From f8640770493dd077bb77ad28c3951f2acc6f0f0b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=82=93=E4=BC=9F=E9=94=AE?= Date: Wed, 3 Dec 2025 16:57:02 +0800 Subject: [PATCH 1/3] init --- .../\351\230\237\344\274\215emmm/README.md" | 201 + .../patches/0001-20251104commit.patch" | 1272 + .../patches/0002-20251106commit.patch" | 3200 + .../patches/0003-20261106secondcommit.patch" | 2769 + .../patches/0004-20251106change.patch" | 7498 +++ .../patches/0005-20251107001commit.patch" | 7707 +++ .../patches/0006-20251107002commit.patch" | 7931 +++ .../patches/0007-20251107003commit.patch" | 8034 +++ .../patches/0008-moe-change.patch" | 8789 +++ .../patches/0009-20251109firstcommit.patch" | 9078 +++ .../patches/0010-.patch" | 49453 ++++++++++++++++ 11 files changed, 105932 insertions(+) create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" new file mode 100644 index 00000000..a3cda3f5 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" @@ -0,0 +1,201 @@ +# README + +## 比赛内容(MOE赛道) + +MoE类: + +deepseek-ai/deepseek-moe-16b-chat + +Qwen/Qwen1.5-MoE-A2.7B-Chat + +在无精度误差的情况下提速这两个模型的prefill,decode和显存峰值 + +![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=NWIwMTIzNjY4NDhkMTI4ZmYxNTFmMWNhOWIyNWRlYzZfeldtbk84b3lhUWVNYjJCZlRtT05TZ0JubU1hMzB0S3RfVG9rZW46WU10NmJsWXFab01CaER4NFlHT2NZRzJHbjJuXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) + +## 最终成绩 + +![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=Zjk2MzEzNmNhYWUxODQ5NzI1NTNhODRmMjhmMDljMGZfeWNuZ0tzT3JBcHBlY0Z1ZnFJWHRNczRGWnd1UWFOaGdfVG9rZW46QVRXVWJGeUpGb2k5R094WmxuVGM5TUdEbmxmXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) + +# 比赛复盘 + +## 前期思路 + +- flash-attention : 由于确实是很常用的加速手段,原理上也work,所以基本上贯穿了优化策略,但实际上收益甚微,就连显存上也没有显示出优化 +- 算子融合 : 一开始做的是诸如合并两次运算到一次运算里面;官方开会的时候提到了mindnlp.core.F,也提到了融合算子的下发调用开销可能和他的加速持平,实际测试下来没有加速,而且造成了精度误差(F.rms_norm) +- 合并python循环为矩阵运算 : 可以说是最有效的方法了,但是前期并没有探索的很深,浅浅掠过 +- 复用图/kernal : 这个可谓是花了最多心思、同时又没弄出效果的方法,具体放在中期测试里面讲 +- 只遍历激活专家 : 前期唯一的提分手段,从100->120 + +## 中期测试 + +- flash-attention + - 通过简单网络来测试,flash-attention对于长序列确而有提速效果,但是在中短序列不明显,有时候还会因为未知波动效果不如baseline + - 官方接口 `mindspore.ops.flash_attention_score`会带来一定的精度误差,具体而言qwen的prompt2会mismatch +- 算子融合 + - F.rms_norm 不仅没加速还带来了精度误差(应该是qwen的prompt1会mismatch),遂直接放弃 + - 但是我没太理解会议里面讲的要比较下放损耗和融合算子加速效果,我个人仍然觉得这应该要work,但是却没有 +- Graph&Pynative mode - kernal/图复用 + - 一开始打算用分桶填充策略,设置 `seq_len = [1,2,4,8,..,128]`的桶来多次调用模型生成来生成这些尺寸的图,为输入的prompt寻找恰好不小于他的桶进行padding触发图复用,但是毫无效果,于是开始探索图复用的条件,网上有说法是需要 `@mindspore.jit`即时编译/`Graph mode`静态图模式才能生成可以复用的图,于是进入下一步测试 + - @mindspore.jit:基本用不了,对于网络模型底层的try-except控制流,jit不支持这种低效分支,而我们又不大可能去修改底层的控制流(太多),遂放弃 + - Graph mode:经过我用简单网络的多次测试(充分预热,多次测试取平均),Graph模式比Pynative模式慢10倍左右,匪夷所思,初步怀疑是没触发图复用所以编译开销也算里面了(也就是单纯用Graph跑一次不会建立图?要显性调用什么函数或者装饰器吗),遂半放弃 + - static-cache :没做成功,因为需要把动态cache 换成 static cache,bug较多,时间上不允许,而且直播的时候说提升不大。 +- Profiler + - 这是一个很好的工具(疑似),但是直到最后都不知道如何使用,一方面是断点设置和信息收集的问题,但这个问题不大 + - ![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=ODNmOTFhMDg2NjZjYmJmODgwMjBlNzVjYTE1MWFiMzRfTTh1S1pnUVZXbWdPRGU0MGhSREh5TU05ZkRaNEJCMGZfVG9rZW46VmFIRWJnMDlob1FUV294YktYZGNTNklqbnlmXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) + - 最重要的是这个页面我只看到NPU的free/compute比值很大,除此之外不知道如何分析来调优了,要是能看**别人实际调优一遍肯定会好很多,求教程!!** +- MOE分析 + - 通过模型原来的代码,在self.mlp = ... 这一行,我发现了有一个if控制流,走moe/mlp,尝试使用走mlp之后,prefill/decode耗时降低了**20倍**,这时候我才意识到,原来前面有的没的都是**次要矛盾**,只要把**moe这个模块的代码**优化好了,就已经胜利了 + - 测试对比了moe和attention模块的占时,发现moe耗时是attention的20倍左右,所以压根不需要管attention部分的优化,也说明了为什么flash-attention部分没什么提速效果,因为**attention对总时长的影响实在微乎其微。** + - 通过查看榜单,思考前两名prefill=1700的同时为什么显存都有略微下降,可以明确的是他们肯定用空间换了时间,但是具体是做了什么呢? + - 通过对moe模块的分析,终于才把握住了主要矛盾;结合前面总总,不难推断肯定是moe部分把某些**串行换并行或者大矩阵计算**,从显存情况我更倾向于是**大矩阵计算**,所以问题就转移到如何做这个大矩阵运算 + +## 后期优化 + +以下代码是我们moe prefill 和 decode 最快的一版 + +### prefill + +1. **Pad (填充)**: 将分配给不同专家的、数量不等的“锯齿状”token数据,通过 tensor_scatter_update 填充成一个规整的、[专家数, 最大Token数, 隐藏层大小] 的“矩形”张量。 +2. **BMM (批量矩阵乘法)**: 利用这个规整的张量,调用一次 ops.bmm 即可同时计算所有专家的输出,将硬件并行度拉满。 +3. **Gather (收集)**: 计算完成后,用 gather_nd 从填充后的结果中高效地提取出有效的输出数据。 + +```Python + @no_grad() + def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): + num_total_assignments = flat_expert_indices.shape[0] + hidden_size = x.shape[-1] + num_experts = len(self.experts) + + # 1) 排序专家分配 + idxs = flat_expert_indices.argsort() + sorted_expert_indices = flat_expert_indices[idxs] + sorted_token_indices = idxs // self.num_experts_per_tok + permuted_tokens = x[sorted_token_indices] + sorted_weights = flat_expert_weights[idxs] + + # 2) 计算Padding所需的尺寸 + tokens_per_expert = sorted_expert_indices.bincount(minlength=num_experts) + max_tokens_per_expert = tokens_per_expert.max().item() + + if max_tokens_per_expert == 0: + return ops.zeros_like(x) + + # 3) 创建Padded张量 + expert_offsets = ops.cumsum(tokens_per_expert, dim=0) - tokens_per_expert + token_indices_in_sorted = mnp.arange(num_total_assignments) + relative_pos_in_expert = token_indices_in_sorted - expert_offsets[sorted_expert_indices] + + gather_indices_sparse = ops.stack([sorted_expert_indices, relative_pos_in_expert], dim=1) + + # --- 关键修正: 使用 tensor_scatter_update --- + padded_tokens = ops.zeros((num_experts, max_tokens_per_expert, hidden_size), dtype=x.dtype) + # 直接使用 mindspore.ops.tensor_scatter_update + padded_tokens = mindspore.ops.tensor_scatter_update(padded_tokens, gather_indices_sparse, permuted_tokens) + + # 4) 堆叠所有专家权重 + gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) + up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) + down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) + + # 5) --- 核心:巨型批处理矩阵乘法 (BMM) --- + gate_out = ops.bmm(padded_tokens, gate_weights.transpose(0, 2, 1)) + up_out = ops.bmm(padded_tokens, up_weights.transpose(0, 2, 1)) + act_out = self.experts[0].act_fn(gate_out) * up_out + padded_expert_outputs = ops.bmm(act_out, down_weights.transpose(0, 2, 1)) + + # 6) 从Padded结果张量中按原始顺序gather回所有有效结果 + # mindspore.ops.gather_nd 是 tensor_scatter_update 的逆操作,非常适合此场景 + expert_outputs_sorted = mindspore.ops.gather_nd(padded_expert_outputs, gather_indices_sparse) + + # 7) 最终加权和还原 + final_output = ops.zeros_like(x) + final_output = mindspore.mint.scatter_add( + final_output, + 0, + sorted_token_indices.view(-1, 1).tile((1, hidden_size)), + expert_outputs_sorted * sorted_weights + ) + return final_output +``` + +### Decode + +1. **向量化计算** + +新代码先将这个 token 选出的 top_k 个专家的weight收集起来,用 ops.stack 堆叠成一个批次。然后,通过 **ops.bmm** 这个算子,一次性并行完成这topk个专家的所有计算。 + +1. **内存局部性优化** + +通过 init_active_expert_cache 函数提前进行预处理。在模型预热(阶段识别出最常用的专家,然后将这些“热门专家”的权重从它们各自零散的位置提前抽调出来,用 ops.stack 堆叠成一个巨大且连续的内存块(即 self.cache_gate_w 等缓存张量)。在实际解码时,代码会优先从这个连续的缓存块中通过索引(self.cache_gate_w[eid])直接读取权重。这种直接在大块连续内存上的索引操作速度极快,因为它能高效利用硬件的内存缓存机制(“缓存命中” Cache Hit),避免了复杂的对象查找。 + +```Python + def init_active_expert_cache(self, active_ids): + """ + 预热后调用,将常用专家的权重预先提取并堆叠, + 形成一个内存连续的“快速访问缓存”。 + """ + self.cache_gate_w = ops.stack([self.experts[i].gate_proj.weight for i in active_ids], dim=0) + self.cache_up_w = ops.stack([self.experts[i].up_proj.weight for i in active_ids], dim=0) + self.cache_down_w = ops.stack([self.experts[i].down_proj.weight for i in active_ids], dim=0) + + def moe_infer_decode_fast(self, x, flat_expert_indices, flat_expert_weights): + """ + 利用“权重缓存”和“BMM向量化”实现极致解码速度。 + """ + top_k = flat_expert_indices.shape[0] + hidden_size = x.shape[-1] + + selected_gate_w = [] + selected_up_w = [] + selected_down_w = [] + + # 1. 核心:从“快速缓存”或“慢速原始列表”中收集权重 + for eid in flat_expert_indices.tolist(): + # 检查缓存是否存在且eid在缓存范围内,如果满足则进入“快速通道” + if hasattr(self, "cache_gate_w") and eid < self.cache_gate_w.shape[0]: + selected_gate_w.append(self.cache_gate_w[eid]) + selected_up_w.append(self.cache_up_w[eid]) + selected_down_w.append(self.cache_down_w[eid]) + else: # 否则,回退到“慢速通道” + selected_gate_w.append(self.experts[eid].gate_proj.weight) + selected_up_w.append(self.experts[eid].up_proj.weight) + selected_down_w.append(self.experts[eid].down_proj.weight) + + # 2. 将收集到的分散权重堆叠成一个批次 + selected_gate_w = ops.stack(selected_gate_w, dim=0) + selected_up_w = ops.stack(selected_up_w, dim=0) + selected_down_w = ops.stack(selected_down_w, dim=0) + + # 3. 向量化计算:使用BMM一次性完成所有专家运算 + x_expanded = x.expand((top_k, 1, hidden_size)) + gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) + up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) + intermediate_states = self.experts[0].act_fn(gate_out) * up_out + expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) + + # 4. 向量化聚合 + weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) + return weighted_sum +``` + +#### 其他trick + +- trick1 : **劫持预热**,事先过一遍短中长三种prompt的generate,充分预热,当时的想法是瞎猫碰一下死耗子看看能不能触发图复用,结果意外确实降低了prefill时延,由于没有充分实验不确定原理 +- trick2 : 根据测试的三个prompt的长度,**用Prompt=0/1/2控制走的优化流**,部分优化流对某些Prompt会带来精度误差,这样是为了在实在解决不了精度问题的情况下,不至于直接放弃这个有效的优化,利用其他Prompt的优化先快带动后快 +- trick3 : **init_active_expert_cache和warmup_moe_model_deep** + - 在预热的时候,记录下所有被激活过的专家的ID,缓存那些在预热中被激活过的active_ids的权重(ops.stack)。 + - 如果缓存已经建立,并且当前需要的专家 eid 就在缓存里,它会直接从连续的 cache_gate_w 张量中索引权重。 + +## 收益点 + +| DeepseekMoe + Qwen moe都进行MoE模块前向优化,decode直接遍历激活专家 | 总分的具体收益估计是从100->120 | +| ------------------------------------------------------------ | ------------------------------------------------------------ | +| 在DeepseekAttention和QwenAttention的forward函数里有用apply_rotary_pos_emb函数,而对于该函数里用了rotate_half函数。对于rotate_half函数,可以使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :]。 | 显存峰值100优化后100prefill133.4445 132.4821decode427.7311 437.5848总分220.919 223.3556 | +| **moe_prefill_fast** 通过将权重堆叠、对输入 tokens 重新排序,将原本多个小的、串行的专家计算,转换为了几个大的、连续的计算块,并使用一个高效的 scatter_add 操作完成结果聚合,从而大幅提升了性能。**moe_decode_fast** 将多个小规模的、串行的专家计算,巧妙地转换成了一次大规模的、并行的批量矩阵乘法(bmm)操作,彻底消除了 Python 循环,因此速度更快。但是有mismatch,所以根据LongPrompt来做dispatch | 显存峰值100优化后98.4848prefill132.4821 163.8114decode437.5848 454.7424总分223.3556 239.0129 | +| **init_active_expert_cache**和**warmup_moe_model_deep**:在预热的时候,记录下所有被激活过的专家的ID,缓存那些在预热中被激活过的active_ids的权重(ops.stack)。如果缓存已经建立,并且当前需要的专家 eid 就在缓存里,它会直接从连续的 cache_gate_w 张量中索引权重。 | 显存峰值98.4848优化后98.4848prefill163.8114 198.4985decode454.7424 493.2538总分239.0129 263.4124 | +| 通过 **Pad -> BMM -> Gather** 的流程,将所有专家的计算合并为单个、大规模的并行操作Pad : 将分配给不同专家的、数量不等的“锯齿状”token数据,通过 tensor_scatter_update 填充成一个规整的、[专家数, 最大Token数, 隐藏层大小] 的“矩形”张量。BMM: 利用这个规整的张量,调用一次 ops.bmm 即可同时计算所有专家的输出,将硬件并行度拉满。Gather : 计算完成后,用 gather_nd 从填充后的结果中高效地提取出有效的输出数据。+但是有mismatch,解决思路是:在核心计算中使用 float32 保证数值精度,从根本上解决 mismatch 问题+根据LongPrompt来做dispatch | 显存峰值98.4848优化后83.3333prefill198.4985 487.1616decode493.2538 490.5996总分263.4124 353.6982 | + +## 总结 + +- 对于精度优化比赛,不应该一上来就花费大量时间在框架本身优化、常用优化等上,应该先通过充分测试找到**主要矛盾**,因为一般这种比赛都会有侧重点,比如这个比赛就是moe部分,如果第一天我就能测试出moe的时间占比如此浮夸,我想我就不会把时间放在细枝末节上面 +- 对于调试工具(如Profiler等),这是辅助完成第一步的,很有学习的必要,这种可视化工具分析的能力或者输出、断点分析能力将是打比赛的重要能力,说是最重要也不为过,这样才能有的放矢 \ No newline at end of file diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" new file mode 100644 index 00000000..c23f7201 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" @@ -0,0 +1,1272 @@ +From d61fd429337580809fe74a59b1dfa81b91094dae Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Tue, 4 Nov 2025 09:11:51 +0800 +Subject: [PATCH 01/10] 20251104commit + +--- + mindnlp/transformers/cache_utils.py | 28 +- + .../models/deepseek/modeling_deepseek.py | 149 ++- + .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- + 3 files changed, 976 insertions(+), 87 deletions(-) + +diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +index cadd2e04..02f8d4be 100644 +--- a/mindnlp/transformers/cache_utils.py ++++ b/mindnlp/transformers/cache_utils.py +@@ -812,14 +812,26 @@ class StaticCache(Cache): + # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. + # k_out[:, :, cache_position] = key_states + # v_out[:, :, cache_position] = value_states +- if ON_ORANGE_PI: +- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +- else: +- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +- ++ # if ON_ORANGE_PI: ++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++ # else: ++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++ # 确保 cache_position 是 1D tensor 并且类型正确 ++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++ if cache_position.ndim > 1: ++ cache_position = cache_position.flatten() ++ # 确保类型是 int32 或 int64(MindSpore 要求) ++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++ cache_position = cache_position.int() ++ ++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++ k_out[:, :, cache_position] = key_states ++ v_out[:, :, cache_position] = value_states ++ + return k_out, v_out + + def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index c695b944..d8303e45 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): + # Copied from transformers.models.llama.modeling_llama.rotate_half + def rotate_half(x): + """Rotates half the hidden dims of the input.""" +- x1 = x[..., : x.shape[-1] // 2] +- x2 = x[..., x.shape[-1] // 2 :] ++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++ # x1 = x[..., : x.shape[-1] // 2] ++ # x2 = x[..., x.shape[-1] // 2 :] ++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + + +@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): + if self.training: + raise NotImplementedError("Training is not supported yet.") + else: +- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +- if self.config.n_shared_experts is not None: +- y = y + self.shared_experts(identity) +- return y ++ # @lwx ++ if orig_shape[1] == 1: ++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++ y=y.view(*orig_shape) ++ if self.config.n_shared_experts is not None: ++ y = y + self.shared_experts(identity) ++ return y ++ else: ++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++ if self.config.n_shared_experts is not None: ++ y = y + self.shared_experts(identity) ++ return y ++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++ # if self.config.n_shared_experts is not None: ++ # y = y + self.shared_experts(identity) ++ # return y ++ ++ @no_grad() ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ ++ expert_cache = ops.zeros_like(x) ++ for i in range(self.num_experts_per_tok): ++ expert_id = flat_expert_indices[i].item() ++ weight = flat_expert_weights[i].item() ++ expert = self.experts[expert_id] ++ expert_out = expert(x) ++ expert_cache += expert_out * weight ++ return expert_cache + + @no_grad() +- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +- # expert_cache = torch.zeros_like(x) +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +- # token_idxs = idxs // self.num_experts_per_tok +- # for i, end_idx in enumerate(tokens_per_expert): +- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +- # if start_idx == end_idx: +- # continue +- # expert = self.experts[i] +- # exp_token_idx = token_idxs[start_idx:end_idx] +- # expert_tokens = x[exp_token_idx] +- # expert_out = expert(expert_tokens) +- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +- # return expert_cache ++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): + expert_cache = ops.zeros_like(x) + idxs = flat_expert_indices.argsort() + tokens_per_expert = flat_expert_indices.bincount().cumsum(0) + token_idxs = idxs // self.num_experts_per_tok ++ + for i, end_idx in enumerate(tokens_per_expert): + start_idx = 0 if i == 0 else tokens_per_expert[i-1] + if start_idx == end_idx: +@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): + expert_out = expert(expert_tokens) + expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) + expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++ + return expert_cache ++ ++ # @no_grad() ++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++ # # expert_cache = torch.zeros_like(x) ++ # # idxs = flat_expert_indices.argsort() ++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++ # # token_idxs = idxs // self.num_experts_per_tok ++ # # for i, end_idx in enumerate(tokens_per_expert): ++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++ # # if start_idx == end_idx: ++ # # continue ++ # # expert = self.experts[i] ++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++ # # expert_tokens = x[exp_token_idx] ++ # # expert_out = expert(expert_tokens) ++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++ # # return expert_cache ++ # expert_cache = ops.zeros_like(x) ++ # idxs = flat_expert_indices.argsort() ++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ # token_idxs = idxs // self.num_experts_per_tok ++ ++ # for i, end_idx in enumerate(tokens_per_expert): ++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ # if start_idx == end_idx: ++ # continue ++ # expert = self.experts[i] ++ # exp_token_idx = token_idxs[start_idx:end_idx] ++ # expert_tokens = x[exp_token_idx] ++ # expert_out = expert(expert_tokens) ++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++ ++ # return expert_cache ++ # @no_grad() ++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++ # expert_cache = ops.zeros_like(x) ++ ++ # # 排序保证顺序一致 ++ # idxs = flat_expert_indices.argsort() ++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ # token_idxs = idxs // self.num_experts_per_tok ++ ++ # # 找出有 token 的专家 ++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++ ++ # for i in active_experts.tolist(): ++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ # end_idx = tokens_per_expert[i] ++ # if start_idx == end_idx: # 没有 token ++ # continue ++ ++ # exp_token_idx = token_idxs[start_idx:end_idx] ++ # expert_tokens = x[exp_token_idx] ++ # expert_out = self.experts[i](expert_tokens) ++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++ ++ # expert_cache = mindspore.mint.scatter_add( ++ # expert_cache, ++ # 0, ++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++ # expert_out ++ # ) ++ ++ # return expert_cache ++ ++ + + # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): + # """ +@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + + # Initialize weights and apply final processing + self.post_init() ++ self.warm_up = False ++ ++ def warmup_moe_model_deep(self): ++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++ test_texts = [ ++ "warmup short", ++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++ ] ++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++ if tokenizer is None: ++ from mindnlp.transformers import AutoTokenizer ++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++ self._warmup_tokenizer = tokenizer ++ ++ for text in test_texts: ++ inputs = tokenizer(text, return_tensors="ms") ++ with mindspore._no_grad(): ++ _ = self(**inputs, use_cache=False) ++ print("[Warmup] DeepSeek-MoE 模型预热完成。") + + def get_input_embeddings(self): + return self.model.embed_tokens +@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] + "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." + ```""" ++ if not self.warm_up: ++ self.warm_up = True ++ self.warmup_moe_model_deep() ++ + output_attentions = ( + output_attentions + if output_attentions is not None +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index 3cbf820e..d4c6b651 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -18,7 +18,6 @@ + # See the License for the specific language governing permissions and + # limitations under the License. + """MindSpore Qwen2MoE model.""" +- + import math + from typing import List, Optional, Tuple, Union + +@@ -36,6 +35,7 @@ from ...modeling_outputs import ( + TokenClassifierOutput, + ) + from ...modeling_utils import PreTrainedModel ++from ...generation import GenerationMixin + from ....utils import logging + from .configuration_qwen2_moe import Qwen2MoeConfig + +@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): + self.variance_epsilon = eps + + def forward(self, hidden_states): ++ # @dwj ++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++ # @lwx ++ # if not self.training : ++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) + input_dtype = hidden_states.dtype + hidden_states = hidden_states.to(mindspore.float32) + variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +@@ -234,6 +239,8 @@ def rotate_half(x): + """Rotates half the hidden dims of the input.""" + x1 = x[..., : x.shape[-1] // 2] + x2 = x[..., x.shape[-1] // 2 :] ++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + + +@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): + self.config = config + self.hidden_size = config.hidden_size + self.intermediate_size = intermediate_size ++ + self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) + self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) + self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) + self.act_fn = ACT2FN[config.hidden_act] + + def forward(self, x): +- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +- + ++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++ # @lwx ++ # gate_up_output = self.gate_up_proj(x) ++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++ # return self.down_proj(swiglu_output) ++ ++ # def forward(self, x): ++ # gate_proj_out = self.gate_proj(x) ++ # up_proj_out = self.up_proj(x) ++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++ # return self.down_proj(swiglu_out) ++ + # Copied from transformers.models.llama.modeling_llama.repeat_kv + def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: + """ +@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): + use_cache: bool = False, + cache_position: Optional[mindspore.Tensor] = None, + ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++ ++ + bsz, q_len, _ = hidden_states.shape + + query_states = self.q_proj(hidden_states) +@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): + "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " + "with a layer index." + ) +- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ if isinstance(past_key_value, StaticCache): ++ kv_seq_len = key_states.shape[-2] ++ else: ++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) + cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) + query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) + + if past_key_value is not None: + cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++ if isinstance(past_key_value, StaticCache): ++ kv_seq_len = key_states.shape[-2] + + # repeat k/v heads if n_kv_heads < n_heads + key_states = repeat_kv(key_states, self.num_key_value_groups) + value_states = repeat_kv(value_states, self.num_key_value_groups) +- ++ + attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) + +- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +- raise ValueError( +- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +- f" {attn_weights.shape}" +- ) +- +- if attention_mask is not None: # no matter the length, we just slice it +- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++ if attention_mask is not None: ++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] + attn_weights = attn_weights + causal_mask + + # upcast attention to fp32 +@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): + attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) + + attn_output = self.o_proj(attn_output) +- ++ # @lwx ++ ++ # max_seq_len = self.max_position_embeddings # 2048 ++ ++ # if attention_mask is not None: ++ # # attention_mask: [B, 1, Sq, Sk] ++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++ ++ # # pad 到 [max_seq_len, max_seq_len] ++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++ # global_attention_mask = padded_mask ++ # else: ++ # global_attention_mask = None ++ ++ ++ # sparse_mode=3 ++ # attn_output = mindspore.ops.flash_attention_score( ++ # query=query_states, ++ # key=key_states, ++ # value=value_states, ++ # real_shift=None, ++ # padding_mask=None, ++ ++ # head_num=self.num_heads, ++ # attn_mask=global_attention_mask, ++ # keep_prob=1.0 - self.attention_dropout, ++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++ # input_layout="BNSD", ++ # pre_tokens=2147483647, ++ # next_tokens=2147483647, ++ # inner_precise=0, ++ # drop_mask=None, ++ # prefix=None, ++ # actual_seq_qlen=None, ++ # actual_seq_kvlen=None, ++ # sparse_mode=sparse_mode, ++ # ) + if not output_attentions: + attn_weights = None + + return attn_output, attn_weights, past_key_value + + ++class Qwen2MoeFlashAttention(nn.Module): ++ """ ++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++ ++ 关键改动: ++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++ 直接传入原始的 key 和 value 张量效率更高。 ++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++ """ ++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++ super().__init__() ++ self.config = config ++ self.layer_idx = layer_idx ++ self.hidden_size = config.hidden_size ++ self.num_heads = config.num_attention_heads ++ self.head_dim = self.hidden_size // self.num_heads ++ self.num_key_value_heads = config.num_key_value_heads ++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++ self.max_position_embeddings = config.max_position_embeddings ++ self.rope_theta = config.rope_theta ++ self.attention_dropout = config.attention_dropout ++ ++ if (self.head_dim * self.num_heads) != self.hidden_size: ++ raise ValueError( ++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++ ) ++ ++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++ ++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++ self.head_dim, ++ max_position_embeddings=self.max_position_embeddings, ++ base=self.rope_theta, ++ ) ++ ++ def forward( ++ self, ++ hidden_states: mindspore.Tensor, ++ attention_mask: Optional[mindspore.Tensor] = None, ++ position_ids: Optional[mindspore.Tensor] = None, ++ past_key_value: Optional[Cache] = None, ++ output_attentions: bool = False, ++ use_cache: bool = False, ++ cache_position: Optional[mindspore.Tensor] = None, ++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++ bsz, q_len, _ = hidden_states.shape ++ ++ # 1. 线性投射 Q, K, V ++ query_states = self.q_proj(hidden_states) ++ key_states = self.k_proj(hidden_states) ++ value_states = self.v_proj(hidden_states) ++ ++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++ # query: [B, S, H*D] -> [B, N1, S, D] ++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++ # 3. RoPE 旋转位置编码 ++ kv_seq_len = key_states.shape[-2] ++ if past_key_value is not None: ++ if self.layer_idx is None: ++ raise ValueError( ++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++ "with a layer index." ++ ) ++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++ if cache_position.shape[0] == 1: ++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++ kv_seq_len = past_seen_tokens + 1 ++ else: ++ # prefill 阶段:cache_position 是范围,使用其长度 ++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++ else: ++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ # 4. KV 缓存更新 ++ if past_key_value is not None: ++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++ key_states, value_states = past_key_value.update( ++ key_states, value_states, self.layer_idx, cache_kwargs ++ ) ++ ++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++ if cache_position.shape[0] == 1: ++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++ kv_seq_len = key_states.shape[-2] ++ ++ # 5. [重要] 准备 Attention Mask ++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++ fa_attention_mask = None ++ if attention_mask is not None: ++ # 截取与当前key长度匹配的部分 ++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++ fa_attention_mask = (mask_slice != 0) ++ ++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++ input_dtype = query_states.dtype ++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++ query_states = query_states.to(mindspore.float16) ++ key_states = key_states.to(mindspore.float16) ++ value_states = value_states.to(mindspore.float16) ++ ++ # 6. [核心] 调用 flash_attention_score 算子 ++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++ attn_output = mindspore.ops.flash_attention_score( ++ query=query_states, ++ key=key_states, ++ value=value_states, ++ head_num=self.num_heads, # 传入Q的头数(N1) ++ attn_mask=fa_attention_mask, ++ keep_prob=1.0 - self.attention_dropout, ++ scalar_value=1.0 / math.sqrt(self.head_dim), ++ input_layout="BNSD", ++ sparse_mode=0 # 使用 defaultMask 模式 ++ ) ++ ++ # 恢复原始数据类型 ++ attn_output = attn_output.to(input_dtype) ++ ++ # 7. 调整输出形状 ++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ attn_output = self.o_proj(attn_output) ++ ++ # FlashAttention 算子不直接返回注意力权重矩阵 ++ attn_weights = None ++ if output_attentions: ++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++ ++ return attn_output, attn_weights, past_key_value ++ ++ # def forward( ++ # self, ++ # hidden_states: mindspore.Tensor, ++ # attention_mask: Optional[mindspore.Tensor] = None, ++ # position_ids: Optional[mindspore.Tensor] = None, ++ # past_key_value: Optional[Cache] = None, ++ # output_attentions: bool = False, ++ # use_cache: bool = False, ++ # cache_position: Optional[mindspore.Tensor] = None, ++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++ # bsz, q_len, _ = hidden_states.shape ++ ++ # # 1. 线性投射 Q, K, V ++ # query_states = self.q_proj(hidden_states) ++ # key_states = self.k_proj(hidden_states) ++ # value_states = self.v_proj(hidden_states) ++ ++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++ # # 3. RoPE 旋转位置编码 ++ # kv_seq_len = key_states.shape[-2] ++ # if past_key_value is not None: ++ # if self.layer_idx is None: ++ # raise ValueError( ++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++ # "with a layer index." ++ # ) ++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ # # 4. KV 缓存更新 ++ # if past_key_value is not None: ++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++ # key_states, value_states = past_key_value.update( ++ # key_states, value_states, self.layer_idx, cache_kwargs ++ # ) ++ ++ # # 5. 准备 Attention Mask ++ # fa_attention_mask = None ++ # if attention_mask is not None: ++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++ # fa_attention_mask = (mask_slice != 0) ++ ++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++ # input_dtype = query_states.dtype ++ ++ # # 6. [核心] 调用 flash_attention_score 算子 ++ # attn_output = mindspore.ops.flash_attention_score( ++ # query=query_states, ++ # key=key_states, ++ # value=value_states, ++ # head_num=self.num_heads, ++ # attn_mask=fa_attention_mask, ++ # keep_prob=1.0 - self.attention_dropout, ++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++ # input_layout="BNSD", ++ # sparse_mode=0, ++ # # <--- 修改点 2: 启用内部高精度计算 --- ++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++ # inner_precise=1 ++ # ) ++ ++ # # 恢复原始数据类型 ++ # attn_output = attn_output.to(input_dtype) ++ ++ # # 7. 调整输出形状 ++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ # attn_output = self.o_proj(attn_output) ++ ++ # attn_weights = None ++ # if output_attentions: ++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++ ++ # return attn_output, attn_weights, past_key_value ++ ++ # def forward( ++ # self, ++ # hidden_states: mindspore.Tensor, ++ # attention_mask: Optional[mindspore.Tensor] = None, ++ # position_ids: Optional[mindspore.Tensor] = None, ++ # past_key_value: Optional[Cache] = None, ++ # output_attentions: bool = False, ++ # use_cache: bool = False, ++ # cache_position: Optional[mindspore.Tensor] = None, ++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++ # bsz, q_len, _ = hidden_states.shape ++ ++ # query_states = self.q_proj(hidden_states) ++ # key_states = self.k_proj(hidden_states) ++ # value_states = self.v_proj(hidden_states) ++ ++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++ # kv_seq_len = key_states.shape[-2] ++ # if past_key_value is not None: ++ # if self.layer_idx is None: ++ # raise ValueError("`layer_idx` must be specified for caching") ++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ # if past_key_value is not None: ++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++ # key_states, value_states = past_key_value.update( ++ # key_states, value_states, self.layer_idx, cache_kwargs ++ # ) ++ ++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++ ++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++ # query_states = query_states / math.sqrt(self.head_dim) ++ # # <--- 修改结束 --- ++ ++ # fa_attention_mask = None ++ # if attention_mask is not None: ++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++ # fa_attention_mask = (mask_slice != 0) ++ ++ # input_dtype = query_states.dtype ++ ++ # attn_output = mindspore.ops.flash_attention_score( ++ # query=query_states, # 传入已经预先缩放过的 query ++ # key=key_states, ++ # value=value_states, ++ # head_num=self.num_heads, ++ # attn_mask=fa_attention_mask, ++ # keep_prob=1.0 - self.attention_dropout, ++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++ # input_layout="BNSD", ++ # sparse_mode=0, ++ # inner_precise=1 # 仍然保持内部高精度计算 ++ # ) ++ ++ # attn_output = attn_output.to(input_dtype) ++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ # attn_output = self.o_proj(attn_output) ++ ++ # attn_weights = None ++ # if output_attentions: ++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++ ++ # return attn_output, attn_weights, past_key_value ++ + QWEN2MOE_ATTENTION_CLASSES = { + "eager": Qwen2MoeAttention, ++ "flash-attention": Qwen2MoeFlashAttention, + } + + +@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) + self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) + ++ #@dwj ++ # 只遍历激活的专家,而非全部专家 + def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +- batch_size, sequence_length, hidden_dim = hidden_states.shape +- hidden_states = hidden_states.view(-1, hidden_dim) +- # router_logits: (batch * sequence_length, n_experts) +- router_logits = self.gate(hidden_states) +- +- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- if self.norm_topk_prob: +- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- # we cast back to the input dtype +- routing_weights = routing_weights.to(hidden_states.dtype) +- +- final_hidden_states = ops.zeros( +- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +- ) +- +- # One hot encode the selected experts to create an expert mask +- # this will be used to easily index which expert is going to be sollicitated +- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +- +- # Loop over all available experts in the model and perform the computation on each expert +- for expert_idx in range(self.num_experts): +- expert_layer = self.experts[expert_idx] +- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +- +- # Index the correct hidden states and compute the expert hidden state for +- # the current expert. We need to make sure to multiply the output hidden +- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +- if 0 not in idx.shape: +- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +- +- # However `index_add_` only support torch tensors for indexing so we'll use +- # the `top_x` tensor here. +- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +- +- shared_expert_output = self.shared_expert(hidden_states) +- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +- +- final_hidden_states = final_hidden_states + shared_expert_output ++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++ num_tokens = hidden_states_reshaped.shape[0] ++ ++ router_logits = self.gate(hidden_states_reshaped) ++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++ if self.norm_topk_prob: ++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ routing_weights = routing_weights.to(hidden_states.dtype) ++ ++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++ flat_selected_experts = selected_experts.flatten() ++ ++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++ token_indices = broadcasted_token_indices.flatten() ++ ++ active_experts = ops.unique(flat_selected_experts) ++ ++ for expert_idx_tensor in active_experts: ++ expert_idx = expert_idx_tensor.item() ++ expert_layer = self.experts[expert_idx] ++ ++ mask = (flat_selected_experts == expert_idx_tensor) ++ selected_token_indices = token_indices[mask] ++ selected_routing_weights = routing_weights.flatten()[mask] ++ ++ current_states = hidden_states_reshaped[selected_token_indices] ++ ++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++ ++ final_hidden_states = final_hidden_states.index_add( ++ dim=0, ++ index=selected_token_indices, ++ source=expert_output.to(hidden_states.dtype) ++ ) ++ ++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output + +- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +- return final_hidden_states, router_logits ++ final_hidden_states = final_hidden_states + shared_expert_output ++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++ ++ return final_hidden_states, router_logits + + + class Qwen2MoeDecoderLayer(nn.Module): +@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): + + self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + ++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++ + if (layer_idx not in config.mlp_only_layers) and ( + config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 + ): +@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): + _no_split_modules = ["Qwen2MoeDecoderLayer"] + _skip_keys_device_placement = "past_key_values" + _supports_cache_class = True ++#lwx ++ # _supports_static_cache = True + + def _init_weights(self, module): + std = self.config.initializer_range +@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): + return causal_mask + + +-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + _tied_weights_keys = ["lm_head.weight"] + + def __init__(self, config): +@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): + self.num_experts_per_tok = config.num_experts_per_tok + # Initialize weights and apply final processing + self.post_init() ++ # @lwx ++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++ # self.generation_config.cache_implementation = "static" ++ self._warmed_up = False ++ ++ def warmup_moe_model(self): ++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++ test_texts = [ ++ "warmup short", ++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++ ] ++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++ if tokenizer is None: ++ from mindnlp.transformers import AutoTokenizer ++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++ self._warmup_tokenizer = tokenizer ++ ++ for text in test_texts: ++ inputs = tokenizer(text, return_tensors="ms") ++ with mindspore._no_grad(): ++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++ print("[Warmup] Qwen2-MoE 模型预热完成。") + + def get_input_embeddings(self): + return self.model.embed_tokens +@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): + >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] + "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." + ```""" ++ if not self._warmed_up: ++ self._warmed_up = True ++ self.warmup_moe_model() + + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_router_logits = ( +@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): + } + ) + return model_inputs ++# @lwx ++ # def _decode_one_tokens_logits( ++ # self, ++ # cur_token: mindspore.Tensor, ++ # input_pos: Optional[mindspore.Tensor], ++ # cache_position: mindspore.Tensor, ++ # past_key_values: StaticCache, ++ # ) -> mindspore.Tensor: ++ # """ ++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++ ++ # Args: ++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++ # input_pos: 输入位置信息,可选 ++ # cache_position: 当前token在cache中的位置,shape为(1,) ++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++ ++ # Returns: ++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++ # """ ++ # # 调用JIT编译的版本 ++ # return self.get_decode_one_tokens_logits( ++ # cur_token=cur_token, ++ # input_pos=input_pos, ++ # cache_position=cache_position, ++ # past_key_values=past_key_values, ++ # ) ++ ++ # @mindspore.jit(jit_level='O1') ++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++ # """ ++ # JIT编译的函数,用于高效的单token解码 ++ # 使用JIT编译优化以支持静态shape和高效执行 ++ ++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++ # """ ++ # outputs = self.model.forward( ++ # input_ids=cur_token, ++ # position_ids=input_pos, ++ # cache_position=cache_position, ++ # past_key_values=past_key_values, ++ # use_cache=True, ++ # return_dict=False, ++ # ) ++ ++ # hidden_states = outputs[0] ++ # logits = self.lm_head.forward(hidden_states) ++ # logits = logits.float() ++ ++ # return logits[:, -1, :] ++ ++ # def _sample( ++ # self, ++ # input_ids: mindspore.Tensor, ++ # logits_processor, ++ # stopping_criteria, ++ # generation_config, ++ # synced_devices: bool, ++ # streamer=None, ++ # logits_warper=None, ++ # **model_kwargs, ++ # ): ++ # """ ++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++ # """ ++ # from ...generation.logits_process import LogitsProcessorList ++ # from ...generation.stopping_criteria import StoppingCriteriaList ++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++ # from mindnlp.core import nn, ops, no_grad ++ # import numpy as np ++ ++ # # 检查是否使用 StaticCache ++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++ # # 否则,直接调用父类方法 ++ # past_key_values = model_kwargs.get("past_key_values") ++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++ ++ # if not isinstance(past_key_values, StaticCache): ++ # # 不使用 StaticCache,直接调用父类方法 ++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++ # return super()._sample( ++ # input_ids=input_ids, ++ # logits_processor=logits_processor, ++ # stopping_criteria=stopping_criteria, ++ # generation_config=generation_config, ++ # synced_devices=synced_devices, ++ # streamer=streamer, ++ # logits_warper=logits_warper, ++ # **model_kwargs, ++ # ) ++ ++ # # 使用 StaticCache,进入自定义循环 ++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++ # pad_token_id = generation_config._pad_token_tensor ++ # output_attentions = generation_config.output_attentions ++ # output_hidden_states = generation_config.output_hidden_states ++ # output_scores = generation_config.output_scores ++ # output_logits = generation_config.output_logits ++ # return_dict_in_generate = generation_config.return_dict_in_generate ++ # max_length = generation_config.max_length ++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++ # do_sample = generation_config.do_sample ++ ++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++ # raise ValueError( ++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++ # f"{logits_warper})." ++ # ) ++ ++ # # init attention / hidden states / scores tuples ++ # scores = () if (return_dict_in_generate and output_scores) else None ++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++ ++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++ # encoder_hidden_states = ( ++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++ # ) ++ ++ # # keep track of which sequences are already finished ++ # batch_size, cur_len = input_ids.shape ++ # this_peer_finished = False ++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++ ++ # time_record = [] ++ # from ....utils.testing_utils import parse_flag_from_env ++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++ ++ # while self._has_unfinished_sequences( ++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++ # ): ++ # if _record_time: ++ # import time as time_module ++ # infer_start = time_module.time() ++ ++ # # prepare model inputs ++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++ ++ # # prepare variable output controls ++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++ ++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++ # cur_cache_position = model_inputs.get("cache_position") ++ # cur_past_key_values = model_inputs.get("past_key_values") ++ # cur_input_ids = model_inputs.get("input_ids") ++ ++ # if (isinstance(cur_past_key_values, StaticCache) and ++ # cur_cache_position is not None and ++ # len(cur_cache_position.shape) > 0 and ++ # cur_cache_position.shape[0] == 1 and ++ # cur_input_ids is not None and ++ # cur_input_ids.shape[1] == 1): ++ # # 使用 JIT 优化的单 token 解码 ++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++ # if not hasattr(self, '_jit_used'): ++ # self._jit_used = False ++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++ ++ # next_token_logits = self.get_decode_one_tokens_logits( ++ # cur_token=cur_input_ids, ++ # input_pos=model_inputs.get("position_ids"), ++ # cache_position=cur_cache_position, ++ # past_key_values=cur_past_key_values, ++ # ) ++ ++ # # 标记已使用JIT(用于后续判断) ++ # if not self._jit_used: ++ # self._jit_used = True ++ ++ # # 构造兼容的输出对象 ++ # class JitOptimizedOutput: ++ # def __init__(self, logits, config): ++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++ # self.config = config ++ # # 对于 JIT 优化路径,这些属性通常不需要 ++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++ # self.attentions = None if not config.is_encoder_decoder else None ++ # self.cross_attentions = None ++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++ # self.hidden_states = None if not config.is_encoder_decoder else None ++ ++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++ # else: ++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++ # outputs = self(**model_inputs, return_dict=True) ++ ++ # if synced_devices and this_peer_finished: ++ # continue ++ ++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++ # next_token_logits = outputs.logits[:, -1, :] ++ ++ # # pre-process distribution ++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++ # if do_sample: ++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++ ++ # # Store scores, attentions and hidden_states when required ++ # if return_dict_in_generate: ++ # if output_scores: ++ # scores += (next_token_scores,) ++ # if output_logits: ++ # raw_logits += (next_token_logits,) ++ # if output_attentions: ++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++ # decoder_attentions += (attn,) if attn is not None else (None,) ++ # if self.config.is_encoder_decoder: ++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++ ++ # if output_hidden_states: ++ # hidden = ( ++ # outputs.decoder_hidden_states ++ # if self.config.is_encoder_decoder ++ # else outputs.hidden_states ++ # ) ++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++ ++ # # token selection ++ # if do_sample: ++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++ # else: ++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++ ++ # # finished sentences should have their next token be a padding token ++ # if has_eos_stopping_criteria: ++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++ ++ # # update generated ids, model inputs, and length for next step ++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++ # if streamer is not None: ++ # streamer.put(next_tokens) ++ ++ # model_kwargs = self._update_model_kwargs_for_generation( ++ # outputs, ++ # model_kwargs, ++ # is_encoder_decoder=self.config.is_encoder_decoder, ++ # ) ++ ++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++ # cur_len += 1 ++ ++ # if _record_time: ++ # import time as time_module ++ # infer_stop = time_module.time() ++ # time_record.append(infer_stop - infer_start) ++ ++ # del outputs ++ ++ # average_infer_time = None ++ # if time_record: ++ # if len(time_record) > 1: ++ # time_record.pop(0) ++ # average_infer_time = sum(time_record) / len(time_record) ++ # print(f'average inference time is: {average_infer_time}') ++ # print(f'inference time record: {time_record}') ++ ++ # if streamer is not None: ++ # streamer.end() ++ ++ # # 简单判断:打印是否使用了JIT路径 ++ # if hasattr(self, '_jit_used') and self._jit_used: ++ # print("[JIT] ✓ JIT optimization was used during generation") ++ # else: ++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++ ++ # if return_dict_in_generate: ++ # if self.config.is_encoder_decoder: ++ # return GenerateEncoderDecoderOutput( ++ # sequences=input_ids, ++ # scores=scores, ++ # logits=raw_logits, ++ # encoder_attentions=encoder_attentions, ++ # encoder_hidden_states=encoder_hidden_states, ++ # decoder_attentions=decoder_attentions, ++ # cross_attentions=cross_attentions, ++ # decoder_hidden_states=decoder_hidden_states, ++ # past_key_values=model_kwargs.get("past_key_values"), ++ # average_infer_time=average_infer_time ++ # ) ++ # else: ++ # return GenerateDecoderOnlyOutput( ++ # sequences=input_ids, ++ # scores=scores, ++ # logits=raw_logits, ++ # attentions=decoder_attentions, ++ # hidden_states=decoder_hidden_states, ++ # past_key_values=model_kwargs.get("past_key_values"), ++ # average_infer_time=average_infer_time ++ # ) ++ # else: ++ # return input_ids ++ ++ # def _prepare_cache_for_generation( ++ # self, ++ # generation_config, ++ # model_kwargs, ++ # assistant_model, ++ # batch_size, ++ # max_cache_length, ++ # ): ++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++ # generation_config.cache_implementation = "static" ++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++ ++ # if generation_config.cache_implementation == "static": ++ # base_required_from_max_length = generation_config.max_length + 1 ++ # base_required = max(max_cache_length, base_required_from_max_length) ++ # min_cache_size = 50 ++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++ # else: ++ # max_cache_length = max(base_required, min_cache_size) ++ ++ # original_max_cache_length = max_cache_length ++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++ # print(f" - input max_cache_length: {original_max_cache_length}") ++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++ # print(f" - final max_cache_length: {max_cache_length}") ++ ++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++ # if max_cache_length > self.config.max_position_embeddings: ++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++ ++ # result = super()._prepare_cache_for_generation( ++ # generation_config=generation_config, ++ # model_kwargs=model_kwargs, ++ # assistant_model=assistant_model, ++ # batch_size=batch_size, ++ # max_cache_length=max_cache_length, ++ # ) ++ ++ # if generation_config.cache_implementation == "static": ++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++ # created_cache = model_kwargs.get(cache_name) ++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++ # if created_cache.max_cache_len < generation_config.max_length: ++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++ ++ # return result ++ ++ ++ + + + # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" new file mode 100644 index 00000000..baee9388 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" @@ -0,0 +1,3200 @@ +From dcd6fc7b6307db27f23087ba3958949eb52a9beb Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Thu, 6 Nov 2025 09:20:38 +0800 +Subject: [PATCH 02/10] 20251106commit + +--- + .../models/deepseek/modeling_deepseek.py | 379 ++++- + .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- + patches/0001-20251104commit.patch | 1272 ++++++++++++++++ + 3 files changed, 2689 insertions(+), 305 deletions(-) + create mode 100644 patches/0001-20251104commit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index d8303e45..73773c22 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): + # y = y + self.shared_experts(identity) + # return y + ++ # @no_grad() ++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ ++ # expert_cache = ops.zeros_like(x) ++ # for i in range(self.num_experts_per_tok): ++ # expert_id = flat_expert_indices[i].item() ++ # weight = flat_expert_weights[i].item() ++ # expert = self.experts[expert_id] ++ # expert_out = expert(x) ++ # expert_cache += expert_out * weight ++ # return expert_cache ++ + @no_grad() + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ # x 的 shape: (1, hidden_size) ++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++ ++ # 1. 收集所有需要的专家层 ++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++ ++ # 2. 并行计算所有专家的输出 ++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++ # ops.cat 会将它们堆叠成一个新的 Tensor ++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++ ++ # 3. 使用矩阵乘法进行加权求和 ++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ # 最终结果 final_output 的 shape: (1, hidden_size) ++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++ ++ return final_output + +- expert_cache = ops.zeros_like(x) +- for i in range(self.num_experts_per_tok): +- expert_id = flat_expert_indices[i].item() +- weight = flat_expert_weights[i].item() +- expert = self.experts[expert_id] +- expert_out = expert(x) +- expert_cache += expert_out * weight +- return expert_cache + + @no_grad() + def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + +- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++ # @lwx ++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) ++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) ++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) + + kv_seq_len = key_states.shape[-2] + if past_key_value is not None: +@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): + return attn_output, attn_weights, past_key_value + + ++# class DeepseekFlashAttention(nn.Module): ++# """ ++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++ ++# This class is designed as a drop-in replacement for DeepseekAttention. ++# """ ++ ++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++# super().__init__() ++# self.config = config ++# self.layer_idx = layer_idx ++# if layer_idx is None: ++# logger.warning( ++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++# "when creating this class." ++# ) ++ ++# self.attention_dropout = config.attention_dropout ++# self.hidden_size = config.hidden_size ++# self.num_heads = config.num_attention_heads ++# self.head_dim = self.hidden_size // self.num_heads ++# self.num_key_value_heads = config.num_key_value_heads ++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++# self.max_position_embeddings = config.max_position_embeddings ++# self.rope_theta = config.rope_theta ++# self.is_causal = True ++ ++# if (self.head_dim * self.num_heads) != self.hidden_size: ++# raise ValueError( ++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++# f" and `num_heads`: {self.num_heads})." ++# ) ++ ++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++# self._init_rope() ++ ++# def _init_rope(self): ++# if self.config.rope_scaling is None: ++# self.rotary_emb = DeepseekRotaryEmbedding( ++# self.head_dim, ++# max_position_embeddings=self.max_position_embeddings, ++# base=self.rope_theta, ++# ) ++# else: ++# scaling_type = self.config.rope_scaling["type"] ++# scaling_factor = self.config.rope_scaling["factor"] ++# if scaling_type == "linear": ++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++# self.head_dim, ++# max_position_embeddings=self.max_position_embeddings, ++# scaling_factor=scaling_factor, ++# base=self.rope_theta, ++# ) ++# elif scaling_type == "dynamic": ++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++# self.head_dim, ++# max_position_embeddings=self.max_position_embeddings, ++# scaling_factor=scaling_factor, ++# base=self.rope_theta, ++# ) ++# else: ++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++ ++# def forward( ++# self, ++# hidden_states: mindspore.Tensor, ++# attention_mask: Optional[mindspore.Tensor] = None, ++# position_ids: Optional[mindspore.Tensor] = None, ++# past_key_value: Optional[Cache] = None, ++# output_attentions: bool = False, ++# use_cache: bool = False, ++# **kwargs, ++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++# if "padding_mask" in kwargs: ++# warnings.warn( ++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++# ) ++ ++# if output_attentions: ++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++ ++# bsz, q_len, _ = hidden_states.shape ++ ++# if self.config.pretraining_tp > 1: ++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++ ++# query_states = self.q_proj(hidden_states) ++# key_states = self.k_proj(hidden_states) ++# value_states = self.v_proj(hidden_states) ++ ++# # Reshape for multi-head attention ++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++# kv_seq_len = key_states.shape[-2] ++# if past_key_value is not None: ++# if self.layer_idx is None: ++# raise ValueError( ++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++# "with a layer index." ++# ) ++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++# # Apply Rotary Positional Embedding ++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++# if past_key_value is not None: ++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ ++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++ ++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++ ++# # Convert attention_mask for flash_attention_score ++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++# if attention_mask is not None: ++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++# raise ValueError( ++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++# ) ++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++# else: ++# attn_mask_for_fa = None ++ ++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++ ++# # Call the fused flash_attention_score operator ++# attn_output = mindspore.ops.flash_attention_score( ++# query=query_states_for_fa, ++# key=key_states_for_fa, ++# value=value_states_for_fa, ++# head_num=self.num_heads, # This is N1, the number of query heads ++# input_layout='BSH', ++# attn_mask=attn_mask_for_fa, ++# keep_prob=keep_prob, ++# scalar_value=1.0 / math.sqrt(self.head_dim), ++# sparse_mode=0 # Default mask mode ++# ) ++ ++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++# attn_output = self.o_proj(attn_output) ++ ++# # Flash Attention does not return attention weights ++# attn_weights = None ++ ++# return attn_output, attn_weights, past_key_value ++ ++class DeepseekFlashAttention(nn.Module): ++ """ ++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. ++ This implementation is a drop-in replacement for the original DeepseekAttention class, ++ designed for high performance on supported hardware (Ascend). ++ ++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. ++ """ ++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++ super().__init__() ++ self.config = config ++ self.layer_idx = layer_idx ++ if layer_idx is None: ++ logger.warning( ++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++ "when creating this class." ++ ) ++ ++ # --- [FIX] Correctly initialize all required attributes --- ++ self.attention_dropout = config.attention_dropout ++ self.hidden_size = config.hidden_size ++ self.num_heads = config.num_attention_heads ++ self.head_dim = self.hidden_size // self.num_heads ++ self.num_key_value_heads = config.num_key_value_heads ++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++ self.max_position_embeddings = config.max_position_embeddings ++ self.rope_theta = config.rope_theta ++ self.is_causal = True ++ ++ if (self.head_dim * self.num_heads) != self.hidden_size: ++ raise ValueError( ++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++ f" and `num_heads`: {self.num_heads})." ++ ) ++ ++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++ ++ # This call will now succeed as all attributes are initialized. ++ self._init_rope() ++ ++ def _init_rope(self): ++ if self.config.rope_scaling is None: ++ self.rotary_emb = DeepseekRotaryEmbedding( ++ self.head_dim, ++ max_position_embeddings=self.max_position_embeddings, ++ base=self.rope_theta, ++ ) ++ else: ++ scaling_type = self.config.rope_scaling["type"] ++ scaling_factor = self.config.rope_scaling["factor"] ++ if scaling_type == "linear": ++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++ self.head_dim, ++ max_position_embeddings=self.max_position_embeddings, ++ scaling_factor=scaling_factor, ++ base=self.rope_theta, ++ ) ++ elif scaling_type == "dynamic": ++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++ self.head_dim, ++ max_position_embeddings=self.max_position_embeddings, ++ scaling_factor=scaling_factor, ++ base=self.rope_theta, ++ ) ++ else: ++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++ ++ def forward( ++ self, ++ hidden_states: mindspore.Tensor, ++ attention_mask: Optional[mindspore.Tensor] = None, ++ position_ids: Optional[mindspore.Tensor] = None, ++ past_key_value: Optional[Cache] = None, ++ output_attentions: bool = False, ++ use_cache: bool = False, ++ **kwargs, ++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ if "padding_mask" in kwargs: ++ warnings.warn( ++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++ ) ++ if output_attentions: ++ warnings.warn( ++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." ++ ) ++ ++ bsz, q_len, _ = hidden_states.shape ++ ++ if self.config.pretraining_tp > 1: ++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++ ++ query_states = self.q_proj(hidden_states) ++ key_states = self.k_proj(hidden_states) ++ value_states = self.v_proj(hidden_states) ++ ++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) ++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++ kv_seq_len = key_states.shape[-2] ++ if past_key_value is not None: ++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++ # Apply Rotary Position Embedding ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ if past_key_value is not None: ++ cache_kwargs = {"sin": sin, "cos": cos} ++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. ++ # So we must explicitly repeat the KV heads. ++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++ ++ # Convert attention mask for flash_attention_score ++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. ++ if attention_mask is not None: ++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++ raise ValueError( ++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++ ) ++ attn_mask_for_fa = attention_mask < 0 ++ else: ++ attn_mask_for_fa = None ++ ++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++ ++ # Call the fused operator using the efficient BNSD layout ++ attn_output = mindspore.ops.flash_attention_score( ++ query=query_states, ++ key=key_states, ++ value=value_states, ++ head_num=self.num_heads, ++ input_layout='BNSD', # Specify the correct layout ++ attn_mask=attn_mask_for_fa, ++ keep_prob=keep_prob, ++ scalar_value=1.0 / math.sqrt(self.head_dim) ++ ) ++ ++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. ++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ ++ # Apply output projection ++ attn_output = self.o_proj(attn_output) ++ ++ # Flash attention does not return attention weights, so we return None. ++ attn_weights = None ++ ++ return attn_output, attn_weights, past_key_value ++ + Deepseek_ATTENTION_CLASSES = { + "eager": DeepseekAttention, ++ "flash-attention": DeepseekFlashAttention, + } + + +@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): + config=config, layer_idx=layer_idx + ) + ++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++ config=config, layer_idx=layer_idx ++ ) ++ + self.mlp = ( + DeepseekMoE(config) + if ( +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index d4c6b651..bced285c 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union + + import mindspore + import mindnlp.core.nn.functional as F +-from mindnlp.core import nn, ops ++from mindnlp.core import nn, ops, no_grad + from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss + + from ....common.activations import ACT2FN +@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) + _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" + _CONFIG_FOR_DOC = "Qwen2MoeConfig" + ++Long_Prompt = False ++PROMPT_LENGTH_THRESHOLD = 128 + + # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position + def _prepare_4d_causal_attention_mask_with_cache_position( +@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): + return attn_output, attn_weights, past_key_value + + ++# class Qwen2MoeFlashAttention(nn.Module): ++# """ ++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++ ++# 关键改动: ++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++# 直接传入原始的 key 和 value 张量效率更高。 ++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++# super().__init__() ++# self.config = config ++# self.layer_idx = layer_idx ++# self.hidden_size = config.hidden_size ++# self.num_heads = config.num_attention_heads ++# self.head_dim = self.hidden_size // self.num_heads ++# self.num_key_value_heads = config.num_key_value_heads ++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++# self.max_position_embeddings = config.max_position_embeddings ++# self.rope_theta = config.rope_theta ++# self.attention_dropout = config.attention_dropout ++ ++# if (self.head_dim * self.num_heads) != self.hidden_size: ++# raise ValueError( ++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++# ) ++ ++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++ ++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++# self.head_dim, ++# max_position_embeddings=self.max_position_embeddings, ++# base=self.rope_theta, ++# ) ++ ++# def forward( ++# self, ++# hidden_states: mindspore.Tensor, ++# attention_mask: Optional[mindspore.Tensor] = None, ++# position_ids: Optional[mindspore.Tensor] = None, ++# past_key_value: Optional[Cache] = None, ++# output_attentions: bool = False, ++# use_cache: bool = False, ++# cache_position: Optional[mindspore.Tensor] = None, ++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++# bsz, q_len, _ = hidden_states.shape ++ ++# # 1. 线性投射 Q, K, V ++# query_states = self.q_proj(hidden_states) ++# key_states = self.k_proj(hidden_states) ++# value_states = self.v_proj(hidden_states) ++ ++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++# # query: [B, S, H*D] -> [B, N1, S, D] ++# # key/val: [B, S, H2*D] -> [B, N2, S, D] ++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++# # 3. RoPE 旋转位置编码 ++# kv_seq_len = key_states.shape[-2] ++# if past_key_value is not None: ++# if self.layer_idx is None: ++# raise ValueError( ++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++# "with a layer index." ++# ) ++# # 对于 StaticCache,需要特殊处理 kv_seq_len ++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++# # 使用 cache_position 的长度来确定实际的 kv_seq_len ++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++# # 临时解决方案:使用 cache_position 的最大值(如果可能) ++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++# if cache_position.shape[0] == 1: ++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++# kv_seq_len = past_seen_tokens + 1 ++# else: ++# # prefill 阶段:cache_position 是范围,使用其长度 ++# kv_seq_len = cache_position.shape[0] + past_seen_tokens ++# else: ++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++# # 4. KV 缓存更新 ++# if past_key_value is not None: ++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++# key_states, value_states = past_key_value.update( ++# key_states, value_states, self.layer_idx, cache_kwargs ++# ) ++ ++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++# if cache_position.shape[0] == 1: ++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++# kv_seq_len = key_states.shape[-2] ++ ++# # 5. [重要] 准备 Attention Mask ++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++# fa_attention_mask = None ++# if attention_mask is not None: ++# # 截取与当前key长度匹配的部分 ++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++# # 转换为布尔类型: 大负数 -> True, 0 -> False ++# fa_attention_mask = (mask_slice != 0) ++ ++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++# input_dtype = query_states.dtype ++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++# query_states = query_states.to(mindspore.float16) ++# key_states = key_states.to(mindspore.float16) ++# value_states = value_states.to(mindspore.float16) ++ ++# # 6. [核心] 调用 flash_attention_score 算子 ++# # - 无需手动 repeat_kv, 算子原生支持 GQA ++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++# attn_output = mindspore.ops.flash_attention_score( ++# query=query_states, ++# key=key_states, ++# value=value_states, ++# head_num=self.num_heads, # 传入Q的头数(N1) ++# attn_mask=fa_attention_mask, ++# keep_prob=1.0 - self.attention_dropout, ++# scalar_value=1.0 / math.sqrt(self.head_dim), ++# input_layout="BNSD", ++# sparse_mode=0 # 使用 defaultMask 模式 ++# ) ++ ++# # 恢复原始数据类型 ++# attn_output = attn_output.to(input_dtype) ++ ++# # 7. 调整输出形状 ++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++# attn_output = self.o_proj(attn_output) ++ ++# # FlashAttention 算子不直接返回注意力权重矩阵 ++# attn_weights = None ++# if output_attentions: ++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++ ++# return attn_output, attn_weights, past_key_value ++ ++# # def forward( ++# # self, ++# # hidden_states: mindspore.Tensor, ++# # attention_mask: Optional[mindspore.Tensor] = None, ++# # position_ids: Optional[mindspore.Tensor] = None, ++# # past_key_value: Optional[Cache] = None, ++# # output_attentions: bool = False, ++# # use_cache: bool = False, ++# # cache_position: Optional[mindspore.Tensor] = None, ++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++# # bsz, q_len, _ = hidden_states.shape ++ ++# # # 1. 线性投射 Q, K, V ++# # query_states = self.q_proj(hidden_states) ++# # key_states = self.k_proj(hidden_states) ++# # value_states = self.v_proj(hidden_states) ++ ++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ ++# # # 3. RoPE 旋转位置编码 ++# # kv_seq_len = key_states.shape[-2] ++# # if past_key_value is not None: ++# # if self.layer_idx is None: ++# # raise ValueError( ++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++# # "with a layer index." ++# # ) ++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++# # # 4. KV 缓存更新 ++# # if past_key_value is not None: ++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++# # key_states, value_states = past_key_value.update( ++# # key_states, value_states, self.layer_idx, cache_kwargs ++# # ) ++ ++# # # 5. 准备 Attention Mask ++# # fa_attention_mask = None ++# # if attention_mask is not None: ++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++# # fa_attention_mask = (mask_slice != 0) ++ ++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++# # input_dtype = query_states.dtype ++ ++# # # 6. [核心] 调用 flash_attention_score 算子 ++# # attn_output = mindspore.ops.flash_attention_score( ++# # query=query_states, ++# # key=key_states, ++# # value=value_states, ++# # head_num=self.num_heads, ++# # attn_mask=fa_attention_mask, ++# # keep_prob=1.0 - self.attention_dropout, ++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++# # input_layout="BNSD", ++# # sparse_mode=0, ++# # # <--- 修改点 2: 启用内部高精度计算 --- ++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++# # inner_precise=1 ++# # ) ++ ++# # # 恢复原始数据类型 ++# # attn_output = attn_output.to(input_dtype) ++ ++# # # 7. 调整输出形状 ++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++# # attn_output = self.o_proj(attn_output) ++ ++# # attn_weights = None ++# # if output_attentions: ++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++ ++# # return attn_output, attn_weights, past_key_value ++ ++ + class Qwen2MoeFlashAttention(nn.Module): + """ +- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +- +- 关键改动: +- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +- 直接传入原始的 key 和 value 张量效率更高。 +- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 ++ ++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` ++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, ++ 完全使用模型的低精度数据类型(如 float16)进行计算, ++ 以达到理论上的最高执行速度。 + """ + def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): + super().__init__() + self.config = config + self.layer_idx = layer_idx ++ if layer_idx is None: ++ logger.warning_once( ++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." ++ ) ++ + self.hidden_size = config.hidden_size + self.num_heads = config.num_attention_heads + self.head_dim = self.hidden_size // self.num_heads + self.num_key_value_heads = config.num_key_value_heads +- self.num_key_value_groups = self.num_heads // self.num_key_value_heads + self.max_position_embeddings = config.max_position_embeddings + self.rope_theta = config.rope_theta + self.attention_dropout = config.attention_dropout + +- if (self.head_dim * self.num_heads) != self.hidden_size: +- raise ValueError( +- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +- ) +- + self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) + self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) + self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + +- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +- # query: [B, S, H*D] -> [B, N1, S, D] +- # key/val: [B, S, H2*D] -> [B, N2, S, D] ++ # 2. 调整形状以匹配 BNSD 布局 + query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) + key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) + value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +- # 3. RoPE 旋转位置编码 ++ ++ # 3. RoPE 和 KV 缓存 + kv_seq_len = key_states.shape[-2] + if past_key_value is not None: +- if self.layer_idx is None: +- raise ValueError( +- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +- "with a layer index." +- ) +- # 对于 StaticCache,需要特殊处理 kv_seq_len +- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +- if isinstance(past_key_value, StaticCache) and cache_position is not None: +- # 使用 cache_position 的长度来确定实际的 kv_seq_len +- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +- # 临时解决方案:使用 cache_position 的最大值(如果可能) +- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +- if cache_position.shape[0] == 1: +- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +- kv_seq_len = past_seen_tokens + 1 +- else: +- # prefill 阶段:cache_position 是范围,使用其长度 +- kv_seq_len = cache_position.shape[0] + past_seen_tokens +- else: +- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- ++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ + cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) + query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) + +- # 4. KV 缓存更新 + if past_key_value is not None: + cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +- key_states, value_states = past_key_value.update( +- key_states, value_states, self.layer_idx, cache_kwargs +- ) +- +- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +- if isinstance(past_key_value, StaticCache) and cache_position is not None: +- if cache_position.shape[0] == 1: +- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +- kv_seq_len = key_states.shape[-2] +- +- # 5. [重要] 准备 Attention Mask +- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++ # 4. 准备 Attention Mask + fa_attention_mask = None + if attention_mask is not None: +- # 截取与当前key长度匹配的部分 +- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) + mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +- # 转换为布尔类型: 大负数 -> True, 0 -> False + fa_attention_mask = (mask_slice != 0) + +- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +- input_dtype = query_states.dtype +- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +- query_states = query_states.to(mindspore.float16) +- key_states = key_states.to(mindspore.float16) +- value_states = value_states.to(mindspore.float16) +- +- # 6. [核心] 调用 flash_attention_score 算子 +- # - 无需手动 repeat_kv, 算子原生支持 GQA +- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 + attn_output = mindspore.ops.flash_attention_score( + query=query_states, + key=key_states, + value=value_states, +- head_num=self.num_heads, # 传入Q的头数(N1) ++ head_num=self.num_heads, + attn_mask=fa_attention_mask, +- keep_prob=1.0 - self.attention_dropout, ++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout + scalar_value=1.0 / math.sqrt(self.head_dim), + input_layout="BNSD", +- sparse_mode=0 # 使用 defaultMask 模式 ++ sparse_mode=0, ++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 + ) + +- # 恢复原始数据类型 +- attn_output = attn_output.to(input_dtype) +- +- # 7. 调整输出形状 +- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++ # 6. 调整输出形状 + attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) + attn_output = self.o_proj(attn_output) + +- # FlashAttention 算子不直接返回注意力权重矩阵 ++ # 7. 返回结果 + attn_weights = None + if output_attentions: +- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") + + return attn_output, attn_weights, past_key_value + +- # def forward( +- # self, +- # hidden_states: mindspore.Tensor, +- # attention_mask: Optional[mindspore.Tensor] = None, +- # position_ids: Optional[mindspore.Tensor] = None, +- # past_key_value: Optional[Cache] = None, +- # output_attentions: bool = False, +- # use_cache: bool = False, +- # cache_position: Optional[mindspore.Tensor] = None, +- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- +- # bsz, q_len, _ = hidden_states.shape +- +- # # 1. 线性投射 Q, K, V +- # query_states = self.q_proj(hidden_states) +- # key_states = self.k_proj(hidden_states) +- # value_states = self.v_proj(hidden_states) +- +- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +- # # 3. RoPE 旋转位置编码 +- # kv_seq_len = key_states.shape[-2] +- # if past_key_value is not None: +- # if self.layer_idx is None: +- # raise ValueError( +- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +- # "with a layer index." +- # ) +- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) + +- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +- # # 4. KV 缓存更新 +- # if past_key_value is not None: +- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +- # key_states, value_states = past_key_value.update( +- # key_states, value_states, self.layer_idx, cache_kwargs +- # ) +- +- # # 5. 准备 Attention Mask +- # fa_attention_mask = None +- # if attention_mask is not None: +- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +- # fa_attention_mask = (mask_slice != 0) +- +- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +- # input_dtype = query_states.dtype +- +- # # 6. [核心] 调用 flash_attention_score 算子 +- # attn_output = mindspore.ops.flash_attention_score( +- # query=query_states, +- # key=key_states, +- # value=value_states, +- # head_num=self.num_heads, +- # attn_mask=fa_attention_mask, +- # keep_prob=1.0 - self.attention_dropout, +- # scalar_value=1.0 / math.sqrt(self.head_dim), +- # input_layout="BNSD", +- # sparse_mode=0, +- # # <--- 修改点 2: 启用内部高精度计算 --- +- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +- # inner_precise=1 +- # ) +- +- # # 恢复原始数据类型 +- # attn_output = attn_output.to(input_dtype) ++QWEN2MOE_ATTENTION_CLASSES = { ++ "eager": Qwen2MoeAttention, ++ "flash-attention": Qwen2MoeFlashAttention, ++} + +- # # 7. 调整输出形状 +- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +- # attn_output = self.o_proj(attn_output) + +- # attn_weights = None +- # if output_attentions: +- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# def __init__(self, config): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# # gating ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++ ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# #@dwj ++# # 只遍历激活的专家,而非全部专家 ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# num_tokens = hidden_states_reshaped.shape[0] ++ ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++# routing_weights = routing_weights.to(hidden_states.dtype) ++ ++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++# flat_selected_experts = selected_experts.flatten() ++ ++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++# token_indices = broadcasted_token_indices.flatten() ++ ++# active_experts = ops.unique(flat_selected_experts) ++ ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++ ++# mask = (flat_selected_experts == expert_idx_tensor) ++# selected_token_indices = token_indices[mask] ++# selected_routing_weights = routing_weights.flatten()[mask] ++ ++# current_states = hidden_states_reshaped[selected_token_indices] ++ ++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++ ++# final_hidden_states = final_hidden_states.index_add( ++# dim=0, ++# index=selected_token_indices, ++# source=expert_output.to(hidden_states.dtype) ++# ) ++ ++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output + +- # return attn_output, attn_weights, past_key_value ++# final_hidden_states = final_hidden_states + shared_expert_output ++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ ++ ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# """ ++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++# `_moe_infer_prefill` (用于长序列处理) 方法。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# # 门控网络 ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# # 专家列表 ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++# # 共享专家 ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# @no_grad() ++# def _moe_infer_decode( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# """ ++# 【解码路径】针对 sequence_length=1 的极致优化。 ++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++# """ ++# batch_size, hidden_dim = hidden_states.shape ++ ++# expert_outputs_list = [ ++# ops.cat([ ++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++# ], dim=0) ++# for i in range(batch_size) ++# ] ++ ++# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++# # shape: (batch_size, top_k, hidden_dim) ++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++ ++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++ ++# return moe_output.squeeze(1) ++ ++# @no_grad() ++# def _moe_infer_prefill( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# """ ++# 【预填充路径】针对 sequence_length > 1 的优化。 ++# 按专家对 Token 进行分组,并进行批处理。 ++# """ ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens = hidden_states.shape[0] ++# flat_selected_experts = selected_experts.flatten() ++ ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++ ++# active_experts = ops.unique(flat_selected_experts) ++ ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++ ++# mask = (flat_selected_experts == expert_idx_tensor) ++# selected_token_indices = token_indices[mask] ++# selected_routing_weights = routing_weights.flatten()[mask] ++ ++# current_states = hidden_states[selected_token_indices] ++ ++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++ ++# moe_output = moe_output.index_add( ++# dim=0, ++# index=selected_token_indices, ++# source=expert_output.to(hidden_states.dtype) ++# ) ++# return moe_output ++ ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# """ ++# 顶层 forward 方法,作为智能分发器。 ++# """ ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++ ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) + +- # def forward( +- # self, +- # hidden_states: mindspore.Tensor, +- # attention_mask: Optional[mindspore.Tensor] = None, +- # position_ids: Optional[mindspore.Tensor] = None, +- # past_key_value: Optional[Cache] = None, +- # output_attentions: bool = False, +- # use_cache: bool = False, +- # cache_position: Optional[mindspore.Tensor] = None, +- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- +- # bsz, q_len, _ = hidden_states.shape +- +- # query_states = self.q_proj(hidden_states) +- # key_states = self.k_proj(hidden_states) +- # value_states = self.v_proj(hidden_states) +- +- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +- # kv_seq_len = key_states.shape[-2] +- # if past_key_value is not None: +- # if self.layer_idx is None: +- # raise ValueError("`layer_idx` must be specified for caching") +- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- +- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +- # if past_key_value is not None: +- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +- # key_states, value_states = past_key_value.update( +- # key_states, value_states, self.layer_idx, cache_kwargs +- # ) ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++# routing_weights = routing_weights.to(hidden_states.dtype) ++ ++# moe_output = None ++# # 在推理时,根据序列长度选择最优路径 ++# if not self.training: ++# if sequence_length == 1: ++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++# else: ++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++# else: ++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++# raise NotImplementedError("Training path is not implemented.") ++ ++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++ ++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++ ++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ ++ ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# """ ++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# # 门控网络 ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# # 专家列表 ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++# # 共享专家 ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# @no_grad() ++# def _moe_infer_decode( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# batch_size, _ = hidden_states.shape ++# expert_outputs_list = [ ++# ops.cat([ ++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++# ], dim=0) ++# for i in range(batch_size) ++# ] ++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++# return moe_output.squeeze(1) ++ ++# @no_grad() ++# def _moe_infer_prefill( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens = hidden_states.shape[0] ++# flat_selected_experts = selected_experts.flatten() ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++# active_experts = ops.unique(flat_selected_experts) ++ ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++# mask = (flat_selected_experts == expert_idx_tensor) ++# selected_token_indices = token_indices[mask] ++# selected_routing_weights = routing_weights.flatten()[mask] ++# current_states = hidden_states[selected_token_indices] ++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++# moe_output = moe_output.index_add( ++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++# ) ++# return moe_output ++ ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# """ ++# 顶层 forward 方法,作为智能分发器。 ++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++# """ ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++ ++# # 1. 门控计算 (通用逻辑) ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++# routing_weights = routing_weights.to(hidden_states.dtype) ++ ++# # 2. 智能分发到最优 MoE 路径 ++# moe_output = None ++# if not self.training: ++# if sequence_length == 1: ++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++# else: ++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++# else: ++# raise NotImplementedError("Training path is not implemented.") ++ ++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++ ++# # 4. 合并 MoE 输出和共享专家输出 ++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++ ++# # 5. 恢复原始形状并返回 ++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ ++# prefill fastest ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# """ ++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# # 门控网络 ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# # 专家列表 ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++# # 共享专家 ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# @no_grad() ++# def _moe_infer_dispatch( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# """ ++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++# """ ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens, _ = hidden_states.shape ++ ++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++# flat_selected_experts = selected_experts.flatten() ++# flat_routing_weights = routing_weights.flatten() + +- # key_states = repeat_kv(key_states, self.num_key_value_groups) +- # value_states = repeat_kv(value_states, self.num_key_value_groups) +- +- # # <--- 核心修改点: 手动进行高精度缩放 --- +- # # 在调用算子前,手动将 query_states 除以缩放因子。 +- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +- # query_states = query_states / math.sqrt(self.head_dim) +- # # <--- 修改结束 --- +- +- # fa_attention_mask = None +- # if attention_mask is not None: +- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +- # fa_attention_mask = (mask_slice != 0) +- +- # input_dtype = query_states.dtype +- +- # attn_output = mindspore.ops.flash_attention_score( +- # query=query_states, # 传入已经预先缩放过的 query +- # key=key_states, +- # value=value_states, +- # head_num=self.num_heads, +- # attn_mask=fa_attention_mask, +- # keep_prob=1.0 - self.attention_dropout, +- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +- # input_layout="BNSD", +- # sparse_mode=0, +- # inner_precise=1 # 仍然保持内部高精度计算 +- # ) ++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() + +- # attn_output = attn_output.to(input_dtype) +- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +- # attn_output = self.o_proj(attn_output) ++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++# active_experts = ops.unique(flat_selected_experts) ++ ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++ ++# # 找到所有分配给该专家的 token ++# mask = (flat_selected_experts == expert_idx_tensor) ++ ++# # 使用 mask 选取对应的 token 和权重 ++# current_token_indices = token_indices[mask] ++# current_routing_weights = flat_routing_weights[mask] ++# current_hidden_states = hidden_states[current_token_indices] ++ ++# # 对这些 token 进行批处理 ++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++ ++# # 使用 index_add 将结果精确地加回到对应位置 ++# moe_output = moe_output.index_add( ++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++# ) ++# return moe_output ++ ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# """ ++# 顶层 forward 方法,作为智能分发器。 ++# """ ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++ ++# # 1. 门控计算 ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++# routing_weights = routing_weights.to(hidden_states.dtype) ++ ++# # 2. 调用统一的 MoE 计算内核 ++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) + +- # attn_weights = None +- # if output_attentions: +- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++# # 3. 统一处理共享专家 ++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++ ++# # 4. 合并输出 ++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++ ++# # 5. 恢复原始形状并返回 ++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ ++ ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# """ ++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++# 【最终高性能与高精度版】: ++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++# 3. 这样实现了速度和准确性的两全其美。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# @no_grad() ++# def _moe_infer_decode( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# """ ++# 【解码路径】极致优化版:bmm + 高精度累加。 ++# """ ++# original_dtype = hidden_states.dtype ++# batch_size, _ = hidden_states.shape ++ ++# expert_outputs_list = [ ++# ops.cat([ ++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++# ], dim=0) ++# for i in range(batch_size) ++# ] ++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++ ++# # 在 float32 下执行 bmm,得到高精度结果 ++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++ ++# # 将高精度结果转换回原始数据类型 ++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++ ++# return moe_output ++ ++# @no_grad() ++# def _moe_infer_prefill( ++# self, ++# hidden_states: mindspore.Tensor, ++# selected_experts: mindspore.Tensor, ++# routing_weights: mindspore.Tensor ++# ) -> mindspore.Tensor: ++# """ ++# 【预填充路径】与原始实现一致,结果精确。 ++# """ ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens, _ = hidden_states.shape ++# flat_selected_experts = selected_experts.flatten() ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++# active_experts = ops.unique(flat_selected_experts) ++ ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++# mask = (flat_selected_experts == expert_idx_tensor) ++# selected_token_indices = token_indices[mask] ++# selected_routing_weights = routing_weights.flatten()[mask] ++# current_states = hidden_states[selected_token_indices] ++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++# moe_output = moe_output.index_add( ++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++# ) ++# return moe_output ++ ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++ ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) + +- # return attn_output, attn_weights, past_key_value ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++# # 如果模型主体是 float16,后续再转换 ++ ++# moe_output = None ++# if not self.training: ++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++# # _moe_infer_decode 内部会处理好类型转换 ++# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++# if sequence_length == 1: ++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++# else: ++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++# else: ++# raise NotImplementedError("Training path is not implemented.") ++ ++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++ ++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ + +-QWEN2MOE_ATTENTION_CLASSES = { +- "eager": Qwen2MoeAttention, +- "flash-attention": Qwen2MoeFlashAttention, +-} ++# class Qwen2MoeSparseMoeBlock(nn.Module): ++# """ ++# 【融合版】一个混合专家模块,内置两种推理策略, ++# 由外部全局变量 `Long_Prompt` 控制: ++ ++# - if Long_Prompt is True: 【精度优先模式】 ++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++# 适用于处理长序列,避免误差累积。 ++ ++# - if Long_Prompt is False: 【速度优先模式】 ++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++# 在解码阶段获得极致速度,同时保证结果高度准确。 ++# """ ++# def __init__(self, config: Qwen2MoeConfig): ++# super().__init__() ++# self.num_experts = config.num_experts ++# self.top_k = config.num_experts_per_tok ++# self.norm_topk_prob = config.norm_topk_prob ++ ++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++# self.experts = nn.ModuleList( ++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++# ) ++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++# # --- 速度优先模式的辅助函数 --- ++# @no_grad() ++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++# original_dtype = hidden_states.dtype ++# batch_size, _ = hidden_states.shape ++# expert_outputs_list = [ ++# ops.cat([ ++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++# ], dim=0) ++# for i in range(batch_size) ++# ] ++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++# weights_fp32 = routing_weights.to(mindspore.float32) ++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++# return moe_output_fp32.squeeze(1).to(original_dtype) ++ ++# @no_grad() ++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens, _ = hidden_states.shape ++# flat_selected_experts = selected_experts.flatten() ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++# active_experts = ops.unique(flat_selected_experts) ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++# mask = (flat_selected_experts == expert_idx_tensor) ++# selected_token_indices = token_indices[mask] ++# selected_routing_weights = routing_weights.flatten()[mask] ++# current_states = hidden_states[selected_token_indices] ++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++# return moe_output ++ ++# # --- 精度优先模式的辅助函数 --- ++# @no_grad() ++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++# moe_output = ops.zeros_like(hidden_states) ++# num_tokens, _ = hidden_states.shape ++# flat_selected_experts = selected_experts.flatten() ++# flat_routing_weights = routing_weights.flatten() ++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++# active_experts = ops.unique(flat_selected_experts) ++# for expert_idx_tensor in active_experts: ++# expert_idx = expert_idx_tensor.item() ++# expert_layer = self.experts[expert_idx] ++# mask = (flat_selected_experts == expert_idx_tensor) ++# current_token_indices = token_indices[mask] ++# current_routing_weights = flat_routing_weights[mask] ++# current_hidden_states = hidden_states[current_token_indices] ++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++# return moe_output ++ ++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++# # 声明我们将要使用一个在模块外部定义的全局变量 ++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++# global Long_Prompt ++ ++# # 1. 门控计算 (所有模式通用) ++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++# router_logits = self.gate(hidden_states_reshaped) ++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++# if self.norm_topk_prob: ++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++# moe_output = None ++# if not self.training: ++# # 根据 Long_Prompt 标志选择模式 ++# if Long_Prompt: ++# # --- 精度优先模式 --- ++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++# else: ++# # --- 速度优先模式 --- ++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++# if sequence_length == 1: ++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++# else: ++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++# else: ++# raise NotImplementedError("Training path is not implemented.") ++ ++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++ ++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++ ++# return final_hidden_states, router_logits ++ ++class Qwen2MoeSparseMoeBlock(nn.Module): ++ """ ++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++ 控制的顶级推理策略: + ++ - if Long_Prompt is True: 【精度优先模式】 ++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 ++ 适用于需要严格可复现性的长序列任务。 + +-class Qwen2MoeSparseMoeBlock(nn.Module): +- def __init__(self, config): ++ - if Long_Prompt is False: 【速度优先模式】 ++ 采用业界最强的性能组合: ++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 ++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 ++ """ ++ def __init__(self, config: Qwen2MoeConfig): + super().__init__() + self.num_experts = config.num_experts + self.top_k = config.num_experts_per_tok + self.norm_topk_prob = config.norm_topk_prob + +- # gating + self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) + self.experts = nn.ModuleList( + [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] + ) +- + self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) + self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) + +- #@dwj +- # 只遍历激活的专家,而非全部专家 +- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +- batch_size, sequence_length, hidden_dim = hidden_states.shape +- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +- num_tokens = hidden_states_reshaped.shape[0] +- +- router_logits = self.gate(hidden_states_reshaped) +- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +- if self.norm_topk_prob: +- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- routing_weights = routing_weights.to(hidden_states.dtype) +- +- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +- flat_selected_experts = selected_experts.flatten() +- +- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +- token_indices = broadcasted_token_indices.flatten() +- +- active_experts = ops.unique(flat_selected_experts) +- +- for expert_idx_tensor in active_experts: +- expert_idx = expert_idx_tensor.item() +- expert_layer = self.experts[expert_idx] +- +- mask = (flat_selected_experts == expert_idx_tensor) +- selected_token_indices = token_indices[mask] +- selected_routing_weights = routing_weights.flatten()[mask] +- +- current_states = hidden_states_reshaped[selected_token_indices] +- +- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +- +- final_hidden_states = final_hidden_states.index_add( +- dim=0, +- index=selected_token_indices, +- source=expert_output.to(hidden_states.dtype) +- ) +- +- shared_expert_output = self.shared_expert(hidden_states_reshaped) +- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- ++ @no_grad() ++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ original_dtype = hidden_states.dtype ++ batch_size, _ = hidden_states.shape ++ expert_outputs_list = [ ++ ops.cat([ ++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++ ], dim=0) ++ for i in range(batch_size) ++ ] ++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++ weights_fp32 = routing_weights.to(mindspore.float32) ++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++ return moe_output_fp32.squeeze(1).to(original_dtype) ++ ++ @no_grad() ++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ num_tokens, _ = hidden_states.shape ++ flat_selected_experts = selected_experts.flatten() ++ sorted_expert_indices = flat_selected_experts.argsort() ++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++ original_token_indices = sorted_expert_indices // self.top_k ++ moe_output = ops.zeros_like(hidden_states) ++ current_token_offset = 0 ++ for i in range(self.num_experts): ++ expert_token_count = tokens_per_expert[i] - current_token_offset ++ if expert_token_count == 0: ++ continue ++ end_offset = current_token_offset + expert_token_count ++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++ expert_hidden_states = hidden_states[expert_original_token_indices] ++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++ current_token_offset += expert_token_count ++ return moe_output ++ ++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++ @no_grad() ++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ moe_output = ops.zeros_like(hidden_states) ++ num_tokens, _ = hidden_states.shape ++ flat_selected_experts = selected_experts.flatten() ++ flat_routing_weights = routing_weights.flatten() ++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++ active_experts = ops.unique(flat_selected_experts) ++ for expert_idx_tensor in active_experts: ++ expert_idx = expert_idx_tensor.item() ++ expert_layer = self.experts[expert_idx] ++ mask = (flat_selected_experts == expert_idx_tensor) ++ current_token_indices = token_indices[mask] ++ current_routing_weights = flat_routing_weights[mask] ++ current_hidden_states = hidden_states[current_token_indices] ++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++ return moe_output + +- final_hidden_states = final_hidden_states + shared_expert_output +- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +- +- return final_hidden_states, router_logits ++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++ global Long_Prompt ++ ++ # 1. 门控计算 (所有模式通用) ++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++ router_logits = self.gate(hidden_states_reshaped) ++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++ if self.norm_topk_prob: ++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++ moe_output = None ++ if Long_Prompt: ++ # --- 精度优先模式 (ACCURACY MODE) --- ++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ else: ++ # --- 速度优先模式 (SPEED MODE) --- ++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ if sequence_length == 1: ++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ else: ++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ + ++ # 3. 共享专家计算与合并 (所有模式通用) ++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++ ++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++ ++ return final_hidden_states, router_logits + + class Qwen2MoeDecoderLayer(nn.Module): + def __init__(self, config: Qwen2MoeConfig, layer_idx: int): + super().__init__() + self.hidden_size = config.hidden_size ++ ++ # if Long_Prompt: ++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ # else: ++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) + + self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + +- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +- + if (layer_idx not in config.mlp_only_layers) and ( + config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 + ): +@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + self._warmed_up = True + self.warmup_moe_model() + ++ ++ + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_router_logits = ( + output_router_logits if output_router_logits is not None else self.config.output_router_logits +@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + router_logits=outputs.router_logits, + ) + ++ def generate(self, *args, **kwargs): ++ """ ++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++ """ ++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++ ++ input_ids = kwargs.get("input_ids") ++ if input_ids is None and args: ++ input_ids = args[0] ++ ++ if input_ids is not None: ++ prompt_length = input_ids.shape[1] ++ ++ if prompt_length > PROMPT_LENGTH_THRESHOLD: ++ Long_Prompt = True ++ else: ++ Long_Prompt = False ++ ++ return super().generate(*args, **kwargs) ++ + # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation + def prepare_inputs_for_generation( + self, +@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens + # Exception 1: when passing input_embeds, input_ids may be missing entries + # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here ++ + if past_key_values is not None: + if inputs_embeds is not None: # Exception 1 + if 0 not in input_ids.shape: +@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + } + ) + return model_inputs ++ + # @lwx + # def _decode_one_tokens_logits( + # self, +@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): + attentions=outputs.attentions, + ) + ++ + __all__ = [ + "Qwen2MoeForCausalLM", + "Qwen2MoeModel", +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +new file mode 100644 +index 00000000..6dfb5b93 +--- /dev/null ++++ b/patches/0001-20251104commit.patch +@@ -0,0 +1,1272 @@ ++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++Subject: [PATCH] 20251104commit ++ ++--- ++ mindnlp/transformers/cache_utils.py | 28 +- ++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++ 3 files changed, 976 insertions(+), 87 deletions(-) ++ ++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++index cadd2e04..02f8d4be 100644 ++--- a/mindnlp/transformers/cache_utils.py +++++ b/mindnlp/transformers/cache_utils.py ++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++ # k_out[:, :, cache_position] = key_states ++ # v_out[:, :, cache_position] = value_states ++- if ON_ORANGE_PI: ++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++- else: ++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++- +++ # if ON_ORANGE_PI: +++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++ # else: +++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++ # 确保 cache_position 是 1D tensor 并且类型正确 +++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++ if cache_position.ndim > 1: +++ cache_position = cache_position.flatten() +++ # 确保类型是 int32 或 int64(MindSpore 要求) +++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++ cache_position = cache_position.int() +++ +++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++ k_out[:, :, cache_position] = key_states +++ v_out[:, :, cache_position] = value_states +++ ++ return k_out, v_out ++ ++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index c695b944..d8303e45 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++- x1 = x[..., : x.shape[-1] // 2] ++- x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++ # x1 = x[..., : x.shape[-1] // 2] +++ # x2 = x[..., x.shape[-1] // 2 :] +++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++ if self.training: ++ raise NotImplementedError("Training is not supported yet.") ++ else: ++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++- if self.config.n_shared_experts is not None: ++- y = y + self.shared_experts(identity) ++- return y +++ # @lwx +++ if orig_shape[1] == 1: +++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++ y=y.view(*orig_shape) +++ if self.config.n_shared_experts is not None: +++ y = y + self.shared_experts(identity) +++ return y +++ else: +++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++ if self.config.n_shared_experts is not None: +++ y = y + self.shared_experts(identity) +++ return y +++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++ # if self.config.n_shared_experts is not None: +++ # y = y + self.shared_experts(identity) +++ # return y +++ +++ @no_grad() +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ +++ expert_cache = ops.zeros_like(x) +++ for i in range(self.num_experts_per_tok): +++ expert_id = flat_expert_indices[i].item() +++ weight = flat_expert_weights[i].item() +++ expert = self.experts[expert_id] +++ expert_out = expert(x) +++ expert_cache += expert_out * weight +++ return expert_cache ++ ++ @no_grad() ++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++- # expert_cache = torch.zeros_like(x) ++- # idxs = flat_expert_indices.argsort() ++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++- # token_idxs = idxs // self.num_experts_per_tok ++- # for i, end_idx in enumerate(tokens_per_expert): ++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++- # if start_idx == end_idx: ++- # continue ++- # expert = self.experts[i] ++- # exp_token_idx = token_idxs[start_idx:end_idx] ++- # expert_tokens = x[exp_token_idx] ++- # expert_out = expert(expert_tokens) ++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++- # return expert_cache +++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ expert_cache = ops.zeros_like(x) ++ idxs = flat_expert_indices.argsort() ++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ token_idxs = idxs // self.num_experts_per_tok +++ ++ for i, end_idx in enumerate(tokens_per_expert): ++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ if start_idx == end_idx: ++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++ expert_out = expert(expert_tokens) ++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ ++ return expert_cache +++ +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # # expert_cache = torch.zeros_like(x) +++ # # idxs = flat_expert_indices.argsort() +++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++ # # token_idxs = idxs // self.num_experts_per_tok +++ # # for i, end_idx in enumerate(tokens_per_expert): +++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++ # # if start_idx == end_idx: +++ # # continue +++ # # expert = self.experts[i] +++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++ # # expert_tokens = x[exp_token_idx] +++ # # expert_out = expert(expert_tokens) +++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++ # # return expert_cache +++ # expert_cache = ops.zeros_like(x) +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # for i, end_idx in enumerate(tokens_per_expert): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # if start_idx == end_idx: +++ # continue +++ # expert = self.experts[i] +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = expert(expert_tokens) +++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ +++ # return expert_cache +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # expert_cache = ops.zeros_like(x) +++ +++ # # 排序保证顺序一致 +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # # 找出有 token 的专家 +++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++ +++ # for i in active_experts.tolist(): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # end_idx = tokens_per_expert[i] +++ # if start_idx == end_idx: # 没有 token +++ # continue +++ +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = self.experts[i](expert_tokens) +++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++ +++ # expert_cache = mindspore.mint.scatter_add( +++ # expert_cache, +++ # 0, +++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++ # expert_out +++ # ) +++ +++ # return expert_cache +++ +++ ++ ++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++ # """ ++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ ++ # Initialize weights and apply final processing ++ self.post_init() +++ self.warm_up = False +++ +++ def warmup_moe_model_deep(self): +++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++ test_texts = [ +++ "warmup short", +++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++ ] +++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++ if tokenizer is None: +++ from mindnlp.transformers import AutoTokenizer +++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++ self._warmup_tokenizer = tokenizer +++ +++ for text in test_texts: +++ inputs = tokenizer(text, return_tensors="ms") +++ with mindspore._no_grad(): +++ _ = self(**inputs, use_cache=False) +++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++ ++ def get_input_embeddings(self): ++ return self.model.embed_tokens ++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++ ```""" +++ if not self.warm_up: +++ self.warm_up = True +++ self.warmup_moe_model_deep() +++ ++ output_attentions = ( ++ output_attentions ++ if output_attentions is not None ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index 3cbf820e..d4c6b651 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -18,7 +18,6 @@ ++ # See the License for the specific language governing permissions and ++ # limitations under the License. ++ """MindSpore Qwen2MoE model.""" ++- ++ import math ++ from typing import List, Optional, Tuple, Union ++ ++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++ TokenClassifierOutput, ++ ) ++ from ...modeling_utils import PreTrainedModel +++from ...generation import GenerationMixin ++ from ....utils import logging ++ from .configuration_qwen2_moe import Qwen2MoeConfig ++ ++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++ self.variance_epsilon = eps ++ ++ def forward(self, hidden_states): +++ # @dwj +++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++ # @lwx +++ # if not self.training : +++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++ input_dtype = hidden_states.dtype ++ hidden_states = hidden_states.to(mindspore.float32) ++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++@@ -234,6 +239,8 @@ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++ x1 = x[..., : x.shape[-1] // 2] ++ x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++ self.config = config ++ self.hidden_size = config.hidden_size ++ self.intermediate_size = intermediate_size +++ ++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++ self.act_fn = ACT2FN[config.hidden_act] ++ ++ def forward(self, x): ++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++- ++ +++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++ # @lwx +++ # gate_up_output = self.gate_up_proj(x) +++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++ # return self.down_proj(swiglu_output) +++ +++ # def forward(self, x): +++ # gate_proj_out = self.gate_proj(x) +++ # up_proj_out = self.up_proj(x) +++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++ # return self.down_proj(swiglu_out) +++ ++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++ """ ++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++ use_cache: bool = False, ++ cache_position: Optional[mindspore.Tensor] = None, ++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ +++ ++ bsz, q_len, _ = hidden_states.shape ++ ++ query_states = self.q_proj(hidden_states) ++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++ "with a layer index." ++ ) ++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ if isinstance(past_key_value, StaticCache): +++ kv_seq_len = key_states.shape[-2] +++ else: +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ if past_key_value is not None: ++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++ if isinstance(past_key_value, StaticCache): +++ kv_seq_len = key_states.shape[-2] ++ ++ # repeat k/v heads if n_kv_heads < n_heads ++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++- +++ ++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++ ++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++- raise ValueError( ++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++- f" {attn_weights.shape}" ++- ) ++- ++- if attention_mask is not None: # no matter the length, we just slice it ++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++ if attention_mask is not None: +++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++ attn_weights = attn_weights + causal_mask ++ ++ # upcast attention to fp32 ++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++ ++ attn_output = self.o_proj(attn_output) ++- +++ # @lwx +++ +++ # max_seq_len = self.max_position_embeddings # 2048 +++ +++ # if attention_mask is not None: +++ # # attention_mask: [B, 1, Sq, Sk] +++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++ +++ # # pad 到 [max_seq_len, max_seq_len] +++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++ # global_attention_mask = padded_mask +++ # else: +++ # global_attention_mask = None +++ +++ +++ # sparse_mode=3 +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, +++ # key=key_states, +++ # value=value_states, +++ # real_shift=None, +++ # padding_mask=None, +++ +++ # head_num=self.num_heads, +++ # attn_mask=global_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++ # input_layout="BNSD", +++ # pre_tokens=2147483647, +++ # next_tokens=2147483647, +++ # inner_precise=0, +++ # drop_mask=None, +++ # prefix=None, +++ # actual_seq_qlen=None, +++ # actual_seq_kvlen=None, +++ # sparse_mode=sparse_mode, +++ # ) ++ if not output_attentions: ++ attn_weights = None ++ ++ return attn_output, attn_weights, past_key_value ++ ++ +++class Qwen2MoeFlashAttention(nn.Module): +++ """ +++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++ +++ 关键改动: +++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++ 直接传入原始的 key 和 value 张量效率更高。 +++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++ """ +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++ super().__init__() +++ self.config = config +++ self.layer_idx = layer_idx +++ self.hidden_size = config.hidden_size +++ self.num_heads = config.num_attention_heads +++ self.head_dim = self.hidden_size // self.num_heads +++ self.num_key_value_heads = config.num_key_value_heads +++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++ self.max_position_embeddings = config.max_position_embeddings +++ self.rope_theta = config.rope_theta +++ self.attention_dropout = config.attention_dropout +++ +++ if (self.head_dim * self.num_heads) != self.hidden_size: +++ raise ValueError( +++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++ ) +++ +++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++ +++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++ self.head_dim, +++ max_position_embeddings=self.max_position_embeddings, +++ base=self.rope_theta, +++ ) +++ +++ def forward( +++ self, +++ hidden_states: mindspore.Tensor, +++ attention_mask: Optional[mindspore.Tensor] = None, +++ position_ids: Optional[mindspore.Tensor] = None, +++ past_key_value: Optional[Cache] = None, +++ output_attentions: bool = False, +++ use_cache: bool = False, +++ cache_position: Optional[mindspore.Tensor] = None, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ bsz, q_len, _ = hidden_states.shape +++ +++ # 1. 线性投射 Q, K, V +++ query_states = self.q_proj(hidden_states) +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++ # query: [B, S, H*D] -> [B, N1, S, D] +++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # 3. RoPE 旋转位置编码 +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++ if self.layer_idx is None: +++ raise ValueError( +++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ "with a layer index." +++ ) +++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++ if cache_position.shape[0] == 1: +++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++ kv_seq_len = past_seen_tokens + 1 +++ else: +++ # prefill 阶段:cache_position 是范围,使用其长度 +++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++ else: +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # 4. KV 缓存更新 +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ key_states, value_states = past_key_value.update( +++ key_states, value_states, self.layer_idx, cache_kwargs +++ ) +++ +++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++ if cache_position.shape[0] == 1: +++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++ kv_seq_len = key_states.shape[-2] +++ +++ # 5. [重要] 准备 Attention Mask +++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++ fa_attention_mask = None +++ if attention_mask is not None: +++ # 截取与当前key长度匹配的部分 +++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++ fa_attention_mask = (mask_slice != 0) +++ +++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++ input_dtype = query_states.dtype +++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++ query_states = query_states.to(mindspore.float16) +++ key_states = key_states.to(mindspore.float16) +++ value_states = value_states.to(mindspore.float16) +++ +++ # 6. [核心] 调用 flash_attention_score 算子 +++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++ attn_output = mindspore.ops.flash_attention_score( +++ query=query_states, +++ key=key_states, +++ value=value_states, +++ head_num=self.num_heads, # 传入Q的头数(N1) +++ attn_mask=fa_attention_mask, +++ keep_prob=1.0 - self.attention_dropout, +++ scalar_value=1.0 / math.sqrt(self.head_dim), +++ input_layout="BNSD", +++ sparse_mode=0 # 使用 defaultMask 模式 +++ ) +++ +++ # 恢复原始数据类型 +++ attn_output = attn_output.to(input_dtype) +++ +++ # 7. 调整输出形状 +++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ attn_output = self.o_proj(attn_output) +++ +++ # FlashAttention 算子不直接返回注意力权重矩阵 +++ attn_weights = None +++ if output_attentions: +++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++ return attn_output, attn_weights, past_key_value +++ +++ # def forward( +++ # self, +++ # hidden_states: mindspore.Tensor, +++ # attention_mask: Optional[mindspore.Tensor] = None, +++ # position_ids: Optional[mindspore.Tensor] = None, +++ # past_key_value: Optional[Cache] = None, +++ # output_attentions: bool = False, +++ # use_cache: bool = False, +++ # cache_position: Optional[mindspore.Tensor] = None, +++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ # bsz, q_len, _ = hidden_states.shape +++ +++ # # 1. 线性投射 Q, K, V +++ # query_states = self.q_proj(hidden_states) +++ # key_states = self.k_proj(hidden_states) +++ # value_states = self.v_proj(hidden_states) +++ +++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # # 3. RoPE 旋转位置编码 +++ # kv_seq_len = key_states.shape[-2] +++ # if past_key_value is not None: +++ # if self.layer_idx is None: +++ # raise ValueError( +++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ # "with a layer index." +++ # ) +++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # # 4. KV 缓存更新 +++ # if past_key_value is not None: +++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ # key_states, value_states = past_key_value.update( +++ # key_states, value_states, self.layer_idx, cache_kwargs +++ # ) +++ +++ # # 5. 准备 Attention Mask +++ # fa_attention_mask = None +++ # if attention_mask is not None: +++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # fa_attention_mask = (mask_slice != 0) +++ +++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++ # input_dtype = query_states.dtype +++ +++ # # 6. [核心] 调用 flash_attention_score 算子 +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, +++ # key=key_states, +++ # value=value_states, +++ # head_num=self.num_heads, +++ # attn_mask=fa_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++ # input_layout="BNSD", +++ # sparse_mode=0, +++ # # <--- 修改点 2: 启用内部高精度计算 --- +++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++ # inner_precise=1 +++ # ) +++ +++ # # 恢复原始数据类型 +++ # attn_output = attn_output.to(input_dtype) +++ +++ # # 7. 调整输出形状 +++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ # attn_output = self.o_proj(attn_output) +++ +++ # attn_weights = None +++ # if output_attentions: +++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++ # return attn_output, attn_weights, past_key_value +++ +++ # def forward( +++ # self, +++ # hidden_states: mindspore.Tensor, +++ # attention_mask: Optional[mindspore.Tensor] = None, +++ # position_ids: Optional[mindspore.Tensor] = None, +++ # past_key_value: Optional[Cache] = None, +++ # output_attentions: bool = False, +++ # use_cache: bool = False, +++ # cache_position: Optional[mindspore.Tensor] = None, +++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ # bsz, q_len, _ = hidden_states.shape +++ +++ # query_states = self.q_proj(hidden_states) +++ # key_states = self.k_proj(hidden_states) +++ # value_states = self.v_proj(hidden_states) +++ +++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # kv_seq_len = key_states.shape[-2] +++ # if past_key_value is not None: +++ # if self.layer_idx is None: +++ # raise ValueError("`layer_idx` must be specified for caching") +++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # if past_key_value is not None: +++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ # key_states, value_states = past_key_value.update( +++ # key_states, value_states, self.layer_idx, cache_kwargs +++ # ) +++ +++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++ +++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++ # query_states = query_states / math.sqrt(self.head_dim) +++ # # <--- 修改结束 --- +++ +++ # fa_attention_mask = None +++ # if attention_mask is not None: +++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # fa_attention_mask = (mask_slice != 0) +++ +++ # input_dtype = query_states.dtype +++ +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, # 传入已经预先缩放过的 query +++ # key=key_states, +++ # value=value_states, +++ # head_num=self.num_heads, +++ # attn_mask=fa_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++ # input_layout="BNSD", +++ # sparse_mode=0, +++ # inner_precise=1 # 仍然保持内部高精度计算 +++ # ) +++ +++ # attn_output = attn_output.to(input_dtype) +++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ # attn_output = self.o_proj(attn_output) +++ +++ # attn_weights = None +++ # if output_attentions: +++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++ +++ # return attn_output, attn_weights, past_key_value +++ ++ QWEN2MOE_ATTENTION_CLASSES = { ++ "eager": Qwen2MoeAttention, +++ "flash-attention": Qwen2MoeFlashAttention, ++ } ++ ++ ++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ +++ #@dwj +++ # 只遍历激活的专家,而非全部专家 ++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++- hidden_states = hidden_states.view(-1, hidden_dim) ++- # router_logits: (batch * sequence_length, n_experts) ++- router_logits = self.gate(hidden_states) ++- ++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- if self.norm_topk_prob: ++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- # we cast back to the input dtype ++- routing_weights = routing_weights.to(hidden_states.dtype) ++- ++- final_hidden_states = ops.zeros( ++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++- ) ++- ++- # One hot encode the selected experts to create an expert mask ++- # this will be used to easily index which expert is going to be sollicitated ++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++- ++- # Loop over all available experts in the model and perform the computation on each expert ++- for expert_idx in range(self.num_experts): ++- expert_layer = self.experts[expert_idx] ++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++- ++- # Index the correct hidden states and compute the expert hidden state for ++- # the current expert. We need to make sure to multiply the output hidden ++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++- if 0 not in idx.shape: ++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++- ++- # However `index_add_` only support torch tensors for indexing so we'll use ++- # the `top_x` tensor here. ++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++- ++- shared_expert_output = self.shared_expert(hidden_states) ++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++- ++- final_hidden_states = final_hidden_states + shared_expert_output +++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++ num_tokens = hidden_states_reshaped.shape[0] +++ +++ router_logits = self.gate(hidden_states_reshaped) +++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++ if self.norm_topk_prob: +++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ routing_weights = routing_weights.to(hidden_states.dtype) +++ +++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++ flat_selected_experts = selected_experts.flatten() +++ +++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++ token_indices = broadcasted_token_indices.flatten() +++ +++ active_experts = ops.unique(flat_selected_experts) +++ +++ for expert_idx_tensor in active_experts: +++ expert_idx = expert_idx_tensor.item() +++ expert_layer = self.experts[expert_idx] +++ +++ mask = (flat_selected_experts == expert_idx_tensor) +++ selected_token_indices = token_indices[mask] +++ selected_routing_weights = routing_weights.flatten()[mask] +++ +++ current_states = hidden_states_reshaped[selected_token_indices] +++ +++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++ +++ final_hidden_states = final_hidden_states.index_add( +++ dim=0, +++ index=selected_token_indices, +++ source=expert_output.to(hidden_states.dtype) +++ ) +++ +++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++ ++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++- return final_hidden_states, router_logits +++ final_hidden_states = final_hidden_states + shared_expert_output +++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++ +++ return final_hidden_states, router_logits ++ ++ ++ class Qwen2MoeDecoderLayer(nn.Module): ++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++ ++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++ ++ if (layer_idx not in config.mlp_only_layers) and ( ++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++ ): ++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++ _skip_keys_device_placement = "past_key_values" ++ _supports_cache_class = True +++#lwx +++ # _supports_static_cache = True ++ ++ def _init_weights(self, module): ++ std = self.config.initializer_range ++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++ return causal_mask ++ ++ ++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ _tied_weights_keys = ["lm_head.weight"] ++ ++ def __init__(self, config): ++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ self.num_experts_per_tok = config.num_experts_per_tok ++ # Initialize weights and apply final processing ++ self.post_init() +++ # @lwx +++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++ # self.generation_config.cache_implementation = "static" +++ self._warmed_up = False +++ +++ def warmup_moe_model(self): +++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++ test_texts = [ +++ "warmup short", +++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++ ] +++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++ if tokenizer is None: +++ from mindnlp.transformers import AutoTokenizer +++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++ self._warmup_tokenizer = tokenizer +++ +++ for text in test_texts: +++ inputs = tokenizer(text, return_tensors="ms") +++ with mindspore._no_grad(): +++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++ ++ def get_input_embeddings(self): ++ return self.model.embed_tokens ++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++ ```""" +++ if not self._warmed_up: +++ self._warmed_up = True +++ self.warmup_moe_model() ++ ++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++ output_router_logits = ( ++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ } ++ ) ++ return model_inputs +++# @lwx +++ # def _decode_one_tokens_logits( +++ # self, +++ # cur_token: mindspore.Tensor, +++ # input_pos: Optional[mindspore.Tensor], +++ # cache_position: mindspore.Tensor, +++ # past_key_values: StaticCache, +++ # ) -> mindspore.Tensor: +++ # """ +++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++ +++ # Args: +++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++ # input_pos: 输入位置信息,可选 +++ # cache_position: 当前token在cache中的位置,shape为(1,) +++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++ +++ # Returns: +++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++ # """ +++ # # 调用JIT编译的版本 +++ # return self.get_decode_one_tokens_logits( +++ # cur_token=cur_token, +++ # input_pos=input_pos, +++ # cache_position=cache_position, +++ # past_key_values=past_key_values, +++ # ) +++ +++ # @mindspore.jit(jit_level='O1') +++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++ # """ +++ # JIT编译的函数,用于高效的单token解码 +++ # 使用JIT编译优化以支持静态shape和高效执行 +++ +++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++ # """ +++ # outputs = self.model.forward( +++ # input_ids=cur_token, +++ # position_ids=input_pos, +++ # cache_position=cache_position, +++ # past_key_values=past_key_values, +++ # use_cache=True, +++ # return_dict=False, +++ # ) +++ +++ # hidden_states = outputs[0] +++ # logits = self.lm_head.forward(hidden_states) +++ # logits = logits.float() +++ +++ # return logits[:, -1, :] +++ +++ # def _sample( +++ # self, +++ # input_ids: mindspore.Tensor, +++ # logits_processor, +++ # stopping_criteria, +++ # generation_config, +++ # synced_devices: bool, +++ # streamer=None, +++ # logits_warper=None, +++ # **model_kwargs, +++ # ): +++ # """ +++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++ # """ +++ # from ...generation.logits_process import LogitsProcessorList +++ # from ...generation.stopping_criteria import StoppingCriteriaList +++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++ # from mindnlp.core import nn, ops, no_grad +++ # import numpy as np +++ +++ # # 检查是否使用 StaticCache +++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++ # # 否则,直接调用父类方法 +++ # past_key_values = model_kwargs.get("past_key_values") +++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++ +++ # if not isinstance(past_key_values, StaticCache): +++ # # 不使用 StaticCache,直接调用父类方法 +++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++ # return super()._sample( +++ # input_ids=input_ids, +++ # logits_processor=logits_processor, +++ # stopping_criteria=stopping_criteria, +++ # generation_config=generation_config, +++ # synced_devices=synced_devices, +++ # streamer=streamer, +++ # logits_warper=logits_warper, +++ # **model_kwargs, +++ # ) +++ +++ # # 使用 StaticCache,进入自定义循环 +++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++ # pad_token_id = generation_config._pad_token_tensor +++ # output_attentions = generation_config.output_attentions +++ # output_hidden_states = generation_config.output_hidden_states +++ # output_scores = generation_config.output_scores +++ # output_logits = generation_config.output_logits +++ # return_dict_in_generate = generation_config.return_dict_in_generate +++ # max_length = generation_config.max_length +++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++ # do_sample = generation_config.do_sample +++ +++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++ # raise ValueError( +++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++ # f"{logits_warper})." +++ # ) +++ +++ # # init attention / hidden states / scores tuples +++ # scores = () if (return_dict_in_generate and output_scores) else None +++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++ +++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++ # encoder_hidden_states = ( +++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++ # ) +++ +++ # # keep track of which sequences are already finished +++ # batch_size, cur_len = input_ids.shape +++ # this_peer_finished = False +++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++ +++ # time_record = [] +++ # from ....utils.testing_utils import parse_flag_from_env +++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++ +++ # while self._has_unfinished_sequences( +++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++ # ): +++ # if _record_time: +++ # import time as time_module +++ # infer_start = time_module.time() +++ +++ # # prepare model inputs +++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++ +++ # # prepare variable output controls +++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++ +++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++ # cur_cache_position = model_inputs.get("cache_position") +++ # cur_past_key_values = model_inputs.get("past_key_values") +++ # cur_input_ids = model_inputs.get("input_ids") +++ +++ # if (isinstance(cur_past_key_values, StaticCache) and +++ # cur_cache_position is not None and +++ # len(cur_cache_position.shape) > 0 and +++ # cur_cache_position.shape[0] == 1 and +++ # cur_input_ids is not None and +++ # cur_input_ids.shape[1] == 1): +++ # # 使用 JIT 优化的单 token 解码 +++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++ # if not hasattr(self, '_jit_used'): +++ # self._jit_used = False +++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++ +++ # next_token_logits = self.get_decode_one_tokens_logits( +++ # cur_token=cur_input_ids, +++ # input_pos=model_inputs.get("position_ids"), +++ # cache_position=cur_cache_position, +++ # past_key_values=cur_past_key_values, +++ # ) +++ +++ # # 标记已使用JIT(用于后续判断) +++ # if not self._jit_used: +++ # self._jit_used = True +++ +++ # # 构造兼容的输出对象 +++ # class JitOptimizedOutput: +++ # def __init__(self, logits, config): +++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++ # self.config = config +++ # # 对于 JIT 优化路径,这些属性通常不需要 +++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++ # self.attentions = None if not config.is_encoder_decoder else None +++ # self.cross_attentions = None +++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++ # self.hidden_states = None if not config.is_encoder_decoder else None +++ +++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++ # else: +++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++ # outputs = self(**model_inputs, return_dict=True) +++ +++ # if synced_devices and this_peer_finished: +++ # continue +++ +++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++ # next_token_logits = outputs.logits[:, -1, :] +++ +++ # # pre-process distribution +++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++ # if do_sample: +++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++ +++ # # Store scores, attentions and hidden_states when required +++ # if return_dict_in_generate: +++ # if output_scores: +++ # scores += (next_token_scores,) +++ # if output_logits: +++ # raw_logits += (next_token_logits,) +++ # if output_attentions: +++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++ # decoder_attentions += (attn,) if attn is not None else (None,) +++ # if self.config.is_encoder_decoder: +++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++ +++ # if output_hidden_states: +++ # hidden = ( +++ # outputs.decoder_hidden_states +++ # if self.config.is_encoder_decoder +++ # else outputs.hidden_states +++ # ) +++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++ +++ # # token selection +++ # if do_sample: +++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++ # else: +++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++ +++ # # finished sentences should have their next token be a padding token +++ # if has_eos_stopping_criteria: +++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++ +++ # # update generated ids, model inputs, and length for next step +++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++ # if streamer is not None: +++ # streamer.put(next_tokens) +++ +++ # model_kwargs = self._update_model_kwargs_for_generation( +++ # outputs, +++ # model_kwargs, +++ # is_encoder_decoder=self.config.is_encoder_decoder, +++ # ) +++ +++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++ # cur_len += 1 +++ +++ # if _record_time: +++ # import time as time_module +++ # infer_stop = time_module.time() +++ # time_record.append(infer_stop - infer_start) +++ +++ # del outputs +++ +++ # average_infer_time = None +++ # if time_record: +++ # if len(time_record) > 1: +++ # time_record.pop(0) +++ # average_infer_time = sum(time_record) / len(time_record) +++ # print(f'average inference time is: {average_infer_time}') +++ # print(f'inference time record: {time_record}') +++ +++ # if streamer is not None: +++ # streamer.end() +++ +++ # # 简单判断:打印是否使用了JIT路径 +++ # if hasattr(self, '_jit_used') and self._jit_used: +++ # print("[JIT] ✓ JIT optimization was used during generation") +++ # else: +++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++ +++ # if return_dict_in_generate: +++ # if self.config.is_encoder_decoder: +++ # return GenerateEncoderDecoderOutput( +++ # sequences=input_ids, +++ # scores=scores, +++ # logits=raw_logits, +++ # encoder_attentions=encoder_attentions, +++ # encoder_hidden_states=encoder_hidden_states, +++ # decoder_attentions=decoder_attentions, +++ # cross_attentions=cross_attentions, +++ # decoder_hidden_states=decoder_hidden_states, +++ # past_key_values=model_kwargs.get("past_key_values"), +++ # average_infer_time=average_infer_time +++ # ) +++ # else: +++ # return GenerateDecoderOnlyOutput( +++ # sequences=input_ids, +++ # scores=scores, +++ # logits=raw_logits, +++ # attentions=decoder_attentions, +++ # hidden_states=decoder_hidden_states, +++ # past_key_values=model_kwargs.get("past_key_values"), +++ # average_infer_time=average_infer_time +++ # ) +++ # else: +++ # return input_ids +++ +++ # def _prepare_cache_for_generation( +++ # self, +++ # generation_config, +++ # model_kwargs, +++ # assistant_model, +++ # batch_size, +++ # max_cache_length, +++ # ): +++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++ # generation_config.cache_implementation = "static" +++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++ +++ # if generation_config.cache_implementation == "static": +++ # base_required_from_max_length = generation_config.max_length + 1 +++ # base_required = max(max_cache_length, base_required_from_max_length) +++ # min_cache_size = 50 +++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++ # else: +++ # max_cache_length = max(base_required, min_cache_size) +++ +++ # original_max_cache_length = max_cache_length +++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++ # print(f" - input max_cache_length: {original_max_cache_length}") +++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++ # print(f" - final max_cache_length: {max_cache_length}") +++ +++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++ # if max_cache_length > self.config.max_position_embeddings: +++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++ +++ # result = super()._prepare_cache_for_generation( +++ # generation_config=generation_config, +++ # model_kwargs=model_kwargs, +++ # assistant_model=assistant_model, +++ # batch_size=batch_size, +++ # max_cache_length=max_cache_length, +++ # ) +++ +++ # if generation_config.cache_implementation == "static": +++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++ # created_cache = model_kwargs.get(cache_name) +++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++ # if created_cache.max_cache_len < generation_config.max_length: +++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++ +++ # return result +++ +++ +++ ++ ++ ++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" new file mode 100644 index 00000000..d64b7f3f --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" @@ -0,0 +1,2769 @@ +From 7a37d9be16fe823c251701c26bbb20cc09f9922a Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Thu, 6 Nov 2025 14:54:37 +0800 +Subject: [PATCH 03/10] 20261106secondcommit + +--- + .../models/deepseek/modeling_deepseek.py | 217 ++- + .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- + patches/0001-20251104commit.patch | 1272 ----------------- + 3 files changed, 528 insertions(+), 2032 deletions(-) + delete mode 100644 patches/0001-20251104commit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 73773c22..2f9192bf 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) + + _CONFIG_FOR_DOC = "DeepseekConfig" + ++_attn_mask_cache = {} ++ ++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): ++ q_len = batch_and_seq[1] ++ kv_len = batch_and_seq[1] + past_key_values_length ++ key = (batch_and_seq[0], q_len, kv_len) ++ ++ if key in _attn_mask_cache: ++ return _attn_mask_cache[key] ++ ++ mask = _prepare_4d_causal_attention_mask( ++ attention_mask, ++ batch_and_seq, ++ inputs_embeds, ++ past_key_values_length, ++ ) ++ _attn_mask_cache[key] = mask ++ return mask + + def _get_unpad_data(attention_mask): + seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): + return final_output + + +- @no_grad() +- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- expert_cache = ops.zeros_like(x) +- idxs = flat_expert_indices.argsort() +- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- token_idxs = idxs // self.num_experts_per_tok +- +- for i, end_idx in enumerate(tokens_per_expert): +- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- if start_idx == end_idx: +- continue +- expert = self.experts[i] +- exp_token_idx = token_idxs[start_idx:end_idx] +- expert_tokens = x[exp_token_idx] +- expert_out = expert(expert_tokens) +- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +- +- return expert_cache +- + # @no_grad() +- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +- # # expert_cache = torch.zeros_like(x) +- # # idxs = flat_expert_indices.argsort() +- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +- # # token_idxs = idxs // self.num_experts_per_tok +- # # for i, end_idx in enumerate(tokens_per_expert): +- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +- # # if start_idx == end_idx: +- # # continue +- # # expert = self.experts[i] +- # # exp_token_idx = token_idxs[start_idx:end_idx] +- # # expert_tokens = x[exp_token_idx] +- # # expert_out = expert(expert_tokens) +- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +- # # return expert_cache ++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): + # expert_cache = ops.zeros_like(x) + # idxs = flat_expert_indices.argsort() + # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): + # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) + + # return expert_cache +- # @no_grad() +- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +- # expert_cache = ops.zeros_like(x) ++ ++ @no_grad() ++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ """ ++ 优化版 MoE prefill: ++ - 批量张量化处理同一个 expert 的所有 token ++ - 跳过无 token 的专家 ++ - 保持结果完全一致 ++ """ ++ # 初始化输出缓存 ++ expert_cache = ops.zeros_like(x) + +- # # 排序保证顺序一致 +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- # token_idxs = idxs // self.num_experts_per_tok ++ # 排序(确保 scatter_add 位置对应原逻辑) ++ idxs = flat_expert_indices.argsort() ++ sorted_expert_indices = flat_expert_indices[idxs] ++ sorted_token_indices = idxs // self.num_experts_per_tok + +- # # 找出有 token 的专家 +- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++ # 每个 expert 的 token 数 ++ tokens_per_expert = sorted_expert_indices.bincount() + +- # for i in active_experts.tolist(): +- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- # end_idx = tokens_per_expert[i] +- # if start_idx == end_idx: # 没有 token +- # continue ++ # 找出有 token 的专家 ++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() + +- # exp_token_idx = token_idxs[start_idx:end_idx] +- # expert_tokens = x[exp_token_idx] +- # expert_out = self.experts[i](expert_tokens) +- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++ for expert_id in active_experts.tolist(): ++ # 取该 expert 对应的排序后 token 区间 ++ start = (tokens_per_expert[:expert_id]).sum().item() ++ end = start + tokens_per_expert[expert_id].item() + +- # expert_cache = mindspore.mint.scatter_add( +- # expert_cache, +- # 0, +- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +- # expert_out +- # ) ++ token_idx = sorted_token_indices[start:end] # 原 token 位置 ++ expert_tokens = x[token_idx] # 取输入向量 + +- # return expert_cache ++ # 执行专家 MLP ++ expert_out = self.experts[expert_id](expert_tokens) ++ ++ # 按权重缩放 ++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] ++ ++ # 回写到缓存(等价 scatter_add) ++ expert_cache = mindspore.mint.scatter_add( ++ expert_cache, ++ 0, ++ token_idx.view(-1, 1).tile((1, x.shape[-1])), ++ scaled_out ++ ) ++ ++ return expert_cache ++ ++ # @no_grad() ++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++ # # expert_cache = torch.zeros_like(x) ++ # # idxs = flat_expert_indices.argsort() ++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++ # # token_idxs = idxs // self.num_experts_per_tok ++ # # for i, end_idx in enumerate(tokens_per_expert): ++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++ # # if start_idx == end_idx: ++ # # continue ++ # # expert = self.experts[i] ++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++ # # expert_tokens = x[exp_token_idx] ++ # # expert_out = expert(expert_tokens) ++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++ # # return expert_cache ++ # expert_cache = ops.zeros_like(x) ++ # idxs = flat_expert_indices.argsort() ++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ # token_idxs = idxs // self.num_experts_per_tok ++ ++ # for i, end_idx in enumerate(tokens_per_expert): ++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ # if start_idx == end_idx: ++ # continue ++ # expert = self.experts[i] ++ # exp_token_idx = token_idxs[start_idx:end_idx] ++ # expert_tokens = x[exp_token_idx] ++ # expert_out = expert(expert_tokens) ++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++ ++ # return expert_cache ++ # @no_grad() ++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++ # expert_cache = ops.zeros_like(x) ++ ++ # # 排序保证顺序一致 ++ # idxs = flat_expert_indices.argsort() ++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ # token_idxs = idxs // self.num_experts_per_tok ++ ++ # # 找出有 token 的专家 ++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++ ++ # for i in active_experts.tolist(): ++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ # end_idx = tokens_per_expert[i] ++ # if start_idx == end_idx: # 没有 token ++ # continue ++ ++ # exp_token_idx = token_idxs[start_idx:end_idx] ++ # expert_tokens = x[exp_token_idx] ++ # expert_out = self.experts[i](expert_tokens) ++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++ ++ # expert_cache = mindspore.mint.scatter_add( ++ # expert_cache, ++ # 0, ++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++ # expert_out ++ # ) ++ ++ # return expert_cache + + + +@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): + + return attn_output, attn_weights, past_key_value + +- + # class DeepseekFlashAttention(nn.Module): + # """ + # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): + + return attn_output, attn_weights, past_key_value + ++ + Deepseek_ATTENTION_CLASSES = { + "eager": DeepseekAttention, + "flash-attention": DeepseekFlashAttention, +@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): + ) + else: + # 4d mask is passed through the layers +- attention_mask = _prepare_4d_causal_attention_mask( ++ # attention_mask = _prepare_4d_causal_attention_mask( ++ # attention_mask, ++ # (batch_size, seq_length), ++ # inputs_embeds, ++ # past_key_values_length, ++ # ) ++ #@dwj ++ attention_mask = get_cached_causal_mask( + attention_mask, + (batch_size, seq_length), + inputs_embeds, +@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + # Initialize weights and apply final processing + self.post_init() + self.warm_up = False ++ #@dwj ++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++ self.num_layers, ++ self.num_attention_heads, ++ self.head_dim, ++ batch_size=1, ++ max_length=self.max_length, ++ dtype=mindspore.float16 ++ ) ++ ++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++ key_cache = [] ++ value_cache = [] ++ for _ in range(num_layers): ++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++ key_cache.append(k) ++ value_cache.append(v) ++ return key_cache, value_cache ++ + + def warmup_moe_model_deep(self): + print("[Warmup] DeepSeek-MoE 模型预热开始...") +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index bced285c..ebd7782e 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) + _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" + _CONFIG_FOR_DOC = "Qwen2MoeConfig" + +-Long_Prompt = False +-PROMPT_LENGTH_THRESHOLD = 128 ++Long_Prompt = 1 ++LONG_PROMPT_LENGTH_THRESHOLD = 128 ++SHORT_PROMPT_LENGTH_THRESHOLD = 32 ++ ++_causal_mask_cache = {} ++ ++def get_cached_causal_mask_with_cache_position( ++ attention_mask: mindspore.Tensor, ++ sequence_length: int, ++ target_length: int, ++ dtype: mindspore.dtype, ++ min_dtype: float, ++ cache_position: mindspore.Tensor, ++ batch_size: int, ++): ++ """ ++ 带缓存的 causal mask 构造函数 ++ """ ++ # q_len 是当前 query 长度 ++ q_len = sequence_length ++ # kv_len 是 target_length ++ kv_len = target_length ++ ++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 ++ key = (batch_size, q_len, kv_len, dtype, min_dtype) ++ ++ if key in _causal_mask_cache: ++ return _causal_mask_cache[key] ++ ++ # 调用原来的 mask 构造逻辑 ++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++ attention_mask, ++ sequence_length=sequence_length, ++ target_length=target_length, ++ dtype=dtype, ++ min_dtype=min_dtype, ++ cache_position=cache_position, ++ batch_size=batch_size, ++ ) ++ # 缓存结果 ++ _causal_mask_cache[key] = causal_mask ++ return causal_mask + + # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position + def _prepare_4d_causal_attention_mask_with_cache_position( +@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: + + + # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe ++# class Qwen2MoeAttention(nn.Module): ++# """ ++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++# and "Generating Long Sequences with Sparse Transformers". ++# """ ++ ++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++# super().__init__() ++# self.config = config ++# self.layer_idx = layer_idx ++# if layer_idx is None: ++# logger.warning_once( ++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++# "when creating this class." ++# ) ++ ++# self.hidden_size = config.hidden_size ++# self.num_heads = config.num_attention_heads ++# self.head_dim = self.hidden_size // self.num_heads ++# self.num_key_value_heads = config.num_key_value_heads ++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++# self.max_position_embeddings = config.max_position_embeddings ++# self.rope_theta = config.rope_theta ++# self.is_causal = True ++# self.attention_dropout = config.attention_dropout ++ ++# if (self.head_dim * self.num_heads) != self.hidden_size: ++# raise ValueError( ++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++# f" and `num_heads`: {self.num_heads})." ++# ) ++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++ ++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++# self.head_dim, ++# max_position_embeddings=self.max_position_embeddings, ++# base=self.rope_theta, ++# ) ++ ++# def forward( ++# self, ++# hidden_states: mindspore.Tensor, ++# attention_mask: Optional[mindspore.Tensor] = None, ++# position_ids: Optional[mindspore.Tensor] = None, ++# past_key_value: Optional[Cache] = None, ++# output_attentions: bool = False, ++# use_cache: bool = False, ++# cache_position: Optional[mindspore.Tensor] = None, ++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++ ++ ++ ++# bsz, q_len, _ = hidden_states.shape ++ ++# query_states = self.q_proj(hidden_states) ++# key_states = self.k_proj(hidden_states) ++# value_states = self.v_proj(hidden_states) ++ ++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++ ++# kv_seq_len = key_states.shape[-2] ++# if past_key_value is not None: ++# if self.layer_idx is None: ++# raise ValueError( ++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++# "with a layer index." ++# ) ++# if isinstance(past_key_value, StaticCache): ++# kv_seq_len = key_states.shape[-2] ++# else: ++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++# if past_key_value is not None: ++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++# if isinstance(past_key_value, StaticCache): ++# kv_seq_len = key_states.shape[-2] ++ ++# # repeat k/v heads if n_kv_heads < n_heads ++# key_states = repeat_kv(key_states, self.num_key_value_groups) ++# value_states = repeat_kv(value_states, self.num_key_value_groups) ++ ++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++ ++# if attention_mask is not None: ++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++# attn_weights = attn_weights + causal_mask ++ ++# # upcast attention to fp32 ++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++# attn_output = ops.matmul(attn_weights, value_states) ++ ++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++# raise ValueError( ++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++# f" {attn_output.shape}" ++# ) ++ ++# attn_output = ops.transpose(attn_output, 1, 2) ++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++ ++# attn_output = self.o_proj(attn_output) ++# # @lwx ++ ++# # max_seq_len = self.max_position_embeddings # 2048 ++ ++# # if attention_mask is not None: ++# # # attention_mask: [B, 1, Sq, Sk] ++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++ ++# # # pad 到 [max_seq_len, max_seq_len] ++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++# # global_attention_mask = padded_mask ++# # else: ++# # global_attention_mask = None ++ ++ ++# # sparse_mode=3 ++# # attn_output = mindspore.ops.flash_attention_score( ++# # query=query_states, ++# # key=key_states, ++# # value=value_states, ++# # real_shift=None, ++# # padding_mask=None, ++ ++# # head_num=self.num_heads, ++# # attn_mask=global_attention_mask, ++# # keep_prob=1.0 - self.attention_dropout, ++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++# # input_layout="BNSD", ++# # pre_tokens=2147483647, ++# # next_tokens=2147483647, ++# # inner_precise=0, ++# # drop_mask=None, ++# # prefix=None, ++# # actual_seq_qlen=None, ++# # actual_seq_kvlen=None, ++# # sparse_mode=sparse_mode, ++# # ) ++# if not output_attentions: ++# attn_weights = None ++ ++# return attn_output, attn_weights, past_key_value ++ + class Qwen2MoeAttention(nn.Module): + """ +- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +- and "Generating Long Sequences with Sparse Transformers". +- """ ++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 + ++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: ++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 ++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 ++ ++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 ++ """ + def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): + super().__init__() + self.config = config +@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): + if layer_idx is None: + logger.warning_once( + f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " + "when creating this class." + ) + +@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): + use_cache: bool = False, + cache_position: Optional[mindspore.Tensor] = None, + ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- + +- ++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- + bsz, q_len, _ = hidden_states.shape + + query_states = self.q_proj(hidden_states) + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + +- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- ++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ + kv_seq_len = key_states.shape[-2] + if past_key_value is not None: +- if self.layer_idx is None: +- raise ValueError( +- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +- "with a layer index." +- ) +- if isinstance(past_key_value, StaticCache): +- kv_seq_len = key_states.shape[-2] +- else: +- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ + cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) + query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) + + if past_key_value is not None: +- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++ ++ # --- 2. 动态调度核心注意力计算 --- ++ global Long_Prompt ++ if Long_Prompt >= 1: ++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- ++ fa_attention_mask = None ++ if attention_mask is not None: ++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++ fa_attention_mask = (mask_slice != 0) ++ ++ attn_output = mindspore.ops.flash_attention_score( ++ query=query_states, ++ key=key_states, ++ value=value_states, ++ head_num=self.num_heads, ++ attn_mask=fa_attention_mask, ++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, ++ scalar_value=1.0 / math.sqrt(self.head_dim), ++ input_layout="BNSD", ++ sparse_mode=0, ++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 ++ ) + +- if isinstance(past_key_value, StaticCache): +- kv_seq_len = key_states.shape[-2] ++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ attn_output = self.o_proj(attn_output) ++ attn_weights = None ++ if output_attentions: ++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") + +- # repeat k/v heads if n_kv_heads < n_heads +- key_states = repeat_kv(key_states, self.num_key_value_groups) +- value_states = repeat_kv(value_states, self.num_key_value_groups) +- +- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++ else: ++ # --- Eager Attention 路径 (用于短序列和解码) --- ++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++ ++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) + +- if attention_mask is not None: +- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +- attn_weights = attn_weights + causal_mask ++ if attention_mask is not None: ++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++ attn_weights = attn_weights + causal_mask + +- # upcast attention to fp32 +- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +- attn_output = ops.matmul(attn_weights, value_states) ++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++ attn_output = ops.matmul(attn_weights, value_states) + +- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +- raise ValueError( +- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +- f" {attn_output.shape}" +- ) ++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++ raise ValueError( ++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" ++ ) + +- attn_output = ops.transpose(attn_output, 1, 2) +- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++ attn_output = ops.transpose(attn_output, 1, 2) ++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++ attn_output = self.o_proj(attn_output) + +- attn_output = self.o_proj(attn_output) +- # @lwx ++ if not output_attentions: ++ attn_weights = None + +- # max_seq_len = self.max_position_embeddings # 2048 +- +- # if attention_mask is not None: +- # # attention_mask: [B, 1, Sq, Sk] +- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +- +- # # pad 到 [max_seq_len, max_seq_len] +- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +- # global_attention_mask = padded_mask +- # else: +- # global_attention_mask = None +- +- +- # sparse_mode=3 +- # attn_output = mindspore.ops.flash_attention_score( +- # query=query_states, +- # key=key_states, +- # value=value_states, +- # real_shift=None, +- # padding_mask=None, +- +- # head_num=self.num_heads, +- # attn_mask=global_attention_mask, +- # keep_prob=1.0 - self.attention_dropout, +- # scalar_value=1.0 / math.sqrt(self.head_dim), +- # input_layout="BNSD", +- # pre_tokens=2147483647, +- # next_tokens=2147483647, +- # inner_precise=0, +- # drop_mask=None, +- # prefix=None, +- # actual_seq_qlen=None, +- # actual_seq_kvlen=None, +- # sparse_mode=sparse_mode, +- # ) +- if not output_attentions: +- attn_weights = None +- + return attn_output, attn_weights, past_key_value + +- + # class Qwen2MoeFlashAttention(nn.Module): + # """ + # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { + # return final_hidden_states, router_logits + + +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# """ +-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# # 门控网络 +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# # 专家列表 +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +-# # 共享专家 +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# @no_grad() +-# def _moe_infer_decode( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# """ +-# 【解码路径】针对 sequence_length=1 的极致优化。 +-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-# """ +-# batch_size, hidden_dim = hidden_states.shape +- +-# expert_outputs_list = [ +-# ops.cat([ +-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-# ], dim=0) +-# for i in range(batch_size) +-# ] +- +-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-# # shape: (batch_size, top_k, hidden_dim) +-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +- +-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +- +-# return moe_output.squeeze(1) +- +-# @no_grad() +-# def _moe_infer_prefill( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# """ +-# 【预填充路径】针对 sequence_length > 1 的优化。 +-# 按专家对 Token 进行分组,并进行批处理。 +-# """ +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens = hidden_states.shape[0] +-# flat_selected_experts = selected_experts.flatten() +- +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +- +-# active_experts = ops.unique(flat_selected_experts) +- +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +- +-# mask = (flat_selected_experts == expert_idx_tensor) +-# selected_token_indices = token_indices[mask] +-# selected_routing_weights = routing_weights.flatten()[mask] +- +-# current_states = hidden_states[selected_token_indices] +- +-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +- +-# moe_output = moe_output.index_add( +-# dim=0, +-# index=selected_token_indices, +-# source=expert_output.to(hidden_states.dtype) +-# ) +-# return moe_output +- +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# """ +-# 顶层 forward 方法,作为智能分发器。 +-# """ +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +- +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +-# routing_weights = routing_weights.to(hidden_states.dtype) +- +-# moe_output = None +-# # 在推理时,根据序列长度选择最优路径 +-# if not self.training: +-# if sequence_length == 1: +-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-# else: +-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-# else: +-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-# raise NotImplementedError("Training path is not implemented.") +- +-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +- +-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +- +-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- +- +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# """ +-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# # 门控网络 +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# # 专家列表 +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +-# # 共享专家 +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# @no_grad() +-# def _moe_infer_decode( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# batch_size, _ = hidden_states.shape +-# expert_outputs_list = [ +-# ops.cat([ +-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-# ], dim=0) +-# for i in range(batch_size) +-# ] +-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-# return moe_output.squeeze(1) +- +-# @no_grad() +-# def _moe_infer_prefill( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens = hidden_states.shape[0] +-# flat_selected_experts = selected_experts.flatten() +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-# active_experts = ops.unique(flat_selected_experts) +- +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +-# mask = (flat_selected_experts == expert_idx_tensor) +-# selected_token_indices = token_indices[mask] +-# selected_routing_weights = routing_weights.flatten()[mask] +-# current_states = hidden_states[selected_token_indices] +-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-# moe_output = moe_output.index_add( +-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-# ) +-# return moe_output +- +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# """ +-# 顶层 forward 方法,作为智能分发器。 +-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-# """ +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +- +-# # 1. 门控计算 (通用逻辑) +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +-# routing_weights = routing_weights.to(hidden_states.dtype) +- +-# # 2. 智能分发到最优 MoE 路径 +-# moe_output = None +-# if not self.training: +-# if sequence_length == 1: +-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-# else: +-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-# else: +-# raise NotImplementedError("Training path is not implemented.") +- +-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +- +-# # 4. 合并 MoE 输出和共享专家输出 +-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +- +-# # 5. 恢复原始形状并返回 +-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- +-# prefill fastest +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# """ +-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# # 门控网络 +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# # 专家列表 +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +-# # 共享专家 +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# @no_grad() +-# def _moe_infer_dispatch( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# """ +-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-# """ +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens, _ = hidden_states.shape +- +-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-# flat_selected_experts = selected_experts.flatten() +-# flat_routing_weights = routing_weights.flatten() +- +-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +- +-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-# active_experts = ops.unique(flat_selected_experts) +- +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +- +-# # 找到所有分配给该专家的 token +-# mask = (flat_selected_experts == expert_idx_tensor) +- +-# # 使用 mask 选取对应的 token 和权重 +-# current_token_indices = token_indices[mask] +-# current_routing_weights = flat_routing_weights[mask] +-# current_hidden_states = hidden_states[current_token_indices] +- +-# # 对这些 token 进行批处理 +-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +- +-# # 使用 index_add 将结果精确地加回到对应位置 +-# moe_output = moe_output.index_add( +-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-# ) +-# return moe_output +- +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# """ +-# 顶层 forward 方法,作为智能分发器。 +-# """ +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +- +-# # 1. 门控计算 +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +-# routing_weights = routing_weights.to(hidden_states.dtype) +- +-# # 2. 调用统一的 MoE 计算内核 +-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +- +-# # 3. 统一处理共享专家 +-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +- +-# # 4. 合并输出 +-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +- +-# # 5. 恢复原始形状并返回 +-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- +- +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# """ +-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-# 【最终高性能与高精度版】: +-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-# 3. 这样实现了速度和准确性的两全其美。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# @no_grad() +-# def _moe_infer_decode( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# """ +-# 【解码路径】极致优化版:bmm + 高精度累加。 +-# """ +-# original_dtype = hidden_states.dtype +-# batch_size, _ = hidden_states.shape +- +-# expert_outputs_list = [ +-# ops.cat([ +-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-# ], dim=0) +-# for i in range(batch_size) +-# ] +-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +- +-# # 在 float32 下执行 bmm,得到高精度结果 +-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +- +-# # 将高精度结果转换回原始数据类型 +-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +- +-# return moe_output +- +-# @no_grad() +-# def _moe_infer_prefill( +-# self, +-# hidden_states: mindspore.Tensor, +-# selected_experts: mindspore.Tensor, +-# routing_weights: mindspore.Tensor +-# ) -> mindspore.Tensor: +-# """ +-# 【预填充路径】与原始实现一致,结果精确。 +-# """ +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens, _ = hidden_states.shape +-# flat_selected_experts = selected_experts.flatten() +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-# active_experts = ops.unique(flat_selected_experts) +- +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +-# mask = (flat_selected_experts == expert_idx_tensor) +-# selected_token_indices = token_indices[mask] +-# selected_routing_weights = routing_weights.flatten()[mask] +-# current_states = hidden_states[selected_token_indices] +-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-# moe_output = moe_output.index_add( +-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-# ) +-# return moe_output +- +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +- +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-# # 如果模型主体是 float16,后续再转换 +- +-# moe_output = None +-# if not self.training: +-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-# # _moe_infer_decode 内部会处理好类型转换 +-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-# if sequence_length == 1: +-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-# else: +-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-# else: +-# raise NotImplementedError("Training path is not implemented.") +- +-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +- +-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- +- +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# """ +-# 【融合版】一个混合专家模块,内置两种推理策略, +-# 由外部全局变量 `Long_Prompt` 控制: +- +-# - if Long_Prompt is True: 【精度优先模式】 +-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-# 适用于处理长序列,避免误差累积。 +- +-# - if Long_Prompt is False: 【速度优先模式】 +-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# # --- 速度优先模式的辅助函数 --- +-# @no_grad() +-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-# original_dtype = hidden_states.dtype +-# batch_size, _ = hidden_states.shape +-# expert_outputs_list = [ +-# ops.cat([ +-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-# ], dim=0) +-# for i in range(batch_size) +-# ] +-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-# weights_fp32 = routing_weights.to(mindspore.float32) +-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-# return moe_output_fp32.squeeze(1).to(original_dtype) +- +-# @no_grad() +-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens, _ = hidden_states.shape +-# flat_selected_experts = selected_experts.flatten() +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-# active_experts = ops.unique(flat_selected_experts) +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +-# mask = (flat_selected_experts == expert_idx_tensor) +-# selected_token_indices = token_indices[mask] +-# selected_routing_weights = routing_weights.flatten()[mask] +-# current_states = hidden_states[selected_token_indices] +-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-# return moe_output +- +-# # --- 精度优先模式的辅助函数 --- +-# @no_grad() +-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-# moe_output = ops.zeros_like(hidden_states) +-# num_tokens, _ = hidden_states.shape +-# flat_selected_experts = selected_experts.flatten() +-# flat_routing_weights = routing_weights.flatten() +-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-# active_experts = ops.unique(flat_selected_experts) +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +-# mask = (flat_selected_experts == expert_idx_tensor) +-# current_token_indices = token_indices[mask] +-# current_routing_weights = flat_routing_weights[mask] +-# current_hidden_states = hidden_states[current_token_indices] +-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-# return moe_output +- +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# # 声明我们将要使用一个在模块外部定义的全局变量 +-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-# global Long_Prompt +- +-# # 1. 门控计算 (所有模式通用) +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +-# moe_output = None +-# if not self.training: +-# # 根据 Long_Prompt 标志选择模式 +-# if Long_Prompt: +-# # --- 精度优先模式 --- +-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-# else: +-# # --- 速度优先模式 --- +-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-# if sequence_length == 1: +-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-# else: +-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-# else: +-# raise NotImplementedError("Training path is not implemented.") +- +-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +- +-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- + class Qwen2MoeSparseMoeBlock(nn.Module): + """ + 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) + return moe_output_fp32.squeeze(1).to(original_dtype) + ++ # @no_grad() ++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ # num_tokens, _ = hidden_states.shape ++ # flat_selected_experts = selected_experts.flatten() ++ # sorted_expert_indices = flat_selected_experts.argsort() ++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++ # original_token_indices = sorted_expert_indices // self.top_k ++ # moe_output = ops.zeros_like(hidden_states) ++ # current_token_offset = 0 ++ # for i in range(self.num_experts): ++ # expert_token_count = tokens_per_expert[i] - current_token_offset ++ # if expert_token_count == 0: ++ # continue ++ # end_offset = current_token_offset + expert_token_count ++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++ # expert_hidden_states = hidden_states[expert_original_token_indices] ++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++ # current_token_offset += expert_token_count ++ # return moe_output ++ + @no_grad() + def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- num_tokens, _ = hidden_states.shape +- flat_selected_experts = selected_experts.flatten() +- sorted_expert_indices = flat_selected_experts.argsort() +- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +- original_token_indices = sorted_expert_indices // self.top_k ++ """ ++ 优化版 MoE prefill (速度优先模式): ++ - 批量张量化处理同一个 expert 的所有 token ++ - 跳过无 token 的专家 ++ - 保持结果完全一致 ++ """ + moe_output = ops.zeros_like(hidden_states) +- current_token_offset = 0 +- for i in range(self.num_experts): +- expert_token_count = tokens_per_expert[i] - current_token_offset +- if expert_token_count == 0: +- continue +- end_offset = current_token_offset + expert_token_count +- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +- expert_hidden_states = hidden_states[expert_original_token_indices] +- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +- current_token_offset += expert_token_count ++ ++ flat_selected_experts = selected_experts.flatten() ++ flat_routing_weights = routing_weights.flatten() ++ ++ idxs = flat_selected_experts.argsort() ++ sorted_expert_indices = flat_selected_experts[idxs] ++ sorted_token_indices = idxs // self.top_k ++ ++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) ++ ++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++ ++ for expert_id in active_experts.tolist(): ++ start = int(tokens_per_expert[:expert_id].sum().item()) ++ end = start + int(tokens_per_expert[expert_id].item()) ++ ++ token_idx = sorted_token_indices[start:end] ++ expert_tokens = hidden_states[token_idx] ++ ++ expert_out = self.experts[expert_id](expert_tokens) ++ ++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) ++ ++ moe_output = mindspore.mint.scatter_add( ++ moe_output, ++ 0, ++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), ++ scaled_out.to(hidden_states.dtype) ++ ) ++ + return moe_output + ++ + # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- + @no_grad() + def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) + + moe_output = None +- if Long_Prompt: +- # --- 精度优先模式 (ACCURACY MODE) --- +- routing_weights_casted = routing_weights.to(hidden_states.dtype) +- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # if Long_Prompt==0: ++ # # --- 精度优先模式 (ACCURACY MODE) --- ++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # else: ++ # # --- 速度优先模式 (SPEED MODE) --- ++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ # if sequence_length == 1: ++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # else: ++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ ++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ if sequence_length == 1: ++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) + else: +- # --- 速度优先模式 (SPEED MODE) --- +- routing_weights_casted = routing_weights.to(hidden_states.dtype) +- if sequence_length == 1: +- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +- else: +- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +- ++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ + + # 3. 共享专家计算与合并 (所有模式通用) + gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + + return final_hidden_states, router_logits + ++ + class Qwen2MoeDecoderLayer(nn.Module): + def __init__(self, config: Qwen2MoeConfig, layer_idx: int): + super().__init__() + self.hidden_size = config.hidden_size + +- # if Long_Prompt: +- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- # else: ++ # if Long_Prompt == 2: + # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++ # else: ++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + + self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + +@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): + ) + + # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++ # attention_mask, ++ # sequence_length=sequence_length, ++ # target_length=target_length, ++ # dtype=dtype, ++ # min_dtype=min_dtype, ++ # cache_position=cache_position, ++ # batch_size=input_tensor.shape[0], ++ # ) ++ #@dwj ++ causal_mask = get_cached_causal_mask_with_cache_position( + attention_mask, + sequence_length=sequence_length, + target_length=target_length, +@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 + 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 + """ +- global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache ++ _causal_mask_cache.clear() + + input_ids = kwargs.get("input_ids") + if input_ids is None and args: +@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + + if input_ids is not None: + prompt_length = input_ids.shape[1] +- +- if prompt_length > PROMPT_LENGTH_THRESHOLD: +- Long_Prompt = True ++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: ++ Long_Prompt = 2 ++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: ++ Long_Prompt = 0 + else: +- Long_Prompt = False ++ Long_Prompt = 1 ++ + + return super().generate(*args, **kwargs) + +@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + dtype = self.lm_head.weight.dtype + min_dtype = float(ops.finfo(dtype).min) + +- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++ # attention_mask, ++ # sequence_length=sequence_length, ++ # target_length=past_key_values.get_max_length(), ++ # dtype=dtype, ++ # min_dtype=min_dtype, ++ # cache_position=cache_position, ++ # batch_size=batch_size, ++ # ) ++ ++ #@dwj ++ attention_mask = get_cached_causal_mask_with_cache_position( + attention_mask, + sequence_length=sequence_length, + target_length=past_key_values.get_max_length(), +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +deleted file mode 100644 +index 6dfb5b93..00000000 +--- a/patches/0001-20251104commit.patch ++++ /dev/null +@@ -1,1272 +0,0 @@ +-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH] 20251104commit +- +---- +- mindnlp/transformers/cache_utils.py | 28 +- +- .../models/deepseek/modeling_deepseek.py | 149 ++- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +- 3 files changed, 976 insertions(+), 87 deletions(-) +- +-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-index cadd2e04..02f8d4be 100644 +---- a/mindnlp/transformers/cache_utils.py +-+++ b/mindnlp/transformers/cache_utils.py +-@@ -812,14 +812,26 @@ class StaticCache(Cache): +- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +- # k_out[:, :, cache_position] = key_states +- # v_out[:, :, cache_position] = value_states +-- if ON_ORANGE_PI: +-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-- else: +-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-- +-+ # if ON_ORANGE_PI: +-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+ # else: +-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+ if cache_position.ndim > 1: +-+ cache_position = cache_position.flatten() +-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+ cache_position = cache_position.int() +-+ +-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+ k_out[:, :, cache_position] = key_states +-+ v_out[:, :, cache_position] = value_states +-+ +- return k_out, v_out +- +- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index c695b944..d8303e45 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +- # Copied from transformers.models.llama.modeling_llama.rotate_half +- def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +-- x1 = x[..., : x.shape[-1] // 2] +-- x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+ # x1 = x[..., : x.shape[-1] // 2] +-+ # x2 = x[..., x.shape[-1] // 2 :] +-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +- if self.training: +- raise NotImplementedError("Training is not supported yet.") +- else: +-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-- if self.config.n_shared_experts is not None: +-- y = y + self.shared_experts(identity) +-- return y +-+ # @lwx +-+ if orig_shape[1] == 1: +-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+ y=y.view(*orig_shape) +-+ if self.config.n_shared_experts is not None: +-+ y = y + self.shared_experts(identity) +-+ return y +-+ else: +-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+ if self.config.n_shared_experts is not None: +-+ y = y + self.shared_experts(identity) +-+ return y +-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+ # if self.config.n_shared_experts is not None: +-+ # y = y + self.shared_experts(identity) +-+ # return y +-+ +-+ @no_grad() +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ +-+ expert_cache = ops.zeros_like(x) +-+ for i in range(self.num_experts_per_tok): +-+ expert_id = flat_expert_indices[i].item() +-+ weight = flat_expert_weights[i].item() +-+ expert = self.experts[expert_id] +-+ expert_out = expert(x) +-+ expert_cache += expert_out * weight +-+ return expert_cache +- +- @no_grad() +-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-- # expert_cache = torch.zeros_like(x) +-- # idxs = flat_expert_indices.argsort() +-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-- # token_idxs = idxs // self.num_experts_per_tok +-- # for i, end_idx in enumerate(tokens_per_expert): +-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-- # if start_idx == end_idx: +-- # continue +-- # expert = self.experts[i] +-- # exp_token_idx = token_idxs[start_idx:end_idx] +-- # expert_tokens = x[exp_token_idx] +-- # expert_out = expert(expert_tokens) +-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-- # return expert_cache +-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- expert_cache = ops.zeros_like(x) +- idxs = flat_expert_indices.argsort() +- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- token_idxs = idxs // self.num_experts_per_tok +-+ +- for i, end_idx in enumerate(tokens_per_expert): +- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- if start_idx == end_idx: +-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +- expert_out = expert(expert_tokens) +- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +- return expert_cache +-+ +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # # expert_cache = torch.zeros_like(x) +-+ # # idxs = flat_expert_indices.argsort() +-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+ # # token_idxs = idxs // self.num_experts_per_tok +-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+ # # if start_idx == end_idx: +-+ # # continue +-+ # # expert = self.experts[i] +-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # # expert_tokens = x[exp_token_idx] +-+ # # expert_out = expert(expert_tokens) +-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+ # # return expert_cache +-+ # expert_cache = ops.zeros_like(x) +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # for i, end_idx in enumerate(tokens_per_expert): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # if start_idx == end_idx: +-+ # continue +-+ # expert = self.experts[i] +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = expert(expert_tokens) +-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +-+ # return expert_cache +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # expert_cache = ops.zeros_like(x) +-+ +-+ # # 排序保证顺序一致 +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # # 找出有 token 的专家 +-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+ +-+ # for i in active_experts.tolist(): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # end_idx = tokens_per_expert[i] +-+ # if start_idx == end_idx: # 没有 token +-+ # continue +-+ +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = self.experts[i](expert_tokens) +-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+ +-+ # expert_cache = mindspore.mint.scatter_add( +-+ # expert_cache, +-+ # 0, +-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+ # expert_out +-+ # ) +-+ +-+ # return expert_cache +-+ +-+ +- +- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +- # """ +-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- +- # Initialize weights and apply final processing +- self.post_init() +-+ self.warm_up = False +-+ +-+ def warmup_moe_model_deep(self): +-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+ test_texts = [ +-+ "warmup short", +-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+ ] +-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+ if tokenizer is None: +-+ from mindnlp.transformers import AutoTokenizer +-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+ self._warmup_tokenizer = tokenizer +-+ +-+ for text in test_texts: +-+ inputs = tokenizer(text, return_tensors="ms") +-+ with mindspore._no_grad(): +-+ _ = self(**inputs, use_cache=False) +-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +- +- def get_input_embeddings(self): +- return self.model.embed_tokens +-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +- ```""" +-+ if not self.warm_up: +-+ self.warm_up = True +-+ self.warmup_moe_model_deep() +-+ +- output_attentions = ( +- output_attentions +- if output_attentions is not None +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index 3cbf820e..d4c6b651 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -18,7 +18,6 @@ +- # See the License for the specific language governing permissions and +- # limitations under the License. +- """MindSpore Qwen2MoE model.""" +-- +- import math +- from typing import List, Optional, Tuple, Union +- +-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +- TokenClassifierOutput, +- ) +- from ...modeling_utils import PreTrainedModel +-+from ...generation import GenerationMixin +- from ....utils import logging +- from .configuration_qwen2_moe import Qwen2MoeConfig +- +-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +- self.variance_epsilon = eps +- +- def forward(self, hidden_states): +-+ # @dwj +-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+ # @lwx +-+ # if not self.training : +-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +- input_dtype = hidden_states.dtype +- hidden_states = hidden_states.to(mindspore.float32) +- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-@@ -234,6 +239,8 @@ def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +- x1 = x[..., : x.shape[-1] // 2] +- x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +- self.config = config +- self.hidden_size = config.hidden_size +- self.intermediate_size = intermediate_size +-+ +- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +- self.act_fn = ACT2FN[config.hidden_act] +- +- def forward(self, x): +-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-- +- +-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+ # @lwx +-+ # gate_up_output = self.gate_up_proj(x) +-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+ # return self.down_proj(swiglu_output) +-+ +-+ # def forward(self, x): +-+ # gate_proj_out = self.gate_proj(x) +-+ # up_proj_out = self.up_proj(x) +-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+ # return self.down_proj(swiglu_out) +-+ +- # Copied from transformers.models.llama.modeling_llama.repeat_kv +- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +- """ +-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +- use_cache: bool = False, +- cache_position: Optional[mindspore.Tensor] = None, +- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ +-+ +- bsz, q_len, _ = hidden_states.shape +- +- query_states = self.q_proj(hidden_states) +-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +- "with a layer index." +- ) +-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ if isinstance(past_key_value, StaticCache): +-+ kv_seq_len = key_states.shape[-2] +-+ else: +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +- if past_key_value is not None: +- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+ if isinstance(past_key_value, StaticCache): +-+ kv_seq_len = key_states.shape[-2] +- +- # repeat k/v heads if n_kv_heads < n_heads +- key_states = repeat_kv(key_states, self.num_key_value_groups) +- value_states = repeat_kv(value_states, self.num_key_value_groups) +-- +-+ +- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +- +-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-- raise ValueError( +-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-- f" {attn_weights.shape}" +-- ) +-- +-- if attention_mask is not None: # no matter the length, we just slice it +-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+ if attention_mask is not None: +-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +- attn_weights = attn_weights + causal_mask +- +- # upcast attention to fp32 +-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +- +- attn_output = self.o_proj(attn_output) +-- +-+ # @lwx +-+ +-+ # max_seq_len = self.max_position_embeddings # 2048 +-+ +-+ # if attention_mask is not None: +-+ # # attention_mask: [B, 1, Sq, Sk] +-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+ +-+ # # pad 到 [max_seq_len, max_seq_len] +-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+ # global_attention_mask = padded_mask +-+ # else: +-+ # global_attention_mask = None +-+ +-+ +-+ # sparse_mode=3 +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, +-+ # key=key_states, +-+ # value=value_states, +-+ # real_shift=None, +-+ # padding_mask=None, +-+ +-+ # head_num=self.num_heads, +-+ # attn_mask=global_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+ # input_layout="BNSD", +-+ # pre_tokens=2147483647, +-+ # next_tokens=2147483647, +-+ # inner_precise=0, +-+ # drop_mask=None, +-+ # prefix=None, +-+ # actual_seq_qlen=None, +-+ # actual_seq_kvlen=None, +-+ # sparse_mode=sparse_mode, +-+ # ) +- if not output_attentions: +- attn_weights = None +- +- return attn_output, attn_weights, past_key_value +- +- +-+class Qwen2MoeFlashAttention(nn.Module): +-+ """ +-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+ +-+ 关键改动: +-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+ 直接传入原始的 key 和 value 张量效率更高。 +-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+ """ +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+ super().__init__() +-+ self.config = config +-+ self.layer_idx = layer_idx +-+ self.hidden_size = config.hidden_size +-+ self.num_heads = config.num_attention_heads +-+ self.head_dim = self.hidden_size // self.num_heads +-+ self.num_key_value_heads = config.num_key_value_heads +-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+ self.max_position_embeddings = config.max_position_embeddings +-+ self.rope_theta = config.rope_theta +-+ self.attention_dropout = config.attention_dropout +-+ +-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+ raise ValueError( +-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+ ) +-+ +-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+ +-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+ self.head_dim, +-+ max_position_embeddings=self.max_position_embeddings, +-+ base=self.rope_theta, +-+ ) +-+ +-+ def forward( +-+ self, +-+ hidden_states: mindspore.Tensor, +-+ attention_mask: Optional[mindspore.Tensor] = None, +-+ position_ids: Optional[mindspore.Tensor] = None, +-+ past_key_value: Optional[Cache] = None, +-+ output_attentions: bool = False, +-+ use_cache: bool = False, +-+ cache_position: Optional[mindspore.Tensor] = None, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ # 1. 线性投射 Q, K, V +-+ query_states = self.q_proj(hidden_states) +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+ # query: [B, S, H*D] -> [B, N1, S, D] +-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # 3. RoPE 旋转位置编码 +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+ if self.layer_idx is None: +-+ raise ValueError( +-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ "with a layer index." +-+ ) +-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+ if cache_position.shape[0] == 1: +-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+ kv_seq_len = past_seen_tokens + 1 +-+ else: +-+ # prefill 阶段:cache_position 是范围,使用其长度 +-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+ else: +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # 4. KV 缓存更新 +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ key_states, value_states = past_key_value.update( +-+ key_states, value_states, self.layer_idx, cache_kwargs +-+ ) +-+ +-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+ if cache_position.shape[0] == 1: +-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+ kv_seq_len = key_states.shape[-2] +-+ +-+ # 5. [重要] 准备 Attention Mask +-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+ fa_attention_mask = None +-+ if attention_mask is not None: +-+ # 截取与当前key长度匹配的部分 +-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+ fa_attention_mask = (mask_slice != 0) +-+ +-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+ input_dtype = query_states.dtype +-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+ query_states = query_states.to(mindspore.float16) +-+ key_states = key_states.to(mindspore.float16) +-+ value_states = value_states.to(mindspore.float16) +-+ +-+ # 6. [核心] 调用 flash_attention_score 算子 +-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+ attn_output = mindspore.ops.flash_attention_score( +-+ query=query_states, +-+ key=key_states, +-+ value=value_states, +-+ head_num=self.num_heads, # 传入Q的头数(N1) +-+ attn_mask=fa_attention_mask, +-+ keep_prob=1.0 - self.attention_dropout, +-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+ input_layout="BNSD", +-+ sparse_mode=0 # 使用 defaultMask 模式 +-+ ) +-+ +-+ # 恢复原始数据类型 +-+ attn_output = attn_output.to(input_dtype) +-+ +-+ # 7. 调整输出形状 +-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ attn_output = self.o_proj(attn_output) +-+ +-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-+ attn_weights = None +-+ if output_attentions: +-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+ # def forward( +-+ # self, +-+ # hidden_states: mindspore.Tensor, +-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+ # position_ids: Optional[mindspore.Tensor] = None, +-+ # past_key_value: Optional[Cache] = None, +-+ # output_attentions: bool = False, +-+ # use_cache: bool = False, +-+ # cache_position: Optional[mindspore.Tensor] = None, +-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ # bsz, q_len, _ = hidden_states.shape +-+ +-+ # # 1. 线性投射 Q, K, V +-+ # query_states = self.q_proj(hidden_states) +-+ # key_states = self.k_proj(hidden_states) +-+ # value_states = self.v_proj(hidden_states) +-+ +-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # # 3. RoPE 旋转位置编码 +-+ # kv_seq_len = key_states.shape[-2] +-+ # if past_key_value is not None: +-+ # if self.layer_idx is None: +-+ # raise ValueError( +-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ # "with a layer index." +-+ # ) +-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # # 4. KV 缓存更新 +-+ # if past_key_value is not None: +-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ # key_states, value_states = past_key_value.update( +-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+ # ) +-+ +-+ # # 5. 准备 Attention Mask +-+ # fa_attention_mask = None +-+ # if attention_mask is not None: +-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # fa_attention_mask = (mask_slice != 0) +-+ +-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+ # input_dtype = query_states.dtype +-+ +-+ # # 6. [核心] 调用 flash_attention_score 算子 +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, +-+ # key=key_states, +-+ # value=value_states, +-+ # head_num=self.num_heads, +-+ # attn_mask=fa_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+ # input_layout="BNSD", +-+ # sparse_mode=0, +-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+ # inner_precise=1 +-+ # ) +-+ +-+ # # 恢复原始数据类型 +-+ # attn_output = attn_output.to(input_dtype) +-+ +-+ # # 7. 调整输出形状 +-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ # attn_output = self.o_proj(attn_output) +-+ +-+ # attn_weights = None +-+ # if output_attentions: +-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+ # return attn_output, attn_weights, past_key_value +-+ +-+ # def forward( +-+ # self, +-+ # hidden_states: mindspore.Tensor, +-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+ # position_ids: Optional[mindspore.Tensor] = None, +-+ # past_key_value: Optional[Cache] = None, +-+ # output_attentions: bool = False, +-+ # use_cache: bool = False, +-+ # cache_position: Optional[mindspore.Tensor] = None, +-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ # bsz, q_len, _ = hidden_states.shape +-+ +-+ # query_states = self.q_proj(hidden_states) +-+ # key_states = self.k_proj(hidden_states) +-+ # value_states = self.v_proj(hidden_states) +-+ +-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # kv_seq_len = key_states.shape[-2] +-+ # if past_key_value is not None: +-+ # if self.layer_idx is None: +-+ # raise ValueError("`layer_idx` must be specified for caching") +-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # if past_key_value is not None: +-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ # key_states, value_states = past_key_value.update( +-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+ # ) +-+ +-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+ +-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+ # query_states = query_states / math.sqrt(self.head_dim) +-+ # # <--- 修改结束 --- +-+ +-+ # fa_attention_mask = None +-+ # if attention_mask is not None: +-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # fa_attention_mask = (mask_slice != 0) +-+ +-+ # input_dtype = query_states.dtype +-+ +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, # 传入已经预先缩放过的 query +-+ # key=key_states, +-+ # value=value_states, +-+ # head_num=self.num_heads, +-+ # attn_mask=fa_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+ # input_layout="BNSD", +-+ # sparse_mode=0, +-+ # inner_precise=1 # 仍然保持内部高精度计算 +-+ # ) +-+ +-+ # attn_output = attn_output.to(input_dtype) +-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ # attn_output = self.o_proj(attn_output) +-+ +-+ # attn_weights = None +-+ # if output_attentions: +-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+ +-+ # return attn_output, attn_weights, past_key_value +-+ +- QWEN2MOE_ATTENTION_CLASSES = { +- "eager": Qwen2MoeAttention, +-+ "flash-attention": Qwen2MoeFlashAttention, +- } +- +- +-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-+ #@dwj +-+ # 只遍历激活的专家,而非全部专家 +- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-- hidden_states = hidden_states.view(-1, hidden_dim) +-- # router_logits: (batch * sequence_length, n_experts) +-- router_logits = self.gate(hidden_states) +-- +-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- if self.norm_topk_prob: +-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- # we cast back to the input dtype +-- routing_weights = routing_weights.to(hidden_states.dtype) +-- +-- final_hidden_states = ops.zeros( +-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-- ) +-- +-- # One hot encode the selected experts to create an expert mask +-- # this will be used to easily index which expert is going to be sollicitated +-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-- +-- # Loop over all available experts in the model and perform the computation on each expert +-- for expert_idx in range(self.num_experts): +-- expert_layer = self.experts[expert_idx] +-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-- +-- # Index the correct hidden states and compute the expert hidden state for +-- # the current expert. We need to make sure to multiply the output hidden +-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-- if 0 not in idx.shape: +-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-- +-- # However `index_add_` only support torch tensors for indexing so we'll use +-- # the `top_x` tensor here. +-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-- +-- shared_expert_output = self.shared_expert(hidden_states) +-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-- +-- final_hidden_states = final_hidden_states + shared_expert_output +-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+ num_tokens = hidden_states_reshaped.shape[0] +-+ +-+ router_logits = self.gate(hidden_states_reshaped) +-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+ if self.norm_topk_prob: +-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+ flat_selected_experts = selected_experts.flatten() +-+ +-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+ token_indices = broadcasted_token_indices.flatten() +-+ +-+ active_experts = ops.unique(flat_selected_experts) +-+ +-+ for expert_idx_tensor in active_experts: +-+ expert_idx = expert_idx_tensor.item() +-+ expert_layer = self.experts[expert_idx] +-+ +-+ mask = (flat_selected_experts == expert_idx_tensor) +-+ selected_token_indices = token_indices[mask] +-+ selected_routing_weights = routing_weights.flatten()[mask] +-+ +-+ current_states = hidden_states_reshaped[selected_token_indices] +-+ +-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+ +-+ final_hidden_states = final_hidden_states.index_add( +-+ dim=0, +-+ index=selected_token_indices, +-+ source=expert_output.to(hidden_states.dtype) +-+ ) +-+ +-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +- +-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-- return final_hidden_states, router_logits +-+ final_hidden_states = final_hidden_states + shared_expert_output +-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+ +-+ return final_hidden_states, router_logits +- +- +- class Qwen2MoeDecoderLayer(nn.Module): +-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +- +- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+ +- if (layer_idx not in config.mlp_only_layers) and ( +- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +- ): +-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +- _no_split_modules = ["Qwen2MoeDecoderLayer"] +- _skip_keys_device_placement = "past_key_values" +- _supports_cache_class = True +-+#lwx +-+ # _supports_static_cache = True +- +- def _init_weights(self, module): +- std = self.config.initializer_range +-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +- return causal_mask +- +- +--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- _tied_weights_keys = ["lm_head.weight"] +- +- def __init__(self, config): +-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- self.num_experts_per_tok = config.num_experts_per_tok +- # Initialize weights and apply final processing +- self.post_init() +-+ # @lwx +-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+ # self.generation_config.cache_implementation = "static" +-+ self._warmed_up = False +-+ +-+ def warmup_moe_model(self): +-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+ test_texts = [ +-+ "warmup short", +-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+ ] +-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+ if tokenizer is None: +-+ from mindnlp.transformers import AutoTokenizer +-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+ self._warmup_tokenizer = tokenizer +-+ +-+ for text in test_texts: +-+ inputs = tokenizer(text, return_tensors="ms") +-+ with mindspore._no_grad(): +-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +- +- def get_input_embeddings(self): +- return self.model.embed_tokens +-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +- ```""" +-+ if not self._warmed_up: +-+ self._warmed_up = True +-+ self.warmup_moe_model() +- +- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +- output_router_logits = ( +-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- } +- ) +- return model_inputs +-+# @lwx +-+ # def _decode_one_tokens_logits( +-+ # self, +-+ # cur_token: mindspore.Tensor, +-+ # input_pos: Optional[mindspore.Tensor], +-+ # cache_position: mindspore.Tensor, +-+ # past_key_values: StaticCache, +-+ # ) -> mindspore.Tensor: +-+ # """ +-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+ +-+ # Args: +-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+ # input_pos: 输入位置信息,可选 +-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+ +-+ # Returns: +-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+ # """ +-+ # # 调用JIT编译的版本 +-+ # return self.get_decode_one_tokens_logits( +-+ # cur_token=cur_token, +-+ # input_pos=input_pos, +-+ # cache_position=cache_position, +-+ # past_key_values=past_key_values, +-+ # ) +-+ +-+ # @mindspore.jit(jit_level='O1') +-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+ # """ +-+ # JIT编译的函数,用于高效的单token解码 +-+ # 使用JIT编译优化以支持静态shape和高效执行 +-+ +-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+ # """ +-+ # outputs = self.model.forward( +-+ # input_ids=cur_token, +-+ # position_ids=input_pos, +-+ # cache_position=cache_position, +-+ # past_key_values=past_key_values, +-+ # use_cache=True, +-+ # return_dict=False, +-+ # ) +-+ +-+ # hidden_states = outputs[0] +-+ # logits = self.lm_head.forward(hidden_states) +-+ # logits = logits.float() +-+ +-+ # return logits[:, -1, :] +-+ +-+ # def _sample( +-+ # self, +-+ # input_ids: mindspore.Tensor, +-+ # logits_processor, +-+ # stopping_criteria, +-+ # generation_config, +-+ # synced_devices: bool, +-+ # streamer=None, +-+ # logits_warper=None, +-+ # **model_kwargs, +-+ # ): +-+ # """ +-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+ # """ +-+ # from ...generation.logits_process import LogitsProcessorList +-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+ # from mindnlp.core import nn, ops, no_grad +-+ # import numpy as np +-+ +-+ # # 检查是否使用 StaticCache +-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+ # # 否则,直接调用父类方法 +-+ # past_key_values = model_kwargs.get("past_key_values") +-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+ +-+ # if not isinstance(past_key_values, StaticCache): +-+ # # 不使用 StaticCache,直接调用父类方法 +-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+ # return super()._sample( +-+ # input_ids=input_ids, +-+ # logits_processor=logits_processor, +-+ # stopping_criteria=stopping_criteria, +-+ # generation_config=generation_config, +-+ # synced_devices=synced_devices, +-+ # streamer=streamer, +-+ # logits_warper=logits_warper, +-+ # **model_kwargs, +-+ # ) +-+ +-+ # # 使用 StaticCache,进入自定义循环 +-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+ # pad_token_id = generation_config._pad_token_tensor +-+ # output_attentions = generation_config.output_attentions +-+ # output_hidden_states = generation_config.output_hidden_states +-+ # output_scores = generation_config.output_scores +-+ # output_logits = generation_config.output_logits +-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-+ # max_length = generation_config.max_length +-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+ # do_sample = generation_config.do_sample +-+ +-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+ # raise ValueError( +-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+ # f"{logits_warper})." +-+ # ) +-+ +-+ # # init attention / hidden states / scores tuples +-+ # scores = () if (return_dict_in_generate and output_scores) else None +-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+ +-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+ # encoder_hidden_states = ( +-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+ # ) +-+ +-+ # # keep track of which sequences are already finished +-+ # batch_size, cur_len = input_ids.shape +-+ # this_peer_finished = False +-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+ +-+ # time_record = [] +-+ # from ....utils.testing_utils import parse_flag_from_env +-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+ +-+ # while self._has_unfinished_sequences( +-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+ # ): +-+ # if _record_time: +-+ # import time as time_module +-+ # infer_start = time_module.time() +-+ +-+ # # prepare model inputs +-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+ +-+ # # prepare variable output controls +-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+ +-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+ # cur_cache_position = model_inputs.get("cache_position") +-+ # cur_past_key_values = model_inputs.get("past_key_values") +-+ # cur_input_ids = model_inputs.get("input_ids") +-+ +-+ # if (isinstance(cur_past_key_values, StaticCache) and +-+ # cur_cache_position is not None and +-+ # len(cur_cache_position.shape) > 0 and +-+ # cur_cache_position.shape[0] == 1 and +-+ # cur_input_ids is not None and +-+ # cur_input_ids.shape[1] == 1): +-+ # # 使用 JIT 优化的单 token 解码 +-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+ # if not hasattr(self, '_jit_used'): +-+ # self._jit_used = False +-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+ +-+ # next_token_logits = self.get_decode_one_tokens_logits( +-+ # cur_token=cur_input_ids, +-+ # input_pos=model_inputs.get("position_ids"), +-+ # cache_position=cur_cache_position, +-+ # past_key_values=cur_past_key_values, +-+ # ) +-+ +-+ # # 标记已使用JIT(用于后续判断) +-+ # if not self._jit_used: +-+ # self._jit_used = True +-+ +-+ # # 构造兼容的输出对象 +-+ # class JitOptimizedOutput: +-+ # def __init__(self, logits, config): +-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+ # self.config = config +-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+ # self.attentions = None if not config.is_encoder_decoder else None +-+ # self.cross_attentions = None +-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-+ +-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+ # else: +-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+ # outputs = self(**model_inputs, return_dict=True) +-+ +-+ # if synced_devices and this_peer_finished: +-+ # continue +-+ +-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+ # next_token_logits = outputs.logits[:, -1, :] +-+ +-+ # # pre-process distribution +-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+ # if do_sample: +-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+ +-+ # # Store scores, attentions and hidden_states when required +-+ # if return_dict_in_generate: +-+ # if output_scores: +-+ # scores += (next_token_scores,) +-+ # if output_logits: +-+ # raw_logits += (next_token_logits,) +-+ # if output_attentions: +-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-+ # if self.config.is_encoder_decoder: +-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+ +-+ # if output_hidden_states: +-+ # hidden = ( +-+ # outputs.decoder_hidden_states +-+ # if self.config.is_encoder_decoder +-+ # else outputs.hidden_states +-+ # ) +-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+ +-+ # # token selection +-+ # if do_sample: +-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+ # else: +-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+ +-+ # # finished sentences should have their next token be a padding token +-+ # if has_eos_stopping_criteria: +-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+ +-+ # # update generated ids, model inputs, and length for next step +-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+ # if streamer is not None: +-+ # streamer.put(next_tokens) +-+ +-+ # model_kwargs = self._update_model_kwargs_for_generation( +-+ # outputs, +-+ # model_kwargs, +-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-+ # ) +-+ +-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+ # cur_len += 1 +-+ +-+ # if _record_time: +-+ # import time as time_module +-+ # infer_stop = time_module.time() +-+ # time_record.append(infer_stop - infer_start) +-+ +-+ # del outputs +-+ +-+ # average_infer_time = None +-+ # if time_record: +-+ # if len(time_record) > 1: +-+ # time_record.pop(0) +-+ # average_infer_time = sum(time_record) / len(time_record) +-+ # print(f'average inference time is: {average_infer_time}') +-+ # print(f'inference time record: {time_record}') +-+ +-+ # if streamer is not None: +-+ # streamer.end() +-+ +-+ # # 简单判断:打印是否使用了JIT路径 +-+ # if hasattr(self, '_jit_used') and self._jit_used: +-+ # print("[JIT] ✓ JIT optimization was used during generation") +-+ # else: +-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+ +-+ # if return_dict_in_generate: +-+ # if self.config.is_encoder_decoder: +-+ # return GenerateEncoderDecoderOutput( +-+ # sequences=input_ids, +-+ # scores=scores, +-+ # logits=raw_logits, +-+ # encoder_attentions=encoder_attentions, +-+ # encoder_hidden_states=encoder_hidden_states, +-+ # decoder_attentions=decoder_attentions, +-+ # cross_attentions=cross_attentions, +-+ # decoder_hidden_states=decoder_hidden_states, +-+ # past_key_values=model_kwargs.get("past_key_values"), +-+ # average_infer_time=average_infer_time +-+ # ) +-+ # else: +-+ # return GenerateDecoderOnlyOutput( +-+ # sequences=input_ids, +-+ # scores=scores, +-+ # logits=raw_logits, +-+ # attentions=decoder_attentions, +-+ # hidden_states=decoder_hidden_states, +-+ # past_key_values=model_kwargs.get("past_key_values"), +-+ # average_infer_time=average_infer_time +-+ # ) +-+ # else: +-+ # return input_ids +-+ +-+ # def _prepare_cache_for_generation( +-+ # self, +-+ # generation_config, +-+ # model_kwargs, +-+ # assistant_model, +-+ # batch_size, +-+ # max_cache_length, +-+ # ): +-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+ # generation_config.cache_implementation = "static" +-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+ +-+ # if generation_config.cache_implementation == "static": +-+ # base_required_from_max_length = generation_config.max_length + 1 +-+ # base_required = max(max_cache_length, base_required_from_max_length) +-+ # min_cache_size = 50 +-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+ # else: +-+ # max_cache_length = max(base_required, min_cache_size) +-+ +-+ # original_max_cache_length = max_cache_length +-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+ # print(f" - final max_cache_length: {max_cache_length}") +-+ +-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+ # if max_cache_length > self.config.max_position_embeddings: +-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+ +-+ # result = super()._prepare_cache_for_generation( +-+ # generation_config=generation_config, +-+ # model_kwargs=model_kwargs, +-+ # assistant_model=assistant_model, +-+ # batch_size=batch_size, +-+ # max_cache_length=max_cache_length, +-+ # ) +-+ +-+ # if generation_config.cache_implementation == "static": +-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+ # created_cache = model_kwargs.get(cache_name) +-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+ # if created_cache.max_cache_len < generation_config.max_length: +-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+ +-+ # return result +-+ +-+ +-+ +- +- +- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +--- +-2.27.0 +- +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" new file mode 100644 index 00000000..25b442d5 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" @@ -0,0 +1,7498 @@ +From 60df5bdc79368911a03b9c034b11b7437df753ca Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Thu, 6 Nov 2025 15:48:09 +0800 +Subject: [PATCH 04/10] 20251106change + +--- + .../models/deepseek/modeling_deepseek.py | 189 +- + patches/0001-20251104commit.patch | 1272 +++++++ + patches/0002-20251106commit.patch | 3200 +++++++++++++++++ + patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ + 4 files changed, 7244 insertions(+), 186 deletions(-) + create mode 100644 patches/0001-20251104commit.patch + create mode 100644 patches/0002-20251106commit.patch + create mode 100644 patches/0003-20261106secondcommit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 2f9192bf..0546f318 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): + + return attn_output, attn_weights, past_key_value + +-# class DeepseekFlashAttention(nn.Module): +-# """ +-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +- +-# This class is designed as a drop-in replacement for DeepseekAttention. +-# """ +- +-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-# super().__init__() +-# self.config = config +-# self.layer_idx = layer_idx +-# if layer_idx is None: +-# logger.warning( +-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-# "when creating this class." +-# ) +- +-# self.attention_dropout = config.attention_dropout +-# self.hidden_size = config.hidden_size +-# self.num_heads = config.num_attention_heads +-# self.head_dim = self.hidden_size // self.num_heads +-# self.num_key_value_heads = config.num_key_value_heads +-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-# self.max_position_embeddings = config.max_position_embeddings +-# self.rope_theta = config.rope_theta +-# self.is_causal = True +- +-# if (self.head_dim * self.num_heads) != self.hidden_size: +-# raise ValueError( +-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-# f" and `num_heads`: {self.num_heads})." +-# ) +- +-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-# self._init_rope() +- +-# def _init_rope(self): +-# if self.config.rope_scaling is None: +-# self.rotary_emb = DeepseekRotaryEmbedding( +-# self.head_dim, +-# max_position_embeddings=self.max_position_embeddings, +-# base=self.rope_theta, +-# ) +-# else: +-# scaling_type = self.config.rope_scaling["type"] +-# scaling_factor = self.config.rope_scaling["factor"] +-# if scaling_type == "linear": +-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-# self.head_dim, +-# max_position_embeddings=self.max_position_embeddings, +-# scaling_factor=scaling_factor, +-# base=self.rope_theta, +-# ) +-# elif scaling_type == "dynamic": +-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-# self.head_dim, +-# max_position_embeddings=self.max_position_embeddings, +-# scaling_factor=scaling_factor, +-# base=self.rope_theta, +-# ) +-# else: +-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +- +-# def forward( +-# self, +-# hidden_states: mindspore.Tensor, +-# attention_mask: Optional[mindspore.Tensor] = None, +-# position_ids: Optional[mindspore.Tensor] = None, +-# past_key_value: Optional[Cache] = None, +-# output_attentions: bool = False, +-# use_cache: bool = False, +-# **kwargs, +-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-# if "padding_mask" in kwargs: +-# warnings.warn( +-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-# ) +- +-# if output_attentions: +-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +- +-# bsz, q_len, _ = hidden_states.shape +- +-# if self.config.pretraining_tp > 1: +-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +- +-# query_states = self.q_proj(hidden_states) +-# key_states = self.k_proj(hidden_states) +-# value_states = self.v_proj(hidden_states) +- +-# # Reshape for multi-head attention +-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +-# kv_seq_len = key_states.shape[-2] +-# if past_key_value is not None: +-# if self.layer_idx is None: +-# raise ValueError( +-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-# "with a layer index." +-# ) +-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- +-# # Apply Rotary Positional Embedding +-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +-# if past_key_value is not None: +-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +- +-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +- +-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +- +-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +- +-# # Convert attention_mask for flash_attention_score +-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-# if attention_mask is not None: +-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-# raise ValueError( +-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-# ) +-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-# else: +-# attn_mask_for_fa = None +- +-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +- +-# # Call the fused flash_attention_score operator +-# attn_output = mindspore.ops.flash_attention_score( +-# query=query_states_for_fa, +-# key=key_states_for_fa, +-# value=value_states_for_fa, +-# head_num=self.num_heads, # This is N1, the number of query heads +-# input_layout='BSH', +-# attn_mask=attn_mask_for_fa, +-# keep_prob=keep_prob, +-# scalar_value=1.0 / math.sqrt(self.head_dim), +-# sparse_mode=0 # Default mask mode +-# ) +- +-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-# attn_output = self.o_proj(attn_output) +- +-# # Flash Attention does not return attention weights +-# attn_weights = None +- +-# return attn_output, attn_weights, past_key_value + + class DeepseekFlashAttention(nn.Module): + """ +@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): + super().__init__() + self.hidden_size = config.hidden_size + +- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +- config=config, layer_idx=layer_idx +- ) ++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++ # config=config, layer_idx=layer_idx ++ # ) + + self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( + config=config, layer_idx=layer_idx +@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): + return outputs + + +- + class DeepseekPreTrainedModel(PreTrainedModel): + config_class = DeepseekConfig + base_model_prefix = "model" +@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + # Initialize weights and apply final processing + self.post_init() + self.warm_up = False +- #@dwj +- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +- self.num_layers, +- self.num_attention_heads, +- self.head_dim, +- batch_size=1, +- max_length=self.max_length, +- dtype=mindspore.float16 +- ) +- +- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +- key_cache = [] +- value_cache = [] +- for _ in range(num_layers): +- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +- key_cache.append(k) +- value_cache.append(v) +- return key_cache, value_cache +- + + def warmup_moe_model_deep(self): + print("[Warmup] DeepSeek-MoE 模型预热开始...") +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +new file mode 100644 +index 00000000..78f22642 +--- /dev/null ++++ b/patches/0001-20251104commit.patch +@@ -0,0 +1,1272 @@ ++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++Subject: [PATCH 1/3] 20251104commit ++ ++--- ++ mindnlp/transformers/cache_utils.py | 28 +- ++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++ 3 files changed, 976 insertions(+), 87 deletions(-) ++ ++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++index cadd2e04..02f8d4be 100644 ++--- a/mindnlp/transformers/cache_utils.py +++++ b/mindnlp/transformers/cache_utils.py ++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++ # k_out[:, :, cache_position] = key_states ++ # v_out[:, :, cache_position] = value_states ++- if ON_ORANGE_PI: ++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++- else: ++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++- +++ # if ON_ORANGE_PI: +++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++ # else: +++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++ # 确保 cache_position 是 1D tensor 并且类型正确 +++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++ if cache_position.ndim > 1: +++ cache_position = cache_position.flatten() +++ # 确保类型是 int32 或 int64(MindSpore 要求) +++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++ cache_position = cache_position.int() +++ +++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++ k_out[:, :, cache_position] = key_states +++ v_out[:, :, cache_position] = value_states +++ ++ return k_out, v_out ++ ++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index c695b944..d8303e45 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++- x1 = x[..., : x.shape[-1] // 2] ++- x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++ # x1 = x[..., : x.shape[-1] // 2] +++ # x2 = x[..., x.shape[-1] // 2 :] +++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++ if self.training: ++ raise NotImplementedError("Training is not supported yet.") ++ else: ++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++- if self.config.n_shared_experts is not None: ++- y = y + self.shared_experts(identity) ++- return y +++ # @lwx +++ if orig_shape[1] == 1: +++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++ y=y.view(*orig_shape) +++ if self.config.n_shared_experts is not None: +++ y = y + self.shared_experts(identity) +++ return y +++ else: +++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++ if self.config.n_shared_experts is not None: +++ y = y + self.shared_experts(identity) +++ return y +++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++ # if self.config.n_shared_experts is not None: +++ # y = y + self.shared_experts(identity) +++ # return y +++ +++ @no_grad() +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ +++ expert_cache = ops.zeros_like(x) +++ for i in range(self.num_experts_per_tok): +++ expert_id = flat_expert_indices[i].item() +++ weight = flat_expert_weights[i].item() +++ expert = self.experts[expert_id] +++ expert_out = expert(x) +++ expert_cache += expert_out * weight +++ return expert_cache ++ ++ @no_grad() ++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++- # expert_cache = torch.zeros_like(x) ++- # idxs = flat_expert_indices.argsort() ++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++- # token_idxs = idxs // self.num_experts_per_tok ++- # for i, end_idx in enumerate(tokens_per_expert): ++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++- # if start_idx == end_idx: ++- # continue ++- # expert = self.experts[i] ++- # exp_token_idx = token_idxs[start_idx:end_idx] ++- # expert_tokens = x[exp_token_idx] ++- # expert_out = expert(expert_tokens) ++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++- # return expert_cache +++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ expert_cache = ops.zeros_like(x) ++ idxs = flat_expert_indices.argsort() ++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++ token_idxs = idxs // self.num_experts_per_tok +++ ++ for i, end_idx in enumerate(tokens_per_expert): ++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++ if start_idx == end_idx: ++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++ expert_out = expert(expert_tokens) ++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ ++ return expert_cache +++ +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # # expert_cache = torch.zeros_like(x) +++ # # idxs = flat_expert_indices.argsort() +++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++ # # token_idxs = idxs // self.num_experts_per_tok +++ # # for i, end_idx in enumerate(tokens_per_expert): +++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++ # # if start_idx == end_idx: +++ # # continue +++ # # expert = self.experts[i] +++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++ # # expert_tokens = x[exp_token_idx] +++ # # expert_out = expert(expert_tokens) +++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++ # # return expert_cache +++ # expert_cache = ops.zeros_like(x) +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # for i, end_idx in enumerate(tokens_per_expert): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # if start_idx == end_idx: +++ # continue +++ # expert = self.experts[i] +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = expert(expert_tokens) +++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ +++ # return expert_cache +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # expert_cache = ops.zeros_like(x) +++ +++ # # 排序保证顺序一致 +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # # 找出有 token 的专家 +++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++ +++ # for i in active_experts.tolist(): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # end_idx = tokens_per_expert[i] +++ # if start_idx == end_idx: # 没有 token +++ # continue +++ +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = self.experts[i](expert_tokens) +++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++ +++ # expert_cache = mindspore.mint.scatter_add( +++ # expert_cache, +++ # 0, +++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++ # expert_out +++ # ) +++ +++ # return expert_cache +++ +++ ++ ++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++ # """ ++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ ++ # Initialize weights and apply final processing ++ self.post_init() +++ self.warm_up = False +++ +++ def warmup_moe_model_deep(self): +++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++ test_texts = [ +++ "warmup short", +++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++ ] +++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++ if tokenizer is None: +++ from mindnlp.transformers import AutoTokenizer +++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++ self._warmup_tokenizer = tokenizer +++ +++ for text in test_texts: +++ inputs = tokenizer(text, return_tensors="ms") +++ with mindspore._no_grad(): +++ _ = self(**inputs, use_cache=False) +++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++ ++ def get_input_embeddings(self): ++ return self.model.embed_tokens ++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++ ```""" +++ if not self.warm_up: +++ self.warm_up = True +++ self.warmup_moe_model_deep() +++ ++ output_attentions = ( ++ output_attentions ++ if output_attentions is not None ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index 3cbf820e..d4c6b651 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -18,7 +18,6 @@ ++ # See the License for the specific language governing permissions and ++ # limitations under the License. ++ """MindSpore Qwen2MoE model.""" ++- ++ import math ++ from typing import List, Optional, Tuple, Union ++ ++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++ TokenClassifierOutput, ++ ) ++ from ...modeling_utils import PreTrainedModel +++from ...generation import GenerationMixin ++ from ....utils import logging ++ from .configuration_qwen2_moe import Qwen2MoeConfig ++ ++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++ self.variance_epsilon = eps ++ ++ def forward(self, hidden_states): +++ # @dwj +++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++ # @lwx +++ # if not self.training : +++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++ input_dtype = hidden_states.dtype ++ hidden_states = hidden_states.to(mindspore.float32) ++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++@@ -234,6 +239,8 @@ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++ x1 = x[..., : x.shape[-1] // 2] ++ x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++ self.config = config ++ self.hidden_size = config.hidden_size ++ self.intermediate_size = intermediate_size +++ ++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++ self.act_fn = ACT2FN[config.hidden_act] ++ ++ def forward(self, x): ++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++- ++ +++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++ # @lwx +++ # gate_up_output = self.gate_up_proj(x) +++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++ # return self.down_proj(swiglu_output) +++ +++ # def forward(self, x): +++ # gate_proj_out = self.gate_proj(x) +++ # up_proj_out = self.up_proj(x) +++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++ # return self.down_proj(swiglu_out) +++ ++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++ """ ++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++ use_cache: bool = False, ++ cache_position: Optional[mindspore.Tensor] = None, ++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ +++ ++ bsz, q_len, _ = hidden_states.shape ++ ++ query_states = self.q_proj(hidden_states) ++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++ "with a layer index." ++ ) ++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ if isinstance(past_key_value, StaticCache): +++ kv_seq_len = key_states.shape[-2] +++ else: +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ if past_key_value is not None: ++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++ if isinstance(past_key_value, StaticCache): +++ kv_seq_len = key_states.shape[-2] ++ ++ # repeat k/v heads if n_kv_heads < n_heads ++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++- +++ ++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++ ++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++- raise ValueError( ++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++- f" {attn_weights.shape}" ++- ) ++- ++- if attention_mask is not None: # no matter the length, we just slice it ++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++ if attention_mask is not None: +++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++ attn_weights = attn_weights + causal_mask ++ ++ # upcast attention to fp32 ++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++ ++ attn_output = self.o_proj(attn_output) ++- +++ # @lwx +++ +++ # max_seq_len = self.max_position_embeddings # 2048 +++ +++ # if attention_mask is not None: +++ # # attention_mask: [B, 1, Sq, Sk] +++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++ +++ # # pad 到 [max_seq_len, max_seq_len] +++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++ # global_attention_mask = padded_mask +++ # else: +++ # global_attention_mask = None +++ +++ +++ # sparse_mode=3 +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, +++ # key=key_states, +++ # value=value_states, +++ # real_shift=None, +++ # padding_mask=None, +++ +++ # head_num=self.num_heads, +++ # attn_mask=global_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++ # input_layout="BNSD", +++ # pre_tokens=2147483647, +++ # next_tokens=2147483647, +++ # inner_precise=0, +++ # drop_mask=None, +++ # prefix=None, +++ # actual_seq_qlen=None, +++ # actual_seq_kvlen=None, +++ # sparse_mode=sparse_mode, +++ # ) ++ if not output_attentions: ++ attn_weights = None ++ ++ return attn_output, attn_weights, past_key_value ++ ++ +++class Qwen2MoeFlashAttention(nn.Module): +++ """ +++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++ +++ 关键改动: +++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++ 直接传入原始的 key 和 value 张量效率更高。 +++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++ """ +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++ super().__init__() +++ self.config = config +++ self.layer_idx = layer_idx +++ self.hidden_size = config.hidden_size +++ self.num_heads = config.num_attention_heads +++ self.head_dim = self.hidden_size // self.num_heads +++ self.num_key_value_heads = config.num_key_value_heads +++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++ self.max_position_embeddings = config.max_position_embeddings +++ self.rope_theta = config.rope_theta +++ self.attention_dropout = config.attention_dropout +++ +++ if (self.head_dim * self.num_heads) != self.hidden_size: +++ raise ValueError( +++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++ ) +++ +++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++ +++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++ self.head_dim, +++ max_position_embeddings=self.max_position_embeddings, +++ base=self.rope_theta, +++ ) +++ +++ def forward( +++ self, +++ hidden_states: mindspore.Tensor, +++ attention_mask: Optional[mindspore.Tensor] = None, +++ position_ids: Optional[mindspore.Tensor] = None, +++ past_key_value: Optional[Cache] = None, +++ output_attentions: bool = False, +++ use_cache: bool = False, +++ cache_position: Optional[mindspore.Tensor] = None, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ bsz, q_len, _ = hidden_states.shape +++ +++ # 1. 线性投射 Q, K, V +++ query_states = self.q_proj(hidden_states) +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++ # query: [B, S, H*D] -> [B, N1, S, D] +++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # 3. RoPE 旋转位置编码 +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++ if self.layer_idx is None: +++ raise ValueError( +++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ "with a layer index." +++ ) +++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++ if cache_position.shape[0] == 1: +++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++ kv_seq_len = past_seen_tokens + 1 +++ else: +++ # prefill 阶段:cache_position 是范围,使用其长度 +++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++ else: +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # 4. KV 缓存更新 +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ key_states, value_states = past_key_value.update( +++ key_states, value_states, self.layer_idx, cache_kwargs +++ ) +++ +++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++ if cache_position.shape[0] == 1: +++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++ kv_seq_len = key_states.shape[-2] +++ +++ # 5. [重要] 准备 Attention Mask +++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++ fa_attention_mask = None +++ if attention_mask is not None: +++ # 截取与当前key长度匹配的部分 +++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++ fa_attention_mask = (mask_slice != 0) +++ +++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++ input_dtype = query_states.dtype +++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++ query_states = query_states.to(mindspore.float16) +++ key_states = key_states.to(mindspore.float16) +++ value_states = value_states.to(mindspore.float16) +++ +++ # 6. [核心] 调用 flash_attention_score 算子 +++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++ attn_output = mindspore.ops.flash_attention_score( +++ query=query_states, +++ key=key_states, +++ value=value_states, +++ head_num=self.num_heads, # 传入Q的头数(N1) +++ attn_mask=fa_attention_mask, +++ keep_prob=1.0 - self.attention_dropout, +++ scalar_value=1.0 / math.sqrt(self.head_dim), +++ input_layout="BNSD", +++ sparse_mode=0 # 使用 defaultMask 模式 +++ ) +++ +++ # 恢复原始数据类型 +++ attn_output = attn_output.to(input_dtype) +++ +++ # 7. 调整输出形状 +++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ attn_output = self.o_proj(attn_output) +++ +++ # FlashAttention 算子不直接返回注意力权重矩阵 +++ attn_weights = None +++ if output_attentions: +++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++ return attn_output, attn_weights, past_key_value +++ +++ # def forward( +++ # self, +++ # hidden_states: mindspore.Tensor, +++ # attention_mask: Optional[mindspore.Tensor] = None, +++ # position_ids: Optional[mindspore.Tensor] = None, +++ # past_key_value: Optional[Cache] = None, +++ # output_attentions: bool = False, +++ # use_cache: bool = False, +++ # cache_position: Optional[mindspore.Tensor] = None, +++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ # bsz, q_len, _ = hidden_states.shape +++ +++ # # 1. 线性投射 Q, K, V +++ # query_states = self.q_proj(hidden_states) +++ # key_states = self.k_proj(hidden_states) +++ # value_states = self.v_proj(hidden_states) +++ +++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # # 3. RoPE 旋转位置编码 +++ # kv_seq_len = key_states.shape[-2] +++ # if past_key_value is not None: +++ # if self.layer_idx is None: +++ # raise ValueError( +++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ # "with a layer index." +++ # ) +++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # # 4. KV 缓存更新 +++ # if past_key_value is not None: +++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ # key_states, value_states = past_key_value.update( +++ # key_states, value_states, self.layer_idx, cache_kwargs +++ # ) +++ +++ # # 5. 准备 Attention Mask +++ # fa_attention_mask = None +++ # if attention_mask is not None: +++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # fa_attention_mask = (mask_slice != 0) +++ +++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++ # input_dtype = query_states.dtype +++ +++ # # 6. [核心] 调用 flash_attention_score 算子 +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, +++ # key=key_states, +++ # value=value_states, +++ # head_num=self.num_heads, +++ # attn_mask=fa_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++ # input_layout="BNSD", +++ # sparse_mode=0, +++ # # <--- 修改点 2: 启用内部高精度计算 --- +++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++ # inner_precise=1 +++ # ) +++ +++ # # 恢复原始数据类型 +++ # attn_output = attn_output.to(input_dtype) +++ +++ # # 7. 调整输出形状 +++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ # attn_output = self.o_proj(attn_output) +++ +++ # attn_weights = None +++ # if output_attentions: +++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++ # return attn_output, attn_weights, past_key_value +++ +++ # def forward( +++ # self, +++ # hidden_states: mindspore.Tensor, +++ # attention_mask: Optional[mindspore.Tensor] = None, +++ # position_ids: Optional[mindspore.Tensor] = None, +++ # past_key_value: Optional[Cache] = None, +++ # output_attentions: bool = False, +++ # use_cache: bool = False, +++ # cache_position: Optional[mindspore.Tensor] = None, +++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ # bsz, q_len, _ = hidden_states.shape +++ +++ # query_states = self.q_proj(hidden_states) +++ # key_states = self.k_proj(hidden_states) +++ # value_states = self.v_proj(hidden_states) +++ +++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ # kv_seq_len = key_states.shape[-2] +++ # if past_key_value is not None: +++ # if self.layer_idx is None: +++ # raise ValueError("`layer_idx` must be specified for caching") +++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ # if past_key_value is not None: +++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ # key_states, value_states = past_key_value.update( +++ # key_states, value_states, self.layer_idx, cache_kwargs +++ # ) +++ +++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++ +++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++ # query_states = query_states / math.sqrt(self.head_dim) +++ # # <--- 修改结束 --- +++ +++ # fa_attention_mask = None +++ # if attention_mask is not None: +++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ # fa_attention_mask = (mask_slice != 0) +++ +++ # input_dtype = query_states.dtype +++ +++ # attn_output = mindspore.ops.flash_attention_score( +++ # query=query_states, # 传入已经预先缩放过的 query +++ # key=key_states, +++ # value=value_states, +++ # head_num=self.num_heads, +++ # attn_mask=fa_attention_mask, +++ # keep_prob=1.0 - self.attention_dropout, +++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++ # input_layout="BNSD", +++ # sparse_mode=0, +++ # inner_precise=1 # 仍然保持内部高精度计算 +++ # ) +++ +++ # attn_output = attn_output.to(input_dtype) +++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ # attn_output = self.o_proj(attn_output) +++ +++ # attn_weights = None +++ # if output_attentions: +++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++ +++ # return attn_output, attn_weights, past_key_value +++ ++ QWEN2MOE_ATTENTION_CLASSES = { ++ "eager": Qwen2MoeAttention, +++ "flash-attention": Qwen2MoeFlashAttention, ++ } ++ ++ ++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ +++ #@dwj +++ # 只遍历激活的专家,而非全部专家 ++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++- hidden_states = hidden_states.view(-1, hidden_dim) ++- # router_logits: (batch * sequence_length, n_experts) ++- router_logits = self.gate(hidden_states) ++- ++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- if self.norm_topk_prob: ++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- # we cast back to the input dtype ++- routing_weights = routing_weights.to(hidden_states.dtype) ++- ++- final_hidden_states = ops.zeros( ++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++- ) ++- ++- # One hot encode the selected experts to create an expert mask ++- # this will be used to easily index which expert is going to be sollicitated ++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++- ++- # Loop over all available experts in the model and perform the computation on each expert ++- for expert_idx in range(self.num_experts): ++- expert_layer = self.experts[expert_idx] ++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++- ++- # Index the correct hidden states and compute the expert hidden state for ++- # the current expert. We need to make sure to multiply the output hidden ++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++- if 0 not in idx.shape: ++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++- ++- # However `index_add_` only support torch tensors for indexing so we'll use ++- # the `top_x` tensor here. ++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++- ++- shared_expert_output = self.shared_expert(hidden_states) ++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++- ++- final_hidden_states = final_hidden_states + shared_expert_output +++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++ num_tokens = hidden_states_reshaped.shape[0] +++ +++ router_logits = self.gate(hidden_states_reshaped) +++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++ if self.norm_topk_prob: +++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ routing_weights = routing_weights.to(hidden_states.dtype) +++ +++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++ flat_selected_experts = selected_experts.flatten() +++ +++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++ token_indices = broadcasted_token_indices.flatten() +++ +++ active_experts = ops.unique(flat_selected_experts) +++ +++ for expert_idx_tensor in active_experts: +++ expert_idx = expert_idx_tensor.item() +++ expert_layer = self.experts[expert_idx] +++ +++ mask = (flat_selected_experts == expert_idx_tensor) +++ selected_token_indices = token_indices[mask] +++ selected_routing_weights = routing_weights.flatten()[mask] +++ +++ current_states = hidden_states_reshaped[selected_token_indices] +++ +++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++ +++ final_hidden_states = final_hidden_states.index_add( +++ dim=0, +++ index=selected_token_indices, +++ source=expert_output.to(hidden_states.dtype) +++ ) +++ +++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++ ++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++- return final_hidden_states, router_logits +++ final_hidden_states = final_hidden_states + shared_expert_output +++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++ +++ return final_hidden_states, router_logits ++ ++ ++ class Qwen2MoeDecoderLayer(nn.Module): ++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++ ++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++ ++ if (layer_idx not in config.mlp_only_layers) and ( ++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++ ): ++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++ _skip_keys_device_placement = "past_key_values" ++ _supports_cache_class = True +++#lwx +++ # _supports_static_cache = True ++ ++ def _init_weights(self, module): ++ std = self.config.initializer_range ++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++ return causal_mask ++ ++ ++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ _tied_weights_keys = ["lm_head.weight"] ++ ++ def __init__(self, config): ++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ self.num_experts_per_tok = config.num_experts_per_tok ++ # Initialize weights and apply final processing ++ self.post_init() +++ # @lwx +++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++ # self.generation_config.cache_implementation = "static" +++ self._warmed_up = False +++ +++ def warmup_moe_model(self): +++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++ test_texts = [ +++ "warmup short", +++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++ ] +++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++ if tokenizer is None: +++ from mindnlp.transformers import AutoTokenizer +++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++ self._warmup_tokenizer = tokenizer +++ +++ for text in test_texts: +++ inputs = tokenizer(text, return_tensors="ms") +++ with mindspore._no_grad(): +++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++ ++ def get_input_embeddings(self): ++ return self.model.embed_tokens ++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++ ```""" +++ if not self._warmed_up: +++ self._warmed_up = True +++ self.warmup_moe_model() ++ ++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++ output_router_logits = ( ++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++ } ++ ) ++ return model_inputs +++# @lwx +++ # def _decode_one_tokens_logits( +++ # self, +++ # cur_token: mindspore.Tensor, +++ # input_pos: Optional[mindspore.Tensor], +++ # cache_position: mindspore.Tensor, +++ # past_key_values: StaticCache, +++ # ) -> mindspore.Tensor: +++ # """ +++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++ +++ # Args: +++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++ # input_pos: 输入位置信息,可选 +++ # cache_position: 当前token在cache中的位置,shape为(1,) +++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++ +++ # Returns: +++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++ # """ +++ # # 调用JIT编译的版本 +++ # return self.get_decode_one_tokens_logits( +++ # cur_token=cur_token, +++ # input_pos=input_pos, +++ # cache_position=cache_position, +++ # past_key_values=past_key_values, +++ # ) +++ +++ # @mindspore.jit(jit_level='O1') +++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++ # """ +++ # JIT编译的函数,用于高效的单token解码 +++ # 使用JIT编译优化以支持静态shape和高效执行 +++ +++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++ # """ +++ # outputs = self.model.forward( +++ # input_ids=cur_token, +++ # position_ids=input_pos, +++ # cache_position=cache_position, +++ # past_key_values=past_key_values, +++ # use_cache=True, +++ # return_dict=False, +++ # ) +++ +++ # hidden_states = outputs[0] +++ # logits = self.lm_head.forward(hidden_states) +++ # logits = logits.float() +++ +++ # return logits[:, -1, :] +++ +++ # def _sample( +++ # self, +++ # input_ids: mindspore.Tensor, +++ # logits_processor, +++ # stopping_criteria, +++ # generation_config, +++ # synced_devices: bool, +++ # streamer=None, +++ # logits_warper=None, +++ # **model_kwargs, +++ # ): +++ # """ +++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++ # """ +++ # from ...generation.logits_process import LogitsProcessorList +++ # from ...generation.stopping_criteria import StoppingCriteriaList +++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++ # from mindnlp.core import nn, ops, no_grad +++ # import numpy as np +++ +++ # # 检查是否使用 StaticCache +++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++ # # 否则,直接调用父类方法 +++ # past_key_values = model_kwargs.get("past_key_values") +++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++ +++ # if not isinstance(past_key_values, StaticCache): +++ # # 不使用 StaticCache,直接调用父类方法 +++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++ # return super()._sample( +++ # input_ids=input_ids, +++ # logits_processor=logits_processor, +++ # stopping_criteria=stopping_criteria, +++ # generation_config=generation_config, +++ # synced_devices=synced_devices, +++ # streamer=streamer, +++ # logits_warper=logits_warper, +++ # **model_kwargs, +++ # ) +++ +++ # # 使用 StaticCache,进入自定义循环 +++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++ # pad_token_id = generation_config._pad_token_tensor +++ # output_attentions = generation_config.output_attentions +++ # output_hidden_states = generation_config.output_hidden_states +++ # output_scores = generation_config.output_scores +++ # output_logits = generation_config.output_logits +++ # return_dict_in_generate = generation_config.return_dict_in_generate +++ # max_length = generation_config.max_length +++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++ # do_sample = generation_config.do_sample +++ +++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++ # raise ValueError( +++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++ # f"{logits_warper})." +++ # ) +++ +++ # # init attention / hidden states / scores tuples +++ # scores = () if (return_dict_in_generate and output_scores) else None +++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++ +++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++ # encoder_hidden_states = ( +++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++ # ) +++ +++ # # keep track of which sequences are already finished +++ # batch_size, cur_len = input_ids.shape +++ # this_peer_finished = False +++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++ +++ # time_record = [] +++ # from ....utils.testing_utils import parse_flag_from_env +++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++ +++ # while self._has_unfinished_sequences( +++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++ # ): +++ # if _record_time: +++ # import time as time_module +++ # infer_start = time_module.time() +++ +++ # # prepare model inputs +++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++ +++ # # prepare variable output controls +++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++ +++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++ # cur_cache_position = model_inputs.get("cache_position") +++ # cur_past_key_values = model_inputs.get("past_key_values") +++ # cur_input_ids = model_inputs.get("input_ids") +++ +++ # if (isinstance(cur_past_key_values, StaticCache) and +++ # cur_cache_position is not None and +++ # len(cur_cache_position.shape) > 0 and +++ # cur_cache_position.shape[0] == 1 and +++ # cur_input_ids is not None and +++ # cur_input_ids.shape[1] == 1): +++ # # 使用 JIT 优化的单 token 解码 +++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++ # if not hasattr(self, '_jit_used'): +++ # self._jit_used = False +++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++ +++ # next_token_logits = self.get_decode_one_tokens_logits( +++ # cur_token=cur_input_ids, +++ # input_pos=model_inputs.get("position_ids"), +++ # cache_position=cur_cache_position, +++ # past_key_values=cur_past_key_values, +++ # ) +++ +++ # # 标记已使用JIT(用于后续判断) +++ # if not self._jit_used: +++ # self._jit_used = True +++ +++ # # 构造兼容的输出对象 +++ # class JitOptimizedOutput: +++ # def __init__(self, logits, config): +++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++ # self.config = config +++ # # 对于 JIT 优化路径,这些属性通常不需要 +++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++ # self.attentions = None if not config.is_encoder_decoder else None +++ # self.cross_attentions = None +++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++ # self.hidden_states = None if not config.is_encoder_decoder else None +++ +++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++ # else: +++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++ # outputs = self(**model_inputs, return_dict=True) +++ +++ # if synced_devices and this_peer_finished: +++ # continue +++ +++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++ # next_token_logits = outputs.logits[:, -1, :] +++ +++ # # pre-process distribution +++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++ # if do_sample: +++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++ +++ # # Store scores, attentions and hidden_states when required +++ # if return_dict_in_generate: +++ # if output_scores: +++ # scores += (next_token_scores,) +++ # if output_logits: +++ # raw_logits += (next_token_logits,) +++ # if output_attentions: +++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++ # decoder_attentions += (attn,) if attn is not None else (None,) +++ # if self.config.is_encoder_decoder: +++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++ +++ # if output_hidden_states: +++ # hidden = ( +++ # outputs.decoder_hidden_states +++ # if self.config.is_encoder_decoder +++ # else outputs.hidden_states +++ # ) +++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++ +++ # # token selection +++ # if do_sample: +++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++ # else: +++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++ +++ # # finished sentences should have their next token be a padding token +++ # if has_eos_stopping_criteria: +++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++ +++ # # update generated ids, model inputs, and length for next step +++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++ # if streamer is not None: +++ # streamer.put(next_tokens) +++ +++ # model_kwargs = self._update_model_kwargs_for_generation( +++ # outputs, +++ # model_kwargs, +++ # is_encoder_decoder=self.config.is_encoder_decoder, +++ # ) +++ +++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++ # cur_len += 1 +++ +++ # if _record_time: +++ # import time as time_module +++ # infer_stop = time_module.time() +++ # time_record.append(infer_stop - infer_start) +++ +++ # del outputs +++ +++ # average_infer_time = None +++ # if time_record: +++ # if len(time_record) > 1: +++ # time_record.pop(0) +++ # average_infer_time = sum(time_record) / len(time_record) +++ # print(f'average inference time is: {average_infer_time}') +++ # print(f'inference time record: {time_record}') +++ +++ # if streamer is not None: +++ # streamer.end() +++ +++ # # 简单判断:打印是否使用了JIT路径 +++ # if hasattr(self, '_jit_used') and self._jit_used: +++ # print("[JIT] ✓ JIT optimization was used during generation") +++ # else: +++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++ +++ # if return_dict_in_generate: +++ # if self.config.is_encoder_decoder: +++ # return GenerateEncoderDecoderOutput( +++ # sequences=input_ids, +++ # scores=scores, +++ # logits=raw_logits, +++ # encoder_attentions=encoder_attentions, +++ # encoder_hidden_states=encoder_hidden_states, +++ # decoder_attentions=decoder_attentions, +++ # cross_attentions=cross_attentions, +++ # decoder_hidden_states=decoder_hidden_states, +++ # past_key_values=model_kwargs.get("past_key_values"), +++ # average_infer_time=average_infer_time +++ # ) +++ # else: +++ # return GenerateDecoderOnlyOutput( +++ # sequences=input_ids, +++ # scores=scores, +++ # logits=raw_logits, +++ # attentions=decoder_attentions, +++ # hidden_states=decoder_hidden_states, +++ # past_key_values=model_kwargs.get("past_key_values"), +++ # average_infer_time=average_infer_time +++ # ) +++ # else: +++ # return input_ids +++ +++ # def _prepare_cache_for_generation( +++ # self, +++ # generation_config, +++ # model_kwargs, +++ # assistant_model, +++ # batch_size, +++ # max_cache_length, +++ # ): +++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++ # generation_config.cache_implementation = "static" +++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++ +++ # if generation_config.cache_implementation == "static": +++ # base_required_from_max_length = generation_config.max_length + 1 +++ # base_required = max(max_cache_length, base_required_from_max_length) +++ # min_cache_size = 50 +++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++ # else: +++ # max_cache_length = max(base_required, min_cache_size) +++ +++ # original_max_cache_length = max_cache_length +++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++ # print(f" - input max_cache_length: {original_max_cache_length}") +++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++ # print(f" - final max_cache_length: {max_cache_length}") +++ +++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++ # if max_cache_length > self.config.max_position_embeddings: +++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++ +++ # result = super()._prepare_cache_for_generation( +++ # generation_config=generation_config, +++ # model_kwargs=model_kwargs, +++ # assistant_model=assistant_model, +++ # batch_size=batch_size, +++ # max_cache_length=max_cache_length, +++ # ) +++ +++ # if generation_config.cache_implementation == "static": +++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++ # created_cache = model_kwargs.get(cache_name) +++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++ # if created_cache.max_cache_len < generation_config.max_length: +++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++ +++ # return result +++ +++ +++ ++ ++ ++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++-- ++2.27.0 ++ +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +new file mode 100644 +index 00000000..22b65dd5 +--- /dev/null ++++ b/patches/0002-20251106commit.patch +@@ -0,0 +1,3200 @@ ++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Thu, 6 Nov 2025 09:20:38 +0800 ++Subject: [PATCH 2/3] 20251106commit ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- ++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ ++ 3 files changed, 2689 insertions(+), 305 deletions(-) ++ create mode 100644 patches/0001-20251104commit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index d8303e45..73773c22 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): ++ # y = y + self.shared_experts(identity) ++ # return y ++ +++ # @no_grad() +++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ +++ # expert_cache = ops.zeros_like(x) +++ # for i in range(self.num_experts_per_tok): +++ # expert_id = flat_expert_indices[i].item() +++ # weight = flat_expert_weights[i].item() +++ # expert = self.experts[expert_id] +++ # expert_out = expert(x) +++ # expert_cache += expert_out * weight +++ # return expert_cache +++ ++ @no_grad() ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ # x 的 shape: (1, hidden_size) +++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++ +++ # 1. 收集所有需要的专家层 +++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++ selected_experts = [self.experts[i] for i in flat_expert_indices] +++ +++ # 2. 并行计算所有专家的输出 +++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++ # ops.cat 会将它们堆叠成一个新的 Tensor +++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++ +++ # 3. 使用矩阵乘法进行加权求和 +++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ # 最终结果 final_output 的 shape: (1, hidden_size) +++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++ +++ return final_output ++ ++- expert_cache = ops.zeros_like(x) ++- for i in range(self.num_experts_per_tok): ++- expert_id = flat_expert_indices[i].item() ++- weight = flat_expert_weights[i].item() ++- expert = self.experts[expert_id] ++- expert_out = expert(x) ++- expert_cache += expert_out * weight ++- return expert_cache ++ ++ @no_grad() ++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): ++ key_states = self.k_proj(hidden_states) ++ value_states = self.v_proj(hidden_states) ++ ++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++ # @lwx +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++ ++ kv_seq_len = key_states.shape[-2] ++ if past_key_value is not None: ++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): ++ return attn_output, attn_weights, past_key_value ++ ++ +++# class DeepseekFlashAttention(nn.Module): +++# """ +++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +++ +++# This class is designed as a drop-in replacement for DeepseekAttention. +++# """ +++ +++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++# super().__init__() +++# self.config = config +++# self.layer_idx = layer_idx +++# if layer_idx is None: +++# logger.warning( +++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++# "when creating this class." +++# ) +++ +++# self.attention_dropout = config.attention_dropout +++# self.hidden_size = config.hidden_size +++# self.num_heads = config.num_attention_heads +++# self.head_dim = self.hidden_size // self.num_heads +++# self.num_key_value_heads = config.num_key_value_heads +++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++# self.max_position_embeddings = config.max_position_embeddings +++# self.rope_theta = config.rope_theta +++# self.is_causal = True +++ +++# if (self.head_dim * self.num_heads) != self.hidden_size: +++# raise ValueError( +++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++# f" and `num_heads`: {self.num_heads})." +++# ) +++ +++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++# self._init_rope() +++ +++# def _init_rope(self): +++# if self.config.rope_scaling is None: +++# self.rotary_emb = DeepseekRotaryEmbedding( +++# self.head_dim, +++# max_position_embeddings=self.max_position_embeddings, +++# base=self.rope_theta, +++# ) +++# else: +++# scaling_type = self.config.rope_scaling["type"] +++# scaling_factor = self.config.rope_scaling["factor"] +++# if scaling_type == "linear": +++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++# self.head_dim, +++# max_position_embeddings=self.max_position_embeddings, +++# scaling_factor=scaling_factor, +++# base=self.rope_theta, +++# ) +++# elif scaling_type == "dynamic": +++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++# self.head_dim, +++# max_position_embeddings=self.max_position_embeddings, +++# scaling_factor=scaling_factor, +++# base=self.rope_theta, +++# ) +++# else: +++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++ +++# def forward( +++# self, +++# hidden_states: mindspore.Tensor, +++# attention_mask: Optional[mindspore.Tensor] = None, +++# position_ids: Optional[mindspore.Tensor] = None, +++# past_key_value: Optional[Cache] = None, +++# output_attentions: bool = False, +++# use_cache: bool = False, +++# **kwargs, +++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++# if "padding_mask" in kwargs: +++# warnings.warn( +++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++# ) +++ +++# if output_attentions: +++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +++ +++# bsz, q_len, _ = hidden_states.shape +++ +++# if self.config.pretraining_tp > 1: +++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++ +++# query_states = self.q_proj(hidden_states) +++# key_states = self.k_proj(hidden_states) +++# value_states = self.v_proj(hidden_states) +++ +++# # Reshape for multi-head attention +++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++# kv_seq_len = key_states.shape[-2] +++# if past_key_value is not None: +++# if self.layer_idx is None: +++# raise ValueError( +++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++# "with a layer index." +++# ) +++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++# # Apply Rotary Positional Embedding +++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++# if past_key_value is not None: +++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ +++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++ +++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++ +++# # Convert attention_mask for flash_attention_score +++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +++# if attention_mask is not None: +++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++# raise ValueError( +++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++# ) +++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +++# else: +++# attn_mask_for_fa = None +++ +++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++ +++# # Call the fused flash_attention_score operator +++# attn_output = mindspore.ops.flash_attention_score( +++# query=query_states_for_fa, +++# key=key_states_for_fa, +++# value=value_states_for_fa, +++# head_num=self.num_heads, # This is N1, the number of query heads +++# input_layout='BSH', +++# attn_mask=attn_mask_for_fa, +++# keep_prob=keep_prob, +++# scalar_value=1.0 / math.sqrt(self.head_dim), +++# sparse_mode=0 # Default mask mode +++# ) +++ +++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +++# attn_output = self.o_proj(attn_output) +++ +++# # Flash Attention does not return attention weights +++# attn_weights = None +++ +++# return attn_output, attn_weights, past_key_value +++ +++class DeepseekFlashAttention(nn.Module): +++ """ +++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +++ This implementation is a drop-in replacement for the original DeepseekAttention class, +++ designed for high performance on supported hardware (Ascend). +++ +++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +++ """ +++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++ super().__init__() +++ self.config = config +++ self.layer_idx = layer_idx +++ if layer_idx is None: +++ logger.warning( +++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++ "when creating this class." +++ ) +++ +++ # --- [FIX] Correctly initialize all required attributes --- +++ self.attention_dropout = config.attention_dropout +++ self.hidden_size = config.hidden_size +++ self.num_heads = config.num_attention_heads +++ self.head_dim = self.hidden_size // self.num_heads +++ self.num_key_value_heads = config.num_key_value_heads +++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++ self.max_position_embeddings = config.max_position_embeddings +++ self.rope_theta = config.rope_theta +++ self.is_causal = True +++ +++ if (self.head_dim * self.num_heads) != self.hidden_size: +++ raise ValueError( +++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++ f" and `num_heads`: {self.num_heads})." +++ ) +++ +++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++ +++ # This call will now succeed as all attributes are initialized. +++ self._init_rope() +++ +++ def _init_rope(self): +++ if self.config.rope_scaling is None: +++ self.rotary_emb = DeepseekRotaryEmbedding( +++ self.head_dim, +++ max_position_embeddings=self.max_position_embeddings, +++ base=self.rope_theta, +++ ) +++ else: +++ scaling_type = self.config.rope_scaling["type"] +++ scaling_factor = self.config.rope_scaling["factor"] +++ if scaling_type == "linear": +++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++ self.head_dim, +++ max_position_embeddings=self.max_position_embeddings, +++ scaling_factor=scaling_factor, +++ base=self.rope_theta, +++ ) +++ elif scaling_type == "dynamic": +++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++ self.head_dim, +++ max_position_embeddings=self.max_position_embeddings, +++ scaling_factor=scaling_factor, +++ base=self.rope_theta, +++ ) +++ else: +++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++ +++ def forward( +++ self, +++ hidden_states: mindspore.Tensor, +++ attention_mask: Optional[mindspore.Tensor] = None, +++ position_ids: Optional[mindspore.Tensor] = None, +++ past_key_value: Optional[Cache] = None, +++ output_attentions: bool = False, +++ use_cache: bool = False, +++ **kwargs, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ if "padding_mask" in kwargs: +++ warnings.warn( +++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++ ) +++ if output_attentions: +++ warnings.warn( +++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +++ ) +++ +++ bsz, q_len, _ = hidden_states.shape +++ +++ if self.config.pretraining_tp > 1: +++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++ +++ query_states = self.q_proj(hidden_states) +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++ # Apply Rotary Position Embedding +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos} +++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +++ # So we must explicitly repeat the KV heads. +++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++ +++ # Convert attention mask for flash_attention_score +++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +++ if attention_mask is not None: +++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++ raise ValueError( +++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++ ) +++ attn_mask_for_fa = attention_mask < 0 +++ else: +++ attn_mask_for_fa = None +++ +++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++ +++ # Call the fused operator using the efficient BNSD layout +++ attn_output = mindspore.ops.flash_attention_score( +++ query=query_states, +++ key=key_states, +++ value=value_states, +++ head_num=self.num_heads, +++ input_layout='BNSD', # Specify the correct layout +++ attn_mask=attn_mask_for_fa, +++ keep_prob=keep_prob, +++ scalar_value=1.0 / math.sqrt(self.head_dim) +++ ) +++ +++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ +++ # Apply output projection +++ attn_output = self.o_proj(attn_output) +++ +++ # Flash attention does not return attention weights, so we return None. +++ attn_weights = None +++ +++ return attn_output, attn_weights, past_key_value +++ ++ Deepseek_ATTENTION_CLASSES = { ++ "eager": DeepseekAttention, +++ "flash-attention": DeepseekFlashAttention, ++ } ++ ++ ++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): ++ config=config, layer_idx=layer_idx ++ ) ++ +++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +++ config=config, layer_idx=layer_idx +++ ) +++ ++ self.mlp = ( ++ DeepseekMoE(config) ++ if ( ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index d4c6b651..bced285c 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union ++ ++ import mindspore ++ import mindnlp.core.nn.functional as F ++-from mindnlp.core import nn, ops +++from mindnlp.core import nn, ops, no_grad ++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss ++ ++ from ....common.activations import ACT2FN ++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) ++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++ +++Long_Prompt = False +++PROMPT_LENGTH_THRESHOLD = 128 ++ ++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++ def _prepare_4d_causal_attention_mask_with_cache_position( ++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): ++ return attn_output, attn_weights, past_key_value ++ ++ +++# class Qwen2MoeFlashAttention(nn.Module): +++# """ +++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++ +++# 关键改动: +++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++# 直接传入原始的 key 和 value 张量效率更高。 +++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++# super().__init__() +++# self.config = config +++# self.layer_idx = layer_idx +++# self.hidden_size = config.hidden_size +++# self.num_heads = config.num_attention_heads +++# self.head_dim = self.hidden_size // self.num_heads +++# self.num_key_value_heads = config.num_key_value_heads +++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++# self.max_position_embeddings = config.max_position_embeddings +++# self.rope_theta = config.rope_theta +++# self.attention_dropout = config.attention_dropout +++ +++# if (self.head_dim * self.num_heads) != self.hidden_size: +++# raise ValueError( +++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++# ) +++ +++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++ +++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++# self.head_dim, +++# max_position_embeddings=self.max_position_embeddings, +++# base=self.rope_theta, +++# ) +++ +++# def forward( +++# self, +++# hidden_states: mindspore.Tensor, +++# attention_mask: Optional[mindspore.Tensor] = None, +++# position_ids: Optional[mindspore.Tensor] = None, +++# past_key_value: Optional[Cache] = None, +++# output_attentions: bool = False, +++# use_cache: bool = False, +++# cache_position: Optional[mindspore.Tensor] = None, +++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++# bsz, q_len, _ = hidden_states.shape +++ +++# # 1. 线性投射 Q, K, V +++# query_states = self.q_proj(hidden_states) +++# key_states = self.k_proj(hidden_states) +++# value_states = self.v_proj(hidden_states) +++ +++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++# # query: [B, S, H*D] -> [B, N1, S, D] +++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++# # 3. RoPE 旋转位置编码 +++# kv_seq_len = key_states.shape[-2] +++# if past_key_value is not None: +++# if self.layer_idx is None: +++# raise ValueError( +++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++# "with a layer index." +++# ) +++# # 对于 StaticCache,需要特殊处理 kv_seq_len +++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++# if cache_position.shape[0] == 1: +++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++# kv_seq_len = past_seen_tokens + 1 +++# else: +++# # prefill 阶段:cache_position 是范围,使用其长度 +++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +++# else: +++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++# # 4. KV 缓存更新 +++# if past_key_value is not None: +++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++# key_states, value_states = past_key_value.update( +++# key_states, value_states, self.layer_idx, cache_kwargs +++# ) +++ +++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++# if cache_position.shape[0] == 1: +++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++# kv_seq_len = key_states.shape[-2] +++ +++# # 5. [重要] 准备 Attention Mask +++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++# fa_attention_mask = None +++# if attention_mask is not None: +++# # 截取与当前key长度匹配的部分 +++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++# # 转换为布尔类型: 大负数 -> True, 0 -> False +++# fa_attention_mask = (mask_slice != 0) +++ +++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++# input_dtype = query_states.dtype +++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++# query_states = query_states.to(mindspore.float16) +++# key_states = key_states.to(mindspore.float16) +++# value_states = value_states.to(mindspore.float16) +++ +++# # 6. [核心] 调用 flash_attention_score 算子 +++# # - 无需手动 repeat_kv, 算子原生支持 GQA +++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++# attn_output = mindspore.ops.flash_attention_score( +++# query=query_states, +++# key=key_states, +++# value=value_states, +++# head_num=self.num_heads, # 传入Q的头数(N1) +++# attn_mask=fa_attention_mask, +++# keep_prob=1.0 - self.attention_dropout, +++# scalar_value=1.0 / math.sqrt(self.head_dim), +++# input_layout="BNSD", +++# sparse_mode=0 # 使用 defaultMask 模式 +++# ) +++ +++# # 恢复原始数据类型 +++# attn_output = attn_output.to(input_dtype) +++ +++# # 7. 调整输出形状 +++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++# attn_output = self.o_proj(attn_output) +++ +++# # FlashAttention 算子不直接返回注意力权重矩阵 +++# attn_weights = None +++# if output_attentions: +++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++# return attn_output, attn_weights, past_key_value +++ +++# # def forward( +++# # self, +++# # hidden_states: mindspore.Tensor, +++# # attention_mask: Optional[mindspore.Tensor] = None, +++# # position_ids: Optional[mindspore.Tensor] = None, +++# # past_key_value: Optional[Cache] = None, +++# # output_attentions: bool = False, +++# # use_cache: bool = False, +++# # cache_position: Optional[mindspore.Tensor] = None, +++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++# # bsz, q_len, _ = hidden_states.shape +++ +++# # # 1. 线性投射 Q, K, V +++# # query_states = self.q_proj(hidden_states) +++# # key_states = self.k_proj(hidden_states) +++# # value_states = self.v_proj(hidden_states) +++ +++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ +++# # # 3. RoPE 旋转位置编码 +++# # kv_seq_len = key_states.shape[-2] +++# # if past_key_value is not None: +++# # if self.layer_idx is None: +++# # raise ValueError( +++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++# # "with a layer index." +++# # ) +++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++# # # 4. KV 缓存更新 +++# # if past_key_value is not None: +++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++# # key_states, value_states = past_key_value.update( +++# # key_states, value_states, self.layer_idx, cache_kwargs +++# # ) +++ +++# # # 5. 准备 Attention Mask +++# # fa_attention_mask = None +++# # if attention_mask is not None: +++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++# # fa_attention_mask = (mask_slice != 0) +++ +++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++# # input_dtype = query_states.dtype +++ +++# # # 6. [核心] 调用 flash_attention_score 算子 +++# # attn_output = mindspore.ops.flash_attention_score( +++# # query=query_states, +++# # key=key_states, +++# # value=value_states, +++# # head_num=self.num_heads, +++# # attn_mask=fa_attention_mask, +++# # keep_prob=1.0 - self.attention_dropout, +++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++# # input_layout="BNSD", +++# # sparse_mode=0, +++# # # <--- 修改点 2: 启用内部高精度计算 --- +++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++# # inner_precise=1 +++# # ) +++ +++# # # 恢复原始数据类型 +++# # attn_output = attn_output.to(input_dtype) +++ +++# # # 7. 调整输出形状 +++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++# # attn_output = self.o_proj(attn_output) +++ +++# # attn_weights = None +++# # if output_attentions: +++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ +++# # return attn_output, attn_weights, past_key_value +++ +++ ++ class Qwen2MoeFlashAttention(nn.Module): ++ """ ++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++- ++- 关键改动: ++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++- 直接传入原始的 key 和 value 张量效率更高。 ++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +++ +++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +++ 完全使用模型的低精度数据类型(如 float16)进行计算, +++ 以达到理论上的最高执行速度。 ++ """ ++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++ super().__init__() ++ self.config = config ++ self.layer_idx = layer_idx +++ if layer_idx is None: +++ logger.warning_once( +++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +++ ) +++ ++ self.hidden_size = config.hidden_size ++ self.num_heads = config.num_attention_heads ++ self.head_dim = self.hidden_size // self.num_heads ++ self.num_key_value_heads = config.num_key_value_heads ++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++ self.max_position_embeddings = config.max_position_embeddings ++ self.rope_theta = config.rope_theta ++ self.attention_dropout = config.attention_dropout ++ ++- if (self.head_dim * self.num_heads) != self.hidden_size: ++- raise ValueError( ++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++- ) ++- ++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): ++ key_states = self.k_proj(hidden_states) ++ value_states = self.v_proj(hidden_states) ++ ++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++- # query: [B, S, H*D] -> [B, N1, S, D] ++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +++ # 2. 调整形状以匹配 BNSD 布局 ++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- ++- # 3. RoPE 旋转位置编码 +++ +++ # 3. RoPE 和 KV 缓存 ++ kv_seq_len = key_states.shape[-2] ++ if past_key_value is not None: ++- if self.layer_idx is None: ++- raise ValueError( ++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++- "with a layer index." ++- ) ++- # 对于 StaticCache,需要特殊处理 kv_seq_len ++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++- # 使用 cache_position 的长度来确定实际的 kv_seq_len ++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++- # 临时解决方案:使用 cache_position 的最大值(如果可能) ++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++- if cache_position.shape[0] == 1: ++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++- kv_seq_len = past_seen_tokens + 1 ++- else: ++- # prefill 阶段:cache_position 是范围,使用其长度 ++- kv_seq_len = cache_position.shape[0] + past_seen_tokens ++- else: ++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++- +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++- # 4. KV 缓存更新 ++ if past_key_value is not None: ++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++- key_states, value_states = past_key_value.update( ++- key_states, value_states, self.layer_idx, cache_kwargs ++- ) ++- ++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++- if cache_position.shape[0] == 1: ++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++- kv_seq_len = key_states.shape[-2] ++- ++- # 5. [重要] 准备 Attention Mask ++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++ # 4. 准备 Attention Mask ++ fa_attention_mask = None ++ if attention_mask is not None: ++- # 截取与当前key长度匹配的部分 ++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++- # 转换为布尔类型: 大负数 -> True, 0 -> False ++ fa_attention_mask = (mask_slice != 0) ++ ++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++- input_dtype = query_states.dtype ++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++- query_states = query_states.to(mindspore.float16) ++- key_states = key_states.to(mindspore.float16) ++- value_states = value_states.to(mindspore.float16) ++- ++- # 6. [核心] 调用 flash_attention_score 算子 ++- # - 无需手动 repeat_kv, 算子原生支持 GQA ++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 ++ attn_output = mindspore.ops.flash_attention_score( ++ query=query_states, ++ key=key_states, ++ value=value_states, ++- head_num=self.num_heads, # 传入Q的头数(N1) +++ head_num=self.num_heads, ++ attn_mask=fa_attention_mask, ++- keep_prob=1.0 - self.attention_dropout, +++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout ++ scalar_value=1.0 / math.sqrt(self.head_dim), ++ input_layout="BNSD", ++- sparse_mode=0 # 使用 defaultMask 模式 +++ sparse_mode=0, +++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 ++ ) ++ ++- # 恢复原始数据类型 ++- attn_output = attn_output.to(input_dtype) ++- ++- # 7. 调整输出形状 ++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++ # 6. 调整输出形状 ++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++ attn_output = self.o_proj(attn_output) ++ ++- # FlashAttention 算子不直接返回注意力权重矩阵 +++ # 7. 返回结果 ++ attn_weights = None ++ if output_attentions: ++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++ ++ return attn_output, attn_weights, past_key_value ++ ++- # def forward( ++- # self, ++- # hidden_states: mindspore.Tensor, ++- # attention_mask: Optional[mindspore.Tensor] = None, ++- # position_ids: Optional[mindspore.Tensor] = None, ++- # past_key_value: Optional[Cache] = None, ++- # output_attentions: bool = False, ++- # use_cache: bool = False, ++- # cache_position: Optional[mindspore.Tensor] = None, ++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++- ++- # bsz, q_len, _ = hidden_states.shape ++- ++- # # 1. 线性投射 Q, K, V ++- # query_states = self.q_proj(hidden_states) ++- # key_states = self.k_proj(hidden_states) ++- # value_states = self.v_proj(hidden_states) ++- ++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- ++- # # 3. RoPE 旋转位置编码 ++- # kv_seq_len = key_states.shape[-2] ++- # if past_key_value is not None: ++- # if self.layer_idx is None: ++- # raise ValueError( ++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++- # "with a layer index." ++- # ) ++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++ ++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++- ++- # # 4. KV 缓存更新 ++- # if past_key_value is not None: ++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++- # key_states, value_states = past_key_value.update( ++- # key_states, value_states, self.layer_idx, cache_kwargs ++- # ) ++- ++- # # 5. 准备 Attention Mask ++- # fa_attention_mask = None ++- # if attention_mask is not None: ++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++- # fa_attention_mask = (mask_slice != 0) ++- ++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++- # input_dtype = query_states.dtype ++- ++- # # 6. [核心] 调用 flash_attention_score 算子 ++- # attn_output = mindspore.ops.flash_attention_score( ++- # query=query_states, ++- # key=key_states, ++- # value=value_states, ++- # head_num=self.num_heads, ++- # attn_mask=fa_attention_mask, ++- # keep_prob=1.0 - self.attention_dropout, ++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++- # input_layout="BNSD", ++- # sparse_mode=0, ++- # # <--- 修改点 2: 启用内部高精度计算 --- ++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++- # inner_precise=1 ++- # ) ++- ++- # # 恢复原始数据类型 ++- # attn_output = attn_output.to(input_dtype) +++QWEN2MOE_ATTENTION_CLASSES = { +++ "eager": Qwen2MoeAttention, +++ "flash-attention": Qwen2MoeFlashAttention, +++} ++ ++- # # 7. 调整输出形状 ++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++- # attn_output = self.o_proj(attn_output) ++ ++- # attn_weights = None ++- # if output_attentions: ++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# def __init__(self, config): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# # gating +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++ +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# #@dwj +++# # 只遍历激活的专家,而非全部专家 +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# num_tokens = hidden_states_reshaped.shape[0] +++ +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++# routing_weights = routing_weights.to(hidden_states.dtype) +++ +++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++# flat_selected_experts = selected_experts.flatten() +++ +++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++# token_indices = broadcasted_token_indices.flatten() +++ +++# active_experts = ops.unique(flat_selected_experts) +++ +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++ +++# mask = (flat_selected_experts == expert_idx_tensor) +++# selected_token_indices = token_indices[mask] +++# selected_routing_weights = routing_weights.flatten()[mask] +++ +++# current_states = hidden_states_reshaped[selected_token_indices] +++ +++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++ +++# final_hidden_states = final_hidden_states.index_add( +++# dim=0, +++# index=selected_token_indices, +++# source=expert_output.to(hidden_states.dtype) +++# ) +++ +++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++ ++- # return attn_output, attn_weights, past_key_value +++# final_hidden_states = final_hidden_states + shared_expert_output +++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ +++ +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# """ +++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++# `_moe_infer_prefill` (用于长序列处理) 方法。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# # 门控网络 +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# # 专家列表 +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++# # 共享专家 +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# @no_grad() +++# def _moe_infer_decode( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# """ +++# 【解码路径】针对 sequence_length=1 的极致优化。 +++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++# """ +++# batch_size, hidden_dim = hidden_states.shape +++ +++# expert_outputs_list = [ +++# ops.cat([ +++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++# ], dim=0) +++# for i in range(batch_size) +++# ] +++ +++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++# # shape: (batch_size, top_k, hidden_dim) +++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++ +++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++ +++# return moe_output.squeeze(1) +++ +++# @no_grad() +++# def _moe_infer_prefill( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# """ +++# 【预填充路径】针对 sequence_length > 1 的优化。 +++# 按专家对 Token 进行分组,并进行批处理。 +++# """ +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens = hidden_states.shape[0] +++# flat_selected_experts = selected_experts.flatten() +++ +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++ +++# active_experts = ops.unique(flat_selected_experts) +++ +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++ +++# mask = (flat_selected_experts == expert_idx_tensor) +++# selected_token_indices = token_indices[mask] +++# selected_routing_weights = routing_weights.flatten()[mask] +++ +++# current_states = hidden_states[selected_token_indices] +++ +++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++ +++# moe_output = moe_output.index_add( +++# dim=0, +++# index=selected_token_indices, +++# source=expert_output.to(hidden_states.dtype) +++# ) +++# return moe_output +++ +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# """ +++# 顶层 forward 方法,作为智能分发器。 +++# """ +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++ +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++- # def forward( ++- # self, ++- # hidden_states: mindspore.Tensor, ++- # attention_mask: Optional[mindspore.Tensor] = None, ++- # position_ids: Optional[mindspore.Tensor] = None, ++- # past_key_value: Optional[Cache] = None, ++- # output_attentions: bool = False, ++- # use_cache: bool = False, ++- # cache_position: Optional[mindspore.Tensor] = None, ++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++- ++- # bsz, q_len, _ = hidden_states.shape ++- ++- # query_states = self.q_proj(hidden_states) ++- # key_states = self.k_proj(hidden_states) ++- # value_states = self.v_proj(hidden_states) ++- ++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- ++- # kv_seq_len = key_states.shape[-2] ++- # if past_key_value is not None: ++- # if self.layer_idx is None: ++- # raise ValueError("`layer_idx` must be specified for caching") ++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++- ++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++- ++- # if past_key_value is not None: ++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++- # key_states, value_states = past_key_value.update( ++- # key_states, value_states, self.layer_idx, cache_kwargs ++- # ) +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++# routing_weights = routing_weights.to(hidden_states.dtype) +++ +++# moe_output = None +++# # 在推理时,根据序列长度选择最优路径 +++# if not self.training: +++# if sequence_length == 1: +++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++# else: +++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++# else: +++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++# raise NotImplementedError("Training path is not implemented.") +++ +++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++ +++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++ +++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ +++ +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# """ +++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# # 门控网络 +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# # 专家列表 +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++# # 共享专家 +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# @no_grad() +++# def _moe_infer_decode( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# batch_size, _ = hidden_states.shape +++# expert_outputs_list = [ +++# ops.cat([ +++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++# ], dim=0) +++# for i in range(batch_size) +++# ] +++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++# return moe_output.squeeze(1) +++ +++# @no_grad() +++# def _moe_infer_prefill( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens = hidden_states.shape[0] +++# flat_selected_experts = selected_experts.flatten() +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++# active_experts = ops.unique(flat_selected_experts) +++ +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++# mask = (flat_selected_experts == expert_idx_tensor) +++# selected_token_indices = token_indices[mask] +++# selected_routing_weights = routing_weights.flatten()[mask] +++# current_states = hidden_states[selected_token_indices] +++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++# moe_output = moe_output.index_add( +++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++# ) +++# return moe_output +++ +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# """ +++# 顶层 forward 方法,作为智能分发器。 +++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++# """ +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++ +++# # 1. 门控计算 (通用逻辑) +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++# routing_weights = routing_weights.to(hidden_states.dtype) +++ +++# # 2. 智能分发到最优 MoE 路径 +++# moe_output = None +++# if not self.training: +++# if sequence_length == 1: +++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++# else: +++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++# else: +++# raise NotImplementedError("Training path is not implemented.") +++ +++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++ +++# # 4. 合并 MoE 输出和共享专家输出 +++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++ +++# # 5. 恢复原始形状并返回 +++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ +++# prefill fastest +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# """ +++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# # 门控网络 +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# # 专家列表 +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++# # 共享专家 +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# @no_grad() +++# def _moe_infer_dispatch( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# """ +++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++# """ +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens, _ = hidden_states.shape +++ +++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++# flat_selected_experts = selected_experts.flatten() +++# flat_routing_weights = routing_weights.flatten() ++ ++- # key_states = repeat_kv(key_states, self.num_key_value_groups) ++- # value_states = repeat_kv(value_states, self.num_key_value_groups) ++- ++- # # <--- 核心修改点: 手动进行高精度缩放 --- ++- # # 在调用算子前,手动将 query_states 除以缩放因子。 ++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++- # query_states = query_states / math.sqrt(self.head_dim) ++- # # <--- 修改结束 --- ++- ++- # fa_attention_mask = None ++- # if attention_mask is not None: ++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++- # fa_attention_mask = (mask_slice != 0) ++- ++- # input_dtype = query_states.dtype ++- ++- # attn_output = mindspore.ops.flash_attention_score( ++- # query=query_states, # 传入已经预先缩放过的 query ++- # key=key_states, ++- # value=value_states, ++- # head_num=self.num_heads, ++- # attn_mask=fa_attention_mask, ++- # keep_prob=1.0 - self.attention_dropout, ++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++- # input_layout="BNSD", ++- # sparse_mode=0, ++- # inner_precise=1 # 仍然保持内部高精度计算 ++- # ) +++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++ ++- # attn_output = attn_output.to(input_dtype) ++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++- # attn_output = self.o_proj(attn_output) +++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++# active_experts = ops.unique(flat_selected_experts) +++ +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++ +++# # 找到所有分配给该专家的 token +++# mask = (flat_selected_experts == expert_idx_tensor) +++ +++# # 使用 mask 选取对应的 token 和权重 +++# current_token_indices = token_indices[mask] +++# current_routing_weights = flat_routing_weights[mask] +++# current_hidden_states = hidden_states[current_token_indices] +++ +++# # 对这些 token 进行批处理 +++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++ +++# # 使用 index_add 将结果精确地加回到对应位置 +++# moe_output = moe_output.index_add( +++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++# ) +++# return moe_output +++ +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# """ +++# 顶层 forward 方法,作为智能分发器。 +++# """ +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++ +++# # 1. 门控计算 +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++# routing_weights = routing_weights.to(hidden_states.dtype) +++ +++# # 2. 调用统一的 MoE 计算内核 +++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++ ++- # attn_weights = None ++- # if output_attentions: ++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++# # 3. 统一处理共享专家 +++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++ +++# # 4. 合并输出 +++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++ +++# # 5. 恢复原始形状并返回 +++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ +++ +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# """ +++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++# 【最终高性能与高精度版】: +++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++# 3. 这样实现了速度和准确性的两全其美。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# @no_grad() +++# def _moe_infer_decode( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# """ +++# 【解码路径】极致优化版:bmm + 高精度累加。 +++# """ +++# original_dtype = hidden_states.dtype +++# batch_size, _ = hidden_states.shape +++ +++# expert_outputs_list = [ +++# ops.cat([ +++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++# ], dim=0) +++# for i in range(batch_size) +++# ] +++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++ +++# # 在 float32 下执行 bmm,得到高精度结果 +++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++ +++# # 将高精度结果转换回原始数据类型 +++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++ +++# return moe_output +++ +++# @no_grad() +++# def _moe_infer_prefill( +++# self, +++# hidden_states: mindspore.Tensor, +++# selected_experts: mindspore.Tensor, +++# routing_weights: mindspore.Tensor +++# ) -> mindspore.Tensor: +++# """ +++# 【预填充路径】与原始实现一致,结果精确。 +++# """ +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens, _ = hidden_states.shape +++# flat_selected_experts = selected_experts.flatten() +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++# active_experts = ops.unique(flat_selected_experts) +++ +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++# mask = (flat_selected_experts == expert_idx_tensor) +++# selected_token_indices = token_indices[mask] +++# selected_routing_weights = routing_weights.flatten()[mask] +++# current_states = hidden_states[selected_token_indices] +++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++# moe_output = moe_output.index_add( +++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++# ) +++# return moe_output +++ +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++ +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++ ++- # return attn_output, attn_weights, past_key_value +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++# # 如果模型主体是 float16,后续再转换 +++ +++# moe_output = None +++# if not self.training: +++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++# # _moe_infer_decode 内部会处理好类型转换 +++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++# if sequence_length == 1: +++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++# else: +++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++# else: +++# raise NotImplementedError("Training path is not implemented.") +++ +++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++ +++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ ++ ++-QWEN2MOE_ATTENTION_CLASSES = { ++- "eager": Qwen2MoeAttention, ++- "flash-attention": Qwen2MoeFlashAttention, ++-} +++# class Qwen2MoeSparseMoeBlock(nn.Module): +++# """ +++# 【融合版】一个混合专家模块,内置两种推理策略, +++# 由外部全局变量 `Long_Prompt` 控制: +++ +++# - if Long_Prompt is True: 【精度优先模式】 +++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++# 适用于处理长序列,避免误差累积。 +++ +++# - if Long_Prompt is False: 【速度优先模式】 +++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++# 在解码阶段获得极致速度,同时保证结果高度准确。 +++# """ +++# def __init__(self, config: Qwen2MoeConfig): +++# super().__init__() +++# self.num_experts = config.num_experts +++# self.top_k = config.num_experts_per_tok +++# self.norm_topk_prob = config.norm_topk_prob +++ +++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++# self.experts = nn.ModuleList( +++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++# ) +++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++# # --- 速度优先模式的辅助函数 --- +++# @no_grad() +++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++# original_dtype = hidden_states.dtype +++# batch_size, _ = hidden_states.shape +++# expert_outputs_list = [ +++# ops.cat([ +++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++# ], dim=0) +++# for i in range(batch_size) +++# ] +++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++# weights_fp32 = routing_weights.to(mindspore.float32) +++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++# return moe_output_fp32.squeeze(1).to(original_dtype) +++ +++# @no_grad() +++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens, _ = hidden_states.shape +++# flat_selected_experts = selected_experts.flatten() +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++# active_experts = ops.unique(flat_selected_experts) +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++# mask = (flat_selected_experts == expert_idx_tensor) +++# selected_token_indices = token_indices[mask] +++# selected_routing_weights = routing_weights.flatten()[mask] +++# current_states = hidden_states[selected_token_indices] +++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++# return moe_output +++ +++# # --- 精度优先模式的辅助函数 --- +++# @no_grad() +++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++# moe_output = ops.zeros_like(hidden_states) +++# num_tokens, _ = hidden_states.shape +++# flat_selected_experts = selected_experts.flatten() +++# flat_routing_weights = routing_weights.flatten() +++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++# active_experts = ops.unique(flat_selected_experts) +++# for expert_idx_tensor in active_experts: +++# expert_idx = expert_idx_tensor.item() +++# expert_layer = self.experts[expert_idx] +++# mask = (flat_selected_experts == expert_idx_tensor) +++# current_token_indices = token_indices[mask] +++# current_routing_weights = flat_routing_weights[mask] +++# current_hidden_states = hidden_states[current_token_indices] +++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++# return moe_output +++ +++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++# # 声明我们将要使用一个在模块外部定义的全局变量 +++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++# global Long_Prompt +++ +++# # 1. 门控计算 (所有模式通用) +++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++# router_logits = self.gate(hidden_states_reshaped) +++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++# if self.norm_topk_prob: +++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++# moe_output = None +++# if not self.training: +++# # 根据 Long_Prompt 标志选择模式 +++# if Long_Prompt: +++# # --- 精度优先模式 --- +++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++# else: +++# # --- 速度优先模式 --- +++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++# if sequence_length == 1: +++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++# else: +++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++# else: +++# raise NotImplementedError("Training path is not implemented.") +++ +++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++ +++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++ +++# return final_hidden_states, router_logits +++ +++class Qwen2MoeSparseMoeBlock(nn.Module): +++ """ +++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++ 控制的顶级推理策略: ++ +++ - if Long_Prompt is True: 【精度优先模式】 +++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +++ 适用于需要严格可复现性的长序列任务。 ++ ++-class Qwen2MoeSparseMoeBlock(nn.Module): ++- def __init__(self, config): +++ - if Long_Prompt is False: 【速度优先模式】 +++ 采用业界最强的性能组合: +++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +++ """ +++ def __init__(self, config: Qwen2MoeConfig): ++ super().__init__() ++ self.num_experts = config.num_experts ++ self.top_k = config.num_experts_per_tok ++ self.norm_topk_prob = config.norm_topk_prob ++ ++- # gating ++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++ self.experts = nn.ModuleList( ++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++ ) ++- ++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++ ++- #@dwj ++- # 只遍历激活的专家,而非全部专家 ++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++- num_tokens = hidden_states_reshaped.shape[0] ++- ++- router_logits = self.gate(hidden_states_reshaped) ++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- ++- if self.norm_topk_prob: ++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- routing_weights = routing_weights.to(hidden_states.dtype) ++- ++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++- flat_selected_experts = selected_experts.flatten() ++- ++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++- token_indices = broadcasted_token_indices.flatten() ++- ++- active_experts = ops.unique(flat_selected_experts) ++- ++- for expert_idx_tensor in active_experts: ++- expert_idx = expert_idx_tensor.item() ++- expert_layer = self.experts[expert_idx] ++- ++- mask = (flat_selected_experts == expert_idx_tensor) ++- selected_token_indices = token_indices[mask] ++- selected_routing_weights = routing_weights.flatten()[mask] ++- ++- current_states = hidden_states_reshaped[selected_token_indices] ++- ++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++- ++- final_hidden_states = final_hidden_states.index_add( ++- dim=0, ++- index=selected_token_indices, ++- source=expert_output.to(hidden_states.dtype) ++- ) ++- ++- shared_expert_output = self.shared_expert(hidden_states_reshaped) ++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +++ @no_grad() +++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ original_dtype = hidden_states.dtype +++ batch_size, _ = hidden_states.shape +++ expert_outputs_list = [ +++ ops.cat([ +++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++ ], dim=0) +++ for i in range(batch_size) +++ ] +++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++ weights_fp32 = routing_weights.to(mindspore.float32) +++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++ return moe_output_fp32.squeeze(1).to(original_dtype) +++ +++ @no_grad() +++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ num_tokens, _ = hidden_states.shape +++ flat_selected_experts = selected_experts.flatten() +++ sorted_expert_indices = flat_selected_experts.argsort() +++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++ original_token_indices = sorted_expert_indices // self.top_k +++ moe_output = ops.zeros_like(hidden_states) +++ current_token_offset = 0 +++ for i in range(self.num_experts): +++ expert_token_count = tokens_per_expert[i] - current_token_offset +++ if expert_token_count == 0: +++ continue +++ end_offset = current_token_offset + expert_token_count +++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++ expert_hidden_states = hidden_states[expert_original_token_indices] +++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++ current_token_offset += expert_token_count +++ return moe_output +++ +++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++ @no_grad() +++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ moe_output = ops.zeros_like(hidden_states) +++ num_tokens, _ = hidden_states.shape +++ flat_selected_experts = selected_experts.flatten() +++ flat_routing_weights = routing_weights.flatten() +++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++ active_experts = ops.unique(flat_selected_experts) +++ for expert_idx_tensor in active_experts: +++ expert_idx = expert_idx_tensor.item() +++ expert_layer = self.experts[expert_idx] +++ mask = (flat_selected_experts == expert_idx_tensor) +++ current_token_indices = token_indices[mask] +++ current_routing_weights = flat_routing_weights[mask] +++ current_hidden_states = hidden_states[current_token_indices] +++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++ return moe_output ++ ++- final_hidden_states = final_hidden_states + shared_expert_output ++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++- ++- return final_hidden_states, router_logits +++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++ global Long_Prompt +++ +++ # 1. 门控计算 (所有模式通用) +++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++ router_logits = self.gate(hidden_states_reshaped) +++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++ if self.norm_topk_prob: +++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++ moe_output = None +++ if Long_Prompt: +++ # --- 精度优先模式 (ACCURACY MODE) --- +++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ else: +++ # --- 速度优先模式 (SPEED MODE) --- +++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++ if sequence_length == 1: +++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ else: +++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ ++ +++ # 3. 共享专家计算与合并 (所有模式通用) +++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++ +++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++ +++ return final_hidden_states, router_logits ++ ++ class Qwen2MoeDecoderLayer(nn.Module): ++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++ super().__init__() ++ self.hidden_size = config.hidden_size +++ +++ # if Long_Prompt: +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ # else: +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++ ++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ ++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++- ++ if (layer_idx not in config.mlp_only_layers) and ( ++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++ ): ++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ self._warmed_up = True ++ self.warmup_moe_model() ++ +++ +++ ++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++ output_router_logits = ( ++ output_router_logits if output_router_logits is not None else self.config.output_router_logits ++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ router_logits=outputs.router_logits, ++ ) ++ +++ def generate(self, *args, **kwargs): +++ """ +++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++ """ +++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++ +++ input_ids = kwargs.get("input_ids") +++ if input_ids is None and args: +++ input_ids = args[0] +++ +++ if input_ids is not None: +++ prompt_length = input_ids.shape[1] +++ +++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +++ Long_Prompt = True +++ else: +++ Long_Prompt = False +++ +++ return super().generate(*args, **kwargs) +++ ++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation ++ def prepare_inputs_for_generation( ++ self, ++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens ++ # Exception 1: when passing input_embeds, input_ids may be missing entries ++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +++ ++ if past_key_values is not None: ++ if inputs_embeds is not None: # Exception 1 ++ if 0 not in input_ids.shape: ++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ } ++ ) ++ return model_inputs +++ ++ # @lwx ++ # def _decode_one_tokens_logits( ++ # self, ++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): ++ attentions=outputs.attentions, ++ ) ++ +++ ++ __all__ = [ ++ "Qwen2MoeForCausalLM", ++ "Qwen2MoeModel", ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++new file mode 100644 ++index 00000000..6dfb5b93 ++--- /dev/null +++++ b/patches/0001-20251104commit.patch ++@@ -0,0 +1,1272 @@ +++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++Subject: [PATCH] 20251104commit +++ +++--- +++ mindnlp/transformers/cache_utils.py | 28 +- +++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++ 3 files changed, 976 insertions(+), 87 deletions(-) +++ +++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++index cadd2e04..02f8d4be 100644 +++--- a/mindnlp/transformers/cache_utils.py ++++++ b/mindnlp/transformers/cache_utils.py +++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++ # k_out[:, :, cache_position] = key_states +++ # v_out[:, :, cache_position] = value_states +++- if ON_ORANGE_PI: +++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++- else: +++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++- ++++ # if ON_ORANGE_PI: ++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++ # else: ++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++ if cache_position.ndim > 1: ++++ cache_position = cache_position.flatten() ++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++ cache_position = cache_position.int() ++++ ++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++ k_out[:, :, cache_position] = key_states ++++ v_out[:, :, cache_position] = value_states ++++ +++ return k_out, v_out +++ +++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index c695b944..d8303e45 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++- x1 = x[..., : x.shape[-1] // 2] +++- x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++ # x1 = x[..., : x.shape[-1] // 2] ++++ # x2 = x[..., x.shape[-1] // 2 :] ++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++ if self.training: +++ raise NotImplementedError("Training is not supported yet.") +++ else: +++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++- if self.config.n_shared_experts is not None: +++- y = y + self.shared_experts(identity) +++- return y ++++ # @lwx ++++ if orig_shape[1] == 1: ++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++ y=y.view(*orig_shape) ++++ if self.config.n_shared_experts is not None: ++++ y = y + self.shared_experts(identity) ++++ return y ++++ else: ++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++ if self.config.n_shared_experts is not None: ++++ y = y + self.shared_experts(identity) ++++ return y ++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++ # if self.config.n_shared_experts is not None: ++++ # y = y + self.shared_experts(identity) ++++ # return y ++++ ++++ @no_grad() ++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ ++++ expert_cache = ops.zeros_like(x) ++++ for i in range(self.num_experts_per_tok): ++++ expert_id = flat_expert_indices[i].item() ++++ weight = flat_expert_weights[i].item() ++++ expert = self.experts[expert_id] ++++ expert_out = expert(x) ++++ expert_cache += expert_out * weight ++++ return expert_cache +++ +++ @no_grad() +++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++- # expert_cache = torch.zeros_like(x) +++- # idxs = flat_expert_indices.argsort() +++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++- # token_idxs = idxs // self.num_experts_per_tok +++- # for i, end_idx in enumerate(tokens_per_expert): +++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++- # if start_idx == end_idx: +++- # continue +++- # expert = self.experts[i] +++- # exp_token_idx = token_idxs[start_idx:end_idx] +++- # expert_tokens = x[exp_token_idx] +++- # expert_out = expert(expert_tokens) +++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++- # return expert_cache ++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ expert_cache = ops.zeros_like(x) +++ idxs = flat_expert_indices.argsort() +++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ token_idxs = idxs // self.num_experts_per_tok ++++ +++ for i, end_idx in enumerate(tokens_per_expert): +++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ if start_idx == end_idx: +++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++ expert_out = expert(expert_tokens) +++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ +++ return expert_cache ++++ ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # # expert_cache = torch.zeros_like(x) ++++ # # idxs = flat_expert_indices.argsort() ++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++ # # token_idxs = idxs // self.num_experts_per_tok ++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++ # # if start_idx == end_idx: ++++ # # continue ++++ # # expert = self.experts[i] ++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # # expert_tokens = x[exp_token_idx] ++++ # # expert_out = expert(expert_tokens) ++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++ # # return expert_cache ++++ # expert_cache = ops.zeros_like(x) ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # for i, end_idx in enumerate(tokens_per_expert): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # if start_idx == end_idx: ++++ # continue ++++ # expert = self.experts[i] ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = expert(expert_tokens) ++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ ++++ # return expert_cache ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # expert_cache = ops.zeros_like(x) ++++ ++++ # # 排序保证顺序一致 ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # # 找出有 token 的专家 ++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++ ++++ # for i in active_experts.tolist(): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # end_idx = tokens_per_expert[i] ++++ # if start_idx == end_idx: # 没有 token ++++ # continue ++++ ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = self.experts[i](expert_tokens) ++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++ ++++ # expert_cache = mindspore.mint.scatter_add( ++++ # expert_cache, ++++ # 0, ++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++ # expert_out ++++ # ) ++++ ++++ # return expert_cache ++++ ++++ +++ +++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++ # """ +++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ +++ # Initialize weights and apply final processing +++ self.post_init() ++++ self.warm_up = False ++++ ++++ def warmup_moe_model_deep(self): ++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++ test_texts = [ ++++ "warmup short", ++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++ ] ++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++ if tokenizer is None: ++++ from mindnlp.transformers import AutoTokenizer ++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++ self._warmup_tokenizer = tokenizer ++++ ++++ for text in test_texts: ++++ inputs = tokenizer(text, return_tensors="ms") ++++ with mindspore._no_grad(): ++++ _ = self(**inputs, use_cache=False) ++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++ +++ def get_input_embeddings(self): +++ return self.model.embed_tokens +++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++ ```""" ++++ if not self.warm_up: ++++ self.warm_up = True ++++ self.warmup_moe_model_deep() ++++ +++ output_attentions = ( +++ output_attentions +++ if output_attentions is not None +++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++index 3cbf820e..d4c6b651 100644 +++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++@@ -18,7 +18,6 @@ +++ # See the License for the specific language governing permissions and +++ # limitations under the License. +++ """MindSpore Qwen2MoE model.""" +++- +++ import math +++ from typing import List, Optional, Tuple, Union +++ +++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++ TokenClassifierOutput, +++ ) +++ from ...modeling_utils import PreTrainedModel ++++from ...generation import GenerationMixin +++ from ....utils import logging +++ from .configuration_qwen2_moe import Qwen2MoeConfig +++ +++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++ self.variance_epsilon = eps +++ +++ def forward(self, hidden_states): ++++ # @dwj ++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++ # @lwx ++++ # if not self.training : ++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++ input_dtype = hidden_states.dtype +++ hidden_states = hidden_states.to(mindspore.float32) +++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++@@ -234,6 +239,8 @@ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++ x1 = x[..., : x.shape[-1] // 2] +++ x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++ self.config = config +++ self.hidden_size = config.hidden_size +++ self.intermediate_size = intermediate_size ++++ +++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++ self.act_fn = ACT2FN[config.hidden_act] +++ +++ def forward(self, x): +++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++- +++ ++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++ # @lwx ++++ # gate_up_output = self.gate_up_proj(x) ++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++ # return self.down_proj(swiglu_output) ++++ ++++ # def forward(self, x): ++++ # gate_proj_out = self.gate_proj(x) ++++ # up_proj_out = self.up_proj(x) ++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++ # return self.down_proj(swiglu_out) ++++ +++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++ """ +++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++ use_cache: bool = False, +++ cache_position: Optional[mindspore.Tensor] = None, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ ++++ +++ bsz, q_len, _ = hidden_states.shape +++ +++ query_states = self.q_proj(hidden_states) +++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ "with a layer index." +++ ) +++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ if isinstance(past_key_value, StaticCache): ++++ kv_seq_len = key_states.shape[-2] ++++ else: ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++ if isinstance(past_key_value, StaticCache): ++++ kv_seq_len = key_states.shape[-2] +++ +++ # repeat k/v heads if n_kv_heads < n_heads +++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++- ++++ +++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++ +++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++- raise ValueError( +++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++- f" {attn_weights.shape}" +++- ) +++- +++- if attention_mask is not None: # no matter the length, we just slice it +++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++ if attention_mask is not None: ++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++ attn_weights = attn_weights + causal_mask +++ +++ # upcast attention to fp32 +++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++ +++ attn_output = self.o_proj(attn_output) +++- ++++ # @lwx ++++ ++++ # max_seq_len = self.max_position_embeddings # 2048 ++++ ++++ # if attention_mask is not None: ++++ # # attention_mask: [B, 1, Sq, Sk] ++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++ ++++ # # pad 到 [max_seq_len, max_seq_len] ++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++ # global_attention_mask = padded_mask ++++ # else: ++++ # global_attention_mask = None ++++ ++++ ++++ # sparse_mode=3 ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, ++++ # key=key_states, ++++ # value=value_states, ++++ # real_shift=None, ++++ # padding_mask=None, ++++ ++++ # head_num=self.num_heads, ++++ # attn_mask=global_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++ # input_layout="BNSD", ++++ # pre_tokens=2147483647, ++++ # next_tokens=2147483647, ++++ # inner_precise=0, ++++ # drop_mask=None, ++++ # prefix=None, ++++ # actual_seq_qlen=None, ++++ # actual_seq_kvlen=None, ++++ # sparse_mode=sparse_mode, ++++ # ) +++ if not output_attentions: +++ attn_weights = None +++ +++ return attn_output, attn_weights, past_key_value +++ +++ ++++class Qwen2MoeFlashAttention(nn.Module): ++++ """ ++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++ ++++ 关键改动: ++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++ 直接传入原始的 key 和 value 张量效率更高。 ++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++ """ ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++ super().__init__() ++++ self.config = config ++++ self.layer_idx = layer_idx ++++ self.hidden_size = config.hidden_size ++++ self.num_heads = config.num_attention_heads ++++ self.head_dim = self.hidden_size // self.num_heads ++++ self.num_key_value_heads = config.num_key_value_heads ++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++ self.max_position_embeddings = config.max_position_embeddings ++++ self.rope_theta = config.rope_theta ++++ self.attention_dropout = config.attention_dropout ++++ ++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++ raise ValueError( ++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++ ) ++++ ++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++ ++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++ self.head_dim, ++++ max_position_embeddings=self.max_position_embeddings, ++++ base=self.rope_theta, ++++ ) ++++ ++++ def forward( ++++ self, ++++ hidden_states: mindspore.Tensor, ++++ attention_mask: Optional[mindspore.Tensor] = None, ++++ position_ids: Optional[mindspore.Tensor] = None, ++++ past_key_value: Optional[Cache] = None, ++++ output_attentions: bool = False, ++++ use_cache: bool = False, ++++ cache_position: Optional[mindspore.Tensor] = None, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ # 1. 线性投射 Q, K, V ++++ query_states = self.q_proj(hidden_states) ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # 3. RoPE 旋转位置编码 ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++ if self.layer_idx is None: ++++ raise ValueError( ++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ "with a layer index." ++++ ) ++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++ if cache_position.shape[0] == 1: ++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++ kv_seq_len = past_seen_tokens + 1 ++++ else: ++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++ else: ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # 4. KV 缓存更新 ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ key_states, value_states = past_key_value.update( ++++ key_states, value_states, self.layer_idx, cache_kwargs ++++ ) ++++ ++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++ if cache_position.shape[0] == 1: ++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++ kv_seq_len = key_states.shape[-2] ++++ ++++ # 5. [重要] 准备 Attention Mask ++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++ fa_attention_mask = None ++++ if attention_mask is not None: ++++ # 截取与当前key长度匹配的部分 ++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++ fa_attention_mask = (mask_slice != 0) ++++ ++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++ input_dtype = query_states.dtype ++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++ query_states = query_states.to(mindspore.float16) ++++ key_states = key_states.to(mindspore.float16) ++++ value_states = value_states.to(mindspore.float16) ++++ ++++ # 6. [核心] 调用 flash_attention_score 算子 ++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++ attn_output = mindspore.ops.flash_attention_score( ++++ query=query_states, ++++ key=key_states, ++++ value=value_states, ++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++ attn_mask=fa_attention_mask, ++++ keep_prob=1.0 - self.attention_dropout, ++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++ input_layout="BNSD", ++++ sparse_mode=0 # 使用 defaultMask 模式 ++++ ) ++++ ++++ # 恢复原始数据类型 ++++ attn_output = attn_output.to(input_dtype) ++++ ++++ # 7. 调整输出形状 ++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ attn_output = self.o_proj(attn_output) ++++ ++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++ attn_weights = None ++++ if output_attentions: ++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++ # def forward( ++++ # self, ++++ # hidden_states: mindspore.Tensor, ++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++ # position_ids: Optional[mindspore.Tensor] = None, ++++ # past_key_value: Optional[Cache] = None, ++++ # output_attentions: bool = False, ++++ # use_cache: bool = False, ++++ # cache_position: Optional[mindspore.Tensor] = None, ++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ # bsz, q_len, _ = hidden_states.shape ++++ ++++ # # 1. 线性投射 Q, K, V ++++ # query_states = self.q_proj(hidden_states) ++++ # key_states = self.k_proj(hidden_states) ++++ # value_states = self.v_proj(hidden_states) ++++ ++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # # 3. RoPE 旋转位置编码 ++++ # kv_seq_len = key_states.shape[-2] ++++ # if past_key_value is not None: ++++ # if self.layer_idx is None: ++++ # raise ValueError( ++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ # "with a layer index." ++++ # ) ++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # # 4. KV 缓存更新 ++++ # if past_key_value is not None: ++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ # key_states, value_states = past_key_value.update( ++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++ # ) ++++ ++++ # # 5. 准备 Attention Mask ++++ # fa_attention_mask = None ++++ # if attention_mask is not None: ++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # fa_attention_mask = (mask_slice != 0) ++++ ++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++ # input_dtype = query_states.dtype ++++ ++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, ++++ # key=key_states, ++++ # value=value_states, ++++ # head_num=self.num_heads, ++++ # attn_mask=fa_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++ # input_layout="BNSD", ++++ # sparse_mode=0, ++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++ # inner_precise=1 ++++ # ) ++++ ++++ # # 恢复原始数据类型 ++++ # attn_output = attn_output.to(input_dtype) ++++ ++++ # # 7. 调整输出形状 ++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ # attn_output = self.o_proj(attn_output) ++++ ++++ # attn_weights = None ++++ # if output_attentions: ++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++ # return attn_output, attn_weights, past_key_value ++++ ++++ # def forward( ++++ # self, ++++ # hidden_states: mindspore.Tensor, ++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++ # position_ids: Optional[mindspore.Tensor] = None, ++++ # past_key_value: Optional[Cache] = None, ++++ # output_attentions: bool = False, ++++ # use_cache: bool = False, ++++ # cache_position: Optional[mindspore.Tensor] = None, ++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ # bsz, q_len, _ = hidden_states.shape ++++ ++++ # query_states = self.q_proj(hidden_states) ++++ # key_states = self.k_proj(hidden_states) ++++ # value_states = self.v_proj(hidden_states) ++++ ++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # kv_seq_len = key_states.shape[-2] ++++ # if past_key_value is not None: ++++ # if self.layer_idx is None: ++++ # raise ValueError("`layer_idx` must be specified for caching") ++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # if past_key_value is not None: ++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ # key_states, value_states = past_key_value.update( ++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++ # ) ++++ ++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++ ++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++ # query_states = query_states / math.sqrt(self.head_dim) ++++ # # <--- 修改结束 --- ++++ ++++ # fa_attention_mask = None ++++ # if attention_mask is not None: ++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # fa_attention_mask = (mask_slice != 0) ++++ ++++ # input_dtype = query_states.dtype ++++ ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, # 传入已经预先缩放过的 query ++++ # key=key_states, ++++ # value=value_states, ++++ # head_num=self.num_heads, ++++ # attn_mask=fa_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++ # input_layout="BNSD", ++++ # sparse_mode=0, ++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++ # ) ++++ ++++ # attn_output = attn_output.to(input_dtype) ++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ # attn_output = self.o_proj(attn_output) ++++ ++++ # attn_weights = None ++++ # if output_attentions: ++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++ ++++ # return attn_output, attn_weights, past_key_value ++++ +++ QWEN2MOE_ATTENTION_CLASSES = { +++ "eager": Qwen2MoeAttention, ++++ "flash-attention": Qwen2MoeFlashAttention, +++ } +++ +++ +++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ ++++ #@dwj ++++ # 只遍历激活的专家,而非全部专家 +++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++- hidden_states = hidden_states.view(-1, hidden_dim) +++- # router_logits: (batch * sequence_length, n_experts) +++- router_logits = self.gate(hidden_states) +++- +++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- if self.norm_topk_prob: +++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- # we cast back to the input dtype +++- routing_weights = routing_weights.to(hidden_states.dtype) +++- +++- final_hidden_states = ops.zeros( +++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++- ) +++- +++- # One hot encode the selected experts to create an expert mask +++- # this will be used to easily index which expert is going to be sollicitated +++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++- +++- # Loop over all available experts in the model and perform the computation on each expert +++- for expert_idx in range(self.num_experts): +++- expert_layer = self.experts[expert_idx] +++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++- +++- # Index the correct hidden states and compute the expert hidden state for +++- # the current expert. We need to make sure to multiply the output hidden +++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++- if 0 not in idx.shape: +++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++- +++- # However `index_add_` only support torch tensors for indexing so we'll use +++- # the `top_x` tensor here. +++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++- +++- shared_expert_output = self.shared_expert(hidden_states) +++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++- +++- final_hidden_states = final_hidden_states + shared_expert_output ++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++ num_tokens = hidden_states_reshaped.shape[0] ++++ ++++ router_logits = self.gate(hidden_states_reshaped) ++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++ if self.norm_topk_prob: ++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++ flat_selected_experts = selected_experts.flatten() ++++ ++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++ token_indices = broadcasted_token_indices.flatten() ++++ ++++ active_experts = ops.unique(flat_selected_experts) ++++ ++++ for expert_idx_tensor in active_experts: ++++ expert_idx = expert_idx_tensor.item() ++++ expert_layer = self.experts[expert_idx] ++++ ++++ mask = (flat_selected_experts == expert_idx_tensor) ++++ selected_token_indices = token_indices[mask] ++++ selected_routing_weights = routing_weights.flatten()[mask] ++++ ++++ current_states = hidden_states_reshaped[selected_token_indices] ++++ ++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++ ++++ final_hidden_states = final_hidden_states.index_add( ++++ dim=0, ++++ index=selected_token_indices, ++++ source=expert_output.to(hidden_states.dtype) ++++ ) ++++ ++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++ +++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++- return final_hidden_states, router_logits ++++ final_hidden_states = final_hidden_states + shared_expert_output ++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++ ++++ return final_hidden_states, router_logits +++ +++ +++ class Qwen2MoeDecoderLayer(nn.Module): +++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++ +++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++ +++ if (layer_idx not in config.mlp_only_layers) and ( +++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++ ): +++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++ _skip_keys_device_placement = "past_key_values" +++ _supports_cache_class = True ++++#lwx ++++ # _supports_static_cache = True +++ +++ def _init_weights(self, module): +++ std = self.config.initializer_range +++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++ return causal_mask +++ +++ +++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ _tied_weights_keys = ["lm_head.weight"] +++ +++ def __init__(self, config): +++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ self.num_experts_per_tok = config.num_experts_per_tok +++ # Initialize weights and apply final processing +++ self.post_init() ++++ # @lwx ++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++ # self.generation_config.cache_implementation = "static" ++++ self._warmed_up = False ++++ ++++ def warmup_moe_model(self): ++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++ test_texts = [ ++++ "warmup short", ++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++ ] ++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++ if tokenizer is None: ++++ from mindnlp.transformers import AutoTokenizer ++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++ self._warmup_tokenizer = tokenizer ++++ ++++ for text in test_texts: ++++ inputs = tokenizer(text, return_tensors="ms") ++++ with mindspore._no_grad(): ++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++ +++ def get_input_embeddings(self): +++ return self.model.embed_tokens +++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++ ```""" ++++ if not self._warmed_up: ++++ self._warmed_up = True ++++ self.warmup_moe_model() +++ +++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++ output_router_logits = ( +++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ } +++ ) +++ return model_inputs ++++# @lwx ++++ # def _decode_one_tokens_logits( ++++ # self, ++++ # cur_token: mindspore.Tensor, ++++ # input_pos: Optional[mindspore.Tensor], ++++ # cache_position: mindspore.Tensor, ++++ # past_key_values: StaticCache, ++++ # ) -> mindspore.Tensor: ++++ # """ ++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++ ++++ # Args: ++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++ # input_pos: 输入位置信息,可选 ++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++ ++++ # Returns: ++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++ # """ ++++ # # 调用JIT编译的版本 ++++ # return self.get_decode_one_tokens_logits( ++++ # cur_token=cur_token, ++++ # input_pos=input_pos, ++++ # cache_position=cache_position, ++++ # past_key_values=past_key_values, ++++ # ) ++++ ++++ # @mindspore.jit(jit_level='O1') ++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++ # """ ++++ # JIT编译的函数,用于高效的单token解码 ++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++ ++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++ # """ ++++ # outputs = self.model.forward( ++++ # input_ids=cur_token, ++++ # position_ids=input_pos, ++++ # cache_position=cache_position, ++++ # past_key_values=past_key_values, ++++ # use_cache=True, ++++ # return_dict=False, ++++ # ) ++++ ++++ # hidden_states = outputs[0] ++++ # logits = self.lm_head.forward(hidden_states) ++++ # logits = logits.float() ++++ ++++ # return logits[:, -1, :] ++++ ++++ # def _sample( ++++ # self, ++++ # input_ids: mindspore.Tensor, ++++ # logits_processor, ++++ # stopping_criteria, ++++ # generation_config, ++++ # synced_devices: bool, ++++ # streamer=None, ++++ # logits_warper=None, ++++ # **model_kwargs, ++++ # ): ++++ # """ ++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++ # """ ++++ # from ...generation.logits_process import LogitsProcessorList ++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++ # from mindnlp.core import nn, ops, no_grad ++++ # import numpy as np ++++ ++++ # # 检查是否使用 StaticCache ++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++ # # 否则,直接调用父类方法 ++++ # past_key_values = model_kwargs.get("past_key_values") ++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++ ++++ # if not isinstance(past_key_values, StaticCache): ++++ # # 不使用 StaticCache,直接调用父类方法 ++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++ # return super()._sample( ++++ # input_ids=input_ids, ++++ # logits_processor=logits_processor, ++++ # stopping_criteria=stopping_criteria, ++++ # generation_config=generation_config, ++++ # synced_devices=synced_devices, ++++ # streamer=streamer, ++++ # logits_warper=logits_warper, ++++ # **model_kwargs, ++++ # ) ++++ ++++ # # 使用 StaticCache,进入自定义循环 ++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++ # pad_token_id = generation_config._pad_token_tensor ++++ # output_attentions = generation_config.output_attentions ++++ # output_hidden_states = generation_config.output_hidden_states ++++ # output_scores = generation_config.output_scores ++++ # output_logits = generation_config.output_logits ++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++ # max_length = generation_config.max_length ++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++ # do_sample = generation_config.do_sample ++++ ++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++ # raise ValueError( ++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++ # f"{logits_warper})." ++++ # ) ++++ ++++ # # init attention / hidden states / scores tuples ++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++ ++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++ # encoder_hidden_states = ( ++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++ # ) ++++ ++++ # # keep track of which sequences are already finished ++++ # batch_size, cur_len = input_ids.shape ++++ # this_peer_finished = False ++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++ ++++ # time_record = [] ++++ # from ....utils.testing_utils import parse_flag_from_env ++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++ ++++ # while self._has_unfinished_sequences( ++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++ # ): ++++ # if _record_time: ++++ # import time as time_module ++++ # infer_start = time_module.time() ++++ ++++ # # prepare model inputs ++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++ ++++ # # prepare variable output controls ++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++ ++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++ # cur_cache_position = model_inputs.get("cache_position") ++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++ # cur_input_ids = model_inputs.get("input_ids") ++++ ++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++ # cur_cache_position is not None and ++++ # len(cur_cache_position.shape) > 0 and ++++ # cur_cache_position.shape[0] == 1 and ++++ # cur_input_ids is not None and ++++ # cur_input_ids.shape[1] == 1): ++++ # # 使用 JIT 优化的单 token 解码 ++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++ # if not hasattr(self, '_jit_used'): ++++ # self._jit_used = False ++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++ ++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++ # cur_token=cur_input_ids, ++++ # input_pos=model_inputs.get("position_ids"), ++++ # cache_position=cur_cache_position, ++++ # past_key_values=cur_past_key_values, ++++ # ) ++++ ++++ # # 标记已使用JIT(用于后续判断) ++++ # if not self._jit_used: ++++ # self._jit_used = True ++++ ++++ # # 构造兼容的输出对象 ++++ # class JitOptimizedOutput: ++++ # def __init__(self, logits, config): ++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++ # self.config = config ++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++ # self.attentions = None if not config.is_encoder_decoder else None ++++ # self.cross_attentions = None ++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++ ++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++ # else: ++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++ # outputs = self(**model_inputs, return_dict=True) ++++ ++++ # if synced_devices and this_peer_finished: ++++ # continue ++++ ++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++ # next_token_logits = outputs.logits[:, -1, :] ++++ ++++ # # pre-process distribution ++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++ # if do_sample: ++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++ ++++ # # Store scores, attentions and hidden_states when required ++++ # if return_dict_in_generate: ++++ # if output_scores: ++++ # scores += (next_token_scores,) ++++ # if output_logits: ++++ # raw_logits += (next_token_logits,) ++++ # if output_attentions: ++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++ # if self.config.is_encoder_decoder: ++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++ ++++ # if output_hidden_states: ++++ # hidden = ( ++++ # outputs.decoder_hidden_states ++++ # if self.config.is_encoder_decoder ++++ # else outputs.hidden_states ++++ # ) ++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++ ++++ # # token selection ++++ # if do_sample: ++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++ # else: ++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++ ++++ # # finished sentences should have their next token be a padding token ++++ # if has_eos_stopping_criteria: ++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++ ++++ # # update generated ids, model inputs, and length for next step ++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++ # if streamer is not None: ++++ # streamer.put(next_tokens) ++++ ++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++ # outputs, ++++ # model_kwargs, ++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++ # ) ++++ ++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++ # cur_len += 1 ++++ ++++ # if _record_time: ++++ # import time as time_module ++++ # infer_stop = time_module.time() ++++ # time_record.append(infer_stop - infer_start) ++++ ++++ # del outputs ++++ ++++ # average_infer_time = None ++++ # if time_record: ++++ # if len(time_record) > 1: ++++ # time_record.pop(0) ++++ # average_infer_time = sum(time_record) / len(time_record) ++++ # print(f'average inference time is: {average_infer_time}') ++++ # print(f'inference time record: {time_record}') ++++ ++++ # if streamer is not None: ++++ # streamer.end() ++++ ++++ # # 简单判断:打印是否使用了JIT路径 ++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++ # else: ++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++ ++++ # if return_dict_in_generate: ++++ # if self.config.is_encoder_decoder: ++++ # return GenerateEncoderDecoderOutput( ++++ # sequences=input_ids, ++++ # scores=scores, ++++ # logits=raw_logits, ++++ # encoder_attentions=encoder_attentions, ++++ # encoder_hidden_states=encoder_hidden_states, ++++ # decoder_attentions=decoder_attentions, ++++ # cross_attentions=cross_attentions, ++++ # decoder_hidden_states=decoder_hidden_states, ++++ # past_key_values=model_kwargs.get("past_key_values"), ++++ # average_infer_time=average_infer_time ++++ # ) ++++ # else: ++++ # return GenerateDecoderOnlyOutput( ++++ # sequences=input_ids, ++++ # scores=scores, ++++ # logits=raw_logits, ++++ # attentions=decoder_attentions, ++++ # hidden_states=decoder_hidden_states, ++++ # past_key_values=model_kwargs.get("past_key_values"), ++++ # average_infer_time=average_infer_time ++++ # ) ++++ # else: ++++ # return input_ids ++++ ++++ # def _prepare_cache_for_generation( ++++ # self, ++++ # generation_config, ++++ # model_kwargs, ++++ # assistant_model, ++++ # batch_size, ++++ # max_cache_length, ++++ # ): ++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++ # generation_config.cache_implementation = "static" ++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++ ++++ # if generation_config.cache_implementation == "static": ++++ # base_required_from_max_length = generation_config.max_length + 1 ++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++ # min_cache_size = 50 ++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++ # else: ++++ # max_cache_length = max(base_required, min_cache_size) ++++ ++++ # original_max_cache_length = max_cache_length ++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++ # print(f" - final max_cache_length: {max_cache_length}") ++++ ++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++ # if max_cache_length > self.config.max_position_embeddings: ++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++ ++++ # result = super()._prepare_cache_for_generation( ++++ # generation_config=generation_config, ++++ # model_kwargs=model_kwargs, ++++ # assistant_model=assistant_model, ++++ # batch_size=batch_size, ++++ # max_cache_length=max_cache_length, ++++ # ) ++++ ++++ # if generation_config.cache_implementation == "static": ++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++ # created_cache = model_kwargs.get(cache_name) ++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++ # if created_cache.max_cache_len < generation_config.max_length: ++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++ ++++ # return result ++++ ++++ ++++ +++ +++ +++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +new file mode 100644 +index 00000000..966529e4 +--- /dev/null ++++ b/patches/0003-20261106secondcommit.patch +@@ -0,0 +1,2769 @@ ++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Thu, 6 Nov 2025 14:54:37 +0800 ++Subject: [PATCH 3/3] 20261106secondcommit ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- ++ patches/0001-20251104commit.patch | 1272 ----------------- ++ 3 files changed, 528 insertions(+), 2032 deletions(-) ++ delete mode 100644 patches/0001-20251104commit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index 73773c22..2f9192bf 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) ++ ++ _CONFIG_FOR_DOC = "DeepseekConfig" ++ +++_attn_mask_cache = {} +++ +++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +++ q_len = batch_and_seq[1] +++ kv_len = batch_and_seq[1] + past_key_values_length +++ key = (batch_and_seq[0], q_len, kv_len) +++ +++ if key in _attn_mask_cache: +++ return _attn_mask_cache[key] +++ +++ mask = _prepare_4d_causal_attention_mask( +++ attention_mask, +++ batch_and_seq, +++ inputs_embeds, +++ past_key_values_length, +++ ) +++ _attn_mask_cache[key] = mask +++ return mask ++ ++ def _get_unpad_data(attention_mask): ++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) ++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): ++ return final_output ++ ++ ++- @no_grad() ++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++- expert_cache = ops.zeros_like(x) ++- idxs = flat_expert_indices.argsort() ++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++- token_idxs = idxs // self.num_experts_per_tok ++- ++- for i, end_idx in enumerate(tokens_per_expert): ++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++- if start_idx == end_idx: ++- continue ++- expert = self.experts[i] ++- exp_token_idx = token_idxs[start_idx:end_idx] ++- expert_tokens = x[exp_token_idx] ++- expert_out = expert(expert_tokens) ++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++- ++- return expert_cache ++- ++ # @no_grad() ++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++- # # expert_cache = torch.zeros_like(x) ++- # # idxs = flat_expert_indices.argsort() ++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++- # # token_idxs = idxs // self.num_experts_per_tok ++- # # for i, end_idx in enumerate(tokens_per_expert): ++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++- # # if start_idx == end_idx: ++- # # continue ++- # # expert = self.experts[i] ++- # # exp_token_idx = token_idxs[start_idx:end_idx] ++- # # expert_tokens = x[exp_token_idx] ++- # # expert_out = expert(expert_tokens) ++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++- # # return expert_cache +++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ # expert_cache = ops.zeros_like(x) ++ # idxs = flat_expert_indices.argsort() ++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): ++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++ ++ # return expert_cache ++- # @no_grad() ++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++- # expert_cache = ops.zeros_like(x) +++ +++ @no_grad() +++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ """ +++ 优化版 MoE prefill: +++ - 批量张量化处理同一个 expert 的所有 token +++ - 跳过无 token 的专家 +++ - 保持结果完全一致 +++ """ +++ # 初始化输出缓存 +++ expert_cache = ops.zeros_like(x) ++ ++- # # 排序保证顺序一致 ++- # idxs = flat_expert_indices.argsort() ++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++- # token_idxs = idxs // self.num_experts_per_tok +++ # 排序(确保 scatter_add 位置对应原逻辑) +++ idxs = flat_expert_indices.argsort() +++ sorted_expert_indices = flat_expert_indices[idxs] +++ sorted_token_indices = idxs // self.num_experts_per_tok ++ ++- # # 找出有 token 的专家 ++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++ # 每个 expert 的 token 数 +++ tokens_per_expert = sorted_expert_indices.bincount() ++ ++- # for i in active_experts.tolist(): ++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++- # end_idx = tokens_per_expert[i] ++- # if start_idx == end_idx: # 没有 token ++- # continue +++ # 找出有 token 的专家 +++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++ ++- # exp_token_idx = token_idxs[start_idx:end_idx] ++- # expert_tokens = x[exp_token_idx] ++- # expert_out = self.experts[i](expert_tokens) ++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++ for expert_id in active_experts.tolist(): +++ # 取该 expert 对应的排序后 token 区间 +++ start = (tokens_per_expert[:expert_id]).sum().item() +++ end = start + tokens_per_expert[expert_id].item() ++ ++- # expert_cache = mindspore.mint.scatter_add( ++- # expert_cache, ++- # 0, ++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++- # expert_out ++- # ) +++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +++ expert_tokens = x[token_idx] # 取输入向量 ++ ++- # return expert_cache +++ # 执行专家 MLP +++ expert_out = self.experts[expert_id](expert_tokens) +++ +++ # 按权重缩放 +++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +++ +++ # 回写到缓存(等价 scatter_add) +++ expert_cache = mindspore.mint.scatter_add( +++ expert_cache, +++ 0, +++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +++ scaled_out +++ ) +++ +++ return expert_cache +++ +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # # expert_cache = torch.zeros_like(x) +++ # # idxs = flat_expert_indices.argsort() +++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++ # # token_idxs = idxs // self.num_experts_per_tok +++ # # for i, end_idx in enumerate(tokens_per_expert): +++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++ # # if start_idx == end_idx: +++ # # continue +++ # # expert = self.experts[i] +++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++ # # expert_tokens = x[exp_token_idx] +++ # # expert_out = expert(expert_tokens) +++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++ # # return expert_cache +++ # expert_cache = ops.zeros_like(x) +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # for i, end_idx in enumerate(tokens_per_expert): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # if start_idx == end_idx: +++ # continue +++ # expert = self.experts[i] +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = expert(expert_tokens) +++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ +++ # return expert_cache +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++ # expert_cache = ops.zeros_like(x) +++ +++ # # 排序保证顺序一致 +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ # token_idxs = idxs // self.num_experts_per_tok +++ +++ # # 找出有 token 的专家 +++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++ +++ # for i in active_experts.tolist(): +++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ # end_idx = tokens_per_expert[i] +++ # if start_idx == end_idx: # 没有 token +++ # continue +++ +++ # exp_token_idx = token_idxs[start_idx:end_idx] +++ # expert_tokens = x[exp_token_idx] +++ # expert_out = self.experts[i](expert_tokens) +++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++ +++ # expert_cache = mindspore.mint.scatter_add( +++ # expert_cache, +++ # 0, +++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++ # expert_out +++ # ) +++ +++ # return expert_cache ++ ++ ++ ++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): ++ ++ return attn_output, attn_weights, past_key_value ++ ++- ++ # class DeepseekFlashAttention(nn.Module): ++ # """ ++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): ++ ++ return attn_output, attn_weights, past_key_value ++ +++ ++ Deepseek_ATTENTION_CLASSES = { ++ "eager": DeepseekAttention, ++ "flash-attention": DeepseekFlashAttention, ++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): ++ ) ++ else: ++ # 4d mask is passed through the layers ++- attention_mask = _prepare_4d_causal_attention_mask( +++ # attention_mask = _prepare_4d_causal_attention_mask( +++ # attention_mask, +++ # (batch_size, seq_length), +++ # inputs_embeds, +++ # past_key_values_length, +++ # ) +++ #@dwj +++ attention_mask = get_cached_causal_mask( ++ attention_mask, ++ (batch_size, seq_length), ++ inputs_embeds, ++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ # Initialize weights and apply final processing ++ self.post_init() ++ self.warm_up = False +++ #@dwj +++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +++ self.num_layers, +++ self.num_attention_heads, +++ self.head_dim, +++ batch_size=1, +++ max_length=self.max_length, +++ dtype=mindspore.float16 +++ ) +++ +++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +++ key_cache = [] +++ value_cache = [] +++ for _ in range(num_layers): +++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++ key_cache.append(k) +++ value_cache.append(v) +++ return key_cache, value_cache +++ ++ ++ def warmup_moe_model_deep(self): ++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index bced285c..ebd7782e 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) ++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++ ++-Long_Prompt = False ++-PROMPT_LENGTH_THRESHOLD = 128 +++Long_Prompt = 1 +++LONG_PROMPT_LENGTH_THRESHOLD = 128 +++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +++ +++_causal_mask_cache = {} +++ +++def get_cached_causal_mask_with_cache_position( +++ attention_mask: mindspore.Tensor, +++ sequence_length: int, +++ target_length: int, +++ dtype: mindspore.dtype, +++ min_dtype: float, +++ cache_position: mindspore.Tensor, +++ batch_size: int, +++): +++ """ +++ 带缓存的 causal mask 构造函数 +++ """ +++ # q_len 是当前 query 长度 +++ q_len = sequence_length +++ # kv_len 是 target_length +++ kv_len = target_length +++ +++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +++ +++ if key in _causal_mask_cache: +++ return _causal_mask_cache[key] +++ +++ # 调用原来的 mask 构造逻辑 +++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++ attention_mask, +++ sequence_length=sequence_length, +++ target_length=target_length, +++ dtype=dtype, +++ min_dtype=min_dtype, +++ cache_position=cache_position, +++ batch_size=batch_size, +++ ) +++ # 缓存结果 +++ _causal_mask_cache[key] = causal_mask +++ return causal_mask ++ ++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++ def _prepare_4d_causal_attention_mask_with_cache_position( ++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++ ++ ++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +++# class Qwen2MoeAttention(nn.Module): +++# """ +++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++# and "Generating Long Sequences with Sparse Transformers". +++# """ +++ +++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++# super().__init__() +++# self.config = config +++# self.layer_idx = layer_idx +++# if layer_idx is None: +++# logger.warning_once( +++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++# "when creating this class." +++# ) +++ +++# self.hidden_size = config.hidden_size +++# self.num_heads = config.num_attention_heads +++# self.head_dim = self.hidden_size // self.num_heads +++# self.num_key_value_heads = config.num_key_value_heads +++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++# self.max_position_embeddings = config.max_position_embeddings +++# self.rope_theta = config.rope_theta +++# self.is_causal = True +++# self.attention_dropout = config.attention_dropout +++ +++# if (self.head_dim * self.num_heads) != self.hidden_size: +++# raise ValueError( +++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++# f" and `num_heads`: {self.num_heads})." +++# ) +++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++ +++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++# self.head_dim, +++# max_position_embeddings=self.max_position_embeddings, +++# base=self.rope_theta, +++# ) +++ +++# def forward( +++# self, +++# hidden_states: mindspore.Tensor, +++# attention_mask: Optional[mindspore.Tensor] = None, +++# position_ids: Optional[mindspore.Tensor] = None, +++# past_key_value: Optional[Cache] = None, +++# output_attentions: bool = False, +++# use_cache: bool = False, +++# cache_position: Optional[mindspore.Tensor] = None, +++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++ +++ +++ +++# bsz, q_len, _ = hidden_states.shape +++ +++# query_states = self.q_proj(hidden_states) +++# key_states = self.k_proj(hidden_states) +++# value_states = self.v_proj(hidden_states) +++ +++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++ +++# kv_seq_len = key_states.shape[-2] +++# if past_key_value is not None: +++# if self.layer_idx is None: +++# raise ValueError( +++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++# "with a layer index." +++# ) +++# if isinstance(past_key_value, StaticCache): +++# kv_seq_len = key_states.shape[-2] +++# else: +++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++# if past_key_value is not None: +++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++# if isinstance(past_key_value, StaticCache): +++# kv_seq_len = key_states.shape[-2] +++ +++# # repeat k/v heads if n_kv_heads < n_heads +++# key_states = repeat_kv(key_states, self.num_key_value_groups) +++# value_states = repeat_kv(value_states, self.num_key_value_groups) +++ +++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++ +++# if attention_mask is not None: +++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++# attn_weights = attn_weights + causal_mask +++ +++# # upcast attention to fp32 +++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++# attn_output = ops.matmul(attn_weights, value_states) +++ +++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++# raise ValueError( +++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++# f" {attn_output.shape}" +++# ) +++ +++# attn_output = ops.transpose(attn_output, 1, 2) +++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++ +++# attn_output = self.o_proj(attn_output) +++# # @lwx +++ +++# # max_seq_len = self.max_position_embeddings # 2048 +++ +++# # if attention_mask is not None: +++# # # attention_mask: [B, 1, Sq, Sk] +++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++ +++# # # pad 到 [max_seq_len, max_seq_len] +++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++# # global_attention_mask = padded_mask +++# # else: +++# # global_attention_mask = None +++ +++ +++# # sparse_mode=3 +++# # attn_output = mindspore.ops.flash_attention_score( +++# # query=query_states, +++# # key=key_states, +++# # value=value_states, +++# # real_shift=None, +++# # padding_mask=None, +++ +++# # head_num=self.num_heads, +++# # attn_mask=global_attention_mask, +++# # keep_prob=1.0 - self.attention_dropout, +++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++# # input_layout="BNSD", +++# # pre_tokens=2147483647, +++# # next_tokens=2147483647, +++# # inner_precise=0, +++# # drop_mask=None, +++# # prefix=None, +++# # actual_seq_qlen=None, +++# # actual_seq_kvlen=None, +++# # sparse_mode=sparse_mode, +++# # ) +++# if not output_attentions: +++# attn_weights = None +++ +++# return attn_output, attn_weights, past_key_value +++ ++ class Qwen2MoeAttention(nn.Module): ++ """ ++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++- and "Generating Long Sequences with Sparse Transformers". ++- """ +++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 ++ +++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +++ +++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +++ """ ++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++ super().__init__() ++ self.config = config ++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): ++ if layer_idx is None: ++ logger.warning_once( ++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++ "when creating this class." ++ ) ++ ++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): ++ use_cache: bool = False, ++ cache_position: Optional[mindspore.Tensor] = None, ++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++- ++ ++- +++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- ++ bsz, q_len, _ = hidden_states.shape ++ ++ query_states = self.q_proj(hidden_states) ++ key_states = self.k_proj(hidden_states) ++ value_states = self.v_proj(hidden_states) ++ ++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++- +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ ++ kv_seq_len = key_states.shape[-2] ++ if past_key_value is not None: ++- if self.layer_idx is None: ++- raise ValueError( ++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++- "with a layer index." ++- ) ++- if isinstance(past_key_value, StaticCache): ++- kv_seq_len = key_states.shape[-2] ++- else: ++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ ++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++ ++ if past_key_value is not None: ++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++ +++ # --- 2. 动态调度核心注意力计算 --- +++ global Long_Prompt +++ if Long_Prompt >= 1: +++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +++ fa_attention_mask = None +++ if attention_mask is not None: +++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++ fa_attention_mask = (mask_slice != 0) +++ +++ attn_output = mindspore.ops.flash_attention_score( +++ query=query_states, +++ key=key_states, +++ value=value_states, +++ head_num=self.num_heads, +++ attn_mask=fa_attention_mask, +++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +++ scalar_value=1.0 / math.sqrt(self.head_dim), +++ input_layout="BNSD", +++ sparse_mode=0, +++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +++ ) ++ ++- if isinstance(past_key_value, StaticCache): ++- kv_seq_len = key_states.shape[-2] +++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ attn_output = self.o_proj(attn_output) +++ attn_weights = None +++ if output_attentions: +++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++ ++- # repeat k/v heads if n_kv_heads < n_heads ++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++- ++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++ else: +++ # --- Eager Attention 路径 (用于短序列和解码) --- +++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++ +++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++ ++- if attention_mask is not None: ++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++- attn_weights = attn_weights + causal_mask +++ if attention_mask is not None: +++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++ attn_weights = attn_weights + causal_mask ++ ++- # upcast attention to fp32 ++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++- attn_output = ops.matmul(attn_weights, value_states) +++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++ attn_output = ops.matmul(attn_weights, value_states) ++ ++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++- raise ValueError( ++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++- f" {attn_output.shape}" ++- ) +++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++ raise ValueError( +++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +++ ) ++ ++- attn_output = ops.transpose(attn_output, 1, 2) ++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++ attn_output = ops.transpose(attn_output, 1, 2) +++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++ attn_output = self.o_proj(attn_output) ++ ++- attn_output = self.o_proj(attn_output) ++- # @lwx +++ if not output_attentions: +++ attn_weights = None ++ ++- # max_seq_len = self.max_position_embeddings # 2048 ++- ++- # if attention_mask is not None: ++- # # attention_mask: [B, 1, Sq, Sk] ++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++- ++- # # pad 到 [max_seq_len, max_seq_len] ++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++- # global_attention_mask = padded_mask ++- # else: ++- # global_attention_mask = None ++- ++- ++- # sparse_mode=3 ++- # attn_output = mindspore.ops.flash_attention_score( ++- # query=query_states, ++- # key=key_states, ++- # value=value_states, ++- # real_shift=None, ++- # padding_mask=None, ++- ++- # head_num=self.num_heads, ++- # attn_mask=global_attention_mask, ++- # keep_prob=1.0 - self.attention_dropout, ++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++- # input_layout="BNSD", ++- # pre_tokens=2147483647, ++- # next_tokens=2147483647, ++- # inner_precise=0, ++- # drop_mask=None, ++- # prefix=None, ++- # actual_seq_qlen=None, ++- # actual_seq_kvlen=None, ++- # sparse_mode=sparse_mode, ++- # ) ++- if not output_attentions: ++- attn_weights = None ++- ++ return attn_output, attn_weights, past_key_value ++ ++- ++ # class Qwen2MoeFlashAttention(nn.Module): ++ # """ ++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { ++ # return final_hidden_states, router_logits ++ ++ ++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++-# """ ++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++-# `_moe_infer_prefill` (用于长序列处理) 方法。 ++-# """ ++-# def __init__(self, config: Qwen2MoeConfig): ++-# super().__init__() ++-# self.num_experts = config.num_experts ++-# self.top_k = config.num_experts_per_tok ++-# self.norm_topk_prob = config.norm_topk_prob ++- ++-# # 门控网络 ++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++-# # 专家列表 ++-# self.experts = nn.ModuleList( ++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++-# ) ++-# # 共享专家 ++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-# @no_grad() ++-# def _moe_infer_decode( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# """ ++-# 【解码路径】针对 sequence_length=1 的极致优化。 ++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++-# """ ++-# batch_size, hidden_dim = hidden_states.shape ++- ++-# expert_outputs_list = [ ++-# ops.cat([ ++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++-# ], dim=0) ++-# for i in range(batch_size) ++-# ] ++- ++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++-# # shape: (batch_size, top_k, hidden_dim) ++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++- ++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++- ++-# return moe_output.squeeze(1) ++- ++-# @no_grad() ++-# def _moe_infer_prefill( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# """ ++-# 【预填充路径】针对 sequence_length > 1 的优化。 ++-# 按专家对 Token 进行分组,并进行批处理。 ++-# """ ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens = hidden_states.shape[0] ++-# flat_selected_experts = selected_experts.flatten() ++- ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++- ++-# active_experts = ops.unique(flat_selected_experts) ++- ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++- ++-# mask = (flat_selected_experts == expert_idx_tensor) ++-# selected_token_indices = token_indices[mask] ++-# selected_routing_weights = routing_weights.flatten()[mask] ++- ++-# current_states = hidden_states[selected_token_indices] ++- ++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++- ++-# moe_output = moe_output.index_add( ++-# dim=0, ++-# index=selected_token_indices, ++-# source=expert_output.to(hidden_states.dtype) ++-# ) ++-# return moe_output ++- ++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-# """ ++-# 顶层 forward 方法,作为智能分发器。 ++-# """ ++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++- ++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-# router_logits = self.gate(hidden_states_reshaped) ++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- ++-# if self.norm_topk_prob: ++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- ++-# routing_weights = routing_weights.to(hidden_states.dtype) ++- ++-# moe_output = None ++-# # 在推理时,根据序列长度选择最优路径 ++-# if not self.training: ++-# if sequence_length == 1: ++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++-# else: ++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++-# else: ++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++-# raise NotImplementedError("Training path is not implemented.") ++- ++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++- ++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++- ++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++- ++-# return final_hidden_states, router_logits ++- ++- ++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++-# """ ++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++-# """ ++-# def __init__(self, config: Qwen2MoeConfig): ++-# super().__init__() ++-# self.num_experts = config.num_experts ++-# self.top_k = config.num_experts_per_tok ++-# self.norm_topk_prob = config.norm_topk_prob ++- ++-# # 门控网络 ++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++-# # 专家列表 ++-# self.experts = nn.ModuleList( ++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++-# ) ++-# # 共享专家 ++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-# @no_grad() ++-# def _moe_infer_decode( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# batch_size, _ = hidden_states.shape ++-# expert_outputs_list = [ ++-# ops.cat([ ++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++-# ], dim=0) ++-# for i in range(batch_size) ++-# ] ++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++-# return moe_output.squeeze(1) ++- ++-# @no_grad() ++-# def _moe_infer_prefill( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens = hidden_states.shape[0] ++-# flat_selected_experts = selected_experts.flatten() ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++-# active_experts = ops.unique(flat_selected_experts) ++- ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++-# mask = (flat_selected_experts == expert_idx_tensor) ++-# selected_token_indices = token_indices[mask] ++-# selected_routing_weights = routing_weights.flatten()[mask] ++-# current_states = hidden_states[selected_token_indices] ++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++-# moe_output = moe_output.index_add( ++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++-# ) ++-# return moe_output ++- ++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-# """ ++-# 顶层 forward 方法,作为智能分发器。 ++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++-# """ ++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++- ++-# # 1. 门控计算 (通用逻辑) ++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-# router_logits = self.gate(hidden_states_reshaped) ++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- ++-# if self.norm_topk_prob: ++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- ++-# routing_weights = routing_weights.to(hidden_states.dtype) ++- ++-# # 2. 智能分发到最优 MoE 路径 ++-# moe_output = None ++-# if not self.training: ++-# if sequence_length == 1: ++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++-# else: ++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++-# else: ++-# raise NotImplementedError("Training path is not implemented.") ++- ++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++- ++-# # 4. 合并 MoE 输出和共享专家输出 ++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++- ++-# # 5. 恢复原始形状并返回 ++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++- ++-# return final_hidden_states, router_logits ++- ++-# prefill fastest ++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++-# """ ++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++-# """ ++-# def __init__(self, config: Qwen2MoeConfig): ++-# super().__init__() ++-# self.num_experts = config.num_experts ++-# self.top_k = config.num_experts_per_tok ++-# self.norm_topk_prob = config.norm_topk_prob ++- ++-# # 门控网络 ++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++-# # 专家列表 ++-# self.experts = nn.ModuleList( ++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++-# ) ++-# # 共享专家 ++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-# @no_grad() ++-# def _moe_infer_dispatch( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# """ ++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++-# """ ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens, _ = hidden_states.shape ++- ++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++-# flat_selected_experts = selected_experts.flatten() ++-# flat_routing_weights = routing_weights.flatten() ++- ++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++- ++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++-# active_experts = ops.unique(flat_selected_experts) ++- ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++- ++-# # 找到所有分配给该专家的 token ++-# mask = (flat_selected_experts == expert_idx_tensor) ++- ++-# # 使用 mask 选取对应的 token 和权重 ++-# current_token_indices = token_indices[mask] ++-# current_routing_weights = flat_routing_weights[mask] ++-# current_hidden_states = hidden_states[current_token_indices] ++- ++-# # 对这些 token 进行批处理 ++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++- ++-# # 使用 index_add 将结果精确地加回到对应位置 ++-# moe_output = moe_output.index_add( ++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++-# ) ++-# return moe_output ++- ++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-# """ ++-# 顶层 forward 方法,作为智能分发器。 ++-# """ ++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++- ++-# # 1. 门控计算 ++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-# router_logits = self.gate(hidden_states_reshaped) ++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- ++-# if self.norm_topk_prob: ++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- ++-# routing_weights = routing_weights.to(hidden_states.dtype) ++- ++-# # 2. 调用统一的 MoE 计算内核 ++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++- ++-# # 3. 统一处理共享专家 ++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++- ++-# # 4. 合并输出 ++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++- ++-# # 5. 恢复原始形状并返回 ++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++- ++-# return final_hidden_states, router_logits ++- ++- ++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++-# """ ++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++-# 【最终高性能与高精度版】: ++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++-# 3. 这样实现了速度和准确性的两全其美。 ++-# """ ++-# def __init__(self, config: Qwen2MoeConfig): ++-# super().__init__() ++-# self.num_experts = config.num_experts ++-# self.top_k = config.num_experts_per_tok ++-# self.norm_topk_prob = config.norm_topk_prob ++- ++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++-# self.experts = nn.ModuleList( ++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++-# ) ++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-# @no_grad() ++-# def _moe_infer_decode( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# """ ++-# 【解码路径】极致优化版:bmm + 高精度累加。 ++-# """ ++-# original_dtype = hidden_states.dtype ++-# batch_size, _ = hidden_states.shape ++- ++-# expert_outputs_list = [ ++-# ops.cat([ ++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++-# ], dim=0) ++-# for i in range(batch_size) ++-# ] ++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++- ++-# # 在 float32 下执行 bmm,得到高精度结果 ++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++- ++-# # 将高精度结果转换回原始数据类型 ++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++- ++-# return moe_output ++- ++-# @no_grad() ++-# def _moe_infer_prefill( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# selected_experts: mindspore.Tensor, ++-# routing_weights: mindspore.Tensor ++-# ) -> mindspore.Tensor: ++-# """ ++-# 【预填充路径】与原始实现一致,结果精确。 ++-# """ ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens, _ = hidden_states.shape ++-# flat_selected_experts = selected_experts.flatten() ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++-# active_experts = ops.unique(flat_selected_experts) ++- ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++-# mask = (flat_selected_experts == expert_idx_tensor) ++-# selected_token_indices = token_indices[mask] ++-# selected_routing_weights = routing_weights.flatten()[mask] ++-# current_states = hidden_states[selected_token_indices] ++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++-# moe_output = moe_output.index_add( ++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++-# ) ++-# return moe_output ++- ++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++- ++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-# router_logits = self.gate(hidden_states_reshaped) ++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++- ++-# if self.norm_topk_prob: ++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- ++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++-# # 如果模型主体是 float16,后续再转换 ++- ++-# moe_output = None ++-# if not self.training: ++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++-# # _moe_infer_decode 内部会处理好类型转换 ++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++-# if sequence_length == 1: ++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++-# else: ++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++-# else: ++-# raise NotImplementedError("Training path is not implemented.") ++- ++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++- ++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++- ++-# return final_hidden_states, router_logits ++- ++- ++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++-# """ ++-# 【融合版】一个混合专家模块,内置两种推理策略, ++-# 由外部全局变量 `Long_Prompt` 控制: ++- ++-# - if Long_Prompt is True: 【精度优先模式】 ++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++-# 适用于处理长序列,避免误差累积。 ++- ++-# - if Long_Prompt is False: 【速度优先模式】 ++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++-# 在解码阶段获得极致速度,同时保证结果高度准确。 ++-# """ ++-# def __init__(self, config: Qwen2MoeConfig): ++-# super().__init__() ++-# self.num_experts = config.num_experts ++-# self.top_k = config.num_experts_per_tok ++-# self.norm_topk_prob = config.norm_topk_prob ++- ++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++-# self.experts = nn.ModuleList( ++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++-# ) ++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-# # --- 速度优先模式的辅助函数 --- ++-# @no_grad() ++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++-# original_dtype = hidden_states.dtype ++-# batch_size, _ = hidden_states.shape ++-# expert_outputs_list = [ ++-# ops.cat([ ++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++-# ], dim=0) ++-# for i in range(batch_size) ++-# ] ++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++-# weights_fp32 = routing_weights.to(mindspore.float32) ++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++-# return moe_output_fp32.squeeze(1).to(original_dtype) ++- ++-# @no_grad() ++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens, _ = hidden_states.shape ++-# flat_selected_experts = selected_experts.flatten() ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++-# active_experts = ops.unique(flat_selected_experts) ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++-# mask = (flat_selected_experts == expert_idx_tensor) ++-# selected_token_indices = token_indices[mask] ++-# selected_routing_weights = routing_weights.flatten()[mask] ++-# current_states = hidden_states[selected_token_indices] ++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++-# return moe_output ++- ++-# # --- 精度优先模式的辅助函数 --- ++-# @no_grad() ++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++-# moe_output = ops.zeros_like(hidden_states) ++-# num_tokens, _ = hidden_states.shape ++-# flat_selected_experts = selected_experts.flatten() ++-# flat_routing_weights = routing_weights.flatten() ++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++-# active_experts = ops.unique(flat_selected_experts) ++-# for expert_idx_tensor in active_experts: ++-# expert_idx = expert_idx_tensor.item() ++-# expert_layer = self.experts[expert_idx] ++-# mask = (flat_selected_experts == expert_idx_tensor) ++-# current_token_indices = token_indices[mask] ++-# current_routing_weights = flat_routing_weights[mask] ++-# current_hidden_states = hidden_states[current_token_indices] ++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++-# return moe_output ++- ++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-# # 声明我们将要使用一个在模块外部定义的全局变量 ++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++-# global Long_Prompt ++- ++-# # 1. 门控计算 (所有模式通用) ++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-# router_logits = self.gate(hidden_states_reshaped) ++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++-# if self.norm_topk_prob: ++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++- ++-# moe_output = None ++-# if not self.training: ++-# # 根据 Long_Prompt 标志选择模式 ++-# if Long_Prompt: ++-# # --- 精度优先模式 --- ++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++-# else: ++-# # --- 速度优先模式 --- ++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++-# if sequence_length == 1: ++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++-# else: ++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++-# else: ++-# raise NotImplementedError("Training path is not implemented.") ++- ++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++- ++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++- ++-# return final_hidden_states, router_logits ++- ++ class Qwen2MoeSparseMoeBlock(nn.Module): ++ """ ++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++ return moe_output_fp32.squeeze(1).to(original_dtype) ++ +++ # @no_grad() +++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ # num_tokens, _ = hidden_states.shape +++ # flat_selected_experts = selected_experts.flatten() +++ # sorted_expert_indices = flat_selected_experts.argsort() +++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++ # original_token_indices = sorted_expert_indices // self.top_k +++ # moe_output = ops.zeros_like(hidden_states) +++ # current_token_offset = 0 +++ # for i in range(self.num_experts): +++ # expert_token_count = tokens_per_expert[i] - current_token_offset +++ # if expert_token_count == 0: +++ # continue +++ # end_offset = current_token_offset + expert_token_count +++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++ # expert_hidden_states = hidden_states[expert_original_token_indices] +++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++ # current_token_offset += expert_token_count +++ # return moe_output +++ ++ @no_grad() ++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++- num_tokens, _ = hidden_states.shape ++- flat_selected_experts = selected_experts.flatten() ++- sorted_expert_indices = flat_selected_experts.argsort() ++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++- original_token_indices = sorted_expert_indices // self.top_k +++ """ +++ 优化版 MoE prefill (速度优先模式): +++ - 批量张量化处理同一个 expert 的所有 token +++ - 跳过无 token 的专家 +++ - 保持结果完全一致 +++ """ ++ moe_output = ops.zeros_like(hidden_states) ++- current_token_offset = 0 ++- for i in range(self.num_experts): ++- expert_token_count = tokens_per_expert[i] - current_token_offset ++- if expert_token_count == 0: ++- continue ++- end_offset = current_token_offset + expert_token_count ++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++- expert_hidden_states = hidden_states[expert_original_token_indices] ++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++- current_token_offset += expert_token_count +++ +++ flat_selected_experts = selected_experts.flatten() +++ flat_routing_weights = routing_weights.flatten() +++ +++ idxs = flat_selected_experts.argsort() +++ sorted_expert_indices = flat_selected_experts[idxs] +++ sorted_token_indices = idxs // self.top_k +++ +++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +++ +++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++ +++ for expert_id in active_experts.tolist(): +++ start = int(tokens_per_expert[:expert_id].sum().item()) +++ end = start + int(tokens_per_expert[expert_id].item()) +++ +++ token_idx = sorted_token_indices[start:end] +++ expert_tokens = hidden_states[token_idx] +++ +++ expert_out = self.experts[expert_id](expert_tokens) +++ +++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +++ +++ moe_output = mindspore.mint.scatter_add( +++ moe_output, +++ 0, +++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +++ scaled_out.to(hidden_states.dtype) +++ ) +++ ++ return moe_output ++ +++ ++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++ @no_grad() ++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++ ++ moe_output = None ++- if Long_Prompt: ++- # --- 精度优先模式 (ACCURACY MODE) --- ++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # if Long_Prompt==0: +++ # # --- 精度优先模式 (ACCURACY MODE) --- +++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # else: +++ # # --- 速度优先模式 (SPEED MODE) --- +++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++ # if sequence_length == 1: +++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # else: +++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ +++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++ if sequence_length == 1: +++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ else: ++- # --- 速度优先模式 (SPEED MODE) --- ++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++- if sequence_length == 1: ++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++- else: ++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++- +++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ ++ ++ # 3. 共享专家计算与合并 (所有模式通用) ++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ ++ return final_hidden_states, router_logits ++ +++ ++ class Qwen2MoeDecoderLayer(nn.Module): ++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++ super().__init__() ++ self.hidden_size = config.hidden_size ++ ++- # if Long_Prompt: ++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++- # else: +++ # if Long_Prompt == 2: ++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++ # else: +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ ++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++ ++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++ ) ++ ++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). ++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++ # attention_mask, +++ # sequence_length=sequence_length, +++ # target_length=target_length, +++ # dtype=dtype, +++ # min_dtype=min_dtype, +++ # cache_position=cache_position, +++ # batch_size=input_tensor.shape[0], +++ # ) +++ #@dwj +++ causal_mask = get_cached_causal_mask_with_cache_position( ++ attention_mask, ++ sequence_length=sequence_length, ++ target_length=target_length, ++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++ """ ++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +++ _causal_mask_cache.clear() ++ ++ input_ids = kwargs.get("input_ids") ++ if input_ids is None and args: ++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ ++ if input_ids is not None: ++ prompt_length = input_ids.shape[1] ++- ++- if prompt_length > PROMPT_LENGTH_THRESHOLD: ++- Long_Prompt = True +++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +++ Long_Prompt = 2 +++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +++ Long_Prompt = 0 ++ else: ++- Long_Prompt = False +++ Long_Prompt = 1 +++ ++ ++ return super().generate(*args, **kwargs) ++ ++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++ dtype = self.lm_head.weight.dtype ++ min_dtype = float(ops.finfo(dtype).min) ++ ++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++ # attention_mask, +++ # sequence_length=sequence_length, +++ # target_length=past_key_values.get_max_length(), +++ # dtype=dtype, +++ # min_dtype=min_dtype, +++ # cache_position=cache_position, +++ # batch_size=batch_size, +++ # ) +++ +++ #@dwj +++ attention_mask = get_cached_causal_mask_with_cache_position( ++ attention_mask, ++ sequence_length=sequence_length, ++ target_length=past_key_values.get_max_length(), ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++deleted file mode 100644 ++index 6dfb5b93..00000000 ++--- a/patches/0001-20251104commit.patch +++++ /dev/null ++@@ -1,1272 +0,0 @@ ++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++-From: Pinoeer-kingxi <13022943007@163.com> ++-Date: Tue, 4 Nov 2025 09:11:51 +0800 ++-Subject: [PATCH] 20251104commit ++- ++---- ++- mindnlp/transformers/cache_utils.py | 28 +- ++- .../models/deepseek/modeling_deepseek.py | 149 ++- ++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++- 3 files changed, 976 insertions(+), 87 deletions(-) ++- ++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++-index cadd2e04..02f8d4be 100644 ++---- a/mindnlp/transformers/cache_utils.py ++-+++ b/mindnlp/transformers/cache_utils.py ++-@@ -812,14 +812,26 @@ class StaticCache(Cache): ++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++- # k_out[:, :, cache_position] = key_states ++- # v_out[:, :, cache_position] = value_states ++-- if ON_ORANGE_PI: ++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++-- else: ++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++-- ++-+ # if ON_ORANGE_PI: ++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++-+ # else: ++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++-+ # 确保 cache_position 是 1D tensor 并且类型正确 ++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++-+ if cache_position.ndim > 1: ++-+ cache_position = cache_position.flatten() ++-+ # 确保类型是 int32 或 int64(MindSpore 要求) ++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++-+ cache_position = cache_position.int() ++-+ ++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++-+ k_out[:, :, cache_position] = key_states ++-+ v_out[:, :, cache_position] = value_states ++-+ ++- return k_out, v_out ++- ++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++-index c695b944..d8303e45 100644 ++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++- # Copied from transformers.models.llama.modeling_llama.rotate_half ++- def rotate_half(x): ++- """Rotates half the hidden dims of the input.""" ++-- x1 = x[..., : x.shape[-1] // 2] ++-- x2 = x[..., x.shape[-1] // 2 :] ++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++-+ # x1 = x[..., : x.shape[-1] // 2] ++-+ # x2 = x[..., x.shape[-1] // 2 :] ++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++- return ops.cat((-x2, x1), dim=-1) ++- ++- ++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++- if self.training: ++- raise NotImplementedError("Training is not supported yet.") ++- else: ++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++-- if self.config.n_shared_experts is not None: ++-- y = y + self.shared_experts(identity) ++-- return y ++-+ # @lwx ++-+ if orig_shape[1] == 1: ++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++-+ y=y.view(*orig_shape) ++-+ if self.config.n_shared_experts is not None: ++-+ y = y + self.shared_experts(identity) ++-+ return y ++-+ else: ++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++-+ if self.config.n_shared_experts is not None: ++-+ y = y + self.shared_experts(identity) ++-+ return y ++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++-+ # if self.config.n_shared_experts is not None: ++-+ # y = y + self.shared_experts(identity) ++-+ # return y ++-+ ++-+ @no_grad() ++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++-+ ++-+ expert_cache = ops.zeros_like(x) ++-+ for i in range(self.num_experts_per_tok): ++-+ expert_id = flat_expert_indices[i].item() ++-+ weight = flat_expert_weights[i].item() ++-+ expert = self.experts[expert_id] ++-+ expert_out = expert(x) ++-+ expert_cache += expert_out * weight ++-+ return expert_cache ++- ++- @no_grad() ++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++-- # expert_cache = torch.zeros_like(x) ++-- # idxs = flat_expert_indices.argsort() ++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++-- # token_idxs = idxs // self.num_experts_per_tok ++-- # for i, end_idx in enumerate(tokens_per_expert): ++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++-- # if start_idx == end_idx: ++-- # continue ++-- # expert = self.experts[i] ++-- # exp_token_idx = token_idxs[start_idx:end_idx] ++-- # expert_tokens = x[exp_token_idx] ++-- # expert_out = expert(expert_tokens) ++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++-- # return expert_cache ++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++- expert_cache = ops.zeros_like(x) ++- idxs = flat_expert_indices.argsort() ++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++- token_idxs = idxs // self.num_experts_per_tok ++-+ ++- for i, end_idx in enumerate(tokens_per_expert): ++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++- if start_idx == end_idx: ++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++- expert_out = expert(expert_tokens) ++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++-+ ++- return expert_cache ++-+ ++-+ # @no_grad() ++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++-+ # # expert_cache = torch.zeros_like(x) ++-+ # # idxs = flat_expert_indices.argsort() ++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++-+ # # token_idxs = idxs // self.num_experts_per_tok ++-+ # # for i, end_idx in enumerate(tokens_per_expert): ++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++-+ # # if start_idx == end_idx: ++-+ # # continue ++-+ # # expert = self.experts[i] ++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] ++-+ # # expert_tokens = x[exp_token_idx] ++-+ # # expert_out = expert(expert_tokens) ++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++-+ # # return expert_cache ++-+ # expert_cache = ops.zeros_like(x) ++-+ # idxs = flat_expert_indices.argsort() ++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++-+ # token_idxs = idxs // self.num_experts_per_tok ++-+ ++-+ # for i, end_idx in enumerate(tokens_per_expert): ++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++-+ # if start_idx == end_idx: ++-+ # continue ++-+ # expert = self.experts[i] ++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++-+ # expert_tokens = x[exp_token_idx] ++-+ # expert_out = expert(expert_tokens) ++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++-+ ++-+ # return expert_cache ++-+ # @no_grad() ++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++-+ # expert_cache = ops.zeros_like(x) ++-+ ++-+ # # 排序保证顺序一致 ++-+ # idxs = flat_expert_indices.argsort() ++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++-+ # token_idxs = idxs // self.num_experts_per_tok ++-+ ++-+ # # 找出有 token 的专家 ++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++-+ ++-+ # for i in active_experts.tolist(): ++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++-+ # end_idx = tokens_per_expert[i] ++-+ # if start_idx == end_idx: # 没有 token ++-+ # continue ++-+ ++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++-+ # expert_tokens = x[exp_token_idx] ++-+ # expert_out = self.experts[i](expert_tokens) ++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++-+ ++-+ # expert_cache = mindspore.mint.scatter_add( ++-+ # expert_cache, ++-+ # 0, ++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++-+ # expert_out ++-+ # ) ++-+ ++-+ # return expert_cache ++-+ ++-+ ++- ++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++- # """ ++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++- ++- # Initialize weights and apply final processing ++- self.post_init() ++-+ self.warm_up = False ++-+ ++-+ def warmup_moe_model_deep(self): ++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++-+ test_texts = [ ++-+ "warmup short", ++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++-+ ] ++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++-+ if tokenizer is None: ++-+ from mindnlp.transformers import AutoTokenizer ++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++-+ self._warmup_tokenizer = tokenizer ++-+ ++-+ for text in test_texts: ++-+ inputs = tokenizer(text, return_tensors="ms") ++-+ with mindspore._no_grad(): ++-+ _ = self(**inputs, use_cache=False) ++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++- ++- def get_input_embeddings(self): ++- return self.model.embed_tokens ++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++- ```""" ++-+ if not self.warm_up: ++-+ self.warm_up = True ++-+ self.warmup_moe_model_deep() ++-+ ++- output_attentions = ( ++- output_attentions ++- if output_attentions is not None ++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++-index 3cbf820e..d4c6b651 100644 ++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++-@@ -18,7 +18,6 @@ ++- # See the License for the specific language governing permissions and ++- # limitations under the License. ++- """MindSpore Qwen2MoE model.""" ++-- ++- import math ++- from typing import List, Optional, Tuple, Union ++- ++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++- TokenClassifierOutput, ++- ) ++- from ...modeling_utils import PreTrainedModel ++-+from ...generation import GenerationMixin ++- from ....utils import logging ++- from .configuration_qwen2_moe import Qwen2MoeConfig ++- ++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++- self.variance_epsilon = eps ++- ++- def forward(self, hidden_states): ++-+ # @dwj ++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++-+ # @lwx ++-+ # if not self.training : ++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++- input_dtype = hidden_states.dtype ++- hidden_states = hidden_states.to(mindspore.float32) ++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++-@@ -234,6 +239,8 @@ def rotate_half(x): ++- """Rotates half the hidden dims of the input.""" ++- x1 = x[..., : x.shape[-1] // 2] ++- x2 = x[..., x.shape[-1] // 2 :] ++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++- return ops.cat((-x2, x1), dim=-1) ++- ++- ++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++- self.config = config ++- self.hidden_size = config.hidden_size ++- self.intermediate_size = intermediate_size ++-+ ++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++- self.act_fn = ACT2FN[config.hidden_act] ++- ++- def forward(self, x): ++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++-- ++- ++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++-+ # @lwx ++-+ # gate_up_output = self.gate_up_proj(x) ++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++-+ # return self.down_proj(swiglu_output) ++-+ ++-+ # def forward(self, x): ++-+ # gate_proj_out = self.gate_proj(x) ++-+ # up_proj_out = self.up_proj(x) ++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++-+ # return self.down_proj(swiglu_out) ++-+ ++- # Copied from transformers.models.llama.modeling_llama.repeat_kv ++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++- """ ++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++- use_cache: bool = False, ++- cache_position: Optional[mindspore.Tensor] = None, ++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++-+ ++-+ ++-+ ++- bsz, q_len, _ = hidden_states.shape ++- ++- query_states = self.q_proj(hidden_states) ++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++- "with a layer index." ++- ) ++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++-+ if isinstance(past_key_value, StaticCache): ++-+ kv_seq_len = key_states.shape[-2] ++-+ else: ++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++- ++- if past_key_value is not None: ++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++-+ ++-+ if isinstance(past_key_value, StaticCache): ++-+ kv_seq_len = key_states.shape[-2] ++- ++- # repeat k/v heads if n_kv_heads < n_heads ++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++-- ++-+ ++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++- ++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++-- raise ValueError( ++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++-- f" {attn_weights.shape}" ++-- ) ++-- ++-- if attention_mask is not None: # no matter the length, we just slice it ++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++-+ if attention_mask is not None: ++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++- attn_weights = attn_weights + causal_mask ++- ++- # upcast attention to fp32 ++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++- ++- attn_output = self.o_proj(attn_output) ++-- ++-+ # @lwx ++-+ ++-+ # max_seq_len = self.max_position_embeddings # 2048 ++-+ ++-+ # if attention_mask is not None: ++-+ # # attention_mask: [B, 1, Sq, Sk] ++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++-+ ++-+ # # pad 到 [max_seq_len, max_seq_len] ++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++-+ # global_attention_mask = padded_mask ++-+ # else: ++-+ # global_attention_mask = None ++-+ ++-+ ++-+ # sparse_mode=3 ++-+ # attn_output = mindspore.ops.flash_attention_score( ++-+ # query=query_states, ++-+ # key=key_states, ++-+ # value=value_states, ++-+ # real_shift=None, ++-+ # padding_mask=None, ++-+ ++-+ # head_num=self.num_heads, ++-+ # attn_mask=global_attention_mask, ++-+ # keep_prob=1.0 - self.attention_dropout, ++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++-+ # input_layout="BNSD", ++-+ # pre_tokens=2147483647, ++-+ # next_tokens=2147483647, ++-+ # inner_precise=0, ++-+ # drop_mask=None, ++-+ # prefix=None, ++-+ # actual_seq_qlen=None, ++-+ # actual_seq_kvlen=None, ++-+ # sparse_mode=sparse_mode, ++-+ # ) ++- if not output_attentions: ++- attn_weights = None ++- ++- return attn_output, attn_weights, past_key_value ++- ++- ++-+class Qwen2MoeFlashAttention(nn.Module): ++-+ """ ++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++-+ ++-+ 关键改动: ++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++-+ 直接传入原始的 key 和 value 张量效率更高。 ++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++-+ """ ++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++-+ super().__init__() ++-+ self.config = config ++-+ self.layer_idx = layer_idx ++-+ self.hidden_size = config.hidden_size ++-+ self.num_heads = config.num_attention_heads ++-+ self.head_dim = self.hidden_size // self.num_heads ++-+ self.num_key_value_heads = config.num_key_value_heads ++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++-+ self.max_position_embeddings = config.max_position_embeddings ++-+ self.rope_theta = config.rope_theta ++-+ self.attention_dropout = config.attention_dropout ++-+ ++-+ if (self.head_dim * self.num_heads) != self.hidden_size: ++-+ raise ValueError( ++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++-+ ) ++-+ ++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++-+ ++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++-+ self.head_dim, ++-+ max_position_embeddings=self.max_position_embeddings, ++-+ base=self.rope_theta, ++-+ ) ++-+ ++-+ def forward( ++-+ self, ++-+ hidden_states: mindspore.Tensor, ++-+ attention_mask: Optional[mindspore.Tensor] = None, ++-+ position_ids: Optional[mindspore.Tensor] = None, ++-+ past_key_value: Optional[Cache] = None, ++-+ output_attentions: bool = False, ++-+ use_cache: bool = False, ++-+ cache_position: Optional[mindspore.Tensor] = None, ++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++-+ ++-+ bsz, q_len, _ = hidden_states.shape ++-+ ++-+ # 1. 线性投射 Q, K, V ++-+ query_states = self.q_proj(hidden_states) ++-+ key_states = self.k_proj(hidden_states) ++-+ value_states = self.v_proj(hidden_states) ++-+ ++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++-+ # query: [B, S, H*D] -> [B, N1, S, D] ++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ ++-+ # 3. RoPE 旋转位置编码 ++-+ kv_seq_len = key_states.shape[-2] ++-+ if past_key_value is not None: ++-+ if self.layer_idx is None: ++-+ raise ValueError( ++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++-+ "with a layer index." ++-+ ) ++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len ++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++-+ if cache_position.shape[0] == 1: ++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++-+ kv_seq_len = past_seen_tokens + 1 ++-+ else: ++-+ # prefill 阶段:cache_position 是范围,使用其长度 ++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++-+ else: ++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++-+ ++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++-+ ++-+ # 4. KV 缓存更新 ++-+ if past_key_value is not None: ++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++-+ key_states, value_states = past_key_value.update( ++-+ key_states, value_states, self.layer_idx, cache_kwargs ++-+ ) ++-+ ++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++-+ if cache_position.shape[0] == 1: ++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++-+ kv_seq_len = key_states.shape[-2] ++-+ ++-+ # 5. [重要] 准备 Attention Mask ++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++-+ fa_attention_mask = None ++-+ if attention_mask is not None: ++-+ # 截取与当前key长度匹配的部分 ++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False ++-+ fa_attention_mask = (mask_slice != 0) ++-+ ++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++-+ input_dtype = query_states.dtype ++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++-+ query_states = query_states.to(mindspore.float16) ++-+ key_states = key_states.to(mindspore.float16) ++-+ value_states = value_states.to(mindspore.float16) ++-+ ++-+ # 6. [核心] 调用 flash_attention_score 算子 ++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA ++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++-+ attn_output = mindspore.ops.flash_attention_score( ++-+ query=query_states, ++-+ key=key_states, ++-+ value=value_states, ++-+ head_num=self.num_heads, # 传入Q的头数(N1) ++-+ attn_mask=fa_attention_mask, ++-+ keep_prob=1.0 - self.attention_dropout, ++-+ scalar_value=1.0 / math.sqrt(self.head_dim), ++-+ input_layout="BNSD", ++-+ sparse_mode=0 # 使用 defaultMask 模式 ++-+ ) ++-+ ++-+ # 恢复原始数据类型 ++-+ attn_output = attn_output.to(input_dtype) ++-+ ++-+ # 7. 调整输出形状 ++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++-+ attn_output = self.o_proj(attn_output) ++-+ ++-+ # FlashAttention 算子不直接返回注意力权重矩阵 ++-+ attn_weights = None ++-+ if output_attentions: ++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++-+ ++-+ return attn_output, attn_weights, past_key_value ++-+ ++-+ # def forward( ++-+ # self, ++-+ # hidden_states: mindspore.Tensor, ++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++-+ # position_ids: Optional[mindspore.Tensor] = None, ++-+ # past_key_value: Optional[Cache] = None, ++-+ # output_attentions: bool = False, ++-+ # use_cache: bool = False, ++-+ # cache_position: Optional[mindspore.Tensor] = None, ++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++-+ ++-+ # bsz, q_len, _ = hidden_states.shape ++-+ ++-+ # # 1. 线性投射 Q, K, V ++-+ # query_states = self.q_proj(hidden_states) ++-+ # key_states = self.k_proj(hidden_states) ++-+ # value_states = self.v_proj(hidden_states) ++-+ ++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ ++-+ # # 3. RoPE 旋转位置编码 ++-+ # kv_seq_len = key_states.shape[-2] ++-+ # if past_key_value is not None: ++-+ # if self.layer_idx is None: ++-+ # raise ValueError( ++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++-+ # "with a layer index." ++-+ # ) ++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++-+ ++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++-+ ++-+ # # 4. KV 缓存更新 ++-+ # if past_key_value is not None: ++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++-+ # key_states, value_states = past_key_value.update( ++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++-+ # ) ++-+ ++-+ # # 5. 准备 Attention Mask ++-+ # fa_attention_mask = None ++-+ # if attention_mask is not None: ++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++-+ # fa_attention_mask = (mask_slice != 0) ++-+ ++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++-+ # input_dtype = query_states.dtype ++-+ ++-+ # # 6. [核心] 调用 flash_attention_score 算子 ++-+ # attn_output = mindspore.ops.flash_attention_score( ++-+ # query=query_states, ++-+ # key=key_states, ++-+ # value=value_states, ++-+ # head_num=self.num_heads, ++-+ # attn_mask=fa_attention_mask, ++-+ # keep_prob=1.0 - self.attention_dropout, ++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++-+ # input_layout="BNSD", ++-+ # sparse_mode=0, ++-+ # # <--- 修改点 2: 启用内部高精度计算 --- ++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++-+ # inner_precise=1 ++-+ # ) ++-+ ++-+ # # 恢复原始数据类型 ++-+ # attn_output = attn_output.to(input_dtype) ++-+ ++-+ # # 7. 调整输出形状 ++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++-+ # attn_output = self.o_proj(attn_output) ++-+ ++-+ # attn_weights = None ++-+ # if output_attentions: ++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++-+ ++-+ # return attn_output, attn_weights, past_key_value ++-+ ++-+ # def forward( ++-+ # self, ++-+ # hidden_states: mindspore.Tensor, ++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++-+ # position_ids: Optional[mindspore.Tensor] = None, ++-+ # past_key_value: Optional[Cache] = None, ++-+ # output_attentions: bool = False, ++-+ # use_cache: bool = False, ++-+ # cache_position: Optional[mindspore.Tensor] = None, ++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++-+ ++-+ # bsz, q_len, _ = hidden_states.shape ++-+ ++-+ # query_states = self.q_proj(hidden_states) ++-+ # key_states = self.k_proj(hidden_states) ++-+ # value_states = self.v_proj(hidden_states) ++-+ ++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-+ ++-+ # kv_seq_len = key_states.shape[-2] ++-+ # if past_key_value is not None: ++-+ # if self.layer_idx is None: ++-+ # raise ValueError("`layer_idx` must be specified for caching") ++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++-+ ++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++-+ ++-+ # if past_key_value is not None: ++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++-+ # key_states, value_states = past_key_value.update( ++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++-+ # ) ++-+ ++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++-+ ++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- ++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++-+ # query_states = query_states / math.sqrt(self.head_dim) ++-+ # # <--- 修改结束 --- ++-+ ++-+ # fa_attention_mask = None ++-+ # if attention_mask is not None: ++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++-+ # fa_attention_mask = (mask_slice != 0) ++-+ ++-+ # input_dtype = query_states.dtype ++-+ ++-+ # attn_output = mindspore.ops.flash_attention_score( ++-+ # query=query_states, # 传入已经预先缩放过的 query ++-+ # key=key_states, ++-+ # value=value_states, ++-+ # head_num=self.num_heads, ++-+ # attn_mask=fa_attention_mask, ++-+ # keep_prob=1.0 - self.attention_dropout, ++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++-+ # input_layout="BNSD", ++-+ # sparse_mode=0, ++-+ # inner_precise=1 # 仍然保持内部高精度计算 ++-+ # ) ++-+ ++-+ # attn_output = attn_output.to(input_dtype) ++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++-+ # attn_output = self.o_proj(attn_output) ++-+ ++-+ # attn_weights = None ++-+ # if output_attentions: ++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++-+ ++-+ # return attn_output, attn_weights, past_key_value ++-+ ++- QWEN2MOE_ATTENTION_CLASSES = { ++- "eager": Qwen2MoeAttention, ++-+ "flash-attention": Qwen2MoeFlashAttention, ++- } ++- ++- ++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++- ++-+ #@dwj ++-+ # 只遍历激活的专家,而非全部专家 ++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++-- batch_size, sequence_length, hidden_dim = hidden_states.shape ++-- hidden_states = hidden_states.view(-1, hidden_dim) ++-- # router_logits: (batch * sequence_length, n_experts) ++-- router_logits = self.gate(hidden_states) ++-- ++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++-- if self.norm_topk_prob: ++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++-- # we cast back to the input dtype ++-- routing_weights = routing_weights.to(hidden_states.dtype) ++-- ++-- final_hidden_states = ops.zeros( ++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++-- ) ++-- ++-- # One hot encode the selected experts to create an expert mask ++-- # this will be used to easily index which expert is going to be sollicitated ++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++-- ++-- # Loop over all available experts in the model and perform the computation on each expert ++-- for expert_idx in range(self.num_experts): ++-- expert_layer = self.experts[expert_idx] ++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++-- ++-- # Index the correct hidden states and compute the expert hidden state for ++-- # the current expert. We need to make sure to multiply the output hidden ++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++-- if 0 not in idx.shape: ++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++-- ++-- # However `index_add_` only support torch tensors for indexing so we'll use ++-- # the `top_x` tensor here. ++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++-- ++-- shared_expert_output = self.shared_expert(hidden_states) ++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++-- ++-- final_hidden_states = final_hidden_states + shared_expert_output ++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape ++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++-+ num_tokens = hidden_states_reshaped.shape[0] ++-+ ++-+ router_logits = self.gate(hidden_states_reshaped) ++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++-+ ++-+ if self.norm_topk_prob: ++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++-+ routing_weights = routing_weights.to(hidden_states.dtype) ++-+ ++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++-+ flat_selected_experts = selected_experts.flatten() ++-+ ++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++-+ token_indices = broadcasted_token_indices.flatten() ++-+ ++-+ active_experts = ops.unique(flat_selected_experts) ++-+ ++-+ for expert_idx_tensor in active_experts: ++-+ expert_idx = expert_idx_tensor.item() ++-+ expert_layer = self.experts[expert_idx] ++-+ ++-+ mask = (flat_selected_experts == expert_idx_tensor) ++-+ selected_token_indices = token_indices[mask] ++-+ selected_routing_weights = routing_weights.flatten()[mask] ++-+ ++-+ current_states = hidden_states_reshaped[selected_token_indices] ++-+ ++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++-+ ++-+ final_hidden_states = final_hidden_states.index_add( ++-+ dim=0, ++-+ index=selected_token_indices, ++-+ source=expert_output.to(hidden_states.dtype) ++-+ ) ++-+ ++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++- ++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++-- return final_hidden_states, router_logits ++-+ final_hidden_states = final_hidden_states + shared_expert_output ++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++-+ ++-+ return final_hidden_states, router_logits ++- ++- ++- class Qwen2MoeDecoderLayer(nn.Module): ++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++- ++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++- ++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++-+ ++- if (layer_idx not in config.mlp_only_layers) and ( ++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++- ): ++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++- _no_split_modules = ["Qwen2MoeDecoderLayer"] ++- _skip_keys_device_placement = "past_key_values" ++- _supports_cache_class = True ++-+#lwx ++-+ # _supports_static_cache = True ++- ++- def _init_weights(self, module): ++- std = self.config.initializer_range ++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++- return causal_mask ++- ++- ++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++- _tied_weights_keys = ["lm_head.weight"] ++- ++- def __init__(self, config): ++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++- self.num_experts_per_tok = config.num_experts_per_tok ++- # Initialize weights and apply final processing ++- self.post_init() ++-+ # @lwx ++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++-+ # self.generation_config.cache_implementation = "static" ++-+ self._warmed_up = False ++-+ ++-+ def warmup_moe_model(self): ++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") ++-+ test_texts = [ ++-+ "warmup short", ++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++-+ ] ++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++-+ if tokenizer is None: ++-+ from mindnlp.transformers import AutoTokenizer ++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++-+ self._warmup_tokenizer = tokenizer ++-+ ++-+ for text in test_texts: ++-+ inputs = tokenizer(text, return_tensors="ms") ++-+ with mindspore._no_grad(): ++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) ++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") ++- ++- def get_input_embeddings(self): ++- return self.model.embed_tokens ++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++- ```""" ++-+ if not self._warmed_up: ++-+ self._warmed_up = True ++-+ self.warmup_moe_model() ++- ++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++- output_router_logits = ( ++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++- } ++- ) ++- return model_inputs ++-+# @lwx ++-+ # def _decode_one_tokens_logits( ++-+ # self, ++-+ # cur_token: mindspore.Tensor, ++-+ # input_pos: Optional[mindspore.Tensor], ++-+ # cache_position: mindspore.Tensor, ++-+ # past_key_values: StaticCache, ++-+ # ) -> mindspore.Tensor: ++-+ # """ ++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++-+ ++-+ # Args: ++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++-+ # input_pos: 输入位置信息,可选 ++-+ # cache_position: 当前token在cache中的位置,shape为(1,) ++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 ++-+ ++-+ # Returns: ++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++-+ # """ ++-+ # # 调用JIT编译的版本 ++-+ # return self.get_decode_one_tokens_logits( ++-+ # cur_token=cur_token, ++-+ # input_pos=input_pos, ++-+ # cache_position=cache_position, ++-+ # past_key_values=past_key_values, ++-+ # ) ++-+ ++-+ # @mindspore.jit(jit_level='O1') ++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++-+ # """ ++-+ # JIT编译的函数,用于高效的单token解码 ++-+ # 使用JIT编译优化以支持静态shape和高效执行 ++-+ ++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++-+ # """ ++-+ # outputs = self.model.forward( ++-+ # input_ids=cur_token, ++-+ # position_ids=input_pos, ++-+ # cache_position=cache_position, ++-+ # past_key_values=past_key_values, ++-+ # use_cache=True, ++-+ # return_dict=False, ++-+ # ) ++-+ ++-+ # hidden_states = outputs[0] ++-+ # logits = self.lm_head.forward(hidden_states) ++-+ # logits = logits.float() ++-+ ++-+ # return logits[:, -1, :] ++-+ ++-+ # def _sample( ++-+ # self, ++-+ # input_ids: mindspore.Tensor, ++-+ # logits_processor, ++-+ # stopping_criteria, ++-+ # generation_config, ++-+ # synced_devices: bool, ++-+ # streamer=None, ++-+ # logits_warper=None, ++-+ # **model_kwargs, ++-+ # ): ++-+ # """ ++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++-+ # """ ++-+ # from ...generation.logits_process import LogitsProcessorList ++-+ # from ...generation.stopping_criteria import StoppingCriteriaList ++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++-+ # from mindnlp.core import nn, ops, no_grad ++-+ # import numpy as np ++-+ ++-+ # # 检查是否使用 StaticCache ++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++-+ # # 否则,直接调用父类方法 ++-+ # past_key_values = model_kwargs.get("past_key_values") ++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++-+ ++-+ # if not isinstance(past_key_values, StaticCache): ++-+ # # 不使用 StaticCache,直接调用父类方法 ++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++-+ # return super()._sample( ++-+ # input_ids=input_ids, ++-+ # logits_processor=logits_processor, ++-+ # stopping_criteria=stopping_criteria, ++-+ # generation_config=generation_config, ++-+ # synced_devices=synced_devices, ++-+ # streamer=streamer, ++-+ # logits_warper=logits_warper, ++-+ # **model_kwargs, ++-+ # ) ++-+ ++-+ # # 使用 StaticCache,进入自定义循环 ++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++-+ # pad_token_id = generation_config._pad_token_tensor ++-+ # output_attentions = generation_config.output_attentions ++-+ # output_hidden_states = generation_config.output_hidden_states ++-+ # output_scores = generation_config.output_scores ++-+ # output_logits = generation_config.output_logits ++-+ # return_dict_in_generate = generation_config.return_dict_in_generate ++-+ # max_length = generation_config.max_length ++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++-+ # do_sample = generation_config.do_sample ++-+ ++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++-+ # raise ValueError( ++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++-+ # f"{logits_warper})." ++-+ # ) ++-+ ++-+ # # init attention / hidden states / scores tuples ++-+ # scores = () if (return_dict_in_generate and output_scores) else None ++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++-+ ++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: ++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++-+ # encoder_hidden_states = ( ++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++-+ # ) ++-+ ++-+ # # keep track of which sequences are already finished ++-+ # batch_size, cur_len = input_ids.shape ++-+ # this_peer_finished = False ++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++-+ ++-+ # time_record = [] ++-+ # from ....utils.testing_utils import parse_flag_from_env ++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++-+ ++-+ # while self._has_unfinished_sequences( ++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++-+ # ): ++-+ # if _record_time: ++-+ # import time as time_module ++-+ # infer_start = time_module.time() ++-+ ++-+ # # prepare model inputs ++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++-+ ++-+ # # prepare variable output controls ++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++-+ ++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++-+ # cur_cache_position = model_inputs.get("cache_position") ++-+ # cur_past_key_values = model_inputs.get("past_key_values") ++-+ # cur_input_ids = model_inputs.get("input_ids") ++-+ ++-+ # if (isinstance(cur_past_key_values, StaticCache) and ++-+ # cur_cache_position is not None and ++-+ # len(cur_cache_position.shape) > 0 and ++-+ # cur_cache_position.shape[0] == 1 and ++-+ # cur_input_ids is not None and ++-+ # cur_input_ids.shape[1] == 1): ++-+ # # 使用 JIT 优化的单 token 解码 ++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++-+ # if not hasattr(self, '_jit_used'): ++-+ # self._jit_used = False ++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++-+ ++-+ # next_token_logits = self.get_decode_one_tokens_logits( ++-+ # cur_token=cur_input_ids, ++-+ # input_pos=model_inputs.get("position_ids"), ++-+ # cache_position=cur_cache_position, ++-+ # past_key_values=cur_past_key_values, ++-+ # ) ++-+ ++-+ # # 标记已使用JIT(用于后续判断) ++-+ # if not self._jit_used: ++-+ # self._jit_used = True ++-+ ++-+ # # 构造兼容的输出对象 ++-+ # class JitOptimizedOutput: ++-+ # def __init__(self, logits, config): ++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++-+ # self.config = config ++-+ # # 对于 JIT 优化路径,这些属性通常不需要 ++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None ++-+ # self.attentions = None if not config.is_encoder_decoder else None ++-+ # self.cross_attentions = None ++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++-+ # self.hidden_states = None if not config.is_encoder_decoder else None ++-+ ++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++-+ # else: ++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++-+ # outputs = self(**model_inputs, return_dict=True) ++-+ ++-+ # if synced_devices and this_peer_finished: ++-+ # continue ++-+ ++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++-+ # next_token_logits = outputs.logits[:, -1, :] ++-+ ++-+ # # pre-process distribution ++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) ++-+ # if do_sample: ++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) ++-+ ++-+ # # Store scores, attentions and hidden_states when required ++-+ # if return_dict_in_generate: ++-+ # if output_scores: ++-+ # scores += (next_token_scores,) ++-+ # if output_logits: ++-+ # raw_logits += (next_token_logits,) ++-+ # if output_attentions: ++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++-+ # decoder_attentions += (attn,) if attn is not None else (None,) ++-+ # if self.config.is_encoder_decoder: ++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++-+ ++-+ # if output_hidden_states: ++-+ # hidden = ( ++-+ # outputs.decoder_hidden_states ++-+ # if self.config.is_encoder_decoder ++-+ # else outputs.hidden_states ++-+ # ) ++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++-+ ++-+ # # token selection ++-+ # if do_sample: ++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++-+ # else: ++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++-+ ++-+ # # finished sentences should have their next token be a padding token ++-+ # if has_eos_stopping_criteria: ++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++-+ ++-+ # # update generated ids, model inputs, and length for next step ++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++-+ # if streamer is not None: ++-+ # streamer.put(next_tokens) ++-+ ++-+ # model_kwargs = self._update_model_kwargs_for_generation( ++-+ # outputs, ++-+ # model_kwargs, ++-+ # is_encoder_decoder=self.config.is_encoder_decoder, ++-+ # ) ++-+ ++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++-+ # cur_len += 1 ++-+ ++-+ # if _record_time: ++-+ # import time as time_module ++-+ # infer_stop = time_module.time() ++-+ # time_record.append(infer_stop - infer_start) ++-+ ++-+ # del outputs ++-+ ++-+ # average_infer_time = None ++-+ # if time_record: ++-+ # if len(time_record) > 1: ++-+ # time_record.pop(0) ++-+ # average_infer_time = sum(time_record) / len(time_record) ++-+ # print(f'average inference time is: {average_infer_time}') ++-+ # print(f'inference time record: {time_record}') ++-+ ++-+ # if streamer is not None: ++-+ # streamer.end() ++-+ ++-+ # # 简单判断:打印是否使用了JIT路径 ++-+ # if hasattr(self, '_jit_used') and self._jit_used: ++-+ # print("[JIT] ✓ JIT optimization was used during generation") ++-+ # else: ++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++-+ ++-+ # if return_dict_in_generate: ++-+ # if self.config.is_encoder_decoder: ++-+ # return GenerateEncoderDecoderOutput( ++-+ # sequences=input_ids, ++-+ # scores=scores, ++-+ # logits=raw_logits, ++-+ # encoder_attentions=encoder_attentions, ++-+ # encoder_hidden_states=encoder_hidden_states, ++-+ # decoder_attentions=decoder_attentions, ++-+ # cross_attentions=cross_attentions, ++-+ # decoder_hidden_states=decoder_hidden_states, ++-+ # past_key_values=model_kwargs.get("past_key_values"), ++-+ # average_infer_time=average_infer_time ++-+ # ) ++-+ # else: ++-+ # return GenerateDecoderOnlyOutput( ++-+ # sequences=input_ids, ++-+ # scores=scores, ++-+ # logits=raw_logits, ++-+ # attentions=decoder_attentions, ++-+ # hidden_states=decoder_hidden_states, ++-+ # past_key_values=model_kwargs.get("past_key_values"), ++-+ # average_infer_time=average_infer_time ++-+ # ) ++-+ # else: ++-+ # return input_ids ++-+ ++-+ # def _prepare_cache_for_generation( ++-+ # self, ++-+ # generation_config, ++-+ # model_kwargs, ++-+ # assistant_model, ++-+ # batch_size, ++-+ # max_cache_length, ++-+ # ): ++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: ++-+ # generation_config.cache_implementation = "static" ++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++-+ ++-+ # if generation_config.cache_implementation == "static": ++-+ # base_required_from_max_length = generation_config.max_length + 1 ++-+ # base_required = max(max_cache_length, base_required_from_max_length) ++-+ # min_cache_size = 50 ++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++-+ # else: ++-+ # max_cache_length = max(base_required, min_cache_size) ++-+ ++-+ # original_max_cache_length = max_cache_length ++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") ++-+ # print(f" - input max_cache_length: {original_max_cache_length}") ++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") ++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++-+ # print(f" - final max_cache_length: {max_cache_length}") ++-+ ++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++-+ # if max_cache_length > self.config.max_position_embeddings: ++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++-+ ++-+ # result = super()._prepare_cache_for_generation( ++-+ # generation_config=generation_config, ++-+ # model_kwargs=model_kwargs, ++-+ # assistant_model=assistant_model, ++-+ # batch_size=batch_size, ++-+ # max_cache_length=max_cache_length, ++-+ # ) ++-+ ++-+ # if generation_config.cache_implementation == "static": ++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++-+ # created_cache = model_kwargs.get(cache_name) ++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++-+ # if created_cache.max_cache_len < generation_config.max_length: ++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++-+ ++-+ # return result ++-+ ++-+ ++-+ ++- ++- ++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++--- ++-2.27.0 ++- ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" new file mode 100644 index 00000000..bbe6df27 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" @@ -0,0 +1,7707 @@ +From ab47c0478530d34d2b48200af0453dda94d1ec18 Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Fri, 7 Nov 2025 11:48:18 +0800 +Subject: [PATCH 05/10] 20251107001commit + +--- + .../models/deepseek/modeling_deepseek.py | 91 +- + .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- + .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- + patches/0001-20251104commit.patch | 2 +- + patches/0002-20251106commit.patch | 2 +- + patches/0003-20261106secondcommit.patch | 2 +- + patches/0004-20251106change.patch | 7498 +++++++++++++++++ + 7 files changed, 7577 insertions(+), 30 deletions(-) + create mode 100644 patches/0004-20251106change.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 0546f318..8831e4b7 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): + # expert_cache += expert_out * weight + # return expert_cache + +- @no_grad() +- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- # x 的 shape: (1, hidden_size) +- # flat_expert_indices 的 shape: (num_experts_per_tok,) +- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +- +- # 1. 收集所有需要的专家层 +- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +- selected_experts = [self.experts[i] for i in flat_expert_indices] +- +- # 2. 并行计算所有专家的输出 +- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +- # ops.cat 会将它们堆叠成一个新的 Tensor +- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +- +- # 3. 使用矩阵乘法进行加权求和 +- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- # 最终结果 final_output 的 shape: (1, hidden_size) +- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++ # @no_grad() ++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ # # x 的 shape: (1, hidden_size) ++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) ++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++ ++ # # 1. 收集所有需要的专家层 ++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++ # selected_experts = [self.experts[i] for i in flat_expert_indices] ++ ++ # # 2. 并行计算所有专家的输出 ++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++ # # ops.cat 会将它们堆叠成一个新的 Tensor ++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++ ++ # # 3. 使用矩阵乘法进行加权求和 ++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ # # 最终结果 final_output 的 shape: (1, hidden_size) ++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) + +- return final_output ++ # return final_output + + + # @no_grad() +@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): + ) + + return expert_cache ++# 放置在 DeepseekMoE 类中 ++ @no_grad() ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ """ ++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++ ++ Args: ++ x (Tensor): 输入张量, shape: (1, hidden_size) ++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++ """ ++ top_k, _ = flat_expert_weights.shape ++ hidden_size = x.shape[-1] ++ ++ # 1. 将所有专家的权重堆叠起来 ++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++ ++ # 2. "收集" 所需的专家权重 ++ selected_gate_w = stacked_gate_w[flat_expert_indices] ++ selected_up_w = stacked_up_w[flat_expert_indices] ++ selected_down_w = stacked_down_w[flat_expert_indices] ++ ++ # 3. 准备输入 ++ x_expanded = x.expand((top_k, 1, hidden_size)) ++ ++ # 4. 并行计算 gate_proj 和 up_proj ++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++ ++ # 5. 计算中间状态 ++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++ ++ # 6. 并行计算 down_proj ++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++ # --- [FIX] --- ++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++ # --- [FIX END] --- ++ ++ # 7. 根据路由权重进行加权求和 ++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++ ++ return weighted_sum ++ ++ + + # @no_grad() + # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index ebd7782e..913a7609 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): + # Copied from transformers.models.llama.modeling_llama.rotate_half + def rotate_half(x): + """Rotates half the hidden dims of the input.""" +- x1 = x[..., : x.shape[-1] // 2] +- x2 = x[..., x.shape[-1] // 2 :] ++ # x1 = x[..., : x.shape[-1] // 2] ++ # x2 = x[..., x.shape[-1] // 2 :] + # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + + +diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +index d059dcbe..2b217b64 100644 +--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): + # Copied from transformers.models.llama.modeling_llama.rotate_half + def rotate_half(x): + """Rotates half the hidden dims of the input.""" +- x1 = x[..., : x.shape[-1] // 2] +- x2 = x[..., x.shape[-1] // 2 :] ++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++ # x1 = x[..., : x.shape[-1] // 2] ++ # x2 = x[..., x.shape[-1] // 2 :] ++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + + +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +index 78f22642..0a0ef2d7 100644 +--- a/patches/0001-20251104commit.patch ++++ b/patches/0001-20251104commit.patch +@@ -1,7 +1,7 @@ + From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/3] 20251104commit ++Subject: [PATCH 1/4] 20251104commit + + --- + mindnlp/transformers/cache_utils.py | 28 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +index 22b65dd5..5185270c 100644 +--- a/patches/0002-20251106commit.patch ++++ b/patches/0002-20251106commit.patch +@@ -1,7 +1,7 @@ + From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/3] 20251106commit ++Subject: [PATCH 2/4] 20251106commit + + --- + .../models/deepseek/modeling_deepseek.py | 379 ++++- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +index 966529e4..3e05f821 100644 +--- a/patches/0003-20261106secondcommit.patch ++++ b/patches/0003-20261106secondcommit.patch +@@ -1,7 +1,7 @@ + From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/3] 20261106secondcommit ++Subject: [PATCH 3/4] 20261106secondcommit + + --- + .../models/deepseek/modeling_deepseek.py | 217 ++- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +new file mode 100644 +index 00000000..88a1aef4 +--- /dev/null ++++ b/patches/0004-20251106change.patch +@@ -0,0 +1,7498 @@ ++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Thu, 6 Nov 2025 15:48:09 +0800 ++Subject: [PATCH 4/4] 20251106change ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 189 +- ++ patches/0001-20251104commit.patch | 1272 +++++++ ++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ ++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ ++ 4 files changed, 7244 insertions(+), 186 deletions(-) ++ create mode 100644 patches/0001-20251104commit.patch ++ create mode 100644 patches/0002-20251106commit.patch ++ create mode 100644 patches/0003-20261106secondcommit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index 2f9192bf..0546f318 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): ++ ++ return attn_output, attn_weights, past_key_value ++ ++-# class DeepseekFlashAttention(nn.Module): ++-# """ ++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++- ++-# This class is designed as a drop-in replacement for DeepseekAttention. ++-# """ ++- ++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++-# super().__init__() ++-# self.config = config ++-# self.layer_idx = layer_idx ++-# if layer_idx is None: ++-# logger.warning( ++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++-# "when creating this class." ++-# ) ++- ++-# self.attention_dropout = config.attention_dropout ++-# self.hidden_size = config.hidden_size ++-# self.num_heads = config.num_attention_heads ++-# self.head_dim = self.hidden_size // self.num_heads ++-# self.num_key_value_heads = config.num_key_value_heads ++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++-# self.max_position_embeddings = config.max_position_embeddings ++-# self.rope_theta = config.rope_theta ++-# self.is_causal = True ++- ++-# if (self.head_dim * self.num_heads) != self.hidden_size: ++-# raise ValueError( ++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++-# f" and `num_heads`: {self.num_heads})." ++-# ) ++- ++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++-# self._init_rope() ++- ++-# def _init_rope(self): ++-# if self.config.rope_scaling is None: ++-# self.rotary_emb = DeepseekRotaryEmbedding( ++-# self.head_dim, ++-# max_position_embeddings=self.max_position_embeddings, ++-# base=self.rope_theta, ++-# ) ++-# else: ++-# scaling_type = self.config.rope_scaling["type"] ++-# scaling_factor = self.config.rope_scaling["factor"] ++-# if scaling_type == "linear": ++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++-# self.head_dim, ++-# max_position_embeddings=self.max_position_embeddings, ++-# scaling_factor=scaling_factor, ++-# base=self.rope_theta, ++-# ) ++-# elif scaling_type == "dynamic": ++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++-# self.head_dim, ++-# max_position_embeddings=self.max_position_embeddings, ++-# scaling_factor=scaling_factor, ++-# base=self.rope_theta, ++-# ) ++-# else: ++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++- ++-# def forward( ++-# self, ++-# hidden_states: mindspore.Tensor, ++-# attention_mask: Optional[mindspore.Tensor] = None, ++-# position_ids: Optional[mindspore.Tensor] = None, ++-# past_key_value: Optional[Cache] = None, ++-# output_attentions: bool = False, ++-# use_cache: bool = False, ++-# **kwargs, ++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++-# if "padding_mask" in kwargs: ++-# warnings.warn( ++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++-# ) ++- ++-# if output_attentions: ++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++- ++-# bsz, q_len, _ = hidden_states.shape ++- ++-# if self.config.pretraining_tp > 1: ++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++- ++-# query_states = self.q_proj(hidden_states) ++-# key_states = self.k_proj(hidden_states) ++-# value_states = self.v_proj(hidden_states) ++- ++-# # Reshape for multi-head attention ++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++- ++-# kv_seq_len = key_states.shape[-2] ++-# if past_key_value is not None: ++-# if self.layer_idx is None: ++-# raise ValueError( ++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++-# "with a layer index." ++-# ) ++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++- ++-# # Apply Rotary Positional Embedding ++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++- ++-# if past_key_value is not None: ++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++- ++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++- ++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++- ++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++- ++-# # Convert attention_mask for flash_attention_score ++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++-# if attention_mask is not None: ++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++-# raise ValueError( ++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++-# ) ++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++-# else: ++-# attn_mask_for_fa = None ++- ++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++- ++-# # Call the fused flash_attention_score operator ++-# attn_output = mindspore.ops.flash_attention_score( ++-# query=query_states_for_fa, ++-# key=key_states_for_fa, ++-# value=value_states_for_fa, ++-# head_num=self.num_heads, # This is N1, the number of query heads ++-# input_layout='BSH', ++-# attn_mask=attn_mask_for_fa, ++-# keep_prob=keep_prob, ++-# scalar_value=1.0 / math.sqrt(self.head_dim), ++-# sparse_mode=0 # Default mask mode ++-# ) ++- ++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++-# attn_output = self.o_proj(attn_output) ++- ++-# # Flash Attention does not return attention weights ++-# attn_weights = None ++- ++-# return attn_output, attn_weights, past_key_value ++ ++ class DeepseekFlashAttention(nn.Module): ++ """ ++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): ++ super().__init__() ++ self.hidden_size = config.hidden_size ++ ++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++- config=config, layer_idx=layer_idx ++- ) +++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +++ # config=config, layer_idx=layer_idx +++ # ) ++ ++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++ config=config, layer_idx=layer_idx ++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): ++ return outputs ++ ++ ++- ++ class DeepseekPreTrainedModel(PreTrainedModel): ++ config_class = DeepseekConfig ++ base_model_prefix = "model" ++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ # Initialize weights and apply final processing ++ self.post_init() ++ self.warm_up = False ++- #@dwj ++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++- self.num_layers, ++- self.num_attention_heads, ++- self.head_dim, ++- batch_size=1, ++- max_length=self.max_length, ++- dtype=mindspore.float16 ++- ) ++- ++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++- key_cache = [] ++- value_cache = [] ++- for _ in range(num_layers): ++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++- key_cache.append(k) ++- value_cache.append(v) ++- return key_cache, value_cache ++- ++ ++ def warmup_moe_model_deep(self): ++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++new file mode 100644 ++index 00000000..78f22642 ++--- /dev/null +++++ b/patches/0001-20251104commit.patch ++@@ -0,0 +1,1272 @@ +++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++Subject: [PATCH 1/3] 20251104commit +++ +++--- +++ mindnlp/transformers/cache_utils.py | 28 +- +++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++ 3 files changed, 976 insertions(+), 87 deletions(-) +++ +++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++index cadd2e04..02f8d4be 100644 +++--- a/mindnlp/transformers/cache_utils.py ++++++ b/mindnlp/transformers/cache_utils.py +++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++ # k_out[:, :, cache_position] = key_states +++ # v_out[:, :, cache_position] = value_states +++- if ON_ORANGE_PI: +++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++- else: +++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++- ++++ # if ON_ORANGE_PI: ++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++ # else: ++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++ if cache_position.ndim > 1: ++++ cache_position = cache_position.flatten() ++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++ cache_position = cache_position.int() ++++ ++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++ k_out[:, :, cache_position] = key_states ++++ v_out[:, :, cache_position] = value_states ++++ +++ return k_out, v_out +++ +++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index c695b944..d8303e45 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++- x1 = x[..., : x.shape[-1] // 2] +++- x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++ # x1 = x[..., : x.shape[-1] // 2] ++++ # x2 = x[..., x.shape[-1] // 2 :] ++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++ if self.training: +++ raise NotImplementedError("Training is not supported yet.") +++ else: +++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++- if self.config.n_shared_experts is not None: +++- y = y + self.shared_experts(identity) +++- return y ++++ # @lwx ++++ if orig_shape[1] == 1: ++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++ y=y.view(*orig_shape) ++++ if self.config.n_shared_experts is not None: ++++ y = y + self.shared_experts(identity) ++++ return y ++++ else: ++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++ if self.config.n_shared_experts is not None: ++++ y = y + self.shared_experts(identity) ++++ return y ++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++ # if self.config.n_shared_experts is not None: ++++ # y = y + self.shared_experts(identity) ++++ # return y ++++ ++++ @no_grad() ++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ ++++ expert_cache = ops.zeros_like(x) ++++ for i in range(self.num_experts_per_tok): ++++ expert_id = flat_expert_indices[i].item() ++++ weight = flat_expert_weights[i].item() ++++ expert = self.experts[expert_id] ++++ expert_out = expert(x) ++++ expert_cache += expert_out * weight ++++ return expert_cache +++ +++ @no_grad() +++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++- # expert_cache = torch.zeros_like(x) +++- # idxs = flat_expert_indices.argsort() +++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++- # token_idxs = idxs // self.num_experts_per_tok +++- # for i, end_idx in enumerate(tokens_per_expert): +++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++- # if start_idx == end_idx: +++- # continue +++- # expert = self.experts[i] +++- # exp_token_idx = token_idxs[start_idx:end_idx] +++- # expert_tokens = x[exp_token_idx] +++- # expert_out = expert(expert_tokens) +++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++- # return expert_cache ++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ expert_cache = ops.zeros_like(x) +++ idxs = flat_expert_indices.argsort() +++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++ token_idxs = idxs // self.num_experts_per_tok ++++ +++ for i, end_idx in enumerate(tokens_per_expert): +++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++ if start_idx == end_idx: +++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++ expert_out = expert(expert_tokens) +++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ +++ return expert_cache ++++ ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # # expert_cache = torch.zeros_like(x) ++++ # # idxs = flat_expert_indices.argsort() ++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++ # # token_idxs = idxs // self.num_experts_per_tok ++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++ # # if start_idx == end_idx: ++++ # # continue ++++ # # expert = self.experts[i] ++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # # expert_tokens = x[exp_token_idx] ++++ # # expert_out = expert(expert_tokens) ++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++ # # return expert_cache ++++ # expert_cache = ops.zeros_like(x) ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # for i, end_idx in enumerate(tokens_per_expert): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # if start_idx == end_idx: ++++ # continue ++++ # expert = self.experts[i] ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = expert(expert_tokens) ++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ ++++ # return expert_cache ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # expert_cache = ops.zeros_like(x) ++++ ++++ # # 排序保证顺序一致 ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # # 找出有 token 的专家 ++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++ ++++ # for i in active_experts.tolist(): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # end_idx = tokens_per_expert[i] ++++ # if start_idx == end_idx: # 没有 token ++++ # continue ++++ ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = self.experts[i](expert_tokens) ++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++ ++++ # expert_cache = mindspore.mint.scatter_add( ++++ # expert_cache, ++++ # 0, ++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++ # expert_out ++++ # ) ++++ ++++ # return expert_cache ++++ ++++ +++ +++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++ # """ +++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ +++ # Initialize weights and apply final processing +++ self.post_init() ++++ self.warm_up = False ++++ ++++ def warmup_moe_model_deep(self): ++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++ test_texts = [ ++++ "warmup short", ++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++ ] ++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++ if tokenizer is None: ++++ from mindnlp.transformers import AutoTokenizer ++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++ self._warmup_tokenizer = tokenizer ++++ ++++ for text in test_texts: ++++ inputs = tokenizer(text, return_tensors="ms") ++++ with mindspore._no_grad(): ++++ _ = self(**inputs, use_cache=False) ++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++ +++ def get_input_embeddings(self): +++ return self.model.embed_tokens +++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++ ```""" ++++ if not self.warm_up: ++++ self.warm_up = True ++++ self.warmup_moe_model_deep() ++++ +++ output_attentions = ( +++ output_attentions +++ if output_attentions is not None +++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++index 3cbf820e..d4c6b651 100644 +++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++@@ -18,7 +18,6 @@ +++ # See the License for the specific language governing permissions and +++ # limitations under the License. +++ """MindSpore Qwen2MoE model.""" +++- +++ import math +++ from typing import List, Optional, Tuple, Union +++ +++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++ TokenClassifierOutput, +++ ) +++ from ...modeling_utils import PreTrainedModel ++++from ...generation import GenerationMixin +++ from ....utils import logging +++ from .configuration_qwen2_moe import Qwen2MoeConfig +++ +++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++ self.variance_epsilon = eps +++ +++ def forward(self, hidden_states): ++++ # @dwj ++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++ # @lwx ++++ # if not self.training : ++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++ input_dtype = hidden_states.dtype +++ hidden_states = hidden_states.to(mindspore.float32) +++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++@@ -234,6 +239,8 @@ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++ x1 = x[..., : x.shape[-1] // 2] +++ x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++ self.config = config +++ self.hidden_size = config.hidden_size +++ self.intermediate_size = intermediate_size ++++ +++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++ self.act_fn = ACT2FN[config.hidden_act] +++ +++ def forward(self, x): +++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++- +++ ++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++ # @lwx ++++ # gate_up_output = self.gate_up_proj(x) ++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++ # return self.down_proj(swiglu_output) ++++ ++++ # def forward(self, x): ++++ # gate_proj_out = self.gate_proj(x) ++++ # up_proj_out = self.up_proj(x) ++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++ # return self.down_proj(swiglu_out) ++++ +++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++ """ +++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++ use_cache: bool = False, +++ cache_position: Optional[mindspore.Tensor] = None, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ ++++ +++ bsz, q_len, _ = hidden_states.shape +++ +++ query_states = self.q_proj(hidden_states) +++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++ "with a layer index." +++ ) +++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ if isinstance(past_key_value, StaticCache): ++++ kv_seq_len = key_states.shape[-2] ++++ else: ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++ if isinstance(past_key_value, StaticCache): ++++ kv_seq_len = key_states.shape[-2] +++ +++ # repeat k/v heads if n_kv_heads < n_heads +++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++- ++++ +++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++ +++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++- raise ValueError( +++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++- f" {attn_weights.shape}" +++- ) +++- +++- if attention_mask is not None: # no matter the length, we just slice it +++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++ if attention_mask is not None: ++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++ attn_weights = attn_weights + causal_mask +++ +++ # upcast attention to fp32 +++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++ +++ attn_output = self.o_proj(attn_output) +++- ++++ # @lwx ++++ ++++ # max_seq_len = self.max_position_embeddings # 2048 ++++ ++++ # if attention_mask is not None: ++++ # # attention_mask: [B, 1, Sq, Sk] ++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++ ++++ # # pad 到 [max_seq_len, max_seq_len] ++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++ # global_attention_mask = padded_mask ++++ # else: ++++ # global_attention_mask = None ++++ ++++ ++++ # sparse_mode=3 ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, ++++ # key=key_states, ++++ # value=value_states, ++++ # real_shift=None, ++++ # padding_mask=None, ++++ ++++ # head_num=self.num_heads, ++++ # attn_mask=global_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++ # input_layout="BNSD", ++++ # pre_tokens=2147483647, ++++ # next_tokens=2147483647, ++++ # inner_precise=0, ++++ # drop_mask=None, ++++ # prefix=None, ++++ # actual_seq_qlen=None, ++++ # actual_seq_kvlen=None, ++++ # sparse_mode=sparse_mode, ++++ # ) +++ if not output_attentions: +++ attn_weights = None +++ +++ return attn_output, attn_weights, past_key_value +++ +++ ++++class Qwen2MoeFlashAttention(nn.Module): ++++ """ ++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++ ++++ 关键改动: ++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++ 直接传入原始的 key 和 value 张量效率更高。 ++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++ """ ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++ super().__init__() ++++ self.config = config ++++ self.layer_idx = layer_idx ++++ self.hidden_size = config.hidden_size ++++ self.num_heads = config.num_attention_heads ++++ self.head_dim = self.hidden_size // self.num_heads ++++ self.num_key_value_heads = config.num_key_value_heads ++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++ self.max_position_embeddings = config.max_position_embeddings ++++ self.rope_theta = config.rope_theta ++++ self.attention_dropout = config.attention_dropout ++++ ++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++ raise ValueError( ++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++ ) ++++ ++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++ ++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++ self.head_dim, ++++ max_position_embeddings=self.max_position_embeddings, ++++ base=self.rope_theta, ++++ ) ++++ ++++ def forward( ++++ self, ++++ hidden_states: mindspore.Tensor, ++++ attention_mask: Optional[mindspore.Tensor] = None, ++++ position_ids: Optional[mindspore.Tensor] = None, ++++ past_key_value: Optional[Cache] = None, ++++ output_attentions: bool = False, ++++ use_cache: bool = False, ++++ cache_position: Optional[mindspore.Tensor] = None, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ # 1. 线性投射 Q, K, V ++++ query_states = self.q_proj(hidden_states) ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # 3. RoPE 旋转位置编码 ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++ if self.layer_idx is None: ++++ raise ValueError( ++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ "with a layer index." ++++ ) ++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++ if cache_position.shape[0] == 1: ++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++ kv_seq_len = past_seen_tokens + 1 ++++ else: ++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++ else: ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # 4. KV 缓存更新 ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ key_states, value_states = past_key_value.update( ++++ key_states, value_states, self.layer_idx, cache_kwargs ++++ ) ++++ ++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++ if cache_position.shape[0] == 1: ++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++ kv_seq_len = key_states.shape[-2] ++++ ++++ # 5. [重要] 准备 Attention Mask ++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++ fa_attention_mask = None ++++ if attention_mask is not None: ++++ # 截取与当前key长度匹配的部分 ++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++ fa_attention_mask = (mask_slice != 0) ++++ ++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++ input_dtype = query_states.dtype ++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++ query_states = query_states.to(mindspore.float16) ++++ key_states = key_states.to(mindspore.float16) ++++ value_states = value_states.to(mindspore.float16) ++++ ++++ # 6. [核心] 调用 flash_attention_score 算子 ++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++ attn_output = mindspore.ops.flash_attention_score( ++++ query=query_states, ++++ key=key_states, ++++ value=value_states, ++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++ attn_mask=fa_attention_mask, ++++ keep_prob=1.0 - self.attention_dropout, ++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++ input_layout="BNSD", ++++ sparse_mode=0 # 使用 defaultMask 模式 ++++ ) ++++ ++++ # 恢复原始数据类型 ++++ attn_output = attn_output.to(input_dtype) ++++ ++++ # 7. 调整输出形状 ++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ attn_output = self.o_proj(attn_output) ++++ ++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++ attn_weights = None ++++ if output_attentions: ++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++ # def forward( ++++ # self, ++++ # hidden_states: mindspore.Tensor, ++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++ # position_ids: Optional[mindspore.Tensor] = None, ++++ # past_key_value: Optional[Cache] = None, ++++ # output_attentions: bool = False, ++++ # use_cache: bool = False, ++++ # cache_position: Optional[mindspore.Tensor] = None, ++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ # bsz, q_len, _ = hidden_states.shape ++++ ++++ # # 1. 线性投射 Q, K, V ++++ # query_states = self.q_proj(hidden_states) ++++ # key_states = self.k_proj(hidden_states) ++++ # value_states = self.v_proj(hidden_states) ++++ ++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # # 3. RoPE 旋转位置编码 ++++ # kv_seq_len = key_states.shape[-2] ++++ # if past_key_value is not None: ++++ # if self.layer_idx is None: ++++ # raise ValueError( ++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ # "with a layer index." ++++ # ) ++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # # 4. KV 缓存更新 ++++ # if past_key_value is not None: ++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ # key_states, value_states = past_key_value.update( ++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++ # ) ++++ ++++ # # 5. 准备 Attention Mask ++++ # fa_attention_mask = None ++++ # if attention_mask is not None: ++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # fa_attention_mask = (mask_slice != 0) ++++ ++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++ # input_dtype = query_states.dtype ++++ ++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, ++++ # key=key_states, ++++ # value=value_states, ++++ # head_num=self.num_heads, ++++ # attn_mask=fa_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++ # input_layout="BNSD", ++++ # sparse_mode=0, ++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++ # inner_precise=1 ++++ # ) ++++ ++++ # # 恢复原始数据类型 ++++ # attn_output = attn_output.to(input_dtype) ++++ ++++ # # 7. 调整输出形状 ++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ # attn_output = self.o_proj(attn_output) ++++ ++++ # attn_weights = None ++++ # if output_attentions: ++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++ # return attn_output, attn_weights, past_key_value ++++ ++++ # def forward( ++++ # self, ++++ # hidden_states: mindspore.Tensor, ++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++ # position_ids: Optional[mindspore.Tensor] = None, ++++ # past_key_value: Optional[Cache] = None, ++++ # output_attentions: bool = False, ++++ # use_cache: bool = False, ++++ # cache_position: Optional[mindspore.Tensor] = None, ++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ # bsz, q_len, _ = hidden_states.shape ++++ ++++ # query_states = self.q_proj(hidden_states) ++++ # key_states = self.k_proj(hidden_states) ++++ # value_states = self.v_proj(hidden_states) ++++ ++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ # kv_seq_len = key_states.shape[-2] ++++ # if past_key_value is not None: ++++ # if self.layer_idx is None: ++++ # raise ValueError("`layer_idx` must be specified for caching") ++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ # if past_key_value is not None: ++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ # key_states, value_states = past_key_value.update( ++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++ # ) ++++ ++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++ ++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++ # query_states = query_states / math.sqrt(self.head_dim) ++++ # # <--- 修改结束 --- ++++ ++++ # fa_attention_mask = None ++++ # if attention_mask is not None: ++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ # fa_attention_mask = (mask_slice != 0) ++++ ++++ # input_dtype = query_states.dtype ++++ ++++ # attn_output = mindspore.ops.flash_attention_score( ++++ # query=query_states, # 传入已经预先缩放过的 query ++++ # key=key_states, ++++ # value=value_states, ++++ # head_num=self.num_heads, ++++ # attn_mask=fa_attention_mask, ++++ # keep_prob=1.0 - self.attention_dropout, ++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++ # input_layout="BNSD", ++++ # sparse_mode=0, ++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++ # ) ++++ ++++ # attn_output = attn_output.to(input_dtype) ++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ # attn_output = self.o_proj(attn_output) ++++ ++++ # attn_weights = None ++++ # if output_attentions: ++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++ ++++ # return attn_output, attn_weights, past_key_value ++++ +++ QWEN2MOE_ATTENTION_CLASSES = { +++ "eager": Qwen2MoeAttention, ++++ "flash-attention": Qwen2MoeFlashAttention, +++ } +++ +++ +++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ ++++ #@dwj ++++ # 只遍历激活的专家,而非全部专家 +++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++- hidden_states = hidden_states.view(-1, hidden_dim) +++- # router_logits: (batch * sequence_length, n_experts) +++- router_logits = self.gate(hidden_states) +++- +++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- if self.norm_topk_prob: +++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- # we cast back to the input dtype +++- routing_weights = routing_weights.to(hidden_states.dtype) +++- +++- final_hidden_states = ops.zeros( +++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++- ) +++- +++- # One hot encode the selected experts to create an expert mask +++- # this will be used to easily index which expert is going to be sollicitated +++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++- +++- # Loop over all available experts in the model and perform the computation on each expert +++- for expert_idx in range(self.num_experts): +++- expert_layer = self.experts[expert_idx] +++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++- +++- # Index the correct hidden states and compute the expert hidden state for +++- # the current expert. We need to make sure to multiply the output hidden +++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++- if 0 not in idx.shape: +++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++- +++- # However `index_add_` only support torch tensors for indexing so we'll use +++- # the `top_x` tensor here. +++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++- +++- shared_expert_output = self.shared_expert(hidden_states) +++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++- +++- final_hidden_states = final_hidden_states + shared_expert_output ++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++ num_tokens = hidden_states_reshaped.shape[0] ++++ ++++ router_logits = self.gate(hidden_states_reshaped) ++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++ if self.norm_topk_prob: ++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++ flat_selected_experts = selected_experts.flatten() ++++ ++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++ token_indices = broadcasted_token_indices.flatten() ++++ ++++ active_experts = ops.unique(flat_selected_experts) ++++ ++++ for expert_idx_tensor in active_experts: ++++ expert_idx = expert_idx_tensor.item() ++++ expert_layer = self.experts[expert_idx] ++++ ++++ mask = (flat_selected_experts == expert_idx_tensor) ++++ selected_token_indices = token_indices[mask] ++++ selected_routing_weights = routing_weights.flatten()[mask] ++++ ++++ current_states = hidden_states_reshaped[selected_token_indices] ++++ ++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++ ++++ final_hidden_states = final_hidden_states.index_add( ++++ dim=0, ++++ index=selected_token_indices, ++++ source=expert_output.to(hidden_states.dtype) ++++ ) ++++ ++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++ +++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++- return final_hidden_states, router_logits ++++ final_hidden_states = final_hidden_states + shared_expert_output ++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++ ++++ return final_hidden_states, router_logits +++ +++ +++ class Qwen2MoeDecoderLayer(nn.Module): +++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++ +++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++ +++ if (layer_idx not in config.mlp_only_layers) and ( +++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++ ): +++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++ _skip_keys_device_placement = "past_key_values" +++ _supports_cache_class = True ++++#lwx ++++ # _supports_static_cache = True +++ +++ def _init_weights(self, module): +++ std = self.config.initializer_range +++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++ return causal_mask +++ +++ +++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ _tied_weights_keys = ["lm_head.weight"] +++ +++ def __init__(self, config): +++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ self.num_experts_per_tok = config.num_experts_per_tok +++ # Initialize weights and apply final processing +++ self.post_init() ++++ # @lwx ++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++ # self.generation_config.cache_implementation = "static" ++++ self._warmed_up = False ++++ ++++ def warmup_moe_model(self): ++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++ test_texts = [ ++++ "warmup short", ++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++ ] ++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++ if tokenizer is None: ++++ from mindnlp.transformers import AutoTokenizer ++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++ self._warmup_tokenizer = tokenizer ++++ ++++ for text in test_texts: ++++ inputs = tokenizer(text, return_tensors="ms") ++++ with mindspore._no_grad(): ++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++ +++ def get_input_embeddings(self): +++ return self.model.embed_tokens +++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++ ```""" ++++ if not self._warmed_up: ++++ self._warmed_up = True ++++ self.warmup_moe_model() +++ +++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++ output_router_logits = ( +++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++ } +++ ) +++ return model_inputs ++++# @lwx ++++ # def _decode_one_tokens_logits( ++++ # self, ++++ # cur_token: mindspore.Tensor, ++++ # input_pos: Optional[mindspore.Tensor], ++++ # cache_position: mindspore.Tensor, ++++ # past_key_values: StaticCache, ++++ # ) -> mindspore.Tensor: ++++ # """ ++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++ ++++ # Args: ++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++ # input_pos: 输入位置信息,可选 ++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++ ++++ # Returns: ++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++ # """ ++++ # # 调用JIT编译的版本 ++++ # return self.get_decode_one_tokens_logits( ++++ # cur_token=cur_token, ++++ # input_pos=input_pos, ++++ # cache_position=cache_position, ++++ # past_key_values=past_key_values, ++++ # ) ++++ ++++ # @mindspore.jit(jit_level='O1') ++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++ # """ ++++ # JIT编译的函数,用于高效的单token解码 ++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++ ++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++ # """ ++++ # outputs = self.model.forward( ++++ # input_ids=cur_token, ++++ # position_ids=input_pos, ++++ # cache_position=cache_position, ++++ # past_key_values=past_key_values, ++++ # use_cache=True, ++++ # return_dict=False, ++++ # ) ++++ ++++ # hidden_states = outputs[0] ++++ # logits = self.lm_head.forward(hidden_states) ++++ # logits = logits.float() ++++ ++++ # return logits[:, -1, :] ++++ ++++ # def _sample( ++++ # self, ++++ # input_ids: mindspore.Tensor, ++++ # logits_processor, ++++ # stopping_criteria, ++++ # generation_config, ++++ # synced_devices: bool, ++++ # streamer=None, ++++ # logits_warper=None, ++++ # **model_kwargs, ++++ # ): ++++ # """ ++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++ # """ ++++ # from ...generation.logits_process import LogitsProcessorList ++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++ # from mindnlp.core import nn, ops, no_grad ++++ # import numpy as np ++++ ++++ # # 检查是否使用 StaticCache ++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++ # # 否则,直接调用父类方法 ++++ # past_key_values = model_kwargs.get("past_key_values") ++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++ ++++ # if not isinstance(past_key_values, StaticCache): ++++ # # 不使用 StaticCache,直接调用父类方法 ++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++ # return super()._sample( ++++ # input_ids=input_ids, ++++ # logits_processor=logits_processor, ++++ # stopping_criteria=stopping_criteria, ++++ # generation_config=generation_config, ++++ # synced_devices=synced_devices, ++++ # streamer=streamer, ++++ # logits_warper=logits_warper, ++++ # **model_kwargs, ++++ # ) ++++ ++++ # # 使用 StaticCache,进入自定义循环 ++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++ # pad_token_id = generation_config._pad_token_tensor ++++ # output_attentions = generation_config.output_attentions ++++ # output_hidden_states = generation_config.output_hidden_states ++++ # output_scores = generation_config.output_scores ++++ # output_logits = generation_config.output_logits ++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++ # max_length = generation_config.max_length ++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++ # do_sample = generation_config.do_sample ++++ ++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++ # raise ValueError( ++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++ # f"{logits_warper})." ++++ # ) ++++ ++++ # # init attention / hidden states / scores tuples ++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++ ++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++ # encoder_hidden_states = ( ++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++ # ) ++++ ++++ # # keep track of which sequences are already finished ++++ # batch_size, cur_len = input_ids.shape ++++ # this_peer_finished = False ++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++ ++++ # time_record = [] ++++ # from ....utils.testing_utils import parse_flag_from_env ++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++ ++++ # while self._has_unfinished_sequences( ++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++ # ): ++++ # if _record_time: ++++ # import time as time_module ++++ # infer_start = time_module.time() ++++ ++++ # # prepare model inputs ++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++ ++++ # # prepare variable output controls ++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++ ++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++ # cur_cache_position = model_inputs.get("cache_position") ++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++ # cur_input_ids = model_inputs.get("input_ids") ++++ ++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++ # cur_cache_position is not None and ++++ # len(cur_cache_position.shape) > 0 and ++++ # cur_cache_position.shape[0] == 1 and ++++ # cur_input_ids is not None and ++++ # cur_input_ids.shape[1] == 1): ++++ # # 使用 JIT 优化的单 token 解码 ++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++ # if not hasattr(self, '_jit_used'): ++++ # self._jit_used = False ++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++ ++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++ # cur_token=cur_input_ids, ++++ # input_pos=model_inputs.get("position_ids"), ++++ # cache_position=cur_cache_position, ++++ # past_key_values=cur_past_key_values, ++++ # ) ++++ ++++ # # 标记已使用JIT(用于后续判断) ++++ # if not self._jit_used: ++++ # self._jit_used = True ++++ ++++ # # 构造兼容的输出对象 ++++ # class JitOptimizedOutput: ++++ # def __init__(self, logits, config): ++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++ # self.config = config ++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++ # self.attentions = None if not config.is_encoder_decoder else None ++++ # self.cross_attentions = None ++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++ ++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++ # else: ++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++ # outputs = self(**model_inputs, return_dict=True) ++++ ++++ # if synced_devices and this_peer_finished: ++++ # continue ++++ ++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++ # next_token_logits = outputs.logits[:, -1, :] ++++ ++++ # # pre-process distribution ++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++ # if do_sample: ++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++ ++++ # # Store scores, attentions and hidden_states when required ++++ # if return_dict_in_generate: ++++ # if output_scores: ++++ # scores += (next_token_scores,) ++++ # if output_logits: ++++ # raw_logits += (next_token_logits,) ++++ # if output_attentions: ++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++ # if self.config.is_encoder_decoder: ++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++ ++++ # if output_hidden_states: ++++ # hidden = ( ++++ # outputs.decoder_hidden_states ++++ # if self.config.is_encoder_decoder ++++ # else outputs.hidden_states ++++ # ) ++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++ ++++ # # token selection ++++ # if do_sample: ++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++ # else: ++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++ ++++ # # finished sentences should have their next token be a padding token ++++ # if has_eos_stopping_criteria: ++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++ ++++ # # update generated ids, model inputs, and length for next step ++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++ # if streamer is not None: ++++ # streamer.put(next_tokens) ++++ ++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++ # outputs, ++++ # model_kwargs, ++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++ # ) ++++ ++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++ # cur_len += 1 ++++ ++++ # if _record_time: ++++ # import time as time_module ++++ # infer_stop = time_module.time() ++++ # time_record.append(infer_stop - infer_start) ++++ ++++ # del outputs ++++ ++++ # average_infer_time = None ++++ # if time_record: ++++ # if len(time_record) > 1: ++++ # time_record.pop(0) ++++ # average_infer_time = sum(time_record) / len(time_record) ++++ # print(f'average inference time is: {average_infer_time}') ++++ # print(f'inference time record: {time_record}') ++++ ++++ # if streamer is not None: ++++ # streamer.end() ++++ ++++ # # 简单判断:打印是否使用了JIT路径 ++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++ # else: ++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++ ++++ # if return_dict_in_generate: ++++ # if self.config.is_encoder_decoder: ++++ # return GenerateEncoderDecoderOutput( ++++ # sequences=input_ids, ++++ # scores=scores, ++++ # logits=raw_logits, ++++ # encoder_attentions=encoder_attentions, ++++ # encoder_hidden_states=encoder_hidden_states, ++++ # decoder_attentions=decoder_attentions, ++++ # cross_attentions=cross_attentions, ++++ # decoder_hidden_states=decoder_hidden_states, ++++ # past_key_values=model_kwargs.get("past_key_values"), ++++ # average_infer_time=average_infer_time ++++ # ) ++++ # else: ++++ # return GenerateDecoderOnlyOutput( ++++ # sequences=input_ids, ++++ # scores=scores, ++++ # logits=raw_logits, ++++ # attentions=decoder_attentions, ++++ # hidden_states=decoder_hidden_states, ++++ # past_key_values=model_kwargs.get("past_key_values"), ++++ # average_infer_time=average_infer_time ++++ # ) ++++ # else: ++++ # return input_ids ++++ ++++ # def _prepare_cache_for_generation( ++++ # self, ++++ # generation_config, ++++ # model_kwargs, ++++ # assistant_model, ++++ # batch_size, ++++ # max_cache_length, ++++ # ): ++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++ # generation_config.cache_implementation = "static" ++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++ ++++ # if generation_config.cache_implementation == "static": ++++ # base_required_from_max_length = generation_config.max_length + 1 ++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++ # min_cache_size = 50 ++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++ # else: ++++ # max_cache_length = max(base_required, min_cache_size) ++++ ++++ # original_max_cache_length = max_cache_length ++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++ # print(f" - final max_cache_length: {max_cache_length}") ++++ ++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++ # if max_cache_length > self.config.max_position_embeddings: ++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++ ++++ # result = super()._prepare_cache_for_generation( ++++ # generation_config=generation_config, ++++ # model_kwargs=model_kwargs, ++++ # assistant_model=assistant_model, ++++ # batch_size=batch_size, ++++ # max_cache_length=max_cache_length, ++++ # ) ++++ ++++ # if generation_config.cache_implementation == "static": ++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++ # created_cache = model_kwargs.get(cache_name) ++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++ # if created_cache.max_cache_len < generation_config.max_length: ++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++ ++++ # return result ++++ ++++ ++++ +++ +++ +++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++-- +++2.27.0 +++ ++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++new file mode 100644 ++index 00000000..22b65dd5 ++--- /dev/null +++++ b/patches/0002-20251106commit.patch ++@@ -0,0 +1,3200 @@ +++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Thu, 6 Nov 2025 09:20:38 +0800 +++Subject: [PATCH 2/3] 20251106commit +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +++ 3 files changed, 2689 insertions(+), 305 deletions(-) +++ create mode 100644 patches/0001-20251104commit.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index d8303e45..73773c22 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +++ # y = y + self.shared_experts(identity) +++ # return y +++ ++++ # @no_grad() ++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ ++++ # expert_cache = ops.zeros_like(x) ++++ # for i in range(self.num_experts_per_tok): ++++ # expert_id = flat_expert_indices[i].item() ++++ # weight = flat_expert_weights[i].item() ++++ # expert = self.experts[expert_id] ++++ # expert_out = expert(x) ++++ # expert_cache += expert_out * weight ++++ # return expert_cache ++++ +++ @no_grad() +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ # x 的 shape: (1, hidden_size) ++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++ ++++ # 1. 收集所有需要的专家层 ++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++++ ++++ # 2. 并行计算所有专家的输出 ++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++ # ops.cat 会将它们堆叠成一个新的 Tensor ++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++ ++++ # 3. 使用矩阵乘法进行加权求和 ++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ # 最终结果 final_output 的 shape: (1, hidden_size) ++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++ ++++ return final_output +++ +++- expert_cache = ops.zeros_like(x) +++- for i in range(self.num_experts_per_tok): +++- expert_id = flat_expert_indices[i].item() +++- weight = flat_expert_weights[i].item() +++- expert = self.experts[expert_id] +++- expert_out = expert(x) +++- expert_cache += expert_out * weight +++- return expert_cache +++ +++ @no_grad() +++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++ # @lwx ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) ++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++ +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +++ return attn_output, attn_weights, past_key_value +++ +++ ++++# class DeepseekFlashAttention(nn.Module): ++++# """ ++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++++ ++++# This class is designed as a drop-in replacement for DeepseekAttention. ++++# """ ++++ ++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++# super().__init__() ++++# self.config = config ++++# self.layer_idx = layer_idx ++++# if layer_idx is None: ++++# logger.warning( ++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++# "when creating this class." ++++# ) ++++ ++++# self.attention_dropout = config.attention_dropout ++++# self.hidden_size = config.hidden_size ++++# self.num_heads = config.num_attention_heads ++++# self.head_dim = self.hidden_size // self.num_heads ++++# self.num_key_value_heads = config.num_key_value_heads ++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++# self.max_position_embeddings = config.max_position_embeddings ++++# self.rope_theta = config.rope_theta ++++# self.is_causal = True ++++ ++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++# raise ValueError( ++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++# f" and `num_heads`: {self.num_heads})." ++++# ) ++++ ++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++# self._init_rope() ++++ ++++# def _init_rope(self): ++++# if self.config.rope_scaling is None: ++++# self.rotary_emb = DeepseekRotaryEmbedding( ++++# self.head_dim, ++++# max_position_embeddings=self.max_position_embeddings, ++++# base=self.rope_theta, ++++# ) ++++# else: ++++# scaling_type = self.config.rope_scaling["type"] ++++# scaling_factor = self.config.rope_scaling["factor"] ++++# if scaling_type == "linear": ++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++# self.head_dim, ++++# max_position_embeddings=self.max_position_embeddings, ++++# scaling_factor=scaling_factor, ++++# base=self.rope_theta, ++++# ) ++++# elif scaling_type == "dynamic": ++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++# self.head_dim, ++++# max_position_embeddings=self.max_position_embeddings, ++++# scaling_factor=scaling_factor, ++++# base=self.rope_theta, ++++# ) ++++# else: ++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++ ++++# def forward( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# attention_mask: Optional[mindspore.Tensor] = None, ++++# position_ids: Optional[mindspore.Tensor] = None, ++++# past_key_value: Optional[Cache] = None, ++++# output_attentions: bool = False, ++++# use_cache: bool = False, ++++# **kwargs, ++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++# if "padding_mask" in kwargs: ++++# warnings.warn( ++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++# ) ++++ ++++# if output_attentions: ++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++++ ++++# bsz, q_len, _ = hidden_states.shape ++++ ++++# if self.config.pretraining_tp > 1: ++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++ ++++# query_states = self.q_proj(hidden_states) ++++# key_states = self.k_proj(hidden_states) ++++# value_states = self.v_proj(hidden_states) ++++ ++++# # Reshape for multi-head attention ++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++# kv_seq_len = key_states.shape[-2] ++++# if past_key_value is not None: ++++# if self.layer_idx is None: ++++# raise ValueError( ++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++# "with a layer index." ++++# ) ++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++# # Apply Rotary Positional Embedding ++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++# if past_key_value is not None: ++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ ++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++ ++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++ ++++# # Convert attention_mask for flash_attention_score ++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++++# if attention_mask is not None: ++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++# raise ValueError( ++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++# ) ++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++++# else: ++++# attn_mask_for_fa = None ++++ ++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++ ++++# # Call the fused flash_attention_score operator ++++# attn_output = mindspore.ops.flash_attention_score( ++++# query=query_states_for_fa, ++++# key=key_states_for_fa, ++++# value=value_states_for_fa, ++++# head_num=self.num_heads, # This is N1, the number of query heads ++++# input_layout='BSH', ++++# attn_mask=attn_mask_for_fa, ++++# keep_prob=keep_prob, ++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++# sparse_mode=0 # Default mask mode ++++# ) ++++ ++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++++# attn_output = self.o_proj(attn_output) ++++ ++++# # Flash Attention does not return attention weights ++++# attn_weights = None ++++ ++++# return attn_output, attn_weights, past_key_value ++++ ++++class DeepseekFlashAttention(nn.Module): ++++ """ ++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. ++++ This implementation is a drop-in replacement for the original DeepseekAttention class, ++++ designed for high performance on supported hardware (Ascend). ++++ ++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. ++++ """ ++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++ super().__init__() ++++ self.config = config ++++ self.layer_idx = layer_idx ++++ if layer_idx is None: ++++ logger.warning( ++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++ "when creating this class." ++++ ) ++++ ++++ # --- [FIX] Correctly initialize all required attributes --- ++++ self.attention_dropout = config.attention_dropout ++++ self.hidden_size = config.hidden_size ++++ self.num_heads = config.num_attention_heads ++++ self.head_dim = self.hidden_size // self.num_heads ++++ self.num_key_value_heads = config.num_key_value_heads ++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++ self.max_position_embeddings = config.max_position_embeddings ++++ self.rope_theta = config.rope_theta ++++ self.is_causal = True ++++ ++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++ raise ValueError( ++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++ f" and `num_heads`: {self.num_heads})." ++++ ) ++++ ++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++ ++++ # This call will now succeed as all attributes are initialized. ++++ self._init_rope() ++++ ++++ def _init_rope(self): ++++ if self.config.rope_scaling is None: ++++ self.rotary_emb = DeepseekRotaryEmbedding( ++++ self.head_dim, ++++ max_position_embeddings=self.max_position_embeddings, ++++ base=self.rope_theta, ++++ ) ++++ else: ++++ scaling_type = self.config.rope_scaling["type"] ++++ scaling_factor = self.config.rope_scaling["factor"] ++++ if scaling_type == "linear": ++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++ self.head_dim, ++++ max_position_embeddings=self.max_position_embeddings, ++++ scaling_factor=scaling_factor, ++++ base=self.rope_theta, ++++ ) ++++ elif scaling_type == "dynamic": ++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++ self.head_dim, ++++ max_position_embeddings=self.max_position_embeddings, ++++ scaling_factor=scaling_factor, ++++ base=self.rope_theta, ++++ ) ++++ else: ++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++ ++++ def forward( ++++ self, ++++ hidden_states: mindspore.Tensor, ++++ attention_mask: Optional[mindspore.Tensor] = None, ++++ position_ids: Optional[mindspore.Tensor] = None, ++++ past_key_value: Optional[Cache] = None, ++++ output_attentions: bool = False, ++++ use_cache: bool = False, ++++ **kwargs, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ if "padding_mask" in kwargs: ++++ warnings.warn( ++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++ ) ++++ if output_attentions: ++++ warnings.warn( ++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." ++++ ) ++++ ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ if self.config.pretraining_tp > 1: ++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++ ++++ query_states = self.q_proj(hidden_states) ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++ # Apply Rotary Position Embedding ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos} ++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. ++++ # So we must explicitly repeat the KV heads. ++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++ ++++ # Convert attention mask for flash_attention_score ++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. ++++ if attention_mask is not None: ++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++ raise ValueError( ++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++ ) ++++ attn_mask_for_fa = attention_mask < 0 ++++ else: ++++ attn_mask_for_fa = None ++++ ++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++ ++++ # Call the fused operator using the efficient BNSD layout ++++ attn_output = mindspore.ops.flash_attention_score( ++++ query=query_states, ++++ key=key_states, ++++ value=value_states, ++++ head_num=self.num_heads, ++++ input_layout='BNSD', # Specify the correct layout ++++ attn_mask=attn_mask_for_fa, ++++ keep_prob=keep_prob, ++++ scalar_value=1.0 / math.sqrt(self.head_dim) ++++ ) ++++ ++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. ++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ ++++ # Apply output projection ++++ attn_output = self.o_proj(attn_output) ++++ ++++ # Flash attention does not return attention weights, so we return None. ++++ attn_weights = None ++++ ++++ return attn_output, attn_weights, past_key_value ++++ +++ Deepseek_ATTENTION_CLASSES = { +++ "eager": DeepseekAttention, ++++ "flash-attention": DeepseekFlashAttention, +++ } +++ +++ +++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +++ config=config, layer_idx=layer_idx +++ ) +++ ++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++++ config=config, layer_idx=layer_idx ++++ ) ++++ +++ self.mlp = ( +++ DeepseekMoE(config) +++ if ( +++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++index d4c6b651..bced285c 100644 +++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +++ +++ import mindspore +++ import mindnlp.core.nn.functional as F +++-from mindnlp.core import nn, ops ++++from mindnlp.core import nn, ops, no_grad +++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +++ +++ from ....common.activations import ACT2FN +++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++ ++++Long_Prompt = False ++++PROMPT_LENGTH_THRESHOLD = 128 +++ +++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++ def _prepare_4d_causal_attention_mask_with_cache_position( +++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +++ return attn_output, attn_weights, past_key_value +++ +++ ++++# class Qwen2MoeFlashAttention(nn.Module): ++++# """ ++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++ ++++# 关键改动: ++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++# 直接传入原始的 key 和 value 张量效率更高。 ++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++# super().__init__() ++++# self.config = config ++++# self.layer_idx = layer_idx ++++# self.hidden_size = config.hidden_size ++++# self.num_heads = config.num_attention_heads ++++# self.head_dim = self.hidden_size // self.num_heads ++++# self.num_key_value_heads = config.num_key_value_heads ++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++# self.max_position_embeddings = config.max_position_embeddings ++++# self.rope_theta = config.rope_theta ++++# self.attention_dropout = config.attention_dropout ++++ ++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++# raise ValueError( ++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++# ) ++++ ++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++ ++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++# self.head_dim, ++++# max_position_embeddings=self.max_position_embeddings, ++++# base=self.rope_theta, ++++# ) ++++ ++++# def forward( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# attention_mask: Optional[mindspore.Tensor] = None, ++++# position_ids: Optional[mindspore.Tensor] = None, ++++# past_key_value: Optional[Cache] = None, ++++# output_attentions: bool = False, ++++# use_cache: bool = False, ++++# cache_position: Optional[mindspore.Tensor] = None, ++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++# bsz, q_len, _ = hidden_states.shape ++++ ++++# # 1. 线性投射 Q, K, V ++++# query_states = self.q_proj(hidden_states) ++++# key_states = self.k_proj(hidden_states) ++++# value_states = self.v_proj(hidden_states) ++++ ++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++# # query: [B, S, H*D] -> [B, N1, S, D] ++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++# # 3. RoPE 旋转位置编码 ++++# kv_seq_len = key_states.shape[-2] ++++# if past_key_value is not None: ++++# if self.layer_idx is None: ++++# raise ValueError( ++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++# "with a layer index." ++++# ) ++++# # 对于 StaticCache,需要特殊处理 kv_seq_len ++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++# if cache_position.shape[0] == 1: ++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++# kv_seq_len = past_seen_tokens + 1 ++++# else: ++++# # prefill 阶段:cache_position 是范围,使用其长度 ++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++# else: ++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++# # 4. KV 缓存更新 ++++# if past_key_value is not None: ++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++# key_states, value_states = past_key_value.update( ++++# key_states, value_states, self.layer_idx, cache_kwargs ++++# ) ++++ ++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++# if cache_position.shape[0] == 1: ++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++# kv_seq_len = key_states.shape[-2] ++++ ++++# # 5. [重要] 准备 Attention Mask ++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++# fa_attention_mask = None ++++# if attention_mask is not None: ++++# # 截取与当前key长度匹配的部分 ++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++# # 转换为布尔类型: 大负数 -> True, 0 -> False ++++# fa_attention_mask = (mask_slice != 0) ++++ ++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++# input_dtype = query_states.dtype ++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++# query_states = query_states.to(mindspore.float16) ++++# key_states = key_states.to(mindspore.float16) ++++# value_states = value_states.to(mindspore.float16) ++++ ++++# # 6. [核心] 调用 flash_attention_score 算子 ++++# # - 无需手动 repeat_kv, 算子原生支持 GQA ++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++# attn_output = mindspore.ops.flash_attention_score( ++++# query=query_states, ++++# key=key_states, ++++# value=value_states, ++++# head_num=self.num_heads, # 传入Q的头数(N1) ++++# attn_mask=fa_attention_mask, ++++# keep_prob=1.0 - self.attention_dropout, ++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++# input_layout="BNSD", ++++# sparse_mode=0 # 使用 defaultMask 模式 ++++# ) ++++ ++++# # 恢复原始数据类型 ++++# attn_output = attn_output.to(input_dtype) ++++ ++++# # 7. 调整输出形状 ++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++# attn_output = self.o_proj(attn_output) ++++ ++++# # FlashAttention 算子不直接返回注意力权重矩阵 ++++# attn_weights = None ++++# if output_attentions: ++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++# return attn_output, attn_weights, past_key_value ++++ ++++# # def forward( ++++# # self, ++++# # hidden_states: mindspore.Tensor, ++++# # attention_mask: Optional[mindspore.Tensor] = None, ++++# # position_ids: Optional[mindspore.Tensor] = None, ++++# # past_key_value: Optional[Cache] = None, ++++# # output_attentions: bool = False, ++++# # use_cache: bool = False, ++++# # cache_position: Optional[mindspore.Tensor] = None, ++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++# # bsz, q_len, _ = hidden_states.shape ++++ ++++# # # 1. 线性投射 Q, K, V ++++# # query_states = self.q_proj(hidden_states) ++++# # key_states = self.k_proj(hidden_states) ++++# # value_states = self.v_proj(hidden_states) ++++ ++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ ++++# # # 3. RoPE 旋转位置编码 ++++# # kv_seq_len = key_states.shape[-2] ++++# # if past_key_value is not None: ++++# # if self.layer_idx is None: ++++# # raise ValueError( ++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++# # "with a layer index." ++++# # ) ++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++# # # 4. KV 缓存更新 ++++# # if past_key_value is not None: ++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++# # key_states, value_states = past_key_value.update( ++++# # key_states, value_states, self.layer_idx, cache_kwargs ++++# # ) ++++ ++++# # # 5. 准备 Attention Mask ++++# # fa_attention_mask = None ++++# # if attention_mask is not None: ++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++# # fa_attention_mask = (mask_slice != 0) ++++ ++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++# # input_dtype = query_states.dtype ++++ ++++# # # 6. [核心] 调用 flash_attention_score 算子 ++++# # attn_output = mindspore.ops.flash_attention_score( ++++# # query=query_states, ++++# # key=key_states, ++++# # value=value_states, ++++# # head_num=self.num_heads, ++++# # attn_mask=fa_attention_mask, ++++# # keep_prob=1.0 - self.attention_dropout, ++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++# # input_layout="BNSD", ++++# # sparse_mode=0, ++++# # # <--- 修改点 2: 启用内部高精度计算 --- ++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++# # inner_precise=1 ++++# # ) ++++ ++++# # # 恢复原始数据类型 ++++# # attn_output = attn_output.to(input_dtype) ++++ ++++# # # 7. 调整输出形状 ++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++# # attn_output = self.o_proj(attn_output) ++++ ++++# # attn_weights = None ++++# # if output_attentions: ++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ ++++# # return attn_output, attn_weights, past_key_value ++++ ++++ +++ class Qwen2MoeFlashAttention(nn.Module): +++ """ +++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++- +++- 关键改动: +++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++- 直接传入原始的 key 和 value 张量效率更高。 +++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 ++++ ++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` ++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, ++++ 完全使用模型的低精度数据类型(如 float16)进行计算, ++++ 以达到理论上的最高执行速度。 +++ """ +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++ super().__init__() +++ self.config = config +++ self.layer_idx = layer_idx ++++ if layer_idx is None: ++++ logger.warning_once( ++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." ++++ ) ++++ +++ self.hidden_size = config.hidden_size +++ self.num_heads = config.num_attention_heads +++ self.head_dim = self.hidden_size // self.num_heads +++ self.num_key_value_heads = config.num_key_value_heads +++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++ self.max_position_embeddings = config.max_position_embeddings +++ self.rope_theta = config.rope_theta +++ self.attention_dropout = config.attention_dropout +++ +++- if (self.head_dim * self.num_heads) != self.hidden_size: +++- raise ValueError( +++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++- ) +++- +++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++- # query: [B, S, H*D] -> [B, N1, S, D] +++- # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++ # 2. 调整形状以匹配 BNSD 布局 +++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- +++- # 3. RoPE 旋转位置编码 ++++ ++++ # 3. RoPE 和 KV 缓存 +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++- if self.layer_idx is None: +++- raise ValueError( +++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++- "with a layer index." +++- ) +++- # 对于 StaticCache,需要特殊处理 kv_seq_len +++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++- if cache_position.shape[0] == 1: +++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++- kv_seq_len = past_seen_tokens + 1 +++- else: +++- # prefill 阶段:cache_position 是范围,使用其长度 +++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +++- else: +++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++- ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++- # 4. KV 缓存更新 +++ if past_key_value is not None: +++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++- key_states, value_states = past_key_value.update( +++- key_states, value_states, self.layer_idx, cache_kwargs +++- ) +++- +++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++- if cache_position.shape[0] == 1: +++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++- kv_seq_len = key_states.shape[-2] +++- +++- # 5. [重要] 准备 Attention Mask +++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++ # 4. 准备 Attention Mask +++ fa_attention_mask = None +++ if attention_mask is not None: +++- # 截取与当前key长度匹配的部分 +++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++- # 转换为布尔类型: 大负数 -> True, 0 -> False +++ fa_attention_mask = (mask_slice != 0) +++ +++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++- input_dtype = query_states.dtype +++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++- query_states = query_states.to(mindspore.float16) +++- key_states = key_states.to(mindspore.float16) +++- value_states = value_states.to(mindspore.float16) +++- +++- # 6. [核心] 调用 flash_attention_score 算子 +++- # - 无需手动 repeat_kv, 算子原生支持 GQA +++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +++ attn_output = mindspore.ops.flash_attention_score( +++ query=query_states, +++ key=key_states, +++ value=value_states, +++- head_num=self.num_heads, # 传入Q的头数(N1) ++++ head_num=self.num_heads, +++ attn_mask=fa_attention_mask, +++- keep_prob=1.0 - self.attention_dropout, ++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +++ scalar_value=1.0 / math.sqrt(self.head_dim), +++ input_layout="BNSD", +++- sparse_mode=0 # 使用 defaultMask 模式 ++++ sparse_mode=0, ++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +++ ) +++ +++- # 恢复原始数据类型 +++- attn_output = attn_output.to(input_dtype) +++- +++- # 7. 调整输出形状 +++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++ # 6. 调整输出形状 +++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++ attn_output = self.o_proj(attn_output) +++ +++- # FlashAttention 算子不直接返回注意力权重矩阵 ++++ # 7. 返回结果 +++ attn_weights = None +++ if output_attentions: +++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++ +++ return attn_output, attn_weights, past_key_value +++ +++- # def forward( +++- # self, +++- # hidden_states: mindspore.Tensor, +++- # attention_mask: Optional[mindspore.Tensor] = None, +++- # position_ids: Optional[mindspore.Tensor] = None, +++- # past_key_value: Optional[Cache] = None, +++- # output_attentions: bool = False, +++- # use_cache: bool = False, +++- # cache_position: Optional[mindspore.Tensor] = None, +++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++- +++- # bsz, q_len, _ = hidden_states.shape +++- +++- # # 1. 线性投射 Q, K, V +++- # query_states = self.q_proj(hidden_states) +++- # key_states = self.k_proj(hidden_states) +++- # value_states = self.v_proj(hidden_states) +++- +++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- +++- # # 3. RoPE 旋转位置编码 +++- # kv_seq_len = key_states.shape[-2] +++- # if past_key_value is not None: +++- # if self.layer_idx is None: +++- # raise ValueError( +++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++- # "with a layer index." +++- # ) +++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++ +++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++- +++- # # 4. KV 缓存更新 +++- # if past_key_value is not None: +++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++- # key_states, value_states = past_key_value.update( +++- # key_states, value_states, self.layer_idx, cache_kwargs +++- # ) +++- +++- # # 5. 准备 Attention Mask +++- # fa_attention_mask = None +++- # if attention_mask is not None: +++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++- # fa_attention_mask = (mask_slice != 0) +++- +++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++- # input_dtype = query_states.dtype +++- +++- # # 6. [核心] 调用 flash_attention_score 算子 +++- # attn_output = mindspore.ops.flash_attention_score( +++- # query=query_states, +++- # key=key_states, +++- # value=value_states, +++- # head_num=self.num_heads, +++- # attn_mask=fa_attention_mask, +++- # keep_prob=1.0 - self.attention_dropout, +++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++- # input_layout="BNSD", +++- # sparse_mode=0, +++- # # <--- 修改点 2: 启用内部高精度计算 --- +++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++- # inner_precise=1 +++- # ) +++- +++- # # 恢复原始数据类型 +++- # attn_output = attn_output.to(input_dtype) ++++QWEN2MOE_ATTENTION_CLASSES = { ++++ "eager": Qwen2MoeAttention, ++++ "flash-attention": Qwen2MoeFlashAttention, ++++} +++ +++- # # 7. 调整输出形状 +++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++- # attn_output = self.o_proj(attn_output) +++ +++- # attn_weights = None +++- # if output_attentions: +++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# def __init__(self, config): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# # gating ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++ ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# #@dwj ++++# # 只遍历激活的专家,而非全部专家 ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# num_tokens = hidden_states_reshaped.shape[0] ++++ ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++# flat_selected_experts = selected_experts.flatten() ++++ ++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++# token_indices = broadcasted_token_indices.flatten() ++++ ++++# active_experts = ops.unique(flat_selected_experts) ++++ ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++ ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# selected_token_indices = token_indices[mask] ++++# selected_routing_weights = routing_weights.flatten()[mask] ++++ ++++# current_states = hidden_states_reshaped[selected_token_indices] ++++ ++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++ ++++# final_hidden_states = final_hidden_states.index_add( ++++# dim=0, ++++# index=selected_token_indices, ++++# source=expert_output.to(hidden_states.dtype) ++++# ) ++++ ++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++ +++- # return attn_output, attn_weights, past_key_value ++++# final_hidden_states = final_hidden_states + shared_expert_output ++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ ++++ ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# """ ++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++++# `_moe_infer_prefill` (用于长序列处理) 方法。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# # 门控网络 ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# # 专家列表 ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++# # 共享专家 ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# @no_grad() ++++# def _moe_infer_decode( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# """ ++++# 【解码路径】针对 sequence_length=1 的极致优化。 ++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++++# """ ++++# batch_size, hidden_dim = hidden_states.shape ++++ ++++# expert_outputs_list = [ ++++# ops.cat([ ++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++# ], dim=0) ++++# for i in range(batch_size) ++++# ] ++++ ++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++++# # shape: (batch_size, top_k, hidden_dim) ++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++ ++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++ ++++# return moe_output.squeeze(1) ++++ ++++# @no_grad() ++++# def _moe_infer_prefill( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# """ ++++# 【预填充路径】针对 sequence_length > 1 的优化。 ++++# 按专家对 Token 进行分组,并进行批处理。 ++++# """ ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens = hidden_states.shape[0] ++++# flat_selected_experts = selected_experts.flatten() ++++ ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++ ++++# active_experts = ops.unique(flat_selected_experts) ++++ ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++ ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# selected_token_indices = token_indices[mask] ++++# selected_routing_weights = routing_weights.flatten()[mask] ++++ ++++# current_states = hidden_states[selected_token_indices] ++++ ++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++ ++++# moe_output = moe_output.index_add( ++++# dim=0, ++++# index=selected_token_indices, ++++# source=expert_output.to(hidden_states.dtype) ++++# ) ++++# return moe_output ++++ ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# """ ++++# 顶层 forward 方法,作为智能分发器。 ++++# """ ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++- # def forward( +++- # self, +++- # hidden_states: mindspore.Tensor, +++- # attention_mask: Optional[mindspore.Tensor] = None, +++- # position_ids: Optional[mindspore.Tensor] = None, +++- # past_key_value: Optional[Cache] = None, +++- # output_attentions: bool = False, +++- # use_cache: bool = False, +++- # cache_position: Optional[mindspore.Tensor] = None, +++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++- +++- # bsz, q_len, _ = hidden_states.shape +++- +++- # query_states = self.q_proj(hidden_states) +++- # key_states = self.k_proj(hidden_states) +++- # value_states = self.v_proj(hidden_states) +++- +++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- +++- # kv_seq_len = key_states.shape[-2] +++- # if past_key_value is not None: +++- # if self.layer_idx is None: +++- # raise ValueError("`layer_idx` must be specified for caching") +++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++- +++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++- +++- # if past_key_value is not None: +++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++- # key_states, value_states = past_key_value.update( +++- # key_states, value_states, self.layer_idx, cache_kwargs +++- # ) ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++# moe_output = None ++++# # 在推理时,根据序列长度选择最优路径 ++++# if not self.training: ++++# if sequence_length == 1: ++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++# else: ++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++# else: ++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++++# raise NotImplementedError("Training path is not implemented.") ++++ ++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++++ ++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++++ ++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ ++++ ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# """ ++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# # 门控网络 ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# # 专家列表 ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++# # 共享专家 ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# @no_grad() ++++# def _moe_infer_decode( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# batch_size, _ = hidden_states.shape ++++# expert_outputs_list = [ ++++# ops.cat([ ++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++# ], dim=0) ++++# for i in range(batch_size) ++++# ] ++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++# return moe_output.squeeze(1) ++++ ++++# @no_grad() ++++# def _moe_infer_prefill( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens = hidden_states.shape[0] ++++# flat_selected_experts = selected_experts.flatten() ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++# active_experts = ops.unique(flat_selected_experts) ++++ ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# selected_token_indices = token_indices[mask] ++++# selected_routing_weights = routing_weights.flatten()[mask] ++++# current_states = hidden_states[selected_token_indices] ++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++# moe_output = moe_output.index_add( ++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++# ) ++++# return moe_output ++++ ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# """ ++++# 顶层 forward 方法,作为智能分发器。 ++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++++# """ ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ ++++# # 1. 门控计算 (通用逻辑) ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++# # 2. 智能分发到最优 MoE 路径 ++++# moe_output = None ++++# if not self.training: ++++# if sequence_length == 1: ++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++# else: ++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++# else: ++++# raise NotImplementedError("Training path is not implemented.") ++++ ++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++ ++++# # 4. 合并 MoE 输出和共享专家输出 ++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++ ++++# # 5. 恢复原始形状并返回 ++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ ++++# prefill fastest ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# """ ++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# # 门控网络 ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# # 专家列表 ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++# # 共享专家 ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# @no_grad() ++++# def _moe_infer_dispatch( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# """ ++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++++# """ ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens, _ = hidden_states.shape ++++ ++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++++# flat_selected_experts = selected_experts.flatten() ++++# flat_routing_weights = routing_weights.flatten() +++ +++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +++- +++- # # <--- 核心修改点: 手动进行高精度缩放 --- +++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++- # query_states = query_states / math.sqrt(self.head_dim) +++- # # <--- 修改结束 --- +++- +++- # fa_attention_mask = None +++- # if attention_mask is not None: +++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++- # fa_attention_mask = (mask_slice != 0) +++- +++- # input_dtype = query_states.dtype +++- +++- # attn_output = mindspore.ops.flash_attention_score( +++- # query=query_states, # 传入已经预先缩放过的 query +++- # key=key_states, +++- # value=value_states, +++- # head_num=self.num_heads, +++- # attn_mask=fa_attention_mask, +++- # keep_prob=1.0 - self.attention_dropout, +++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++- # input_layout="BNSD", +++- # sparse_mode=0, +++- # inner_precise=1 # 仍然保持内部高精度计算 +++- # ) ++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++ +++- # attn_output = attn_output.to(input_dtype) +++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++- # attn_output = self.o_proj(attn_output) ++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++++# active_experts = ops.unique(flat_selected_experts) ++++ ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++ ++++# # 找到所有分配给该专家的 token ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++ ++++# # 使用 mask 选取对应的 token 和权重 ++++# current_token_indices = token_indices[mask] ++++# current_routing_weights = flat_routing_weights[mask] ++++# current_hidden_states = hidden_states[current_token_indices] ++++ ++++# # 对这些 token 进行批处理 ++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++ ++++# # 使用 index_add 将结果精确地加回到对应位置 ++++# moe_output = moe_output.index_add( ++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++++# ) ++++# return moe_output ++++ ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# """ ++++# 顶层 forward 方法,作为智能分发器。 ++++# """ ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ ++++# # 1. 门控计算 ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++ ++++# # 2. 调用统一的 MoE 计算内核 ++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++ +++- # attn_weights = None +++- # if output_attentions: +++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++# # 3. 统一处理共享专家 ++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++ ++++# # 4. 合并输出 ++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++ ++++# # 5. 恢复原始形状并返回 ++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ ++++ ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# """ ++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++# 【最终高性能与高精度版】: ++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++++# 3. 这样实现了速度和准确性的两全其美。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# @no_grad() ++++# def _moe_infer_decode( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# """ ++++# 【解码路径】极致优化版:bmm + 高精度累加。 ++++# """ ++++# original_dtype = hidden_states.dtype ++++# batch_size, _ = hidden_states.shape ++++ ++++# expert_outputs_list = [ ++++# ops.cat([ ++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++# ], dim=0) ++++# for i in range(batch_size) ++++# ] ++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++ ++++# # 在 float32 下执行 bmm,得到高精度结果 ++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++ ++++# # 将高精度结果转换回原始数据类型 ++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++++ ++++# return moe_output ++++ ++++# @no_grad() ++++# def _moe_infer_prefill( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# selected_experts: mindspore.Tensor, ++++# routing_weights: mindspore.Tensor ++++# ) -> mindspore.Tensor: ++++# """ ++++# 【预填充路径】与原始实现一致,结果精确。 ++++# """ ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens, _ = hidden_states.shape ++++# flat_selected_experts = selected_experts.flatten() ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++# active_experts = ops.unique(flat_selected_experts) ++++ ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# selected_token_indices = token_indices[mask] ++++# selected_routing_weights = routing_weights.flatten()[mask] ++++# current_states = hidden_states[selected_token_indices] ++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++# moe_output = moe_output.index_add( ++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++# ) ++++# return moe_output ++++ ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++ +++- # return attn_output, attn_weights, past_key_value ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++++# # 如果模型主体是 float16,后续再转换 ++++ ++++# moe_output = None ++++# if not self.training: ++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++++# # _moe_infer_decode 内部会处理好类型转换 ++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++++# if sequence_length == 1: ++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++# else: ++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++# else: ++++# raise NotImplementedError("Training path is not implemented.") ++++ ++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++ ++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ +++ +++-QWEN2MOE_ATTENTION_CLASSES = { +++- "eager": Qwen2MoeAttention, +++- "flash-attention": Qwen2MoeFlashAttention, +++-} ++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++# """ ++++# 【融合版】一个混合专家模块,内置两种推理策略, ++++# 由外部全局变量 `Long_Prompt` 控制: ++++ ++++# - if Long_Prompt is True: 【精度优先模式】 ++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++++# 适用于处理长序列,避免误差累积。 ++++ ++++# - if Long_Prompt is False: 【速度优先模式】 ++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++++# 在解码阶段获得极致速度,同时保证结果高度准确。 ++++# """ ++++# def __init__(self, config: Qwen2MoeConfig): ++++# super().__init__() ++++# self.num_experts = config.num_experts ++++# self.top_k = config.num_experts_per_tok ++++# self.norm_topk_prob = config.norm_topk_prob ++++ ++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++# self.experts = nn.ModuleList( ++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++# ) ++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++# # --- 速度优先模式的辅助函数 --- ++++# @no_grad() ++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++# original_dtype = hidden_states.dtype ++++# batch_size, _ = hidden_states.shape ++++# expert_outputs_list = [ ++++# ops.cat([ ++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++# ], dim=0) ++++# for i in range(batch_size) ++++# ] ++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++# weights_fp32 = routing_weights.to(mindspore.float32) ++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++# return moe_output_fp32.squeeze(1).to(original_dtype) ++++ ++++# @no_grad() ++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens, _ = hidden_states.shape ++++# flat_selected_experts = selected_experts.flatten() ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++# active_experts = ops.unique(flat_selected_experts) ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# selected_token_indices = token_indices[mask] ++++# selected_routing_weights = routing_weights.flatten()[mask] ++++# current_states = hidden_states[selected_token_indices] ++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++++# return moe_output ++++ ++++# # --- 精度优先模式的辅助函数 --- ++++# @no_grad() ++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++# moe_output = ops.zeros_like(hidden_states) ++++# num_tokens, _ = hidden_states.shape ++++# flat_selected_experts = selected_experts.flatten() ++++# flat_routing_weights = routing_weights.flatten() ++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++# active_experts = ops.unique(flat_selected_experts) ++++# for expert_idx_tensor in active_experts: ++++# expert_idx = expert_idx_tensor.item() ++++# expert_layer = self.experts[expert_idx] ++++# mask = (flat_selected_experts == expert_idx_tensor) ++++# current_token_indices = token_indices[mask] ++++# current_routing_weights = flat_routing_weights[mask] ++++# current_hidden_states = hidden_states[current_token_indices] ++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++# return moe_output ++++ ++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++# # 声明我们将要使用一个在模块外部定义的全局变量 ++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++++# global Long_Prompt ++++ ++++# # 1. 门控计算 (所有模式通用) ++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++# router_logits = self.gate(hidden_states_reshaped) ++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++# if self.norm_topk_prob: ++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++# moe_output = None ++++# if not self.training: ++++# # 根据 Long_Prompt 标志选择模式 ++++# if Long_Prompt: ++++# # --- 精度优先模式 --- ++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++# else: ++++# # --- 速度优先模式 --- ++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++# if sequence_length == 1: ++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++# else: ++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++# else: ++++# raise NotImplementedError("Training path is not implemented.") ++++ ++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++ ++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++ ++++# return final_hidden_states, router_logits ++++ ++++class Qwen2MoeSparseMoeBlock(nn.Module): ++++ """ ++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++++ 控制的顶级推理策略: +++ ++++ - if Long_Prompt is True: 【精度优先模式】 ++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 ++++ 适用于需要严格可复现性的长序列任务。 +++ +++-class Qwen2MoeSparseMoeBlock(nn.Module): +++- def __init__(self, config): ++++ - if Long_Prompt is False: 【速度优先模式】 ++++ 采用业界最强的性能组合: ++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 ++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 ++++ """ ++++ def __init__(self, config: Qwen2MoeConfig): +++ super().__init__() +++ self.num_experts = config.num_experts +++ self.top_k = config.num_experts_per_tok +++ self.norm_topk_prob = config.norm_topk_prob +++ +++- # gating +++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++ self.experts = nn.ModuleList( +++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++ ) +++- +++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++ +++- #@dwj +++- # 只遍历激活的专家,而非全部专家 +++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++- num_tokens = hidden_states_reshaped.shape[0] +++- +++- router_logits = self.gate(hidden_states_reshaped) +++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- +++- if self.norm_topk_prob: +++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- routing_weights = routing_weights.to(hidden_states.dtype) +++- +++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++- flat_selected_experts = selected_experts.flatten() +++- +++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++- token_indices = broadcasted_token_indices.flatten() +++- +++- active_experts = ops.unique(flat_selected_experts) +++- +++- for expert_idx_tensor in active_experts: +++- expert_idx = expert_idx_tensor.item() +++- expert_layer = self.experts[expert_idx] +++- +++- mask = (flat_selected_experts == expert_idx_tensor) +++- selected_token_indices = token_indices[mask] +++- selected_routing_weights = routing_weights.flatten()[mask] +++- +++- current_states = hidden_states_reshaped[selected_token_indices] +++- +++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++- +++- final_hidden_states = final_hidden_states.index_add( +++- dim=0, +++- index=selected_token_indices, +++- source=expert_output.to(hidden_states.dtype) +++- ) +++- +++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- ++++ @no_grad() ++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++ original_dtype = hidden_states.dtype ++++ batch_size, _ = hidden_states.shape ++++ expert_outputs_list = [ ++++ ops.cat([ ++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++ ], dim=0) ++++ for i in range(batch_size) ++++ ] ++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++ weights_fp32 = routing_weights.to(mindspore.float32) ++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++ return moe_output_fp32.squeeze(1).to(original_dtype) ++++ ++++ @no_grad() ++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++ num_tokens, _ = hidden_states.shape ++++ flat_selected_experts = selected_experts.flatten() ++++ sorted_expert_indices = flat_selected_experts.argsort() ++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++ original_token_indices = sorted_expert_indices // self.top_k ++++ moe_output = ops.zeros_like(hidden_states) ++++ current_token_offset = 0 ++++ for i in range(self.num_experts): ++++ expert_token_count = tokens_per_expert[i] - current_token_offset ++++ if expert_token_count == 0: ++++ continue ++++ end_offset = current_token_offset + expert_token_count ++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++ expert_hidden_states = hidden_states[expert_original_token_indices] ++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++ current_token_offset += expert_token_count ++++ return moe_output ++++ ++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++++ @no_grad() ++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++ moe_output = ops.zeros_like(hidden_states) ++++ num_tokens, _ = hidden_states.shape ++++ flat_selected_experts = selected_experts.flatten() ++++ flat_routing_weights = routing_weights.flatten() ++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++ active_experts = ops.unique(flat_selected_experts) ++++ for expert_idx_tensor in active_experts: ++++ expert_idx = expert_idx_tensor.item() ++++ expert_layer = self.experts[expert_idx] ++++ mask = (flat_selected_experts == expert_idx_tensor) ++++ current_token_indices = token_indices[mask] ++++ current_routing_weights = flat_routing_weights[mask] ++++ current_hidden_states = hidden_states[current_token_indices] ++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++ return moe_output +++ +++- final_hidden_states = final_hidden_states + shared_expert_output +++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++- +++- return final_hidden_states, router_logits ++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++ global Long_Prompt ++++ ++++ # 1. 门控计算 (所有模式通用) ++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++ router_logits = self.gate(hidden_states_reshaped) ++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++ if self.norm_topk_prob: ++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++ moe_output = None ++++ if Long_Prompt: ++++ # --- 精度优先模式 (ACCURACY MODE) --- ++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ else: ++++ # --- 速度优先模式 (SPEED MODE) --- ++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++ if sequence_length == 1: ++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ else: ++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ +++ ++++ # 3. 共享专家计算与合并 (所有模式通用) ++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++ ++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++ ++++ return final_hidden_states, router_logits +++ +++ class Qwen2MoeDecoderLayer(nn.Module): +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++ super().__init__() +++ self.hidden_size = config.hidden_size ++++ ++++ # if Long_Prompt: ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ # else: ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++ +++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ +++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++- +++ if (layer_idx not in config.mlp_only_layers) and ( +++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++ ): +++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ self._warmed_up = True +++ self.warmup_moe_model() +++ ++++ ++++ +++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++ output_router_logits = ( +++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ router_logits=outputs.router_logits, +++ ) +++ ++++ def generate(self, *args, **kwargs): ++++ """ ++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++++ """ ++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++ ++++ input_ids = kwargs.get("input_ids") ++++ if input_ids is None and args: ++++ input_ids = args[0] ++++ ++++ if input_ids is not None: ++++ prompt_length = input_ids.shape[1] ++++ ++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: ++++ Long_Prompt = True ++++ else: ++++ Long_Prompt = False ++++ ++++ return super().generate(*args, **kwargs) ++++ +++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +++ def prepare_inputs_for_generation( +++ self, +++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +++ # Exception 1: when passing input_embeds, input_ids may be missing entries +++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here ++++ +++ if past_key_values is not None: +++ if inputs_embeds is not None: # Exception 1 +++ if 0 not in input_ids.shape: +++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ } +++ ) +++ return model_inputs ++++ +++ # @lwx +++ # def _decode_one_tokens_logits( +++ # self, +++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +++ attentions=outputs.attentions, +++ ) +++ ++++ +++ __all__ = [ +++ "Qwen2MoeForCausalLM", +++ "Qwen2MoeModel", +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++new file mode 100644 +++index 00000000..6dfb5b93 +++--- /dev/null ++++++ b/patches/0001-20251104commit.patch +++@@ -0,0 +1,1272 @@ ++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++Subject: [PATCH] 20251104commit ++++ ++++--- ++++ mindnlp/transformers/cache_utils.py | 28 +- ++++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++ 3 files changed, 976 insertions(+), 87 deletions(-) ++++ ++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++index cadd2e04..02f8d4be 100644 ++++--- a/mindnlp/transformers/cache_utils.py +++++++ b/mindnlp/transformers/cache_utils.py ++++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++ # k_out[:, :, cache_position] = key_states ++++ # v_out[:, :, cache_position] = value_states ++++- if ON_ORANGE_PI: ++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++- else: ++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++- +++++ # if ON_ORANGE_PI: +++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++ # else: +++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++ # 确保 cache_position 是 1D tensor 并且类型正确 +++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++ if cache_position.ndim > 1: +++++ cache_position = cache_position.flatten() +++++ # 确保类型是 int32 或 int64(MindSpore 要求) +++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++ cache_position = cache_position.int() +++++ +++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++ k_out[:, :, cache_position] = key_states +++++ v_out[:, :, cache_position] = value_states +++++ ++++ return k_out, v_out ++++ ++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index c695b944..d8303e45 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++- x1 = x[..., : x.shape[-1] // 2] ++++- x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++ # x1 = x[..., : x.shape[-1] // 2] +++++ # x2 = x[..., x.shape[-1] // 2 :] +++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++ if self.training: ++++ raise NotImplementedError("Training is not supported yet.") ++++ else: ++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++- if self.config.n_shared_experts is not None: ++++- y = y + self.shared_experts(identity) ++++- return y +++++ # @lwx +++++ if orig_shape[1] == 1: +++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++ y=y.view(*orig_shape) +++++ if self.config.n_shared_experts is not None: +++++ y = y + self.shared_experts(identity) +++++ return y +++++ else: +++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++ if self.config.n_shared_experts is not None: +++++ y = y + self.shared_experts(identity) +++++ return y +++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++ # if self.config.n_shared_experts is not None: +++++ # y = y + self.shared_experts(identity) +++++ # return y +++++ +++++ @no_grad() +++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ +++++ expert_cache = ops.zeros_like(x) +++++ for i in range(self.num_experts_per_tok): +++++ expert_id = flat_expert_indices[i].item() +++++ weight = flat_expert_weights[i].item() +++++ expert = self.experts[expert_id] +++++ expert_out = expert(x) +++++ expert_cache += expert_out * weight +++++ return expert_cache ++++ ++++ @no_grad() ++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++- # expert_cache = torch.zeros_like(x) ++++- # idxs = flat_expert_indices.argsort() ++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++- # token_idxs = idxs // self.num_experts_per_tok ++++- # for i, end_idx in enumerate(tokens_per_expert): ++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++- # if start_idx == end_idx: ++++- # continue ++++- # expert = self.experts[i] ++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++- # expert_tokens = x[exp_token_idx] ++++- # expert_out = expert(expert_tokens) ++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++- # return expert_cache +++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++ expert_cache = ops.zeros_like(x) ++++ idxs = flat_expert_indices.argsort() ++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ token_idxs = idxs // self.num_experts_per_tok +++++ ++++ for i, end_idx in enumerate(tokens_per_expert): ++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ if start_idx == end_idx: ++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++ expert_out = expert(expert_tokens) ++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ ++++ return expert_cache +++++ +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # # expert_cache = torch.zeros_like(x) +++++ # # idxs = flat_expert_indices.argsort() +++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++ # # token_idxs = idxs // self.num_experts_per_tok +++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++ # # if start_idx == end_idx: +++++ # # continue +++++ # # expert = self.experts[i] +++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # # expert_tokens = x[exp_token_idx] +++++ # # expert_out = expert(expert_tokens) +++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++ # # return expert_cache +++++ # expert_cache = ops.zeros_like(x) +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # for i, end_idx in enumerate(tokens_per_expert): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # if start_idx == end_idx: +++++ # continue +++++ # expert = self.experts[i] +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = expert(expert_tokens) +++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ +++++ # return expert_cache +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # expert_cache = ops.zeros_like(x) +++++ +++++ # # 排序保证顺序一致 +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # # 找出有 token 的专家 +++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++ +++++ # for i in active_experts.tolist(): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # end_idx = tokens_per_expert[i] +++++ # if start_idx == end_idx: # 没有 token +++++ # continue +++++ +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = self.experts[i](expert_tokens) +++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++ +++++ # expert_cache = mindspore.mint.scatter_add( +++++ # expert_cache, +++++ # 0, +++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++ # expert_out +++++ # ) +++++ +++++ # return expert_cache +++++ +++++ ++++ ++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++ # """ ++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ ++++ # Initialize weights and apply final processing ++++ self.post_init() +++++ self.warm_up = False +++++ +++++ def warmup_moe_model_deep(self): +++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++ test_texts = [ +++++ "warmup short", +++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++ ] +++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++ if tokenizer is None: +++++ from mindnlp.transformers import AutoTokenizer +++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++ self._warmup_tokenizer = tokenizer +++++ +++++ for text in test_texts: +++++ inputs = tokenizer(text, return_tensors="ms") +++++ with mindspore._no_grad(): +++++ _ = self(**inputs, use_cache=False) +++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++ ++++ def get_input_embeddings(self): ++++ return self.model.embed_tokens ++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++ ```""" +++++ if not self.warm_up: +++++ self.warm_up = True +++++ self.warmup_moe_model_deep() +++++ ++++ output_attentions = ( ++++ output_attentions ++++ if output_attentions is not None ++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++index 3cbf820e..d4c6b651 100644 ++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++@@ -18,7 +18,6 @@ ++++ # See the License for the specific language governing permissions and ++++ # limitations under the License. ++++ """MindSpore Qwen2MoE model.""" ++++- ++++ import math ++++ from typing import List, Optional, Tuple, Union ++++ ++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++ TokenClassifierOutput, ++++ ) ++++ from ...modeling_utils import PreTrainedModel +++++from ...generation import GenerationMixin ++++ from ....utils import logging ++++ from .configuration_qwen2_moe import Qwen2MoeConfig ++++ ++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++ self.variance_epsilon = eps ++++ ++++ def forward(self, hidden_states): +++++ # @dwj +++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++ # @lwx +++++ # if not self.training : +++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++ input_dtype = hidden_states.dtype ++++ hidden_states = hidden_states.to(mindspore.float32) ++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++@@ -234,6 +239,8 @@ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++ x1 = x[..., : x.shape[-1] // 2] ++++ x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++ self.config = config ++++ self.hidden_size = config.hidden_size ++++ self.intermediate_size = intermediate_size +++++ ++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++ self.act_fn = ACT2FN[config.hidden_act] ++++ ++++ def forward(self, x): ++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++- ++++ +++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++ # @lwx +++++ # gate_up_output = self.gate_up_proj(x) +++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++ # return self.down_proj(swiglu_output) +++++ +++++ # def forward(self, x): +++++ # gate_proj_out = self.gate_proj(x) +++++ # up_proj_out = self.up_proj(x) +++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++ # return self.down_proj(swiglu_out) +++++ ++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++ """ ++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++ use_cache: bool = False, ++++ cache_position: Optional[mindspore.Tensor] = None, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ +++++ ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ query_states = self.q_proj(hidden_states) ++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ "with a layer index." ++++ ) ++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ if isinstance(past_key_value, StaticCache): +++++ kv_seq_len = key_states.shape[-2] +++++ else: +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++ if isinstance(past_key_value, StaticCache): +++++ kv_seq_len = key_states.shape[-2] ++++ ++++ # repeat k/v heads if n_kv_heads < n_heads ++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++- +++++ ++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++ ++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++- raise ValueError( ++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++- f" {attn_weights.shape}" ++++- ) ++++- ++++- if attention_mask is not None: # no matter the length, we just slice it ++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++ if attention_mask is not None: +++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++ attn_weights = attn_weights + causal_mask ++++ ++++ # upcast attention to fp32 ++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++ ++++ attn_output = self.o_proj(attn_output) ++++- +++++ # @lwx +++++ +++++ # max_seq_len = self.max_position_embeddings # 2048 +++++ +++++ # if attention_mask is not None: +++++ # # attention_mask: [B, 1, Sq, Sk] +++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++ +++++ # # pad 到 [max_seq_len, max_seq_len] +++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++ # global_attention_mask = padded_mask +++++ # else: +++++ # global_attention_mask = None +++++ +++++ +++++ # sparse_mode=3 +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, +++++ # key=key_states, +++++ # value=value_states, +++++ # real_shift=None, +++++ # padding_mask=None, +++++ +++++ # head_num=self.num_heads, +++++ # attn_mask=global_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++ # input_layout="BNSD", +++++ # pre_tokens=2147483647, +++++ # next_tokens=2147483647, +++++ # inner_precise=0, +++++ # drop_mask=None, +++++ # prefix=None, +++++ # actual_seq_qlen=None, +++++ # actual_seq_kvlen=None, +++++ # sparse_mode=sparse_mode, +++++ # ) ++++ if not output_attentions: ++++ attn_weights = None ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++ +++++class Qwen2MoeFlashAttention(nn.Module): +++++ """ +++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++ +++++ 关键改动: +++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++ 直接传入原始的 key 和 value 张量效率更高。 +++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++ """ +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++ super().__init__() +++++ self.config = config +++++ self.layer_idx = layer_idx +++++ self.hidden_size = config.hidden_size +++++ self.num_heads = config.num_attention_heads +++++ self.head_dim = self.hidden_size // self.num_heads +++++ self.num_key_value_heads = config.num_key_value_heads +++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++ self.max_position_embeddings = config.max_position_embeddings +++++ self.rope_theta = config.rope_theta +++++ self.attention_dropout = config.attention_dropout +++++ +++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++ raise ValueError( +++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++ ) +++++ +++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++ +++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++ self.head_dim, +++++ max_position_embeddings=self.max_position_embeddings, +++++ base=self.rope_theta, +++++ ) +++++ +++++ def forward( +++++ self, +++++ hidden_states: mindspore.Tensor, +++++ attention_mask: Optional[mindspore.Tensor] = None, +++++ position_ids: Optional[mindspore.Tensor] = None, +++++ past_key_value: Optional[Cache] = None, +++++ output_attentions: bool = False, +++++ use_cache: bool = False, +++++ cache_position: Optional[mindspore.Tensor] = None, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ # 1. 线性投射 Q, K, V +++++ query_states = self.q_proj(hidden_states) +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++ # query: [B, S, H*D] -> [B, N1, S, D] +++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # 3. RoPE 旋转位置编码 +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++ if self.layer_idx is None: +++++ raise ValueError( +++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ "with a layer index." +++++ ) +++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++ if cache_position.shape[0] == 1: +++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++ kv_seq_len = past_seen_tokens + 1 +++++ else: +++++ # prefill 阶段:cache_position 是范围,使用其长度 +++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++ else: +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # 4. KV 缓存更新 +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ key_states, value_states = past_key_value.update( +++++ key_states, value_states, self.layer_idx, cache_kwargs +++++ ) +++++ +++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++ if cache_position.shape[0] == 1: +++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++ kv_seq_len = key_states.shape[-2] +++++ +++++ # 5. [重要] 准备 Attention Mask +++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++ fa_attention_mask = None +++++ if attention_mask is not None: +++++ # 截取与当前key长度匹配的部分 +++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++ fa_attention_mask = (mask_slice != 0) +++++ +++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++ input_dtype = query_states.dtype +++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++ query_states = query_states.to(mindspore.float16) +++++ key_states = key_states.to(mindspore.float16) +++++ value_states = value_states.to(mindspore.float16) +++++ +++++ # 6. [核心] 调用 flash_attention_score 算子 +++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++ attn_output = mindspore.ops.flash_attention_score( +++++ query=query_states, +++++ key=key_states, +++++ value=value_states, +++++ head_num=self.num_heads, # 传入Q的头数(N1) +++++ attn_mask=fa_attention_mask, +++++ keep_prob=1.0 - self.attention_dropout, +++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++ input_layout="BNSD", +++++ sparse_mode=0 # 使用 defaultMask 模式 +++++ ) +++++ +++++ # 恢复原始数据类型 +++++ attn_output = attn_output.to(input_dtype) +++++ +++++ # 7. 调整输出形状 +++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ attn_output = self.o_proj(attn_output) +++++ +++++ # FlashAttention 算子不直接返回注意力权重矩阵 +++++ attn_weights = None +++++ if output_attentions: +++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++ # def forward( +++++ # self, +++++ # hidden_states: mindspore.Tensor, +++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++ # position_ids: Optional[mindspore.Tensor] = None, +++++ # past_key_value: Optional[Cache] = None, +++++ # output_attentions: bool = False, +++++ # use_cache: bool = False, +++++ # cache_position: Optional[mindspore.Tensor] = None, +++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ # bsz, q_len, _ = hidden_states.shape +++++ +++++ # # 1. 线性投射 Q, K, V +++++ # query_states = self.q_proj(hidden_states) +++++ # key_states = self.k_proj(hidden_states) +++++ # value_states = self.v_proj(hidden_states) +++++ +++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # # 3. RoPE 旋转位置编码 +++++ # kv_seq_len = key_states.shape[-2] +++++ # if past_key_value is not None: +++++ # if self.layer_idx is None: +++++ # raise ValueError( +++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ # "with a layer index." +++++ # ) +++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # # 4. KV 缓存更新 +++++ # if past_key_value is not None: +++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ # key_states, value_states = past_key_value.update( +++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++ # ) +++++ +++++ # # 5. 准备 Attention Mask +++++ # fa_attention_mask = None +++++ # if attention_mask is not None: +++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # fa_attention_mask = (mask_slice != 0) +++++ +++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++ # input_dtype = query_states.dtype +++++ +++++ # # 6. [核心] 调用 flash_attention_score 算子 +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, +++++ # key=key_states, +++++ # value=value_states, +++++ # head_num=self.num_heads, +++++ # attn_mask=fa_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++ # input_layout="BNSD", +++++ # sparse_mode=0, +++++ # # <--- 修改点 2: 启用内部高精度计算 --- +++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++ # inner_precise=1 +++++ # ) +++++ +++++ # # 恢复原始数据类型 +++++ # attn_output = attn_output.to(input_dtype) +++++ +++++ # # 7. 调整输出形状 +++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ # attn_output = self.o_proj(attn_output) +++++ +++++ # attn_weights = None +++++ # if output_attentions: +++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++ # return attn_output, attn_weights, past_key_value +++++ +++++ # def forward( +++++ # self, +++++ # hidden_states: mindspore.Tensor, +++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++ # position_ids: Optional[mindspore.Tensor] = None, +++++ # past_key_value: Optional[Cache] = None, +++++ # output_attentions: bool = False, +++++ # use_cache: bool = False, +++++ # cache_position: Optional[mindspore.Tensor] = None, +++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ # bsz, q_len, _ = hidden_states.shape +++++ +++++ # query_states = self.q_proj(hidden_states) +++++ # key_states = self.k_proj(hidden_states) +++++ # value_states = self.v_proj(hidden_states) +++++ +++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # kv_seq_len = key_states.shape[-2] +++++ # if past_key_value is not None: +++++ # if self.layer_idx is None: +++++ # raise ValueError("`layer_idx` must be specified for caching") +++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # if past_key_value is not None: +++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ # key_states, value_states = past_key_value.update( +++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++ # ) +++++ +++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++ +++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++ # query_states = query_states / math.sqrt(self.head_dim) +++++ # # <--- 修改结束 --- +++++ +++++ # fa_attention_mask = None +++++ # if attention_mask is not None: +++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # fa_attention_mask = (mask_slice != 0) +++++ +++++ # input_dtype = query_states.dtype +++++ +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, # 传入已经预先缩放过的 query +++++ # key=key_states, +++++ # value=value_states, +++++ # head_num=self.num_heads, +++++ # attn_mask=fa_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++ # input_layout="BNSD", +++++ # sparse_mode=0, +++++ # inner_precise=1 # 仍然保持内部高精度计算 +++++ # ) +++++ +++++ # attn_output = attn_output.to(input_dtype) +++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ # attn_output = self.o_proj(attn_output) +++++ +++++ # attn_weights = None +++++ # if output_attentions: +++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++ +++++ # return attn_output, attn_weights, past_key_value +++++ ++++ QWEN2MOE_ATTENTION_CLASSES = { ++++ "eager": Qwen2MoeAttention, +++++ "flash-attention": Qwen2MoeFlashAttention, ++++ } ++++ ++++ ++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ +++++ #@dwj +++++ # 只遍历激活的专家,而非全部专家 ++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- hidden_states = hidden_states.view(-1, hidden_dim) ++++- # router_logits: (batch * sequence_length, n_experts) ++++- router_logits = self.gate(hidden_states) ++++- ++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- if self.norm_topk_prob: ++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- # we cast back to the input dtype ++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++- final_hidden_states = ops.zeros( ++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++- ) ++++- ++++- # One hot encode the selected experts to create an expert mask ++++- # this will be used to easily index which expert is going to be sollicitated ++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++- ++++- # Loop over all available experts in the model and perform the computation on each expert ++++- for expert_idx in range(self.num_experts): ++++- expert_layer = self.experts[expert_idx] ++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++- ++++- # Index the correct hidden states and compute the expert hidden state for ++++- # the current expert. We need to make sure to multiply the output hidden ++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++- if 0 not in idx.shape: ++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++- ++++- # However `index_add_` only support torch tensors for indexing so we'll use ++++- # the `top_x` tensor here. ++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++- ++++- shared_expert_output = self.shared_expert(hidden_states) ++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++- ++++- final_hidden_states = final_hidden_states + shared_expert_output +++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++ num_tokens = hidden_states_reshaped.shape[0] +++++ +++++ router_logits = self.gate(hidden_states_reshaped) +++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++ if self.norm_topk_prob: +++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++ flat_selected_experts = selected_experts.flatten() +++++ +++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++ token_indices = broadcasted_token_indices.flatten() +++++ +++++ active_experts = ops.unique(flat_selected_experts) +++++ +++++ for expert_idx_tensor in active_experts: +++++ expert_idx = expert_idx_tensor.item() +++++ expert_layer = self.experts[expert_idx] +++++ +++++ mask = (flat_selected_experts == expert_idx_tensor) +++++ selected_token_indices = token_indices[mask] +++++ selected_routing_weights = routing_weights.flatten()[mask] +++++ +++++ current_states = hidden_states_reshaped[selected_token_indices] +++++ +++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++ +++++ final_hidden_states = final_hidden_states.index_add( +++++ dim=0, +++++ index=selected_token_indices, +++++ source=expert_output.to(hidden_states.dtype) +++++ ) +++++ +++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++ ++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++- return final_hidden_states, router_logits +++++ final_hidden_states = final_hidden_states + shared_expert_output +++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++ +++++ return final_hidden_states, router_logits ++++ ++++ ++++ class Qwen2MoeDecoderLayer(nn.Module): ++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++ ++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++ ++++ if (layer_idx not in config.mlp_only_layers) and ( ++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++ ): ++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++ _skip_keys_device_placement = "past_key_values" ++++ _supports_cache_class = True +++++#lwx +++++ # _supports_static_cache = True ++++ ++++ def _init_weights(self, module): ++++ std = self.config.initializer_range ++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++ return causal_mask ++++ ++++ ++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ _tied_weights_keys = ["lm_head.weight"] ++++ ++++ def __init__(self, config): ++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ self.num_experts_per_tok = config.num_experts_per_tok ++++ # Initialize weights and apply final processing ++++ self.post_init() +++++ # @lwx +++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++ # self.generation_config.cache_implementation = "static" +++++ self._warmed_up = False +++++ +++++ def warmup_moe_model(self): +++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++ test_texts = [ +++++ "warmup short", +++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++ ] +++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++ if tokenizer is None: +++++ from mindnlp.transformers import AutoTokenizer +++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++ self._warmup_tokenizer = tokenizer +++++ +++++ for text in test_texts: +++++ inputs = tokenizer(text, return_tensors="ms") +++++ with mindspore._no_grad(): +++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++ ++++ def get_input_embeddings(self): ++++ return self.model.embed_tokens ++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++ ```""" +++++ if not self._warmed_up: +++++ self._warmed_up = True +++++ self.warmup_moe_model() ++++ ++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++ output_router_logits = ( ++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ } ++++ ) ++++ return model_inputs +++++# @lwx +++++ # def _decode_one_tokens_logits( +++++ # self, +++++ # cur_token: mindspore.Tensor, +++++ # input_pos: Optional[mindspore.Tensor], +++++ # cache_position: mindspore.Tensor, +++++ # past_key_values: StaticCache, +++++ # ) -> mindspore.Tensor: +++++ # """ +++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++ +++++ # Args: +++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++ # input_pos: 输入位置信息,可选 +++++ # cache_position: 当前token在cache中的位置,shape为(1,) +++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++ +++++ # Returns: +++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++ # """ +++++ # # 调用JIT编译的版本 +++++ # return self.get_decode_one_tokens_logits( +++++ # cur_token=cur_token, +++++ # input_pos=input_pos, +++++ # cache_position=cache_position, +++++ # past_key_values=past_key_values, +++++ # ) +++++ +++++ # @mindspore.jit(jit_level='O1') +++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++ # """ +++++ # JIT编译的函数,用于高效的单token解码 +++++ # 使用JIT编译优化以支持静态shape和高效执行 +++++ +++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++ # """ +++++ # outputs = self.model.forward( +++++ # input_ids=cur_token, +++++ # position_ids=input_pos, +++++ # cache_position=cache_position, +++++ # past_key_values=past_key_values, +++++ # use_cache=True, +++++ # return_dict=False, +++++ # ) +++++ +++++ # hidden_states = outputs[0] +++++ # logits = self.lm_head.forward(hidden_states) +++++ # logits = logits.float() +++++ +++++ # return logits[:, -1, :] +++++ +++++ # def _sample( +++++ # self, +++++ # input_ids: mindspore.Tensor, +++++ # logits_processor, +++++ # stopping_criteria, +++++ # generation_config, +++++ # synced_devices: bool, +++++ # streamer=None, +++++ # logits_warper=None, +++++ # **model_kwargs, +++++ # ): +++++ # """ +++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++ # """ +++++ # from ...generation.logits_process import LogitsProcessorList +++++ # from ...generation.stopping_criteria import StoppingCriteriaList +++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++ # from mindnlp.core import nn, ops, no_grad +++++ # import numpy as np +++++ +++++ # # 检查是否使用 StaticCache +++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++ # # 否则,直接调用父类方法 +++++ # past_key_values = model_kwargs.get("past_key_values") +++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++ +++++ # if not isinstance(past_key_values, StaticCache): +++++ # # 不使用 StaticCache,直接调用父类方法 +++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++ # return super()._sample( +++++ # input_ids=input_ids, +++++ # logits_processor=logits_processor, +++++ # stopping_criteria=stopping_criteria, +++++ # generation_config=generation_config, +++++ # synced_devices=synced_devices, +++++ # streamer=streamer, +++++ # logits_warper=logits_warper, +++++ # **model_kwargs, +++++ # ) +++++ +++++ # # 使用 StaticCache,进入自定义循环 +++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++ # pad_token_id = generation_config._pad_token_tensor +++++ # output_attentions = generation_config.output_attentions +++++ # output_hidden_states = generation_config.output_hidden_states +++++ # output_scores = generation_config.output_scores +++++ # output_logits = generation_config.output_logits +++++ # return_dict_in_generate = generation_config.return_dict_in_generate +++++ # max_length = generation_config.max_length +++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++ # do_sample = generation_config.do_sample +++++ +++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++ # raise ValueError( +++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++ # f"{logits_warper})." +++++ # ) +++++ +++++ # # init attention / hidden states / scores tuples +++++ # scores = () if (return_dict_in_generate and output_scores) else None +++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++ +++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++ # encoder_hidden_states = ( +++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++ # ) +++++ +++++ # # keep track of which sequences are already finished +++++ # batch_size, cur_len = input_ids.shape +++++ # this_peer_finished = False +++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++ +++++ # time_record = [] +++++ # from ....utils.testing_utils import parse_flag_from_env +++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++ +++++ # while self._has_unfinished_sequences( +++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++ # ): +++++ # if _record_time: +++++ # import time as time_module +++++ # infer_start = time_module.time() +++++ +++++ # # prepare model inputs +++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++ +++++ # # prepare variable output controls +++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++ +++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++ # cur_cache_position = model_inputs.get("cache_position") +++++ # cur_past_key_values = model_inputs.get("past_key_values") +++++ # cur_input_ids = model_inputs.get("input_ids") +++++ +++++ # if (isinstance(cur_past_key_values, StaticCache) and +++++ # cur_cache_position is not None and +++++ # len(cur_cache_position.shape) > 0 and +++++ # cur_cache_position.shape[0] == 1 and +++++ # cur_input_ids is not None and +++++ # cur_input_ids.shape[1] == 1): +++++ # # 使用 JIT 优化的单 token 解码 +++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++ # if not hasattr(self, '_jit_used'): +++++ # self._jit_used = False +++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++ +++++ # next_token_logits = self.get_decode_one_tokens_logits( +++++ # cur_token=cur_input_ids, +++++ # input_pos=model_inputs.get("position_ids"), +++++ # cache_position=cur_cache_position, +++++ # past_key_values=cur_past_key_values, +++++ # ) +++++ +++++ # # 标记已使用JIT(用于后续判断) +++++ # if not self._jit_used: +++++ # self._jit_used = True +++++ +++++ # # 构造兼容的输出对象 +++++ # class JitOptimizedOutput: +++++ # def __init__(self, logits, config): +++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++ # self.config = config +++++ # # 对于 JIT 优化路径,这些属性通常不需要 +++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++ # self.attentions = None if not config.is_encoder_decoder else None +++++ # self.cross_attentions = None +++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++ # self.hidden_states = None if not config.is_encoder_decoder else None +++++ +++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++ # else: +++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++ # outputs = self(**model_inputs, return_dict=True) +++++ +++++ # if synced_devices and this_peer_finished: +++++ # continue +++++ +++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++ # next_token_logits = outputs.logits[:, -1, :] +++++ +++++ # # pre-process distribution +++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++ # if do_sample: +++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++ +++++ # # Store scores, attentions and hidden_states when required +++++ # if return_dict_in_generate: +++++ # if output_scores: +++++ # scores += (next_token_scores,) +++++ # if output_logits: +++++ # raw_logits += (next_token_logits,) +++++ # if output_attentions: +++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++ # decoder_attentions += (attn,) if attn is not None else (None,) +++++ # if self.config.is_encoder_decoder: +++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++ +++++ # if output_hidden_states: +++++ # hidden = ( +++++ # outputs.decoder_hidden_states +++++ # if self.config.is_encoder_decoder +++++ # else outputs.hidden_states +++++ # ) +++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++ +++++ # # token selection +++++ # if do_sample: +++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++ # else: +++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++ +++++ # # finished sentences should have their next token be a padding token +++++ # if has_eos_stopping_criteria: +++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++ +++++ # # update generated ids, model inputs, and length for next step +++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++ # if streamer is not None: +++++ # streamer.put(next_tokens) +++++ +++++ # model_kwargs = self._update_model_kwargs_for_generation( +++++ # outputs, +++++ # model_kwargs, +++++ # is_encoder_decoder=self.config.is_encoder_decoder, +++++ # ) +++++ +++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++ # cur_len += 1 +++++ +++++ # if _record_time: +++++ # import time as time_module +++++ # infer_stop = time_module.time() +++++ # time_record.append(infer_stop - infer_start) +++++ +++++ # del outputs +++++ +++++ # average_infer_time = None +++++ # if time_record: +++++ # if len(time_record) > 1: +++++ # time_record.pop(0) +++++ # average_infer_time = sum(time_record) / len(time_record) +++++ # print(f'average inference time is: {average_infer_time}') +++++ # print(f'inference time record: {time_record}') +++++ +++++ # if streamer is not None: +++++ # streamer.end() +++++ +++++ # # 简单判断:打印是否使用了JIT路径 +++++ # if hasattr(self, '_jit_used') and self._jit_used: +++++ # print("[JIT] ✓ JIT optimization was used during generation") +++++ # else: +++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++ +++++ # if return_dict_in_generate: +++++ # if self.config.is_encoder_decoder: +++++ # return GenerateEncoderDecoderOutput( +++++ # sequences=input_ids, +++++ # scores=scores, +++++ # logits=raw_logits, +++++ # encoder_attentions=encoder_attentions, +++++ # encoder_hidden_states=encoder_hidden_states, +++++ # decoder_attentions=decoder_attentions, +++++ # cross_attentions=cross_attentions, +++++ # decoder_hidden_states=decoder_hidden_states, +++++ # past_key_values=model_kwargs.get("past_key_values"), +++++ # average_infer_time=average_infer_time +++++ # ) +++++ # else: +++++ # return GenerateDecoderOnlyOutput( +++++ # sequences=input_ids, +++++ # scores=scores, +++++ # logits=raw_logits, +++++ # attentions=decoder_attentions, +++++ # hidden_states=decoder_hidden_states, +++++ # past_key_values=model_kwargs.get("past_key_values"), +++++ # average_infer_time=average_infer_time +++++ # ) +++++ # else: +++++ # return input_ids +++++ +++++ # def _prepare_cache_for_generation( +++++ # self, +++++ # generation_config, +++++ # model_kwargs, +++++ # assistant_model, +++++ # batch_size, +++++ # max_cache_length, +++++ # ): +++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++ # generation_config.cache_implementation = "static" +++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++ +++++ # if generation_config.cache_implementation == "static": +++++ # base_required_from_max_length = generation_config.max_length + 1 +++++ # base_required = max(max_cache_length, base_required_from_max_length) +++++ # min_cache_size = 50 +++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++ # else: +++++ # max_cache_length = max(base_required, min_cache_size) +++++ +++++ # original_max_cache_length = max_cache_length +++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++ # print(f" - input max_cache_length: {original_max_cache_length}") +++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++ # print(f" - final max_cache_length: {max_cache_length}") +++++ +++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++ # if max_cache_length > self.config.max_position_embeddings: +++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++ +++++ # result = super()._prepare_cache_for_generation( +++++ # generation_config=generation_config, +++++ # model_kwargs=model_kwargs, +++++ # assistant_model=assistant_model, +++++ # batch_size=batch_size, +++++ # max_cache_length=max_cache_length, +++++ # ) +++++ +++++ # if generation_config.cache_implementation == "static": +++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++ # created_cache = model_kwargs.get(cache_name) +++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++ # if created_cache.max_cache_len < generation_config.max_length: +++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++ +++++ # return result +++++ +++++ +++++ ++++ ++++ ++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++-- ++++2.27.0 ++++ +++-- +++2.27.0 +++ ++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++new file mode 100644 ++index 00000000..966529e4 ++--- /dev/null +++++ b/patches/0003-20261106secondcommit.patch ++@@ -0,0 +1,2769 @@ +++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Thu, 6 Nov 2025 14:54:37 +0800 +++Subject: [PATCH 3/3] 20261106secondcommit +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +++ patches/0001-20251104commit.patch | 1272 ----------------- +++ 3 files changed, 528 insertions(+), 2032 deletions(-) +++ delete mode 100644 patches/0001-20251104commit.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index 73773c22..2f9192bf 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +++ +++ _CONFIG_FOR_DOC = "DeepseekConfig" +++ ++++_attn_mask_cache = {} ++++ ++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): ++++ q_len = batch_and_seq[1] ++++ kv_len = batch_and_seq[1] + past_key_values_length ++++ key = (batch_and_seq[0], q_len, kv_len) ++++ ++++ if key in _attn_mask_cache: ++++ return _attn_mask_cache[key] ++++ ++++ mask = _prepare_4d_causal_attention_mask( ++++ attention_mask, ++++ batch_and_seq, ++++ inputs_embeds, ++++ past_key_values_length, ++++ ) ++++ _attn_mask_cache[key] = mask ++++ return mask +++ +++ def _get_unpad_data(attention_mask): +++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +++ return final_output +++ +++ +++- @no_grad() +++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++- expert_cache = ops.zeros_like(x) +++- idxs = flat_expert_indices.argsort() +++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++- token_idxs = idxs // self.num_experts_per_tok +++- +++- for i, end_idx in enumerate(tokens_per_expert): +++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++- if start_idx == end_idx: +++- continue +++- expert = self.experts[i] +++- exp_token_idx = token_idxs[start_idx:end_idx] +++- expert_tokens = x[exp_token_idx] +++- expert_out = expert(expert_tokens) +++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++- +++- return expert_cache +++- +++ # @no_grad() +++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++- # # expert_cache = torch.zeros_like(x) +++- # # idxs = flat_expert_indices.argsort() +++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++- # # token_idxs = idxs // self.num_experts_per_tok +++- # # for i, end_idx in enumerate(tokens_per_expert): +++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++- # # if start_idx == end_idx: +++- # # continue +++- # # expert = self.experts[i] +++- # # exp_token_idx = token_idxs[start_idx:end_idx] +++- # # expert_tokens = x[exp_token_idx] +++- # # expert_out = expert(expert_tokens) +++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++- # # return expert_cache ++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ # expert_cache = ops.zeros_like(x) +++ # idxs = flat_expert_indices.argsort() +++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++ +++ # return expert_cache +++- # @no_grad() +++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++- # expert_cache = ops.zeros_like(x) ++++ ++++ @no_grad() ++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++ """ ++++ 优化版 MoE prefill: ++++ - 批量张量化处理同一个 expert 的所有 token ++++ - 跳过无 token 的专家 ++++ - 保持结果完全一致 ++++ """ ++++ # 初始化输出缓存 ++++ expert_cache = ops.zeros_like(x) +++ +++- # # 排序保证顺序一致 +++- # idxs = flat_expert_indices.argsort() +++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++- # token_idxs = idxs // self.num_experts_per_tok ++++ # 排序(确保 scatter_add 位置对应原逻辑) ++++ idxs = flat_expert_indices.argsort() ++++ sorted_expert_indices = flat_expert_indices[idxs] ++++ sorted_token_indices = idxs // self.num_experts_per_tok +++ +++- # # 找出有 token 的专家 +++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++ # 每个 expert 的 token 数 ++++ tokens_per_expert = sorted_expert_indices.bincount() +++ +++- # for i in active_experts.tolist(): +++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++- # end_idx = tokens_per_expert[i] +++- # if start_idx == end_idx: # 没有 token +++- # continue ++++ # 找出有 token 的专家 ++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++ +++- # exp_token_idx = token_idxs[start_idx:end_idx] +++- # expert_tokens = x[exp_token_idx] +++- # expert_out = self.experts[i](expert_tokens) +++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++ for expert_id in active_experts.tolist(): ++++ # 取该 expert 对应的排序后 token 区间 ++++ start = (tokens_per_expert[:expert_id]).sum().item() ++++ end = start + tokens_per_expert[expert_id].item() +++ +++- # expert_cache = mindspore.mint.scatter_add( +++- # expert_cache, +++- # 0, +++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++- # expert_out +++- # ) ++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 ++++ expert_tokens = x[token_idx] # 取输入向量 +++ +++- # return expert_cache ++++ # 执行专家 MLP ++++ expert_out = self.experts[expert_id](expert_tokens) ++++ ++++ # 按权重缩放 ++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] ++++ ++++ # 回写到缓存(等价 scatter_add) ++++ expert_cache = mindspore.mint.scatter_add( ++++ expert_cache, ++++ 0, ++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++ scaled_out ++++ ) ++++ ++++ return expert_cache ++++ ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # # expert_cache = torch.zeros_like(x) ++++ # # idxs = flat_expert_indices.argsort() ++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++ # # token_idxs = idxs // self.num_experts_per_tok ++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++ # # if start_idx == end_idx: ++++ # # continue ++++ # # expert = self.experts[i] ++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # # expert_tokens = x[exp_token_idx] ++++ # # expert_out = expert(expert_tokens) ++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++ # # return expert_cache ++++ # expert_cache = ops.zeros_like(x) ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # for i, end_idx in enumerate(tokens_per_expert): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # if start_idx == end_idx: ++++ # continue ++++ # expert = self.experts[i] ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = expert(expert_tokens) ++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ ++++ # return expert_cache ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++ # expert_cache = ops.zeros_like(x) ++++ ++++ # # 排序保证顺序一致 ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ # token_idxs = idxs // self.num_experts_per_tok ++++ ++++ # # 找出有 token 的专家 ++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++ ++++ # for i in active_experts.tolist(): ++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ # end_idx = tokens_per_expert[i] ++++ # if start_idx == end_idx: # 没有 token ++++ # continue ++++ ++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++ # expert_tokens = x[exp_token_idx] ++++ # expert_out = self.experts[i](expert_tokens) ++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++ ++++ # expert_cache = mindspore.mint.scatter_add( ++++ # expert_cache, ++++ # 0, ++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++ # expert_out ++++ # ) ++++ ++++ # return expert_cache +++ +++ +++ +++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +++ +++ return attn_output, attn_weights, past_key_value +++ +++- +++ # class DeepseekFlashAttention(nn.Module): +++ # """ +++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +++ +++ return attn_output, attn_weights, past_key_value +++ ++++ +++ Deepseek_ATTENTION_CLASSES = { +++ "eager": DeepseekAttention, +++ "flash-attention": DeepseekFlashAttention, +++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +++ ) +++ else: +++ # 4d mask is passed through the layers +++- attention_mask = _prepare_4d_causal_attention_mask( ++++ # attention_mask = _prepare_4d_causal_attention_mask( ++++ # attention_mask, ++++ # (batch_size, seq_length), ++++ # inputs_embeds, ++++ # past_key_values_length, ++++ # ) ++++ #@dwj ++++ attention_mask = get_cached_causal_mask( +++ attention_mask, +++ (batch_size, seq_length), +++ inputs_embeds, +++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ # Initialize weights and apply final processing +++ self.post_init() +++ self.warm_up = False ++++ #@dwj ++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++++ self.num_layers, ++++ self.num_attention_heads, ++++ self.head_dim, ++++ batch_size=1, ++++ max_length=self.max_length, ++++ dtype=mindspore.float16 ++++ ) ++++ ++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++++ key_cache = [] ++++ value_cache = [] ++++ for _ in range(num_layers): ++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++ key_cache.append(k) ++++ value_cache.append(v) ++++ return key_cache, value_cache ++++ +++ +++ def warmup_moe_model_deep(self): +++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++index bced285c..ebd7782e 100644 +++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++ +++-Long_Prompt = False +++-PROMPT_LENGTH_THRESHOLD = 128 ++++Long_Prompt = 1 ++++LONG_PROMPT_LENGTH_THRESHOLD = 128 ++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 ++++ ++++_causal_mask_cache = {} ++++ ++++def get_cached_causal_mask_with_cache_position( ++++ attention_mask: mindspore.Tensor, ++++ sequence_length: int, ++++ target_length: int, ++++ dtype: mindspore.dtype, ++++ min_dtype: float, ++++ cache_position: mindspore.Tensor, ++++ batch_size: int, ++++): ++++ """ ++++ 带缓存的 causal mask 构造函数 ++++ """ ++++ # q_len 是当前 query 长度 ++++ q_len = sequence_length ++++ # kv_len 是 target_length ++++ kv_len = target_length ++++ ++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 ++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) ++++ ++++ if key in _causal_mask_cache: ++++ return _causal_mask_cache[key] ++++ ++++ # 调用原来的 mask 构造逻辑 ++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++ attention_mask, ++++ sequence_length=sequence_length, ++++ target_length=target_length, ++++ dtype=dtype, ++++ min_dtype=min_dtype, ++++ cache_position=cache_position, ++++ batch_size=batch_size, ++++ ) ++++ # 缓存结果 ++++ _causal_mask_cache[key] = causal_mask ++++ return causal_mask +++ +++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++ def _prepare_4d_causal_attention_mask_with_cache_position( +++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++ +++ +++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe ++++# class Qwen2MoeAttention(nn.Module): ++++# """ ++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++++# and "Generating Long Sequences with Sparse Transformers". ++++# """ ++++ ++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++# super().__init__() ++++# self.config = config ++++# self.layer_idx = layer_idx ++++# if layer_idx is None: ++++# logger.warning_once( ++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++# "when creating this class." ++++# ) ++++ ++++# self.hidden_size = config.hidden_size ++++# self.num_heads = config.num_attention_heads ++++# self.head_dim = self.hidden_size // self.num_heads ++++# self.num_key_value_heads = config.num_key_value_heads ++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++# self.max_position_embeddings = config.max_position_embeddings ++++# self.rope_theta = config.rope_theta ++++# self.is_causal = True ++++# self.attention_dropout = config.attention_dropout ++++ ++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++# raise ValueError( ++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++# f" and `num_heads`: {self.num_heads})." ++++# ) ++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++ ++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++# self.head_dim, ++++# max_position_embeddings=self.max_position_embeddings, ++++# base=self.rope_theta, ++++# ) ++++ ++++# def forward( ++++# self, ++++# hidden_states: mindspore.Tensor, ++++# attention_mask: Optional[mindspore.Tensor] = None, ++++# position_ids: Optional[mindspore.Tensor] = None, ++++# past_key_value: Optional[Cache] = None, ++++# output_attentions: bool = False, ++++# use_cache: bool = False, ++++# cache_position: Optional[mindspore.Tensor] = None, ++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++ ++++ ++++ ++++# bsz, q_len, _ = hidden_states.shape ++++ ++++# query_states = self.q_proj(hidden_states) ++++# key_states = self.k_proj(hidden_states) ++++# value_states = self.v_proj(hidden_states) ++++ ++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++ ++++# kv_seq_len = key_states.shape[-2] ++++# if past_key_value is not None: ++++# if self.layer_idx is None: ++++# raise ValueError( ++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++# "with a layer index." ++++# ) ++++# if isinstance(past_key_value, StaticCache): ++++# kv_seq_len = key_states.shape[-2] ++++# else: ++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++# if past_key_value is not None: ++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++# if isinstance(past_key_value, StaticCache): ++++# kv_seq_len = key_states.shape[-2] ++++ ++++# # repeat k/v heads if n_kv_heads < n_heads ++++# key_states = repeat_kv(key_states, self.num_key_value_groups) ++++# value_states = repeat_kv(value_states, self.num_key_value_groups) ++++ ++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++ ++++# if attention_mask is not None: ++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++# attn_weights = attn_weights + causal_mask ++++ ++++# # upcast attention to fp32 ++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++# attn_output = ops.matmul(attn_weights, value_states) ++++ ++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++# raise ValueError( ++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++++# f" {attn_output.shape}" ++++# ) ++++ ++++# attn_output = ops.transpose(attn_output, 1, 2) ++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++ ++++# attn_output = self.o_proj(attn_output) ++++# # @lwx ++++ ++++# # max_seq_len = self.max_position_embeddings # 2048 ++++ ++++# # if attention_mask is not None: ++++# # # attention_mask: [B, 1, Sq, Sk] ++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++ ++++# # # pad 到 [max_seq_len, max_seq_len] ++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++# # global_attention_mask = padded_mask ++++# # else: ++++# # global_attention_mask = None ++++ ++++ ++++# # sparse_mode=3 ++++# # attn_output = mindspore.ops.flash_attention_score( ++++# # query=query_states, ++++# # key=key_states, ++++# # value=value_states, ++++# # real_shift=None, ++++# # padding_mask=None, ++++ ++++# # head_num=self.num_heads, ++++# # attn_mask=global_attention_mask, ++++# # keep_prob=1.0 - self.attention_dropout, ++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++# # input_layout="BNSD", ++++# # pre_tokens=2147483647, ++++# # next_tokens=2147483647, ++++# # inner_precise=0, ++++# # drop_mask=None, ++++# # prefix=None, ++++# # actual_seq_qlen=None, ++++# # actual_seq_kvlen=None, ++++# # sparse_mode=sparse_mode, ++++# # ) ++++# if not output_attentions: ++++# attn_weights = None ++++ ++++# return attn_output, attn_weights, past_key_value ++++ +++ class Qwen2MoeAttention(nn.Module): +++ """ +++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++- and "Generating Long Sequences with Sparse Transformers". +++- """ ++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +++ ++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: ++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 ++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 ++++ ++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 ++++ """ +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++ super().__init__() +++ self.config = config +++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +++ if layer_idx is None: +++ logger.warning_once( +++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++ "when creating this class." +++ ) +++ +++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +++ use_cache: bool = False, +++ cache_position: Optional[mindspore.Tensor] = None, +++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++- +++ +++- ++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +++ bsz, q_len, _ = hidden_states.shape +++ +++ query_states = self.q_proj(hidden_states) +++ key_states = self.k_proj(hidden_states) +++ value_states = self.v_proj(hidden_states) +++ +++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++- ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ +++ kv_seq_len = key_states.shape[-2] +++ if past_key_value is not None: +++- if self.layer_idx is None: +++- raise ValueError( +++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++- "with a layer index." +++- ) +++- if isinstance(past_key_value, StaticCache): +++- kv_seq_len = key_states.shape[-2] +++- else: +++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ +++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++ +++ if past_key_value is not None: +++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++ ++++ # --- 2. 动态调度核心注意力计算 --- ++++ global Long_Prompt ++++ if Long_Prompt >= 1: ++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- ++++ fa_attention_mask = None ++++ if attention_mask is not None: ++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++ fa_attention_mask = (mask_slice != 0) ++++ ++++ attn_output = mindspore.ops.flash_attention_score( ++++ query=query_states, ++++ key=key_states, ++++ value=value_states, ++++ head_num=self.num_heads, ++++ attn_mask=fa_attention_mask, ++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, ++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++ input_layout="BNSD", ++++ sparse_mode=0, ++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 ++++ ) +++ +++- if isinstance(past_key_value, StaticCache): +++- kv_seq_len = key_states.shape[-2] ++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ attn_output = self.o_proj(attn_output) ++++ attn_weights = None ++++ if output_attentions: ++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++ +++- # repeat k/v heads if n_kv_heads < n_heads +++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++- +++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++ else: ++++ # --- Eager Attention 路径 (用于短序列和解码) --- ++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++ ++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++ +++- if attention_mask is not None: +++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++- attn_weights = attn_weights + causal_mask ++++ if attention_mask is not None: ++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++ attn_weights = attn_weights + causal_mask +++ +++- # upcast attention to fp32 +++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++- attn_output = ops.matmul(attn_weights, value_states) ++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++ attn_output = ops.matmul(attn_weights, value_states) +++ +++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++- raise ValueError( +++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++- f" {attn_output.shape}" +++- ) ++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++ raise ValueError( ++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" ++++ ) +++ +++- attn_output = ops.transpose(attn_output, 1, 2) +++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++ attn_output = ops.transpose(attn_output, 1, 2) ++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++ attn_output = self.o_proj(attn_output) +++ +++- attn_output = self.o_proj(attn_output) +++- # @lwx ++++ if not output_attentions: ++++ attn_weights = None +++ +++- # max_seq_len = self.max_position_embeddings # 2048 +++- +++- # if attention_mask is not None: +++- # # attention_mask: [B, 1, Sq, Sk] +++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++- +++- # # pad 到 [max_seq_len, max_seq_len] +++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++- # global_attention_mask = padded_mask +++- # else: +++- # global_attention_mask = None +++- +++- +++- # sparse_mode=3 +++- # attn_output = mindspore.ops.flash_attention_score( +++- # query=query_states, +++- # key=key_states, +++- # value=value_states, +++- # real_shift=None, +++- # padding_mask=None, +++- +++- # head_num=self.num_heads, +++- # attn_mask=global_attention_mask, +++- # keep_prob=1.0 - self.attention_dropout, +++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++- # input_layout="BNSD", +++- # pre_tokens=2147483647, +++- # next_tokens=2147483647, +++- # inner_precise=0, +++- # drop_mask=None, +++- # prefix=None, +++- # actual_seq_qlen=None, +++- # actual_seq_kvlen=None, +++- # sparse_mode=sparse_mode, +++- # ) +++- if not output_attentions: +++- attn_weights = None +++- +++ return attn_output, attn_weights, past_key_value +++ +++- +++ # class Qwen2MoeFlashAttention(nn.Module): +++ # """ +++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +++ # return final_hidden_states, router_logits +++ +++ +++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++-# """ +++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +++-# """ +++-# def __init__(self, config: Qwen2MoeConfig): +++-# super().__init__() +++-# self.num_experts = config.num_experts +++-# self.top_k = config.num_experts_per_tok +++-# self.norm_topk_prob = config.norm_topk_prob +++- +++-# # 门控网络 +++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++-# # 专家列表 +++-# self.experts = nn.ModuleList( +++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++-# ) +++-# # 共享专家 +++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-# @no_grad() +++-# def _moe_infer_decode( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# """ +++-# 【解码路径】针对 sequence_length=1 的极致优化。 +++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++-# """ +++-# batch_size, hidden_dim = hidden_states.shape +++- +++-# expert_outputs_list = [ +++-# ops.cat([ +++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++-# ], dim=0) +++-# for i in range(batch_size) +++-# ] +++- +++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++-# # shape: (batch_size, top_k, hidden_dim) +++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++- +++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++- +++-# return moe_output.squeeze(1) +++- +++-# @no_grad() +++-# def _moe_infer_prefill( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# """ +++-# 【预填充路径】针对 sequence_length > 1 的优化。 +++-# 按专家对 Token 进行分组,并进行批处理。 +++-# """ +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens = hidden_states.shape[0] +++-# flat_selected_experts = selected_experts.flatten() +++- +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++- +++-# active_experts = ops.unique(flat_selected_experts) +++- +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++- +++-# mask = (flat_selected_experts == expert_idx_tensor) +++-# selected_token_indices = token_indices[mask] +++-# selected_routing_weights = routing_weights.flatten()[mask] +++- +++-# current_states = hidden_states[selected_token_indices] +++- +++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++- +++-# moe_output = moe_output.index_add( +++-# dim=0, +++-# index=selected_token_indices, +++-# source=expert_output.to(hidden_states.dtype) +++-# ) +++-# return moe_output +++- +++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-# """ +++-# 顶层 forward 方法,作为智能分发器。 +++-# """ +++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++- +++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-# router_logits = self.gate(hidden_states_reshaped) +++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- +++-# if self.norm_topk_prob: +++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- +++-# routing_weights = routing_weights.to(hidden_states.dtype) +++- +++-# moe_output = None +++-# # 在推理时,根据序列长度选择最优路径 +++-# if not self.training: +++-# if sequence_length == 1: +++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++-# else: +++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++-# else: +++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++-# raise NotImplementedError("Training path is not implemented.") +++- +++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++- +++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++- +++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++- +++-# return final_hidden_states, router_logits +++- +++- +++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++-# """ +++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++-# """ +++-# def __init__(self, config: Qwen2MoeConfig): +++-# super().__init__() +++-# self.num_experts = config.num_experts +++-# self.top_k = config.num_experts_per_tok +++-# self.norm_topk_prob = config.norm_topk_prob +++- +++-# # 门控网络 +++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++-# # 专家列表 +++-# self.experts = nn.ModuleList( +++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++-# ) +++-# # 共享专家 +++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-# @no_grad() +++-# def _moe_infer_decode( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# batch_size, _ = hidden_states.shape +++-# expert_outputs_list = [ +++-# ops.cat([ +++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++-# ], dim=0) +++-# for i in range(batch_size) +++-# ] +++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++-# return moe_output.squeeze(1) +++- +++-# @no_grad() +++-# def _moe_infer_prefill( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens = hidden_states.shape[0] +++-# flat_selected_experts = selected_experts.flatten() +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++-# active_experts = ops.unique(flat_selected_experts) +++- +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++-# mask = (flat_selected_experts == expert_idx_tensor) +++-# selected_token_indices = token_indices[mask] +++-# selected_routing_weights = routing_weights.flatten()[mask] +++-# current_states = hidden_states[selected_token_indices] +++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++-# moe_output = moe_output.index_add( +++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++-# ) +++-# return moe_output +++- +++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-# """ +++-# 顶层 forward 方法,作为智能分发器。 +++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++-# """ +++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++- +++-# # 1. 门控计算 (通用逻辑) +++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-# router_logits = self.gate(hidden_states_reshaped) +++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- +++-# if self.norm_topk_prob: +++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- +++-# routing_weights = routing_weights.to(hidden_states.dtype) +++- +++-# # 2. 智能分发到最优 MoE 路径 +++-# moe_output = None +++-# if not self.training: +++-# if sequence_length == 1: +++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++-# else: +++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++-# else: +++-# raise NotImplementedError("Training path is not implemented.") +++- +++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++- +++-# # 4. 合并 MoE 输出和共享专家输出 +++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++- +++-# # 5. 恢复原始形状并返回 +++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++- +++-# return final_hidden_states, router_logits +++- +++-# prefill fastest +++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++-# """ +++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++-# """ +++-# def __init__(self, config: Qwen2MoeConfig): +++-# super().__init__() +++-# self.num_experts = config.num_experts +++-# self.top_k = config.num_experts_per_tok +++-# self.norm_topk_prob = config.norm_topk_prob +++- +++-# # 门控网络 +++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++-# # 专家列表 +++-# self.experts = nn.ModuleList( +++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++-# ) +++-# # 共享专家 +++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-# @no_grad() +++-# def _moe_infer_dispatch( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# """ +++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++-# """ +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens, _ = hidden_states.shape +++- +++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++-# flat_selected_experts = selected_experts.flatten() +++-# flat_routing_weights = routing_weights.flatten() +++- +++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++- +++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++-# active_experts = ops.unique(flat_selected_experts) +++- +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++- +++-# # 找到所有分配给该专家的 token +++-# mask = (flat_selected_experts == expert_idx_tensor) +++- +++-# # 使用 mask 选取对应的 token 和权重 +++-# current_token_indices = token_indices[mask] +++-# current_routing_weights = flat_routing_weights[mask] +++-# current_hidden_states = hidden_states[current_token_indices] +++- +++-# # 对这些 token 进行批处理 +++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++- +++-# # 使用 index_add 将结果精确地加回到对应位置 +++-# moe_output = moe_output.index_add( +++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++-# ) +++-# return moe_output +++- +++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-# """ +++-# 顶层 forward 方法,作为智能分发器。 +++-# """ +++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++- +++-# # 1. 门控计算 +++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-# router_logits = self.gate(hidden_states_reshaped) +++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- +++-# if self.norm_topk_prob: +++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- +++-# routing_weights = routing_weights.to(hidden_states.dtype) +++- +++-# # 2. 调用统一的 MoE 计算内核 +++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++- +++-# # 3. 统一处理共享专家 +++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++- +++-# # 4. 合并输出 +++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++- +++-# # 5. 恢复原始形状并返回 +++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++- +++-# return final_hidden_states, router_logits +++- +++- +++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++-# """ +++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++-# 【最终高性能与高精度版】: +++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++-# 3. 这样实现了速度和准确性的两全其美。 +++-# """ +++-# def __init__(self, config: Qwen2MoeConfig): +++-# super().__init__() +++-# self.num_experts = config.num_experts +++-# self.top_k = config.num_experts_per_tok +++-# self.norm_topk_prob = config.norm_topk_prob +++- +++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++-# self.experts = nn.ModuleList( +++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++-# ) +++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-# @no_grad() +++-# def _moe_infer_decode( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# """ +++-# 【解码路径】极致优化版:bmm + 高精度累加。 +++-# """ +++-# original_dtype = hidden_states.dtype +++-# batch_size, _ = hidden_states.shape +++- +++-# expert_outputs_list = [ +++-# ops.cat([ +++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++-# ], dim=0) +++-# for i in range(batch_size) +++-# ] +++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++- +++-# # 在 float32 下执行 bmm,得到高精度结果 +++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++- +++-# # 将高精度结果转换回原始数据类型 +++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++- +++-# return moe_output +++- +++-# @no_grad() +++-# def _moe_infer_prefill( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# selected_experts: mindspore.Tensor, +++-# routing_weights: mindspore.Tensor +++-# ) -> mindspore.Tensor: +++-# """ +++-# 【预填充路径】与原始实现一致,结果精确。 +++-# """ +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens, _ = hidden_states.shape +++-# flat_selected_experts = selected_experts.flatten() +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++-# active_experts = ops.unique(flat_selected_experts) +++- +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++-# mask = (flat_selected_experts == expert_idx_tensor) +++-# selected_token_indices = token_indices[mask] +++-# selected_routing_weights = routing_weights.flatten()[mask] +++-# current_states = hidden_states[selected_token_indices] +++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++-# moe_output = moe_output.index_add( +++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++-# ) +++-# return moe_output +++- +++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++- +++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-# router_logits = self.gate(hidden_states_reshaped) +++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++- +++-# if self.norm_topk_prob: +++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- +++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++-# # 如果模型主体是 float16,后续再转换 +++- +++-# moe_output = None +++-# if not self.training: +++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++-# # _moe_infer_decode 内部会处理好类型转换 +++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++-# if sequence_length == 1: +++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++-# else: +++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++-# else: +++-# raise NotImplementedError("Training path is not implemented.") +++- +++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++- +++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++- +++-# return final_hidden_states, router_logits +++- +++- +++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++-# """ +++-# 【融合版】一个混合专家模块,内置两种推理策略, +++-# 由外部全局变量 `Long_Prompt` 控制: +++- +++-# - if Long_Prompt is True: 【精度优先模式】 +++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++-# 适用于处理长序列,避免误差累积。 +++- +++-# - if Long_Prompt is False: 【速度优先模式】 +++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +++-# """ +++-# def __init__(self, config: Qwen2MoeConfig): +++-# super().__init__() +++-# self.num_experts = config.num_experts +++-# self.top_k = config.num_experts_per_tok +++-# self.norm_topk_prob = config.norm_topk_prob +++- +++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++-# self.experts = nn.ModuleList( +++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++-# ) +++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-# # --- 速度优先模式的辅助函数 --- +++-# @no_grad() +++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++-# original_dtype = hidden_states.dtype +++-# batch_size, _ = hidden_states.shape +++-# expert_outputs_list = [ +++-# ops.cat([ +++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++-# ], dim=0) +++-# for i in range(batch_size) +++-# ] +++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++-# weights_fp32 = routing_weights.to(mindspore.float32) +++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++-# return moe_output_fp32.squeeze(1).to(original_dtype) +++- +++-# @no_grad() +++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens, _ = hidden_states.shape +++-# flat_selected_experts = selected_experts.flatten() +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++-# active_experts = ops.unique(flat_selected_experts) +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++-# mask = (flat_selected_experts == expert_idx_tensor) +++-# selected_token_indices = token_indices[mask] +++-# selected_routing_weights = routing_weights.flatten()[mask] +++-# current_states = hidden_states[selected_token_indices] +++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++-# return moe_output +++- +++-# # --- 精度优先模式的辅助函数 --- +++-# @no_grad() +++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++-# moe_output = ops.zeros_like(hidden_states) +++-# num_tokens, _ = hidden_states.shape +++-# flat_selected_experts = selected_experts.flatten() +++-# flat_routing_weights = routing_weights.flatten() +++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++-# active_experts = ops.unique(flat_selected_experts) +++-# for expert_idx_tensor in active_experts: +++-# expert_idx = expert_idx_tensor.item() +++-# expert_layer = self.experts[expert_idx] +++-# mask = (flat_selected_experts == expert_idx_tensor) +++-# current_token_indices = token_indices[mask] +++-# current_routing_weights = flat_routing_weights[mask] +++-# current_hidden_states = hidden_states[current_token_indices] +++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++-# return moe_output +++- +++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-# # 声明我们将要使用一个在模块外部定义的全局变量 +++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++-# global Long_Prompt +++- +++-# # 1. 门控计算 (所有模式通用) +++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-# router_logits = self.gate(hidden_states_reshaped) +++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++-# if self.norm_topk_prob: +++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++- +++-# moe_output = None +++-# if not self.training: +++-# # 根据 Long_Prompt 标志选择模式 +++-# if Long_Prompt: +++-# # --- 精度优先模式 --- +++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++-# else: +++-# # --- 速度优先模式 --- +++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++-# if sequence_length == 1: +++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++-# else: +++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++-# else: +++-# raise NotImplementedError("Training path is not implemented.") +++- +++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++- +++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++- +++-# return final_hidden_states, router_logits +++- +++ class Qwen2MoeSparseMoeBlock(nn.Module): +++ """ +++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++ return moe_output_fp32.squeeze(1).to(original_dtype) +++ ++++ # @no_grad() ++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++ # num_tokens, _ = hidden_states.shape ++++ # flat_selected_experts = selected_experts.flatten() ++++ # sorted_expert_indices = flat_selected_experts.argsort() ++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++ # original_token_indices = sorted_expert_indices // self.top_k ++++ # moe_output = ops.zeros_like(hidden_states) ++++ # current_token_offset = 0 ++++ # for i in range(self.num_experts): ++++ # expert_token_count = tokens_per_expert[i] - current_token_offset ++++ # if expert_token_count == 0: ++++ # continue ++++ # end_offset = current_token_offset + expert_token_count ++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++ # expert_hidden_states = hidden_states[expert_original_token_indices] ++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++ # current_token_offset += expert_token_count ++++ # return moe_output ++++ +++ @no_grad() +++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++- num_tokens, _ = hidden_states.shape +++- flat_selected_experts = selected_experts.flatten() +++- sorted_expert_indices = flat_selected_experts.argsort() +++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++- original_token_indices = sorted_expert_indices // self.top_k ++++ """ ++++ 优化版 MoE prefill (速度优先模式): ++++ - 批量张量化处理同一个 expert 的所有 token ++++ - 跳过无 token 的专家 ++++ - 保持结果完全一致 ++++ """ +++ moe_output = ops.zeros_like(hidden_states) +++- current_token_offset = 0 +++- for i in range(self.num_experts): +++- expert_token_count = tokens_per_expert[i] - current_token_offset +++- if expert_token_count == 0: +++- continue +++- end_offset = current_token_offset + expert_token_count +++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++- expert_hidden_states = hidden_states[expert_original_token_indices] +++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++- current_token_offset += expert_token_count ++++ ++++ flat_selected_experts = selected_experts.flatten() ++++ flat_routing_weights = routing_weights.flatten() ++++ ++++ idxs = flat_selected_experts.argsort() ++++ sorted_expert_indices = flat_selected_experts[idxs] ++++ sorted_token_indices = idxs // self.top_k ++++ ++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) ++++ ++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++++ ++++ for expert_id in active_experts.tolist(): ++++ start = int(tokens_per_expert[:expert_id].sum().item()) ++++ end = start + int(tokens_per_expert[expert_id].item()) ++++ ++++ token_idx = sorted_token_indices[start:end] ++++ expert_tokens = hidden_states[token_idx] ++++ ++++ expert_out = self.experts[expert_id](expert_tokens) ++++ ++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) ++++ ++++ moe_output = mindspore.mint.scatter_add( ++++ moe_output, ++++ 0, ++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), ++++ scaled_out.to(hidden_states.dtype) ++++ ) ++++ +++ return moe_output +++ ++++ +++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++ @no_grad() +++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++ +++ moe_output = None +++- if Long_Prompt: +++- # --- 精度优先模式 (ACCURACY MODE) --- +++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ # if Long_Prompt==0: ++++ # # --- 精度优先模式 (ACCURACY MODE) --- ++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ # else: ++++ # # --- 速度优先模式 (SPEED MODE) --- ++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++ # if sequence_length == 1: ++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ # else: ++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ ++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++ if sequence_length == 1: ++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ else: +++- # --- 速度优先模式 (SPEED MODE) --- +++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++- if sequence_length == 1: +++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++- else: +++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++- ++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ +++ +++ # 3. 共享专家计算与合并 (所有模式通用) +++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++ +++ return final_hidden_states, router_logits +++ ++++ +++ class Qwen2MoeDecoderLayer(nn.Module): +++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++ super().__init__() +++ self.hidden_size = config.hidden_size +++ +++- # if Long_Prompt: +++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++- # else: ++++ # if Long_Prompt == 2: +++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++ # else: ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ +++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++ +++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++ ) +++ +++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++ # attention_mask, ++++ # sequence_length=sequence_length, ++++ # target_length=target_length, ++++ # dtype=dtype, ++++ # min_dtype=min_dtype, ++++ # cache_position=cache_position, ++++ # batch_size=input_tensor.shape[0], ++++ # ) ++++ #@dwj ++++ causal_mask = get_cached_causal_mask_with_cache_position( +++ attention_mask, +++ sequence_length=sequence_length, +++ target_length=target_length, +++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++ """ +++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache ++++ _causal_mask_cache.clear() +++ +++ input_ids = kwargs.get("input_ids") +++ if input_ids is None and args: +++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ +++ if input_ids is not None: +++ prompt_length = input_ids.shape[1] +++- +++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +++- Long_Prompt = True ++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: ++++ Long_Prompt = 2 ++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: ++++ Long_Prompt = 0 +++ else: +++- Long_Prompt = False ++++ Long_Prompt = 1 ++++ +++ +++ return super().generate(*args, **kwargs) +++ +++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++ dtype = self.lm_head.weight.dtype +++ min_dtype = float(ops.finfo(dtype).min) +++ +++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++ # attention_mask, ++++ # sequence_length=sequence_length, ++++ # target_length=past_key_values.get_max_length(), ++++ # dtype=dtype, ++++ # min_dtype=min_dtype, ++++ # cache_position=cache_position, ++++ # batch_size=batch_size, ++++ # ) ++++ ++++ #@dwj ++++ attention_mask = get_cached_causal_mask_with_cache_position( +++ attention_mask, +++ sequence_length=sequence_length, +++ target_length=past_key_values.get_max_length(), +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++deleted file mode 100644 +++index 6dfb5b93..00000000 +++--- a/patches/0001-20251104commit.patch ++++++ /dev/null +++@@ -1,1272 +0,0 @@ +++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++-From: Pinoeer-kingxi <13022943007@163.com> +++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +++-Subject: [PATCH] 20251104commit +++- +++---- +++- mindnlp/transformers/cache_utils.py | 28 +- +++- .../models/deepseek/modeling_deepseek.py | 149 ++- +++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++- 3 files changed, 976 insertions(+), 87 deletions(-) +++- +++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++-index cadd2e04..02f8d4be 100644 +++---- a/mindnlp/transformers/cache_utils.py +++-+++ b/mindnlp/transformers/cache_utils.py +++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++- # k_out[:, :, cache_position] = key_states +++- # v_out[:, :, cache_position] = value_states +++-- if ON_ORANGE_PI: +++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++-- else: +++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++-- +++-+ # if ON_ORANGE_PI: +++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++-+ # else: +++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++-+ if cache_position.ndim > 1: +++-+ cache_position = cache_position.flatten() +++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++-+ cache_position = cache_position.int() +++-+ +++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++-+ k_out[:, :, cache_position] = key_states +++-+ v_out[:, :, cache_position] = value_states +++-+ +++- return k_out, v_out +++- +++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++-index c695b944..d8303e45 100644 +++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++- # Copied from transformers.models.llama.modeling_llama.rotate_half +++- def rotate_half(x): +++- """Rotates half the hidden dims of the input.""" +++-- x1 = x[..., : x.shape[-1] // 2] +++-- x2 = x[..., x.shape[-1] // 2 :] +++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++-+ # x1 = x[..., : x.shape[-1] // 2] +++-+ # x2 = x[..., x.shape[-1] // 2 :] +++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++- return ops.cat((-x2, x1), dim=-1) +++- +++- +++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++- if self.training: +++- raise NotImplementedError("Training is not supported yet.") +++- else: +++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++-- if self.config.n_shared_experts is not None: +++-- y = y + self.shared_experts(identity) +++-- return y +++-+ # @lwx +++-+ if orig_shape[1] == 1: +++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++-+ y=y.view(*orig_shape) +++-+ if self.config.n_shared_experts is not None: +++-+ y = y + self.shared_experts(identity) +++-+ return y +++-+ else: +++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++-+ if self.config.n_shared_experts is not None: +++-+ y = y + self.shared_experts(identity) +++-+ return y +++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++-+ # if self.config.n_shared_experts is not None: +++-+ # y = y + self.shared_experts(identity) +++-+ # return y +++-+ +++-+ @no_grad() +++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++-+ +++-+ expert_cache = ops.zeros_like(x) +++-+ for i in range(self.num_experts_per_tok): +++-+ expert_id = flat_expert_indices[i].item() +++-+ weight = flat_expert_weights[i].item() +++-+ expert = self.experts[expert_id] +++-+ expert_out = expert(x) +++-+ expert_cache += expert_out * weight +++-+ return expert_cache +++- +++- @no_grad() +++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++-- # expert_cache = torch.zeros_like(x) +++-- # idxs = flat_expert_indices.argsort() +++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++-- # token_idxs = idxs // self.num_experts_per_tok +++-- # for i, end_idx in enumerate(tokens_per_expert): +++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++-- # if start_idx == end_idx: +++-- # continue +++-- # expert = self.experts[i] +++-- # exp_token_idx = token_idxs[start_idx:end_idx] +++-- # expert_tokens = x[exp_token_idx] +++-- # expert_out = expert(expert_tokens) +++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++-- # return expert_cache +++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++- expert_cache = ops.zeros_like(x) +++- idxs = flat_expert_indices.argsort() +++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++- token_idxs = idxs // self.num_experts_per_tok +++-+ +++- for i, end_idx in enumerate(tokens_per_expert): +++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++- if start_idx == end_idx: +++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++- expert_out = expert(expert_tokens) +++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++-+ +++- return expert_cache +++-+ +++-+ # @no_grad() +++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++-+ # # expert_cache = torch.zeros_like(x) +++-+ # # idxs = flat_expert_indices.argsort() +++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++-+ # # token_idxs = idxs // self.num_experts_per_tok +++-+ # # for i, end_idx in enumerate(tokens_per_expert): +++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++-+ # # if start_idx == end_idx: +++-+ # # continue +++-+ # # expert = self.experts[i] +++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +++-+ # # expert_tokens = x[exp_token_idx] +++-+ # # expert_out = expert(expert_tokens) +++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++-+ # # return expert_cache +++-+ # expert_cache = ops.zeros_like(x) +++-+ # idxs = flat_expert_indices.argsort() +++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++-+ # token_idxs = idxs // self.num_experts_per_tok +++-+ +++-+ # for i, end_idx in enumerate(tokens_per_expert): +++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++-+ # if start_idx == end_idx: +++-+ # continue +++-+ # expert = self.experts[i] +++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++-+ # expert_tokens = x[exp_token_idx] +++-+ # expert_out = expert(expert_tokens) +++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++-+ +++-+ # return expert_cache +++-+ # @no_grad() +++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++-+ # expert_cache = ops.zeros_like(x) +++-+ +++-+ # # 排序保证顺序一致 +++-+ # idxs = flat_expert_indices.argsort() +++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++-+ # token_idxs = idxs // self.num_experts_per_tok +++-+ +++-+ # # 找出有 token 的专家 +++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++-+ +++-+ # for i in active_experts.tolist(): +++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++-+ # end_idx = tokens_per_expert[i] +++-+ # if start_idx == end_idx: # 没有 token +++-+ # continue +++-+ +++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++-+ # expert_tokens = x[exp_token_idx] +++-+ # expert_out = self.experts[i](expert_tokens) +++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++-+ +++-+ # expert_cache = mindspore.mint.scatter_add( +++-+ # expert_cache, +++-+ # 0, +++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++-+ # expert_out +++-+ # ) +++-+ +++-+ # return expert_cache +++-+ +++-+ +++- +++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++- # """ +++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++- +++- # Initialize weights and apply final processing +++- self.post_init() +++-+ self.warm_up = False +++-+ +++-+ def warmup_moe_model_deep(self): +++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++-+ test_texts = [ +++-+ "warmup short", +++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++-+ ] +++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++-+ if tokenizer is None: +++-+ from mindnlp.transformers import AutoTokenizer +++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++-+ self._warmup_tokenizer = tokenizer +++-+ +++-+ for text in test_texts: +++-+ inputs = tokenizer(text, return_tensors="ms") +++-+ with mindspore._no_grad(): +++-+ _ = self(**inputs, use_cache=False) +++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++- +++- def get_input_embeddings(self): +++- return self.model.embed_tokens +++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++- ```""" +++-+ if not self.warm_up: +++-+ self.warm_up = True +++-+ self.warmup_moe_model_deep() +++-+ +++- output_attentions = ( +++- output_attentions +++- if output_attentions is not None +++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++-index 3cbf820e..d4c6b651 100644 +++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++-@@ -18,7 +18,6 @@ +++- # See the License for the specific language governing permissions and +++- # limitations under the License. +++- """MindSpore Qwen2MoE model.""" +++-- +++- import math +++- from typing import List, Optional, Tuple, Union +++- +++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++- TokenClassifierOutput, +++- ) +++- from ...modeling_utils import PreTrainedModel +++-+from ...generation import GenerationMixin +++- from ....utils import logging +++- from .configuration_qwen2_moe import Qwen2MoeConfig +++- +++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++- self.variance_epsilon = eps +++- +++- def forward(self, hidden_states): +++-+ # @dwj +++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++-+ # @lwx +++-+ # if not self.training : +++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++- input_dtype = hidden_states.dtype +++- hidden_states = hidden_states.to(mindspore.float32) +++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++-@@ -234,6 +239,8 @@ def rotate_half(x): +++- """Rotates half the hidden dims of the input.""" +++- x1 = x[..., : x.shape[-1] // 2] +++- x2 = x[..., x.shape[-1] // 2 :] +++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++- return ops.cat((-x2, x1), dim=-1) +++- +++- +++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++- self.config = config +++- self.hidden_size = config.hidden_size +++- self.intermediate_size = intermediate_size +++-+ +++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++- self.act_fn = ACT2FN[config.hidden_act] +++- +++- def forward(self, x): +++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++-- +++- +++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++-+ # @lwx +++-+ # gate_up_output = self.gate_up_proj(x) +++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++-+ # return self.down_proj(swiglu_output) +++-+ +++-+ # def forward(self, x): +++-+ # gate_proj_out = self.gate_proj(x) +++-+ # up_proj_out = self.up_proj(x) +++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++-+ # return self.down_proj(swiglu_out) +++-+ +++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++- """ +++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++- use_cache: bool = False, +++- cache_position: Optional[mindspore.Tensor] = None, +++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++-+ +++-+ +++-+ +++- bsz, q_len, _ = hidden_states.shape +++- +++- query_states = self.q_proj(hidden_states) +++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++- "with a layer index." +++- ) +++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++-+ if isinstance(past_key_value, StaticCache): +++-+ kv_seq_len = key_states.shape[-2] +++-+ else: +++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++- +++- if past_key_value is not None: +++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++-+ +++-+ if isinstance(past_key_value, StaticCache): +++-+ kv_seq_len = key_states.shape[-2] +++- +++- # repeat k/v heads if n_kv_heads < n_heads +++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++-- +++-+ +++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++- +++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++-- raise ValueError( +++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++-- f" {attn_weights.shape}" +++-- ) +++-- +++-- if attention_mask is not None: # no matter the length, we just slice it +++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++-+ if attention_mask is not None: +++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++- attn_weights = attn_weights + causal_mask +++- +++- # upcast attention to fp32 +++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++- +++- attn_output = self.o_proj(attn_output) +++-- +++-+ # @lwx +++-+ +++-+ # max_seq_len = self.max_position_embeddings # 2048 +++-+ +++-+ # if attention_mask is not None: +++-+ # # attention_mask: [B, 1, Sq, Sk] +++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++-+ +++-+ # # pad 到 [max_seq_len, max_seq_len] +++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++-+ # global_attention_mask = padded_mask +++-+ # else: +++-+ # global_attention_mask = None +++-+ +++-+ +++-+ # sparse_mode=3 +++-+ # attn_output = mindspore.ops.flash_attention_score( +++-+ # query=query_states, +++-+ # key=key_states, +++-+ # value=value_states, +++-+ # real_shift=None, +++-+ # padding_mask=None, +++-+ +++-+ # head_num=self.num_heads, +++-+ # attn_mask=global_attention_mask, +++-+ # keep_prob=1.0 - self.attention_dropout, +++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++-+ # input_layout="BNSD", +++-+ # pre_tokens=2147483647, +++-+ # next_tokens=2147483647, +++-+ # inner_precise=0, +++-+ # drop_mask=None, +++-+ # prefix=None, +++-+ # actual_seq_qlen=None, +++-+ # actual_seq_kvlen=None, +++-+ # sparse_mode=sparse_mode, +++-+ # ) +++- if not output_attentions: +++- attn_weights = None +++- +++- return attn_output, attn_weights, past_key_value +++- +++- +++-+class Qwen2MoeFlashAttention(nn.Module): +++-+ """ +++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++-+ +++-+ 关键改动: +++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++-+ 直接传入原始的 key 和 value 张量效率更高。 +++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++-+ """ +++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++-+ super().__init__() +++-+ self.config = config +++-+ self.layer_idx = layer_idx +++-+ self.hidden_size = config.hidden_size +++-+ self.num_heads = config.num_attention_heads +++-+ self.head_dim = self.hidden_size // self.num_heads +++-+ self.num_key_value_heads = config.num_key_value_heads +++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++-+ self.max_position_embeddings = config.max_position_embeddings +++-+ self.rope_theta = config.rope_theta +++-+ self.attention_dropout = config.attention_dropout +++-+ +++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +++-+ raise ValueError( +++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++-+ ) +++-+ +++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++-+ +++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++-+ self.head_dim, +++-+ max_position_embeddings=self.max_position_embeddings, +++-+ base=self.rope_theta, +++-+ ) +++-+ +++-+ def forward( +++-+ self, +++-+ hidden_states: mindspore.Tensor, +++-+ attention_mask: Optional[mindspore.Tensor] = None, +++-+ position_ids: Optional[mindspore.Tensor] = None, +++-+ past_key_value: Optional[Cache] = None, +++-+ output_attentions: bool = False, +++-+ use_cache: bool = False, +++-+ cache_position: Optional[mindspore.Tensor] = None, +++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++-+ +++-+ bsz, q_len, _ = hidden_states.shape +++-+ +++-+ # 1. 线性投射 Q, K, V +++-+ query_states = self.q_proj(hidden_states) +++-+ key_states = self.k_proj(hidden_states) +++-+ value_states = self.v_proj(hidden_states) +++-+ +++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++-+ # query: [B, S, H*D] -> [B, N1, S, D] +++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ +++-+ # 3. RoPE 旋转位置编码 +++-+ kv_seq_len = key_states.shape[-2] +++-+ if past_key_value is not None: +++-+ if self.layer_idx is None: +++-+ raise ValueError( +++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++-+ "with a layer index." +++-+ ) +++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++-+ if cache_position.shape[0] == 1: +++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++-+ kv_seq_len = past_seen_tokens + 1 +++-+ else: +++-+ # prefill 阶段:cache_position 是范围,使用其长度 +++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++-+ else: +++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++-+ +++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++-+ +++-+ # 4. KV 缓存更新 +++-+ if past_key_value is not None: +++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++-+ key_states, value_states = past_key_value.update( +++-+ key_states, value_states, self.layer_idx, cache_kwargs +++-+ ) +++-+ +++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++-+ if cache_position.shape[0] == 1: +++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++-+ kv_seq_len = key_states.shape[-2] +++-+ +++-+ # 5. [重要] 准备 Attention Mask +++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++-+ fa_attention_mask = None +++-+ if attention_mask is not None: +++-+ # 截取与当前key长度匹配的部分 +++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +++-+ fa_attention_mask = (mask_slice != 0) +++-+ +++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++-+ input_dtype = query_states.dtype +++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++-+ query_states = query_states.to(mindspore.float16) +++-+ key_states = key_states.to(mindspore.float16) +++-+ value_states = value_states.to(mindspore.float16) +++-+ +++-+ # 6. [核心] 调用 flash_attention_score 算子 +++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++-+ attn_output = mindspore.ops.flash_attention_score( +++-+ query=query_states, +++-+ key=key_states, +++-+ value=value_states, +++-+ head_num=self.num_heads, # 传入Q的头数(N1) +++-+ attn_mask=fa_attention_mask, +++-+ keep_prob=1.0 - self.attention_dropout, +++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +++-+ input_layout="BNSD", +++-+ sparse_mode=0 # 使用 defaultMask 模式 +++-+ ) +++-+ +++-+ # 恢复原始数据类型 +++-+ attn_output = attn_output.to(input_dtype) +++-+ +++-+ # 7. 调整输出形状 +++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++-+ attn_output = self.o_proj(attn_output) +++-+ +++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +++-+ attn_weights = None +++-+ if output_attentions: +++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++-+ +++-+ return attn_output, attn_weights, past_key_value +++-+ +++-+ # def forward( +++-+ # self, +++-+ # hidden_states: mindspore.Tensor, +++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++-+ # position_ids: Optional[mindspore.Tensor] = None, +++-+ # past_key_value: Optional[Cache] = None, +++-+ # output_attentions: bool = False, +++-+ # use_cache: bool = False, +++-+ # cache_position: Optional[mindspore.Tensor] = None, +++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++-+ +++-+ # bsz, q_len, _ = hidden_states.shape +++-+ +++-+ # # 1. 线性投射 Q, K, V +++-+ # query_states = self.q_proj(hidden_states) +++-+ # key_states = self.k_proj(hidden_states) +++-+ # value_states = self.v_proj(hidden_states) +++-+ +++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ +++-+ # # 3. RoPE 旋转位置编码 +++-+ # kv_seq_len = key_states.shape[-2] +++-+ # if past_key_value is not None: +++-+ # if self.layer_idx is None: +++-+ # raise ValueError( +++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++-+ # "with a layer index." +++-+ # ) +++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++-+ +++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++-+ +++-+ # # 4. KV 缓存更新 +++-+ # if past_key_value is not None: +++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++-+ # key_states, value_states = past_key_value.update( +++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++-+ # ) +++-+ +++-+ # # 5. 准备 Attention Mask +++-+ # fa_attention_mask = None +++-+ # if attention_mask is not None: +++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++-+ # fa_attention_mask = (mask_slice != 0) +++-+ +++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++-+ # input_dtype = query_states.dtype +++-+ +++-+ # # 6. [核心] 调用 flash_attention_score 算子 +++-+ # attn_output = mindspore.ops.flash_attention_score( +++-+ # query=query_states, +++-+ # key=key_states, +++-+ # value=value_states, +++-+ # head_num=self.num_heads, +++-+ # attn_mask=fa_attention_mask, +++-+ # keep_prob=1.0 - self.attention_dropout, +++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++-+ # input_layout="BNSD", +++-+ # sparse_mode=0, +++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++-+ # inner_precise=1 +++-+ # ) +++-+ +++-+ # # 恢复原始数据类型 +++-+ # attn_output = attn_output.to(input_dtype) +++-+ +++-+ # # 7. 调整输出形状 +++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++-+ # attn_output = self.o_proj(attn_output) +++-+ +++-+ # attn_weights = None +++-+ # if output_attentions: +++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++-+ +++-+ # return attn_output, attn_weights, past_key_value +++-+ +++-+ # def forward( +++-+ # self, +++-+ # hidden_states: mindspore.Tensor, +++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++-+ # position_ids: Optional[mindspore.Tensor] = None, +++-+ # past_key_value: Optional[Cache] = None, +++-+ # output_attentions: bool = False, +++-+ # use_cache: bool = False, +++-+ # cache_position: Optional[mindspore.Tensor] = None, +++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++-+ +++-+ # bsz, q_len, _ = hidden_states.shape +++-+ +++-+ # query_states = self.q_proj(hidden_states) +++-+ # key_states = self.k_proj(hidden_states) +++-+ # value_states = self.v_proj(hidden_states) +++-+ +++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-+ +++-+ # kv_seq_len = key_states.shape[-2] +++-+ # if past_key_value is not None: +++-+ # if self.layer_idx is None: +++-+ # raise ValueError("`layer_idx` must be specified for caching") +++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++-+ +++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++-+ +++-+ # if past_key_value is not None: +++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++-+ # key_states, value_states = past_key_value.update( +++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++-+ # ) +++-+ +++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++-+ +++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++-+ # query_states = query_states / math.sqrt(self.head_dim) +++-+ # # <--- 修改结束 --- +++-+ +++-+ # fa_attention_mask = None +++-+ # if attention_mask is not None: +++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++-+ # fa_attention_mask = (mask_slice != 0) +++-+ +++-+ # input_dtype = query_states.dtype +++-+ +++-+ # attn_output = mindspore.ops.flash_attention_score( +++-+ # query=query_states, # 传入已经预先缩放过的 query +++-+ # key=key_states, +++-+ # value=value_states, +++-+ # head_num=self.num_heads, +++-+ # attn_mask=fa_attention_mask, +++-+ # keep_prob=1.0 - self.attention_dropout, +++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++-+ # input_layout="BNSD", +++-+ # sparse_mode=0, +++-+ # inner_precise=1 # 仍然保持内部高精度计算 +++-+ # ) +++-+ +++-+ # attn_output = attn_output.to(input_dtype) +++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++-+ # attn_output = self.o_proj(attn_output) +++-+ +++-+ # attn_weights = None +++-+ # if output_attentions: +++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++-+ +++-+ # return attn_output, attn_weights, past_key_value +++-+ +++- QWEN2MOE_ATTENTION_CLASSES = { +++- "eager": Qwen2MoeAttention, +++-+ "flash-attention": Qwen2MoeFlashAttention, +++- } +++- +++- +++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++- +++-+ #@dwj +++-+ # 只遍历激活的专家,而非全部专家 +++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +++-- hidden_states = hidden_states.view(-1, hidden_dim) +++-- # router_logits: (batch * sequence_length, n_experts) +++-- router_logits = self.gate(hidden_states) +++-- +++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++-- if self.norm_topk_prob: +++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++-- # we cast back to the input dtype +++-- routing_weights = routing_weights.to(hidden_states.dtype) +++-- +++-- final_hidden_states = ops.zeros( +++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++-- ) +++-- +++-- # One hot encode the selected experts to create an expert mask +++-- # this will be used to easily index which expert is going to be sollicitated +++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++-- +++-- # Loop over all available experts in the model and perform the computation on each expert +++-- for expert_idx in range(self.num_experts): +++-- expert_layer = self.experts[expert_idx] +++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++-- +++-- # Index the correct hidden states and compute the expert hidden state for +++-- # the current expert. We need to make sure to multiply the output hidden +++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++-- if 0 not in idx.shape: +++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++-- +++-- # However `index_add_` only support torch tensors for indexing so we'll use +++-- # the `top_x` tensor here. +++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++-- +++-- shared_expert_output = self.shared_expert(hidden_states) +++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++-- +++-- final_hidden_states = final_hidden_states + shared_expert_output +++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++-+ num_tokens = hidden_states_reshaped.shape[0] +++-+ +++-+ router_logits = self.gate(hidden_states_reshaped) +++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++-+ +++-+ if self.norm_topk_prob: +++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++-+ routing_weights = routing_weights.to(hidden_states.dtype) +++-+ +++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++-+ flat_selected_experts = selected_experts.flatten() +++-+ +++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++-+ token_indices = broadcasted_token_indices.flatten() +++-+ +++-+ active_experts = ops.unique(flat_selected_experts) +++-+ +++-+ for expert_idx_tensor in active_experts: +++-+ expert_idx = expert_idx_tensor.item() +++-+ expert_layer = self.experts[expert_idx] +++-+ +++-+ mask = (flat_selected_experts == expert_idx_tensor) +++-+ selected_token_indices = token_indices[mask] +++-+ selected_routing_weights = routing_weights.flatten()[mask] +++-+ +++-+ current_states = hidden_states_reshaped[selected_token_indices] +++-+ +++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++-+ +++-+ final_hidden_states = final_hidden_states.index_add( +++-+ dim=0, +++-+ index=selected_token_indices, +++-+ source=expert_output.to(hidden_states.dtype) +++-+ ) +++-+ +++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++- +++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++-- return final_hidden_states, router_logits +++-+ final_hidden_states = final_hidden_states + shared_expert_output +++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++-+ +++-+ return final_hidden_states, router_logits +++- +++- +++- class Qwen2MoeDecoderLayer(nn.Module): +++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++- +++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++- +++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++-+ +++- if (layer_idx not in config.mlp_only_layers) and ( +++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++- ): +++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +++- _skip_keys_device_placement = "past_key_values" +++- _supports_cache_class = True +++-+#lwx +++-+ # _supports_static_cache = True +++- +++- def _init_weights(self, module): +++- std = self.config.initializer_range +++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++- return causal_mask +++- +++- +++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++- _tied_weights_keys = ["lm_head.weight"] +++- +++- def __init__(self, config): +++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++- self.num_experts_per_tok = config.num_experts_per_tok +++- # Initialize weights and apply final processing +++- self.post_init() +++-+ # @lwx +++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++-+ # self.generation_config.cache_implementation = "static" +++-+ self._warmed_up = False +++-+ +++-+ def warmup_moe_model(self): +++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +++-+ test_texts = [ +++-+ "warmup short", +++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++-+ ] +++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++-+ if tokenizer is None: +++-+ from mindnlp.transformers import AutoTokenizer +++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++-+ self._warmup_tokenizer = tokenizer +++-+ +++-+ for text in test_texts: +++-+ inputs = tokenizer(text, return_tensors="ms") +++-+ with mindspore._no_grad(): +++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +++- +++- def get_input_embeddings(self): +++- return self.model.embed_tokens +++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++- ```""" +++-+ if not self._warmed_up: +++-+ self._warmed_up = True +++-+ self.warmup_moe_model() +++- +++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++- output_router_logits = ( +++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++- } +++- ) +++- return model_inputs +++-+# @lwx +++-+ # def _decode_one_tokens_logits( +++-+ # self, +++-+ # cur_token: mindspore.Tensor, +++-+ # input_pos: Optional[mindspore.Tensor], +++-+ # cache_position: mindspore.Tensor, +++-+ # past_key_values: StaticCache, +++-+ # ) -> mindspore.Tensor: +++-+ # """ +++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++-+ +++-+ # Args: +++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++-+ # input_pos: 输入位置信息,可选 +++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +++-+ +++-+ # Returns: +++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++-+ # """ +++-+ # # 调用JIT编译的版本 +++-+ # return self.get_decode_one_tokens_logits( +++-+ # cur_token=cur_token, +++-+ # input_pos=input_pos, +++-+ # cache_position=cache_position, +++-+ # past_key_values=past_key_values, +++-+ # ) +++-+ +++-+ # @mindspore.jit(jit_level='O1') +++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++-+ # """ +++-+ # JIT编译的函数,用于高效的单token解码 +++-+ # 使用JIT编译优化以支持静态shape和高效执行 +++-+ +++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++-+ # """ +++-+ # outputs = self.model.forward( +++-+ # input_ids=cur_token, +++-+ # position_ids=input_pos, +++-+ # cache_position=cache_position, +++-+ # past_key_values=past_key_values, +++-+ # use_cache=True, +++-+ # return_dict=False, +++-+ # ) +++-+ +++-+ # hidden_states = outputs[0] +++-+ # logits = self.lm_head.forward(hidden_states) +++-+ # logits = logits.float() +++-+ +++-+ # return logits[:, -1, :] +++-+ +++-+ # def _sample( +++-+ # self, +++-+ # input_ids: mindspore.Tensor, +++-+ # logits_processor, +++-+ # stopping_criteria, +++-+ # generation_config, +++-+ # synced_devices: bool, +++-+ # streamer=None, +++-+ # logits_warper=None, +++-+ # **model_kwargs, +++-+ # ): +++-+ # """ +++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++-+ # """ +++-+ # from ...generation.logits_process import LogitsProcessorList +++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++-+ # from mindnlp.core import nn, ops, no_grad +++-+ # import numpy as np +++-+ +++-+ # # 检查是否使用 StaticCache +++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++-+ # # 否则,直接调用父类方法 +++-+ # past_key_values = model_kwargs.get("past_key_values") +++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++-+ +++-+ # if not isinstance(past_key_values, StaticCache): +++-+ # # 不使用 StaticCache,直接调用父类方法 +++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++-+ # return super()._sample( +++-+ # input_ids=input_ids, +++-+ # logits_processor=logits_processor, +++-+ # stopping_criteria=stopping_criteria, +++-+ # generation_config=generation_config, +++-+ # synced_devices=synced_devices, +++-+ # streamer=streamer, +++-+ # logits_warper=logits_warper, +++-+ # **model_kwargs, +++-+ # ) +++-+ +++-+ # # 使用 StaticCache,进入自定义循环 +++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++-+ # pad_token_id = generation_config._pad_token_tensor +++-+ # output_attentions = generation_config.output_attentions +++-+ # output_hidden_states = generation_config.output_hidden_states +++-+ # output_scores = generation_config.output_scores +++-+ # output_logits = generation_config.output_logits +++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +++-+ # max_length = generation_config.max_length +++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++-+ # do_sample = generation_config.do_sample +++-+ +++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++-+ # raise ValueError( +++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++-+ # f"{logits_warper})." +++-+ # ) +++-+ +++-+ # # init attention / hidden states / scores tuples +++-+ # scores = () if (return_dict_in_generate and output_scores) else None +++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++-+ +++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++-+ # encoder_hidden_states = ( +++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++-+ # ) +++-+ +++-+ # # keep track of which sequences are already finished +++-+ # batch_size, cur_len = input_ids.shape +++-+ # this_peer_finished = False +++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++-+ +++-+ # time_record = [] +++-+ # from ....utils.testing_utils import parse_flag_from_env +++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++-+ +++-+ # while self._has_unfinished_sequences( +++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++-+ # ): +++-+ # if _record_time: +++-+ # import time as time_module +++-+ # infer_start = time_module.time() +++-+ +++-+ # # prepare model inputs +++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++-+ +++-+ # # prepare variable output controls +++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++-+ +++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++-+ # cur_cache_position = model_inputs.get("cache_position") +++-+ # cur_past_key_values = model_inputs.get("past_key_values") +++-+ # cur_input_ids = model_inputs.get("input_ids") +++-+ +++-+ # if (isinstance(cur_past_key_values, StaticCache) and +++-+ # cur_cache_position is not None and +++-+ # len(cur_cache_position.shape) > 0 and +++-+ # cur_cache_position.shape[0] == 1 and +++-+ # cur_input_ids is not None and +++-+ # cur_input_ids.shape[1] == 1): +++-+ # # 使用 JIT 优化的单 token 解码 +++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++-+ # if not hasattr(self, '_jit_used'): +++-+ # self._jit_used = False +++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++-+ +++-+ # next_token_logits = self.get_decode_one_tokens_logits( +++-+ # cur_token=cur_input_ids, +++-+ # input_pos=model_inputs.get("position_ids"), +++-+ # cache_position=cur_cache_position, +++-+ # past_key_values=cur_past_key_values, +++-+ # ) +++-+ +++-+ # # 标记已使用JIT(用于后续判断) +++-+ # if not self._jit_used: +++-+ # self._jit_used = True +++-+ +++-+ # # 构造兼容的输出对象 +++-+ # class JitOptimizedOutput: +++-+ # def __init__(self, logits, config): +++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++-+ # self.config = config +++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +++-+ # self.attentions = None if not config.is_encoder_decoder else None +++-+ # self.cross_attentions = None +++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +++-+ +++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++-+ # else: +++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++-+ # outputs = self(**model_inputs, return_dict=True) +++-+ +++-+ # if synced_devices and this_peer_finished: +++-+ # continue +++-+ +++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++-+ # next_token_logits = outputs.logits[:, -1, :] +++-+ +++-+ # # pre-process distribution +++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +++-+ # if do_sample: +++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +++-+ +++-+ # # Store scores, attentions and hidden_states when required +++-+ # if return_dict_in_generate: +++-+ # if output_scores: +++-+ # scores += (next_token_scores,) +++-+ # if output_logits: +++-+ # raw_logits += (next_token_logits,) +++-+ # if output_attentions: +++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +++-+ # if self.config.is_encoder_decoder: +++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++-+ +++-+ # if output_hidden_states: +++-+ # hidden = ( +++-+ # outputs.decoder_hidden_states +++-+ # if self.config.is_encoder_decoder +++-+ # else outputs.hidden_states +++-+ # ) +++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++-+ +++-+ # # token selection +++-+ # if do_sample: +++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++-+ # else: +++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++-+ +++-+ # # finished sentences should have their next token be a padding token +++-+ # if has_eos_stopping_criteria: +++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++-+ +++-+ # # update generated ids, model inputs, and length for next step +++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++-+ # if streamer is not None: +++-+ # streamer.put(next_tokens) +++-+ +++-+ # model_kwargs = self._update_model_kwargs_for_generation( +++-+ # outputs, +++-+ # model_kwargs, +++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +++-+ # ) +++-+ +++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++-+ # cur_len += 1 +++-+ +++-+ # if _record_time: +++-+ # import time as time_module +++-+ # infer_stop = time_module.time() +++-+ # time_record.append(infer_stop - infer_start) +++-+ +++-+ # del outputs +++-+ +++-+ # average_infer_time = None +++-+ # if time_record: +++-+ # if len(time_record) > 1: +++-+ # time_record.pop(0) +++-+ # average_infer_time = sum(time_record) / len(time_record) +++-+ # print(f'average inference time is: {average_infer_time}') +++-+ # print(f'inference time record: {time_record}') +++-+ +++-+ # if streamer is not None: +++-+ # streamer.end() +++-+ +++-+ # # 简单判断:打印是否使用了JIT路径 +++-+ # if hasattr(self, '_jit_used') and self._jit_used: +++-+ # print("[JIT] ✓ JIT optimization was used during generation") +++-+ # else: +++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++-+ +++-+ # if return_dict_in_generate: +++-+ # if self.config.is_encoder_decoder: +++-+ # return GenerateEncoderDecoderOutput( +++-+ # sequences=input_ids, +++-+ # scores=scores, +++-+ # logits=raw_logits, +++-+ # encoder_attentions=encoder_attentions, +++-+ # encoder_hidden_states=encoder_hidden_states, +++-+ # decoder_attentions=decoder_attentions, +++-+ # cross_attentions=cross_attentions, +++-+ # decoder_hidden_states=decoder_hidden_states, +++-+ # past_key_values=model_kwargs.get("past_key_values"), +++-+ # average_infer_time=average_infer_time +++-+ # ) +++-+ # else: +++-+ # return GenerateDecoderOnlyOutput( +++-+ # sequences=input_ids, +++-+ # scores=scores, +++-+ # logits=raw_logits, +++-+ # attentions=decoder_attentions, +++-+ # hidden_states=decoder_hidden_states, +++-+ # past_key_values=model_kwargs.get("past_key_values"), +++-+ # average_infer_time=average_infer_time +++-+ # ) +++-+ # else: +++-+ # return input_ids +++-+ +++-+ # def _prepare_cache_for_generation( +++-+ # self, +++-+ # generation_config, +++-+ # model_kwargs, +++-+ # assistant_model, +++-+ # batch_size, +++-+ # max_cache_length, +++-+ # ): +++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +++-+ # generation_config.cache_implementation = "static" +++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++-+ +++-+ # if generation_config.cache_implementation == "static": +++-+ # base_required_from_max_length = generation_config.max_length + 1 +++-+ # base_required = max(max_cache_length, base_required_from_max_length) +++-+ # min_cache_size = 50 +++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++-+ # else: +++-+ # max_cache_length = max(base_required, min_cache_size) +++-+ +++-+ # original_max_cache_length = max_cache_length +++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++-+ # print(f" - final max_cache_length: {max_cache_length}") +++-+ +++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++-+ # if max_cache_length > self.config.max_position_embeddings: +++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++-+ +++-+ # result = super()._prepare_cache_for_generation( +++-+ # generation_config=generation_config, +++-+ # model_kwargs=model_kwargs, +++-+ # assistant_model=assistant_model, +++-+ # batch_size=batch_size, +++-+ # max_cache_length=max_cache_length, +++-+ # ) +++-+ +++-+ # if generation_config.cache_implementation == "static": +++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++-+ # created_cache = model_kwargs.get(cache_name) +++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++-+ # if created_cache.max_cache_len < generation_config.max_length: +++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++-+ +++-+ # return result +++-+ +++-+ +++-+ +++- +++- +++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++--- +++-2.27.0 +++- +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" new file mode 100644 index 00000000..46db89f2 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" @@ -0,0 +1,7931 @@ +From 2c9ca98c339c674179652ab1635dab69b46d9012 Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Fri, 7 Nov 2025 12:06:32 +0800 +Subject: [PATCH 06/10] 20251107002commit + +--- + .../models/deepseek/modeling_deepseek.py | 122 +- + patches/0001-20251104commit.patch | 2 +- + patches/0002-20251106commit.patch | 2 +- + patches/0003-20261106secondcommit.patch | 2 +- + patches/0004-20251106change.patch | 2 +- + patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ + 6 files changed, 7773 insertions(+), 64 deletions(-) + create mode 100644 patches/0005-20251107001commit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 8831e4b7..e7e1c053 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): + # expert_out = expert(x) + # expert_cache += expert_out * weight + # return expert_cache +- +- # @no_grad() +- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- # # x 的 shape: (1, hidden_size) +- # # flat_expert_indices 的 shape: (num_experts_per_tok,) +- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +- +- # # 1. 收集所有需要的专家层 +- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +- # selected_experts = [self.experts[i] for i in flat_expert_indices] +- +- # # 2. 并行计算所有专家的输出 +- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +- # # ops.cat 会将它们堆叠成一个新的 Tensor +- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +- +- # # 3. 使用矩阵乘法进行加权求和 +- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- # # 最终结果 final_output 的 shape: (1, hidden_size) +- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++ ++ @no_grad() ++ dwj ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ # x 的 shape: (1, hidden_size) ++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++ ++ # 1. 收集所有需要的专家层 ++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++ ++ # 2. 并行计算所有专家的输出 ++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++ # ops.cat 会将它们堆叠成一个新的 Tensor ++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++ ++ # 3. 使用矩阵乘法进行加权求和 ++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ # 最终结果 final_output 的 shape: (1, hidden_size) ++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) + +- # return final_output ++ return final_output + + + # @no_grad() +@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): + + return expert_cache + # 放置在 DeepseekMoE 类中 +- @no_grad() +- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- """ +- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +- +- Args: +- x (Tensor): 输入张量, shape: (1, hidden_size) +- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +- """ +- top_k, _ = flat_expert_weights.shape +- hidden_size = x.shape[-1] +- +- # 1. 将所有专家的权重堆叠起来 +- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++ # @no_grad() ++ # #lwx 20251107 ++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ # """ ++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++ ++ # Args: ++ # x (Tensor): 输入张量, shape: (1, hidden_size) ++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++ # """ ++ # top_k, _ = flat_expert_weights.shape ++ # hidden_size = x.shape[-1] ++ ++ # # 1. 将所有专家的权重堆叠起来 ++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) + +- # 2. "收集" 所需的专家权重 +- selected_gate_w = stacked_gate_w[flat_expert_indices] +- selected_up_w = stacked_up_w[flat_expert_indices] +- selected_down_w = stacked_down_w[flat_expert_indices] ++ # # 2. "收集" 所需的专家权重 ++ # selected_gate_w = stacked_gate_w[flat_expert_indices] ++ # selected_up_w = stacked_up_w[flat_expert_indices] ++ # selected_down_w = stacked_down_w[flat_expert_indices] + +- # 3. 准备输入 +- x_expanded = x.expand((top_k, 1, hidden_size)) ++ # # 3. 准备输入 ++ # x_expanded = x.expand((top_k, 1, hidden_size)) + +- # 4. 并行计算 gate_proj 和 up_proj +- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++ # # 4. 并行计算 gate_proj 和 up_proj ++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) + +- # 5. 计算中间状态 +- intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++ # # 5. 计算中间状态 ++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out + +- # 6. 并行计算 down_proj +- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +- # --- [FIX] --- +- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +- # --- [FIX END] --- ++ # # 6. 并行计算 down_proj ++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++ # # --- [FIX] --- ++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++ # # --- [FIX END] --- + +- # 7. 根据路由权重进行加权求和 +- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++ # # 7. 根据路由权重进行加权求和 ++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) + +- return weighted_sum ++ # return weighted_sum + + + +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +index 0a0ef2d7..2842180e 100644 +--- a/patches/0001-20251104commit.patch ++++ b/patches/0001-20251104commit.patch +@@ -1,7 +1,7 @@ + From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/4] 20251104commit ++Subject: [PATCH 1/5] 20251104commit + + --- + mindnlp/transformers/cache_utils.py | 28 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +index 5185270c..c6cd8757 100644 +--- a/patches/0002-20251106commit.patch ++++ b/patches/0002-20251106commit.patch +@@ -1,7 +1,7 @@ + From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/4] 20251106commit ++Subject: [PATCH 2/5] 20251106commit + + --- + .../models/deepseek/modeling_deepseek.py | 379 ++++- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +index 3e05f821..601960c9 100644 +--- a/patches/0003-20261106secondcommit.patch ++++ b/patches/0003-20261106secondcommit.patch +@@ -1,7 +1,7 @@ + From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/4] 20261106secondcommit ++Subject: [PATCH 3/5] 20261106secondcommit + + --- + .../models/deepseek/modeling_deepseek.py | 217 ++- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +index 88a1aef4..8976f10b 100644 +--- a/patches/0004-20251106change.patch ++++ b/patches/0004-20251106change.patch +@@ -1,7 +1,7 @@ + From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 15:48:09 +0800 +-Subject: [PATCH 4/4] 20251106change ++Subject: [PATCH 4/5] 20251106change + + --- + .../models/deepseek/modeling_deepseek.py | 189 +- +diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +new file mode 100644 +index 00000000..8d9032be +--- /dev/null ++++ b/patches/0005-20251107001commit.patch +@@ -0,0 +1,7707 @@ ++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Fri, 7 Nov 2025 11:48:18 +0800 ++Subject: [PATCH 5/5] 20251107001commit ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 91 +- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- ++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- ++ patches/0001-20251104commit.patch | 2 +- ++ patches/0002-20251106commit.patch | 2 +- ++ patches/0003-20261106secondcommit.patch | 2 +- ++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ ++ 7 files changed, 7577 insertions(+), 30 deletions(-) ++ create mode 100644 patches/0004-20251106change.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index 0546f318..8831e4b7 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): ++ # expert_cache += expert_out * weight ++ # return expert_cache ++ ++- @no_grad() ++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++- # x 的 shape: (1, hidden_size) ++- # flat_expert_indices 的 shape: (num_experts_per_tok,) ++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++- ++- # 1. 收集所有需要的专家层 ++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++- selected_experts = [self.experts[i] for i in flat_expert_indices] ++- ++- # 2. 并行计算所有专家的输出 ++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++- # ops.cat 会将它们堆叠成一个新的 Tensor ++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++- ++- # 3. 使用矩阵乘法进行加权求和 ++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++- # 最终结果 final_output 的 shape: (1, hidden_size) ++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++ # @no_grad() +++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ # # x 的 shape: (1, hidden_size) +++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++ +++ # # 1. 收集所有需要的专家层 +++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++ # selected_experts = [self.experts[i] for i in flat_expert_indices] +++ +++ # # 2. 并行计算所有专家的输出 +++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++ # # ops.cat 会将它们堆叠成一个新的 Tensor +++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++ +++ # # 3. 使用矩阵乘法进行加权求和 +++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ # # 最终结果 final_output 的 shape: (1, hidden_size) +++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++ ++- return final_output +++ # return final_output ++ ++ ++ # @no_grad() ++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): ++ ) ++ ++ return expert_cache +++# 放置在 DeepseekMoE 类中 +++ @no_grad() +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ """ +++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +++ +++ Args: +++ x (Tensor): 输入张量, shape: (1, hidden_size) +++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +++ """ +++ top_k, _ = flat_expert_weights.shape +++ hidden_size = x.shape[-1] +++ +++ # 1. 将所有专家的权重堆叠起来 +++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +++ +++ # 2. "收集" 所需的专家权重 +++ selected_gate_w = stacked_gate_w[flat_expert_indices] +++ selected_up_w = stacked_up_w[flat_expert_indices] +++ selected_down_w = stacked_down_w[flat_expert_indices] +++ +++ # 3. 准备输入 +++ x_expanded = x.expand((top_k, 1, hidden_size)) +++ +++ # 4. 并行计算 gate_proj 和 up_proj +++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +++ +++ # 5. 计算中间状态 +++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +++ +++ # 6. 并行计算 down_proj +++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +++ # --- [FIX] --- +++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +++ # --- [FIX END] --- +++ +++ # 7. 根据路由权重进行加权求和 +++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +++ +++ return weighted_sum +++ +++ ++ ++ # @no_grad() ++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index ebd7782e..913a7609 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): ++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++- x1 = x[..., : x.shape[-1] // 2] ++- x2 = x[..., x.shape[-1] // 2 :] +++ # x1 = x[..., : x.shape[-1] // 2] +++ # x2 = x[..., x.shape[-1] // 2 :] ++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++index d059dcbe..2b217b64 100644 ++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): ++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++ def rotate_half(x): ++ """Rotates half the hidden dims of the input.""" ++- x1 = x[..., : x.shape[-1] // 2] ++- x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++ # x1 = x[..., : x.shape[-1] // 2] +++ # x2 = x[..., x.shape[-1] // 2 :] +++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++ return ops.cat((-x2, x1), dim=-1) ++ ++ ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++index 78f22642..0a0ef2d7 100644 ++--- a/patches/0001-20251104commit.patch +++++ b/patches/0001-20251104commit.patch ++@@ -1,7 +1,7 @@ ++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++-Subject: [PATCH 1/3] 20251104commit +++Subject: [PATCH 1/4] 20251104commit ++ ++ --- ++ mindnlp/transformers/cache_utils.py | 28 +- ++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++index 22b65dd5..5185270c 100644 ++--- a/patches/0002-20251106commit.patch +++++ b/patches/0002-20251106commit.patch ++@@ -1,7 +1,7 @@ ++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++-Subject: [PATCH 2/3] 20251106commit +++Subject: [PATCH 2/4] 20251106commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++index 966529e4..3e05f821 100644 ++--- a/patches/0003-20261106secondcommit.patch +++++ b/patches/0003-20261106secondcommit.patch ++@@ -1,7 +1,7 @@ ++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++-Subject: [PATCH 3/3] 20261106secondcommit +++Subject: [PATCH 3/4] 20261106secondcommit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++new file mode 100644 ++index 00000000..88a1aef4 ++--- /dev/null +++++ b/patches/0004-20251106change.patch ++@@ -0,0 +1,7498 @@ +++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Thu, 6 Nov 2025 15:48:09 +0800 +++Subject: [PATCH 4/4] 20251106change +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 189 +- +++ patches/0001-20251104commit.patch | 1272 +++++++ +++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +++ 4 files changed, 7244 insertions(+), 186 deletions(-) +++ create mode 100644 patches/0001-20251104commit.patch +++ create mode 100644 patches/0002-20251106commit.patch +++ create mode 100644 patches/0003-20261106secondcommit.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index 2f9192bf..0546f318 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +++ +++ return attn_output, attn_weights, past_key_value +++ +++-# class DeepseekFlashAttention(nn.Module): +++-# """ +++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +++- +++-# This class is designed as a drop-in replacement for DeepseekAttention. +++-# """ +++- +++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++-# super().__init__() +++-# self.config = config +++-# self.layer_idx = layer_idx +++-# if layer_idx is None: +++-# logger.warning( +++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++-# "when creating this class." +++-# ) +++- +++-# self.attention_dropout = config.attention_dropout +++-# self.hidden_size = config.hidden_size +++-# self.num_heads = config.num_attention_heads +++-# self.head_dim = self.hidden_size // self.num_heads +++-# self.num_key_value_heads = config.num_key_value_heads +++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++-# self.max_position_embeddings = config.max_position_embeddings +++-# self.rope_theta = config.rope_theta +++-# self.is_causal = True +++- +++-# if (self.head_dim * self.num_heads) != self.hidden_size: +++-# raise ValueError( +++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++-# f" and `num_heads`: {self.num_heads})." +++-# ) +++- +++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++-# self._init_rope() +++- +++-# def _init_rope(self): +++-# if self.config.rope_scaling is None: +++-# self.rotary_emb = DeepseekRotaryEmbedding( +++-# self.head_dim, +++-# max_position_embeddings=self.max_position_embeddings, +++-# base=self.rope_theta, +++-# ) +++-# else: +++-# scaling_type = self.config.rope_scaling["type"] +++-# scaling_factor = self.config.rope_scaling["factor"] +++-# if scaling_type == "linear": +++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++-# self.head_dim, +++-# max_position_embeddings=self.max_position_embeddings, +++-# scaling_factor=scaling_factor, +++-# base=self.rope_theta, +++-# ) +++-# elif scaling_type == "dynamic": +++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++-# self.head_dim, +++-# max_position_embeddings=self.max_position_embeddings, +++-# scaling_factor=scaling_factor, +++-# base=self.rope_theta, +++-# ) +++-# else: +++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++- +++-# def forward( +++-# self, +++-# hidden_states: mindspore.Tensor, +++-# attention_mask: Optional[mindspore.Tensor] = None, +++-# position_ids: Optional[mindspore.Tensor] = None, +++-# past_key_value: Optional[Cache] = None, +++-# output_attentions: bool = False, +++-# use_cache: bool = False, +++-# **kwargs, +++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++-# if "padding_mask" in kwargs: +++-# warnings.warn( +++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++-# ) +++- +++-# if output_attentions: +++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +++- +++-# bsz, q_len, _ = hidden_states.shape +++- +++-# if self.config.pretraining_tp > 1: +++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++- +++-# query_states = self.q_proj(hidden_states) +++-# key_states = self.k_proj(hidden_states) +++-# value_states = self.v_proj(hidden_states) +++- +++-# # Reshape for multi-head attention +++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++- +++-# kv_seq_len = key_states.shape[-2] +++-# if past_key_value is not None: +++-# if self.layer_idx is None: +++-# raise ValueError( +++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++-# "with a layer index." +++-# ) +++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++- +++-# # Apply Rotary Positional Embedding +++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++- +++-# if past_key_value is not None: +++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++- +++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++- +++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++- +++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++- +++-# # Convert attention_mask for flash_attention_score +++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +++-# if attention_mask is not None: +++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++-# raise ValueError( +++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++-# ) +++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +++-# else: +++-# attn_mask_for_fa = None +++- +++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++- +++-# # Call the fused flash_attention_score operator +++-# attn_output = mindspore.ops.flash_attention_score( +++-# query=query_states_for_fa, +++-# key=key_states_for_fa, +++-# value=value_states_for_fa, +++-# head_num=self.num_heads, # This is N1, the number of query heads +++-# input_layout='BSH', +++-# attn_mask=attn_mask_for_fa, +++-# keep_prob=keep_prob, +++-# scalar_value=1.0 / math.sqrt(self.head_dim), +++-# sparse_mode=0 # Default mask mode +++-# ) +++- +++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +++-# attn_output = self.o_proj(attn_output) +++- +++-# # Flash Attention does not return attention weights +++-# attn_weights = None +++- +++-# return attn_output, attn_weights, past_key_value +++ +++ class DeepseekFlashAttention(nn.Module): +++ """ +++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +++ super().__init__() +++ self.hidden_size = config.hidden_size +++ +++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +++- config=config, layer_idx=layer_idx +++- ) ++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++++ # config=config, layer_idx=layer_idx ++++ # ) +++ +++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +++ config=config, layer_idx=layer_idx +++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +++ return outputs +++ +++ +++- +++ class DeepseekPreTrainedModel(PreTrainedModel): +++ config_class = DeepseekConfig +++ base_model_prefix = "model" +++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++ # Initialize weights and apply final processing +++ self.post_init() +++ self.warm_up = False +++- #@dwj +++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +++- self.num_layers, +++- self.num_attention_heads, +++- self.head_dim, +++- batch_size=1, +++- max_length=self.max_length, +++- dtype=mindspore.float16 +++- ) +++- +++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +++- key_cache = [] +++- value_cache = [] +++- for _ in range(num_layers): +++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++- key_cache.append(k) +++- value_cache.append(v) +++- return key_cache, value_cache +++- +++ +++ def warmup_moe_model_deep(self): +++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++new file mode 100644 +++index 00000000..78f22642 +++--- /dev/null ++++++ b/patches/0001-20251104commit.patch +++@@ -0,0 +1,1272 @@ ++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++Subject: [PATCH 1/3] 20251104commit ++++ ++++--- ++++ mindnlp/transformers/cache_utils.py | 28 +- ++++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++ 3 files changed, 976 insertions(+), 87 deletions(-) ++++ ++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++index cadd2e04..02f8d4be 100644 ++++--- a/mindnlp/transformers/cache_utils.py +++++++ b/mindnlp/transformers/cache_utils.py ++++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++ # k_out[:, :, cache_position] = key_states ++++ # v_out[:, :, cache_position] = value_states ++++- if ON_ORANGE_PI: ++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++- else: ++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++- +++++ # if ON_ORANGE_PI: +++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++ # else: +++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++ # 确保 cache_position 是 1D tensor 并且类型正确 +++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++ if cache_position.ndim > 1: +++++ cache_position = cache_position.flatten() +++++ # 确保类型是 int32 或 int64(MindSpore 要求) +++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++ cache_position = cache_position.int() +++++ +++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++ k_out[:, :, cache_position] = key_states +++++ v_out[:, :, cache_position] = value_states +++++ ++++ return k_out, v_out ++++ ++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index c695b944..d8303e45 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++- x1 = x[..., : x.shape[-1] // 2] ++++- x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++ # x1 = x[..., : x.shape[-1] // 2] +++++ # x2 = x[..., x.shape[-1] // 2 :] +++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++ if self.training: ++++ raise NotImplementedError("Training is not supported yet.") ++++ else: ++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++- if self.config.n_shared_experts is not None: ++++- y = y + self.shared_experts(identity) ++++- return y +++++ # @lwx +++++ if orig_shape[1] == 1: +++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++ y=y.view(*orig_shape) +++++ if self.config.n_shared_experts is not None: +++++ y = y + self.shared_experts(identity) +++++ return y +++++ else: +++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++ if self.config.n_shared_experts is not None: +++++ y = y + self.shared_experts(identity) +++++ return y +++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++ # if self.config.n_shared_experts is not None: +++++ # y = y + self.shared_experts(identity) +++++ # return y +++++ +++++ @no_grad() +++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ +++++ expert_cache = ops.zeros_like(x) +++++ for i in range(self.num_experts_per_tok): +++++ expert_id = flat_expert_indices[i].item() +++++ weight = flat_expert_weights[i].item() +++++ expert = self.experts[expert_id] +++++ expert_out = expert(x) +++++ expert_cache += expert_out * weight +++++ return expert_cache ++++ ++++ @no_grad() ++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++- # expert_cache = torch.zeros_like(x) ++++- # idxs = flat_expert_indices.argsort() ++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++- # token_idxs = idxs // self.num_experts_per_tok ++++- # for i, end_idx in enumerate(tokens_per_expert): ++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++- # if start_idx == end_idx: ++++- # continue ++++- # expert = self.experts[i] ++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++- # expert_tokens = x[exp_token_idx] ++++- # expert_out = expert(expert_tokens) ++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++- # return expert_cache +++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++ expert_cache = ops.zeros_like(x) ++++ idxs = flat_expert_indices.argsort() ++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++ token_idxs = idxs // self.num_experts_per_tok +++++ ++++ for i, end_idx in enumerate(tokens_per_expert): ++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++ if start_idx == end_idx: ++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++ expert_out = expert(expert_tokens) ++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ ++++ return expert_cache +++++ +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # # expert_cache = torch.zeros_like(x) +++++ # # idxs = flat_expert_indices.argsort() +++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++ # # token_idxs = idxs // self.num_experts_per_tok +++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++ # # if start_idx == end_idx: +++++ # # continue +++++ # # expert = self.experts[i] +++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # # expert_tokens = x[exp_token_idx] +++++ # # expert_out = expert(expert_tokens) +++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++ # # return expert_cache +++++ # expert_cache = ops.zeros_like(x) +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # for i, end_idx in enumerate(tokens_per_expert): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # if start_idx == end_idx: +++++ # continue +++++ # expert = self.experts[i] +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = expert(expert_tokens) +++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ +++++ # return expert_cache +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # expert_cache = ops.zeros_like(x) +++++ +++++ # # 排序保证顺序一致 +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # # 找出有 token 的专家 +++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++ +++++ # for i in active_experts.tolist(): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # end_idx = tokens_per_expert[i] +++++ # if start_idx == end_idx: # 没有 token +++++ # continue +++++ +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = self.experts[i](expert_tokens) +++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++ +++++ # expert_cache = mindspore.mint.scatter_add( +++++ # expert_cache, +++++ # 0, +++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++ # expert_out +++++ # ) +++++ +++++ # return expert_cache +++++ +++++ ++++ ++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++ # """ ++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ ++++ # Initialize weights and apply final processing ++++ self.post_init() +++++ self.warm_up = False +++++ +++++ def warmup_moe_model_deep(self): +++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++ test_texts = [ +++++ "warmup short", +++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++ ] +++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++ if tokenizer is None: +++++ from mindnlp.transformers import AutoTokenizer +++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++ self._warmup_tokenizer = tokenizer +++++ +++++ for text in test_texts: +++++ inputs = tokenizer(text, return_tensors="ms") +++++ with mindspore._no_grad(): +++++ _ = self(**inputs, use_cache=False) +++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++ ++++ def get_input_embeddings(self): ++++ return self.model.embed_tokens ++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++ ```""" +++++ if not self.warm_up: +++++ self.warm_up = True +++++ self.warmup_moe_model_deep() +++++ ++++ output_attentions = ( ++++ output_attentions ++++ if output_attentions is not None ++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++index 3cbf820e..d4c6b651 100644 ++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++@@ -18,7 +18,6 @@ ++++ # See the License for the specific language governing permissions and ++++ # limitations under the License. ++++ """MindSpore Qwen2MoE model.""" ++++- ++++ import math ++++ from typing import List, Optional, Tuple, Union ++++ ++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++ TokenClassifierOutput, ++++ ) ++++ from ...modeling_utils import PreTrainedModel +++++from ...generation import GenerationMixin ++++ from ....utils import logging ++++ from .configuration_qwen2_moe import Qwen2MoeConfig ++++ ++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++ self.variance_epsilon = eps ++++ ++++ def forward(self, hidden_states): +++++ # @dwj +++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++ # @lwx +++++ # if not self.training : +++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++ input_dtype = hidden_states.dtype ++++ hidden_states = hidden_states.to(mindspore.float32) ++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++@@ -234,6 +239,8 @@ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++ x1 = x[..., : x.shape[-1] // 2] ++++ x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++ self.config = config ++++ self.hidden_size = config.hidden_size ++++ self.intermediate_size = intermediate_size +++++ ++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++ self.act_fn = ACT2FN[config.hidden_act] ++++ ++++ def forward(self, x): ++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++- ++++ +++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++ # @lwx +++++ # gate_up_output = self.gate_up_proj(x) +++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++ # return self.down_proj(swiglu_output) +++++ +++++ # def forward(self, x): +++++ # gate_proj_out = self.gate_proj(x) +++++ # up_proj_out = self.up_proj(x) +++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++ # return self.down_proj(swiglu_out) +++++ ++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++ """ ++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++ use_cache: bool = False, ++++ cache_position: Optional[mindspore.Tensor] = None, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ +++++ ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ query_states = self.q_proj(hidden_states) ++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++ "with a layer index." ++++ ) ++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ if isinstance(past_key_value, StaticCache): +++++ kv_seq_len = key_states.shape[-2] +++++ else: +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++ if isinstance(past_key_value, StaticCache): +++++ kv_seq_len = key_states.shape[-2] ++++ ++++ # repeat k/v heads if n_kv_heads < n_heads ++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++- +++++ ++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++ ++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++- raise ValueError( ++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++- f" {attn_weights.shape}" ++++- ) ++++- ++++- if attention_mask is not None: # no matter the length, we just slice it ++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++ if attention_mask is not None: +++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++ attn_weights = attn_weights + causal_mask ++++ ++++ # upcast attention to fp32 ++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++ ++++ attn_output = self.o_proj(attn_output) ++++- +++++ # @lwx +++++ +++++ # max_seq_len = self.max_position_embeddings # 2048 +++++ +++++ # if attention_mask is not None: +++++ # # attention_mask: [B, 1, Sq, Sk] +++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++ +++++ # # pad 到 [max_seq_len, max_seq_len] +++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++ # global_attention_mask = padded_mask +++++ # else: +++++ # global_attention_mask = None +++++ +++++ +++++ # sparse_mode=3 +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, +++++ # key=key_states, +++++ # value=value_states, +++++ # real_shift=None, +++++ # padding_mask=None, +++++ +++++ # head_num=self.num_heads, +++++ # attn_mask=global_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++ # input_layout="BNSD", +++++ # pre_tokens=2147483647, +++++ # next_tokens=2147483647, +++++ # inner_precise=0, +++++ # drop_mask=None, +++++ # prefix=None, +++++ # actual_seq_qlen=None, +++++ # actual_seq_kvlen=None, +++++ # sparse_mode=sparse_mode, +++++ # ) ++++ if not output_attentions: ++++ attn_weights = None ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++ +++++class Qwen2MoeFlashAttention(nn.Module): +++++ """ +++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++ +++++ 关键改动: +++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++ 直接传入原始的 key 和 value 张量效率更高。 +++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++ """ +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++ super().__init__() +++++ self.config = config +++++ self.layer_idx = layer_idx +++++ self.hidden_size = config.hidden_size +++++ self.num_heads = config.num_attention_heads +++++ self.head_dim = self.hidden_size // self.num_heads +++++ self.num_key_value_heads = config.num_key_value_heads +++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++ self.max_position_embeddings = config.max_position_embeddings +++++ self.rope_theta = config.rope_theta +++++ self.attention_dropout = config.attention_dropout +++++ +++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++ raise ValueError( +++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++ ) +++++ +++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++ +++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++ self.head_dim, +++++ max_position_embeddings=self.max_position_embeddings, +++++ base=self.rope_theta, +++++ ) +++++ +++++ def forward( +++++ self, +++++ hidden_states: mindspore.Tensor, +++++ attention_mask: Optional[mindspore.Tensor] = None, +++++ position_ids: Optional[mindspore.Tensor] = None, +++++ past_key_value: Optional[Cache] = None, +++++ output_attentions: bool = False, +++++ use_cache: bool = False, +++++ cache_position: Optional[mindspore.Tensor] = None, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ # 1. 线性投射 Q, K, V +++++ query_states = self.q_proj(hidden_states) +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++ # query: [B, S, H*D] -> [B, N1, S, D] +++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # 3. RoPE 旋转位置编码 +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++ if self.layer_idx is None: +++++ raise ValueError( +++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ "with a layer index." +++++ ) +++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++ if cache_position.shape[0] == 1: +++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++ kv_seq_len = past_seen_tokens + 1 +++++ else: +++++ # prefill 阶段:cache_position 是范围,使用其长度 +++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++ else: +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # 4. KV 缓存更新 +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ key_states, value_states = past_key_value.update( +++++ key_states, value_states, self.layer_idx, cache_kwargs +++++ ) +++++ +++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++ if cache_position.shape[0] == 1: +++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++ kv_seq_len = key_states.shape[-2] +++++ +++++ # 5. [重要] 准备 Attention Mask +++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++ fa_attention_mask = None +++++ if attention_mask is not None: +++++ # 截取与当前key长度匹配的部分 +++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++ fa_attention_mask = (mask_slice != 0) +++++ +++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++ input_dtype = query_states.dtype +++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++ query_states = query_states.to(mindspore.float16) +++++ key_states = key_states.to(mindspore.float16) +++++ value_states = value_states.to(mindspore.float16) +++++ +++++ # 6. [核心] 调用 flash_attention_score 算子 +++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++ attn_output = mindspore.ops.flash_attention_score( +++++ query=query_states, +++++ key=key_states, +++++ value=value_states, +++++ head_num=self.num_heads, # 传入Q的头数(N1) +++++ attn_mask=fa_attention_mask, +++++ keep_prob=1.0 - self.attention_dropout, +++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++ input_layout="BNSD", +++++ sparse_mode=0 # 使用 defaultMask 模式 +++++ ) +++++ +++++ # 恢复原始数据类型 +++++ attn_output = attn_output.to(input_dtype) +++++ +++++ # 7. 调整输出形状 +++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ attn_output = self.o_proj(attn_output) +++++ +++++ # FlashAttention 算子不直接返回注意力权重矩阵 +++++ attn_weights = None +++++ if output_attentions: +++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++ # def forward( +++++ # self, +++++ # hidden_states: mindspore.Tensor, +++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++ # position_ids: Optional[mindspore.Tensor] = None, +++++ # past_key_value: Optional[Cache] = None, +++++ # output_attentions: bool = False, +++++ # use_cache: bool = False, +++++ # cache_position: Optional[mindspore.Tensor] = None, +++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ # bsz, q_len, _ = hidden_states.shape +++++ +++++ # # 1. 线性投射 Q, K, V +++++ # query_states = self.q_proj(hidden_states) +++++ # key_states = self.k_proj(hidden_states) +++++ # value_states = self.v_proj(hidden_states) +++++ +++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # # 3. RoPE 旋转位置编码 +++++ # kv_seq_len = key_states.shape[-2] +++++ # if past_key_value is not None: +++++ # if self.layer_idx is None: +++++ # raise ValueError( +++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ # "with a layer index." +++++ # ) +++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # # 4. KV 缓存更新 +++++ # if past_key_value is not None: +++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ # key_states, value_states = past_key_value.update( +++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++ # ) +++++ +++++ # # 5. 准备 Attention Mask +++++ # fa_attention_mask = None +++++ # if attention_mask is not None: +++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # fa_attention_mask = (mask_slice != 0) +++++ +++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++ # input_dtype = query_states.dtype +++++ +++++ # # 6. [核心] 调用 flash_attention_score 算子 +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, +++++ # key=key_states, +++++ # value=value_states, +++++ # head_num=self.num_heads, +++++ # attn_mask=fa_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++ # input_layout="BNSD", +++++ # sparse_mode=0, +++++ # # <--- 修改点 2: 启用内部高精度计算 --- +++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++ # inner_precise=1 +++++ # ) +++++ +++++ # # 恢复原始数据类型 +++++ # attn_output = attn_output.to(input_dtype) +++++ +++++ # # 7. 调整输出形状 +++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ # attn_output = self.o_proj(attn_output) +++++ +++++ # attn_weights = None +++++ # if output_attentions: +++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++ # return attn_output, attn_weights, past_key_value +++++ +++++ # def forward( +++++ # self, +++++ # hidden_states: mindspore.Tensor, +++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++ # position_ids: Optional[mindspore.Tensor] = None, +++++ # past_key_value: Optional[Cache] = None, +++++ # output_attentions: bool = False, +++++ # use_cache: bool = False, +++++ # cache_position: Optional[mindspore.Tensor] = None, +++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ # bsz, q_len, _ = hidden_states.shape +++++ +++++ # query_states = self.q_proj(hidden_states) +++++ # key_states = self.k_proj(hidden_states) +++++ # value_states = self.v_proj(hidden_states) +++++ +++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ # kv_seq_len = key_states.shape[-2] +++++ # if past_key_value is not None: +++++ # if self.layer_idx is None: +++++ # raise ValueError("`layer_idx` must be specified for caching") +++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ # if past_key_value is not None: +++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ # key_states, value_states = past_key_value.update( +++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++ # ) +++++ +++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++ +++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++ # query_states = query_states / math.sqrt(self.head_dim) +++++ # # <--- 修改结束 --- +++++ +++++ # fa_attention_mask = None +++++ # if attention_mask is not None: +++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ # fa_attention_mask = (mask_slice != 0) +++++ +++++ # input_dtype = query_states.dtype +++++ +++++ # attn_output = mindspore.ops.flash_attention_score( +++++ # query=query_states, # 传入已经预先缩放过的 query +++++ # key=key_states, +++++ # value=value_states, +++++ # head_num=self.num_heads, +++++ # attn_mask=fa_attention_mask, +++++ # keep_prob=1.0 - self.attention_dropout, +++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++ # input_layout="BNSD", +++++ # sparse_mode=0, +++++ # inner_precise=1 # 仍然保持内部高精度计算 +++++ # ) +++++ +++++ # attn_output = attn_output.to(input_dtype) +++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ # attn_output = self.o_proj(attn_output) +++++ +++++ # attn_weights = None +++++ # if output_attentions: +++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++ +++++ # return attn_output, attn_weights, past_key_value +++++ ++++ QWEN2MOE_ATTENTION_CLASSES = { ++++ "eager": Qwen2MoeAttention, +++++ "flash-attention": Qwen2MoeFlashAttention, ++++ } ++++ ++++ ++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ +++++ #@dwj +++++ # 只遍历激活的专家,而非全部专家 ++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- hidden_states = hidden_states.view(-1, hidden_dim) ++++- # router_logits: (batch * sequence_length, n_experts) ++++- router_logits = self.gate(hidden_states) ++++- ++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- if self.norm_topk_prob: ++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- # we cast back to the input dtype ++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++- final_hidden_states = ops.zeros( ++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++- ) ++++- ++++- # One hot encode the selected experts to create an expert mask ++++- # this will be used to easily index which expert is going to be sollicitated ++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++- ++++- # Loop over all available experts in the model and perform the computation on each expert ++++- for expert_idx in range(self.num_experts): ++++- expert_layer = self.experts[expert_idx] ++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++- ++++- # Index the correct hidden states and compute the expert hidden state for ++++- # the current expert. We need to make sure to multiply the output hidden ++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++- if 0 not in idx.shape: ++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++- ++++- # However `index_add_` only support torch tensors for indexing so we'll use ++++- # the `top_x` tensor here. ++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++- ++++- shared_expert_output = self.shared_expert(hidden_states) ++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++- ++++- final_hidden_states = final_hidden_states + shared_expert_output +++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++ num_tokens = hidden_states_reshaped.shape[0] +++++ +++++ router_logits = self.gate(hidden_states_reshaped) +++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++ if self.norm_topk_prob: +++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++ flat_selected_experts = selected_experts.flatten() +++++ +++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++ token_indices = broadcasted_token_indices.flatten() +++++ +++++ active_experts = ops.unique(flat_selected_experts) +++++ +++++ for expert_idx_tensor in active_experts: +++++ expert_idx = expert_idx_tensor.item() +++++ expert_layer = self.experts[expert_idx] +++++ +++++ mask = (flat_selected_experts == expert_idx_tensor) +++++ selected_token_indices = token_indices[mask] +++++ selected_routing_weights = routing_weights.flatten()[mask] +++++ +++++ current_states = hidden_states_reshaped[selected_token_indices] +++++ +++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++ +++++ final_hidden_states = final_hidden_states.index_add( +++++ dim=0, +++++ index=selected_token_indices, +++++ source=expert_output.to(hidden_states.dtype) +++++ ) +++++ +++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++ ++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++- return final_hidden_states, router_logits +++++ final_hidden_states = final_hidden_states + shared_expert_output +++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++ +++++ return final_hidden_states, router_logits ++++ ++++ ++++ class Qwen2MoeDecoderLayer(nn.Module): ++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++ ++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++ ++++ if (layer_idx not in config.mlp_only_layers) and ( ++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++ ): ++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++ _skip_keys_device_placement = "past_key_values" ++++ _supports_cache_class = True +++++#lwx +++++ # _supports_static_cache = True ++++ ++++ def _init_weights(self, module): ++++ std = self.config.initializer_range ++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++ return causal_mask ++++ ++++ ++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ _tied_weights_keys = ["lm_head.weight"] ++++ ++++ def __init__(self, config): ++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ self.num_experts_per_tok = config.num_experts_per_tok ++++ # Initialize weights and apply final processing ++++ self.post_init() +++++ # @lwx +++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++ # self.generation_config.cache_implementation = "static" +++++ self._warmed_up = False +++++ +++++ def warmup_moe_model(self): +++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++ test_texts = [ +++++ "warmup short", +++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++ ] +++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++ if tokenizer is None: +++++ from mindnlp.transformers import AutoTokenizer +++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++ self._warmup_tokenizer = tokenizer +++++ +++++ for text in test_texts: +++++ inputs = tokenizer(text, return_tensors="ms") +++++ with mindspore._no_grad(): +++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++ ++++ def get_input_embeddings(self): ++++ return self.model.embed_tokens ++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++ ```""" +++++ if not self._warmed_up: +++++ self._warmed_up = True +++++ self.warmup_moe_model() ++++ ++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++ output_router_logits = ( ++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++ } ++++ ) ++++ return model_inputs +++++# @lwx +++++ # def _decode_one_tokens_logits( +++++ # self, +++++ # cur_token: mindspore.Tensor, +++++ # input_pos: Optional[mindspore.Tensor], +++++ # cache_position: mindspore.Tensor, +++++ # past_key_values: StaticCache, +++++ # ) -> mindspore.Tensor: +++++ # """ +++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++ +++++ # Args: +++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++ # input_pos: 输入位置信息,可选 +++++ # cache_position: 当前token在cache中的位置,shape为(1,) +++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++ +++++ # Returns: +++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++ # """ +++++ # # 调用JIT编译的版本 +++++ # return self.get_decode_one_tokens_logits( +++++ # cur_token=cur_token, +++++ # input_pos=input_pos, +++++ # cache_position=cache_position, +++++ # past_key_values=past_key_values, +++++ # ) +++++ +++++ # @mindspore.jit(jit_level='O1') +++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++ # """ +++++ # JIT编译的函数,用于高效的单token解码 +++++ # 使用JIT编译优化以支持静态shape和高效执行 +++++ +++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++ # """ +++++ # outputs = self.model.forward( +++++ # input_ids=cur_token, +++++ # position_ids=input_pos, +++++ # cache_position=cache_position, +++++ # past_key_values=past_key_values, +++++ # use_cache=True, +++++ # return_dict=False, +++++ # ) +++++ +++++ # hidden_states = outputs[0] +++++ # logits = self.lm_head.forward(hidden_states) +++++ # logits = logits.float() +++++ +++++ # return logits[:, -1, :] +++++ +++++ # def _sample( +++++ # self, +++++ # input_ids: mindspore.Tensor, +++++ # logits_processor, +++++ # stopping_criteria, +++++ # generation_config, +++++ # synced_devices: bool, +++++ # streamer=None, +++++ # logits_warper=None, +++++ # **model_kwargs, +++++ # ): +++++ # """ +++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++ # """ +++++ # from ...generation.logits_process import LogitsProcessorList +++++ # from ...generation.stopping_criteria import StoppingCriteriaList +++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++ # from mindnlp.core import nn, ops, no_grad +++++ # import numpy as np +++++ +++++ # # 检查是否使用 StaticCache +++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++ # # 否则,直接调用父类方法 +++++ # past_key_values = model_kwargs.get("past_key_values") +++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++ +++++ # if not isinstance(past_key_values, StaticCache): +++++ # # 不使用 StaticCache,直接调用父类方法 +++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++ # return super()._sample( +++++ # input_ids=input_ids, +++++ # logits_processor=logits_processor, +++++ # stopping_criteria=stopping_criteria, +++++ # generation_config=generation_config, +++++ # synced_devices=synced_devices, +++++ # streamer=streamer, +++++ # logits_warper=logits_warper, +++++ # **model_kwargs, +++++ # ) +++++ +++++ # # 使用 StaticCache,进入自定义循环 +++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++ # pad_token_id = generation_config._pad_token_tensor +++++ # output_attentions = generation_config.output_attentions +++++ # output_hidden_states = generation_config.output_hidden_states +++++ # output_scores = generation_config.output_scores +++++ # output_logits = generation_config.output_logits +++++ # return_dict_in_generate = generation_config.return_dict_in_generate +++++ # max_length = generation_config.max_length +++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++ # do_sample = generation_config.do_sample +++++ +++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++ # raise ValueError( +++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++ # f"{logits_warper})." +++++ # ) +++++ +++++ # # init attention / hidden states / scores tuples +++++ # scores = () if (return_dict_in_generate and output_scores) else None +++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++ +++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++ # encoder_hidden_states = ( +++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++ # ) +++++ +++++ # # keep track of which sequences are already finished +++++ # batch_size, cur_len = input_ids.shape +++++ # this_peer_finished = False +++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++ +++++ # time_record = [] +++++ # from ....utils.testing_utils import parse_flag_from_env +++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++ +++++ # while self._has_unfinished_sequences( +++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++ # ): +++++ # if _record_time: +++++ # import time as time_module +++++ # infer_start = time_module.time() +++++ +++++ # # prepare model inputs +++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++ +++++ # # prepare variable output controls +++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++ +++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++ # cur_cache_position = model_inputs.get("cache_position") +++++ # cur_past_key_values = model_inputs.get("past_key_values") +++++ # cur_input_ids = model_inputs.get("input_ids") +++++ +++++ # if (isinstance(cur_past_key_values, StaticCache) and +++++ # cur_cache_position is not None and +++++ # len(cur_cache_position.shape) > 0 and +++++ # cur_cache_position.shape[0] == 1 and +++++ # cur_input_ids is not None and +++++ # cur_input_ids.shape[1] == 1): +++++ # # 使用 JIT 优化的单 token 解码 +++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++ # if not hasattr(self, '_jit_used'): +++++ # self._jit_used = False +++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++ +++++ # next_token_logits = self.get_decode_one_tokens_logits( +++++ # cur_token=cur_input_ids, +++++ # input_pos=model_inputs.get("position_ids"), +++++ # cache_position=cur_cache_position, +++++ # past_key_values=cur_past_key_values, +++++ # ) +++++ +++++ # # 标记已使用JIT(用于后续判断) +++++ # if not self._jit_used: +++++ # self._jit_used = True +++++ +++++ # # 构造兼容的输出对象 +++++ # class JitOptimizedOutput: +++++ # def __init__(self, logits, config): +++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++ # self.config = config +++++ # # 对于 JIT 优化路径,这些属性通常不需要 +++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++ # self.attentions = None if not config.is_encoder_decoder else None +++++ # self.cross_attentions = None +++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++ # self.hidden_states = None if not config.is_encoder_decoder else None +++++ +++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++ # else: +++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++ # outputs = self(**model_inputs, return_dict=True) +++++ +++++ # if synced_devices and this_peer_finished: +++++ # continue +++++ +++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++ # next_token_logits = outputs.logits[:, -1, :] +++++ +++++ # # pre-process distribution +++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++ # if do_sample: +++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++ +++++ # # Store scores, attentions and hidden_states when required +++++ # if return_dict_in_generate: +++++ # if output_scores: +++++ # scores += (next_token_scores,) +++++ # if output_logits: +++++ # raw_logits += (next_token_logits,) +++++ # if output_attentions: +++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++ # decoder_attentions += (attn,) if attn is not None else (None,) +++++ # if self.config.is_encoder_decoder: +++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++ +++++ # if output_hidden_states: +++++ # hidden = ( +++++ # outputs.decoder_hidden_states +++++ # if self.config.is_encoder_decoder +++++ # else outputs.hidden_states +++++ # ) +++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++ +++++ # # token selection +++++ # if do_sample: +++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++ # else: +++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++ +++++ # # finished sentences should have their next token be a padding token +++++ # if has_eos_stopping_criteria: +++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++ +++++ # # update generated ids, model inputs, and length for next step +++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++ # if streamer is not None: +++++ # streamer.put(next_tokens) +++++ +++++ # model_kwargs = self._update_model_kwargs_for_generation( +++++ # outputs, +++++ # model_kwargs, +++++ # is_encoder_decoder=self.config.is_encoder_decoder, +++++ # ) +++++ +++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++ # cur_len += 1 +++++ +++++ # if _record_time: +++++ # import time as time_module +++++ # infer_stop = time_module.time() +++++ # time_record.append(infer_stop - infer_start) +++++ +++++ # del outputs +++++ +++++ # average_infer_time = None +++++ # if time_record: +++++ # if len(time_record) > 1: +++++ # time_record.pop(0) +++++ # average_infer_time = sum(time_record) / len(time_record) +++++ # print(f'average inference time is: {average_infer_time}') +++++ # print(f'inference time record: {time_record}') +++++ +++++ # if streamer is not None: +++++ # streamer.end() +++++ +++++ # # 简单判断:打印是否使用了JIT路径 +++++ # if hasattr(self, '_jit_used') and self._jit_used: +++++ # print("[JIT] ✓ JIT optimization was used during generation") +++++ # else: +++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++ +++++ # if return_dict_in_generate: +++++ # if self.config.is_encoder_decoder: +++++ # return GenerateEncoderDecoderOutput( +++++ # sequences=input_ids, +++++ # scores=scores, +++++ # logits=raw_logits, +++++ # encoder_attentions=encoder_attentions, +++++ # encoder_hidden_states=encoder_hidden_states, +++++ # decoder_attentions=decoder_attentions, +++++ # cross_attentions=cross_attentions, +++++ # decoder_hidden_states=decoder_hidden_states, +++++ # past_key_values=model_kwargs.get("past_key_values"), +++++ # average_infer_time=average_infer_time +++++ # ) +++++ # else: +++++ # return GenerateDecoderOnlyOutput( +++++ # sequences=input_ids, +++++ # scores=scores, +++++ # logits=raw_logits, +++++ # attentions=decoder_attentions, +++++ # hidden_states=decoder_hidden_states, +++++ # past_key_values=model_kwargs.get("past_key_values"), +++++ # average_infer_time=average_infer_time +++++ # ) +++++ # else: +++++ # return input_ids +++++ +++++ # def _prepare_cache_for_generation( +++++ # self, +++++ # generation_config, +++++ # model_kwargs, +++++ # assistant_model, +++++ # batch_size, +++++ # max_cache_length, +++++ # ): +++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++ # generation_config.cache_implementation = "static" +++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++ +++++ # if generation_config.cache_implementation == "static": +++++ # base_required_from_max_length = generation_config.max_length + 1 +++++ # base_required = max(max_cache_length, base_required_from_max_length) +++++ # min_cache_size = 50 +++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++ # else: +++++ # max_cache_length = max(base_required, min_cache_size) +++++ +++++ # original_max_cache_length = max_cache_length +++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++ # print(f" - input max_cache_length: {original_max_cache_length}") +++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++ # print(f" - final max_cache_length: {max_cache_length}") +++++ +++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++ # if max_cache_length > self.config.max_position_embeddings: +++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++ +++++ # result = super()._prepare_cache_for_generation( +++++ # generation_config=generation_config, +++++ # model_kwargs=model_kwargs, +++++ # assistant_model=assistant_model, +++++ # batch_size=batch_size, +++++ # max_cache_length=max_cache_length, +++++ # ) +++++ +++++ # if generation_config.cache_implementation == "static": +++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++ # created_cache = model_kwargs.get(cache_name) +++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++ # if created_cache.max_cache_len < generation_config.max_length: +++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++ +++++ # return result +++++ +++++ +++++ ++++ ++++ ++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++-- ++++2.27.0 ++++ +++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++new file mode 100644 +++index 00000000..22b65dd5 +++--- /dev/null ++++++ b/patches/0002-20251106commit.patch +++@@ -0,0 +1,3200 @@ ++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Thu, 6 Nov 2025 09:20:38 +0800 ++++Subject: [PATCH 2/3] 20251106commit ++++ ++++--- ++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- ++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ ++++ 3 files changed, 2689 insertions(+), 305 deletions(-) ++++ create mode 100644 patches/0001-20251104commit.patch ++++ ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index d8303e45..73773c22 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): ++++ # y = y + self.shared_experts(identity) ++++ # return y ++++ +++++ # @no_grad() +++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ +++++ # expert_cache = ops.zeros_like(x) +++++ # for i in range(self.num_experts_per_tok): +++++ # expert_id = flat_expert_indices[i].item() +++++ # weight = flat_expert_weights[i].item() +++++ # expert = self.experts[expert_id] +++++ # expert_out = expert(x) +++++ # expert_cache += expert_out * weight +++++ # return expert_cache +++++ ++++ @no_grad() ++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ # x 的 shape: (1, hidden_size) +++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++++ +++++ # 1. 收集所有需要的专家层 +++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +++++ +++++ # 2. 并行计算所有专家的输出 +++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++++ # ops.cat 会将它们堆叠成一个新的 Tensor +++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++++ +++++ # 3. 使用矩阵乘法进行加权求和 +++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ # 最终结果 final_output 的 shape: (1, hidden_size) +++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++++ +++++ return final_output ++++ ++++- expert_cache = ops.zeros_like(x) ++++- for i in range(self.num_experts_per_tok): ++++- expert_id = flat_expert_indices[i].item() ++++- weight = flat_expert_weights[i].item() ++++- expert = self.experts[expert_id] ++++- expert_out = expert(x) ++++- expert_cache += expert_out * weight ++++- return expert_cache ++++ ++++ @no_grad() ++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++ # @lwx +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++++ ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): ++++ return attn_output, attn_weights, past_key_value ++++ ++++ +++++# class DeepseekFlashAttention(nn.Module): +++++# """ +++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +++++ +++++# This class is designed as a drop-in replacement for DeepseekAttention. +++++# """ +++++ +++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++++# super().__init__() +++++# self.config = config +++++# self.layer_idx = layer_idx +++++# if layer_idx is None: +++++# logger.warning( +++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++# "when creating this class." +++++# ) +++++ +++++# self.attention_dropout = config.attention_dropout +++++# self.hidden_size = config.hidden_size +++++# self.num_heads = config.num_attention_heads +++++# self.head_dim = self.hidden_size // self.num_heads +++++# self.num_key_value_heads = config.num_key_value_heads +++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++# self.max_position_embeddings = config.max_position_embeddings +++++# self.rope_theta = config.rope_theta +++++# self.is_causal = True +++++ +++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++# raise ValueError( +++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++# f" and `num_heads`: {self.num_heads})." +++++# ) +++++ +++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++++# self._init_rope() +++++ +++++# def _init_rope(self): +++++# if self.config.rope_scaling is None: +++++# self.rotary_emb = DeepseekRotaryEmbedding( +++++# self.head_dim, +++++# max_position_embeddings=self.max_position_embeddings, +++++# base=self.rope_theta, +++++# ) +++++# else: +++++# scaling_type = self.config.rope_scaling["type"] +++++# scaling_factor = self.config.rope_scaling["factor"] +++++# if scaling_type == "linear": +++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++++# self.head_dim, +++++# max_position_embeddings=self.max_position_embeddings, +++++# scaling_factor=scaling_factor, +++++# base=self.rope_theta, +++++# ) +++++# elif scaling_type == "dynamic": +++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++++# self.head_dim, +++++# max_position_embeddings=self.max_position_embeddings, +++++# scaling_factor=scaling_factor, +++++# base=self.rope_theta, +++++# ) +++++# else: +++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++++ +++++# def forward( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# attention_mask: Optional[mindspore.Tensor] = None, +++++# position_ids: Optional[mindspore.Tensor] = None, +++++# past_key_value: Optional[Cache] = None, +++++# output_attentions: bool = False, +++++# use_cache: bool = False, +++++# **kwargs, +++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++# if "padding_mask" in kwargs: +++++# warnings.warn( +++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++++# ) +++++ +++++# if output_attentions: +++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +++++ +++++# bsz, q_len, _ = hidden_states.shape +++++ +++++# if self.config.pretraining_tp > 1: +++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++++ +++++# query_states = self.q_proj(hidden_states) +++++# key_states = self.k_proj(hidden_states) +++++# value_states = self.v_proj(hidden_states) +++++ +++++# # Reshape for multi-head attention +++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++# kv_seq_len = key_states.shape[-2] +++++# if past_key_value is not None: +++++# if self.layer_idx is None: +++++# raise ValueError( +++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++# "with a layer index." +++++# ) +++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++# # Apply Rotary Positional Embedding +++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++# if past_key_value is not None: +++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ +++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++ +++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++ +++++# # Convert attention_mask for flash_attention_score +++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +++++# if attention_mask is not None: +++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++++# raise ValueError( +++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++++# ) +++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +++++# else: +++++# attn_mask_for_fa = None +++++ +++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++++ +++++# # Call the fused flash_attention_score operator +++++# attn_output = mindspore.ops.flash_attention_score( +++++# query=query_states_for_fa, +++++# key=key_states_for_fa, +++++# value=value_states_for_fa, +++++# head_num=self.num_heads, # This is N1, the number of query heads +++++# input_layout='BSH', +++++# attn_mask=attn_mask_for_fa, +++++# keep_prob=keep_prob, +++++# scalar_value=1.0 / math.sqrt(self.head_dim), +++++# sparse_mode=0 # Default mask mode +++++# ) +++++ +++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +++++# attn_output = self.o_proj(attn_output) +++++ +++++# # Flash Attention does not return attention weights +++++# attn_weights = None +++++ +++++# return attn_output, attn_weights, past_key_value +++++ +++++class DeepseekFlashAttention(nn.Module): +++++ """ +++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +++++ This implementation is a drop-in replacement for the original DeepseekAttention class, +++++ designed for high performance on supported hardware (Ascend). +++++ +++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +++++ """ +++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++++ super().__init__() +++++ self.config = config +++++ self.layer_idx = layer_idx +++++ if layer_idx is None: +++++ logger.warning( +++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++ "when creating this class." +++++ ) +++++ +++++ # --- [FIX] Correctly initialize all required attributes --- +++++ self.attention_dropout = config.attention_dropout +++++ self.hidden_size = config.hidden_size +++++ self.num_heads = config.num_attention_heads +++++ self.head_dim = self.hidden_size // self.num_heads +++++ self.num_key_value_heads = config.num_key_value_heads +++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++ self.max_position_embeddings = config.max_position_embeddings +++++ self.rope_theta = config.rope_theta +++++ self.is_causal = True +++++ +++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++ raise ValueError( +++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++ f" and `num_heads`: {self.num_heads})." +++++ ) +++++ +++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++++ +++++ # This call will now succeed as all attributes are initialized. +++++ self._init_rope() +++++ +++++ def _init_rope(self): +++++ if self.config.rope_scaling is None: +++++ self.rotary_emb = DeepseekRotaryEmbedding( +++++ self.head_dim, +++++ max_position_embeddings=self.max_position_embeddings, +++++ base=self.rope_theta, +++++ ) +++++ else: +++++ scaling_type = self.config.rope_scaling["type"] +++++ scaling_factor = self.config.rope_scaling["factor"] +++++ if scaling_type == "linear": +++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++++ self.head_dim, +++++ max_position_embeddings=self.max_position_embeddings, +++++ scaling_factor=scaling_factor, +++++ base=self.rope_theta, +++++ ) +++++ elif scaling_type == "dynamic": +++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++++ self.head_dim, +++++ max_position_embeddings=self.max_position_embeddings, +++++ scaling_factor=scaling_factor, +++++ base=self.rope_theta, +++++ ) +++++ else: +++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++++ +++++ def forward( +++++ self, +++++ hidden_states: mindspore.Tensor, +++++ attention_mask: Optional[mindspore.Tensor] = None, +++++ position_ids: Optional[mindspore.Tensor] = None, +++++ past_key_value: Optional[Cache] = None, +++++ output_attentions: bool = False, +++++ use_cache: bool = False, +++++ **kwargs, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ if "padding_mask" in kwargs: +++++ warnings.warn( +++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++++ ) +++++ if output_attentions: +++++ warnings.warn( +++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +++++ ) +++++ +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ if self.config.pretraining_tp > 1: +++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++++ +++++ query_states = self.q_proj(hidden_states) +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++ # Apply Rotary Position Embedding +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos} +++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +++++ # So we must explicitly repeat the KV heads. +++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++ +++++ # Convert attention mask for flash_attention_score +++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +++++ if attention_mask is not None: +++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++++ raise ValueError( +++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++++ ) +++++ attn_mask_for_fa = attention_mask < 0 +++++ else: +++++ attn_mask_for_fa = None +++++ +++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++++ +++++ # Call the fused operator using the efficient BNSD layout +++++ attn_output = mindspore.ops.flash_attention_score( +++++ query=query_states, +++++ key=key_states, +++++ value=value_states, +++++ head_num=self.num_heads, +++++ input_layout='BNSD', # Specify the correct layout +++++ attn_mask=attn_mask_for_fa, +++++ keep_prob=keep_prob, +++++ scalar_value=1.0 / math.sqrt(self.head_dim) +++++ ) +++++ +++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ +++++ # Apply output projection +++++ attn_output = self.o_proj(attn_output) +++++ +++++ # Flash attention does not return attention weights, so we return None. +++++ attn_weights = None +++++ +++++ return attn_output, attn_weights, past_key_value +++++ ++++ Deepseek_ATTENTION_CLASSES = { ++++ "eager": DeepseekAttention, +++++ "flash-attention": DeepseekFlashAttention, ++++ } ++++ ++++ ++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): ++++ config=config, layer_idx=layer_idx ++++ ) ++++ +++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +++++ config=config, layer_idx=layer_idx +++++ ) +++++ ++++ self.mlp = ( ++++ DeepseekMoE(config) ++++ if ( ++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++index d4c6b651..bced285c 100644 ++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union ++++ ++++ import mindspore ++++ import mindnlp.core.nn.functional as F ++++-from mindnlp.core import nn, ops +++++from mindnlp.core import nn, ops, no_grad ++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss ++++ ++++ from ....common.activations import ACT2FN ++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) ++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++++ +++++Long_Prompt = False +++++PROMPT_LENGTH_THRESHOLD = 128 ++++ ++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++++ def _prepare_4d_causal_attention_mask_with_cache_position( ++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): ++++ return attn_output, attn_weights, past_key_value ++++ ++++ +++++# class Qwen2MoeFlashAttention(nn.Module): +++++# """ +++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++ +++++# 关键改动: +++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++# 直接传入原始的 key 和 value 张量效率更高。 +++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++# super().__init__() +++++# self.config = config +++++# self.layer_idx = layer_idx +++++# self.hidden_size = config.hidden_size +++++# self.num_heads = config.num_attention_heads +++++# self.head_dim = self.hidden_size // self.num_heads +++++# self.num_key_value_heads = config.num_key_value_heads +++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++# self.max_position_embeddings = config.max_position_embeddings +++++# self.rope_theta = config.rope_theta +++++# self.attention_dropout = config.attention_dropout +++++ +++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++# raise ValueError( +++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++# ) +++++ +++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++ +++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++# self.head_dim, +++++# max_position_embeddings=self.max_position_embeddings, +++++# base=self.rope_theta, +++++# ) +++++ +++++# def forward( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# attention_mask: Optional[mindspore.Tensor] = None, +++++# position_ids: Optional[mindspore.Tensor] = None, +++++# past_key_value: Optional[Cache] = None, +++++# output_attentions: bool = False, +++++# use_cache: bool = False, +++++# cache_position: Optional[mindspore.Tensor] = None, +++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++# bsz, q_len, _ = hidden_states.shape +++++ +++++# # 1. 线性投射 Q, K, V +++++# query_states = self.q_proj(hidden_states) +++++# key_states = self.k_proj(hidden_states) +++++# value_states = self.v_proj(hidden_states) +++++ +++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++# # query: [B, S, H*D] -> [B, N1, S, D] +++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++# # 3. RoPE 旋转位置编码 +++++# kv_seq_len = key_states.shape[-2] +++++# if past_key_value is not None: +++++# if self.layer_idx is None: +++++# raise ValueError( +++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++# "with a layer index." +++++# ) +++++# # 对于 StaticCache,需要特殊处理 kv_seq_len +++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++# if cache_position.shape[0] == 1: +++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++# kv_seq_len = past_seen_tokens + 1 +++++# else: +++++# # prefill 阶段:cache_position 是范围,使用其长度 +++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++# else: +++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++# # 4. KV 缓存更新 +++++# if past_key_value is not None: +++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++# key_states, value_states = past_key_value.update( +++++# key_states, value_states, self.layer_idx, cache_kwargs +++++# ) +++++ +++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++# if cache_position.shape[0] == 1: +++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++# kv_seq_len = key_states.shape[-2] +++++ +++++# # 5. [重要] 准备 Attention Mask +++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++# fa_attention_mask = None +++++# if attention_mask is not None: +++++# # 截取与当前key长度匹配的部分 +++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++# # 转换为布尔类型: 大负数 -> True, 0 -> False +++++# fa_attention_mask = (mask_slice != 0) +++++ +++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++# input_dtype = query_states.dtype +++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++# query_states = query_states.to(mindspore.float16) +++++# key_states = key_states.to(mindspore.float16) +++++# value_states = value_states.to(mindspore.float16) +++++ +++++# # 6. [核心] 调用 flash_attention_score 算子 +++++# # - 无需手动 repeat_kv, 算子原生支持 GQA +++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++# attn_output = mindspore.ops.flash_attention_score( +++++# query=query_states, +++++# key=key_states, +++++# value=value_states, +++++# head_num=self.num_heads, # 传入Q的头数(N1) +++++# attn_mask=fa_attention_mask, +++++# keep_prob=1.0 - self.attention_dropout, +++++# scalar_value=1.0 / math.sqrt(self.head_dim), +++++# input_layout="BNSD", +++++# sparse_mode=0 # 使用 defaultMask 模式 +++++# ) +++++ +++++# # 恢复原始数据类型 +++++# attn_output = attn_output.to(input_dtype) +++++ +++++# # 7. 调整输出形状 +++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++# attn_output = self.o_proj(attn_output) +++++ +++++# # FlashAttention 算子不直接返回注意力权重矩阵 +++++# attn_weights = None +++++# if output_attentions: +++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++# return attn_output, attn_weights, past_key_value +++++ +++++# # def forward( +++++# # self, +++++# # hidden_states: mindspore.Tensor, +++++# # attention_mask: Optional[mindspore.Tensor] = None, +++++# # position_ids: Optional[mindspore.Tensor] = None, +++++# # past_key_value: Optional[Cache] = None, +++++# # output_attentions: bool = False, +++++# # use_cache: bool = False, +++++# # cache_position: Optional[mindspore.Tensor] = None, +++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++# # bsz, q_len, _ = hidden_states.shape +++++ +++++# # # 1. 线性投射 Q, K, V +++++# # query_states = self.q_proj(hidden_states) +++++# # key_states = self.k_proj(hidden_states) +++++# # value_states = self.v_proj(hidden_states) +++++ +++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ +++++# # # 3. RoPE 旋转位置编码 +++++# # kv_seq_len = key_states.shape[-2] +++++# # if past_key_value is not None: +++++# # if self.layer_idx is None: +++++# # raise ValueError( +++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++# # "with a layer index." +++++# # ) +++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++# # # 4. KV 缓存更新 +++++# # if past_key_value is not None: +++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++# # key_states, value_states = past_key_value.update( +++++# # key_states, value_states, self.layer_idx, cache_kwargs +++++# # ) +++++ +++++# # # 5. 准备 Attention Mask +++++# # fa_attention_mask = None +++++# # if attention_mask is not None: +++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++# # fa_attention_mask = (mask_slice != 0) +++++ +++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++# # input_dtype = query_states.dtype +++++ +++++# # # 6. [核心] 调用 flash_attention_score 算子 +++++# # attn_output = mindspore.ops.flash_attention_score( +++++# # query=query_states, +++++# # key=key_states, +++++# # value=value_states, +++++# # head_num=self.num_heads, +++++# # attn_mask=fa_attention_mask, +++++# # keep_prob=1.0 - self.attention_dropout, +++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++++# # input_layout="BNSD", +++++# # sparse_mode=0, +++++# # # <--- 修改点 2: 启用内部高精度计算 --- +++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++# # inner_precise=1 +++++# # ) +++++ +++++# # # 恢复原始数据类型 +++++# # attn_output = attn_output.to(input_dtype) +++++ +++++# # # 7. 调整输出形状 +++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++# # attn_output = self.o_proj(attn_output) +++++ +++++# # attn_weights = None +++++# # if output_attentions: +++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ +++++# # return attn_output, attn_weights, past_key_value +++++ +++++ ++++ class Qwen2MoeFlashAttention(nn.Module): ++++ """ ++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++- ++++- 关键改动: ++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++- 直接传入原始的 key 和 value 张量效率更高。 ++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +++++ +++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +++++ 完全使用模型的低精度数据类型(如 float16)进行计算, +++++ 以达到理论上的最高执行速度。 ++++ """ ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++ super().__init__() ++++ self.config = config ++++ self.layer_idx = layer_idx +++++ if layer_idx is None: +++++ logger.warning_once( +++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +++++ ) +++++ ++++ self.hidden_size = config.hidden_size ++++ self.num_heads = config.num_attention_heads ++++ self.head_dim = self.hidden_size // self.num_heads ++++ self.num_key_value_heads = config.num_key_value_heads ++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++ self.max_position_embeddings = config.max_position_embeddings ++++ self.rope_theta = config.rope_theta ++++ self.attention_dropout = config.attention_dropout ++++ ++++- if (self.head_dim * self.num_heads) != self.hidden_size: ++++- raise ValueError( ++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++- ) ++++- ++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++- # query: [B, S, H*D] -> [B, N1, S, D] ++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++ # 2. 调整形状以匹配 BNSD 布局 ++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- ++++- # 3. RoPE 旋转位置编码 +++++ +++++ # 3. RoPE 和 KV 缓存 ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++- if self.layer_idx is None: ++++- raise ValueError( ++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++- "with a layer index." ++++- ) ++++- # 对于 StaticCache,需要特殊处理 kv_seq_len ++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++- if cache_position.shape[0] == 1: ++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++- kv_seq_len = past_seen_tokens + 1 ++++- else: ++++- # prefill 阶段:cache_position 是范围,使用其长度 ++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++- else: ++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++- +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++- # 4. KV 缓存更新 ++++ if past_key_value is not None: ++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++- key_states, value_states = past_key_value.update( ++++- key_states, value_states, self.layer_idx, cache_kwargs ++++- ) ++++- ++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++- if cache_position.shape[0] == 1: ++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++- kv_seq_len = key_states.shape[-2] ++++- ++++- # 5. [重要] 准备 Attention Mask ++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++ # 4. 准备 Attention Mask ++++ fa_attention_mask = None ++++ if attention_mask is not None: ++++- # 截取与当前key长度匹配的部分 ++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++- # 转换为布尔类型: 大负数 -> True, 0 -> False ++++ fa_attention_mask = (mask_slice != 0) ++++ ++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++- input_dtype = query_states.dtype ++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++- query_states = query_states.to(mindspore.float16) ++++- key_states = key_states.to(mindspore.float16) ++++- value_states = value_states.to(mindspore.float16) ++++- ++++- # 6. [核心] 调用 flash_attention_score 算子 ++++- # - 无需手动 repeat_kv, 算子原生支持 GQA ++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 ++++ attn_output = mindspore.ops.flash_attention_score( ++++ query=query_states, ++++ key=key_states, ++++ value=value_states, ++++- head_num=self.num_heads, # 传入Q的头数(N1) +++++ head_num=self.num_heads, ++++ attn_mask=fa_attention_mask, ++++- keep_prob=1.0 - self.attention_dropout, +++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout ++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++ input_layout="BNSD", ++++- sparse_mode=0 # 使用 defaultMask 模式 +++++ sparse_mode=0, +++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 ++++ ) ++++ ++++- # 恢复原始数据类型 ++++- attn_output = attn_output.to(input_dtype) ++++- ++++- # 7. 调整输出形状 ++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++ # 6. 调整输出形状 ++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++ attn_output = self.o_proj(attn_output) ++++ ++++- # FlashAttention 算子不直接返回注意力权重矩阵 +++++ # 7. 返回结果 ++++ attn_weights = None ++++ if output_attentions: ++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++- # def forward( ++++- # self, ++++- # hidden_states: mindspore.Tensor, ++++- # attention_mask: Optional[mindspore.Tensor] = None, ++++- # position_ids: Optional[mindspore.Tensor] = None, ++++- # past_key_value: Optional[Cache] = None, ++++- # output_attentions: bool = False, ++++- # use_cache: bool = False, ++++- # cache_position: Optional[mindspore.Tensor] = None, ++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++- ++++- # bsz, q_len, _ = hidden_states.shape ++++- ++++- # # 1. 线性投射 Q, K, V ++++- # query_states = self.q_proj(hidden_states) ++++- # key_states = self.k_proj(hidden_states) ++++- # value_states = self.v_proj(hidden_states) ++++- ++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- ++++- # # 3. RoPE 旋转位置编码 ++++- # kv_seq_len = key_states.shape[-2] ++++- # if past_key_value is not None: ++++- # if self.layer_idx is None: ++++- # raise ValueError( ++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++- # "with a layer index." ++++- # ) ++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++ ++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++- ++++- # # 4. KV 缓存更新 ++++- # if past_key_value is not None: ++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++- # key_states, value_states = past_key_value.update( ++++- # key_states, value_states, self.layer_idx, cache_kwargs ++++- # ) ++++- ++++- # # 5. 准备 Attention Mask ++++- # fa_attention_mask = None ++++- # if attention_mask is not None: ++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++- # fa_attention_mask = (mask_slice != 0) ++++- ++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++- # input_dtype = query_states.dtype ++++- ++++- # # 6. [核心] 调用 flash_attention_score 算子 ++++- # attn_output = mindspore.ops.flash_attention_score( ++++- # query=query_states, ++++- # key=key_states, ++++- # value=value_states, ++++- # head_num=self.num_heads, ++++- # attn_mask=fa_attention_mask, ++++- # keep_prob=1.0 - self.attention_dropout, ++++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++++- # input_layout="BNSD", ++++- # sparse_mode=0, ++++- # # <--- 修改点 2: 启用内部高精度计算 --- ++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++- # inner_precise=1 ++++- # ) ++++- ++++- # # 恢复原始数据类型 ++++- # attn_output = attn_output.to(input_dtype) +++++QWEN2MOE_ATTENTION_CLASSES = { +++++ "eager": Qwen2MoeAttention, +++++ "flash-attention": Qwen2MoeFlashAttention, +++++} ++++ ++++- # # 7. 调整输出形状 ++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++- # attn_output = self.o_proj(attn_output) ++++ ++++- # attn_weights = None ++++- # if output_attentions: ++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# def __init__(self, config): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# # gating +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++ +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# #@dwj +++++# # 只遍历激活的专家,而非全部专家 +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# num_tokens = hidden_states_reshaped.shape[0] +++++ +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++# flat_selected_experts = selected_experts.flatten() +++++ +++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++# token_indices = broadcasted_token_indices.flatten() +++++ +++++# active_experts = ops.unique(flat_selected_experts) +++++ +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++ +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# selected_token_indices = token_indices[mask] +++++# selected_routing_weights = routing_weights.flatten()[mask] +++++ +++++# current_states = hidden_states_reshaped[selected_token_indices] +++++ +++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++ +++++# final_hidden_states = final_hidden_states.index_add( +++++# dim=0, +++++# index=selected_token_indices, +++++# source=expert_output.to(hidden_states.dtype) +++++# ) +++++ +++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++ ++++- # return attn_output, attn_weights, past_key_value +++++# final_hidden_states = final_hidden_states + shared_expert_output +++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ +++++ +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# """ +++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++++# `_moe_infer_prefill` (用于长序列处理) 方法。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# # 门控网络 +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# # 专家列表 +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++# # 共享专家 +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# @no_grad() +++++# def _moe_infer_decode( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# """ +++++# 【解码路径】针对 sequence_length=1 的极致优化。 +++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++++# """ +++++# batch_size, hidden_dim = hidden_states.shape +++++ +++++# expert_outputs_list = [ +++++# ops.cat([ +++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++# ], dim=0) +++++# for i in range(batch_size) +++++# ] +++++ +++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++++# # shape: (batch_size, top_k, hidden_dim) +++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++ +++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++ +++++# return moe_output.squeeze(1) +++++ +++++# @no_grad() +++++# def _moe_infer_prefill( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# """ +++++# 【预填充路径】针对 sequence_length > 1 的优化。 +++++# 按专家对 Token 进行分组,并进行批处理。 +++++# """ +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens = hidden_states.shape[0] +++++# flat_selected_experts = selected_experts.flatten() +++++ +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++ +++++# active_experts = ops.unique(flat_selected_experts) +++++ +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++ +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# selected_token_indices = token_indices[mask] +++++# selected_routing_weights = routing_weights.flatten()[mask] +++++ +++++# current_states = hidden_states[selected_token_indices] +++++ +++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++ +++++# moe_output = moe_output.index_add( +++++# dim=0, +++++# index=selected_token_indices, +++++# source=expert_output.to(hidden_states.dtype) +++++# ) +++++# return moe_output +++++ +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# """ +++++# 顶层 forward 方法,作为智能分发器。 +++++# """ +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++- # def forward( ++++- # self, ++++- # hidden_states: mindspore.Tensor, ++++- # attention_mask: Optional[mindspore.Tensor] = None, ++++- # position_ids: Optional[mindspore.Tensor] = None, ++++- # past_key_value: Optional[Cache] = None, ++++- # output_attentions: bool = False, ++++- # use_cache: bool = False, ++++- # cache_position: Optional[mindspore.Tensor] = None, ++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++- ++++- # bsz, q_len, _ = hidden_states.shape ++++- ++++- # query_states = self.q_proj(hidden_states) ++++- # key_states = self.k_proj(hidden_states) ++++- # value_states = self.v_proj(hidden_states) ++++- ++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- ++++- # kv_seq_len = key_states.shape[-2] ++++- # if past_key_value is not None: ++++- # if self.layer_idx is None: ++++- # raise ValueError("`layer_idx` must be specified for caching") ++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++- ++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++- ++++- # if past_key_value is not None: ++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++- # key_states, value_states = past_key_value.update( ++++- # key_states, value_states, self.layer_idx, cache_kwargs ++++- # ) +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++# moe_output = None +++++# # 在推理时,根据序列长度选择最优路径 +++++# if not self.training: +++++# if sequence_length == 1: +++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++# else: +++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++# else: +++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++++# raise NotImplementedError("Training path is not implemented.") +++++ +++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++++ +++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++++ +++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ +++++ +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# """ +++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# # 门控网络 +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# # 专家列表 +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++# # 共享专家 +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# @no_grad() +++++# def _moe_infer_decode( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# batch_size, _ = hidden_states.shape +++++# expert_outputs_list = [ +++++# ops.cat([ +++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++# ], dim=0) +++++# for i in range(batch_size) +++++# ] +++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++# return moe_output.squeeze(1) +++++ +++++# @no_grad() +++++# def _moe_infer_prefill( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens = hidden_states.shape[0] +++++# flat_selected_experts = selected_experts.flatten() +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++# active_experts = ops.unique(flat_selected_experts) +++++ +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# selected_token_indices = token_indices[mask] +++++# selected_routing_weights = routing_weights.flatten()[mask] +++++# current_states = hidden_states[selected_token_indices] +++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++# moe_output = moe_output.index_add( +++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++# ) +++++# return moe_output +++++ +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# """ +++++# 顶层 forward 方法,作为智能分发器。 +++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++++# """ +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ +++++# # 1. 门控计算 (通用逻辑) +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++# # 2. 智能分发到最优 MoE 路径 +++++# moe_output = None +++++# if not self.training: +++++# if sequence_length == 1: +++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++# else: +++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++# else: +++++# raise NotImplementedError("Training path is not implemented.") +++++ +++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++ +++++# # 4. 合并 MoE 输出和共享专家输出 +++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++ +++++# # 5. 恢复原始形状并返回 +++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ +++++# prefill fastest +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# """ +++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# # 门控网络 +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# # 专家列表 +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++# # 共享专家 +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# @no_grad() +++++# def _moe_infer_dispatch( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# """ +++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++++# """ +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens, _ = hidden_states.shape +++++ +++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++++# flat_selected_experts = selected_experts.flatten() +++++# flat_routing_weights = routing_weights.flatten() ++++ ++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++- ++++- # # <--- 核心修改点: 手动进行高精度缩放 --- ++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++- # query_states = query_states / math.sqrt(self.head_dim) ++++- # # <--- 修改结束 --- ++++- ++++- # fa_attention_mask = None ++++- # if attention_mask is not None: ++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++- # fa_attention_mask = (mask_slice != 0) ++++- ++++- # input_dtype = query_states.dtype ++++- ++++- # attn_output = mindspore.ops.flash_attention_score( ++++- # query=query_states, # 传入已经预先缩放过的 query ++++- # key=key_states, ++++- # value=value_states, ++++- # head_num=self.num_heads, ++++- # attn_mask=fa_attention_mask, ++++- # keep_prob=1.0 - self.attention_dropout, ++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++- # input_layout="BNSD", ++++- # sparse_mode=0, ++++- # inner_precise=1 # 仍然保持内部高精度计算 ++++- # ) +++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++ ++++- # attn_output = attn_output.to(input_dtype) ++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++- # attn_output = self.o_proj(attn_output) +++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++++# active_experts = ops.unique(flat_selected_experts) +++++ +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++ +++++# # 找到所有分配给该专家的 token +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++ +++++# # 使用 mask 选取对应的 token 和权重 +++++# current_token_indices = token_indices[mask] +++++# current_routing_weights = flat_routing_weights[mask] +++++# current_hidden_states = hidden_states[current_token_indices] +++++ +++++# # 对这些 token 进行批处理 +++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++ +++++# # 使用 index_add 将结果精确地加回到对应位置 +++++# moe_output = moe_output.index_add( +++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++++# ) +++++# return moe_output +++++ +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# """ +++++# 顶层 forward 方法,作为智能分发器。 +++++# """ +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ +++++# # 1. 门控计算 +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++ +++++# # 2. 调用统一的 MoE 计算内核 +++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++++ ++++- # attn_weights = None ++++- # if output_attentions: ++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++# # 3. 统一处理共享专家 +++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++ +++++# # 4. 合并输出 +++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++ +++++# # 5. 恢复原始形状并返回 +++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ +++++ +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# """ +++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++# 【最终高性能与高精度版】: +++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++++# 3. 这样实现了速度和准确性的两全其美。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# @no_grad() +++++# def _moe_infer_decode( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# """ +++++# 【解码路径】极致优化版:bmm + 高精度累加。 +++++# """ +++++# original_dtype = hidden_states.dtype +++++# batch_size, _ = hidden_states.shape +++++ +++++# expert_outputs_list = [ +++++# ops.cat([ +++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++# ], dim=0) +++++# for i in range(batch_size) +++++# ] +++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++ +++++# # 在 float32 下执行 bmm,得到高精度结果 +++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++ +++++# # 将高精度结果转换回原始数据类型 +++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++++ +++++# return moe_output +++++ +++++# @no_grad() +++++# def _moe_infer_prefill( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# selected_experts: mindspore.Tensor, +++++# routing_weights: mindspore.Tensor +++++# ) -> mindspore.Tensor: +++++# """ +++++# 【预填充路径】与原始实现一致,结果精确。 +++++# """ +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens, _ = hidden_states.shape +++++# flat_selected_experts = selected_experts.flatten() +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++# active_experts = ops.unique(flat_selected_experts) +++++ +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# selected_token_indices = token_indices[mask] +++++# selected_routing_weights = routing_weights.flatten()[mask] +++++# current_states = hidden_states[selected_token_indices] +++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++# moe_output = moe_output.index_add( +++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++# ) +++++# return moe_output +++++ +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++ ++++- # return attn_output, attn_weights, past_key_value +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++++# # 如果模型主体是 float16,后续再转换 +++++ +++++# moe_output = None +++++# if not self.training: +++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++++# # _moe_infer_decode 内部会处理好类型转换 +++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++++# if sequence_length == 1: +++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++# else: +++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++# else: +++++# raise NotImplementedError("Training path is not implemented.") +++++ +++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++ +++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ ++++ ++++-QWEN2MOE_ATTENTION_CLASSES = { ++++- "eager": Qwen2MoeAttention, ++++- "flash-attention": Qwen2MoeFlashAttention, ++++-} +++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++# """ +++++# 【融合版】一个混合专家模块,内置两种推理策略, +++++# 由外部全局变量 `Long_Prompt` 控制: +++++ +++++# - if Long_Prompt is True: 【精度优先模式】 +++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++++# 适用于处理长序列,避免误差累积。 +++++ +++++# - if Long_Prompt is False: 【速度优先模式】 +++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++++# 在解码阶段获得极致速度,同时保证结果高度准确。 +++++# """ +++++# def __init__(self, config: Qwen2MoeConfig): +++++# super().__init__() +++++# self.num_experts = config.num_experts +++++# self.top_k = config.num_experts_per_tok +++++# self.norm_topk_prob = config.norm_topk_prob +++++ +++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++# self.experts = nn.ModuleList( +++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++# ) +++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++# # --- 速度优先模式的辅助函数 --- +++++# @no_grad() +++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++# original_dtype = hidden_states.dtype +++++# batch_size, _ = hidden_states.shape +++++# expert_outputs_list = [ +++++# ops.cat([ +++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++# ], dim=0) +++++# for i in range(batch_size) +++++# ] +++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++# weights_fp32 = routing_weights.to(mindspore.float32) +++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++# return moe_output_fp32.squeeze(1).to(original_dtype) +++++ +++++# @no_grad() +++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens, _ = hidden_states.shape +++++# flat_selected_experts = selected_experts.flatten() +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++# active_experts = ops.unique(flat_selected_experts) +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# selected_token_indices = token_indices[mask] +++++# selected_routing_weights = routing_weights.flatten()[mask] +++++# current_states = hidden_states[selected_token_indices] +++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++++# return moe_output +++++ +++++# # --- 精度优先模式的辅助函数 --- +++++# @no_grad() +++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++# moe_output = ops.zeros_like(hidden_states) +++++# num_tokens, _ = hidden_states.shape +++++# flat_selected_experts = selected_experts.flatten() +++++# flat_routing_weights = routing_weights.flatten() +++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++# active_experts = ops.unique(flat_selected_experts) +++++# for expert_idx_tensor in active_experts: +++++# expert_idx = expert_idx_tensor.item() +++++# expert_layer = self.experts[expert_idx] +++++# mask = (flat_selected_experts == expert_idx_tensor) +++++# current_token_indices = token_indices[mask] +++++# current_routing_weights = flat_routing_weights[mask] +++++# current_hidden_states = hidden_states[current_token_indices] +++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++# return moe_output +++++ +++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++# # 声明我们将要使用一个在模块外部定义的全局变量 +++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++++# global Long_Prompt +++++ +++++# # 1. 门控计算 (所有模式通用) +++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++# router_logits = self.gate(hidden_states_reshaped) +++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++# if self.norm_topk_prob: +++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++# moe_output = None +++++# if not self.training: +++++# # 根据 Long_Prompt 标志选择模式 +++++# if Long_Prompt: +++++# # --- 精度优先模式 --- +++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++# else: +++++# # --- 速度优先模式 --- +++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++# if sequence_length == 1: +++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++# else: +++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++# else: +++++# raise NotImplementedError("Training path is not implemented.") +++++ +++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++ +++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++ +++++# return final_hidden_states, router_logits +++++ +++++class Qwen2MoeSparseMoeBlock(nn.Module): +++++ """ +++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++++ 控制的顶级推理策略: ++++ +++++ - if Long_Prompt is True: 【精度优先模式】 +++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +++++ 适用于需要严格可复现性的长序列任务。 ++++ ++++-class Qwen2MoeSparseMoeBlock(nn.Module): ++++- def __init__(self, config): +++++ - if Long_Prompt is False: 【速度优先模式】 +++++ 采用业界最强的性能组合: +++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +++++ """ +++++ def __init__(self, config: Qwen2MoeConfig): ++++ super().__init__() ++++ self.num_experts = config.num_experts ++++ self.top_k = config.num_experts_per_tok ++++ self.norm_topk_prob = config.norm_topk_prob ++++ ++++- # gating ++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++ self.experts = nn.ModuleList( ++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++ ) ++++- ++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++ ++++- #@dwj ++++- # 只遍历激活的专家,而非全部专家 ++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++- num_tokens = hidden_states_reshaped.shape[0] ++++- ++++- router_logits = self.gate(hidden_states_reshaped) ++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- ++++- if self.norm_topk_prob: ++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++- flat_selected_experts = selected_experts.flatten() ++++- ++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++- token_indices = broadcasted_token_indices.flatten() ++++- ++++- active_experts = ops.unique(flat_selected_experts) ++++- ++++- for expert_idx_tensor in active_experts: ++++- expert_idx = expert_idx_tensor.item() ++++- expert_layer = self.experts[expert_idx] ++++- ++++- mask = (flat_selected_experts == expert_idx_tensor) ++++- selected_token_indices = token_indices[mask] ++++- selected_routing_weights = routing_weights.flatten()[mask] ++++- ++++- current_states = hidden_states_reshaped[selected_token_indices] ++++- ++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++- ++++- final_hidden_states = final_hidden_states.index_add( ++++- dim=0, ++++- index=selected_token_indices, ++++- source=expert_output.to(hidden_states.dtype) ++++- ) ++++- ++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +++++ @no_grad() +++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++ original_dtype = hidden_states.dtype +++++ batch_size, _ = hidden_states.shape +++++ expert_outputs_list = [ +++++ ops.cat([ +++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++ ], dim=0) +++++ for i in range(batch_size) +++++ ] +++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++ weights_fp32 = routing_weights.to(mindspore.float32) +++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++ return moe_output_fp32.squeeze(1).to(original_dtype) +++++ +++++ @no_grad() +++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++ num_tokens, _ = hidden_states.shape +++++ flat_selected_experts = selected_experts.flatten() +++++ sorted_expert_indices = flat_selected_experts.argsort() +++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++ original_token_indices = sorted_expert_indices // self.top_k +++++ moe_output = ops.zeros_like(hidden_states) +++++ current_token_offset = 0 +++++ for i in range(self.num_experts): +++++ expert_token_count = tokens_per_expert[i] - current_token_offset +++++ if expert_token_count == 0: +++++ continue +++++ end_offset = current_token_offset + expert_token_count +++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++ expert_hidden_states = hidden_states[expert_original_token_indices] +++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++ current_token_offset += expert_token_count +++++ return moe_output +++++ +++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++++ @no_grad() +++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++ moe_output = ops.zeros_like(hidden_states) +++++ num_tokens, _ = hidden_states.shape +++++ flat_selected_experts = selected_experts.flatten() +++++ flat_routing_weights = routing_weights.flatten() +++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++ active_experts = ops.unique(flat_selected_experts) +++++ for expert_idx_tensor in active_experts: +++++ expert_idx = expert_idx_tensor.item() +++++ expert_layer = self.experts[expert_idx] +++++ mask = (flat_selected_experts == expert_idx_tensor) +++++ current_token_indices = token_indices[mask] +++++ current_routing_weights = flat_routing_weights[mask] +++++ current_hidden_states = hidden_states[current_token_indices] +++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++ return moe_output ++++ ++++- final_hidden_states = final_hidden_states + shared_expert_output ++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++- ++++- return final_hidden_states, router_logits +++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++ global Long_Prompt +++++ +++++ # 1. 门控计算 (所有模式通用) +++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++ router_logits = self.gate(hidden_states_reshaped) +++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++ if self.norm_topk_prob: +++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++ moe_output = None +++++ if Long_Prompt: +++++ # --- 精度优先模式 (ACCURACY MODE) --- +++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ else: +++++ # --- 速度优先模式 (SPEED MODE) --- +++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++ if sequence_length == 1: +++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ else: +++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ ++++ +++++ # 3. 共享专家计算与合并 (所有模式通用) +++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++ +++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++ +++++ return final_hidden_states, router_logits ++++ ++++ class Qwen2MoeDecoderLayer(nn.Module): ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++++ super().__init__() ++++ self.hidden_size = config.hidden_size +++++ +++++ # if Long_Prompt: +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ # else: +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++ ++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ ++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++- ++++ if (layer_idx not in config.mlp_only_layers) and ( ++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++ ): ++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ self._warmed_up = True ++++ self.warmup_moe_model() ++++ +++++ +++++ ++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++ output_router_logits = ( ++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits ++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ router_logits=outputs.router_logits, ++++ ) ++++ +++++ def generate(self, *args, **kwargs): +++++ """ +++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++++ """ +++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++++ +++++ input_ids = kwargs.get("input_ids") +++++ if input_ids is None and args: +++++ input_ids = args[0] +++++ +++++ if input_ids is not None: +++++ prompt_length = input_ids.shape[1] +++++ +++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +++++ Long_Prompt = True +++++ else: +++++ Long_Prompt = False +++++ +++++ return super().generate(*args, **kwargs) +++++ ++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation ++++ def prepare_inputs_for_generation( ++++ self, ++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens ++++ # Exception 1: when passing input_embeds, input_ids may be missing entries ++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +++++ ++++ if past_key_values is not None: ++++ if inputs_embeds is not None: # Exception 1 ++++ if 0 not in input_ids.shape: ++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ } ++++ ) ++++ return model_inputs +++++ ++++ # @lwx ++++ # def _decode_one_tokens_logits( ++++ # self, ++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): ++++ attentions=outputs.attentions, ++++ ) ++++ +++++ ++++ __all__ = [ ++++ "Qwen2MoeForCausalLM", ++++ "Qwen2MoeModel", ++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++new file mode 100644 ++++index 00000000..6dfb5b93 ++++--- /dev/null +++++++ b/patches/0001-20251104commit.patch ++++@@ -0,0 +1,1272 @@ +++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++Subject: [PATCH] 20251104commit +++++ +++++--- +++++ mindnlp/transformers/cache_utils.py | 28 +- +++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++ 3 files changed, 976 insertions(+), 87 deletions(-) +++++ +++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++index cadd2e04..02f8d4be 100644 +++++--- a/mindnlp/transformers/cache_utils.py ++++++++ b/mindnlp/transformers/cache_utils.py +++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++ # k_out[:, :, cache_position] = key_states +++++ # v_out[:, :, cache_position] = value_states +++++- if ON_ORANGE_PI: +++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++- else: +++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++- ++++++ # if ON_ORANGE_PI: ++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++ # else: ++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++++ if cache_position.ndim > 1: ++++++ cache_position = cache_position.flatten() ++++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++++ cache_position = cache_position.int() ++++++ ++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++++ k_out[:, :, cache_position] = key_states ++++++ v_out[:, :, cache_position] = value_states ++++++ +++++ return k_out, v_out +++++ +++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index c695b944..d8303e45 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++- x1 = x[..., : x.shape[-1] // 2] +++++- x2 = x[..., x.shape[-1] // 2 :] ++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++ # x2 = x[..., x.shape[-1] // 2 :] ++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++ if self.training: +++++ raise NotImplementedError("Training is not supported yet.") +++++ else: +++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++- if self.config.n_shared_experts is not None: +++++- y = y + self.shared_experts(identity) +++++- return y ++++++ # @lwx ++++++ if orig_shape[1] == 1: ++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++++ y=y.view(*orig_shape) ++++++ if self.config.n_shared_experts is not None: ++++++ y = y + self.shared_experts(identity) ++++++ return y ++++++ else: ++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++++ if self.config.n_shared_experts is not None: ++++++ y = y + self.shared_experts(identity) ++++++ return y ++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++ # if self.config.n_shared_experts is not None: ++++++ # y = y + self.shared_experts(identity) ++++++ # return y ++++++ ++++++ @no_grad() ++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ ++++++ expert_cache = ops.zeros_like(x) ++++++ for i in range(self.num_experts_per_tok): ++++++ expert_id = flat_expert_indices[i].item() ++++++ weight = flat_expert_weights[i].item() ++++++ expert = self.experts[expert_id] ++++++ expert_out = expert(x) ++++++ expert_cache += expert_out * weight ++++++ return expert_cache +++++ +++++ @no_grad() +++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++- # expert_cache = torch.zeros_like(x) +++++- # idxs = flat_expert_indices.argsort() +++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++- # token_idxs = idxs // self.num_experts_per_tok +++++- # for i, end_idx in enumerate(tokens_per_expert): +++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++- # if start_idx == end_idx: +++++- # continue +++++- # expert = self.experts[i] +++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++- # expert_tokens = x[exp_token_idx] +++++- # expert_out = expert(expert_tokens) +++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++- # return expert_cache ++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++ expert_cache = ops.zeros_like(x) +++++ idxs = flat_expert_indices.argsort() +++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ token_idxs = idxs // self.num_experts_per_tok ++++++ +++++ for i, end_idx in enumerate(tokens_per_expert): +++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ if start_idx == end_idx: +++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++ expert_out = expert(expert_tokens) +++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ +++++ return expert_cache ++++++ ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # # expert_cache = torch.zeros_like(x) ++++++ # # idxs = flat_expert_indices.argsort() ++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++ # # if start_idx == end_idx: ++++++ # # continue ++++++ # # expert = self.experts[i] ++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # # expert_tokens = x[exp_token_idx] ++++++ # # expert_out = expert(expert_tokens) ++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++ # # return expert_cache ++++++ # expert_cache = ops.zeros_like(x) ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # if start_idx == end_idx: ++++++ # continue ++++++ # expert = self.experts[i] ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = expert(expert_tokens) ++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ ++++++ # return expert_cache ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # expert_cache = ops.zeros_like(x) ++++++ ++++++ # # 排序保证顺序一致 ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # # 找出有 token 的专家 ++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++ ++++++ # for i in active_experts.tolist(): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # end_idx = tokens_per_expert[i] ++++++ # if start_idx == end_idx: # 没有 token ++++++ # continue ++++++ ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = self.experts[i](expert_tokens) ++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++ ++++++ # expert_cache = mindspore.mint.scatter_add( ++++++ # expert_cache, ++++++ # 0, ++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++ # expert_out ++++++ # ) ++++++ ++++++ # return expert_cache ++++++ ++++++ +++++ +++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++ # """ +++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ +++++ # Initialize weights and apply final processing +++++ self.post_init() ++++++ self.warm_up = False ++++++ ++++++ def warmup_moe_model_deep(self): ++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++ test_texts = [ ++++++ "warmup short", ++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++++ ] ++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++ if tokenizer is None: ++++++ from mindnlp.transformers import AutoTokenizer ++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++ self._warmup_tokenizer = tokenizer ++++++ ++++++ for text in test_texts: ++++++ inputs = tokenizer(text, return_tensors="ms") ++++++ with mindspore._no_grad(): ++++++ _ = self(**inputs, use_cache=False) ++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++ +++++ def get_input_embeddings(self): +++++ return self.model.embed_tokens +++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++ ```""" ++++++ if not self.warm_up: ++++++ self.warm_up = True ++++++ self.warmup_moe_model_deep() ++++++ +++++ output_attentions = ( +++++ output_attentions +++++ if output_attentions is not None +++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++index 3cbf820e..d4c6b651 100644 +++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++@@ -18,7 +18,6 @@ +++++ # See the License for the specific language governing permissions and +++++ # limitations under the License. +++++ """MindSpore Qwen2MoE model.""" +++++- +++++ import math +++++ from typing import List, Optional, Tuple, Union +++++ +++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++ TokenClassifierOutput, +++++ ) +++++ from ...modeling_utils import PreTrainedModel ++++++from ...generation import GenerationMixin +++++ from ....utils import logging +++++ from .configuration_qwen2_moe import Qwen2MoeConfig +++++ +++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++ self.variance_epsilon = eps +++++ +++++ def forward(self, hidden_states): ++++++ # @dwj ++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++ # @lwx ++++++ # if not self.training : ++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++ input_dtype = hidden_states.dtype +++++ hidden_states = hidden_states.to(mindspore.float32) +++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++@@ -234,6 +239,8 @@ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++ x1 = x[..., : x.shape[-1] // 2] +++++ x2 = x[..., x.shape[-1] // 2 :] ++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++ self.config = config +++++ self.hidden_size = config.hidden_size +++++ self.intermediate_size = intermediate_size ++++++ +++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++ self.act_fn = ACT2FN[config.hidden_act] +++++ +++++ def forward(self, x): +++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++- +++++ ++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++ # @lwx ++++++ # gate_up_output = self.gate_up_proj(x) ++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++++ # return self.down_proj(swiglu_output) ++++++ ++++++ # def forward(self, x): ++++++ # gate_proj_out = self.gate_proj(x) ++++++ # up_proj_out = self.up_proj(x) ++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++++ # return self.down_proj(swiglu_out) ++++++ +++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++ """ +++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++ use_cache: bool = False, +++++ cache_position: Optional[mindspore.Tensor] = None, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ ++++++ +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ query_states = self.q_proj(hidden_states) +++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ "with a layer index." +++++ ) +++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ if isinstance(past_key_value, StaticCache): ++++++ kv_seq_len = key_states.shape[-2] ++++++ else: ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++ if isinstance(past_key_value, StaticCache): ++++++ kv_seq_len = key_states.shape[-2] +++++ +++++ # repeat k/v heads if n_kv_heads < n_heads +++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++- ++++++ +++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++ +++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++- raise ValueError( +++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++- f" {attn_weights.shape}" +++++- ) +++++- +++++- if attention_mask is not None: # no matter the length, we just slice it +++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++++ if attention_mask is not None: ++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++ attn_weights = attn_weights + causal_mask +++++ +++++ # upcast attention to fp32 +++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++ +++++ attn_output = self.o_proj(attn_output) +++++- ++++++ # @lwx ++++++ ++++++ # max_seq_len = self.max_position_embeddings # 2048 ++++++ ++++++ # if attention_mask is not None: ++++++ # # attention_mask: [B, 1, Sq, Sk] ++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++ ++++++ # # pad 到 [max_seq_len, max_seq_len] ++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++ # global_attention_mask = padded_mask ++++++ # else: ++++++ # global_attention_mask = None ++++++ ++++++ ++++++ # sparse_mode=3 ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # real_shift=None, ++++++ # padding_mask=None, ++++++ ++++++ # head_num=self.num_heads, ++++++ # attn_mask=global_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ # input_layout="BNSD", ++++++ # pre_tokens=2147483647, ++++++ # next_tokens=2147483647, ++++++ # inner_precise=0, ++++++ # drop_mask=None, ++++++ # prefix=None, ++++++ # actual_seq_qlen=None, ++++++ # actual_seq_kvlen=None, ++++++ # sparse_mode=sparse_mode, ++++++ # ) +++++ if not output_attentions: +++++ attn_weights = None +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++ ++++++class Qwen2MoeFlashAttention(nn.Module): ++++++ """ ++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++ ++++++ 关键改动: ++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++ 直接传入原始的 key 和 value 张量效率更高。 ++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++ """ ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++ super().__init__() ++++++ self.config = config ++++++ self.layer_idx = layer_idx ++++++ self.hidden_size = config.hidden_size ++++++ self.num_heads = config.num_attention_heads ++++++ self.head_dim = self.hidden_size // self.num_heads ++++++ self.num_key_value_heads = config.num_key_value_heads ++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++ self.max_position_embeddings = config.max_position_embeddings ++++++ self.rope_theta = config.rope_theta ++++++ self.attention_dropout = config.attention_dropout ++++++ ++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++ raise ValueError( ++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++ ) ++++++ ++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++ ++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++ self.head_dim, ++++++ max_position_embeddings=self.max_position_embeddings, ++++++ base=self.rope_theta, ++++++ ) ++++++ ++++++ def forward( ++++++ self, ++++++ hidden_states: mindspore.Tensor, ++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++ past_key_value: Optional[Cache] = None, ++++++ output_attentions: bool = False, ++++++ use_cache: bool = False, ++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # 1. 线性投射 Q, K, V ++++++ query_states = self.q_proj(hidden_states) ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # 3. RoPE 旋转位置编码 ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++ if self.layer_idx is None: ++++++ raise ValueError( ++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ "with a layer index." ++++++ ) ++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++ if cache_position.shape[0] == 1: ++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++ kv_seq_len = past_seen_tokens + 1 ++++++ else: ++++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++ else: ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # 4. KV 缓存更新 ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ key_states, value_states = past_key_value.update( ++++++ key_states, value_states, self.layer_idx, cache_kwargs ++++++ ) ++++++ ++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++ if cache_position.shape[0] == 1: ++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++ kv_seq_len = key_states.shape[-2] ++++++ ++++++ # 5. [重要] 准备 Attention Mask ++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++ fa_attention_mask = None ++++++ if attention_mask is not None: ++++++ # 截取与当前key长度匹配的部分 ++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++ fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++ input_dtype = query_states.dtype ++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++ query_states = query_states.to(mindspore.float16) ++++++ key_states = key_states.to(mindspore.float16) ++++++ value_states = value_states.to(mindspore.float16) ++++++ ++++++ # 6. [核心] 调用 flash_attention_score 算子 ++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++ attn_output = mindspore.ops.flash_attention_score( ++++++ query=query_states, ++++++ key=key_states, ++++++ value=value_states, ++++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++++ attn_mask=fa_attention_mask, ++++++ keep_prob=1.0 - self.attention_dropout, ++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ input_layout="BNSD", ++++++ sparse_mode=0 # 使用 defaultMask 模式 ++++++ ) ++++++ ++++++ # 恢复原始数据类型 ++++++ attn_output = attn_output.to(input_dtype) ++++++ ++++++ # 7. 调整输出形状 ++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = self.o_proj(attn_output) ++++++ ++++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++++ attn_weights = None ++++++ if output_attentions: ++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ # def forward( ++++++ # self, ++++++ # hidden_states: mindspore.Tensor, ++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++ # past_key_value: Optional[Cache] = None, ++++++ # output_attentions: bool = False, ++++++ # use_cache: bool = False, ++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ # bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # # 1. 线性投射 Q, K, V ++++++ # query_states = self.q_proj(hidden_states) ++++++ # key_states = self.k_proj(hidden_states) ++++++ # value_states = self.v_proj(hidden_states) ++++++ ++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # # 3. RoPE 旋转位置编码 ++++++ # kv_seq_len = key_states.shape[-2] ++++++ # if past_key_value is not None: ++++++ # if self.layer_idx is None: ++++++ # raise ValueError( ++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ # "with a layer index." ++++++ # ) ++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # # 4. KV 缓存更新 ++++++ # if past_key_value is not None: ++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ # key_states, value_states = past_key_value.update( ++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++ # ) ++++++ ++++++ # # 5. 准备 Attention Mask ++++++ # fa_attention_mask = None ++++++ # if attention_mask is not None: ++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++ # input_dtype = query_states.dtype ++++++ ++++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # head_num=self.num_heads, ++++++ # attn_mask=fa_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ # input_layout="BNSD", ++++++ # sparse_mode=0, ++++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++ # inner_precise=1 ++++++ # ) ++++++ ++++++ # # 恢复原始数据类型 ++++++ # attn_output = attn_output.to(input_dtype) ++++++ ++++++ # # 7. 调整输出形状 ++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ # attn_output = self.o_proj(attn_output) ++++++ ++++++ # attn_weights = None ++++++ # if output_attentions: ++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++ # return attn_output, attn_weights, past_key_value ++++++ ++++++ # def forward( ++++++ # self, ++++++ # hidden_states: mindspore.Tensor, ++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++ # past_key_value: Optional[Cache] = None, ++++++ # output_attentions: bool = False, ++++++ # use_cache: bool = False, ++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ # bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # query_states = self.q_proj(hidden_states) ++++++ # key_states = self.k_proj(hidden_states) ++++++ # value_states = self.v_proj(hidden_states) ++++++ ++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # kv_seq_len = key_states.shape[-2] ++++++ # if past_key_value is not None: ++++++ # if self.layer_idx is None: ++++++ # raise ValueError("`layer_idx` must be specified for caching") ++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # if past_key_value is not None: ++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ # key_states, value_states = past_key_value.update( ++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++ # ) ++++++ ++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++ ++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++ # query_states = query_states / math.sqrt(self.head_dim) ++++++ # # <--- 修改结束 --- ++++++ ++++++ # fa_attention_mask = None ++++++ # if attention_mask is not None: ++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # input_dtype = query_states.dtype ++++++ ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, # 传入已经预先缩放过的 query ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # head_num=self.num_heads, ++++++ # attn_mask=fa_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++ # input_layout="BNSD", ++++++ # sparse_mode=0, ++++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++++ # ) ++++++ ++++++ # attn_output = attn_output.to(input_dtype) ++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ # attn_output = self.o_proj(attn_output) ++++++ ++++++ # attn_weights = None ++++++ # if output_attentions: ++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++ ++++++ # return attn_output, attn_weights, past_key_value ++++++ +++++ QWEN2MOE_ATTENTION_CLASSES = { +++++ "eager": Qwen2MoeAttention, ++++++ "flash-attention": Qwen2MoeFlashAttention, +++++ } +++++ +++++ +++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ ++++++ #@dwj ++++++ # 只遍历激活的专家,而非全部专家 +++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- hidden_states = hidden_states.view(-1, hidden_dim) +++++- # router_logits: (batch * sequence_length, n_experts) +++++- router_logits = self.gate(hidden_states) +++++- +++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- if self.norm_topk_prob: +++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- # we cast back to the input dtype +++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++- final_hidden_states = ops.zeros( +++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++- ) +++++- +++++- # One hot encode the selected experts to create an expert mask +++++- # this will be used to easily index which expert is going to be sollicitated +++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++- +++++- # Loop over all available experts in the model and perform the computation on each expert +++++- for expert_idx in range(self.num_experts): +++++- expert_layer = self.experts[expert_idx] +++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++- +++++- # Index the correct hidden states and compute the expert hidden state for +++++- # the current expert. We need to make sure to multiply the output hidden +++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++- if 0 not in idx.shape: +++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++- +++++- # However `index_add_` only support torch tensors for indexing so we'll use +++++- # the `top_x` tensor here. +++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++- +++++- shared_expert_output = self.shared_expert(hidden_states) +++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++- +++++- final_hidden_states = final_hidden_states + shared_expert_output ++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++ num_tokens = hidden_states_reshaped.shape[0] ++++++ ++++++ router_logits = self.gate(hidden_states_reshaped) ++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++ if self.norm_topk_prob: ++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++ flat_selected_experts = selected_experts.flatten() ++++++ ++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++ token_indices = broadcasted_token_indices.flatten() ++++++ ++++++ active_experts = ops.unique(flat_selected_experts) ++++++ ++++++ for expert_idx_tensor in active_experts: ++++++ expert_idx = expert_idx_tensor.item() ++++++ expert_layer = self.experts[expert_idx] ++++++ ++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++ selected_token_indices = token_indices[mask] ++++++ selected_routing_weights = routing_weights.flatten()[mask] ++++++ ++++++ current_states = hidden_states_reshaped[selected_token_indices] ++++++ ++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++ ++++++ final_hidden_states = final_hidden_states.index_add( ++++++ dim=0, ++++++ index=selected_token_indices, ++++++ source=expert_output.to(hidden_states.dtype) ++++++ ) ++++++ ++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++ +++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++- return final_hidden_states, router_logits ++++++ final_hidden_states = final_hidden_states + shared_expert_output ++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++ ++++++ return final_hidden_states, router_logits +++++ +++++ +++++ class Qwen2MoeDecoderLayer(nn.Module): +++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++ +++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++ +++++ if (layer_idx not in config.mlp_only_layers) and ( +++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++ ): +++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++ _skip_keys_device_placement = "past_key_values" +++++ _supports_cache_class = True ++++++#lwx ++++++ # _supports_static_cache = True +++++ +++++ def _init_weights(self, module): +++++ std = self.config.initializer_range +++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++ return causal_mask +++++ +++++ +++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ _tied_weights_keys = ["lm_head.weight"] +++++ +++++ def __init__(self, config): +++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ self.num_experts_per_tok = config.num_experts_per_tok +++++ # Initialize weights and apply final processing +++++ self.post_init() ++++++ # @lwx ++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++++ # self.generation_config.cache_implementation = "static" ++++++ self._warmed_up = False ++++++ ++++++ def warmup_moe_model(self): ++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++++ test_texts = [ ++++++ "warmup short", ++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++++ ] ++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++ if tokenizer is None: ++++++ from mindnlp.transformers import AutoTokenizer ++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++ self._warmup_tokenizer = tokenizer ++++++ ++++++ for text in test_texts: ++++++ inputs = tokenizer(text, return_tensors="ms") ++++++ with mindspore._no_grad(): ++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++ +++++ def get_input_embeddings(self): +++++ return self.model.embed_tokens +++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++ ```""" ++++++ if not self._warmed_up: ++++++ self._warmed_up = True ++++++ self.warmup_moe_model() +++++ +++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++ output_router_logits = ( +++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ } +++++ ) +++++ return model_inputs ++++++# @lwx ++++++ # def _decode_one_tokens_logits( ++++++ # self, ++++++ # cur_token: mindspore.Tensor, ++++++ # input_pos: Optional[mindspore.Tensor], ++++++ # cache_position: mindspore.Tensor, ++++++ # past_key_values: StaticCache, ++++++ # ) -> mindspore.Tensor: ++++++ # """ ++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++++ ++++++ # Args: ++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++++ # input_pos: 输入位置信息,可选 ++++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++++ ++++++ # Returns: ++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++++ # """ ++++++ # # 调用JIT编译的版本 ++++++ # return self.get_decode_one_tokens_logits( ++++++ # cur_token=cur_token, ++++++ # input_pos=input_pos, ++++++ # cache_position=cache_position, ++++++ # past_key_values=past_key_values, ++++++ # ) ++++++ ++++++ # @mindspore.jit(jit_level='O1') ++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++++ # """ ++++++ # JIT编译的函数,用于高效的单token解码 ++++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++++ ++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++++ # """ ++++++ # outputs = self.model.forward( ++++++ # input_ids=cur_token, ++++++ # position_ids=input_pos, ++++++ # cache_position=cache_position, ++++++ # past_key_values=past_key_values, ++++++ # use_cache=True, ++++++ # return_dict=False, ++++++ # ) ++++++ ++++++ # hidden_states = outputs[0] ++++++ # logits = self.lm_head.forward(hidden_states) ++++++ # logits = logits.float() ++++++ ++++++ # return logits[:, -1, :] ++++++ ++++++ # def _sample( ++++++ # self, ++++++ # input_ids: mindspore.Tensor, ++++++ # logits_processor, ++++++ # stopping_criteria, ++++++ # generation_config, ++++++ # synced_devices: bool, ++++++ # streamer=None, ++++++ # logits_warper=None, ++++++ # **model_kwargs, ++++++ # ): ++++++ # """ ++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++++ # """ ++++++ # from ...generation.logits_process import LogitsProcessorList ++++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++++ # from mindnlp.core import nn, ops, no_grad ++++++ # import numpy as np ++++++ ++++++ # # 检查是否使用 StaticCache ++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++++ # # 否则,直接调用父类方法 ++++++ # past_key_values = model_kwargs.get("past_key_values") ++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++++ ++++++ # if not isinstance(past_key_values, StaticCache): ++++++ # # 不使用 StaticCache,直接调用父类方法 ++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++++ # return super()._sample( ++++++ # input_ids=input_ids, ++++++ # logits_processor=logits_processor, ++++++ # stopping_criteria=stopping_criteria, ++++++ # generation_config=generation_config, ++++++ # synced_devices=synced_devices, ++++++ # streamer=streamer, ++++++ # logits_warper=logits_warper, ++++++ # **model_kwargs, ++++++ # ) ++++++ ++++++ # # 使用 StaticCache,进入自定义循环 ++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++++ # pad_token_id = generation_config._pad_token_tensor ++++++ # output_attentions = generation_config.output_attentions ++++++ # output_hidden_states = generation_config.output_hidden_states ++++++ # output_scores = generation_config.output_scores ++++++ # output_logits = generation_config.output_logits ++++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++++ # max_length = generation_config.max_length ++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++++ # do_sample = generation_config.do_sample ++++++ ++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++++ # raise ValueError( ++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++++ # f"{logits_warper})." ++++++ # ) ++++++ ++++++ # # init attention / hidden states / scores tuples ++++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++++ ++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++++ # encoder_hidden_states = ( ++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++++ # ) ++++++ ++++++ # # keep track of which sequences are already finished ++++++ # batch_size, cur_len = input_ids.shape ++++++ # this_peer_finished = False ++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++++ ++++++ # time_record = [] ++++++ # from ....utils.testing_utils import parse_flag_from_env ++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++++ ++++++ # while self._has_unfinished_sequences( ++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++++ # ): ++++++ # if _record_time: ++++++ # import time as time_module ++++++ # infer_start = time_module.time() ++++++ ++++++ # # prepare model inputs ++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++++ ++++++ # # prepare variable output controls ++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++++ ++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++++ # cur_cache_position = model_inputs.get("cache_position") ++++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++++ # cur_input_ids = model_inputs.get("input_ids") ++++++ ++++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++++ # cur_cache_position is not None and ++++++ # len(cur_cache_position.shape) > 0 and ++++++ # cur_cache_position.shape[0] == 1 and ++++++ # cur_input_ids is not None and ++++++ # cur_input_ids.shape[1] == 1): ++++++ # # 使用 JIT 优化的单 token 解码 ++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++++ # if not hasattr(self, '_jit_used'): ++++++ # self._jit_used = False ++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++++ ++++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++++ # cur_token=cur_input_ids, ++++++ # input_pos=model_inputs.get("position_ids"), ++++++ # cache_position=cur_cache_position, ++++++ # past_key_values=cur_past_key_values, ++++++ # ) ++++++ ++++++ # # 标记已使用JIT(用于后续判断) ++++++ # if not self._jit_used: ++++++ # self._jit_used = True ++++++ ++++++ # # 构造兼容的输出对象 ++++++ # class JitOptimizedOutput: ++++++ # def __init__(self, logits, config): ++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++++ # self.config = config ++++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++++ # self.attentions = None if not config.is_encoder_decoder else None ++++++ # self.cross_attentions = None ++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++++ ++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++++ # else: ++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++++ # outputs = self(**model_inputs, return_dict=True) ++++++ ++++++ # if synced_devices and this_peer_finished: ++++++ # continue ++++++ ++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++++ # next_token_logits = outputs.logits[:, -1, :] ++++++ ++++++ # # pre-process distribution ++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++++ # if do_sample: ++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++++ ++++++ # # Store scores, attentions and hidden_states when required ++++++ # if return_dict_in_generate: ++++++ # if output_scores: ++++++ # scores += (next_token_scores,) ++++++ # if output_logits: ++++++ # raw_logits += (next_token_logits,) ++++++ # if output_attentions: ++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++++ # if self.config.is_encoder_decoder: ++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++++ ++++++ # if output_hidden_states: ++++++ # hidden = ( ++++++ # outputs.decoder_hidden_states ++++++ # if self.config.is_encoder_decoder ++++++ # else outputs.hidden_states ++++++ # ) ++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++++ ++++++ # # token selection ++++++ # if do_sample: ++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++++ # else: ++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++++ ++++++ # # finished sentences should have their next token be a padding token ++++++ # if has_eos_stopping_criteria: ++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++++ ++++++ # # update generated ids, model inputs, and length for next step ++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++++ # if streamer is not None: ++++++ # streamer.put(next_tokens) ++++++ ++++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++++ # outputs, ++++++ # model_kwargs, ++++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++++ # ) ++++++ ++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++++ # cur_len += 1 ++++++ ++++++ # if _record_time: ++++++ # import time as time_module ++++++ # infer_stop = time_module.time() ++++++ # time_record.append(infer_stop - infer_start) ++++++ ++++++ # del outputs ++++++ ++++++ # average_infer_time = None ++++++ # if time_record: ++++++ # if len(time_record) > 1: ++++++ # time_record.pop(0) ++++++ # average_infer_time = sum(time_record) / len(time_record) ++++++ # print(f'average inference time is: {average_infer_time}') ++++++ # print(f'inference time record: {time_record}') ++++++ ++++++ # if streamer is not None: ++++++ # streamer.end() ++++++ ++++++ # # 简单判断:打印是否使用了JIT路径 ++++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++++ # else: ++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++++ ++++++ # if return_dict_in_generate: ++++++ # if self.config.is_encoder_decoder: ++++++ # return GenerateEncoderDecoderOutput( ++++++ # sequences=input_ids, ++++++ # scores=scores, ++++++ # logits=raw_logits, ++++++ # encoder_attentions=encoder_attentions, ++++++ # encoder_hidden_states=encoder_hidden_states, ++++++ # decoder_attentions=decoder_attentions, ++++++ # cross_attentions=cross_attentions, ++++++ # decoder_hidden_states=decoder_hidden_states, ++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++ # average_infer_time=average_infer_time ++++++ # ) ++++++ # else: ++++++ # return GenerateDecoderOnlyOutput( ++++++ # sequences=input_ids, ++++++ # scores=scores, ++++++ # logits=raw_logits, ++++++ # attentions=decoder_attentions, ++++++ # hidden_states=decoder_hidden_states, ++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++ # average_infer_time=average_infer_time ++++++ # ) ++++++ # else: ++++++ # return input_ids ++++++ ++++++ # def _prepare_cache_for_generation( ++++++ # self, ++++++ # generation_config, ++++++ # model_kwargs, ++++++ # assistant_model, ++++++ # batch_size, ++++++ # max_cache_length, ++++++ # ): ++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++++ # generation_config.cache_implementation = "static" ++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++++ ++++++ # if generation_config.cache_implementation == "static": ++++++ # base_required_from_max_length = generation_config.max_length + 1 ++++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++++ # min_cache_size = 50 ++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++++ # else: ++++++ # max_cache_length = max(base_required, min_cache_size) ++++++ ++++++ # original_max_cache_length = max_cache_length ++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++++ # print(f" - final max_cache_length: {max_cache_length}") ++++++ ++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++ # if max_cache_length > self.config.max_position_embeddings: ++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++++ ++++++ # result = super()._prepare_cache_for_generation( ++++++ # generation_config=generation_config, ++++++ # model_kwargs=model_kwargs, ++++++ # assistant_model=assistant_model, ++++++ # batch_size=batch_size, ++++++ # max_cache_length=max_cache_length, ++++++ # ) ++++++ ++++++ # if generation_config.cache_implementation == "static": ++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++++ # created_cache = model_kwargs.get(cache_name) ++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++++ # if created_cache.max_cache_len < generation_config.max_length: ++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++++ ++++++ # return result ++++++ ++++++ ++++++ +++++ +++++ +++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++-- +++++2.27.0 +++++ ++++-- ++++2.27.0 ++++ +++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++new file mode 100644 +++index 00000000..966529e4 +++--- /dev/null ++++++ b/patches/0003-20261106secondcommit.patch +++@@ -0,0 +1,2769 @@ ++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Thu, 6 Nov 2025 14:54:37 +0800 ++++Subject: [PATCH 3/3] 20261106secondcommit ++++ ++++--- ++++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- ++++ patches/0001-20251104commit.patch | 1272 ----------------- ++++ 3 files changed, 528 insertions(+), 2032 deletions(-) ++++ delete mode 100644 patches/0001-20251104commit.patch ++++ ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index 73773c22..2f9192bf 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) ++++ ++++ _CONFIG_FOR_DOC = "DeepseekConfig" ++++ +++++_attn_mask_cache = {} +++++ +++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +++++ q_len = batch_and_seq[1] +++++ kv_len = batch_and_seq[1] + past_key_values_length +++++ key = (batch_and_seq[0], q_len, kv_len) +++++ +++++ if key in _attn_mask_cache: +++++ return _attn_mask_cache[key] +++++ +++++ mask = _prepare_4d_causal_attention_mask( +++++ attention_mask, +++++ batch_and_seq, +++++ inputs_embeds, +++++ past_key_values_length, +++++ ) +++++ _attn_mask_cache[key] = mask +++++ return mask ++++ ++++ def _get_unpad_data(attention_mask): ++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) ++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): ++++ return final_output ++++ ++++ ++++- @no_grad() ++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++- expert_cache = ops.zeros_like(x) ++++- idxs = flat_expert_indices.argsort() ++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++- token_idxs = idxs // self.num_experts_per_tok ++++- ++++- for i, end_idx in enumerate(tokens_per_expert): ++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++- if start_idx == end_idx: ++++- continue ++++- expert = self.experts[i] ++++- exp_token_idx = token_idxs[start_idx:end_idx] ++++- expert_tokens = x[exp_token_idx] ++++- expert_out = expert(expert_tokens) ++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++- ++++- return expert_cache ++++- ++++ # @no_grad() ++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++- # # expert_cache = torch.zeros_like(x) ++++- # # idxs = flat_expert_indices.argsort() ++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++- # # token_idxs = idxs // self.num_experts_per_tok ++++- # # for i, end_idx in enumerate(tokens_per_expert): ++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++- # # if start_idx == end_idx: ++++- # # continue ++++- # # expert = self.experts[i] ++++- # # exp_token_idx = token_idxs[start_idx:end_idx] ++++- # # expert_tokens = x[exp_token_idx] ++++- # # expert_out = expert(expert_tokens) ++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++- # # return expert_cache +++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++ # expert_cache = ops.zeros_like(x) ++++ # idxs = flat_expert_indices.argsort() ++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): ++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++ ++++ # return expert_cache ++++- # @no_grad() ++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++- # expert_cache = ops.zeros_like(x) +++++ +++++ @no_grad() +++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++ """ +++++ 优化版 MoE prefill: +++++ - 批量张量化处理同一个 expert 的所有 token +++++ - 跳过无 token 的专家 +++++ - 保持结果完全一致 +++++ """ +++++ # 初始化输出缓存 +++++ expert_cache = ops.zeros_like(x) ++++ ++++- # # 排序保证顺序一致 ++++- # idxs = flat_expert_indices.argsort() ++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++- # token_idxs = idxs // self.num_experts_per_tok +++++ # 排序(确保 scatter_add 位置对应原逻辑) +++++ idxs = flat_expert_indices.argsort() +++++ sorted_expert_indices = flat_expert_indices[idxs] +++++ sorted_token_indices = idxs // self.num_experts_per_tok ++++ ++++- # # 找出有 token 的专家 ++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++ # 每个 expert 的 token 数 +++++ tokens_per_expert = sorted_expert_indices.bincount() ++++ ++++- # for i in active_experts.tolist(): ++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++- # end_idx = tokens_per_expert[i] ++++- # if start_idx == end_idx: # 没有 token ++++- # continue +++++ # 找出有 token 的专家 +++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++++ ++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++- # expert_tokens = x[exp_token_idx] ++++- # expert_out = self.experts[i](expert_tokens) ++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++ for expert_id in active_experts.tolist(): +++++ # 取该 expert 对应的排序后 token 区间 +++++ start = (tokens_per_expert[:expert_id]).sum().item() +++++ end = start + tokens_per_expert[expert_id].item() ++++ ++++- # expert_cache = mindspore.mint.scatter_add( ++++- # expert_cache, ++++- # 0, ++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++- # expert_out ++++- # ) +++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +++++ expert_tokens = x[token_idx] # 取输入向量 ++++ ++++- # return expert_cache +++++ # 执行专家 MLP +++++ expert_out = self.experts[expert_id](expert_tokens) +++++ +++++ # 按权重缩放 +++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +++++ +++++ # 回写到缓存(等价 scatter_add) +++++ expert_cache = mindspore.mint.scatter_add( +++++ expert_cache, +++++ 0, +++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++ scaled_out +++++ ) +++++ +++++ return expert_cache +++++ +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # # expert_cache = torch.zeros_like(x) +++++ # # idxs = flat_expert_indices.argsort() +++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++ # # token_idxs = idxs // self.num_experts_per_tok +++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++ # # if start_idx == end_idx: +++++ # # continue +++++ # # expert = self.experts[i] +++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # # expert_tokens = x[exp_token_idx] +++++ # # expert_out = expert(expert_tokens) +++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++ # # return expert_cache +++++ # expert_cache = ops.zeros_like(x) +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # for i, end_idx in enumerate(tokens_per_expert): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # if start_idx == end_idx: +++++ # continue +++++ # expert = self.experts[i] +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = expert(expert_tokens) +++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ +++++ # return expert_cache +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++ # expert_cache = ops.zeros_like(x) +++++ +++++ # # 排序保证顺序一致 +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ # token_idxs = idxs // self.num_experts_per_tok +++++ +++++ # # 找出有 token 的专家 +++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++ +++++ # for i in active_experts.tolist(): +++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ # end_idx = tokens_per_expert[i] +++++ # if start_idx == end_idx: # 没有 token +++++ # continue +++++ +++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++ # expert_tokens = x[exp_token_idx] +++++ # expert_out = self.experts[i](expert_tokens) +++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++ +++++ # expert_cache = mindspore.mint.scatter_add( +++++ # expert_cache, +++++ # 0, +++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++ # expert_out +++++ # ) +++++ +++++ # return expert_cache ++++ ++++ ++++ ++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++- ++++ # class DeepseekFlashAttention(nn.Module): ++++ # """ ++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): ++++ ++++ return attn_output, attn_weights, past_key_value ++++ +++++ ++++ Deepseek_ATTENTION_CLASSES = { ++++ "eager": DeepseekAttention, ++++ "flash-attention": DeepseekFlashAttention, ++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): ++++ ) ++++ else: ++++ # 4d mask is passed through the layers ++++- attention_mask = _prepare_4d_causal_attention_mask( +++++ # attention_mask = _prepare_4d_causal_attention_mask( +++++ # attention_mask, +++++ # (batch_size, seq_length), +++++ # inputs_embeds, +++++ # past_key_values_length, +++++ # ) +++++ #@dwj +++++ attention_mask = get_cached_causal_mask( ++++ attention_mask, ++++ (batch_size, seq_length), ++++ inputs_embeds, ++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ # Initialize weights and apply final processing ++++ self.post_init() ++++ self.warm_up = False +++++ #@dwj +++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +++++ self.num_layers, +++++ self.num_attention_heads, +++++ self.head_dim, +++++ batch_size=1, +++++ max_length=self.max_length, +++++ dtype=mindspore.float16 +++++ ) +++++ +++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +++++ key_cache = [] +++++ value_cache = [] +++++ for _ in range(num_layers): +++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++ key_cache.append(k) +++++ value_cache.append(v) +++++ return key_cache, value_cache +++++ ++++ ++++ def warmup_moe_model_deep(self): ++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++index bced285c..ebd7782e 100644 ++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) ++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++++ ++++-Long_Prompt = False ++++-PROMPT_LENGTH_THRESHOLD = 128 +++++Long_Prompt = 1 +++++LONG_PROMPT_LENGTH_THRESHOLD = 128 +++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +++++ +++++_causal_mask_cache = {} +++++ +++++def get_cached_causal_mask_with_cache_position( +++++ attention_mask: mindspore.Tensor, +++++ sequence_length: int, +++++ target_length: int, +++++ dtype: mindspore.dtype, +++++ min_dtype: float, +++++ cache_position: mindspore.Tensor, +++++ batch_size: int, +++++): +++++ """ +++++ 带缓存的 causal mask 构造函数 +++++ """ +++++ # q_len 是当前 query 长度 +++++ q_len = sequence_length +++++ # kv_len 是 target_length +++++ kv_len = target_length +++++ +++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +++++ +++++ if key in _causal_mask_cache: +++++ return _causal_mask_cache[key] +++++ +++++ # 调用原来的 mask 构造逻辑 +++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++ attention_mask, +++++ sequence_length=sequence_length, +++++ target_length=target_length, +++++ dtype=dtype, +++++ min_dtype=min_dtype, +++++ cache_position=cache_position, +++++ batch_size=batch_size, +++++ ) +++++ # 缓存结果 +++++ _causal_mask_cache[key] = causal_mask +++++ return causal_mask ++++ ++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++++ def _prepare_4d_causal_attention_mask_with_cache_position( ++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++ ++++ ++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +++++# class Qwen2MoeAttention(nn.Module): +++++# """ +++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++++# and "Generating Long Sequences with Sparse Transformers". +++++# """ +++++ +++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++# super().__init__() +++++# self.config = config +++++# self.layer_idx = layer_idx +++++# if layer_idx is None: +++++# logger.warning_once( +++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++# "when creating this class." +++++# ) +++++ +++++# self.hidden_size = config.hidden_size +++++# self.num_heads = config.num_attention_heads +++++# self.head_dim = self.hidden_size // self.num_heads +++++# self.num_key_value_heads = config.num_key_value_heads +++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++# self.max_position_embeddings = config.max_position_embeddings +++++# self.rope_theta = config.rope_theta +++++# self.is_causal = True +++++# self.attention_dropout = config.attention_dropout +++++ +++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++# raise ValueError( +++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++# f" and `num_heads`: {self.num_heads})." +++++# ) +++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++ +++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++# self.head_dim, +++++# max_position_embeddings=self.max_position_embeddings, +++++# base=self.rope_theta, +++++# ) +++++ +++++# def forward( +++++# self, +++++# hidden_states: mindspore.Tensor, +++++# attention_mask: Optional[mindspore.Tensor] = None, +++++# position_ids: Optional[mindspore.Tensor] = None, +++++# past_key_value: Optional[Cache] = None, +++++# output_attentions: bool = False, +++++# use_cache: bool = False, +++++# cache_position: Optional[mindspore.Tensor] = None, +++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++ +++++ +++++ +++++# bsz, q_len, _ = hidden_states.shape +++++ +++++# query_states = self.q_proj(hidden_states) +++++# key_states = self.k_proj(hidden_states) +++++# value_states = self.v_proj(hidden_states) +++++ +++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++ +++++# kv_seq_len = key_states.shape[-2] +++++# if past_key_value is not None: +++++# if self.layer_idx is None: +++++# raise ValueError( +++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++# "with a layer index." +++++# ) +++++# if isinstance(past_key_value, StaticCache): +++++# kv_seq_len = key_states.shape[-2] +++++# else: +++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++# if past_key_value is not None: +++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++# if isinstance(past_key_value, StaticCache): +++++# kv_seq_len = key_states.shape[-2] +++++ +++++# # repeat k/v heads if n_kv_heads < n_heads +++++# key_states = repeat_kv(key_states, self.num_key_value_groups) +++++# value_states = repeat_kv(value_states, self.num_key_value_groups) +++++ +++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++ +++++# if attention_mask is not None: +++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++# attn_weights = attn_weights + causal_mask +++++ +++++# # upcast attention to fp32 +++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++# attn_output = ops.matmul(attn_weights, value_states) +++++ +++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++# raise ValueError( +++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++++# f" {attn_output.shape}" +++++# ) +++++ +++++# attn_output = ops.transpose(attn_output, 1, 2) +++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++ +++++# attn_output = self.o_proj(attn_output) +++++# # @lwx +++++ +++++# # max_seq_len = self.max_position_embeddings # 2048 +++++ +++++# # if attention_mask is not None: +++++# # # attention_mask: [B, 1, Sq, Sk] +++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++ +++++# # # pad 到 [max_seq_len, max_seq_len] +++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++# # global_attention_mask = padded_mask +++++# # else: +++++# # global_attention_mask = None +++++ +++++ +++++# # sparse_mode=3 +++++# # attn_output = mindspore.ops.flash_attention_score( +++++# # query=query_states, +++++# # key=key_states, +++++# # value=value_states, +++++# # real_shift=None, +++++# # padding_mask=None, +++++ +++++# # head_num=self.num_heads, +++++# # attn_mask=global_attention_mask, +++++# # keep_prob=1.0 - self.attention_dropout, +++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++++# # input_layout="BNSD", +++++# # pre_tokens=2147483647, +++++# # next_tokens=2147483647, +++++# # inner_precise=0, +++++# # drop_mask=None, +++++# # prefix=None, +++++# # actual_seq_qlen=None, +++++# # actual_seq_kvlen=None, +++++# # sparse_mode=sparse_mode, +++++# # ) +++++# if not output_attentions: +++++# attn_weights = None +++++ +++++# return attn_output, attn_weights, past_key_value +++++ ++++ class Qwen2MoeAttention(nn.Module): ++++ """ ++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++++- and "Generating Long Sequences with Sparse Transformers". ++++- """ +++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 ++++ +++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +++++ +++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +++++ """ ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++ super().__init__() ++++ self.config = config ++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): ++++ if layer_idx is None: ++++ logger.warning_once( ++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++ "when creating this class." ++++ ) ++++ ++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): ++++ use_cache: bool = False, ++++ cache_position: Optional[mindspore.Tensor] = None, ++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++- ++++ ++++- +++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- ++++ bsz, q_len, _ = hidden_states.shape ++++ ++++ query_states = self.q_proj(hidden_states) ++++ key_states = self.k_proj(hidden_states) ++++ value_states = self.v_proj(hidden_states) ++++ ++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++- +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ ++++ kv_seq_len = key_states.shape[-2] ++++ if past_key_value is not None: ++++- if self.layer_idx is None: ++++- raise ValueError( ++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++- "with a layer index." ++++- ) ++++- if isinstance(past_key_value, StaticCache): ++++- kv_seq_len = key_states.shape[-2] ++++- else: ++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ ++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++ ++++ if past_key_value is not None: ++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++ +++++ # --- 2. 动态调度核心注意力计算 --- +++++ global Long_Prompt +++++ if Long_Prompt >= 1: +++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +++++ fa_attention_mask = None +++++ if attention_mask is not None: +++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++ fa_attention_mask = (mask_slice != 0) +++++ +++++ attn_output = mindspore.ops.flash_attention_score( +++++ query=query_states, +++++ key=key_states, +++++ value=value_states, +++++ head_num=self.num_heads, +++++ attn_mask=fa_attention_mask, +++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++ input_layout="BNSD", +++++ sparse_mode=0, +++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +++++ ) ++++ ++++- if isinstance(past_key_value, StaticCache): ++++- kv_seq_len = key_states.shape[-2] +++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ attn_output = self.o_proj(attn_output) +++++ attn_weights = None +++++ if output_attentions: +++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++++ ++++- # repeat k/v heads if n_kv_heads < n_heads ++++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++++- ++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++ else: +++++ # --- Eager Attention 路径 (用于短序列和解码) --- +++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++ +++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++ ++++- if attention_mask is not None: ++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++- attn_weights = attn_weights + causal_mask +++++ if attention_mask is not None: +++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++ attn_weights = attn_weights + causal_mask ++++ ++++- # upcast attention to fp32 ++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++- attn_output = ops.matmul(attn_weights, value_states) +++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++ attn_output = ops.matmul(attn_weights, value_states) ++++ ++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++- raise ValueError( ++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++++- f" {attn_output.shape}" ++++- ) +++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++ raise ValueError( +++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +++++ ) ++++ ++++- attn_output = ops.transpose(attn_output, 1, 2) ++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++ attn_output = ops.transpose(attn_output, 1, 2) +++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++ attn_output = self.o_proj(attn_output) ++++ ++++- attn_output = self.o_proj(attn_output) ++++- # @lwx +++++ if not output_attentions: +++++ attn_weights = None ++++ ++++- # max_seq_len = self.max_position_embeddings # 2048 ++++- ++++- # if attention_mask is not None: ++++- # # attention_mask: [B, 1, Sq, Sk] ++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++- ++++- # # pad 到 [max_seq_len, max_seq_len] ++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++- # global_attention_mask = padded_mask ++++- # else: ++++- # global_attention_mask = None ++++- ++++- ++++- # sparse_mode=3 ++++- # attn_output = mindspore.ops.flash_attention_score( ++++- # query=query_states, ++++- # key=key_states, ++++- # value=value_states, ++++- # real_shift=None, ++++- # padding_mask=None, ++++- ++++- # head_num=self.num_heads, ++++- # attn_mask=global_attention_mask, ++++- # keep_prob=1.0 - self.attention_dropout, ++++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++++- # input_layout="BNSD", ++++- # pre_tokens=2147483647, ++++- # next_tokens=2147483647, ++++- # inner_precise=0, ++++- # drop_mask=None, ++++- # prefix=None, ++++- # actual_seq_qlen=None, ++++- # actual_seq_kvlen=None, ++++- # sparse_mode=sparse_mode, ++++- # ) ++++- if not output_attentions: ++++- attn_weights = None ++++- ++++ return attn_output, attn_weights, past_key_value ++++ ++++- ++++ # class Qwen2MoeFlashAttention(nn.Module): ++++ # """ ++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { ++++ # return final_hidden_states, router_logits ++++ ++++ ++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++-# """ ++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 ++++-# """ ++++-# def __init__(self, config: Qwen2MoeConfig): ++++-# super().__init__() ++++-# self.num_experts = config.num_experts ++++-# self.top_k = config.num_experts_per_tok ++++-# self.norm_topk_prob = config.norm_topk_prob ++++- ++++-# # 门控网络 ++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++-# # 专家列表 ++++-# self.experts = nn.ModuleList( ++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++-# ) ++++-# # 共享专家 ++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-# @no_grad() ++++-# def _moe_infer_decode( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# """ ++++-# 【解码路径】针对 sequence_length=1 的极致优化。 ++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++++-# """ ++++-# batch_size, hidden_dim = hidden_states.shape ++++- ++++-# expert_outputs_list = [ ++++-# ops.cat([ ++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++-# ], dim=0) ++++-# for i in range(batch_size) ++++-# ] ++++- ++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++++-# # shape: (batch_size, top_k, hidden_dim) ++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++- ++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++- ++++-# return moe_output.squeeze(1) ++++- ++++-# @no_grad() ++++-# def _moe_infer_prefill( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# """ ++++-# 【预填充路径】针对 sequence_length > 1 的优化。 ++++-# 按专家对 Token 进行分组,并进行批处理。 ++++-# """ ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens = hidden_states.shape[0] ++++-# flat_selected_experts = selected_experts.flatten() ++++- ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++- ++++-# active_experts = ops.unique(flat_selected_experts) ++++- ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++- ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++-# selected_token_indices = token_indices[mask] ++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++- ++++-# current_states = hidden_states[selected_token_indices] ++++- ++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++- ++++-# moe_output = moe_output.index_add( ++++-# dim=0, ++++-# index=selected_token_indices, ++++-# source=expert_output.to(hidden_states.dtype) ++++-# ) ++++-# return moe_output ++++- ++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-# """ ++++-# 顶层 forward 方法,作为智能分发器。 ++++-# """ ++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- ++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-# router_logits = self.gate(hidden_states_reshaped) ++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- ++++-# if self.norm_topk_prob: ++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- ++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++-# moe_output = None ++++-# # 在推理时,根据序列长度选择最优路径 ++++-# if not self.training: ++++-# if sequence_length == 1: ++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++-# else: ++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++-# else: ++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++++-# raise NotImplementedError("Training path is not implemented.") ++++- ++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++++- ++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++++- ++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++++- ++++-# return final_hidden_states, router_logits ++++- ++++- ++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++-# """ ++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++++-# """ ++++-# def __init__(self, config: Qwen2MoeConfig): ++++-# super().__init__() ++++-# self.num_experts = config.num_experts ++++-# self.top_k = config.num_experts_per_tok ++++-# self.norm_topk_prob = config.norm_topk_prob ++++- ++++-# # 门控网络 ++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++-# # 专家列表 ++++-# self.experts = nn.ModuleList( ++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++-# ) ++++-# # 共享专家 ++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-# @no_grad() ++++-# def _moe_infer_decode( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# batch_size, _ = hidden_states.shape ++++-# expert_outputs_list = [ ++++-# ops.cat([ ++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++-# ], dim=0) ++++-# for i in range(batch_size) ++++-# ] ++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++-# return moe_output.squeeze(1) ++++- ++++-# @no_grad() ++++-# def _moe_infer_prefill( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens = hidden_states.shape[0] ++++-# flat_selected_experts = selected_experts.flatten() ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++-# active_experts = ops.unique(flat_selected_experts) ++++- ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++-# selected_token_indices = token_indices[mask] ++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++-# current_states = hidden_states[selected_token_indices] ++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++-# moe_output = moe_output.index_add( ++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++-# ) ++++-# return moe_output ++++- ++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-# """ ++++-# 顶层 forward 方法,作为智能分发器。 ++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++++-# """ ++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- ++++-# # 1. 门控计算 (通用逻辑) ++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-# router_logits = self.gate(hidden_states_reshaped) ++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- ++++-# if self.norm_topk_prob: ++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- ++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++-# # 2. 智能分发到最优 MoE 路径 ++++-# moe_output = None ++++-# if not self.training: ++++-# if sequence_length == 1: ++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++-# else: ++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++-# else: ++++-# raise NotImplementedError("Training path is not implemented.") ++++- ++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++- ++++-# # 4. 合并 MoE 输出和共享专家输出 ++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++- ++++-# # 5. 恢复原始形状并返回 ++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++- ++++-# return final_hidden_states, router_logits ++++- ++++-# prefill fastest ++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++-# """ ++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++++-# """ ++++-# def __init__(self, config: Qwen2MoeConfig): ++++-# super().__init__() ++++-# self.num_experts = config.num_experts ++++-# self.top_k = config.num_experts_per_tok ++++-# self.norm_topk_prob = config.norm_topk_prob ++++- ++++-# # 门控网络 ++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++-# # 专家列表 ++++-# self.experts = nn.ModuleList( ++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++-# ) ++++-# # 共享专家 ++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-# @no_grad() ++++-# def _moe_infer_dispatch( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# """ ++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++++-# """ ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens, _ = hidden_states.shape ++++- ++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++++-# flat_selected_experts = selected_experts.flatten() ++++-# flat_routing_weights = routing_weights.flatten() ++++- ++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++- ++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++++-# active_experts = ops.unique(flat_selected_experts) ++++- ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++- ++++-# # 找到所有分配给该专家的 token ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++- ++++-# # 使用 mask 选取对应的 token 和权重 ++++-# current_token_indices = token_indices[mask] ++++-# current_routing_weights = flat_routing_weights[mask] ++++-# current_hidden_states = hidden_states[current_token_indices] ++++- ++++-# # 对这些 token 进行批处理 ++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++- ++++-# # 使用 index_add 将结果精确地加回到对应位置 ++++-# moe_output = moe_output.index_add( ++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++++-# ) ++++-# return moe_output ++++- ++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-# """ ++++-# 顶层 forward 方法,作为智能分发器。 ++++-# """ ++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- ++++-# # 1. 门控计算 ++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-# router_logits = self.gate(hidden_states_reshaped) ++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- ++++-# if self.norm_topk_prob: ++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- ++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++- ++++-# # 2. 调用统一的 MoE 计算内核 ++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++++- ++++-# # 3. 统一处理共享专家 ++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++- ++++-# # 4. 合并输出 ++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++- ++++-# # 5. 恢复原始形状并返回 ++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++- ++++-# return final_hidden_states, router_logits ++++- ++++- ++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++-# """ ++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++-# 【最终高性能与高精度版】: ++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++++-# 3. 这样实现了速度和准确性的两全其美。 ++++-# """ ++++-# def __init__(self, config: Qwen2MoeConfig): ++++-# super().__init__() ++++-# self.num_experts = config.num_experts ++++-# self.top_k = config.num_experts_per_tok ++++-# self.norm_topk_prob = config.norm_topk_prob ++++- ++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++-# self.experts = nn.ModuleList( ++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++-# ) ++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-# @no_grad() ++++-# def _moe_infer_decode( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# """ ++++-# 【解码路径】极致优化版:bmm + 高精度累加。 ++++-# """ ++++-# original_dtype = hidden_states.dtype ++++-# batch_size, _ = hidden_states.shape ++++- ++++-# expert_outputs_list = [ ++++-# ops.cat([ ++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++-# ], dim=0) ++++-# for i in range(batch_size) ++++-# ] ++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++- ++++-# # 在 float32 下执行 bmm,得到高精度结果 ++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++- ++++-# # 将高精度结果转换回原始数据类型 ++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++++- ++++-# return moe_output ++++- ++++-# @no_grad() ++++-# def _moe_infer_prefill( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# selected_experts: mindspore.Tensor, ++++-# routing_weights: mindspore.Tensor ++++-# ) -> mindspore.Tensor: ++++-# """ ++++-# 【预填充路径】与原始实现一致,结果精确。 ++++-# """ ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens, _ = hidden_states.shape ++++-# flat_selected_experts = selected_experts.flatten() ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++-# active_experts = ops.unique(flat_selected_experts) ++++- ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++-# selected_token_indices = token_indices[mask] ++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++-# current_states = hidden_states[selected_token_indices] ++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++-# moe_output = moe_output.index_add( ++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++-# ) ++++-# return moe_output ++++- ++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++- ++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-# router_logits = self.gate(hidden_states_reshaped) ++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++- ++++-# if self.norm_topk_prob: ++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- ++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++++-# # 如果模型主体是 float16,后续再转换 ++++- ++++-# moe_output = None ++++-# if not self.training: ++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++++-# # _moe_infer_decode 内部会处理好类型转换 ++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++++-# if sequence_length == 1: ++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++-# else: ++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++-# else: ++++-# raise NotImplementedError("Training path is not implemented.") ++++- ++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++- ++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++- ++++-# return final_hidden_states, router_logits ++++- ++++- ++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++-# """ ++++-# 【融合版】一个混合专家模块,内置两种推理策略, ++++-# 由外部全局变量 `Long_Prompt` 控制: ++++- ++++-# - if Long_Prompt is True: 【精度优先模式】 ++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++++-# 适用于处理长序列,避免误差累积。 ++++- ++++-# - if Long_Prompt is False: 【速度优先模式】 ++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 ++++-# """ ++++-# def __init__(self, config: Qwen2MoeConfig): ++++-# super().__init__() ++++-# self.num_experts = config.num_experts ++++-# self.top_k = config.num_experts_per_tok ++++-# self.norm_topk_prob = config.norm_topk_prob ++++- ++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++-# self.experts = nn.ModuleList( ++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++-# ) ++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-# # --- 速度优先模式的辅助函数 --- ++++-# @no_grad() ++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++-# original_dtype = hidden_states.dtype ++++-# batch_size, _ = hidden_states.shape ++++-# expert_outputs_list = [ ++++-# ops.cat([ ++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++-# ], dim=0) ++++-# for i in range(batch_size) ++++-# ] ++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++-# weights_fp32 = routing_weights.to(mindspore.float32) ++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++-# return moe_output_fp32.squeeze(1).to(original_dtype) ++++- ++++-# @no_grad() ++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens, _ = hidden_states.shape ++++-# flat_selected_experts = selected_experts.flatten() ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++-# active_experts = ops.unique(flat_selected_experts) ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++-# selected_token_indices = token_indices[mask] ++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++-# current_states = hidden_states[selected_token_indices] ++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++++-# return moe_output ++++- ++++-# # --- 精度优先模式的辅助函数 --- ++++-# @no_grad() ++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++-# moe_output = ops.zeros_like(hidden_states) ++++-# num_tokens, _ = hidden_states.shape ++++-# flat_selected_experts = selected_experts.flatten() ++++-# flat_routing_weights = routing_weights.flatten() ++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++-# active_experts = ops.unique(flat_selected_experts) ++++-# for expert_idx_tensor in active_experts: ++++-# expert_idx = expert_idx_tensor.item() ++++-# expert_layer = self.experts[expert_idx] ++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++-# current_token_indices = token_indices[mask] ++++-# current_routing_weights = flat_routing_weights[mask] ++++-# current_hidden_states = hidden_states[current_token_indices] ++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++-# return moe_output ++++- ++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-# # 声明我们将要使用一个在模块外部定义的全局变量 ++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++++-# global Long_Prompt ++++- ++++-# # 1. 门控计算 (所有模式通用) ++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-# router_logits = self.gate(hidden_states_reshaped) ++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++-# if self.norm_topk_prob: ++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++- ++++-# moe_output = None ++++-# if not self.training: ++++-# # 根据 Long_Prompt 标志选择模式 ++++-# if Long_Prompt: ++++-# # --- 精度优先模式 --- ++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++-# else: ++++-# # --- 速度优先模式 --- ++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++-# if sequence_length == 1: ++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++-# else: ++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++-# else: ++++-# raise NotImplementedError("Training path is not implemented.") ++++- ++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++- ++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++- ++++-# return final_hidden_states, router_logits ++++- ++++ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ """ ++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++ return moe_output_fp32.squeeze(1).to(original_dtype) ++++ +++++ # @no_grad() +++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++ # num_tokens, _ = hidden_states.shape +++++ # flat_selected_experts = selected_experts.flatten() +++++ # sorted_expert_indices = flat_selected_experts.argsort() +++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++ # original_token_indices = sorted_expert_indices // self.top_k +++++ # moe_output = ops.zeros_like(hidden_states) +++++ # current_token_offset = 0 +++++ # for i in range(self.num_experts): +++++ # expert_token_count = tokens_per_expert[i] - current_token_offset +++++ # if expert_token_count == 0: +++++ # continue +++++ # end_offset = current_token_offset + expert_token_count +++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++ # expert_hidden_states = hidden_states[expert_original_token_indices] +++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++ # current_token_offset += expert_token_count +++++ # return moe_output +++++ ++++ @no_grad() ++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++- num_tokens, _ = hidden_states.shape ++++- flat_selected_experts = selected_experts.flatten() ++++- sorted_expert_indices = flat_selected_experts.argsort() ++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++- original_token_indices = sorted_expert_indices // self.top_k +++++ """ +++++ 优化版 MoE prefill (速度优先模式): +++++ - 批量张量化处理同一个 expert 的所有 token +++++ - 跳过无 token 的专家 +++++ - 保持结果完全一致 +++++ """ ++++ moe_output = ops.zeros_like(hidden_states) ++++- current_token_offset = 0 ++++- for i in range(self.num_experts): ++++- expert_token_count = tokens_per_expert[i] - current_token_offset ++++- if expert_token_count == 0: ++++- continue ++++- end_offset = current_token_offset + expert_token_count ++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++- expert_hidden_states = hidden_states[expert_original_token_indices] ++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++- current_token_offset += expert_token_count +++++ +++++ flat_selected_experts = selected_experts.flatten() +++++ flat_routing_weights = routing_weights.flatten() +++++ +++++ idxs = flat_selected_experts.argsort() +++++ sorted_expert_indices = flat_selected_experts[idxs] +++++ sorted_token_indices = idxs // self.top_k +++++ +++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +++++ +++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++++ +++++ for expert_id in active_experts.tolist(): +++++ start = int(tokens_per_expert[:expert_id].sum().item()) +++++ end = start + int(tokens_per_expert[expert_id].item()) +++++ +++++ token_idx = sorted_token_indices[start:end] +++++ expert_tokens = hidden_states[token_idx] +++++ +++++ expert_out = self.experts[expert_id](expert_tokens) +++++ +++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +++++ +++++ moe_output = mindspore.mint.scatter_add( +++++ moe_output, +++++ 0, +++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +++++ scaled_out.to(hidden_states.dtype) +++++ ) +++++ ++++ return moe_output ++++ +++++ ++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++++ @no_grad() ++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++ ++++ moe_output = None ++++- if Long_Prompt: ++++- # --- 精度优先模式 (ACCURACY MODE) --- ++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ # if Long_Prompt==0: +++++ # # --- 精度优先模式 (ACCURACY MODE) --- +++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ # else: +++++ # # --- 速度优先模式 (SPEED MODE) --- +++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++ # if sequence_length == 1: +++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ # else: +++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ +++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++ if sequence_length == 1: +++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++ else: ++++- # --- 速度优先模式 (SPEED MODE) --- ++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++- if sequence_length == 1: ++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++- else: ++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++- +++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ ++++ ++++ # 3. 共享专家计算与合并 (所有模式通用) ++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++ ++++ return final_hidden_states, router_logits ++++ +++++ ++++ class Qwen2MoeDecoderLayer(nn.Module): ++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++++ super().__init__() ++++ self.hidden_size = config.hidden_size ++++ ++++- # if Long_Prompt: ++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++- # else: +++++ # if Long_Prompt == 2: ++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++ # else: +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ ++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++ ++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++ ) ++++ ++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). ++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++ # attention_mask, +++++ # sequence_length=sequence_length, +++++ # target_length=target_length, +++++ # dtype=dtype, +++++ # min_dtype=min_dtype, +++++ # cache_position=cache_position, +++++ # batch_size=input_tensor.shape[0], +++++ # ) +++++ #@dwj +++++ causal_mask = get_cached_causal_mask_with_cache_position( ++++ attention_mask, ++++ sequence_length=sequence_length, ++++ target_length=target_length, ++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++++ """ ++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +++++ _causal_mask_cache.clear() ++++ ++++ input_ids = kwargs.get("input_ids") ++++ if input_ids is None and args: ++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ ++++ if input_ids is not None: ++++ prompt_length = input_ids.shape[1] ++++- ++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: ++++- Long_Prompt = True +++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +++++ Long_Prompt = 2 +++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +++++ Long_Prompt = 0 ++++ else: ++++- Long_Prompt = False +++++ Long_Prompt = 1 +++++ ++++ ++++ return super().generate(*args, **kwargs) ++++ ++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++ dtype = self.lm_head.weight.dtype ++++ min_dtype = float(ops.finfo(dtype).min) ++++ ++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++ # attention_mask, +++++ # sequence_length=sequence_length, +++++ # target_length=past_key_values.get_max_length(), +++++ # dtype=dtype, +++++ # min_dtype=min_dtype, +++++ # cache_position=cache_position, +++++ # batch_size=batch_size, +++++ # ) +++++ +++++ #@dwj +++++ attention_mask = get_cached_causal_mask_with_cache_position( ++++ attention_mask, ++++ sequence_length=sequence_length, ++++ target_length=past_key_values.get_max_length(), ++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++deleted file mode 100644 ++++index 6dfb5b93..00000000 ++++--- a/patches/0001-20251104commit.patch +++++++ /dev/null ++++@@ -1,1272 +0,0 @@ ++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++-From: Pinoeer-kingxi <13022943007@163.com> ++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++-Subject: [PATCH] 20251104commit ++++- ++++---- ++++- mindnlp/transformers/cache_utils.py | 28 +- ++++- .../models/deepseek/modeling_deepseek.py | 149 ++- ++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++- 3 files changed, 976 insertions(+), 87 deletions(-) ++++- ++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++-index cadd2e04..02f8d4be 100644 ++++---- a/mindnlp/transformers/cache_utils.py ++++-+++ b/mindnlp/transformers/cache_utils.py ++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++- # k_out[:, :, cache_position] = key_states ++++- # v_out[:, :, cache_position] = value_states ++++-- if ON_ORANGE_PI: ++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++-- else: ++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++-- ++++-+ # if ON_ORANGE_PI: ++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++-+ # else: ++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 ++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++-+ if cache_position.ndim > 1: ++++-+ cache_position = cache_position.flatten() ++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) ++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++-+ cache_position = cache_position.int() ++++-+ ++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++-+ k_out[:, :, cache_position] = key_states ++++-+ v_out[:, :, cache_position] = value_states ++++-+ ++++- return k_out, v_out ++++- ++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++-index c695b944..d8303e45 100644 ++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++- # Copied from transformers.models.llama.modeling_llama.rotate_half ++++- def rotate_half(x): ++++- """Rotates half the hidden dims of the input.""" ++++-- x1 = x[..., : x.shape[-1] // 2] ++++-- x2 = x[..., x.shape[-1] // 2 :] ++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++-+ # x1 = x[..., : x.shape[-1] // 2] ++++-+ # x2 = x[..., x.shape[-1] // 2 :] ++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++- return ops.cat((-x2, x1), dim=-1) ++++- ++++- ++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++- if self.training: ++++- raise NotImplementedError("Training is not supported yet.") ++++- else: ++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++-- if self.config.n_shared_experts is not None: ++++-- y = y + self.shared_experts(identity) ++++-- return y ++++-+ # @lwx ++++-+ if orig_shape[1] == 1: ++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++-+ y=y.view(*orig_shape) ++++-+ if self.config.n_shared_experts is not None: ++++-+ y = y + self.shared_experts(identity) ++++-+ return y ++++-+ else: ++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++-+ if self.config.n_shared_experts is not None: ++++-+ y = y + self.shared_experts(identity) ++++-+ return y ++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++-+ # if self.config.n_shared_experts is not None: ++++-+ # y = y + self.shared_experts(identity) ++++-+ # return y ++++-+ ++++-+ @no_grad() ++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++-+ ++++-+ expert_cache = ops.zeros_like(x) ++++-+ for i in range(self.num_experts_per_tok): ++++-+ expert_id = flat_expert_indices[i].item() ++++-+ weight = flat_expert_weights[i].item() ++++-+ expert = self.experts[expert_id] ++++-+ expert_out = expert(x) ++++-+ expert_cache += expert_out * weight ++++-+ return expert_cache ++++- ++++- @no_grad() ++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++-- # expert_cache = torch.zeros_like(x) ++++-- # idxs = flat_expert_indices.argsort() ++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++-- # token_idxs = idxs // self.num_experts_per_tok ++++-- # for i, end_idx in enumerate(tokens_per_expert): ++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++-- # if start_idx == end_idx: ++++-- # continue ++++-- # expert = self.experts[i] ++++-- # exp_token_idx = token_idxs[start_idx:end_idx] ++++-- # expert_tokens = x[exp_token_idx] ++++-- # expert_out = expert(expert_tokens) ++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++-- # return expert_cache ++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++- expert_cache = ops.zeros_like(x) ++++- idxs = flat_expert_indices.argsort() ++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++- token_idxs = idxs // self.num_experts_per_tok ++++-+ ++++- for i, end_idx in enumerate(tokens_per_expert): ++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++- if start_idx == end_idx: ++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++- expert_out = expert(expert_tokens) ++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++-+ ++++- return expert_cache ++++-+ ++++-+ # @no_grad() ++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++-+ # # expert_cache = torch.zeros_like(x) ++++-+ # # idxs = flat_expert_indices.argsort() ++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++-+ # # token_idxs = idxs // self.num_experts_per_tok ++++-+ # # for i, end_idx in enumerate(tokens_per_expert): ++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++-+ # # if start_idx == end_idx: ++++-+ # # continue ++++-+ # # expert = self.experts[i] ++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++-+ # # expert_tokens = x[exp_token_idx] ++++-+ # # expert_out = expert(expert_tokens) ++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++-+ # # return expert_cache ++++-+ # expert_cache = ops.zeros_like(x) ++++-+ # idxs = flat_expert_indices.argsort() ++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++-+ # token_idxs = idxs // self.num_experts_per_tok ++++-+ ++++-+ # for i, end_idx in enumerate(tokens_per_expert): ++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++-+ # if start_idx == end_idx: ++++-+ # continue ++++-+ # expert = self.experts[i] ++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++++-+ # expert_tokens = x[exp_token_idx] ++++-+ # expert_out = expert(expert_tokens) ++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++-+ ++++-+ # return expert_cache ++++-+ # @no_grad() ++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++-+ # expert_cache = ops.zeros_like(x) ++++-+ ++++-+ # # 排序保证顺序一致 ++++-+ # idxs = flat_expert_indices.argsort() ++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++-+ # token_idxs = idxs // self.num_experts_per_tok ++++-+ ++++-+ # # 找出有 token 的专家 ++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++-+ ++++-+ # for i in active_experts.tolist(): ++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++-+ # end_idx = tokens_per_expert[i] ++++-+ # if start_idx == end_idx: # 没有 token ++++-+ # continue ++++-+ ++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++++-+ # expert_tokens = x[exp_token_idx] ++++-+ # expert_out = self.experts[i](expert_tokens) ++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++-+ ++++-+ # expert_cache = mindspore.mint.scatter_add( ++++-+ # expert_cache, ++++-+ # 0, ++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++-+ # expert_out ++++-+ # ) ++++-+ ++++-+ # return expert_cache ++++-+ ++++-+ ++++- ++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++- # """ ++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++- ++++- # Initialize weights and apply final processing ++++- self.post_init() ++++-+ self.warm_up = False ++++-+ ++++-+ def warmup_moe_model_deep(self): ++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++-+ test_texts = [ ++++-+ "warmup short", ++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++-+ ] ++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++-+ if tokenizer is None: ++++-+ from mindnlp.transformers import AutoTokenizer ++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++-+ self._warmup_tokenizer = tokenizer ++++-+ ++++-+ for text in test_texts: ++++-+ inputs = tokenizer(text, return_tensors="ms") ++++-+ with mindspore._no_grad(): ++++-+ _ = self(**inputs, use_cache=False) ++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++- ++++- def get_input_embeddings(self): ++++- return self.model.embed_tokens ++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++- ```""" ++++-+ if not self.warm_up: ++++-+ self.warm_up = True ++++-+ self.warmup_moe_model_deep() ++++-+ ++++- output_attentions = ( ++++- output_attentions ++++- if output_attentions is not None ++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++-index 3cbf820e..d4c6b651 100644 ++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++-@@ -18,7 +18,6 @@ ++++- # See the License for the specific language governing permissions and ++++- # limitations under the License. ++++- """MindSpore Qwen2MoE model.""" ++++-- ++++- import math ++++- from typing import List, Optional, Tuple, Union ++++- ++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++- TokenClassifierOutput, ++++- ) ++++- from ...modeling_utils import PreTrainedModel ++++-+from ...generation import GenerationMixin ++++- from ....utils import logging ++++- from .configuration_qwen2_moe import Qwen2MoeConfig ++++- ++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++- self.variance_epsilon = eps ++++- ++++- def forward(self, hidden_states): ++++-+ # @dwj ++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++-+ # @lwx ++++-+ # if not self.training : ++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++- input_dtype = hidden_states.dtype ++++- hidden_states = hidden_states.to(mindspore.float32) ++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++-@@ -234,6 +239,8 @@ def rotate_half(x): ++++- """Rotates half the hidden dims of the input.""" ++++- x1 = x[..., : x.shape[-1] // 2] ++++- x2 = x[..., x.shape[-1] // 2 :] ++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++- return ops.cat((-x2, x1), dim=-1) ++++- ++++- ++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++- self.config = config ++++- self.hidden_size = config.hidden_size ++++- self.intermediate_size = intermediate_size ++++-+ ++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++- self.act_fn = ACT2FN[config.hidden_act] ++++- ++++- def forward(self, x): ++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++-- ++++- ++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++-+ # @lwx ++++-+ # gate_up_output = self.gate_up_proj(x) ++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++-+ # return self.down_proj(swiglu_output) ++++-+ ++++-+ # def forward(self, x): ++++-+ # gate_proj_out = self.gate_proj(x) ++++-+ # up_proj_out = self.up_proj(x) ++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++-+ # return self.down_proj(swiglu_out) ++++-+ ++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++- """ ++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++- use_cache: bool = False, ++++- cache_position: Optional[mindspore.Tensor] = None, ++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++-+ ++++-+ ++++-+ ++++- bsz, q_len, _ = hidden_states.shape ++++- ++++- query_states = self.q_proj(hidden_states) ++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++- "with a layer index." ++++- ) ++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++-+ if isinstance(past_key_value, StaticCache): ++++-+ kv_seq_len = key_states.shape[-2] ++++-+ else: ++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++- ++++- if past_key_value is not None: ++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++-+ ++++-+ if isinstance(past_key_value, StaticCache): ++++-+ kv_seq_len = key_states.shape[-2] ++++- ++++- # repeat k/v heads if n_kv_heads < n_heads ++++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++++-- ++++-+ ++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++- ++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++-- raise ValueError( ++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++-- f" {attn_weights.shape}" ++++-- ) ++++-- ++++-- if attention_mask is not None: # no matter the length, we just slice it ++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++-+ if attention_mask is not None: ++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++- attn_weights = attn_weights + causal_mask ++++- ++++- # upcast attention to fp32 ++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++- ++++- attn_output = self.o_proj(attn_output) ++++-- ++++-+ # @lwx ++++-+ ++++-+ # max_seq_len = self.max_position_embeddings # 2048 ++++-+ ++++-+ # if attention_mask is not None: ++++-+ # # attention_mask: [B, 1, Sq, Sk] ++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++-+ ++++-+ # # pad 到 [max_seq_len, max_seq_len] ++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++-+ # global_attention_mask = padded_mask ++++-+ # else: ++++-+ # global_attention_mask = None ++++-+ ++++-+ ++++-+ # sparse_mode=3 ++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++-+ # query=query_states, ++++-+ # key=key_states, ++++-+ # value=value_states, ++++-+ # real_shift=None, ++++-+ # padding_mask=None, ++++-+ ++++-+ # head_num=self.num_heads, ++++-+ # attn_mask=global_attention_mask, ++++-+ # keep_prob=1.0 - self.attention_dropout, ++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++-+ # input_layout="BNSD", ++++-+ # pre_tokens=2147483647, ++++-+ # next_tokens=2147483647, ++++-+ # inner_precise=0, ++++-+ # drop_mask=None, ++++-+ # prefix=None, ++++-+ # actual_seq_qlen=None, ++++-+ # actual_seq_kvlen=None, ++++-+ # sparse_mode=sparse_mode, ++++-+ # ) ++++- if not output_attentions: ++++- attn_weights = None ++++- ++++- return attn_output, attn_weights, past_key_value ++++- ++++- ++++-+class Qwen2MoeFlashAttention(nn.Module): ++++-+ """ ++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++-+ ++++-+ 关键改动: ++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++-+ 直接传入原始的 key 和 value 张量效率更高。 ++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++-+ """ ++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++-+ super().__init__() ++++-+ self.config = config ++++-+ self.layer_idx = layer_idx ++++-+ self.hidden_size = config.hidden_size ++++-+ self.num_heads = config.num_attention_heads ++++-+ self.head_dim = self.hidden_size // self.num_heads ++++-+ self.num_key_value_heads = config.num_key_value_heads ++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++-+ self.max_position_embeddings = config.max_position_embeddings ++++-+ self.rope_theta = config.rope_theta ++++-+ self.attention_dropout = config.attention_dropout ++++-+ ++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: ++++-+ raise ValueError( ++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++-+ ) ++++-+ ++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++-+ ++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++-+ self.head_dim, ++++-+ max_position_embeddings=self.max_position_embeddings, ++++-+ base=self.rope_theta, ++++-+ ) ++++-+ ++++-+ def forward( ++++-+ self, ++++-+ hidden_states: mindspore.Tensor, ++++-+ attention_mask: Optional[mindspore.Tensor] = None, ++++-+ position_ids: Optional[mindspore.Tensor] = None, ++++-+ past_key_value: Optional[Cache] = None, ++++-+ output_attentions: bool = False, ++++-+ use_cache: bool = False, ++++-+ cache_position: Optional[mindspore.Tensor] = None, ++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++-+ ++++-+ bsz, q_len, _ = hidden_states.shape ++++-+ ++++-+ # 1. 线性投射 Q, K, V ++++-+ query_states = self.q_proj(hidden_states) ++++-+ key_states = self.k_proj(hidden_states) ++++-+ value_states = self.v_proj(hidden_states) ++++-+ ++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++-+ # query: [B, S, H*D] -> [B, N1, S, D] ++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ ++++-+ # 3. RoPE 旋转位置编码 ++++-+ kv_seq_len = key_states.shape[-2] ++++-+ if past_key_value is not None: ++++-+ if self.layer_idx is None: ++++-+ raise ValueError( ++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++-+ "with a layer index." ++++-+ ) ++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++-+ if cache_position.shape[0] == 1: ++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++-+ kv_seq_len = past_seen_tokens + 1 ++++-+ else: ++++-+ # prefill 阶段:cache_position 是范围,使用其长度 ++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++-+ else: ++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++-+ ++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++-+ ++++-+ # 4. KV 缓存更新 ++++-+ if past_key_value is not None: ++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++-+ key_states, value_states = past_key_value.update( ++++-+ key_states, value_states, self.layer_idx, cache_kwargs ++++-+ ) ++++-+ ++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++-+ if cache_position.shape[0] == 1: ++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++-+ kv_seq_len = key_states.shape[-2] ++++-+ ++++-+ # 5. [重要] 准备 Attention Mask ++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++-+ fa_attention_mask = None ++++-+ if attention_mask is not None: ++++-+ # 截取与当前key长度匹配的部分 ++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++-+ fa_attention_mask = (mask_slice != 0) ++++-+ ++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++-+ input_dtype = query_states.dtype ++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++-+ query_states = query_states.to(mindspore.float16) ++++-+ key_states = key_states.to(mindspore.float16) ++++-+ value_states = value_states.to(mindspore.float16) ++++-+ ++++-+ # 6. [核心] 调用 flash_attention_score 算子 ++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++-+ attn_output = mindspore.ops.flash_attention_score( ++++-+ query=query_states, ++++-+ key=key_states, ++++-+ value=value_states, ++++-+ head_num=self.num_heads, # 传入Q的头数(N1) ++++-+ attn_mask=fa_attention_mask, ++++-+ keep_prob=1.0 - self.attention_dropout, ++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), ++++-+ input_layout="BNSD", ++++-+ sparse_mode=0 # 使用 defaultMask 模式 ++++-+ ) ++++-+ ++++-+ # 恢复原始数据类型 ++++-+ attn_output = attn_output.to(input_dtype) ++++-+ ++++-+ # 7. 调整输出形状 ++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++-+ attn_output = self.o_proj(attn_output) ++++-+ ++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 ++++-+ attn_weights = None ++++-+ if output_attentions: ++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++-+ ++++-+ return attn_output, attn_weights, past_key_value ++++-+ ++++-+ # def forward( ++++-+ # self, ++++-+ # hidden_states: mindspore.Tensor, ++++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++++-+ # position_ids: Optional[mindspore.Tensor] = None, ++++-+ # past_key_value: Optional[Cache] = None, ++++-+ # output_attentions: bool = False, ++++-+ # use_cache: bool = False, ++++-+ # cache_position: Optional[mindspore.Tensor] = None, ++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++-+ ++++-+ # bsz, q_len, _ = hidden_states.shape ++++-+ ++++-+ # # 1. 线性投射 Q, K, V ++++-+ # query_states = self.q_proj(hidden_states) ++++-+ # key_states = self.k_proj(hidden_states) ++++-+ # value_states = self.v_proj(hidden_states) ++++-+ ++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ ++++-+ # # 3. RoPE 旋转位置编码 ++++-+ # kv_seq_len = key_states.shape[-2] ++++-+ # if past_key_value is not None: ++++-+ # if self.layer_idx is None: ++++-+ # raise ValueError( ++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++-+ # "with a layer index." ++++-+ # ) ++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++-+ ++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++-+ ++++-+ # # 4. KV 缓存更新 ++++-+ # if past_key_value is not None: ++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++-+ # key_states, value_states = past_key_value.update( ++++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++++-+ # ) ++++-+ ++++-+ # # 5. 准备 Attention Mask ++++-+ # fa_attention_mask = None ++++-+ # if attention_mask is not None: ++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++-+ # fa_attention_mask = (mask_slice != 0) ++++-+ ++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++-+ # input_dtype = query_states.dtype ++++-+ ++++-+ # # 6. [核心] 调用 flash_attention_score 算子 ++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++-+ # query=query_states, ++++-+ # key=key_states, ++++-+ # value=value_states, ++++-+ # head_num=self.num_heads, ++++-+ # attn_mask=fa_attention_mask, ++++-+ # keep_prob=1.0 - self.attention_dropout, ++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++-+ # input_layout="BNSD", ++++-+ # sparse_mode=0, ++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- ++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++-+ # inner_precise=1 ++++-+ # ) ++++-+ ++++-+ # # 恢复原始数据类型 ++++-+ # attn_output = attn_output.to(input_dtype) ++++-+ ++++-+ # # 7. 调整输出形状 ++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++-+ # attn_output = self.o_proj(attn_output) ++++-+ ++++-+ # attn_weights = None ++++-+ # if output_attentions: ++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++-+ ++++-+ # return attn_output, attn_weights, past_key_value ++++-+ ++++-+ # def forward( ++++-+ # self, ++++-+ # hidden_states: mindspore.Tensor, ++++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++++-+ # position_ids: Optional[mindspore.Tensor] = None, ++++-+ # past_key_value: Optional[Cache] = None, ++++-+ # output_attentions: bool = False, ++++-+ # use_cache: bool = False, ++++-+ # cache_position: Optional[mindspore.Tensor] = None, ++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++-+ ++++-+ # bsz, q_len, _ = hidden_states.shape ++++-+ ++++-+ # query_states = self.q_proj(hidden_states) ++++-+ # key_states = self.k_proj(hidden_states) ++++-+ # value_states = self.v_proj(hidden_states) ++++-+ ++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-+ ++++-+ # kv_seq_len = key_states.shape[-2] ++++-+ # if past_key_value is not None: ++++-+ # if self.layer_idx is None: ++++-+ # raise ValueError("`layer_idx` must be specified for caching") ++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++-+ ++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++-+ ++++-+ # if past_key_value is not None: ++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++-+ # key_states, value_states = past_key_value.update( ++++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++++-+ # ) ++++-+ ++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++-+ ++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++-+ # query_states = query_states / math.sqrt(self.head_dim) ++++-+ # # <--- 修改结束 --- ++++-+ ++++-+ # fa_attention_mask = None ++++-+ # if attention_mask is not None: ++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++-+ # fa_attention_mask = (mask_slice != 0) ++++-+ ++++-+ # input_dtype = query_states.dtype ++++-+ ++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++-+ # query=query_states, # 传入已经预先缩放过的 query ++++-+ # key=key_states, ++++-+ # value=value_states, ++++-+ # head_num=self.num_heads, ++++-+ # attn_mask=fa_attention_mask, ++++-+ # keep_prob=1.0 - self.attention_dropout, ++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++-+ # input_layout="BNSD", ++++-+ # sparse_mode=0, ++++-+ # inner_precise=1 # 仍然保持内部高精度计算 ++++-+ # ) ++++-+ ++++-+ # attn_output = attn_output.to(input_dtype) ++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++-+ # attn_output = self.o_proj(attn_output) ++++-+ ++++-+ # attn_weights = None ++++-+ # if output_attentions: ++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++-+ ++++-+ # return attn_output, attn_weights, past_key_value ++++-+ ++++- QWEN2MOE_ATTENTION_CLASSES = { ++++- "eager": Qwen2MoeAttention, ++++-+ "flash-attention": Qwen2MoeFlashAttention, ++++- } ++++- ++++- ++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++- ++++-+ #@dwj ++++-+ # 只遍历激活的专家,而非全部专家 ++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++-- hidden_states = hidden_states.view(-1, hidden_dim) ++++-- # router_logits: (batch * sequence_length, n_experts) ++++-- router_logits = self.gate(hidden_states) ++++-- ++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++-- if self.norm_topk_prob: ++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++-- # we cast back to the input dtype ++++-- routing_weights = routing_weights.to(hidden_states.dtype) ++++-- ++++-- final_hidden_states = ops.zeros( ++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++-- ) ++++-- ++++-- # One hot encode the selected experts to create an expert mask ++++-- # this will be used to easily index which expert is going to be sollicitated ++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++-- ++++-- # Loop over all available experts in the model and perform the computation on each expert ++++-- for expert_idx in range(self.num_experts): ++++-- expert_layer = self.experts[expert_idx] ++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++-- ++++-- # Index the correct hidden states and compute the expert hidden state for ++++-- # the current expert. We need to make sure to multiply the output hidden ++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++-- if 0 not in idx.shape: ++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++-- ++++-- # However `index_add_` only support torch tensors for indexing so we'll use ++++-- # the `top_x` tensor here. ++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++-- ++++-- shared_expert_output = self.shared_expert(hidden_states) ++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++-- ++++-- final_hidden_states = final_hidden_states + shared_expert_output ++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++-+ num_tokens = hidden_states_reshaped.shape[0] ++++-+ ++++-+ router_logits = self.gate(hidden_states_reshaped) ++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++-+ ++++-+ if self.norm_topk_prob: ++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++-+ routing_weights = routing_weights.to(hidden_states.dtype) ++++-+ ++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++-+ flat_selected_experts = selected_experts.flatten() ++++-+ ++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++-+ token_indices = broadcasted_token_indices.flatten() ++++-+ ++++-+ active_experts = ops.unique(flat_selected_experts) ++++-+ ++++-+ for expert_idx_tensor in active_experts: ++++-+ expert_idx = expert_idx_tensor.item() ++++-+ expert_layer = self.experts[expert_idx] ++++-+ ++++-+ mask = (flat_selected_experts == expert_idx_tensor) ++++-+ selected_token_indices = token_indices[mask] ++++-+ selected_routing_weights = routing_weights.flatten()[mask] ++++-+ ++++-+ current_states = hidden_states_reshaped[selected_token_indices] ++++-+ ++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++-+ ++++-+ final_hidden_states = final_hidden_states.index_add( ++++-+ dim=0, ++++-+ index=selected_token_indices, ++++-+ source=expert_output.to(hidden_states.dtype) ++++-+ ) ++++-+ ++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++- ++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++-- return final_hidden_states, router_logits ++++-+ final_hidden_states = final_hidden_states + shared_expert_output ++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++-+ ++++-+ return final_hidden_states, router_logits ++++- ++++- ++++- class Qwen2MoeDecoderLayer(nn.Module): ++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++- ++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++- ++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++-+ ++++- if (layer_idx not in config.mlp_only_layers) and ( ++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++- ): ++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++- _skip_keys_device_placement = "past_key_values" ++++- _supports_cache_class = True ++++-+#lwx ++++-+ # _supports_static_cache = True ++++- ++++- def _init_weights(self, module): ++++- std = self.config.initializer_range ++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++- return causal_mask ++++- ++++- ++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++- _tied_weights_keys = ["lm_head.weight"] ++++- ++++- def __init__(self, config): ++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++- self.num_experts_per_tok = config.num_experts_per_tok ++++- # Initialize weights and apply final processing ++++- self.post_init() ++++-+ # @lwx ++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++-+ # self.generation_config.cache_implementation = "static" ++++-+ self._warmed_up = False ++++-+ ++++-+ def warmup_moe_model(self): ++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++-+ test_texts = [ ++++-+ "warmup short", ++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++-+ ] ++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++-+ if tokenizer is None: ++++-+ from mindnlp.transformers import AutoTokenizer ++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++-+ self._warmup_tokenizer = tokenizer ++++-+ ++++-+ for text in test_texts: ++++-+ inputs = tokenizer(text, return_tensors="ms") ++++-+ with mindspore._no_grad(): ++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++- ++++- def get_input_embeddings(self): ++++- return self.model.embed_tokens ++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++- ```""" ++++-+ if not self._warmed_up: ++++-+ self._warmed_up = True ++++-+ self.warmup_moe_model() ++++- ++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++- output_router_logits = ( ++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++- } ++++- ) ++++- return model_inputs ++++-+# @lwx ++++-+ # def _decode_one_tokens_logits( ++++-+ # self, ++++-+ # cur_token: mindspore.Tensor, ++++-+ # input_pos: Optional[mindspore.Tensor], ++++-+ # cache_position: mindspore.Tensor, ++++-+ # past_key_values: StaticCache, ++++-+ # ) -> mindspore.Tensor: ++++-+ # """ ++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++-+ ++++-+ # Args: ++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++-+ # input_pos: 输入位置信息,可选 ++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) ++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++-+ ++++-+ # Returns: ++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++-+ # """ ++++-+ # # 调用JIT编译的版本 ++++-+ # return self.get_decode_one_tokens_logits( ++++-+ # cur_token=cur_token, ++++-+ # input_pos=input_pos, ++++-+ # cache_position=cache_position, ++++-+ # past_key_values=past_key_values, ++++-+ # ) ++++-+ ++++-+ # @mindspore.jit(jit_level='O1') ++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++-+ # """ ++++-+ # JIT编译的函数,用于高效的单token解码 ++++-+ # 使用JIT编译优化以支持静态shape和高效执行 ++++-+ ++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++-+ # """ ++++-+ # outputs = self.model.forward( ++++-+ # input_ids=cur_token, ++++-+ # position_ids=input_pos, ++++-+ # cache_position=cache_position, ++++-+ # past_key_values=past_key_values, ++++-+ # use_cache=True, ++++-+ # return_dict=False, ++++-+ # ) ++++-+ ++++-+ # hidden_states = outputs[0] ++++-+ # logits = self.lm_head.forward(hidden_states) ++++-+ # logits = logits.float() ++++-+ ++++-+ # return logits[:, -1, :] ++++-+ ++++-+ # def _sample( ++++-+ # self, ++++-+ # input_ids: mindspore.Tensor, ++++-+ # logits_processor, ++++-+ # stopping_criteria, ++++-+ # generation_config, ++++-+ # synced_devices: bool, ++++-+ # streamer=None, ++++-+ # logits_warper=None, ++++-+ # **model_kwargs, ++++-+ # ): ++++-+ # """ ++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++-+ # """ ++++-+ # from ...generation.logits_process import LogitsProcessorList ++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList ++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++-+ # from mindnlp.core import nn, ops, no_grad ++++-+ # import numpy as np ++++-+ ++++-+ # # 检查是否使用 StaticCache ++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++-+ # # 否则,直接调用父类方法 ++++-+ # past_key_values = model_kwargs.get("past_key_values") ++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++-+ ++++-+ # if not isinstance(past_key_values, StaticCache): ++++-+ # # 不使用 StaticCache,直接调用父类方法 ++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++-+ # return super()._sample( ++++-+ # input_ids=input_ids, ++++-+ # logits_processor=logits_processor, ++++-+ # stopping_criteria=stopping_criteria, ++++-+ # generation_config=generation_config, ++++-+ # synced_devices=synced_devices, ++++-+ # streamer=streamer, ++++-+ # logits_warper=logits_warper, ++++-+ # **model_kwargs, ++++-+ # ) ++++-+ ++++-+ # # 使用 StaticCache,进入自定义循环 ++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++-+ # pad_token_id = generation_config._pad_token_tensor ++++-+ # output_attentions = generation_config.output_attentions ++++-+ # output_hidden_states = generation_config.output_hidden_states ++++-+ # output_scores = generation_config.output_scores ++++-+ # output_logits = generation_config.output_logits ++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate ++++-+ # max_length = generation_config.max_length ++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++-+ # do_sample = generation_config.do_sample ++++-+ ++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++-+ # raise ValueError( ++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++-+ # f"{logits_warper})." ++++-+ # ) ++++-+ ++++-+ # # init attention / hidden states / scores tuples ++++-+ # scores = () if (return_dict_in_generate and output_scores) else None ++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++-+ ++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++-+ # encoder_hidden_states = ( ++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++-+ # ) ++++-+ ++++-+ # # keep track of which sequences are already finished ++++-+ # batch_size, cur_len = input_ids.shape ++++-+ # this_peer_finished = False ++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++-+ ++++-+ # time_record = [] ++++-+ # from ....utils.testing_utils import parse_flag_from_env ++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++-+ ++++-+ # while self._has_unfinished_sequences( ++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++-+ # ): ++++-+ # if _record_time: ++++-+ # import time as time_module ++++-+ # infer_start = time_module.time() ++++-+ ++++-+ # # prepare model inputs ++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++-+ ++++-+ # # prepare variable output controls ++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++-+ ++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++-+ # cur_cache_position = model_inputs.get("cache_position") ++++-+ # cur_past_key_values = model_inputs.get("past_key_values") ++++-+ # cur_input_ids = model_inputs.get("input_ids") ++++-+ ++++-+ # if (isinstance(cur_past_key_values, StaticCache) and ++++-+ # cur_cache_position is not None and ++++-+ # len(cur_cache_position.shape) > 0 and ++++-+ # cur_cache_position.shape[0] == 1 and ++++-+ # cur_input_ids is not None and ++++-+ # cur_input_ids.shape[1] == 1): ++++-+ # # 使用 JIT 优化的单 token 解码 ++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++-+ # if not hasattr(self, '_jit_used'): ++++-+ # self._jit_used = False ++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++-+ ++++-+ # next_token_logits = self.get_decode_one_tokens_logits( ++++-+ # cur_token=cur_input_ids, ++++-+ # input_pos=model_inputs.get("position_ids"), ++++-+ # cache_position=cur_cache_position, ++++-+ # past_key_values=cur_past_key_values, ++++-+ # ) ++++-+ ++++-+ # # 标记已使用JIT(用于后续判断) ++++-+ # if not self._jit_used: ++++-+ # self._jit_used = True ++++-+ ++++-+ # # 构造兼容的输出对象 ++++-+ # class JitOptimizedOutput: ++++-+ # def __init__(self, logits, config): ++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++-+ # self.config = config ++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 ++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++-+ # self.attentions = None if not config.is_encoder_decoder else None ++++-+ # self.cross_attentions = None ++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None ++++-+ ++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++-+ # else: ++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++-+ # outputs = self(**model_inputs, return_dict=True) ++++-+ ++++-+ # if synced_devices and this_peer_finished: ++++-+ # continue ++++-+ ++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++-+ # next_token_logits = outputs.logits[:, -1, :] ++++-+ ++++-+ # # pre-process distribution ++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++-+ # if do_sample: ++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++-+ ++++-+ # # Store scores, attentions and hidden_states when required ++++-+ # if return_dict_in_generate: ++++-+ # if output_scores: ++++-+ # scores += (next_token_scores,) ++++-+ # if output_logits: ++++-+ # raw_logits += (next_token_logits,) ++++-+ # if output_attentions: ++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) ++++-+ # if self.config.is_encoder_decoder: ++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++-+ ++++-+ # if output_hidden_states: ++++-+ # hidden = ( ++++-+ # outputs.decoder_hidden_states ++++-+ # if self.config.is_encoder_decoder ++++-+ # else outputs.hidden_states ++++-+ # ) ++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++-+ ++++-+ # # token selection ++++-+ # if do_sample: ++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++-+ # else: ++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++-+ ++++-+ # # finished sentences should have their next token be a padding token ++++-+ # if has_eos_stopping_criteria: ++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++-+ ++++-+ # # update generated ids, model inputs, and length for next step ++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++-+ # if streamer is not None: ++++-+ # streamer.put(next_tokens) ++++-+ ++++-+ # model_kwargs = self._update_model_kwargs_for_generation( ++++-+ # outputs, ++++-+ # model_kwargs, ++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, ++++-+ # ) ++++-+ ++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++-+ # cur_len += 1 ++++-+ ++++-+ # if _record_time: ++++-+ # import time as time_module ++++-+ # infer_stop = time_module.time() ++++-+ # time_record.append(infer_stop - infer_start) ++++-+ ++++-+ # del outputs ++++-+ ++++-+ # average_infer_time = None ++++-+ # if time_record: ++++-+ # if len(time_record) > 1: ++++-+ # time_record.pop(0) ++++-+ # average_infer_time = sum(time_record) / len(time_record) ++++-+ # print(f'average inference time is: {average_infer_time}') ++++-+ # print(f'inference time record: {time_record}') ++++-+ ++++-+ # if streamer is not None: ++++-+ # streamer.end() ++++-+ ++++-+ # # 简单判断:打印是否使用了JIT路径 ++++-+ # if hasattr(self, '_jit_used') and self._jit_used: ++++-+ # print("[JIT] ✓ JIT optimization was used during generation") ++++-+ # else: ++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++-+ ++++-+ # if return_dict_in_generate: ++++-+ # if self.config.is_encoder_decoder: ++++-+ # return GenerateEncoderDecoderOutput( ++++-+ # sequences=input_ids, ++++-+ # scores=scores, ++++-+ # logits=raw_logits, ++++-+ # encoder_attentions=encoder_attentions, ++++-+ # encoder_hidden_states=encoder_hidden_states, ++++-+ # decoder_attentions=decoder_attentions, ++++-+ # cross_attentions=cross_attentions, ++++-+ # decoder_hidden_states=decoder_hidden_states, ++++-+ # past_key_values=model_kwargs.get("past_key_values"), ++++-+ # average_infer_time=average_infer_time ++++-+ # ) ++++-+ # else: ++++-+ # return GenerateDecoderOnlyOutput( ++++-+ # sequences=input_ids, ++++-+ # scores=scores, ++++-+ # logits=raw_logits, ++++-+ # attentions=decoder_attentions, ++++-+ # hidden_states=decoder_hidden_states, ++++-+ # past_key_values=model_kwargs.get("past_key_values"), ++++-+ # average_infer_time=average_infer_time ++++-+ # ) ++++-+ # else: ++++-+ # return input_ids ++++-+ ++++-+ # def _prepare_cache_for_generation( ++++-+ # self, ++++-+ # generation_config, ++++-+ # model_kwargs, ++++-+ # assistant_model, ++++-+ # batch_size, ++++-+ # max_cache_length, ++++-+ # ): ++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++-+ # generation_config.cache_implementation = "static" ++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++-+ ++++-+ # if generation_config.cache_implementation == "static": ++++-+ # base_required_from_max_length = generation_config.max_length + 1 ++++-+ # base_required = max(max_cache_length, base_required_from_max_length) ++++-+ # min_cache_size = 50 ++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++-+ # else: ++++-+ # max_cache_length = max(base_required, min_cache_size) ++++-+ ++++-+ # original_max_cache_length = max_cache_length ++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") ++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++-+ # print(f" - final max_cache_length: {max_cache_length}") ++++-+ ++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++-+ # if max_cache_length > self.config.max_position_embeddings: ++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++-+ ++++-+ # result = super()._prepare_cache_for_generation( ++++-+ # generation_config=generation_config, ++++-+ # model_kwargs=model_kwargs, ++++-+ # assistant_model=assistant_model, ++++-+ # batch_size=batch_size, ++++-+ # max_cache_length=max_cache_length, ++++-+ # ) ++++-+ ++++-+ # if generation_config.cache_implementation == "static": ++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++-+ # created_cache = model_kwargs.get(cache_name) ++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++-+ # if created_cache.max_cache_len < generation_config.max_length: ++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++-+ ++++-+ # return result ++++-+ ++++-+ ++++-+ ++++- ++++- ++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++--- ++++-2.27.0 ++++- ++++-- ++++2.27.0 ++++ +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" new file mode 100644 index 00000000..695e3df9 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" @@ -0,0 +1,8034 @@ +From 2831c3ffbda41719e00e1cd83c3840bcb9dd79db Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Fri, 7 Nov 2025 12:12:51 +0800 +Subject: [PATCH 07/10] 20251107003commit + +--- + .../models/deepseek/modeling_deepseek.py | 2 +- + patches/0001-20251104commit.patch | 2 +- + patches/0002-20251106commit.patch | 2 +- + patches/0003-20261106secondcommit.patch | 2 +- + patches/0004-20251106change.patch | 2 +- + patches/0005-20251107001commit.patch | 2 +- + patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ + 7 files changed, 7937 insertions(+), 6 deletions(-) + create mode 100644 patches/0006-20251107002commit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index e7e1c053..ff631974 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): + # return expert_cache + + @no_grad() +- dwj ++ # dwj + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): + # x 的 shape: (1, hidden_size) + # flat_expert_indices 的 shape: (num_experts_per_tok,) +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +index 2842180e..c9c8c5ee 100644 +--- a/patches/0001-20251104commit.patch ++++ b/patches/0001-20251104commit.patch +@@ -1,7 +1,7 @@ + From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/5] 20251104commit ++Subject: [PATCH 1/6] 20251104commit + + --- + mindnlp/transformers/cache_utils.py | 28 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +index c6cd8757..625656eb 100644 +--- a/patches/0002-20251106commit.patch ++++ b/patches/0002-20251106commit.patch +@@ -1,7 +1,7 @@ + From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/5] 20251106commit ++Subject: [PATCH 2/6] 20251106commit + + --- + .../models/deepseek/modeling_deepseek.py | 379 ++++- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +index 601960c9..dcb85080 100644 +--- a/patches/0003-20261106secondcommit.patch ++++ b/patches/0003-20261106secondcommit.patch +@@ -1,7 +1,7 @@ + From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/5] 20261106secondcommit ++Subject: [PATCH 3/6] 20261106secondcommit + + --- + .../models/deepseek/modeling_deepseek.py | 217 ++- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +index 8976f10b..bbed13cc 100644 +--- a/patches/0004-20251106change.patch ++++ b/patches/0004-20251106change.patch +@@ -1,7 +1,7 @@ + From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 15:48:09 +0800 +-Subject: [PATCH 4/5] 20251106change ++Subject: [PATCH 4/6] 20251106change + + --- + .../models/deepseek/modeling_deepseek.py | 189 +- +diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +index 8d9032be..b2d1035c 100644 +--- a/patches/0005-20251107001commit.patch ++++ b/patches/0005-20251107001commit.patch +@@ -1,7 +1,7 @@ + From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 11:48:18 +0800 +-Subject: [PATCH 5/5] 20251107001commit ++Subject: [PATCH 5/6] 20251107001commit + + --- + .../models/deepseek/modeling_deepseek.py | 91 +- +diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +new file mode 100644 +index 00000000..bffa134e +--- /dev/null ++++ b/patches/0006-20251107002commit.patch +@@ -0,0 +1,7931 @@ ++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Fri, 7 Nov 2025 12:06:32 +0800 ++Subject: [PATCH 6/6] 20251107002commit ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 122 +- ++ patches/0001-20251104commit.patch | 2 +- ++ patches/0002-20251106commit.patch | 2 +- ++ patches/0003-20261106secondcommit.patch | 2 +- ++ patches/0004-20251106change.patch | 2 +- ++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ ++ 6 files changed, 7773 insertions(+), 64 deletions(-) ++ create mode 100644 patches/0005-20251107001commit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index 8831e4b7..e7e1c053 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): ++ # expert_out = expert(x) ++ # expert_cache += expert_out * weight ++ # return expert_cache ++- ++- # @no_grad() ++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++- # # x 的 shape: (1, hidden_size) ++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) ++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++- ++- # # 1. 收集所有需要的专家层 ++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++- # selected_experts = [self.experts[i] for i in flat_expert_indices] ++- ++- # # 2. 并行计算所有专家的输出 ++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++- # # ops.cat 会将它们堆叠成一个新的 Tensor ++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++- ++- # # 3. 使用矩阵乘法进行加权求和 ++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++- # # 最终结果 final_output 的 shape: (1, hidden_size) ++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++ +++ @no_grad() +++ dwj +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ # x 的 shape: (1, hidden_size) +++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++ +++ # 1. 收集所有需要的专家层 +++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++ selected_experts = [self.experts[i] for i in flat_expert_indices] +++ +++ # 2. 并行计算所有专家的输出 +++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++ # ops.cat 会将它们堆叠成一个新的 Tensor +++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++ +++ # 3. 使用矩阵乘法进行加权求和 +++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++ # 最终结果 final_output 的 shape: (1, hidden_size) +++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++ ++- # return final_output +++ return final_output ++ ++ ++ # @no_grad() ++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): ++ ++ return expert_cache ++ # 放置在 DeepseekMoE 类中 ++- @no_grad() ++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++- """ ++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++- ++- Args: ++- x (Tensor): 输入张量, shape: (1, hidden_size) ++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++- """ ++- top_k, _ = flat_expert_weights.shape ++- hidden_size = x.shape[-1] ++- ++- # 1. 将所有专家的权重堆叠起来 ++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +++ # @no_grad() +++ # #lwx 20251107 +++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ # """ +++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +++ +++ # Args: +++ # x (Tensor): 输入张量, shape: (1, hidden_size) +++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +++ # """ +++ # top_k, _ = flat_expert_weights.shape +++ # hidden_size = x.shape[-1] +++ +++ # # 1. 将所有专家的权重堆叠起来 +++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++ ++- # 2. "收集" 所需的专家权重 ++- selected_gate_w = stacked_gate_w[flat_expert_indices] ++- selected_up_w = stacked_up_w[flat_expert_indices] ++- selected_down_w = stacked_down_w[flat_expert_indices] +++ # # 2. "收集" 所需的专家权重 +++ # selected_gate_w = stacked_gate_w[flat_expert_indices] +++ # selected_up_w = stacked_up_w[flat_expert_indices] +++ # selected_down_w = stacked_down_w[flat_expert_indices] ++ ++- # 3. 准备输入 ++- x_expanded = x.expand((top_k, 1, hidden_size)) +++ # # 3. 准备输入 +++ # x_expanded = x.expand((top_k, 1, hidden_size)) ++ ++- # 4. 并行计算 gate_proj 和 up_proj ++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +++ # # 4. 并行计算 gate_proj 和 up_proj +++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++ ++- # 5. 计算中间状态 ++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out +++ # # 5. 计算中间状态 +++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++ ++- # 6. 并行计算 down_proj ++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++- # --- [FIX] --- ++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++- # --- [FIX END] --- +++ # # 6. 并行计算 down_proj +++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +++ # # --- [FIX] --- +++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +++ # # --- [FIX END] --- ++ ++- # 7. 根据路由权重进行加权求和 ++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +++ # # 7. 根据路由权重进行加权求和 +++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++ ++- return weighted_sum +++ # return weighted_sum ++ ++ ++ ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++index 0a0ef2d7..2842180e 100644 ++--- a/patches/0001-20251104commit.patch +++++ b/patches/0001-20251104commit.patch ++@@ -1,7 +1,7 @@ ++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++-Subject: [PATCH 1/4] 20251104commit +++Subject: [PATCH 1/5] 20251104commit ++ ++ --- ++ mindnlp/transformers/cache_utils.py | 28 +- ++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++index 5185270c..c6cd8757 100644 ++--- a/patches/0002-20251106commit.patch +++++ b/patches/0002-20251106commit.patch ++@@ -1,7 +1,7 @@ ++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++-Subject: [PATCH 2/4] 20251106commit +++Subject: [PATCH 2/5] 20251106commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++index 3e05f821..601960c9 100644 ++--- a/patches/0003-20261106secondcommit.patch +++++ b/patches/0003-20261106secondcommit.patch ++@@ -1,7 +1,7 @@ ++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++-Subject: [PATCH 3/4] 20261106secondcommit +++Subject: [PATCH 3/5] 20261106secondcommit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++index 88a1aef4..8976f10b 100644 ++--- a/patches/0004-20251106change.patch +++++ b/patches/0004-20251106change.patch ++@@ -1,7 +1,7 @@ ++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 15:48:09 +0800 ++-Subject: [PATCH 4/4] 20251106change +++Subject: [PATCH 4/5] 20251106change ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 189 +- ++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch ++new file mode 100644 ++index 00000000..8d9032be ++--- /dev/null +++++ b/patches/0005-20251107001commit.patch ++@@ -0,0 +1,7707 @@ +++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Fri, 7 Nov 2025 11:48:18 +0800 +++Subject: [PATCH 5/5] 20251107001commit +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 91 +- +++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +++ patches/0001-20251104commit.patch | 2 +- +++ patches/0002-20251106commit.patch | 2 +- +++ patches/0003-20261106secondcommit.patch | 2 +- +++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ +++ 7 files changed, 7577 insertions(+), 30 deletions(-) +++ create mode 100644 patches/0004-20251106change.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index 0546f318..8831e4b7 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +++ # expert_cache += expert_out * weight +++ # return expert_cache +++ +++- @no_grad() +++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++- # x 的 shape: (1, hidden_size) +++- # flat_expert_indices 的 shape: (num_experts_per_tok,) +++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++- +++- # 1. 收集所有需要的专家层 +++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++- selected_experts = [self.experts[i] for i in flat_expert_indices] +++- +++- # 2. 并行计算所有专家的输出 +++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++- # ops.cat 会将它们堆叠成一个新的 Tensor +++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++- +++- # 3. 使用矩阵乘法进行加权求和 +++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++- # 最终结果 final_output 的 shape: (1, hidden_size) +++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++ # @no_grad() ++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ # # x 的 shape: (1, hidden_size) ++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++ ++++ # # 1. 收集所有需要的专家层 ++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] ++++ ++++ # # 2. 并行计算所有专家的输出 ++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++ # # ops.cat 会将它们堆叠成一个新的 Tensor ++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++ ++++ # # 3. 使用矩阵乘法进行加权求和 ++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ # # 最终结果 final_output 的 shape: (1, hidden_size) ++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++ +++- return final_output ++++ # return final_output +++ +++ +++ # @no_grad() +++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +++ ) +++ +++ return expert_cache ++++# 放置在 DeepseekMoE 类中 ++++ @no_grad() ++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ """ ++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++++ ++++ Args: ++++ x (Tensor): 输入张量, shape: (1, hidden_size) ++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++++ """ ++++ top_k, _ = flat_expert_weights.shape ++++ hidden_size = x.shape[-1] ++++ ++++ # 1. 将所有专家的权重堆叠起来 ++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++++ ++++ # 2. "收集" 所需的专家权重 ++++ selected_gate_w = stacked_gate_w[flat_expert_indices] ++++ selected_up_w = stacked_up_w[flat_expert_indices] ++++ selected_down_w = stacked_down_w[flat_expert_indices] ++++ ++++ # 3. 准备输入 ++++ x_expanded = x.expand((top_k, 1, hidden_size)) ++++ ++++ # 4. 并行计算 gate_proj 和 up_proj ++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++++ ++++ # 5. 计算中间状态 ++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++++ ++++ # 6. 并行计算 down_proj ++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++++ # --- [FIX] --- ++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++++ # --- [FIX END] --- ++++ ++++ # 7. 根据路由权重进行加权求和 ++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++++ ++++ return weighted_sum ++++ ++++ +++ +++ # @no_grad() +++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++index ebd7782e..913a7609 100644 +++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++- x1 = x[..., : x.shape[-1] // 2] +++- x2 = x[..., x.shape[-1] // 2 :] ++++ # x1 = x[..., : x.shape[-1] // 2] ++++ # x2 = x[..., x.shape[-1] // 2 :] +++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++index d059dcbe..2b217b64 100644 +++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++ def rotate_half(x): +++ """Rotates half the hidden dims of the input.""" +++- x1 = x[..., : x.shape[-1] // 2] +++- x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++ # x1 = x[..., : x.shape[-1] // 2] ++++ # x2 = x[..., x.shape[-1] // 2 :] ++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++ return ops.cat((-x2, x1), dim=-1) +++ +++ +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++index 78f22642..0a0ef2d7 100644 +++--- a/patches/0001-20251104commit.patch ++++++ b/patches/0001-20251104commit.patch +++@@ -1,7 +1,7 @@ +++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +++-Subject: [PATCH 1/3] 20251104commit ++++Subject: [PATCH 1/4] 20251104commit +++ +++ --- +++ mindnlp/transformers/cache_utils.py | 28 +- +++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++index 22b65dd5..5185270c 100644 +++--- a/patches/0002-20251106commit.patch ++++++ b/patches/0002-20251106commit.patch +++@@ -1,7 +1,7 @@ +++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +++-Subject: [PATCH 2/3] 20251106commit ++++Subject: [PATCH 2/4] 20251106commit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++index 966529e4..3e05f821 100644 +++--- a/patches/0003-20261106secondcommit.patch ++++++ b/patches/0003-20261106secondcommit.patch +++@@ -1,7 +1,7 @@ +++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +++-Subject: [PATCH 3/3] 20261106secondcommit ++++Subject: [PATCH 3/4] 20261106secondcommit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +++new file mode 100644 +++index 00000000..88a1aef4 +++--- /dev/null ++++++ b/patches/0004-20251106change.patch +++@@ -0,0 +1,7498 @@ ++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Thu, 6 Nov 2025 15:48:09 +0800 ++++Subject: [PATCH 4/4] 20251106change ++++ ++++--- ++++ .../models/deepseek/modeling_deepseek.py | 189 +- ++++ patches/0001-20251104commit.patch | 1272 +++++++ ++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ ++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ ++++ 4 files changed, 7244 insertions(+), 186 deletions(-) ++++ create mode 100644 patches/0001-20251104commit.patch ++++ create mode 100644 patches/0002-20251106commit.patch ++++ create mode 100644 patches/0003-20261106secondcommit.patch ++++ ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index 2f9192bf..0546f318 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): ++++ ++++ return attn_output, attn_weights, past_key_value ++++ ++++-# class DeepseekFlashAttention(nn.Module): ++++-# """ ++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++++- ++++-# This class is designed as a drop-in replacement for DeepseekAttention. ++++-# """ ++++- ++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++-# super().__init__() ++++-# self.config = config ++++-# self.layer_idx = layer_idx ++++-# if layer_idx is None: ++++-# logger.warning( ++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++-# "when creating this class." ++++-# ) ++++- ++++-# self.attention_dropout = config.attention_dropout ++++-# self.hidden_size = config.hidden_size ++++-# self.num_heads = config.num_attention_heads ++++-# self.head_dim = self.hidden_size // self.num_heads ++++-# self.num_key_value_heads = config.num_key_value_heads ++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++-# self.max_position_embeddings = config.max_position_embeddings ++++-# self.rope_theta = config.rope_theta ++++-# self.is_causal = True ++++- ++++-# if (self.head_dim * self.num_heads) != self.hidden_size: ++++-# raise ValueError( ++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++-# f" and `num_heads`: {self.num_heads})." ++++-# ) ++++- ++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++-# self._init_rope() ++++- ++++-# def _init_rope(self): ++++-# if self.config.rope_scaling is None: ++++-# self.rotary_emb = DeepseekRotaryEmbedding( ++++-# self.head_dim, ++++-# max_position_embeddings=self.max_position_embeddings, ++++-# base=self.rope_theta, ++++-# ) ++++-# else: ++++-# scaling_type = self.config.rope_scaling["type"] ++++-# scaling_factor = self.config.rope_scaling["factor"] ++++-# if scaling_type == "linear": ++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++-# self.head_dim, ++++-# max_position_embeddings=self.max_position_embeddings, ++++-# scaling_factor=scaling_factor, ++++-# base=self.rope_theta, ++++-# ) ++++-# elif scaling_type == "dynamic": ++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++-# self.head_dim, ++++-# max_position_embeddings=self.max_position_embeddings, ++++-# scaling_factor=scaling_factor, ++++-# base=self.rope_theta, ++++-# ) ++++-# else: ++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++- ++++-# def forward( ++++-# self, ++++-# hidden_states: mindspore.Tensor, ++++-# attention_mask: Optional[mindspore.Tensor] = None, ++++-# position_ids: Optional[mindspore.Tensor] = None, ++++-# past_key_value: Optional[Cache] = None, ++++-# output_attentions: bool = False, ++++-# use_cache: bool = False, ++++-# **kwargs, ++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++-# if "padding_mask" in kwargs: ++++-# warnings.warn( ++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++-# ) ++++- ++++-# if output_attentions: ++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++++- ++++-# bsz, q_len, _ = hidden_states.shape ++++- ++++-# if self.config.pretraining_tp > 1: ++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++- ++++-# query_states = self.q_proj(hidden_states) ++++-# key_states = self.k_proj(hidden_states) ++++-# value_states = self.v_proj(hidden_states) ++++- ++++-# # Reshape for multi-head attention ++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++- ++++-# kv_seq_len = key_states.shape[-2] ++++-# if past_key_value is not None: ++++-# if self.layer_idx is None: ++++-# raise ValueError( ++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++-# "with a layer index." ++++-# ) ++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++- ++++-# # Apply Rotary Positional Embedding ++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++- ++++-# if past_key_value is not None: ++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++- ++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++- ++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++- ++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++- ++++-# # Convert attention_mask for flash_attention_score ++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++++-# if attention_mask is not None: ++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++-# raise ValueError( ++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++-# ) ++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++++-# else: ++++-# attn_mask_for_fa = None ++++- ++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++- ++++-# # Call the fused flash_attention_score operator ++++-# attn_output = mindspore.ops.flash_attention_score( ++++-# query=query_states_for_fa, ++++-# key=key_states_for_fa, ++++-# value=value_states_for_fa, ++++-# head_num=self.num_heads, # This is N1, the number of query heads ++++-# input_layout='BSH', ++++-# attn_mask=attn_mask_for_fa, ++++-# keep_prob=keep_prob, ++++-# scalar_value=1.0 / math.sqrt(self.head_dim), ++++-# sparse_mode=0 # Default mask mode ++++-# ) ++++- ++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++++-# attn_output = self.o_proj(attn_output) ++++- ++++-# # Flash Attention does not return attention weights ++++-# attn_weights = None ++++- ++++-# return attn_output, attn_weights, past_key_value ++++ ++++ class DeepseekFlashAttention(nn.Module): ++++ """ ++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): ++++ super().__init__() ++++ self.hidden_size = config.hidden_size ++++ ++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++++- config=config, layer_idx=layer_idx ++++- ) +++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +++++ # config=config, layer_idx=layer_idx +++++ # ) ++++ ++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++++ config=config, layer_idx=layer_idx ++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): ++++ return outputs ++++ ++++ ++++- ++++ class DeepseekPreTrainedModel(PreTrainedModel): ++++ config_class = DeepseekConfig ++++ base_model_prefix = "model" ++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++ # Initialize weights and apply final processing ++++ self.post_init() ++++ self.warm_up = False ++++- #@dwj ++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++++- self.num_layers, ++++- self.num_attention_heads, ++++- self.head_dim, ++++- batch_size=1, ++++- max_length=self.max_length, ++++- dtype=mindspore.float16 ++++- ) ++++- ++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++++- key_cache = [] ++++- value_cache = [] ++++- for _ in range(num_layers): ++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++- key_cache.append(k) ++++- value_cache.append(v) ++++- return key_cache, value_cache ++++- ++++ ++++ def warmup_moe_model_deep(self): ++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++new file mode 100644 ++++index 00000000..78f22642 ++++--- /dev/null +++++++ b/patches/0001-20251104commit.patch ++++@@ -0,0 +1,1272 @@ +++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++Subject: [PATCH 1/3] 20251104commit +++++ +++++--- +++++ mindnlp/transformers/cache_utils.py | 28 +- +++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++ 3 files changed, 976 insertions(+), 87 deletions(-) +++++ +++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++index cadd2e04..02f8d4be 100644 +++++--- a/mindnlp/transformers/cache_utils.py ++++++++ b/mindnlp/transformers/cache_utils.py +++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++ # k_out[:, :, cache_position] = key_states +++++ # v_out[:, :, cache_position] = value_states +++++- if ON_ORANGE_PI: +++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++- else: +++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++- ++++++ # if ON_ORANGE_PI: ++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++ # else: ++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++++ if cache_position.ndim > 1: ++++++ cache_position = cache_position.flatten() ++++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++++ cache_position = cache_position.int() ++++++ ++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++++ k_out[:, :, cache_position] = key_states ++++++ v_out[:, :, cache_position] = value_states ++++++ +++++ return k_out, v_out +++++ +++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index c695b944..d8303e45 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++- x1 = x[..., : x.shape[-1] // 2] +++++- x2 = x[..., x.shape[-1] // 2 :] ++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++ # x2 = x[..., x.shape[-1] // 2 :] ++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++ if self.training: +++++ raise NotImplementedError("Training is not supported yet.") +++++ else: +++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++- if self.config.n_shared_experts is not None: +++++- y = y + self.shared_experts(identity) +++++- return y ++++++ # @lwx ++++++ if orig_shape[1] == 1: ++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++++ y=y.view(*orig_shape) ++++++ if self.config.n_shared_experts is not None: ++++++ y = y + self.shared_experts(identity) ++++++ return y ++++++ else: ++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++++ if self.config.n_shared_experts is not None: ++++++ y = y + self.shared_experts(identity) ++++++ return y ++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++ # if self.config.n_shared_experts is not None: ++++++ # y = y + self.shared_experts(identity) ++++++ # return y ++++++ ++++++ @no_grad() ++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ ++++++ expert_cache = ops.zeros_like(x) ++++++ for i in range(self.num_experts_per_tok): ++++++ expert_id = flat_expert_indices[i].item() ++++++ weight = flat_expert_weights[i].item() ++++++ expert = self.experts[expert_id] ++++++ expert_out = expert(x) ++++++ expert_cache += expert_out * weight ++++++ return expert_cache +++++ +++++ @no_grad() +++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++- # expert_cache = torch.zeros_like(x) +++++- # idxs = flat_expert_indices.argsort() +++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++- # token_idxs = idxs // self.num_experts_per_tok +++++- # for i, end_idx in enumerate(tokens_per_expert): +++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++- # if start_idx == end_idx: +++++- # continue +++++- # expert = self.experts[i] +++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++- # expert_tokens = x[exp_token_idx] +++++- # expert_out = expert(expert_tokens) +++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++- # return expert_cache ++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++ expert_cache = ops.zeros_like(x) +++++ idxs = flat_expert_indices.argsort() +++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++ token_idxs = idxs // self.num_experts_per_tok ++++++ +++++ for i, end_idx in enumerate(tokens_per_expert): +++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++ if start_idx == end_idx: +++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++ expert_out = expert(expert_tokens) +++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ +++++ return expert_cache ++++++ ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # # expert_cache = torch.zeros_like(x) ++++++ # # idxs = flat_expert_indices.argsort() ++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++ # # if start_idx == end_idx: ++++++ # # continue ++++++ # # expert = self.experts[i] ++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # # expert_tokens = x[exp_token_idx] ++++++ # # expert_out = expert(expert_tokens) ++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++ # # return expert_cache ++++++ # expert_cache = ops.zeros_like(x) ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # if start_idx == end_idx: ++++++ # continue ++++++ # expert = self.experts[i] ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = expert(expert_tokens) ++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ ++++++ # return expert_cache ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # expert_cache = ops.zeros_like(x) ++++++ ++++++ # # 排序保证顺序一致 ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # # 找出有 token 的专家 ++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++ ++++++ # for i in active_experts.tolist(): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # end_idx = tokens_per_expert[i] ++++++ # if start_idx == end_idx: # 没有 token ++++++ # continue ++++++ ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = self.experts[i](expert_tokens) ++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++ ++++++ # expert_cache = mindspore.mint.scatter_add( ++++++ # expert_cache, ++++++ # 0, ++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++ # expert_out ++++++ # ) ++++++ ++++++ # return expert_cache ++++++ ++++++ +++++ +++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++ # """ +++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ +++++ # Initialize weights and apply final processing +++++ self.post_init() ++++++ self.warm_up = False ++++++ ++++++ def warmup_moe_model_deep(self): ++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++ test_texts = [ ++++++ "warmup short", ++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++++ ] ++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++ if tokenizer is None: ++++++ from mindnlp.transformers import AutoTokenizer ++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++ self._warmup_tokenizer = tokenizer ++++++ ++++++ for text in test_texts: ++++++ inputs = tokenizer(text, return_tensors="ms") ++++++ with mindspore._no_grad(): ++++++ _ = self(**inputs, use_cache=False) ++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++ +++++ def get_input_embeddings(self): +++++ return self.model.embed_tokens +++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++ ```""" ++++++ if not self.warm_up: ++++++ self.warm_up = True ++++++ self.warmup_moe_model_deep() ++++++ +++++ output_attentions = ( +++++ output_attentions +++++ if output_attentions is not None +++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++index 3cbf820e..d4c6b651 100644 +++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++@@ -18,7 +18,6 @@ +++++ # See the License for the specific language governing permissions and +++++ # limitations under the License. +++++ """MindSpore Qwen2MoE model.""" +++++- +++++ import math +++++ from typing import List, Optional, Tuple, Union +++++ +++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++ TokenClassifierOutput, +++++ ) +++++ from ...modeling_utils import PreTrainedModel ++++++from ...generation import GenerationMixin +++++ from ....utils import logging +++++ from .configuration_qwen2_moe import Qwen2MoeConfig +++++ +++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++ self.variance_epsilon = eps +++++ +++++ def forward(self, hidden_states): ++++++ # @dwj ++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++ # @lwx ++++++ # if not self.training : ++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++ input_dtype = hidden_states.dtype +++++ hidden_states = hidden_states.to(mindspore.float32) +++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++@@ -234,6 +239,8 @@ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++ x1 = x[..., : x.shape[-1] // 2] +++++ x2 = x[..., x.shape[-1] // 2 :] ++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++ self.config = config +++++ self.hidden_size = config.hidden_size +++++ self.intermediate_size = intermediate_size ++++++ +++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++ self.act_fn = ACT2FN[config.hidden_act] +++++ +++++ def forward(self, x): +++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++- +++++ ++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++ # @lwx ++++++ # gate_up_output = self.gate_up_proj(x) ++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++++ # return self.down_proj(swiglu_output) ++++++ ++++++ # def forward(self, x): ++++++ # gate_proj_out = self.gate_proj(x) ++++++ # up_proj_out = self.up_proj(x) ++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++++ # return self.down_proj(swiglu_out) ++++++ +++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++ """ +++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++ use_cache: bool = False, +++++ cache_position: Optional[mindspore.Tensor] = None, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ ++++++ +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ query_states = self.q_proj(hidden_states) +++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++ "with a layer index." +++++ ) +++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ if isinstance(past_key_value, StaticCache): ++++++ kv_seq_len = key_states.shape[-2] ++++++ else: ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++ if isinstance(past_key_value, StaticCache): ++++++ kv_seq_len = key_states.shape[-2] +++++ +++++ # repeat k/v heads if n_kv_heads < n_heads +++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++- ++++++ +++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++ +++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++- raise ValueError( +++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++- f" {attn_weights.shape}" +++++- ) +++++- +++++- if attention_mask is not None: # no matter the length, we just slice it +++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++++ if attention_mask is not None: ++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++ attn_weights = attn_weights + causal_mask +++++ +++++ # upcast attention to fp32 +++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++ +++++ attn_output = self.o_proj(attn_output) +++++- ++++++ # @lwx ++++++ ++++++ # max_seq_len = self.max_position_embeddings # 2048 ++++++ ++++++ # if attention_mask is not None: ++++++ # # attention_mask: [B, 1, Sq, Sk] ++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++ ++++++ # # pad 到 [max_seq_len, max_seq_len] ++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++ # global_attention_mask = padded_mask ++++++ # else: ++++++ # global_attention_mask = None ++++++ ++++++ ++++++ # sparse_mode=3 ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # real_shift=None, ++++++ # padding_mask=None, ++++++ ++++++ # head_num=self.num_heads, ++++++ # attn_mask=global_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ # input_layout="BNSD", ++++++ # pre_tokens=2147483647, ++++++ # next_tokens=2147483647, ++++++ # inner_precise=0, ++++++ # drop_mask=None, ++++++ # prefix=None, ++++++ # actual_seq_qlen=None, ++++++ # actual_seq_kvlen=None, ++++++ # sparse_mode=sparse_mode, ++++++ # ) +++++ if not output_attentions: +++++ attn_weights = None +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++ ++++++class Qwen2MoeFlashAttention(nn.Module): ++++++ """ ++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++ ++++++ 关键改动: ++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++ 直接传入原始的 key 和 value 张量效率更高。 ++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++ """ ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++ super().__init__() ++++++ self.config = config ++++++ self.layer_idx = layer_idx ++++++ self.hidden_size = config.hidden_size ++++++ self.num_heads = config.num_attention_heads ++++++ self.head_dim = self.hidden_size // self.num_heads ++++++ self.num_key_value_heads = config.num_key_value_heads ++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++ self.max_position_embeddings = config.max_position_embeddings ++++++ self.rope_theta = config.rope_theta ++++++ self.attention_dropout = config.attention_dropout ++++++ ++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++ raise ValueError( ++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++ ) ++++++ ++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++ ++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++ self.head_dim, ++++++ max_position_embeddings=self.max_position_embeddings, ++++++ base=self.rope_theta, ++++++ ) ++++++ ++++++ def forward( ++++++ self, ++++++ hidden_states: mindspore.Tensor, ++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++ past_key_value: Optional[Cache] = None, ++++++ output_attentions: bool = False, ++++++ use_cache: bool = False, ++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # 1. 线性投射 Q, K, V ++++++ query_states = self.q_proj(hidden_states) ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # 3. RoPE 旋转位置编码 ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++ if self.layer_idx is None: ++++++ raise ValueError( ++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ "with a layer index." ++++++ ) ++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++ if cache_position.shape[0] == 1: ++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++ kv_seq_len = past_seen_tokens + 1 ++++++ else: ++++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++ else: ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # 4. KV 缓存更新 ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ key_states, value_states = past_key_value.update( ++++++ key_states, value_states, self.layer_idx, cache_kwargs ++++++ ) ++++++ ++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++ if cache_position.shape[0] == 1: ++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++ kv_seq_len = key_states.shape[-2] ++++++ ++++++ # 5. [重要] 准备 Attention Mask ++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++ fa_attention_mask = None ++++++ if attention_mask is not None: ++++++ # 截取与当前key长度匹配的部分 ++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++ fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++ input_dtype = query_states.dtype ++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++ query_states = query_states.to(mindspore.float16) ++++++ key_states = key_states.to(mindspore.float16) ++++++ value_states = value_states.to(mindspore.float16) ++++++ ++++++ # 6. [核心] 调用 flash_attention_score 算子 ++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++ attn_output = mindspore.ops.flash_attention_score( ++++++ query=query_states, ++++++ key=key_states, ++++++ value=value_states, ++++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++++ attn_mask=fa_attention_mask, ++++++ keep_prob=1.0 - self.attention_dropout, ++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ input_layout="BNSD", ++++++ sparse_mode=0 # 使用 defaultMask 模式 ++++++ ) ++++++ ++++++ # 恢复原始数据类型 ++++++ attn_output = attn_output.to(input_dtype) ++++++ ++++++ # 7. 调整输出形状 ++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = self.o_proj(attn_output) ++++++ ++++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++++ attn_weights = None ++++++ if output_attentions: ++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ # def forward( ++++++ # self, ++++++ # hidden_states: mindspore.Tensor, ++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++ # past_key_value: Optional[Cache] = None, ++++++ # output_attentions: bool = False, ++++++ # use_cache: bool = False, ++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ # bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # # 1. 线性投射 Q, K, V ++++++ # query_states = self.q_proj(hidden_states) ++++++ # key_states = self.k_proj(hidden_states) ++++++ # value_states = self.v_proj(hidden_states) ++++++ ++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # # 3. RoPE 旋转位置编码 ++++++ # kv_seq_len = key_states.shape[-2] ++++++ # if past_key_value is not None: ++++++ # if self.layer_idx is None: ++++++ # raise ValueError( ++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ # "with a layer index." ++++++ # ) ++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # # 4. KV 缓存更新 ++++++ # if past_key_value is not None: ++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ # key_states, value_states = past_key_value.update( ++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++ # ) ++++++ ++++++ # # 5. 准备 Attention Mask ++++++ # fa_attention_mask = None ++++++ # if attention_mask is not None: ++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++ # input_dtype = query_states.dtype ++++++ ++++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # head_num=self.num_heads, ++++++ # attn_mask=fa_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ # input_layout="BNSD", ++++++ # sparse_mode=0, ++++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++ # inner_precise=1 ++++++ # ) ++++++ ++++++ # # 恢复原始数据类型 ++++++ # attn_output = attn_output.to(input_dtype) ++++++ ++++++ # # 7. 调整输出形状 ++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ # attn_output = self.o_proj(attn_output) ++++++ ++++++ # attn_weights = None ++++++ # if output_attentions: ++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++ # return attn_output, attn_weights, past_key_value ++++++ ++++++ # def forward( ++++++ # self, ++++++ # hidden_states: mindspore.Tensor, ++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++ # past_key_value: Optional[Cache] = None, ++++++ # output_attentions: bool = False, ++++++ # use_cache: bool = False, ++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ # bsz, q_len, _ = hidden_states.shape ++++++ ++++++ # query_states = self.q_proj(hidden_states) ++++++ # key_states = self.k_proj(hidden_states) ++++++ # value_states = self.v_proj(hidden_states) ++++++ ++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ # kv_seq_len = key_states.shape[-2] ++++++ # if past_key_value is not None: ++++++ # if self.layer_idx is None: ++++++ # raise ValueError("`layer_idx` must be specified for caching") ++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ # if past_key_value is not None: ++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ # key_states, value_states = past_key_value.update( ++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++ # ) ++++++ ++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++ ++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++ # query_states = query_states / math.sqrt(self.head_dim) ++++++ # # <--- 修改结束 --- ++++++ ++++++ # fa_attention_mask = None ++++++ # if attention_mask is not None: ++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ # fa_attention_mask = (mask_slice != 0) ++++++ ++++++ # input_dtype = query_states.dtype ++++++ ++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++ # query=query_states, # 传入已经预先缩放过的 query ++++++ # key=key_states, ++++++ # value=value_states, ++++++ # head_num=self.num_heads, ++++++ # attn_mask=fa_attention_mask, ++++++ # keep_prob=1.0 - self.attention_dropout, ++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++ # input_layout="BNSD", ++++++ # sparse_mode=0, ++++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++++ # ) ++++++ ++++++ # attn_output = attn_output.to(input_dtype) ++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ # attn_output = self.o_proj(attn_output) ++++++ ++++++ # attn_weights = None ++++++ # if output_attentions: ++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++ ++++++ # return attn_output, attn_weights, past_key_value ++++++ +++++ QWEN2MOE_ATTENTION_CLASSES = { +++++ "eager": Qwen2MoeAttention, ++++++ "flash-attention": Qwen2MoeFlashAttention, +++++ } +++++ +++++ +++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ ++++++ #@dwj ++++++ # 只遍历激活的专家,而非全部专家 +++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- hidden_states = hidden_states.view(-1, hidden_dim) +++++- # router_logits: (batch * sequence_length, n_experts) +++++- router_logits = self.gate(hidden_states) +++++- +++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- if self.norm_topk_prob: +++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- # we cast back to the input dtype +++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++- final_hidden_states = ops.zeros( +++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++- ) +++++- +++++- # One hot encode the selected experts to create an expert mask +++++- # this will be used to easily index which expert is going to be sollicitated +++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++- +++++- # Loop over all available experts in the model and perform the computation on each expert +++++- for expert_idx in range(self.num_experts): +++++- expert_layer = self.experts[expert_idx] +++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++- +++++- # Index the correct hidden states and compute the expert hidden state for +++++- # the current expert. We need to make sure to multiply the output hidden +++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++- if 0 not in idx.shape: +++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++- +++++- # However `index_add_` only support torch tensors for indexing so we'll use +++++- # the `top_x` tensor here. +++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++- +++++- shared_expert_output = self.shared_expert(hidden_states) +++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++- +++++- final_hidden_states = final_hidden_states + shared_expert_output ++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++ num_tokens = hidden_states_reshaped.shape[0] ++++++ ++++++ router_logits = self.gate(hidden_states_reshaped) ++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++ if self.norm_topk_prob: ++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++ flat_selected_experts = selected_experts.flatten() ++++++ ++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++ token_indices = broadcasted_token_indices.flatten() ++++++ ++++++ active_experts = ops.unique(flat_selected_experts) ++++++ ++++++ for expert_idx_tensor in active_experts: ++++++ expert_idx = expert_idx_tensor.item() ++++++ expert_layer = self.experts[expert_idx] ++++++ ++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++ selected_token_indices = token_indices[mask] ++++++ selected_routing_weights = routing_weights.flatten()[mask] ++++++ ++++++ current_states = hidden_states_reshaped[selected_token_indices] ++++++ ++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++ ++++++ final_hidden_states = final_hidden_states.index_add( ++++++ dim=0, ++++++ index=selected_token_indices, ++++++ source=expert_output.to(hidden_states.dtype) ++++++ ) ++++++ ++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++ +++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++- return final_hidden_states, router_logits ++++++ final_hidden_states = final_hidden_states + shared_expert_output ++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++ ++++++ return final_hidden_states, router_logits +++++ +++++ +++++ class Qwen2MoeDecoderLayer(nn.Module): +++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++ +++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++ +++++ if (layer_idx not in config.mlp_only_layers) and ( +++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++ ): +++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++ _skip_keys_device_placement = "past_key_values" +++++ _supports_cache_class = True ++++++#lwx ++++++ # _supports_static_cache = True +++++ +++++ def _init_weights(self, module): +++++ std = self.config.initializer_range +++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++ return causal_mask +++++ +++++ +++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ _tied_weights_keys = ["lm_head.weight"] +++++ +++++ def __init__(self, config): +++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ self.num_experts_per_tok = config.num_experts_per_tok +++++ # Initialize weights and apply final processing +++++ self.post_init() ++++++ # @lwx ++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++++ # self.generation_config.cache_implementation = "static" ++++++ self._warmed_up = False ++++++ ++++++ def warmup_moe_model(self): ++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++++ test_texts = [ ++++++ "warmup short", ++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++++ ] ++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++ if tokenizer is None: ++++++ from mindnlp.transformers import AutoTokenizer ++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++ self._warmup_tokenizer = tokenizer ++++++ ++++++ for text in test_texts: ++++++ inputs = tokenizer(text, return_tensors="ms") ++++++ with mindspore._no_grad(): ++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++ +++++ def get_input_embeddings(self): +++++ return self.model.embed_tokens +++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++ ```""" ++++++ if not self._warmed_up: ++++++ self._warmed_up = True ++++++ self.warmup_moe_model() +++++ +++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++ output_router_logits = ( +++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++ } +++++ ) +++++ return model_inputs ++++++# @lwx ++++++ # def _decode_one_tokens_logits( ++++++ # self, ++++++ # cur_token: mindspore.Tensor, ++++++ # input_pos: Optional[mindspore.Tensor], ++++++ # cache_position: mindspore.Tensor, ++++++ # past_key_values: StaticCache, ++++++ # ) -> mindspore.Tensor: ++++++ # """ ++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++++ ++++++ # Args: ++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++++ # input_pos: 输入位置信息,可选 ++++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++++ ++++++ # Returns: ++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++++ # """ ++++++ # # 调用JIT编译的版本 ++++++ # return self.get_decode_one_tokens_logits( ++++++ # cur_token=cur_token, ++++++ # input_pos=input_pos, ++++++ # cache_position=cache_position, ++++++ # past_key_values=past_key_values, ++++++ # ) ++++++ ++++++ # @mindspore.jit(jit_level='O1') ++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++++ # """ ++++++ # JIT编译的函数,用于高效的单token解码 ++++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++++ ++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++++ # """ ++++++ # outputs = self.model.forward( ++++++ # input_ids=cur_token, ++++++ # position_ids=input_pos, ++++++ # cache_position=cache_position, ++++++ # past_key_values=past_key_values, ++++++ # use_cache=True, ++++++ # return_dict=False, ++++++ # ) ++++++ ++++++ # hidden_states = outputs[0] ++++++ # logits = self.lm_head.forward(hidden_states) ++++++ # logits = logits.float() ++++++ ++++++ # return logits[:, -1, :] ++++++ ++++++ # def _sample( ++++++ # self, ++++++ # input_ids: mindspore.Tensor, ++++++ # logits_processor, ++++++ # stopping_criteria, ++++++ # generation_config, ++++++ # synced_devices: bool, ++++++ # streamer=None, ++++++ # logits_warper=None, ++++++ # **model_kwargs, ++++++ # ): ++++++ # """ ++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++++ # """ ++++++ # from ...generation.logits_process import LogitsProcessorList ++++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++++ # from mindnlp.core import nn, ops, no_grad ++++++ # import numpy as np ++++++ ++++++ # # 检查是否使用 StaticCache ++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++++ # # 否则,直接调用父类方法 ++++++ # past_key_values = model_kwargs.get("past_key_values") ++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++++ ++++++ # if not isinstance(past_key_values, StaticCache): ++++++ # # 不使用 StaticCache,直接调用父类方法 ++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++++ # return super()._sample( ++++++ # input_ids=input_ids, ++++++ # logits_processor=logits_processor, ++++++ # stopping_criteria=stopping_criteria, ++++++ # generation_config=generation_config, ++++++ # synced_devices=synced_devices, ++++++ # streamer=streamer, ++++++ # logits_warper=logits_warper, ++++++ # **model_kwargs, ++++++ # ) ++++++ ++++++ # # 使用 StaticCache,进入自定义循环 ++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++++ # pad_token_id = generation_config._pad_token_tensor ++++++ # output_attentions = generation_config.output_attentions ++++++ # output_hidden_states = generation_config.output_hidden_states ++++++ # output_scores = generation_config.output_scores ++++++ # output_logits = generation_config.output_logits ++++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++++ # max_length = generation_config.max_length ++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++++ # do_sample = generation_config.do_sample ++++++ ++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++++ # raise ValueError( ++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++++ # f"{logits_warper})." ++++++ # ) ++++++ ++++++ # # init attention / hidden states / scores tuples ++++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++++ ++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++++ # encoder_hidden_states = ( ++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++++ # ) ++++++ ++++++ # # keep track of which sequences are already finished ++++++ # batch_size, cur_len = input_ids.shape ++++++ # this_peer_finished = False ++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++++ ++++++ # time_record = [] ++++++ # from ....utils.testing_utils import parse_flag_from_env ++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++++ ++++++ # while self._has_unfinished_sequences( ++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++++ # ): ++++++ # if _record_time: ++++++ # import time as time_module ++++++ # infer_start = time_module.time() ++++++ ++++++ # # prepare model inputs ++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++++ ++++++ # # prepare variable output controls ++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++++ ++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++++ # cur_cache_position = model_inputs.get("cache_position") ++++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++++ # cur_input_ids = model_inputs.get("input_ids") ++++++ ++++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++++ # cur_cache_position is not None and ++++++ # len(cur_cache_position.shape) > 0 and ++++++ # cur_cache_position.shape[0] == 1 and ++++++ # cur_input_ids is not None and ++++++ # cur_input_ids.shape[1] == 1): ++++++ # # 使用 JIT 优化的单 token 解码 ++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++++ # if not hasattr(self, '_jit_used'): ++++++ # self._jit_used = False ++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++++ ++++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++++ # cur_token=cur_input_ids, ++++++ # input_pos=model_inputs.get("position_ids"), ++++++ # cache_position=cur_cache_position, ++++++ # past_key_values=cur_past_key_values, ++++++ # ) ++++++ ++++++ # # 标记已使用JIT(用于后续判断) ++++++ # if not self._jit_used: ++++++ # self._jit_used = True ++++++ ++++++ # # 构造兼容的输出对象 ++++++ # class JitOptimizedOutput: ++++++ # def __init__(self, logits, config): ++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++++ # self.config = config ++++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++++ # self.attentions = None if not config.is_encoder_decoder else None ++++++ # self.cross_attentions = None ++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++++ ++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++++ # else: ++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++++ # outputs = self(**model_inputs, return_dict=True) ++++++ ++++++ # if synced_devices and this_peer_finished: ++++++ # continue ++++++ ++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++++ # next_token_logits = outputs.logits[:, -1, :] ++++++ ++++++ # # pre-process distribution ++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++++ # if do_sample: ++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++++ ++++++ # # Store scores, attentions and hidden_states when required ++++++ # if return_dict_in_generate: ++++++ # if output_scores: ++++++ # scores += (next_token_scores,) ++++++ # if output_logits: ++++++ # raw_logits += (next_token_logits,) ++++++ # if output_attentions: ++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++++ # if self.config.is_encoder_decoder: ++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++++ ++++++ # if output_hidden_states: ++++++ # hidden = ( ++++++ # outputs.decoder_hidden_states ++++++ # if self.config.is_encoder_decoder ++++++ # else outputs.hidden_states ++++++ # ) ++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++++ ++++++ # # token selection ++++++ # if do_sample: ++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++++ # else: ++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++++ ++++++ # # finished sentences should have their next token be a padding token ++++++ # if has_eos_stopping_criteria: ++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++++ ++++++ # # update generated ids, model inputs, and length for next step ++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++++ # if streamer is not None: ++++++ # streamer.put(next_tokens) ++++++ ++++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++++ # outputs, ++++++ # model_kwargs, ++++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++++ # ) ++++++ ++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++++ # cur_len += 1 ++++++ ++++++ # if _record_time: ++++++ # import time as time_module ++++++ # infer_stop = time_module.time() ++++++ # time_record.append(infer_stop - infer_start) ++++++ ++++++ # del outputs ++++++ ++++++ # average_infer_time = None ++++++ # if time_record: ++++++ # if len(time_record) > 1: ++++++ # time_record.pop(0) ++++++ # average_infer_time = sum(time_record) / len(time_record) ++++++ # print(f'average inference time is: {average_infer_time}') ++++++ # print(f'inference time record: {time_record}') ++++++ ++++++ # if streamer is not None: ++++++ # streamer.end() ++++++ ++++++ # # 简单判断:打印是否使用了JIT路径 ++++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++++ # else: ++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++++ ++++++ # if return_dict_in_generate: ++++++ # if self.config.is_encoder_decoder: ++++++ # return GenerateEncoderDecoderOutput( ++++++ # sequences=input_ids, ++++++ # scores=scores, ++++++ # logits=raw_logits, ++++++ # encoder_attentions=encoder_attentions, ++++++ # encoder_hidden_states=encoder_hidden_states, ++++++ # decoder_attentions=decoder_attentions, ++++++ # cross_attentions=cross_attentions, ++++++ # decoder_hidden_states=decoder_hidden_states, ++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++ # average_infer_time=average_infer_time ++++++ # ) ++++++ # else: ++++++ # return GenerateDecoderOnlyOutput( ++++++ # sequences=input_ids, ++++++ # scores=scores, ++++++ # logits=raw_logits, ++++++ # attentions=decoder_attentions, ++++++ # hidden_states=decoder_hidden_states, ++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++ # average_infer_time=average_infer_time ++++++ # ) ++++++ # else: ++++++ # return input_ids ++++++ ++++++ # def _prepare_cache_for_generation( ++++++ # self, ++++++ # generation_config, ++++++ # model_kwargs, ++++++ # assistant_model, ++++++ # batch_size, ++++++ # max_cache_length, ++++++ # ): ++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++++ # generation_config.cache_implementation = "static" ++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++++ ++++++ # if generation_config.cache_implementation == "static": ++++++ # base_required_from_max_length = generation_config.max_length + 1 ++++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++++ # min_cache_size = 50 ++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++++ # else: ++++++ # max_cache_length = max(base_required, min_cache_size) ++++++ ++++++ # original_max_cache_length = max_cache_length ++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++++ # print(f" - final max_cache_length: {max_cache_length}") ++++++ ++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++ # if max_cache_length > self.config.max_position_embeddings: ++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++++ ++++++ # result = super()._prepare_cache_for_generation( ++++++ # generation_config=generation_config, ++++++ # model_kwargs=model_kwargs, ++++++ # assistant_model=assistant_model, ++++++ # batch_size=batch_size, ++++++ # max_cache_length=max_cache_length, ++++++ # ) ++++++ ++++++ # if generation_config.cache_implementation == "static": ++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++++ # created_cache = model_kwargs.get(cache_name) ++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++++ # if created_cache.max_cache_len < generation_config.max_length: ++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++++ ++++++ # return result ++++++ ++++++ ++++++ +++++ +++++ +++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++-- +++++2.27.0 +++++ ++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++++new file mode 100644 ++++index 00000000..22b65dd5 ++++--- /dev/null +++++++ b/patches/0002-20251106commit.patch ++++@@ -0,0 +1,3200 @@ +++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Thu, 6 Nov 2025 09:20:38 +0800 +++++Subject: [PATCH 2/3] 20251106commit +++++ +++++--- +++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +++++ 3 files changed, 2689 insertions(+), 305 deletions(-) +++++ create mode 100644 patches/0001-20251104commit.patch +++++ +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index d8303e45..73773c22 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +++++ # y = y + self.shared_experts(identity) +++++ # return y +++++ ++++++ # @no_grad() ++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ ++++++ # expert_cache = ops.zeros_like(x) ++++++ # for i in range(self.num_experts_per_tok): ++++++ # expert_id = flat_expert_indices[i].item() ++++++ # weight = flat_expert_weights[i].item() ++++++ # expert = self.experts[expert_id] ++++++ # expert_out = expert(x) ++++++ # expert_cache += expert_out * weight ++++++ # return expert_cache ++++++ +++++ @no_grad() +++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ # x 的 shape: (1, hidden_size) ++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++++ ++++++ # 1. 收集所有需要的专家层 ++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++++++ ++++++ # 2. 并行计算所有专家的输出 ++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++++ # ops.cat 会将它们堆叠成一个新的 Tensor ++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++++ ++++++ # 3. 使用矩阵乘法进行加权求和 ++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++ # 最终结果 final_output 的 shape: (1, hidden_size) ++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++++ ++++++ return final_output +++++ +++++- expert_cache = ops.zeros_like(x) +++++- for i in range(self.num_experts_per_tok): +++++- expert_id = flat_expert_indices[i].item() +++++- weight = flat_expert_weights[i].item() +++++- expert = self.experts[expert_id] +++++- expert_out = expert(x) +++++- expert_cache += expert_out * weight +++++- return expert_cache +++++ +++++ @no_grad() +++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++ # @lwx ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) ++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++++ +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +++++ return attn_output, attn_weights, past_key_value +++++ +++++ ++++++# class DeepseekFlashAttention(nn.Module): ++++++# """ ++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++++++ ++++++# This class is designed as a drop-in replacement for DeepseekAttention. ++++++# """ ++++++ ++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++++# super().__init__() ++++++# self.config = config ++++++# self.layer_idx = layer_idx ++++++# if layer_idx is None: ++++++# logger.warning( ++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++# "when creating this class." ++++++# ) ++++++ ++++++# self.attention_dropout = config.attention_dropout ++++++# self.hidden_size = config.hidden_size ++++++# self.num_heads = config.num_attention_heads ++++++# self.head_dim = self.hidden_size // self.num_heads ++++++# self.num_key_value_heads = config.num_key_value_heads ++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++# self.max_position_embeddings = config.max_position_embeddings ++++++# self.rope_theta = config.rope_theta ++++++# self.is_causal = True ++++++ ++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++# raise ValueError( ++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++# f" and `num_heads`: {self.num_heads})." ++++++# ) ++++++ ++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++++# self._init_rope() ++++++ ++++++# def _init_rope(self): ++++++# if self.config.rope_scaling is None: ++++++# self.rotary_emb = DeepseekRotaryEmbedding( ++++++# self.head_dim, ++++++# max_position_embeddings=self.max_position_embeddings, ++++++# base=self.rope_theta, ++++++# ) ++++++# else: ++++++# scaling_type = self.config.rope_scaling["type"] ++++++# scaling_factor = self.config.rope_scaling["factor"] ++++++# if scaling_type == "linear": ++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++++# self.head_dim, ++++++# max_position_embeddings=self.max_position_embeddings, ++++++# scaling_factor=scaling_factor, ++++++# base=self.rope_theta, ++++++# ) ++++++# elif scaling_type == "dynamic": ++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++++# self.head_dim, ++++++# max_position_embeddings=self.max_position_embeddings, ++++++# scaling_factor=scaling_factor, ++++++# base=self.rope_theta, ++++++# ) ++++++# else: ++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++++ ++++++# def forward( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++# past_key_value: Optional[Cache] = None, ++++++# output_attentions: bool = False, ++++++# use_cache: bool = False, ++++++# **kwargs, ++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++# if "padding_mask" in kwargs: ++++++# warnings.warn( ++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++++# ) ++++++ ++++++# if output_attentions: ++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++++++ ++++++# bsz, q_len, _ = hidden_states.shape ++++++ ++++++# if self.config.pretraining_tp > 1: ++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++++ ++++++# query_states = self.q_proj(hidden_states) ++++++# key_states = self.k_proj(hidden_states) ++++++# value_states = self.v_proj(hidden_states) ++++++ ++++++# # Reshape for multi-head attention ++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++# kv_seq_len = key_states.shape[-2] ++++++# if past_key_value is not None: ++++++# if self.layer_idx is None: ++++++# raise ValueError( ++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++# "with a layer index." ++++++# ) ++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++# # Apply Rotary Positional Embedding ++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++# if past_key_value is not None: ++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ ++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++ ++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++ ++++++# # Convert attention_mask for flash_attention_score ++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++++++# if attention_mask is not None: ++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++++# raise ValueError( ++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++++# ) ++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++++++# else: ++++++# attn_mask_for_fa = None ++++++ ++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++++ ++++++# # Call the fused flash_attention_score operator ++++++# attn_output = mindspore.ops.flash_attention_score( ++++++# query=query_states_for_fa, ++++++# key=key_states_for_fa, ++++++# value=value_states_for_fa, ++++++# head_num=self.num_heads, # This is N1, the number of query heads ++++++# input_layout='BSH', ++++++# attn_mask=attn_mask_for_fa, ++++++# keep_prob=keep_prob, ++++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++++# sparse_mode=0 # Default mask mode ++++++# ) ++++++ ++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++++++# attn_output = self.o_proj(attn_output) ++++++ ++++++# # Flash Attention does not return attention weights ++++++# attn_weights = None ++++++ ++++++# return attn_output, attn_weights, past_key_value ++++++ ++++++class DeepseekFlashAttention(nn.Module): ++++++ """ ++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. ++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, ++++++ designed for high performance on supported hardware (Ascend). ++++++ ++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. ++++++ """ ++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++++ super().__init__() ++++++ self.config = config ++++++ self.layer_idx = layer_idx ++++++ if layer_idx is None: ++++++ logger.warning( ++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++ "when creating this class." ++++++ ) ++++++ ++++++ # --- [FIX] Correctly initialize all required attributes --- ++++++ self.attention_dropout = config.attention_dropout ++++++ self.hidden_size = config.hidden_size ++++++ self.num_heads = config.num_attention_heads ++++++ self.head_dim = self.hidden_size // self.num_heads ++++++ self.num_key_value_heads = config.num_key_value_heads ++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++ self.max_position_embeddings = config.max_position_embeddings ++++++ self.rope_theta = config.rope_theta ++++++ self.is_causal = True ++++++ ++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++ raise ValueError( ++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++ f" and `num_heads`: {self.num_heads})." ++++++ ) ++++++ ++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++++ ++++++ # This call will now succeed as all attributes are initialized. ++++++ self._init_rope() ++++++ ++++++ def _init_rope(self): ++++++ if self.config.rope_scaling is None: ++++++ self.rotary_emb = DeepseekRotaryEmbedding( ++++++ self.head_dim, ++++++ max_position_embeddings=self.max_position_embeddings, ++++++ base=self.rope_theta, ++++++ ) ++++++ else: ++++++ scaling_type = self.config.rope_scaling["type"] ++++++ scaling_factor = self.config.rope_scaling["factor"] ++++++ if scaling_type == "linear": ++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++++ self.head_dim, ++++++ max_position_embeddings=self.max_position_embeddings, ++++++ scaling_factor=scaling_factor, ++++++ base=self.rope_theta, ++++++ ) ++++++ elif scaling_type == "dynamic": ++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++++ self.head_dim, ++++++ max_position_embeddings=self.max_position_embeddings, ++++++ scaling_factor=scaling_factor, ++++++ base=self.rope_theta, ++++++ ) ++++++ else: ++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++++ ++++++ def forward( ++++++ self, ++++++ hidden_states: mindspore.Tensor, ++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++ past_key_value: Optional[Cache] = None, ++++++ output_attentions: bool = False, ++++++ use_cache: bool = False, ++++++ **kwargs, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ if "padding_mask" in kwargs: ++++++ warnings.warn( ++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++++ ) ++++++ if output_attentions: ++++++ warnings.warn( ++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." ++++++ ) ++++++ ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ if self.config.pretraining_tp > 1: ++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++++ ++++++ query_states = self.q_proj(hidden_states) ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++ # Apply Rotary Position Embedding ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos} ++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. ++++++ # So we must explicitly repeat the KV heads. ++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++ ++++++ # Convert attention mask for flash_attention_score ++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. ++++++ if attention_mask is not None: ++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++++ raise ValueError( ++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++++ ) ++++++ attn_mask_for_fa = attention_mask < 0 ++++++ else: ++++++ attn_mask_for_fa = None ++++++ ++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++++ ++++++ # Call the fused operator using the efficient BNSD layout ++++++ attn_output = mindspore.ops.flash_attention_score( ++++++ query=query_states, ++++++ key=key_states, ++++++ value=value_states, ++++++ head_num=self.num_heads, ++++++ input_layout='BNSD', # Specify the correct layout ++++++ attn_mask=attn_mask_for_fa, ++++++ keep_prob=keep_prob, ++++++ scalar_value=1.0 / math.sqrt(self.head_dim) ++++++ ) ++++++ ++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. ++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ ++++++ # Apply output projection ++++++ attn_output = self.o_proj(attn_output) ++++++ ++++++ # Flash attention does not return attention weights, so we return None. ++++++ attn_weights = None ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ +++++ Deepseek_ATTENTION_CLASSES = { +++++ "eager": DeepseekAttention, ++++++ "flash-attention": DeepseekFlashAttention, +++++ } +++++ +++++ +++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +++++ config=config, layer_idx=layer_idx +++++ ) +++++ ++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++++++ config=config, layer_idx=layer_idx ++++++ ) ++++++ +++++ self.mlp = ( +++++ DeepseekMoE(config) +++++ if ( +++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++index d4c6b651..bced285c 100644 +++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +++++ +++++ import mindspore +++++ import mindnlp.core.nn.functional as F +++++-from mindnlp.core import nn, ops ++++++from mindnlp.core import nn, ops, no_grad +++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +++++ +++++ from ....common.activations import ACT2FN +++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++++ ++++++Long_Prompt = False ++++++PROMPT_LENGTH_THRESHOLD = 128 +++++ +++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++++ def _prepare_4d_causal_attention_mask_with_cache_position( +++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +++++ return attn_output, attn_weights, past_key_value +++++ +++++ ++++++# class Qwen2MoeFlashAttention(nn.Module): ++++++# """ ++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++ ++++++# 关键改动: ++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++# 直接传入原始的 key 和 value 张量效率更高。 ++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++# super().__init__() ++++++# self.config = config ++++++# self.layer_idx = layer_idx ++++++# self.hidden_size = config.hidden_size ++++++# self.num_heads = config.num_attention_heads ++++++# self.head_dim = self.hidden_size // self.num_heads ++++++# self.num_key_value_heads = config.num_key_value_heads ++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++# self.max_position_embeddings = config.max_position_embeddings ++++++# self.rope_theta = config.rope_theta ++++++# self.attention_dropout = config.attention_dropout ++++++ ++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++# raise ValueError( ++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++# ) ++++++ ++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++ ++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++# self.head_dim, ++++++# max_position_embeddings=self.max_position_embeddings, ++++++# base=self.rope_theta, ++++++# ) ++++++ ++++++# def forward( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++# past_key_value: Optional[Cache] = None, ++++++# output_attentions: bool = False, ++++++# use_cache: bool = False, ++++++# cache_position: Optional[mindspore.Tensor] = None, ++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++# bsz, q_len, _ = hidden_states.shape ++++++ ++++++# # 1. 线性投射 Q, K, V ++++++# query_states = self.q_proj(hidden_states) ++++++# key_states = self.k_proj(hidden_states) ++++++# value_states = self.v_proj(hidden_states) ++++++ ++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++# # query: [B, S, H*D] -> [B, N1, S, D] ++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++# # 3. RoPE 旋转位置编码 ++++++# kv_seq_len = key_states.shape[-2] ++++++# if past_key_value is not None: ++++++# if self.layer_idx is None: ++++++# raise ValueError( ++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++# "with a layer index." ++++++# ) ++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++# if cache_position.shape[0] == 1: ++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++# kv_seq_len = past_seen_tokens + 1 ++++++# else: ++++++# # prefill 阶段:cache_position 是范围,使用其长度 ++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++# else: ++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++# # 4. KV 缓存更新 ++++++# if past_key_value is not None: ++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++# key_states, value_states = past_key_value.update( ++++++# key_states, value_states, self.layer_idx, cache_kwargs ++++++# ) ++++++ ++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++# if cache_position.shape[0] == 1: ++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++# kv_seq_len = key_states.shape[-2] ++++++ ++++++# # 5. [重要] 准备 Attention Mask ++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++# fa_attention_mask = None ++++++# if attention_mask is not None: ++++++# # 截取与当前key长度匹配的部分 ++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++# fa_attention_mask = (mask_slice != 0) ++++++ ++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++# input_dtype = query_states.dtype ++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++# query_states = query_states.to(mindspore.float16) ++++++# key_states = key_states.to(mindspore.float16) ++++++# value_states = value_states.to(mindspore.float16) ++++++ ++++++# # 6. [核心] 调用 flash_attention_score 算子 ++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++# attn_output = mindspore.ops.flash_attention_score( ++++++# query=query_states, ++++++# key=key_states, ++++++# value=value_states, ++++++# head_num=self.num_heads, # 传入Q的头数(N1) ++++++# attn_mask=fa_attention_mask, ++++++# keep_prob=1.0 - self.attention_dropout, ++++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++++# input_layout="BNSD", ++++++# sparse_mode=0 # 使用 defaultMask 模式 ++++++# ) ++++++ ++++++# # 恢复原始数据类型 ++++++# attn_output = attn_output.to(input_dtype) ++++++ ++++++# # 7. 调整输出形状 ++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++# attn_output = self.o_proj(attn_output) ++++++ ++++++# # FlashAttention 算子不直接返回注意力权重矩阵 ++++++# attn_weights = None ++++++# if output_attentions: ++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++# return attn_output, attn_weights, past_key_value ++++++ ++++++# # def forward( ++++++# # self, ++++++# # hidden_states: mindspore.Tensor, ++++++# # attention_mask: Optional[mindspore.Tensor] = None, ++++++# # position_ids: Optional[mindspore.Tensor] = None, ++++++# # past_key_value: Optional[Cache] = None, ++++++# # output_attentions: bool = False, ++++++# # use_cache: bool = False, ++++++# # cache_position: Optional[mindspore.Tensor] = None, ++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++# # bsz, q_len, _ = hidden_states.shape ++++++ ++++++# # # 1. 线性投射 Q, K, V ++++++# # query_states = self.q_proj(hidden_states) ++++++# # key_states = self.k_proj(hidden_states) ++++++# # value_states = self.v_proj(hidden_states) ++++++ ++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ ++++++# # # 3. RoPE 旋转位置编码 ++++++# # kv_seq_len = key_states.shape[-2] ++++++# # if past_key_value is not None: ++++++# # if self.layer_idx is None: ++++++# # raise ValueError( ++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++# # "with a layer index." ++++++# # ) ++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++# # # 4. KV 缓存更新 ++++++# # if past_key_value is not None: ++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++# # key_states, value_states = past_key_value.update( ++++++# # key_states, value_states, self.layer_idx, cache_kwargs ++++++# # ) ++++++ ++++++# # # 5. 准备 Attention Mask ++++++# # fa_attention_mask = None ++++++# # if attention_mask is not None: ++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++# # fa_attention_mask = (mask_slice != 0) ++++++ ++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++# # input_dtype = query_states.dtype ++++++ ++++++# # # 6. [核心] 调用 flash_attention_score 算子 ++++++# # attn_output = mindspore.ops.flash_attention_score( ++++++# # query=query_states, ++++++# # key=key_states, ++++++# # value=value_states, ++++++# # head_num=self.num_heads, ++++++# # attn_mask=fa_attention_mask, ++++++# # keep_prob=1.0 - self.attention_dropout, ++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++# # input_layout="BNSD", ++++++# # sparse_mode=0, ++++++# # # <--- 修改点 2: 启用内部高精度计算 --- ++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++# # inner_precise=1 ++++++# # ) ++++++ ++++++# # # 恢复原始数据类型 ++++++# # attn_output = attn_output.to(input_dtype) ++++++ ++++++# # # 7. 调整输出形状 ++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++# # attn_output = self.o_proj(attn_output) ++++++ ++++++# # attn_weights = None ++++++# # if output_attentions: ++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ ++++++# # return attn_output, attn_weights, past_key_value ++++++ ++++++ +++++ class Qwen2MoeFlashAttention(nn.Module): +++++ """ +++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++- +++++- 关键改动: +++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++- 直接传入原始的 key 和 value 张量效率更高。 +++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 ++++++ ++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` ++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, ++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, ++++++ 以达到理论上的最高执行速度。 +++++ """ +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++ super().__init__() +++++ self.config = config +++++ self.layer_idx = layer_idx ++++++ if layer_idx is None: ++++++ logger.warning_once( ++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." ++++++ ) ++++++ +++++ self.hidden_size = config.hidden_size +++++ self.num_heads = config.num_attention_heads +++++ self.head_dim = self.hidden_size // self.num_heads +++++ self.num_key_value_heads = config.num_key_value_heads +++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++ self.max_position_embeddings = config.max_position_embeddings +++++ self.rope_theta = config.rope_theta +++++ self.attention_dropout = config.attention_dropout +++++ +++++- if (self.head_dim * self.num_heads) != self.hidden_size: +++++- raise ValueError( +++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++- ) +++++- +++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++- # query: [B, S, H*D] -> [B, N1, S, D] +++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++ # 2. 调整形状以匹配 BNSD 布局 +++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- +++++- # 3. RoPE 旋转位置编码 ++++++ ++++++ # 3. RoPE 和 KV 缓存 +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++- if self.layer_idx is None: +++++- raise ValueError( +++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++- "with a layer index." +++++- ) +++++- # 对于 StaticCache,需要特殊处理 kv_seq_len +++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++- if cache_position.shape[0] == 1: +++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++- kv_seq_len = past_seen_tokens + 1 +++++- else: +++++- # prefill 阶段:cache_position 是范围,使用其长度 +++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++- else: +++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++- ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++- # 4. KV 缓存更新 +++++ if past_key_value is not None: +++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++- key_states, value_states = past_key_value.update( +++++- key_states, value_states, self.layer_idx, cache_kwargs +++++- ) +++++- +++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++- if cache_position.shape[0] == 1: +++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++- kv_seq_len = key_states.shape[-2] +++++- +++++- # 5. [重要] 准备 Attention Mask +++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++ # 4. 准备 Attention Mask +++++ fa_attention_mask = None +++++ if attention_mask is not None: +++++- # 截取与当前key长度匹配的部分 +++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++- # 转换为布尔类型: 大负数 -> True, 0 -> False +++++ fa_attention_mask = (mask_slice != 0) +++++ +++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++- input_dtype = query_states.dtype +++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++- query_states = query_states.to(mindspore.float16) +++++- key_states = key_states.to(mindspore.float16) +++++- value_states = value_states.to(mindspore.float16) +++++- +++++- # 6. [核心] 调用 flash_attention_score 算子 +++++- # - 无需手动 repeat_kv, 算子原生支持 GQA +++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +++++ attn_output = mindspore.ops.flash_attention_score( +++++ query=query_states, +++++ key=key_states, +++++ value=value_states, +++++- head_num=self.num_heads, # 传入Q的头数(N1) ++++++ head_num=self.num_heads, +++++ attn_mask=fa_attention_mask, +++++- keep_prob=1.0 - self.attention_dropout, ++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++ input_layout="BNSD", +++++- sparse_mode=0 # 使用 defaultMask 模式 ++++++ sparse_mode=0, ++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +++++ ) +++++ +++++- # 恢复原始数据类型 +++++- attn_output = attn_output.to(input_dtype) +++++- +++++- # 7. 调整输出形状 +++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++ # 6. 调整输出形状 +++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++ attn_output = self.o_proj(attn_output) +++++ +++++- # FlashAttention 算子不直接返回注意力权重矩阵 ++++++ # 7. 返回结果 +++++ attn_weights = None +++++ if output_attentions: +++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++- # def forward( +++++- # self, +++++- # hidden_states: mindspore.Tensor, +++++- # attention_mask: Optional[mindspore.Tensor] = None, +++++- # position_ids: Optional[mindspore.Tensor] = None, +++++- # past_key_value: Optional[Cache] = None, +++++- # output_attentions: bool = False, +++++- # use_cache: bool = False, +++++- # cache_position: Optional[mindspore.Tensor] = None, +++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++- +++++- # bsz, q_len, _ = hidden_states.shape +++++- +++++- # # 1. 线性投射 Q, K, V +++++- # query_states = self.q_proj(hidden_states) +++++- # key_states = self.k_proj(hidden_states) +++++- # value_states = self.v_proj(hidden_states) +++++- +++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- +++++- # # 3. RoPE 旋转位置编码 +++++- # kv_seq_len = key_states.shape[-2] +++++- # if past_key_value is not None: +++++- # if self.layer_idx is None: +++++- # raise ValueError( +++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++- # "with a layer index." +++++- # ) +++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++ +++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++- +++++- # # 4. KV 缓存更新 +++++- # if past_key_value is not None: +++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++- # key_states, value_states = past_key_value.update( +++++- # key_states, value_states, self.layer_idx, cache_kwargs +++++- # ) +++++- +++++- # # 5. 准备 Attention Mask +++++- # fa_attention_mask = None +++++- # if attention_mask is not None: +++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++- # fa_attention_mask = (mask_slice != 0) +++++- +++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++- # input_dtype = query_states.dtype +++++- +++++- # # 6. [核心] 调用 flash_attention_score 算子 +++++- # attn_output = mindspore.ops.flash_attention_score( +++++- # query=query_states, +++++- # key=key_states, +++++- # value=value_states, +++++- # head_num=self.num_heads, +++++- # attn_mask=fa_attention_mask, +++++- # keep_prob=1.0 - self.attention_dropout, +++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++++- # input_layout="BNSD", +++++- # sparse_mode=0, +++++- # # <--- 修改点 2: 启用内部高精度计算 --- +++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++- # inner_precise=1 +++++- # ) +++++- +++++- # # 恢复原始数据类型 +++++- # attn_output = attn_output.to(input_dtype) ++++++QWEN2MOE_ATTENTION_CLASSES = { ++++++ "eager": Qwen2MoeAttention, ++++++ "flash-attention": Qwen2MoeFlashAttention, ++++++} +++++ +++++- # # 7. 调整输出形状 +++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++- # attn_output = self.o_proj(attn_output) +++++ +++++- # attn_weights = None +++++- # if output_attentions: +++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# def __init__(self, config): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# # gating ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++ ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# #@dwj ++++++# # 只遍历激活的专家,而非全部专家 ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# num_tokens = hidden_states_reshaped.shape[0] ++++++ ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++# flat_selected_experts = selected_experts.flatten() ++++++ ++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++# token_indices = broadcasted_token_indices.flatten() ++++++ ++++++# active_experts = ops.unique(flat_selected_experts) ++++++ ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++ ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# selected_token_indices = token_indices[mask] ++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++ ++++++# current_states = hidden_states_reshaped[selected_token_indices] ++++++ ++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++ ++++++# final_hidden_states = final_hidden_states.index_add( ++++++# dim=0, ++++++# index=selected_token_indices, ++++++# source=expert_output.to(hidden_states.dtype) ++++++# ) ++++++ ++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++ +++++- # return attn_output, attn_weights, past_key_value ++++++# final_hidden_states = final_hidden_states + shared_expert_output ++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ ++++++ ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# """ ++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# # 门控网络 ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# # 专家列表 ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++# # 共享专家 ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_decode( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# """ ++++++# 【解码路径】针对 sequence_length=1 的极致优化。 ++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++++++# """ ++++++# batch_size, hidden_dim = hidden_states.shape ++++++ ++++++# expert_outputs_list = [ ++++++# ops.cat([ ++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++# ], dim=0) ++++++# for i in range(batch_size) ++++++# ] ++++++ ++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++++++# # shape: (batch_size, top_k, hidden_dim) ++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++ ++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++ ++++++# return moe_output.squeeze(1) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_prefill( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# """ ++++++# 【预填充路径】针对 sequence_length > 1 的优化。 ++++++# 按专家对 Token 进行分组,并进行批处理。 ++++++# """ ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens = hidden_states.shape[0] ++++++# flat_selected_experts = selected_experts.flatten() ++++++ ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++ ++++++# active_experts = ops.unique(flat_selected_experts) ++++++ ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++ ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# selected_token_indices = token_indices[mask] ++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++ ++++++# current_states = hidden_states[selected_token_indices] ++++++ ++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++ ++++++# moe_output = moe_output.index_add( ++++++# dim=0, ++++++# index=selected_token_indices, ++++++# source=expert_output.to(hidden_states.dtype) ++++++# ) ++++++# return moe_output ++++++ ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# """ ++++++# 顶层 forward 方法,作为智能分发器。 ++++++# """ ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++- # def forward( +++++- # self, +++++- # hidden_states: mindspore.Tensor, +++++- # attention_mask: Optional[mindspore.Tensor] = None, +++++- # position_ids: Optional[mindspore.Tensor] = None, +++++- # past_key_value: Optional[Cache] = None, +++++- # output_attentions: bool = False, +++++- # use_cache: bool = False, +++++- # cache_position: Optional[mindspore.Tensor] = None, +++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++- +++++- # bsz, q_len, _ = hidden_states.shape +++++- +++++- # query_states = self.q_proj(hidden_states) +++++- # key_states = self.k_proj(hidden_states) +++++- # value_states = self.v_proj(hidden_states) +++++- +++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- +++++- # kv_seq_len = key_states.shape[-2] +++++- # if past_key_value is not None: +++++- # if self.layer_idx is None: +++++- # raise ValueError("`layer_idx` must be specified for caching") +++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++- +++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++- +++++- # if past_key_value is not None: +++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++- # key_states, value_states = past_key_value.update( +++++- # key_states, value_states, self.layer_idx, cache_kwargs +++++- # ) ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++# moe_output = None ++++++# # 在推理时,根据序列长度选择最优路径 ++++++# if not self.training: ++++++# if sequence_length == 1: ++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++# else: ++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++# else: ++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++++++# raise NotImplementedError("Training path is not implemented.") ++++++ ++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++++++ ++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++++++ ++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ ++++++ ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# """ ++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# # 门控网络 ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# # 专家列表 ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++# # 共享专家 ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_decode( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# batch_size, _ = hidden_states.shape ++++++# expert_outputs_list = [ ++++++# ops.cat([ ++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++# ], dim=0) ++++++# for i in range(batch_size) ++++++# ] ++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++# return moe_output.squeeze(1) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_prefill( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens = hidden_states.shape[0] ++++++# flat_selected_experts = selected_experts.flatten() ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++# active_experts = ops.unique(flat_selected_experts) ++++++ ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# selected_token_indices = token_indices[mask] ++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++# current_states = hidden_states[selected_token_indices] ++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++# moe_output = moe_output.index_add( ++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++# ) ++++++# return moe_output ++++++ ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# """ ++++++# 顶层 forward 方法,作为智能分发器。 ++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++++++# """ ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ ++++++# # 1. 门控计算 (通用逻辑) ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++# # 2. 智能分发到最优 MoE 路径 ++++++# moe_output = None ++++++# if not self.training: ++++++# if sequence_length == 1: ++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++# else: ++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++# else: ++++++# raise NotImplementedError("Training path is not implemented.") ++++++ ++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++ ++++++# # 4. 合并 MoE 输出和共享专家输出 ++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++ ++++++# # 5. 恢复原始形状并返回 ++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ ++++++# prefill fastest ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# """ ++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# # 门控网络 ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# # 专家列表 ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++# # 共享专家 ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_dispatch( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# """ ++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++++++# """ ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens, _ = hidden_states.shape ++++++ ++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++++++# flat_selected_experts = selected_experts.flatten() ++++++# flat_routing_weights = routing_weights.flatten() +++++ +++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++- +++++- # # <--- 核心修改点: 手动进行高精度缩放 --- +++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++- # query_states = query_states / math.sqrt(self.head_dim) +++++- # # <--- 修改结束 --- +++++- +++++- # fa_attention_mask = None +++++- # if attention_mask is not None: +++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++- # fa_attention_mask = (mask_slice != 0) +++++- +++++- # input_dtype = query_states.dtype +++++- +++++- # attn_output = mindspore.ops.flash_attention_score( +++++- # query=query_states, # 传入已经预先缩放过的 query +++++- # key=key_states, +++++- # value=value_states, +++++- # head_num=self.num_heads, +++++- # attn_mask=fa_attention_mask, +++++- # keep_prob=1.0 - self.attention_dropout, +++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++- # input_layout="BNSD", +++++- # sparse_mode=0, +++++- # inner_precise=1 # 仍然保持内部高精度计算 +++++- # ) ++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++ +++++- # attn_output = attn_output.to(input_dtype) +++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++- # attn_output = self.o_proj(attn_output) ++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++++++# active_experts = ops.unique(flat_selected_experts) ++++++ ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++ ++++++# # 找到所有分配给该专家的 token ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++ ++++++# # 使用 mask 选取对应的 token 和权重 ++++++# current_token_indices = token_indices[mask] ++++++# current_routing_weights = flat_routing_weights[mask] ++++++# current_hidden_states = hidden_states[current_token_indices] ++++++ ++++++# # 对这些 token 进行批处理 ++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++ ++++++# # 使用 index_add 将结果精确地加回到对应位置 ++++++# moe_output = moe_output.index_add( ++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++++++# ) ++++++# return moe_output ++++++ ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# """ ++++++# 顶层 forward 方法,作为智能分发器。 ++++++# """ ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ ++++++# # 1. 门控计算 ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++ ++++++# # 2. 调用统一的 MoE 计算内核 ++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++++ +++++- # attn_weights = None +++++- # if output_attentions: +++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++# # 3. 统一处理共享专家 ++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++ ++++++# # 4. 合并输出 ++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++ ++++++# # 5. 恢复原始形状并返回 ++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ ++++++ ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# """ ++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++# 【最终高性能与高精度版】: ++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++++++# 3. 这样实现了速度和准确性的两全其美。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_decode( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# """ ++++++# 【解码路径】极致优化版:bmm + 高精度累加。 ++++++# """ ++++++# original_dtype = hidden_states.dtype ++++++# batch_size, _ = hidden_states.shape ++++++ ++++++# expert_outputs_list = [ ++++++# ops.cat([ ++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++# ], dim=0) ++++++# for i in range(batch_size) ++++++# ] ++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++ ++++++# # 在 float32 下执行 bmm,得到高精度结果 ++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++ ++++++# # 将高精度结果转换回原始数据类型 ++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++++++ ++++++# return moe_output ++++++ ++++++# @no_grad() ++++++# def _moe_infer_prefill( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# selected_experts: mindspore.Tensor, ++++++# routing_weights: mindspore.Tensor ++++++# ) -> mindspore.Tensor: ++++++# """ ++++++# 【预填充路径】与原始实现一致,结果精确。 ++++++# """ ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens, _ = hidden_states.shape ++++++# flat_selected_experts = selected_experts.flatten() ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++# active_experts = ops.unique(flat_selected_experts) ++++++ ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# selected_token_indices = token_indices[mask] ++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++# current_states = hidden_states[selected_token_indices] ++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++# moe_output = moe_output.index_add( ++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++# ) ++++++# return moe_output ++++++ ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++ +++++- # return attn_output, attn_weights, past_key_value ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++++++# # 如果模型主体是 float16,后续再转换 ++++++ ++++++# moe_output = None ++++++# if not self.training: ++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++++++# # _moe_infer_decode 内部会处理好类型转换 ++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++++++# if sequence_length == 1: ++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++# else: ++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++# else: ++++++# raise NotImplementedError("Training path is not implemented.") ++++++ ++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++ ++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ +++++ +++++-QWEN2MOE_ATTENTION_CLASSES = { +++++- "eager": Qwen2MoeAttention, +++++- "flash-attention": Qwen2MoeFlashAttention, +++++-} ++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++# """ ++++++# 【融合版】一个混合专家模块,内置两种推理策略, ++++++# 由外部全局变量 `Long_Prompt` 控制: ++++++ ++++++# - if Long_Prompt is True: 【精度优先模式】 ++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++++++# 适用于处理长序列,避免误差累积。 ++++++ ++++++# - if Long_Prompt is False: 【速度优先模式】 ++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 ++++++# """ ++++++# def __init__(self, config: Qwen2MoeConfig): ++++++# super().__init__() ++++++# self.num_experts = config.num_experts ++++++# self.top_k = config.num_experts_per_tok ++++++# self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++# self.experts = nn.ModuleList( ++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++# ) ++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++# # --- 速度优先模式的辅助函数 --- ++++++# @no_grad() ++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++# original_dtype = hidden_states.dtype ++++++# batch_size, _ = hidden_states.shape ++++++# expert_outputs_list = [ ++++++# ops.cat([ ++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++# ], dim=0) ++++++# for i in range(batch_size) ++++++# ] ++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++# weights_fp32 = routing_weights.to(mindspore.float32) ++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++# return moe_output_fp32.squeeze(1).to(original_dtype) ++++++ ++++++# @no_grad() ++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens, _ = hidden_states.shape ++++++# flat_selected_experts = selected_experts.flatten() ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++# active_experts = ops.unique(flat_selected_experts) ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# selected_token_indices = token_indices[mask] ++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++# current_states = hidden_states[selected_token_indices] ++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++# return moe_output ++++++ ++++++# # --- 精度优先模式的辅助函数 --- ++++++# @no_grad() ++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++# moe_output = ops.zeros_like(hidden_states) ++++++# num_tokens, _ = hidden_states.shape ++++++# flat_selected_experts = selected_experts.flatten() ++++++# flat_routing_weights = routing_weights.flatten() ++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++# active_experts = ops.unique(flat_selected_experts) ++++++# for expert_idx_tensor in active_experts: ++++++# expert_idx = expert_idx_tensor.item() ++++++# expert_layer = self.experts[expert_idx] ++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++# current_token_indices = token_indices[mask] ++++++# current_routing_weights = flat_routing_weights[mask] ++++++# current_hidden_states = hidden_states[current_token_indices] ++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++# return moe_output ++++++ ++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++# # 声明我们将要使用一个在模块外部定义的全局变量 ++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++++++# global Long_Prompt ++++++ ++++++# # 1. 门控计算 (所有模式通用) ++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++# router_logits = self.gate(hidden_states_reshaped) ++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++++# if self.norm_topk_prob: ++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++# moe_output = None ++++++# if not self.training: ++++++# # 根据 Long_Prompt 标志选择模式 ++++++# if Long_Prompt: ++++++# # --- 精度优先模式 --- ++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++# else: ++++++# # --- 速度优先模式 --- ++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++# if sequence_length == 1: ++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++# else: ++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++# else: ++++++# raise NotImplementedError("Training path is not implemented.") ++++++ ++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++ ++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++# return final_hidden_states, router_logits ++++++ ++++++class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ """ ++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++++++ 控制的顶级推理策略: +++++ ++++++ - if Long_Prompt is True: 【精度优先模式】 ++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 ++++++ 适用于需要严格可复现性的长序列任务。 +++++ +++++-class Qwen2MoeSparseMoeBlock(nn.Module): +++++- def __init__(self, config): ++++++ - if Long_Prompt is False: 【速度优先模式】 ++++++ 采用业界最强的性能组合: ++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 ++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 ++++++ """ ++++++ def __init__(self, config: Qwen2MoeConfig): +++++ super().__init__() +++++ self.num_experts = config.num_experts +++++ self.top_k = config.num_experts_per_tok +++++ self.norm_topk_prob = config.norm_topk_prob +++++ +++++- # gating +++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++ self.experts = nn.ModuleList( +++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++ ) +++++- +++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++ +++++- #@dwj +++++- # 只遍历激活的专家,而非全部专家 +++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++- num_tokens = hidden_states_reshaped.shape[0] +++++- +++++- router_logits = self.gate(hidden_states_reshaped) +++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- +++++- if self.norm_topk_prob: +++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++- flat_selected_experts = selected_experts.flatten() +++++- +++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++- token_indices = broadcasted_token_indices.flatten() +++++- +++++- active_experts = ops.unique(flat_selected_experts) +++++- +++++- for expert_idx_tensor in active_experts: +++++- expert_idx = expert_idx_tensor.item() +++++- expert_layer = self.experts[expert_idx] +++++- +++++- mask = (flat_selected_experts == expert_idx_tensor) +++++- selected_token_indices = token_indices[mask] +++++- selected_routing_weights = routing_weights.flatten()[mask] +++++- +++++- current_states = hidden_states_reshaped[selected_token_indices] +++++- +++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++- +++++- final_hidden_states = final_hidden_states.index_add( +++++- dim=0, +++++- index=selected_token_indices, +++++- source=expert_output.to(hidden_states.dtype) +++++- ) +++++- +++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- ++++++ @no_grad() ++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++ original_dtype = hidden_states.dtype ++++++ batch_size, _ = hidden_states.shape ++++++ expert_outputs_list = [ ++++++ ops.cat([ ++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++ ], dim=0) ++++++ for i in range(batch_size) ++++++ ] ++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++ weights_fp32 = routing_weights.to(mindspore.float32) ++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++ return moe_output_fp32.squeeze(1).to(original_dtype) ++++++ ++++++ @no_grad() ++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++ num_tokens, _ = hidden_states.shape ++++++ flat_selected_experts = selected_experts.flatten() ++++++ sorted_expert_indices = flat_selected_experts.argsort() ++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++++ original_token_indices = sorted_expert_indices // self.top_k ++++++ moe_output = ops.zeros_like(hidden_states) ++++++ current_token_offset = 0 ++++++ for i in range(self.num_experts): ++++++ expert_token_count = tokens_per_expert[i] - current_token_offset ++++++ if expert_token_count == 0: ++++++ continue ++++++ end_offset = current_token_offset + expert_token_count ++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++++ expert_hidden_states = hidden_states[expert_original_token_indices] ++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++ current_token_offset += expert_token_count ++++++ return moe_output ++++++ ++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++++++ @no_grad() ++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++ moe_output = ops.zeros_like(hidden_states) ++++++ num_tokens, _ = hidden_states.shape ++++++ flat_selected_experts = selected_experts.flatten() ++++++ flat_routing_weights = routing_weights.flatten() ++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++ active_experts = ops.unique(flat_selected_experts) ++++++ for expert_idx_tensor in active_experts: ++++++ expert_idx = expert_idx_tensor.item() ++++++ expert_layer = self.experts[expert_idx] ++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++ current_token_indices = token_indices[mask] ++++++ current_routing_weights = flat_routing_weights[mask] ++++++ current_hidden_states = hidden_states[current_token_indices] ++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++ return moe_output +++++ +++++- final_hidden_states = final_hidden_states + shared_expert_output +++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++- +++++- return final_hidden_states, router_logits ++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++ global Long_Prompt ++++++ ++++++ # 1. 门控计算 (所有模式通用) ++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++ router_logits = self.gate(hidden_states_reshaped) ++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++++ if self.norm_topk_prob: ++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++ moe_output = None ++++++ if Long_Prompt: ++++++ # --- 精度优先模式 (ACCURACY MODE) --- ++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ else: ++++++ # --- 速度优先模式 (SPEED MODE) --- ++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++ if sequence_length == 1: ++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ else: ++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ +++++ ++++++ # 3. 共享专家计算与合并 (所有模式通用) ++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++ ++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++ ++++++ return final_hidden_states, router_logits +++++ +++++ class Qwen2MoeDecoderLayer(nn.Module): +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++++ super().__init__() +++++ self.hidden_size = config.hidden_size ++++++ ++++++ # if Long_Prompt: ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ # else: ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++ +++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ +++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++- +++++ if (layer_idx not in config.mlp_only_layers) and ( +++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++ ): +++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ self._warmed_up = True +++++ self.warmup_moe_model() +++++ ++++++ ++++++ +++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++ output_router_logits = ( +++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ router_logits=outputs.router_logits, +++++ ) +++++ ++++++ def generate(self, *args, **kwargs): ++++++ """ ++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++++++ """ ++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++++ ++++++ input_ids = kwargs.get("input_ids") ++++++ if input_ids is None and args: ++++++ input_ids = args[0] ++++++ ++++++ if input_ids is not None: ++++++ prompt_length = input_ids.shape[1] ++++++ ++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: ++++++ Long_Prompt = True ++++++ else: ++++++ Long_Prompt = False ++++++ ++++++ return super().generate(*args, **kwargs) ++++++ +++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +++++ def prepare_inputs_for_generation( +++++ self, +++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +++++ # Exception 1: when passing input_embeds, input_ids may be missing entries +++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here ++++++ +++++ if past_key_values is not None: +++++ if inputs_embeds is not None: # Exception 1 +++++ if 0 not in input_ids.shape: +++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ } +++++ ) +++++ return model_inputs ++++++ +++++ # @lwx +++++ # def _decode_one_tokens_logits( +++++ # self, +++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +++++ attentions=outputs.attentions, +++++ ) +++++ ++++++ +++++ __all__ = [ +++++ "Qwen2MoeForCausalLM", +++++ "Qwen2MoeModel", +++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++new file mode 100644 +++++index 00000000..6dfb5b93 +++++--- /dev/null ++++++++ b/patches/0001-20251104commit.patch +++++@@ -0,0 +1,1272 @@ ++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++++Subject: [PATCH] 20251104commit ++++++ ++++++--- ++++++ mindnlp/transformers/cache_utils.py | 28 +- ++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++++ 3 files changed, 976 insertions(+), 87 deletions(-) ++++++ ++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++++index cadd2e04..02f8d4be 100644 ++++++--- a/mindnlp/transformers/cache_utils.py +++++++++ b/mindnlp/transformers/cache_utils.py ++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++++ # k_out[:, :, cache_position] = key_states ++++++ # v_out[:, :, cache_position] = value_states ++++++- if ON_ORANGE_PI: ++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++- else: ++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++- +++++++ # if ON_ORANGE_PI: +++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++ # else: +++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++++ if cache_position.ndim > 1: +++++++ cache_position = cache_position.flatten() +++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++++ cache_position = cache_position.int() +++++++ +++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++++ k_out[:, :, cache_position] = key_states +++++++ v_out[:, :, cache_position] = value_states +++++++ ++++++ return k_out, v_out ++++++ ++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++index c695b944..d8303e45 100644 ++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++++ def rotate_half(x): ++++++ """Rotates half the hidden dims of the input.""" ++++++- x1 = x[..., : x.shape[-1] // 2] ++++++- x2 = x[..., x.shape[-1] // 2 :] +++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++ # x1 = x[..., : x.shape[-1] // 2] +++++++ # x2 = x[..., x.shape[-1] // 2 :] +++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++ return ops.cat((-x2, x1), dim=-1) ++++++ ++++++ ++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++++ if self.training: ++++++ raise NotImplementedError("Training is not supported yet.") ++++++ else: ++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++- if self.config.n_shared_experts is not None: ++++++- y = y + self.shared_experts(identity) ++++++- return y +++++++ # @lwx +++++++ if orig_shape[1] == 1: +++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++++ y=y.view(*orig_shape) +++++++ if self.config.n_shared_experts is not None: +++++++ y = y + self.shared_experts(identity) +++++++ return y +++++++ else: +++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++++ if self.config.n_shared_experts is not None: +++++++ y = y + self.shared_experts(identity) +++++++ return y +++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++ # if self.config.n_shared_experts is not None: +++++++ # y = y + self.shared_experts(identity) +++++++ # return y +++++++ +++++++ @no_grad() +++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++ +++++++ expert_cache = ops.zeros_like(x) +++++++ for i in range(self.num_experts_per_tok): +++++++ expert_id = flat_expert_indices[i].item() +++++++ weight = flat_expert_weights[i].item() +++++++ expert = self.experts[expert_id] +++++++ expert_out = expert(x) +++++++ expert_cache += expert_out * weight +++++++ return expert_cache ++++++ ++++++ @no_grad() ++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++- # expert_cache = torch.zeros_like(x) ++++++- # idxs = flat_expert_indices.argsort() ++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++- # token_idxs = idxs // self.num_experts_per_tok ++++++- # for i, end_idx in enumerate(tokens_per_expert): ++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++- # if start_idx == end_idx: ++++++- # continue ++++++- # expert = self.experts[i] ++++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++++- # expert_tokens = x[exp_token_idx] ++++++- # expert_out = expert(expert_tokens) ++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++- # return expert_cache +++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++ expert_cache = ops.zeros_like(x) ++++++ idxs = flat_expert_indices.argsort() ++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ token_idxs = idxs // self.num_experts_per_tok +++++++ ++++++ for i, end_idx in enumerate(tokens_per_expert): ++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ if start_idx == end_idx: ++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++++ expert_out = expert(expert_tokens) ++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ ++++++ return expert_cache +++++++ +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # # expert_cache = torch.zeros_like(x) +++++++ # # idxs = flat_expert_indices.argsort() +++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++ # # token_idxs = idxs // self.num_experts_per_tok +++++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++ # # if start_idx == end_idx: +++++++ # # continue +++++++ # # expert = self.experts[i] +++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # # expert_tokens = x[exp_token_idx] +++++++ # # expert_out = expert(expert_tokens) +++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++ # # return expert_cache +++++++ # expert_cache = ops.zeros_like(x) +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # for i, end_idx in enumerate(tokens_per_expert): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # if start_idx == end_idx: +++++++ # continue +++++++ # expert = self.experts[i] +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = expert(expert_tokens) +++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ +++++++ # return expert_cache +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # expert_cache = ops.zeros_like(x) +++++++ +++++++ # # 排序保证顺序一致 +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # # 找出有 token 的专家 +++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++ +++++++ # for i in active_experts.tolist(): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # end_idx = tokens_per_expert[i] +++++++ # if start_idx == end_idx: # 没有 token +++++++ # continue +++++++ +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = self.experts[i](expert_tokens) +++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++ +++++++ # expert_cache = mindspore.mint.scatter_add( +++++++ # expert_cache, +++++++ # 0, +++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++ # expert_out +++++++ # ) +++++++ +++++++ # return expert_cache +++++++ +++++++ ++++++ ++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++++ # """ ++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ ++++++ # Initialize weights and apply final processing ++++++ self.post_init() +++++++ self.warm_up = False +++++++ +++++++ def warmup_moe_model_deep(self): +++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++++ test_texts = [ +++++++ "warmup short", +++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++++ ] +++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++ if tokenizer is None: +++++++ from mindnlp.transformers import AutoTokenizer +++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++ self._warmup_tokenizer = tokenizer +++++++ +++++++ for text in test_texts: +++++++ inputs = tokenizer(text, return_tensors="ms") +++++++ with mindspore._no_grad(): +++++++ _ = self(**inputs, use_cache=False) +++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++++ ++++++ def get_input_embeddings(self): ++++++ return self.model.embed_tokens ++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++ ```""" +++++++ if not self.warm_up: +++++++ self.warm_up = True +++++++ self.warmup_moe_model_deep() +++++++ ++++++ output_attentions = ( ++++++ output_attentions ++++++ if output_attentions is not None ++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++index 3cbf820e..d4c6b651 100644 ++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++@@ -18,7 +18,6 @@ ++++++ # See the License for the specific language governing permissions and ++++++ # limitations under the License. ++++++ """MindSpore Qwen2MoE model.""" ++++++- ++++++ import math ++++++ from typing import List, Optional, Tuple, Union ++++++ ++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++++ TokenClassifierOutput, ++++++ ) ++++++ from ...modeling_utils import PreTrainedModel +++++++from ...generation import GenerationMixin ++++++ from ....utils import logging ++++++ from .configuration_qwen2_moe import Qwen2MoeConfig ++++++ ++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++++ self.variance_epsilon = eps ++++++ ++++++ def forward(self, hidden_states): +++++++ # @dwj +++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++ # @lwx +++++++ # if not self.training : +++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++ input_dtype = hidden_states.dtype ++++++ hidden_states = hidden_states.to(mindspore.float32) ++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++++@@ -234,6 +239,8 @@ def rotate_half(x): ++++++ """Rotates half the hidden dims of the input.""" ++++++ x1 = x[..., : x.shape[-1] // 2] ++++++ x2 = x[..., x.shape[-1] // 2 :] +++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++ return ops.cat((-x2, x1), dim=-1) ++++++ ++++++ ++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++++ self.config = config ++++++ self.hidden_size = config.hidden_size ++++++ self.intermediate_size = intermediate_size +++++++ ++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++++ self.act_fn = ACT2FN[config.hidden_act] ++++++ ++++++ def forward(self, x): ++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++- ++++++ +++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++ # @lwx +++++++ # gate_up_output = self.gate_up_proj(x) +++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++++ # return self.down_proj(swiglu_output) +++++++ +++++++ # def forward(self, x): +++++++ # gate_proj_out = self.gate_proj(x) +++++++ # up_proj_out = self.up_proj(x) +++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++++ # return self.down_proj(swiglu_out) +++++++ ++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++++ """ ++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++++ use_cache: bool = False, ++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ +++++++ ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ query_states = self.q_proj(hidden_states) ++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ "with a layer index." ++++++ ) ++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ if isinstance(past_key_value, StaticCache): +++++++ kv_seq_len = key_states.shape[-2] +++++++ else: +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++ if isinstance(past_key_value, StaticCache): +++++++ kv_seq_len = key_states.shape[-2] ++++++ ++++++ # repeat k/v heads if n_kv_heads < n_heads ++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++- +++++++ ++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++ ++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++++- raise ValueError( ++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++++- f" {attn_weights.shape}" ++++++- ) ++++++- ++++++- if attention_mask is not None: # no matter the length, we just slice it ++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++++ if attention_mask is not None: +++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++ attn_weights = attn_weights + causal_mask ++++++ ++++++ # upcast attention to fp32 ++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++ ++++++ attn_output = self.o_proj(attn_output) ++++++- +++++++ # @lwx +++++++ +++++++ # max_seq_len = self.max_position_embeddings # 2048 +++++++ +++++++ # if attention_mask is not None: +++++++ # # attention_mask: [B, 1, Sq, Sk] +++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++ +++++++ # # pad 到 [max_seq_len, max_seq_len] +++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++ # global_attention_mask = padded_mask +++++++ # else: +++++++ # global_attention_mask = None +++++++ +++++++ +++++++ # sparse_mode=3 +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # real_shift=None, +++++++ # padding_mask=None, +++++++ +++++++ # head_num=self.num_heads, +++++++ # attn_mask=global_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ # input_layout="BNSD", +++++++ # pre_tokens=2147483647, +++++++ # next_tokens=2147483647, +++++++ # inner_precise=0, +++++++ # drop_mask=None, +++++++ # prefix=None, +++++++ # actual_seq_qlen=None, +++++++ # actual_seq_kvlen=None, +++++++ # sparse_mode=sparse_mode, +++++++ # ) ++++++ if not output_attentions: ++++++ attn_weights = None ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ +++++++class Qwen2MoeFlashAttention(nn.Module): +++++++ """ +++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++ +++++++ 关键改动: +++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++ 直接传入原始的 key 和 value 张量效率更高。 +++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++ """ +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++ super().__init__() +++++++ self.config = config +++++++ self.layer_idx = layer_idx +++++++ self.hidden_size = config.hidden_size +++++++ self.num_heads = config.num_attention_heads +++++++ self.head_dim = self.hidden_size // self.num_heads +++++++ self.num_key_value_heads = config.num_key_value_heads +++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++ self.max_position_embeddings = config.max_position_embeddings +++++++ self.rope_theta = config.rope_theta +++++++ self.attention_dropout = config.attention_dropout +++++++ +++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++++ raise ValueError( +++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++ ) +++++++ +++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++ +++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++ self.head_dim, +++++++ max_position_embeddings=self.max_position_embeddings, +++++++ base=self.rope_theta, +++++++ ) +++++++ +++++++ def forward( +++++++ self, +++++++ hidden_states: mindspore.Tensor, +++++++ attention_mask: Optional[mindspore.Tensor] = None, +++++++ position_ids: Optional[mindspore.Tensor] = None, +++++++ past_key_value: Optional[Cache] = None, +++++++ output_attentions: bool = False, +++++++ use_cache: bool = False, +++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # 1. 线性投射 Q, K, V +++++++ query_states = self.q_proj(hidden_states) +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++ # query: [B, S, H*D] -> [B, N1, S, D] +++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # 3. RoPE 旋转位置编码 +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++ if self.layer_idx is None: +++++++ raise ValueError( +++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ "with a layer index." +++++++ ) +++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++ if cache_position.shape[0] == 1: +++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++ kv_seq_len = past_seen_tokens + 1 +++++++ else: +++++++ # prefill 阶段:cache_position 是范围,使用其长度 +++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++ else: +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # 4. KV 缓存更新 +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ key_states, value_states = past_key_value.update( +++++++ key_states, value_states, self.layer_idx, cache_kwargs +++++++ ) +++++++ +++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++ if cache_position.shape[0] == 1: +++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++ kv_seq_len = key_states.shape[-2] +++++++ +++++++ # 5. [重要] 准备 Attention Mask +++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++ fa_attention_mask = None +++++++ if attention_mask is not None: +++++++ # 截取与当前key长度匹配的部分 +++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++ fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++ input_dtype = query_states.dtype +++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++ query_states = query_states.to(mindspore.float16) +++++++ key_states = key_states.to(mindspore.float16) +++++++ value_states = value_states.to(mindspore.float16) +++++++ +++++++ # 6. [核心] 调用 flash_attention_score 算子 +++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++ attn_output = mindspore.ops.flash_attention_score( +++++++ query=query_states, +++++++ key=key_states, +++++++ value=value_states, +++++++ head_num=self.num_heads, # 传入Q的头数(N1) +++++++ attn_mask=fa_attention_mask, +++++++ keep_prob=1.0 - self.attention_dropout, +++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ input_layout="BNSD", +++++++ sparse_mode=0 # 使用 defaultMask 模式 +++++++ ) +++++++ +++++++ # 恢复原始数据类型 +++++++ attn_output = attn_output.to(input_dtype) +++++++ +++++++ # 7. 调整输出形状 +++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = self.o_proj(attn_output) +++++++ +++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +++++++ attn_weights = None +++++++ if output_attentions: +++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ # def forward( +++++++ # self, +++++++ # hidden_states: mindspore.Tensor, +++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++ # past_key_value: Optional[Cache] = None, +++++++ # output_attentions: bool = False, +++++++ # use_cache: bool = False, +++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ # bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # # 1. 线性投射 Q, K, V +++++++ # query_states = self.q_proj(hidden_states) +++++++ # key_states = self.k_proj(hidden_states) +++++++ # value_states = self.v_proj(hidden_states) +++++++ +++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # # 3. RoPE 旋转位置编码 +++++++ # kv_seq_len = key_states.shape[-2] +++++++ # if past_key_value is not None: +++++++ # if self.layer_idx is None: +++++++ # raise ValueError( +++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ # "with a layer index." +++++++ # ) +++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # # 4. KV 缓存更新 +++++++ # if past_key_value is not None: +++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ # key_states, value_states = past_key_value.update( +++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++ # ) +++++++ +++++++ # # 5. 准备 Attention Mask +++++++ # fa_attention_mask = None +++++++ # if attention_mask is not None: +++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++ # input_dtype = query_states.dtype +++++++ +++++++ # # 6. [核心] 调用 flash_attention_score 算子 +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # head_num=self.num_heads, +++++++ # attn_mask=fa_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ # input_layout="BNSD", +++++++ # sparse_mode=0, +++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++ # inner_precise=1 +++++++ # ) +++++++ +++++++ # # 恢复原始数据类型 +++++++ # attn_output = attn_output.to(input_dtype) +++++++ +++++++ # # 7. 调整输出形状 +++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ # attn_output = self.o_proj(attn_output) +++++++ +++++++ # attn_weights = None +++++++ # if output_attentions: +++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++ # return attn_output, attn_weights, past_key_value +++++++ +++++++ # def forward( +++++++ # self, +++++++ # hidden_states: mindspore.Tensor, +++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++ # past_key_value: Optional[Cache] = None, +++++++ # output_attentions: bool = False, +++++++ # use_cache: bool = False, +++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ # bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # query_states = self.q_proj(hidden_states) +++++++ # key_states = self.k_proj(hidden_states) +++++++ # value_states = self.v_proj(hidden_states) +++++++ +++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # kv_seq_len = key_states.shape[-2] +++++++ # if past_key_value is not None: +++++++ # if self.layer_idx is None: +++++++ # raise ValueError("`layer_idx` must be specified for caching") +++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # if past_key_value is not None: +++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ # key_states, value_states = past_key_value.update( +++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++ # ) +++++++ +++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++ +++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++++ # query_states = query_states / math.sqrt(self.head_dim) +++++++ # # <--- 修改结束 --- +++++++ +++++++ # fa_attention_mask = None +++++++ # if attention_mask is not None: +++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # input_dtype = query_states.dtype +++++++ +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, # 传入已经预先缩放过的 query +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # head_num=self.num_heads, +++++++ # attn_mask=fa_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++++ # input_layout="BNSD", +++++++ # sparse_mode=0, +++++++ # inner_precise=1 # 仍然保持内部高精度计算 +++++++ # ) +++++++ +++++++ # attn_output = attn_output.to(input_dtype) +++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ # attn_output = self.o_proj(attn_output) +++++++ +++++++ # attn_weights = None +++++++ # if output_attentions: +++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++++ +++++++ # return attn_output, attn_weights, past_key_value +++++++ ++++++ QWEN2MOE_ATTENTION_CLASSES = { ++++++ "eager": Qwen2MoeAttention, +++++++ "flash-attention": Qwen2MoeFlashAttention, ++++++ } ++++++ ++++++ ++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ +++++++ #@dwj +++++++ # 只遍历激活的专家,而非全部专家 ++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- hidden_states = hidden_states.view(-1, hidden_dim) ++++++- # router_logits: (batch * sequence_length, n_experts) ++++++- router_logits = self.gate(hidden_states) ++++++- ++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- if self.norm_topk_prob: ++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- # we cast back to the input dtype ++++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++- final_hidden_states = ops.zeros( ++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++++- ) ++++++- ++++++- # One hot encode the selected experts to create an expert mask ++++++- # this will be used to easily index which expert is going to be sollicitated ++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++++- ++++++- # Loop over all available experts in the model and perform the computation on each expert ++++++- for expert_idx in range(self.num_experts): ++++++- expert_layer = self.experts[expert_idx] ++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++++- ++++++- # Index the correct hidden states and compute the expert hidden state for ++++++- # the current expert. We need to make sure to multiply the output hidden ++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++++- if 0 not in idx.shape: ++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++++- ++++++- # However `index_add_` only support torch tensors for indexing so we'll use ++++++- # the `top_x` tensor here. ++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++++- ++++++- shared_expert_output = self.shared_expert(hidden_states) ++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++++- ++++++- final_hidden_states = final_hidden_states + shared_expert_output +++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++ num_tokens = hidden_states_reshaped.shape[0] +++++++ +++++++ router_logits = self.gate(hidden_states_reshaped) +++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++ if self.norm_topk_prob: +++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++ flat_selected_experts = selected_experts.flatten() +++++++ +++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++ token_indices = broadcasted_token_indices.flatten() +++++++ +++++++ active_experts = ops.unique(flat_selected_experts) +++++++ +++++++ for expert_idx_tensor in active_experts: +++++++ expert_idx = expert_idx_tensor.item() +++++++ expert_layer = self.experts[expert_idx] +++++++ +++++++ mask = (flat_selected_experts == expert_idx_tensor) +++++++ selected_token_indices = token_indices[mask] +++++++ selected_routing_weights = routing_weights.flatten()[mask] +++++++ +++++++ current_states = hidden_states_reshaped[selected_token_indices] +++++++ +++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++ +++++++ final_hidden_states = final_hidden_states.index_add( +++++++ dim=0, +++++++ index=selected_token_indices, +++++++ source=expert_output.to(hidden_states.dtype) +++++++ ) +++++++ +++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++ ++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++- return final_hidden_states, router_logits +++++++ final_hidden_states = final_hidden_states + shared_expert_output +++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++ +++++++ return final_hidden_states, router_logits ++++++ ++++++ ++++++ class Qwen2MoeDecoderLayer(nn.Module): ++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++++ ++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++ ++++++ if (layer_idx not in config.mlp_only_layers) and ( ++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++++ ): ++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++++ _skip_keys_device_placement = "past_key_values" ++++++ _supports_cache_class = True +++++++#lwx +++++++ # _supports_static_cache = True ++++++ ++++++ def _init_weights(self, module): ++++++ std = self.config.initializer_range ++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++++ return causal_mask ++++++ ++++++ ++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ _tied_weights_keys = ["lm_head.weight"] ++++++ ++++++ def __init__(self, config): ++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ self.num_experts_per_tok = config.num_experts_per_tok ++++++ # Initialize weights and apply final processing ++++++ self.post_init() +++++++ # @lwx +++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++++ # self.generation_config.cache_implementation = "static" +++++++ self._warmed_up = False +++++++ +++++++ def warmup_moe_model(self): +++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++++ test_texts = [ +++++++ "warmup short", +++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++++ ] +++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++ if tokenizer is None: +++++++ from mindnlp.transformers import AutoTokenizer +++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++ self._warmup_tokenizer = tokenizer +++++++ +++++++ for text in test_texts: +++++++ inputs = tokenizer(text, return_tensors="ms") +++++++ with mindspore._no_grad(): +++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++++ ++++++ def get_input_embeddings(self): ++++++ return self.model.embed_tokens ++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++ ```""" +++++++ if not self._warmed_up: +++++++ self._warmed_up = True +++++++ self.warmup_moe_model() ++++++ ++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++++ output_router_logits = ( ++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ } ++++++ ) ++++++ return model_inputs +++++++# @lwx +++++++ # def _decode_one_tokens_logits( +++++++ # self, +++++++ # cur_token: mindspore.Tensor, +++++++ # input_pos: Optional[mindspore.Tensor], +++++++ # cache_position: mindspore.Tensor, +++++++ # past_key_values: StaticCache, +++++++ # ) -> mindspore.Tensor: +++++++ # """ +++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++++ +++++++ # Args: +++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++++ # input_pos: 输入位置信息,可选 +++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++++ +++++++ # Returns: +++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++++ # """ +++++++ # # 调用JIT编译的版本 +++++++ # return self.get_decode_one_tokens_logits( +++++++ # cur_token=cur_token, +++++++ # input_pos=input_pos, +++++++ # cache_position=cache_position, +++++++ # past_key_values=past_key_values, +++++++ # ) +++++++ +++++++ # @mindspore.jit(jit_level='O1') +++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++++ # """ +++++++ # JIT编译的函数,用于高效的单token解码 +++++++ # 使用JIT编译优化以支持静态shape和高效执行 +++++++ +++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++++ # """ +++++++ # outputs = self.model.forward( +++++++ # input_ids=cur_token, +++++++ # position_ids=input_pos, +++++++ # cache_position=cache_position, +++++++ # past_key_values=past_key_values, +++++++ # use_cache=True, +++++++ # return_dict=False, +++++++ # ) +++++++ +++++++ # hidden_states = outputs[0] +++++++ # logits = self.lm_head.forward(hidden_states) +++++++ # logits = logits.float() +++++++ +++++++ # return logits[:, -1, :] +++++++ +++++++ # def _sample( +++++++ # self, +++++++ # input_ids: mindspore.Tensor, +++++++ # logits_processor, +++++++ # stopping_criteria, +++++++ # generation_config, +++++++ # synced_devices: bool, +++++++ # streamer=None, +++++++ # logits_warper=None, +++++++ # **model_kwargs, +++++++ # ): +++++++ # """ +++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++++ # """ +++++++ # from ...generation.logits_process import LogitsProcessorList +++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++++ # from mindnlp.core import nn, ops, no_grad +++++++ # import numpy as np +++++++ +++++++ # # 检查是否使用 StaticCache +++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++++ # # 否则,直接调用父类方法 +++++++ # past_key_values = model_kwargs.get("past_key_values") +++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++++ +++++++ # if not isinstance(past_key_values, StaticCache): +++++++ # # 不使用 StaticCache,直接调用父类方法 +++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++++ # return super()._sample( +++++++ # input_ids=input_ids, +++++++ # logits_processor=logits_processor, +++++++ # stopping_criteria=stopping_criteria, +++++++ # generation_config=generation_config, +++++++ # synced_devices=synced_devices, +++++++ # streamer=streamer, +++++++ # logits_warper=logits_warper, +++++++ # **model_kwargs, +++++++ # ) +++++++ +++++++ # # 使用 StaticCache,进入自定义循环 +++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++++ # pad_token_id = generation_config._pad_token_tensor +++++++ # output_attentions = generation_config.output_attentions +++++++ # output_hidden_states = generation_config.output_hidden_states +++++++ # output_scores = generation_config.output_scores +++++++ # output_logits = generation_config.output_logits +++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +++++++ # max_length = generation_config.max_length +++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++++ # do_sample = generation_config.do_sample +++++++ +++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++++ # raise ValueError( +++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++++ # f"{logits_warper})." +++++++ # ) +++++++ +++++++ # # init attention / hidden states / scores tuples +++++++ # scores = () if (return_dict_in_generate and output_scores) else None +++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++++ +++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++++ # encoder_hidden_states = ( +++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++++ # ) +++++++ +++++++ # # keep track of which sequences are already finished +++++++ # batch_size, cur_len = input_ids.shape +++++++ # this_peer_finished = False +++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++++ +++++++ # time_record = [] +++++++ # from ....utils.testing_utils import parse_flag_from_env +++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++++ +++++++ # while self._has_unfinished_sequences( +++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++++ # ): +++++++ # if _record_time: +++++++ # import time as time_module +++++++ # infer_start = time_module.time() +++++++ +++++++ # # prepare model inputs +++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++++ +++++++ # # prepare variable output controls +++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++++ +++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++++ # cur_cache_position = model_inputs.get("cache_position") +++++++ # cur_past_key_values = model_inputs.get("past_key_values") +++++++ # cur_input_ids = model_inputs.get("input_ids") +++++++ +++++++ # if (isinstance(cur_past_key_values, StaticCache) and +++++++ # cur_cache_position is not None and +++++++ # len(cur_cache_position.shape) > 0 and +++++++ # cur_cache_position.shape[0] == 1 and +++++++ # cur_input_ids is not None and +++++++ # cur_input_ids.shape[1] == 1): +++++++ # # 使用 JIT 优化的单 token 解码 +++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++++ # if not hasattr(self, '_jit_used'): +++++++ # self._jit_used = False +++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++++ +++++++ # next_token_logits = self.get_decode_one_tokens_logits( +++++++ # cur_token=cur_input_ids, +++++++ # input_pos=model_inputs.get("position_ids"), +++++++ # cache_position=cur_cache_position, +++++++ # past_key_values=cur_past_key_values, +++++++ # ) +++++++ +++++++ # # 标记已使用JIT(用于后续判断) +++++++ # if not self._jit_used: +++++++ # self._jit_used = True +++++++ +++++++ # # 构造兼容的输出对象 +++++++ # class JitOptimizedOutput: +++++++ # def __init__(self, logits, config): +++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++++ # self.config = config +++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++++ # self.attentions = None if not config.is_encoder_decoder else None +++++++ # self.cross_attentions = None +++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +++++++ +++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++++ # else: +++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++++ # outputs = self(**model_inputs, return_dict=True) +++++++ +++++++ # if synced_devices and this_peer_finished: +++++++ # continue +++++++ +++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++++ # next_token_logits = outputs.logits[:, -1, :] +++++++ +++++++ # # pre-process distribution +++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++++ # if do_sample: +++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++++ +++++++ # # Store scores, attentions and hidden_states when required +++++++ # if return_dict_in_generate: +++++++ # if output_scores: +++++++ # scores += (next_token_scores,) +++++++ # if output_logits: +++++++ # raw_logits += (next_token_logits,) +++++++ # if output_attentions: +++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +++++++ # if self.config.is_encoder_decoder: +++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++++ +++++++ # if output_hidden_states: +++++++ # hidden = ( +++++++ # outputs.decoder_hidden_states +++++++ # if self.config.is_encoder_decoder +++++++ # else outputs.hidden_states +++++++ # ) +++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++++ +++++++ # # token selection +++++++ # if do_sample: +++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++++ # else: +++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++++ +++++++ # # finished sentences should have their next token be a padding token +++++++ # if has_eos_stopping_criteria: +++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++++ +++++++ # # update generated ids, model inputs, and length for next step +++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++++ # if streamer is not None: +++++++ # streamer.put(next_tokens) +++++++ +++++++ # model_kwargs = self._update_model_kwargs_for_generation( +++++++ # outputs, +++++++ # model_kwargs, +++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +++++++ # ) +++++++ +++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++++ # cur_len += 1 +++++++ +++++++ # if _record_time: +++++++ # import time as time_module +++++++ # infer_stop = time_module.time() +++++++ # time_record.append(infer_stop - infer_start) +++++++ +++++++ # del outputs +++++++ +++++++ # average_infer_time = None +++++++ # if time_record: +++++++ # if len(time_record) > 1: +++++++ # time_record.pop(0) +++++++ # average_infer_time = sum(time_record) / len(time_record) +++++++ # print(f'average inference time is: {average_infer_time}') +++++++ # print(f'inference time record: {time_record}') +++++++ +++++++ # if streamer is not None: +++++++ # streamer.end() +++++++ +++++++ # # 简单判断:打印是否使用了JIT路径 +++++++ # if hasattr(self, '_jit_used') and self._jit_used: +++++++ # print("[JIT] ✓ JIT optimization was used during generation") +++++++ # else: +++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++++ +++++++ # if return_dict_in_generate: +++++++ # if self.config.is_encoder_decoder: +++++++ # return GenerateEncoderDecoderOutput( +++++++ # sequences=input_ids, +++++++ # scores=scores, +++++++ # logits=raw_logits, +++++++ # encoder_attentions=encoder_attentions, +++++++ # encoder_hidden_states=encoder_hidden_states, +++++++ # decoder_attentions=decoder_attentions, +++++++ # cross_attentions=cross_attentions, +++++++ # decoder_hidden_states=decoder_hidden_states, +++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++ # average_infer_time=average_infer_time +++++++ # ) +++++++ # else: +++++++ # return GenerateDecoderOnlyOutput( +++++++ # sequences=input_ids, +++++++ # scores=scores, +++++++ # logits=raw_logits, +++++++ # attentions=decoder_attentions, +++++++ # hidden_states=decoder_hidden_states, +++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++ # average_infer_time=average_infer_time +++++++ # ) +++++++ # else: +++++++ # return input_ids +++++++ +++++++ # def _prepare_cache_for_generation( +++++++ # self, +++++++ # generation_config, +++++++ # model_kwargs, +++++++ # assistant_model, +++++++ # batch_size, +++++++ # max_cache_length, +++++++ # ): +++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++++ # generation_config.cache_implementation = "static" +++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++++ +++++++ # if generation_config.cache_implementation == "static": +++++++ # base_required_from_max_length = generation_config.max_length + 1 +++++++ # base_required = max(max_cache_length, base_required_from_max_length) +++++++ # min_cache_size = 50 +++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++++ # else: +++++++ # max_cache_length = max(base_required, min_cache_size) +++++++ +++++++ # original_max_cache_length = max_cache_length +++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++++ # print(f" - final max_cache_length: {max_cache_length}") +++++++ +++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++ # if max_cache_length > self.config.max_position_embeddings: +++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++++ +++++++ # result = super()._prepare_cache_for_generation( +++++++ # generation_config=generation_config, +++++++ # model_kwargs=model_kwargs, +++++++ # assistant_model=assistant_model, +++++++ # batch_size=batch_size, +++++++ # max_cache_length=max_cache_length, +++++++ # ) +++++++ +++++++ # if generation_config.cache_implementation == "static": +++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++++ # created_cache = model_kwargs.get(cache_name) +++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++++ # if created_cache.max_cache_len < generation_config.max_length: +++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++++ +++++++ # return result +++++++ +++++++ +++++++ ++++++ ++++++ ++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++++-- ++++++2.27.0 ++++++ +++++-- +++++2.27.0 +++++ ++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++++new file mode 100644 ++++index 00000000..966529e4 ++++--- /dev/null +++++++ b/patches/0003-20261106secondcommit.patch ++++@@ -0,0 +1,2769 @@ +++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Thu, 6 Nov 2025 14:54:37 +0800 +++++Subject: [PATCH 3/3] 20261106secondcommit +++++ +++++--- +++++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +++++ patches/0001-20251104commit.patch | 1272 ----------------- +++++ 3 files changed, 528 insertions(+), 2032 deletions(-) +++++ delete mode 100644 patches/0001-20251104commit.patch +++++ +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index 73773c22..2f9192bf 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +++++ +++++ _CONFIG_FOR_DOC = "DeepseekConfig" +++++ ++++++_attn_mask_cache = {} ++++++ ++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): ++++++ q_len = batch_and_seq[1] ++++++ kv_len = batch_and_seq[1] + past_key_values_length ++++++ key = (batch_and_seq[0], q_len, kv_len) ++++++ ++++++ if key in _attn_mask_cache: ++++++ return _attn_mask_cache[key] ++++++ ++++++ mask = _prepare_4d_causal_attention_mask( ++++++ attention_mask, ++++++ batch_and_seq, ++++++ inputs_embeds, ++++++ past_key_values_length, ++++++ ) ++++++ _attn_mask_cache[key] = mask ++++++ return mask +++++ +++++ def _get_unpad_data(attention_mask): +++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +++++ return final_output +++++ +++++ +++++- @no_grad() +++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++- expert_cache = ops.zeros_like(x) +++++- idxs = flat_expert_indices.argsort() +++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++- token_idxs = idxs // self.num_experts_per_tok +++++- +++++- for i, end_idx in enumerate(tokens_per_expert): +++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++- if start_idx == end_idx: +++++- continue +++++- expert = self.experts[i] +++++- exp_token_idx = token_idxs[start_idx:end_idx] +++++- expert_tokens = x[exp_token_idx] +++++- expert_out = expert(expert_tokens) +++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++- +++++- return expert_cache +++++- +++++ # @no_grad() +++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++- # # expert_cache = torch.zeros_like(x) +++++- # # idxs = flat_expert_indices.argsort() +++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++- # # token_idxs = idxs // self.num_experts_per_tok +++++- # # for i, end_idx in enumerate(tokens_per_expert): +++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++- # # if start_idx == end_idx: +++++- # # continue +++++- # # expert = self.experts[i] +++++- # # exp_token_idx = token_idxs[start_idx:end_idx] +++++- # # expert_tokens = x[exp_token_idx] +++++- # # expert_out = expert(expert_tokens) +++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++- # # return expert_cache ++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++ # expert_cache = ops.zeros_like(x) +++++ # idxs = flat_expert_indices.argsort() +++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++ +++++ # return expert_cache +++++- # @no_grad() +++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++- # expert_cache = ops.zeros_like(x) ++++++ ++++++ @no_grad() ++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++ """ ++++++ 优化版 MoE prefill: ++++++ - 批量张量化处理同一个 expert 的所有 token ++++++ - 跳过无 token 的专家 ++++++ - 保持结果完全一致 ++++++ """ ++++++ # 初始化输出缓存 ++++++ expert_cache = ops.zeros_like(x) +++++ +++++- # # 排序保证顺序一致 +++++- # idxs = flat_expert_indices.argsort() +++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++- # token_idxs = idxs // self.num_experts_per_tok ++++++ # 排序(确保 scatter_add 位置对应原逻辑) ++++++ idxs = flat_expert_indices.argsort() ++++++ sorted_expert_indices = flat_expert_indices[idxs] ++++++ sorted_token_indices = idxs // self.num_experts_per_tok +++++ +++++- # # 找出有 token 的专家 +++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++ # 每个 expert 的 token 数 ++++++ tokens_per_expert = sorted_expert_indices.bincount() +++++ +++++- # for i in active_experts.tolist(): +++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++- # end_idx = tokens_per_expert[i] +++++- # if start_idx == end_idx: # 没有 token +++++- # continue ++++++ # 找出有 token 的专家 ++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++++ +++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++- # expert_tokens = x[exp_token_idx] +++++- # expert_out = self.experts[i](expert_tokens) +++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++ for expert_id in active_experts.tolist(): ++++++ # 取该 expert 对应的排序后 token 区间 ++++++ start = (tokens_per_expert[:expert_id]).sum().item() ++++++ end = start + tokens_per_expert[expert_id].item() +++++ +++++- # expert_cache = mindspore.mint.scatter_add( +++++- # expert_cache, +++++- # 0, +++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++- # expert_out +++++- # ) ++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 ++++++ expert_tokens = x[token_idx] # 取输入向量 +++++ +++++- # return expert_cache ++++++ # 执行专家 MLP ++++++ expert_out = self.experts[expert_id](expert_tokens) ++++++ ++++++ # 按权重缩放 ++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] ++++++ ++++++ # 回写到缓存(等价 scatter_add) ++++++ expert_cache = mindspore.mint.scatter_add( ++++++ expert_cache, ++++++ 0, ++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++ scaled_out ++++++ ) ++++++ ++++++ return expert_cache ++++++ ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # # expert_cache = torch.zeros_like(x) ++++++ # # idxs = flat_expert_indices.argsort() ++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++ # # if start_idx == end_idx: ++++++ # # continue ++++++ # # expert = self.experts[i] ++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # # expert_tokens = x[exp_token_idx] ++++++ # # expert_out = expert(expert_tokens) ++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++ # # return expert_cache ++++++ # expert_cache = ops.zeros_like(x) ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # if start_idx == end_idx: ++++++ # continue ++++++ # expert = self.experts[i] ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = expert(expert_tokens) ++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ ++++++ # return expert_cache ++++++ # @no_grad() ++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++ # expert_cache = ops.zeros_like(x) ++++++ ++++++ # # 排序保证顺序一致 ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++ ++++++ # # 找出有 token 的专家 ++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++ ++++++ # for i in active_experts.tolist(): ++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ # end_idx = tokens_per_expert[i] ++++++ # if start_idx == end_idx: # 没有 token ++++++ # continue ++++++ ++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++ # expert_tokens = x[exp_token_idx] ++++++ # expert_out = self.experts[i](expert_tokens) ++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++ ++++++ # expert_cache = mindspore.mint.scatter_add( ++++++ # expert_cache, ++++++ # 0, ++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++ # expert_out ++++++ # ) ++++++ ++++++ # return expert_cache +++++ +++++ +++++ +++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++- +++++ # class DeepseekFlashAttention(nn.Module): +++++ # """ +++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +++++ +++++ return attn_output, attn_weights, past_key_value +++++ ++++++ +++++ Deepseek_ATTENTION_CLASSES = { +++++ "eager": DeepseekAttention, +++++ "flash-attention": DeepseekFlashAttention, +++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +++++ ) +++++ else: +++++ # 4d mask is passed through the layers +++++- attention_mask = _prepare_4d_causal_attention_mask( ++++++ # attention_mask = _prepare_4d_causal_attention_mask( ++++++ # attention_mask, ++++++ # (batch_size, seq_length), ++++++ # inputs_embeds, ++++++ # past_key_values_length, ++++++ # ) ++++++ #@dwj ++++++ attention_mask = get_cached_causal_mask( +++++ attention_mask, +++++ (batch_size, seq_length), +++++ inputs_embeds, +++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ # Initialize weights and apply final processing +++++ self.post_init() +++++ self.warm_up = False ++++++ #@dwj ++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++++++ self.num_layers, ++++++ self.num_attention_heads, ++++++ self.head_dim, ++++++ batch_size=1, ++++++ max_length=self.max_length, ++++++ dtype=mindspore.float16 ++++++ ) ++++++ ++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++++++ key_cache = [] ++++++ value_cache = [] ++++++ for _ in range(num_layers): ++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++ key_cache.append(k) ++++++ value_cache.append(v) ++++++ return key_cache, value_cache ++++++ +++++ +++++ def warmup_moe_model_deep(self): +++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++index bced285c..ebd7782e 100644 +++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++++ +++++-Long_Prompt = False +++++-PROMPT_LENGTH_THRESHOLD = 128 ++++++Long_Prompt = 1 ++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 ++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 ++++++ ++++++_causal_mask_cache = {} ++++++ ++++++def get_cached_causal_mask_with_cache_position( ++++++ attention_mask: mindspore.Tensor, ++++++ sequence_length: int, ++++++ target_length: int, ++++++ dtype: mindspore.dtype, ++++++ min_dtype: float, ++++++ cache_position: mindspore.Tensor, ++++++ batch_size: int, ++++++): ++++++ """ ++++++ 带缓存的 causal mask 构造函数 ++++++ """ ++++++ # q_len 是当前 query 长度 ++++++ q_len = sequence_length ++++++ # kv_len 是 target_length ++++++ kv_len = target_length ++++++ ++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 ++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) ++++++ ++++++ if key in _causal_mask_cache: ++++++ return _causal_mask_cache[key] ++++++ ++++++ # 调用原来的 mask 构造逻辑 ++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++ attention_mask, ++++++ sequence_length=sequence_length, ++++++ target_length=target_length, ++++++ dtype=dtype, ++++++ min_dtype=min_dtype, ++++++ cache_position=cache_position, ++++++ batch_size=batch_size, ++++++ ) ++++++ # 缓存结果 ++++++ _causal_mask_cache[key] = causal_mask ++++++ return causal_mask +++++ +++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++++ def _prepare_4d_causal_attention_mask_with_cache_position( +++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++ +++++ +++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe ++++++# class Qwen2MoeAttention(nn.Module): ++++++# """ ++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++++++# and "Generating Long Sequences with Sparse Transformers". ++++++# """ ++++++ ++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++# super().__init__() ++++++# self.config = config ++++++# self.layer_idx = layer_idx ++++++# if layer_idx is None: ++++++# logger.warning_once( ++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++# "when creating this class." ++++++# ) ++++++ ++++++# self.hidden_size = config.hidden_size ++++++# self.num_heads = config.num_attention_heads ++++++# self.head_dim = self.hidden_size // self.num_heads ++++++# self.num_key_value_heads = config.num_key_value_heads ++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++# self.max_position_embeddings = config.max_position_embeddings ++++++# self.rope_theta = config.rope_theta ++++++# self.is_causal = True ++++++# self.attention_dropout = config.attention_dropout ++++++ ++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++# raise ValueError( ++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++# f" and `num_heads`: {self.num_heads})." ++++++# ) ++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++ ++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++# self.head_dim, ++++++# max_position_embeddings=self.max_position_embeddings, ++++++# base=self.rope_theta, ++++++# ) ++++++ ++++++# def forward( ++++++# self, ++++++# hidden_states: mindspore.Tensor, ++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++# past_key_value: Optional[Cache] = None, ++++++# output_attentions: bool = False, ++++++# use_cache: bool = False, ++++++# cache_position: Optional[mindspore.Tensor] = None, ++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++ ++++++ ++++++ ++++++# bsz, q_len, _ = hidden_states.shape ++++++ ++++++# query_states = self.q_proj(hidden_states) ++++++# key_states = self.k_proj(hidden_states) ++++++# value_states = self.v_proj(hidden_states) ++++++ ++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++ ++++++# kv_seq_len = key_states.shape[-2] ++++++# if past_key_value is not None: ++++++# if self.layer_idx is None: ++++++# raise ValueError( ++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++# "with a layer index." ++++++# ) ++++++# if isinstance(past_key_value, StaticCache): ++++++# kv_seq_len = key_states.shape[-2] ++++++# else: ++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++# if past_key_value is not None: ++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++# if isinstance(past_key_value, StaticCache): ++++++# kv_seq_len = key_states.shape[-2] ++++++ ++++++# # repeat k/v heads if n_kv_heads < n_heads ++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++ ++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++ ++++++# if attention_mask is not None: ++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++# attn_weights = attn_weights + causal_mask ++++++ ++++++# # upcast attention to fp32 ++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++++# attn_output = ops.matmul(attn_weights, value_states) ++++++ ++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++++# raise ValueError( ++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++++++# f" {attn_output.shape}" ++++++# ) ++++++ ++++++# attn_output = ops.transpose(attn_output, 1, 2) ++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++ ++++++# attn_output = self.o_proj(attn_output) ++++++# # @lwx ++++++ ++++++# # max_seq_len = self.max_position_embeddings # 2048 ++++++ ++++++# # if attention_mask is not None: ++++++# # # attention_mask: [B, 1, Sq, Sk] ++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++ ++++++# # # pad 到 [max_seq_len, max_seq_len] ++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++# # global_attention_mask = padded_mask ++++++# # else: ++++++# # global_attention_mask = None ++++++ ++++++ ++++++# # sparse_mode=3 ++++++# # attn_output = mindspore.ops.flash_attention_score( ++++++# # query=query_states, ++++++# # key=key_states, ++++++# # value=value_states, ++++++# # real_shift=None, ++++++# # padding_mask=None, ++++++ ++++++# # head_num=self.num_heads, ++++++# # attn_mask=global_attention_mask, ++++++# # keep_prob=1.0 - self.attention_dropout, ++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++# # input_layout="BNSD", ++++++# # pre_tokens=2147483647, ++++++# # next_tokens=2147483647, ++++++# # inner_precise=0, ++++++# # drop_mask=None, ++++++# # prefix=None, ++++++# # actual_seq_qlen=None, ++++++# # actual_seq_kvlen=None, ++++++# # sparse_mode=sparse_mode, ++++++# # ) ++++++# if not output_attentions: ++++++# attn_weights = None ++++++ ++++++# return attn_output, attn_weights, past_key_value ++++++ +++++ class Qwen2MoeAttention(nn.Module): +++++ """ +++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++++- and "Generating Long Sequences with Sparse Transformers". +++++- """ ++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +++++ ++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: ++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 ++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 ++++++ ++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 ++++++ """ +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++ super().__init__() +++++ self.config = config +++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +++++ if layer_idx is None: +++++ logger.warning_once( +++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++ "when creating this class." +++++ ) +++++ +++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +++++ use_cache: bool = False, +++++ cache_position: Optional[mindspore.Tensor] = None, +++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++- +++++ +++++- ++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +++++ bsz, q_len, _ = hidden_states.shape +++++ +++++ query_states = self.q_proj(hidden_states) +++++ key_states = self.k_proj(hidden_states) +++++ value_states = self.v_proj(hidden_states) +++++ +++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++- ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ +++++ kv_seq_len = key_states.shape[-2] +++++ if past_key_value is not None: +++++- if self.layer_idx is None: +++++- raise ValueError( +++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++- "with a layer index." +++++- ) +++++- if isinstance(past_key_value, StaticCache): +++++- kv_seq_len = key_states.shape[-2] +++++- else: +++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ +++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++ +++++ if past_key_value is not None: +++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++ ++++++ # --- 2. 动态调度核心注意力计算 --- ++++++ global Long_Prompt ++++++ if Long_Prompt >= 1: ++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- ++++++ fa_attention_mask = None ++++++ if attention_mask is not None: ++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++ fa_attention_mask = (mask_slice != 0) ++++++ ++++++ attn_output = mindspore.ops.flash_attention_score( ++++++ query=query_states, ++++++ key=key_states, ++++++ value=value_states, ++++++ head_num=self.num_heads, ++++++ attn_mask=fa_attention_mask, ++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, ++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ input_layout="BNSD", ++++++ sparse_mode=0, ++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 ++++++ ) +++++ +++++- if isinstance(past_key_value, StaticCache): +++++- kv_seq_len = key_states.shape[-2] ++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = self.o_proj(attn_output) ++++++ attn_weights = None ++++++ if output_attentions: ++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++++ +++++- # repeat k/v heads if n_kv_heads < n_heads +++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++++- +++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++ else: ++++++ # --- Eager Attention 路径 (用于短序列和解码) --- ++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++ ++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++ +++++- if attention_mask is not None: +++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++- attn_weights = attn_weights + causal_mask ++++++ if attention_mask is not None: ++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++ attn_weights = attn_weights + causal_mask +++++ +++++- # upcast attention to fp32 +++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++- attn_output = ops.matmul(attn_weights, value_states) ++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++++ attn_output = ops.matmul(attn_weights, value_states) +++++ +++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++- raise ValueError( +++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++++- f" {attn_output.shape}" +++++- ) ++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++++ raise ValueError( ++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" ++++++ ) +++++ +++++- attn_output = ops.transpose(attn_output, 1, 2) +++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = ops.transpose(attn_output, 1, 2) ++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = self.o_proj(attn_output) +++++ +++++- attn_output = self.o_proj(attn_output) +++++- # @lwx ++++++ if not output_attentions: ++++++ attn_weights = None +++++ +++++- # max_seq_len = self.max_position_embeddings # 2048 +++++- +++++- # if attention_mask is not None: +++++- # # attention_mask: [B, 1, Sq, Sk] +++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++- +++++- # # pad 到 [max_seq_len, max_seq_len] +++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++- # global_attention_mask = padded_mask +++++- # else: +++++- # global_attention_mask = None +++++- +++++- +++++- # sparse_mode=3 +++++- # attn_output = mindspore.ops.flash_attention_score( +++++- # query=query_states, +++++- # key=key_states, +++++- # value=value_states, +++++- # real_shift=None, +++++- # padding_mask=None, +++++- +++++- # head_num=self.num_heads, +++++- # attn_mask=global_attention_mask, +++++- # keep_prob=1.0 - self.attention_dropout, +++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++++- # input_layout="BNSD", +++++- # pre_tokens=2147483647, +++++- # next_tokens=2147483647, +++++- # inner_precise=0, +++++- # drop_mask=None, +++++- # prefix=None, +++++- # actual_seq_qlen=None, +++++- # actual_seq_kvlen=None, +++++- # sparse_mode=sparse_mode, +++++- # ) +++++- if not output_attentions: +++++- attn_weights = None +++++- +++++ return attn_output, attn_weights, past_key_value +++++ +++++- +++++ # class Qwen2MoeFlashAttention(nn.Module): +++++ # """ +++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +++++ # return final_hidden_states, router_logits +++++ +++++ +++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++-# """ +++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +++++-# """ +++++-# def __init__(self, config: Qwen2MoeConfig): +++++-# super().__init__() +++++-# self.num_experts = config.num_experts +++++-# self.top_k = config.num_experts_per_tok +++++-# self.norm_topk_prob = config.norm_topk_prob +++++- +++++-# # 门控网络 +++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++-# # 专家列表 +++++-# self.experts = nn.ModuleList( +++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++-# ) +++++-# # 共享专家 +++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-# @no_grad() +++++-# def _moe_infer_decode( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# """ +++++-# 【解码路径】针对 sequence_length=1 的极致优化。 +++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++++-# """ +++++-# batch_size, hidden_dim = hidden_states.shape +++++- +++++-# expert_outputs_list = [ +++++-# ops.cat([ +++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++-# ], dim=0) +++++-# for i in range(batch_size) +++++-# ] +++++- +++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++++-# # shape: (batch_size, top_k, hidden_dim) +++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++- +++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++- +++++-# return moe_output.squeeze(1) +++++- +++++-# @no_grad() +++++-# def _moe_infer_prefill( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# """ +++++-# 【预填充路径】针对 sequence_length > 1 的优化。 +++++-# 按专家对 Token 进行分组,并进行批处理。 +++++-# """ +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens = hidden_states.shape[0] +++++-# flat_selected_experts = selected_experts.flatten() +++++- +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++- +++++-# active_experts = ops.unique(flat_selected_experts) +++++- +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++- +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++-# selected_token_indices = token_indices[mask] +++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++- +++++-# current_states = hidden_states[selected_token_indices] +++++- +++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++- +++++-# moe_output = moe_output.index_add( +++++-# dim=0, +++++-# index=selected_token_indices, +++++-# source=expert_output.to(hidden_states.dtype) +++++-# ) +++++-# return moe_output +++++- +++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-# """ +++++-# 顶层 forward 方法,作为智能分发器。 +++++-# """ +++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- +++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-# router_logits = self.gate(hidden_states_reshaped) +++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- +++++-# if self.norm_topk_prob: +++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- +++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++-# moe_output = None +++++-# # 在推理时,根据序列长度选择最优路径 +++++-# if not self.training: +++++-# if sequence_length == 1: +++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++-# else: +++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++-# else: +++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++++-# raise NotImplementedError("Training path is not implemented.") +++++- +++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++++- +++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++++- +++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++++- +++++-# return final_hidden_states, router_logits +++++- +++++- +++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++-# """ +++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++++-# """ +++++-# def __init__(self, config: Qwen2MoeConfig): +++++-# super().__init__() +++++-# self.num_experts = config.num_experts +++++-# self.top_k = config.num_experts_per_tok +++++-# self.norm_topk_prob = config.norm_topk_prob +++++- +++++-# # 门控网络 +++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++-# # 专家列表 +++++-# self.experts = nn.ModuleList( +++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++-# ) +++++-# # 共享专家 +++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-# @no_grad() +++++-# def _moe_infer_decode( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# batch_size, _ = hidden_states.shape +++++-# expert_outputs_list = [ +++++-# ops.cat([ +++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++-# ], dim=0) +++++-# for i in range(batch_size) +++++-# ] +++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++-# return moe_output.squeeze(1) +++++- +++++-# @no_grad() +++++-# def _moe_infer_prefill( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens = hidden_states.shape[0] +++++-# flat_selected_experts = selected_experts.flatten() +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++-# active_experts = ops.unique(flat_selected_experts) +++++- +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++-# selected_token_indices = token_indices[mask] +++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++-# current_states = hidden_states[selected_token_indices] +++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++-# moe_output = moe_output.index_add( +++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++-# ) +++++-# return moe_output +++++- +++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-# """ +++++-# 顶层 forward 方法,作为智能分发器。 +++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++++-# """ +++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- +++++-# # 1. 门控计算 (通用逻辑) +++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-# router_logits = self.gate(hidden_states_reshaped) +++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- +++++-# if self.norm_topk_prob: +++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- +++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++-# # 2. 智能分发到最优 MoE 路径 +++++-# moe_output = None +++++-# if not self.training: +++++-# if sequence_length == 1: +++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++-# else: +++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++-# else: +++++-# raise NotImplementedError("Training path is not implemented.") +++++- +++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++- +++++-# # 4. 合并 MoE 输出和共享专家输出 +++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++- +++++-# # 5. 恢复原始形状并返回 +++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++- +++++-# return final_hidden_states, router_logits +++++- +++++-# prefill fastest +++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++-# """ +++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++++-# """ +++++-# def __init__(self, config: Qwen2MoeConfig): +++++-# super().__init__() +++++-# self.num_experts = config.num_experts +++++-# self.top_k = config.num_experts_per_tok +++++-# self.norm_topk_prob = config.norm_topk_prob +++++- +++++-# # 门控网络 +++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++-# # 专家列表 +++++-# self.experts = nn.ModuleList( +++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++-# ) +++++-# # 共享专家 +++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-# @no_grad() +++++-# def _moe_infer_dispatch( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# """ +++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++++-# """ +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens, _ = hidden_states.shape +++++- +++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++++-# flat_selected_experts = selected_experts.flatten() +++++-# flat_routing_weights = routing_weights.flatten() +++++- +++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++- +++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++++-# active_experts = ops.unique(flat_selected_experts) +++++- +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++- +++++-# # 找到所有分配给该专家的 token +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++- +++++-# # 使用 mask 选取对应的 token 和权重 +++++-# current_token_indices = token_indices[mask] +++++-# current_routing_weights = flat_routing_weights[mask] +++++-# current_hidden_states = hidden_states[current_token_indices] +++++- +++++-# # 对这些 token 进行批处理 +++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++- +++++-# # 使用 index_add 将结果精确地加回到对应位置 +++++-# moe_output = moe_output.index_add( +++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++++-# ) +++++-# return moe_output +++++- +++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-# """ +++++-# 顶层 forward 方法,作为智能分发器。 +++++-# """ +++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- +++++-# # 1. 门控计算 +++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-# router_logits = self.gate(hidden_states_reshaped) +++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- +++++-# if self.norm_topk_prob: +++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- +++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++- +++++-# # 2. 调用统一的 MoE 计算内核 +++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++++- +++++-# # 3. 统一处理共享专家 +++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++- +++++-# # 4. 合并输出 +++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++- +++++-# # 5. 恢复原始形状并返回 +++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++- +++++-# return final_hidden_states, router_logits +++++- +++++- +++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++-# """ +++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++-# 【最终高性能与高精度版】: +++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++++-# 3. 这样实现了速度和准确性的两全其美。 +++++-# """ +++++-# def __init__(self, config: Qwen2MoeConfig): +++++-# super().__init__() +++++-# self.num_experts = config.num_experts +++++-# self.top_k = config.num_experts_per_tok +++++-# self.norm_topk_prob = config.norm_topk_prob +++++- +++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++-# self.experts = nn.ModuleList( +++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++-# ) +++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-# @no_grad() +++++-# def _moe_infer_decode( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# """ +++++-# 【解码路径】极致优化版:bmm + 高精度累加。 +++++-# """ +++++-# original_dtype = hidden_states.dtype +++++-# batch_size, _ = hidden_states.shape +++++- +++++-# expert_outputs_list = [ +++++-# ops.cat([ +++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++-# ], dim=0) +++++-# for i in range(batch_size) +++++-# ] +++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++- +++++-# # 在 float32 下执行 bmm,得到高精度结果 +++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++- +++++-# # 将高精度结果转换回原始数据类型 +++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++++- +++++-# return moe_output +++++- +++++-# @no_grad() +++++-# def _moe_infer_prefill( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# selected_experts: mindspore.Tensor, +++++-# routing_weights: mindspore.Tensor +++++-# ) -> mindspore.Tensor: +++++-# """ +++++-# 【预填充路径】与原始实现一致,结果精确。 +++++-# """ +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens, _ = hidden_states.shape +++++-# flat_selected_experts = selected_experts.flatten() +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++-# active_experts = ops.unique(flat_selected_experts) +++++- +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++-# selected_token_indices = token_indices[mask] +++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++-# current_states = hidden_states[selected_token_indices] +++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++-# moe_output = moe_output.index_add( +++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++-# ) +++++-# return moe_output +++++- +++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++- +++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-# router_logits = self.gate(hidden_states_reshaped) +++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++- +++++-# if self.norm_topk_prob: +++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- +++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++++-# # 如果模型主体是 float16,后续再转换 +++++- +++++-# moe_output = None +++++-# if not self.training: +++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++++-# # _moe_infer_decode 内部会处理好类型转换 +++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++++-# if sequence_length == 1: +++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++-# else: +++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++-# else: +++++-# raise NotImplementedError("Training path is not implemented.") +++++- +++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++- +++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++- +++++-# return final_hidden_states, router_logits +++++- +++++- +++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++-# """ +++++-# 【融合版】一个混合专家模块,内置两种推理策略, +++++-# 由外部全局变量 `Long_Prompt` 控制: +++++- +++++-# - if Long_Prompt is True: 【精度优先模式】 +++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++++-# 适用于处理长序列,避免误差累积。 +++++- +++++-# - if Long_Prompt is False: 【速度优先模式】 +++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +++++-# """ +++++-# def __init__(self, config: Qwen2MoeConfig): +++++-# super().__init__() +++++-# self.num_experts = config.num_experts +++++-# self.top_k = config.num_experts_per_tok +++++-# self.norm_topk_prob = config.norm_topk_prob +++++- +++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++-# self.experts = nn.ModuleList( +++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++-# ) +++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-# # --- 速度优先模式的辅助函数 --- +++++-# @no_grad() +++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++-# original_dtype = hidden_states.dtype +++++-# batch_size, _ = hidden_states.shape +++++-# expert_outputs_list = [ +++++-# ops.cat([ +++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++-# ], dim=0) +++++-# for i in range(batch_size) +++++-# ] +++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++-# weights_fp32 = routing_weights.to(mindspore.float32) +++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++-# return moe_output_fp32.squeeze(1).to(original_dtype) +++++- +++++-# @no_grad() +++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens, _ = hidden_states.shape +++++-# flat_selected_experts = selected_experts.flatten() +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++-# active_experts = ops.unique(flat_selected_experts) +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++-# selected_token_indices = token_indices[mask] +++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++-# current_states = hidden_states[selected_token_indices] +++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++++-# return moe_output +++++- +++++-# # --- 精度优先模式的辅助函数 --- +++++-# @no_grad() +++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++-# moe_output = ops.zeros_like(hidden_states) +++++-# num_tokens, _ = hidden_states.shape +++++-# flat_selected_experts = selected_experts.flatten() +++++-# flat_routing_weights = routing_weights.flatten() +++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++-# active_experts = ops.unique(flat_selected_experts) +++++-# for expert_idx_tensor in active_experts: +++++-# expert_idx = expert_idx_tensor.item() +++++-# expert_layer = self.experts[expert_idx] +++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++-# current_token_indices = token_indices[mask] +++++-# current_routing_weights = flat_routing_weights[mask] +++++-# current_hidden_states = hidden_states[current_token_indices] +++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++-# return moe_output +++++- +++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-# # 声明我们将要使用一个在模块外部定义的全局变量 +++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++++-# global Long_Prompt +++++- +++++-# # 1. 门控计算 (所有模式通用) +++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-# router_logits = self.gate(hidden_states_reshaped) +++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++-# if self.norm_topk_prob: +++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++- +++++-# moe_output = None +++++-# if not self.training: +++++-# # 根据 Long_Prompt 标志选择模式 +++++-# if Long_Prompt: +++++-# # --- 精度优先模式 --- +++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++-# else: +++++-# # --- 速度优先模式 --- +++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++-# if sequence_length == 1: +++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++-# else: +++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++-# else: +++++-# raise NotImplementedError("Training path is not implemented.") +++++- +++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++- +++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++- +++++-# return final_hidden_states, router_logits +++++- +++++ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ """ +++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++ return moe_output_fp32.squeeze(1).to(original_dtype) +++++ ++++++ # @no_grad() ++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++ # num_tokens, _ = hidden_states.shape ++++++ # flat_selected_experts = selected_experts.flatten() ++++++ # sorted_expert_indices = flat_selected_experts.argsort() ++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++++ # original_token_indices = sorted_expert_indices // self.top_k ++++++ # moe_output = ops.zeros_like(hidden_states) ++++++ # current_token_offset = 0 ++++++ # for i in range(self.num_experts): ++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset ++++++ # if expert_token_count == 0: ++++++ # continue ++++++ # end_offset = current_token_offset + expert_token_count ++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] ++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++ # current_token_offset += expert_token_count ++++++ # return moe_output ++++++ +++++ @no_grad() +++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++- num_tokens, _ = hidden_states.shape +++++- flat_selected_experts = selected_experts.flatten() +++++- sorted_expert_indices = flat_selected_experts.argsort() +++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++- original_token_indices = sorted_expert_indices // self.top_k ++++++ """ ++++++ 优化版 MoE prefill (速度优先模式): ++++++ - 批量张量化处理同一个 expert 的所有 token ++++++ - 跳过无 token 的专家 ++++++ - 保持结果完全一致 ++++++ """ +++++ moe_output = ops.zeros_like(hidden_states) +++++- current_token_offset = 0 +++++- for i in range(self.num_experts): +++++- expert_token_count = tokens_per_expert[i] - current_token_offset +++++- if expert_token_count == 0: +++++- continue +++++- end_offset = current_token_offset + expert_token_count +++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++- expert_hidden_states = hidden_states[expert_original_token_indices] +++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++- current_token_offset += expert_token_count ++++++ ++++++ flat_selected_experts = selected_experts.flatten() ++++++ flat_routing_weights = routing_weights.flatten() ++++++ ++++++ idxs = flat_selected_experts.argsort() ++++++ sorted_expert_indices = flat_selected_experts[idxs] ++++++ sorted_token_indices = idxs // self.top_k ++++++ ++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) ++++++ ++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++++++ ++++++ for expert_id in active_experts.tolist(): ++++++ start = int(tokens_per_expert[:expert_id].sum().item()) ++++++ end = start + int(tokens_per_expert[expert_id].item()) ++++++ ++++++ token_idx = sorted_token_indices[start:end] ++++++ expert_tokens = hidden_states[token_idx] ++++++ ++++++ expert_out = self.experts[expert_id](expert_tokens) ++++++ ++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) ++++++ ++++++ moe_output = mindspore.mint.scatter_add( ++++++ moe_output, ++++++ 0, ++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), ++++++ scaled_out.to(hidden_states.dtype) ++++++ ) ++++++ +++++ return moe_output +++++ ++++++ +++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++++ @no_grad() +++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++ +++++ moe_output = None +++++- if Long_Prompt: +++++- # --- 精度优先模式 (ACCURACY MODE) --- +++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ # if Long_Prompt==0: ++++++ # # --- 精度优先模式 (ACCURACY MODE) --- ++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ # else: ++++++ # # --- 速度优先模式 (SPEED MODE) --- ++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++ # if sequence_length == 1: ++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ # else: ++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ ++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++ if sequence_length == 1: ++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++ else: +++++- # --- 速度优先模式 (SPEED MODE) --- +++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++- if sequence_length == 1: +++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++- else: +++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++- ++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ +++++ +++++ # 3. 共享专家计算与合并 (所有模式通用) +++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++ +++++ return final_hidden_states, router_logits +++++ ++++++ +++++ class Qwen2MoeDecoderLayer(nn.Module): +++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++++ super().__init__() +++++ self.hidden_size = config.hidden_size +++++ +++++- # if Long_Prompt: +++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++- # else: ++++++ # if Long_Prompt == 2: +++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++ # else: ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ +++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++ +++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++ ) +++++ +++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++ # attention_mask, ++++++ # sequence_length=sequence_length, ++++++ # target_length=target_length, ++++++ # dtype=dtype, ++++++ # min_dtype=min_dtype, ++++++ # cache_position=cache_position, ++++++ # batch_size=input_tensor.shape[0], ++++++ # ) ++++++ #@dwj ++++++ causal_mask = get_cached_causal_mask_with_cache_position( +++++ attention_mask, +++++ sequence_length=sequence_length, +++++ target_length=target_length, +++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++++ """ +++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache ++++++ _causal_mask_cache.clear() +++++ +++++ input_ids = kwargs.get("input_ids") +++++ if input_ids is None and args: +++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ +++++ if input_ids is not None: +++++ prompt_length = input_ids.shape[1] +++++- +++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +++++- Long_Prompt = True ++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: ++++++ Long_Prompt = 2 ++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: ++++++ Long_Prompt = 0 +++++ else: +++++- Long_Prompt = False ++++++ Long_Prompt = 1 ++++++ +++++ +++++ return super().generate(*args, **kwargs) +++++ +++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++ dtype = self.lm_head.weight.dtype +++++ min_dtype = float(ops.finfo(dtype).min) +++++ +++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++ # attention_mask, ++++++ # sequence_length=sequence_length, ++++++ # target_length=past_key_values.get_max_length(), ++++++ # dtype=dtype, ++++++ # min_dtype=min_dtype, ++++++ # cache_position=cache_position, ++++++ # batch_size=batch_size, ++++++ # ) ++++++ ++++++ #@dwj ++++++ attention_mask = get_cached_causal_mask_with_cache_position( +++++ attention_mask, +++++ sequence_length=sequence_length, +++++ target_length=past_key_values.get_max_length(), +++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++deleted file mode 100644 +++++index 6dfb5b93..00000000 +++++--- a/patches/0001-20251104commit.patch ++++++++ /dev/null +++++@@ -1,1272 +0,0 @@ +++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++-From: Pinoeer-kingxi <13022943007@163.com> +++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++-Subject: [PATCH] 20251104commit +++++- +++++---- +++++- mindnlp/transformers/cache_utils.py | 28 +- +++++- .../models/deepseek/modeling_deepseek.py | 149 ++- +++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++- 3 files changed, 976 insertions(+), 87 deletions(-) +++++- +++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++-index cadd2e04..02f8d4be 100644 +++++---- a/mindnlp/transformers/cache_utils.py +++++-+++ b/mindnlp/transformers/cache_utils.py +++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++- # k_out[:, :, cache_position] = key_states +++++- # v_out[:, :, cache_position] = value_states +++++-- if ON_ORANGE_PI: +++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++-- else: +++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++-- +++++-+ # if ON_ORANGE_PI: +++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++-+ # else: +++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++-+ if cache_position.ndim > 1: +++++-+ cache_position = cache_position.flatten() +++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++-+ cache_position = cache_position.int() +++++-+ +++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++-+ k_out[:, :, cache_position] = key_states +++++-+ v_out[:, :, cache_position] = value_states +++++-+ +++++- return k_out, v_out +++++- +++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++-index c695b944..d8303e45 100644 +++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++- # Copied from transformers.models.llama.modeling_llama.rotate_half +++++- def rotate_half(x): +++++- """Rotates half the hidden dims of the input.""" +++++-- x1 = x[..., : x.shape[-1] // 2] +++++-- x2 = x[..., x.shape[-1] // 2 :] +++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++-+ # x1 = x[..., : x.shape[-1] // 2] +++++-+ # x2 = x[..., x.shape[-1] // 2 :] +++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++- return ops.cat((-x2, x1), dim=-1) +++++- +++++- +++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++- if self.training: +++++- raise NotImplementedError("Training is not supported yet.") +++++- else: +++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++-- if self.config.n_shared_experts is not None: +++++-- y = y + self.shared_experts(identity) +++++-- return y +++++-+ # @lwx +++++-+ if orig_shape[1] == 1: +++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++-+ y=y.view(*orig_shape) +++++-+ if self.config.n_shared_experts is not None: +++++-+ y = y + self.shared_experts(identity) +++++-+ return y +++++-+ else: +++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++-+ if self.config.n_shared_experts is not None: +++++-+ y = y + self.shared_experts(identity) +++++-+ return y +++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++-+ # if self.config.n_shared_experts is not None: +++++-+ # y = y + self.shared_experts(identity) +++++-+ # return y +++++-+ +++++-+ @no_grad() +++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++-+ +++++-+ expert_cache = ops.zeros_like(x) +++++-+ for i in range(self.num_experts_per_tok): +++++-+ expert_id = flat_expert_indices[i].item() +++++-+ weight = flat_expert_weights[i].item() +++++-+ expert = self.experts[expert_id] +++++-+ expert_out = expert(x) +++++-+ expert_cache += expert_out * weight +++++-+ return expert_cache +++++- +++++- @no_grad() +++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++-- # expert_cache = torch.zeros_like(x) +++++-- # idxs = flat_expert_indices.argsort() +++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++-- # token_idxs = idxs // self.num_experts_per_tok +++++-- # for i, end_idx in enumerate(tokens_per_expert): +++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++-- # if start_idx == end_idx: +++++-- # continue +++++-- # expert = self.experts[i] +++++-- # exp_token_idx = token_idxs[start_idx:end_idx] +++++-- # expert_tokens = x[exp_token_idx] +++++-- # expert_out = expert(expert_tokens) +++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++-- # return expert_cache +++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++- expert_cache = ops.zeros_like(x) +++++- idxs = flat_expert_indices.argsort() +++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++- token_idxs = idxs // self.num_experts_per_tok +++++-+ +++++- for i, end_idx in enumerate(tokens_per_expert): +++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++- if start_idx == end_idx: +++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++- expert_out = expert(expert_tokens) +++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++-+ +++++- return expert_cache +++++-+ +++++-+ # @no_grad() +++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++-+ # # expert_cache = torch.zeros_like(x) +++++-+ # # idxs = flat_expert_indices.argsort() +++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++-+ # # token_idxs = idxs // self.num_experts_per_tok +++++-+ # # for i, end_idx in enumerate(tokens_per_expert): +++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++-+ # # if start_idx == end_idx: +++++-+ # # continue +++++-+ # # expert = self.experts[i] +++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++-+ # # expert_tokens = x[exp_token_idx] +++++-+ # # expert_out = expert(expert_tokens) +++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++-+ # # return expert_cache +++++-+ # expert_cache = ops.zeros_like(x) +++++-+ # idxs = flat_expert_indices.argsort() +++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++-+ # token_idxs = idxs // self.num_experts_per_tok +++++-+ +++++-+ # for i, end_idx in enumerate(tokens_per_expert): +++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++-+ # if start_idx == end_idx: +++++-+ # continue +++++-+ # expert = self.experts[i] +++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++++-+ # expert_tokens = x[exp_token_idx] +++++-+ # expert_out = expert(expert_tokens) +++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++-+ +++++-+ # return expert_cache +++++-+ # @no_grad() +++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++-+ # expert_cache = ops.zeros_like(x) +++++-+ +++++-+ # # 排序保证顺序一致 +++++-+ # idxs = flat_expert_indices.argsort() +++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++-+ # token_idxs = idxs // self.num_experts_per_tok +++++-+ +++++-+ # # 找出有 token 的专家 +++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++-+ +++++-+ # for i in active_experts.tolist(): +++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++-+ # end_idx = tokens_per_expert[i] +++++-+ # if start_idx == end_idx: # 没有 token +++++-+ # continue +++++-+ +++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++++-+ # expert_tokens = x[exp_token_idx] +++++-+ # expert_out = self.experts[i](expert_tokens) +++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++-+ +++++-+ # expert_cache = mindspore.mint.scatter_add( +++++-+ # expert_cache, +++++-+ # 0, +++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++-+ # expert_out +++++-+ # ) +++++-+ +++++-+ # return expert_cache +++++-+ +++++-+ +++++- +++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++- # """ +++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++- +++++- # Initialize weights and apply final processing +++++- self.post_init() +++++-+ self.warm_up = False +++++-+ +++++-+ def warmup_moe_model_deep(self): +++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++-+ test_texts = [ +++++-+ "warmup short", +++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++-+ ] +++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++-+ if tokenizer is None: +++++-+ from mindnlp.transformers import AutoTokenizer +++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++-+ self._warmup_tokenizer = tokenizer +++++-+ +++++-+ for text in test_texts: +++++-+ inputs = tokenizer(text, return_tensors="ms") +++++-+ with mindspore._no_grad(): +++++-+ _ = self(**inputs, use_cache=False) +++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++- +++++- def get_input_embeddings(self): +++++- return self.model.embed_tokens +++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++- ```""" +++++-+ if not self.warm_up: +++++-+ self.warm_up = True +++++-+ self.warmup_moe_model_deep() +++++-+ +++++- output_attentions = ( +++++- output_attentions +++++- if output_attentions is not None +++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++-index 3cbf820e..d4c6b651 100644 +++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++-@@ -18,7 +18,6 @@ +++++- # See the License for the specific language governing permissions and +++++- # limitations under the License. +++++- """MindSpore Qwen2MoE model.""" +++++-- +++++- import math +++++- from typing import List, Optional, Tuple, Union +++++- +++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++- TokenClassifierOutput, +++++- ) +++++- from ...modeling_utils import PreTrainedModel +++++-+from ...generation import GenerationMixin +++++- from ....utils import logging +++++- from .configuration_qwen2_moe import Qwen2MoeConfig +++++- +++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++- self.variance_epsilon = eps +++++- +++++- def forward(self, hidden_states): +++++-+ # @dwj +++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++-+ # @lwx +++++-+ # if not self.training : +++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++- input_dtype = hidden_states.dtype +++++- hidden_states = hidden_states.to(mindspore.float32) +++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++-@@ -234,6 +239,8 @@ def rotate_half(x): +++++- """Rotates half the hidden dims of the input.""" +++++- x1 = x[..., : x.shape[-1] // 2] +++++- x2 = x[..., x.shape[-1] // 2 :] +++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++- return ops.cat((-x2, x1), dim=-1) +++++- +++++- +++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++- self.config = config +++++- self.hidden_size = config.hidden_size +++++- self.intermediate_size = intermediate_size +++++-+ +++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++- self.act_fn = ACT2FN[config.hidden_act] +++++- +++++- def forward(self, x): +++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++-- +++++- +++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++-+ # @lwx +++++-+ # gate_up_output = self.gate_up_proj(x) +++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++-+ # return self.down_proj(swiglu_output) +++++-+ +++++-+ # def forward(self, x): +++++-+ # gate_proj_out = self.gate_proj(x) +++++-+ # up_proj_out = self.up_proj(x) +++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++-+ # return self.down_proj(swiglu_out) +++++-+ +++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++- """ +++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++- use_cache: bool = False, +++++- cache_position: Optional[mindspore.Tensor] = None, +++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++-+ +++++-+ +++++-+ +++++- bsz, q_len, _ = hidden_states.shape +++++- +++++- query_states = self.q_proj(hidden_states) +++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++- "with a layer index." +++++- ) +++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++-+ if isinstance(past_key_value, StaticCache): +++++-+ kv_seq_len = key_states.shape[-2] +++++-+ else: +++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++- +++++- if past_key_value is not None: +++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++-+ +++++-+ if isinstance(past_key_value, StaticCache): +++++-+ kv_seq_len = key_states.shape[-2] +++++- +++++- # repeat k/v heads if n_kv_heads < n_heads +++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++++-- +++++-+ +++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++- +++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++-- raise ValueError( +++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++-- f" {attn_weights.shape}" +++++-- ) +++++-- +++++-- if attention_mask is not None: # no matter the length, we just slice it +++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++-+ if attention_mask is not None: +++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++- attn_weights = attn_weights + causal_mask +++++- +++++- # upcast attention to fp32 +++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++- +++++- attn_output = self.o_proj(attn_output) +++++-- +++++-+ # @lwx +++++-+ +++++-+ # max_seq_len = self.max_position_embeddings # 2048 +++++-+ +++++-+ # if attention_mask is not None: +++++-+ # # attention_mask: [B, 1, Sq, Sk] +++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++-+ +++++-+ # # pad 到 [max_seq_len, max_seq_len] +++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++-+ # global_attention_mask = padded_mask +++++-+ # else: +++++-+ # global_attention_mask = None +++++-+ +++++-+ +++++-+ # sparse_mode=3 +++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++-+ # query=query_states, +++++-+ # key=key_states, +++++-+ # value=value_states, +++++-+ # real_shift=None, +++++-+ # padding_mask=None, +++++-+ +++++-+ # head_num=self.num_heads, +++++-+ # attn_mask=global_attention_mask, +++++-+ # keep_prob=1.0 - self.attention_dropout, +++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++-+ # input_layout="BNSD", +++++-+ # pre_tokens=2147483647, +++++-+ # next_tokens=2147483647, +++++-+ # inner_precise=0, +++++-+ # drop_mask=None, +++++-+ # prefix=None, +++++-+ # actual_seq_qlen=None, +++++-+ # actual_seq_kvlen=None, +++++-+ # sparse_mode=sparse_mode, +++++-+ # ) +++++- if not output_attentions: +++++- attn_weights = None +++++- +++++- return attn_output, attn_weights, past_key_value +++++- +++++- +++++-+class Qwen2MoeFlashAttention(nn.Module): +++++-+ """ +++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++-+ +++++-+ 关键改动: +++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++-+ 直接传入原始的 key 和 value 张量效率更高。 +++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++-+ """ +++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++-+ super().__init__() +++++-+ self.config = config +++++-+ self.layer_idx = layer_idx +++++-+ self.hidden_size = config.hidden_size +++++-+ self.num_heads = config.num_attention_heads +++++-+ self.head_dim = self.hidden_size // self.num_heads +++++-+ self.num_key_value_heads = config.num_key_value_heads +++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++-+ self.max_position_embeddings = config.max_position_embeddings +++++-+ self.rope_theta = config.rope_theta +++++-+ self.attention_dropout = config.attention_dropout +++++-+ +++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +++++-+ raise ValueError( +++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++-+ ) +++++-+ +++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++-+ +++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++-+ self.head_dim, +++++-+ max_position_embeddings=self.max_position_embeddings, +++++-+ base=self.rope_theta, +++++-+ ) +++++-+ +++++-+ def forward( +++++-+ self, +++++-+ hidden_states: mindspore.Tensor, +++++-+ attention_mask: Optional[mindspore.Tensor] = None, +++++-+ position_ids: Optional[mindspore.Tensor] = None, +++++-+ past_key_value: Optional[Cache] = None, +++++-+ output_attentions: bool = False, +++++-+ use_cache: bool = False, +++++-+ cache_position: Optional[mindspore.Tensor] = None, +++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++-+ +++++-+ bsz, q_len, _ = hidden_states.shape +++++-+ +++++-+ # 1. 线性投射 Q, K, V +++++-+ query_states = self.q_proj(hidden_states) +++++-+ key_states = self.k_proj(hidden_states) +++++-+ value_states = self.v_proj(hidden_states) +++++-+ +++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++-+ # query: [B, S, H*D] -> [B, N1, S, D] +++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ +++++-+ # 3. RoPE 旋转位置编码 +++++-+ kv_seq_len = key_states.shape[-2] +++++-+ if past_key_value is not None: +++++-+ if self.layer_idx is None: +++++-+ raise ValueError( +++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++-+ "with a layer index." +++++-+ ) +++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++-+ if cache_position.shape[0] == 1: +++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++-+ kv_seq_len = past_seen_tokens + 1 +++++-+ else: +++++-+ # prefill 阶段:cache_position 是范围,使用其长度 +++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++-+ else: +++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++-+ +++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++-+ +++++-+ # 4. KV 缓存更新 +++++-+ if past_key_value is not None: +++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++-+ key_states, value_states = past_key_value.update( +++++-+ key_states, value_states, self.layer_idx, cache_kwargs +++++-+ ) +++++-+ +++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++-+ if cache_position.shape[0] == 1: +++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++-+ kv_seq_len = key_states.shape[-2] +++++-+ +++++-+ # 5. [重要] 准备 Attention Mask +++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++-+ fa_attention_mask = None +++++-+ if attention_mask is not None: +++++-+ # 截取与当前key长度匹配的部分 +++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++-+ fa_attention_mask = (mask_slice != 0) +++++-+ +++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++-+ input_dtype = query_states.dtype +++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++-+ query_states = query_states.to(mindspore.float16) +++++-+ key_states = key_states.to(mindspore.float16) +++++-+ value_states = value_states.to(mindspore.float16) +++++-+ +++++-+ # 6. [核心] 调用 flash_attention_score 算子 +++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++-+ attn_output = mindspore.ops.flash_attention_score( +++++-+ query=query_states, +++++-+ key=key_states, +++++-+ value=value_states, +++++-+ head_num=self.num_heads, # 传入Q的头数(N1) +++++-+ attn_mask=fa_attention_mask, +++++-+ keep_prob=1.0 - self.attention_dropout, +++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +++++-+ input_layout="BNSD", +++++-+ sparse_mode=0 # 使用 defaultMask 模式 +++++-+ ) +++++-+ +++++-+ # 恢复原始数据类型 +++++-+ attn_output = attn_output.to(input_dtype) +++++-+ +++++-+ # 7. 调整输出形状 +++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++-+ attn_output = self.o_proj(attn_output) +++++-+ +++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +++++-+ attn_weights = None +++++-+ if output_attentions: +++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++-+ +++++-+ return attn_output, attn_weights, past_key_value +++++-+ +++++-+ # def forward( +++++-+ # self, +++++-+ # hidden_states: mindspore.Tensor, +++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++++-+ # position_ids: Optional[mindspore.Tensor] = None, +++++-+ # past_key_value: Optional[Cache] = None, +++++-+ # output_attentions: bool = False, +++++-+ # use_cache: bool = False, +++++-+ # cache_position: Optional[mindspore.Tensor] = None, +++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++-+ +++++-+ # bsz, q_len, _ = hidden_states.shape +++++-+ +++++-+ # # 1. 线性投射 Q, K, V +++++-+ # query_states = self.q_proj(hidden_states) +++++-+ # key_states = self.k_proj(hidden_states) +++++-+ # value_states = self.v_proj(hidden_states) +++++-+ +++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ +++++-+ # # 3. RoPE 旋转位置编码 +++++-+ # kv_seq_len = key_states.shape[-2] +++++-+ # if past_key_value is not None: +++++-+ # if self.layer_idx is None: +++++-+ # raise ValueError( +++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++-+ # "with a layer index." +++++-+ # ) +++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++-+ +++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++-+ +++++-+ # # 4. KV 缓存更新 +++++-+ # if past_key_value is not None: +++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++-+ # key_states, value_states = past_key_value.update( +++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++++-+ # ) +++++-+ +++++-+ # # 5. 准备 Attention Mask +++++-+ # fa_attention_mask = None +++++-+ # if attention_mask is not None: +++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++-+ # fa_attention_mask = (mask_slice != 0) +++++-+ +++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++-+ # input_dtype = query_states.dtype +++++-+ +++++-+ # # 6. [核心] 调用 flash_attention_score 算子 +++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++-+ # query=query_states, +++++-+ # key=key_states, +++++-+ # value=value_states, +++++-+ # head_num=self.num_heads, +++++-+ # attn_mask=fa_attention_mask, +++++-+ # keep_prob=1.0 - self.attention_dropout, +++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++-+ # input_layout="BNSD", +++++-+ # sparse_mode=0, +++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++-+ # inner_precise=1 +++++-+ # ) +++++-+ +++++-+ # # 恢复原始数据类型 +++++-+ # attn_output = attn_output.to(input_dtype) +++++-+ +++++-+ # # 7. 调整输出形状 +++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++-+ # attn_output = self.o_proj(attn_output) +++++-+ +++++-+ # attn_weights = None +++++-+ # if output_attentions: +++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++-+ +++++-+ # return attn_output, attn_weights, past_key_value +++++-+ +++++-+ # def forward( +++++-+ # self, +++++-+ # hidden_states: mindspore.Tensor, +++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++++-+ # position_ids: Optional[mindspore.Tensor] = None, +++++-+ # past_key_value: Optional[Cache] = None, +++++-+ # output_attentions: bool = False, +++++-+ # use_cache: bool = False, +++++-+ # cache_position: Optional[mindspore.Tensor] = None, +++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++-+ +++++-+ # bsz, q_len, _ = hidden_states.shape +++++-+ +++++-+ # query_states = self.q_proj(hidden_states) +++++-+ # key_states = self.k_proj(hidden_states) +++++-+ # value_states = self.v_proj(hidden_states) +++++-+ +++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-+ +++++-+ # kv_seq_len = key_states.shape[-2] +++++-+ # if past_key_value is not None: +++++-+ # if self.layer_idx is None: +++++-+ # raise ValueError("`layer_idx` must be specified for caching") +++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++-+ +++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++-+ +++++-+ # if past_key_value is not None: +++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++-+ # key_states, value_states = past_key_value.update( +++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++++-+ # ) +++++-+ +++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++-+ +++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++-+ # query_states = query_states / math.sqrt(self.head_dim) +++++-+ # # <--- 修改结束 --- +++++-+ +++++-+ # fa_attention_mask = None +++++-+ # if attention_mask is not None: +++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++-+ # fa_attention_mask = (mask_slice != 0) +++++-+ +++++-+ # input_dtype = query_states.dtype +++++-+ +++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++-+ # query=query_states, # 传入已经预先缩放过的 query +++++-+ # key=key_states, +++++-+ # value=value_states, +++++-+ # head_num=self.num_heads, +++++-+ # attn_mask=fa_attention_mask, +++++-+ # keep_prob=1.0 - self.attention_dropout, +++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++-+ # input_layout="BNSD", +++++-+ # sparse_mode=0, +++++-+ # inner_precise=1 # 仍然保持内部高精度计算 +++++-+ # ) +++++-+ +++++-+ # attn_output = attn_output.to(input_dtype) +++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++-+ # attn_output = self.o_proj(attn_output) +++++-+ +++++-+ # attn_weights = None +++++-+ # if output_attentions: +++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++-+ +++++-+ # return attn_output, attn_weights, past_key_value +++++-+ +++++- QWEN2MOE_ATTENTION_CLASSES = { +++++- "eager": Qwen2MoeAttention, +++++-+ "flash-attention": Qwen2MoeFlashAttention, +++++- } +++++- +++++- +++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++- +++++-+ #@dwj +++++-+ # 只遍历激活的专家,而非全部专家 +++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++-- hidden_states = hidden_states.view(-1, hidden_dim) +++++-- # router_logits: (batch * sequence_length, n_experts) +++++-- router_logits = self.gate(hidden_states) +++++-- +++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++-- if self.norm_topk_prob: +++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++-- # we cast back to the input dtype +++++-- routing_weights = routing_weights.to(hidden_states.dtype) +++++-- +++++-- final_hidden_states = ops.zeros( +++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++-- ) +++++-- +++++-- # One hot encode the selected experts to create an expert mask +++++-- # this will be used to easily index which expert is going to be sollicitated +++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++-- +++++-- # Loop over all available experts in the model and perform the computation on each expert +++++-- for expert_idx in range(self.num_experts): +++++-- expert_layer = self.experts[expert_idx] +++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++-- +++++-- # Index the correct hidden states and compute the expert hidden state for +++++-- # the current expert. We need to make sure to multiply the output hidden +++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++-- if 0 not in idx.shape: +++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++-- +++++-- # However `index_add_` only support torch tensors for indexing so we'll use +++++-- # the `top_x` tensor here. +++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++-- +++++-- shared_expert_output = self.shared_expert(hidden_states) +++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++-- +++++-- final_hidden_states = final_hidden_states + shared_expert_output +++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++-+ num_tokens = hidden_states_reshaped.shape[0] +++++-+ +++++-+ router_logits = self.gate(hidden_states_reshaped) +++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++-+ +++++-+ if self.norm_topk_prob: +++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++-+ routing_weights = routing_weights.to(hidden_states.dtype) +++++-+ +++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++-+ flat_selected_experts = selected_experts.flatten() +++++-+ +++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++-+ token_indices = broadcasted_token_indices.flatten() +++++-+ +++++-+ active_experts = ops.unique(flat_selected_experts) +++++-+ +++++-+ for expert_idx_tensor in active_experts: +++++-+ expert_idx = expert_idx_tensor.item() +++++-+ expert_layer = self.experts[expert_idx] +++++-+ +++++-+ mask = (flat_selected_experts == expert_idx_tensor) +++++-+ selected_token_indices = token_indices[mask] +++++-+ selected_routing_weights = routing_weights.flatten()[mask] +++++-+ +++++-+ current_states = hidden_states_reshaped[selected_token_indices] +++++-+ +++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++-+ +++++-+ final_hidden_states = final_hidden_states.index_add( +++++-+ dim=0, +++++-+ index=selected_token_indices, +++++-+ source=expert_output.to(hidden_states.dtype) +++++-+ ) +++++-+ +++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++- +++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++-- return final_hidden_states, router_logits +++++-+ final_hidden_states = final_hidden_states + shared_expert_output +++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++-+ +++++-+ return final_hidden_states, router_logits +++++- +++++- +++++- class Qwen2MoeDecoderLayer(nn.Module): +++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++- +++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++- +++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++-+ +++++- if (layer_idx not in config.mlp_only_layers) and ( +++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++- ): +++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++- _skip_keys_device_placement = "past_key_values" +++++- _supports_cache_class = True +++++-+#lwx +++++-+ # _supports_static_cache = True +++++- +++++- def _init_weights(self, module): +++++- std = self.config.initializer_range +++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++- return causal_mask +++++- +++++- +++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++- _tied_weights_keys = ["lm_head.weight"] +++++- +++++- def __init__(self, config): +++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++- self.num_experts_per_tok = config.num_experts_per_tok +++++- # Initialize weights and apply final processing +++++- self.post_init() +++++-+ # @lwx +++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++-+ # self.generation_config.cache_implementation = "static" +++++-+ self._warmed_up = False +++++-+ +++++-+ def warmup_moe_model(self): +++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++-+ test_texts = [ +++++-+ "warmup short", +++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++-+ ] +++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++-+ if tokenizer is None: +++++-+ from mindnlp.transformers import AutoTokenizer +++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++-+ self._warmup_tokenizer = tokenizer +++++-+ +++++-+ for text in test_texts: +++++-+ inputs = tokenizer(text, return_tensors="ms") +++++-+ with mindspore._no_grad(): +++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++- +++++- def get_input_embeddings(self): +++++- return self.model.embed_tokens +++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++- ```""" +++++-+ if not self._warmed_up: +++++-+ self._warmed_up = True +++++-+ self.warmup_moe_model() +++++- +++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++- output_router_logits = ( +++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++- } +++++- ) +++++- return model_inputs +++++-+# @lwx +++++-+ # def _decode_one_tokens_logits( +++++-+ # self, +++++-+ # cur_token: mindspore.Tensor, +++++-+ # input_pos: Optional[mindspore.Tensor], +++++-+ # cache_position: mindspore.Tensor, +++++-+ # past_key_values: StaticCache, +++++-+ # ) -> mindspore.Tensor: +++++-+ # """ +++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++-+ +++++-+ # Args: +++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++-+ # input_pos: 输入位置信息,可选 +++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++-+ +++++-+ # Returns: +++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++-+ # """ +++++-+ # # 调用JIT编译的版本 +++++-+ # return self.get_decode_one_tokens_logits( +++++-+ # cur_token=cur_token, +++++-+ # input_pos=input_pos, +++++-+ # cache_position=cache_position, +++++-+ # past_key_values=past_key_values, +++++-+ # ) +++++-+ +++++-+ # @mindspore.jit(jit_level='O1') +++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++-+ # """ +++++-+ # JIT编译的函数,用于高效的单token解码 +++++-+ # 使用JIT编译优化以支持静态shape和高效执行 +++++-+ +++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++-+ # """ +++++-+ # outputs = self.model.forward( +++++-+ # input_ids=cur_token, +++++-+ # position_ids=input_pos, +++++-+ # cache_position=cache_position, +++++-+ # past_key_values=past_key_values, +++++-+ # use_cache=True, +++++-+ # return_dict=False, +++++-+ # ) +++++-+ +++++-+ # hidden_states = outputs[0] +++++-+ # logits = self.lm_head.forward(hidden_states) +++++-+ # logits = logits.float() +++++-+ +++++-+ # return logits[:, -1, :] +++++-+ +++++-+ # def _sample( +++++-+ # self, +++++-+ # input_ids: mindspore.Tensor, +++++-+ # logits_processor, +++++-+ # stopping_criteria, +++++-+ # generation_config, +++++-+ # synced_devices: bool, +++++-+ # streamer=None, +++++-+ # logits_warper=None, +++++-+ # **model_kwargs, +++++-+ # ): +++++-+ # """ +++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++-+ # """ +++++-+ # from ...generation.logits_process import LogitsProcessorList +++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++-+ # from mindnlp.core import nn, ops, no_grad +++++-+ # import numpy as np +++++-+ +++++-+ # # 检查是否使用 StaticCache +++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++-+ # # 否则,直接调用父类方法 +++++-+ # past_key_values = model_kwargs.get("past_key_values") +++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++-+ +++++-+ # if not isinstance(past_key_values, StaticCache): +++++-+ # # 不使用 StaticCache,直接调用父类方法 +++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++-+ # return super()._sample( +++++-+ # input_ids=input_ids, +++++-+ # logits_processor=logits_processor, +++++-+ # stopping_criteria=stopping_criteria, +++++-+ # generation_config=generation_config, +++++-+ # synced_devices=synced_devices, +++++-+ # streamer=streamer, +++++-+ # logits_warper=logits_warper, +++++-+ # **model_kwargs, +++++-+ # ) +++++-+ +++++-+ # # 使用 StaticCache,进入自定义循环 +++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++-+ # pad_token_id = generation_config._pad_token_tensor +++++-+ # output_attentions = generation_config.output_attentions +++++-+ # output_hidden_states = generation_config.output_hidden_states +++++-+ # output_scores = generation_config.output_scores +++++-+ # output_logits = generation_config.output_logits +++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +++++-+ # max_length = generation_config.max_length +++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++-+ # do_sample = generation_config.do_sample +++++-+ +++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++-+ # raise ValueError( +++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++-+ # f"{logits_warper})." +++++-+ # ) +++++-+ +++++-+ # # init attention / hidden states / scores tuples +++++-+ # scores = () if (return_dict_in_generate and output_scores) else None +++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++-+ +++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++-+ # encoder_hidden_states = ( +++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++-+ # ) +++++-+ +++++-+ # # keep track of which sequences are already finished +++++-+ # batch_size, cur_len = input_ids.shape +++++-+ # this_peer_finished = False +++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++-+ +++++-+ # time_record = [] +++++-+ # from ....utils.testing_utils import parse_flag_from_env +++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++-+ +++++-+ # while self._has_unfinished_sequences( +++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++-+ # ): +++++-+ # if _record_time: +++++-+ # import time as time_module +++++-+ # infer_start = time_module.time() +++++-+ +++++-+ # # prepare model inputs +++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++-+ +++++-+ # # prepare variable output controls +++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++-+ +++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++-+ # cur_cache_position = model_inputs.get("cache_position") +++++-+ # cur_past_key_values = model_inputs.get("past_key_values") +++++-+ # cur_input_ids = model_inputs.get("input_ids") +++++-+ +++++-+ # if (isinstance(cur_past_key_values, StaticCache) and +++++-+ # cur_cache_position is not None and +++++-+ # len(cur_cache_position.shape) > 0 and +++++-+ # cur_cache_position.shape[0] == 1 and +++++-+ # cur_input_ids is not None and +++++-+ # cur_input_ids.shape[1] == 1): +++++-+ # # 使用 JIT 优化的单 token 解码 +++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++-+ # if not hasattr(self, '_jit_used'): +++++-+ # self._jit_used = False +++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++-+ +++++-+ # next_token_logits = self.get_decode_one_tokens_logits( +++++-+ # cur_token=cur_input_ids, +++++-+ # input_pos=model_inputs.get("position_ids"), +++++-+ # cache_position=cur_cache_position, +++++-+ # past_key_values=cur_past_key_values, +++++-+ # ) +++++-+ +++++-+ # # 标记已使用JIT(用于后续判断) +++++-+ # if not self._jit_used: +++++-+ # self._jit_used = True +++++-+ +++++-+ # # 构造兼容的输出对象 +++++-+ # class JitOptimizedOutput: +++++-+ # def __init__(self, logits, config): +++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++-+ # self.config = config +++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++-+ # self.attentions = None if not config.is_encoder_decoder else None +++++-+ # self.cross_attentions = None +++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +++++-+ +++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++-+ # else: +++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++-+ # outputs = self(**model_inputs, return_dict=True) +++++-+ +++++-+ # if synced_devices and this_peer_finished: +++++-+ # continue +++++-+ +++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++-+ # next_token_logits = outputs.logits[:, -1, :] +++++-+ +++++-+ # # pre-process distribution +++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++-+ # if do_sample: +++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++-+ +++++-+ # # Store scores, attentions and hidden_states when required +++++-+ # if return_dict_in_generate: +++++-+ # if output_scores: +++++-+ # scores += (next_token_scores,) +++++-+ # if output_logits: +++++-+ # raw_logits += (next_token_logits,) +++++-+ # if output_attentions: +++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +++++-+ # if self.config.is_encoder_decoder: +++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++-+ +++++-+ # if output_hidden_states: +++++-+ # hidden = ( +++++-+ # outputs.decoder_hidden_states +++++-+ # if self.config.is_encoder_decoder +++++-+ # else outputs.hidden_states +++++-+ # ) +++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++-+ +++++-+ # # token selection +++++-+ # if do_sample: +++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++-+ # else: +++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++-+ +++++-+ # # finished sentences should have their next token be a padding token +++++-+ # if has_eos_stopping_criteria: +++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++-+ +++++-+ # # update generated ids, model inputs, and length for next step +++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++-+ # if streamer is not None: +++++-+ # streamer.put(next_tokens) +++++-+ +++++-+ # model_kwargs = self._update_model_kwargs_for_generation( +++++-+ # outputs, +++++-+ # model_kwargs, +++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +++++-+ # ) +++++-+ +++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++-+ # cur_len += 1 +++++-+ +++++-+ # if _record_time: +++++-+ # import time as time_module +++++-+ # infer_stop = time_module.time() +++++-+ # time_record.append(infer_stop - infer_start) +++++-+ +++++-+ # del outputs +++++-+ +++++-+ # average_infer_time = None +++++-+ # if time_record: +++++-+ # if len(time_record) > 1: +++++-+ # time_record.pop(0) +++++-+ # average_infer_time = sum(time_record) / len(time_record) +++++-+ # print(f'average inference time is: {average_infer_time}') +++++-+ # print(f'inference time record: {time_record}') +++++-+ +++++-+ # if streamer is not None: +++++-+ # streamer.end() +++++-+ +++++-+ # # 简单判断:打印是否使用了JIT路径 +++++-+ # if hasattr(self, '_jit_used') and self._jit_used: +++++-+ # print("[JIT] ✓ JIT optimization was used during generation") +++++-+ # else: +++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++-+ +++++-+ # if return_dict_in_generate: +++++-+ # if self.config.is_encoder_decoder: +++++-+ # return GenerateEncoderDecoderOutput( +++++-+ # sequences=input_ids, +++++-+ # scores=scores, +++++-+ # logits=raw_logits, +++++-+ # encoder_attentions=encoder_attentions, +++++-+ # encoder_hidden_states=encoder_hidden_states, +++++-+ # decoder_attentions=decoder_attentions, +++++-+ # cross_attentions=cross_attentions, +++++-+ # decoder_hidden_states=decoder_hidden_states, +++++-+ # past_key_values=model_kwargs.get("past_key_values"), +++++-+ # average_infer_time=average_infer_time +++++-+ # ) +++++-+ # else: +++++-+ # return GenerateDecoderOnlyOutput( +++++-+ # sequences=input_ids, +++++-+ # scores=scores, +++++-+ # logits=raw_logits, +++++-+ # attentions=decoder_attentions, +++++-+ # hidden_states=decoder_hidden_states, +++++-+ # past_key_values=model_kwargs.get("past_key_values"), +++++-+ # average_infer_time=average_infer_time +++++-+ # ) +++++-+ # else: +++++-+ # return input_ids +++++-+ +++++-+ # def _prepare_cache_for_generation( +++++-+ # self, +++++-+ # generation_config, +++++-+ # model_kwargs, +++++-+ # assistant_model, +++++-+ # batch_size, +++++-+ # max_cache_length, +++++-+ # ): +++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++-+ # generation_config.cache_implementation = "static" +++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++-+ +++++-+ # if generation_config.cache_implementation == "static": +++++-+ # base_required_from_max_length = generation_config.max_length + 1 +++++-+ # base_required = max(max_cache_length, base_required_from_max_length) +++++-+ # min_cache_size = 50 +++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++-+ # else: +++++-+ # max_cache_length = max(base_required, min_cache_size) +++++-+ +++++-+ # original_max_cache_length = max_cache_length +++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++-+ # print(f" - final max_cache_length: {max_cache_length}") +++++-+ +++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++-+ # if max_cache_length > self.config.max_position_embeddings: +++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++-+ +++++-+ # result = super()._prepare_cache_for_generation( +++++-+ # generation_config=generation_config, +++++-+ # model_kwargs=model_kwargs, +++++-+ # assistant_model=assistant_model, +++++-+ # batch_size=batch_size, +++++-+ # max_cache_length=max_cache_length, +++++-+ # ) +++++-+ +++++-+ # if generation_config.cache_implementation == "static": +++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++-+ # created_cache = model_kwargs.get(cache_name) +++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++-+ # if created_cache.max_cache_len < generation_config.max_length: +++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++-+ +++++-+ # return result +++++-+ +++++-+ +++++-+ +++++- +++++- +++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++--- +++++-2.27.0 +++++- +++++-- +++++2.27.0 +++++ ++++-- ++++2.27.0 ++++ +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" new file mode 100644 index 00000000..31d324c3 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" @@ -0,0 +1,8789 @@ +From 3b0f98eeed90a7204357d96aacc9dc7098b9dab1 Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Sun, 9 Nov 2025 00:50:01 +0800 +Subject: [PATCH 08/10] moe change + +--- + .../models/deepseek/modeling_deepseek.py | 433 +- + .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- + patches/0001-20251104commit.patch | 2 +- + patches/0002-20251106commit.patch | 2 +- + patches/0003-20261106secondcommit.patch | 2 +- + patches/0004-20251106change.patch | 2 +- + patches/0005-20251107001commit.patch | 2 +- + patches/0006-20251107002commit.patch | 2 +- + patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ + 9 files changed, 8510 insertions(+), 55 deletions(-) + create mode 100644 patches/0007-20251107003commit.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index ff631974..0af29305 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -19,8 +19,10 @@ + # limitations under the License. + """ MindNLP DeepSeek model.""" + import math ++import time + import warnings + from typing import List, Optional, Tuple, Union ++from mindspore import mint + import mindspore + from mindnlp.core import nn, ops, no_grad + from mindnlp.core.nn import functional as F +@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) + + _CONFIG_FOR_DOC = "DeepseekConfig" + ++Long_Prompt = 1 ++LONG_PROMPT_LENGTH_THRESHOLD = 128 ++SHORT_PROMPT_LENGTH_THRESHOLD = 32 ++ + _attn_mask_cache = {} + + def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +@@ -380,6 +386,8 @@ class MoEGate(nn.Module): + return topk_idx, topk_weight, aux_loss + + ++bincount_op = mindspore.ops.Bincount() ++ + class DeepseekMoE(nn.Module): + """ + A mixed expert module containing shared experts. +@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): + y = y + self.shared_experts(identity) + return y + else: +- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++ if Long_Prompt == 0: ++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++ else: ++ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) + if self.config.n_shared_experts is not None: + y = y + self.shared_experts(identity) + return y +@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): + # if self.config.n_shared_experts is not None: + # y = y + self.shared_experts(identity) + # return y +- ++ ++ ++ ++ # lwx ++ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): ++ # """ ++ # 如果 expert_ids 为 None,走单专家逻辑; ++ # 如果有,多专家批量处理,保证和原逻辑一致。 ++ # """ ++ # if expert_ids is None: ++ # # 原单专家逻辑 ++ # if self.config.pretraining_tp > 1: ++ # slice = self.intermediate_size // self.config.pretraining_tp ++ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) ++ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) ++ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) ++ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) ++ # for i in range(self.config.pretraining_tp)], dim=-1) ++ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) ++ # for i in range(self.config.pretraining_tp)], dim=-1) ++ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) ++ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) ++ # for i in range(self.config.pretraining_tp)] ++ # down_proj = sum(down_proj) ++ # else: ++ # down_proj = self.down_proj( ++ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) ++ # ) ++ # return down_proj ++ ++ # # ====== 批量多专家路径 ====== ++ # hidden_size = x.shape[-1] ++ ++ # # 按 token expert_ids 选权重 ++ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] ++ # up_weights = self.up_proj.weight[expert_ids] ++ # down_weights = self.down_proj.weight[expert_ids] ++ ++ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 ++ # if self.config.pretraining_tp > 1: ++ # outputs = [] ++ # slice = self.intermediate_size // self.config.pretraining_tp ++ # for i in range(self.config.pretraining_tp): ++ # # 每个 slice 单独计算 ++ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) ++ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) ++ # act_out = self.act_fn(gate_proj_out) * up_proj_out ++ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) ++ # outputs.append(down_proj_out) ++ # return sum(outputs) ++ # else: ++ # gate_proj_out = F.linear(x, gate_weights) ++ # up_proj_out = F.linear(x, up_weights) ++ # act_out = self.act_fn(gate_proj_out) * up_proj_out ++ # return F.linear(act_out, down_weights) ++ # @no_grad() ++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ # num_tokens = x.shape[0] ++ # hidden_size = x.shape[-1] ++ ++ # idxs = flat_expert_indices.argsort() ++ # sorted_expert_indices = flat_expert_indices[idxs] ++ # sorted_token_indices = idxs // self.num_experts_per_tok ++ # sorted_indices = sorted_token_indices ++ ++ # permuted_tokens = x[sorted_token_indices] ++ # sorted_weights = flat_expert_weights[idxs] ++ ++ # # 一次调用多专家 forward ++ # expert_outputs = ops.zeros_like(permuted_tokens) ++ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) ++ ++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) ++ # try: ++ # final_output = ops.moe_token_unpermute( ++ # expert_outputs, ++ # sorted_indices, ++ # probs=probs, ++ # padded_mode=False ++ # ) ++ # except Exception: ++ # final_output = ops.zeros_like(x) ++ # final_output = mindspore.mint.scatter_add( ++ # final_output, ++ # 0, ++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), ++ # expert_outputs * sorted_weights ++ # ) ++ ++ # return final_output ++ ++ # def mlp_batch_forward(self, tokens, expert_ids): ++ # """ ++ # 使用批量专家 forward(保留精度) ++ # """ ++ # return self.experts[0].forward(tokens, expert_ids) ++ + # @no_grad() + # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): + +@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): + # expert_cache += expert_out * weight + # return expert_cache + ++ #@dwj + @no_grad() +- # dwj + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- # x 的 shape: (1, hidden_size) +- # flat_expert_indices 的 shape: (num_experts_per_tok,) +- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +- +- # 1. 收集所有需要的专家层 +- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 + selected_experts = [self.experts[i] for i in flat_expert_indices] +- +- # 2. 并行计算所有专家的输出 +- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +- # ops.cat 会将它们堆叠成一个新的 Tensor +- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) + expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +- +- # 3. 使用矩阵乘法进行加权求和 +- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- # 最终结果 final_output 的 shape: (1, hidden_size) + final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +- + return final_output + + +- # @no_grad() +- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # expert_cache = ops.zeros_like(x) +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- # token_idxs = idxs // self.num_experts_per_tok +- +- # for i, end_idx in enumerate(tokens_per_expert): +- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- # if start_idx == end_idx: +- # continue +- # expert = self.experts[i] +- # exp_token_idx = token_idxs[start_idx:end_idx] +- # expert_tokens = x[exp_token_idx] +- # expert_out = expert(expert_tokens) +- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +- +- # return expert_cache +- + @no_grad() + def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): + """ +@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): + ) + + return expert_cache ++ ++ ++ # @no_grad() ++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ # """ ++ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add ++ # """ ++ # num_tokens = x.shape[0] ++ # hidden_size = x.shape[-1] ++ ++ # # 生成排序后的 token 索引 ++ # idxs = flat_expert_indices.argsort() ++ # sorted_expert_indices = flat_expert_indices[idxs] ++ # sorted_token_indices = idxs // self.num_experts_per_tok ++ ++ # # 记录到 sorted_indices(moe_token_unpermute 用) ++ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] ++ ++ # # 收集专家输入 ++ # permuted_tokens = x[sorted_token_indices] ++ ++ # # 执行每个专家的 MLP(批量处理) ++ # expert_outputs = [] ++ # token_ptr = 0 ++ # tokens_per_expert = sorted_expert_indices.bincount() ++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): ++ # if count == 0: ++ # continue ++ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] ++ # out = self.experts[expert_id](cur_tokens) ++ # expert_outputs.append(out) ++ # token_ptr += count ++ ++ # # 拼接所有专家输出 ++ # permuted_outputs = ops.cat(expert_outputs, axis=0) ++ ++ # # 权重缩放(probs 形状为 [num_tokens, top_k]) ++ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) ++ ++ # # 直接调用硬件加速的 unpermute ++ # final_output = ops.moe_token_unpermute( ++ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] ++ # sorted_indices, # shape: [num_tokens * top_k] ++ # probs=probs, # 按概率加权 ++ # padded_mode=False ++ # ) ++ ++ # return final_output ++ ++ # lwx prefill 20251108 ++ @no_grad() ++ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): ++ """ ++ 高性能 + 数值一致的 MoE prefill 推理: ++ 1. 批量化处理所有专家计算,减少 Python 循环开销 ++ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 ++ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 ++ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch ++ ++ 参数: ++ x: [num_tokens, hidden_size], ++ MoE 输入的 token 表示 ++ flat_expert_indices: [num_tokens * top_k], ++ 每个 token 的路由专家 id ++ flat_expert_weights: [num_tokens * top_k, 1], ++ 路由专家权重 ++ """ ++ num_tokens = x.shape[0] ++ hidden_size = x.shape[-1] ++ ++ # 1) 排序专家分配(与原 scatter_add 一致的顺序) ++ idxs = flat_expert_indices.argsort() # 排序索引 ++ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] ++ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID ++ ++ # sorted_indices 必须与 permuted_tokens 顺序匹配 ++ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 ++ ++ # 2) 收集专家输入(按 idxs 排序) ++ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] ++ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 ++ ++ # 3) 计算每个专家的 token 数 ++ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) ++ ++ # 4) 批量专家计算(减少 Python 循环) ++ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) ++ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) ++ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) ++ ++ expert_outputs = ops.zeros_like(permuted_tokens) ++ ptr = 0 ++ for expert_id, count in enumerate(tokens_per_expert.tolist()): ++ if count == 0: ++ continue ++ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] ++ ++ # 与 DeepseekMLP forward 等价 ++ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) ++ up_proj_out = F.linear(tokens, up_weights[expert_id]) ++ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out ++ expert_out = F.linear(act_out, down_weights[expert_id]) ++ ++ expert_outputs[ptr:ptr+count] = expert_out ++ ptr += count ++ ++ # 5) Ascend 加速的 unpermute(已排序的权重) ++ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape ++ ++ final_output = ops.zeros_like(x) ++ final_output = mindspore.mint.scatter_add( ++ final_output, ++ 0, ++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), ++ expert_outputs * sorted_weights ++ ) ++ ++ ++ # try: ++ # final_output = ops.moe_token_unpermute( ++ # expert_outputs, # [num_tokens*top_k, hidden_size] ++ # sorted_indices, # [num_tokens*top_k] 原 token id ++ # probs=probs, # 对应权重 ++ # padded_mode=False ++ # ) ++ # except Exception: ++ # # CPU/GPU fallback:用 scatter_add 保证完全一致 ++ # final_output = ops.zeros_like(x) ++ # final_output = mindspore.mint.scatter_add( ++ # final_output, ++ # 0, ++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), ++ # expert_outputs * sorted_weights ++ # ) ++ ++ return final_output ++ ++ ++ # @no_grad() ++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ # num_tokens = x.shape[0] ++ # hidden_size = x.shape[-1] ++ ++ # idxs = flat_expert_indices.argsort() ++ # sorted_expert_indices = flat_expert_indices[idxs] ++ # sorted_token_indices = idxs // self.num_experts_per_tok ++ ++ # # sorted_indices = sorted_token_indices ++ # sorted_indices = sorted_token_indices.astype(mindspore.int32) ++ # permuted_tokens = x[sorted_token_indices] ++ # sorted_weights = flat_expert_weights[idxs] ++ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) ++ ++ # expert_outputs = ops.zeros_like(permuted_tokens) ++ # ptr = 0 ++ ++ # # 只按专家维度循环 ++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): ++ # if count == 0: ++ # continue ++ # token_slice = slice(ptr, ptr + count) ++ # expert_tokens = permuted_tokens[token_slice] ++ ++ # # 保持原 forward(含 pretraining_tp、bias 等) ++ # expert_out = self.experts[expert_id](expert_tokens) ++ ++ # expert_outputs[token_slice] = expert_out ++ # ptr += count ++ ++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) ++ # try: ++ # final_output = mindspore.ops.moe_token_unpermute( ++ # expert_outputs, ++ # sorted_indices, ++ # probs=probs, ++ # padded_mode=False ++ # ) ++ # except Exception: ++ # final_output = ops.zeros_like(x) ++ # final_output = mindspore.mint.scatter_add( ++ # final_output, ++ # 0, ++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), ++ # expert_outputs * sorted_weights ++ # ) ++ ++ # return final_output ++ ++ ++ #lwx ++ # @no_grad() ++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ # """ ++ # 并行化 MoE prefill: ++ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 ++ # - 保证结果与原版完全一致 ++ # """ ++ # # 输出缓存 ++ # expert_cache = ops.zeros_like(x) ++ ++ # # token 总数(批量*seq_len*num_experts_per_tok) ++ # num_tokens = flat_expert_indices.shape[0] ++ # hidden_dim = x.shape[-1] ++ ++ # # 原 token ID(idxs // num_experts_per_tok) ++ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) ++ ++ # # ====== Step 1: 组织输入 ====== ++ # # 按 experts 排序,保证 scatter_add 对应位置一致 ++ # sort_ids = flat_expert_indices.argsort() ++ # sorted_experts = flat_expert_indices[sort_ids] ++ # sorted_tokens = token_ids[sort_ids] ++ # sorted_weights = flat_expert_weights[sort_ids] ++ ++ # # 收集每个专家的输入 ++ # # build: expert_inputs[expert_id] = [tokens...] ++ # expert_inputs = [] ++ # expert_outs = [] ++ ++ # for eid in range(self.config.n_routed_experts): ++ # eid_mask = (sorted_experts == eid) ++ # if eid_mask.any(): ++ # tokens_for_eid = x[sorted_tokens[eid_mask]] ++ # expert_inputs.append(tokens_for_eid) ++ # else: ++ # expert_inputs.append(None) ++ ++ # # ====== Step 2: 并行计算所有专家输出 ====== ++ # # 存储所有专家结果到一个列表 ++ # for eid in range(self.config.n_routed_experts): ++ # if expert_inputs[eid] is not None: ++ # out = self.experts[eid](expert_inputs[eid]) ++ # expert_outs.append(out) ++ # else: ++ # expert_outs.append(None) ++ ++ # # ====== Step 3: scatter_add 回写结果 ====== ++ # # 遍历专家,将结果加回对应的 token ++ # pos = 0 ++ # for eid in range(self.config.n_routed_experts): ++ # if expert_outs[eid] is not None: ++ # size = expert_outs[eid].shape[0] ++ # tokens_idx = sorted_tokens[pos:pos+size] ++ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] ++ # pos += size ++ ++ # # scatter_add 到 expert_cache ++ # expert_cache = mindspore.mint.scatter_add( ++ # expert_cache, ++ # dim=0, ++ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), ++ # src=scaled_out ++ # ) ++ ++ # return expert_cache ++ ++ ++ + # 放置在 DeepseekMoE 类中 + # @no_grad() + # #lwx 20251107 +@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): + self.hidden_size = config.hidden_size + + # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +- # config=config, layer_idx=layer_idx ++ # config=config, layer_idx=layer_idx + # ) + + self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): + ) + else DeepseekMLP(config) + ) ++ + self.input_layernorm = DeepseekRMSNorm( + config.hidden_size, eps=config.rms_norm_eps + ) +@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + def get_decoder(self): + return self.model + ++ def generate(self, *args, **kwargs): ++ """ ++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++ """ ++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++ ++ input_ids = kwargs.get("input_ids") ++ if input_ids is None and args: ++ input_ids = args[0] ++ ++ if input_ids is not None: ++ prompt_length = input_ids.shape[1] ++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: ++ Long_Prompt = 2 ++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: ++ Long_Prompt = 0 ++ else: ++ Long_Prompt = 1 ++ ++ ++ return super().generate(*args, **kwargs) + + def forward( + self, +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index 913a7609..6566958b 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + + # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- + @no_grad() +- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + original_dtype = hidden_states.dtype + batch_size, _ = hidden_states.shape + expert_outputs_list = [ +@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) + return moe_output_fp32.squeeze(1).to(original_dtype) + ++ + # @no_grad() +- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + # num_tokens, _ = hidden_states.shape + # flat_selected_experts = selected_experts.flatten() + # sorted_expert_indices = flat_selected_experts.argsort() +@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + # current_token_offset += expert_token_count + # return moe_output + ++ # baseline + @no_grad() +- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + """ + 优化版 MoE prefill (速度优先模式): + - 批量张量化处理同一个 expert 的所有 token +@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + return moe_output + + ++ @no_grad() ++ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ """ ++ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add ++ 逻辑: ++ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 ++ 2. 每个 expert 一次性处理其全部 token ++ 3. 最后一次 scatter_add 回到原 token 顺序 ++ """ ++ ++ num_tokens = hidden_states.shape[0] ++ hidden_size = hidden_states.shape[-1] ++ ++ # 展平为一维 ++ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] ++ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] ++ ++ # 按 expert 排序 ++ idxs = flat_selected_experts.argsort() ++ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 ++ sorted_token_indices = idxs // self.top_k # 对应原 token ID ++ ++ # 排好序的输入向量(连续内存) ++ permuted_tokens = hidden_states[sorted_token_indices] ++ ++ # 排好序的权重 ++ sorted_weights = flat_routing_weights[idxs] ++ ++ # 每个 expert 对应的 token 数量 ++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) ++ ++ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) ++ expert_outputs = ops.zeros_like(permuted_tokens) ++ ++ ptr = 0 # 指向当前切片的起点 ++ for expert_id, count in enumerate(tokens_per_expert.tolist()): ++ if count == 0: ++ continue ++ ++ token_slice = slice(ptr, ptr + count) ++ expert_tokens = permuted_tokens[token_slice] # 连续切片 ++ ++ # 执行专家 MLP ++ expert_out = self.experts[expert_id](expert_tokens) ++ ++ expert_outputs[token_slice] = expert_out ++ ptr += count ++ ++ # 按权重缩放 ++ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) ++ ++ # 回写到原 token 顺序 (单次 scatter_add) ++ moe_output = mindspore.mint.scatter_add( ++ ops.zeros_like(hidden_states), ++ 0, ++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), ++ scaled_outputs ++ ) ++ ++ return moe_output ++ ++ ++ + # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++ + @no_grad() + def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + moe_output = ops.zeros_like(hidden_states) +@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + # # --- 速度优先模式 (SPEED MODE) --- + # routing_weights_casted = routing_weights.to(hidden_states.dtype) + # if sequence_length == 1: +- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) + # else: +- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) + + routing_weights_casted = routing_weights.to(hidden_states.dtype) + if sequence_length == 1: +- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) + else: +- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +- ++ # if Long_Prompt == 1: ++ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # else: ++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ + + # 3. 共享专家计算与合并 (所有模式通用) + gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +index c9c8c5ee..513dd40b 100644 +--- a/patches/0001-20251104commit.patch ++++ b/patches/0001-20251104commit.patch +@@ -1,7 +1,7 @@ + From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/6] 20251104commit ++Subject: [PATCH 1/7] 20251104commit + + --- + mindnlp/transformers/cache_utils.py | 28 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +index 625656eb..41081b85 100644 +--- a/patches/0002-20251106commit.patch ++++ b/patches/0002-20251106commit.patch +@@ -1,7 +1,7 @@ + From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/6] 20251106commit ++Subject: [PATCH 2/7] 20251106commit + + --- + .../models/deepseek/modeling_deepseek.py | 379 ++++- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +index dcb85080..c1392569 100644 +--- a/patches/0003-20261106secondcommit.patch ++++ b/patches/0003-20261106secondcommit.patch +@@ -1,7 +1,7 @@ + From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/6] 20261106secondcommit ++Subject: [PATCH 3/7] 20261106secondcommit + + --- + .../models/deepseek/modeling_deepseek.py | 217 ++- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +index bbed13cc..e548b1b2 100644 +--- a/patches/0004-20251106change.patch ++++ b/patches/0004-20251106change.patch +@@ -1,7 +1,7 @@ + From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 15:48:09 +0800 +-Subject: [PATCH 4/6] 20251106change ++Subject: [PATCH 4/7] 20251106change + + --- + .../models/deepseek/modeling_deepseek.py | 189 +- +diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +index b2d1035c..bf224d2a 100644 +--- a/patches/0005-20251107001commit.patch ++++ b/patches/0005-20251107001commit.patch +@@ -1,7 +1,7 @@ + From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 11:48:18 +0800 +-Subject: [PATCH 5/6] 20251107001commit ++Subject: [PATCH 5/7] 20251107001commit + + --- + .../models/deepseek/modeling_deepseek.py | 91 +- +diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +index bffa134e..1bd306b9 100644 +--- a/patches/0006-20251107002commit.patch ++++ b/patches/0006-20251107002commit.patch +@@ -1,7 +1,7 @@ + From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 12:06:32 +0800 +-Subject: [PATCH 6/6] 20251107002commit ++Subject: [PATCH 6/7] 20251107002commit + + --- + .../models/deepseek/modeling_deepseek.py | 122 +- +diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch +new file mode 100644 +index 00000000..ce558554 +--- /dev/null ++++ b/patches/0007-20251107003commit.patch +@@ -0,0 +1,8034 @@ ++From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Fri, 7 Nov 2025 12:12:51 +0800 ++Subject: [PATCH 7/7] 20251107003commit ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 2 +- ++ patches/0001-20251104commit.patch | 2 +- ++ patches/0002-20251106commit.patch | 2 +- ++ patches/0003-20261106secondcommit.patch | 2 +- ++ patches/0004-20251106change.patch | 2 +- ++ patches/0005-20251107001commit.patch | 2 +- ++ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ ++ 7 files changed, 7937 insertions(+), 6 deletions(-) ++ create mode 100644 patches/0006-20251107002commit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index e7e1c053..ff631974 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): ++ # return expert_cache ++ ++ @no_grad() ++- dwj +++ # dwj ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ # x 的 shape: (1, hidden_size) ++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++index 2842180e..c9c8c5ee 100644 ++--- a/patches/0001-20251104commit.patch +++++ b/patches/0001-20251104commit.patch ++@@ -1,7 +1,7 @@ ++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++-Subject: [PATCH 1/5] 20251104commit +++Subject: [PATCH 1/6] 20251104commit ++ ++ --- ++ mindnlp/transformers/cache_utils.py | 28 +- ++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++index c6cd8757..625656eb 100644 ++--- a/patches/0002-20251106commit.patch +++++ b/patches/0002-20251106commit.patch ++@@ -1,7 +1,7 @@ ++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++-Subject: [PATCH 2/5] 20251106commit +++Subject: [PATCH 2/6] 20251106commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++index 601960c9..dcb85080 100644 ++--- a/patches/0003-20261106secondcommit.patch +++++ b/patches/0003-20261106secondcommit.patch ++@@ -1,7 +1,7 @@ ++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++-Subject: [PATCH 3/5] 20261106secondcommit +++Subject: [PATCH 3/6] 20261106secondcommit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++index 8976f10b..bbed13cc 100644 ++--- a/patches/0004-20251106change.patch +++++ b/patches/0004-20251106change.patch ++@@ -1,7 +1,7 @@ ++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 15:48:09 +0800 ++-Subject: [PATCH 4/5] 20251106change +++Subject: [PATCH 4/6] 20251106change ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 189 +- ++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch ++index 8d9032be..b2d1035c 100644 ++--- a/patches/0005-20251107001commit.patch +++++ b/patches/0005-20251107001commit.patch ++@@ -1,7 +1,7 @@ ++ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Fri, 7 Nov 2025 11:48:18 +0800 ++-Subject: [PATCH 5/5] 20251107001commit +++Subject: [PATCH 5/6] 20251107001commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 91 +- ++diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch ++new file mode 100644 ++index 00000000..bffa134e ++--- /dev/null +++++ b/patches/0006-20251107002commit.patch ++@@ -0,0 +1,7931 @@ +++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Fri, 7 Nov 2025 12:06:32 +0800 +++Subject: [PATCH 6/6] 20251107002commit +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 122 +- +++ patches/0001-20251104commit.patch | 2 +- +++ patches/0002-20251106commit.patch | 2 +- +++ patches/0003-20261106secondcommit.patch | 2 +- +++ patches/0004-20251106change.patch | 2 +- +++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ +++ 6 files changed, 7773 insertions(+), 64 deletions(-) +++ create mode 100644 patches/0005-20251107001commit.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index 8831e4b7..e7e1c053 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): +++ # expert_out = expert(x) +++ # expert_cache += expert_out * weight +++ # return expert_cache +++- +++- # @no_grad() +++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++- # # x 的 shape: (1, hidden_size) +++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) +++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++- +++- # # 1. 收集所有需要的专家层 +++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++- # selected_experts = [self.experts[i] for i in flat_expert_indices] +++- +++- # # 2. 并行计算所有专家的输出 +++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++- # # ops.cat 会将它们堆叠成一个新的 Tensor +++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++- +++- # # 3. 使用矩阵乘法进行加权求和 +++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++- # # 最终结果 final_output 的 shape: (1, hidden_size) +++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++ ++++ @no_grad() ++++ dwj ++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ # x 的 shape: (1, hidden_size) ++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++ ++++ # 1. 收集所有需要的专家层 ++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++++ ++++ # 2. 并行计算所有专家的输出 ++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++ # ops.cat 会将它们堆叠成一个新的 Tensor ++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++ ++++ # 3. 使用矩阵乘法进行加权求和 ++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++ # 最终结果 final_output 的 shape: (1, hidden_size) ++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++ +++- # return final_output ++++ return final_output +++ +++ +++ # @no_grad() +++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): +++ +++ return expert_cache +++ # 放置在 DeepseekMoE 类中 +++- @no_grad() +++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++- """ +++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +++- +++- Args: +++- x (Tensor): 输入张量, shape: (1, hidden_size) +++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +++- """ +++- top_k, _ = flat_expert_weights.shape +++- hidden_size = x.shape[-1] +++- +++- # 1. 将所有专家的权重堆叠起来 +++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++++ # @no_grad() ++++ # #lwx 20251107 ++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++ # """ ++++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++++ ++++ # Args: ++++ # x (Tensor): 输入张量, shape: (1, hidden_size) ++++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++++ # """ ++++ # top_k, _ = flat_expert_weights.shape ++++ # hidden_size = x.shape[-1] ++++ ++++ # # 1. 将所有专家的权重堆叠起来 ++++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +++ +++- # 2. "收集" 所需的专家权重 +++- selected_gate_w = stacked_gate_w[flat_expert_indices] +++- selected_up_w = stacked_up_w[flat_expert_indices] +++- selected_down_w = stacked_down_w[flat_expert_indices] ++++ # # 2. "收集" 所需的专家权重 ++++ # selected_gate_w = stacked_gate_w[flat_expert_indices] ++++ # selected_up_w = stacked_up_w[flat_expert_indices] ++++ # selected_down_w = stacked_down_w[flat_expert_indices] +++ +++- # 3. 准备输入 +++- x_expanded = x.expand((top_k, 1, hidden_size)) ++++ # # 3. 准备输入 ++++ # x_expanded = x.expand((top_k, 1, hidden_size)) +++ +++- # 4. 并行计算 gate_proj 和 up_proj +++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++++ # # 4. 并行计算 gate_proj 和 up_proj ++++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +++ +++- # 5. 计算中间状态 +++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++++ # # 5. 计算中间状态 ++++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out +++ +++- # 6. 并行计算 down_proj +++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +++- # --- [FIX] --- +++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +++- # --- [FIX END] --- ++++ # # 6. 并行计算 down_proj ++++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++++ # # --- [FIX] --- ++++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++++ # # --- [FIX END] --- +++ +++- # 7. 根据路由权重进行加权求和 +++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++++ # # 7. 根据路由权重进行加权求和 ++++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +++ +++- return weighted_sum ++++ # return weighted_sum +++ +++ +++ +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++index 0a0ef2d7..2842180e 100644 +++--- a/patches/0001-20251104commit.patch ++++++ b/patches/0001-20251104commit.patch +++@@ -1,7 +1,7 @@ +++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +++-Subject: [PATCH 1/4] 20251104commit ++++Subject: [PATCH 1/5] 20251104commit +++ +++ --- +++ mindnlp/transformers/cache_utils.py | 28 +- +++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++index 5185270c..c6cd8757 100644 +++--- a/patches/0002-20251106commit.patch ++++++ b/patches/0002-20251106commit.patch +++@@ -1,7 +1,7 @@ +++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +++-Subject: [PATCH 2/4] 20251106commit ++++Subject: [PATCH 2/5] 20251106commit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++index 3e05f821..601960c9 100644 +++--- a/patches/0003-20261106secondcommit.patch ++++++ b/patches/0003-20261106secondcommit.patch +++@@ -1,7 +1,7 @@ +++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +++-Subject: [PATCH 3/4] 20261106secondcommit ++++Subject: [PATCH 3/5] 20261106secondcommit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +++index 88a1aef4..8976f10b 100644 +++--- a/patches/0004-20251106change.patch ++++++ b/patches/0004-20251106change.patch +++@@ -1,7 +1,7 @@ +++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 15:48:09 +0800 +++-Subject: [PATCH 4/4] 20251106change ++++Subject: [PATCH 4/5] 20251106change +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 189 +- +++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +++new file mode 100644 +++index 00000000..8d9032be +++--- /dev/null ++++++ b/patches/0005-20251107001commit.patch +++@@ -0,0 +1,7707 @@ ++++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Fri, 7 Nov 2025 11:48:18 +0800 ++++Subject: [PATCH 5/5] 20251107001commit ++++ ++++--- ++++ .../models/deepseek/modeling_deepseek.py | 91 +- ++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- ++++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- ++++ patches/0001-20251104commit.patch | 2 +- ++++ patches/0002-20251106commit.patch | 2 +- ++++ patches/0003-20261106secondcommit.patch | 2 +- ++++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ ++++ 7 files changed, 7577 insertions(+), 30 deletions(-) ++++ create mode 100644 patches/0004-20251106change.patch ++++ ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index 0546f318..8831e4b7 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): ++++ # expert_cache += expert_out * weight ++++ # return expert_cache ++++ ++++- @no_grad() ++++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++- # x 的 shape: (1, hidden_size) ++++- # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++- ++++- # 1. 收集所有需要的专家层 ++++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++- selected_experts = [self.experts[i] for i in flat_expert_indices] ++++- ++++- # 2. 并行计算所有专家的输出 ++++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++- # ops.cat 会将它们堆叠成一个新的 Tensor ++++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++- ++++- # 3. 使用矩阵乘法进行加权求和 ++++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++- # 最终结果 final_output 的 shape: (1, hidden_size) ++++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++++ # @no_grad() +++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ # # x 的 shape: (1, hidden_size) +++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++++ +++++ # # 1. 收集所有需要的专家层 +++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] +++++ +++++ # # 2. 并行计算所有专家的输出 +++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++++ # # ops.cat 会将它们堆叠成一个新的 Tensor +++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++++ +++++ # # 3. 使用矩阵乘法进行加权求和 +++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ # # 最终结果 final_output 的 shape: (1, hidden_size) +++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++ ++++- return final_output +++++ # return final_output ++++ ++++ ++++ # @no_grad() ++++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): ++++ ) ++++ ++++ return expert_cache +++++# 放置在 DeepseekMoE 类中 +++++ @no_grad() +++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ """ +++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +++++ +++++ Args: +++++ x (Tensor): 输入张量, shape: (1, hidden_size) +++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +++++ """ +++++ top_k, _ = flat_expert_weights.shape +++++ hidden_size = x.shape[-1] +++++ +++++ # 1. 将所有专家的权重堆叠起来 +++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +++++ +++++ # 2. "收集" 所需的专家权重 +++++ selected_gate_w = stacked_gate_w[flat_expert_indices] +++++ selected_up_w = stacked_up_w[flat_expert_indices] +++++ selected_down_w = stacked_down_w[flat_expert_indices] +++++ +++++ # 3. 准备输入 +++++ x_expanded = x.expand((top_k, 1, hidden_size)) +++++ +++++ # 4. 并行计算 gate_proj 和 up_proj +++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +++++ +++++ # 5. 计算中间状态 +++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +++++ +++++ # 6. 并行计算 down_proj +++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +++++ # --- [FIX] --- +++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +++++ # --- [FIX END] --- +++++ +++++ # 7. 根据路由权重进行加权求和 +++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +++++ +++++ return weighted_sum +++++ +++++ ++++ ++++ # @no_grad() ++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++index ebd7782e..913a7609 100644 ++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): ++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++- x1 = x[..., : x.shape[-1] // 2] ++++- x2 = x[..., x.shape[-1] // 2 :] +++++ # x1 = x[..., : x.shape[-1] // 2] +++++ # x2 = x[..., x.shape[-1] // 2 :] ++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++++index d059dcbe..2b217b64 100644 ++++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): ++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++ def rotate_half(x): ++++ """Rotates half the hidden dims of the input.""" ++++- x1 = x[..., : x.shape[-1] // 2] ++++- x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++ # x1 = x[..., : x.shape[-1] // 2] +++++ # x2 = x[..., x.shape[-1] // 2 :] +++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++ return ops.cat((-x2, x1), dim=-1) ++++ ++++ ++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++index 78f22642..0a0ef2d7 100644 ++++--- a/patches/0001-20251104commit.patch +++++++ b/patches/0001-20251104commit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++-Subject: [PATCH 1/3] 20251104commit +++++Subject: [PATCH 1/4] 20251104commit ++++ ++++ --- ++++ mindnlp/transformers/cache_utils.py | 28 +- ++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++++index 22b65dd5..5185270c 100644 ++++--- a/patches/0002-20251106commit.patch +++++++ b/patches/0002-20251106commit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++++-Subject: [PATCH 2/3] 20251106commit +++++Subject: [PATCH 2/4] 20251106commit ++++ ++++ --- ++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++++index 966529e4..3e05f821 100644 ++++--- a/patches/0003-20261106secondcommit.patch +++++++ b/patches/0003-20261106secondcommit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++++-Subject: [PATCH 3/3] 20261106secondcommit +++++Subject: [PATCH 3/4] 20261106secondcommit ++++ ++++ --- ++++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++++new file mode 100644 ++++index 00000000..88a1aef4 ++++--- /dev/null +++++++ b/patches/0004-20251106change.patch ++++@@ -0,0 +1,7498 @@ +++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Thu, 6 Nov 2025 15:48:09 +0800 +++++Subject: [PATCH 4/4] 20251106change +++++ +++++--- +++++ .../models/deepseek/modeling_deepseek.py | 189 +- +++++ patches/0001-20251104commit.patch | 1272 +++++++ +++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +++++ 4 files changed, 7244 insertions(+), 186 deletions(-) +++++ create mode 100644 patches/0001-20251104commit.patch +++++ create mode 100644 patches/0002-20251106commit.patch +++++ create mode 100644 patches/0003-20261106secondcommit.patch +++++ +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index 2f9192bf..0546f318 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +++++ +++++ return attn_output, attn_weights, past_key_value +++++ +++++-# class DeepseekFlashAttention(nn.Module): +++++-# """ +++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +++++- +++++-# This class is designed as a drop-in replacement for DeepseekAttention. +++++-# """ +++++- +++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++++-# super().__init__() +++++-# self.config = config +++++-# self.layer_idx = layer_idx +++++-# if layer_idx is None: +++++-# logger.warning( +++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++-# "when creating this class." +++++-# ) +++++- +++++-# self.attention_dropout = config.attention_dropout +++++-# self.hidden_size = config.hidden_size +++++-# self.num_heads = config.num_attention_heads +++++-# self.head_dim = self.hidden_size // self.num_heads +++++-# self.num_key_value_heads = config.num_key_value_heads +++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++-# self.max_position_embeddings = config.max_position_embeddings +++++-# self.rope_theta = config.rope_theta +++++-# self.is_causal = True +++++- +++++-# if (self.head_dim * self.num_heads) != self.hidden_size: +++++-# raise ValueError( +++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++-# f" and `num_heads`: {self.num_heads})." +++++-# ) +++++- +++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++++-# self._init_rope() +++++- +++++-# def _init_rope(self): +++++-# if self.config.rope_scaling is None: +++++-# self.rotary_emb = DeepseekRotaryEmbedding( +++++-# self.head_dim, +++++-# max_position_embeddings=self.max_position_embeddings, +++++-# base=self.rope_theta, +++++-# ) +++++-# else: +++++-# scaling_type = self.config.rope_scaling["type"] +++++-# scaling_factor = self.config.rope_scaling["factor"] +++++-# if scaling_type == "linear": +++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++++-# self.head_dim, +++++-# max_position_embeddings=self.max_position_embeddings, +++++-# scaling_factor=scaling_factor, +++++-# base=self.rope_theta, +++++-# ) +++++-# elif scaling_type == "dynamic": +++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++++-# self.head_dim, +++++-# max_position_embeddings=self.max_position_embeddings, +++++-# scaling_factor=scaling_factor, +++++-# base=self.rope_theta, +++++-# ) +++++-# else: +++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++++- +++++-# def forward( +++++-# self, +++++-# hidden_states: mindspore.Tensor, +++++-# attention_mask: Optional[mindspore.Tensor] = None, +++++-# position_ids: Optional[mindspore.Tensor] = None, +++++-# past_key_value: Optional[Cache] = None, +++++-# output_attentions: bool = False, +++++-# use_cache: bool = False, +++++-# **kwargs, +++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++-# if "padding_mask" in kwargs: +++++-# warnings.warn( +++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++++-# ) +++++- +++++-# if output_attentions: +++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +++++- +++++-# bsz, q_len, _ = hidden_states.shape +++++- +++++-# if self.config.pretraining_tp > 1: +++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++++- +++++-# query_states = self.q_proj(hidden_states) +++++-# key_states = self.k_proj(hidden_states) +++++-# value_states = self.v_proj(hidden_states) +++++- +++++-# # Reshape for multi-head attention +++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++- +++++-# kv_seq_len = key_states.shape[-2] +++++-# if past_key_value is not None: +++++-# if self.layer_idx is None: +++++-# raise ValueError( +++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++-# "with a layer index." +++++-# ) +++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++- +++++-# # Apply Rotary Positional Embedding +++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++- +++++-# if past_key_value is not None: +++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++- +++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++- +++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++- +++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++- +++++-# # Convert attention_mask for flash_attention_score +++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +++++-# if attention_mask is not None: +++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++++-# raise ValueError( +++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++++-# ) +++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +++++-# else: +++++-# attn_mask_for_fa = None +++++- +++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++++- +++++-# # Call the fused flash_attention_score operator +++++-# attn_output = mindspore.ops.flash_attention_score( +++++-# query=query_states_for_fa, +++++-# key=key_states_for_fa, +++++-# value=value_states_for_fa, +++++-# head_num=self.num_heads, # This is N1, the number of query heads +++++-# input_layout='BSH', +++++-# attn_mask=attn_mask_for_fa, +++++-# keep_prob=keep_prob, +++++-# scalar_value=1.0 / math.sqrt(self.head_dim), +++++-# sparse_mode=0 # Default mask mode +++++-# ) +++++- +++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +++++-# attn_output = self.o_proj(attn_output) +++++- +++++-# # Flash Attention does not return attention weights +++++-# attn_weights = None +++++- +++++-# return attn_output, attn_weights, past_key_value +++++ +++++ class DeepseekFlashAttention(nn.Module): +++++ """ +++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +++++ super().__init__() +++++ self.hidden_size = config.hidden_size +++++ +++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +++++- config=config, layer_idx=layer_idx +++++- ) ++++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++++++ # config=config, layer_idx=layer_idx ++++++ # ) +++++ +++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +++++ config=config, layer_idx=layer_idx +++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +++++ return outputs +++++ +++++ +++++- +++++ class DeepseekPreTrainedModel(PreTrainedModel): +++++ config_class = DeepseekConfig +++++ base_model_prefix = "model" +++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++ # Initialize weights and apply final processing +++++ self.post_init() +++++ self.warm_up = False +++++- #@dwj +++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +++++- self.num_layers, +++++- self.num_attention_heads, +++++- self.head_dim, +++++- batch_size=1, +++++- max_length=self.max_length, +++++- dtype=mindspore.float16 +++++- ) +++++- +++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +++++- key_cache = [] +++++- value_cache = [] +++++- for _ in range(num_layers): +++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++- key_cache.append(k) +++++- value_cache.append(v) +++++- return key_cache, value_cache +++++- +++++ +++++ def warmup_moe_model_deep(self): +++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++new file mode 100644 +++++index 00000000..78f22642 +++++--- /dev/null ++++++++ b/patches/0001-20251104commit.patch +++++@@ -0,0 +1,1272 @@ ++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++++Subject: [PATCH 1/3] 20251104commit ++++++ ++++++--- ++++++ mindnlp/transformers/cache_utils.py | 28 +- ++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++++ 3 files changed, 976 insertions(+), 87 deletions(-) ++++++ ++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++++index cadd2e04..02f8d4be 100644 ++++++--- a/mindnlp/transformers/cache_utils.py +++++++++ b/mindnlp/transformers/cache_utils.py ++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++++ # k_out[:, :, cache_position] = key_states ++++++ # v_out[:, :, cache_position] = value_states ++++++- if ON_ORANGE_PI: ++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++- else: ++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++- +++++++ # if ON_ORANGE_PI: +++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++ # else: +++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++++ if cache_position.ndim > 1: +++++++ cache_position = cache_position.flatten() +++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++++ cache_position = cache_position.int() +++++++ +++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++++ k_out[:, :, cache_position] = key_states +++++++ v_out[:, :, cache_position] = value_states +++++++ ++++++ return k_out, v_out ++++++ ++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++index c695b944..d8303e45 100644 ++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++++ def rotate_half(x): ++++++ """Rotates half the hidden dims of the input.""" ++++++- x1 = x[..., : x.shape[-1] // 2] ++++++- x2 = x[..., x.shape[-1] // 2 :] +++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++ # x1 = x[..., : x.shape[-1] // 2] +++++++ # x2 = x[..., x.shape[-1] // 2 :] +++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++ return ops.cat((-x2, x1), dim=-1) ++++++ ++++++ ++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++++ if self.training: ++++++ raise NotImplementedError("Training is not supported yet.") ++++++ else: ++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++- if self.config.n_shared_experts is not None: ++++++- y = y + self.shared_experts(identity) ++++++- return y +++++++ # @lwx +++++++ if orig_shape[1] == 1: +++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++++ y=y.view(*orig_shape) +++++++ if self.config.n_shared_experts is not None: +++++++ y = y + self.shared_experts(identity) +++++++ return y +++++++ else: +++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++++ if self.config.n_shared_experts is not None: +++++++ y = y + self.shared_experts(identity) +++++++ return y +++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++ # if self.config.n_shared_experts is not None: +++++++ # y = y + self.shared_experts(identity) +++++++ # return y +++++++ +++++++ @no_grad() +++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++ +++++++ expert_cache = ops.zeros_like(x) +++++++ for i in range(self.num_experts_per_tok): +++++++ expert_id = flat_expert_indices[i].item() +++++++ weight = flat_expert_weights[i].item() +++++++ expert = self.experts[expert_id] +++++++ expert_out = expert(x) +++++++ expert_cache += expert_out * weight +++++++ return expert_cache ++++++ ++++++ @no_grad() ++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++- # expert_cache = torch.zeros_like(x) ++++++- # idxs = flat_expert_indices.argsort() ++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++- # token_idxs = idxs // self.num_experts_per_tok ++++++- # for i, end_idx in enumerate(tokens_per_expert): ++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++- # if start_idx == end_idx: ++++++- # continue ++++++- # expert = self.experts[i] ++++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++++- # expert_tokens = x[exp_token_idx] ++++++- # expert_out = expert(expert_tokens) ++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++- # return expert_cache +++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++ expert_cache = ops.zeros_like(x) ++++++ idxs = flat_expert_indices.argsort() ++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++ token_idxs = idxs // self.num_experts_per_tok +++++++ ++++++ for i, end_idx in enumerate(tokens_per_expert): ++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++ if start_idx == end_idx: ++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++++ expert_out = expert(expert_tokens) ++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ ++++++ return expert_cache +++++++ +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # # expert_cache = torch.zeros_like(x) +++++++ # # idxs = flat_expert_indices.argsort() +++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++ # # token_idxs = idxs // self.num_experts_per_tok +++++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++ # # if start_idx == end_idx: +++++++ # # continue +++++++ # # expert = self.experts[i] +++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # # expert_tokens = x[exp_token_idx] +++++++ # # expert_out = expert(expert_tokens) +++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++ # # return expert_cache +++++++ # expert_cache = ops.zeros_like(x) +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # for i, end_idx in enumerate(tokens_per_expert): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # if start_idx == end_idx: +++++++ # continue +++++++ # expert = self.experts[i] +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = expert(expert_tokens) +++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ +++++++ # return expert_cache +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # expert_cache = ops.zeros_like(x) +++++++ +++++++ # # 排序保证顺序一致 +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # # 找出有 token 的专家 +++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++ +++++++ # for i in active_experts.tolist(): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # end_idx = tokens_per_expert[i] +++++++ # if start_idx == end_idx: # 没有 token +++++++ # continue +++++++ +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = self.experts[i](expert_tokens) +++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++ +++++++ # expert_cache = mindspore.mint.scatter_add( +++++++ # expert_cache, +++++++ # 0, +++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++ # expert_out +++++++ # ) +++++++ +++++++ # return expert_cache +++++++ +++++++ ++++++ ++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++++ # """ ++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ ++++++ # Initialize weights and apply final processing ++++++ self.post_init() +++++++ self.warm_up = False +++++++ +++++++ def warmup_moe_model_deep(self): +++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++++ test_texts = [ +++++++ "warmup short", +++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++++ ] +++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++ if tokenizer is None: +++++++ from mindnlp.transformers import AutoTokenizer +++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++ self._warmup_tokenizer = tokenizer +++++++ +++++++ for text in test_texts: +++++++ inputs = tokenizer(text, return_tensors="ms") +++++++ with mindspore._no_grad(): +++++++ _ = self(**inputs, use_cache=False) +++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++++ ++++++ def get_input_embeddings(self): ++++++ return self.model.embed_tokens ++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++ ```""" +++++++ if not self.warm_up: +++++++ self.warm_up = True +++++++ self.warmup_moe_model_deep() +++++++ ++++++ output_attentions = ( ++++++ output_attentions ++++++ if output_attentions is not None ++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++index 3cbf820e..d4c6b651 100644 ++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++@@ -18,7 +18,6 @@ ++++++ # See the License for the specific language governing permissions and ++++++ # limitations under the License. ++++++ """MindSpore Qwen2MoE model.""" ++++++- ++++++ import math ++++++ from typing import List, Optional, Tuple, Union ++++++ ++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++++ TokenClassifierOutput, ++++++ ) ++++++ from ...modeling_utils import PreTrainedModel +++++++from ...generation import GenerationMixin ++++++ from ....utils import logging ++++++ from .configuration_qwen2_moe import Qwen2MoeConfig ++++++ ++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++++ self.variance_epsilon = eps ++++++ ++++++ def forward(self, hidden_states): +++++++ # @dwj +++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++ # @lwx +++++++ # if not self.training : +++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++ input_dtype = hidden_states.dtype ++++++ hidden_states = hidden_states.to(mindspore.float32) ++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++++@@ -234,6 +239,8 @@ def rotate_half(x): ++++++ """Rotates half the hidden dims of the input.""" ++++++ x1 = x[..., : x.shape[-1] // 2] ++++++ x2 = x[..., x.shape[-1] // 2 :] +++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++ return ops.cat((-x2, x1), dim=-1) ++++++ ++++++ ++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++++ self.config = config ++++++ self.hidden_size = config.hidden_size ++++++ self.intermediate_size = intermediate_size +++++++ ++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++++ self.act_fn = ACT2FN[config.hidden_act] ++++++ ++++++ def forward(self, x): ++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++- ++++++ +++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++ # @lwx +++++++ # gate_up_output = self.gate_up_proj(x) +++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++++ # return self.down_proj(swiglu_output) +++++++ +++++++ # def forward(self, x): +++++++ # gate_proj_out = self.gate_proj(x) +++++++ # up_proj_out = self.up_proj(x) +++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++++ # return self.down_proj(swiglu_out) +++++++ ++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++++ """ ++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++++ use_cache: bool = False, ++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ +++++++ ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ query_states = self.q_proj(hidden_states) ++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++ "with a layer index." ++++++ ) ++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ if isinstance(past_key_value, StaticCache): +++++++ kv_seq_len = key_states.shape[-2] +++++++ else: +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++ if isinstance(past_key_value, StaticCache): +++++++ kv_seq_len = key_states.shape[-2] ++++++ ++++++ # repeat k/v heads if n_kv_heads < n_heads ++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++- +++++++ ++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++ ++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++++- raise ValueError( ++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++++- f" {attn_weights.shape}" ++++++- ) ++++++- ++++++- if attention_mask is not None: # no matter the length, we just slice it ++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++++ if attention_mask is not None: +++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++ attn_weights = attn_weights + causal_mask ++++++ ++++++ # upcast attention to fp32 ++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++ ++++++ attn_output = self.o_proj(attn_output) ++++++- +++++++ # @lwx +++++++ +++++++ # max_seq_len = self.max_position_embeddings # 2048 +++++++ +++++++ # if attention_mask is not None: +++++++ # # attention_mask: [B, 1, Sq, Sk] +++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++ +++++++ # # pad 到 [max_seq_len, max_seq_len] +++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++ # global_attention_mask = padded_mask +++++++ # else: +++++++ # global_attention_mask = None +++++++ +++++++ +++++++ # sparse_mode=3 +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # real_shift=None, +++++++ # padding_mask=None, +++++++ +++++++ # head_num=self.num_heads, +++++++ # attn_mask=global_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ # input_layout="BNSD", +++++++ # pre_tokens=2147483647, +++++++ # next_tokens=2147483647, +++++++ # inner_precise=0, +++++++ # drop_mask=None, +++++++ # prefix=None, +++++++ # actual_seq_qlen=None, +++++++ # actual_seq_kvlen=None, +++++++ # sparse_mode=sparse_mode, +++++++ # ) ++++++ if not output_attentions: ++++++ attn_weights = None ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ +++++++class Qwen2MoeFlashAttention(nn.Module): +++++++ """ +++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++ +++++++ 关键改动: +++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++ 直接传入原始的 key 和 value 张量效率更高。 +++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++ """ +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++ super().__init__() +++++++ self.config = config +++++++ self.layer_idx = layer_idx +++++++ self.hidden_size = config.hidden_size +++++++ self.num_heads = config.num_attention_heads +++++++ self.head_dim = self.hidden_size // self.num_heads +++++++ self.num_key_value_heads = config.num_key_value_heads +++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++ self.max_position_embeddings = config.max_position_embeddings +++++++ self.rope_theta = config.rope_theta +++++++ self.attention_dropout = config.attention_dropout +++++++ +++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++++ raise ValueError( +++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++ ) +++++++ +++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++ +++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++ self.head_dim, +++++++ max_position_embeddings=self.max_position_embeddings, +++++++ base=self.rope_theta, +++++++ ) +++++++ +++++++ def forward( +++++++ self, +++++++ hidden_states: mindspore.Tensor, +++++++ attention_mask: Optional[mindspore.Tensor] = None, +++++++ position_ids: Optional[mindspore.Tensor] = None, +++++++ past_key_value: Optional[Cache] = None, +++++++ output_attentions: bool = False, +++++++ use_cache: bool = False, +++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # 1. 线性投射 Q, K, V +++++++ query_states = self.q_proj(hidden_states) +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++ # query: [B, S, H*D] -> [B, N1, S, D] +++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # 3. RoPE 旋转位置编码 +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++ if self.layer_idx is None: +++++++ raise ValueError( +++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ "with a layer index." +++++++ ) +++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++ if cache_position.shape[0] == 1: +++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++ kv_seq_len = past_seen_tokens + 1 +++++++ else: +++++++ # prefill 阶段:cache_position 是范围,使用其长度 +++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++ else: +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # 4. KV 缓存更新 +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ key_states, value_states = past_key_value.update( +++++++ key_states, value_states, self.layer_idx, cache_kwargs +++++++ ) +++++++ +++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++ if cache_position.shape[0] == 1: +++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++ kv_seq_len = key_states.shape[-2] +++++++ +++++++ # 5. [重要] 准备 Attention Mask +++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++ fa_attention_mask = None +++++++ if attention_mask is not None: +++++++ # 截取与当前key长度匹配的部分 +++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++ fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++ input_dtype = query_states.dtype +++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++ query_states = query_states.to(mindspore.float16) +++++++ key_states = key_states.to(mindspore.float16) +++++++ value_states = value_states.to(mindspore.float16) +++++++ +++++++ # 6. [核心] 调用 flash_attention_score 算子 +++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++ attn_output = mindspore.ops.flash_attention_score( +++++++ query=query_states, +++++++ key=key_states, +++++++ value=value_states, +++++++ head_num=self.num_heads, # 传入Q的头数(N1) +++++++ attn_mask=fa_attention_mask, +++++++ keep_prob=1.0 - self.attention_dropout, +++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ input_layout="BNSD", +++++++ sparse_mode=0 # 使用 defaultMask 模式 +++++++ ) +++++++ +++++++ # 恢复原始数据类型 +++++++ attn_output = attn_output.to(input_dtype) +++++++ +++++++ # 7. 调整输出形状 +++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = self.o_proj(attn_output) +++++++ +++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +++++++ attn_weights = None +++++++ if output_attentions: +++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ # def forward( +++++++ # self, +++++++ # hidden_states: mindspore.Tensor, +++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++ # past_key_value: Optional[Cache] = None, +++++++ # output_attentions: bool = False, +++++++ # use_cache: bool = False, +++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ # bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # # 1. 线性投射 Q, K, V +++++++ # query_states = self.q_proj(hidden_states) +++++++ # key_states = self.k_proj(hidden_states) +++++++ # value_states = self.v_proj(hidden_states) +++++++ +++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # # 3. RoPE 旋转位置编码 +++++++ # kv_seq_len = key_states.shape[-2] +++++++ # if past_key_value is not None: +++++++ # if self.layer_idx is None: +++++++ # raise ValueError( +++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ # "with a layer index." +++++++ # ) +++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # # 4. KV 缓存更新 +++++++ # if past_key_value is not None: +++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ # key_states, value_states = past_key_value.update( +++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++ # ) +++++++ +++++++ # # 5. 准备 Attention Mask +++++++ # fa_attention_mask = None +++++++ # if attention_mask is not None: +++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++ # input_dtype = query_states.dtype +++++++ +++++++ # # 6. [核心] 调用 flash_attention_score 算子 +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # head_num=self.num_heads, +++++++ # attn_mask=fa_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ # input_layout="BNSD", +++++++ # sparse_mode=0, +++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++ # inner_precise=1 +++++++ # ) +++++++ +++++++ # # 恢复原始数据类型 +++++++ # attn_output = attn_output.to(input_dtype) +++++++ +++++++ # # 7. 调整输出形状 +++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ # attn_output = self.o_proj(attn_output) +++++++ +++++++ # attn_weights = None +++++++ # if output_attentions: +++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++ # return attn_output, attn_weights, past_key_value +++++++ +++++++ # def forward( +++++++ # self, +++++++ # hidden_states: mindspore.Tensor, +++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++ # past_key_value: Optional[Cache] = None, +++++++ # output_attentions: bool = False, +++++++ # use_cache: bool = False, +++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ # bsz, q_len, _ = hidden_states.shape +++++++ +++++++ # query_states = self.q_proj(hidden_states) +++++++ # key_states = self.k_proj(hidden_states) +++++++ # value_states = self.v_proj(hidden_states) +++++++ +++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ # kv_seq_len = key_states.shape[-2] +++++++ # if past_key_value is not None: +++++++ # if self.layer_idx is None: +++++++ # raise ValueError("`layer_idx` must be specified for caching") +++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ # if past_key_value is not None: +++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ # key_states, value_states = past_key_value.update( +++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++ # ) +++++++ +++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++ +++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++++ # query_states = query_states / math.sqrt(self.head_dim) +++++++ # # <--- 修改结束 --- +++++++ +++++++ # fa_attention_mask = None +++++++ # if attention_mask is not None: +++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ # fa_attention_mask = (mask_slice != 0) +++++++ +++++++ # input_dtype = query_states.dtype +++++++ +++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++ # query=query_states, # 传入已经预先缩放过的 query +++++++ # key=key_states, +++++++ # value=value_states, +++++++ # head_num=self.num_heads, +++++++ # attn_mask=fa_attention_mask, +++++++ # keep_prob=1.0 - self.attention_dropout, +++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++++ # input_layout="BNSD", +++++++ # sparse_mode=0, +++++++ # inner_precise=1 # 仍然保持内部高精度计算 +++++++ # ) +++++++ +++++++ # attn_output = attn_output.to(input_dtype) +++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ # attn_output = self.o_proj(attn_output) +++++++ +++++++ # attn_weights = None +++++++ # if output_attentions: +++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++++ +++++++ # return attn_output, attn_weights, past_key_value +++++++ ++++++ QWEN2MOE_ATTENTION_CLASSES = { ++++++ "eager": Qwen2MoeAttention, +++++++ "flash-attention": Qwen2MoeFlashAttention, ++++++ } ++++++ ++++++ ++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ +++++++ #@dwj +++++++ # 只遍历激活的专家,而非全部专家 ++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- hidden_states = hidden_states.view(-1, hidden_dim) ++++++- # router_logits: (batch * sequence_length, n_experts) ++++++- router_logits = self.gate(hidden_states) ++++++- ++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- if self.norm_topk_prob: ++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- # we cast back to the input dtype ++++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++- final_hidden_states = ops.zeros( ++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++++- ) ++++++- ++++++- # One hot encode the selected experts to create an expert mask ++++++- # this will be used to easily index which expert is going to be sollicitated ++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++++- ++++++- # Loop over all available experts in the model and perform the computation on each expert ++++++- for expert_idx in range(self.num_experts): ++++++- expert_layer = self.experts[expert_idx] ++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++++- ++++++- # Index the correct hidden states and compute the expert hidden state for ++++++- # the current expert. We need to make sure to multiply the output hidden ++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++++- if 0 not in idx.shape: ++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++++- ++++++- # However `index_add_` only support torch tensors for indexing so we'll use ++++++- # the `top_x` tensor here. ++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++++- ++++++- shared_expert_output = self.shared_expert(hidden_states) ++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++++- ++++++- final_hidden_states = final_hidden_states + shared_expert_output +++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++ num_tokens = hidden_states_reshaped.shape[0] +++++++ +++++++ router_logits = self.gate(hidden_states_reshaped) +++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++ if self.norm_topk_prob: +++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++ flat_selected_experts = selected_experts.flatten() +++++++ +++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++ token_indices = broadcasted_token_indices.flatten() +++++++ +++++++ active_experts = ops.unique(flat_selected_experts) +++++++ +++++++ for expert_idx_tensor in active_experts: +++++++ expert_idx = expert_idx_tensor.item() +++++++ expert_layer = self.experts[expert_idx] +++++++ +++++++ mask = (flat_selected_experts == expert_idx_tensor) +++++++ selected_token_indices = token_indices[mask] +++++++ selected_routing_weights = routing_weights.flatten()[mask] +++++++ +++++++ current_states = hidden_states_reshaped[selected_token_indices] +++++++ +++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++ +++++++ final_hidden_states = final_hidden_states.index_add( +++++++ dim=0, +++++++ index=selected_token_indices, +++++++ source=expert_output.to(hidden_states.dtype) +++++++ ) +++++++ +++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++ ++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++- return final_hidden_states, router_logits +++++++ final_hidden_states = final_hidden_states + shared_expert_output +++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++ +++++++ return final_hidden_states, router_logits ++++++ ++++++ ++++++ class Qwen2MoeDecoderLayer(nn.Module): ++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++++ ++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++ ++++++ if (layer_idx not in config.mlp_only_layers) and ( ++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++++ ): ++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++++ _skip_keys_device_placement = "past_key_values" ++++++ _supports_cache_class = True +++++++#lwx +++++++ # _supports_static_cache = True ++++++ ++++++ def _init_weights(self, module): ++++++ std = self.config.initializer_range ++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++++ return causal_mask ++++++ ++++++ ++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ _tied_weights_keys = ["lm_head.weight"] ++++++ ++++++ def __init__(self, config): ++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ self.num_experts_per_tok = config.num_experts_per_tok ++++++ # Initialize weights and apply final processing ++++++ self.post_init() +++++++ # @lwx +++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++++ # self.generation_config.cache_implementation = "static" +++++++ self._warmed_up = False +++++++ +++++++ def warmup_moe_model(self): +++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++++ test_texts = [ +++++++ "warmup short", +++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++++ ] +++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++ if tokenizer is None: +++++++ from mindnlp.transformers import AutoTokenizer +++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++ self._warmup_tokenizer = tokenizer +++++++ +++++++ for text in test_texts: +++++++ inputs = tokenizer(text, return_tensors="ms") +++++++ with mindspore._no_grad(): +++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++++ ++++++ def get_input_embeddings(self): ++++++ return self.model.embed_tokens ++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++ ```""" +++++++ if not self._warmed_up: +++++++ self._warmed_up = True +++++++ self.warmup_moe_model() ++++++ ++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++++ output_router_logits = ( ++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++ } ++++++ ) ++++++ return model_inputs +++++++# @lwx +++++++ # def _decode_one_tokens_logits( +++++++ # self, +++++++ # cur_token: mindspore.Tensor, +++++++ # input_pos: Optional[mindspore.Tensor], +++++++ # cache_position: mindspore.Tensor, +++++++ # past_key_values: StaticCache, +++++++ # ) -> mindspore.Tensor: +++++++ # """ +++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++++ +++++++ # Args: +++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++++ # input_pos: 输入位置信息,可选 +++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++++ +++++++ # Returns: +++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++++ # """ +++++++ # # 调用JIT编译的版本 +++++++ # return self.get_decode_one_tokens_logits( +++++++ # cur_token=cur_token, +++++++ # input_pos=input_pos, +++++++ # cache_position=cache_position, +++++++ # past_key_values=past_key_values, +++++++ # ) +++++++ +++++++ # @mindspore.jit(jit_level='O1') +++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++++ # """ +++++++ # JIT编译的函数,用于高效的单token解码 +++++++ # 使用JIT编译优化以支持静态shape和高效执行 +++++++ +++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++++ # """ +++++++ # outputs = self.model.forward( +++++++ # input_ids=cur_token, +++++++ # position_ids=input_pos, +++++++ # cache_position=cache_position, +++++++ # past_key_values=past_key_values, +++++++ # use_cache=True, +++++++ # return_dict=False, +++++++ # ) +++++++ +++++++ # hidden_states = outputs[0] +++++++ # logits = self.lm_head.forward(hidden_states) +++++++ # logits = logits.float() +++++++ +++++++ # return logits[:, -1, :] +++++++ +++++++ # def _sample( +++++++ # self, +++++++ # input_ids: mindspore.Tensor, +++++++ # logits_processor, +++++++ # stopping_criteria, +++++++ # generation_config, +++++++ # synced_devices: bool, +++++++ # streamer=None, +++++++ # logits_warper=None, +++++++ # **model_kwargs, +++++++ # ): +++++++ # """ +++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++++ # """ +++++++ # from ...generation.logits_process import LogitsProcessorList +++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++++ # from mindnlp.core import nn, ops, no_grad +++++++ # import numpy as np +++++++ +++++++ # # 检查是否使用 StaticCache +++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++++ # # 否则,直接调用父类方法 +++++++ # past_key_values = model_kwargs.get("past_key_values") +++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++++ +++++++ # if not isinstance(past_key_values, StaticCache): +++++++ # # 不使用 StaticCache,直接调用父类方法 +++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++++ # return super()._sample( +++++++ # input_ids=input_ids, +++++++ # logits_processor=logits_processor, +++++++ # stopping_criteria=stopping_criteria, +++++++ # generation_config=generation_config, +++++++ # synced_devices=synced_devices, +++++++ # streamer=streamer, +++++++ # logits_warper=logits_warper, +++++++ # **model_kwargs, +++++++ # ) +++++++ +++++++ # # 使用 StaticCache,进入自定义循环 +++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++++ # pad_token_id = generation_config._pad_token_tensor +++++++ # output_attentions = generation_config.output_attentions +++++++ # output_hidden_states = generation_config.output_hidden_states +++++++ # output_scores = generation_config.output_scores +++++++ # output_logits = generation_config.output_logits +++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +++++++ # max_length = generation_config.max_length +++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++++ # do_sample = generation_config.do_sample +++++++ +++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++++ # raise ValueError( +++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++++ # f"{logits_warper})." +++++++ # ) +++++++ +++++++ # # init attention / hidden states / scores tuples +++++++ # scores = () if (return_dict_in_generate and output_scores) else None +++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++++ +++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++++ # encoder_hidden_states = ( +++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++++ # ) +++++++ +++++++ # # keep track of which sequences are already finished +++++++ # batch_size, cur_len = input_ids.shape +++++++ # this_peer_finished = False +++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++++ +++++++ # time_record = [] +++++++ # from ....utils.testing_utils import parse_flag_from_env +++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++++ +++++++ # while self._has_unfinished_sequences( +++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++++ # ): +++++++ # if _record_time: +++++++ # import time as time_module +++++++ # infer_start = time_module.time() +++++++ +++++++ # # prepare model inputs +++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++++ +++++++ # # prepare variable output controls +++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++++ +++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++++ # cur_cache_position = model_inputs.get("cache_position") +++++++ # cur_past_key_values = model_inputs.get("past_key_values") +++++++ # cur_input_ids = model_inputs.get("input_ids") +++++++ +++++++ # if (isinstance(cur_past_key_values, StaticCache) and +++++++ # cur_cache_position is not None and +++++++ # len(cur_cache_position.shape) > 0 and +++++++ # cur_cache_position.shape[0] == 1 and +++++++ # cur_input_ids is not None and +++++++ # cur_input_ids.shape[1] == 1): +++++++ # # 使用 JIT 优化的单 token 解码 +++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++++ # if not hasattr(self, '_jit_used'): +++++++ # self._jit_used = False +++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++++ +++++++ # next_token_logits = self.get_decode_one_tokens_logits( +++++++ # cur_token=cur_input_ids, +++++++ # input_pos=model_inputs.get("position_ids"), +++++++ # cache_position=cur_cache_position, +++++++ # past_key_values=cur_past_key_values, +++++++ # ) +++++++ +++++++ # # 标记已使用JIT(用于后续判断) +++++++ # if not self._jit_used: +++++++ # self._jit_used = True +++++++ +++++++ # # 构造兼容的输出对象 +++++++ # class JitOptimizedOutput: +++++++ # def __init__(self, logits, config): +++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++++ # self.config = config +++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++++ # self.attentions = None if not config.is_encoder_decoder else None +++++++ # self.cross_attentions = None +++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +++++++ +++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++++ # else: +++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++++ # outputs = self(**model_inputs, return_dict=True) +++++++ +++++++ # if synced_devices and this_peer_finished: +++++++ # continue +++++++ +++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++++ # next_token_logits = outputs.logits[:, -1, :] +++++++ +++++++ # # pre-process distribution +++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++++ # if do_sample: +++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++++ +++++++ # # Store scores, attentions and hidden_states when required +++++++ # if return_dict_in_generate: +++++++ # if output_scores: +++++++ # scores += (next_token_scores,) +++++++ # if output_logits: +++++++ # raw_logits += (next_token_logits,) +++++++ # if output_attentions: +++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +++++++ # if self.config.is_encoder_decoder: +++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++++ +++++++ # if output_hidden_states: +++++++ # hidden = ( +++++++ # outputs.decoder_hidden_states +++++++ # if self.config.is_encoder_decoder +++++++ # else outputs.hidden_states +++++++ # ) +++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++++ +++++++ # # token selection +++++++ # if do_sample: +++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++++ # else: +++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++++ +++++++ # # finished sentences should have their next token be a padding token +++++++ # if has_eos_stopping_criteria: +++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++++ +++++++ # # update generated ids, model inputs, and length for next step +++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++++ # if streamer is not None: +++++++ # streamer.put(next_tokens) +++++++ +++++++ # model_kwargs = self._update_model_kwargs_for_generation( +++++++ # outputs, +++++++ # model_kwargs, +++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +++++++ # ) +++++++ +++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++++ # cur_len += 1 +++++++ +++++++ # if _record_time: +++++++ # import time as time_module +++++++ # infer_stop = time_module.time() +++++++ # time_record.append(infer_stop - infer_start) +++++++ +++++++ # del outputs +++++++ +++++++ # average_infer_time = None +++++++ # if time_record: +++++++ # if len(time_record) > 1: +++++++ # time_record.pop(0) +++++++ # average_infer_time = sum(time_record) / len(time_record) +++++++ # print(f'average inference time is: {average_infer_time}') +++++++ # print(f'inference time record: {time_record}') +++++++ +++++++ # if streamer is not None: +++++++ # streamer.end() +++++++ +++++++ # # 简单判断:打印是否使用了JIT路径 +++++++ # if hasattr(self, '_jit_used') and self._jit_used: +++++++ # print("[JIT] ✓ JIT optimization was used during generation") +++++++ # else: +++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++++ +++++++ # if return_dict_in_generate: +++++++ # if self.config.is_encoder_decoder: +++++++ # return GenerateEncoderDecoderOutput( +++++++ # sequences=input_ids, +++++++ # scores=scores, +++++++ # logits=raw_logits, +++++++ # encoder_attentions=encoder_attentions, +++++++ # encoder_hidden_states=encoder_hidden_states, +++++++ # decoder_attentions=decoder_attentions, +++++++ # cross_attentions=cross_attentions, +++++++ # decoder_hidden_states=decoder_hidden_states, +++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++ # average_infer_time=average_infer_time +++++++ # ) +++++++ # else: +++++++ # return GenerateDecoderOnlyOutput( +++++++ # sequences=input_ids, +++++++ # scores=scores, +++++++ # logits=raw_logits, +++++++ # attentions=decoder_attentions, +++++++ # hidden_states=decoder_hidden_states, +++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++ # average_infer_time=average_infer_time +++++++ # ) +++++++ # else: +++++++ # return input_ids +++++++ +++++++ # def _prepare_cache_for_generation( +++++++ # self, +++++++ # generation_config, +++++++ # model_kwargs, +++++++ # assistant_model, +++++++ # batch_size, +++++++ # max_cache_length, +++++++ # ): +++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++++ # generation_config.cache_implementation = "static" +++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++++ +++++++ # if generation_config.cache_implementation == "static": +++++++ # base_required_from_max_length = generation_config.max_length + 1 +++++++ # base_required = max(max_cache_length, base_required_from_max_length) +++++++ # min_cache_size = 50 +++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++++ # else: +++++++ # max_cache_length = max(base_required, min_cache_size) +++++++ +++++++ # original_max_cache_length = max_cache_length +++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++++ # print(f" - final max_cache_length: {max_cache_length}") +++++++ +++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++ # if max_cache_length > self.config.max_position_embeddings: +++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++++ +++++++ # result = super()._prepare_cache_for_generation( +++++++ # generation_config=generation_config, +++++++ # model_kwargs=model_kwargs, +++++++ # assistant_model=assistant_model, +++++++ # batch_size=batch_size, +++++++ # max_cache_length=max_cache_length, +++++++ # ) +++++++ +++++++ # if generation_config.cache_implementation == "static": +++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++++ # created_cache = model_kwargs.get(cache_name) +++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++++ # if created_cache.max_cache_len < generation_config.max_length: +++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++++ +++++++ # return result +++++++ +++++++ +++++++ ++++++ ++++++ ++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++++-- ++++++2.27.0 ++++++ +++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++++new file mode 100644 +++++index 00000000..22b65dd5 +++++--- /dev/null ++++++++ b/patches/0002-20251106commit.patch +++++@@ -0,0 +1,3200 @@ ++++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++Date: Thu, 6 Nov 2025 09:20:38 +0800 ++++++Subject: [PATCH 2/3] 20251106commit ++++++ ++++++--- ++++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- ++++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ ++++++ 3 files changed, 2689 insertions(+), 305 deletions(-) ++++++ create mode 100644 patches/0001-20251104commit.patch ++++++ ++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++index d8303e45..73773c22 100644 ++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): ++++++ # y = y + self.shared_experts(identity) ++++++ # return y ++++++ +++++++ # @no_grad() +++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++ +++++++ # expert_cache = ops.zeros_like(x) +++++++ # for i in range(self.num_experts_per_tok): +++++++ # expert_id = flat_expert_indices[i].item() +++++++ # weight = flat_expert_weights[i].item() +++++++ # expert = self.experts[expert_id] +++++++ # expert_out = expert(x) +++++++ # expert_cache += expert_out * weight +++++++ # return expert_cache +++++++ ++++++ @no_grad() ++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++ # x 的 shape: (1, hidden_size) +++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++++++ +++++++ # 1. 收集所有需要的专家层 +++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +++++++ +++++++ # 2. 并行计算所有专家的输出 +++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++++++ # ops.cat 会将它们堆叠成一个新的 Tensor +++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++++++ +++++++ # 3. 使用矩阵乘法进行加权求和 +++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++++ # 最终结果 final_output 的 shape: (1, hidden_size) +++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++++++ +++++++ return final_output ++++++ ++++++- expert_cache = ops.zeros_like(x) ++++++- for i in range(self.num_experts_per_tok): ++++++- expert_id = flat_expert_indices[i].item() ++++++- weight = flat_expert_weights[i].item() ++++++- expert = self.experts[expert_id] ++++++- expert_out = expert(x) ++++++- expert_cache += expert_out * weight ++++++- return expert_cache ++++++ ++++++ @no_grad() ++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++ # @lwx +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++++++ ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ +++++++# class DeepseekFlashAttention(nn.Module): +++++++# """ +++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +++++++ +++++++# This class is designed as a drop-in replacement for DeepseekAttention. +++++++# """ +++++++ +++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++++++# super().__init__() +++++++# self.config = config +++++++# self.layer_idx = layer_idx +++++++# if layer_idx is None: +++++++# logger.warning( +++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++++# "when creating this class." +++++++# ) +++++++ +++++++# self.attention_dropout = config.attention_dropout +++++++# self.hidden_size = config.hidden_size +++++++# self.num_heads = config.num_attention_heads +++++++# self.head_dim = self.hidden_size // self.num_heads +++++++# self.num_key_value_heads = config.num_key_value_heads +++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++# self.max_position_embeddings = config.max_position_embeddings +++++++# self.rope_theta = config.rope_theta +++++++# self.is_causal = True +++++++ +++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++++# raise ValueError( +++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++++# f" and `num_heads`: {self.num_heads})." +++++++# ) +++++++ +++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++++++# self._init_rope() +++++++ +++++++# def _init_rope(self): +++++++# if self.config.rope_scaling is None: +++++++# self.rotary_emb = DeepseekRotaryEmbedding( +++++++# self.head_dim, +++++++# max_position_embeddings=self.max_position_embeddings, +++++++# base=self.rope_theta, +++++++# ) +++++++# else: +++++++# scaling_type = self.config.rope_scaling["type"] +++++++# scaling_factor = self.config.rope_scaling["factor"] +++++++# if scaling_type == "linear": +++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++++++# self.head_dim, +++++++# max_position_embeddings=self.max_position_embeddings, +++++++# scaling_factor=scaling_factor, +++++++# base=self.rope_theta, +++++++# ) +++++++# elif scaling_type == "dynamic": +++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++++++# self.head_dim, +++++++# max_position_embeddings=self.max_position_embeddings, +++++++# scaling_factor=scaling_factor, +++++++# base=self.rope_theta, +++++++# ) +++++++# else: +++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++++++ +++++++# def forward( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# attention_mask: Optional[mindspore.Tensor] = None, +++++++# position_ids: Optional[mindspore.Tensor] = None, +++++++# past_key_value: Optional[Cache] = None, +++++++# output_attentions: bool = False, +++++++# use_cache: bool = False, +++++++# **kwargs, +++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++# if "padding_mask" in kwargs: +++++++# warnings.warn( +++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++++++# ) +++++++ +++++++# if output_attentions: +++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +++++++ +++++++# bsz, q_len, _ = hidden_states.shape +++++++ +++++++# if self.config.pretraining_tp > 1: +++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++++++ +++++++# query_states = self.q_proj(hidden_states) +++++++# key_states = self.k_proj(hidden_states) +++++++# value_states = self.v_proj(hidden_states) +++++++ +++++++# # Reshape for multi-head attention +++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++# kv_seq_len = key_states.shape[-2] +++++++# if past_key_value is not None: +++++++# if self.layer_idx is None: +++++++# raise ValueError( +++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++# "with a layer index." +++++++# ) +++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++# # Apply Rotary Positional Embedding +++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++# if past_key_value is not None: +++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ +++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++++ +++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +++++++ +++++++# # Convert attention_mask for flash_attention_score +++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +++++++# if attention_mask is not None: +++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++++++# raise ValueError( +++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++++++# ) +++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +++++++# else: +++++++# attn_mask_for_fa = None +++++++ +++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++++++ +++++++# # Call the fused flash_attention_score operator +++++++# attn_output = mindspore.ops.flash_attention_score( +++++++# query=query_states_for_fa, +++++++# key=key_states_for_fa, +++++++# value=value_states_for_fa, +++++++# head_num=self.num_heads, # This is N1, the number of query heads +++++++# input_layout='BSH', +++++++# attn_mask=attn_mask_for_fa, +++++++# keep_prob=keep_prob, +++++++# scalar_value=1.0 / math.sqrt(self.head_dim), +++++++# sparse_mode=0 # Default mask mode +++++++# ) +++++++ +++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +++++++# attn_output = self.o_proj(attn_output) +++++++ +++++++# # Flash Attention does not return attention weights +++++++# attn_weights = None +++++++ +++++++# return attn_output, attn_weights, past_key_value +++++++ +++++++class DeepseekFlashAttention(nn.Module): +++++++ """ +++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, +++++++ designed for high performance on supported hardware (Ascend). +++++++ +++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +++++++ """ +++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +++++++ super().__init__() +++++++ self.config = config +++++++ self.layer_idx = layer_idx +++++++ if layer_idx is None: +++++++ logger.warning( +++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++++ "when creating this class." +++++++ ) +++++++ +++++++ # --- [FIX] Correctly initialize all required attributes --- +++++++ self.attention_dropout = config.attention_dropout +++++++ self.hidden_size = config.hidden_size +++++++ self.num_heads = config.num_attention_heads +++++++ self.head_dim = self.hidden_size // self.num_heads +++++++ self.num_key_value_heads = config.num_key_value_heads +++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++ self.max_position_embeddings = config.max_position_embeddings +++++++ self.rope_theta = config.rope_theta +++++++ self.is_causal = True +++++++ +++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++++ raise ValueError( +++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++++ f" and `num_heads`: {self.num_heads})." +++++++ ) +++++++ +++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +++++++ +++++++ # This call will now succeed as all attributes are initialized. +++++++ self._init_rope() +++++++ +++++++ def _init_rope(self): +++++++ if self.config.rope_scaling is None: +++++++ self.rotary_emb = DeepseekRotaryEmbedding( +++++++ self.head_dim, +++++++ max_position_embeddings=self.max_position_embeddings, +++++++ base=self.rope_theta, +++++++ ) +++++++ else: +++++++ scaling_type = self.config.rope_scaling["type"] +++++++ scaling_factor = self.config.rope_scaling["factor"] +++++++ if scaling_type == "linear": +++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +++++++ self.head_dim, +++++++ max_position_embeddings=self.max_position_embeddings, +++++++ scaling_factor=scaling_factor, +++++++ base=self.rope_theta, +++++++ ) +++++++ elif scaling_type == "dynamic": +++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +++++++ self.head_dim, +++++++ max_position_embeddings=self.max_position_embeddings, +++++++ scaling_factor=scaling_factor, +++++++ base=self.rope_theta, +++++++ ) +++++++ else: +++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +++++++ +++++++ def forward( +++++++ self, +++++++ hidden_states: mindspore.Tensor, +++++++ attention_mask: Optional[mindspore.Tensor] = None, +++++++ position_ids: Optional[mindspore.Tensor] = None, +++++++ past_key_value: Optional[Cache] = None, +++++++ output_attentions: bool = False, +++++++ use_cache: bool = False, +++++++ **kwargs, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ if "padding_mask" in kwargs: +++++++ warnings.warn( +++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +++++++ ) +++++++ if output_attentions: +++++++ warnings.warn( +++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +++++++ ) +++++++ +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ if self.config.pretraining_tp > 1: +++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +++++++ +++++++ query_states = self.q_proj(hidden_states) +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++ # Apply Rotary Position Embedding +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos} +++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +++++++ # So we must explicitly repeat the KV heads. +++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++ +++++++ # Convert attention mask for flash_attention_score +++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +++++++ if attention_mask is not None: +++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +++++++ raise ValueError( +++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +++++++ ) +++++++ attn_mask_for_fa = attention_mask < 0 +++++++ else: +++++++ attn_mask_for_fa = None +++++++ +++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +++++++ +++++++ # Call the fused operator using the efficient BNSD layout +++++++ attn_output = mindspore.ops.flash_attention_score( +++++++ query=query_states, +++++++ key=key_states, +++++++ value=value_states, +++++++ head_num=self.num_heads, +++++++ input_layout='BNSD', # Specify the correct layout +++++++ attn_mask=attn_mask_for_fa, +++++++ keep_prob=keep_prob, +++++++ scalar_value=1.0 / math.sqrt(self.head_dim) +++++++ ) +++++++ +++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ +++++++ # Apply output projection +++++++ attn_output = self.o_proj(attn_output) +++++++ +++++++ # Flash attention does not return attention weights, so we return None. +++++++ attn_weights = None +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ ++++++ Deepseek_ATTENTION_CLASSES = { ++++++ "eager": DeepseekAttention, +++++++ "flash-attention": DeepseekFlashAttention, ++++++ } ++++++ ++++++ ++++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): ++++++ config=config, layer_idx=layer_idx ++++++ ) ++++++ +++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +++++++ config=config, layer_idx=layer_idx +++++++ ) +++++++ ++++++ self.mlp = ( ++++++ DeepseekMoE(config) ++++++ if ( ++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++index d4c6b651..bced285c 100644 ++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union ++++++ ++++++ import mindspore ++++++ import mindnlp.core.nn.functional as F ++++++-from mindnlp.core import nn, ops +++++++from mindnlp.core import nn, ops, no_grad ++++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss ++++++ ++++++ from ....common.activations import ACT2FN ++++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) ++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++++++ +++++++Long_Prompt = False +++++++PROMPT_LENGTH_THRESHOLD = 128 ++++++ ++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++++++ def _prepare_4d_causal_attention_mask_with_cache_position( ++++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++ +++++++# class Qwen2MoeFlashAttention(nn.Module): +++++++# """ +++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++ +++++++# 关键改动: +++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++# 直接传入原始的 key 和 value 张量效率更高。 +++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++# super().__init__() +++++++# self.config = config +++++++# self.layer_idx = layer_idx +++++++# self.hidden_size = config.hidden_size +++++++# self.num_heads = config.num_attention_heads +++++++# self.head_dim = self.hidden_size // self.num_heads +++++++# self.num_key_value_heads = config.num_key_value_heads +++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++# self.max_position_embeddings = config.max_position_embeddings +++++++# self.rope_theta = config.rope_theta +++++++# self.attention_dropout = config.attention_dropout +++++++ +++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++++# raise ValueError( +++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++# ) +++++++ +++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++ +++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++# self.head_dim, +++++++# max_position_embeddings=self.max_position_embeddings, +++++++# base=self.rope_theta, +++++++# ) +++++++ +++++++# def forward( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# attention_mask: Optional[mindspore.Tensor] = None, +++++++# position_ids: Optional[mindspore.Tensor] = None, +++++++# past_key_value: Optional[Cache] = None, +++++++# output_attentions: bool = False, +++++++# use_cache: bool = False, +++++++# cache_position: Optional[mindspore.Tensor] = None, +++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++# bsz, q_len, _ = hidden_states.shape +++++++ +++++++# # 1. 线性投射 Q, K, V +++++++# query_states = self.q_proj(hidden_states) +++++++# key_states = self.k_proj(hidden_states) +++++++# value_states = self.v_proj(hidden_states) +++++++ +++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++# # query: [B, S, H*D] -> [B, N1, S, D] +++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++# # 3. RoPE 旋转位置编码 +++++++# kv_seq_len = key_states.shape[-2] +++++++# if past_key_value is not None: +++++++# if self.layer_idx is None: +++++++# raise ValueError( +++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++# "with a layer index." +++++++# ) +++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++# if cache_position.shape[0] == 1: +++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++# kv_seq_len = past_seen_tokens + 1 +++++++# else: +++++++# # prefill 阶段:cache_position 是范围,使用其长度 +++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++# else: +++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++# # 4. KV 缓存更新 +++++++# if past_key_value is not None: +++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++# key_states, value_states = past_key_value.update( +++++++# key_states, value_states, self.layer_idx, cache_kwargs +++++++# ) +++++++ +++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++# if cache_position.shape[0] == 1: +++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++# kv_seq_len = key_states.shape[-2] +++++++ +++++++# # 5. [重要] 准备 Attention Mask +++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++# fa_attention_mask = None +++++++# if attention_mask is not None: +++++++# # 截取与当前key长度匹配的部分 +++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++# fa_attention_mask = (mask_slice != 0) +++++++ +++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++# input_dtype = query_states.dtype +++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++# query_states = query_states.to(mindspore.float16) +++++++# key_states = key_states.to(mindspore.float16) +++++++# value_states = value_states.to(mindspore.float16) +++++++ +++++++# # 6. [核心] 调用 flash_attention_score 算子 +++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++# attn_output = mindspore.ops.flash_attention_score( +++++++# query=query_states, +++++++# key=key_states, +++++++# value=value_states, +++++++# head_num=self.num_heads, # 传入Q的头数(N1) +++++++# attn_mask=fa_attention_mask, +++++++# keep_prob=1.0 - self.attention_dropout, +++++++# scalar_value=1.0 / math.sqrt(self.head_dim), +++++++# input_layout="BNSD", +++++++# sparse_mode=0 # 使用 defaultMask 模式 +++++++# ) +++++++ +++++++# # 恢复原始数据类型 +++++++# attn_output = attn_output.to(input_dtype) +++++++ +++++++# # 7. 调整输出形状 +++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++# attn_output = self.o_proj(attn_output) +++++++ +++++++# # FlashAttention 算子不直接返回注意力权重矩阵 +++++++# attn_weights = None +++++++# if output_attentions: +++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++# return attn_output, attn_weights, past_key_value +++++++ +++++++# # def forward( +++++++# # self, +++++++# # hidden_states: mindspore.Tensor, +++++++# # attention_mask: Optional[mindspore.Tensor] = None, +++++++# # position_ids: Optional[mindspore.Tensor] = None, +++++++# # past_key_value: Optional[Cache] = None, +++++++# # output_attentions: bool = False, +++++++# # use_cache: bool = False, +++++++# # cache_position: Optional[mindspore.Tensor] = None, +++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++# # bsz, q_len, _ = hidden_states.shape +++++++ +++++++# # # 1. 线性投射 Q, K, V +++++++# # query_states = self.q_proj(hidden_states) +++++++# # key_states = self.k_proj(hidden_states) +++++++# # value_states = self.v_proj(hidden_states) +++++++ +++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ +++++++# # # 3. RoPE 旋转位置编码 +++++++# # kv_seq_len = key_states.shape[-2] +++++++# # if past_key_value is not None: +++++++# # if self.layer_idx is None: +++++++# # raise ValueError( +++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++# # "with a layer index." +++++++# # ) +++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++# # # 4. KV 缓存更新 +++++++# # if past_key_value is not None: +++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++# # key_states, value_states = past_key_value.update( +++++++# # key_states, value_states, self.layer_idx, cache_kwargs +++++++# # ) +++++++ +++++++# # # 5. 准备 Attention Mask +++++++# # fa_attention_mask = None +++++++# # if attention_mask is not None: +++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++# # fa_attention_mask = (mask_slice != 0) +++++++ +++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++# # input_dtype = query_states.dtype +++++++ +++++++# # # 6. [核心] 调用 flash_attention_score 算子 +++++++# # attn_output = mindspore.ops.flash_attention_score( +++++++# # query=query_states, +++++++# # key=key_states, +++++++# # value=value_states, +++++++# # head_num=self.num_heads, +++++++# # attn_mask=fa_attention_mask, +++++++# # keep_prob=1.0 - self.attention_dropout, +++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++# # input_layout="BNSD", +++++++# # sparse_mode=0, +++++++# # # <--- 修改点 2: 启用内部高精度计算 --- +++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++# # inner_precise=1 +++++++# # ) +++++++ +++++++# # # 恢复原始数据类型 +++++++# # attn_output = attn_output.to(input_dtype) +++++++ +++++++# # # 7. 调整输出形状 +++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++# # attn_output = self.o_proj(attn_output) +++++++ +++++++# # attn_weights = None +++++++# # if output_attentions: +++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ +++++++# # return attn_output, attn_weights, past_key_value +++++++ +++++++ ++++++ class Qwen2MoeFlashAttention(nn.Module): ++++++ """ ++++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++- ++++++- 关键改动: ++++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++- 直接传入原始的 key 和 value 张量效率更高。 ++++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +++++++ +++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, +++++++ 以达到理论上的最高执行速度。 ++++++ """ ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++ super().__init__() ++++++ self.config = config ++++++ self.layer_idx = layer_idx +++++++ if layer_idx is None: +++++++ logger.warning_once( +++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +++++++ ) +++++++ ++++++ self.hidden_size = config.hidden_size ++++++ self.num_heads = config.num_attention_heads ++++++ self.head_dim = self.hidden_size // self.num_heads ++++++ self.num_key_value_heads = config.num_key_value_heads ++++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++ self.max_position_embeddings = config.max_position_embeddings ++++++ self.rope_theta = config.rope_theta ++++++ self.attention_dropout = config.attention_dropout ++++++ ++++++- if (self.head_dim * self.num_heads) != self.hidden_size: ++++++- raise ValueError( ++++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++- ) ++++++- ++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++- # query: [B, S, H*D] -> [B, N1, S, D] ++++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++ # 2. 调整形状以匹配 BNSD 布局 ++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- ++++++- # 3. RoPE 旋转位置编码 +++++++ +++++++ # 3. RoPE 和 KV 缓存 ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++- if self.layer_idx is None: ++++++- raise ValueError( ++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++- "with a layer index." ++++++- ) ++++++- # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++- if cache_position.shape[0] == 1: ++++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++- kv_seq_len = past_seen_tokens + 1 ++++++- else: ++++++- # prefill 阶段:cache_position 是范围,使用其长度 ++++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++- else: ++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++- +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++- # 4. KV 缓存更新 ++++++ if past_key_value is not None: ++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++- key_states, value_states = past_key_value.update( ++++++- key_states, value_states, self.layer_idx, cache_kwargs ++++++- ) ++++++- ++++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++- if cache_position.shape[0] == 1: ++++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++- kv_seq_len = key_states.shape[-2] ++++++- ++++++- # 5. [重要] 准备 Attention Mask ++++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++ # 4. 准备 Attention Mask ++++++ fa_attention_mask = None ++++++ if attention_mask is not None: ++++++- # 截取与当前key长度匹配的部分 ++++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++- # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++ fa_attention_mask = (mask_slice != 0) ++++++ ++++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++- input_dtype = query_states.dtype ++++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++- query_states = query_states.to(mindspore.float16) ++++++- key_states = key_states.to(mindspore.float16) ++++++- value_states = value_states.to(mindspore.float16) ++++++- ++++++- # 6. [核心] 调用 flash_attention_score 算子 ++++++- # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 ++++++ attn_output = mindspore.ops.flash_attention_score( ++++++ query=query_states, ++++++ key=key_states, ++++++ value=value_states, ++++++- head_num=self.num_heads, # 传入Q的头数(N1) +++++++ head_num=self.num_heads, ++++++ attn_mask=fa_attention_mask, ++++++- keep_prob=1.0 - self.attention_dropout, +++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout ++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++ input_layout="BNSD", ++++++- sparse_mode=0 # 使用 defaultMask 模式 +++++++ sparse_mode=0, +++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 ++++++ ) ++++++ ++++++- # 恢复原始数据类型 ++++++- attn_output = attn_output.to(input_dtype) ++++++- ++++++- # 7. 调整输出形状 ++++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++ # 6. 调整输出形状 ++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++ attn_output = self.o_proj(attn_output) ++++++ ++++++- # FlashAttention 算子不直接返回注意力权重矩阵 +++++++ # 7. 返回结果 ++++++ attn_weights = None ++++++ if output_attentions: ++++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++- # def forward( ++++++- # self, ++++++- # hidden_states: mindspore.Tensor, ++++++- # attention_mask: Optional[mindspore.Tensor] = None, ++++++- # position_ids: Optional[mindspore.Tensor] = None, ++++++- # past_key_value: Optional[Cache] = None, ++++++- # output_attentions: bool = False, ++++++- # use_cache: bool = False, ++++++- # cache_position: Optional[mindspore.Tensor] = None, ++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++- ++++++- # bsz, q_len, _ = hidden_states.shape ++++++- ++++++- # # 1. 线性投射 Q, K, V ++++++- # query_states = self.q_proj(hidden_states) ++++++- # key_states = self.k_proj(hidden_states) ++++++- # value_states = self.v_proj(hidden_states) ++++++- ++++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- ++++++- # # 3. RoPE 旋转位置编码 ++++++- # kv_seq_len = key_states.shape[-2] ++++++- # if past_key_value is not None: ++++++- # if self.layer_idx is None: ++++++- # raise ValueError( ++++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++- # "with a layer index." ++++++- # ) ++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++ ++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++- ++++++- # # 4. KV 缓存更新 ++++++- # if past_key_value is not None: ++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++- # key_states, value_states = past_key_value.update( ++++++- # key_states, value_states, self.layer_idx, cache_kwargs ++++++- # ) ++++++- ++++++- # # 5. 准备 Attention Mask ++++++- # fa_attention_mask = None ++++++- # if attention_mask is not None: ++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++- # fa_attention_mask = (mask_slice != 0) ++++++- ++++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++- # input_dtype = query_states.dtype ++++++- ++++++- # # 6. [核心] 调用 flash_attention_score 算子 ++++++- # attn_output = mindspore.ops.flash_attention_score( ++++++- # query=query_states, ++++++- # key=key_states, ++++++- # value=value_states, ++++++- # head_num=self.num_heads, ++++++- # attn_mask=fa_attention_mask, ++++++- # keep_prob=1.0 - self.attention_dropout, ++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++- # input_layout="BNSD", ++++++- # sparse_mode=0, ++++++- # # <--- 修改点 2: 启用内部高精度计算 --- ++++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++- # inner_precise=1 ++++++- # ) ++++++- ++++++- # # 恢复原始数据类型 ++++++- # attn_output = attn_output.to(input_dtype) +++++++QWEN2MOE_ATTENTION_CLASSES = { +++++++ "eager": Qwen2MoeAttention, +++++++ "flash-attention": Qwen2MoeFlashAttention, +++++++} ++++++ ++++++- # # 7. 调整输出形状 ++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++- # attn_output = self.o_proj(attn_output) ++++++ ++++++- # attn_weights = None ++++++- # if output_attentions: ++++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# def __init__(self, config): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# # gating +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++ +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# #@dwj +++++++# # 只遍历激活的专家,而非全部专家 +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# num_tokens = hidden_states_reshaped.shape[0] +++++++ +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++# flat_selected_experts = selected_experts.flatten() +++++++ +++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++# token_indices = broadcasted_token_indices.flatten() +++++++ +++++++# active_experts = ops.unique(flat_selected_experts) +++++++ +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++ +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# selected_token_indices = token_indices[mask] +++++++# selected_routing_weights = routing_weights.flatten()[mask] +++++++ +++++++# current_states = hidden_states_reshaped[selected_token_indices] +++++++ +++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++ +++++++# final_hidden_states = final_hidden_states.index_add( +++++++# dim=0, +++++++# index=selected_token_indices, +++++++# source=expert_output.to(hidden_states.dtype) +++++++# ) +++++++ +++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++ ++++++- # return attn_output, attn_weights, past_key_value +++++++# final_hidden_states = final_hidden_states + shared_expert_output +++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ +++++++ +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# """ +++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# # 门控网络 +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# # 专家列表 +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++# # 共享专家 +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_decode( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# """ +++++++# 【解码路径】针对 sequence_length=1 的极致优化。 +++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++++++# """ +++++++# batch_size, hidden_dim = hidden_states.shape +++++++ +++++++# expert_outputs_list = [ +++++++# ops.cat([ +++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++# ], dim=0) +++++++# for i in range(batch_size) +++++++# ] +++++++ +++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++++++# # shape: (batch_size, top_k, hidden_dim) +++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++ +++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++ +++++++# return moe_output.squeeze(1) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_prefill( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# """ +++++++# 【预填充路径】针对 sequence_length > 1 的优化。 +++++++# 按专家对 Token 进行分组,并进行批处理。 +++++++# """ +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens = hidden_states.shape[0] +++++++# flat_selected_experts = selected_experts.flatten() +++++++ +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++ +++++++# active_experts = ops.unique(flat_selected_experts) +++++++ +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++ +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# selected_token_indices = token_indices[mask] +++++++# selected_routing_weights = routing_weights.flatten()[mask] +++++++ +++++++# current_states = hidden_states[selected_token_indices] +++++++ +++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++ +++++++# moe_output = moe_output.index_add( +++++++# dim=0, +++++++# index=selected_token_indices, +++++++# source=expert_output.to(hidden_states.dtype) +++++++# ) +++++++# return moe_output +++++++ +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# """ +++++++# 顶层 forward 方法,作为智能分发器。 +++++++# """ +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++- # def forward( ++++++- # self, ++++++- # hidden_states: mindspore.Tensor, ++++++- # attention_mask: Optional[mindspore.Tensor] = None, ++++++- # position_ids: Optional[mindspore.Tensor] = None, ++++++- # past_key_value: Optional[Cache] = None, ++++++- # output_attentions: bool = False, ++++++- # use_cache: bool = False, ++++++- # cache_position: Optional[mindspore.Tensor] = None, ++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++- ++++++- # bsz, q_len, _ = hidden_states.shape ++++++- ++++++- # query_states = self.q_proj(hidden_states) ++++++- # key_states = self.k_proj(hidden_states) ++++++- # value_states = self.v_proj(hidden_states) ++++++- ++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- ++++++- # kv_seq_len = key_states.shape[-2] ++++++- # if past_key_value is not None: ++++++- # if self.layer_idx is None: ++++++- # raise ValueError("`layer_idx` must be specified for caching") ++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++- ++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++- ++++++- # if past_key_value is not None: ++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++- # key_states, value_states = past_key_value.update( ++++++- # key_states, value_states, self.layer_idx, cache_kwargs ++++++- # ) +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++# moe_output = None +++++++# # 在推理时,根据序列长度选择最优路径 +++++++# if not self.training: +++++++# if sequence_length == 1: +++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++++# else: +++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++++# else: +++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++++++# raise NotImplementedError("Training path is not implemented.") +++++++ +++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++++++ +++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++++++ +++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ +++++++ +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# """ +++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# # 门控网络 +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# # 专家列表 +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++# # 共享专家 +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_decode( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# batch_size, _ = hidden_states.shape +++++++# expert_outputs_list = [ +++++++# ops.cat([ +++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++# ], dim=0) +++++++# for i in range(batch_size) +++++++# ] +++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++# return moe_output.squeeze(1) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_prefill( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens = hidden_states.shape[0] +++++++# flat_selected_experts = selected_experts.flatten() +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++# active_experts = ops.unique(flat_selected_experts) +++++++ +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# selected_token_indices = token_indices[mask] +++++++# selected_routing_weights = routing_weights.flatten()[mask] +++++++# current_states = hidden_states[selected_token_indices] +++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++# moe_output = moe_output.index_add( +++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++++# ) +++++++# return moe_output +++++++ +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# """ +++++++# 顶层 forward 方法,作为智能分发器。 +++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++++++# """ +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ +++++++# # 1. 门控计算 (通用逻辑) +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++# # 2. 智能分发到最优 MoE 路径 +++++++# moe_output = None +++++++# if not self.training: +++++++# if sequence_length == 1: +++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++++# else: +++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++++# else: +++++++# raise NotImplementedError("Training path is not implemented.") +++++++ +++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++ +++++++# # 4. 合并 MoE 输出和共享专家输出 +++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++ +++++++# # 5. 恢复原始形状并返回 +++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ +++++++# prefill fastest +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# """ +++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# # 门控网络 +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# # 专家列表 +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++# # 共享专家 +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_dispatch( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# """ +++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++++++# """ +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens, _ = hidden_states.shape +++++++ +++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++++++# flat_selected_experts = selected_experts.flatten() +++++++# flat_routing_weights = routing_weights.flatten() ++++++ ++++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++- ++++++- # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++- # query_states = query_states / math.sqrt(self.head_dim) ++++++- # # <--- 修改结束 --- ++++++- ++++++- # fa_attention_mask = None ++++++- # if attention_mask is not None: ++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++- # fa_attention_mask = (mask_slice != 0) ++++++- ++++++- # input_dtype = query_states.dtype ++++++- ++++++- # attn_output = mindspore.ops.flash_attention_score( ++++++- # query=query_states, # 传入已经预先缩放过的 query ++++++- # key=key_states, ++++++- # value=value_states, ++++++- # head_num=self.num_heads, ++++++- # attn_mask=fa_attention_mask, ++++++- # keep_prob=1.0 - self.attention_dropout, ++++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++- # input_layout="BNSD", ++++++- # sparse_mode=0, ++++++- # inner_precise=1 # 仍然保持内部高精度计算 ++++++- # ) +++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++ ++++++- # attn_output = attn_output.to(input_dtype) ++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++- # attn_output = self.o_proj(attn_output) +++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++++++# active_experts = ops.unique(flat_selected_experts) +++++++ +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++ +++++++# # 找到所有分配给该专家的 token +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++ +++++++# # 使用 mask 选取对应的 token 和权重 +++++++# current_token_indices = token_indices[mask] +++++++# current_routing_weights = flat_routing_weights[mask] +++++++# current_hidden_states = hidden_states[current_token_indices] +++++++ +++++++# # 对这些 token 进行批处理 +++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++++ +++++++# # 使用 index_add 将结果精确地加回到对应位置 +++++++# moe_output = moe_output.index_add( +++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++++++# ) +++++++# return moe_output +++++++ +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# """ +++++++# 顶层 forward 方法,作为智能分发器。 +++++++# """ +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ +++++++# # 1. 门控计算 +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++# routing_weights = routing_weights.to(hidden_states.dtype) +++++++ +++++++# # 2. 调用统一的 MoE 计算内核 +++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++++++ ++++++- # attn_weights = None ++++++- # if output_attentions: ++++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++++# # 3. 统一处理共享专家 +++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++ +++++++# # 4. 合并输出 +++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++ +++++++# # 5. 恢复原始形状并返回 +++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ +++++++ +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# """ +++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++# 【最终高性能与高精度版】: +++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++++++# 3. 这样实现了速度和准确性的两全其美。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_decode( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# """ +++++++# 【解码路径】极致优化版:bmm + 高精度累加。 +++++++# """ +++++++# original_dtype = hidden_states.dtype +++++++# batch_size, _ = hidden_states.shape +++++++ +++++++# expert_outputs_list = [ +++++++# ops.cat([ +++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++# ], dim=0) +++++++# for i in range(batch_size) +++++++# ] +++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++ +++++++# # 在 float32 下执行 bmm,得到高精度结果 +++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++ +++++++# # 将高精度结果转换回原始数据类型 +++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++++++ +++++++# return moe_output +++++++ +++++++# @no_grad() +++++++# def _moe_infer_prefill( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# selected_experts: mindspore.Tensor, +++++++# routing_weights: mindspore.Tensor +++++++# ) -> mindspore.Tensor: +++++++# """ +++++++# 【预填充路径】与原始实现一致,结果精确。 +++++++# """ +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens, _ = hidden_states.shape +++++++# flat_selected_experts = selected_experts.flatten() +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++# active_experts = ops.unique(flat_selected_experts) +++++++ +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# selected_token_indices = token_indices[mask] +++++++# selected_routing_weights = routing_weights.flatten()[mask] +++++++# current_states = hidden_states[selected_token_indices] +++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++# moe_output = moe_output.index_add( +++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++++# ) +++++++# return moe_output +++++++ +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++ ++++++- # return attn_output, attn_weights, past_key_value +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++++++# # 如果模型主体是 float16,后续再转换 +++++++ +++++++# moe_output = None +++++++# if not self.training: +++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++++++# # _moe_infer_decode 内部会处理好类型转换 +++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++++++# if sequence_length == 1: +++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++++# else: +++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++++# else: +++++++# raise NotImplementedError("Training path is not implemented.") +++++++ +++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++ +++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ ++++++ ++++++-QWEN2MOE_ATTENTION_CLASSES = { ++++++- "eager": Qwen2MoeAttention, ++++++- "flash-attention": Qwen2MoeFlashAttention, ++++++-} +++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++# """ +++++++# 【融合版】一个混合专家模块,内置两种推理策略, +++++++# 由外部全局变量 `Long_Prompt` 控制: +++++++ +++++++# - if Long_Prompt is True: 【精度优先模式】 +++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++++++# 适用于处理长序列,避免误差累积。 +++++++ +++++++# - if Long_Prompt is False: 【速度优先模式】 +++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 +++++++# """ +++++++# def __init__(self, config: Qwen2MoeConfig): +++++++# super().__init__() +++++++# self.num_experts = config.num_experts +++++++# self.top_k = config.num_experts_per_tok +++++++# self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++# self.experts = nn.ModuleList( +++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++# ) +++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++# # --- 速度优先模式的辅助函数 --- +++++++# @no_grad() +++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++# original_dtype = hidden_states.dtype +++++++# batch_size, _ = hidden_states.shape +++++++# expert_outputs_list = [ +++++++# ops.cat([ +++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++# ], dim=0) +++++++# for i in range(batch_size) +++++++# ] +++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++# weights_fp32 = routing_weights.to(mindspore.float32) +++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++++# return moe_output_fp32.squeeze(1).to(original_dtype) +++++++ +++++++# @no_grad() +++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens, _ = hidden_states.shape +++++++# flat_selected_experts = selected_experts.flatten() +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++# active_experts = ops.unique(flat_selected_experts) +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# selected_token_indices = token_indices[mask] +++++++# selected_routing_weights = routing_weights.flatten()[mask] +++++++# current_states = hidden_states[selected_token_indices] +++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++# return moe_output +++++++ +++++++# # --- 精度优先模式的辅助函数 --- +++++++# @no_grad() +++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++# moe_output = ops.zeros_like(hidden_states) +++++++# num_tokens, _ = hidden_states.shape +++++++# flat_selected_experts = selected_experts.flatten() +++++++# flat_routing_weights = routing_weights.flatten() +++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++# active_experts = ops.unique(flat_selected_experts) +++++++# for expert_idx_tensor in active_experts: +++++++# expert_idx = expert_idx_tensor.item() +++++++# expert_layer = self.experts[expert_idx] +++++++# mask = (flat_selected_experts == expert_idx_tensor) +++++++# current_token_indices = token_indices[mask] +++++++# current_routing_weights = flat_routing_weights[mask] +++++++# current_hidden_states = hidden_states[current_token_indices] +++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++# return moe_output +++++++ +++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++# # 声明我们将要使用一个在模块外部定义的全局变量 +++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++++++# global Long_Prompt +++++++ +++++++# # 1. 门控计算 (所有模式通用) +++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++# router_logits = self.gate(hidden_states_reshaped) +++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++++# if self.norm_topk_prob: +++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++# moe_output = None +++++++# if not self.training: +++++++# # 根据 Long_Prompt 标志选择模式 +++++++# if Long_Prompt: +++++++# # --- 精度优先模式 --- +++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++# else: +++++++# # --- 速度优先模式 --- +++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++# if sequence_length == 1: +++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++# else: +++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++# else: +++++++# raise NotImplementedError("Training path is not implemented.") +++++++ +++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++ +++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++# return final_hidden_states, router_logits +++++++ +++++++class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ """ +++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++++++ 控制的顶级推理策略: ++++++ +++++++ - if Long_Prompt is True: 【精度优先模式】 +++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +++++++ 适用于需要严格可复现性的长序列任务。 ++++++ ++++++-class Qwen2MoeSparseMoeBlock(nn.Module): ++++++- def __init__(self, config): +++++++ - if Long_Prompt is False: 【速度优先模式】 +++++++ 采用业界最强的性能组合: +++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +++++++ """ +++++++ def __init__(self, config: Qwen2MoeConfig): ++++++ super().__init__() ++++++ self.num_experts = config.num_experts ++++++ self.top_k = config.num_experts_per_tok ++++++ self.norm_topk_prob = config.norm_topk_prob ++++++ ++++++- # gating ++++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++ self.experts = nn.ModuleList( ++++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++ ) ++++++- ++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++ ++++++- #@dwj ++++++- # 只遍历激活的专家,而非全部专家 ++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++- num_tokens = hidden_states_reshaped.shape[0] ++++++- ++++++- router_logits = self.gate(hidden_states_reshaped) ++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- ++++++- if self.norm_topk_prob: ++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++- flat_selected_experts = selected_experts.flatten() ++++++- ++++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++- token_indices = broadcasted_token_indices.flatten() ++++++- ++++++- active_experts = ops.unique(flat_selected_experts) ++++++- ++++++- for expert_idx_tensor in active_experts: ++++++- expert_idx = expert_idx_tensor.item() ++++++- expert_layer = self.experts[expert_idx] ++++++- ++++++- mask = (flat_selected_experts == expert_idx_tensor) ++++++- selected_token_indices = token_indices[mask] ++++++- selected_routing_weights = routing_weights.flatten()[mask] ++++++- ++++++- current_states = hidden_states_reshaped[selected_token_indices] ++++++- ++++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++- ++++++- final_hidden_states = final_hidden_states.index_add( ++++++- dim=0, ++++++- index=selected_token_indices, ++++++- source=expert_output.to(hidden_states.dtype) ++++++- ) ++++++- ++++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +++++++ @no_grad() +++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++ original_dtype = hidden_states.dtype +++++++ batch_size, _ = hidden_states.shape +++++++ expert_outputs_list = [ +++++++ ops.cat([ +++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++ ], dim=0) +++++++ for i in range(batch_size) +++++++ ] +++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++ weights_fp32 = routing_weights.to(mindspore.float32) +++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++++ return moe_output_fp32.squeeze(1).to(original_dtype) +++++++ +++++++ @no_grad() +++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++ num_tokens, _ = hidden_states.shape +++++++ flat_selected_experts = selected_experts.flatten() +++++++ sorted_expert_indices = flat_selected_experts.argsort() +++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++++ original_token_indices = sorted_expert_indices // self.top_k +++++++ moe_output = ops.zeros_like(hidden_states) +++++++ current_token_offset = 0 +++++++ for i in range(self.num_experts): +++++++ expert_token_count = tokens_per_expert[i] - current_token_offset +++++++ if expert_token_count == 0: +++++++ continue +++++++ end_offset = current_token_offset + expert_token_count +++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++++ expert_hidden_states = hidden_states[expert_original_token_indices] +++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++ current_token_offset += expert_token_count +++++++ return moe_output +++++++ +++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++++++ @no_grad() +++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++ moe_output = ops.zeros_like(hidden_states) +++++++ num_tokens, _ = hidden_states.shape +++++++ flat_selected_experts = selected_experts.flatten() +++++++ flat_routing_weights = routing_weights.flatten() +++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++ active_experts = ops.unique(flat_selected_experts) +++++++ for expert_idx_tensor in active_experts: +++++++ expert_idx = expert_idx_tensor.item() +++++++ expert_layer = self.experts[expert_idx] +++++++ mask = (flat_selected_experts == expert_idx_tensor) +++++++ current_token_indices = token_indices[mask] +++++++ current_routing_weights = flat_routing_weights[mask] +++++++ current_hidden_states = hidden_states[current_token_indices] +++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++ return moe_output ++++++ ++++++- final_hidden_states = final_hidden_states + shared_expert_output ++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++- ++++++- return final_hidden_states, router_logits +++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++ global Long_Prompt +++++++ +++++++ # 1. 门控计算 (所有模式通用) +++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++ router_logits = self.gate(hidden_states_reshaped) +++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++++ if self.norm_topk_prob: +++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++ moe_output = None +++++++ if Long_Prompt: +++++++ # --- 精度优先模式 (ACCURACY MODE) --- +++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ else: +++++++ # --- 速度优先模式 (SPEED MODE) --- +++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++ if sequence_length == 1: +++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ else: +++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ ++++++ +++++++ # 3. 共享专家计算与合并 (所有模式通用) +++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++ +++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++ +++++++ return final_hidden_states, router_logits ++++++ ++++++ class Qwen2MoeDecoderLayer(nn.Module): ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++++++ super().__init__() ++++++ self.hidden_size = config.hidden_size +++++++ +++++++ # if Long_Prompt: +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ # else: +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++ ++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ ++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++- ++++++ if (layer_idx not in config.mlp_only_layers) and ( ++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++++ ): ++++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ self._warmed_up = True ++++++ self.warmup_moe_model() ++++++ +++++++ +++++++ ++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++++ output_router_logits = ( ++++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits ++++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ router_logits=outputs.router_logits, ++++++ ) ++++++ +++++++ def generate(self, *args, **kwargs): +++++++ """ +++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++++++ """ +++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++++++ +++++++ input_ids = kwargs.get("input_ids") +++++++ if input_ids is None and args: +++++++ input_ids = args[0] +++++++ +++++++ if input_ids is not None: +++++++ prompt_length = input_ids.shape[1] +++++++ +++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +++++++ Long_Prompt = True +++++++ else: +++++++ Long_Prompt = False +++++++ +++++++ return super().generate(*args, **kwargs) +++++++ ++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation ++++++ def prepare_inputs_for_generation( ++++++ self, ++++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens ++++++ # Exception 1: when passing input_embeds, input_ids may be missing entries ++++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +++++++ ++++++ if past_key_values is not None: ++++++ if inputs_embeds is not None: # Exception 1 ++++++ if 0 not in input_ids.shape: ++++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ } ++++++ ) ++++++ return model_inputs +++++++ ++++++ # @lwx ++++++ # def _decode_one_tokens_logits( ++++++ # self, ++++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): ++++++ attentions=outputs.attentions, ++++++ ) ++++++ +++++++ ++++++ __all__ = [ ++++++ "Qwen2MoeForCausalLM", ++++++ "Qwen2MoeModel", ++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++++new file mode 100644 ++++++index 00000000..6dfb5b93 ++++++--- /dev/null +++++++++ b/patches/0001-20251104commit.patch ++++++@@ -0,0 +1,1272 @@ +++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++++From: Pinoeer-kingxi <13022943007@163.com> +++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++++Subject: [PATCH] 20251104commit +++++++ +++++++--- +++++++ mindnlp/transformers/cache_utils.py | 28 +- +++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++++ 3 files changed, 976 insertions(+), 87 deletions(-) +++++++ +++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++++index cadd2e04..02f8d4be 100644 +++++++--- a/mindnlp/transformers/cache_utils.py ++++++++++ b/mindnlp/transformers/cache_utils.py +++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++++ # k_out[:, :, cache_position] = key_states +++++++ # v_out[:, :, cache_position] = value_states +++++++- if ON_ORANGE_PI: +++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++- else: +++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++- ++++++++ # if ON_ORANGE_PI: ++++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++++ # else: ++++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++++++ if cache_position.ndim > 1: ++++++++ cache_position = cache_position.flatten() ++++++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++++++ cache_position = cache_position.int() ++++++++ ++++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++++++ k_out[:, :, cache_position] = key_states ++++++++ v_out[:, :, cache_position] = value_states ++++++++ +++++++ return k_out, v_out +++++++ +++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++index c695b944..d8303e45 100644 +++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++++ def rotate_half(x): +++++++ """Rotates half the hidden dims of the input.""" +++++++- x1 = x[..., : x.shape[-1] // 2] +++++++- x2 = x[..., x.shape[-1] // 2 :] ++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++++ # x2 = x[..., x.shape[-1] // 2 :] ++++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++ return ops.cat((-x2, x1), dim=-1) +++++++ +++++++ +++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++++ if self.training: +++++++ raise NotImplementedError("Training is not supported yet.") +++++++ else: +++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++- if self.config.n_shared_experts is not None: +++++++- y = y + self.shared_experts(identity) +++++++- return y ++++++++ # @lwx ++++++++ if orig_shape[1] == 1: ++++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++++++ y=y.view(*orig_shape) ++++++++ if self.config.n_shared_experts is not None: ++++++++ y = y + self.shared_experts(identity) ++++++++ return y ++++++++ else: ++++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++++++ if self.config.n_shared_experts is not None: ++++++++ y = y + self.shared_experts(identity) ++++++++ return y ++++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++++ # if self.config.n_shared_experts is not None: ++++++++ # y = y + self.shared_experts(identity) ++++++++ # return y ++++++++ ++++++++ @no_grad() ++++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++++ ++++++++ expert_cache = ops.zeros_like(x) ++++++++ for i in range(self.num_experts_per_tok): ++++++++ expert_id = flat_expert_indices[i].item() ++++++++ weight = flat_expert_weights[i].item() ++++++++ expert = self.experts[expert_id] ++++++++ expert_out = expert(x) ++++++++ expert_cache += expert_out * weight ++++++++ return expert_cache +++++++ +++++++ @no_grad() +++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++- # expert_cache = torch.zeros_like(x) +++++++- # idxs = flat_expert_indices.argsort() +++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++- # token_idxs = idxs // self.num_experts_per_tok +++++++- # for i, end_idx in enumerate(tokens_per_expert): +++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++- # if start_idx == end_idx: +++++++- # continue +++++++- # expert = self.experts[i] +++++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++++- # expert_tokens = x[exp_token_idx] +++++++- # expert_out = expert(expert_tokens) +++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++- # return expert_cache ++++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++ expert_cache = ops.zeros_like(x) +++++++ idxs = flat_expert_indices.argsort() +++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ token_idxs = idxs // self.num_experts_per_tok ++++++++ +++++++ for i, end_idx in enumerate(tokens_per_expert): +++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ if start_idx == end_idx: +++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++++ expert_out = expert(expert_tokens) +++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++++ +++++++ return expert_cache ++++++++ ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # # expert_cache = torch.zeros_like(x) ++++++++ # # idxs = flat_expert_indices.argsort() ++++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++++ # # if start_idx == end_idx: ++++++++ # # continue ++++++++ # # expert = self.experts[i] ++++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # # expert_tokens = x[exp_token_idx] ++++++++ # # expert_out = expert(expert_tokens) ++++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++++ # # return expert_cache ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # if start_idx == end_idx: ++++++++ # continue ++++++++ # expert = self.experts[i] ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = expert(expert_tokens) ++++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++++ ++++++++ # return expert_cache ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ ++++++++ # # 排序保证顺序一致 ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # # 找出有 token 的专家 ++++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++++ ++++++++ # for i in active_experts.tolist(): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # end_idx = tokens_per_expert[i] ++++++++ # if start_idx == end_idx: # 没有 token ++++++++ # continue ++++++++ ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = self.experts[i](expert_tokens) ++++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++++ ++++++++ # expert_cache = mindspore.mint.scatter_add( ++++++++ # expert_cache, ++++++++ # 0, ++++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++++ # expert_out ++++++++ # ) ++++++++ ++++++++ # return expert_cache ++++++++ ++++++++ +++++++ +++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++++ # """ +++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++ +++++++ # Initialize weights and apply final processing +++++++ self.post_init() ++++++++ self.warm_up = False ++++++++ ++++++++ def warmup_moe_model_deep(self): ++++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++++ test_texts = [ ++++++++ "warmup short", ++++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++++++ ] ++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++++ if tokenizer is None: ++++++++ from mindnlp.transformers import AutoTokenizer ++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++++ self._warmup_tokenizer = tokenizer ++++++++ ++++++++ for text in test_texts: ++++++++ inputs = tokenizer(text, return_tensors="ms") ++++++++ with mindspore._no_grad(): ++++++++ _ = self(**inputs, use_cache=False) ++++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++++ +++++++ def get_input_embeddings(self): +++++++ return self.model.embed_tokens +++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++ ```""" ++++++++ if not self.warm_up: ++++++++ self.warm_up = True ++++++++ self.warmup_moe_model_deep() ++++++++ +++++++ output_attentions = ( +++++++ output_attentions +++++++ if output_attentions is not None +++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++index 3cbf820e..d4c6b651 100644 +++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++@@ -18,7 +18,6 @@ +++++++ # See the License for the specific language governing permissions and +++++++ # limitations under the License. +++++++ """MindSpore Qwen2MoE model.""" +++++++- +++++++ import math +++++++ from typing import List, Optional, Tuple, Union +++++++ +++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++++ TokenClassifierOutput, +++++++ ) +++++++ from ...modeling_utils import PreTrainedModel ++++++++from ...generation import GenerationMixin +++++++ from ....utils import logging +++++++ from .configuration_qwen2_moe import Qwen2MoeConfig +++++++ +++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++++ self.variance_epsilon = eps +++++++ +++++++ def forward(self, hidden_states): ++++++++ # @dwj ++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++++ # @lwx ++++++++ # if not self.training : ++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++ input_dtype = hidden_states.dtype +++++++ hidden_states = hidden_states.to(mindspore.float32) +++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++++@@ -234,6 +239,8 @@ def rotate_half(x): +++++++ """Rotates half the hidden dims of the input.""" +++++++ x1 = x[..., : x.shape[-1] // 2] +++++++ x2 = x[..., x.shape[-1] // 2 :] ++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++ return ops.cat((-x2, x1), dim=-1) +++++++ +++++++ +++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++++ self.config = config +++++++ self.hidden_size = config.hidden_size +++++++ self.intermediate_size = intermediate_size ++++++++ +++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++++ self.act_fn = ACT2FN[config.hidden_act] +++++++ +++++++ def forward(self, x): +++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++- +++++++ ++++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++++ # @lwx ++++++++ # gate_up_output = self.gate_up_proj(x) ++++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++++++ # return self.down_proj(swiglu_output) ++++++++ ++++++++ # def forward(self, x): ++++++++ # gate_proj_out = self.gate_proj(x) ++++++++ # up_proj_out = self.up_proj(x) ++++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++++++ # return self.down_proj(swiglu_out) ++++++++ +++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++++ """ +++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++++ use_cache: bool = False, +++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ ++++++++ +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ query_states = self.q_proj(hidden_states) +++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ "with a layer index." +++++++ ) +++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ if isinstance(past_key_value, StaticCache): ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ else: ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++ if isinstance(past_key_value, StaticCache): ++++++++ kv_seq_len = key_states.shape[-2] +++++++ +++++++ # repeat k/v heads if n_kv_heads < n_heads +++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++- ++++++++ +++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++ +++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++++- raise ValueError( +++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++++- f" {attn_weights.shape}" +++++++- ) +++++++- +++++++- if attention_mask is not None: # no matter the length, we just slice it +++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++++++ if attention_mask is not None: ++++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++ attn_weights = attn_weights + causal_mask +++++++ +++++++ # upcast attention to fp32 +++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++ +++++++ attn_output = self.o_proj(attn_output) +++++++- ++++++++ # @lwx ++++++++ ++++++++ # max_seq_len = self.max_position_embeddings # 2048 ++++++++ ++++++++ # if attention_mask is not None: ++++++++ # # attention_mask: [B, 1, Sq, Sk] ++++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++++ ++++++++ # # pad 到 [max_seq_len, max_seq_len] ++++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++++ # global_attention_mask = padded_mask ++++++++ # else: ++++++++ # global_attention_mask = None ++++++++ ++++++++ ++++++++ # sparse_mode=3 ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # real_shift=None, ++++++++ # padding_mask=None, ++++++++ ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=global_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ # input_layout="BNSD", ++++++++ # pre_tokens=2147483647, ++++++++ # next_tokens=2147483647, ++++++++ # inner_precise=0, ++++++++ # drop_mask=None, ++++++++ # prefix=None, ++++++++ # actual_seq_qlen=None, ++++++++ # actual_seq_kvlen=None, ++++++++ # sparse_mode=sparse_mode, ++++++++ # ) +++++++ if not output_attentions: +++++++ attn_weights = None +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ ++++++++class Qwen2MoeFlashAttention(nn.Module): ++++++++ """ ++++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++++ ++++++++ 关键改动: ++++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++++ 直接传入原始的 key 和 value 张量效率更高。 ++++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++++ """ ++++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++++ super().__init__() ++++++++ self.config = config ++++++++ self.layer_idx = layer_idx ++++++++ self.hidden_size = config.hidden_size ++++++++ self.num_heads = config.num_attention_heads ++++++++ self.head_dim = self.hidden_size // self.num_heads ++++++++ self.num_key_value_heads = config.num_key_value_heads ++++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++ self.max_position_embeddings = config.max_position_embeddings ++++++++ self.rope_theta = config.rope_theta ++++++++ self.attention_dropout = config.attention_dropout ++++++++ ++++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++ raise ValueError( ++++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++++ ) ++++++++ ++++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++++ ++++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++++ self.head_dim, ++++++++ max_position_embeddings=self.max_position_embeddings, ++++++++ base=self.rope_theta, ++++++++ ) ++++++++ ++++++++ def forward( ++++++++ self, ++++++++ hidden_states: mindspore.Tensor, ++++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++++ past_key_value: Optional[Cache] = None, ++++++++ output_attentions: bool = False, ++++++++ use_cache: bool = False, ++++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # 1. 线性投射 Q, K, V ++++++++ query_states = self.q_proj(hidden_states) ++++++++ key_states = self.k_proj(hidden_states) ++++++++ value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # 3. RoPE 旋转位置编码 ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ if past_key_value is not None: ++++++++ if self.layer_idx is None: ++++++++ raise ValueError( ++++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++ "with a layer index." ++++++++ ) ++++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++++ if cache_position.shape[0] == 1: ++++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++++ kv_seq_len = past_seen_tokens + 1 ++++++++ else: ++++++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++++ else: ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # 4. KV 缓存更新 ++++++++ if past_key_value is not None: ++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ key_states, value_states = past_key_value.update( ++++++++ key_states, value_states, self.layer_idx, cache_kwargs ++++++++ ) ++++++++ ++++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++ if cache_position.shape[0] == 1: ++++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ ++++++++ # 5. [重要] 准备 Attention Mask ++++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++++ fa_attention_mask = None ++++++++ if attention_mask is not None: ++++++++ # 截取与当前key长度匹配的部分 ++++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++++ fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++++ input_dtype = query_states.dtype ++++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++++ query_states = query_states.to(mindspore.float16) ++++++++ key_states = key_states.to(mindspore.float16) ++++++++ value_states = value_states.to(mindspore.float16) ++++++++ ++++++++ # 6. [核心] 调用 flash_attention_score 算子 ++++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++++ attn_output = mindspore.ops.flash_attention_score( ++++++++ query=query_states, ++++++++ key=key_states, ++++++++ value=value_states, ++++++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++++++ attn_mask=fa_attention_mask, ++++++++ keep_prob=1.0 - self.attention_dropout, ++++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ input_layout="BNSD", ++++++++ sparse_mode=0 # 使用 defaultMask 模式 ++++++++ ) ++++++++ ++++++++ # 恢复原始数据类型 ++++++++ attn_output = attn_output.to(input_dtype) ++++++++ ++++++++ # 7. 调整输出形状 ++++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++++++ attn_weights = None ++++++++ if output_attentions: ++++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++ return attn_output, attn_weights, past_key_value ++++++++ ++++++++ # def forward( ++++++++ # self, ++++++++ # hidden_states: mindspore.Tensor, ++++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++++ # past_key_value: Optional[Cache] = None, ++++++++ # output_attentions: bool = False, ++++++++ # use_cache: bool = False, ++++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ # bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # # 1. 线性投射 Q, K, V ++++++++ # query_states = self.q_proj(hidden_states) ++++++++ # key_states = self.k_proj(hidden_states) ++++++++ # value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # # 3. RoPE 旋转位置编码 ++++++++ # kv_seq_len = key_states.shape[-2] ++++++++ # if past_key_value is not None: ++++++++ # if self.layer_idx is None: ++++++++ # raise ValueError( ++++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++ # "with a layer index." ++++++++ # ) ++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # # 4. KV 缓存更新 ++++++++ # if past_key_value is not None: ++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ # key_states, value_states = past_key_value.update( ++++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++++ # ) ++++++++ ++++++++ # # 5. 准备 Attention Mask ++++++++ # fa_attention_mask = None ++++++++ # if attention_mask is not None: ++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++++ # input_dtype = query_states.dtype ++++++++ ++++++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=fa_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ # input_layout="BNSD", ++++++++ # sparse_mode=0, ++++++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++++ # inner_precise=1 ++++++++ # ) ++++++++ ++++++++ # # 恢复原始数据类型 ++++++++ # attn_output = attn_output.to(input_dtype) ++++++++ ++++++++ # # 7. 调整输出形状 ++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ # attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # attn_weights = None ++++++++ # if output_attentions: ++++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++ # return attn_output, attn_weights, past_key_value ++++++++ ++++++++ # def forward( ++++++++ # self, ++++++++ # hidden_states: mindspore.Tensor, ++++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++++ # past_key_value: Optional[Cache] = None, ++++++++ # output_attentions: bool = False, ++++++++ # use_cache: bool = False, ++++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ # bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # query_states = self.q_proj(hidden_states) ++++++++ # key_states = self.k_proj(hidden_states) ++++++++ # value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # kv_seq_len = key_states.shape[-2] ++++++++ # if past_key_value is not None: ++++++++ # if self.layer_idx is None: ++++++++ # raise ValueError("`layer_idx` must be specified for caching") ++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # if past_key_value is not None: ++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ # key_states, value_states = past_key_value.update( ++++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++++ # ) ++++++++ ++++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++ ++++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++++ # query_states = query_states / math.sqrt(self.head_dim) ++++++++ # # <--- 修改结束 --- ++++++++ ++++++++ # fa_attention_mask = None ++++++++ # if attention_mask is not None: ++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # input_dtype = query_states.dtype ++++++++ ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, # 传入已经预先缩放过的 query ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=fa_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++++ # input_layout="BNSD", ++++++++ # sparse_mode=0, ++++++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++++++ # ) ++++++++ ++++++++ # attn_output = attn_output.to(input_dtype) ++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ # attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # attn_weights = None ++++++++ # if output_attentions: ++++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++++ ++++++++ # return attn_output, attn_weights, past_key_value ++++++++ +++++++ QWEN2MOE_ATTENTION_CLASSES = { +++++++ "eager": Qwen2MoeAttention, ++++++++ "flash-attention": Qwen2MoeFlashAttention, +++++++ } +++++++ +++++++ +++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ ++++++++ #@dwj ++++++++ # 只遍历激活的专家,而非全部专家 +++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- hidden_states = hidden_states.view(-1, hidden_dim) +++++++- # router_logits: (batch * sequence_length, n_experts) +++++++- router_logits = self.gate(hidden_states) +++++++- +++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- if self.norm_topk_prob: +++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- # we cast back to the input dtype +++++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++- final_hidden_states = ops.zeros( +++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++++- ) +++++++- +++++++- # One hot encode the selected experts to create an expert mask +++++++- # this will be used to easily index which expert is going to be sollicitated +++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++++- +++++++- # Loop over all available experts in the model and perform the computation on each expert +++++++- for expert_idx in range(self.num_experts): +++++++- expert_layer = self.experts[expert_idx] +++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++++- +++++++- # Index the correct hidden states and compute the expert hidden state for +++++++- # the current expert. We need to make sure to multiply the output hidden +++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++++- if 0 not in idx.shape: +++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++++- +++++++- # However `index_add_` only support torch tensors for indexing so we'll use +++++++- # the `top_x` tensor here. +++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++++- +++++++- shared_expert_output = self.shared_expert(hidden_states) +++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++++- +++++++- final_hidden_states = final_hidden_states + shared_expert_output ++++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++ num_tokens = hidden_states_reshaped.shape[0] ++++++++ ++++++++ router_logits = self.gate(hidden_states_reshaped) ++++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++ ++++++++ if self.norm_topk_prob: ++++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++++ flat_selected_experts = selected_experts.flatten() ++++++++ ++++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++++ token_indices = broadcasted_token_indices.flatten() ++++++++ ++++++++ active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++ for expert_idx_tensor in active_experts: ++++++++ expert_idx = expert_idx_tensor.item() ++++++++ expert_layer = self.experts[expert_idx] ++++++++ ++++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++++ selected_token_indices = token_indices[mask] ++++++++ selected_routing_weights = routing_weights.flatten()[mask] ++++++++ ++++++++ current_states = hidden_states_reshaped[selected_token_indices] ++++++++ ++++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++ ++++++++ final_hidden_states = final_hidden_states.index_add( ++++++++ dim=0, ++++++++ index=selected_token_indices, ++++++++ source=expert_output.to(hidden_states.dtype) ++++++++ ) ++++++++ ++++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++++ +++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++- return final_hidden_states, router_logits ++++++++ final_hidden_states = final_hidden_states + shared_expert_output ++++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++ return final_hidden_states, router_logits +++++++ +++++++ +++++++ class Qwen2MoeDecoderLayer(nn.Module): +++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++++ +++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ ++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++++ +++++++ if (layer_idx not in config.mlp_only_layers) and ( +++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++++ ): +++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++++ _skip_keys_device_placement = "past_key_values" +++++++ _supports_cache_class = True ++++++++#lwx ++++++++ # _supports_static_cache = True +++++++ +++++++ def _init_weights(self, module): +++++++ std = self.config.initializer_range +++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++++ return causal_mask +++++++ +++++++ +++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ _tied_weights_keys = ["lm_head.weight"] +++++++ +++++++ def __init__(self, config): +++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ self.num_experts_per_tok = config.num_experts_per_tok +++++++ # Initialize weights and apply final processing +++++++ self.post_init() ++++++++ # @lwx ++++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++++++ # self.generation_config.cache_implementation = "static" ++++++++ self._warmed_up = False ++++++++ ++++++++ def warmup_moe_model(self): ++++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++++++ test_texts = [ ++++++++ "warmup short", ++++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++++++ ] ++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++++ if tokenizer is None: ++++++++ from mindnlp.transformers import AutoTokenizer ++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++++ self._warmup_tokenizer = tokenizer ++++++++ ++++++++ for text in test_texts: ++++++++ inputs = tokenizer(text, return_tensors="ms") ++++++++ with mindspore._no_grad(): ++++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++++ +++++++ def get_input_embeddings(self): +++++++ return self.model.embed_tokens +++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++ ```""" ++++++++ if not self._warmed_up: ++++++++ self._warmed_up = True ++++++++ self.warmup_moe_model() +++++++ +++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++++ output_router_logits = ( +++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ } +++++++ ) +++++++ return model_inputs ++++++++# @lwx ++++++++ # def _decode_one_tokens_logits( ++++++++ # self, ++++++++ # cur_token: mindspore.Tensor, ++++++++ # input_pos: Optional[mindspore.Tensor], ++++++++ # cache_position: mindspore.Tensor, ++++++++ # past_key_values: StaticCache, ++++++++ # ) -> mindspore.Tensor: ++++++++ # """ ++++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++++++ ++++++++ # Args: ++++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++++++ # input_pos: 输入位置信息,可选 ++++++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++++++ ++++++++ # Returns: ++++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++++++ # """ ++++++++ # # 调用JIT编译的版本 ++++++++ # return self.get_decode_one_tokens_logits( ++++++++ # cur_token=cur_token, ++++++++ # input_pos=input_pos, ++++++++ # cache_position=cache_position, ++++++++ # past_key_values=past_key_values, ++++++++ # ) ++++++++ ++++++++ # @mindspore.jit(jit_level='O1') ++++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++++++ # """ ++++++++ # JIT编译的函数,用于高效的单token解码 ++++++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++++++ ++++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++++++ # """ ++++++++ # outputs = self.model.forward( ++++++++ # input_ids=cur_token, ++++++++ # position_ids=input_pos, ++++++++ # cache_position=cache_position, ++++++++ # past_key_values=past_key_values, ++++++++ # use_cache=True, ++++++++ # return_dict=False, ++++++++ # ) ++++++++ ++++++++ # hidden_states = outputs[0] ++++++++ # logits = self.lm_head.forward(hidden_states) ++++++++ # logits = logits.float() ++++++++ ++++++++ # return logits[:, -1, :] ++++++++ ++++++++ # def _sample( ++++++++ # self, ++++++++ # input_ids: mindspore.Tensor, ++++++++ # logits_processor, ++++++++ # stopping_criteria, ++++++++ # generation_config, ++++++++ # synced_devices: bool, ++++++++ # streamer=None, ++++++++ # logits_warper=None, ++++++++ # **model_kwargs, ++++++++ # ): ++++++++ # """ ++++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++++++ # """ ++++++++ # from ...generation.logits_process import LogitsProcessorList ++++++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++++++ # from mindnlp.core import nn, ops, no_grad ++++++++ # import numpy as np ++++++++ ++++++++ # # 检查是否使用 StaticCache ++++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++++++ # # 否则,直接调用父类方法 ++++++++ # past_key_values = model_kwargs.get("past_key_values") ++++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++++++ ++++++++ # if not isinstance(past_key_values, StaticCache): ++++++++ # # 不使用 StaticCache,直接调用父类方法 ++++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++++++ # return super()._sample( ++++++++ # input_ids=input_ids, ++++++++ # logits_processor=logits_processor, ++++++++ # stopping_criteria=stopping_criteria, ++++++++ # generation_config=generation_config, ++++++++ # synced_devices=synced_devices, ++++++++ # streamer=streamer, ++++++++ # logits_warper=logits_warper, ++++++++ # **model_kwargs, ++++++++ # ) ++++++++ ++++++++ # # 使用 StaticCache,进入自定义循环 ++++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++++++ # pad_token_id = generation_config._pad_token_tensor ++++++++ # output_attentions = generation_config.output_attentions ++++++++ # output_hidden_states = generation_config.output_hidden_states ++++++++ # output_scores = generation_config.output_scores ++++++++ # output_logits = generation_config.output_logits ++++++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++++++ # max_length = generation_config.max_length ++++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++++++ # do_sample = generation_config.do_sample ++++++++ ++++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++++++ # raise ValueError( ++++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++++++ # f"{logits_warper})." ++++++++ # ) ++++++++ ++++++++ # # init attention / hidden states / scores tuples ++++++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++++++ ++++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++++++ # encoder_hidden_states = ( ++++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++++++ # ) ++++++++ ++++++++ # # keep track of which sequences are already finished ++++++++ # batch_size, cur_len = input_ids.shape ++++++++ # this_peer_finished = False ++++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++++++ ++++++++ # time_record = [] ++++++++ # from ....utils.testing_utils import parse_flag_from_env ++++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++++++ ++++++++ # while self._has_unfinished_sequences( ++++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++++++ # ): ++++++++ # if _record_time: ++++++++ # import time as time_module ++++++++ # infer_start = time_module.time() ++++++++ ++++++++ # # prepare model inputs ++++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++++++ ++++++++ # # prepare variable output controls ++++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++++++ ++++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++++++ # cur_cache_position = model_inputs.get("cache_position") ++++++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++++++ # cur_input_ids = model_inputs.get("input_ids") ++++++++ ++++++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++++++ # cur_cache_position is not None and ++++++++ # len(cur_cache_position.shape) > 0 and ++++++++ # cur_cache_position.shape[0] == 1 and ++++++++ # cur_input_ids is not None and ++++++++ # cur_input_ids.shape[1] == 1): ++++++++ # # 使用 JIT 优化的单 token 解码 ++++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++++++ # if not hasattr(self, '_jit_used'): ++++++++ # self._jit_used = False ++++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++++++ ++++++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++++++ # cur_token=cur_input_ids, ++++++++ # input_pos=model_inputs.get("position_ids"), ++++++++ # cache_position=cur_cache_position, ++++++++ # past_key_values=cur_past_key_values, ++++++++ # ) ++++++++ ++++++++ # # 标记已使用JIT(用于后续判断) ++++++++ # if not self._jit_used: ++++++++ # self._jit_used = True ++++++++ ++++++++ # # 构造兼容的输出对象 ++++++++ # class JitOptimizedOutput: ++++++++ # def __init__(self, logits, config): ++++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++++++ # self.config = config ++++++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++++++ # self.attentions = None if not config.is_encoder_decoder else None ++++++++ # self.cross_attentions = None ++++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++++++ ++++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++++++ # else: ++++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++++++ # outputs = self(**model_inputs, return_dict=True) ++++++++ ++++++++ # if synced_devices and this_peer_finished: ++++++++ # continue ++++++++ ++++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++++++ # next_token_logits = outputs.logits[:, -1, :] ++++++++ ++++++++ # # pre-process distribution ++++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++++++ # if do_sample: ++++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++++++ ++++++++ # # Store scores, attentions and hidden_states when required ++++++++ # if return_dict_in_generate: ++++++++ # if output_scores: ++++++++ # scores += (next_token_scores,) ++++++++ # if output_logits: ++++++++ # raw_logits += (next_token_logits,) ++++++++ # if output_attentions: ++++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++++++ # if self.config.is_encoder_decoder: ++++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++++++ ++++++++ # if output_hidden_states: ++++++++ # hidden = ( ++++++++ # outputs.decoder_hidden_states ++++++++ # if self.config.is_encoder_decoder ++++++++ # else outputs.hidden_states ++++++++ # ) ++++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++++++ ++++++++ # # token selection ++++++++ # if do_sample: ++++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++++++ # else: ++++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++++++ ++++++++ # # finished sentences should have their next token be a padding token ++++++++ # if has_eos_stopping_criteria: ++++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++++++ ++++++++ # # update generated ids, model inputs, and length for next step ++++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++++++ # if streamer is not None: ++++++++ # streamer.put(next_tokens) ++++++++ ++++++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++++++ # outputs, ++++++++ # model_kwargs, ++++++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++++++ # ) ++++++++ ++++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++++++ # cur_len += 1 ++++++++ ++++++++ # if _record_time: ++++++++ # import time as time_module ++++++++ # infer_stop = time_module.time() ++++++++ # time_record.append(infer_stop - infer_start) ++++++++ ++++++++ # del outputs ++++++++ ++++++++ # average_infer_time = None ++++++++ # if time_record: ++++++++ # if len(time_record) > 1: ++++++++ # time_record.pop(0) ++++++++ # average_infer_time = sum(time_record) / len(time_record) ++++++++ # print(f'average inference time is: {average_infer_time}') ++++++++ # print(f'inference time record: {time_record}') ++++++++ ++++++++ # if streamer is not None: ++++++++ # streamer.end() ++++++++ ++++++++ # # 简单判断:打印是否使用了JIT路径 ++++++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++++++ # else: ++++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++++++ ++++++++ # if return_dict_in_generate: ++++++++ # if self.config.is_encoder_decoder: ++++++++ # return GenerateEncoderDecoderOutput( ++++++++ # sequences=input_ids, ++++++++ # scores=scores, ++++++++ # logits=raw_logits, ++++++++ # encoder_attentions=encoder_attentions, ++++++++ # encoder_hidden_states=encoder_hidden_states, ++++++++ # decoder_attentions=decoder_attentions, ++++++++ # cross_attentions=cross_attentions, ++++++++ # decoder_hidden_states=decoder_hidden_states, ++++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++++ # average_infer_time=average_infer_time ++++++++ # ) ++++++++ # else: ++++++++ # return GenerateDecoderOnlyOutput( ++++++++ # sequences=input_ids, ++++++++ # scores=scores, ++++++++ # logits=raw_logits, ++++++++ # attentions=decoder_attentions, ++++++++ # hidden_states=decoder_hidden_states, ++++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++++ # average_infer_time=average_infer_time ++++++++ # ) ++++++++ # else: ++++++++ # return input_ids ++++++++ ++++++++ # def _prepare_cache_for_generation( ++++++++ # self, ++++++++ # generation_config, ++++++++ # model_kwargs, ++++++++ # assistant_model, ++++++++ # batch_size, ++++++++ # max_cache_length, ++++++++ # ): ++++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++++++ # generation_config.cache_implementation = "static" ++++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++++++ ++++++++ # if generation_config.cache_implementation == "static": ++++++++ # base_required_from_max_length = generation_config.max_length + 1 ++++++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++++++ # min_cache_size = 50 ++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++++++ # else: ++++++++ # max_cache_length = max(base_required, min_cache_size) ++++++++ ++++++++ # original_max_cache_length = max_cache_length ++++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++++++ # print(f" - final max_cache_length: {max_cache_length}") ++++++++ ++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++++ # if max_cache_length > self.config.max_position_embeddings: ++++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++++++ ++++++++ # result = super()._prepare_cache_for_generation( ++++++++ # generation_config=generation_config, ++++++++ # model_kwargs=model_kwargs, ++++++++ # assistant_model=assistant_model, ++++++++ # batch_size=batch_size, ++++++++ # max_cache_length=max_cache_length, ++++++++ # ) ++++++++ ++++++++ # if generation_config.cache_implementation == "static": ++++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++++++ # created_cache = model_kwargs.get(cache_name) ++++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++++++ # if created_cache.max_cache_len < generation_config.max_length: ++++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++++++ ++++++++ # return result ++++++++ ++++++++ ++++++++ +++++++ +++++++ +++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++++-- +++++++2.27.0 +++++++ ++++++-- ++++++2.27.0 ++++++ +++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++++new file mode 100644 +++++index 00000000..966529e4 +++++--- /dev/null ++++++++ b/patches/0003-20261106secondcommit.patch +++++@@ -0,0 +1,2769 @@ ++++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++Date: Thu, 6 Nov 2025 14:54:37 +0800 ++++++Subject: [PATCH 3/3] 20261106secondcommit ++++++ ++++++--- ++++++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- ++++++ patches/0001-20251104commit.patch | 1272 ----------------- ++++++ 3 files changed, 528 insertions(+), 2032 deletions(-) ++++++ delete mode 100644 patches/0001-20251104commit.patch ++++++ ++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++index 73773c22..2f9192bf 100644 ++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) ++++++ ++++++ _CONFIG_FOR_DOC = "DeepseekConfig" ++++++ +++++++_attn_mask_cache = {} +++++++ +++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +++++++ q_len = batch_and_seq[1] +++++++ kv_len = batch_and_seq[1] + past_key_values_length +++++++ key = (batch_and_seq[0], q_len, kv_len) +++++++ +++++++ if key in _attn_mask_cache: +++++++ return _attn_mask_cache[key] +++++++ +++++++ mask = _prepare_4d_causal_attention_mask( +++++++ attention_mask, +++++++ batch_and_seq, +++++++ inputs_embeds, +++++++ past_key_values_length, +++++++ ) +++++++ _attn_mask_cache[key] = mask +++++++ return mask ++++++ ++++++ def _get_unpad_data(attention_mask): ++++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) ++++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): ++++++ return final_output ++++++ ++++++ ++++++- @no_grad() ++++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++- expert_cache = ops.zeros_like(x) ++++++- idxs = flat_expert_indices.argsort() ++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++- token_idxs = idxs // self.num_experts_per_tok ++++++- ++++++- for i, end_idx in enumerate(tokens_per_expert): ++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++- if start_idx == end_idx: ++++++- continue ++++++- expert = self.experts[i] ++++++- exp_token_idx = token_idxs[start_idx:end_idx] ++++++- expert_tokens = x[exp_token_idx] ++++++- expert_out = expert(expert_tokens) ++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++- ++++++- return expert_cache ++++++- ++++++ # @no_grad() ++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++- # # expert_cache = torch.zeros_like(x) ++++++- # # idxs = flat_expert_indices.argsort() ++++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++- # # token_idxs = idxs // self.num_experts_per_tok ++++++- # # for i, end_idx in enumerate(tokens_per_expert): ++++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++- # # if start_idx == end_idx: ++++++- # # continue ++++++- # # expert = self.experts[i] ++++++- # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++- # # expert_tokens = x[exp_token_idx] ++++++- # # expert_out = expert(expert_tokens) ++++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++- # # return expert_cache +++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++ # expert_cache = ops.zeros_like(x) ++++++ # idxs = flat_expert_indices.argsort() ++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): ++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++ ++++++ # return expert_cache ++++++- # @no_grad() ++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++- # expert_cache = ops.zeros_like(x) +++++++ +++++++ @no_grad() +++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++ """ +++++++ 优化版 MoE prefill: +++++++ - 批量张量化处理同一个 expert 的所有 token +++++++ - 跳过无 token 的专家 +++++++ - 保持结果完全一致 +++++++ """ +++++++ # 初始化输出缓存 +++++++ expert_cache = ops.zeros_like(x) ++++++ ++++++- # # 排序保证顺序一致 ++++++- # idxs = flat_expert_indices.argsort() ++++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++- # token_idxs = idxs // self.num_experts_per_tok +++++++ # 排序(确保 scatter_add 位置对应原逻辑) +++++++ idxs = flat_expert_indices.argsort() +++++++ sorted_expert_indices = flat_expert_indices[idxs] +++++++ sorted_token_indices = idxs // self.num_experts_per_tok ++++++ ++++++- # # 找出有 token 的专家 ++++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++ # 每个 expert 的 token 数 +++++++ tokens_per_expert = sorted_expert_indices.bincount() ++++++ ++++++- # for i in active_experts.tolist(): ++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++- # end_idx = tokens_per_expert[i] ++++++- # if start_idx == end_idx: # 没有 token ++++++- # continue +++++++ # 找出有 token 的专家 +++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++++++ ++++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++++- # expert_tokens = x[exp_token_idx] ++++++- # expert_out = self.experts[i](expert_tokens) ++++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++ for expert_id in active_experts.tolist(): +++++++ # 取该 expert 对应的排序后 token 区间 +++++++ start = (tokens_per_expert[:expert_id]).sum().item() +++++++ end = start + tokens_per_expert[expert_id].item() ++++++ ++++++- # expert_cache = mindspore.mint.scatter_add( ++++++- # expert_cache, ++++++- # 0, ++++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++- # expert_out ++++++- # ) +++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +++++++ expert_tokens = x[token_idx] # 取输入向量 ++++++ ++++++- # return expert_cache +++++++ # 执行专家 MLP +++++++ expert_out = self.experts[expert_id](expert_tokens) +++++++ +++++++ # 按权重缩放 +++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +++++++ +++++++ # 回写到缓存(等价 scatter_add) +++++++ expert_cache = mindspore.mint.scatter_add( +++++++ expert_cache, +++++++ 0, +++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++ scaled_out +++++++ ) +++++++ +++++++ return expert_cache +++++++ +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # # expert_cache = torch.zeros_like(x) +++++++ # # idxs = flat_expert_indices.argsort() +++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++ # # token_idxs = idxs // self.num_experts_per_tok +++++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++ # # if start_idx == end_idx: +++++++ # # continue +++++++ # # expert = self.experts[i] +++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # # expert_tokens = x[exp_token_idx] +++++++ # # expert_out = expert(expert_tokens) +++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++ # # return expert_cache +++++++ # expert_cache = ops.zeros_like(x) +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # for i, end_idx in enumerate(tokens_per_expert): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # if start_idx == end_idx: +++++++ # continue +++++++ # expert = self.experts[i] +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = expert(expert_tokens) +++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ +++++++ # return expert_cache +++++++ # @no_grad() +++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++ # expert_cache = ops.zeros_like(x) +++++++ +++++++ # # 排序保证顺序一致 +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++ +++++++ # # 找出有 token 的专家 +++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++ +++++++ # for i in active_experts.tolist(): +++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ # end_idx = tokens_per_expert[i] +++++++ # if start_idx == end_idx: # 没有 token +++++++ # continue +++++++ +++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++ # expert_tokens = x[exp_token_idx] +++++++ # expert_out = self.experts[i](expert_tokens) +++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++ +++++++ # expert_cache = mindspore.mint.scatter_add( +++++++ # expert_cache, +++++++ # 0, +++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++ # expert_out +++++++ # ) +++++++ +++++++ # return expert_cache ++++++ ++++++ ++++++ ++++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++- ++++++ # class DeepseekFlashAttention(nn.Module): ++++++ # """ ++++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ +++++++ ++++++ Deepseek_ATTENTION_CLASSES = { ++++++ "eager": DeepseekAttention, ++++++ "flash-attention": DeepseekFlashAttention, ++++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): ++++++ ) ++++++ else: ++++++ # 4d mask is passed through the layers ++++++- attention_mask = _prepare_4d_causal_attention_mask( +++++++ # attention_mask = _prepare_4d_causal_attention_mask( +++++++ # attention_mask, +++++++ # (batch_size, seq_length), +++++++ # inputs_embeds, +++++++ # past_key_values_length, +++++++ # ) +++++++ #@dwj +++++++ attention_mask = get_cached_causal_mask( ++++++ attention_mask, ++++++ (batch_size, seq_length), ++++++ inputs_embeds, ++++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ # Initialize weights and apply final processing ++++++ self.post_init() ++++++ self.warm_up = False +++++++ #@dwj +++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +++++++ self.num_layers, +++++++ self.num_attention_heads, +++++++ self.head_dim, +++++++ batch_size=1, +++++++ max_length=self.max_length, +++++++ dtype=mindspore.float16 +++++++ ) +++++++ +++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +++++++ key_cache = [] +++++++ value_cache = [] +++++++ for _ in range(num_layers): +++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +++++++ key_cache.append(k) +++++++ value_cache.append(v) +++++++ return key_cache, value_cache +++++++ ++++++ ++++++ def warmup_moe_model_deep(self): ++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++index bced285c..ebd7782e 100644 ++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) ++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" ++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" ++++++ ++++++-Long_Prompt = False ++++++-PROMPT_LENGTH_THRESHOLD = 128 +++++++Long_Prompt = 1 +++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 +++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +++++++ +++++++_causal_mask_cache = {} +++++++ +++++++def get_cached_causal_mask_with_cache_position( +++++++ attention_mask: mindspore.Tensor, +++++++ sequence_length: int, +++++++ target_length: int, +++++++ dtype: mindspore.dtype, +++++++ min_dtype: float, +++++++ cache_position: mindspore.Tensor, +++++++ batch_size: int, +++++++): +++++++ """ +++++++ 带缓存的 causal mask 构造函数 +++++++ """ +++++++ # q_len 是当前 query 长度 +++++++ q_len = sequence_length +++++++ # kv_len 是 target_length +++++++ kv_len = target_length +++++++ +++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +++++++ +++++++ if key in _causal_mask_cache: +++++++ return _causal_mask_cache[key] +++++++ +++++++ # 调用原来的 mask 构造逻辑 +++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++++ attention_mask, +++++++ sequence_length=sequence_length, +++++++ target_length=target_length, +++++++ dtype=dtype, +++++++ min_dtype=min_dtype, +++++++ cache_position=cache_position, +++++++ batch_size=batch_size, +++++++ ) +++++++ # 缓存结果 +++++++ _causal_mask_cache[key] = causal_mask +++++++ return causal_mask ++++++ ++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position ++++++ def _prepare_4d_causal_attention_mask_with_cache_position( ++++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++++ ++++++ ++++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +++++++# class Qwen2MoeAttention(nn.Module): +++++++# """ +++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++++++# and "Generating Long Sequences with Sparse Transformers". +++++++# """ +++++++ +++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++# super().__init__() +++++++# self.config = config +++++++# self.layer_idx = layer_idx +++++++# if layer_idx is None: +++++++# logger.warning_once( +++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++++# "when creating this class." +++++++# ) +++++++ +++++++# self.hidden_size = config.hidden_size +++++++# self.num_heads = config.num_attention_heads +++++++# self.head_dim = self.hidden_size // self.num_heads +++++++# self.num_key_value_heads = config.num_key_value_heads +++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++# self.max_position_embeddings = config.max_position_embeddings +++++++# self.rope_theta = config.rope_theta +++++++# self.is_causal = True +++++++# self.attention_dropout = config.attention_dropout +++++++ +++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +++++++# raise ValueError( +++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +++++++# f" and `num_heads`: {self.num_heads})." +++++++# ) +++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++ +++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++# self.head_dim, +++++++# max_position_embeddings=self.max_position_embeddings, +++++++# base=self.rope_theta, +++++++# ) +++++++ +++++++# def forward( +++++++# self, +++++++# hidden_states: mindspore.Tensor, +++++++# attention_mask: Optional[mindspore.Tensor] = None, +++++++# position_ids: Optional[mindspore.Tensor] = None, +++++++# past_key_value: Optional[Cache] = None, +++++++# output_attentions: bool = False, +++++++# use_cache: bool = False, +++++++# cache_position: Optional[mindspore.Tensor] = None, +++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++ +++++++ +++++++ +++++++# bsz, q_len, _ = hidden_states.shape +++++++ +++++++# query_states = self.q_proj(hidden_states) +++++++# key_states = self.k_proj(hidden_states) +++++++# value_states = self.v_proj(hidden_states) +++++++ +++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++ +++++++# kv_seq_len = key_states.shape[-2] +++++++# if past_key_value is not None: +++++++# if self.layer_idx is None: +++++++# raise ValueError( +++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++# "with a layer index." +++++++# ) +++++++# if isinstance(past_key_value, StaticCache): +++++++# kv_seq_len = key_states.shape[-2] +++++++# else: +++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++# if past_key_value is not None: +++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++# if isinstance(past_key_value, StaticCache): +++++++# kv_seq_len = key_states.shape[-2] +++++++ +++++++# # repeat k/v heads if n_kv_heads < n_heads +++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++ +++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++ +++++++# if attention_mask is not None: +++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++# attn_weights = attn_weights + causal_mask +++++++ +++++++# # upcast attention to fp32 +++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++++# attn_output = ops.matmul(attn_weights, value_states) +++++++ +++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++++# raise ValueError( +++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++++++# f" {attn_output.shape}" +++++++# ) +++++++ +++++++# attn_output = ops.transpose(attn_output, 1, 2) +++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++ +++++++# attn_output = self.o_proj(attn_output) +++++++# # @lwx +++++++ +++++++# # max_seq_len = self.max_position_embeddings # 2048 +++++++ +++++++# # if attention_mask is not None: +++++++# # # attention_mask: [B, 1, Sq, Sk] +++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++ +++++++# # # pad 到 [max_seq_len, max_seq_len] +++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++# # global_attention_mask = padded_mask +++++++# # else: +++++++# # global_attention_mask = None +++++++ +++++++ +++++++# # sparse_mode=3 +++++++# # attn_output = mindspore.ops.flash_attention_score( +++++++# # query=query_states, +++++++# # key=key_states, +++++++# # value=value_states, +++++++# # real_shift=None, +++++++# # padding_mask=None, +++++++ +++++++# # head_num=self.num_heads, +++++++# # attn_mask=global_attention_mask, +++++++# # keep_prob=1.0 - self.attention_dropout, +++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++# # input_layout="BNSD", +++++++# # pre_tokens=2147483647, +++++++# # next_tokens=2147483647, +++++++# # inner_precise=0, +++++++# # drop_mask=None, +++++++# # prefix=None, +++++++# # actual_seq_qlen=None, +++++++# # actual_seq_kvlen=None, +++++++# # sparse_mode=sparse_mode, +++++++# # ) +++++++# if not output_attentions: +++++++# attn_weights = None +++++++ +++++++# return attn_output, attn_weights, past_key_value +++++++ ++++++ class Qwen2MoeAttention(nn.Module): ++++++ """ ++++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++++++- and "Generating Long Sequences with Sparse Transformers". ++++++- """ +++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 ++++++ +++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +++++++ +++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +++++++ """ ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++ super().__init__() ++++++ self.config = config ++++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): ++++++ if layer_idx is None: ++++++ logger.warning_once( ++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++ "when creating this class." ++++++ ) ++++++ ++++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): ++++++ use_cache: bool = False, ++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++- ++++++ ++++++- +++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- ++++++ bsz, q_len, _ = hidden_states.shape ++++++ ++++++ query_states = self.q_proj(hidden_states) ++++++ key_states = self.k_proj(hidden_states) ++++++ value_states = self.v_proj(hidden_states) ++++++ ++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++- +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ ++++++ kv_seq_len = key_states.shape[-2] ++++++ if past_key_value is not None: ++++++- if self.layer_idx is None: ++++++- raise ValueError( ++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++- "with a layer index." ++++++- ) ++++++- if isinstance(past_key_value, StaticCache): ++++++- kv_seq_len = key_states.shape[-2] ++++++- else: ++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ ++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++ ++++++ if past_key_value is not None: ++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++ +++++++ # --- 2. 动态调度核心注意力计算 --- +++++++ global Long_Prompt +++++++ if Long_Prompt >= 1: +++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +++++++ fa_attention_mask = None +++++++ if attention_mask is not None: +++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++ fa_attention_mask = (mask_slice != 0) +++++++ +++++++ attn_output = mindspore.ops.flash_attention_score( +++++++ query=query_states, +++++++ key=key_states, +++++++ value=value_states, +++++++ head_num=self.num_heads, +++++++ attn_mask=fa_attention_mask, +++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ input_layout="BNSD", +++++++ sparse_mode=0, +++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +++++++ ) ++++++ ++++++- if isinstance(past_key_value, StaticCache): ++++++- kv_seq_len = key_states.shape[-2] +++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = self.o_proj(attn_output) +++++++ attn_weights = None +++++++ if output_attentions: +++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") ++++++ ++++++- # repeat k/v heads if n_kv_heads < n_heads ++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++- ++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++ else: +++++++ # --- Eager Attention 路径 (用于短序列和解码) --- +++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++ +++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++ ++++++- if attention_mask is not None: ++++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++- attn_weights = attn_weights + causal_mask +++++++ if attention_mask is not None: +++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++ attn_weights = attn_weights + causal_mask ++++++ ++++++- # upcast attention to fp32 ++++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++++- attn_output = ops.matmul(attn_weights, value_states) +++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++++ attn_output = ops.matmul(attn_weights, value_states) ++++++ ++++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++++- raise ValueError( ++++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++++++- f" {attn_output.shape}" ++++++- ) +++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++++ raise ValueError( +++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +++++++ ) ++++++ ++++++- attn_output = ops.transpose(attn_output, 1, 2) ++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = ops.transpose(attn_output, 1, 2) +++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = self.o_proj(attn_output) ++++++ ++++++- attn_output = self.o_proj(attn_output) ++++++- # @lwx +++++++ if not output_attentions: +++++++ attn_weights = None ++++++ ++++++- # max_seq_len = self.max_position_embeddings # 2048 ++++++- ++++++- # if attention_mask is not None: ++++++- # # attention_mask: [B, 1, Sq, Sk] ++++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++- ++++++- # # pad 到 [max_seq_len, max_seq_len] ++++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++- # global_attention_mask = padded_mask ++++++- # else: ++++++- # global_attention_mask = None ++++++- ++++++- ++++++- # sparse_mode=3 ++++++- # attn_output = mindspore.ops.flash_attention_score( ++++++- # query=query_states, ++++++- # key=key_states, ++++++- # value=value_states, ++++++- # real_shift=None, ++++++- # padding_mask=None, ++++++- ++++++- # head_num=self.num_heads, ++++++- # attn_mask=global_attention_mask, ++++++- # keep_prob=1.0 - self.attention_dropout, ++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++- # input_layout="BNSD", ++++++- # pre_tokens=2147483647, ++++++- # next_tokens=2147483647, ++++++- # inner_precise=0, ++++++- # drop_mask=None, ++++++- # prefix=None, ++++++- # actual_seq_qlen=None, ++++++- # actual_seq_kvlen=None, ++++++- # sparse_mode=sparse_mode, ++++++- # ) ++++++- if not output_attentions: ++++++- attn_weights = None ++++++- ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++- ++++++ # class Qwen2MoeFlashAttention(nn.Module): ++++++ # """ ++++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { ++++++ # return final_hidden_states, router_logits ++++++ ++++++ ++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++-# """ ++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 ++++++-# """ ++++++-# def __init__(self, config: Qwen2MoeConfig): ++++++-# super().__init__() ++++++-# self.num_experts = config.num_experts ++++++-# self.top_k = config.num_experts_per_tok ++++++-# self.norm_topk_prob = config.norm_topk_prob ++++++- ++++++-# # 门控网络 ++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++-# # 专家列表 ++++++-# self.experts = nn.ModuleList( ++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++-# ) ++++++-# # 共享专家 ++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_decode( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# """ ++++++-# 【解码路径】针对 sequence_length=1 的极致优化。 ++++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++++++-# """ ++++++-# batch_size, hidden_dim = hidden_states.shape ++++++- ++++++-# expert_outputs_list = [ ++++++-# ops.cat([ ++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++-# ], dim=0) ++++++-# for i in range(batch_size) ++++++-# ] ++++++- ++++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++++++-# # shape: (batch_size, top_k, hidden_dim) ++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++- ++++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++- ++++++-# return moe_output.squeeze(1) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_prefill( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# """ ++++++-# 【预填充路径】针对 sequence_length > 1 的优化。 ++++++-# 按专家对 Token 进行分组,并进行批处理。 ++++++-# """ ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens = hidden_states.shape[0] ++++++-# flat_selected_experts = selected_experts.flatten() ++++++- ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++- ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++- ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++- ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++-# selected_token_indices = token_indices[mask] ++++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++++- ++++++-# current_states = hidden_states[selected_token_indices] ++++++- ++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++- ++++++-# moe_output = moe_output.index_add( ++++++-# dim=0, ++++++-# index=selected_token_indices, ++++++-# source=expert_output.to(hidden_states.dtype) ++++++-# ) ++++++-# return moe_output ++++++- ++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-# """ ++++++-# 顶层 forward 方法,作为智能分发器。 ++++++-# """ ++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- ++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-# router_logits = self.gate(hidden_states_reshaped) ++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- ++++++-# if self.norm_topk_prob: ++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- ++++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++-# moe_output = None ++++++-# # 在推理时,根据序列长度选择最优路径 ++++++-# if not self.training: ++++++-# if sequence_length == 1: ++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++-# else: ++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++-# else: ++++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++++++-# raise NotImplementedError("Training path is not implemented.") ++++++- ++++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++++++- ++++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++++++- ++++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++++++- ++++++-# return final_hidden_states, router_logits ++++++- ++++++- ++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++-# """ ++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++++++-# """ ++++++-# def __init__(self, config: Qwen2MoeConfig): ++++++-# super().__init__() ++++++-# self.num_experts = config.num_experts ++++++-# self.top_k = config.num_experts_per_tok ++++++-# self.norm_topk_prob = config.norm_topk_prob ++++++- ++++++-# # 门控网络 ++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++-# # 专家列表 ++++++-# self.experts = nn.ModuleList( ++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++-# ) ++++++-# # 共享专家 ++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_decode( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# batch_size, _ = hidden_states.shape ++++++-# expert_outputs_list = [ ++++++-# ops.cat([ ++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++-# ], dim=0) ++++++-# for i in range(batch_size) ++++++-# ] ++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++-# return moe_output.squeeze(1) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_prefill( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens = hidden_states.shape[0] ++++++-# flat_selected_experts = selected_experts.flatten() ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++- ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++-# selected_token_indices = token_indices[mask] ++++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++++-# current_states = hidden_states[selected_token_indices] ++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++-# moe_output = moe_output.index_add( ++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++-# ) ++++++-# return moe_output ++++++- ++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-# """ ++++++-# 顶层 forward 方法,作为智能分发器。 ++++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++++++-# """ ++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- ++++++-# # 1. 门控计算 (通用逻辑) ++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-# router_logits = self.gate(hidden_states_reshaped) ++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- ++++++-# if self.norm_topk_prob: ++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- ++++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++-# # 2. 智能分发到最优 MoE 路径 ++++++-# moe_output = None ++++++-# if not self.training: ++++++-# if sequence_length == 1: ++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++-# else: ++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++-# else: ++++++-# raise NotImplementedError("Training path is not implemented.") ++++++- ++++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++- ++++++-# # 4. 合并 MoE 输出和共享专家输出 ++++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++- ++++++-# # 5. 恢复原始形状并返回 ++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++- ++++++-# return final_hidden_states, router_logits ++++++- ++++++-# prefill fastest ++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++-# """ ++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++++++-# """ ++++++-# def __init__(self, config: Qwen2MoeConfig): ++++++-# super().__init__() ++++++-# self.num_experts = config.num_experts ++++++-# self.top_k = config.num_experts_per_tok ++++++-# self.norm_topk_prob = config.norm_topk_prob ++++++- ++++++-# # 门控网络 ++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++-# # 专家列表 ++++++-# self.experts = nn.ModuleList( ++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++-# ) ++++++-# # 共享专家 ++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_dispatch( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# """ ++++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++++++-# """ ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens, _ = hidden_states.shape ++++++- ++++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++++++-# flat_selected_experts = selected_experts.flatten() ++++++-# flat_routing_weights = routing_weights.flatten() ++++++- ++++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++- ++++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++- ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++- ++++++-# # 找到所有分配给该专家的 token ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++- ++++++-# # 使用 mask 选取对应的 token 和权重 ++++++-# current_token_indices = token_indices[mask] ++++++-# current_routing_weights = flat_routing_weights[mask] ++++++-# current_hidden_states = hidden_states[current_token_indices] ++++++- ++++++-# # 对这些 token 进行批处理 ++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++- ++++++-# # 使用 index_add 将结果精确地加回到对应位置 ++++++-# moe_output = moe_output.index_add( ++++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++++++-# ) ++++++-# return moe_output ++++++- ++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-# """ ++++++-# 顶层 forward 方法,作为智能分发器。 ++++++-# """ ++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- ++++++-# # 1. 门控计算 ++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-# router_logits = self.gate(hidden_states_reshaped) ++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- ++++++-# if self.norm_topk_prob: ++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- ++++++-# routing_weights = routing_weights.to(hidden_states.dtype) ++++++- ++++++-# # 2. 调用统一的 MoE 计算内核 ++++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) ++++++- ++++++-# # 3. 统一处理共享专家 ++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++- ++++++-# # 4. 合并输出 ++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++- ++++++-# # 5. 恢复原始形状并返回 ++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++- ++++++-# return final_hidden_states, router_logits ++++++- ++++++- ++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++-# """ ++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++-# 【最终高性能与高精度版】: ++++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++++++-# 3. 这样实现了速度和准确性的两全其美。 ++++++-# """ ++++++-# def __init__(self, config: Qwen2MoeConfig): ++++++-# super().__init__() ++++++-# self.num_experts = config.num_experts ++++++-# self.top_k = config.num_experts_per_tok ++++++-# self.norm_topk_prob = config.norm_topk_prob ++++++- ++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++-# self.experts = nn.ModuleList( ++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++-# ) ++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_decode( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# """ ++++++-# 【解码路径】极致优化版:bmm + 高精度累加。 ++++++-# """ ++++++-# original_dtype = hidden_states.dtype ++++++-# batch_size, _ = hidden_states.shape ++++++- ++++++-# expert_outputs_list = [ ++++++-# ops.cat([ ++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++-# ], dim=0) ++++++-# for i in range(batch_size) ++++++-# ] ++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++- ++++++-# # 在 float32 下执行 bmm,得到高精度结果 ++++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++- ++++++-# # 将高精度结果转换回原始数据类型 ++++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++++++- ++++++-# return moe_output ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_prefill( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# selected_experts: mindspore.Tensor, ++++++-# routing_weights: mindspore.Tensor ++++++-# ) -> mindspore.Tensor: ++++++-# """ ++++++-# 【预填充路径】与原始实现一致,结果精确。 ++++++-# """ ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens, _ = hidden_states.shape ++++++-# flat_selected_experts = selected_experts.flatten() ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++- ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++-# selected_token_indices = token_indices[mask] ++++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++++-# current_states = hidden_states[selected_token_indices] ++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++-# moe_output = moe_output.index_add( ++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++-# ) ++++++-# return moe_output ++++++- ++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++- ++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-# router_logits = self.gate(hidden_states_reshaped) ++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++- ++++++-# if self.norm_topk_prob: ++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- ++++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++++++-# # 如果模型主体是 float16,后续再转换 ++++++- ++++++-# moe_output = None ++++++-# if not self.training: ++++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++++++-# # _moe_infer_decode 内部会处理好类型转换 ++++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++++++-# if sequence_length == 1: ++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++-# else: ++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++-# else: ++++++-# raise NotImplementedError("Training path is not implemented.") ++++++- ++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++- ++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++- ++++++-# return final_hidden_states, router_logits ++++++- ++++++- ++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++-# """ ++++++-# 【融合版】一个混合专家模块,内置两种推理策略, ++++++-# 由外部全局变量 `Long_Prompt` 控制: ++++++- ++++++-# - if Long_Prompt is True: 【精度优先模式】 ++++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++++++-# 适用于处理长序列,避免误差累积。 ++++++- ++++++-# - if Long_Prompt is False: 【速度优先模式】 ++++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 ++++++-# """ ++++++-# def __init__(self, config: Qwen2MoeConfig): ++++++-# super().__init__() ++++++-# self.num_experts = config.num_experts ++++++-# self.top_k = config.num_experts_per_tok ++++++-# self.norm_topk_prob = config.norm_topk_prob ++++++- ++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++-# self.experts = nn.ModuleList( ++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++-# ) ++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-# # --- 速度优先模式的辅助函数 --- ++++++-# @no_grad() ++++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++-# original_dtype = hidden_states.dtype ++++++-# batch_size, _ = hidden_states.shape ++++++-# expert_outputs_list = [ ++++++-# ops.cat([ ++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++-# ], dim=0) ++++++-# for i in range(batch_size) ++++++-# ] ++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++-# weights_fp32 = routing_weights.to(mindspore.float32) ++++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++-# return moe_output_fp32.squeeze(1).to(original_dtype) ++++++- ++++++-# @no_grad() ++++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens, _ = hidden_states.shape ++++++-# flat_selected_experts = selected_experts.flatten() ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++-# selected_token_indices = token_indices[mask] ++++++-# selected_routing_weights = routing_weights.flatten()[mask] ++++++-# current_states = hidden_states[selected_token_indices] ++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++-# return moe_output ++++++- ++++++-# # --- 精度优先模式的辅助函数 --- ++++++-# @no_grad() ++++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++-# moe_output = ops.zeros_like(hidden_states) ++++++-# num_tokens, _ = hidden_states.shape ++++++-# flat_selected_experts = selected_experts.flatten() ++++++-# flat_routing_weights = routing_weights.flatten() ++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++-# active_experts = ops.unique(flat_selected_experts) ++++++-# for expert_idx_tensor in active_experts: ++++++-# expert_idx = expert_idx_tensor.item() ++++++-# expert_layer = self.experts[expert_idx] ++++++-# mask = (flat_selected_experts == expert_idx_tensor) ++++++-# current_token_indices = token_indices[mask] ++++++-# current_routing_weights = flat_routing_weights[mask] ++++++-# current_hidden_states = hidden_states[current_token_indices] ++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++-# return moe_output ++++++- ++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-# # 声明我们将要使用一个在模块外部定义的全局变量 ++++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++++++-# global Long_Prompt ++++++- ++++++-# # 1. 门控计算 (所有模式通用) ++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-# router_logits = self.gate(hidden_states_reshaped) ++++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++++-# if self.norm_topk_prob: ++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++- ++++++-# moe_output = None ++++++-# if not self.training: ++++++-# # 根据 Long_Prompt 标志选择模式 ++++++-# if Long_Prompt: ++++++-# # --- 精度优先模式 --- ++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++-# else: ++++++-# # --- 速度优先模式 --- ++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++-# if sequence_length == 1: ++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++-# else: ++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++-# else: ++++++-# raise NotImplementedError("Training path is not implemented.") ++++++- ++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++- ++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++- ++++++-# return final_hidden_states, router_logits ++++++- ++++++ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ """ ++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++ return moe_output_fp32.squeeze(1).to(original_dtype) ++++++ +++++++ # @no_grad() +++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++ # num_tokens, _ = hidden_states.shape +++++++ # flat_selected_experts = selected_experts.flatten() +++++++ # sorted_expert_indices = flat_selected_experts.argsort() +++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++++ # original_token_indices = sorted_expert_indices // self.top_k +++++++ # moe_output = ops.zeros_like(hidden_states) +++++++ # current_token_offset = 0 +++++++ # for i in range(self.num_experts): +++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset +++++++ # if expert_token_count == 0: +++++++ # continue +++++++ # end_offset = current_token_offset + expert_token_count +++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] +++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++ # current_token_offset += expert_token_count +++++++ # return moe_output +++++++ ++++++ @no_grad() ++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++- num_tokens, _ = hidden_states.shape ++++++- flat_selected_experts = selected_experts.flatten() ++++++- sorted_expert_indices = flat_selected_experts.argsort() ++++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++++- original_token_indices = sorted_expert_indices // self.top_k +++++++ """ +++++++ 优化版 MoE prefill (速度优先模式): +++++++ - 批量张量化处理同一个 expert 的所有 token +++++++ - 跳过无 token 的专家 +++++++ - 保持结果完全一致 +++++++ """ ++++++ moe_output = ops.zeros_like(hidden_states) ++++++- current_token_offset = 0 ++++++- for i in range(self.num_experts): ++++++- expert_token_count = tokens_per_expert[i] - current_token_offset ++++++- if expert_token_count == 0: ++++++- continue ++++++- end_offset = current_token_offset + expert_token_count ++++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++++- expert_hidden_states = hidden_states[expert_original_token_indices] ++++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++- current_token_offset += expert_token_count +++++++ +++++++ flat_selected_experts = selected_experts.flatten() +++++++ flat_routing_weights = routing_weights.flatten() +++++++ +++++++ idxs = flat_selected_experts.argsort() +++++++ sorted_expert_indices = flat_selected_experts[idxs] +++++++ sorted_token_indices = idxs // self.top_k +++++++ +++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +++++++ +++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++++++ +++++++ for expert_id in active_experts.tolist(): +++++++ start = int(tokens_per_expert[:expert_id].sum().item()) +++++++ end = start + int(tokens_per_expert[expert_id].item()) +++++++ +++++++ token_idx = sorted_token_indices[start:end] +++++++ expert_tokens = hidden_states[token_idx] +++++++ +++++++ expert_out = self.experts[expert_id](expert_tokens) +++++++ +++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +++++++ +++++++ moe_output = mindspore.mint.scatter_add( +++++++ moe_output, +++++++ 0, +++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +++++++ scaled_out.to(hidden_states.dtype) +++++++ ) +++++++ ++++++ return moe_output ++++++ +++++++ ++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++++++ @no_grad() ++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++ ++++++ moe_output = None ++++++- if Long_Prompt: ++++++- # --- 精度优先模式 (ACCURACY MODE) --- ++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ # if Long_Prompt==0: +++++++ # # --- 精度优先模式 (ACCURACY MODE) --- +++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ # else: +++++++ # # --- 速度优先模式 (SPEED MODE) --- +++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++ # if sequence_length == 1: +++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ # else: +++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ +++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++ if sequence_length == 1: +++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++ else: ++++++- # --- 速度优先模式 (SPEED MODE) --- ++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++- if sequence_length == 1: ++++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++- else: ++++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++- +++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ ++++++ ++++++ # 3. 共享专家计算与合并 (所有模式通用) ++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++ ++++++ return final_hidden_states, router_logits ++++++ +++++++ ++++++ class Qwen2MoeDecoderLayer(nn.Module): ++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): ++++++ super().__init__() ++++++ self.hidden_size = config.hidden_size ++++++ ++++++- # if Long_Prompt: ++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++- # else: +++++++ # if Long_Prompt == 2: ++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++ # else: +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ ++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++ ++++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++++ ) ++++++ ++++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). ++++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++++ # attention_mask, +++++++ # sequence_length=sequence_length, +++++++ # target_length=target_length, +++++++ # dtype=dtype, +++++++ # min_dtype=min_dtype, +++++++ # cache_position=cache_position, +++++++ # batch_size=input_tensor.shape[0], +++++++ # ) +++++++ #@dwj +++++++ causal_mask = get_cached_causal_mask_with_cache_position( ++++++ attention_mask, ++++++ sequence_length=sequence_length, ++++++ target_length=target_length, ++++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++++++ """ ++++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +++++++ _causal_mask_cache.clear() ++++++ ++++++ input_ids = kwargs.get("input_ids") ++++++ if input_ids is None and args: ++++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ ++++++ if input_ids is not None: ++++++ prompt_length = input_ids.shape[1] ++++++- ++++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: ++++++- Long_Prompt = True +++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +++++++ Long_Prompt = 2 +++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +++++++ Long_Prompt = 0 ++++++ else: ++++++- Long_Prompt = False +++++++ Long_Prompt = 1 +++++++ ++++++ ++++++ return super().generate(*args, **kwargs) ++++++ ++++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++ dtype = self.lm_head.weight.dtype ++++++ min_dtype = float(ops.finfo(dtype).min) ++++++ ++++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +++++++ # attention_mask, +++++++ # sequence_length=sequence_length, +++++++ # target_length=past_key_values.get_max_length(), +++++++ # dtype=dtype, +++++++ # min_dtype=min_dtype, +++++++ # cache_position=cache_position, +++++++ # batch_size=batch_size, +++++++ # ) +++++++ +++++++ #@dwj +++++++ attention_mask = get_cached_causal_mask_with_cache_position( ++++++ attention_mask, ++++++ sequence_length=sequence_length, ++++++ target_length=past_key_values.get_max_length(), ++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++++deleted file mode 100644 ++++++index 6dfb5b93..00000000 ++++++--- a/patches/0001-20251104commit.patch +++++++++ /dev/null ++++++@@ -1,1272 +0,0 @@ ++++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++++-From: Pinoeer-kingxi <13022943007@163.com> ++++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++++-Subject: [PATCH] 20251104commit ++++++- ++++++---- ++++++- mindnlp/transformers/cache_utils.py | 28 +- ++++++- .../models/deepseek/modeling_deepseek.py | 149 ++- ++++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++++- 3 files changed, 976 insertions(+), 87 deletions(-) ++++++- ++++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++++-index cadd2e04..02f8d4be 100644 ++++++---- a/mindnlp/transformers/cache_utils.py ++++++-+++ b/mindnlp/transformers/cache_utils.py ++++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++++- # k_out[:, :, cache_position] = key_states ++++++- # v_out[:, :, cache_position] = value_states ++++++-- if ON_ORANGE_PI: ++++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++-- else: ++++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++-- ++++++-+ # if ON_ORANGE_PI: ++++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++-+ # else: ++++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 ++++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++++-+ if cache_position.ndim > 1: ++++++-+ cache_position = cache_position.flatten() ++++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) ++++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++++-+ cache_position = cache_position.int() ++++++-+ ++++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++++-+ k_out[:, :, cache_position] = key_states ++++++-+ v_out[:, :, cache_position] = value_states ++++++-+ ++++++- return k_out, v_out ++++++- ++++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++-index c695b944..d8303e45 100644 ++++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++++- # Copied from transformers.models.llama.modeling_llama.rotate_half ++++++- def rotate_half(x): ++++++- """Rotates half the hidden dims of the input.""" ++++++-- x1 = x[..., : x.shape[-1] // 2] ++++++-- x2 = x[..., x.shape[-1] // 2 :] ++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++-+ # x1 = x[..., : x.shape[-1] // 2] ++++++-+ # x2 = x[..., x.shape[-1] // 2 :] ++++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++- return ops.cat((-x2, x1), dim=-1) ++++++- ++++++- ++++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++++- if self.training: ++++++- raise NotImplementedError("Training is not supported yet.") ++++++- else: ++++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++-- if self.config.n_shared_experts is not None: ++++++-- y = y + self.shared_experts(identity) ++++++-- return y ++++++-+ # @lwx ++++++-+ if orig_shape[1] == 1: ++++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++++-+ y=y.view(*orig_shape) ++++++-+ if self.config.n_shared_experts is not None: ++++++-+ y = y + self.shared_experts(identity) ++++++-+ return y ++++++-+ else: ++++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++++-+ if self.config.n_shared_experts is not None: ++++++-+ y = y + self.shared_experts(identity) ++++++-+ return y ++++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++-+ # if self.config.n_shared_experts is not None: ++++++-+ # y = y + self.shared_experts(identity) ++++++-+ # return y ++++++-+ ++++++-+ @no_grad() ++++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++-+ ++++++-+ expert_cache = ops.zeros_like(x) ++++++-+ for i in range(self.num_experts_per_tok): ++++++-+ expert_id = flat_expert_indices[i].item() ++++++-+ weight = flat_expert_weights[i].item() ++++++-+ expert = self.experts[expert_id] ++++++-+ expert_out = expert(x) ++++++-+ expert_cache += expert_out * weight ++++++-+ return expert_cache ++++++- ++++++- @no_grad() ++++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++-- # expert_cache = torch.zeros_like(x) ++++++-- # idxs = flat_expert_indices.argsort() ++++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++-- # token_idxs = idxs // self.num_experts_per_tok ++++++-- # for i, end_idx in enumerate(tokens_per_expert): ++++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++-- # if start_idx == end_idx: ++++++-- # continue ++++++-- # expert = self.experts[i] ++++++-- # exp_token_idx = token_idxs[start_idx:end_idx] ++++++-- # expert_tokens = x[exp_token_idx] ++++++-- # expert_out = expert(expert_tokens) ++++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++-- # return expert_cache ++++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++- expert_cache = ops.zeros_like(x) ++++++- idxs = flat_expert_indices.argsort() ++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++- token_idxs = idxs // self.num_experts_per_tok ++++++-+ ++++++- for i, end_idx in enumerate(tokens_per_expert): ++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++- if start_idx == end_idx: ++++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++++- expert_out = expert(expert_tokens) ++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++-+ ++++++- return expert_cache ++++++-+ ++++++-+ # @no_grad() ++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++-+ # # expert_cache = torch.zeros_like(x) ++++++-+ # # idxs = flat_expert_indices.argsort() ++++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++-+ # # token_idxs = idxs // self.num_experts_per_tok ++++++-+ # # for i, end_idx in enumerate(tokens_per_expert): ++++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++-+ # # if start_idx == end_idx: ++++++-+ # # continue ++++++-+ # # expert = self.experts[i] ++++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++-+ # # expert_tokens = x[exp_token_idx] ++++++-+ # # expert_out = expert(expert_tokens) ++++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++-+ # # return expert_cache ++++++-+ # expert_cache = ops.zeros_like(x) ++++++-+ # idxs = flat_expert_indices.argsort() ++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++-+ # token_idxs = idxs // self.num_experts_per_tok ++++++-+ ++++++-+ # for i, end_idx in enumerate(tokens_per_expert): ++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++-+ # if start_idx == end_idx: ++++++-+ # continue ++++++-+ # expert = self.experts[i] ++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++-+ # expert_tokens = x[exp_token_idx] ++++++-+ # expert_out = expert(expert_tokens) ++++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++-+ ++++++-+ # return expert_cache ++++++-+ # @no_grad() ++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++-+ # expert_cache = ops.zeros_like(x) ++++++-+ ++++++-+ # # 排序保证顺序一致 ++++++-+ # idxs = flat_expert_indices.argsort() ++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++-+ # token_idxs = idxs // self.num_experts_per_tok ++++++-+ ++++++-+ # # 找出有 token 的专家 ++++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++-+ ++++++-+ # for i in active_experts.tolist(): ++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++-+ # end_idx = tokens_per_expert[i] ++++++-+ # if start_idx == end_idx: # 没有 token ++++++-+ # continue ++++++-+ ++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++-+ # expert_tokens = x[exp_token_idx] ++++++-+ # expert_out = self.experts[i](expert_tokens) ++++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++-+ ++++++-+ # expert_cache = mindspore.mint.scatter_add( ++++++-+ # expert_cache, ++++++-+ # 0, ++++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++-+ # expert_out ++++++-+ # ) ++++++-+ ++++++-+ # return expert_cache ++++++-+ ++++++-+ ++++++- ++++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++++- # """ ++++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++- ++++++- # Initialize weights and apply final processing ++++++- self.post_init() ++++++-+ self.warm_up = False ++++++-+ ++++++-+ def warmup_moe_model_deep(self): ++++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++-+ test_texts = [ ++++++-+ "warmup short", ++++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++++-+ ] ++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++-+ if tokenizer is None: ++++++-+ from mindnlp.transformers import AutoTokenizer ++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++-+ self._warmup_tokenizer = tokenizer ++++++-+ ++++++-+ for text in test_texts: ++++++-+ inputs = tokenizer(text, return_tensors="ms") ++++++-+ with mindspore._no_grad(): ++++++-+ _ = self(**inputs, use_cache=False) ++++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++++- ++++++- def get_input_embeddings(self): ++++++- return self.model.embed_tokens ++++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++- ```""" ++++++-+ if not self.warm_up: ++++++-+ self.warm_up = True ++++++-+ self.warmup_moe_model_deep() ++++++-+ ++++++- output_attentions = ( ++++++- output_attentions ++++++- if output_attentions is not None ++++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++-index 3cbf820e..d4c6b651 100644 ++++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++-@@ -18,7 +18,6 @@ ++++++- # See the License for the specific language governing permissions and ++++++- # limitations under the License. ++++++- """MindSpore Qwen2MoE model.""" ++++++-- ++++++- import math ++++++- from typing import List, Optional, Tuple, Union ++++++- ++++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++++- TokenClassifierOutput, ++++++- ) ++++++- from ...modeling_utils import PreTrainedModel ++++++-+from ...generation import GenerationMixin ++++++- from ....utils import logging ++++++- from .configuration_qwen2_moe import Qwen2MoeConfig ++++++- ++++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++++- self.variance_epsilon = eps ++++++- ++++++- def forward(self, hidden_states): ++++++-+ # @dwj ++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++-+ # @lwx ++++++-+ # if not self.training : ++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++- input_dtype = hidden_states.dtype ++++++- hidden_states = hidden_states.to(mindspore.float32) ++++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++++-@@ -234,6 +239,8 @@ def rotate_half(x): ++++++- """Rotates half the hidden dims of the input.""" ++++++- x1 = x[..., : x.shape[-1] // 2] ++++++- x2 = x[..., x.shape[-1] // 2 :] ++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++- return ops.cat((-x2, x1), dim=-1) ++++++- ++++++- ++++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++++- self.config = config ++++++- self.hidden_size = config.hidden_size ++++++- self.intermediate_size = intermediate_size ++++++-+ ++++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++++- self.act_fn = ACT2FN[config.hidden_act] ++++++- ++++++- def forward(self, x): ++++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++-- ++++++- ++++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++-+ # @lwx ++++++-+ # gate_up_output = self.gate_up_proj(x) ++++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++++-+ # return self.down_proj(swiglu_output) ++++++-+ ++++++-+ # def forward(self, x): ++++++-+ # gate_proj_out = self.gate_proj(x) ++++++-+ # up_proj_out = self.up_proj(x) ++++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++++-+ # return self.down_proj(swiglu_out) ++++++-+ ++++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++++- """ ++++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++++- use_cache: bool = False, ++++++- cache_position: Optional[mindspore.Tensor] = None, ++++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++-+ ++++++-+ ++++++-+ ++++++- bsz, q_len, _ = hidden_states.shape ++++++- ++++++- query_states = self.q_proj(hidden_states) ++++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++- "with a layer index." ++++++- ) ++++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++-+ if isinstance(past_key_value, StaticCache): ++++++-+ kv_seq_len = key_states.shape[-2] ++++++-+ else: ++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++- ++++++- if past_key_value is not None: ++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++-+ ++++++-+ if isinstance(past_key_value, StaticCache): ++++++-+ kv_seq_len = key_states.shape[-2] ++++++- ++++++- # repeat k/v heads if n_kv_heads < n_heads ++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++-- ++++++-+ ++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++- ++++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++++-- raise ValueError( ++++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++++-- f" {attn_weights.shape}" ++++++-- ) ++++++-- ++++++-- if attention_mask is not None: # no matter the length, we just slice it ++++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++++-+ if attention_mask is not None: ++++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++- attn_weights = attn_weights + causal_mask ++++++- ++++++- # upcast attention to fp32 ++++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++- ++++++- attn_output = self.o_proj(attn_output) ++++++-- ++++++-+ # @lwx ++++++-+ ++++++-+ # max_seq_len = self.max_position_embeddings # 2048 ++++++-+ ++++++-+ # if attention_mask is not None: ++++++-+ # # attention_mask: [B, 1, Sq, Sk] ++++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++-+ ++++++-+ # # pad 到 [max_seq_len, max_seq_len] ++++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++-+ # global_attention_mask = padded_mask ++++++-+ # else: ++++++-+ # global_attention_mask = None ++++++-+ ++++++-+ ++++++-+ # sparse_mode=3 ++++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++++-+ # query=query_states, ++++++-+ # key=key_states, ++++++-+ # value=value_states, ++++++-+ # real_shift=None, ++++++-+ # padding_mask=None, ++++++-+ ++++++-+ # head_num=self.num_heads, ++++++-+ # attn_mask=global_attention_mask, ++++++-+ # keep_prob=1.0 - self.attention_dropout, ++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++-+ # input_layout="BNSD", ++++++-+ # pre_tokens=2147483647, ++++++-+ # next_tokens=2147483647, ++++++-+ # inner_precise=0, ++++++-+ # drop_mask=None, ++++++-+ # prefix=None, ++++++-+ # actual_seq_qlen=None, ++++++-+ # actual_seq_kvlen=None, ++++++-+ # sparse_mode=sparse_mode, ++++++-+ # ) ++++++- if not output_attentions: ++++++- attn_weights = None ++++++- ++++++- return attn_output, attn_weights, past_key_value ++++++- ++++++- ++++++-+class Qwen2MoeFlashAttention(nn.Module): ++++++-+ """ ++++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++-+ ++++++-+ 关键改动: ++++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++-+ 直接传入原始的 key 和 value 张量效率更高。 ++++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++-+ """ ++++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++-+ super().__init__() ++++++-+ self.config = config ++++++-+ self.layer_idx = layer_idx ++++++-+ self.hidden_size = config.hidden_size ++++++-+ self.num_heads = config.num_attention_heads ++++++-+ self.head_dim = self.hidden_size // self.num_heads ++++++-+ self.num_key_value_heads = config.num_key_value_heads ++++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++-+ self.max_position_embeddings = config.max_position_embeddings ++++++-+ self.rope_theta = config.rope_theta ++++++-+ self.attention_dropout = config.attention_dropout ++++++-+ ++++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++-+ raise ValueError( ++++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++-+ ) ++++++-+ ++++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++-+ ++++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++-+ self.head_dim, ++++++-+ max_position_embeddings=self.max_position_embeddings, ++++++-+ base=self.rope_theta, ++++++-+ ) ++++++-+ ++++++-+ def forward( ++++++-+ self, ++++++-+ hidden_states: mindspore.Tensor, ++++++-+ attention_mask: Optional[mindspore.Tensor] = None, ++++++-+ position_ids: Optional[mindspore.Tensor] = None, ++++++-+ past_key_value: Optional[Cache] = None, ++++++-+ output_attentions: bool = False, ++++++-+ use_cache: bool = False, ++++++-+ cache_position: Optional[mindspore.Tensor] = None, ++++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++-+ ++++++-+ bsz, q_len, _ = hidden_states.shape ++++++-+ ++++++-+ # 1. 线性投射 Q, K, V ++++++-+ query_states = self.q_proj(hidden_states) ++++++-+ key_states = self.k_proj(hidden_states) ++++++-+ value_states = self.v_proj(hidden_states) ++++++-+ ++++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++-+ # query: [B, S, H*D] -> [B, N1, S, D] ++++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ ++++++-+ # 3. RoPE 旋转位置编码 ++++++-+ kv_seq_len = key_states.shape[-2] ++++++-+ if past_key_value is not None: ++++++-+ if self.layer_idx is None: ++++++-+ raise ValueError( ++++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++-+ "with a layer index." ++++++-+ ) ++++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++-+ if cache_position.shape[0] == 1: ++++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++-+ kv_seq_len = past_seen_tokens + 1 ++++++-+ else: ++++++-+ # prefill 阶段:cache_position 是范围,使用其长度 ++++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++-+ else: ++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++-+ ++++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++-+ ++++++-+ # 4. KV 缓存更新 ++++++-+ if past_key_value is not None: ++++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++-+ key_states, value_states = past_key_value.update( ++++++-+ key_states, value_states, self.layer_idx, cache_kwargs ++++++-+ ) ++++++-+ ++++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++-+ if cache_position.shape[0] == 1: ++++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++-+ kv_seq_len = key_states.shape[-2] ++++++-+ ++++++-+ # 5. [重要] 准备 Attention Mask ++++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++-+ fa_attention_mask = None ++++++-+ if attention_mask is not None: ++++++-+ # 截取与当前key长度匹配的部分 ++++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++-+ fa_attention_mask = (mask_slice != 0) ++++++-+ ++++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++-+ input_dtype = query_states.dtype ++++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++-+ query_states = query_states.to(mindspore.float16) ++++++-+ key_states = key_states.to(mindspore.float16) ++++++-+ value_states = value_states.to(mindspore.float16) ++++++-+ ++++++-+ # 6. [核心] 调用 flash_attention_score 算子 ++++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++-+ attn_output = mindspore.ops.flash_attention_score( ++++++-+ query=query_states, ++++++-+ key=key_states, ++++++-+ value=value_states, ++++++-+ head_num=self.num_heads, # 传入Q的头数(N1) ++++++-+ attn_mask=fa_attention_mask, ++++++-+ keep_prob=1.0 - self.attention_dropout, ++++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++-+ input_layout="BNSD", ++++++-+ sparse_mode=0 # 使用 defaultMask 模式 ++++++-+ ) ++++++-+ ++++++-+ # 恢复原始数据类型 ++++++-+ attn_output = attn_output.to(input_dtype) ++++++-+ ++++++-+ # 7. 调整输出形状 ++++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++-+ attn_output = self.o_proj(attn_output) ++++++-+ ++++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 ++++++-+ attn_weights = None ++++++-+ if output_attentions: ++++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++-+ ++++++-+ return attn_output, attn_weights, past_key_value ++++++-+ ++++++-+ # def forward( ++++++-+ # self, ++++++-+ # hidden_states: mindspore.Tensor, ++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++++++-+ # position_ids: Optional[mindspore.Tensor] = None, ++++++-+ # past_key_value: Optional[Cache] = None, ++++++-+ # output_attentions: bool = False, ++++++-+ # use_cache: bool = False, ++++++-+ # cache_position: Optional[mindspore.Tensor] = None, ++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++-+ ++++++-+ # bsz, q_len, _ = hidden_states.shape ++++++-+ ++++++-+ # # 1. 线性投射 Q, K, V ++++++-+ # query_states = self.q_proj(hidden_states) ++++++-+ # key_states = self.k_proj(hidden_states) ++++++-+ # value_states = self.v_proj(hidden_states) ++++++-+ ++++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ ++++++-+ # # 3. RoPE 旋转位置编码 ++++++-+ # kv_seq_len = key_states.shape[-2] ++++++-+ # if past_key_value is not None: ++++++-+ # if self.layer_idx is None: ++++++-+ # raise ValueError( ++++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++-+ # "with a layer index." ++++++-+ # ) ++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++-+ ++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++-+ ++++++-+ # # 4. KV 缓存更新 ++++++-+ # if past_key_value is not None: ++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++-+ # key_states, value_states = past_key_value.update( ++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++++++-+ # ) ++++++-+ ++++++-+ # # 5. 准备 Attention Mask ++++++-+ # fa_attention_mask = None ++++++-+ # if attention_mask is not None: ++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++-+ # fa_attention_mask = (mask_slice != 0) ++++++-+ ++++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++-+ # input_dtype = query_states.dtype ++++++-+ ++++++-+ # # 6. [核心] 调用 flash_attention_score 算子 ++++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++++-+ # query=query_states, ++++++-+ # key=key_states, ++++++-+ # value=value_states, ++++++-+ # head_num=self.num_heads, ++++++-+ # attn_mask=fa_attention_mask, ++++++-+ # keep_prob=1.0 - self.attention_dropout, ++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++-+ # input_layout="BNSD", ++++++-+ # sparse_mode=0, ++++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- ++++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++-+ # inner_precise=1 ++++++-+ # ) ++++++-+ ++++++-+ # # 恢复原始数据类型 ++++++-+ # attn_output = attn_output.to(input_dtype) ++++++-+ ++++++-+ # # 7. 调整输出形状 ++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++-+ # attn_output = self.o_proj(attn_output) ++++++-+ ++++++-+ # attn_weights = None ++++++-+ # if output_attentions: ++++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++-+ ++++++-+ # return attn_output, attn_weights, past_key_value ++++++-+ ++++++-+ # def forward( ++++++-+ # self, ++++++-+ # hidden_states: mindspore.Tensor, ++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, ++++++-+ # position_ids: Optional[mindspore.Tensor] = None, ++++++-+ # past_key_value: Optional[Cache] = None, ++++++-+ # output_attentions: bool = False, ++++++-+ # use_cache: bool = False, ++++++-+ # cache_position: Optional[mindspore.Tensor] = None, ++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++-+ ++++++-+ # bsz, q_len, _ = hidden_states.shape ++++++-+ ++++++-+ # query_states = self.q_proj(hidden_states) ++++++-+ # key_states = self.k_proj(hidden_states) ++++++-+ # value_states = self.v_proj(hidden_states) ++++++-+ ++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-+ ++++++-+ # kv_seq_len = key_states.shape[-2] ++++++-+ # if past_key_value is not None: ++++++-+ # if self.layer_idx is None: ++++++-+ # raise ValueError("`layer_idx` must be specified for caching") ++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++-+ ++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++-+ ++++++-+ # if past_key_value is not None: ++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++-+ # key_states, value_states = past_key_value.update( ++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs ++++++-+ # ) ++++++-+ ++++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++-+ ++++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++-+ # query_states = query_states / math.sqrt(self.head_dim) ++++++-+ # # <--- 修改结束 --- ++++++-+ ++++++-+ # fa_attention_mask = None ++++++-+ # if attention_mask is not None: ++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++-+ # fa_attention_mask = (mask_slice != 0) ++++++-+ ++++++-+ # input_dtype = query_states.dtype ++++++-+ ++++++-+ # attn_output = mindspore.ops.flash_attention_score( ++++++-+ # query=query_states, # 传入已经预先缩放过的 query ++++++-+ # key=key_states, ++++++-+ # value=value_states, ++++++-+ # head_num=self.num_heads, ++++++-+ # attn_mask=fa_attention_mask, ++++++-+ # keep_prob=1.0 - self.attention_dropout, ++++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++-+ # input_layout="BNSD", ++++++-+ # sparse_mode=0, ++++++-+ # inner_precise=1 # 仍然保持内部高精度计算 ++++++-+ # ) ++++++-+ ++++++-+ # attn_output = attn_output.to(input_dtype) ++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++-+ # attn_output = self.o_proj(attn_output) ++++++-+ ++++++-+ # attn_weights = None ++++++-+ # if output_attentions: ++++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++-+ ++++++-+ # return attn_output, attn_weights, past_key_value ++++++-+ ++++++- QWEN2MOE_ATTENTION_CLASSES = { ++++++- "eager": Qwen2MoeAttention, ++++++-+ "flash-attention": Qwen2MoeFlashAttention, ++++++- } ++++++- ++++++- ++++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++- ++++++-+ #@dwj ++++++-+ # 只遍历激活的专家,而非全部专家 ++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++-- hidden_states = hidden_states.view(-1, hidden_dim) ++++++-- # router_logits: (batch * sequence_length, n_experts) ++++++-- router_logits = self.gate(hidden_states) ++++++-- ++++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++-- if self.norm_topk_prob: ++++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++-- # we cast back to the input dtype ++++++-- routing_weights = routing_weights.to(hidden_states.dtype) ++++++-- ++++++-- final_hidden_states = ops.zeros( ++++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++++-- ) ++++++-- ++++++-- # One hot encode the selected experts to create an expert mask ++++++-- # this will be used to easily index which expert is going to be sollicitated ++++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++++-- ++++++-- # Loop over all available experts in the model and perform the computation on each expert ++++++-- for expert_idx in range(self.num_experts): ++++++-- expert_layer = self.experts[expert_idx] ++++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++++-- ++++++-- # Index the correct hidden states and compute the expert hidden state for ++++++-- # the current expert. We need to make sure to multiply the output hidden ++++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++++-- if 0 not in idx.shape: ++++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++++-- ++++++-- # However `index_add_` only support torch tensors for indexing so we'll use ++++++-- # the `top_x` tensor here. ++++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++++-- ++++++-- shared_expert_output = self.shared_expert(hidden_states) ++++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++++-- ++++++-- final_hidden_states = final_hidden_states + shared_expert_output ++++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++-+ num_tokens = hidden_states_reshaped.shape[0] ++++++-+ ++++++-+ router_logits = self.gate(hidden_states_reshaped) ++++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++-+ ++++++-+ if self.norm_topk_prob: ++++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++-+ routing_weights = routing_weights.to(hidden_states.dtype) ++++++-+ ++++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++-+ flat_selected_experts = selected_experts.flatten() ++++++-+ ++++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++-+ token_indices = broadcasted_token_indices.flatten() ++++++-+ ++++++-+ active_experts = ops.unique(flat_selected_experts) ++++++-+ ++++++-+ for expert_idx_tensor in active_experts: ++++++-+ expert_idx = expert_idx_tensor.item() ++++++-+ expert_layer = self.experts[expert_idx] ++++++-+ ++++++-+ mask = (flat_selected_experts == expert_idx_tensor) ++++++-+ selected_token_indices = token_indices[mask] ++++++-+ selected_routing_weights = routing_weights.flatten()[mask] ++++++-+ ++++++-+ current_states = hidden_states_reshaped[selected_token_indices] ++++++-+ ++++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++-+ ++++++-+ final_hidden_states = final_hidden_states.index_add( ++++++-+ dim=0, ++++++-+ index=selected_token_indices, ++++++-+ source=expert_output.to(hidden_states.dtype) ++++++-+ ) ++++++-+ ++++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++- ++++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++-- return final_hidden_states, router_logits ++++++-+ final_hidden_states = final_hidden_states + shared_expert_output ++++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++-+ ++++++-+ return final_hidden_states, router_logits ++++++- ++++++- ++++++- class Qwen2MoeDecoderLayer(nn.Module): ++++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++++- ++++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++- ++++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++-+ ++++++- if (layer_idx not in config.mlp_only_layers) and ( ++++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++++- ): ++++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++++- _skip_keys_device_placement = "past_key_values" ++++++- _supports_cache_class = True ++++++-+#lwx ++++++-+ # _supports_static_cache = True ++++++- ++++++- def _init_weights(self, module): ++++++- std = self.config.initializer_range ++++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++++- return causal_mask ++++++- ++++++- ++++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++- _tied_weights_keys = ["lm_head.weight"] ++++++- ++++++- def __init__(self, config): ++++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++- self.num_experts_per_tok = config.num_experts_per_tok ++++++- # Initialize weights and apply final processing ++++++- self.post_init() ++++++-+ # @lwx ++++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++++-+ # self.generation_config.cache_implementation = "static" ++++++-+ self._warmed_up = False ++++++-+ ++++++-+ def warmup_moe_model(self): ++++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++++-+ test_texts = [ ++++++-+ "warmup short", ++++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++++-+ ] ++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++-+ if tokenizer is None: ++++++-+ from mindnlp.transformers import AutoTokenizer ++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++-+ self._warmup_tokenizer = tokenizer ++++++-+ ++++++-+ for text in test_texts: ++++++-+ inputs = tokenizer(text, return_tensors="ms") ++++++-+ with mindspore._no_grad(): ++++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++++- ++++++- def get_input_embeddings(self): ++++++- return self.model.embed_tokens ++++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++- ```""" ++++++-+ if not self._warmed_up: ++++++-+ self._warmed_up = True ++++++-+ self.warmup_moe_model() ++++++- ++++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++++- output_router_logits = ( ++++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++- } ++++++- ) ++++++- return model_inputs ++++++-+# @lwx ++++++-+ # def _decode_one_tokens_logits( ++++++-+ # self, ++++++-+ # cur_token: mindspore.Tensor, ++++++-+ # input_pos: Optional[mindspore.Tensor], ++++++-+ # cache_position: mindspore.Tensor, ++++++-+ # past_key_values: StaticCache, ++++++-+ # ) -> mindspore.Tensor: ++++++-+ # """ ++++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++++-+ ++++++-+ # Args: ++++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++++-+ # input_pos: 输入位置信息,可选 ++++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) ++++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++++-+ ++++++-+ # Returns: ++++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++++-+ # """ ++++++-+ # # 调用JIT编译的版本 ++++++-+ # return self.get_decode_one_tokens_logits( ++++++-+ # cur_token=cur_token, ++++++-+ # input_pos=input_pos, ++++++-+ # cache_position=cache_position, ++++++-+ # past_key_values=past_key_values, ++++++-+ # ) ++++++-+ ++++++-+ # @mindspore.jit(jit_level='O1') ++++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++++-+ # """ ++++++-+ # JIT编译的函数,用于高效的单token解码 ++++++-+ # 使用JIT编译优化以支持静态shape和高效执行 ++++++-+ ++++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++++-+ # """ ++++++-+ # outputs = self.model.forward( ++++++-+ # input_ids=cur_token, ++++++-+ # position_ids=input_pos, ++++++-+ # cache_position=cache_position, ++++++-+ # past_key_values=past_key_values, ++++++-+ # use_cache=True, ++++++-+ # return_dict=False, ++++++-+ # ) ++++++-+ ++++++-+ # hidden_states = outputs[0] ++++++-+ # logits = self.lm_head.forward(hidden_states) ++++++-+ # logits = logits.float() ++++++-+ ++++++-+ # return logits[:, -1, :] ++++++-+ ++++++-+ # def _sample( ++++++-+ # self, ++++++-+ # input_ids: mindspore.Tensor, ++++++-+ # logits_processor, ++++++-+ # stopping_criteria, ++++++-+ # generation_config, ++++++-+ # synced_devices: bool, ++++++-+ # streamer=None, ++++++-+ # logits_warper=None, ++++++-+ # **model_kwargs, ++++++-+ # ): ++++++-+ # """ ++++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++++-+ # """ ++++++-+ # from ...generation.logits_process import LogitsProcessorList ++++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList ++++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++++-+ # from mindnlp.core import nn, ops, no_grad ++++++-+ # import numpy as np ++++++-+ ++++++-+ # # 检查是否使用 StaticCache ++++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++++-+ # # 否则,直接调用父类方法 ++++++-+ # past_key_values = model_kwargs.get("past_key_values") ++++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++++-+ ++++++-+ # if not isinstance(past_key_values, StaticCache): ++++++-+ # # 不使用 StaticCache,直接调用父类方法 ++++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++++-+ # return super()._sample( ++++++-+ # input_ids=input_ids, ++++++-+ # logits_processor=logits_processor, ++++++-+ # stopping_criteria=stopping_criteria, ++++++-+ # generation_config=generation_config, ++++++-+ # synced_devices=synced_devices, ++++++-+ # streamer=streamer, ++++++-+ # logits_warper=logits_warper, ++++++-+ # **model_kwargs, ++++++-+ # ) ++++++-+ ++++++-+ # # 使用 StaticCache,进入自定义循环 ++++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++++-+ # pad_token_id = generation_config._pad_token_tensor ++++++-+ # output_attentions = generation_config.output_attentions ++++++-+ # output_hidden_states = generation_config.output_hidden_states ++++++-+ # output_scores = generation_config.output_scores ++++++-+ # output_logits = generation_config.output_logits ++++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate ++++++-+ # max_length = generation_config.max_length ++++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++++-+ # do_sample = generation_config.do_sample ++++++-+ ++++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++++-+ # raise ValueError( ++++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++++-+ # f"{logits_warper})." ++++++-+ # ) ++++++-+ ++++++-+ # # init attention / hidden states / scores tuples ++++++-+ # scores = () if (return_dict_in_generate and output_scores) else None ++++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++++-+ ++++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++++-+ # encoder_hidden_states = ( ++++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++++-+ # ) ++++++-+ ++++++-+ # # keep track of which sequences are already finished ++++++-+ # batch_size, cur_len = input_ids.shape ++++++-+ # this_peer_finished = False ++++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++++-+ ++++++-+ # time_record = [] ++++++-+ # from ....utils.testing_utils import parse_flag_from_env ++++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++++-+ ++++++-+ # while self._has_unfinished_sequences( ++++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++++-+ # ): ++++++-+ # if _record_time: ++++++-+ # import time as time_module ++++++-+ # infer_start = time_module.time() ++++++-+ ++++++-+ # # prepare model inputs ++++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++++-+ ++++++-+ # # prepare variable output controls ++++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++++-+ ++++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++++-+ # cur_cache_position = model_inputs.get("cache_position") ++++++-+ # cur_past_key_values = model_inputs.get("past_key_values") ++++++-+ # cur_input_ids = model_inputs.get("input_ids") ++++++-+ ++++++-+ # if (isinstance(cur_past_key_values, StaticCache) and ++++++-+ # cur_cache_position is not None and ++++++-+ # len(cur_cache_position.shape) > 0 and ++++++-+ # cur_cache_position.shape[0] == 1 and ++++++-+ # cur_input_ids is not None and ++++++-+ # cur_input_ids.shape[1] == 1): ++++++-+ # # 使用 JIT 优化的单 token 解码 ++++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++++-+ # if not hasattr(self, '_jit_used'): ++++++-+ # self._jit_used = False ++++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++++-+ ++++++-+ # next_token_logits = self.get_decode_one_tokens_logits( ++++++-+ # cur_token=cur_input_ids, ++++++-+ # input_pos=model_inputs.get("position_ids"), ++++++-+ # cache_position=cur_cache_position, ++++++-+ # past_key_values=cur_past_key_values, ++++++-+ # ) ++++++-+ ++++++-+ # # 标记已使用JIT(用于后续判断) ++++++-+ # if not self._jit_used: ++++++-+ # self._jit_used = True ++++++-+ ++++++-+ # # 构造兼容的输出对象 ++++++-+ # class JitOptimizedOutput: ++++++-+ # def __init__(self, logits, config): ++++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++++-+ # self.config = config ++++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 ++++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++++-+ # self.attentions = None if not config.is_encoder_decoder else None ++++++-+ # self.cross_attentions = None ++++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None ++++++-+ ++++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++++-+ # else: ++++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++++-+ # outputs = self(**model_inputs, return_dict=True) ++++++-+ ++++++-+ # if synced_devices and this_peer_finished: ++++++-+ # continue ++++++-+ ++++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++++-+ # next_token_logits = outputs.logits[:, -1, :] ++++++-+ ++++++-+ # # pre-process distribution ++++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++++-+ # if do_sample: ++++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++++-+ ++++++-+ # # Store scores, attentions and hidden_states when required ++++++-+ # if return_dict_in_generate: ++++++-+ # if output_scores: ++++++-+ # scores += (next_token_scores,) ++++++-+ # if output_logits: ++++++-+ # raw_logits += (next_token_logits,) ++++++-+ # if output_attentions: ++++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) ++++++-+ # if self.config.is_encoder_decoder: ++++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++++-+ ++++++-+ # if output_hidden_states: ++++++-+ # hidden = ( ++++++-+ # outputs.decoder_hidden_states ++++++-+ # if self.config.is_encoder_decoder ++++++-+ # else outputs.hidden_states ++++++-+ # ) ++++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++++-+ ++++++-+ # # token selection ++++++-+ # if do_sample: ++++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++++-+ # else: ++++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++++-+ ++++++-+ # # finished sentences should have their next token be a padding token ++++++-+ # if has_eos_stopping_criteria: ++++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++++-+ ++++++-+ # # update generated ids, model inputs, and length for next step ++++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++++-+ # if streamer is not None: ++++++-+ # streamer.put(next_tokens) ++++++-+ ++++++-+ # model_kwargs = self._update_model_kwargs_for_generation( ++++++-+ # outputs, ++++++-+ # model_kwargs, ++++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, ++++++-+ # ) ++++++-+ ++++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++++-+ # cur_len += 1 ++++++-+ ++++++-+ # if _record_time: ++++++-+ # import time as time_module ++++++-+ # infer_stop = time_module.time() ++++++-+ # time_record.append(infer_stop - infer_start) ++++++-+ ++++++-+ # del outputs ++++++-+ ++++++-+ # average_infer_time = None ++++++-+ # if time_record: ++++++-+ # if len(time_record) > 1: ++++++-+ # time_record.pop(0) ++++++-+ # average_infer_time = sum(time_record) / len(time_record) ++++++-+ # print(f'average inference time is: {average_infer_time}') ++++++-+ # print(f'inference time record: {time_record}') ++++++-+ ++++++-+ # if streamer is not None: ++++++-+ # streamer.end() ++++++-+ ++++++-+ # # 简单判断:打印是否使用了JIT路径 ++++++-+ # if hasattr(self, '_jit_used') and self._jit_used: ++++++-+ # print("[JIT] ✓ JIT optimization was used during generation") ++++++-+ # else: ++++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++++-+ ++++++-+ # if return_dict_in_generate: ++++++-+ # if self.config.is_encoder_decoder: ++++++-+ # return GenerateEncoderDecoderOutput( ++++++-+ # sequences=input_ids, ++++++-+ # scores=scores, ++++++-+ # logits=raw_logits, ++++++-+ # encoder_attentions=encoder_attentions, ++++++-+ # encoder_hidden_states=encoder_hidden_states, ++++++-+ # decoder_attentions=decoder_attentions, ++++++-+ # cross_attentions=cross_attentions, ++++++-+ # decoder_hidden_states=decoder_hidden_states, ++++++-+ # past_key_values=model_kwargs.get("past_key_values"), ++++++-+ # average_infer_time=average_infer_time ++++++-+ # ) ++++++-+ # else: ++++++-+ # return GenerateDecoderOnlyOutput( ++++++-+ # sequences=input_ids, ++++++-+ # scores=scores, ++++++-+ # logits=raw_logits, ++++++-+ # attentions=decoder_attentions, ++++++-+ # hidden_states=decoder_hidden_states, ++++++-+ # past_key_values=model_kwargs.get("past_key_values"), ++++++-+ # average_infer_time=average_infer_time ++++++-+ # ) ++++++-+ # else: ++++++-+ # return input_ids ++++++-+ ++++++-+ # def _prepare_cache_for_generation( ++++++-+ # self, ++++++-+ # generation_config, ++++++-+ # model_kwargs, ++++++-+ # assistant_model, ++++++-+ # batch_size, ++++++-+ # max_cache_length, ++++++-+ # ): ++++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++++-+ # generation_config.cache_implementation = "static" ++++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++++-+ ++++++-+ # if generation_config.cache_implementation == "static": ++++++-+ # base_required_from_max_length = generation_config.max_length + 1 ++++++-+ # base_required = max(max_cache_length, base_required_from_max_length) ++++++-+ # min_cache_size = 50 ++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++++-+ # else: ++++++-+ # max_cache_length = max(base_required, min_cache_size) ++++++-+ ++++++-+ # original_max_cache_length = max_cache_length ++++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") ++++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++++-+ # print(f" - final max_cache_length: {max_cache_length}") ++++++-+ ++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++-+ # if max_cache_length > self.config.max_position_embeddings: ++++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++++-+ ++++++-+ # result = super()._prepare_cache_for_generation( ++++++-+ # generation_config=generation_config, ++++++-+ # model_kwargs=model_kwargs, ++++++-+ # assistant_model=assistant_model, ++++++-+ # batch_size=batch_size, ++++++-+ # max_cache_length=max_cache_length, ++++++-+ # ) ++++++-+ ++++++-+ # if generation_config.cache_implementation == "static": ++++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++++-+ # created_cache = model_kwargs.get(cache_name) ++++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++++-+ # if created_cache.max_cache_len < generation_config.max_length: ++++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++++-+ ++++++-+ # return result ++++++-+ ++++++-+ ++++++-+ ++++++- ++++++- ++++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++++--- ++++++-2.27.0 ++++++- ++++++-- ++++++2.27.0 ++++++ +++++-- +++++2.27.0 +++++ ++++-- ++++2.27.0 ++++ +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" new file mode 100644 index 00000000..5ba94286 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" @@ -0,0 +1,9078 @@ +From 4f88911daf60910b3b94b56b8a590650454a2dde Mon Sep 17 00:00:00 2001 +From: Pinoeer-kingxi <13022943007@163.com> +Date: Sun, 9 Nov 2025 02:09:15 +0800 +Subject: [PATCH 09/10] 20251109firstcommit + +--- + .../models/deepseek/modeling_deepseek.py | 103 +- + patches/0001-20251104commit.patch | 2 +- + patches/0002-20251106commit.patch | 2 +- + patches/0003-20261106secondcommit.patch | 2 +- + patches/0004-20251106change.patch | 2 +- + patches/0005-20251107001commit.patch | 2 +- + patches/0006-20251107002commit.patch | 2 +- + patches/0007-20251107003commit.patch | 2 +- + patches/0008-moe-change.patch | 8789 +++++++++++++++++ + 9 files changed, 8889 insertions(+), 17 deletions(-) + create mode 100644 patches/0008-moe-change.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 0af29305..8d004af1 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -415,7 +415,9 @@ class DeepseekMoE(nn.Module): + else: + # @lwx + if orig_shape[1] == 1: +- y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++ # lwx moe_infer_decode_fast ++ # y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++ y=self.moe_infer_decode_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) + y=y.view(*orig_shape) + if self.config.n_shared_experts is not None: + y = y + self.shared_experts(identity) +@@ -544,6 +546,7 @@ class DeepseekMoE(nn.Module): + #@dwj + @no_grad() + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ + selected_experts = [self.experts[i] for i in flat_expert_indices] + expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) + final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +@@ -643,6 +646,43 @@ class DeepseekMoE(nn.Module): + # ) + + # return final_output ++ # def init_expert_cache(self): ++ # """ ++ # 在模型初始化时调用,缓存所有专家的权重到显存。 ++ # """ ++ # self.cache_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) ++ # self.cache_up_w = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) ++ # self.cache_down_w = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) ++ @no_grad() ++ def moe_infer_decode_fast(self, x, flat_expert_indices, flat_expert_weights): ++ top_k = flat_expert_indices.shape[0] ++ hidden_size = x.shape[-1] ++ ++ selected_gate_w = [] ++ selected_up_w = [] ++ selected_down_w = [] ++ ++ for eid in flat_expert_indices.tolist(): ++ if hasattr(self, "cache_gate_w") and eid < self.cache_gate_w.shape[0]: ++ selected_gate_w.append(self.cache_gate_w[eid]) ++ selected_up_w.append(self.cache_up_w[eid]) ++ selected_down_w.append(self.cache_down_w[eid]) ++ else: ++ selected_gate_w.append(self.experts[eid].gate_proj.weight) ++ selected_up_w.append(self.experts[eid].up_proj.weight) ++ selected_down_w.append(self.experts[eid].down_proj.weight) ++ ++ selected_gate_w = ops.stack(selected_gate_w, dim=0) ++ selected_up_w = ops.stack(selected_up_w, dim=0) ++ selected_down_w = ops.stack(selected_down_w, dim=0) ++ ++ x_expanded = x.expand((top_k, 1, hidden_size)) ++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++ return weighted_sum + + # lwx prefill 20251108 + @no_grad() +@@ -711,7 +751,7 @@ class DeepseekMoE(nn.Module): + sorted_token_indices.view(-1, 1).tile((1, hidden_size)), + expert_outputs * sorted_weights + ) +- ++ return final_output + + # try: + # final_output = ops.moe_token_unpermute( +@@ -730,7 +770,7 @@ class DeepseekMoE(nn.Module): + # expert_outputs * sorted_weights + # ) + +- return final_output ++ # return final_output + + + # @no_grad() +@@ -1827,27 +1867,68 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + + # Initialize weights and apply final processing + self.post_init() ++ # lwx + self.warm_up = False +- ++ #初始 ++ ++ # def warmup_moe_model_deep(self): ++ # print("[Warmup] DeepSeek-MoE 模型预热开始...") ++ # test_texts = [ ++ # "warmup short", ++ # "This is a medium length warmup sentence for MoE experts. middle middle middle", ++ # "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++ # ] ++ # tokenizer = getattr(self, "_warmup_tokenizer", None) ++ # if tokenizer is None: ++ # from mindnlp.transformers import AutoTokenizer ++ # tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++ # self._warmup_tokenizer = tokenizer ++ ++ # for text in test_texts: ++ # inputs = tokenizer(text, return_tensors="ms") ++ # with mindspore._no_grad(): ++ # _ = self(**inputs, use_cache=False) ++ # print("[Warmup] DeepSeek-MoE 模型预热完成。") ++ + def warmup_moe_model_deep(self): + print("[Warmup] DeepSeek-MoE 模型预热开始...") +- test_texts = [ +- "warmup short", +- "This is a medium length warmup sentence for MoE experts. middle middle middle", +- "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++ ++ # 直接用 eval.py 默认的 prompts 内容 ++ warmup_prompts = [ ++ "Hello, how are you?", ++ "This American studied art at Yale and is the author of multiple popular mystery novels. First name is 'Hillary'. What's the last name?", ++ """Summarize the following text: US President Donald Trump has said he is 'not happy' with his Russian counterpart Vladimir Putin, following Moscow's largest aerial attack yet on Ukraine. ++ In a rare rebuke, Trump said: "What the hell happened to him? He's killing a lot of people." He later called Putin "absolutely crazy". ++ Ukrainian President Volodymyr Zelensky earlier said Washington's "silence" over recent Russian attacks was encouraging Putin, urging "strong pressure" - including tougher sanctions - on Moscow. ++ """ + ] ++ + tokenizer = getattr(self, "_warmup_tokenizer", None) + if tokenizer is None: + from mindnlp.transformers import AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) + self._warmup_tokenizer = tokenizer + +- for text in test_texts: ++ # 跑一遍 warmup_prompts,触发路由逻辑 ++ for text in warmup_prompts: + inputs = tokenizer(text, return_tensors="ms") + with mindspore._no_grad(): + _ = self(**inputs, use_cache=False) ++ ++ # 这里可以加按需缓存逻辑,避免显存 OOM ++ from mindnlp.transformers.models.deepseek.modeling_deepseek import DeepseekMoE ++ for module in self.modules(): ++ if isinstance(module, DeepseekMoE): ++ active_ids = getattr(module, "_last_routed_expert_ids", None) ++ if active_ids is not None: ++ module.init_active_expert_cache(active_ids) + print("[Warmup] DeepSeek-MoE 模型预热完成。") + ++ def init_active_expert_cache(self, active_ids): ++ self.cache_gate_w = ops.stack([self.experts[i].gate_proj.weight for i in active_ids], dim=0) ++ self.cache_up_w = ops.stack([self.experts[i].up_proj.weight for i in active_ids], dim=0) ++ self.cache_down_w = ops.stack([self.experts[i].down_proj.weight for i in active_ids], dim=0) ++ + def get_input_embeddings(self): + return self.model.embed_tokens + +@@ -2208,7 +2289,9 @@ if __name__ == "__main__": + config.num_hidden_layers = 2 + config.n_routed_experts = 2 + model = DeepseekForCausalLM(config) +- ++ # for module in model.modules(): ++ # if isinstance(module, DeepseekMoE): ++ # module.init_expert_cache() + print('init model') + input_ids = mindspore.Tensor(np.random.randint(0, 10000, (1, 11)), mindspore.int32) + attention_mask = mindspore.Tensor(np.ones((1,11)), mindspore.int32) +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +index 513dd40b..8de61195 100644 +--- a/patches/0001-20251104commit.patch ++++ b/patches/0001-20251104commit.patch +@@ -1,7 +1,7 @@ + From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/7] 20251104commit ++Subject: [PATCH 1/8] 20251104commit + + --- + mindnlp/transformers/cache_utils.py | 28 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +index 41081b85..d7a129ea 100644 +--- a/patches/0002-20251106commit.patch ++++ b/patches/0002-20251106commit.patch +@@ -1,7 +1,7 @@ + From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/7] 20251106commit ++Subject: [PATCH 2/8] 20251106commit + + --- + .../models/deepseek/modeling_deepseek.py | 379 ++++- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +index c1392569..179a9bb5 100644 +--- a/patches/0003-20261106secondcommit.patch ++++ b/patches/0003-20261106secondcommit.patch +@@ -1,7 +1,7 @@ + From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/7] 20261106secondcommit ++Subject: [PATCH 3/8] 20261106secondcommit + + --- + .../models/deepseek/modeling_deepseek.py | 217 ++- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +index e548b1b2..bc5549ca 100644 +--- a/patches/0004-20251106change.patch ++++ b/patches/0004-20251106change.patch +@@ -1,7 +1,7 @@ + From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Thu, 6 Nov 2025 15:48:09 +0800 +-Subject: [PATCH 4/7] 20251106change ++Subject: [PATCH 4/8] 20251106change + + --- + .../models/deepseek/modeling_deepseek.py | 189 +- +diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +index bf224d2a..7217a46b 100644 +--- a/patches/0005-20251107001commit.patch ++++ b/patches/0005-20251107001commit.patch +@@ -1,7 +1,7 @@ + From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 11:48:18 +0800 +-Subject: [PATCH 5/7] 20251107001commit ++Subject: [PATCH 5/8] 20251107001commit + + --- + .../models/deepseek/modeling_deepseek.py | 91 +- +diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +index 1bd306b9..80906633 100644 +--- a/patches/0006-20251107002commit.patch ++++ b/patches/0006-20251107002commit.patch +@@ -1,7 +1,7 @@ + From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 12:06:32 +0800 +-Subject: [PATCH 6/7] 20251107002commit ++Subject: [PATCH 6/8] 20251107002commit + + --- + .../models/deepseek/modeling_deepseek.py | 122 +- +diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch +index ce558554..8a2fc4fe 100644 +--- a/patches/0007-20251107003commit.patch ++++ b/patches/0007-20251107003commit.patch +@@ -1,7 +1,7 @@ + From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 + From: Pinoeer-kingxi <13022943007@163.com> + Date: Fri, 7 Nov 2025 12:12:51 +0800 +-Subject: [PATCH 7/7] 20251107003commit ++Subject: [PATCH 7/8] 20251107003commit + + --- + .../models/deepseek/modeling_deepseek.py | 2 +- +diff --git a/patches/0008-moe-change.patch b/patches/0008-moe-change.patch +new file mode 100644 +index 00000000..349f1429 +--- /dev/null ++++ b/patches/0008-moe-change.patch +@@ -0,0 +1,8789 @@ ++From 45ba3bbc411b64cbffd547fa3d66bce9545639dd Mon Sep 17 00:00:00 2001 ++From: Pinoeer-kingxi <13022943007@163.com> ++Date: Sun, 9 Nov 2025 00:50:01 +0800 ++Subject: [PATCH 8/8] moe change ++ ++--- ++ .../models/deepseek/modeling_deepseek.py | 433 +- ++ .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- ++ patches/0001-20251104commit.patch | 2 +- ++ patches/0002-20251106commit.patch | 2 +- ++ patches/0003-20261106secondcommit.patch | 2 +- ++ patches/0004-20251106change.patch | 2 +- ++ patches/0005-20251107001commit.patch | 2 +- ++ patches/0006-20251107002commit.patch | 2 +- ++ patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ ++ 9 files changed, 8510 insertions(+), 55 deletions(-) ++ create mode 100644 patches/0007-20251107003commit.patch ++ ++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++index ff631974..0af29305 100644 ++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++@@ -19,8 +19,10 @@ ++ # limitations under the License. ++ """ MindNLP DeepSeek model.""" ++ import math +++import time ++ import warnings ++ from typing import List, Optional, Tuple, Union +++from mindspore import mint ++ import mindspore ++ from mindnlp.core import nn, ops, no_grad ++ from mindnlp.core.nn import functional as F ++@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) ++ ++ _CONFIG_FOR_DOC = "DeepseekConfig" ++ +++Long_Prompt = 1 +++LONG_PROMPT_LENGTH_THRESHOLD = 128 +++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +++ ++ _attn_mask_cache = {} ++ ++ def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): ++@@ -380,6 +386,8 @@ class MoEGate(nn.Module): ++ return topk_idx, topk_weight, aux_loss ++ ++ +++bincount_op = mindspore.ops.Bincount() +++ ++ class DeepseekMoE(nn.Module): ++ """ ++ A mixed expert module containing shared experts. ++@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): ++ y = y + self.shared_experts(identity) ++ return y ++ else: ++- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++ if Long_Prompt == 0: +++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++ else: +++ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++ if self.config.n_shared_experts is not None: ++ y = y + self.shared_experts(identity) ++ return y ++@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): ++ # if self.config.n_shared_experts is not None: ++ # y = y + self.shared_experts(identity) ++ # return y ++- +++ +++ +++ +++ # lwx +++ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): +++ # """ +++ # 如果 expert_ids 为 None,走单专家逻辑; +++ # 如果有,多专家批量处理,保证和原逻辑一致。 +++ # """ +++ # if expert_ids is None: +++ # # 原单专家逻辑 +++ # if self.config.pretraining_tp > 1: +++ # slice = self.intermediate_size // self.config.pretraining_tp +++ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) +++ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) +++ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) +++ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) +++ # for i in range(self.config.pretraining_tp)], dim=-1) +++ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) +++ # for i in range(self.config.pretraining_tp)], dim=-1) +++ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) +++ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) +++ # for i in range(self.config.pretraining_tp)] +++ # down_proj = sum(down_proj) +++ # else: +++ # down_proj = self.down_proj( +++ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) +++ # ) +++ # return down_proj +++ +++ # # ====== 批量多专家路径 ====== +++ # hidden_size = x.shape[-1] +++ +++ # # 按 token expert_ids 选权重 +++ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] +++ # up_weights = self.up_proj.weight[expert_ids] +++ # down_weights = self.down_proj.weight[expert_ids] +++ +++ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 +++ # if self.config.pretraining_tp > 1: +++ # outputs = [] +++ # slice = self.intermediate_size // self.config.pretraining_tp +++ # for i in range(self.config.pretraining_tp): +++ # # 每个 slice 单独计算 +++ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) +++ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) +++ # act_out = self.act_fn(gate_proj_out) * up_proj_out +++ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) +++ # outputs.append(down_proj_out) +++ # return sum(outputs) +++ # else: +++ # gate_proj_out = F.linear(x, gate_weights) +++ # up_proj_out = F.linear(x, up_weights) +++ # act_out = self.act_fn(gate_proj_out) * up_proj_out +++ # return F.linear(act_out, down_weights) +++ # @no_grad() +++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ # num_tokens = x.shape[0] +++ # hidden_size = x.shape[-1] +++ +++ # idxs = flat_expert_indices.argsort() +++ # sorted_expert_indices = flat_expert_indices[idxs] +++ # sorted_token_indices = idxs // self.num_experts_per_tok +++ # sorted_indices = sorted_token_indices +++ +++ # permuted_tokens = x[sorted_token_indices] +++ # sorted_weights = flat_expert_weights[idxs] +++ +++ # # 一次调用多专家 forward +++ # expert_outputs = ops.zeros_like(permuted_tokens) +++ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) +++ +++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +++ # try: +++ # final_output = ops.moe_token_unpermute( +++ # expert_outputs, +++ # sorted_indices, +++ # probs=probs, +++ # padded_mode=False +++ # ) +++ # except Exception: +++ # final_output = ops.zeros_like(x) +++ # final_output = mindspore.mint.scatter_add( +++ # final_output, +++ # 0, +++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +++ # expert_outputs * sorted_weights +++ # ) +++ +++ # return final_output +++ +++ # def mlp_batch_forward(self, tokens, expert_ids): +++ # """ +++ # 使用批量专家 forward(保留精度) +++ # """ +++ # return self.experts[0].forward(tokens, expert_ids) +++ ++ # @no_grad() ++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++ ++@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): ++ # expert_cache += expert_out * weight ++ # return expert_cache ++ +++ #@dwj ++ @no_grad() ++- # dwj ++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++- # x 的 shape: (1, hidden_size) ++- # flat_expert_indices 的 shape: (num_experts_per_tok,) ++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++- ++- # 1. 收集所有需要的专家层 ++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++- ++- # 2. 并行计算所有专家的输出 ++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++- # ops.cat 会将它们堆叠成一个新的 Tensor ++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++- ++- # 3. 使用矩阵乘法进行加权求和 ++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++- # 最终结果 final_output 的 shape: (1, hidden_size) ++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++- ++ return final_output ++ ++ ++- # @no_grad() ++- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++- # expert_cache = ops.zeros_like(x) ++- # idxs = flat_expert_indices.argsort() ++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++- # token_idxs = idxs // self.num_experts_per_tok ++- ++- # for i, end_idx in enumerate(tokens_per_expert): ++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++- # if start_idx == end_idx: ++- # continue ++- # expert = self.experts[i] ++- # exp_token_idx = token_idxs[start_idx:end_idx] ++- # expert_tokens = x[exp_token_idx] ++- # expert_out = expert(expert_tokens) ++- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++- ++- # return expert_cache ++- ++ @no_grad() ++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++ """ ++@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): ++ ) ++ ++ return expert_cache +++ +++ +++ # @no_grad() +++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ # """ +++ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add +++ # """ +++ # num_tokens = x.shape[0] +++ # hidden_size = x.shape[-1] +++ +++ # # 生成排序后的 token 索引 +++ # idxs = flat_expert_indices.argsort() +++ # sorted_expert_indices = flat_expert_indices[idxs] +++ # sorted_token_indices = idxs // self.num_experts_per_tok +++ +++ # # 记录到 sorted_indices(moe_token_unpermute 用) +++ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] +++ +++ # # 收集专家输入 +++ # permuted_tokens = x[sorted_token_indices] +++ +++ # # 执行每个专家的 MLP(批量处理) +++ # expert_outputs = [] +++ # token_ptr = 0 +++ # tokens_per_expert = sorted_expert_indices.bincount() +++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): +++ # if count == 0: +++ # continue +++ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] +++ # out = self.experts[expert_id](cur_tokens) +++ # expert_outputs.append(out) +++ # token_ptr += count +++ +++ # # 拼接所有专家输出 +++ # permuted_outputs = ops.cat(expert_outputs, axis=0) +++ +++ # # 权重缩放(probs 形状为 [num_tokens, top_k]) +++ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) +++ +++ # # 直接调用硬件加速的 unpermute +++ # final_output = ops.moe_token_unpermute( +++ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] +++ # sorted_indices, # shape: [num_tokens * top_k] +++ # probs=probs, # 按概率加权 +++ # padded_mode=False +++ # ) +++ +++ # return final_output +++ +++ # lwx prefill 20251108 +++ @no_grad() +++ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): +++ """ +++ 高性能 + 数值一致的 MoE prefill 推理: +++ 1. 批量化处理所有专家计算,减少 Python 循环开销 +++ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 +++ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 +++ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch +++ +++ 参数: +++ x: [num_tokens, hidden_size], +++ MoE 输入的 token 表示 +++ flat_expert_indices: [num_tokens * top_k], +++ 每个 token 的路由专家 id +++ flat_expert_weights: [num_tokens * top_k, 1], +++ 路由专家权重 +++ """ +++ num_tokens = x.shape[0] +++ hidden_size = x.shape[-1] +++ +++ # 1) 排序专家分配(与原 scatter_add 一致的顺序) +++ idxs = flat_expert_indices.argsort() # 排序索引 +++ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] +++ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID +++ +++ # sorted_indices 必须与 permuted_tokens 顺序匹配 +++ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 +++ +++ # 2) 收集专家输入(按 idxs 排序) +++ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] +++ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 +++ +++ # 3) 计算每个专家的 token 数 +++ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) +++ +++ # 4) 批量专家计算(减少 Python 循环) +++ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) +++ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) +++ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) +++ +++ expert_outputs = ops.zeros_like(permuted_tokens) +++ ptr = 0 +++ for expert_id, count in enumerate(tokens_per_expert.tolist()): +++ if count == 0: +++ continue +++ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] +++ +++ # 与 DeepseekMLP forward 等价 +++ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) +++ up_proj_out = F.linear(tokens, up_weights[expert_id]) +++ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out +++ expert_out = F.linear(act_out, down_weights[expert_id]) +++ +++ expert_outputs[ptr:ptr+count] = expert_out +++ ptr += count +++ +++ # 5) Ascend 加速的 unpermute(已排序的权重) +++ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape +++ +++ final_output = ops.zeros_like(x) +++ final_output = mindspore.mint.scatter_add( +++ final_output, +++ 0, +++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +++ expert_outputs * sorted_weights +++ ) +++ +++ +++ # try: +++ # final_output = ops.moe_token_unpermute( +++ # expert_outputs, # [num_tokens*top_k, hidden_size] +++ # sorted_indices, # [num_tokens*top_k] 原 token id +++ # probs=probs, # 对应权重 +++ # padded_mode=False +++ # ) +++ # except Exception: +++ # # CPU/GPU fallback:用 scatter_add 保证完全一致 +++ # final_output = ops.zeros_like(x) +++ # final_output = mindspore.mint.scatter_add( +++ # final_output, +++ # 0, +++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +++ # expert_outputs * sorted_weights +++ # ) +++ +++ return final_output +++ +++ +++ # @no_grad() +++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ # num_tokens = x.shape[0] +++ # hidden_size = x.shape[-1] +++ +++ # idxs = flat_expert_indices.argsort() +++ # sorted_expert_indices = flat_expert_indices[idxs] +++ # sorted_token_indices = idxs // self.num_experts_per_tok +++ +++ # # sorted_indices = sorted_token_indices +++ # sorted_indices = sorted_token_indices.astype(mindspore.int32) +++ # permuted_tokens = x[sorted_token_indices] +++ # sorted_weights = flat_expert_weights[idxs] +++ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) +++ +++ # expert_outputs = ops.zeros_like(permuted_tokens) +++ # ptr = 0 +++ +++ # # 只按专家维度循环 +++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): +++ # if count == 0: +++ # continue +++ # token_slice = slice(ptr, ptr + count) +++ # expert_tokens = permuted_tokens[token_slice] +++ +++ # # 保持原 forward(含 pretraining_tp、bias 等) +++ # expert_out = self.experts[expert_id](expert_tokens) +++ +++ # expert_outputs[token_slice] = expert_out +++ # ptr += count +++ +++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +++ # try: +++ # final_output = mindspore.ops.moe_token_unpermute( +++ # expert_outputs, +++ # sorted_indices, +++ # probs=probs, +++ # padded_mode=False +++ # ) +++ # except Exception: +++ # final_output = ops.zeros_like(x) +++ # final_output = mindspore.mint.scatter_add( +++ # final_output, +++ # 0, +++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +++ # expert_outputs * sorted_weights +++ # ) +++ +++ # return final_output +++ +++ +++ #lwx +++ # @no_grad() +++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++ # """ +++ # 并行化 MoE prefill: +++ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 +++ # - 保证结果与原版完全一致 +++ # """ +++ # # 输出缓存 +++ # expert_cache = ops.zeros_like(x) +++ +++ # # token 总数(批量*seq_len*num_experts_per_tok) +++ # num_tokens = flat_expert_indices.shape[0] +++ # hidden_dim = x.shape[-1] +++ +++ # # 原 token ID(idxs // num_experts_per_tok) +++ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) +++ +++ # # ====== Step 1: 组织输入 ====== +++ # # 按 experts 排序,保证 scatter_add 对应位置一致 +++ # sort_ids = flat_expert_indices.argsort() +++ # sorted_experts = flat_expert_indices[sort_ids] +++ # sorted_tokens = token_ids[sort_ids] +++ # sorted_weights = flat_expert_weights[sort_ids] +++ +++ # # 收集每个专家的输入 +++ # # build: expert_inputs[expert_id] = [tokens...] +++ # expert_inputs = [] +++ # expert_outs = [] +++ +++ # for eid in range(self.config.n_routed_experts): +++ # eid_mask = (sorted_experts == eid) +++ # if eid_mask.any(): +++ # tokens_for_eid = x[sorted_tokens[eid_mask]] +++ # expert_inputs.append(tokens_for_eid) +++ # else: +++ # expert_inputs.append(None) +++ +++ # # ====== Step 2: 并行计算所有专家输出 ====== +++ # # 存储所有专家结果到一个列表 +++ # for eid in range(self.config.n_routed_experts): +++ # if expert_inputs[eid] is not None: +++ # out = self.experts[eid](expert_inputs[eid]) +++ # expert_outs.append(out) +++ # else: +++ # expert_outs.append(None) +++ +++ # # ====== Step 3: scatter_add 回写结果 ====== +++ # # 遍历专家,将结果加回对应的 token +++ # pos = 0 +++ # for eid in range(self.config.n_routed_experts): +++ # if expert_outs[eid] is not None: +++ # size = expert_outs[eid].shape[0] +++ # tokens_idx = sorted_tokens[pos:pos+size] +++ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] +++ # pos += size +++ +++ # # scatter_add 到 expert_cache +++ # expert_cache = mindspore.mint.scatter_add( +++ # expert_cache, +++ # dim=0, +++ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), +++ # src=scaled_out +++ # ) +++ +++ # return expert_cache +++ +++ +++ ++ # 放置在 DeepseekMoE 类中 ++ # @no_grad() ++ # #lwx 20251107 ++@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): ++ self.hidden_size = config.hidden_size ++ ++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++- # config=config, layer_idx=layer_idx +++ # config=config, layer_idx=layer_idx ++ # ) ++ ++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): ++ ) ++ else DeepseekMLP(config) ++ ) +++ ++ self.input_layernorm = DeepseekRMSNorm( ++ config.hidden_size, eps=config.rms_norm_eps ++ ) ++@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++ def get_decoder(self): ++ return self.model ++ +++ def generate(self, *args, **kwargs): +++ """ +++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++ """ +++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +++ +++ input_ids = kwargs.get("input_ids") +++ if input_ids is None and args: +++ input_ids = args[0] +++ +++ if input_ids is not None: +++ prompt_length = input_ids.shape[1] +++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +++ Long_Prompt = 2 +++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +++ Long_Prompt = 0 +++ else: +++ Long_Prompt = 1 +++ +++ +++ return super().generate(*args, **kwargs) ++ ++ def forward( ++ self, ++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++index 913a7609..6566958b 100644 ++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ ++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- ++ @no_grad() ++- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ original_dtype = hidden_states.dtype ++ batch_size, _ = hidden_states.shape ++ expert_outputs_list = [ ++@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++ return moe_output_fp32.squeeze(1).to(original_dtype) ++ +++ ++ # @no_grad() ++- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ # num_tokens, _ = hidden_states.shape ++ # flat_selected_experts = selected_experts.flatten() ++ # sorted_expert_indices = flat_selected_experts.argsort() ++@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ # current_token_offset += expert_token_count ++ # return moe_output ++ +++ # baseline ++ @no_grad() ++- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ """ ++ 优化版 MoE prefill (速度优先模式): ++ - 批量张量化处理同一个 expert 的所有 token ++@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ return moe_output ++ ++ +++ @no_grad() +++ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++ """ +++ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add +++ 逻辑: +++ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 +++ 2. 每个 expert 一次性处理其全部 token +++ 3. 最后一次 scatter_add 回到原 token 顺序 +++ """ +++ +++ num_tokens = hidden_states.shape[0] +++ hidden_size = hidden_states.shape[-1] +++ +++ # 展平为一维 +++ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] +++ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] +++ +++ # 按 expert 排序 +++ idxs = flat_selected_experts.argsort() +++ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 +++ sorted_token_indices = idxs // self.top_k # 对应原 token ID +++ +++ # 排好序的输入向量(连续内存) +++ permuted_tokens = hidden_states[sorted_token_indices] +++ +++ # 排好序的权重 +++ sorted_weights = flat_routing_weights[idxs] +++ +++ # 每个 expert 对应的 token 数量 +++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +++ +++ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) +++ expert_outputs = ops.zeros_like(permuted_tokens) +++ +++ ptr = 0 # 指向当前切片的起点 +++ for expert_id, count in enumerate(tokens_per_expert.tolist()): +++ if count == 0: +++ continue +++ +++ token_slice = slice(ptr, ptr + count) +++ expert_tokens = permuted_tokens[token_slice] # 连续切片 +++ +++ # 执行专家 MLP +++ expert_out = self.experts[expert_id](expert_tokens) +++ +++ expert_outputs[token_slice] = expert_out +++ ptr += count +++ +++ # 按权重缩放 +++ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) +++ +++ # 回写到原 token 顺序 (单次 scatter_add) +++ moe_output = mindspore.mint.scatter_add( +++ ops.zeros_like(hidden_states), +++ 0, +++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +++ scaled_outputs +++ ) +++ +++ return moe_output +++ +++ +++ ++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++ ++ @no_grad() ++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++ moe_output = ops.zeros_like(hidden_states) ++@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++ # # --- 速度优先模式 (SPEED MODE) --- ++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ # if sequence_length == 1: ++- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ # else: ++- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ ++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++ if sequence_length == 1: ++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++ else: ++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++- +++ # if Long_Prompt == 1: +++ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ # else: +++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++ ++ ++ # 3. 共享专家计算与合并 (所有模式通用) ++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++index c9c8c5ee..513dd40b 100644 ++--- a/patches/0001-20251104commit.patch +++++ b/patches/0001-20251104commit.patch ++@@ -1,7 +1,7 @@ ++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++-Subject: [PATCH 1/6] 20251104commit +++Subject: [PATCH 1/7] 20251104commit ++ ++ --- ++ mindnlp/transformers/cache_utils.py | 28 +- ++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++index 625656eb..41081b85 100644 ++--- a/patches/0002-20251106commit.patch +++++ b/patches/0002-20251106commit.patch ++@@ -1,7 +1,7 @@ ++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++-Subject: [PATCH 2/6] 20251106commit +++Subject: [PATCH 2/7] 20251106commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++index dcb85080..c1392569 100644 ++--- a/patches/0003-20261106secondcommit.patch +++++ b/patches/0003-20261106secondcommit.patch ++@@ -1,7 +1,7 @@ ++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++-Subject: [PATCH 3/6] 20261106secondcommit +++Subject: [PATCH 3/7] 20261106secondcommit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++index bbed13cc..e548b1b2 100644 ++--- a/patches/0004-20251106change.patch +++++ b/patches/0004-20251106change.patch ++@@ -1,7 +1,7 @@ ++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Thu, 6 Nov 2025 15:48:09 +0800 ++-Subject: [PATCH 4/6] 20251106change +++Subject: [PATCH 4/7] 20251106change ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 189 +- ++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch ++index b2d1035c..bf224d2a 100644 ++--- a/patches/0005-20251107001commit.patch +++++ b/patches/0005-20251107001commit.patch ++@@ -1,7 +1,7 @@ ++ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Fri, 7 Nov 2025 11:48:18 +0800 ++-Subject: [PATCH 5/6] 20251107001commit +++Subject: [PATCH 5/7] 20251107001commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 91 +- ++diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch ++index bffa134e..1bd306b9 100644 ++--- a/patches/0006-20251107002commit.patch +++++ b/patches/0006-20251107002commit.patch ++@@ -1,7 +1,7 @@ ++ From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 ++ From: Pinoeer-kingxi <13022943007@163.com> ++ Date: Fri, 7 Nov 2025 12:06:32 +0800 ++-Subject: [PATCH 6/6] 20251107002commit +++Subject: [PATCH 6/7] 20251107002commit ++ ++ --- ++ .../models/deepseek/modeling_deepseek.py | 122 +- ++diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch ++new file mode 100644 ++index 00000000..ce558554 ++--- /dev/null +++++ b/patches/0007-20251107003commit.patch ++@@ -0,0 +1,8034 @@ +++From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 +++From: Pinoeer-kingxi <13022943007@163.com> +++Date: Fri, 7 Nov 2025 12:12:51 +0800 +++Subject: [PATCH 7/7] 20251107003commit +++ +++--- +++ .../models/deepseek/modeling_deepseek.py | 2 +- +++ patches/0001-20251104commit.patch | 2 +- +++ patches/0002-20251106commit.patch | 2 +- +++ patches/0003-20261106secondcommit.patch | 2 +- +++ patches/0004-20251106change.patch | 2 +- +++ patches/0005-20251107001commit.patch | 2 +- +++ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ +++ 7 files changed, 7937 insertions(+), 6 deletions(-) +++ create mode 100644 patches/0006-20251107002commit.patch +++ +++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++index e7e1c053..ff631974 100644 +++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): +++ # return expert_cache +++ +++ @no_grad() +++- dwj ++++ # dwj +++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++ # x 的 shape: (1, hidden_size) +++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++index 2842180e..c9c8c5ee 100644 +++--- a/patches/0001-20251104commit.patch ++++++ b/patches/0001-20251104commit.patch +++@@ -1,7 +1,7 @@ +++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +++-Subject: [PATCH 1/5] 20251104commit ++++Subject: [PATCH 1/6] 20251104commit +++ +++ --- +++ mindnlp/transformers/cache_utils.py | 28 +- +++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++index c6cd8757..625656eb 100644 +++--- a/patches/0002-20251106commit.patch ++++++ b/patches/0002-20251106commit.patch +++@@ -1,7 +1,7 @@ +++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +++-Subject: [PATCH 2/5] 20251106commit ++++Subject: [PATCH 2/6] 20251106commit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++index 601960c9..dcb85080 100644 +++--- a/patches/0003-20261106secondcommit.patch ++++++ b/patches/0003-20261106secondcommit.patch +++@@ -1,7 +1,7 @@ +++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +++-Subject: [PATCH 3/5] 20261106secondcommit ++++Subject: [PATCH 3/6] 20261106secondcommit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +++index 8976f10b..bbed13cc 100644 +++--- a/patches/0004-20251106change.patch ++++++ b/patches/0004-20251106change.patch +++@@ -1,7 +1,7 @@ +++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Thu, 6 Nov 2025 15:48:09 +0800 +++-Subject: [PATCH 4/5] 20251106change ++++Subject: [PATCH 4/6] 20251106change +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 189 +- +++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +++index 8d9032be..b2d1035c 100644 +++--- a/patches/0005-20251107001commit.patch ++++++ b/patches/0005-20251107001commit.patch +++@@ -1,7 +1,7 @@ +++ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +++ From: Pinoeer-kingxi <13022943007@163.com> +++ Date: Fri, 7 Nov 2025 11:48:18 +0800 +++-Subject: [PATCH 5/5] 20251107001commit ++++Subject: [PATCH 5/6] 20251107001commit +++ +++ --- +++ .../models/deepseek/modeling_deepseek.py | 91 +- +++diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +++new file mode 100644 +++index 00000000..bffa134e +++--- /dev/null ++++++ b/patches/0006-20251107002commit.patch +++@@ -0,0 +1,7931 @@ ++++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 ++++From: Pinoeer-kingxi <13022943007@163.com> ++++Date: Fri, 7 Nov 2025 12:06:32 +0800 ++++Subject: [PATCH 6/6] 20251107002commit ++++ ++++--- ++++ .../models/deepseek/modeling_deepseek.py | 122 +- ++++ patches/0001-20251104commit.patch | 2 +- ++++ patches/0002-20251106commit.patch | 2 +- ++++ patches/0003-20261106secondcommit.patch | 2 +- ++++ patches/0004-20251106change.patch | 2 +- ++++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ ++++ 6 files changed, 7773 insertions(+), 64 deletions(-) ++++ create mode 100644 patches/0005-20251107001commit.patch ++++ ++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++index 8831e4b7..e7e1c053 100644 ++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): ++++ # expert_out = expert(x) ++++ # expert_cache += expert_out * weight ++++ # return expert_cache ++++- ++++- # @no_grad() ++++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++- # # x 的 shape: (1, hidden_size) ++++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++- ++++- # # 1. 收集所有需要的专家层 ++++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++- # selected_experts = [self.experts[i] for i in flat_expert_indices] ++++- ++++- # # 2. 并行计算所有专家的输出 ++++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++- # # ops.cat 会将它们堆叠成一个新的 Tensor ++++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++- ++++- # # 3. 使用矩阵乘法进行加权求和 ++++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++- # # 最终结果 final_output 的 shape: (1, hidden_size) ++++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++++ +++++ @no_grad() +++++ dwj +++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ # x 的 shape: (1, hidden_size) +++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++++ +++++ # 1. 收集所有需要的专家层 +++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +++++ +++++ # 2. 并行计算所有专家的输出 +++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++++ # ops.cat 会将它们堆叠成一个新的 Tensor +++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++++ +++++ # 3. 使用矩阵乘法进行加权求和 +++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++ # 最终结果 final_output 的 shape: (1, hidden_size) +++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++ ++++- # return final_output +++++ return final_output ++++ ++++ ++++ # @no_grad() ++++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): ++++ ++++ return expert_cache ++++ # 放置在 DeepseekMoE 类中 ++++- @no_grad() ++++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++- """ ++++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++++- ++++- Args: ++++- x (Tensor): 输入张量, shape: (1, hidden_size) ++++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++++- """ ++++- top_k, _ = flat_expert_weights.shape ++++- hidden_size = x.shape[-1] ++++- ++++- # 1. 将所有专家的权重堆叠起来 ++++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +++++ # @no_grad() +++++ # #lwx 20251107 +++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++ # """ +++++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +++++ +++++ # Args: +++++ # x (Tensor): 输入张量, shape: (1, hidden_size) +++++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +++++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +++++ # """ +++++ # top_k, _ = flat_expert_weights.shape +++++ # hidden_size = x.shape[-1] +++++ +++++ # # 1. 将所有专家的权重堆叠起来 +++++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +++++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +++++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++++ ++++- # 2. "收集" 所需的专家权重 ++++- selected_gate_w = stacked_gate_w[flat_expert_indices] ++++- selected_up_w = stacked_up_w[flat_expert_indices] ++++- selected_down_w = stacked_down_w[flat_expert_indices] +++++ # # 2. "收集" 所需的专家权重 +++++ # selected_gate_w = stacked_gate_w[flat_expert_indices] +++++ # selected_up_w = stacked_up_w[flat_expert_indices] +++++ # selected_down_w = stacked_down_w[flat_expert_indices] ++++ ++++- # 3. 准备输入 ++++- x_expanded = x.expand((top_k, 1, hidden_size)) +++++ # # 3. 准备输入 +++++ # x_expanded = x.expand((top_k, 1, hidden_size)) ++++ ++++- # 4. 并行计算 gate_proj 和 up_proj ++++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +++++ # # 4. 并行计算 gate_proj 和 up_proj +++++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +++++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++++ ++++- # 5. 计算中间状态 ++++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out +++++ # # 5. 计算中间状态 +++++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++++ ++++- # 6. 并行计算 down_proj ++++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++++- # --- [FIX] --- ++++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++++- # --- [FIX END] --- +++++ # # 6. 并行计算 down_proj +++++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +++++ # # --- [FIX] --- +++++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +++++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +++++ # # --- [FIX END] --- ++++ ++++- # 7. 根据路由权重进行加权求和 ++++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +++++ # # 7. 根据路由权重进行加权求和 +++++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++++ ++++- return weighted_sum +++++ # return weighted_sum ++++ ++++ ++++ ++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++index 0a0ef2d7..2842180e 100644 ++++--- a/patches/0001-20251104commit.patch +++++++ b/patches/0001-20251104commit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++-Subject: [PATCH 1/4] 20251104commit +++++Subject: [PATCH 1/5] 20251104commit ++++ ++++ --- ++++ mindnlp/transformers/cache_utils.py | 28 +- ++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++++index 5185270c..c6cd8757 100644 ++++--- a/patches/0002-20251106commit.patch +++++++ b/patches/0002-20251106commit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Thu, 6 Nov 2025 09:20:38 +0800 ++++-Subject: [PATCH 2/4] 20251106commit +++++Subject: [PATCH 2/5] 20251106commit ++++ ++++ --- ++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- ++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++++index 3e05f821..601960c9 100644 ++++--- a/patches/0003-20261106secondcommit.patch +++++++ b/patches/0003-20261106secondcommit.patch ++++@@ -1,7 +1,7 @@ ++++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Thu, 6 Nov 2025 14:54:37 +0800 ++++-Subject: [PATCH 3/4] 20261106secondcommit +++++Subject: [PATCH 3/5] 20261106secondcommit ++++ ++++ --- ++++ .../models/deepseek/modeling_deepseek.py | 217 ++- ++++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch ++++index 88a1aef4..8976f10b 100644 ++++--- a/patches/0004-20251106change.patch +++++++ b/patches/0004-20251106change.patch ++++@@ -1,7 +1,7 @@ ++++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++++ From: Pinoeer-kingxi <13022943007@163.com> ++++ Date: Thu, 6 Nov 2025 15:48:09 +0800 ++++-Subject: [PATCH 4/4] 20251106change +++++Subject: [PATCH 4/5] 20251106change ++++ ++++ --- ++++ .../models/deepseek/modeling_deepseek.py | 189 +- ++++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch ++++new file mode 100644 ++++index 00000000..8d9032be ++++--- /dev/null +++++++ b/patches/0005-20251107001commit.patch ++++@@ -0,0 +1,7707 @@ +++++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +++++From: Pinoeer-kingxi <13022943007@163.com> +++++Date: Fri, 7 Nov 2025 11:48:18 +0800 +++++Subject: [PATCH 5/5] 20251107001commit +++++ +++++--- +++++ .../models/deepseek/modeling_deepseek.py | 91 +- +++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +++++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +++++ patches/0001-20251104commit.patch | 2 +- +++++ patches/0002-20251106commit.patch | 2 +- +++++ patches/0003-20261106secondcommit.patch | 2 +- +++++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ +++++ 7 files changed, 7577 insertions(+), 30 deletions(-) +++++ create mode 100644 patches/0004-20251106change.patch +++++ +++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++index 0546f318..8831e4b7 100644 +++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +++++ # expert_cache += expert_out * weight +++++ # return expert_cache +++++ +++++- @no_grad() +++++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++- # x 的 shape: (1, hidden_size) +++++- # flat_expert_indices 的 shape: (num_experts_per_tok,) +++++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +++++- +++++- # 1. 收集所有需要的专家层 +++++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +++++- selected_experts = [self.experts[i] for i in flat_expert_indices] +++++- +++++- # 2. 并行计算所有专家的输出 +++++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +++++- # ops.cat 会将它们堆叠成一个新的 Tensor +++++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +++++- +++++- # 3. 使用矩阵乘法进行加权求和 +++++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +++++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +++++- # 最终结果 final_output 的 shape: (1, hidden_size) +++++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++++ # @no_grad() ++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ # # x 的 shape: (1, hidden_size) ++++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++++ ++++++ # # 1. 收集所有需要的专家层 ++++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] ++++++ ++++++ # # 2. 并行计算所有专家的输出 ++++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++++ # # ops.cat 会将它们堆叠成一个新的 Tensor ++++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++++ ++++++ # # 3. 使用矩阵乘法进行加权求和 ++++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++ # # 最终结果 final_output 的 shape: (1, hidden_size) ++++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +++++ +++++- return final_output ++++++ # return final_output +++++ +++++ +++++ # @no_grad() +++++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +++++ ) +++++ +++++ return expert_cache ++++++# 放置在 DeepseekMoE 类中 ++++++ @no_grad() ++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++ """ ++++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 ++++++ ++++++ Args: ++++++ x (Tensor): 输入张量, shape: (1, hidden_size) ++++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) ++++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) ++++++ """ ++++++ top_k, _ = flat_expert_weights.shape ++++++ hidden_size = x.shape[-1] ++++++ ++++++ # 1. 将所有专家的权重堆叠起来 ++++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) ++++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) ++++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) ++++++ ++++++ # 2. "收集" 所需的专家权重 ++++++ selected_gate_w = stacked_gate_w[flat_expert_indices] ++++++ selected_up_w = stacked_up_w[flat_expert_indices] ++++++ selected_down_w = stacked_down_w[flat_expert_indices] ++++++ ++++++ # 3. 准备输入 ++++++ x_expanded = x.expand((top_k, 1, hidden_size)) ++++++ ++++++ # 4. 并行计算 gate_proj 和 up_proj ++++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) ++++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) ++++++ ++++++ # 5. 计算中间状态 ++++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out ++++++ ++++++ # 6. 并行计算 down_proj ++++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) ++++++ # --- [FIX] --- ++++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 ++++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) ++++++ # --- [FIX END] --- ++++++ ++++++ # 7. 根据路由权重进行加权求和 ++++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) ++++++ ++++++ return weighted_sum ++++++ ++++++ +++++ +++++ # @no_grad() +++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++index ebd7782e..913a7609 100644 +++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++- x1 = x[..., : x.shape[-1] // 2] +++++- x2 = x[..., x.shape[-1] // 2 :] ++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++ # x2 = x[..., x.shape[-1] // 2 :] +++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++++index d059dcbe..2b217b64 100644 +++++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py ++++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +++++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++ def rotate_half(x): +++++ """Rotates half the hidden dims of the input.""" +++++- x1 = x[..., : x.shape[-1] // 2] +++++- x2 = x[..., x.shape[-1] // 2 :] ++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++ # x2 = x[..., x.shape[-1] // 2 :] ++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++ return ops.cat((-x2, x1), dim=-1) +++++ +++++ +++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++index 78f22642..0a0ef2d7 100644 +++++--- a/patches/0001-20251104commit.patch ++++++++ b/patches/0001-20251104commit.patch +++++@@ -1,7 +1,7 @@ +++++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++ From: Pinoeer-kingxi <13022943007@163.com> +++++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++-Subject: [PATCH 1/3] 20251104commit ++++++Subject: [PATCH 1/4] 20251104commit +++++ +++++ --- +++++ mindnlp/transformers/cache_utils.py | 28 +- +++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +++++index 22b65dd5..5185270c 100644 +++++--- a/patches/0002-20251106commit.patch ++++++++ b/patches/0002-20251106commit.patch +++++@@ -1,7 +1,7 @@ +++++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++++ From: Pinoeer-kingxi <13022943007@163.com> +++++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +++++-Subject: [PATCH 2/3] 20251106commit ++++++Subject: [PATCH 2/4] 20251106commit +++++ +++++ --- +++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +++++index 966529e4..3e05f821 100644 +++++--- a/patches/0003-20261106secondcommit.patch ++++++++ b/patches/0003-20261106secondcommit.patch +++++@@ -1,7 +1,7 @@ +++++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++++ From: Pinoeer-kingxi <13022943007@163.com> +++++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +++++-Subject: [PATCH 3/3] 20261106secondcommit ++++++Subject: [PATCH 3/4] 20261106secondcommit +++++ +++++ --- +++++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +++++new file mode 100644 +++++index 00000000..88a1aef4 +++++--- /dev/null ++++++++ b/patches/0004-20251106change.patch +++++@@ -0,0 +1,7498 @@ ++++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 ++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++Date: Thu, 6 Nov 2025 15:48:09 +0800 ++++++Subject: [PATCH 4/4] 20251106change ++++++ ++++++--- ++++++ .../models/deepseek/modeling_deepseek.py | 189 +- ++++++ patches/0001-20251104commit.patch | 1272 +++++++ ++++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ ++++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ ++++++ 4 files changed, 7244 insertions(+), 186 deletions(-) ++++++ create mode 100644 patches/0001-20251104commit.patch ++++++ create mode 100644 patches/0002-20251106commit.patch ++++++ create mode 100644 patches/0003-20261106secondcommit.patch ++++++ ++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++index 2f9192bf..0546f318 100644 ++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): ++++++ ++++++ return attn_output, attn_weights, past_key_value ++++++ ++++++-# class DeepseekFlashAttention(nn.Module): ++++++-# """ ++++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++++++- ++++++-# This class is designed as a drop-in replacement for DeepseekAttention. ++++++-# """ ++++++- ++++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++++-# super().__init__() ++++++-# self.config = config ++++++-# self.layer_idx = layer_idx ++++++-# if layer_idx is None: ++++++-# logger.warning( ++++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++-# "when creating this class." ++++++-# ) ++++++- ++++++-# self.attention_dropout = config.attention_dropout ++++++-# self.hidden_size = config.hidden_size ++++++-# self.num_heads = config.num_attention_heads ++++++-# self.head_dim = self.hidden_size // self.num_heads ++++++-# self.num_key_value_heads = config.num_key_value_heads ++++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++-# self.max_position_embeddings = config.max_position_embeddings ++++++-# self.rope_theta = config.rope_theta ++++++-# self.is_causal = True ++++++- ++++++-# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++-# raise ValueError( ++++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++-# f" and `num_heads`: {self.num_heads})." ++++++-# ) ++++++- ++++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++++-# self._init_rope() ++++++- ++++++-# def _init_rope(self): ++++++-# if self.config.rope_scaling is None: ++++++-# self.rotary_emb = DeepseekRotaryEmbedding( ++++++-# self.head_dim, ++++++-# max_position_embeddings=self.max_position_embeddings, ++++++-# base=self.rope_theta, ++++++-# ) ++++++-# else: ++++++-# scaling_type = self.config.rope_scaling["type"] ++++++-# scaling_factor = self.config.rope_scaling["factor"] ++++++-# if scaling_type == "linear": ++++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++++-# self.head_dim, ++++++-# max_position_embeddings=self.max_position_embeddings, ++++++-# scaling_factor=scaling_factor, ++++++-# base=self.rope_theta, ++++++-# ) ++++++-# elif scaling_type == "dynamic": ++++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++++-# self.head_dim, ++++++-# max_position_embeddings=self.max_position_embeddings, ++++++-# scaling_factor=scaling_factor, ++++++-# base=self.rope_theta, ++++++-# ) ++++++-# else: ++++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++++- ++++++-# def forward( ++++++-# self, ++++++-# hidden_states: mindspore.Tensor, ++++++-# attention_mask: Optional[mindspore.Tensor] = None, ++++++-# position_ids: Optional[mindspore.Tensor] = None, ++++++-# past_key_value: Optional[Cache] = None, ++++++-# output_attentions: bool = False, ++++++-# use_cache: bool = False, ++++++-# **kwargs, ++++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++-# if "padding_mask" in kwargs: ++++++-# warnings.warn( ++++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++++-# ) ++++++- ++++++-# if output_attentions: ++++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++++++- ++++++-# bsz, q_len, _ = hidden_states.shape ++++++- ++++++-# if self.config.pretraining_tp > 1: ++++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++++- ++++++-# query_states = self.q_proj(hidden_states) ++++++-# key_states = self.k_proj(hidden_states) ++++++-# value_states = self.v_proj(hidden_states) ++++++- ++++++-# # Reshape for multi-head attention ++++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++- ++++++-# kv_seq_len = key_states.shape[-2] ++++++-# if past_key_value is not None: ++++++-# if self.layer_idx is None: ++++++-# raise ValueError( ++++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++-# "with a layer index." ++++++-# ) ++++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++- ++++++-# # Apply Rotary Positional Embedding ++++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++- ++++++-# if past_key_value is not None: ++++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++- ++++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++- ++++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++- ++++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++- ++++++-# # Convert attention_mask for flash_attention_score ++++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++++++-# if attention_mask is not None: ++++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++++-# raise ValueError( ++++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++++-# ) ++++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++++++-# else: ++++++-# attn_mask_for_fa = None ++++++- ++++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++++- ++++++-# # Call the fused flash_attention_score operator ++++++-# attn_output = mindspore.ops.flash_attention_score( ++++++-# query=query_states_for_fa, ++++++-# key=key_states_for_fa, ++++++-# value=value_states_for_fa, ++++++-# head_num=self.num_heads, # This is N1, the number of query heads ++++++-# input_layout='BSH', ++++++-# attn_mask=attn_mask_for_fa, ++++++-# keep_prob=keep_prob, ++++++-# scalar_value=1.0 / math.sqrt(self.head_dim), ++++++-# sparse_mode=0 # Default mask mode ++++++-# ) ++++++- ++++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++++++-# attn_output = self.o_proj(attn_output) ++++++- ++++++-# # Flash Attention does not return attention weights ++++++-# attn_weights = None ++++++- ++++++-# return attn_output, attn_weights, past_key_value ++++++ ++++++ class DeepseekFlashAttention(nn.Module): ++++++ """ ++++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): ++++++ super().__init__() ++++++ self.hidden_size = config.hidden_size ++++++ ++++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( ++++++- config=config, layer_idx=layer_idx ++++++- ) +++++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +++++++ # config=config, layer_idx=layer_idx +++++++ # ) ++++++ ++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++++++ config=config, layer_idx=layer_idx ++++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): ++++++ return outputs ++++++ ++++++ ++++++- ++++++ class DeepseekPreTrainedModel(PreTrainedModel): ++++++ config_class = DeepseekConfig ++++++ base_model_prefix = "model" ++++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++ # Initialize weights and apply final processing ++++++ self.post_init() ++++++ self.warm_up = False ++++++- #@dwj ++++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++++++- self.num_layers, ++++++- self.num_attention_heads, ++++++- self.head_dim, ++++++- batch_size=1, ++++++- max_length=self.max_length, ++++++- dtype=mindspore.float16 ++++++- ) ++++++- ++++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++++++- key_cache = [] ++++++- value_cache = [] ++++++- for _ in range(num_layers): ++++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++- key_cache.append(k) ++++++- value_cache.append(v) ++++++- return key_cache, value_cache ++++++- ++++++ ++++++ def warmup_moe_model_deep(self): ++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch ++++++new file mode 100644 ++++++index 00000000..78f22642 ++++++--- /dev/null +++++++++ b/patches/0001-20251104commit.patch ++++++@@ -0,0 +1,1272 @@ +++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++++From: Pinoeer-kingxi <13022943007@163.com> +++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++++Subject: [PATCH 1/3] 20251104commit +++++++ +++++++--- +++++++ mindnlp/transformers/cache_utils.py | 28 +- +++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++++ 3 files changed, 976 insertions(+), 87 deletions(-) +++++++ +++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++++index cadd2e04..02f8d4be 100644 +++++++--- a/mindnlp/transformers/cache_utils.py ++++++++++ b/mindnlp/transformers/cache_utils.py +++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++++ # k_out[:, :, cache_position] = key_states +++++++ # v_out[:, :, cache_position] = value_states +++++++- if ON_ORANGE_PI: +++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++- else: +++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++- ++++++++ # if ON_ORANGE_PI: ++++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++++ # else: ++++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++++ # 确保 cache_position 是 1D tensor 并且类型正确 ++++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ++++++++ if cache_position.ndim > 1: ++++++++ cache_position = cache_position.flatten() ++++++++ # 确保类型是 int32 或 int64(MindSpore 要求) ++++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ++++++++ cache_position = cache_position.int() ++++++++ ++++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ++++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ++++++++ k_out[:, :, cache_position] = key_states ++++++++ v_out[:, :, cache_position] = value_states ++++++++ +++++++ return k_out, v_out +++++++ +++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++index c695b944..d8303e45 100644 +++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +++++++ def rotate_half(x): +++++++ """Rotates half the hidden dims of the input.""" +++++++- x1 = x[..., : x.shape[-1] // 2] +++++++- x2 = x[..., x.shape[-1] // 2 :] ++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++++ # x1 = x[..., : x.shape[-1] // 2] ++++++++ # x2 = x[..., x.shape[-1] // 2 :] ++++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++ return ops.cat((-x2, x1), dim=-1) +++++++ +++++++ +++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++++ if self.training: +++++++ raise NotImplementedError("Training is not supported yet.") +++++++ else: +++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++- if self.config.n_shared_experts is not None: +++++++- y = y + self.shared_experts(identity) +++++++- return y ++++++++ # @lwx ++++++++ if orig_shape[1] == 1: ++++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ++++++++ y=y.view(*orig_shape) ++++++++ if self.config.n_shared_experts is not None: ++++++++ y = y + self.shared_experts(identity) ++++++++ return y ++++++++ else: ++++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ++++++++ if self.config.n_shared_experts is not None: ++++++++ y = y + self.shared_experts(identity) ++++++++ return y ++++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++++ # if self.config.n_shared_experts is not None: ++++++++ # y = y + self.shared_experts(identity) ++++++++ # return y ++++++++ ++++++++ @no_grad() ++++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++++ ++++++++ expert_cache = ops.zeros_like(x) ++++++++ for i in range(self.num_experts_per_tok): ++++++++ expert_id = flat_expert_indices[i].item() ++++++++ weight = flat_expert_weights[i].item() ++++++++ expert = self.experts[expert_id] ++++++++ expert_out = expert(x) ++++++++ expert_cache += expert_out * weight ++++++++ return expert_cache +++++++ +++++++ @no_grad() +++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++- # expert_cache = torch.zeros_like(x) +++++++- # idxs = flat_expert_indices.argsort() +++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++- # token_idxs = idxs // self.num_experts_per_tok +++++++- # for i, end_idx in enumerate(tokens_per_expert): +++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++- # if start_idx == end_idx: +++++++- # continue +++++++- # expert = self.experts[i] +++++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++++- # expert_tokens = x[exp_token_idx] +++++++- # expert_out = expert(expert_tokens) +++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++- # return expert_cache ++++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++ expert_cache = ops.zeros_like(x) +++++++ idxs = flat_expert_indices.argsort() +++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++ token_idxs = idxs // self.num_experts_per_tok ++++++++ +++++++ for i, end_idx in enumerate(tokens_per_expert): +++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++ if start_idx == end_idx: +++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++++ expert_out = expert(expert_tokens) +++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++++ +++++++ return expert_cache ++++++++ ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # # expert_cache = torch.zeros_like(x) ++++++++ # # idxs = flat_expert_indices.argsort() ++++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++++ # # if start_idx == end_idx: ++++++++ # # continue ++++++++ # # expert = self.experts[i] ++++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # # expert_tokens = x[exp_token_idx] ++++++++ # # expert_out = expert(expert_tokens) ++++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++++ # # return expert_cache ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # if start_idx == end_idx: ++++++++ # continue ++++++++ # expert = self.experts[i] ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = expert(expert_tokens) ++++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++++ ++++++++ # return expert_cache ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ ++++++++ # # 排序保证顺序一致 ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # # 找出有 token 的专家 ++++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++++ ++++++++ # for i in active_experts.tolist(): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # end_idx = tokens_per_expert[i] ++++++++ # if start_idx == end_idx: # 没有 token ++++++++ # continue ++++++++ ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = self.experts[i](expert_tokens) ++++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++++ ++++++++ # expert_cache = mindspore.mint.scatter_add( ++++++++ # expert_cache, ++++++++ # 0, ++++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++++ # expert_out ++++++++ # ) ++++++++ ++++++++ # return expert_cache ++++++++ ++++++++ +++++++ +++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++++ # """ +++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++ +++++++ # Initialize weights and apply final processing +++++++ self.post_init() ++++++++ self.warm_up = False ++++++++ ++++++++ def warmup_moe_model_deep(self): ++++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") ++++++++ test_texts = [ ++++++++ "warmup short", ++++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", ++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ++++++++ ] ++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++++ if tokenizer is None: ++++++++ from mindnlp.transformers import AutoTokenizer ++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++++ self._warmup_tokenizer = tokenizer ++++++++ ++++++++ for text in test_texts: ++++++++ inputs = tokenizer(text, return_tensors="ms") ++++++++ with mindspore._no_grad(): ++++++++ _ = self(**inputs, use_cache=False) ++++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++++ +++++++ def get_input_embeddings(self): +++++++ return self.model.embed_tokens +++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++ ```""" ++++++++ if not self.warm_up: ++++++++ self.warm_up = True ++++++++ self.warmup_moe_model_deep() ++++++++ +++++++ output_attentions = ( +++++++ output_attentions +++++++ if output_attentions is not None +++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++index 3cbf820e..d4c6b651 100644 +++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++@@ -18,7 +18,6 @@ +++++++ # See the License for the specific language governing permissions and +++++++ # limitations under the License. +++++++ """MindSpore Qwen2MoE model.""" +++++++- +++++++ import math +++++++ from typing import List, Optional, Tuple, Union +++++++ +++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++++ TokenClassifierOutput, +++++++ ) +++++++ from ...modeling_utils import PreTrainedModel ++++++++from ...generation import GenerationMixin +++++++ from ....utils import logging +++++++ from .configuration_qwen2_moe import Qwen2MoeConfig +++++++ +++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++++ self.variance_epsilon = eps +++++++ +++++++ def forward(self, hidden_states): ++++++++ # @dwj ++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++++ # @lwx ++++++++ # if not self.training : ++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++ input_dtype = hidden_states.dtype +++++++ hidden_states = hidden_states.to(mindspore.float32) +++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++++@@ -234,6 +239,8 @@ def rotate_half(x): +++++++ """Rotates half the hidden dims of the input.""" +++++++ x1 = x[..., : x.shape[-1] // 2] +++++++ x2 = x[..., x.shape[-1] // 2 :] ++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ++++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++ return ops.cat((-x2, x1), dim=-1) +++++++ +++++++ +++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++++ self.config = config +++++++ self.hidden_size = config.hidden_size +++++++ self.intermediate_size = intermediate_size ++++++++ +++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++++ self.act_fn = ACT2FN[config.hidden_act] +++++++ +++++++ def forward(self, x): +++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++- +++++++ ++++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++++ # @lwx ++++++++ # gate_up_output = self.gate_up_proj(x) ++++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ++++++++ # return self.down_proj(swiglu_output) ++++++++ ++++++++ # def forward(self, x): ++++++++ # gate_proj_out = self.gate_proj(x) ++++++++ # up_proj_out = self.up_proj(x) ++++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ++++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ++++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ++++++++ # return self.down_proj(swiglu_out) ++++++++ +++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++++ """ +++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++++ use_cache: bool = False, +++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ ++++++++ +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ query_states = self.q_proj(hidden_states) +++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++ "with a layer index." +++++++ ) +++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ if isinstance(past_key_value, StaticCache): ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ else: ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++ if isinstance(past_key_value, StaticCache): ++++++++ kv_seq_len = key_states.shape[-2] +++++++ +++++++ # repeat k/v heads if n_kv_heads < n_heads +++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++- ++++++++ +++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++ +++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++++- raise ValueError( +++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++++- f" {attn_weights.shape}" +++++++- ) +++++++- +++++++- if attention_mask is not None: # no matter the length, we just slice it +++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ++++++++ if attention_mask is not None: ++++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++ attn_weights = attn_weights + causal_mask +++++++ +++++++ # upcast attention to fp32 +++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++ +++++++ attn_output = self.o_proj(attn_output) +++++++- ++++++++ # @lwx ++++++++ ++++++++ # max_seq_len = self.max_position_embeddings # 2048 ++++++++ ++++++++ # if attention_mask is not None: ++++++++ # # attention_mask: [B, 1, Sq, Sk] ++++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++++ ++++++++ # # pad 到 [max_seq_len, max_seq_len] ++++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++++ # global_attention_mask = padded_mask ++++++++ # else: ++++++++ # global_attention_mask = None ++++++++ ++++++++ ++++++++ # sparse_mode=3 ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # real_shift=None, ++++++++ # padding_mask=None, ++++++++ ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=global_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ # input_layout="BNSD", ++++++++ # pre_tokens=2147483647, ++++++++ # next_tokens=2147483647, ++++++++ # inner_precise=0, ++++++++ # drop_mask=None, ++++++++ # prefix=None, ++++++++ # actual_seq_qlen=None, ++++++++ # actual_seq_kvlen=None, ++++++++ # sparse_mode=sparse_mode, ++++++++ # ) +++++++ if not output_attentions: +++++++ attn_weights = None +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ ++++++++class Qwen2MoeFlashAttention(nn.Module): ++++++++ """ ++++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++++ ++++++++ 关键改动: ++++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++++ 直接传入原始的 key 和 value 张量效率更高。 ++++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++++ """ ++++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++++ super().__init__() ++++++++ self.config = config ++++++++ self.layer_idx = layer_idx ++++++++ self.hidden_size = config.hidden_size ++++++++ self.num_heads = config.num_attention_heads ++++++++ self.head_dim = self.hidden_size // self.num_heads ++++++++ self.num_key_value_heads = config.num_key_value_heads ++++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++ self.max_position_embeddings = config.max_position_embeddings ++++++++ self.rope_theta = config.rope_theta ++++++++ self.attention_dropout = config.attention_dropout ++++++++ ++++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++ raise ValueError( ++++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++++ ) ++++++++ ++++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++++ ++++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++++ self.head_dim, ++++++++ max_position_embeddings=self.max_position_embeddings, ++++++++ base=self.rope_theta, ++++++++ ) ++++++++ ++++++++ def forward( ++++++++ self, ++++++++ hidden_states: mindspore.Tensor, ++++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++++ past_key_value: Optional[Cache] = None, ++++++++ output_attentions: bool = False, ++++++++ use_cache: bool = False, ++++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # 1. 线性投射 Q, K, V ++++++++ query_states = self.q_proj(hidden_states) ++++++++ key_states = self.k_proj(hidden_states) ++++++++ value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++ # query: [B, S, H*D] -> [B, N1, S, D] ++++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # 3. RoPE 旋转位置编码 ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ if past_key_value is not None: ++++++++ if self.layer_idx is None: ++++++++ raise ValueError( ++++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++ "with a layer index." ++++++++ ) ++++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++++ if cache_position.shape[0] == 1: ++++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++++ kv_seq_len = past_seen_tokens + 1 ++++++++ else: ++++++++ # prefill 阶段:cache_position 是范围,使用其长度 ++++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++++ else: ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # 4. KV 缓存更新 ++++++++ if past_key_value is not None: ++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ key_states, value_states = past_key_value.update( ++++++++ key_states, value_states, self.layer_idx, cache_kwargs ++++++++ ) ++++++++ ++++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++ if cache_position.shape[0] == 1: ++++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ ++++++++ # 5. [重要] 准备 Attention Mask ++++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++++ fa_attention_mask = None ++++++++ if attention_mask is not None: ++++++++ # 截取与当前key长度匹配的部分 ++++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++++ fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++++ input_dtype = query_states.dtype ++++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++++ query_states = query_states.to(mindspore.float16) ++++++++ key_states = key_states.to(mindspore.float16) ++++++++ value_states = value_states.to(mindspore.float16) ++++++++ ++++++++ # 6. [核心] 调用 flash_attention_score 算子 ++++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++++ attn_output = mindspore.ops.flash_attention_score( ++++++++ query=query_states, ++++++++ key=key_states, ++++++++ value=value_states, ++++++++ head_num=self.num_heads, # 传入Q的头数(N1) ++++++++ attn_mask=fa_attention_mask, ++++++++ keep_prob=1.0 - self.attention_dropout, ++++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ input_layout="BNSD", ++++++++ sparse_mode=0 # 使用 defaultMask 模式 ++++++++ ) ++++++++ ++++++++ # 恢复原始数据类型 ++++++++ attn_output = attn_output.to(input_dtype) ++++++++ ++++++++ # 7. 调整输出形状 ++++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # FlashAttention 算子不直接返回注意力权重矩阵 ++++++++ attn_weights = None ++++++++ if output_attentions: ++++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++ return attn_output, attn_weights, past_key_value ++++++++ ++++++++ # def forward( ++++++++ # self, ++++++++ # hidden_states: mindspore.Tensor, ++++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++++ # past_key_value: Optional[Cache] = None, ++++++++ # output_attentions: bool = False, ++++++++ # use_cache: bool = False, ++++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ # bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # # 1. 线性投射 Q, K, V ++++++++ # query_states = self.q_proj(hidden_states) ++++++++ # key_states = self.k_proj(hidden_states) ++++++++ # value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # # 3. RoPE 旋转位置编码 ++++++++ # kv_seq_len = key_states.shape[-2] ++++++++ # if past_key_value is not None: ++++++++ # if self.layer_idx is None: ++++++++ # raise ValueError( ++++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++ # "with a layer index." ++++++++ # ) ++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # # 4. KV 缓存更新 ++++++++ # if past_key_value is not None: ++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ # key_states, value_states = past_key_value.update( ++++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++++ # ) ++++++++ ++++++++ # # 5. 准备 Attention Mask ++++++++ # fa_attention_mask = None ++++++++ # if attention_mask is not None: ++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++++ # input_dtype = query_states.dtype ++++++++ ++++++++ # # 6. [核心] 调用 flash_attention_score 算子 ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=fa_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ # input_layout="BNSD", ++++++++ # sparse_mode=0, ++++++++ # # <--- 修改点 2: 启用内部高精度计算 --- ++++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++++ # inner_precise=1 ++++++++ # ) ++++++++ ++++++++ # # 恢复原始数据类型 ++++++++ # attn_output = attn_output.to(input_dtype) ++++++++ ++++++++ # # 7. 调整输出形状 ++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ # attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # attn_weights = None ++++++++ # if output_attentions: ++++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++ # return attn_output, attn_weights, past_key_value ++++++++ ++++++++ # def forward( ++++++++ # self, ++++++++ # hidden_states: mindspore.Tensor, ++++++++ # attention_mask: Optional[mindspore.Tensor] = None, ++++++++ # position_ids: Optional[mindspore.Tensor] = None, ++++++++ # past_key_value: Optional[Cache] = None, ++++++++ # output_attentions: bool = False, ++++++++ # use_cache: bool = False, ++++++++ # cache_position: Optional[mindspore.Tensor] = None, ++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ # bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ # query_states = self.q_proj(hidden_states) ++++++++ # key_states = self.k_proj(hidden_states) ++++++++ # value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ # kv_seq_len = key_states.shape[-2] ++++++++ # if past_key_value is not None: ++++++++ # if self.layer_idx is None: ++++++++ # raise ValueError("`layer_idx` must be specified for caching") ++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ # if past_key_value is not None: ++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++ # key_states, value_states = past_key_value.update( ++++++++ # key_states, value_states, self.layer_idx, cache_kwargs ++++++++ # ) ++++++++ ++++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++ ++++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- ++++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 ++++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ++++++++ # query_states = query_states / math.sqrt(self.head_dim) ++++++++ # # <--- 修改结束 --- ++++++++ ++++++++ # fa_attention_mask = None ++++++++ # if attention_mask is not None: ++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ # fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ # input_dtype = query_states.dtype ++++++++ ++++++++ # attn_output = mindspore.ops.flash_attention_score( ++++++++ # query=query_states, # 传入已经预先缩放过的 query ++++++++ # key=key_states, ++++++++ # value=value_states, ++++++++ # head_num=self.num_heads, ++++++++ # attn_mask=fa_attention_mask, ++++++++ # keep_prob=1.0 - self.attention_dropout, ++++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ++++++++ # input_layout="BNSD", ++++++++ # sparse_mode=0, ++++++++ # inner_precise=1 # 仍然保持内部高精度计算 ++++++++ # ) ++++++++ ++++++++ # attn_output = attn_output.to(input_dtype) ++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ # attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # attn_weights = None ++++++++ # if output_attentions: ++++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++++ ++++++++ # return attn_output, attn_weights, past_key_value ++++++++ +++++++ QWEN2MOE_ATTENTION_CLASSES = { +++++++ "eager": Qwen2MoeAttention, ++++++++ "flash-attention": Qwen2MoeFlashAttention, +++++++ } +++++++ +++++++ +++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ ++++++++ #@dwj ++++++++ # 只遍历激活的专家,而非全部专家 +++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- hidden_states = hidden_states.view(-1, hidden_dim) +++++++- # router_logits: (batch * sequence_length, n_experts) +++++++- router_logits = self.gate(hidden_states) +++++++- +++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- if self.norm_topk_prob: +++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- # we cast back to the input dtype +++++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++- final_hidden_states = ops.zeros( +++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++++- ) +++++++- +++++++- # One hot encode the selected experts to create an expert mask +++++++- # this will be used to easily index which expert is going to be sollicitated +++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++++- +++++++- # Loop over all available experts in the model and perform the computation on each expert +++++++- for expert_idx in range(self.num_experts): +++++++- expert_layer = self.experts[expert_idx] +++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++++- +++++++- # Index the correct hidden states and compute the expert hidden state for +++++++- # the current expert. We need to make sure to multiply the output hidden +++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++++- if 0 not in idx.shape: +++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++++- +++++++- # However `index_add_` only support torch tensors for indexing so we'll use +++++++- # the `top_x` tensor here. +++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++++- +++++++- shared_expert_output = self.shared_expert(hidden_states) +++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++++- +++++++- final_hidden_states = final_hidden_states + shared_expert_output ++++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++ num_tokens = hidden_states_reshaped.shape[0] ++++++++ ++++++++ router_logits = self.gate(hidden_states_reshaped) ++++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++ ++++++++ if self.norm_topk_prob: ++++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++++ flat_selected_experts = selected_experts.flatten() ++++++++ ++++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++++ token_indices = broadcasted_token_indices.flatten() ++++++++ ++++++++ active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++ for expert_idx_tensor in active_experts: ++++++++ expert_idx = expert_idx_tensor.item() ++++++++ expert_layer = self.experts[expert_idx] ++++++++ ++++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++++ selected_token_indices = token_indices[mask] ++++++++ selected_routing_weights = routing_weights.flatten()[mask] ++++++++ ++++++++ current_states = hidden_states_reshaped[selected_token_indices] ++++++++ ++++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++ ++++++++ final_hidden_states = final_hidden_states.index_add( ++++++++ dim=0, ++++++++ index=selected_token_indices, ++++++++ source=expert_output.to(hidden_states.dtype) ++++++++ ) ++++++++ ++++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++++ +++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++- return final_hidden_states, router_logits ++++++++ final_hidden_states = final_hidden_states + shared_expert_output ++++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++ return final_hidden_states, router_logits +++++++ +++++++ +++++++ class Qwen2MoeDecoderLayer(nn.Module): +++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++++ +++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ ++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++++ +++++++ if (layer_idx not in config.mlp_only_layers) and ( +++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++++ ): +++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++++ _skip_keys_device_placement = "past_key_values" +++++++ _supports_cache_class = True ++++++++#lwx ++++++++ # _supports_static_cache = True +++++++ +++++++ def _init_weights(self, module): +++++++ std = self.config.initializer_range +++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++++ return causal_mask +++++++ +++++++ +++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ _tied_weights_keys = ["lm_head.weight"] +++++++ +++++++ def __init__(self, config): +++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ self.num_experts_per_tok = config.num_experts_per_tok +++++++ # Initialize weights and apply final processing +++++++ self.post_init() ++++++++ # @lwx ++++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ++++++++ # self.generation_config.cache_implementation = "static" ++++++++ self._warmed_up = False ++++++++ ++++++++ def warmup_moe_model(self): ++++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") ++++++++ test_texts = [ ++++++++ "warmup short", ++++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ++++++++ ] ++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) ++++++++ if tokenizer is None: ++++++++ from mindnlp.transformers import AutoTokenizer ++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ++++++++ self._warmup_tokenizer = tokenizer ++++++++ ++++++++ for text in test_texts: ++++++++ inputs = tokenizer(text, return_tensors="ms") ++++++++ with mindspore._no_grad(): ++++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) ++++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++++ +++++++ def get_input_embeddings(self): +++++++ return self.model.embed_tokens +++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++ ```""" ++++++++ if not self._warmed_up: ++++++++ self._warmed_up = True ++++++++ self.warmup_moe_model() +++++++ +++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++++ output_router_logits = ( +++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++ } +++++++ ) +++++++ return model_inputs ++++++++# @lwx ++++++++ # def _decode_one_tokens_logits( ++++++++ # self, ++++++++ # cur_token: mindspore.Tensor, ++++++++ # input_pos: Optional[mindspore.Tensor], ++++++++ # cache_position: mindspore.Tensor, ++++++++ # past_key_values: StaticCache, ++++++++ # ) -> mindspore.Tensor: ++++++++ # """ ++++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ++++++++ ++++++++ # Args: ++++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) ++++++++ # input_pos: 输入位置信息,可选 ++++++++ # cache_position: 当前token在cache中的位置,shape为(1,) ++++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 ++++++++ ++++++++ # Returns: ++++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) ++++++++ # """ ++++++++ # # 调用JIT编译的版本 ++++++++ # return self.get_decode_one_tokens_logits( ++++++++ # cur_token=cur_token, ++++++++ # input_pos=input_pos, ++++++++ # cache_position=cache_position, ++++++++ # past_key_values=past_key_values, ++++++++ # ) ++++++++ ++++++++ # @mindspore.jit(jit_level='O1') ++++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ++++++++ # """ ++++++++ # JIT编译的函数,用于高效的单token解码 ++++++++ # 使用JIT编译优化以支持静态shape和高效执行 ++++++++ ++++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ++++++++ # """ ++++++++ # outputs = self.model.forward( ++++++++ # input_ids=cur_token, ++++++++ # position_ids=input_pos, ++++++++ # cache_position=cache_position, ++++++++ # past_key_values=past_key_values, ++++++++ # use_cache=True, ++++++++ # return_dict=False, ++++++++ # ) ++++++++ ++++++++ # hidden_states = outputs[0] ++++++++ # logits = self.lm_head.forward(hidden_states) ++++++++ # logits = logits.float() ++++++++ ++++++++ # return logits[:, -1, :] ++++++++ ++++++++ # def _sample( ++++++++ # self, ++++++++ # input_ids: mindspore.Tensor, ++++++++ # logits_processor, ++++++++ # stopping_criteria, ++++++++ # generation_config, ++++++++ # synced_devices: bool, ++++++++ # streamer=None, ++++++++ # logits_warper=None, ++++++++ # **model_kwargs, ++++++++ # ): ++++++++ # """ ++++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ++++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ++++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ++++++++ # """ ++++++++ # from ...generation.logits_process import LogitsProcessorList ++++++++ # from ...generation.stopping_criteria import StoppingCriteriaList ++++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ++++++++ # from mindnlp.core import nn, ops, no_grad ++++++++ # import numpy as np ++++++++ ++++++++ # # 检查是否使用 StaticCache ++++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ++++++++ # # 否则,直接调用父类方法 ++++++++ # past_key_values = model_kwargs.get("past_key_values") ++++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ++++++++ ++++++++ # if not isinstance(past_key_values, StaticCache): ++++++++ # # 不使用 StaticCache,直接调用父类方法 ++++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ++++++++ # return super()._sample( ++++++++ # input_ids=input_ids, ++++++++ # logits_processor=logits_processor, ++++++++ # stopping_criteria=stopping_criteria, ++++++++ # generation_config=generation_config, ++++++++ # synced_devices=synced_devices, ++++++++ # streamer=streamer, ++++++++ # logits_warper=logits_warper, ++++++++ # **model_kwargs, ++++++++ # ) ++++++++ ++++++++ # # 使用 StaticCache,进入自定义循环 ++++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ++++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ++++++++ # pad_token_id = generation_config._pad_token_tensor ++++++++ # output_attentions = generation_config.output_attentions ++++++++ # output_hidden_states = generation_config.output_hidden_states ++++++++ # output_scores = generation_config.output_scores ++++++++ # output_logits = generation_config.output_logits ++++++++ # return_dict_in_generate = generation_config.return_dict_in_generate ++++++++ # max_length = generation_config.max_length ++++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ++++++++ # do_sample = generation_config.do_sample ++++++++ ++++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ++++++++ # raise ValueError( ++++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ++++++++ # f"{logits_warper})." ++++++++ # ) ++++++++ ++++++++ # # init attention / hidden states / scores tuples ++++++++ # scores = () if (return_dict_in_generate and output_scores) else None ++++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None ++++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ++++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ++++++++ ++++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ++++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: ++++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ++++++++ # encoder_hidden_states = ( ++++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ++++++++ # ) ++++++++ ++++++++ # # keep track of which sequences are already finished ++++++++ # batch_size, cur_len = input_ids.shape ++++++++ # this_peer_finished = False ++++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ++++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ++++++++ ++++++++ # time_record = [] ++++++++ # from ....utils.testing_utils import parse_flag_from_env ++++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ++++++++ ++++++++ # while self._has_unfinished_sequences( ++++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ++++++++ # ): ++++++++ # if _record_time: ++++++++ # import time as time_module ++++++++ # infer_start = time_module.time() ++++++++ ++++++++ # # prepare model inputs ++++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ++++++++ ++++++++ # # prepare variable output controls ++++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ++++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ++++++++ ++++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ++++++++ # cur_cache_position = model_inputs.get("cache_position") ++++++++ # cur_past_key_values = model_inputs.get("past_key_values") ++++++++ # cur_input_ids = model_inputs.get("input_ids") ++++++++ ++++++++ # if (isinstance(cur_past_key_values, StaticCache) and ++++++++ # cur_cache_position is not None and ++++++++ # len(cur_cache_position.shape) > 0 and ++++++++ # cur_cache_position.shape[0] == 1 and ++++++++ # cur_input_ids is not None and ++++++++ # cur_input_ids.shape[1] == 1): ++++++++ # # 使用 JIT 优化的单 token 解码 ++++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ++++++++ # if not hasattr(self, '_jit_used'): ++++++++ # self._jit_used = False ++++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ++++++++ ++++++++ # next_token_logits = self.get_decode_one_tokens_logits( ++++++++ # cur_token=cur_input_ids, ++++++++ # input_pos=model_inputs.get("position_ids"), ++++++++ # cache_position=cur_cache_position, ++++++++ # past_key_values=cur_past_key_values, ++++++++ # ) ++++++++ ++++++++ # # 标记已使用JIT(用于后续判断) ++++++++ # if not self._jit_used: ++++++++ # self._jit_used = True ++++++++ ++++++++ # # 构造兼容的输出对象 ++++++++ # class JitOptimizedOutput: ++++++++ # def __init__(self, logits, config): ++++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ++++++++ # self.config = config ++++++++ # # 对于 JIT 优化路径,这些属性通常不需要 ++++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None ++++++++ # self.attentions = None if not config.is_encoder_decoder else None ++++++++ # self.cross_attentions = None ++++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ++++++++ # self.hidden_states = None if not config.is_encoder_decoder else None ++++++++ ++++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) ++++++++ # else: ++++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ++++++++ # outputs = self(**model_inputs, return_dict=True) ++++++++ ++++++++ # if synced_devices and this_peer_finished: ++++++++ # continue ++++++++ ++++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ++++++++ # next_token_logits = outputs.logits[:, -1, :] ++++++++ ++++++++ # # pre-process distribution ++++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) ++++++++ # if do_sample: ++++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) ++++++++ ++++++++ # # Store scores, attentions and hidden_states when required ++++++++ # if return_dict_in_generate: ++++++++ # if output_scores: ++++++++ # scores += (next_token_scores,) ++++++++ # if output_logits: ++++++++ # raw_logits += (next_token_logits,) ++++++++ # if output_attentions: ++++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ++++++++ # decoder_attentions += (attn,) if attn is not None else (None,) ++++++++ # if self.config.is_encoder_decoder: ++++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ++++++++ ++++++++ # if output_hidden_states: ++++++++ # hidden = ( ++++++++ # outputs.decoder_hidden_states ++++++++ # if self.config.is_encoder_decoder ++++++++ # else outputs.hidden_states ++++++++ # ) ++++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ++++++++ ++++++++ # # token selection ++++++++ # if do_sample: ++++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) ++++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ++++++++ # else: ++++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) ++++++++ ++++++++ # # finished sentences should have their next token be a padding token ++++++++ # if has_eos_stopping_criteria: ++++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ++++++++ ++++++++ # # update generated ids, model inputs, and length for next step ++++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ++++++++ # if streamer is not None: ++++++++ # streamer.put(next_tokens) ++++++++ ++++++++ # model_kwargs = self._update_model_kwargs_for_generation( ++++++++ # outputs, ++++++++ # model_kwargs, ++++++++ # is_encoder_decoder=self.config.is_encoder_decoder, ++++++++ # ) ++++++++ ++++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ++++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ++++++++ # cur_len += 1 ++++++++ ++++++++ # if _record_time: ++++++++ # import time as time_module ++++++++ # infer_stop = time_module.time() ++++++++ # time_record.append(infer_stop - infer_start) ++++++++ ++++++++ # del outputs ++++++++ ++++++++ # average_infer_time = None ++++++++ # if time_record: ++++++++ # if len(time_record) > 1: ++++++++ # time_record.pop(0) ++++++++ # average_infer_time = sum(time_record) / len(time_record) ++++++++ # print(f'average inference time is: {average_infer_time}') ++++++++ # print(f'inference time record: {time_record}') ++++++++ ++++++++ # if streamer is not None: ++++++++ # streamer.end() ++++++++ ++++++++ # # 简单判断:打印是否使用了JIT路径 ++++++++ # if hasattr(self, '_jit_used') and self._jit_used: ++++++++ # print("[JIT] ✓ JIT optimization was used during generation") ++++++++ # else: ++++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ++++++++ ++++++++ # if return_dict_in_generate: ++++++++ # if self.config.is_encoder_decoder: ++++++++ # return GenerateEncoderDecoderOutput( ++++++++ # sequences=input_ids, ++++++++ # scores=scores, ++++++++ # logits=raw_logits, ++++++++ # encoder_attentions=encoder_attentions, ++++++++ # encoder_hidden_states=encoder_hidden_states, ++++++++ # decoder_attentions=decoder_attentions, ++++++++ # cross_attentions=cross_attentions, ++++++++ # decoder_hidden_states=decoder_hidden_states, ++++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++++ # average_infer_time=average_infer_time ++++++++ # ) ++++++++ # else: ++++++++ # return GenerateDecoderOnlyOutput( ++++++++ # sequences=input_ids, ++++++++ # scores=scores, ++++++++ # logits=raw_logits, ++++++++ # attentions=decoder_attentions, ++++++++ # hidden_states=decoder_hidden_states, ++++++++ # past_key_values=model_kwargs.get("past_key_values"), ++++++++ # average_infer_time=average_infer_time ++++++++ # ) ++++++++ # else: ++++++++ # return input_ids ++++++++ ++++++++ # def _prepare_cache_for_generation( ++++++++ # self, ++++++++ # generation_config, ++++++++ # model_kwargs, ++++++++ # assistant_model, ++++++++ # batch_size, ++++++++ # max_cache_length, ++++++++ # ): ++++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: ++++++++ # generation_config.cache_implementation = "static" ++++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ++++++++ ++++++++ # if generation_config.cache_implementation == "static": ++++++++ # base_required_from_max_length = generation_config.max_length + 1 ++++++++ # base_required = max(max_cache_length, base_required_from_max_length) ++++++++ # min_cache_size = 50 ++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ++++++++ # else: ++++++++ # max_cache_length = max(base_required, min_cache_size) ++++++++ ++++++++ # original_max_cache_length = max_cache_length ++++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") ++++++++ # print(f" - input max_cache_length: {original_max_cache_length}") ++++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") ++++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ++++++++ # print(f" - final max_cache_length: {max_cache_length}") ++++++++ ++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ++++++++ # if max_cache_length > self.config.max_position_embeddings: ++++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ++++++++ ++++++++ # result = super()._prepare_cache_for_generation( ++++++++ # generation_config=generation_config, ++++++++ # model_kwargs=model_kwargs, ++++++++ # assistant_model=assistant_model, ++++++++ # batch_size=batch_size, ++++++++ # max_cache_length=max_cache_length, ++++++++ # ) ++++++++ ++++++++ # if generation_config.cache_implementation == "static": ++++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ++++++++ # created_cache = model_kwargs.get(cache_name) ++++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ++++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ++++++++ # if created_cache.max_cache_len < generation_config.max_length: ++++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ++++++++ ++++++++ # return result ++++++++ ++++++++ ++++++++ +++++++ +++++++ +++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++++-- +++++++2.27.0 +++++++ ++++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch ++++++new file mode 100644 ++++++index 00000000..22b65dd5 ++++++--- /dev/null +++++++++ b/patches/0002-20251106commit.patch ++++++@@ -0,0 +1,3200 @@ +++++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +++++++From: Pinoeer-kingxi <13022943007@163.com> +++++++Date: Thu, 6 Nov 2025 09:20:38 +0800 +++++++Subject: [PATCH 2/3] 20251106commit +++++++ +++++++--- +++++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +++++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +++++++ 3 files changed, 2689 insertions(+), 305 deletions(-) +++++++ create mode 100644 patches/0001-20251104commit.patch +++++++ +++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++index d8303e45..73773c22 100644 +++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +++++++ # y = y + self.shared_experts(identity) +++++++ # return y +++++++ ++++++++ # @no_grad() ++++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++++ ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ # for i in range(self.num_experts_per_tok): ++++++++ # expert_id = flat_expert_indices[i].item() ++++++++ # weight = flat_expert_weights[i].item() ++++++++ # expert = self.experts[expert_id] ++++++++ # expert_out = expert(x) ++++++++ # expert_cache += expert_out * weight ++++++++ # return expert_cache ++++++++ +++++++ @no_grad() +++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # x 的 shape: (1, hidden_size) ++++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) ++++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) ++++++++ ++++++++ # 1. 收集所有需要的专家层 ++++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 ++++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] ++++++++ ++++++++ # 2. 并行计算所有专家的输出 ++++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors ++++++++ # ops.cat 会将它们堆叠成一个新的 Tensor ++++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) ++++++++ ++++++++ # 3. 使用矩阵乘法进行加权求和 ++++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) ++++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) ++++++++ # 最终结果 final_output 的 shape: (1, hidden_size) ++++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) ++++++++ ++++++++ return final_output +++++++ +++++++- expert_cache = ops.zeros_like(x) +++++++- for i in range(self.num_experts_per_tok): +++++++- expert_id = flat_expert_indices[i].item() +++++++- weight = flat_expert_weights[i].item() +++++++- expert = self.experts[expert_id] +++++++- expert_out = expert(x) +++++++- expert_cache += expert_out * weight +++++++- return expert_cache +++++++ +++++++ @no_grad() +++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++++ # @lwx ++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) ++++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) ++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) ++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) ++++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +++++++ +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ ++++++++# class DeepseekFlashAttention(nn.Module): ++++++++# """ ++++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ++++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. ++++++++ ++++++++# This class is designed as a drop-in replacement for DeepseekAttention. ++++++++# """ ++++++++ ++++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++++++# super().__init__() ++++++++# self.config = config ++++++++# self.layer_idx = layer_idx ++++++++# if layer_idx is None: ++++++++# logger.warning( ++++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++++# "when creating this class." ++++++++# ) ++++++++ ++++++++# self.attention_dropout = config.attention_dropout ++++++++# self.hidden_size = config.hidden_size ++++++++# self.num_heads = config.num_attention_heads ++++++++# self.head_dim = self.hidden_size // self.num_heads ++++++++# self.num_key_value_heads = config.num_key_value_heads ++++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++# self.max_position_embeddings = config.max_position_embeddings ++++++++# self.rope_theta = config.rope_theta ++++++++# self.is_causal = True ++++++++ ++++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++# raise ValueError( ++++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++++# f" and `num_heads`: {self.num_heads})." ++++++++# ) ++++++++ ++++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++++++# self._init_rope() ++++++++ ++++++++# def _init_rope(self): ++++++++# if self.config.rope_scaling is None: ++++++++# self.rotary_emb = DeepseekRotaryEmbedding( ++++++++# self.head_dim, ++++++++# max_position_embeddings=self.max_position_embeddings, ++++++++# base=self.rope_theta, ++++++++# ) ++++++++# else: ++++++++# scaling_type = self.config.rope_scaling["type"] ++++++++# scaling_factor = self.config.rope_scaling["factor"] ++++++++# if scaling_type == "linear": ++++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++++++# self.head_dim, ++++++++# max_position_embeddings=self.max_position_embeddings, ++++++++# scaling_factor=scaling_factor, ++++++++# base=self.rope_theta, ++++++++# ) ++++++++# elif scaling_type == "dynamic": ++++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++++++# self.head_dim, ++++++++# max_position_embeddings=self.max_position_embeddings, ++++++++# scaling_factor=scaling_factor, ++++++++# base=self.rope_theta, ++++++++# ) ++++++++# else: ++++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++++++ ++++++++# def forward( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++++# past_key_value: Optional[Cache] = None, ++++++++# output_attentions: bool = False, ++++++++# use_cache: bool = False, ++++++++# **kwargs, ++++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++# if "padding_mask" in kwargs: ++++++++# warnings.warn( ++++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++++++# ) ++++++++ ++++++++# if output_attentions: ++++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") ++++++++ ++++++++# bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++# if self.config.pretraining_tp > 1: ++++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++++++ ++++++++# query_states = self.q_proj(hidden_states) ++++++++# key_states = self.k_proj(hidden_states) ++++++++# value_states = self.v_proj(hidden_states) ++++++++ ++++++++# # Reshape for multi-head attention ++++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++# kv_seq_len = key_states.shape[-2] ++++++++# if past_key_value is not None: ++++++++# if self.layer_idx is None: ++++++++# raise ValueError( ++++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++# "with a layer index." ++++++++# ) ++++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++# # Apply Rotary Positional Embedding ++++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++# if past_key_value is not None: ++++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ++++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ++++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ++++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ ++++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++++ ++++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ++++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) ++++++++ ++++++++# # Convert attention_mask for flash_attention_score ++++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ++++++++# if attention_mask is not None: ++++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ++++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++++++# raise ValueError( ++++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++++++# ) ++++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ++++++++# else: ++++++++# attn_mask_for_fa = None ++++++++ ++++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++++++ ++++++++# # Call the fused flash_attention_score operator ++++++++# attn_output = mindspore.ops.flash_attention_score( ++++++++# query=query_states_for_fa, ++++++++# key=key_states_for_fa, ++++++++# value=value_states_for_fa, ++++++++# head_num=self.num_heads, # This is N1, the number of query heads ++++++++# input_layout='BSH', ++++++++# attn_mask=attn_mask_for_fa, ++++++++# keep_prob=keep_prob, ++++++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++# sparse_mode=0 # Default mask mode ++++++++# ) ++++++++ ++++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ++++++++# attn_output = self.o_proj(attn_output) ++++++++ ++++++++# # Flash Attention does not return attention weights ++++++++# attn_weights = None ++++++++ ++++++++# return attn_output, attn_weights, past_key_value ++++++++ ++++++++class DeepseekFlashAttention(nn.Module): ++++++++ """ ++++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. ++++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, ++++++++ designed for high performance on supported hardware (Ascend). ++++++++ ++++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. ++++++++ """ ++++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ++++++++ super().__init__() ++++++++ self.config = config ++++++++ self.layer_idx = layer_idx ++++++++ if layer_idx is None: ++++++++ logger.warning( ++++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++++ "when creating this class." ++++++++ ) ++++++++ ++++++++ # --- [FIX] Correctly initialize all required attributes --- ++++++++ self.attention_dropout = config.attention_dropout ++++++++ self.hidden_size = config.hidden_size ++++++++ self.num_heads = config.num_attention_heads ++++++++ self.head_dim = self.hidden_size // self.num_heads ++++++++ self.num_key_value_heads = config.num_key_value_heads ++++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++ self.max_position_embeddings = config.max_position_embeddings ++++++++ self.rope_theta = config.rope_theta ++++++++ self.is_causal = True ++++++++ ++++++++ if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++ raise ValueError( ++++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++++ f" and `num_heads`: {self.num_heads})." ++++++++ ) ++++++++ ++++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ++++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ++++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ++++++++ ++++++++ # This call will now succeed as all attributes are initialized. ++++++++ self._init_rope() ++++++++ ++++++++ def _init_rope(self): ++++++++ if self.config.rope_scaling is None: ++++++++ self.rotary_emb = DeepseekRotaryEmbedding( ++++++++ self.head_dim, ++++++++ max_position_embeddings=self.max_position_embeddings, ++++++++ base=self.rope_theta, ++++++++ ) ++++++++ else: ++++++++ scaling_type = self.config.rope_scaling["type"] ++++++++ scaling_factor = self.config.rope_scaling["factor"] ++++++++ if scaling_type == "linear": ++++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ++++++++ self.head_dim, ++++++++ max_position_embeddings=self.max_position_embeddings, ++++++++ scaling_factor=scaling_factor, ++++++++ base=self.rope_theta, ++++++++ ) ++++++++ elif scaling_type == "dynamic": ++++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ++++++++ self.head_dim, ++++++++ max_position_embeddings=self.max_position_embeddings, ++++++++ scaling_factor=scaling_factor, ++++++++ base=self.rope_theta, ++++++++ ) ++++++++ else: ++++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") ++++++++ ++++++++ def forward( ++++++++ self, ++++++++ hidden_states: mindspore.Tensor, ++++++++ attention_mask: Optional[mindspore.Tensor] = None, ++++++++ position_ids: Optional[mindspore.Tensor] = None, ++++++++ past_key_value: Optional[Cache] = None, ++++++++ output_attentions: bool = False, ++++++++ use_cache: bool = False, ++++++++ **kwargs, ++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ if "padding_mask" in kwargs: ++++++++ warnings.warn( ++++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ++++++++ ) ++++++++ if output_attentions: ++++++++ warnings.warn( ++++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." ++++++++ ) ++++++++ ++++++++ bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ if self.config.pretraining_tp > 1: ++++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") ++++++++ ++++++++ query_states = self.q_proj(hidden_states) ++++++++ key_states = self.k_proj(hidden_states) ++++++++ value_states = self.v_proj(hidden_states) ++++++++ ++++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) ++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++ kv_seq_len = key_states.shape[-2] ++++++++ if past_key_value is not None: ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++ # Apply Rotary Position Embedding ++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ if past_key_value is not None: ++++++++ cache_kwargs = {"sin": sin, "cos": cos} ++++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. ++++++++ # So we must explicitly repeat the KV heads. ++++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++ ++++++++ # Convert attention mask for flash_attention_score ++++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. ++++++++ if attention_mask is not None: ++++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ++++++++ raise ValueError( ++++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ++++++++ ) ++++++++ attn_mask_for_fa = attention_mask < 0 ++++++++ else: ++++++++ attn_mask_for_fa = None ++++++++ ++++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 ++++++++ ++++++++ # Call the fused operator using the efficient BNSD layout ++++++++ attn_output = mindspore.ops.flash_attention_score( ++++++++ query=query_states, ++++++++ key=key_states, ++++++++ value=value_states, ++++++++ head_num=self.num_heads, ++++++++ input_layout='BNSD', # Specify the correct layout ++++++++ attn_mask=attn_mask_for_fa, ++++++++ keep_prob=keep_prob, ++++++++ scalar_value=1.0 / math.sqrt(self.head_dim) ++++++++ ) ++++++++ ++++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. ++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ ++++++++ # Apply output projection ++++++++ attn_output = self.o_proj(attn_output) ++++++++ ++++++++ # Flash attention does not return attention weights, so we return None. ++++++++ attn_weights = None ++++++++ ++++++++ return attn_output, attn_weights, past_key_value ++++++++ +++++++ Deepseek_ATTENTION_CLASSES = { +++++++ "eager": DeepseekAttention, ++++++++ "flash-attention": DeepseekFlashAttention, +++++++ } +++++++ +++++++ +++++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +++++++ config=config, layer_idx=layer_idx +++++++ ) +++++++ ++++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( ++++++++ config=config, layer_idx=layer_idx ++++++++ ) ++++++++ +++++++ self.mlp = ( +++++++ DeepseekMoE(config) +++++++ if ( +++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++index d4c6b651..bced285c 100644 +++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +++++++ +++++++ import mindspore +++++++ import mindnlp.core.nn.functional as F +++++++-from mindnlp.core import nn, ops ++++++++from mindnlp.core import nn, ops, no_grad +++++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +++++++ +++++++ from ....common.activations import ACT2FN +++++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++++++ ++++++++Long_Prompt = False ++++++++PROMPT_LENGTH_THRESHOLD = 128 +++++++ +++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++++++ def _prepare_4d_causal_attention_mask_with_cache_position( +++++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++ ++++++++# class Qwen2MoeFlashAttention(nn.Module): ++++++++# """ ++++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ++++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ++++++++ ++++++++# 关键改动: ++++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ++++++++# 直接传入原始的 key 和 value 张量效率更高。 ++++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ++++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++++# super().__init__() ++++++++# self.config = config ++++++++# self.layer_idx = layer_idx ++++++++# self.hidden_size = config.hidden_size ++++++++# self.num_heads = config.num_attention_heads ++++++++# self.head_dim = self.hidden_size // self.num_heads ++++++++# self.num_key_value_heads = config.num_key_value_heads ++++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++# self.max_position_embeddings = config.max_position_embeddings ++++++++# self.rope_theta = config.rope_theta ++++++++# self.attention_dropout = config.attention_dropout ++++++++ ++++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++# raise ValueError( ++++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ++++++++# ) ++++++++ ++++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++++ ++++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++++# self.head_dim, ++++++++# max_position_embeddings=self.max_position_embeddings, ++++++++# base=self.rope_theta, ++++++++# ) ++++++++ ++++++++# def forward( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++++# past_key_value: Optional[Cache] = None, ++++++++# output_attentions: bool = False, ++++++++# use_cache: bool = False, ++++++++# cache_position: Optional[mindspore.Tensor] = None, ++++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++# bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++# # 1. 线性投射 Q, K, V ++++++++# query_states = self.q_proj(hidden_states) ++++++++# key_states = self.k_proj(hidden_states) ++++++++# value_states = self.v_proj(hidden_states) ++++++++ ++++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++# # query: [B, S, H*D] -> [B, N1, S, D] ++++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++# # 3. RoPE 旋转位置编码 ++++++++# kv_seq_len = key_states.shape[-2] ++++++++# if past_key_value is not None: ++++++++# if self.layer_idx is None: ++++++++# raise ValueError( ++++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++# "with a layer index." ++++++++# ) ++++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len ++++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ++++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len ++++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ++++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ++++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ++++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ++++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) ++++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ++++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ++++++++# if cache_position.shape[0] == 1: ++++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ++++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ++++++++# kv_seq_len = past_seen_tokens + 1 ++++++++# else: ++++++++# # prefill 阶段:cache_position 是范围,使用其长度 ++++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens ++++++++# else: ++++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++# # 4. KV 缓存更新 ++++++++# if past_key_value is not None: ++++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++# key_states, value_states = past_key_value.update( ++++++++# key_states, value_states, self.layer_idx, cache_kwargs ++++++++# ) ++++++++ ++++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ++++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ++++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: ++++++++# if cache_position.shape[0] == 1: ++++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ++++++++# kv_seq_len = key_states.shape[-2] ++++++++ ++++++++# # 5. [重要] 准备 Attention Mask ++++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ++++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++++# fa_attention_mask = None ++++++++# if attention_mask is not None: ++++++++# # 截取与当前key长度匹配的部分 ++++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ++++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ++++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False ++++++++# fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ++++++++# input_dtype = query_states.dtype ++++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): ++++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ++++++++# query_states = query_states.to(mindspore.float16) ++++++++# key_states = key_states.to(mindspore.float16) ++++++++# value_states = value_states.to(mindspore.float16) ++++++++ ++++++++# # 6. [核心] 调用 flash_attention_score 算子 ++++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA ++++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++++# attn_output = mindspore.ops.flash_attention_score( ++++++++# query=query_states, ++++++++# key=key_states, ++++++++# value=value_states, ++++++++# head_num=self.num_heads, # 传入Q的头数(N1) ++++++++# attn_mask=fa_attention_mask, ++++++++# keep_prob=1.0 - self.attention_dropout, ++++++++# scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++# input_layout="BNSD", ++++++++# sparse_mode=0 # 使用 defaultMask 模式 ++++++++# ) ++++++++ ++++++++# # 恢复原始数据类型 ++++++++# attn_output = attn_output.to(input_dtype) ++++++++ ++++++++# # 7. 调整输出形状 ++++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++# attn_output = self.o_proj(attn_output) ++++++++ ++++++++# # FlashAttention 算子不直接返回注意力权重矩阵 ++++++++# attn_weights = None ++++++++# if output_attentions: ++++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++# return attn_output, attn_weights, past_key_value ++++++++ ++++++++# # def forward( ++++++++# # self, ++++++++# # hidden_states: mindspore.Tensor, ++++++++# # attention_mask: Optional[mindspore.Tensor] = None, ++++++++# # position_ids: Optional[mindspore.Tensor] = None, ++++++++# # past_key_value: Optional[Cache] = None, ++++++++# # output_attentions: bool = False, ++++++++# # use_cache: bool = False, ++++++++# # cache_position: Optional[mindspore.Tensor] = None, ++++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++# # bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++# # # 1. 线性投射 Q, K, V ++++++++# # query_states = self.q_proj(hidden_states) ++++++++# # key_states = self.k_proj(hidden_states) ++++++++# # value_states = self.v_proj(hidden_states) ++++++++ ++++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ++++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ ++++++++# # # 3. RoPE 旋转位置编码 ++++++++# # kv_seq_len = key_states.shape[-2] ++++++++# # if past_key_value is not None: ++++++++# # if self.layer_idx is None: ++++++++# # raise ValueError( ++++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++# # "with a layer index." ++++++++# # ) ++++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ ++++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++# # # 4. KV 缓存更新 ++++++++# # if past_key_value is not None: ++++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ++++++++# # key_states, value_states = past_key_value.update( ++++++++# # key_states, value_states, self.layer_idx, cache_kwargs ++++++++# # ) ++++++++ ++++++++# # # 5. 准备 Attention Mask ++++++++# # fa_attention_mask = None ++++++++# # if attention_mask is not None: ++++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++# # fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ++++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ++++++++# # input_dtype = query_states.dtype ++++++++ ++++++++# # # 6. [核心] 调用 flash_attention_score 算子 ++++++++# # attn_output = mindspore.ops.flash_attention_score( ++++++++# # query=query_states, ++++++++# # key=key_states, ++++++++# # value=value_states, ++++++++# # head_num=self.num_heads, ++++++++# # attn_mask=fa_attention_mask, ++++++++# # keep_prob=1.0 - self.attention_dropout, ++++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++# # input_layout="BNSD", ++++++++# # sparse_mode=0, ++++++++# # # <--- 修改点 2: 启用内部高精度计算 --- ++++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ++++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ++++++++# # inner_precise=1 ++++++++# # ) ++++++++ ++++++++# # # 恢复原始数据类型 ++++++++# # attn_output = attn_output.to(input_dtype) ++++++++ ++++++++# # # 7. 调整输出形状 ++++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++# # attn_output = self.o_proj(attn_output) ++++++++ ++++++++# # attn_weights = None ++++++++# # if output_attentions: ++++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ ++++++++# # return attn_output, attn_weights, past_key_value ++++++++ ++++++++ +++++++ class Qwen2MoeFlashAttention(nn.Module): +++++++ """ +++++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++- +++++++- 关键改动: +++++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++- 直接传入原始的 key 和 value 张量效率更高。 +++++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ++++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 ++++++++ ++++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` ++++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, ++++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, ++++++++ 以达到理论上的最高执行速度。 +++++++ """ +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++ super().__init__() +++++++ self.config = config +++++++ self.layer_idx = layer_idx ++++++++ if layer_idx is None: ++++++++ logger.warning_once( ++++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." ++++++++ ) ++++++++ +++++++ self.hidden_size = config.hidden_size +++++++ self.num_heads = config.num_attention_heads +++++++ self.head_dim = self.hidden_size // self.num_heads +++++++ self.num_key_value_heads = config.num_key_value_heads +++++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++ self.max_position_embeddings = config.max_position_embeddings +++++++ self.rope_theta = config.rope_theta +++++++ self.attention_dropout = config.attention_dropout +++++++ +++++++- if (self.head_dim * self.num_heads) != self.hidden_size: +++++++- raise ValueError( +++++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++- ) +++++++- +++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++- # query: [B, S, H*D] -> [B, N1, S, D] +++++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] ++++++++ # 2. 调整形状以匹配 BNSD 布局 +++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- +++++++- # 3. RoPE 旋转位置编码 ++++++++ ++++++++ # 3. RoPE 和 KV 缓存 +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++- if self.layer_idx is None: +++++++- raise ValueError( +++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++- "with a layer index." +++++++- ) +++++++- # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++- if cache_position.shape[0] == 1: +++++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++- kv_seq_len = past_seen_tokens + 1 +++++++- else: +++++++- # prefill 阶段:cache_position 是范围,使用其长度 +++++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++- else: +++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++- ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++- # 4. KV 缓存更新 +++++++ if past_key_value is not None: +++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++- key_states, value_states = past_key_value.update( +++++++- key_states, value_states, self.layer_idx, cache_kwargs +++++++- ) +++++++- +++++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++- if cache_position.shape[0] == 1: +++++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++- kv_seq_len = key_states.shape[-2] +++++++- +++++++- # 5. [重要] 准备 Attention Mask +++++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ++++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++ # 4. 准备 Attention Mask +++++++ fa_attention_mask = None +++++++ if attention_mask is not None: +++++++- # 截取与当前key长度匹配的部分 +++++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++- # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++ fa_attention_mask = (mask_slice != 0) +++++++ +++++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++- input_dtype = query_states.dtype +++++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++- query_states = query_states.to(mindspore.float16) +++++++- key_states = key_states.to(mindspore.float16) +++++++- value_states = value_states.to(mindspore.float16) +++++++- +++++++- # 6. [核心] 调用 flash_attention_score 算子 +++++++- # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ++++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +++++++ attn_output = mindspore.ops.flash_attention_score( +++++++ query=query_states, +++++++ key=key_states, +++++++ value=value_states, +++++++- head_num=self.num_heads, # 传入Q的头数(N1) ++++++++ head_num=self.num_heads, +++++++ attn_mask=fa_attention_mask, +++++++- keep_prob=1.0 - self.attention_dropout, ++++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++ input_layout="BNSD", +++++++- sparse_mode=0 # 使用 defaultMask 模式 ++++++++ sparse_mode=0, ++++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +++++++ ) +++++++ +++++++- # 恢复原始数据类型 +++++++- attn_output = attn_output.to(input_dtype) +++++++- +++++++- # 7. 调整输出形状 +++++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ++++++++ # 6. 调整输出形状 +++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++ attn_output = self.o_proj(attn_output) +++++++ +++++++- # FlashAttention 算子不直接返回注意力权重矩阵 ++++++++ # 7. 返回结果 +++++++ attn_weights = None +++++++ if output_attentions: +++++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++- # def forward( +++++++- # self, +++++++- # hidden_states: mindspore.Tensor, +++++++- # attention_mask: Optional[mindspore.Tensor] = None, +++++++- # position_ids: Optional[mindspore.Tensor] = None, +++++++- # past_key_value: Optional[Cache] = None, +++++++- # output_attentions: bool = False, +++++++- # use_cache: bool = False, +++++++- # cache_position: Optional[mindspore.Tensor] = None, +++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++- +++++++- # bsz, q_len, _ = hidden_states.shape +++++++- +++++++- # # 1. 线性投射 Q, K, V +++++++- # query_states = self.q_proj(hidden_states) +++++++- # key_states = self.k_proj(hidden_states) +++++++- # value_states = self.v_proj(hidden_states) +++++++- +++++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- +++++++- # # 3. RoPE 旋转位置编码 +++++++- # kv_seq_len = key_states.shape[-2] +++++++- # if past_key_value is not None: +++++++- # if self.layer_idx is None: +++++++- # raise ValueError( +++++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++- # "with a layer index." +++++++- # ) +++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++ +++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++- +++++++- # # 4. KV 缓存更新 +++++++- # if past_key_value is not None: +++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++- # key_states, value_states = past_key_value.update( +++++++- # key_states, value_states, self.layer_idx, cache_kwargs +++++++- # ) +++++++- +++++++- # # 5. 准备 Attention Mask +++++++- # fa_attention_mask = None +++++++- # if attention_mask is not None: +++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++- # fa_attention_mask = (mask_slice != 0) +++++++- +++++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++- # input_dtype = query_states.dtype +++++++- +++++++- # # 6. [核心] 调用 flash_attention_score 算子 +++++++- # attn_output = mindspore.ops.flash_attention_score( +++++++- # query=query_states, +++++++- # key=key_states, +++++++- # value=value_states, +++++++- # head_num=self.num_heads, +++++++- # attn_mask=fa_attention_mask, +++++++- # keep_prob=1.0 - self.attention_dropout, +++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++- # input_layout="BNSD", +++++++- # sparse_mode=0, +++++++- # # <--- 修改点 2: 启用内部高精度计算 --- +++++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++- # inner_precise=1 +++++++- # ) +++++++- +++++++- # # 恢复原始数据类型 +++++++- # attn_output = attn_output.to(input_dtype) ++++++++QWEN2MOE_ATTENTION_CLASSES = { ++++++++ "eager": Qwen2MoeAttention, ++++++++ "flash-attention": Qwen2MoeFlashAttention, ++++++++} +++++++ +++++++- # # 7. 调整输出形状 +++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++- # attn_output = self.o_proj(attn_output) +++++++ +++++++- # attn_weights = None +++++++- # if output_attentions: +++++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# def __init__(self, config): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# # gating ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++ ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# #@dwj ++++++++# # 只遍历激活的专家,而非全部专家 ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# num_tokens = hidden_states_reshaped.shape[0] ++++++++ ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++ ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++ ++++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ++++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ++++++++# token_indices = broadcasted_token_indices.flatten() ++++++++ ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++ ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# selected_token_indices = token_indices[mask] ++++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++++ ++++++++# current_states = hidden_states_reshaped[selected_token_indices] ++++++++ ++++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++ ++++++++# final_hidden_states = final_hidden_states.index_add( ++++++++# dim=0, ++++++++# index=selected_token_indices, ++++++++# source=expert_output.to(hidden_states.dtype) ++++++++# ) ++++++++ ++++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++++ +++++++- # return attn_output, attn_weights, past_key_value ++++++++# final_hidden_states = final_hidden_states + shared_expert_output ++++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ ++++++++ ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# """ ++++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ++++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ++++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# # 门控网络 ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# # 专家列表 ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++# # 共享专家 ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_decode( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# """ ++++++++# 【解码路径】针对 sequence_length=1 的极致优化。 ++++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ++++++++# """ ++++++++# batch_size, hidden_dim = hidden_states.shape ++++++++ ++++++++# expert_outputs_list = [ ++++++++# ops.cat([ ++++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++++# ], dim=0) ++++++++# for i in range(batch_size) ++++++++# ] ++++++++ ++++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- ++++++++# # shape: (batch_size, top_k, hidden_dim) ++++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++++ ++++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ++++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++++ ++++++++# return moe_output.squeeze(1) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_prefill( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# """ ++++++++# 【预填充路径】针对 sequence_length > 1 的优化。 ++++++++# 按专家对 Token 进行分组,并进行批处理。 ++++++++# """ ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens = hidden_states.shape[0] ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++ ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++ ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++ ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# selected_token_indices = token_indices[mask] ++++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++++ ++++++++# current_states = hidden_states[selected_token_indices] ++++++++ ++++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++ ++++++++# moe_output = moe_output.index_add( ++++++++# dim=0, ++++++++# index=selected_token_indices, ++++++++# source=expert_output.to(hidden_states.dtype) ++++++++# ) ++++++++# return moe_output ++++++++ ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# """ ++++++++# 顶层 forward 方法,作为智能分发器。 ++++++++# """ ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++- # def forward( +++++++- # self, +++++++- # hidden_states: mindspore.Tensor, +++++++- # attention_mask: Optional[mindspore.Tensor] = None, +++++++- # position_ids: Optional[mindspore.Tensor] = None, +++++++- # past_key_value: Optional[Cache] = None, +++++++- # output_attentions: bool = False, +++++++- # use_cache: bool = False, +++++++- # cache_position: Optional[mindspore.Tensor] = None, +++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++- +++++++- # bsz, q_len, _ = hidden_states.shape +++++++- +++++++- # query_states = self.q_proj(hidden_states) +++++++- # key_states = self.k_proj(hidden_states) +++++++- # value_states = self.v_proj(hidden_states) +++++++- +++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++- +++++++- # kv_seq_len = key_states.shape[-2] +++++++- # if past_key_value is not None: +++++++- # if self.layer_idx is None: +++++++- # raise ValueError("`layer_idx` must be specified for caching") +++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++- +++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++- +++++++- # if past_key_value is not None: +++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++- # key_states, value_states = past_key_value.update( +++++++- # key_states, value_states, self.layer_idx, cache_kwargs +++++++- # ) ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++# moe_output = None ++++++++# # 在推理时,根据序列长度选择最优路径 ++++++++# if not self.training: ++++++++# if sequence_length == 1: ++++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++++# else: ++++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++++# else: ++++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ++++++++# raise NotImplementedError("Training path is not implemented.") ++++++++ ++++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) ++++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ++++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) ++++++++ ++++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights ++++++++ ++++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ ++++++++ ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# """ ++++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# # 门控网络 ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# # 专家列表 ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++# # 共享专家 ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_decode( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# batch_size, _ = hidden_states.shape ++++++++# expert_outputs_list = [ ++++++++# ops.cat([ ++++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++++# ], dim=0) ++++++++# for i in range(batch_size) ++++++++# ] ++++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++++# return moe_output.squeeze(1) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_prefill( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens = hidden_states.shape[0] ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# selected_token_indices = token_indices[mask] ++++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++++# current_states = hidden_states[selected_token_indices] ++++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++# moe_output = moe_output.index_add( ++++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++++# ) ++++++++# return moe_output ++++++++ ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# """ ++++++++# 顶层 forward 方法,作为智能分发器。 ++++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ++++++++# """ ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ ++++++++# # 1. 门控计算 (通用逻辑) ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++ ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++# # 2. 智能分发到最优 MoE 路径 ++++++++# moe_output = None ++++++++# if not self.training: ++++++++# if sequence_length == 1: ++++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ++++++++# else: ++++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ++++++++# else: ++++++++# raise NotImplementedError("Training path is not implemented.") ++++++++ ++++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ++++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ++++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++++ ++++++++# # 4. 合并 MoE 输出和共享专家输出 ++++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ++++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++++ ++++++++# # 5. 恢复原始形状并返回 ++++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ ++++++++# prefill fastest ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# """ ++++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ++++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# # 门控网络 ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# # 专家列表 ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++# # 共享专家 ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_dispatch( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# """ ++++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ++++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ++++++++# """ ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens, _ = hidden_states.shape ++++++++ ++++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++# flat_routing_weights = routing_weights.flatten() +++++++ +++++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++- +++++++- # # <--- 核心修改点: 手动进行高精度缩放 --- +++++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++++- # query_states = query_states / math.sqrt(self.head_dim) +++++++- # # <--- 修改结束 --- +++++++- +++++++- # fa_attention_mask = None +++++++- # if attention_mask is not None: +++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++- # fa_attention_mask = (mask_slice != 0) +++++++- +++++++- # input_dtype = query_states.dtype +++++++- +++++++- # attn_output = mindspore.ops.flash_attention_score( +++++++- # query=query_states, # 传入已经预先缩放过的 query +++++++- # key=key_states, +++++++- # value=value_states, +++++++- # head_num=self.num_heads, +++++++- # attn_mask=fa_attention_mask, +++++++- # keep_prob=1.0 - self.attention_dropout, +++++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++++- # input_layout="BNSD", +++++++- # sparse_mode=0, +++++++- # inner_precise=1 # 仍然保持内部高精度计算 +++++++- # ) ++++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++ +++++++- # attn_output = attn_output.to(input_dtype) +++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++- # attn_output = self.o_proj(attn_output) ++++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++ ++++++++# # 找到所有分配给该专家的 token ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++ ++++++++# # 使用 mask 选取对应的 token 和权重 ++++++++# current_token_indices = token_indices[mask] ++++++++# current_routing_weights = flat_routing_weights[mask] ++++++++# current_hidden_states = hidden_states[current_token_indices] ++++++++ ++++++++# # 对这些 token 进行批处理 ++++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++++ ++++++++# # 使用 index_add 将结果精确地加回到对应位置 ++++++++# moe_output = moe_output.index_add( ++++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ++++++++# ) ++++++++# return moe_output ++++++++ ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# """ ++++++++# 顶层 forward 方法,作为智能分发器。 ++++++++# """ ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ ++++++++# # 1. 门控计算 ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++ ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++# routing_weights = routing_weights.to(hidden_states.dtype) ++++++++ ++++++++# # 2. 调用统一的 MoE 计算内核 ++++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ++++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++++++ +++++++- # attn_weights = None +++++++- # if output_attentions: +++++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ++++++++# # 3. 统一处理共享专家 ++++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++++ ++++++++# # 4. 合并输出 ++++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++++ ++++++++# # 5. 恢复原始形状并返回 ++++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ ++++++++ ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# """ ++++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ++++++++# 【最终高性能与高精度版】: ++++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ++++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ++++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ++++++++# 3. 这样实现了速度和准确性的两全其美。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_decode( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# """ ++++++++# 【解码路径】极致优化版:bmm + 高精度累加。 ++++++++# """ ++++++++# original_dtype = hidden_states.dtype ++++++++# batch_size, _ = hidden_states.shape ++++++++ ++++++++# expert_outputs_list = [ ++++++++# ops.cat([ ++++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++++# ], dim=0) ++++++++# for i in range(batch_size) ++++++++# ] ++++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++++ ++++++++# # 在 float32 下执行 bmm,得到高精度结果 ++++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ++++++++ ++++++++# # 将高精度结果转换回原始数据类型 ++++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) ++++++++ ++++++++# return moe_output ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_prefill( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# selected_experts: mindspore.Tensor, ++++++++# routing_weights: mindspore.Tensor ++++++++# ) -> mindspore.Tensor: ++++++++# """ ++++++++# 【预填充路径】与原始实现一致,结果精确。 ++++++++# """ ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens, _ = hidden_states.shape ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++ ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# selected_token_indices = token_indices[mask] ++++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++++# current_states = hidden_states[selected_token_indices] ++++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++# moe_output = moe_output.index_add( ++++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ++++++++# ) ++++++++# return moe_output ++++++++ ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++ +++++++- # return attn_output, attn_weights, past_key_value ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ++++++++# # 如果模型主体是 float16,后续再转换 ++++++++ ++++++++# moe_output = None ++++++++# if not self.training: ++++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ++++++++# # _moe_infer_decode 内部会处理好类型转换 ++++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) ++++++++# if sequence_length == 1: ++++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++++# else: ++++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ++++++++# else: ++++++++# raise NotImplementedError("Training path is not implemented.") ++++++++ ++++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++++ ++++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ +++++++ +++++++-QWEN2MOE_ATTENTION_CLASSES = { +++++++- "eager": Qwen2MoeAttention, +++++++- "flash-attention": Qwen2MoeFlashAttention, +++++++-} ++++++++# class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++# """ ++++++++# 【融合版】一个混合专家模块,内置两种推理策略, ++++++++# 由外部全局变量 `Long_Prompt` 控制: ++++++++ ++++++++# - if Long_Prompt is True: 【精度优先模式】 ++++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ++++++++# 适用于处理长序列,避免误差累积。 ++++++++ ++++++++# - if Long_Prompt is False: 【速度优先模式】 ++++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ++++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 ++++++++# """ ++++++++# def __init__(self, config: Qwen2MoeConfig): ++++++++# super().__init__() ++++++++# self.num_experts = config.num_experts ++++++++# self.top_k = config.num_experts_per_tok ++++++++# self.norm_topk_prob = config.norm_topk_prob ++++++++ ++++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ++++++++# self.experts = nn.ModuleList( ++++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ++++++++# ) ++++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ ++++++++# # --- 速度优先模式的辅助函数 --- ++++++++# @no_grad() ++++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++# original_dtype = hidden_states.dtype ++++++++# batch_size, _ = hidden_states.shape ++++++++# expert_outputs_list = [ ++++++++# ops.cat([ ++++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++++# ], dim=0) ++++++++# for i in range(batch_size) ++++++++# ] ++++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++++# weights_fp32 = routing_weights.to(mindspore.float32) ++++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++++# return moe_output_fp32.squeeze(1).to(original_dtype) ++++++++ ++++++++# @no_grad() ++++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens, _ = hidden_states.shape ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# selected_token_indices = token_indices[mask] ++++++++# selected_routing_weights = routing_weights.flatten()[mask] ++++++++# current_states = hidden_states[selected_token_indices] ++++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ++++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++++# return moe_output ++++++++ ++++++++# # --- 精度优先模式的辅助函数 --- ++++++++# @no_grad() ++++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++# moe_output = ops.zeros_like(hidden_states) ++++++++# num_tokens, _ = hidden_states.shape ++++++++# flat_selected_experts = selected_experts.flatten() ++++++++# flat_routing_weights = routing_weights.flatten() ++++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++# active_experts = ops.unique(flat_selected_experts) ++++++++# for expert_idx_tensor in active_experts: ++++++++# expert_idx = expert_idx_tensor.item() ++++++++# expert_layer = self.experts[expert_idx] ++++++++# mask = (flat_selected_experts == expert_idx_tensor) ++++++++# current_token_indices = token_indices[mask] ++++++++# current_routing_weights = flat_routing_weights[mask] ++++++++# current_hidden_states = hidden_states[current_token_indices] ++++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++++# return moe_output ++++++++ ++++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++# # 声明我们将要使用一个在模块外部定义的全局变量 ++++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ++++++++# global Long_Prompt ++++++++ ++++++++# # 1. 门控计算 (所有模式通用) ++++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++# router_logits = self.gate(hidden_states_reshaped) ++++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++++++# if self.norm_topk_prob: ++++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++# moe_output = None ++++++++# if not self.training: ++++++++# # 根据 Long_Prompt 标志选择模式 ++++++++# if Long_Prompt: ++++++++# # --- 精度优先模式 --- ++++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++# else: ++++++++# # --- 速度优先模式 --- ++++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++# if sequence_length == 1: ++++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++# else: ++++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++# else: ++++++++# raise NotImplementedError("Training path is not implemented.") ++++++++ ++++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++++ ++++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++# return final_hidden_states, router_logits ++++++++ ++++++++class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++ """ ++++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` ++++++++ 控制的顶级推理策略: +++++++ ++++++++ - if Long_Prompt is True: 【精度优先模式】 ++++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 ++++++++ 适用于需要严格可复现性的长序列任务。 +++++++ +++++++-class Qwen2MoeSparseMoeBlock(nn.Module): +++++++- def __init__(self, config): ++++++++ - if Long_Prompt is False: 【速度优先模式】 ++++++++ 采用业界最强的性能组合: ++++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 ++++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 ++++++++ """ ++++++++ def __init__(self, config: Qwen2MoeConfig): +++++++ super().__init__() +++++++ self.num_experts = config.num_experts +++++++ self.top_k = config.num_experts_per_tok +++++++ self.norm_topk_prob = config.norm_topk_prob +++++++ +++++++- # gating +++++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++ self.experts = nn.ModuleList( +++++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++ ) +++++++- +++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++ +++++++- #@dwj +++++++- # 只遍历激活的专家,而非全部专家 +++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++- num_tokens = hidden_states_reshaped.shape[0] +++++++- +++++++- router_logits = self.gate(hidden_states_reshaped) +++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- +++++++- if self.norm_topk_prob: +++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++- flat_selected_experts = selected_experts.flatten() +++++++- +++++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++- token_indices = broadcasted_token_indices.flatten() +++++++- +++++++- active_experts = ops.unique(flat_selected_experts) +++++++- +++++++- for expert_idx_tensor in active_experts: +++++++- expert_idx = expert_idx_tensor.item() +++++++- expert_layer = self.experts[expert_idx] +++++++- +++++++- mask = (flat_selected_experts == expert_idx_tensor) +++++++- selected_token_indices = token_indices[mask] +++++++- selected_routing_weights = routing_weights.flatten()[mask] +++++++- +++++++- current_states = hidden_states_reshaped[selected_token_indices] +++++++- +++++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++- +++++++- final_hidden_states = final_hidden_states.index_add( +++++++- dim=0, +++++++- index=selected_token_indices, +++++++- source=expert_output.to(hidden_states.dtype) +++++++- ) +++++++- +++++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- ++++++++ @no_grad() ++++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++ original_dtype = hidden_states.dtype ++++++++ batch_size, _ = hidden_states.shape ++++++++ expert_outputs_list = [ ++++++++ ops.cat([ ++++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ++++++++ ], dim=0) ++++++++ for i in range(batch_size) ++++++++ ] ++++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ++++++++ weights_fp32 = routing_weights.to(mindspore.float32) ++++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ++++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ++++++++ return moe_output_fp32.squeeze(1).to(original_dtype) ++++++++ ++++++++ @no_grad() ++++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++ num_tokens, _ = hidden_states.shape ++++++++ flat_selected_experts = selected_experts.flatten() ++++++++ sorted_expert_indices = flat_selected_experts.argsort() ++++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++++++ original_token_indices = sorted_expert_indices // self.top_k ++++++++ moe_output = ops.zeros_like(hidden_states) ++++++++ current_token_offset = 0 ++++++++ for i in range(self.num_experts): ++++++++ expert_token_count = tokens_per_expert[i] - current_token_offset ++++++++ if expert_token_count == 0: ++++++++ continue ++++++++ end_offset = current_token_offset + expert_token_count ++++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++++++ expert_hidden_states = hidden_states[expert_original_token_indices] ++++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++++ current_token_offset += expert_token_count ++++++++ return moe_output ++++++++ ++++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- ++++++++ @no_grad() ++++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++ moe_output = ops.zeros_like(hidden_states) ++++++++ num_tokens, _ = hidden_states.shape ++++++++ flat_selected_experts = selected_experts.flatten() ++++++++ flat_routing_weights = routing_weights.flatten() ++++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ++++++++ active_experts = ops.unique(flat_selected_experts) ++++++++ for expert_idx_tensor in active_experts: ++++++++ expert_idx = expert_idx_tensor.item() ++++++++ expert_layer = self.experts[expert_idx] ++++++++ mask = (flat_selected_experts == expert_idx_tensor) ++++++++ current_token_indices = token_indices[mask] ++++++++ current_routing_weights = flat_routing_weights[mask] ++++++++ current_hidden_states = hidden_states[current_token_indices] ++++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ++++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++++ return moe_output +++++++ +++++++- final_hidden_states = final_hidden_states + shared_expert_output +++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++- +++++++- return final_hidden_states, router_logits ++++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++ global Long_Prompt ++++++++ ++++++++ # 1. 门控计算 (所有模式通用) ++++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ++++++++ router_logits = self.gate(hidden_states_reshaped) ++++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ++++++++ if self.norm_topk_prob: ++++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++ ++++++++ moe_output = None ++++++++ if Long_Prompt: ++++++++ # --- 精度优先模式 (ACCURACY MODE) --- ++++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ else: ++++++++ # --- 速度优先模式 (SPEED MODE) --- ++++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++ if sequence_length == 1: ++++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ else: ++++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ +++++++ ++++++++ # 3. 共享专家计算与合并 (所有模式通用) ++++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ++++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) ++++++++ ++++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output ++++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) ++++++++ ++++++++ return final_hidden_states, router_logits +++++++ +++++++ class Qwen2MoeDecoderLayer(nn.Module): +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++++++ super().__init__() +++++++ self.hidden_size = config.hidden_size ++++++++ ++++++++ # if Long_Prompt: ++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++++ # else: ++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++ +++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ +++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++- +++++++ if (layer_idx not in config.mlp_only_layers) and ( +++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++++ ): +++++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ self._warmed_up = True +++++++ self.warmup_moe_model() +++++++ ++++++++ ++++++++ +++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++++ output_router_logits = ( +++++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +++++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ router_logits=outputs.router_logits, +++++++ ) +++++++ ++++++++ def generate(self, *args, **kwargs): ++++++++ """ ++++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 ++++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 ++++++++ """ ++++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++++++ ++++++++ input_ids = kwargs.get("input_ids") ++++++++ if input_ids is None and args: ++++++++ input_ids = args[0] ++++++++ ++++++++ if input_ids is not None: ++++++++ prompt_length = input_ids.shape[1] ++++++++ ++++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: ++++++++ Long_Prompt = True ++++++++ else: ++++++++ Long_Prompt = False ++++++++ ++++++++ return super().generate(*args, **kwargs) ++++++++ +++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +++++++ def prepare_inputs_for_generation( +++++++ self, +++++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +++++++ # Exception 1: when passing input_embeds, input_ids may be missing entries +++++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here ++++++++ +++++++ if past_key_values is not None: +++++++ if inputs_embeds is not None: # Exception 1 +++++++ if 0 not in input_ids.shape: +++++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ } +++++++ ) +++++++ return model_inputs ++++++++ +++++++ # @lwx +++++++ # def _decode_one_tokens_logits( +++++++ # self, +++++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +++++++ attentions=outputs.attentions, +++++++ ) +++++++ ++++++++ +++++++ __all__ = [ +++++++ "Qwen2MoeForCausalLM", +++++++ "Qwen2MoeModel", +++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++++new file mode 100644 +++++++index 00000000..6dfb5b93 +++++++--- /dev/null ++++++++++ b/patches/0001-20251104commit.patch +++++++@@ -0,0 +1,1272 @@ ++++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ++++++++From: Pinoeer-kingxi <13022943007@163.com> ++++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 ++++++++Subject: [PATCH] 20251104commit ++++++++ ++++++++--- ++++++++ mindnlp/transformers/cache_utils.py | 28 +- ++++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- ++++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- ++++++++ 3 files changed, 976 insertions(+), 87 deletions(-) ++++++++ ++++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ++++++++index cadd2e04..02f8d4be 100644 ++++++++--- a/mindnlp/transformers/cache_utils.py +++++++++++ b/mindnlp/transformers/cache_utils.py ++++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): ++++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. ++++++++ # k_out[:, :, cache_position] = key_states ++++++++ # v_out[:, :, cache_position] = value_states ++++++++- if ON_ORANGE_PI: ++++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ++++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ++++++++- else: ++++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ++++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ++++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ++++++++- +++++++++ # if ON_ORANGE_PI: +++++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++++ # else: +++++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +++++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++++++ if cache_position.ndim > 1: +++++++++ cache_position = cache_position.flatten() +++++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +++++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++++++ cache_position = cache_position.int() +++++++++ +++++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++++++ k_out[:, :, cache_position] = key_states +++++++++ v_out[:, :, cache_position] = value_states +++++++++ ++++++++ return k_out, v_out ++++++++ ++++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ++++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++index c695b944..d8303e45 100644 ++++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): ++++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half ++++++++ def rotate_half(x): ++++++++ """Rotates half the hidden dims of the input.""" ++++++++- x1 = x[..., : x.shape[-1] // 2] ++++++++- x2 = x[..., x.shape[-1] // 2 :] +++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++++ # x1 = x[..., : x.shape[-1] // 2] +++++++++ # x2 = x[..., x.shape[-1] // 2 :] +++++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++++ return ops.cat((-x2, x1), dim=-1) ++++++++ ++++++++ ++++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): ++++++++ if self.training: ++++++++ raise NotImplementedError("Training is not supported yet.") ++++++++ else: ++++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ++++++++- if self.config.n_shared_experts is not None: ++++++++- y = y + self.shared_experts(identity) ++++++++- return y +++++++++ # @lwx +++++++++ if orig_shape[1] == 1: +++++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++++++ y=y.view(*orig_shape) +++++++++ if self.config.n_shared_experts is not None: +++++++++ y = y + self.shared_experts(identity) +++++++++ return y +++++++++ else: +++++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++++++ if self.config.n_shared_experts is not None: +++++++++ y = y + self.shared_experts(identity) +++++++++ return y +++++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++++ # if self.config.n_shared_experts is not None: +++++++++ # y = y + self.shared_experts(identity) +++++++++ # return y +++++++++ +++++++++ @no_grad() +++++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++++ +++++++++ expert_cache = ops.zeros_like(x) +++++++++ for i in range(self.num_experts_per_tok): +++++++++ expert_id = flat_expert_indices[i].item() +++++++++ weight = flat_expert_weights[i].item() +++++++++ expert = self.experts[expert_id] +++++++++ expert_out = expert(x) +++++++++ expert_cache += expert_out * weight +++++++++ return expert_cache ++++++++ ++++++++ @no_grad() ++++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++- # expert_cache = torch.zeros_like(x) ++++++++- # idxs = flat_expert_indices.argsort() ++++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++++- # token_idxs = idxs // self.num_experts_per_tok ++++++++- # for i, end_idx in enumerate(tokens_per_expert): ++++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++++- # if start_idx == end_idx: ++++++++- # continue ++++++++- # expert = self.experts[i] ++++++++- # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++- # expert_tokens = x[exp_token_idx] ++++++++- # expert_out = expert(expert_tokens) ++++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++++- # return expert_cache +++++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++++ expert_cache = ops.zeros_like(x) ++++++++ idxs = flat_expert_indices.argsort() ++++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ token_idxs = idxs // self.num_experts_per_tok +++++++++ ++++++++ for i, end_idx in enumerate(tokens_per_expert): ++++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ if start_idx == end_idx: ++++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): ++++++++ expert_out = expert(expert_tokens) ++++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++++ ++++++++ return expert_cache +++++++++ +++++++++ # @no_grad() +++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++++ # # expert_cache = torch.zeros_like(x) +++++++++ # # idxs = flat_expert_indices.argsort() +++++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++++ # # token_idxs = idxs // self.num_experts_per_tok +++++++++ # # for i, end_idx in enumerate(tokens_per_expert): +++++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++++ # # if start_idx == end_idx: +++++++++ # # continue +++++++++ # # expert = self.experts[i] +++++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++++ # # expert_tokens = x[exp_token_idx] +++++++++ # # expert_out = expert(expert_tokens) +++++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++++ # # return expert_cache +++++++++ # expert_cache = ops.zeros_like(x) +++++++++ # idxs = flat_expert_indices.argsort() +++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++++ +++++++++ # for i, end_idx in enumerate(tokens_per_expert): +++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++++ # if start_idx == end_idx: +++++++++ # continue +++++++++ # expert = self.experts[i] +++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++++ # expert_tokens = x[exp_token_idx] +++++++++ # expert_out = expert(expert_tokens) +++++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++++ +++++++++ # return expert_cache +++++++++ # @no_grad() +++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++++ # expert_cache = ops.zeros_like(x) +++++++++ +++++++++ # # 排序保证顺序一致 +++++++++ # idxs = flat_expert_indices.argsort() +++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++++ # token_idxs = idxs // self.num_experts_per_tok +++++++++ +++++++++ # # 找出有 token 的专家 +++++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++++ +++++++++ # for i in active_experts.tolist(): +++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++++ # end_idx = tokens_per_expert[i] +++++++++ # if start_idx == end_idx: # 没有 token +++++++++ # continue +++++++++ +++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++++ # expert_tokens = x[exp_token_idx] +++++++++ # expert_out = self.experts[i](expert_tokens) +++++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++++ +++++++++ # expert_cache = mindspore.mint.scatter_add( +++++++++ # expert_cache, +++++++++ # 0, +++++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++++ # expert_out +++++++++ # ) +++++++++ +++++++++ # return expert_cache +++++++++ +++++++++ ++++++++ ++++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): ++++++++ # """ ++++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++++ ++++++++ # Initialize weights and apply final processing ++++++++ self.post_init() +++++++++ self.warm_up = False +++++++++ +++++++++ def warmup_moe_model_deep(self): +++++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++++++ test_texts = [ +++++++++ "warmup short", +++++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++++++ ] +++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++++ if tokenizer is None: +++++++++ from mindnlp.transformers import AutoTokenizer +++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++++ self._warmup_tokenizer = tokenizer +++++++++ +++++++++ for text in test_texts: +++++++++ inputs = tokenizer(text, return_tensors="ms") +++++++++ with mindspore._no_grad(): +++++++++ _ = self(**inputs, use_cache=False) +++++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") ++++++++ ++++++++ def get_input_embeddings(self): ++++++++ return self.model.embed_tokens ++++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): ++++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++++ ```""" +++++++++ if not self.warm_up: +++++++++ self.warm_up = True +++++++++ self.warmup_moe_model_deep() +++++++++ ++++++++ output_attentions = ( ++++++++ output_attentions ++++++++ if output_attentions is not None ++++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++index 3cbf820e..d4c6b651 100644 ++++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++@@ -18,7 +18,6 @@ ++++++++ # See the License for the specific language governing permissions and ++++++++ # limitations under the License. ++++++++ """MindSpore Qwen2MoE model.""" ++++++++- ++++++++ import math ++++++++ from typing import List, Optional, Tuple, Union ++++++++ ++++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( ++++++++ TokenClassifierOutput, ++++++++ ) ++++++++ from ...modeling_utils import PreTrainedModel +++++++++from ...generation import GenerationMixin ++++++++ from ....utils import logging ++++++++ from .configuration_qwen2_moe import Qwen2MoeConfig ++++++++ ++++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): ++++++++ self.variance_epsilon = eps ++++++++ ++++++++ def forward(self, hidden_states): +++++++++ # @dwj +++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++++ # @lwx +++++++++ # if not self.training : +++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ++++++++ input_dtype = hidden_states.dtype ++++++++ hidden_states = hidden_states.to(mindspore.float32) ++++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ++++++++@@ -234,6 +239,8 @@ def rotate_half(x): ++++++++ """Rotates half the hidden dims of the input.""" ++++++++ x1 = x[..., : x.shape[-1] // 2] ++++++++ x2 = x[..., x.shape[-1] // 2 :] +++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) ++++++++ return ops.cat((-x2, x1), dim=-1) ++++++++ ++++++++ ++++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): ++++++++ self.config = config ++++++++ self.hidden_size = config.hidden_size ++++++++ self.intermediate_size = intermediate_size +++++++++ ++++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) ++++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) ++++++++ self.act_fn = ACT2FN[config.hidden_act] ++++++++ ++++++++ def forward(self, x): ++++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ++++++++- ++++++++ +++++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++++ # @lwx +++++++++ # gate_up_output = self.gate_up_proj(x) +++++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++++++ # return self.down_proj(swiglu_output) +++++++++ +++++++++ # def forward(self, x): +++++++++ # gate_proj_out = self.gate_proj(x) +++++++++ # up_proj_out = self.up_proj(x) +++++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++++++ # return self.down_proj(swiglu_out) +++++++++ ++++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv ++++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: ++++++++ """ ++++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): ++++++++ use_cache: bool = False, ++++++++ cache_position: Optional[mindspore.Tensor] = None, ++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++++ +++++++++ +++++++++ ++++++++ bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++ query_states = self.q_proj(hidden_states) ++++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): ++++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++ "with a layer index." ++++++++ ) ++++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++++ if isinstance(past_key_value, StaticCache): +++++++++ kv_seq_len = key_states.shape[-2] +++++++++ else: +++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++ if past_key_value is not None: ++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++++ +++++++++ if isinstance(past_key_value, StaticCache): +++++++++ kv_seq_len = key_states.shape[-2] ++++++++ ++++++++ # repeat k/v heads if n_kv_heads < n_heads ++++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++- +++++++++ ++++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++++ ++++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ++++++++- raise ValueError( ++++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ++++++++- f" {attn_weights.shape}" ++++++++- ) ++++++++- ++++++++- if attention_mask is not None: # no matter the length, we just slice it ++++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++++++ if attention_mask is not None: +++++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++++ attn_weights = attn_weights + causal_mask ++++++++ ++++++++ # upcast attention to fp32 ++++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): ++++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++++ ++++++++ attn_output = self.o_proj(attn_output) ++++++++- +++++++++ # @lwx +++++++++ +++++++++ # max_seq_len = self.max_position_embeddings # 2048 +++++++++ +++++++++ # if attention_mask is not None: +++++++++ # # attention_mask: [B, 1, Sq, Sk] +++++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++++ +++++++++ # # pad 到 [max_seq_len, max_seq_len] +++++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++++ # global_attention_mask = padded_mask +++++++++ # else: +++++++++ # global_attention_mask = None +++++++++ +++++++++ +++++++++ # sparse_mode=3 +++++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++++ # query=query_states, +++++++++ # key=key_states, +++++++++ # value=value_states, +++++++++ # real_shift=None, +++++++++ # padding_mask=None, +++++++++ +++++++++ # head_num=self.num_heads, +++++++++ # attn_mask=global_attention_mask, +++++++++ # keep_prob=1.0 - self.attention_dropout, +++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++++ # input_layout="BNSD", +++++++++ # pre_tokens=2147483647, +++++++++ # next_tokens=2147483647, +++++++++ # inner_precise=0, +++++++++ # drop_mask=None, +++++++++ # prefix=None, +++++++++ # actual_seq_qlen=None, +++++++++ # actual_seq_kvlen=None, +++++++++ # sparse_mode=sparse_mode, +++++++++ # ) ++++++++ if not output_attentions: ++++++++ attn_weights = None ++++++++ ++++++++ return attn_output, attn_weights, past_key_value ++++++++ ++++++++ +++++++++class Qwen2MoeFlashAttention(nn.Module): +++++++++ """ +++++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++++ +++++++++ 关键改动: +++++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++++ 直接传入原始的 key 和 value 张量效率更高。 +++++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++++ """ +++++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++++ super().__init__() +++++++++ self.config = config +++++++++ self.layer_idx = layer_idx +++++++++ self.hidden_size = config.hidden_size +++++++++ self.num_heads = config.num_attention_heads +++++++++ self.head_dim = self.hidden_size // self.num_heads +++++++++ self.num_key_value_heads = config.num_key_value_heads +++++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++++ self.max_position_embeddings = config.max_position_embeddings +++++++++ self.rope_theta = config.rope_theta +++++++++ self.attention_dropout = config.attention_dropout +++++++++ +++++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +++++++++ raise ValueError( +++++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++++ ) +++++++++ +++++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++++ +++++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++++ self.head_dim, +++++++++ max_position_embeddings=self.max_position_embeddings, +++++++++ base=self.rope_theta, +++++++++ ) +++++++++ +++++++++ def forward( +++++++++ self, +++++++++ hidden_states: mindspore.Tensor, +++++++++ attention_mask: Optional[mindspore.Tensor] = None, +++++++++ position_ids: Optional[mindspore.Tensor] = None, +++++++++ past_key_value: Optional[Cache] = None, +++++++++ output_attentions: bool = False, +++++++++ use_cache: bool = False, +++++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++++ +++++++++ bsz, q_len, _ = hidden_states.shape +++++++++ +++++++++ # 1. 线性投射 Q, K, V +++++++++ query_states = self.q_proj(hidden_states) +++++++++ key_states = self.k_proj(hidden_states) +++++++++ value_states = self.v_proj(hidden_states) +++++++++ +++++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++++ # query: [B, S, H*D] -> [B, N1, S, D] +++++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ +++++++++ # 3. RoPE 旋转位置编码 +++++++++ kv_seq_len = key_states.shape[-2] +++++++++ if past_key_value is not None: +++++++++ if self.layer_idx is None: +++++++++ raise ValueError( +++++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++++ "with a layer index." +++++++++ ) +++++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++++ if cache_position.shape[0] == 1: +++++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++++ kv_seq_len = past_seen_tokens + 1 +++++++++ else: +++++++++ # prefill 阶段:cache_position 是范围,使用其长度 +++++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++++ else: +++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++++ +++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++++ +++++++++ # 4. KV 缓存更新 +++++++++ if past_key_value is not None: +++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++++ key_states, value_states = past_key_value.update( +++++++++ key_states, value_states, self.layer_idx, cache_kwargs +++++++++ ) +++++++++ +++++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++++ if cache_position.shape[0] == 1: +++++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++++ kv_seq_len = key_states.shape[-2] +++++++++ +++++++++ # 5. [重要] 准备 Attention Mask +++++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++++ fa_attention_mask = None +++++++++ if attention_mask is not None: +++++++++ # 截取与当前key长度匹配的部分 +++++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++++ fa_attention_mask = (mask_slice != 0) +++++++++ +++++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++++ input_dtype = query_states.dtype +++++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++++ query_states = query_states.to(mindspore.float16) +++++++++ key_states = key_states.to(mindspore.float16) +++++++++ value_states = value_states.to(mindspore.float16) +++++++++ +++++++++ # 6. [核心] 调用 flash_attention_score 算子 +++++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++++ attn_output = mindspore.ops.flash_attention_score( +++++++++ query=query_states, +++++++++ key=key_states, +++++++++ value=value_states, +++++++++ head_num=self.num_heads, # 传入Q的头数(N1) +++++++++ attn_mask=fa_attention_mask, +++++++++ keep_prob=1.0 - self.attention_dropout, +++++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++++ input_layout="BNSD", +++++++++ sparse_mode=0 # 使用 defaultMask 模式 +++++++++ ) +++++++++ +++++++++ # 恢复原始数据类型 +++++++++ attn_output = attn_output.to(input_dtype) +++++++++ +++++++++ # 7. 调整输出形状 +++++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++++ attn_output = self.o_proj(attn_output) +++++++++ +++++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +++++++++ attn_weights = None +++++++++ if output_attentions: +++++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++++ +++++++++ return attn_output, attn_weights, past_key_value +++++++++ +++++++++ # def forward( +++++++++ # self, +++++++++ # hidden_states: mindspore.Tensor, +++++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++++ # past_key_value: Optional[Cache] = None, +++++++++ # output_attentions: bool = False, +++++++++ # use_cache: bool = False, +++++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++++ +++++++++ # bsz, q_len, _ = hidden_states.shape +++++++++ +++++++++ # # 1. 线性投射 Q, K, V +++++++++ # query_states = self.q_proj(hidden_states) +++++++++ # key_states = self.k_proj(hidden_states) +++++++++ # value_states = self.v_proj(hidden_states) +++++++++ +++++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ +++++++++ # # 3. RoPE 旋转位置编码 +++++++++ # kv_seq_len = key_states.shape[-2] +++++++++ # if past_key_value is not None: +++++++++ # if self.layer_idx is None: +++++++++ # raise ValueError( +++++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++++ # "with a layer index." +++++++++ # ) +++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++++ +++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++++ +++++++++ # # 4. KV 缓存更新 +++++++++ # if past_key_value is not None: +++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++++ # key_states, value_states = past_key_value.update( +++++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++++ # ) +++++++++ +++++++++ # # 5. 准备 Attention Mask +++++++++ # fa_attention_mask = None +++++++++ # if attention_mask is not None: +++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++++ # fa_attention_mask = (mask_slice != 0) +++++++++ +++++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++++ # input_dtype = query_states.dtype +++++++++ +++++++++ # # 6. [核心] 调用 flash_attention_score 算子 +++++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++++ # query=query_states, +++++++++ # key=key_states, +++++++++ # value=value_states, +++++++++ # head_num=self.num_heads, +++++++++ # attn_mask=fa_attention_mask, +++++++++ # keep_prob=1.0 - self.attention_dropout, +++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++++ # input_layout="BNSD", +++++++++ # sparse_mode=0, +++++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +++++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++++ # inner_precise=1 +++++++++ # ) +++++++++ +++++++++ # # 恢复原始数据类型 +++++++++ # attn_output = attn_output.to(input_dtype) +++++++++ +++++++++ # # 7. 调整输出形状 +++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++++ # attn_output = self.o_proj(attn_output) +++++++++ +++++++++ # attn_weights = None +++++++++ # if output_attentions: +++++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++++ +++++++++ # return attn_output, attn_weights, past_key_value +++++++++ +++++++++ # def forward( +++++++++ # self, +++++++++ # hidden_states: mindspore.Tensor, +++++++++ # attention_mask: Optional[mindspore.Tensor] = None, +++++++++ # position_ids: Optional[mindspore.Tensor] = None, +++++++++ # past_key_value: Optional[Cache] = None, +++++++++ # output_attentions: bool = False, +++++++++ # use_cache: bool = False, +++++++++ # cache_position: Optional[mindspore.Tensor] = None, +++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++++ +++++++++ # bsz, q_len, _ = hidden_states.shape +++++++++ +++++++++ # query_states = self.q_proj(hidden_states) +++++++++ # key_states = self.k_proj(hidden_states) +++++++++ # value_states = self.v_proj(hidden_states) +++++++++ +++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++++ +++++++++ # kv_seq_len = key_states.shape[-2] +++++++++ # if past_key_value is not None: +++++++++ # if self.layer_idx is None: +++++++++ # raise ValueError("`layer_idx` must be specified for caching") +++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++++ +++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++++ +++++++++ # if past_key_value is not None: +++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++++ # key_states, value_states = past_key_value.update( +++++++++ # key_states, value_states, self.layer_idx, cache_kwargs +++++++++ # ) +++++++++ +++++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++++ +++++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++++++ # query_states = query_states / math.sqrt(self.head_dim) +++++++++ # # <--- 修改结束 --- +++++++++ +++++++++ # fa_attention_mask = None +++++++++ # if attention_mask is not None: +++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++++ # fa_attention_mask = (mask_slice != 0) +++++++++ +++++++++ # input_dtype = query_states.dtype +++++++++ +++++++++ # attn_output = mindspore.ops.flash_attention_score( +++++++++ # query=query_states, # 传入已经预先缩放过的 query +++++++++ # key=key_states, +++++++++ # value=value_states, +++++++++ # head_num=self.num_heads, +++++++++ # attn_mask=fa_attention_mask, +++++++++ # keep_prob=1.0 - self.attention_dropout, +++++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++++++ # input_layout="BNSD", +++++++++ # sparse_mode=0, +++++++++ # inner_precise=1 # 仍然保持内部高精度计算 +++++++++ # ) +++++++++ +++++++++ # attn_output = attn_output.to(input_dtype) +++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++++ # attn_output = self.o_proj(attn_output) +++++++++ +++++++++ # attn_weights = None +++++++++ # if output_attentions: +++++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++++++ +++++++++ # return attn_output, attn_weights, past_key_value +++++++++ ++++++++ QWEN2MOE_ATTENTION_CLASSES = { ++++++++ "eager": Qwen2MoeAttention, +++++++++ "flash-attention": Qwen2MoeFlashAttention, ++++++++ } ++++++++ ++++++++ ++++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): ++++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ++++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) ++++++++ +++++++++ #@dwj +++++++++ # 只遍历激活的专家,而非全部专家 ++++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ++++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape ++++++++- hidden_states = hidden_states.view(-1, hidden_dim) ++++++++- # router_logits: (batch * sequence_length, n_experts) ++++++++- router_logits = self.gate(hidden_states) ++++++++- ++++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ++++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ++++++++- if self.norm_topk_prob: ++++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ++++++++- # we cast back to the input dtype ++++++++- routing_weights = routing_weights.to(hidden_states.dtype) ++++++++- ++++++++- final_hidden_states = ops.zeros( ++++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ++++++++- ) ++++++++- ++++++++- # One hot encode the selected experts to create an expert mask ++++++++- # this will be used to easily index which expert is going to be sollicitated ++++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ++++++++- ++++++++- # Loop over all available experts in the model and perform the computation on each expert ++++++++- for expert_idx in range(self.num_experts): ++++++++- expert_layer = self.experts[expert_idx] ++++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ++++++++- ++++++++- # Index the correct hidden states and compute the expert hidden state for ++++++++- # the current expert. We need to make sure to multiply the output hidden ++++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ++++++++- if 0 not in idx.shape: ++++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ++++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ++++++++- ++++++++- # However `index_add_` only support torch tensors for indexing so we'll use ++++++++- # the `top_x` tensor here. ++++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ++++++++- ++++++++- shared_expert_output = self.shared_expert(hidden_states) ++++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ++++++++- ++++++++- final_hidden_states = final_hidden_states + shared_expert_output +++++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++++ num_tokens = hidden_states_reshaped.shape[0] +++++++++ +++++++++ router_logits = self.gate(hidden_states_reshaped) +++++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++++ +++++++++ if self.norm_topk_prob: +++++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++++ routing_weights = routing_weights.to(hidden_states.dtype) +++++++++ +++++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++++ flat_selected_experts = selected_experts.flatten() +++++++++ +++++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++++ token_indices = broadcasted_token_indices.flatten() +++++++++ +++++++++ active_experts = ops.unique(flat_selected_experts) +++++++++ +++++++++ for expert_idx_tensor in active_experts: +++++++++ expert_idx = expert_idx_tensor.item() +++++++++ expert_layer = self.experts[expert_idx] +++++++++ +++++++++ mask = (flat_selected_experts == expert_idx_tensor) +++++++++ selected_token_indices = token_indices[mask] +++++++++ selected_routing_weights = routing_weights.flatten()[mask] +++++++++ +++++++++ current_states = hidden_states_reshaped[selected_token_indices] +++++++++ +++++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++++ +++++++++ final_hidden_states = final_hidden_states.index_add( +++++++++ dim=0, +++++++++ index=selected_token_indices, +++++++++ source=expert_output.to(hidden_states.dtype) +++++++++ ) +++++++++ +++++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output ++++++++ ++++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ++++++++- return final_hidden_states, router_logits +++++++++ final_hidden_states = final_hidden_states + shared_expert_output +++++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++++ +++++++++ return final_hidden_states, router_logits ++++++++ ++++++++ ++++++++ class Qwen2MoeDecoderLayer(nn.Module): ++++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): ++++++++ ++++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) ++++++++ +++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++++ ++++++++ if (layer_idx not in config.mlp_only_layers) and ( ++++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 ++++++++ ): ++++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): ++++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] ++++++++ _skip_keys_device_placement = "past_key_values" ++++++++ _supports_cache_class = True +++++++++#lwx +++++++++ # _supports_static_cache = True ++++++++ ++++++++ def _init_weights(self, module): ++++++++ std = self.config.initializer_range ++++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): ++++++++ return causal_mask ++++++++ ++++++++ ++++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): ++++++++ _tied_weights_keys = ["lm_head.weight"] ++++++++ ++++++++ def __init__(self, config): ++++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++++ self.num_experts_per_tok = config.num_experts_per_tok ++++++++ # Initialize weights and apply final processing ++++++++ self.post_init() +++++++++ # @lwx +++++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++++++ # self.generation_config.cache_implementation = "static" +++++++++ self._warmed_up = False +++++++++ +++++++++ def warmup_moe_model(self): +++++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++++++ test_texts = [ +++++++++ "warmup short", +++++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++++++ ] +++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++++ if tokenizer is None: +++++++++ from mindnlp.transformers import AutoTokenizer +++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++++ self._warmup_tokenizer = tokenizer +++++++++ +++++++++ for text in test_texts: +++++++++ inputs = tokenizer(text, return_tensors="ms") +++++++++ with mindspore._no_grad(): +++++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") ++++++++ ++++++++ def get_input_embeddings(self): ++++++++ return self.model.embed_tokens ++++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] ++++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." ++++++++ ```""" +++++++++ if not self._warmed_up: +++++++++ self._warmed_up = True +++++++++ self.warmup_moe_model() ++++++++ ++++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions ++++++++ output_router_logits = ( ++++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ++++++++ } ++++++++ ) ++++++++ return model_inputs +++++++++# @lwx +++++++++ # def _decode_one_tokens_logits( +++++++++ # self, +++++++++ # cur_token: mindspore.Tensor, +++++++++ # input_pos: Optional[mindspore.Tensor], +++++++++ # cache_position: mindspore.Tensor, +++++++++ # past_key_values: StaticCache, +++++++++ # ) -> mindspore.Tensor: +++++++++ # """ +++++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++++++ +++++++++ # Args: +++++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++++++ # input_pos: 输入位置信息,可选 +++++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +++++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++++++ +++++++++ # Returns: +++++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++++++ # """ +++++++++ # # 调用JIT编译的版本 +++++++++ # return self.get_decode_one_tokens_logits( +++++++++ # cur_token=cur_token, +++++++++ # input_pos=input_pos, +++++++++ # cache_position=cache_position, +++++++++ # past_key_values=past_key_values, +++++++++ # ) +++++++++ +++++++++ # @mindspore.jit(jit_level='O1') +++++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++++++ # """ +++++++++ # JIT编译的函数,用于高效的单token解码 +++++++++ # 使用JIT编译优化以支持静态shape和高效执行 +++++++++ +++++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++++++ # """ +++++++++ # outputs = self.model.forward( +++++++++ # input_ids=cur_token, +++++++++ # position_ids=input_pos, +++++++++ # cache_position=cache_position, +++++++++ # past_key_values=past_key_values, +++++++++ # use_cache=True, +++++++++ # return_dict=False, +++++++++ # ) +++++++++ +++++++++ # hidden_states = outputs[0] +++++++++ # logits = self.lm_head.forward(hidden_states) +++++++++ # logits = logits.float() +++++++++ +++++++++ # return logits[:, -1, :] +++++++++ +++++++++ # def _sample( +++++++++ # self, +++++++++ # input_ids: mindspore.Tensor, +++++++++ # logits_processor, +++++++++ # stopping_criteria, +++++++++ # generation_config, +++++++++ # synced_devices: bool, +++++++++ # streamer=None, +++++++++ # logits_warper=None, +++++++++ # **model_kwargs, +++++++++ # ): +++++++++ # """ +++++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++++++ # """ +++++++++ # from ...generation.logits_process import LogitsProcessorList +++++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +++++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++++++ # from mindnlp.core import nn, ops, no_grad +++++++++ # import numpy as np +++++++++ +++++++++ # # 检查是否使用 StaticCache +++++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++++++ # # 否则,直接调用父类方法 +++++++++ # past_key_values = model_kwargs.get("past_key_values") +++++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++++++ +++++++++ # if not isinstance(past_key_values, StaticCache): +++++++++ # # 不使用 StaticCache,直接调用父类方法 +++++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++++++ # return super()._sample( +++++++++ # input_ids=input_ids, +++++++++ # logits_processor=logits_processor, +++++++++ # stopping_criteria=stopping_criteria, +++++++++ # generation_config=generation_config, +++++++++ # synced_devices=synced_devices, +++++++++ # streamer=streamer, +++++++++ # logits_warper=logits_warper, +++++++++ # **model_kwargs, +++++++++ # ) +++++++++ +++++++++ # # 使用 StaticCache,进入自定义循环 +++++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++++++ # pad_token_id = generation_config._pad_token_tensor +++++++++ # output_attentions = generation_config.output_attentions +++++++++ # output_hidden_states = generation_config.output_hidden_states +++++++++ # output_scores = generation_config.output_scores +++++++++ # output_logits = generation_config.output_logits +++++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +++++++++ # max_length = generation_config.max_length +++++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++++++ # do_sample = generation_config.do_sample +++++++++ +++++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++++++ # raise ValueError( +++++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++++++ # f"{logits_warper})." +++++++++ # ) +++++++++ +++++++++ # # init attention / hidden states / scores tuples +++++++++ # scores = () if (return_dict_in_generate and output_scores) else None +++++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++++++ +++++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++++++ # encoder_hidden_states = ( +++++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++++++ # ) +++++++++ +++++++++ # # keep track of which sequences are already finished +++++++++ # batch_size, cur_len = input_ids.shape +++++++++ # this_peer_finished = False +++++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++++++ +++++++++ # time_record = [] +++++++++ # from ....utils.testing_utils import parse_flag_from_env +++++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++++++ +++++++++ # while self._has_unfinished_sequences( +++++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++++++ # ): +++++++++ # if _record_time: +++++++++ # import time as time_module +++++++++ # infer_start = time_module.time() +++++++++ +++++++++ # # prepare model inputs +++++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++++++ +++++++++ # # prepare variable output controls +++++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++++++ +++++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++++++ # cur_cache_position = model_inputs.get("cache_position") +++++++++ # cur_past_key_values = model_inputs.get("past_key_values") +++++++++ # cur_input_ids = model_inputs.get("input_ids") +++++++++ +++++++++ # if (isinstance(cur_past_key_values, StaticCache) and +++++++++ # cur_cache_position is not None and +++++++++ # len(cur_cache_position.shape) > 0 and +++++++++ # cur_cache_position.shape[0] == 1 and +++++++++ # cur_input_ids is not None and +++++++++ # cur_input_ids.shape[1] == 1): +++++++++ # # 使用 JIT 优化的单 token 解码 +++++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++++++ # if not hasattr(self, '_jit_used'): +++++++++ # self._jit_used = False +++++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++++++ +++++++++ # next_token_logits = self.get_decode_one_tokens_logits( +++++++++ # cur_token=cur_input_ids, +++++++++ # input_pos=model_inputs.get("position_ids"), +++++++++ # cache_position=cur_cache_position, +++++++++ # past_key_values=cur_past_key_values, +++++++++ # ) +++++++++ +++++++++ # # 标记已使用JIT(用于后续判断) +++++++++ # if not self._jit_used: +++++++++ # self._jit_used = True +++++++++ +++++++++ # # 构造兼容的输出对象 +++++++++ # class JitOptimizedOutput: +++++++++ # def __init__(self, logits, config): +++++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++++++ # self.config = config +++++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +++++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++++++ # self.attentions = None if not config.is_encoder_decoder else None +++++++++ # self.cross_attentions = None +++++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +++++++++ +++++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++++++ # else: +++++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++++++ # outputs = self(**model_inputs, return_dict=True) +++++++++ +++++++++ # if synced_devices and this_peer_finished: +++++++++ # continue +++++++++ +++++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++++++ # next_token_logits = outputs.logits[:, -1, :] +++++++++ +++++++++ # # pre-process distribution +++++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++++++ # if do_sample: +++++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++++++ +++++++++ # # Store scores, attentions and hidden_states when required +++++++++ # if return_dict_in_generate: +++++++++ # if output_scores: +++++++++ # scores += (next_token_scores,) +++++++++ # if output_logits: +++++++++ # raw_logits += (next_token_logits,) +++++++++ # if output_attentions: +++++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +++++++++ # if self.config.is_encoder_decoder: +++++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++++++ +++++++++ # if output_hidden_states: +++++++++ # hidden = ( +++++++++ # outputs.decoder_hidden_states +++++++++ # if self.config.is_encoder_decoder +++++++++ # else outputs.hidden_states +++++++++ # ) +++++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++++++ +++++++++ # # token selection +++++++++ # if do_sample: +++++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++++++ # else: +++++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++++++ +++++++++ # # finished sentences should have their next token be a padding token +++++++++ # if has_eos_stopping_criteria: +++++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++++++ +++++++++ # # update generated ids, model inputs, and length for next step +++++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++++++ # if streamer is not None: +++++++++ # streamer.put(next_tokens) +++++++++ +++++++++ # model_kwargs = self._update_model_kwargs_for_generation( +++++++++ # outputs, +++++++++ # model_kwargs, +++++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +++++++++ # ) +++++++++ +++++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++++++ # cur_len += 1 +++++++++ +++++++++ # if _record_time: +++++++++ # import time as time_module +++++++++ # infer_stop = time_module.time() +++++++++ # time_record.append(infer_stop - infer_start) +++++++++ +++++++++ # del outputs +++++++++ +++++++++ # average_infer_time = None +++++++++ # if time_record: +++++++++ # if len(time_record) > 1: +++++++++ # time_record.pop(0) +++++++++ # average_infer_time = sum(time_record) / len(time_record) +++++++++ # print(f'average inference time is: {average_infer_time}') +++++++++ # print(f'inference time record: {time_record}') +++++++++ +++++++++ # if streamer is not None: +++++++++ # streamer.end() +++++++++ +++++++++ # # 简单判断:打印是否使用了JIT路径 +++++++++ # if hasattr(self, '_jit_used') and self._jit_used: +++++++++ # print("[JIT] ✓ JIT optimization was used during generation") +++++++++ # else: +++++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++++++ +++++++++ # if return_dict_in_generate: +++++++++ # if self.config.is_encoder_decoder: +++++++++ # return GenerateEncoderDecoderOutput( +++++++++ # sequences=input_ids, +++++++++ # scores=scores, +++++++++ # logits=raw_logits, +++++++++ # encoder_attentions=encoder_attentions, +++++++++ # encoder_hidden_states=encoder_hidden_states, +++++++++ # decoder_attentions=decoder_attentions, +++++++++ # cross_attentions=cross_attentions, +++++++++ # decoder_hidden_states=decoder_hidden_states, +++++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++++ # average_infer_time=average_infer_time +++++++++ # ) +++++++++ # else: +++++++++ # return GenerateDecoderOnlyOutput( +++++++++ # sequences=input_ids, +++++++++ # scores=scores, +++++++++ # logits=raw_logits, +++++++++ # attentions=decoder_attentions, +++++++++ # hidden_states=decoder_hidden_states, +++++++++ # past_key_values=model_kwargs.get("past_key_values"), +++++++++ # average_infer_time=average_infer_time +++++++++ # ) +++++++++ # else: +++++++++ # return input_ids +++++++++ +++++++++ # def _prepare_cache_for_generation( +++++++++ # self, +++++++++ # generation_config, +++++++++ # model_kwargs, +++++++++ # assistant_model, +++++++++ # batch_size, +++++++++ # max_cache_length, +++++++++ # ): +++++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++++++ # generation_config.cache_implementation = "static" +++++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++++++ +++++++++ # if generation_config.cache_implementation == "static": +++++++++ # base_required_from_max_length = generation_config.max_length + 1 +++++++++ # base_required = max(max_cache_length, base_required_from_max_length) +++++++++ # min_cache_size = 50 +++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++++++ # else: +++++++++ # max_cache_length = max(base_required, min_cache_size) +++++++++ +++++++++ # original_max_cache_length = max_cache_length +++++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +++++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++++++ # print(f" - final max_cache_length: {max_cache_length}") +++++++++ +++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++++ # if max_cache_length > self.config.max_position_embeddings: +++++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++++++ +++++++++ # result = super()._prepare_cache_for_generation( +++++++++ # generation_config=generation_config, +++++++++ # model_kwargs=model_kwargs, +++++++++ # assistant_model=assistant_model, +++++++++ # batch_size=batch_size, +++++++++ # max_cache_length=max_cache_length, +++++++++ # ) +++++++++ +++++++++ # if generation_config.cache_implementation == "static": +++++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++++++ # created_cache = model_kwargs.get(cache_name) +++++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++++++ # if created_cache.max_cache_len < generation_config.max_length: +++++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++++++ +++++++++ # return result +++++++++ +++++++++ +++++++++ ++++++++ ++++++++ ++++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ++++++++-- ++++++++2.27.0 ++++++++ +++++++-- +++++++2.27.0 +++++++ ++++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch ++++++new file mode 100644 ++++++index 00000000..966529e4 ++++++--- /dev/null +++++++++ b/patches/0003-20261106secondcommit.patch ++++++@@ -0,0 +1,2769 @@ +++++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +++++++From: Pinoeer-kingxi <13022943007@163.com> +++++++Date: Thu, 6 Nov 2025 14:54:37 +0800 +++++++Subject: [PATCH 3/3] 20261106secondcommit +++++++ +++++++--- +++++++ .../models/deepseek/modeling_deepseek.py | 217 ++- +++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +++++++ patches/0001-20251104commit.patch | 1272 ----------------- +++++++ 3 files changed, 528 insertions(+), 2032 deletions(-) +++++++ delete mode 100644 patches/0001-20251104commit.patch +++++++ +++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++index 73773c22..2f9192bf 100644 +++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +++++++ +++++++ _CONFIG_FOR_DOC = "DeepseekConfig" +++++++ ++++++++_attn_mask_cache = {} ++++++++ ++++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): ++++++++ q_len = batch_and_seq[1] ++++++++ kv_len = batch_and_seq[1] + past_key_values_length ++++++++ key = (batch_and_seq[0], q_len, kv_len) ++++++++ ++++++++ if key in _attn_mask_cache: ++++++++ return _attn_mask_cache[key] ++++++++ ++++++++ mask = _prepare_4d_causal_attention_mask( ++++++++ attention_mask, ++++++++ batch_and_seq, ++++++++ inputs_embeds, ++++++++ past_key_values_length, ++++++++ ) ++++++++ _attn_mask_cache[key] = mask ++++++++ return mask +++++++ +++++++ def _get_unpad_data(attention_mask): +++++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +++++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +++++++ return final_output +++++++ +++++++ +++++++- @no_grad() +++++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++- expert_cache = ops.zeros_like(x) +++++++- idxs = flat_expert_indices.argsort() +++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++- token_idxs = idxs // self.num_experts_per_tok +++++++- +++++++- for i, end_idx in enumerate(tokens_per_expert): +++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++- if start_idx == end_idx: +++++++- continue +++++++- expert = self.experts[i] +++++++- exp_token_idx = token_idxs[start_idx:end_idx] +++++++- expert_tokens = x[exp_token_idx] +++++++- expert_out = expert(expert_tokens) +++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++- +++++++- return expert_cache +++++++- +++++++ # @no_grad() +++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++- # # expert_cache = torch.zeros_like(x) +++++++- # # idxs = flat_expert_indices.argsort() +++++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++- # # token_idxs = idxs // self.num_experts_per_tok +++++++- # # for i, end_idx in enumerate(tokens_per_expert): +++++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++- # # if start_idx == end_idx: +++++++- # # continue +++++++- # # expert = self.experts[i] +++++++- # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++- # # expert_tokens = x[exp_token_idx] +++++++- # # expert_out = expert(expert_tokens) +++++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++- # # return expert_cache ++++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++ # expert_cache = ops.zeros_like(x) +++++++ # idxs = flat_expert_indices.argsort() +++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++ +++++++ # return expert_cache +++++++- # @no_grad() +++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++- # expert_cache = ops.zeros_like(x) ++++++++ ++++++++ @no_grad() ++++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): ++++++++ """ ++++++++ 优化版 MoE prefill: ++++++++ - 批量张量化处理同一个 expert 的所有 token ++++++++ - 跳过无 token 的专家 ++++++++ - 保持结果完全一致 ++++++++ """ ++++++++ # 初始化输出缓存 ++++++++ expert_cache = ops.zeros_like(x) +++++++ +++++++- # # 排序保证顺序一致 +++++++- # idxs = flat_expert_indices.argsort() +++++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++- # token_idxs = idxs // self.num_experts_per_tok ++++++++ # 排序(确保 scatter_add 位置对应原逻辑) ++++++++ idxs = flat_expert_indices.argsort() ++++++++ sorted_expert_indices = flat_expert_indices[idxs] ++++++++ sorted_token_indices = idxs // self.num_experts_per_tok +++++++ +++++++- # # 找出有 token 的专家 +++++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++++ # 每个 expert 的 token 数 ++++++++ tokens_per_expert = sorted_expert_indices.bincount() +++++++ +++++++- # for i in active_experts.tolist(): +++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++- # end_idx = tokens_per_expert[i] +++++++- # if start_idx == end_idx: # 没有 token +++++++- # continue ++++++++ # 找出有 token 的专家 ++++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +++++++ +++++++- # exp_token_idx = token_idxs[start_idx:end_idx] +++++++- # expert_tokens = x[exp_token_idx] +++++++- # expert_out = self.experts[i](expert_tokens) +++++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++++ for expert_id in active_experts.tolist(): ++++++++ # 取该 expert 对应的排序后 token 区间 ++++++++ start = (tokens_per_expert[:expert_id]).sum().item() ++++++++ end = start + tokens_per_expert[expert_id].item() +++++++ +++++++- # expert_cache = mindspore.mint.scatter_add( +++++++- # expert_cache, +++++++- # 0, +++++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++- # expert_out +++++++- # ) ++++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 ++++++++ expert_tokens = x[token_idx] # 取输入向量 +++++++ +++++++- # return expert_cache ++++++++ # 执行专家 MLP ++++++++ expert_out = self.experts[expert_id](expert_tokens) ++++++++ ++++++++ # 按权重缩放 ++++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] ++++++++ ++++++++ # 回写到缓存(等价 scatter_add) ++++++++ expert_cache = mindspore.mint.scatter_add( ++++++++ expert_cache, ++++++++ 0, ++++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++++ scaled_out ++++++++ ) ++++++++ ++++++++ return expert_cache ++++++++ ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # # expert_cache = torch.zeros_like(x) ++++++++ # # idxs = flat_expert_indices.argsort() ++++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ++++++++ # # token_idxs = idxs // self.num_experts_per_tok ++++++++ # # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ++++++++ # # if start_idx == end_idx: ++++++++ # # continue ++++++++ # # expert = self.experts[i] ++++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # # expert_tokens = x[exp_token_idx] ++++++++ # # expert_out = expert(expert_tokens) ++++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ++++++++ # # return expert_cache ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # for i, end_idx in enumerate(tokens_per_expert): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # if start_idx == end_idx: ++++++++ # continue ++++++++ # expert = self.experts[i] ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = expert(expert_tokens) ++++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ++++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ++++++++ ++++++++ # return expert_cache ++++++++ # @no_grad() ++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ++++++++ # expert_cache = ops.zeros_like(x) ++++++++ ++++++++ # # 排序保证顺序一致 ++++++++ # idxs = flat_expert_indices.argsort() ++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ++++++++ # token_idxs = idxs // self.num_experts_per_tok ++++++++ ++++++++ # # 找出有 token 的专家 ++++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ++++++++ ++++++++ # for i in active_experts.tolist(): ++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ++++++++ # end_idx = tokens_per_expert[i] ++++++++ # if start_idx == end_idx: # 没有 token ++++++++ # continue ++++++++ ++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] ++++++++ # expert_tokens = x[exp_token_idx] ++++++++ # expert_out = self.experts[i](expert_tokens) ++++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ++++++++ ++++++++ # expert_cache = mindspore.mint.scatter_add( ++++++++ # expert_cache, ++++++++ # 0, ++++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ++++++++ # expert_out ++++++++ # ) ++++++++ ++++++++ # return expert_cache +++++++ +++++++ +++++++ +++++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++- +++++++ # class DeepseekFlashAttention(nn.Module): +++++++ # """ +++++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +++++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +++++++ +++++++ return attn_output, attn_weights, past_key_value +++++++ ++++++++ +++++++ Deepseek_ATTENTION_CLASSES = { +++++++ "eager": DeepseekAttention, +++++++ "flash-attention": DeepseekFlashAttention, +++++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +++++++ ) +++++++ else: +++++++ # 4d mask is passed through the layers +++++++- attention_mask = _prepare_4d_causal_attention_mask( ++++++++ # attention_mask = _prepare_4d_causal_attention_mask( ++++++++ # attention_mask, ++++++++ # (batch_size, seq_length), ++++++++ # inputs_embeds, ++++++++ # past_key_values_length, ++++++++ # ) ++++++++ #@dwj ++++++++ attention_mask = get_cached_causal_mask( +++++++ attention_mask, +++++++ (batch_size, seq_length), +++++++ inputs_embeds, +++++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++ # Initialize weights and apply final processing +++++++ self.post_init() +++++++ self.warm_up = False ++++++++ #@dwj ++++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( ++++++++ self.num_layers, ++++++++ self.num_attention_heads, ++++++++ self.head_dim, ++++++++ batch_size=1, ++++++++ max_length=self.max_length, ++++++++ dtype=mindspore.float16 ++++++++ ) ++++++++ ++++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): ++++++++ key_cache = [] ++++++++ value_cache = [] ++++++++ for _ in range(num_layers): ++++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) ++++++++ key_cache.append(k) ++++++++ value_cache.append(v) ++++++++ return key_cache, value_cache ++++++++ +++++++ +++++++ def warmup_moe_model_deep(self): +++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++index bced285c..ebd7782e 100644 +++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +++++++ +++++++-Long_Prompt = False +++++++-PROMPT_LENGTH_THRESHOLD = 128 ++++++++Long_Prompt = 1 ++++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 ++++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 ++++++++ ++++++++_causal_mask_cache = {} ++++++++ ++++++++def get_cached_causal_mask_with_cache_position( ++++++++ attention_mask: mindspore.Tensor, ++++++++ sequence_length: int, ++++++++ target_length: int, ++++++++ dtype: mindspore.dtype, ++++++++ min_dtype: float, ++++++++ cache_position: mindspore.Tensor, ++++++++ batch_size: int, ++++++++): ++++++++ """ ++++++++ 带缓存的 causal mask 构造函数 ++++++++ """ ++++++++ # q_len 是当前 query 长度 ++++++++ q_len = sequence_length ++++++++ # kv_len 是 target_length ++++++++ kv_len = target_length ++++++++ ++++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 ++++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) ++++++++ ++++++++ if key in _causal_mask_cache: ++++++++ return _causal_mask_cache[key] ++++++++ ++++++++ # 调用原来的 mask 构造逻辑 ++++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++++ attention_mask, ++++++++ sequence_length=sequence_length, ++++++++ target_length=target_length, ++++++++ dtype=dtype, ++++++++ min_dtype=min_dtype, ++++++++ cache_position=cache_position, ++++++++ batch_size=batch_size, ++++++++ ) ++++++++ # 缓存结果 ++++++++ _causal_mask_cache[key] = causal_mask ++++++++ return causal_mask +++++++ +++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +++++++ def _prepare_4d_causal_attention_mask_with_cache_position( +++++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++++ +++++++ +++++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe ++++++++# class Qwen2MoeAttention(nn.Module): ++++++++# """ ++++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer ++++++++# and "Generating Long Sequences with Sparse Transformers". ++++++++# """ ++++++++ ++++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ++++++++# super().__init__() ++++++++# self.config = config ++++++++# self.layer_idx = layer_idx ++++++++# if layer_idx is None: ++++++++# logger.warning_once( ++++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ++++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++++# "when creating this class." ++++++++# ) ++++++++ ++++++++# self.hidden_size = config.hidden_size ++++++++# self.num_heads = config.num_attention_heads ++++++++# self.head_dim = self.hidden_size // self.num_heads ++++++++# self.num_key_value_heads = config.num_key_value_heads ++++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ++++++++# self.max_position_embeddings = config.max_position_embeddings ++++++++# self.rope_theta = config.rope_theta ++++++++# self.is_causal = True ++++++++# self.attention_dropout = config.attention_dropout ++++++++ ++++++++# if (self.head_dim * self.num_heads) != self.hidden_size: ++++++++# raise ValueError( ++++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ++++++++# f" and `num_heads`: {self.num_heads})." ++++++++# ) ++++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ++++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ++++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ++++++++ ++++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( ++++++++# self.head_dim, ++++++++# max_position_embeddings=self.max_position_embeddings, ++++++++# base=self.rope_theta, ++++++++# ) ++++++++ ++++++++# def forward( ++++++++# self, ++++++++# hidden_states: mindspore.Tensor, ++++++++# attention_mask: Optional[mindspore.Tensor] = None, ++++++++# position_ids: Optional[mindspore.Tensor] = None, ++++++++# past_key_value: Optional[Cache] = None, ++++++++# output_attentions: bool = False, ++++++++# use_cache: bool = False, ++++++++# cache_position: Optional[mindspore.Tensor] = None, ++++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ++++++++ ++++++++ ++++++++ ++++++++# bsz, q_len, _ = hidden_states.shape ++++++++ ++++++++# query_states = self.q_proj(hidden_states) ++++++++# key_states = self.k_proj(hidden_states) ++++++++# value_states = self.v_proj(hidden_states) ++++++++ ++++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) ++++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) ++++++++ ++++++++# kv_seq_len = key_states.shape[-2] ++++++++# if past_key_value is not None: ++++++++# if self.layer_idx is None: ++++++++# raise ValueError( ++++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ++++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ++++++++# "with a layer index." ++++++++# ) ++++++++# if isinstance(past_key_value, StaticCache): ++++++++# kv_seq_len = key_states.shape[-2] ++++++++# else: ++++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ++++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ++++++++ ++++++++# if past_key_value is not None: ++++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++# if isinstance(past_key_value, StaticCache): ++++++++# kv_seq_len = key_states.shape[-2] ++++++++ ++++++++# # repeat k/v heads if n_kv_heads < n_heads ++++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++ ++++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++++ ++++++++# if attention_mask is not None: ++++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++++# attn_weights = attn_weights + causal_mask ++++++++ ++++++++# # upcast attention to fp32 ++++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++++++# attn_output = ops.matmul(attn_weights, value_states) ++++++++ ++++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++++++# raise ValueError( ++++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" ++++++++# f" {attn_output.shape}" ++++++++# ) ++++++++ ++++++++# attn_output = ops.transpose(attn_output, 1, 2) ++++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++++ ++++++++# attn_output = self.o_proj(attn_output) ++++++++# # @lwx ++++++++ ++++++++# # max_seq_len = self.max_position_embeddings # 2048 ++++++++ ++++++++# # if attention_mask is not None: ++++++++# # # attention_mask: [B, 1, Sq, Sk] ++++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ++++++++ ++++++++# # # pad 到 [max_seq_len, max_seq_len] ++++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ++++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ++++++++# # global_attention_mask = padded_mask ++++++++# # else: ++++++++# # global_attention_mask = None ++++++++ ++++++++ ++++++++# # sparse_mode=3 ++++++++# # attn_output = mindspore.ops.flash_attention_score( ++++++++# # query=query_states, ++++++++# # key=key_states, ++++++++# # value=value_states, ++++++++# # real_shift=None, ++++++++# # padding_mask=None, ++++++++ ++++++++# # head_num=self.num_heads, ++++++++# # attn_mask=global_attention_mask, ++++++++# # keep_prob=1.0 - self.attention_dropout, ++++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++# # input_layout="BNSD", ++++++++# # pre_tokens=2147483647, ++++++++# # next_tokens=2147483647, ++++++++# # inner_precise=0, ++++++++# # drop_mask=None, ++++++++# # prefix=None, ++++++++# # actual_seq_qlen=None, ++++++++# # actual_seq_kvlen=None, ++++++++# # sparse_mode=sparse_mode, ++++++++# # ) ++++++++# if not output_attentions: ++++++++# attn_weights = None ++++++++ ++++++++# return attn_output, attn_weights, past_key_value ++++++++ +++++++ class Qwen2MoeAttention(nn.Module): +++++++ """ +++++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +++++++- and "Generating Long Sequences with Sparse Transformers". +++++++- """ ++++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +++++++ ++++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: ++++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 ++++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 ++++++++ ++++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 ++++++++ """ +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++ super().__init__() +++++++ self.config = config +++++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +++++++ if layer_idx is None: +++++++ logger.warning_once( +++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +++++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ++++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +++++++ "when creating this class." +++++++ ) +++++++ +++++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +++++++ use_cache: bool = False, +++++++ cache_position: Optional[mindspore.Tensor] = None, +++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++- +++++++ +++++++- ++++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +++++++ bsz, q_len, _ = hidden_states.shape +++++++ +++++++ query_states = self.q_proj(hidden_states) +++++++ key_states = self.k_proj(hidden_states) +++++++ value_states = self.v_proj(hidden_states) +++++++ +++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +++++++- ++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ++++++++ +++++++ kv_seq_len = key_states.shape[-2] +++++++ if past_key_value is not None: +++++++- if self.layer_idx is None: +++++++- raise ValueError( +++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++- "with a layer index." +++++++- ) +++++++- if isinstance(past_key_value, StaticCache): +++++++- kv_seq_len = key_states.shape[-2] +++++++- else: +++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ++++++++ +++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++ +++++++ if past_key_value is not None: +++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models ++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ++++++++ ++++++++ # --- 2. 动态调度核心注意力计算 --- ++++++++ global Long_Prompt ++++++++ if Long_Prompt >= 1: ++++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- ++++++++ fa_attention_mask = None ++++++++ if attention_mask is not None: ++++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ++++++++ fa_attention_mask = (mask_slice != 0) ++++++++ ++++++++ attn_output = mindspore.ops.flash_attention_score( ++++++++ query=query_states, ++++++++ key=key_states, ++++++++ value=value_states, ++++++++ head_num=self.num_heads, ++++++++ attn_mask=fa_attention_mask, ++++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, ++++++++ scalar_value=1.0 / math.sqrt(self.head_dim), ++++++++ input_layout="BNSD", ++++++++ sparse_mode=0, ++++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 ++++++++ ) +++++++ +++++++- if isinstance(past_key_value, StaticCache): +++++++- kv_seq_len = key_states.shape[-2] ++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ++++++++ attn_output = self.o_proj(attn_output) ++++++++ attn_weights = None ++++++++ if output_attentions: ++++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +++++++ +++++++- # repeat k/v heads if n_kv_heads < n_heads +++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++- +++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) ++++++++ else: ++++++++ # --- Eager Attention 路径 (用于短序列和解码) --- ++++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) ++++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) ++++++++ ++++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++ +++++++- if attention_mask is not None: +++++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++- attn_weights = attn_weights + causal_mask ++++++++ if attention_mask is not None: ++++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] ++++++++ attn_weights = attn_weights + causal_mask +++++++ +++++++- # upcast attention to fp32 +++++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +++++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +++++++- attn_output = ops.matmul(attn_weights, value_states) ++++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) ++++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) ++++++++ attn_output = ops.matmul(attn_weights, value_states) +++++++ +++++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +++++++- raise ValueError( +++++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +++++++- f" {attn_output.shape}" +++++++- ) ++++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): ++++++++ raise ValueError( ++++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" ++++++++ ) +++++++ +++++++- attn_output = ops.transpose(attn_output, 1, 2) +++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++++ attn_output = ops.transpose(attn_output, 1, 2) ++++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) ++++++++ attn_output = self.o_proj(attn_output) +++++++ +++++++- attn_output = self.o_proj(attn_output) +++++++- # @lwx ++++++++ if not output_attentions: ++++++++ attn_weights = None +++++++ +++++++- # max_seq_len = self.max_position_embeddings # 2048 +++++++- +++++++- # if attention_mask is not None: +++++++- # # attention_mask: [B, 1, Sq, Sk] +++++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++- +++++++- # # pad 到 [max_seq_len, max_seq_len] +++++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++- # global_attention_mask = padded_mask +++++++- # else: +++++++- # global_attention_mask = None +++++++- +++++++- +++++++- # sparse_mode=3 +++++++- # attn_output = mindspore.ops.flash_attention_score( +++++++- # query=query_states, +++++++- # key=key_states, +++++++- # value=value_states, +++++++- # real_shift=None, +++++++- # padding_mask=None, +++++++- +++++++- # head_num=self.num_heads, +++++++- # attn_mask=global_attention_mask, +++++++- # keep_prob=1.0 - self.attention_dropout, +++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++- # input_layout="BNSD", +++++++- # pre_tokens=2147483647, +++++++- # next_tokens=2147483647, +++++++- # inner_precise=0, +++++++- # drop_mask=None, +++++++- # prefix=None, +++++++- # actual_seq_qlen=None, +++++++- # actual_seq_kvlen=None, +++++++- # sparse_mode=sparse_mode, +++++++- # ) +++++++- if not output_attentions: +++++++- attn_weights = None +++++++- +++++++ return attn_output, attn_weights, past_key_value +++++++ +++++++- +++++++ # class Qwen2MoeFlashAttention(nn.Module): +++++++ # """ +++++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +++++++ # return final_hidden_states, router_logits +++++++ +++++++ +++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++-# """ +++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +++++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +++++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +++++++-# """ +++++++-# def __init__(self, config: Qwen2MoeConfig): +++++++-# super().__init__() +++++++-# self.num_experts = config.num_experts +++++++-# self.top_k = config.num_experts_per_tok +++++++-# self.norm_topk_prob = config.norm_topk_prob +++++++- +++++++-# # 门控网络 +++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++-# # 专家列表 +++++++-# self.experts = nn.ModuleList( +++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++-# ) +++++++-# # 共享专家 +++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_decode( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# """ +++++++-# 【解码路径】针对 sequence_length=1 的极致优化。 +++++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +++++++-# """ +++++++-# batch_size, hidden_dim = hidden_states.shape +++++++- +++++++-# expert_outputs_list = [ +++++++-# ops.cat([ +++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++-# ], dim=0) +++++++-# for i in range(batch_size) +++++++-# ] +++++++- +++++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +++++++-# # shape: (batch_size, top_k, hidden_dim) +++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++- +++++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++- +++++++-# return moe_output.squeeze(1) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_prefill( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# """ +++++++-# 【预填充路径】针对 sequence_length > 1 的优化。 +++++++-# 按专家对 Token 进行分组,并进行批处理。 +++++++-# """ +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens = hidden_states.shape[0] +++++++-# flat_selected_experts = selected_experts.flatten() +++++++- +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++- +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++- +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++- +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++-# selected_token_indices = token_indices[mask] +++++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++++- +++++++-# current_states = hidden_states[selected_token_indices] +++++++- +++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++- +++++++-# moe_output = moe_output.index_add( +++++++-# dim=0, +++++++-# index=selected_token_indices, +++++++-# source=expert_output.to(hidden_states.dtype) +++++++-# ) +++++++-# return moe_output +++++++- +++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-# """ +++++++-# 顶层 forward 方法,作为智能分发器。 +++++++-# """ +++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- +++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-# router_logits = self.gate(hidden_states_reshaped) +++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- +++++++-# if self.norm_topk_prob: +++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- +++++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++-# moe_output = None +++++++-# # 在推理时,根据序列长度选择最优路径 +++++++-# if not self.training: +++++++-# if sequence_length == 1: +++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++++-# else: +++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++++-# else: +++++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +++++++-# raise NotImplementedError("Training path is not implemented.") +++++++- +++++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +++++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +++++++- +++++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +++++++- +++++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +++++++- +++++++-# return final_hidden_states, router_logits +++++++- +++++++- +++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++-# """ +++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +++++++-# """ +++++++-# def __init__(self, config: Qwen2MoeConfig): +++++++-# super().__init__() +++++++-# self.num_experts = config.num_experts +++++++-# self.top_k = config.num_experts_per_tok +++++++-# self.norm_topk_prob = config.norm_topk_prob +++++++- +++++++-# # 门控网络 +++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++-# # 专家列表 +++++++-# self.experts = nn.ModuleList( +++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++-# ) +++++++-# # 共享专家 +++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_decode( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# batch_size, _ = hidden_states.shape +++++++-# expert_outputs_list = [ +++++++-# ops.cat([ +++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++-# ], dim=0) +++++++-# for i in range(batch_size) +++++++-# ] +++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++-# return moe_output.squeeze(1) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_prefill( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens = hidden_states.shape[0] +++++++-# flat_selected_experts = selected_experts.flatten() +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++- +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++-# selected_token_indices = token_indices[mask] +++++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++++-# current_states = hidden_states[selected_token_indices] +++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++-# moe_output = moe_output.index_add( +++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++++-# ) +++++++-# return moe_output +++++++- +++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-# """ +++++++-# 顶层 forward 方法,作为智能分发器。 +++++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +++++++-# """ +++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- +++++++-# # 1. 门控计算 (通用逻辑) +++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-# router_logits = self.gate(hidden_states_reshaped) +++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- +++++++-# if self.norm_topk_prob: +++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- +++++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++-# # 2. 智能分发到最优 MoE 路径 +++++++-# moe_output = None +++++++-# if not self.training: +++++++-# if sequence_length == 1: +++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +++++++-# else: +++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +++++++-# else: +++++++-# raise NotImplementedError("Training path is not implemented.") +++++++- +++++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +++++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++- +++++++-# # 4. 合并 MoE 输出和共享专家输出 +++++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++- +++++++-# # 5. 恢复原始形状并返回 +++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++- +++++++-# return final_hidden_states, router_logits +++++++- +++++++-# prefill fastest +++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++-# """ +++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +++++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +++++++-# """ +++++++-# def __init__(self, config: Qwen2MoeConfig): +++++++-# super().__init__() +++++++-# self.num_experts = config.num_experts +++++++-# self.top_k = config.num_experts_per_tok +++++++-# self.norm_topk_prob = config.norm_topk_prob +++++++- +++++++-# # 门控网络 +++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++-# # 专家列表 +++++++-# self.experts = nn.ModuleList( +++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++-# ) +++++++-# # 共享专家 +++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_dispatch( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# """ +++++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +++++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +++++++-# """ +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens, _ = hidden_states.shape +++++++- +++++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +++++++-# flat_selected_experts = selected_experts.flatten() +++++++-# flat_routing_weights = routing_weights.flatten() +++++++- +++++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++- +++++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++- +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++- +++++++-# # 找到所有分配给该专家的 token +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++- +++++++-# # 使用 mask 选取对应的 token 和权重 +++++++-# current_token_indices = token_indices[mask] +++++++-# current_routing_weights = flat_routing_weights[mask] +++++++-# current_hidden_states = hidden_states[current_token_indices] +++++++- +++++++-# # 对这些 token 进行批处理 +++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++++- +++++++-# # 使用 index_add 将结果精确地加回到对应位置 +++++++-# moe_output = moe_output.index_add( +++++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +++++++-# ) +++++++-# return moe_output +++++++- +++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-# """ +++++++-# 顶层 forward 方法,作为智能分发器。 +++++++-# """ +++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- +++++++-# # 1. 门控计算 +++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-# router_logits = self.gate(hidden_states_reshaped) +++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- +++++++-# if self.norm_topk_prob: +++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- +++++++-# routing_weights = routing_weights.to(hidden_states.dtype) +++++++- +++++++-# # 2. 调用统一的 MoE 计算内核 +++++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +++++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +++++++- +++++++-# # 3. 统一处理共享专家 +++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++- +++++++-# # 4. 合并输出 +++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++- +++++++-# # 5. 恢复原始形状并返回 +++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++- +++++++-# return final_hidden_states, router_logits +++++++- +++++++- +++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++-# """ +++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +++++++-# 【最终高性能与高精度版】: +++++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +++++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +++++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +++++++-# 3. 这样实现了速度和准确性的两全其美。 +++++++-# """ +++++++-# def __init__(self, config: Qwen2MoeConfig): +++++++-# super().__init__() +++++++-# self.num_experts = config.num_experts +++++++-# self.top_k = config.num_experts_per_tok +++++++-# self.norm_topk_prob = config.norm_topk_prob +++++++- +++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++-# self.experts = nn.ModuleList( +++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++-# ) +++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_decode( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# """ +++++++-# 【解码路径】极致优化版:bmm + 高精度累加。 +++++++-# """ +++++++-# original_dtype = hidden_states.dtype +++++++-# batch_size, _ = hidden_states.shape +++++++- +++++++-# expert_outputs_list = [ +++++++-# ops.cat([ +++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++-# ], dim=0) +++++++-# for i in range(batch_size) +++++++-# ] +++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++- +++++++-# # 在 float32 下执行 bmm,得到高精度结果 +++++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +++++++- +++++++-# # 将高精度结果转换回原始数据类型 +++++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +++++++- +++++++-# return moe_output +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_prefill( +++++++-# self, +++++++-# hidden_states: mindspore.Tensor, +++++++-# selected_experts: mindspore.Tensor, +++++++-# routing_weights: mindspore.Tensor +++++++-# ) -> mindspore.Tensor: +++++++-# """ +++++++-# 【预填充路径】与原始实现一致,结果精确。 +++++++-# """ +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens, _ = hidden_states.shape +++++++-# flat_selected_experts = selected_experts.flatten() +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++- +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++-# selected_token_indices = token_indices[mask] +++++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++++-# current_states = hidden_states[selected_token_indices] +++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++-# moe_output = moe_output.index_add( +++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +++++++-# ) +++++++-# return moe_output +++++++- +++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++- +++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-# router_logits = self.gate(hidden_states_reshaped) +++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++- +++++++-# if self.norm_topk_prob: +++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- +++++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +++++++-# # 如果模型主体是 float16,后续再转换 +++++++- +++++++-# moe_output = None +++++++-# if not self.training: +++++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +++++++-# # _moe_infer_decode 内部会处理好类型转换 +++++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +++++++-# if sequence_length == 1: +++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++++-# else: +++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +++++++-# else: +++++++-# raise NotImplementedError("Training path is not implemented.") +++++++- +++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++- +++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++- +++++++-# return final_hidden_states, router_logits +++++++- +++++++- +++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +++++++-# """ +++++++-# 【融合版】一个混合专家模块,内置两种推理策略, +++++++-# 由外部全局变量 `Long_Prompt` 控制: +++++++- +++++++-# - if Long_Prompt is True: 【精度优先模式】 +++++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +++++++-# 适用于处理长序列,避免误差累积。 +++++++- +++++++-# - if Long_Prompt is False: 【速度优先模式】 +++++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +++++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +++++++-# """ +++++++-# def __init__(self, config: Qwen2MoeConfig): +++++++-# super().__init__() +++++++-# self.num_experts = config.num_experts +++++++-# self.top_k = config.num_experts_per_tok +++++++-# self.norm_topk_prob = config.norm_topk_prob +++++++- +++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +++++++-# self.experts = nn.ModuleList( +++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +++++++-# ) +++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-# # --- 速度优先模式的辅助函数 --- +++++++-# @no_grad() +++++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++-# original_dtype = hidden_states.dtype +++++++-# batch_size, _ = hidden_states.shape +++++++-# expert_outputs_list = [ +++++++-# ops.cat([ +++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +++++++-# ], dim=0) +++++++-# for i in range(batch_size) +++++++-# ] +++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +++++++-# weights_fp32 = routing_weights.to(mindspore.float32) +++++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +++++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++++-# return moe_output_fp32.squeeze(1).to(original_dtype) +++++++- +++++++-# @no_grad() +++++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens, _ = hidden_states.shape +++++++-# flat_selected_experts = selected_experts.flatten() +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++-# selected_token_indices = token_indices[mask] +++++++-# selected_routing_weights = routing_weights.flatten()[mask] +++++++-# current_states = hidden_states[selected_token_indices] +++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++-# return moe_output +++++++- +++++++-# # --- 精度优先模式的辅助函数 --- +++++++-# @no_grad() +++++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++-# moe_output = ops.zeros_like(hidden_states) +++++++-# num_tokens, _ = hidden_states.shape +++++++-# flat_selected_experts = selected_experts.flatten() +++++++-# flat_routing_weights = routing_weights.flatten() +++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +++++++-# active_experts = ops.unique(flat_selected_experts) +++++++-# for expert_idx_tensor in active_experts: +++++++-# expert_idx = expert_idx_tensor.item() +++++++-# expert_layer = self.experts[expert_idx] +++++++-# mask = (flat_selected_experts == expert_idx_tensor) +++++++-# current_token_indices = token_indices[mask] +++++++-# current_routing_weights = flat_routing_weights[mask] +++++++-# current_hidden_states = hidden_states[current_token_indices] +++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +++++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++-# return moe_output +++++++- +++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-# # 声明我们将要使用一个在模块外部定义的全局变量 +++++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +++++++-# global Long_Prompt +++++++- +++++++-# # 1. 门控计算 (所有模式通用) +++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-# router_logits = self.gate(hidden_states_reshaped) +++++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +++++++-# if self.norm_topk_prob: +++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++- +++++++-# moe_output = None +++++++-# if not self.training: +++++++-# # 根据 Long_Prompt 标志选择模式 +++++++-# if Long_Prompt: +++++++-# # --- 精度优先模式 --- +++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++-# else: +++++++-# # --- 速度优先模式 --- +++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++-# if sequence_length == 1: +++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++-# else: +++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++-# else: +++++++-# raise NotImplementedError("Training path is not implemented.") +++++++- +++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +++++++- +++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +++++++- +++++++-# return final_hidden_states, router_logits +++++++- +++++++ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ """ +++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +++++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +++++++ return moe_output_fp32.squeeze(1).to(original_dtype) +++++++ ++++++++ # @no_grad() ++++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ++++++++ # num_tokens, _ = hidden_states.shape ++++++++ # flat_selected_experts = selected_experts.flatten() ++++++++ # sorted_expert_indices = flat_selected_experts.argsort() ++++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) ++++++++ # original_token_indices = sorted_expert_indices // self.top_k ++++++++ # moe_output = ops.zeros_like(hidden_states) ++++++++ # current_token_offset = 0 ++++++++ # for i in range(self.num_experts): ++++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset ++++++++ # if expert_token_count == 0: ++++++++ # continue ++++++++ # end_offset = current_token_offset + expert_token_count ++++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] ++++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] ++++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] ++++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] ++++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) ++++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) ++++++++ # current_token_offset += expert_token_count ++++++++ # return moe_output ++++++++ +++++++ @no_grad() +++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++- num_tokens, _ = hidden_states.shape +++++++- flat_selected_experts = selected_experts.flatten() +++++++- sorted_expert_indices = flat_selected_experts.argsort() +++++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +++++++- original_token_indices = sorted_expert_indices // self.top_k ++++++++ """ ++++++++ 优化版 MoE prefill (速度优先模式): ++++++++ - 批量张量化处理同一个 expert 的所有 token ++++++++ - 跳过无 token 的专家 ++++++++ - 保持结果完全一致 ++++++++ """ +++++++ moe_output = ops.zeros_like(hidden_states) +++++++- current_token_offset = 0 +++++++- for i in range(self.num_experts): +++++++- expert_token_count = tokens_per_expert[i] - current_token_offset +++++++- if expert_token_count == 0: +++++++- continue +++++++- end_offset = current_token_offset + expert_token_count +++++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +++++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +++++++- expert_hidden_states = hidden_states[expert_original_token_indices] +++++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +++++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +++++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +++++++- current_token_offset += expert_token_count ++++++++ ++++++++ flat_selected_experts = selected_experts.flatten() ++++++++ flat_routing_weights = routing_weights.flatten() ++++++++ ++++++++ idxs = flat_selected_experts.argsort() ++++++++ sorted_expert_indices = flat_selected_experts[idxs] ++++++++ sorted_token_indices = idxs // self.top_k ++++++++ ++++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) ++++++++ ++++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() ++++++++ ++++++++ for expert_id in active_experts.tolist(): ++++++++ start = int(tokens_per_expert[:expert_id].sum().item()) ++++++++ end = start + int(tokens_per_expert[expert_id].item()) ++++++++ ++++++++ token_idx = sorted_token_indices[start:end] ++++++++ expert_tokens = hidden_states[token_idx] ++++++++ ++++++++ expert_out = self.experts[expert_id](expert_tokens) ++++++++ ++++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) ++++++++ ++++++++ moe_output = mindspore.mint.scatter_add( ++++++++ moe_output, ++++++++ 0, ++++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), ++++++++ scaled_out.to(hidden_states.dtype) ++++++++ ) ++++++++ +++++++ return moe_output +++++++ ++++++++ +++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +++++++ @no_grad() +++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +++++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++ +++++++ moe_output = None +++++++- if Long_Prompt: +++++++- # --- 精度优先模式 (ACCURACY MODE) --- +++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ # if Long_Prompt==0: ++++++++ # # --- 精度优先模式 (ACCURACY MODE) --- ++++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ # else: ++++++++ # # --- 速度优先模式 (SPEED MODE) --- ++++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++ # if sequence_length == 1: ++++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ # else: ++++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ ++++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) ++++++++ if sequence_length == 1: ++++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++ else: +++++++- # --- 速度优先模式 (SPEED MODE) --- +++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +++++++- if sequence_length == 1: +++++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++- else: +++++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +++++++- ++++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) ++++++++ +++++++ +++++++ # 3. 共享专家计算与合并 (所有模式通用) +++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +++++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++ +++++++ return final_hidden_states, router_logits +++++++ ++++++++ +++++++ class Qwen2MoeDecoderLayer(nn.Module): +++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +++++++ super().__init__() +++++++ self.hidden_size = config.hidden_size +++++++ +++++++- # if Long_Prompt: +++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++- # else: ++++++++ # if Long_Prompt == 2: +++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ++++++++ # else: ++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ +++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++ +++++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++++ ) +++++++ +++++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +++++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++++ # attention_mask, ++++++++ # sequence_length=sequence_length, ++++++++ # target_length=target_length, ++++++++ # dtype=dtype, ++++++++ # min_dtype=min_dtype, ++++++++ # cache_position=cache_position, ++++++++ # batch_size=input_tensor.shape[0], ++++++++ # ) ++++++++ #@dwj ++++++++ causal_mask = get_cached_causal_mask_with_cache_position( +++++++ attention_mask, +++++++ sequence_length=sequence_length, +++++++ target_length=target_length, +++++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +++++++ """ +++++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD ++++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache ++++++++ _causal_mask_cache.clear() +++++++ +++++++ input_ids = kwargs.get("input_ids") +++++++ if input_ids is None and args: +++++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ +++++++ if input_ids is not None: +++++++ prompt_length = input_ids.shape[1] +++++++- +++++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +++++++- Long_Prompt = True ++++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: ++++++++ Long_Prompt = 2 ++++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: ++++++++ Long_Prompt = 0 +++++++ else: +++++++- Long_Prompt = False ++++++++ Long_Prompt = 1 ++++++++ +++++++ +++++++ return super().generate(*args, **kwargs) +++++++ +++++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++ dtype = self.lm_head.weight.dtype +++++++ min_dtype = float(ops.finfo(dtype).min) +++++++ +++++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( ++++++++ # attention_mask, ++++++++ # sequence_length=sequence_length, ++++++++ # target_length=past_key_values.get_max_length(), ++++++++ # dtype=dtype, ++++++++ # min_dtype=min_dtype, ++++++++ # cache_position=cache_position, ++++++++ # batch_size=batch_size, ++++++++ # ) ++++++++ ++++++++ #@dwj ++++++++ attention_mask = get_cached_causal_mask_with_cache_position( +++++++ attention_mask, +++++++ sequence_length=sequence_length, +++++++ target_length=past_key_values.get_max_length(), +++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +++++++deleted file mode 100644 +++++++index 6dfb5b93..00000000 +++++++--- a/patches/0001-20251104commit.patch ++++++++++ /dev/null +++++++@@ -1,1272 +0,0 @@ +++++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +++++++-From: Pinoeer-kingxi <13022943007@163.com> +++++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +++++++-Subject: [PATCH] 20251104commit +++++++- +++++++---- +++++++- mindnlp/transformers/cache_utils.py | 28 +- +++++++- .../models/deepseek/modeling_deepseek.py | 149 ++- +++++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +++++++- 3 files changed, 976 insertions(+), 87 deletions(-) +++++++- +++++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +++++++-index cadd2e04..02f8d4be 100644 +++++++---- a/mindnlp/transformers/cache_utils.py +++++++-+++ b/mindnlp/transformers/cache_utils.py +++++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +++++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +++++++- # k_out[:, :, cache_position] = key_states +++++++- # v_out[:, :, cache_position] = value_states +++++++-- if ON_ORANGE_PI: +++++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++-- else: +++++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++-- +++++++-+ # if ON_ORANGE_PI: +++++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +++++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +++++++-+ # else: +++++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +++++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +++++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +++++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +++++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +++++++-+ if cache_position.ndim > 1: +++++++-+ cache_position = cache_position.flatten() +++++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +++++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +++++++-+ cache_position = cache_position.int() +++++++-+ +++++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +++++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +++++++-+ k_out[:, :, cache_position] = key_states +++++++-+ v_out[:, :, cache_position] = value_states +++++++-+ +++++++- return k_out, v_out +++++++- +++++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +++++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++-index c695b944..d8303e45 100644 +++++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +++++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +++++++- # Copied from transformers.models.llama.modeling_llama.rotate_half +++++++- def rotate_half(x): +++++++- """Rotates half the hidden dims of the input.""" +++++++-- x1 = x[..., : x.shape[-1] // 2] +++++++-- x2 = x[..., x.shape[-1] // 2 :] +++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++-+ # x1 = x[..., : x.shape[-1] // 2] +++++++-+ # x2 = x[..., x.shape[-1] // 2 :] +++++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++- return ops.cat((-x2, x1), dim=-1) +++++++- +++++++- +++++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +++++++- if self.training: +++++++- raise NotImplementedError("Training is not supported yet.") +++++++- else: +++++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++-- if self.config.n_shared_experts is not None: +++++++-- y = y + self.shared_experts(identity) +++++++-- return y +++++++-+ # @lwx +++++++-+ if orig_shape[1] == 1: +++++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +++++++-+ y=y.view(*orig_shape) +++++++-+ if self.config.n_shared_experts is not None: +++++++-+ y = y + self.shared_experts(identity) +++++++-+ return y +++++++-+ else: +++++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +++++++-+ if self.config.n_shared_experts is not None: +++++++-+ y = y + self.shared_experts(identity) +++++++-+ return y +++++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +++++++-+ # if self.config.n_shared_experts is not None: +++++++-+ # y = y + self.shared_experts(identity) +++++++-+ # return y +++++++-+ +++++++-+ @no_grad() +++++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +++++++-+ +++++++-+ expert_cache = ops.zeros_like(x) +++++++-+ for i in range(self.num_experts_per_tok): +++++++-+ expert_id = flat_expert_indices[i].item() +++++++-+ weight = flat_expert_weights[i].item() +++++++-+ expert = self.experts[expert_id] +++++++-+ expert_out = expert(x) +++++++-+ expert_cache += expert_out * weight +++++++-+ return expert_cache +++++++- +++++++- @no_grad() +++++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++-- # expert_cache = torch.zeros_like(x) +++++++-- # idxs = flat_expert_indices.argsort() +++++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++-- # token_idxs = idxs // self.num_experts_per_tok +++++++-- # for i, end_idx in enumerate(tokens_per_expert): +++++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++-- # if start_idx == end_idx: +++++++-- # continue +++++++-- # expert = self.experts[i] +++++++-- # exp_token_idx = token_idxs[start_idx:end_idx] +++++++-- # expert_tokens = x[exp_token_idx] +++++++-- # expert_out = expert(expert_tokens) +++++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++-- # return expert_cache +++++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +++++++- expert_cache = ops.zeros_like(x) +++++++- idxs = flat_expert_indices.argsort() +++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++- token_idxs = idxs // self.num_experts_per_tok +++++++-+ +++++++- for i, end_idx in enumerate(tokens_per_expert): +++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++- if start_idx == end_idx: +++++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +++++++- expert_out = expert(expert_tokens) +++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++-+ +++++++- return expert_cache +++++++-+ +++++++-+ # @no_grad() +++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++-+ # # expert_cache = torch.zeros_like(x) +++++++-+ # # idxs = flat_expert_indices.argsort() +++++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +++++++-+ # # token_idxs = idxs // self.num_experts_per_tok +++++++-+ # # for i, end_idx in enumerate(tokens_per_expert): +++++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +++++++-+ # # if start_idx == end_idx: +++++++-+ # # continue +++++++-+ # # expert = self.experts[i] +++++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +++++++-+ # # expert_tokens = x[exp_token_idx] +++++++-+ # # expert_out = expert(expert_tokens) +++++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +++++++-+ # # return expert_cache +++++++-+ # expert_cache = ops.zeros_like(x) +++++++-+ # idxs = flat_expert_indices.argsort() +++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++-+ # token_idxs = idxs // self.num_experts_per_tok +++++++-+ +++++++-+ # for i, end_idx in enumerate(tokens_per_expert): +++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++-+ # if start_idx == end_idx: +++++++-+ # continue +++++++-+ # expert = self.experts[i] +++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++-+ # expert_tokens = x[exp_token_idx] +++++++-+ # expert_out = expert(expert_tokens) +++++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +++++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +++++++-+ +++++++-+ # return expert_cache +++++++-+ # @no_grad() +++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +++++++-+ # expert_cache = ops.zeros_like(x) +++++++-+ +++++++-+ # # 排序保证顺序一致 +++++++-+ # idxs = flat_expert_indices.argsort() +++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +++++++-+ # token_idxs = idxs // self.num_experts_per_tok +++++++-+ +++++++-+ # # 找出有 token 的专家 +++++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +++++++-+ +++++++-+ # for i in active_experts.tolist(): +++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +++++++-+ # end_idx = tokens_per_expert[i] +++++++-+ # if start_idx == end_idx: # 没有 token +++++++-+ # continue +++++++-+ +++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +++++++-+ # expert_tokens = x[exp_token_idx] +++++++-+ # expert_out = self.experts[i](expert_tokens) +++++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +++++++-+ +++++++-+ # expert_cache = mindspore.mint.scatter_add( +++++++-+ # expert_cache, +++++++-+ # 0, +++++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +++++++-+ # expert_out +++++++-+ # ) +++++++-+ +++++++-+ # return expert_cache +++++++-+ +++++++-+ +++++++- +++++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +++++++- # """ +++++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++- +++++++- # Initialize weights and apply final processing +++++++- self.post_init() +++++++-+ self.warm_up = False +++++++-+ +++++++-+ def warmup_moe_model_deep(self): +++++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +++++++-+ test_texts = [ +++++++-+ "warmup short", +++++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +++++++-+ ] +++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++-+ if tokenizer is None: +++++++-+ from mindnlp.transformers import AutoTokenizer +++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++-+ self._warmup_tokenizer = tokenizer +++++++-+ +++++++-+ for text in test_texts: +++++++-+ inputs = tokenizer(text, return_tensors="ms") +++++++-+ with mindspore._no_grad(): +++++++-+ _ = self(**inputs, use_cache=False) +++++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +++++++- +++++++- def get_input_embeddings(self): +++++++- return self.model.embed_tokens +++++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++- ```""" +++++++-+ if not self.warm_up: +++++++-+ self.warm_up = True +++++++-+ self.warmup_moe_model_deep() +++++++-+ +++++++- output_attentions = ( +++++++- output_attentions +++++++- if output_attentions is not None +++++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++-index 3cbf820e..d4c6b651 100644 +++++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +++++++-@@ -18,7 +18,6 @@ +++++++- # See the License for the specific language governing permissions and +++++++- # limitations under the License. +++++++- """MindSpore Qwen2MoE model.""" +++++++-- +++++++- import math +++++++- from typing import List, Optional, Tuple, Union +++++++- +++++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +++++++- TokenClassifierOutput, +++++++- ) +++++++- from ...modeling_utils import PreTrainedModel +++++++-+from ...generation import GenerationMixin +++++++- from ....utils import logging +++++++- from .configuration_qwen2_moe import Qwen2MoeConfig +++++++- +++++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +++++++- self.variance_epsilon = eps +++++++- +++++++- def forward(self, hidden_states): +++++++-+ # @dwj +++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++-+ # @lwx +++++++-+ # if not self.training : +++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +++++++- input_dtype = hidden_states.dtype +++++++- hidden_states = hidden_states.to(mindspore.float32) +++++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +++++++-@@ -234,6 +239,8 @@ def rotate_half(x): +++++++- """Rotates half the hidden dims of the input.""" +++++++- x1 = x[..., : x.shape[-1] // 2] +++++++- x2 = x[..., x.shape[-1] // 2 :] +++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +++++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +++++++- return ops.cat((-x2, x1), dim=-1) +++++++- +++++++- +++++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +++++++- self.config = config +++++++- self.hidden_size = config.hidden_size +++++++- self.intermediate_size = intermediate_size +++++++-+ +++++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +++++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +++++++- self.act_fn = ACT2FN[config.hidden_act] +++++++- +++++++- def forward(self, x): +++++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++-- +++++++- +++++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +++++++-+ # @lwx +++++++-+ # gate_up_output = self.gate_up_proj(x) +++++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +++++++-+ # return self.down_proj(swiglu_output) +++++++-+ +++++++-+ # def forward(self, x): +++++++-+ # gate_proj_out = self.gate_proj(x) +++++++-+ # up_proj_out = self.up_proj(x) +++++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +++++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +++++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +++++++-+ # return self.down_proj(swiglu_out) +++++++-+ +++++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +++++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +++++++- """ +++++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +++++++- use_cache: bool = False, +++++++- cache_position: Optional[mindspore.Tensor] = None, +++++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++-+ +++++++-+ +++++++-+ +++++++- bsz, q_len, _ = hidden_states.shape +++++++- +++++++- query_states = self.q_proj(hidden_states) +++++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++- "with a layer index." +++++++- ) +++++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++-+ if isinstance(past_key_value, StaticCache): +++++++-+ kv_seq_len = key_states.shape[-2] +++++++-+ else: +++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++- +++++++- if past_key_value is not None: +++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +++++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +++++++-+ +++++++-+ if isinstance(past_key_value, StaticCache): +++++++-+ kv_seq_len = key_states.shape[-2] +++++++- +++++++- # repeat k/v heads if n_kv_heads < n_heads +++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++-- +++++++-+ +++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +++++++- +++++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +++++++-- raise ValueError( +++++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +++++++-- f" {attn_weights.shape}" +++++++-- ) +++++++-- +++++++-- if attention_mask is not None: # no matter the length, we just slice it +++++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +++++++-+ if attention_mask is not None: +++++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +++++++- attn_weights = attn_weights + causal_mask +++++++- +++++++- # upcast attention to fp32 +++++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +++++++- +++++++- attn_output = self.o_proj(attn_output) +++++++-- +++++++-+ # @lwx +++++++-+ +++++++-+ # max_seq_len = self.max_position_embeddings # 2048 +++++++-+ +++++++-+ # if attention_mask is not None: +++++++-+ # # attention_mask: [B, 1, Sq, Sk] +++++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +++++++-+ +++++++-+ # # pad 到 [max_seq_len, max_seq_len] +++++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +++++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +++++++-+ # global_attention_mask = padded_mask +++++++-+ # else: +++++++-+ # global_attention_mask = None +++++++-+ +++++++-+ +++++++-+ # sparse_mode=3 +++++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++++-+ # query=query_states, +++++++-+ # key=key_states, +++++++-+ # value=value_states, +++++++-+ # real_shift=None, +++++++-+ # padding_mask=None, +++++++-+ +++++++-+ # head_num=self.num_heads, +++++++-+ # attn_mask=global_attention_mask, +++++++-+ # keep_prob=1.0 - self.attention_dropout, +++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++-+ # input_layout="BNSD", +++++++-+ # pre_tokens=2147483647, +++++++-+ # next_tokens=2147483647, +++++++-+ # inner_precise=0, +++++++-+ # drop_mask=None, +++++++-+ # prefix=None, +++++++-+ # actual_seq_qlen=None, +++++++-+ # actual_seq_kvlen=None, +++++++-+ # sparse_mode=sparse_mode, +++++++-+ # ) +++++++- if not output_attentions: +++++++- attn_weights = None +++++++- +++++++- return attn_output, attn_weights, past_key_value +++++++- +++++++- +++++++-+class Qwen2MoeFlashAttention(nn.Module): +++++++-+ """ +++++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +++++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +++++++-+ +++++++-+ 关键改动: +++++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +++++++-+ 直接传入原始的 key 和 value 张量效率更高。 +++++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +++++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +++++++-+ """ +++++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +++++++-+ super().__init__() +++++++-+ self.config = config +++++++-+ self.layer_idx = layer_idx +++++++-+ self.hidden_size = config.hidden_size +++++++-+ self.num_heads = config.num_attention_heads +++++++-+ self.head_dim = self.hidden_size // self.num_heads +++++++-+ self.num_key_value_heads = config.num_key_value_heads +++++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +++++++-+ self.max_position_embeddings = config.max_position_embeddings +++++++-+ self.rope_theta = config.rope_theta +++++++-+ self.attention_dropout = config.attention_dropout +++++++-+ +++++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +++++++-+ raise ValueError( +++++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +++++++-+ ) +++++++-+ +++++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +++++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +++++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +++++++-+ +++++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +++++++-+ self.head_dim, +++++++-+ max_position_embeddings=self.max_position_embeddings, +++++++-+ base=self.rope_theta, +++++++-+ ) +++++++-+ +++++++-+ def forward( +++++++-+ self, +++++++-+ hidden_states: mindspore.Tensor, +++++++-+ attention_mask: Optional[mindspore.Tensor] = None, +++++++-+ position_ids: Optional[mindspore.Tensor] = None, +++++++-+ past_key_value: Optional[Cache] = None, +++++++-+ output_attentions: bool = False, +++++++-+ use_cache: bool = False, +++++++-+ cache_position: Optional[mindspore.Tensor] = None, +++++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++-+ +++++++-+ bsz, q_len, _ = hidden_states.shape +++++++-+ +++++++-+ # 1. 线性投射 Q, K, V +++++++-+ query_states = self.q_proj(hidden_states) +++++++-+ key_states = self.k_proj(hidden_states) +++++++-+ value_states = self.v_proj(hidden_states) +++++++-+ +++++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++-+ # query: [B, S, H*D] -> [B, N1, S, D] +++++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +++++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ +++++++-+ # 3. RoPE 旋转位置编码 +++++++-+ kv_seq_len = key_states.shape[-2] +++++++-+ if past_key_value is not None: +++++++-+ if self.layer_idx is None: +++++++-+ raise ValueError( +++++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++-+ "with a layer index." +++++++-+ ) +++++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +++++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +++++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +++++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +++++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +++++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +++++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +++++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +++++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +++++++-+ if cache_position.shape[0] == 1: +++++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +++++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +++++++-+ kv_seq_len = past_seen_tokens + 1 +++++++-+ else: +++++++-+ # prefill 阶段:cache_position 是范围,使用其长度 +++++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +++++++-+ else: +++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++-+ +++++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++-+ +++++++-+ # 4. KV 缓存更新 +++++++-+ if past_key_value is not None: +++++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++-+ key_states, value_states = past_key_value.update( +++++++-+ key_states, value_states, self.layer_idx, cache_kwargs +++++++-+ ) +++++++-+ +++++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +++++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +++++++-+ if cache_position.shape[0] == 1: +++++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +++++++-+ kv_seq_len = key_states.shape[-2] +++++++-+ +++++++-+ # 5. [重要] 准备 Attention Mask +++++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +++++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +++++++-+ fa_attention_mask = None +++++++-+ if attention_mask is not None: +++++++-+ # 截取与当前key长度匹配的部分 +++++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +++++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +++++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +++++++-+ fa_attention_mask = (mask_slice != 0) +++++++-+ +++++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +++++++-+ input_dtype = query_states.dtype +++++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +++++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +++++++-+ query_states = query_states.to(mindspore.float16) +++++++-+ key_states = key_states.to(mindspore.float16) +++++++-+ value_states = value_states.to(mindspore.float16) +++++++-+ +++++++-+ # 6. [核心] 调用 flash_attention_score 算子 +++++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +++++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +++++++-+ attn_output = mindspore.ops.flash_attention_score( +++++++-+ query=query_states, +++++++-+ key=key_states, +++++++-+ value=value_states, +++++++-+ head_num=self.num_heads, # 传入Q的头数(N1) +++++++-+ attn_mask=fa_attention_mask, +++++++-+ keep_prob=1.0 - self.attention_dropout, +++++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +++++++-+ input_layout="BNSD", +++++++-+ sparse_mode=0 # 使用 defaultMask 模式 +++++++-+ ) +++++++-+ +++++++-+ # 恢复原始数据类型 +++++++-+ attn_output = attn_output.to(input_dtype) +++++++-+ +++++++-+ # 7. 调整输出形状 +++++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +++++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++-+ attn_output = self.o_proj(attn_output) +++++++-+ +++++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +++++++-+ attn_weights = None +++++++-+ if output_attentions: +++++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++-+ +++++++-+ return attn_output, attn_weights, past_key_value +++++++-+ +++++++-+ # def forward( +++++++-+ # self, +++++++-+ # hidden_states: mindspore.Tensor, +++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++++++-+ # position_ids: Optional[mindspore.Tensor] = None, +++++++-+ # past_key_value: Optional[Cache] = None, +++++++-+ # output_attentions: bool = False, +++++++-+ # use_cache: bool = False, +++++++-+ # cache_position: Optional[mindspore.Tensor] = None, +++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++-+ +++++++-+ # bsz, q_len, _ = hidden_states.shape +++++++-+ +++++++-+ # # 1. 线性投射 Q, K, V +++++++-+ # query_states = self.q_proj(hidden_states) +++++++-+ # key_states = self.k_proj(hidden_states) +++++++-+ # value_states = self.v_proj(hidden_states) +++++++-+ +++++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ +++++++-+ # # 3. RoPE 旋转位置编码 +++++++-+ # kv_seq_len = key_states.shape[-2] +++++++-+ # if past_key_value is not None: +++++++-+ # if self.layer_idx is None: +++++++-+ # raise ValueError( +++++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +++++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +++++++-+ # "with a layer index." +++++++-+ # ) +++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++-+ +++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++-+ +++++++-+ # # 4. KV 缓存更新 +++++++-+ # if past_key_value is not None: +++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++-+ # key_states, value_states = past_key_value.update( +++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++++++-+ # ) +++++++-+ +++++++-+ # # 5. 准备 Attention Mask +++++++-+ # fa_attention_mask = None +++++++-+ # if attention_mask is not None: +++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++-+ # fa_attention_mask = (mask_slice != 0) +++++++-+ +++++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +++++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +++++++-+ # input_dtype = query_states.dtype +++++++-+ +++++++-+ # # 6. [核心] 调用 flash_attention_score 算子 +++++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++++-+ # query=query_states, +++++++-+ # key=key_states, +++++++-+ # value=value_states, +++++++-+ # head_num=self.num_heads, +++++++-+ # attn_mask=fa_attention_mask, +++++++-+ # keep_prob=1.0 - self.attention_dropout, +++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +++++++-+ # input_layout="BNSD", +++++++-+ # sparse_mode=0, +++++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +++++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +++++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +++++++-+ # inner_precise=1 +++++++-+ # ) +++++++-+ +++++++-+ # # 恢复原始数据类型 +++++++-+ # attn_output = attn_output.to(input_dtype) +++++++-+ +++++++-+ # # 7. 调整输出形状 +++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++-+ # attn_output = self.o_proj(attn_output) +++++++-+ +++++++-+ # attn_weights = None +++++++-+ # if output_attentions: +++++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +++++++-+ +++++++-+ # return attn_output, attn_weights, past_key_value +++++++-+ +++++++-+ # def forward( +++++++-+ # self, +++++++-+ # hidden_states: mindspore.Tensor, +++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +++++++-+ # position_ids: Optional[mindspore.Tensor] = None, +++++++-+ # past_key_value: Optional[Cache] = None, +++++++-+ # output_attentions: bool = False, +++++++-+ # use_cache: bool = False, +++++++-+ # cache_position: Optional[mindspore.Tensor] = None, +++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +++++++-+ +++++++-+ # bsz, q_len, _ = hidden_states.shape +++++++-+ +++++++-+ # query_states = self.q_proj(hidden_states) +++++++-+ # key_states = self.k_proj(hidden_states) +++++++-+ # value_states = self.v_proj(hidden_states) +++++++-+ +++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +++++++-+ +++++++-+ # kv_seq_len = key_states.shape[-2] +++++++-+ # if past_key_value is not None: +++++++-+ # if self.layer_idx is None: +++++++-+ # raise ValueError("`layer_idx` must be specified for caching") +++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +++++++-+ +++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +++++++-+ +++++++-+ # if past_key_value is not None: +++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +++++++-+ # key_states, value_states = past_key_value.update( +++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +++++++-+ # ) +++++++-+ +++++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +++++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +++++++-+ +++++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +++++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +++++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +++++++-+ # query_states = query_states / math.sqrt(self.head_dim) +++++++-+ # # <--- 修改结束 --- +++++++-+ +++++++-+ # fa_attention_mask = None +++++++-+ # if attention_mask is not None: +++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +++++++-+ # fa_attention_mask = (mask_slice != 0) +++++++-+ +++++++-+ # input_dtype = query_states.dtype +++++++-+ +++++++-+ # attn_output = mindspore.ops.flash_attention_score( +++++++-+ # query=query_states, # 传入已经预先缩放过的 query +++++++-+ # key=key_states, +++++++-+ # value=value_states, +++++++-+ # head_num=self.num_heads, +++++++-+ # attn_mask=fa_attention_mask, +++++++-+ # keep_prob=1.0 - self.attention_dropout, +++++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +++++++-+ # input_layout="BNSD", +++++++-+ # sparse_mode=0, +++++++-+ # inner_precise=1 # 仍然保持内部高精度计算 +++++++-+ # ) +++++++-+ +++++++-+ # attn_output = attn_output.to(input_dtype) +++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +++++++-+ # attn_output = self.o_proj(attn_output) +++++++-+ +++++++-+ # attn_weights = None +++++++-+ # if output_attentions: +++++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +++++++-+ +++++++-+ # return attn_output, attn_weights, past_key_value +++++++-+ +++++++- QWEN2MOE_ATTENTION_CLASSES = { +++++++- "eager": Qwen2MoeAttention, +++++++-+ "flash-attention": Qwen2MoeFlashAttention, +++++++- } +++++++- +++++++- +++++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +++++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +++++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +++++++- +++++++-+ #@dwj +++++++-+ # 只遍历激活的专家,而非全部专家 +++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +++++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++-- hidden_states = hidden_states.view(-1, hidden_dim) +++++++-- # router_logits: (batch * sequence_length, n_experts) +++++++-- router_logits = self.gate(hidden_states) +++++++-- +++++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++-- if self.norm_topk_prob: +++++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++-- # we cast back to the input dtype +++++++-- routing_weights = routing_weights.to(hidden_states.dtype) +++++++-- +++++++-- final_hidden_states = ops.zeros( +++++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +++++++-- ) +++++++-- +++++++-- # One hot encode the selected experts to create an expert mask +++++++-- # this will be used to easily index which expert is going to be sollicitated +++++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +++++++-- +++++++-- # Loop over all available experts in the model and perform the computation on each expert +++++++-- for expert_idx in range(self.num_experts): +++++++-- expert_layer = self.experts[expert_idx] +++++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +++++++-- +++++++-- # Index the correct hidden states and compute the expert hidden state for +++++++-- # the current expert. We need to make sure to multiply the output hidden +++++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +++++++-- if 0 not in idx.shape: +++++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +++++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +++++++-- +++++++-- # However `index_add_` only support torch tensors for indexing so we'll use +++++++-- # the `top_x` tensor here. +++++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +++++++-- +++++++-- shared_expert_output = self.shared_expert(hidden_states) +++++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +++++++-- +++++++-- final_hidden_states = final_hidden_states + shared_expert_output +++++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +++++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +++++++-+ num_tokens = hidden_states_reshaped.shape[0] +++++++-+ +++++++-+ router_logits = self.gate(hidden_states_reshaped) +++++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +++++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +++++++-+ +++++++-+ if self.norm_topk_prob: +++++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +++++++-+ routing_weights = routing_weights.to(hidden_states.dtype) +++++++-+ +++++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +++++++-+ flat_selected_experts = selected_experts.flatten() +++++++-+ +++++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +++++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +++++++-+ token_indices = broadcasted_token_indices.flatten() +++++++-+ +++++++-+ active_experts = ops.unique(flat_selected_experts) +++++++-+ +++++++-+ for expert_idx_tensor in active_experts: +++++++-+ expert_idx = expert_idx_tensor.item() +++++++-+ expert_layer = self.experts[expert_idx] +++++++-+ +++++++-+ mask = (flat_selected_experts == expert_idx_tensor) +++++++-+ selected_token_indices = token_indices[mask] +++++++-+ selected_routing_weights = routing_weights.flatten()[mask] +++++++-+ +++++++-+ current_states = hidden_states_reshaped[selected_token_indices] +++++++-+ +++++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +++++++-+ +++++++-+ final_hidden_states = final_hidden_states.index_add( +++++++-+ dim=0, +++++++-+ index=selected_token_indices, +++++++-+ source=expert_output.to(hidden_states.dtype) +++++++-+ ) +++++++-+ +++++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +++++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +++++++- +++++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++-- return final_hidden_states, router_logits +++++++-+ final_hidden_states = final_hidden_states + shared_expert_output +++++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +++++++-+ +++++++-+ return final_hidden_states, router_logits +++++++- +++++++- +++++++- class Qwen2MoeDecoderLayer(nn.Module): +++++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +++++++- +++++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +++++++- +++++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +++++++-+ +++++++- if (layer_idx not in config.mlp_only_layers) and ( +++++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +++++++- ): +++++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +++++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +++++++- _skip_keys_device_placement = "past_key_values" +++++++- _supports_cache_class = True +++++++-+#lwx +++++++-+ # _supports_static_cache = True +++++++- +++++++- def _init_weights(self, module): +++++++- std = self.config.initializer_range +++++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +++++++- return causal_mask +++++++- +++++++- +++++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +++++++- _tied_weights_keys = ["lm_head.weight"] +++++++- +++++++- def __init__(self, config): +++++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++- self.num_experts_per_tok = config.num_experts_per_tok +++++++- # Initialize weights and apply final processing +++++++- self.post_init() +++++++-+ # @lwx +++++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +++++++-+ # self.generation_config.cache_implementation = "static" +++++++-+ self._warmed_up = False +++++++-+ +++++++-+ def warmup_moe_model(self): +++++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +++++++-+ test_texts = [ +++++++-+ "warmup short", +++++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +++++++-+ ] +++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +++++++-+ if tokenizer is None: +++++++-+ from mindnlp.transformers import AutoTokenizer +++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +++++++-+ self._warmup_tokenizer = tokenizer +++++++-+ +++++++-+ for text in test_texts: +++++++-+ inputs = tokenizer(text, return_tensors="ms") +++++++-+ with mindspore._no_grad(): +++++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +++++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +++++++- +++++++- def get_input_embeddings(self): +++++++- return self.model.embed_tokens +++++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +++++++- ```""" +++++++-+ if not self._warmed_up: +++++++-+ self._warmed_up = True +++++++-+ self.warmup_moe_model() +++++++- +++++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +++++++- output_router_logits = ( +++++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +++++++- } +++++++- ) +++++++- return model_inputs +++++++-+# @lwx +++++++-+ # def _decode_one_tokens_logits( +++++++-+ # self, +++++++-+ # cur_token: mindspore.Tensor, +++++++-+ # input_pos: Optional[mindspore.Tensor], +++++++-+ # cache_position: mindspore.Tensor, +++++++-+ # past_key_values: StaticCache, +++++++-+ # ) -> mindspore.Tensor: +++++++-+ # """ +++++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +++++++-+ +++++++-+ # Args: +++++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +++++++-+ # input_pos: 输入位置信息,可选 +++++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +++++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +++++++-+ +++++++-+ # Returns: +++++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +++++++-+ # """ +++++++-+ # # 调用JIT编译的版本 +++++++-+ # return self.get_decode_one_tokens_logits( +++++++-+ # cur_token=cur_token, +++++++-+ # input_pos=input_pos, +++++++-+ # cache_position=cache_position, +++++++-+ # past_key_values=past_key_values, +++++++-+ # ) +++++++-+ +++++++-+ # @mindspore.jit(jit_level='O1') +++++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +++++++-+ # """ +++++++-+ # JIT编译的函数,用于高效的单token解码 +++++++-+ # 使用JIT编译优化以支持静态shape和高效执行 +++++++-+ +++++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +++++++-+ # """ +++++++-+ # outputs = self.model.forward( +++++++-+ # input_ids=cur_token, +++++++-+ # position_ids=input_pos, +++++++-+ # cache_position=cache_position, +++++++-+ # past_key_values=past_key_values, +++++++-+ # use_cache=True, +++++++-+ # return_dict=False, +++++++-+ # ) +++++++-+ +++++++-+ # hidden_states = outputs[0] +++++++-+ # logits = self.lm_head.forward(hidden_states) +++++++-+ # logits = logits.float() +++++++-+ +++++++-+ # return logits[:, -1, :] +++++++-+ +++++++-+ # def _sample( +++++++-+ # self, +++++++-+ # input_ids: mindspore.Tensor, +++++++-+ # logits_processor, +++++++-+ # stopping_criteria, +++++++-+ # generation_config, +++++++-+ # synced_devices: bool, +++++++-+ # streamer=None, +++++++-+ # logits_warper=None, +++++++-+ # **model_kwargs, +++++++-+ # ): +++++++-+ # """ +++++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +++++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +++++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +++++++-+ # """ +++++++-+ # from ...generation.logits_process import LogitsProcessorList +++++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +++++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +++++++-+ # from mindnlp.core import nn, ops, no_grad +++++++-+ # import numpy as np +++++++-+ +++++++-+ # # 检查是否使用 StaticCache +++++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +++++++-+ # # 否则,直接调用父类方法 +++++++-+ # past_key_values = model_kwargs.get("past_key_values") +++++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +++++++-+ +++++++-+ # if not isinstance(past_key_values, StaticCache): +++++++-+ # # 不使用 StaticCache,直接调用父类方法 +++++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +++++++-+ # return super()._sample( +++++++-+ # input_ids=input_ids, +++++++-+ # logits_processor=logits_processor, +++++++-+ # stopping_criteria=stopping_criteria, +++++++-+ # generation_config=generation_config, +++++++-+ # synced_devices=synced_devices, +++++++-+ # streamer=streamer, +++++++-+ # logits_warper=logits_warper, +++++++-+ # **model_kwargs, +++++++-+ # ) +++++++-+ +++++++-+ # # 使用 StaticCache,进入自定义循环 +++++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +++++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +++++++-+ # pad_token_id = generation_config._pad_token_tensor +++++++-+ # output_attentions = generation_config.output_attentions +++++++-+ # output_hidden_states = generation_config.output_hidden_states +++++++-+ # output_scores = generation_config.output_scores +++++++-+ # output_logits = generation_config.output_logits +++++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +++++++-+ # max_length = generation_config.max_length +++++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +++++++-+ # do_sample = generation_config.do_sample +++++++-+ +++++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +++++++-+ # raise ValueError( +++++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +++++++-+ # f"{logits_warper})." +++++++-+ # ) +++++++-+ +++++++-+ # # init attention / hidden states / scores tuples +++++++-+ # scores = () if (return_dict_in_generate and output_scores) else None +++++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +++++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +++++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +++++++-+ +++++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +++++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +++++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +++++++-+ # encoder_hidden_states = ( +++++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +++++++-+ # ) +++++++-+ +++++++-+ # # keep track of which sequences are already finished +++++++-+ # batch_size, cur_len = input_ids.shape +++++++-+ # this_peer_finished = False +++++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +++++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +++++++-+ +++++++-+ # time_record = [] +++++++-+ # from ....utils.testing_utils import parse_flag_from_env +++++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +++++++-+ +++++++-+ # while self._has_unfinished_sequences( +++++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +++++++-+ # ): +++++++-+ # if _record_time: +++++++-+ # import time as time_module +++++++-+ # infer_start = time_module.time() +++++++-+ +++++++-+ # # prepare model inputs +++++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +++++++-+ +++++++-+ # # prepare variable output controls +++++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +++++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +++++++-+ +++++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +++++++-+ # cur_cache_position = model_inputs.get("cache_position") +++++++-+ # cur_past_key_values = model_inputs.get("past_key_values") +++++++-+ # cur_input_ids = model_inputs.get("input_ids") +++++++-+ +++++++-+ # if (isinstance(cur_past_key_values, StaticCache) and +++++++-+ # cur_cache_position is not None and +++++++-+ # len(cur_cache_position.shape) > 0 and +++++++-+ # cur_cache_position.shape[0] == 1 and +++++++-+ # cur_input_ids is not None and +++++++-+ # cur_input_ids.shape[1] == 1): +++++++-+ # # 使用 JIT 优化的单 token 解码 +++++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +++++++-+ # if not hasattr(self, '_jit_used'): +++++++-+ # self._jit_used = False +++++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +++++++-+ +++++++-+ # next_token_logits = self.get_decode_one_tokens_logits( +++++++-+ # cur_token=cur_input_ids, +++++++-+ # input_pos=model_inputs.get("position_ids"), +++++++-+ # cache_position=cur_cache_position, +++++++-+ # past_key_values=cur_past_key_values, +++++++-+ # ) +++++++-+ +++++++-+ # # 标记已使用JIT(用于后续判断) +++++++-+ # if not self._jit_used: +++++++-+ # self._jit_used = True +++++++-+ +++++++-+ # # 构造兼容的输出对象 +++++++-+ # class JitOptimizedOutput: +++++++-+ # def __init__(self, logits, config): +++++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +++++++-+ # self.config = config +++++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +++++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +++++++-+ # self.attentions = None if not config.is_encoder_decoder else None +++++++-+ # self.cross_attentions = None +++++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +++++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +++++++-+ +++++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +++++++-+ # else: +++++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +++++++-+ # outputs = self(**model_inputs, return_dict=True) +++++++-+ +++++++-+ # if synced_devices and this_peer_finished: +++++++-+ # continue +++++++-+ +++++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +++++++-+ # next_token_logits = outputs.logits[:, -1, :] +++++++-+ +++++++-+ # # pre-process distribution +++++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +++++++-+ # if do_sample: +++++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +++++++-+ +++++++-+ # # Store scores, attentions and hidden_states when required +++++++-+ # if return_dict_in_generate: +++++++-+ # if output_scores: +++++++-+ # scores += (next_token_scores,) +++++++-+ # if output_logits: +++++++-+ # raw_logits += (next_token_logits,) +++++++-+ # if output_attentions: +++++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +++++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +++++++-+ # if self.config.is_encoder_decoder: +++++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +++++++-+ +++++++-+ # if output_hidden_states: +++++++-+ # hidden = ( +++++++-+ # outputs.decoder_hidden_states +++++++-+ # if self.config.is_encoder_decoder +++++++-+ # else outputs.hidden_states +++++++-+ # ) +++++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +++++++-+ +++++++-+ # # token selection +++++++-+ # if do_sample: +++++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +++++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +++++++-+ # else: +++++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +++++++-+ +++++++-+ # # finished sentences should have their next token be a padding token +++++++-+ # if has_eos_stopping_criteria: +++++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +++++++-+ +++++++-+ # # update generated ids, model inputs, and length for next step +++++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +++++++-+ # if streamer is not None: +++++++-+ # streamer.put(next_tokens) +++++++-+ +++++++-+ # model_kwargs = self._update_model_kwargs_for_generation( +++++++-+ # outputs, +++++++-+ # model_kwargs, +++++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +++++++-+ # ) +++++++-+ +++++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +++++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +++++++-+ # cur_len += 1 +++++++-+ +++++++-+ # if _record_time: +++++++-+ # import time as time_module +++++++-+ # infer_stop = time_module.time() +++++++-+ # time_record.append(infer_stop - infer_start) +++++++-+ +++++++-+ # del outputs +++++++-+ +++++++-+ # average_infer_time = None +++++++-+ # if time_record: +++++++-+ # if len(time_record) > 1: +++++++-+ # time_record.pop(0) +++++++-+ # average_infer_time = sum(time_record) / len(time_record) +++++++-+ # print(f'average inference time is: {average_infer_time}') +++++++-+ # print(f'inference time record: {time_record}') +++++++-+ +++++++-+ # if streamer is not None: +++++++-+ # streamer.end() +++++++-+ +++++++-+ # # 简单判断:打印是否使用了JIT路径 +++++++-+ # if hasattr(self, '_jit_used') and self._jit_used: +++++++-+ # print("[JIT] ✓ JIT optimization was used during generation") +++++++-+ # else: +++++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +++++++-+ +++++++-+ # if return_dict_in_generate: +++++++-+ # if self.config.is_encoder_decoder: +++++++-+ # return GenerateEncoderDecoderOutput( +++++++-+ # sequences=input_ids, +++++++-+ # scores=scores, +++++++-+ # logits=raw_logits, +++++++-+ # encoder_attentions=encoder_attentions, +++++++-+ # encoder_hidden_states=encoder_hidden_states, +++++++-+ # decoder_attentions=decoder_attentions, +++++++-+ # cross_attentions=cross_attentions, +++++++-+ # decoder_hidden_states=decoder_hidden_states, +++++++-+ # past_key_values=model_kwargs.get("past_key_values"), +++++++-+ # average_infer_time=average_infer_time +++++++-+ # ) +++++++-+ # else: +++++++-+ # return GenerateDecoderOnlyOutput( +++++++-+ # sequences=input_ids, +++++++-+ # scores=scores, +++++++-+ # logits=raw_logits, +++++++-+ # attentions=decoder_attentions, +++++++-+ # hidden_states=decoder_hidden_states, +++++++-+ # past_key_values=model_kwargs.get("past_key_values"), +++++++-+ # average_infer_time=average_infer_time +++++++-+ # ) +++++++-+ # else: +++++++-+ # return input_ids +++++++-+ +++++++-+ # def _prepare_cache_for_generation( +++++++-+ # self, +++++++-+ # generation_config, +++++++-+ # model_kwargs, +++++++-+ # assistant_model, +++++++-+ # batch_size, +++++++-+ # max_cache_length, +++++++-+ # ): +++++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +++++++-+ # generation_config.cache_implementation = "static" +++++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +++++++-+ +++++++-+ # if generation_config.cache_implementation == "static": +++++++-+ # base_required_from_max_length = generation_config.max_length + 1 +++++++-+ # base_required = max(max_cache_length, base_required_from_max_length) +++++++-+ # min_cache_size = 50 +++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +++++++-+ # else: +++++++-+ # max_cache_length = max(base_required, min_cache_size) +++++++-+ +++++++-+ # original_max_cache_length = max_cache_length +++++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +++++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +++++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +++++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +++++++-+ # print(f" - final max_cache_length: {max_cache_length}") +++++++-+ +++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +++++++-+ # if max_cache_length > self.config.max_position_embeddings: +++++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +++++++-+ +++++++-+ # result = super()._prepare_cache_for_generation( +++++++-+ # generation_config=generation_config, +++++++-+ # model_kwargs=model_kwargs, +++++++-+ # assistant_model=assistant_model, +++++++-+ # batch_size=batch_size, +++++++-+ # max_cache_length=max_cache_length, +++++++-+ # ) +++++++-+ +++++++-+ # if generation_config.cache_implementation == "static": +++++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +++++++-+ # created_cache = model_kwargs.get(cache_name) +++++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +++++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +++++++-+ # if created_cache.max_cache_len < generation_config.max_length: +++++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +++++++-+ +++++++-+ # return result +++++++-+ +++++++-+ +++++++-+ +++++++- +++++++- +++++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +++++++--- +++++++-2.27.0 +++++++- +++++++-- +++++++2.27.0 +++++++ ++++++-- ++++++2.27.0 ++++++ +++++-- +++++2.27.0 +++++ ++++-- ++++2.27.0 ++++ +++-- +++2.27.0 +++ ++-- ++2.27.0 ++ +-- +2.39.5 (Apple Git-154) + diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" new file mode 100644 index 00000000..a1832dc4 --- /dev/null +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" @@ -0,0 +1,49453 @@ +From 5d88d879c9a97cf89b7f7a00df9534ba2df9e955 Mon Sep 17 00:00:00 2001 +From: =?UTF-8?q?=E9=82=93=E4=BC=9F=E9=94=AE?= +Date: Wed, 3 Dec 2025 16:13:15 +0800 +Subject: [PATCH 10/10] =?UTF-8?q?=E6=9C=80=E5=90=8E=E6=95=B4=E7=90=86?= +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +--- + .../models/deepseek/modeling_deepseek.py | 731 +- + .../models/qwen2_moe/modeling_qwen2_moe.py | 1005 +- + patches/0001-20251104commit.patch | 1272 --- + patches/0002-20251106commit.patch | 3200 ------ + patches/0003-20261106secondcommit.patch | 2769 ------ + patches/0004-20251106change.patch | 7498 -------------- + patches/0005-20251107001commit.patch | 7707 --------------- + patches/0006-20251107002commit.patch | 7931 --------------- + patches/0007-20251107003commit.patch | 8034 --------------- + patches/0008-moe-change.patch | 8789 ----------------- + 10 files changed, 29 insertions(+), 48907 deletions(-) + delete mode 100644 patches/0001-20251104commit.patch + delete mode 100644 patches/0002-20251106commit.patch + delete mode 100644 patches/0003-20261106secondcommit.patch + delete mode 100644 patches/0004-20251106change.patch + delete mode 100644 patches/0005-20251107001commit.patch + delete mode 100644 patches/0006-20251107002commit.patch + delete mode 100644 patches/0007-20251107003commit.patch + delete mode 100644 patches/0008-moe-change.patch + +diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +index 8d004af1..8178fb05 100644 +--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +@@ -234,9 +234,6 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): + # Copied from transformers.models.llama.modeling_llama.rotate_half + def rotate_half(x): + """Rotates half the hidden dims of the input.""" +- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +- # x1 = x[..., : x.shape[-1] // 2] +- # x2 = x[..., x.shape[-1] // 2 :] + x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + +@@ -413,10 +410,7 @@ class DeepseekMoE(nn.Module): + if self.training: + raise NotImplementedError("Training is not supported yet.") + else: +- # @lwx + if orig_shape[1] == 1: +- # lwx moe_infer_decode_fast +- # y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) + y=self.moe_infer_decode_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) + y=y.view(*orig_shape) + if self.config.n_shared_experts is not None: +@@ -430,120 +424,7 @@ class DeepseekMoE(nn.Module): + if self.config.n_shared_experts is not None: + y = y + self.shared_experts(identity) + return y +- # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +- # if self.config.n_shared_experts is not None: +- # y = y + self.shared_experts(identity) +- # return y +- +- +- +- # lwx +- # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): +- # """ +- # 如果 expert_ids 为 None,走单专家逻辑; +- # 如果有,多专家批量处理,保证和原逻辑一致。 +- # """ +- # if expert_ids is None: +- # # 原单专家逻辑 +- # if self.config.pretraining_tp > 1: +- # slice = self.intermediate_size // self.config.pretraining_tp +- # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) +- # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) +- # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) +- # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) +- # for i in range(self.config.pretraining_tp)], dim=-1) +- # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) +- # for i in range(self.config.pretraining_tp)], dim=-1) +- # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) +- # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) +- # for i in range(self.config.pretraining_tp)] +- # down_proj = sum(down_proj) +- # else: +- # down_proj = self.down_proj( +- # self.act_fn(self.gate_proj(x)) * self.up_proj(x) +- # ) +- # return down_proj +- +- # # ====== 批量多专家路径 ====== +- # hidden_size = x.shape[-1] +- +- # # 按 token expert_ids 选权重 +- # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] +- # up_weights = self.up_proj.weight[expert_ids] +- # down_weights = self.down_proj.weight[expert_ids] +- +- # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 +- # if self.config.pretraining_tp > 1: +- # outputs = [] +- # slice = self.intermediate_size // self.config.pretraining_tp +- # for i in range(self.config.pretraining_tp): +- # # 每个 slice 单独计算 +- # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) +- # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) +- # act_out = self.act_fn(gate_proj_out) * up_proj_out +- # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) +- # outputs.append(down_proj_out) +- # return sum(outputs) +- # else: +- # gate_proj_out = F.linear(x, gate_weights) +- # up_proj_out = F.linear(x, up_weights) +- # act_out = self.act_fn(gate_proj_out) * up_proj_out +- # return F.linear(act_out, down_weights) +- # @no_grad() +- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # num_tokens = x.shape[0] +- # hidden_size = x.shape[-1] +- +- # idxs = flat_expert_indices.argsort() +- # sorted_expert_indices = flat_expert_indices[idxs] +- # sorted_token_indices = idxs // self.num_experts_per_tok +- # sorted_indices = sorted_token_indices +- +- # permuted_tokens = x[sorted_token_indices] +- # sorted_weights = flat_expert_weights[idxs] +- +- # # 一次调用多专家 forward +- # expert_outputs = ops.zeros_like(permuted_tokens) +- # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) +- +- # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +- # try: +- # final_output = ops.moe_token_unpermute( +- # expert_outputs, +- # sorted_indices, +- # probs=probs, +- # padded_mode=False +- # ) +- # except Exception: +- # final_output = ops.zeros_like(x) +- # final_output = mindspore.mint.scatter_add( +- # final_output, +- # 0, +- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +- # expert_outputs * sorted_weights +- # ) +- +- # return final_output +- +- # def mlp_batch_forward(self, tokens, expert_ids): +- # """ +- # 使用批量专家 forward(保留精度) +- # """ +- # return self.experts[0].forward(tokens, expert_ids) +- +- # @no_grad() +- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- +- # expert_cache = ops.zeros_like(x) +- # for i in range(self.num_experts_per_tok): +- # expert_id = flat_expert_indices[i].item() +- # weight = flat_expert_weights[i].item() +- # expert = self.experts[expert_id] +- # expert_out = expert(x) +- # expert_cache += expert_out * weight +- # return expert_cache +- +- #@dwj ++ + @no_grad() + def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): + +@@ -561,35 +442,27 @@ class DeepseekMoE(nn.Module): + - 跳过无 token 的专家 + - 保持结果完全一致 + """ +- # 初始化输出缓存 + expert_cache = ops.zeros_like(x) + +- # 排序(确保 scatter_add 位置对应原逻辑) + idxs = flat_expert_indices.argsort() + sorted_expert_indices = flat_expert_indices[idxs] + sorted_token_indices = idxs // self.num_experts_per_tok + +- # 每个 expert 的 token 数 + tokens_per_expert = sorted_expert_indices.bincount() + +- # 找出有 token 的专家 + active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() + + for expert_id in active_experts.tolist(): +- # 取该 expert 对应的排序后 token 区间 + start = (tokens_per_expert[:expert_id]).sum().item() + end = start + tokens_per_expert[expert_id].item() + +- token_idx = sorted_token_indices[start:end] # 原 token 位置 +- expert_tokens = x[token_idx] # 取输入向量 ++ token_idx = sorted_token_indices[start:end] ++ expert_tokens = x[token_idx] + +- # 执行专家 MLP + expert_out = self.experts[expert_id](expert_tokens) + +- # 按权重缩放 + scaled_out = expert_out * flat_expert_weights[idxs[start:end]] + +- # 回写到缓存(等价 scatter_add) + expert_cache = mindspore.mint.scatter_add( + expert_cache, + 0, +@@ -599,60 +472,6 @@ class DeepseekMoE(nn.Module): + + return expert_cache + +- +- # @no_grad() +- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # """ +- # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add +- # """ +- # num_tokens = x.shape[0] +- # hidden_size = x.shape[-1] +- +- # # 生成排序后的 token 索引 +- # idxs = flat_expert_indices.argsort() +- # sorted_expert_indices = flat_expert_indices[idxs] +- # sorted_token_indices = idxs // self.num_experts_per_tok +- +- # # 记录到 sorted_indices(moe_token_unpermute 用) +- # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] +- +- # # 收集专家输入 +- # permuted_tokens = x[sorted_token_indices] +- +- # # 执行每个专家的 MLP(批量处理) +- # expert_outputs = [] +- # token_ptr = 0 +- # tokens_per_expert = sorted_expert_indices.bincount() +- # for expert_id, count in enumerate(tokens_per_expert.tolist()): +- # if count == 0: +- # continue +- # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] +- # out = self.experts[expert_id](cur_tokens) +- # expert_outputs.append(out) +- # token_ptr += count +- +- # # 拼接所有专家输出 +- # permuted_outputs = ops.cat(expert_outputs, axis=0) +- +- # # 权重缩放(probs 形状为 [num_tokens, top_k]) +- # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) +- +- # # 直接调用硬件加速的 unpermute +- # final_output = ops.moe_token_unpermute( +- # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] +- # sorted_indices, # shape: [num_tokens * top_k] +- # probs=probs, # 按概率加权 +- # padded_mode=False +- # ) +- +- # return final_output +- # def init_expert_cache(self): +- # """ +- # 在模型初始化时调用,缓存所有专家的权重到显存。 +- # """ +- # self.cache_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) +- # self.cache_up_w = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) +- # self.cache_down_w = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) + @no_grad() + def moe_infer_decode_fast(self, x, flat_expert_indices, flat_expert_weights): + top_k = flat_expert_indices.shape[0] +@@ -684,43 +503,22 @@ class DeepseekMoE(nn.Module): + weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) + return weighted_sum + +- # lwx prefill 20251108 + @no_grad() + def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): +- """ +- 高性能 + 数值一致的 MoE prefill 推理: +- 1. 批量化处理所有专家计算,减少 Python 循环开销 +- 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 +- 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 +- 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch +- +- 参数: +- x: [num_tokens, hidden_size], +- MoE 输入的 token 表示 +- flat_expert_indices: [num_tokens * top_k], +- 每个 token 的路由专家 id +- flat_expert_weights: [num_tokens * top_k, 1], +- 路由专家权重 +- """ + num_tokens = x.shape[0] + hidden_size = x.shape[-1] + +- # 1) 排序专家分配(与原 scatter_add 一致的顺序) +- idxs = flat_expert_indices.argsort() # 排序索引 +- sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] +- sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID ++ idxs = flat_expert_indices.argsort() ++ sorted_expert_indices = flat_expert_indices[idxs] ++ sorted_token_indices = idxs // self.num_experts_per_tok + +- # sorted_indices 必须与 permuted_tokens 顺序匹配 +- sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 ++ sorted_indices = sorted_token_indices + +- # 2) 收集专家输入(按 idxs 排序) +- permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] +- sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 ++ permuted_tokens = x[sorted_token_indices] ++ sorted_weights = flat_expert_weights[idxs] + +- # 3) 计算每个专家的 token 数 + tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) + +- # 4) 批量专家计算(减少 Python 循环) + gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) + up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) + down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) +@@ -731,8 +529,7 @@ class DeepseekMoE(nn.Module): + if count == 0: + continue + tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] +- +- # 与 DeepseekMLP forward 等价 ++ + gate_proj_out = F.linear(tokens, gate_weights[expert_id]) + up_proj_out = F.linear(tokens, up_weights[expert_id]) + act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out +@@ -741,7 +538,6 @@ class DeepseekMoE(nn.Module): + expert_outputs[ptr:ptr+count] = expert_out + ptr += count + +- # 5) Ascend 加速的 unpermute(已排序的权重) + probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape + + final_output = ops.zeros_like(x) +@@ -753,444 +549,6 @@ class DeepseekMoE(nn.Module): + ) + return final_output + +- # try: +- # final_output = ops.moe_token_unpermute( +- # expert_outputs, # [num_tokens*top_k, hidden_size] +- # sorted_indices, # [num_tokens*top_k] 原 token id +- # probs=probs, # 对应权重 +- # padded_mode=False +- # ) +- # except Exception: +- # # CPU/GPU fallback:用 scatter_add 保证完全一致 +- # final_output = ops.zeros_like(x) +- # final_output = mindspore.mint.scatter_add( +- # final_output, +- # 0, +- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +- # expert_outputs * sorted_weights +- # ) +- +- # return final_output +- +- +- # @no_grad() +- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # num_tokens = x.shape[0] +- # hidden_size = x.shape[-1] +- +- # idxs = flat_expert_indices.argsort() +- # sorted_expert_indices = flat_expert_indices[idxs] +- # sorted_token_indices = idxs // self.num_experts_per_tok +- +- # # sorted_indices = sorted_token_indices +- # sorted_indices = sorted_token_indices.astype(mindspore.int32) +- # permuted_tokens = x[sorted_token_indices] +- # sorted_weights = flat_expert_weights[idxs] +- # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) +- +- # expert_outputs = ops.zeros_like(permuted_tokens) +- # ptr = 0 +- +- # # 只按专家维度循环 +- # for expert_id, count in enumerate(tokens_per_expert.tolist()): +- # if count == 0: +- # continue +- # token_slice = slice(ptr, ptr + count) +- # expert_tokens = permuted_tokens[token_slice] +- +- # # 保持原 forward(含 pretraining_tp、bias 等) +- # expert_out = self.experts[expert_id](expert_tokens) +- +- # expert_outputs[token_slice] = expert_out +- # ptr += count +- +- # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +- # try: +- # final_output = mindspore.ops.moe_token_unpermute( +- # expert_outputs, +- # sorted_indices, +- # probs=probs, +- # padded_mode=False +- # ) +- # except Exception: +- # final_output = ops.zeros_like(x) +- # final_output = mindspore.mint.scatter_add( +- # final_output, +- # 0, +- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +- # expert_outputs * sorted_weights +- # ) +- +- # return final_output +- +- +- #lwx +- # @no_grad() +- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # """ +- # 并行化 MoE prefill: +- # - 一次性计算所有专家输出,牺牲显存峰值换取速度 +- # - 保证结果与原版完全一致 +- # """ +- # # 输出缓存 +- # expert_cache = ops.zeros_like(x) +- +- # # token 总数(批量*seq_len*num_experts_per_tok) +- # num_tokens = flat_expert_indices.shape[0] +- # hidden_dim = x.shape[-1] +- +- # # 原 token ID(idxs // num_experts_per_tok) +- # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) +- +- # # ====== Step 1: 组织输入 ====== +- # # 按 experts 排序,保证 scatter_add 对应位置一致 +- # sort_ids = flat_expert_indices.argsort() +- # sorted_experts = flat_expert_indices[sort_ids] +- # sorted_tokens = token_ids[sort_ids] +- # sorted_weights = flat_expert_weights[sort_ids] +- +- # # 收集每个专家的输入 +- # # build: expert_inputs[expert_id] = [tokens...] +- # expert_inputs = [] +- # expert_outs = [] +- +- # for eid in range(self.config.n_routed_experts): +- # eid_mask = (sorted_experts == eid) +- # if eid_mask.any(): +- # tokens_for_eid = x[sorted_tokens[eid_mask]] +- # expert_inputs.append(tokens_for_eid) +- # else: +- # expert_inputs.append(None) +- +- # # ====== Step 2: 并行计算所有专家输出 ====== +- # # 存储所有专家结果到一个列表 +- # for eid in range(self.config.n_routed_experts): +- # if expert_inputs[eid] is not None: +- # out = self.experts[eid](expert_inputs[eid]) +- # expert_outs.append(out) +- # else: +- # expert_outs.append(None) +- +- # # ====== Step 3: scatter_add 回写结果 ====== +- # # 遍历专家,将结果加回对应的 token +- # pos = 0 +- # for eid in range(self.config.n_routed_experts): +- # if expert_outs[eid] is not None: +- # size = expert_outs[eid].shape[0] +- # tokens_idx = sorted_tokens[pos:pos+size] +- # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] +- # pos += size +- +- # # scatter_add 到 expert_cache +- # expert_cache = mindspore.mint.scatter_add( +- # expert_cache, +- # dim=0, +- # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), +- # src=scaled_out +- # ) +- +- # return expert_cache +- +- +- +-# 放置在 DeepseekMoE 类中 +- # @no_grad() +- # #lwx 20251107 +- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- # """ +- # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +- +- # Args: +- # x (Tensor): 输入张量, shape: (1, hidden_size) +- # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +- # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +- # """ +- # top_k, _ = flat_expert_weights.shape +- # hidden_size = x.shape[-1] +- +- # # 1. 将所有专家的权重堆叠起来 +- # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +- # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +- # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +- +- # # 2. "收集" 所需的专家权重 +- # selected_gate_w = stacked_gate_w[flat_expert_indices] +- # selected_up_w = stacked_up_w[flat_expert_indices] +- # selected_down_w = stacked_down_w[flat_expert_indices] +- +- # # 3. 准备输入 +- # x_expanded = x.expand((top_k, 1, hidden_size)) +- +- # # 4. 并行计算 gate_proj 和 up_proj +- # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +- # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +- +- # # 5. 计算中间状态 +- # intermediate_states = self.experts[0].act_fn(gate_out) * up_out +- +- # # 6. 并行计算 down_proj +- # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +- # # --- [FIX] --- +- # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +- # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +- # # --- [FIX END] --- +- +- # # 7. 根据路由权重进行加权求和 +- # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +- +- # return weighted_sum +- +- +- +- # @no_grad() +- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +- # # expert_cache = torch.zeros_like(x) +- # # idxs = flat_expert_indices.argsort() +- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +- # # token_idxs = idxs // self.num_experts_per_tok +- # # for i, end_idx in enumerate(tokens_per_expert): +- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +- # # if start_idx == end_idx: +- # # continue +- # # expert = self.experts[i] +- # # exp_token_idx = token_idxs[start_idx:end_idx] +- # # expert_tokens = x[exp_token_idx] +- # # expert_out = expert(expert_tokens) +- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +- # # return expert_cache +- # expert_cache = ops.zeros_like(x) +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- # token_idxs = idxs // self.num_experts_per_tok +- +- # for i, end_idx in enumerate(tokens_per_expert): +- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- # if start_idx == end_idx: +- # continue +- # expert = self.experts[i] +- # exp_token_idx = token_idxs[start_idx:end_idx] +- # expert_tokens = x[exp_token_idx] +- # expert_out = expert(expert_tokens) +- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +- +- # return expert_cache +- # @no_grad() +- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +- # expert_cache = ops.zeros_like(x) +- +- # # 排序保证顺序一致 +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- # token_idxs = idxs // self.num_experts_per_tok +- +- # # 找出有 token 的专家 +- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +- +- # for i in active_experts.tolist(): +- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- # end_idx = tokens_per_expert[i] +- # if start_idx == end_idx: # 没有 token +- # continue +- +- # exp_token_idx = token_idxs[start_idx:end_idx] +- # expert_tokens = x[exp_token_idx] +- # expert_out = self.experts[i](expert_tokens) +- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +- +- # expert_cache = mindspore.mint.scatter_add( +- # expert_cache, +- # 0, +- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +- # expert_out +- # ) +- +- # return expert_cache +- +- +- +-# class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-# """ +-# The trick function of adding auxiliary (aux) loss, +-# which includes the gradient of the aux loss during backpropagation. +-# """ +-# @staticmethod +-# def forward(ctx, x, loss): +-# assert loss.numel() == 1 +-# ctx.dtype = loss.dtype +-# ctx.required_aux_loss = loss.requires_grad +-# return x +- +-# @staticmethod +-# def backward(ctx, grad_output): +-# grad_loss = None +-# if ctx.required_aux_loss: +-# grad_loss = ops.ones(1, dtype=ctx.dtype) +-# return grad_output, grad_loss +- +- +-# class DeepseekMoE(nn.Module): +-# ''' +-# A mixed expert module containing shared experts. +-# ''' +-# def __init__(self, config): +-# super().__init__() +-# self.config = config +-# self.num_experts_per_tok = config.num_experts_per_tok +-# if hasattr(config, "ep_size") and config.ep_size > 1: +-# assert config.ep_size == mindspore.mint.distributed.get_world_size() +-# self.ep_size = config.ep_size +-# self.experts_per_rank = config.n_routed_experts // config.ep_size +-# self.ep_rank = mindspore.mint.distributed.get_rank() +-# self.experts = nn.ModuleList( +-# [ +-# ( +-# DeepseekMLP( +-# config, intermediate_size=config.moe_intermediate_size +-# ) +-# if i >= self.ep_rank * self.experts_per_rank +-# and i < (self.ep_rank + 1) * self.experts_per_rank +-# else None +-# ) +-# for i in range(config.n_routed_experts) +-# ] +-# ) +- +-# else: +-# self.ep_size = 1 +-# self.experts_per_rank = config.n_routed_experts +-# self.ep_rank = 0 +-# self.experts = nn.ModuleList( +-# [ +-# DeepseekMLP( +-# config, intermediate_size=config.moe_intermediate_size +-# ) +-# for i in range(config.n_routed_experts) +-# ] +-# ) +-# self.gate = MoEGate(config) +-# if config.n_shared_experts is not None: +-# intermediate_size = config.moe_intermediate_size * config.n_shared_experts +-# self.shared_experts = DeepseekMLP( +-# config=config, intermediate_size=intermediate_size +-# ) +- +-# def forward(self, hidden_states): +-# identity = hidden_states +-# orig_shape = hidden_states.shape +-# topk_idx, topk_weight, aux_loss = self.gate(hidden_states) +-# hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) +-# flat_topk_idx = topk_idx.view(-1) +-# if self.training: +-# hidden_states = hidden_states.repeat_interleave( +-# self.num_experts_per_tok, dim=0 +-# ) +-# y = ops.empty(hidden_states.shape) +-# for i, expert in enumerate(self.experts): +-# y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i]) +-# y = ops.sum(y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1), dim=1) +-# y = y.to(hidden_states.dtype).view(*orig_shape) +-# # y = AddAuxiliaryLoss.apply(y, aux_loss) +-# else: +-# # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-# y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape) +-# if self.config.n_shared_experts is not None: +-# y = y + self.shared_experts(identity) +-# return y +- +-# # # @mindnlp.core.no_grad() +-# # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-# # expert_cache = ops.zeros_like(x) +-# # idxs = flat_expert_indices.argsort() +-# # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-# # token_idxs = idxs // self.num_experts_per_tok +-# # for i, end_idx in enumerate(tokens_per_expert): +-# # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-# # if start_idx == end_idx: +-# # continue +-# # expert = self.experts[i] +-# # exp_token_idx = token_idxs[start_idx:end_idx] +-# # expert_tokens = x[exp_token_idx] +-# # expert_out = expert(expert_tokens) +-# # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-# # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out, reduce='sum') +-# # return expert_out # expert_cache +-# def moe_infer(self, x, topk_ids, topk_weight): +-# cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts))) +-# cnts.scatter_(1, topk_ids, 1) +-# tokens_per_expert = cnts.sum(dim=0) +-# idxs = topk_ids.view(-1).argsort() +-# sorted_tokens = x[idxs // topk_ids.shape[1]] +-# sorted_tokens_shape = sorted_tokens.shape +-# if self.ep_size > 1: +-# tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1) +-# tokens_per_expert_group = tokens_per_expert.new_empty( +-# tokens_per_expert.shape[0] +-# ) +-# mindspore.mint.distributed.all_to_all_single(tokens_per_expert_group, tokens_per_expert) +-# output_splits = ( +-# tokens_per_expert_group.view(self.ep_size, -1) +-# .sum(1) +-# .cpu() +-# .numpy() +-# .tolist() +-# ) +-# gathered_tokens = sorted_tokens.new_empty( +-# tokens_per_expert_group.sum(dim=0).cpu().item(), sorted_tokens.shape[1] +-# ) +-# input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist() +-# mindspore.mint.distributed.all_to_all( +-# list(gathered_tokens.split(output_splits)), +-# list(sorted_tokens.split(input_split_sizes)), +-# ) +-# tokens_per_expert_post_gather = tokens_per_expert_group.view( +-# self.ep_size, self.experts_per_rank +-# ).sum(dim=0) +-# gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32) +-# s = 0 +-# for i, k in enumerate(tokens_per_expert_group.cpu().numpy()): +-# gatherd_idxs[s : s + k] = i % self.experts_per_rank +-# s += k +-# gatherd_idxs = gatherd_idxs.argsort() +-# sorted_tokens = gathered_tokens[gatherd_idxs] +-# tokens_per_expert = tokens_per_expert_post_gather +-# tokens_per_expert = tokens_per_expert.cpu().numpy() +-# outputs = [] +-# start_idx = 0 +-# for i, num_tokens in enumerate(tokens_per_expert): +-# end_idx = start_idx + num_tokens +-# if num_tokens == 0: +-# continue +-# expert = self.experts[i + self.ep_rank * self.experts_per_rank] +-# tokens_for_this_expert = sorted_tokens[start_idx:end_idx] +-# expert_out = expert(tokens_for_this_expert) +-# outputs.append(expert_out) +-# start_idx = end_idx +- +-# outs = ops.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0) +-# if self.ep_size > 1: +-# new_x = ops.empty_like(outs) +-# new_x[gatherd_idxs] = outs +-# gathered_tokens = new_x.new_empty(*sorted_tokens_shape) +-# mindspore.mint.distributed.all_to_all( +-# list(gathered_tokens.split(input_split_sizes)), +-# list(new_x.split(output_splits)), +-# ) +-# outs = gathered_tokens +- +-# new_x = ops.empty_like(outs) +-# new_x[idxs] = outs +-# final_out = ( +-# new_x.view(*topk_ids.shape, -1) +-# .type(topk_weight.dtype) +-# .mul_(topk_weight.unsqueeze(dim=-1)) +-# .sum(dim=1) +-# .type(new_x.dtype) +-# ) +-# return final_out +- +- + # Copied from transformers.models.llama.modeling_llama.repeat_kv + def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: + """ +@@ -1313,10 +671,6 @@ class DeepseekAttention(nn.Module): + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + +- # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +- # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- # @lwx + query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) + query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) + key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +@@ -1555,10 +909,6 @@ class DeepseekDecoderLayer(nn.Module): + super().__init__() + self.hidden_size = config.hidden_size + +- # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +- # config=config, layer_idx=layer_idx +- # ) +- + self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( + config=config, layer_idx=layer_idx + ) +@@ -1774,14 +1124,6 @@ class DeepseekModel(DeepseekPreTrainedModel): + else None + ) + else: +- # 4d mask is passed through the layers +- # attention_mask = _prepare_4d_causal_attention_mask( +- # attention_mask, +- # (batch_size, seq_length), +- # inputs_embeds, +- # past_key_values_length, +- # ) +- #@dwj + attention_mask = get_cached_causal_mask( + attention_mask, + (batch_size, seq_length), +@@ -1869,38 +1211,14 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + self.post_init() + # lwx + self.warm_up = False +- #初始 +- +- # def warmup_moe_model_deep(self): +- # print("[Warmup] DeepSeek-MoE 模型预热开始...") +- # test_texts = [ +- # "warmup short", +- # "This is a medium length warmup sentence for MoE experts. middle middle middle", +- # "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +- # ] +- # tokenizer = getattr(self, "_warmup_tokenizer", None) +- # if tokenizer is None: +- # from mindnlp.transformers import AutoTokenizer +- # tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +- # self._warmup_tokenizer = tokenizer +- +- # for text in test_texts: +- # inputs = tokenizer(text, return_tensors="ms") +- # with mindspore._no_grad(): +- # _ = self(**inputs, use_cache=False) +- # print("[Warmup] DeepSeek-MoE 模型预热完成。") +- ++ + def warmup_moe_model_deep(self): + print("[Warmup] DeepSeek-MoE 模型预热开始...") + +- # 直接用 eval.py 默认的 prompts 内容 + warmup_prompts = [ +- "Hello, how are you?", +- "This American studied art at Yale and is the author of multiple popular mystery novels. First name is 'Hillary'. What's the last name?", +- """Summarize the following text: US President Donald Trump has said he is 'not happy' with his Russian counterpart Vladimir Putin, following Moscow's largest aerial attack yet on Ukraine. +- In a rare rebuke, Trump said: "What the hell happened to him? He's killing a lot of people." He later called Putin "absolutely crazy". +- Ukrainian President Volodymyr Zelensky earlier said Washington's "silence" over recent Russian attacks was encouraging Putin, urging "strong pressure" - including tougher sanctions - on Moscow. +- """ ++ "warmup short", ++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" + ] + + tokenizer = getattr(self, "_warmup_tokenizer", None) +@@ -1909,13 +1227,11 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) + self._warmup_tokenizer = tokenizer + +- # 跑一遍 warmup_prompts,触发路由逻辑 + for text in warmup_prompts: + inputs = tokenizer(text, return_tensors="ms") + with mindspore._no_grad(): + _ = self(**inputs, use_cache=False) + +- # 这里可以加按需缓存逻辑,避免显存 OOM + from mindnlp.transformers.models.deepseek.modeling_deepseek import DeepseekMoE + for module in self.modules(): + if isinstance(module, DeepseekMoE): +@@ -2051,15 +1367,13 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + + loss = None + if labels is not None: +- # Shift so that tokens < n predict n + shift_logits = logits[..., :-1, :] + shift_labels = labels[..., 1:] +- # Flatten the tokens ++ + loss_fct = nn.CrossEntropyLoss() + shift_logits = shift_logits.view(-1, self.config.vocab_size) + shift_labels = shift_labels.view(-1) +- # Enable model parallelism +- # shift_labels = shift_labels.to(shift_logits) ++ + loss = loss_fct(shift_logits, shift_labels) + + if not return_dict: +@@ -2091,22 +1405,16 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + cache_length = past_length = past_key_values[0][0].shape[2] + max_cache_length = None + +- # Keep only the unprocessed tokens: +- # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where +- # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as +- # input) ++ + if ( + attention_mask is not None + and attention_mask.shape[1] > input_ids.shape[1] + ): + input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :] +- # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard +- # input_ids based on the past_length. ++ + elif past_length < input_ids.shape[1]: + input_ids = input_ids[:, past_length:] +- # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens. + +- # If we are about to go beyond the maximum cache length, we need to crop the input attention mask. + if ( + max_cache_length is not None + and attention_mask is not None +@@ -2116,14 +1424,11 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): + + position_ids = kwargs.get("position_ids", None) + if attention_mask is not None and position_ids is None: +- # create position_ids on the fly for batch generation + position_ids = attention_mask.to(mindspore.int32).cumsum(-1) - 1 +- # position_ids.masked_fill_(attention_mask == 0, 1) + position_ids = ops.masked_fill(position_ids, attention_mask == 0, 1) + if past_key_values: + position_ids = position_ids[:, -input_ids.shape[1] :] + +- # if `inputs_embeds` are passed, we only want to use them in the 1st generation step + if inputs_embeds is not None and past_key_values is None: + model_inputs = {"inputs_embeds": inputs_embeds} + else: +diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +index 6566958b..d689e36d 100644 +--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +@@ -63,18 +63,14 @@ def get_cached_causal_mask_with_cache_position( + """ + 带缓存的 causal mask 构造函数 + """ +- # q_len 是当前 query 长度 + q_len = sequence_length +- # kv_len 是 target_length + kv_len = target_length + +- # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 + key = (batch_size, q_len, kv_len, dtype, min_dtype) + + if key in _causal_mask_cache: + return _causal_mask_cache[key] + +- # 调用原来的 mask 构造逻辑 + causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( + attention_mask, + sequence_length=sequence_length, +@@ -84,7 +80,6 @@ def get_cached_causal_mask_with_cache_position( + cache_position=cache_position, + batch_size=batch_size, + ) +- # 缓存结果 + _causal_mask_cache[key] = causal_mask + return causal_mask + +@@ -224,11 +219,6 @@ class Qwen2MoeRMSNorm(nn.Module): + self.variance_epsilon = eps + + def forward(self, hidden_states): +- # @dwj +- # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +- # @lwx +- # if not self.training : +- # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) + input_dtype = hidden_states.dtype + hidden_states = hidden_states.to(mindspore.float32) + variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +@@ -279,9 +269,6 @@ class Qwen2MoeRotaryEmbedding(nn.Module): + # Copied from transformers.models.llama.modeling_llama.rotate_half + def rotate_half(x): + """Rotates half the hidden dims of the input.""" +- # x1 = x[..., : x.shape[-1] // 2] +- # x2 = x[..., x.shape[-1] // 2 :] +- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] + x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) + return ops.cat((-x2, x1), dim=-1) + +@@ -329,21 +316,8 @@ class Qwen2MoeMLP(nn.Module): + self.act_fn = ACT2FN[config.hidden_act] + + def forward(self, x): +- + return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +- # @lwx +- # gate_up_output = self.gate_up_proj(x) +- # swiglu_output = mindspore.ops.swiglu(gate_up_output) +- # return self.down_proj(swiglu_output) +- +- # def forward(self, x): +- # gate_proj_out = self.gate_proj(x) +- # up_proj_out = self.up_proj(x) +- # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +- # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +- # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +- # return self.down_proj(swiglu_out) +- ++ + # Copied from transformers.models.llama.modeling_llama.repeat_kv + def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: + """ +@@ -356,164 +330,6 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: + hidden_states = hidden_states[:, :, None, :, :].broadcast_to((batch, num_key_value_heads, n_rep, slen, head_dim)) + return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) + +- +-# Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-# class Qwen2MoeAttention(nn.Module): +-# """ +-# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-# and "Generating Long Sequences with Sparse Transformers". +-# """ +- +-# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-# super().__init__() +-# self.config = config +-# self.layer_idx = layer_idx +-# if layer_idx is None: +-# logger.warning_once( +-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-# "when creating this class." +-# ) +- +-# self.hidden_size = config.hidden_size +-# self.num_heads = config.num_attention_heads +-# self.head_dim = self.hidden_size // self.num_heads +-# self.num_key_value_heads = config.num_key_value_heads +-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-# self.max_position_embeddings = config.max_position_embeddings +-# self.rope_theta = config.rope_theta +-# self.is_causal = True +-# self.attention_dropout = config.attention_dropout +- +-# if (self.head_dim * self.num_heads) != self.hidden_size: +-# raise ValueError( +-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-# f" and `num_heads`: {self.num_heads})." +-# ) +-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +- +-# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-# self.head_dim, +-# max_position_embeddings=self.max_position_embeddings, +-# base=self.rope_theta, +-# ) +- +-# def forward( +-# self, +-# hidden_states: mindspore.Tensor, +-# attention_mask: Optional[mindspore.Tensor] = None, +-# position_ids: Optional[mindspore.Tensor] = None, +-# past_key_value: Optional[Cache] = None, +-# output_attentions: bool = False, +-# use_cache: bool = False, +-# cache_position: Optional[mindspore.Tensor] = None, +-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- +- +- +-# bsz, q_len, _ = hidden_states.shape +- +-# query_states = self.q_proj(hidden_states) +-# key_states = self.k_proj(hidden_states) +-# value_states = self.v_proj(hidden_states) +- +-# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +- +-# kv_seq_len = key_states.shape[-2] +-# if past_key_value is not None: +-# if self.layer_idx is None: +-# raise ValueError( +-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-# "with a layer index." +-# ) +-# if isinstance(past_key_value, StaticCache): +-# kv_seq_len = key_states.shape[-2] +-# else: +-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +-# if past_key_value is not None: +-# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +- +-# if isinstance(past_key_value, StaticCache): +-# kv_seq_len = key_states.shape[-2] +- +-# # repeat k/v heads if n_kv_heads < n_heads +-# key_states = repeat_kv(key_states, self.num_key_value_groups) +-# value_states = repeat_kv(value_states, self.num_key_value_groups) +- +-# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +- +-# if attention_mask is not None: +-# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-# attn_weights = attn_weights + causal_mask +- +-# # upcast attention to fp32 +-# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-# attn_output = ops.matmul(attn_weights, value_states) +- +-# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-# raise ValueError( +-# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-# f" {attn_output.shape}" +-# ) +- +-# attn_output = ops.transpose(attn_output, 1, 2) +-# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +- +-# attn_output = self.o_proj(attn_output) +-# # @lwx +- +-# # max_seq_len = self.max_position_embeddings # 2048 +- +-# # if attention_mask is not None: +-# # # attention_mask: [B, 1, Sq, Sk] +-# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +- +-# # # pad 到 [max_seq_len, max_seq_len] +-# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-# # global_attention_mask = padded_mask +-# # else: +-# # global_attention_mask = None +- +- +-# # sparse_mode=3 +-# # attn_output = mindspore.ops.flash_attention_score( +-# # query=query_states, +-# # key=key_states, +-# # value=value_states, +-# # real_shift=None, +-# # padding_mask=None, +- +-# # head_num=self.num_heads, +-# # attn_mask=global_attention_mask, +-# # keep_prob=1.0 - self.attention_dropout, +-# # scalar_value=1.0 / math.sqrt(self.head_dim), +-# # input_layout="BNSD", +-# # pre_tokens=2147483647, +-# # next_tokens=2147483647, +-# # inner_precise=0, +-# # drop_mask=None, +-# # prefix=None, +-# # actual_seq_qlen=None, +-# # actual_seq_kvlen=None, +-# # sparse_mode=sparse_mode, +-# # ) +-# if not output_attentions: +-# attn_weights = None +- +-# return attn_output, attn_weights, past_key_value +- + class Qwen2MoeAttention(nn.Module): + """ + 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +@@ -594,10 +410,8 @@ class Qwen2MoeAttention(nn.Module): + cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) + +- # --- 2. 动态调度核心注意力计算 --- + global Long_Prompt + if Long_Prompt >= 1: +- # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- + fa_attention_mask = None + if attention_mask is not None: + mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +@@ -613,7 +427,7 @@ class Qwen2MoeAttention(nn.Module): + scalar_value=1.0 / math.sqrt(self.head_dim), + input_layout="BNSD", + sparse_mode=0, +- inner_precise=0 # 使用高精度模式以对齐 Eager 结果 ++ inner_precise=0 + ) + + attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +@@ -623,7 +437,6 @@ class Qwen2MoeAttention(nn.Module): + logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") + + else: +- # --- Eager Attention 路径 (用于短序列和解码) --- + key_states = repeat_kv(key_states, self.num_key_value_groups) + value_states = repeat_kv(value_states, self.num_key_value_groups) + +@@ -651,252 +464,6 @@ class Qwen2MoeAttention(nn.Module): + + return attn_output, attn_weights, past_key_value + +-# class Qwen2MoeFlashAttention(nn.Module): +-# """ +-# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +- +-# 关键改动: +-# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-# 直接传入原始的 key 和 value 张量效率更高。 +-# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-# """ +-# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-# super().__init__() +-# self.config = config +-# self.layer_idx = layer_idx +-# self.hidden_size = config.hidden_size +-# self.num_heads = config.num_attention_heads +-# self.head_dim = self.hidden_size // self.num_heads +-# self.num_key_value_heads = config.num_key_value_heads +-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-# self.max_position_embeddings = config.max_position_embeddings +-# self.rope_theta = config.rope_theta +-# self.attention_dropout = config.attention_dropout +- +-# if (self.head_dim * self.num_heads) != self.hidden_size: +-# raise ValueError( +-# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-# ) +- +-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +- +-# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-# self.head_dim, +-# max_position_embeddings=self.max_position_embeddings, +-# base=self.rope_theta, +-# ) +- +-# def forward( +-# self, +-# hidden_states: mindspore.Tensor, +-# attention_mask: Optional[mindspore.Tensor] = None, +-# position_ids: Optional[mindspore.Tensor] = None, +-# past_key_value: Optional[Cache] = None, +-# output_attentions: bool = False, +-# use_cache: bool = False, +-# cache_position: Optional[mindspore.Tensor] = None, +-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- +-# bsz, q_len, _ = hidden_states.shape +- +-# # 1. 线性投射 Q, K, V +-# query_states = self.q_proj(hidden_states) +-# key_states = self.k_proj(hidden_states) +-# value_states = self.v_proj(hidden_states) +- +-# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-# # query: [B, S, H*D] -> [B, N1, S, D] +-# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +-# # 3. RoPE 旋转位置编码 +-# kv_seq_len = key_states.shape[-2] +-# if past_key_value is not None: +-# if self.layer_idx is None: +-# raise ValueError( +-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-# "with a layer index." +-# ) +-# # 对于 StaticCache,需要特殊处理 kv_seq_len +-# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-# if cache_position.shape[0] == 1: +-# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-# kv_seq_len = past_seen_tokens + 1 +-# else: +-# # prefill 阶段:cache_position 是范围,使用其长度 +-# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-# else: +-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- +-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +-# # 4. KV 缓存更新 +-# if past_key_value is not None: +-# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-# key_states, value_states = past_key_value.update( +-# key_states, value_states, self.layer_idx, cache_kwargs +-# ) +- +-# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-# if cache_position.shape[0] == 1: +-# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-# kv_seq_len = key_states.shape[-2] +- +-# # 5. [重要] 准备 Attention Mask +-# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-# fa_attention_mask = None +-# if attention_mask is not None: +-# # 截取与当前key长度匹配的部分 +-# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-# # 转换为布尔类型: 大负数 -> True, 0 -> False +-# fa_attention_mask = (mask_slice != 0) +- +-# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-# input_dtype = query_states.dtype +-# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-# query_states = query_states.to(mindspore.float16) +-# key_states = key_states.to(mindspore.float16) +-# value_states = value_states.to(mindspore.float16) +- +-# # 6. [核心] 调用 flash_attention_score 算子 +-# # - 无需手动 repeat_kv, 算子原生支持 GQA +-# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-# attn_output = mindspore.ops.flash_attention_score( +-# query=query_states, +-# key=key_states, +-# value=value_states, +-# head_num=self.num_heads, # 传入Q的头数(N1) +-# attn_mask=fa_attention_mask, +-# keep_prob=1.0 - self.attention_dropout, +-# scalar_value=1.0 / math.sqrt(self.head_dim), +-# input_layout="BNSD", +-# sparse_mode=0 # 使用 defaultMask 模式 +-# ) +- +-# # 恢复原始数据类型 +-# attn_output = attn_output.to(input_dtype) +- +-# # 7. 调整输出形状 +-# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-# attn_output = self.o_proj(attn_output) +- +-# # FlashAttention 算子不直接返回注意力权重矩阵 +-# attn_weights = None +-# if output_attentions: +-# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +- +-# return attn_output, attn_weights, past_key_value +- +-# # def forward( +-# # self, +-# # hidden_states: mindspore.Tensor, +-# # attention_mask: Optional[mindspore.Tensor] = None, +-# # position_ids: Optional[mindspore.Tensor] = None, +-# # past_key_value: Optional[Cache] = None, +-# # output_attentions: bool = False, +-# # use_cache: bool = False, +-# # cache_position: Optional[mindspore.Tensor] = None, +-# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +- +-# # bsz, q_len, _ = hidden_states.shape +- +-# # # 1. 线性投射 Q, K, V +-# # query_states = self.q_proj(hidden_states) +-# # key_states = self.k_proj(hidden_states) +-# # value_states = self.v_proj(hidden_states) +- +-# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +-# # # 3. RoPE 旋转位置编码 +-# # kv_seq_len = key_states.shape[-2] +-# # if past_key_value is not None: +-# # if self.layer_idx is None: +-# # raise ValueError( +-# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-# # "with a layer index." +-# # ) +-# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- +-# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +-# # # 4. KV 缓存更新 +-# # if past_key_value is not None: +-# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-# # key_states, value_states = past_key_value.update( +-# # key_states, value_states, self.layer_idx, cache_kwargs +-# # ) +- +-# # # 5. 准备 Attention Mask +-# # fa_attention_mask = None +-# # if attention_mask is not None: +-# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-# # fa_attention_mask = (mask_slice != 0) +- +-# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-# # input_dtype = query_states.dtype +- +-# # # 6. [核心] 调用 flash_attention_score 算子 +-# # attn_output = mindspore.ops.flash_attention_score( +-# # query=query_states, +-# # key=key_states, +-# # value=value_states, +-# # head_num=self.num_heads, +-# # attn_mask=fa_attention_mask, +-# # keep_prob=1.0 - self.attention_dropout, +-# # scalar_value=1.0 / math.sqrt(self.head_dim), +-# # input_layout="BNSD", +-# # sparse_mode=0, +-# # # <--- 修改点 2: 启用内部高精度计算 --- +-# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-# # inner_precise=1 +-# # ) +- +-# # # 恢复原始数据类型 +-# # attn_output = attn_output.to(input_dtype) +- +-# # # 7. 调整输出形状 +-# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-# # attn_output = self.o_proj(attn_output) +- +-# # attn_weights = None +-# # if output_attentions: +-# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +- +-# # return attn_output, attn_weights, past_key_value +- + + class Qwen2MoeFlashAttention(nn.Module): + """ +@@ -948,17 +515,14 @@ class Qwen2MoeFlashAttention(nn.Module): + + bsz, q_len, _ = hidden_states.shape + +- # 1. 线性投射 Q, K, V + query_states = self.q_proj(hidden_states) + key_states = self.k_proj(hidden_states) + value_states = self.v_proj(hidden_states) + +- # 2. 调整形状以匹配 BNSD 布局 + query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) + key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) + value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- +- # 3. RoPE 和 KV 缓存 ++ + kv_seq_len = key_states.shape[-2] + if past_key_value is not None: + kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +@@ -970,13 +534,11 @@ class Qwen2MoeFlashAttention(nn.Module): + cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} + key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) + +- # 4. 准备 Attention Mask + fa_attention_mask = None + if attention_mask is not None: + mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] + fa_attention_mask = (mask_slice != 0) + +- # 5. 【核心】调用 flash_attention_score,关闭高精度累加 + attn_output = mindspore.ops.flash_attention_score( + query=query_states, + key=key_states, +@@ -987,14 +549,12 @@ class Qwen2MoeFlashAttention(nn.Module): + scalar_value=1.0 / math.sqrt(self.head_dim), + input_layout="BNSD", + sparse_mode=0, +- inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 ++ inner_precise=0 + ) + +- # 6. 调整输出形状 + attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) + attn_output = self.o_proj(attn_output) + +- # 7. 返回结果 + attn_weights = None + if output_attentions: + logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +@@ -1007,88 +567,7 @@ QWEN2MOE_ATTENTION_CLASSES = { + "flash-attention": Qwen2MoeFlashAttention, + } + +- +-# class Qwen2MoeSparseMoeBlock(nn.Module): +-# def __init__(self, config): +-# super().__init__() +-# self.num_experts = config.num_experts +-# self.top_k = config.num_experts_per_tok +-# self.norm_topk_prob = config.norm_topk_prob +- +-# # gating +-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-# self.experts = nn.ModuleList( +-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-# ) +- +-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-# #@dwj +-# # 只遍历激活的专家,而非全部专家 +-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-# num_tokens = hidden_states_reshaped.shape[0] +- +-# router_logits = self.gate(hidden_states_reshaped) +-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-# if self.norm_topk_prob: +-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-# routing_weights = routing_weights.to(hidden_states.dtype) +- +-# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-# flat_selected_experts = selected_experts.flatten() +- +-# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-# token_indices = broadcasted_token_indices.flatten() +- +-# active_experts = ops.unique(flat_selected_experts) +- +-# for expert_idx_tensor in active_experts: +-# expert_idx = expert_idx_tensor.item() +-# expert_layer = self.experts[expert_idx] +- +-# mask = (flat_selected_experts == expert_idx_tensor) +-# selected_token_indices = token_indices[mask] +-# selected_routing_weights = routing_weights.flatten()[mask] +- +-# current_states = hidden_states_reshaped[selected_token_indices] +- +-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +- +-# final_hidden_states = final_hidden_states.index_add( +-# dim=0, +-# index=selected_token_indices, +-# source=expert_output.to(hidden_states.dtype) +-# ) +- +-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +- +-# final_hidden_states = final_hidden_states + shared_expert_output +-# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +- +-# return final_hidden_states, router_logits +- +- + class Qwen2MoeSparseMoeBlock(nn.Module): +- """ +- 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +- 控制的顶级推理策略: +- +- - if Long_Prompt is True: 【精度优先模式】 +- 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +- 适用于需要严格可复现性的长序列任务。 +- +- - if Long_Prompt is False: 【速度优先模式】 +- 采用业界最强的性能组合: +- - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +- - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +- """ + def __init__(self, config: Qwen2MoeConfig): + super().__init__() + self.num_experts = config.num_experts +@@ -1102,7 +581,6 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) + self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) + +- # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- + @no_grad() + def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + original_dtype = hidden_states.dtype +@@ -1119,39 +597,8 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) + return moe_output_fp32.squeeze(1).to(original_dtype) + +- +- # @no_grad() +- # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- # num_tokens, _ = hidden_states.shape +- # flat_selected_experts = selected_experts.flatten() +- # sorted_expert_indices = flat_selected_experts.argsort() +- # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +- # original_token_indices = sorted_expert_indices // self.top_k +- # moe_output = ops.zeros_like(hidden_states) +- # current_token_offset = 0 +- # for i in range(self.num_experts): +- # expert_token_count = tokens_per_expert[i] - current_token_offset +- # if expert_token_count == 0: +- # continue +- # end_offset = current_token_offset + expert_token_count +- # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +- # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +- # expert_hidden_states = hidden_states[expert_original_token_indices] +- # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +- # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +- # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +- # current_token_offset += expert_token_count +- # return moe_output +- +- # baseline + @no_grad() + def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- """ +- 优化版 MoE prefill (速度优先模式): +- - 批量张量化处理同一个 expert 的所有 token +- - 跳过无 token 的专家 +- - 保持结果完全一致 +- """ + moe_output = ops.zeros_like(hidden_states) + + flat_selected_experts = selected_experts.flatten() +@@ -1188,56 +635,39 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + + @no_grad() + def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- """ +- 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add +- 逻辑: +- 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 +- 2. 每个 expert 一次性处理其全部 token +- 3. 最后一次 scatter_add 回到原 token 顺序 +- """ +- + num_tokens = hidden_states.shape[0] + hidden_size = hidden_states.shape[-1] + +- # 展平为一维 +- flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] +- flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] ++ flat_selected_experts = selected_experts.flatten() ++ flat_routing_weights = routing_weights.flatten() + +- # 按 expert 排序 + idxs = flat_selected_experts.argsort() +- sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 +- sorted_token_indices = idxs // self.top_k # 对应原 token ID ++ sorted_expert_indices = flat_selected_experts[idxs] ++ sorted_token_indices = idxs // self.top_k + +- # 排好序的输入向量(连续内存) + permuted_tokens = hidden_states[sorted_token_indices] + +- # 排好序的权重 + sorted_weights = flat_routing_weights[idxs] + +- # 每个 expert 对应的 token 数量 + tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) + +- # 存放专家输出(与 permuted_tokens 对应顺序保持一致) + expert_outputs = ops.zeros_like(permuted_tokens) + +- ptr = 0 # 指向当前切片的起点 ++ ptr = 0 + for expert_id, count in enumerate(tokens_per_expert.tolist()): + if count == 0: + continue + + token_slice = slice(ptr, ptr + count) +- expert_tokens = permuted_tokens[token_slice] # 连续切片 ++ expert_tokens = permuted_tokens[token_slice] + +- # 执行专家 MLP + expert_out = self.experts[expert_id](expert_tokens) + + expert_outputs[token_slice] = expert_out + ptr += count + +- # 按权重缩放 + scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) + +- # 回写到原 token 顺序 (单次 scatter_add) + moe_output = mindspore.mint.scatter_add( + ops.zeros_like(hidden_states), + 0, +@@ -1247,10 +677,6 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + + return moe_output + +- +- +- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +- + @no_grad() + def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: + moe_output = ops.zeros_like(hidden_states) +@@ -1282,31 +708,12 @@ class Qwen2MoeSparseMoeBlock(nn.Module): + if self.norm_topk_prob: + routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) + +- moe_output = None +- # if Long_Prompt==0: +- # # --- 精度优先模式 (ACCURACY MODE) --- +- # routing_weights_casted = routing_weights.to(hidden_states.dtype) +- # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +- # else: +- # # --- 速度优先模式 (SPEED MODE) --- +- # routing_weights_casted = routing_weights.to(hidden_states.dtype) +- # if sequence_length == 1: +- # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +- # else: +- # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +- + routing_weights_casted = routing_weights.to(hidden_states.dtype) + if sequence_length == 1: + moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) + else: +- # if Long_Prompt == 1: +- # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +- # else: +- # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) + moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) + +- +- # 3. 共享专家计算与合并 (所有模式通用) + gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ + F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) + +@@ -1320,11 +727,6 @@ class Qwen2MoeDecoderLayer(nn.Module): + def __init__(self, config: Qwen2MoeConfig, layer_idx: int): + super().__init__() + self.hidden_size = config.hidden_size +- +- # if Long_Prompt == 2: +- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +- # else: +- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + + self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) + +@@ -1421,8 +823,6 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): + _no_split_modules = ["Qwen2MoeDecoderLayer"] + _skip_keys_device_placement = "past_key_values" + _supports_cache_class = True +-#lwx +- # _supports_static_cache = True + + def _init_weights(self, module): + std = self.config.initializer_range +@@ -1576,7 +976,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): + + hidden_states = self.norm(hidden_states) + +- # add hidden states from the last decoder layer + if output_hidden_states: + all_hidden_states += (hidden_states,) + +@@ -1598,7 +997,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): + router_logits=all_router_logits, + ) + +- # Copied from transformers.models.llama.modeling_llama.LlamaModel._update_causal_mask + def _update_causal_mask( + self, + attention_mask: mindspore.Tensor, +@@ -1626,17 +1024,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): + else past_seen_tokens + sequence_length + 1 + ) + +- # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +- # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +- # attention_mask, +- # sequence_length=sequence_length, +- # target_length=target_length, +- # dtype=dtype, +- # min_dtype=min_dtype, +- # cache_position=cache_position, +- # batch_size=input_tensor.shape[0], +- # ) +- #@dwj + causal_mask = get_cached_causal_mask_with_cache_position( + attention_mask, + sequence_length=sequence_length, +@@ -1664,9 +1051,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + self.num_experts_per_tok = config.num_experts_per_tok + # Initialize weights and apply final processing + self.post_init() +- # @lwx +- # if self.generation_config is not None and self.generation_config.cache_implementation is None: +- # self.generation_config.cache_implementation = "static" ++ + self._warmed_up = False + + def warmup_moe_model(self): +@@ -1890,17 +1275,6 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + dtype = self.lm_head.weight.dtype + min_dtype = float(ops.finfo(dtype).min) + +- # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +- # attention_mask, +- # sequence_length=sequence_length, +- # target_length=past_key_values.get_max_length(), +- # dtype=dtype, +- # min_dtype=min_dtype, +- # cache_position=cache_position, +- # batch_size=batch_size, +- # ) +- +- #@dwj + attention_mask = get_cached_causal_mask_with_cache_position( + attention_mask, + sequence_length=sequence_length, +@@ -1922,363 +1296,6 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): + ) + return model_inputs + +-# @lwx +- # def _decode_one_tokens_logits( +- # self, +- # cur_token: mindspore.Tensor, +- # input_pos: Optional[mindspore.Tensor], +- # cache_position: mindspore.Tensor, +- # past_key_values: StaticCache, +- # ) -> mindspore.Tensor: +- # """ +- # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +- +- # Args: +- # cur_token: 当前要处理的token,shape为(batch_size, 1) +- # input_pos: 输入位置信息,可选 +- # cache_position: 当前token在cache中的位置,shape为(1,) +- # past_key_values: StaticCache对象,存储之前的key-value状态 +- +- # Returns: +- # logits: 当前token的logits,shape为(batch_size, vocab_size) +- # """ +- # # 调用JIT编译的版本 +- # return self.get_decode_one_tokens_logits( +- # cur_token=cur_token, +- # input_pos=input_pos, +- # cache_position=cache_position, +- # past_key_values=past_key_values, +- # ) +- +- # @mindspore.jit(jit_level='O1') +- # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +- # """ +- # JIT编译的函数,用于高效的单token解码 +- # 使用JIT编译优化以支持静态shape和高效执行 +- +- # 注意:直接调用forward方法,避免经过_call_impl中的try-except +- # """ +- # outputs = self.model.forward( +- # input_ids=cur_token, +- # position_ids=input_pos, +- # cache_position=cache_position, +- # past_key_values=past_key_values, +- # use_cache=True, +- # return_dict=False, +- # ) +- +- # hidden_states = outputs[0] +- # logits = self.lm_head.forward(hidden_states) +- # logits = logits.float() +- +- # return logits[:, -1, :] +- +- # def _sample( +- # self, +- # input_ids: mindspore.Tensor, +- # logits_processor, +- # stopping_criteria, +- # generation_config, +- # synced_devices: bool, +- # streamer=None, +- # logits_warper=None, +- # **model_kwargs, +- # ): +- # """ +- # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +- # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +- # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +- # """ +- # from ...generation.logits_process import LogitsProcessorList +- # from ...generation.stopping_criteria import StoppingCriteriaList +- # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +- # from mindnlp.core import nn, ops, no_grad +- # import numpy as np +- +- # # 检查是否使用 StaticCache +- # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +- # # 否则,直接调用父类方法 +- # past_key_values = model_kwargs.get("past_key_values") +- # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +- +- # if not isinstance(past_key_values, StaticCache): +- # # 不使用 StaticCache,直接调用父类方法 +- # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +- # return super()._sample( +- # input_ids=input_ids, +- # logits_processor=logits_processor, +- # stopping_criteria=stopping_criteria, +- # generation_config=generation_config, +- # synced_devices=synced_devices, +- # streamer=streamer, +- # logits_warper=logits_warper, +- # **model_kwargs, +- # ) +- +- # # 使用 StaticCache,进入自定义循环 +- # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +- # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +- # pad_token_id = generation_config._pad_token_tensor +- # output_attentions = generation_config.output_attentions +- # output_hidden_states = generation_config.output_hidden_states +- # output_scores = generation_config.output_scores +- # output_logits = generation_config.output_logits +- # return_dict_in_generate = generation_config.return_dict_in_generate +- # max_length = generation_config.max_length +- # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +- # do_sample = generation_config.do_sample +- +- # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +- # raise ValueError( +- # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +- # f"{logits_warper})." +- # ) +- +- # # init attention / hidden states / scores tuples +- # scores = () if (return_dict_in_generate and output_scores) else None +- # raw_logits = () if (return_dict_in_generate and output_logits) else None +- # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +- # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +- # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +- +- # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +- # if return_dict_in_generate and self.config.is_encoder_decoder: +- # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +- # encoder_hidden_states = ( +- # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +- # ) +- +- # # keep track of which sequences are already finished +- # batch_size, cur_len = input_ids.shape +- # this_peer_finished = False +- # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +- # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +- +- # time_record = [] +- # from ....utils.testing_utils import parse_flag_from_env +- # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +- +- # while self._has_unfinished_sequences( +- # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +- # ): +- # if _record_time: +- # import time as time_module +- # infer_start = time_module.time() +- +- # # prepare model inputs +- # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +- +- # # prepare variable output controls +- # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +- # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +- +- # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +- # cur_cache_position = model_inputs.get("cache_position") +- # cur_past_key_values = model_inputs.get("past_key_values") +- # cur_input_ids = model_inputs.get("input_ids") +- +- # if (isinstance(cur_past_key_values, StaticCache) and +- # cur_cache_position is not None and +- # len(cur_cache_position.shape) > 0 and +- # cur_cache_position.shape[0] == 1 and +- # cur_input_ids is not None and +- # cur_input_ids.shape[1] == 1): +- # # 使用 JIT 优化的单 token 解码 +- # # 简单判断方法:首次调用时打印(JIT编译需要时间) +- # if not hasattr(self, '_jit_used'): +- # self._jit_used = False +- # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +- +- # next_token_logits = self.get_decode_one_tokens_logits( +- # cur_token=cur_input_ids, +- # input_pos=model_inputs.get("position_ids"), +- # cache_position=cur_cache_position, +- # past_key_values=cur_past_key_values, +- # ) +- +- # # 标记已使用JIT(用于后续判断) +- # if not self._jit_used: +- # self._jit_used = True +- +- # # 构造兼容的输出对象 +- # class JitOptimizedOutput: +- # def __init__(self, logits, config): +- # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +- # self.config = config +- # # 对于 JIT 优化路径,这些属性通常不需要 +- # self.decoder_attentions = None if config.is_encoder_decoder else None +- # self.attentions = None if not config.is_encoder_decoder else None +- # self.cross_attentions = None +- # self.decoder_hidden_states = None if config.is_encoder_decoder else None +- # self.hidden_states = None if not config.is_encoder_decoder else None +- +- # outputs = JitOptimizedOutput(next_token_logits, self.config) +- # else: +- # # 标准 forward 调用(首次prefill阶段或非StaticCache) +- # outputs = self(**model_inputs, return_dict=True) +- +- # if synced_devices and this_peer_finished: +- # continue +- +- # # Clone is needed to avoid keeping a hanging ref to outputs.logits +- # next_token_logits = outputs.logits[:, -1, :] +- +- # # pre-process distribution +- # next_token_scores = logits_processor(input_ids, next_token_logits) +- # if do_sample: +- # next_token_scores = logits_warper(input_ids, next_token_scores) +- +- # # Store scores, attentions and hidden_states when required +- # if return_dict_in_generate: +- # if output_scores: +- # scores += (next_token_scores,) +- # if output_logits: +- # raw_logits += (next_token_logits,) +- # if output_attentions: +- # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +- # decoder_attentions += (attn,) if attn is not None else (None,) +- # if self.config.is_encoder_decoder: +- # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +- +- # if output_hidden_states: +- # hidden = ( +- # outputs.decoder_hidden_states +- # if self.config.is_encoder_decoder +- # else outputs.hidden_states +- # ) +- # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +- +- # # token selection +- # if do_sample: +- # probs = nn.functional.softmax(next_token_scores, dim=-1) +- # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +- # else: +- # next_tokens = ops.argmax(next_token_scores, dim=-1) +- +- # # finished sentences should have their next token be a padding token +- # if has_eos_stopping_criteria: +- # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +- +- # # update generated ids, model inputs, and length for next step +- # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +- # if streamer is not None: +- # streamer.put(next_tokens) +- +- # model_kwargs = self._update_model_kwargs_for_generation( +- # outputs, +- # model_kwargs, +- # is_encoder_decoder=self.config.is_encoder_decoder, +- # ) +- +- # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +- # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +- # cur_len += 1 +- +- # if _record_time: +- # import time as time_module +- # infer_stop = time_module.time() +- # time_record.append(infer_stop - infer_start) +- +- # del outputs +- +- # average_infer_time = None +- # if time_record: +- # if len(time_record) > 1: +- # time_record.pop(0) +- # average_infer_time = sum(time_record) / len(time_record) +- # print(f'average inference time is: {average_infer_time}') +- # print(f'inference time record: {time_record}') +- +- # if streamer is not None: +- # streamer.end() +- +- # # 简单判断:打印是否使用了JIT路径 +- # if hasattr(self, '_jit_used') and self._jit_used: +- # print("[JIT] ✓ JIT optimization was used during generation") +- # else: +- # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +- +- # if return_dict_in_generate: +- # if self.config.is_encoder_decoder: +- # return GenerateEncoderDecoderOutput( +- # sequences=input_ids, +- # scores=scores, +- # logits=raw_logits, +- # encoder_attentions=encoder_attentions, +- # encoder_hidden_states=encoder_hidden_states, +- # decoder_attentions=decoder_attentions, +- # cross_attentions=cross_attentions, +- # decoder_hidden_states=decoder_hidden_states, +- # past_key_values=model_kwargs.get("past_key_values"), +- # average_infer_time=average_infer_time +- # ) +- # else: +- # return GenerateDecoderOnlyOutput( +- # sequences=input_ids, +- # scores=scores, +- # logits=raw_logits, +- # attentions=decoder_attentions, +- # hidden_states=decoder_hidden_states, +- # past_key_values=model_kwargs.get("past_key_values"), +- # average_infer_time=average_infer_time +- # ) +- # else: +- # return input_ids +- +- # def _prepare_cache_for_generation( +- # self, +- # generation_config, +- # model_kwargs, +- # assistant_model, +- # batch_size, +- # max_cache_length, +- # ): +- # if generation_config.cache_implementation is None and self._supports_static_cache: +- # generation_config.cache_implementation = "static" +- # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +- +- # if generation_config.cache_implementation == "static": +- # base_required_from_max_length = generation_config.max_length + 1 +- # base_required = max(max_cache_length, base_required_from_max_length) +- # min_cache_size = 50 +- # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +- # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +- # else: +- # max_cache_length = max(base_required, min_cache_size) +- +- # original_max_cache_length = max_cache_length +- # print(f"[JIT] StaticCache max_cache_length calculation:") +- # print(f" - input max_cache_length: {original_max_cache_length}") +- # print(f" - generation_config.max_length: {generation_config.max_length}") +- # print(f" - base_required_from_max_length: {base_required_from_max_length}") +- # print(f" - final max_cache_length: {max_cache_length}") +- +- # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +- # if max_cache_length > self.config.max_position_embeddings: +- # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +- +- # result = super()._prepare_cache_for_generation( +- # generation_config=generation_config, +- # model_kwargs=model_kwargs, +- # assistant_model=assistant_model, +- # batch_size=batch_size, +- # max_cache_length=max_cache_length, +- # ) +- +- # if generation_config.cache_implementation == "static": +- # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +- # created_cache = model_kwargs.get(cache_name) +- # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +- # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +- # if created_cache.max_cache_len < generation_config.max_length: +- # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +- +- # return result +- +- +- +- +- + # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE + class Qwen2MoeForSequenceClassification(Qwen2MoePreTrainedModel): + def __init__(self, config): +diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +deleted file mode 100644 +index 8de61195..00000000 +--- a/patches/0001-20251104commit.patch ++++ /dev/null +@@ -1,1272 +0,0 @@ +-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-Subject: [PATCH 1/8] 20251104commit +- +---- +- mindnlp/transformers/cache_utils.py | 28 +- +- .../models/deepseek/modeling_deepseek.py | 149 ++- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +- 3 files changed, 976 insertions(+), 87 deletions(-) +- +-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-index cadd2e04..02f8d4be 100644 +---- a/mindnlp/transformers/cache_utils.py +-+++ b/mindnlp/transformers/cache_utils.py +-@@ -812,14 +812,26 @@ class StaticCache(Cache): +- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +- # k_out[:, :, cache_position] = key_states +- # v_out[:, :, cache_position] = value_states +-- if ON_ORANGE_PI: +-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-- else: +-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-- +-+ # if ON_ORANGE_PI: +-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+ # else: +-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+ if cache_position.ndim > 1: +-+ cache_position = cache_position.flatten() +-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+ cache_position = cache_position.int() +-+ +-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+ k_out[:, :, cache_position] = key_states +-+ v_out[:, :, cache_position] = value_states +-+ +- return k_out, v_out +- +- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index c695b944..d8303e45 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +- # Copied from transformers.models.llama.modeling_llama.rotate_half +- def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +-- x1 = x[..., : x.shape[-1] // 2] +-- x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+ # x1 = x[..., : x.shape[-1] // 2] +-+ # x2 = x[..., x.shape[-1] // 2 :] +-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +- if self.training: +- raise NotImplementedError("Training is not supported yet.") +- else: +-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-- if self.config.n_shared_experts is not None: +-- y = y + self.shared_experts(identity) +-- return y +-+ # @lwx +-+ if orig_shape[1] == 1: +-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+ y=y.view(*orig_shape) +-+ if self.config.n_shared_experts is not None: +-+ y = y + self.shared_experts(identity) +-+ return y +-+ else: +-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+ if self.config.n_shared_experts is not None: +-+ y = y + self.shared_experts(identity) +-+ return y +-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+ # if self.config.n_shared_experts is not None: +-+ # y = y + self.shared_experts(identity) +-+ # return y +-+ +-+ @no_grad() +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ +-+ expert_cache = ops.zeros_like(x) +-+ for i in range(self.num_experts_per_tok): +-+ expert_id = flat_expert_indices[i].item() +-+ weight = flat_expert_weights[i].item() +-+ expert = self.experts[expert_id] +-+ expert_out = expert(x) +-+ expert_cache += expert_out * weight +-+ return expert_cache +- +- @no_grad() +-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-- # expert_cache = torch.zeros_like(x) +-- # idxs = flat_expert_indices.argsort() +-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-- # token_idxs = idxs // self.num_experts_per_tok +-- # for i, end_idx in enumerate(tokens_per_expert): +-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-- # if start_idx == end_idx: +-- # continue +-- # expert = self.experts[i] +-- # exp_token_idx = token_idxs[start_idx:end_idx] +-- # expert_tokens = x[exp_token_idx] +-- # expert_out = expert(expert_tokens) +-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-- # return expert_cache +-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- expert_cache = ops.zeros_like(x) +- idxs = flat_expert_indices.argsort() +- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +- token_idxs = idxs // self.num_experts_per_tok +-+ +- for i, end_idx in enumerate(tokens_per_expert): +- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +- if start_idx == end_idx: +-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +- expert_out = expert(expert_tokens) +- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +- return expert_cache +-+ +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # # expert_cache = torch.zeros_like(x) +-+ # # idxs = flat_expert_indices.argsort() +-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+ # # token_idxs = idxs // self.num_experts_per_tok +-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+ # # if start_idx == end_idx: +-+ # # continue +-+ # # expert = self.experts[i] +-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # # expert_tokens = x[exp_token_idx] +-+ # # expert_out = expert(expert_tokens) +-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+ # # return expert_cache +-+ # expert_cache = ops.zeros_like(x) +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # for i, end_idx in enumerate(tokens_per_expert): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # if start_idx == end_idx: +-+ # continue +-+ # expert = self.experts[i] +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = expert(expert_tokens) +-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +-+ # return expert_cache +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # expert_cache = ops.zeros_like(x) +-+ +-+ # # 排序保证顺序一致 +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # # 找出有 token 的专家 +-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+ +-+ # for i in active_experts.tolist(): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # end_idx = tokens_per_expert[i] +-+ # if start_idx == end_idx: # 没有 token +-+ # continue +-+ +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = self.experts[i](expert_tokens) +-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+ +-+ # expert_cache = mindspore.mint.scatter_add( +-+ # expert_cache, +-+ # 0, +-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+ # expert_out +-+ # ) +-+ +-+ # return expert_cache +-+ +-+ +- +- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +- # """ +-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- +- # Initialize weights and apply final processing +- self.post_init() +-+ self.warm_up = False +-+ +-+ def warmup_moe_model_deep(self): +-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+ test_texts = [ +-+ "warmup short", +-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+ ] +-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+ if tokenizer is None: +-+ from mindnlp.transformers import AutoTokenizer +-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+ self._warmup_tokenizer = tokenizer +-+ +-+ for text in test_texts: +-+ inputs = tokenizer(text, return_tensors="ms") +-+ with mindspore._no_grad(): +-+ _ = self(**inputs, use_cache=False) +-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +- +- def get_input_embeddings(self): +- return self.model.embed_tokens +-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +- ```""" +-+ if not self.warm_up: +-+ self.warm_up = True +-+ self.warmup_moe_model_deep() +-+ +- output_attentions = ( +- output_attentions +- if output_attentions is not None +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index 3cbf820e..d4c6b651 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -18,7 +18,6 @@ +- # See the License for the specific language governing permissions and +- # limitations under the License. +- """MindSpore Qwen2MoE model.""" +-- +- import math +- from typing import List, Optional, Tuple, Union +- +-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +- TokenClassifierOutput, +- ) +- from ...modeling_utils import PreTrainedModel +-+from ...generation import GenerationMixin +- from ....utils import logging +- from .configuration_qwen2_moe import Qwen2MoeConfig +- +-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +- self.variance_epsilon = eps +- +- def forward(self, hidden_states): +-+ # @dwj +-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+ # @lwx +-+ # if not self.training : +-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +- input_dtype = hidden_states.dtype +- hidden_states = hidden_states.to(mindspore.float32) +- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-@@ -234,6 +239,8 @@ def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +- x1 = x[..., : x.shape[-1] // 2] +- x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +- self.config = config +- self.hidden_size = config.hidden_size +- self.intermediate_size = intermediate_size +-+ +- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +- self.act_fn = ACT2FN[config.hidden_act] +- +- def forward(self, x): +-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-- +- +-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+ # @lwx +-+ # gate_up_output = self.gate_up_proj(x) +-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+ # return self.down_proj(swiglu_output) +-+ +-+ # def forward(self, x): +-+ # gate_proj_out = self.gate_proj(x) +-+ # up_proj_out = self.up_proj(x) +-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+ # return self.down_proj(swiglu_out) +-+ +- # Copied from transformers.models.llama.modeling_llama.repeat_kv +- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +- """ +-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +- use_cache: bool = False, +- cache_position: Optional[mindspore.Tensor] = None, +- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ +-+ +- bsz, q_len, _ = hidden_states.shape +- +- query_states = self.q_proj(hidden_states) +-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +- "with a layer index." +- ) +-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ if isinstance(past_key_value, StaticCache): +-+ kv_seq_len = key_states.shape[-2] +-+ else: +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +- if past_key_value is not None: +- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+ if isinstance(past_key_value, StaticCache): +-+ kv_seq_len = key_states.shape[-2] +- +- # repeat k/v heads if n_kv_heads < n_heads +- key_states = repeat_kv(key_states, self.num_key_value_groups) +- value_states = repeat_kv(value_states, self.num_key_value_groups) +-- +-+ +- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +- +-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-- raise ValueError( +-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-- f" {attn_weights.shape}" +-- ) +-- +-- if attention_mask is not None: # no matter the length, we just slice it +-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+ if attention_mask is not None: +-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +- attn_weights = attn_weights + causal_mask +- +- # upcast attention to fp32 +-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +- +- attn_output = self.o_proj(attn_output) +-- +-+ # @lwx +-+ +-+ # max_seq_len = self.max_position_embeddings # 2048 +-+ +-+ # if attention_mask is not None: +-+ # # attention_mask: [B, 1, Sq, Sk] +-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+ +-+ # # pad 到 [max_seq_len, max_seq_len] +-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+ # global_attention_mask = padded_mask +-+ # else: +-+ # global_attention_mask = None +-+ +-+ +-+ # sparse_mode=3 +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, +-+ # key=key_states, +-+ # value=value_states, +-+ # real_shift=None, +-+ # padding_mask=None, +-+ +-+ # head_num=self.num_heads, +-+ # attn_mask=global_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+ # input_layout="BNSD", +-+ # pre_tokens=2147483647, +-+ # next_tokens=2147483647, +-+ # inner_precise=0, +-+ # drop_mask=None, +-+ # prefix=None, +-+ # actual_seq_qlen=None, +-+ # actual_seq_kvlen=None, +-+ # sparse_mode=sparse_mode, +-+ # ) +- if not output_attentions: +- attn_weights = None +- +- return attn_output, attn_weights, past_key_value +- +- +-+class Qwen2MoeFlashAttention(nn.Module): +-+ """ +-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+ +-+ 关键改动: +-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+ 直接传入原始的 key 和 value 张量效率更高。 +-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+ """ +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+ super().__init__() +-+ self.config = config +-+ self.layer_idx = layer_idx +-+ self.hidden_size = config.hidden_size +-+ self.num_heads = config.num_attention_heads +-+ self.head_dim = self.hidden_size // self.num_heads +-+ self.num_key_value_heads = config.num_key_value_heads +-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+ self.max_position_embeddings = config.max_position_embeddings +-+ self.rope_theta = config.rope_theta +-+ self.attention_dropout = config.attention_dropout +-+ +-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+ raise ValueError( +-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+ ) +-+ +-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+ +-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+ self.head_dim, +-+ max_position_embeddings=self.max_position_embeddings, +-+ base=self.rope_theta, +-+ ) +-+ +-+ def forward( +-+ self, +-+ hidden_states: mindspore.Tensor, +-+ attention_mask: Optional[mindspore.Tensor] = None, +-+ position_ids: Optional[mindspore.Tensor] = None, +-+ past_key_value: Optional[Cache] = None, +-+ output_attentions: bool = False, +-+ use_cache: bool = False, +-+ cache_position: Optional[mindspore.Tensor] = None, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ # 1. 线性投射 Q, K, V +-+ query_states = self.q_proj(hidden_states) +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+ # query: [B, S, H*D] -> [B, N1, S, D] +-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # 3. RoPE 旋转位置编码 +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+ if self.layer_idx is None: +-+ raise ValueError( +-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ "with a layer index." +-+ ) +-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+ if cache_position.shape[0] == 1: +-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+ kv_seq_len = past_seen_tokens + 1 +-+ else: +-+ # prefill 阶段:cache_position 是范围,使用其长度 +-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+ else: +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # 4. KV 缓存更新 +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ key_states, value_states = past_key_value.update( +-+ key_states, value_states, self.layer_idx, cache_kwargs +-+ ) +-+ +-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+ if cache_position.shape[0] == 1: +-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+ kv_seq_len = key_states.shape[-2] +-+ +-+ # 5. [重要] 准备 Attention Mask +-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+ fa_attention_mask = None +-+ if attention_mask is not None: +-+ # 截取与当前key长度匹配的部分 +-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+ fa_attention_mask = (mask_slice != 0) +-+ +-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+ input_dtype = query_states.dtype +-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+ query_states = query_states.to(mindspore.float16) +-+ key_states = key_states.to(mindspore.float16) +-+ value_states = value_states.to(mindspore.float16) +-+ +-+ # 6. [核心] 调用 flash_attention_score 算子 +-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+ attn_output = mindspore.ops.flash_attention_score( +-+ query=query_states, +-+ key=key_states, +-+ value=value_states, +-+ head_num=self.num_heads, # 传入Q的头数(N1) +-+ attn_mask=fa_attention_mask, +-+ keep_prob=1.0 - self.attention_dropout, +-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+ input_layout="BNSD", +-+ sparse_mode=0 # 使用 defaultMask 模式 +-+ ) +-+ +-+ # 恢复原始数据类型 +-+ attn_output = attn_output.to(input_dtype) +-+ +-+ # 7. 调整输出形状 +-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ attn_output = self.o_proj(attn_output) +-+ +-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-+ attn_weights = None +-+ if output_attentions: +-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+ # def forward( +-+ # self, +-+ # hidden_states: mindspore.Tensor, +-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+ # position_ids: Optional[mindspore.Tensor] = None, +-+ # past_key_value: Optional[Cache] = None, +-+ # output_attentions: bool = False, +-+ # use_cache: bool = False, +-+ # cache_position: Optional[mindspore.Tensor] = None, +-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ # bsz, q_len, _ = hidden_states.shape +-+ +-+ # # 1. 线性投射 Q, K, V +-+ # query_states = self.q_proj(hidden_states) +-+ # key_states = self.k_proj(hidden_states) +-+ # value_states = self.v_proj(hidden_states) +-+ +-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # # 3. RoPE 旋转位置编码 +-+ # kv_seq_len = key_states.shape[-2] +-+ # if past_key_value is not None: +-+ # if self.layer_idx is None: +-+ # raise ValueError( +-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ # "with a layer index." +-+ # ) +-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # # 4. KV 缓存更新 +-+ # if past_key_value is not None: +-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ # key_states, value_states = past_key_value.update( +-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+ # ) +-+ +-+ # # 5. 准备 Attention Mask +-+ # fa_attention_mask = None +-+ # if attention_mask is not None: +-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # fa_attention_mask = (mask_slice != 0) +-+ +-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+ # input_dtype = query_states.dtype +-+ +-+ # # 6. [核心] 调用 flash_attention_score 算子 +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, +-+ # key=key_states, +-+ # value=value_states, +-+ # head_num=self.num_heads, +-+ # attn_mask=fa_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+ # input_layout="BNSD", +-+ # sparse_mode=0, +-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+ # inner_precise=1 +-+ # ) +-+ +-+ # # 恢复原始数据类型 +-+ # attn_output = attn_output.to(input_dtype) +-+ +-+ # # 7. 调整输出形状 +-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ # attn_output = self.o_proj(attn_output) +-+ +-+ # attn_weights = None +-+ # if output_attentions: +-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+ # return attn_output, attn_weights, past_key_value +-+ +-+ # def forward( +-+ # self, +-+ # hidden_states: mindspore.Tensor, +-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+ # position_ids: Optional[mindspore.Tensor] = None, +-+ # past_key_value: Optional[Cache] = None, +-+ # output_attentions: bool = False, +-+ # use_cache: bool = False, +-+ # cache_position: Optional[mindspore.Tensor] = None, +-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ # bsz, q_len, _ = hidden_states.shape +-+ +-+ # query_states = self.q_proj(hidden_states) +-+ # key_states = self.k_proj(hidden_states) +-+ # value_states = self.v_proj(hidden_states) +-+ +-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ # kv_seq_len = key_states.shape[-2] +-+ # if past_key_value is not None: +-+ # if self.layer_idx is None: +-+ # raise ValueError("`layer_idx` must be specified for caching") +-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ # if past_key_value is not None: +-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ # key_states, value_states = past_key_value.update( +-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+ # ) +-+ +-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+ +-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+ # query_states = query_states / math.sqrt(self.head_dim) +-+ # # <--- 修改结束 --- +-+ +-+ # fa_attention_mask = None +-+ # if attention_mask is not None: +-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ # fa_attention_mask = (mask_slice != 0) +-+ +-+ # input_dtype = query_states.dtype +-+ +-+ # attn_output = mindspore.ops.flash_attention_score( +-+ # query=query_states, # 传入已经预先缩放过的 query +-+ # key=key_states, +-+ # value=value_states, +-+ # head_num=self.num_heads, +-+ # attn_mask=fa_attention_mask, +-+ # keep_prob=1.0 - self.attention_dropout, +-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+ # input_layout="BNSD", +-+ # sparse_mode=0, +-+ # inner_precise=1 # 仍然保持内部高精度计算 +-+ # ) +-+ +-+ # attn_output = attn_output.to(input_dtype) +-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ # attn_output = self.o_proj(attn_output) +-+ +-+ # attn_weights = None +-+ # if output_attentions: +-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+ +-+ # return attn_output, attn_weights, past_key_value +-+ +- QWEN2MOE_ATTENTION_CLASSES = { +- "eager": Qwen2MoeAttention, +-+ "flash-attention": Qwen2MoeFlashAttention, +- } +- +- +-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-+ #@dwj +-+ # 只遍历激活的专家,而非全部专家 +- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-- hidden_states = hidden_states.view(-1, hidden_dim) +-- # router_logits: (batch * sequence_length, n_experts) +-- router_logits = self.gate(hidden_states) +-- +-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- if self.norm_topk_prob: +-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- # we cast back to the input dtype +-- routing_weights = routing_weights.to(hidden_states.dtype) +-- +-- final_hidden_states = ops.zeros( +-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-- ) +-- +-- # One hot encode the selected experts to create an expert mask +-- # this will be used to easily index which expert is going to be sollicitated +-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-- +-- # Loop over all available experts in the model and perform the computation on each expert +-- for expert_idx in range(self.num_experts): +-- expert_layer = self.experts[expert_idx] +-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-- +-- # Index the correct hidden states and compute the expert hidden state for +-- # the current expert. We need to make sure to multiply the output hidden +-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-- if 0 not in idx.shape: +-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-- +-- # However `index_add_` only support torch tensors for indexing so we'll use +-- # the `top_x` tensor here. +-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-- +-- shared_expert_output = self.shared_expert(hidden_states) +-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-- +-- final_hidden_states = final_hidden_states + shared_expert_output +-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+ num_tokens = hidden_states_reshaped.shape[0] +-+ +-+ router_logits = self.gate(hidden_states_reshaped) +-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+ if self.norm_topk_prob: +-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+ flat_selected_experts = selected_experts.flatten() +-+ +-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+ token_indices = broadcasted_token_indices.flatten() +-+ +-+ active_experts = ops.unique(flat_selected_experts) +-+ +-+ for expert_idx_tensor in active_experts: +-+ expert_idx = expert_idx_tensor.item() +-+ expert_layer = self.experts[expert_idx] +-+ +-+ mask = (flat_selected_experts == expert_idx_tensor) +-+ selected_token_indices = token_indices[mask] +-+ selected_routing_weights = routing_weights.flatten()[mask] +-+ +-+ current_states = hidden_states_reshaped[selected_token_indices] +-+ +-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+ +-+ final_hidden_states = final_hidden_states.index_add( +-+ dim=0, +-+ index=selected_token_indices, +-+ source=expert_output.to(hidden_states.dtype) +-+ ) +-+ +-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +- +-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-- return final_hidden_states, router_logits +-+ final_hidden_states = final_hidden_states + shared_expert_output +-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+ +-+ return final_hidden_states, router_logits +- +- +- class Qwen2MoeDecoderLayer(nn.Module): +-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +- +- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+ +- if (layer_idx not in config.mlp_only_layers) and ( +- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +- ): +-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +- _no_split_modules = ["Qwen2MoeDecoderLayer"] +- _skip_keys_device_placement = "past_key_values" +- _supports_cache_class = True +-+#lwx +-+ # _supports_static_cache = True +- +- def _init_weights(self, module): +- std = self.config.initializer_range +-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +- return causal_mask +- +- +--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- _tied_weights_keys = ["lm_head.weight"] +- +- def __init__(self, config): +-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- self.num_experts_per_tok = config.num_experts_per_tok +- # Initialize weights and apply final processing +- self.post_init() +-+ # @lwx +-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+ # self.generation_config.cache_implementation = "static" +-+ self._warmed_up = False +-+ +-+ def warmup_moe_model(self): +-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+ test_texts = [ +-+ "warmup short", +-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+ ] +-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+ if tokenizer is None: +-+ from mindnlp.transformers import AutoTokenizer +-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+ self._warmup_tokenizer = tokenizer +-+ +-+ for text in test_texts: +-+ inputs = tokenizer(text, return_tensors="ms") +-+ with mindspore._no_grad(): +-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +- +- def get_input_embeddings(self): +- return self.model.embed_tokens +-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +- ```""" +-+ if not self._warmed_up: +-+ self._warmed_up = True +-+ self.warmup_moe_model() +- +- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +- output_router_logits = ( +-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +- } +- ) +- return model_inputs +-+# @lwx +-+ # def _decode_one_tokens_logits( +-+ # self, +-+ # cur_token: mindspore.Tensor, +-+ # input_pos: Optional[mindspore.Tensor], +-+ # cache_position: mindspore.Tensor, +-+ # past_key_values: StaticCache, +-+ # ) -> mindspore.Tensor: +-+ # """ +-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+ +-+ # Args: +-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+ # input_pos: 输入位置信息,可选 +-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+ +-+ # Returns: +-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+ # """ +-+ # # 调用JIT编译的版本 +-+ # return self.get_decode_one_tokens_logits( +-+ # cur_token=cur_token, +-+ # input_pos=input_pos, +-+ # cache_position=cache_position, +-+ # past_key_values=past_key_values, +-+ # ) +-+ +-+ # @mindspore.jit(jit_level='O1') +-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+ # """ +-+ # JIT编译的函数,用于高效的单token解码 +-+ # 使用JIT编译优化以支持静态shape和高效执行 +-+ +-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+ # """ +-+ # outputs = self.model.forward( +-+ # input_ids=cur_token, +-+ # position_ids=input_pos, +-+ # cache_position=cache_position, +-+ # past_key_values=past_key_values, +-+ # use_cache=True, +-+ # return_dict=False, +-+ # ) +-+ +-+ # hidden_states = outputs[0] +-+ # logits = self.lm_head.forward(hidden_states) +-+ # logits = logits.float() +-+ +-+ # return logits[:, -1, :] +-+ +-+ # def _sample( +-+ # self, +-+ # input_ids: mindspore.Tensor, +-+ # logits_processor, +-+ # stopping_criteria, +-+ # generation_config, +-+ # synced_devices: bool, +-+ # streamer=None, +-+ # logits_warper=None, +-+ # **model_kwargs, +-+ # ): +-+ # """ +-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+ # """ +-+ # from ...generation.logits_process import LogitsProcessorList +-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+ # from mindnlp.core import nn, ops, no_grad +-+ # import numpy as np +-+ +-+ # # 检查是否使用 StaticCache +-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+ # # 否则,直接调用父类方法 +-+ # past_key_values = model_kwargs.get("past_key_values") +-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+ +-+ # if not isinstance(past_key_values, StaticCache): +-+ # # 不使用 StaticCache,直接调用父类方法 +-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+ # return super()._sample( +-+ # input_ids=input_ids, +-+ # logits_processor=logits_processor, +-+ # stopping_criteria=stopping_criteria, +-+ # generation_config=generation_config, +-+ # synced_devices=synced_devices, +-+ # streamer=streamer, +-+ # logits_warper=logits_warper, +-+ # **model_kwargs, +-+ # ) +-+ +-+ # # 使用 StaticCache,进入自定义循环 +-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+ # pad_token_id = generation_config._pad_token_tensor +-+ # output_attentions = generation_config.output_attentions +-+ # output_hidden_states = generation_config.output_hidden_states +-+ # output_scores = generation_config.output_scores +-+ # output_logits = generation_config.output_logits +-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-+ # max_length = generation_config.max_length +-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+ # do_sample = generation_config.do_sample +-+ +-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+ # raise ValueError( +-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+ # f"{logits_warper})." +-+ # ) +-+ +-+ # # init attention / hidden states / scores tuples +-+ # scores = () if (return_dict_in_generate and output_scores) else None +-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+ +-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+ # encoder_hidden_states = ( +-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+ # ) +-+ +-+ # # keep track of which sequences are already finished +-+ # batch_size, cur_len = input_ids.shape +-+ # this_peer_finished = False +-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+ +-+ # time_record = [] +-+ # from ....utils.testing_utils import parse_flag_from_env +-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+ +-+ # while self._has_unfinished_sequences( +-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+ # ): +-+ # if _record_time: +-+ # import time as time_module +-+ # infer_start = time_module.time() +-+ +-+ # # prepare model inputs +-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+ +-+ # # prepare variable output controls +-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+ +-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+ # cur_cache_position = model_inputs.get("cache_position") +-+ # cur_past_key_values = model_inputs.get("past_key_values") +-+ # cur_input_ids = model_inputs.get("input_ids") +-+ +-+ # if (isinstance(cur_past_key_values, StaticCache) and +-+ # cur_cache_position is not None and +-+ # len(cur_cache_position.shape) > 0 and +-+ # cur_cache_position.shape[0] == 1 and +-+ # cur_input_ids is not None and +-+ # cur_input_ids.shape[1] == 1): +-+ # # 使用 JIT 优化的单 token 解码 +-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+ # if not hasattr(self, '_jit_used'): +-+ # self._jit_used = False +-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+ +-+ # next_token_logits = self.get_decode_one_tokens_logits( +-+ # cur_token=cur_input_ids, +-+ # input_pos=model_inputs.get("position_ids"), +-+ # cache_position=cur_cache_position, +-+ # past_key_values=cur_past_key_values, +-+ # ) +-+ +-+ # # 标记已使用JIT(用于后续判断) +-+ # if not self._jit_used: +-+ # self._jit_used = True +-+ +-+ # # 构造兼容的输出对象 +-+ # class JitOptimizedOutput: +-+ # def __init__(self, logits, config): +-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+ # self.config = config +-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+ # self.attentions = None if not config.is_encoder_decoder else None +-+ # self.cross_attentions = None +-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-+ +-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+ # else: +-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+ # outputs = self(**model_inputs, return_dict=True) +-+ +-+ # if synced_devices and this_peer_finished: +-+ # continue +-+ +-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+ # next_token_logits = outputs.logits[:, -1, :] +-+ +-+ # # pre-process distribution +-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+ # if do_sample: +-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+ +-+ # # Store scores, attentions and hidden_states when required +-+ # if return_dict_in_generate: +-+ # if output_scores: +-+ # scores += (next_token_scores,) +-+ # if output_logits: +-+ # raw_logits += (next_token_logits,) +-+ # if output_attentions: +-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-+ # if self.config.is_encoder_decoder: +-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+ +-+ # if output_hidden_states: +-+ # hidden = ( +-+ # outputs.decoder_hidden_states +-+ # if self.config.is_encoder_decoder +-+ # else outputs.hidden_states +-+ # ) +-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+ +-+ # # token selection +-+ # if do_sample: +-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+ # else: +-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+ +-+ # # finished sentences should have their next token be a padding token +-+ # if has_eos_stopping_criteria: +-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+ +-+ # # update generated ids, model inputs, and length for next step +-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+ # if streamer is not None: +-+ # streamer.put(next_tokens) +-+ +-+ # model_kwargs = self._update_model_kwargs_for_generation( +-+ # outputs, +-+ # model_kwargs, +-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-+ # ) +-+ +-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+ # cur_len += 1 +-+ +-+ # if _record_time: +-+ # import time as time_module +-+ # infer_stop = time_module.time() +-+ # time_record.append(infer_stop - infer_start) +-+ +-+ # del outputs +-+ +-+ # average_infer_time = None +-+ # if time_record: +-+ # if len(time_record) > 1: +-+ # time_record.pop(0) +-+ # average_infer_time = sum(time_record) / len(time_record) +-+ # print(f'average inference time is: {average_infer_time}') +-+ # print(f'inference time record: {time_record}') +-+ +-+ # if streamer is not None: +-+ # streamer.end() +-+ +-+ # # 简单判断:打印是否使用了JIT路径 +-+ # if hasattr(self, '_jit_used') and self._jit_used: +-+ # print("[JIT] ✓ JIT optimization was used during generation") +-+ # else: +-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+ +-+ # if return_dict_in_generate: +-+ # if self.config.is_encoder_decoder: +-+ # return GenerateEncoderDecoderOutput( +-+ # sequences=input_ids, +-+ # scores=scores, +-+ # logits=raw_logits, +-+ # encoder_attentions=encoder_attentions, +-+ # encoder_hidden_states=encoder_hidden_states, +-+ # decoder_attentions=decoder_attentions, +-+ # cross_attentions=cross_attentions, +-+ # decoder_hidden_states=decoder_hidden_states, +-+ # past_key_values=model_kwargs.get("past_key_values"), +-+ # average_infer_time=average_infer_time +-+ # ) +-+ # else: +-+ # return GenerateDecoderOnlyOutput( +-+ # sequences=input_ids, +-+ # scores=scores, +-+ # logits=raw_logits, +-+ # attentions=decoder_attentions, +-+ # hidden_states=decoder_hidden_states, +-+ # past_key_values=model_kwargs.get("past_key_values"), +-+ # average_infer_time=average_infer_time +-+ # ) +-+ # else: +-+ # return input_ids +-+ +-+ # def _prepare_cache_for_generation( +-+ # self, +-+ # generation_config, +-+ # model_kwargs, +-+ # assistant_model, +-+ # batch_size, +-+ # max_cache_length, +-+ # ): +-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+ # generation_config.cache_implementation = "static" +-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+ +-+ # if generation_config.cache_implementation == "static": +-+ # base_required_from_max_length = generation_config.max_length + 1 +-+ # base_required = max(max_cache_length, base_required_from_max_length) +-+ # min_cache_size = 50 +-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+ # else: +-+ # max_cache_length = max(base_required, min_cache_size) +-+ +-+ # original_max_cache_length = max_cache_length +-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+ # print(f" - final max_cache_length: {max_cache_length}") +-+ +-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+ # if max_cache_length > self.config.max_position_embeddings: +-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+ +-+ # result = super()._prepare_cache_for_generation( +-+ # generation_config=generation_config, +-+ # model_kwargs=model_kwargs, +-+ # assistant_model=assistant_model, +-+ # batch_size=batch_size, +-+ # max_cache_length=max_cache_length, +-+ # ) +-+ +-+ # if generation_config.cache_implementation == "static": +-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+ # created_cache = model_kwargs.get(cache_name) +-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+ # if created_cache.max_cache_len < generation_config.max_length: +-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+ +-+ # return result +-+ +-+ +-+ +- +- +- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +--- +-2.27.0 +- +diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +deleted file mode 100644 +index d7a129ea..00000000 +--- a/patches/0002-20251106commit.patch ++++ /dev/null +@@ -1,3200 +0,0 @@ +-From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Thu, 6 Nov 2025 09:20:38 +0800 +-Subject: [PATCH 2/8] 20251106commit +- +---- +- .../models/deepseek/modeling_deepseek.py | 379 ++++- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +- patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +- 3 files changed, 2689 insertions(+), 305 deletions(-) +- create mode 100644 patches/0001-20251104commit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index d8303e45..73773c22 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +- # y = y + self.shared_experts(identity) +- # return y +- +-+ # @no_grad() +-+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ +-+ # expert_cache = ops.zeros_like(x) +-+ # for i in range(self.num_experts_per_tok): +-+ # expert_id = flat_expert_indices[i].item() +-+ # weight = flat_expert_weights[i].item() +-+ # expert = self.experts[expert_id] +-+ # expert_out = expert(x) +-+ # expert_cache += expert_out * weight +-+ # return expert_cache +-+ +- @no_grad() +- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ # x 的 shape: (1, hidden_size) +-+ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+ +-+ # 1. 收集所有需要的专家层 +-+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+ selected_experts = [self.experts[i] for i in flat_expert_indices] +-+ +-+ # 2. 并行计算所有专家的输出 +-+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+ # ops.cat 会将它们堆叠成一个新的 Tensor +-+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+ +-+ # 3. 使用矩阵乘法进行加权求和 +-+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ # 最终结果 final_output 的 shape: (1, hidden_size) +-+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+ +-+ return final_output +- +-- expert_cache = ops.zeros_like(x) +-- for i in range(self.num_experts_per_tok): +-- expert_id = flat_expert_indices[i].item() +-- weight = flat_expert_weights[i].item() +-- expert = self.experts[expert_id] +-- expert_out = expert(x) +-- expert_cache += expert_out * weight +-- return expert_cache +- +- @no_grad() +- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +- key_states = self.k_proj(hidden_states) +- value_states = self.v_proj(hidden_states) +- +-- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+ # @lwx +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-+ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +- +- kv_seq_len = key_states.shape[-2] +- if past_key_value is not None: +-@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +- return attn_output, attn_weights, past_key_value +- +- +-+# class DeepseekFlashAttention(nn.Module): +-+# """ +-+# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-+ +-+# This class is designed as a drop-in replacement for DeepseekAttention. +-+# """ +-+ +-+# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+# super().__init__() +-+# self.config = config +-+# self.layer_idx = layer_idx +-+# if layer_idx is None: +-+# logger.warning( +-+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+# "when creating this class." +-+# ) +-+ +-+# self.attention_dropout = config.attention_dropout +-+# self.hidden_size = config.hidden_size +-+# self.num_heads = config.num_attention_heads +-+# self.head_dim = self.hidden_size // self.num_heads +-+# self.num_key_value_heads = config.num_key_value_heads +-+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+# self.max_position_embeddings = config.max_position_embeddings +-+# self.rope_theta = config.rope_theta +-+# self.is_causal = True +-+ +-+# if (self.head_dim * self.num_heads) != self.hidden_size: +-+# raise ValueError( +-+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+# f" and `num_heads`: {self.num_heads})." +-+# ) +-+ +-+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+# self._init_rope() +-+ +-+# def _init_rope(self): +-+# if self.config.rope_scaling is None: +-+# self.rotary_emb = DeepseekRotaryEmbedding( +-+# self.head_dim, +-+# max_position_embeddings=self.max_position_embeddings, +-+# base=self.rope_theta, +-+# ) +-+# else: +-+# scaling_type = self.config.rope_scaling["type"] +-+# scaling_factor = self.config.rope_scaling["factor"] +-+# if scaling_type == "linear": +-+# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+# self.head_dim, +-+# max_position_embeddings=self.max_position_embeddings, +-+# scaling_factor=scaling_factor, +-+# base=self.rope_theta, +-+# ) +-+# elif scaling_type == "dynamic": +-+# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+# self.head_dim, +-+# max_position_embeddings=self.max_position_embeddings, +-+# scaling_factor=scaling_factor, +-+# base=self.rope_theta, +-+# ) +-+# else: +-+# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+ +-+# def forward( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# attention_mask: Optional[mindspore.Tensor] = None, +-+# position_ids: Optional[mindspore.Tensor] = None, +-+# past_key_value: Optional[Cache] = None, +-+# output_attentions: bool = False, +-+# use_cache: bool = False, +-+# **kwargs, +-+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+# if "padding_mask" in kwargs: +-+# warnings.warn( +-+# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+# ) +-+ +-+# if output_attentions: +-+# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-+ +-+# bsz, q_len, _ = hidden_states.shape +-+ +-+# if self.config.pretraining_tp > 1: +-+# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+ +-+# query_states = self.q_proj(hidden_states) +-+# key_states = self.k_proj(hidden_states) +-+# value_states = self.v_proj(hidden_states) +-+ +-+# # Reshape for multi-head attention +-+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+# kv_seq_len = key_states.shape[-2] +-+# if past_key_value is not None: +-+# if self.layer_idx is None: +-+# raise ValueError( +-+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+# "with a layer index." +-+# ) +-+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+# # Apply Rotary Positional Embedding +-+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+# if past_key_value is not None: +-+# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-+# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-+# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ +-+# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+ +-+# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+ +-+# # Convert attention_mask for flash_attention_score +-+# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-+# if attention_mask is not None: +-+# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-+# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+# raise ValueError( +-+# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+# ) +-+# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-+# else: +-+# attn_mask_for_fa = None +-+ +-+# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+ +-+# # Call the fused flash_attention_score operator +-+# attn_output = mindspore.ops.flash_attention_score( +-+# query=query_states_for_fa, +-+# key=key_states_for_fa, +-+# value=value_states_for_fa, +-+# head_num=self.num_heads, # This is N1, the number of query heads +-+# input_layout='BSH', +-+# attn_mask=attn_mask_for_fa, +-+# keep_prob=keep_prob, +-+# scalar_value=1.0 / math.sqrt(self.head_dim), +-+# sparse_mode=0 # Default mask mode +-+# ) +-+ +-+# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-+# attn_output = self.o_proj(attn_output) +-+ +-+# # Flash Attention does not return attention weights +-+# attn_weights = None +-+ +-+# return attn_output, attn_weights, past_key_value +-+ +-+class DeepseekFlashAttention(nn.Module): +-+ """ +-+ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-+ This implementation is a drop-in replacement for the original DeepseekAttention class, +-+ designed for high performance on supported hardware (Ascend). +-+ +-+ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-+ """ +-+ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+ super().__init__() +-+ self.config = config +-+ self.layer_idx = layer_idx +-+ if layer_idx is None: +-+ logger.warning( +-+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+ "when creating this class." +-+ ) +-+ +-+ # --- [FIX] Correctly initialize all required attributes --- +-+ self.attention_dropout = config.attention_dropout +-+ self.hidden_size = config.hidden_size +-+ self.num_heads = config.num_attention_heads +-+ self.head_dim = self.hidden_size // self.num_heads +-+ self.num_key_value_heads = config.num_key_value_heads +-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+ self.max_position_embeddings = config.max_position_embeddings +-+ self.rope_theta = config.rope_theta +-+ self.is_causal = True +-+ +-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+ raise ValueError( +-+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+ f" and `num_heads`: {self.num_heads})." +-+ ) +-+ +-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+ +-+ # This call will now succeed as all attributes are initialized. +-+ self._init_rope() +-+ +-+ def _init_rope(self): +-+ if self.config.rope_scaling is None: +-+ self.rotary_emb = DeepseekRotaryEmbedding( +-+ self.head_dim, +-+ max_position_embeddings=self.max_position_embeddings, +-+ base=self.rope_theta, +-+ ) +-+ else: +-+ scaling_type = self.config.rope_scaling["type"] +-+ scaling_factor = self.config.rope_scaling["factor"] +-+ if scaling_type == "linear": +-+ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+ self.head_dim, +-+ max_position_embeddings=self.max_position_embeddings, +-+ scaling_factor=scaling_factor, +-+ base=self.rope_theta, +-+ ) +-+ elif scaling_type == "dynamic": +-+ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+ self.head_dim, +-+ max_position_embeddings=self.max_position_embeddings, +-+ scaling_factor=scaling_factor, +-+ base=self.rope_theta, +-+ ) +-+ else: +-+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+ +-+ def forward( +-+ self, +-+ hidden_states: mindspore.Tensor, +-+ attention_mask: Optional[mindspore.Tensor] = None, +-+ position_ids: Optional[mindspore.Tensor] = None, +-+ past_key_value: Optional[Cache] = None, +-+ output_attentions: bool = False, +-+ use_cache: bool = False, +-+ **kwargs, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ if "padding_mask" in kwargs: +-+ warnings.warn( +-+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+ ) +-+ if output_attentions: +-+ warnings.warn( +-+ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-+ ) +-+ +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ if self.config.pretraining_tp > 1: +-+ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+ +-+ query_states = self.q_proj(hidden_states) +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+ # Apply Rotary Position Embedding +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos} +-+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-+ # So we must explicitly repeat the KV heads. +-+ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+ +-+ # Convert attention mask for flash_attention_score +-+ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-+ if attention_mask is not None: +-+ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+ raise ValueError( +-+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+ ) +-+ attn_mask_for_fa = attention_mask < 0 +-+ else: +-+ attn_mask_for_fa = None +-+ +-+ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+ +-+ # Call the fused operator using the efficient BNSD layout +-+ attn_output = mindspore.ops.flash_attention_score( +-+ query=query_states, +-+ key=key_states, +-+ value=value_states, +-+ head_num=self.num_heads, +-+ input_layout='BNSD', # Specify the correct layout +-+ attn_mask=attn_mask_for_fa, +-+ keep_prob=keep_prob, +-+ scalar_value=1.0 / math.sqrt(self.head_dim) +-+ ) +-+ +-+ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ +-+ # Apply output projection +-+ attn_output = self.o_proj(attn_output) +-+ +-+ # Flash attention does not return attention weights, so we return None. +-+ attn_weights = None +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +- Deepseek_ATTENTION_CLASSES = { +- "eager": DeepseekAttention, +-+ "flash-attention": DeepseekFlashAttention, +- } +- +- +-@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +- config=config, layer_idx=layer_idx +- ) +- +-+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-+ config=config, layer_idx=layer_idx +-+ ) +-+ +- self.mlp = ( +- DeepseekMoE(config) +- if ( +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index d4c6b651..bced285c 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +- +- import mindspore +- import mindnlp.core.nn.functional as F +--from mindnlp.core import nn, ops +-+from mindnlp.core import nn, ops, no_grad +- from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +- +- from ....common.activations import ACT2FN +-@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +- _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +- _CONFIG_FOR_DOC = "Qwen2MoeConfig" +- +-+Long_Prompt = False +-+PROMPT_LENGTH_THRESHOLD = 128 +- +- # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +- def _prepare_4d_causal_attention_mask_with_cache_position( +-@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +- return attn_output, attn_weights, past_key_value +- +- +-+# class Qwen2MoeFlashAttention(nn.Module): +-+# """ +-+# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+ +-+# 关键改动: +-+# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+# 直接传入原始的 key 和 value 张量效率更高。 +-+# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+# super().__init__() +-+# self.config = config +-+# self.layer_idx = layer_idx +-+# self.hidden_size = config.hidden_size +-+# self.num_heads = config.num_attention_heads +-+# self.head_dim = self.hidden_size // self.num_heads +-+# self.num_key_value_heads = config.num_key_value_heads +-+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+# self.max_position_embeddings = config.max_position_embeddings +-+# self.rope_theta = config.rope_theta +-+# self.attention_dropout = config.attention_dropout +-+ +-+# if (self.head_dim * self.num_heads) != self.hidden_size: +-+# raise ValueError( +-+# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+# ) +-+ +-+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+ +-+# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+# self.head_dim, +-+# max_position_embeddings=self.max_position_embeddings, +-+# base=self.rope_theta, +-+# ) +-+ +-+# def forward( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# attention_mask: Optional[mindspore.Tensor] = None, +-+# position_ids: Optional[mindspore.Tensor] = None, +-+# past_key_value: Optional[Cache] = None, +-+# output_attentions: bool = False, +-+# use_cache: bool = False, +-+# cache_position: Optional[mindspore.Tensor] = None, +-+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+# bsz, q_len, _ = hidden_states.shape +-+ +-+# # 1. 线性投射 Q, K, V +-+# query_states = self.q_proj(hidden_states) +-+# key_states = self.k_proj(hidden_states) +-+# value_states = self.v_proj(hidden_states) +-+ +-+# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+# # query: [B, S, H*D] -> [B, N1, S, D] +-+# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+# # 3. RoPE 旋转位置编码 +-+# kv_seq_len = key_states.shape[-2] +-+# if past_key_value is not None: +-+# if self.layer_idx is None: +-+# raise ValueError( +-+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+# "with a layer index." +-+# ) +-+# # 对于 StaticCache,需要特殊处理 kv_seq_len +-+# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+# if cache_position.shape[0] == 1: +-+# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+# kv_seq_len = past_seen_tokens + 1 +-+# else: +-+# # prefill 阶段:cache_position 是范围,使用其长度 +-+# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+# else: +-+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+# # 4. KV 缓存更新 +-+# if past_key_value is not None: +-+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+# key_states, value_states = past_key_value.update( +-+# key_states, value_states, self.layer_idx, cache_kwargs +-+# ) +-+ +-+# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+# if cache_position.shape[0] == 1: +-+# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+# kv_seq_len = key_states.shape[-2] +-+ +-+# # 5. [重要] 准备 Attention Mask +-+# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+# fa_attention_mask = None +-+# if attention_mask is not None: +-+# # 截取与当前key长度匹配的部分 +-+# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+# # 转换为布尔类型: 大负数 -> True, 0 -> False +-+# fa_attention_mask = (mask_slice != 0) +-+ +-+# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+# input_dtype = query_states.dtype +-+# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+# query_states = query_states.to(mindspore.float16) +-+# key_states = key_states.to(mindspore.float16) +-+# value_states = value_states.to(mindspore.float16) +-+ +-+# # 6. [核心] 调用 flash_attention_score 算子 +-+# # - 无需手动 repeat_kv, 算子原生支持 GQA +-+# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+# attn_output = mindspore.ops.flash_attention_score( +-+# query=query_states, +-+# key=key_states, +-+# value=value_states, +-+# head_num=self.num_heads, # 传入Q的头数(N1) +-+# attn_mask=fa_attention_mask, +-+# keep_prob=1.0 - self.attention_dropout, +-+# scalar_value=1.0 / math.sqrt(self.head_dim), +-+# input_layout="BNSD", +-+# sparse_mode=0 # 使用 defaultMask 模式 +-+# ) +-+ +-+# # 恢复原始数据类型 +-+# attn_output = attn_output.to(input_dtype) +-+ +-+# # 7. 调整输出形状 +-+# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+# attn_output = self.o_proj(attn_output) +-+ +-+# # FlashAttention 算子不直接返回注意力权重矩阵 +-+# attn_weights = None +-+# if output_attentions: +-+# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+# return attn_output, attn_weights, past_key_value +-+ +-+# # def forward( +-+# # self, +-+# # hidden_states: mindspore.Tensor, +-+# # attention_mask: Optional[mindspore.Tensor] = None, +-+# # position_ids: Optional[mindspore.Tensor] = None, +-+# # past_key_value: Optional[Cache] = None, +-+# # output_attentions: bool = False, +-+# # use_cache: bool = False, +-+# # cache_position: Optional[mindspore.Tensor] = None, +-+# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+# # bsz, q_len, _ = hidden_states.shape +-+ +-+# # # 1. 线性投射 Q, K, V +-+# # query_states = self.q_proj(hidden_states) +-+# # key_states = self.k_proj(hidden_states) +-+# # value_states = self.v_proj(hidden_states) +-+ +-+# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +-+# # # 3. RoPE 旋转位置编码 +-+# # kv_seq_len = key_states.shape[-2] +-+# # if past_key_value is not None: +-+# # if self.layer_idx is None: +-+# # raise ValueError( +-+# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+# # "with a layer index." +-+# # ) +-+# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+# # # 4. KV 缓存更新 +-+# # if past_key_value is not None: +-+# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+# # key_states, value_states = past_key_value.update( +-+# # key_states, value_states, self.layer_idx, cache_kwargs +-+# # ) +-+ +-+# # # 5. 准备 Attention Mask +-+# # fa_attention_mask = None +-+# # if attention_mask is not None: +-+# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+# # fa_attention_mask = (mask_slice != 0) +-+ +-+# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+# # input_dtype = query_states.dtype +-+ +-+# # # 6. [核心] 调用 flash_attention_score 算子 +-+# # attn_output = mindspore.ops.flash_attention_score( +-+# # query=query_states, +-+# # key=key_states, +-+# # value=value_states, +-+# # head_num=self.num_heads, +-+# # attn_mask=fa_attention_mask, +-+# # keep_prob=1.0 - self.attention_dropout, +-+# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+# # input_layout="BNSD", +-+# # sparse_mode=0, +-+# # # <--- 修改点 2: 启用内部高精度计算 --- +-+# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+# # inner_precise=1 +-+# # ) +-+ +-+# # # 恢复原始数据类型 +-+# # attn_output = attn_output.to(input_dtype) +-+ +-+# # # 7. 调整输出形状 +-+# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+# # attn_output = self.o_proj(attn_output) +-+ +-+# # attn_weights = None +-+# # if output_attentions: +-+# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ +-+# # return attn_output, attn_weights, past_key_value +-+ +-+ +- class Qwen2MoeFlashAttention(nn.Module): +- """ +-- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-- +-- 关键改动: +-- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-- 直接传入原始的 key 和 value 张量效率更高。 +-- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-+ +-+ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-+ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-+ 完全使用模型的低精度数据类型(如 float16)进行计算, +-+ 以达到理论上的最高执行速度。 +- """ +- def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +- super().__init__() +- self.config = config +- self.layer_idx = layer_idx +-+ if layer_idx is None: +-+ logger.warning_once( +-+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-+ ) +-+ +- self.hidden_size = config.hidden_size +- self.num_heads = config.num_attention_heads +- self.head_dim = self.hidden_size // self.num_heads +- self.num_key_value_heads = config.num_key_value_heads +-- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +- self.max_position_embeddings = config.max_position_embeddings +- self.rope_theta = config.rope_theta +- self.attention_dropout = config.attention_dropout +- +-- if (self.head_dim * self.num_heads) != self.hidden_size: +-- raise ValueError( +-- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-- ) +-- +- self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +- self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +- self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +- key_states = self.k_proj(hidden_states) +- value_states = self.v_proj(hidden_states) +- +-- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-- # query: [B, S, H*D] -> [B, N1, S, D] +-- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+ # 2. 调整形状以匹配 BNSD 布局 +- query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +- key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +- value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- +-- # 3. RoPE 旋转位置编码 +-+ +-+ # 3. RoPE 和 KV 缓存 +- kv_seq_len = key_states.shape[-2] +- if past_key_value is not None: +-- if self.layer_idx is None: +-- raise ValueError( +-- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-- "with a layer index." +-- ) +-- # 对于 StaticCache,需要特殊处理 kv_seq_len +-- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-- if cache_position.shape[0] == 1: +-- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-- kv_seq_len = past_seen_tokens + 1 +-- else: +-- # prefill 阶段:cache_position 是范围,使用其长度 +-- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-- else: +-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-- +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +-- # 4. KV 缓存更新 +- if past_key_value is not None: +- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-- key_states, value_states = past_key_value.update( +-- key_states, value_states, self.layer_idx, cache_kwargs +-- ) +-- +-- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-- if cache_position.shape[0] == 1: +-- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-- kv_seq_len = key_states.shape[-2] +-- +-- # 5. [重要] 准备 Attention Mask +-- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+ # 4. 准备 Attention Mask +- fa_attention_mask = None +- if attention_mask is not None: +-- # 截取与当前key长度匹配的部分 +-- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +- mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-- # 转换为布尔类型: 大负数 -> True, 0 -> False +- fa_attention_mask = (mask_slice != 0) +- +-- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-- input_dtype = query_states.dtype +-- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-- query_states = query_states.to(mindspore.float16) +-- key_states = key_states.to(mindspore.float16) +-- value_states = value_states.to(mindspore.float16) +-- +-- # 6. [核心] 调用 flash_attention_score 算子 +-- # - 无需手动 repeat_kv, 算子原生支持 GQA +-- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +- attn_output = mindspore.ops.flash_attention_score( +- query=query_states, +- key=key_states, +- value=value_states, +-- head_num=self.num_heads, # 传入Q的头数(N1) +-+ head_num=self.num_heads, +- attn_mask=fa_attention_mask, +-- keep_prob=1.0 - self.attention_dropout, +-+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +- scalar_value=1.0 / math.sqrt(self.head_dim), +- input_layout="BNSD", +-- sparse_mode=0 # 使用 defaultMask 模式 +-+ sparse_mode=0, +-+ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +- ) +- +-- # 恢复原始数据类型 +-- attn_output = attn_output.to(input_dtype) +-- +-- # 7. 调整输出形状 +-- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+ # 6. 调整输出形状 +- attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +- attn_output = self.o_proj(attn_output) +- +-- # FlashAttention 算子不直接返回注意力权重矩阵 +-+ # 7. 返回结果 +- attn_weights = None +- if output_attentions: +-- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +- +- return attn_output, attn_weights, past_key_value +- +-- # def forward( +-- # self, +-- # hidden_states: mindspore.Tensor, +-- # attention_mask: Optional[mindspore.Tensor] = None, +-- # position_ids: Optional[mindspore.Tensor] = None, +-- # past_key_value: Optional[Cache] = None, +-- # output_attentions: bool = False, +-- # use_cache: bool = False, +-- # cache_position: Optional[mindspore.Tensor] = None, +-- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-- +-- # bsz, q_len, _ = hidden_states.shape +-- +-- # # 1. 线性投射 Q, K, V +-- # query_states = self.q_proj(hidden_states) +-- # key_states = self.k_proj(hidden_states) +-- # value_states = self.v_proj(hidden_states) +-- +-- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- +-- # # 3. RoPE 旋转位置编码 +-- # kv_seq_len = key_states.shape[-2] +-- # if past_key_value is not None: +-- # if self.layer_idx is None: +-- # raise ValueError( +-- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-- # "with a layer index." +-- # ) +-- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +- +-- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-- +-- # # 4. KV 缓存更新 +-- # if past_key_value is not None: +-- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-- # key_states, value_states = past_key_value.update( +-- # key_states, value_states, self.layer_idx, cache_kwargs +-- # ) +-- +-- # # 5. 准备 Attention Mask +-- # fa_attention_mask = None +-- # if attention_mask is not None: +-- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-- # fa_attention_mask = (mask_slice != 0) +-- +-- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-- # input_dtype = query_states.dtype +-- +-- # # 6. [核心] 调用 flash_attention_score 算子 +-- # attn_output = mindspore.ops.flash_attention_score( +-- # query=query_states, +-- # key=key_states, +-- # value=value_states, +-- # head_num=self.num_heads, +-- # attn_mask=fa_attention_mask, +-- # keep_prob=1.0 - self.attention_dropout, +-- # scalar_value=1.0 / math.sqrt(self.head_dim), +-- # input_layout="BNSD", +-- # sparse_mode=0, +-- # # <--- 修改点 2: 启用内部高精度计算 --- +-- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-- # inner_precise=1 +-- # ) +-- +-- # # 恢复原始数据类型 +-- # attn_output = attn_output.to(input_dtype) +-+QWEN2MOE_ATTENTION_CLASSES = { +-+ "eager": Qwen2MoeAttention, +-+ "flash-attention": Qwen2MoeFlashAttention, +-+} +- +-- # # 7. 调整输出形状 +-- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-- # attn_output = self.o_proj(attn_output) +- +-- # attn_weights = None +-- # if output_attentions: +-- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# def __init__(self, config): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# # gating +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+ +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# #@dwj +-+# # 只遍历激活的专家,而非全部专家 +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# num_tokens = hidden_states_reshaped.shape[0] +-+ +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+# routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+# flat_selected_experts = selected_experts.flatten() +-+ +-+# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+# token_indices = broadcasted_token_indices.flatten() +-+ +-+# active_experts = ops.unique(flat_selected_experts) +-+ +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+ +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# selected_token_indices = token_indices[mask] +-+# selected_routing_weights = routing_weights.flatten()[mask] +-+ +-+# current_states = hidden_states_reshaped[selected_token_indices] +-+ +-+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+ +-+# final_hidden_states = final_hidden_states.index_add( +-+# dim=0, +-+# index=selected_token_indices, +-+# source=expert_output.to(hidden_states.dtype) +-+# ) +-+ +-+# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +- +-- # return attn_output, attn_weights, past_key_value +-+# final_hidden_states = final_hidden_states + shared_expert_output +-+# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +-+ +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# """ +-+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# # 门控网络 +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# # 专家列表 +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+# # 共享专家 +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# @no_grad() +-+# def _moe_infer_decode( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# """ +-+# 【解码路径】针对 sequence_length=1 的极致优化。 +-+# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+# """ +-+# batch_size, hidden_dim = hidden_states.shape +-+ +-+# expert_outputs_list = [ +-+# ops.cat([ +-+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+# ], dim=0) +-+# for i in range(batch_size) +-+# ] +-+ +-+# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+# # shape: (batch_size, top_k, hidden_dim) +-+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+ +-+# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+ +-+# return moe_output.squeeze(1) +-+ +-+# @no_grad() +-+# def _moe_infer_prefill( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# """ +-+# 【预填充路径】针对 sequence_length > 1 的优化。 +-+# 按专家对 Token 进行分组,并进行批处理。 +-+# """ +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens = hidden_states.shape[0] +-+# flat_selected_experts = selected_experts.flatten() +-+ +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+ +-+# active_experts = ops.unique(flat_selected_experts) +-+ +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+ +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# selected_token_indices = token_indices[mask] +-+# selected_routing_weights = routing_weights.flatten()[mask] +-+ +-+# current_states = hidden_states[selected_token_indices] +-+ +-+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+ +-+# moe_output = moe_output.index_add( +-+# dim=0, +-+# index=selected_token_indices, +-+# source=expert_output.to(hidden_states.dtype) +-+# ) +-+# return moe_output +-+ +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# """ +-+# 顶层 forward 方法,作为智能分发器。 +-+# """ +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-- # def forward( +-- # self, +-- # hidden_states: mindspore.Tensor, +-- # attention_mask: Optional[mindspore.Tensor] = None, +-- # position_ids: Optional[mindspore.Tensor] = None, +-- # past_key_value: Optional[Cache] = None, +-- # output_attentions: bool = False, +-- # use_cache: bool = False, +-- # cache_position: Optional[mindspore.Tensor] = None, +-- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-- +-- # bsz, q_len, _ = hidden_states.shape +-- +-- # query_states = self.q_proj(hidden_states) +-- # key_states = self.k_proj(hidden_states) +-- # value_states = self.v_proj(hidden_states) +-- +-- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- +-- # kv_seq_len = key_states.shape[-2] +-- # if past_key_value is not None: +-- # if self.layer_idx is None: +-- # raise ValueError("`layer_idx` must be specified for caching") +-- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-- +-- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-- +-- # if past_key_value is not None: +-- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-- # key_states, value_states = past_key_value.update( +-- # key_states, value_states, self.layer_idx, cache_kwargs +-- # ) +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+# routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+# moe_output = None +-+# # 在推理时,根据序列长度选择最优路径 +-+# if not self.training: +-+# if sequence_length == 1: +-+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+# else: +-+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+# else: +-+# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+# raise NotImplementedError("Training path is not implemented.") +-+ +-+# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+ +-+# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+ +-+# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +-+ +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# """ +-+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# # 门控网络 +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# # 专家列表 +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+# # 共享专家 +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# @no_grad() +-+# def _moe_infer_decode( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# batch_size, _ = hidden_states.shape +-+# expert_outputs_list = [ +-+# ops.cat([ +-+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+# ], dim=0) +-+# for i in range(batch_size) +-+# ] +-+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+# return moe_output.squeeze(1) +-+ +-+# @no_grad() +-+# def _moe_infer_prefill( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens = hidden_states.shape[0] +-+# flat_selected_experts = selected_experts.flatten() +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+# active_experts = ops.unique(flat_selected_experts) +-+ +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# selected_token_indices = token_indices[mask] +-+# selected_routing_weights = routing_weights.flatten()[mask] +-+# current_states = hidden_states[selected_token_indices] +-+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+# moe_output = moe_output.index_add( +-+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+# ) +-+# return moe_output +-+ +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# """ +-+# 顶层 forward 方法,作为智能分发器。 +-+# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+# """ +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ +-+# # 1. 门控计算 (通用逻辑) +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+# routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+# # 2. 智能分发到最优 MoE 路径 +-+# moe_output = None +-+# if not self.training: +-+# if sequence_length == 1: +-+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+# else: +-+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+# else: +-+# raise NotImplementedError("Training path is not implemented.") +-+ +-+# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+ +-+# # 4. 合并 MoE 输出和共享专家输出 +-+# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+ +-+# # 5. 恢复原始形状并返回 +-+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +-+# prefill fastest +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# """ +-+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# # 门控网络 +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# # 专家列表 +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+# # 共享专家 +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# @no_grad() +-+# def _moe_infer_dispatch( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# """ +-+# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+# """ +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens, _ = hidden_states.shape +-+ +-+# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+# flat_selected_experts = selected_experts.flatten() +-+# flat_routing_weights = routing_weights.flatten() +- +-- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-- +-- # # <--- 核心修改点: 手动进行高精度缩放 --- +-- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-- # query_states = query_states / math.sqrt(self.head_dim) +-- # # <--- 修改结束 --- +-- +-- # fa_attention_mask = None +-- # if attention_mask is not None: +-- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-- # fa_attention_mask = (mask_slice != 0) +-- +-- # input_dtype = query_states.dtype +-- +-- # attn_output = mindspore.ops.flash_attention_score( +-- # query=query_states, # 传入已经预先缩放过的 query +-- # key=key_states, +-- # value=value_states, +-- # head_num=self.num_heads, +-- # attn_mask=fa_attention_mask, +-- # keep_prob=1.0 - self.attention_dropout, +-- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-- # input_layout="BNSD", +-- # sparse_mode=0, +-- # inner_precise=1 # 仍然保持内部高精度计算 +-- # ) +-+# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +- +-- # attn_output = attn_output.to(input_dtype) +-- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-- # attn_output = self.o_proj(attn_output) +-+# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+# active_experts = ops.unique(flat_selected_experts) +-+ +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+ +-+# # 找到所有分配给该专家的 token +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+ +-+# # 使用 mask 选取对应的 token 和权重 +-+# current_token_indices = token_indices[mask] +-+# current_routing_weights = flat_routing_weights[mask] +-+# current_hidden_states = hidden_states[current_token_indices] +-+ +-+# # 对这些 token 进行批处理 +-+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+ +-+# # 使用 index_add 将结果精确地加回到对应位置 +-+# moe_output = moe_output.index_add( +-+# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+# ) +-+# return moe_output +-+ +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# """ +-+# 顶层 forward 方法,作为智能分发器。 +-+# """ +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ +-+# # 1. 门控计算 +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+# routing_weights = routing_weights.to(hidden_states.dtype) +-+ +-+# # 2. 调用统一的 MoE 计算内核 +-+# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +- +-- # attn_weights = None +-- # if output_attentions: +-- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+# # 3. 统一处理共享专家 +-+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+ +-+# # 4. 合并输出 +-+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+ +-+# # 5. 恢复原始形状并返回 +-+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +-+ +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# """ +-+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+# 【最终高性能与高精度版】: +-+# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+# 3. 这样实现了速度和准确性的两全其美。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# @no_grad() +-+# def _moe_infer_decode( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# """ +-+# 【解码路径】极致优化版:bmm + 高精度累加。 +-+# """ +-+# original_dtype = hidden_states.dtype +-+# batch_size, _ = hidden_states.shape +-+ +-+# expert_outputs_list = [ +-+# ops.cat([ +-+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+# ], dim=0) +-+# for i in range(batch_size) +-+# ] +-+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+ +-+# # 在 float32 下执行 bmm,得到高精度结果 +-+# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+ +-+# # 将高精度结果转换回原始数据类型 +-+# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+ +-+# return moe_output +-+ +-+# @no_grad() +-+# def _moe_infer_prefill( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# selected_experts: mindspore.Tensor, +-+# routing_weights: mindspore.Tensor +-+# ) -> mindspore.Tensor: +-+# """ +-+# 【预填充路径】与原始实现一致,结果精确。 +-+# """ +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens, _ = hidden_states.shape +-+# flat_selected_experts = selected_experts.flatten() +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+# active_experts = ops.unique(flat_selected_experts) +-+ +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# selected_token_indices = token_indices[mask] +-+# selected_routing_weights = routing_weights.flatten()[mask] +-+# current_states = hidden_states[selected_token_indices] +-+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+# moe_output = moe_output.index_add( +-+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+# ) +-+# return moe_output +-+ +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +- +-- # return attn_output, attn_weights, past_key_value +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+# # 如果模型主体是 float16,后续再转换 +-+ +-+# moe_output = None +-+# if not self.training: +-+# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+# # _moe_infer_decode 内部会处理好类型转换 +-+# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+# if sequence_length == 1: +-+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+# else: +-+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+# else: +-+# raise NotImplementedError("Training path is not implemented.") +-+ +-+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+ +-+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +- +--QWEN2MOE_ATTENTION_CLASSES = { +-- "eager": Qwen2MoeAttention, +-- "flash-attention": Qwen2MoeFlashAttention, +--} +-+# class Qwen2MoeSparseMoeBlock(nn.Module): +-+# """ +-+# 【融合版】一个混合专家模块,内置两种推理策略, +-+# 由外部全局变量 `Long_Prompt` 控制: +-+ +-+# - if Long_Prompt is True: 【精度优先模式】 +-+# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+# 适用于处理长序列,避免误差累积。 +-+ +-+# - if Long_Prompt is False: 【速度优先模式】 +-+# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+# """ +-+# def __init__(self, config: Qwen2MoeConfig): +-+# super().__init__() +-+# self.num_experts = config.num_experts +-+# self.top_k = config.num_experts_per_tok +-+# self.norm_topk_prob = config.norm_topk_prob +-+ +-+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+# self.experts = nn.ModuleList( +-+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+# ) +-+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+# # --- 速度优先模式的辅助函数 --- +-+# @no_grad() +-+# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+# original_dtype = hidden_states.dtype +-+# batch_size, _ = hidden_states.shape +-+# expert_outputs_list = [ +-+# ops.cat([ +-+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+# ], dim=0) +-+# for i in range(batch_size) +-+# ] +-+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+# weights_fp32 = routing_weights.to(mindspore.float32) +-+# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+# return moe_output_fp32.squeeze(1).to(original_dtype) +-+ +-+# @no_grad() +-+# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens, _ = hidden_states.shape +-+# flat_selected_experts = selected_experts.flatten() +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+# active_experts = ops.unique(flat_selected_experts) +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# selected_token_indices = token_indices[mask] +-+# selected_routing_weights = routing_weights.flatten()[mask] +-+# current_states = hidden_states[selected_token_indices] +-+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+# return moe_output +-+ +-+# # --- 精度优先模式的辅助函数 --- +-+# @no_grad() +-+# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+# moe_output = ops.zeros_like(hidden_states) +-+# num_tokens, _ = hidden_states.shape +-+# flat_selected_experts = selected_experts.flatten() +-+# flat_routing_weights = routing_weights.flatten() +-+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+# active_experts = ops.unique(flat_selected_experts) +-+# for expert_idx_tensor in active_experts: +-+# expert_idx = expert_idx_tensor.item() +-+# expert_layer = self.experts[expert_idx] +-+# mask = (flat_selected_experts == expert_idx_tensor) +-+# current_token_indices = token_indices[mask] +-+# current_routing_weights = flat_routing_weights[mask] +-+# current_hidden_states = hidden_states[current_token_indices] +-+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+# return moe_output +-+ +-+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+# # 声明我们将要使用一个在模块外部定义的全局变量 +-+# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+# global Long_Prompt +-+ +-+# # 1. 门控计算 (所有模式通用) +-+# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+# router_logits = self.gate(hidden_states_reshaped) +-+# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+# if self.norm_topk_prob: +-+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+# moe_output = None +-+# if not self.training: +-+# # 根据 Long_Prompt 标志选择模式 +-+# if Long_Prompt: +-+# # --- 精度优先模式 --- +-+# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+# else: +-+# # --- 速度优先模式 --- +-+# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+# if sequence_length == 1: +-+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+# else: +-+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+# else: +-+# raise NotImplementedError("Training path is not implemented.") +-+ +-+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+ +-+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+ +-+# return final_hidden_states, router_logits +-+ +-+class Qwen2MoeSparseMoeBlock(nn.Module): +-+ """ +-+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+ 控制的顶级推理策略: +- +-+ - if Long_Prompt is True: 【精度优先模式】 +-+ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-+ 适用于需要严格可复现性的长序列任务。 +- +--class Qwen2MoeSparseMoeBlock(nn.Module): +-- def __init__(self, config): +-+ - if Long_Prompt is False: 【速度优先模式】 +-+ 采用业界最强的性能组合: +-+ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-+ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-+ """ +-+ def __init__(self, config: Qwen2MoeConfig): +- super().__init__() +- self.num_experts = config.num_experts +- self.top_k = config.num_experts_per_tok +- self.norm_topk_prob = config.norm_topk_prob +- +-- # gating +- self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +- self.experts = nn.ModuleList( +- [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +- ) +-- +- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +- +-- #@dwj +-- # 只遍历激活的专家,而非全部专家 +-- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-- num_tokens = hidden_states_reshaped.shape[0] +-- +-- router_logits = self.gate(hidden_states_reshaped) +-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- +-- if self.norm_topk_prob: +-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- routing_weights = routing_weights.to(hidden_states.dtype) +-- +-- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-- flat_selected_experts = selected_experts.flatten() +-- +-- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-- token_indices = broadcasted_token_indices.flatten() +-- +-- active_experts = ops.unique(flat_selected_experts) +-- +-- for expert_idx_tensor in active_experts: +-- expert_idx = expert_idx_tensor.item() +-- expert_layer = self.experts[expert_idx] +-- +-- mask = (flat_selected_experts == expert_idx_tensor) +-- selected_token_indices = token_indices[mask] +-- selected_routing_weights = routing_weights.flatten()[mask] +-- +-- current_states = hidden_states_reshaped[selected_token_indices] +-- +-- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-- +-- final_hidden_states = final_hidden_states.index_add( +-- dim=0, +-- index=selected_token_indices, +-- source=expert_output.to(hidden_states.dtype) +-- ) +-- +-- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-+ @no_grad() +-+ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ original_dtype = hidden_states.dtype +-+ batch_size, _ = hidden_states.shape +-+ expert_outputs_list = [ +-+ ops.cat([ +-+ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+ ], dim=0) +-+ for i in range(batch_size) +-+ ] +-+ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+ weights_fp32 = routing_weights.to(mindspore.float32) +-+ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+ return moe_output_fp32.squeeze(1).to(original_dtype) +-+ +-+ @no_grad() +-+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ num_tokens, _ = hidden_states.shape +-+ flat_selected_experts = selected_experts.flatten() +-+ sorted_expert_indices = flat_selected_experts.argsort() +-+ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+ original_token_indices = sorted_expert_indices // self.top_k +-+ moe_output = ops.zeros_like(hidden_states) +-+ current_token_offset = 0 +-+ for i in range(self.num_experts): +-+ expert_token_count = tokens_per_expert[i] - current_token_offset +-+ if expert_token_count == 0: +-+ continue +-+ end_offset = current_token_offset + expert_token_count +-+ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+ expert_hidden_states = hidden_states[expert_original_token_indices] +-+ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+ current_token_offset += expert_token_count +-+ return moe_output +-+ +-+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+ @no_grad() +-+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ moe_output = ops.zeros_like(hidden_states) +-+ num_tokens, _ = hidden_states.shape +-+ flat_selected_experts = selected_experts.flatten() +-+ flat_routing_weights = routing_weights.flatten() +-+ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+ active_experts = ops.unique(flat_selected_experts) +-+ for expert_idx_tensor in active_experts: +-+ expert_idx = expert_idx_tensor.item() +-+ expert_layer = self.experts[expert_idx] +-+ mask = (flat_selected_experts == expert_idx_tensor) +-+ current_token_indices = token_indices[mask] +-+ current_routing_weights = flat_routing_weights[mask] +-+ current_hidden_states = hidden_states[current_token_indices] +-+ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+ return moe_output +- +-- final_hidden_states = final_hidden_states + shared_expert_output +-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-- +-- return final_hidden_states, router_logits +-+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+ global Long_Prompt +-+ +-+ # 1. 门控计算 (所有模式通用) +-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+ router_logits = self.gate(hidden_states_reshaped) +-+ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+ if self.norm_topk_prob: +-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+ moe_output = None +-+ if Long_Prompt: +-+ # --- 精度优先模式 (ACCURACY MODE) --- +-+ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ else: +-+ # --- 速度优先模式 (SPEED MODE) --- +-+ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+ if sequence_length == 1: +-+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ else: +-+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ +- +-+ # 3. 共享专家计算与合并 (所有模式通用) +-+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+ +-+ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+ +-+ return final_hidden_states, router_logits +- +- class Qwen2MoeDecoderLayer(nn.Module): +- def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +- super().__init__() +- self.hidden_size = config.hidden_size +-+ +-+ # if Long_Prompt: +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ # else: +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +- +- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- +-- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-- +- if (layer_idx not in config.mlp_only_layers) and ( +- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +- ): +-@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- self._warmed_up = True +- self.warmup_moe_model() +- +-+ +-+ +- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +- output_router_logits = ( +- output_router_logits if output_router_logits is not None else self.config.output_router_logits +-@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- router_logits=outputs.router_logits, +- ) +- +-+ def generate(self, *args, **kwargs): +-+ """ +-+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+ """ +-+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+ +-+ input_ids = kwargs.get("input_ids") +-+ if input_ids is None and args: +-+ input_ids = args[0] +-+ +-+ if input_ids is not None: +-+ prompt_length = input_ids.shape[1] +-+ +-+ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+ Long_Prompt = True +-+ else: +-+ Long_Prompt = False +-+ +-+ return super().generate(*args, **kwargs) +-+ +- # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +- def prepare_inputs_for_generation( +- self, +-@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +- # Exception 1: when passing input_embeds, input_ids may be missing entries +- # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-+ +- if past_key_values is not None: +- if inputs_embeds is not None: # Exception 1 +- if 0 not in input_ids.shape: +-@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- } +- ) +- return model_inputs +-+ +- # @lwx +- # def _decode_one_tokens_logits( +- # self, +-@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +- attentions=outputs.attentions, +- ) +- +-+ +- __all__ = [ +- "Qwen2MoeForCausalLM", +- "Qwen2MoeModel", +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-new file mode 100644 +-index 00000000..6dfb5b93 +---- /dev/null +-+++ b/patches/0001-20251104commit.patch +-@@ -0,0 +1,1272 @@ +-+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+Subject: [PATCH] 20251104commit +-+ +-+--- +-+ mindnlp/transformers/cache_utils.py | 28 +- +-+ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+ 3 files changed, 976 insertions(+), 87 deletions(-) +-+ +-+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+index cadd2e04..02f8d4be 100644 +-+--- a/mindnlp/transformers/cache_utils.py +-++++ b/mindnlp/transformers/cache_utils.py +-+@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+ # k_out[:, :, cache_position] = key_states +-+ # v_out[:, :, cache_position] = value_states +-+- if ON_ORANGE_PI: +-+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+- else: +-+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+- +-++ # if ON_ORANGE_PI: +-++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++ # else: +-++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++ if cache_position.ndim > 1: +-++ cache_position = cache_position.flatten() +-++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++ cache_position = cache_position.int() +-++ +-++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++ k_out[:, :, cache_position] = key_states +-++ v_out[:, :, cache_position] = value_states +-++ +-+ return k_out, v_out +-+ +-+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index c695b944..d8303e45 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+- x1 = x[..., : x.shape[-1] // 2] +-+- x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++ # x1 = x[..., : x.shape[-1] // 2] +-++ # x2 = x[..., x.shape[-1] // 2 :] +-++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+ if self.training: +-+ raise NotImplementedError("Training is not supported yet.") +-+ else: +-+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+- if self.config.n_shared_experts is not None: +-+- y = y + self.shared_experts(identity) +-+- return y +-++ # @lwx +-++ if orig_shape[1] == 1: +-++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++ y=y.view(*orig_shape) +-++ if self.config.n_shared_experts is not None: +-++ y = y + self.shared_experts(identity) +-++ return y +-++ else: +-++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++ if self.config.n_shared_experts is not None: +-++ y = y + self.shared_experts(identity) +-++ return y +-++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++ # if self.config.n_shared_experts is not None: +-++ # y = y + self.shared_experts(identity) +-++ # return y +-++ +-++ @no_grad() +-++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ +-++ expert_cache = ops.zeros_like(x) +-++ for i in range(self.num_experts_per_tok): +-++ expert_id = flat_expert_indices[i].item() +-++ weight = flat_expert_weights[i].item() +-++ expert = self.experts[expert_id] +-++ expert_out = expert(x) +-++ expert_cache += expert_out * weight +-++ return expert_cache +-+ +-+ @no_grad() +-+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+- # expert_cache = torch.zeros_like(x) +-+- # idxs = flat_expert_indices.argsort() +-+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+- # token_idxs = idxs // self.num_experts_per_tok +-+- # for i, end_idx in enumerate(tokens_per_expert): +-+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+- # if start_idx == end_idx: +-+- # continue +-+- # expert = self.experts[i] +-+- # exp_token_idx = token_idxs[start_idx:end_idx] +-+- # expert_tokens = x[exp_token_idx] +-+- # expert_out = expert(expert_tokens) +-+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+- # return expert_cache +-++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ expert_cache = ops.zeros_like(x) +-+ idxs = flat_expert_indices.argsort() +-+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ token_idxs = idxs // self.num_experts_per_tok +-++ +-+ for i, end_idx in enumerate(tokens_per_expert): +-+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ if start_idx == end_idx: +-+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+ expert_out = expert(expert_tokens) +-+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-+ return expert_cache +-++ +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # # expert_cache = torch.zeros_like(x) +-++ # # idxs = flat_expert_indices.argsort() +-++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++ # # token_idxs = idxs // self.num_experts_per_tok +-++ # # for i, end_idx in enumerate(tokens_per_expert): +-++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++ # # if start_idx == end_idx: +-++ # # continue +-++ # # expert = self.experts[i] +-++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # # expert_tokens = x[exp_token_idx] +-++ # # expert_out = expert(expert_tokens) +-++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++ # # return expert_cache +-++ # expert_cache = ops.zeros_like(x) +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # for i, end_idx in enumerate(tokens_per_expert): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # if start_idx == end_idx: +-++ # continue +-++ # expert = self.experts[i] +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = expert(expert_tokens) +-++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-++ # return expert_cache +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # expert_cache = ops.zeros_like(x) +-++ +-++ # # 排序保证顺序一致 +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # # 找出有 token 的专家 +-++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++ +-++ # for i in active_experts.tolist(): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # end_idx = tokens_per_expert[i] +-++ # if start_idx == end_idx: # 没有 token +-++ # continue +-++ +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = self.experts[i](expert_tokens) +-++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++ +-++ # expert_cache = mindspore.mint.scatter_add( +-++ # expert_cache, +-++ # 0, +-++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++ # expert_out +-++ # ) +-++ +-++ # return expert_cache +-++ +-++ +-+ +-+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+ # """ +-+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ +-+ # Initialize weights and apply final processing +-+ self.post_init() +-++ self.warm_up = False +-++ +-++ def warmup_moe_model_deep(self): +-++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++ test_texts = [ +-++ "warmup short", +-++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++ ] +-++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++ if tokenizer is None: +-++ from mindnlp.transformers import AutoTokenizer +-++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++ self._warmup_tokenizer = tokenizer +-++ +-++ for text in test_texts: +-++ inputs = tokenizer(text, return_tensors="ms") +-++ with mindspore._no_grad(): +-++ _ = self(**inputs, use_cache=False) +-++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+ +-+ def get_input_embeddings(self): +-+ return self.model.embed_tokens +-+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+ ```""" +-++ if not self.warm_up: +-++ self.warm_up = True +-++ self.warmup_moe_model_deep() +-++ +-+ output_attentions = ( +-+ output_attentions +-+ if output_attentions is not None +-+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+index 3cbf820e..d4c6b651 100644 +-+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+@@ -18,7 +18,6 @@ +-+ # See the License for the specific language governing permissions and +-+ # limitations under the License. +-+ """MindSpore Qwen2MoE model.""" +-+- +-+ import math +-+ from typing import List, Optional, Tuple, Union +-+ +-+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+ TokenClassifierOutput, +-+ ) +-+ from ...modeling_utils import PreTrainedModel +-++from ...generation import GenerationMixin +-+ from ....utils import logging +-+ from .configuration_qwen2_moe import Qwen2MoeConfig +-+ +-+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+ self.variance_epsilon = eps +-+ +-+ def forward(self, hidden_states): +-++ # @dwj +-++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++ # @lwx +-++ # if not self.training : +-++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+ input_dtype = hidden_states.dtype +-+ hidden_states = hidden_states.to(mindspore.float32) +-+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+@@ -234,6 +239,8 @@ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+ x1 = x[..., : x.shape[-1] // 2] +-+ x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+ self.config = config +-+ self.hidden_size = config.hidden_size +-+ self.intermediate_size = intermediate_size +-++ +-+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+ self.act_fn = ACT2FN[config.hidden_act] +-+ +-+ def forward(self, x): +-+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+- +-+ +-++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++ # @lwx +-++ # gate_up_output = self.gate_up_proj(x) +-++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++ # return self.down_proj(swiglu_output) +-++ +-++ # def forward(self, x): +-++ # gate_proj_out = self.gate_proj(x) +-++ # up_proj_out = self.up_proj(x) +-++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++ # return self.down_proj(swiglu_out) +-++ +-+ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+ """ +-+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+ use_cache: bool = False, +-+ cache_position: Optional[mindspore.Tensor] = None, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ +-++ +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ query_states = self.q_proj(hidden_states) +-+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ "with a layer index." +-+ ) +-+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ if isinstance(past_key_value, StaticCache): +-++ kv_seq_len = key_states.shape[-2] +-++ else: +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++ if isinstance(past_key_value, StaticCache): +-++ kv_seq_len = key_states.shape[-2] +-+ +-+ # repeat k/v heads if n_kv_heads < n_heads +-+ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+- +-++ +-+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+ +-+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+- raise ValueError( +-+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+- f" {attn_weights.shape}" +-+- ) +-+- +-+- if attention_mask is not None: # no matter the length, we just slice it +-+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++ if attention_mask is not None: +-++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+ attn_weights = attn_weights + causal_mask +-+ +-+ # upcast attention to fp32 +-+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+ +-+ attn_output = self.o_proj(attn_output) +-+- +-++ # @lwx +-++ +-++ # max_seq_len = self.max_position_embeddings # 2048 +-++ +-++ # if attention_mask is not None: +-++ # # attention_mask: [B, 1, Sq, Sk] +-++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++ +-++ # # pad 到 [max_seq_len, max_seq_len] +-++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++ # global_attention_mask = padded_mask +-++ # else: +-++ # global_attention_mask = None +-++ +-++ +-++ # sparse_mode=3 +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, +-++ # key=key_states, +-++ # value=value_states, +-++ # real_shift=None, +-++ # padding_mask=None, +-++ +-++ # head_num=self.num_heads, +-++ # attn_mask=global_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++ # input_layout="BNSD", +-++ # pre_tokens=2147483647, +-++ # next_tokens=2147483647, +-++ # inner_precise=0, +-++ # drop_mask=None, +-++ # prefix=None, +-++ # actual_seq_qlen=None, +-++ # actual_seq_kvlen=None, +-++ # sparse_mode=sparse_mode, +-++ # ) +-+ if not output_attentions: +-+ attn_weights = None +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+ +-++class Qwen2MoeFlashAttention(nn.Module): +-++ """ +-++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++ +-++ 关键改动: +-++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++ 直接传入原始的 key 和 value 张量效率更高。 +-++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++ """ +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++ super().__init__() +-++ self.config = config +-++ self.layer_idx = layer_idx +-++ self.hidden_size = config.hidden_size +-++ self.num_heads = config.num_attention_heads +-++ self.head_dim = self.hidden_size // self.num_heads +-++ self.num_key_value_heads = config.num_key_value_heads +-++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++ self.max_position_embeddings = config.max_position_embeddings +-++ self.rope_theta = config.rope_theta +-++ self.attention_dropout = config.attention_dropout +-++ +-++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++ raise ValueError( +-++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++ ) +-++ +-++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++ +-++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++ self.head_dim, +-++ max_position_embeddings=self.max_position_embeddings, +-++ base=self.rope_theta, +-++ ) +-++ +-++ def forward( +-++ self, +-++ hidden_states: mindspore.Tensor, +-++ attention_mask: Optional[mindspore.Tensor] = None, +-++ position_ids: Optional[mindspore.Tensor] = None, +-++ past_key_value: Optional[Cache] = None, +-++ output_attentions: bool = False, +-++ use_cache: bool = False, +-++ cache_position: Optional[mindspore.Tensor] = None, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ # 1. 线性投射 Q, K, V +-++ query_states = self.q_proj(hidden_states) +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++ # query: [B, S, H*D] -> [B, N1, S, D] +-++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # 3. RoPE 旋转位置编码 +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++ if self.layer_idx is None: +-++ raise ValueError( +-++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ "with a layer index." +-++ ) +-++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++ if cache_position.shape[0] == 1: +-++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++ kv_seq_len = past_seen_tokens + 1 +-++ else: +-++ # prefill 阶段:cache_position 是范围,使用其长度 +-++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++ else: +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # 4. KV 缓存更新 +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ key_states, value_states = past_key_value.update( +-++ key_states, value_states, self.layer_idx, cache_kwargs +-++ ) +-++ +-++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++ if cache_position.shape[0] == 1: +-++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++ kv_seq_len = key_states.shape[-2] +-++ +-++ # 5. [重要] 准备 Attention Mask +-++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++ fa_attention_mask = None +-++ if attention_mask is not None: +-++ # 截取与当前key长度匹配的部分 +-++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++ fa_attention_mask = (mask_slice != 0) +-++ +-++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++ input_dtype = query_states.dtype +-++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++ query_states = query_states.to(mindspore.float16) +-++ key_states = key_states.to(mindspore.float16) +-++ value_states = value_states.to(mindspore.float16) +-++ +-++ # 6. [核心] 调用 flash_attention_score 算子 +-++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++ attn_output = mindspore.ops.flash_attention_score( +-++ query=query_states, +-++ key=key_states, +-++ value=value_states, +-++ head_num=self.num_heads, # 传入Q的头数(N1) +-++ attn_mask=fa_attention_mask, +-++ keep_prob=1.0 - self.attention_dropout, +-++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++ input_layout="BNSD", +-++ sparse_mode=0 # 使用 defaultMask 模式 +-++ ) +-++ +-++ # 恢复原始数据类型 +-++ attn_output = attn_output.to(input_dtype) +-++ +-++ # 7. 调整输出形状 +-++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ attn_output = self.o_proj(attn_output) +-++ +-++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++ attn_weights = None +-++ if output_attentions: +-++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++ # def forward( +-++ # self, +-++ # hidden_states: mindspore.Tensor, +-++ # attention_mask: Optional[mindspore.Tensor] = None, +-++ # position_ids: Optional[mindspore.Tensor] = None, +-++ # past_key_value: Optional[Cache] = None, +-++ # output_attentions: bool = False, +-++ # use_cache: bool = False, +-++ # cache_position: Optional[mindspore.Tensor] = None, +-++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ # bsz, q_len, _ = hidden_states.shape +-++ +-++ # # 1. 线性投射 Q, K, V +-++ # query_states = self.q_proj(hidden_states) +-++ # key_states = self.k_proj(hidden_states) +-++ # value_states = self.v_proj(hidden_states) +-++ +-++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # # 3. RoPE 旋转位置编码 +-++ # kv_seq_len = key_states.shape[-2] +-++ # if past_key_value is not None: +-++ # if self.layer_idx is None: +-++ # raise ValueError( +-++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ # "with a layer index." +-++ # ) +-++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # # 4. KV 缓存更新 +-++ # if past_key_value is not None: +-++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ # key_states, value_states = past_key_value.update( +-++ # key_states, value_states, self.layer_idx, cache_kwargs +-++ # ) +-++ +-++ # # 5. 准备 Attention Mask +-++ # fa_attention_mask = None +-++ # if attention_mask is not None: +-++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # fa_attention_mask = (mask_slice != 0) +-++ +-++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++ # input_dtype = query_states.dtype +-++ +-++ # # 6. [核心] 调用 flash_attention_score 算子 +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, +-++ # key=key_states, +-++ # value=value_states, +-++ # head_num=self.num_heads, +-++ # attn_mask=fa_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++ # input_layout="BNSD", +-++ # sparse_mode=0, +-++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++ # inner_precise=1 +-++ # ) +-++ +-++ # # 恢复原始数据类型 +-++ # attn_output = attn_output.to(input_dtype) +-++ +-++ # # 7. 调整输出形状 +-++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ # attn_output = self.o_proj(attn_output) +-++ +-++ # attn_weights = None +-++ # if output_attentions: +-++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++ # return attn_output, attn_weights, past_key_value +-++ +-++ # def forward( +-++ # self, +-++ # hidden_states: mindspore.Tensor, +-++ # attention_mask: Optional[mindspore.Tensor] = None, +-++ # position_ids: Optional[mindspore.Tensor] = None, +-++ # past_key_value: Optional[Cache] = None, +-++ # output_attentions: bool = False, +-++ # use_cache: bool = False, +-++ # cache_position: Optional[mindspore.Tensor] = None, +-++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ # bsz, q_len, _ = hidden_states.shape +-++ +-++ # query_states = self.q_proj(hidden_states) +-++ # key_states = self.k_proj(hidden_states) +-++ # value_states = self.v_proj(hidden_states) +-++ +-++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # kv_seq_len = key_states.shape[-2] +-++ # if past_key_value is not None: +-++ # if self.layer_idx is None: +-++ # raise ValueError("`layer_idx` must be specified for caching") +-++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # if past_key_value is not None: +-++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ # key_states, value_states = past_key_value.update( +-++ # key_states, value_states, self.layer_idx, cache_kwargs +-++ # ) +-++ +-++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++ +-++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++ # query_states = query_states / math.sqrt(self.head_dim) +-++ # # <--- 修改结束 --- +-++ +-++ # fa_attention_mask = None +-++ # if attention_mask is not None: +-++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # fa_attention_mask = (mask_slice != 0) +-++ +-++ # input_dtype = query_states.dtype +-++ +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, # 传入已经预先缩放过的 query +-++ # key=key_states, +-++ # value=value_states, +-++ # head_num=self.num_heads, +-++ # attn_mask=fa_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++ # input_layout="BNSD", +-++ # sparse_mode=0, +-++ # inner_precise=1 # 仍然保持内部高精度计算 +-++ # ) +-++ +-++ # attn_output = attn_output.to(input_dtype) +-++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ # attn_output = self.o_proj(attn_output) +-++ +-++ # attn_weights = None +-++ # if output_attentions: +-++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++ +-++ # return attn_output, attn_weights, past_key_value +-++ +-+ QWEN2MOE_ATTENTION_CLASSES = { +-+ "eager": Qwen2MoeAttention, +-++ "flash-attention": Qwen2MoeFlashAttention, +-+ } +-+ +-+ +-+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-++ #@dwj +-++ # 只遍历激活的专家,而非全部专家 +-+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- hidden_states = hidden_states.view(-1, hidden_dim) +-+- # router_logits: (batch * sequence_length, n_experts) +-+- router_logits = self.gate(hidden_states) +-+- +-+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- if self.norm_topk_prob: +-+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- # we cast back to the input dtype +-+- routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+- final_hidden_states = ops.zeros( +-+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+- ) +-+- +-+- # One hot encode the selected experts to create an expert mask +-+- # this will be used to easily index which expert is going to be sollicitated +-+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+- +-+- # Loop over all available experts in the model and perform the computation on each expert +-+- for expert_idx in range(self.num_experts): +-+- expert_layer = self.experts[expert_idx] +-+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+- +-+- # Index the correct hidden states and compute the expert hidden state for +-+- # the current expert. We need to make sure to multiply the output hidden +-+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+- if 0 not in idx.shape: +-+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+- +-+- # However `index_add_` only support torch tensors for indexing so we'll use +-+- # the `top_x` tensor here. +-+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+- +-+- shared_expert_output = self.shared_expert(hidden_states) +-+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+- +-+- final_hidden_states = final_hidden_states + shared_expert_output +-++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++ num_tokens = hidden_states_reshaped.shape[0] +-++ +-++ router_logits = self.gate(hidden_states_reshaped) +-++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++ if self.norm_topk_prob: +-++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++ flat_selected_experts = selected_experts.flatten() +-++ +-++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++ token_indices = broadcasted_token_indices.flatten() +-++ +-++ active_experts = ops.unique(flat_selected_experts) +-++ +-++ for expert_idx_tensor in active_experts: +-++ expert_idx = expert_idx_tensor.item() +-++ expert_layer = self.experts[expert_idx] +-++ +-++ mask = (flat_selected_experts == expert_idx_tensor) +-++ selected_token_indices = token_indices[mask] +-++ selected_routing_weights = routing_weights.flatten()[mask] +-++ +-++ current_states = hidden_states_reshaped[selected_token_indices] +-++ +-++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++ +-++ final_hidden_states = final_hidden_states.index_add( +-++ dim=0, +-++ index=selected_token_indices, +-++ source=expert_output.to(hidden_states.dtype) +-++ ) +-++ +-++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+ +-+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+- return final_hidden_states, router_logits +-++ final_hidden_states = final_hidden_states + shared_expert_output +-++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++ +-++ return final_hidden_states, router_logits +-+ +-+ +-+ class Qwen2MoeDecoderLayer(nn.Module): +-+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+ +-+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++ +-+ if (layer_idx not in config.mlp_only_layers) and ( +-+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+ ): +-+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+ _skip_keys_device_placement = "past_key_values" +-+ _supports_cache_class = True +-++#lwx +-++ # _supports_static_cache = True +-+ +-+ def _init_weights(self, module): +-+ std = self.config.initializer_range +-+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+ return causal_mask +-+ +-+ +-+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ _tied_weights_keys = ["lm_head.weight"] +-+ +-+ def __init__(self, config): +-+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ self.num_experts_per_tok = config.num_experts_per_tok +-+ # Initialize weights and apply final processing +-+ self.post_init() +-++ # @lwx +-++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++ # self.generation_config.cache_implementation = "static" +-++ self._warmed_up = False +-++ +-++ def warmup_moe_model(self): +-++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++ test_texts = [ +-++ "warmup short", +-++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++ ] +-++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++ if tokenizer is None: +-++ from mindnlp.transformers import AutoTokenizer +-++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++ self._warmup_tokenizer = tokenizer +-++ +-++ for text in test_texts: +-++ inputs = tokenizer(text, return_tensors="ms") +-++ with mindspore._no_grad(): +-++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+ +-+ def get_input_embeddings(self): +-+ return self.model.embed_tokens +-+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+ ```""" +-++ if not self._warmed_up: +-++ self._warmed_up = True +-++ self.warmup_moe_model() +-+ +-+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+ output_router_logits = ( +-+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ } +-+ ) +-+ return model_inputs +-++# @lwx +-++ # def _decode_one_tokens_logits( +-++ # self, +-++ # cur_token: mindspore.Tensor, +-++ # input_pos: Optional[mindspore.Tensor], +-++ # cache_position: mindspore.Tensor, +-++ # past_key_values: StaticCache, +-++ # ) -> mindspore.Tensor: +-++ # """ +-++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++ +-++ # Args: +-++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++ # input_pos: 输入位置信息,可选 +-++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++ +-++ # Returns: +-++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++ # """ +-++ # # 调用JIT编译的版本 +-++ # return self.get_decode_one_tokens_logits( +-++ # cur_token=cur_token, +-++ # input_pos=input_pos, +-++ # cache_position=cache_position, +-++ # past_key_values=past_key_values, +-++ # ) +-++ +-++ # @mindspore.jit(jit_level='O1') +-++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++ # """ +-++ # JIT编译的函数,用于高效的单token解码 +-++ # 使用JIT编译优化以支持静态shape和高效执行 +-++ +-++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++ # """ +-++ # outputs = self.model.forward( +-++ # input_ids=cur_token, +-++ # position_ids=input_pos, +-++ # cache_position=cache_position, +-++ # past_key_values=past_key_values, +-++ # use_cache=True, +-++ # return_dict=False, +-++ # ) +-++ +-++ # hidden_states = outputs[0] +-++ # logits = self.lm_head.forward(hidden_states) +-++ # logits = logits.float() +-++ +-++ # return logits[:, -1, :] +-++ +-++ # def _sample( +-++ # self, +-++ # input_ids: mindspore.Tensor, +-++ # logits_processor, +-++ # stopping_criteria, +-++ # generation_config, +-++ # synced_devices: bool, +-++ # streamer=None, +-++ # logits_warper=None, +-++ # **model_kwargs, +-++ # ): +-++ # """ +-++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++ # """ +-++ # from ...generation.logits_process import LogitsProcessorList +-++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++ # from mindnlp.core import nn, ops, no_grad +-++ # import numpy as np +-++ +-++ # # 检查是否使用 StaticCache +-++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++ # # 否则,直接调用父类方法 +-++ # past_key_values = model_kwargs.get("past_key_values") +-++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++ +-++ # if not isinstance(past_key_values, StaticCache): +-++ # # 不使用 StaticCache,直接调用父类方法 +-++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++ # return super()._sample( +-++ # input_ids=input_ids, +-++ # logits_processor=logits_processor, +-++ # stopping_criteria=stopping_criteria, +-++ # generation_config=generation_config, +-++ # synced_devices=synced_devices, +-++ # streamer=streamer, +-++ # logits_warper=logits_warper, +-++ # **model_kwargs, +-++ # ) +-++ +-++ # # 使用 StaticCache,进入自定义循环 +-++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++ # pad_token_id = generation_config._pad_token_tensor +-++ # output_attentions = generation_config.output_attentions +-++ # output_hidden_states = generation_config.output_hidden_states +-++ # output_scores = generation_config.output_scores +-++ # output_logits = generation_config.output_logits +-++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++ # max_length = generation_config.max_length +-++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++ # do_sample = generation_config.do_sample +-++ +-++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++ # raise ValueError( +-++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++ # f"{logits_warper})." +-++ # ) +-++ +-++ # # init attention / hidden states / scores tuples +-++ # scores = () if (return_dict_in_generate and output_scores) else None +-++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++ +-++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++ # encoder_hidden_states = ( +-++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++ # ) +-++ +-++ # # keep track of which sequences are already finished +-++ # batch_size, cur_len = input_ids.shape +-++ # this_peer_finished = False +-++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++ +-++ # time_record = [] +-++ # from ....utils.testing_utils import parse_flag_from_env +-++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++ +-++ # while self._has_unfinished_sequences( +-++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++ # ): +-++ # if _record_time: +-++ # import time as time_module +-++ # infer_start = time_module.time() +-++ +-++ # # prepare model inputs +-++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++ +-++ # # prepare variable output controls +-++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++ +-++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++ # cur_cache_position = model_inputs.get("cache_position") +-++ # cur_past_key_values = model_inputs.get("past_key_values") +-++ # cur_input_ids = model_inputs.get("input_ids") +-++ +-++ # if (isinstance(cur_past_key_values, StaticCache) and +-++ # cur_cache_position is not None and +-++ # len(cur_cache_position.shape) > 0 and +-++ # cur_cache_position.shape[0] == 1 and +-++ # cur_input_ids is not None and +-++ # cur_input_ids.shape[1] == 1): +-++ # # 使用 JIT 优化的单 token 解码 +-++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++ # if not hasattr(self, '_jit_used'): +-++ # self._jit_used = False +-++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++ +-++ # next_token_logits = self.get_decode_one_tokens_logits( +-++ # cur_token=cur_input_ids, +-++ # input_pos=model_inputs.get("position_ids"), +-++ # cache_position=cur_cache_position, +-++ # past_key_values=cur_past_key_values, +-++ # ) +-++ +-++ # # 标记已使用JIT(用于后续判断) +-++ # if not self._jit_used: +-++ # self._jit_used = True +-++ +-++ # # 构造兼容的输出对象 +-++ # class JitOptimizedOutput: +-++ # def __init__(self, logits, config): +-++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++ # self.config = config +-++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++ # self.attentions = None if not config.is_encoder_decoder else None +-++ # self.cross_attentions = None +-++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++ +-++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++ # else: +-++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++ # outputs = self(**model_inputs, return_dict=True) +-++ +-++ # if synced_devices and this_peer_finished: +-++ # continue +-++ +-++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++ # next_token_logits = outputs.logits[:, -1, :] +-++ +-++ # # pre-process distribution +-++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++ # if do_sample: +-++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++ +-++ # # Store scores, attentions and hidden_states when required +-++ # if return_dict_in_generate: +-++ # if output_scores: +-++ # scores += (next_token_scores,) +-++ # if output_logits: +-++ # raw_logits += (next_token_logits,) +-++ # if output_attentions: +-++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++ # if self.config.is_encoder_decoder: +-++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++ +-++ # if output_hidden_states: +-++ # hidden = ( +-++ # outputs.decoder_hidden_states +-++ # if self.config.is_encoder_decoder +-++ # else outputs.hidden_states +-++ # ) +-++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++ +-++ # # token selection +-++ # if do_sample: +-++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++ # else: +-++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++ +-++ # # finished sentences should have their next token be a padding token +-++ # if has_eos_stopping_criteria: +-++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++ +-++ # # update generated ids, model inputs, and length for next step +-++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++ # if streamer is not None: +-++ # streamer.put(next_tokens) +-++ +-++ # model_kwargs = self._update_model_kwargs_for_generation( +-++ # outputs, +-++ # model_kwargs, +-++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++ # ) +-++ +-++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++ # cur_len += 1 +-++ +-++ # if _record_time: +-++ # import time as time_module +-++ # infer_stop = time_module.time() +-++ # time_record.append(infer_stop - infer_start) +-++ +-++ # del outputs +-++ +-++ # average_infer_time = None +-++ # if time_record: +-++ # if len(time_record) > 1: +-++ # time_record.pop(0) +-++ # average_infer_time = sum(time_record) / len(time_record) +-++ # print(f'average inference time is: {average_infer_time}') +-++ # print(f'inference time record: {time_record}') +-++ +-++ # if streamer is not None: +-++ # streamer.end() +-++ +-++ # # 简单判断:打印是否使用了JIT路径 +-++ # if hasattr(self, '_jit_used') and self._jit_used: +-++ # print("[JIT] ✓ JIT optimization was used during generation") +-++ # else: +-++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++ +-++ # if return_dict_in_generate: +-++ # if self.config.is_encoder_decoder: +-++ # return GenerateEncoderDecoderOutput( +-++ # sequences=input_ids, +-++ # scores=scores, +-++ # logits=raw_logits, +-++ # encoder_attentions=encoder_attentions, +-++ # encoder_hidden_states=encoder_hidden_states, +-++ # decoder_attentions=decoder_attentions, +-++ # cross_attentions=cross_attentions, +-++ # decoder_hidden_states=decoder_hidden_states, +-++ # past_key_values=model_kwargs.get("past_key_values"), +-++ # average_infer_time=average_infer_time +-++ # ) +-++ # else: +-++ # return GenerateDecoderOnlyOutput( +-++ # sequences=input_ids, +-++ # scores=scores, +-++ # logits=raw_logits, +-++ # attentions=decoder_attentions, +-++ # hidden_states=decoder_hidden_states, +-++ # past_key_values=model_kwargs.get("past_key_values"), +-++ # average_infer_time=average_infer_time +-++ # ) +-++ # else: +-++ # return input_ids +-++ +-++ # def _prepare_cache_for_generation( +-++ # self, +-++ # generation_config, +-++ # model_kwargs, +-++ # assistant_model, +-++ # batch_size, +-++ # max_cache_length, +-++ # ): +-++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++ # generation_config.cache_implementation = "static" +-++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++ +-++ # if generation_config.cache_implementation == "static": +-++ # base_required_from_max_length = generation_config.max_length + 1 +-++ # base_required = max(max_cache_length, base_required_from_max_length) +-++ # min_cache_size = 50 +-++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++ # else: +-++ # max_cache_length = max(base_required, min_cache_size) +-++ +-++ # original_max_cache_length = max_cache_length +-++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++ # print(f" - final max_cache_length: {max_cache_length}") +-++ +-++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++ # if max_cache_length > self.config.max_position_embeddings: +-++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++ +-++ # result = super()._prepare_cache_for_generation( +-++ # generation_config=generation_config, +-++ # model_kwargs=model_kwargs, +-++ # assistant_model=assistant_model, +-++ # batch_size=batch_size, +-++ # max_cache_length=max_cache_length, +-++ # ) +-++ +-++ # if generation_config.cache_implementation == "static": +-++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++ # created_cache = model_kwargs.get(cache_name) +-++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++ # if created_cache.max_cache_len < generation_config.max_length: +-++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++ +-++ # return result +-++ +-++ +-++ +-+ +-+ +-+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +deleted file mode 100644 +index 179a9bb5..00000000 +--- a/patches/0003-20261106secondcommit.patch ++++ /dev/null +@@ -1,2769 +0,0 @@ +-From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Thu, 6 Nov 2025 14:54:37 +0800 +-Subject: [PATCH 3/8] 20261106secondcommit +- +---- +- .../models/deepseek/modeling_deepseek.py | 217 ++- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +- patches/0001-20251104commit.patch | 1272 ----------------- +- 3 files changed, 528 insertions(+), 2032 deletions(-) +- delete mode 100644 patches/0001-20251104commit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index 73773c22..2f9192bf 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +- +- _CONFIG_FOR_DOC = "DeepseekConfig" +- +-+_attn_mask_cache = {} +-+ +-+def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-+ q_len = batch_and_seq[1] +-+ kv_len = batch_and_seq[1] + past_key_values_length +-+ key = (batch_and_seq[0], q_len, kv_len) +-+ +-+ if key in _attn_mask_cache: +-+ return _attn_mask_cache[key] +-+ +-+ mask = _prepare_4d_causal_attention_mask( +-+ attention_mask, +-+ batch_and_seq, +-+ inputs_embeds, +-+ past_key_values_length, +-+ ) +-+ _attn_mask_cache[key] = mask +-+ return mask +- +- def _get_unpad_data(attention_mask): +- seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +- return final_output +- +- +-- @no_grad() +-- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-- expert_cache = ops.zeros_like(x) +-- idxs = flat_expert_indices.argsort() +-- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-- token_idxs = idxs // self.num_experts_per_tok +-- +-- for i, end_idx in enumerate(tokens_per_expert): +-- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-- if start_idx == end_idx: +-- continue +-- expert = self.experts[i] +-- exp_token_idx = token_idxs[start_idx:end_idx] +-- expert_tokens = x[exp_token_idx] +-- expert_out = expert(expert_tokens) +-- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-- +-- return expert_cache +-- +- # @no_grad() +-- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-- # # expert_cache = torch.zeros_like(x) +-- # # idxs = flat_expert_indices.argsort() +-- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-- # # token_idxs = idxs // self.num_experts_per_tok +-- # # for i, end_idx in enumerate(tokens_per_expert): +-- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-- # # if start_idx == end_idx: +-- # # continue +-- # # expert = self.experts[i] +-- # # exp_token_idx = token_idxs[start_idx:end_idx] +-- # # expert_tokens = x[exp_token_idx] +-- # # expert_out = expert(expert_tokens) +-- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-- # # return expert_cache +-+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- # expert_cache = ops.zeros_like(x) +- # idxs = flat_expert_indices.argsort() +- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +- +- # return expert_cache +-- # @no_grad() +-- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-- # expert_cache = ops.zeros_like(x) +-+ +-+ @no_grad() +-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ """ +-+ 优化版 MoE prefill: +-+ - 批量张量化处理同一个 expert 的所有 token +-+ - 跳过无 token 的专家 +-+ - 保持结果完全一致 +-+ """ +-+ # 初始化输出缓存 +-+ expert_cache = ops.zeros_like(x) +- +-- # # 排序保证顺序一致 +-- # idxs = flat_expert_indices.argsort() +-- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-- # token_idxs = idxs // self.num_experts_per_tok +-+ # 排序(确保 scatter_add 位置对应原逻辑) +-+ idxs = flat_expert_indices.argsort() +-+ sorted_expert_indices = flat_expert_indices[idxs] +-+ sorted_token_indices = idxs // self.num_experts_per_tok +- +-- # # 找出有 token 的专家 +-- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+ # 每个 expert 的 token 数 +-+ tokens_per_expert = sorted_expert_indices.bincount() +- +-- # for i in active_experts.tolist(): +-- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-- # end_idx = tokens_per_expert[i] +-- # if start_idx == end_idx: # 没有 token +-- # continue +-+ # 找出有 token 的专家 +-+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +- +-- # exp_token_idx = token_idxs[start_idx:end_idx] +-- # expert_tokens = x[exp_token_idx] +-- # expert_out = self.experts[i](expert_tokens) +-- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+ for expert_id in active_experts.tolist(): +-+ # 取该 expert 对应的排序后 token 区间 +-+ start = (tokens_per_expert[:expert_id]).sum().item() +-+ end = start + tokens_per_expert[expert_id].item() +- +-- # expert_cache = mindspore.mint.scatter_add( +-- # expert_cache, +-- # 0, +-- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-- # expert_out +-- # ) +-+ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-+ expert_tokens = x[token_idx] # 取输入向量 +- +-- # return expert_cache +-+ # 执行专家 MLP +-+ expert_out = self.experts[expert_id](expert_tokens) +-+ +-+ # 按权重缩放 +-+ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-+ +-+ # 回写到缓存(等价 scatter_add) +-+ expert_cache = mindspore.mint.scatter_add( +-+ expert_cache, +-+ 0, +-+ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+ scaled_out +-+ ) +-+ +-+ return expert_cache +-+ +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # # expert_cache = torch.zeros_like(x) +-+ # # idxs = flat_expert_indices.argsort() +-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+ # # token_idxs = idxs // self.num_experts_per_tok +-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+ # # if start_idx == end_idx: +-+ # # continue +-+ # # expert = self.experts[i] +-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # # expert_tokens = x[exp_token_idx] +-+ # # expert_out = expert(expert_tokens) +-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+ # # return expert_cache +-+ # expert_cache = ops.zeros_like(x) +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # for i, end_idx in enumerate(tokens_per_expert): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # if start_idx == end_idx: +-+ # continue +-+ # expert = self.experts[i] +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = expert(expert_tokens) +-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +-+ # return expert_cache +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+ # expert_cache = ops.zeros_like(x) +-+ +-+ # # 排序保证顺序一致 +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ # token_idxs = idxs // self.num_experts_per_tok +-+ +-+ # # 找出有 token 的专家 +-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+ +-+ # for i in active_experts.tolist(): +-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ # end_idx = tokens_per_expert[i] +-+ # if start_idx == end_idx: # 没有 token +-+ # continue +-+ +-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+ # expert_tokens = x[exp_token_idx] +-+ # expert_out = self.experts[i](expert_tokens) +-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+ +-+ # expert_cache = mindspore.mint.scatter_add( +-+ # expert_cache, +-+ # 0, +-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+ # expert_out +-+ # ) +-+ +-+ # return expert_cache +- +- +- +-@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +- +- return attn_output, attn_weights, past_key_value +- +-- +- # class DeepseekFlashAttention(nn.Module): +- # """ +- # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +- +- return attn_output, attn_weights, past_key_value +- +-+ +- Deepseek_ATTENTION_CLASSES = { +- "eager": DeepseekAttention, +- "flash-attention": DeepseekFlashAttention, +-@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +- ) +- else: +- # 4d mask is passed through the layers +-- attention_mask = _prepare_4d_causal_attention_mask( +-+ # attention_mask = _prepare_4d_causal_attention_mask( +-+ # attention_mask, +-+ # (batch_size, seq_length), +-+ # inputs_embeds, +-+ # past_key_values_length, +-+ # ) +-+ #@dwj +-+ attention_mask = get_cached_causal_mask( +- attention_mask, +- (batch_size, seq_length), +- inputs_embeds, +-@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- # Initialize weights and apply final processing +- self.post_init() +- self.warm_up = False +-+ #@dwj +-+ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-+ self.num_layers, +-+ self.num_attention_heads, +-+ self.head_dim, +-+ batch_size=1, +-+ max_length=self.max_length, +-+ dtype=mindspore.float16 +-+ ) +-+ +-+ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-+ key_cache = [] +-+ value_cache = [] +-+ for _ in range(num_layers): +-+ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+ key_cache.append(k) +-+ value_cache.append(v) +-+ return key_cache, value_cache +-+ +- +- def warmup_moe_model_deep(self): +- print("[Warmup] DeepSeek-MoE 模型预热开始...") +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index bced285c..ebd7782e 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +- _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +- _CONFIG_FOR_DOC = "Qwen2MoeConfig" +- +--Long_Prompt = False +--PROMPT_LENGTH_THRESHOLD = 128 +-+Long_Prompt = 1 +-+LONG_PROMPT_LENGTH_THRESHOLD = 128 +-+SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-+ +-+_causal_mask_cache = {} +-+ +-+def get_cached_causal_mask_with_cache_position( +-+ attention_mask: mindspore.Tensor, +-+ sequence_length: int, +-+ target_length: int, +-+ dtype: mindspore.dtype, +-+ min_dtype: float, +-+ cache_position: mindspore.Tensor, +-+ batch_size: int, +-+): +-+ """ +-+ 带缓存的 causal mask 构造函数 +-+ """ +-+ # q_len 是当前 query 长度 +-+ q_len = sequence_length +-+ # kv_len 是 target_length +-+ kv_len = target_length +-+ +-+ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-+ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-+ +-+ if key in _causal_mask_cache: +-+ return _causal_mask_cache[key] +-+ +-+ # 调用原来的 mask 构造逻辑 +-+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+ attention_mask, +-+ sequence_length=sequence_length, +-+ target_length=target_length, +-+ dtype=dtype, +-+ min_dtype=min_dtype, +-+ cache_position=cache_position, +-+ batch_size=batch_size, +-+ ) +-+ # 缓存结果 +-+ _causal_mask_cache[key] = causal_mask +-+ return causal_mask +- +- # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +- def _prepare_4d_causal_attention_mask_with_cache_position( +-@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +- +- +- # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-+# class Qwen2MoeAttention(nn.Module): +-+# """ +-+# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+# and "Generating Long Sequences with Sparse Transformers". +-+# """ +-+ +-+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+# super().__init__() +-+# self.config = config +-+# self.layer_idx = layer_idx +-+# if layer_idx is None: +-+# logger.warning_once( +-+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+# "when creating this class." +-+# ) +-+ +-+# self.hidden_size = config.hidden_size +-+# self.num_heads = config.num_attention_heads +-+# self.head_dim = self.hidden_size // self.num_heads +-+# self.num_key_value_heads = config.num_key_value_heads +-+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+# self.max_position_embeddings = config.max_position_embeddings +-+# self.rope_theta = config.rope_theta +-+# self.is_causal = True +-+# self.attention_dropout = config.attention_dropout +-+ +-+# if (self.head_dim * self.num_heads) != self.hidden_size: +-+# raise ValueError( +-+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+# f" and `num_heads`: {self.num_heads})." +-+# ) +-+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+ +-+# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+# self.head_dim, +-+# max_position_embeddings=self.max_position_embeddings, +-+# base=self.rope_theta, +-+# ) +-+ +-+# def forward( +-+# self, +-+# hidden_states: mindspore.Tensor, +-+# attention_mask: Optional[mindspore.Tensor] = None, +-+# position_ids: Optional[mindspore.Tensor] = None, +-+# past_key_value: Optional[Cache] = None, +-+# output_attentions: bool = False, +-+# use_cache: bool = False, +-+# cache_position: Optional[mindspore.Tensor] = None, +-+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+ +-+ +-+ +-+# bsz, q_len, _ = hidden_states.shape +-+ +-+# query_states = self.q_proj(hidden_states) +-+# key_states = self.k_proj(hidden_states) +-+# value_states = self.v_proj(hidden_states) +-+ +-+# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+ +-+# kv_seq_len = key_states.shape[-2] +-+# if past_key_value is not None: +-+# if self.layer_idx is None: +-+# raise ValueError( +-+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+# "with a layer index." +-+# ) +-+# if isinstance(past_key_value, StaticCache): +-+# kv_seq_len = key_states.shape[-2] +-+# else: +-+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+# if past_key_value is not None: +-+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+# if isinstance(past_key_value, StaticCache): +-+# kv_seq_len = key_states.shape[-2] +-+ +-+# # repeat k/v heads if n_kv_heads < n_heads +-+# key_states = repeat_kv(key_states, self.num_key_value_groups) +-+# value_states = repeat_kv(value_states, self.num_key_value_groups) +-+ +-+# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+ +-+# if attention_mask is not None: +-+# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+# attn_weights = attn_weights + causal_mask +-+ +-+# # upcast attention to fp32 +-+# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+# attn_output = ops.matmul(attn_weights, value_states) +-+ +-+# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+# raise ValueError( +-+# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+# f" {attn_output.shape}" +-+# ) +-+ +-+# attn_output = ops.transpose(attn_output, 1, 2) +-+# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+ +-+# attn_output = self.o_proj(attn_output) +-+# # @lwx +-+ +-+# # max_seq_len = self.max_position_embeddings # 2048 +-+ +-+# # if attention_mask is not None: +-+# # # attention_mask: [B, 1, Sq, Sk] +-+# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+ +-+# # # pad 到 [max_seq_len, max_seq_len] +-+# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+# # global_attention_mask = padded_mask +-+# # else: +-+# # global_attention_mask = None +-+ +-+ +-+# # sparse_mode=3 +-+# # attn_output = mindspore.ops.flash_attention_score( +-+# # query=query_states, +-+# # key=key_states, +-+# # value=value_states, +-+# # real_shift=None, +-+# # padding_mask=None, +-+ +-+# # head_num=self.num_heads, +-+# # attn_mask=global_attention_mask, +-+# # keep_prob=1.0 - self.attention_dropout, +-+# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+# # input_layout="BNSD", +-+# # pre_tokens=2147483647, +-+# # next_tokens=2147483647, +-+# # inner_precise=0, +-+# # drop_mask=None, +-+# # prefix=None, +-+# # actual_seq_qlen=None, +-+# # actual_seq_kvlen=None, +-+# # sparse_mode=sparse_mode, +-+# # ) +-+# if not output_attentions: +-+# attn_weights = None +-+ +-+# return attn_output, attn_weights, past_key_value +-+ +- class Qwen2MoeAttention(nn.Module): +- """ +-- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-- and "Generating Long Sequences with Sparse Transformers". +-- """ +-+ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +- +-+ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-+ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-+ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-+ +-+ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-+ """ +- def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +- super().__init__() +- self.config = config +-@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +- if layer_idx is None: +- logger.warning_once( +- f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +- "when creating this class." +- ) +- +-@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +- use_cache: bool = False, +- cache_position: Optional[mindspore.Tensor] = None, +- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-- +- +-- +-+ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +- bsz, q_len, _ = hidden_states.shape +- +- query_states = self.q_proj(hidden_states) +- key_states = self.k_proj(hidden_states) +- value_states = self.v_proj(hidden_states) +- +-- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-- +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ +- kv_seq_len = key_states.shape[-2] +- if past_key_value is not None: +-- if self.layer_idx is None: +-- raise ValueError( +-- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-- "with a layer index." +-- ) +-- if isinstance(past_key_value, StaticCache): +-- kv_seq_len = key_states.shape[-2] +-- else: +-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +- +- if past_key_value is not None: +-- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+ +-+ # --- 2. 动态调度核心注意力计算 --- +-+ global Long_Prompt +-+ if Long_Prompt >= 1: +-+ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-+ fa_attention_mask = None +-+ if attention_mask is not None: +-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+ fa_attention_mask = (mask_slice != 0) +-+ +-+ attn_output = mindspore.ops.flash_attention_score( +-+ query=query_states, +-+ key=key_states, +-+ value=value_states, +-+ head_num=self.num_heads, +-+ attn_mask=fa_attention_mask, +-+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+ input_layout="BNSD", +-+ sparse_mode=0, +-+ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-+ ) +- +-- if isinstance(past_key_value, StaticCache): +-- kv_seq_len = key_states.shape[-2] +-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ attn_output = self.o_proj(attn_output) +-+ attn_weights = None +-+ if output_attentions: +-+ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +- +-- # repeat k/v heads if n_kv_heads < n_heads +-- key_states = repeat_kv(key_states, self.num_key_value_groups) +-- value_states = repeat_kv(value_states, self.num_key_value_groups) +-- +-- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+ else: +-+ # --- Eager Attention 路径 (用于短序列和解码) --- +-+ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+ +-+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +- +-- if attention_mask is not None: +-- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-- attn_weights = attn_weights + causal_mask +-+ if attention_mask is not None: +-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+ attn_weights = attn_weights + causal_mask +- +-- # upcast attention to fp32 +-- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-- attn_output = ops.matmul(attn_weights, value_states) +-+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+ attn_output = ops.matmul(attn_weights, value_states) +- +-- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-- raise ValueError( +-- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-- f" {attn_output.shape}" +-- ) +-+ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+ raise ValueError( +-+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-+ ) +- +-- attn_output = ops.transpose(attn_output, 1, 2) +-- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+ attn_output = ops.transpose(attn_output, 1, 2) +-+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+ attn_output = self.o_proj(attn_output) +- +-- attn_output = self.o_proj(attn_output) +-- # @lwx +-+ if not output_attentions: +-+ attn_weights = None +- +-- # max_seq_len = self.max_position_embeddings # 2048 +-- +-- # if attention_mask is not None: +-- # # attention_mask: [B, 1, Sq, Sk] +-- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-- +-- # # pad 到 [max_seq_len, max_seq_len] +-- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-- # global_attention_mask = padded_mask +-- # else: +-- # global_attention_mask = None +-- +-- +-- # sparse_mode=3 +-- # attn_output = mindspore.ops.flash_attention_score( +-- # query=query_states, +-- # key=key_states, +-- # value=value_states, +-- # real_shift=None, +-- # padding_mask=None, +-- +-- # head_num=self.num_heads, +-- # attn_mask=global_attention_mask, +-- # keep_prob=1.0 - self.attention_dropout, +-- # scalar_value=1.0 / math.sqrt(self.head_dim), +-- # input_layout="BNSD", +-- # pre_tokens=2147483647, +-- # next_tokens=2147483647, +-- # inner_precise=0, +-- # drop_mask=None, +-- # prefix=None, +-- # actual_seq_qlen=None, +-- # actual_seq_kvlen=None, +-- # sparse_mode=sparse_mode, +-- # ) +-- if not output_attentions: +-- attn_weights = None +-- +- return attn_output, attn_weights, past_key_value +- +-- +- # class Qwen2MoeFlashAttention(nn.Module): +- # """ +- # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +- # return final_hidden_states, router_logits +- +- +--# class Qwen2MoeSparseMoeBlock(nn.Module): +--# """ +--# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +--# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +--# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +--# `_moe_infer_prefill` (用于长序列处理) 方法。 +--# """ +--# def __init__(self, config: Qwen2MoeConfig): +--# super().__init__() +--# self.num_experts = config.num_experts +--# self.top_k = config.num_experts_per_tok +--# self.norm_topk_prob = config.norm_topk_prob +-- +--# # 门控网络 +--# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +--# # 专家列表 +--# self.experts = nn.ModuleList( +--# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +--# ) +--# # 共享专家 +--# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +--# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--# @no_grad() +--# def _moe_infer_decode( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# """ +--# 【解码路径】针对 sequence_length=1 的极致优化。 +--# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +--# """ +--# batch_size, hidden_dim = hidden_states.shape +-- +--# expert_outputs_list = [ +--# ops.cat([ +--# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +--# ], dim=0) +--# for i in range(batch_size) +--# ] +-- +--# # --- 错误修复:将 axis=0 修改为 dim=0 --- +--# # shape: (batch_size, top_k, hidden_dim) +--# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-- +--# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +--# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-- +--# return moe_output.squeeze(1) +-- +--# @no_grad() +--# def _moe_infer_prefill( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# """ +--# 【预填充路径】针对 sequence_length > 1 的优化。 +--# 按专家对 Token 进行分组,并进行批处理。 +--# """ +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens = hidden_states.shape[0] +--# flat_selected_experts = selected_experts.flatten() +-- +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-- +--# active_experts = ops.unique(flat_selected_experts) +-- +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +-- +--# mask = (flat_selected_experts == expert_idx_tensor) +--# selected_token_indices = token_indices[mask] +--# selected_routing_weights = routing_weights.flatten()[mask] +-- +--# current_states = hidden_states[selected_token_indices] +-- +--# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-- +--# moe_output = moe_output.index_add( +--# dim=0, +--# index=selected_token_indices, +--# source=expert_output.to(hidden_states.dtype) +--# ) +--# return moe_output +-- +--# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--# """ +--# 顶层 forward 方法,作为智能分发器。 +--# """ +--# batch_size, sequence_length, hidden_dim = hidden_states.shape +-- +--# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--# router_logits = self.gate(hidden_states_reshaped) +--# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- +--# if self.norm_topk_prob: +--# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- +--# routing_weights = routing_weights.to(hidden_states.dtype) +-- +--# moe_output = None +--# # 在推理时,根据序列长度选择最优路径 +--# if not self.training: +--# if sequence_length == 1: +--# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +--# else: +--# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +--# else: +--# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +--# raise NotImplementedError("Training path is not implemented.") +-- +--# shared_expert_output = self.shared_expert(hidden_states_reshaped) +--# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +--# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-- +--# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-- +--# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-- +--# return final_hidden_states, router_logits +-- +-- +--# class Qwen2MoeSparseMoeBlock(nn.Module): +--# """ +--# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +--# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +--# """ +--# def __init__(self, config: Qwen2MoeConfig): +--# super().__init__() +--# self.num_experts = config.num_experts +--# self.top_k = config.num_experts_per_tok +--# self.norm_topk_prob = config.norm_topk_prob +-- +--# # 门控网络 +--# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +--# # 专家列表 +--# self.experts = nn.ModuleList( +--# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +--# ) +--# # 共享专家 +--# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +--# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--# @no_grad() +--# def _moe_infer_decode( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# batch_size, _ = hidden_states.shape +--# expert_outputs_list = [ +--# ops.cat([ +--# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +--# ], dim=0) +--# for i in range(batch_size) +--# ] +--# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +--# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +--# return moe_output.squeeze(1) +-- +--# @no_grad() +--# def _moe_infer_prefill( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens = hidden_states.shape[0] +--# flat_selected_experts = selected_experts.flatten() +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +--# active_experts = ops.unique(flat_selected_experts) +-- +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +--# mask = (flat_selected_experts == expert_idx_tensor) +--# selected_token_indices = token_indices[mask] +--# selected_routing_weights = routing_weights.flatten()[mask] +--# current_states = hidden_states[selected_token_indices] +--# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +--# moe_output = moe_output.index_add( +--# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +--# ) +--# return moe_output +-- +--# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--# """ +--# 顶层 forward 方法,作为智能分发器。 +--# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +--# """ +--# batch_size, sequence_length, hidden_dim = hidden_states.shape +-- +--# # 1. 门控计算 (通用逻辑) +--# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--# router_logits = self.gate(hidden_states_reshaped) +--# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- +--# if self.norm_topk_prob: +--# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- +--# routing_weights = routing_weights.to(hidden_states.dtype) +-- +--# # 2. 智能分发到最优 MoE 路径 +--# moe_output = None +--# if not self.training: +--# if sequence_length == 1: +--# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +--# else: +--# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +--# else: +--# raise NotImplementedError("Training path is not implemented.") +-- +--# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +--# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +--# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +--# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-- +--# # 4. 合并 MoE 输出和共享专家输出 +--# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +--# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-- +--# # 5. 恢复原始形状并返回 +--# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-- +--# return final_hidden_states, router_logits +-- +--# prefill fastest +--# class Qwen2MoeSparseMoeBlock(nn.Module): +--# """ +--# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +--# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +--# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +--# """ +--# def __init__(self, config: Qwen2MoeConfig): +--# super().__init__() +--# self.num_experts = config.num_experts +--# self.top_k = config.num_experts_per_tok +--# self.norm_topk_prob = config.norm_topk_prob +-- +--# # 门控网络 +--# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +--# # 专家列表 +--# self.experts = nn.ModuleList( +--# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +--# ) +--# # 共享专家 +--# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +--# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--# @no_grad() +--# def _moe_infer_dispatch( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# """ +--# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +--# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +--# """ +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens, _ = hidden_states.shape +-- +--# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +--# flat_selected_experts = selected_experts.flatten() +--# flat_routing_weights = routing_weights.flatten() +-- +--# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-- +--# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +--# active_experts = ops.unique(flat_selected_experts) +-- +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +-- +--# # 找到所有分配给该专家的 token +--# mask = (flat_selected_experts == expert_idx_tensor) +-- +--# # 使用 mask 选取对应的 token 和权重 +--# current_token_indices = token_indices[mask] +--# current_routing_weights = flat_routing_weights[mask] +--# current_hidden_states = hidden_states[current_token_indices] +-- +--# # 对这些 token 进行批处理 +--# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-- +--# # 使用 index_add 将结果精确地加回到对应位置 +--# moe_output = moe_output.index_add( +--# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +--# ) +--# return moe_output +-- +--# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--# """ +--# 顶层 forward 方法,作为智能分发器。 +--# """ +--# batch_size, sequence_length, hidden_dim = hidden_states.shape +-- +--# # 1. 门控计算 +--# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--# router_logits = self.gate(hidden_states_reshaped) +--# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- +--# if self.norm_topk_prob: +--# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- +--# routing_weights = routing_weights.to(hidden_states.dtype) +-- +--# # 2. 调用统一的 MoE 计算内核 +--# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +--# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-- +--# # 3. 统一处理共享专家 +--# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +--# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-- +--# # 4. 合并输出 +--# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-- +--# # 5. 恢复原始形状并返回 +--# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-- +--# return final_hidden_states, router_logits +-- +-- +--# class Qwen2MoeSparseMoeBlock(nn.Module): +--# """ +--# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +--# 【最终高性能与高精度版】: +--# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +--# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +--# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +--# 3. 这样实现了速度和准确性的两全其美。 +--# """ +--# def __init__(self, config: Qwen2MoeConfig): +--# super().__init__() +--# self.num_experts = config.num_experts +--# self.top_k = config.num_experts_per_tok +--# self.norm_topk_prob = config.norm_topk_prob +-- +--# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +--# self.experts = nn.ModuleList( +--# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +--# ) +--# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +--# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--# @no_grad() +--# def _moe_infer_decode( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# """ +--# 【解码路径】极致优化版:bmm + 高精度累加。 +--# """ +--# original_dtype = hidden_states.dtype +--# batch_size, _ = hidden_states.shape +-- +--# expert_outputs_list = [ +--# ops.cat([ +--# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +--# ], dim=0) +--# for i in range(batch_size) +--# ] +--# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-- +--# # 在 float32 下执行 bmm,得到高精度结果 +--# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-- +--# # 将高精度结果转换回原始数据类型 +--# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-- +--# return moe_output +-- +--# @no_grad() +--# def _moe_infer_prefill( +--# self, +--# hidden_states: mindspore.Tensor, +--# selected_experts: mindspore.Tensor, +--# routing_weights: mindspore.Tensor +--# ) -> mindspore.Tensor: +--# """ +--# 【预填充路径】与原始实现一致,结果精确。 +--# """ +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens, _ = hidden_states.shape +--# flat_selected_experts = selected_experts.flatten() +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +--# active_experts = ops.unique(flat_selected_experts) +-- +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +--# mask = (flat_selected_experts == expert_idx_tensor) +--# selected_token_indices = token_indices[mask] +--# selected_routing_weights = routing_weights.flatten()[mask] +--# current_states = hidden_states[selected_token_indices] +--# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +--# moe_output = moe_output.index_add( +--# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +--# ) +--# return moe_output +-- +--# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--# batch_size, sequence_length, hidden_dim = hidden_states.shape +-- +--# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--# router_logits = self.gate(hidden_states_reshaped) +--# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-- +--# if self.norm_topk_prob: +--# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- +--# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +--# # 如果模型主体是 float16,后续再转换 +-- +--# moe_output = None +--# if not self.training: +--# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +--# # _moe_infer_decode 内部会处理好类型转换 +--# temp_routing_weights = routing_weights.to(hidden_states.dtype) +--# if sequence_length == 1: +--# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +--# else: +--# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +--# else: +--# raise NotImplementedError("Training path is not implemented.") +-- +--# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +--# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-- +--# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +--# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-- +--# return final_hidden_states, router_logits +-- +-- +--# class Qwen2MoeSparseMoeBlock(nn.Module): +--# """ +--# 【融合版】一个混合专家模块,内置两种推理策略, +--# 由外部全局变量 `Long_Prompt` 控制: +-- +--# - if Long_Prompt is True: 【精度优先模式】 +--# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +--# 适用于处理长序列,避免误差累积。 +-- +--# - if Long_Prompt is False: 【速度优先模式】 +--# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +--# 在解码阶段获得极致速度,同时保证结果高度准确。 +--# """ +--# def __init__(self, config: Qwen2MoeConfig): +--# super().__init__() +--# self.num_experts = config.num_experts +--# self.top_k = config.num_experts_per_tok +--# self.norm_topk_prob = config.norm_topk_prob +-- +--# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +--# self.experts = nn.ModuleList( +--# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +--# ) +--# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +--# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--# # --- 速度优先模式的辅助函数 --- +--# @no_grad() +--# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +--# original_dtype = hidden_states.dtype +--# batch_size, _ = hidden_states.shape +--# expert_outputs_list = [ +--# ops.cat([ +--# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +--# ], dim=0) +--# for i in range(batch_size) +--# ] +--# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +--# weights_fp32 = routing_weights.to(mindspore.float32) +--# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +--# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +--# return moe_output_fp32.squeeze(1).to(original_dtype) +-- +--# @no_grad() +--# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens, _ = hidden_states.shape +--# flat_selected_experts = selected_experts.flatten() +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +--# active_experts = ops.unique(flat_selected_experts) +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +--# mask = (flat_selected_experts == expert_idx_tensor) +--# selected_token_indices = token_indices[mask] +--# selected_routing_weights = routing_weights.flatten()[mask] +--# current_states = hidden_states[selected_token_indices] +--# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +--# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +--# return moe_output +-- +--# # --- 精度优先模式的辅助函数 --- +--# @no_grad() +--# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +--# moe_output = ops.zeros_like(hidden_states) +--# num_tokens, _ = hidden_states.shape +--# flat_selected_experts = selected_experts.flatten() +--# flat_routing_weights = routing_weights.flatten() +--# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +--# active_experts = ops.unique(flat_selected_experts) +--# for expert_idx_tensor in active_experts: +--# expert_idx = expert_idx_tensor.item() +--# expert_layer = self.experts[expert_idx] +--# mask = (flat_selected_experts == expert_idx_tensor) +--# current_token_indices = token_indices[mask] +--# current_routing_weights = flat_routing_weights[mask] +--# current_hidden_states = hidden_states[current_token_indices] +--# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +--# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +--# return moe_output +-- +--# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--# # 声明我们将要使用一个在模块外部定义的全局变量 +--# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +--# global Long_Prompt +-- +--# # 1. 门控计算 (所有模式通用) +--# batch_size, sequence_length, hidden_dim = hidden_states.shape +--# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--# router_logits = self.gate(hidden_states_reshaped) +--# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +--# if self.norm_topk_prob: +--# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-- +--# moe_output = None +--# if not self.training: +--# # 根据 Long_Prompt 标志选择模式 +--# if Long_Prompt: +--# # --- 精度优先模式 --- +--# routing_weights_casted = routing_weights.to(hidden_states.dtype) +--# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +--# else: +--# # --- 速度优先模式 --- +--# routing_weights_casted = routing_weights.to(hidden_states.dtype) +--# if sequence_length == 1: +--# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +--# else: +--# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +--# else: +--# raise NotImplementedError("Training path is not implemented.") +-- +--# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +--# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-- +--# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +--# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-- +--# return final_hidden_states, router_logits +-- +- class Qwen2MoeSparseMoeBlock(nn.Module): +- """ +- 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +- return moe_output_fp32.squeeze(1).to(original_dtype) +- +-+ # @no_grad() +-+ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ # num_tokens, _ = hidden_states.shape +-+ # flat_selected_experts = selected_experts.flatten() +-+ # sorted_expert_indices = flat_selected_experts.argsort() +-+ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+ # original_token_indices = sorted_expert_indices // self.top_k +-+ # moe_output = ops.zeros_like(hidden_states) +-+ # current_token_offset = 0 +-+ # for i in range(self.num_experts): +-+ # expert_token_count = tokens_per_expert[i] - current_token_offset +-+ # if expert_token_count == 0: +-+ # continue +-+ # end_offset = current_token_offset + expert_token_count +-+ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+ # expert_hidden_states = hidden_states[expert_original_token_indices] +-+ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+ # current_token_offset += expert_token_count +-+ # return moe_output +-+ +- @no_grad() +- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-- num_tokens, _ = hidden_states.shape +-- flat_selected_experts = selected_experts.flatten() +-- sorted_expert_indices = flat_selected_experts.argsort() +-- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-- original_token_indices = sorted_expert_indices // self.top_k +-+ """ +-+ 优化版 MoE prefill (速度优先模式): +-+ - 批量张量化处理同一个 expert 的所有 token +-+ - 跳过无 token 的专家 +-+ - 保持结果完全一致 +-+ """ +- moe_output = ops.zeros_like(hidden_states) +-- current_token_offset = 0 +-- for i in range(self.num_experts): +-- expert_token_count = tokens_per_expert[i] - current_token_offset +-- if expert_token_count == 0: +-- continue +-- end_offset = current_token_offset + expert_token_count +-- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-- expert_hidden_states = hidden_states[expert_original_token_indices] +-- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-- current_token_offset += expert_token_count +-+ +-+ flat_selected_experts = selected_experts.flatten() +-+ flat_routing_weights = routing_weights.flatten() +-+ +-+ idxs = flat_selected_experts.argsort() +-+ sorted_expert_indices = flat_selected_experts[idxs] +-+ sorted_token_indices = idxs // self.top_k +-+ +-+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-+ +-+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+ +-+ for expert_id in active_experts.tolist(): +-+ start = int(tokens_per_expert[:expert_id].sum().item()) +-+ end = start + int(tokens_per_expert[expert_id].item()) +-+ +-+ token_idx = sorted_token_indices[start:end] +-+ expert_tokens = hidden_states[token_idx] +-+ +-+ expert_out = self.experts[expert_id](expert_tokens) +-+ +-+ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-+ +-+ moe_output = mindspore.mint.scatter_add( +-+ moe_output, +-+ 0, +-+ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-+ scaled_out.to(hidden_states.dtype) +-+ ) +-+ +- return moe_output +- +-+ +- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +- @no_grad() +- def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +- +- moe_output = None +-- if Long_Prompt: +-- # --- 精度优先模式 (ACCURACY MODE) --- +-- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # if Long_Prompt==0: +-+ # # --- 精度优先模式 (ACCURACY MODE) --- +-+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # else: +-+ # # --- 速度优先模式 (SPEED MODE) --- +-+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+ # if sequence_length == 1: +-+ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # else: +-+ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ +-+ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+ if sequence_length == 1: +-+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +- else: +-- # --- 速度优先模式 (SPEED MODE) --- +-- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-- if sequence_length == 1: +-- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-- else: +-- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-- +-+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ +- +- # 3. 共享专家计算与合并 (所有模式通用) +- gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- +- return final_hidden_states, router_logits +- +-+ +- class Qwen2MoeDecoderLayer(nn.Module): +- def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +- super().__init__() +- self.hidden_size = config.hidden_size +- +-- # if Long_Prompt: +-- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-- # else: +-+ # if Long_Prompt == 2: +- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+ # else: +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- +- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +- +-@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +- ) +- +- # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+ # attention_mask, +-+ # sequence_length=sequence_length, +-+ # target_length=target_length, +-+ # dtype=dtype, +-+ # min_dtype=min_dtype, +-+ # cache_position=cache_position, +-+ # batch_size=input_tensor.shape[0], +-+ # ) +-+ #@dwj +-+ causal_mask = get_cached_causal_mask_with_cache_position( +- attention_mask, +- sequence_length=sequence_length, +- target_length=target_length, +-@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +- 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +- """ +-- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-+ _causal_mask_cache.clear() +- +- input_ids = kwargs.get("input_ids") +- if input_ids is None and args: +-@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- +- if input_ids is not None: +- prompt_length = input_ids.shape[1] +-- +-- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-- Long_Prompt = True +-+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-+ Long_Prompt = 2 +-+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-+ Long_Prompt = 0 +- else: +-- Long_Prompt = False +-+ Long_Prompt = 1 +-+ +- +- return super().generate(*args, **kwargs) +- +-@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +- dtype = self.lm_head.weight.dtype +- min_dtype = float(ops.finfo(dtype).min) +- +-- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+ # attention_mask, +-+ # sequence_length=sequence_length, +-+ # target_length=past_key_values.get_max_length(), +-+ # dtype=dtype, +-+ # min_dtype=min_dtype, +-+ # cache_position=cache_position, +-+ # batch_size=batch_size, +-+ # ) +-+ +-+ #@dwj +-+ attention_mask = get_cached_causal_mask_with_cache_position( +- attention_mask, +- sequence_length=sequence_length, +- target_length=past_key_values.get_max_length(), +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-deleted file mode 100644 +-index 6dfb5b93..00000000 +---- a/patches/0001-20251104commit.patch +-+++ /dev/null +-@@ -1,1272 +0,0 @@ +--From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +--From: Pinoeer-kingxi <13022943007@163.com> +--Date: Tue, 4 Nov 2025 09:11:51 +0800 +--Subject: [PATCH] 20251104commit +-- +----- +-- mindnlp/transformers/cache_utils.py | 28 +- +-- .../models/deepseek/modeling_deepseek.py | 149 ++- +-- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-- 3 files changed, 976 insertions(+), 87 deletions(-) +-- +--diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +--index cadd2e04..02f8d4be 100644 +----- a/mindnlp/transformers/cache_utils.py +--+++ b/mindnlp/transformers/cache_utils.py +--@@ -812,14 +812,26 @@ class StaticCache(Cache): +-- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-- # k_out[:, :, cache_position] = key_states +-- # v_out[:, :, cache_position] = value_states +--- if ON_ORANGE_PI: +--- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +--- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +--- else: +--- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +--- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +--- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +--- +--+ # if ON_ORANGE_PI: +--+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +--+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +--+ # else: +--+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +--+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +--+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +--+ # 确保 cache_position 是 1D tensor 并且类型正确 +--+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +--+ if cache_position.ndim > 1: +--+ cache_position = cache_position.flatten() +--+ # 确保类型是 int32 或 int64(MindSpore 要求) +--+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +--+ cache_position = cache_position.int() +--+ +--+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +--+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +--+ k_out[:, :, cache_position] = key_states +--+ v_out[:, :, cache_position] = value_states +--+ +-- return k_out, v_out +-- +-- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +--diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +--index c695b944..d8303e45 100644 +----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +--+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +--@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-- # Copied from transformers.models.llama.modeling_llama.rotate_half +-- def rotate_half(x): +-- """Rotates half the hidden dims of the input.""" +--- x1 = x[..., : x.shape[-1] // 2] +--- x2 = x[..., x.shape[-1] // 2 :] +--+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +--+ # x1 = x[..., : x.shape[-1] // 2] +--+ # x2 = x[..., x.shape[-1] // 2 :] +--+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-- return ops.cat((-x2, x1), dim=-1) +-- +-- +--@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-- if self.training: +-- raise NotImplementedError("Training is not supported yet.") +-- else: +--- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +--- if self.config.n_shared_experts is not None: +--- y = y + self.shared_experts(identity) +--- return y +--+ # @lwx +--+ if orig_shape[1] == 1: +--+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +--+ y=y.view(*orig_shape) +--+ if self.config.n_shared_experts is not None: +--+ y = y + self.shared_experts(identity) +--+ return y +--+ else: +--+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +--+ if self.config.n_shared_experts is not None: +--+ y = y + self.shared_experts(identity) +--+ return y +--+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +--+ # if self.config.n_shared_experts is not None: +--+ # y = y + self.shared_experts(identity) +--+ # return y +--+ +--+ @no_grad() +--+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +--+ +--+ expert_cache = ops.zeros_like(x) +--+ for i in range(self.num_experts_per_tok): +--+ expert_id = flat_expert_indices[i].item() +--+ weight = flat_expert_weights[i].item() +--+ expert = self.experts[expert_id] +--+ expert_out = expert(x) +--+ expert_cache += expert_out * weight +--+ return expert_cache +-- +-- @no_grad() +--- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +--- # expert_cache = torch.zeros_like(x) +--- # idxs = flat_expert_indices.argsort() +--- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +--- # token_idxs = idxs // self.num_experts_per_tok +--- # for i, end_idx in enumerate(tokens_per_expert): +--- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +--- # if start_idx == end_idx: +--- # continue +--- # expert = self.experts[i] +--- # exp_token_idx = token_idxs[start_idx:end_idx] +--- # expert_tokens = x[exp_token_idx] +--- # expert_out = expert(expert_tokens) +--- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +--- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +--- # return expert_cache +--+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-- expert_cache = ops.zeros_like(x) +-- idxs = flat_expert_indices.argsort() +-- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-- token_idxs = idxs // self.num_experts_per_tok +--+ +-- for i, end_idx in enumerate(tokens_per_expert): +-- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-- if start_idx == end_idx: +--@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-- expert_out = expert(expert_tokens) +-- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +--+ +-- return expert_cache +--+ +--+ # @no_grad() +--+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +--+ # # expert_cache = torch.zeros_like(x) +--+ # # idxs = flat_expert_indices.argsort() +--+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +--+ # # token_idxs = idxs // self.num_experts_per_tok +--+ # # for i, end_idx in enumerate(tokens_per_expert): +--+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +--+ # # if start_idx == end_idx: +--+ # # continue +--+ # # expert = self.experts[i] +--+ # # exp_token_idx = token_idxs[start_idx:end_idx] +--+ # # expert_tokens = x[exp_token_idx] +--+ # # expert_out = expert(expert_tokens) +--+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +--+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +--+ # # return expert_cache +--+ # expert_cache = ops.zeros_like(x) +--+ # idxs = flat_expert_indices.argsort() +--+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +--+ # token_idxs = idxs // self.num_experts_per_tok +--+ +--+ # for i, end_idx in enumerate(tokens_per_expert): +--+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +--+ # if start_idx == end_idx: +--+ # continue +--+ # expert = self.experts[i] +--+ # exp_token_idx = token_idxs[start_idx:end_idx] +--+ # expert_tokens = x[exp_token_idx] +--+ # expert_out = expert(expert_tokens) +--+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +--+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +--+ +--+ # return expert_cache +--+ # @no_grad() +--+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +--+ # expert_cache = ops.zeros_like(x) +--+ +--+ # # 排序保证顺序一致 +--+ # idxs = flat_expert_indices.argsort() +--+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +--+ # token_idxs = idxs // self.num_experts_per_tok +--+ +--+ # # 找出有 token 的专家 +--+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +--+ +--+ # for i in active_experts.tolist(): +--+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +--+ # end_idx = tokens_per_expert[i] +--+ # if start_idx == end_idx: # 没有 token +--+ # continue +--+ +--+ # exp_token_idx = token_idxs[start_idx:end_idx] +--+ # expert_tokens = x[exp_token_idx] +--+ # expert_out = self.experts[i](expert_tokens) +--+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +--+ +--+ # expert_cache = mindspore.mint.scatter_add( +--+ # expert_cache, +--+ # 0, +--+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +--+ # expert_out +--+ # ) +--+ +--+ # return expert_cache +--+ +--+ +-- +-- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-- # """ +--@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-- +-- # Initialize weights and apply final processing +-- self.post_init() +--+ self.warm_up = False +--+ +--+ def warmup_moe_model_deep(self): +--+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +--+ test_texts = [ +--+ "warmup short", +--+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +--+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +--+ ] +--+ tokenizer = getattr(self, "_warmup_tokenizer", None) +--+ if tokenizer is None: +--+ from mindnlp.transformers import AutoTokenizer +--+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +--+ self._warmup_tokenizer = tokenizer +--+ +--+ for text in test_texts: +--+ inputs = tokenizer(text, return_tensors="ms") +--+ with mindspore._no_grad(): +--+ _ = self(**inputs, use_cache=False) +--+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-- +-- def get_input_embeddings(self): +-- return self.model.embed_tokens +--@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-- ```""" +--+ if not self.warm_up: +--+ self.warm_up = True +--+ self.warmup_moe_model_deep() +--+ +-- output_attentions = ( +-- output_attentions +-- if output_attentions is not None +--diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +--index 3cbf820e..d4c6b651 100644 +----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +--+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +--@@ -18,7 +18,6 @@ +-- # See the License for the specific language governing permissions and +-- # limitations under the License. +-- """MindSpore Qwen2MoE model.""" +--- +-- import math +-- from typing import List, Optional, Tuple, Union +-- +--@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-- TokenClassifierOutput, +-- ) +-- from ...modeling_utils import PreTrainedModel +--+from ...generation import GenerationMixin +-- from ....utils import logging +-- from .configuration_qwen2_moe import Qwen2MoeConfig +-- +--@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-- self.variance_epsilon = eps +-- +-- def forward(self, hidden_states): +--+ # @dwj +--+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +--+ # @lwx +--+ # if not self.training : +--+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-- input_dtype = hidden_states.dtype +-- hidden_states = hidden_states.to(mindspore.float32) +-- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +--@@ -234,6 +239,8 @@ def rotate_half(x): +-- """Rotates half the hidden dims of the input.""" +-- x1 = x[..., : x.shape[-1] // 2] +-- x2 = x[..., x.shape[-1] // 2 :] +--+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +--+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-- return ops.cat((-x2, x1), dim=-1) +-- +-- +--@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-- self.config = config +-- self.hidden_size = config.hidden_size +-- self.intermediate_size = intermediate_size +--+ +-- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-- self.act_fn = ACT2FN[config.hidden_act] +-- +-- def forward(self, x): +--- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +--- +-- +--+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +--+ # @lwx +--+ # gate_up_output = self.gate_up_proj(x) +--+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +--+ # return self.down_proj(swiglu_output) +--+ +--+ # def forward(self, x): +--+ # gate_proj_out = self.gate_proj(x) +--+ # up_proj_out = self.up_proj(x) +--+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +--+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +--+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +--+ # return self.down_proj(swiglu_out) +--+ +-- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-- """ +--@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-- use_cache: bool = False, +-- cache_position: Optional[mindspore.Tensor] = None, +-- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +--+ +--+ +--+ +-- bsz, q_len, _ = hidden_states.shape +-- +-- query_states = self.q_proj(hidden_states) +--@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-- "with a layer index." +-- ) +--- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +--+ if isinstance(past_key_value, StaticCache): +--+ kv_seq_len = key_states.shape[-2] +--+ else: +--+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-- +-- if past_key_value is not None: +-- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +--+ +--+ if isinstance(past_key_value, StaticCache): +--+ kv_seq_len = key_states.shape[-2] +-- +-- # repeat k/v heads if n_kv_heads < n_heads +-- key_states = repeat_kv(key_states, self.num_key_value_groups) +-- value_states = repeat_kv(value_states, self.num_key_value_groups) +--- +--+ +-- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-- +--- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +--- raise ValueError( +--- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +--- f" {attn_weights.shape}" +--- ) +--- +--- if attention_mask is not None: # no matter the length, we just slice it +--- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +--+ if attention_mask is not None: +--+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-- attn_weights = attn_weights + causal_mask +-- +-- # upcast attention to fp32 +--@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-- +-- attn_output = self.o_proj(attn_output) +--- +--+ # @lwx +--+ +--+ # max_seq_len = self.max_position_embeddings # 2048 +--+ +--+ # if attention_mask is not None: +--+ # # attention_mask: [B, 1, Sq, Sk] +--+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +--+ +--+ # # pad 到 [max_seq_len, max_seq_len] +--+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +--+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +--+ # global_attention_mask = padded_mask +--+ # else: +--+ # global_attention_mask = None +--+ +--+ +--+ # sparse_mode=3 +--+ # attn_output = mindspore.ops.flash_attention_score( +--+ # query=query_states, +--+ # key=key_states, +--+ # value=value_states, +--+ # real_shift=None, +--+ # padding_mask=None, +--+ +--+ # head_num=self.num_heads, +--+ # attn_mask=global_attention_mask, +--+ # keep_prob=1.0 - self.attention_dropout, +--+ # scalar_value=1.0 / math.sqrt(self.head_dim), +--+ # input_layout="BNSD", +--+ # pre_tokens=2147483647, +--+ # next_tokens=2147483647, +--+ # inner_precise=0, +--+ # drop_mask=None, +--+ # prefix=None, +--+ # actual_seq_qlen=None, +--+ # actual_seq_kvlen=None, +--+ # sparse_mode=sparse_mode, +--+ # ) +-- if not output_attentions: +-- attn_weights = None +-- +-- return attn_output, attn_weights, past_key_value +-- +-- +--+class Qwen2MoeFlashAttention(nn.Module): +--+ """ +--+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +--+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +--+ +--+ 关键改动: +--+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +--+ 直接传入原始的 key 和 value 张量效率更高。 +--+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +--+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +--+ """ +--+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +--+ super().__init__() +--+ self.config = config +--+ self.layer_idx = layer_idx +--+ self.hidden_size = config.hidden_size +--+ self.num_heads = config.num_attention_heads +--+ self.head_dim = self.hidden_size // self.num_heads +--+ self.num_key_value_heads = config.num_key_value_heads +--+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +--+ self.max_position_embeddings = config.max_position_embeddings +--+ self.rope_theta = config.rope_theta +--+ self.attention_dropout = config.attention_dropout +--+ +--+ if (self.head_dim * self.num_heads) != self.hidden_size: +--+ raise ValueError( +--+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +--+ ) +--+ +--+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +--+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +--+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +--+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +--+ +--+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +--+ self.head_dim, +--+ max_position_embeddings=self.max_position_embeddings, +--+ base=self.rope_theta, +--+ ) +--+ +--+ def forward( +--+ self, +--+ hidden_states: mindspore.Tensor, +--+ attention_mask: Optional[mindspore.Tensor] = None, +--+ position_ids: Optional[mindspore.Tensor] = None, +--+ past_key_value: Optional[Cache] = None, +--+ output_attentions: bool = False, +--+ use_cache: bool = False, +--+ cache_position: Optional[mindspore.Tensor] = None, +--+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +--+ +--+ bsz, q_len, _ = hidden_states.shape +--+ +--+ # 1. 线性投射 Q, K, V +--+ query_states = self.q_proj(hidden_states) +--+ key_states = self.k_proj(hidden_states) +--+ value_states = self.v_proj(hidden_states) +--+ +--+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +--+ # query: [B, S, H*D] -> [B, N1, S, D] +--+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +--+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ +--+ # 3. RoPE 旋转位置编码 +--+ kv_seq_len = key_states.shape[-2] +--+ if past_key_value is not None: +--+ if self.layer_idx is None: +--+ raise ValueError( +--+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +--+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +--+ "with a layer index." +--+ ) +--+ # 对于 StaticCache,需要特殊处理 kv_seq_len +--+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +--+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +--+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +--+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +--+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +--+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +--+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +--+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +--+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +--+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +--+ if cache_position.shape[0] == 1: +--+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +--+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +--+ kv_seq_len = past_seen_tokens + 1 +--+ else: +--+ # prefill 阶段:cache_position 是范围,使用其长度 +--+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +--+ else: +--+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +--+ +--+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +--+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +--+ +--+ # 4. KV 缓存更新 +--+ if past_key_value is not None: +--+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +--+ key_states, value_states = past_key_value.update( +--+ key_states, value_states, self.layer_idx, cache_kwargs +--+ ) +--+ +--+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +--+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +--+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +--+ if cache_position.shape[0] == 1: +--+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +--+ kv_seq_len = key_states.shape[-2] +--+ +--+ # 5. [重要] 准备 Attention Mask +--+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +--+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +--+ fa_attention_mask = None +--+ if attention_mask is not None: +--+ # 截取与当前key长度匹配的部分 +--+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +--+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +--+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +--+ # 转换为布尔类型: 大负数 -> True, 0 -> False +--+ fa_attention_mask = (mask_slice != 0) +--+ +--+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +--+ input_dtype = query_states.dtype +--+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +--+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +--+ query_states = query_states.to(mindspore.float16) +--+ key_states = key_states.to(mindspore.float16) +--+ value_states = value_states.to(mindspore.float16) +--+ +--+ # 6. [核心] 调用 flash_attention_score 算子 +--+ # - 无需手动 repeat_kv, 算子原生支持 GQA +--+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +--+ attn_output = mindspore.ops.flash_attention_score( +--+ query=query_states, +--+ key=key_states, +--+ value=value_states, +--+ head_num=self.num_heads, # 传入Q的头数(N1) +--+ attn_mask=fa_attention_mask, +--+ keep_prob=1.0 - self.attention_dropout, +--+ scalar_value=1.0 / math.sqrt(self.head_dim), +--+ input_layout="BNSD", +--+ sparse_mode=0 # 使用 defaultMask 模式 +--+ ) +--+ +--+ # 恢复原始数据类型 +--+ attn_output = attn_output.to(input_dtype) +--+ +--+ # 7. 调整输出形状 +--+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +--+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +--+ attn_output = self.o_proj(attn_output) +--+ +--+ # FlashAttention 算子不直接返回注意力权重矩阵 +--+ attn_weights = None +--+ if output_attentions: +--+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +--+ +--+ return attn_output, attn_weights, past_key_value +--+ +--+ # def forward( +--+ # self, +--+ # hidden_states: mindspore.Tensor, +--+ # attention_mask: Optional[mindspore.Tensor] = None, +--+ # position_ids: Optional[mindspore.Tensor] = None, +--+ # past_key_value: Optional[Cache] = None, +--+ # output_attentions: bool = False, +--+ # use_cache: bool = False, +--+ # cache_position: Optional[mindspore.Tensor] = None, +--+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +--+ +--+ # bsz, q_len, _ = hidden_states.shape +--+ +--+ # # 1. 线性投射 Q, K, V +--+ # query_states = self.q_proj(hidden_states) +--+ # key_states = self.k_proj(hidden_states) +--+ # value_states = self.v_proj(hidden_states) +--+ +--+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +--+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ +--+ # # 3. RoPE 旋转位置编码 +--+ # kv_seq_len = key_states.shape[-2] +--+ # if past_key_value is not None: +--+ # if self.layer_idx is None: +--+ # raise ValueError( +--+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +--+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +--+ # "with a layer index." +--+ # ) +--+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +--+ +--+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +--+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +--+ +--+ # # 4. KV 缓存更新 +--+ # if past_key_value is not None: +--+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +--+ # key_states, value_states = past_key_value.update( +--+ # key_states, value_states, self.layer_idx, cache_kwargs +--+ # ) +--+ +--+ # # 5. 准备 Attention Mask +--+ # fa_attention_mask = None +--+ # if attention_mask is not None: +--+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +--+ # fa_attention_mask = (mask_slice != 0) +--+ +--+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +--+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +--+ # input_dtype = query_states.dtype +--+ +--+ # # 6. [核心] 调用 flash_attention_score 算子 +--+ # attn_output = mindspore.ops.flash_attention_score( +--+ # query=query_states, +--+ # key=key_states, +--+ # value=value_states, +--+ # head_num=self.num_heads, +--+ # attn_mask=fa_attention_mask, +--+ # keep_prob=1.0 - self.attention_dropout, +--+ # scalar_value=1.0 / math.sqrt(self.head_dim), +--+ # input_layout="BNSD", +--+ # sparse_mode=0, +--+ # # <--- 修改点 2: 启用内部高精度计算 --- +--+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +--+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +--+ # inner_precise=1 +--+ # ) +--+ +--+ # # 恢复原始数据类型 +--+ # attn_output = attn_output.to(input_dtype) +--+ +--+ # # 7. 调整输出形状 +--+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +--+ # attn_output = self.o_proj(attn_output) +--+ +--+ # attn_weights = None +--+ # if output_attentions: +--+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +--+ +--+ # return attn_output, attn_weights, past_key_value +--+ +--+ # def forward( +--+ # self, +--+ # hidden_states: mindspore.Tensor, +--+ # attention_mask: Optional[mindspore.Tensor] = None, +--+ # position_ids: Optional[mindspore.Tensor] = None, +--+ # past_key_value: Optional[Cache] = None, +--+ # output_attentions: bool = False, +--+ # use_cache: bool = False, +--+ # cache_position: Optional[mindspore.Tensor] = None, +--+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +--+ +--+ # bsz, q_len, _ = hidden_states.shape +--+ +--+ # query_states = self.q_proj(hidden_states) +--+ # key_states = self.k_proj(hidden_states) +--+ # value_states = self.v_proj(hidden_states) +--+ +--+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--+ +--+ # kv_seq_len = key_states.shape[-2] +--+ # if past_key_value is not None: +--+ # if self.layer_idx is None: +--+ # raise ValueError("`layer_idx` must be specified for caching") +--+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +--+ +--+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +--+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +--+ +--+ # if past_key_value is not None: +--+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +--+ # key_states, value_states = past_key_value.update( +--+ # key_states, value_states, self.layer_idx, cache_kwargs +--+ # ) +--+ +--+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +--+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +--+ +--+ # # <--- 核心修改点: 手动进行高精度缩放 --- +--+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +--+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +--+ # query_states = query_states / math.sqrt(self.head_dim) +--+ # # <--- 修改结束 --- +--+ +--+ # fa_attention_mask = None +--+ # if attention_mask is not None: +--+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +--+ # fa_attention_mask = (mask_slice != 0) +--+ +--+ # input_dtype = query_states.dtype +--+ +--+ # attn_output = mindspore.ops.flash_attention_score( +--+ # query=query_states, # 传入已经预先缩放过的 query +--+ # key=key_states, +--+ # value=value_states, +--+ # head_num=self.num_heads, +--+ # attn_mask=fa_attention_mask, +--+ # keep_prob=1.0 - self.attention_dropout, +--+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +--+ # input_layout="BNSD", +--+ # sparse_mode=0, +--+ # inner_precise=1 # 仍然保持内部高精度计算 +--+ # ) +--+ +--+ # attn_output = attn_output.to(input_dtype) +--+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +--+ # attn_output = self.o_proj(attn_output) +--+ +--+ # attn_weights = None +--+ # if output_attentions: +--+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +--+ +--+ # return attn_output, attn_weights, past_key_value +--+ +-- QWEN2MOE_ATTENTION_CLASSES = { +-- "eager": Qwen2MoeAttention, +--+ "flash-attention": Qwen2MoeFlashAttention, +-- } +-- +-- +--@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-- +--+ #@dwj +--+ # 只遍历激活的专家,而非全部专家 +-- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +--- batch_size, sequence_length, hidden_dim = hidden_states.shape +--- hidden_states = hidden_states.view(-1, hidden_dim) +--- # router_logits: (batch * sequence_length, n_experts) +--- router_logits = self.gate(hidden_states) +--- +--- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +--- if self.norm_topk_prob: +--- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +--- # we cast back to the input dtype +--- routing_weights = routing_weights.to(hidden_states.dtype) +--- +--- final_hidden_states = ops.zeros( +--- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +--- ) +--- +--- # One hot encode the selected experts to create an expert mask +--- # this will be used to easily index which expert is going to be sollicitated +--- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +--- +--- # Loop over all available experts in the model and perform the computation on each expert +--- for expert_idx in range(self.num_experts): +--- expert_layer = self.experts[expert_idx] +--- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +--- +--- # Index the correct hidden states and compute the expert hidden state for +--- # the current expert. We need to make sure to multiply the output hidden +--- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +--- if 0 not in idx.shape: +--- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +--- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +--- +--- # However `index_add_` only support torch tensors for indexing so we'll use +--- # the `top_x` tensor here. +--- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +--- +--- shared_expert_output = self.shared_expert(hidden_states) +--- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +--- +--- final_hidden_states = final_hidden_states + shared_expert_output +--+ batch_size, sequence_length, hidden_dim = hidden_states.shape +--+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +--+ num_tokens = hidden_states_reshaped.shape[0] +--+ +--+ router_logits = self.gate(hidden_states_reshaped) +--+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +--+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +--+ +--+ if self.norm_topk_prob: +--+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +--+ routing_weights = routing_weights.to(hidden_states.dtype) +--+ +--+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +--+ flat_selected_experts = selected_experts.flatten() +--+ +--+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +--+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +--+ token_indices = broadcasted_token_indices.flatten() +--+ +--+ active_experts = ops.unique(flat_selected_experts) +--+ +--+ for expert_idx_tensor in active_experts: +--+ expert_idx = expert_idx_tensor.item() +--+ expert_layer = self.experts[expert_idx] +--+ +--+ mask = (flat_selected_experts == expert_idx_tensor) +--+ selected_token_indices = token_indices[mask] +--+ selected_routing_weights = routing_weights.flatten()[mask] +--+ +--+ current_states = hidden_states_reshaped[selected_token_indices] +--+ +--+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +--+ +--+ final_hidden_states = final_hidden_states.index_add( +--+ dim=0, +--+ index=selected_token_indices, +--+ source=expert_output.to(hidden_states.dtype) +--+ ) +--+ +--+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +--+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-- +--- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +--- return final_hidden_states, router_logits +--+ final_hidden_states = final_hidden_states + shared_expert_output +--+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +--+ +--+ return final_hidden_states, router_logits +-- +-- +-- class Qwen2MoeDecoderLayer(nn.Module): +--@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-- +-- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-- +--+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +--+ +-- if (layer_idx not in config.mlp_only_layers) and ( +-- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-- ): +--@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-- _skip_keys_device_placement = "past_key_values" +-- _supports_cache_class = True +--+#lwx +--+ # _supports_static_cache = True +-- +-- def _init_weights(self, module): +-- std = self.config.initializer_range +--@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-- return causal_mask +-- +-- +---class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +--+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-- _tied_weights_keys = ["lm_head.weight"] +-- +-- def __init__(self, config): +--@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-- self.num_experts_per_tok = config.num_experts_per_tok +-- # Initialize weights and apply final processing +-- self.post_init() +--+ # @lwx +--+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +--+ # self.generation_config.cache_implementation = "static" +--+ self._warmed_up = False +--+ +--+ def warmup_moe_model(self): +--+ print("[Warmup] Qwen2-MoE 模型预热开始...") +--+ test_texts = [ +--+ "warmup short", +--+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +--+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +--+ ] +--+ tokenizer = getattr(self, "_warmup_tokenizer", None) +--+ if tokenizer is None: +--+ from mindnlp.transformers import AutoTokenizer +--+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +--+ self._warmup_tokenizer = tokenizer +--+ +--+ for text in test_texts: +--+ inputs = tokenizer(text, return_tensors="ms") +--+ with mindspore._no_grad(): +--+ _ = self(**inputs, output_router_logits=True, use_cache=False) +--+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-- +-- def get_input_embeddings(self): +-- return self.model.embed_tokens +--@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-- ```""" +--+ if not self._warmed_up: +--+ self._warmed_up = True +--+ self.warmup_moe_model() +-- +-- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-- output_router_logits = ( +--@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-- } +-- ) +-- return model_inputs +--+# @lwx +--+ # def _decode_one_tokens_logits( +--+ # self, +--+ # cur_token: mindspore.Tensor, +--+ # input_pos: Optional[mindspore.Tensor], +--+ # cache_position: mindspore.Tensor, +--+ # past_key_values: StaticCache, +--+ # ) -> mindspore.Tensor: +--+ # """ +--+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +--+ +--+ # Args: +--+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +--+ # input_pos: 输入位置信息,可选 +--+ # cache_position: 当前token在cache中的位置,shape为(1,) +--+ # past_key_values: StaticCache对象,存储之前的key-value状态 +--+ +--+ # Returns: +--+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +--+ # """ +--+ # # 调用JIT编译的版本 +--+ # return self.get_decode_one_tokens_logits( +--+ # cur_token=cur_token, +--+ # input_pos=input_pos, +--+ # cache_position=cache_position, +--+ # past_key_values=past_key_values, +--+ # ) +--+ +--+ # @mindspore.jit(jit_level='O1') +--+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +--+ # """ +--+ # JIT编译的函数,用于高效的单token解码 +--+ # 使用JIT编译优化以支持静态shape和高效执行 +--+ +--+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +--+ # """ +--+ # outputs = self.model.forward( +--+ # input_ids=cur_token, +--+ # position_ids=input_pos, +--+ # cache_position=cache_position, +--+ # past_key_values=past_key_values, +--+ # use_cache=True, +--+ # return_dict=False, +--+ # ) +--+ +--+ # hidden_states = outputs[0] +--+ # logits = self.lm_head.forward(hidden_states) +--+ # logits = logits.float() +--+ +--+ # return logits[:, -1, :] +--+ +--+ # def _sample( +--+ # self, +--+ # input_ids: mindspore.Tensor, +--+ # logits_processor, +--+ # stopping_criteria, +--+ # generation_config, +--+ # synced_devices: bool, +--+ # streamer=None, +--+ # logits_warper=None, +--+ # **model_kwargs, +--+ # ): +--+ # """ +--+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +--+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +--+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +--+ # """ +--+ # from ...generation.logits_process import LogitsProcessorList +--+ # from ...generation.stopping_criteria import StoppingCriteriaList +--+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +--+ # from mindnlp.core import nn, ops, no_grad +--+ # import numpy as np +--+ +--+ # # 检查是否使用 StaticCache +--+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +--+ # # 否则,直接调用父类方法 +--+ # past_key_values = model_kwargs.get("past_key_values") +--+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +--+ +--+ # if not isinstance(past_key_values, StaticCache): +--+ # # 不使用 StaticCache,直接调用父类方法 +--+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +--+ # return super()._sample( +--+ # input_ids=input_ids, +--+ # logits_processor=logits_processor, +--+ # stopping_criteria=stopping_criteria, +--+ # generation_config=generation_config, +--+ # synced_devices=synced_devices, +--+ # streamer=streamer, +--+ # logits_warper=logits_warper, +--+ # **model_kwargs, +--+ # ) +--+ +--+ # # 使用 StaticCache,进入自定义循环 +--+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +--+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +--+ # pad_token_id = generation_config._pad_token_tensor +--+ # output_attentions = generation_config.output_attentions +--+ # output_hidden_states = generation_config.output_hidden_states +--+ # output_scores = generation_config.output_scores +--+ # output_logits = generation_config.output_logits +--+ # return_dict_in_generate = generation_config.return_dict_in_generate +--+ # max_length = generation_config.max_length +--+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +--+ # do_sample = generation_config.do_sample +--+ +--+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +--+ # raise ValueError( +--+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +--+ # f"{logits_warper})." +--+ # ) +--+ +--+ # # init attention / hidden states / scores tuples +--+ # scores = () if (return_dict_in_generate and output_scores) else None +--+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +--+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +--+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +--+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +--+ +--+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +--+ # if return_dict_in_generate and self.config.is_encoder_decoder: +--+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +--+ # encoder_hidden_states = ( +--+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +--+ # ) +--+ +--+ # # keep track of which sequences are already finished +--+ # batch_size, cur_len = input_ids.shape +--+ # this_peer_finished = False +--+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +--+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +--+ +--+ # time_record = [] +--+ # from ....utils.testing_utils import parse_flag_from_env +--+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +--+ +--+ # while self._has_unfinished_sequences( +--+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +--+ # ): +--+ # if _record_time: +--+ # import time as time_module +--+ # infer_start = time_module.time() +--+ +--+ # # prepare model inputs +--+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +--+ +--+ # # prepare variable output controls +--+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +--+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +--+ +--+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +--+ # cur_cache_position = model_inputs.get("cache_position") +--+ # cur_past_key_values = model_inputs.get("past_key_values") +--+ # cur_input_ids = model_inputs.get("input_ids") +--+ +--+ # if (isinstance(cur_past_key_values, StaticCache) and +--+ # cur_cache_position is not None and +--+ # len(cur_cache_position.shape) > 0 and +--+ # cur_cache_position.shape[0] == 1 and +--+ # cur_input_ids is not None and +--+ # cur_input_ids.shape[1] == 1): +--+ # # 使用 JIT 优化的单 token 解码 +--+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +--+ # if not hasattr(self, '_jit_used'): +--+ # self._jit_used = False +--+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +--+ +--+ # next_token_logits = self.get_decode_one_tokens_logits( +--+ # cur_token=cur_input_ids, +--+ # input_pos=model_inputs.get("position_ids"), +--+ # cache_position=cur_cache_position, +--+ # past_key_values=cur_past_key_values, +--+ # ) +--+ +--+ # # 标记已使用JIT(用于后续判断) +--+ # if not self._jit_used: +--+ # self._jit_used = True +--+ +--+ # # 构造兼容的输出对象 +--+ # class JitOptimizedOutput: +--+ # def __init__(self, logits, config): +--+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +--+ # self.config = config +--+ # # 对于 JIT 优化路径,这些属性通常不需要 +--+ # self.decoder_attentions = None if config.is_encoder_decoder else None +--+ # self.attentions = None if not config.is_encoder_decoder else None +--+ # self.cross_attentions = None +--+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +--+ # self.hidden_states = None if not config.is_encoder_decoder else None +--+ +--+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +--+ # else: +--+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +--+ # outputs = self(**model_inputs, return_dict=True) +--+ +--+ # if synced_devices and this_peer_finished: +--+ # continue +--+ +--+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +--+ # next_token_logits = outputs.logits[:, -1, :] +--+ +--+ # # pre-process distribution +--+ # next_token_scores = logits_processor(input_ids, next_token_logits) +--+ # if do_sample: +--+ # next_token_scores = logits_warper(input_ids, next_token_scores) +--+ +--+ # # Store scores, attentions and hidden_states when required +--+ # if return_dict_in_generate: +--+ # if output_scores: +--+ # scores += (next_token_scores,) +--+ # if output_logits: +--+ # raw_logits += (next_token_logits,) +--+ # if output_attentions: +--+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +--+ # decoder_attentions += (attn,) if attn is not None else (None,) +--+ # if self.config.is_encoder_decoder: +--+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +--+ +--+ # if output_hidden_states: +--+ # hidden = ( +--+ # outputs.decoder_hidden_states +--+ # if self.config.is_encoder_decoder +--+ # else outputs.hidden_states +--+ # ) +--+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +--+ +--+ # # token selection +--+ # if do_sample: +--+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +--+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +--+ # else: +--+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +--+ +--+ # # finished sentences should have their next token be a padding token +--+ # if has_eos_stopping_criteria: +--+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +--+ +--+ # # update generated ids, model inputs, and length for next step +--+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +--+ # if streamer is not None: +--+ # streamer.put(next_tokens) +--+ +--+ # model_kwargs = self._update_model_kwargs_for_generation( +--+ # outputs, +--+ # model_kwargs, +--+ # is_encoder_decoder=self.config.is_encoder_decoder, +--+ # ) +--+ +--+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +--+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +--+ # cur_len += 1 +--+ +--+ # if _record_time: +--+ # import time as time_module +--+ # infer_stop = time_module.time() +--+ # time_record.append(infer_stop - infer_start) +--+ +--+ # del outputs +--+ +--+ # average_infer_time = None +--+ # if time_record: +--+ # if len(time_record) > 1: +--+ # time_record.pop(0) +--+ # average_infer_time = sum(time_record) / len(time_record) +--+ # print(f'average inference time is: {average_infer_time}') +--+ # print(f'inference time record: {time_record}') +--+ +--+ # if streamer is not None: +--+ # streamer.end() +--+ +--+ # # 简单判断:打印是否使用了JIT路径 +--+ # if hasattr(self, '_jit_used') and self._jit_used: +--+ # print("[JIT] ✓ JIT optimization was used during generation") +--+ # else: +--+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +--+ +--+ # if return_dict_in_generate: +--+ # if self.config.is_encoder_decoder: +--+ # return GenerateEncoderDecoderOutput( +--+ # sequences=input_ids, +--+ # scores=scores, +--+ # logits=raw_logits, +--+ # encoder_attentions=encoder_attentions, +--+ # encoder_hidden_states=encoder_hidden_states, +--+ # decoder_attentions=decoder_attentions, +--+ # cross_attentions=cross_attentions, +--+ # decoder_hidden_states=decoder_hidden_states, +--+ # past_key_values=model_kwargs.get("past_key_values"), +--+ # average_infer_time=average_infer_time +--+ # ) +--+ # else: +--+ # return GenerateDecoderOnlyOutput( +--+ # sequences=input_ids, +--+ # scores=scores, +--+ # logits=raw_logits, +--+ # attentions=decoder_attentions, +--+ # hidden_states=decoder_hidden_states, +--+ # past_key_values=model_kwargs.get("past_key_values"), +--+ # average_infer_time=average_infer_time +--+ # ) +--+ # else: +--+ # return input_ids +--+ +--+ # def _prepare_cache_for_generation( +--+ # self, +--+ # generation_config, +--+ # model_kwargs, +--+ # assistant_model, +--+ # batch_size, +--+ # max_cache_length, +--+ # ): +--+ # if generation_config.cache_implementation is None and self._supports_static_cache: +--+ # generation_config.cache_implementation = "static" +--+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +--+ +--+ # if generation_config.cache_implementation == "static": +--+ # base_required_from_max_length = generation_config.max_length + 1 +--+ # base_required = max(max_cache_length, base_required_from_max_length) +--+ # min_cache_size = 50 +--+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +--+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +--+ # else: +--+ # max_cache_length = max(base_required, min_cache_size) +--+ +--+ # original_max_cache_length = max_cache_length +--+ # print(f"[JIT] StaticCache max_cache_length calculation:") +--+ # print(f" - input max_cache_length: {original_max_cache_length}") +--+ # print(f" - generation_config.max_length: {generation_config.max_length}") +--+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +--+ # print(f" - final max_cache_length: {max_cache_length}") +--+ +--+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +--+ # if max_cache_length > self.config.max_position_embeddings: +--+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +--+ +--+ # result = super()._prepare_cache_for_generation( +--+ # generation_config=generation_config, +--+ # model_kwargs=model_kwargs, +--+ # assistant_model=assistant_model, +--+ # batch_size=batch_size, +--+ # max_cache_length=max_cache_length, +--+ # ) +--+ +--+ # if generation_config.cache_implementation == "static": +--+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +--+ # created_cache = model_kwargs.get(cache_name) +--+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +--+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +--+ # if created_cache.max_cache_len < generation_config.max_length: +--+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +--+ +--+ # return result +--+ +--+ +--+ +-- +-- +-- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +---- +--2.27.0 +-- +--- +-2.27.0 +- +diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +deleted file mode 100644 +index bc5549ca..00000000 +--- a/patches/0004-20251106change.patch ++++ /dev/null +@@ -1,7498 +0,0 @@ +-From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Thu, 6 Nov 2025 15:48:09 +0800 +-Subject: [PATCH 4/8] 20251106change +- +---- +- .../models/deepseek/modeling_deepseek.py | 189 +- +- patches/0001-20251104commit.patch | 1272 +++++++ +- patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +- patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +- 4 files changed, 7244 insertions(+), 186 deletions(-) +- create mode 100644 patches/0001-20251104commit.patch +- create mode 100644 patches/0002-20251106commit.patch +- create mode 100644 patches/0003-20261106secondcommit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index 2f9192bf..0546f318 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +- +- return attn_output, attn_weights, past_key_value +- +--# class DeepseekFlashAttention(nn.Module): +--# """ +--# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +--# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-- +--# This class is designed as a drop-in replacement for DeepseekAttention. +--# """ +-- +--# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +--# super().__init__() +--# self.config = config +--# self.layer_idx = layer_idx +--# if layer_idx is None: +--# logger.warning( +--# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +--# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +--# "when creating this class." +--# ) +-- +--# self.attention_dropout = config.attention_dropout +--# self.hidden_size = config.hidden_size +--# self.num_heads = config.num_attention_heads +--# self.head_dim = self.hidden_size // self.num_heads +--# self.num_key_value_heads = config.num_key_value_heads +--# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +--# self.max_position_embeddings = config.max_position_embeddings +--# self.rope_theta = config.rope_theta +--# self.is_causal = True +-- +--# if (self.head_dim * self.num_heads) != self.hidden_size: +--# raise ValueError( +--# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +--# f" and `num_heads`: {self.num_heads})." +--# ) +-- +--# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +--# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +--# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +--# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +--# self._init_rope() +-- +--# def _init_rope(self): +--# if self.config.rope_scaling is None: +--# self.rotary_emb = DeepseekRotaryEmbedding( +--# self.head_dim, +--# max_position_embeddings=self.max_position_embeddings, +--# base=self.rope_theta, +--# ) +--# else: +--# scaling_type = self.config.rope_scaling["type"] +--# scaling_factor = self.config.rope_scaling["factor"] +--# if scaling_type == "linear": +--# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +--# self.head_dim, +--# max_position_embeddings=self.max_position_embeddings, +--# scaling_factor=scaling_factor, +--# base=self.rope_theta, +--# ) +--# elif scaling_type == "dynamic": +--# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +--# self.head_dim, +--# max_position_embeddings=self.max_position_embeddings, +--# scaling_factor=scaling_factor, +--# base=self.rope_theta, +--# ) +--# else: +--# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-- +--# def forward( +--# self, +--# hidden_states: mindspore.Tensor, +--# attention_mask: Optional[mindspore.Tensor] = None, +--# position_ids: Optional[mindspore.Tensor] = None, +--# past_key_value: Optional[Cache] = None, +--# output_attentions: bool = False, +--# use_cache: bool = False, +--# **kwargs, +--# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +--# if "padding_mask" in kwargs: +--# warnings.warn( +--# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +--# ) +-- +--# if output_attentions: +--# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-- +--# bsz, q_len, _ = hidden_states.shape +-- +--# if self.config.pretraining_tp > 1: +--# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-- +--# query_states = self.q_proj(hidden_states) +--# key_states = self.k_proj(hidden_states) +--# value_states = self.v_proj(hidden_states) +-- +--# # Reshape for multi-head attention +--# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +--# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +--# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-- +--# kv_seq_len = key_states.shape[-2] +--# if past_key_value is not None: +--# if self.layer_idx is None: +--# raise ValueError( +--# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +--# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +--# "with a layer index." +--# ) +--# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-- +--# # Apply Rotary Positional Embedding +--# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +--# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-- +--# if past_key_value is not None: +--# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +--# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-- +--# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +--# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +--# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-- +--# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +--# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-- +--# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +--# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-- +--# # Convert attention_mask for flash_attention_score +--# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +--# if attention_mask is not None: +--# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +--# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +--# raise ValueError( +--# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +--# ) +--# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +--# else: +--# attn_mask_for_fa = None +-- +--# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-- +--# # Call the fused flash_attention_score operator +--# attn_output = mindspore.ops.flash_attention_score( +--# query=query_states_for_fa, +--# key=key_states_for_fa, +--# value=value_states_for_fa, +--# head_num=self.num_heads, # This is N1, the number of query heads +--# input_layout='BSH', +--# attn_mask=attn_mask_for_fa, +--# keep_prob=keep_prob, +--# scalar_value=1.0 / math.sqrt(self.head_dim), +--# sparse_mode=0 # Default mask mode +--# ) +-- +--# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +--# attn_output = self.o_proj(attn_output) +-- +--# # Flash Attention does not return attention weights +--# attn_weights = None +-- +--# return attn_output, attn_weights, past_key_value +- +- class DeepseekFlashAttention(nn.Module): +- """ +-@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +- super().__init__() +- self.hidden_size = config.hidden_size +- +-- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-- config=config, layer_idx=layer_idx +-- ) +-+ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-+ # config=config, layer_idx=layer_idx +-+ # ) +- +- self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +- config=config, layer_idx=layer_idx +-@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +- return outputs +- +- +-- +- class DeepseekPreTrainedModel(PreTrainedModel): +- config_class = DeepseekConfig +- base_model_prefix = "model" +-@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- # Initialize weights and apply final processing +- self.post_init() +- self.warm_up = False +-- #@dwj +-- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-- self.num_layers, +-- self.num_attention_heads, +-- self.head_dim, +-- batch_size=1, +-- max_length=self.max_length, +-- dtype=mindspore.float16 +-- ) +-- +-- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-- key_cache = [] +-- value_cache = [] +-- for _ in range(num_layers): +-- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-- key_cache.append(k) +-- value_cache.append(v) +-- return key_cache, value_cache +-- +- +- def warmup_moe_model_deep(self): +- print("[Warmup] DeepSeek-MoE 模型预热开始...") +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-new file mode 100644 +-index 00000000..78f22642 +---- /dev/null +-+++ b/patches/0001-20251104commit.patch +-@@ -0,0 +1,1272 @@ +-+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+Subject: [PATCH 1/3] 20251104commit +-+ +-+--- +-+ mindnlp/transformers/cache_utils.py | 28 +- +-+ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+ 3 files changed, 976 insertions(+), 87 deletions(-) +-+ +-+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+index cadd2e04..02f8d4be 100644 +-+--- a/mindnlp/transformers/cache_utils.py +-++++ b/mindnlp/transformers/cache_utils.py +-+@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+ # k_out[:, :, cache_position] = key_states +-+ # v_out[:, :, cache_position] = value_states +-+- if ON_ORANGE_PI: +-+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+- else: +-+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+- +-++ # if ON_ORANGE_PI: +-++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++ # else: +-++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++ if cache_position.ndim > 1: +-++ cache_position = cache_position.flatten() +-++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++ cache_position = cache_position.int() +-++ +-++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++ k_out[:, :, cache_position] = key_states +-++ v_out[:, :, cache_position] = value_states +-++ +-+ return k_out, v_out +-+ +-+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index c695b944..d8303e45 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+- x1 = x[..., : x.shape[-1] // 2] +-+- x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++ # x1 = x[..., : x.shape[-1] // 2] +-++ # x2 = x[..., x.shape[-1] // 2 :] +-++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+ if self.training: +-+ raise NotImplementedError("Training is not supported yet.") +-+ else: +-+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+- if self.config.n_shared_experts is not None: +-+- y = y + self.shared_experts(identity) +-+- return y +-++ # @lwx +-++ if orig_shape[1] == 1: +-++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++ y=y.view(*orig_shape) +-++ if self.config.n_shared_experts is not None: +-++ y = y + self.shared_experts(identity) +-++ return y +-++ else: +-++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++ if self.config.n_shared_experts is not None: +-++ y = y + self.shared_experts(identity) +-++ return y +-++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++ # if self.config.n_shared_experts is not None: +-++ # y = y + self.shared_experts(identity) +-++ # return y +-++ +-++ @no_grad() +-++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ +-++ expert_cache = ops.zeros_like(x) +-++ for i in range(self.num_experts_per_tok): +-++ expert_id = flat_expert_indices[i].item() +-++ weight = flat_expert_weights[i].item() +-++ expert = self.experts[expert_id] +-++ expert_out = expert(x) +-++ expert_cache += expert_out * weight +-++ return expert_cache +-+ +-+ @no_grad() +-+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+- # expert_cache = torch.zeros_like(x) +-+- # idxs = flat_expert_indices.argsort() +-+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+- # token_idxs = idxs // self.num_experts_per_tok +-+- # for i, end_idx in enumerate(tokens_per_expert): +-+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+- # if start_idx == end_idx: +-+- # continue +-+- # expert = self.experts[i] +-+- # exp_token_idx = token_idxs[start_idx:end_idx] +-+- # expert_tokens = x[exp_token_idx] +-+- # expert_out = expert(expert_tokens) +-+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+- # return expert_cache +-++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ expert_cache = ops.zeros_like(x) +-+ idxs = flat_expert_indices.argsort() +-+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+ token_idxs = idxs // self.num_experts_per_tok +-++ +-+ for i, end_idx in enumerate(tokens_per_expert): +-+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+ if start_idx == end_idx: +-+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+ expert_out = expert(expert_tokens) +-+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-+ return expert_cache +-++ +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # # expert_cache = torch.zeros_like(x) +-++ # # idxs = flat_expert_indices.argsort() +-++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++ # # token_idxs = idxs // self.num_experts_per_tok +-++ # # for i, end_idx in enumerate(tokens_per_expert): +-++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++ # # if start_idx == end_idx: +-++ # # continue +-++ # # expert = self.experts[i] +-++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # # expert_tokens = x[exp_token_idx] +-++ # # expert_out = expert(expert_tokens) +-++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++ # # return expert_cache +-++ # expert_cache = ops.zeros_like(x) +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # for i, end_idx in enumerate(tokens_per_expert): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # if start_idx == end_idx: +-++ # continue +-++ # expert = self.experts[i] +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = expert(expert_tokens) +-++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-++ # return expert_cache +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # expert_cache = ops.zeros_like(x) +-++ +-++ # # 排序保证顺序一致 +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # # 找出有 token 的专家 +-++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++ +-++ # for i in active_experts.tolist(): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # end_idx = tokens_per_expert[i] +-++ # if start_idx == end_idx: # 没有 token +-++ # continue +-++ +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = self.experts[i](expert_tokens) +-++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++ +-++ # expert_cache = mindspore.mint.scatter_add( +-++ # expert_cache, +-++ # 0, +-++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++ # expert_out +-++ # ) +-++ +-++ # return expert_cache +-++ +-++ +-+ +-+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+ # """ +-+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ +-+ # Initialize weights and apply final processing +-+ self.post_init() +-++ self.warm_up = False +-++ +-++ def warmup_moe_model_deep(self): +-++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++ test_texts = [ +-++ "warmup short", +-++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++ ] +-++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++ if tokenizer is None: +-++ from mindnlp.transformers import AutoTokenizer +-++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++ self._warmup_tokenizer = tokenizer +-++ +-++ for text in test_texts: +-++ inputs = tokenizer(text, return_tensors="ms") +-++ with mindspore._no_grad(): +-++ _ = self(**inputs, use_cache=False) +-++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+ +-+ def get_input_embeddings(self): +-+ return self.model.embed_tokens +-+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+ ```""" +-++ if not self.warm_up: +-++ self.warm_up = True +-++ self.warmup_moe_model_deep() +-++ +-+ output_attentions = ( +-+ output_attentions +-+ if output_attentions is not None +-+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+index 3cbf820e..d4c6b651 100644 +-+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+@@ -18,7 +18,6 @@ +-+ # See the License for the specific language governing permissions and +-+ # limitations under the License. +-+ """MindSpore Qwen2MoE model.""" +-+- +-+ import math +-+ from typing import List, Optional, Tuple, Union +-+ +-+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+ TokenClassifierOutput, +-+ ) +-+ from ...modeling_utils import PreTrainedModel +-++from ...generation import GenerationMixin +-+ from ....utils import logging +-+ from .configuration_qwen2_moe import Qwen2MoeConfig +-+ +-+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+ self.variance_epsilon = eps +-+ +-+ def forward(self, hidden_states): +-++ # @dwj +-++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++ # @lwx +-++ # if not self.training : +-++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+ input_dtype = hidden_states.dtype +-+ hidden_states = hidden_states.to(mindspore.float32) +-+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+@@ -234,6 +239,8 @@ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+ x1 = x[..., : x.shape[-1] // 2] +-+ x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+ self.config = config +-+ self.hidden_size = config.hidden_size +-+ self.intermediate_size = intermediate_size +-++ +-+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+ self.act_fn = ACT2FN[config.hidden_act] +-+ +-+ def forward(self, x): +-+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+- +-+ +-++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++ # @lwx +-++ # gate_up_output = self.gate_up_proj(x) +-++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++ # return self.down_proj(swiglu_output) +-++ +-++ # def forward(self, x): +-++ # gate_proj_out = self.gate_proj(x) +-++ # up_proj_out = self.up_proj(x) +-++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++ # return self.down_proj(swiglu_out) +-++ +-+ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+ """ +-+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+ use_cache: bool = False, +-+ cache_position: Optional[mindspore.Tensor] = None, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ +-++ +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ query_states = self.q_proj(hidden_states) +-+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+ "with a layer index." +-+ ) +-+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ if isinstance(past_key_value, StaticCache): +-++ kv_seq_len = key_states.shape[-2] +-++ else: +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++ if isinstance(past_key_value, StaticCache): +-++ kv_seq_len = key_states.shape[-2] +-+ +-+ # repeat k/v heads if n_kv_heads < n_heads +-+ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+- +-++ +-+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+ +-+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+- raise ValueError( +-+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+- f" {attn_weights.shape}" +-+- ) +-+- +-+- if attention_mask is not None: # no matter the length, we just slice it +-+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++ if attention_mask is not None: +-++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+ attn_weights = attn_weights + causal_mask +-+ +-+ # upcast attention to fp32 +-+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+ +-+ attn_output = self.o_proj(attn_output) +-+- +-++ # @lwx +-++ +-++ # max_seq_len = self.max_position_embeddings # 2048 +-++ +-++ # if attention_mask is not None: +-++ # # attention_mask: [B, 1, Sq, Sk] +-++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++ +-++ # # pad 到 [max_seq_len, max_seq_len] +-++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++ # global_attention_mask = padded_mask +-++ # else: +-++ # global_attention_mask = None +-++ +-++ +-++ # sparse_mode=3 +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, +-++ # key=key_states, +-++ # value=value_states, +-++ # real_shift=None, +-++ # padding_mask=None, +-++ +-++ # head_num=self.num_heads, +-++ # attn_mask=global_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++ # input_layout="BNSD", +-++ # pre_tokens=2147483647, +-++ # next_tokens=2147483647, +-++ # inner_precise=0, +-++ # drop_mask=None, +-++ # prefix=None, +-++ # actual_seq_qlen=None, +-++ # actual_seq_kvlen=None, +-++ # sparse_mode=sparse_mode, +-++ # ) +-+ if not output_attentions: +-+ attn_weights = None +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+ +-++class Qwen2MoeFlashAttention(nn.Module): +-++ """ +-++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++ +-++ 关键改动: +-++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++ 直接传入原始的 key 和 value 张量效率更高。 +-++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++ """ +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++ super().__init__() +-++ self.config = config +-++ self.layer_idx = layer_idx +-++ self.hidden_size = config.hidden_size +-++ self.num_heads = config.num_attention_heads +-++ self.head_dim = self.hidden_size // self.num_heads +-++ self.num_key_value_heads = config.num_key_value_heads +-++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++ self.max_position_embeddings = config.max_position_embeddings +-++ self.rope_theta = config.rope_theta +-++ self.attention_dropout = config.attention_dropout +-++ +-++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++ raise ValueError( +-++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++ ) +-++ +-++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++ +-++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++ self.head_dim, +-++ max_position_embeddings=self.max_position_embeddings, +-++ base=self.rope_theta, +-++ ) +-++ +-++ def forward( +-++ self, +-++ hidden_states: mindspore.Tensor, +-++ attention_mask: Optional[mindspore.Tensor] = None, +-++ position_ids: Optional[mindspore.Tensor] = None, +-++ past_key_value: Optional[Cache] = None, +-++ output_attentions: bool = False, +-++ use_cache: bool = False, +-++ cache_position: Optional[mindspore.Tensor] = None, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ # 1. 线性投射 Q, K, V +-++ query_states = self.q_proj(hidden_states) +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++ # query: [B, S, H*D] -> [B, N1, S, D] +-++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # 3. RoPE 旋转位置编码 +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++ if self.layer_idx is None: +-++ raise ValueError( +-++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ "with a layer index." +-++ ) +-++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++ if cache_position.shape[0] == 1: +-++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++ kv_seq_len = past_seen_tokens + 1 +-++ else: +-++ # prefill 阶段:cache_position 是范围,使用其长度 +-++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++ else: +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # 4. KV 缓存更新 +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ key_states, value_states = past_key_value.update( +-++ key_states, value_states, self.layer_idx, cache_kwargs +-++ ) +-++ +-++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++ if cache_position.shape[0] == 1: +-++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++ kv_seq_len = key_states.shape[-2] +-++ +-++ # 5. [重要] 准备 Attention Mask +-++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++ fa_attention_mask = None +-++ if attention_mask is not None: +-++ # 截取与当前key长度匹配的部分 +-++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++ fa_attention_mask = (mask_slice != 0) +-++ +-++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++ input_dtype = query_states.dtype +-++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++ query_states = query_states.to(mindspore.float16) +-++ key_states = key_states.to(mindspore.float16) +-++ value_states = value_states.to(mindspore.float16) +-++ +-++ # 6. [核心] 调用 flash_attention_score 算子 +-++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++ attn_output = mindspore.ops.flash_attention_score( +-++ query=query_states, +-++ key=key_states, +-++ value=value_states, +-++ head_num=self.num_heads, # 传入Q的头数(N1) +-++ attn_mask=fa_attention_mask, +-++ keep_prob=1.0 - self.attention_dropout, +-++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++ input_layout="BNSD", +-++ sparse_mode=0 # 使用 defaultMask 模式 +-++ ) +-++ +-++ # 恢复原始数据类型 +-++ attn_output = attn_output.to(input_dtype) +-++ +-++ # 7. 调整输出形状 +-++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ attn_output = self.o_proj(attn_output) +-++ +-++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++ attn_weights = None +-++ if output_attentions: +-++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++ # def forward( +-++ # self, +-++ # hidden_states: mindspore.Tensor, +-++ # attention_mask: Optional[mindspore.Tensor] = None, +-++ # position_ids: Optional[mindspore.Tensor] = None, +-++ # past_key_value: Optional[Cache] = None, +-++ # output_attentions: bool = False, +-++ # use_cache: bool = False, +-++ # cache_position: Optional[mindspore.Tensor] = None, +-++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ # bsz, q_len, _ = hidden_states.shape +-++ +-++ # # 1. 线性投射 Q, K, V +-++ # query_states = self.q_proj(hidden_states) +-++ # key_states = self.k_proj(hidden_states) +-++ # value_states = self.v_proj(hidden_states) +-++ +-++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # # 3. RoPE 旋转位置编码 +-++ # kv_seq_len = key_states.shape[-2] +-++ # if past_key_value is not None: +-++ # if self.layer_idx is None: +-++ # raise ValueError( +-++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ # "with a layer index." +-++ # ) +-++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # # 4. KV 缓存更新 +-++ # if past_key_value is not None: +-++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ # key_states, value_states = past_key_value.update( +-++ # key_states, value_states, self.layer_idx, cache_kwargs +-++ # ) +-++ +-++ # # 5. 准备 Attention Mask +-++ # fa_attention_mask = None +-++ # if attention_mask is not None: +-++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # fa_attention_mask = (mask_slice != 0) +-++ +-++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++ # input_dtype = query_states.dtype +-++ +-++ # # 6. [核心] 调用 flash_attention_score 算子 +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, +-++ # key=key_states, +-++ # value=value_states, +-++ # head_num=self.num_heads, +-++ # attn_mask=fa_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++ # input_layout="BNSD", +-++ # sparse_mode=0, +-++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++ # inner_precise=1 +-++ # ) +-++ +-++ # # 恢复原始数据类型 +-++ # attn_output = attn_output.to(input_dtype) +-++ +-++ # # 7. 调整输出形状 +-++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ # attn_output = self.o_proj(attn_output) +-++ +-++ # attn_weights = None +-++ # if output_attentions: +-++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++ # return attn_output, attn_weights, past_key_value +-++ +-++ # def forward( +-++ # self, +-++ # hidden_states: mindspore.Tensor, +-++ # attention_mask: Optional[mindspore.Tensor] = None, +-++ # position_ids: Optional[mindspore.Tensor] = None, +-++ # past_key_value: Optional[Cache] = None, +-++ # output_attentions: bool = False, +-++ # use_cache: bool = False, +-++ # cache_position: Optional[mindspore.Tensor] = None, +-++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ # bsz, q_len, _ = hidden_states.shape +-++ +-++ # query_states = self.q_proj(hidden_states) +-++ # key_states = self.k_proj(hidden_states) +-++ # value_states = self.v_proj(hidden_states) +-++ +-++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ # kv_seq_len = key_states.shape[-2] +-++ # if past_key_value is not None: +-++ # if self.layer_idx is None: +-++ # raise ValueError("`layer_idx` must be specified for caching") +-++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ # if past_key_value is not None: +-++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ # key_states, value_states = past_key_value.update( +-++ # key_states, value_states, self.layer_idx, cache_kwargs +-++ # ) +-++ +-++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++ +-++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++ # query_states = query_states / math.sqrt(self.head_dim) +-++ # # <--- 修改结束 --- +-++ +-++ # fa_attention_mask = None +-++ # if attention_mask is not None: +-++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ # fa_attention_mask = (mask_slice != 0) +-++ +-++ # input_dtype = query_states.dtype +-++ +-++ # attn_output = mindspore.ops.flash_attention_score( +-++ # query=query_states, # 传入已经预先缩放过的 query +-++ # key=key_states, +-++ # value=value_states, +-++ # head_num=self.num_heads, +-++ # attn_mask=fa_attention_mask, +-++ # keep_prob=1.0 - self.attention_dropout, +-++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++ # input_layout="BNSD", +-++ # sparse_mode=0, +-++ # inner_precise=1 # 仍然保持内部高精度计算 +-++ # ) +-++ +-++ # attn_output = attn_output.to(input_dtype) +-++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ # attn_output = self.o_proj(attn_output) +-++ +-++ # attn_weights = None +-++ # if output_attentions: +-++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++ +-++ # return attn_output, attn_weights, past_key_value +-++ +-+ QWEN2MOE_ATTENTION_CLASSES = { +-+ "eager": Qwen2MoeAttention, +-++ "flash-attention": Qwen2MoeFlashAttention, +-+ } +-+ +-+ +-+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-++ #@dwj +-++ # 只遍历激活的专家,而非全部专家 +-+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- hidden_states = hidden_states.view(-1, hidden_dim) +-+- # router_logits: (batch * sequence_length, n_experts) +-+- router_logits = self.gate(hidden_states) +-+- +-+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- if self.norm_topk_prob: +-+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- # we cast back to the input dtype +-+- routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+- final_hidden_states = ops.zeros( +-+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+- ) +-+- +-+- # One hot encode the selected experts to create an expert mask +-+- # this will be used to easily index which expert is going to be sollicitated +-+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+- +-+- # Loop over all available experts in the model and perform the computation on each expert +-+- for expert_idx in range(self.num_experts): +-+- expert_layer = self.experts[expert_idx] +-+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+- +-+- # Index the correct hidden states and compute the expert hidden state for +-+- # the current expert. We need to make sure to multiply the output hidden +-+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+- if 0 not in idx.shape: +-+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+- +-+- # However `index_add_` only support torch tensors for indexing so we'll use +-+- # the `top_x` tensor here. +-+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+- +-+- shared_expert_output = self.shared_expert(hidden_states) +-+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+- +-+- final_hidden_states = final_hidden_states + shared_expert_output +-++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++ num_tokens = hidden_states_reshaped.shape[0] +-++ +-++ router_logits = self.gate(hidden_states_reshaped) +-++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++ if self.norm_topk_prob: +-++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++ flat_selected_experts = selected_experts.flatten() +-++ +-++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++ token_indices = broadcasted_token_indices.flatten() +-++ +-++ active_experts = ops.unique(flat_selected_experts) +-++ +-++ for expert_idx_tensor in active_experts: +-++ expert_idx = expert_idx_tensor.item() +-++ expert_layer = self.experts[expert_idx] +-++ +-++ mask = (flat_selected_experts == expert_idx_tensor) +-++ selected_token_indices = token_indices[mask] +-++ selected_routing_weights = routing_weights.flatten()[mask] +-++ +-++ current_states = hidden_states_reshaped[selected_token_indices] +-++ +-++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++ +-++ final_hidden_states = final_hidden_states.index_add( +-++ dim=0, +-++ index=selected_token_indices, +-++ source=expert_output.to(hidden_states.dtype) +-++ ) +-++ +-++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+ +-+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+- return final_hidden_states, router_logits +-++ final_hidden_states = final_hidden_states + shared_expert_output +-++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++ +-++ return final_hidden_states, router_logits +-+ +-+ +-+ class Qwen2MoeDecoderLayer(nn.Module): +-+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+ +-+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++ +-+ if (layer_idx not in config.mlp_only_layers) and ( +-+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+ ): +-+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+ _skip_keys_device_placement = "past_key_values" +-+ _supports_cache_class = True +-++#lwx +-++ # _supports_static_cache = True +-+ +-+ def _init_weights(self, module): +-+ std = self.config.initializer_range +-+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+ return causal_mask +-+ +-+ +-+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ _tied_weights_keys = ["lm_head.weight"] +-+ +-+ def __init__(self, config): +-+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ self.num_experts_per_tok = config.num_experts_per_tok +-+ # Initialize weights and apply final processing +-+ self.post_init() +-++ # @lwx +-++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++ # self.generation_config.cache_implementation = "static" +-++ self._warmed_up = False +-++ +-++ def warmup_moe_model(self): +-++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++ test_texts = [ +-++ "warmup short", +-++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++ ] +-++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++ if tokenizer is None: +-++ from mindnlp.transformers import AutoTokenizer +-++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++ self._warmup_tokenizer = tokenizer +-++ +-++ for text in test_texts: +-++ inputs = tokenizer(text, return_tensors="ms") +-++ with mindspore._no_grad(): +-++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+ +-+ def get_input_embeddings(self): +-+ return self.model.embed_tokens +-+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+ ```""" +-++ if not self._warmed_up: +-++ self._warmed_up = True +-++ self.warmup_moe_model() +-+ +-+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+ output_router_logits = ( +-+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+ } +-+ ) +-+ return model_inputs +-++# @lwx +-++ # def _decode_one_tokens_logits( +-++ # self, +-++ # cur_token: mindspore.Tensor, +-++ # input_pos: Optional[mindspore.Tensor], +-++ # cache_position: mindspore.Tensor, +-++ # past_key_values: StaticCache, +-++ # ) -> mindspore.Tensor: +-++ # """ +-++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++ +-++ # Args: +-++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++ # input_pos: 输入位置信息,可选 +-++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++ +-++ # Returns: +-++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++ # """ +-++ # # 调用JIT编译的版本 +-++ # return self.get_decode_one_tokens_logits( +-++ # cur_token=cur_token, +-++ # input_pos=input_pos, +-++ # cache_position=cache_position, +-++ # past_key_values=past_key_values, +-++ # ) +-++ +-++ # @mindspore.jit(jit_level='O1') +-++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++ # """ +-++ # JIT编译的函数,用于高效的单token解码 +-++ # 使用JIT编译优化以支持静态shape和高效执行 +-++ +-++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++ # """ +-++ # outputs = self.model.forward( +-++ # input_ids=cur_token, +-++ # position_ids=input_pos, +-++ # cache_position=cache_position, +-++ # past_key_values=past_key_values, +-++ # use_cache=True, +-++ # return_dict=False, +-++ # ) +-++ +-++ # hidden_states = outputs[0] +-++ # logits = self.lm_head.forward(hidden_states) +-++ # logits = logits.float() +-++ +-++ # return logits[:, -1, :] +-++ +-++ # def _sample( +-++ # self, +-++ # input_ids: mindspore.Tensor, +-++ # logits_processor, +-++ # stopping_criteria, +-++ # generation_config, +-++ # synced_devices: bool, +-++ # streamer=None, +-++ # logits_warper=None, +-++ # **model_kwargs, +-++ # ): +-++ # """ +-++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++ # """ +-++ # from ...generation.logits_process import LogitsProcessorList +-++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++ # from mindnlp.core import nn, ops, no_grad +-++ # import numpy as np +-++ +-++ # # 检查是否使用 StaticCache +-++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++ # # 否则,直接调用父类方法 +-++ # past_key_values = model_kwargs.get("past_key_values") +-++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++ +-++ # if not isinstance(past_key_values, StaticCache): +-++ # # 不使用 StaticCache,直接调用父类方法 +-++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++ # return super()._sample( +-++ # input_ids=input_ids, +-++ # logits_processor=logits_processor, +-++ # stopping_criteria=stopping_criteria, +-++ # generation_config=generation_config, +-++ # synced_devices=synced_devices, +-++ # streamer=streamer, +-++ # logits_warper=logits_warper, +-++ # **model_kwargs, +-++ # ) +-++ +-++ # # 使用 StaticCache,进入自定义循环 +-++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++ # pad_token_id = generation_config._pad_token_tensor +-++ # output_attentions = generation_config.output_attentions +-++ # output_hidden_states = generation_config.output_hidden_states +-++ # output_scores = generation_config.output_scores +-++ # output_logits = generation_config.output_logits +-++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++ # max_length = generation_config.max_length +-++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++ # do_sample = generation_config.do_sample +-++ +-++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++ # raise ValueError( +-++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++ # f"{logits_warper})." +-++ # ) +-++ +-++ # # init attention / hidden states / scores tuples +-++ # scores = () if (return_dict_in_generate and output_scores) else None +-++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++ +-++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++ # encoder_hidden_states = ( +-++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++ # ) +-++ +-++ # # keep track of which sequences are already finished +-++ # batch_size, cur_len = input_ids.shape +-++ # this_peer_finished = False +-++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++ +-++ # time_record = [] +-++ # from ....utils.testing_utils import parse_flag_from_env +-++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++ +-++ # while self._has_unfinished_sequences( +-++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++ # ): +-++ # if _record_time: +-++ # import time as time_module +-++ # infer_start = time_module.time() +-++ +-++ # # prepare model inputs +-++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++ +-++ # # prepare variable output controls +-++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++ +-++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++ # cur_cache_position = model_inputs.get("cache_position") +-++ # cur_past_key_values = model_inputs.get("past_key_values") +-++ # cur_input_ids = model_inputs.get("input_ids") +-++ +-++ # if (isinstance(cur_past_key_values, StaticCache) and +-++ # cur_cache_position is not None and +-++ # len(cur_cache_position.shape) > 0 and +-++ # cur_cache_position.shape[0] == 1 and +-++ # cur_input_ids is not None and +-++ # cur_input_ids.shape[1] == 1): +-++ # # 使用 JIT 优化的单 token 解码 +-++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++ # if not hasattr(self, '_jit_used'): +-++ # self._jit_used = False +-++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++ +-++ # next_token_logits = self.get_decode_one_tokens_logits( +-++ # cur_token=cur_input_ids, +-++ # input_pos=model_inputs.get("position_ids"), +-++ # cache_position=cur_cache_position, +-++ # past_key_values=cur_past_key_values, +-++ # ) +-++ +-++ # # 标记已使用JIT(用于后续判断) +-++ # if not self._jit_used: +-++ # self._jit_used = True +-++ +-++ # # 构造兼容的输出对象 +-++ # class JitOptimizedOutput: +-++ # def __init__(self, logits, config): +-++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++ # self.config = config +-++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++ # self.attentions = None if not config.is_encoder_decoder else None +-++ # self.cross_attentions = None +-++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++ +-++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++ # else: +-++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++ # outputs = self(**model_inputs, return_dict=True) +-++ +-++ # if synced_devices and this_peer_finished: +-++ # continue +-++ +-++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++ # next_token_logits = outputs.logits[:, -1, :] +-++ +-++ # # pre-process distribution +-++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++ # if do_sample: +-++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++ +-++ # # Store scores, attentions and hidden_states when required +-++ # if return_dict_in_generate: +-++ # if output_scores: +-++ # scores += (next_token_scores,) +-++ # if output_logits: +-++ # raw_logits += (next_token_logits,) +-++ # if output_attentions: +-++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++ # if self.config.is_encoder_decoder: +-++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++ +-++ # if output_hidden_states: +-++ # hidden = ( +-++ # outputs.decoder_hidden_states +-++ # if self.config.is_encoder_decoder +-++ # else outputs.hidden_states +-++ # ) +-++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++ +-++ # # token selection +-++ # if do_sample: +-++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++ # else: +-++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++ +-++ # # finished sentences should have their next token be a padding token +-++ # if has_eos_stopping_criteria: +-++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++ +-++ # # update generated ids, model inputs, and length for next step +-++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++ # if streamer is not None: +-++ # streamer.put(next_tokens) +-++ +-++ # model_kwargs = self._update_model_kwargs_for_generation( +-++ # outputs, +-++ # model_kwargs, +-++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++ # ) +-++ +-++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++ # cur_len += 1 +-++ +-++ # if _record_time: +-++ # import time as time_module +-++ # infer_stop = time_module.time() +-++ # time_record.append(infer_stop - infer_start) +-++ +-++ # del outputs +-++ +-++ # average_infer_time = None +-++ # if time_record: +-++ # if len(time_record) > 1: +-++ # time_record.pop(0) +-++ # average_infer_time = sum(time_record) / len(time_record) +-++ # print(f'average inference time is: {average_infer_time}') +-++ # print(f'inference time record: {time_record}') +-++ +-++ # if streamer is not None: +-++ # streamer.end() +-++ +-++ # # 简单判断:打印是否使用了JIT路径 +-++ # if hasattr(self, '_jit_used') and self._jit_used: +-++ # print("[JIT] ✓ JIT optimization was used during generation") +-++ # else: +-++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++ +-++ # if return_dict_in_generate: +-++ # if self.config.is_encoder_decoder: +-++ # return GenerateEncoderDecoderOutput( +-++ # sequences=input_ids, +-++ # scores=scores, +-++ # logits=raw_logits, +-++ # encoder_attentions=encoder_attentions, +-++ # encoder_hidden_states=encoder_hidden_states, +-++ # decoder_attentions=decoder_attentions, +-++ # cross_attentions=cross_attentions, +-++ # decoder_hidden_states=decoder_hidden_states, +-++ # past_key_values=model_kwargs.get("past_key_values"), +-++ # average_infer_time=average_infer_time +-++ # ) +-++ # else: +-++ # return GenerateDecoderOnlyOutput( +-++ # sequences=input_ids, +-++ # scores=scores, +-++ # logits=raw_logits, +-++ # attentions=decoder_attentions, +-++ # hidden_states=decoder_hidden_states, +-++ # past_key_values=model_kwargs.get("past_key_values"), +-++ # average_infer_time=average_infer_time +-++ # ) +-++ # else: +-++ # return input_ids +-++ +-++ # def _prepare_cache_for_generation( +-++ # self, +-++ # generation_config, +-++ # model_kwargs, +-++ # assistant_model, +-++ # batch_size, +-++ # max_cache_length, +-++ # ): +-++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++ # generation_config.cache_implementation = "static" +-++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++ +-++ # if generation_config.cache_implementation == "static": +-++ # base_required_from_max_length = generation_config.max_length + 1 +-++ # base_required = max(max_cache_length, base_required_from_max_length) +-++ # min_cache_size = 50 +-++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++ # else: +-++ # max_cache_length = max(base_required, min_cache_size) +-++ +-++ # original_max_cache_length = max_cache_length +-++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++ # print(f" - final max_cache_length: {max_cache_length}") +-++ +-++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++ # if max_cache_length > self.config.max_position_embeddings: +-++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++ +-++ # result = super()._prepare_cache_for_generation( +-++ # generation_config=generation_config, +-++ # model_kwargs=model_kwargs, +-++ # assistant_model=assistant_model, +-++ # batch_size=batch_size, +-++ # max_cache_length=max_cache_length, +-++ # ) +-++ +-++ # if generation_config.cache_implementation == "static": +-++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++ # created_cache = model_kwargs.get(cache_name) +-++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++ # if created_cache.max_cache_len < generation_config.max_length: +-++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++ +-++ # return result +-++ +-++ +-++ +-+ +-+ +-+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+-- +-+2.27.0 +-+ +-diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-new file mode 100644 +-index 00000000..22b65dd5 +---- /dev/null +-+++ b/patches/0002-20251106commit.patch +-@@ -0,0 +1,3200 @@ +-+From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+Subject: [PATCH 2/3] 20251106commit +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +-+ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +-+ 3 files changed, 2689 insertions(+), 305 deletions(-) +-+ create mode 100644 patches/0001-20251104commit.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index d8303e45..73773c22 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +-+ # y = y + self.shared_experts(identity) +-+ # return y +-+ +-++ # @no_grad() +-++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ +-++ # expert_cache = ops.zeros_like(x) +-++ # for i in range(self.num_experts_per_tok): +-++ # expert_id = flat_expert_indices[i].item() +-++ # weight = flat_expert_weights[i].item() +-++ # expert = self.experts[expert_id] +-++ # expert_out = expert(x) +-++ # expert_cache += expert_out * weight +-++ # return expert_cache +-++ +-+ @no_grad() +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ # x 的 shape: (1, hidden_size) +-++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++ +-++ # 1. 收集所有需要的专家层 +-++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-++ +-++ # 2. 并行计算所有专家的输出 +-++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++ # ops.cat 会将它们堆叠成一个新的 Tensor +-++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++ +-++ # 3. 使用矩阵乘法进行加权求和 +-++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ # 最终结果 final_output 的 shape: (1, hidden_size) +-++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++ +-++ return final_output +-+ +-+- expert_cache = ops.zeros_like(x) +-+- for i in range(self.num_experts_per_tok): +-+- expert_id = flat_expert_indices[i].item() +-+- weight = flat_expert_weights[i].item() +-+- expert = self.experts[expert_id] +-+- expert_out = expert(x) +-+- expert_cache += expert_out * weight +-+- return expert_cache +-+ +-+ @no_grad() +-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++ # @lwx +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+ +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +-+ return attn_output, attn_weights, past_key_value +-+ +-+ +-++# class DeepseekFlashAttention(nn.Module): +-++# """ +-++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-++ +-++# This class is designed as a drop-in replacement for DeepseekAttention. +-++# """ +-++ +-++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++# super().__init__() +-++# self.config = config +-++# self.layer_idx = layer_idx +-++# if layer_idx is None: +-++# logger.warning( +-++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++# "when creating this class." +-++# ) +-++ +-++# self.attention_dropout = config.attention_dropout +-++# self.hidden_size = config.hidden_size +-++# self.num_heads = config.num_attention_heads +-++# self.head_dim = self.hidden_size // self.num_heads +-++# self.num_key_value_heads = config.num_key_value_heads +-++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++# self.max_position_embeddings = config.max_position_embeddings +-++# self.rope_theta = config.rope_theta +-++# self.is_causal = True +-++ +-++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++# raise ValueError( +-++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++# f" and `num_heads`: {self.num_heads})." +-++# ) +-++ +-++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++# self._init_rope() +-++ +-++# def _init_rope(self): +-++# if self.config.rope_scaling is None: +-++# self.rotary_emb = DeepseekRotaryEmbedding( +-++# self.head_dim, +-++# max_position_embeddings=self.max_position_embeddings, +-++# base=self.rope_theta, +-++# ) +-++# else: +-++# scaling_type = self.config.rope_scaling["type"] +-++# scaling_factor = self.config.rope_scaling["factor"] +-++# if scaling_type == "linear": +-++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++# self.head_dim, +-++# max_position_embeddings=self.max_position_embeddings, +-++# scaling_factor=scaling_factor, +-++# base=self.rope_theta, +-++# ) +-++# elif scaling_type == "dynamic": +-++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++# self.head_dim, +-++# max_position_embeddings=self.max_position_embeddings, +-++# scaling_factor=scaling_factor, +-++# base=self.rope_theta, +-++# ) +-++# else: +-++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++ +-++# def forward( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# attention_mask: Optional[mindspore.Tensor] = None, +-++# position_ids: Optional[mindspore.Tensor] = None, +-++# past_key_value: Optional[Cache] = None, +-++# output_attentions: bool = False, +-++# use_cache: bool = False, +-++# **kwargs, +-++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++# if "padding_mask" in kwargs: +-++# warnings.warn( +-++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++# ) +-++ +-++# if output_attentions: +-++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-++ +-++# bsz, q_len, _ = hidden_states.shape +-++ +-++# if self.config.pretraining_tp > 1: +-++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++ +-++# query_states = self.q_proj(hidden_states) +-++# key_states = self.k_proj(hidden_states) +-++# value_states = self.v_proj(hidden_states) +-++ +-++# # Reshape for multi-head attention +-++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++# kv_seq_len = key_states.shape[-2] +-++# if past_key_value is not None: +-++# if self.layer_idx is None: +-++# raise ValueError( +-++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++# "with a layer index." +-++# ) +-++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++# # Apply Rotary Positional Embedding +-++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++# if past_key_value is not None: +-++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ +-++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++ +-++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++ +-++# # Convert attention_mask for flash_attention_score +-++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-++# if attention_mask is not None: +-++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++# raise ValueError( +-++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++# ) +-++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-++# else: +-++# attn_mask_for_fa = None +-++ +-++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++ +-++# # Call the fused flash_attention_score operator +-++# attn_output = mindspore.ops.flash_attention_score( +-++# query=query_states_for_fa, +-++# key=key_states_for_fa, +-++# value=value_states_for_fa, +-++# head_num=self.num_heads, # This is N1, the number of query heads +-++# input_layout='BSH', +-++# attn_mask=attn_mask_for_fa, +-++# keep_prob=keep_prob, +-++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++# sparse_mode=0 # Default mask mode +-++# ) +-++ +-++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-++# attn_output = self.o_proj(attn_output) +-++ +-++# # Flash Attention does not return attention weights +-++# attn_weights = None +-++ +-++# return attn_output, attn_weights, past_key_value +-++ +-++class DeepseekFlashAttention(nn.Module): +-++ """ +-++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-++ This implementation is a drop-in replacement for the original DeepseekAttention class, +-++ designed for high performance on supported hardware (Ascend). +-++ +-++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-++ """ +-++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++ super().__init__() +-++ self.config = config +-++ self.layer_idx = layer_idx +-++ if layer_idx is None: +-++ logger.warning( +-++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++ "when creating this class." +-++ ) +-++ +-++ # --- [FIX] Correctly initialize all required attributes --- +-++ self.attention_dropout = config.attention_dropout +-++ self.hidden_size = config.hidden_size +-++ self.num_heads = config.num_attention_heads +-++ self.head_dim = self.hidden_size // self.num_heads +-++ self.num_key_value_heads = config.num_key_value_heads +-++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++ self.max_position_embeddings = config.max_position_embeddings +-++ self.rope_theta = config.rope_theta +-++ self.is_causal = True +-++ +-++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++ raise ValueError( +-++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++ f" and `num_heads`: {self.num_heads})." +-++ ) +-++ +-++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++ +-++ # This call will now succeed as all attributes are initialized. +-++ self._init_rope() +-++ +-++ def _init_rope(self): +-++ if self.config.rope_scaling is None: +-++ self.rotary_emb = DeepseekRotaryEmbedding( +-++ self.head_dim, +-++ max_position_embeddings=self.max_position_embeddings, +-++ base=self.rope_theta, +-++ ) +-++ else: +-++ scaling_type = self.config.rope_scaling["type"] +-++ scaling_factor = self.config.rope_scaling["factor"] +-++ if scaling_type == "linear": +-++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++ self.head_dim, +-++ max_position_embeddings=self.max_position_embeddings, +-++ scaling_factor=scaling_factor, +-++ base=self.rope_theta, +-++ ) +-++ elif scaling_type == "dynamic": +-++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++ self.head_dim, +-++ max_position_embeddings=self.max_position_embeddings, +-++ scaling_factor=scaling_factor, +-++ base=self.rope_theta, +-++ ) +-++ else: +-++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++ +-++ def forward( +-++ self, +-++ hidden_states: mindspore.Tensor, +-++ attention_mask: Optional[mindspore.Tensor] = None, +-++ position_ids: Optional[mindspore.Tensor] = None, +-++ past_key_value: Optional[Cache] = None, +-++ output_attentions: bool = False, +-++ use_cache: bool = False, +-++ **kwargs, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ if "padding_mask" in kwargs: +-++ warnings.warn( +-++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++ ) +-++ if output_attentions: +-++ warnings.warn( +-++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-++ ) +-++ +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ if self.config.pretraining_tp > 1: +-++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++ +-++ query_states = self.q_proj(hidden_states) +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++ # Apply Rotary Position Embedding +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos} +-++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-++ # So we must explicitly repeat the KV heads. +-++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++ +-++ # Convert attention mask for flash_attention_score +-++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-++ if attention_mask is not None: +-++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++ raise ValueError( +-++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++ ) +-++ attn_mask_for_fa = attention_mask < 0 +-++ else: +-++ attn_mask_for_fa = None +-++ +-++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++ +-++ # Call the fused operator using the efficient BNSD layout +-++ attn_output = mindspore.ops.flash_attention_score( +-++ query=query_states, +-++ key=key_states, +-++ value=value_states, +-++ head_num=self.num_heads, +-++ input_layout='BNSD', # Specify the correct layout +-++ attn_mask=attn_mask_for_fa, +-++ keep_prob=keep_prob, +-++ scalar_value=1.0 / math.sqrt(self.head_dim) +-++ ) +-++ +-++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ +-++ # Apply output projection +-++ attn_output = self.o_proj(attn_output) +-++ +-++ # Flash attention does not return attention weights, so we return None. +-++ attn_weights = None +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-+ Deepseek_ATTENTION_CLASSES = { +-+ "eager": DeepseekAttention, +-++ "flash-attention": DeepseekFlashAttention, +-+ } +-+ +-+ +-+@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +-+ config=config, layer_idx=layer_idx +-+ ) +-+ +-++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-++ config=config, layer_idx=layer_idx +-++ ) +-++ +-+ self.mlp = ( +-+ DeepseekMoE(config) +-+ if ( +-+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+index d4c6b651..bced285c 100644 +-+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +-+ +-+ import mindspore +-+ import mindnlp.core.nn.functional as F +-+-from mindnlp.core import nn, ops +-++from mindnlp.core import nn, ops, no_grad +-+ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +-+ +-+ from ....common.activations import ACT2FN +-+@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +-+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+ +-++Long_Prompt = False +-++PROMPT_LENGTH_THRESHOLD = 128 +-+ +-+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+ def _prepare_4d_causal_attention_mask_with_cache_position( +-+@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +-+ return attn_output, attn_weights, past_key_value +-+ +-+ +-++# class Qwen2MoeFlashAttention(nn.Module): +-++# """ +-++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++ +-++# 关键改动: +-++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++# 直接传入原始的 key 和 value 张量效率更高。 +-++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++# super().__init__() +-++# self.config = config +-++# self.layer_idx = layer_idx +-++# self.hidden_size = config.hidden_size +-++# self.num_heads = config.num_attention_heads +-++# self.head_dim = self.hidden_size // self.num_heads +-++# self.num_key_value_heads = config.num_key_value_heads +-++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++# self.max_position_embeddings = config.max_position_embeddings +-++# self.rope_theta = config.rope_theta +-++# self.attention_dropout = config.attention_dropout +-++ +-++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++# raise ValueError( +-++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++# ) +-++ +-++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++ +-++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++# self.head_dim, +-++# max_position_embeddings=self.max_position_embeddings, +-++# base=self.rope_theta, +-++# ) +-++ +-++# def forward( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# attention_mask: Optional[mindspore.Tensor] = None, +-++# position_ids: Optional[mindspore.Tensor] = None, +-++# past_key_value: Optional[Cache] = None, +-++# output_attentions: bool = False, +-++# use_cache: bool = False, +-++# cache_position: Optional[mindspore.Tensor] = None, +-++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++# bsz, q_len, _ = hidden_states.shape +-++ +-++# # 1. 线性投射 Q, K, V +-++# query_states = self.q_proj(hidden_states) +-++# key_states = self.k_proj(hidden_states) +-++# value_states = self.v_proj(hidden_states) +-++ +-++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++# # query: [B, S, H*D] -> [B, N1, S, D] +-++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++# # 3. RoPE 旋转位置编码 +-++# kv_seq_len = key_states.shape[-2] +-++# if past_key_value is not None: +-++# if self.layer_idx is None: +-++# raise ValueError( +-++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++# "with a layer index." +-++# ) +-++# # 对于 StaticCache,需要特殊处理 kv_seq_len +-++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++# if cache_position.shape[0] == 1: +-++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++# kv_seq_len = past_seen_tokens + 1 +-++# else: +-++# # prefill 阶段:cache_position 是范围,使用其长度 +-++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++# else: +-++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++# # 4. KV 缓存更新 +-++# if past_key_value is not None: +-++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++# key_states, value_states = past_key_value.update( +-++# key_states, value_states, self.layer_idx, cache_kwargs +-++# ) +-++ +-++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++# if cache_position.shape[0] == 1: +-++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++# kv_seq_len = key_states.shape[-2] +-++ +-++# # 5. [重要] 准备 Attention Mask +-++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++# fa_attention_mask = None +-++# if attention_mask is not None: +-++# # 截取与当前key长度匹配的部分 +-++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++# # 转换为布尔类型: 大负数 -> True, 0 -> False +-++# fa_attention_mask = (mask_slice != 0) +-++ +-++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++# input_dtype = query_states.dtype +-++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++# query_states = query_states.to(mindspore.float16) +-++# key_states = key_states.to(mindspore.float16) +-++# value_states = value_states.to(mindspore.float16) +-++ +-++# # 6. [核心] 调用 flash_attention_score 算子 +-++# # - 无需手动 repeat_kv, 算子原生支持 GQA +-++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++# attn_output = mindspore.ops.flash_attention_score( +-++# query=query_states, +-++# key=key_states, +-++# value=value_states, +-++# head_num=self.num_heads, # 传入Q的头数(N1) +-++# attn_mask=fa_attention_mask, +-++# keep_prob=1.0 - self.attention_dropout, +-++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++# input_layout="BNSD", +-++# sparse_mode=0 # 使用 defaultMask 模式 +-++# ) +-++ +-++# # 恢复原始数据类型 +-++# attn_output = attn_output.to(input_dtype) +-++ +-++# # 7. 调整输出形状 +-++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++# attn_output = self.o_proj(attn_output) +-++ +-++# # FlashAttention 算子不直接返回注意力权重矩阵 +-++# attn_weights = None +-++# if output_attentions: +-++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++# return attn_output, attn_weights, past_key_value +-++ +-++# # def forward( +-++# # self, +-++# # hidden_states: mindspore.Tensor, +-++# # attention_mask: Optional[mindspore.Tensor] = None, +-++# # position_ids: Optional[mindspore.Tensor] = None, +-++# # past_key_value: Optional[Cache] = None, +-++# # output_attentions: bool = False, +-++# # use_cache: bool = False, +-++# # cache_position: Optional[mindspore.Tensor] = None, +-++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++# # bsz, q_len, _ = hidden_states.shape +-++ +-++# # # 1. 线性投射 Q, K, V +-++# # query_states = self.q_proj(hidden_states) +-++# # key_states = self.k_proj(hidden_states) +-++# # value_states = self.v_proj(hidden_states) +-++ +-++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-++# # # 3. RoPE 旋转位置编码 +-++# # kv_seq_len = key_states.shape[-2] +-++# # if past_key_value is not None: +-++# # if self.layer_idx is None: +-++# # raise ValueError( +-++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++# # "with a layer index." +-++# # ) +-++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++# # # 4. KV 缓存更新 +-++# # if past_key_value is not None: +-++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++# # key_states, value_states = past_key_value.update( +-++# # key_states, value_states, self.layer_idx, cache_kwargs +-++# # ) +-++ +-++# # # 5. 准备 Attention Mask +-++# # fa_attention_mask = None +-++# # if attention_mask is not None: +-++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++# # fa_attention_mask = (mask_slice != 0) +-++ +-++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++# # input_dtype = query_states.dtype +-++ +-++# # # 6. [核心] 调用 flash_attention_score 算子 +-++# # attn_output = mindspore.ops.flash_attention_score( +-++# # query=query_states, +-++# # key=key_states, +-++# # value=value_states, +-++# # head_num=self.num_heads, +-++# # attn_mask=fa_attention_mask, +-++# # keep_prob=1.0 - self.attention_dropout, +-++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++# # input_layout="BNSD", +-++# # sparse_mode=0, +-++# # # <--- 修改点 2: 启用内部高精度计算 --- +-++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++# # inner_precise=1 +-++# # ) +-++ +-++# # # 恢复原始数据类型 +-++# # attn_output = attn_output.to(input_dtype) +-++ +-++# # # 7. 调整输出形状 +-++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++# # attn_output = self.o_proj(attn_output) +-++ +-++# # attn_weights = None +-++# # if output_attentions: +-++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ +-++# # return attn_output, attn_weights, past_key_value +-++ +-++ +-+ class Qwen2MoeFlashAttention(nn.Module): +-+ """ +-+- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+- +-+- 关键改动: +-+- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+- 直接传入原始的 key 和 value 张量效率更高。 +-+- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-++ +-++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-++ 完全使用模型的低精度数据类型(如 float16)进行计算, +-++ 以达到理论上的最高执行速度。 +-+ """ +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+ super().__init__() +-+ self.config = config +-+ self.layer_idx = layer_idx +-++ if layer_idx is None: +-++ logger.warning_once( +-++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-++ ) +-++ +-+ self.hidden_size = config.hidden_size +-+ self.num_heads = config.num_attention_heads +-+ self.head_dim = self.hidden_size // self.num_heads +-+ self.num_key_value_heads = config.num_key_value_heads +-+- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+ self.max_position_embeddings = config.max_position_embeddings +-+ self.rope_theta = config.rope_theta +-+ self.attention_dropout = config.attention_dropout +-+ +-+- if (self.head_dim * self.num_heads) != self.hidden_size: +-+- raise ValueError( +-+- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+- ) +-+- +-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+- # query: [B, S, H*D] -> [B, N1, S, D] +-+- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++ # 2. 调整形状以匹配 BNSD 布局 +-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- +-+- # 3. RoPE 旋转位置编码 +-++ +-++ # 3. RoPE 和 KV 缓存 +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+- if self.layer_idx is None: +-+- raise ValueError( +-+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+- "with a layer index." +-+- ) +-+- # 对于 StaticCache,需要特殊处理 kv_seq_len +-+- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+- if cache_position.shape[0] == 1: +-+- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+- kv_seq_len = past_seen_tokens + 1 +-+- else: +-+- # prefill 阶段:cache_position 是范围,使用其长度 +-+- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+- else: +-+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+- +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+- # 4. KV 缓存更新 +-+ if past_key_value is not None: +-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+- key_states, value_states = past_key_value.update( +-+- key_states, value_states, self.layer_idx, cache_kwargs +-+- ) +-+- +-+- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+- if cache_position.shape[0] == 1: +-+- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+- kv_seq_len = key_states.shape[-2] +-+- +-+- # 5. [重要] 准备 Attention Mask +-+- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++ # 4. 准备 Attention Mask +-+ fa_attention_mask = None +-+ if attention_mask is not None: +-+- # 截取与当前key长度匹配的部分 +-+- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+- # 转换为布尔类型: 大负数 -> True, 0 -> False +-+ fa_attention_mask = (mask_slice != 0) +-+ +-+- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+- input_dtype = query_states.dtype +-+- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+- query_states = query_states.to(mindspore.float16) +-+- key_states = key_states.to(mindspore.float16) +-+- value_states = value_states.to(mindspore.float16) +-+- +-+- # 6. [核心] 调用 flash_attention_score 算子 +-+- # - 无需手动 repeat_kv, 算子原生支持 GQA +-+- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +-+ attn_output = mindspore.ops.flash_attention_score( +-+ query=query_states, +-+ key=key_states, +-+ value=value_states, +-+- head_num=self.num_heads, # 传入Q的头数(N1) +-++ head_num=self.num_heads, +-+ attn_mask=fa_attention_mask, +-+- keep_prob=1.0 - self.attention_dropout, +-++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+ input_layout="BNSD", +-+- sparse_mode=0 # 使用 defaultMask 模式 +-++ sparse_mode=0, +-++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +-+ ) +-+ +-+- # 恢复原始数据类型 +-+- attn_output = attn_output.to(input_dtype) +-+- +-+- # 7. 调整输出形状 +-+- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++ # 6. 调整输出形状 +-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+ attn_output = self.o_proj(attn_output) +-+ +-+- # FlashAttention 算子不直接返回注意力权重矩阵 +-++ # 7. 返回结果 +-+ attn_weights = None +-+ if output_attentions: +-+- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+- # def forward( +-+- # self, +-+- # hidden_states: mindspore.Tensor, +-+- # attention_mask: Optional[mindspore.Tensor] = None, +-+- # position_ids: Optional[mindspore.Tensor] = None, +-+- # past_key_value: Optional[Cache] = None, +-+- # output_attentions: bool = False, +-+- # use_cache: bool = False, +-+- # cache_position: Optional[mindspore.Tensor] = None, +-+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+- +-+- # bsz, q_len, _ = hidden_states.shape +-+- +-+- # # 1. 线性投射 Q, K, V +-+- # query_states = self.q_proj(hidden_states) +-+- # key_states = self.k_proj(hidden_states) +-+- # value_states = self.v_proj(hidden_states) +-+- +-+- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- +-+- # # 3. RoPE 旋转位置编码 +-+- # kv_seq_len = key_states.shape[-2] +-+- # if past_key_value is not None: +-+- # if self.layer_idx is None: +-+- # raise ValueError( +-+- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+- # "with a layer index." +-+- # ) +-+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+ +-+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+- +-+- # # 4. KV 缓存更新 +-+- # if past_key_value is not None: +-+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+- # key_states, value_states = past_key_value.update( +-+- # key_states, value_states, self.layer_idx, cache_kwargs +-+- # ) +-+- +-+- # # 5. 准备 Attention Mask +-+- # fa_attention_mask = None +-+- # if attention_mask is not None: +-+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+- # fa_attention_mask = (mask_slice != 0) +-+- +-+- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+- # input_dtype = query_states.dtype +-+- +-+- # # 6. [核心] 调用 flash_attention_score 算子 +-+- # attn_output = mindspore.ops.flash_attention_score( +-+- # query=query_states, +-+- # key=key_states, +-+- # value=value_states, +-+- # head_num=self.num_heads, +-+- # attn_mask=fa_attention_mask, +-+- # keep_prob=1.0 - self.attention_dropout, +-+- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+- # input_layout="BNSD", +-+- # sparse_mode=0, +-+- # # <--- 修改点 2: 启用内部高精度计算 --- +-+- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+- # inner_precise=1 +-+- # ) +-+- +-+- # # 恢复原始数据类型 +-+- # attn_output = attn_output.to(input_dtype) +-++QWEN2MOE_ATTENTION_CLASSES = { +-++ "eager": Qwen2MoeAttention, +-++ "flash-attention": Qwen2MoeFlashAttention, +-++} +-+ +-+- # # 7. 调整输出形状 +-+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+- # attn_output = self.o_proj(attn_output) +-+ +-+- # attn_weights = None +-+- # if output_attentions: +-+- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# def __init__(self, config): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# # gating +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++ +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# #@dwj +-++# # 只遍历激活的专家,而非全部专家 +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# num_tokens = hidden_states_reshaped.shape[0] +-++ +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++# routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++# flat_selected_experts = selected_experts.flatten() +-++ +-++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++# token_indices = broadcasted_token_indices.flatten() +-++ +-++# active_experts = ops.unique(flat_selected_experts) +-++ +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++ +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# selected_token_indices = token_indices[mask] +-++# selected_routing_weights = routing_weights.flatten()[mask] +-++ +-++# current_states = hidden_states_reshaped[selected_token_indices] +-++ +-++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++ +-++# final_hidden_states = final_hidden_states.index_add( +-++# dim=0, +-++# index=selected_token_indices, +-++# source=expert_output.to(hidden_states.dtype) +-++# ) +-++ +-++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+ +-+- # return attn_output, attn_weights, past_key_value +-++# final_hidden_states = final_hidden_states + shared_expert_output +-++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-++ +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# """ +-++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-++# `_moe_infer_prefill` (用于长序列处理) 方法。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# # 门控网络 +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# # 专家列表 +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++# # 共享专家 +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# @no_grad() +-++# def _moe_infer_decode( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# """ +-++# 【解码路径】针对 sequence_length=1 的极致优化。 +-++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-++# """ +-++# batch_size, hidden_dim = hidden_states.shape +-++ +-++# expert_outputs_list = [ +-++# ops.cat([ +-++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++# ], dim=0) +-++# for i in range(batch_size) +-++# ] +-++ +-++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-++# # shape: (batch_size, top_k, hidden_dim) +-++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++ +-++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++ +-++# return moe_output.squeeze(1) +-++ +-++# @no_grad() +-++# def _moe_infer_prefill( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# """ +-++# 【预填充路径】针对 sequence_length > 1 的优化。 +-++# 按专家对 Token 进行分组,并进行批处理。 +-++# """ +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens = hidden_states.shape[0] +-++# flat_selected_experts = selected_experts.flatten() +-++ +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++ +-++# active_experts = ops.unique(flat_selected_experts) +-++ +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++ +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# selected_token_indices = token_indices[mask] +-++# selected_routing_weights = routing_weights.flatten()[mask] +-++ +-++# current_states = hidden_states[selected_token_indices] +-++ +-++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++ +-++# moe_output = moe_output.index_add( +-++# dim=0, +-++# index=selected_token_indices, +-++# source=expert_output.to(hidden_states.dtype) +-++# ) +-++# return moe_output +-++ +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# """ +-++# 顶层 forward 方法,作为智能分发器。 +-++# """ +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+- # def forward( +-+- # self, +-+- # hidden_states: mindspore.Tensor, +-+- # attention_mask: Optional[mindspore.Tensor] = None, +-+- # position_ids: Optional[mindspore.Tensor] = None, +-+- # past_key_value: Optional[Cache] = None, +-+- # output_attentions: bool = False, +-+- # use_cache: bool = False, +-+- # cache_position: Optional[mindspore.Tensor] = None, +-+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+- +-+- # bsz, q_len, _ = hidden_states.shape +-+- +-+- # query_states = self.q_proj(hidden_states) +-+- # key_states = self.k_proj(hidden_states) +-+- # value_states = self.v_proj(hidden_states) +-+- +-+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- +-+- # kv_seq_len = key_states.shape[-2] +-+- # if past_key_value is not None: +-+- # if self.layer_idx is None: +-+- # raise ValueError("`layer_idx` must be specified for caching") +-+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+- +-+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+- +-+- # if past_key_value is not None: +-+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+- # key_states, value_states = past_key_value.update( +-+- # key_states, value_states, self.layer_idx, cache_kwargs +-+- # ) +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++# routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++# moe_output = None +-++# # 在推理时,根据序列长度选择最优路径 +-++# if not self.training: +-++# if sequence_length == 1: +-++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++# else: +-++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++# else: +-++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-++# raise NotImplementedError("Training path is not implemented.") +-++ +-++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-++ +-++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-++ +-++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-++ +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# """ +-++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# # 门控网络 +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# # 专家列表 +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++# # 共享专家 +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# @no_grad() +-++# def _moe_infer_decode( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# batch_size, _ = hidden_states.shape +-++# expert_outputs_list = [ +-++# ops.cat([ +-++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++# ], dim=0) +-++# for i in range(batch_size) +-++# ] +-++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++# return moe_output.squeeze(1) +-++ +-++# @no_grad() +-++# def _moe_infer_prefill( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens = hidden_states.shape[0] +-++# flat_selected_experts = selected_experts.flatten() +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++# active_experts = ops.unique(flat_selected_experts) +-++ +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# selected_token_indices = token_indices[mask] +-++# selected_routing_weights = routing_weights.flatten()[mask] +-++# current_states = hidden_states[selected_token_indices] +-++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++# moe_output = moe_output.index_add( +-++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++# ) +-++# return moe_output +-++ +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# """ +-++# 顶层 forward 方法,作为智能分发器。 +-++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-++# """ +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ +-++# # 1. 门控计算 (通用逻辑) +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++# routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++# # 2. 智能分发到最优 MoE 路径 +-++# moe_output = None +-++# if not self.training: +-++# if sequence_length == 1: +-++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++# else: +-++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++# else: +-++# raise NotImplementedError("Training path is not implemented.") +-++ +-++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++ +-++# # 4. 合并 MoE 输出和共享专家输出 +-++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++ +-++# # 5. 恢复原始形状并返回 +-++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-++# prefill fastest +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# """ +-++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# # 门控网络 +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# # 专家列表 +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++# # 共享专家 +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# @no_grad() +-++# def _moe_infer_dispatch( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# """ +-++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-++# """ +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens, _ = hidden_states.shape +-++ +-++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-++# flat_selected_experts = selected_experts.flatten() +-++# flat_routing_weights = routing_weights.flatten() +-+ +-+- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+- +-+- # # <--- 核心修改点: 手动进行高精度缩放 --- +-+- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+- # query_states = query_states / math.sqrt(self.head_dim) +-+- # # <--- 修改结束 --- +-+- +-+- # fa_attention_mask = None +-+- # if attention_mask is not None: +-+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+- # fa_attention_mask = (mask_slice != 0) +-+- +-+- # input_dtype = query_states.dtype +-+- +-+- # attn_output = mindspore.ops.flash_attention_score( +-+- # query=query_states, # 传入已经预先缩放过的 query +-+- # key=key_states, +-+- # value=value_states, +-+- # head_num=self.num_heads, +-+- # attn_mask=fa_attention_mask, +-+- # keep_prob=1.0 - self.attention_dropout, +-+- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+- # input_layout="BNSD", +-+- # sparse_mode=0, +-+- # inner_precise=1 # 仍然保持内部高精度计算 +-+- # ) +-++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+ +-+- # attn_output = attn_output.to(input_dtype) +-+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+- # attn_output = self.o_proj(attn_output) +-++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-++# active_experts = ops.unique(flat_selected_experts) +-++ +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++ +-++# # 找到所有分配给该专家的 token +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++ +-++# # 使用 mask 选取对应的 token 和权重 +-++# current_token_indices = token_indices[mask] +-++# current_routing_weights = flat_routing_weights[mask] +-++# current_hidden_states = hidden_states[current_token_indices] +-++ +-++# # 对这些 token 进行批处理 +-++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++ +-++# # 使用 index_add 将结果精确地加回到对应位置 +-++# moe_output = moe_output.index_add( +-++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-++# ) +-++# return moe_output +-++ +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# """ +-++# 顶层 forward 方法,作为智能分发器。 +-++# """ +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ +-++# # 1. 门控计算 +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++# routing_weights = routing_weights.to(hidden_states.dtype) +-++ +-++# # 2. 调用统一的 MoE 计算内核 +-++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+ +-+- # attn_weights = None +-+- # if output_attentions: +-+- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++# # 3. 统一处理共享专家 +-++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++ +-++# # 4. 合并输出 +-++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++ +-++# # 5. 恢复原始形状并返回 +-++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-++ +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# """ +-++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++# 【最终高性能与高精度版】: +-++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-++# 3. 这样实现了速度和准确性的两全其美。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# @no_grad() +-++# def _moe_infer_decode( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# """ +-++# 【解码路径】极致优化版:bmm + 高精度累加。 +-++# """ +-++# original_dtype = hidden_states.dtype +-++# batch_size, _ = hidden_states.shape +-++ +-++# expert_outputs_list = [ +-++# ops.cat([ +-++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++# ], dim=0) +-++# for i in range(batch_size) +-++# ] +-++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++ +-++# # 在 float32 下执行 bmm,得到高精度结果 +-++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++ +-++# # 将高精度结果转换回原始数据类型 +-++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-++ +-++# return moe_output +-++ +-++# @no_grad() +-++# def _moe_infer_prefill( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# selected_experts: mindspore.Tensor, +-++# routing_weights: mindspore.Tensor +-++# ) -> mindspore.Tensor: +-++# """ +-++# 【预填充路径】与原始实现一致,结果精确。 +-++# """ +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens, _ = hidden_states.shape +-++# flat_selected_experts = selected_experts.flatten() +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++# active_experts = ops.unique(flat_selected_experts) +-++ +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# selected_token_indices = token_indices[mask] +-++# selected_routing_weights = routing_weights.flatten()[mask] +-++# current_states = hidden_states[selected_token_indices] +-++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++# moe_output = moe_output.index_add( +-++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++# ) +-++# return moe_output +-++ +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+ +-+- # return attn_output, attn_weights, past_key_value +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-++# # 如果模型主体是 float16,后续再转换 +-++ +-++# moe_output = None +-++# if not self.training: +-++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-++# # _moe_infer_decode 内部会处理好类型转换 +-++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-++# if sequence_length == 1: +-++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++# else: +-++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++# else: +-++# raise NotImplementedError("Training path is not implemented.") +-++ +-++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++ +-++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-+ +-+-QWEN2MOE_ATTENTION_CLASSES = { +-+- "eager": Qwen2MoeAttention, +-+- "flash-attention": Qwen2MoeFlashAttention, +-+-} +-++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++# """ +-++# 【融合版】一个混合专家模块,内置两种推理策略, +-++# 由外部全局变量 `Long_Prompt` 控制: +-++ +-++# - if Long_Prompt is True: 【精度优先模式】 +-++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-++# 适用于处理长序列,避免误差累积。 +-++ +-++# - if Long_Prompt is False: 【速度优先模式】 +-++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-++# 在解码阶段获得极致速度,同时保证结果高度准确。 +-++# """ +-++# def __init__(self, config: Qwen2MoeConfig): +-++# super().__init__() +-++# self.num_experts = config.num_experts +-++# self.top_k = config.num_experts_per_tok +-++# self.norm_topk_prob = config.norm_topk_prob +-++ +-++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++# self.experts = nn.ModuleList( +-++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++# ) +-++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++# # --- 速度优先模式的辅助函数 --- +-++# @no_grad() +-++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++# original_dtype = hidden_states.dtype +-++# batch_size, _ = hidden_states.shape +-++# expert_outputs_list = [ +-++# ops.cat([ +-++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++# ], dim=0) +-++# for i in range(batch_size) +-++# ] +-++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++# weights_fp32 = routing_weights.to(mindspore.float32) +-++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++# return moe_output_fp32.squeeze(1).to(original_dtype) +-++ +-++# @no_grad() +-++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens, _ = hidden_states.shape +-++# flat_selected_experts = selected_experts.flatten() +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++# active_experts = ops.unique(flat_selected_experts) +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# selected_token_indices = token_indices[mask] +-++# selected_routing_weights = routing_weights.flatten()[mask] +-++# current_states = hidden_states[selected_token_indices] +-++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-++# return moe_output +-++ +-++# # --- 精度优先模式的辅助函数 --- +-++# @no_grad() +-++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++# moe_output = ops.zeros_like(hidden_states) +-++# num_tokens, _ = hidden_states.shape +-++# flat_selected_experts = selected_experts.flatten() +-++# flat_routing_weights = routing_weights.flatten() +-++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++# active_experts = ops.unique(flat_selected_experts) +-++# for expert_idx_tensor in active_experts: +-++# expert_idx = expert_idx_tensor.item() +-++# expert_layer = self.experts[expert_idx] +-++# mask = (flat_selected_experts == expert_idx_tensor) +-++# current_token_indices = token_indices[mask] +-++# current_routing_weights = flat_routing_weights[mask] +-++# current_hidden_states = hidden_states[current_token_indices] +-++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++# return moe_output +-++ +-++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++# # 声明我们将要使用一个在模块外部定义的全局变量 +-++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-++# global Long_Prompt +-++ +-++# # 1. 门控计算 (所有模式通用) +-++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++# router_logits = self.gate(hidden_states_reshaped) +-++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++# if self.norm_topk_prob: +-++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++# moe_output = None +-++# if not self.training: +-++# # 根据 Long_Prompt 标志选择模式 +-++# if Long_Prompt: +-++# # --- 精度优先模式 --- +-++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++# else: +-++# # --- 速度优先模式 --- +-++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++# if sequence_length == 1: +-++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++# else: +-++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++# else: +-++# raise NotImplementedError("Training path is not implemented.") +-++ +-++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++ +-++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++ +-++# return final_hidden_states, router_logits +-++ +-++class Qwen2MoeSparseMoeBlock(nn.Module): +-++ """ +-++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-++ 控制的顶级推理策略: +-+ +-++ - if Long_Prompt is True: 【精度优先模式】 +-++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-++ 适用于需要严格可复现性的长序列任务。 +-+ +-+-class Qwen2MoeSparseMoeBlock(nn.Module): +-+- def __init__(self, config): +-++ - if Long_Prompt is False: 【速度优先模式】 +-++ 采用业界最强的性能组合: +-++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-++ """ +-++ def __init__(self, config: Qwen2MoeConfig): +-+ super().__init__() +-+ self.num_experts = config.num_experts +-+ self.top_k = config.num_experts_per_tok +-+ self.norm_topk_prob = config.norm_topk_prob +-+ +-+- # gating +-+ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+ self.experts = nn.ModuleList( +-+ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+ ) +-+- +-+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+ +-+- #@dwj +-+- # 只遍历激活的专家,而非全部专家 +-+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+- num_tokens = hidden_states_reshaped.shape[0] +-+- +-+- router_logits = self.gate(hidden_states_reshaped) +-+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- +-+- if self.norm_topk_prob: +-+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+- flat_selected_experts = selected_experts.flatten() +-+- +-+- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+- token_indices = broadcasted_token_indices.flatten() +-+- +-+- active_experts = ops.unique(flat_selected_experts) +-+- +-+- for expert_idx_tensor in active_experts: +-+- expert_idx = expert_idx_tensor.item() +-+- expert_layer = self.experts[expert_idx] +-+- +-+- mask = (flat_selected_experts == expert_idx_tensor) +-+- selected_token_indices = token_indices[mask] +-+- selected_routing_weights = routing_weights.flatten()[mask] +-+- +-+- current_states = hidden_states_reshaped[selected_token_indices] +-+- +-+- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+- +-+- final_hidden_states = final_hidden_states.index_add( +-+- dim=0, +-+- index=selected_token_indices, +-+- source=expert_output.to(hidden_states.dtype) +-+- ) +-+- +-+- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-++ @no_grad() +-++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++ original_dtype = hidden_states.dtype +-++ batch_size, _ = hidden_states.shape +-++ expert_outputs_list = [ +-++ ops.cat([ +-++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++ ], dim=0) +-++ for i in range(batch_size) +-++ ] +-++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++ weights_fp32 = routing_weights.to(mindspore.float32) +-++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++ return moe_output_fp32.squeeze(1).to(original_dtype) +-++ +-++ @no_grad() +-++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++ num_tokens, _ = hidden_states.shape +-++ flat_selected_experts = selected_experts.flatten() +-++ sorted_expert_indices = flat_selected_experts.argsort() +-++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++ original_token_indices = sorted_expert_indices // self.top_k +-++ moe_output = ops.zeros_like(hidden_states) +-++ current_token_offset = 0 +-++ for i in range(self.num_experts): +-++ expert_token_count = tokens_per_expert[i] - current_token_offset +-++ if expert_token_count == 0: +-++ continue +-++ end_offset = current_token_offset + expert_token_count +-++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++ expert_hidden_states = hidden_states[expert_original_token_indices] +-++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++ current_token_offset += expert_token_count +-++ return moe_output +-++ +-++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-++ @no_grad() +-++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++ moe_output = ops.zeros_like(hidden_states) +-++ num_tokens, _ = hidden_states.shape +-++ flat_selected_experts = selected_experts.flatten() +-++ flat_routing_weights = routing_weights.flatten() +-++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++ active_experts = ops.unique(flat_selected_experts) +-++ for expert_idx_tensor in active_experts: +-++ expert_idx = expert_idx_tensor.item() +-++ expert_layer = self.experts[expert_idx] +-++ mask = (flat_selected_experts == expert_idx_tensor) +-++ current_token_indices = token_indices[mask] +-++ current_routing_weights = flat_routing_weights[mask] +-++ current_hidden_states = hidden_states[current_token_indices] +-++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++ return moe_output +-+ +-+- final_hidden_states = final_hidden_states + shared_expert_output +-+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+- +-+- return final_hidden_states, router_logits +-++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++ global Long_Prompt +-++ +-++ # 1. 门控计算 (所有模式通用) +-++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++ router_logits = self.gate(hidden_states_reshaped) +-++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++ if self.norm_topk_prob: +-++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++ moe_output = None +-++ if Long_Prompt: +-++ # --- 精度优先模式 (ACCURACY MODE) --- +-++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ else: +-++ # --- 速度优先模式 (SPEED MODE) --- +-++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++ if sequence_length == 1: +-++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ else: +-++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ +-+ +-++ # 3. 共享专家计算与合并 (所有模式通用) +-++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++ +-++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++ +-++ return final_hidden_states, router_logits +-+ +-+ class Qwen2MoeDecoderLayer(nn.Module): +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+ super().__init__() +-+ self.hidden_size = config.hidden_size +-++ +-++ # if Long_Prompt: +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ # else: +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+ +-+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ +-+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+- +-+ if (layer_idx not in config.mlp_only_layers) and ( +-+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+ ): +-+@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ self._warmed_up = True +-+ self.warmup_moe_model() +-+ +-++ +-++ +-+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+ output_router_logits = ( +-+ output_router_logits if output_router_logits is not None else self.config.output_router_logits +-+@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ router_logits=outputs.router_logits, +-+ ) +-+ +-++ def generate(self, *args, **kwargs): +-++ """ +-++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-++ """ +-++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++ +-++ input_ids = kwargs.get("input_ids") +-++ if input_ids is None and args: +-++ input_ids = args[0] +-++ +-++ if input_ids is not None: +-++ prompt_length = input_ids.shape[1] +-++ +-++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-++ Long_Prompt = True +-++ else: +-++ Long_Prompt = False +-++ +-++ return super().generate(*args, **kwargs) +-++ +-+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +-+ def prepare_inputs_for_generation( +-+ self, +-+@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +-+ # Exception 1: when passing input_embeds, input_ids may be missing entries +-+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-++ +-+ if past_key_values is not None: +-+ if inputs_embeds is not None: # Exception 1 +-+ if 0 not in input_ids.shape: +-+@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ } +-+ ) +-+ return model_inputs +-++ +-+ # @lwx +-+ # def _decode_one_tokens_logits( +-+ # self, +-+@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +-+ attentions=outputs.attentions, +-+ ) +-+ +-++ +-+ __all__ = [ +-+ "Qwen2MoeForCausalLM", +-+ "Qwen2MoeModel", +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+new file mode 100644 +-+index 00000000..6dfb5b93 +-+--- /dev/null +-++++ b/patches/0001-20251104commit.patch +-+@@ -0,0 +1,1272 @@ +-++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++Subject: [PATCH] 20251104commit +-++ +-++--- +-++ mindnlp/transformers/cache_utils.py | 28 +- +-++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++ 3 files changed, 976 insertions(+), 87 deletions(-) +-++ +-++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++index cadd2e04..02f8d4be 100644 +-++--- a/mindnlp/transformers/cache_utils.py +-+++++ b/mindnlp/transformers/cache_utils.py +-++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++ # k_out[:, :, cache_position] = key_states +-++ # v_out[:, :, cache_position] = value_states +-++- if ON_ORANGE_PI: +-++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++- else: +-++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++- +-+++ # if ON_ORANGE_PI: +-+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++ # else: +-+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++ if cache_position.ndim > 1: +-+++ cache_position = cache_position.flatten() +-+++ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++ cache_position = cache_position.int() +-+++ +-+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++ k_out[:, :, cache_position] = key_states +-+++ v_out[:, :, cache_position] = value_states +-+++ +-++ return k_out, v_out +-++ +-++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index c695b944..d8303e45 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++- x1 = x[..., : x.shape[-1] // 2] +-++- x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++ # x1 = x[..., : x.shape[-1] // 2] +-+++ # x2 = x[..., x.shape[-1] // 2 :] +-+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++ if self.training: +-++ raise NotImplementedError("Training is not supported yet.") +-++ else: +-++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++- if self.config.n_shared_experts is not None: +-++- y = y + self.shared_experts(identity) +-++- return y +-+++ # @lwx +-+++ if orig_shape[1] == 1: +-+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++ y=y.view(*orig_shape) +-+++ if self.config.n_shared_experts is not None: +-+++ y = y + self.shared_experts(identity) +-+++ return y +-+++ else: +-+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++ if self.config.n_shared_experts is not None: +-+++ y = y + self.shared_experts(identity) +-+++ return y +-+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++ # if self.config.n_shared_experts is not None: +-+++ # y = y + self.shared_experts(identity) +-+++ # return y +-+++ +-+++ @no_grad() +-+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ +-+++ expert_cache = ops.zeros_like(x) +-+++ for i in range(self.num_experts_per_tok): +-+++ expert_id = flat_expert_indices[i].item() +-+++ weight = flat_expert_weights[i].item() +-+++ expert = self.experts[expert_id] +-+++ expert_out = expert(x) +-+++ expert_cache += expert_out * weight +-+++ return expert_cache +-++ +-++ @no_grad() +-++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++- # expert_cache = torch.zeros_like(x) +-++- # idxs = flat_expert_indices.argsort() +-++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++- # token_idxs = idxs // self.num_experts_per_tok +-++- # for i, end_idx in enumerate(tokens_per_expert): +-++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++- # if start_idx == end_idx: +-++- # continue +-++- # expert = self.experts[i] +-++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++- # expert_tokens = x[exp_token_idx] +-++- # expert_out = expert(expert_tokens) +-++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++- # return expert_cache +-+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++ expert_cache = ops.zeros_like(x) +-++ idxs = flat_expert_indices.argsort() +-++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ token_idxs = idxs // self.num_experts_per_tok +-+++ +-++ for i, end_idx in enumerate(tokens_per_expert): +-++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ if start_idx == end_idx: +-++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++ expert_out = expert(expert_tokens) +-++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-++ return expert_cache +-+++ +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # # expert_cache = torch.zeros_like(x) +-+++ # # idxs = flat_expert_indices.argsort() +-+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++ # # token_idxs = idxs // self.num_experts_per_tok +-+++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++ # # if start_idx == end_idx: +-+++ # # continue +-+++ # # expert = self.experts[i] +-+++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # # expert_tokens = x[exp_token_idx] +-+++ # # expert_out = expert(expert_tokens) +-+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++ # # return expert_cache +-+++ # expert_cache = ops.zeros_like(x) +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # for i, end_idx in enumerate(tokens_per_expert): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # if start_idx == end_idx: +-+++ # continue +-+++ # expert = self.experts[i] +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = expert(expert_tokens) +-+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-+++ # return expert_cache +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # expert_cache = ops.zeros_like(x) +-+++ +-+++ # # 排序保证顺序一致 +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # # 找出有 token 的专家 +-+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++ +-+++ # for i in active_experts.tolist(): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # end_idx = tokens_per_expert[i] +-+++ # if start_idx == end_idx: # 没有 token +-+++ # continue +-+++ +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = self.experts[i](expert_tokens) +-+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++ +-+++ # expert_cache = mindspore.mint.scatter_add( +-+++ # expert_cache, +-+++ # 0, +-+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++ # expert_out +-+++ # ) +-+++ +-+++ # return expert_cache +-+++ +-+++ +-++ +-++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++ # """ +-++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ +-++ # Initialize weights and apply final processing +-++ self.post_init() +-+++ self.warm_up = False +-+++ +-+++ def warmup_moe_model_deep(self): +-+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++ test_texts = [ +-+++ "warmup short", +-+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++ ] +-+++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++ if tokenizer is None: +-+++ from mindnlp.transformers import AutoTokenizer +-+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++ self._warmup_tokenizer = tokenizer +-+++ +-+++ for text in test_texts: +-+++ inputs = tokenizer(text, return_tensors="ms") +-+++ with mindspore._no_grad(): +-+++ _ = self(**inputs, use_cache=False) +-+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++ +-++ def get_input_embeddings(self): +-++ return self.model.embed_tokens +-++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++ ```""" +-+++ if not self.warm_up: +-+++ self.warm_up = True +-+++ self.warmup_moe_model_deep() +-+++ +-++ output_attentions = ( +-++ output_attentions +-++ if output_attentions is not None +-++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++index 3cbf820e..d4c6b651 100644 +-++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++@@ -18,7 +18,6 @@ +-++ # See the License for the specific language governing permissions and +-++ # limitations under the License. +-++ """MindSpore Qwen2MoE model.""" +-++- +-++ import math +-++ from typing import List, Optional, Tuple, Union +-++ +-++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++ TokenClassifierOutput, +-++ ) +-++ from ...modeling_utils import PreTrainedModel +-+++from ...generation import GenerationMixin +-++ from ....utils import logging +-++ from .configuration_qwen2_moe import Qwen2MoeConfig +-++ +-++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++ self.variance_epsilon = eps +-++ +-++ def forward(self, hidden_states): +-+++ # @dwj +-+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++ # @lwx +-+++ # if not self.training : +-+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++ input_dtype = hidden_states.dtype +-++ hidden_states = hidden_states.to(mindspore.float32) +-++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++@@ -234,6 +239,8 @@ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++ x1 = x[..., : x.shape[-1] // 2] +-++ x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++ self.config = config +-++ self.hidden_size = config.hidden_size +-++ self.intermediate_size = intermediate_size +-+++ +-++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++ self.act_fn = ACT2FN[config.hidden_act] +-++ +-++ def forward(self, x): +-++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++- +-++ +-+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++ # @lwx +-+++ # gate_up_output = self.gate_up_proj(x) +-+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++ # return self.down_proj(swiglu_output) +-+++ +-+++ # def forward(self, x): +-+++ # gate_proj_out = self.gate_proj(x) +-+++ # up_proj_out = self.up_proj(x) +-+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++ # return self.down_proj(swiglu_out) +-+++ +-++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++ """ +-++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++ use_cache: bool = False, +-++ cache_position: Optional[mindspore.Tensor] = None, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ +-+++ +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ query_states = self.q_proj(hidden_states) +-++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ "with a layer index." +-++ ) +-++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ if isinstance(past_key_value, StaticCache): +-+++ kv_seq_len = key_states.shape[-2] +-+++ else: +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++ if isinstance(past_key_value, StaticCache): +-+++ kv_seq_len = key_states.shape[-2] +-++ +-++ # repeat k/v heads if n_kv_heads < n_heads +-++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++- +-+++ +-++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++ +-++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++- raise ValueError( +-++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++- f" {attn_weights.shape}" +-++- ) +-++- +-++- if attention_mask is not None: # no matter the length, we just slice it +-++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++ if attention_mask is not None: +-+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++ attn_weights = attn_weights + causal_mask +-++ +-++ # upcast attention to fp32 +-++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++ +-++ attn_output = self.o_proj(attn_output) +-++- +-+++ # @lwx +-+++ +-+++ # max_seq_len = self.max_position_embeddings # 2048 +-+++ +-+++ # if attention_mask is not None: +-+++ # # attention_mask: [B, 1, Sq, Sk] +-+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++ +-+++ # # pad 到 [max_seq_len, max_seq_len] +-+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++ # global_attention_mask = padded_mask +-+++ # else: +-+++ # global_attention_mask = None +-+++ +-+++ +-+++ # sparse_mode=3 +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # real_shift=None, +-+++ # padding_mask=None, +-+++ +-+++ # head_num=self.num_heads, +-+++ # attn_mask=global_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ # input_layout="BNSD", +-+++ # pre_tokens=2147483647, +-+++ # next_tokens=2147483647, +-+++ # inner_precise=0, +-+++ # drop_mask=None, +-+++ # prefix=None, +-+++ # actual_seq_qlen=None, +-+++ # actual_seq_kvlen=None, +-+++ # sparse_mode=sparse_mode, +-+++ # ) +-++ if not output_attentions: +-++ attn_weights = None +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++ +-+++class Qwen2MoeFlashAttention(nn.Module): +-+++ """ +-+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++ +-+++ 关键改动: +-+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++ 直接传入原始的 key 和 value 张量效率更高。 +-+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++ """ +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++ super().__init__() +-+++ self.config = config +-+++ self.layer_idx = layer_idx +-+++ self.hidden_size = config.hidden_size +-+++ self.num_heads = config.num_attention_heads +-+++ self.head_dim = self.hidden_size // self.num_heads +-+++ self.num_key_value_heads = config.num_key_value_heads +-+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++ self.max_position_embeddings = config.max_position_embeddings +-+++ self.rope_theta = config.rope_theta +-+++ self.attention_dropout = config.attention_dropout +-+++ +-+++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++ raise ValueError( +-+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++ ) +-+++ +-+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++ +-+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++ self.head_dim, +-+++ max_position_embeddings=self.max_position_embeddings, +-+++ base=self.rope_theta, +-+++ ) +-+++ +-+++ def forward( +-+++ self, +-+++ hidden_states: mindspore.Tensor, +-+++ attention_mask: Optional[mindspore.Tensor] = None, +-+++ position_ids: Optional[mindspore.Tensor] = None, +-+++ past_key_value: Optional[Cache] = None, +-+++ output_attentions: bool = False, +-+++ use_cache: bool = False, +-+++ cache_position: Optional[mindspore.Tensor] = None, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # 1. 线性投射 Q, K, V +-+++ query_states = self.q_proj(hidden_states) +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++ # query: [B, S, H*D] -> [B, N1, S, D] +-+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # 3. RoPE 旋转位置编码 +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++ if self.layer_idx is None: +-+++ raise ValueError( +-+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ "with a layer index." +-+++ ) +-+++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++ if cache_position.shape[0] == 1: +-+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++ kv_seq_len = past_seen_tokens + 1 +-+++ else: +-+++ # prefill 阶段:cache_position 是范围,使用其长度 +-+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++ else: +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # 4. KV 缓存更新 +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ key_states, value_states = past_key_value.update( +-+++ key_states, value_states, self.layer_idx, cache_kwargs +-+++ ) +-+++ +-+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++ if cache_position.shape[0] == 1: +-+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++ kv_seq_len = key_states.shape[-2] +-+++ +-+++ # 5. [重要] 准备 Attention Mask +-+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++ fa_attention_mask = None +-+++ if attention_mask is not None: +-+++ # 截取与当前key长度匹配的部分 +-+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++ fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++ input_dtype = query_states.dtype +-+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++ query_states = query_states.to(mindspore.float16) +-+++ key_states = key_states.to(mindspore.float16) +-+++ value_states = value_states.to(mindspore.float16) +-+++ +-+++ # 6. [核心] 调用 flash_attention_score 算子 +-+++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++ attn_output = mindspore.ops.flash_attention_score( +-+++ query=query_states, +-+++ key=key_states, +-+++ value=value_states, +-+++ head_num=self.num_heads, # 传入Q的头数(N1) +-+++ attn_mask=fa_attention_mask, +-+++ keep_prob=1.0 - self.attention_dropout, +-+++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ input_layout="BNSD", +-+++ sparse_mode=0 # 使用 defaultMask 模式 +-+++ ) +-+++ +-+++ # 恢复原始数据类型 +-+++ attn_output = attn_output.to(input_dtype) +-+++ +-+++ # 7. 调整输出形状 +-+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = self.o_proj(attn_output) +-+++ +-+++ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++ attn_weights = None +-+++ if output_attentions: +-+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ # def forward( +-+++ # self, +-+++ # hidden_states: mindspore.Tensor, +-+++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++ # position_ids: Optional[mindspore.Tensor] = None, +-+++ # past_key_value: Optional[Cache] = None, +-+++ # output_attentions: bool = False, +-+++ # use_cache: bool = False, +-+++ # cache_position: Optional[mindspore.Tensor] = None, +-+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ # bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # # 1. 线性投射 Q, K, V +-+++ # query_states = self.q_proj(hidden_states) +-+++ # key_states = self.k_proj(hidden_states) +-+++ # value_states = self.v_proj(hidden_states) +-+++ +-+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # # 3. RoPE 旋转位置编码 +-+++ # kv_seq_len = key_states.shape[-2] +-+++ # if past_key_value is not None: +-+++ # if self.layer_idx is None: +-+++ # raise ValueError( +-+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ # "with a layer index." +-+++ # ) +-+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # # 4. KV 缓存更新 +-+++ # if past_key_value is not None: +-+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ # key_states, value_states = past_key_value.update( +-+++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++ # ) +-+++ +-+++ # # 5. 准备 Attention Mask +-+++ # fa_attention_mask = None +-+++ # if attention_mask is not None: +-+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++ # input_dtype = query_states.dtype +-+++ +-+++ # # 6. [核心] 调用 flash_attention_score 算子 +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # head_num=self.num_heads, +-+++ # attn_mask=fa_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ # input_layout="BNSD", +-+++ # sparse_mode=0, +-+++ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++ # inner_precise=1 +-+++ # ) +-+++ +-+++ # # 恢复原始数据类型 +-+++ # attn_output = attn_output.to(input_dtype) +-+++ +-+++ # # 7. 调整输出形状 +-+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ # attn_output = self.o_proj(attn_output) +-+++ +-+++ # attn_weights = None +-+++ # if output_attentions: +-+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++ # return attn_output, attn_weights, past_key_value +-+++ +-+++ # def forward( +-+++ # self, +-+++ # hidden_states: mindspore.Tensor, +-+++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++ # position_ids: Optional[mindspore.Tensor] = None, +-+++ # past_key_value: Optional[Cache] = None, +-+++ # output_attentions: bool = False, +-+++ # use_cache: bool = False, +-+++ # cache_position: Optional[mindspore.Tensor] = None, +-+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ # bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # query_states = self.q_proj(hidden_states) +-+++ # key_states = self.k_proj(hidden_states) +-+++ # value_states = self.v_proj(hidden_states) +-+++ +-+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # kv_seq_len = key_states.shape[-2] +-+++ # if past_key_value is not None: +-+++ # if self.layer_idx is None: +-+++ # raise ValueError("`layer_idx` must be specified for caching") +-+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # if past_key_value is not None: +-+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ # key_states, value_states = past_key_value.update( +-+++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++ # ) +-+++ +-+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++ +-+++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++ # query_states = query_states / math.sqrt(self.head_dim) +-+++ # # <--- 修改结束 --- +-+++ +-+++ # fa_attention_mask = None +-+++ # if attention_mask is not None: +-+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # input_dtype = query_states.dtype +-+++ +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, # 传入已经预先缩放过的 query +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # head_num=self.num_heads, +-+++ # attn_mask=fa_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++ # input_layout="BNSD", +-+++ # sparse_mode=0, +-+++ # inner_precise=1 # 仍然保持内部高精度计算 +-+++ # ) +-+++ +-+++ # attn_output = attn_output.to(input_dtype) +-+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ # attn_output = self.o_proj(attn_output) +-+++ +-+++ # attn_weights = None +-+++ # if output_attentions: +-+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++ +-+++ # return attn_output, attn_weights, past_key_value +-+++ +-++ QWEN2MOE_ATTENTION_CLASSES = { +-++ "eager": Qwen2MoeAttention, +-+++ "flash-attention": Qwen2MoeFlashAttention, +-++ } +-++ +-++ +-++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-+++ #@dwj +-+++ # 只遍历激活的专家,而非全部专家 +-++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- hidden_states = hidden_states.view(-1, hidden_dim) +-++- # router_logits: (batch * sequence_length, n_experts) +-++- router_logits = self.gate(hidden_states) +-++- +-++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- if self.norm_topk_prob: +-++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- # we cast back to the input dtype +-++- routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++- final_hidden_states = ops.zeros( +-++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++- ) +-++- +-++- # One hot encode the selected experts to create an expert mask +-++- # this will be used to easily index which expert is going to be sollicitated +-++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++- +-++- # Loop over all available experts in the model and perform the computation on each expert +-++- for expert_idx in range(self.num_experts): +-++- expert_layer = self.experts[expert_idx] +-++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++- +-++- # Index the correct hidden states and compute the expert hidden state for +-++- # the current expert. We need to make sure to multiply the output hidden +-++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++- if 0 not in idx.shape: +-++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++- +-++- # However `index_add_` only support torch tensors for indexing so we'll use +-++- # the `top_x` tensor here. +-++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++- +-++- shared_expert_output = self.shared_expert(hidden_states) +-++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++- +-++- final_hidden_states = final_hidden_states + shared_expert_output +-+++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++ num_tokens = hidden_states_reshaped.shape[0] +-+++ +-+++ router_logits = self.gate(hidden_states_reshaped) +-+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++ if self.norm_topk_prob: +-+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++ flat_selected_experts = selected_experts.flatten() +-+++ +-+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++ token_indices = broadcasted_token_indices.flatten() +-+++ +-+++ active_experts = ops.unique(flat_selected_experts) +-+++ +-+++ for expert_idx_tensor in active_experts: +-+++ expert_idx = expert_idx_tensor.item() +-+++ expert_layer = self.experts[expert_idx] +-+++ +-+++ mask = (flat_selected_experts == expert_idx_tensor) +-+++ selected_token_indices = token_indices[mask] +-+++ selected_routing_weights = routing_weights.flatten()[mask] +-+++ +-+++ current_states = hidden_states_reshaped[selected_token_indices] +-+++ +-+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++ +-+++ final_hidden_states = final_hidden_states.index_add( +-+++ dim=0, +-+++ index=selected_token_indices, +-+++ source=expert_output.to(hidden_states.dtype) +-+++ ) +-+++ +-+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++ +-++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++- return final_hidden_states, router_logits +-+++ final_hidden_states = final_hidden_states + shared_expert_output +-+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++ +-+++ return final_hidden_states, router_logits +-++ +-++ +-++ class Qwen2MoeDecoderLayer(nn.Module): +-++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++ +-++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++ +-++ if (layer_idx not in config.mlp_only_layers) and ( +-++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++ ): +-++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++ _skip_keys_device_placement = "past_key_values" +-++ _supports_cache_class = True +-+++#lwx +-+++ # _supports_static_cache = True +-++ +-++ def _init_weights(self, module): +-++ std = self.config.initializer_range +-++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++ return causal_mask +-++ +-++ +-++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ _tied_weights_keys = ["lm_head.weight"] +-++ +-++ def __init__(self, config): +-++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ self.num_experts_per_tok = config.num_experts_per_tok +-++ # Initialize weights and apply final processing +-++ self.post_init() +-+++ # @lwx +-+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++ # self.generation_config.cache_implementation = "static" +-+++ self._warmed_up = False +-+++ +-+++ def warmup_moe_model(self): +-+++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++ test_texts = [ +-+++ "warmup short", +-+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++ ] +-+++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++ if tokenizer is None: +-+++ from mindnlp.transformers import AutoTokenizer +-+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++ self._warmup_tokenizer = tokenizer +-+++ +-+++ for text in test_texts: +-+++ inputs = tokenizer(text, return_tensors="ms") +-+++ with mindspore._no_grad(): +-+++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++ +-++ def get_input_embeddings(self): +-++ return self.model.embed_tokens +-++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++ ```""" +-+++ if not self._warmed_up: +-+++ self._warmed_up = True +-+++ self.warmup_moe_model() +-++ +-++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++ output_router_logits = ( +-++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ } +-++ ) +-++ return model_inputs +-+++# @lwx +-+++ # def _decode_one_tokens_logits( +-+++ # self, +-+++ # cur_token: mindspore.Tensor, +-+++ # input_pos: Optional[mindspore.Tensor], +-+++ # cache_position: mindspore.Tensor, +-+++ # past_key_values: StaticCache, +-+++ # ) -> mindspore.Tensor: +-+++ # """ +-+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++ +-+++ # Args: +-+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++ # input_pos: 输入位置信息,可选 +-+++ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++ +-+++ # Returns: +-+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++ # """ +-+++ # # 调用JIT编译的版本 +-+++ # return self.get_decode_one_tokens_logits( +-+++ # cur_token=cur_token, +-+++ # input_pos=input_pos, +-+++ # cache_position=cache_position, +-+++ # past_key_values=past_key_values, +-+++ # ) +-+++ +-+++ # @mindspore.jit(jit_level='O1') +-+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++ # """ +-+++ # JIT编译的函数,用于高效的单token解码 +-+++ # 使用JIT编译优化以支持静态shape和高效执行 +-+++ +-+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++ # """ +-+++ # outputs = self.model.forward( +-+++ # input_ids=cur_token, +-+++ # position_ids=input_pos, +-+++ # cache_position=cache_position, +-+++ # past_key_values=past_key_values, +-+++ # use_cache=True, +-+++ # return_dict=False, +-+++ # ) +-+++ +-+++ # hidden_states = outputs[0] +-+++ # logits = self.lm_head.forward(hidden_states) +-+++ # logits = logits.float() +-+++ +-+++ # return logits[:, -1, :] +-+++ +-+++ # def _sample( +-+++ # self, +-+++ # input_ids: mindspore.Tensor, +-+++ # logits_processor, +-+++ # stopping_criteria, +-+++ # generation_config, +-+++ # synced_devices: bool, +-+++ # streamer=None, +-+++ # logits_warper=None, +-+++ # **model_kwargs, +-+++ # ): +-+++ # """ +-+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++ # """ +-+++ # from ...generation.logits_process import LogitsProcessorList +-+++ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++ # from mindnlp.core import nn, ops, no_grad +-+++ # import numpy as np +-+++ +-+++ # # 检查是否使用 StaticCache +-+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++ # # 否则,直接调用父类方法 +-+++ # past_key_values = model_kwargs.get("past_key_values") +-+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++ +-+++ # if not isinstance(past_key_values, StaticCache): +-+++ # # 不使用 StaticCache,直接调用父类方法 +-+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++ # return super()._sample( +-+++ # input_ids=input_ids, +-+++ # logits_processor=logits_processor, +-+++ # stopping_criteria=stopping_criteria, +-+++ # generation_config=generation_config, +-+++ # synced_devices=synced_devices, +-+++ # streamer=streamer, +-+++ # logits_warper=logits_warper, +-+++ # **model_kwargs, +-+++ # ) +-+++ +-+++ # # 使用 StaticCache,进入自定义循环 +-+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++ # pad_token_id = generation_config._pad_token_tensor +-+++ # output_attentions = generation_config.output_attentions +-+++ # output_hidden_states = generation_config.output_hidden_states +-+++ # output_scores = generation_config.output_scores +-+++ # output_logits = generation_config.output_logits +-+++ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++ # max_length = generation_config.max_length +-+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++ # do_sample = generation_config.do_sample +-+++ +-+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++ # raise ValueError( +-+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++ # f"{logits_warper})." +-+++ # ) +-+++ +-+++ # # init attention / hidden states / scores tuples +-+++ # scores = () if (return_dict_in_generate and output_scores) else None +-+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++ +-+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++ # encoder_hidden_states = ( +-+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++ # ) +-+++ +-+++ # # keep track of which sequences are already finished +-+++ # batch_size, cur_len = input_ids.shape +-+++ # this_peer_finished = False +-+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++ +-+++ # time_record = [] +-+++ # from ....utils.testing_utils import parse_flag_from_env +-+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++ +-+++ # while self._has_unfinished_sequences( +-+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++ # ): +-+++ # if _record_time: +-+++ # import time as time_module +-+++ # infer_start = time_module.time() +-+++ +-+++ # # prepare model inputs +-+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++ +-+++ # # prepare variable output controls +-+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++ +-+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++ # cur_cache_position = model_inputs.get("cache_position") +-+++ # cur_past_key_values = model_inputs.get("past_key_values") +-+++ # cur_input_ids = model_inputs.get("input_ids") +-+++ +-+++ # if (isinstance(cur_past_key_values, StaticCache) and +-+++ # cur_cache_position is not None and +-+++ # len(cur_cache_position.shape) > 0 and +-+++ # cur_cache_position.shape[0] == 1 and +-+++ # cur_input_ids is not None and +-+++ # cur_input_ids.shape[1] == 1): +-+++ # # 使用 JIT 优化的单 token 解码 +-+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++ # if not hasattr(self, '_jit_used'): +-+++ # self._jit_used = False +-+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++ +-+++ # next_token_logits = self.get_decode_one_tokens_logits( +-+++ # cur_token=cur_input_ids, +-+++ # input_pos=model_inputs.get("position_ids"), +-+++ # cache_position=cur_cache_position, +-+++ # past_key_values=cur_past_key_values, +-+++ # ) +-+++ +-+++ # # 标记已使用JIT(用于后续判断) +-+++ # if not self._jit_used: +-+++ # self._jit_used = True +-+++ +-+++ # # 构造兼容的输出对象 +-+++ # class JitOptimizedOutput: +-+++ # def __init__(self, logits, config): +-+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++ # self.config = config +-+++ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++ # self.attentions = None if not config.is_encoder_decoder else None +-+++ # self.cross_attentions = None +-+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++ +-+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++ # else: +-+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++ # outputs = self(**model_inputs, return_dict=True) +-+++ +-+++ # if synced_devices and this_peer_finished: +-+++ # continue +-+++ +-+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++ # next_token_logits = outputs.logits[:, -1, :] +-+++ +-+++ # # pre-process distribution +-+++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++ # if do_sample: +-+++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++ +-+++ # # Store scores, attentions and hidden_states when required +-+++ # if return_dict_in_generate: +-+++ # if output_scores: +-+++ # scores += (next_token_scores,) +-+++ # if output_logits: +-+++ # raw_logits += (next_token_logits,) +-+++ # if output_attentions: +-+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++ # if self.config.is_encoder_decoder: +-+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++ +-+++ # if output_hidden_states: +-+++ # hidden = ( +-+++ # outputs.decoder_hidden_states +-+++ # if self.config.is_encoder_decoder +-+++ # else outputs.hidden_states +-+++ # ) +-+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++ +-+++ # # token selection +-+++ # if do_sample: +-+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++ # else: +-+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++ +-+++ # # finished sentences should have their next token be a padding token +-+++ # if has_eos_stopping_criteria: +-+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++ +-+++ # # update generated ids, model inputs, and length for next step +-+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++ # if streamer is not None: +-+++ # streamer.put(next_tokens) +-+++ +-+++ # model_kwargs = self._update_model_kwargs_for_generation( +-+++ # outputs, +-+++ # model_kwargs, +-+++ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++ # ) +-+++ +-+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++ # cur_len += 1 +-+++ +-+++ # if _record_time: +-+++ # import time as time_module +-+++ # infer_stop = time_module.time() +-+++ # time_record.append(infer_stop - infer_start) +-+++ +-+++ # del outputs +-+++ +-+++ # average_infer_time = None +-+++ # if time_record: +-+++ # if len(time_record) > 1: +-+++ # time_record.pop(0) +-+++ # average_infer_time = sum(time_record) / len(time_record) +-+++ # print(f'average inference time is: {average_infer_time}') +-+++ # print(f'inference time record: {time_record}') +-+++ +-+++ # if streamer is not None: +-+++ # streamer.end() +-+++ +-+++ # # 简单判断:打印是否使用了JIT路径 +-+++ # if hasattr(self, '_jit_used') and self._jit_used: +-+++ # print("[JIT] ✓ JIT optimization was used during generation") +-+++ # else: +-+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++ +-+++ # if return_dict_in_generate: +-+++ # if self.config.is_encoder_decoder: +-+++ # return GenerateEncoderDecoderOutput( +-+++ # sequences=input_ids, +-+++ # scores=scores, +-+++ # logits=raw_logits, +-+++ # encoder_attentions=encoder_attentions, +-+++ # encoder_hidden_states=encoder_hidden_states, +-+++ # decoder_attentions=decoder_attentions, +-+++ # cross_attentions=cross_attentions, +-+++ # decoder_hidden_states=decoder_hidden_states, +-+++ # past_key_values=model_kwargs.get("past_key_values"), +-+++ # average_infer_time=average_infer_time +-+++ # ) +-+++ # else: +-+++ # return GenerateDecoderOnlyOutput( +-+++ # sequences=input_ids, +-+++ # scores=scores, +-+++ # logits=raw_logits, +-+++ # attentions=decoder_attentions, +-+++ # hidden_states=decoder_hidden_states, +-+++ # past_key_values=model_kwargs.get("past_key_values"), +-+++ # average_infer_time=average_infer_time +-+++ # ) +-+++ # else: +-+++ # return input_ids +-+++ +-+++ # def _prepare_cache_for_generation( +-+++ # self, +-+++ # generation_config, +-+++ # model_kwargs, +-+++ # assistant_model, +-+++ # batch_size, +-+++ # max_cache_length, +-+++ # ): +-+++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++ # generation_config.cache_implementation = "static" +-+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++ +-+++ # if generation_config.cache_implementation == "static": +-+++ # base_required_from_max_length = generation_config.max_length + 1 +-+++ # base_required = max(max_cache_length, base_required_from_max_length) +-+++ # min_cache_size = 50 +-+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++ # else: +-+++ # max_cache_length = max(base_required, min_cache_size) +-+++ +-+++ # original_max_cache_length = max_cache_length +-+++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++ # print(f" - final max_cache_length: {max_cache_length}") +-+++ +-+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++ # if max_cache_length > self.config.max_position_embeddings: +-+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++ +-+++ # result = super()._prepare_cache_for_generation( +-+++ # generation_config=generation_config, +-+++ # model_kwargs=model_kwargs, +-+++ # assistant_model=assistant_model, +-+++ # batch_size=batch_size, +-+++ # max_cache_length=max_cache_length, +-+++ # ) +-+++ +-+++ # if generation_config.cache_implementation == "static": +-+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++ # created_cache = model_kwargs.get(cache_name) +-+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++ # if created_cache.max_cache_len < generation_config.max_length: +-+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++ +-+++ # return result +-+++ +-+++ +-+++ +-++ +-++ +-++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++-- +-++2.27.0 +-++ +-+-- +-+2.27.0 +-+ +-diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-new file mode 100644 +-index 00000000..966529e4 +---- /dev/null +-+++ b/patches/0003-20261106secondcommit.patch +-@@ -0,0 +1,2769 @@ +-+From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+Subject: [PATCH 3/3] 20261106secondcommit +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +-+ patches/0001-20251104commit.patch | 1272 ----------------- +-+ 3 files changed, 528 insertions(+), 2032 deletions(-) +-+ delete mode 100644 patches/0001-20251104commit.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index 73773c22..2f9192bf 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +-+ +-+ _CONFIG_FOR_DOC = "DeepseekConfig" +-+ +-++_attn_mask_cache = {} +-++ +-++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-++ q_len = batch_and_seq[1] +-++ kv_len = batch_and_seq[1] + past_key_values_length +-++ key = (batch_and_seq[0], q_len, kv_len) +-++ +-++ if key in _attn_mask_cache: +-++ return _attn_mask_cache[key] +-++ +-++ mask = _prepare_4d_causal_attention_mask( +-++ attention_mask, +-++ batch_and_seq, +-++ inputs_embeds, +-++ past_key_values_length, +-++ ) +-++ _attn_mask_cache[key] = mask +-++ return mask +-+ +-+ def _get_unpad_data(attention_mask): +-+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-+@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +-+ return final_output +-+ +-+ +-+- @no_grad() +-+- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+- expert_cache = ops.zeros_like(x) +-+- idxs = flat_expert_indices.argsort() +-+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+- token_idxs = idxs // self.num_experts_per_tok +-+- +-+- for i, end_idx in enumerate(tokens_per_expert): +-+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+- if start_idx == end_idx: +-+- continue +-+- expert = self.experts[i] +-+- exp_token_idx = token_idxs[start_idx:end_idx] +-+- expert_tokens = x[exp_token_idx] +-+- expert_out = expert(expert_tokens) +-+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+- +-+- return expert_cache +-+- +-+ # @no_grad() +-+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+- # # expert_cache = torch.zeros_like(x) +-+- # # idxs = flat_expert_indices.argsort() +-+- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+- # # token_idxs = idxs // self.num_experts_per_tok +-+- # # for i, end_idx in enumerate(tokens_per_expert): +-+- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+- # # if start_idx == end_idx: +-+- # # continue +-+- # # expert = self.experts[i] +-+- # # exp_token_idx = token_idxs[start_idx:end_idx] +-+- # # expert_tokens = x[exp_token_idx] +-+- # # expert_out = expert(expert_tokens) +-+- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+- # # return expert_cache +-++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ # expert_cache = ops.zeros_like(x) +-+ # idxs = flat_expert_indices.argsort() +-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+ +-+ # return expert_cache +-+- # @no_grad() +-+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+- # expert_cache = ops.zeros_like(x) +-++ +-++ @no_grad() +-++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++ """ +-++ 优化版 MoE prefill: +-++ - 批量张量化处理同一个 expert 的所有 token +-++ - 跳过无 token 的专家 +-++ - 保持结果完全一致 +-++ """ +-++ # 初始化输出缓存 +-++ expert_cache = ops.zeros_like(x) +-+ +-+- # # 排序保证顺序一致 +-+- # idxs = flat_expert_indices.argsort() +-+- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+- # token_idxs = idxs // self.num_experts_per_tok +-++ # 排序(确保 scatter_add 位置对应原逻辑) +-++ idxs = flat_expert_indices.argsort() +-++ sorted_expert_indices = flat_expert_indices[idxs] +-++ sorted_token_indices = idxs // self.num_experts_per_tok +-+ +-+- # # 找出有 token 的专家 +-+- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++ # 每个 expert 的 token 数 +-++ tokens_per_expert = sorted_expert_indices.bincount() +-+ +-+- # for i in active_experts.tolist(): +-+- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+- # end_idx = tokens_per_expert[i] +-+- # if start_idx == end_idx: # 没有 token +-+- # continue +-++ # 找出有 token 的专家 +-++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+ +-+- # exp_token_idx = token_idxs[start_idx:end_idx] +-+- # expert_tokens = x[exp_token_idx] +-+- # expert_out = self.experts[i](expert_tokens) +-+- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++ for expert_id in active_experts.tolist(): +-++ # 取该 expert 对应的排序后 token 区间 +-++ start = (tokens_per_expert[:expert_id]).sum().item() +-++ end = start + tokens_per_expert[expert_id].item() +-+ +-+- # expert_cache = mindspore.mint.scatter_add( +-+- # expert_cache, +-+- # 0, +-+- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+- # expert_out +-+- # ) +-++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-++ expert_tokens = x[token_idx] # 取输入向量 +-+ +-+- # return expert_cache +-++ # 执行专家 MLP +-++ expert_out = self.experts[expert_id](expert_tokens) +-++ +-++ # 按权重缩放 +-++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-++ +-++ # 回写到缓存(等价 scatter_add) +-++ expert_cache = mindspore.mint.scatter_add( +-++ expert_cache, +-++ 0, +-++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++ scaled_out +-++ ) +-++ +-++ return expert_cache +-++ +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # # expert_cache = torch.zeros_like(x) +-++ # # idxs = flat_expert_indices.argsort() +-++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++ # # token_idxs = idxs // self.num_experts_per_tok +-++ # # for i, end_idx in enumerate(tokens_per_expert): +-++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++ # # if start_idx == end_idx: +-++ # # continue +-++ # # expert = self.experts[i] +-++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # # expert_tokens = x[exp_token_idx] +-++ # # expert_out = expert(expert_tokens) +-++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++ # # return expert_cache +-++ # expert_cache = ops.zeros_like(x) +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # for i, end_idx in enumerate(tokens_per_expert): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # if start_idx == end_idx: +-++ # continue +-++ # expert = self.experts[i] +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = expert(expert_tokens) +-++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-++ # return expert_cache +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++ # expert_cache = ops.zeros_like(x) +-++ +-++ # # 排序保证顺序一致 +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ # token_idxs = idxs // self.num_experts_per_tok +-++ +-++ # # 找出有 token 的专家 +-++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++ +-++ # for i in active_experts.tolist(): +-++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ # end_idx = tokens_per_expert[i] +-++ # if start_idx == end_idx: # 没有 token +-++ # continue +-++ +-++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++ # expert_tokens = x[exp_token_idx] +-++ # expert_out = self.experts[i](expert_tokens) +-++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++ +-++ # expert_cache = mindspore.mint.scatter_add( +-++ # expert_cache, +-++ # 0, +-++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++ # expert_out +-++ # ) +-++ +-++ # return expert_cache +-+ +-+ +-+ +-+@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+- +-+ # class DeepseekFlashAttention(nn.Module): +-+ # """ +-+ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-++ +-+ Deepseek_ATTENTION_CLASSES = { +-+ "eager": DeepseekAttention, +-+ "flash-attention": DeepseekFlashAttention, +-+@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +-+ ) +-+ else: +-+ # 4d mask is passed through the layers +-+- attention_mask = _prepare_4d_causal_attention_mask( +-++ # attention_mask = _prepare_4d_causal_attention_mask( +-++ # attention_mask, +-++ # (batch_size, seq_length), +-++ # inputs_embeds, +-++ # past_key_values_length, +-++ # ) +-++ #@dwj +-++ attention_mask = get_cached_causal_mask( +-+ attention_mask, +-+ (batch_size, seq_length), +-+ inputs_embeds, +-+@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ # Initialize weights and apply final processing +-+ self.post_init() +-+ self.warm_up = False +-++ #@dwj +-++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-++ self.num_layers, +-++ self.num_attention_heads, +-++ self.head_dim, +-++ batch_size=1, +-++ max_length=self.max_length, +-++ dtype=mindspore.float16 +-++ ) +-++ +-++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-++ key_cache = [] +-++ value_cache = [] +-++ for _ in range(num_layers): +-++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++ key_cache.append(k) +-++ value_cache.append(v) +-++ return key_cache, value_cache +-++ +-+ +-+ def warmup_moe_model_deep(self): +-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+index bced285c..ebd7782e 100644 +-+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +-+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+ +-+-Long_Prompt = False +-+-PROMPT_LENGTH_THRESHOLD = 128 +-++Long_Prompt = 1 +-++LONG_PROMPT_LENGTH_THRESHOLD = 128 +-++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-++ +-++_causal_mask_cache = {} +-++ +-++def get_cached_causal_mask_with_cache_position( +-++ attention_mask: mindspore.Tensor, +-++ sequence_length: int, +-++ target_length: int, +-++ dtype: mindspore.dtype, +-++ min_dtype: float, +-++ cache_position: mindspore.Tensor, +-++ batch_size: int, +-++): +-++ """ +-++ 带缓存的 causal mask 构造函数 +-++ """ +-++ # q_len 是当前 query 长度 +-++ q_len = sequence_length +-++ # kv_len 是 target_length +-++ kv_len = target_length +-++ +-++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-++ +-++ if key in _causal_mask_cache: +-++ return _causal_mask_cache[key] +-++ +-++ # 调用原来的 mask 构造逻辑 +-++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++ attention_mask, +-++ sequence_length=sequence_length, +-++ target_length=target_length, +-++ dtype=dtype, +-++ min_dtype=min_dtype, +-++ cache_position=cache_position, +-++ batch_size=batch_size, +-++ ) +-++ # 缓存结果 +-++ _causal_mask_cache[key] = causal_mask +-++ return causal_mask +-+ +-+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+ def _prepare_4d_causal_attention_mask_with_cache_position( +-+@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+ +-+ +-+ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-++# class Qwen2MoeAttention(nn.Module): +-++# """ +-++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-++# and "Generating Long Sequences with Sparse Transformers". +-++# """ +-++ +-++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++# super().__init__() +-++# self.config = config +-++# self.layer_idx = layer_idx +-++# if layer_idx is None: +-++# logger.warning_once( +-++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++# "when creating this class." +-++# ) +-++ +-++# self.hidden_size = config.hidden_size +-++# self.num_heads = config.num_attention_heads +-++# self.head_dim = self.hidden_size // self.num_heads +-++# self.num_key_value_heads = config.num_key_value_heads +-++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++# self.max_position_embeddings = config.max_position_embeddings +-++# self.rope_theta = config.rope_theta +-++# self.is_causal = True +-++# self.attention_dropout = config.attention_dropout +-++ +-++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++# raise ValueError( +-++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++# f" and `num_heads`: {self.num_heads})." +-++# ) +-++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++ +-++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++# self.head_dim, +-++# max_position_embeddings=self.max_position_embeddings, +-++# base=self.rope_theta, +-++# ) +-++ +-++# def forward( +-++# self, +-++# hidden_states: mindspore.Tensor, +-++# attention_mask: Optional[mindspore.Tensor] = None, +-++# position_ids: Optional[mindspore.Tensor] = None, +-++# past_key_value: Optional[Cache] = None, +-++# output_attentions: bool = False, +-++# use_cache: bool = False, +-++# cache_position: Optional[mindspore.Tensor] = None, +-++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++ +-++ +-++ +-++# bsz, q_len, _ = hidden_states.shape +-++ +-++# query_states = self.q_proj(hidden_states) +-++# key_states = self.k_proj(hidden_states) +-++# value_states = self.v_proj(hidden_states) +-++ +-++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++ +-++# kv_seq_len = key_states.shape[-2] +-++# if past_key_value is not None: +-++# if self.layer_idx is None: +-++# raise ValueError( +-++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++# "with a layer index." +-++# ) +-++# if isinstance(past_key_value, StaticCache): +-++# kv_seq_len = key_states.shape[-2] +-++# else: +-++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++# if past_key_value is not None: +-++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++# if isinstance(past_key_value, StaticCache): +-++# kv_seq_len = key_states.shape[-2] +-++ +-++# # repeat k/v heads if n_kv_heads < n_heads +-++# key_states = repeat_kv(key_states, self.num_key_value_groups) +-++# value_states = repeat_kv(value_states, self.num_key_value_groups) +-++ +-++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++ +-++# if attention_mask is not None: +-++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++# attn_weights = attn_weights + causal_mask +-++ +-++# # upcast attention to fp32 +-++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++# attn_output = ops.matmul(attn_weights, value_states) +-++ +-++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++# raise ValueError( +-++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-++# f" {attn_output.shape}" +-++# ) +-++ +-++# attn_output = ops.transpose(attn_output, 1, 2) +-++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++ +-++# attn_output = self.o_proj(attn_output) +-++# # @lwx +-++ +-++# # max_seq_len = self.max_position_embeddings # 2048 +-++ +-++# # if attention_mask is not None: +-++# # # attention_mask: [B, 1, Sq, Sk] +-++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++ +-++# # # pad 到 [max_seq_len, max_seq_len] +-++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++# # global_attention_mask = padded_mask +-++# # else: +-++# # global_attention_mask = None +-++ +-++ +-++# # sparse_mode=3 +-++# # attn_output = mindspore.ops.flash_attention_score( +-++# # query=query_states, +-++# # key=key_states, +-++# # value=value_states, +-++# # real_shift=None, +-++# # padding_mask=None, +-++ +-++# # head_num=self.num_heads, +-++# # attn_mask=global_attention_mask, +-++# # keep_prob=1.0 - self.attention_dropout, +-++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++# # input_layout="BNSD", +-++# # pre_tokens=2147483647, +-++# # next_tokens=2147483647, +-++# # inner_precise=0, +-++# # drop_mask=None, +-++# # prefix=None, +-++# # actual_seq_qlen=None, +-++# # actual_seq_kvlen=None, +-++# # sparse_mode=sparse_mode, +-++# # ) +-++# if not output_attentions: +-++# attn_weights = None +-++ +-++# return attn_output, attn_weights, past_key_value +-++ +-+ class Qwen2MoeAttention(nn.Module): +-+ """ +-+- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+- and "Generating Long Sequences with Sparse Transformers". +-+- """ +-++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +-+ +-++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-++ +-++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-++ """ +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+ super().__init__() +-+ self.config = config +-+@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +-+ if layer_idx is None: +-+ logger.warning_once( +-+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+ "when creating this class." +-+ ) +-+ +-+@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +-+ use_cache: bool = False, +-+ cache_position: Optional[mindspore.Tensor] = None, +-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+- +-+ +-+- +-++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +-+ bsz, q_len, _ = hidden_states.shape +-+ +-+ query_states = self.q_proj(hidden_states) +-+ key_states = self.k_proj(hidden_states) +-+ value_states = self.v_proj(hidden_states) +-+ +-+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+- +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ +-+ kv_seq_len = key_states.shape[-2] +-+ if past_key_value is not None: +-+- if self.layer_idx is None: +-+- raise ValueError( +-+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+- "with a layer index." +-+- ) +-+- if isinstance(past_key_value, StaticCache): +-+- kv_seq_len = key_states.shape[-2] +-+- else: +-+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+ +-+ if past_key_value is not None: +-+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++ +-++ # --- 2. 动态调度核心注意力计算 --- +-++ global Long_Prompt +-++ if Long_Prompt >= 1: +-++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-++ fa_attention_mask = None +-++ if attention_mask is not None: +-++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++ fa_attention_mask = (mask_slice != 0) +-++ +-++ attn_output = mindspore.ops.flash_attention_score( +-++ query=query_states, +-++ key=key_states, +-++ value=value_states, +-++ head_num=self.num_heads, +-++ attn_mask=fa_attention_mask, +-++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++ input_layout="BNSD", +-++ sparse_mode=0, +-++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-++ ) +-+ +-+- if isinstance(past_key_value, StaticCache): +-+- kv_seq_len = key_states.shape[-2] +-++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ attn_output = self.o_proj(attn_output) +-++ attn_weights = None +-++ if output_attentions: +-++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+ +-+- # repeat k/v heads if n_kv_heads < n_heads +-+- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+- +-+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++ else: +-++ # --- Eager Attention 路径 (用于短序列和解码) --- +-++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++ +-++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+ +-+- if attention_mask is not None: +-+- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+- attn_weights = attn_weights + causal_mask +-++ if attention_mask is not None: +-++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++ attn_weights = attn_weights + causal_mask +-+ +-+- # upcast attention to fp32 +-+- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+- attn_output = ops.matmul(attn_weights, value_states) +-++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++ attn_output = ops.matmul(attn_weights, value_states) +-+ +-+- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+- raise ValueError( +-+- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+- f" {attn_output.shape}" +-+- ) +-++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++ raise ValueError( +-++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-++ ) +-+ +-+- attn_output = ops.transpose(attn_output, 1, 2) +-+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++ attn_output = ops.transpose(attn_output, 1, 2) +-++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++ attn_output = self.o_proj(attn_output) +-+ +-+- attn_output = self.o_proj(attn_output) +-+- # @lwx +-++ if not output_attentions: +-++ attn_weights = None +-+ +-+- # max_seq_len = self.max_position_embeddings # 2048 +-+- +-+- # if attention_mask is not None: +-+- # # attention_mask: [B, 1, Sq, Sk] +-+- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+- +-+- # # pad 到 [max_seq_len, max_seq_len] +-+- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+- # global_attention_mask = padded_mask +-+- # else: +-+- # global_attention_mask = None +-+- +-+- +-+- # sparse_mode=3 +-+- # attn_output = mindspore.ops.flash_attention_score( +-+- # query=query_states, +-+- # key=key_states, +-+- # value=value_states, +-+- # real_shift=None, +-+- # padding_mask=None, +-+- +-+- # head_num=self.num_heads, +-+- # attn_mask=global_attention_mask, +-+- # keep_prob=1.0 - self.attention_dropout, +-+- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+- # input_layout="BNSD", +-+- # pre_tokens=2147483647, +-+- # next_tokens=2147483647, +-+- # inner_precise=0, +-+- # drop_mask=None, +-+- # prefix=None, +-+- # actual_seq_qlen=None, +-+- # actual_seq_kvlen=None, +-+- # sparse_mode=sparse_mode, +-+- # ) +-+- if not output_attentions: +-+- attn_weights = None +-+- +-+ return attn_output, attn_weights, past_key_value +-+ +-+- +-+ # class Qwen2MoeFlashAttention(nn.Module): +-+ # """ +-+ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +-+ # return final_hidden_states, router_logits +-+ +-+ +-+-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+-# """ +-+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+-# """ +-+-# def __init__(self, config: Qwen2MoeConfig): +-+-# super().__init__() +-+-# self.num_experts = config.num_experts +-+-# self.top_k = config.num_experts_per_tok +-+-# self.norm_topk_prob = config.norm_topk_prob +-+- +-+-# # 门控网络 +-+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+-# # 专家列表 +-+-# self.experts = nn.ModuleList( +-+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+-# ) +-+-# # 共享专家 +-+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-# @no_grad() +-+-# def _moe_infer_decode( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# """ +-+-# 【解码路径】针对 sequence_length=1 的极致优化。 +-+-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+-# """ +-+-# batch_size, hidden_dim = hidden_states.shape +-+- +-+-# expert_outputs_list = [ +-+-# ops.cat([ +-+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+-# ], dim=0) +-+-# for i in range(batch_size) +-+-# ] +-+- +-+-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+-# # shape: (batch_size, top_k, hidden_dim) +-+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+- +-+-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+- +-+-# return moe_output.squeeze(1) +-+- +-+-# @no_grad() +-+-# def _moe_infer_prefill( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# """ +-+-# 【预填充路径】针对 sequence_length > 1 的优化。 +-+-# 按专家对 Token 进行分组,并进行批处理。 +-+-# """ +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens = hidden_states.shape[0] +-+-# flat_selected_experts = selected_experts.flatten() +-+- +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+- +-+-# active_experts = ops.unique(flat_selected_experts) +-+- +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+- +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+-# selected_token_indices = token_indices[mask] +-+-# selected_routing_weights = routing_weights.flatten()[mask] +-+- +-+-# current_states = hidden_states[selected_token_indices] +-+- +-+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+- +-+-# moe_output = moe_output.index_add( +-+-# dim=0, +-+-# index=selected_token_indices, +-+-# source=expert_output.to(hidden_states.dtype) +-+-# ) +-+-# return moe_output +-+- +-+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-# """ +-+-# 顶层 forward 方法,作为智能分发器。 +-+-# """ +-+-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- +-+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-# router_logits = self.gate(hidden_states_reshaped) +-+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- +-+-# if self.norm_topk_prob: +-+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- +-+-# routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+-# moe_output = None +-+-# # 在推理时,根据序列长度选择最优路径 +-+-# if not self.training: +-+-# if sequence_length == 1: +-+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+-# else: +-+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+-# else: +-+-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+-# raise NotImplementedError("Training path is not implemented.") +-+- +-+-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+- +-+-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+- +-+-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+- +-+-# return final_hidden_states, router_logits +-+- +-+- +-+-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+-# """ +-+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+-# """ +-+-# def __init__(self, config: Qwen2MoeConfig): +-+-# super().__init__() +-+-# self.num_experts = config.num_experts +-+-# self.top_k = config.num_experts_per_tok +-+-# self.norm_topk_prob = config.norm_topk_prob +-+- +-+-# # 门控网络 +-+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+-# # 专家列表 +-+-# self.experts = nn.ModuleList( +-+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+-# ) +-+-# # 共享专家 +-+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-# @no_grad() +-+-# def _moe_infer_decode( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# batch_size, _ = hidden_states.shape +-+-# expert_outputs_list = [ +-+-# ops.cat([ +-+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+-# ], dim=0) +-+-# for i in range(batch_size) +-+-# ] +-+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+-# return moe_output.squeeze(1) +-+- +-+-# @no_grad() +-+-# def _moe_infer_prefill( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens = hidden_states.shape[0] +-+-# flat_selected_experts = selected_experts.flatten() +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+-# active_experts = ops.unique(flat_selected_experts) +-+- +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+-# selected_token_indices = token_indices[mask] +-+-# selected_routing_weights = routing_weights.flatten()[mask] +-+-# current_states = hidden_states[selected_token_indices] +-+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+-# moe_output = moe_output.index_add( +-+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+-# ) +-+-# return moe_output +-+- +-+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-# """ +-+-# 顶层 forward 方法,作为智能分发器。 +-+-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+-# """ +-+-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- +-+-# # 1. 门控计算 (通用逻辑) +-+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-# router_logits = self.gate(hidden_states_reshaped) +-+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- +-+-# if self.norm_topk_prob: +-+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- +-+-# routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+-# # 2. 智能分发到最优 MoE 路径 +-+-# moe_output = None +-+-# if not self.training: +-+-# if sequence_length == 1: +-+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+-# else: +-+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+-# else: +-+-# raise NotImplementedError("Training path is not implemented.") +-+- +-+-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+- +-+-# # 4. 合并 MoE 输出和共享专家输出 +-+-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+- +-+-# # 5. 恢复原始形状并返回 +-+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+- +-+-# return final_hidden_states, router_logits +-+- +-+-# prefill fastest +-+-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+-# """ +-+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+-# """ +-+-# def __init__(self, config: Qwen2MoeConfig): +-+-# super().__init__() +-+-# self.num_experts = config.num_experts +-+-# self.top_k = config.num_experts_per_tok +-+-# self.norm_topk_prob = config.norm_topk_prob +-+- +-+-# # 门控网络 +-+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+-# # 专家列表 +-+-# self.experts = nn.ModuleList( +-+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+-# ) +-+-# # 共享专家 +-+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-# @no_grad() +-+-# def _moe_infer_dispatch( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# """ +-+-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+-# """ +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens, _ = hidden_states.shape +-+- +-+-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+-# flat_selected_experts = selected_experts.flatten() +-+-# flat_routing_weights = routing_weights.flatten() +-+- +-+-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+- +-+-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+-# active_experts = ops.unique(flat_selected_experts) +-+- +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+- +-+-# # 找到所有分配给该专家的 token +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+- +-+-# # 使用 mask 选取对应的 token 和权重 +-+-# current_token_indices = token_indices[mask] +-+-# current_routing_weights = flat_routing_weights[mask] +-+-# current_hidden_states = hidden_states[current_token_indices] +-+- +-+-# # 对这些 token 进行批处理 +-+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+- +-+-# # 使用 index_add 将结果精确地加回到对应位置 +-+-# moe_output = moe_output.index_add( +-+-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+-# ) +-+-# return moe_output +-+- +-+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-# """ +-+-# 顶层 forward 方法,作为智能分发器。 +-+-# """ +-+-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- +-+-# # 1. 门控计算 +-+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-# router_logits = self.gate(hidden_states_reshaped) +-+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- +-+-# if self.norm_topk_prob: +-+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- +-+-# routing_weights = routing_weights.to(hidden_states.dtype) +-+- +-+-# # 2. 调用统一的 MoE 计算内核 +-+-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+- +-+-# # 3. 统一处理共享专家 +-+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+- +-+-# # 4. 合并输出 +-+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+- +-+-# # 5. 恢复原始形状并返回 +-+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+- +-+-# return final_hidden_states, router_logits +-+- +-+- +-+-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+-# """ +-+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+-# 【最终高性能与高精度版】: +-+-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+-# 3. 这样实现了速度和准确性的两全其美。 +-+-# """ +-+-# def __init__(self, config: Qwen2MoeConfig): +-+-# super().__init__() +-+-# self.num_experts = config.num_experts +-+-# self.top_k = config.num_experts_per_tok +-+-# self.norm_topk_prob = config.norm_topk_prob +-+- +-+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+-# self.experts = nn.ModuleList( +-+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+-# ) +-+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-# @no_grad() +-+-# def _moe_infer_decode( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# """ +-+-# 【解码路径】极致优化版:bmm + 高精度累加。 +-+-# """ +-+-# original_dtype = hidden_states.dtype +-+-# batch_size, _ = hidden_states.shape +-+- +-+-# expert_outputs_list = [ +-+-# ops.cat([ +-+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+-# ], dim=0) +-+-# for i in range(batch_size) +-+-# ] +-+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+- +-+-# # 在 float32 下执行 bmm,得到高精度结果 +-+-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+- +-+-# # 将高精度结果转换回原始数据类型 +-+-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+- +-+-# return moe_output +-+- +-+-# @no_grad() +-+-# def _moe_infer_prefill( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# selected_experts: mindspore.Tensor, +-+-# routing_weights: mindspore.Tensor +-+-# ) -> mindspore.Tensor: +-+-# """ +-+-# 【预填充路径】与原始实现一致,结果精确。 +-+-# """ +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens, _ = hidden_states.shape +-+-# flat_selected_experts = selected_experts.flatten() +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+-# active_experts = ops.unique(flat_selected_experts) +-+- +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+-# selected_token_indices = token_indices[mask] +-+-# selected_routing_weights = routing_weights.flatten()[mask] +-+-# current_states = hidden_states[selected_token_indices] +-+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+-# moe_output = moe_output.index_add( +-+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+-# ) +-+-# return moe_output +-+- +-+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+- +-+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-# router_logits = self.gate(hidden_states_reshaped) +-+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+- +-+-# if self.norm_topk_prob: +-+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- +-+-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+-# # 如果模型主体是 float16,后续再转换 +-+- +-+-# moe_output = None +-+-# if not self.training: +-+-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+-# # _moe_infer_decode 内部会处理好类型转换 +-+-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+-# if sequence_length == 1: +-+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+-# else: +-+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+-# else: +-+-# raise NotImplementedError("Training path is not implemented.") +-+- +-+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+- +-+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+- +-+-# return final_hidden_states, router_logits +-+- +-+- +-+-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+-# """ +-+-# 【融合版】一个混合专家模块,内置两种推理策略, +-+-# 由外部全局变量 `Long_Prompt` 控制: +-+- +-+-# - if Long_Prompt is True: 【精度优先模式】 +-+-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+-# 适用于处理长序列,避免误差累积。 +-+- +-+-# - if Long_Prompt is False: 【速度优先模式】 +-+-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+-# """ +-+-# def __init__(self, config: Qwen2MoeConfig): +-+-# super().__init__() +-+-# self.num_experts = config.num_experts +-+-# self.top_k = config.num_experts_per_tok +-+-# self.norm_topk_prob = config.norm_topk_prob +-+- +-+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+-# self.experts = nn.ModuleList( +-+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+-# ) +-+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-# # --- 速度优先模式的辅助函数 --- +-+-# @no_grad() +-+-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+-# original_dtype = hidden_states.dtype +-+-# batch_size, _ = hidden_states.shape +-+-# expert_outputs_list = [ +-+-# ops.cat([ +-+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+-# ], dim=0) +-+-# for i in range(batch_size) +-+-# ] +-+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+-# weights_fp32 = routing_weights.to(mindspore.float32) +-+-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+-# return moe_output_fp32.squeeze(1).to(original_dtype) +-+- +-+-# @no_grad() +-+-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens, _ = hidden_states.shape +-+-# flat_selected_experts = selected_experts.flatten() +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+-# active_experts = ops.unique(flat_selected_experts) +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+-# selected_token_indices = token_indices[mask] +-+-# selected_routing_weights = routing_weights.flatten()[mask] +-+-# current_states = hidden_states[selected_token_indices] +-+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+-# return moe_output +-+- +-+-# # --- 精度优先模式的辅助函数 --- +-+-# @no_grad() +-+-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+-# moe_output = ops.zeros_like(hidden_states) +-+-# num_tokens, _ = hidden_states.shape +-+-# flat_selected_experts = selected_experts.flatten() +-+-# flat_routing_weights = routing_weights.flatten() +-+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+-# active_experts = ops.unique(flat_selected_experts) +-+-# for expert_idx_tensor in active_experts: +-+-# expert_idx = expert_idx_tensor.item() +-+-# expert_layer = self.experts[expert_idx] +-+-# mask = (flat_selected_experts == expert_idx_tensor) +-+-# current_token_indices = token_indices[mask] +-+-# current_routing_weights = flat_routing_weights[mask] +-+-# current_hidden_states = hidden_states[current_token_indices] +-+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+-# return moe_output +-+- +-+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-# # 声明我们将要使用一个在模块外部定义的全局变量 +-+-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+-# global Long_Prompt +-+- +-+-# # 1. 门控计算 (所有模式通用) +-+-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-# router_logits = self.gate(hidden_states_reshaped) +-+-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+-# if self.norm_topk_prob: +-+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+- +-+-# moe_output = None +-+-# if not self.training: +-+-# # 根据 Long_Prompt 标志选择模式 +-+-# if Long_Prompt: +-+-# # --- 精度优先模式 --- +-+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+-# else: +-+-# # --- 速度优先模式 --- +-+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+-# if sequence_length == 1: +-+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+-# else: +-+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+-# else: +-+-# raise NotImplementedError("Training path is not implemented.") +-+- +-+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+- +-+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+- +-+-# return final_hidden_states, router_logits +-+- +-+ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ """ +-+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+ return moe_output_fp32.squeeze(1).to(original_dtype) +-+ +-++ # @no_grad() +-++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++ # num_tokens, _ = hidden_states.shape +-++ # flat_selected_experts = selected_experts.flatten() +-++ # sorted_expert_indices = flat_selected_experts.argsort() +-++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++ # original_token_indices = sorted_expert_indices // self.top_k +-++ # moe_output = ops.zeros_like(hidden_states) +-++ # current_token_offset = 0 +-++ # for i in range(self.num_experts): +-++ # expert_token_count = tokens_per_expert[i] - current_token_offset +-++ # if expert_token_count == 0: +-++ # continue +-++ # end_offset = current_token_offset + expert_token_count +-++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++ # expert_hidden_states = hidden_states[expert_original_token_indices] +-++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++ # current_token_offset += expert_token_count +-++ # return moe_output +-++ +-+ @no_grad() +-+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+- num_tokens, _ = hidden_states.shape +-+- flat_selected_experts = selected_experts.flatten() +-+- sorted_expert_indices = flat_selected_experts.argsort() +-+- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+- original_token_indices = sorted_expert_indices // self.top_k +-++ """ +-++ 优化版 MoE prefill (速度优先模式): +-++ - 批量张量化处理同一个 expert 的所有 token +-++ - 跳过无 token 的专家 +-++ - 保持结果完全一致 +-++ """ +-+ moe_output = ops.zeros_like(hidden_states) +-+- current_token_offset = 0 +-+- for i in range(self.num_experts): +-+- expert_token_count = tokens_per_expert[i] - current_token_offset +-+- if expert_token_count == 0: +-+- continue +-+- end_offset = current_token_offset + expert_token_count +-+- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+- expert_hidden_states = hidden_states[expert_original_token_indices] +-+- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+- current_token_offset += expert_token_count +-++ +-++ flat_selected_experts = selected_experts.flatten() +-++ flat_routing_weights = routing_weights.flatten() +-++ +-++ idxs = flat_selected_experts.argsort() +-++ sorted_expert_indices = flat_selected_experts[idxs] +-++ sorted_token_indices = idxs // self.top_k +-++ +-++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-++ +-++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-++ +-++ for expert_id in active_experts.tolist(): +-++ start = int(tokens_per_expert[:expert_id].sum().item()) +-++ end = start + int(tokens_per_expert[expert_id].item()) +-++ +-++ token_idx = sorted_token_indices[start:end] +-++ expert_tokens = hidden_states[token_idx] +-++ +-++ expert_out = self.experts[expert_id](expert_tokens) +-++ +-++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-++ +-++ moe_output = mindspore.mint.scatter_add( +-++ moe_output, +-++ 0, +-++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-++ scaled_out.to(hidden_states.dtype) +-++ ) +-++ +-+ return moe_output +-+ +-++ +-+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+ @no_grad() +-+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+ +-+ moe_output = None +-+- if Long_Prompt: +-+- # --- 精度优先模式 (ACCURACY MODE) --- +-+- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ # if Long_Prompt==0: +-++ # # --- 精度优先模式 (ACCURACY MODE) --- +-++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ # else: +-++ # # --- 速度优先模式 (SPEED MODE) --- +-++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++ # if sequence_length == 1: +-++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ # else: +-++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ +-++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++ if sequence_length == 1: +-++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ else: +-+- # --- 速度优先模式 (SPEED MODE) --- +-+- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+- if sequence_length == 1: +-+- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+- else: +-+- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+- +-++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ +-+ +-+ # 3. 共享专家计算与合并 (所有模式通用) +-+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+ +-+ return final_hidden_states, router_logits +-+ +-++ +-+ class Qwen2MoeDecoderLayer(nn.Module): +-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+ super().__init__() +-+ self.hidden_size = config.hidden_size +-+ +-+- # if Long_Prompt: +-+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+- # else: +-++ # if Long_Prompt == 2: +-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++ # else: +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ +-+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+ +-+@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+ ) +-+ +-+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-+- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++ # attention_mask, +-++ # sequence_length=sequence_length, +-++ # target_length=target_length, +-++ # dtype=dtype, +-++ # min_dtype=min_dtype, +-++ # cache_position=cache_position, +-++ # batch_size=input_tensor.shape[0], +-++ # ) +-++ #@dwj +-++ causal_mask = get_cached_causal_mask_with_cache_position( +-+ attention_mask, +-+ sequence_length=sequence_length, +-+ target_length=target_length, +-+@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+ """ +-+- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-++ _causal_mask_cache.clear() +-+ +-+ input_ids = kwargs.get("input_ids") +-+ if input_ids is None and args: +-+@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ +-+ if input_ids is not None: +-+ prompt_length = input_ids.shape[1] +-+- +-+- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+- Long_Prompt = True +-++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-++ Long_Prompt = 2 +-++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-++ Long_Prompt = 0 +-+ else: +-+- Long_Prompt = False +-++ Long_Prompt = 1 +-++ +-+ +-+ return super().generate(*args, **kwargs) +-+ +-+@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+ dtype = self.lm_head.weight.dtype +-+ min_dtype = float(ops.finfo(dtype).min) +-+ +-+- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++ # attention_mask, +-++ # sequence_length=sequence_length, +-++ # target_length=past_key_values.get_max_length(), +-++ # dtype=dtype, +-++ # min_dtype=min_dtype, +-++ # cache_position=cache_position, +-++ # batch_size=batch_size, +-++ # ) +-++ +-++ #@dwj +-++ attention_mask = get_cached_causal_mask_with_cache_position( +-+ attention_mask, +-+ sequence_length=sequence_length, +-+ target_length=past_key_values.get_max_length(), +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+deleted file mode 100644 +-+index 6dfb5b93..00000000 +-+--- a/patches/0001-20251104commit.patch +-++++ /dev/null +-+@@ -1,1272 +0,0 @@ +-+-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+-From: Pinoeer-kingxi <13022943007@163.com> +-+-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+-Subject: [PATCH] 20251104commit +-+- +-+---- +-+- mindnlp/transformers/cache_utils.py | 28 +- +-+- .../models/deepseek/modeling_deepseek.py | 149 ++- +-+- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+- 3 files changed, 976 insertions(+), 87 deletions(-) +-+- +-+-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+-index cadd2e04..02f8d4be 100644 +-+---- a/mindnlp/transformers/cache_utils.py +-+-+++ b/mindnlp/transformers/cache_utils.py +-+-@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+- # k_out[:, :, cache_position] = key_states +-+- # v_out[:, :, cache_position] = value_states +-+-- if ON_ORANGE_PI: +-+-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+-- else: +-+-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+-- +-+-+ # if ON_ORANGE_PI: +-+-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+-+ # else: +-+-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-+-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+-+ if cache_position.ndim > 1: +-+-+ cache_position = cache_position.flatten() +-+-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-+-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+-+ cache_position = cache_position.int() +-+-+ +-+-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+-+ k_out[:, :, cache_position] = key_states +-+-+ v_out[:, :, cache_position] = value_states +-+-+ +-+- return k_out, v_out +-+- +-+- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+-index c695b944..d8303e45 100644 +-+---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+- # Copied from transformers.models.llama.modeling_llama.rotate_half +-+- def rotate_half(x): +-+- """Rotates half the hidden dims of the input.""" +-+-- x1 = x[..., : x.shape[-1] // 2] +-+-- x2 = x[..., x.shape[-1] // 2 :] +-+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+-+ # x1 = x[..., : x.shape[-1] // 2] +-+-+ # x2 = x[..., x.shape[-1] // 2 :] +-+-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+- return ops.cat((-x2, x1), dim=-1) +-+- +-+- +-+-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+- if self.training: +-+- raise NotImplementedError("Training is not supported yet.") +-+- else: +-+-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+-- if self.config.n_shared_experts is not None: +-+-- y = y + self.shared_experts(identity) +-+-- return y +-+-+ # @lwx +-+-+ if orig_shape[1] == 1: +-+-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+-+ y=y.view(*orig_shape) +-+-+ if self.config.n_shared_experts is not None: +-+-+ y = y + self.shared_experts(identity) +-+-+ return y +-+-+ else: +-+-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+-+ if self.config.n_shared_experts is not None: +-+-+ y = y + self.shared_experts(identity) +-+-+ return y +-+-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+-+ # if self.config.n_shared_experts is not None: +-+-+ # y = y + self.shared_experts(identity) +-+-+ # return y +-+-+ +-+-+ @no_grad() +-+-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+-+ +-+-+ expert_cache = ops.zeros_like(x) +-+-+ for i in range(self.num_experts_per_tok): +-+-+ expert_id = flat_expert_indices[i].item() +-+-+ weight = flat_expert_weights[i].item() +-+-+ expert = self.experts[expert_id] +-+-+ expert_out = expert(x) +-+-+ expert_cache += expert_out * weight +-+-+ return expert_cache +-+- +-+- @no_grad() +-+-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+-- # expert_cache = torch.zeros_like(x) +-+-- # idxs = flat_expert_indices.argsort() +-+-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+-- # token_idxs = idxs // self.num_experts_per_tok +-+-- # for i, end_idx in enumerate(tokens_per_expert): +-+-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+-- # if start_idx == end_idx: +-+-- # continue +-+-- # expert = self.experts[i] +-+-- # exp_token_idx = token_idxs[start_idx:end_idx] +-+-- # expert_tokens = x[exp_token_idx] +-+-- # expert_out = expert(expert_tokens) +-+-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+-- # return expert_cache +-+-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+- expert_cache = ops.zeros_like(x) +-+- idxs = flat_expert_indices.argsort() +-+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+- token_idxs = idxs // self.num_experts_per_tok +-+-+ +-+- for i, end_idx in enumerate(tokens_per_expert): +-+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+- if start_idx == end_idx: +-+-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+- expert_out = expert(expert_tokens) +-+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+-+ +-+- return expert_cache +-+-+ +-+-+ # @no_grad() +-+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+-+ # # expert_cache = torch.zeros_like(x) +-+-+ # # idxs = flat_expert_indices.argsort() +-+-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+-+ # # token_idxs = idxs // self.num_experts_per_tok +-+-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+-+ # # if start_idx == end_idx: +-+-+ # # continue +-+-+ # # expert = self.experts[i] +-+-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+-+ # # expert_tokens = x[exp_token_idx] +-+-+ # # expert_out = expert(expert_tokens) +-+-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+-+ # # return expert_cache +-+-+ # expert_cache = ops.zeros_like(x) +-+-+ # idxs = flat_expert_indices.argsort() +-+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+-+ # token_idxs = idxs // self.num_experts_per_tok +-+-+ +-+-+ # for i, end_idx in enumerate(tokens_per_expert): +-+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+-+ # if start_idx == end_idx: +-+-+ # continue +-+-+ # expert = self.experts[i] +-+-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+-+ # expert_tokens = x[exp_token_idx] +-+-+ # expert_out = expert(expert_tokens) +-+-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+-+ +-+-+ # return expert_cache +-+-+ # @no_grad() +-+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+-+ # expert_cache = ops.zeros_like(x) +-+-+ +-+-+ # # 排序保证顺序一致 +-+-+ # idxs = flat_expert_indices.argsort() +-+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+-+ # token_idxs = idxs // self.num_experts_per_tok +-+-+ +-+-+ # # 找出有 token 的专家 +-+-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+-+ +-+-+ # for i in active_experts.tolist(): +-+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+-+ # end_idx = tokens_per_expert[i] +-+-+ # if start_idx == end_idx: # 没有 token +-+-+ # continue +-+-+ +-+-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+-+ # expert_tokens = x[exp_token_idx] +-+-+ # expert_out = self.experts[i](expert_tokens) +-+-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+-+ +-+-+ # expert_cache = mindspore.mint.scatter_add( +-+-+ # expert_cache, +-+-+ # 0, +-+-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+-+ # expert_out +-+-+ # ) +-+-+ +-+-+ # return expert_cache +-+-+ +-+-+ +-+- +-+- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+- # """ +-+-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+- +-+- # Initialize weights and apply final processing +-+- self.post_init() +-+-+ self.warm_up = False +-+-+ +-+-+ def warmup_moe_model_deep(self): +-+-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+-+ test_texts = [ +-+-+ "warmup short", +-+-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+-+ ] +-+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+-+ if tokenizer is None: +-+-+ from mindnlp.transformers import AutoTokenizer +-+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+-+ self._warmup_tokenizer = tokenizer +-+-+ +-+-+ for text in test_texts: +-+-+ inputs = tokenizer(text, return_tensors="ms") +-+-+ with mindspore._no_grad(): +-+-+ _ = self(**inputs, use_cache=False) +-+-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+- +-+- def get_input_embeddings(self): +-+- return self.model.embed_tokens +-+-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+- ```""" +-+-+ if not self.warm_up: +-+-+ self.warm_up = True +-+-+ self.warmup_moe_model_deep() +-+-+ +-+- output_attentions = ( +-+- output_attentions +-+- if output_attentions is not None +-+-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+-index 3cbf820e..d4c6b651 100644 +-+---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+-@@ -18,7 +18,6 @@ +-+- # See the License for the specific language governing permissions and +-+- # limitations under the License. +-+- """MindSpore Qwen2MoE model.""" +-+-- +-+- import math +-+- from typing import List, Optional, Tuple, Union +-+- +-+-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+- TokenClassifierOutput, +-+- ) +-+- from ...modeling_utils import PreTrainedModel +-+-+from ...generation import GenerationMixin +-+- from ....utils import logging +-+- from .configuration_qwen2_moe import Qwen2MoeConfig +-+- +-+-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+- self.variance_epsilon = eps +-+- +-+- def forward(self, hidden_states): +-+-+ # @dwj +-+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+-+ # @lwx +-+-+ # if not self.training : +-+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+- input_dtype = hidden_states.dtype +-+- hidden_states = hidden_states.to(mindspore.float32) +-+- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+-@@ -234,6 +239,8 @@ def rotate_half(x): +-+- """Rotates half the hidden dims of the input.""" +-+- x1 = x[..., : x.shape[-1] // 2] +-+- x2 = x[..., x.shape[-1] // 2 :] +-+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+- return ops.cat((-x2, x1), dim=-1) +-+- +-+- +-+-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+- self.config = config +-+- self.hidden_size = config.hidden_size +-+- self.intermediate_size = intermediate_size +-+-+ +-+- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+- self.act_fn = ACT2FN[config.hidden_act] +-+- +-+- def forward(self, x): +-+-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+-- +-+- +-+-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+-+ # @lwx +-+-+ # gate_up_output = self.gate_up_proj(x) +-+-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+-+ # return self.down_proj(swiglu_output) +-+-+ +-+-+ # def forward(self, x): +-+-+ # gate_proj_out = self.gate_proj(x) +-+-+ # up_proj_out = self.up_proj(x) +-+-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+-+ # return self.down_proj(swiglu_out) +-+-+ +-+- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+- """ +-+-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+- use_cache: bool = False, +-+- cache_position: Optional[mindspore.Tensor] = None, +-+- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+-+ +-+-+ +-+-+ +-+- bsz, q_len, _ = hidden_states.shape +-+- +-+- query_states = self.q_proj(hidden_states) +-+-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+- "with a layer index." +-+- ) +-+-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+-+ if isinstance(past_key_value, StaticCache): +-+-+ kv_seq_len = key_states.shape[-2] +-+-+ else: +-+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+- +-+- if past_key_value is not None: +-+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+-+ +-+-+ if isinstance(past_key_value, StaticCache): +-+-+ kv_seq_len = key_states.shape[-2] +-+- +-+- # repeat k/v heads if n_kv_heads < n_heads +-+- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+-- +-+-+ +-+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+- +-+-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+-- raise ValueError( +-+-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+-- f" {attn_weights.shape}" +-+-- ) +-+-- +-+-- if attention_mask is not None: # no matter the length, we just slice it +-+-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+-+ if attention_mask is not None: +-+-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+- attn_weights = attn_weights + causal_mask +-+- +-+- # upcast attention to fp32 +-+-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+- +-+- attn_output = self.o_proj(attn_output) +-+-- +-+-+ # @lwx +-+-+ +-+-+ # max_seq_len = self.max_position_embeddings # 2048 +-+-+ +-+-+ # if attention_mask is not None: +-+-+ # # attention_mask: [B, 1, Sq, Sk] +-+-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+-+ +-+-+ # # pad 到 [max_seq_len, max_seq_len] +-+-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+-+ # global_attention_mask = padded_mask +-+-+ # else: +-+-+ # global_attention_mask = None +-+-+ +-+-+ +-+-+ # sparse_mode=3 +-+-+ # attn_output = mindspore.ops.flash_attention_score( +-+-+ # query=query_states, +-+-+ # key=key_states, +-+-+ # value=value_states, +-+-+ # real_shift=None, +-+-+ # padding_mask=None, +-+-+ +-+-+ # head_num=self.num_heads, +-+-+ # attn_mask=global_attention_mask, +-+-+ # keep_prob=1.0 - self.attention_dropout, +-+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+-+ # input_layout="BNSD", +-+-+ # pre_tokens=2147483647, +-+-+ # next_tokens=2147483647, +-+-+ # inner_precise=0, +-+-+ # drop_mask=None, +-+-+ # prefix=None, +-+-+ # actual_seq_qlen=None, +-+-+ # actual_seq_kvlen=None, +-+-+ # sparse_mode=sparse_mode, +-+-+ # ) +-+- if not output_attentions: +-+- attn_weights = None +-+- +-+- return attn_output, attn_weights, past_key_value +-+- +-+- +-+-+class Qwen2MoeFlashAttention(nn.Module): +-+-+ """ +-+-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+-+ +-+-+ 关键改动: +-+-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+-+ 直接传入原始的 key 和 value 张量效率更高。 +-+-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+-+ """ +-+-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+-+ super().__init__() +-+-+ self.config = config +-+-+ self.layer_idx = layer_idx +-+-+ self.hidden_size = config.hidden_size +-+-+ self.num_heads = config.num_attention_heads +-+-+ self.head_dim = self.hidden_size // self.num_heads +-+-+ self.num_key_value_heads = config.num_key_value_heads +-+-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+-+ self.max_position_embeddings = config.max_position_embeddings +-+-+ self.rope_theta = config.rope_theta +-+-+ self.attention_dropout = config.attention_dropout +-+-+ +-+-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+-+ raise ValueError( +-+-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+-+ ) +-+-+ +-+-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+-+ +-+-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+-+ self.head_dim, +-+-+ max_position_embeddings=self.max_position_embeddings, +-+-+ base=self.rope_theta, +-+-+ ) +-+-+ +-+-+ def forward( +-+-+ self, +-+-+ hidden_states: mindspore.Tensor, +-+-+ attention_mask: Optional[mindspore.Tensor] = None, +-+-+ position_ids: Optional[mindspore.Tensor] = None, +-+-+ past_key_value: Optional[Cache] = None, +-+-+ output_attentions: bool = False, +-+-+ use_cache: bool = False, +-+-+ cache_position: Optional[mindspore.Tensor] = None, +-+-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+-+ +-+-+ bsz, q_len, _ = hidden_states.shape +-+-+ +-+-+ # 1. 线性投射 Q, K, V +-+-+ query_states = self.q_proj(hidden_states) +-+-+ key_states = self.k_proj(hidden_states) +-+-+ value_states = self.v_proj(hidden_states) +-+-+ +-+-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+-+ # query: [B, S, H*D] -> [B, N1, S, D] +-+-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ +-+-+ # 3. RoPE 旋转位置编码 +-+-+ kv_seq_len = key_states.shape[-2] +-+-+ if past_key_value is not None: +-+-+ if self.layer_idx is None: +-+-+ raise ValueError( +-+-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+-+ "with a layer index." +-+-+ ) +-+-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+-+ if cache_position.shape[0] == 1: +-+-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+-+ kv_seq_len = past_seen_tokens + 1 +-+-+ else: +-+-+ # prefill 阶段:cache_position 是范围,使用其长度 +-+-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+-+ else: +-+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+-+ +-+-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+-+ +-+-+ # 4. KV 缓存更新 +-+-+ if past_key_value is not None: +-+-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+-+ key_states, value_states = past_key_value.update( +-+-+ key_states, value_states, self.layer_idx, cache_kwargs +-+-+ ) +-+-+ +-+-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+-+ if cache_position.shape[0] == 1: +-+-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+-+ kv_seq_len = key_states.shape[-2] +-+-+ +-+-+ # 5. [重要] 准备 Attention Mask +-+-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+-+ fa_attention_mask = None +-+-+ if attention_mask is not None: +-+-+ # 截取与当前key长度匹配的部分 +-+-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+-+ fa_attention_mask = (mask_slice != 0) +-+-+ +-+-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+-+ input_dtype = query_states.dtype +-+-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+-+ query_states = query_states.to(mindspore.float16) +-+-+ key_states = key_states.to(mindspore.float16) +-+-+ value_states = value_states.to(mindspore.float16) +-+-+ +-+-+ # 6. [核心] 调用 flash_attention_score 算子 +-+-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+-+ attn_output = mindspore.ops.flash_attention_score( +-+-+ query=query_states, +-+-+ key=key_states, +-+-+ value=value_states, +-+-+ head_num=self.num_heads, # 传入Q的头数(N1) +-+-+ attn_mask=fa_attention_mask, +-+-+ keep_prob=1.0 - self.attention_dropout, +-+-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+-+ input_layout="BNSD", +-+-+ sparse_mode=0 # 使用 defaultMask 模式 +-+-+ ) +-+-+ +-+-+ # 恢复原始数据类型 +-+-+ attn_output = attn_output.to(input_dtype) +-+-+ +-+-+ # 7. 调整输出形状 +-+-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+-+ attn_output = self.o_proj(attn_output) +-+-+ +-+-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-+-+ attn_weights = None +-+-+ if output_attentions: +-+-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+-+ +-+-+ return attn_output, attn_weights, past_key_value +-+-+ +-+-+ # def forward( +-+-+ # self, +-+-+ # hidden_states: mindspore.Tensor, +-+-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+-+ # position_ids: Optional[mindspore.Tensor] = None, +-+-+ # past_key_value: Optional[Cache] = None, +-+-+ # output_attentions: bool = False, +-+-+ # use_cache: bool = False, +-+-+ # cache_position: Optional[mindspore.Tensor] = None, +-+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+-+ +-+-+ # bsz, q_len, _ = hidden_states.shape +-+-+ +-+-+ # # 1. 线性投射 Q, K, V +-+-+ # query_states = self.q_proj(hidden_states) +-+-+ # key_states = self.k_proj(hidden_states) +-+-+ # value_states = self.v_proj(hidden_states) +-+-+ +-+-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ +-+-+ # # 3. RoPE 旋转位置编码 +-+-+ # kv_seq_len = key_states.shape[-2] +-+-+ # if past_key_value is not None: +-+-+ # if self.layer_idx is None: +-+-+ # raise ValueError( +-+-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+-+ # "with a layer index." +-+-+ # ) +-+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+-+ +-+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+-+ +-+-+ # # 4. KV 缓存更新 +-+-+ # if past_key_value is not None: +-+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+-+ # key_states, value_states = past_key_value.update( +-+-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+-+ # ) +-+-+ +-+-+ # # 5. 准备 Attention Mask +-+-+ # fa_attention_mask = None +-+-+ # if attention_mask is not None: +-+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+-+ # fa_attention_mask = (mask_slice != 0) +-+-+ +-+-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+-+ # input_dtype = query_states.dtype +-+-+ +-+-+ # # 6. [核心] 调用 flash_attention_score 算子 +-+-+ # attn_output = mindspore.ops.flash_attention_score( +-+-+ # query=query_states, +-+-+ # key=key_states, +-+-+ # value=value_states, +-+-+ # head_num=self.num_heads, +-+-+ # attn_mask=fa_attention_mask, +-+-+ # keep_prob=1.0 - self.attention_dropout, +-+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+-+ # input_layout="BNSD", +-+-+ # sparse_mode=0, +-+-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-+-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+-+ # inner_precise=1 +-+-+ # ) +-+-+ +-+-+ # # 恢复原始数据类型 +-+-+ # attn_output = attn_output.to(input_dtype) +-+-+ +-+-+ # # 7. 调整输出形状 +-+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+-+ # attn_output = self.o_proj(attn_output) +-+-+ +-+-+ # attn_weights = None +-+-+ # if output_attentions: +-+-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+-+ +-+-+ # return attn_output, attn_weights, past_key_value +-+-+ +-+-+ # def forward( +-+-+ # self, +-+-+ # hidden_states: mindspore.Tensor, +-+-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+-+ # position_ids: Optional[mindspore.Tensor] = None, +-+-+ # past_key_value: Optional[Cache] = None, +-+-+ # output_attentions: bool = False, +-+-+ # use_cache: bool = False, +-+-+ # cache_position: Optional[mindspore.Tensor] = None, +-+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+-+ +-+-+ # bsz, q_len, _ = hidden_states.shape +-+-+ +-+-+ # query_states = self.q_proj(hidden_states) +-+-+ # key_states = self.k_proj(hidden_states) +-+-+ # value_states = self.v_proj(hidden_states) +-+-+ +-+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-+ +-+-+ # kv_seq_len = key_states.shape[-2] +-+-+ # if past_key_value is not None: +-+-+ # if self.layer_idx is None: +-+-+ # raise ValueError("`layer_idx` must be specified for caching") +-+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+-+ +-+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+-+ +-+-+ # if past_key_value is not None: +-+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+-+ # key_states, value_states = past_key_value.update( +-+-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+-+ # ) +-+-+ +-+-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+-+ +-+-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+-+ # query_states = query_states / math.sqrt(self.head_dim) +-+-+ # # <--- 修改结束 --- +-+-+ +-+-+ # fa_attention_mask = None +-+-+ # if attention_mask is not None: +-+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+-+ # fa_attention_mask = (mask_slice != 0) +-+-+ +-+-+ # input_dtype = query_states.dtype +-+-+ +-+-+ # attn_output = mindspore.ops.flash_attention_score( +-+-+ # query=query_states, # 传入已经预先缩放过的 query +-+-+ # key=key_states, +-+-+ # value=value_states, +-+-+ # head_num=self.num_heads, +-+-+ # attn_mask=fa_attention_mask, +-+-+ # keep_prob=1.0 - self.attention_dropout, +-+-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+-+ # input_layout="BNSD", +-+-+ # sparse_mode=0, +-+-+ # inner_precise=1 # 仍然保持内部高精度计算 +-+-+ # ) +-+-+ +-+-+ # attn_output = attn_output.to(input_dtype) +-+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+-+ # attn_output = self.o_proj(attn_output) +-+-+ +-+-+ # attn_weights = None +-+-+ # if output_attentions: +-+-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+-+ +-+-+ # return attn_output, attn_weights, past_key_value +-+-+ +-+- QWEN2MOE_ATTENTION_CLASSES = { +-+- "eager": Qwen2MoeAttention, +-+-+ "flash-attention": Qwen2MoeFlashAttention, +-+- } +-+- +-+- +-+-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+- +-+-+ #@dwj +-+-+ # 只遍历激活的专家,而非全部专家 +-+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+-- hidden_states = hidden_states.view(-1, hidden_dim) +-+-- # router_logits: (batch * sequence_length, n_experts) +-+-- router_logits = self.gate(hidden_states) +-+-- +-+-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+-- if self.norm_topk_prob: +-+-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+-- # we cast back to the input dtype +-+-- routing_weights = routing_weights.to(hidden_states.dtype) +-+-- +-+-- final_hidden_states = ops.zeros( +-+-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+-- ) +-+-- +-+-- # One hot encode the selected experts to create an expert mask +-+-- # this will be used to easily index which expert is going to be sollicitated +-+-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+-- +-+-- # Loop over all available experts in the model and perform the computation on each expert +-+-- for expert_idx in range(self.num_experts): +-+-- expert_layer = self.experts[expert_idx] +-+-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+-- +-+-- # Index the correct hidden states and compute the expert hidden state for +-+-- # the current expert. We need to make sure to multiply the output hidden +-+-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+-- if 0 not in idx.shape: +-+-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+-- +-+-- # However `index_add_` only support torch tensors for indexing so we'll use +-+-- # the `top_x` tensor here. +-+-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+-- +-+-- shared_expert_output = self.shared_expert(hidden_states) +-+-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+-- +-+-- final_hidden_states = final_hidden_states + shared_expert_output +-+-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+-+ num_tokens = hidden_states_reshaped.shape[0] +-+-+ +-+-+ router_logits = self.gate(hidden_states_reshaped) +-+-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+-+ +-+-+ if self.norm_topk_prob: +-+-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+-+ routing_weights = routing_weights.to(hidden_states.dtype) +-+-+ +-+-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+-+ flat_selected_experts = selected_experts.flatten() +-+-+ +-+-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+-+ token_indices = broadcasted_token_indices.flatten() +-+-+ +-+-+ active_experts = ops.unique(flat_selected_experts) +-+-+ +-+-+ for expert_idx_tensor in active_experts: +-+-+ expert_idx = expert_idx_tensor.item() +-+-+ expert_layer = self.experts[expert_idx] +-+-+ +-+-+ mask = (flat_selected_experts == expert_idx_tensor) +-+-+ selected_token_indices = token_indices[mask] +-+-+ selected_routing_weights = routing_weights.flatten()[mask] +-+-+ +-+-+ current_states = hidden_states_reshaped[selected_token_indices] +-+-+ +-+-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+-+ +-+-+ final_hidden_states = final_hidden_states.index_add( +-+-+ dim=0, +-+-+ index=selected_token_indices, +-+-+ source=expert_output.to(hidden_states.dtype) +-+-+ ) +-+-+ +-+-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+- +-+-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+-- return final_hidden_states, router_logits +-+-+ final_hidden_states = final_hidden_states + shared_expert_output +-+-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+-+ +-+-+ return final_hidden_states, router_logits +-+- +-+- +-+- class Qwen2MoeDecoderLayer(nn.Module): +-+-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+- +-+- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+- +-+-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+-+ +-+- if (layer_idx not in config.mlp_only_layers) and ( +-+- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+- ): +-+-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+- _skip_keys_device_placement = "past_key_values" +-+- _supports_cache_class = True +-+-+#lwx +-+-+ # _supports_static_cache = True +-+- +-+- def _init_weights(self, module): +-+- std = self.config.initializer_range +-+-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+- return causal_mask +-+- +-+- +-+--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+- _tied_weights_keys = ["lm_head.weight"] +-+- +-+- def __init__(self, config): +-+-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+- self.num_experts_per_tok = config.num_experts_per_tok +-+- # Initialize weights and apply final processing +-+- self.post_init() +-+-+ # @lwx +-+-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+-+ # self.generation_config.cache_implementation = "static" +-+-+ self._warmed_up = False +-+-+ +-+-+ def warmup_moe_model(self): +-+-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+-+ test_texts = [ +-+-+ "warmup short", +-+-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+-+ ] +-+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+-+ if tokenizer is None: +-+-+ from mindnlp.transformers import AutoTokenizer +-+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+-+ self._warmup_tokenizer = tokenizer +-+-+ +-+-+ for text in test_texts: +-+-+ inputs = tokenizer(text, return_tensors="ms") +-+-+ with mindspore._no_grad(): +-+-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+- +-+- def get_input_embeddings(self): +-+- return self.model.embed_tokens +-+-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+- ```""" +-+-+ if not self._warmed_up: +-+-+ self._warmed_up = True +-+-+ self.warmup_moe_model() +-+- +-+- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+- output_router_logits = ( +-+-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+- } +-+- ) +-+- return model_inputs +-+-+# @lwx +-+-+ # def _decode_one_tokens_logits( +-+-+ # self, +-+-+ # cur_token: mindspore.Tensor, +-+-+ # input_pos: Optional[mindspore.Tensor], +-+-+ # cache_position: mindspore.Tensor, +-+-+ # past_key_values: StaticCache, +-+-+ # ) -> mindspore.Tensor: +-+-+ # """ +-+-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+-+ +-+-+ # Args: +-+-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+-+ # input_pos: 输入位置信息,可选 +-+-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-+-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+-+ +-+-+ # Returns: +-+-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+-+ # """ +-+-+ # # 调用JIT编译的版本 +-+-+ # return self.get_decode_one_tokens_logits( +-+-+ # cur_token=cur_token, +-+-+ # input_pos=input_pos, +-+-+ # cache_position=cache_position, +-+-+ # past_key_values=past_key_values, +-+-+ # ) +-+-+ +-+-+ # @mindspore.jit(jit_level='O1') +-+-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+-+ # """ +-+-+ # JIT编译的函数,用于高效的单token解码 +-+-+ # 使用JIT编译优化以支持静态shape和高效执行 +-+-+ +-+-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+-+ # """ +-+-+ # outputs = self.model.forward( +-+-+ # input_ids=cur_token, +-+-+ # position_ids=input_pos, +-+-+ # cache_position=cache_position, +-+-+ # past_key_values=past_key_values, +-+-+ # use_cache=True, +-+-+ # return_dict=False, +-+-+ # ) +-+-+ +-+-+ # hidden_states = outputs[0] +-+-+ # logits = self.lm_head.forward(hidden_states) +-+-+ # logits = logits.float() +-+-+ +-+-+ # return logits[:, -1, :] +-+-+ +-+-+ # def _sample( +-+-+ # self, +-+-+ # input_ids: mindspore.Tensor, +-+-+ # logits_processor, +-+-+ # stopping_criteria, +-+-+ # generation_config, +-+-+ # synced_devices: bool, +-+-+ # streamer=None, +-+-+ # logits_warper=None, +-+-+ # **model_kwargs, +-+-+ # ): +-+-+ # """ +-+-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+-+ # """ +-+-+ # from ...generation.logits_process import LogitsProcessorList +-+-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-+-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+-+ # from mindnlp.core import nn, ops, no_grad +-+-+ # import numpy as np +-+-+ +-+-+ # # 检查是否使用 StaticCache +-+-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+-+ # # 否则,直接调用父类方法 +-+-+ # past_key_values = model_kwargs.get("past_key_values") +-+-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+-+ +-+-+ # if not isinstance(past_key_values, StaticCache): +-+-+ # # 不使用 StaticCache,直接调用父类方法 +-+-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+-+ # return super()._sample( +-+-+ # input_ids=input_ids, +-+-+ # logits_processor=logits_processor, +-+-+ # stopping_criteria=stopping_criteria, +-+-+ # generation_config=generation_config, +-+-+ # synced_devices=synced_devices, +-+-+ # streamer=streamer, +-+-+ # logits_warper=logits_warper, +-+-+ # **model_kwargs, +-+-+ # ) +-+-+ +-+-+ # # 使用 StaticCache,进入自定义循环 +-+-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+-+ # pad_token_id = generation_config._pad_token_tensor +-+-+ # output_attentions = generation_config.output_attentions +-+-+ # output_hidden_states = generation_config.output_hidden_states +-+-+ # output_scores = generation_config.output_scores +-+-+ # output_logits = generation_config.output_logits +-+-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-+-+ # max_length = generation_config.max_length +-+-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+-+ # do_sample = generation_config.do_sample +-+-+ +-+-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+-+ # raise ValueError( +-+-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+-+ # f"{logits_warper})." +-+-+ # ) +-+-+ +-+-+ # # init attention / hidden states / scores tuples +-+-+ # scores = () if (return_dict_in_generate and output_scores) else None +-+-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+-+ +-+-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+-+ # encoder_hidden_states = ( +-+-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+-+ # ) +-+-+ +-+-+ # # keep track of which sequences are already finished +-+-+ # batch_size, cur_len = input_ids.shape +-+-+ # this_peer_finished = False +-+-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+-+ +-+-+ # time_record = [] +-+-+ # from ....utils.testing_utils import parse_flag_from_env +-+-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+-+ +-+-+ # while self._has_unfinished_sequences( +-+-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+-+ # ): +-+-+ # if _record_time: +-+-+ # import time as time_module +-+-+ # infer_start = time_module.time() +-+-+ +-+-+ # # prepare model inputs +-+-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+-+ +-+-+ # # prepare variable output controls +-+-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+-+ +-+-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+-+ # cur_cache_position = model_inputs.get("cache_position") +-+-+ # cur_past_key_values = model_inputs.get("past_key_values") +-+-+ # cur_input_ids = model_inputs.get("input_ids") +-+-+ +-+-+ # if (isinstance(cur_past_key_values, StaticCache) and +-+-+ # cur_cache_position is not None and +-+-+ # len(cur_cache_position.shape) > 0 and +-+-+ # cur_cache_position.shape[0] == 1 and +-+-+ # cur_input_ids is not None and +-+-+ # cur_input_ids.shape[1] == 1): +-+-+ # # 使用 JIT 优化的单 token 解码 +-+-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+-+ # if not hasattr(self, '_jit_used'): +-+-+ # self._jit_used = False +-+-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+-+ +-+-+ # next_token_logits = self.get_decode_one_tokens_logits( +-+-+ # cur_token=cur_input_ids, +-+-+ # input_pos=model_inputs.get("position_ids"), +-+-+ # cache_position=cur_cache_position, +-+-+ # past_key_values=cur_past_key_values, +-+-+ # ) +-+-+ +-+-+ # # 标记已使用JIT(用于后续判断) +-+-+ # if not self._jit_used: +-+-+ # self._jit_used = True +-+-+ +-+-+ # # 构造兼容的输出对象 +-+-+ # class JitOptimizedOutput: +-+-+ # def __init__(self, logits, config): +-+-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+-+ # self.config = config +-+-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-+-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+-+ # self.attentions = None if not config.is_encoder_decoder else None +-+-+ # self.cross_attentions = None +-+-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-+-+ +-+-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+-+ # else: +-+-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+-+ # outputs = self(**model_inputs, return_dict=True) +-+-+ +-+-+ # if synced_devices and this_peer_finished: +-+-+ # continue +-+-+ +-+-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+-+ # next_token_logits = outputs.logits[:, -1, :] +-+-+ +-+-+ # # pre-process distribution +-+-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+-+ # if do_sample: +-+-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+-+ +-+-+ # # Store scores, attentions and hidden_states when required +-+-+ # if return_dict_in_generate: +-+-+ # if output_scores: +-+-+ # scores += (next_token_scores,) +-+-+ # if output_logits: +-+-+ # raw_logits += (next_token_logits,) +-+-+ # if output_attentions: +-+-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-+-+ # if self.config.is_encoder_decoder: +-+-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+-+ +-+-+ # if output_hidden_states: +-+-+ # hidden = ( +-+-+ # outputs.decoder_hidden_states +-+-+ # if self.config.is_encoder_decoder +-+-+ # else outputs.hidden_states +-+-+ # ) +-+-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+-+ +-+-+ # # token selection +-+-+ # if do_sample: +-+-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+-+ # else: +-+-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+-+ +-+-+ # # finished sentences should have their next token be a padding token +-+-+ # if has_eos_stopping_criteria: +-+-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+-+ +-+-+ # # update generated ids, model inputs, and length for next step +-+-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+-+ # if streamer is not None: +-+-+ # streamer.put(next_tokens) +-+-+ +-+-+ # model_kwargs = self._update_model_kwargs_for_generation( +-+-+ # outputs, +-+-+ # model_kwargs, +-+-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-+-+ # ) +-+-+ +-+-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+-+ # cur_len += 1 +-+-+ +-+-+ # if _record_time: +-+-+ # import time as time_module +-+-+ # infer_stop = time_module.time() +-+-+ # time_record.append(infer_stop - infer_start) +-+-+ +-+-+ # del outputs +-+-+ +-+-+ # average_infer_time = None +-+-+ # if time_record: +-+-+ # if len(time_record) > 1: +-+-+ # time_record.pop(0) +-+-+ # average_infer_time = sum(time_record) / len(time_record) +-+-+ # print(f'average inference time is: {average_infer_time}') +-+-+ # print(f'inference time record: {time_record}') +-+-+ +-+-+ # if streamer is not None: +-+-+ # streamer.end() +-+-+ +-+-+ # # 简单判断:打印是否使用了JIT路径 +-+-+ # if hasattr(self, '_jit_used') and self._jit_used: +-+-+ # print("[JIT] ✓ JIT optimization was used during generation") +-+-+ # else: +-+-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+-+ +-+-+ # if return_dict_in_generate: +-+-+ # if self.config.is_encoder_decoder: +-+-+ # return GenerateEncoderDecoderOutput( +-+-+ # sequences=input_ids, +-+-+ # scores=scores, +-+-+ # logits=raw_logits, +-+-+ # encoder_attentions=encoder_attentions, +-+-+ # encoder_hidden_states=encoder_hidden_states, +-+-+ # decoder_attentions=decoder_attentions, +-+-+ # cross_attentions=cross_attentions, +-+-+ # decoder_hidden_states=decoder_hidden_states, +-+-+ # past_key_values=model_kwargs.get("past_key_values"), +-+-+ # average_infer_time=average_infer_time +-+-+ # ) +-+-+ # else: +-+-+ # return GenerateDecoderOnlyOutput( +-+-+ # sequences=input_ids, +-+-+ # scores=scores, +-+-+ # logits=raw_logits, +-+-+ # attentions=decoder_attentions, +-+-+ # hidden_states=decoder_hidden_states, +-+-+ # past_key_values=model_kwargs.get("past_key_values"), +-+-+ # average_infer_time=average_infer_time +-+-+ # ) +-+-+ # else: +-+-+ # return input_ids +-+-+ +-+-+ # def _prepare_cache_for_generation( +-+-+ # self, +-+-+ # generation_config, +-+-+ # model_kwargs, +-+-+ # assistant_model, +-+-+ # batch_size, +-+-+ # max_cache_length, +-+-+ # ): +-+-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+-+ # generation_config.cache_implementation = "static" +-+-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+-+ +-+-+ # if generation_config.cache_implementation == "static": +-+-+ # base_required_from_max_length = generation_config.max_length + 1 +-+-+ # base_required = max(max_cache_length, base_required_from_max_length) +-+-+ # min_cache_size = 50 +-+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+-+ # else: +-+-+ # max_cache_length = max(base_required, min_cache_size) +-+-+ +-+-+ # original_max_cache_length = max_cache_length +-+-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-+-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+-+ # print(f" - final max_cache_length: {max_cache_length}") +-+-+ +-+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+-+ # if max_cache_length > self.config.max_position_embeddings: +-+-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+-+ +-+-+ # result = super()._prepare_cache_for_generation( +-+-+ # generation_config=generation_config, +-+-+ # model_kwargs=model_kwargs, +-+-+ # assistant_model=assistant_model, +-+-+ # batch_size=batch_size, +-+-+ # max_cache_length=max_cache_length, +-+-+ # ) +-+-+ +-+-+ # if generation_config.cache_implementation == "static": +-+-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+-+ # created_cache = model_kwargs.get(cache_name) +-+-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+-+ # if created_cache.max_cache_len < generation_config.max_length: +-+-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+-+ +-+-+ # return result +-+-+ +-+-+ +-+-+ +-+- +-+- +-+- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+--- +-+-2.27.0 +-+- +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +deleted file mode 100644 +index 7217a46b..00000000 +--- a/patches/0005-20251107001commit.patch ++++ /dev/null +@@ -1,7707 +0,0 @@ +-From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Fri, 7 Nov 2025 11:48:18 +0800 +-Subject: [PATCH 5/8] 20251107001commit +- +---- +- .../models/deepseek/modeling_deepseek.py | 91 +- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +- .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +- patches/0001-20251104commit.patch | 2 +- +- patches/0002-20251106commit.patch | 2 +- +- patches/0003-20261106secondcommit.patch | 2 +- +- patches/0004-20251106change.patch | 7498 +++++++++++++++++ +- 7 files changed, 7577 insertions(+), 30 deletions(-) +- create mode 100644 patches/0004-20251106change.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index 0546f318..8831e4b7 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +- # expert_cache += expert_out * weight +- # return expert_cache +- +-- @no_grad() +-- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-- # x 的 shape: (1, hidden_size) +-- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-- +-- # 1. 收集所有需要的专家层 +-- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-- selected_experts = [self.experts[i] for i in flat_expert_indices] +-- +-- # 2. 并行计算所有专家的输出 +-- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-- # ops.cat 会将它们堆叠成一个新的 Tensor +-- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-- +-- # 3. 使用矩阵乘法进行加权求和 +-- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-- # 最终结果 final_output 的 shape: (1, hidden_size) +-- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+ # @no_grad() +-+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ # # x 的 shape: (1, hidden_size) +-+ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+ +-+ # # 1. 收集所有需要的专家层 +-+ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+ # selected_experts = [self.experts[i] for i in flat_expert_indices] +-+ +-+ # # 2. 并行计算所有专家的输出 +-+ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+ # # ops.cat 会将它们堆叠成一个新的 Tensor +-+ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+ +-+ # # 3. 使用矩阵乘法进行加权求和 +-+ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ # # 最终结果 final_output 的 shape: (1, hidden_size) +-+ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +- +-- return final_output +-+ # return final_output +- +- +- # @no_grad() +-@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +- ) +- +- return expert_cache +-+# 放置在 DeepseekMoE 类中 +-+ @no_grad() +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ """ +-+ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-+ +-+ Args: +-+ x (Tensor): 输入张量, shape: (1, hidden_size) +-+ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-+ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-+ """ +-+ top_k, _ = flat_expert_weights.shape +-+ hidden_size = x.shape[-1] +-+ +-+ # 1. 将所有专家的权重堆叠起来 +-+ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-+ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-+ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-+ +-+ # 2. "收集" 所需的专家权重 +-+ selected_gate_w = stacked_gate_w[flat_expert_indices] +-+ selected_up_w = stacked_up_w[flat_expert_indices] +-+ selected_down_w = stacked_down_w[flat_expert_indices] +-+ +-+ # 3. 准备输入 +-+ x_expanded = x.expand((top_k, 1, hidden_size)) +-+ +-+ # 4. 并行计算 gate_proj 和 up_proj +-+ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-+ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-+ +-+ # 5. 计算中间状态 +-+ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-+ +-+ # 6. 并行计算 down_proj +-+ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-+ # --- [FIX] --- +-+ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-+ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-+ # --- [FIX END] --- +-+ +-+ # 7. 根据路由权重进行加权求和 +-+ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-+ +-+ return weighted_sum +-+ +-+ +- +- # @no_grad() +- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index ebd7782e..913a7609 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +- # Copied from transformers.models.llama.modeling_llama.rotate_half +- def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +-- x1 = x[..., : x.shape[-1] // 2] +-- x2 = x[..., x.shape[-1] // 2 :] +-+ # x1 = x[..., : x.shape[-1] // 2] +-+ # x2 = x[..., x.shape[-1] // 2 :] +- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-index d059dcbe..2b217b64 100644 +---- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +- # Copied from transformers.models.llama.modeling_llama.rotate_half +- def rotate_half(x): +- """Rotates half the hidden dims of the input.""" +-- x1 = x[..., : x.shape[-1] // 2] +-- x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+ # x1 = x[..., : x.shape[-1] // 2] +-+ # x2 = x[..., x.shape[-1] // 2 :] +-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +- return ops.cat((-x2, x1), dim=-1) +- +- +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-index 78f22642..0a0ef2d7 100644 +---- a/patches/0001-20251104commit.patch +-+++ b/patches/0001-20251104commit.patch +-@@ -1,7 +1,7 @@ +- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Tue, 4 Nov 2025 09:11:51 +0800 +--Subject: [PATCH 1/3] 20251104commit +-+Subject: [PATCH 1/4] 20251104commit +- +- --- +- mindnlp/transformers/cache_utils.py | 28 +- +-diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-index 22b65dd5..5185270c 100644 +---- a/patches/0002-20251106commit.patch +-+++ b/patches/0002-20251106commit.patch +-@@ -1,7 +1,7 @@ +- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 09:20:38 +0800 +--Subject: [PATCH 2/3] 20251106commit +-+Subject: [PATCH 2/4] 20251106commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 379 ++++- +-diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-index 966529e4..3e05f821 100644 +---- a/patches/0003-20261106secondcommit.patch +-+++ b/patches/0003-20261106secondcommit.patch +-@@ -1,7 +1,7 @@ +- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 14:54:37 +0800 +--Subject: [PATCH 3/3] 20261106secondcommit +-+Subject: [PATCH 3/4] 20261106secondcommit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 217 ++- +-diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-new file mode 100644 +-index 00000000..88a1aef4 +---- /dev/null +-+++ b/patches/0004-20251106change.patch +-@@ -0,0 +1,7498 @@ +-+From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Thu, 6 Nov 2025 15:48:09 +0800 +-+Subject: [PATCH 4/4] 20251106change +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 189 +- +-+ patches/0001-20251104commit.patch | 1272 +++++++ +-+ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +-+ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +-+ 4 files changed, 7244 insertions(+), 186 deletions(-) +-+ create mode 100644 patches/0001-20251104commit.patch +-+ create mode 100644 patches/0002-20251106commit.patch +-+ create mode 100644 patches/0003-20261106secondcommit.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index 2f9192bf..0546f318 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +-+ +-+ return attn_output, attn_weights, past_key_value +-+ +-+-# class DeepseekFlashAttention(nn.Module): +-+-# """ +-+-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-+- +-+-# This class is designed as a drop-in replacement for DeepseekAttention. +-+-# """ +-+- +-+-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+-# super().__init__() +-+-# self.config = config +-+-# self.layer_idx = layer_idx +-+-# if layer_idx is None: +-+-# logger.warning( +-+-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+-# "when creating this class." +-+-# ) +-+- +-+-# self.attention_dropout = config.attention_dropout +-+-# self.hidden_size = config.hidden_size +-+-# self.num_heads = config.num_attention_heads +-+-# self.head_dim = self.hidden_size // self.num_heads +-+-# self.num_key_value_heads = config.num_key_value_heads +-+-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+-# self.max_position_embeddings = config.max_position_embeddings +-+-# self.rope_theta = config.rope_theta +-+-# self.is_causal = True +-+- +-+-# if (self.head_dim * self.num_heads) != self.hidden_size: +-+-# raise ValueError( +-+-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+-# f" and `num_heads`: {self.num_heads})." +-+-# ) +-+- +-+-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+-# self._init_rope() +-+- +-+-# def _init_rope(self): +-+-# if self.config.rope_scaling is None: +-+-# self.rotary_emb = DeepseekRotaryEmbedding( +-+-# self.head_dim, +-+-# max_position_embeddings=self.max_position_embeddings, +-+-# base=self.rope_theta, +-+-# ) +-+-# else: +-+-# scaling_type = self.config.rope_scaling["type"] +-+-# scaling_factor = self.config.rope_scaling["factor"] +-+-# if scaling_type == "linear": +-+-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+-# self.head_dim, +-+-# max_position_embeddings=self.max_position_embeddings, +-+-# scaling_factor=scaling_factor, +-+-# base=self.rope_theta, +-+-# ) +-+-# elif scaling_type == "dynamic": +-+-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+-# self.head_dim, +-+-# max_position_embeddings=self.max_position_embeddings, +-+-# scaling_factor=scaling_factor, +-+-# base=self.rope_theta, +-+-# ) +-+-# else: +-+-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+- +-+-# def forward( +-+-# self, +-+-# hidden_states: mindspore.Tensor, +-+-# attention_mask: Optional[mindspore.Tensor] = None, +-+-# position_ids: Optional[mindspore.Tensor] = None, +-+-# past_key_value: Optional[Cache] = None, +-+-# output_attentions: bool = False, +-+-# use_cache: bool = False, +-+-# **kwargs, +-+-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+-# if "padding_mask" in kwargs: +-+-# warnings.warn( +-+-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+-# ) +-+- +-+-# if output_attentions: +-+-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-+- +-+-# bsz, q_len, _ = hidden_states.shape +-+- +-+-# if self.config.pretraining_tp > 1: +-+-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+- +-+-# query_states = self.q_proj(hidden_states) +-+-# key_states = self.k_proj(hidden_states) +-+-# value_states = self.v_proj(hidden_states) +-+- +-+-# # Reshape for multi-head attention +-+-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+- +-+-# kv_seq_len = key_states.shape[-2] +-+-# if past_key_value is not None: +-+-# if self.layer_idx is None: +-+-# raise ValueError( +-+-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+-# "with a layer index." +-+-# ) +-+-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+- +-+-# # Apply Rotary Positional Embedding +-+-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+- +-+-# if past_key_value is not None: +-+-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-+-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+- +-+-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-+-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-+-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+- +-+-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+- +-+-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+- +-+-# # Convert attention_mask for flash_attention_score +-+-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-+-# if attention_mask is not None: +-+-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-+-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+-# raise ValueError( +-+-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+-# ) +-+-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-+-# else: +-+-# attn_mask_for_fa = None +-+- +-+-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+- +-+-# # Call the fused flash_attention_score operator +-+-# attn_output = mindspore.ops.flash_attention_score( +-+-# query=query_states_for_fa, +-+-# key=key_states_for_fa, +-+-# value=value_states_for_fa, +-+-# head_num=self.num_heads, # This is N1, the number of query heads +-+-# input_layout='BSH', +-+-# attn_mask=attn_mask_for_fa, +-+-# keep_prob=keep_prob, +-+-# scalar_value=1.0 / math.sqrt(self.head_dim), +-+-# sparse_mode=0 # Default mask mode +-+-# ) +-+- +-+-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-+-# attn_output = self.o_proj(attn_output) +-+- +-+-# # Flash Attention does not return attention weights +-+-# attn_weights = None +-+- +-+-# return attn_output, attn_weights, past_key_value +-+ +-+ class DeepseekFlashAttention(nn.Module): +-+ """ +-+@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +-+ super().__init__() +-+ self.hidden_size = config.hidden_size +-+ +-+- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-+- config=config, layer_idx=layer_idx +-+- ) +-++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-++ # config=config, layer_idx=layer_idx +-++ # ) +-+ +-+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-+ config=config, layer_idx=layer_idx +-+@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +-+ return outputs +-+ +-+ +-+- +-+ class DeepseekPreTrainedModel(PreTrainedModel): +-+ config_class = DeepseekConfig +-+ base_model_prefix = "model" +-+@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+ # Initialize weights and apply final processing +-+ self.post_init() +-+ self.warm_up = False +-+- #@dwj +-+- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-+- self.num_layers, +-+- self.num_attention_heads, +-+- self.head_dim, +-+- batch_size=1, +-+- max_length=self.max_length, +-+- dtype=mindspore.float16 +-+- ) +-+- +-+- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-+- key_cache = [] +-+- value_cache = [] +-+- for _ in range(num_layers): +-+- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+- key_cache.append(k) +-+- value_cache.append(v) +-+- return key_cache, value_cache +-+- +-+ +-+ def warmup_moe_model_deep(self): +-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+new file mode 100644 +-+index 00000000..78f22642 +-+--- /dev/null +-++++ b/patches/0001-20251104commit.patch +-+@@ -0,0 +1,1272 @@ +-++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++Subject: [PATCH 1/3] 20251104commit +-++ +-++--- +-++ mindnlp/transformers/cache_utils.py | 28 +- +-++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++ 3 files changed, 976 insertions(+), 87 deletions(-) +-++ +-++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++index cadd2e04..02f8d4be 100644 +-++--- a/mindnlp/transformers/cache_utils.py +-+++++ b/mindnlp/transformers/cache_utils.py +-++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++ # k_out[:, :, cache_position] = key_states +-++ # v_out[:, :, cache_position] = value_states +-++- if ON_ORANGE_PI: +-++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++- else: +-++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++- +-+++ # if ON_ORANGE_PI: +-+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++ # else: +-+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++ if cache_position.ndim > 1: +-+++ cache_position = cache_position.flatten() +-+++ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++ cache_position = cache_position.int() +-+++ +-+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++ k_out[:, :, cache_position] = key_states +-+++ v_out[:, :, cache_position] = value_states +-+++ +-++ return k_out, v_out +-++ +-++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index c695b944..d8303e45 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++- x1 = x[..., : x.shape[-1] // 2] +-++- x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++ # x1 = x[..., : x.shape[-1] // 2] +-+++ # x2 = x[..., x.shape[-1] // 2 :] +-+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++ if self.training: +-++ raise NotImplementedError("Training is not supported yet.") +-++ else: +-++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++- if self.config.n_shared_experts is not None: +-++- y = y + self.shared_experts(identity) +-++- return y +-+++ # @lwx +-+++ if orig_shape[1] == 1: +-+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++ y=y.view(*orig_shape) +-+++ if self.config.n_shared_experts is not None: +-+++ y = y + self.shared_experts(identity) +-+++ return y +-+++ else: +-+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++ if self.config.n_shared_experts is not None: +-+++ y = y + self.shared_experts(identity) +-+++ return y +-+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++ # if self.config.n_shared_experts is not None: +-+++ # y = y + self.shared_experts(identity) +-+++ # return y +-+++ +-+++ @no_grad() +-+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ +-+++ expert_cache = ops.zeros_like(x) +-+++ for i in range(self.num_experts_per_tok): +-+++ expert_id = flat_expert_indices[i].item() +-+++ weight = flat_expert_weights[i].item() +-+++ expert = self.experts[expert_id] +-+++ expert_out = expert(x) +-+++ expert_cache += expert_out * weight +-+++ return expert_cache +-++ +-++ @no_grad() +-++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++- # expert_cache = torch.zeros_like(x) +-++- # idxs = flat_expert_indices.argsort() +-++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++- # token_idxs = idxs // self.num_experts_per_tok +-++- # for i, end_idx in enumerate(tokens_per_expert): +-++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++- # if start_idx == end_idx: +-++- # continue +-++- # expert = self.experts[i] +-++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++- # expert_tokens = x[exp_token_idx] +-++- # expert_out = expert(expert_tokens) +-++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++- # return expert_cache +-+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++ expert_cache = ops.zeros_like(x) +-++ idxs = flat_expert_indices.argsort() +-++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++ token_idxs = idxs // self.num_experts_per_tok +-+++ +-++ for i, end_idx in enumerate(tokens_per_expert): +-++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++ if start_idx == end_idx: +-++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++ expert_out = expert(expert_tokens) +-++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-++ return expert_cache +-+++ +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # # expert_cache = torch.zeros_like(x) +-+++ # # idxs = flat_expert_indices.argsort() +-+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++ # # token_idxs = idxs // self.num_experts_per_tok +-+++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++ # # if start_idx == end_idx: +-+++ # # continue +-+++ # # expert = self.experts[i] +-+++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # # expert_tokens = x[exp_token_idx] +-+++ # # expert_out = expert(expert_tokens) +-+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++ # # return expert_cache +-+++ # expert_cache = ops.zeros_like(x) +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # for i, end_idx in enumerate(tokens_per_expert): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # if start_idx == end_idx: +-+++ # continue +-+++ # expert = self.experts[i] +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = expert(expert_tokens) +-+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-+++ # return expert_cache +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # expert_cache = ops.zeros_like(x) +-+++ +-+++ # # 排序保证顺序一致 +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # # 找出有 token 的专家 +-+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++ +-+++ # for i in active_experts.tolist(): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # end_idx = tokens_per_expert[i] +-+++ # if start_idx == end_idx: # 没有 token +-+++ # continue +-+++ +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = self.experts[i](expert_tokens) +-+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++ +-+++ # expert_cache = mindspore.mint.scatter_add( +-+++ # expert_cache, +-+++ # 0, +-+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++ # expert_out +-+++ # ) +-+++ +-+++ # return expert_cache +-+++ +-+++ +-++ +-++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++ # """ +-++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ +-++ # Initialize weights and apply final processing +-++ self.post_init() +-+++ self.warm_up = False +-+++ +-+++ def warmup_moe_model_deep(self): +-+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++ test_texts = [ +-+++ "warmup short", +-+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++ ] +-+++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++ if tokenizer is None: +-+++ from mindnlp.transformers import AutoTokenizer +-+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++ self._warmup_tokenizer = tokenizer +-+++ +-+++ for text in test_texts: +-+++ inputs = tokenizer(text, return_tensors="ms") +-+++ with mindspore._no_grad(): +-+++ _ = self(**inputs, use_cache=False) +-+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++ +-++ def get_input_embeddings(self): +-++ return self.model.embed_tokens +-++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++ ```""" +-+++ if not self.warm_up: +-+++ self.warm_up = True +-+++ self.warmup_moe_model_deep() +-+++ +-++ output_attentions = ( +-++ output_attentions +-++ if output_attentions is not None +-++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++index 3cbf820e..d4c6b651 100644 +-++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++@@ -18,7 +18,6 @@ +-++ # See the License for the specific language governing permissions and +-++ # limitations under the License. +-++ """MindSpore Qwen2MoE model.""" +-++- +-++ import math +-++ from typing import List, Optional, Tuple, Union +-++ +-++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++ TokenClassifierOutput, +-++ ) +-++ from ...modeling_utils import PreTrainedModel +-+++from ...generation import GenerationMixin +-++ from ....utils import logging +-++ from .configuration_qwen2_moe import Qwen2MoeConfig +-++ +-++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++ self.variance_epsilon = eps +-++ +-++ def forward(self, hidden_states): +-+++ # @dwj +-+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++ # @lwx +-+++ # if not self.training : +-+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++ input_dtype = hidden_states.dtype +-++ hidden_states = hidden_states.to(mindspore.float32) +-++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++@@ -234,6 +239,8 @@ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++ x1 = x[..., : x.shape[-1] // 2] +-++ x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++ self.config = config +-++ self.hidden_size = config.hidden_size +-++ self.intermediate_size = intermediate_size +-+++ +-++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++ self.act_fn = ACT2FN[config.hidden_act] +-++ +-++ def forward(self, x): +-++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++- +-++ +-+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++ # @lwx +-+++ # gate_up_output = self.gate_up_proj(x) +-+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++ # return self.down_proj(swiglu_output) +-+++ +-+++ # def forward(self, x): +-+++ # gate_proj_out = self.gate_proj(x) +-+++ # up_proj_out = self.up_proj(x) +-+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++ # return self.down_proj(swiglu_out) +-+++ +-++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++ """ +-++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++ use_cache: bool = False, +-++ cache_position: Optional[mindspore.Tensor] = None, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ +-+++ +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ query_states = self.q_proj(hidden_states) +-++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++ "with a layer index." +-++ ) +-++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ if isinstance(past_key_value, StaticCache): +-+++ kv_seq_len = key_states.shape[-2] +-+++ else: +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++ if isinstance(past_key_value, StaticCache): +-+++ kv_seq_len = key_states.shape[-2] +-++ +-++ # repeat k/v heads if n_kv_heads < n_heads +-++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++- +-+++ +-++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++ +-++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++- raise ValueError( +-++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++- f" {attn_weights.shape}" +-++- ) +-++- +-++- if attention_mask is not None: # no matter the length, we just slice it +-++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++ if attention_mask is not None: +-+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++ attn_weights = attn_weights + causal_mask +-++ +-++ # upcast attention to fp32 +-++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++ +-++ attn_output = self.o_proj(attn_output) +-++- +-+++ # @lwx +-+++ +-+++ # max_seq_len = self.max_position_embeddings # 2048 +-+++ +-+++ # if attention_mask is not None: +-+++ # # attention_mask: [B, 1, Sq, Sk] +-+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++ +-+++ # # pad 到 [max_seq_len, max_seq_len] +-+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++ # global_attention_mask = padded_mask +-+++ # else: +-+++ # global_attention_mask = None +-+++ +-+++ +-+++ # sparse_mode=3 +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # real_shift=None, +-+++ # padding_mask=None, +-+++ +-+++ # head_num=self.num_heads, +-+++ # attn_mask=global_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ # input_layout="BNSD", +-+++ # pre_tokens=2147483647, +-+++ # next_tokens=2147483647, +-+++ # inner_precise=0, +-+++ # drop_mask=None, +-+++ # prefix=None, +-+++ # actual_seq_qlen=None, +-+++ # actual_seq_kvlen=None, +-+++ # sparse_mode=sparse_mode, +-+++ # ) +-++ if not output_attentions: +-++ attn_weights = None +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++ +-+++class Qwen2MoeFlashAttention(nn.Module): +-+++ """ +-+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++ +-+++ 关键改动: +-+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++ 直接传入原始的 key 和 value 张量效率更高。 +-+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++ """ +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++ super().__init__() +-+++ self.config = config +-+++ self.layer_idx = layer_idx +-+++ self.hidden_size = config.hidden_size +-+++ self.num_heads = config.num_attention_heads +-+++ self.head_dim = self.hidden_size // self.num_heads +-+++ self.num_key_value_heads = config.num_key_value_heads +-+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++ self.max_position_embeddings = config.max_position_embeddings +-+++ self.rope_theta = config.rope_theta +-+++ self.attention_dropout = config.attention_dropout +-+++ +-+++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++ raise ValueError( +-+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++ ) +-+++ +-+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++ +-+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++ self.head_dim, +-+++ max_position_embeddings=self.max_position_embeddings, +-+++ base=self.rope_theta, +-+++ ) +-+++ +-+++ def forward( +-+++ self, +-+++ hidden_states: mindspore.Tensor, +-+++ attention_mask: Optional[mindspore.Tensor] = None, +-+++ position_ids: Optional[mindspore.Tensor] = None, +-+++ past_key_value: Optional[Cache] = None, +-+++ output_attentions: bool = False, +-+++ use_cache: bool = False, +-+++ cache_position: Optional[mindspore.Tensor] = None, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # 1. 线性投射 Q, K, V +-+++ query_states = self.q_proj(hidden_states) +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++ # query: [B, S, H*D] -> [B, N1, S, D] +-+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # 3. RoPE 旋转位置编码 +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++ if self.layer_idx is None: +-+++ raise ValueError( +-+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ "with a layer index." +-+++ ) +-+++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++ if cache_position.shape[0] == 1: +-+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++ kv_seq_len = past_seen_tokens + 1 +-+++ else: +-+++ # prefill 阶段:cache_position 是范围,使用其长度 +-+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++ else: +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # 4. KV 缓存更新 +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ key_states, value_states = past_key_value.update( +-+++ key_states, value_states, self.layer_idx, cache_kwargs +-+++ ) +-+++ +-+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++ if cache_position.shape[0] == 1: +-+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++ kv_seq_len = key_states.shape[-2] +-+++ +-+++ # 5. [重要] 准备 Attention Mask +-+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++ fa_attention_mask = None +-+++ if attention_mask is not None: +-+++ # 截取与当前key长度匹配的部分 +-+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++ fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++ input_dtype = query_states.dtype +-+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++ query_states = query_states.to(mindspore.float16) +-+++ key_states = key_states.to(mindspore.float16) +-+++ value_states = value_states.to(mindspore.float16) +-+++ +-+++ # 6. [核心] 调用 flash_attention_score 算子 +-+++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++ attn_output = mindspore.ops.flash_attention_score( +-+++ query=query_states, +-+++ key=key_states, +-+++ value=value_states, +-+++ head_num=self.num_heads, # 传入Q的头数(N1) +-+++ attn_mask=fa_attention_mask, +-+++ keep_prob=1.0 - self.attention_dropout, +-+++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ input_layout="BNSD", +-+++ sparse_mode=0 # 使用 defaultMask 模式 +-+++ ) +-+++ +-+++ # 恢复原始数据类型 +-+++ attn_output = attn_output.to(input_dtype) +-+++ +-+++ # 7. 调整输出形状 +-+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = self.o_proj(attn_output) +-+++ +-+++ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++ attn_weights = None +-+++ if output_attentions: +-+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ # def forward( +-+++ # self, +-+++ # hidden_states: mindspore.Tensor, +-+++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++ # position_ids: Optional[mindspore.Tensor] = None, +-+++ # past_key_value: Optional[Cache] = None, +-+++ # output_attentions: bool = False, +-+++ # use_cache: bool = False, +-+++ # cache_position: Optional[mindspore.Tensor] = None, +-+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ # bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # # 1. 线性投射 Q, K, V +-+++ # query_states = self.q_proj(hidden_states) +-+++ # key_states = self.k_proj(hidden_states) +-+++ # value_states = self.v_proj(hidden_states) +-+++ +-+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # # 3. RoPE 旋转位置编码 +-+++ # kv_seq_len = key_states.shape[-2] +-+++ # if past_key_value is not None: +-+++ # if self.layer_idx is None: +-+++ # raise ValueError( +-+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ # "with a layer index." +-+++ # ) +-+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # # 4. KV 缓存更新 +-+++ # if past_key_value is not None: +-+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ # key_states, value_states = past_key_value.update( +-+++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++ # ) +-+++ +-+++ # # 5. 准备 Attention Mask +-+++ # fa_attention_mask = None +-+++ # if attention_mask is not None: +-+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++ # input_dtype = query_states.dtype +-+++ +-+++ # # 6. [核心] 调用 flash_attention_score 算子 +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # head_num=self.num_heads, +-+++ # attn_mask=fa_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ # input_layout="BNSD", +-+++ # sparse_mode=0, +-+++ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++ # inner_precise=1 +-+++ # ) +-+++ +-+++ # # 恢复原始数据类型 +-+++ # attn_output = attn_output.to(input_dtype) +-+++ +-+++ # # 7. 调整输出形状 +-+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ # attn_output = self.o_proj(attn_output) +-+++ +-+++ # attn_weights = None +-+++ # if output_attentions: +-+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++ # return attn_output, attn_weights, past_key_value +-+++ +-+++ # def forward( +-+++ # self, +-+++ # hidden_states: mindspore.Tensor, +-+++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++ # position_ids: Optional[mindspore.Tensor] = None, +-+++ # past_key_value: Optional[Cache] = None, +-+++ # output_attentions: bool = False, +-+++ # use_cache: bool = False, +-+++ # cache_position: Optional[mindspore.Tensor] = None, +-+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ # bsz, q_len, _ = hidden_states.shape +-+++ +-+++ # query_states = self.q_proj(hidden_states) +-+++ # key_states = self.k_proj(hidden_states) +-+++ # value_states = self.v_proj(hidden_states) +-+++ +-+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ # kv_seq_len = key_states.shape[-2] +-+++ # if past_key_value is not None: +-+++ # if self.layer_idx is None: +-+++ # raise ValueError("`layer_idx` must be specified for caching") +-+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ # if past_key_value is not None: +-+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ # key_states, value_states = past_key_value.update( +-+++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++ # ) +-+++ +-+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++ +-+++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++ # query_states = query_states / math.sqrt(self.head_dim) +-+++ # # <--- 修改结束 --- +-+++ +-+++ # fa_attention_mask = None +-+++ # if attention_mask is not None: +-+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ # fa_attention_mask = (mask_slice != 0) +-+++ +-+++ # input_dtype = query_states.dtype +-+++ +-+++ # attn_output = mindspore.ops.flash_attention_score( +-+++ # query=query_states, # 传入已经预先缩放过的 query +-+++ # key=key_states, +-+++ # value=value_states, +-+++ # head_num=self.num_heads, +-+++ # attn_mask=fa_attention_mask, +-+++ # keep_prob=1.0 - self.attention_dropout, +-+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++ # input_layout="BNSD", +-+++ # sparse_mode=0, +-+++ # inner_precise=1 # 仍然保持内部高精度计算 +-+++ # ) +-+++ +-+++ # attn_output = attn_output.to(input_dtype) +-+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ # attn_output = self.o_proj(attn_output) +-+++ +-+++ # attn_weights = None +-+++ # if output_attentions: +-+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++ +-+++ # return attn_output, attn_weights, past_key_value +-+++ +-++ QWEN2MOE_ATTENTION_CLASSES = { +-++ "eager": Qwen2MoeAttention, +-+++ "flash-attention": Qwen2MoeFlashAttention, +-++ } +-++ +-++ +-++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-+++ #@dwj +-+++ # 只遍历激活的专家,而非全部专家 +-++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- hidden_states = hidden_states.view(-1, hidden_dim) +-++- # router_logits: (batch * sequence_length, n_experts) +-++- router_logits = self.gate(hidden_states) +-++- +-++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- if self.norm_topk_prob: +-++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- # we cast back to the input dtype +-++- routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++- final_hidden_states = ops.zeros( +-++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++- ) +-++- +-++- # One hot encode the selected experts to create an expert mask +-++- # this will be used to easily index which expert is going to be sollicitated +-++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++- +-++- # Loop over all available experts in the model and perform the computation on each expert +-++- for expert_idx in range(self.num_experts): +-++- expert_layer = self.experts[expert_idx] +-++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++- +-++- # Index the correct hidden states and compute the expert hidden state for +-++- # the current expert. We need to make sure to multiply the output hidden +-++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++- if 0 not in idx.shape: +-++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++- +-++- # However `index_add_` only support torch tensors for indexing so we'll use +-++- # the `top_x` tensor here. +-++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++- +-++- shared_expert_output = self.shared_expert(hidden_states) +-++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++- +-++- final_hidden_states = final_hidden_states + shared_expert_output +-+++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++ num_tokens = hidden_states_reshaped.shape[0] +-+++ +-+++ router_logits = self.gate(hidden_states_reshaped) +-+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++ if self.norm_topk_prob: +-+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++ flat_selected_experts = selected_experts.flatten() +-+++ +-+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++ token_indices = broadcasted_token_indices.flatten() +-+++ +-+++ active_experts = ops.unique(flat_selected_experts) +-+++ +-+++ for expert_idx_tensor in active_experts: +-+++ expert_idx = expert_idx_tensor.item() +-+++ expert_layer = self.experts[expert_idx] +-+++ +-+++ mask = (flat_selected_experts == expert_idx_tensor) +-+++ selected_token_indices = token_indices[mask] +-+++ selected_routing_weights = routing_weights.flatten()[mask] +-+++ +-+++ current_states = hidden_states_reshaped[selected_token_indices] +-+++ +-+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++ +-+++ final_hidden_states = final_hidden_states.index_add( +-+++ dim=0, +-+++ index=selected_token_indices, +-+++ source=expert_output.to(hidden_states.dtype) +-+++ ) +-+++ +-+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++ +-++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++- return final_hidden_states, router_logits +-+++ final_hidden_states = final_hidden_states + shared_expert_output +-+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++ +-+++ return final_hidden_states, router_logits +-++ +-++ +-++ class Qwen2MoeDecoderLayer(nn.Module): +-++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++ +-++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++ +-++ if (layer_idx not in config.mlp_only_layers) and ( +-++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++ ): +-++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++ _skip_keys_device_placement = "past_key_values" +-++ _supports_cache_class = True +-+++#lwx +-+++ # _supports_static_cache = True +-++ +-++ def _init_weights(self, module): +-++ std = self.config.initializer_range +-++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++ return causal_mask +-++ +-++ +-++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ _tied_weights_keys = ["lm_head.weight"] +-++ +-++ def __init__(self, config): +-++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ self.num_experts_per_tok = config.num_experts_per_tok +-++ # Initialize weights and apply final processing +-++ self.post_init() +-+++ # @lwx +-+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++ # self.generation_config.cache_implementation = "static" +-+++ self._warmed_up = False +-+++ +-+++ def warmup_moe_model(self): +-+++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++ test_texts = [ +-+++ "warmup short", +-+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++ ] +-+++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++ if tokenizer is None: +-+++ from mindnlp.transformers import AutoTokenizer +-+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++ self._warmup_tokenizer = tokenizer +-+++ +-+++ for text in test_texts: +-+++ inputs = tokenizer(text, return_tensors="ms") +-+++ with mindspore._no_grad(): +-+++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++ +-++ def get_input_embeddings(self): +-++ return self.model.embed_tokens +-++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++ ```""" +-+++ if not self._warmed_up: +-+++ self._warmed_up = True +-+++ self.warmup_moe_model() +-++ +-++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++ output_router_logits = ( +-++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++ } +-++ ) +-++ return model_inputs +-+++# @lwx +-+++ # def _decode_one_tokens_logits( +-+++ # self, +-+++ # cur_token: mindspore.Tensor, +-+++ # input_pos: Optional[mindspore.Tensor], +-+++ # cache_position: mindspore.Tensor, +-+++ # past_key_values: StaticCache, +-+++ # ) -> mindspore.Tensor: +-+++ # """ +-+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++ +-+++ # Args: +-+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++ # input_pos: 输入位置信息,可选 +-+++ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++ +-+++ # Returns: +-+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++ # """ +-+++ # # 调用JIT编译的版本 +-+++ # return self.get_decode_one_tokens_logits( +-+++ # cur_token=cur_token, +-+++ # input_pos=input_pos, +-+++ # cache_position=cache_position, +-+++ # past_key_values=past_key_values, +-+++ # ) +-+++ +-+++ # @mindspore.jit(jit_level='O1') +-+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++ # """ +-+++ # JIT编译的函数,用于高效的单token解码 +-+++ # 使用JIT编译优化以支持静态shape和高效执行 +-+++ +-+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++ # """ +-+++ # outputs = self.model.forward( +-+++ # input_ids=cur_token, +-+++ # position_ids=input_pos, +-+++ # cache_position=cache_position, +-+++ # past_key_values=past_key_values, +-+++ # use_cache=True, +-+++ # return_dict=False, +-+++ # ) +-+++ +-+++ # hidden_states = outputs[0] +-+++ # logits = self.lm_head.forward(hidden_states) +-+++ # logits = logits.float() +-+++ +-+++ # return logits[:, -1, :] +-+++ +-+++ # def _sample( +-+++ # self, +-+++ # input_ids: mindspore.Tensor, +-+++ # logits_processor, +-+++ # stopping_criteria, +-+++ # generation_config, +-+++ # synced_devices: bool, +-+++ # streamer=None, +-+++ # logits_warper=None, +-+++ # **model_kwargs, +-+++ # ): +-+++ # """ +-+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++ # """ +-+++ # from ...generation.logits_process import LogitsProcessorList +-+++ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++ # from mindnlp.core import nn, ops, no_grad +-+++ # import numpy as np +-+++ +-+++ # # 检查是否使用 StaticCache +-+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++ # # 否则,直接调用父类方法 +-+++ # past_key_values = model_kwargs.get("past_key_values") +-+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++ +-+++ # if not isinstance(past_key_values, StaticCache): +-+++ # # 不使用 StaticCache,直接调用父类方法 +-+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++ # return super()._sample( +-+++ # input_ids=input_ids, +-+++ # logits_processor=logits_processor, +-+++ # stopping_criteria=stopping_criteria, +-+++ # generation_config=generation_config, +-+++ # synced_devices=synced_devices, +-+++ # streamer=streamer, +-+++ # logits_warper=logits_warper, +-+++ # **model_kwargs, +-+++ # ) +-+++ +-+++ # # 使用 StaticCache,进入自定义循环 +-+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++ # pad_token_id = generation_config._pad_token_tensor +-+++ # output_attentions = generation_config.output_attentions +-+++ # output_hidden_states = generation_config.output_hidden_states +-+++ # output_scores = generation_config.output_scores +-+++ # output_logits = generation_config.output_logits +-+++ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++ # max_length = generation_config.max_length +-+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++ # do_sample = generation_config.do_sample +-+++ +-+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++ # raise ValueError( +-+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++ # f"{logits_warper})." +-+++ # ) +-+++ +-+++ # # init attention / hidden states / scores tuples +-+++ # scores = () if (return_dict_in_generate and output_scores) else None +-+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++ +-+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++ # encoder_hidden_states = ( +-+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++ # ) +-+++ +-+++ # # keep track of which sequences are already finished +-+++ # batch_size, cur_len = input_ids.shape +-+++ # this_peer_finished = False +-+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++ +-+++ # time_record = [] +-+++ # from ....utils.testing_utils import parse_flag_from_env +-+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++ +-+++ # while self._has_unfinished_sequences( +-+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++ # ): +-+++ # if _record_time: +-+++ # import time as time_module +-+++ # infer_start = time_module.time() +-+++ +-+++ # # prepare model inputs +-+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++ +-+++ # # prepare variable output controls +-+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++ +-+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++ # cur_cache_position = model_inputs.get("cache_position") +-+++ # cur_past_key_values = model_inputs.get("past_key_values") +-+++ # cur_input_ids = model_inputs.get("input_ids") +-+++ +-+++ # if (isinstance(cur_past_key_values, StaticCache) and +-+++ # cur_cache_position is not None and +-+++ # len(cur_cache_position.shape) > 0 and +-+++ # cur_cache_position.shape[0] == 1 and +-+++ # cur_input_ids is not None and +-+++ # cur_input_ids.shape[1] == 1): +-+++ # # 使用 JIT 优化的单 token 解码 +-+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++ # if not hasattr(self, '_jit_used'): +-+++ # self._jit_used = False +-+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++ +-+++ # next_token_logits = self.get_decode_one_tokens_logits( +-+++ # cur_token=cur_input_ids, +-+++ # input_pos=model_inputs.get("position_ids"), +-+++ # cache_position=cur_cache_position, +-+++ # past_key_values=cur_past_key_values, +-+++ # ) +-+++ +-+++ # # 标记已使用JIT(用于后续判断) +-+++ # if not self._jit_used: +-+++ # self._jit_used = True +-+++ +-+++ # # 构造兼容的输出对象 +-+++ # class JitOptimizedOutput: +-+++ # def __init__(self, logits, config): +-+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++ # self.config = config +-+++ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++ # self.attentions = None if not config.is_encoder_decoder else None +-+++ # self.cross_attentions = None +-+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++ +-+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++ # else: +-+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++ # outputs = self(**model_inputs, return_dict=True) +-+++ +-+++ # if synced_devices and this_peer_finished: +-+++ # continue +-+++ +-+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++ # next_token_logits = outputs.logits[:, -1, :] +-+++ +-+++ # # pre-process distribution +-+++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++ # if do_sample: +-+++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++ +-+++ # # Store scores, attentions and hidden_states when required +-+++ # if return_dict_in_generate: +-+++ # if output_scores: +-+++ # scores += (next_token_scores,) +-+++ # if output_logits: +-+++ # raw_logits += (next_token_logits,) +-+++ # if output_attentions: +-+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++ # if self.config.is_encoder_decoder: +-+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++ +-+++ # if output_hidden_states: +-+++ # hidden = ( +-+++ # outputs.decoder_hidden_states +-+++ # if self.config.is_encoder_decoder +-+++ # else outputs.hidden_states +-+++ # ) +-+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++ +-+++ # # token selection +-+++ # if do_sample: +-+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++ # else: +-+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++ +-+++ # # finished sentences should have their next token be a padding token +-+++ # if has_eos_stopping_criteria: +-+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++ +-+++ # # update generated ids, model inputs, and length for next step +-+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++ # if streamer is not None: +-+++ # streamer.put(next_tokens) +-+++ +-+++ # model_kwargs = self._update_model_kwargs_for_generation( +-+++ # outputs, +-+++ # model_kwargs, +-+++ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++ # ) +-+++ +-+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++ # cur_len += 1 +-+++ +-+++ # if _record_time: +-+++ # import time as time_module +-+++ # infer_stop = time_module.time() +-+++ # time_record.append(infer_stop - infer_start) +-+++ +-+++ # del outputs +-+++ +-+++ # average_infer_time = None +-+++ # if time_record: +-+++ # if len(time_record) > 1: +-+++ # time_record.pop(0) +-+++ # average_infer_time = sum(time_record) / len(time_record) +-+++ # print(f'average inference time is: {average_infer_time}') +-+++ # print(f'inference time record: {time_record}') +-+++ +-+++ # if streamer is not None: +-+++ # streamer.end() +-+++ +-+++ # # 简单判断:打印是否使用了JIT路径 +-+++ # if hasattr(self, '_jit_used') and self._jit_used: +-+++ # print("[JIT] ✓ JIT optimization was used during generation") +-+++ # else: +-+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++ +-+++ # if return_dict_in_generate: +-+++ # if self.config.is_encoder_decoder: +-+++ # return GenerateEncoderDecoderOutput( +-+++ # sequences=input_ids, +-+++ # scores=scores, +-+++ # logits=raw_logits, +-+++ # encoder_attentions=encoder_attentions, +-+++ # encoder_hidden_states=encoder_hidden_states, +-+++ # decoder_attentions=decoder_attentions, +-+++ # cross_attentions=cross_attentions, +-+++ # decoder_hidden_states=decoder_hidden_states, +-+++ # past_key_values=model_kwargs.get("past_key_values"), +-+++ # average_infer_time=average_infer_time +-+++ # ) +-+++ # else: +-+++ # return GenerateDecoderOnlyOutput( +-+++ # sequences=input_ids, +-+++ # scores=scores, +-+++ # logits=raw_logits, +-+++ # attentions=decoder_attentions, +-+++ # hidden_states=decoder_hidden_states, +-+++ # past_key_values=model_kwargs.get("past_key_values"), +-+++ # average_infer_time=average_infer_time +-+++ # ) +-+++ # else: +-+++ # return input_ids +-+++ +-+++ # def _prepare_cache_for_generation( +-+++ # self, +-+++ # generation_config, +-+++ # model_kwargs, +-+++ # assistant_model, +-+++ # batch_size, +-+++ # max_cache_length, +-+++ # ): +-+++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++ # generation_config.cache_implementation = "static" +-+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++ +-+++ # if generation_config.cache_implementation == "static": +-+++ # base_required_from_max_length = generation_config.max_length + 1 +-+++ # base_required = max(max_cache_length, base_required_from_max_length) +-+++ # min_cache_size = 50 +-+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++ # else: +-+++ # max_cache_length = max(base_required, min_cache_size) +-+++ +-+++ # original_max_cache_length = max_cache_length +-+++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++ # print(f" - final max_cache_length: {max_cache_length}") +-+++ +-+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++ # if max_cache_length > self.config.max_position_embeddings: +-+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++ +-+++ # result = super()._prepare_cache_for_generation( +-+++ # generation_config=generation_config, +-+++ # model_kwargs=model_kwargs, +-+++ # assistant_model=assistant_model, +-+++ # batch_size=batch_size, +-+++ # max_cache_length=max_cache_length, +-+++ # ) +-+++ +-+++ # if generation_config.cache_implementation == "static": +-+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++ # created_cache = model_kwargs.get(cache_name) +-+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++ # if created_cache.max_cache_len < generation_config.max_length: +-+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++ +-+++ # return result +-+++ +-+++ +-+++ +-++ +-++ +-++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++-- +-++2.27.0 +-++ +-+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+new file mode 100644 +-+index 00000000..22b65dd5 +-+--- /dev/null +-++++ b/patches/0002-20251106commit.patch +-+@@ -0,0 +1,3200 @@ +-++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Thu, 6 Nov 2025 09:20:38 +0800 +-++Subject: [PATCH 2/3] 20251106commit +-++ +-++--- +-++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +-++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +-++ 3 files changed, 2689 insertions(+), 305 deletions(-) +-++ create mode 100644 patches/0001-20251104commit.patch +-++ +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index d8303e45..73773c22 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +-++ # y = y + self.shared_experts(identity) +-++ # return y +-++ +-+++ # @no_grad() +-+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ +-+++ # expert_cache = ops.zeros_like(x) +-+++ # for i in range(self.num_experts_per_tok): +-+++ # expert_id = flat_expert_indices[i].item() +-+++ # weight = flat_expert_weights[i].item() +-+++ # expert = self.experts[expert_id] +-+++ # expert_out = expert(x) +-+++ # expert_cache += expert_out * weight +-+++ # return expert_cache +-+++ +-++ @no_grad() +-++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ # x 的 shape: (1, hidden_size) +-+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+++ +-+++ # 1. 收集所有需要的专家层 +-+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-+++ +-+++ # 2. 并行计算所有专家的输出 +-+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+++ # ops.cat 会将它们堆叠成一个新的 Tensor +-+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+++ +-+++ # 3. 使用矩阵乘法进行加权求和 +-+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ # 最终结果 final_output 的 shape: (1, hidden_size) +-+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+++ +-+++ return final_output +-++ +-++- expert_cache = ops.zeros_like(x) +-++- for i in range(self.num_experts_per_tok): +-++- expert_id = flat_expert_indices[i].item() +-++- weight = flat_expert_weights[i].item() +-++- expert = self.experts[expert_id] +-++- expert_out = expert(x) +-++- expert_cache += expert_out * weight +-++- return expert_cache +-++ +-++ @no_grad() +-++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++ # @lwx +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-+++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-++ +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +-++ return attn_output, attn_weights, past_key_value +-++ +-++ +-+++# class DeepseekFlashAttention(nn.Module): +-+++# """ +-+++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-+++ +-+++# This class is designed as a drop-in replacement for DeepseekAttention. +-+++# """ +-+++ +-+++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+++# super().__init__() +-+++# self.config = config +-+++# self.layer_idx = layer_idx +-+++# if layer_idx is None: +-+++# logger.warning( +-+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++# "when creating this class." +-+++# ) +-+++ +-+++# self.attention_dropout = config.attention_dropout +-+++# self.hidden_size = config.hidden_size +-+++# self.num_heads = config.num_attention_heads +-+++# self.head_dim = self.hidden_size // self.num_heads +-+++# self.num_key_value_heads = config.num_key_value_heads +-+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++# self.max_position_embeddings = config.max_position_embeddings +-+++# self.rope_theta = config.rope_theta +-+++# self.is_causal = True +-+++ +-+++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++# raise ValueError( +-+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++# f" and `num_heads`: {self.num_heads})." +-+++# ) +-+++ +-+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+++# self._init_rope() +-+++ +-+++# def _init_rope(self): +-+++# if self.config.rope_scaling is None: +-+++# self.rotary_emb = DeepseekRotaryEmbedding( +-+++# self.head_dim, +-+++# max_position_embeddings=self.max_position_embeddings, +-+++# base=self.rope_theta, +-+++# ) +-+++# else: +-+++# scaling_type = self.config.rope_scaling["type"] +-+++# scaling_factor = self.config.rope_scaling["factor"] +-+++# if scaling_type == "linear": +-+++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+++# self.head_dim, +-+++# max_position_embeddings=self.max_position_embeddings, +-+++# scaling_factor=scaling_factor, +-+++# base=self.rope_theta, +-+++# ) +-+++# elif scaling_type == "dynamic": +-+++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+++# self.head_dim, +-+++# max_position_embeddings=self.max_position_embeddings, +-+++# scaling_factor=scaling_factor, +-+++# base=self.rope_theta, +-+++# ) +-+++# else: +-+++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+++ +-+++# def forward( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# attention_mask: Optional[mindspore.Tensor] = None, +-+++# position_ids: Optional[mindspore.Tensor] = None, +-+++# past_key_value: Optional[Cache] = None, +-+++# output_attentions: bool = False, +-+++# use_cache: bool = False, +-+++# **kwargs, +-+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++# if "padding_mask" in kwargs: +-+++# warnings.warn( +-+++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+++# ) +-+++ +-+++# if output_attentions: +-+++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-+++ +-+++# bsz, q_len, _ = hidden_states.shape +-+++ +-+++# if self.config.pretraining_tp > 1: +-+++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+++ +-+++# query_states = self.q_proj(hidden_states) +-+++# key_states = self.k_proj(hidden_states) +-+++# value_states = self.v_proj(hidden_states) +-+++ +-+++# # Reshape for multi-head attention +-+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++# kv_seq_len = key_states.shape[-2] +-+++# if past_key_value is not None: +-+++# if self.layer_idx is None: +-+++# raise ValueError( +-+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++# "with a layer index." +-+++# ) +-+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++# # Apply Rotary Positional Embedding +-+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++# if past_key_value is not None: +-+++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-+++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-+++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ +-+++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++ +-+++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++ +-+++# # Convert attention_mask for flash_attention_score +-+++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-+++# if attention_mask is not None: +-+++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-+++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+++# raise ValueError( +-+++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+++# ) +-+++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-+++# else: +-+++# attn_mask_for_fa = None +-+++ +-+++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+++ +-+++# # Call the fused flash_attention_score operator +-+++# attn_output = mindspore.ops.flash_attention_score( +-+++# query=query_states_for_fa, +-+++# key=key_states_for_fa, +-+++# value=value_states_for_fa, +-+++# head_num=self.num_heads, # This is N1, the number of query heads +-+++# input_layout='BSH', +-+++# attn_mask=attn_mask_for_fa, +-+++# keep_prob=keep_prob, +-+++# scalar_value=1.0 / math.sqrt(self.head_dim), +-+++# sparse_mode=0 # Default mask mode +-+++# ) +-+++ +-+++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-+++# attn_output = self.o_proj(attn_output) +-+++ +-+++# # Flash Attention does not return attention weights +-+++# attn_weights = None +-+++ +-+++# return attn_output, attn_weights, past_key_value +-+++ +-+++class DeepseekFlashAttention(nn.Module): +-+++ """ +-+++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-+++ This implementation is a drop-in replacement for the original DeepseekAttention class, +-+++ designed for high performance on supported hardware (Ascend). +-+++ +-+++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-+++ """ +-+++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+++ super().__init__() +-+++ self.config = config +-+++ self.layer_idx = layer_idx +-+++ if layer_idx is None: +-+++ logger.warning( +-+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++ "when creating this class." +-+++ ) +-+++ +-+++ # --- [FIX] Correctly initialize all required attributes --- +-+++ self.attention_dropout = config.attention_dropout +-+++ self.hidden_size = config.hidden_size +-+++ self.num_heads = config.num_attention_heads +-+++ self.head_dim = self.hidden_size // self.num_heads +-+++ self.num_key_value_heads = config.num_key_value_heads +-+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++ self.max_position_embeddings = config.max_position_embeddings +-+++ self.rope_theta = config.rope_theta +-+++ self.is_causal = True +-+++ +-+++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++ raise ValueError( +-+++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++ f" and `num_heads`: {self.num_heads})." +-+++ ) +-+++ +-+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+++ +-+++ # This call will now succeed as all attributes are initialized. +-+++ self._init_rope() +-+++ +-+++ def _init_rope(self): +-+++ if self.config.rope_scaling is None: +-+++ self.rotary_emb = DeepseekRotaryEmbedding( +-+++ self.head_dim, +-+++ max_position_embeddings=self.max_position_embeddings, +-+++ base=self.rope_theta, +-+++ ) +-+++ else: +-+++ scaling_type = self.config.rope_scaling["type"] +-+++ scaling_factor = self.config.rope_scaling["factor"] +-+++ if scaling_type == "linear": +-+++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+++ self.head_dim, +-+++ max_position_embeddings=self.max_position_embeddings, +-+++ scaling_factor=scaling_factor, +-+++ base=self.rope_theta, +-+++ ) +-+++ elif scaling_type == "dynamic": +-+++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+++ self.head_dim, +-+++ max_position_embeddings=self.max_position_embeddings, +-+++ scaling_factor=scaling_factor, +-+++ base=self.rope_theta, +-+++ ) +-+++ else: +-+++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+++ +-+++ def forward( +-+++ self, +-+++ hidden_states: mindspore.Tensor, +-+++ attention_mask: Optional[mindspore.Tensor] = None, +-+++ position_ids: Optional[mindspore.Tensor] = None, +-+++ past_key_value: Optional[Cache] = None, +-+++ output_attentions: bool = False, +-+++ use_cache: bool = False, +-+++ **kwargs, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ if "padding_mask" in kwargs: +-+++ warnings.warn( +-+++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+++ ) +-+++ if output_attentions: +-+++ warnings.warn( +-+++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-+++ ) +-+++ +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ if self.config.pretraining_tp > 1: +-+++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+++ +-+++ query_states = self.q_proj(hidden_states) +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++ # Apply Rotary Position Embedding +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos} +-+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-+++ # So we must explicitly repeat the KV heads. +-+++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++ +-+++ # Convert attention mask for flash_attention_score +-+++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-+++ if attention_mask is not None: +-+++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+++ raise ValueError( +-+++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+++ ) +-+++ attn_mask_for_fa = attention_mask < 0 +-+++ else: +-+++ attn_mask_for_fa = None +-+++ +-+++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+++ +-+++ # Call the fused operator using the efficient BNSD layout +-+++ attn_output = mindspore.ops.flash_attention_score( +-+++ query=query_states, +-+++ key=key_states, +-+++ value=value_states, +-+++ head_num=self.num_heads, +-+++ input_layout='BNSD', # Specify the correct layout +-+++ attn_mask=attn_mask_for_fa, +-+++ keep_prob=keep_prob, +-+++ scalar_value=1.0 / math.sqrt(self.head_dim) +-+++ ) +-+++ +-+++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ +-+++ # Apply output projection +-+++ attn_output = self.o_proj(attn_output) +-+++ +-+++ # Flash attention does not return attention weights, so we return None. +-+++ attn_weights = None +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-++ Deepseek_ATTENTION_CLASSES = { +-++ "eager": DeepseekAttention, +-+++ "flash-attention": DeepseekFlashAttention, +-++ } +-++ +-++ +-++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +-++ config=config, layer_idx=layer_idx +-++ ) +-++ +-+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-+++ config=config, layer_idx=layer_idx +-+++ ) +-+++ +-++ self.mlp = ( +-++ DeepseekMoE(config) +-++ if ( +-++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++index d4c6b651..bced285c 100644 +-++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +-++ +-++ import mindspore +-++ import mindnlp.core.nn.functional as F +-++-from mindnlp.core import nn, ops +-+++from mindnlp.core import nn, ops, no_grad +-++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +-++ +-++ from ....common.activations import ACT2FN +-++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +-++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-++ +-+++Long_Prompt = False +-+++PROMPT_LENGTH_THRESHOLD = 128 +-++ +-++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-++ def _prepare_4d_causal_attention_mask_with_cache_position( +-++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +-++ return attn_output, attn_weights, past_key_value +-++ +-++ +-+++# class Qwen2MoeFlashAttention(nn.Module): +-+++# """ +-+++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++ +-+++# 关键改动: +-+++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++# 直接传入原始的 key 和 value 张量效率更高。 +-+++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++# super().__init__() +-+++# self.config = config +-+++# self.layer_idx = layer_idx +-+++# self.hidden_size = config.hidden_size +-+++# self.num_heads = config.num_attention_heads +-+++# self.head_dim = self.hidden_size // self.num_heads +-+++# self.num_key_value_heads = config.num_key_value_heads +-+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++# self.max_position_embeddings = config.max_position_embeddings +-+++# self.rope_theta = config.rope_theta +-+++# self.attention_dropout = config.attention_dropout +-+++ +-+++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++# raise ValueError( +-+++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++# ) +-+++ +-+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++ +-+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++# self.head_dim, +-+++# max_position_embeddings=self.max_position_embeddings, +-+++# base=self.rope_theta, +-+++# ) +-+++ +-+++# def forward( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# attention_mask: Optional[mindspore.Tensor] = None, +-+++# position_ids: Optional[mindspore.Tensor] = None, +-+++# past_key_value: Optional[Cache] = None, +-+++# output_attentions: bool = False, +-+++# use_cache: bool = False, +-+++# cache_position: Optional[mindspore.Tensor] = None, +-+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++# bsz, q_len, _ = hidden_states.shape +-+++ +-+++# # 1. 线性投射 Q, K, V +-+++# query_states = self.q_proj(hidden_states) +-+++# key_states = self.k_proj(hidden_states) +-+++# value_states = self.v_proj(hidden_states) +-+++ +-+++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++# # query: [B, S, H*D] -> [B, N1, S, D] +-+++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++# # 3. RoPE 旋转位置编码 +-+++# kv_seq_len = key_states.shape[-2] +-+++# if past_key_value is not None: +-+++# if self.layer_idx is None: +-+++# raise ValueError( +-+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++# "with a layer index." +-+++# ) +-+++# # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++# if cache_position.shape[0] == 1: +-+++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++# kv_seq_len = past_seen_tokens + 1 +-+++# else: +-+++# # prefill 阶段:cache_position 是范围,使用其长度 +-+++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++# else: +-+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++# # 4. KV 缓存更新 +-+++# if past_key_value is not None: +-+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++# key_states, value_states = past_key_value.update( +-+++# key_states, value_states, self.layer_idx, cache_kwargs +-+++# ) +-+++ +-+++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++# if cache_position.shape[0] == 1: +-+++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++# kv_seq_len = key_states.shape[-2] +-+++ +-+++# # 5. [重要] 准备 Attention Mask +-+++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++# fa_attention_mask = None +-+++# if attention_mask is not None: +-+++# # 截取与当前key长度匹配的部分 +-+++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++# # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++# fa_attention_mask = (mask_slice != 0) +-+++ +-+++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++# input_dtype = query_states.dtype +-+++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++# query_states = query_states.to(mindspore.float16) +-+++# key_states = key_states.to(mindspore.float16) +-+++# value_states = value_states.to(mindspore.float16) +-+++ +-+++# # 6. [核心] 调用 flash_attention_score 算子 +-+++# # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++# attn_output = mindspore.ops.flash_attention_score( +-+++# query=query_states, +-+++# key=key_states, +-+++# value=value_states, +-+++# head_num=self.num_heads, # 传入Q的头数(N1) +-+++# attn_mask=fa_attention_mask, +-+++# keep_prob=1.0 - self.attention_dropout, +-+++# scalar_value=1.0 / math.sqrt(self.head_dim), +-+++# input_layout="BNSD", +-+++# sparse_mode=0 # 使用 defaultMask 模式 +-+++# ) +-+++ +-+++# # 恢复原始数据类型 +-+++# attn_output = attn_output.to(input_dtype) +-+++ +-+++# # 7. 调整输出形状 +-+++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++# attn_output = self.o_proj(attn_output) +-+++ +-+++# # FlashAttention 算子不直接返回注意力权重矩阵 +-+++# attn_weights = None +-+++# if output_attentions: +-+++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++# return attn_output, attn_weights, past_key_value +-+++ +-+++# # def forward( +-+++# # self, +-+++# # hidden_states: mindspore.Tensor, +-+++# # attention_mask: Optional[mindspore.Tensor] = None, +-+++# # position_ids: Optional[mindspore.Tensor] = None, +-+++# # past_key_value: Optional[Cache] = None, +-+++# # output_attentions: bool = False, +-+++# # use_cache: bool = False, +-+++# # cache_position: Optional[mindspore.Tensor] = None, +-+++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++# # bsz, q_len, _ = hidden_states.shape +-+++ +-+++# # # 1. 线性投射 Q, K, V +-+++# # query_states = self.q_proj(hidden_states) +-+++# # key_states = self.k_proj(hidden_states) +-+++# # value_states = self.v_proj(hidden_states) +-+++ +-+++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-+++# # # 3. RoPE 旋转位置编码 +-+++# # kv_seq_len = key_states.shape[-2] +-+++# # if past_key_value is not None: +-+++# # if self.layer_idx is None: +-+++# # raise ValueError( +-+++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++# # "with a layer index." +-+++# # ) +-+++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++# # # 4. KV 缓存更新 +-+++# # if past_key_value is not None: +-+++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++# # key_states, value_states = past_key_value.update( +-+++# # key_states, value_states, self.layer_idx, cache_kwargs +-+++# # ) +-+++ +-+++# # # 5. 准备 Attention Mask +-+++# # fa_attention_mask = None +-+++# # if attention_mask is not None: +-+++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++# # fa_attention_mask = (mask_slice != 0) +-+++ +-+++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++# # input_dtype = query_states.dtype +-+++ +-+++# # # 6. [核心] 调用 flash_attention_score 算子 +-+++# # attn_output = mindspore.ops.flash_attention_score( +-+++# # query=query_states, +-+++# # key=key_states, +-+++# # value=value_states, +-+++# # head_num=self.num_heads, +-+++# # attn_mask=fa_attention_mask, +-+++# # keep_prob=1.0 - self.attention_dropout, +-+++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++# # input_layout="BNSD", +-+++# # sparse_mode=0, +-+++# # # <--- 修改点 2: 启用内部高精度计算 --- +-+++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++# # inner_precise=1 +-+++# # ) +-+++ +-+++# # # 恢复原始数据类型 +-+++# # attn_output = attn_output.to(input_dtype) +-+++ +-+++# # # 7. 调整输出形状 +-+++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++# # attn_output = self.o_proj(attn_output) +-+++ +-+++# # attn_weights = None +-+++# # if output_attentions: +-+++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ +-+++# # return attn_output, attn_weights, past_key_value +-+++ +-+++ +-++ class Qwen2MoeFlashAttention(nn.Module): +-++ """ +-++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++- +-++- 关键改动: +-++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++- 直接传入原始的 key 和 value 张量效率更高。 +-++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-+++ +-+++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-+++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-+++ 完全使用模型的低精度数据类型(如 float16)进行计算, +-+++ 以达到理论上的最高执行速度。 +-++ """ +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++ super().__init__() +-++ self.config = config +-++ self.layer_idx = layer_idx +-+++ if layer_idx is None: +-+++ logger.warning_once( +-+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-+++ ) +-+++ +-++ self.hidden_size = config.hidden_size +-++ self.num_heads = config.num_attention_heads +-++ self.head_dim = self.hidden_size // self.num_heads +-++ self.num_key_value_heads = config.num_key_value_heads +-++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++ self.max_position_embeddings = config.max_position_embeddings +-++ self.rope_theta = config.rope_theta +-++ self.attention_dropout = config.attention_dropout +-++ +-++- if (self.head_dim * self.num_heads) != self.hidden_size: +-++- raise ValueError( +-++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++- ) +-++- +-++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++- # query: [B, S, H*D] -> [B, N1, S, D] +-++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++ # 2. 调整形状以匹配 BNSD 布局 +-++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- +-++- # 3. RoPE 旋转位置编码 +-+++ +-+++ # 3. RoPE 和 KV 缓存 +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++- if self.layer_idx is None: +-++- raise ValueError( +-++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++- "with a layer index." +-++- ) +-++- # 对于 StaticCache,需要特殊处理 kv_seq_len +-++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++- if cache_position.shape[0] == 1: +-++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++- kv_seq_len = past_seen_tokens + 1 +-++- else: +-++- # prefill 阶段:cache_position 是范围,使用其长度 +-++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++- else: +-++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++- +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++- # 4. KV 缓存更新 +-++ if past_key_value is not None: +-++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++- key_states, value_states = past_key_value.update( +-++- key_states, value_states, self.layer_idx, cache_kwargs +-++- ) +-++- +-++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++- if cache_position.shape[0] == 1: +-++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++- kv_seq_len = key_states.shape[-2] +-++- +-++- # 5. [重要] 准备 Attention Mask +-++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++ # 4. 准备 Attention Mask +-++ fa_attention_mask = None +-++ if attention_mask is not None: +-++- # 截取与当前key长度匹配的部分 +-++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++- # 转换为布尔类型: 大负数 -> True, 0 -> False +-++ fa_attention_mask = (mask_slice != 0) +-++ +-++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++- input_dtype = query_states.dtype +-++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++- query_states = query_states.to(mindspore.float16) +-++- key_states = key_states.to(mindspore.float16) +-++- value_states = value_states.to(mindspore.float16) +-++- +-++- # 6. [核心] 调用 flash_attention_score 算子 +-++- # - 无需手动 repeat_kv, 算子原生支持 GQA +-++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +-++ attn_output = mindspore.ops.flash_attention_score( +-++ query=query_states, +-++ key=key_states, +-++ value=value_states, +-++- head_num=self.num_heads, # 传入Q的头数(N1) +-+++ head_num=self.num_heads, +-++ attn_mask=fa_attention_mask, +-++- keep_prob=1.0 - self.attention_dropout, +-+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +-++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++ input_layout="BNSD", +-++- sparse_mode=0 # 使用 defaultMask 模式 +-+++ sparse_mode=0, +-+++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +-++ ) +-++ +-++- # 恢复原始数据类型 +-++- attn_output = attn_output.to(input_dtype) +-++- +-++- # 7. 调整输出形状 +-++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++ # 6. 调整输出形状 +-++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++ attn_output = self.o_proj(attn_output) +-++ +-++- # FlashAttention 算子不直接返回注意力权重矩阵 +-+++ # 7. 返回结果 +-++ attn_weights = None +-++ if output_attentions: +-++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++- # def forward( +-++- # self, +-++- # hidden_states: mindspore.Tensor, +-++- # attention_mask: Optional[mindspore.Tensor] = None, +-++- # position_ids: Optional[mindspore.Tensor] = None, +-++- # past_key_value: Optional[Cache] = None, +-++- # output_attentions: bool = False, +-++- # use_cache: bool = False, +-++- # cache_position: Optional[mindspore.Tensor] = None, +-++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++- +-++- # bsz, q_len, _ = hidden_states.shape +-++- +-++- # # 1. 线性投射 Q, K, V +-++- # query_states = self.q_proj(hidden_states) +-++- # key_states = self.k_proj(hidden_states) +-++- # value_states = self.v_proj(hidden_states) +-++- +-++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- +-++- # # 3. RoPE 旋转位置编码 +-++- # kv_seq_len = key_states.shape[-2] +-++- # if past_key_value is not None: +-++- # if self.layer_idx is None: +-++- # raise ValueError( +-++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++- # "with a layer index." +-++- # ) +-++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++ +-++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++- +-++- # # 4. KV 缓存更新 +-++- # if past_key_value is not None: +-++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++- # key_states, value_states = past_key_value.update( +-++- # key_states, value_states, self.layer_idx, cache_kwargs +-++- # ) +-++- +-++- # # 5. 准备 Attention Mask +-++- # fa_attention_mask = None +-++- # if attention_mask is not None: +-++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++- # fa_attention_mask = (mask_slice != 0) +-++- +-++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++- # input_dtype = query_states.dtype +-++- +-++- # # 6. [核心] 调用 flash_attention_score 算子 +-++- # attn_output = mindspore.ops.flash_attention_score( +-++- # query=query_states, +-++- # key=key_states, +-++- # value=value_states, +-++- # head_num=self.num_heads, +-++- # attn_mask=fa_attention_mask, +-++- # keep_prob=1.0 - self.attention_dropout, +-++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-++- # input_layout="BNSD", +-++- # sparse_mode=0, +-++- # # <--- 修改点 2: 启用内部高精度计算 --- +-++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++- # inner_precise=1 +-++- # ) +-++- +-++- # # 恢复原始数据类型 +-++- # attn_output = attn_output.to(input_dtype) +-+++QWEN2MOE_ATTENTION_CLASSES = { +-+++ "eager": Qwen2MoeAttention, +-+++ "flash-attention": Qwen2MoeFlashAttention, +-+++} +-++ +-++- # # 7. 调整输出形状 +-++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++- # attn_output = self.o_proj(attn_output) +-++ +-++- # attn_weights = None +-++- # if output_attentions: +-++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# def __init__(self, config): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# # gating +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++ +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# #@dwj +-+++# # 只遍历激活的专家,而非全部专家 +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# num_tokens = hidden_states_reshaped.shape[0] +-+++ +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++# flat_selected_experts = selected_experts.flatten() +-+++ +-+++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++# token_indices = broadcasted_token_indices.flatten() +-+++ +-+++# active_experts = ops.unique(flat_selected_experts) +-+++ +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++ +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# selected_token_indices = token_indices[mask] +-+++# selected_routing_weights = routing_weights.flatten()[mask] +-+++ +-+++# current_states = hidden_states_reshaped[selected_token_indices] +-+++ +-+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++ +-+++# final_hidden_states = final_hidden_states.index_add( +-+++# dim=0, +-+++# index=selected_token_indices, +-+++# source=expert_output.to(hidden_states.dtype) +-+++# ) +-+++ +-+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++ +-++- # return attn_output, attn_weights, past_key_value +-+++# final_hidden_states = final_hidden_states + shared_expert_output +-+++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-+++ +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# """ +-+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+++# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# # 门控网络 +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# # 专家列表 +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++# # 共享专家 +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_decode( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# """ +-+++# 【解码路径】针对 sequence_length=1 的极致优化。 +-+++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+++# """ +-+++# batch_size, hidden_dim = hidden_states.shape +-+++ +-+++# expert_outputs_list = [ +-+++# ops.cat([ +-+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++# ], dim=0) +-+++# for i in range(batch_size) +-+++# ] +-+++ +-+++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+++# # shape: (batch_size, top_k, hidden_dim) +-+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++ +-+++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++ +-+++# return moe_output.squeeze(1) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_prefill( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# """ +-+++# 【预填充路径】针对 sequence_length > 1 的优化。 +-+++# 按专家对 Token 进行分组,并进行批处理。 +-+++# """ +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens = hidden_states.shape[0] +-+++# flat_selected_experts = selected_experts.flatten() +-+++ +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++ +-+++# active_experts = ops.unique(flat_selected_experts) +-+++ +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++ +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# selected_token_indices = token_indices[mask] +-+++# selected_routing_weights = routing_weights.flatten()[mask] +-+++ +-+++# current_states = hidden_states[selected_token_indices] +-+++ +-+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++ +-+++# moe_output = moe_output.index_add( +-+++# dim=0, +-+++# index=selected_token_indices, +-+++# source=expert_output.to(hidden_states.dtype) +-+++# ) +-+++# return moe_output +-+++ +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# """ +-+++# 顶层 forward 方法,作为智能分发器。 +-+++# """ +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++- # def forward( +-++- # self, +-++- # hidden_states: mindspore.Tensor, +-++- # attention_mask: Optional[mindspore.Tensor] = None, +-++- # position_ids: Optional[mindspore.Tensor] = None, +-++- # past_key_value: Optional[Cache] = None, +-++- # output_attentions: bool = False, +-++- # use_cache: bool = False, +-++- # cache_position: Optional[mindspore.Tensor] = None, +-++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++- +-++- # bsz, q_len, _ = hidden_states.shape +-++- +-++- # query_states = self.q_proj(hidden_states) +-++- # key_states = self.k_proj(hidden_states) +-++- # value_states = self.v_proj(hidden_states) +-++- +-++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- +-++- # kv_seq_len = key_states.shape[-2] +-++- # if past_key_value is not None: +-++- # if self.layer_idx is None: +-++- # raise ValueError("`layer_idx` must be specified for caching") +-++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++- +-++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++- +-++- # if past_key_value is not None: +-++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++- # key_states, value_states = past_key_value.update( +-++- # key_states, value_states, self.layer_idx, cache_kwargs +-++- # ) +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++# moe_output = None +-+++# # 在推理时,根据序列长度选择最优路径 +-+++# if not self.training: +-+++# if sequence_length == 1: +-+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++# else: +-+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++# else: +-+++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+++# raise NotImplementedError("Training path is not implemented.") +-+++ +-+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+++ +-+++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+++ +-+++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-+++ +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# """ +-+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# # 门控网络 +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# # 专家列表 +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++# # 共享专家 +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_decode( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# batch_size, _ = hidden_states.shape +-+++# expert_outputs_list = [ +-+++# ops.cat([ +-+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++# ], dim=0) +-+++# for i in range(batch_size) +-+++# ] +-+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++# return moe_output.squeeze(1) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_prefill( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens = hidden_states.shape[0] +-+++# flat_selected_experts = selected_experts.flatten() +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++# active_experts = ops.unique(flat_selected_experts) +-+++ +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# selected_token_indices = token_indices[mask] +-+++# selected_routing_weights = routing_weights.flatten()[mask] +-+++# current_states = hidden_states[selected_token_indices] +-+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++# moe_output = moe_output.index_add( +-+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++# ) +-+++# return moe_output +-+++ +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# """ +-+++# 顶层 forward 方法,作为智能分发器。 +-+++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+++# """ +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ +-+++# # 1. 门控计算 (通用逻辑) +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++# # 2. 智能分发到最优 MoE 路径 +-+++# moe_output = None +-+++# if not self.training: +-+++# if sequence_length == 1: +-+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++# else: +-+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++# else: +-+++# raise NotImplementedError("Training path is not implemented.") +-+++ +-+++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++ +-+++# # 4. 合并 MoE 输出和共享专家输出 +-+++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++ +-+++# # 5. 恢复原始形状并返回 +-+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-+++# prefill fastest +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# """ +-+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# # 门控网络 +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# # 专家列表 +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++# # 共享专家 +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_dispatch( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# """ +-+++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+++# """ +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens, _ = hidden_states.shape +-+++ +-+++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+++# flat_selected_experts = selected_experts.flatten() +-+++# flat_routing_weights = routing_weights.flatten() +-++ +-++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++- +-++- # # <--- 核心修改点: 手动进行高精度缩放 --- +-++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++- # query_states = query_states / math.sqrt(self.head_dim) +-++- # # <--- 修改结束 --- +-++- +-++- # fa_attention_mask = None +-++- # if attention_mask is not None: +-++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++- # fa_attention_mask = (mask_slice != 0) +-++- +-++- # input_dtype = query_states.dtype +-++- +-++- # attn_output = mindspore.ops.flash_attention_score( +-++- # query=query_states, # 传入已经预先缩放过的 query +-++- # key=key_states, +-++- # value=value_states, +-++- # head_num=self.num_heads, +-++- # attn_mask=fa_attention_mask, +-++- # keep_prob=1.0 - self.attention_dropout, +-++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++- # input_layout="BNSD", +-++- # sparse_mode=0, +-++- # inner_precise=1 # 仍然保持内部高精度计算 +-++- # ) +-+++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++ +-++- # attn_output = attn_output.to(input_dtype) +-++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++- # attn_output = self.o_proj(attn_output) +-+++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+++# active_experts = ops.unique(flat_selected_experts) +-+++ +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++ +-+++# # 找到所有分配给该专家的 token +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++ +-+++# # 使用 mask 选取对应的 token 和权重 +-+++# current_token_indices = token_indices[mask] +-+++# current_routing_weights = flat_routing_weights[mask] +-+++# current_hidden_states = hidden_states[current_token_indices] +-+++ +-+++# # 对这些 token 进行批处理 +-+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++ +-+++# # 使用 index_add 将结果精确地加回到对应位置 +-+++# moe_output = moe_output.index_add( +-+++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+++# ) +-+++# return moe_output +-+++ +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# """ +-+++# 顶层 forward 方法,作为智能分发器。 +-+++# """ +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ +-+++# # 1. 门控计算 +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++ +-+++# # 2. 调用统一的 MoE 计算内核 +-+++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-++ +-++- # attn_weights = None +-++- # if output_attentions: +-++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++# # 3. 统一处理共享专家 +-+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++ +-+++# # 4. 合并输出 +-+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++ +-+++# # 5. 恢复原始形状并返回 +-+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-+++ +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# """ +-+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++# 【最终高性能与高精度版】: +-+++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+++# 3. 这样实现了速度和准确性的两全其美。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_decode( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# """ +-+++# 【解码路径】极致优化版:bmm + 高精度累加。 +-+++# """ +-+++# original_dtype = hidden_states.dtype +-+++# batch_size, _ = hidden_states.shape +-+++ +-+++# expert_outputs_list = [ +-+++# ops.cat([ +-+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++# ], dim=0) +-+++# for i in range(batch_size) +-+++# ] +-+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++ +-+++# # 在 float32 下执行 bmm,得到高精度结果 +-+++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++ +-+++# # 将高精度结果转换回原始数据类型 +-+++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+++ +-+++# return moe_output +-+++ +-+++# @no_grad() +-+++# def _moe_infer_prefill( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# selected_experts: mindspore.Tensor, +-+++# routing_weights: mindspore.Tensor +-+++# ) -> mindspore.Tensor: +-+++# """ +-+++# 【预填充路径】与原始实现一致,结果精确。 +-+++# """ +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens, _ = hidden_states.shape +-+++# flat_selected_experts = selected_experts.flatten() +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++# active_experts = ops.unique(flat_selected_experts) +-+++ +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# selected_token_indices = token_indices[mask] +-+++# selected_routing_weights = routing_weights.flatten()[mask] +-+++# current_states = hidden_states[selected_token_indices] +-+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++# moe_output = moe_output.index_add( +-+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++# ) +-+++# return moe_output +-+++ +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++ +-++- # return attn_output, attn_weights, past_key_value +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+++# # 如果模型主体是 float16,后续再转换 +-+++ +-+++# moe_output = None +-+++# if not self.training: +-+++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+++# # _moe_infer_decode 内部会处理好类型转换 +-+++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+++# if sequence_length == 1: +-+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++# else: +-+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++# else: +-+++# raise NotImplementedError("Training path is not implemented.") +-+++ +-+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++ +-+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-++ +-++-QWEN2MOE_ATTENTION_CLASSES = { +-++- "eager": Qwen2MoeAttention, +-++- "flash-attention": Qwen2MoeFlashAttention, +-++-} +-+++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++# """ +-+++# 【融合版】一个混合专家模块,内置两种推理策略, +-+++# 由外部全局变量 `Long_Prompt` 控制: +-+++ +-+++# - if Long_Prompt is True: 【精度优先模式】 +-+++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+++# 适用于处理长序列,避免误差累积。 +-+++ +-+++# - if Long_Prompt is False: 【速度优先模式】 +-+++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+++# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+++# """ +-+++# def __init__(self, config: Qwen2MoeConfig): +-+++# super().__init__() +-+++# self.num_experts = config.num_experts +-+++# self.top_k = config.num_experts_per_tok +-+++# self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++# self.experts = nn.ModuleList( +-+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++# ) +-+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++# # --- 速度优先模式的辅助函数 --- +-+++# @no_grad() +-+++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++# original_dtype = hidden_states.dtype +-+++# batch_size, _ = hidden_states.shape +-+++# expert_outputs_list = [ +-+++# ops.cat([ +-+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++# ], dim=0) +-+++# for i in range(batch_size) +-+++# ] +-+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++# weights_fp32 = routing_weights.to(mindspore.float32) +-+++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++# return moe_output_fp32.squeeze(1).to(original_dtype) +-+++ +-+++# @no_grad() +-+++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens, _ = hidden_states.shape +-+++# flat_selected_experts = selected_experts.flatten() +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++# active_experts = ops.unique(flat_selected_experts) +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# selected_token_indices = token_indices[mask] +-+++# selected_routing_weights = routing_weights.flatten()[mask] +-+++# current_states = hidden_states[selected_token_indices] +-+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++# return moe_output +-+++ +-+++# # --- 精度优先模式的辅助函数 --- +-+++# @no_grad() +-+++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++# moe_output = ops.zeros_like(hidden_states) +-+++# num_tokens, _ = hidden_states.shape +-+++# flat_selected_experts = selected_experts.flatten() +-+++# flat_routing_weights = routing_weights.flatten() +-+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++# active_experts = ops.unique(flat_selected_experts) +-+++# for expert_idx_tensor in active_experts: +-+++# expert_idx = expert_idx_tensor.item() +-+++# expert_layer = self.experts[expert_idx] +-+++# mask = (flat_selected_experts == expert_idx_tensor) +-+++# current_token_indices = token_indices[mask] +-+++# current_routing_weights = flat_routing_weights[mask] +-+++# current_hidden_states = hidden_states[current_token_indices] +-+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++# return moe_output +-+++ +-+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++# # 声明我们将要使用一个在模块外部定义的全局变量 +-+++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+++# global Long_Prompt +-+++ +-+++# # 1. 门控计算 (所有模式通用) +-+++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++# router_logits = self.gate(hidden_states_reshaped) +-+++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++# if self.norm_topk_prob: +-+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++# moe_output = None +-+++# if not self.training: +-+++# # 根据 Long_Prompt 标志选择模式 +-+++# if Long_Prompt: +-+++# # --- 精度优先模式 --- +-+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++# else: +-+++# # --- 速度优先模式 --- +-+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++# if sequence_length == 1: +-+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++# else: +-+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++# else: +-+++# raise NotImplementedError("Training path is not implemented.") +-+++ +-+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++ +-+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++# return final_hidden_states, router_logits +-+++ +-+++class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ """ +-+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+++ 控制的顶级推理策略: +-++ +-+++ - if Long_Prompt is True: 【精度优先模式】 +-+++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-+++ 适用于需要严格可复现性的长序列任务。 +-++ +-++-class Qwen2MoeSparseMoeBlock(nn.Module): +-++- def __init__(self, config): +-+++ - if Long_Prompt is False: 【速度优先模式】 +-+++ 采用业界最强的性能组合: +-+++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-+++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-+++ """ +-+++ def __init__(self, config: Qwen2MoeConfig): +-++ super().__init__() +-++ self.num_experts = config.num_experts +-++ self.top_k = config.num_experts_per_tok +-++ self.norm_topk_prob = config.norm_topk_prob +-++ +-++- # gating +-++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++ self.experts = nn.ModuleList( +-++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++ ) +-++- +-++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++ +-++- #@dwj +-++- # 只遍历激活的专家,而非全部专家 +-++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++- num_tokens = hidden_states_reshaped.shape[0] +-++- +-++- router_logits = self.gate(hidden_states_reshaped) +-++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- +-++- if self.norm_topk_prob: +-++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++- flat_selected_experts = selected_experts.flatten() +-++- +-++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++- token_indices = broadcasted_token_indices.flatten() +-++- +-++- active_experts = ops.unique(flat_selected_experts) +-++- +-++- for expert_idx_tensor in active_experts: +-++- expert_idx = expert_idx_tensor.item() +-++- expert_layer = self.experts[expert_idx] +-++- +-++- mask = (flat_selected_experts == expert_idx_tensor) +-++- selected_token_indices = token_indices[mask] +-++- selected_routing_weights = routing_weights.flatten()[mask] +-++- +-++- current_states = hidden_states_reshaped[selected_token_indices] +-++- +-++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++- +-++- final_hidden_states = final_hidden_states.index_add( +-++- dim=0, +-++- index=selected_token_indices, +-++- source=expert_output.to(hidden_states.dtype) +-++- ) +-++- +-++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-+++ @no_grad() +-+++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++ original_dtype = hidden_states.dtype +-+++ batch_size, _ = hidden_states.shape +-+++ expert_outputs_list = [ +-+++ ops.cat([ +-+++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++ ], dim=0) +-+++ for i in range(batch_size) +-+++ ] +-+++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++ weights_fp32 = routing_weights.to(mindspore.float32) +-+++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++ return moe_output_fp32.squeeze(1).to(original_dtype) +-+++ +-+++ @no_grad() +-+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++ num_tokens, _ = hidden_states.shape +-+++ flat_selected_experts = selected_experts.flatten() +-+++ sorted_expert_indices = flat_selected_experts.argsort() +-+++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++ original_token_indices = sorted_expert_indices // self.top_k +-+++ moe_output = ops.zeros_like(hidden_states) +-+++ current_token_offset = 0 +-+++ for i in range(self.num_experts): +-+++ expert_token_count = tokens_per_expert[i] - current_token_offset +-+++ if expert_token_count == 0: +-+++ continue +-+++ end_offset = current_token_offset + expert_token_count +-+++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++ expert_hidden_states = hidden_states[expert_original_token_indices] +-+++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++ current_token_offset += expert_token_count +-+++ return moe_output +-+++ +-+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+++ @no_grad() +-+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++ moe_output = ops.zeros_like(hidden_states) +-+++ num_tokens, _ = hidden_states.shape +-+++ flat_selected_experts = selected_experts.flatten() +-+++ flat_routing_weights = routing_weights.flatten() +-+++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++ active_experts = ops.unique(flat_selected_experts) +-+++ for expert_idx_tensor in active_experts: +-+++ expert_idx = expert_idx_tensor.item() +-+++ expert_layer = self.experts[expert_idx] +-+++ mask = (flat_selected_experts == expert_idx_tensor) +-+++ current_token_indices = token_indices[mask] +-+++ current_routing_weights = flat_routing_weights[mask] +-+++ current_hidden_states = hidden_states[current_token_indices] +-+++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++ return moe_output +-++ +-++- final_hidden_states = final_hidden_states + shared_expert_output +-++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++- +-++- return final_hidden_states, router_logits +-+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++ global Long_Prompt +-+++ +-+++ # 1. 门控计算 (所有模式通用) +-+++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++ router_logits = self.gate(hidden_states_reshaped) +-+++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++ if self.norm_topk_prob: +-+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++ moe_output = None +-+++ if Long_Prompt: +-+++ # --- 精度优先模式 (ACCURACY MODE) --- +-+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ else: +-+++ # --- 速度优先模式 (SPEED MODE) --- +-+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++ if sequence_length == 1: +-+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ else: +-+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ +-++ +-+++ # 3. 共享专家计算与合并 (所有模式通用) +-+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++ +-+++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++ +-+++ return final_hidden_states, router_logits +-++ +-++ class Qwen2MoeDecoderLayer(nn.Module): +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-++ super().__init__() +-++ self.hidden_size = config.hidden_size +-+++ +-+++ # if Long_Prompt: +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ # else: +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++ +-++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ +-++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++- +-++ if (layer_idx not in config.mlp_only_layers) and ( +-++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++ ): +-++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ self._warmed_up = True +-++ self.warmup_moe_model() +-++ +-+++ +-+++ +-++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++ output_router_logits = ( +-++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +-++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ router_logits=outputs.router_logits, +-++ ) +-++ +-+++ def generate(self, *args, **kwargs): +-+++ """ +-+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+++ """ +-+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+++ +-+++ input_ids = kwargs.get("input_ids") +-+++ if input_ids is None and args: +-+++ input_ids = args[0] +-+++ +-+++ if input_ids is not None: +-+++ prompt_length = input_ids.shape[1] +-+++ +-+++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+++ Long_Prompt = True +-+++ else: +-+++ Long_Prompt = False +-+++ +-+++ return super().generate(*args, **kwargs) +-+++ +-++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +-++ def prepare_inputs_for_generation( +-++ self, +-++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +-++ # Exception 1: when passing input_embeds, input_ids may be missing entries +-++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-+++ +-++ if past_key_values is not None: +-++ if inputs_embeds is not None: # Exception 1 +-++ if 0 not in input_ids.shape: +-++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ } +-++ ) +-++ return model_inputs +-+++ +-++ # @lwx +-++ # def _decode_one_tokens_logits( +-++ # self, +-++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +-++ attentions=outputs.attentions, +-++ ) +-++ +-+++ +-++ __all__ = [ +-++ "Qwen2MoeForCausalLM", +-++ "Qwen2MoeModel", +-++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++new file mode 100644 +-++index 00000000..6dfb5b93 +-++--- /dev/null +-+++++ b/patches/0001-20251104commit.patch +-++@@ -0,0 +1,1272 @@ +-+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++Subject: [PATCH] 20251104commit +-+++ +-+++--- +-+++ mindnlp/transformers/cache_utils.py | 28 +- +-+++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++ 3 files changed, 976 insertions(+), 87 deletions(-) +-+++ +-+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++index cadd2e04..02f8d4be 100644 +-+++--- a/mindnlp/transformers/cache_utils.py +-++++++ b/mindnlp/transformers/cache_utils.py +-+++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++ # k_out[:, :, cache_position] = key_states +-+++ # v_out[:, :, cache_position] = value_states +-+++- if ON_ORANGE_PI: +-+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++- else: +-+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++- +-++++ # if ON_ORANGE_PI: +-++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++ # else: +-++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++++ if cache_position.ndim > 1: +-++++ cache_position = cache_position.flatten() +-++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++++ cache_position = cache_position.int() +-++++ +-++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++++ k_out[:, :, cache_position] = key_states +-++++ v_out[:, :, cache_position] = value_states +-++++ +-+++ return k_out, v_out +-+++ +-+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index c695b944..d8303e45 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++- x1 = x[..., : x.shape[-1] // 2] +-+++- x2 = x[..., x.shape[-1] // 2 :] +-++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++ # x1 = x[..., : x.shape[-1] // 2] +-++++ # x2 = x[..., x.shape[-1] // 2 :] +-++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++ if self.training: +-+++ raise NotImplementedError("Training is not supported yet.") +-+++ else: +-+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++- if self.config.n_shared_experts is not None: +-+++- y = y + self.shared_experts(identity) +-+++- return y +-++++ # @lwx +-++++ if orig_shape[1] == 1: +-++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++++ y=y.view(*orig_shape) +-++++ if self.config.n_shared_experts is not None: +-++++ y = y + self.shared_experts(identity) +-++++ return y +-++++ else: +-++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++++ if self.config.n_shared_experts is not None: +-++++ y = y + self.shared_experts(identity) +-++++ return y +-++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++ # if self.config.n_shared_experts is not None: +-++++ # y = y + self.shared_experts(identity) +-++++ # return y +-++++ +-++++ @no_grad() +-++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ +-++++ expert_cache = ops.zeros_like(x) +-++++ for i in range(self.num_experts_per_tok): +-++++ expert_id = flat_expert_indices[i].item() +-++++ weight = flat_expert_weights[i].item() +-++++ expert = self.experts[expert_id] +-++++ expert_out = expert(x) +-++++ expert_cache += expert_out * weight +-++++ return expert_cache +-+++ +-+++ @no_grad() +-+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++- # expert_cache = torch.zeros_like(x) +-+++- # idxs = flat_expert_indices.argsort() +-+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++- # token_idxs = idxs // self.num_experts_per_tok +-+++- # for i, end_idx in enumerate(tokens_per_expert): +-+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++- # if start_idx == end_idx: +-+++- # continue +-+++- # expert = self.experts[i] +-+++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++- # expert_tokens = x[exp_token_idx] +-+++- # expert_out = expert(expert_tokens) +-+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++- # return expert_cache +-++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++ expert_cache = ops.zeros_like(x) +-+++ idxs = flat_expert_indices.argsort() +-+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ token_idxs = idxs // self.num_experts_per_tok +-++++ +-+++ for i, end_idx in enumerate(tokens_per_expert): +-+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ if start_idx == end_idx: +-+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++ expert_out = expert(expert_tokens) +-+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-+++ return expert_cache +-++++ +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # # expert_cache = torch.zeros_like(x) +-++++ # # idxs = flat_expert_indices.argsort() +-++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++ # # if start_idx == end_idx: +-++++ # # continue +-++++ # # expert = self.experts[i] +-++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # # expert_tokens = x[exp_token_idx] +-++++ # # expert_out = expert(expert_tokens) +-++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++ # # return expert_cache +-++++ # expert_cache = ops.zeros_like(x) +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # if start_idx == end_idx: +-++++ # continue +-++++ # expert = self.experts[i] +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = expert(expert_tokens) +-++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-++++ # return expert_cache +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # expert_cache = ops.zeros_like(x) +-++++ +-++++ # # 排序保证顺序一致 +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # # 找出有 token 的专家 +-++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++ +-++++ # for i in active_experts.tolist(): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # end_idx = tokens_per_expert[i] +-++++ # if start_idx == end_idx: # 没有 token +-++++ # continue +-++++ +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = self.experts[i](expert_tokens) +-++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++ +-++++ # expert_cache = mindspore.mint.scatter_add( +-++++ # expert_cache, +-++++ # 0, +-++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++ # expert_out +-++++ # ) +-++++ +-++++ # return expert_cache +-++++ +-++++ +-+++ +-+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++ # """ +-+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-++++ self.warm_up = False +-++++ +-++++ def warmup_moe_model_deep(self): +-++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++ test_texts = [ +-++++ "warmup short", +-++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++++ ] +-++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++ if tokenizer is None: +-++++ from mindnlp.transformers import AutoTokenizer +-++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++ self._warmup_tokenizer = tokenizer +-++++ +-++++ for text in test_texts: +-++++ inputs = tokenizer(text, return_tensors="ms") +-++++ with mindspore._no_grad(): +-++++ _ = self(**inputs, use_cache=False) +-++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++ +-+++ def get_input_embeddings(self): +-+++ return self.model.embed_tokens +-+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++ ```""" +-++++ if not self.warm_up: +-++++ self.warm_up = True +-++++ self.warmup_moe_model_deep() +-++++ +-+++ output_attentions = ( +-+++ output_attentions +-+++ if output_attentions is not None +-+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++index 3cbf820e..d4c6b651 100644 +-+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++@@ -18,7 +18,6 @@ +-+++ # See the License for the specific language governing permissions and +-+++ # limitations under the License. +-+++ """MindSpore Qwen2MoE model.""" +-+++- +-+++ import math +-+++ from typing import List, Optional, Tuple, Union +-+++ +-+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++ TokenClassifierOutput, +-+++ ) +-+++ from ...modeling_utils import PreTrainedModel +-++++from ...generation import GenerationMixin +-+++ from ....utils import logging +-+++ from .configuration_qwen2_moe import Qwen2MoeConfig +-+++ +-+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++ self.variance_epsilon = eps +-+++ +-+++ def forward(self, hidden_states): +-++++ # @dwj +-++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++ # @lwx +-++++ # if not self.training : +-++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++ input_dtype = hidden_states.dtype +-+++ hidden_states = hidden_states.to(mindspore.float32) +-+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++@@ -234,6 +239,8 @@ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++ x1 = x[..., : x.shape[-1] // 2] +-+++ x2 = x[..., x.shape[-1] // 2 :] +-++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++ self.config = config +-+++ self.hidden_size = config.hidden_size +-+++ self.intermediate_size = intermediate_size +-++++ +-+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++ self.act_fn = ACT2FN[config.hidden_act] +-+++ +-+++ def forward(self, x): +-+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++- +-+++ +-++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++ # @lwx +-++++ # gate_up_output = self.gate_up_proj(x) +-++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++++ # return self.down_proj(swiglu_output) +-++++ +-++++ # def forward(self, x): +-++++ # gate_proj_out = self.gate_proj(x) +-++++ # up_proj_out = self.up_proj(x) +-++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++++ # return self.down_proj(swiglu_out) +-++++ +-+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++ """ +-+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++ use_cache: bool = False, +-+++ cache_position: Optional[mindspore.Tensor] = None, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ +-++++ +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ query_states = self.q_proj(hidden_states) +-+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ "with a layer index." +-+++ ) +-+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ if isinstance(past_key_value, StaticCache): +-++++ kv_seq_len = key_states.shape[-2] +-++++ else: +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++ if isinstance(past_key_value, StaticCache): +-++++ kv_seq_len = key_states.shape[-2] +-+++ +-+++ # repeat k/v heads if n_kv_heads < n_heads +-+++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++- +-++++ +-+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++ +-+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++- raise ValueError( +-+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++- f" {attn_weights.shape}" +-+++- ) +-+++- +-+++- if attention_mask is not None: # no matter the length, we just slice it +-+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++++ if attention_mask is not None: +-++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++ attn_weights = attn_weights + causal_mask +-+++ +-+++ # upcast attention to fp32 +-+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++ +-+++ attn_output = self.o_proj(attn_output) +-+++- +-++++ # @lwx +-++++ +-++++ # max_seq_len = self.max_position_embeddings # 2048 +-++++ +-++++ # if attention_mask is not None: +-++++ # # attention_mask: [B, 1, Sq, Sk] +-++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++ +-++++ # # pad 到 [max_seq_len, max_seq_len] +-++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++ # global_attention_mask = padded_mask +-++++ # else: +-++++ # global_attention_mask = None +-++++ +-++++ +-++++ # sparse_mode=3 +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # real_shift=None, +-++++ # padding_mask=None, +-++++ +-++++ # head_num=self.num_heads, +-++++ # attn_mask=global_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ # input_layout="BNSD", +-++++ # pre_tokens=2147483647, +-++++ # next_tokens=2147483647, +-++++ # inner_precise=0, +-++++ # drop_mask=None, +-++++ # prefix=None, +-++++ # actual_seq_qlen=None, +-++++ # actual_seq_kvlen=None, +-++++ # sparse_mode=sparse_mode, +-++++ # ) +-+++ if not output_attentions: +-+++ attn_weights = None +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ +-++++class Qwen2MoeFlashAttention(nn.Module): +-++++ """ +-++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++ +-++++ 关键改动: +-++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++ 直接传入原始的 key 和 value 张量效率更高。 +-++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++ """ +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++ super().__init__() +-++++ self.config = config +-++++ self.layer_idx = layer_idx +-++++ self.hidden_size = config.hidden_size +-++++ self.num_heads = config.num_attention_heads +-++++ self.head_dim = self.hidden_size // self.num_heads +-++++ self.num_key_value_heads = config.num_key_value_heads +-++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++ self.max_position_embeddings = config.max_position_embeddings +-++++ self.rope_theta = config.rope_theta +-++++ self.attention_dropout = config.attention_dropout +-++++ +-++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++ raise ValueError( +-++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++ ) +-++++ +-++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++ +-++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++ self.head_dim, +-++++ max_position_embeddings=self.max_position_embeddings, +-++++ base=self.rope_theta, +-++++ ) +-++++ +-++++ def forward( +-++++ self, +-++++ hidden_states: mindspore.Tensor, +-++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++ position_ids: Optional[mindspore.Tensor] = None, +-++++ past_key_value: Optional[Cache] = None, +-++++ output_attentions: bool = False, +-++++ use_cache: bool = False, +-++++ cache_position: Optional[mindspore.Tensor] = None, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # 1. 线性投射 Q, K, V +-++++ query_states = self.q_proj(hidden_states) +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++ # query: [B, S, H*D] -> [B, N1, S, D] +-++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # 3. RoPE 旋转位置编码 +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++ if self.layer_idx is None: +-++++ raise ValueError( +-++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ "with a layer index." +-++++ ) +-++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++ if cache_position.shape[0] == 1: +-++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++ kv_seq_len = past_seen_tokens + 1 +-++++ else: +-++++ # prefill 阶段:cache_position 是范围,使用其长度 +-++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++ else: +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # 4. KV 缓存更新 +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ key_states, value_states = past_key_value.update( +-++++ key_states, value_states, self.layer_idx, cache_kwargs +-++++ ) +-++++ +-++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++ if cache_position.shape[0] == 1: +-++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++ kv_seq_len = key_states.shape[-2] +-++++ +-++++ # 5. [重要] 准备 Attention Mask +-++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++ fa_attention_mask = None +-++++ if attention_mask is not None: +-++++ # 截取与当前key长度匹配的部分 +-++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++ fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++ input_dtype = query_states.dtype +-++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++ query_states = query_states.to(mindspore.float16) +-++++ key_states = key_states.to(mindspore.float16) +-++++ value_states = value_states.to(mindspore.float16) +-++++ +-++++ # 6. [核心] 调用 flash_attention_score 算子 +-++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++ attn_output = mindspore.ops.flash_attention_score( +-++++ query=query_states, +-++++ key=key_states, +-++++ value=value_states, +-++++ head_num=self.num_heads, # 传入Q的头数(N1) +-++++ attn_mask=fa_attention_mask, +-++++ keep_prob=1.0 - self.attention_dropout, +-++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ input_layout="BNSD", +-++++ sparse_mode=0 # 使用 defaultMask 模式 +-++++ ) +-++++ +-++++ # 恢复原始数据类型 +-++++ attn_output = attn_output.to(input_dtype) +-++++ +-++++ # 7. 调整输出形状 +-++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = self.o_proj(attn_output) +-++++ +-++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++++ attn_weights = None +-++++ if output_attentions: +-++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ # def forward( +-++++ # self, +-++++ # hidden_states: mindspore.Tensor, +-++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++ # past_key_value: Optional[Cache] = None, +-++++ # output_attentions: bool = False, +-++++ # use_cache: bool = False, +-++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ # bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # # 1. 线性投射 Q, K, V +-++++ # query_states = self.q_proj(hidden_states) +-++++ # key_states = self.k_proj(hidden_states) +-++++ # value_states = self.v_proj(hidden_states) +-++++ +-++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # # 3. RoPE 旋转位置编码 +-++++ # kv_seq_len = key_states.shape[-2] +-++++ # if past_key_value is not None: +-++++ # if self.layer_idx is None: +-++++ # raise ValueError( +-++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ # "with a layer index." +-++++ # ) +-++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # # 4. KV 缓存更新 +-++++ # if past_key_value is not None: +-++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ # key_states, value_states = past_key_value.update( +-++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++ # ) +-++++ +-++++ # # 5. 准备 Attention Mask +-++++ # fa_attention_mask = None +-++++ # if attention_mask is not None: +-++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++ # input_dtype = query_states.dtype +-++++ +-++++ # # 6. [核心] 调用 flash_attention_score 算子 +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # head_num=self.num_heads, +-++++ # attn_mask=fa_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ # input_layout="BNSD", +-++++ # sparse_mode=0, +-++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++ # inner_precise=1 +-++++ # ) +-++++ +-++++ # # 恢复原始数据类型 +-++++ # attn_output = attn_output.to(input_dtype) +-++++ +-++++ # # 7. 调整输出形状 +-++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ # attn_output = self.o_proj(attn_output) +-++++ +-++++ # attn_weights = None +-++++ # if output_attentions: +-++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++ # return attn_output, attn_weights, past_key_value +-++++ +-++++ # def forward( +-++++ # self, +-++++ # hidden_states: mindspore.Tensor, +-++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++ # past_key_value: Optional[Cache] = None, +-++++ # output_attentions: bool = False, +-++++ # use_cache: bool = False, +-++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ # bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # query_states = self.q_proj(hidden_states) +-++++ # key_states = self.k_proj(hidden_states) +-++++ # value_states = self.v_proj(hidden_states) +-++++ +-++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # kv_seq_len = key_states.shape[-2] +-++++ # if past_key_value is not None: +-++++ # if self.layer_idx is None: +-++++ # raise ValueError("`layer_idx` must be specified for caching") +-++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # if past_key_value is not None: +-++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ # key_states, value_states = past_key_value.update( +-++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++ # ) +-++++ +-++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++ +-++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++ # query_states = query_states / math.sqrt(self.head_dim) +-++++ # # <--- 修改结束 --- +-++++ +-++++ # fa_attention_mask = None +-++++ # if attention_mask is not None: +-++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # input_dtype = query_states.dtype +-++++ +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, # 传入已经预先缩放过的 query +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # head_num=self.num_heads, +-++++ # attn_mask=fa_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++ # input_layout="BNSD", +-++++ # sparse_mode=0, +-++++ # inner_precise=1 # 仍然保持内部高精度计算 +-++++ # ) +-++++ +-++++ # attn_output = attn_output.to(input_dtype) +-++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ # attn_output = self.o_proj(attn_output) +-++++ +-++++ # attn_weights = None +-++++ # if output_attentions: +-++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++ +-++++ # return attn_output, attn_weights, past_key_value +-++++ +-+++ QWEN2MOE_ATTENTION_CLASSES = { +-+++ "eager": Qwen2MoeAttention, +-++++ "flash-attention": Qwen2MoeFlashAttention, +-+++ } +-+++ +-+++ +-+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-++++ #@dwj +-++++ # 只遍历激活的专家,而非全部专家 +-+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- hidden_states = hidden_states.view(-1, hidden_dim) +-+++- # router_logits: (batch * sequence_length, n_experts) +-+++- router_logits = self.gate(hidden_states) +-+++- +-+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- if self.norm_topk_prob: +-+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- # we cast back to the input dtype +-+++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++- final_hidden_states = ops.zeros( +-+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++- ) +-+++- +-+++- # One hot encode the selected experts to create an expert mask +-+++- # this will be used to easily index which expert is going to be sollicitated +-+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++- +-+++- # Loop over all available experts in the model and perform the computation on each expert +-+++- for expert_idx in range(self.num_experts): +-+++- expert_layer = self.experts[expert_idx] +-+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++- +-+++- # Index the correct hidden states and compute the expert hidden state for +-+++- # the current expert. We need to make sure to multiply the output hidden +-+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++- if 0 not in idx.shape: +-+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++- +-+++- # However `index_add_` only support torch tensors for indexing so we'll use +-+++- # the `top_x` tensor here. +-+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++- +-+++- shared_expert_output = self.shared_expert(hidden_states) +-+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++- +-+++- final_hidden_states = final_hidden_states + shared_expert_output +-++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++ num_tokens = hidden_states_reshaped.shape[0] +-++++ +-++++ router_logits = self.gate(hidden_states_reshaped) +-++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++ if self.norm_topk_prob: +-++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++ flat_selected_experts = selected_experts.flatten() +-++++ +-++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++ token_indices = broadcasted_token_indices.flatten() +-++++ +-++++ active_experts = ops.unique(flat_selected_experts) +-++++ +-++++ for expert_idx_tensor in active_experts: +-++++ expert_idx = expert_idx_tensor.item() +-++++ expert_layer = self.experts[expert_idx] +-++++ +-++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++ selected_token_indices = token_indices[mask] +-++++ selected_routing_weights = routing_weights.flatten()[mask] +-++++ +-++++ current_states = hidden_states_reshaped[selected_token_indices] +-++++ +-++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++ +-++++ final_hidden_states = final_hidden_states.index_add( +-++++ dim=0, +-++++ index=selected_token_indices, +-++++ source=expert_output.to(hidden_states.dtype) +-++++ ) +-++++ +-++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++ +-+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++- return final_hidden_states, router_logits +-++++ final_hidden_states = final_hidden_states + shared_expert_output +-++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++ +-++++ return final_hidden_states, router_logits +-+++ +-+++ +-+++ class Qwen2MoeDecoderLayer(nn.Module): +-+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++ +-+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++ +-+++ if (layer_idx not in config.mlp_only_layers) and ( +-+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++ ): +-+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++ _skip_keys_device_placement = "past_key_values" +-+++ _supports_cache_class = True +-++++#lwx +-++++ # _supports_static_cache = True +-+++ +-+++ def _init_weights(self, module): +-+++ std = self.config.initializer_range +-+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++ return causal_mask +-+++ +-+++ +-+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ _tied_weights_keys = ["lm_head.weight"] +-+++ +-+++ def __init__(self, config): +-+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ self.num_experts_per_tok = config.num_experts_per_tok +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-++++ # @lwx +-++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++++ # self.generation_config.cache_implementation = "static" +-++++ self._warmed_up = False +-++++ +-++++ def warmup_moe_model(self): +-++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++++ test_texts = [ +-++++ "warmup short", +-++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++++ ] +-++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++ if tokenizer is None: +-++++ from mindnlp.transformers import AutoTokenizer +-++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++ self._warmup_tokenizer = tokenizer +-++++ +-++++ for text in test_texts: +-++++ inputs = tokenizer(text, return_tensors="ms") +-++++ with mindspore._no_grad(): +-++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++ +-+++ def get_input_embeddings(self): +-+++ return self.model.embed_tokens +-+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++ ```""" +-++++ if not self._warmed_up: +-++++ self._warmed_up = True +-++++ self.warmup_moe_model() +-+++ +-+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++ output_router_logits = ( +-+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ } +-+++ ) +-+++ return model_inputs +-++++# @lwx +-++++ # def _decode_one_tokens_logits( +-++++ # self, +-++++ # cur_token: mindspore.Tensor, +-++++ # input_pos: Optional[mindspore.Tensor], +-++++ # cache_position: mindspore.Tensor, +-++++ # past_key_values: StaticCache, +-++++ # ) -> mindspore.Tensor: +-++++ # """ +-++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++++ +-++++ # Args: +-++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++++ # input_pos: 输入位置信息,可选 +-++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++++ +-++++ # Returns: +-++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++++ # """ +-++++ # # 调用JIT编译的版本 +-++++ # return self.get_decode_one_tokens_logits( +-++++ # cur_token=cur_token, +-++++ # input_pos=input_pos, +-++++ # cache_position=cache_position, +-++++ # past_key_values=past_key_values, +-++++ # ) +-++++ +-++++ # @mindspore.jit(jit_level='O1') +-++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++++ # """ +-++++ # JIT编译的函数,用于高效的单token解码 +-++++ # 使用JIT编译优化以支持静态shape和高效执行 +-++++ +-++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++++ # """ +-++++ # outputs = self.model.forward( +-++++ # input_ids=cur_token, +-++++ # position_ids=input_pos, +-++++ # cache_position=cache_position, +-++++ # past_key_values=past_key_values, +-++++ # use_cache=True, +-++++ # return_dict=False, +-++++ # ) +-++++ +-++++ # hidden_states = outputs[0] +-++++ # logits = self.lm_head.forward(hidden_states) +-++++ # logits = logits.float() +-++++ +-++++ # return logits[:, -1, :] +-++++ +-++++ # def _sample( +-++++ # self, +-++++ # input_ids: mindspore.Tensor, +-++++ # logits_processor, +-++++ # stopping_criteria, +-++++ # generation_config, +-++++ # synced_devices: bool, +-++++ # streamer=None, +-++++ # logits_warper=None, +-++++ # **model_kwargs, +-++++ # ): +-++++ # """ +-++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++++ # """ +-++++ # from ...generation.logits_process import LogitsProcessorList +-++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++++ # from mindnlp.core import nn, ops, no_grad +-++++ # import numpy as np +-++++ +-++++ # # 检查是否使用 StaticCache +-++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++++ # # 否则,直接调用父类方法 +-++++ # past_key_values = model_kwargs.get("past_key_values") +-++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++++ +-++++ # if not isinstance(past_key_values, StaticCache): +-++++ # # 不使用 StaticCache,直接调用父类方法 +-++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++++ # return super()._sample( +-++++ # input_ids=input_ids, +-++++ # logits_processor=logits_processor, +-++++ # stopping_criteria=stopping_criteria, +-++++ # generation_config=generation_config, +-++++ # synced_devices=synced_devices, +-++++ # streamer=streamer, +-++++ # logits_warper=logits_warper, +-++++ # **model_kwargs, +-++++ # ) +-++++ +-++++ # # 使用 StaticCache,进入自定义循环 +-++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++++ # pad_token_id = generation_config._pad_token_tensor +-++++ # output_attentions = generation_config.output_attentions +-++++ # output_hidden_states = generation_config.output_hidden_states +-++++ # output_scores = generation_config.output_scores +-++++ # output_logits = generation_config.output_logits +-++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++++ # max_length = generation_config.max_length +-++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++++ # do_sample = generation_config.do_sample +-++++ +-++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++++ # raise ValueError( +-++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++++ # f"{logits_warper})." +-++++ # ) +-++++ +-++++ # # init attention / hidden states / scores tuples +-++++ # scores = () if (return_dict_in_generate and output_scores) else None +-++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++++ +-++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++++ # encoder_hidden_states = ( +-++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++++ # ) +-++++ +-++++ # # keep track of which sequences are already finished +-++++ # batch_size, cur_len = input_ids.shape +-++++ # this_peer_finished = False +-++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++++ +-++++ # time_record = [] +-++++ # from ....utils.testing_utils import parse_flag_from_env +-++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++++ +-++++ # while self._has_unfinished_sequences( +-++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++++ # ): +-++++ # if _record_time: +-++++ # import time as time_module +-++++ # infer_start = time_module.time() +-++++ +-++++ # # prepare model inputs +-++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++++ +-++++ # # prepare variable output controls +-++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++++ +-++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++++ # cur_cache_position = model_inputs.get("cache_position") +-++++ # cur_past_key_values = model_inputs.get("past_key_values") +-++++ # cur_input_ids = model_inputs.get("input_ids") +-++++ +-++++ # if (isinstance(cur_past_key_values, StaticCache) and +-++++ # cur_cache_position is not None and +-++++ # len(cur_cache_position.shape) > 0 and +-++++ # cur_cache_position.shape[0] == 1 and +-++++ # cur_input_ids is not None and +-++++ # cur_input_ids.shape[1] == 1): +-++++ # # 使用 JIT 优化的单 token 解码 +-++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++++ # if not hasattr(self, '_jit_used'): +-++++ # self._jit_used = False +-++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++++ +-++++ # next_token_logits = self.get_decode_one_tokens_logits( +-++++ # cur_token=cur_input_ids, +-++++ # input_pos=model_inputs.get("position_ids"), +-++++ # cache_position=cur_cache_position, +-++++ # past_key_values=cur_past_key_values, +-++++ # ) +-++++ +-++++ # # 标记已使用JIT(用于后续判断) +-++++ # if not self._jit_used: +-++++ # self._jit_used = True +-++++ +-++++ # # 构造兼容的输出对象 +-++++ # class JitOptimizedOutput: +-++++ # def __init__(self, logits, config): +-++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++++ # self.config = config +-++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++++ # self.attentions = None if not config.is_encoder_decoder else None +-++++ # self.cross_attentions = None +-++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++++ +-++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++++ # else: +-++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++++ # outputs = self(**model_inputs, return_dict=True) +-++++ +-++++ # if synced_devices and this_peer_finished: +-++++ # continue +-++++ +-++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++++ # next_token_logits = outputs.logits[:, -1, :] +-++++ +-++++ # # pre-process distribution +-++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++++ # if do_sample: +-++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++++ +-++++ # # Store scores, attentions and hidden_states when required +-++++ # if return_dict_in_generate: +-++++ # if output_scores: +-++++ # scores += (next_token_scores,) +-++++ # if output_logits: +-++++ # raw_logits += (next_token_logits,) +-++++ # if output_attentions: +-++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++++ # if self.config.is_encoder_decoder: +-++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++++ +-++++ # if output_hidden_states: +-++++ # hidden = ( +-++++ # outputs.decoder_hidden_states +-++++ # if self.config.is_encoder_decoder +-++++ # else outputs.hidden_states +-++++ # ) +-++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++++ +-++++ # # token selection +-++++ # if do_sample: +-++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++++ # else: +-++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++++ +-++++ # # finished sentences should have their next token be a padding token +-++++ # if has_eos_stopping_criteria: +-++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++++ +-++++ # # update generated ids, model inputs, and length for next step +-++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++++ # if streamer is not None: +-++++ # streamer.put(next_tokens) +-++++ +-++++ # model_kwargs = self._update_model_kwargs_for_generation( +-++++ # outputs, +-++++ # model_kwargs, +-++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++++ # ) +-++++ +-++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++++ # cur_len += 1 +-++++ +-++++ # if _record_time: +-++++ # import time as time_module +-++++ # infer_stop = time_module.time() +-++++ # time_record.append(infer_stop - infer_start) +-++++ +-++++ # del outputs +-++++ +-++++ # average_infer_time = None +-++++ # if time_record: +-++++ # if len(time_record) > 1: +-++++ # time_record.pop(0) +-++++ # average_infer_time = sum(time_record) / len(time_record) +-++++ # print(f'average inference time is: {average_infer_time}') +-++++ # print(f'inference time record: {time_record}') +-++++ +-++++ # if streamer is not None: +-++++ # streamer.end() +-++++ +-++++ # # 简单判断:打印是否使用了JIT路径 +-++++ # if hasattr(self, '_jit_used') and self._jit_used: +-++++ # print("[JIT] ✓ JIT optimization was used during generation") +-++++ # else: +-++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++++ +-++++ # if return_dict_in_generate: +-++++ # if self.config.is_encoder_decoder: +-++++ # return GenerateEncoderDecoderOutput( +-++++ # sequences=input_ids, +-++++ # scores=scores, +-++++ # logits=raw_logits, +-++++ # encoder_attentions=encoder_attentions, +-++++ # encoder_hidden_states=encoder_hidden_states, +-++++ # decoder_attentions=decoder_attentions, +-++++ # cross_attentions=cross_attentions, +-++++ # decoder_hidden_states=decoder_hidden_states, +-++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++ # average_infer_time=average_infer_time +-++++ # ) +-++++ # else: +-++++ # return GenerateDecoderOnlyOutput( +-++++ # sequences=input_ids, +-++++ # scores=scores, +-++++ # logits=raw_logits, +-++++ # attentions=decoder_attentions, +-++++ # hidden_states=decoder_hidden_states, +-++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++ # average_infer_time=average_infer_time +-++++ # ) +-++++ # else: +-++++ # return input_ids +-++++ +-++++ # def _prepare_cache_for_generation( +-++++ # self, +-++++ # generation_config, +-++++ # model_kwargs, +-++++ # assistant_model, +-++++ # batch_size, +-++++ # max_cache_length, +-++++ # ): +-++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++++ # generation_config.cache_implementation = "static" +-++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++++ +-++++ # if generation_config.cache_implementation == "static": +-++++ # base_required_from_max_length = generation_config.max_length + 1 +-++++ # base_required = max(max_cache_length, base_required_from_max_length) +-++++ # min_cache_size = 50 +-++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++++ # else: +-++++ # max_cache_length = max(base_required, min_cache_size) +-++++ +-++++ # original_max_cache_length = max_cache_length +-++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++++ # print(f" - final max_cache_length: {max_cache_length}") +-++++ +-++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++ # if max_cache_length > self.config.max_position_embeddings: +-++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++++ +-++++ # result = super()._prepare_cache_for_generation( +-++++ # generation_config=generation_config, +-++++ # model_kwargs=model_kwargs, +-++++ # assistant_model=assistant_model, +-++++ # batch_size=batch_size, +-++++ # max_cache_length=max_cache_length, +-++++ # ) +-++++ +-++++ # if generation_config.cache_implementation == "static": +-++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++++ # created_cache = model_kwargs.get(cache_name) +-++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++++ # if created_cache.max_cache_len < generation_config.max_length: +-++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++++ +-++++ # return result +-++++ +-++++ +-++++ +-+++ +-+++ +-+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++-- +-+++2.27.0 +-+++ +-++-- +-++2.27.0 +-++ +-+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+new file mode 100644 +-+index 00000000..966529e4 +-+--- /dev/null +-++++ b/patches/0003-20261106secondcommit.patch +-+@@ -0,0 +1,2769 @@ +-++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Thu, 6 Nov 2025 14:54:37 +0800 +-++Subject: [PATCH 3/3] 20261106secondcommit +-++ +-++--- +-++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +-++ patches/0001-20251104commit.patch | 1272 ----------------- +-++ 3 files changed, 528 insertions(+), 2032 deletions(-) +-++ delete mode 100644 patches/0001-20251104commit.patch +-++ +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index 73773c22..2f9192bf 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +-++ +-++ _CONFIG_FOR_DOC = "DeepseekConfig" +-++ +-+++_attn_mask_cache = {} +-+++ +-+++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-+++ q_len = batch_and_seq[1] +-+++ kv_len = batch_and_seq[1] + past_key_values_length +-+++ key = (batch_and_seq[0], q_len, kv_len) +-+++ +-+++ if key in _attn_mask_cache: +-+++ return _attn_mask_cache[key] +-+++ +-+++ mask = _prepare_4d_causal_attention_mask( +-+++ attention_mask, +-+++ batch_and_seq, +-+++ inputs_embeds, +-+++ past_key_values_length, +-+++ ) +-+++ _attn_mask_cache[key] = mask +-+++ return mask +-++ +-++ def _get_unpad_data(attention_mask): +-++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +-++ return final_output +-++ +-++ +-++- @no_grad() +-++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++- expert_cache = ops.zeros_like(x) +-++- idxs = flat_expert_indices.argsort() +-++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++- token_idxs = idxs // self.num_experts_per_tok +-++- +-++- for i, end_idx in enumerate(tokens_per_expert): +-++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++- if start_idx == end_idx: +-++- continue +-++- expert = self.experts[i] +-++- exp_token_idx = token_idxs[start_idx:end_idx] +-++- expert_tokens = x[exp_token_idx] +-++- expert_out = expert(expert_tokens) +-++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++- +-++- return expert_cache +-++- +-++ # @no_grad() +-++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++- # # expert_cache = torch.zeros_like(x) +-++- # # idxs = flat_expert_indices.argsort() +-++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++- # # token_idxs = idxs // self.num_experts_per_tok +-++- # # for i, end_idx in enumerate(tokens_per_expert): +-++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++- # # if start_idx == end_idx: +-++- # # continue +-++- # # expert = self.experts[i] +-++- # # exp_token_idx = token_idxs[start_idx:end_idx] +-++- # # expert_tokens = x[exp_token_idx] +-++- # # expert_out = expert(expert_tokens) +-++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++- # # return expert_cache +-+++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++ # expert_cache = ops.zeros_like(x) +-++ # idxs = flat_expert_indices.argsort() +-++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +-++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++ +-++ # return expert_cache +-++- # @no_grad() +-++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++- # expert_cache = ops.zeros_like(x) +-+++ +-+++ @no_grad() +-+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++ """ +-+++ 优化版 MoE prefill: +-+++ - 批量张量化处理同一个 expert 的所有 token +-+++ - 跳过无 token 的专家 +-+++ - 保持结果完全一致 +-+++ """ +-+++ # 初始化输出缓存 +-+++ expert_cache = ops.zeros_like(x) +-++ +-++- # # 排序保证顺序一致 +-++- # idxs = flat_expert_indices.argsort() +-++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++- # token_idxs = idxs // self.num_experts_per_tok +-+++ # 排序(确保 scatter_add 位置对应原逻辑) +-+++ idxs = flat_expert_indices.argsort() +-+++ sorted_expert_indices = flat_expert_indices[idxs] +-+++ sorted_token_indices = idxs // self.num_experts_per_tok +-++ +-++- # # 找出有 token 的专家 +-++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++ # 每个 expert 的 token 数 +-+++ tokens_per_expert = sorted_expert_indices.bincount() +-++ +-++- # for i in active_experts.tolist(): +-++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++- # end_idx = tokens_per_expert[i] +-++- # if start_idx == end_idx: # 没有 token +-++- # continue +-+++ # 找出有 token 的专家 +-+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-++ +-++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++- # expert_tokens = x[exp_token_idx] +-++- # expert_out = self.experts[i](expert_tokens) +-++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++ for expert_id in active_experts.tolist(): +-+++ # 取该 expert 对应的排序后 token 区间 +-+++ start = (tokens_per_expert[:expert_id]).sum().item() +-+++ end = start + tokens_per_expert[expert_id].item() +-++ +-++- # expert_cache = mindspore.mint.scatter_add( +-++- # expert_cache, +-++- # 0, +-++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++- # expert_out +-++- # ) +-+++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-+++ expert_tokens = x[token_idx] # 取输入向量 +-++ +-++- # return expert_cache +-+++ # 执行专家 MLP +-+++ expert_out = self.experts[expert_id](expert_tokens) +-+++ +-+++ # 按权重缩放 +-+++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-+++ +-+++ # 回写到缓存(等价 scatter_add) +-+++ expert_cache = mindspore.mint.scatter_add( +-+++ expert_cache, +-+++ 0, +-+++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++ scaled_out +-+++ ) +-+++ +-+++ return expert_cache +-+++ +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # # expert_cache = torch.zeros_like(x) +-+++ # # idxs = flat_expert_indices.argsort() +-+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++ # # token_idxs = idxs // self.num_experts_per_tok +-+++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++ # # if start_idx == end_idx: +-+++ # # continue +-+++ # # expert = self.experts[i] +-+++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # # expert_tokens = x[exp_token_idx] +-+++ # # expert_out = expert(expert_tokens) +-+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++ # # return expert_cache +-+++ # expert_cache = ops.zeros_like(x) +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # for i, end_idx in enumerate(tokens_per_expert): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # if start_idx == end_idx: +-+++ # continue +-+++ # expert = self.experts[i] +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = expert(expert_tokens) +-+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-+++ # return expert_cache +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++ # expert_cache = ops.zeros_like(x) +-+++ +-+++ # # 排序保证顺序一致 +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ # token_idxs = idxs // self.num_experts_per_tok +-+++ +-+++ # # 找出有 token 的专家 +-+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++ +-+++ # for i in active_experts.tolist(): +-+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ # end_idx = tokens_per_expert[i] +-+++ # if start_idx == end_idx: # 没有 token +-+++ # continue +-+++ +-+++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++ # expert_tokens = x[exp_token_idx] +-+++ # expert_out = self.experts[i](expert_tokens) +-+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++ +-+++ # expert_cache = mindspore.mint.scatter_add( +-+++ # expert_cache, +-+++ # 0, +-+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++ # expert_out +-+++ # ) +-+++ +-+++ # return expert_cache +-++ +-++ +-++ +-++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++- +-++ # class DeepseekFlashAttention(nn.Module): +-++ # """ +-++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-+++ +-++ Deepseek_ATTENTION_CLASSES = { +-++ "eager": DeepseekAttention, +-++ "flash-attention": DeepseekFlashAttention, +-++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +-++ ) +-++ else: +-++ # 4d mask is passed through the layers +-++- attention_mask = _prepare_4d_causal_attention_mask( +-+++ # attention_mask = _prepare_4d_causal_attention_mask( +-+++ # attention_mask, +-+++ # (batch_size, seq_length), +-+++ # inputs_embeds, +-+++ # past_key_values_length, +-+++ # ) +-+++ #@dwj +-+++ attention_mask = get_cached_causal_mask( +-++ attention_mask, +-++ (batch_size, seq_length), +-++ inputs_embeds, +-++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ # Initialize weights and apply final processing +-++ self.post_init() +-++ self.warm_up = False +-+++ #@dwj +-+++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-+++ self.num_layers, +-+++ self.num_attention_heads, +-+++ self.head_dim, +-+++ batch_size=1, +-+++ max_length=self.max_length, +-+++ dtype=mindspore.float16 +-+++ ) +-+++ +-+++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-+++ key_cache = [] +-+++ value_cache = [] +-+++ for _ in range(num_layers): +-+++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++ key_cache.append(k) +-+++ value_cache.append(v) +-+++ return key_cache, value_cache +-+++ +-++ +-++ def warmup_moe_model_deep(self): +-++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++index bced285c..ebd7782e 100644 +-++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +-++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-++ +-++-Long_Prompt = False +-++-PROMPT_LENGTH_THRESHOLD = 128 +-+++Long_Prompt = 1 +-+++LONG_PROMPT_LENGTH_THRESHOLD = 128 +-+++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-+++ +-+++_causal_mask_cache = {} +-+++ +-+++def get_cached_causal_mask_with_cache_position( +-+++ attention_mask: mindspore.Tensor, +-+++ sequence_length: int, +-+++ target_length: int, +-+++ dtype: mindspore.dtype, +-+++ min_dtype: float, +-+++ cache_position: mindspore.Tensor, +-+++ batch_size: int, +-+++): +-+++ """ +-+++ 带缓存的 causal mask 构造函数 +-+++ """ +-+++ # q_len 是当前 query 长度 +-+++ q_len = sequence_length +-+++ # kv_len 是 target_length +-+++ kv_len = target_length +-+++ +-+++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-+++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-+++ +-+++ if key in _causal_mask_cache: +-+++ return _causal_mask_cache[key] +-+++ +-+++ # 调用原来的 mask 构造逻辑 +-+++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++ attention_mask, +-+++ sequence_length=sequence_length, +-+++ target_length=target_length, +-+++ dtype=dtype, +-+++ min_dtype=min_dtype, +-+++ cache_position=cache_position, +-+++ batch_size=batch_size, +-+++ ) +-+++ # 缓存结果 +-+++ _causal_mask_cache[key] = causal_mask +-+++ return causal_mask +-++ +-++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-++ def _prepare_4d_causal_attention_mask_with_cache_position( +-++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++ +-++ +-++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-+++# class Qwen2MoeAttention(nn.Module): +-+++# """ +-+++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+++# and "Generating Long Sequences with Sparse Transformers". +-+++# """ +-+++ +-+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++# super().__init__() +-+++# self.config = config +-+++# self.layer_idx = layer_idx +-+++# if layer_idx is None: +-+++# logger.warning_once( +-+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++# "when creating this class." +-+++# ) +-+++ +-+++# self.hidden_size = config.hidden_size +-+++# self.num_heads = config.num_attention_heads +-+++# self.head_dim = self.hidden_size // self.num_heads +-+++# self.num_key_value_heads = config.num_key_value_heads +-+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++# self.max_position_embeddings = config.max_position_embeddings +-+++# self.rope_theta = config.rope_theta +-+++# self.is_causal = True +-+++# self.attention_dropout = config.attention_dropout +-+++ +-+++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++# raise ValueError( +-+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++# f" and `num_heads`: {self.num_heads})." +-+++# ) +-+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++ +-+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++# self.head_dim, +-+++# max_position_embeddings=self.max_position_embeddings, +-+++# base=self.rope_theta, +-+++# ) +-+++ +-+++# def forward( +-+++# self, +-+++# hidden_states: mindspore.Tensor, +-+++# attention_mask: Optional[mindspore.Tensor] = None, +-+++# position_ids: Optional[mindspore.Tensor] = None, +-+++# past_key_value: Optional[Cache] = None, +-+++# output_attentions: bool = False, +-+++# use_cache: bool = False, +-+++# cache_position: Optional[mindspore.Tensor] = None, +-+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++ +-+++ +-+++ +-+++# bsz, q_len, _ = hidden_states.shape +-+++ +-+++# query_states = self.q_proj(hidden_states) +-+++# key_states = self.k_proj(hidden_states) +-+++# value_states = self.v_proj(hidden_states) +-+++ +-+++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++ +-+++# kv_seq_len = key_states.shape[-2] +-+++# if past_key_value is not None: +-+++# if self.layer_idx is None: +-+++# raise ValueError( +-+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++# "with a layer index." +-+++# ) +-+++# if isinstance(past_key_value, StaticCache): +-+++# kv_seq_len = key_states.shape[-2] +-+++# else: +-+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++# if past_key_value is not None: +-+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++# if isinstance(past_key_value, StaticCache): +-+++# kv_seq_len = key_states.shape[-2] +-+++ +-+++# # repeat k/v heads if n_kv_heads < n_heads +-+++# key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++# value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++ +-+++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++ +-+++# if attention_mask is not None: +-+++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++# attn_weights = attn_weights + causal_mask +-+++ +-+++# # upcast attention to fp32 +-+++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++# attn_output = ops.matmul(attn_weights, value_states) +-+++ +-+++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++# raise ValueError( +-+++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+++# f" {attn_output.shape}" +-+++# ) +-+++ +-+++# attn_output = ops.transpose(attn_output, 1, 2) +-+++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++ +-+++# attn_output = self.o_proj(attn_output) +-+++# # @lwx +-+++ +-+++# # max_seq_len = self.max_position_embeddings # 2048 +-+++ +-+++# # if attention_mask is not None: +-+++# # # attention_mask: [B, 1, Sq, Sk] +-+++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++ +-+++# # # pad 到 [max_seq_len, max_seq_len] +-+++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++# # global_attention_mask = padded_mask +-+++# # else: +-+++# # global_attention_mask = None +-+++ +-+++ +-+++# # sparse_mode=3 +-+++# # attn_output = mindspore.ops.flash_attention_score( +-+++# # query=query_states, +-+++# # key=key_states, +-+++# # value=value_states, +-+++# # real_shift=None, +-+++# # padding_mask=None, +-+++ +-+++# # head_num=self.num_heads, +-+++# # attn_mask=global_attention_mask, +-+++# # keep_prob=1.0 - self.attention_dropout, +-+++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++# # input_layout="BNSD", +-+++# # pre_tokens=2147483647, +-+++# # next_tokens=2147483647, +-+++# # inner_precise=0, +-+++# # drop_mask=None, +-+++# # prefix=None, +-+++# # actual_seq_qlen=None, +-+++# # actual_seq_kvlen=None, +-+++# # sparse_mode=sparse_mode, +-+++# # ) +-+++# if not output_attentions: +-+++# attn_weights = None +-+++ +-+++# return attn_output, attn_weights, past_key_value +-+++ +-++ class Qwen2MoeAttention(nn.Module): +-++ """ +-++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-++- and "Generating Long Sequences with Sparse Transformers". +-++- """ +-+++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +-++ +-+++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-+++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-+++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-+++ +-+++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-+++ """ +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++ super().__init__() +-++ self.config = config +-++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +-++ if layer_idx is None: +-++ logger.warning_once( +-++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++ "when creating this class." +-++ ) +-++ +-++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +-++ use_cache: bool = False, +-++ cache_position: Optional[mindspore.Tensor] = None, +-++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++- +-++ +-++- +-+++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +-++ bsz, q_len, _ = hidden_states.shape +-++ +-++ query_states = self.q_proj(hidden_states) +-++ key_states = self.k_proj(hidden_states) +-++ value_states = self.v_proj(hidden_states) +-++ +-++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++- +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ +-++ kv_seq_len = key_states.shape[-2] +-++ if past_key_value is not None: +-++- if self.layer_idx is None: +-++- raise ValueError( +-++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++- "with a layer index." +-++- ) +-++- if isinstance(past_key_value, StaticCache): +-++- kv_seq_len = key_states.shape[-2] +-++- else: +-++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++ +-++ if past_key_value is not None: +-++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++ +-+++ # --- 2. 动态调度核心注意力计算 --- +-+++ global Long_Prompt +-+++ if Long_Prompt >= 1: +-+++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-+++ fa_attention_mask = None +-+++ if attention_mask is not None: +-+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++ fa_attention_mask = (mask_slice != 0) +-+++ +-+++ attn_output = mindspore.ops.flash_attention_score( +-+++ query=query_states, +-+++ key=key_states, +-+++ value=value_states, +-+++ head_num=self.num_heads, +-+++ attn_mask=fa_attention_mask, +-+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-+++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ input_layout="BNSD", +-+++ sparse_mode=0, +-+++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-+++ ) +-++ +-++- if isinstance(past_key_value, StaticCache): +-++- kv_seq_len = key_states.shape[-2] +-+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = self.o_proj(attn_output) +-+++ attn_weights = None +-+++ if output_attentions: +-+++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-++ +-++- # repeat k/v heads if n_kv_heads < n_heads +-++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-++- +-++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++ else: +-+++ # --- Eager Attention 路径 (用于短序列和解码) --- +-+++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++ +-+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++ +-++- if attention_mask is not None: +-++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++- attn_weights = attn_weights + causal_mask +-+++ if attention_mask is not None: +-+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++ attn_weights = attn_weights + causal_mask +-++ +-++- # upcast attention to fp32 +-++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++- attn_output = ops.matmul(attn_weights, value_states) +-+++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++ attn_output = ops.matmul(attn_weights, value_states) +-++ +-++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++- raise ValueError( +-++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-++- f" {attn_output.shape}" +-++- ) +-+++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++ raise ValueError( +-+++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-+++ ) +-++ +-++- attn_output = ops.transpose(attn_output, 1, 2) +-++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = ops.transpose(attn_output, 1, 2) +-+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = self.o_proj(attn_output) +-++ +-++- attn_output = self.o_proj(attn_output) +-++- # @lwx +-+++ if not output_attentions: +-+++ attn_weights = None +-++ +-++- # max_seq_len = self.max_position_embeddings # 2048 +-++- +-++- # if attention_mask is not None: +-++- # # attention_mask: [B, 1, Sq, Sk] +-++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++- +-++- # # pad 到 [max_seq_len, max_seq_len] +-++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++- # global_attention_mask = padded_mask +-++- # else: +-++- # global_attention_mask = None +-++- +-++- +-++- # sparse_mode=3 +-++- # attn_output = mindspore.ops.flash_attention_score( +-++- # query=query_states, +-++- # key=key_states, +-++- # value=value_states, +-++- # real_shift=None, +-++- # padding_mask=None, +-++- +-++- # head_num=self.num_heads, +-++- # attn_mask=global_attention_mask, +-++- # keep_prob=1.0 - self.attention_dropout, +-++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-++- # input_layout="BNSD", +-++- # pre_tokens=2147483647, +-++- # next_tokens=2147483647, +-++- # inner_precise=0, +-++- # drop_mask=None, +-++- # prefix=None, +-++- # actual_seq_qlen=None, +-++- # actual_seq_kvlen=None, +-++- # sparse_mode=sparse_mode, +-++- # ) +-++- if not output_attentions: +-++- attn_weights = None +-++- +-++ return attn_output, attn_weights, past_key_value +-++ +-++- +-++ # class Qwen2MoeFlashAttention(nn.Module): +-++ # """ +-++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +-++ # return final_hidden_states, router_logits +-++ +-++ +-++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++-# """ +-++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-++-# """ +-++-# def __init__(self, config: Qwen2MoeConfig): +-++-# super().__init__() +-++-# self.num_experts = config.num_experts +-++-# self.top_k = config.num_experts_per_tok +-++-# self.norm_topk_prob = config.norm_topk_prob +-++- +-++-# # 门控网络 +-++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++-# # 专家列表 +-++-# self.experts = nn.ModuleList( +-++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++-# ) +-++-# # 共享专家 +-++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-# @no_grad() +-++-# def _moe_infer_decode( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# """ +-++-# 【解码路径】针对 sequence_length=1 的极致优化。 +-++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-++-# """ +-++-# batch_size, hidden_dim = hidden_states.shape +-++- +-++-# expert_outputs_list = [ +-++-# ops.cat([ +-++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++-# ], dim=0) +-++-# for i in range(batch_size) +-++-# ] +-++- +-++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-++-# # shape: (batch_size, top_k, hidden_dim) +-++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++- +-++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++- +-++-# return moe_output.squeeze(1) +-++- +-++-# @no_grad() +-++-# def _moe_infer_prefill( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# """ +-++-# 【预填充路径】针对 sequence_length > 1 的优化。 +-++-# 按专家对 Token 进行分组,并进行批处理。 +-++-# """ +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens = hidden_states.shape[0] +-++-# flat_selected_experts = selected_experts.flatten() +-++- +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++- +-++-# active_experts = ops.unique(flat_selected_experts) +-++- +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++- +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++-# selected_token_indices = token_indices[mask] +-++-# selected_routing_weights = routing_weights.flatten()[mask] +-++- +-++-# current_states = hidden_states[selected_token_indices] +-++- +-++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++- +-++-# moe_output = moe_output.index_add( +-++-# dim=0, +-++-# index=selected_token_indices, +-++-# source=expert_output.to(hidden_states.dtype) +-++-# ) +-++-# return moe_output +-++- +-++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-# """ +-++-# 顶层 forward 方法,作为智能分发器。 +-++-# """ +-++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- +-++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-# router_logits = self.gate(hidden_states_reshaped) +-++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- +-++-# if self.norm_topk_prob: +-++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- +-++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++-# moe_output = None +-++-# # 在推理时,根据序列长度选择最优路径 +-++-# if not self.training: +-++-# if sequence_length == 1: +-++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++-# else: +-++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++-# else: +-++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-++-# raise NotImplementedError("Training path is not implemented.") +-++- +-++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-++- +-++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-++- +-++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-++- +-++-# return final_hidden_states, router_logits +-++- +-++- +-++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++-# """ +-++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-++-# """ +-++-# def __init__(self, config: Qwen2MoeConfig): +-++-# super().__init__() +-++-# self.num_experts = config.num_experts +-++-# self.top_k = config.num_experts_per_tok +-++-# self.norm_topk_prob = config.norm_topk_prob +-++- +-++-# # 门控网络 +-++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++-# # 专家列表 +-++-# self.experts = nn.ModuleList( +-++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++-# ) +-++-# # 共享专家 +-++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-# @no_grad() +-++-# def _moe_infer_decode( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# batch_size, _ = hidden_states.shape +-++-# expert_outputs_list = [ +-++-# ops.cat([ +-++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++-# ], dim=0) +-++-# for i in range(batch_size) +-++-# ] +-++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++-# return moe_output.squeeze(1) +-++- +-++-# @no_grad() +-++-# def _moe_infer_prefill( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens = hidden_states.shape[0] +-++-# flat_selected_experts = selected_experts.flatten() +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++-# active_experts = ops.unique(flat_selected_experts) +-++- +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++-# selected_token_indices = token_indices[mask] +-++-# selected_routing_weights = routing_weights.flatten()[mask] +-++-# current_states = hidden_states[selected_token_indices] +-++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++-# moe_output = moe_output.index_add( +-++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++-# ) +-++-# return moe_output +-++- +-++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-# """ +-++-# 顶层 forward 方法,作为智能分发器。 +-++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-++-# """ +-++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- +-++-# # 1. 门控计算 (通用逻辑) +-++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-# router_logits = self.gate(hidden_states_reshaped) +-++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- +-++-# if self.norm_topk_prob: +-++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- +-++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++-# # 2. 智能分发到最优 MoE 路径 +-++-# moe_output = None +-++-# if not self.training: +-++-# if sequence_length == 1: +-++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++-# else: +-++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++-# else: +-++-# raise NotImplementedError("Training path is not implemented.") +-++- +-++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++- +-++-# # 4. 合并 MoE 输出和共享专家输出 +-++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++- +-++-# # 5. 恢复原始形状并返回 +-++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++- +-++-# return final_hidden_states, router_logits +-++- +-++-# prefill fastest +-++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++-# """ +-++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-++-# """ +-++-# def __init__(self, config: Qwen2MoeConfig): +-++-# super().__init__() +-++-# self.num_experts = config.num_experts +-++-# self.top_k = config.num_experts_per_tok +-++-# self.norm_topk_prob = config.norm_topk_prob +-++- +-++-# # 门控网络 +-++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++-# # 专家列表 +-++-# self.experts = nn.ModuleList( +-++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++-# ) +-++-# # 共享专家 +-++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-# @no_grad() +-++-# def _moe_infer_dispatch( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# """ +-++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-++-# """ +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens, _ = hidden_states.shape +-++- +-++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-++-# flat_selected_experts = selected_experts.flatten() +-++-# flat_routing_weights = routing_weights.flatten() +-++- +-++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++- +-++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-++-# active_experts = ops.unique(flat_selected_experts) +-++- +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++- +-++-# # 找到所有分配给该专家的 token +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++- +-++-# # 使用 mask 选取对应的 token 和权重 +-++-# current_token_indices = token_indices[mask] +-++-# current_routing_weights = flat_routing_weights[mask] +-++-# current_hidden_states = hidden_states[current_token_indices] +-++- +-++-# # 对这些 token 进行批处理 +-++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++- +-++-# # 使用 index_add 将结果精确地加回到对应位置 +-++-# moe_output = moe_output.index_add( +-++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-++-# ) +-++-# return moe_output +-++- +-++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-# """ +-++-# 顶层 forward 方法,作为智能分发器。 +-++-# """ +-++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- +-++-# # 1. 门控计算 +-++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-# router_logits = self.gate(hidden_states_reshaped) +-++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- +-++-# if self.norm_topk_prob: +-++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- +-++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++- +-++-# # 2. 调用统一的 MoE 计算内核 +-++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-++- +-++-# # 3. 统一处理共享专家 +-++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++- +-++-# # 4. 合并输出 +-++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++- +-++-# # 5. 恢复原始形状并返回 +-++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++- +-++-# return final_hidden_states, router_logits +-++- +-++- +-++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++-# """ +-++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++-# 【最终高性能与高精度版】: +-++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-++-# 3. 这样实现了速度和准确性的两全其美。 +-++-# """ +-++-# def __init__(self, config: Qwen2MoeConfig): +-++-# super().__init__() +-++-# self.num_experts = config.num_experts +-++-# self.top_k = config.num_experts_per_tok +-++-# self.norm_topk_prob = config.norm_topk_prob +-++- +-++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++-# self.experts = nn.ModuleList( +-++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++-# ) +-++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-# @no_grad() +-++-# def _moe_infer_decode( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# """ +-++-# 【解码路径】极致优化版:bmm + 高精度累加。 +-++-# """ +-++-# original_dtype = hidden_states.dtype +-++-# batch_size, _ = hidden_states.shape +-++- +-++-# expert_outputs_list = [ +-++-# ops.cat([ +-++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++-# ], dim=0) +-++-# for i in range(batch_size) +-++-# ] +-++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++- +-++-# # 在 float32 下执行 bmm,得到高精度结果 +-++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++- +-++-# # 将高精度结果转换回原始数据类型 +-++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-++- +-++-# return moe_output +-++- +-++-# @no_grad() +-++-# def _moe_infer_prefill( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# selected_experts: mindspore.Tensor, +-++-# routing_weights: mindspore.Tensor +-++-# ) -> mindspore.Tensor: +-++-# """ +-++-# 【预填充路径】与原始实现一致,结果精确。 +-++-# """ +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens, _ = hidden_states.shape +-++-# flat_selected_experts = selected_experts.flatten() +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++-# active_experts = ops.unique(flat_selected_experts) +-++- +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++-# selected_token_indices = token_indices[mask] +-++-# selected_routing_weights = routing_weights.flatten()[mask] +-++-# current_states = hidden_states[selected_token_indices] +-++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++-# moe_output = moe_output.index_add( +-++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++-# ) +-++-# return moe_output +-++- +-++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++- +-++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-# router_logits = self.gate(hidden_states_reshaped) +-++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++- +-++-# if self.norm_topk_prob: +-++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- +-++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-++-# # 如果模型主体是 float16,后续再转换 +-++- +-++-# moe_output = None +-++-# if not self.training: +-++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-++-# # _moe_infer_decode 内部会处理好类型转换 +-++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-++-# if sequence_length == 1: +-++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++-# else: +-++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++-# else: +-++-# raise NotImplementedError("Training path is not implemented.") +-++- +-++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++- +-++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++- +-++-# return final_hidden_states, router_logits +-++- +-++- +-++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++-# """ +-++-# 【融合版】一个混合专家模块,内置两种推理策略, +-++-# 由外部全局变量 `Long_Prompt` 控制: +-++- +-++-# - if Long_Prompt is True: 【精度优先模式】 +-++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-++-# 适用于处理长序列,避免误差累积。 +-++- +-++-# - if Long_Prompt is False: 【速度优先模式】 +-++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-++-# """ +-++-# def __init__(self, config: Qwen2MoeConfig): +-++-# super().__init__() +-++-# self.num_experts = config.num_experts +-++-# self.top_k = config.num_experts_per_tok +-++-# self.norm_topk_prob = config.norm_topk_prob +-++- +-++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++-# self.experts = nn.ModuleList( +-++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++-# ) +-++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-# # --- 速度优先模式的辅助函数 --- +-++-# @no_grad() +-++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++-# original_dtype = hidden_states.dtype +-++-# batch_size, _ = hidden_states.shape +-++-# expert_outputs_list = [ +-++-# ops.cat([ +-++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++-# ], dim=0) +-++-# for i in range(batch_size) +-++-# ] +-++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++-# weights_fp32 = routing_weights.to(mindspore.float32) +-++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++-# return moe_output_fp32.squeeze(1).to(original_dtype) +-++- +-++-# @no_grad() +-++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens, _ = hidden_states.shape +-++-# flat_selected_experts = selected_experts.flatten() +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++-# active_experts = ops.unique(flat_selected_experts) +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++-# selected_token_indices = token_indices[mask] +-++-# selected_routing_weights = routing_weights.flatten()[mask] +-++-# current_states = hidden_states[selected_token_indices] +-++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-++-# return moe_output +-++- +-++-# # --- 精度优先模式的辅助函数 --- +-++-# @no_grad() +-++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++-# moe_output = ops.zeros_like(hidden_states) +-++-# num_tokens, _ = hidden_states.shape +-++-# flat_selected_experts = selected_experts.flatten() +-++-# flat_routing_weights = routing_weights.flatten() +-++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++-# active_experts = ops.unique(flat_selected_experts) +-++-# for expert_idx_tensor in active_experts: +-++-# expert_idx = expert_idx_tensor.item() +-++-# expert_layer = self.experts[expert_idx] +-++-# mask = (flat_selected_experts == expert_idx_tensor) +-++-# current_token_indices = token_indices[mask] +-++-# current_routing_weights = flat_routing_weights[mask] +-++-# current_hidden_states = hidden_states[current_token_indices] +-++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++-# return moe_output +-++- +-++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-# # 声明我们将要使用一个在模块外部定义的全局变量 +-++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-++-# global Long_Prompt +-++- +-++-# # 1. 门控计算 (所有模式通用) +-++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-# router_logits = self.gate(hidden_states_reshaped) +-++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++-# if self.norm_topk_prob: +-++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++- +-++-# moe_output = None +-++-# if not self.training: +-++-# # 根据 Long_Prompt 标志选择模式 +-++-# if Long_Prompt: +-++-# # --- 精度优先模式 --- +-++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++-# else: +-++-# # --- 速度优先模式 --- +-++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++-# if sequence_length == 1: +-++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++-# else: +-++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++-# else: +-++-# raise NotImplementedError("Training path is not implemented.") +-++- +-++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++- +-++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++- +-++-# return final_hidden_states, router_logits +-++- +-++ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ """ +-++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++ return moe_output_fp32.squeeze(1).to(original_dtype) +-++ +-+++ # @no_grad() +-+++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++ # num_tokens, _ = hidden_states.shape +-+++ # flat_selected_experts = selected_experts.flatten() +-+++ # sorted_expert_indices = flat_selected_experts.argsort() +-+++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++ # original_token_indices = sorted_expert_indices // self.top_k +-+++ # moe_output = ops.zeros_like(hidden_states) +-+++ # current_token_offset = 0 +-+++ # for i in range(self.num_experts): +-+++ # expert_token_count = tokens_per_expert[i] - current_token_offset +-+++ # if expert_token_count == 0: +-+++ # continue +-+++ # end_offset = current_token_offset + expert_token_count +-+++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++ # expert_hidden_states = hidden_states[expert_original_token_indices] +-+++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++ # current_token_offset += expert_token_count +-+++ # return moe_output +-+++ +-++ @no_grad() +-++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++- num_tokens, _ = hidden_states.shape +-++- flat_selected_experts = selected_experts.flatten() +-++- sorted_expert_indices = flat_selected_experts.argsort() +-++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++- original_token_indices = sorted_expert_indices // self.top_k +-+++ """ +-+++ 优化版 MoE prefill (速度优先模式): +-+++ - 批量张量化处理同一个 expert 的所有 token +-+++ - 跳过无 token 的专家 +-+++ - 保持结果完全一致 +-+++ """ +-++ moe_output = ops.zeros_like(hidden_states) +-++- current_token_offset = 0 +-++- for i in range(self.num_experts): +-++- expert_token_count = tokens_per_expert[i] - current_token_offset +-++- if expert_token_count == 0: +-++- continue +-++- end_offset = current_token_offset + expert_token_count +-++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++- expert_hidden_states = hidden_states[expert_original_token_indices] +-++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++- current_token_offset += expert_token_count +-+++ +-+++ flat_selected_experts = selected_experts.flatten() +-+++ flat_routing_weights = routing_weights.flatten() +-+++ +-+++ idxs = flat_selected_experts.argsort() +-+++ sorted_expert_indices = flat_selected_experts[idxs] +-+++ sorted_token_indices = idxs // self.top_k +-+++ +-+++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-+++ +-+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+++ +-+++ for expert_id in active_experts.tolist(): +-+++ start = int(tokens_per_expert[:expert_id].sum().item()) +-+++ end = start + int(tokens_per_expert[expert_id].item()) +-+++ +-+++ token_idx = sorted_token_indices[start:end] +-+++ expert_tokens = hidden_states[token_idx] +-+++ +-+++ expert_out = self.experts[expert_id](expert_tokens) +-+++ +-+++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-+++ +-+++ moe_output = mindspore.mint.scatter_add( +-+++ moe_output, +-+++ 0, +-+++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-+++ scaled_out.to(hidden_states.dtype) +-+++ ) +-+++ +-++ return moe_output +-++ +-+++ +-++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-++ @no_grad() +-++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++ +-++ moe_output = None +-++- if Long_Prompt: +-++- # --- 精度优先模式 (ACCURACY MODE) --- +-++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ # if Long_Prompt==0: +-+++ # # --- 精度优先模式 (ACCURACY MODE) --- +-+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ # else: +-+++ # # --- 速度优先模式 (SPEED MODE) --- +-+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++ # if sequence_length == 1: +-+++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ # else: +-+++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ +-+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++ if sequence_length == 1: +-+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++ else: +-++- # --- 速度优先模式 (SPEED MODE) --- +-++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++- if sequence_length == 1: +-++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++- else: +-++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++- +-+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ +-++ +-++ # 3. 共享专家计算与合并 (所有模式通用) +-++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++ +-++ return final_hidden_states, router_logits +-++ +-+++ +-++ class Qwen2MoeDecoderLayer(nn.Module): +-++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-++ super().__init__() +-++ self.hidden_size = config.hidden_size +-++ +-++- # if Long_Prompt: +-++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++- # else: +-+++ # if Long_Prompt == 2: +-++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++ # else: +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ +-++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++ +-++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++ ) +-++ +-++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++ # attention_mask, +-+++ # sequence_length=sequence_length, +-+++ # target_length=target_length, +-+++ # dtype=dtype, +-+++ # min_dtype=min_dtype, +-+++ # cache_position=cache_position, +-+++ # batch_size=input_tensor.shape[0], +-+++ # ) +-+++ #@dwj +-+++ causal_mask = get_cached_causal_mask_with_cache_position( +-++ attention_mask, +-++ sequence_length=sequence_length, +-++ target_length=target_length, +-++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-++ """ +-++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-+++ _causal_mask_cache.clear() +-++ +-++ input_ids = kwargs.get("input_ids") +-++ if input_ids is None and args: +-++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ +-++ if input_ids is not None: +-++ prompt_length = input_ids.shape[1] +-++- +-++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-++- Long_Prompt = True +-+++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-+++ Long_Prompt = 2 +-+++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-+++ Long_Prompt = 0 +-++ else: +-++- Long_Prompt = False +-+++ Long_Prompt = 1 +-+++ +-++ +-++ return super().generate(*args, **kwargs) +-++ +-++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++ dtype = self.lm_head.weight.dtype +-++ min_dtype = float(ops.finfo(dtype).min) +-++ +-++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++ # attention_mask, +-+++ # sequence_length=sequence_length, +-+++ # target_length=past_key_values.get_max_length(), +-+++ # dtype=dtype, +-+++ # min_dtype=min_dtype, +-+++ # cache_position=cache_position, +-+++ # batch_size=batch_size, +-+++ # ) +-+++ +-+++ #@dwj +-+++ attention_mask = get_cached_causal_mask_with_cache_position( +-++ attention_mask, +-++ sequence_length=sequence_length, +-++ target_length=past_key_values.get_max_length(), +-++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++deleted file mode 100644 +-++index 6dfb5b93..00000000 +-++--- a/patches/0001-20251104commit.patch +-+++++ /dev/null +-++@@ -1,1272 +0,0 @@ +-++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++-From: Pinoeer-kingxi <13022943007@163.com> +-++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++-Subject: [PATCH] 20251104commit +-++- +-++---- +-++- mindnlp/transformers/cache_utils.py | 28 +- +-++- .../models/deepseek/modeling_deepseek.py | 149 ++- +-++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++- 3 files changed, 976 insertions(+), 87 deletions(-) +-++- +-++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++-index cadd2e04..02f8d4be 100644 +-++---- a/mindnlp/transformers/cache_utils.py +-++-+++ b/mindnlp/transformers/cache_utils.py +-++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++- # k_out[:, :, cache_position] = key_states +-++- # v_out[:, :, cache_position] = value_states +-++-- if ON_ORANGE_PI: +-++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++-- else: +-++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++-- +-++-+ # if ON_ORANGE_PI: +-++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++-+ # else: +-++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++-+ if cache_position.ndim > 1: +-++-+ cache_position = cache_position.flatten() +-++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++-+ cache_position = cache_position.int() +-++-+ +-++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++-+ k_out[:, :, cache_position] = key_states +-++-+ v_out[:, :, cache_position] = value_states +-++-+ +-++- return k_out, v_out +-++- +-++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++-index c695b944..d8303e45 100644 +-++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++- # Copied from transformers.models.llama.modeling_llama.rotate_half +-++- def rotate_half(x): +-++- """Rotates half the hidden dims of the input.""" +-++-- x1 = x[..., : x.shape[-1] // 2] +-++-- x2 = x[..., x.shape[-1] // 2 :] +-++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++-+ # x1 = x[..., : x.shape[-1] // 2] +-++-+ # x2 = x[..., x.shape[-1] // 2 :] +-++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++- return ops.cat((-x2, x1), dim=-1) +-++- +-++- +-++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++- if self.training: +-++- raise NotImplementedError("Training is not supported yet.") +-++- else: +-++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++-- if self.config.n_shared_experts is not None: +-++-- y = y + self.shared_experts(identity) +-++-- return y +-++-+ # @lwx +-++-+ if orig_shape[1] == 1: +-++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++-+ y=y.view(*orig_shape) +-++-+ if self.config.n_shared_experts is not None: +-++-+ y = y + self.shared_experts(identity) +-++-+ return y +-++-+ else: +-++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++-+ if self.config.n_shared_experts is not None: +-++-+ y = y + self.shared_experts(identity) +-++-+ return y +-++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++-+ # if self.config.n_shared_experts is not None: +-++-+ # y = y + self.shared_experts(identity) +-++-+ # return y +-++-+ +-++-+ @no_grad() +-++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++-+ +-++-+ expert_cache = ops.zeros_like(x) +-++-+ for i in range(self.num_experts_per_tok): +-++-+ expert_id = flat_expert_indices[i].item() +-++-+ weight = flat_expert_weights[i].item() +-++-+ expert = self.experts[expert_id] +-++-+ expert_out = expert(x) +-++-+ expert_cache += expert_out * weight +-++-+ return expert_cache +-++- +-++- @no_grad() +-++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++-- # expert_cache = torch.zeros_like(x) +-++-- # idxs = flat_expert_indices.argsort() +-++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++-- # token_idxs = idxs // self.num_experts_per_tok +-++-- # for i, end_idx in enumerate(tokens_per_expert): +-++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++-- # if start_idx == end_idx: +-++-- # continue +-++-- # expert = self.experts[i] +-++-- # exp_token_idx = token_idxs[start_idx:end_idx] +-++-- # expert_tokens = x[exp_token_idx] +-++-- # expert_out = expert(expert_tokens) +-++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++-- # return expert_cache +-++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++- expert_cache = ops.zeros_like(x) +-++- idxs = flat_expert_indices.argsort() +-++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++- token_idxs = idxs // self.num_experts_per_tok +-++-+ +-++- for i, end_idx in enumerate(tokens_per_expert): +-++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++- if start_idx == end_idx: +-++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++- expert_out = expert(expert_tokens) +-++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++-+ +-++- return expert_cache +-++-+ +-++-+ # @no_grad() +-++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++-+ # # expert_cache = torch.zeros_like(x) +-++-+ # # idxs = flat_expert_indices.argsort() +-++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++-+ # # token_idxs = idxs // self.num_experts_per_tok +-++-+ # # for i, end_idx in enumerate(tokens_per_expert): +-++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++-+ # # if start_idx == end_idx: +-++-+ # # continue +-++-+ # # expert = self.experts[i] +-++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++-+ # # expert_tokens = x[exp_token_idx] +-++-+ # # expert_out = expert(expert_tokens) +-++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++-+ # # return expert_cache +-++-+ # expert_cache = ops.zeros_like(x) +-++-+ # idxs = flat_expert_indices.argsort() +-++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++-+ # token_idxs = idxs // self.num_experts_per_tok +-++-+ +-++-+ # for i, end_idx in enumerate(tokens_per_expert): +-++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++-+ # if start_idx == end_idx: +-++-+ # continue +-++-+ # expert = self.experts[i] +-++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-++-+ # expert_tokens = x[exp_token_idx] +-++-+ # expert_out = expert(expert_tokens) +-++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++-+ +-++-+ # return expert_cache +-++-+ # @no_grad() +-++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++-+ # expert_cache = ops.zeros_like(x) +-++-+ +-++-+ # # 排序保证顺序一致 +-++-+ # idxs = flat_expert_indices.argsort() +-++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++-+ # token_idxs = idxs // self.num_experts_per_tok +-++-+ +-++-+ # # 找出有 token 的专家 +-++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++-+ +-++-+ # for i in active_experts.tolist(): +-++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++-+ # end_idx = tokens_per_expert[i] +-++-+ # if start_idx == end_idx: # 没有 token +-++-+ # continue +-++-+ +-++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-++-+ # expert_tokens = x[exp_token_idx] +-++-+ # expert_out = self.experts[i](expert_tokens) +-++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++-+ +-++-+ # expert_cache = mindspore.mint.scatter_add( +-++-+ # expert_cache, +-++-+ # 0, +-++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++-+ # expert_out +-++-+ # ) +-++-+ +-++-+ # return expert_cache +-++-+ +-++-+ +-++- +-++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++- # """ +-++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++- +-++- # Initialize weights and apply final processing +-++- self.post_init() +-++-+ self.warm_up = False +-++-+ +-++-+ def warmup_moe_model_deep(self): +-++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++-+ test_texts = [ +-++-+ "warmup short", +-++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++-+ ] +-++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++-+ if tokenizer is None: +-++-+ from mindnlp.transformers import AutoTokenizer +-++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++-+ self._warmup_tokenizer = tokenizer +-++-+ +-++-+ for text in test_texts: +-++-+ inputs = tokenizer(text, return_tensors="ms") +-++-+ with mindspore._no_grad(): +-++-+ _ = self(**inputs, use_cache=False) +-++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++- +-++- def get_input_embeddings(self): +-++- return self.model.embed_tokens +-++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++- ```""" +-++-+ if not self.warm_up: +-++-+ self.warm_up = True +-++-+ self.warmup_moe_model_deep() +-++-+ +-++- output_attentions = ( +-++- output_attentions +-++- if output_attentions is not None +-++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++-index 3cbf820e..d4c6b651 100644 +-++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++-@@ -18,7 +18,6 @@ +-++- # See the License for the specific language governing permissions and +-++- # limitations under the License. +-++- """MindSpore Qwen2MoE model.""" +-++-- +-++- import math +-++- from typing import List, Optional, Tuple, Union +-++- +-++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++- TokenClassifierOutput, +-++- ) +-++- from ...modeling_utils import PreTrainedModel +-++-+from ...generation import GenerationMixin +-++- from ....utils import logging +-++- from .configuration_qwen2_moe import Qwen2MoeConfig +-++- +-++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++- self.variance_epsilon = eps +-++- +-++- def forward(self, hidden_states): +-++-+ # @dwj +-++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++-+ # @lwx +-++-+ # if not self.training : +-++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++- input_dtype = hidden_states.dtype +-++- hidden_states = hidden_states.to(mindspore.float32) +-++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++-@@ -234,6 +239,8 @@ def rotate_half(x): +-++- """Rotates half the hidden dims of the input.""" +-++- x1 = x[..., : x.shape[-1] // 2] +-++- x2 = x[..., x.shape[-1] // 2 :] +-++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++- return ops.cat((-x2, x1), dim=-1) +-++- +-++- +-++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++- self.config = config +-++- self.hidden_size = config.hidden_size +-++- self.intermediate_size = intermediate_size +-++-+ +-++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++- self.act_fn = ACT2FN[config.hidden_act] +-++- +-++- def forward(self, x): +-++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++-- +-++- +-++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++-+ # @lwx +-++-+ # gate_up_output = self.gate_up_proj(x) +-++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++-+ # return self.down_proj(swiglu_output) +-++-+ +-++-+ # def forward(self, x): +-++-+ # gate_proj_out = self.gate_proj(x) +-++-+ # up_proj_out = self.up_proj(x) +-++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++-+ # return self.down_proj(swiglu_out) +-++-+ +-++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++- """ +-++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++- use_cache: bool = False, +-++- cache_position: Optional[mindspore.Tensor] = None, +-++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++-+ +-++-+ +-++-+ +-++- bsz, q_len, _ = hidden_states.shape +-++- +-++- query_states = self.q_proj(hidden_states) +-++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++- "with a layer index." +-++- ) +-++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++-+ if isinstance(past_key_value, StaticCache): +-++-+ kv_seq_len = key_states.shape[-2] +-++-+ else: +-++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++- +-++- if past_key_value is not None: +-++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++-+ +-++-+ if isinstance(past_key_value, StaticCache): +-++-+ kv_seq_len = key_states.shape[-2] +-++- +-++- # repeat k/v heads if n_kv_heads < n_heads +-++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-++-- +-++-+ +-++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++- +-++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++-- raise ValueError( +-++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++-- f" {attn_weights.shape}" +-++-- ) +-++-- +-++-- if attention_mask is not None: # no matter the length, we just slice it +-++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++-+ if attention_mask is not None: +-++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++- attn_weights = attn_weights + causal_mask +-++- +-++- # upcast attention to fp32 +-++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++- +-++- attn_output = self.o_proj(attn_output) +-++-- +-++-+ # @lwx +-++-+ +-++-+ # max_seq_len = self.max_position_embeddings # 2048 +-++-+ +-++-+ # if attention_mask is not None: +-++-+ # # attention_mask: [B, 1, Sq, Sk] +-++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++-+ +-++-+ # # pad 到 [max_seq_len, max_seq_len] +-++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++-+ # global_attention_mask = padded_mask +-++-+ # else: +-++-+ # global_attention_mask = None +-++-+ +-++-+ +-++-+ # sparse_mode=3 +-++-+ # attn_output = mindspore.ops.flash_attention_score( +-++-+ # query=query_states, +-++-+ # key=key_states, +-++-+ # value=value_states, +-++-+ # real_shift=None, +-++-+ # padding_mask=None, +-++-+ +-++-+ # head_num=self.num_heads, +-++-+ # attn_mask=global_attention_mask, +-++-+ # keep_prob=1.0 - self.attention_dropout, +-++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++-+ # input_layout="BNSD", +-++-+ # pre_tokens=2147483647, +-++-+ # next_tokens=2147483647, +-++-+ # inner_precise=0, +-++-+ # drop_mask=None, +-++-+ # prefix=None, +-++-+ # actual_seq_qlen=None, +-++-+ # actual_seq_kvlen=None, +-++-+ # sparse_mode=sparse_mode, +-++-+ # ) +-++- if not output_attentions: +-++- attn_weights = None +-++- +-++- return attn_output, attn_weights, past_key_value +-++- +-++- +-++-+class Qwen2MoeFlashAttention(nn.Module): +-++-+ """ +-++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++-+ +-++-+ 关键改动: +-++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++-+ 直接传入原始的 key 和 value 张量效率更高。 +-++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++-+ """ +-++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++-+ super().__init__() +-++-+ self.config = config +-++-+ self.layer_idx = layer_idx +-++-+ self.hidden_size = config.hidden_size +-++-+ self.num_heads = config.num_attention_heads +-++-+ self.head_dim = self.hidden_size // self.num_heads +-++-+ self.num_key_value_heads = config.num_key_value_heads +-++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++-+ self.max_position_embeddings = config.max_position_embeddings +-++-+ self.rope_theta = config.rope_theta +-++-+ self.attention_dropout = config.attention_dropout +-++-+ +-++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-++-+ raise ValueError( +-++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++-+ ) +-++-+ +-++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++-+ +-++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++-+ self.head_dim, +-++-+ max_position_embeddings=self.max_position_embeddings, +-++-+ base=self.rope_theta, +-++-+ ) +-++-+ +-++-+ def forward( +-++-+ self, +-++-+ hidden_states: mindspore.Tensor, +-++-+ attention_mask: Optional[mindspore.Tensor] = None, +-++-+ position_ids: Optional[mindspore.Tensor] = None, +-++-+ past_key_value: Optional[Cache] = None, +-++-+ output_attentions: bool = False, +-++-+ use_cache: bool = False, +-++-+ cache_position: Optional[mindspore.Tensor] = None, +-++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++-+ +-++-+ bsz, q_len, _ = hidden_states.shape +-++-+ +-++-+ # 1. 线性投射 Q, K, V +-++-+ query_states = self.q_proj(hidden_states) +-++-+ key_states = self.k_proj(hidden_states) +-++-+ value_states = self.v_proj(hidden_states) +-++-+ +-++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++-+ # query: [B, S, H*D] -> [B, N1, S, D] +-++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ +-++-+ # 3. RoPE 旋转位置编码 +-++-+ kv_seq_len = key_states.shape[-2] +-++-+ if past_key_value is not None: +-++-+ if self.layer_idx is None: +-++-+ raise ValueError( +-++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++-+ "with a layer index." +-++-+ ) +-++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++-+ if cache_position.shape[0] == 1: +-++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++-+ kv_seq_len = past_seen_tokens + 1 +-++-+ else: +-++-+ # prefill 阶段:cache_position 是范围,使用其长度 +-++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++-+ else: +-++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++-+ +-++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++-+ +-++-+ # 4. KV 缓存更新 +-++-+ if past_key_value is not None: +-++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++-+ key_states, value_states = past_key_value.update( +-++-+ key_states, value_states, self.layer_idx, cache_kwargs +-++-+ ) +-++-+ +-++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++-+ if cache_position.shape[0] == 1: +-++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++-+ kv_seq_len = key_states.shape[-2] +-++-+ +-++-+ # 5. [重要] 准备 Attention Mask +-++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++-+ fa_attention_mask = None +-++-+ if attention_mask is not None: +-++-+ # 截取与当前key长度匹配的部分 +-++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++-+ fa_attention_mask = (mask_slice != 0) +-++-+ +-++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++-+ input_dtype = query_states.dtype +-++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++-+ query_states = query_states.to(mindspore.float16) +-++-+ key_states = key_states.to(mindspore.float16) +-++-+ value_states = value_states.to(mindspore.float16) +-++-+ +-++-+ # 6. [核心] 调用 flash_attention_score 算子 +-++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++-+ attn_output = mindspore.ops.flash_attention_score( +-++-+ query=query_states, +-++-+ key=key_states, +-++-+ value=value_states, +-++-+ head_num=self.num_heads, # 传入Q的头数(N1) +-++-+ attn_mask=fa_attention_mask, +-++-+ keep_prob=1.0 - self.attention_dropout, +-++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-++-+ input_layout="BNSD", +-++-+ sparse_mode=0 # 使用 defaultMask 模式 +-++-+ ) +-++-+ +-++-+ # 恢复原始数据类型 +-++-+ attn_output = attn_output.to(input_dtype) +-++-+ +-++-+ # 7. 调整输出形状 +-++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++-+ attn_output = self.o_proj(attn_output) +-++-+ +-++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-++-+ attn_weights = None +-++-+ if output_attentions: +-++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++-+ +-++-+ return attn_output, attn_weights, past_key_value +-++-+ +-++-+ # def forward( +-++-+ # self, +-++-+ # hidden_states: mindspore.Tensor, +-++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-++-+ # position_ids: Optional[mindspore.Tensor] = None, +-++-+ # past_key_value: Optional[Cache] = None, +-++-+ # output_attentions: bool = False, +-++-+ # use_cache: bool = False, +-++-+ # cache_position: Optional[mindspore.Tensor] = None, +-++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++-+ +-++-+ # bsz, q_len, _ = hidden_states.shape +-++-+ +-++-+ # # 1. 线性投射 Q, K, V +-++-+ # query_states = self.q_proj(hidden_states) +-++-+ # key_states = self.k_proj(hidden_states) +-++-+ # value_states = self.v_proj(hidden_states) +-++-+ +-++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ +-++-+ # # 3. RoPE 旋转位置编码 +-++-+ # kv_seq_len = key_states.shape[-2] +-++-+ # if past_key_value is not None: +-++-+ # if self.layer_idx is None: +-++-+ # raise ValueError( +-++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++-+ # "with a layer index." +-++-+ # ) +-++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++-+ +-++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++-+ +-++-+ # # 4. KV 缓存更新 +-++-+ # if past_key_value is not None: +-++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++-+ # key_states, value_states = past_key_value.update( +-++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-++-+ # ) +-++-+ +-++-+ # # 5. 准备 Attention Mask +-++-+ # fa_attention_mask = None +-++-+ # if attention_mask is not None: +-++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++-+ # fa_attention_mask = (mask_slice != 0) +-++-+ +-++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++-+ # input_dtype = query_states.dtype +-++-+ +-++-+ # # 6. [核心] 调用 flash_attention_score 算子 +-++-+ # attn_output = mindspore.ops.flash_attention_score( +-++-+ # query=query_states, +-++-+ # key=key_states, +-++-+ # value=value_states, +-++-+ # head_num=self.num_heads, +-++-+ # attn_mask=fa_attention_mask, +-++-+ # keep_prob=1.0 - self.attention_dropout, +-++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++-+ # input_layout="BNSD", +-++-+ # sparse_mode=0, +-++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++-+ # inner_precise=1 +-++-+ # ) +-++-+ +-++-+ # # 恢复原始数据类型 +-++-+ # attn_output = attn_output.to(input_dtype) +-++-+ +-++-+ # # 7. 调整输出形状 +-++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++-+ # attn_output = self.o_proj(attn_output) +-++-+ +-++-+ # attn_weights = None +-++-+ # if output_attentions: +-++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++-+ +-++-+ # return attn_output, attn_weights, past_key_value +-++-+ +-++-+ # def forward( +-++-+ # self, +-++-+ # hidden_states: mindspore.Tensor, +-++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-++-+ # position_ids: Optional[mindspore.Tensor] = None, +-++-+ # past_key_value: Optional[Cache] = None, +-++-+ # output_attentions: bool = False, +-++-+ # use_cache: bool = False, +-++-+ # cache_position: Optional[mindspore.Tensor] = None, +-++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++-+ +-++-+ # bsz, q_len, _ = hidden_states.shape +-++-+ +-++-+ # query_states = self.q_proj(hidden_states) +-++-+ # key_states = self.k_proj(hidden_states) +-++-+ # value_states = self.v_proj(hidden_states) +-++-+ +-++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-+ +-++-+ # kv_seq_len = key_states.shape[-2] +-++-+ # if past_key_value is not None: +-++-+ # if self.layer_idx is None: +-++-+ # raise ValueError("`layer_idx` must be specified for caching") +-++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++-+ +-++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++-+ +-++-+ # if past_key_value is not None: +-++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++-+ # key_states, value_states = past_key_value.update( +-++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-++-+ # ) +-++-+ +-++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++-+ +-++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++-+ # query_states = query_states / math.sqrt(self.head_dim) +-++-+ # # <--- 修改结束 --- +-++-+ +-++-+ # fa_attention_mask = None +-++-+ # if attention_mask is not None: +-++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++-+ # fa_attention_mask = (mask_slice != 0) +-++-+ +-++-+ # input_dtype = query_states.dtype +-++-+ +-++-+ # attn_output = mindspore.ops.flash_attention_score( +-++-+ # query=query_states, # 传入已经预先缩放过的 query +-++-+ # key=key_states, +-++-+ # value=value_states, +-++-+ # head_num=self.num_heads, +-++-+ # attn_mask=fa_attention_mask, +-++-+ # keep_prob=1.0 - self.attention_dropout, +-++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++-+ # input_layout="BNSD", +-++-+ # sparse_mode=0, +-++-+ # inner_precise=1 # 仍然保持内部高精度计算 +-++-+ # ) +-++-+ +-++-+ # attn_output = attn_output.to(input_dtype) +-++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++-+ # attn_output = self.o_proj(attn_output) +-++-+ +-++-+ # attn_weights = None +-++-+ # if output_attentions: +-++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++-+ +-++-+ # return attn_output, attn_weights, past_key_value +-++-+ +-++- QWEN2MOE_ATTENTION_CLASSES = { +-++- "eager": Qwen2MoeAttention, +-++-+ "flash-attention": Qwen2MoeFlashAttention, +-++- } +-++- +-++- +-++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++- +-++-+ #@dwj +-++-+ # 只遍历激活的专家,而非全部专家 +-++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++-- hidden_states = hidden_states.view(-1, hidden_dim) +-++-- # router_logits: (batch * sequence_length, n_experts) +-++-- router_logits = self.gate(hidden_states) +-++-- +-++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++-- if self.norm_topk_prob: +-++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++-- # we cast back to the input dtype +-++-- routing_weights = routing_weights.to(hidden_states.dtype) +-++-- +-++-- final_hidden_states = ops.zeros( +-++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++-- ) +-++-- +-++-- # One hot encode the selected experts to create an expert mask +-++-- # this will be used to easily index which expert is going to be sollicitated +-++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++-- +-++-- # Loop over all available experts in the model and perform the computation on each expert +-++-- for expert_idx in range(self.num_experts): +-++-- expert_layer = self.experts[expert_idx] +-++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++-- +-++-- # Index the correct hidden states and compute the expert hidden state for +-++-- # the current expert. We need to make sure to multiply the output hidden +-++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++-- if 0 not in idx.shape: +-++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++-- +-++-- # However `index_add_` only support torch tensors for indexing so we'll use +-++-- # the `top_x` tensor here. +-++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++-- +-++-- shared_expert_output = self.shared_expert(hidden_states) +-++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++-- +-++-- final_hidden_states = final_hidden_states + shared_expert_output +-++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++-+ num_tokens = hidden_states_reshaped.shape[0] +-++-+ +-++-+ router_logits = self.gate(hidden_states_reshaped) +-++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++-+ +-++-+ if self.norm_topk_prob: +-++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++-+ routing_weights = routing_weights.to(hidden_states.dtype) +-++-+ +-++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++-+ flat_selected_experts = selected_experts.flatten() +-++-+ +-++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++-+ token_indices = broadcasted_token_indices.flatten() +-++-+ +-++-+ active_experts = ops.unique(flat_selected_experts) +-++-+ +-++-+ for expert_idx_tensor in active_experts: +-++-+ expert_idx = expert_idx_tensor.item() +-++-+ expert_layer = self.experts[expert_idx] +-++-+ +-++-+ mask = (flat_selected_experts == expert_idx_tensor) +-++-+ selected_token_indices = token_indices[mask] +-++-+ selected_routing_weights = routing_weights.flatten()[mask] +-++-+ +-++-+ current_states = hidden_states_reshaped[selected_token_indices] +-++-+ +-++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++-+ +-++-+ final_hidden_states = final_hidden_states.index_add( +-++-+ dim=0, +-++-+ index=selected_token_indices, +-++-+ source=expert_output.to(hidden_states.dtype) +-++-+ ) +-++-+ +-++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++- +-++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++-- return final_hidden_states, router_logits +-++-+ final_hidden_states = final_hidden_states + shared_expert_output +-++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++-+ +-++-+ return final_hidden_states, router_logits +-++- +-++- +-++- class Qwen2MoeDecoderLayer(nn.Module): +-++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++- +-++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++- +-++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++-+ +-++- if (layer_idx not in config.mlp_only_layers) and ( +-++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++- ): +-++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++- _skip_keys_device_placement = "past_key_values" +-++- _supports_cache_class = True +-++-+#lwx +-++-+ # _supports_static_cache = True +-++- +-++- def _init_weights(self, module): +-++- std = self.config.initializer_range +-++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++- return causal_mask +-++- +-++- +-++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++- _tied_weights_keys = ["lm_head.weight"] +-++- +-++- def __init__(self, config): +-++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++- self.num_experts_per_tok = config.num_experts_per_tok +-++- # Initialize weights and apply final processing +-++- self.post_init() +-++-+ # @lwx +-++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++-+ # self.generation_config.cache_implementation = "static" +-++-+ self._warmed_up = False +-++-+ +-++-+ def warmup_moe_model(self): +-++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++-+ test_texts = [ +-++-+ "warmup short", +-++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++-+ ] +-++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++-+ if tokenizer is None: +-++-+ from mindnlp.transformers import AutoTokenizer +-++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++-+ self._warmup_tokenizer = tokenizer +-++-+ +-++-+ for text in test_texts: +-++-+ inputs = tokenizer(text, return_tensors="ms") +-++-+ with mindspore._no_grad(): +-++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++- +-++- def get_input_embeddings(self): +-++- return self.model.embed_tokens +-++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++- ```""" +-++-+ if not self._warmed_up: +-++-+ self._warmed_up = True +-++-+ self.warmup_moe_model() +-++- +-++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++- output_router_logits = ( +-++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++- } +-++- ) +-++- return model_inputs +-++-+# @lwx +-++-+ # def _decode_one_tokens_logits( +-++-+ # self, +-++-+ # cur_token: mindspore.Tensor, +-++-+ # input_pos: Optional[mindspore.Tensor], +-++-+ # cache_position: mindspore.Tensor, +-++-+ # past_key_values: StaticCache, +-++-+ # ) -> mindspore.Tensor: +-++-+ # """ +-++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++-+ +-++-+ # Args: +-++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++-+ # input_pos: 输入位置信息,可选 +-++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++-+ +-++-+ # Returns: +-++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++-+ # """ +-++-+ # # 调用JIT编译的版本 +-++-+ # return self.get_decode_one_tokens_logits( +-++-+ # cur_token=cur_token, +-++-+ # input_pos=input_pos, +-++-+ # cache_position=cache_position, +-++-+ # past_key_values=past_key_values, +-++-+ # ) +-++-+ +-++-+ # @mindspore.jit(jit_level='O1') +-++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++-+ # """ +-++-+ # JIT编译的函数,用于高效的单token解码 +-++-+ # 使用JIT编译优化以支持静态shape和高效执行 +-++-+ +-++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++-+ # """ +-++-+ # outputs = self.model.forward( +-++-+ # input_ids=cur_token, +-++-+ # position_ids=input_pos, +-++-+ # cache_position=cache_position, +-++-+ # past_key_values=past_key_values, +-++-+ # use_cache=True, +-++-+ # return_dict=False, +-++-+ # ) +-++-+ +-++-+ # hidden_states = outputs[0] +-++-+ # logits = self.lm_head.forward(hidden_states) +-++-+ # logits = logits.float() +-++-+ +-++-+ # return logits[:, -1, :] +-++-+ +-++-+ # def _sample( +-++-+ # self, +-++-+ # input_ids: mindspore.Tensor, +-++-+ # logits_processor, +-++-+ # stopping_criteria, +-++-+ # generation_config, +-++-+ # synced_devices: bool, +-++-+ # streamer=None, +-++-+ # logits_warper=None, +-++-+ # **model_kwargs, +-++-+ # ): +-++-+ # """ +-++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++-+ # """ +-++-+ # from ...generation.logits_process import LogitsProcessorList +-++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++-+ # from mindnlp.core import nn, ops, no_grad +-++-+ # import numpy as np +-++-+ +-++-+ # # 检查是否使用 StaticCache +-++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++-+ # # 否则,直接调用父类方法 +-++-+ # past_key_values = model_kwargs.get("past_key_values") +-++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++-+ +-++-+ # if not isinstance(past_key_values, StaticCache): +-++-+ # # 不使用 StaticCache,直接调用父类方法 +-++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++-+ # return super()._sample( +-++-+ # input_ids=input_ids, +-++-+ # logits_processor=logits_processor, +-++-+ # stopping_criteria=stopping_criteria, +-++-+ # generation_config=generation_config, +-++-+ # synced_devices=synced_devices, +-++-+ # streamer=streamer, +-++-+ # logits_warper=logits_warper, +-++-+ # **model_kwargs, +-++-+ # ) +-++-+ +-++-+ # # 使用 StaticCache,进入自定义循环 +-++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++-+ # pad_token_id = generation_config._pad_token_tensor +-++-+ # output_attentions = generation_config.output_attentions +-++-+ # output_hidden_states = generation_config.output_hidden_states +-++-+ # output_scores = generation_config.output_scores +-++-+ # output_logits = generation_config.output_logits +-++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-++-+ # max_length = generation_config.max_length +-++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++-+ # do_sample = generation_config.do_sample +-++-+ +-++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++-+ # raise ValueError( +-++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++-+ # f"{logits_warper})." +-++-+ # ) +-++-+ +-++-+ # # init attention / hidden states / scores tuples +-++-+ # scores = () if (return_dict_in_generate and output_scores) else None +-++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++-+ +-++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++-+ # encoder_hidden_states = ( +-++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++-+ # ) +-++-+ +-++-+ # # keep track of which sequences are already finished +-++-+ # batch_size, cur_len = input_ids.shape +-++-+ # this_peer_finished = False +-++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++-+ +-++-+ # time_record = [] +-++-+ # from ....utils.testing_utils import parse_flag_from_env +-++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++-+ +-++-+ # while self._has_unfinished_sequences( +-++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++-+ # ): +-++-+ # if _record_time: +-++-+ # import time as time_module +-++-+ # infer_start = time_module.time() +-++-+ +-++-+ # # prepare model inputs +-++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++-+ +-++-+ # # prepare variable output controls +-++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++-+ +-++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++-+ # cur_cache_position = model_inputs.get("cache_position") +-++-+ # cur_past_key_values = model_inputs.get("past_key_values") +-++-+ # cur_input_ids = model_inputs.get("input_ids") +-++-+ +-++-+ # if (isinstance(cur_past_key_values, StaticCache) and +-++-+ # cur_cache_position is not None and +-++-+ # len(cur_cache_position.shape) > 0 and +-++-+ # cur_cache_position.shape[0] == 1 and +-++-+ # cur_input_ids is not None and +-++-+ # cur_input_ids.shape[1] == 1): +-++-+ # # 使用 JIT 优化的单 token 解码 +-++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++-+ # if not hasattr(self, '_jit_used'): +-++-+ # self._jit_used = False +-++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++-+ +-++-+ # next_token_logits = self.get_decode_one_tokens_logits( +-++-+ # cur_token=cur_input_ids, +-++-+ # input_pos=model_inputs.get("position_ids"), +-++-+ # cache_position=cur_cache_position, +-++-+ # past_key_values=cur_past_key_values, +-++-+ # ) +-++-+ +-++-+ # # 标记已使用JIT(用于后续判断) +-++-+ # if not self._jit_used: +-++-+ # self._jit_used = True +-++-+ +-++-+ # # 构造兼容的输出对象 +-++-+ # class JitOptimizedOutput: +-++-+ # def __init__(self, logits, config): +-++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++-+ # self.config = config +-++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++-+ # self.attentions = None if not config.is_encoder_decoder else None +-++-+ # self.cross_attentions = None +-++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-++-+ +-++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++-+ # else: +-++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++-+ # outputs = self(**model_inputs, return_dict=True) +-++-+ +-++-+ # if synced_devices and this_peer_finished: +-++-+ # continue +-++-+ +-++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++-+ # next_token_logits = outputs.logits[:, -1, :] +-++-+ +-++-+ # # pre-process distribution +-++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++-+ # if do_sample: +-++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++-+ +-++-+ # # Store scores, attentions and hidden_states when required +-++-+ # if return_dict_in_generate: +-++-+ # if output_scores: +-++-+ # scores += (next_token_scores,) +-++-+ # if output_logits: +-++-+ # raw_logits += (next_token_logits,) +-++-+ # if output_attentions: +-++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-++-+ # if self.config.is_encoder_decoder: +-++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++-+ +-++-+ # if output_hidden_states: +-++-+ # hidden = ( +-++-+ # outputs.decoder_hidden_states +-++-+ # if self.config.is_encoder_decoder +-++-+ # else outputs.hidden_states +-++-+ # ) +-++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++-+ +-++-+ # # token selection +-++-+ # if do_sample: +-++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++-+ # else: +-++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++-+ +-++-+ # # finished sentences should have their next token be a padding token +-++-+ # if has_eos_stopping_criteria: +-++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++-+ +-++-+ # # update generated ids, model inputs, and length for next step +-++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++-+ # if streamer is not None: +-++-+ # streamer.put(next_tokens) +-++-+ +-++-+ # model_kwargs = self._update_model_kwargs_for_generation( +-++-+ # outputs, +-++-+ # model_kwargs, +-++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-++-+ # ) +-++-+ +-++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++-+ # cur_len += 1 +-++-+ +-++-+ # if _record_time: +-++-+ # import time as time_module +-++-+ # infer_stop = time_module.time() +-++-+ # time_record.append(infer_stop - infer_start) +-++-+ +-++-+ # del outputs +-++-+ +-++-+ # average_infer_time = None +-++-+ # if time_record: +-++-+ # if len(time_record) > 1: +-++-+ # time_record.pop(0) +-++-+ # average_infer_time = sum(time_record) / len(time_record) +-++-+ # print(f'average inference time is: {average_infer_time}') +-++-+ # print(f'inference time record: {time_record}') +-++-+ +-++-+ # if streamer is not None: +-++-+ # streamer.end() +-++-+ +-++-+ # # 简单判断:打印是否使用了JIT路径 +-++-+ # if hasattr(self, '_jit_used') and self._jit_used: +-++-+ # print("[JIT] ✓ JIT optimization was used during generation") +-++-+ # else: +-++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++-+ +-++-+ # if return_dict_in_generate: +-++-+ # if self.config.is_encoder_decoder: +-++-+ # return GenerateEncoderDecoderOutput( +-++-+ # sequences=input_ids, +-++-+ # scores=scores, +-++-+ # logits=raw_logits, +-++-+ # encoder_attentions=encoder_attentions, +-++-+ # encoder_hidden_states=encoder_hidden_states, +-++-+ # decoder_attentions=decoder_attentions, +-++-+ # cross_attentions=cross_attentions, +-++-+ # decoder_hidden_states=decoder_hidden_states, +-++-+ # past_key_values=model_kwargs.get("past_key_values"), +-++-+ # average_infer_time=average_infer_time +-++-+ # ) +-++-+ # else: +-++-+ # return GenerateDecoderOnlyOutput( +-++-+ # sequences=input_ids, +-++-+ # scores=scores, +-++-+ # logits=raw_logits, +-++-+ # attentions=decoder_attentions, +-++-+ # hidden_states=decoder_hidden_states, +-++-+ # past_key_values=model_kwargs.get("past_key_values"), +-++-+ # average_infer_time=average_infer_time +-++-+ # ) +-++-+ # else: +-++-+ # return input_ids +-++-+ +-++-+ # def _prepare_cache_for_generation( +-++-+ # self, +-++-+ # generation_config, +-++-+ # model_kwargs, +-++-+ # assistant_model, +-++-+ # batch_size, +-++-+ # max_cache_length, +-++-+ # ): +-++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++-+ # generation_config.cache_implementation = "static" +-++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++-+ +-++-+ # if generation_config.cache_implementation == "static": +-++-+ # base_required_from_max_length = generation_config.max_length + 1 +-++-+ # base_required = max(max_cache_length, base_required_from_max_length) +-++-+ # min_cache_size = 50 +-++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++-+ # else: +-++-+ # max_cache_length = max(base_required, min_cache_size) +-++-+ +-++-+ # original_max_cache_length = max_cache_length +-++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++-+ # print(f" - final max_cache_length: {max_cache_length}") +-++-+ +-++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++-+ # if max_cache_length > self.config.max_position_embeddings: +-++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++-+ +-++-+ # result = super()._prepare_cache_for_generation( +-++-+ # generation_config=generation_config, +-++-+ # model_kwargs=model_kwargs, +-++-+ # assistant_model=assistant_model, +-++-+ # batch_size=batch_size, +-++-+ # max_cache_length=max_cache_length, +-++-+ # ) +-++-+ +-++-+ # if generation_config.cache_implementation == "static": +-++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++-+ # created_cache = model_kwargs.get(cache_name) +-++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++-+ # if created_cache.max_cache_len < generation_config.max_length: +-++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++-+ +-++-+ # return result +-++-+ +-++-+ +-++-+ +-++- +-++- +-++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++--- +-++-2.27.0 +-++- +-++-- +-++2.27.0 +-++ +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +deleted file mode 100644 +index 80906633..00000000 +--- a/patches/0006-20251107002commit.patch ++++ /dev/null +@@ -1,7931 +0,0 @@ +-From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Fri, 7 Nov 2025 12:06:32 +0800 +-Subject: [PATCH 6/8] 20251107002commit +- +---- +- .../models/deepseek/modeling_deepseek.py | 122 +- +- patches/0001-20251104commit.patch | 2 +- +- patches/0002-20251106commit.patch | 2 +- +- patches/0003-20261106secondcommit.patch | 2 +- +- patches/0004-20251106change.patch | 2 +- +- patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ +- 6 files changed, 7773 insertions(+), 64 deletions(-) +- create mode 100644 patches/0005-20251107001commit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index 8831e4b7..e7e1c053 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): +- # expert_out = expert(x) +- # expert_cache += expert_out * weight +- # return expert_cache +-- +-- # @no_grad() +-- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-- # # x 的 shape: (1, hidden_size) +-- # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-- +-- # # 1. 收集所有需要的专家层 +-- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-- # selected_experts = [self.experts[i] for i in flat_expert_indices] +-- +-- # # 2. 并行计算所有专家的输出 +-- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-- # # ops.cat 会将它们堆叠成一个新的 Tensor +-- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-- +-- # # 3. 使用矩阵乘法进行加权求和 +-- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-- # # 最终结果 final_output 的 shape: (1, hidden_size) +-- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+ +-+ @no_grad() +-+ dwj +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ # x 的 shape: (1, hidden_size) +-+ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+ +-+ # 1. 收集所有需要的专家层 +-+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+ selected_experts = [self.experts[i] for i in flat_expert_indices] +-+ +-+ # 2. 并行计算所有专家的输出 +-+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+ # ops.cat 会将它们堆叠成一个新的 Tensor +-+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+ +-+ # 3. 使用矩阵乘法进行加权求和 +-+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+ # 最终结果 final_output 的 shape: (1, hidden_size) +-+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +- +-- # return final_output +-+ return final_output +- +- +- # @no_grad() +-@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): +- +- return expert_cache +- # 放置在 DeepseekMoE 类中 +-- @no_grad() +-- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-- """ +-- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-- +-- Args: +-- x (Tensor): 输入张量, shape: (1, hidden_size) +-- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-- """ +-- top_k, _ = flat_expert_weights.shape +-- hidden_size = x.shape[-1] +-- +-- # 1. 将所有专家的权重堆叠起来 +-- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-+ # @no_grad() +-+ # #lwx 20251107 +-+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ # """ +-+ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-+ +-+ # Args: +-+ # x (Tensor): 输入张量, shape: (1, hidden_size) +-+ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-+ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-+ # """ +-+ # top_k, _ = flat_expert_weights.shape +-+ # hidden_size = x.shape[-1] +-+ +-+ # # 1. 将所有专家的权重堆叠起来 +-+ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-+ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-+ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +- +-- # 2. "收集" 所需的专家权重 +-- selected_gate_w = stacked_gate_w[flat_expert_indices] +-- selected_up_w = stacked_up_w[flat_expert_indices] +-- selected_down_w = stacked_down_w[flat_expert_indices] +-+ # # 2. "收集" 所需的专家权重 +-+ # selected_gate_w = stacked_gate_w[flat_expert_indices] +-+ # selected_up_w = stacked_up_w[flat_expert_indices] +-+ # selected_down_w = stacked_down_w[flat_expert_indices] +- +-- # 3. 准备输入 +-- x_expanded = x.expand((top_k, 1, hidden_size)) +-+ # # 3. 准备输入 +-+ # x_expanded = x.expand((top_k, 1, hidden_size)) +- +-- # 4. 并行计算 gate_proj 和 up_proj +-- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-+ # # 4. 并行计算 gate_proj 和 up_proj +-+ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-+ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +- +-- # 5. 计算中间状态 +-- intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-+ # # 5. 计算中间状态 +-+ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out +- +-- # 6. 并行计算 down_proj +-- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-- # --- [FIX] --- +-- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-- # --- [FIX END] --- +-+ # # 6. 并行计算 down_proj +-+ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-+ # # --- [FIX] --- +-+ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-+ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-+ # # --- [FIX END] --- +- +-- # 7. 根据路由权重进行加权求和 +-- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-+ # # 7. 根据路由权重进行加权求和 +-+ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +- +-- return weighted_sum +-+ # return weighted_sum +- +- +- +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-index 0a0ef2d7..2842180e 100644 +---- a/patches/0001-20251104commit.patch +-+++ b/patches/0001-20251104commit.patch +-@@ -1,7 +1,7 @@ +- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Tue, 4 Nov 2025 09:11:51 +0800 +--Subject: [PATCH 1/4] 20251104commit +-+Subject: [PATCH 1/5] 20251104commit +- +- --- +- mindnlp/transformers/cache_utils.py | 28 +- +-diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-index 5185270c..c6cd8757 100644 +---- a/patches/0002-20251106commit.patch +-+++ b/patches/0002-20251106commit.patch +-@@ -1,7 +1,7 @@ +- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 09:20:38 +0800 +--Subject: [PATCH 2/4] 20251106commit +-+Subject: [PATCH 2/5] 20251106commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 379 ++++- +-diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-index 3e05f821..601960c9 100644 +---- a/patches/0003-20261106secondcommit.patch +-+++ b/patches/0003-20261106secondcommit.patch +-@@ -1,7 +1,7 @@ +- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 14:54:37 +0800 +--Subject: [PATCH 3/4] 20261106secondcommit +-+Subject: [PATCH 3/5] 20261106secondcommit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 217 ++- +-diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-index 88a1aef4..8976f10b 100644 +---- a/patches/0004-20251106change.patch +-+++ b/patches/0004-20251106change.patch +-@@ -1,7 +1,7 @@ +- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 15:48:09 +0800 +--Subject: [PATCH 4/4] 20251106change +-+Subject: [PATCH 4/5] 20251106change +- +- --- +- .../models/deepseek/modeling_deepseek.py | 189 +- +-diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-new file mode 100644 +-index 00000000..8d9032be +---- /dev/null +-+++ b/patches/0005-20251107001commit.patch +-@@ -0,0 +1,7707 @@ +-+From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Fri, 7 Nov 2025 11:48:18 +0800 +-+Subject: [PATCH 5/5] 20251107001commit +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 91 +- +-+ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +-+ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +-+ patches/0001-20251104commit.patch | 2 +- +-+ patches/0002-20251106commit.patch | 2 +- +-+ patches/0003-20261106secondcommit.patch | 2 +- +-+ patches/0004-20251106change.patch | 7498 +++++++++++++++++ +-+ 7 files changed, 7577 insertions(+), 30 deletions(-) +-+ create mode 100644 patches/0004-20251106change.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index 0546f318..8831e4b7 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +-+ # expert_cache += expert_out * weight +-+ # return expert_cache +-+ +-+- @no_grad() +-+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+- # x 的 shape: (1, hidden_size) +-+- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+- +-+- # 1. 收集所有需要的专家层 +-+- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+- selected_experts = [self.experts[i] for i in flat_expert_indices] +-+- +-+- # 2. 并行计算所有专家的输出 +-+- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+- # ops.cat 会将它们堆叠成一个新的 Tensor +-+- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+- +-+- # 3. 使用矩阵乘法进行加权求和 +-+- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+- # 最终结果 final_output 的 shape: (1, hidden_size) +-+- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++ # @no_grad() +-++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ # # x 的 shape: (1, hidden_size) +-++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++ +-++ # # 1. 收集所有需要的专家层 +-++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++ # selected_experts = [self.experts[i] for i in flat_expert_indices] +-++ +-++ # # 2. 并行计算所有专家的输出 +-++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++ # # ops.cat 会将它们堆叠成一个新的 Tensor +-++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++ +-++ # # 3. 使用矩阵乘法进行加权求和 +-++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ # # 最终结果 final_output 的 shape: (1, hidden_size) +-++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+ +-+- return final_output +-++ # return final_output +-+ +-+ +-+ # @no_grad() +-+@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +-+ ) +-+ +-+ return expert_cache +-++# 放置在 DeepseekMoE 类中 +-++ @no_grad() +-++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ """ +-++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-++ +-++ Args: +-++ x (Tensor): 输入张量, shape: (1, hidden_size) +-++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-++ """ +-++ top_k, _ = flat_expert_weights.shape +-++ hidden_size = x.shape[-1] +-++ +-++ # 1. 将所有专家的权重堆叠起来 +-++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-++ +-++ # 2. "收集" 所需的专家权重 +-++ selected_gate_w = stacked_gate_w[flat_expert_indices] +-++ selected_up_w = stacked_up_w[flat_expert_indices] +-++ selected_down_w = stacked_down_w[flat_expert_indices] +-++ +-++ # 3. 准备输入 +-++ x_expanded = x.expand((top_k, 1, hidden_size)) +-++ +-++ # 4. 并行计算 gate_proj 和 up_proj +-++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-++ +-++ # 5. 计算中间状态 +-++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-++ +-++ # 6. 并行计算 down_proj +-++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-++ # --- [FIX] --- +-++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-++ # --- [FIX END] --- +-++ +-++ # 7. 根据路由权重进行加权求和 +-++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-++ +-++ return weighted_sum +-++ +-++ +-+ +-+ # @no_grad() +-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+index ebd7782e..913a7609 100644 +-+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +-+ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+- x1 = x[..., : x.shape[-1] // 2] +-+- x2 = x[..., x.shape[-1] // 2 :] +-++ # x1 = x[..., : x.shape[-1] // 2] +-++ # x2 = x[..., x.shape[-1] // 2 :] +-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+index d059dcbe..2b217b64 100644 +-+--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +-+ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+ def rotate_half(x): +-+ """Rotates half the hidden dims of the input.""" +-+- x1 = x[..., : x.shape[-1] // 2] +-+- x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++ # x1 = x[..., : x.shape[-1] // 2] +-++ # x2 = x[..., x.shape[-1] // 2 :] +-++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+ return ops.cat((-x2, x1), dim=-1) +-+ +-+ +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+index 78f22642..0a0ef2d7 100644 +-+--- a/patches/0001-20251104commit.patch +-++++ b/patches/0001-20251104commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+-Subject: [PATCH 1/3] 20251104commit +-++Subject: [PATCH 1/4] 20251104commit +-+ +-+ --- +-+ mindnlp/transformers/cache_utils.py | 28 +- +-+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+index 22b65dd5..5185270c 100644 +-+--- a/patches/0002-20251106commit.patch +-++++ b/patches/0002-20251106commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+-Subject: [PATCH 2/3] 20251106commit +-++Subject: [PATCH 2/4] 20251106commit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+index 966529e4..3e05f821 100644 +-+--- a/patches/0003-20261106secondcommit.patch +-++++ b/patches/0003-20261106secondcommit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+-Subject: [PATCH 3/3] 20261106secondcommit +-++Subject: [PATCH 3/4] 20261106secondcommit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-+new file mode 100644 +-+index 00000000..88a1aef4 +-+--- /dev/null +-++++ b/patches/0004-20251106change.patch +-+@@ -0,0 +1,7498 @@ +-++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Thu, 6 Nov 2025 15:48:09 +0800 +-++Subject: [PATCH 4/4] 20251106change +-++ +-++--- +-++ .../models/deepseek/modeling_deepseek.py | 189 +- +-++ patches/0001-20251104commit.patch | 1272 +++++++ +-++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +-++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +-++ 4 files changed, 7244 insertions(+), 186 deletions(-) +-++ create mode 100644 patches/0001-20251104commit.patch +-++ create mode 100644 patches/0002-20251106commit.patch +-++ create mode 100644 patches/0003-20261106secondcommit.patch +-++ +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index 2f9192bf..0546f318 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +-++ +-++ return attn_output, attn_weights, past_key_value +-++ +-++-# class DeepseekFlashAttention(nn.Module): +-++-# """ +-++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-++- +-++-# This class is designed as a drop-in replacement for DeepseekAttention. +-++-# """ +-++- +-++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++-# super().__init__() +-++-# self.config = config +-++-# self.layer_idx = layer_idx +-++-# if layer_idx is None: +-++-# logger.warning( +-++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++-# "when creating this class." +-++-# ) +-++- +-++-# self.attention_dropout = config.attention_dropout +-++-# self.hidden_size = config.hidden_size +-++-# self.num_heads = config.num_attention_heads +-++-# self.head_dim = self.hidden_size // self.num_heads +-++-# self.num_key_value_heads = config.num_key_value_heads +-++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++-# self.max_position_embeddings = config.max_position_embeddings +-++-# self.rope_theta = config.rope_theta +-++-# self.is_causal = True +-++- +-++-# if (self.head_dim * self.num_heads) != self.hidden_size: +-++-# raise ValueError( +-++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++-# f" and `num_heads`: {self.num_heads})." +-++-# ) +-++- +-++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++-# self._init_rope() +-++- +-++-# def _init_rope(self): +-++-# if self.config.rope_scaling is None: +-++-# self.rotary_emb = DeepseekRotaryEmbedding( +-++-# self.head_dim, +-++-# max_position_embeddings=self.max_position_embeddings, +-++-# base=self.rope_theta, +-++-# ) +-++-# else: +-++-# scaling_type = self.config.rope_scaling["type"] +-++-# scaling_factor = self.config.rope_scaling["factor"] +-++-# if scaling_type == "linear": +-++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++-# self.head_dim, +-++-# max_position_embeddings=self.max_position_embeddings, +-++-# scaling_factor=scaling_factor, +-++-# base=self.rope_theta, +-++-# ) +-++-# elif scaling_type == "dynamic": +-++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++-# self.head_dim, +-++-# max_position_embeddings=self.max_position_embeddings, +-++-# scaling_factor=scaling_factor, +-++-# base=self.rope_theta, +-++-# ) +-++-# else: +-++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++- +-++-# def forward( +-++-# self, +-++-# hidden_states: mindspore.Tensor, +-++-# attention_mask: Optional[mindspore.Tensor] = None, +-++-# position_ids: Optional[mindspore.Tensor] = None, +-++-# past_key_value: Optional[Cache] = None, +-++-# output_attentions: bool = False, +-++-# use_cache: bool = False, +-++-# **kwargs, +-++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++-# if "padding_mask" in kwargs: +-++-# warnings.warn( +-++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++-# ) +-++- +-++-# if output_attentions: +-++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-++- +-++-# bsz, q_len, _ = hidden_states.shape +-++- +-++-# if self.config.pretraining_tp > 1: +-++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++- +-++-# query_states = self.q_proj(hidden_states) +-++-# key_states = self.k_proj(hidden_states) +-++-# value_states = self.v_proj(hidden_states) +-++- +-++-# # Reshape for multi-head attention +-++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++- +-++-# kv_seq_len = key_states.shape[-2] +-++-# if past_key_value is not None: +-++-# if self.layer_idx is None: +-++-# raise ValueError( +-++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++-# "with a layer index." +-++-# ) +-++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++- +-++-# # Apply Rotary Positional Embedding +-++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++- +-++-# if past_key_value is not None: +-++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++- +-++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++- +-++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++- +-++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++- +-++-# # Convert attention_mask for flash_attention_score +-++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-++-# if attention_mask is not None: +-++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++-# raise ValueError( +-++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++-# ) +-++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-++-# else: +-++-# attn_mask_for_fa = None +-++- +-++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++- +-++-# # Call the fused flash_attention_score operator +-++-# attn_output = mindspore.ops.flash_attention_score( +-++-# query=query_states_for_fa, +-++-# key=key_states_for_fa, +-++-# value=value_states_for_fa, +-++-# head_num=self.num_heads, # This is N1, the number of query heads +-++-# input_layout='BSH', +-++-# attn_mask=attn_mask_for_fa, +-++-# keep_prob=keep_prob, +-++-# scalar_value=1.0 / math.sqrt(self.head_dim), +-++-# sparse_mode=0 # Default mask mode +-++-# ) +-++- +-++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-++-# attn_output = self.o_proj(attn_output) +-++- +-++-# # Flash Attention does not return attention weights +-++-# attn_weights = None +-++- +-++-# return attn_output, attn_weights, past_key_value +-++ +-++ class DeepseekFlashAttention(nn.Module): +-++ """ +-++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +-++ super().__init__() +-++ self.hidden_size = config.hidden_size +-++ +-++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-++- config=config, layer_idx=layer_idx +-++- ) +-+++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-+++ # config=config, layer_idx=layer_idx +-+++ # ) +-++ +-++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-++ config=config, layer_idx=layer_idx +-++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +-++ return outputs +-++ +-++ +-++- +-++ class DeepseekPreTrainedModel(PreTrainedModel): +-++ config_class = DeepseekConfig +-++ base_model_prefix = "model" +-++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++ # Initialize weights and apply final processing +-++ self.post_init() +-++ self.warm_up = False +-++- #@dwj +-++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-++- self.num_layers, +-++- self.num_attention_heads, +-++- self.head_dim, +-++- batch_size=1, +-++- max_length=self.max_length, +-++- dtype=mindspore.float16 +-++- ) +-++- +-++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-++- key_cache = [] +-++- value_cache = [] +-++- for _ in range(num_layers): +-++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++- key_cache.append(k) +-++- value_cache.append(v) +-++- return key_cache, value_cache +-++- +-++ +-++ def warmup_moe_model_deep(self): +-++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++new file mode 100644 +-++index 00000000..78f22642 +-++--- /dev/null +-+++++ b/patches/0001-20251104commit.patch +-++@@ -0,0 +1,1272 @@ +-+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++Subject: [PATCH 1/3] 20251104commit +-+++ +-+++--- +-+++ mindnlp/transformers/cache_utils.py | 28 +- +-+++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++ 3 files changed, 976 insertions(+), 87 deletions(-) +-+++ +-+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++index cadd2e04..02f8d4be 100644 +-+++--- a/mindnlp/transformers/cache_utils.py +-++++++ b/mindnlp/transformers/cache_utils.py +-+++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++ # k_out[:, :, cache_position] = key_states +-+++ # v_out[:, :, cache_position] = value_states +-+++- if ON_ORANGE_PI: +-+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++- else: +-+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++- +-++++ # if ON_ORANGE_PI: +-++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++ # else: +-++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++++ if cache_position.ndim > 1: +-++++ cache_position = cache_position.flatten() +-++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++++ cache_position = cache_position.int() +-++++ +-++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++++ k_out[:, :, cache_position] = key_states +-++++ v_out[:, :, cache_position] = value_states +-++++ +-+++ return k_out, v_out +-+++ +-+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index c695b944..d8303e45 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++- x1 = x[..., : x.shape[-1] // 2] +-+++- x2 = x[..., x.shape[-1] // 2 :] +-++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++ # x1 = x[..., : x.shape[-1] // 2] +-++++ # x2 = x[..., x.shape[-1] // 2 :] +-++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++ if self.training: +-+++ raise NotImplementedError("Training is not supported yet.") +-+++ else: +-+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++- if self.config.n_shared_experts is not None: +-+++- y = y + self.shared_experts(identity) +-+++- return y +-++++ # @lwx +-++++ if orig_shape[1] == 1: +-++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++++ y=y.view(*orig_shape) +-++++ if self.config.n_shared_experts is not None: +-++++ y = y + self.shared_experts(identity) +-++++ return y +-++++ else: +-++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++++ if self.config.n_shared_experts is not None: +-++++ y = y + self.shared_experts(identity) +-++++ return y +-++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++ # if self.config.n_shared_experts is not None: +-++++ # y = y + self.shared_experts(identity) +-++++ # return y +-++++ +-++++ @no_grad() +-++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ +-++++ expert_cache = ops.zeros_like(x) +-++++ for i in range(self.num_experts_per_tok): +-++++ expert_id = flat_expert_indices[i].item() +-++++ weight = flat_expert_weights[i].item() +-++++ expert = self.experts[expert_id] +-++++ expert_out = expert(x) +-++++ expert_cache += expert_out * weight +-++++ return expert_cache +-+++ +-+++ @no_grad() +-+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++- # expert_cache = torch.zeros_like(x) +-+++- # idxs = flat_expert_indices.argsort() +-+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++- # token_idxs = idxs // self.num_experts_per_tok +-+++- # for i, end_idx in enumerate(tokens_per_expert): +-+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++- # if start_idx == end_idx: +-+++- # continue +-+++- # expert = self.experts[i] +-+++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++- # expert_tokens = x[exp_token_idx] +-+++- # expert_out = expert(expert_tokens) +-+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++- # return expert_cache +-++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++ expert_cache = ops.zeros_like(x) +-+++ idxs = flat_expert_indices.argsort() +-+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++ token_idxs = idxs // self.num_experts_per_tok +-++++ +-+++ for i, end_idx in enumerate(tokens_per_expert): +-+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++ if start_idx == end_idx: +-+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++ expert_out = expert(expert_tokens) +-+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-+++ return expert_cache +-++++ +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # # expert_cache = torch.zeros_like(x) +-++++ # # idxs = flat_expert_indices.argsort() +-++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++ # # if start_idx == end_idx: +-++++ # # continue +-++++ # # expert = self.experts[i] +-++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # # expert_tokens = x[exp_token_idx] +-++++ # # expert_out = expert(expert_tokens) +-++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++ # # return expert_cache +-++++ # expert_cache = ops.zeros_like(x) +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # if start_idx == end_idx: +-++++ # continue +-++++ # expert = self.experts[i] +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = expert(expert_tokens) +-++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-++++ # return expert_cache +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # expert_cache = ops.zeros_like(x) +-++++ +-++++ # # 排序保证顺序一致 +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # # 找出有 token 的专家 +-++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++ +-++++ # for i in active_experts.tolist(): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # end_idx = tokens_per_expert[i] +-++++ # if start_idx == end_idx: # 没有 token +-++++ # continue +-++++ +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = self.experts[i](expert_tokens) +-++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++ +-++++ # expert_cache = mindspore.mint.scatter_add( +-++++ # expert_cache, +-++++ # 0, +-++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++ # expert_out +-++++ # ) +-++++ +-++++ # return expert_cache +-++++ +-++++ +-+++ +-+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++ # """ +-+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-++++ self.warm_up = False +-++++ +-++++ def warmup_moe_model_deep(self): +-++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++ test_texts = [ +-++++ "warmup short", +-++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++++ ] +-++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++ if tokenizer is None: +-++++ from mindnlp.transformers import AutoTokenizer +-++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++ self._warmup_tokenizer = tokenizer +-++++ +-++++ for text in test_texts: +-++++ inputs = tokenizer(text, return_tensors="ms") +-++++ with mindspore._no_grad(): +-++++ _ = self(**inputs, use_cache=False) +-++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++ +-+++ def get_input_embeddings(self): +-+++ return self.model.embed_tokens +-+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++ ```""" +-++++ if not self.warm_up: +-++++ self.warm_up = True +-++++ self.warmup_moe_model_deep() +-++++ +-+++ output_attentions = ( +-+++ output_attentions +-+++ if output_attentions is not None +-+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++index 3cbf820e..d4c6b651 100644 +-+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++@@ -18,7 +18,6 @@ +-+++ # See the License for the specific language governing permissions and +-+++ # limitations under the License. +-+++ """MindSpore Qwen2MoE model.""" +-+++- +-+++ import math +-+++ from typing import List, Optional, Tuple, Union +-+++ +-+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++ TokenClassifierOutput, +-+++ ) +-+++ from ...modeling_utils import PreTrainedModel +-++++from ...generation import GenerationMixin +-+++ from ....utils import logging +-+++ from .configuration_qwen2_moe import Qwen2MoeConfig +-+++ +-+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++ self.variance_epsilon = eps +-+++ +-+++ def forward(self, hidden_states): +-++++ # @dwj +-++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++ # @lwx +-++++ # if not self.training : +-++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++ input_dtype = hidden_states.dtype +-+++ hidden_states = hidden_states.to(mindspore.float32) +-+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++@@ -234,6 +239,8 @@ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++ x1 = x[..., : x.shape[-1] // 2] +-+++ x2 = x[..., x.shape[-1] // 2 :] +-++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++ self.config = config +-+++ self.hidden_size = config.hidden_size +-+++ self.intermediate_size = intermediate_size +-++++ +-+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++ self.act_fn = ACT2FN[config.hidden_act] +-+++ +-+++ def forward(self, x): +-+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++- +-+++ +-++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++ # @lwx +-++++ # gate_up_output = self.gate_up_proj(x) +-++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++++ # return self.down_proj(swiglu_output) +-++++ +-++++ # def forward(self, x): +-++++ # gate_proj_out = self.gate_proj(x) +-++++ # up_proj_out = self.up_proj(x) +-++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++++ # return self.down_proj(swiglu_out) +-++++ +-+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++ """ +-+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++ use_cache: bool = False, +-+++ cache_position: Optional[mindspore.Tensor] = None, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ +-++++ +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ query_states = self.q_proj(hidden_states) +-+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++ "with a layer index." +-+++ ) +-+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ if isinstance(past_key_value, StaticCache): +-++++ kv_seq_len = key_states.shape[-2] +-++++ else: +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++ if isinstance(past_key_value, StaticCache): +-++++ kv_seq_len = key_states.shape[-2] +-+++ +-+++ # repeat k/v heads if n_kv_heads < n_heads +-+++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++- +-++++ +-+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++ +-+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++- raise ValueError( +-+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++- f" {attn_weights.shape}" +-+++- ) +-+++- +-+++- if attention_mask is not None: # no matter the length, we just slice it +-+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++++ if attention_mask is not None: +-++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++ attn_weights = attn_weights + causal_mask +-+++ +-+++ # upcast attention to fp32 +-+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++ +-+++ attn_output = self.o_proj(attn_output) +-+++- +-++++ # @lwx +-++++ +-++++ # max_seq_len = self.max_position_embeddings # 2048 +-++++ +-++++ # if attention_mask is not None: +-++++ # # attention_mask: [B, 1, Sq, Sk] +-++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++ +-++++ # # pad 到 [max_seq_len, max_seq_len] +-++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++ # global_attention_mask = padded_mask +-++++ # else: +-++++ # global_attention_mask = None +-++++ +-++++ +-++++ # sparse_mode=3 +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # real_shift=None, +-++++ # padding_mask=None, +-++++ +-++++ # head_num=self.num_heads, +-++++ # attn_mask=global_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ # input_layout="BNSD", +-++++ # pre_tokens=2147483647, +-++++ # next_tokens=2147483647, +-++++ # inner_precise=0, +-++++ # drop_mask=None, +-++++ # prefix=None, +-++++ # actual_seq_qlen=None, +-++++ # actual_seq_kvlen=None, +-++++ # sparse_mode=sparse_mode, +-++++ # ) +-+++ if not output_attentions: +-+++ attn_weights = None +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ +-++++class Qwen2MoeFlashAttention(nn.Module): +-++++ """ +-++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++ +-++++ 关键改动: +-++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++ 直接传入原始的 key 和 value 张量效率更高。 +-++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++ """ +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++ super().__init__() +-++++ self.config = config +-++++ self.layer_idx = layer_idx +-++++ self.hidden_size = config.hidden_size +-++++ self.num_heads = config.num_attention_heads +-++++ self.head_dim = self.hidden_size // self.num_heads +-++++ self.num_key_value_heads = config.num_key_value_heads +-++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++ self.max_position_embeddings = config.max_position_embeddings +-++++ self.rope_theta = config.rope_theta +-++++ self.attention_dropout = config.attention_dropout +-++++ +-++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++ raise ValueError( +-++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++ ) +-++++ +-++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++ +-++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++ self.head_dim, +-++++ max_position_embeddings=self.max_position_embeddings, +-++++ base=self.rope_theta, +-++++ ) +-++++ +-++++ def forward( +-++++ self, +-++++ hidden_states: mindspore.Tensor, +-++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++ position_ids: Optional[mindspore.Tensor] = None, +-++++ past_key_value: Optional[Cache] = None, +-++++ output_attentions: bool = False, +-++++ use_cache: bool = False, +-++++ cache_position: Optional[mindspore.Tensor] = None, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # 1. 线性投射 Q, K, V +-++++ query_states = self.q_proj(hidden_states) +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++ # query: [B, S, H*D] -> [B, N1, S, D] +-++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # 3. RoPE 旋转位置编码 +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++ if self.layer_idx is None: +-++++ raise ValueError( +-++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ "with a layer index." +-++++ ) +-++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++ if cache_position.shape[0] == 1: +-++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++ kv_seq_len = past_seen_tokens + 1 +-++++ else: +-++++ # prefill 阶段:cache_position 是范围,使用其长度 +-++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++ else: +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # 4. KV 缓存更新 +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ key_states, value_states = past_key_value.update( +-++++ key_states, value_states, self.layer_idx, cache_kwargs +-++++ ) +-++++ +-++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++ if cache_position.shape[0] == 1: +-++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++ kv_seq_len = key_states.shape[-2] +-++++ +-++++ # 5. [重要] 准备 Attention Mask +-++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++ fa_attention_mask = None +-++++ if attention_mask is not None: +-++++ # 截取与当前key长度匹配的部分 +-++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++ fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++ input_dtype = query_states.dtype +-++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++ query_states = query_states.to(mindspore.float16) +-++++ key_states = key_states.to(mindspore.float16) +-++++ value_states = value_states.to(mindspore.float16) +-++++ +-++++ # 6. [核心] 调用 flash_attention_score 算子 +-++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++ attn_output = mindspore.ops.flash_attention_score( +-++++ query=query_states, +-++++ key=key_states, +-++++ value=value_states, +-++++ head_num=self.num_heads, # 传入Q的头数(N1) +-++++ attn_mask=fa_attention_mask, +-++++ keep_prob=1.0 - self.attention_dropout, +-++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ input_layout="BNSD", +-++++ sparse_mode=0 # 使用 defaultMask 模式 +-++++ ) +-++++ +-++++ # 恢复原始数据类型 +-++++ attn_output = attn_output.to(input_dtype) +-++++ +-++++ # 7. 调整输出形状 +-++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = self.o_proj(attn_output) +-++++ +-++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++++ attn_weights = None +-++++ if output_attentions: +-++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ # def forward( +-++++ # self, +-++++ # hidden_states: mindspore.Tensor, +-++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++ # past_key_value: Optional[Cache] = None, +-++++ # output_attentions: bool = False, +-++++ # use_cache: bool = False, +-++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ # bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # # 1. 线性投射 Q, K, V +-++++ # query_states = self.q_proj(hidden_states) +-++++ # key_states = self.k_proj(hidden_states) +-++++ # value_states = self.v_proj(hidden_states) +-++++ +-++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # # 3. RoPE 旋转位置编码 +-++++ # kv_seq_len = key_states.shape[-2] +-++++ # if past_key_value is not None: +-++++ # if self.layer_idx is None: +-++++ # raise ValueError( +-++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ # "with a layer index." +-++++ # ) +-++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # # 4. KV 缓存更新 +-++++ # if past_key_value is not None: +-++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ # key_states, value_states = past_key_value.update( +-++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++ # ) +-++++ +-++++ # # 5. 准备 Attention Mask +-++++ # fa_attention_mask = None +-++++ # if attention_mask is not None: +-++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++ # input_dtype = query_states.dtype +-++++ +-++++ # # 6. [核心] 调用 flash_attention_score 算子 +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # head_num=self.num_heads, +-++++ # attn_mask=fa_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ # input_layout="BNSD", +-++++ # sparse_mode=0, +-++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++ # inner_precise=1 +-++++ # ) +-++++ +-++++ # # 恢复原始数据类型 +-++++ # attn_output = attn_output.to(input_dtype) +-++++ +-++++ # # 7. 调整输出形状 +-++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ # attn_output = self.o_proj(attn_output) +-++++ +-++++ # attn_weights = None +-++++ # if output_attentions: +-++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++ # return attn_output, attn_weights, past_key_value +-++++ +-++++ # def forward( +-++++ # self, +-++++ # hidden_states: mindspore.Tensor, +-++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++ # past_key_value: Optional[Cache] = None, +-++++ # output_attentions: bool = False, +-++++ # use_cache: bool = False, +-++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ # bsz, q_len, _ = hidden_states.shape +-++++ +-++++ # query_states = self.q_proj(hidden_states) +-++++ # key_states = self.k_proj(hidden_states) +-++++ # value_states = self.v_proj(hidden_states) +-++++ +-++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ # kv_seq_len = key_states.shape[-2] +-++++ # if past_key_value is not None: +-++++ # if self.layer_idx is None: +-++++ # raise ValueError("`layer_idx` must be specified for caching") +-++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ # if past_key_value is not None: +-++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ # key_states, value_states = past_key_value.update( +-++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++ # ) +-++++ +-++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++ +-++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++ # query_states = query_states / math.sqrt(self.head_dim) +-++++ # # <--- 修改结束 --- +-++++ +-++++ # fa_attention_mask = None +-++++ # if attention_mask is not None: +-++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ # fa_attention_mask = (mask_slice != 0) +-++++ +-++++ # input_dtype = query_states.dtype +-++++ +-++++ # attn_output = mindspore.ops.flash_attention_score( +-++++ # query=query_states, # 传入已经预先缩放过的 query +-++++ # key=key_states, +-++++ # value=value_states, +-++++ # head_num=self.num_heads, +-++++ # attn_mask=fa_attention_mask, +-++++ # keep_prob=1.0 - self.attention_dropout, +-++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++ # input_layout="BNSD", +-++++ # sparse_mode=0, +-++++ # inner_precise=1 # 仍然保持内部高精度计算 +-++++ # ) +-++++ +-++++ # attn_output = attn_output.to(input_dtype) +-++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ # attn_output = self.o_proj(attn_output) +-++++ +-++++ # attn_weights = None +-++++ # if output_attentions: +-++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++ +-++++ # return attn_output, attn_weights, past_key_value +-++++ +-+++ QWEN2MOE_ATTENTION_CLASSES = { +-+++ "eager": Qwen2MoeAttention, +-++++ "flash-attention": Qwen2MoeFlashAttention, +-+++ } +-+++ +-+++ +-+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-++++ #@dwj +-++++ # 只遍历激活的专家,而非全部专家 +-+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- hidden_states = hidden_states.view(-1, hidden_dim) +-+++- # router_logits: (batch * sequence_length, n_experts) +-+++- router_logits = self.gate(hidden_states) +-+++- +-+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- if self.norm_topk_prob: +-+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- # we cast back to the input dtype +-+++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++- final_hidden_states = ops.zeros( +-+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++- ) +-+++- +-+++- # One hot encode the selected experts to create an expert mask +-+++- # this will be used to easily index which expert is going to be sollicitated +-+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++- +-+++- # Loop over all available experts in the model and perform the computation on each expert +-+++- for expert_idx in range(self.num_experts): +-+++- expert_layer = self.experts[expert_idx] +-+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++- +-+++- # Index the correct hidden states and compute the expert hidden state for +-+++- # the current expert. We need to make sure to multiply the output hidden +-+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++- if 0 not in idx.shape: +-+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++- +-+++- # However `index_add_` only support torch tensors for indexing so we'll use +-+++- # the `top_x` tensor here. +-+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++- +-+++- shared_expert_output = self.shared_expert(hidden_states) +-+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++- +-+++- final_hidden_states = final_hidden_states + shared_expert_output +-++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++ num_tokens = hidden_states_reshaped.shape[0] +-++++ +-++++ router_logits = self.gate(hidden_states_reshaped) +-++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++ if self.norm_topk_prob: +-++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++ flat_selected_experts = selected_experts.flatten() +-++++ +-++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++ token_indices = broadcasted_token_indices.flatten() +-++++ +-++++ active_experts = ops.unique(flat_selected_experts) +-++++ +-++++ for expert_idx_tensor in active_experts: +-++++ expert_idx = expert_idx_tensor.item() +-++++ expert_layer = self.experts[expert_idx] +-++++ +-++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++ selected_token_indices = token_indices[mask] +-++++ selected_routing_weights = routing_weights.flatten()[mask] +-++++ +-++++ current_states = hidden_states_reshaped[selected_token_indices] +-++++ +-++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++ +-++++ final_hidden_states = final_hidden_states.index_add( +-++++ dim=0, +-++++ index=selected_token_indices, +-++++ source=expert_output.to(hidden_states.dtype) +-++++ ) +-++++ +-++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++ +-+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++- return final_hidden_states, router_logits +-++++ final_hidden_states = final_hidden_states + shared_expert_output +-++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++ +-++++ return final_hidden_states, router_logits +-+++ +-+++ +-+++ class Qwen2MoeDecoderLayer(nn.Module): +-+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++ +-+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++ +-+++ if (layer_idx not in config.mlp_only_layers) and ( +-+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++ ): +-+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++ _skip_keys_device_placement = "past_key_values" +-+++ _supports_cache_class = True +-++++#lwx +-++++ # _supports_static_cache = True +-+++ +-+++ def _init_weights(self, module): +-+++ std = self.config.initializer_range +-+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++ return causal_mask +-+++ +-+++ +-+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ _tied_weights_keys = ["lm_head.weight"] +-+++ +-+++ def __init__(self, config): +-+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ self.num_experts_per_tok = config.num_experts_per_tok +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-++++ # @lwx +-++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++++ # self.generation_config.cache_implementation = "static" +-++++ self._warmed_up = False +-++++ +-++++ def warmup_moe_model(self): +-++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++++ test_texts = [ +-++++ "warmup short", +-++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++++ ] +-++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++ if tokenizer is None: +-++++ from mindnlp.transformers import AutoTokenizer +-++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++ self._warmup_tokenizer = tokenizer +-++++ +-++++ for text in test_texts: +-++++ inputs = tokenizer(text, return_tensors="ms") +-++++ with mindspore._no_grad(): +-++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++ +-+++ def get_input_embeddings(self): +-+++ return self.model.embed_tokens +-+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++ ```""" +-++++ if not self._warmed_up: +-++++ self._warmed_up = True +-++++ self.warmup_moe_model() +-+++ +-+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++ output_router_logits = ( +-+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++ } +-+++ ) +-+++ return model_inputs +-++++# @lwx +-++++ # def _decode_one_tokens_logits( +-++++ # self, +-++++ # cur_token: mindspore.Tensor, +-++++ # input_pos: Optional[mindspore.Tensor], +-++++ # cache_position: mindspore.Tensor, +-++++ # past_key_values: StaticCache, +-++++ # ) -> mindspore.Tensor: +-++++ # """ +-++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++++ +-++++ # Args: +-++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++++ # input_pos: 输入位置信息,可选 +-++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++++ +-++++ # Returns: +-++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++++ # """ +-++++ # # 调用JIT编译的版本 +-++++ # return self.get_decode_one_tokens_logits( +-++++ # cur_token=cur_token, +-++++ # input_pos=input_pos, +-++++ # cache_position=cache_position, +-++++ # past_key_values=past_key_values, +-++++ # ) +-++++ +-++++ # @mindspore.jit(jit_level='O1') +-++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++++ # """ +-++++ # JIT编译的函数,用于高效的单token解码 +-++++ # 使用JIT编译优化以支持静态shape和高效执行 +-++++ +-++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++++ # """ +-++++ # outputs = self.model.forward( +-++++ # input_ids=cur_token, +-++++ # position_ids=input_pos, +-++++ # cache_position=cache_position, +-++++ # past_key_values=past_key_values, +-++++ # use_cache=True, +-++++ # return_dict=False, +-++++ # ) +-++++ +-++++ # hidden_states = outputs[0] +-++++ # logits = self.lm_head.forward(hidden_states) +-++++ # logits = logits.float() +-++++ +-++++ # return logits[:, -1, :] +-++++ +-++++ # def _sample( +-++++ # self, +-++++ # input_ids: mindspore.Tensor, +-++++ # logits_processor, +-++++ # stopping_criteria, +-++++ # generation_config, +-++++ # synced_devices: bool, +-++++ # streamer=None, +-++++ # logits_warper=None, +-++++ # **model_kwargs, +-++++ # ): +-++++ # """ +-++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++++ # """ +-++++ # from ...generation.logits_process import LogitsProcessorList +-++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++++ # from mindnlp.core import nn, ops, no_grad +-++++ # import numpy as np +-++++ +-++++ # # 检查是否使用 StaticCache +-++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++++ # # 否则,直接调用父类方法 +-++++ # past_key_values = model_kwargs.get("past_key_values") +-++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++++ +-++++ # if not isinstance(past_key_values, StaticCache): +-++++ # # 不使用 StaticCache,直接调用父类方法 +-++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++++ # return super()._sample( +-++++ # input_ids=input_ids, +-++++ # logits_processor=logits_processor, +-++++ # stopping_criteria=stopping_criteria, +-++++ # generation_config=generation_config, +-++++ # synced_devices=synced_devices, +-++++ # streamer=streamer, +-++++ # logits_warper=logits_warper, +-++++ # **model_kwargs, +-++++ # ) +-++++ +-++++ # # 使用 StaticCache,进入自定义循环 +-++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++++ # pad_token_id = generation_config._pad_token_tensor +-++++ # output_attentions = generation_config.output_attentions +-++++ # output_hidden_states = generation_config.output_hidden_states +-++++ # output_scores = generation_config.output_scores +-++++ # output_logits = generation_config.output_logits +-++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++++ # max_length = generation_config.max_length +-++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++++ # do_sample = generation_config.do_sample +-++++ +-++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++++ # raise ValueError( +-++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++++ # f"{logits_warper})." +-++++ # ) +-++++ +-++++ # # init attention / hidden states / scores tuples +-++++ # scores = () if (return_dict_in_generate and output_scores) else None +-++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++++ +-++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++++ # encoder_hidden_states = ( +-++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++++ # ) +-++++ +-++++ # # keep track of which sequences are already finished +-++++ # batch_size, cur_len = input_ids.shape +-++++ # this_peer_finished = False +-++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++++ +-++++ # time_record = [] +-++++ # from ....utils.testing_utils import parse_flag_from_env +-++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++++ +-++++ # while self._has_unfinished_sequences( +-++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++++ # ): +-++++ # if _record_time: +-++++ # import time as time_module +-++++ # infer_start = time_module.time() +-++++ +-++++ # # prepare model inputs +-++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++++ +-++++ # # prepare variable output controls +-++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++++ +-++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++++ # cur_cache_position = model_inputs.get("cache_position") +-++++ # cur_past_key_values = model_inputs.get("past_key_values") +-++++ # cur_input_ids = model_inputs.get("input_ids") +-++++ +-++++ # if (isinstance(cur_past_key_values, StaticCache) and +-++++ # cur_cache_position is not None and +-++++ # len(cur_cache_position.shape) > 0 and +-++++ # cur_cache_position.shape[0] == 1 and +-++++ # cur_input_ids is not None and +-++++ # cur_input_ids.shape[1] == 1): +-++++ # # 使用 JIT 优化的单 token 解码 +-++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++++ # if not hasattr(self, '_jit_used'): +-++++ # self._jit_used = False +-++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++++ +-++++ # next_token_logits = self.get_decode_one_tokens_logits( +-++++ # cur_token=cur_input_ids, +-++++ # input_pos=model_inputs.get("position_ids"), +-++++ # cache_position=cur_cache_position, +-++++ # past_key_values=cur_past_key_values, +-++++ # ) +-++++ +-++++ # # 标记已使用JIT(用于后续判断) +-++++ # if not self._jit_used: +-++++ # self._jit_used = True +-++++ +-++++ # # 构造兼容的输出对象 +-++++ # class JitOptimizedOutput: +-++++ # def __init__(self, logits, config): +-++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++++ # self.config = config +-++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++++ # self.attentions = None if not config.is_encoder_decoder else None +-++++ # self.cross_attentions = None +-++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++++ +-++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++++ # else: +-++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++++ # outputs = self(**model_inputs, return_dict=True) +-++++ +-++++ # if synced_devices and this_peer_finished: +-++++ # continue +-++++ +-++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++++ # next_token_logits = outputs.logits[:, -1, :] +-++++ +-++++ # # pre-process distribution +-++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++++ # if do_sample: +-++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++++ +-++++ # # Store scores, attentions and hidden_states when required +-++++ # if return_dict_in_generate: +-++++ # if output_scores: +-++++ # scores += (next_token_scores,) +-++++ # if output_logits: +-++++ # raw_logits += (next_token_logits,) +-++++ # if output_attentions: +-++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++++ # if self.config.is_encoder_decoder: +-++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++++ +-++++ # if output_hidden_states: +-++++ # hidden = ( +-++++ # outputs.decoder_hidden_states +-++++ # if self.config.is_encoder_decoder +-++++ # else outputs.hidden_states +-++++ # ) +-++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++++ +-++++ # # token selection +-++++ # if do_sample: +-++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++++ # else: +-++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++++ +-++++ # # finished sentences should have their next token be a padding token +-++++ # if has_eos_stopping_criteria: +-++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++++ +-++++ # # update generated ids, model inputs, and length for next step +-++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++++ # if streamer is not None: +-++++ # streamer.put(next_tokens) +-++++ +-++++ # model_kwargs = self._update_model_kwargs_for_generation( +-++++ # outputs, +-++++ # model_kwargs, +-++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++++ # ) +-++++ +-++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++++ # cur_len += 1 +-++++ +-++++ # if _record_time: +-++++ # import time as time_module +-++++ # infer_stop = time_module.time() +-++++ # time_record.append(infer_stop - infer_start) +-++++ +-++++ # del outputs +-++++ +-++++ # average_infer_time = None +-++++ # if time_record: +-++++ # if len(time_record) > 1: +-++++ # time_record.pop(0) +-++++ # average_infer_time = sum(time_record) / len(time_record) +-++++ # print(f'average inference time is: {average_infer_time}') +-++++ # print(f'inference time record: {time_record}') +-++++ +-++++ # if streamer is not None: +-++++ # streamer.end() +-++++ +-++++ # # 简单判断:打印是否使用了JIT路径 +-++++ # if hasattr(self, '_jit_used') and self._jit_used: +-++++ # print("[JIT] ✓ JIT optimization was used during generation") +-++++ # else: +-++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++++ +-++++ # if return_dict_in_generate: +-++++ # if self.config.is_encoder_decoder: +-++++ # return GenerateEncoderDecoderOutput( +-++++ # sequences=input_ids, +-++++ # scores=scores, +-++++ # logits=raw_logits, +-++++ # encoder_attentions=encoder_attentions, +-++++ # encoder_hidden_states=encoder_hidden_states, +-++++ # decoder_attentions=decoder_attentions, +-++++ # cross_attentions=cross_attentions, +-++++ # decoder_hidden_states=decoder_hidden_states, +-++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++ # average_infer_time=average_infer_time +-++++ # ) +-++++ # else: +-++++ # return GenerateDecoderOnlyOutput( +-++++ # sequences=input_ids, +-++++ # scores=scores, +-++++ # logits=raw_logits, +-++++ # attentions=decoder_attentions, +-++++ # hidden_states=decoder_hidden_states, +-++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++ # average_infer_time=average_infer_time +-++++ # ) +-++++ # else: +-++++ # return input_ids +-++++ +-++++ # def _prepare_cache_for_generation( +-++++ # self, +-++++ # generation_config, +-++++ # model_kwargs, +-++++ # assistant_model, +-++++ # batch_size, +-++++ # max_cache_length, +-++++ # ): +-++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++++ # generation_config.cache_implementation = "static" +-++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++++ +-++++ # if generation_config.cache_implementation == "static": +-++++ # base_required_from_max_length = generation_config.max_length + 1 +-++++ # base_required = max(max_cache_length, base_required_from_max_length) +-++++ # min_cache_size = 50 +-++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++++ # else: +-++++ # max_cache_length = max(base_required, min_cache_size) +-++++ +-++++ # original_max_cache_length = max_cache_length +-++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++++ # print(f" - final max_cache_length: {max_cache_length}") +-++++ +-++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++ # if max_cache_length > self.config.max_position_embeddings: +-++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++++ +-++++ # result = super()._prepare_cache_for_generation( +-++++ # generation_config=generation_config, +-++++ # model_kwargs=model_kwargs, +-++++ # assistant_model=assistant_model, +-++++ # batch_size=batch_size, +-++++ # max_cache_length=max_cache_length, +-++++ # ) +-++++ +-++++ # if generation_config.cache_implementation == "static": +-++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++++ # created_cache = model_kwargs.get(cache_name) +-++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++++ # if created_cache.max_cache_len < generation_config.max_length: +-++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++++ +-++++ # return result +-++++ +-++++ +-++++ +-+++ +-+++ +-+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++-- +-+++2.27.0 +-+++ +-++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-++new file mode 100644 +-++index 00000000..22b65dd5 +-++--- /dev/null +-+++++ b/patches/0002-20251106commit.patch +-++@@ -0,0 +1,3200 @@ +-+++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+++Subject: [PATCH 2/3] 20251106commit +-+++ +-+++--- +-+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +-+++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +-+++ 3 files changed, 2689 insertions(+), 305 deletions(-) +-+++ create mode 100644 patches/0001-20251104commit.patch +-+++ +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index d8303e45..73773c22 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +-+++ # y = y + self.shared_experts(identity) +-+++ # return y +-+++ +-++++ # @no_grad() +-++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ +-++++ # expert_cache = ops.zeros_like(x) +-++++ # for i in range(self.num_experts_per_tok): +-++++ # expert_id = flat_expert_indices[i].item() +-++++ # weight = flat_expert_weights[i].item() +-++++ # expert = self.experts[expert_id] +-++++ # expert_out = expert(x) +-++++ # expert_cache += expert_out * weight +-++++ # return expert_cache +-++++ +-+++ @no_grad() +-+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ # x 的 shape: (1, hidden_size) +-++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++++ +-++++ # 1. 收集所有需要的专家层 +-++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-++++ +-++++ # 2. 并行计算所有专家的输出 +-++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++++ # ops.cat 会将它们堆叠成一个新的 Tensor +-++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++++ +-++++ # 3. 使用矩阵乘法进行加权求和 +-++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++ # 最终结果 final_output 的 shape: (1, hidden_size) +-++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++++ +-++++ return final_output +-+++ +-+++- expert_cache = ops.zeros_like(x) +-+++- for i in range(self.num_experts_per_tok): +-+++- expert_id = flat_expert_indices[i].item() +-+++- weight = flat_expert_weights[i].item() +-+++- expert = self.experts[expert_id] +-+++- expert_out = expert(x) +-+++- expert_cache += expert_out * weight +-+++- return expert_cache +-+++ +-+++ @no_grad() +-+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++ # @lwx +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+++ +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ +-++++# class DeepseekFlashAttention(nn.Module): +-++++# """ +-++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-++++ +-++++# This class is designed as a drop-in replacement for DeepseekAttention. +-++++# """ +-++++ +-++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++++# super().__init__() +-++++# self.config = config +-++++# self.layer_idx = layer_idx +-++++# if layer_idx is None: +-++++# logger.warning( +-++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++# "when creating this class." +-++++# ) +-++++ +-++++# self.attention_dropout = config.attention_dropout +-++++# self.hidden_size = config.hidden_size +-++++# self.num_heads = config.num_attention_heads +-++++# self.head_dim = self.hidden_size // self.num_heads +-++++# self.num_key_value_heads = config.num_key_value_heads +-++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++# self.max_position_embeddings = config.max_position_embeddings +-++++# self.rope_theta = config.rope_theta +-++++# self.is_causal = True +-++++ +-++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++# raise ValueError( +-++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++# f" and `num_heads`: {self.num_heads})." +-++++# ) +-++++ +-++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++++# self._init_rope() +-++++ +-++++# def _init_rope(self): +-++++# if self.config.rope_scaling is None: +-++++# self.rotary_emb = DeepseekRotaryEmbedding( +-++++# self.head_dim, +-++++# max_position_embeddings=self.max_position_embeddings, +-++++# base=self.rope_theta, +-++++# ) +-++++# else: +-++++# scaling_type = self.config.rope_scaling["type"] +-++++# scaling_factor = self.config.rope_scaling["factor"] +-++++# if scaling_type == "linear": +-++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++++# self.head_dim, +-++++# max_position_embeddings=self.max_position_embeddings, +-++++# scaling_factor=scaling_factor, +-++++# base=self.rope_theta, +-++++# ) +-++++# elif scaling_type == "dynamic": +-++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++++# self.head_dim, +-++++# max_position_embeddings=self.max_position_embeddings, +-++++# scaling_factor=scaling_factor, +-++++# base=self.rope_theta, +-++++# ) +-++++# else: +-++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++++ +-++++# def forward( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++# position_ids: Optional[mindspore.Tensor] = None, +-++++# past_key_value: Optional[Cache] = None, +-++++# output_attentions: bool = False, +-++++# use_cache: bool = False, +-++++# **kwargs, +-++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++# if "padding_mask" in kwargs: +-++++# warnings.warn( +-++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++++# ) +-++++ +-++++# if output_attentions: +-++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-++++ +-++++# bsz, q_len, _ = hidden_states.shape +-++++ +-++++# if self.config.pretraining_tp > 1: +-++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++++ +-++++# query_states = self.q_proj(hidden_states) +-++++# key_states = self.k_proj(hidden_states) +-++++# value_states = self.v_proj(hidden_states) +-++++ +-++++# # Reshape for multi-head attention +-++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++# kv_seq_len = key_states.shape[-2] +-++++# if past_key_value is not None: +-++++# if self.layer_idx is None: +-++++# raise ValueError( +-++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++# "with a layer index." +-++++# ) +-++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++# # Apply Rotary Positional Embedding +-++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++# if past_key_value is not None: +-++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ +-++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++ +-++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++ +-++++# # Convert attention_mask for flash_attention_score +-++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-++++# if attention_mask is not None: +-++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++++# raise ValueError( +-++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++++# ) +-++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-++++# else: +-++++# attn_mask_for_fa = None +-++++ +-++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++++ +-++++# # Call the fused flash_attention_score operator +-++++# attn_output = mindspore.ops.flash_attention_score( +-++++# query=query_states_for_fa, +-++++# key=key_states_for_fa, +-++++# value=value_states_for_fa, +-++++# head_num=self.num_heads, # This is N1, the number of query heads +-++++# input_layout='BSH', +-++++# attn_mask=attn_mask_for_fa, +-++++# keep_prob=keep_prob, +-++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++++# sparse_mode=0 # Default mask mode +-++++# ) +-++++ +-++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-++++# attn_output = self.o_proj(attn_output) +-++++ +-++++# # Flash Attention does not return attention weights +-++++# attn_weights = None +-++++ +-++++# return attn_output, attn_weights, past_key_value +-++++ +-++++class DeepseekFlashAttention(nn.Module): +-++++ """ +-++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-++++ This implementation is a drop-in replacement for the original DeepseekAttention class, +-++++ designed for high performance on supported hardware (Ascend). +-++++ +-++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-++++ """ +-++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++++ super().__init__() +-++++ self.config = config +-++++ self.layer_idx = layer_idx +-++++ if layer_idx is None: +-++++ logger.warning( +-++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++ "when creating this class." +-++++ ) +-++++ +-++++ # --- [FIX] Correctly initialize all required attributes --- +-++++ self.attention_dropout = config.attention_dropout +-++++ self.hidden_size = config.hidden_size +-++++ self.num_heads = config.num_attention_heads +-++++ self.head_dim = self.hidden_size // self.num_heads +-++++ self.num_key_value_heads = config.num_key_value_heads +-++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++ self.max_position_embeddings = config.max_position_embeddings +-++++ self.rope_theta = config.rope_theta +-++++ self.is_causal = True +-++++ +-++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++ raise ValueError( +-++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++ f" and `num_heads`: {self.num_heads})." +-++++ ) +-++++ +-++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++++ +-++++ # This call will now succeed as all attributes are initialized. +-++++ self._init_rope() +-++++ +-++++ def _init_rope(self): +-++++ if self.config.rope_scaling is None: +-++++ self.rotary_emb = DeepseekRotaryEmbedding( +-++++ self.head_dim, +-++++ max_position_embeddings=self.max_position_embeddings, +-++++ base=self.rope_theta, +-++++ ) +-++++ else: +-++++ scaling_type = self.config.rope_scaling["type"] +-++++ scaling_factor = self.config.rope_scaling["factor"] +-++++ if scaling_type == "linear": +-++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++++ self.head_dim, +-++++ max_position_embeddings=self.max_position_embeddings, +-++++ scaling_factor=scaling_factor, +-++++ base=self.rope_theta, +-++++ ) +-++++ elif scaling_type == "dynamic": +-++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++++ self.head_dim, +-++++ max_position_embeddings=self.max_position_embeddings, +-++++ scaling_factor=scaling_factor, +-++++ base=self.rope_theta, +-++++ ) +-++++ else: +-++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++++ +-++++ def forward( +-++++ self, +-++++ hidden_states: mindspore.Tensor, +-++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++ position_ids: Optional[mindspore.Tensor] = None, +-++++ past_key_value: Optional[Cache] = None, +-++++ output_attentions: bool = False, +-++++ use_cache: bool = False, +-++++ **kwargs, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ if "padding_mask" in kwargs: +-++++ warnings.warn( +-++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++++ ) +-++++ if output_attentions: +-++++ warnings.warn( +-++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-++++ ) +-++++ +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ if self.config.pretraining_tp > 1: +-++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++++ +-++++ query_states = self.q_proj(hidden_states) +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++ # Apply Rotary Position Embedding +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos} +-++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-++++ # So we must explicitly repeat the KV heads. +-++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++ +-++++ # Convert attention mask for flash_attention_score +-++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-++++ if attention_mask is not None: +-++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++++ raise ValueError( +-++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++++ ) +-++++ attn_mask_for_fa = attention_mask < 0 +-++++ else: +-++++ attn_mask_for_fa = None +-++++ +-++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++++ +-++++ # Call the fused operator using the efficient BNSD layout +-++++ attn_output = mindspore.ops.flash_attention_score( +-++++ query=query_states, +-++++ key=key_states, +-++++ value=value_states, +-++++ head_num=self.num_heads, +-++++ input_layout='BNSD', # Specify the correct layout +-++++ attn_mask=attn_mask_for_fa, +-++++ keep_prob=keep_prob, +-++++ scalar_value=1.0 / math.sqrt(self.head_dim) +-++++ ) +-++++ +-++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ +-++++ # Apply output projection +-++++ attn_output = self.o_proj(attn_output) +-++++ +-++++ # Flash attention does not return attention weights, so we return None. +-++++ attn_weights = None +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-+++ Deepseek_ATTENTION_CLASSES = { +-+++ "eager": DeepseekAttention, +-++++ "flash-attention": DeepseekFlashAttention, +-+++ } +-+++ +-+++ +-+++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +-+++ config=config, layer_idx=layer_idx +-+++ ) +-+++ +-++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-++++ config=config, layer_idx=layer_idx +-++++ ) +-++++ +-+++ self.mlp = ( +-+++ DeepseekMoE(config) +-+++ if ( +-+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++index d4c6b651..bced285c 100644 +-+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +-+++ +-+++ import mindspore +-+++ import mindnlp.core.nn.functional as F +-+++-from mindnlp.core import nn, ops +-++++from mindnlp.core import nn, ops, no_grad +-+++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +-+++ +-+++ from ....common.activations import ACT2FN +-+++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +-+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+++ +-++++Long_Prompt = False +-++++PROMPT_LENGTH_THRESHOLD = 128 +-+++ +-+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+++ def _prepare_4d_causal_attention_mask_with_cache_position( +-+++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++ +-++++# class Qwen2MoeFlashAttention(nn.Module): +-++++# """ +-++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++ +-++++# 关键改动: +-++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++# 直接传入原始的 key 和 value 张量效率更高。 +-++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++# super().__init__() +-++++# self.config = config +-++++# self.layer_idx = layer_idx +-++++# self.hidden_size = config.hidden_size +-++++# self.num_heads = config.num_attention_heads +-++++# self.head_dim = self.hidden_size // self.num_heads +-++++# self.num_key_value_heads = config.num_key_value_heads +-++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++# self.max_position_embeddings = config.max_position_embeddings +-++++# self.rope_theta = config.rope_theta +-++++# self.attention_dropout = config.attention_dropout +-++++ +-++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++# raise ValueError( +-++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++# ) +-++++ +-++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++ +-++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++# self.head_dim, +-++++# max_position_embeddings=self.max_position_embeddings, +-++++# base=self.rope_theta, +-++++# ) +-++++ +-++++# def forward( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++# position_ids: Optional[mindspore.Tensor] = None, +-++++# past_key_value: Optional[Cache] = None, +-++++# output_attentions: bool = False, +-++++# use_cache: bool = False, +-++++# cache_position: Optional[mindspore.Tensor] = None, +-++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++# bsz, q_len, _ = hidden_states.shape +-++++ +-++++# # 1. 线性投射 Q, K, V +-++++# query_states = self.q_proj(hidden_states) +-++++# key_states = self.k_proj(hidden_states) +-++++# value_states = self.v_proj(hidden_states) +-++++ +-++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++# # query: [B, S, H*D] -> [B, N1, S, D] +-++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++# # 3. RoPE 旋转位置编码 +-++++# kv_seq_len = key_states.shape[-2] +-++++# if past_key_value is not None: +-++++# if self.layer_idx is None: +-++++# raise ValueError( +-++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++# "with a layer index." +-++++# ) +-++++# # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++# if cache_position.shape[0] == 1: +-++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++# kv_seq_len = past_seen_tokens + 1 +-++++# else: +-++++# # prefill 阶段:cache_position 是范围,使用其长度 +-++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++# else: +-++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++# # 4. KV 缓存更新 +-++++# if past_key_value is not None: +-++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++# key_states, value_states = past_key_value.update( +-++++# key_states, value_states, self.layer_idx, cache_kwargs +-++++# ) +-++++ +-++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++# if cache_position.shape[0] == 1: +-++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++# kv_seq_len = key_states.shape[-2] +-++++ +-++++# # 5. [重要] 准备 Attention Mask +-++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++# fa_attention_mask = None +-++++# if attention_mask is not None: +-++++# # 截取与当前key长度匹配的部分 +-++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++# # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++# fa_attention_mask = (mask_slice != 0) +-++++ +-++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++# input_dtype = query_states.dtype +-++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++# query_states = query_states.to(mindspore.float16) +-++++# key_states = key_states.to(mindspore.float16) +-++++# value_states = value_states.to(mindspore.float16) +-++++ +-++++# # 6. [核心] 调用 flash_attention_score 算子 +-++++# # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++# attn_output = mindspore.ops.flash_attention_score( +-++++# query=query_states, +-++++# key=key_states, +-++++# value=value_states, +-++++# head_num=self.num_heads, # 传入Q的头数(N1) +-++++# attn_mask=fa_attention_mask, +-++++# keep_prob=1.0 - self.attention_dropout, +-++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++++# input_layout="BNSD", +-++++# sparse_mode=0 # 使用 defaultMask 模式 +-++++# ) +-++++ +-++++# # 恢复原始数据类型 +-++++# attn_output = attn_output.to(input_dtype) +-++++ +-++++# # 7. 调整输出形状 +-++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++# attn_output = self.o_proj(attn_output) +-++++ +-++++# # FlashAttention 算子不直接返回注意力权重矩阵 +-++++# attn_weights = None +-++++# if output_attentions: +-++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++# return attn_output, attn_weights, past_key_value +-++++ +-++++# # def forward( +-++++# # self, +-++++# # hidden_states: mindspore.Tensor, +-++++# # attention_mask: Optional[mindspore.Tensor] = None, +-++++# # position_ids: Optional[mindspore.Tensor] = None, +-++++# # past_key_value: Optional[Cache] = None, +-++++# # output_attentions: bool = False, +-++++# # use_cache: bool = False, +-++++# # cache_position: Optional[mindspore.Tensor] = None, +-++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++# # bsz, q_len, _ = hidden_states.shape +-++++ +-++++# # # 1. 线性投射 Q, K, V +-++++# # query_states = self.q_proj(hidden_states) +-++++# # key_states = self.k_proj(hidden_states) +-++++# # value_states = self.v_proj(hidden_states) +-++++ +-++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-++++# # # 3. RoPE 旋转位置编码 +-++++# # kv_seq_len = key_states.shape[-2] +-++++# # if past_key_value is not None: +-++++# # if self.layer_idx is None: +-++++# # raise ValueError( +-++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++# # "with a layer index." +-++++# # ) +-++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++# # # 4. KV 缓存更新 +-++++# # if past_key_value is not None: +-++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++# # key_states, value_states = past_key_value.update( +-++++# # key_states, value_states, self.layer_idx, cache_kwargs +-++++# # ) +-++++ +-++++# # # 5. 准备 Attention Mask +-++++# # fa_attention_mask = None +-++++# # if attention_mask is not None: +-++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++# # fa_attention_mask = (mask_slice != 0) +-++++ +-++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++# # input_dtype = query_states.dtype +-++++ +-++++# # # 6. [核心] 调用 flash_attention_score 算子 +-++++# # attn_output = mindspore.ops.flash_attention_score( +-++++# # query=query_states, +-++++# # key=key_states, +-++++# # value=value_states, +-++++# # head_num=self.num_heads, +-++++# # attn_mask=fa_attention_mask, +-++++# # keep_prob=1.0 - self.attention_dropout, +-++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++# # input_layout="BNSD", +-++++# # sparse_mode=0, +-++++# # # <--- 修改点 2: 启用内部高精度计算 --- +-++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++# # inner_precise=1 +-++++# # ) +-++++ +-++++# # # 恢复原始数据类型 +-++++# # attn_output = attn_output.to(input_dtype) +-++++ +-++++# # # 7. 调整输出形状 +-++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++# # attn_output = self.o_proj(attn_output) +-++++ +-++++# # attn_weights = None +-++++# # if output_attentions: +-++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ +-++++# # return attn_output, attn_weights, past_key_value +-++++ +-++++ +-+++ class Qwen2MoeFlashAttention(nn.Module): +-+++ """ +-+++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++- +-+++- 关键改动: +-+++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++- 直接传入原始的 key 和 value 张量效率更高。 +-+++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-++++ +-++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-++++ 完全使用模型的低精度数据类型(如 float16)进行计算, +-++++ 以达到理论上的最高执行速度。 +-+++ """ +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++ super().__init__() +-+++ self.config = config +-+++ self.layer_idx = layer_idx +-++++ if layer_idx is None: +-++++ logger.warning_once( +-++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-++++ ) +-++++ +-+++ self.hidden_size = config.hidden_size +-+++ self.num_heads = config.num_attention_heads +-+++ self.head_dim = self.hidden_size // self.num_heads +-+++ self.num_key_value_heads = config.num_key_value_heads +-+++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++ self.max_position_embeddings = config.max_position_embeddings +-+++ self.rope_theta = config.rope_theta +-+++ self.attention_dropout = config.attention_dropout +-+++ +-+++- if (self.head_dim * self.num_heads) != self.hidden_size: +-+++- raise ValueError( +-+++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++- ) +-+++- +-+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++- # query: [B, S, H*D] -> [B, N1, S, D] +-+++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++ # 2. 调整形状以匹配 BNSD 布局 +-+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- +-+++- # 3. RoPE 旋转位置编码 +-++++ +-++++ # 3. RoPE 和 KV 缓存 +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++- if self.layer_idx is None: +-+++- raise ValueError( +-+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++- "with a layer index." +-+++- ) +-+++- # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++- if cache_position.shape[0] == 1: +-+++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++- kv_seq_len = past_seen_tokens + 1 +-+++- else: +-+++- # prefill 阶段:cache_position 是范围,使用其长度 +-+++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++- else: +-+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++- +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++- # 4. KV 缓存更新 +-+++ if past_key_value is not None: +-+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++- key_states, value_states = past_key_value.update( +-+++- key_states, value_states, self.layer_idx, cache_kwargs +-+++- ) +-+++- +-+++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++- if cache_position.shape[0] == 1: +-+++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++- kv_seq_len = key_states.shape[-2] +-+++- +-+++- # 5. [重要] 准备 Attention Mask +-+++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++ # 4. 准备 Attention Mask +-+++ fa_attention_mask = None +-+++ if attention_mask is not None: +-+++- # 截取与当前key长度匹配的部分 +-+++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++- # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++ fa_attention_mask = (mask_slice != 0) +-+++ +-+++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++- input_dtype = query_states.dtype +-+++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++- query_states = query_states.to(mindspore.float16) +-+++- key_states = key_states.to(mindspore.float16) +-+++- value_states = value_states.to(mindspore.float16) +-+++- +-+++- # 6. [核心] 调用 flash_attention_score 算子 +-+++- # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +-+++ attn_output = mindspore.ops.flash_attention_score( +-+++ query=query_states, +-+++ key=key_states, +-+++ value=value_states, +-+++- head_num=self.num_heads, # 传入Q的头数(N1) +-++++ head_num=self.num_heads, +-+++ attn_mask=fa_attention_mask, +-+++- keep_prob=1.0 - self.attention_dropout, +-++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +-+++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++ input_layout="BNSD", +-+++- sparse_mode=0 # 使用 defaultMask 模式 +-++++ sparse_mode=0, +-++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +-+++ ) +-+++ +-+++- # 恢复原始数据类型 +-+++- attn_output = attn_output.to(input_dtype) +-+++- +-+++- # 7. 调整输出形状 +-+++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++ # 6. 调整输出形状 +-+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++ attn_output = self.o_proj(attn_output) +-+++ +-+++- # FlashAttention 算子不直接返回注意力权重矩阵 +-++++ # 7. 返回结果 +-+++ attn_weights = None +-+++ if output_attentions: +-+++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++- # def forward( +-+++- # self, +-+++- # hidden_states: mindspore.Tensor, +-+++- # attention_mask: Optional[mindspore.Tensor] = None, +-+++- # position_ids: Optional[mindspore.Tensor] = None, +-+++- # past_key_value: Optional[Cache] = None, +-+++- # output_attentions: bool = False, +-+++- # use_cache: bool = False, +-+++- # cache_position: Optional[mindspore.Tensor] = None, +-+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++- +-+++- # bsz, q_len, _ = hidden_states.shape +-+++- +-+++- # # 1. 线性投射 Q, K, V +-+++- # query_states = self.q_proj(hidden_states) +-+++- # key_states = self.k_proj(hidden_states) +-+++- # value_states = self.v_proj(hidden_states) +-+++- +-+++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- +-+++- # # 3. RoPE 旋转位置编码 +-+++- # kv_seq_len = key_states.shape[-2] +-+++- # if past_key_value is not None: +-+++- # if self.layer_idx is None: +-+++- # raise ValueError( +-+++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++- # "with a layer index." +-+++- # ) +-+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++ +-+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++- +-+++- # # 4. KV 缓存更新 +-+++- # if past_key_value is not None: +-+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++- # key_states, value_states = past_key_value.update( +-+++- # key_states, value_states, self.layer_idx, cache_kwargs +-+++- # ) +-+++- +-+++- # # 5. 准备 Attention Mask +-+++- # fa_attention_mask = None +-+++- # if attention_mask is not None: +-+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++- # fa_attention_mask = (mask_slice != 0) +-+++- +-+++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++- # input_dtype = query_states.dtype +-+++- +-+++- # # 6. [核心] 调用 flash_attention_score 算子 +-+++- # attn_output = mindspore.ops.flash_attention_score( +-+++- # query=query_states, +-+++- # key=key_states, +-+++- # value=value_states, +-+++- # head_num=self.num_heads, +-+++- # attn_mask=fa_attention_mask, +-+++- # keep_prob=1.0 - self.attention_dropout, +-+++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++- # input_layout="BNSD", +-+++- # sparse_mode=0, +-+++- # # <--- 修改点 2: 启用内部高精度计算 --- +-+++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++- # inner_precise=1 +-+++- # ) +-+++- +-+++- # # 恢复原始数据类型 +-+++- # attn_output = attn_output.to(input_dtype) +-++++QWEN2MOE_ATTENTION_CLASSES = { +-++++ "eager": Qwen2MoeAttention, +-++++ "flash-attention": Qwen2MoeFlashAttention, +-++++} +-+++ +-+++- # # 7. 调整输出形状 +-+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++- # attn_output = self.o_proj(attn_output) +-+++ +-+++- # attn_weights = None +-+++- # if output_attentions: +-+++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# def __init__(self, config): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# # gating +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++ +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# #@dwj +-++++# # 只遍历激活的专家,而非全部专家 +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# num_tokens = hidden_states_reshaped.shape[0] +-++++ +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++# flat_selected_experts = selected_experts.flatten() +-++++ +-++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++# token_indices = broadcasted_token_indices.flatten() +-++++ +-++++# active_experts = ops.unique(flat_selected_experts) +-++++ +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++ +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# selected_token_indices = token_indices[mask] +-++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++ +-++++# current_states = hidden_states_reshaped[selected_token_indices] +-++++ +-++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++ +-++++# final_hidden_states = final_hidden_states.index_add( +-++++# dim=0, +-++++# index=selected_token_indices, +-++++# source=expert_output.to(hidden_states.dtype) +-++++# ) +-++++ +-++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++ +-+++- # return attn_output, attn_weights, past_key_value +-++++# final_hidden_states = final_hidden_states + shared_expert_output +-++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-++++ +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# """ +-++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-++++# `_moe_infer_prefill` (用于长序列处理) 方法。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# # 门控网络 +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# # 专家列表 +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++# # 共享专家 +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_decode( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# """ +-++++# 【解码路径】针对 sequence_length=1 的极致优化。 +-++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-++++# """ +-++++# batch_size, hidden_dim = hidden_states.shape +-++++ +-++++# expert_outputs_list = [ +-++++# ops.cat([ +-++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++# ], dim=0) +-++++# for i in range(batch_size) +-++++# ] +-++++ +-++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-++++# # shape: (batch_size, top_k, hidden_dim) +-++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++ +-++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++ +-++++# return moe_output.squeeze(1) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_prefill( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# """ +-++++# 【预填充路径】针对 sequence_length > 1 的优化。 +-++++# 按专家对 Token 进行分组,并进行批处理。 +-++++# """ +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens = hidden_states.shape[0] +-++++# flat_selected_experts = selected_experts.flatten() +-++++ +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++ +-++++# active_experts = ops.unique(flat_selected_experts) +-++++ +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++ +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# selected_token_indices = token_indices[mask] +-++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++ +-++++# current_states = hidden_states[selected_token_indices] +-++++ +-++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++ +-++++# moe_output = moe_output.index_add( +-++++# dim=0, +-++++# index=selected_token_indices, +-++++# source=expert_output.to(hidden_states.dtype) +-++++# ) +-++++# return moe_output +-++++ +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# """ +-++++# 顶层 forward 方法,作为智能分发器。 +-++++# """ +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++- # def forward( +-+++- # self, +-+++- # hidden_states: mindspore.Tensor, +-+++- # attention_mask: Optional[mindspore.Tensor] = None, +-+++- # position_ids: Optional[mindspore.Tensor] = None, +-+++- # past_key_value: Optional[Cache] = None, +-+++- # output_attentions: bool = False, +-+++- # use_cache: bool = False, +-+++- # cache_position: Optional[mindspore.Tensor] = None, +-+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++- +-+++- # bsz, q_len, _ = hidden_states.shape +-+++- +-+++- # query_states = self.q_proj(hidden_states) +-+++- # key_states = self.k_proj(hidden_states) +-+++- # value_states = self.v_proj(hidden_states) +-+++- +-+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- +-+++- # kv_seq_len = key_states.shape[-2] +-+++- # if past_key_value is not None: +-+++- # if self.layer_idx is None: +-+++- # raise ValueError("`layer_idx` must be specified for caching") +-+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++- +-+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++- +-+++- # if past_key_value is not None: +-+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++- # key_states, value_states = past_key_value.update( +-+++- # key_states, value_states, self.layer_idx, cache_kwargs +-+++- # ) +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++# moe_output = None +-++++# # 在推理时,根据序列长度选择最优路径 +-++++# if not self.training: +-++++# if sequence_length == 1: +-++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++# else: +-++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++# else: +-++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-++++# raise NotImplementedError("Training path is not implemented.") +-++++ +-++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-++++ +-++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-++++ +-++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-++++ +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# """ +-++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# # 门控网络 +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# # 专家列表 +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++# # 共享专家 +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_decode( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# batch_size, _ = hidden_states.shape +-++++# expert_outputs_list = [ +-++++# ops.cat([ +-++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++# ], dim=0) +-++++# for i in range(batch_size) +-++++# ] +-++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++# return moe_output.squeeze(1) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_prefill( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens = hidden_states.shape[0] +-++++# flat_selected_experts = selected_experts.flatten() +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++# active_experts = ops.unique(flat_selected_experts) +-++++ +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# selected_token_indices = token_indices[mask] +-++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++# current_states = hidden_states[selected_token_indices] +-++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++# moe_output = moe_output.index_add( +-++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++# ) +-++++# return moe_output +-++++ +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# """ +-++++# 顶层 forward 方法,作为智能分发器。 +-++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-++++# """ +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ +-++++# # 1. 门控计算 (通用逻辑) +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++# # 2. 智能分发到最优 MoE 路径 +-++++# moe_output = None +-++++# if not self.training: +-++++# if sequence_length == 1: +-++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++# else: +-++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++# else: +-++++# raise NotImplementedError("Training path is not implemented.") +-++++ +-++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++ +-++++# # 4. 合并 MoE 输出和共享专家输出 +-++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++ +-++++# # 5. 恢复原始形状并返回 +-++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-++++# prefill fastest +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# """ +-++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# # 门控网络 +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# # 专家列表 +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++# # 共享专家 +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_dispatch( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# """ +-++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-++++# """ +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens, _ = hidden_states.shape +-++++ +-++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-++++# flat_selected_experts = selected_experts.flatten() +-++++# flat_routing_weights = routing_weights.flatten() +-+++ +-+++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++- +-+++- # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++- # query_states = query_states / math.sqrt(self.head_dim) +-+++- # # <--- 修改结束 --- +-+++- +-+++- # fa_attention_mask = None +-+++- # if attention_mask is not None: +-+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++- # fa_attention_mask = (mask_slice != 0) +-+++- +-+++- # input_dtype = query_states.dtype +-+++- +-+++- # attn_output = mindspore.ops.flash_attention_score( +-+++- # query=query_states, # 传入已经预先缩放过的 query +-+++- # key=key_states, +-+++- # value=value_states, +-+++- # head_num=self.num_heads, +-+++- # attn_mask=fa_attention_mask, +-+++- # keep_prob=1.0 - self.attention_dropout, +-+++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++- # input_layout="BNSD", +-+++- # sparse_mode=0, +-+++- # inner_precise=1 # 仍然保持内部高精度计算 +-+++- # ) +-++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++ +-+++- # attn_output = attn_output.to(input_dtype) +-+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++- # attn_output = self.o_proj(attn_output) +-++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-++++# active_experts = ops.unique(flat_selected_experts) +-++++ +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++ +-++++# # 找到所有分配给该专家的 token +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++ +-++++# # 使用 mask 选取对应的 token 和权重 +-++++# current_token_indices = token_indices[mask] +-++++# current_routing_weights = flat_routing_weights[mask] +-++++# current_hidden_states = hidden_states[current_token_indices] +-++++ +-++++# # 对这些 token 进行批处理 +-++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++ +-++++# # 使用 index_add 将结果精确地加回到对应位置 +-++++# moe_output = moe_output.index_add( +-++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-++++# ) +-++++# return moe_output +-++++ +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# """ +-++++# 顶层 forward 方法,作为智能分发器。 +-++++# """ +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ +-++++# # 1. 门控计算 +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++ +-++++# # 2. 调用统一的 MoE 计算内核 +-++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+++ +-+++- # attn_weights = None +-+++- # if output_attentions: +-+++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++# # 3. 统一处理共享专家 +-++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++ +-++++# # 4. 合并输出 +-++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++ +-++++# # 5. 恢复原始形状并返回 +-++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-++++ +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# """ +-++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++# 【最终高性能与高精度版】: +-++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-++++# 3. 这样实现了速度和准确性的两全其美。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_decode( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# """ +-++++# 【解码路径】极致优化版:bmm + 高精度累加。 +-++++# """ +-++++# original_dtype = hidden_states.dtype +-++++# batch_size, _ = hidden_states.shape +-++++ +-++++# expert_outputs_list = [ +-++++# ops.cat([ +-++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++# ], dim=0) +-++++# for i in range(batch_size) +-++++# ] +-++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++ +-++++# # 在 float32 下执行 bmm,得到高精度结果 +-++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++ +-++++# # 将高精度结果转换回原始数据类型 +-++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-++++ +-++++# return moe_output +-++++ +-++++# @no_grad() +-++++# def _moe_infer_prefill( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# selected_experts: mindspore.Tensor, +-++++# routing_weights: mindspore.Tensor +-++++# ) -> mindspore.Tensor: +-++++# """ +-++++# 【预填充路径】与原始实现一致,结果精确。 +-++++# """ +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens, _ = hidden_states.shape +-++++# flat_selected_experts = selected_experts.flatten() +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++# active_experts = ops.unique(flat_selected_experts) +-++++ +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# selected_token_indices = token_indices[mask] +-++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++# current_states = hidden_states[selected_token_indices] +-++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++# moe_output = moe_output.index_add( +-++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++# ) +-++++# return moe_output +-++++ +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++ +-+++- # return attn_output, attn_weights, past_key_value +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-++++# # 如果模型主体是 float16,后续再转换 +-++++ +-++++# moe_output = None +-++++# if not self.training: +-++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-++++# # _moe_infer_decode 内部会处理好类型转换 +-++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-++++# if sequence_length == 1: +-++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++# else: +-++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++# else: +-++++# raise NotImplementedError("Training path is not implemented.") +-++++ +-++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++ +-++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-+++ +-+++-QWEN2MOE_ATTENTION_CLASSES = { +-+++- "eager": Qwen2MoeAttention, +-+++- "flash-attention": Qwen2MoeFlashAttention, +-+++-} +-++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++# """ +-++++# 【融合版】一个混合专家模块,内置两种推理策略, +-++++# 由外部全局变量 `Long_Prompt` 控制: +-++++ +-++++# - if Long_Prompt is True: 【精度优先模式】 +-++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-++++# 适用于处理长序列,避免误差累积。 +-++++ +-++++# - if Long_Prompt is False: 【速度优先模式】 +-++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-++++# 在解码阶段获得极致速度,同时保证结果高度准确。 +-++++# """ +-++++# def __init__(self, config: Qwen2MoeConfig): +-++++# super().__init__() +-++++# self.num_experts = config.num_experts +-++++# self.top_k = config.num_experts_per_tok +-++++# self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++# self.experts = nn.ModuleList( +-++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++# ) +-++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++# # --- 速度优先模式的辅助函数 --- +-++++# @no_grad() +-++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++# original_dtype = hidden_states.dtype +-++++# batch_size, _ = hidden_states.shape +-++++# expert_outputs_list = [ +-++++# ops.cat([ +-++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++# ], dim=0) +-++++# for i in range(batch_size) +-++++# ] +-++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++# weights_fp32 = routing_weights.to(mindspore.float32) +-++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++# return moe_output_fp32.squeeze(1).to(original_dtype) +-++++ +-++++# @no_grad() +-++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens, _ = hidden_states.shape +-++++# flat_selected_experts = selected_experts.flatten() +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++# active_experts = ops.unique(flat_selected_experts) +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# selected_token_indices = token_indices[mask] +-++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++# current_states = hidden_states[selected_token_indices] +-++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++# return moe_output +-++++ +-++++# # --- 精度优先模式的辅助函数 --- +-++++# @no_grad() +-++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++# moe_output = ops.zeros_like(hidden_states) +-++++# num_tokens, _ = hidden_states.shape +-++++# flat_selected_experts = selected_experts.flatten() +-++++# flat_routing_weights = routing_weights.flatten() +-++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++# active_experts = ops.unique(flat_selected_experts) +-++++# for expert_idx_tensor in active_experts: +-++++# expert_idx = expert_idx_tensor.item() +-++++# expert_layer = self.experts[expert_idx] +-++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++# current_token_indices = token_indices[mask] +-++++# current_routing_weights = flat_routing_weights[mask] +-++++# current_hidden_states = hidden_states[current_token_indices] +-++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++# return moe_output +-++++ +-++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++# # 声明我们将要使用一个在模块外部定义的全局变量 +-++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-++++# global Long_Prompt +-++++ +-++++# # 1. 门控计算 (所有模式通用) +-++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++# router_logits = self.gate(hidden_states_reshaped) +-++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++++# if self.norm_topk_prob: +-++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++# moe_output = None +-++++# if not self.training: +-++++# # 根据 Long_Prompt 标志选择模式 +-++++# if Long_Prompt: +-++++# # --- 精度优先模式 --- +-++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++# else: +-++++# # --- 速度优先模式 --- +-++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++# if sequence_length == 1: +-++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++# else: +-++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++# else: +-++++# raise NotImplementedError("Training path is not implemented.") +-++++ +-++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++ +-++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++# return final_hidden_states, router_logits +-++++ +-++++class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ """ +-++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-++++ 控制的顶级推理策略: +-+++ +-++++ - if Long_Prompt is True: 【精度优先模式】 +-++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-++++ 适用于需要严格可复现性的长序列任务。 +-+++ +-+++-class Qwen2MoeSparseMoeBlock(nn.Module): +-+++- def __init__(self, config): +-++++ - if Long_Prompt is False: 【速度优先模式】 +-++++ 采用业界最强的性能组合: +-++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-++++ """ +-++++ def __init__(self, config: Qwen2MoeConfig): +-+++ super().__init__() +-+++ self.num_experts = config.num_experts +-+++ self.top_k = config.num_experts_per_tok +-+++ self.norm_topk_prob = config.norm_topk_prob +-+++ +-+++- # gating +-+++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++ self.experts = nn.ModuleList( +-+++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++ ) +-+++- +-+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++ +-+++- #@dwj +-+++- # 只遍历激活的专家,而非全部专家 +-+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++- num_tokens = hidden_states_reshaped.shape[0] +-+++- +-+++- router_logits = self.gate(hidden_states_reshaped) +-+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- +-+++- if self.norm_topk_prob: +-+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++- flat_selected_experts = selected_experts.flatten() +-+++- +-+++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++- token_indices = broadcasted_token_indices.flatten() +-+++- +-+++- active_experts = ops.unique(flat_selected_experts) +-+++- +-+++- for expert_idx_tensor in active_experts: +-+++- expert_idx = expert_idx_tensor.item() +-+++- expert_layer = self.experts[expert_idx] +-+++- +-+++- mask = (flat_selected_experts == expert_idx_tensor) +-+++- selected_token_indices = token_indices[mask] +-+++- selected_routing_weights = routing_weights.flatten()[mask] +-+++- +-+++- current_states = hidden_states_reshaped[selected_token_indices] +-+++- +-+++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++- +-+++- final_hidden_states = final_hidden_states.index_add( +-+++- dim=0, +-+++- index=selected_token_indices, +-+++- source=expert_output.to(hidden_states.dtype) +-+++- ) +-+++- +-+++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-++++ @no_grad() +-++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++ original_dtype = hidden_states.dtype +-++++ batch_size, _ = hidden_states.shape +-++++ expert_outputs_list = [ +-++++ ops.cat([ +-++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++ ], dim=0) +-++++ for i in range(batch_size) +-++++ ] +-++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++ weights_fp32 = routing_weights.to(mindspore.float32) +-++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++ return moe_output_fp32.squeeze(1).to(original_dtype) +-++++ +-++++ @no_grad() +-++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++ num_tokens, _ = hidden_states.shape +-++++ flat_selected_experts = selected_experts.flatten() +-++++ sorted_expert_indices = flat_selected_experts.argsort() +-++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++++ original_token_indices = sorted_expert_indices // self.top_k +-++++ moe_output = ops.zeros_like(hidden_states) +-++++ current_token_offset = 0 +-++++ for i in range(self.num_experts): +-++++ expert_token_count = tokens_per_expert[i] - current_token_offset +-++++ if expert_token_count == 0: +-++++ continue +-++++ end_offset = current_token_offset + expert_token_count +-++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++++ expert_hidden_states = hidden_states[expert_original_token_indices] +-++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++ current_token_offset += expert_token_count +-++++ return moe_output +-++++ +-++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-++++ @no_grad() +-++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++ moe_output = ops.zeros_like(hidden_states) +-++++ num_tokens, _ = hidden_states.shape +-++++ flat_selected_experts = selected_experts.flatten() +-++++ flat_routing_weights = routing_weights.flatten() +-++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++ active_experts = ops.unique(flat_selected_experts) +-++++ for expert_idx_tensor in active_experts: +-++++ expert_idx = expert_idx_tensor.item() +-++++ expert_layer = self.experts[expert_idx] +-++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++ current_token_indices = token_indices[mask] +-++++ current_routing_weights = flat_routing_weights[mask] +-++++ current_hidden_states = hidden_states[current_token_indices] +-++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++ return moe_output +-+++ +-+++- final_hidden_states = final_hidden_states + shared_expert_output +-+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++- +-+++- return final_hidden_states, router_logits +-++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++ global Long_Prompt +-++++ +-++++ # 1. 门控计算 (所有模式通用) +-++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++ router_logits = self.gate(hidden_states_reshaped) +-++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++++ if self.norm_topk_prob: +-++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++ moe_output = None +-++++ if Long_Prompt: +-++++ # --- 精度优先模式 (ACCURACY MODE) --- +-++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ else: +-++++ # --- 速度优先模式 (SPEED MODE) --- +-++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++ if sequence_length == 1: +-++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ else: +-++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ +-+++ +-++++ # 3. 共享专家计算与合并 (所有模式通用) +-++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++ +-++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++ +-++++ return final_hidden_states, router_logits +-+++ +-+++ class Qwen2MoeDecoderLayer(nn.Module): +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+++ super().__init__() +-+++ self.hidden_size = config.hidden_size +-++++ +-++++ # if Long_Prompt: +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ # else: +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++ +-+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ +-+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++- +-+++ if (layer_idx not in config.mlp_only_layers) and ( +-+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++ ): +-+++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ self._warmed_up = True +-+++ self.warmup_moe_model() +-+++ +-++++ +-++++ +-+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++ output_router_logits = ( +-+++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +-+++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ router_logits=outputs.router_logits, +-+++ ) +-+++ +-++++ def generate(self, *args, **kwargs): +-++++ """ +-++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-++++ """ +-++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++++ +-++++ input_ids = kwargs.get("input_ids") +-++++ if input_ids is None and args: +-++++ input_ids = args[0] +-++++ +-++++ if input_ids is not None: +-++++ prompt_length = input_ids.shape[1] +-++++ +-++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-++++ Long_Prompt = True +-++++ else: +-++++ Long_Prompt = False +-++++ +-++++ return super().generate(*args, **kwargs) +-++++ +-+++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +-+++ def prepare_inputs_for_generation( +-+++ self, +-+++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +-+++ # Exception 1: when passing input_embeds, input_ids may be missing entries +-+++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-++++ +-+++ if past_key_values is not None: +-+++ if inputs_embeds is not None: # Exception 1 +-+++ if 0 not in input_ids.shape: +-+++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ } +-+++ ) +-+++ return model_inputs +-++++ +-+++ # @lwx +-+++ # def _decode_one_tokens_logits( +-+++ # self, +-+++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +-+++ attentions=outputs.attentions, +-+++ ) +-+++ +-++++ +-+++ __all__ = [ +-+++ "Qwen2MoeForCausalLM", +-+++ "Qwen2MoeModel", +-+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++new file mode 100644 +-+++index 00000000..6dfb5b93 +-+++--- /dev/null +-++++++ b/patches/0001-20251104commit.patch +-+++@@ -0,0 +1,1272 @@ +-++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++++From: Pinoeer-kingxi <13022943007@163.com> +-++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++++Subject: [PATCH] 20251104commit +-++++ +-++++--- +-++++ mindnlp/transformers/cache_utils.py | 28 +- +-++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++++ 3 files changed, 976 insertions(+), 87 deletions(-) +-++++ +-++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++++index cadd2e04..02f8d4be 100644 +-++++--- a/mindnlp/transformers/cache_utils.py +-+++++++ b/mindnlp/transformers/cache_utils.py +-++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++++ # k_out[:, :, cache_position] = key_states +-++++ # v_out[:, :, cache_position] = value_states +-++++- if ON_ORANGE_PI: +-++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++- else: +-++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++- +-+++++ # if ON_ORANGE_PI: +-+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++ # else: +-+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++++ if cache_position.ndim > 1: +-+++++ cache_position = cache_position.flatten() +-+++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++++ cache_position = cache_position.int() +-+++++ +-+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++++ k_out[:, :, cache_position] = key_states +-+++++ v_out[:, :, cache_position] = value_states +-+++++ +-++++ return k_out, v_out +-++++ +-++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++index c695b944..d8303e45 100644 +-++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++++ def rotate_half(x): +-++++ """Rotates half the hidden dims of the input.""" +-++++- x1 = x[..., : x.shape[-1] // 2] +-++++- x2 = x[..., x.shape[-1] // 2 :] +-+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++ # x1 = x[..., : x.shape[-1] // 2] +-+++++ # x2 = x[..., x.shape[-1] // 2 :] +-+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++ return ops.cat((-x2, x1), dim=-1) +-++++ +-++++ +-++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++++ if self.training: +-++++ raise NotImplementedError("Training is not supported yet.") +-++++ else: +-++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++- if self.config.n_shared_experts is not None: +-++++- y = y + self.shared_experts(identity) +-++++- return y +-+++++ # @lwx +-+++++ if orig_shape[1] == 1: +-+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++++ y=y.view(*orig_shape) +-+++++ if self.config.n_shared_experts is not None: +-+++++ y = y + self.shared_experts(identity) +-+++++ return y +-+++++ else: +-+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++++ if self.config.n_shared_experts is not None: +-+++++ y = y + self.shared_experts(identity) +-+++++ return y +-+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++ # if self.config.n_shared_experts is not None: +-+++++ # y = y + self.shared_experts(identity) +-+++++ # return y +-+++++ +-+++++ @no_grad() +-+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++ +-+++++ expert_cache = ops.zeros_like(x) +-+++++ for i in range(self.num_experts_per_tok): +-+++++ expert_id = flat_expert_indices[i].item() +-+++++ weight = flat_expert_weights[i].item() +-+++++ expert = self.experts[expert_id] +-+++++ expert_out = expert(x) +-+++++ expert_cache += expert_out * weight +-+++++ return expert_cache +-++++ +-++++ @no_grad() +-++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++- # expert_cache = torch.zeros_like(x) +-++++- # idxs = flat_expert_indices.argsort() +-++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++- # token_idxs = idxs // self.num_experts_per_tok +-++++- # for i, end_idx in enumerate(tokens_per_expert): +-++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++- # if start_idx == end_idx: +-++++- # continue +-++++- # expert = self.experts[i] +-++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++++- # expert_tokens = x[exp_token_idx] +-++++- # expert_out = expert(expert_tokens) +-++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++- # return expert_cache +-+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++ expert_cache = ops.zeros_like(x) +-++++ idxs = flat_expert_indices.argsort() +-++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ token_idxs = idxs // self.num_experts_per_tok +-+++++ +-++++ for i, end_idx in enumerate(tokens_per_expert): +-++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ if start_idx == end_idx: +-++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++++ expert_out = expert(expert_tokens) +-++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-++++ return expert_cache +-+++++ +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # # expert_cache = torch.zeros_like(x) +-+++++ # # idxs = flat_expert_indices.argsort() +-+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++ # # token_idxs = idxs // self.num_experts_per_tok +-+++++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++ # # if start_idx == end_idx: +-+++++ # # continue +-+++++ # # expert = self.experts[i] +-+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # # expert_tokens = x[exp_token_idx] +-+++++ # # expert_out = expert(expert_tokens) +-+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++ # # return expert_cache +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # if start_idx == end_idx: +-+++++ # continue +-+++++ # expert = self.experts[i] +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = expert(expert_tokens) +-+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-+++++ # return expert_cache +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ +-+++++ # # 排序保证顺序一致 +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # # 找出有 token 的专家 +-+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++ +-+++++ # for i in active_experts.tolist(): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # end_idx = tokens_per_expert[i] +-+++++ # if start_idx == end_idx: # 没有 token +-+++++ # continue +-+++++ +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = self.experts[i](expert_tokens) +-+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++ +-+++++ # expert_cache = mindspore.mint.scatter_add( +-+++++ # expert_cache, +-+++++ # 0, +-+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++ # expert_out +-+++++ # ) +-+++++ +-+++++ # return expert_cache +-+++++ +-+++++ +-++++ +-++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++++ # """ +-++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-+++++ self.warm_up = False +-+++++ +-+++++ def warmup_moe_model_deep(self): +-+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++++ test_texts = [ +-+++++ "warmup short", +-+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++++ ] +-+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++ if tokenizer is None: +-+++++ from mindnlp.transformers import AutoTokenizer +-+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++ self._warmup_tokenizer = tokenizer +-+++++ +-+++++ for text in test_texts: +-+++++ inputs = tokenizer(text, return_tensors="ms") +-+++++ with mindspore._no_grad(): +-+++++ _ = self(**inputs, use_cache=False) +-+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++++ +-++++ def get_input_embeddings(self): +-++++ return self.model.embed_tokens +-++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++ ```""" +-+++++ if not self.warm_up: +-+++++ self.warm_up = True +-+++++ self.warmup_moe_model_deep() +-+++++ +-++++ output_attentions = ( +-++++ output_attentions +-++++ if output_attentions is not None +-++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++index 3cbf820e..d4c6b651 100644 +-++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++@@ -18,7 +18,6 @@ +-++++ # See the License for the specific language governing permissions and +-++++ # limitations under the License. +-++++ """MindSpore Qwen2MoE model.""" +-++++- +-++++ import math +-++++ from typing import List, Optional, Tuple, Union +-++++ +-++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++++ TokenClassifierOutput, +-++++ ) +-++++ from ...modeling_utils import PreTrainedModel +-+++++from ...generation import GenerationMixin +-++++ from ....utils import logging +-++++ from .configuration_qwen2_moe import Qwen2MoeConfig +-++++ +-++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++++ self.variance_epsilon = eps +-++++ +-++++ def forward(self, hidden_states): +-+++++ # @dwj +-+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++ # @lwx +-+++++ # if not self.training : +-+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++ input_dtype = hidden_states.dtype +-++++ hidden_states = hidden_states.to(mindspore.float32) +-++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++++@@ -234,6 +239,8 @@ def rotate_half(x): +-++++ """Rotates half the hidden dims of the input.""" +-++++ x1 = x[..., : x.shape[-1] // 2] +-++++ x2 = x[..., x.shape[-1] // 2 :] +-+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++ return ops.cat((-x2, x1), dim=-1) +-++++ +-++++ +-++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++++ self.config = config +-++++ self.hidden_size = config.hidden_size +-++++ self.intermediate_size = intermediate_size +-+++++ +-++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++++ self.act_fn = ACT2FN[config.hidden_act] +-++++ +-++++ def forward(self, x): +-++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++- +-++++ +-+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++ # @lwx +-+++++ # gate_up_output = self.gate_up_proj(x) +-+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++++ # return self.down_proj(swiglu_output) +-+++++ +-+++++ # def forward(self, x): +-+++++ # gate_proj_out = self.gate_proj(x) +-+++++ # up_proj_out = self.up_proj(x) +-+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++++ # return self.down_proj(swiglu_out) +-+++++ +-++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++++ """ +-++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++++ use_cache: bool = False, +-++++ cache_position: Optional[mindspore.Tensor] = None, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ +-+++++ +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ query_states = self.q_proj(hidden_states) +-++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ "with a layer index." +-++++ ) +-++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ if isinstance(past_key_value, StaticCache): +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ else: +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++ if isinstance(past_key_value, StaticCache): +-+++++ kv_seq_len = key_states.shape[-2] +-++++ +-++++ # repeat k/v heads if n_kv_heads < n_heads +-++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++- +-+++++ +-++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++ +-++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++++- raise ValueError( +-++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++++- f" {attn_weights.shape}" +-++++- ) +-++++- +-++++- if attention_mask is not None: # no matter the length, we just slice it +-++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++++ if attention_mask is not None: +-+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++ attn_weights = attn_weights + causal_mask +-++++ +-++++ # upcast attention to fp32 +-++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++ +-++++ attn_output = self.o_proj(attn_output) +-++++- +-+++++ # @lwx +-+++++ +-+++++ # max_seq_len = self.max_position_embeddings # 2048 +-+++++ +-+++++ # if attention_mask is not None: +-+++++ # # attention_mask: [B, 1, Sq, Sk] +-+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++ +-+++++ # # pad 到 [max_seq_len, max_seq_len] +-+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++ # global_attention_mask = padded_mask +-+++++ # else: +-+++++ # global_attention_mask = None +-+++++ +-+++++ +-+++++ # sparse_mode=3 +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # real_shift=None, +-+++++ # padding_mask=None, +-+++++ +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=global_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ # input_layout="BNSD", +-+++++ # pre_tokens=2147483647, +-+++++ # next_tokens=2147483647, +-+++++ # inner_precise=0, +-+++++ # drop_mask=None, +-+++++ # prefix=None, +-+++++ # actual_seq_qlen=None, +-+++++ # actual_seq_kvlen=None, +-+++++ # sparse_mode=sparse_mode, +-+++++ # ) +-++++ if not output_attentions: +-++++ attn_weights = None +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ +-+++++class Qwen2MoeFlashAttention(nn.Module): +-+++++ """ +-+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++ +-+++++ 关键改动: +-+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++ 直接传入原始的 key 和 value 张量效率更高。 +-+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++ """ +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++ super().__init__() +-+++++ self.config = config +-+++++ self.layer_idx = layer_idx +-+++++ self.hidden_size = config.hidden_size +-+++++ self.num_heads = config.num_attention_heads +-+++++ self.head_dim = self.hidden_size // self.num_heads +-+++++ self.num_key_value_heads = config.num_key_value_heads +-+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++ self.max_position_embeddings = config.max_position_embeddings +-+++++ self.rope_theta = config.rope_theta +-+++++ self.attention_dropout = config.attention_dropout +-+++++ +-+++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++ raise ValueError( +-+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++ ) +-+++++ +-+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++ +-+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++ self.head_dim, +-+++++ max_position_embeddings=self.max_position_embeddings, +-+++++ base=self.rope_theta, +-+++++ ) +-+++++ +-+++++ def forward( +-+++++ self, +-+++++ hidden_states: mindspore.Tensor, +-+++++ attention_mask: Optional[mindspore.Tensor] = None, +-+++++ position_ids: Optional[mindspore.Tensor] = None, +-+++++ past_key_value: Optional[Cache] = None, +-+++++ output_attentions: bool = False, +-+++++ use_cache: bool = False, +-+++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # 1. 线性投射 Q, K, V +-+++++ query_states = self.q_proj(hidden_states) +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++ # query: [B, S, H*D] -> [B, N1, S, D] +-+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # 3. RoPE 旋转位置编码 +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++ if self.layer_idx is None: +-+++++ raise ValueError( +-+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ "with a layer index." +-+++++ ) +-+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++ if cache_position.shape[0] == 1: +-+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++ kv_seq_len = past_seen_tokens + 1 +-+++++ else: +-+++++ # prefill 阶段:cache_position 是范围,使用其长度 +-+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++ else: +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # 4. KV 缓存更新 +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ key_states, value_states = past_key_value.update( +-+++++ key_states, value_states, self.layer_idx, cache_kwargs +-+++++ ) +-+++++ +-+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++ if cache_position.shape[0] == 1: +-+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ +-+++++ # 5. [重要] 准备 Attention Mask +-+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++ fa_attention_mask = None +-+++++ if attention_mask is not None: +-+++++ # 截取与当前key长度匹配的部分 +-+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++ fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++ input_dtype = query_states.dtype +-+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++ query_states = query_states.to(mindspore.float16) +-+++++ key_states = key_states.to(mindspore.float16) +-+++++ value_states = value_states.to(mindspore.float16) +-+++++ +-+++++ # 6. [核心] 调用 flash_attention_score 算子 +-+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++ attn_output = mindspore.ops.flash_attention_score( +-+++++ query=query_states, +-+++++ key=key_states, +-+++++ value=value_states, +-+++++ head_num=self.num_heads, # 传入Q的头数(N1) +-+++++ attn_mask=fa_attention_mask, +-+++++ keep_prob=1.0 - self.attention_dropout, +-+++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ input_layout="BNSD", +-+++++ sparse_mode=0 # 使用 defaultMask 模式 +-+++++ ) +-+++++ +-+++++ # 恢复原始数据类型 +-+++++ attn_output = attn_output.to(input_dtype) +-+++++ +-+++++ # 7. 调整输出形状 +-+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++ attn_weights = None +-+++++ if output_attentions: +-+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ # def forward( +-+++++ # self, +-+++++ # hidden_states: mindspore.Tensor, +-+++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++ # past_key_value: Optional[Cache] = None, +-+++++ # output_attentions: bool = False, +-+++++ # use_cache: bool = False, +-+++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ # bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # # 1. 线性投射 Q, K, V +-+++++ # query_states = self.q_proj(hidden_states) +-+++++ # key_states = self.k_proj(hidden_states) +-+++++ # value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # # 3. RoPE 旋转位置编码 +-+++++ # kv_seq_len = key_states.shape[-2] +-+++++ # if past_key_value is not None: +-+++++ # if self.layer_idx is None: +-+++++ # raise ValueError( +-+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ # "with a layer index." +-+++++ # ) +-+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # # 4. KV 缓存更新 +-+++++ # if past_key_value is not None: +-+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ # key_states, value_states = past_key_value.update( +-+++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++ # ) +-+++++ +-+++++ # # 5. 准备 Attention Mask +-+++++ # fa_attention_mask = None +-+++++ # if attention_mask is not None: +-+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++ # input_dtype = query_states.dtype +-+++++ +-+++++ # # 6. [核心] 调用 flash_attention_score 算子 +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=fa_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ # input_layout="BNSD", +-+++++ # sparse_mode=0, +-+++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++ # inner_precise=1 +-+++++ # ) +-+++++ +-+++++ # # 恢复原始数据类型 +-+++++ # attn_output = attn_output.to(input_dtype) +-+++++ +-+++++ # # 7. 调整输出形状 +-+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ # attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # attn_weights = None +-+++++ # if output_attentions: +-+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++ # return attn_output, attn_weights, past_key_value +-+++++ +-+++++ # def forward( +-+++++ # self, +-+++++ # hidden_states: mindspore.Tensor, +-+++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++ # past_key_value: Optional[Cache] = None, +-+++++ # output_attentions: bool = False, +-+++++ # use_cache: bool = False, +-+++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ # bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # query_states = self.q_proj(hidden_states) +-+++++ # key_states = self.k_proj(hidden_states) +-+++++ # value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # kv_seq_len = key_states.shape[-2] +-+++++ # if past_key_value is not None: +-+++++ # if self.layer_idx is None: +-+++++ # raise ValueError("`layer_idx` must be specified for caching") +-+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # if past_key_value is not None: +-+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ # key_states, value_states = past_key_value.update( +-+++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++ # ) +-+++++ +-+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++ +-+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++++ # query_states = query_states / math.sqrt(self.head_dim) +-+++++ # # <--- 修改结束 --- +-+++++ +-+++++ # fa_attention_mask = None +-+++++ # if attention_mask is not None: +-+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # input_dtype = query_states.dtype +-+++++ +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, # 传入已经预先缩放过的 query +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=fa_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++++ # input_layout="BNSD", +-+++++ # sparse_mode=0, +-+++++ # inner_precise=1 # 仍然保持内部高精度计算 +-+++++ # ) +-+++++ +-+++++ # attn_output = attn_output.to(input_dtype) +-+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ # attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # attn_weights = None +-+++++ # if output_attentions: +-+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++++ +-+++++ # return attn_output, attn_weights, past_key_value +-+++++ +-++++ QWEN2MOE_ATTENTION_CLASSES = { +-++++ "eager": Qwen2MoeAttention, +-+++++ "flash-attention": Qwen2MoeFlashAttention, +-++++ } +-++++ +-++++ +-++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-+++++ #@dwj +-+++++ # 只遍历激活的专家,而非全部专家 +-++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- hidden_states = hidden_states.view(-1, hidden_dim) +-++++- # router_logits: (batch * sequence_length, n_experts) +-++++- router_logits = self.gate(hidden_states) +-++++- +-++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- if self.norm_topk_prob: +-++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- # we cast back to the input dtype +-++++- routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++- final_hidden_states = ops.zeros( +-++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++++- ) +-++++- +-++++- # One hot encode the selected experts to create an expert mask +-++++- # this will be used to easily index which expert is going to be sollicitated +-++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++++- +-++++- # Loop over all available experts in the model and perform the computation on each expert +-++++- for expert_idx in range(self.num_experts): +-++++- expert_layer = self.experts[expert_idx] +-++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++++- +-++++- # Index the correct hidden states and compute the expert hidden state for +-++++- # the current expert. We need to make sure to multiply the output hidden +-++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++++- if 0 not in idx.shape: +-++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++++- +-++++- # However `index_add_` only support torch tensors for indexing so we'll use +-++++- # the `top_x` tensor here. +-++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++++- +-++++- shared_expert_output = self.shared_expert(hidden_states) +-++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++++- +-++++- final_hidden_states = final_hidden_states + shared_expert_output +-+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++ num_tokens = hidden_states_reshaped.shape[0] +-+++++ +-+++++ router_logits = self.gate(hidden_states_reshaped) +-+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++ if self.norm_topk_prob: +-+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++ flat_selected_experts = selected_experts.flatten() +-+++++ +-+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++ token_indices = broadcasted_token_indices.flatten() +-+++++ +-+++++ active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++ for expert_idx_tensor in active_experts: +-+++++ expert_idx = expert_idx_tensor.item() +-+++++ expert_layer = self.experts[expert_idx] +-+++++ +-+++++ mask = (flat_selected_experts == expert_idx_tensor) +-+++++ selected_token_indices = token_indices[mask] +-+++++ selected_routing_weights = routing_weights.flatten()[mask] +-+++++ +-+++++ current_states = hidden_states_reshaped[selected_token_indices] +-+++++ +-+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++ +-+++++ final_hidden_states = final_hidden_states.index_add( +-+++++ dim=0, +-+++++ index=selected_token_indices, +-+++++ source=expert_output.to(hidden_states.dtype) +-+++++ ) +-+++++ +-+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++ +-++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++- return final_hidden_states, router_logits +-+++++ final_hidden_states = final_hidden_states + shared_expert_output +-+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++ return final_hidden_states, router_logits +-++++ +-++++ +-++++ class Qwen2MoeDecoderLayer(nn.Module): +-++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++++ +-++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++ +-++++ if (layer_idx not in config.mlp_only_layers) and ( +-++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++++ ): +-++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++++ _skip_keys_device_placement = "past_key_values" +-++++ _supports_cache_class = True +-+++++#lwx +-+++++ # _supports_static_cache = True +-++++ +-++++ def _init_weights(self, module): +-++++ std = self.config.initializer_range +-++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++++ return causal_mask +-++++ +-++++ +-++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ _tied_weights_keys = ["lm_head.weight"] +-++++ +-++++ def __init__(self, config): +-++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ self.num_experts_per_tok = config.num_experts_per_tok +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-+++++ # @lwx +-+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++++ # self.generation_config.cache_implementation = "static" +-+++++ self._warmed_up = False +-+++++ +-+++++ def warmup_moe_model(self): +-+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++++ test_texts = [ +-+++++ "warmup short", +-+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++++ ] +-+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++ if tokenizer is None: +-+++++ from mindnlp.transformers import AutoTokenizer +-+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++ self._warmup_tokenizer = tokenizer +-+++++ +-+++++ for text in test_texts: +-+++++ inputs = tokenizer(text, return_tensors="ms") +-+++++ with mindspore._no_grad(): +-+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++++ +-++++ def get_input_embeddings(self): +-++++ return self.model.embed_tokens +-++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++ ```""" +-+++++ if not self._warmed_up: +-+++++ self._warmed_up = True +-+++++ self.warmup_moe_model() +-++++ +-++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++++ output_router_logits = ( +-++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ } +-++++ ) +-++++ return model_inputs +-+++++# @lwx +-+++++ # def _decode_one_tokens_logits( +-+++++ # self, +-+++++ # cur_token: mindspore.Tensor, +-+++++ # input_pos: Optional[mindspore.Tensor], +-+++++ # cache_position: mindspore.Tensor, +-+++++ # past_key_values: StaticCache, +-+++++ # ) -> mindspore.Tensor: +-+++++ # """ +-+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++++ +-+++++ # Args: +-+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++++ # input_pos: 输入位置信息,可选 +-+++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++++ +-+++++ # Returns: +-+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++++ # """ +-+++++ # # 调用JIT编译的版本 +-+++++ # return self.get_decode_one_tokens_logits( +-+++++ # cur_token=cur_token, +-+++++ # input_pos=input_pos, +-+++++ # cache_position=cache_position, +-+++++ # past_key_values=past_key_values, +-+++++ # ) +-+++++ +-+++++ # @mindspore.jit(jit_level='O1') +-+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++++ # """ +-+++++ # JIT编译的函数,用于高效的单token解码 +-+++++ # 使用JIT编译优化以支持静态shape和高效执行 +-+++++ +-+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++++ # """ +-+++++ # outputs = self.model.forward( +-+++++ # input_ids=cur_token, +-+++++ # position_ids=input_pos, +-+++++ # cache_position=cache_position, +-+++++ # past_key_values=past_key_values, +-+++++ # use_cache=True, +-+++++ # return_dict=False, +-+++++ # ) +-+++++ +-+++++ # hidden_states = outputs[0] +-+++++ # logits = self.lm_head.forward(hidden_states) +-+++++ # logits = logits.float() +-+++++ +-+++++ # return logits[:, -1, :] +-+++++ +-+++++ # def _sample( +-+++++ # self, +-+++++ # input_ids: mindspore.Tensor, +-+++++ # logits_processor, +-+++++ # stopping_criteria, +-+++++ # generation_config, +-+++++ # synced_devices: bool, +-+++++ # streamer=None, +-+++++ # logits_warper=None, +-+++++ # **model_kwargs, +-+++++ # ): +-+++++ # """ +-+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++++ # """ +-+++++ # from ...generation.logits_process import LogitsProcessorList +-+++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++++ # from mindnlp.core import nn, ops, no_grad +-+++++ # import numpy as np +-+++++ +-+++++ # # 检查是否使用 StaticCache +-+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++++ # # 否则,直接调用父类方法 +-+++++ # past_key_values = model_kwargs.get("past_key_values") +-+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++++ +-+++++ # if not isinstance(past_key_values, StaticCache): +-+++++ # # 不使用 StaticCache,直接调用父类方法 +-+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++++ # return super()._sample( +-+++++ # input_ids=input_ids, +-+++++ # logits_processor=logits_processor, +-+++++ # stopping_criteria=stopping_criteria, +-+++++ # generation_config=generation_config, +-+++++ # synced_devices=synced_devices, +-+++++ # streamer=streamer, +-+++++ # logits_warper=logits_warper, +-+++++ # **model_kwargs, +-+++++ # ) +-+++++ +-+++++ # # 使用 StaticCache,进入自定义循环 +-+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++++ # pad_token_id = generation_config._pad_token_tensor +-+++++ # output_attentions = generation_config.output_attentions +-+++++ # output_hidden_states = generation_config.output_hidden_states +-+++++ # output_scores = generation_config.output_scores +-+++++ # output_logits = generation_config.output_logits +-+++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++++ # max_length = generation_config.max_length +-+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++++ # do_sample = generation_config.do_sample +-+++++ +-+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++++ # raise ValueError( +-+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++++ # f"{logits_warper})." +-+++++ # ) +-+++++ +-+++++ # # init attention / hidden states / scores tuples +-+++++ # scores = () if (return_dict_in_generate and output_scores) else None +-+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++++ +-+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++++ # encoder_hidden_states = ( +-+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++++ # ) +-+++++ +-+++++ # # keep track of which sequences are already finished +-+++++ # batch_size, cur_len = input_ids.shape +-+++++ # this_peer_finished = False +-+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++++ +-+++++ # time_record = [] +-+++++ # from ....utils.testing_utils import parse_flag_from_env +-+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++++ +-+++++ # while self._has_unfinished_sequences( +-+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++++ # ): +-+++++ # if _record_time: +-+++++ # import time as time_module +-+++++ # infer_start = time_module.time() +-+++++ +-+++++ # # prepare model inputs +-+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++++ +-+++++ # # prepare variable output controls +-+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++++ +-+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++++ # cur_cache_position = model_inputs.get("cache_position") +-+++++ # cur_past_key_values = model_inputs.get("past_key_values") +-+++++ # cur_input_ids = model_inputs.get("input_ids") +-+++++ +-+++++ # if (isinstance(cur_past_key_values, StaticCache) and +-+++++ # cur_cache_position is not None and +-+++++ # len(cur_cache_position.shape) > 0 and +-+++++ # cur_cache_position.shape[0] == 1 and +-+++++ # cur_input_ids is not None and +-+++++ # cur_input_ids.shape[1] == 1): +-+++++ # # 使用 JIT 优化的单 token 解码 +-+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++++ # if not hasattr(self, '_jit_used'): +-+++++ # self._jit_used = False +-+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++++ +-+++++ # next_token_logits = self.get_decode_one_tokens_logits( +-+++++ # cur_token=cur_input_ids, +-+++++ # input_pos=model_inputs.get("position_ids"), +-+++++ # cache_position=cur_cache_position, +-+++++ # past_key_values=cur_past_key_values, +-+++++ # ) +-+++++ +-+++++ # # 标记已使用JIT(用于后续判断) +-+++++ # if not self._jit_used: +-+++++ # self._jit_used = True +-+++++ +-+++++ # # 构造兼容的输出对象 +-+++++ # class JitOptimizedOutput: +-+++++ # def __init__(self, logits, config): +-+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++++ # self.config = config +-+++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++++ # self.attentions = None if not config.is_encoder_decoder else None +-+++++ # self.cross_attentions = None +-+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++++ +-+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++++ # else: +-+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++++ # outputs = self(**model_inputs, return_dict=True) +-+++++ +-+++++ # if synced_devices and this_peer_finished: +-+++++ # continue +-+++++ +-+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++++ # next_token_logits = outputs.logits[:, -1, :] +-+++++ +-+++++ # # pre-process distribution +-+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++++ # if do_sample: +-+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++++ +-+++++ # # Store scores, attentions and hidden_states when required +-+++++ # if return_dict_in_generate: +-+++++ # if output_scores: +-+++++ # scores += (next_token_scores,) +-+++++ # if output_logits: +-+++++ # raw_logits += (next_token_logits,) +-+++++ # if output_attentions: +-+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++++ # if self.config.is_encoder_decoder: +-+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++++ +-+++++ # if output_hidden_states: +-+++++ # hidden = ( +-+++++ # outputs.decoder_hidden_states +-+++++ # if self.config.is_encoder_decoder +-+++++ # else outputs.hidden_states +-+++++ # ) +-+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++++ +-+++++ # # token selection +-+++++ # if do_sample: +-+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++++ # else: +-+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++++ +-+++++ # # finished sentences should have their next token be a padding token +-+++++ # if has_eos_stopping_criteria: +-+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++++ +-+++++ # # update generated ids, model inputs, and length for next step +-+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++++ # if streamer is not None: +-+++++ # streamer.put(next_tokens) +-+++++ +-+++++ # model_kwargs = self._update_model_kwargs_for_generation( +-+++++ # outputs, +-+++++ # model_kwargs, +-+++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++++ # ) +-+++++ +-+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++++ # cur_len += 1 +-+++++ +-+++++ # if _record_time: +-+++++ # import time as time_module +-+++++ # infer_stop = time_module.time() +-+++++ # time_record.append(infer_stop - infer_start) +-+++++ +-+++++ # del outputs +-+++++ +-+++++ # average_infer_time = None +-+++++ # if time_record: +-+++++ # if len(time_record) > 1: +-+++++ # time_record.pop(0) +-+++++ # average_infer_time = sum(time_record) / len(time_record) +-+++++ # print(f'average inference time is: {average_infer_time}') +-+++++ # print(f'inference time record: {time_record}') +-+++++ +-+++++ # if streamer is not None: +-+++++ # streamer.end() +-+++++ +-+++++ # # 简单判断:打印是否使用了JIT路径 +-+++++ # if hasattr(self, '_jit_used') and self._jit_used: +-+++++ # print("[JIT] ✓ JIT optimization was used during generation") +-+++++ # else: +-+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++++ +-+++++ # if return_dict_in_generate: +-+++++ # if self.config.is_encoder_decoder: +-+++++ # return GenerateEncoderDecoderOutput( +-+++++ # sequences=input_ids, +-+++++ # scores=scores, +-+++++ # logits=raw_logits, +-+++++ # encoder_attentions=encoder_attentions, +-+++++ # encoder_hidden_states=encoder_hidden_states, +-+++++ # decoder_attentions=decoder_attentions, +-+++++ # cross_attentions=cross_attentions, +-+++++ # decoder_hidden_states=decoder_hidden_states, +-+++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++ # average_infer_time=average_infer_time +-+++++ # ) +-+++++ # else: +-+++++ # return GenerateDecoderOnlyOutput( +-+++++ # sequences=input_ids, +-+++++ # scores=scores, +-+++++ # logits=raw_logits, +-+++++ # attentions=decoder_attentions, +-+++++ # hidden_states=decoder_hidden_states, +-+++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++ # average_infer_time=average_infer_time +-+++++ # ) +-+++++ # else: +-+++++ # return input_ids +-+++++ +-+++++ # def _prepare_cache_for_generation( +-+++++ # self, +-+++++ # generation_config, +-+++++ # model_kwargs, +-+++++ # assistant_model, +-+++++ # batch_size, +-+++++ # max_cache_length, +-+++++ # ): +-+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++++ # generation_config.cache_implementation = "static" +-+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++++ +-+++++ # if generation_config.cache_implementation == "static": +-+++++ # base_required_from_max_length = generation_config.max_length + 1 +-+++++ # base_required = max(max_cache_length, base_required_from_max_length) +-+++++ # min_cache_size = 50 +-+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++++ # else: +-+++++ # max_cache_length = max(base_required, min_cache_size) +-+++++ +-+++++ # original_max_cache_length = max_cache_length +-+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++++ # print(f" - final max_cache_length: {max_cache_length}") +-+++++ +-+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++ # if max_cache_length > self.config.max_position_embeddings: +-+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++++ +-+++++ # result = super()._prepare_cache_for_generation( +-+++++ # generation_config=generation_config, +-+++++ # model_kwargs=model_kwargs, +-+++++ # assistant_model=assistant_model, +-+++++ # batch_size=batch_size, +-+++++ # max_cache_length=max_cache_length, +-+++++ # ) +-+++++ +-+++++ # if generation_config.cache_implementation == "static": +-+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++++ # created_cache = model_kwargs.get(cache_name) +-+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++++ # if created_cache.max_cache_len < generation_config.max_length: +-+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++++ +-+++++ # return result +-+++++ +-+++++ +-+++++ +-++++ +-++++ +-++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++++-- +-++++2.27.0 +-++++ +-+++-- +-+++2.27.0 +-+++ +-++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-++new file mode 100644 +-++index 00000000..966529e4 +-++--- /dev/null +-+++++ b/patches/0003-20261106secondcommit.patch +-++@@ -0,0 +1,2769 @@ +-+++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+++Subject: [PATCH 3/3] 20261106secondcommit +-+++ +-+++--- +-+++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +-+++ patches/0001-20251104commit.patch | 1272 ----------------- +-+++ 3 files changed, 528 insertions(+), 2032 deletions(-) +-+++ delete mode 100644 patches/0001-20251104commit.patch +-+++ +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index 73773c22..2f9192bf 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +-+++ +-+++ _CONFIG_FOR_DOC = "DeepseekConfig" +-+++ +-++++_attn_mask_cache = {} +-++++ +-++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-++++ q_len = batch_and_seq[1] +-++++ kv_len = batch_and_seq[1] + past_key_values_length +-++++ key = (batch_and_seq[0], q_len, kv_len) +-++++ +-++++ if key in _attn_mask_cache: +-++++ return _attn_mask_cache[key] +-++++ +-++++ mask = _prepare_4d_causal_attention_mask( +-++++ attention_mask, +-++++ batch_and_seq, +-++++ inputs_embeds, +-++++ past_key_values_length, +-++++ ) +-++++ _attn_mask_cache[key] = mask +-++++ return mask +-+++ +-+++ def _get_unpad_data(attention_mask): +-+++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-+++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +-+++ return final_output +-+++ +-+++ +-+++- @no_grad() +-+++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++- expert_cache = ops.zeros_like(x) +-+++- idxs = flat_expert_indices.argsort() +-+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++- token_idxs = idxs // self.num_experts_per_tok +-+++- +-+++- for i, end_idx in enumerate(tokens_per_expert): +-+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++- if start_idx == end_idx: +-+++- continue +-+++- expert = self.experts[i] +-+++- exp_token_idx = token_idxs[start_idx:end_idx] +-+++- expert_tokens = x[exp_token_idx] +-+++- expert_out = expert(expert_tokens) +-+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++- +-+++- return expert_cache +-+++- +-+++ # @no_grad() +-+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++- # # expert_cache = torch.zeros_like(x) +-+++- # # idxs = flat_expert_indices.argsort() +-+++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++- # # token_idxs = idxs // self.num_experts_per_tok +-+++- # # for i, end_idx in enumerate(tokens_per_expert): +-+++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++- # # if start_idx == end_idx: +-+++- # # continue +-+++- # # expert = self.experts[i] +-+++- # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++- # # expert_tokens = x[exp_token_idx] +-+++- # # expert_out = expert(expert_tokens) +-+++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++- # # return expert_cache +-++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++ # expert_cache = ops.zeros_like(x) +-+++ # idxs = flat_expert_indices.argsort() +-+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +-+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++ +-+++ # return expert_cache +-+++- # @no_grad() +-+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++- # expert_cache = ops.zeros_like(x) +-++++ +-++++ @no_grad() +-++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++ """ +-++++ 优化版 MoE prefill: +-++++ - 批量张量化处理同一个 expert 的所有 token +-++++ - 跳过无 token 的专家 +-++++ - 保持结果完全一致 +-++++ """ +-++++ # 初始化输出缓存 +-++++ expert_cache = ops.zeros_like(x) +-+++ +-+++- # # 排序保证顺序一致 +-+++- # idxs = flat_expert_indices.argsort() +-+++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++- # token_idxs = idxs // self.num_experts_per_tok +-++++ # 排序(确保 scatter_add 位置对应原逻辑) +-++++ idxs = flat_expert_indices.argsort() +-++++ sorted_expert_indices = flat_expert_indices[idxs] +-++++ sorted_token_indices = idxs // self.num_experts_per_tok +-+++ +-+++- # # 找出有 token 的专家 +-+++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++ # 每个 expert 的 token 数 +-++++ tokens_per_expert = sorted_expert_indices.bincount() +-+++ +-+++- # for i in active_experts.tolist(): +-+++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++- # end_idx = tokens_per_expert[i] +-+++- # if start_idx == end_idx: # 没有 token +-+++- # continue +-++++ # 找出有 token 的专家 +-++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+++ +-+++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++- # expert_tokens = x[exp_token_idx] +-+++- # expert_out = self.experts[i](expert_tokens) +-+++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++ for expert_id in active_experts.tolist(): +-++++ # 取该 expert 对应的排序后 token 区间 +-++++ start = (tokens_per_expert[:expert_id]).sum().item() +-++++ end = start + tokens_per_expert[expert_id].item() +-+++ +-+++- # expert_cache = mindspore.mint.scatter_add( +-+++- # expert_cache, +-+++- # 0, +-+++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++- # expert_out +-+++- # ) +-++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-++++ expert_tokens = x[token_idx] # 取输入向量 +-+++ +-+++- # return expert_cache +-++++ # 执行专家 MLP +-++++ expert_out = self.experts[expert_id](expert_tokens) +-++++ +-++++ # 按权重缩放 +-++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-++++ +-++++ # 回写到缓存(等价 scatter_add) +-++++ expert_cache = mindspore.mint.scatter_add( +-++++ expert_cache, +-++++ 0, +-++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++ scaled_out +-++++ ) +-++++ +-++++ return expert_cache +-++++ +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # # expert_cache = torch.zeros_like(x) +-++++ # # idxs = flat_expert_indices.argsort() +-++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++ # # if start_idx == end_idx: +-++++ # # continue +-++++ # # expert = self.experts[i] +-++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # # expert_tokens = x[exp_token_idx] +-++++ # # expert_out = expert(expert_tokens) +-++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++ # # return expert_cache +-++++ # expert_cache = ops.zeros_like(x) +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # if start_idx == end_idx: +-++++ # continue +-++++ # expert = self.experts[i] +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = expert(expert_tokens) +-++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-++++ # return expert_cache +-++++ # @no_grad() +-++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++ # expert_cache = ops.zeros_like(x) +-++++ +-++++ # # 排序保证顺序一致 +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ # token_idxs = idxs // self.num_experts_per_tok +-++++ +-++++ # # 找出有 token 的专家 +-++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++ +-++++ # for i in active_experts.tolist(): +-++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ # end_idx = tokens_per_expert[i] +-++++ # if start_idx == end_idx: # 没有 token +-++++ # continue +-++++ +-++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++ # expert_tokens = x[exp_token_idx] +-++++ # expert_out = self.experts[i](expert_tokens) +-++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++ +-++++ # expert_cache = mindspore.mint.scatter_add( +-++++ # expert_cache, +-++++ # 0, +-++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++ # expert_out +-++++ # ) +-++++ +-++++ # return expert_cache +-+++ +-+++ +-+++ +-+++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++- +-+++ # class DeepseekFlashAttention(nn.Module): +-+++ # """ +-+++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-++++ +-+++ Deepseek_ATTENTION_CLASSES = { +-+++ "eager": DeepseekAttention, +-+++ "flash-attention": DeepseekFlashAttention, +-+++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +-+++ ) +-+++ else: +-+++ # 4d mask is passed through the layers +-+++- attention_mask = _prepare_4d_causal_attention_mask( +-++++ # attention_mask = _prepare_4d_causal_attention_mask( +-++++ # attention_mask, +-++++ # (batch_size, seq_length), +-++++ # inputs_embeds, +-++++ # past_key_values_length, +-++++ # ) +-++++ #@dwj +-++++ attention_mask = get_cached_causal_mask( +-+++ attention_mask, +-+++ (batch_size, seq_length), +-+++ inputs_embeds, +-+++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-+++ self.warm_up = False +-++++ #@dwj +-++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-++++ self.num_layers, +-++++ self.num_attention_heads, +-++++ self.head_dim, +-++++ batch_size=1, +-++++ max_length=self.max_length, +-++++ dtype=mindspore.float16 +-++++ ) +-++++ +-++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-++++ key_cache = [] +-++++ value_cache = [] +-++++ for _ in range(num_layers): +-++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++ key_cache.append(k) +-++++ value_cache.append(v) +-++++ return key_cache, value_cache +-++++ +-+++ +-+++ def warmup_moe_model_deep(self): +-+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++index bced285c..ebd7782e 100644 +-+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +-+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+++ +-+++-Long_Prompt = False +-+++-PROMPT_LENGTH_THRESHOLD = 128 +-++++Long_Prompt = 1 +-++++LONG_PROMPT_LENGTH_THRESHOLD = 128 +-++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-++++ +-++++_causal_mask_cache = {} +-++++ +-++++def get_cached_causal_mask_with_cache_position( +-++++ attention_mask: mindspore.Tensor, +-++++ sequence_length: int, +-++++ target_length: int, +-++++ dtype: mindspore.dtype, +-++++ min_dtype: float, +-++++ cache_position: mindspore.Tensor, +-++++ batch_size: int, +-++++): +-++++ """ +-++++ 带缓存的 causal mask 构造函数 +-++++ """ +-++++ # q_len 是当前 query 长度 +-++++ q_len = sequence_length +-++++ # kv_len 是 target_length +-++++ kv_len = target_length +-++++ +-++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-++++ +-++++ if key in _causal_mask_cache: +-++++ return _causal_mask_cache[key] +-++++ +-++++ # 调用原来的 mask 构造逻辑 +-++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++ attention_mask, +-++++ sequence_length=sequence_length, +-++++ target_length=target_length, +-++++ dtype=dtype, +-++++ min_dtype=min_dtype, +-++++ cache_position=cache_position, +-++++ batch_size=batch_size, +-++++ ) +-++++ # 缓存结果 +-++++ _causal_mask_cache[key] = causal_mask +-++++ return causal_mask +-+++ +-+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+++ def _prepare_4d_causal_attention_mask_with_cache_position( +-+++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++ +-+++ +-+++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-++++# class Qwen2MoeAttention(nn.Module): +-++++# """ +-++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-++++# and "Generating Long Sequences with Sparse Transformers". +-++++# """ +-++++ +-++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++# super().__init__() +-++++# self.config = config +-++++# self.layer_idx = layer_idx +-++++# if layer_idx is None: +-++++# logger.warning_once( +-++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++# "when creating this class." +-++++# ) +-++++ +-++++# self.hidden_size = config.hidden_size +-++++# self.num_heads = config.num_attention_heads +-++++# self.head_dim = self.hidden_size // self.num_heads +-++++# self.num_key_value_heads = config.num_key_value_heads +-++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++# self.max_position_embeddings = config.max_position_embeddings +-++++# self.rope_theta = config.rope_theta +-++++# self.is_causal = True +-++++# self.attention_dropout = config.attention_dropout +-++++ +-++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++# raise ValueError( +-++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++# f" and `num_heads`: {self.num_heads})." +-++++# ) +-++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++ +-++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++# self.head_dim, +-++++# max_position_embeddings=self.max_position_embeddings, +-++++# base=self.rope_theta, +-++++# ) +-++++ +-++++# def forward( +-++++# self, +-++++# hidden_states: mindspore.Tensor, +-++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++# position_ids: Optional[mindspore.Tensor] = None, +-++++# past_key_value: Optional[Cache] = None, +-++++# output_attentions: bool = False, +-++++# use_cache: bool = False, +-++++# cache_position: Optional[mindspore.Tensor] = None, +-++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++ +-++++ +-++++ +-++++# bsz, q_len, _ = hidden_states.shape +-++++ +-++++# query_states = self.q_proj(hidden_states) +-++++# key_states = self.k_proj(hidden_states) +-++++# value_states = self.v_proj(hidden_states) +-++++ +-++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++ +-++++# kv_seq_len = key_states.shape[-2] +-++++# if past_key_value is not None: +-++++# if self.layer_idx is None: +-++++# raise ValueError( +-++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++# "with a layer index." +-++++# ) +-++++# if isinstance(past_key_value, StaticCache): +-++++# kv_seq_len = key_states.shape[-2] +-++++# else: +-++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++# if past_key_value is not None: +-++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++# if isinstance(past_key_value, StaticCache): +-++++# kv_seq_len = key_states.shape[-2] +-++++ +-++++# # repeat k/v heads if n_kv_heads < n_heads +-++++# key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++# value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++ +-++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++ +-++++# if attention_mask is not None: +-++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++# attn_weights = attn_weights + causal_mask +-++++ +-++++# # upcast attention to fp32 +-++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++++# attn_output = ops.matmul(attn_weights, value_states) +-++++ +-++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++++# raise ValueError( +-++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-++++# f" {attn_output.shape}" +-++++# ) +-++++ +-++++# attn_output = ops.transpose(attn_output, 1, 2) +-++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++ +-++++# attn_output = self.o_proj(attn_output) +-++++# # @lwx +-++++ +-++++# # max_seq_len = self.max_position_embeddings # 2048 +-++++ +-++++# # if attention_mask is not None: +-++++# # # attention_mask: [B, 1, Sq, Sk] +-++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++ +-++++# # # pad 到 [max_seq_len, max_seq_len] +-++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++# # global_attention_mask = padded_mask +-++++# # else: +-++++# # global_attention_mask = None +-++++ +-++++ +-++++# # sparse_mode=3 +-++++# # attn_output = mindspore.ops.flash_attention_score( +-++++# # query=query_states, +-++++# # key=key_states, +-++++# # value=value_states, +-++++# # real_shift=None, +-++++# # padding_mask=None, +-++++ +-++++# # head_num=self.num_heads, +-++++# # attn_mask=global_attention_mask, +-++++# # keep_prob=1.0 - self.attention_dropout, +-++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++# # input_layout="BNSD", +-++++# # pre_tokens=2147483647, +-++++# # next_tokens=2147483647, +-++++# # inner_precise=0, +-++++# # drop_mask=None, +-++++# # prefix=None, +-++++# # actual_seq_qlen=None, +-++++# # actual_seq_kvlen=None, +-++++# # sparse_mode=sparse_mode, +-++++# # ) +-++++# if not output_attentions: +-++++# attn_weights = None +-++++ +-++++# return attn_output, attn_weights, past_key_value +-++++ +-+++ class Qwen2MoeAttention(nn.Module): +-+++ """ +-+++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+++- and "Generating Long Sequences with Sparse Transformers". +-+++- """ +-++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +-+++ +-++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-++++ +-++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-++++ """ +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++ super().__init__() +-+++ self.config = config +-+++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +-+++ if layer_idx is None: +-+++ logger.warning_once( +-+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++ "when creating this class." +-+++ ) +-+++ +-+++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +-+++ use_cache: bool = False, +-+++ cache_position: Optional[mindspore.Tensor] = None, +-+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++- +-+++ +-+++- +-++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +-+++ bsz, q_len, _ = hidden_states.shape +-+++ +-+++ query_states = self.q_proj(hidden_states) +-+++ key_states = self.k_proj(hidden_states) +-+++ value_states = self.v_proj(hidden_states) +-+++ +-+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++- +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ +-+++ kv_seq_len = key_states.shape[-2] +-+++ if past_key_value is not None: +-+++- if self.layer_idx is None: +-+++- raise ValueError( +-+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++- "with a layer index." +-+++- ) +-+++- if isinstance(past_key_value, StaticCache): +-+++- kv_seq_len = key_states.shape[-2] +-+++- else: +-+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++ +-+++ if past_key_value is not None: +-+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++ +-++++ # --- 2. 动态调度核心注意力计算 --- +-++++ global Long_Prompt +-++++ if Long_Prompt >= 1: +-++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-++++ fa_attention_mask = None +-++++ if attention_mask is not None: +-++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++ fa_attention_mask = (mask_slice != 0) +-++++ +-++++ attn_output = mindspore.ops.flash_attention_score( +-++++ query=query_states, +-++++ key=key_states, +-++++ value=value_states, +-++++ head_num=self.num_heads, +-++++ attn_mask=fa_attention_mask, +-++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ input_layout="BNSD", +-++++ sparse_mode=0, +-++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-++++ ) +-+++ +-+++- if isinstance(past_key_value, StaticCache): +-+++- kv_seq_len = key_states.shape[-2] +-++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = self.o_proj(attn_output) +-++++ attn_weights = None +-++++ if output_attentions: +-++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+++ +-+++- # repeat k/v heads if n_kv_heads < n_heads +-+++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++- +-+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++ else: +-++++ # --- Eager Attention 路径 (用于短序列和解码) --- +-++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++ +-++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++ +-+++- if attention_mask is not None: +-+++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++- attn_weights = attn_weights + causal_mask +-++++ if attention_mask is not None: +-++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++ attn_weights = attn_weights + causal_mask +-+++ +-+++- # upcast attention to fp32 +-+++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++- attn_output = ops.matmul(attn_weights, value_states) +-++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++++ attn_output = ops.matmul(attn_weights, value_states) +-+++ +-+++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++- raise ValueError( +-+++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+++- f" {attn_output.shape}" +-+++- ) +-++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++++ raise ValueError( +-++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-++++ ) +-+++ +-+++- attn_output = ops.transpose(attn_output, 1, 2) +-+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = ops.transpose(attn_output, 1, 2) +-++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = self.o_proj(attn_output) +-+++ +-+++- attn_output = self.o_proj(attn_output) +-+++- # @lwx +-++++ if not output_attentions: +-++++ attn_weights = None +-+++ +-+++- # max_seq_len = self.max_position_embeddings # 2048 +-+++- +-+++- # if attention_mask is not None: +-+++- # # attention_mask: [B, 1, Sq, Sk] +-+++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++- +-+++- # # pad 到 [max_seq_len, max_seq_len] +-+++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++- # global_attention_mask = padded_mask +-+++- # else: +-+++- # global_attention_mask = None +-+++- +-+++- +-+++- # sparse_mode=3 +-+++- # attn_output = mindspore.ops.flash_attention_score( +-+++- # query=query_states, +-+++- # key=key_states, +-+++- # value=value_states, +-+++- # real_shift=None, +-+++- # padding_mask=None, +-+++- +-+++- # head_num=self.num_heads, +-+++- # attn_mask=global_attention_mask, +-+++- # keep_prob=1.0 - self.attention_dropout, +-+++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++- # input_layout="BNSD", +-+++- # pre_tokens=2147483647, +-+++- # next_tokens=2147483647, +-+++- # inner_precise=0, +-+++- # drop_mask=None, +-+++- # prefix=None, +-+++- # actual_seq_qlen=None, +-+++- # actual_seq_kvlen=None, +-+++- # sparse_mode=sparse_mode, +-+++- # ) +-+++- if not output_attentions: +-+++- attn_weights = None +-+++- +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++- +-+++ # class Qwen2MoeFlashAttention(nn.Module): +-+++ # """ +-+++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +-+++ # return final_hidden_states, router_logits +-+++ +-+++ +-+++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++-# """ +-+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+++-# """ +-+++-# def __init__(self, config: Qwen2MoeConfig): +-+++-# super().__init__() +-+++-# self.num_experts = config.num_experts +-+++-# self.top_k = config.num_experts_per_tok +-+++-# self.norm_topk_prob = config.norm_topk_prob +-+++- +-+++-# # 门控网络 +-+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++-# # 专家列表 +-+++-# self.experts = nn.ModuleList( +-+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++-# ) +-+++-# # 共享专家 +-+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_decode( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# """ +-+++-# 【解码路径】针对 sequence_length=1 的极致优化。 +-+++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+++-# """ +-+++-# batch_size, hidden_dim = hidden_states.shape +-+++- +-+++-# expert_outputs_list = [ +-+++-# ops.cat([ +-+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++-# ], dim=0) +-+++-# for i in range(batch_size) +-+++-# ] +-+++- +-+++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+++-# # shape: (batch_size, top_k, hidden_dim) +-+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++- +-+++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++- +-+++-# return moe_output.squeeze(1) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_prefill( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# """ +-+++-# 【预填充路径】针对 sequence_length > 1 的优化。 +-+++-# 按专家对 Token 进行分组,并进行批处理。 +-+++-# """ +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens = hidden_states.shape[0] +-+++-# flat_selected_experts = selected_experts.flatten() +-+++- +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++- +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++- +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++- +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++-# selected_token_indices = token_indices[mask] +-+++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++- +-+++-# current_states = hidden_states[selected_token_indices] +-+++- +-+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++- +-+++-# moe_output = moe_output.index_add( +-+++-# dim=0, +-+++-# index=selected_token_indices, +-+++-# source=expert_output.to(hidden_states.dtype) +-+++-# ) +-+++-# return moe_output +-+++- +-+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-# """ +-+++-# 顶层 forward 方法,作为智能分发器。 +-+++-# """ +-+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- +-+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-# router_logits = self.gate(hidden_states_reshaped) +-+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- +-+++-# if self.norm_topk_prob: +-+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- +-+++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++-# moe_output = None +-+++-# # 在推理时,根据序列长度选择最优路径 +-+++-# if not self.training: +-+++-# if sequence_length == 1: +-+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++-# else: +-+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++-# else: +-+++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+++-# raise NotImplementedError("Training path is not implemented.") +-+++- +-+++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+++- +-+++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+++- +-+++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+++- +-+++-# return final_hidden_states, router_logits +-+++- +-+++- +-+++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++-# """ +-+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+++-# """ +-+++-# def __init__(self, config: Qwen2MoeConfig): +-+++-# super().__init__() +-+++-# self.num_experts = config.num_experts +-+++-# self.top_k = config.num_experts_per_tok +-+++-# self.norm_topk_prob = config.norm_topk_prob +-+++- +-+++-# # 门控网络 +-+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++-# # 专家列表 +-+++-# self.experts = nn.ModuleList( +-+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++-# ) +-+++-# # 共享专家 +-+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_decode( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# batch_size, _ = hidden_states.shape +-+++-# expert_outputs_list = [ +-+++-# ops.cat([ +-+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++-# ], dim=0) +-+++-# for i in range(batch_size) +-+++-# ] +-+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++-# return moe_output.squeeze(1) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_prefill( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens = hidden_states.shape[0] +-+++-# flat_selected_experts = selected_experts.flatten() +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++- +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++-# selected_token_indices = token_indices[mask] +-+++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++-# current_states = hidden_states[selected_token_indices] +-+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++-# moe_output = moe_output.index_add( +-+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++-# ) +-+++-# return moe_output +-+++- +-+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-# """ +-+++-# 顶层 forward 方法,作为智能分发器。 +-+++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+++-# """ +-+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- +-+++-# # 1. 门控计算 (通用逻辑) +-+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-# router_logits = self.gate(hidden_states_reshaped) +-+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- +-+++-# if self.norm_topk_prob: +-+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- +-+++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++-# # 2. 智能分发到最优 MoE 路径 +-+++-# moe_output = None +-+++-# if not self.training: +-+++-# if sequence_length == 1: +-+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++-# else: +-+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++-# else: +-+++-# raise NotImplementedError("Training path is not implemented.") +-+++- +-+++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++- +-+++-# # 4. 合并 MoE 输出和共享专家输出 +-+++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++- +-+++-# # 5. 恢复原始形状并返回 +-+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++- +-+++-# return final_hidden_states, router_logits +-+++- +-+++-# prefill fastest +-+++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++-# """ +-+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+++-# """ +-+++-# def __init__(self, config: Qwen2MoeConfig): +-+++-# super().__init__() +-+++-# self.num_experts = config.num_experts +-+++-# self.top_k = config.num_experts_per_tok +-+++-# self.norm_topk_prob = config.norm_topk_prob +-+++- +-+++-# # 门控网络 +-+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++-# # 专家列表 +-+++-# self.experts = nn.ModuleList( +-+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++-# ) +-+++-# # 共享专家 +-+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_dispatch( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# """ +-+++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+++-# """ +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens, _ = hidden_states.shape +-+++- +-+++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+++-# flat_selected_experts = selected_experts.flatten() +-+++-# flat_routing_weights = routing_weights.flatten() +-+++- +-+++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++- +-+++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++- +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++- +-+++-# # 找到所有分配给该专家的 token +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++- +-+++-# # 使用 mask 选取对应的 token 和权重 +-+++-# current_token_indices = token_indices[mask] +-+++-# current_routing_weights = flat_routing_weights[mask] +-+++-# current_hidden_states = hidden_states[current_token_indices] +-+++- +-+++-# # 对这些 token 进行批处理 +-+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++- +-+++-# # 使用 index_add 将结果精确地加回到对应位置 +-+++-# moe_output = moe_output.index_add( +-+++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+++-# ) +-+++-# return moe_output +-+++- +-+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-# """ +-+++-# 顶层 forward 方法,作为智能分发器。 +-+++-# """ +-+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- +-+++-# # 1. 门控计算 +-+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-# router_logits = self.gate(hidden_states_reshaped) +-+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- +-+++-# if self.norm_topk_prob: +-+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- +-+++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++- +-+++-# # 2. 调用统一的 MoE 计算内核 +-+++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+++- +-+++-# # 3. 统一处理共享专家 +-+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++- +-+++-# # 4. 合并输出 +-+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++- +-+++-# # 5. 恢复原始形状并返回 +-+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++- +-+++-# return final_hidden_states, router_logits +-+++- +-+++- +-+++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++-# """ +-+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++-# 【最终高性能与高精度版】: +-+++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+++-# 3. 这样实现了速度和准确性的两全其美。 +-+++-# """ +-+++-# def __init__(self, config: Qwen2MoeConfig): +-+++-# super().__init__() +-+++-# self.num_experts = config.num_experts +-+++-# self.top_k = config.num_experts_per_tok +-+++-# self.norm_topk_prob = config.norm_topk_prob +-+++- +-+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++-# self.experts = nn.ModuleList( +-+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++-# ) +-+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_decode( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# """ +-+++-# 【解码路径】极致优化版:bmm + 高精度累加。 +-+++-# """ +-+++-# original_dtype = hidden_states.dtype +-+++-# batch_size, _ = hidden_states.shape +-+++- +-+++-# expert_outputs_list = [ +-+++-# ops.cat([ +-+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++-# ], dim=0) +-+++-# for i in range(batch_size) +-+++-# ] +-+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++- +-+++-# # 在 float32 下执行 bmm,得到高精度结果 +-+++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++- +-+++-# # 将高精度结果转换回原始数据类型 +-+++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+++- +-+++-# return moe_output +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_prefill( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# selected_experts: mindspore.Tensor, +-+++-# routing_weights: mindspore.Tensor +-+++-# ) -> mindspore.Tensor: +-+++-# """ +-+++-# 【预填充路径】与原始实现一致,结果精确。 +-+++-# """ +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens, _ = hidden_states.shape +-+++-# flat_selected_experts = selected_experts.flatten() +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++- +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++-# selected_token_indices = token_indices[mask] +-+++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++-# current_states = hidden_states[selected_token_indices] +-+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++-# moe_output = moe_output.index_add( +-+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++-# ) +-+++-# return moe_output +-+++- +-+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++- +-+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-# router_logits = self.gate(hidden_states_reshaped) +-+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++- +-+++-# if self.norm_topk_prob: +-+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- +-+++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+++-# # 如果模型主体是 float16,后续再转换 +-+++- +-+++-# moe_output = None +-+++-# if not self.training: +-+++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+++-# # _moe_infer_decode 内部会处理好类型转换 +-+++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+++-# if sequence_length == 1: +-+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++-# else: +-+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++-# else: +-+++-# raise NotImplementedError("Training path is not implemented.") +-+++- +-+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++- +-+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++- +-+++-# return final_hidden_states, router_logits +-+++- +-+++- +-+++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++-# """ +-+++-# 【融合版】一个混合专家模块,内置两种推理策略, +-+++-# 由外部全局变量 `Long_Prompt` 控制: +-+++- +-+++-# - if Long_Prompt is True: 【精度优先模式】 +-+++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+++-# 适用于处理长序列,避免误差累积。 +-+++- +-+++-# - if Long_Prompt is False: 【速度优先模式】 +-+++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+++-# """ +-+++-# def __init__(self, config: Qwen2MoeConfig): +-+++-# super().__init__() +-+++-# self.num_experts = config.num_experts +-+++-# self.top_k = config.num_experts_per_tok +-+++-# self.norm_topk_prob = config.norm_topk_prob +-+++- +-+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++-# self.experts = nn.ModuleList( +-+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++-# ) +-+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-# # --- 速度优先模式的辅助函数 --- +-+++-# @no_grad() +-+++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++-# original_dtype = hidden_states.dtype +-+++-# batch_size, _ = hidden_states.shape +-+++-# expert_outputs_list = [ +-+++-# ops.cat([ +-+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++-# ], dim=0) +-+++-# for i in range(batch_size) +-+++-# ] +-+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++-# weights_fp32 = routing_weights.to(mindspore.float32) +-+++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++-# return moe_output_fp32.squeeze(1).to(original_dtype) +-+++- +-+++-# @no_grad() +-+++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens, _ = hidden_states.shape +-+++-# flat_selected_experts = selected_experts.flatten() +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++-# selected_token_indices = token_indices[mask] +-+++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++-# current_states = hidden_states[selected_token_indices] +-+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++-# return moe_output +-+++- +-+++-# # --- 精度优先模式的辅助函数 --- +-+++-# @no_grad() +-+++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++-# moe_output = ops.zeros_like(hidden_states) +-+++-# num_tokens, _ = hidden_states.shape +-+++-# flat_selected_experts = selected_experts.flatten() +-+++-# flat_routing_weights = routing_weights.flatten() +-+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++-# active_experts = ops.unique(flat_selected_experts) +-+++-# for expert_idx_tensor in active_experts: +-+++-# expert_idx = expert_idx_tensor.item() +-+++-# expert_layer = self.experts[expert_idx] +-+++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++-# current_token_indices = token_indices[mask] +-+++-# current_routing_weights = flat_routing_weights[mask] +-+++-# current_hidden_states = hidden_states[current_token_indices] +-+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++-# return moe_output +-+++- +-+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-# # 声明我们将要使用一个在模块外部定义的全局变量 +-+++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+++-# global Long_Prompt +-+++- +-+++-# # 1. 门控计算 (所有模式通用) +-+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-# router_logits = self.gate(hidden_states_reshaped) +-+++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++-# if self.norm_topk_prob: +-+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++- +-+++-# moe_output = None +-+++-# if not self.training: +-+++-# # 根据 Long_Prompt 标志选择模式 +-+++-# if Long_Prompt: +-+++-# # --- 精度优先模式 --- +-+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++-# else: +-+++-# # --- 速度优先模式 --- +-+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++-# if sequence_length == 1: +-+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++-# else: +-+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++-# else: +-+++-# raise NotImplementedError("Training path is not implemented.") +-+++- +-+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++- +-+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++- +-+++-# return final_hidden_states, router_logits +-+++- +-+++ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ """ +-+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++ return moe_output_fp32.squeeze(1).to(original_dtype) +-+++ +-++++ # @no_grad() +-++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++ # num_tokens, _ = hidden_states.shape +-++++ # flat_selected_experts = selected_experts.flatten() +-++++ # sorted_expert_indices = flat_selected_experts.argsort() +-++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++++ # original_token_indices = sorted_expert_indices // self.top_k +-++++ # moe_output = ops.zeros_like(hidden_states) +-++++ # current_token_offset = 0 +-++++ # for i in range(self.num_experts): +-++++ # expert_token_count = tokens_per_expert[i] - current_token_offset +-++++ # if expert_token_count == 0: +-++++ # continue +-++++ # end_offset = current_token_offset + expert_token_count +-++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++++ # expert_hidden_states = hidden_states[expert_original_token_indices] +-++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++ # current_token_offset += expert_token_count +-++++ # return moe_output +-++++ +-+++ @no_grad() +-+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++- num_tokens, _ = hidden_states.shape +-+++- flat_selected_experts = selected_experts.flatten() +-+++- sorted_expert_indices = flat_selected_experts.argsort() +-+++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++- original_token_indices = sorted_expert_indices // self.top_k +-++++ """ +-++++ 优化版 MoE prefill (速度优先模式): +-++++ - 批量张量化处理同一个 expert 的所有 token +-++++ - 跳过无 token 的专家 +-++++ - 保持结果完全一致 +-++++ """ +-+++ moe_output = ops.zeros_like(hidden_states) +-+++- current_token_offset = 0 +-+++- for i in range(self.num_experts): +-+++- expert_token_count = tokens_per_expert[i] - current_token_offset +-+++- if expert_token_count == 0: +-+++- continue +-+++- end_offset = current_token_offset + expert_token_count +-+++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++- expert_hidden_states = hidden_states[expert_original_token_indices] +-+++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++- current_token_offset += expert_token_count +-++++ +-++++ flat_selected_experts = selected_experts.flatten() +-++++ flat_routing_weights = routing_weights.flatten() +-++++ +-++++ idxs = flat_selected_experts.argsort() +-++++ sorted_expert_indices = flat_selected_experts[idxs] +-++++ sorted_token_indices = idxs // self.top_k +-++++ +-++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-++++ +-++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-++++ +-++++ for expert_id in active_experts.tolist(): +-++++ start = int(tokens_per_expert[:expert_id].sum().item()) +-++++ end = start + int(tokens_per_expert[expert_id].item()) +-++++ +-++++ token_idx = sorted_token_indices[start:end] +-++++ expert_tokens = hidden_states[token_idx] +-++++ +-++++ expert_out = self.experts[expert_id](expert_tokens) +-++++ +-++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-++++ +-++++ moe_output = mindspore.mint.scatter_add( +-++++ moe_output, +-++++ 0, +-++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-++++ scaled_out.to(hidden_states.dtype) +-++++ ) +-++++ +-+++ return moe_output +-+++ +-++++ +-+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+++ @no_grad() +-+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++ +-+++ moe_output = None +-+++- if Long_Prompt: +-+++- # --- 精度优先模式 (ACCURACY MODE) --- +-+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ # if Long_Prompt==0: +-++++ # # --- 精度优先模式 (ACCURACY MODE) --- +-++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ # else: +-++++ # # --- 速度优先模式 (SPEED MODE) --- +-++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++ # if sequence_length == 1: +-++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ # else: +-++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ +-++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++ if sequence_length == 1: +-++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++ else: +-+++- # --- 速度优先模式 (SPEED MODE) --- +-+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++- if sequence_length == 1: +-+++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++- else: +-+++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++- +-++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ +-+++ +-+++ # 3. 共享专家计算与合并 (所有模式通用) +-+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++ +-+++ return final_hidden_states, router_logits +-+++ +-++++ +-+++ class Qwen2MoeDecoderLayer(nn.Module): +-+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+++ super().__init__() +-+++ self.hidden_size = config.hidden_size +-+++ +-+++- # if Long_Prompt: +-+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++- # else: +-++++ # if Long_Prompt == 2: +-+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++ # else: +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ +-+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++ +-+++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++ ) +-+++ +-+++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-+++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++ # attention_mask, +-++++ # sequence_length=sequence_length, +-++++ # target_length=target_length, +-++++ # dtype=dtype, +-++++ # min_dtype=min_dtype, +-++++ # cache_position=cache_position, +-++++ # batch_size=input_tensor.shape[0], +-++++ # ) +-++++ #@dwj +-++++ causal_mask = get_cached_causal_mask_with_cache_position( +-+++ attention_mask, +-+++ sequence_length=sequence_length, +-+++ target_length=target_length, +-+++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+++ """ +-+++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-++++ _causal_mask_cache.clear() +-+++ +-+++ input_ids = kwargs.get("input_ids") +-+++ if input_ids is None and args: +-+++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ +-+++ if input_ids is not None: +-+++ prompt_length = input_ids.shape[1] +-+++- +-+++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+++- Long_Prompt = True +-++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-++++ Long_Prompt = 2 +-++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-++++ Long_Prompt = 0 +-+++ else: +-+++- Long_Prompt = False +-++++ Long_Prompt = 1 +-++++ +-+++ +-+++ return super().generate(*args, **kwargs) +-+++ +-+++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++ dtype = self.lm_head.weight.dtype +-+++ min_dtype = float(ops.finfo(dtype).min) +-+++ +-+++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++ # attention_mask, +-++++ # sequence_length=sequence_length, +-++++ # target_length=past_key_values.get_max_length(), +-++++ # dtype=dtype, +-++++ # min_dtype=min_dtype, +-++++ # cache_position=cache_position, +-++++ # batch_size=batch_size, +-++++ # ) +-++++ +-++++ #@dwj +-++++ attention_mask = get_cached_causal_mask_with_cache_position( +-+++ attention_mask, +-+++ sequence_length=sequence_length, +-+++ target_length=past_key_values.get_max_length(), +-+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++deleted file mode 100644 +-+++index 6dfb5b93..00000000 +-+++--- a/patches/0001-20251104commit.patch +-++++++ /dev/null +-+++@@ -1,1272 +0,0 @@ +-+++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++-From: Pinoeer-kingxi <13022943007@163.com> +-+++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++-Subject: [PATCH] 20251104commit +-+++- +-+++---- +-+++- mindnlp/transformers/cache_utils.py | 28 +- +-+++- .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++- 3 files changed, 976 insertions(+), 87 deletions(-) +-+++- +-+++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++-index cadd2e04..02f8d4be 100644 +-+++---- a/mindnlp/transformers/cache_utils.py +-+++-+++ b/mindnlp/transformers/cache_utils.py +-+++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++- # k_out[:, :, cache_position] = key_states +-+++- # v_out[:, :, cache_position] = value_states +-+++-- if ON_ORANGE_PI: +-+++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++-- else: +-+++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++-- +-+++-+ # if ON_ORANGE_PI: +-+++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++-+ # else: +-+++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++-+ if cache_position.ndim > 1: +-+++-+ cache_position = cache_position.flatten() +-+++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++-+ cache_position = cache_position.int() +-+++-+ +-+++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++-+ k_out[:, :, cache_position] = key_states +-+++-+ v_out[:, :, cache_position] = value_states +-+++-+ +-+++- return k_out, v_out +-+++- +-+++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++-index c695b944..d8303e45 100644 +-+++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++- # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++- def rotate_half(x): +-+++- """Rotates half the hidden dims of the input.""" +-+++-- x1 = x[..., : x.shape[-1] // 2] +-+++-- x2 = x[..., x.shape[-1] // 2 :] +-+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++-+ # x1 = x[..., : x.shape[-1] // 2] +-+++-+ # x2 = x[..., x.shape[-1] // 2 :] +-+++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++- return ops.cat((-x2, x1), dim=-1) +-+++- +-+++- +-+++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++- if self.training: +-+++- raise NotImplementedError("Training is not supported yet.") +-+++- else: +-+++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++-- if self.config.n_shared_experts is not None: +-+++-- y = y + self.shared_experts(identity) +-+++-- return y +-+++-+ # @lwx +-+++-+ if orig_shape[1] == 1: +-+++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++-+ y=y.view(*orig_shape) +-+++-+ if self.config.n_shared_experts is not None: +-+++-+ y = y + self.shared_experts(identity) +-+++-+ return y +-+++-+ else: +-+++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++-+ if self.config.n_shared_experts is not None: +-+++-+ y = y + self.shared_experts(identity) +-+++-+ return y +-+++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++-+ # if self.config.n_shared_experts is not None: +-+++-+ # y = y + self.shared_experts(identity) +-+++-+ # return y +-+++-+ +-+++-+ @no_grad() +-+++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++-+ +-+++-+ expert_cache = ops.zeros_like(x) +-+++-+ for i in range(self.num_experts_per_tok): +-+++-+ expert_id = flat_expert_indices[i].item() +-+++-+ weight = flat_expert_weights[i].item() +-+++-+ expert = self.experts[expert_id] +-+++-+ expert_out = expert(x) +-+++-+ expert_cache += expert_out * weight +-+++-+ return expert_cache +-+++- +-+++- @no_grad() +-+++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++-- # expert_cache = torch.zeros_like(x) +-+++-- # idxs = flat_expert_indices.argsort() +-+++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++-- # token_idxs = idxs // self.num_experts_per_tok +-+++-- # for i, end_idx in enumerate(tokens_per_expert): +-+++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++-- # if start_idx == end_idx: +-+++-- # continue +-+++-- # expert = self.experts[i] +-+++-- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++-- # expert_tokens = x[exp_token_idx] +-+++-- # expert_out = expert(expert_tokens) +-+++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++-- # return expert_cache +-+++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++- expert_cache = ops.zeros_like(x) +-+++- idxs = flat_expert_indices.argsort() +-+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++- token_idxs = idxs // self.num_experts_per_tok +-+++-+ +-+++- for i, end_idx in enumerate(tokens_per_expert): +-+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++- if start_idx == end_idx: +-+++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++- expert_out = expert(expert_tokens) +-+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++-+ +-+++- return expert_cache +-+++-+ +-+++-+ # @no_grad() +-+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++-+ # # expert_cache = torch.zeros_like(x) +-+++-+ # # idxs = flat_expert_indices.argsort() +-+++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++-+ # # token_idxs = idxs // self.num_experts_per_tok +-+++-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++-+ # # if start_idx == end_idx: +-+++-+ # # continue +-+++-+ # # expert = self.experts[i] +-+++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++-+ # # expert_tokens = x[exp_token_idx] +-+++-+ # # expert_out = expert(expert_tokens) +-+++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++-+ # # return expert_cache +-+++-+ # expert_cache = ops.zeros_like(x) +-+++-+ # idxs = flat_expert_indices.argsort() +-+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++-+ # token_idxs = idxs // self.num_experts_per_tok +-+++-+ +-+++-+ # for i, end_idx in enumerate(tokens_per_expert): +-+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++-+ # if start_idx == end_idx: +-+++-+ # continue +-+++-+ # expert = self.experts[i] +-+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++-+ # expert_tokens = x[exp_token_idx] +-+++-+ # expert_out = expert(expert_tokens) +-+++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++-+ +-+++-+ # return expert_cache +-+++-+ # @no_grad() +-+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++-+ # expert_cache = ops.zeros_like(x) +-+++-+ +-+++-+ # # 排序保证顺序一致 +-+++-+ # idxs = flat_expert_indices.argsort() +-+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++-+ # token_idxs = idxs // self.num_experts_per_tok +-+++-+ +-+++-+ # # 找出有 token 的专家 +-+++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++-+ +-+++-+ # for i in active_experts.tolist(): +-+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++-+ # end_idx = tokens_per_expert[i] +-+++-+ # if start_idx == end_idx: # 没有 token +-+++-+ # continue +-+++-+ +-+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++-+ # expert_tokens = x[exp_token_idx] +-+++-+ # expert_out = self.experts[i](expert_tokens) +-+++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++-+ +-+++-+ # expert_cache = mindspore.mint.scatter_add( +-+++-+ # expert_cache, +-+++-+ # 0, +-+++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++-+ # expert_out +-+++-+ # ) +-+++-+ +-+++-+ # return expert_cache +-+++-+ +-+++-+ +-+++- +-+++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++- # """ +-+++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++- +-+++- # Initialize weights and apply final processing +-+++- self.post_init() +-+++-+ self.warm_up = False +-+++-+ +-+++-+ def warmup_moe_model_deep(self): +-+++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++-+ test_texts = [ +-+++-+ "warmup short", +-+++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++-+ ] +-+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++-+ if tokenizer is None: +-+++-+ from mindnlp.transformers import AutoTokenizer +-+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++-+ self._warmup_tokenizer = tokenizer +-+++-+ +-+++-+ for text in test_texts: +-+++-+ inputs = tokenizer(text, return_tensors="ms") +-+++-+ with mindspore._no_grad(): +-+++-+ _ = self(**inputs, use_cache=False) +-+++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++- +-+++- def get_input_embeddings(self): +-+++- return self.model.embed_tokens +-+++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++- ```""" +-+++-+ if not self.warm_up: +-+++-+ self.warm_up = True +-+++-+ self.warmup_moe_model_deep() +-+++-+ +-+++- output_attentions = ( +-+++- output_attentions +-+++- if output_attentions is not None +-+++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++-index 3cbf820e..d4c6b651 100644 +-+++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++-@@ -18,7 +18,6 @@ +-+++- # See the License for the specific language governing permissions and +-+++- # limitations under the License. +-+++- """MindSpore Qwen2MoE model.""" +-+++-- +-+++- import math +-+++- from typing import List, Optional, Tuple, Union +-+++- +-+++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++- TokenClassifierOutput, +-+++- ) +-+++- from ...modeling_utils import PreTrainedModel +-+++-+from ...generation import GenerationMixin +-+++- from ....utils import logging +-+++- from .configuration_qwen2_moe import Qwen2MoeConfig +-+++- +-+++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++- self.variance_epsilon = eps +-+++- +-+++- def forward(self, hidden_states): +-+++-+ # @dwj +-+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++-+ # @lwx +-+++-+ # if not self.training : +-+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++- input_dtype = hidden_states.dtype +-+++- hidden_states = hidden_states.to(mindspore.float32) +-+++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++-@@ -234,6 +239,8 @@ def rotate_half(x): +-+++- """Rotates half the hidden dims of the input.""" +-+++- x1 = x[..., : x.shape[-1] // 2] +-+++- x2 = x[..., x.shape[-1] // 2 :] +-+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++- return ops.cat((-x2, x1), dim=-1) +-+++- +-+++- +-+++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++- self.config = config +-+++- self.hidden_size = config.hidden_size +-+++- self.intermediate_size = intermediate_size +-+++-+ +-+++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++- self.act_fn = ACT2FN[config.hidden_act] +-+++- +-+++- def forward(self, x): +-+++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++-- +-+++- +-+++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++-+ # @lwx +-+++-+ # gate_up_output = self.gate_up_proj(x) +-+++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++-+ # return self.down_proj(swiglu_output) +-+++-+ +-+++-+ # def forward(self, x): +-+++-+ # gate_proj_out = self.gate_proj(x) +-+++-+ # up_proj_out = self.up_proj(x) +-+++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++-+ # return self.down_proj(swiglu_out) +-+++-+ +-+++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++- """ +-+++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++- use_cache: bool = False, +-+++- cache_position: Optional[mindspore.Tensor] = None, +-+++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++-+ +-+++-+ +-+++-+ +-+++- bsz, q_len, _ = hidden_states.shape +-+++- +-+++- query_states = self.q_proj(hidden_states) +-+++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++- "with a layer index." +-+++- ) +-+++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++-+ if isinstance(past_key_value, StaticCache): +-+++-+ kv_seq_len = key_states.shape[-2] +-+++-+ else: +-+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++- +-+++- if past_key_value is not None: +-+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++-+ +-+++-+ if isinstance(past_key_value, StaticCache): +-+++-+ kv_seq_len = key_states.shape[-2] +-+++- +-+++- # repeat k/v heads if n_kv_heads < n_heads +-+++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++-- +-+++-+ +-+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++- +-+++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++-- raise ValueError( +-+++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++-- f" {attn_weights.shape}" +-+++-- ) +-+++-- +-+++-- if attention_mask is not None: # no matter the length, we just slice it +-+++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++-+ if attention_mask is not None: +-+++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++- attn_weights = attn_weights + causal_mask +-+++- +-+++- # upcast attention to fp32 +-+++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++- +-+++- attn_output = self.o_proj(attn_output) +-+++-- +-+++-+ # @lwx +-+++-+ +-+++-+ # max_seq_len = self.max_position_embeddings # 2048 +-+++-+ +-+++-+ # if attention_mask is not None: +-+++-+ # # attention_mask: [B, 1, Sq, Sk] +-+++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++-+ +-+++-+ # # pad 到 [max_seq_len, max_seq_len] +-+++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++-+ # global_attention_mask = padded_mask +-+++-+ # else: +-+++-+ # global_attention_mask = None +-+++-+ +-+++-+ +-+++-+ # sparse_mode=3 +-+++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++-+ # query=query_states, +-+++-+ # key=key_states, +-+++-+ # value=value_states, +-+++-+ # real_shift=None, +-+++-+ # padding_mask=None, +-+++-+ +-+++-+ # head_num=self.num_heads, +-+++-+ # attn_mask=global_attention_mask, +-+++-+ # keep_prob=1.0 - self.attention_dropout, +-+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++-+ # input_layout="BNSD", +-+++-+ # pre_tokens=2147483647, +-+++-+ # next_tokens=2147483647, +-+++-+ # inner_precise=0, +-+++-+ # drop_mask=None, +-+++-+ # prefix=None, +-+++-+ # actual_seq_qlen=None, +-+++-+ # actual_seq_kvlen=None, +-+++-+ # sparse_mode=sparse_mode, +-+++-+ # ) +-+++- if not output_attentions: +-+++- attn_weights = None +-+++- +-+++- return attn_output, attn_weights, past_key_value +-+++- +-+++- +-+++-+class Qwen2MoeFlashAttention(nn.Module): +-+++-+ """ +-+++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++-+ +-+++-+ 关键改动: +-+++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++-+ 直接传入原始的 key 和 value 张量效率更高。 +-+++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++-+ """ +-+++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++-+ super().__init__() +-+++-+ self.config = config +-+++-+ self.layer_idx = layer_idx +-+++-+ self.hidden_size = config.hidden_size +-+++-+ self.num_heads = config.num_attention_heads +-+++-+ self.head_dim = self.hidden_size // self.num_heads +-+++-+ self.num_key_value_heads = config.num_key_value_heads +-+++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++-+ self.max_position_embeddings = config.max_position_embeddings +-+++-+ self.rope_theta = config.rope_theta +-+++-+ self.attention_dropout = config.attention_dropout +-+++-+ +-+++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++-+ raise ValueError( +-+++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++-+ ) +-+++-+ +-+++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++-+ +-+++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++-+ self.head_dim, +-+++-+ max_position_embeddings=self.max_position_embeddings, +-+++-+ base=self.rope_theta, +-+++-+ ) +-+++-+ +-+++-+ def forward( +-+++-+ self, +-+++-+ hidden_states: mindspore.Tensor, +-+++-+ attention_mask: Optional[mindspore.Tensor] = None, +-+++-+ position_ids: Optional[mindspore.Tensor] = None, +-+++-+ past_key_value: Optional[Cache] = None, +-+++-+ output_attentions: bool = False, +-+++-+ use_cache: bool = False, +-+++-+ cache_position: Optional[mindspore.Tensor] = None, +-+++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++-+ +-+++-+ bsz, q_len, _ = hidden_states.shape +-+++-+ +-+++-+ # 1. 线性投射 Q, K, V +-+++-+ query_states = self.q_proj(hidden_states) +-+++-+ key_states = self.k_proj(hidden_states) +-+++-+ value_states = self.v_proj(hidden_states) +-+++-+ +-+++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++-+ # query: [B, S, H*D] -> [B, N1, S, D] +-+++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ +-+++-+ # 3. RoPE 旋转位置编码 +-+++-+ kv_seq_len = key_states.shape[-2] +-+++-+ if past_key_value is not None: +-+++-+ if self.layer_idx is None: +-+++-+ raise ValueError( +-+++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++-+ "with a layer index." +-+++-+ ) +-+++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++-+ if cache_position.shape[0] == 1: +-+++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++-+ kv_seq_len = past_seen_tokens + 1 +-+++-+ else: +-+++-+ # prefill 阶段:cache_position 是范围,使用其长度 +-+++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++-+ else: +-+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++-+ +-+++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++-+ +-+++-+ # 4. KV 缓存更新 +-+++-+ if past_key_value is not None: +-+++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++-+ key_states, value_states = past_key_value.update( +-+++-+ key_states, value_states, self.layer_idx, cache_kwargs +-+++-+ ) +-+++-+ +-+++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++-+ if cache_position.shape[0] == 1: +-+++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++-+ kv_seq_len = key_states.shape[-2] +-+++-+ +-+++-+ # 5. [重要] 准备 Attention Mask +-+++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++-+ fa_attention_mask = None +-+++-+ if attention_mask is not None: +-+++-+ # 截取与当前key长度匹配的部分 +-+++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++-+ fa_attention_mask = (mask_slice != 0) +-+++-+ +-+++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++-+ input_dtype = query_states.dtype +-+++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++-+ query_states = query_states.to(mindspore.float16) +-+++-+ key_states = key_states.to(mindspore.float16) +-+++-+ value_states = value_states.to(mindspore.float16) +-+++-+ +-+++-+ # 6. [核心] 调用 flash_attention_score 算子 +-+++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++-+ attn_output = mindspore.ops.flash_attention_score( +-+++-+ query=query_states, +-+++-+ key=key_states, +-+++-+ value=value_states, +-+++-+ head_num=self.num_heads, # 传入Q的头数(N1) +-+++-+ attn_mask=fa_attention_mask, +-+++-+ keep_prob=1.0 - self.attention_dropout, +-+++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++-+ input_layout="BNSD", +-+++-+ sparse_mode=0 # 使用 defaultMask 模式 +-+++-+ ) +-+++-+ +-+++-+ # 恢复原始数据类型 +-+++-+ attn_output = attn_output.to(input_dtype) +-+++-+ +-+++-+ # 7. 调整输出形状 +-+++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++-+ attn_output = self.o_proj(attn_output) +-+++-+ +-+++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++-+ attn_weights = None +-+++-+ if output_attentions: +-+++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++-+ +-+++-+ return attn_output, attn_weights, past_key_value +-+++-+ +-+++-+ # def forward( +-+++-+ # self, +-+++-+ # hidden_states: mindspore.Tensor, +-+++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+++-+ # position_ids: Optional[mindspore.Tensor] = None, +-+++-+ # past_key_value: Optional[Cache] = None, +-+++-+ # output_attentions: bool = False, +-+++-+ # use_cache: bool = False, +-+++-+ # cache_position: Optional[mindspore.Tensor] = None, +-+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++-+ +-+++-+ # bsz, q_len, _ = hidden_states.shape +-+++-+ +-+++-+ # # 1. 线性投射 Q, K, V +-+++-+ # query_states = self.q_proj(hidden_states) +-+++-+ # key_states = self.k_proj(hidden_states) +-+++-+ # value_states = self.v_proj(hidden_states) +-+++-+ +-+++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ +-+++-+ # # 3. RoPE 旋转位置编码 +-+++-+ # kv_seq_len = key_states.shape[-2] +-+++-+ # if past_key_value is not None: +-+++-+ # if self.layer_idx is None: +-+++-+ # raise ValueError( +-+++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++-+ # "with a layer index." +-+++-+ # ) +-+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++-+ +-+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++-+ +-+++-+ # # 4. KV 缓存更新 +-+++-+ # if past_key_value is not None: +-+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++-+ # key_states, value_states = past_key_value.update( +-+++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+++-+ # ) +-+++-+ +-+++-+ # # 5. 准备 Attention Mask +-+++-+ # fa_attention_mask = None +-+++-+ # if attention_mask is not None: +-+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++-+ # fa_attention_mask = (mask_slice != 0) +-+++-+ +-+++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++-+ # input_dtype = query_states.dtype +-+++-+ +-+++-+ # # 6. [核心] 调用 flash_attention_score 算子 +-+++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++-+ # query=query_states, +-+++-+ # key=key_states, +-+++-+ # value=value_states, +-+++-+ # head_num=self.num_heads, +-+++-+ # attn_mask=fa_attention_mask, +-+++-+ # keep_prob=1.0 - self.attention_dropout, +-+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++-+ # input_layout="BNSD", +-+++-+ # sparse_mode=0, +-+++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++-+ # inner_precise=1 +-+++-+ # ) +-+++-+ +-+++-+ # # 恢复原始数据类型 +-+++-+ # attn_output = attn_output.to(input_dtype) +-+++-+ +-+++-+ # # 7. 调整输出形状 +-+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++-+ # attn_output = self.o_proj(attn_output) +-+++-+ +-+++-+ # attn_weights = None +-+++-+ # if output_attentions: +-+++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++-+ +-+++-+ # return attn_output, attn_weights, past_key_value +-+++-+ +-+++-+ # def forward( +-+++-+ # self, +-+++-+ # hidden_states: mindspore.Tensor, +-+++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+++-+ # position_ids: Optional[mindspore.Tensor] = None, +-+++-+ # past_key_value: Optional[Cache] = None, +-+++-+ # output_attentions: bool = False, +-+++-+ # use_cache: bool = False, +-+++-+ # cache_position: Optional[mindspore.Tensor] = None, +-+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++-+ +-+++-+ # bsz, q_len, _ = hidden_states.shape +-+++-+ +-+++-+ # query_states = self.q_proj(hidden_states) +-+++-+ # key_states = self.k_proj(hidden_states) +-+++-+ # value_states = self.v_proj(hidden_states) +-+++-+ +-+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-+ +-+++-+ # kv_seq_len = key_states.shape[-2] +-+++-+ # if past_key_value is not None: +-+++-+ # if self.layer_idx is None: +-+++-+ # raise ValueError("`layer_idx` must be specified for caching") +-+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++-+ +-+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++-+ +-+++-+ # if past_key_value is not None: +-+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++-+ # key_states, value_states = past_key_value.update( +-+++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+++-+ # ) +-+++-+ +-+++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++-+ +-+++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++-+ # query_states = query_states / math.sqrt(self.head_dim) +-+++-+ # # <--- 修改结束 --- +-+++-+ +-+++-+ # fa_attention_mask = None +-+++-+ # if attention_mask is not None: +-+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++-+ # fa_attention_mask = (mask_slice != 0) +-+++-+ +-+++-+ # input_dtype = query_states.dtype +-+++-+ +-+++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++-+ # query=query_states, # 传入已经预先缩放过的 query +-+++-+ # key=key_states, +-+++-+ # value=value_states, +-+++-+ # head_num=self.num_heads, +-+++-+ # attn_mask=fa_attention_mask, +-+++-+ # keep_prob=1.0 - self.attention_dropout, +-+++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++-+ # input_layout="BNSD", +-+++-+ # sparse_mode=0, +-+++-+ # inner_precise=1 # 仍然保持内部高精度计算 +-+++-+ # ) +-+++-+ +-+++-+ # attn_output = attn_output.to(input_dtype) +-+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++-+ # attn_output = self.o_proj(attn_output) +-+++-+ +-+++-+ # attn_weights = None +-+++-+ # if output_attentions: +-+++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++-+ +-+++-+ # return attn_output, attn_weights, past_key_value +-+++-+ +-+++- QWEN2MOE_ATTENTION_CLASSES = { +-+++- "eager": Qwen2MoeAttention, +-+++-+ "flash-attention": Qwen2MoeFlashAttention, +-+++- } +-+++- +-+++- +-+++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++- +-+++-+ #@dwj +-+++-+ # 只遍历激活的专家,而非全部专家 +-+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++-- hidden_states = hidden_states.view(-1, hidden_dim) +-+++-- # router_logits: (batch * sequence_length, n_experts) +-+++-- router_logits = self.gate(hidden_states) +-+++-- +-+++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++-- if self.norm_topk_prob: +-+++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++-- # we cast back to the input dtype +-+++-- routing_weights = routing_weights.to(hidden_states.dtype) +-+++-- +-+++-- final_hidden_states = ops.zeros( +-+++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++-- ) +-+++-- +-+++-- # One hot encode the selected experts to create an expert mask +-+++-- # this will be used to easily index which expert is going to be sollicitated +-+++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++-- +-+++-- # Loop over all available experts in the model and perform the computation on each expert +-+++-- for expert_idx in range(self.num_experts): +-+++-- expert_layer = self.experts[expert_idx] +-+++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++-- +-+++-- # Index the correct hidden states and compute the expert hidden state for +-+++-- # the current expert. We need to make sure to multiply the output hidden +-+++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++-- if 0 not in idx.shape: +-+++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++-- +-+++-- # However `index_add_` only support torch tensors for indexing so we'll use +-+++-- # the `top_x` tensor here. +-+++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++-- +-+++-- shared_expert_output = self.shared_expert(hidden_states) +-+++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++-- +-+++-- final_hidden_states = final_hidden_states + shared_expert_output +-+++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++-+ num_tokens = hidden_states_reshaped.shape[0] +-+++-+ +-+++-+ router_logits = self.gate(hidden_states_reshaped) +-+++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++-+ +-+++-+ if self.norm_topk_prob: +-+++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++-+ routing_weights = routing_weights.to(hidden_states.dtype) +-+++-+ +-+++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++-+ flat_selected_experts = selected_experts.flatten() +-+++-+ +-+++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++-+ token_indices = broadcasted_token_indices.flatten() +-+++-+ +-+++-+ active_experts = ops.unique(flat_selected_experts) +-+++-+ +-+++-+ for expert_idx_tensor in active_experts: +-+++-+ expert_idx = expert_idx_tensor.item() +-+++-+ expert_layer = self.experts[expert_idx] +-+++-+ +-+++-+ mask = (flat_selected_experts == expert_idx_tensor) +-+++-+ selected_token_indices = token_indices[mask] +-+++-+ selected_routing_weights = routing_weights.flatten()[mask] +-+++-+ +-+++-+ current_states = hidden_states_reshaped[selected_token_indices] +-+++-+ +-+++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++-+ +-+++-+ final_hidden_states = final_hidden_states.index_add( +-+++-+ dim=0, +-+++-+ index=selected_token_indices, +-+++-+ source=expert_output.to(hidden_states.dtype) +-+++-+ ) +-+++-+ +-+++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++- +-+++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++-- return final_hidden_states, router_logits +-+++-+ final_hidden_states = final_hidden_states + shared_expert_output +-+++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++-+ +-+++-+ return final_hidden_states, router_logits +-+++- +-+++- +-+++- class Qwen2MoeDecoderLayer(nn.Module): +-+++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++- +-+++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++- +-+++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++-+ +-+++- if (layer_idx not in config.mlp_only_layers) and ( +-+++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++- ): +-+++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++- _skip_keys_device_placement = "past_key_values" +-+++- _supports_cache_class = True +-+++-+#lwx +-+++-+ # _supports_static_cache = True +-+++- +-+++- def _init_weights(self, module): +-+++- std = self.config.initializer_range +-+++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++- return causal_mask +-+++- +-+++- +-+++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++- _tied_weights_keys = ["lm_head.weight"] +-+++- +-+++- def __init__(self, config): +-+++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++- self.num_experts_per_tok = config.num_experts_per_tok +-+++- # Initialize weights and apply final processing +-+++- self.post_init() +-+++-+ # @lwx +-+++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++-+ # self.generation_config.cache_implementation = "static" +-+++-+ self._warmed_up = False +-+++-+ +-+++-+ def warmup_moe_model(self): +-+++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++-+ test_texts = [ +-+++-+ "warmup short", +-+++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++-+ ] +-+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++-+ if tokenizer is None: +-+++-+ from mindnlp.transformers import AutoTokenizer +-+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++-+ self._warmup_tokenizer = tokenizer +-+++-+ +-+++-+ for text in test_texts: +-+++-+ inputs = tokenizer(text, return_tensors="ms") +-+++-+ with mindspore._no_grad(): +-+++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++- +-+++- def get_input_embeddings(self): +-+++- return self.model.embed_tokens +-+++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++- ```""" +-+++-+ if not self._warmed_up: +-+++-+ self._warmed_up = True +-+++-+ self.warmup_moe_model() +-+++- +-+++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++- output_router_logits = ( +-+++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++- } +-+++- ) +-+++- return model_inputs +-+++-+# @lwx +-+++-+ # def _decode_one_tokens_logits( +-+++-+ # self, +-+++-+ # cur_token: mindspore.Tensor, +-+++-+ # input_pos: Optional[mindspore.Tensor], +-+++-+ # cache_position: mindspore.Tensor, +-+++-+ # past_key_values: StaticCache, +-+++-+ # ) -> mindspore.Tensor: +-+++-+ # """ +-+++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++-+ +-+++-+ # Args: +-+++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++-+ # input_pos: 输入位置信息,可选 +-+++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++-+ +-+++-+ # Returns: +-+++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++-+ # """ +-+++-+ # # 调用JIT编译的版本 +-+++-+ # return self.get_decode_one_tokens_logits( +-+++-+ # cur_token=cur_token, +-+++-+ # input_pos=input_pos, +-+++-+ # cache_position=cache_position, +-+++-+ # past_key_values=past_key_values, +-+++-+ # ) +-+++-+ +-+++-+ # @mindspore.jit(jit_level='O1') +-+++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++-+ # """ +-+++-+ # JIT编译的函数,用于高效的单token解码 +-+++-+ # 使用JIT编译优化以支持静态shape和高效执行 +-+++-+ +-+++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++-+ # """ +-+++-+ # outputs = self.model.forward( +-+++-+ # input_ids=cur_token, +-+++-+ # position_ids=input_pos, +-+++-+ # cache_position=cache_position, +-+++-+ # past_key_values=past_key_values, +-+++-+ # use_cache=True, +-+++-+ # return_dict=False, +-+++-+ # ) +-+++-+ +-+++-+ # hidden_states = outputs[0] +-+++-+ # logits = self.lm_head.forward(hidden_states) +-+++-+ # logits = logits.float() +-+++-+ +-+++-+ # return logits[:, -1, :] +-+++-+ +-+++-+ # def _sample( +-+++-+ # self, +-+++-+ # input_ids: mindspore.Tensor, +-+++-+ # logits_processor, +-+++-+ # stopping_criteria, +-+++-+ # generation_config, +-+++-+ # synced_devices: bool, +-+++-+ # streamer=None, +-+++-+ # logits_warper=None, +-+++-+ # **model_kwargs, +-+++-+ # ): +-+++-+ # """ +-+++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++-+ # """ +-+++-+ # from ...generation.logits_process import LogitsProcessorList +-+++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++-+ # from mindnlp.core import nn, ops, no_grad +-+++-+ # import numpy as np +-+++-+ +-+++-+ # # 检查是否使用 StaticCache +-+++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++-+ # # 否则,直接调用父类方法 +-+++-+ # past_key_values = model_kwargs.get("past_key_values") +-+++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++-+ +-+++-+ # if not isinstance(past_key_values, StaticCache): +-+++-+ # # 不使用 StaticCache,直接调用父类方法 +-+++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++-+ # return super()._sample( +-+++-+ # input_ids=input_ids, +-+++-+ # logits_processor=logits_processor, +-+++-+ # stopping_criteria=stopping_criteria, +-+++-+ # generation_config=generation_config, +-+++-+ # synced_devices=synced_devices, +-+++-+ # streamer=streamer, +-+++-+ # logits_warper=logits_warper, +-+++-+ # **model_kwargs, +-+++-+ # ) +-+++-+ +-+++-+ # # 使用 StaticCache,进入自定义循环 +-+++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++-+ # pad_token_id = generation_config._pad_token_tensor +-+++-+ # output_attentions = generation_config.output_attentions +-+++-+ # output_hidden_states = generation_config.output_hidden_states +-+++-+ # output_scores = generation_config.output_scores +-+++-+ # output_logits = generation_config.output_logits +-+++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++-+ # max_length = generation_config.max_length +-+++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++-+ # do_sample = generation_config.do_sample +-+++-+ +-+++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++-+ # raise ValueError( +-+++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++-+ # f"{logits_warper})." +-+++-+ # ) +-+++-+ +-+++-+ # # init attention / hidden states / scores tuples +-+++-+ # scores = () if (return_dict_in_generate and output_scores) else None +-+++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++-+ +-+++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++-+ # encoder_hidden_states = ( +-+++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++-+ # ) +-+++-+ +-+++-+ # # keep track of which sequences are already finished +-+++-+ # batch_size, cur_len = input_ids.shape +-+++-+ # this_peer_finished = False +-+++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++-+ +-+++-+ # time_record = [] +-+++-+ # from ....utils.testing_utils import parse_flag_from_env +-+++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++-+ +-+++-+ # while self._has_unfinished_sequences( +-+++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++-+ # ): +-+++-+ # if _record_time: +-+++-+ # import time as time_module +-+++-+ # infer_start = time_module.time() +-+++-+ +-+++-+ # # prepare model inputs +-+++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++-+ +-+++-+ # # prepare variable output controls +-+++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++-+ +-+++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++-+ # cur_cache_position = model_inputs.get("cache_position") +-+++-+ # cur_past_key_values = model_inputs.get("past_key_values") +-+++-+ # cur_input_ids = model_inputs.get("input_ids") +-+++-+ +-+++-+ # if (isinstance(cur_past_key_values, StaticCache) and +-+++-+ # cur_cache_position is not None and +-+++-+ # len(cur_cache_position.shape) > 0 and +-+++-+ # cur_cache_position.shape[0] == 1 and +-+++-+ # cur_input_ids is not None and +-+++-+ # cur_input_ids.shape[1] == 1): +-+++-+ # # 使用 JIT 优化的单 token 解码 +-+++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++-+ # if not hasattr(self, '_jit_used'): +-+++-+ # self._jit_used = False +-+++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++-+ +-+++-+ # next_token_logits = self.get_decode_one_tokens_logits( +-+++-+ # cur_token=cur_input_ids, +-+++-+ # input_pos=model_inputs.get("position_ids"), +-+++-+ # cache_position=cur_cache_position, +-+++-+ # past_key_values=cur_past_key_values, +-+++-+ # ) +-+++-+ +-+++-+ # # 标记已使用JIT(用于后续判断) +-+++-+ # if not self._jit_used: +-+++-+ # self._jit_used = True +-+++-+ +-+++-+ # # 构造兼容的输出对象 +-+++-+ # class JitOptimizedOutput: +-+++-+ # def __init__(self, logits, config): +-+++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++-+ # self.config = config +-+++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++-+ # self.attentions = None if not config.is_encoder_decoder else None +-+++-+ # self.cross_attentions = None +-+++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++-+ +-+++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++-+ # else: +-+++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++-+ # outputs = self(**model_inputs, return_dict=True) +-+++-+ +-+++-+ # if synced_devices and this_peer_finished: +-+++-+ # continue +-+++-+ +-+++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++-+ # next_token_logits = outputs.logits[:, -1, :] +-+++-+ +-+++-+ # # pre-process distribution +-+++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++-+ # if do_sample: +-+++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++-+ +-+++-+ # # Store scores, attentions and hidden_states when required +-+++-+ # if return_dict_in_generate: +-+++-+ # if output_scores: +-+++-+ # scores += (next_token_scores,) +-+++-+ # if output_logits: +-+++-+ # raw_logits += (next_token_logits,) +-+++-+ # if output_attentions: +-+++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++-+ # if self.config.is_encoder_decoder: +-+++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++-+ +-+++-+ # if output_hidden_states: +-+++-+ # hidden = ( +-+++-+ # outputs.decoder_hidden_states +-+++-+ # if self.config.is_encoder_decoder +-+++-+ # else outputs.hidden_states +-+++-+ # ) +-+++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++-+ +-+++-+ # # token selection +-+++-+ # if do_sample: +-+++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++-+ # else: +-+++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++-+ +-+++-+ # # finished sentences should have their next token be a padding token +-+++-+ # if has_eos_stopping_criteria: +-+++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++-+ +-+++-+ # # update generated ids, model inputs, and length for next step +-+++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++-+ # if streamer is not None: +-+++-+ # streamer.put(next_tokens) +-+++-+ +-+++-+ # model_kwargs = self._update_model_kwargs_for_generation( +-+++-+ # outputs, +-+++-+ # model_kwargs, +-+++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++-+ # ) +-+++-+ +-+++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++-+ # cur_len += 1 +-+++-+ +-+++-+ # if _record_time: +-+++-+ # import time as time_module +-+++-+ # infer_stop = time_module.time() +-+++-+ # time_record.append(infer_stop - infer_start) +-+++-+ +-+++-+ # del outputs +-+++-+ +-+++-+ # average_infer_time = None +-+++-+ # if time_record: +-+++-+ # if len(time_record) > 1: +-+++-+ # time_record.pop(0) +-+++-+ # average_infer_time = sum(time_record) / len(time_record) +-+++-+ # print(f'average inference time is: {average_infer_time}') +-+++-+ # print(f'inference time record: {time_record}') +-+++-+ +-+++-+ # if streamer is not None: +-+++-+ # streamer.end() +-+++-+ +-+++-+ # # 简单判断:打印是否使用了JIT路径 +-+++-+ # if hasattr(self, '_jit_used') and self._jit_used: +-+++-+ # print("[JIT] ✓ JIT optimization was used during generation") +-+++-+ # else: +-+++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++-+ +-+++-+ # if return_dict_in_generate: +-+++-+ # if self.config.is_encoder_decoder: +-+++-+ # return GenerateEncoderDecoderOutput( +-+++-+ # sequences=input_ids, +-+++-+ # scores=scores, +-+++-+ # logits=raw_logits, +-+++-+ # encoder_attentions=encoder_attentions, +-+++-+ # encoder_hidden_states=encoder_hidden_states, +-+++-+ # decoder_attentions=decoder_attentions, +-+++-+ # cross_attentions=cross_attentions, +-+++-+ # decoder_hidden_states=decoder_hidden_states, +-+++-+ # past_key_values=model_kwargs.get("past_key_values"), +-+++-+ # average_infer_time=average_infer_time +-+++-+ # ) +-+++-+ # else: +-+++-+ # return GenerateDecoderOnlyOutput( +-+++-+ # sequences=input_ids, +-+++-+ # scores=scores, +-+++-+ # logits=raw_logits, +-+++-+ # attentions=decoder_attentions, +-+++-+ # hidden_states=decoder_hidden_states, +-+++-+ # past_key_values=model_kwargs.get("past_key_values"), +-+++-+ # average_infer_time=average_infer_time +-+++-+ # ) +-+++-+ # else: +-+++-+ # return input_ids +-+++-+ +-+++-+ # def _prepare_cache_for_generation( +-+++-+ # self, +-+++-+ # generation_config, +-+++-+ # model_kwargs, +-+++-+ # assistant_model, +-+++-+ # batch_size, +-+++-+ # max_cache_length, +-+++-+ # ): +-+++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++-+ # generation_config.cache_implementation = "static" +-+++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++-+ +-+++-+ # if generation_config.cache_implementation == "static": +-+++-+ # base_required_from_max_length = generation_config.max_length + 1 +-+++-+ # base_required = max(max_cache_length, base_required_from_max_length) +-+++-+ # min_cache_size = 50 +-+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++-+ # else: +-+++-+ # max_cache_length = max(base_required, min_cache_size) +-+++-+ +-+++-+ # original_max_cache_length = max_cache_length +-+++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++-+ # print(f" - final max_cache_length: {max_cache_length}") +-+++-+ +-+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++-+ # if max_cache_length > self.config.max_position_embeddings: +-+++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++-+ +-+++-+ # result = super()._prepare_cache_for_generation( +-+++-+ # generation_config=generation_config, +-+++-+ # model_kwargs=model_kwargs, +-+++-+ # assistant_model=assistant_model, +-+++-+ # batch_size=batch_size, +-+++-+ # max_cache_length=max_cache_length, +-+++-+ # ) +-+++-+ +-+++-+ # if generation_config.cache_implementation == "static": +-+++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++-+ # created_cache = model_kwargs.get(cache_name) +-+++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++-+ # if created_cache.max_cache_len < generation_config.max_length: +-+++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++-+ +-+++-+ # return result +-+++-+ +-+++-+ +-+++-+ +-+++- +-+++- +-+++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++--- +-+++-2.27.0 +-+++- +-+++-- +-+++2.27.0 +-+++ +-++-- +-++2.27.0 +-++ +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch +deleted file mode 100644 +index 8a2fc4fe..00000000 +--- a/patches/0007-20251107003commit.patch ++++ /dev/null +@@ -1,8034 +0,0 @@ +-From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Fri, 7 Nov 2025 12:12:51 +0800 +-Subject: [PATCH 7/8] 20251107003commit +- +---- +- .../models/deepseek/modeling_deepseek.py | 2 +- +- patches/0001-20251104commit.patch | 2 +- +- patches/0002-20251106commit.patch | 2 +- +- patches/0003-20261106secondcommit.patch | 2 +- +- patches/0004-20251106change.patch | 2 +- +- patches/0005-20251107001commit.patch | 2 +- +- patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ +- 7 files changed, 7937 insertions(+), 6 deletions(-) +- create mode 100644 patches/0006-20251107002commit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index e7e1c053..ff631974 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): +- # return expert_cache +- +- @no_grad() +-- dwj +-+ # dwj +- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- # x 的 shape: (1, hidden_size) +- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-index 2842180e..c9c8c5ee 100644 +---- a/patches/0001-20251104commit.patch +-+++ b/patches/0001-20251104commit.patch +-@@ -1,7 +1,7 @@ +- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Tue, 4 Nov 2025 09:11:51 +0800 +--Subject: [PATCH 1/5] 20251104commit +-+Subject: [PATCH 1/6] 20251104commit +- +- --- +- mindnlp/transformers/cache_utils.py | 28 +- +-diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-index c6cd8757..625656eb 100644 +---- a/patches/0002-20251106commit.patch +-+++ b/patches/0002-20251106commit.patch +-@@ -1,7 +1,7 @@ +- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 09:20:38 +0800 +--Subject: [PATCH 2/5] 20251106commit +-+Subject: [PATCH 2/6] 20251106commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 379 ++++- +-diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-index 601960c9..dcb85080 100644 +---- a/patches/0003-20261106secondcommit.patch +-+++ b/patches/0003-20261106secondcommit.patch +-@@ -1,7 +1,7 @@ +- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 14:54:37 +0800 +--Subject: [PATCH 3/5] 20261106secondcommit +-+Subject: [PATCH 3/6] 20261106secondcommit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 217 ++- +-diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-index 8976f10b..bbed13cc 100644 +---- a/patches/0004-20251106change.patch +-+++ b/patches/0004-20251106change.patch +-@@ -1,7 +1,7 @@ +- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 15:48:09 +0800 +--Subject: [PATCH 4/5] 20251106change +-+Subject: [PATCH 4/6] 20251106change +- +- --- +- .../models/deepseek/modeling_deepseek.py | 189 +- +-diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-index 8d9032be..b2d1035c 100644 +---- a/patches/0005-20251107001commit.patch +-+++ b/patches/0005-20251107001commit.patch +-@@ -1,7 +1,7 @@ +- From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Fri, 7 Nov 2025 11:48:18 +0800 +--Subject: [PATCH 5/5] 20251107001commit +-+Subject: [PATCH 5/6] 20251107001commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 91 +- +-diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +-new file mode 100644 +-index 00000000..bffa134e +---- /dev/null +-+++ b/patches/0006-20251107002commit.patch +-@@ -0,0 +1,7931 @@ +-+From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Fri, 7 Nov 2025 12:06:32 +0800 +-+Subject: [PATCH 6/6] 20251107002commit +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 122 +- +-+ patches/0001-20251104commit.patch | 2 +- +-+ patches/0002-20251106commit.patch | 2 +- +-+ patches/0003-20261106secondcommit.patch | 2 +- +-+ patches/0004-20251106change.patch | 2 +- +-+ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ +-+ 6 files changed, 7773 insertions(+), 64 deletions(-) +-+ create mode 100644 patches/0005-20251107001commit.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index 8831e4b7..e7e1c053 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): +-+ # expert_out = expert(x) +-+ # expert_cache += expert_out * weight +-+ # return expert_cache +-+- +-+- # @no_grad() +-+- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+- # # x 的 shape: (1, hidden_size) +-+- # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+- +-+- # # 1. 收集所有需要的专家层 +-+- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+- # selected_experts = [self.experts[i] for i in flat_expert_indices] +-+- +-+- # # 2. 并行计算所有专家的输出 +-+- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+- # # ops.cat 会将它们堆叠成一个新的 Tensor +-+- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+- +-+- # # 3. 使用矩阵乘法进行加权求和 +-+- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+- # # 最终结果 final_output 的 shape: (1, hidden_size) +-+- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++ +-++ @no_grad() +-++ dwj +-++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ # x 的 shape: (1, hidden_size) +-++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++ +-++ # 1. 收集所有需要的专家层 +-++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-++ +-++ # 2. 并行计算所有专家的输出 +-++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++ # ops.cat 会将它们堆叠成一个新的 Tensor +-++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++ +-++ # 3. 使用矩阵乘法进行加权求和 +-++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++ # 最终结果 final_output 的 shape: (1, hidden_size) +-++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+ +-+- # return final_output +-++ return final_output +-+ +-+ +-+ # @no_grad() +-+@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): +-+ +-+ return expert_cache +-+ # 放置在 DeepseekMoE 类中 +-+- @no_grad() +-+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+- """ +-+- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-+- +-+- Args: +-+- x (Tensor): 输入张量, shape: (1, hidden_size) +-+- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-+- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-+- """ +-+- top_k, _ = flat_expert_weights.shape +-+- hidden_size = x.shape[-1] +-+- +-+- # 1. 将所有专家的权重堆叠起来 +-+- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-+- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-+- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-++ # @no_grad() +-++ # #lwx 20251107 +-++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++ # """ +-++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-++ +-++ # Args: +-++ # x (Tensor): 输入张量, shape: (1, hidden_size) +-++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-++ # """ +-++ # top_k, _ = flat_expert_weights.shape +-++ # hidden_size = x.shape[-1] +-++ +-++ # # 1. 将所有专家的权重堆叠起来 +-++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-+ +-+- # 2. "收集" 所需的专家权重 +-+- selected_gate_w = stacked_gate_w[flat_expert_indices] +-+- selected_up_w = stacked_up_w[flat_expert_indices] +-+- selected_down_w = stacked_down_w[flat_expert_indices] +-++ # # 2. "收集" 所需的专家权重 +-++ # selected_gate_w = stacked_gate_w[flat_expert_indices] +-++ # selected_up_w = stacked_up_w[flat_expert_indices] +-++ # selected_down_w = stacked_down_w[flat_expert_indices] +-+ +-+- # 3. 准备输入 +-+- x_expanded = x.expand((top_k, 1, hidden_size)) +-++ # # 3. 准备输入 +-++ # x_expanded = x.expand((top_k, 1, hidden_size)) +-+ +-+- # 4. 并行计算 gate_proj 和 up_proj +-+- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-+- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-++ # # 4. 并行计算 gate_proj 和 up_proj +-++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-+ +-+- # 5. 计算中间状态 +-+- intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-++ # # 5. 计算中间状态 +-++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-+ +-+- # 6. 并行计算 down_proj +-+- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-+- # --- [FIX] --- +-+- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-+- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-+- # --- [FIX END] --- +-++ # # 6. 并行计算 down_proj +-++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-++ # # --- [FIX] --- +-++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-++ # # --- [FIX END] --- +-+ +-+- # 7. 根据路由权重进行加权求和 +-+- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-++ # # 7. 根据路由权重进行加权求和 +-++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-+ +-+- return weighted_sum +-++ # return weighted_sum +-+ +-+ +-+ +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+index 0a0ef2d7..2842180e 100644 +-+--- a/patches/0001-20251104commit.patch +-++++ b/patches/0001-20251104commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+-Subject: [PATCH 1/4] 20251104commit +-++Subject: [PATCH 1/5] 20251104commit +-+ +-+ --- +-+ mindnlp/transformers/cache_utils.py | 28 +- +-+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+index 5185270c..c6cd8757 100644 +-+--- a/patches/0002-20251106commit.patch +-++++ b/patches/0002-20251106commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+-Subject: [PATCH 2/4] 20251106commit +-++Subject: [PATCH 2/5] 20251106commit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+index 3e05f821..601960c9 100644 +-+--- a/patches/0003-20261106secondcommit.patch +-++++ b/patches/0003-20261106secondcommit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+-Subject: [PATCH 3/4] 20261106secondcommit +-++Subject: [PATCH 3/5] 20261106secondcommit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-+index 88a1aef4..8976f10b 100644 +-+--- a/patches/0004-20251106change.patch +-++++ b/patches/0004-20251106change.patch +-+@@ -1,7 +1,7 @@ +-+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 15:48:09 +0800 +-+-Subject: [PATCH 4/4] 20251106change +-++Subject: [PATCH 4/5] 20251106change +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 189 +- +-+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-+new file mode 100644 +-+index 00000000..8d9032be +-+--- /dev/null +-++++ b/patches/0005-20251107001commit.patch +-+@@ -0,0 +1,7707 @@ +-++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Fri, 7 Nov 2025 11:48:18 +0800 +-++Subject: [PATCH 5/5] 20251107001commit +-++ +-++--- +-++ .../models/deepseek/modeling_deepseek.py | 91 +- +-++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +-++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +-++ patches/0001-20251104commit.patch | 2 +- +-++ patches/0002-20251106commit.patch | 2 +- +-++ patches/0003-20261106secondcommit.patch | 2 +- +-++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ +-++ 7 files changed, 7577 insertions(+), 30 deletions(-) +-++ create mode 100644 patches/0004-20251106change.patch +-++ +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index 0546f318..8831e4b7 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +-++ # expert_cache += expert_out * weight +-++ # return expert_cache +-++ +-++- @no_grad() +-++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++- # x 的 shape: (1, hidden_size) +-++- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++- +-++- # 1. 收集所有需要的专家层 +-++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++- selected_experts = [self.experts[i] for i in flat_expert_indices] +-++- +-++- # 2. 并行计算所有专家的输出 +-++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++- # ops.cat 会将它们堆叠成一个新的 Tensor +-++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++- +-++- # 3. 使用矩阵乘法进行加权求和 +-++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++- # 最终结果 final_output 的 shape: (1, hidden_size) +-++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+++ # @no_grad() +-+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ # # x 的 shape: (1, hidden_size) +-+++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+++ +-+++ # # 1. 收集所有需要的专家层 +-+++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+++ # selected_experts = [self.experts[i] for i in flat_expert_indices] +-+++ +-+++ # # 2. 并行计算所有专家的输出 +-+++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+++ # # ops.cat 会将它们堆叠成一个新的 Tensor +-+++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+++ +-+++ # # 3. 使用矩阵乘法进行加权求和 +-+++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ # # 最终结果 final_output 的 shape: (1, hidden_size) +-+++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++ +-++- return final_output +-+++ # return final_output +-++ +-++ +-++ # @no_grad() +-++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +-++ ) +-++ +-++ return expert_cache +-+++# 放置在 DeepseekMoE 类中 +-+++ @no_grad() +-+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ """ +-+++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-+++ +-+++ Args: +-+++ x (Tensor): 输入张量, shape: (1, hidden_size) +-+++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-+++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-+++ """ +-+++ top_k, _ = flat_expert_weights.shape +-+++ hidden_size = x.shape[-1] +-+++ +-+++ # 1. 将所有专家的权重堆叠起来 +-+++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-+++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-+++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-+++ +-+++ # 2. "收集" 所需的专家权重 +-+++ selected_gate_w = stacked_gate_w[flat_expert_indices] +-+++ selected_up_w = stacked_up_w[flat_expert_indices] +-+++ selected_down_w = stacked_down_w[flat_expert_indices] +-+++ +-+++ # 3. 准备输入 +-+++ x_expanded = x.expand((top_k, 1, hidden_size)) +-+++ +-+++ # 4. 并行计算 gate_proj 和 up_proj +-+++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-+++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-+++ +-+++ # 5. 计算中间状态 +-+++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-+++ +-+++ # 6. 并行计算 down_proj +-+++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-+++ # --- [FIX] --- +-+++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-+++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-+++ # --- [FIX END] --- +-+++ +-+++ # 7. 根据路由权重进行加权求和 +-+++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-+++ +-+++ return weighted_sum +-+++ +-+++ +-++ +-++ # @no_grad() +-++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++index ebd7782e..913a7609 100644 +-++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +-++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++- x1 = x[..., : x.shape[-1] // 2] +-++- x2 = x[..., x.shape[-1] // 2 :] +-+++ # x1 = x[..., : x.shape[-1] // 2] +-+++ # x2 = x[..., x.shape[-1] // 2 :] +-++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-++index d059dcbe..2b217b64 100644 +-++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +-++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++ def rotate_half(x): +-++ """Rotates half the hidden dims of the input.""" +-++- x1 = x[..., : x.shape[-1] // 2] +-++- x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++ # x1 = x[..., : x.shape[-1] // 2] +-+++ # x2 = x[..., x.shape[-1] // 2 :] +-+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++ return ops.cat((-x2, x1), dim=-1) +-++ +-++ +-++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++index 78f22642..0a0ef2d7 100644 +-++--- a/patches/0001-20251104commit.patch +-+++++ b/patches/0001-20251104commit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++-Subject: [PATCH 1/3] 20251104commit +-+++Subject: [PATCH 1/4] 20251104commit +-++ +-++ --- +-++ mindnlp/transformers/cache_utils.py | 28 +- +-++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-++index 22b65dd5..5185270c 100644 +-++--- a/patches/0002-20251106commit.patch +-+++++ b/patches/0002-20251106commit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-++-Subject: [PATCH 2/3] 20251106commit +-+++Subject: [PATCH 2/4] 20251106commit +-++ +-++ --- +-++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-++index 966529e4..3e05f821 100644 +-++--- a/patches/0003-20261106secondcommit.patch +-+++++ b/patches/0003-20261106secondcommit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-++-Subject: [PATCH 3/3] 20261106secondcommit +-+++Subject: [PATCH 3/4] 20261106secondcommit +-++ +-++ --- +-++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-++new file mode 100644 +-++index 00000000..88a1aef4 +-++--- /dev/null +-+++++ b/patches/0004-20251106change.patch +-++@@ -0,0 +1,7498 @@ +-+++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Thu, 6 Nov 2025 15:48:09 +0800 +-+++Subject: [PATCH 4/4] 20251106change +-+++ +-+++--- +-+++ .../models/deepseek/modeling_deepseek.py | 189 +- +-+++ patches/0001-20251104commit.patch | 1272 +++++++ +-+++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +-+++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +-+++ 4 files changed, 7244 insertions(+), 186 deletions(-) +-+++ create mode 100644 patches/0001-20251104commit.patch +-+++ create mode 100644 patches/0002-20251106commit.patch +-+++ create mode 100644 patches/0003-20261106secondcommit.patch +-+++ +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index 2f9192bf..0546f318 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +-+++ +-+++ return attn_output, attn_weights, past_key_value +-+++ +-+++-# class DeepseekFlashAttention(nn.Module): +-+++-# """ +-+++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-+++- +-+++-# This class is designed as a drop-in replacement for DeepseekAttention. +-+++-# """ +-+++- +-+++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+++-# super().__init__() +-+++-# self.config = config +-+++-# self.layer_idx = layer_idx +-+++-# if layer_idx is None: +-+++-# logger.warning( +-+++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++-# "when creating this class." +-+++-# ) +-+++- +-+++-# self.attention_dropout = config.attention_dropout +-+++-# self.hidden_size = config.hidden_size +-+++-# self.num_heads = config.num_attention_heads +-+++-# self.head_dim = self.hidden_size // self.num_heads +-+++-# self.num_key_value_heads = config.num_key_value_heads +-+++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++-# self.max_position_embeddings = config.max_position_embeddings +-+++-# self.rope_theta = config.rope_theta +-+++-# self.is_causal = True +-+++- +-+++-# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++-# raise ValueError( +-+++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++-# f" and `num_heads`: {self.num_heads})." +-+++-# ) +-+++- +-+++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+++-# self._init_rope() +-+++- +-+++-# def _init_rope(self): +-+++-# if self.config.rope_scaling is None: +-+++-# self.rotary_emb = DeepseekRotaryEmbedding( +-+++-# self.head_dim, +-+++-# max_position_embeddings=self.max_position_embeddings, +-+++-# base=self.rope_theta, +-+++-# ) +-+++-# else: +-+++-# scaling_type = self.config.rope_scaling["type"] +-+++-# scaling_factor = self.config.rope_scaling["factor"] +-+++-# if scaling_type == "linear": +-+++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+++-# self.head_dim, +-+++-# max_position_embeddings=self.max_position_embeddings, +-+++-# scaling_factor=scaling_factor, +-+++-# base=self.rope_theta, +-+++-# ) +-+++-# elif scaling_type == "dynamic": +-+++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+++-# self.head_dim, +-+++-# max_position_embeddings=self.max_position_embeddings, +-+++-# scaling_factor=scaling_factor, +-+++-# base=self.rope_theta, +-+++-# ) +-+++-# else: +-+++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+++- +-+++-# def forward( +-+++-# self, +-+++-# hidden_states: mindspore.Tensor, +-+++-# attention_mask: Optional[mindspore.Tensor] = None, +-+++-# position_ids: Optional[mindspore.Tensor] = None, +-+++-# past_key_value: Optional[Cache] = None, +-+++-# output_attentions: bool = False, +-+++-# use_cache: bool = False, +-+++-# **kwargs, +-+++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++-# if "padding_mask" in kwargs: +-+++-# warnings.warn( +-+++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+++-# ) +-+++- +-+++-# if output_attentions: +-+++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-+++- +-+++-# bsz, q_len, _ = hidden_states.shape +-+++- +-+++-# if self.config.pretraining_tp > 1: +-+++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+++- +-+++-# query_states = self.q_proj(hidden_states) +-+++-# key_states = self.k_proj(hidden_states) +-+++-# value_states = self.v_proj(hidden_states) +-+++- +-+++-# # Reshape for multi-head attention +-+++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++- +-+++-# kv_seq_len = key_states.shape[-2] +-+++-# if past_key_value is not None: +-+++-# if self.layer_idx is None: +-+++-# raise ValueError( +-+++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++-# "with a layer index." +-+++-# ) +-+++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++- +-+++-# # Apply Rotary Positional Embedding +-+++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++- +-+++-# if past_key_value is not None: +-+++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-+++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++- +-+++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-+++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-+++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++- +-+++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++- +-+++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++- +-+++-# # Convert attention_mask for flash_attention_score +-+++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-+++-# if attention_mask is not None: +-+++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-+++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+++-# raise ValueError( +-+++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+++-# ) +-+++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-+++-# else: +-+++-# attn_mask_for_fa = None +-+++- +-+++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+++- +-+++-# # Call the fused flash_attention_score operator +-+++-# attn_output = mindspore.ops.flash_attention_score( +-+++-# query=query_states_for_fa, +-+++-# key=key_states_for_fa, +-+++-# value=value_states_for_fa, +-+++-# head_num=self.num_heads, # This is N1, the number of query heads +-+++-# input_layout='BSH', +-+++-# attn_mask=attn_mask_for_fa, +-+++-# keep_prob=keep_prob, +-+++-# scalar_value=1.0 / math.sqrt(self.head_dim), +-+++-# sparse_mode=0 # Default mask mode +-+++-# ) +-+++- +-+++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-+++-# attn_output = self.o_proj(attn_output) +-+++- +-+++-# # Flash Attention does not return attention weights +-+++-# attn_weights = None +-+++- +-+++-# return attn_output, attn_weights, past_key_value +-+++ +-+++ class DeepseekFlashAttention(nn.Module): +-+++ """ +-+++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +-+++ super().__init__() +-+++ self.hidden_size = config.hidden_size +-+++ +-+++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-+++- config=config, layer_idx=layer_idx +-+++- ) +-++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-++++ # config=config, layer_idx=layer_idx +-++++ # ) +-+++ +-+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-+++ config=config, layer_idx=layer_idx +-+++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +-+++ return outputs +-+++ +-+++ +-+++- +-+++ class DeepseekPreTrainedModel(PreTrainedModel): +-+++ config_class = DeepseekConfig +-+++ base_model_prefix = "model" +-+++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++ # Initialize weights and apply final processing +-+++ self.post_init() +-+++ self.warm_up = False +-+++- #@dwj +-+++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-+++- self.num_layers, +-+++- self.num_attention_heads, +-+++- self.head_dim, +-+++- batch_size=1, +-+++- max_length=self.max_length, +-+++- dtype=mindspore.float16 +-+++- ) +-+++- +-+++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-+++- key_cache = [] +-+++- value_cache = [] +-+++- for _ in range(num_layers): +-+++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++- key_cache.append(k) +-+++- value_cache.append(v) +-+++- return key_cache, value_cache +-+++- +-+++ +-+++ def warmup_moe_model_deep(self): +-+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++new file mode 100644 +-+++index 00000000..78f22642 +-+++--- /dev/null +-++++++ b/patches/0001-20251104commit.patch +-+++@@ -0,0 +1,1272 @@ +-++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++++From: Pinoeer-kingxi <13022943007@163.com> +-++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++++Subject: [PATCH 1/3] 20251104commit +-++++ +-++++--- +-++++ mindnlp/transformers/cache_utils.py | 28 +- +-++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++++ 3 files changed, 976 insertions(+), 87 deletions(-) +-++++ +-++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++++index cadd2e04..02f8d4be 100644 +-++++--- a/mindnlp/transformers/cache_utils.py +-+++++++ b/mindnlp/transformers/cache_utils.py +-++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++++ # k_out[:, :, cache_position] = key_states +-++++ # v_out[:, :, cache_position] = value_states +-++++- if ON_ORANGE_PI: +-++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++- else: +-++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++- +-+++++ # if ON_ORANGE_PI: +-+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++ # else: +-+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++++ if cache_position.ndim > 1: +-+++++ cache_position = cache_position.flatten() +-+++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++++ cache_position = cache_position.int() +-+++++ +-+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++++ k_out[:, :, cache_position] = key_states +-+++++ v_out[:, :, cache_position] = value_states +-+++++ +-++++ return k_out, v_out +-++++ +-++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++index c695b944..d8303e45 100644 +-++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++++ def rotate_half(x): +-++++ """Rotates half the hidden dims of the input.""" +-++++- x1 = x[..., : x.shape[-1] // 2] +-++++- x2 = x[..., x.shape[-1] // 2 :] +-+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++ # x1 = x[..., : x.shape[-1] // 2] +-+++++ # x2 = x[..., x.shape[-1] // 2 :] +-+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++ return ops.cat((-x2, x1), dim=-1) +-++++ +-++++ +-++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++++ if self.training: +-++++ raise NotImplementedError("Training is not supported yet.") +-++++ else: +-++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++- if self.config.n_shared_experts is not None: +-++++- y = y + self.shared_experts(identity) +-++++- return y +-+++++ # @lwx +-+++++ if orig_shape[1] == 1: +-+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++++ y=y.view(*orig_shape) +-+++++ if self.config.n_shared_experts is not None: +-+++++ y = y + self.shared_experts(identity) +-+++++ return y +-+++++ else: +-+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++++ if self.config.n_shared_experts is not None: +-+++++ y = y + self.shared_experts(identity) +-+++++ return y +-+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++ # if self.config.n_shared_experts is not None: +-+++++ # y = y + self.shared_experts(identity) +-+++++ # return y +-+++++ +-+++++ @no_grad() +-+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++ +-+++++ expert_cache = ops.zeros_like(x) +-+++++ for i in range(self.num_experts_per_tok): +-+++++ expert_id = flat_expert_indices[i].item() +-+++++ weight = flat_expert_weights[i].item() +-+++++ expert = self.experts[expert_id] +-+++++ expert_out = expert(x) +-+++++ expert_cache += expert_out * weight +-+++++ return expert_cache +-++++ +-++++ @no_grad() +-++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++- # expert_cache = torch.zeros_like(x) +-++++- # idxs = flat_expert_indices.argsort() +-++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++- # token_idxs = idxs // self.num_experts_per_tok +-++++- # for i, end_idx in enumerate(tokens_per_expert): +-++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++- # if start_idx == end_idx: +-++++- # continue +-++++- # expert = self.experts[i] +-++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++++- # expert_tokens = x[exp_token_idx] +-++++- # expert_out = expert(expert_tokens) +-++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++- # return expert_cache +-+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++ expert_cache = ops.zeros_like(x) +-++++ idxs = flat_expert_indices.argsort() +-++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++ token_idxs = idxs // self.num_experts_per_tok +-+++++ +-++++ for i, end_idx in enumerate(tokens_per_expert): +-++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++ if start_idx == end_idx: +-++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++++ expert_out = expert(expert_tokens) +-++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-++++ return expert_cache +-+++++ +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # # expert_cache = torch.zeros_like(x) +-+++++ # # idxs = flat_expert_indices.argsort() +-+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++ # # token_idxs = idxs // self.num_experts_per_tok +-+++++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++ # # if start_idx == end_idx: +-+++++ # # continue +-+++++ # # expert = self.experts[i] +-+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # # expert_tokens = x[exp_token_idx] +-+++++ # # expert_out = expert(expert_tokens) +-+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++ # # return expert_cache +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # if start_idx == end_idx: +-+++++ # continue +-+++++ # expert = self.experts[i] +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = expert(expert_tokens) +-+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-+++++ # return expert_cache +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ +-+++++ # # 排序保证顺序一致 +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # # 找出有 token 的专家 +-+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++ +-+++++ # for i in active_experts.tolist(): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # end_idx = tokens_per_expert[i] +-+++++ # if start_idx == end_idx: # 没有 token +-+++++ # continue +-+++++ +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = self.experts[i](expert_tokens) +-+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++ +-+++++ # expert_cache = mindspore.mint.scatter_add( +-+++++ # expert_cache, +-+++++ # 0, +-+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++ # expert_out +-+++++ # ) +-+++++ +-+++++ # return expert_cache +-+++++ +-+++++ +-++++ +-++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++++ # """ +-++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-+++++ self.warm_up = False +-+++++ +-+++++ def warmup_moe_model_deep(self): +-+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++++ test_texts = [ +-+++++ "warmup short", +-+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++++ ] +-+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++ if tokenizer is None: +-+++++ from mindnlp.transformers import AutoTokenizer +-+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++ self._warmup_tokenizer = tokenizer +-+++++ +-+++++ for text in test_texts: +-+++++ inputs = tokenizer(text, return_tensors="ms") +-+++++ with mindspore._no_grad(): +-+++++ _ = self(**inputs, use_cache=False) +-+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++++ +-++++ def get_input_embeddings(self): +-++++ return self.model.embed_tokens +-++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++ ```""" +-+++++ if not self.warm_up: +-+++++ self.warm_up = True +-+++++ self.warmup_moe_model_deep() +-+++++ +-++++ output_attentions = ( +-++++ output_attentions +-++++ if output_attentions is not None +-++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++index 3cbf820e..d4c6b651 100644 +-++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++@@ -18,7 +18,6 @@ +-++++ # See the License for the specific language governing permissions and +-++++ # limitations under the License. +-++++ """MindSpore Qwen2MoE model.""" +-++++- +-++++ import math +-++++ from typing import List, Optional, Tuple, Union +-++++ +-++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++++ TokenClassifierOutput, +-++++ ) +-++++ from ...modeling_utils import PreTrainedModel +-+++++from ...generation import GenerationMixin +-++++ from ....utils import logging +-++++ from .configuration_qwen2_moe import Qwen2MoeConfig +-++++ +-++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++++ self.variance_epsilon = eps +-++++ +-++++ def forward(self, hidden_states): +-+++++ # @dwj +-+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++ # @lwx +-+++++ # if not self.training : +-+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++ input_dtype = hidden_states.dtype +-++++ hidden_states = hidden_states.to(mindspore.float32) +-++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++++@@ -234,6 +239,8 @@ def rotate_half(x): +-++++ """Rotates half the hidden dims of the input.""" +-++++ x1 = x[..., : x.shape[-1] // 2] +-++++ x2 = x[..., x.shape[-1] // 2 :] +-+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++ return ops.cat((-x2, x1), dim=-1) +-++++ +-++++ +-++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++++ self.config = config +-++++ self.hidden_size = config.hidden_size +-++++ self.intermediate_size = intermediate_size +-+++++ +-++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++++ self.act_fn = ACT2FN[config.hidden_act] +-++++ +-++++ def forward(self, x): +-++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++- +-++++ +-+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++ # @lwx +-+++++ # gate_up_output = self.gate_up_proj(x) +-+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++++ # return self.down_proj(swiglu_output) +-+++++ +-+++++ # def forward(self, x): +-+++++ # gate_proj_out = self.gate_proj(x) +-+++++ # up_proj_out = self.up_proj(x) +-+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++++ # return self.down_proj(swiglu_out) +-+++++ +-++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++++ """ +-++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++++ use_cache: bool = False, +-++++ cache_position: Optional[mindspore.Tensor] = None, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ +-+++++ +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ query_states = self.q_proj(hidden_states) +-++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++ "with a layer index." +-++++ ) +-++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ if isinstance(past_key_value, StaticCache): +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ else: +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++ if isinstance(past_key_value, StaticCache): +-+++++ kv_seq_len = key_states.shape[-2] +-++++ +-++++ # repeat k/v heads if n_kv_heads < n_heads +-++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++- +-+++++ +-++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++ +-++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++++- raise ValueError( +-++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++++- f" {attn_weights.shape}" +-++++- ) +-++++- +-++++- if attention_mask is not None: # no matter the length, we just slice it +-++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++++ if attention_mask is not None: +-+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++ attn_weights = attn_weights + causal_mask +-++++ +-++++ # upcast attention to fp32 +-++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++ +-++++ attn_output = self.o_proj(attn_output) +-++++- +-+++++ # @lwx +-+++++ +-+++++ # max_seq_len = self.max_position_embeddings # 2048 +-+++++ +-+++++ # if attention_mask is not None: +-+++++ # # attention_mask: [B, 1, Sq, Sk] +-+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++ +-+++++ # # pad 到 [max_seq_len, max_seq_len] +-+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++ # global_attention_mask = padded_mask +-+++++ # else: +-+++++ # global_attention_mask = None +-+++++ +-+++++ +-+++++ # sparse_mode=3 +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # real_shift=None, +-+++++ # padding_mask=None, +-+++++ +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=global_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ # input_layout="BNSD", +-+++++ # pre_tokens=2147483647, +-+++++ # next_tokens=2147483647, +-+++++ # inner_precise=0, +-+++++ # drop_mask=None, +-+++++ # prefix=None, +-+++++ # actual_seq_qlen=None, +-+++++ # actual_seq_kvlen=None, +-+++++ # sparse_mode=sparse_mode, +-+++++ # ) +-++++ if not output_attentions: +-++++ attn_weights = None +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ +-+++++class Qwen2MoeFlashAttention(nn.Module): +-+++++ """ +-+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++ +-+++++ 关键改动: +-+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++ 直接传入原始的 key 和 value 张量效率更高。 +-+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++ """ +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++ super().__init__() +-+++++ self.config = config +-+++++ self.layer_idx = layer_idx +-+++++ self.hidden_size = config.hidden_size +-+++++ self.num_heads = config.num_attention_heads +-+++++ self.head_dim = self.hidden_size // self.num_heads +-+++++ self.num_key_value_heads = config.num_key_value_heads +-+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++ self.max_position_embeddings = config.max_position_embeddings +-+++++ self.rope_theta = config.rope_theta +-+++++ self.attention_dropout = config.attention_dropout +-+++++ +-+++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++ raise ValueError( +-+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++ ) +-+++++ +-+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++ +-+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++ self.head_dim, +-+++++ max_position_embeddings=self.max_position_embeddings, +-+++++ base=self.rope_theta, +-+++++ ) +-+++++ +-+++++ def forward( +-+++++ self, +-+++++ hidden_states: mindspore.Tensor, +-+++++ attention_mask: Optional[mindspore.Tensor] = None, +-+++++ position_ids: Optional[mindspore.Tensor] = None, +-+++++ past_key_value: Optional[Cache] = None, +-+++++ output_attentions: bool = False, +-+++++ use_cache: bool = False, +-+++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # 1. 线性投射 Q, K, V +-+++++ query_states = self.q_proj(hidden_states) +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++ # query: [B, S, H*D] -> [B, N1, S, D] +-+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # 3. RoPE 旋转位置编码 +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++ if self.layer_idx is None: +-+++++ raise ValueError( +-+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ "with a layer index." +-+++++ ) +-+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++ if cache_position.shape[0] == 1: +-+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++ kv_seq_len = past_seen_tokens + 1 +-+++++ else: +-+++++ # prefill 阶段:cache_position 是范围,使用其长度 +-+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++ else: +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # 4. KV 缓存更新 +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ key_states, value_states = past_key_value.update( +-+++++ key_states, value_states, self.layer_idx, cache_kwargs +-+++++ ) +-+++++ +-+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++ if cache_position.shape[0] == 1: +-+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ +-+++++ # 5. [重要] 准备 Attention Mask +-+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++ fa_attention_mask = None +-+++++ if attention_mask is not None: +-+++++ # 截取与当前key长度匹配的部分 +-+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++ fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++ input_dtype = query_states.dtype +-+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++ query_states = query_states.to(mindspore.float16) +-+++++ key_states = key_states.to(mindspore.float16) +-+++++ value_states = value_states.to(mindspore.float16) +-+++++ +-+++++ # 6. [核心] 调用 flash_attention_score 算子 +-+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++ attn_output = mindspore.ops.flash_attention_score( +-+++++ query=query_states, +-+++++ key=key_states, +-+++++ value=value_states, +-+++++ head_num=self.num_heads, # 传入Q的头数(N1) +-+++++ attn_mask=fa_attention_mask, +-+++++ keep_prob=1.0 - self.attention_dropout, +-+++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ input_layout="BNSD", +-+++++ sparse_mode=0 # 使用 defaultMask 模式 +-+++++ ) +-+++++ +-+++++ # 恢复原始数据类型 +-+++++ attn_output = attn_output.to(input_dtype) +-+++++ +-+++++ # 7. 调整输出形状 +-+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++ attn_weights = None +-+++++ if output_attentions: +-+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ # def forward( +-+++++ # self, +-+++++ # hidden_states: mindspore.Tensor, +-+++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++ # past_key_value: Optional[Cache] = None, +-+++++ # output_attentions: bool = False, +-+++++ # use_cache: bool = False, +-+++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ # bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # # 1. 线性投射 Q, K, V +-+++++ # query_states = self.q_proj(hidden_states) +-+++++ # key_states = self.k_proj(hidden_states) +-+++++ # value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # # 3. RoPE 旋转位置编码 +-+++++ # kv_seq_len = key_states.shape[-2] +-+++++ # if past_key_value is not None: +-+++++ # if self.layer_idx is None: +-+++++ # raise ValueError( +-+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ # "with a layer index." +-+++++ # ) +-+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # # 4. KV 缓存更新 +-+++++ # if past_key_value is not None: +-+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ # key_states, value_states = past_key_value.update( +-+++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++ # ) +-+++++ +-+++++ # # 5. 准备 Attention Mask +-+++++ # fa_attention_mask = None +-+++++ # if attention_mask is not None: +-+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++ # input_dtype = query_states.dtype +-+++++ +-+++++ # # 6. [核心] 调用 flash_attention_score 算子 +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=fa_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ # input_layout="BNSD", +-+++++ # sparse_mode=0, +-+++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++ # inner_precise=1 +-+++++ # ) +-+++++ +-+++++ # # 恢复原始数据类型 +-+++++ # attn_output = attn_output.to(input_dtype) +-+++++ +-+++++ # # 7. 调整输出形状 +-+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ # attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # attn_weights = None +-+++++ # if output_attentions: +-+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++ # return attn_output, attn_weights, past_key_value +-+++++ +-+++++ # def forward( +-+++++ # self, +-+++++ # hidden_states: mindspore.Tensor, +-+++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++ # past_key_value: Optional[Cache] = None, +-+++++ # output_attentions: bool = False, +-+++++ # use_cache: bool = False, +-+++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ # bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ # query_states = self.q_proj(hidden_states) +-+++++ # key_states = self.k_proj(hidden_states) +-+++++ # value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ # kv_seq_len = key_states.shape[-2] +-+++++ # if past_key_value is not None: +-+++++ # if self.layer_idx is None: +-+++++ # raise ValueError("`layer_idx` must be specified for caching") +-+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ # if past_key_value is not None: +-+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ # key_states, value_states = past_key_value.update( +-+++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++ # ) +-+++++ +-+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++ +-+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++++ # query_states = query_states / math.sqrt(self.head_dim) +-+++++ # # <--- 修改结束 --- +-+++++ +-+++++ # fa_attention_mask = None +-+++++ # if attention_mask is not None: +-+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ # fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ # input_dtype = query_states.dtype +-+++++ +-+++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++ # query=query_states, # 传入已经预先缩放过的 query +-+++++ # key=key_states, +-+++++ # value=value_states, +-+++++ # head_num=self.num_heads, +-+++++ # attn_mask=fa_attention_mask, +-+++++ # keep_prob=1.0 - self.attention_dropout, +-+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++++ # input_layout="BNSD", +-+++++ # sparse_mode=0, +-+++++ # inner_precise=1 # 仍然保持内部高精度计算 +-+++++ # ) +-+++++ +-+++++ # attn_output = attn_output.to(input_dtype) +-+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ # attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # attn_weights = None +-+++++ # if output_attentions: +-+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++++ +-+++++ # return attn_output, attn_weights, past_key_value +-+++++ +-++++ QWEN2MOE_ATTENTION_CLASSES = { +-++++ "eager": Qwen2MoeAttention, +-+++++ "flash-attention": Qwen2MoeFlashAttention, +-++++ } +-++++ +-++++ +-++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-+++++ #@dwj +-+++++ # 只遍历激活的专家,而非全部专家 +-++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- hidden_states = hidden_states.view(-1, hidden_dim) +-++++- # router_logits: (batch * sequence_length, n_experts) +-++++- router_logits = self.gate(hidden_states) +-++++- +-++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- if self.norm_topk_prob: +-++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- # we cast back to the input dtype +-++++- routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++- final_hidden_states = ops.zeros( +-++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++++- ) +-++++- +-++++- # One hot encode the selected experts to create an expert mask +-++++- # this will be used to easily index which expert is going to be sollicitated +-++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++++- +-++++- # Loop over all available experts in the model and perform the computation on each expert +-++++- for expert_idx in range(self.num_experts): +-++++- expert_layer = self.experts[expert_idx] +-++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++++- +-++++- # Index the correct hidden states and compute the expert hidden state for +-++++- # the current expert. We need to make sure to multiply the output hidden +-++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++++- if 0 not in idx.shape: +-++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++++- +-++++- # However `index_add_` only support torch tensors for indexing so we'll use +-++++- # the `top_x` tensor here. +-++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++++- +-++++- shared_expert_output = self.shared_expert(hidden_states) +-++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++++- +-++++- final_hidden_states = final_hidden_states + shared_expert_output +-+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++ num_tokens = hidden_states_reshaped.shape[0] +-+++++ +-+++++ router_logits = self.gate(hidden_states_reshaped) +-+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++ if self.norm_topk_prob: +-+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++ flat_selected_experts = selected_experts.flatten() +-+++++ +-+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++ token_indices = broadcasted_token_indices.flatten() +-+++++ +-+++++ active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++ for expert_idx_tensor in active_experts: +-+++++ expert_idx = expert_idx_tensor.item() +-+++++ expert_layer = self.experts[expert_idx] +-+++++ +-+++++ mask = (flat_selected_experts == expert_idx_tensor) +-+++++ selected_token_indices = token_indices[mask] +-+++++ selected_routing_weights = routing_weights.flatten()[mask] +-+++++ +-+++++ current_states = hidden_states_reshaped[selected_token_indices] +-+++++ +-+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++ +-+++++ final_hidden_states = final_hidden_states.index_add( +-+++++ dim=0, +-+++++ index=selected_token_indices, +-+++++ source=expert_output.to(hidden_states.dtype) +-+++++ ) +-+++++ +-+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++ +-++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++- return final_hidden_states, router_logits +-+++++ final_hidden_states = final_hidden_states + shared_expert_output +-+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++ return final_hidden_states, router_logits +-++++ +-++++ +-++++ class Qwen2MoeDecoderLayer(nn.Module): +-++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++++ +-++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++ +-++++ if (layer_idx not in config.mlp_only_layers) and ( +-++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++++ ): +-++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++++ _skip_keys_device_placement = "past_key_values" +-++++ _supports_cache_class = True +-+++++#lwx +-+++++ # _supports_static_cache = True +-++++ +-++++ def _init_weights(self, module): +-++++ std = self.config.initializer_range +-++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++++ return causal_mask +-++++ +-++++ +-++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ _tied_weights_keys = ["lm_head.weight"] +-++++ +-++++ def __init__(self, config): +-++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ self.num_experts_per_tok = config.num_experts_per_tok +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-+++++ # @lwx +-+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++++ # self.generation_config.cache_implementation = "static" +-+++++ self._warmed_up = False +-+++++ +-+++++ def warmup_moe_model(self): +-+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++++ test_texts = [ +-+++++ "warmup short", +-+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++++ ] +-+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++ if tokenizer is None: +-+++++ from mindnlp.transformers import AutoTokenizer +-+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++ self._warmup_tokenizer = tokenizer +-+++++ +-+++++ for text in test_texts: +-+++++ inputs = tokenizer(text, return_tensors="ms") +-+++++ with mindspore._no_grad(): +-+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++++ +-++++ def get_input_embeddings(self): +-++++ return self.model.embed_tokens +-++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++ ```""" +-+++++ if not self._warmed_up: +-+++++ self._warmed_up = True +-+++++ self.warmup_moe_model() +-++++ +-++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++++ output_router_logits = ( +-++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++ } +-++++ ) +-++++ return model_inputs +-+++++# @lwx +-+++++ # def _decode_one_tokens_logits( +-+++++ # self, +-+++++ # cur_token: mindspore.Tensor, +-+++++ # input_pos: Optional[mindspore.Tensor], +-+++++ # cache_position: mindspore.Tensor, +-+++++ # past_key_values: StaticCache, +-+++++ # ) -> mindspore.Tensor: +-+++++ # """ +-+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++++ +-+++++ # Args: +-+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++++ # input_pos: 输入位置信息,可选 +-+++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++++ +-+++++ # Returns: +-+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++++ # """ +-+++++ # # 调用JIT编译的版本 +-+++++ # return self.get_decode_one_tokens_logits( +-+++++ # cur_token=cur_token, +-+++++ # input_pos=input_pos, +-+++++ # cache_position=cache_position, +-+++++ # past_key_values=past_key_values, +-+++++ # ) +-+++++ +-+++++ # @mindspore.jit(jit_level='O1') +-+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++++ # """ +-+++++ # JIT编译的函数,用于高效的单token解码 +-+++++ # 使用JIT编译优化以支持静态shape和高效执行 +-+++++ +-+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++++ # """ +-+++++ # outputs = self.model.forward( +-+++++ # input_ids=cur_token, +-+++++ # position_ids=input_pos, +-+++++ # cache_position=cache_position, +-+++++ # past_key_values=past_key_values, +-+++++ # use_cache=True, +-+++++ # return_dict=False, +-+++++ # ) +-+++++ +-+++++ # hidden_states = outputs[0] +-+++++ # logits = self.lm_head.forward(hidden_states) +-+++++ # logits = logits.float() +-+++++ +-+++++ # return logits[:, -1, :] +-+++++ +-+++++ # def _sample( +-+++++ # self, +-+++++ # input_ids: mindspore.Tensor, +-+++++ # logits_processor, +-+++++ # stopping_criteria, +-+++++ # generation_config, +-+++++ # synced_devices: bool, +-+++++ # streamer=None, +-+++++ # logits_warper=None, +-+++++ # **model_kwargs, +-+++++ # ): +-+++++ # """ +-+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++++ # """ +-+++++ # from ...generation.logits_process import LogitsProcessorList +-+++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++++ # from mindnlp.core import nn, ops, no_grad +-+++++ # import numpy as np +-+++++ +-+++++ # # 检查是否使用 StaticCache +-+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++++ # # 否则,直接调用父类方法 +-+++++ # past_key_values = model_kwargs.get("past_key_values") +-+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++++ +-+++++ # if not isinstance(past_key_values, StaticCache): +-+++++ # # 不使用 StaticCache,直接调用父类方法 +-+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++++ # return super()._sample( +-+++++ # input_ids=input_ids, +-+++++ # logits_processor=logits_processor, +-+++++ # stopping_criteria=stopping_criteria, +-+++++ # generation_config=generation_config, +-+++++ # synced_devices=synced_devices, +-+++++ # streamer=streamer, +-+++++ # logits_warper=logits_warper, +-+++++ # **model_kwargs, +-+++++ # ) +-+++++ +-+++++ # # 使用 StaticCache,进入自定义循环 +-+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++++ # pad_token_id = generation_config._pad_token_tensor +-+++++ # output_attentions = generation_config.output_attentions +-+++++ # output_hidden_states = generation_config.output_hidden_states +-+++++ # output_scores = generation_config.output_scores +-+++++ # output_logits = generation_config.output_logits +-+++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++++ # max_length = generation_config.max_length +-+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++++ # do_sample = generation_config.do_sample +-+++++ +-+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++++ # raise ValueError( +-+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++++ # f"{logits_warper})." +-+++++ # ) +-+++++ +-+++++ # # init attention / hidden states / scores tuples +-+++++ # scores = () if (return_dict_in_generate and output_scores) else None +-+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++++ +-+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++++ # encoder_hidden_states = ( +-+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++++ # ) +-+++++ +-+++++ # # keep track of which sequences are already finished +-+++++ # batch_size, cur_len = input_ids.shape +-+++++ # this_peer_finished = False +-+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++++ +-+++++ # time_record = [] +-+++++ # from ....utils.testing_utils import parse_flag_from_env +-+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++++ +-+++++ # while self._has_unfinished_sequences( +-+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++++ # ): +-+++++ # if _record_time: +-+++++ # import time as time_module +-+++++ # infer_start = time_module.time() +-+++++ +-+++++ # # prepare model inputs +-+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++++ +-+++++ # # prepare variable output controls +-+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++++ +-+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++++ # cur_cache_position = model_inputs.get("cache_position") +-+++++ # cur_past_key_values = model_inputs.get("past_key_values") +-+++++ # cur_input_ids = model_inputs.get("input_ids") +-+++++ +-+++++ # if (isinstance(cur_past_key_values, StaticCache) and +-+++++ # cur_cache_position is not None and +-+++++ # len(cur_cache_position.shape) > 0 and +-+++++ # cur_cache_position.shape[0] == 1 and +-+++++ # cur_input_ids is not None and +-+++++ # cur_input_ids.shape[1] == 1): +-+++++ # # 使用 JIT 优化的单 token 解码 +-+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++++ # if not hasattr(self, '_jit_used'): +-+++++ # self._jit_used = False +-+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++++ +-+++++ # next_token_logits = self.get_decode_one_tokens_logits( +-+++++ # cur_token=cur_input_ids, +-+++++ # input_pos=model_inputs.get("position_ids"), +-+++++ # cache_position=cur_cache_position, +-+++++ # past_key_values=cur_past_key_values, +-+++++ # ) +-+++++ +-+++++ # # 标记已使用JIT(用于后续判断) +-+++++ # if not self._jit_used: +-+++++ # self._jit_used = True +-+++++ +-+++++ # # 构造兼容的输出对象 +-+++++ # class JitOptimizedOutput: +-+++++ # def __init__(self, logits, config): +-+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++++ # self.config = config +-+++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++++ # self.attentions = None if not config.is_encoder_decoder else None +-+++++ # self.cross_attentions = None +-+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++++ +-+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++++ # else: +-+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++++ # outputs = self(**model_inputs, return_dict=True) +-+++++ +-+++++ # if synced_devices and this_peer_finished: +-+++++ # continue +-+++++ +-+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++++ # next_token_logits = outputs.logits[:, -1, :] +-+++++ +-+++++ # # pre-process distribution +-+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++++ # if do_sample: +-+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++++ +-+++++ # # Store scores, attentions and hidden_states when required +-+++++ # if return_dict_in_generate: +-+++++ # if output_scores: +-+++++ # scores += (next_token_scores,) +-+++++ # if output_logits: +-+++++ # raw_logits += (next_token_logits,) +-+++++ # if output_attentions: +-+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++++ # if self.config.is_encoder_decoder: +-+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++++ +-+++++ # if output_hidden_states: +-+++++ # hidden = ( +-+++++ # outputs.decoder_hidden_states +-+++++ # if self.config.is_encoder_decoder +-+++++ # else outputs.hidden_states +-+++++ # ) +-+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++++ +-+++++ # # token selection +-+++++ # if do_sample: +-+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++++ # else: +-+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++++ +-+++++ # # finished sentences should have their next token be a padding token +-+++++ # if has_eos_stopping_criteria: +-+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++++ +-+++++ # # update generated ids, model inputs, and length for next step +-+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++++ # if streamer is not None: +-+++++ # streamer.put(next_tokens) +-+++++ +-+++++ # model_kwargs = self._update_model_kwargs_for_generation( +-+++++ # outputs, +-+++++ # model_kwargs, +-+++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++++ # ) +-+++++ +-+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++++ # cur_len += 1 +-+++++ +-+++++ # if _record_time: +-+++++ # import time as time_module +-+++++ # infer_stop = time_module.time() +-+++++ # time_record.append(infer_stop - infer_start) +-+++++ +-+++++ # del outputs +-+++++ +-+++++ # average_infer_time = None +-+++++ # if time_record: +-+++++ # if len(time_record) > 1: +-+++++ # time_record.pop(0) +-+++++ # average_infer_time = sum(time_record) / len(time_record) +-+++++ # print(f'average inference time is: {average_infer_time}') +-+++++ # print(f'inference time record: {time_record}') +-+++++ +-+++++ # if streamer is not None: +-+++++ # streamer.end() +-+++++ +-+++++ # # 简单判断:打印是否使用了JIT路径 +-+++++ # if hasattr(self, '_jit_used') and self._jit_used: +-+++++ # print("[JIT] ✓ JIT optimization was used during generation") +-+++++ # else: +-+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++++ +-+++++ # if return_dict_in_generate: +-+++++ # if self.config.is_encoder_decoder: +-+++++ # return GenerateEncoderDecoderOutput( +-+++++ # sequences=input_ids, +-+++++ # scores=scores, +-+++++ # logits=raw_logits, +-+++++ # encoder_attentions=encoder_attentions, +-+++++ # encoder_hidden_states=encoder_hidden_states, +-+++++ # decoder_attentions=decoder_attentions, +-+++++ # cross_attentions=cross_attentions, +-+++++ # decoder_hidden_states=decoder_hidden_states, +-+++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++ # average_infer_time=average_infer_time +-+++++ # ) +-+++++ # else: +-+++++ # return GenerateDecoderOnlyOutput( +-+++++ # sequences=input_ids, +-+++++ # scores=scores, +-+++++ # logits=raw_logits, +-+++++ # attentions=decoder_attentions, +-+++++ # hidden_states=decoder_hidden_states, +-+++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++ # average_infer_time=average_infer_time +-+++++ # ) +-+++++ # else: +-+++++ # return input_ids +-+++++ +-+++++ # def _prepare_cache_for_generation( +-+++++ # self, +-+++++ # generation_config, +-+++++ # model_kwargs, +-+++++ # assistant_model, +-+++++ # batch_size, +-+++++ # max_cache_length, +-+++++ # ): +-+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++++ # generation_config.cache_implementation = "static" +-+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++++ +-+++++ # if generation_config.cache_implementation == "static": +-+++++ # base_required_from_max_length = generation_config.max_length + 1 +-+++++ # base_required = max(max_cache_length, base_required_from_max_length) +-+++++ # min_cache_size = 50 +-+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++++ # else: +-+++++ # max_cache_length = max(base_required, min_cache_size) +-+++++ +-+++++ # original_max_cache_length = max_cache_length +-+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++++ # print(f" - final max_cache_length: {max_cache_length}") +-+++++ +-+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++ # if max_cache_length > self.config.max_position_embeddings: +-+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++++ +-+++++ # result = super()._prepare_cache_for_generation( +-+++++ # generation_config=generation_config, +-+++++ # model_kwargs=model_kwargs, +-+++++ # assistant_model=assistant_model, +-+++++ # batch_size=batch_size, +-+++++ # max_cache_length=max_cache_length, +-+++++ # ) +-+++++ +-+++++ # if generation_config.cache_implementation == "static": +-+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++++ # created_cache = model_kwargs.get(cache_name) +-+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++++ # if created_cache.max_cache_len < generation_config.max_length: +-+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++++ +-+++++ # return result +-+++++ +-+++++ +-+++++ +-++++ +-++++ +-++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++++-- +-++++2.27.0 +-++++ +-+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+++new file mode 100644 +-+++index 00000000..22b65dd5 +-+++--- /dev/null +-++++++ b/patches/0002-20251106commit.patch +-+++@@ -0,0 +1,3200 @@ +-++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-++++From: Pinoeer-kingxi <13022943007@163.com> +-++++Date: Thu, 6 Nov 2025 09:20:38 +0800 +-++++Subject: [PATCH 2/3] 20251106commit +-++++ +-++++--- +-++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +-++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +-++++ 3 files changed, 2689 insertions(+), 305 deletions(-) +-++++ create mode 100644 patches/0001-20251104commit.patch +-++++ +-++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++index d8303e45..73773c22 100644 +-++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +-++++ # y = y + self.shared_experts(identity) +-++++ # return y +-++++ +-+++++ # @no_grad() +-+++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++ +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ # for i in range(self.num_experts_per_tok): +-+++++ # expert_id = flat_expert_indices[i].item() +-+++++ # weight = flat_expert_weights[i].item() +-+++++ # expert = self.experts[expert_id] +-+++++ # expert_out = expert(x) +-+++++ # expert_cache += expert_out * weight +-+++++ # return expert_cache +-+++++ +-++++ @no_grad() +-++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # x 的 shape: (1, hidden_size) +-+++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+++++ +-+++++ # 1. 收集所有需要的专家层 +-+++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-+++++ +-+++++ # 2. 并行计算所有专家的输出 +-+++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+++++ # ops.cat 会将它们堆叠成一个新的 Tensor +-+++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+++++ +-+++++ # 3. 使用矩阵乘法进行加权求和 +-+++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++++ # 最终结果 final_output 的 shape: (1, hidden_size) +-+++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+++++ +-+++++ return final_output +-++++ +-++++- expert_cache = ops.zeros_like(x) +-++++- for i in range(self.num_experts_per_tok): +-++++- expert_id = flat_expert_indices[i].item() +-++++- weight = flat_expert_weights[i].item() +-++++- expert = self.experts[expert_id] +-++++- expert_out = expert(x) +-++++- expert_cache += expert_out * weight +-++++- return expert_cache +-++++ +-++++ @no_grad() +-++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++ # @lwx +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-+++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-+++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-++++ +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ +-+++++# class DeepseekFlashAttention(nn.Module): +-+++++# """ +-+++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-+++++ +-+++++# This class is designed as a drop-in replacement for DeepseekAttention. +-+++++# """ +-+++++ +-+++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+++++# super().__init__() +-+++++# self.config = config +-+++++# self.layer_idx = layer_idx +-+++++# if layer_idx is None: +-+++++# logger.warning( +-+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++++# "when creating this class." +-+++++# ) +-+++++ +-+++++# self.attention_dropout = config.attention_dropout +-+++++# self.hidden_size = config.hidden_size +-+++++# self.num_heads = config.num_attention_heads +-+++++# self.head_dim = self.hidden_size // self.num_heads +-+++++# self.num_key_value_heads = config.num_key_value_heads +-+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++# self.max_position_embeddings = config.max_position_embeddings +-+++++# self.rope_theta = config.rope_theta +-+++++# self.is_causal = True +-+++++ +-+++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++# raise ValueError( +-+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++++# f" and `num_heads`: {self.num_heads})." +-+++++# ) +-+++++ +-+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+++++# self._init_rope() +-+++++ +-+++++# def _init_rope(self): +-+++++# if self.config.rope_scaling is None: +-+++++# self.rotary_emb = DeepseekRotaryEmbedding( +-+++++# self.head_dim, +-+++++# max_position_embeddings=self.max_position_embeddings, +-+++++# base=self.rope_theta, +-+++++# ) +-+++++# else: +-+++++# scaling_type = self.config.rope_scaling["type"] +-+++++# scaling_factor = self.config.rope_scaling["factor"] +-+++++# if scaling_type == "linear": +-+++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+++++# self.head_dim, +-+++++# max_position_embeddings=self.max_position_embeddings, +-+++++# scaling_factor=scaling_factor, +-+++++# base=self.rope_theta, +-+++++# ) +-+++++# elif scaling_type == "dynamic": +-+++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+++++# self.head_dim, +-+++++# max_position_embeddings=self.max_position_embeddings, +-+++++# scaling_factor=scaling_factor, +-+++++# base=self.rope_theta, +-+++++# ) +-+++++# else: +-+++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+++++ +-+++++# def forward( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# attention_mask: Optional[mindspore.Tensor] = None, +-+++++# position_ids: Optional[mindspore.Tensor] = None, +-+++++# past_key_value: Optional[Cache] = None, +-+++++# output_attentions: bool = False, +-+++++# use_cache: bool = False, +-+++++# **kwargs, +-+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++# if "padding_mask" in kwargs: +-+++++# warnings.warn( +-+++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+++++# ) +-+++++ +-+++++# if output_attentions: +-+++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-+++++ +-+++++# bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++# if self.config.pretraining_tp > 1: +-+++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+++++ +-+++++# query_states = self.q_proj(hidden_states) +-+++++# key_states = self.k_proj(hidden_states) +-+++++# value_states = self.v_proj(hidden_states) +-+++++ +-+++++# # Reshape for multi-head attention +-+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++# kv_seq_len = key_states.shape[-2] +-+++++# if past_key_value is not None: +-+++++# if self.layer_idx is None: +-+++++# raise ValueError( +-+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++# "with a layer index." +-+++++# ) +-+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++# # Apply Rotary Positional Embedding +-+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++# if past_key_value is not None: +-+++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-+++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-+++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ +-+++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++++ +-+++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-+++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-+++++ +-+++++# # Convert attention_mask for flash_attention_score +-+++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-+++++# if attention_mask is not None: +-+++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-+++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+++++# raise ValueError( +-+++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+++++# ) +-+++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-+++++# else: +-+++++# attn_mask_for_fa = None +-+++++ +-+++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+++++ +-+++++# # Call the fused flash_attention_score operator +-+++++# attn_output = mindspore.ops.flash_attention_score( +-+++++# query=query_states_for_fa, +-+++++# key=key_states_for_fa, +-+++++# value=value_states_for_fa, +-+++++# head_num=self.num_heads, # This is N1, the number of query heads +-+++++# input_layout='BSH', +-+++++# attn_mask=attn_mask_for_fa, +-+++++# keep_prob=keep_prob, +-+++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++# sparse_mode=0 # Default mask mode +-+++++# ) +-+++++ +-+++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-+++++# attn_output = self.o_proj(attn_output) +-+++++ +-+++++# # Flash Attention does not return attention weights +-+++++# attn_weights = None +-+++++ +-+++++# return attn_output, attn_weights, past_key_value +-+++++ +-+++++class DeepseekFlashAttention(nn.Module): +-+++++ """ +-+++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-+++++ This implementation is a drop-in replacement for the original DeepseekAttention class, +-+++++ designed for high performance on supported hardware (Ascend). +-+++++ +-+++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-+++++ """ +-+++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-+++++ super().__init__() +-+++++ self.config = config +-+++++ self.layer_idx = layer_idx +-+++++ if layer_idx is None: +-+++++ logger.warning( +-+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++++ "when creating this class." +-+++++ ) +-+++++ +-+++++ # --- [FIX] Correctly initialize all required attributes --- +-+++++ self.attention_dropout = config.attention_dropout +-+++++ self.hidden_size = config.hidden_size +-+++++ self.num_heads = config.num_attention_heads +-+++++ self.head_dim = self.hidden_size // self.num_heads +-+++++ self.num_key_value_heads = config.num_key_value_heads +-+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++ self.max_position_embeddings = config.max_position_embeddings +-+++++ self.rope_theta = config.rope_theta +-+++++ self.is_causal = True +-+++++ +-+++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++ raise ValueError( +-+++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++++ f" and `num_heads`: {self.num_heads})." +-+++++ ) +-+++++ +-+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-+++++ +-+++++ # This call will now succeed as all attributes are initialized. +-+++++ self._init_rope() +-+++++ +-+++++ def _init_rope(self): +-+++++ if self.config.rope_scaling is None: +-+++++ self.rotary_emb = DeepseekRotaryEmbedding( +-+++++ self.head_dim, +-+++++ max_position_embeddings=self.max_position_embeddings, +-+++++ base=self.rope_theta, +-+++++ ) +-+++++ else: +-+++++ scaling_type = self.config.rope_scaling["type"] +-+++++ scaling_factor = self.config.rope_scaling["factor"] +-+++++ if scaling_type == "linear": +-+++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-+++++ self.head_dim, +-+++++ max_position_embeddings=self.max_position_embeddings, +-+++++ scaling_factor=scaling_factor, +-+++++ base=self.rope_theta, +-+++++ ) +-+++++ elif scaling_type == "dynamic": +-+++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-+++++ self.head_dim, +-+++++ max_position_embeddings=self.max_position_embeddings, +-+++++ scaling_factor=scaling_factor, +-+++++ base=self.rope_theta, +-+++++ ) +-+++++ else: +-+++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-+++++ +-+++++ def forward( +-+++++ self, +-+++++ hidden_states: mindspore.Tensor, +-+++++ attention_mask: Optional[mindspore.Tensor] = None, +-+++++ position_ids: Optional[mindspore.Tensor] = None, +-+++++ past_key_value: Optional[Cache] = None, +-+++++ output_attentions: bool = False, +-+++++ use_cache: bool = False, +-+++++ **kwargs, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ if "padding_mask" in kwargs: +-+++++ warnings.warn( +-+++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-+++++ ) +-+++++ if output_attentions: +-+++++ warnings.warn( +-+++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-+++++ ) +-+++++ +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ if self.config.pretraining_tp > 1: +-+++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-+++++ +-+++++ query_states = self.q_proj(hidden_states) +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++ # Apply Rotary Position Embedding +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos} +-+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-+++++ # So we must explicitly repeat the KV heads. +-+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++ +-+++++ # Convert attention mask for flash_attention_score +-+++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-+++++ if attention_mask is not None: +-+++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-+++++ raise ValueError( +-+++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-+++++ ) +-+++++ attn_mask_for_fa = attention_mask < 0 +-+++++ else: +-+++++ attn_mask_for_fa = None +-+++++ +-+++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-+++++ +-+++++ # Call the fused operator using the efficient BNSD layout +-+++++ attn_output = mindspore.ops.flash_attention_score( +-+++++ query=query_states, +-+++++ key=key_states, +-+++++ value=value_states, +-+++++ head_num=self.num_heads, +-+++++ input_layout='BNSD', # Specify the correct layout +-+++++ attn_mask=attn_mask_for_fa, +-+++++ keep_prob=keep_prob, +-+++++ scalar_value=1.0 / math.sqrt(self.head_dim) +-+++++ ) +-+++++ +-+++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ +-+++++ # Apply output projection +-+++++ attn_output = self.o_proj(attn_output) +-+++++ +-+++++ # Flash attention does not return attention weights, so we return None. +-+++++ attn_weights = None +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-++++ Deepseek_ATTENTION_CLASSES = { +-++++ "eager": DeepseekAttention, +-+++++ "flash-attention": DeepseekFlashAttention, +-++++ } +-++++ +-++++ +-++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +-++++ config=config, layer_idx=layer_idx +-++++ ) +-++++ +-+++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-+++++ config=config, layer_idx=layer_idx +-+++++ ) +-+++++ +-++++ self.mlp = ( +-++++ DeepseekMoE(config) +-++++ if ( +-++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++index d4c6b651..bced285c 100644 +-++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +-++++ +-++++ import mindspore +-++++ import mindnlp.core.nn.functional as F +-++++-from mindnlp.core import nn, ops +-+++++from mindnlp.core import nn, ops, no_grad +-++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +-++++ +-++++ from ....common.activations import ACT2FN +-++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +-++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-++++ +-+++++Long_Prompt = False +-+++++PROMPT_LENGTH_THRESHOLD = 128 +-++++ +-++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-++++ def _prepare_4d_causal_attention_mask_with_cache_position( +-++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++ +-+++++# class Qwen2MoeFlashAttention(nn.Module): +-+++++# """ +-+++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++ +-+++++# 关键改动: +-+++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++# 直接传入原始的 key 和 value 张量效率更高。 +-+++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++# super().__init__() +-+++++# self.config = config +-+++++# self.layer_idx = layer_idx +-+++++# self.hidden_size = config.hidden_size +-+++++# self.num_heads = config.num_attention_heads +-+++++# self.head_dim = self.hidden_size // self.num_heads +-+++++# self.num_key_value_heads = config.num_key_value_heads +-+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++# self.max_position_embeddings = config.max_position_embeddings +-+++++# self.rope_theta = config.rope_theta +-+++++# self.attention_dropout = config.attention_dropout +-+++++ +-+++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++# raise ValueError( +-+++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++# ) +-+++++ +-+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++ +-+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++# self.head_dim, +-+++++# max_position_embeddings=self.max_position_embeddings, +-+++++# base=self.rope_theta, +-+++++# ) +-+++++ +-+++++# def forward( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# attention_mask: Optional[mindspore.Tensor] = None, +-+++++# position_ids: Optional[mindspore.Tensor] = None, +-+++++# past_key_value: Optional[Cache] = None, +-+++++# output_attentions: bool = False, +-+++++# use_cache: bool = False, +-+++++# cache_position: Optional[mindspore.Tensor] = None, +-+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++# bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++# # 1. 线性投射 Q, K, V +-+++++# query_states = self.q_proj(hidden_states) +-+++++# key_states = self.k_proj(hidden_states) +-+++++# value_states = self.v_proj(hidden_states) +-+++++ +-+++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++# # query: [B, S, H*D] -> [B, N1, S, D] +-+++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++# # 3. RoPE 旋转位置编码 +-+++++# kv_seq_len = key_states.shape[-2] +-+++++# if past_key_value is not None: +-+++++# if self.layer_idx is None: +-+++++# raise ValueError( +-+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++# "with a layer index." +-+++++# ) +-+++++# # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++# if cache_position.shape[0] == 1: +-+++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++# kv_seq_len = past_seen_tokens + 1 +-+++++# else: +-+++++# # prefill 阶段:cache_position 是范围,使用其长度 +-+++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++# else: +-+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++# # 4. KV 缓存更新 +-+++++# if past_key_value is not None: +-+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++# key_states, value_states = past_key_value.update( +-+++++# key_states, value_states, self.layer_idx, cache_kwargs +-+++++# ) +-+++++ +-+++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++# if cache_position.shape[0] == 1: +-+++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++# kv_seq_len = key_states.shape[-2] +-+++++ +-+++++# # 5. [重要] 准备 Attention Mask +-+++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++# fa_attention_mask = None +-+++++# if attention_mask is not None: +-+++++# # 截取与当前key长度匹配的部分 +-+++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++# # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++# fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++# input_dtype = query_states.dtype +-+++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++# query_states = query_states.to(mindspore.float16) +-+++++# key_states = key_states.to(mindspore.float16) +-+++++# value_states = value_states.to(mindspore.float16) +-+++++ +-+++++# # 6. [核心] 调用 flash_attention_score 算子 +-+++++# # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++# attn_output = mindspore.ops.flash_attention_score( +-+++++# query=query_states, +-+++++# key=key_states, +-+++++# value=value_states, +-+++++# head_num=self.num_heads, # 传入Q的头数(N1) +-+++++# attn_mask=fa_attention_mask, +-+++++# keep_prob=1.0 - self.attention_dropout, +-+++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++# input_layout="BNSD", +-+++++# sparse_mode=0 # 使用 defaultMask 模式 +-+++++# ) +-+++++ +-+++++# # 恢复原始数据类型 +-+++++# attn_output = attn_output.to(input_dtype) +-+++++ +-+++++# # 7. 调整输出形状 +-+++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++# attn_output = self.o_proj(attn_output) +-+++++ +-+++++# # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++# attn_weights = None +-+++++# if output_attentions: +-+++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++# return attn_output, attn_weights, past_key_value +-+++++ +-+++++# # def forward( +-+++++# # self, +-+++++# # hidden_states: mindspore.Tensor, +-+++++# # attention_mask: Optional[mindspore.Tensor] = None, +-+++++# # position_ids: Optional[mindspore.Tensor] = None, +-+++++# # past_key_value: Optional[Cache] = None, +-+++++# # output_attentions: bool = False, +-+++++# # use_cache: bool = False, +-+++++# # cache_position: Optional[mindspore.Tensor] = None, +-+++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++# # bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++# # # 1. 线性投射 Q, K, V +-+++++# # query_states = self.q_proj(hidden_states) +-+++++# # key_states = self.k_proj(hidden_states) +-+++++# # value_states = self.v_proj(hidden_states) +-+++++ +-+++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-+++++# # # 3. RoPE 旋转位置编码 +-+++++# # kv_seq_len = key_states.shape[-2] +-+++++# # if past_key_value is not None: +-+++++# # if self.layer_idx is None: +-+++++# # raise ValueError( +-+++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++# # "with a layer index." +-+++++# # ) +-+++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++# # # 4. KV 缓存更新 +-+++++# # if past_key_value is not None: +-+++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++# # key_states, value_states = past_key_value.update( +-+++++# # key_states, value_states, self.layer_idx, cache_kwargs +-+++++# # ) +-+++++ +-+++++# # # 5. 准备 Attention Mask +-+++++# # fa_attention_mask = None +-+++++# # if attention_mask is not None: +-+++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++# # fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++# # input_dtype = query_states.dtype +-+++++ +-+++++# # # 6. [核心] 调用 flash_attention_score 算子 +-+++++# # attn_output = mindspore.ops.flash_attention_score( +-+++++# # query=query_states, +-+++++# # key=key_states, +-+++++# # value=value_states, +-+++++# # head_num=self.num_heads, +-+++++# # attn_mask=fa_attention_mask, +-+++++# # keep_prob=1.0 - self.attention_dropout, +-+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++# # input_layout="BNSD", +-+++++# # sparse_mode=0, +-+++++# # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++# # inner_precise=1 +-+++++# # ) +-+++++ +-+++++# # # 恢复原始数据类型 +-+++++# # attn_output = attn_output.to(input_dtype) +-+++++ +-+++++# # # 7. 调整输出形状 +-+++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++# # attn_output = self.o_proj(attn_output) +-+++++ +-+++++# # attn_weights = None +-+++++# # if output_attentions: +-+++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ +-+++++# # return attn_output, attn_weights, past_key_value +-+++++ +-+++++ +-++++ class Qwen2MoeFlashAttention(nn.Module): +-++++ """ +-++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++- +-++++- 关键改动: +-++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++- 直接传入原始的 key 和 value 张量效率更高。 +-++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-+++++ +-+++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-+++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-+++++ 完全使用模型的低精度数据类型(如 float16)进行计算, +-+++++ 以达到理论上的最高执行速度。 +-++++ """ +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++ super().__init__() +-++++ self.config = config +-++++ self.layer_idx = layer_idx +-+++++ if layer_idx is None: +-+++++ logger.warning_once( +-+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-+++++ ) +-+++++ +-++++ self.hidden_size = config.hidden_size +-++++ self.num_heads = config.num_attention_heads +-++++ self.head_dim = self.hidden_size // self.num_heads +-++++ self.num_key_value_heads = config.num_key_value_heads +-++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++ self.max_position_embeddings = config.max_position_embeddings +-++++ self.rope_theta = config.rope_theta +-++++ self.attention_dropout = config.attention_dropout +-++++ +-++++- if (self.head_dim * self.num_heads) != self.hidden_size: +-++++- raise ValueError( +-++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++- ) +-++++- +-++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++- # query: [B, S, H*D] -> [B, N1, S, D] +-++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++ # 2. 调整形状以匹配 BNSD 布局 +-++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- +-++++- # 3. RoPE 旋转位置编码 +-+++++ +-+++++ # 3. RoPE 和 KV 缓存 +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++- if self.layer_idx is None: +-++++- raise ValueError( +-++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++- "with a layer index." +-++++- ) +-++++- # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++- if cache_position.shape[0] == 1: +-++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++- kv_seq_len = past_seen_tokens + 1 +-++++- else: +-++++- # prefill 阶段:cache_position 是范围,使用其长度 +-++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++- else: +-++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++- +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++- # 4. KV 缓存更新 +-++++ if past_key_value is not None: +-++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++- key_states, value_states = past_key_value.update( +-++++- key_states, value_states, self.layer_idx, cache_kwargs +-++++- ) +-++++- +-++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++- if cache_position.shape[0] == 1: +-++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++- kv_seq_len = key_states.shape[-2] +-++++- +-++++- # 5. [重要] 准备 Attention Mask +-++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++ # 4. 准备 Attention Mask +-++++ fa_attention_mask = None +-++++ if attention_mask is not None: +-++++- # 截取与当前key长度匹配的部分 +-++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++- # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++ fa_attention_mask = (mask_slice != 0) +-++++ +-++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++- input_dtype = query_states.dtype +-++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++- query_states = query_states.to(mindspore.float16) +-++++- key_states = key_states.to(mindspore.float16) +-++++- value_states = value_states.to(mindspore.float16) +-++++- +-++++- # 6. [核心] 调用 flash_attention_score 算子 +-++++- # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +-++++ attn_output = mindspore.ops.flash_attention_score( +-++++ query=query_states, +-++++ key=key_states, +-++++ value=value_states, +-++++- head_num=self.num_heads, # 传入Q的头数(N1) +-+++++ head_num=self.num_heads, +-++++ attn_mask=fa_attention_mask, +-++++- keep_prob=1.0 - self.attention_dropout, +-+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +-++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++ input_layout="BNSD", +-++++- sparse_mode=0 # 使用 defaultMask 模式 +-+++++ sparse_mode=0, +-+++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +-++++ ) +-++++ +-++++- # 恢复原始数据类型 +-++++- attn_output = attn_output.to(input_dtype) +-++++- +-++++- # 7. 调整输出形状 +-++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++ # 6. 调整输出形状 +-++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++ attn_output = self.o_proj(attn_output) +-++++ +-++++- # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++ # 7. 返回结果 +-++++ attn_weights = None +-++++ if output_attentions: +-++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++- # def forward( +-++++- # self, +-++++- # hidden_states: mindspore.Tensor, +-++++- # attention_mask: Optional[mindspore.Tensor] = None, +-++++- # position_ids: Optional[mindspore.Tensor] = None, +-++++- # past_key_value: Optional[Cache] = None, +-++++- # output_attentions: bool = False, +-++++- # use_cache: bool = False, +-++++- # cache_position: Optional[mindspore.Tensor] = None, +-++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++- +-++++- # bsz, q_len, _ = hidden_states.shape +-++++- +-++++- # # 1. 线性投射 Q, K, V +-++++- # query_states = self.q_proj(hidden_states) +-++++- # key_states = self.k_proj(hidden_states) +-++++- # value_states = self.v_proj(hidden_states) +-++++- +-++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- +-++++- # # 3. RoPE 旋转位置编码 +-++++- # kv_seq_len = key_states.shape[-2] +-++++- # if past_key_value is not None: +-++++- # if self.layer_idx is None: +-++++- # raise ValueError( +-++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++- # "with a layer index." +-++++- # ) +-++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++ +-++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++- +-++++- # # 4. KV 缓存更新 +-++++- # if past_key_value is not None: +-++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++- # key_states, value_states = past_key_value.update( +-++++- # key_states, value_states, self.layer_idx, cache_kwargs +-++++- # ) +-++++- +-++++- # # 5. 准备 Attention Mask +-++++- # fa_attention_mask = None +-++++- # if attention_mask is not None: +-++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++- # fa_attention_mask = (mask_slice != 0) +-++++- +-++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++- # input_dtype = query_states.dtype +-++++- +-++++- # # 6. [核心] 调用 flash_attention_score 算子 +-++++- # attn_output = mindspore.ops.flash_attention_score( +-++++- # query=query_states, +-++++- # key=key_states, +-++++- # value=value_states, +-++++- # head_num=self.num_heads, +-++++- # attn_mask=fa_attention_mask, +-++++- # keep_prob=1.0 - self.attention_dropout, +-++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++- # input_layout="BNSD", +-++++- # sparse_mode=0, +-++++- # # <--- 修改点 2: 启用内部高精度计算 --- +-++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++- # inner_precise=1 +-++++- # ) +-++++- +-++++- # # 恢复原始数据类型 +-++++- # attn_output = attn_output.to(input_dtype) +-+++++QWEN2MOE_ATTENTION_CLASSES = { +-+++++ "eager": Qwen2MoeAttention, +-+++++ "flash-attention": Qwen2MoeFlashAttention, +-+++++} +-++++ +-++++- # # 7. 调整输出形状 +-++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++- # attn_output = self.o_proj(attn_output) +-++++ +-++++- # attn_weights = None +-++++- # if output_attentions: +-++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# def __init__(self, config): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# # gating +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++ +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# #@dwj +-+++++# # 只遍历激活的专家,而非全部专家 +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# num_tokens = hidden_states_reshaped.shape[0] +-+++++ +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++ +-+++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++# token_indices = broadcasted_token_indices.flatten() +-+++++ +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++ +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# selected_token_indices = token_indices[mask] +-+++++# selected_routing_weights = routing_weights.flatten()[mask] +-+++++ +-+++++# current_states = hidden_states_reshaped[selected_token_indices] +-+++++ +-+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++ +-+++++# final_hidden_states = final_hidden_states.index_add( +-+++++# dim=0, +-+++++# index=selected_token_indices, +-+++++# source=expert_output.to(hidden_states.dtype) +-+++++# ) +-+++++ +-+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++ +-++++- # return attn_output, attn_weights, past_key_value +-+++++# final_hidden_states = final_hidden_states + shared_expert_output +-+++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# """ +-+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+++++# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# # 门控网络 +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# # 专家列表 +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++# # 共享专家 +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_decode( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# """ +-+++++# 【解码路径】针对 sequence_length=1 的极致优化。 +-+++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+++++# """ +-+++++# batch_size, hidden_dim = hidden_states.shape +-+++++ +-+++++# expert_outputs_list = [ +-+++++# ops.cat([ +-+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++# ], dim=0) +-+++++# for i in range(batch_size) +-+++++# ] +-+++++ +-+++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+++++# # shape: (batch_size, top_k, hidden_dim) +-+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++ +-+++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++ +-+++++# return moe_output.squeeze(1) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_prefill( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# """ +-+++++# 【预填充路径】针对 sequence_length > 1 的优化。 +-+++++# 按专家对 Token 进行分组,并进行批处理。 +-+++++# """ +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens = hidden_states.shape[0] +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++ +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++ +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++ +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# selected_token_indices = token_indices[mask] +-+++++# selected_routing_weights = routing_weights.flatten()[mask] +-+++++ +-+++++# current_states = hidden_states[selected_token_indices] +-+++++ +-+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++ +-+++++# moe_output = moe_output.index_add( +-+++++# dim=0, +-+++++# index=selected_token_indices, +-+++++# source=expert_output.to(hidden_states.dtype) +-+++++# ) +-+++++# return moe_output +-+++++ +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# """ +-+++++# 顶层 forward 方法,作为智能分发器。 +-+++++# """ +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++- # def forward( +-++++- # self, +-++++- # hidden_states: mindspore.Tensor, +-++++- # attention_mask: Optional[mindspore.Tensor] = None, +-++++- # position_ids: Optional[mindspore.Tensor] = None, +-++++- # past_key_value: Optional[Cache] = None, +-++++- # output_attentions: bool = False, +-++++- # use_cache: bool = False, +-++++- # cache_position: Optional[mindspore.Tensor] = None, +-++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++- +-++++- # bsz, q_len, _ = hidden_states.shape +-++++- +-++++- # query_states = self.q_proj(hidden_states) +-++++- # key_states = self.k_proj(hidden_states) +-++++- # value_states = self.v_proj(hidden_states) +-++++- +-++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- +-++++- # kv_seq_len = key_states.shape[-2] +-++++- # if past_key_value is not None: +-++++- # if self.layer_idx is None: +-++++- # raise ValueError("`layer_idx` must be specified for caching") +-++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++- +-++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++- +-++++- # if past_key_value is not None: +-++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++- # key_states, value_states = past_key_value.update( +-++++- # key_states, value_states, self.layer_idx, cache_kwargs +-++++- # ) +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++# moe_output = None +-+++++# # 在推理时,根据序列长度选择最优路径 +-+++++# if not self.training: +-+++++# if sequence_length == 1: +-+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++++# else: +-+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++++# else: +-+++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+++++# raise NotImplementedError("Training path is not implemented.") +-+++++ +-+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+++++ +-+++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+++++ +-+++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# """ +-+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# # 门控网络 +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# # 专家列表 +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++# # 共享专家 +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_decode( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# batch_size, _ = hidden_states.shape +-+++++# expert_outputs_list = [ +-+++++# ops.cat([ +-+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++# ], dim=0) +-+++++# for i in range(batch_size) +-+++++# ] +-+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++# return moe_output.squeeze(1) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_prefill( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens = hidden_states.shape[0] +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# selected_token_indices = token_indices[mask] +-+++++# selected_routing_weights = routing_weights.flatten()[mask] +-+++++# current_states = hidden_states[selected_token_indices] +-+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++# moe_output = moe_output.index_add( +-+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++# ) +-+++++# return moe_output +-+++++ +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# """ +-+++++# 顶层 forward 方法,作为智能分发器。 +-+++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+++++# """ +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ +-+++++# # 1. 门控计算 (通用逻辑) +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++# # 2. 智能分发到最优 MoE 路径 +-+++++# moe_output = None +-+++++# if not self.training: +-+++++# if sequence_length == 1: +-+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++++# else: +-+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++++# else: +-+++++# raise NotImplementedError("Training path is not implemented.") +-+++++ +-+++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++ +-+++++# # 4. 合并 MoE 输出和共享专家输出 +-+++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++ +-+++++# # 5. 恢复原始形状并返回 +-+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-+++++# prefill fastest +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# """ +-+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# # 门控网络 +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# # 专家列表 +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++# # 共享专家 +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_dispatch( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# """ +-+++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+++++# """ +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens, _ = hidden_states.shape +-+++++ +-+++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++# flat_routing_weights = routing_weights.flatten() +-++++ +-++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++- +-++++- # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++- # query_states = query_states / math.sqrt(self.head_dim) +-++++- # # <--- 修改结束 --- +-++++- +-++++- # fa_attention_mask = None +-++++- # if attention_mask is not None: +-++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++- # fa_attention_mask = (mask_slice != 0) +-++++- +-++++- # input_dtype = query_states.dtype +-++++- +-++++- # attn_output = mindspore.ops.flash_attention_score( +-++++- # query=query_states, # 传入已经预先缩放过的 query +-++++- # key=key_states, +-++++- # value=value_states, +-++++- # head_num=self.num_heads, +-++++- # attn_mask=fa_attention_mask, +-++++- # keep_prob=1.0 - self.attention_dropout, +-++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++- # input_layout="BNSD", +-++++- # sparse_mode=0, +-++++- # inner_precise=1 # 仍然保持内部高精度计算 +-++++- # ) +-+++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++ +-++++- # attn_output = attn_output.to(input_dtype) +-++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++- # attn_output = self.o_proj(attn_output) +-+++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++ +-+++++# # 找到所有分配给该专家的 token +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++ +-+++++# # 使用 mask 选取对应的 token 和权重 +-+++++# current_token_indices = token_indices[mask] +-+++++# current_routing_weights = flat_routing_weights[mask] +-+++++# current_hidden_states = hidden_states[current_token_indices] +-+++++ +-+++++# # 对这些 token 进行批处理 +-+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++++ +-+++++# # 使用 index_add 将结果精确地加回到对应位置 +-+++++# moe_output = moe_output.index_add( +-+++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++# ) +-+++++# return moe_output +-+++++ +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# """ +-+++++# 顶层 forward 方法,作为智能分发器。 +-+++++# """ +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ +-+++++# # 1. 门控计算 +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++ +-+++++# # 2. 调用统一的 MoE 计算内核 +-+++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-++++ +-++++- # attn_weights = None +-++++- # if output_attentions: +-++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++++# # 3. 统一处理共享专家 +-+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++ +-+++++# # 4. 合并输出 +-+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++ +-+++++# # 5. 恢复原始形状并返回 +-+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# """ +-+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++# 【最终高性能与高精度版】: +-+++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+++++# 3. 这样实现了速度和准确性的两全其美。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_decode( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# """ +-+++++# 【解码路径】极致优化版:bmm + 高精度累加。 +-+++++# """ +-+++++# original_dtype = hidden_states.dtype +-+++++# batch_size, _ = hidden_states.shape +-+++++ +-+++++# expert_outputs_list = [ +-+++++# ops.cat([ +-+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++# ], dim=0) +-+++++# for i in range(batch_size) +-+++++# ] +-+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++ +-+++++# # 在 float32 下执行 bmm,得到高精度结果 +-+++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++ +-+++++# # 将高精度结果转换回原始数据类型 +-+++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+++++ +-+++++# return moe_output +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_prefill( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# selected_experts: mindspore.Tensor, +-+++++# routing_weights: mindspore.Tensor +-+++++# ) -> mindspore.Tensor: +-+++++# """ +-+++++# 【预填充路径】与原始实现一致,结果精确。 +-+++++# """ +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens, _ = hidden_states.shape +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++ +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# selected_token_indices = token_indices[mask] +-+++++# selected_routing_weights = routing_weights.flatten()[mask] +-+++++# current_states = hidden_states[selected_token_indices] +-+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++# moe_output = moe_output.index_add( +-+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++# ) +-+++++# return moe_output +-+++++ +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++ +-++++- # return attn_output, attn_weights, past_key_value +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+++++# # 如果模型主体是 float16,后续再转换 +-+++++ +-+++++# moe_output = None +-+++++# if not self.training: +-+++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+++++# # _moe_infer_decode 内部会处理好类型转换 +-+++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+++++# if sequence_length == 1: +-+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++++# else: +-+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++++# else: +-+++++# raise NotImplementedError("Training path is not implemented.") +-+++++ +-+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++ +-+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-++++ +-++++-QWEN2MOE_ATTENTION_CLASSES = { +-++++- "eager": Qwen2MoeAttention, +-++++- "flash-attention": Qwen2MoeFlashAttention, +-++++-} +-+++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++# """ +-+++++# 【融合版】一个混合专家模块,内置两种推理策略, +-+++++# 由外部全局变量 `Long_Prompt` 控制: +-+++++ +-+++++# - if Long_Prompt is True: 【精度优先模式】 +-+++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+++++# 适用于处理长序列,避免误差累积。 +-+++++ +-+++++# - if Long_Prompt is False: 【速度优先模式】 +-+++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+++++# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+++++# """ +-+++++# def __init__(self, config: Qwen2MoeConfig): +-+++++# super().__init__() +-+++++# self.num_experts = config.num_experts +-+++++# self.top_k = config.num_experts_per_tok +-+++++# self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++# self.experts = nn.ModuleList( +-+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++# ) +-+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++# # --- 速度优先模式的辅助函数 --- +-+++++# @no_grad() +-+++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++# original_dtype = hidden_states.dtype +-+++++# batch_size, _ = hidden_states.shape +-+++++# expert_outputs_list = [ +-+++++# ops.cat([ +-+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++# ], dim=0) +-+++++# for i in range(batch_size) +-+++++# ] +-+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++# weights_fp32 = routing_weights.to(mindspore.float32) +-+++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++++# return moe_output_fp32.squeeze(1).to(original_dtype) +-+++++ +-+++++# @no_grad() +-+++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens, _ = hidden_states.shape +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# selected_token_indices = token_indices[mask] +-+++++# selected_routing_weights = routing_weights.flatten()[mask] +-+++++# current_states = hidden_states[selected_token_indices] +-+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++# return moe_output +-+++++ +-+++++# # --- 精度优先模式的辅助函数 --- +-+++++# @no_grad() +-+++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++# moe_output = ops.zeros_like(hidden_states) +-+++++# num_tokens, _ = hidden_states.shape +-+++++# flat_selected_experts = selected_experts.flatten() +-+++++# flat_routing_weights = routing_weights.flatten() +-+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++# active_experts = ops.unique(flat_selected_experts) +-+++++# for expert_idx_tensor in active_experts: +-+++++# expert_idx = expert_idx_tensor.item() +-+++++# expert_layer = self.experts[expert_idx] +-+++++# mask = (flat_selected_experts == expert_idx_tensor) +-+++++# current_token_indices = token_indices[mask] +-+++++# current_routing_weights = flat_routing_weights[mask] +-+++++# current_hidden_states = hidden_states[current_token_indices] +-+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++# return moe_output +-+++++ +-+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++# # 声明我们将要使用一个在模块外部定义的全局变量 +-+++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+++++# global Long_Prompt +-+++++ +-+++++# # 1. 门控计算 (所有模式通用) +-+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++# router_logits = self.gate(hidden_states_reshaped) +-+++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++++# if self.norm_topk_prob: +-+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++# moe_output = None +-+++++# if not self.training: +-+++++# # 根据 Long_Prompt 标志选择模式 +-+++++# if Long_Prompt: +-+++++# # --- 精度优先模式 --- +-+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++# else: +-+++++# # --- 速度优先模式 --- +-+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++# if sequence_length == 1: +-+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++# else: +-+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++# else: +-+++++# raise NotImplementedError("Training path is not implemented.") +-+++++ +-+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++ +-+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++# return final_hidden_states, router_logits +-+++++ +-+++++class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ """ +-+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+++++ 控制的顶级推理策略: +-++++ +-+++++ - if Long_Prompt is True: 【精度优先模式】 +-+++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-+++++ 适用于需要严格可复现性的长序列任务。 +-++++ +-++++-class Qwen2MoeSparseMoeBlock(nn.Module): +-++++- def __init__(self, config): +-+++++ - if Long_Prompt is False: 【速度优先模式】 +-+++++ 采用业界最强的性能组合: +-+++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-+++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-+++++ """ +-+++++ def __init__(self, config: Qwen2MoeConfig): +-++++ super().__init__() +-++++ self.num_experts = config.num_experts +-++++ self.top_k = config.num_experts_per_tok +-++++ self.norm_topk_prob = config.norm_topk_prob +-++++ +-++++- # gating +-++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++ self.experts = nn.ModuleList( +-++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++ ) +-++++- +-++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++ +-++++- #@dwj +-++++- # 只遍历激活的专家,而非全部专家 +-++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++- num_tokens = hidden_states_reshaped.shape[0] +-++++- +-++++- router_logits = self.gate(hidden_states_reshaped) +-++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- +-++++- if self.norm_topk_prob: +-++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++- flat_selected_experts = selected_experts.flatten() +-++++- +-++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++- token_indices = broadcasted_token_indices.flatten() +-++++- +-++++- active_experts = ops.unique(flat_selected_experts) +-++++- +-++++- for expert_idx_tensor in active_experts: +-++++- expert_idx = expert_idx_tensor.item() +-++++- expert_layer = self.experts[expert_idx] +-++++- +-++++- mask = (flat_selected_experts == expert_idx_tensor) +-++++- selected_token_indices = token_indices[mask] +-++++- selected_routing_weights = routing_weights.flatten()[mask] +-++++- +-++++- current_states = hidden_states_reshaped[selected_token_indices] +-++++- +-++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++- +-++++- final_hidden_states = final_hidden_states.index_add( +-++++- dim=0, +-++++- index=selected_token_indices, +-++++- source=expert_output.to(hidden_states.dtype) +-++++- ) +-++++- +-++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-+++++ @no_grad() +-+++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++ original_dtype = hidden_states.dtype +-+++++ batch_size, _ = hidden_states.shape +-+++++ expert_outputs_list = [ +-+++++ ops.cat([ +-+++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++ ], dim=0) +-+++++ for i in range(batch_size) +-+++++ ] +-+++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++ weights_fp32 = routing_weights.to(mindspore.float32) +-+++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++++ return moe_output_fp32.squeeze(1).to(original_dtype) +-+++++ +-+++++ @no_grad() +-+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++ num_tokens, _ = hidden_states.shape +-+++++ flat_selected_experts = selected_experts.flatten() +-+++++ sorted_expert_indices = flat_selected_experts.argsort() +-+++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++++ original_token_indices = sorted_expert_indices // self.top_k +-+++++ moe_output = ops.zeros_like(hidden_states) +-+++++ current_token_offset = 0 +-+++++ for i in range(self.num_experts): +-+++++ expert_token_count = tokens_per_expert[i] - current_token_offset +-+++++ if expert_token_count == 0: +-+++++ continue +-+++++ end_offset = current_token_offset + expert_token_count +-+++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++++ expert_hidden_states = hidden_states[expert_original_token_indices] +-+++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++ current_token_offset += expert_token_count +-+++++ return moe_output +-+++++ +-+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+++++ @no_grad() +-+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++ moe_output = ops.zeros_like(hidden_states) +-+++++ num_tokens, _ = hidden_states.shape +-+++++ flat_selected_experts = selected_experts.flatten() +-+++++ flat_routing_weights = routing_weights.flatten() +-+++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++ active_experts = ops.unique(flat_selected_experts) +-+++++ for expert_idx_tensor in active_experts: +-+++++ expert_idx = expert_idx_tensor.item() +-+++++ expert_layer = self.experts[expert_idx] +-+++++ mask = (flat_selected_experts == expert_idx_tensor) +-+++++ current_token_indices = token_indices[mask] +-+++++ current_routing_weights = flat_routing_weights[mask] +-+++++ current_hidden_states = hidden_states[current_token_indices] +-+++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++ return moe_output +-++++ +-++++- final_hidden_states = final_hidden_states + shared_expert_output +-++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++- +-++++- return final_hidden_states, router_logits +-+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++ global Long_Prompt +-+++++ +-+++++ # 1. 门控计算 (所有模式通用) +-+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++ router_logits = self.gate(hidden_states_reshaped) +-+++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++++ if self.norm_topk_prob: +-+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++ moe_output = None +-+++++ if Long_Prompt: +-+++++ # --- 精度优先模式 (ACCURACY MODE) --- +-+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ else: +-+++++ # --- 速度优先模式 (SPEED MODE) --- +-+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++ if sequence_length == 1: +-+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ else: +-+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ +-++++ +-+++++ # 3. 共享专家计算与合并 (所有模式通用) +-+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++ +-+++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++ +-+++++ return final_hidden_states, router_logits +-++++ +-++++ class Qwen2MoeDecoderLayer(nn.Module): +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-++++ super().__init__() +-++++ self.hidden_size = config.hidden_size +-+++++ +-+++++ # if Long_Prompt: +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ # else: +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++ +-++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ +-++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++- +-++++ if (layer_idx not in config.mlp_only_layers) and ( +-++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++++ ): +-++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ self._warmed_up = True +-++++ self.warmup_moe_model() +-++++ +-+++++ +-+++++ +-++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++++ output_router_logits = ( +-++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +-++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ router_logits=outputs.router_logits, +-++++ ) +-++++ +-+++++ def generate(self, *args, **kwargs): +-+++++ """ +-+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+++++ """ +-+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+++++ +-+++++ input_ids = kwargs.get("input_ids") +-+++++ if input_ids is None and args: +-+++++ input_ids = args[0] +-+++++ +-+++++ if input_ids is not None: +-+++++ prompt_length = input_ids.shape[1] +-+++++ +-+++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+++++ Long_Prompt = True +-+++++ else: +-+++++ Long_Prompt = False +-+++++ +-+++++ return super().generate(*args, **kwargs) +-+++++ +-++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +-++++ def prepare_inputs_for_generation( +-++++ self, +-++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +-++++ # Exception 1: when passing input_embeds, input_ids may be missing entries +-++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-+++++ +-++++ if past_key_values is not None: +-++++ if inputs_embeds is not None: # Exception 1 +-++++ if 0 not in input_ids.shape: +-++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ } +-++++ ) +-++++ return model_inputs +-+++++ +-++++ # @lwx +-++++ # def _decode_one_tokens_logits( +-++++ # self, +-++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +-++++ attentions=outputs.attentions, +-++++ ) +-++++ +-+++++ +-++++ __all__ = [ +-++++ "Qwen2MoeForCausalLM", +-++++ "Qwen2MoeModel", +-++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++++new file mode 100644 +-++++index 00000000..6dfb5b93 +-++++--- /dev/null +-+++++++ b/patches/0001-20251104commit.patch +-++++@@ -0,0 +1,1272 @@ +-+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++++From: Pinoeer-kingxi <13022943007@163.com> +-+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++++Subject: [PATCH] 20251104commit +-+++++ +-+++++--- +-+++++ mindnlp/transformers/cache_utils.py | 28 +- +-+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++++ 3 files changed, 976 insertions(+), 87 deletions(-) +-+++++ +-+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++++index cadd2e04..02f8d4be 100644 +-+++++--- a/mindnlp/transformers/cache_utils.py +-++++++++ b/mindnlp/transformers/cache_utils.py +-+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++++ # k_out[:, :, cache_position] = key_states +-+++++ # v_out[:, :, cache_position] = value_states +-+++++- if ON_ORANGE_PI: +-+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++- else: +-+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++- +-++++++ # if ON_ORANGE_PI: +-++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++++ # else: +-++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++++++ if cache_position.ndim > 1: +-++++++ cache_position = cache_position.flatten() +-++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++++++ cache_position = cache_position.int() +-++++++ +-++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++++++ k_out[:, :, cache_position] = key_states +-++++++ v_out[:, :, cache_position] = value_states +-++++++ +-+++++ return k_out, v_out +-+++++ +-+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++index c695b944..d8303e45 100644 +-+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++++ def rotate_half(x): +-+++++ """Rotates half the hidden dims of the input.""" +-+++++- x1 = x[..., : x.shape[-1] // 2] +-+++++- x2 = x[..., x.shape[-1] // 2 :] +-++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++++ # x1 = x[..., : x.shape[-1] // 2] +-++++++ # x2 = x[..., x.shape[-1] // 2 :] +-++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++ return ops.cat((-x2, x1), dim=-1) +-+++++ +-+++++ +-+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++++ if self.training: +-+++++ raise NotImplementedError("Training is not supported yet.") +-+++++ else: +-+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++- if self.config.n_shared_experts is not None: +-+++++- y = y + self.shared_experts(identity) +-+++++- return y +-++++++ # @lwx +-++++++ if orig_shape[1] == 1: +-++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++++++ y=y.view(*orig_shape) +-++++++ if self.config.n_shared_experts is not None: +-++++++ y = y + self.shared_experts(identity) +-++++++ return y +-++++++ else: +-++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++++++ if self.config.n_shared_experts is not None: +-++++++ y = y + self.shared_experts(identity) +-++++++ return y +-++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++++ # if self.config.n_shared_experts is not None: +-++++++ # y = y + self.shared_experts(identity) +-++++++ # return y +-++++++ +-++++++ @no_grad() +-++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++++ +-++++++ expert_cache = ops.zeros_like(x) +-++++++ for i in range(self.num_experts_per_tok): +-++++++ expert_id = flat_expert_indices[i].item() +-++++++ weight = flat_expert_weights[i].item() +-++++++ expert = self.experts[expert_id] +-++++++ expert_out = expert(x) +-++++++ expert_cache += expert_out * weight +-++++++ return expert_cache +-+++++ +-+++++ @no_grad() +-+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++- # expert_cache = torch.zeros_like(x) +-+++++- # idxs = flat_expert_indices.argsort() +-+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++- # token_idxs = idxs // self.num_experts_per_tok +-+++++- # for i, end_idx in enumerate(tokens_per_expert): +-+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++- # if start_idx == end_idx: +-+++++- # continue +-+++++- # expert = self.experts[i] +-+++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++- # expert_tokens = x[exp_token_idx] +-+++++- # expert_out = expert(expert_tokens) +-+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++- # return expert_cache +-++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++ expert_cache = ops.zeros_like(x) +-+++++ idxs = flat_expert_indices.argsort() +-+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ token_idxs = idxs // self.num_experts_per_tok +-++++++ +-+++++ for i, end_idx in enumerate(tokens_per_expert): +-+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ if start_idx == end_idx: +-+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++++ expert_out = expert(expert_tokens) +-+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++++ +-+++++ return expert_cache +-++++++ +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # # expert_cache = torch.zeros_like(x) +-++++++ # # idxs = flat_expert_indices.argsort() +-++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++++ # # if start_idx == end_idx: +-++++++ # # continue +-++++++ # # expert = self.experts[i] +-++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # # expert_tokens = x[exp_token_idx] +-++++++ # # expert_out = expert(expert_tokens) +-++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++++ # # return expert_cache +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # if start_idx == end_idx: +-++++++ # continue +-++++++ # expert = self.experts[i] +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = expert(expert_tokens) +-++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++++ +-++++++ # return expert_cache +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ +-++++++ # # 排序保证顺序一致 +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # # 找出有 token 的专家 +-++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++++ +-++++++ # for i in active_experts.tolist(): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # end_idx = tokens_per_expert[i] +-++++++ # if start_idx == end_idx: # 没有 token +-++++++ # continue +-++++++ +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = self.experts[i](expert_tokens) +-++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++++ +-++++++ # expert_cache = mindspore.mint.scatter_add( +-++++++ # expert_cache, +-++++++ # 0, +-++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++++ # expert_out +-++++++ # ) +-++++++ +-++++++ # return expert_cache +-++++++ +-++++++ +-+++++ +-+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++++ # """ +-+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++ +-+++++ # Initialize weights and apply final processing +-+++++ self.post_init() +-++++++ self.warm_up = False +-++++++ +-++++++ def warmup_moe_model_deep(self): +-++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++++ test_texts = [ +-++++++ "warmup short", +-++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++++++ ] +-++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++++ if tokenizer is None: +-++++++ from mindnlp.transformers import AutoTokenizer +-++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++++ self._warmup_tokenizer = tokenizer +-++++++ +-++++++ for text in test_texts: +-++++++ inputs = tokenizer(text, return_tensors="ms") +-++++++ with mindspore._no_grad(): +-++++++ _ = self(**inputs, use_cache=False) +-++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++++ +-+++++ def get_input_embeddings(self): +-+++++ return self.model.embed_tokens +-+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++ ```""" +-++++++ if not self.warm_up: +-++++++ self.warm_up = True +-++++++ self.warmup_moe_model_deep() +-++++++ +-+++++ output_attentions = ( +-+++++ output_attentions +-+++++ if output_attentions is not None +-+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++index 3cbf820e..d4c6b651 100644 +-+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++@@ -18,7 +18,6 @@ +-+++++ # See the License for the specific language governing permissions and +-+++++ # limitations under the License. +-+++++ """MindSpore Qwen2MoE model.""" +-+++++- +-+++++ import math +-+++++ from typing import List, Optional, Tuple, Union +-+++++ +-+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++++ TokenClassifierOutput, +-+++++ ) +-+++++ from ...modeling_utils import PreTrainedModel +-++++++from ...generation import GenerationMixin +-+++++ from ....utils import logging +-+++++ from .configuration_qwen2_moe import Qwen2MoeConfig +-+++++ +-+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++++ self.variance_epsilon = eps +-+++++ +-+++++ def forward(self, hidden_states): +-++++++ # @dwj +-++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++++ # @lwx +-++++++ # if not self.training : +-++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++ input_dtype = hidden_states.dtype +-+++++ hidden_states = hidden_states.to(mindspore.float32) +-+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++++@@ -234,6 +239,8 @@ def rotate_half(x): +-+++++ """Rotates half the hidden dims of the input.""" +-+++++ x1 = x[..., : x.shape[-1] // 2] +-+++++ x2 = x[..., x.shape[-1] // 2 :] +-++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++ return ops.cat((-x2, x1), dim=-1) +-+++++ +-+++++ +-+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++++ self.config = config +-+++++ self.hidden_size = config.hidden_size +-+++++ self.intermediate_size = intermediate_size +-++++++ +-+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++++ self.act_fn = ACT2FN[config.hidden_act] +-+++++ +-+++++ def forward(self, x): +-+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++- +-+++++ +-++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++++ # @lwx +-++++++ # gate_up_output = self.gate_up_proj(x) +-++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++++++ # return self.down_proj(swiglu_output) +-++++++ +-++++++ # def forward(self, x): +-++++++ # gate_proj_out = self.gate_proj(x) +-++++++ # up_proj_out = self.up_proj(x) +-++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++++++ # return self.down_proj(swiglu_out) +-++++++ +-+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++++ """ +-+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++++ use_cache: bool = False, +-+++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ +-++++++ +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ query_states = self.q_proj(hidden_states) +-+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ "with a layer index." +-+++++ ) +-+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ if isinstance(past_key_value, StaticCache): +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ else: +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++ if isinstance(past_key_value, StaticCache): +-++++++ kv_seq_len = key_states.shape[-2] +-+++++ +-+++++ # repeat k/v heads if n_kv_heads < n_heads +-+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++- +-++++++ +-+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++ +-+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++++- raise ValueError( +-+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++++- f" {attn_weights.shape}" +-+++++- ) +-+++++- +-+++++- if attention_mask is not None: # no matter the length, we just slice it +-+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++++++ if attention_mask is not None: +-++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++ attn_weights = attn_weights + causal_mask +-+++++ +-+++++ # upcast attention to fp32 +-+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++ +-+++++ attn_output = self.o_proj(attn_output) +-+++++- +-++++++ # @lwx +-++++++ +-++++++ # max_seq_len = self.max_position_embeddings # 2048 +-++++++ +-++++++ # if attention_mask is not None: +-++++++ # # attention_mask: [B, 1, Sq, Sk] +-++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++++ +-++++++ # # pad 到 [max_seq_len, max_seq_len] +-++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++++ # global_attention_mask = padded_mask +-++++++ # else: +-++++++ # global_attention_mask = None +-++++++ +-++++++ +-++++++ # sparse_mode=3 +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # real_shift=None, +-++++++ # padding_mask=None, +-++++++ +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=global_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ # input_layout="BNSD", +-++++++ # pre_tokens=2147483647, +-++++++ # next_tokens=2147483647, +-++++++ # inner_precise=0, +-++++++ # drop_mask=None, +-++++++ # prefix=None, +-++++++ # actual_seq_qlen=None, +-++++++ # actual_seq_kvlen=None, +-++++++ # sparse_mode=sparse_mode, +-++++++ # ) +-+++++ if not output_attentions: +-+++++ attn_weights = None +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ +-++++++class Qwen2MoeFlashAttention(nn.Module): +-++++++ """ +-++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++++ +-++++++ 关键改动: +-++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++++ 直接传入原始的 key 和 value 张量效率更高。 +-++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++++ """ +-++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++++ super().__init__() +-++++++ self.config = config +-++++++ self.layer_idx = layer_idx +-++++++ self.hidden_size = config.hidden_size +-++++++ self.num_heads = config.num_attention_heads +-++++++ self.head_dim = self.hidden_size // self.num_heads +-++++++ self.num_key_value_heads = config.num_key_value_heads +-++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++ self.max_position_embeddings = config.max_position_embeddings +-++++++ self.rope_theta = config.rope_theta +-++++++ self.attention_dropout = config.attention_dropout +-++++++ +-++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++ raise ValueError( +-++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++++ ) +-++++++ +-++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++++ +-++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++++ self.head_dim, +-++++++ max_position_embeddings=self.max_position_embeddings, +-++++++ base=self.rope_theta, +-++++++ ) +-++++++ +-++++++ def forward( +-++++++ self, +-++++++ hidden_states: mindspore.Tensor, +-++++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++++ position_ids: Optional[mindspore.Tensor] = None, +-++++++ past_key_value: Optional[Cache] = None, +-++++++ output_attentions: bool = False, +-++++++ use_cache: bool = False, +-++++++ cache_position: Optional[mindspore.Tensor] = None, +-++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # 1. 线性投射 Q, K, V +-++++++ query_states = self.q_proj(hidden_states) +-++++++ key_states = self.k_proj(hidden_states) +-++++++ value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++ # query: [B, S, H*D] -> [B, N1, S, D] +-++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # 3. RoPE 旋转位置编码 +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ if past_key_value is not None: +-++++++ if self.layer_idx is None: +-++++++ raise ValueError( +-++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++ "with a layer index." +-++++++ ) +-++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++++ if cache_position.shape[0] == 1: +-++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++++ kv_seq_len = past_seen_tokens + 1 +-++++++ else: +-++++++ # prefill 阶段:cache_position 是范围,使用其长度 +-++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++++ else: +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # 4. KV 缓存更新 +-++++++ if past_key_value is not None: +-++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ key_states, value_states = past_key_value.update( +-++++++ key_states, value_states, self.layer_idx, cache_kwargs +-++++++ ) +-++++++ +-++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++ if cache_position.shape[0] == 1: +-++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ +-++++++ # 5. [重要] 准备 Attention Mask +-++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++++ fa_attention_mask = None +-++++++ if attention_mask is not None: +-++++++ # 截取与当前key长度匹配的部分 +-++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++++ fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++++ input_dtype = query_states.dtype +-++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++++ query_states = query_states.to(mindspore.float16) +-++++++ key_states = key_states.to(mindspore.float16) +-++++++ value_states = value_states.to(mindspore.float16) +-++++++ +-++++++ # 6. [核心] 调用 flash_attention_score 算子 +-++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++++ attn_output = mindspore.ops.flash_attention_score( +-++++++ query=query_states, +-++++++ key=key_states, +-++++++ value=value_states, +-++++++ head_num=self.num_heads, # 传入Q的头数(N1) +-++++++ attn_mask=fa_attention_mask, +-++++++ keep_prob=1.0 - self.attention_dropout, +-++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ input_layout="BNSD", +-++++++ sparse_mode=0 # 使用 defaultMask 模式 +-++++++ ) +-++++++ +-++++++ # 恢复原始数据类型 +-++++++ attn_output = attn_output.to(input_dtype) +-++++++ +-++++++ # 7. 调整输出形状 +-++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++++++ attn_weights = None +-++++++ if output_attentions: +-++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++ return attn_output, attn_weights, past_key_value +-++++++ +-++++++ # def forward( +-++++++ # self, +-++++++ # hidden_states: mindspore.Tensor, +-++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++++ # past_key_value: Optional[Cache] = None, +-++++++ # output_attentions: bool = False, +-++++++ # use_cache: bool = False, +-++++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ # bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # # 1. 线性投射 Q, K, V +-++++++ # query_states = self.q_proj(hidden_states) +-++++++ # key_states = self.k_proj(hidden_states) +-++++++ # value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # # 3. RoPE 旋转位置编码 +-++++++ # kv_seq_len = key_states.shape[-2] +-++++++ # if past_key_value is not None: +-++++++ # if self.layer_idx is None: +-++++++ # raise ValueError( +-++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++ # "with a layer index." +-++++++ # ) +-++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # # 4. KV 缓存更新 +-++++++ # if past_key_value is not None: +-++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ # key_states, value_states = past_key_value.update( +-++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++++ # ) +-++++++ +-++++++ # # 5. 准备 Attention Mask +-++++++ # fa_attention_mask = None +-++++++ # if attention_mask is not None: +-++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++++ # input_dtype = query_states.dtype +-++++++ +-++++++ # # 6. [核心] 调用 flash_attention_score 算子 +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=fa_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ # input_layout="BNSD", +-++++++ # sparse_mode=0, +-++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++++ # inner_precise=1 +-++++++ # ) +-++++++ +-++++++ # # 恢复原始数据类型 +-++++++ # attn_output = attn_output.to(input_dtype) +-++++++ +-++++++ # # 7. 调整输出形状 +-++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ # attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # attn_weights = None +-++++++ # if output_attentions: +-++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++ # return attn_output, attn_weights, past_key_value +-++++++ +-++++++ # def forward( +-++++++ # self, +-++++++ # hidden_states: mindspore.Tensor, +-++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++++ # past_key_value: Optional[Cache] = None, +-++++++ # output_attentions: bool = False, +-++++++ # use_cache: bool = False, +-++++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ # bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # query_states = self.q_proj(hidden_states) +-++++++ # key_states = self.k_proj(hidden_states) +-++++++ # value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # kv_seq_len = key_states.shape[-2] +-++++++ # if past_key_value is not None: +-++++++ # if self.layer_idx is None: +-++++++ # raise ValueError("`layer_idx` must be specified for caching") +-++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # if past_key_value is not None: +-++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ # key_states, value_states = past_key_value.update( +-++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++++ # ) +-++++++ +-++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++ +-++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++++ # query_states = query_states / math.sqrt(self.head_dim) +-++++++ # # <--- 修改结束 --- +-++++++ +-++++++ # fa_attention_mask = None +-++++++ # if attention_mask is not None: +-++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # input_dtype = query_states.dtype +-++++++ +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, # 传入已经预先缩放过的 query +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=fa_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++++ # input_layout="BNSD", +-++++++ # sparse_mode=0, +-++++++ # inner_precise=1 # 仍然保持内部高精度计算 +-++++++ # ) +-++++++ +-++++++ # attn_output = attn_output.to(input_dtype) +-++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ # attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # attn_weights = None +-++++++ # if output_attentions: +-++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++++ +-++++++ # return attn_output, attn_weights, past_key_value +-++++++ +-+++++ QWEN2MOE_ATTENTION_CLASSES = { +-+++++ "eager": Qwen2MoeAttention, +-++++++ "flash-attention": Qwen2MoeFlashAttention, +-+++++ } +-+++++ +-+++++ +-+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-++++++ #@dwj +-++++++ # 只遍历激活的专家,而非全部专家 +-+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- hidden_states = hidden_states.view(-1, hidden_dim) +-+++++- # router_logits: (batch * sequence_length, n_experts) +-+++++- router_logits = self.gate(hidden_states) +-+++++- +-+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- if self.norm_topk_prob: +-+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- # we cast back to the input dtype +-+++++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++- final_hidden_states = ops.zeros( +-+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++++- ) +-+++++- +-+++++- # One hot encode the selected experts to create an expert mask +-+++++- # this will be used to easily index which expert is going to be sollicitated +-+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++++- +-+++++- # Loop over all available experts in the model and perform the computation on each expert +-+++++- for expert_idx in range(self.num_experts): +-+++++- expert_layer = self.experts[expert_idx] +-+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++++- +-+++++- # Index the correct hidden states and compute the expert hidden state for +-+++++- # the current expert. We need to make sure to multiply the output hidden +-+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++++- if 0 not in idx.shape: +-+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++++- +-+++++- # However `index_add_` only support torch tensors for indexing so we'll use +-+++++- # the `top_x` tensor here. +-+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++++- +-+++++- shared_expert_output = self.shared_expert(hidden_states) +-+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++++- +-+++++- final_hidden_states = final_hidden_states + shared_expert_output +-++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++ num_tokens = hidden_states_reshaped.shape[0] +-++++++ +-++++++ router_logits = self.gate(hidden_states_reshaped) +-++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++ +-++++++ if self.norm_topk_prob: +-++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++++ flat_selected_experts = selected_experts.flatten() +-++++++ +-++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++++ token_indices = broadcasted_token_indices.flatten() +-++++++ +-++++++ active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++ for expert_idx_tensor in active_experts: +-++++++ expert_idx = expert_idx_tensor.item() +-++++++ expert_layer = self.experts[expert_idx] +-++++++ +-++++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++++ selected_token_indices = token_indices[mask] +-++++++ selected_routing_weights = routing_weights.flatten()[mask] +-++++++ +-++++++ current_states = hidden_states_reshaped[selected_token_indices] +-++++++ +-++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++ +-++++++ final_hidden_states = final_hidden_states.index_add( +-++++++ dim=0, +-++++++ index=selected_token_indices, +-++++++ source=expert_output.to(hidden_states.dtype) +-++++++ ) +-++++++ +-++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++++ +-+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++- return final_hidden_states, router_logits +-++++++ final_hidden_states = final_hidden_states + shared_expert_output +-++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++ return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++ class Qwen2MoeDecoderLayer(nn.Module): +-+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++++ +-+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ +-++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++++ +-+++++ if (layer_idx not in config.mlp_only_layers) and ( +-+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++++ ): +-+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++++ _skip_keys_device_placement = "past_key_values" +-+++++ _supports_cache_class = True +-++++++#lwx +-++++++ # _supports_static_cache = True +-+++++ +-+++++ def _init_weights(self, module): +-+++++ std = self.config.initializer_range +-+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++++ return causal_mask +-+++++ +-+++++ +-+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ _tied_weights_keys = ["lm_head.weight"] +-+++++ +-+++++ def __init__(self, config): +-+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ self.num_experts_per_tok = config.num_experts_per_tok +-+++++ # Initialize weights and apply final processing +-+++++ self.post_init() +-++++++ # @lwx +-++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++++++ # self.generation_config.cache_implementation = "static" +-++++++ self._warmed_up = False +-++++++ +-++++++ def warmup_moe_model(self): +-++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++++++ test_texts = [ +-++++++ "warmup short", +-++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++++++ ] +-++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++++ if tokenizer is None: +-++++++ from mindnlp.transformers import AutoTokenizer +-++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++++ self._warmup_tokenizer = tokenizer +-++++++ +-++++++ for text in test_texts: +-++++++ inputs = tokenizer(text, return_tensors="ms") +-++++++ with mindspore._no_grad(): +-++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++++ +-+++++ def get_input_embeddings(self): +-+++++ return self.model.embed_tokens +-+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++ ```""" +-++++++ if not self._warmed_up: +-++++++ self._warmed_up = True +-++++++ self.warmup_moe_model() +-+++++ +-+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++++ output_router_logits = ( +-+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ } +-+++++ ) +-+++++ return model_inputs +-++++++# @lwx +-++++++ # def _decode_one_tokens_logits( +-++++++ # self, +-++++++ # cur_token: mindspore.Tensor, +-++++++ # input_pos: Optional[mindspore.Tensor], +-++++++ # cache_position: mindspore.Tensor, +-++++++ # past_key_values: StaticCache, +-++++++ # ) -> mindspore.Tensor: +-++++++ # """ +-++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++++++ +-++++++ # Args: +-++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++++++ # input_pos: 输入位置信息,可选 +-++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++++++ +-++++++ # Returns: +-++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++++++ # """ +-++++++ # # 调用JIT编译的版本 +-++++++ # return self.get_decode_one_tokens_logits( +-++++++ # cur_token=cur_token, +-++++++ # input_pos=input_pos, +-++++++ # cache_position=cache_position, +-++++++ # past_key_values=past_key_values, +-++++++ # ) +-++++++ +-++++++ # @mindspore.jit(jit_level='O1') +-++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++++++ # """ +-++++++ # JIT编译的函数,用于高效的单token解码 +-++++++ # 使用JIT编译优化以支持静态shape和高效执行 +-++++++ +-++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++++++ # """ +-++++++ # outputs = self.model.forward( +-++++++ # input_ids=cur_token, +-++++++ # position_ids=input_pos, +-++++++ # cache_position=cache_position, +-++++++ # past_key_values=past_key_values, +-++++++ # use_cache=True, +-++++++ # return_dict=False, +-++++++ # ) +-++++++ +-++++++ # hidden_states = outputs[0] +-++++++ # logits = self.lm_head.forward(hidden_states) +-++++++ # logits = logits.float() +-++++++ +-++++++ # return logits[:, -1, :] +-++++++ +-++++++ # def _sample( +-++++++ # self, +-++++++ # input_ids: mindspore.Tensor, +-++++++ # logits_processor, +-++++++ # stopping_criteria, +-++++++ # generation_config, +-++++++ # synced_devices: bool, +-++++++ # streamer=None, +-++++++ # logits_warper=None, +-++++++ # **model_kwargs, +-++++++ # ): +-++++++ # """ +-++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++++++ # """ +-++++++ # from ...generation.logits_process import LogitsProcessorList +-++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++++++ # from mindnlp.core import nn, ops, no_grad +-++++++ # import numpy as np +-++++++ +-++++++ # # 检查是否使用 StaticCache +-++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++++++ # # 否则,直接调用父类方法 +-++++++ # past_key_values = model_kwargs.get("past_key_values") +-++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++++++ +-++++++ # if not isinstance(past_key_values, StaticCache): +-++++++ # # 不使用 StaticCache,直接调用父类方法 +-++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++++++ # return super()._sample( +-++++++ # input_ids=input_ids, +-++++++ # logits_processor=logits_processor, +-++++++ # stopping_criteria=stopping_criteria, +-++++++ # generation_config=generation_config, +-++++++ # synced_devices=synced_devices, +-++++++ # streamer=streamer, +-++++++ # logits_warper=logits_warper, +-++++++ # **model_kwargs, +-++++++ # ) +-++++++ +-++++++ # # 使用 StaticCache,进入自定义循环 +-++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++++++ # pad_token_id = generation_config._pad_token_tensor +-++++++ # output_attentions = generation_config.output_attentions +-++++++ # output_hidden_states = generation_config.output_hidden_states +-++++++ # output_scores = generation_config.output_scores +-++++++ # output_logits = generation_config.output_logits +-++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++++++ # max_length = generation_config.max_length +-++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++++++ # do_sample = generation_config.do_sample +-++++++ +-++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++++++ # raise ValueError( +-++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++++++ # f"{logits_warper})." +-++++++ # ) +-++++++ +-++++++ # # init attention / hidden states / scores tuples +-++++++ # scores = () if (return_dict_in_generate and output_scores) else None +-++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++++++ +-++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++++++ # encoder_hidden_states = ( +-++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++++++ # ) +-++++++ +-++++++ # # keep track of which sequences are already finished +-++++++ # batch_size, cur_len = input_ids.shape +-++++++ # this_peer_finished = False +-++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++++++ +-++++++ # time_record = [] +-++++++ # from ....utils.testing_utils import parse_flag_from_env +-++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++++++ +-++++++ # while self._has_unfinished_sequences( +-++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++++++ # ): +-++++++ # if _record_time: +-++++++ # import time as time_module +-++++++ # infer_start = time_module.time() +-++++++ +-++++++ # # prepare model inputs +-++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++++++ +-++++++ # # prepare variable output controls +-++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++++++ +-++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++++++ # cur_cache_position = model_inputs.get("cache_position") +-++++++ # cur_past_key_values = model_inputs.get("past_key_values") +-++++++ # cur_input_ids = model_inputs.get("input_ids") +-++++++ +-++++++ # if (isinstance(cur_past_key_values, StaticCache) and +-++++++ # cur_cache_position is not None and +-++++++ # len(cur_cache_position.shape) > 0 and +-++++++ # cur_cache_position.shape[0] == 1 and +-++++++ # cur_input_ids is not None and +-++++++ # cur_input_ids.shape[1] == 1): +-++++++ # # 使用 JIT 优化的单 token 解码 +-++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++++++ # if not hasattr(self, '_jit_used'): +-++++++ # self._jit_used = False +-++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++++++ +-++++++ # next_token_logits = self.get_decode_one_tokens_logits( +-++++++ # cur_token=cur_input_ids, +-++++++ # input_pos=model_inputs.get("position_ids"), +-++++++ # cache_position=cur_cache_position, +-++++++ # past_key_values=cur_past_key_values, +-++++++ # ) +-++++++ +-++++++ # # 标记已使用JIT(用于后续判断) +-++++++ # if not self._jit_used: +-++++++ # self._jit_used = True +-++++++ +-++++++ # # 构造兼容的输出对象 +-++++++ # class JitOptimizedOutput: +-++++++ # def __init__(self, logits, config): +-++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++++++ # self.config = config +-++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++++++ # self.attentions = None if not config.is_encoder_decoder else None +-++++++ # self.cross_attentions = None +-++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++++++ +-++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++++++ # else: +-++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++++++ # outputs = self(**model_inputs, return_dict=True) +-++++++ +-++++++ # if synced_devices and this_peer_finished: +-++++++ # continue +-++++++ +-++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++++++ # next_token_logits = outputs.logits[:, -1, :] +-++++++ +-++++++ # # pre-process distribution +-++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++++++ # if do_sample: +-++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++++++ +-++++++ # # Store scores, attentions and hidden_states when required +-++++++ # if return_dict_in_generate: +-++++++ # if output_scores: +-++++++ # scores += (next_token_scores,) +-++++++ # if output_logits: +-++++++ # raw_logits += (next_token_logits,) +-++++++ # if output_attentions: +-++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++++++ # if self.config.is_encoder_decoder: +-++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++++++ +-++++++ # if output_hidden_states: +-++++++ # hidden = ( +-++++++ # outputs.decoder_hidden_states +-++++++ # if self.config.is_encoder_decoder +-++++++ # else outputs.hidden_states +-++++++ # ) +-++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++++++ +-++++++ # # token selection +-++++++ # if do_sample: +-++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++++++ # else: +-++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++++++ +-++++++ # # finished sentences should have their next token be a padding token +-++++++ # if has_eos_stopping_criteria: +-++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++++++ +-++++++ # # update generated ids, model inputs, and length for next step +-++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++++++ # if streamer is not None: +-++++++ # streamer.put(next_tokens) +-++++++ +-++++++ # model_kwargs = self._update_model_kwargs_for_generation( +-++++++ # outputs, +-++++++ # model_kwargs, +-++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++++++ # ) +-++++++ +-++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++++++ # cur_len += 1 +-++++++ +-++++++ # if _record_time: +-++++++ # import time as time_module +-++++++ # infer_stop = time_module.time() +-++++++ # time_record.append(infer_stop - infer_start) +-++++++ +-++++++ # del outputs +-++++++ +-++++++ # average_infer_time = None +-++++++ # if time_record: +-++++++ # if len(time_record) > 1: +-++++++ # time_record.pop(0) +-++++++ # average_infer_time = sum(time_record) / len(time_record) +-++++++ # print(f'average inference time is: {average_infer_time}') +-++++++ # print(f'inference time record: {time_record}') +-++++++ +-++++++ # if streamer is not None: +-++++++ # streamer.end() +-++++++ +-++++++ # # 简单判断:打印是否使用了JIT路径 +-++++++ # if hasattr(self, '_jit_used') and self._jit_used: +-++++++ # print("[JIT] ✓ JIT optimization was used during generation") +-++++++ # else: +-++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++++++ +-++++++ # if return_dict_in_generate: +-++++++ # if self.config.is_encoder_decoder: +-++++++ # return GenerateEncoderDecoderOutput( +-++++++ # sequences=input_ids, +-++++++ # scores=scores, +-++++++ # logits=raw_logits, +-++++++ # encoder_attentions=encoder_attentions, +-++++++ # encoder_hidden_states=encoder_hidden_states, +-++++++ # decoder_attentions=decoder_attentions, +-++++++ # cross_attentions=cross_attentions, +-++++++ # decoder_hidden_states=decoder_hidden_states, +-++++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++++ # average_infer_time=average_infer_time +-++++++ # ) +-++++++ # else: +-++++++ # return GenerateDecoderOnlyOutput( +-++++++ # sequences=input_ids, +-++++++ # scores=scores, +-++++++ # logits=raw_logits, +-++++++ # attentions=decoder_attentions, +-++++++ # hidden_states=decoder_hidden_states, +-++++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++++ # average_infer_time=average_infer_time +-++++++ # ) +-++++++ # else: +-++++++ # return input_ids +-++++++ +-++++++ # def _prepare_cache_for_generation( +-++++++ # self, +-++++++ # generation_config, +-++++++ # model_kwargs, +-++++++ # assistant_model, +-++++++ # batch_size, +-++++++ # max_cache_length, +-++++++ # ): +-++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++++++ # generation_config.cache_implementation = "static" +-++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++++++ +-++++++ # if generation_config.cache_implementation == "static": +-++++++ # base_required_from_max_length = generation_config.max_length + 1 +-++++++ # base_required = max(max_cache_length, base_required_from_max_length) +-++++++ # min_cache_size = 50 +-++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++++++ # else: +-++++++ # max_cache_length = max(base_required, min_cache_size) +-++++++ +-++++++ # original_max_cache_length = max_cache_length +-++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++++++ # print(f" - final max_cache_length: {max_cache_length}") +-++++++ +-++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++++ # if max_cache_length > self.config.max_position_embeddings: +-++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++++++ +-++++++ # result = super()._prepare_cache_for_generation( +-++++++ # generation_config=generation_config, +-++++++ # model_kwargs=model_kwargs, +-++++++ # assistant_model=assistant_model, +-++++++ # batch_size=batch_size, +-++++++ # max_cache_length=max_cache_length, +-++++++ # ) +-++++++ +-++++++ # if generation_config.cache_implementation == "static": +-++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++++++ # created_cache = model_kwargs.get(cache_name) +-++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++++++ # if created_cache.max_cache_len < generation_config.max_length: +-++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++++++ +-++++++ # return result +-++++++ +-++++++ +-++++++ +-+++++ +-+++++ +-+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++++-- +-+++++2.27.0 +-+++++ +-++++-- +-++++2.27.0 +-++++ +-+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+++new file mode 100644 +-+++index 00000000..966529e4 +-+++--- /dev/null +-++++++ b/patches/0003-20261106secondcommit.patch +-+++@@ -0,0 +1,2769 @@ +-++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-++++From: Pinoeer-kingxi <13022943007@163.com> +-++++Date: Thu, 6 Nov 2025 14:54:37 +0800 +-++++Subject: [PATCH 3/3] 20261106secondcommit +-++++ +-++++--- +-++++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +-++++ patches/0001-20251104commit.patch | 1272 ----------------- +-++++ 3 files changed, 528 insertions(+), 2032 deletions(-) +-++++ delete mode 100644 patches/0001-20251104commit.patch +-++++ +-++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++index 73773c22..2f9192bf 100644 +-++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +-++++ +-++++ _CONFIG_FOR_DOC = "DeepseekConfig" +-++++ +-+++++_attn_mask_cache = {} +-+++++ +-+++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-+++++ q_len = batch_and_seq[1] +-+++++ kv_len = batch_and_seq[1] + past_key_values_length +-+++++ key = (batch_and_seq[0], q_len, kv_len) +-+++++ +-+++++ if key in _attn_mask_cache: +-+++++ return _attn_mask_cache[key] +-+++++ +-+++++ mask = _prepare_4d_causal_attention_mask( +-+++++ attention_mask, +-+++++ batch_and_seq, +-+++++ inputs_embeds, +-+++++ past_key_values_length, +-+++++ ) +-+++++ _attn_mask_cache[key] = mask +-+++++ return mask +-++++ +-++++ def _get_unpad_data(attention_mask): +-++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +-++++ return final_output +-++++ +-++++ +-++++- @no_grad() +-++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++- expert_cache = ops.zeros_like(x) +-++++- idxs = flat_expert_indices.argsort() +-++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++- token_idxs = idxs // self.num_experts_per_tok +-++++- +-++++- for i, end_idx in enumerate(tokens_per_expert): +-++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++- if start_idx == end_idx: +-++++- continue +-++++- expert = self.experts[i] +-++++- exp_token_idx = token_idxs[start_idx:end_idx] +-++++- expert_tokens = x[exp_token_idx] +-++++- expert_out = expert(expert_tokens) +-++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++- +-++++- return expert_cache +-++++- +-++++ # @no_grad() +-++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++- # # expert_cache = torch.zeros_like(x) +-++++- # # idxs = flat_expert_indices.argsort() +-++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++- # # token_idxs = idxs // self.num_experts_per_tok +-++++- # # for i, end_idx in enumerate(tokens_per_expert): +-++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++- # # if start_idx == end_idx: +-++++- # # continue +-++++- # # expert = self.experts[i] +-++++- # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++- # # expert_tokens = x[exp_token_idx] +-++++- # # expert_out = expert(expert_tokens) +-++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++- # # return expert_cache +-+++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++ # expert_cache = ops.zeros_like(x) +-++++ # idxs = flat_expert_indices.argsort() +-++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +-++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++ +-++++ # return expert_cache +-++++- # @no_grad() +-++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++- # expert_cache = ops.zeros_like(x) +-+++++ +-+++++ @no_grad() +-+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++ """ +-+++++ 优化版 MoE prefill: +-+++++ - 批量张量化处理同一个 expert 的所有 token +-+++++ - 跳过无 token 的专家 +-+++++ - 保持结果完全一致 +-+++++ """ +-+++++ # 初始化输出缓存 +-+++++ expert_cache = ops.zeros_like(x) +-++++ +-++++- # # 排序保证顺序一致 +-++++- # idxs = flat_expert_indices.argsort() +-++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++- # token_idxs = idxs // self.num_experts_per_tok +-+++++ # 排序(确保 scatter_add 位置对应原逻辑) +-+++++ idxs = flat_expert_indices.argsort() +-+++++ sorted_expert_indices = flat_expert_indices[idxs] +-+++++ sorted_token_indices = idxs // self.num_experts_per_tok +-++++ +-++++- # # 找出有 token 的专家 +-++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++ # 每个 expert 的 token 数 +-+++++ tokens_per_expert = sorted_expert_indices.bincount() +-++++ +-++++- # for i in active_experts.tolist(): +-++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++- # end_idx = tokens_per_expert[i] +-++++- # if start_idx == end_idx: # 没有 token +-++++- # continue +-+++++ # 找出有 token 的专家 +-+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-++++ +-++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++++- # expert_tokens = x[exp_token_idx] +-++++- # expert_out = self.experts[i](expert_tokens) +-++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++ for expert_id in active_experts.tolist(): +-+++++ # 取该 expert 对应的排序后 token 区间 +-+++++ start = (tokens_per_expert[:expert_id]).sum().item() +-+++++ end = start + tokens_per_expert[expert_id].item() +-++++ +-++++- # expert_cache = mindspore.mint.scatter_add( +-++++- # expert_cache, +-++++- # 0, +-++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++- # expert_out +-++++- # ) +-+++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-+++++ expert_tokens = x[token_idx] # 取输入向量 +-++++ +-++++- # return expert_cache +-+++++ # 执行专家 MLP +-+++++ expert_out = self.experts[expert_id](expert_tokens) +-+++++ +-+++++ # 按权重缩放 +-+++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-+++++ +-+++++ # 回写到缓存(等价 scatter_add) +-+++++ expert_cache = mindspore.mint.scatter_add( +-+++++ expert_cache, +-+++++ 0, +-+++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++ scaled_out +-+++++ ) +-+++++ +-+++++ return expert_cache +-+++++ +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # # expert_cache = torch.zeros_like(x) +-+++++ # # idxs = flat_expert_indices.argsort() +-+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++ # # token_idxs = idxs // self.num_experts_per_tok +-+++++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++ # # if start_idx == end_idx: +-+++++ # # continue +-+++++ # # expert = self.experts[i] +-+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # # expert_tokens = x[exp_token_idx] +-+++++ # # expert_out = expert(expert_tokens) +-+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++ # # return expert_cache +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # for i, end_idx in enumerate(tokens_per_expert): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # if start_idx == end_idx: +-+++++ # continue +-+++++ # expert = self.experts[i] +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = expert(expert_tokens) +-+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-+++++ # return expert_cache +-+++++ # @no_grad() +-+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ +-+++++ # # 排序保证顺序一致 +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++ +-+++++ # # 找出有 token 的专家 +-+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++ +-+++++ # for i in active_experts.tolist(): +-+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ # end_idx = tokens_per_expert[i] +-+++++ # if start_idx == end_idx: # 没有 token +-+++++ # continue +-+++++ +-+++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++ # expert_tokens = x[exp_token_idx] +-+++++ # expert_out = self.experts[i](expert_tokens) +-+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++ +-+++++ # expert_cache = mindspore.mint.scatter_add( +-+++++ # expert_cache, +-+++++ # 0, +-+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++ # expert_out +-+++++ # ) +-+++++ +-+++++ # return expert_cache +-++++ +-++++ +-++++ +-++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++- +-++++ # class DeepseekFlashAttention(nn.Module): +-++++ # """ +-++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-+++++ +-++++ Deepseek_ATTENTION_CLASSES = { +-++++ "eager": DeepseekAttention, +-++++ "flash-attention": DeepseekFlashAttention, +-++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +-++++ ) +-++++ else: +-++++ # 4d mask is passed through the layers +-++++- attention_mask = _prepare_4d_causal_attention_mask( +-+++++ # attention_mask = _prepare_4d_causal_attention_mask( +-+++++ # attention_mask, +-+++++ # (batch_size, seq_length), +-+++++ # inputs_embeds, +-+++++ # past_key_values_length, +-+++++ # ) +-+++++ #@dwj +-+++++ attention_mask = get_cached_causal_mask( +-++++ attention_mask, +-++++ (batch_size, seq_length), +-++++ inputs_embeds, +-++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-++++ self.warm_up = False +-+++++ #@dwj +-+++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-+++++ self.num_layers, +-+++++ self.num_attention_heads, +-+++++ self.head_dim, +-+++++ batch_size=1, +-+++++ max_length=self.max_length, +-+++++ dtype=mindspore.float16 +-+++++ ) +-+++++ +-+++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-+++++ key_cache = [] +-+++++ value_cache = [] +-+++++ for _ in range(num_layers): +-+++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-+++++ key_cache.append(k) +-+++++ value_cache.append(v) +-+++++ return key_cache, value_cache +-+++++ +-++++ +-++++ def warmup_moe_model_deep(self): +-++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++index bced285c..ebd7782e 100644 +-++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +-++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-++++ +-++++-Long_Prompt = False +-++++-PROMPT_LENGTH_THRESHOLD = 128 +-+++++Long_Prompt = 1 +-+++++LONG_PROMPT_LENGTH_THRESHOLD = 128 +-+++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-+++++ +-+++++_causal_mask_cache = {} +-+++++ +-+++++def get_cached_causal_mask_with_cache_position( +-+++++ attention_mask: mindspore.Tensor, +-+++++ sequence_length: int, +-+++++ target_length: int, +-+++++ dtype: mindspore.dtype, +-+++++ min_dtype: float, +-+++++ cache_position: mindspore.Tensor, +-+++++ batch_size: int, +-+++++): +-+++++ """ +-+++++ 带缓存的 causal mask 构造函数 +-+++++ """ +-+++++ # q_len 是当前 query 长度 +-+++++ q_len = sequence_length +-+++++ # kv_len 是 target_length +-+++++ kv_len = target_length +-+++++ +-+++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-+++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-+++++ +-+++++ if key in _causal_mask_cache: +-+++++ return _causal_mask_cache[key] +-+++++ +-+++++ # 调用原来的 mask 构造逻辑 +-+++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++++ attention_mask, +-+++++ sequence_length=sequence_length, +-+++++ target_length=target_length, +-+++++ dtype=dtype, +-+++++ min_dtype=min_dtype, +-+++++ cache_position=cache_position, +-+++++ batch_size=batch_size, +-+++++ ) +-+++++ # 缓存结果 +-+++++ _causal_mask_cache[key] = causal_mask +-+++++ return causal_mask +-++++ +-++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-++++ def _prepare_4d_causal_attention_mask_with_cache_position( +-++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++++ +-++++ +-++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-+++++# class Qwen2MoeAttention(nn.Module): +-+++++# """ +-+++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+++++# and "Generating Long Sequences with Sparse Transformers". +-+++++# """ +-+++++ +-+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++# super().__init__() +-+++++# self.config = config +-+++++# self.layer_idx = layer_idx +-+++++# if layer_idx is None: +-+++++# logger.warning_once( +-+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++++# "when creating this class." +-+++++# ) +-+++++ +-+++++# self.hidden_size = config.hidden_size +-+++++# self.num_heads = config.num_attention_heads +-+++++# self.head_dim = self.hidden_size // self.num_heads +-+++++# self.num_key_value_heads = config.num_key_value_heads +-+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++# self.max_position_embeddings = config.max_position_embeddings +-+++++# self.rope_theta = config.rope_theta +-+++++# self.is_causal = True +-+++++# self.attention_dropout = config.attention_dropout +-+++++ +-+++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++# raise ValueError( +-+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-+++++# f" and `num_heads`: {self.num_heads})." +-+++++# ) +-+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++ +-+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++# self.head_dim, +-+++++# max_position_embeddings=self.max_position_embeddings, +-+++++# base=self.rope_theta, +-+++++# ) +-+++++ +-+++++# def forward( +-+++++# self, +-+++++# hidden_states: mindspore.Tensor, +-+++++# attention_mask: Optional[mindspore.Tensor] = None, +-+++++# position_ids: Optional[mindspore.Tensor] = None, +-+++++# past_key_value: Optional[Cache] = None, +-+++++# output_attentions: bool = False, +-+++++# use_cache: bool = False, +-+++++# cache_position: Optional[mindspore.Tensor] = None, +-+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++ +-+++++ +-+++++ +-+++++# bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++# query_states = self.q_proj(hidden_states) +-+++++# key_states = self.k_proj(hidden_states) +-+++++# value_states = self.v_proj(hidden_states) +-+++++ +-+++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++ +-+++++# kv_seq_len = key_states.shape[-2] +-+++++# if past_key_value is not None: +-+++++# if self.layer_idx is None: +-+++++# raise ValueError( +-+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++# "with a layer index." +-+++++# ) +-+++++# if isinstance(past_key_value, StaticCache): +-+++++# kv_seq_len = key_states.shape[-2] +-+++++# else: +-+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++# if past_key_value is not None: +-+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++# if isinstance(past_key_value, StaticCache): +-+++++# kv_seq_len = key_states.shape[-2] +-+++++ +-+++++# # repeat k/v heads if n_kv_heads < n_heads +-+++++# key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++# value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++ +-+++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++ +-+++++# if attention_mask is not None: +-+++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++# attn_weights = attn_weights + causal_mask +-+++++ +-+++++# # upcast attention to fp32 +-+++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++++# attn_output = ops.matmul(attn_weights, value_states) +-+++++ +-+++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++++# raise ValueError( +-+++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+++++# f" {attn_output.shape}" +-+++++# ) +-+++++ +-+++++# attn_output = ops.transpose(attn_output, 1, 2) +-+++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++ +-+++++# attn_output = self.o_proj(attn_output) +-+++++# # @lwx +-+++++ +-+++++# # max_seq_len = self.max_position_embeddings # 2048 +-+++++ +-+++++# # if attention_mask is not None: +-+++++# # # attention_mask: [B, 1, Sq, Sk] +-+++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++ +-+++++# # # pad 到 [max_seq_len, max_seq_len] +-+++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++# # global_attention_mask = padded_mask +-+++++# # else: +-+++++# # global_attention_mask = None +-+++++ +-+++++ +-+++++# # sparse_mode=3 +-+++++# # attn_output = mindspore.ops.flash_attention_score( +-+++++# # query=query_states, +-+++++# # key=key_states, +-+++++# # value=value_states, +-+++++# # real_shift=None, +-+++++# # padding_mask=None, +-+++++ +-+++++# # head_num=self.num_heads, +-+++++# # attn_mask=global_attention_mask, +-+++++# # keep_prob=1.0 - self.attention_dropout, +-+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++# # input_layout="BNSD", +-+++++# # pre_tokens=2147483647, +-+++++# # next_tokens=2147483647, +-+++++# # inner_precise=0, +-+++++# # drop_mask=None, +-+++++# # prefix=None, +-+++++# # actual_seq_qlen=None, +-+++++# # actual_seq_kvlen=None, +-+++++# # sparse_mode=sparse_mode, +-+++++# # ) +-+++++# if not output_attentions: +-+++++# attn_weights = None +-+++++ +-+++++# return attn_output, attn_weights, past_key_value +-+++++ +-++++ class Qwen2MoeAttention(nn.Module): +-++++ """ +-++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-++++- and "Generating Long Sequences with Sparse Transformers". +-++++- """ +-+++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +-++++ +-+++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-+++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-+++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-+++++ +-+++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-+++++ """ +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++ super().__init__() +-++++ self.config = config +-++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +-++++ if layer_idx is None: +-++++ logger.warning_once( +-++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++ "when creating this class." +-++++ ) +-++++ +-++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +-++++ use_cache: bool = False, +-++++ cache_position: Optional[mindspore.Tensor] = None, +-++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++- +-++++ +-++++- +-+++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +-++++ bsz, q_len, _ = hidden_states.shape +-++++ +-++++ query_states = self.q_proj(hidden_states) +-++++ key_states = self.k_proj(hidden_states) +-++++ value_states = self.v_proj(hidden_states) +-++++ +-++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++- +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ +-++++ kv_seq_len = key_states.shape[-2] +-++++ if past_key_value is not None: +-++++- if self.layer_idx is None: +-++++- raise ValueError( +-++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++- "with a layer index." +-++++- ) +-++++- if isinstance(past_key_value, StaticCache): +-++++- kv_seq_len = key_states.shape[-2] +-++++- else: +-++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++ +-++++ if past_key_value is not None: +-++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++ +-+++++ # --- 2. 动态调度核心注意力计算 --- +-+++++ global Long_Prompt +-+++++ if Long_Prompt >= 1: +-+++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-+++++ fa_attention_mask = None +-+++++ if attention_mask is not None: +-+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++ fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++ attn_output = mindspore.ops.flash_attention_score( +-+++++ query=query_states, +-+++++ key=key_states, +-+++++ value=value_states, +-+++++ head_num=self.num_heads, +-+++++ attn_mask=fa_attention_mask, +-+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-+++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ input_layout="BNSD", +-+++++ sparse_mode=0, +-+++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-+++++ ) +-++++ +-++++- if isinstance(past_key_value, StaticCache): +-++++- kv_seq_len = key_states.shape[-2] +-+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = self.o_proj(attn_output) +-+++++ attn_weights = None +-+++++ if output_attentions: +-+++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-++++ +-++++- # repeat k/v heads if n_kv_heads < n_heads +-++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++- +-++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++ else: +-+++++ # --- Eager Attention 路径 (用于短序列和解码) --- +-+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++ +-+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++ +-++++- if attention_mask is not None: +-++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++- attn_weights = attn_weights + causal_mask +-+++++ if attention_mask is not None: +-+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++ attn_weights = attn_weights + causal_mask +-++++ +-++++- # upcast attention to fp32 +-++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++++- attn_output = ops.matmul(attn_weights, value_states) +-+++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++++ attn_output = ops.matmul(attn_weights, value_states) +-++++ +-++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++++- raise ValueError( +-++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-++++- f" {attn_output.shape}" +-++++- ) +-+++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++++ raise ValueError( +-+++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-+++++ ) +-++++ +-++++- attn_output = ops.transpose(attn_output, 1, 2) +-++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = ops.transpose(attn_output, 1, 2) +-+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = self.o_proj(attn_output) +-++++ +-++++- attn_output = self.o_proj(attn_output) +-++++- # @lwx +-+++++ if not output_attentions: +-+++++ attn_weights = None +-++++ +-++++- # max_seq_len = self.max_position_embeddings # 2048 +-++++- +-++++- # if attention_mask is not None: +-++++- # # attention_mask: [B, 1, Sq, Sk] +-++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++- +-++++- # # pad 到 [max_seq_len, max_seq_len] +-++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++- # global_attention_mask = padded_mask +-++++- # else: +-++++- # global_attention_mask = None +-++++- +-++++- +-++++- # sparse_mode=3 +-++++- # attn_output = mindspore.ops.flash_attention_score( +-++++- # query=query_states, +-++++- # key=key_states, +-++++- # value=value_states, +-++++- # real_shift=None, +-++++- # padding_mask=None, +-++++- +-++++- # head_num=self.num_heads, +-++++- # attn_mask=global_attention_mask, +-++++- # keep_prob=1.0 - self.attention_dropout, +-++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++- # input_layout="BNSD", +-++++- # pre_tokens=2147483647, +-++++- # next_tokens=2147483647, +-++++- # inner_precise=0, +-++++- # drop_mask=None, +-++++- # prefix=None, +-++++- # actual_seq_qlen=None, +-++++- # actual_seq_kvlen=None, +-++++- # sparse_mode=sparse_mode, +-++++- # ) +-++++- if not output_attentions: +-++++- attn_weights = None +-++++- +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++- +-++++ # class Qwen2MoeFlashAttention(nn.Module): +-++++ # """ +-++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +-++++ # return final_hidden_states, router_logits +-++++ +-++++ +-++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++-# """ +-++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-++++-# """ +-++++-# def __init__(self, config: Qwen2MoeConfig): +-++++-# super().__init__() +-++++-# self.num_experts = config.num_experts +-++++-# self.top_k = config.num_experts_per_tok +-++++-# self.norm_topk_prob = config.norm_topk_prob +-++++- +-++++-# # 门控网络 +-++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++-# # 专家列表 +-++++-# self.experts = nn.ModuleList( +-++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++-# ) +-++++-# # 共享专家 +-++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_decode( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# """ +-++++-# 【解码路径】针对 sequence_length=1 的极致优化。 +-++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-++++-# """ +-++++-# batch_size, hidden_dim = hidden_states.shape +-++++- +-++++-# expert_outputs_list = [ +-++++-# ops.cat([ +-++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++-# ], dim=0) +-++++-# for i in range(batch_size) +-++++-# ] +-++++- +-++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-++++-# # shape: (batch_size, top_k, hidden_dim) +-++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++- +-++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++- +-++++-# return moe_output.squeeze(1) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_prefill( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# """ +-++++-# 【预填充路径】针对 sequence_length > 1 的优化。 +-++++-# 按专家对 Token 进行分组,并进行批处理。 +-++++-# """ +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens = hidden_states.shape[0] +-++++-# flat_selected_experts = selected_experts.flatten() +-++++- +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++- +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++- +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++- +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++-# selected_token_indices = token_indices[mask] +-++++-# selected_routing_weights = routing_weights.flatten()[mask] +-++++- +-++++-# current_states = hidden_states[selected_token_indices] +-++++- +-++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++- +-++++-# moe_output = moe_output.index_add( +-++++-# dim=0, +-++++-# index=selected_token_indices, +-++++-# source=expert_output.to(hidden_states.dtype) +-++++-# ) +-++++-# return moe_output +-++++- +-++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-# """ +-++++-# 顶层 forward 方法,作为智能分发器。 +-++++-# """ +-++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- +-++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-# router_logits = self.gate(hidden_states_reshaped) +-++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- +-++++-# if self.norm_topk_prob: +-++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- +-++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++-# moe_output = None +-++++-# # 在推理时,根据序列长度选择最优路径 +-++++-# if not self.training: +-++++-# if sequence_length == 1: +-++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++-# else: +-++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++-# else: +-++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-++++-# raise NotImplementedError("Training path is not implemented.") +-++++- +-++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-++++- +-++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-++++- +-++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-++++- +-++++-# return final_hidden_states, router_logits +-++++- +-++++- +-++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++-# """ +-++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-++++-# """ +-++++-# def __init__(self, config: Qwen2MoeConfig): +-++++-# super().__init__() +-++++-# self.num_experts = config.num_experts +-++++-# self.top_k = config.num_experts_per_tok +-++++-# self.norm_topk_prob = config.norm_topk_prob +-++++- +-++++-# # 门控网络 +-++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++-# # 专家列表 +-++++-# self.experts = nn.ModuleList( +-++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++-# ) +-++++-# # 共享专家 +-++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_decode( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# batch_size, _ = hidden_states.shape +-++++-# expert_outputs_list = [ +-++++-# ops.cat([ +-++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++-# ], dim=0) +-++++-# for i in range(batch_size) +-++++-# ] +-++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++-# return moe_output.squeeze(1) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_prefill( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens = hidden_states.shape[0] +-++++-# flat_selected_experts = selected_experts.flatten() +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++- +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++-# selected_token_indices = token_indices[mask] +-++++-# selected_routing_weights = routing_weights.flatten()[mask] +-++++-# current_states = hidden_states[selected_token_indices] +-++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++-# moe_output = moe_output.index_add( +-++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++-# ) +-++++-# return moe_output +-++++- +-++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-# """ +-++++-# 顶层 forward 方法,作为智能分发器。 +-++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-++++-# """ +-++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- +-++++-# # 1. 门控计算 (通用逻辑) +-++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-# router_logits = self.gate(hidden_states_reshaped) +-++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- +-++++-# if self.norm_topk_prob: +-++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- +-++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++-# # 2. 智能分发到最优 MoE 路径 +-++++-# moe_output = None +-++++-# if not self.training: +-++++-# if sequence_length == 1: +-++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++-# else: +-++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++-# else: +-++++-# raise NotImplementedError("Training path is not implemented.") +-++++- +-++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++- +-++++-# # 4. 合并 MoE 输出和共享专家输出 +-++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++- +-++++-# # 5. 恢复原始形状并返回 +-++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++- +-++++-# return final_hidden_states, router_logits +-++++- +-++++-# prefill fastest +-++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++-# """ +-++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-++++-# """ +-++++-# def __init__(self, config: Qwen2MoeConfig): +-++++-# super().__init__() +-++++-# self.num_experts = config.num_experts +-++++-# self.top_k = config.num_experts_per_tok +-++++-# self.norm_topk_prob = config.norm_topk_prob +-++++- +-++++-# # 门控网络 +-++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++-# # 专家列表 +-++++-# self.experts = nn.ModuleList( +-++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++-# ) +-++++-# # 共享专家 +-++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_dispatch( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# """ +-++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-++++-# """ +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens, _ = hidden_states.shape +-++++- +-++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-++++-# flat_selected_experts = selected_experts.flatten() +-++++-# flat_routing_weights = routing_weights.flatten() +-++++- +-++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++- +-++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++- +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++- +-++++-# # 找到所有分配给该专家的 token +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++- +-++++-# # 使用 mask 选取对应的 token 和权重 +-++++-# current_token_indices = token_indices[mask] +-++++-# current_routing_weights = flat_routing_weights[mask] +-++++-# current_hidden_states = hidden_states[current_token_indices] +-++++- +-++++-# # 对这些 token 进行批处理 +-++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++- +-++++-# # 使用 index_add 将结果精确地加回到对应位置 +-++++-# moe_output = moe_output.index_add( +-++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-++++-# ) +-++++-# return moe_output +-++++- +-++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-# """ +-++++-# 顶层 forward 方法,作为智能分发器。 +-++++-# """ +-++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- +-++++-# # 1. 门控计算 +-++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-# router_logits = self.gate(hidden_states_reshaped) +-++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- +-++++-# if self.norm_topk_prob: +-++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- +-++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-++++- +-++++-# # 2. 调用统一的 MoE 计算内核 +-++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-++++- +-++++-# # 3. 统一处理共享专家 +-++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++- +-++++-# # 4. 合并输出 +-++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++- +-++++-# # 5. 恢复原始形状并返回 +-++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++- +-++++-# return final_hidden_states, router_logits +-++++- +-++++- +-++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++-# """ +-++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++-# 【最终高性能与高精度版】: +-++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-++++-# 3. 这样实现了速度和准确性的两全其美。 +-++++-# """ +-++++-# def __init__(self, config: Qwen2MoeConfig): +-++++-# super().__init__() +-++++-# self.num_experts = config.num_experts +-++++-# self.top_k = config.num_experts_per_tok +-++++-# self.norm_topk_prob = config.norm_topk_prob +-++++- +-++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++-# self.experts = nn.ModuleList( +-++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++-# ) +-++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_decode( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# """ +-++++-# 【解码路径】极致优化版:bmm + 高精度累加。 +-++++-# """ +-++++-# original_dtype = hidden_states.dtype +-++++-# batch_size, _ = hidden_states.shape +-++++- +-++++-# expert_outputs_list = [ +-++++-# ops.cat([ +-++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++-# ], dim=0) +-++++-# for i in range(batch_size) +-++++-# ] +-++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++- +-++++-# # 在 float32 下执行 bmm,得到高精度结果 +-++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++- +-++++-# # 将高精度结果转换回原始数据类型 +-++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-++++- +-++++-# return moe_output +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_prefill( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# selected_experts: mindspore.Tensor, +-++++-# routing_weights: mindspore.Tensor +-++++-# ) -> mindspore.Tensor: +-++++-# """ +-++++-# 【预填充路径】与原始实现一致,结果精确。 +-++++-# """ +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens, _ = hidden_states.shape +-++++-# flat_selected_experts = selected_experts.flatten() +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++- +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++-# selected_token_indices = token_indices[mask] +-++++-# selected_routing_weights = routing_weights.flatten()[mask] +-++++-# current_states = hidden_states[selected_token_indices] +-++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++-# moe_output = moe_output.index_add( +-++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++-# ) +-++++-# return moe_output +-++++- +-++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++- +-++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-# router_logits = self.gate(hidden_states_reshaped) +-++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++- +-++++-# if self.norm_topk_prob: +-++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- +-++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-++++-# # 如果模型主体是 float16,后续再转换 +-++++- +-++++-# moe_output = None +-++++-# if not self.training: +-++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-++++-# # _moe_infer_decode 内部会处理好类型转换 +-++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-++++-# if sequence_length == 1: +-++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++-# else: +-++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++-# else: +-++++-# raise NotImplementedError("Training path is not implemented.") +-++++- +-++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++- +-++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++- +-++++-# return final_hidden_states, router_logits +-++++- +-++++- +-++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++-# """ +-++++-# 【融合版】一个混合专家模块,内置两种推理策略, +-++++-# 由外部全局变量 `Long_Prompt` 控制: +-++++- +-++++-# - if Long_Prompt is True: 【精度优先模式】 +-++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-++++-# 适用于处理长序列,避免误差累积。 +-++++- +-++++-# - if Long_Prompt is False: 【速度优先模式】 +-++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-++++-# """ +-++++-# def __init__(self, config: Qwen2MoeConfig): +-++++-# super().__init__() +-++++-# self.num_experts = config.num_experts +-++++-# self.top_k = config.num_experts_per_tok +-++++-# self.norm_topk_prob = config.norm_topk_prob +-++++- +-++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++-# self.experts = nn.ModuleList( +-++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++-# ) +-++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-# # --- 速度优先模式的辅助函数 --- +-++++-# @no_grad() +-++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++-# original_dtype = hidden_states.dtype +-++++-# batch_size, _ = hidden_states.shape +-++++-# expert_outputs_list = [ +-++++-# ops.cat([ +-++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++-# ], dim=0) +-++++-# for i in range(batch_size) +-++++-# ] +-++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++-# weights_fp32 = routing_weights.to(mindspore.float32) +-++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++-# return moe_output_fp32.squeeze(1).to(original_dtype) +-++++- +-++++-# @no_grad() +-++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens, _ = hidden_states.shape +-++++-# flat_selected_experts = selected_experts.flatten() +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++-# selected_token_indices = token_indices[mask] +-++++-# selected_routing_weights = routing_weights.flatten()[mask] +-++++-# current_states = hidden_states[selected_token_indices] +-++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++-# return moe_output +-++++- +-++++-# # --- 精度优先模式的辅助函数 --- +-++++-# @no_grad() +-++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++-# moe_output = ops.zeros_like(hidden_states) +-++++-# num_tokens, _ = hidden_states.shape +-++++-# flat_selected_experts = selected_experts.flatten() +-++++-# flat_routing_weights = routing_weights.flatten() +-++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++-# active_experts = ops.unique(flat_selected_experts) +-++++-# for expert_idx_tensor in active_experts: +-++++-# expert_idx = expert_idx_tensor.item() +-++++-# expert_layer = self.experts[expert_idx] +-++++-# mask = (flat_selected_experts == expert_idx_tensor) +-++++-# current_token_indices = token_indices[mask] +-++++-# current_routing_weights = flat_routing_weights[mask] +-++++-# current_hidden_states = hidden_states[current_token_indices] +-++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++-# return moe_output +-++++- +-++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-# # 声明我们将要使用一个在模块外部定义的全局变量 +-++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-++++-# global Long_Prompt +-++++- +-++++-# # 1. 门控计算 (所有模式通用) +-++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-# router_logits = self.gate(hidden_states_reshaped) +-++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++++-# if self.norm_topk_prob: +-++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++- +-++++-# moe_output = None +-++++-# if not self.training: +-++++-# # 根据 Long_Prompt 标志选择模式 +-++++-# if Long_Prompt: +-++++-# # --- 精度优先模式 --- +-++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++-# else: +-++++-# # --- 速度优先模式 --- +-++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++-# if sequence_length == 1: +-++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++-# else: +-++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++-# else: +-++++-# raise NotImplementedError("Training path is not implemented.") +-++++- +-++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++- +-++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++- +-++++-# return final_hidden_states, router_logits +-++++- +-++++ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ """ +-++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++ return moe_output_fp32.squeeze(1).to(original_dtype) +-++++ +-+++++ # @no_grad() +-+++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++ # num_tokens, _ = hidden_states.shape +-+++++ # flat_selected_experts = selected_experts.flatten() +-+++++ # sorted_expert_indices = flat_selected_experts.argsort() +-+++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++++ # original_token_indices = sorted_expert_indices // self.top_k +-+++++ # moe_output = ops.zeros_like(hidden_states) +-+++++ # current_token_offset = 0 +-+++++ # for i in range(self.num_experts): +-+++++ # expert_token_count = tokens_per_expert[i] - current_token_offset +-+++++ # if expert_token_count == 0: +-+++++ # continue +-+++++ # end_offset = current_token_offset + expert_token_count +-+++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++++ # expert_hidden_states = hidden_states[expert_original_token_indices] +-+++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++ # current_token_offset += expert_token_count +-+++++ # return moe_output +-+++++ +-++++ @no_grad() +-++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++- num_tokens, _ = hidden_states.shape +-++++- flat_selected_experts = selected_experts.flatten() +-++++- sorted_expert_indices = flat_selected_experts.argsort() +-++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++++- original_token_indices = sorted_expert_indices // self.top_k +-+++++ """ +-+++++ 优化版 MoE prefill (速度优先模式): +-+++++ - 批量张量化处理同一个 expert 的所有 token +-+++++ - 跳过无 token 的专家 +-+++++ - 保持结果完全一致 +-+++++ """ +-++++ moe_output = ops.zeros_like(hidden_states) +-++++- current_token_offset = 0 +-++++- for i in range(self.num_experts): +-++++- expert_token_count = tokens_per_expert[i] - current_token_offset +-++++- if expert_token_count == 0: +-++++- continue +-++++- end_offset = current_token_offset + expert_token_count +-++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++++- expert_hidden_states = hidden_states[expert_original_token_indices] +-++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++- current_token_offset += expert_token_count +-+++++ +-+++++ flat_selected_experts = selected_experts.flatten() +-+++++ flat_routing_weights = routing_weights.flatten() +-+++++ +-+++++ idxs = flat_selected_experts.argsort() +-+++++ sorted_expert_indices = flat_selected_experts[idxs] +-+++++ sorted_token_indices = idxs // self.top_k +-+++++ +-+++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-+++++ +-+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+++++ +-+++++ for expert_id in active_experts.tolist(): +-+++++ start = int(tokens_per_expert[:expert_id].sum().item()) +-+++++ end = start + int(tokens_per_expert[expert_id].item()) +-+++++ +-+++++ token_idx = sorted_token_indices[start:end] +-+++++ expert_tokens = hidden_states[token_idx] +-+++++ +-+++++ expert_out = self.experts[expert_id](expert_tokens) +-+++++ +-+++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-+++++ +-+++++ moe_output = mindspore.mint.scatter_add( +-+++++ moe_output, +-+++++ 0, +-+++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-+++++ scaled_out.to(hidden_states.dtype) +-+++++ ) +-+++++ +-++++ return moe_output +-++++ +-+++++ +-++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-++++ @no_grad() +-++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++ +-++++ moe_output = None +-++++- if Long_Prompt: +-++++- # --- 精度优先模式 (ACCURACY MODE) --- +-++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ # if Long_Prompt==0: +-+++++ # # --- 精度优先模式 (ACCURACY MODE) --- +-+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ # else: +-+++++ # # --- 速度优先模式 (SPEED MODE) --- +-+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++ # if sequence_length == 1: +-+++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ # else: +-+++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ +-+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++ if sequence_length == 1: +-+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++ else: +-++++- # --- 速度优先模式 (SPEED MODE) --- +-++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++- if sequence_length == 1: +-++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++- else: +-++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++- +-+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ +-++++ +-++++ # 3. 共享专家计算与合并 (所有模式通用) +-++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++ +-++++ return final_hidden_states, router_logits +-++++ +-+++++ +-++++ class Qwen2MoeDecoderLayer(nn.Module): +-++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-++++ super().__init__() +-++++ self.hidden_size = config.hidden_size +-++++ +-++++- # if Long_Prompt: +-++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++- # else: +-+++++ # if Long_Prompt == 2: +-++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++ # else: +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ +-++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++ +-++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++++ ) +-++++ +-++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++++ # attention_mask, +-+++++ # sequence_length=sequence_length, +-+++++ # target_length=target_length, +-+++++ # dtype=dtype, +-+++++ # min_dtype=min_dtype, +-+++++ # cache_position=cache_position, +-+++++ # batch_size=input_tensor.shape[0], +-+++++ # ) +-+++++ #@dwj +-+++++ causal_mask = get_cached_causal_mask_with_cache_position( +-++++ attention_mask, +-++++ sequence_length=sequence_length, +-++++ target_length=target_length, +-++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-++++ """ +-++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-+++++ _causal_mask_cache.clear() +-++++ +-++++ input_ids = kwargs.get("input_ids") +-++++ if input_ids is None and args: +-++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ +-++++ if input_ids is not None: +-++++ prompt_length = input_ids.shape[1] +-++++- +-++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-++++- Long_Prompt = True +-+++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-+++++ Long_Prompt = 2 +-+++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-+++++ Long_Prompt = 0 +-++++ else: +-++++- Long_Prompt = False +-+++++ Long_Prompt = 1 +-+++++ +-++++ +-++++ return super().generate(*args, **kwargs) +-++++ +-++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++ dtype = self.lm_head.weight.dtype +-++++ min_dtype = float(ops.finfo(dtype).min) +-++++ +-++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-+++++ # attention_mask, +-+++++ # sequence_length=sequence_length, +-+++++ # target_length=past_key_values.get_max_length(), +-+++++ # dtype=dtype, +-+++++ # min_dtype=min_dtype, +-+++++ # cache_position=cache_position, +-+++++ # batch_size=batch_size, +-+++++ # ) +-+++++ +-+++++ #@dwj +-+++++ attention_mask = get_cached_causal_mask_with_cache_position( +-++++ attention_mask, +-++++ sequence_length=sequence_length, +-++++ target_length=past_key_values.get_max_length(), +-++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++++deleted file mode 100644 +-++++index 6dfb5b93..00000000 +-++++--- a/patches/0001-20251104commit.patch +-+++++++ /dev/null +-++++@@ -1,1272 +0,0 @@ +-++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++++-From: Pinoeer-kingxi <13022943007@163.com> +-++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++++-Subject: [PATCH] 20251104commit +-++++- +-++++---- +-++++- mindnlp/transformers/cache_utils.py | 28 +- +-++++- .../models/deepseek/modeling_deepseek.py | 149 ++- +-++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++++- 3 files changed, 976 insertions(+), 87 deletions(-) +-++++- +-++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++++-index cadd2e04..02f8d4be 100644 +-++++---- a/mindnlp/transformers/cache_utils.py +-++++-+++ b/mindnlp/transformers/cache_utils.py +-++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++++- # k_out[:, :, cache_position] = key_states +-++++- # v_out[:, :, cache_position] = value_states +-++++-- if ON_ORANGE_PI: +-++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++-- else: +-++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++-- +-++++-+ # if ON_ORANGE_PI: +-++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++-+ # else: +-++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++++-+ if cache_position.ndim > 1: +-++++-+ cache_position = cache_position.flatten() +-++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++++-+ cache_position = cache_position.int() +-++++-+ +-++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++++-+ k_out[:, :, cache_position] = key_states +-++++-+ v_out[:, :, cache_position] = value_states +-++++-+ +-++++- return k_out, v_out +-++++- +-++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++-index c695b944..d8303e45 100644 +-++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++++- # Copied from transformers.models.llama.modeling_llama.rotate_half +-++++- def rotate_half(x): +-++++- """Rotates half the hidden dims of the input.""" +-++++-- x1 = x[..., : x.shape[-1] // 2] +-++++-- x2 = x[..., x.shape[-1] // 2 :] +-++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++-+ # x1 = x[..., : x.shape[-1] // 2] +-++++-+ # x2 = x[..., x.shape[-1] // 2 :] +-++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++- return ops.cat((-x2, x1), dim=-1) +-++++- +-++++- +-++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++++- if self.training: +-++++- raise NotImplementedError("Training is not supported yet.") +-++++- else: +-++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++-- if self.config.n_shared_experts is not None: +-++++-- y = y + self.shared_experts(identity) +-++++-- return y +-++++-+ # @lwx +-++++-+ if orig_shape[1] == 1: +-++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++++-+ y=y.view(*orig_shape) +-++++-+ if self.config.n_shared_experts is not None: +-++++-+ y = y + self.shared_experts(identity) +-++++-+ return y +-++++-+ else: +-++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++++-+ if self.config.n_shared_experts is not None: +-++++-+ y = y + self.shared_experts(identity) +-++++-+ return y +-++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++-+ # if self.config.n_shared_experts is not None: +-++++-+ # y = y + self.shared_experts(identity) +-++++-+ # return y +-++++-+ +-++++-+ @no_grad() +-++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++-+ +-++++-+ expert_cache = ops.zeros_like(x) +-++++-+ for i in range(self.num_experts_per_tok): +-++++-+ expert_id = flat_expert_indices[i].item() +-++++-+ weight = flat_expert_weights[i].item() +-++++-+ expert = self.experts[expert_id] +-++++-+ expert_out = expert(x) +-++++-+ expert_cache += expert_out * weight +-++++-+ return expert_cache +-++++- +-++++- @no_grad() +-++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++-- # expert_cache = torch.zeros_like(x) +-++++-- # idxs = flat_expert_indices.argsort() +-++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++-- # token_idxs = idxs // self.num_experts_per_tok +-++++-- # for i, end_idx in enumerate(tokens_per_expert): +-++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++-- # if start_idx == end_idx: +-++++-- # continue +-++++-- # expert = self.experts[i] +-++++-- # exp_token_idx = token_idxs[start_idx:end_idx] +-++++-- # expert_tokens = x[exp_token_idx] +-++++-- # expert_out = expert(expert_tokens) +-++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++-- # return expert_cache +-++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++- expert_cache = ops.zeros_like(x) +-++++- idxs = flat_expert_indices.argsort() +-++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++- token_idxs = idxs // self.num_experts_per_tok +-++++-+ +-++++- for i, end_idx in enumerate(tokens_per_expert): +-++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++- if start_idx == end_idx: +-++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++++- expert_out = expert(expert_tokens) +-++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++-+ +-++++- return expert_cache +-++++-+ +-++++-+ # @no_grad() +-++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++-+ # # expert_cache = torch.zeros_like(x) +-++++-+ # # idxs = flat_expert_indices.argsort() +-++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++-+ # # token_idxs = idxs // self.num_experts_per_tok +-++++-+ # # for i, end_idx in enumerate(tokens_per_expert): +-++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++-+ # # if start_idx == end_idx: +-++++-+ # # continue +-++++-+ # # expert = self.experts[i] +-++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++-+ # # expert_tokens = x[exp_token_idx] +-++++-+ # # expert_out = expert(expert_tokens) +-++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++-+ # # return expert_cache +-++++-+ # expert_cache = ops.zeros_like(x) +-++++-+ # idxs = flat_expert_indices.argsort() +-++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++-+ # token_idxs = idxs // self.num_experts_per_tok +-++++-+ +-++++-+ # for i, end_idx in enumerate(tokens_per_expert): +-++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++-+ # if start_idx == end_idx: +-++++-+ # continue +-++++-+ # expert = self.experts[i] +-++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++-+ # expert_tokens = x[exp_token_idx] +-++++-+ # expert_out = expert(expert_tokens) +-++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++-+ +-++++-+ # return expert_cache +-++++-+ # @no_grad() +-++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++-+ # expert_cache = ops.zeros_like(x) +-++++-+ +-++++-+ # # 排序保证顺序一致 +-++++-+ # idxs = flat_expert_indices.argsort() +-++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++-+ # token_idxs = idxs // self.num_experts_per_tok +-++++-+ +-++++-+ # # 找出有 token 的专家 +-++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++-+ +-++++-+ # for i in active_experts.tolist(): +-++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++-+ # end_idx = tokens_per_expert[i] +-++++-+ # if start_idx == end_idx: # 没有 token +-++++-+ # continue +-++++-+ +-++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++-+ # expert_tokens = x[exp_token_idx] +-++++-+ # expert_out = self.experts[i](expert_tokens) +-++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++-+ +-++++-+ # expert_cache = mindspore.mint.scatter_add( +-++++-+ # expert_cache, +-++++-+ # 0, +-++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++-+ # expert_out +-++++-+ # ) +-++++-+ +-++++-+ # return expert_cache +-++++-+ +-++++-+ +-++++- +-++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++++- # """ +-++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++- +-++++- # Initialize weights and apply final processing +-++++- self.post_init() +-++++-+ self.warm_up = False +-++++-+ +-++++-+ def warmup_moe_model_deep(self): +-++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++-+ test_texts = [ +-++++-+ "warmup short", +-++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++++-+ ] +-++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++-+ if tokenizer is None: +-++++-+ from mindnlp.transformers import AutoTokenizer +-++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++-+ self._warmup_tokenizer = tokenizer +-++++-+ +-++++-+ for text in test_texts: +-++++-+ inputs = tokenizer(text, return_tensors="ms") +-++++-+ with mindspore._no_grad(): +-++++-+ _ = self(**inputs, use_cache=False) +-++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++++- +-++++- def get_input_embeddings(self): +-++++- return self.model.embed_tokens +-++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++- ```""" +-++++-+ if not self.warm_up: +-++++-+ self.warm_up = True +-++++-+ self.warmup_moe_model_deep() +-++++-+ +-++++- output_attentions = ( +-++++- output_attentions +-++++- if output_attentions is not None +-++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++-index 3cbf820e..d4c6b651 100644 +-++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++-@@ -18,7 +18,6 @@ +-++++- # See the License for the specific language governing permissions and +-++++- # limitations under the License. +-++++- """MindSpore Qwen2MoE model.""" +-++++-- +-++++- import math +-++++- from typing import List, Optional, Tuple, Union +-++++- +-++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++++- TokenClassifierOutput, +-++++- ) +-++++- from ...modeling_utils import PreTrainedModel +-++++-+from ...generation import GenerationMixin +-++++- from ....utils import logging +-++++- from .configuration_qwen2_moe import Qwen2MoeConfig +-++++- +-++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++++- self.variance_epsilon = eps +-++++- +-++++- def forward(self, hidden_states): +-++++-+ # @dwj +-++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++-+ # @lwx +-++++-+ # if not self.training : +-++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++- input_dtype = hidden_states.dtype +-++++- hidden_states = hidden_states.to(mindspore.float32) +-++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++++-@@ -234,6 +239,8 @@ def rotate_half(x): +-++++- """Rotates half the hidden dims of the input.""" +-++++- x1 = x[..., : x.shape[-1] // 2] +-++++- x2 = x[..., x.shape[-1] // 2 :] +-++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++- return ops.cat((-x2, x1), dim=-1) +-++++- +-++++- +-++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++++- self.config = config +-++++- self.hidden_size = config.hidden_size +-++++- self.intermediate_size = intermediate_size +-++++-+ +-++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++++- self.act_fn = ACT2FN[config.hidden_act] +-++++- +-++++- def forward(self, x): +-++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++-- +-++++- +-++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++-+ # @lwx +-++++-+ # gate_up_output = self.gate_up_proj(x) +-++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++++-+ # return self.down_proj(swiglu_output) +-++++-+ +-++++-+ # def forward(self, x): +-++++-+ # gate_proj_out = self.gate_proj(x) +-++++-+ # up_proj_out = self.up_proj(x) +-++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++++-+ # return self.down_proj(swiglu_out) +-++++-+ +-++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++++- """ +-++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++++- use_cache: bool = False, +-++++- cache_position: Optional[mindspore.Tensor] = None, +-++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++-+ +-++++-+ +-++++-+ +-++++- bsz, q_len, _ = hidden_states.shape +-++++- +-++++- query_states = self.q_proj(hidden_states) +-++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++- "with a layer index." +-++++- ) +-++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++-+ if isinstance(past_key_value, StaticCache): +-++++-+ kv_seq_len = key_states.shape[-2] +-++++-+ else: +-++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++- +-++++- if past_key_value is not None: +-++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++-+ +-++++-+ if isinstance(past_key_value, StaticCache): +-++++-+ kv_seq_len = key_states.shape[-2] +-++++- +-++++- # repeat k/v heads if n_kv_heads < n_heads +-++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++-- +-++++-+ +-++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++- +-++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++++-- raise ValueError( +-++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++++-- f" {attn_weights.shape}" +-++++-- ) +-++++-- +-++++-- if attention_mask is not None: # no matter the length, we just slice it +-++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++++-+ if attention_mask is not None: +-++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++- attn_weights = attn_weights + causal_mask +-++++- +-++++- # upcast attention to fp32 +-++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++- +-++++- attn_output = self.o_proj(attn_output) +-++++-- +-++++-+ # @lwx +-++++-+ +-++++-+ # max_seq_len = self.max_position_embeddings # 2048 +-++++-+ +-++++-+ # if attention_mask is not None: +-++++-+ # # attention_mask: [B, 1, Sq, Sk] +-++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++-+ +-++++-+ # # pad 到 [max_seq_len, max_seq_len] +-++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++-+ # global_attention_mask = padded_mask +-++++-+ # else: +-++++-+ # global_attention_mask = None +-++++-+ +-++++-+ +-++++-+ # sparse_mode=3 +-++++-+ # attn_output = mindspore.ops.flash_attention_score( +-++++-+ # query=query_states, +-++++-+ # key=key_states, +-++++-+ # value=value_states, +-++++-+ # real_shift=None, +-++++-+ # padding_mask=None, +-++++-+ +-++++-+ # head_num=self.num_heads, +-++++-+ # attn_mask=global_attention_mask, +-++++-+ # keep_prob=1.0 - self.attention_dropout, +-++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++-+ # input_layout="BNSD", +-++++-+ # pre_tokens=2147483647, +-++++-+ # next_tokens=2147483647, +-++++-+ # inner_precise=0, +-++++-+ # drop_mask=None, +-++++-+ # prefix=None, +-++++-+ # actual_seq_qlen=None, +-++++-+ # actual_seq_kvlen=None, +-++++-+ # sparse_mode=sparse_mode, +-++++-+ # ) +-++++- if not output_attentions: +-++++- attn_weights = None +-++++- +-++++- return attn_output, attn_weights, past_key_value +-++++- +-++++- +-++++-+class Qwen2MoeFlashAttention(nn.Module): +-++++-+ """ +-++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++-+ +-++++-+ 关键改动: +-++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++-+ 直接传入原始的 key 和 value 张量效率更高。 +-++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++-+ """ +-++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++-+ super().__init__() +-++++-+ self.config = config +-++++-+ self.layer_idx = layer_idx +-++++-+ self.hidden_size = config.hidden_size +-++++-+ self.num_heads = config.num_attention_heads +-++++-+ self.head_dim = self.hidden_size // self.num_heads +-++++-+ self.num_key_value_heads = config.num_key_value_heads +-++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++-+ self.max_position_embeddings = config.max_position_embeddings +-++++-+ self.rope_theta = config.rope_theta +-++++-+ self.attention_dropout = config.attention_dropout +-++++-+ +-++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++-+ raise ValueError( +-++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++-+ ) +-++++-+ +-++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++-+ +-++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++-+ self.head_dim, +-++++-+ max_position_embeddings=self.max_position_embeddings, +-++++-+ base=self.rope_theta, +-++++-+ ) +-++++-+ +-++++-+ def forward( +-++++-+ self, +-++++-+ hidden_states: mindspore.Tensor, +-++++-+ attention_mask: Optional[mindspore.Tensor] = None, +-++++-+ position_ids: Optional[mindspore.Tensor] = None, +-++++-+ past_key_value: Optional[Cache] = None, +-++++-+ output_attentions: bool = False, +-++++-+ use_cache: bool = False, +-++++-+ cache_position: Optional[mindspore.Tensor] = None, +-++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++-+ +-++++-+ bsz, q_len, _ = hidden_states.shape +-++++-+ +-++++-+ # 1. 线性投射 Q, K, V +-++++-+ query_states = self.q_proj(hidden_states) +-++++-+ key_states = self.k_proj(hidden_states) +-++++-+ value_states = self.v_proj(hidden_states) +-++++-+ +-++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++-+ # query: [B, S, H*D] -> [B, N1, S, D] +-++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ +-++++-+ # 3. RoPE 旋转位置编码 +-++++-+ kv_seq_len = key_states.shape[-2] +-++++-+ if past_key_value is not None: +-++++-+ if self.layer_idx is None: +-++++-+ raise ValueError( +-++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++-+ "with a layer index." +-++++-+ ) +-++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++-+ if cache_position.shape[0] == 1: +-++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++-+ kv_seq_len = past_seen_tokens + 1 +-++++-+ else: +-++++-+ # prefill 阶段:cache_position 是范围,使用其长度 +-++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++-+ else: +-++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++-+ +-++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++-+ +-++++-+ # 4. KV 缓存更新 +-++++-+ if past_key_value is not None: +-++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++-+ key_states, value_states = past_key_value.update( +-++++-+ key_states, value_states, self.layer_idx, cache_kwargs +-++++-+ ) +-++++-+ +-++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++-+ if cache_position.shape[0] == 1: +-++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++-+ kv_seq_len = key_states.shape[-2] +-++++-+ +-++++-+ # 5. [重要] 准备 Attention Mask +-++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++-+ fa_attention_mask = None +-++++-+ if attention_mask is not None: +-++++-+ # 截取与当前key长度匹配的部分 +-++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++-+ fa_attention_mask = (mask_slice != 0) +-++++-+ +-++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++-+ input_dtype = query_states.dtype +-++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++-+ query_states = query_states.to(mindspore.float16) +-++++-+ key_states = key_states.to(mindspore.float16) +-++++-+ value_states = value_states.to(mindspore.float16) +-++++-+ +-++++-+ # 6. [核心] 调用 flash_attention_score 算子 +-++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++-+ attn_output = mindspore.ops.flash_attention_score( +-++++-+ query=query_states, +-++++-+ key=key_states, +-++++-+ value=value_states, +-++++-+ head_num=self.num_heads, # 传入Q的头数(N1) +-++++-+ attn_mask=fa_attention_mask, +-++++-+ keep_prob=1.0 - self.attention_dropout, +-++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++-+ input_layout="BNSD", +-++++-+ sparse_mode=0 # 使用 defaultMask 模式 +-++++-+ ) +-++++-+ +-++++-+ # 恢复原始数据类型 +-++++-+ attn_output = attn_output.to(input_dtype) +-++++-+ +-++++-+ # 7. 调整输出形状 +-++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++-+ attn_output = self.o_proj(attn_output) +-++++-+ +-++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-++++-+ attn_weights = None +-++++-+ if output_attentions: +-++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++-+ +-++++-+ return attn_output, attn_weights, past_key_value +-++++-+ +-++++-+ # def forward( +-++++-+ # self, +-++++-+ # hidden_states: mindspore.Tensor, +-++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-++++-+ # position_ids: Optional[mindspore.Tensor] = None, +-++++-+ # past_key_value: Optional[Cache] = None, +-++++-+ # output_attentions: bool = False, +-++++-+ # use_cache: bool = False, +-++++-+ # cache_position: Optional[mindspore.Tensor] = None, +-++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++-+ +-++++-+ # bsz, q_len, _ = hidden_states.shape +-++++-+ +-++++-+ # # 1. 线性投射 Q, K, V +-++++-+ # query_states = self.q_proj(hidden_states) +-++++-+ # key_states = self.k_proj(hidden_states) +-++++-+ # value_states = self.v_proj(hidden_states) +-++++-+ +-++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ +-++++-+ # # 3. RoPE 旋转位置编码 +-++++-+ # kv_seq_len = key_states.shape[-2] +-++++-+ # if past_key_value is not None: +-++++-+ # if self.layer_idx is None: +-++++-+ # raise ValueError( +-++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++-+ # "with a layer index." +-++++-+ # ) +-++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++-+ +-++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++-+ +-++++-+ # # 4. KV 缓存更新 +-++++-+ # if past_key_value is not None: +-++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++-+ # key_states, value_states = past_key_value.update( +-++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-++++-+ # ) +-++++-+ +-++++-+ # # 5. 准备 Attention Mask +-++++-+ # fa_attention_mask = None +-++++-+ # if attention_mask is not None: +-++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++-+ # fa_attention_mask = (mask_slice != 0) +-++++-+ +-++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++-+ # input_dtype = query_states.dtype +-++++-+ +-++++-+ # # 6. [核心] 调用 flash_attention_score 算子 +-++++-+ # attn_output = mindspore.ops.flash_attention_score( +-++++-+ # query=query_states, +-++++-+ # key=key_states, +-++++-+ # value=value_states, +-++++-+ # head_num=self.num_heads, +-++++-+ # attn_mask=fa_attention_mask, +-++++-+ # keep_prob=1.0 - self.attention_dropout, +-++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++-+ # input_layout="BNSD", +-++++-+ # sparse_mode=0, +-++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++-+ # inner_precise=1 +-++++-+ # ) +-++++-+ +-++++-+ # # 恢复原始数据类型 +-++++-+ # attn_output = attn_output.to(input_dtype) +-++++-+ +-++++-+ # # 7. 调整输出形状 +-++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++-+ # attn_output = self.o_proj(attn_output) +-++++-+ +-++++-+ # attn_weights = None +-++++-+ # if output_attentions: +-++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++-+ +-++++-+ # return attn_output, attn_weights, past_key_value +-++++-+ +-++++-+ # def forward( +-++++-+ # self, +-++++-+ # hidden_states: mindspore.Tensor, +-++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-++++-+ # position_ids: Optional[mindspore.Tensor] = None, +-++++-+ # past_key_value: Optional[Cache] = None, +-++++-+ # output_attentions: bool = False, +-++++-+ # use_cache: bool = False, +-++++-+ # cache_position: Optional[mindspore.Tensor] = None, +-++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++-+ +-++++-+ # bsz, q_len, _ = hidden_states.shape +-++++-+ +-++++-+ # query_states = self.q_proj(hidden_states) +-++++-+ # key_states = self.k_proj(hidden_states) +-++++-+ # value_states = self.v_proj(hidden_states) +-++++-+ +-++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-+ +-++++-+ # kv_seq_len = key_states.shape[-2] +-++++-+ # if past_key_value is not None: +-++++-+ # if self.layer_idx is None: +-++++-+ # raise ValueError("`layer_idx` must be specified for caching") +-++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++-+ +-++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++-+ +-++++-+ # if past_key_value is not None: +-++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++-+ # key_states, value_states = past_key_value.update( +-++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-++++-+ # ) +-++++-+ +-++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++-+ +-++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++-+ # query_states = query_states / math.sqrt(self.head_dim) +-++++-+ # # <--- 修改结束 --- +-++++-+ +-++++-+ # fa_attention_mask = None +-++++-+ # if attention_mask is not None: +-++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++-+ # fa_attention_mask = (mask_slice != 0) +-++++-+ +-++++-+ # input_dtype = query_states.dtype +-++++-+ +-++++-+ # attn_output = mindspore.ops.flash_attention_score( +-++++-+ # query=query_states, # 传入已经预先缩放过的 query +-++++-+ # key=key_states, +-++++-+ # value=value_states, +-++++-+ # head_num=self.num_heads, +-++++-+ # attn_mask=fa_attention_mask, +-++++-+ # keep_prob=1.0 - self.attention_dropout, +-++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++-+ # input_layout="BNSD", +-++++-+ # sparse_mode=0, +-++++-+ # inner_precise=1 # 仍然保持内部高精度计算 +-++++-+ # ) +-++++-+ +-++++-+ # attn_output = attn_output.to(input_dtype) +-++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++-+ # attn_output = self.o_proj(attn_output) +-++++-+ +-++++-+ # attn_weights = None +-++++-+ # if output_attentions: +-++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++-+ +-++++-+ # return attn_output, attn_weights, past_key_value +-++++-+ +-++++- QWEN2MOE_ATTENTION_CLASSES = { +-++++- "eager": Qwen2MoeAttention, +-++++-+ "flash-attention": Qwen2MoeFlashAttention, +-++++- } +-++++- +-++++- +-++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++- +-++++-+ #@dwj +-++++-+ # 只遍历激活的专家,而非全部专家 +-++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++-- hidden_states = hidden_states.view(-1, hidden_dim) +-++++-- # router_logits: (batch * sequence_length, n_experts) +-++++-- router_logits = self.gate(hidden_states) +-++++-- +-++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++-- if self.norm_topk_prob: +-++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++-- # we cast back to the input dtype +-++++-- routing_weights = routing_weights.to(hidden_states.dtype) +-++++-- +-++++-- final_hidden_states = ops.zeros( +-++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++++-- ) +-++++-- +-++++-- # One hot encode the selected experts to create an expert mask +-++++-- # this will be used to easily index which expert is going to be sollicitated +-++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++++-- +-++++-- # Loop over all available experts in the model and perform the computation on each expert +-++++-- for expert_idx in range(self.num_experts): +-++++-- expert_layer = self.experts[expert_idx] +-++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++++-- +-++++-- # Index the correct hidden states and compute the expert hidden state for +-++++-- # the current expert. We need to make sure to multiply the output hidden +-++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++++-- if 0 not in idx.shape: +-++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++++-- +-++++-- # However `index_add_` only support torch tensors for indexing so we'll use +-++++-- # the `top_x` tensor here. +-++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++++-- +-++++-- shared_expert_output = self.shared_expert(hidden_states) +-++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++++-- +-++++-- final_hidden_states = final_hidden_states + shared_expert_output +-++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++-+ num_tokens = hidden_states_reshaped.shape[0] +-++++-+ +-++++-+ router_logits = self.gate(hidden_states_reshaped) +-++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++-+ +-++++-+ if self.norm_topk_prob: +-++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++-+ routing_weights = routing_weights.to(hidden_states.dtype) +-++++-+ +-++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++-+ flat_selected_experts = selected_experts.flatten() +-++++-+ +-++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++-+ token_indices = broadcasted_token_indices.flatten() +-++++-+ +-++++-+ active_experts = ops.unique(flat_selected_experts) +-++++-+ +-++++-+ for expert_idx_tensor in active_experts: +-++++-+ expert_idx = expert_idx_tensor.item() +-++++-+ expert_layer = self.experts[expert_idx] +-++++-+ +-++++-+ mask = (flat_selected_experts == expert_idx_tensor) +-++++-+ selected_token_indices = token_indices[mask] +-++++-+ selected_routing_weights = routing_weights.flatten()[mask] +-++++-+ +-++++-+ current_states = hidden_states_reshaped[selected_token_indices] +-++++-+ +-++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++-+ +-++++-+ final_hidden_states = final_hidden_states.index_add( +-++++-+ dim=0, +-++++-+ index=selected_token_indices, +-++++-+ source=expert_output.to(hidden_states.dtype) +-++++-+ ) +-++++-+ +-++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++- +-++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++-- return final_hidden_states, router_logits +-++++-+ final_hidden_states = final_hidden_states + shared_expert_output +-++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++-+ +-++++-+ return final_hidden_states, router_logits +-++++- +-++++- +-++++- class Qwen2MoeDecoderLayer(nn.Module): +-++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++++- +-++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++- +-++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++-+ +-++++- if (layer_idx not in config.mlp_only_layers) and ( +-++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++++- ): +-++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++++- _skip_keys_device_placement = "past_key_values" +-++++- _supports_cache_class = True +-++++-+#lwx +-++++-+ # _supports_static_cache = True +-++++- +-++++- def _init_weights(self, module): +-++++- std = self.config.initializer_range +-++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++++- return causal_mask +-++++- +-++++- +-++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++- _tied_weights_keys = ["lm_head.weight"] +-++++- +-++++- def __init__(self, config): +-++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++- self.num_experts_per_tok = config.num_experts_per_tok +-++++- # Initialize weights and apply final processing +-++++- self.post_init() +-++++-+ # @lwx +-++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++++-+ # self.generation_config.cache_implementation = "static" +-++++-+ self._warmed_up = False +-++++-+ +-++++-+ def warmup_moe_model(self): +-++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++++-+ test_texts = [ +-++++-+ "warmup short", +-++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++++-+ ] +-++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++-+ if tokenizer is None: +-++++-+ from mindnlp.transformers import AutoTokenizer +-++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++-+ self._warmup_tokenizer = tokenizer +-++++-+ +-++++-+ for text in test_texts: +-++++-+ inputs = tokenizer(text, return_tensors="ms") +-++++-+ with mindspore._no_grad(): +-++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++++- +-++++- def get_input_embeddings(self): +-++++- return self.model.embed_tokens +-++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++- ```""" +-++++-+ if not self._warmed_up: +-++++-+ self._warmed_up = True +-++++-+ self.warmup_moe_model() +-++++- +-++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++++- output_router_logits = ( +-++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++- } +-++++- ) +-++++- return model_inputs +-++++-+# @lwx +-++++-+ # def _decode_one_tokens_logits( +-++++-+ # self, +-++++-+ # cur_token: mindspore.Tensor, +-++++-+ # input_pos: Optional[mindspore.Tensor], +-++++-+ # cache_position: mindspore.Tensor, +-++++-+ # past_key_values: StaticCache, +-++++-+ # ) -> mindspore.Tensor: +-++++-+ # """ +-++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++++-+ +-++++-+ # Args: +-++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++++-+ # input_pos: 输入位置信息,可选 +-++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++++-+ +-++++-+ # Returns: +-++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++++-+ # """ +-++++-+ # # 调用JIT编译的版本 +-++++-+ # return self.get_decode_one_tokens_logits( +-++++-+ # cur_token=cur_token, +-++++-+ # input_pos=input_pos, +-++++-+ # cache_position=cache_position, +-++++-+ # past_key_values=past_key_values, +-++++-+ # ) +-++++-+ +-++++-+ # @mindspore.jit(jit_level='O1') +-++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++++-+ # """ +-++++-+ # JIT编译的函数,用于高效的单token解码 +-++++-+ # 使用JIT编译优化以支持静态shape和高效执行 +-++++-+ +-++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++++-+ # """ +-++++-+ # outputs = self.model.forward( +-++++-+ # input_ids=cur_token, +-++++-+ # position_ids=input_pos, +-++++-+ # cache_position=cache_position, +-++++-+ # past_key_values=past_key_values, +-++++-+ # use_cache=True, +-++++-+ # return_dict=False, +-++++-+ # ) +-++++-+ +-++++-+ # hidden_states = outputs[0] +-++++-+ # logits = self.lm_head.forward(hidden_states) +-++++-+ # logits = logits.float() +-++++-+ +-++++-+ # return logits[:, -1, :] +-++++-+ +-++++-+ # def _sample( +-++++-+ # self, +-++++-+ # input_ids: mindspore.Tensor, +-++++-+ # logits_processor, +-++++-+ # stopping_criteria, +-++++-+ # generation_config, +-++++-+ # synced_devices: bool, +-++++-+ # streamer=None, +-++++-+ # logits_warper=None, +-++++-+ # **model_kwargs, +-++++-+ # ): +-++++-+ # """ +-++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++++-+ # """ +-++++-+ # from ...generation.logits_process import LogitsProcessorList +-++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++++-+ # from mindnlp.core import nn, ops, no_grad +-++++-+ # import numpy as np +-++++-+ +-++++-+ # # 检查是否使用 StaticCache +-++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++++-+ # # 否则,直接调用父类方法 +-++++-+ # past_key_values = model_kwargs.get("past_key_values") +-++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++++-+ +-++++-+ # if not isinstance(past_key_values, StaticCache): +-++++-+ # # 不使用 StaticCache,直接调用父类方法 +-++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++++-+ # return super()._sample( +-++++-+ # input_ids=input_ids, +-++++-+ # logits_processor=logits_processor, +-++++-+ # stopping_criteria=stopping_criteria, +-++++-+ # generation_config=generation_config, +-++++-+ # synced_devices=synced_devices, +-++++-+ # streamer=streamer, +-++++-+ # logits_warper=logits_warper, +-++++-+ # **model_kwargs, +-++++-+ # ) +-++++-+ +-++++-+ # # 使用 StaticCache,进入自定义循环 +-++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++++-+ # pad_token_id = generation_config._pad_token_tensor +-++++-+ # output_attentions = generation_config.output_attentions +-++++-+ # output_hidden_states = generation_config.output_hidden_states +-++++-+ # output_scores = generation_config.output_scores +-++++-+ # output_logits = generation_config.output_logits +-++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-++++-+ # max_length = generation_config.max_length +-++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++++-+ # do_sample = generation_config.do_sample +-++++-+ +-++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++++-+ # raise ValueError( +-++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++++-+ # f"{logits_warper})." +-++++-+ # ) +-++++-+ +-++++-+ # # init attention / hidden states / scores tuples +-++++-+ # scores = () if (return_dict_in_generate and output_scores) else None +-++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++++-+ +-++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++++-+ # encoder_hidden_states = ( +-++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++++-+ # ) +-++++-+ +-++++-+ # # keep track of which sequences are already finished +-++++-+ # batch_size, cur_len = input_ids.shape +-++++-+ # this_peer_finished = False +-++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++++-+ +-++++-+ # time_record = [] +-++++-+ # from ....utils.testing_utils import parse_flag_from_env +-++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++++-+ +-++++-+ # while self._has_unfinished_sequences( +-++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++++-+ # ): +-++++-+ # if _record_time: +-++++-+ # import time as time_module +-++++-+ # infer_start = time_module.time() +-++++-+ +-++++-+ # # prepare model inputs +-++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++++-+ +-++++-+ # # prepare variable output controls +-++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++++-+ +-++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++++-+ # cur_cache_position = model_inputs.get("cache_position") +-++++-+ # cur_past_key_values = model_inputs.get("past_key_values") +-++++-+ # cur_input_ids = model_inputs.get("input_ids") +-++++-+ +-++++-+ # if (isinstance(cur_past_key_values, StaticCache) and +-++++-+ # cur_cache_position is not None and +-++++-+ # len(cur_cache_position.shape) > 0 and +-++++-+ # cur_cache_position.shape[0] == 1 and +-++++-+ # cur_input_ids is not None and +-++++-+ # cur_input_ids.shape[1] == 1): +-++++-+ # # 使用 JIT 优化的单 token 解码 +-++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++++-+ # if not hasattr(self, '_jit_used'): +-++++-+ # self._jit_used = False +-++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++++-+ +-++++-+ # next_token_logits = self.get_decode_one_tokens_logits( +-++++-+ # cur_token=cur_input_ids, +-++++-+ # input_pos=model_inputs.get("position_ids"), +-++++-+ # cache_position=cur_cache_position, +-++++-+ # past_key_values=cur_past_key_values, +-++++-+ # ) +-++++-+ +-++++-+ # # 标记已使用JIT(用于后续判断) +-++++-+ # if not self._jit_used: +-++++-+ # self._jit_used = True +-++++-+ +-++++-+ # # 构造兼容的输出对象 +-++++-+ # class JitOptimizedOutput: +-++++-+ # def __init__(self, logits, config): +-++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++++-+ # self.config = config +-++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++++-+ # self.attentions = None if not config.is_encoder_decoder else None +-++++-+ # self.cross_attentions = None +-++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-++++-+ +-++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++++-+ # else: +-++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++++-+ # outputs = self(**model_inputs, return_dict=True) +-++++-+ +-++++-+ # if synced_devices and this_peer_finished: +-++++-+ # continue +-++++-+ +-++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++++-+ # next_token_logits = outputs.logits[:, -1, :] +-++++-+ +-++++-+ # # pre-process distribution +-++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++++-+ # if do_sample: +-++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++++-+ +-++++-+ # # Store scores, attentions and hidden_states when required +-++++-+ # if return_dict_in_generate: +-++++-+ # if output_scores: +-++++-+ # scores += (next_token_scores,) +-++++-+ # if output_logits: +-++++-+ # raw_logits += (next_token_logits,) +-++++-+ # if output_attentions: +-++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-++++-+ # if self.config.is_encoder_decoder: +-++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++++-+ +-++++-+ # if output_hidden_states: +-++++-+ # hidden = ( +-++++-+ # outputs.decoder_hidden_states +-++++-+ # if self.config.is_encoder_decoder +-++++-+ # else outputs.hidden_states +-++++-+ # ) +-++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++++-+ +-++++-+ # # token selection +-++++-+ # if do_sample: +-++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++++-+ # else: +-++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++++-+ +-++++-+ # # finished sentences should have their next token be a padding token +-++++-+ # if has_eos_stopping_criteria: +-++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++++-+ +-++++-+ # # update generated ids, model inputs, and length for next step +-++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++++-+ # if streamer is not None: +-++++-+ # streamer.put(next_tokens) +-++++-+ +-++++-+ # model_kwargs = self._update_model_kwargs_for_generation( +-++++-+ # outputs, +-++++-+ # model_kwargs, +-++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-++++-+ # ) +-++++-+ +-++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++++-+ # cur_len += 1 +-++++-+ +-++++-+ # if _record_time: +-++++-+ # import time as time_module +-++++-+ # infer_stop = time_module.time() +-++++-+ # time_record.append(infer_stop - infer_start) +-++++-+ +-++++-+ # del outputs +-++++-+ +-++++-+ # average_infer_time = None +-++++-+ # if time_record: +-++++-+ # if len(time_record) > 1: +-++++-+ # time_record.pop(0) +-++++-+ # average_infer_time = sum(time_record) / len(time_record) +-++++-+ # print(f'average inference time is: {average_infer_time}') +-++++-+ # print(f'inference time record: {time_record}') +-++++-+ +-++++-+ # if streamer is not None: +-++++-+ # streamer.end() +-++++-+ +-++++-+ # # 简单判断:打印是否使用了JIT路径 +-++++-+ # if hasattr(self, '_jit_used') and self._jit_used: +-++++-+ # print("[JIT] ✓ JIT optimization was used during generation") +-++++-+ # else: +-++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++++-+ +-++++-+ # if return_dict_in_generate: +-++++-+ # if self.config.is_encoder_decoder: +-++++-+ # return GenerateEncoderDecoderOutput( +-++++-+ # sequences=input_ids, +-++++-+ # scores=scores, +-++++-+ # logits=raw_logits, +-++++-+ # encoder_attentions=encoder_attentions, +-++++-+ # encoder_hidden_states=encoder_hidden_states, +-++++-+ # decoder_attentions=decoder_attentions, +-++++-+ # cross_attentions=cross_attentions, +-++++-+ # decoder_hidden_states=decoder_hidden_states, +-++++-+ # past_key_values=model_kwargs.get("past_key_values"), +-++++-+ # average_infer_time=average_infer_time +-++++-+ # ) +-++++-+ # else: +-++++-+ # return GenerateDecoderOnlyOutput( +-++++-+ # sequences=input_ids, +-++++-+ # scores=scores, +-++++-+ # logits=raw_logits, +-++++-+ # attentions=decoder_attentions, +-++++-+ # hidden_states=decoder_hidden_states, +-++++-+ # past_key_values=model_kwargs.get("past_key_values"), +-++++-+ # average_infer_time=average_infer_time +-++++-+ # ) +-++++-+ # else: +-++++-+ # return input_ids +-++++-+ +-++++-+ # def _prepare_cache_for_generation( +-++++-+ # self, +-++++-+ # generation_config, +-++++-+ # model_kwargs, +-++++-+ # assistant_model, +-++++-+ # batch_size, +-++++-+ # max_cache_length, +-++++-+ # ): +-++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++++-+ # generation_config.cache_implementation = "static" +-++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++++-+ +-++++-+ # if generation_config.cache_implementation == "static": +-++++-+ # base_required_from_max_length = generation_config.max_length + 1 +-++++-+ # base_required = max(max_cache_length, base_required_from_max_length) +-++++-+ # min_cache_size = 50 +-++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++++-+ # else: +-++++-+ # max_cache_length = max(base_required, min_cache_size) +-++++-+ +-++++-+ # original_max_cache_length = max_cache_length +-++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++++-+ # print(f" - final max_cache_length: {max_cache_length}") +-++++-+ +-++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++-+ # if max_cache_length > self.config.max_position_embeddings: +-++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++++-+ +-++++-+ # result = super()._prepare_cache_for_generation( +-++++-+ # generation_config=generation_config, +-++++-+ # model_kwargs=model_kwargs, +-++++-+ # assistant_model=assistant_model, +-++++-+ # batch_size=batch_size, +-++++-+ # max_cache_length=max_cache_length, +-++++-+ # ) +-++++-+ +-++++-+ # if generation_config.cache_implementation == "static": +-++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++++-+ # created_cache = model_kwargs.get(cache_name) +-++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++++-+ # if created_cache.max_cache_len < generation_config.max_length: +-++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++++-+ +-++++-+ # return result +-++++-+ +-++++-+ +-++++-+ +-++++- +-++++- +-++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++++--- +-++++-2.27.0 +-++++- +-++++-- +-++++2.27.0 +-++++ +-+++-- +-+++2.27.0 +-+++ +-++-- +-++2.27.0 +-++ +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +diff --git a/patches/0008-moe-change.patch b/patches/0008-moe-change.patch +deleted file mode 100644 +index 349f1429..00000000 +--- a/patches/0008-moe-change.patch ++++ /dev/null +@@ -1,8789 +0,0 @@ +-From 45ba3bbc411b64cbffd547fa3d66bce9545639dd Mon Sep 17 00:00:00 2001 +-From: Pinoeer-kingxi <13022943007@163.com> +-Date: Sun, 9 Nov 2025 00:50:01 +0800 +-Subject: [PATCH 8/8] moe change +- +---- +- .../models/deepseek/modeling_deepseek.py | 433 +- +- .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- +- patches/0001-20251104commit.patch | 2 +- +- patches/0002-20251106commit.patch | 2 +- +- patches/0003-20261106secondcommit.patch | 2 +- +- patches/0004-20251106change.patch | 2 +- +- patches/0005-20251107001commit.patch | 2 +- +- patches/0006-20251107002commit.patch | 2 +- +- patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ +- 9 files changed, 8510 insertions(+), 55 deletions(-) +- create mode 100644 patches/0007-20251107003commit.patch +- +-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-index ff631974..0af29305 100644 +---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-@@ -19,8 +19,10 @@ +- # limitations under the License. +- """ MindNLP DeepSeek model.""" +- import math +-+import time +- import warnings +- from typing import List, Optional, Tuple, Union +-+from mindspore import mint +- import mindspore +- from mindnlp.core import nn, ops, no_grad +- from mindnlp.core.nn import functional as F +-@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) +- +- _CONFIG_FOR_DOC = "DeepseekConfig" +- +-+Long_Prompt = 1 +-+LONG_PROMPT_LENGTH_THRESHOLD = 128 +-+SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-+ +- _attn_mask_cache = {} +- +- def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-@@ -380,6 +386,8 @@ class MoEGate(nn.Module): +- return topk_idx, topk_weight, aux_loss +- +- +-+bincount_op = mindspore.ops.Bincount() +-+ +- class DeepseekMoE(nn.Module): +- """ +- A mixed expert module containing shared experts. +-@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): +- y = y + self.shared_experts(identity) +- return y +- else: +-- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+ if Long_Prompt == 0: +-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+ else: +-+ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +- if self.config.n_shared_experts is not None: +- y = y + self.shared_experts(identity) +- return y +-@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): +- # if self.config.n_shared_experts is not None: +- # y = y + self.shared_experts(identity) +- # return y +-- +-+ +-+ +-+ +-+ # lwx +-+ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): +-+ # """ +-+ # 如果 expert_ids 为 None,走单专家逻辑; +-+ # 如果有,多专家批量处理,保证和原逻辑一致。 +-+ # """ +-+ # if expert_ids is None: +-+ # # 原单专家逻辑 +-+ # if self.config.pretraining_tp > 1: +-+ # slice = self.intermediate_size // self.config.pretraining_tp +-+ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) +-+ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) +-+ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) +-+ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) +-+ # for i in range(self.config.pretraining_tp)], dim=-1) +-+ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) +-+ # for i in range(self.config.pretraining_tp)], dim=-1) +-+ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) +-+ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) +-+ # for i in range(self.config.pretraining_tp)] +-+ # down_proj = sum(down_proj) +-+ # else: +-+ # down_proj = self.down_proj( +-+ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) +-+ # ) +-+ # return down_proj +-+ +-+ # # ====== 批量多专家路径 ====== +-+ # hidden_size = x.shape[-1] +-+ +-+ # # 按 token expert_ids 选权重 +-+ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] +-+ # up_weights = self.up_proj.weight[expert_ids] +-+ # down_weights = self.down_proj.weight[expert_ids] +-+ +-+ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 +-+ # if self.config.pretraining_tp > 1: +-+ # outputs = [] +-+ # slice = self.intermediate_size // self.config.pretraining_tp +-+ # for i in range(self.config.pretraining_tp): +-+ # # 每个 slice 单独计算 +-+ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) +-+ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) +-+ # act_out = self.act_fn(gate_proj_out) * up_proj_out +-+ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) +-+ # outputs.append(down_proj_out) +-+ # return sum(outputs) +-+ # else: +-+ # gate_proj_out = F.linear(x, gate_weights) +-+ # up_proj_out = F.linear(x, up_weights) +-+ # act_out = self.act_fn(gate_proj_out) * up_proj_out +-+ # return F.linear(act_out, down_weights) +-+ # @no_grad() +-+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ # num_tokens = x.shape[0] +-+ # hidden_size = x.shape[-1] +-+ +-+ # idxs = flat_expert_indices.argsort() +-+ # sorted_expert_indices = flat_expert_indices[idxs] +-+ # sorted_token_indices = idxs // self.num_experts_per_tok +-+ # sorted_indices = sorted_token_indices +-+ +-+ # permuted_tokens = x[sorted_token_indices] +-+ # sorted_weights = flat_expert_weights[idxs] +-+ +-+ # # 一次调用多专家 forward +-+ # expert_outputs = ops.zeros_like(permuted_tokens) +-+ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) +-+ +-+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +-+ # try: +-+ # final_output = ops.moe_token_unpermute( +-+ # expert_outputs, +-+ # sorted_indices, +-+ # probs=probs, +-+ # padded_mode=False +-+ # ) +-+ # except Exception: +-+ # final_output = ops.zeros_like(x) +-+ # final_output = mindspore.mint.scatter_add( +-+ # final_output, +-+ # 0, +-+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +-+ # expert_outputs * sorted_weights +-+ # ) +-+ +-+ # return final_output +-+ +-+ # def mlp_batch_forward(self, tokens, expert_ids): +-+ # """ +-+ # 使用批量专家 forward(保留精度) +-+ # """ +-+ # return self.experts[0].forward(tokens, expert_ids) +-+ +- # @no_grad() +- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +- +-@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): +- # expert_cache += expert_out * weight +- # return expert_cache +- +-+ #@dwj +- @no_grad() +-- # dwj +- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-- # x 的 shape: (1, hidden_size) +-- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-- +-- # 1. 收集所有需要的专家层 +-- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +- selected_experts = [self.experts[i] for i in flat_expert_indices] +-- +-- # 2. 并行计算所有专家的输出 +-- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-- # ops.cat 会将它们堆叠成一个新的 Tensor +-- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-- +-- # 3. 使用矩阵乘法进行加权求和 +-- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-- # 最终结果 final_output 的 shape: (1, hidden_size) +- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-- +- return final_output +- +- +-- # @no_grad() +-- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-- # expert_cache = ops.zeros_like(x) +-- # idxs = flat_expert_indices.argsort() +-- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-- # token_idxs = idxs // self.num_experts_per_tok +-- +-- # for i, end_idx in enumerate(tokens_per_expert): +-- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-- # if start_idx == end_idx: +-- # continue +-- # expert = self.experts[i] +-- # exp_token_idx = token_idxs[start_idx:end_idx] +-- # expert_tokens = x[exp_token_idx] +-- # expert_out = expert(expert_tokens) +-- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-- +-- # return expert_cache +-- +- @no_grad() +- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +- """ +-@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): +- ) +- +- return expert_cache +-+ +-+ +-+ # @no_grad() +-+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ # """ +-+ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add +-+ # """ +-+ # num_tokens = x.shape[0] +-+ # hidden_size = x.shape[-1] +-+ +-+ # # 生成排序后的 token 索引 +-+ # idxs = flat_expert_indices.argsort() +-+ # sorted_expert_indices = flat_expert_indices[idxs] +-+ # sorted_token_indices = idxs // self.num_experts_per_tok +-+ +-+ # # 记录到 sorted_indices(moe_token_unpermute 用) +-+ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] +-+ +-+ # # 收集专家输入 +-+ # permuted_tokens = x[sorted_token_indices] +-+ +-+ # # 执行每个专家的 MLP(批量处理) +-+ # expert_outputs = [] +-+ # token_ptr = 0 +-+ # tokens_per_expert = sorted_expert_indices.bincount() +-+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): +-+ # if count == 0: +-+ # continue +-+ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] +-+ # out = self.experts[expert_id](cur_tokens) +-+ # expert_outputs.append(out) +-+ # token_ptr += count +-+ +-+ # # 拼接所有专家输出 +-+ # permuted_outputs = ops.cat(expert_outputs, axis=0) +-+ +-+ # # 权重缩放(probs 形状为 [num_tokens, top_k]) +-+ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) +-+ +-+ # # 直接调用硬件加速的 unpermute +-+ # final_output = ops.moe_token_unpermute( +-+ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] +-+ # sorted_indices, # shape: [num_tokens * top_k] +-+ # probs=probs, # 按概率加权 +-+ # padded_mode=False +-+ # ) +-+ +-+ # return final_output +-+ +-+ # lwx prefill 20251108 +-+ @no_grad() +-+ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): +-+ """ +-+ 高性能 + 数值一致的 MoE prefill 推理: +-+ 1. 批量化处理所有专家计算,减少 Python 循环开销 +-+ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 +-+ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 +-+ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch +-+ +-+ 参数: +-+ x: [num_tokens, hidden_size], +-+ MoE 输入的 token 表示 +-+ flat_expert_indices: [num_tokens * top_k], +-+ 每个 token 的路由专家 id +-+ flat_expert_weights: [num_tokens * top_k, 1], +-+ 路由专家权重 +-+ """ +-+ num_tokens = x.shape[0] +-+ hidden_size = x.shape[-1] +-+ +-+ # 1) 排序专家分配(与原 scatter_add 一致的顺序) +-+ idxs = flat_expert_indices.argsort() # 排序索引 +-+ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] +-+ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID +-+ +-+ # sorted_indices 必须与 permuted_tokens 顺序匹配 +-+ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 +-+ +-+ # 2) 收集专家输入(按 idxs 排序) +-+ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] +-+ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 +-+ +-+ # 3) 计算每个专家的 token 数 +-+ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) +-+ +-+ # 4) 批量专家计算(减少 Python 循环) +-+ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) +-+ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) +-+ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) +-+ +-+ expert_outputs = ops.zeros_like(permuted_tokens) +-+ ptr = 0 +-+ for expert_id, count in enumerate(tokens_per_expert.tolist()): +-+ if count == 0: +-+ continue +-+ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] +-+ +-+ # 与 DeepseekMLP forward 等价 +-+ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) +-+ up_proj_out = F.linear(tokens, up_weights[expert_id]) +-+ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out +-+ expert_out = F.linear(act_out, down_weights[expert_id]) +-+ +-+ expert_outputs[ptr:ptr+count] = expert_out +-+ ptr += count +-+ +-+ # 5) Ascend 加速的 unpermute(已排序的权重) +-+ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape +-+ +-+ final_output = ops.zeros_like(x) +-+ final_output = mindspore.mint.scatter_add( +-+ final_output, +-+ 0, +-+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +-+ expert_outputs * sorted_weights +-+ ) +-+ +-+ +-+ # try: +-+ # final_output = ops.moe_token_unpermute( +-+ # expert_outputs, # [num_tokens*top_k, hidden_size] +-+ # sorted_indices, # [num_tokens*top_k] 原 token id +-+ # probs=probs, # 对应权重 +-+ # padded_mode=False +-+ # ) +-+ # except Exception: +-+ # # CPU/GPU fallback:用 scatter_add 保证完全一致 +-+ # final_output = ops.zeros_like(x) +-+ # final_output = mindspore.mint.scatter_add( +-+ # final_output, +-+ # 0, +-+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +-+ # expert_outputs * sorted_weights +-+ # ) +-+ +-+ return final_output +-+ +-+ +-+ # @no_grad() +-+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ # num_tokens = x.shape[0] +-+ # hidden_size = x.shape[-1] +-+ +-+ # idxs = flat_expert_indices.argsort() +-+ # sorted_expert_indices = flat_expert_indices[idxs] +-+ # sorted_token_indices = idxs // self.num_experts_per_tok +-+ +-+ # # sorted_indices = sorted_token_indices +-+ # sorted_indices = sorted_token_indices.astype(mindspore.int32) +-+ # permuted_tokens = x[sorted_token_indices] +-+ # sorted_weights = flat_expert_weights[idxs] +-+ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) +-+ +-+ # expert_outputs = ops.zeros_like(permuted_tokens) +-+ # ptr = 0 +-+ +-+ # # 只按专家维度循环 +-+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): +-+ # if count == 0: +-+ # continue +-+ # token_slice = slice(ptr, ptr + count) +-+ # expert_tokens = permuted_tokens[token_slice] +-+ +-+ # # 保持原 forward(含 pretraining_tp、bias 等) +-+ # expert_out = self.experts[expert_id](expert_tokens) +-+ +-+ # expert_outputs[token_slice] = expert_out +-+ # ptr += count +-+ +-+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) +-+ # try: +-+ # final_output = mindspore.ops.moe_token_unpermute( +-+ # expert_outputs, +-+ # sorted_indices, +-+ # probs=probs, +-+ # padded_mode=False +-+ # ) +-+ # except Exception: +-+ # final_output = ops.zeros_like(x) +-+ # final_output = mindspore.mint.scatter_add( +-+ # final_output, +-+ # 0, +-+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +-+ # expert_outputs * sorted_weights +-+ # ) +-+ +-+ # return final_output +-+ +-+ +-+ #lwx +-+ # @no_grad() +-+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+ # """ +-+ # 并行化 MoE prefill: +-+ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 +-+ # - 保证结果与原版完全一致 +-+ # """ +-+ # # 输出缓存 +-+ # expert_cache = ops.zeros_like(x) +-+ +-+ # # token 总数(批量*seq_len*num_experts_per_tok) +-+ # num_tokens = flat_expert_indices.shape[0] +-+ # hidden_dim = x.shape[-1] +-+ +-+ # # 原 token ID(idxs // num_experts_per_tok) +-+ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) +-+ +-+ # # ====== Step 1: 组织输入 ====== +-+ # # 按 experts 排序,保证 scatter_add 对应位置一致 +-+ # sort_ids = flat_expert_indices.argsort() +-+ # sorted_experts = flat_expert_indices[sort_ids] +-+ # sorted_tokens = token_ids[sort_ids] +-+ # sorted_weights = flat_expert_weights[sort_ids] +-+ +-+ # # 收集每个专家的输入 +-+ # # build: expert_inputs[expert_id] = [tokens...] +-+ # expert_inputs = [] +-+ # expert_outs = [] +-+ +-+ # for eid in range(self.config.n_routed_experts): +-+ # eid_mask = (sorted_experts == eid) +-+ # if eid_mask.any(): +-+ # tokens_for_eid = x[sorted_tokens[eid_mask]] +-+ # expert_inputs.append(tokens_for_eid) +-+ # else: +-+ # expert_inputs.append(None) +-+ +-+ # # ====== Step 2: 并行计算所有专家输出 ====== +-+ # # 存储所有专家结果到一个列表 +-+ # for eid in range(self.config.n_routed_experts): +-+ # if expert_inputs[eid] is not None: +-+ # out = self.experts[eid](expert_inputs[eid]) +-+ # expert_outs.append(out) +-+ # else: +-+ # expert_outs.append(None) +-+ +-+ # # ====== Step 3: scatter_add 回写结果 ====== +-+ # # 遍历专家,将结果加回对应的 token +-+ # pos = 0 +-+ # for eid in range(self.config.n_routed_experts): +-+ # if expert_outs[eid] is not None: +-+ # size = expert_outs[eid].shape[0] +-+ # tokens_idx = sorted_tokens[pos:pos+size] +-+ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] +-+ # pos += size +-+ +-+ # # scatter_add 到 expert_cache +-+ # expert_cache = mindspore.mint.scatter_add( +-+ # expert_cache, +-+ # dim=0, +-+ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), +-+ # src=scaled_out +-+ # ) +-+ +-+ # return expert_cache +-+ +-+ +-+ +- # 放置在 DeepseekMoE 类中 +- # @no_grad() +- # #lwx 20251107 +-@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): +- self.hidden_size = config.hidden_size +- +- # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-- # config=config, layer_idx=layer_idx +-+ # config=config, layer_idx=layer_idx +- # ) +- +- self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): +- ) +- else DeepseekMLP(config) +- ) +-+ +- self.input_layernorm = DeepseekRMSNorm( +- config.hidden_size, eps=config.rms_norm_eps +- ) +-@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +- def get_decoder(self): +- return self.model +- +-+ def generate(self, *args, **kwargs): +-+ """ +-+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+ """ +-+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-+ +-+ input_ids = kwargs.get("input_ids") +-+ if input_ids is None and args: +-+ input_ids = args[0] +-+ +-+ if input_ids is not None: +-+ prompt_length = input_ids.shape[1] +-+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-+ Long_Prompt = 2 +-+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-+ Long_Prompt = 0 +-+ else: +-+ Long_Prompt = 1 +-+ +-+ +-+ return super().generate(*args, **kwargs) +- +- def forward( +- self, +-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-index 913a7609..6566958b 100644 +---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- +- # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +- @no_grad() +-- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- original_dtype = hidden_states.dtype +- batch_size, _ = hidden_states.shape +- expert_outputs_list = [ +-@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +- return moe_output_fp32.squeeze(1).to(original_dtype) +- +-+ +- # @no_grad() +-- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- # num_tokens, _ = hidden_states.shape +- # flat_selected_experts = selected_experts.flatten() +- # sorted_expert_indices = flat_selected_experts.argsort() +-@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- # current_token_offset += expert_token_count +- # return moe_output +- +-+ # baseline +- @no_grad() +-- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- """ +- 优化版 MoE prefill (速度优先模式): +- - 批量张量化处理同一个 expert 的所有 token +-@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- return moe_output +- +- +-+ @no_grad() +-+ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+ """ +-+ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add +-+ 逻辑: +-+ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 +-+ 2. 每个 expert 一次性处理其全部 token +-+ 3. 最后一次 scatter_add 回到原 token 顺序 +-+ """ +-+ +-+ num_tokens = hidden_states.shape[0] +-+ hidden_size = hidden_states.shape[-1] +-+ +-+ # 展平为一维 +-+ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] +-+ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] +-+ +-+ # 按 expert 排序 +-+ idxs = flat_selected_experts.argsort() +-+ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 +-+ sorted_token_indices = idxs // self.top_k # 对应原 token ID +-+ +-+ # 排好序的输入向量(连续内存) +-+ permuted_tokens = hidden_states[sorted_token_indices] +-+ +-+ # 排好序的权重 +-+ sorted_weights = flat_routing_weights[idxs] +-+ +-+ # 每个 expert 对应的 token 数量 +-+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-+ +-+ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) +-+ expert_outputs = ops.zeros_like(permuted_tokens) +-+ +-+ ptr = 0 # 指向当前切片的起点 +-+ for expert_id, count in enumerate(tokens_per_expert.tolist()): +-+ if count == 0: +-+ continue +-+ +-+ token_slice = slice(ptr, ptr + count) +-+ expert_tokens = permuted_tokens[token_slice] # 连续切片 +-+ +-+ # 执行专家 MLP +-+ expert_out = self.experts[expert_id](expert_tokens) +-+ +-+ expert_outputs[token_slice] = expert_out +-+ ptr += count +-+ +-+ # 按权重缩放 +-+ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) +-+ +-+ # 回写到原 token 顺序 (单次 scatter_add) +-+ moe_output = mindspore.mint.scatter_add( +-+ ops.zeros_like(hidden_states), +-+ 0, +-+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), +-+ scaled_outputs +-+ ) +-+ +-+ return moe_output +-+ +-+ +-+ +- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+ +- @no_grad() +- def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +- moe_output = ops.zeros_like(hidden_states) +-@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +- # # --- 速度优先模式 (SPEED MODE) --- +- # routing_weights_casted = routing_weights.to(hidden_states.dtype) +- # if sequence_length == 1: +-- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +- # else: +-- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +- +- routing_weights_casted = routing_weights.to(hidden_states.dtype) +- if sequence_length == 1: +-- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +- else: +-- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-- +-+ # if Long_Prompt == 1: +-+ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ # else: +-+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+ +- +- # 3. 共享专家计算与合并 (所有模式通用) +- gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-index c9c8c5ee..513dd40b 100644 +---- a/patches/0001-20251104commit.patch +-+++ b/patches/0001-20251104commit.patch +-@@ -1,7 +1,7 @@ +- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Tue, 4 Nov 2025 09:11:51 +0800 +--Subject: [PATCH 1/6] 20251104commit +-+Subject: [PATCH 1/7] 20251104commit +- +- --- +- mindnlp/transformers/cache_utils.py | 28 +- +-diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-index 625656eb..41081b85 100644 +---- a/patches/0002-20251106commit.patch +-+++ b/patches/0002-20251106commit.patch +-@@ -1,7 +1,7 @@ +- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 09:20:38 +0800 +--Subject: [PATCH 2/6] 20251106commit +-+Subject: [PATCH 2/7] 20251106commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 379 ++++- +-diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-index dcb85080..c1392569 100644 +---- a/patches/0003-20261106secondcommit.patch +-+++ b/patches/0003-20261106secondcommit.patch +-@@ -1,7 +1,7 @@ +- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 14:54:37 +0800 +--Subject: [PATCH 3/6] 20261106secondcommit +-+Subject: [PATCH 3/7] 20261106secondcommit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 217 ++- +-diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-index bbed13cc..e548b1b2 100644 +---- a/patches/0004-20251106change.patch +-+++ b/patches/0004-20251106change.patch +-@@ -1,7 +1,7 @@ +- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Thu, 6 Nov 2025 15:48:09 +0800 +--Subject: [PATCH 4/6] 20251106change +-+Subject: [PATCH 4/7] 20251106change +- +- --- +- .../models/deepseek/modeling_deepseek.py | 189 +- +-diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-index b2d1035c..bf224d2a 100644 +---- a/patches/0005-20251107001commit.patch +-+++ b/patches/0005-20251107001commit.patch +-@@ -1,7 +1,7 @@ +- From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Fri, 7 Nov 2025 11:48:18 +0800 +--Subject: [PATCH 5/6] 20251107001commit +-+Subject: [PATCH 5/7] 20251107001commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 91 +- +-diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +-index bffa134e..1bd306b9 100644 +---- a/patches/0006-20251107002commit.patch +-+++ b/patches/0006-20251107002commit.patch +-@@ -1,7 +1,7 @@ +- From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 +- From: Pinoeer-kingxi <13022943007@163.com> +- Date: Fri, 7 Nov 2025 12:06:32 +0800 +--Subject: [PATCH 6/6] 20251107002commit +-+Subject: [PATCH 6/7] 20251107002commit +- +- --- +- .../models/deepseek/modeling_deepseek.py | 122 +- +-diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch +-new file mode 100644 +-index 00000000..ce558554 +---- /dev/null +-+++ b/patches/0007-20251107003commit.patch +-@@ -0,0 +1,8034 @@ +-+From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 +-+From: Pinoeer-kingxi <13022943007@163.com> +-+Date: Fri, 7 Nov 2025 12:12:51 +0800 +-+Subject: [PATCH 7/7] 20251107003commit +-+ +-+--- +-+ .../models/deepseek/modeling_deepseek.py | 2 +- +-+ patches/0001-20251104commit.patch | 2 +- +-+ patches/0002-20251106commit.patch | 2 +- +-+ patches/0003-20261106secondcommit.patch | 2 +- +-+ patches/0004-20251106change.patch | 2 +- +-+ patches/0005-20251107001commit.patch | 2 +- +-+ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ +-+ 7 files changed, 7937 insertions(+), 6 deletions(-) +-+ create mode 100644 patches/0006-20251107002commit.patch +-+ +-+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+index e7e1c053..ff631974 100644 +-+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): +-+ # return expert_cache +-+ +-+ @no_grad() +-+- dwj +-++ # dwj +-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+ # x 的 shape: (1, hidden_size) +-+ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+index 2842180e..c9c8c5ee 100644 +-+--- a/patches/0001-20251104commit.patch +-++++ b/patches/0001-20251104commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+-Subject: [PATCH 1/5] 20251104commit +-++Subject: [PATCH 1/6] 20251104commit +-+ +-+ --- +-+ mindnlp/transformers/cache_utils.py | 28 +- +-+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+index c6cd8757..625656eb 100644 +-+--- a/patches/0002-20251106commit.patch +-++++ b/patches/0002-20251106commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+-Subject: [PATCH 2/5] 20251106commit +-++Subject: [PATCH 2/6] 20251106commit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+index 601960c9..dcb85080 100644 +-+--- a/patches/0003-20261106secondcommit.patch +-++++ b/patches/0003-20261106secondcommit.patch +-+@@ -1,7 +1,7 @@ +-+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+-Subject: [PATCH 3/5] 20261106secondcommit +-++Subject: [PATCH 3/6] 20261106secondcommit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-+index 8976f10b..bbed13cc 100644 +-+--- a/patches/0004-20251106change.patch +-++++ b/patches/0004-20251106change.patch +-+@@ -1,7 +1,7 @@ +-+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Thu, 6 Nov 2025 15:48:09 +0800 +-+-Subject: [PATCH 4/5] 20251106change +-++Subject: [PATCH 4/6] 20251106change +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 189 +- +-+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-+index 8d9032be..b2d1035c 100644 +-+--- a/patches/0005-20251107001commit.patch +-++++ b/patches/0005-20251107001commit.patch +-+@@ -1,7 +1,7 @@ +-+ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +-+ From: Pinoeer-kingxi <13022943007@163.com> +-+ Date: Fri, 7 Nov 2025 11:48:18 +0800 +-+-Subject: [PATCH 5/5] 20251107001commit +-++Subject: [PATCH 5/6] 20251107001commit +-+ +-+ --- +-+ .../models/deepseek/modeling_deepseek.py | 91 +- +-+diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch +-+new file mode 100644 +-+index 00000000..bffa134e +-+--- /dev/null +-++++ b/patches/0006-20251107002commit.patch +-+@@ -0,0 +1,7931 @@ +-++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 +-++From: Pinoeer-kingxi <13022943007@163.com> +-++Date: Fri, 7 Nov 2025 12:06:32 +0800 +-++Subject: [PATCH 6/6] 20251107002commit +-++ +-++--- +-++ .../models/deepseek/modeling_deepseek.py | 122 +- +-++ patches/0001-20251104commit.patch | 2 +- +-++ patches/0002-20251106commit.patch | 2 +- +-++ patches/0003-20261106secondcommit.patch | 2 +- +-++ patches/0004-20251106change.patch | 2 +- +-++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ +-++ 6 files changed, 7773 insertions(+), 64 deletions(-) +-++ create mode 100644 patches/0005-20251107001commit.patch +-++ +-++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++index 8831e4b7..e7e1c053 100644 +-++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): +-++ # expert_out = expert(x) +-++ # expert_cache += expert_out * weight +-++ # return expert_cache +-++- +-++- # @no_grad() +-++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++- # # x 的 shape: (1, hidden_size) +-++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++- +-++- # # 1. 收集所有需要的专家层 +-++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++- # selected_experts = [self.experts[i] for i in flat_expert_indices] +-++- +-++- # # 2. 并行计算所有专家的输出 +-++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++- # # ops.cat 会将它们堆叠成一个新的 Tensor +-++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++- +-++- # # 3. 使用矩阵乘法进行加权求和 +-++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++- # # 最终结果 final_output 的 shape: (1, hidden_size) +-++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+++ +-+++ @no_grad() +-+++ dwj +-+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ # x 的 shape: (1, hidden_size) +-+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+++ +-+++ # 1. 收集所有需要的专家层 +-+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-+++ +-+++ # 2. 并行计算所有专家的输出 +-+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+++ # ops.cat 会将它们堆叠成一个新的 Tensor +-+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+++ +-+++ # 3. 使用矩阵乘法进行加权求和 +-+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++ # 最终结果 final_output 的 shape: (1, hidden_size) +-+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++ +-++- # return final_output +-+++ return final_output +-++ +-++ +-++ # @no_grad() +-++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): +-++ +-++ return expert_cache +-++ # 放置在 DeepseekMoE 类中 +-++- @no_grad() +-++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++- """ +-++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-++- +-++- Args: +-++- x (Tensor): 输入张量, shape: (1, hidden_size) +-++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-++- """ +-++- top_k, _ = flat_expert_weights.shape +-++- hidden_size = x.shape[-1] +-++- +-++- # 1. 将所有专家的权重堆叠起来 +-++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-+++ # @no_grad() +-+++ # #lwx 20251107 +-+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++ # """ +-+++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-+++ +-+++ # Args: +-+++ # x (Tensor): 输入张量, shape: (1, hidden_size) +-+++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-+++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-+++ # """ +-+++ # top_k, _ = flat_expert_weights.shape +-+++ # hidden_size = x.shape[-1] +-+++ +-+++ # # 1. 将所有专家的权重堆叠起来 +-+++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-+++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-+++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-++ +-++- # 2. "收集" 所需的专家权重 +-++- selected_gate_w = stacked_gate_w[flat_expert_indices] +-++- selected_up_w = stacked_up_w[flat_expert_indices] +-++- selected_down_w = stacked_down_w[flat_expert_indices] +-+++ # # 2. "收集" 所需的专家权重 +-+++ # selected_gate_w = stacked_gate_w[flat_expert_indices] +-+++ # selected_up_w = stacked_up_w[flat_expert_indices] +-+++ # selected_down_w = stacked_down_w[flat_expert_indices] +-++ +-++- # 3. 准备输入 +-++- x_expanded = x.expand((top_k, 1, hidden_size)) +-+++ # # 3. 准备输入 +-+++ # x_expanded = x.expand((top_k, 1, hidden_size)) +-++ +-++- # 4. 并行计算 gate_proj 和 up_proj +-++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-+++ # # 4. 并行计算 gate_proj 和 up_proj +-+++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-+++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-++ +-++- # 5. 计算中间状态 +-++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-+++ # # 5. 计算中间状态 +-+++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-++ +-++- # 6. 并行计算 down_proj +-++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-++- # --- [FIX] --- +-++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-++- # --- [FIX END] --- +-+++ # # 6. 并行计算 down_proj +-+++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-+++ # # --- [FIX] --- +-+++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-+++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-+++ # # --- [FIX END] --- +-++ +-++- # 7. 根据路由权重进行加权求和 +-++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-+++ # # 7. 根据路由权重进行加权求和 +-+++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-++ +-++- return weighted_sum +-+++ # return weighted_sum +-++ +-++ +-++ +-++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++index 0a0ef2d7..2842180e 100644 +-++--- a/patches/0001-20251104commit.patch +-+++++ b/patches/0001-20251104commit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++-Subject: [PATCH 1/4] 20251104commit +-+++Subject: [PATCH 1/5] 20251104commit +-++ +-++ --- +-++ mindnlp/transformers/cache_utils.py | 28 +- +-++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-++index 5185270c..c6cd8757 100644 +-++--- a/patches/0002-20251106commit.patch +-+++++ b/patches/0002-20251106commit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-++-Subject: [PATCH 2/4] 20251106commit +-+++Subject: [PATCH 2/5] 20251106commit +-++ +-++ --- +-++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-++index 3e05f821..601960c9 100644 +-++--- a/patches/0003-20261106secondcommit.patch +-+++++ b/patches/0003-20261106secondcommit.patch +-++@@ -1,7 +1,7 @@ +-++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-++-Subject: [PATCH 3/4] 20261106secondcommit +-+++Subject: [PATCH 3/5] 20261106secondcommit +-++ +-++ --- +-++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-++index 88a1aef4..8976f10b 100644 +-++--- a/patches/0004-20251106change.patch +-+++++ b/patches/0004-20251106change.patch +-++@@ -1,7 +1,7 @@ +-++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-++ From: Pinoeer-kingxi <13022943007@163.com> +-++ Date: Thu, 6 Nov 2025 15:48:09 +0800 +-++-Subject: [PATCH 4/4] 20251106change +-+++Subject: [PATCH 4/5] 20251106change +-++ +-++ --- +-++ .../models/deepseek/modeling_deepseek.py | 189 +- +-++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch +-++new file mode 100644 +-++index 00000000..8d9032be +-++--- /dev/null +-+++++ b/patches/0005-20251107001commit.patch +-++@@ -0,0 +1,7707 @@ +-+++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 +-+++From: Pinoeer-kingxi <13022943007@163.com> +-+++Date: Fri, 7 Nov 2025 11:48:18 +0800 +-+++Subject: [PATCH 5/5] 20251107001commit +-+++ +-+++--- +-+++ .../models/deepseek/modeling_deepseek.py | 91 +- +-+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- +-+++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- +-+++ patches/0001-20251104commit.patch | 2 +- +-+++ patches/0002-20251106commit.patch | 2 +- +-+++ patches/0003-20261106secondcommit.patch | 2 +- +-+++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ +-+++ 7 files changed, 7577 insertions(+), 30 deletions(-) +-+++ create mode 100644 patches/0004-20251106change.patch +-+++ +-+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++index 0546f318..8831e4b7 100644 +-+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): +-+++ # expert_cache += expert_out * weight +-+++ # return expert_cache +-+++ +-+++- @no_grad() +-+++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++- # x 的 shape: (1, hidden_size) +-+++- # flat_expert_indices 的 shape: (num_experts_per_tok,) +-+++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-+++- +-+++- # 1. 收集所有需要的专家层 +-+++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-+++- selected_experts = [self.experts[i] for i in flat_expert_indices] +-+++- +-+++- # 2. 并行计算所有专家的输出 +-+++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-+++- # ops.cat 会将它们堆叠成一个新的 Tensor +-+++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-+++- +-+++- # 3. 使用矩阵乘法进行加权求和 +-+++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-+++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-+++- # 最终结果 final_output 的 shape: (1, hidden_size) +-+++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++++ # @no_grad() +-++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ # # x 的 shape: (1, hidden_size) +-++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++++ +-++++ # # 1. 收集所有需要的专家层 +-++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] +-++++ +-++++ # # 2. 并行计算所有专家的输出 +-++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++++ # # ops.cat 会将它们堆叠成一个新的 Tensor +-++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++++ +-++++ # # 3. 使用矩阵乘法进行加权求和 +-++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++ # # 最终结果 final_output 的 shape: (1, hidden_size) +-++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-+++ +-+++- return final_output +-++++ # return final_output +-+++ +-+++ +-+++ # @no_grad() +-+++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): +-+++ ) +-+++ +-+++ return expert_cache +-++++# 放置在 DeepseekMoE 类中 +-++++ @no_grad() +-++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++ """ +-++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 +-++++ +-++++ Args: +-++++ x (Tensor): 输入张量, shape: (1, hidden_size) +-++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) +-++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) +-++++ """ +-++++ top_k, _ = flat_expert_weights.shape +-++++ hidden_size = x.shape[-1] +-++++ +-++++ # 1. 将所有专家的权重堆叠起来 +-++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) +-++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) +-++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) +-++++ +-++++ # 2. "收集" 所需的专家权重 +-++++ selected_gate_w = stacked_gate_w[flat_expert_indices] +-++++ selected_up_w = stacked_up_w[flat_expert_indices] +-++++ selected_down_w = stacked_down_w[flat_expert_indices] +-++++ +-++++ # 3. 准备输入 +-++++ x_expanded = x.expand((top_k, 1, hidden_size)) +-++++ +-++++ # 4. 并行计算 gate_proj 和 up_proj +-++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) +-++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) +-++++ +-++++ # 5. 计算中间状态 +-++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out +-++++ +-++++ # 6. 并行计算 down_proj +-++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) +-++++ # --- [FIX] --- +-++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 +-++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) +-++++ # --- [FIX END] --- +-++++ +-++++ # 7. 根据路由权重进行加权求和 +-++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) +-++++ +-++++ return weighted_sum +-++++ +-++++ +-+++ +-+++ # @no_grad() +-+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++index ebd7782e..913a7609 100644 +-+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): +-+++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++- x1 = x[..., : x.shape[-1] // 2] +-+++- x2 = x[..., x.shape[-1] // 2 :] +-++++ # x1 = x[..., : x.shape[-1] // 2] +-++++ # x2 = x[..., x.shape[-1] // 2 :] +-+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+++index d059dcbe..2b217b64 100644 +-+++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py +-+++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): +-+++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++ def rotate_half(x): +-+++ """Rotates half the hidden dims of the input.""" +-+++- x1 = x[..., : x.shape[-1] // 2] +-+++- x2 = x[..., x.shape[-1] // 2 :] +-++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++ # x1 = x[..., : x.shape[-1] // 2] +-++++ # x2 = x[..., x.shape[-1] // 2 :] +-++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++ return ops.cat((-x2, x1), dim=-1) +-+++ +-+++ +-+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++index 78f22642..0a0ef2d7 100644 +-+++--- a/patches/0001-20251104commit.patch +-++++++ b/patches/0001-20251104commit.patch +-+++@@ -1,7 +1,7 @@ +-+++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++ From: Pinoeer-kingxi <13022943007@163.com> +-+++ Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++-Subject: [PATCH 1/3] 20251104commit +-++++Subject: [PATCH 1/4] 20251104commit +-+++ +-+++ --- +-+++ mindnlp/transformers/cache_utils.py | 28 +- +-+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-+++index 22b65dd5..5185270c 100644 +-+++--- a/patches/0002-20251106commit.patch +-++++++ b/patches/0002-20251106commit.patch +-+++@@ -1,7 +1,7 @@ +-+++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+++ From: Pinoeer-kingxi <13022943007@163.com> +-+++ Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+++-Subject: [PATCH 2/3] 20251106commit +-++++Subject: [PATCH 2/4] 20251106commit +-+++ +-+++ --- +-+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-+++index 966529e4..3e05f821 100644 +-+++--- a/patches/0003-20261106secondcommit.patch +-++++++ b/patches/0003-20261106secondcommit.patch +-+++@@ -1,7 +1,7 @@ +-+++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+++ From: Pinoeer-kingxi <13022943007@163.com> +-+++ Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+++-Subject: [PATCH 3/3] 20261106secondcommit +-++++Subject: [PATCH 3/4] 20261106secondcommit +-+++ +-+++ --- +-+++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch +-+++new file mode 100644 +-+++index 00000000..88a1aef4 +-+++--- /dev/null +-++++++ b/patches/0004-20251106change.patch +-+++@@ -0,0 +1,7498 @@ +-++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 +-++++From: Pinoeer-kingxi <13022943007@163.com> +-++++Date: Thu, 6 Nov 2025 15:48:09 +0800 +-++++Subject: [PATCH 4/4] 20251106change +-++++ +-++++--- +-++++ .../models/deepseek/modeling_deepseek.py | 189 +- +-++++ patches/0001-20251104commit.patch | 1272 +++++++ +-++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ +-++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ +-++++ 4 files changed, 7244 insertions(+), 186 deletions(-) +-++++ create mode 100644 patches/0001-20251104commit.patch +-++++ create mode 100644 patches/0002-20251106commit.patch +-++++ create mode 100644 patches/0003-20261106secondcommit.patch +-++++ +-++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++index 2f9192bf..0546f318 100644 +-++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): +-++++ +-++++ return attn_output, attn_weights, past_key_value +-++++ +-++++-# class DeepseekFlashAttention(nn.Module): +-++++-# """ +-++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-++++- +-++++-# This class is designed as a drop-in replacement for DeepseekAttention. +-++++-# """ +-++++- +-++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++++-# super().__init__() +-++++-# self.config = config +-++++-# self.layer_idx = layer_idx +-++++-# if layer_idx is None: +-++++-# logger.warning( +-++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++-# "when creating this class." +-++++-# ) +-++++- +-++++-# self.attention_dropout = config.attention_dropout +-++++-# self.hidden_size = config.hidden_size +-++++-# self.num_heads = config.num_attention_heads +-++++-# self.head_dim = self.hidden_size // self.num_heads +-++++-# self.num_key_value_heads = config.num_key_value_heads +-++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++-# self.max_position_embeddings = config.max_position_embeddings +-++++-# self.rope_theta = config.rope_theta +-++++-# self.is_causal = True +-++++- +-++++-# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++-# raise ValueError( +-++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++-# f" and `num_heads`: {self.num_heads})." +-++++-# ) +-++++- +-++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++++-# self._init_rope() +-++++- +-++++-# def _init_rope(self): +-++++-# if self.config.rope_scaling is None: +-++++-# self.rotary_emb = DeepseekRotaryEmbedding( +-++++-# self.head_dim, +-++++-# max_position_embeddings=self.max_position_embeddings, +-++++-# base=self.rope_theta, +-++++-# ) +-++++-# else: +-++++-# scaling_type = self.config.rope_scaling["type"] +-++++-# scaling_factor = self.config.rope_scaling["factor"] +-++++-# if scaling_type == "linear": +-++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++++-# self.head_dim, +-++++-# max_position_embeddings=self.max_position_embeddings, +-++++-# scaling_factor=scaling_factor, +-++++-# base=self.rope_theta, +-++++-# ) +-++++-# elif scaling_type == "dynamic": +-++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++++-# self.head_dim, +-++++-# max_position_embeddings=self.max_position_embeddings, +-++++-# scaling_factor=scaling_factor, +-++++-# base=self.rope_theta, +-++++-# ) +-++++-# else: +-++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++++- +-++++-# def forward( +-++++-# self, +-++++-# hidden_states: mindspore.Tensor, +-++++-# attention_mask: Optional[mindspore.Tensor] = None, +-++++-# position_ids: Optional[mindspore.Tensor] = None, +-++++-# past_key_value: Optional[Cache] = None, +-++++-# output_attentions: bool = False, +-++++-# use_cache: bool = False, +-++++-# **kwargs, +-++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++-# if "padding_mask" in kwargs: +-++++-# warnings.warn( +-++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++++-# ) +-++++- +-++++-# if output_attentions: +-++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-++++- +-++++-# bsz, q_len, _ = hidden_states.shape +-++++- +-++++-# if self.config.pretraining_tp > 1: +-++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++++- +-++++-# query_states = self.q_proj(hidden_states) +-++++-# key_states = self.k_proj(hidden_states) +-++++-# value_states = self.v_proj(hidden_states) +-++++- +-++++-# # Reshape for multi-head attention +-++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++- +-++++-# kv_seq_len = key_states.shape[-2] +-++++-# if past_key_value is not None: +-++++-# if self.layer_idx is None: +-++++-# raise ValueError( +-++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++-# "with a layer index." +-++++-# ) +-++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++- +-++++-# # Apply Rotary Positional Embedding +-++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++- +-++++-# if past_key_value is not None: +-++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++- +-++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++- +-++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++- +-++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++- +-++++-# # Convert attention_mask for flash_attention_score +-++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-++++-# if attention_mask is not None: +-++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++++-# raise ValueError( +-++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++++-# ) +-++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-++++-# else: +-++++-# attn_mask_for_fa = None +-++++- +-++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++++- +-++++-# # Call the fused flash_attention_score operator +-++++-# attn_output = mindspore.ops.flash_attention_score( +-++++-# query=query_states_for_fa, +-++++-# key=key_states_for_fa, +-++++-# value=value_states_for_fa, +-++++-# head_num=self.num_heads, # This is N1, the number of query heads +-++++-# input_layout='BSH', +-++++-# attn_mask=attn_mask_for_fa, +-++++-# keep_prob=keep_prob, +-++++-# scalar_value=1.0 / math.sqrt(self.head_dim), +-++++-# sparse_mode=0 # Default mask mode +-++++-# ) +-++++- +-++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-++++-# attn_output = self.o_proj(attn_output) +-++++- +-++++-# # Flash Attention does not return attention weights +-++++-# attn_weights = None +-++++- +-++++-# return attn_output, attn_weights, past_key_value +-++++ +-++++ class DeepseekFlashAttention(nn.Module): +-++++ """ +-++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): +-++++ super().__init__() +-++++ self.hidden_size = config.hidden_size +-++++ +-++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-++++- config=config, layer_idx=layer_idx +-++++- ) +-+++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( +-+++++ # config=config, layer_idx=layer_idx +-+++++ # ) +-++++ +-++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-++++ config=config, layer_idx=layer_idx +-++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): +-++++ return outputs +-++++ +-++++ +-++++- +-++++ class DeepseekPreTrainedModel(PreTrainedModel): +-++++ config_class = DeepseekConfig +-++++ base_model_prefix = "model" +-++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++ # Initialize weights and apply final processing +-++++ self.post_init() +-++++ self.warm_up = False +-++++- #@dwj +-++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-++++- self.num_layers, +-++++- self.num_attention_heads, +-++++- self.head_dim, +-++++- batch_size=1, +-++++- max_length=self.max_length, +-++++- dtype=mindspore.float16 +-++++- ) +-++++- +-++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-++++- key_cache = [] +-++++- value_cache = [] +-++++- for _ in range(num_layers): +-++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++- key_cache.append(k) +-++++- value_cache.append(v) +-++++- return key_cache, value_cache +-++++- +-++++ +-++++ def warmup_moe_model_deep(self): +-++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-++++new file mode 100644 +-++++index 00000000..78f22642 +-++++--- /dev/null +-+++++++ b/patches/0001-20251104commit.patch +-++++@@ -0,0 +1,1272 @@ +-+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++++From: Pinoeer-kingxi <13022943007@163.com> +-+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++++Subject: [PATCH 1/3] 20251104commit +-+++++ +-+++++--- +-+++++ mindnlp/transformers/cache_utils.py | 28 +- +-+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++++ 3 files changed, 976 insertions(+), 87 deletions(-) +-+++++ +-+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++++index cadd2e04..02f8d4be 100644 +-+++++--- a/mindnlp/transformers/cache_utils.py +-++++++++ b/mindnlp/transformers/cache_utils.py +-+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++++ # k_out[:, :, cache_position] = key_states +-+++++ # v_out[:, :, cache_position] = value_states +-+++++- if ON_ORANGE_PI: +-+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++- else: +-+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++- +-++++++ # if ON_ORANGE_PI: +-++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++++ # else: +-++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-++++++ if cache_position.ndim > 1: +-++++++ cache_position = cache_position.flatten() +-++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-++++++ cache_position = cache_position.int() +-++++++ +-++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-++++++ k_out[:, :, cache_position] = key_states +-++++++ v_out[:, :, cache_position] = value_states +-++++++ +-+++++ return k_out, v_out +-+++++ +-+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++index c695b944..d8303e45 100644 +-+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++++ def rotate_half(x): +-+++++ """Rotates half the hidden dims of the input.""" +-+++++- x1 = x[..., : x.shape[-1] // 2] +-+++++- x2 = x[..., x.shape[-1] // 2 :] +-++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++++ # x1 = x[..., : x.shape[-1] // 2] +-++++++ # x2 = x[..., x.shape[-1] // 2 :] +-++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++ return ops.cat((-x2, x1), dim=-1) +-+++++ +-+++++ +-+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++++ if self.training: +-+++++ raise NotImplementedError("Training is not supported yet.") +-+++++ else: +-+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++- if self.config.n_shared_experts is not None: +-+++++- y = y + self.shared_experts(identity) +-+++++- return y +-++++++ # @lwx +-++++++ if orig_shape[1] == 1: +-++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-++++++ y=y.view(*orig_shape) +-++++++ if self.config.n_shared_experts is not None: +-++++++ y = y + self.shared_experts(identity) +-++++++ return y +-++++++ else: +-++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-++++++ if self.config.n_shared_experts is not None: +-++++++ y = y + self.shared_experts(identity) +-++++++ return y +-++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++++ # if self.config.n_shared_experts is not None: +-++++++ # y = y + self.shared_experts(identity) +-++++++ # return y +-++++++ +-++++++ @no_grad() +-++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++++ +-++++++ expert_cache = ops.zeros_like(x) +-++++++ for i in range(self.num_experts_per_tok): +-++++++ expert_id = flat_expert_indices[i].item() +-++++++ weight = flat_expert_weights[i].item() +-++++++ expert = self.experts[expert_id] +-++++++ expert_out = expert(x) +-++++++ expert_cache += expert_out * weight +-++++++ return expert_cache +-+++++ +-+++++ @no_grad() +-+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++- # expert_cache = torch.zeros_like(x) +-+++++- # idxs = flat_expert_indices.argsort() +-+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++- # token_idxs = idxs // self.num_experts_per_tok +-+++++- # for i, end_idx in enumerate(tokens_per_expert): +-+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++- # if start_idx == end_idx: +-+++++- # continue +-+++++- # expert = self.experts[i] +-+++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++- # expert_tokens = x[exp_token_idx] +-+++++- # expert_out = expert(expert_tokens) +-+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++- # return expert_cache +-++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++ expert_cache = ops.zeros_like(x) +-+++++ idxs = flat_expert_indices.argsort() +-+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++ token_idxs = idxs // self.num_experts_per_tok +-++++++ +-+++++ for i, end_idx in enumerate(tokens_per_expert): +-+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++ if start_idx == end_idx: +-+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++++ expert_out = expert(expert_tokens) +-+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++++ +-+++++ return expert_cache +-++++++ +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # # expert_cache = torch.zeros_like(x) +-++++++ # # idxs = flat_expert_indices.argsort() +-++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++++ # # if start_idx == end_idx: +-++++++ # # continue +-++++++ # # expert = self.experts[i] +-++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # # expert_tokens = x[exp_token_idx] +-++++++ # # expert_out = expert(expert_tokens) +-++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++++ # # return expert_cache +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # if start_idx == end_idx: +-++++++ # continue +-++++++ # expert = self.experts[i] +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = expert(expert_tokens) +-++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++++ +-++++++ # return expert_cache +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ +-++++++ # # 排序保证顺序一致 +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # # 找出有 token 的专家 +-++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++++ +-++++++ # for i in active_experts.tolist(): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # end_idx = tokens_per_expert[i] +-++++++ # if start_idx == end_idx: # 没有 token +-++++++ # continue +-++++++ +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = self.experts[i](expert_tokens) +-++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++++ +-++++++ # expert_cache = mindspore.mint.scatter_add( +-++++++ # expert_cache, +-++++++ # 0, +-++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++++ # expert_out +-++++++ # ) +-++++++ +-++++++ # return expert_cache +-++++++ +-++++++ +-+++++ +-+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++++ # """ +-+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++ +-+++++ # Initialize weights and apply final processing +-+++++ self.post_init() +-++++++ self.warm_up = False +-++++++ +-++++++ def warmup_moe_model_deep(self): +-++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-++++++ test_texts = [ +-++++++ "warmup short", +-++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-++++++ ] +-++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++++ if tokenizer is None: +-++++++ from mindnlp.transformers import AutoTokenizer +-++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++++ self._warmup_tokenizer = tokenizer +-++++++ +-++++++ for text in test_texts: +-++++++ inputs = tokenizer(text, return_tensors="ms") +-++++++ with mindspore._no_grad(): +-++++++ _ = self(**inputs, use_cache=False) +-++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++++ +-+++++ def get_input_embeddings(self): +-+++++ return self.model.embed_tokens +-+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++ ```""" +-++++++ if not self.warm_up: +-++++++ self.warm_up = True +-++++++ self.warmup_moe_model_deep() +-++++++ +-+++++ output_attentions = ( +-+++++ output_attentions +-+++++ if output_attentions is not None +-+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++index 3cbf820e..d4c6b651 100644 +-+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++@@ -18,7 +18,6 @@ +-+++++ # See the License for the specific language governing permissions and +-+++++ # limitations under the License. +-+++++ """MindSpore Qwen2MoE model.""" +-+++++- +-+++++ import math +-+++++ from typing import List, Optional, Tuple, Union +-+++++ +-+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++++ TokenClassifierOutput, +-+++++ ) +-+++++ from ...modeling_utils import PreTrainedModel +-++++++from ...generation import GenerationMixin +-+++++ from ....utils import logging +-+++++ from .configuration_qwen2_moe import Qwen2MoeConfig +-+++++ +-+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++++ self.variance_epsilon = eps +-+++++ +-+++++ def forward(self, hidden_states): +-++++++ # @dwj +-++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++++ # @lwx +-++++++ # if not self.training : +-++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++ input_dtype = hidden_states.dtype +-+++++ hidden_states = hidden_states.to(mindspore.float32) +-+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++++@@ -234,6 +239,8 @@ def rotate_half(x): +-+++++ """Rotates half the hidden dims of the input.""" +-+++++ x1 = x[..., : x.shape[-1] // 2] +-+++++ x2 = x[..., x.shape[-1] // 2 :] +-++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++ return ops.cat((-x2, x1), dim=-1) +-+++++ +-+++++ +-+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++++ self.config = config +-+++++ self.hidden_size = config.hidden_size +-+++++ self.intermediate_size = intermediate_size +-++++++ +-+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++++ self.act_fn = ACT2FN[config.hidden_act] +-+++++ +-+++++ def forward(self, x): +-+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++- +-+++++ +-++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++++ # @lwx +-++++++ # gate_up_output = self.gate_up_proj(x) +-++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-++++++ # return self.down_proj(swiglu_output) +-++++++ +-++++++ # def forward(self, x): +-++++++ # gate_proj_out = self.gate_proj(x) +-++++++ # up_proj_out = self.up_proj(x) +-++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-++++++ # return self.down_proj(swiglu_out) +-++++++ +-+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++++ """ +-+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++++ use_cache: bool = False, +-+++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ +-++++++ +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ query_states = self.q_proj(hidden_states) +-+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++ "with a layer index." +-+++++ ) +-+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ if isinstance(past_key_value, StaticCache): +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ else: +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++ if isinstance(past_key_value, StaticCache): +-++++++ kv_seq_len = key_states.shape[-2] +-+++++ +-+++++ # repeat k/v heads if n_kv_heads < n_heads +-+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++- +-++++++ +-+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++ +-+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++++- raise ValueError( +-+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++++- f" {attn_weights.shape}" +-+++++- ) +-+++++- +-+++++- if attention_mask is not None: # no matter the length, we just slice it +-+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-++++++ if attention_mask is not None: +-++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++ attn_weights = attn_weights + causal_mask +-+++++ +-+++++ # upcast attention to fp32 +-+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++ +-+++++ attn_output = self.o_proj(attn_output) +-+++++- +-++++++ # @lwx +-++++++ +-++++++ # max_seq_len = self.max_position_embeddings # 2048 +-++++++ +-++++++ # if attention_mask is not None: +-++++++ # # attention_mask: [B, 1, Sq, Sk] +-++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++++ +-++++++ # # pad 到 [max_seq_len, max_seq_len] +-++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++++ # global_attention_mask = padded_mask +-++++++ # else: +-++++++ # global_attention_mask = None +-++++++ +-++++++ +-++++++ # sparse_mode=3 +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # real_shift=None, +-++++++ # padding_mask=None, +-++++++ +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=global_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ # input_layout="BNSD", +-++++++ # pre_tokens=2147483647, +-++++++ # next_tokens=2147483647, +-++++++ # inner_precise=0, +-++++++ # drop_mask=None, +-++++++ # prefix=None, +-++++++ # actual_seq_qlen=None, +-++++++ # actual_seq_kvlen=None, +-++++++ # sparse_mode=sparse_mode, +-++++++ # ) +-+++++ if not output_attentions: +-+++++ attn_weights = None +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ +-++++++class Qwen2MoeFlashAttention(nn.Module): +-++++++ """ +-++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++++ +-++++++ 关键改动: +-++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++++ 直接传入原始的 key 和 value 张量效率更高。 +-++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++++ """ +-++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++++ super().__init__() +-++++++ self.config = config +-++++++ self.layer_idx = layer_idx +-++++++ self.hidden_size = config.hidden_size +-++++++ self.num_heads = config.num_attention_heads +-++++++ self.head_dim = self.hidden_size // self.num_heads +-++++++ self.num_key_value_heads = config.num_key_value_heads +-++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++ self.max_position_embeddings = config.max_position_embeddings +-++++++ self.rope_theta = config.rope_theta +-++++++ self.attention_dropout = config.attention_dropout +-++++++ +-++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++ raise ValueError( +-++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++++ ) +-++++++ +-++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++++ +-++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++++ self.head_dim, +-++++++ max_position_embeddings=self.max_position_embeddings, +-++++++ base=self.rope_theta, +-++++++ ) +-++++++ +-++++++ def forward( +-++++++ self, +-++++++ hidden_states: mindspore.Tensor, +-++++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++++ position_ids: Optional[mindspore.Tensor] = None, +-++++++ past_key_value: Optional[Cache] = None, +-++++++ output_attentions: bool = False, +-++++++ use_cache: bool = False, +-++++++ cache_position: Optional[mindspore.Tensor] = None, +-++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # 1. 线性投射 Q, K, V +-++++++ query_states = self.q_proj(hidden_states) +-++++++ key_states = self.k_proj(hidden_states) +-++++++ value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++ # query: [B, S, H*D] -> [B, N1, S, D] +-++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # 3. RoPE 旋转位置编码 +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ if past_key_value is not None: +-++++++ if self.layer_idx is None: +-++++++ raise ValueError( +-++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++ "with a layer index." +-++++++ ) +-++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++++ if cache_position.shape[0] == 1: +-++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++++ kv_seq_len = past_seen_tokens + 1 +-++++++ else: +-++++++ # prefill 阶段:cache_position 是范围,使用其长度 +-++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++++ else: +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # 4. KV 缓存更新 +-++++++ if past_key_value is not None: +-++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ key_states, value_states = past_key_value.update( +-++++++ key_states, value_states, self.layer_idx, cache_kwargs +-++++++ ) +-++++++ +-++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++ if cache_position.shape[0] == 1: +-++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ +-++++++ # 5. [重要] 准备 Attention Mask +-++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++++ fa_attention_mask = None +-++++++ if attention_mask is not None: +-++++++ # 截取与当前key长度匹配的部分 +-++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++++ fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++++ input_dtype = query_states.dtype +-++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++++ query_states = query_states.to(mindspore.float16) +-++++++ key_states = key_states.to(mindspore.float16) +-++++++ value_states = value_states.to(mindspore.float16) +-++++++ +-++++++ # 6. [核心] 调用 flash_attention_score 算子 +-++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++++ attn_output = mindspore.ops.flash_attention_score( +-++++++ query=query_states, +-++++++ key=key_states, +-++++++ value=value_states, +-++++++ head_num=self.num_heads, # 传入Q的头数(N1) +-++++++ attn_mask=fa_attention_mask, +-++++++ keep_prob=1.0 - self.attention_dropout, +-++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ input_layout="BNSD", +-++++++ sparse_mode=0 # 使用 defaultMask 模式 +-++++++ ) +-++++++ +-++++++ # 恢复原始数据类型 +-++++++ attn_output = attn_output.to(input_dtype) +-++++++ +-++++++ # 7. 调整输出形状 +-++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-++++++ attn_weights = None +-++++++ if output_attentions: +-++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++ return attn_output, attn_weights, past_key_value +-++++++ +-++++++ # def forward( +-++++++ # self, +-++++++ # hidden_states: mindspore.Tensor, +-++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++++ # past_key_value: Optional[Cache] = None, +-++++++ # output_attentions: bool = False, +-++++++ # use_cache: bool = False, +-++++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ # bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # # 1. 线性投射 Q, K, V +-++++++ # query_states = self.q_proj(hidden_states) +-++++++ # key_states = self.k_proj(hidden_states) +-++++++ # value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # # 3. RoPE 旋转位置编码 +-++++++ # kv_seq_len = key_states.shape[-2] +-++++++ # if past_key_value is not None: +-++++++ # if self.layer_idx is None: +-++++++ # raise ValueError( +-++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++ # "with a layer index." +-++++++ # ) +-++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # # 4. KV 缓存更新 +-++++++ # if past_key_value is not None: +-++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ # key_states, value_states = past_key_value.update( +-++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++++ # ) +-++++++ +-++++++ # # 5. 准备 Attention Mask +-++++++ # fa_attention_mask = None +-++++++ # if attention_mask is not None: +-++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++++ # input_dtype = query_states.dtype +-++++++ +-++++++ # # 6. [核心] 调用 flash_attention_score 算子 +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=fa_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ # input_layout="BNSD", +-++++++ # sparse_mode=0, +-++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++++ # inner_precise=1 +-++++++ # ) +-++++++ +-++++++ # # 恢复原始数据类型 +-++++++ # attn_output = attn_output.to(input_dtype) +-++++++ +-++++++ # # 7. 调整输出形状 +-++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ # attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # attn_weights = None +-++++++ # if output_attentions: +-++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++ # return attn_output, attn_weights, past_key_value +-++++++ +-++++++ # def forward( +-++++++ # self, +-++++++ # hidden_states: mindspore.Tensor, +-++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-++++++ # position_ids: Optional[mindspore.Tensor] = None, +-++++++ # past_key_value: Optional[Cache] = None, +-++++++ # output_attentions: bool = False, +-++++++ # use_cache: bool = False, +-++++++ # cache_position: Optional[mindspore.Tensor] = None, +-++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ # bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ # query_states = self.q_proj(hidden_states) +-++++++ # key_states = self.k_proj(hidden_states) +-++++++ # value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ # kv_seq_len = key_states.shape[-2] +-++++++ # if past_key_value is not None: +-++++++ # if self.layer_idx is None: +-++++++ # raise ValueError("`layer_idx` must be specified for caching") +-++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ # if past_key_value is not None: +-++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++ # key_states, value_states = past_key_value.update( +-++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-++++++ # ) +-++++++ +-++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++ +-++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-++++++ # query_states = query_states / math.sqrt(self.head_dim) +-++++++ # # <--- 修改结束 --- +-++++++ +-++++++ # fa_attention_mask = None +-++++++ # if attention_mask is not None: +-++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ # fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ # input_dtype = query_states.dtype +-++++++ +-++++++ # attn_output = mindspore.ops.flash_attention_score( +-++++++ # query=query_states, # 传入已经预先缩放过的 query +-++++++ # key=key_states, +-++++++ # value=value_states, +-++++++ # head_num=self.num_heads, +-++++++ # attn_mask=fa_attention_mask, +-++++++ # keep_prob=1.0 - self.attention_dropout, +-++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-++++++ # input_layout="BNSD", +-++++++ # sparse_mode=0, +-++++++ # inner_precise=1 # 仍然保持内部高精度计算 +-++++++ # ) +-++++++ +-++++++ # attn_output = attn_output.to(input_dtype) +-++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ # attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # attn_weights = None +-++++++ # if output_attentions: +-++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++++ +-++++++ # return attn_output, attn_weights, past_key_value +-++++++ +-+++++ QWEN2MOE_ATTENTION_CLASSES = { +-+++++ "eager": Qwen2MoeAttention, +-++++++ "flash-attention": Qwen2MoeFlashAttention, +-+++++ } +-+++++ +-+++++ +-+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-++++++ #@dwj +-++++++ # 只遍历激活的专家,而非全部专家 +-+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- hidden_states = hidden_states.view(-1, hidden_dim) +-+++++- # router_logits: (batch * sequence_length, n_experts) +-+++++- router_logits = self.gate(hidden_states) +-+++++- +-+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- if self.norm_topk_prob: +-+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- # we cast back to the input dtype +-+++++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++- final_hidden_states = ops.zeros( +-+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++++- ) +-+++++- +-+++++- # One hot encode the selected experts to create an expert mask +-+++++- # this will be used to easily index which expert is going to be sollicitated +-+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++++- +-+++++- # Loop over all available experts in the model and perform the computation on each expert +-+++++- for expert_idx in range(self.num_experts): +-+++++- expert_layer = self.experts[expert_idx] +-+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++++- +-+++++- # Index the correct hidden states and compute the expert hidden state for +-+++++- # the current expert. We need to make sure to multiply the output hidden +-+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++++- if 0 not in idx.shape: +-+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++++- +-+++++- # However `index_add_` only support torch tensors for indexing so we'll use +-+++++- # the `top_x` tensor here. +-+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++++- +-+++++- shared_expert_output = self.shared_expert(hidden_states) +-+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++++- +-+++++- final_hidden_states = final_hidden_states + shared_expert_output +-++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++ num_tokens = hidden_states_reshaped.shape[0] +-++++++ +-++++++ router_logits = self.gate(hidden_states_reshaped) +-++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++ +-++++++ if self.norm_topk_prob: +-++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++++ flat_selected_experts = selected_experts.flatten() +-++++++ +-++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++++ token_indices = broadcasted_token_indices.flatten() +-++++++ +-++++++ active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++ for expert_idx_tensor in active_experts: +-++++++ expert_idx = expert_idx_tensor.item() +-++++++ expert_layer = self.experts[expert_idx] +-++++++ +-++++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++++ selected_token_indices = token_indices[mask] +-++++++ selected_routing_weights = routing_weights.flatten()[mask] +-++++++ +-++++++ current_states = hidden_states_reshaped[selected_token_indices] +-++++++ +-++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++ +-++++++ final_hidden_states = final_hidden_states.index_add( +-++++++ dim=0, +-++++++ index=selected_token_indices, +-++++++ source=expert_output.to(hidden_states.dtype) +-++++++ ) +-++++++ +-++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++++ +-+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++- return final_hidden_states, router_logits +-++++++ final_hidden_states = final_hidden_states + shared_expert_output +-++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++ return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++ class Qwen2MoeDecoderLayer(nn.Module): +-+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++++ +-+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ +-++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++++ +-+++++ if (layer_idx not in config.mlp_only_layers) and ( +-+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++++ ): +-+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++++ _skip_keys_device_placement = "past_key_values" +-+++++ _supports_cache_class = True +-++++++#lwx +-++++++ # _supports_static_cache = True +-+++++ +-+++++ def _init_weights(self, module): +-+++++ std = self.config.initializer_range +-+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++++ return causal_mask +-+++++ +-+++++ +-+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ _tied_weights_keys = ["lm_head.weight"] +-+++++ +-+++++ def __init__(self, config): +-+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ self.num_experts_per_tok = config.num_experts_per_tok +-+++++ # Initialize weights and apply final processing +-+++++ self.post_init() +-++++++ # @lwx +-++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-++++++ # self.generation_config.cache_implementation = "static" +-++++++ self._warmed_up = False +-++++++ +-++++++ def warmup_moe_model(self): +-++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-++++++ test_texts = [ +-++++++ "warmup short", +-++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-++++++ ] +-++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-++++++ if tokenizer is None: +-++++++ from mindnlp.transformers import AutoTokenizer +-++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-++++++ self._warmup_tokenizer = tokenizer +-++++++ +-++++++ for text in test_texts: +-++++++ inputs = tokenizer(text, return_tensors="ms") +-++++++ with mindspore._no_grad(): +-++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++++ +-+++++ def get_input_embeddings(self): +-+++++ return self.model.embed_tokens +-+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++ ```""" +-++++++ if not self._warmed_up: +-++++++ self._warmed_up = True +-++++++ self.warmup_moe_model() +-+++++ +-+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++++ output_router_logits = ( +-+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++ } +-+++++ ) +-+++++ return model_inputs +-++++++# @lwx +-++++++ # def _decode_one_tokens_logits( +-++++++ # self, +-++++++ # cur_token: mindspore.Tensor, +-++++++ # input_pos: Optional[mindspore.Tensor], +-++++++ # cache_position: mindspore.Tensor, +-++++++ # past_key_values: StaticCache, +-++++++ # ) -> mindspore.Tensor: +-++++++ # """ +-++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-++++++ +-++++++ # Args: +-++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-++++++ # input_pos: 输入位置信息,可选 +-++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-++++++ +-++++++ # Returns: +-++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-++++++ # """ +-++++++ # # 调用JIT编译的版本 +-++++++ # return self.get_decode_one_tokens_logits( +-++++++ # cur_token=cur_token, +-++++++ # input_pos=input_pos, +-++++++ # cache_position=cache_position, +-++++++ # past_key_values=past_key_values, +-++++++ # ) +-++++++ +-++++++ # @mindspore.jit(jit_level='O1') +-++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-++++++ # """ +-++++++ # JIT编译的函数,用于高效的单token解码 +-++++++ # 使用JIT编译优化以支持静态shape和高效执行 +-++++++ +-++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-++++++ # """ +-++++++ # outputs = self.model.forward( +-++++++ # input_ids=cur_token, +-++++++ # position_ids=input_pos, +-++++++ # cache_position=cache_position, +-++++++ # past_key_values=past_key_values, +-++++++ # use_cache=True, +-++++++ # return_dict=False, +-++++++ # ) +-++++++ +-++++++ # hidden_states = outputs[0] +-++++++ # logits = self.lm_head.forward(hidden_states) +-++++++ # logits = logits.float() +-++++++ +-++++++ # return logits[:, -1, :] +-++++++ +-++++++ # def _sample( +-++++++ # self, +-++++++ # input_ids: mindspore.Tensor, +-++++++ # logits_processor, +-++++++ # stopping_criteria, +-++++++ # generation_config, +-++++++ # synced_devices: bool, +-++++++ # streamer=None, +-++++++ # logits_warper=None, +-++++++ # **model_kwargs, +-++++++ # ): +-++++++ # """ +-++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-++++++ # """ +-++++++ # from ...generation.logits_process import LogitsProcessorList +-++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-++++++ # from mindnlp.core import nn, ops, no_grad +-++++++ # import numpy as np +-++++++ +-++++++ # # 检查是否使用 StaticCache +-++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-++++++ # # 否则,直接调用父类方法 +-++++++ # past_key_values = model_kwargs.get("past_key_values") +-++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-++++++ +-++++++ # if not isinstance(past_key_values, StaticCache): +-++++++ # # 不使用 StaticCache,直接调用父类方法 +-++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-++++++ # return super()._sample( +-++++++ # input_ids=input_ids, +-++++++ # logits_processor=logits_processor, +-++++++ # stopping_criteria=stopping_criteria, +-++++++ # generation_config=generation_config, +-++++++ # synced_devices=synced_devices, +-++++++ # streamer=streamer, +-++++++ # logits_warper=logits_warper, +-++++++ # **model_kwargs, +-++++++ # ) +-++++++ +-++++++ # # 使用 StaticCache,进入自定义循环 +-++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-++++++ # pad_token_id = generation_config._pad_token_tensor +-++++++ # output_attentions = generation_config.output_attentions +-++++++ # output_hidden_states = generation_config.output_hidden_states +-++++++ # output_scores = generation_config.output_scores +-++++++ # output_logits = generation_config.output_logits +-++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-++++++ # max_length = generation_config.max_length +-++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-++++++ # do_sample = generation_config.do_sample +-++++++ +-++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-++++++ # raise ValueError( +-++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-++++++ # f"{logits_warper})." +-++++++ # ) +-++++++ +-++++++ # # init attention / hidden states / scores tuples +-++++++ # scores = () if (return_dict_in_generate and output_scores) else None +-++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-++++++ +-++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-++++++ # encoder_hidden_states = ( +-++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-++++++ # ) +-++++++ +-++++++ # # keep track of which sequences are already finished +-++++++ # batch_size, cur_len = input_ids.shape +-++++++ # this_peer_finished = False +-++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-++++++ +-++++++ # time_record = [] +-++++++ # from ....utils.testing_utils import parse_flag_from_env +-++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-++++++ +-++++++ # while self._has_unfinished_sequences( +-++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-++++++ # ): +-++++++ # if _record_time: +-++++++ # import time as time_module +-++++++ # infer_start = time_module.time() +-++++++ +-++++++ # # prepare model inputs +-++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-++++++ +-++++++ # # prepare variable output controls +-++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-++++++ +-++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-++++++ # cur_cache_position = model_inputs.get("cache_position") +-++++++ # cur_past_key_values = model_inputs.get("past_key_values") +-++++++ # cur_input_ids = model_inputs.get("input_ids") +-++++++ +-++++++ # if (isinstance(cur_past_key_values, StaticCache) and +-++++++ # cur_cache_position is not None and +-++++++ # len(cur_cache_position.shape) > 0 and +-++++++ # cur_cache_position.shape[0] == 1 and +-++++++ # cur_input_ids is not None and +-++++++ # cur_input_ids.shape[1] == 1): +-++++++ # # 使用 JIT 优化的单 token 解码 +-++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-++++++ # if not hasattr(self, '_jit_used'): +-++++++ # self._jit_used = False +-++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-++++++ +-++++++ # next_token_logits = self.get_decode_one_tokens_logits( +-++++++ # cur_token=cur_input_ids, +-++++++ # input_pos=model_inputs.get("position_ids"), +-++++++ # cache_position=cur_cache_position, +-++++++ # past_key_values=cur_past_key_values, +-++++++ # ) +-++++++ +-++++++ # # 标记已使用JIT(用于后续判断) +-++++++ # if not self._jit_used: +-++++++ # self._jit_used = True +-++++++ +-++++++ # # 构造兼容的输出对象 +-++++++ # class JitOptimizedOutput: +-++++++ # def __init__(self, logits, config): +-++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-++++++ # self.config = config +-++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-++++++ # self.attentions = None if not config.is_encoder_decoder else None +-++++++ # self.cross_attentions = None +-++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-++++++ +-++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-++++++ # else: +-++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-++++++ # outputs = self(**model_inputs, return_dict=True) +-++++++ +-++++++ # if synced_devices and this_peer_finished: +-++++++ # continue +-++++++ +-++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-++++++ # next_token_logits = outputs.logits[:, -1, :] +-++++++ +-++++++ # # pre-process distribution +-++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-++++++ # if do_sample: +-++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-++++++ +-++++++ # # Store scores, attentions and hidden_states when required +-++++++ # if return_dict_in_generate: +-++++++ # if output_scores: +-++++++ # scores += (next_token_scores,) +-++++++ # if output_logits: +-++++++ # raw_logits += (next_token_logits,) +-++++++ # if output_attentions: +-++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-++++++ # if self.config.is_encoder_decoder: +-++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-++++++ +-++++++ # if output_hidden_states: +-++++++ # hidden = ( +-++++++ # outputs.decoder_hidden_states +-++++++ # if self.config.is_encoder_decoder +-++++++ # else outputs.hidden_states +-++++++ # ) +-++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-++++++ +-++++++ # # token selection +-++++++ # if do_sample: +-++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-++++++ # else: +-++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-++++++ +-++++++ # # finished sentences should have their next token be a padding token +-++++++ # if has_eos_stopping_criteria: +-++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-++++++ +-++++++ # # update generated ids, model inputs, and length for next step +-++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-++++++ # if streamer is not None: +-++++++ # streamer.put(next_tokens) +-++++++ +-++++++ # model_kwargs = self._update_model_kwargs_for_generation( +-++++++ # outputs, +-++++++ # model_kwargs, +-++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-++++++ # ) +-++++++ +-++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-++++++ # cur_len += 1 +-++++++ +-++++++ # if _record_time: +-++++++ # import time as time_module +-++++++ # infer_stop = time_module.time() +-++++++ # time_record.append(infer_stop - infer_start) +-++++++ +-++++++ # del outputs +-++++++ +-++++++ # average_infer_time = None +-++++++ # if time_record: +-++++++ # if len(time_record) > 1: +-++++++ # time_record.pop(0) +-++++++ # average_infer_time = sum(time_record) / len(time_record) +-++++++ # print(f'average inference time is: {average_infer_time}') +-++++++ # print(f'inference time record: {time_record}') +-++++++ +-++++++ # if streamer is not None: +-++++++ # streamer.end() +-++++++ +-++++++ # # 简单判断:打印是否使用了JIT路径 +-++++++ # if hasattr(self, '_jit_used') and self._jit_used: +-++++++ # print("[JIT] ✓ JIT optimization was used during generation") +-++++++ # else: +-++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-++++++ +-++++++ # if return_dict_in_generate: +-++++++ # if self.config.is_encoder_decoder: +-++++++ # return GenerateEncoderDecoderOutput( +-++++++ # sequences=input_ids, +-++++++ # scores=scores, +-++++++ # logits=raw_logits, +-++++++ # encoder_attentions=encoder_attentions, +-++++++ # encoder_hidden_states=encoder_hidden_states, +-++++++ # decoder_attentions=decoder_attentions, +-++++++ # cross_attentions=cross_attentions, +-++++++ # decoder_hidden_states=decoder_hidden_states, +-++++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++++ # average_infer_time=average_infer_time +-++++++ # ) +-++++++ # else: +-++++++ # return GenerateDecoderOnlyOutput( +-++++++ # sequences=input_ids, +-++++++ # scores=scores, +-++++++ # logits=raw_logits, +-++++++ # attentions=decoder_attentions, +-++++++ # hidden_states=decoder_hidden_states, +-++++++ # past_key_values=model_kwargs.get("past_key_values"), +-++++++ # average_infer_time=average_infer_time +-++++++ # ) +-++++++ # else: +-++++++ # return input_ids +-++++++ +-++++++ # def _prepare_cache_for_generation( +-++++++ # self, +-++++++ # generation_config, +-++++++ # model_kwargs, +-++++++ # assistant_model, +-++++++ # batch_size, +-++++++ # max_cache_length, +-++++++ # ): +-++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-++++++ # generation_config.cache_implementation = "static" +-++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-++++++ +-++++++ # if generation_config.cache_implementation == "static": +-++++++ # base_required_from_max_length = generation_config.max_length + 1 +-++++++ # base_required = max(max_cache_length, base_required_from_max_length) +-++++++ # min_cache_size = 50 +-++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-++++++ # else: +-++++++ # max_cache_length = max(base_required, min_cache_size) +-++++++ +-++++++ # original_max_cache_length = max_cache_length +-++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-++++++ # print(f" - final max_cache_length: {max_cache_length}") +-++++++ +-++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-++++++ # if max_cache_length > self.config.max_position_embeddings: +-++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-++++++ +-++++++ # result = super()._prepare_cache_for_generation( +-++++++ # generation_config=generation_config, +-++++++ # model_kwargs=model_kwargs, +-++++++ # assistant_model=assistant_model, +-++++++ # batch_size=batch_size, +-++++++ # max_cache_length=max_cache_length, +-++++++ # ) +-++++++ +-++++++ # if generation_config.cache_implementation == "static": +-++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-++++++ # created_cache = model_kwargs.get(cache_name) +-++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-++++++ # if created_cache.max_cache_len < generation_config.max_length: +-++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-++++++ +-++++++ # return result +-++++++ +-++++++ +-++++++ +-+++++ +-+++++ +-+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++++-- +-+++++2.27.0 +-+++++ +-++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch +-++++new file mode 100644 +-++++index 00000000..22b65dd5 +-++++--- /dev/null +-+++++++ b/patches/0002-20251106commit.patch +-++++@@ -0,0 +1,3200 @@ +-+++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 +-+++++From: Pinoeer-kingxi <13022943007@163.com> +-+++++Date: Thu, 6 Nov 2025 09:20:38 +0800 +-+++++Subject: [PATCH 2/3] 20251106commit +-+++++ +-+++++--- +-+++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- +-+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- +-+++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ +-+++++ 3 files changed, 2689 insertions(+), 305 deletions(-) +-+++++ create mode 100644 patches/0001-20251104commit.patch +-+++++ +-+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++index d8303e45..73773c22 100644 +-+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): +-+++++ # y = y + self.shared_experts(identity) +-+++++ # return y +-+++++ +-++++++ # @no_grad() +-++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++++ +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ # for i in range(self.num_experts_per_tok): +-++++++ # expert_id = flat_expert_indices[i].item() +-++++++ # weight = flat_expert_weights[i].item() +-++++++ # expert = self.experts[expert_id] +-++++++ # expert_out = expert(x) +-++++++ # expert_cache += expert_out * weight +-++++++ # return expert_cache +-++++++ +-+++++ @no_grad() +-+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # x 的 shape: (1, hidden_size) +-++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) +-++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) +-++++++ +-++++++ # 1. 收集所有需要的专家层 +-++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 +-++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] +-++++++ +-++++++ # 2. 并行计算所有专家的输出 +-++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors +-++++++ # ops.cat 会将它们堆叠成一个新的 Tensor +-++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) +-++++++ +-++++++ # 3. 使用矩阵乘法进行加权求和 +-++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) +-++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) +-++++++ # 最终结果 final_output 的 shape: (1, hidden_size) +-++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) +-++++++ +-++++++ return final_output +-+++++ +-+++++- expert_cache = ops.zeros_like(x) +-+++++- for i in range(self.num_experts_per_tok): +-+++++- expert_id = flat_expert_indices[i].item() +-+++++- weight = flat_expert_weights[i].item() +-+++++- expert = self.experts[expert_id] +-+++++- expert_out = expert(x) +-+++++- expert_cache += expert_out * weight +-+++++- return expert_cache +-+++++ +-+++++ @no_grad() +-+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++++ # @lwx +-++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) +-++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) +-++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) +-++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) +-+++++ +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ +-++++++# class DeepseekFlashAttention(nn.Module): +-++++++# """ +-++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. +-++++++ +-++++++# This class is designed as a drop-in replacement for DeepseekAttention. +-++++++# """ +-++++++ +-++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++++++# super().__init__() +-++++++# self.config = config +-++++++# self.layer_idx = layer_idx +-++++++# if layer_idx is None: +-++++++# logger.warning( +-++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++++# "when creating this class." +-++++++# ) +-++++++ +-++++++# self.attention_dropout = config.attention_dropout +-++++++# self.hidden_size = config.hidden_size +-++++++# self.num_heads = config.num_attention_heads +-++++++# self.head_dim = self.hidden_size // self.num_heads +-++++++# self.num_key_value_heads = config.num_key_value_heads +-++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++# self.max_position_embeddings = config.max_position_embeddings +-++++++# self.rope_theta = config.rope_theta +-++++++# self.is_causal = True +-++++++ +-++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++# raise ValueError( +-++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++++# f" and `num_heads`: {self.num_heads})." +-++++++# ) +-++++++ +-++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++++++# self._init_rope() +-++++++ +-++++++# def _init_rope(self): +-++++++# if self.config.rope_scaling is None: +-++++++# self.rotary_emb = DeepseekRotaryEmbedding( +-++++++# self.head_dim, +-++++++# max_position_embeddings=self.max_position_embeddings, +-++++++# base=self.rope_theta, +-++++++# ) +-++++++# else: +-++++++# scaling_type = self.config.rope_scaling["type"] +-++++++# scaling_factor = self.config.rope_scaling["factor"] +-++++++# if scaling_type == "linear": +-++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++++++# self.head_dim, +-++++++# max_position_embeddings=self.max_position_embeddings, +-++++++# scaling_factor=scaling_factor, +-++++++# base=self.rope_theta, +-++++++# ) +-++++++# elif scaling_type == "dynamic": +-++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++++++# self.head_dim, +-++++++# max_position_embeddings=self.max_position_embeddings, +-++++++# scaling_factor=scaling_factor, +-++++++# base=self.rope_theta, +-++++++# ) +-++++++# else: +-++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++++++ +-++++++# def forward( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++++# position_ids: Optional[mindspore.Tensor] = None, +-++++++# past_key_value: Optional[Cache] = None, +-++++++# output_attentions: bool = False, +-++++++# use_cache: bool = False, +-++++++# **kwargs, +-++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++# if "padding_mask" in kwargs: +-++++++# warnings.warn( +-++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++++++# ) +-++++++ +-++++++# if output_attentions: +-++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") +-++++++ +-++++++# bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++# if self.config.pretraining_tp > 1: +-++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++++++ +-++++++# query_states = self.q_proj(hidden_states) +-++++++# key_states = self.k_proj(hidden_states) +-++++++# value_states = self.v_proj(hidden_states) +-++++++ +-++++++# # Reshape for multi-head attention +-++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++# kv_seq_len = key_states.shape[-2] +-++++++# if past_key_value is not None: +-++++++# if self.layer_idx is None: +-++++++# raise ValueError( +-++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++# "with a layer index." +-++++++# ) +-++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++# # Apply Rotary Positional Embedding +-++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++# if past_key_value is not None: +-++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models +-++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout +-++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) +-++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ +-++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++++ +-++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) +-++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) +-++++++ +-++++++# # Convert attention_mask for flash_attention_score +-++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. +-++++++# if attention_mask is not None: +-++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) +-++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++++++# raise ValueError( +-++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++++++# ) +-++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True +-++++++# else: +-++++++# attn_mask_for_fa = None +-++++++ +-++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++++++ +-++++++# # Call the fused flash_attention_score operator +-++++++# attn_output = mindspore.ops.flash_attention_score( +-++++++# query=query_states_for_fa, +-++++++# key=key_states_for_fa, +-++++++# value=value_states_for_fa, +-++++++# head_num=self.num_heads, # This is N1, the number of query heads +-++++++# input_layout='BSH', +-++++++# attn_mask=attn_mask_for_fa, +-++++++# keep_prob=keep_prob, +-++++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++# sparse_mode=0 # Default mask mode +-++++++# ) +-++++++ +-++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed +-++++++# attn_output = self.o_proj(attn_output) +-++++++ +-++++++# # Flash Attention does not return attention weights +-++++++# attn_weights = None +-++++++ +-++++++# return attn_output, attn_weights, past_key_value +-++++++ +-++++++class DeepseekFlashAttention(nn.Module): +-++++++ """ +-++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. +-++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, +-++++++ designed for high performance on supported hardware (Ascend). +-++++++ +-++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. +-++++++ """ +-++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): +-++++++ super().__init__() +-++++++ self.config = config +-++++++ self.layer_idx = layer_idx +-++++++ if layer_idx is None: +-++++++ logger.warning( +-++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++++ "when creating this class." +-++++++ ) +-++++++ +-++++++ # --- [FIX] Correctly initialize all required attributes --- +-++++++ self.attention_dropout = config.attention_dropout +-++++++ self.hidden_size = config.hidden_size +-++++++ self.num_heads = config.num_attention_heads +-++++++ self.head_dim = self.hidden_size // self.num_heads +-++++++ self.num_key_value_heads = config.num_key_value_heads +-++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++ self.max_position_embeddings = config.max_position_embeddings +-++++++ self.rope_theta = config.rope_theta +-++++++ self.is_causal = True +-++++++ +-++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++ raise ValueError( +-++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++++ f" and `num_heads`: {self.num_heads})." +-++++++ ) +-++++++ +-++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) +-++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) +-++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) +-++++++ +-++++++ # This call will now succeed as all attributes are initialized. +-++++++ self._init_rope() +-++++++ +-++++++ def _init_rope(self): +-++++++ if self.config.rope_scaling is None: +-++++++ self.rotary_emb = DeepseekRotaryEmbedding( +-++++++ self.head_dim, +-++++++ max_position_embeddings=self.max_position_embeddings, +-++++++ base=self.rope_theta, +-++++++ ) +-++++++ else: +-++++++ scaling_type = self.config.rope_scaling["type"] +-++++++ scaling_factor = self.config.rope_scaling["factor"] +-++++++ if scaling_type == "linear": +-++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( +-++++++ self.head_dim, +-++++++ max_position_embeddings=self.max_position_embeddings, +-++++++ scaling_factor=scaling_factor, +-++++++ base=self.rope_theta, +-++++++ ) +-++++++ elif scaling_type == "dynamic": +-++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( +-++++++ self.head_dim, +-++++++ max_position_embeddings=self.max_position_embeddings, +-++++++ scaling_factor=scaling_factor, +-++++++ base=self.rope_theta, +-++++++ ) +-++++++ else: +-++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") +-++++++ +-++++++ def forward( +-++++++ self, +-++++++ hidden_states: mindspore.Tensor, +-++++++ attention_mask: Optional[mindspore.Tensor] = None, +-++++++ position_ids: Optional[mindspore.Tensor] = None, +-++++++ past_key_value: Optional[Cache] = None, +-++++++ output_attentions: bool = False, +-++++++ use_cache: bool = False, +-++++++ **kwargs, +-++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ if "padding_mask" in kwargs: +-++++++ warnings.warn( +-++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" +-++++++ ) +-++++++ if output_attentions: +-++++++ warnings.warn( +-++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." +-++++++ ) +-++++++ +-++++++ bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ if self.config.pretraining_tp > 1: +-++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") +-++++++ +-++++++ query_states = self.q_proj(hidden_states) +-++++++ key_states = self.k_proj(hidden_states) +-++++++ value_states = self.v_proj(hidden_states) +-++++++ +-++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) +-++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++ kv_seq_len = key_states.shape[-2] +-++++++ if past_key_value is not None: +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++ # Apply Rotary Position Embedding +-++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ if past_key_value is not None: +-++++++ cache_kwargs = {"sin": sin, "cos": cos} +-++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. +-++++++ # So we must explicitly repeat the KV heads. +-++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++ +-++++++ # Convert attention mask for flash_attention_score +-++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. +-++++++ if attention_mask is not None: +-++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): +-++++++ raise ValueError( +-++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" +-++++++ ) +-++++++ attn_mask_for_fa = attention_mask < 0 +-++++++ else: +-++++++ attn_mask_for_fa = None +-++++++ +-++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 +-++++++ +-++++++ # Call the fused operator using the efficient BNSD layout +-++++++ attn_output = mindspore.ops.flash_attention_score( +-++++++ query=query_states, +-++++++ key=key_states, +-++++++ value=value_states, +-++++++ head_num=self.num_heads, +-++++++ input_layout='BNSD', # Specify the correct layout +-++++++ attn_mask=attn_mask_for_fa, +-++++++ keep_prob=keep_prob, +-++++++ scalar_value=1.0 / math.sqrt(self.head_dim) +-++++++ ) +-++++++ +-++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. +-++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ +-++++++ # Apply output projection +-++++++ attn_output = self.o_proj(attn_output) +-++++++ +-++++++ # Flash attention does not return attention weights, so we return None. +-++++++ attn_weights = None +-++++++ +-++++++ return attn_output, attn_weights, past_key_value +-++++++ +-+++++ Deepseek_ATTENTION_CLASSES = { +-+++++ "eager": DeepseekAttention, +-++++++ "flash-attention": DeepseekFlashAttention, +-+++++ } +-+++++ +-+++++ +-+++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): +-+++++ config=config, layer_idx=layer_idx +-+++++ ) +-+++++ +-++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( +-++++++ config=config, layer_idx=layer_idx +-++++++ ) +-++++++ +-+++++ self.mlp = ( +-+++++ DeepseekMoE(config) +-+++++ if ( +-+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++index d4c6b651..bced285c 100644 +-+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union +-+++++ +-+++++ import mindspore +-+++++ import mindnlp.core.nn.functional as F +-+++++-from mindnlp.core import nn, ops +-++++++from mindnlp.core import nn, ops, no_grad +-+++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss +-+++++ +-+++++ from ....common.activations import ACT2FN +-+++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) +-+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+++++ +-++++++Long_Prompt = False +-++++++PROMPT_LENGTH_THRESHOLD = 128 +-+++++ +-+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+++++ def _prepare_4d_causal_attention_mask_with_cache_position( +-+++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++ +-++++++# class Qwen2MoeFlashAttention(nn.Module): +-++++++# """ +-++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-++++++ +-++++++# 关键改动: +-++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-++++++# 直接传入原始的 key 和 value 张量效率更高。 +-++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++++# super().__init__() +-++++++# self.config = config +-++++++# self.layer_idx = layer_idx +-++++++# self.hidden_size = config.hidden_size +-++++++# self.num_heads = config.num_attention_heads +-++++++# self.head_dim = self.hidden_size // self.num_heads +-++++++# self.num_key_value_heads = config.num_key_value_heads +-++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++# self.max_position_embeddings = config.max_position_embeddings +-++++++# self.rope_theta = config.rope_theta +-++++++# self.attention_dropout = config.attention_dropout +-++++++ +-++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++# raise ValueError( +-++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-++++++# ) +-++++++ +-++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++++ +-++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++++# self.head_dim, +-++++++# max_position_embeddings=self.max_position_embeddings, +-++++++# base=self.rope_theta, +-++++++# ) +-++++++ +-++++++# def forward( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++++# position_ids: Optional[mindspore.Tensor] = None, +-++++++# past_key_value: Optional[Cache] = None, +-++++++# output_attentions: bool = False, +-++++++# use_cache: bool = False, +-++++++# cache_position: Optional[mindspore.Tensor] = None, +-++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++# bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++# # 1. 线性投射 Q, K, V +-++++++# query_states = self.q_proj(hidden_states) +-++++++# key_states = self.k_proj(hidden_states) +-++++++# value_states = self.v_proj(hidden_states) +-++++++ +-++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++# # query: [B, S, H*D] -> [B, N1, S, D] +-++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++# # 3. RoPE 旋转位置编码 +-++++++# kv_seq_len = key_states.shape[-2] +-++++++# if past_key_value is not None: +-++++++# if self.layer_idx is None: +-++++++# raise ValueError( +-++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++# "with a layer index." +-++++++# ) +-++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len +-++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len +-++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) +-++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-++++++# if cache_position.shape[0] == 1: +-++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-++++++# kv_seq_len = past_seen_tokens + 1 +-++++++# else: +-++++++# # prefill 阶段:cache_position 是范围,使用其长度 +-++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens +-++++++# else: +-++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++# # 4. KV 缓存更新 +-++++++# if past_key_value is not None: +-++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++# key_states, value_states = past_key_value.update( +-++++++# key_states, value_states, self.layer_idx, cache_kwargs +-++++++# ) +-++++++ +-++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: +-++++++# if cache_position.shape[0] == 1: +-++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-++++++# kv_seq_len = key_states.shape[-2] +-++++++ +-++++++# # 5. [重要] 准备 Attention Mask +-++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++++# fa_attention_mask = None +-++++++# if attention_mask is not None: +-++++++# # 截取与当前key长度匹配的部分 +-++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False +-++++++# fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-++++++# input_dtype = query_states.dtype +-++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-++++++# query_states = query_states.to(mindspore.float16) +-++++++# key_states = key_states.to(mindspore.float16) +-++++++# value_states = value_states.to(mindspore.float16) +-++++++ +-++++++# # 6. [核心] 调用 flash_attention_score 算子 +-++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA +-++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++++# attn_output = mindspore.ops.flash_attention_score( +-++++++# query=query_states, +-++++++# key=key_states, +-++++++# value=value_states, +-++++++# head_num=self.num_heads, # 传入Q的头数(N1) +-++++++# attn_mask=fa_attention_mask, +-++++++# keep_prob=1.0 - self.attention_dropout, +-++++++# scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++# input_layout="BNSD", +-++++++# sparse_mode=0 # 使用 defaultMask 模式 +-++++++# ) +-++++++ +-++++++# # 恢复原始数据类型 +-++++++# attn_output = attn_output.to(input_dtype) +-++++++ +-++++++# # 7. 调整输出形状 +-++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++# attn_output = self.o_proj(attn_output) +-++++++ +-++++++# # FlashAttention 算子不直接返回注意力权重矩阵 +-++++++# attn_weights = None +-++++++# if output_attentions: +-++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++# return attn_output, attn_weights, past_key_value +-++++++ +-++++++# # def forward( +-++++++# # self, +-++++++# # hidden_states: mindspore.Tensor, +-++++++# # attention_mask: Optional[mindspore.Tensor] = None, +-++++++# # position_ids: Optional[mindspore.Tensor] = None, +-++++++# # past_key_value: Optional[Cache] = None, +-++++++# # output_attentions: bool = False, +-++++++# # use_cache: bool = False, +-++++++# # cache_position: Optional[mindspore.Tensor] = None, +-++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++# # bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++# # # 1. 线性投射 Q, K, V +-++++++# # query_states = self.q_proj(hidden_states) +-++++++# # key_states = self.k_proj(hidden_states) +-++++++# # value_states = self.v_proj(hidden_states) +-++++++ +-++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-++++++# # # 3. RoPE 旋转位置编码 +-++++++# # kv_seq_len = key_states.shape[-2] +-++++++# # if past_key_value is not None: +-++++++# # if self.layer_idx is None: +-++++++# # raise ValueError( +-++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++# # "with a layer index." +-++++++# # ) +-++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++# # # 4. KV 缓存更新 +-++++++# # if past_key_value is not None: +-++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-++++++# # key_states, value_states = past_key_value.update( +-++++++# # key_states, value_states, self.layer_idx, cache_kwargs +-++++++# # ) +-++++++ +-++++++# # # 5. 准备 Attention Mask +-++++++# # fa_attention_mask = None +-++++++# # if attention_mask is not None: +-++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++# # fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-++++++# # input_dtype = query_states.dtype +-++++++ +-++++++# # # 6. [核心] 调用 flash_attention_score 算子 +-++++++# # attn_output = mindspore.ops.flash_attention_score( +-++++++# # query=query_states, +-++++++# # key=key_states, +-++++++# # value=value_states, +-++++++# # head_num=self.num_heads, +-++++++# # attn_mask=fa_attention_mask, +-++++++# # keep_prob=1.0 - self.attention_dropout, +-++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++# # input_layout="BNSD", +-++++++# # sparse_mode=0, +-++++++# # # <--- 修改点 2: 启用内部高精度计算 --- +-++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-++++++# # inner_precise=1 +-++++++# # ) +-++++++ +-++++++# # # 恢复原始数据类型 +-++++++# # attn_output = attn_output.to(input_dtype) +-++++++ +-++++++# # # 7. 调整输出形状 +-++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++# # attn_output = self.o_proj(attn_output) +-++++++ +-++++++# # attn_weights = None +-++++++# # if output_attentions: +-++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ +-++++++# # return attn_output, attn_weights, past_key_value +-++++++ +-++++++ +-+++++ class Qwen2MoeFlashAttention(nn.Module): +-+++++ """ +-+++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++- +-+++++- 关键改动: +-+++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++- 直接传入原始的 key 和 value 张量效率更高。 +-+++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 +-++++++ +-++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` +-++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, +-++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, +-++++++ 以达到理论上的最高执行速度。 +-+++++ """ +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++ super().__init__() +-+++++ self.config = config +-+++++ self.layer_idx = layer_idx +-++++++ if layer_idx is None: +-++++++ logger.warning_once( +-++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." +-++++++ ) +-++++++ +-+++++ self.hidden_size = config.hidden_size +-+++++ self.num_heads = config.num_attention_heads +-+++++ self.head_dim = self.hidden_size // self.num_heads +-+++++ self.num_key_value_heads = config.num_key_value_heads +-+++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++ self.max_position_embeddings = config.max_position_embeddings +-+++++ self.rope_theta = config.rope_theta +-+++++ self.attention_dropout = config.attention_dropout +-+++++ +-+++++- if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++- raise ValueError( +-+++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++- ) +-+++++- +-+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++- # query: [B, S, H*D] -> [B, N1, S, D] +-+++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] +-++++++ # 2. 调整形状以匹配 BNSD 布局 +-+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- +-+++++- # 3. RoPE 旋转位置编码 +-++++++ +-++++++ # 3. RoPE 和 KV 缓存 +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++- if self.layer_idx is None: +-+++++- raise ValueError( +-+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++- "with a layer index." +-+++++- ) +-+++++- # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++- if cache_position.shape[0] == 1: +-+++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++- kv_seq_len = past_seen_tokens + 1 +-+++++- else: +-+++++- # prefill 阶段:cache_position 是范围,使用其长度 +-+++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++- else: +-+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++- +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++- # 4. KV 缓存更新 +-+++++ if past_key_value is not None: +-+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++- key_states, value_states = past_key_value.update( +-+++++- key_states, value_states, self.layer_idx, cache_kwargs +-+++++- ) +-+++++- +-+++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++- if cache_position.shape[0] == 1: +-+++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++- kv_seq_len = key_states.shape[-2] +-+++++- +-+++++- # 5. [重要] 准备 Attention Mask +-+++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++ # 4. 准备 Attention Mask +-+++++ fa_attention_mask = None +-+++++ if attention_mask is not None: +-+++++- # 截取与当前key长度匹配的部分 +-+++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++- # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++ fa_attention_mask = (mask_slice != 0) +-+++++ +-+++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++- input_dtype = query_states.dtype +-+++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++- query_states = query_states.to(mindspore.float16) +-+++++- key_states = key_states.to(mindspore.float16) +-+++++- value_states = value_states.to(mindspore.float16) +-+++++- +-+++++- # 6. [核心] 调用 flash_attention_score 算子 +-+++++- # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 +-+++++ attn_output = mindspore.ops.flash_attention_score( +-+++++ query=query_states, +-+++++ key=key_states, +-+++++ value=value_states, +-+++++- head_num=self.num_heads, # 传入Q的头数(N1) +-++++++ head_num=self.num_heads, +-+++++ attn_mask=fa_attention_mask, +-+++++- keep_prob=1.0 - self.attention_dropout, +-++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout +-+++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++ input_layout="BNSD", +-+++++- sparse_mode=0 # 使用 defaultMask 模式 +-++++++ sparse_mode=0, +-++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 +-+++++ ) +-+++++ +-+++++- # 恢复原始数据类型 +-+++++- attn_output = attn_output.to(input_dtype) +-+++++- +-+++++- # 7. 调整输出形状 +-+++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-++++++ # 6. 调整输出形状 +-+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++ attn_output = self.o_proj(attn_output) +-+++++ +-+++++- # FlashAttention 算子不直接返回注意力权重矩阵 +-++++++ # 7. 返回结果 +-+++++ attn_weights = None +-+++++ if output_attentions: +-+++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++- # def forward( +-+++++- # self, +-+++++- # hidden_states: mindspore.Tensor, +-+++++- # attention_mask: Optional[mindspore.Tensor] = None, +-+++++- # position_ids: Optional[mindspore.Tensor] = None, +-+++++- # past_key_value: Optional[Cache] = None, +-+++++- # output_attentions: bool = False, +-+++++- # use_cache: bool = False, +-+++++- # cache_position: Optional[mindspore.Tensor] = None, +-+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++- +-+++++- # bsz, q_len, _ = hidden_states.shape +-+++++- +-+++++- # # 1. 线性投射 Q, K, V +-+++++- # query_states = self.q_proj(hidden_states) +-+++++- # key_states = self.k_proj(hidden_states) +-+++++- # value_states = self.v_proj(hidden_states) +-+++++- +-+++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- +-+++++- # # 3. RoPE 旋转位置编码 +-+++++- # kv_seq_len = key_states.shape[-2] +-+++++- # if past_key_value is not None: +-+++++- # if self.layer_idx is None: +-+++++- # raise ValueError( +-+++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++- # "with a layer index." +-+++++- # ) +-+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++ +-+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++- +-+++++- # # 4. KV 缓存更新 +-+++++- # if past_key_value is not None: +-+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++- # key_states, value_states = past_key_value.update( +-+++++- # key_states, value_states, self.layer_idx, cache_kwargs +-+++++- # ) +-+++++- +-+++++- # # 5. 准备 Attention Mask +-+++++- # fa_attention_mask = None +-+++++- # if attention_mask is not None: +-+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++- # fa_attention_mask = (mask_slice != 0) +-+++++- +-+++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++- # input_dtype = query_states.dtype +-+++++- +-+++++- # # 6. [核心] 调用 flash_attention_score 算子 +-+++++- # attn_output = mindspore.ops.flash_attention_score( +-+++++- # query=query_states, +-+++++- # key=key_states, +-+++++- # value=value_states, +-+++++- # head_num=self.num_heads, +-+++++- # attn_mask=fa_attention_mask, +-+++++- # keep_prob=1.0 - self.attention_dropout, +-+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++- # input_layout="BNSD", +-+++++- # sparse_mode=0, +-+++++- # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++- # inner_precise=1 +-+++++- # ) +-+++++- +-+++++- # # 恢复原始数据类型 +-+++++- # attn_output = attn_output.to(input_dtype) +-++++++QWEN2MOE_ATTENTION_CLASSES = { +-++++++ "eager": Qwen2MoeAttention, +-++++++ "flash-attention": Qwen2MoeFlashAttention, +-++++++} +-+++++ +-+++++- # # 7. 调整输出形状 +-+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++- # attn_output = self.o_proj(attn_output) +-+++++ +-+++++- # attn_weights = None +-+++++- # if output_attentions: +-+++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# def __init__(self, config): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# # gating +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++ +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# #@dwj +-++++++# # 只遍历激活的专家,而非全部专家 +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# num_tokens = hidden_states_reshaped.shape[0] +-++++++ +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++ +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++ +-++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-++++++# token_indices = broadcasted_token_indices.flatten() +-++++++ +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++ +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# selected_token_indices = token_indices[mask] +-++++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++++ +-++++++# current_states = hidden_states_reshaped[selected_token_indices] +-++++++ +-++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++ +-++++++# final_hidden_states = final_hidden_states.index_add( +-++++++# dim=0, +-++++++# index=selected_token_indices, +-++++++# source=expert_output.to(hidden_states.dtype) +-++++++# ) +-++++++ +-++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++++ +-+++++- # return attn_output, attn_weights, past_key_value +-++++++# final_hidden_states = final_hidden_states + shared_expert_output +-++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-++++++ +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# """ +-++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# # 门控网络 +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# # 专家列表 +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++# # 共享专家 +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_decode( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# """ +-++++++# 【解码路径】针对 sequence_length=1 的极致优化。 +-++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-++++++# """ +-++++++# batch_size, hidden_dim = hidden_states.shape +-++++++ +-++++++# expert_outputs_list = [ +-++++++# ops.cat([ +-++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++++# ], dim=0) +-++++++# for i in range(batch_size) +-++++++# ] +-++++++ +-++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-++++++# # shape: (batch_size, top_k, hidden_dim) +-++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++++ +-++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++++ +-++++++# return moe_output.squeeze(1) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_prefill( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# """ +-++++++# 【预填充路径】针对 sequence_length > 1 的优化。 +-++++++# 按专家对 Token 进行分组,并进行批处理。 +-++++++# """ +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens = hidden_states.shape[0] +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++ +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++ +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++ +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# selected_token_indices = token_indices[mask] +-++++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++++ +-++++++# current_states = hidden_states[selected_token_indices] +-++++++ +-++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++ +-++++++# moe_output = moe_output.index_add( +-++++++# dim=0, +-++++++# index=selected_token_indices, +-++++++# source=expert_output.to(hidden_states.dtype) +-++++++# ) +-++++++# return moe_output +-++++++ +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# """ +-++++++# 顶层 forward 方法,作为智能分发器。 +-++++++# """ +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++- # def forward( +-+++++- # self, +-+++++- # hidden_states: mindspore.Tensor, +-+++++- # attention_mask: Optional[mindspore.Tensor] = None, +-+++++- # position_ids: Optional[mindspore.Tensor] = None, +-+++++- # past_key_value: Optional[Cache] = None, +-+++++- # output_attentions: bool = False, +-+++++- # use_cache: bool = False, +-+++++- # cache_position: Optional[mindspore.Tensor] = None, +-+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++- +-+++++- # bsz, q_len, _ = hidden_states.shape +-+++++- +-+++++- # query_states = self.q_proj(hidden_states) +-+++++- # key_states = self.k_proj(hidden_states) +-+++++- # value_states = self.v_proj(hidden_states) +-+++++- +-+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++- +-+++++- # kv_seq_len = key_states.shape[-2] +-+++++- # if past_key_value is not None: +-+++++- # if self.layer_idx is None: +-+++++- # raise ValueError("`layer_idx` must be specified for caching") +-+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++- +-+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++- +-+++++- # if past_key_value is not None: +-+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++- # key_states, value_states = past_key_value.update( +-+++++- # key_states, value_states, self.layer_idx, cache_kwargs +-+++++- # ) +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++# moe_output = None +-++++++# # 在推理时,根据序列长度选择最优路径 +-++++++# if not self.training: +-++++++# if sequence_length == 1: +-++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++++# else: +-++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++++# else: +-++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-++++++# raise NotImplementedError("Training path is not implemented.") +-++++++ +-++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-++++++ +-++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-++++++ +-++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-++++++ +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# """ +-++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# # 门控网络 +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# # 专家列表 +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++# # 共享专家 +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_decode( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# batch_size, _ = hidden_states.shape +-++++++# expert_outputs_list = [ +-++++++# ops.cat([ +-++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++++# ], dim=0) +-++++++# for i in range(batch_size) +-++++++# ] +-++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++++# return moe_output.squeeze(1) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_prefill( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens = hidden_states.shape[0] +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# selected_token_indices = token_indices[mask] +-++++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++++# current_states = hidden_states[selected_token_indices] +-++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++# moe_output = moe_output.index_add( +-++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++++# ) +-++++++# return moe_output +-++++++ +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# """ +-++++++# 顶层 forward 方法,作为智能分发器。 +-++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-++++++# """ +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ +-++++++# # 1. 门控计算 (通用逻辑) +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++ +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++# # 2. 智能分发到最优 MoE 路径 +-++++++# moe_output = None +-++++++# if not self.training: +-++++++# if sequence_length == 1: +-++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-++++++# else: +-++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-++++++# else: +-++++++# raise NotImplementedError("Training path is not implemented.") +-++++++ +-++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++++ +-++++++# # 4. 合并 MoE 输出和共享专家输出 +-++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++++ +-++++++# # 5. 恢复原始形状并返回 +-++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-++++++# prefill fastest +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# """ +-++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# # 门控网络 +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# # 专家列表 +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++# # 共享专家 +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_dispatch( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# """ +-++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-++++++# """ +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens, _ = hidden_states.shape +-++++++ +-++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++# flat_routing_weights = routing_weights.flatten() +-+++++ +-+++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++- +-+++++- # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++++- # query_states = query_states / math.sqrt(self.head_dim) +-+++++- # # <--- 修改结束 --- +-+++++- +-+++++- # fa_attention_mask = None +-+++++- # if attention_mask is not None: +-+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++- # fa_attention_mask = (mask_slice != 0) +-+++++- +-+++++- # input_dtype = query_states.dtype +-+++++- +-+++++- # attn_output = mindspore.ops.flash_attention_score( +-+++++- # query=query_states, # 传入已经预先缩放过的 query +-+++++- # key=key_states, +-+++++- # value=value_states, +-+++++- # head_num=self.num_heads, +-+++++- # attn_mask=fa_attention_mask, +-+++++- # keep_prob=1.0 - self.attention_dropout, +-+++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++++- # input_layout="BNSD", +-+++++- # sparse_mode=0, +-+++++- # inner_precise=1 # 仍然保持内部高精度计算 +-+++++- # ) +-++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++ +-+++++- # attn_output = attn_output.to(input_dtype) +-+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++- # attn_output = self.o_proj(attn_output) +-++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++ +-++++++# # 找到所有分配给该专家的 token +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++ +-++++++# # 使用 mask 选取对应的 token 和权重 +-++++++# current_token_indices = token_indices[mask] +-++++++# current_routing_weights = flat_routing_weights[mask] +-++++++# current_hidden_states = hidden_states[current_token_indices] +-++++++ +-++++++# # 对这些 token 进行批处理 +-++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++++ +-++++++# # 使用 index_add 将结果精确地加回到对应位置 +-++++++# moe_output = moe_output.index_add( +-++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-++++++# ) +-++++++# return moe_output +-++++++ +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# """ +-++++++# 顶层 forward 方法,作为智能分发器。 +-++++++# """ +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ +-++++++# # 1. 门控计算 +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++ +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++# routing_weights = routing_weights.to(hidden_states.dtype) +-++++++ +-++++++# # 2. 调用统一的 MoE 计算内核 +-++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+++++ +-+++++- # attn_weights = None +-+++++- # if output_attentions: +-+++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-++++++# # 3. 统一处理共享专家 +-++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++++ +-++++++# # 4. 合并输出 +-++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++++ +-++++++# # 5. 恢复原始形状并返回 +-++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-++++++ +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# """ +-++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-++++++# 【最终高性能与高精度版】: +-++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-++++++# 3. 这样实现了速度和准确性的两全其美。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_decode( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# """ +-++++++# 【解码路径】极致优化版:bmm + 高精度累加。 +-++++++# """ +-++++++# original_dtype = hidden_states.dtype +-++++++# batch_size, _ = hidden_states.shape +-++++++ +-++++++# expert_outputs_list = [ +-++++++# ops.cat([ +-++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++++# ], dim=0) +-++++++# for i in range(batch_size) +-++++++# ] +-++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++++ +-++++++# # 在 float32 下执行 bmm,得到高精度结果 +-++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-++++++ +-++++++# # 将高精度结果转换回原始数据类型 +-++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-++++++ +-++++++# return moe_output +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_prefill( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# selected_experts: mindspore.Tensor, +-++++++# routing_weights: mindspore.Tensor +-++++++# ) -> mindspore.Tensor: +-++++++# """ +-++++++# 【预填充路径】与原始实现一致,结果精确。 +-++++++# """ +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens, _ = hidden_states.shape +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++ +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# selected_token_indices = token_indices[mask] +-++++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++++# current_states = hidden_states[selected_token_indices] +-++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++# moe_output = moe_output.index_add( +-++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-++++++# ) +-++++++# return moe_output +-++++++ +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++ +-+++++- # return attn_output, attn_weights, past_key_value +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-++++++# # 如果模型主体是 float16,后续再转换 +-++++++ +-++++++# moe_output = None +-++++++# if not self.training: +-++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-++++++# # _moe_infer_decode 内部会处理好类型转换 +-++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-++++++# if sequence_length == 1: +-++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++++# else: +-++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-++++++# else: +-++++++# raise NotImplementedError("Training path is not implemented.") +-++++++ +-++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++++ +-++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-+++++ +-+++++-QWEN2MOE_ATTENTION_CLASSES = { +-+++++- "eager": Qwen2MoeAttention, +-+++++- "flash-attention": Qwen2MoeFlashAttention, +-+++++-} +-++++++# class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++# """ +-++++++# 【融合版】一个混合专家模块,内置两种推理策略, +-++++++# 由外部全局变量 `Long_Prompt` 控制: +-++++++ +-++++++# - if Long_Prompt is True: 【精度优先模式】 +-++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-++++++# 适用于处理长序列,避免误差累积。 +-++++++ +-++++++# - if Long_Prompt is False: 【速度优先模式】 +-++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 +-++++++# """ +-++++++# def __init__(self, config: Qwen2MoeConfig): +-++++++# super().__init__() +-++++++# self.num_experts = config.num_experts +-++++++# self.top_k = config.num_experts_per_tok +-++++++# self.norm_topk_prob = config.norm_topk_prob +-++++++ +-++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-++++++# self.experts = nn.ModuleList( +-++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-++++++# ) +-++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-++++++# # --- 速度优先模式的辅助函数 --- +-++++++# @no_grad() +-++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++# original_dtype = hidden_states.dtype +-++++++# batch_size, _ = hidden_states.shape +-++++++# expert_outputs_list = [ +-++++++# ops.cat([ +-++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++++# ], dim=0) +-++++++# for i in range(batch_size) +-++++++# ] +-++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++++# weights_fp32 = routing_weights.to(mindspore.float32) +-++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++++# return moe_output_fp32.squeeze(1).to(original_dtype) +-++++++ +-++++++# @no_grad() +-++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens, _ = hidden_states.shape +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# selected_token_indices = token_indices[mask] +-++++++# selected_routing_weights = routing_weights.flatten()[mask] +-++++++# current_states = hidden_states[selected_token_indices] +-++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++++# return moe_output +-++++++ +-++++++# # --- 精度优先模式的辅助函数 --- +-++++++# @no_grad() +-++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++# moe_output = ops.zeros_like(hidden_states) +-++++++# num_tokens, _ = hidden_states.shape +-++++++# flat_selected_experts = selected_experts.flatten() +-++++++# flat_routing_weights = routing_weights.flatten() +-++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++# active_experts = ops.unique(flat_selected_experts) +-++++++# for expert_idx_tensor in active_experts: +-++++++# expert_idx = expert_idx_tensor.item() +-++++++# expert_layer = self.experts[expert_idx] +-++++++# mask = (flat_selected_experts == expert_idx_tensor) +-++++++# current_token_indices = token_indices[mask] +-++++++# current_routing_weights = flat_routing_weights[mask] +-++++++# current_hidden_states = hidden_states[current_token_indices] +-++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++++# return moe_output +-++++++ +-++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++# # 声明我们将要使用一个在模块外部定义的全局变量 +-++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-++++++# global Long_Prompt +-++++++ +-++++++# # 1. 门控计算 (所有模式通用) +-++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++# router_logits = self.gate(hidden_states_reshaped) +-++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++++++# if self.norm_topk_prob: +-++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++# moe_output = None +-++++++# if not self.training: +-++++++# # 根据 Long_Prompt 标志选择模式 +-++++++# if Long_Prompt: +-++++++# # --- 精度优先模式 --- +-++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++# else: +-++++++# # --- 速度优先模式 --- +-++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++# if sequence_length == 1: +-++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++# else: +-++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++# else: +-++++++# raise NotImplementedError("Training path is not implemented.") +-++++++ +-++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++++ +-++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++# return final_hidden_states, router_logits +-++++++ +-++++++class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++ """ +-++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-++++++ 控制的顶级推理策略: +-+++++ +-++++++ - if Long_Prompt is True: 【精度优先模式】 +-++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 +-++++++ 适用于需要严格可复现性的长序列任务。 +-+++++ +-+++++-class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++- def __init__(self, config): +-++++++ - if Long_Prompt is False: 【速度优先模式】 +-++++++ 采用业界最强的性能组合: +-++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 +-++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 +-++++++ """ +-++++++ def __init__(self, config: Qwen2MoeConfig): +-+++++ super().__init__() +-+++++ self.num_experts = config.num_experts +-+++++ self.top_k = config.num_experts_per_tok +-+++++ self.norm_topk_prob = config.norm_topk_prob +-+++++ +-+++++- # gating +-+++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++ self.experts = nn.ModuleList( +-+++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++ ) +-+++++- +-+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++ +-+++++- #@dwj +-+++++- # 只遍历激活的专家,而非全部专家 +-+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++- num_tokens = hidden_states_reshaped.shape[0] +-+++++- +-+++++- router_logits = self.gate(hidden_states_reshaped) +-+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- +-+++++- if self.norm_topk_prob: +-+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++- flat_selected_experts = selected_experts.flatten() +-+++++- +-+++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++- token_indices = broadcasted_token_indices.flatten() +-+++++- +-+++++- active_experts = ops.unique(flat_selected_experts) +-+++++- +-+++++- for expert_idx_tensor in active_experts: +-+++++- expert_idx = expert_idx_tensor.item() +-+++++- expert_layer = self.experts[expert_idx] +-+++++- +-+++++- mask = (flat_selected_experts == expert_idx_tensor) +-+++++- selected_token_indices = token_indices[mask] +-+++++- selected_routing_weights = routing_weights.flatten()[mask] +-+++++- +-+++++- current_states = hidden_states_reshaped[selected_token_indices] +-+++++- +-+++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++- +-+++++- final_hidden_states = final_hidden_states.index_add( +-+++++- dim=0, +-+++++- index=selected_token_indices, +-+++++- source=expert_output.to(hidden_states.dtype) +-+++++- ) +-+++++- +-+++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- +-++++++ @no_grad() +-++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++ original_dtype = hidden_states.dtype +-++++++ batch_size, _ = hidden_states.shape +-++++++ expert_outputs_list = [ +-++++++ ops.cat([ +-++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-++++++ ], dim=0) +-++++++ for i in range(batch_size) +-++++++ ] +-++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-++++++ weights_fp32 = routing_weights.to(mindspore.float32) +-++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-++++++ return moe_output_fp32.squeeze(1).to(original_dtype) +-++++++ +-++++++ @no_grad() +-++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++ num_tokens, _ = hidden_states.shape +-++++++ flat_selected_experts = selected_experts.flatten() +-++++++ sorted_expert_indices = flat_selected_experts.argsort() +-++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++++++ original_token_indices = sorted_expert_indices // self.top_k +-++++++ moe_output = ops.zeros_like(hidden_states) +-++++++ current_token_offset = 0 +-++++++ for i in range(self.num_experts): +-++++++ expert_token_count = tokens_per_expert[i] - current_token_offset +-++++++ if expert_token_count == 0: +-++++++ continue +-++++++ end_offset = current_token_offset + expert_token_count +-++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++++++ expert_hidden_states = hidden_states[expert_original_token_indices] +-++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++++ current_token_offset += expert_token_count +-++++++ return moe_output +-++++++ +-++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-++++++ @no_grad() +-++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++ moe_output = ops.zeros_like(hidden_states) +-++++++ num_tokens, _ = hidden_states.shape +-++++++ flat_selected_experts = selected_experts.flatten() +-++++++ flat_routing_weights = routing_weights.flatten() +-++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-++++++ active_experts = ops.unique(flat_selected_experts) +-++++++ for expert_idx_tensor in active_experts: +-++++++ expert_idx = expert_idx_tensor.item() +-++++++ expert_layer = self.experts[expert_idx] +-++++++ mask = (flat_selected_experts == expert_idx_tensor) +-++++++ current_token_indices = token_indices[mask] +-++++++ current_routing_weights = flat_routing_weights[mask] +-++++++ current_hidden_states = hidden_states[current_token_indices] +-++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++++ return moe_output +-+++++ +-+++++- final_hidden_states = final_hidden_states + shared_expert_output +-+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++- return final_hidden_states, router_logits +-++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++ global Long_Prompt +-++++++ +-++++++ # 1. 门控计算 (所有模式通用) +-++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-++++++ router_logits = self.gate(hidden_states_reshaped) +-++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-++++++ if self.norm_topk_prob: +-++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++ +-++++++ moe_output = None +-++++++ if Long_Prompt: +-++++++ # --- 精度优先模式 (ACCURACY MODE) --- +-++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ else: +-++++++ # --- 速度优先模式 (SPEED MODE) --- +-++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++ if sequence_length == 1: +-++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ else: +-++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ +-+++++ +-++++++ # 3. 共享专家计算与合并 (所有模式通用) +-++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-++++++ +-++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-++++++ +-++++++ return final_hidden_states, router_logits +-+++++ +-+++++ class Qwen2MoeDecoderLayer(nn.Module): +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+++++ super().__init__() +-+++++ self.hidden_size = config.hidden_size +-++++++ +-++++++ # if Long_Prompt: +-++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++++ # else: +-++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++ +-+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ +-+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++- +-+++++ if (layer_idx not in config.mlp_only_layers) and ( +-+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++++ ): +-+++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ self._warmed_up = True +-+++++ self.warmup_moe_model() +-+++++ +-++++++ +-++++++ +-+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++++ output_router_logits = ( +-+++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits +-+++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ router_logits=outputs.router_logits, +-+++++ ) +-+++++ +-++++++ def generate(self, *args, **kwargs): +-++++++ """ +-++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-++++++ """ +-++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++++++ +-++++++ input_ids = kwargs.get("input_ids") +-++++++ if input_ids is None and args: +-++++++ input_ids = args[0] +-++++++ +-++++++ if input_ids is not None: +-++++++ prompt_length = input_ids.shape[1] +-++++++ +-++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: +-++++++ Long_Prompt = True +-++++++ else: +-++++++ Long_Prompt = False +-++++++ +-++++++ return super().generate(*args, **kwargs) +-++++++ +-+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation +-+++++ def prepare_inputs_for_generation( +-+++++ self, +-+++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens +-+++++ # Exception 1: when passing input_embeds, input_ids may be missing entries +-+++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here +-++++++ +-+++++ if past_key_values is not None: +-+++++ if inputs_embeds is not None: # Exception 1 +-+++++ if 0 not in input_ids.shape: +-+++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ } +-+++++ ) +-+++++ return model_inputs +-++++++ +-+++++ # @lwx +-+++++ # def _decode_one_tokens_logits( +-+++++ # self, +-+++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): +-+++++ attentions=outputs.attentions, +-+++++ ) +-+++++ +-++++++ +-+++++ __all__ = [ +-+++++ "Qwen2MoeForCausalLM", +-+++++ "Qwen2MoeModel", +-+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++++new file mode 100644 +-+++++index 00000000..6dfb5b93 +-+++++--- /dev/null +-++++++++ b/patches/0001-20251104commit.patch +-+++++@@ -0,0 +1,1272 @@ +-++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-++++++From: Pinoeer-kingxi <13022943007@163.com> +-++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 +-++++++Subject: [PATCH] 20251104commit +-++++++ +-++++++--- +-++++++ mindnlp/transformers/cache_utils.py | 28 +- +-++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- +-++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-++++++ 3 files changed, 976 insertions(+), 87 deletions(-) +-++++++ +-++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-++++++index cadd2e04..02f8d4be 100644 +-++++++--- a/mindnlp/transformers/cache_utils.py +-+++++++++ b/mindnlp/transformers/cache_utils.py +-++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): +-++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-++++++ # k_out[:, :, cache_position] = key_states +-++++++ # v_out[:, :, cache_position] = value_states +-++++++- if ON_ORANGE_PI: +-++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-++++++- else: +-++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-++++++- +-+++++++ # if ON_ORANGE_PI: +-+++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++++ # else: +-+++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++++ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++++++ if cache_position.ndim > 1: +-+++++++ cache_position = cache_position.flatten() +-+++++++ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++++++ cache_position = cache_position.int() +-+++++++ +-+++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++++++ k_out[:, :, cache_position] = key_states +-+++++++ v_out[:, :, cache_position] = value_states +-+++++++ +-++++++ return k_out, v_out +-++++++ +-++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++index c695b944..d8303e45 100644 +-++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half +-++++++ def rotate_half(x): +-++++++ """Rotates half the hidden dims of the input.""" +-++++++- x1 = x[..., : x.shape[-1] // 2] +-++++++- x2 = x[..., x.shape[-1] // 2 :] +-+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++++ # x1 = x[..., : x.shape[-1] // 2] +-+++++++ # x2 = x[..., x.shape[-1] // 2 :] +-+++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++++ return ops.cat((-x2, x1), dim=-1) +-++++++ +-++++++ +-++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-++++++ if self.training: +-++++++ raise NotImplementedError("Training is not supported yet.") +-++++++ else: +-++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-++++++- if self.config.n_shared_experts is not None: +-++++++- y = y + self.shared_experts(identity) +-++++++- return y +-+++++++ # @lwx +-+++++++ if orig_shape[1] == 1: +-+++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++++++ y=y.view(*orig_shape) +-+++++++ if self.config.n_shared_experts is not None: +-+++++++ y = y + self.shared_experts(identity) +-+++++++ return y +-+++++++ else: +-+++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++++++ if self.config.n_shared_experts is not None: +-+++++++ y = y + self.shared_experts(identity) +-+++++++ return y +-+++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++++ # if self.config.n_shared_experts is not None: +-+++++++ # y = y + self.shared_experts(identity) +-+++++++ # return y +-+++++++ +-+++++++ @no_grad() +-+++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++++ +-+++++++ expert_cache = ops.zeros_like(x) +-+++++++ for i in range(self.num_experts_per_tok): +-+++++++ expert_id = flat_expert_indices[i].item() +-+++++++ weight = flat_expert_weights[i].item() +-+++++++ expert = self.experts[expert_id] +-+++++++ expert_out = expert(x) +-+++++++ expert_cache += expert_out * weight +-+++++++ return expert_cache +-++++++ +-++++++ @no_grad() +-++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++- # expert_cache = torch.zeros_like(x) +-++++++- # idxs = flat_expert_indices.argsort() +-++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++++- # token_idxs = idxs // self.num_experts_per_tok +-++++++- # for i, end_idx in enumerate(tokens_per_expert): +-++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++++- # if start_idx == end_idx: +-++++++- # continue +-++++++- # expert = self.experts[i] +-++++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++- # expert_tokens = x[exp_token_idx] +-++++++- # expert_out = expert(expert_tokens) +-++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++++- # return expert_cache +-+++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++++ expert_cache = ops.zeros_like(x) +-++++++ idxs = flat_expert_indices.argsort() +-++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ token_idxs = idxs // self.num_experts_per_tok +-+++++++ +-++++++ for i, end_idx in enumerate(tokens_per_expert): +-++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ if start_idx == end_idx: +-++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-++++++ expert_out = expert(expert_tokens) +-++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++++ +-++++++ return expert_cache +-+++++++ +-+++++++ # @no_grad() +-+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++++ # # expert_cache = torch.zeros_like(x) +-+++++++ # # idxs = flat_expert_indices.argsort() +-+++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++++ # # token_idxs = idxs // self.num_experts_per_tok +-+++++++ # # for i, end_idx in enumerate(tokens_per_expert): +-+++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++++ # # if start_idx == end_idx: +-+++++++ # # continue +-+++++++ # # expert = self.experts[i] +-+++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++++ # # expert_tokens = x[exp_token_idx] +-+++++++ # # expert_out = expert(expert_tokens) +-+++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++++ # # return expert_cache +-+++++++ # expert_cache = ops.zeros_like(x) +-+++++++ # idxs = flat_expert_indices.argsort() +-+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++++ +-+++++++ # for i, end_idx in enumerate(tokens_per_expert): +-+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++++ # if start_idx == end_idx: +-+++++++ # continue +-+++++++ # expert = self.experts[i] +-+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++++ # expert_tokens = x[exp_token_idx] +-+++++++ # expert_out = expert(expert_tokens) +-+++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++++ +-+++++++ # return expert_cache +-+++++++ # @no_grad() +-+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++++ # expert_cache = ops.zeros_like(x) +-+++++++ +-+++++++ # # 排序保证顺序一致 +-+++++++ # idxs = flat_expert_indices.argsort() +-+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++++ # token_idxs = idxs // self.num_experts_per_tok +-+++++++ +-+++++++ # # 找出有 token 的专家 +-+++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++++ +-+++++++ # for i in active_experts.tolist(): +-+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++++ # end_idx = tokens_per_expert[i] +-+++++++ # if start_idx == end_idx: # 没有 token +-+++++++ # continue +-+++++++ +-+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++++ # expert_tokens = x[exp_token_idx] +-+++++++ # expert_out = self.experts[i](expert_tokens) +-+++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++++ +-+++++++ # expert_cache = mindspore.mint.scatter_add( +-+++++++ # expert_cache, +-+++++++ # 0, +-+++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++++ # expert_out +-+++++++ # ) +-+++++++ +-+++++++ # return expert_cache +-+++++++ +-+++++++ +-++++++ +-++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-++++++ # """ +-++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++++ +-++++++ # Initialize weights and apply final processing +-++++++ self.post_init() +-+++++++ self.warm_up = False +-+++++++ +-+++++++ def warmup_moe_model_deep(self): +-+++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++++++ test_texts = [ +-+++++++ "warmup short", +-+++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++++++ ] +-+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++++ if tokenizer is None: +-+++++++ from mindnlp.transformers import AutoTokenizer +-+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++++ self._warmup_tokenizer = tokenizer +-+++++++ +-+++++++ for text in test_texts: +-+++++++ inputs = tokenizer(text, return_tensors="ms") +-+++++++ with mindspore._no_grad(): +-+++++++ _ = self(**inputs, use_cache=False) +-+++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-++++++ +-++++++ def get_input_embeddings(self): +-++++++ return self.model.embed_tokens +-++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++++ ```""" +-+++++++ if not self.warm_up: +-+++++++ self.warm_up = True +-+++++++ self.warmup_moe_model_deep() +-+++++++ +-++++++ output_attentions = ( +-++++++ output_attentions +-++++++ if output_attentions is not None +-++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++index 3cbf820e..d4c6b651 100644 +-++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++@@ -18,7 +18,6 @@ +-++++++ # See the License for the specific language governing permissions and +-++++++ # limitations under the License. +-++++++ """MindSpore Qwen2MoE model.""" +-++++++- +-++++++ import math +-++++++ from typing import List, Optional, Tuple, Union +-++++++ +-++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-++++++ TokenClassifierOutput, +-++++++ ) +-++++++ from ...modeling_utils import PreTrainedModel +-+++++++from ...generation import GenerationMixin +-++++++ from ....utils import logging +-++++++ from .configuration_qwen2_moe import Qwen2MoeConfig +-++++++ +-++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-++++++ self.variance_epsilon = eps +-++++++ +-++++++ def forward(self, hidden_states): +-+++++++ # @dwj +-+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++++ # @lwx +-+++++++ # if not self.training : +-+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-++++++ input_dtype = hidden_states.dtype +-++++++ hidden_states = hidden_states.to(mindspore.float32) +-++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-++++++@@ -234,6 +239,8 @@ def rotate_half(x): +-++++++ """Rotates half the hidden dims of the input.""" +-++++++ x1 = x[..., : x.shape[-1] // 2] +-++++++ x2 = x[..., x.shape[-1] // 2 :] +-+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-++++++ return ops.cat((-x2, x1), dim=-1) +-++++++ +-++++++ +-++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-++++++ self.config = config +-++++++ self.hidden_size = config.hidden_size +-++++++ self.intermediate_size = intermediate_size +-+++++++ +-++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-++++++ self.act_fn = ACT2FN[config.hidden_act] +-++++++ +-++++++ def forward(self, x): +-++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-++++++- +-++++++ +-+++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++++ # @lwx +-+++++++ # gate_up_output = self.gate_up_proj(x) +-+++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++++++ # return self.down_proj(swiglu_output) +-+++++++ +-+++++++ # def forward(self, x): +-+++++++ # gate_proj_out = self.gate_proj(x) +-+++++++ # up_proj_out = self.up_proj(x) +-+++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++++++ # return self.down_proj(swiglu_out) +-+++++++ +-++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv +-++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-++++++ """ +-++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-++++++ use_cache: bool = False, +-++++++ cache_position: Optional[mindspore.Tensor] = None, +-++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++++ +-+++++++ +-+++++++ +-++++++ bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++ query_states = self.q_proj(hidden_states) +-++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++ "with a layer index." +-++++++ ) +-++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++++ if isinstance(past_key_value, StaticCache): +-+++++++ kv_seq_len = key_states.shape[-2] +-+++++++ else: +-+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++ if past_key_value is not None: +-++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++++ +-+++++++ if isinstance(past_key_value, StaticCache): +-+++++++ kv_seq_len = key_states.shape[-2] +-++++++ +-++++++ # repeat k/v heads if n_kv_heads < n_heads +-++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++- +-+++++++ +-++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++++ +-++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-++++++- raise ValueError( +-++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-++++++- f" {attn_weights.shape}" +-++++++- ) +-++++++- +-++++++- if attention_mask is not None: # no matter the length, we just slice it +-++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++++++ if attention_mask is not None: +-+++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++++ attn_weights = attn_weights + causal_mask +-++++++ +-++++++ # upcast attention to fp32 +-++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++++ +-++++++ attn_output = self.o_proj(attn_output) +-++++++- +-+++++++ # @lwx +-+++++++ +-+++++++ # max_seq_len = self.max_position_embeddings # 2048 +-+++++++ +-+++++++ # if attention_mask is not None: +-+++++++ # # attention_mask: [B, 1, Sq, Sk] +-+++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++++ +-+++++++ # # pad 到 [max_seq_len, max_seq_len] +-+++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++++ # global_attention_mask = padded_mask +-+++++++ # else: +-+++++++ # global_attention_mask = None +-+++++++ +-+++++++ +-+++++++ # sparse_mode=3 +-+++++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++++ # query=query_states, +-+++++++ # key=key_states, +-+++++++ # value=value_states, +-+++++++ # real_shift=None, +-+++++++ # padding_mask=None, +-+++++++ +-+++++++ # head_num=self.num_heads, +-+++++++ # attn_mask=global_attention_mask, +-+++++++ # keep_prob=1.0 - self.attention_dropout, +-+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++++ # input_layout="BNSD", +-+++++++ # pre_tokens=2147483647, +-+++++++ # next_tokens=2147483647, +-+++++++ # inner_precise=0, +-+++++++ # drop_mask=None, +-+++++++ # prefix=None, +-+++++++ # actual_seq_qlen=None, +-+++++++ # actual_seq_kvlen=None, +-+++++++ # sparse_mode=sparse_mode, +-+++++++ # ) +-++++++ if not output_attentions: +-++++++ attn_weights = None +-++++++ +-++++++ return attn_output, attn_weights, past_key_value +-++++++ +-++++++ +-+++++++class Qwen2MoeFlashAttention(nn.Module): +-+++++++ """ +-+++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++++ +-+++++++ 关键改动: +-+++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++++ 直接传入原始的 key 和 value 张量效率更高。 +-+++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++++ """ +-+++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++++ super().__init__() +-+++++++ self.config = config +-+++++++ self.layer_idx = layer_idx +-+++++++ self.hidden_size = config.hidden_size +-+++++++ self.num_heads = config.num_attention_heads +-+++++++ self.head_dim = self.hidden_size // self.num_heads +-+++++++ self.num_key_value_heads = config.num_key_value_heads +-+++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++++ self.max_position_embeddings = config.max_position_embeddings +-+++++++ self.rope_theta = config.rope_theta +-+++++++ self.attention_dropout = config.attention_dropout +-+++++++ +-+++++++ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++++ raise ValueError( +-+++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++++ ) +-+++++++ +-+++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++++ +-+++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++++ self.head_dim, +-+++++++ max_position_embeddings=self.max_position_embeddings, +-+++++++ base=self.rope_theta, +-+++++++ ) +-+++++++ +-+++++++ def forward( +-+++++++ self, +-+++++++ hidden_states: mindspore.Tensor, +-+++++++ attention_mask: Optional[mindspore.Tensor] = None, +-+++++++ position_ids: Optional[mindspore.Tensor] = None, +-+++++++ past_key_value: Optional[Cache] = None, +-+++++++ output_attentions: bool = False, +-+++++++ use_cache: bool = False, +-+++++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++++ +-+++++++ bsz, q_len, _ = hidden_states.shape +-+++++++ +-+++++++ # 1. 线性投射 Q, K, V +-+++++++ query_states = self.q_proj(hidden_states) +-+++++++ key_states = self.k_proj(hidden_states) +-+++++++ value_states = self.v_proj(hidden_states) +-+++++++ +-+++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++++ # query: [B, S, H*D] -> [B, N1, S, D] +-+++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ +-+++++++ # 3. RoPE 旋转位置编码 +-+++++++ kv_seq_len = key_states.shape[-2] +-+++++++ if past_key_value is not None: +-+++++++ if self.layer_idx is None: +-+++++++ raise ValueError( +-+++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++++ "with a layer index." +-+++++++ ) +-+++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++++ if cache_position.shape[0] == 1: +-+++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++++ kv_seq_len = past_seen_tokens + 1 +-+++++++ else: +-+++++++ # prefill 阶段:cache_position 是范围,使用其长度 +-+++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++++ else: +-+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++++ +-+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++++ +-+++++++ # 4. KV 缓存更新 +-+++++++ if past_key_value is not None: +-+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++++ key_states, value_states = past_key_value.update( +-+++++++ key_states, value_states, self.layer_idx, cache_kwargs +-+++++++ ) +-+++++++ +-+++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++++ if cache_position.shape[0] == 1: +-+++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++++ kv_seq_len = key_states.shape[-2] +-+++++++ +-+++++++ # 5. [重要] 准备 Attention Mask +-+++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++++ fa_attention_mask = None +-+++++++ if attention_mask is not None: +-+++++++ # 截取与当前key长度匹配的部分 +-+++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++++ fa_attention_mask = (mask_slice != 0) +-+++++++ +-+++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++++ input_dtype = query_states.dtype +-+++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++++ query_states = query_states.to(mindspore.float16) +-+++++++ key_states = key_states.to(mindspore.float16) +-+++++++ value_states = value_states.to(mindspore.float16) +-+++++++ +-+++++++ # 6. [核心] 调用 flash_attention_score 算子 +-+++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++++ attn_output = mindspore.ops.flash_attention_score( +-+++++++ query=query_states, +-+++++++ key=key_states, +-+++++++ value=value_states, +-+++++++ head_num=self.num_heads, # 传入Q的头数(N1) +-+++++++ attn_mask=fa_attention_mask, +-+++++++ keep_prob=1.0 - self.attention_dropout, +-+++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++++ input_layout="BNSD", +-+++++++ sparse_mode=0 # 使用 defaultMask 模式 +-+++++++ ) +-+++++++ +-+++++++ # 恢复原始数据类型 +-+++++++ attn_output = attn_output.to(input_dtype) +-+++++++ +-+++++++ # 7. 调整输出形状 +-+++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++++ attn_output = self.o_proj(attn_output) +-+++++++ +-+++++++ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++++ attn_weights = None +-+++++++ if output_attentions: +-+++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++++ +-+++++++ return attn_output, attn_weights, past_key_value +-+++++++ +-+++++++ # def forward( +-+++++++ # self, +-+++++++ # hidden_states: mindspore.Tensor, +-+++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++++ # past_key_value: Optional[Cache] = None, +-+++++++ # output_attentions: bool = False, +-+++++++ # use_cache: bool = False, +-+++++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++++ +-+++++++ # bsz, q_len, _ = hidden_states.shape +-+++++++ +-+++++++ # # 1. 线性投射 Q, K, V +-+++++++ # query_states = self.q_proj(hidden_states) +-+++++++ # key_states = self.k_proj(hidden_states) +-+++++++ # value_states = self.v_proj(hidden_states) +-+++++++ +-+++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ +-+++++++ # # 3. RoPE 旋转位置编码 +-+++++++ # kv_seq_len = key_states.shape[-2] +-+++++++ # if past_key_value is not None: +-+++++++ # if self.layer_idx is None: +-+++++++ # raise ValueError( +-+++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++++ # "with a layer index." +-+++++++ # ) +-+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++++ +-+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++++ +-+++++++ # # 4. KV 缓存更新 +-+++++++ # if past_key_value is not None: +-+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++++ # key_states, value_states = past_key_value.update( +-+++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++++ # ) +-+++++++ +-+++++++ # # 5. 准备 Attention Mask +-+++++++ # fa_attention_mask = None +-+++++++ # if attention_mask is not None: +-+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++++ # fa_attention_mask = (mask_slice != 0) +-+++++++ +-+++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++++ # input_dtype = query_states.dtype +-+++++++ +-+++++++ # # 6. [核心] 调用 flash_attention_score 算子 +-+++++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++++ # query=query_states, +-+++++++ # key=key_states, +-+++++++ # value=value_states, +-+++++++ # head_num=self.num_heads, +-+++++++ # attn_mask=fa_attention_mask, +-+++++++ # keep_prob=1.0 - self.attention_dropout, +-+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++++ # input_layout="BNSD", +-+++++++ # sparse_mode=0, +-+++++++ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++++ # inner_precise=1 +-+++++++ # ) +-+++++++ +-+++++++ # # 恢复原始数据类型 +-+++++++ # attn_output = attn_output.to(input_dtype) +-+++++++ +-+++++++ # # 7. 调整输出形状 +-+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++++ # attn_output = self.o_proj(attn_output) +-+++++++ +-+++++++ # attn_weights = None +-+++++++ # if output_attentions: +-+++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++++ +-+++++++ # return attn_output, attn_weights, past_key_value +-+++++++ +-+++++++ # def forward( +-+++++++ # self, +-+++++++ # hidden_states: mindspore.Tensor, +-+++++++ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++++ # position_ids: Optional[mindspore.Tensor] = None, +-+++++++ # past_key_value: Optional[Cache] = None, +-+++++++ # output_attentions: bool = False, +-+++++++ # use_cache: bool = False, +-+++++++ # cache_position: Optional[mindspore.Tensor] = None, +-+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++++ +-+++++++ # bsz, q_len, _ = hidden_states.shape +-+++++++ +-+++++++ # query_states = self.q_proj(hidden_states) +-+++++++ # key_states = self.k_proj(hidden_states) +-+++++++ # value_states = self.v_proj(hidden_states) +-+++++++ +-+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++++ +-+++++++ # kv_seq_len = key_states.shape[-2] +-+++++++ # if past_key_value is not None: +-+++++++ # if self.layer_idx is None: +-+++++++ # raise ValueError("`layer_idx` must be specified for caching") +-+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++++ +-+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++++ +-+++++++ # if past_key_value is not None: +-+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++++ # key_states, value_states = past_key_value.update( +-+++++++ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++++ # ) +-+++++++ +-+++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++++ +-+++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++++++ # query_states = query_states / math.sqrt(self.head_dim) +-+++++++ # # <--- 修改结束 --- +-+++++++ +-+++++++ # fa_attention_mask = None +-+++++++ # if attention_mask is not None: +-+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++++ # fa_attention_mask = (mask_slice != 0) +-+++++++ +-+++++++ # input_dtype = query_states.dtype +-+++++++ +-+++++++ # attn_output = mindspore.ops.flash_attention_score( +-+++++++ # query=query_states, # 传入已经预先缩放过的 query +-+++++++ # key=key_states, +-+++++++ # value=value_states, +-+++++++ # head_num=self.num_heads, +-+++++++ # attn_mask=fa_attention_mask, +-+++++++ # keep_prob=1.0 - self.attention_dropout, +-+++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++++++ # input_layout="BNSD", +-+++++++ # sparse_mode=0, +-+++++++ # inner_precise=1 # 仍然保持内部高精度计算 +-+++++++ # ) +-+++++++ +-+++++++ # attn_output = attn_output.to(input_dtype) +-+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++++ # attn_output = self.o_proj(attn_output) +-+++++++ +-+++++++ # attn_weights = None +-+++++++ # if output_attentions: +-+++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++++++ +-+++++++ # return attn_output, attn_weights, past_key_value +-+++++++ +-++++++ QWEN2MOE_ATTENTION_CLASSES = { +-++++++ "eager": Qwen2MoeAttention, +-+++++++ "flash-attention": Qwen2MoeFlashAttention, +-++++++ } +-++++++ +-++++++ +-++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-++++++ +-+++++++ #@dwj +-+++++++ # 只遍历激活的专家,而非全部专家 +-++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape +-++++++- hidden_states = hidden_states.view(-1, hidden_dim) +-++++++- # router_logits: (batch * sequence_length, n_experts) +-++++++- router_logits = self.gate(hidden_states) +-++++++- +-++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-++++++- if self.norm_topk_prob: +-++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-++++++- # we cast back to the input dtype +-++++++- routing_weights = routing_weights.to(hidden_states.dtype) +-++++++- +-++++++- final_hidden_states = ops.zeros( +-++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-++++++- ) +-++++++- +-++++++- # One hot encode the selected experts to create an expert mask +-++++++- # this will be used to easily index which expert is going to be sollicitated +-++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-++++++- +-++++++- # Loop over all available experts in the model and perform the computation on each expert +-++++++- for expert_idx in range(self.num_experts): +-++++++- expert_layer = self.experts[expert_idx] +-++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-++++++- +-++++++- # Index the correct hidden states and compute the expert hidden state for +-++++++- # the current expert. We need to make sure to multiply the output hidden +-++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-++++++- if 0 not in idx.shape: +-++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-++++++- +-++++++- # However `index_add_` only support torch tensors for indexing so we'll use +-++++++- # the `top_x` tensor here. +-++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-++++++- +-++++++- shared_expert_output = self.shared_expert(hidden_states) +-++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-++++++- +-++++++- final_hidden_states = final_hidden_states + shared_expert_output +-+++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++++ num_tokens = hidden_states_reshaped.shape[0] +-+++++++ +-+++++++ router_logits = self.gate(hidden_states_reshaped) +-+++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++++ +-+++++++ if self.norm_topk_prob: +-+++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++++ routing_weights = routing_weights.to(hidden_states.dtype) +-+++++++ +-+++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++++ flat_selected_experts = selected_experts.flatten() +-+++++++ +-+++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++++ token_indices = broadcasted_token_indices.flatten() +-+++++++ +-+++++++ active_experts = ops.unique(flat_selected_experts) +-+++++++ +-+++++++ for expert_idx_tensor in active_experts: +-+++++++ expert_idx = expert_idx_tensor.item() +-+++++++ expert_layer = self.experts[expert_idx] +-+++++++ +-+++++++ mask = (flat_selected_experts == expert_idx_tensor) +-+++++++ selected_token_indices = token_indices[mask] +-+++++++ selected_routing_weights = routing_weights.flatten()[mask] +-+++++++ +-+++++++ current_states = hidden_states_reshaped[selected_token_indices] +-+++++++ +-+++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++++ +-+++++++ final_hidden_states = final_hidden_states.index_add( +-+++++++ dim=0, +-+++++++ index=selected_token_indices, +-+++++++ source=expert_output.to(hidden_states.dtype) +-+++++++ ) +-+++++++ +-+++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-++++++ +-++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-++++++- return final_hidden_states, router_logits +-+++++++ final_hidden_states = final_hidden_states + shared_expert_output +-+++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++++ +-+++++++ return final_hidden_states, router_logits +-++++++ +-++++++ +-++++++ class Qwen2MoeDecoderLayer(nn.Module): +-++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-++++++ +-++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-++++++ +-+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++++ +-++++++ if (layer_idx not in config.mlp_only_layers) and ( +-++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-++++++ ): +-++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] +-++++++ _skip_keys_device_placement = "past_key_values" +-++++++ _supports_cache_class = True +-+++++++#lwx +-+++++++ # _supports_static_cache = True +-++++++ +-++++++ def _init_weights(self, module): +-++++++ std = self.config.initializer_range +-++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-++++++ return causal_mask +-++++++ +-++++++ +-++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-++++++ _tied_weights_keys = ["lm_head.weight"] +-++++++ +-++++++ def __init__(self, config): +-++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++++ self.num_experts_per_tok = config.num_experts_per_tok +-++++++ # Initialize weights and apply final processing +-++++++ self.post_init() +-+++++++ # @lwx +-+++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++++++ # self.generation_config.cache_implementation = "static" +-+++++++ self._warmed_up = False +-+++++++ +-+++++++ def warmup_moe_model(self): +-+++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++++++ test_texts = [ +-+++++++ "warmup short", +-+++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++++++ ] +-+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++++ if tokenizer is None: +-+++++++ from mindnlp.transformers import AutoTokenizer +-+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++++ self._warmup_tokenizer = tokenizer +-+++++++ +-+++++++ for text in test_texts: +-+++++++ inputs = tokenizer(text, return_tensors="ms") +-+++++++ with mindspore._no_grad(): +-+++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") +-++++++ +-++++++ def get_input_embeddings(self): +-++++++ return self.model.embed_tokens +-++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-++++++ ```""" +-+++++++ if not self._warmed_up: +-+++++++ self._warmed_up = True +-+++++++ self.warmup_moe_model() +-++++++ +-++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-++++++ output_router_logits = ( +-++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-++++++ } +-++++++ ) +-++++++ return model_inputs +-+++++++# @lwx +-+++++++ # def _decode_one_tokens_logits( +-+++++++ # self, +-+++++++ # cur_token: mindspore.Tensor, +-+++++++ # input_pos: Optional[mindspore.Tensor], +-+++++++ # cache_position: mindspore.Tensor, +-+++++++ # past_key_values: StaticCache, +-+++++++ # ) -> mindspore.Tensor: +-+++++++ # """ +-+++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++++++ +-+++++++ # Args: +-+++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++++++ # input_pos: 输入位置信息,可选 +-+++++++ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++++++ +-+++++++ # Returns: +-+++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++++++ # """ +-+++++++ # # 调用JIT编译的版本 +-+++++++ # return self.get_decode_one_tokens_logits( +-+++++++ # cur_token=cur_token, +-+++++++ # input_pos=input_pos, +-+++++++ # cache_position=cache_position, +-+++++++ # past_key_values=past_key_values, +-+++++++ # ) +-+++++++ +-+++++++ # @mindspore.jit(jit_level='O1') +-+++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++++++ # """ +-+++++++ # JIT编译的函数,用于高效的单token解码 +-+++++++ # 使用JIT编译优化以支持静态shape和高效执行 +-+++++++ +-+++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++++++ # """ +-+++++++ # outputs = self.model.forward( +-+++++++ # input_ids=cur_token, +-+++++++ # position_ids=input_pos, +-+++++++ # cache_position=cache_position, +-+++++++ # past_key_values=past_key_values, +-+++++++ # use_cache=True, +-+++++++ # return_dict=False, +-+++++++ # ) +-+++++++ +-+++++++ # hidden_states = outputs[0] +-+++++++ # logits = self.lm_head.forward(hidden_states) +-+++++++ # logits = logits.float() +-+++++++ +-+++++++ # return logits[:, -1, :] +-+++++++ +-+++++++ # def _sample( +-+++++++ # self, +-+++++++ # input_ids: mindspore.Tensor, +-+++++++ # logits_processor, +-+++++++ # stopping_criteria, +-+++++++ # generation_config, +-+++++++ # synced_devices: bool, +-+++++++ # streamer=None, +-+++++++ # logits_warper=None, +-+++++++ # **model_kwargs, +-+++++++ # ): +-+++++++ # """ +-+++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++++++ # """ +-+++++++ # from ...generation.logits_process import LogitsProcessorList +-+++++++ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++++++ # from mindnlp.core import nn, ops, no_grad +-+++++++ # import numpy as np +-+++++++ +-+++++++ # # 检查是否使用 StaticCache +-+++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++++++ # # 否则,直接调用父类方法 +-+++++++ # past_key_values = model_kwargs.get("past_key_values") +-+++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++++++ +-+++++++ # if not isinstance(past_key_values, StaticCache): +-+++++++ # # 不使用 StaticCache,直接调用父类方法 +-+++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++++++ # return super()._sample( +-+++++++ # input_ids=input_ids, +-+++++++ # logits_processor=logits_processor, +-+++++++ # stopping_criteria=stopping_criteria, +-+++++++ # generation_config=generation_config, +-+++++++ # synced_devices=synced_devices, +-+++++++ # streamer=streamer, +-+++++++ # logits_warper=logits_warper, +-+++++++ # **model_kwargs, +-+++++++ # ) +-+++++++ +-+++++++ # # 使用 StaticCache,进入自定义循环 +-+++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++++++ # pad_token_id = generation_config._pad_token_tensor +-+++++++ # output_attentions = generation_config.output_attentions +-+++++++ # output_hidden_states = generation_config.output_hidden_states +-+++++++ # output_scores = generation_config.output_scores +-+++++++ # output_logits = generation_config.output_logits +-+++++++ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++++++ # max_length = generation_config.max_length +-+++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++++++ # do_sample = generation_config.do_sample +-+++++++ +-+++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++++++ # raise ValueError( +-+++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++++++ # f"{logits_warper})." +-+++++++ # ) +-+++++++ +-+++++++ # # init attention / hidden states / scores tuples +-+++++++ # scores = () if (return_dict_in_generate and output_scores) else None +-+++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++++++ +-+++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++++++ # encoder_hidden_states = ( +-+++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++++++ # ) +-+++++++ +-+++++++ # # keep track of which sequences are already finished +-+++++++ # batch_size, cur_len = input_ids.shape +-+++++++ # this_peer_finished = False +-+++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++++++ +-+++++++ # time_record = [] +-+++++++ # from ....utils.testing_utils import parse_flag_from_env +-+++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++++++ +-+++++++ # while self._has_unfinished_sequences( +-+++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++++++ # ): +-+++++++ # if _record_time: +-+++++++ # import time as time_module +-+++++++ # infer_start = time_module.time() +-+++++++ +-+++++++ # # prepare model inputs +-+++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++++++ +-+++++++ # # prepare variable output controls +-+++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++++++ +-+++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++++++ # cur_cache_position = model_inputs.get("cache_position") +-+++++++ # cur_past_key_values = model_inputs.get("past_key_values") +-+++++++ # cur_input_ids = model_inputs.get("input_ids") +-+++++++ +-+++++++ # if (isinstance(cur_past_key_values, StaticCache) and +-+++++++ # cur_cache_position is not None and +-+++++++ # len(cur_cache_position.shape) > 0 and +-+++++++ # cur_cache_position.shape[0] == 1 and +-+++++++ # cur_input_ids is not None and +-+++++++ # cur_input_ids.shape[1] == 1): +-+++++++ # # 使用 JIT 优化的单 token 解码 +-+++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++++++ # if not hasattr(self, '_jit_used'): +-+++++++ # self._jit_used = False +-+++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++++++ +-+++++++ # next_token_logits = self.get_decode_one_tokens_logits( +-+++++++ # cur_token=cur_input_ids, +-+++++++ # input_pos=model_inputs.get("position_ids"), +-+++++++ # cache_position=cur_cache_position, +-+++++++ # past_key_values=cur_past_key_values, +-+++++++ # ) +-+++++++ +-+++++++ # # 标记已使用JIT(用于后续判断) +-+++++++ # if not self._jit_used: +-+++++++ # self._jit_used = True +-+++++++ +-+++++++ # # 构造兼容的输出对象 +-+++++++ # class JitOptimizedOutput: +-+++++++ # def __init__(self, logits, config): +-+++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++++++ # self.config = config +-+++++++ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++++++ # self.attentions = None if not config.is_encoder_decoder else None +-+++++++ # self.cross_attentions = None +-+++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++++++ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++++++ +-+++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++++++ # else: +-+++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++++++ # outputs = self(**model_inputs, return_dict=True) +-+++++++ +-+++++++ # if synced_devices and this_peer_finished: +-+++++++ # continue +-+++++++ +-+++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++++++ # next_token_logits = outputs.logits[:, -1, :] +-+++++++ +-+++++++ # # pre-process distribution +-+++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++++++ # if do_sample: +-+++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++++++ +-+++++++ # # Store scores, attentions and hidden_states when required +-+++++++ # if return_dict_in_generate: +-+++++++ # if output_scores: +-+++++++ # scores += (next_token_scores,) +-+++++++ # if output_logits: +-+++++++ # raw_logits += (next_token_logits,) +-+++++++ # if output_attentions: +-+++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++++++ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++++++ # if self.config.is_encoder_decoder: +-+++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++++++ +-+++++++ # if output_hidden_states: +-+++++++ # hidden = ( +-+++++++ # outputs.decoder_hidden_states +-+++++++ # if self.config.is_encoder_decoder +-+++++++ # else outputs.hidden_states +-+++++++ # ) +-+++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++++++ +-+++++++ # # token selection +-+++++++ # if do_sample: +-+++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++++++ # else: +-+++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++++++ +-+++++++ # # finished sentences should have their next token be a padding token +-+++++++ # if has_eos_stopping_criteria: +-+++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++++++ +-+++++++ # # update generated ids, model inputs, and length for next step +-+++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++++++ # if streamer is not None: +-+++++++ # streamer.put(next_tokens) +-+++++++ +-+++++++ # model_kwargs = self._update_model_kwargs_for_generation( +-+++++++ # outputs, +-+++++++ # model_kwargs, +-+++++++ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++++++ # ) +-+++++++ +-+++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++++++ # cur_len += 1 +-+++++++ +-+++++++ # if _record_time: +-+++++++ # import time as time_module +-+++++++ # infer_stop = time_module.time() +-+++++++ # time_record.append(infer_stop - infer_start) +-+++++++ +-+++++++ # del outputs +-+++++++ +-+++++++ # average_infer_time = None +-+++++++ # if time_record: +-+++++++ # if len(time_record) > 1: +-+++++++ # time_record.pop(0) +-+++++++ # average_infer_time = sum(time_record) / len(time_record) +-+++++++ # print(f'average inference time is: {average_infer_time}') +-+++++++ # print(f'inference time record: {time_record}') +-+++++++ +-+++++++ # if streamer is not None: +-+++++++ # streamer.end() +-+++++++ +-+++++++ # # 简单判断:打印是否使用了JIT路径 +-+++++++ # if hasattr(self, '_jit_used') and self._jit_used: +-+++++++ # print("[JIT] ✓ JIT optimization was used during generation") +-+++++++ # else: +-+++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++++++ +-+++++++ # if return_dict_in_generate: +-+++++++ # if self.config.is_encoder_decoder: +-+++++++ # return GenerateEncoderDecoderOutput( +-+++++++ # sequences=input_ids, +-+++++++ # scores=scores, +-+++++++ # logits=raw_logits, +-+++++++ # encoder_attentions=encoder_attentions, +-+++++++ # encoder_hidden_states=encoder_hidden_states, +-+++++++ # decoder_attentions=decoder_attentions, +-+++++++ # cross_attentions=cross_attentions, +-+++++++ # decoder_hidden_states=decoder_hidden_states, +-+++++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++++ # average_infer_time=average_infer_time +-+++++++ # ) +-+++++++ # else: +-+++++++ # return GenerateDecoderOnlyOutput( +-+++++++ # sequences=input_ids, +-+++++++ # scores=scores, +-+++++++ # logits=raw_logits, +-+++++++ # attentions=decoder_attentions, +-+++++++ # hidden_states=decoder_hidden_states, +-+++++++ # past_key_values=model_kwargs.get("past_key_values"), +-+++++++ # average_infer_time=average_infer_time +-+++++++ # ) +-+++++++ # else: +-+++++++ # return input_ids +-+++++++ +-+++++++ # def _prepare_cache_for_generation( +-+++++++ # self, +-+++++++ # generation_config, +-+++++++ # model_kwargs, +-+++++++ # assistant_model, +-+++++++ # batch_size, +-+++++++ # max_cache_length, +-+++++++ # ): +-+++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++++++ # generation_config.cache_implementation = "static" +-+++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++++++ +-+++++++ # if generation_config.cache_implementation == "static": +-+++++++ # base_required_from_max_length = generation_config.max_length + 1 +-+++++++ # base_required = max(max_cache_length, base_required_from_max_length) +-+++++++ # min_cache_size = 50 +-+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++++++ # else: +-+++++++ # max_cache_length = max(base_required, min_cache_size) +-+++++++ +-+++++++ # original_max_cache_length = max_cache_length +-+++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++++++ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++++++ # print(f" - final max_cache_length: {max_cache_length}") +-+++++++ +-+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++++ # if max_cache_length > self.config.max_position_embeddings: +-+++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++++++ +-+++++++ # result = super()._prepare_cache_for_generation( +-+++++++ # generation_config=generation_config, +-+++++++ # model_kwargs=model_kwargs, +-+++++++ # assistant_model=assistant_model, +-+++++++ # batch_size=batch_size, +-+++++++ # max_cache_length=max_cache_length, +-+++++++ # ) +-+++++++ +-+++++++ # if generation_config.cache_implementation == "static": +-+++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++++++ # created_cache = model_kwargs.get(cache_name) +-+++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++++++ # if created_cache.max_cache_len < generation_config.max_length: +-+++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++++++ +-+++++++ # return result +-+++++++ +-+++++++ +-+++++++ +-++++++ +-++++++ +-++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-++++++-- +-++++++2.27.0 +-++++++ +-+++++-- +-+++++2.27.0 +-+++++ +-++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch +-++++new file mode 100644 +-++++index 00000000..966529e4 +-++++--- /dev/null +-+++++++ b/patches/0003-20261106secondcommit.patch +-++++@@ -0,0 +1,2769 @@ +-+++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 +-+++++From: Pinoeer-kingxi <13022943007@163.com> +-+++++Date: Thu, 6 Nov 2025 14:54:37 +0800 +-+++++Subject: [PATCH 3/3] 20261106secondcommit +-+++++ +-+++++--- +-+++++ .../models/deepseek/modeling_deepseek.py | 217 ++- +-+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- +-+++++ patches/0001-20251104commit.patch | 1272 ----------------- +-+++++ 3 files changed, 528 insertions(+), 2032 deletions(-) +-+++++ delete mode 100644 patches/0001-20251104commit.patch +-+++++ +-+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++index 73773c22..2f9192bf 100644 +-+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) +-+++++ +-+++++ _CONFIG_FOR_DOC = "DeepseekConfig" +-+++++ +-++++++_attn_mask_cache = {} +-++++++ +-++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): +-++++++ q_len = batch_and_seq[1] +-++++++ kv_len = batch_and_seq[1] + past_key_values_length +-++++++ key = (batch_and_seq[0], q_len, kv_len) +-++++++ +-++++++ if key in _attn_mask_cache: +-++++++ return _attn_mask_cache[key] +-++++++ +-++++++ mask = _prepare_4d_causal_attention_mask( +-++++++ attention_mask, +-++++++ batch_and_seq, +-++++++ inputs_embeds, +-++++++ past_key_values_length, +-++++++ ) +-++++++ _attn_mask_cache[key] = mask +-++++++ return mask +-+++++ +-+++++ def _get_unpad_data(attention_mask): +-+++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) +-+++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): +-+++++ return final_output +-+++++ +-+++++ +-+++++- @no_grad() +-+++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++- expert_cache = ops.zeros_like(x) +-+++++- idxs = flat_expert_indices.argsort() +-+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++- token_idxs = idxs // self.num_experts_per_tok +-+++++- +-+++++- for i, end_idx in enumerate(tokens_per_expert): +-+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++- if start_idx == end_idx: +-+++++- continue +-+++++- expert = self.experts[i] +-+++++- exp_token_idx = token_idxs[start_idx:end_idx] +-+++++- expert_tokens = x[exp_token_idx] +-+++++- expert_out = expert(expert_tokens) +-+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++- +-+++++- return expert_cache +-+++++- +-+++++ # @no_grad() +-+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++- # # expert_cache = torch.zeros_like(x) +-+++++- # # idxs = flat_expert_indices.argsort() +-+++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++- # # token_idxs = idxs // self.num_experts_per_tok +-+++++- # # for i, end_idx in enumerate(tokens_per_expert): +-+++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++- # # if start_idx == end_idx: +-+++++- # # continue +-+++++- # # expert = self.experts[i] +-+++++- # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++- # # expert_tokens = x[exp_token_idx] +-+++++- # # expert_out = expert(expert_tokens) +-+++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++- # # return expert_cache +-++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++ # expert_cache = ops.zeros_like(x) +-+++++ # idxs = flat_expert_indices.argsort() +-+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): +-+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++ +-+++++ # return expert_cache +-+++++- # @no_grad() +-+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++- # expert_cache = ops.zeros_like(x) +-++++++ +-++++++ @no_grad() +-++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-++++++ """ +-++++++ 优化版 MoE prefill: +-++++++ - 批量张量化处理同一个 expert 的所有 token +-++++++ - 跳过无 token 的专家 +-++++++ - 保持结果完全一致 +-++++++ """ +-++++++ # 初始化输出缓存 +-++++++ expert_cache = ops.zeros_like(x) +-+++++ +-+++++- # # 排序保证顺序一致 +-+++++- # idxs = flat_expert_indices.argsort() +-+++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++- # token_idxs = idxs // self.num_experts_per_tok +-++++++ # 排序(确保 scatter_add 位置对应原逻辑) +-++++++ idxs = flat_expert_indices.argsort() +-++++++ sorted_expert_indices = flat_expert_indices[idxs] +-++++++ sorted_token_indices = idxs // self.num_experts_per_tok +-+++++ +-+++++- # # 找出有 token 的专家 +-+++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++++ # 每个 expert 的 token 数 +-++++++ tokens_per_expert = sorted_expert_indices.bincount() +-+++++ +-+++++- # for i in active_experts.tolist(): +-+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++- # end_idx = tokens_per_expert[i] +-+++++- # if start_idx == end_idx: # 没有 token +-+++++- # continue +-++++++ # 找出有 token 的专家 +-++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-+++++ +-+++++- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++- # expert_tokens = x[exp_token_idx] +-+++++- # expert_out = self.experts[i](expert_tokens) +-+++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++++ for expert_id in active_experts.tolist(): +-++++++ # 取该 expert 对应的排序后 token 区间 +-++++++ start = (tokens_per_expert[:expert_id]).sum().item() +-++++++ end = start + tokens_per_expert[expert_id].item() +-+++++ +-+++++- # expert_cache = mindspore.mint.scatter_add( +-+++++- # expert_cache, +-+++++- # 0, +-+++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++- # expert_out +-+++++- # ) +-++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 +-++++++ expert_tokens = x[token_idx] # 取输入向量 +-+++++ +-+++++- # return expert_cache +-++++++ # 执行专家 MLP +-++++++ expert_out = self.experts[expert_id](expert_tokens) +-++++++ +-++++++ # 按权重缩放 +-++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] +-++++++ +-++++++ # 回写到缓存(等价 scatter_add) +-++++++ expert_cache = mindspore.mint.scatter_add( +-++++++ expert_cache, +-++++++ 0, +-++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++++ scaled_out +-++++++ ) +-++++++ +-++++++ return expert_cache +-++++++ +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # # expert_cache = torch.zeros_like(x) +-++++++ # # idxs = flat_expert_indices.argsort() +-++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-++++++ # # token_idxs = idxs // self.num_experts_per_tok +-++++++ # # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-++++++ # # if start_idx == end_idx: +-++++++ # # continue +-++++++ # # expert = self.experts[i] +-++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # # expert_tokens = x[exp_token_idx] +-++++++ # # expert_out = expert(expert_tokens) +-++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-++++++ # # return expert_cache +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # for i, end_idx in enumerate(tokens_per_expert): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # if start_idx == end_idx: +-++++++ # continue +-++++++ # expert = self.experts[i] +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = expert(expert_tokens) +-++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-++++++ +-++++++ # return expert_cache +-++++++ # @no_grad() +-++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-++++++ # expert_cache = ops.zeros_like(x) +-++++++ +-++++++ # # 排序保证顺序一致 +-++++++ # idxs = flat_expert_indices.argsort() +-++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-++++++ # token_idxs = idxs // self.num_experts_per_tok +-++++++ +-++++++ # # 找出有 token 的专家 +-++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-++++++ +-++++++ # for i in active_experts.tolist(): +-++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-++++++ # end_idx = tokens_per_expert[i] +-++++++ # if start_idx == end_idx: # 没有 token +-++++++ # continue +-++++++ +-++++++ # exp_token_idx = token_idxs[start_idx:end_idx] +-++++++ # expert_tokens = x[exp_token_idx] +-++++++ # expert_out = self.experts[i](expert_tokens) +-++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-++++++ +-++++++ # expert_cache = mindspore.mint.scatter_add( +-++++++ # expert_cache, +-++++++ # 0, +-++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-++++++ # expert_out +-++++++ # ) +-++++++ +-++++++ # return expert_cache +-+++++ +-+++++ +-+++++ +-+++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++- +-+++++ # class DeepseekFlashAttention(nn.Module): +-+++++ # """ +-+++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using +-+++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): +-+++++ +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-++++++ +-+++++ Deepseek_ATTENTION_CLASSES = { +-+++++ "eager": DeepseekAttention, +-+++++ "flash-attention": DeepseekFlashAttention, +-+++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): +-+++++ ) +-+++++ else: +-+++++ # 4d mask is passed through the layers +-+++++- attention_mask = _prepare_4d_causal_attention_mask( +-++++++ # attention_mask = _prepare_4d_causal_attention_mask( +-++++++ # attention_mask, +-++++++ # (batch_size, seq_length), +-++++++ # inputs_embeds, +-++++++ # past_key_values_length, +-++++++ # ) +-++++++ #@dwj +-++++++ attention_mask = get_cached_causal_mask( +-+++++ attention_mask, +-+++++ (batch_size, seq_length), +-+++++ inputs_embeds, +-+++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++ # Initialize weights and apply final processing +-+++++ self.post_init() +-+++++ self.warm_up = False +-++++++ #@dwj +-++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( +-++++++ self.num_layers, +-++++++ self.num_attention_heads, +-++++++ self.head_dim, +-++++++ batch_size=1, +-++++++ max_length=self.max_length, +-++++++ dtype=mindspore.float16 +-++++++ ) +-++++++ +-++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): +-++++++ key_cache = [] +-++++++ value_cache = [] +-++++++ for _ in range(num_layers): +-++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) +-++++++ key_cache.append(k) +-++++++ value_cache.append(v) +-++++++ return key_cache, value_cache +-++++++ +-+++++ +-+++++ def warmup_moe_model_deep(self): +-+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++index bced285c..ebd7782e 100644 +-+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) +-+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" +-+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" +-+++++ +-+++++-Long_Prompt = False +-+++++-PROMPT_LENGTH_THRESHOLD = 128 +-++++++Long_Prompt = 1 +-++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 +-++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 +-++++++ +-++++++_causal_mask_cache = {} +-++++++ +-++++++def get_cached_causal_mask_with_cache_position( +-++++++ attention_mask: mindspore.Tensor, +-++++++ sequence_length: int, +-++++++ target_length: int, +-++++++ dtype: mindspore.dtype, +-++++++ min_dtype: float, +-++++++ cache_position: mindspore.Tensor, +-++++++ batch_size: int, +-++++++): +-++++++ """ +-++++++ 带缓存的 causal mask 构造函数 +-++++++ """ +-++++++ # q_len 是当前 query 长度 +-++++++ q_len = sequence_length +-++++++ # kv_len 是 target_length +-++++++ kv_len = target_length +-++++++ +-++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 +-++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) +-++++++ +-++++++ if key in _causal_mask_cache: +-++++++ return _causal_mask_cache[key] +-++++++ +-++++++ # 调用原来的 mask 构造逻辑 +-++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++++ attention_mask, +-++++++ sequence_length=sequence_length, +-++++++ target_length=target_length, +-++++++ dtype=dtype, +-++++++ min_dtype=min_dtype, +-++++++ cache_position=cache_position, +-++++++ batch_size=batch_size, +-++++++ ) +-++++++ # 缓存结果 +-++++++ _causal_mask_cache[key] = causal_mask +-++++++ return causal_mask +-+++++ +-+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position +-+++++ def _prepare_4d_causal_attention_mask_with_cache_position( +-+++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++++ +-+++++ +-+++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe +-++++++# class Qwen2MoeAttention(nn.Module): +-++++++# """ +-++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-++++++# and "Generating Long Sequences with Sparse Transformers". +-++++++# """ +-++++++ +-++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-++++++# super().__init__() +-++++++# self.config = config +-++++++# self.layer_idx = layer_idx +-++++++# if layer_idx is None: +-++++++# logger.warning_once( +-++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++++# "when creating this class." +-++++++# ) +-++++++ +-++++++# self.hidden_size = config.hidden_size +-++++++# self.num_heads = config.num_attention_heads +-++++++# self.head_dim = self.hidden_size // self.num_heads +-++++++# self.num_key_value_heads = config.num_key_value_heads +-++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-++++++# self.max_position_embeddings = config.max_position_embeddings +-++++++# self.rope_theta = config.rope_theta +-++++++# self.is_causal = True +-++++++# self.attention_dropout = config.attention_dropout +-++++++ +-++++++# if (self.head_dim * self.num_heads) != self.hidden_size: +-++++++# raise ValueError( +-++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" +-++++++# f" and `num_heads`: {self.num_heads})." +-++++++# ) +-++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-++++++ +-++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( +-++++++# self.head_dim, +-++++++# max_position_embeddings=self.max_position_embeddings, +-++++++# base=self.rope_theta, +-++++++# ) +-++++++ +-++++++# def forward( +-++++++# self, +-++++++# hidden_states: mindspore.Tensor, +-++++++# attention_mask: Optional[mindspore.Tensor] = None, +-++++++# position_ids: Optional[mindspore.Tensor] = None, +-++++++# past_key_value: Optional[Cache] = None, +-++++++# output_attentions: bool = False, +-++++++# use_cache: bool = False, +-++++++# cache_position: Optional[mindspore.Tensor] = None, +-++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-++++++ +-++++++ +-++++++ +-++++++# bsz, q_len, _ = hidden_states.shape +-++++++ +-++++++# query_states = self.q_proj(hidden_states) +-++++++# key_states = self.k_proj(hidden_states) +-++++++# value_states = self.v_proj(hidden_states) +-++++++ +-++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-++++++ +-++++++# kv_seq_len = key_states.shape[-2] +-++++++# if past_key_value is not None: +-++++++# if self.layer_idx is None: +-++++++# raise ValueError( +-++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-++++++# "with a layer index." +-++++++# ) +-++++++# if isinstance(past_key_value, StaticCache): +-++++++# kv_seq_len = key_states.shape[-2] +-++++++# else: +-++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-++++++ +-++++++# if past_key_value is not None: +-++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++# if isinstance(past_key_value, StaticCache): +-++++++# kv_seq_len = key_states.shape[-2] +-++++++ +-++++++# # repeat k/v heads if n_kv_heads < n_heads +-++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++ +-++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++++ +-++++++# if attention_mask is not None: +-++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++++# attn_weights = attn_weights + causal_mask +-++++++ +-++++++# # upcast attention to fp32 +-++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++++++# attn_output = ops.matmul(attn_weights, value_states) +-++++++ +-++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++++++# raise ValueError( +-++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-++++++# f" {attn_output.shape}" +-++++++# ) +-++++++ +-++++++# attn_output = ops.transpose(attn_output, 1, 2) +-++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++++ +-++++++# attn_output = self.o_proj(attn_output) +-++++++# # @lwx +-++++++ +-++++++# # max_seq_len = self.max_position_embeddings # 2048 +-++++++ +-++++++# # if attention_mask is not None: +-++++++# # # attention_mask: [B, 1, Sq, Sk] +-++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-++++++ +-++++++# # # pad 到 [max_seq_len, max_seq_len] +-++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-++++++# # global_attention_mask = padded_mask +-++++++# # else: +-++++++# # global_attention_mask = None +-++++++ +-++++++ +-++++++# # sparse_mode=3 +-++++++# # attn_output = mindspore.ops.flash_attention_score( +-++++++# # query=query_states, +-++++++# # key=key_states, +-++++++# # value=value_states, +-++++++# # real_shift=None, +-++++++# # padding_mask=None, +-++++++ +-++++++# # head_num=self.num_heads, +-++++++# # attn_mask=global_attention_mask, +-++++++# # keep_prob=1.0 - self.attention_dropout, +-++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++# # input_layout="BNSD", +-++++++# # pre_tokens=2147483647, +-++++++# # next_tokens=2147483647, +-++++++# # inner_precise=0, +-++++++# # drop_mask=None, +-++++++# # prefix=None, +-++++++# # actual_seq_qlen=None, +-++++++# # actual_seq_kvlen=None, +-++++++# # sparse_mode=sparse_mode, +-++++++# # ) +-++++++# if not output_attentions: +-++++++# attn_weights = None +-++++++ +-++++++# return attn_output, attn_weights, past_key_value +-++++++ +-+++++ class Qwen2MoeAttention(nn.Module): +-+++++ """ +-+++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer +-+++++- and "Generating Long Sequences with Sparse Transformers". +-+++++- """ +-++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 +-+++++ +-++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: +-++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 +-++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 +-++++++ +-++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 +-++++++ """ +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++ super().__init__() +-+++++ self.config = config +-+++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): +-+++++ if layer_idx is None: +-+++++ logger.warning_once( +-+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " +-+++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " +-+++++ "when creating this class." +-+++++ ) +-+++++ +-+++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): +-+++++ use_cache: bool = False, +-+++++ cache_position: Optional[mindspore.Tensor] = None, +-+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++- +-+++++ +-+++++- +-++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- +-+++++ bsz, q_len, _ = hidden_states.shape +-+++++ +-+++++ query_states = self.q_proj(hidden_states) +-+++++ key_states = self.k_proj(hidden_states) +-+++++ value_states = self.v_proj(hidden_states) +-+++++ +-+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) +-+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) +-+++++- +-++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-++++++ +-+++++ kv_seq_len = key_states.shape[-2] +-+++++ if past_key_value is not None: +-+++++- if self.layer_idx is None: +-+++++- raise ValueError( +-+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++- "with a layer index." +-+++++- ) +-+++++- if isinstance(past_key_value, StaticCache): +-+++++- kv_seq_len = key_states.shape[-2] +-+++++- else: +-+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-++++++ +-+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++ +-+++++ if past_key_value is not None: +-+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-++++++ +-++++++ # --- 2. 动态调度核心注意力计算 --- +-++++++ global Long_Prompt +-++++++ if Long_Prompt >= 1: +-++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- +-++++++ fa_attention_mask = None +-++++++ if attention_mask is not None: +-++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-++++++ fa_attention_mask = (mask_slice != 0) +-++++++ +-++++++ attn_output = mindspore.ops.flash_attention_score( +-++++++ query=query_states, +-++++++ key=key_states, +-++++++ value=value_states, +-++++++ head_num=self.num_heads, +-++++++ attn_mask=fa_attention_mask, +-++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, +-++++++ scalar_value=1.0 / math.sqrt(self.head_dim), +-++++++ input_layout="BNSD", +-++++++ sparse_mode=0, +-++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 +-++++++ ) +-+++++ +-+++++- if isinstance(past_key_value, StaticCache): +-+++++- kv_seq_len = key_states.shape[-2] +-++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-++++++ attn_output = self.o_proj(attn_output) +-++++++ attn_weights = None +-++++++ if output_attentions: +-++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") +-+++++ +-+++++- # repeat k/v heads if n_kv_heads < n_heads +-+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++- +-+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-++++++ else: +-++++++ # --- Eager Attention 路径 (用于短序列和解码) --- +-++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) +-++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) +-++++++ +-++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++ +-+++++- if attention_mask is not None: +-+++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++- attn_weights = attn_weights + causal_mask +-++++++ if attention_mask is not None: +-++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-++++++ attn_weights = attn_weights + causal_mask +-+++++ +-+++++- # upcast attention to fp32 +-+++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-+++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-+++++- attn_output = ops.matmul(attn_weights, value_states) +-++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) +-++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) +-++++++ attn_output = ops.matmul(attn_weights, value_states) +-+++++ +-+++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-+++++- raise ValueError( +-+++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" +-+++++- f" {attn_output.shape}" +-+++++- ) +-++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): +-++++++ raise ValueError( +-++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" +-++++++ ) +-+++++ +-+++++- attn_output = ops.transpose(attn_output, 1, 2) +-+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++++ attn_output = ops.transpose(attn_output, 1, 2) +-++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-++++++ attn_output = self.o_proj(attn_output) +-+++++ +-+++++- attn_output = self.o_proj(attn_output) +-+++++- # @lwx +-++++++ if not output_attentions: +-++++++ attn_weights = None +-+++++ +-+++++- # max_seq_len = self.max_position_embeddings # 2048 +-+++++- +-+++++- # if attention_mask is not None: +-+++++- # # attention_mask: [B, 1, Sq, Sk] +-+++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++- +-+++++- # # pad 到 [max_seq_len, max_seq_len] +-+++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++- # global_attention_mask = padded_mask +-+++++- # else: +-+++++- # global_attention_mask = None +-+++++- +-+++++- +-+++++- # sparse_mode=3 +-+++++- # attn_output = mindspore.ops.flash_attention_score( +-+++++- # query=query_states, +-+++++- # key=key_states, +-+++++- # value=value_states, +-+++++- # real_shift=None, +-+++++- # padding_mask=None, +-+++++- +-+++++- # head_num=self.num_heads, +-+++++- # attn_mask=global_attention_mask, +-+++++- # keep_prob=1.0 - self.attention_dropout, +-+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++- # input_layout="BNSD", +-+++++- # pre_tokens=2147483647, +-+++++- # next_tokens=2147483647, +-+++++- # inner_precise=0, +-+++++- # drop_mask=None, +-+++++- # prefix=None, +-+++++- # actual_seq_qlen=None, +-+++++- # actual_seq_kvlen=None, +-+++++- # sparse_mode=sparse_mode, +-+++++- # ) +-+++++- if not output_attentions: +-+++++- attn_weights = None +-+++++- +-+++++ return attn_output, attn_weights, past_key_value +-+++++ +-+++++- +-+++++ # class Qwen2MoeFlashAttention(nn.Module): +-+++++ # """ +-+++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { +-+++++ # return final_hidden_states, router_logits +-+++++ +-+++++ +-+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++-# """ +-+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 +-+++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 +-+++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 +-+++++-# """ +-+++++-# def __init__(self, config: Qwen2MoeConfig): +-+++++-# super().__init__() +-+++++-# self.num_experts = config.num_experts +-+++++-# self.top_k = config.num_experts_per_tok +-+++++-# self.norm_topk_prob = config.norm_topk_prob +-+++++- +-+++++-# # 门控网络 +-+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++-# # 专家列表 +-+++++-# self.experts = nn.ModuleList( +-+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++-# ) +-+++++-# # 共享专家 +-+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_decode( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 【解码路径】针对 sequence_length=1 的极致优化。 +-+++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 +-+++++-# """ +-+++++-# batch_size, hidden_dim = hidden_states.shape +-+++++- +-+++++-# expert_outputs_list = [ +-+++++-# ops.cat([ +-+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++-# ], dim=0) +-+++++-# for i in range(batch_size) +-+++++-# ] +-+++++- +-+++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- +-+++++-# # shape: (batch_size, top_k, hidden_dim) +-+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++- +-+++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 +-+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++- +-+++++-# return moe_output.squeeze(1) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_prefill( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 【预填充路径】针对 sequence_length > 1 的优化。 +-+++++-# 按专家对 Token 进行分组,并进行批处理。 +-+++++-# """ +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens = hidden_states.shape[0] +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++- +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++- +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++- +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++- +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++-# selected_token_indices = token_indices[mask] +-+++++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++++- +-+++++-# current_states = hidden_states[selected_token_indices] +-+++++- +-+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++- +-+++++-# moe_output = moe_output.index_add( +-+++++-# dim=0, +-+++++-# index=selected_token_indices, +-+++++-# source=expert_output.to(hidden_states.dtype) +-+++++-# ) +-+++++-# return moe_output +-+++++- +-+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 顶层 forward 方法,作为智能分发器。 +-+++++-# """ +-+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- +-+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-# router_logits = self.gate(hidden_states_reshaped) +-+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- +-+++++-# if self.norm_topk_prob: +-+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- +-+++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++-# moe_output = None +-+++++-# # 在推理时,根据序列长度选择最优路径 +-+++++-# if not self.training: +-+++++-# if sequence_length == 1: +-+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++++-# else: +-+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++++-# else: +-+++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 +-+++++-# raise NotImplementedError("Training path is not implemented.") +-+++++- +-+++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) +-+++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) +-+++++- +-+++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights +-+++++- +-+++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++-# return final_hidden_states, router_logits +-+++++- +-+++++- +-+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++-# """ +-+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 +-+++++-# """ +-+++++-# def __init__(self, config: Qwen2MoeConfig): +-+++++-# super().__init__() +-+++++-# self.num_experts = config.num_experts +-+++++-# self.top_k = config.num_experts_per_tok +-+++++-# self.norm_topk_prob = config.norm_topk_prob +-+++++- +-+++++-# # 门控网络 +-+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++-# # 专家列表 +-+++++-# self.experts = nn.ModuleList( +-+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++-# ) +-+++++-# # 共享专家 +-+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_decode( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# batch_size, _ = hidden_states.shape +-+++++-# expert_outputs_list = [ +-+++++-# ops.cat([ +-+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++-# ], dim=0) +-+++++-# for i in range(batch_size) +-+++++-# ] +-+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++-# return moe_output.squeeze(1) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_prefill( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens = hidden_states.shape[0] +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++- +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++-# selected_token_indices = token_indices[mask] +-+++++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++++-# current_states = hidden_states[selected_token_indices] +-+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++-# moe_output = moe_output.index_add( +-+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++-# ) +-+++++-# return moe_output +-+++++- +-+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 顶层 forward 方法,作为智能分发器。 +-+++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 +-+++++-# """ +-+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- +-+++++-# # 1. 门控计算 (通用逻辑) +-+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-# router_logits = self.gate(hidden_states_reshaped) +-+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- +-+++++-# if self.norm_topk_prob: +-+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- +-+++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++-# # 2. 智能分发到最优 MoE 路径 +-+++++-# moe_output = None +-+++++-# if not self.training: +-+++++-# if sequence_length == 1: +-+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) +-+++++-# else: +-+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) +-+++++-# else: +-+++++-# raise NotImplementedError("Training path is not implemented.") +-+++++- +-+++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 +-+++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 +-+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++- +-+++++-# # 4. 合并 MoE 输出和共享专家输出 +-+++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 +-+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++- +-+++++-# # 5. 恢复原始形状并返回 +-+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++-# return final_hidden_states, router_logits +-+++++- +-+++++-# prefill fastest +-+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++-# """ +-+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), +-+++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 +-+++++-# """ +-+++++-# def __init__(self, config: Qwen2MoeConfig): +-+++++-# super().__init__() +-+++++-# self.num_experts = config.num_experts +-+++++-# self.top_k = config.num_experts_per_tok +-+++++-# self.norm_topk_prob = config.norm_topk_prob +-+++++- +-+++++-# # 门控网络 +-+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++-# # 专家列表 +-+++++-# self.experts = nn.ModuleList( +-+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++-# ) +-+++++-# # 共享专家 +-+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_dispatch( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 +-+++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 +-+++++-# """ +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens, _ = hidden_states.shape +-+++++- +-+++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++-# flat_routing_weights = routing_weights.flatten() +-+++++- +-+++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++- +-+++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++- +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++- +-+++++-# # 找到所有分配给该专家的 token +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++- +-+++++-# # 使用 mask 选取对应的 token 和权重 +-+++++-# current_token_indices = token_indices[mask] +-+++++-# current_routing_weights = flat_routing_weights[mask] +-+++++-# current_hidden_states = hidden_states[current_token_indices] +-+++++- +-+++++-# # 对这些 token 进行批处理 +-+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++++- +-+++++-# # 使用 index_add 将结果精确地加回到对应位置 +-+++++-# moe_output = moe_output.index_add( +-+++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++-# ) +-+++++-# return moe_output +-+++++- +-+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 顶层 forward 方法,作为智能分发器。 +-+++++-# """ +-+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- +-+++++-# # 1. 门控计算 +-+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-# router_logits = self.gate(hidden_states_reshaped) +-+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- +-+++++-# if self.norm_topk_prob: +-+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- +-+++++-# routing_weights = routing_weights.to(hidden_states.dtype) +-+++++- +-+++++-# # 2. 调用统一的 MoE 计算内核 +-+++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 +-+++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) +-+++++- +-+++++-# # 3. 统一处理共享专家 +-+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++- +-+++++-# # 4. 合并输出 +-+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++- +-+++++-# # 5. 恢复原始形状并返回 +-+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++-# return final_hidden_states, router_logits +-+++++- +-+++++- +-+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++-# """ +-+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 +-+++++-# 【最终高性能与高精度版】: +-+++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 +-+++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 +-+++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 +-+++++-# 3. 这样实现了速度和准确性的两全其美。 +-+++++-# """ +-+++++-# def __init__(self, config: Qwen2MoeConfig): +-+++++-# super().__init__() +-+++++-# self.num_experts = config.num_experts +-+++++-# self.top_k = config.num_experts_per_tok +-+++++-# self.norm_topk_prob = config.norm_topk_prob +-+++++- +-+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++-# self.experts = nn.ModuleList( +-+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++-# ) +-+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_decode( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 【解码路径】极致优化版:bmm + 高精度累加。 +-+++++-# """ +-+++++-# original_dtype = hidden_states.dtype +-+++++-# batch_size, _ = hidden_states.shape +-+++++- +-+++++-# expert_outputs_list = [ +-+++++-# ops.cat([ +-+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++-# ], dim=0) +-+++++-# for i in range(batch_size) +-+++++-# ] +-+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++- +-+++++-# # 在 float32 下执行 bmm,得到高精度结果 +-+++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) +-+++++- +-+++++-# # 将高精度结果转换回原始数据类型 +-+++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) +-+++++- +-+++++-# return moe_output +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_prefill( +-+++++-# self, +-+++++-# hidden_states: mindspore.Tensor, +-+++++-# selected_experts: mindspore.Tensor, +-+++++-# routing_weights: mindspore.Tensor +-+++++-# ) -> mindspore.Tensor: +-+++++-# """ +-+++++-# 【预填充路径】与原始实现一致,结果精确。 +-+++++-# """ +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens, _ = hidden_states.shape +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++- +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++-# selected_token_indices = token_indices[mask] +-+++++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++++-# current_states = hidden_states[selected_token_indices] +-+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++-# moe_output = moe_output.index_add( +-+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) +-+++++-# ) +-+++++-# return moe_output +-+++++- +-+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++- +-+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-# router_logits = self.gate(hidden_states_reshaped) +-+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++- +-+++++-# if self.norm_topk_prob: +-+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- +-+++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 +-+++++-# # 如果模型主体是 float16,后续再转换 +-+++++- +-+++++-# moe_output = None +-+++++-# if not self.training: +-+++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 +-+++++-# # _moe_infer_decode 内部会处理好类型转换 +-+++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) +-+++++-# if sequence_length == 1: +-+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++++-# else: +-+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) +-+++++-# else: +-+++++-# raise NotImplementedError("Training path is not implemented.") +-+++++- +-+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++- +-+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++-# return final_hidden_states, router_logits +-+++++- +-+++++- +-+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++-# """ +-+++++-# 【融合版】一个混合专家模块,内置两种推理策略, +-+++++-# 由外部全局变量 `Long_Prompt` 控制: +-+++++- +-+++++-# - if Long_Prompt is True: 【精度优先模式】 +-+++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 +-+++++-# 适用于处理长序列,避免误差累积。 +-+++++- +-+++++-# - if Long_Prompt is False: 【速度优先模式】 +-+++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, +-+++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 +-+++++-# """ +-+++++-# def __init__(self, config: Qwen2MoeConfig): +-+++++-# super().__init__() +-+++++-# self.num_experts = config.num_experts +-+++++-# self.top_k = config.num_experts_per_tok +-+++++-# self.norm_topk_prob = config.norm_topk_prob +-+++++- +-+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) +-+++++-# self.experts = nn.ModuleList( +-+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] +-+++++-# ) +-+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-# # --- 速度优先模式的辅助函数 --- +-+++++-# @no_grad() +-+++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++-# original_dtype = hidden_states.dtype +-+++++-# batch_size, _ = hidden_states.shape +-+++++-# expert_outputs_list = [ +-+++++-# ops.cat([ +-+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] +-+++++-# ], dim=0) +-+++++-# for i in range(batch_size) +-+++++-# ] +-+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) +-+++++-# weights_fp32 = routing_weights.to(mindspore.float32) +-+++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) +-+++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++++-# return moe_output_fp32.squeeze(1).to(original_dtype) +-+++++- +-+++++-# @no_grad() +-+++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens, _ = hidden_states.shape +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++-# selected_token_indices = token_indices[mask] +-+++++-# selected_routing_weights = routing_weights.flatten()[mask] +-+++++-# current_states = hidden_states[selected_token_indices] +-+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++-# return moe_output +-+++++- +-+++++-# # --- 精度优先模式的辅助函数 --- +-+++++-# @no_grad() +-+++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++-# moe_output = ops.zeros_like(hidden_states) +-+++++-# num_tokens, _ = hidden_states.shape +-+++++-# flat_selected_experts = selected_experts.flatten() +-+++++-# flat_routing_weights = routing_weights.flatten() +-+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() +-+++++-# active_experts = ops.unique(flat_selected_experts) +-+++++-# for expert_idx_tensor in active_experts: +-+++++-# expert_idx = expert_idx_tensor.item() +-+++++-# expert_layer = self.experts[expert_idx] +-+++++-# mask = (flat_selected_experts == expert_idx_tensor) +-+++++-# current_token_indices = token_indices[mask] +-+++++-# current_routing_weights = flat_routing_weights[mask] +-+++++-# current_hidden_states = hidden_states[current_token_indices] +-+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) +-+++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++-# return moe_output +-+++++- +-+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-# # 声明我们将要使用一个在模块外部定义的全局变量 +-+++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 +-+++++-# global Long_Prompt +-+++++- +-+++++-# # 1. 门控计算 (所有模式通用) +-+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-# router_logits = self.gate(hidden_states_reshaped) +-+++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) +-+++++-# if self.norm_topk_prob: +-+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++- +-+++++-# moe_output = None +-+++++-# if not self.training: +-+++++-# # 根据 Long_Prompt 标志选择模式 +-+++++-# if Long_Prompt: +-+++++-# # --- 精度优先模式 --- +-+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++-# else: +-+++++-# # --- 速度优先模式 --- +-+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++-# if sequence_length == 1: +-+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++-# else: +-+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++-# else: +-+++++-# raise NotImplementedError("Training path is not implemented.") +-+++++- +-+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) +-+++++- +-+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output +-+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) +-+++++- +-+++++-# return final_hidden_states, router_logits +-+++++- +-+++++ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ """ +-+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` +-+++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) +-+++++ return moe_output_fp32.squeeze(1).to(original_dtype) +-+++++ +-++++++ # @no_grad() +-++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-++++++ # num_tokens, _ = hidden_states.shape +-++++++ # flat_selected_experts = selected_experts.flatten() +-++++++ # sorted_expert_indices = flat_selected_experts.argsort() +-++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-++++++ # original_token_indices = sorted_expert_indices // self.top_k +-++++++ # moe_output = ops.zeros_like(hidden_states) +-++++++ # current_token_offset = 0 +-++++++ # for i in range(self.num_experts): +-++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset +-++++++ # if expert_token_count == 0: +-++++++ # continue +-++++++ # end_offset = current_token_offset + expert_token_count +-++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] +-++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-++++++ # current_token_offset += expert_token_count +-++++++ # return moe_output +-++++++ +-+++++ @no_grad() +-+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++- num_tokens, _ = hidden_states.shape +-+++++- flat_selected_experts = selected_experts.flatten() +-+++++- sorted_expert_indices = flat_selected_experts.argsort() +-+++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) +-+++++- original_token_indices = sorted_expert_indices // self.top_k +-++++++ """ +-++++++ 优化版 MoE prefill (速度优先模式): +-++++++ - 批量张量化处理同一个 expert 的所有 token +-++++++ - 跳过无 token 的专家 +-++++++ - 保持结果完全一致 +-++++++ """ +-+++++ moe_output = ops.zeros_like(hidden_states) +-+++++- current_token_offset = 0 +-+++++- for i in range(self.num_experts): +-+++++- expert_token_count = tokens_per_expert[i] - current_token_offset +-+++++- if expert_token_count == 0: +-+++++- continue +-+++++- end_offset = current_token_offset + expert_token_count +-+++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] +-+++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] +-+++++- expert_hidden_states = hidden_states[expert_original_token_indices] +-+++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] +-+++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) +-+++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) +-+++++- current_token_offset += expert_token_count +-++++++ +-++++++ flat_selected_experts = selected_experts.flatten() +-++++++ flat_routing_weights = routing_weights.flatten() +-++++++ +-++++++ idxs = flat_selected_experts.argsort() +-++++++ sorted_expert_indices = flat_selected_experts[idxs] +-++++++ sorted_token_indices = idxs // self.top_k +-++++++ +-++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) +-++++++ +-++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() +-++++++ +-++++++ for expert_id in active_experts.tolist(): +-++++++ start = int(tokens_per_expert[:expert_id].sum().item()) +-++++++ end = start + int(tokens_per_expert[expert_id].item()) +-++++++ +-++++++ token_idx = sorted_token_indices[start:end] +-++++++ expert_tokens = hidden_states[token_idx] +-++++++ +-++++++ expert_out = self.experts[expert_id](expert_tokens) +-++++++ +-++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) +-++++++ +-++++++ moe_output = mindspore.mint.scatter_add( +-++++++ moe_output, +-++++++ 0, +-++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), +-++++++ scaled_out.to(hidden_states.dtype) +-++++++ ) +-++++++ +-+++++ return moe_output +-+++++ +-++++++ +-+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- +-+++++ @no_grad() +-+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: +-+++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++ +-+++++ moe_output = None +-+++++- if Long_Prompt: +-+++++- # --- 精度优先模式 (ACCURACY MODE) --- +-+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ # if Long_Prompt==0: +-++++++ # # --- 精度优先模式 (ACCURACY MODE) --- +-++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ # else: +-++++++ # # --- 速度优先模式 (SPEED MODE) --- +-++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++ # if sequence_length == 1: +-++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ # else: +-++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ +-++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) +-++++++ if sequence_length == 1: +-++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++ else: +-+++++- # --- 速度优先模式 (SPEED MODE) --- +-+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) +-+++++- if sequence_length == 1: +-+++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++- else: +-+++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-+++++- +-++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) +-++++++ +-+++++ +-+++++ # 3. 共享专家计算与合并 (所有模式通用) +-+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ +-+++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++ +-+++++ return final_hidden_states, router_logits +-+++++ +-++++++ +-+++++ class Qwen2MoeDecoderLayer(nn.Module): +-+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): +-+++++ super().__init__() +-+++++ self.hidden_size = config.hidden_size +-+++++ +-+++++- # if Long_Prompt: +-+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++- # else: +-++++++ # if Long_Prompt == 2: +-+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-++++++ # else: +-++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ +-+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++ +-+++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++++ ) +-+++++ +-+++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). +-+++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++++ # attention_mask, +-++++++ # sequence_length=sequence_length, +-++++++ # target_length=target_length, +-++++++ # dtype=dtype, +-++++++ # min_dtype=min_dtype, +-++++++ # cache_position=cache_position, +-++++++ # batch_size=input_tensor.shape[0], +-++++++ # ) +-++++++ #@dwj +-++++++ causal_mask = get_cached_causal_mask_with_cache_position( +-+++++ attention_mask, +-+++++ sequence_length=sequence_length, +-+++++ target_length=target_length, +-+++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 +-+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 +-+++++ """ +-+++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD +-++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache +-++++++ _causal_mask_cache.clear() +-+++++ +-+++++ input_ids = kwargs.get("input_ids") +-+++++ if input_ids is None and args: +-+++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ +-+++++ if input_ids is not None: +-+++++ prompt_length = input_ids.shape[1] +-+++++- +-+++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: +-+++++- Long_Prompt = True +-++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: +-++++++ Long_Prompt = 2 +-++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: +-++++++ Long_Prompt = 0 +-+++++ else: +-+++++- Long_Prompt = False +-++++++ Long_Prompt = 1 +-++++++ +-+++++ +-+++++ return super().generate(*args, **kwargs) +-+++++ +-+++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++ dtype = self.lm_head.weight.dtype +-+++++ min_dtype = float(ops.finfo(dtype).min) +-+++++ +-+++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( +-++++++ # attention_mask, +-++++++ # sequence_length=sequence_length, +-++++++ # target_length=past_key_values.get_max_length(), +-++++++ # dtype=dtype, +-++++++ # min_dtype=min_dtype, +-++++++ # cache_position=cache_position, +-++++++ # batch_size=batch_size, +-++++++ # ) +-++++++ +-++++++ #@dwj +-++++++ attention_mask = get_cached_causal_mask_with_cache_position( +-+++++ attention_mask, +-+++++ sequence_length=sequence_length, +-+++++ target_length=past_key_values.get_max_length(), +-+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch +-+++++deleted file mode 100644 +-+++++index 6dfb5b93..00000000 +-+++++--- a/patches/0001-20251104commit.patch +-++++++++ /dev/null +-+++++@@ -1,1272 +0,0 @@ +-+++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 +-+++++-From: Pinoeer-kingxi <13022943007@163.com> +-+++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 +-+++++-Subject: [PATCH] 20251104commit +-+++++- +-+++++---- +-+++++- mindnlp/transformers/cache_utils.py | 28 +- +-+++++- .../models/deepseek/modeling_deepseek.py | 149 ++- +-+++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- +-+++++- 3 files changed, 976 insertions(+), 87 deletions(-) +-+++++- +-+++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py +-+++++-index cadd2e04..02f8d4be 100644 +-+++++---- a/mindnlp/transformers/cache_utils.py +-+++++-+++ b/mindnlp/transformers/cache_utils.py +-+++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): +-+++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. +-+++++- # k_out[:, :, cache_position] = key_states +-+++++- # v_out[:, :, cache_position] = value_states +-+++++-- if ON_ORANGE_PI: +-+++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++-- else: +-+++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++-- +-+++++-+ # if ON_ORANGE_PI: +-+++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) +-+++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) +-+++++-+ # else: +-+++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy +-+++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) +-+++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) +-+++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 +-+++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] +-+++++-+ if cache_position.ndim > 1: +-+++++-+ cache_position = cache_position.flatten() +-+++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) +-+++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): +-+++++-+ cache_position = cache_position.int() +-+++++-+ +-+++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) +-+++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 +-+++++-+ k_out[:, :, cache_position] = key_states +-+++++-+ v_out[:, :, cache_position] = value_states +-+++++-+ +-+++++- return k_out, v_out +-+++++- +-+++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: +-+++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++-index c695b944..d8303e45 100644 +-+++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py +-+++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): +-+++++- # Copied from transformers.models.llama.modeling_llama.rotate_half +-+++++- def rotate_half(x): +-+++++- """Rotates half the hidden dims of the input.""" +-+++++-- x1 = x[..., : x.shape[-1] // 2] +-+++++-- x2 = x[..., x.shape[-1] // 2 :] +-+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++-+ # x1 = x[..., : x.shape[-1] // 2] +-+++++-+ # x2 = x[..., x.shape[-1] // 2 :] +-+++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++- return ops.cat((-x2, x1), dim=-1) +-+++++- +-+++++- +-+++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): +-+++++- if self.training: +-+++++- raise NotImplementedError("Training is not supported yet.") +-+++++- else: +-+++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++-- if self.config.n_shared_experts is not None: +-+++++-- y = y + self.shared_experts(identity) +-+++++-- return y +-+++++-+ # @lwx +-+++++-+ if orig_shape[1] == 1: +-+++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) +-+++++-+ y=y.view(*orig_shape) +-+++++-+ if self.config.n_shared_experts is not None: +-+++++-+ y = y + self.shared_experts(identity) +-+++++-+ return y +-+++++-+ else: +-+++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) +-+++++-+ if self.config.n_shared_experts is not None: +-+++++-+ y = y + self.shared_experts(identity) +-+++++-+ return y +-+++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) +-+++++-+ # if self.config.n_shared_experts is not None: +-+++++-+ # y = y + self.shared_experts(identity) +-+++++-+ # return y +-+++++-+ +-+++++-+ @no_grad() +-+++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): +-+++++-+ +-+++++-+ expert_cache = ops.zeros_like(x) +-+++++-+ for i in range(self.num_experts_per_tok): +-+++++-+ expert_id = flat_expert_indices[i].item() +-+++++-+ weight = flat_expert_weights[i].item() +-+++++-+ expert = self.experts[expert_id] +-+++++-+ expert_out = expert(x) +-+++++-+ expert_cache += expert_out * weight +-+++++-+ return expert_cache +-+++++- +-+++++- @no_grad() +-+++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++-- # expert_cache = torch.zeros_like(x) +-+++++-- # idxs = flat_expert_indices.argsort() +-+++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++-- # token_idxs = idxs // self.num_experts_per_tok +-+++++-- # for i, end_idx in enumerate(tokens_per_expert): +-+++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++-- # if start_idx == end_idx: +-+++++-- # continue +-+++++-- # expert = self.experts[i] +-+++++-- # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++-- # expert_tokens = x[exp_token_idx] +-+++++-- # expert_out = expert(expert_tokens) +-+++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++-- # return expert_cache +-+++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): +-+++++- expert_cache = ops.zeros_like(x) +-+++++- idxs = flat_expert_indices.argsort() +-+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++- token_idxs = idxs // self.num_experts_per_tok +-+++++-+ +-+++++- for i, end_idx in enumerate(tokens_per_expert): +-+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++- if start_idx == end_idx: +-+++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): +-+++++- expert_out = expert(expert_tokens) +-+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++-+ +-+++++- return expert_cache +-+++++-+ +-+++++-+ # @no_grad() +-+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++-+ # # expert_cache = torch.zeros_like(x) +-+++++-+ # # idxs = flat_expert_indices.argsort() +-+++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) +-+++++-+ # # token_idxs = idxs // self.num_experts_per_tok +-+++++-+ # # for i, end_idx in enumerate(tokens_per_expert): +-+++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] +-+++++-+ # # if start_idx == end_idx: +-+++++-+ # # continue +-+++++-+ # # expert = self.experts[i] +-+++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++-+ # # expert_tokens = x[exp_token_idx] +-+++++-+ # # expert_out = expert(expert_tokens) +-+++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') +-+++++-+ # # return expert_cache +-+++++-+ # expert_cache = ops.zeros_like(x) +-+++++-+ # idxs = flat_expert_indices.argsort() +-+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++-+ # token_idxs = idxs // self.num_experts_per_tok +-+++++-+ +-+++++-+ # for i, end_idx in enumerate(tokens_per_expert): +-+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++-+ # if start_idx == end_idx: +-+++++-+ # continue +-+++++-+ # expert = self.experts[i] +-+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++-+ # expert_tokens = x[exp_token_idx] +-+++++-+ # expert_out = expert(expert_tokens) +-+++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) +-+++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) +-+++++-+ +-+++++-+ # return expert_cache +-+++++-+ # @no_grad() +-+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): +-+++++-+ # expert_cache = ops.zeros_like(x) +-+++++-+ +-+++++-+ # # 排序保证顺序一致 +-+++++-+ # idxs = flat_expert_indices.argsort() +-+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) +-+++++-+ # token_idxs = idxs // self.num_experts_per_tok +-+++++-+ +-+++++-+ # # 找出有 token 的专家 +-+++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) +-+++++-+ +-+++++-+ # for i in active_experts.tolist(): +-+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] +-+++++-+ # end_idx = tokens_per_expert[i] +-+++++-+ # if start_idx == end_idx: # 没有 token +-+++++-+ # continue +-+++++-+ +-+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] +-+++++-+ # expert_tokens = x[exp_token_idx] +-+++++-+ # expert_out = self.experts[i](expert_tokens) +-+++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] +-+++++-+ +-+++++-+ # expert_cache = mindspore.mint.scatter_add( +-+++++-+ # expert_cache, +-+++++-+ # 0, +-+++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), +-+++++-+ # expert_out +-+++++-+ # ) +-+++++-+ +-+++++-+ # return expert_cache +-+++++-+ +-+++++-+ +-+++++- +-+++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): +-+++++- # """ +-+++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++- +-+++++- # Initialize weights and apply final processing +-+++++- self.post_init() +-+++++-+ self.warm_up = False +-+++++-+ +-+++++-+ def warmup_moe_model_deep(self): +-+++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") +-+++++-+ test_texts = [ +-+++++-+ "warmup short", +-+++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", +-+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" +-+++++-+ ] +-+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++-+ if tokenizer is None: +-+++++-+ from mindnlp.transformers import AutoTokenizer +-+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++-+ self._warmup_tokenizer = tokenizer +-+++++-+ +-+++++-+ for text in test_texts: +-+++++-+ inputs = tokenizer(text, return_tensors="ms") +-+++++-+ with mindspore._no_grad(): +-+++++-+ _ = self(**inputs, use_cache=False) +-+++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") +-+++++- +-+++++- def get_input_embeddings(self): +-+++++- return self.model.embed_tokens +-+++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): +-+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++- ```""" +-+++++-+ if not self.warm_up: +-+++++-+ self.warm_up = True +-+++++-+ self.warmup_moe_model_deep() +-+++++-+ +-+++++- output_attentions = ( +-+++++- output_attentions +-+++++- if output_attentions is not None +-+++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++-index 3cbf820e..d4c6b651 100644 +-+++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py +-+++++-@@ -18,7 +18,6 @@ +-+++++- # See the License for the specific language governing permissions and +-+++++- # limitations under the License. +-+++++- """MindSpore Qwen2MoE model.""" +-+++++-- +-+++++- import math +-+++++- from typing import List, Optional, Tuple, Union +-+++++- +-+++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( +-+++++- TokenClassifierOutput, +-+++++- ) +-+++++- from ...modeling_utils import PreTrainedModel +-+++++-+from ...generation import GenerationMixin +-+++++- from ....utils import logging +-+++++- from .configuration_qwen2_moe import Qwen2MoeConfig +-+++++- +-+++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): +-+++++- self.variance_epsilon = eps +-+++++- +-+++++- def forward(self, hidden_states): +-+++++-+ # @dwj +-+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++-+ # @lwx +-+++++-+ # if not self.training : +-+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) +-+++++- input_dtype = hidden_states.dtype +-+++++- hidden_states = hidden_states.to(mindspore.float32) +-+++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) +-+++++-@@ -234,6 +239,8 @@ def rotate_half(x): +-+++++- """Rotates half the hidden dims of the input.""" +-+++++- x1 = x[..., : x.shape[-1] // 2] +-+++++- x2 = x[..., x.shape[-1] // 2 :] +-+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] +-+++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) +-+++++- return ops.cat((-x2, x1), dim=-1) +-+++++- +-+++++- +-+++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): +-+++++- self.config = config +-+++++- self.hidden_size = config.hidden_size +-+++++- self.intermediate_size = intermediate_size +-+++++-+ +-+++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) +-+++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) +-+++++- self.act_fn = ACT2FN[config.hidden_act] +-+++++- +-+++++- def forward(self, x): +-+++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++-- +-+++++- +-+++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) +-+++++-+ # @lwx +-+++++-+ # gate_up_output = self.gate_up_proj(x) +-+++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) +-+++++-+ # return self.down_proj(swiglu_output) +-+++++-+ +-+++++-+ # def forward(self, x): +-+++++-+ # gate_proj_out = self.gate_proj(x) +-+++++-+ # up_proj_out = self.up_proj(x) +-+++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) +-+++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) +-+++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out +-+++++-+ # return self.down_proj(swiglu_out) +-+++++-+ +-+++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv +-+++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: +-+++++- """ +-+++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): +-+++++- use_cache: bool = False, +-+++++- cache_position: Optional[mindspore.Tensor] = None, +-+++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++-+ +-+++++-+ +-+++++-+ +-+++++- bsz, q_len, _ = hidden_states.shape +-+++++- +-+++++- query_states = self.q_proj(hidden_states) +-+++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): +-+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++- "with a layer index." +-+++++- ) +-+++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++-+ if isinstance(past_key_value, StaticCache): +-+++++-+ kv_seq_len = key_states.shape[-2] +-+++++-+ else: +-+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++- +-+++++- if past_key_value is not None: +-+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models +-+++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) +-+++++-+ +-+++++-+ if isinstance(past_key_value, StaticCache): +-+++++-+ kv_seq_len = key_states.shape[-2] +-+++++- +-+++++- # repeat k/v heads if n_kv_heads < n_heads +-+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++-- +-+++++-+ +-+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) +-+++++- +-+++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): +-+++++-- raise ValueError( +-+++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" +-+++++-- f" {attn_weights.shape}" +-+++++-- ) +-+++++-- +-+++++-- if attention_mask is not None: # no matter the length, we just slice it +-+++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] +-+++++-+ if attention_mask is not None: +-+++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] +-+++++- attn_weights = attn_weights + causal_mask +-+++++- +-+++++- # upcast attention to fp32 +-+++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): +-+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) +-+++++- +-+++++- attn_output = self.o_proj(attn_output) +-+++++-- +-+++++-+ # @lwx +-+++++-+ +-+++++-+ # max_seq_len = self.max_position_embeddings # 2048 +-+++++-+ +-+++++-+ # if attention_mask is not None: +-+++++-+ # # attention_mask: [B, 1, Sq, Sk] +-+++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask +-+++++-+ +-+++++-+ # # pad 到 [max_seq_len, max_seq_len] +-+++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 +-+++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) +-+++++-+ # global_attention_mask = padded_mask +-+++++-+ # else: +-+++++-+ # global_attention_mask = None +-+++++-+ +-+++++-+ +-+++++-+ # sparse_mode=3 +-+++++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++++-+ # query=query_states, +-+++++-+ # key=key_states, +-+++++-+ # value=value_states, +-+++++-+ # real_shift=None, +-+++++-+ # padding_mask=None, +-+++++-+ +-+++++-+ # head_num=self.num_heads, +-+++++-+ # attn_mask=global_attention_mask, +-+++++-+ # keep_prob=1.0 - self.attention_dropout, +-+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++-+ # input_layout="BNSD", +-+++++-+ # pre_tokens=2147483647, +-+++++-+ # next_tokens=2147483647, +-+++++-+ # inner_precise=0, +-+++++-+ # drop_mask=None, +-+++++-+ # prefix=None, +-+++++-+ # actual_seq_qlen=None, +-+++++-+ # actual_seq_kvlen=None, +-+++++-+ # sparse_mode=sparse_mode, +-+++++-+ # ) +-+++++- if not output_attentions: +-+++++- attn_weights = None +-+++++- +-+++++- return attn_output, attn_weights, past_key_value +-+++++- +-+++++- +-+++++-+class Qwen2MoeFlashAttention(nn.Module): +-+++++-+ """ +-+++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 +-+++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 +-+++++-+ +-+++++-+ 关键改动: +-+++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), +-+++++-+ 直接传入原始的 key 和 value 张量效率更高。 +-+++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 +-+++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 +-+++++-+ """ +-+++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): +-+++++-+ super().__init__() +-+++++-+ self.config = config +-+++++-+ self.layer_idx = layer_idx +-+++++-+ self.hidden_size = config.hidden_size +-+++++-+ self.num_heads = config.num_attention_heads +-+++++-+ self.head_dim = self.hidden_size // self.num_heads +-+++++-+ self.num_key_value_heads = config.num_key_value_heads +-+++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads +-+++++-+ self.max_position_embeddings = config.max_position_embeddings +-+++++-+ self.rope_theta = config.rope_theta +-+++++-+ self.attention_dropout = config.attention_dropout +-+++++-+ +-+++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: +-+++++-+ raise ValueError( +-+++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" +-+++++-+ ) +-+++++-+ +-+++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) +-+++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) +-+++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) +-+++++-+ +-+++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( +-+++++-+ self.head_dim, +-+++++-+ max_position_embeddings=self.max_position_embeddings, +-+++++-+ base=self.rope_theta, +-+++++-+ ) +-+++++-+ +-+++++-+ def forward( +-+++++-+ self, +-+++++-+ hidden_states: mindspore.Tensor, +-+++++-+ attention_mask: Optional[mindspore.Tensor] = None, +-+++++-+ position_ids: Optional[mindspore.Tensor] = None, +-+++++-+ past_key_value: Optional[Cache] = None, +-+++++-+ output_attentions: bool = False, +-+++++-+ use_cache: bool = False, +-+++++-+ cache_position: Optional[mindspore.Tensor] = None, +-+++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++-+ +-+++++-+ bsz, q_len, _ = hidden_states.shape +-+++++-+ +-+++++-+ # 1. 线性投射 Q, K, V +-+++++-+ query_states = self.q_proj(hidden_states) +-+++++-+ key_states = self.k_proj(hidden_states) +-+++++-+ value_states = self.v_proj(hidden_states) +-+++++-+ +-+++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++-+ # query: [B, S, H*D] -> [B, N1, S, D] +-+++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] +-+++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ +-+++++-+ # 3. RoPE 旋转位置编码 +-+++++-+ kv_seq_len = key_states.shape[-2] +-+++++-+ if past_key_value is not None: +-+++++-+ if self.layer_idx is None: +-+++++-+ raise ValueError( +-+++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++-+ "with a layer index." +-+++++-+ ) +-+++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len +-+++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 +-+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len +-+++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n +-+++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) +-+++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 +-+++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 +-+++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) +-+++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens +-+++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 +-+++++-+ if cache_position.shape[0] == 1: +-+++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 +-+++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) +-+++++-+ kv_seq_len = past_seen_tokens + 1 +-+++++-+ else: +-+++++-+ # prefill 阶段:cache_position 是范围,使用其长度 +-+++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens +-+++++-+ else: +-+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++-+ +-+++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++-+ +-+++++-+ # 4. KV 缓存更新 +-+++++-+ if past_key_value is not None: +-+++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++-+ key_states, value_states = past_key_value.update( +-+++++-+ key_states, value_states, self.layer_idx, cache_kwargs +-+++++-+ ) +-+++++-+ +-+++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 +-+++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) +-+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: +-+++++-+ if cache_position.shape[0] == 1: +-+++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) +-+++++-+ kv_seq_len = key_states.shape[-2] +-+++++-+ +-+++++-+ # 5. [重要] 准备 Attention Mask +-+++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) +-+++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 +-+++++-+ fa_attention_mask = None +-+++++-+ if attention_mask is not None: +-+++++-+ # 截取与当前key长度匹配的部分 +-+++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) +-+++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) +-+++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False +-+++++-+ fa_attention_mask = (mask_slice != 0) +-+++++-+ +-+++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 +-+++++-+ input_dtype = query_states.dtype +-+++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): +-+++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 +-+++++-+ query_states = query_states.to(mindspore.float16) +-+++++-+ key_states = key_states.to(mindspore.float16) +-+++++-+ value_states = value_states.to(mindspore.float16) +-+++++-+ +-+++++-+ # 6. [核心] 调用 flash_attention_score 算子 +-+++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA +-+++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] +-+++++-+ attn_output = mindspore.ops.flash_attention_score( +-+++++-+ query=query_states, +-+++++-+ key=key_states, +-+++++-+ value=value_states, +-+++++-+ head_num=self.num_heads, # 传入Q的头数(N1) +-+++++-+ attn_mask=fa_attention_mask, +-+++++-+ keep_prob=1.0 - self.attention_dropout, +-+++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++-+ input_layout="BNSD", +-+++++-+ sparse_mode=0 # 使用 defaultMask 模式 +-+++++-+ ) +-+++++-+ +-+++++-+ # 恢复原始数据类型 +-+++++-+ attn_output = attn_output.to(input_dtype) +-+++++-+ +-+++++-+ # 7. 调整输出形状 +-+++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] +-+++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++-+ attn_output = self.o_proj(attn_output) +-+++++-+ +-+++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 +-+++++-+ attn_weights = None +-+++++-+ if output_attentions: +-+++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++-+ +-+++++-+ return attn_output, attn_weights, past_key_value +-+++++-+ +-+++++-+ # def forward( +-+++++-+ # self, +-+++++-+ # hidden_states: mindspore.Tensor, +-+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++-+ # position_ids: Optional[mindspore.Tensor] = None, +-+++++-+ # past_key_value: Optional[Cache] = None, +-+++++-+ # output_attentions: bool = False, +-+++++-+ # use_cache: bool = False, +-+++++-+ # cache_position: Optional[mindspore.Tensor] = None, +-+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++-+ +-+++++-+ # bsz, q_len, _ = hidden_states.shape +-+++++-+ +-+++++-+ # # 1. 线性投射 Q, K, V +-+++++-+ # query_states = self.q_proj(hidden_states) +-+++++-+ # key_states = self.k_proj(hidden_states) +-+++++-+ # value_states = self.v_proj(hidden_states) +-+++++-+ +-+++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 +-+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ +-+++++-+ # # 3. RoPE 旋转位置编码 +-+++++-+ # kv_seq_len = key_states.shape[-2] +-+++++-+ # if past_key_value is not None: +-+++++-+ # if self.layer_idx is None: +-+++++-+ # raise ValueError( +-+++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " +-+++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " +-+++++-+ # "with a layer index." +-+++++-+ # ) +-+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++-+ +-+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++-+ +-+++++-+ # # 4. KV 缓存更新 +-+++++-+ # if past_key_value is not None: +-+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++-+ # key_states, value_states = past_key_value.update( +-+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++-+ # ) +-+++++-+ +-+++++-+ # # 5. 准备 Attention Mask +-+++++-+ # fa_attention_mask = None +-+++++-+ # if attention_mask is not None: +-+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++-+ # fa_attention_mask = (mask_slice != 0) +-+++++-+ +-+++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- +-+++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 +-+++++-+ # input_dtype = query_states.dtype +-+++++-+ +-+++++-+ # # 6. [核心] 调用 flash_attention_score 算子 +-+++++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++++-+ # query=query_states, +-+++++-+ # key=key_states, +-+++++-+ # value=value_states, +-+++++-+ # head_num=self.num_heads, +-+++++-+ # attn_mask=fa_attention_mask, +-+++++-+ # keep_prob=1.0 - self.attention_dropout, +-+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), +-+++++-+ # input_layout="BNSD", +-+++++-+ # sparse_mode=0, +-+++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- +-+++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, +-+++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 +-+++++-+ # inner_precise=1 +-+++++-+ # ) +-+++++-+ +-+++++-+ # # 恢复原始数据类型 +-+++++-+ # attn_output = attn_output.to(input_dtype) +-+++++-+ +-+++++-+ # # 7. 调整输出形状 +-+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++-+ # attn_output = self.o_proj(attn_output) +-+++++-+ +-+++++-+ # attn_weights = None +-+++++-+ # if output_attentions: +-+++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") +-+++++-+ +-+++++-+ # return attn_output, attn_weights, past_key_value +-+++++-+ +-+++++-+ # def forward( +-+++++-+ # self, +-+++++-+ # hidden_states: mindspore.Tensor, +-+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, +-+++++-+ # position_ids: Optional[mindspore.Tensor] = None, +-+++++-+ # past_key_value: Optional[Cache] = None, +-+++++-+ # output_attentions: bool = False, +-+++++-+ # use_cache: bool = False, +-+++++-+ # cache_position: Optional[mindspore.Tensor] = None, +-+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: +-+++++-+ +-+++++-+ # bsz, q_len, _ = hidden_states.shape +-+++++-+ +-+++++-+ # query_states = self.q_proj(hidden_states) +-+++++-+ # key_states = self.k_proj(hidden_states) +-+++++-+ # value_states = self.v_proj(hidden_states) +-+++++-+ +-+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) +-+++++-+ +-+++++-+ # kv_seq_len = key_states.shape[-2] +-+++++-+ # if past_key_value is not None: +-+++++-+ # if self.layer_idx is None: +-+++++-+ # raise ValueError("`layer_idx` must be specified for caching") +-+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) +-+++++-+ +-+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) +-+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) +-+++++-+ +-+++++-+ # if past_key_value is not None: +-+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} +-+++++-+ # key_states, value_states = past_key_value.update( +-+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs +-+++++-+ # ) +-+++++-+ +-+++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) +-+++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) +-+++++-+ +-+++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- +-+++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 +-+++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 +-+++++-+ # query_states = query_states / math.sqrt(self.head_dim) +-+++++-+ # # <--- 修改结束 --- +-+++++-+ +-+++++-+ # fa_attention_mask = None +-+++++-+ # if attention_mask is not None: +-+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] +-+++++-+ # fa_attention_mask = (mask_slice != 0) +-+++++-+ +-+++++-+ # input_dtype = query_states.dtype +-+++++-+ +-+++++-+ # attn_output = mindspore.ops.flash_attention_score( +-+++++-+ # query=query_states, # 传入已经预先缩放过的 query +-+++++-+ # key=key_states, +-+++++-+ # value=value_states, +-+++++-+ # head_num=self.num_heads, +-+++++-+ # attn_mask=fa_attention_mask, +-+++++-+ # keep_prob=1.0 - self.attention_dropout, +-+++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 +-+++++-+ # input_layout="BNSD", +-+++++-+ # sparse_mode=0, +-+++++-+ # inner_precise=1 # 仍然保持内部高精度计算 +-+++++-+ # ) +-+++++-+ +-+++++-+ # attn_output = attn_output.to(input_dtype) +-+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) +-+++++-+ # attn_output = self.o_proj(attn_output) +-+++++-+ +-+++++-+ # attn_weights = None +-+++++-+ # if output_attentions: +-+++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") +-+++++-+ +-+++++-+ # return attn_output, attn_weights, past_key_value +-+++++-+ +-+++++- QWEN2MOE_ATTENTION_CLASSES = { +-+++++- "eager": Qwen2MoeAttention, +-+++++-+ "flash-attention": Qwen2MoeFlashAttention, +-+++++- } +-+++++- +-+++++- +-+++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): +-+++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) +-+++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) +-+++++- +-+++++-+ #@dwj +-+++++-+ # 只遍历激活的专家,而非全部专家 +-+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: +-+++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++-- hidden_states = hidden_states.view(-1, hidden_dim) +-+++++-- # router_logits: (batch * sequence_length, n_experts) +-+++++-- router_logits = self.gate(hidden_states) +-+++++-- +-+++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++-- if self.norm_topk_prob: +-+++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++-- # we cast back to the input dtype +-+++++-- routing_weights = routing_weights.to(hidden_states.dtype) +-+++++-- +-+++++-- final_hidden_states = ops.zeros( +-+++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype +-+++++-- ) +-+++++-- +-+++++-- # One hot encode the selected experts to create an expert mask +-+++++-- # this will be used to easily index which expert is going to be sollicitated +-+++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) +-+++++-- +-+++++-- # Loop over all available experts in the model and perform the computation on each expert +-+++++-- for expert_idx in range(self.num_experts): +-+++++-- expert_layer = self.experts[expert_idx] +-+++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) +-+++++-- +-+++++-- # Index the correct hidden states and compute the expert hidden state for +-+++++-- # the current expert. We need to make sure to multiply the output hidden +-+++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) +-+++++-- if 0 not in idx.shape: +-+++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) +-+++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] +-+++++-- +-+++++-- # However `index_add_` only support torch tensors for indexing so we'll use +-+++++-- # the `top_x` tensor here. +-+++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) +-+++++-- +-+++++-- shared_expert_output = self.shared_expert(hidden_states) +-+++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output +-+++++-- +-+++++-- final_hidden_states = final_hidden_states + shared_expert_output +-+++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape +-+++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) +-+++++-+ num_tokens = hidden_states_reshaped.shape[0] +-+++++-+ +-+++++-+ router_logits = self.gate(hidden_states_reshaped) +-+++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) +-+++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) +-+++++-+ +-+++++-+ if self.norm_topk_prob: +-+++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) +-+++++-+ routing_weights = routing_weights.to(hidden_states.dtype) +-+++++-+ +-+++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) +-+++++-+ flat_selected_experts = selected_experts.flatten() +-+++++-+ +-+++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) +-+++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) +-+++++-+ token_indices = broadcasted_token_indices.flatten() +-+++++-+ +-+++++-+ active_experts = ops.unique(flat_selected_experts) +-+++++-+ +-+++++-+ for expert_idx_tensor in active_experts: +-+++++-+ expert_idx = expert_idx_tensor.item() +-+++++-+ expert_layer = self.experts[expert_idx] +-+++++-+ +-+++++-+ mask = (flat_selected_experts == expert_idx_tensor) +-+++++-+ selected_token_indices = token_indices[mask] +-+++++-+ selected_routing_weights = routing_weights.flatten()[mask] +-+++++-+ +-+++++-+ current_states = hidden_states_reshaped[selected_token_indices] +-+++++-+ +-+++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) +-+++++-+ +-+++++-+ final_hidden_states = final_hidden_states.index_add( +-+++++-+ dim=0, +-+++++-+ index=selected_token_indices, +-+++++-+ source=expert_output.to(hidden_states.dtype) +-+++++-+ ) +-+++++-+ +-+++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) +-+++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output +-+++++- +-+++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++-- return final_hidden_states, router_logits +-+++++-+ final_hidden_states = final_hidden_states + shared_expert_output +-+++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) +-+++++-+ +-+++++-+ return final_hidden_states, router_logits +-+++++- +-+++++- +-+++++- class Qwen2MoeDecoderLayer(nn.Module): +-+++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): +-+++++- +-+++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) +-+++++- +-+++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) +-+++++-+ +-+++++- if (layer_idx not in config.mlp_only_layers) and ( +-+++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 +-+++++- ): +-+++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): +-+++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] +-+++++- _skip_keys_device_placement = "past_key_values" +-+++++- _supports_cache_class = True +-+++++-+#lwx +-+++++-+ # _supports_static_cache = True +-+++++- +-+++++- def _init_weights(self, module): +-+++++- std = self.config.initializer_range +-+++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): +-+++++- return causal_mask +-+++++- +-+++++- +-+++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): +-+++++- _tied_weights_keys = ["lm_head.weight"] +-+++++- +-+++++- def __init__(self, config): +-+++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++- self.num_experts_per_tok = config.num_experts_per_tok +-+++++- # Initialize weights and apply final processing +-+++++- self.post_init() +-+++++-+ # @lwx +-+++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: +-+++++-+ # self.generation_config.cache_implementation = "static" +-+++++-+ self._warmed_up = False +-+++++-+ +-+++++-+ def warmup_moe_model(self): +-+++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") +-+++++-+ test_texts = [ +-+++++-+ "warmup short", +-+++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", +-+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" +-+++++-+ ] +-+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) +-+++++-+ if tokenizer is None: +-+++++-+ from mindnlp.transformers import AutoTokenizer +-+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) +-+++++-+ self._warmup_tokenizer = tokenizer +-+++++-+ +-+++++-+ for text in test_texts: +-+++++-+ inputs = tokenizer(text, return_tensors="ms") +-+++++-+ with mindspore._no_grad(): +-+++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) +-+++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") +-+++++- +-+++++- def get_input_embeddings(self): +-+++++- return self.model.embed_tokens +-+++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] +-+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." +-+++++- ```""" +-+++++-+ if not self._warmed_up: +-+++++-+ self._warmed_up = True +-+++++-+ self.warmup_moe_model() +-+++++- +-+++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions +-+++++- output_router_logits = ( +-+++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): +-+++++- } +-+++++- ) +-+++++- return model_inputs +-+++++-+# @lwx +-+++++-+ # def _decode_one_tokens_logits( +-+++++-+ # self, +-+++++-+ # cur_token: mindspore.Tensor, +-+++++-+ # input_pos: Optional[mindspore.Tensor], +-+++++-+ # cache_position: mindspore.Tensor, +-+++++-+ # past_key_values: StaticCache, +-+++++-+ # ) -> mindspore.Tensor: +-+++++-+ # """ +-+++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) +-+++++-+ +-+++++-+ # Args: +-+++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) +-+++++-+ # input_pos: 输入位置信息,可选 +-+++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) +-+++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 +-+++++-+ +-+++++-+ # Returns: +-+++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) +-+++++-+ # """ +-+++++-+ # # 调用JIT编译的版本 +-+++++-+ # return self.get_decode_one_tokens_logits( +-+++++-+ # cur_token=cur_token, +-+++++-+ # input_pos=input_pos, +-+++++-+ # cache_position=cache_position, +-+++++-+ # past_key_values=past_key_values, +-+++++-+ # ) +-+++++-+ +-+++++-+ # @mindspore.jit(jit_level='O1') +-+++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): +-+++++-+ # """ +-+++++-+ # JIT编译的函数,用于高效的单token解码 +-+++++-+ # 使用JIT编译优化以支持静态shape和高效执行 +-+++++-+ +-+++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except +-+++++-+ # """ +-+++++-+ # outputs = self.model.forward( +-+++++-+ # input_ids=cur_token, +-+++++-+ # position_ids=input_pos, +-+++++-+ # cache_position=cache_position, +-+++++-+ # past_key_values=past_key_values, +-+++++-+ # use_cache=True, +-+++++-+ # return_dict=False, +-+++++-+ # ) +-+++++-+ +-+++++-+ # hidden_states = outputs[0] +-+++++-+ # logits = self.lm_head.forward(hidden_states) +-+++++-+ # logits = logits.float() +-+++++-+ +-+++++-+ # return logits[:, -1, :] +-+++++-+ +-+++++-+ # def _sample( +-+++++-+ # self, +-+++++-+ # input_ids: mindspore.Tensor, +-+++++-+ # logits_processor, +-+++++-+ # stopping_criteria, +-+++++-+ # generation_config, +-+++++-+ # synced_devices: bool, +-+++++-+ # streamer=None, +-+++++-+ # logits_warper=None, +-+++++-+ # **model_kwargs, +-+++++-+ # ): +-+++++-+ # """ +-+++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 +-+++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 +-+++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 +-+++++-+ # """ +-+++++-+ # from ...generation.logits_process import LogitsProcessorList +-+++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList +-+++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput +-+++++-+ # from mindnlp.core import nn, ops, no_grad +-+++++-+ # import numpy as np +-+++++-+ +-+++++-+ # # 检查是否使用 StaticCache +-+++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 +-+++++-+ # # 否则,直接调用父类方法 +-+++++-+ # past_key_values = model_kwargs.get("past_key_values") +-+++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") +-+++++-+ +-+++++-+ # if not isinstance(past_key_values, StaticCache): +-+++++-+ # # 不使用 StaticCache,直接调用父类方法 +-+++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") +-+++++-+ # return super()._sample( +-+++++-+ # input_ids=input_ids, +-+++++-+ # logits_processor=logits_processor, +-+++++-+ # stopping_criteria=stopping_criteria, +-+++++-+ # generation_config=generation_config, +-+++++-+ # synced_devices=synced_devices, +-+++++-+ # streamer=streamer, +-+++++-+ # logits_warper=logits_warper, +-+++++-+ # **model_kwargs, +-+++++-+ # ) +-+++++-+ +-+++++-+ # # 使用 StaticCache,进入自定义循环 +-+++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) +-+++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 +-+++++-+ # pad_token_id = generation_config._pad_token_tensor +-+++++-+ # output_attentions = generation_config.output_attentions +-+++++-+ # output_hidden_states = generation_config.output_hidden_states +-+++++-+ # output_scores = generation_config.output_scores +-+++++-+ # output_logits = generation_config.output_logits +-+++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate +-+++++-+ # max_length = generation_config.max_length +-+++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) +-+++++-+ # do_sample = generation_config.do_sample +-+++++-+ +-+++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): +-+++++-+ # raise ValueError( +-+++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " +-+++++-+ # f"{logits_warper})." +-+++++-+ # ) +-+++++-+ +-+++++-+ # # init attention / hidden states / scores tuples +-+++++-+ # scores = () if (return_dict_in_generate and output_scores) else None +-+++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None +-+++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None +-+++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None +-+++++-+ +-+++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states +-+++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: +-+++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None +-+++++-+ # encoder_hidden_states = ( +-+++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None +-+++++-+ # ) +-+++++-+ +-+++++-+ # # keep track of which sequences are already finished +-+++++-+ # batch_size, cur_len = input_ids.shape +-+++++-+ # this_peer_finished = False +-+++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) +-+++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) +-+++++-+ +-+++++-+ # time_record = [] +-+++++-+ # from ....utils.testing_utils import parse_flag_from_env +-+++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) +-+++++-+ +-+++++-+ # while self._has_unfinished_sequences( +-+++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length +-+++++-+ # ): +-+++++-+ # if _record_time: +-+++++-+ # import time as time_module +-+++++-+ # infer_start = time_module.time() +-+++++-+ +-+++++-+ # # prepare model inputs +-+++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) +-+++++-+ +-+++++-+ # # prepare variable output controls +-+++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) +-+++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) +-+++++-+ +-+++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 +-+++++-+ # cur_cache_position = model_inputs.get("cache_position") +-+++++-+ # cur_past_key_values = model_inputs.get("past_key_values") +-+++++-+ # cur_input_ids = model_inputs.get("input_ids") +-+++++-+ +-+++++-+ # if (isinstance(cur_past_key_values, StaticCache) and +-+++++-+ # cur_cache_position is not None and +-+++++-+ # len(cur_cache_position.shape) > 0 and +-+++++-+ # cur_cache_position.shape[0] == 1 and +-+++++-+ # cur_input_ids is not None and +-+++++-+ # cur_input_ids.shape[1] == 1): +-+++++-+ # # 使用 JIT 优化的单 token 解码 +-+++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) +-+++++-+ # if not hasattr(self, '_jit_used'): +-+++++-+ # self._jit_used = False +-+++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") +-+++++-+ +-+++++-+ # next_token_logits = self.get_decode_one_tokens_logits( +-+++++-+ # cur_token=cur_input_ids, +-+++++-+ # input_pos=model_inputs.get("position_ids"), +-+++++-+ # cache_position=cur_cache_position, +-+++++-+ # past_key_values=cur_past_key_values, +-+++++-+ # ) +-+++++-+ +-+++++-+ # # 标记已使用JIT(用于后续判断) +-+++++-+ # if not self._jit_used: +-+++++-+ # self._jit_used = True +-+++++-+ +-+++++-+ # # 构造兼容的输出对象 +-+++++-+ # class JitOptimizedOutput: +-+++++-+ # def __init__(self, logits, config): +-+++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits +-+++++-+ # self.config = config +-+++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 +-+++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None +-+++++-+ # self.attentions = None if not config.is_encoder_decoder else None +-+++++-+ # self.cross_attentions = None +-+++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None +-+++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None +-+++++-+ +-+++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) +-+++++-+ # else: +-+++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) +-+++++-+ # outputs = self(**model_inputs, return_dict=True) +-+++++-+ +-+++++-+ # if synced_devices and this_peer_finished: +-+++++-+ # continue +-+++++-+ +-+++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits +-+++++-+ # next_token_logits = outputs.logits[:, -1, :] +-+++++-+ +-+++++-+ # # pre-process distribution +-+++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) +-+++++-+ # if do_sample: +-+++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) +-+++++-+ +-+++++-+ # # Store scores, attentions and hidden_states when required +-+++++-+ # if return_dict_in_generate: +-+++++-+ # if output_scores: +-+++++-+ # scores += (next_token_scores,) +-+++++-+ # if output_logits: +-+++++-+ # raw_logits += (next_token_logits,) +-+++++-+ # if output_attentions: +-+++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions +-+++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) +-+++++-+ # if self.config.is_encoder_decoder: +-+++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) +-+++++-+ +-+++++-+ # if output_hidden_states: +-+++++-+ # hidden = ( +-+++++-+ # outputs.decoder_hidden_states +-+++++-+ # if self.config.is_encoder_decoder +-+++++-+ # else outputs.hidden_states +-+++++-+ # ) +-+++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) +-+++++-+ +-+++++-+ # # token selection +-+++++-+ # if do_sample: +-+++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) +-+++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) +-+++++-+ # else: +-+++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) +-+++++-+ +-+++++-+ # # finished sentences should have their next token be a padding token +-+++++-+ # if has_eos_stopping_criteria: +-+++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) +-+++++-+ +-+++++-+ # # update generated ids, model inputs, and length for next step +-+++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) +-+++++-+ # if streamer is not None: +-+++++-+ # streamer.put(next_tokens) +-+++++-+ +-+++++-+ # model_kwargs = self._update_model_kwargs_for_generation( +-+++++-+ # outputs, +-+++++-+ # model_kwargs, +-+++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, +-+++++-+ # ) +-+++++-+ +-+++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) +-+++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 +-+++++-+ # cur_len += 1 +-+++++-+ +-+++++-+ # if _record_time: +-+++++-+ # import time as time_module +-+++++-+ # infer_stop = time_module.time() +-+++++-+ # time_record.append(infer_stop - infer_start) +-+++++-+ +-+++++-+ # del outputs +-+++++-+ +-+++++-+ # average_infer_time = None +-+++++-+ # if time_record: +-+++++-+ # if len(time_record) > 1: +-+++++-+ # time_record.pop(0) +-+++++-+ # average_infer_time = sum(time_record) / len(time_record) +-+++++-+ # print(f'average inference time is: {average_infer_time}') +-+++++-+ # print(f'inference time record: {time_record}') +-+++++-+ +-+++++-+ # if streamer is not None: +-+++++-+ # streamer.end() +-+++++-+ +-+++++-+ # # 简单判断:打印是否使用了JIT路径 +-+++++-+ # if hasattr(self, '_jit_used') and self._jit_used: +-+++++-+ # print("[JIT] ✓ JIT optimization was used during generation") +-+++++-+ # else: +-+++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") +-+++++-+ +-+++++-+ # if return_dict_in_generate: +-+++++-+ # if self.config.is_encoder_decoder: +-+++++-+ # return GenerateEncoderDecoderOutput( +-+++++-+ # sequences=input_ids, +-+++++-+ # scores=scores, +-+++++-+ # logits=raw_logits, +-+++++-+ # encoder_attentions=encoder_attentions, +-+++++-+ # encoder_hidden_states=encoder_hidden_states, +-+++++-+ # decoder_attentions=decoder_attentions, +-+++++-+ # cross_attentions=cross_attentions, +-+++++-+ # decoder_hidden_states=decoder_hidden_states, +-+++++-+ # past_key_values=model_kwargs.get("past_key_values"), +-+++++-+ # average_infer_time=average_infer_time +-+++++-+ # ) +-+++++-+ # else: +-+++++-+ # return GenerateDecoderOnlyOutput( +-+++++-+ # sequences=input_ids, +-+++++-+ # scores=scores, +-+++++-+ # logits=raw_logits, +-+++++-+ # attentions=decoder_attentions, +-+++++-+ # hidden_states=decoder_hidden_states, +-+++++-+ # past_key_values=model_kwargs.get("past_key_values"), +-+++++-+ # average_infer_time=average_infer_time +-+++++-+ # ) +-+++++-+ # else: +-+++++-+ # return input_ids +-+++++-+ +-+++++-+ # def _prepare_cache_for_generation( +-+++++-+ # self, +-+++++-+ # generation_config, +-+++++-+ # model_kwargs, +-+++++-+ # assistant_model, +-+++++-+ # batch_size, +-+++++-+ # max_cache_length, +-+++++-+ # ): +-+++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: +-+++++-+ # generation_config.cache_implementation = "static" +-+++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") +-+++++-+ +-+++++-+ # if generation_config.cache_implementation == "static": +-+++++-+ # base_required_from_max_length = generation_config.max_length + 1 +-+++++-+ # base_required = max(max_cache_length, base_required_from_max_length) +-+++++-+ # min_cache_size = 50 +-+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) +-+++++-+ # else: +-+++++-+ # max_cache_length = max(base_required, min_cache_size) +-+++++-+ +-+++++-+ # original_max_cache_length = max_cache_length +-+++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") +-+++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") +-+++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") +-+++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") +-+++++-+ # print(f" - final max_cache_length: {max_cache_length}") +-+++++-+ +-+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: +-+++++-+ # if max_cache_length > self.config.max_position_embeddings: +-+++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") +-+++++-+ +-+++++-+ # result = super()._prepare_cache_for_generation( +-+++++-+ # generation_config=generation_config, +-+++++-+ # model_kwargs=model_kwargs, +-+++++-+ # assistant_model=assistant_model, +-+++++-+ # batch_size=batch_size, +-+++++-+ # max_cache_length=max_cache_length, +-+++++-+ # ) +-+++++-+ +-+++++-+ # if generation_config.cache_implementation == "static": +-+++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" +-+++++-+ # created_cache = model_kwargs.get(cache_name) +-+++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): +-+++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") +-+++++-+ # if created_cache.max_cache_len < generation_config.max_length: +-+++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") +-+++++-+ +-+++++-+ # return result +-+++++-+ +-+++++-+ +-+++++-+ +-+++++- +-+++++- +-+++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE +-+++++--- +-+++++-2.27.0 +-+++++- +-+++++-- +-+++++2.27.0 +-+++++ +-++++-- +-++++2.27.0 +-++++ +-+++-- +-+++2.27.0 +-+++ +-++-- +-++2.27.0 +-++ +-+-- +-+2.27.0 +-+ +--- +-2.27.0 +- +-- +2.39.5 (Apple Git-154) + From 03e95e4e202f66d0321cf2c9824aae4e38488afa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=82=93=E4=BC=9F=E9=94=AE?= Date: Wed, 10 Dec 2025 14:10:28 +0800 Subject: [PATCH 2/3] =?UTF-8?q?=E6=A0=B9=E6=8D=AEreview=E9=87=8D=E6=96=B0?= =?UTF-8?q?=E4=B8=8A=E4=BC=A0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../\351\230\237\344\274\215emmm/README.md" | 23 +- .../assets/mindstudio.png" | Bin 0 -> 448846 bytes .../assets/score\350\256\241\347\256\227.png" | Bin 0 -> 77871 bytes .../\346\216\222\350\241\214\346\246\234.png" | Bin 0 -> 144470 bytes ...0\347\273\210\346\210\220\347\273\251.png" | Bin 0 -> 420137 bytes .../patches/0001-20251104commit.patch" | 1272 - .../patches/0002-20251106commit.patch" | 3200 - .../patches/0003-20261106secondcommit.patch" | 2769 - .../patches/0004-20251106change.patch" | 7498 --- .../patches/0005-20251107001commit.patch" | 7707 --- .../patches/0006-20251107002commit.patch" | 7931 --- .../patches/0007-20251107003commit.patch" | 8034 --- .../patches/0008-moe-change.patch" | 8789 --- .../patches/0009-20251109firstcommit.patch" | 9078 --- .../patches/0010-.patch" | 49453 ---------------- 15 files changed, 13 insertions(+), 105741 deletions(-) create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/mindstudio.png" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/score\350\256\241\347\256\227.png" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\216\222\350\241\214\346\246\234.png" create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\234\200\347\273\210\346\210\220\347\273\251.png" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" delete mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" index a3cda3f5..b29d1520 100644 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" +++ "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/README.md" @@ -10,11 +10,11 @@ Qwen/Qwen1.5-MoE-A2.7B-Chat 在无精度误差的情况下提速这两个模型的prefill,decode和显存峰值 -![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=NWIwMTIzNjY4NDhkMTI4ZmYxNTFmMWNhOWIyNWRlYzZfeldtbk84b3lhUWVNYjJCZlRtT05TZ0JubU1hMzB0S3RfVG9rZW46WU10NmJsWXFab01CaER4NFlHT2NZRzJHbjJuXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) +![img](./assets/score计算.png) ## 最终成绩 -![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=Zjk2MzEzNmNhYWUxODQ5NzI1NTNhODRmMjhmMDljMGZfeWNuZ0tzT3JBcHBlY0Z1ZnFJWHRNczRGWnd1UWFOaGdfVG9rZW46QVRXVWJGeUpGb2k5R094WmxuVGM5TUdEbmxmXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) +![img](./assets/最终成绩.png) # 比赛复盘 @@ -32,7 +32,7 @@ Qwen/Qwen1.5-MoE-A2.7B-Chat - 通过简单网络来测试,flash-attention对于长序列确而有提速效果,但是在中短序列不明显,有时候还会因为未知波动效果不如baseline - 官方接口 `mindspore.ops.flash_attention_score`会带来一定的精度误差,具体而言qwen的prompt2会mismatch - 算子融合 - - F.rms_norm 不仅没加速还带来了精度误差(应该是qwen的prompt1会mismatch),遂直接放弃 + - F.rms_norm 不仅没加速还带来了精度误差(应该是qwen的prompt1会mismatch),遂直接放弃;对于review中提到的融合算子精度对齐没有缺陷,我猜测可能是进入F.rms_norm前所必须做的精度转化操作导致的,虽然我当时尝试了float32也还是有mismatch - 但是我没太理解会议里面讲的要比较下放损耗和融合算子加速效果,我个人仍然觉得这应该要work,但是却没有 - Graph&Pynative mode - kernal/图复用 - 一开始打算用分桶填充策略,设置 `seq_len = [1,2,4,8,..,128]`的桶来多次调用模型生成来生成这些尺寸的图,为输入的prompt寻找恰好不小于他的桶进行padding触发图复用,但是毫无效果,于是开始探索图复用的条件,网上有说法是需要 `@mindspore.jit`即时编译/`Graph mode`静态图模式才能生成可以复用的图,于是进入下一步测试 @@ -41,7 +41,7 @@ Qwen/Qwen1.5-MoE-A2.7B-Chat - static-cache :没做成功,因为需要把动态cache 换成 static cache,bug较多,时间上不允许,而且直播的时候说提升不大。 - Profiler - 这是一个很好的工具(疑似),但是直到最后都不知道如何使用,一方面是断点设置和信息收集的问题,但这个问题不大 - - ![img](https://kxqaj5kr937.feishu.cn/space/api/box/stream/download/asynccode/?code=ODNmOTFhMDg2NjZjYmJmODgwMjBlNzVjYTE1MWFiMzRfTTh1S1pnUVZXbWdPRGU0MGhSREh5TU05ZkRaNEJCMGZfVG9rZW46VmFIRWJnMDlob1FUV294YktYZGNTNklqbnlmXzE3NjQ3NTAyNDg6MTc2NDc1Mzg0OF9WNA) + - ![img](./assets/mindstudio.png) - 最重要的是这个页面我只看到NPU的free/compute比值很大,除此之外不知道如何分析来调优了,要是能看**别人实际调优一遍肯定会好很多,求教程!!** - MOE分析 - 通过模型原来的代码,在self.mlp = ... 这一行,我发现了有一个if控制流,走moe/mlp,尝试使用走mlp之后,prefill/decode耗时降低了**20倍**,这时候我才意识到,原来前面有的没的都是**次要矛盾**,只要把**moe这个模块的代码**优化好了,就已经胜利了 @@ -186,14 +186,17 @@ Qwen/Qwen1.5-MoE-A2.7B-Chat - 在预热的时候,记录下所有被激活过的专家的ID,缓存那些在预热中被激活过的active_ids的权重(ops.stack)。 - 如果缓存已经建立,并且当前需要的专家 eid 就在缓存里,它会直接从连续的 cache_gate_w 张量中索引权重。 + + ## 收益点 -| DeepseekMoe + Qwen moe都进行MoE模块前向优化,decode直接遍历激活专家 | 总分的具体收益估计是从100->120 | -| ------------------------------------------------------------ | ------------------------------------------------------------ | -| 在DeepseekAttention和QwenAttention的forward函数里有用apply_rotary_pos_emb函数,而对于该函数里用了rotate_half函数。对于rotate_half函数,可以使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :]。 | 显存峰值100优化后100prefill133.4445 132.4821decode427.7311 437.5848总分220.919 223.3556 | -| **moe_prefill_fast** 通过将权重堆叠、对输入 tokens 重新排序,将原本多个小的、串行的专家计算,转换为了几个大的、连续的计算块,并使用一个高效的 scatter_add 操作完成结果聚合,从而大幅提升了性能。**moe_decode_fast** 将多个小规模的、串行的专家计算,巧妙地转换成了一次大规模的、并行的批量矩阵乘法(bmm)操作,彻底消除了 Python 循环,因此速度更快。但是有mismatch,所以根据LongPrompt来做dispatch | 显存峰值100优化后98.4848prefill132.4821 163.8114decode437.5848 454.7424总分223.3556 239.0129 | -| **init_active_expert_cache**和**warmup_moe_model_deep**:在预热的时候,记录下所有被激活过的专家的ID,缓存那些在预热中被激活过的active_ids的权重(ops.stack)。如果缓存已经建立,并且当前需要的专家 eid 就在缓存里,它会直接从连续的 cache_gate_w 张量中索引权重。 | 显存峰值98.4848优化后98.4848prefill163.8114 198.4985decode454.7424 493.2538总分239.0129 263.4124 | -| 通过 **Pad -> BMM -> Gather** 的流程,将所有专家的计算合并为单个、大规模的并行操作Pad : 将分配给不同专家的、数量不等的“锯齿状”token数据,通过 tensor_scatter_update 填充成一个规整的、[专家数, 最大Token数, 隐藏层大小] 的“矩形”张量。BMM: 利用这个规整的张量,调用一次 ops.bmm 即可同时计算所有专家的输出,将硬件并行度拉满。Gather : 计算完成后,用 gather_nd 从填充后的结果中高效地提取出有效的输出数据。+但是有mismatch,解决思路是:在核心计算中使用 float32 保证数值精度,从根本上解决 mismatch 问题+根据LongPrompt来做dispatch | 显存峰值98.4848优化后83.3333prefill198.4985 487.1616decode493.2538 490.5996总分263.4124 353.6982 | +| 策略名称 | 说明 | 显存峰值 | Prefill | Decode | 总分 | +| :----------------------------------------------: | :----------------------------------------------------------- | :-------------: | :---------------: | :---------------: | :---------------: | +| DeepseekMoe + Qwen MoE模块优化 | Decode直接遍历激活专家 | 100→100 | 100→132 | 100→400 | 100→200 | +| Rotary优化 | 用`ops.split`替代`rotate_half`切片方式 | 100→100 | 133.4445→132.4821 | 427.7311→437.5848 | 220.919→223.3556 | +| moe_prefill_fast / moe_decode_fast | 串行专家计算改为大批量并行BMM,减少Python循环,速度更快(LongPrompt dispatch) | 100→98.4848 | 132.4821→163.8114 | 437.5848→454.7424 | 223.3556→239.0129 | +| init_active_expert_cache / warmup_moe_model_deep | 缓存预热期间激活专家权重,直接索引cache提升性能 | 98.4848→98.4848 | 163.8114→198.4985 | 454.7424→493.2538 | 239.0129→263.4124 | +| Pad→BMM→Gather流程 | 将专家计算合并为一次BMM,保证精度float32并按LongPrompt dispatch | 98.4848→83.3333 | 198.4985→487.1616 | 493.2538→490.5996 | 263.4124→353.6982 | ## 总结 diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/mindstudio.png" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/mindstudio.png" new file mode 100644 index 0000000000000000000000000000000000000000..a963d5c66afcbf9a0e3bc28050d7b638aea560eb GIT binary patch literal 448846 zcmd42cT`l(@-94ziV{VVWRaY677&q~a|V%|bA~}hKtMo-I7CSU!w@8AMnJNXVaSN& zFytJ+?eTqo=e+0MKkoYexp%L{+Iu%m@9wJVr>dT6BGgsoaIwg-Kp+sVg8WNO5a{>g z-Ou;$0XGPUd>**n@pz%2eIK|2?^}KV-pM>=^gXp)tUZ0q-K{`2&MrzNfUimAR*#i}Mq0J0~lk5d`Ap<`r13uj1iZN&mpZy^@^&WF;kQ zB|U4Y?#&Y(;g#eX0iop&VcY_%1rWyj{O2Ih6Oh77DQ(~Mty!NyGoPTHy^}MJt16UQ zfc{8QEyE5$mv=Y)gT;H~Q`idicR53o=7vqChUFI8(081(pJP@ve|n#Qb6-Jz%hyG$ zhQUvvBCPu4679B4Q(@Hy;pj<1Gy93t2w#F%0|!nnL{^vf)IBdH=F5gAC+&U<;y|u~W1Bk%rzTd}ON@X0#ZQxO zg0o?93$*p~fhxyEixqgdijcMl1uV`I^_i>K?hjE&`0cj0U_VX?rp;aF$*1DH451u$ zSWJ$(%5!Di8}co1@I?rQKkP^beMqexJIMvrOcjtZ%0X>fOwwpqh_A!W@BZscUnyUm zkAAtrJddR!wc-HFt4RixxuYuwT~5=lHxctAt>#+eZTK5%D>kh5Bn=rl@ngP?rCn%m zgjbp~)ftLwbCORAdTe$6O)o0pSDn+?U!wAvsk;4U&LZ40_&A)^bHY*pr^j!d1v;O* zJJf7HU#pdEcI72gW-ON5DxTreGk)$xOrKT(BZqJzBWH45PYdkJ{H5N`Na}iaj*}fV z<_46t3}10TIio%bOG9)Zqa{*T^!usTh!xUv%}y7Gw9QPiBkX^z=ZzS%r>x|4RPZ=K zElayb4*pf^YSp9{THD)P;O3XhQ8XHjl_tm$xZihv*dR<6=T9}IlMBvb$(YTjBNbw) zo$#lKitY~DEA(#ik{>%lZ%{mtj}d0D=^uG#g{XFG*R&GB?Kv#m+f}7PIOr9KneE&} zV`{sK^^vo=)Rf7+BsDv0Xen=2B4{t&FF?WylGH9t7RhkePPues@G)j@L`7`aO{uv6&ax;vGI(WvNgV7ug%Kb+ulWL{m$(Tj9#=T8Cht{= z|JB&?)Md5b?&P;nuHDztG#hlntX?C-IzB5w@;;mgV9K_eCMgpWf9rkyxZ)crocYST^upf0a6{y(y*6{q zh>2oUU`dPlkw&*r-gYM|I@l@MekM5$58-`dH04ul=;aiGHP>0xS)YrtAerrH*corW z_2l{NO@V~8QkeGjH`-jun9{k+$>u~ zp{jRe-zzj;xBFM!UPbd&>#d;t3!@5{+EqbTOaUNIiHAFDGu&efinuU+7oyZUOx zI_BneD(t&!u6IkIwm7X8Z+fqBg<+rjqSXUD1Y$3J!;!7}+1c?6=!j-(s2J*k&M%ewsJv z7Xn%8E~j21XF3!^sA9jfG4}O~-rFyhu}WItN&DuVg#ZD`^$k*8m9wds&Ezula1V9{ zIi6@j`X(e0H9iR&fBS*H=hxI{rYwKugJQk&`{61o6`-*1tyky$J7}7#*9s#wT$gdw zedzIHm&<7I5M@$aokrGGC6l&&VjC{k=H^YEk~l#WFU_gh`Bk_F7YCZBncPa%TbvWU zU$cH1BUH7coiT-ZS@8{KHT#$A3BWtrB z9II?o;Z9OIh7s$iW3d0tBmHo0X&}yk@Q+w`Gu_(@bYn(b)Ec&Sj`TT5L|ErDD>9rc z2lD!^O>3JwJ~>&QB%r;|@cz}|v95GuP&jOs`j651s1AxhMY8mKVUdt93?tLMSimU4 zmWROOih@LW9>$qR=>>jY&i0(L&XhW;5-gkzT1*aXpT#^1LEhAW8wwOv#Oe0KpJHJ)k#E>(T}{?{ zof^k78s8epeVb{1)^2`xM%-TbxafMhQBp4>*t*~^N7ocTpD$5{T8bhn5 zT+ZKSw16IPMp`xN8c0|-zv>`WrXR;&+jUks`eU7(w6W~Fq`5|YfwFC#fv*p!3c>pM zY)zsyH^qUl>=cHyYS@hz+7_$t@ME`=qMcSVo&AsT)Gi^!ghXCyv)G|`qgi&p^i;q( zm^ewUk6X_3*BHvH)j=^#8roO8o*5c}b(B7IoiVn7WkSs5k!07N%CED7)?}V3(Bc9c zjk#FbUq_A^BHcWicC2*EQ3U_q^t%|&|aY7oy0vv=I zh6xpKiG?GQ%(F)j9l1aJ2-L-iS0_6;^8KrD@SSuxKut zzk$V6pA&CO6bIQ*tGoGqulu^dF;DSFct7#Vp(}@T_c9~k=lm`keo!7OWJ81p&quq% zzug}D(}52snjmh&2iW6wXgG#{^(2B1GX}wCB_E#pf|rA(XhOX-feM*G52dc1A<%-E z$E4lr=l?MznJn~}lS98zb{Wr3#y!g(87waiGb`c9h%8QFibCp+RfC}rcBs*z1M>Kd zdoiR6;1MPXea>VR zh@Gk5dn*;)>oytIX8XH_eC}h{eFA9^IKm#;6eKI4>MkC*>UD|+%g`HW3T^0Rft5S( z6c9?wYc~#8#kBOnvpgAFCIuXDl*Axk8Pwdxlk&$Bc^_+|3dBrvkgkkDW3nTZmEAyC zcd_o8(ZOb1*T>Fws761L zsT+BqY{j)^DKcsQ#4bOn*1|@XLmxe|nK9=EZ&p$7A7lSO z$yc;>?`ks5EO9scFa@^iTBRwYrAN>e{ob%tk;Z13Axq3 z`eSOZV(Jza%fCY0Lv41g9lH664O{FtyxmTiDKn*z<}8#*uU_vHQB?0oWuzG&yHoTW zH~MdftfhPwcs&#=n~;tt9nmLbYnM(d(=ax{&pz@6VcBkPWmp^;*^*$&7qowF{dl!P zzmpnbB370`FY|&{Md-FoM9*kjbnFJ<&8FH>M=U!9E)+`ZrL7BJ)MV5X}?F*bMN5kI>$2-?=Gu~+aeE{^O*1*rzyY)+rAc<7L~^gHoW zs>+z0Q&N&cPOOImQ1z=qEq{k#VgtA|^BprMpQf&|?R>_`aM=8(0U(R&zbc%%fZgzy zx5<3d#Hc-I6E8m@=zT4&G!5p*@m~X~8zDk8# zMCLt-POimu5O!73lda0hNX|_2yBOVUxvt|wxIq@b)asF5$UKB$2;5$? z@b+`d(m55w#IJ%a(mx;aNm-8{2wsOqk8!4%_wtwY8a2w+2bhg@e92DOKQdgS%qZyx zi_}}4FCRg32Ld{&&pv{)0;k6+XBBt_8fHy%h@*aa?w@{nD&2LobyCvc;}rL3ET3|2dC66Pv#<1^7F#h3#$Tp z=1!7){Q@0LeHfW0w{J#(^^$%21LO-LvMlUwNLT{?8+VvfrZ>_^OGp9IGJfO~s)b87zITEw+MJ>Qe8 zonf0slVhb=Uv|6A!?=R@(d|ms>O9x2g``VvfT2 zhdt$ywxaxYBl0MgjHbt*#Ef%m#q1fqm%E2oF!fl6KGCN|ju&g5@6B%DV2*;fR{d8d z){Zs1SUCpd)wj(V}ZT}P+kyobQp*(AKS}(mCLFu(C9c@N%Fsb!B9k`l1QkIj)W z+T(@gB}N#|$fqRXuOAura+>B;pE!^2v#;v=e%Cnt0A93f1Yh^%7TM82%)C45-4m$^ z$*F{_N+hlg(&l`FVpYd{?7KPMaOi~-YMWGHHUo5-Ud}Iqd}-r)RMEU-p&^_+V<>;JRD+n(}I&(q^w;o z;?gu+dcBW&NUM*UZ2D1Au)@u1mguq|ZbxV5jJCc@QNSOL1JpF~n1h3(IQUNl1A5he z-^5o-QglNP42D5oJ4V(`>(u+ys`_ABkw+NbtV@m)9-w_1mH#O4XH6xP@l5j13&=y} zWkz*@c2Jc`qFmXel|B)2SIUW{CSo=7FLiUJj6eGNSLa<-b0w}J$ZP46SqyUT3!-uv zHXaVMd^SP5Y1m!G@)^87EIAKrTYpItzpl_$5~{Ilh|)E&^Dx&Gar z5_12)=kb5UR=os0{CkLh-QS3O06qCf3&`VtGok5O&VP2|_IMD%;@;W8BFd;bG*n(8 zMkzF_$Gw(}LMII@CZWW0vRhjWiiv8mZ+*4f?;z=^ny|u+?p|mT7SRMCS99l4O0#Q#ol-GY~oB#j)r; z#wE3{3g|*$g|}Dgwks2T?;C-1U?;|BB2erAC%@#6Xwap z>~uc{p`gpepYyhBE`4ERhAmafXCel@@*eQYMmbWWDX)t2Vs9E@M#NZ7x?cyxt_WGX znY5{+eBDCgu`OE&pKCoNLNam@X;QN04USsQHOkCQtfrEUMY} ztwc{%2V}Ke3`h=!o_9WQQjEm*+7YKXG|IY2FbzVhh+kYrRSI4!H|HjUyzw&h6x%BS zEiuE7mK#m_^Nf#fXnWKhcX--@|E&q&$r)QNL2J@46j;LYZWfEy1GnkIQeZ) z<}l9Nxvp!%dtETKYQ#@K}x%Z;^W&!hXx$H8_Z2`#~F0GWzuBodVc7A@YV-O_U-cARR4S6$v zd3B3ySKDjroqGdH*TPV4TLEZ~if4j!saZIf<4n@W@Y-#AN7YKYhD8w1j4qR2WO22@y zTf5FY2AI^$Uj+i`Ee5S?>>}X~iGh0sidz#BB_8%h`vv6PLCQ&=4!{KQ|T zGZ&O&u}DOb?5Y%$!fTZZ+ivu^N?1Vl8ME~GueAhtme!c^xb3-je(M!d>O(gS^BrEs zUh5DegluiUm{Flirkc>S3RAM(_o9PI)>n6cGeNn;8mez zPtmTLySHxz9nc`lYSnZVAw_CgV{fJ94&iFoK36?i!AIj!&}^T>M_;vgaZHL&G5s{z zIt6j9Bn!XLDZtEWYMX=S2rI&56HG5=%bf_Kw4yg#n9SPVl{;l7xch1L5{qtYLR7xOst9N>3L;W zW66i3U0IaaAjM%N$pD?;8&q)qS$;=eDKUS>9;ksa1N&kt%EA_jWH{rc=GezG=g6R}CIL~lH zooXe-^CVmJz0JbiRj=W|5A>t!a|nOR)lYp2Es|=EGc7<;k=tdOJ*D_|t;ZLM6X!+B z2x${+iQqo3f1pcB+UsA=5+LE;1m*R&9BIxQYnU~Tkncm#`&Ox1G~Bp0MXDb!sPCWZ zc(nM-_TBU+MeZ*bNd!4N+L_sT>VDlS#u+oQ9n5a;TEf3GIy9MIbF7+QP;Xgj0u3l@ z&eh`Ja4OO---e{@ZkXd59*(wTEL^^U@0Gn8`j%Xf9+)N&oXi%VMo|7Qr#%=b(*c#3{W( zhzPy|y;I2Tp-Bs;z`gu^1`Y^4u=BzwTaJ~yuJl7hV!Pv;j0xH$%+Fr`WBEH^Au){4 zcB#Yp`N1ksMW6f`ZOQdgA}B906(Q3|N}`UBj)_QWUy**ZR0zgZgL;52-aJ=~#nl_|rD z>v4ux-w~O$FZ-pBS2Tgq;e?>s>-CA%&Z8Sq1tyZvbUSz7Y3KUCh~2*y=Mm+G-70i2 zEU7%At@*l5#hS#qNDJ$#M7$-B`7Q=W> zUomXLN1XsLEFR5$iRNQfM4pIc#gEyKA<=D{v5Su4pT^Zq=1{Hqvm^!Q3Wgra){ z&5pQoJ=U6v-@bjILb<&%$%M`983V#UXE>$K;A z)33$FW$ys%{7k5?-d&JQ+qPXI!EsNOC^4$qY~5=f=};{!SE{TB1&p+|X^Z`)^BWJ3lpb%Ua;eu-k8rR5)_IPG_{Lngrmbhvfs-^f`oGc#4omkK(nB8`K+X zZeA>?%Rh+d5%sq%+nzlC{M{hs_Ls&+IX^!U)GX3pCzOQ4_;Ygb^;F)1qylS7azW{K zrctRC%y}TE+v5$bmI6naJj6drqV#RL9C1x2vt#s->lC9OVQ*)3?R&WqDBws!o!lx7 zqdW>_`&)+u@}`r(3fl>Zk&nGYDOolKE@ z|7eJ7eu_`{UU6`TG{R7AW^{B~@NcTrxQdI2lK=C_as?tvIsnpeak$y=U`yw;l@ z=32kJa{2uO_gkJX3fiOTX=0_^pnS%pu+Nn6`@P?u3gCLi*UP#eQEGCFDycZMlz8B8 z>-?zE8*4d@9W}Y^F`&Z#PsMzl;o+a~QKEg`PML9g_r;=2^L}SP8P!EEh1eKiN8l_y z>5wZSlXlNMV*!kx%M5R5{4R41o1H7^dp>1P)D88F0x=K~R=DNbc=A&T|E>2gT0c<- z9A+z=)K=B3f1fhw9Oa#}e?omE@(z&Its^ zEcSJ(EYXs>_KP=ZM|=OdI?#7qz^I~d1^6v;-*t+7_WYBq{V9VFi& zbC>Y=W~wW5kdOv!j6aKXYHN|fCSAM2pDnWi@4pL_n0G?#`QzAm5-j~r^Y&W*;Y>jz z@u`{lTbDU@wL&Ud5cmhOI6GM)9+l8`a~V0gb?hv#;-+4r*~vt^OV+4l;0flwKgNc= z%lLD()%?lhk7=1+I}x!#nlXjU=;@=dcdcLRg2HgbFDD2g=7st78i`CXAE$$NYQllY zTjB>ByM~koT$NIc88OBh)lt-r={>70jylhszj?Q^yu9HMe4t|D=u9@FsbTpWVM`NoLyf3Z|C{i=Os(+7j3t_2(6mtXV`$tiZsT$9Y$vk zjbwfl(Vf5oT}j0FI7D{cL|r?@HO@wf)A!g!M!%N#VqPEZ-Kc3o7SDON3H`Qf$y0Wq z5GiC9pg>B&HKVD5^4|?II$A0V)YpxY;V5ykci7QZb1CB;9!`Klv(wYr&#u5%tFho5 z@e92EjR`)P|4Ok}J~P`?yBn)rLl?02!J+t*PYh4iPLi%FS0I zVHM+(R#1r8pn0tAp@b+Z>bJegrXweFn@=jYdG}wLebY1)rYt?YmL3cdIkt10UN*Rv zy%7;b&rj4&Q(m|KD4#|Dsf_&34y%QlcXO*i!)O=XcX!-B4>Z}E&nwa39Q5963*IsQ zFHYiT^UtD(Ki9UvE4b5&PcKjK?=>BA>IGnTWu;b41&(5zJ{@2d`XEez&ew^#6^Y}l}0A&4&9V9$B?;*7L z&+mZvzWxtc_1ACyuQNUW&rwhR|0Xo)d;%g#(*c}s7(2wR5y-{G=;+wgGSEMK8gYKh zdgpGK&T*A{2gg}Ljwb0m+MXM?wbdV24=ne#lu+|qaBbgTN_4GF0TBGf!l$R%yG~@k zYVLv1c;PthPuriIVg5s%zW6-!(j=3`-ZJchp|>k<+IF3cZcn_f?0sGS==vXtJ*dfs zEQ~hFyxYEdz<<7$17zEX+wsq>$BlX%MvM3%x_LEM1NuyB{a*t-ZxDxyP8Xl}lKrMe z-)t(XDvUMYrd2um)xUmiF*Y)r-y84ZaMDi&$+PO~t+q#t0)7HQKb9eh5O4G;f|r}W zFoSa@CvlVw!UAf0?W^s5UDf6AXa?T~1z0{NDVeTQ=+o0?rsGqBWY1-CHeW|&Ao6+WCuiOvQj*7s|_^wXtS9 z;!xI9+}Y{RVPu+3yv!z0_C-F<^`(G-O|?UpcXe$wv!I|ZHVNl&%>}hKz;iU2#@hF1 z3=NopHL}%RlXgfzAR+6@$!fa;5`ct)FC}Q>;^J+5Usptf9U7le$0{n4+M>^+Q_X7B zA)}+}Wg(cy8(Zre4}vnpf{IF?!O<>FjP2Z2*0c9Y3@T0S$e`{oMMh^v^Gw9%N7svb9|YwW*c1jX zlb0A)>h`d(_#Y)ESKF-Og}$IwfUt3xPzXAW4_mQW){0rO=FE!u6z(rg^WsPY8nZn} zx}Yc9uaom!QLaATn#s=2ZgpD}-ojg6T3PRjmCVn}vkZMl3$=a;&N7?7DR6;TR!T>a zaE*u?)v$;$$;HXP94k;=n;2@h7+@EitCS&T)9XvjhFE~?bnsabyDp-aP1j*IOhr*yIichA*qMio=Td z#xgOY+W&`WzK!)pclMS4S?DS;PRzBjS+jSuu6D8Zb<>Nx7`LkV%|)G_)Ip$FSws}FUjZdL4LPGm)3Y;dd&fdO zV4&HXK{^Z6=hBs?7)p`CR1X^lg_GHvPf2Bk>Mt}Q?`|Ohb4CD`JMuhu-f@l*j75w$Gp-CYkF~IVrQXg9aWMg=}S8M;`%CO!)TC`DAU{LaNYwNY!)ekmi z&5oimax!X7uQ8TH4&$8szCIvlEcCEZWs^ic55d5&x3^UfeEm+h)Hv6RVkd53HRB;O zKOL;eH0$=;H^Y3qiRr1VatYE9es1B<*a7) zhCMwjq9%Td+3NaisA3HxrKo4lwT~YCsy=GL1hS~#+RU#rW;LpMnURqp3O@XD37&rs zMDlf1j)G|rEzPQh%pV(@>pII3wTTG{c>uN)!n38QsGJPH*v738lWcf;CEk`OBjGfF zF(@S=5S7x+;^K?vN(0@aupdU9;7%(`E4bs{>uSG8W9zOn`IUD-Kbu?0s7lWAtqf~& zjAv&}iVz|NGfjT}yK!Bk09ZYL^11ETBMyU>7d8mQ(!qs}wstWwn{H7)Tp=Ta$|7hI z?QLak?XX#Mp`+!YdUQm(r08LHlMFSs!gpU#vGY|p}D7E__!;G3VdT*}kuJpWqOGH3dO%cyX4l)vY3(E40rQmKx^I14SU6jKb% zNO$)qLB|F?0B_hRvEa=mIuIfX(lP-2t9|!0{emPqI=W(;>w=-?`U#nt*(no`$57+z z-=VBf$P69b#^$DuwQh~Bu@^1`qLm1R<`XX?wDq)dD`cXo9c4misAKKz`&op9%qRqB z5db~|LB%efH0a@{CD)fd);c@^UE$l*@4H(Pp*l(4(W16Bafxxe$z?sYaXROK?__Ua>Y9oK3z-0l;#HQZfQr zrYAl-+uPf#3%Eq^^)Pmk=vmF{9E&R39$pS+qe8wh9eX=xO*Nm|DMKS!={wlTwx&|q zkK!Q=(fA<$W@-U}Q8peP1sR5XUb?vQAiDN;VC(dLyw{hM%qGSr2O#yv3zHJX)npc6 zo;F2@DxU$sQjgZM<*0vS8BmAqK~s|j)8a@IjEpX{V~);E@sfP2(JuxzqCV7Ma%000 zH7DL^$&)v>IS&8)MIM-K#iS!E3&)6B@V>=wCx+ci!otO6#->}Ut+V_95ZfCuMnO(h zU`mQg1A{}-m^xk??-f^)?e>P3SxwYor|<79udJ{=#<9QIt_8MLqM3SOaXKY22}$4U z@}fM;7E#fuQPea7uxOxAG|0x){MkYd-|j3Ds|0C-h>VOJo9x{w27zXPsv?h}u4(q# z+GCT#5)H301r8&P>;lO%GAoS<4 z)>l+iCPEVhXhCakjbrM;QW#f30Jhn`Ku=EAyNhNNm53ndFi+SxIaLw5WmFtoB-wYPPB zesR*2pA^+`{*#bK$JF#~MTL7pQe5`oM|Sk=RZC!DyucJ|h|jck>X+ik@!vn7PjRr3ok~ z+9V_;<+1o_cW$a4U@OJ!97ZiK0DX!EG$@ASZ|i9&z%$R63KBEY5J^!g?gt)JpgErV zpkLwxugf1^hv2k-03;%tTM0SxDleUhZA`1(A{~}N{&^3$xcW~6f zkPoO>lbL8V)5Ik@nu?UBEc-1Ohz^x~Dbyi>5@m*loUEdZvH+0@vs3lQ>tC~7CMN0z z=CPV}6$g~xa?*!@67|N;jut(oF8b_Xs>)9v&I0fK-aj5XmFgBWQLJlFqMJ`&(9Mkl zUHCAQNl+gi4&ck{$znk-lZJa58meC2p3EF{zo)YAJ5w=@7WMm8cklc{U~s-HbK4n^ zVUP8qq}wf!kLrsTkR*0f#;h(xO$~{cXB8WZhyex$1}i70hM|AMBf>{xvy<@gaW+X5 zc4VXuZ}3}sX_~8vu4)nO`{E#%I5WOPl$`lpUZRx!mtE-FCQ;fq2SEkUXmgN zLONFB+FU?z9I&8}$jH^$>+Sieoc?}1FE1ZEOUsv0N$g`0i(RAY!RC%zs#$97($bMB zFvsx-B+6f$T0lskupTZY5o^2AK(9-3qZ1g45N4F z1fE}Q&~bH8?Yakfa0MLs?&#?3<7NkvBOL%Odia-*effn4%K+>Vp~N$(Q%Ahw&TE)D z*g&sq6!ql;h|R*rq?i5bIThMzW6L()vM9IN<|At6u^jpnHQ{m&M41nu@3&dx2D zU}QsMBR{`zAzuO8@_|3#A?frCQ#CcUyZ|p@S|&GZl40*+r(rPpl*716pUQ-)Vdrf) zFs8V zZF?s>T?eJguzUA9Ym9KBG__1B9*2ctVveCuZ0%nPOR8lKNE28XIt0#@#Tj&rYQ;Y< zWYbnx*K~{YJkBLGKSa6|12Gl^l9+7vWaOi*udTfE$p;|jO8E3ic$?g_s+LVmObyV2 zmRry5dO6wA>FJW|>z4A(3fKulf#w`YV`rg<2L>e#OK7X~9(ue#gmaef6u){_s`3+O# zla7KN-@0NQ#GXB)4}S8*#`BQgLM$N@3O^nf^lK)(X?xW`!f9Nb)Ar~H|8V_#{_?|1 z31Azk0@w*;%=S`~PwS1>fzqOT{cz`9P;O3oWK@)%qj9>#NhlZV6d7G@R|#GC@uu9( z9tM!mF%x0MWF*_Ay2s)GnM}5JvXW}87LdrYe_lKC=l?~m3uL4?7O%e$fZj}ZIM=dq z(wa&j^|F;JC+|tq`aN@&>H5~ojbl)(dsccbzPLZ|D$44R_PV}u%)ZH0iZ_2L&k{Hc zTb|`c9cxonMjxKHi7d8!3~6VaWY0y;r^4XsFR|X2Zq*twqVIFi9OF)6Tjv@D+Rk9}7LIMTebT{vNRuTJlwHj6Bjp;&jy4w)2^lvd4~1F3n5 zW8|-Y3aqmAILB{|50Qmp(6T)1S?0;R8JG(S?{Q0rptptzvYfT+uvnS78g7L}dP~KuSD!fp7B0TeHyappC>L^^7c-4Y(6~fuY zY(!;9Sddp$73I?mRf63MdHP2a;lXq%vPqtzD#DafBQd6ptdZ~LX+9SgrmU4OQ0*H96}q6<4@KX_A#}&e*ts{6zK1!*9#(<)>{?es=2z+Pb~=>*k4hiUBXY9o>M@qx~prAYAhC*Zj?w78>d8 zq@4yMUn^9wqN3o4=^xi&{>mekpOH6f)i&_ zo5*UMMrLOta+Yu3ii%QvOi6C3ZO5`8f5*g!kcIhYn-XV!anrl}0ZBP8CyoPBPSHJ{ z=dEaOEfcN2ydJH3^zljD4{P5Au7NM^XMJ_E&f3Y|-+8!kFuKZFI9X^PO2Kwm-j}QLu3LHWo{`x4q(Cv>G6MrGLcOlZ; zEPi>W78|F?2rnv?@%vGv$O&IeO4ifny=XeSBW*6|6qtToqZPc-l2d_|HQ$x3r=p-; zYr`bd!SFqkFE+A>)o~hDSNHz7hC~x?5rU|k#JM;@9UR=8^Qd4R6XJxXh`G=bGN~;` zBIb_H)!0Sx@cJL%$G&qvfi*(0iE4zrfK2wuH(l6W;u()03NKG!oj#(0ruRzLU9KAcbs!ibkx_1m3k zu5Vx%7j_PIetisC5*~tYpMU6&5!8ee0eZ;KsdyCpCwfp`<|5^w_PuP#Uy*S z`kfKc7J{fQ3-$mj79f%S7M8Q-EL3To)%EUxw9;wQ`d#bK2eBHZY>T@)IfPX0bQ#1i z?$dsoUOP&q>2gz79mdZJ06l5l(CKrd3|D9NY;y>fTd1oUSln~n%U6zo6FMr#HqrxI;#9i+2=Se(^5 z_CUGaqJPCYziC!hA1CtNU2Lw;&I8-IU2D}BWt3-Vk6)g|tKWhSfeH0zHbr={aI!v) z|KU1Ex^G`2%6?PnCxqxY1UrUUa zn)ZDC339GY`}ONlr&{{Ax(9^ad3?61f`K@{A@0cBkg?o0!SZRuG+I{dvKj5YDy#F| zny+hCp5V7A=|}h#3N+W2srdy+nG5X5scXYM8C9M_zvXYPHQ4x4)H0piIEn@-H&UOR zDbW9_n>}}>Sl#m-YL>uh2dM@&Ym#d~#Z*_@Se(5n`PX8_K9c3TUDFFuVF+$+;Pz+b z@dZ>xWxsDH+#E6H>1%%%A=>N~F5Nu2IL!V4F(WKV%3tfKm-Jc9+&w>PbJcGWmHFy* zXq4R;Io|H&F*^foBD9*5Gj*f$%g<1BR{@;)rzWk%yRLk7a^2kqVm$m~IRF>hB18|m zHj7Jy)?%lXnNcYko4-;Io;|B&;G@*nC;EPuKPgP--fKbAt8cd;GPH>ZYrhY7v`y!r z9l7nJ_N02>-75ptl3sS2D<(#T5_&a+IgX=jM1mn{cNe`JDKe)L%nhTXuo*XnzgCFf z3tRt<0m;f9@qz$0i1vI62ZXpEiGD5^-@17G(|cKVUxSzdRa&5#5+)@jMN2&}50o(et{dWT(W#FVJ`nugk{ps?^QD^~Jo4Jp7zs^_2ii&)2devUpQO|!& z;WhF1SpZy_u@ADjDPbbdHp;pd{l=<;OQW^4iOW!OqCuXrZ`yL6ayBfvvm$^~MecVS zeb9kQR=Py;jD*Do%@7z2mFTOyyC-_x0Qs-(yp7~Ea#lzniDF&aZHy$NQZkI=Xu4{<_SI5<`t(wsY256sCKYpiHIaQIcE-H2Wf z**a&LobTQbmy0nI&0vgABpZAG@Ru=!^wyiZcL`c0|`D`&|X8Hr(pgt+)Orf%39bE2!9 zi;z=4giHo5=1)v1G0DmQHB?T{i&A5EGMc(oPH*&FyAM$2zISgfkW6hwes{j#Bp@u= z#O*9UT_|8hj2E=)dFM7*7B#MJqVwQ6FuV$;kvyYXkfO}qxI7hLR~qajxn*U^c6`QDUkC%m+{V7zLmtXiz*CG&Nr29ToWzBd> zPSv)U`CJ8UD>^mJ%+B71_8u8a#O{xJiIpN@K@wk}+`+A^TNW7l_N@fvbmz_iA?t@X zi-V|Xp=Uv@6nhsK`ZYnyud?h?s%j{6>is04o?+#SOb=saBV;inWW$-6ndu+n@eQId zvEH}ewX7r~icu~o_&Gt6WIjAMr)Z@!w#zfrz45bD{wxU|Rj zYpOpFcG=k-Rs5HOp-ogA8Xs?4f{%Q7Pj}~?)Qjwgt?HkWl0puTj(_bb(^Ez~#0oE0 zj+f5^wR5lS?6jlz)J;uWfBppZXhYt<1@o&=I4`y>aA&K3s_`WfzGf6(y1fe1byy=W zp0pX9bMct;6e3j~H{h1Caft6Zb;qqO{ZcadYRnV^DYraTPtBO%;OL_Jfstw@dL#ly zX?Vf$LLwBJYcg7$>S)WqyMbkDvhc1)v^tj`r_bC65ixSR5fA$E5(@yR{TGRQ%jAxd zE$K8qG2Ju|n6-)FzOz^ko48XK?5;Bjq=m{wuH~bNp&zAn;(CjR zID4N0aket;rJ*U6y%N7DCP$4RS-aNLdG3KIJ-tFs-4ylgg&|^7{WNQ%N(DE0t2y5c z79Gya$@J_7`F!Exih_=%<&65To%M^JtdZJ^ZLCj_;()1VwDIErIW03g>(j?L3n!S& zoMfTX=4{=q zS@HpKh_()jn|mPtsxP2 z?uLebPZbZBA!Uax4@!AE&nwCWIFn==n|TvYIwtu(I}U5TZf>6S^z!VkdVEXd z@Z{(bRU|}_ezM#N8`P>^S~BL9@rwEVz)nG)B@wV?+qpk<!hROx(&j_ z3-#86yx(qL)@o~6P*l=*tzh0xMBrsE&9D7F&K2bKg@$S1dg9_-Sy9oF z1x-ux4*0_Ctc?_@sONKHx4JHj*XWJfYn+!9Y;tV9tOqI>ydIsNpj3QndJ+mKsH?VZ z=Y+2)FC`>=?wsg89lY?%o0EAjF;aHJuh$qVl(wSpVQ^#ry}KVN#KM63!2uAPA%%{h z1O>7i&+o?#G&G9Y*-rd_#JzP`)Z6w4jDaW;2BEZqprnUxP(-A22qmPuJ4fJv($WIb zJ;2Z%BOr{lo-}`V8OLvG!V@^;v7Ljf^TN)*py5zxnzJ z%u;G$l;{JWt&y4Cr#}|jm35VyyR}v#G3t+t~U*{{m9U=7pW84x`nzwa3%N17!{}U7at> zZi!6kXrIH;_MXB{6-z!3<&#B|AD8zUw(V+^$VY3f^gK(JDYfom43>)JbUAY)ku=ux zruIDA(C-{FZ$#WrBv+C%pEP>4Pmou zWDl9Ok+?V%j0{{vkLme}RN9^1nNSUA{YtD|3uqAh&NtFF-yxXnPloG5ie|X>sp^J# zYjfF?4UM%&C94z_f3PZ%epA`Z$k8;WvnjW7Wxs6z&20m??%5OE&)NwCkMRbtPq@6J z{MyN{NyOCi;fm4#fOli8F`A{=hdg$8IE%~NB$Y4T=&8cs#oO_T_jkzOeB*YfDDO>@ zWll&dD+_kwlx}Tm1o8*3*|i|5q%!F1rQRr-=;1-l{Q6T`)ox9Yr_%7Z8aWWmpDtF* z-N}8~vhXXdBzC@Y`2}fseD1>yX-4l6>R~b9wSbWIYo>$2(rdMEo?QzLmVNj0VXe5I zQ{q4{KBIi1+@SAyD2>Q1oUW=D?Cd7{-%2Z}1iu(h$R&w}eB|$r3qxOf#N{H)Wzw5b zo%`AP1E2kYGhdl&A3G7X)l(Iv9~pPq8IpT{Pfq@sV}7&s33GvEULjn+JyeT?j7$ZH z2*6-zMUAnt>>pab^&3hvJor;W|0yo+5DJrxKpe4-!0)~Og5a&P?pH#@Cyu`d_kVk- z|A5P+r@hJ=DBr=65gC2u`yuQ#UtsGLKx*@;MFKA_bk~oD?-K;@U$%t&!4^$ui6YR$ z(wZ;4jSYQzkKpRH`D%W(d*r)tw9U`#dd+eN+yuEmfDOvl2($)2cEq9p?vR*DnF3e? z&`X&xinxO~W-5nS}5MWlq z_-5KO+Ctpj*Y>O?OTud%$yIbdIMyHxxHNPfsh#Ux2SCHk%^$Dy=!ir0hyiSHVwTp> zz;j$-%YxMfDrXz3si{-5uduh^;%eIu^IX{@`vJ8ah178C2`Vc=aZJG$;q51X@?&?O z;%pCx6Nq3fl6Y;62bYUX84yy?1Ih*(V!iXubFaA&;;7z+3ssj&o?$PLA-eQo=H%%9 z;}@7J*O{NBvg}J-so1BV-@jL^Q2>xn&wZ8BYtAusxf!qN2a}jL?bybznSFK+4#SpX zd)C^G2RJF%u8^}ugb(lhNkim>ZGFJRwD&_Cd3ikdtpM_qu07eA5l%G?$ZmEYG%=fl zW5u5q%QUcNDE81NH6qEjE)MFw>eF=CgXy1Q6Z3hSIOKcp=}flj3m`5gCtV!&V^IaS zZ{FBVj<6?(-$PM_qjazdL97O1FeENn#SQU0SK{ zrm?ZMdrxJzPRUm{)^^xZaf$wv#p}rHaKs#0dnj#$@%tDJHM>1F%(Yptb&fhv|D|<1 zs~lHX;xNlM#I_fE!=j3X1ybqWi&?+B96&zD8n(~4gm;i)IJJ@31|Qn~=2xel5k5#V zQgZv|0CFzpQ#lg4$S-@l1LUK1#t(fEsX@`XoSwa2E||$zj)T{%kaNM7@TEFRTafuEW; zV}^QfCX1WXYM8+2=Nkhmk3RzkR87+mxWQla^!1Y3QsO4foOL9K`m-c{7ON=dJy0wl zxk(b}#|$v&nGG(|e$?TI5nW1BMv3KbMoYoZ{%7HA{yBbqMIXnAb}AfYkY%+KPv<`a zpZa(^R#Y8joX@V_uT!%3ReHoJkG=n!@O&RSN@x^Vv zbaG4mfMtz({Q0X!bM}j$DpK?(f_T>RFFq^ZC!Gj9AiPDcI1D`Gkh1Gylcq ze5;SX-WGrIryElc(y&msG&FJi57xR_19;uChk=B@@cGa~VAUQdGMHKAnQqkFnRLW_ zzGemr*9`|7J2_SSVY1-{)>;?}18O~812J^Xi(OgP_ch;uk&e4)60Ncsdt-PmN|u87 zCL6Eaw*R>NC4s-^*7?hD{Y~v+u$Mdl7 ze5N$@uGTB$m*=+&)L9qK+Q@gC>cu`Sjjq-uS%2W|pLD`Wo#|`2%f9H1xugq*{YDM8 zwjWq(9`1(V05`O?du+u%Ej`uO#GN{A|v-@g`GZl|6YE z%~E^mTb#~}e}z$w;MAINl6s27XjGwoKQ6h$g(P{jLe6M|{fz@>=Qp1Ndrb}iTs>wn zOk_;rEzAWN{4RULw}22b!^B%%x;~L{5_smemd5AfJfW%l35av7Vmg|gh?}>>YD_i{ zCgbk@Of^TYwL*^Ka5MYl$F&jo`?l}ydjgjkAwkG_tFa2-(?(}jO}A?rjcCF4uq5%P zk+Za-2OJ+K_74uqKn8h=`}&<6g``({w#0xTeva?8j}d#tK$4zSK{8SGNXX@SPY?E_ zr`V@CCVwzN#FWNY{N4oTSgkkZM72#$K|PxAs6MAEE`QK;VBFzLrKS3ugZboXiqLII zd?nHVpcA+h&x69N)Ng&hwd2+RhEd3i;Ba_Tf3W@FTTaI*PeC3GW-{$3sqM8EHMf8o z8@u2gN(08H`YGl*PG8{V*;=3bwlBwa09LeBg9hv|MSC zWqalOnEC|~;oU*K2Al1f0+F-U%f(3AOCi1~JL(9Li=XBzm+$4}ds;KE`|Q$+g-;Yk zxY5|Qb6l=e3!m(l%RLecALBCJZYFL>b$9Gr5Slq}etgHvt0-#Tpy@22ZyU*!vK)|& zjQDlCHa`=3ghfhc4|we~L6^ITeN^A3Q@&Ecd;P^V*tbbk*Jp>rYNqT?{MOT0nZX%U zi1#u4JVi%nb#47A3BK=je}8``XD1w@P3xjF5U)^lGW!DWRUjDr*aIWou(cA0KtiqJ zfIFtL#+_rlgVsAINbMsg8VwgY6PL8?dSca;Q&dSx^||@t8UVJhJwLFBB4{CyQftbt zvem8iN9>-Qk5?3-$0C7|Z7;(W9#2+4#Y#0}AS}9ohO#tT+uATPGsA%lyDeptk5w7p zEn@3ocyMQ=NVntS)Z2`m5HkELOLhN%Iyuk{S;baCO~dnouje-%HTC8v3BbQ#M|$?1 zoYU*4Gy^+ke}B4pc?Z}+76SUN_2I73dRIQ(;?GvRT_2%hJ8o1+dw#n#UpLmyU}M== zzM9-`0SX69cBALf_Af;raa%<`kUTjW*Rf1UX=Z8u%C-Ivd4J!FS5a?YcyI<$h>Akv zZzbk@<*^R`bNVsaqa1%Jjpz9JHEOt69((tk*a`J(>+jh_j|v{;;10J@kjoE8B)vyn zlj~~6T#NVM-cV50-_j&s=;8^CB62Qb%x$UAe}3n)2P-T@r(wm5!@N}=`zo5$vn;E; z#H~!hII?}3H)IKL^9geT=7}AJtxYZ=0*2NmBkMKXt07~;l`ka5aUx`2sfJT`6m!@- zlobkPhdk1O4UAh3PTO)dQJI9f*iW00_s6uxF-9-jBwT3Rell$i;zeV?QdJq#+f2#x zCV}u+h#PnCb*GA|-N`}4GW|^pCAx%nzXt~;7?X1V;Bx`GXv|R~xbtPx-X)A`Fa5?E9 zmj+6$ldKrIsMM+rvFZTN-oELN^E_Avs>cC*l)S_cvCAFe&)7#K1NkK0-i<~?#BL?- zgC2=VL_oI%Av0M$Pb90X$Nt=M$$TyYPfgEF(qs)Z*17_`vmeDS#Bq@Js~EJ92}v3u zTD=@mN|QSC*<2I&>M!|!AGv5`gMg2(aYCV9eQ$ddkDiz0q@i_{-$VZV0(pJ&To#l3{ee7|Ye7)D7C| z73qlYF)_2Xq$dlR8|D2>#0!|kz`$E;n;kuZW{Hd}@raf!9ii=-2pYtN??rVHQ2<~~ z1s>oWGkPz4m|61Wm_~|0cWI9=TEzsH8$Z7MAwfLZw*d8sGVam{d*?&WuKRO#_V=CK zbCF0R&W>nBk*@4`ZMX4jpMo>&+Lq#h6J0T9Y={gsL2cCy%D83`5D=8v50dfPjJG^g z1j?}K%zX=i;SxTsv`yJ*xLiKR7>@-H7gM?&S||#iuo`D))|^)*9gk%q1eS!Ir+UP^ z=2H{+%{k}nzJXL#UAM!>DL(6?9!>Kw%WsH}pstu=Zc%mshxdivwA!ibj3WoRo`*p5 z$FJ#TF~%GmOT?F+>q{boy2V8B`ehlM6z}0LU+E!uAc)EcXg^@9%`8QBDh@&3DghC} z_wB|v!o5{&NfBBjWBm%I$Q{M0$78JZu3P-(S{@$ZqCBHhu{rhVMaBCRfv*f(cQUDo z2?;@D^H}ghLIsso&eF%d?WpiWUO-yv@}*uh>7HU0ghjrZCjwHXe};+g9-yeddjnw_ z?ziXHymSM>4NY#3mZMvzr;ps^Jx3-c+T4q?z#{CSfD_FGgyN)8j{cLLT$5dl>Q-td}vA zQ;`oHzn6R!+m(afu;{==L&z-Tz^!huq#is1B?t4o^?vKuy11zyo*42I{G?`2oip*O zZ69Iu&w)=4b(XiUYI936r8tbb2V+?P6rCI%_xLh0BEjUvks#wxa)N`GL`-$0%SvI~ z2RAHk*29cL(-0B97twlyO^4%9yX~LnH*Fak14o0PAvjL1E(23z0%vuVOZSS{n5Hl3 zm-w{+vaMhBc}PzJ^!Pp7&%hGZ=??cczGsfj@RiHeESb_s$)tLwt=b3lfY8>AZByqptllltzb-qE8Ult&A9fra49 zvuy~*`DIe>bxt{B0$iH29_$QX!{nyR-fVo+k3?rmxAg=>sHRUelcE9oFnXpCT2wUQ z^0KZfQ0ckbdv`JHo$x4RuVn%1mS+_xkF3;)8V8XKnJL#=n#DIu3A9KchczP)$p^Jlv6tE~ns&2o1V zwAAWkZQb>+%go;5g(A*C-87{_>|m^_wA;^<&dCo+Jl2jp!df zBKL`C@nsJob=K0IjLgnP8E7ZG-F&|G{^39R_*X~2>$vk>PeUy%!C2T$d%ur83T%cA zTaMM3cTGgO3P#Vd5PSRELie6{U{Bo+Ysrr{OhucUbX0P8c@bL6>pYJeT#q^2jk4{V=FE8foZ4jGe zG|>5cuDHw4$AhD))fhJUXA|Qi3a+lis*Lg#7^c~aWSx9!tz%u>M2`?+F;JV8Lq3q0ar}J)D&?^z$10qn3lB9~TDFay03!`3asms7{;6 zeU=0f%Fhma3*jt-CDh=O=uW|yKS2t@aIJpIRhGf7h}L1{O3qKnrxLw%y0ovI%YL`Q z_`b36%|dRUkO-byYbU-&ev((MZp)G(H@yGZ0hO#;p z2V)!NJfDv;pXaGSVm=fLt^cO2ek>>c>c`@#J576Rt=W3nBbo6~`~o%df_;Lqp%}B9 z!uU7t;7R&MeGN8V$&uq_8%5dlkDLDC8LIFj^Sf8>)d>7G4B*#1L019J52L4mep*hc z#~5{0ZQi_bt;y?hB{0`Ntlc0KUX)&v_DN3A@ondCAO+`o9$##|&abg+R(u5}6MD89 zdB2Y-S>YkUt?2_&?%5u}bbXc7U{|}{C=?grn@7}iD7d@VfD79}Dtg=zCJE%nM<@b4!6{C=W&X%L4 zQ}gS9iU_Zt0g`a4aqSzItte{Bjq5|TmBH|?v0Q>^GT@tXA=A;r&ENVfaT->pY9r~S z%833sMDtb-I8+b2Nn8$(@iK^<&4x(bzi~0PNUCaA98So=m9{NzO;8#dsXQL{WMyVL z{>JszSgO2*r`%aM@s+o%-+9rCU_W+I599lI^@MR4xoEi@{?R+-7E>E$>dy&a-L$#C z$CE#rF3<(8%_ri%veBGv+w_|$d^b35I_Kh>qX}}h#&~Grx(^3OZMM%wxTpLXeXd-M z{>sVyW?h|TjQX#v=${K(124ofWre%*Am(fS{h1_haoOsyoM#SebdAcMZipMY0k+Vz z7^C+2{IEOfsBFDTS}hlz+_f#mh1W|uI z?rsK#fiGh5DIOEKWR;Y^IK4&9sCM#AE&n&=;5%dL_E41{onPgim6Yc%pu6Qj#1Y`c zr_cXGy!fMcnQ=<8E5jnZ&Qe#mwjS%MzZVQf)M6F3+%QFMlQf6JeA}zVKCNY>1-+UA zg7tOB{fa)F8cqA|HgyNRTv1YaaS6zSW_jE3!Y ziPqtNSWcuri%mGvW@=6EEoBe8%L8pcpS_~b?yx-zYz&Mw=+e> z{xT1iprQWHG=tZ~+U1EO<+HxnK*L8*t%u|i-c?=>W^MWxPVRCGeR%%}GuhIByTXNC z2JHM-`NO?|kPq$~Oc4WPTwij1@FjO-YrHy0yWt04gO0Xa?s9A|a&ZjwPb2#d)$RIp zs<~XZyvP)9gDemAoTQah&&Wy+cvkh!Z+lR5%6!%}Uh=+nkJsjc@n+6qbXZ#g`hR4;yH&ZO0|qe4O*EoxEK792gHakkD4^>Rpz0 zff#MN(spQUL9y)>Od>$|ob~l>AE&0ew{F5<%%c^?*e|gQBGog15@%ApJW|scsOQEQspY~{)I~O;;5E1HYE!%C=iVKrv??dR%tj8*Xl5Cn{ zp%?L3&C~h4vy&a9LVW{MfjWS;J@pRNeBY*+JOU}jn-qs5{^Y0Lv#F)_ z72+uA+;`6&@;|`2TF2i7s(h2zgyueIeT1IVMF{Vzg0x^ebHI#{h=?@4!4t$WnDj|#q?R|2p|)m$Mrzw?HD#`l^PcJ=nBpwM<#1XhQ& zE2F3=dFTfYVLjG;ySe!5I|egrYx+pJ-PK|PVeF*dZ)b9R`~<7GiwhsY_ep@tY!rI9 z?^?aR+r^QdPXYdLct|uZG~QuCc*sYvy_$(gJI-gauXnb4q?=oxizk#Itl#*&_43M| z+Hyc3cm8b28jzM#Z7VI}n|^oo?FxgW@-|+J5tn({*!~!MIDn}S?K~UZ8br$y%E^VI zT(5nCvRx&pG@ahMxLBZF0(90nAu3hKl$V@MG_}ur&bK<7pQQy@Qqe)0ZK`%B|9*cOe>NQ~Wp1)w9 z8=1f?E>GAXE96p+7ie`|g(d`>T7P-;5T-8YoS$Kjnp&-^XPT2;r&^a};fp~?TEnp} zS;bl)Y&WpZehP2v{%uLMV`Q>NRgX?la~(!e?d>2lVM}Zd>+dgNHC~piY4yqIX|S~x zh@StR5wmMKNFt`mI+<;bjx)EjTdi>(fls_VgAVJ3+sfViwEyuz7x&uQW_E7%QPvub zz7`qS)gxju_TvYgz9bv_cCP)7`Yn}1l&qDWn6Ai;_a-&7_S$RwdWfj+)so2!by*KIo#(U-pD1qH3 zAsewJ?xW3u+`PQ662czhfwa z+6@YI5Rsl3_q{<;==KfQQiCOmsA4XPA0OLHdm|mEOt;qagcktH@R*+o;PFD+ciZm-SHZqQw;8<{8}CX=3k~kI$Vt*OCi~C@6?a ziZ6~S1?jc^1$BW4{$HnGf0t#z*`^v98*cN~JHy{w{*I&<^Z9la#~RF00O?!GA5?py zy4dF&FLz%w?Ns88R@R40-d%BQb;fvn=*ojJ<0U56UUm@^UN!<>FlO-Ed`?lLe_;Fh zq&>C>+332hbhlLVtY*n|%3Y}51tVai_9OBTJtNW{)A6wOU>QAQH&u`EB8J40o-IQa zq3aW9+lGWB+uB${^vEt+3hL9j=GHhpcZseeg2@OvZYcHlr>9SP=1QZcS}*lL;+mrH z#UtI+h{s%}sp@sb8h9YBv;%;L2rb?yXAmm-kFNf6rhnGH&A|~xR#sK*1SvwpJ-3>o z(bLX6AT10&cQlYdaBvVd>3XS=H&GuE;d87jdKl>$gj`_#P`Vx(+fhXguiJHiYdSo} z*?D8|bMJVq&!?jradMB6&sKkIHyYD*fYNQ<(Be9tJU8_@UAVZob`{|JD*_0E$@m8;;NuRXy*j}*?&%Z4k|HrQ-h z5YWrMBr%xJ>JYav64PAI=h*sZa=8-JIK ze-2l#u`4OZ#3r|a|=ONQxKC~9|?Fd9be><$wfeiln1?Xjv;dDLb0uMBIc7Yb0mjtqctVZZ8 zjRH~b$DT9Gt+eMtHjSG^TXie`4DD-z9R+}0H&0R9ijvS(Mp!Aonf=>EdEv2*&CyOGxGasrr?yJt%Q+4PB9`LKNW zywJ=kg=^ELglYYAr2D0g>-O|fLlRKYAFQU!Hba*WnuMlXN#pf8Ws6y9b$K1zxpB5G zcO`O@FRo;2pXoBPrRDlMCB+y50iqcOKuv|x^2NcoY?}6_Xq9GuyLGzX^qW(i=6yu) z(!_~v7(zN)B_V4``I)FP3gj;e$nT7N4-$O$Gt;(4K}(Ww%?bZ6dv8*LSRt&w50Gf0 zIGx-}6|k4Vz^sHAS=HwfdDT<<;D%|OfHuGj++cX9Y}F;W;10WR@G^ws(s{2wD?Co; zO34M6?V!X_Q!yH0In}w*a?C;2sV6QvS3_kQn31QZ7N%cz_hd85?XNPi_Q}s8Qw!$)_W4l-};ft!pa*G~8 z;EuY!-TP%!R;W|$TZ!iV9=UHl@fI@76;<_1tSVIZ+5eD>C%b>X5eY(g*di<5K81eY z?DI#WnI18}L5VEx-J7wk2TwHX! zD*YDGZE?Ry#!agi`IEk_M*DP(z0{`2W`_cNTY;AX7< zpY)P>Fqf(j$4(N-SFWBxBqS>bu4JFOvLC)+eLJ{YEjlf&U6-Vk_euQacht%ka1UO~ zpLc_QS2Vc$9PtNd%pV>*};^S?e2n3X;N0ROlBgu<5T?;dzR}VP-LJ~LYu!M1qyOGCE9@!Gr0FQ8eAn$me z{dvm@ar&~oConp`O5Z-_;fO?F{YWI+(0dPqNmjg5!y}_5UtKY-= z_`xhd-;d8;aq~d?oFyT50ztCkUAwgt&+zBLetqP0kJw-LEu7=v#&^l{t9;oy{Jz<8 zTO6YHdhV&$#Yf$0ba2K!{z=SN%*vNOat0FiUs%VVOZE>#N#+mTzyv6SEbxPQS=~CD zCdR!Tcz65M-+hWW;(D_kLvincu;}>fP=dkkWoKm3BLT3V>sz}Pn?G{l*}dCy4xUdU zgyyGcIKVhW^ht@j3a{4TrE+`{yZ5GJ;ao|FLN}EbN*9AlW%-3Wc+)SvTqkre$5Z(D zhQ79qt1h%^`UQxoaNPSzetYRaUTJu#1&qS&KX*Nx3?-F+JNM1aqR)zme|eh*yvD8J z)&Mtvc@ux`$JBQNN2iVw`cjVW>&mNF?Bq0hIMA0-sB6T}@of~h(j&btdi`2`7w7yM zwi}#FQn(v9~(E(}Md{<9Hu;Pr91~aRf(tSOb(LfyKT&w-&X@EO|HZf=Ax{~m8 z!`2rtyS_>{ZBtUwu$;L)ygN5r+^P9IEzqxk7)BAbn}i=Q(~Ne1Z7wu4v6#@@oOA zdzhWmz0*1wWwkS3Iv?nBgugcgHv$zQcJHn03dScbZ?*2TIGqw8k)|CHl&&{#GprZOVY{TJsX9uc5qReBUJZ%GWYvTrdM#f%3g+UjIjsu<2@Q9 zs4`C3H29xpVkhUj=UiN{gUO9{g~Qw3LIRlRB`EKns1&#>PJh;+zMorTIzs%(wD3e# zDeFV}q{v9R|HP^PLNsr51&*%c%$b#pzPq*1Fosj%_yiN#Ek?U5Dm5A|&l@c|(e^*k z5719v_#|)8+yv74mRmnwR_y!@iSgpa?u)CNXLiY+D&Ca1KVSNUb@0b8lb7;_Zw8!k zx;Q&b21ny~o|>l+5)oAy>l+QyyIA$!LSCET?VxxZYIUJNx;6XhYTJz#J>o_(X~3Ma z&|cz=ICBG|fl-%>6f{npHGm{9!_Gvd9A>mmyx}+at73&ver&X_o5GW+=%mqRhQR?UA28WkLEWx-RX?qH-3M~vYhxl;Y<&e@^JmN`PuxD zQJMSMn1jQmW}Gx*G7@`j_W5X)q+s$-d#V`WEo>Qek{n~0b(6@-X~NvBW-W%5K1rmU zE9tvFSi?qKz;mNQg3%Q)4!@CO93`K->mnBJ6-qV17~P9G{X~zAZ0TI^!NX&!rcL0^ ziR?kUFu1}^9nYKGv16S*491C!333$3Uevi#_=(k+dp#{s4PQI5N_LNZUTg57wX)P+4A8H5T=Ip;sCEn{QBV2W4R-^H9ET=N!vkhBQ_gB=q-ZqJUAo%t1J!_Yt7 zyGrRtZXb0$rxLB~DrgTiYoRH&eKYtFL{6juJkX(LGp&F0Pl1c9R)Y#O{eimvRdoM> z`q^(S_Q?N~ukY>&P8M$lRi(Ij{uem9`mTXZ+mv~(EI5%XwgONI4;MWZ{<%UFk{4dQ zx+95mI}|2{{3D}oHV(x7a)%*`o*Vzu88^lJf9n>(A%G! z6!Fxx;idm+e&>s#{(ap)p7#Z8;Z4R{bDlEwfo`3`8*R+$m^h|&$jN~ zjv7U^4`8Ry;jSIDi0XdA>vUDOnsb^`VeDx$N7IgdA}f<0BzNa9|EObD_+O_CX#UqJ z10}cq>r70-|2h+s;=fKA;Q6oed6xfm%D~8f9eeQqdxI|3c_s%ZpaqGyK1JM|cDx!U zyM+O{-0wML(kkf%8Vaf&NuK|zr(Zq&yMw-9e46bpPSfcEW*U}rZMMpDd#7(V<*NGI zr=)vT`vvF}!l#NpV_x~~QIFlae(_8A5B2p?IEELoHHQtbke-w=Sfs{&I ziy7RzgT;=@Ixw%;I!65Smf4XHT%Mz28%QHHqM@EZv(}}k)rK9LIQqV65 zxOUHHU0h8e@^T7m^)r|0Ep!cAj^wh>&JR!s`@0?i5#PBhQ!B^+cJ_kZLz^+O9|=}A zkcs-ldQ6?F{|(9m*O_zQ^@CIAWHdj#8pb^X1vAKo6Jy$n;t&ydH zJntX1Ci##(#i};5dr?`PzsrNk18x3eU<=ecucmS0`f=;wmM7i;__1Bpo;gA}9VI9W(6!xp-tMJ?cQ)?$x1>ARo|i%?u8 zzSy48a|**doG!RLqdoIJ??dbrJ8zwBe|FIO&lcl+s`SU1b*_W!mC+ACg-VzCNc_}IEA_qSWPlut$ET@~7OVGig zp?&Lyzt{Bq@m>mNTkyE4I2Ajv!o$f4n>la1)KMmpj!J>fJp+|#EM082G8WJWbqd7E zkqn><+8Ap7!kN>anctv*(Lp0d>`34BZye}8ab>w|rLY}0Lne9V0GSLMim zU%x#V$o8pNLRPk%Om@jqRP?sr+f|NCLT~gzel87aXum?&{h;T=sO-OQxcbiz#T(!S z`X(+4NqBrZdplRss|a04eJ=fgjXjRrX}PfJqQ|yCO_%6Cqx93hJ$3;U@pWmHBK7){ z@V|?NsLO3|7#Jv%WL&OgJyEA66_H%Cz9SV*gF>8CLyvut-90@8`T1@gE1=K-427mh z{njvd@%J*5qD_+%Q@)jp0knw1+7oy0laSPSw63s|5N4DeX%DaOiR}O5z{xc~Hy1m9 zxHXv`)RCXG&hfVn%`S9B@Ao~+ADyg&5n_ExV4G7isBsj3ID|d^;FrV^x@qbtPl+|p zRVPApZ=Y#DAoZj$C&$!jYqEr4tpiD`T%?X-DQ?>FwN4Rqej6cn_;qW-yIQypVrv`G zbX?wNHZjt3dU>(@fqcAPU0DqQU${2;%|JDTjC|R~^{)foKK3~HW%!L)M`f8EW(|yQ z{HI>ECn>VIb+0Shmi%qug89RfQel?e?o1bvt2VituW%G|MN?1eb1!a!A7nL<-#WS9y>Ab&Rx4qeXTDY_sTmXoC6O zz@m$*KEz9V>+dsd=X?|AS)09AeSHb@yqojU=Uac90WT+TqErk+X-1w)eZ9SEMqUm> z`gOX$U2;~jZR?6B%AE0Bg$lXbC?3Aj=`4CW_KQJ2radTwU#0U9)VBGL&$jAP94?Nh z2i6uch_fB}!;|Rhk2cy@#%SoNG_%t|r@7337^CGdqSOuMRzu}^y*9Sne`~eeTg%friVT$nwa|Qpe4<6GA^xT_CuK{ zUZ=_I<=1QfF-+5#h>PXV9wML^mYx40SxU{6&ec;s@mLDRtoyP^*Rdd8O-);G{j11t z;j699t*a~k<5T1F|0-FG^ZzlxxpZ;;S99P}KK{AOLO0yCr0Dz9N#Y04wm&2N$fR^4 zjGjp)dX(pMzH2y5ZexpGa~eK)qaMy8-=GFLIU8OnSx67%q=qwRmgY?7O_enXG^tT- z5AS?@Yl=`-sj$0&b_$qJE1p-^3tuqX2ne3v?Wz1nomx?vf7kZbRBch$U~|2RpQsbo zsMJ`Blg7Y-&NHvag)K*nKptx?h0n+B?S{YKB3N`A*r?};EuQ-h9@_F&(ed$0+$|i( z#+3J4cOaAVT|3kP*5SM7DGrS(4TaXds;P?H&_eit5^L>9wHKl|twTBSUB3aW)}IHG z`=BF|lR70h$y>fr8KN!yjPui9gK;)Q|F8PbALZ@6q;#iaHS8Ao`TD=I)V|8}O)|%; z4qM=N@Rkba&_Ns6ag`lBoF13%J33(|I);Ta$t#ns5)| zIJL}~vz1}oag*Xv^x2Uy?JVI(6v&ZhOeJ?Ne(^w09C`Y5EMmn`g9 z#OlpLIUH@xCs$oP?6KkX4jZb^hAq((13ZiTDv7!^>J%K#3gq5#Z|Yv=Vra>;!%8DG z_WLRi%7+4z`m^WcP1{7c$7}6UQNOwbWd_w8*przA!YbGKWngQP-g}(kQNKpaN9Q`b z1QXrbTDn2?bshY+3}3;BRoifFM^jS26^&tlc~kn8JSbVKuFnwfoZBpRyLrvZK* zDcvwH(UD2tI++y%$-DV=G+lY3sb)J)tjNlhXJO9zg;d4&ZJek1MDq(B(groVrh_Z* zWy?@lwh_kR4Qlnb2blU!KKW~OX&?W4-4!w;I+W3HC7lQplB!w!IOqLxVwIjt0MO;2|=PgjO#$3c?I z7PWXBF1Xc~=dGC6mi}&xM&ngTZcU8Y@>E^#rHAllDOEWdPx)9#PrVO(cylK`k|ds~KjVP29t zMU8EBA2!__Z@Z@V0a}??N~=+62ZPHc6`Xoy^&*#4bQfPMZE^FRI zFPYdhW^)IQP=%{at{0s9hL$y^EJsKQUKW*EaCMR5QltDrrWB20HV?Ty)8KnEFIe8c z8`}A#gWAbX8&melEY?oV9Z1l;i$ptIm$E(z+mdBX%z60S=xwbtoJ>KI!m}x3yuzAk!o}82{|x0VSTN%GI`*?+ z*m+jQn1nNr^_`*fpFI10r8pm|kg836(P$wupOCDoDPPbK1}QMuFb3)21`)NUzW8|8 zArHkoIFi%B%Y3~E$RYe!&$}r_lbD&+4#I4?q+I9JUgjB$<(yGg2a5H%t3IWvbX!jo zP%(MtA;k$doWn%*l&u++yOPuEQhxpF(QU0lz)7m=nfI~5T|qgz#pWJR`7s+sYU1~K zmAvtJ{P*%+-uu-h&{=eJ_2RU#;g+4cv+wkX4!#r>EqMqJ z=NQCFJ*^E3j*(`As3?-MWhDx!Xi&%|*eL236x1+7m>#+dOq`B(33VQNFxwEu%*=Lm z2{03#uGffV%6ZZoVbvkZFO*eec`ghf4l0!xLI_ofJ=JNKWNaVOVw@(grXwCLg#TWf z$NDY_w*Yf;7!S099Wk3{t&x`&dkN~9Udf%*9a5aXBoGIuFtMb^Ponl9rT5ZWOVw6A zma(d5>l7%xs=DCihuM*Z14;A&MQ~1Pb$zP~Go9G}HxD^9NZA;o(7$%5YMUUMN!}R2 zLwjpG;e2D$H0|QtPP4*NT8Nq`EekxSv#4A=9ZX3By^{S8RJV|`e3J^qm{na(recl! zZs&nV)q%O8*ytLcWul0h#_tW8@S!HA%(OQ1vzlFp@OcNIRxEe-;2CXieEON`1KHq= zdOEOfQ{R^PG!De7OkEZJXE2QJ{3FL9q{iWH>z6XRx+DF-ctjVgL6T%Dw$=*Ebl?tX zktM7M=`AmLOo91&C5$;UsI#?{_u{p#{Dc<-X#im|sKj(Um(LnIT2B;H*RV;>UEcIf zbiJ7n%DwgoXfE@*GZ7{4E!saB`K}xfy}K8VPwBlhf@Mw)sk9STS&_&G z&4Y!ucOJSiTkWeQNvl1U9U7qi)m0Nr#-7rb)}}mIB~DFj6oc7|X&-3mcy|{`zceyI z!D5B4mZT6nq(X}44e*|R2#Xjdk257t_(1`E#X8hn=SpC9t+bZ*N;6L0$bYzQTrVc(`J`odM6aP#0aBvD3tsEhvi(Lm)!{iSFHd6+k0|>U{rc%}QTBUOA z4Oj*x!w~)78l&ql!C12k&8&5!r+lHLRkfhAoi4R9PYcpkZlPwFkg>azUIK76iePRi zLll+yoSM$TNhm`{Up-k>r$BIlx;`5Utuq?MXJrquv^q!}Qju@ahB!Y}&KazXkHW}6Yy~xlrMvYogMKD;k>6s_3&ONkYF+209|CG5}fNPv{J< zJTUfTWmpRgKPWak-wQ#B9-1?=r37=Iyq8y?15yPRYlWNO2(ZVszG~DFz@&EUSas{tt|4I%F?a!bZHAbG=d_Br%2``g^JK3%E)6MuM!3H@Gk4tMw{)AT;4iOFG63 z0=}8tRh>boawY|5PURD?!cIF@K6*&I0V&>%>_|15@S;wDpaGiMe_T<=9PtytR$W8S zX@?2Sj`@5 zsIXCmRM=T91*l}3!?>fg6@v?0QdER;U#_y4IYW}m?AdJ&V@Fi8gGjiwE*W01=Yl1L zI*WW4AfEzL4Jw;Tp~C;i;j!D1um5#M%MaKNXZV>PJqOFdrPnL-25tF5-guc5i16KU zZD-N%$`g&g`2N0=zahp$%-tuFQs;UdA0#V&`YFlIWoP4iq1x;hm#8e!__#Nz6?c=g zYnw9R$TEwWq}pwrL>cLfu3P(2K;q<=ezoY_MLv}uDFFQ~a>cW#rT=}l^KOAd9YP(l zw4#IC13zv0s-dDslT;I>TWYHj13$K`n?F(`hp`IU7%BwA2kJ;-H9}u8w;6YE@H6r>0c#EguRW(#a-{H7uLIt4Xno4uSn_wAUcANGp zQWs$<2bI`f$>2`%>gu0W<}OQ1)YiIsHFjfJi7+SjTC;9AhS7>zFg(m$TcHBLo4Dlk zvy6hYj>vdtZfFBMtH~gt3F9$b`i6d{*lhnZY31gfVXTt^uz5?y)XA6iW$jRH2sTBj zN9cN6NPZ`M>8^LmdU;TW3XEHtLzq}YB~_M$Y)3ACt*#2Fh>>u>jcI2!7o3fyS{EPs#gA+BWi_PyAgXFf4DPvW# z57YwWgQ9|myi5TE$Tj=p%Cc5kbOQ*tNB6|J%O^HBF@SYpIsyp2(kB@(E*16-< z#_VFwD(M(BZviWxVKrYr2&oPZE68^cQJ2P8kSR*RLfI{jh2SG%(W>Yv-*K1EiSG3| zc`@%E4vg8xa6_FW=Q(26@A7&0)3SzH3w_}G+7OBh_FcTT@X=M7ydC>YrQ(~J{jaJR zw&WmDLDfocEi zjD#Z#(4xd@Tmb9HUOHw0$6R+r`GKB@@JL0g22$cNY4tHo;pcmPEsY1#KE29`S-`@^ z$5&fx_^WH!8(Pk47wasje<6F&)1;xQEj`@CmkGdMz$RRfwqrC7VOYv+SPJA|_e67y zU87c4&t~>{em1k!?jrdVdOYLUs z@ip%NE`H*p2z;>~*JJ#No>2mzK<`lc>605TU1J2pmTfGB;yRi0QwQT(fcwNmirJOERYi4EdUcVSH zarOd{^-rZx2X1e@^}6%Di^HJyjsOrb@qLZ5c#4b%yukm%+IxmI(XDNuC?YB%C<;gw zQBXS4yA-8^(tD9!Lhl`PG^ct9cUX>I~-B$9fVdR6q z24B$9!29Uyn0}_-j?Cg_7_fj!@`yq-j#D|aKh`46#$m!G`2mYpyFo` zK`YP;{dTGZ${ILb=Jc>V3LiL72zM*Uo7MzO=Fak-m)`i0Y8Kty)}{~OeeL6<$I~VF z0Q@OeX3p@gsRn*2FE%|#q6b;j%-e7RAI)Ke8%g}W(Ep#L0 zO+vM#Ig9{cEE7-CdGuO)(5)H39st(*0W{n|m!2}AGcB*L7%-s;PzXQ+@oEvkG1CCT zO$mRB5N}Sl-E%8~_(}u2U{5J(vam8nZZ(tv%+#wz{th5MfGazXYY&aSHqETCFD(Ve z<0{K2RQ(4h4`l}@3Tlp<2e+2ZR+Pu+AsqUEl&S8|UFT#r^S&CX-T(2iPNRIupLhTcubALvj@+YM@bdhL*PFvr8EKp}SSWw}mAi5= zUy7pk?a?+ycXMf=dr#TiP3A8x&x(6A+yppa24zLL=mYc**I+&Don-X>T2)e_9N?8y zxiu^-%))`M3THm#t-9`)r>GI+6t24l_bm->4O`1ERThR|D{2Q)p?Um7$AYkbXW44N7$ ze1_$=5Cd(@E5r_mvtOCJ;P~nBs|@OLyGo4|v|=Gj3fx9vyQ8$)@R~{_>AfPu_;z|r zUun4usyIoRpTjhhHB8Oz=`=u?tjEo2W-71s$Ek>;DpoIJfOxTa(W_ooz@zoj1$dIK zcdD+wq1q>;@;BOFC_4_o%=fi`_%y>DkGcY6RuS3DfV0nY8I8U z)BePi@-`O5YRlJzDv@7)Bof$^B{|OocWj={S?RI2prYD=miB@GDYN$Ns z`R`vg{&Uq!N7raY{PapoOEjzA+NKlnI}Qmy@z_UeuX`L=l9{RJy88>)WznHHr6EsKp;zJmy{#oclJxtrztHF!^twu-ObUYX z%SdRx=+*Bl)l)qxRK?4Ozt0LK&hcplz1|^D)}f79%tfjg2U+;pFLl)|JL8L}C9i#U zKt=na*wV_1)9>mBbI!O`&SkPe|8&QR_Ekkmsg*;yF|cK@&lZ{LmiFZ(=N9y;7rWKj zCC$Q4a8LW>cWQuE?&$%u`_Er5FJM>uS6UC>_$*(Fopq0i7UGU~CIXK$JBlx`t%YG> zy_sA8TB>d0o>uVIL=^tB$P&bdsgSsG^*Q()@hChs9B-(&K8{u*|7!Q@JaT;<DXW#?Y}q~w{! ztH&TZnsGpiMIlAuO{ofrjRUXUl_oLQgI&8dDAy`OiojmMXkWWb~GN5TlfW@Fl# zTusXGRKHZFGM4Em{zI8^@L~>b&Sm`(n3G0#f7OpKK)ev6tKBw|v$IXaD{|b9+r*@L z_WQH!c*}e=#3&dAO8T#noiztXU%C5Ij#jwZcbpl;9&*Ca<@}2aO>QlPTf#>LSq3T{ zQ|~paoKBC=y*>lhYZuv{WJ%F6w$=e}|ban0G`+;vJkhu5l(9Y(qW&cJfrpWoQc)FP{{ew+)* zPfodx**p<;{qtl8aljL?%NSmsm&CM5C4C&1?QM!%)XDsJ6;kTm4h(I!votD<;98^# zr%W~hdJ2h|=v0)xujCWyK3NGgmXy!kLw04Fdyb%9oTUA+YDjq<~#w z-1eSTQMHkcmz>Xq>?D3}BQm~?mDX$j8+J9@*ZzXecWnJkt9tQCviRw98xI9xZ#gQF zKn&_E_}~Op{2|Hh7wzWvOZ{2Zzk}PY`%gR8G;Lta!Yj{CtCrT*41o(5(H_ekwuH^N z#a`=FMilKFk~_dY8yw{mT)7f4Q=i)`sFTu$TU7`}9 zxMWtnG%oF?lcp608^2$0{GeWPn^{1-RcP}hn7X58+T$2M5vySQ8rTKq>xMDVwzW-y>?D`;v43r*f-Exff#s44$rJxB zSuacfhi_<_9wgG5s3?1rk<-)+p_yd~E3PL+fCqoEZIFk~298K2{eYWZYyQlmMgv&8 zqO5p?Wl%X*Uf$(!)!!J73nd6VYO}u9AqcT`;vpm7O|%!?fAcaG>RKI!h6^pCi;tId z?j`y^+zyyqZ`0Gw##=4{P0}7K5V;Ordp!o-(`&9)p-(DdDk_`k`fvXarbcx&P4)&^`eeBoP5yv!?}!Qtyw+7C zqAzp-nRrkB_|UTSe-wk8=p>t-e*ZOxocbZst4yvzIGBfd;cnZivaHqg z@|#g6i-2*LuGs{rl+SA2XlYd(Q#ZTohlDFzfZj~FIB{9u$eMH$+R&(hV-;6QSr7Z2 z*%H!5R2$SrePe|k&Na4*%C(0)4b+9 z_-K2~wl~f&qa($PM)v5K%Zz@;ZU`-OtCB>y>WEC%11ax!`VLv=nL4{V+2T%hF}hw! zV&k)^Q~j$&ZG2;BkS~KbhgDWfcj~i71&4OA!JYBqNPls!h(mV^i@wPa%>W{)OjvXg zf56l##ND&qYfWf)jU6v-NRg=F$eTvKM%)TaTdK7O^O(Sbs&KYE3R(0c;uiB{8XLo?OGdTZ5?bx zsbj$g6^2Ze85|w)i+!GoW*rARh8eX|Nf*DUK^$#Yj-I_5f83?BpY9)0cUg}h|6sF5 z&L98Jtm(&Skyl0OWGILn&yj;u-Vj}HyQmLS$hj>4F~Kf|bFVP)L>5j~NBpTnM=C^5 zQ^<>@aYd*O3(5KtOB?j!pRV@o8m{=2CUE$p?2WRn> z#>b*B7LU_F#Z>)T&R9%p@oY}R5s+NV0xPni4uucHCy|#O9%Ce;&oOfr3^4%J4yUV8^mJY-t=t3s+=LZ zSXP`JG;cG#Uho|fe{pH7ziU6FvROAKPaAv3ohON$;~e;mI$o3=v^0NfY5<;8l)pMj zJ*Fs&ObIv?s6M#!9+7k$mLA<@AC@oG z-Y&(>s8&8*USj4X#+$L5oDg093c-{>)oSJyK94ydd%E<-|N6|vnsWSjPT2>v1hJf zg===zyu$nvK|hHOEIx21Co|4{((o9;KOeY{#lVlnZLwbcU=!4kMD6Kn?J1$&ATnoQFZ*HF~$97&fQr=X|l2743hQRGc3DgP4Nsz>RP0k$K6yQu>V25aP!*~jt->}Mv~;<|luDH}}flWcmqUM-^ZqDTU033KHlj4r_l_Zd7b@Iq|& zHc=a`*SG7gXn}c@V#qsNO!5qg&hpxZFS#vR&rz~|hyaha|7;32(ETamSTTB^d<)Us z-_5~_AEaCbiQOz?Bq#MbXwaes9aQrAy|qO*1}Ud~=69&WQ>Ibm^g~Fw`Qwu(r{~m; ztF>EStGDgFo`6!(b^3!O)v%6{~$QKM|+vu1~# z4YV#dCm{zH9>%uX13~$r9uYx7OE&#$cT#h)AELt^k7d`_=A(7r+4*@j+nOk5j*K;{ z&qDa>>2-n}aU5m5oNcFO6)|5N>m>y2Aza$NsUYobLh={z!-TMGqrjyzn8Qbc)$lmo zIlFxo5eKy4WUxRn_9BI#J!D)eoB5s|4n|+UyuO7#U8J=~jYjsuY zLGTQ`C)Qn#zwy%peo1oK*ZqFcKXu70f1nRsTi|1q2Ck}&7}L@5n3bM68ZN3Toa`=M z{V`*4tt#Ja=8DT~f3K6dh@yjaL>b}JhLYS&p_fx&zcIUH(KR6m!R1n>jArRL!1U@4 zeTq-@%Hho#1bZ&%g1`y_O%g1Wuk4ssn?pXwjzPcwv!1ZM1COpR#19F%w^Tnhv(c=6 zv?xZxp`FXhuZWK*Cd{q(_22ZJnemTDP!2028jd@e*|NNM*<%1r&QrBeE3ha zP;kcGrD_EdPYMId4KqR2wp(4oBtAijqSz!dj zAEED|N4ojTG;{x%QT5j*Gy@SQM|$pOqtpD@T$CbWryfu9f&Obd4|w0!!F|{a1P0n+ zdA@l_!S)O)$mxHXYM+BVWK(LwI1p(OJf{9A5cr{Mp#2(As!F{6^lKY$yS0i)GorAz zqb$w23N}50HsX&S7_!u|>uDwYB@svn&c?P5+~FKOvUXBS0EH}pBnL<7h)hz*F#-B; zw!3=lH4db97fik?1>e_e=Iqga#R~F?4Ky58U#h6Gum~7q5;(}I_`{=X@vZD@QT!6~ zeL~7rlcQ}%;$Hjx5V||cYGXHTR{4n@9y`>NuFoeOiE<1hzVygX>e#`y9}Av*3A5FQ zZ=Wb5*dGe{^_^`JlDh!Kqw27jJ&VE;BQ(5Tg{6kV#o}|J-0os_z8-_uVIO1N8?bL+ zOJB12nSqw&eD`BNc)amEKXgs7NqN0a0)A}H! z;()qf#_uuOXJ%^v-%pD@-A!ow7wdD9&;*6y|MhFU&J6VmodnEs=Hw|u#RrI>V^_R1 zZW}iLQsgKqKlKMo490hYWhq<2B4xaLT#98Rb~g{L5V2IBH@Kl&VXMt0=NG{<&uwh`v&80gaArrx zN;5Nk=P_aTXAgBZK&$-v(mZ5Gt&Q#oy*7hK9n_lE^?CDW%9DG?Mfru~wXlVYlClxm z485SEX?g|D07oHS1l;sO)W2tD{(C~0Un&dPdNjG0qFQDlgoCJ4Yl#2m2uMvo!ryaQ zd2_+Ww?u^=EP2&3pL9I-A za8g6{0-t9ledd`N1>bBo=;eCzA5b1&V*4uTz8}`j;sY zl@Ytr-!}|8*<}b`|I5*q-&PJk?5zt^)C! z$y&XUnYU?%IYYeDZ_w@U_0Hm^=SD(~o})!hlDzR7v%W7)mVA`%I z-xEwB2Ps>*%#_-F?$H{V++(ZsT2FJ^`a1Hhh1f&2+DcMVR+}Vi@l$b?x#6r-b)4Nb z?pbh@lWE^NYvl9d)WD^9bK|C&V3nz(d#+KW*8RgL27pnnh2g+m9W|0{$uG*(A$5O^-o@8q}PutrGW=E z)X69a*R?Xy@T0nR$Cb=ep_fUYg=chADn2t1Vc3Ut$4lqzgVP725>&h=&e8_>L*B*a z#nJ1K+L)dXGUWM8v$SLjKJ&l0(z_H*A;=X)LKfM0buBmtuFETSH4=5v_3x~?NYcyB z(tS_fHbQ!JL12sRLm@$)7;wPl3D%xV`5xqYpy;>H4)UFm+?|K055^ zGxo3ZOY9dPjiIMDwn=_t-I+i%l8#Vd#F6wLlGv2^eMUnZvc<^LGPBzcnw~+dZ1pCg zR|_mF?DFcqOu(Y<*^#47!7!)c*?F~JPC-VPsp9@*?nST|bE4QcoBDHHij2FBbJr4Qn58eIoRvkDLo?st zG36n?KE6w3y}3_H!b)$p@ugh`@6694kl6VaiMHSDEm^3C9E=`f1#Pr;g}TPa?x(>2 zzU#0o{rZ0l&N8B(CZFgb_9B)nG z=XRDMv|_fr5)TJ{jo=fbxs0~MMcN%q6Dj){;3i$_1`yx}v$ zk*OeGBtSLPFReg}d6>6p`s?^=0}5je$8Ms>0IU!G>cB9v(m;M`XA7=Mif^B4y2_wc zxBnvSW53`ygW%L%cD1;LAv3N$6>^sWNNDn+;dKyu3RMcjMd$S*$+@{ikA)d$WS-d6 zd-LPjX}grGXPHS!5k6-P%6Q;(pw1HFU$MGy!h zU?8H^-yxuUU!n9Y(9rhP)%K!Qg3+u$>J%$lADed{g`U zS=xFwD5;0=UF1ey_AD9#fAk=>&cCHPTVldrYM4iT+~YQ6JA{_dexn*&LYuU39RY~> zg`0WOHC`L5^4;F&7BjuN&W{!k{N> z>73VyOf#0rLOaQyqwH^$Rob5N%z{nxdOH=I(}|hgp@_VNGYr2Q505Fn1iTJu3^Vzr zfXjsjS1w$q))Do+E?hk)?jZ{Ls7=tzFE}-QGc^-SJHBZH?JA}vT;8O>Y?(zfl<)9U zld`@17-z?Cev990X4>+b=plOLI|b;qya+VHq2T+?TeBeWicA$KZ{pCQri+r=Yml`z zxKxJ*f`wjCnMuE2{l>4Osb*T^Xg?PoOc&10|IBTVk=g*>^sTzZI))aYJrG*BN3c7~ z6PyUr_mXuO)VJKF_MbmLD#gGvpjap-SHebGIX2S-+ra!Rgc17i?H#Q) zhGtlyvjxPTCm>+<@tr$*naCI#{QdJa-28`3cKRv<#fNZAB1wn+M?*X4oU5Y|)U`4| zY+Km;qyOr+Bbs7+5aI~2+Te=uG7z+o4l==a)XrGxHLmvuuFW-9;5K*@-!8*G8*G>F z&_-;aCGh%bf4F9U`1YJqTz5(KyGGI-7B;Xy9j>Hp2{OdxwsK^LfQ(h^LA}IG?1=G2 z=qaEzODD<1v(K$SP5WEkA`p>#b1mdGoCawyb+U06_-JnnI3e3i+QU2SNSL|LUZ;L5 zyp4IxZ$jL-?!*O0As))^yIr&O77@M*AA=#bc|1xFTK$)z6t-QnjC9_NAE$awXHzz? zYR&M@drp}^iJuv@nIIVvNApN!^W26y23!o%@zK<&&_=r{lJk-Kkr)@BL%&79j8Xh* z_qw6e#&c@h%CFE>-?mVJcFYkY618G2Cv zsC#xY1;JF>Bkl$9kYL0Tj!ewSORhJ;?=cZFO$zcb5I_Ey^!xGHi?^EMGWQH@8dl1W z_+6p^Nou39lTOkkMxG)osVAZH+=7=mhKJ?#V41;4s``7ejnwcVCx0R=dY*AMz8--y|R|W6z%__+BbF$)vV4zHHP;j`Ii4 zBEg%HgdhCoYyHc^-jrD(JK~7;q+nO}1Q*Z-d&j+-O!!jPG z>5`DEG%xRac&3jmX3)A+yb%h`jtDand zJ!(nmM%#tQ4w(I1jMWt1UnuNIsKtf68Z6BewDNdkL*I+;sY}+UvnNkittdbJ9DYe? zDc$6KPy8G&{OuF60<|eG3Vog9b53p%{=SplXARKvOy#c5TfMPp6J^@$2Ko?=(4S1Z zQKr^ZZ`bqk=)Ql<#81cxIoEsP*tp6!Wrv@!mT7AAG0@E0j{@w8@SWxRl);z$<>C8; zl>Q$Z4;iD)?#~{b?ut?>cbnf-$*D(b{s_{)%d^krG_mspA#(lO6qm=N)>R_Y#s?zI zA;aAU1D$e&C(d&#h#)<-vC-(u+lc;1^#l!mle8DX8rGuF!o$932_Al4%n@08-9(y~6`2Pn zUU*8~$Y$G!&GYs@_WPqs{HKcS?brOC!l=~kUTte?2#C&vuj)qj!*fWb{Jtb>&%A1E zf(QM()_hV`veHhlK6!vwwv+o{RQ%T++n;rS{`~pn)o+Q+VQa;@Wpo|1}cT|e-*l|lPL z%TMQOw3_bw(T)p&!0+>}6kZpx^sdG3-iqrV36J)FD{}s{jtD%gxgs__G5@BMgWB8D zHeUI@^0D=}>vlWN6B$@=;h)oE#@)D@m)wFDF}rI@nK1CZ;Xg@hAKS|rxY~~FxLkCU zn7K43_}|qxK(gv#&$aI(Bl(r~#MNk*t-`6hcklTRrYyc!Rk@eR?_lEOl#rf2uUSRV z=}c2zUT)p&mR{{ZZ8WXd06c_ZbUMksMZKjZ>iEGgkwi6cIMeYW+5FY1&6o3L_Zp#= zul4mI$jcF9!sWs#de_3L3kY^BM@-dB8XtEIC|ynAgr@wyxSi1<-bn!aiq)Fo_}{*L zyQ$whzHOx}O1|20Dcdn}=58l%j%@RKZ7cp}f8O<60(n`X+9CL#@wuv82lmU|`*i8q zU+HoCAFpWUTchN#!=Ym1@?`@0&NP}iw%?KBE6{kY$L$@T^S;rpd3T7t61(y<|HON( zvi0j?=~GYpw7!l@)>n^n%Yzw8w7uJ#Br`?=&q_6vRO)^0V|4rTKE2pRmz>4YCnhBU zh6R-A`X;z@h$e5mLSZX9!@zfJdCz%>lRV?JwX)_uGd^tJeMpFQrrH|fA+cm37Rc*@U zHrzcHI%a5FHrO*R@L}iNi?j|y_CT$MW(d_4~9J%Xt@AD{^w_PZyK_wJvI|L+69XD2j?cz~?b|9Nyz=x^cwKac)&mHXp={|xcJ{_8;{ z+`#+4w#x$U<trVK;c1GWqW-D$ z>*FMq&)zs%IVSDmxMH6f{`(>xF$W|@j8rr?kjB$K|Cs5M@r?&~5C3PDeme8qB)jpo z)z9tyeXlxM2ZiSk1ktmrm?dSW;*^94BD}4rJFWh{k!OUrMsomV_Se~|5l~+N6+^L= zSbpjvJ?K3i5k;rHr$tloE<*8PgCk%wmxC88hv>8c`2rwsRiGlBXcjOCry0j5x4unR zuUoB)gYh-r$?*y+c=e8&rhI?-Rw`yTwJe__42`jF2UQ?_7Edo4hDko2S<|e1EOMeiV3a}$w(ZfgaE4gB_ zUi<4VT<`4M4B2;D4mTzrIMFEeF>mkmJwrJSvW{zNLOlx}S~`5$wWN?7dXtrr;d_+t zvhM`7XP~0$4!puISXohZ7Zqn8g@ZS?ScZq$x|&@p41wL5|N7Y{W2gU~LwPx-+nmNs z!vhRk82}?xti$4~~YbIxS6}&d&={^PBWBb8YgfnBmu@8b95l zf70qf+>mjtvs)^w@2!RdM>-VAWZh=_QBOjZhR|wJn7UnWO>>cs(DT?KEBq$)2g6%< z;R23QlqTz_T%8UT5HSmLYAqC-A2-|E+tYFIrr`7Yf)&&5$lKIDZBF;67h@mO6TW5- zzuxd5FPmBG>f-Cq4_Q%nUf(ycAwc%q*!PHdky@?3b+dWNl+8r-xXxAXrN+}>&t<%& zoyD(N2Dk8?XhIZIZ^K&bn;cjB*7jboNTob9dn+s7aVczShxW&D?eEKclvWJ32r$AA zq89jxLCXG^{a+ns#oy}ipXe0*gqk-t#fytpyF>Pd^o>c3A%p#1!xJ6&(Ig;;9F8T2 zb`1~eIjGzD#&I9vp45V_^?XnSmCIbu61o0XukgWbCDJ(w&Ku`Sl<<^OExq?6&YkWU z?EA)rD_TDMyoct@PAn{P*`i_T>5_cBc|>=()u+BL&`{&?_8H}>4WZZ}s1`+kJV>t6 z&FqaMzkHPr>wj|heYIV`eVvEd zYu4n+qjt8LhVc^aK8UB?$%ONh^?Y(E6Q-yySmxIr5HZ=8B3gZOKIB_E8Th8wW6@9b za?K&$+cf5uO4@23fzXl9I*&@%z|;+zTzy!jiG+AZFlQf z4#SV~qxT_Tb}{Xusi9+Y)6oPKM38~Xy_VWrJh@V9zg}yRTL#c|G(T<(aa-T9vU8^M zTDm~DgGPI{(_Xsc{0y`336}bD{&^N4djDVsWtvZt9HJ35xAJYHgsGt2sJlWNL_Of$ z-5#K!!r?QLfxyM~V`k?2lnYeE(Dbz*)o< zspBjM&fnytsV(>Rbs)|Y_Q$S7&f~}OSXrEGZ98Rd(9#xN{!)#+Nr-K^X|-?pVN5In z=8V6xoM8Bvoes(ucyfOe8{oOAqxmkVy833-?a|zR5w}k#z)tgawoc_0dk}jD*Pr+W zWiPWbs5Bg^iS<)b0DA+j6%X}IU1Z{ACM@_=3(Gu+4^RR$dPdI0#~~ul&9T^&QMo=d$W{5Ahk8{BI0DkYHutj z?8%eMHBO+Cy?$9R@&0^@J_)E$U;39f-e-Z8Ll+ml0JXa+Ri4Kh0pb%IFI3$5TJkz? z{(y_@j^#35hH>T%4RLSs2DtK+UEw4+V^eTH)O$L9-vP3;wK6Z4+_trylbE~A| zh``r#(*%2hD$C2{FgBHTRWOS22hP7B(4gx`|z` z%XFTij${A&K$Pe15$4U>+RBevG7+bs1GI!pO7ilLnFq0PudV{mW${`+67J;)g&=<1 zsH3#BjT#8SmAEu>lf1TP?DNP_|04%yzab(tuq3^8@NfqQuHV%~o>;hn z*vkv?L-0It9<9=Z1Fgs6K&^EPxCWeP*JDR!Z4I74XJTo23jvo8Ja(M0SnG*vi&58y z@#aoBwgz8a;=XoEJ-GTsk9h4Q8ppvZyM{6?v6AaJqPhfx1p4~kd3&2*P#_I*#s;2A zqBYf!mq(L+NB8LaVq)h{6EhXP4p}C?bM5Rb^8bPg=Bj2r*`I<&240~5ghh2_XbHF} zQ-I6>rf`P63^0w+uGakrLwc$G7lQ4<2Vy6yMVKo*uZq@C<5P86mxxz_wNt|GTk4Wl z8bkG17zm+;_il|MFFG2-5};Q|6zxQ{71`H!%hBUQ!7ayGhlNGht%!)*u?$Du`(A|i zNMqy4AdPs-7^IbUvtA`LxwN(xd>;85g#(zZg5Ur*W%eCNRZpC_wAW#ImBk1ZfIoKq&(|9S z-N75>ELj-D^5LwrC$?`9MsXf=KXP#^q;|iAm`VJY zrlmDzn22(Arv)%zl6cUu~hhhhEPWe=B@#Jp6o{lmkgjg5H< zv&-ZZWJd?fxnhdETtQGP0boL8E-r3R=-Vf--hE=AGtD7E?>Oc#Z18}GCmM(zk>7Wh z;%LoPvO0&KJM%mA7((On8m7Wo1RQ!0(Bl=`=J~L|m8U)pSL0M~^>Bxd9T`4LsvFHKwoQZ2iRX@*2`X8f_1k0b=d_^WGa(RiPMKuZVj; zX}zct3j^#$IV)1<9 z`wPmCgLM^(Z`4??*~5u*(H48+mL^Ibo{#F<5ioC_l@X4D;A=oR`w(&O z0U!<)aByEmT5R+Sds>;zt9~SSZ0GU<=uBY0O@u+H{jQH`=9A$cqN64(>^_A@+!hvo z7uUB4ruH18D_CK8I`++XW#)7tg5zcHV#{INsX=^M8N=2NQyw!YQaH!^yy3bZ?_f5I$ZjM;LR@Xy@sBQ7{Z zN4D*XwzSe{>U`qhcuVmJo^tiJe>=)-Z7tN_e^ zY@sqBx$2Q0UZAP7%IOl>PO$U1Bmoo7}L~1nQQPb>p>_)vEoR+z(`YdZu)0@^Urk`fRU+ca*hx zXc~4kIn}Jh>eJcT-aR_MCOu25+ttR5*5c6|FZ&jo?w3}NRyzrD2YYTj31zr!b_MQ& zEmRNs(*gO`X|Djp7{%4l;NcEy7l(oEO&JW@6nKTMDJu)w%@zy|0*FKI>KC(2p&8I1 z733Jfe$$w!K^=>?0aBp4`YAxn9K<7^ZDE-+G8xg+H7oY;PzOiGt=48g+A8f^dK>^6 zC@vnl)=5llU+e65o}aU4ZT)HajF{Slnx47I>+|lJyPNbk{4veQ8^WCNPi}o`X&JbC zZywjICU%ylu;Frs4-ATcyj;pLBu>L1VKpSL=am4Gn3#+&V3Y#XXBnx75C9?y$a zk338IdVU@F`b1|CF-7UrQQK;Z$J%(e)cLG2^U2BGrELxoLR3;H2-9PHL8W5x$kZw# zdn`RBd(8b1U=9$Pd|mCTj*Xv%oUf;(gaMhzsNtreM$Ev8m+vwA7-}dLkklQ$;_L^??Vj7PFz(l)4*kyr zC`#+@mzGaf>wL*3R^Xg{CH!`ML;z!)(!JPDYqfe)<2}Q~_getsZE2x$MDQ0x-UeXy za5f_OI*R~>W{ah%W=BKlg+EvB(--<5VP=FXnva@>t{ze;t9%x^%vn_9zx5=g+x!(j zp2z7c%gO2^D@)5yHwkam7y(7&KL3wZ3^Ib_JCs%U%a?+B*UqsyyoIM#+mMpz3u5B? z5a@jmj}X*-i053Mn(pJBxYAwK?4Fodvq>B3w}Yb)*3)cZ3#jfW81z`8^G16%nWTv} zu}09(e7oC(H!+az*%~sSA+@x6^11d}kNJbr@{HOJ9XR3o`fZlroCpq9#wIW4UiLbAsI^O?wacB->U)k?;;^%|^K-Fy0f3-n z2B7UHAD>PcfPiZf@tDKZ;?97`aTc(DotpYlZG+Du?HA2uv0}nNo+wGufXU8Yb#t^D zzf#&G<}J)OiNJd)sjBC!HOrVq#ofL1HN3o;4Y=O$uZNbuSFVRcIeIuB7mEDC#!(nvNxNzX5NG_+L)FTQrCZ=~` z1k2w1$-3_)c?GRet~M`$NL1TItT)J)0`1^!j;#I|)wxqqfd~%H4v;tyZf|CNhfxt# z|AR;u{u_}Na#?ySDs?_fO8X_MV5cE^Q*Uv#HO!Ubis}u2PjL!yZ%+ePgDc6sr7Dg1 zNOE|yERr?ysS@RUP_$Y?(npr>{Y7dg*p4fwi;}6K*`h~4n_Fpop_z4vev%T-Rcp$H zaT*((wDBF$D-@*`*U;7V8WW3#d2%kTe7j*amKcEHySolich`OFqYThS3@zSM4ljqK zYMIkzI77S1(R{A1G8hc2@laJ@``RT(wq1^$17< z4#tvSyA}iBLa3QF*B9m9#8LkB zR(7Tx^g+8fnW65fle=g^9{5p8s_ga#Pfy3Ga%-35BoZf{o_qNtT#3H(#BrZGi;8S}$P70U%c%EHtUmd}=#*&{7V+nLu{CU?_Fl ztM|i3{i_dSG|wT9SHGK+l05Cj{BW>N83qORR(#sn!I5TPal&gW6O_|W;N`2WmB#C5DlvTcq%M+nM-zKax5EW-abO4PEF*OS)W!3QYaojzsH>~)7lsqJ(| znQziXm;1S-%1WexhngS8=-Mti!=~?Oo8O)Adr;bV z+Pb#^_Vw0v*qy}%8cT<`|A)7)jEk~+yQPsXX#o)dr5mJ6P#Wnj>Fy5cMn$?o8isNP zn4weYkY-2;fuTFk{run8=Y2gN&d2ln@d1XpXYPB)b*;6owfC-tQNz{Rzj=BJiM9m1 z6HXWy>3qq`u(R!BO4oP6ovg8aeZBE?!QAC?T3ur;^gVzvm&+_~Mg{B=P4!Ei*172Yk|8iV0v%9*OJmV}4npuNogyd|YUx zqBC-tc3Q4twi+%@gSv8^CW4RB22ZXI3k(wVR|AjpGyl{~qM2p?Dy{@*zvoH+WR_i_ zkgc2rV0r*G!e?!Cwr`%3>8wIM`6Y|Tii>-UgD3%=v9uh^d6f1MYG@oe_sI!BDK(!- zx!heNhAny0mHM0LB zK-SWG3-Esx4Uz%(7z_-vVs=xpq}h{Ei;0(A0AfSm+kUYhiMQhCwE#apj>tp|xK@0n zC~d>%vE`|``v~AM8!s|m@wT7+3{!5USH6Y&v>uj>P)KTR4FSMj8a#2-ACQ=u-KbfZ zF#!!zh#!zwA}4=_PpSnk+;W|~g+Yy^kbBkz#x|GgQ|unkXWH8zK>!SRz+ls=a2sbL ztD{9^43qOg%y+}nv!4Kg7r4nZZ3(_MRX$}QxWtvRuf!W;BrdxR_&&^dpHgEgC-+F-0FnMOf`y~Y zwhVK*U$1e*$x}?&&RziHA9?1S?I_Je&eD3_1S&h*_T-fJM;){i$A!&+292|9f1^RW8ganEF&1TY@e@{9O(*9f7 z<$ac!{oH-+IgmhB0Ae~hiEMprRiu6n=c}d=vTq+txczusa zfjS5yP(H#yii_nwdD+EU-PkDj>NTHm?Btl_ZCbm^?6gyNi1)dOn9R?Gs0_t58X?n0 zXZG*?6z&Iz-~%9a7CtGh!}8jD3IOo(YDOwrroCN;%-puSORncFbFatTu)I&K)=chL zn((1tzu4G(1cVuLS|+3mtgGSSGFcRWZ~yb(aZt`SzdfLgAF1X+7KoNT9jV}INexyB z@MHcQDWDwb<+-q;r;Nu!#w;}1^7#9+c%UkA#0Cr;fO!hFC1Ifq4dLF;HF%wMsOk^W z6dt+qNZfPR)dZ0O?AXJTpXn8VE&3l0i;7*QbB2EMjrRb{k+Uy318Iy5X_D3bPuhmiRrpV8gjJs zA1|}#g$vunHQ!412cD!b@C7+nmQ120qHzycx0P6${faa$g8&P_%3J zZlgtYL=h#>vX$MIu$wuBs&{c9! zc8^)#TC+L(e3{+Q<_VKQqogCDY*{60Vb%E&A5V0NEE}o&fV7`d`IWg8>+YkE4uPL| zf`qWjO4NH#4}cmxlIuHUKn;3xYbEIClxu2(7VL(7caPb#rYDq;P!{?O;Opbf{{--f zbrQ5sof)}|t_A4rF!Erh+!s7TyzcM8hD>{V&v<#G2GD%i(_E>k-Zax*CCa~kAALg! z8XhXN-$j08P(#EgNOFp&rm6YnwG(Yaa`Ne-P5+{s$>|E=(p8*~6YdVxBdgD{ypuwL z^Xk67L#2o&F40)`eUzgeDi2pIrcV|!x?3FiIB{=Ec(=T->< zb+b=T*`UFX8uPz{-sozE#ysoNpj$O*3&oOd0nsk{guZ79i@~j^#N28nh<=#>=yBCC za{Z6=^}lCd6Cznh1R*buyP8+e#K_6Lu|`pPPbLnNn_W<}I%@dDZLR|%_PY3Vj@lG} z-v7ef3AeO_O27$-TH)Ommqg(~Z$?tIyY^{s%PG6QOA-?8BKn7i<~O4 zS6}JZe59F+X0Ib+rW9n07l)_=8n$ln1(l#&B|Szx?xeNFCkl__i#I84#O0Yt0=o32 z>wyMSRw0g8C00`Ex}hWFjuk`AMzTr|i)ZkAVS9=OYdAE37nd-%9BOE)(%kGwr3g*~u%O zgH}}`9y|M7#WfGl22U$8(l~A*zM-+h?-MPJpvR}L0&J0 zh6{9AHzjVP_v!AUe_=8ojtOa{44yOxKa${+{f(<9I4@~`%Y9?4X`AQMul#dc=>9uJ z=QhR89Z8Sd2_gVg)g?=0Ww(=hB^v5|MrHQUG2y=Naki2pk^~F7L!7)yrr)M}zK*#) z;?xqy-eiocWu!NPJKH2YLZ*K$cS1umVr3jDzaRBD5AFv0SyfYlCxecoU zj`WAdY0_tE1NowlByU(X-5g0I_~X5DN_L6U!}KbLc@mmAKD7dJEQH;DfE z<**CI|78-AzFnNuZ-A5&Nj##UPOfIrkqIDyhCja%uP;Ke(TD;nOdz&)8FZ7 z6w39ru_mec{qZEn1#fvvOZ>n9wy?9;DrTo$n!Cvk@`9g^w&vlKPMfTwG9J0Uev{W- zpX0KslF9^pGA;&}v+eLnOoSt_U!<3pz?4=>%R}ee$jB%i(URKQkmGuuZ_{5VDGRxt z$Kt9#E_fWDnE1KP!rtm%tHolL12RwsPui9UU^GM!|F@#O=5GHk!qCGZ-BwXU_T@3} zbh7s1<99~ZhxJ3T5|hHh%0)LAOYwt|%9K&nf4l%Yu5HgGE`oM;PL>(`H*pObtY^Ik zFVm**-jTzkG-1iX7wz}>86L5VduW#ESkp?mYrE)lz!0MN9lzL6UyoCjvCq6wYU3A# zA#u5v@Yc1+EU>(Q!|H{&dAeN-OwQ@_3SoeD;SU|4r0G0;$PP;WiUx-o68s}_z3F82#S42Ru^^NUB{l8G zn9sO~m{NL_JnCOAb(iStaSu~{IDLeaa8_wJAXauu@3EZqF{AjK2W~@GS6~HOjK&!H z32J(bia?)JeVu^&#fN>uuT1o(z2@7#5{I6{fhy2Sh9G^6Cx4La32KfKU9()cQ9LoJB9w```XW*ltm{CdJZ!=T33Rh{pbWuKqx| zEzvNa65ZdSbU_Tc-MCAlbfTwl@^!K3Y`HtrBPua-+NcgrLwh+i>ls?No2R1uuRbvP zs&yXy&n)@hQ{tc3){>L|f4leoM+*MGZzfOXFJ|!P3H~z!#G@9cIR2gNfDJHDr^15) zYRAcvs_25?Pp6Lj{_qRWmcSa$t?Bi0{0*!6vOK{|urdEgUky=$F(Z}JhhN*wdo!?xG-ex=>-dNJ)0~OdRbbuZ4&?5w_lrRvdFJ!9?XOpXMa3 zgP?Z^#^D`5n%xc3t*9lkMkvb4u*)*&JQV}mfe8m?NCj6l5sZ62)Vy=eeqd24=;|v? z6-ujPCiUgtQylXD-DN$TLVvZ`(bsJX7WMd8ayGPwAqL2LzJfqL&MxodP`;jkX8T+~QUT;^lrpMq9?LECrlBhZxYk87=OtZanm)>laz8~=v z>~|(g5A|SRsi6{ATpg7E3Ow6!Me0@(7JX{mz)MUE$ycf+%c1(&u{pAnV)=B1Nf9fR zZxhW+{OQH;6r@I|9%C<}32|#mu|vmSY~C2l2tzeGqE61MC`;fTZMO0yz39G+6Og}+TKJZ5ptyT74LRrodcw@JS5s#I1sG zLkW4OlxN)sB6~tlNw$B!wycq2jO8{ukhvsebrI%XWN(+|^Y~DE;z$imKq4<~oqb@eJ6EDUULE$v z%6y<$FI6kfQsWIw&+A5dYAnE#8|6k)X%zGCJp4;YqY^7;14x7a> zxE=g}Y23f|>Qigt3)6Bb=E#AUjSCNnjTR-)u@so&%FA3F<3xC*AoiYusrvf`jRDXm zgpb4SimE7Lp6gr$!G2o0ul`_uIBo=*3x@=9Qm1IdVog#=*y=nO>}mKnnk<-S)!yDs zTs8zCoVxBor(IV#{0+6j@anvX-v;|B{@w`%&cP;7mDC09&MwQ{4dJxSjx{m0vTUB{ zE#M}&WXSjGR2+Ok?1}udyCy9qA~h1&dyD$zsl9$_`x1jQ4YQ-w;FsXrR=uXJKDJvu zxZ8;IvRNPGpa&7_euSdQ)<2Rr*01DYUgAe~-Ly^%7Zy;jC0ydYXiw~QHTj|yuJG6_ z^s)LV*3E4*mI!U!4D3=#?t|FOVBLvmDrTLbOT*JQM4oxIGZ2a@Sn|Z6BV5~ibKjy9 z!=3zK+3{kqF^@CR{iA*x)5h$pS}tQ~KE9TkJYNv1G@VvGGU+J8&a1aRCq^C!vuq%Y zRT7wPGIR6Z37w4+D|-+c-yk!k_rvd!TxNnPgDb4fQq&e2ITj*0;E#?1-XyZ8&_IIH zYR_PUGpju=1h*PM?6kU#G}3a}6Rz+ef##>WY_D|%8a0*1ogk!Am%cURpUk=H z`bDA05@IwCO-6WcuZg~9ZzoV2+~P>Qnz+5#^WU02*$)V=e*e2!57%J%N(tu=x>1e% zpScPII3FafHdG)mdig6$TBe1rn=v^PLmtpd?G%9J#@bQMFQO^)Fvpt7NJ}S5p-C2UW4HcSKN+;m-)~#DRG7)$WQ=A z=>a!eye>)~iZ^=6(T)-JjOzYj4tDTVq=vWsWZ8Y`qV?(=5j~QCK}Cu725IOm@oN^f zFVD_7K1(XIbHJOQ@4F5 z4?*FrtnQoKhTv>D29oED#D!-86}nR`7w5jCuMT{jNjvFsx59DG*>YR2gdqX6IS#1t zIP$*S4z`IlYy5?AHIkm52&&=j5At00tHREQ8kZNh7lfm$2e&K^L7xv7y}31-u zG%77k$O2dgOlDUp+dWF6Oc%Mv=OYe%kUBJ<^&3au&h!J8p0HHdc0_EqbVPja!cGW# zz-`tS}rInT|kc6DpfP(W;T^Brpwi>q)4TQ)uPG8X5%EgjsQbE#xB zKNAyvW{*D9s}tR4@*4fEx9T-rCtHs9Ney{z#9Pgea{L5^p}5nLAn5YdfrVA#%E3dW zPgY*I56Qh)FnaKoj?I%>>#|df(-P*|Udk@-ha3^=){9jdJTfds?Y69IbCX zjhw)W?>KPD7aU zK-<~Y9_zHjs88#~o&-)3JpCeKaxf*l7`KYJt+;_JIuj?fp)fs%o{|Oae#G%Z*+z2S zC08G&=GKUU>%s7*c*2)1g@Tq*_n*jl4s}s_=aG%BE}soht-BQ-ch!6jy5gZ!}mtq zTP67dXW>~p$I4rkH*|dqmlRM-UTxbl5TTJzqRkG%o|W~F_kd8s@-KwBCh&z$-V!>- zp2Zcn#f`J2_S;MI2LJXwH3Sl?C%P+bT`l8tM?jLg*^yKLoH^R!`MJVbtq+g9pUr5AMZ`83Cg>U8c zE?;EHrY-f1^rO_IGHrgn z9{Rbky_s;ikcYg{yi27+G`w*9%E6YF*V_RBMV&1P+B<$-*u0*8rSXC@a=@bqc43hg zSqB!a@MT!)C<3LN3{a>Sg*z?9@B2R*xoxg&nr&wr0ijAe zP7rYlLax_6Jw?4RPNIhc2nK=pDK<=*!r>XOBKKR#HPWHP9Pel=JaHh#J*i|s0Bd14 zkp<&6hZyL3-b~iT9sY7ppU7TVS5}p7$=F`>Hj0~=ruTS|Nl)R97A&Z5B^)#*A05|dH zdNElIXNyW!pB(QZe>$X%<^lJhiE4&7=4p{Y23cyQt;w^R%ILJ zKG-K2=M3YegIQ+Z_%egfVX&K7!w;GCtI7QgMM%iG)BL&C2F4g_QvQM$VM5zB%kSQ- zr)GN3p?>lz=c_uMwY`nJXXI& zS-_nUKK_8=$`7+E_QuI^M?Q;?_YMTpaWpuBUcmDABMwYN)HyeETRT;Y%qa_xbhxh+ zePr_FFT_^b`WgRS61T?i_cF63PJ6=_cB{A*7UaenbSw=YY;LvoCeX>H2Lh<3T8K9K zoyC+i`^idTINCyEEf0xbX+pcv4Y4)06OBOYzxLFJda?*d3-tx0>JjrFzC;;O^~K*& z7;qwEI3MzYGIPCmcP*p~%e@UhvZZ>`=MLw+Zp0@_u`=rA&3a8AsA7PXZP}Ngu#{ z7b9p$3YMb;HqR`8?e~$DayUORg&UHOS|~P@r4EPmPAjH2lgzH+Lhb36a!q;H*^3~9#{p=dw{n(a~{j}zdzM?Um0jIf@swtCk_w+H)@Td~2 z6}#-&5w$z#8*U+rwjO?{<3s8WK`7?79t#5A7+Z-JI?Q8dp*WhPCG^-vDGi&*L${Pe z#e#;O0viSX`1zd6%*A!Zb6{SP_=Fu0)Utx+9nJQ1_4YKoY_fB+e_mclSz_?TjCzKX zgRN7wXWiCFXAdx3b_W0T6 z&S)zp=~qLXy$O^Op>-CE%&`d8?#xO*T~Y)0rQ6CaDubpAD+wD+Fd*0G)OP~e5N z|8%A8{P8m&bNY1~ak#wrDIbvJ!RZm%;3m>l?nJ zUCZcVBcUyAmOVsVl2(Lpe7udA$GLHF4+mL zW#@#N&fmV&`B0PUJ%zyw=Ma577{bqi`%PPFSdQNzXaXHnF_^9?0#)~Nl$i;&g9z=i zMB$0P zX4-^fZq6E?mt~LE!zsz^En0=62uo_`&nWJ=Ej1bu^&eJmz_6z-p0?G6MzG3nwnGW1 z{;oP1?t|ph-OT`oW0{N?Hj(z+>p4j#)KaGrq5q!mvC^mXYm8kcd@HxND$xWh$ymO4 z{_vU>O(HQvV!f+ShnJpZF4SqKcZkCjL@)x6tmcsM9oK4KyllcTsWA-K??gu?AA!1f zI3WSiQz=}5uX&8U=|qRqoFjn1AiEa?M$?I58ah1LzhQi1UReoZtbMoZs%*e%mda_W zBUfPD#Ya#ma?zB(x1mtA>rc~kK+Czw;*RlB)1FnJs(EWUE->HoU)oCdbIcTETiv3FPf25;1P~au-w|6J|vxLkUiZ!s0NB1mhgFjnz}hPzi~cqkaY43PJD)< zv-Ho71)$W5nzVMdq{VLqoSXW4z;kIxbfN}t&8uT4>vBfwR|?ovD(9>Fz%~ zAj3qZeSS%pTB$a5OqMWD?<><5X2QJw!LxZc*4<7zXI0t&2$f{iA1lXy16g&vXIok+ ztr;qYK*=~-f?B5$X)%jSXtw0fe~RBjI6F6)YhZ$#6=Jw`tPJ~0EC~bWD+gEnnhV-G zbE8yW&J9^DJQsw0cE6CjYccH>SDw2Kpqsj*Zl`wIvx1Ep@i)+4a3)TPFZJ~Uob#tP zJo~oiCNDZ`_~!mAyAlBJFk+qfp8f%iV&k~q*e-ZU8)E=Hi z-J_kV&Ym?!BVvT)bgM-UIWc!ql2{E&+;I}C`&PNF#|dQt?6G}N>$#=#o5Rha#h1@G zKP(-85t#%KZbG{p-!(C$KD=?AeB8(%?)%6d^cp&`c-*G9o_klBcQaT6cS;oKnJ3&} zbmnQ3yLzknEDGRMI36_*`kt*+#ed|uBlS;zY4Z}ST~uD4D1{g-lQ9Zli$S`lXqRTs zGXX}(@m*IhT#j-|bYgB%b7?!2MmGnmX7(f`}6Cg?Upwyx~tJw9g0HrXkUB7 z6ZWAC=qQz};nw%KEQ+&!SO&TMWY*Ct0jRw0Kedq1s`xn^W-nI!IlqU`?qG-WcXhYN zWrzpB{a87or`t|4XSAF;&EH(cjoS)?9LwURlkDaE7D`!k*S*e;5ye$m=1j~OwF zYIS8)fFT(ftYk3E9*f#_a~z@!{aPTL2E|)Gv(Z$~;Lxl<^{<@n5XLCSe=#iWNlH3e zD7c7joK5gfOWB()L^I2tq{WBUC|j<0-jK0=H%%|Qmn~cF0v^?f7zF5rv?hSy;}yOU z0;Ip1qu?|9LGXc7H(RbZaz(EHc$T=L$0}xsRI$i-5DIVbbstNZLXjJ4**T4xQ72oH zxwUG?N~wAsF3TIUU35)on9Hr!Oc1%66gFHqyqY{FGOwQ_$+WuyhM5qHgGOe(mvmQddW;5F&@_lj^BqC`IjV6o)|l0s6K1=fzlV;6q$3U$NfMo;p?BRtN039lu36?={%DKfY~GKi|YDt2!{cxOStNbxHzW z@pNvl0$pGL6q|oO9!OA7Z!{ax0P;L_+)ji87-gtS5BO=x^)F!G(l3tUkG8A>hvFz& zK^G~1E;;rGN@#wR-Vi7R0BI_H%SbEXSFV%N#k1+re3VH*pX;>Wc@4&uI_LA+|BCYt zlk&x|86(;s?D>DQx*%=c^@{A_99eI8*#6uBOCFCYT;HmQ3L+HeL>b!o=$0Z}Q2YK_#XlVB zX~KRmVONDc9`)4KQ&J=?D^#dECnS&=!UYH}((J*EAs*m)7C^$Zy61yhl;SYYJp!0y zGEmNH-<4Kjm!Y#ESk`dGXmG&*UaondC-q#!9@kexxR52EJgt!Bvr7}hI2NC<(3W=O>7(a3XVzXm&YQu|>WWWVm^F=ZG! z-bO5BN0KVp_JJRd0H*@ovsG9o;~4hwaAf=Of>9tsI0-xex%=iQZzOm7=g0?4`{C((e4IudjjRm2n^SQb@b*_Ph7#i79#lnXew6D=$7XI#@**DSb1*w=) zGkV;pIMkD-{MPd?(TDSX6(j#v$yTu7U3jzK&w*#5%_;4R!0pc(gWoKQTl+79trN|4 z`ldIZK(RBMkq=QEEXUF9IaT>9idvE2ZDD16QA;l5#-o8{i%1YIAacdbw2W(c52-@^ zVXGW1iAMXrN|3_O?CC$k0A4E){5w;7Dg=kg${*-wL;m(LtyEdt8KWRaW8=*Vt zpL8MR%k81T>2URgWjxkyr+1T+NSW4Sw}uocLcv?|W8V;sw-myG0N~7A)dq_fTCK8t zX9AB)gP|w?8eT!q{;8{7e#BQ<1f2N)RSD@~o+0LHPQ??3;PTSt4{qk_xi z8&{df zSTd_CQ}H%P^3fJ5I_KZ812k@1d3J**J{OJ6%8!#$pXG24FD-A9b6h1blTK|BcBRia>;Zw?|G&VUV~b``=rq?O~I5oQl=5( zg&V`;yQPBK^V8BdlN$A zvXm)$efHVisar99(hf8pO`$VFNeOnc7W5S@;W}(k(=YoUjvh6*Jm&Q!GpJtckD`A= zV)Wf&bEoXL>(8zWfD*_u!lZ2X7N1#d3XS(#aO7*Z9YK+xnq9#+LgU3RhcrpEO#G{_ zD8|3DS(QXqQ+jx!OZdhg0`KQYq7q5SRcVs6bPz-mXb;WEV#=)Dw!x|iB5UA#&4>I# zZG*?yVjSF`07v(%J;_{z4E1ct@D8!PR1>yTk`2VgBMorm-U7T zML(lWKNa9$!AC|L5xWakJPMGwRRzg-{k=>vVcz3h@U8ibx|ManJ_iG1V ztk&x%&5$43ouvxi0= z^}T^0Y$*4K5juC|cMT~nN6#z1#c~)1N+cGPqO^P#n--@#nSvUB zbu?Vgq1d?+xtH&qUx}6Jh6peM?^kd}Yo!#??j6eMQEJCALl15q1b>~TuA9=@i|`(w zbLv_R#mSuJEU^9@Zoj3Orm%fYQSpN?aSCpqf&%p-O5q?%Y#u#x)cmG}dlDD9v7=pB z-=$*MrPMCNcbsTYL@CGlOQRFavir-ffS#PiKZc*% z0FA1?mPA1fb6pO;;=CvJzgX&>bKkmZeK^JRHcgdSrVac>pX+-a z$5vBy$oX*AZt3Ys8iH7W92-2W1h1~mQUqTPKXhHr3WbGHUmf$Rp7e4CBre~1KYSoJ z4#>E=GWceB{|eZ4?`iOQ&4U(83U=^ig2d)t^NF{gg5aA{joXFbJ5C|bYrf^{bMIGf zr_-cIZp}~H&j{|VzfA<}iY#xO2`(81J8vZ-M9+SaJeh4zdN^x*AOKF86i@u!URvUi z8hji4kQX0s{0nomz2isnjCPl%!s?B<6yah3(o<|1Dzj*Ax~iSus@={3cQd|gym@N| z%R#9JcPpVRv{4))7Y04uOPdXYA$q@q&S`$m@Lomtjd+8(TO4_GgQr7N4Ai4|fxYbtJ0>ZKke25%CJu z3A(gU$%X1sBa(98?LaQ%{yR~GU+U5SSNhoW0uHwF!~SD?i`l?@tDkj)q#NT{u=E1H zN_iKcJNNT7*q8UrA+)$UIIdagK63Q7^oLYvGX<{{P9;nt7}M(N1c7;`vLrU!`F@=; zwNX5h8??Nm>y6_)3~q4kXSm&5UXrcTeMYAGd3SqLWQ*j9@&hrih=ga9^+Q9!=iy@! z&PEs8GhVhcxJ;nTor>yjjJfxc48Zy!{=tDw{LL8i2SwX@rL0VfYQx{(5O;LQR zq?l?M!T2-n7WBYx2{Nm^Lk;BX zEs4+i4!Wl5Gzr4qT`~{eM&z;p_Cm&m{`rvpBm4!p;MB7>7@Qw9b^mJ;} zSzKjrAD%|jfL=P9NhJ`UyE@|Q1g-H&&~~S}MtlQH?++p&ukya~0i=}kCF;J+vDk@P zD~Pz{PrgG}_SF&=$t6k4pB8!t`SVWB*BQM$*))MQpXTXp+8LImoY|RP&=B|tE{`gJ zlJ&3q`2UOv!UM!Q9ui9jipLrLy(VTN^Jj^vBf!jO{npxB!rAfWVgHox&-@@^-Lw^+ zMhS^af&Qgo5k4qi4xa}5j9#?gl(s);wC3KICo=tM&wsz^zlK|MQpDdE|1WP|wCdBp zJXe3L9{+hup5$K~^nYFUh+h1|U)u6tUmbw*|JZR5?fU-8~9}N=r{a&B005j~ub;v4-2P&2pz=n4j@J1R)A^GNrF?Cqgp zL`FbQDTML&cfWuw-I&$p!^Dw)R|6B~IsDxiRj^WtJ_sm8$X*$nW!C}E!R8DM_lRgE!v<%I?1J-5ejgw zw#Pi8%cgBEH!Dkh0(sTcDA@enpN@pVV-ugI7DXsZ#nzALOgL-xueN6zn!BxQ+4EVq zh%QJA>Z-orb-Uw@j0i+er`A_8_vJ28%ntX* zT*rlAK6z4pFN@sap)|g{)6l>yC$Ku=1~$vkOAwMUux6sGYvh!3SKw_WcD33(A3eY< zNb$$u!5R*rFC@~3c%na{bsv=$E}^hF0s4)BR41G$o--ml&yg$Z;qd0)J;S0|{~l1f z7nhLdCSo6%C^Zbz@sJe-3d#l8fTzhY9MLG#+w=cb1$C!wKYEjzF%>Q0)9~GpUYa%9 zH#WWI@Z0t5v^#5z?>+pbB|8DTZ!*$G7R8$zf##}NOfz%f z{*SMJ_pzRw@t`rT@qyOi9e|0scx z&&9Q%^CF4WSXg7VK_uA$5r1?Yv7;)6v~BNW>K?jel@wOQAFI{PxtzDUIK%@UgI*!t z9$mrSv8CZDV+|h$gjpx0TrM3Y(Nq4(xYD;z{-$}Mc#cJ7IxwU&clE;pGxB=6+ zoe$H?T})Y9d6O_X>8Zf4C|19lvO)ZzOCu!I;oW5JQQ1%Sxv$SiVFE5>7o-_YIQ@D! z3EwB8J9Noy;{5kd(z@M0H6K&z;?((KTS(cHqrq;Q^}C7n$d!)8BEV`m2(1UFauY3| zy&=sUXIhryuHV=)G~&TGF;Dz&)(rLWug0&ie@Lnmz|RsqXJT)#tKkP*pEIZ9t;(xx z6zo{=yxAAEKVpNbrgY1qbf#Br8xsfpEf6W-Nc)>i+{bEHFC+}~=`@hr+Rkai?Pn5(QuvMAmNtB+q{(kj z#dHD>b^-~CSb@(3RS6#I);#|)?`c=y-#J5giH)wgLsq=lbgwmom$1+<=pTG`^ScQ+ zhy08sXx9Rili+7_Bt|{}3&7`qH4z>!{@`i9Sr@%y7dHuhK3HTx5+KKT)mQK=1{|nk zPwmxQp9^!=*+e3gQ%hQ0`f45v%1(M0Ylmt_uSu0^os?SF^lIZ>9d+)j;>E?9<(a`P z@lBY&*VVl(qHi8Z)#XNS0|@TdE741IWFTIB6gFpbv$SJh%Emt*+q|{S{2c2bt!^L^ zmNuH#=w)0Zk(bf5&_6f>yUoyFAGkXGW-hNI>8|hYR*|TknFHrWSsp&i{UDU&tjd#> zbeSVz@JrEuQnZ=1x3rPjcmKWR-Hpt@BZn^c@2qn0yLXZlh`>Q-AJ~$d6%Q9iza5_U zQZ4&6_85@cKS$|ZzE(=4F+%>4{)}eQVgJ##zwec)jG$wnHrgL@0+=1Ru<$HZnq6n6uOUpjXV!NTEd;I zB%)pj6y~9QA@oXbYx!Piw0RCA$x-pRqeKW8-|biGxlvKXq7e2@L;AG^`N1!CGfYW# zc!RwWAHx{5uJr@wd1>>Oa(^|p&2|L`Hh#ZIN^9n#;~CpLnW42e&mZpB-5&6Dup^*w zAee3E9%)Sa5;INg9FxA*m}hGylPu3t()8`|Q8fc&TX3OLxl=O`TR8U=PsY?j-z%w? zQ7zF_9<{DpE>XBTeh$9LAQR3KWlELaUw+J-HKqg#zrBJn-fyd6Eo+FjiM_2hGwN*? zjMJo!)R@H6W{osyVjhOyxr(=n{jevmKI3RzwPp=CQL_yk8GboD zWcQuH;PVjaGZaofy zf_-YgM>K!TwMsH+UPeF5{G4e%wWwxxbcA>-4rcERpkKQ%<~;m9{rm=VjVgH96yhNL>Cxlht(YgMj^eH+pAuQ2ZnO^vIF&!82V29i_7@(PzS>YP zZ*IMr{zS}MNpBJ?G>Do-!dJ={e8Q{j;Ba!qylk}jaFyb9tkvC^_(ivRG2L(BRp3TD#DOIln{w19Qf5Bu4wE1f;>dg82EhBg?C zsX@(L1*MlaCN`UO)mUyXlM3*VRS%R;5*lMgy$gr-mncj`KAPFcguHXHefcXZDmLK2 zCqv2b2-ZH{DLqbjK}x+4fTGIX_W0PphTO;gn?k{kbXN11l6T&#;~s`a8Z_Vd5d|qY zB^)WZqd9E6V>`Ux=vS6HFZ~uA{)JlsK7XT&meN2}*ZrLY^@?WGIiN`*1<)LJ4t;mc z@l-qab3@)Mx!ZgjFy}VUXYcQE>p5W_1XA>;_?NU)3S+6`kL@ETH9RA?s6^}eJ4?Nu zzY>GzPuGgpG5kn7cHRL;v6g~UM47r@W zRpU#&iqmyD0I)PWNp4}qFKxf=d@=|8YRCUiZW!N13Hg9Yohj<&)&J%)HYJwcgga!T zqfaQR-_9_d;71?7|Ft=ghn7W*t=ywjnLw0AHiyj4X+pdj);g;3m;%f;qdxKMkx!_q z$ESYPfq7RNV4wfNU>^kXMiDAU9h+#~CwTZtQ~%7j{UxDX?d;t0_s?y0R67jFjgYuVUpzx~ zPSDISPVn)t54}URzwoZ;)PQO1?O;(D;Q9i+rZd72|CA@yNR0cwtBtSPc35zU<+CF6 zN(=~p<>kWG)>Pk$`TcL!~lujk^St1sp3OU2`ccwy}Nx3%)hnZ%76c8*1_q~X7B zDwPos;f7Hi zF&{jhlOWv6{XlW{{Wy%ovJDusBe_C>?6ftxHOKwV?wr zGwGD%WY0}(qTI-bTdNO{_>%UitwY_M0~$_V$gJtIWcDb+i+dH z06zH?LuK?TFR^N4TQKa*d^0{rAZbMC)Yub@mFjBLL(jYOp;0V38t89DKu!|^37 zZ}?~;1G{yYcF?jLry^5i>ixN&|LP{pT-uMPfo76n_6Uh7@pjgV;{t8*}iSYHaD5i zvT_d2mOfJe*V=U-19z>#N}uGUe0ZA6w;+saPHUKMe>m|QyOK1Fn~P^=1;~-dq}>Wv z3YRs!TPNj+x6mxHc^-~O0?kj{oJu<1y#V0NB2_}daINQk z&RjrG+Ih(=oO&I!B;=d~WeGVO*tDQpg8p-$kw3>6y6#h#K*-mbjQnBmm?ZCF|5 zqiT{p7=gq&#P75JK4U7jV7fLt15^~U;pQ>C^v!B9n_4|J_1ortkPiFtZ1McHHHxY$ z(6-)*sjtNT4|8uF6=nDJ4`YBxDuR@PAYIZWB_T>kcS*-c_mGMV?V!@#Nap}Ucf*iF z4oD6m9Ygb8_w#)3`+4qnt>5qc_g!liYcYAQ>+G}lXMgtR?4y-mHfasAZnf*$N3Ad# ziZhSq z;@)@i^UyZeOUgJ_snDyQrI{62c%AA3L0Et(j-HKB(O`BZEcHFjbPABDRh01k?yUn& z`dMTR|0zsHoX?29N_XDu_$#hU8MFq^-~gGen!T=I4fZEHPp#Kbr8>V}9Rhjgtws^U zT-E}3(bP~?S(h~!*6rJtL6(wi|A=xR?=&#?hiU1f?}~OwQ-$|Fl)VKZT5{EEZLmvh zn>?6u@BY)(|KR8U8}I`GSMgj7#{RG_1NA*j^e<0YLwdO4ss$`9H=)ZOXD#U zf2GE0jw3K6)hQTRFEvWtq2=#Z3NS_KeX-Ha+O&(V_NotN@8;=98%t=I3erA4k{;vf zdbO)q74Tvhp4fen!o|h!b>bTwoR!IrRx70DeHf)4E+lW)DB{96c+&mOzNv&uN^nV9 zZ-`Z=t+ZjLgCao}PefRMW#21GdUAiomF$$bbJO;8W{k2N#`j{MG~_La<7pXTtWn8h zy>ecxxzJwy+3)+7DxZyEHBMwLi<$s+l^B1D)9$8mb)$lCAw9S9Q?v=qdPr-({J!G zug)>@nQ`TFukpktsJ;-ZW^4l-k`h-4!`%(rQ(FZa0A5{}NW|7N8iKqV3s60nd9C%s z$30w=!eYDY#hOun#ql{(bhGk9BBSE5$v;2KC;*NE_$)9sVbRBrF_~qIQXrNo&K*4{ zvRdByXV0-hFMj^~42!lAV_#8|fc{F!C_Co72VHb^=6eG11JXqR(4WM?ipT(xfXXq_ z@cdJWsfd{NN<)v4?unvjJ`h=bZ)FqFD%vcQoMkB`hXp&OL;G=rZG=HgTqpQd}V(~L>ic8b>Lw(@lN&ewsi-`pLKdjPO-@AXEQ{fyH8@|*TEZTBp zttjMod#*u7gG&17qZ8TU=9pxtpixRWj6eeM+<_A7EBoi_0ZAeH2=$B@PzUvqY3?W8 zffzOD3lkWSM6Jg#JvMwuIJ4aidn)F~xN+*Al3IE+BqP`CTI@THNf*~%Z}AZxxK1WH zhMlRTfRzMH&(7vaS_!u&v#7h>zWlGn+J{w;8JI4*}Zo=`JR} zwMaC4$T~>~k70_RnDQN-ipC|?XMGw`O0ql@vlXY`G4@>z09yf0<)Pa9!Q!m`=gkBv zd+(3_RFJGh&>4w9zcE1p%(@KB8?r~2O-^&m!*2*SqF!i}c;2br3ncM=)aF25@J1|f zH>GEe9Q3S3b0zIY_@EVbj2eAjT=w{*@8}O)(gUdqb>68)jA<0-R{-`7w%n-QKrlZ* z*2Of&4~5vU;V~gF)na`G9aL@qETUjK1izZmJUlOl>G{-EQ@)F=4k|y`?CT!f^*NPL z{{VMEJuvECVEV4cKJycQ@<5+-uC^6XaP>KDT?F^kSz}K5U}#KZ4}j=N-BfY50!6YVs!^;0MHR_qs2*@Xa+7+1)nS)ew_FkyeM21U>IQ^Ipu^0i5jt&Wi8# z5{Jp`&*k)%#{Ofw-rTk5fb)Jf03&TL37H#dbsIo>oMH*khmYr8-#eb2&)24B(xG2| zld5myb)Xpx#6BKg+R|HN@oKRqp)MdPd2iw_DDeT=uFfOP`mT%2nbr=Z(fI7DL3ID@ zjpkNdtV<^V2KKJd*jCy&XL~Y#YXQnMeIjWT9f@ z{>pRPq={lK;3RbvY$EdQ}eDIOQS;vRyAaq!@CzN@)<|x>gkQi`QPsk|xbiX9iuviZ7ygUO++|VfhPKo#1N~Mj3=`BGJ~rhrZ+WB`o(Du{L|W7*Y*EJb6vxyCVhRq`o^mIN@!+ zdTYHm{AIO&_jOAE0xmSwKa26ZhbLuf&w<*A{iJ3np9RTVt%$2&>aG-D%J+tKd`oz! z^7YS1(Dlb2rCz!@bH%ius+CHX(%tL#>bL2S1!c>T*f^I4S+ZhXXVR^jOgY)z=%NQt{y2s^RkC-3 ze4fnhI%n~9r#Bi8@1?P1a}){#v^&&<6K&8lK-0uZ+b|qQy!SN0h)87N@d|CT>RYzF zv)Ro14-B#`4k6SUUGgTlt@RnGx6#hRM$22Wk5wA@R1-&5BOu{*%c_c`=F{3_PBMN9 zN3`h*2d@J1ApwVWBFv)Z&p@u6fkLOBGXaTcH zGPJDNGGUEH+noT;Ibj|;LOpmm$S*^7~WR7CvC^<7O z8F@D{!GIWRJpz*HCKccs^XjA_W7@$?VkP1ZEkQ{|3u{05DWgrnAxnfUcPNWkFkzW9 zC{Nawd~#GFb8Y9d7$F&r_{rm&w(ktroN8MG;8_u;oj+iPE~yd*xFPL{{;8PdEOTO?VN6gBQXGkfQrXvJA> zNcqYE6y?*>r_@@KNb=OC_}B~cA#F4mt7<_T1N?H?pZEK-Y4_b(p}|@mZE?6WUrfzd zS35HAUJt4HAAS6>dx2}ICg&5S()S5Qvs;QeUv9qtQ}K@E^WO0((xs?o;qk=i7Lk^V z4QxML^N&t7o$&7h#(%yYOv=2s)1V2N%d){H??Kk_*EC>#q$?lf2zmA?%qc^pU6bi| zzt{Hxb5P1glgXO;X|rCr+VW# zfZq-=$bM!IJq&pj%O|X}+^rcl8`E=>huh*q-?gOeNn_1Rm&=VnRr^R3{z%dRuK>$q zOl{jdaxEzHE?^xDS*1}<6|4G;muXp#baC=(3zdn6W9tMovdVbc0}jo?JM4L+?~%Wq zN-#m8s{W_i0sZfz%#wX!cn@_!NID zdt_kPK>-b$vosu#@B3B~P08tG)t8+ay(+5{+2O4GNPpMR)wJ{&C7{_?DMp1}k7XDi zpc74Hj1#ixF`IAJf7@)Kx@p;7C_J#{tgS7b6+!vO@W-G{%Dlq6Ppl!ekDQMB?+g;XBq6)oJ6(oVqeyHyE@PkgE0=aH#ErAgxLB(cb zo}@xc#*_`{-9?i6t#$tH3OmHJ>hX_6hMw8%M&pt`PD-wMmcs}G@53i@^sT)^3cT-! z4yTb6Jr3Hcg~xgnT)z?MI$q*}Sc-g~>Q!8Zg9)lg5?R9hcG) z*>Wk9{+C-ID;%S`-41FIdXbLfkKya56xXc8uW;ZG2YiCCpF&)VOZEKgNV;DCuuutB zF?qfk9+&JYSEya}wFvS(ITo;{0rOz{?fM$rmv6rE(Q2uuP8f=%TY}4_J`H7&U2UKx zz;u0?NS0TzpmyF~3z`0^qIpWA6_lQGK5Kp%t8{=YPdXvI>0@6cm?O$+bdx@}Us&9|gm|2H&%r2a^eMxJW7wfpYA{?(5}uOJy}3fGFs4SG*>ne~ z5EuLD`f?@JJ<|2}n-7~->c8zhh#Qp;j(S!LQ+5Iq9r`rR24baq+ zCMS+Guj6E(b}lkGJr4x#=_(q?f1J5kty^W7?9|j9mdKm0PL{`%(09J-K%XsGnSY|= z^{^u!zgzt6$>Nox1#hCYcyT^GN!UK~?J-outvtO7&9xtJHFtZdNBllCzp$92&*bX7 zW&I+G+85ohh+fVeK!D%x$>SnQ4^lH3Udl|dSw-{8Z#T2XS9x95;tYSY z(#s-;ZPs9+zXSrE=63I!&Wl4{{AV%{CDIc=zCo}i7<}GejQv{nUOHE$@mggj!!P3Y zo{C;3EvU2>uVgOcEqqt(%|XyMmcl5m`6kyPllP=po~d(uP5yVckA&ipr7XU3bboLb*0O6+Mtj_QcpQIS27 zff$evy15%BOeEJ7Go_-+^DrZ(07C z=3?!~dcEN-P|YTa0f_-8z!H1b`m1Q{nJR;7tyZmiJEIv{#G8E19}8cTrGM(?_3z=9 zyj~q{D=Q3oWxHj-e>5J)z89i$?H|VLbub{IXge@%FGJC~keyVC*99Y%8Pk9+RNZJK zA!&xxXI<8rRXhh{Vc|{k0tE_Plf)iE5`Rx-OZ~q@0{-@0B&u5Z`!9cc<9S^@k~0s< zgBhsx-VqZ`S(i5q1ugPGXLttAels_utE-|iHm39Y1_LDS21Y#(&m+2?HtmLHlTHYV zH=JuD*+eh--(WSu8W3JIZaWy&G{%;@*|q70EWZkkNByLJrd8VPg&BG1{9q-Q2xt6w zjNCoVY*ZM9tE|YA(q8Se67Q>-7eG$Mlk$j$Qj6%w1j1*T*cEsG z$gOa38)|3n`{F7@C)P&J^WSHDmy=O%JUU$}R8~=W-}NVCurasLSZyk+w@mjT)~>Wn zQ68wJ8$+lg5p*eMR*ng^vg_PYQLS`4wk_NiczMrC=T3u{x|&`#Zn#Q7V;Cv$ydAYb zhtU7+mT?a1I~e1#u72da-MmYh_P}3uX@@s57e-H_#h_u3nXMx2GvY3#p_kqJC+qIq z`1-l3QQ3j7Zi4sW$|%k8GqA_eBvoTCb7Ux6Xgq5*j^xGto@Ka{kzVfdMPl?mAJ?z^ zYW#9=vvdKX8MrUIvRj!cE|44R#M23F7b-9^ zRw8_@mHtlivrA^{WF%vQrkVe&`c$!qmqHc#S5N;BM&rERvewq~=qgD5~u6;4- z>V42k;rYDExK$Z6A~DzQj|t(XuQ;hFfjY{l&ejiqI125azpfeTC=S?7GXu>AtrN-V zR$7pADxmmod?afn9Ck`t{1LkCSnVTC#D=kapxSjCLedC_w6xp(0~Tz+bS%!nTGa+P z2osYh4FW<=<`3_BMn#=N=xp2(K9O_TKDG#$(n88!$LV~r$zGbc5RO7uNreppdwN%# znaRR{T6gF3o@3PJ0>*KAdSyQi7AlLn`~rQrW?wwigjp0Or+(8h?NaaC8~I&)K8Hr4UEmy;;be8O zYVm^))~7et_H&n#>*p-7HKqAZmwva;b}mG0wkJDD!bc5oN;2z*gYUe$qPMW?9BJ-Z z!Eae{-KYi;r|FfaTt(r=3WS^Zzxrs}Jpx}`Io9zzr5P<=lS!H-GrAOiYR$QL+Y)~K z>{$3PT>g?~C?tS5ei46DJjd3?Rf|{NfvWcH!bOS@^3wcMah2uv`M$5`hbYtyA7b%h zSH}MK1AfrS$@uH^H@UUx*FFe8isJOH6Y+1dw9T~CX!n=g_0D;F0SkhsLr2qM9kV*g zEejamX z(s*C^1{*qQIe@sSmz}x;O$$oWoex$i+(x-Gg%D0j;KE@G9yPfJ&cAcy5pSTncMsWB zwKE0TCFRWaVUm%L!s-#4O`wQqCs6Jh1&)5M5pl^`oFU9Y8@HP-;`V%9ox~5 zS=afE%#BV)%jC&(jtcu;rpxIWc`VgK1c#Q3mr7bVMm;C8Z)4=On%eaf(kp%UF?U`H zp}<}ee_rQ>F_ZF%C>Gv5rNsze0WJBum(NSP*jo1qJAhF#?dc0GN`|={Nx!qND(Z*0 z$5#8F=`d5enA~(tqre5V-{u=kZ!aiDOZG0h+q??S7i4KKb1?QTKy zL177cC4Q0cSD`1ul0WZk=~lzSyDhkZsCm_SUk;k9*D)7BU1u!-N_^%*HaFb#{8P;d z;k{li>}NhG8BhZyON8u40gzq-ED}|bLC|Ad#n3-r4?+a*KeZwNVb{6yE$0} zbA7F;|5RY>glM0B?%53tC?}slXc+F%L5psEUpw3-&~WbCxDV<1uH(DTL@0Sq%;=&g zd>{uE4-t1TA}9b&#>6(*fx{SLrdYUI8Wv9LW}5N@ES_@g-M;u}(W(kG{W?%=_Q^vQ z>2{N-a9g%nBSe{Tg%e;NNHI2%%_HH061=JVm4mq0WPF0@zL%A zx8e1Rd3Wk)aot6Pqa44Wta)0e;QJMqB^)*jvp-GX-B0mFyrU6 z{ezAwNL*e_6yptq67q$I1@@Z9$3Dqq=opvJtq^Txk5TT|4%|3kB0De<*kavid-TEj zV>Qhw)^;gA37Z58UQHyc=%v`VMgnw1o0V=E+~A2~>s2))fzi9|UiQ-K4H(ognlmm? zdSPrnSNJRoKPqaGM+r)G`s`pseDyJ@sq-+lbAJ)3J7+kMG>^cdgDGWhS;?6yk-@W( zpM&xZUyOX^BFN?EcehiJ8#v681IdXM{BoRR7(mSu8@a|#0waTzf|&Lo4$Xw&3mr6k z=8#bXt{0PjQsX|lt{4biL+`rpOSP`6YD?!~^~J1$;z`ZP3O@wKJ4TBoZp^@E0k)XE zmteYIm|3dlysDMnPt_GwMVI#A#GcDnv$tcCSb7%x*TVYcFhXF&$1zN)-ypzg1fKJc z)vAxH`Mv=TdzDdasHT1A`*hIS!18;W#UhczN<@Fv)&#*Lf&Pj<$wH;$Qaa)qA7IG-DR=9GiVR zh~Dm_$D{)0L3HLpT(Z+MXyA4X>_k8&BXiHo@sUR0DJ@K~t*R!H4N|J7v{5+9Ii#2O z)-c*G`?+Hw$yS4ay$^pFZx?DiPHDC>u2zYt;c4@Az8G5}Sy)WFE;+qHpRP7<+@z(V zPkK-0_*2-8*jn6%m$&=uep6^CC(uj4SMTHtsN)PeE%awy_BwU5C~>8x0+V5 zh2u$x2)WSKF;%stXtKB290jgY>Gzj%&r&MDsjN-X9BDw>*msjcKKT5R+*Rv^3n*x@ zC5Z}bzV8@upTdD(X5`)G^!%x#OV%UKN7d2_5Qf(n*V3XFiNY9v@7qFajE9Es>EImv z?reZfX~&5<7!%C+48nlEzFG>$e&pW(>5|Fa^1L_Bhsx=BABI-T>G&K+!Q!$ubl+Xb zcpj8NI0khuRJ^M(c2+$E*|i>i$xt%Ix3eNt_HUFKEo?ZRR z4V=ZAMp^pya{KqFE@o}amW45rjhjOXOl34evbF<@?h1f?&}<5LIwD)$iP6W^_`4@G zhrf#5$=0c6Nr#NM{&-v~^`^4Z0pX`mRL9bO&a2qcf7azhhCHRn(K++iAsfsb>ycq3 zM~fL{)7lrX^7AC5gR*Nrlh)=JtAm^@b0`ymp;m=#;w9Qk-T5Fj+91At=#iP#f(rmF z?0+)xIl`~}-+}v84TDx3&wwf0N+R22ZD@vY3)$BBGv(-(H=dW^0tp$4x>P!LYy^(b z+Agvmi$pnN3X8QOiK*7}Ymi!?1RD3rAv_V!9rLow9JbooUj1$9aPwNiO;#C$vOIg0 zuiYL)g9*SnIy{TZN%(qX_{sQ@uupSG!_5ycjrP#I0vVpyM8Guy0r)u7^mpT2ydS8v zULSCfLq}Z4dQ-)HIVi9tIYCEjer5Q!5qZoQJTV2_Z^c-b`)N2Cr%x3RD+@hpwvezm~3L!sIoK zcZN-}ltxiytYzwkSC+j%5Xu!GPY^xHQ&Fs45UX+@3U_sUU?}-Sovr?QYh(QNpcCdQ z6rbenN_ZxS?jBmYaEsJ^C@Bd?*{{g^1s3k^x$cnuuCWkcZ&Y)rqfulwrLZ-C{qBFx z^($MSm0tpL>#7YJymh4N@28`3-Htaw)jP-uxX~VOi02eIP8M5n{`55R8q+8Ly#*jL zAeA4_z?IQ8F3-s)?7C|a|B=<+cGH7SWQR2zdtf@%sfuh@zio*5sgCZwP1cT9Z#j=X z^V4h|y;2*S(HO0IypV`0u0ixU5J2iqO?w9u zl2s_`t4szB?5G5Ef7xqmYBscPcAG&_9*8nnso&vd*%^PeZJzwjx64tW_0ZPzEOZNu z@zd*t7ltNcV?mqE!@v0@E#1L1b}fguf#{|r2o7HEUgq#LH(+I)8sZ-UX>ymyK-k;d!gpC}^v^ddGRPwE96wQ)$=kw0%d^5gkZGw@;U- z*=Fh<*g!kJws28y#ld&-q_lFpls(C>Ts@U z0ms!29e!C`4|d`kd=O_ds3mahOKwKhyfpT13h~u#*LA67ol6zV*K!jx98$I~E+l1~QBoIcJpt_dDR|6$iJmLq5AgWD|?NdH!+7V7y#F z&EqdiXP1tDb?GY>Y_vyEZ&&jy=2j=~;g}l@%=%!&(yK}@7R7yx+4MP-c!Q&him|X_uuVK&PA*Udask<=ko&Q>7MXsX%(b&Gkt<8WLFGCuO>n;!~&jGzX03*CX3P|lF>ntY~PJ2(u*LZ`>x*kgj;B{y*5}Y0u4WH zORSc&m6H?T7>3utjqkVrxP-w<mI&^qbySdl6xLiaMB;^;5daB_dR*Hq4mg= zlh$Q9f(E9{#Zou!FPk-o$u6D`PD5LBjMil}_WMDpR@fNwkOrw`7KWitD-9s;Hubsc z2KO+=c|#J`u5-`3-Fpcok5;;i-PZrs0)Q=tk8^U1$5P>WL^{e75mFqRX-)av^)*Nx zN!&1IRHE$;Oo@Q1)Q|KBi|d`XBpe!Cdahyk;ytS#qkV7xZq=z}{gS`K)g#Qmc4nQ$ zS`yC5C?KzuY=hN_A5z%gK+S>eyQCRFJGek;2Oq7PkZ(&jK1C;>9S@@I*xHzd3qy`w zeKgT6y@ln&!mU;eU9FkXz<>{HXt!bDD;9g@VqK>I`|53!s(#h|t`excS*h|Ne%a+S zuApDqKu)Ev)p%O4DC9{aVub2-Pfw zxiMbkP7be2KHy9w;Et59mkB@LXL-Fz5+ZLC&n(c*q>iB`q$ zFZvp;5D+c)ar&Y63AKb`6(nL0lHxN*c-s0N>Ltw+>#p_eCGo?bNlw>)=5{1q?>U=? zbvbF>A_E)!k=Z9t`0|xI#>PnP0@IKJyzgU&N;YFm0W8+lWtRHo)@E&)xAXqrOl9KhwPh zt>zUe)g2fem-C4A_mcn!6zLL&G*-4O!KdXK)LVY9Y@IIe9@Kt!W)RsO(kAW$eZ9!M z;8xNEgBT*SnxMB1c=s@=5|)%6q9#H? zuxnxUSguA}g25;QJotn3xm|xCPxad}AnW>)46Gwpnuw!mQ>04&*VE7YwiKn3GPJH7 zf&sLx;*$Rdj-=B;kNlhZR-2u&N(A~1@7fo4Q`|68&B=zqq5VJrvDxl^c@gL}rd2v; zJ2-n!vObpk;N^Ck)QUclq>?kCRkvZo%4b&tR(f+o_Ju-GImLKT)447C>dxxIIgzpJ zk)L5i72qKAn#hwx`SGjhOtD6P713)O8aP6!>Q2JNv&N8bGhFBNbcw@vJ1Kh6SR5^9 zAHphL8@N^=cY=7FPOX8s#s5rr(|MH)dM{rLdFXWg>78P{6{lk#mtQ?h5C_luU0ehF zOR^#ZPK7YX&>OC8Hp{nWTPrd})NO)ICrn)(9Tv#eKVLyTqAmk22l02zYxfHNQ0Oaw z8j^dukJQ-e(dMl7l(hV$+8j+G0?MhSyz7I%K-Z$a<4#u*#KdZAmE2AeISq?i^@?#a z9-}bj=KXJ$j4QK!`YVM5f-q3H_pee^$Z$tFu(jKO2P;#Mr|I+{p0Ui>EOK>OgZsi@>~dw%Mr*skG2`0)VN6Sn3X z=*#4#_0-OvBh#f}jHKseGNzLx1>pT_e_8uED7VujTfRKUUX#}Br-n~cbDllR;jl(V zt)i$6cyn!bko?uG#<%wA-KRTrN+7JYFDAjyA{QSTU7~1|3&*tR*~?rm zC#N%~eKf1~u%YUsydKO9wPs;&jpj^p4S%~h^z*(x*t{DEe25s*`4cFNh|Pk@0FXRx zauC){_Ok<$xlLE~q%?QGU~&JKqr3S$<_pu)MLcXAMwVDZgN<(z%8NY8Q+JfzJCM=| zPUf73)6SNCl?fg@S#w}~kQP!=xoo8-Ov*8@c5Q> zfaPTRK8tE5!{?fhFIJVM-}O$R6c|*dQs0g}UAHDMul4j-je!(`bxum7UE7wriKe#B z7Yh9q;7-PKG|H~=S-A$TSA@hMPIFI!BR-EKKg=#bn>N`?YV*alv+-9q@X^F@e5aTY zu)PYUZ?G@|ZBfa2w&c~f?fCd5U#87YCas!!FZ@90gNbbcIssri04SSX2GDPB3Z+YG zc{~v=6yvHY#vTh8B+sfzJ=R{Yh_V_1&X#Ey@4RlS$C-?VfT>1DNHzoKRe%Hs}Lq?F&>!jr*OJ8maXAd;D9z|;_{1LF~U%zTxs^T`c{iLH5MRgULnWdj84&-ia zz090iFaeVzR(6xbk&^ROBOpD+e_)HiZUX@Hlms4YDI7jg%LWwndRl&~=?87$oJdX9 z4K`-h^kp}gJc(++GGeIL${f(w)e{7ed|;+Jx7Q@90WR|-jE&b;(Xyn7rj{E7ELeb? zV*?@Y#XSb{A&#Mj-K1M#b0J~&CDCpAx-61_X-zvg9<5jCv>F14mCcNs~+hab1bs0cXgMf{{p5TwGO z)Wt|Kez`N5W=8IpttODIwvUX=Zx*80v^gSNxmxpMKo#{`H_VWC!Py&@8P{z|ybD0EYIugZu{QO9@b&I#y!oz{sKB}l!XYuY(DxT!B zhEdtRm=0{uSV2ZUok_1W^sSMqlUXnCYX`x(^u-ohSkxS*<342RbM!T{;p^2|rY6eu zXN`71B9jL}Vczu^MH z>@6=`@Jl~$o+ThPxL2en6kmy_^U#G6FrF#?8qk==sfHG<-?P3LYDe?ryqZ37m5zIm znfFu8Ucg&6{XJIxZl-i_{$JfNyp&wwRsRw*1V47kwZqah4V#occ%1I_6?>?QT>@Tg z`H9Pd%lf0S>ryykEX9&z$3Jgf0R8HGFw&qKEyMd(&0`2V5|Gg?eC=KsDYi1@LX&*x z(4bie#1|VXjY}}XK5FinFT^*F@oc-QWb_*fcJ1yVtYW8|ay;)Oaq=AFj>iwXA5QJ* z%l4Givqy$$UFIoXc({X(GpACpg^T5Vt5*YXU{A&!5S|i;E)#;9tMSQ?0k$m+()>zc zIEB{2tSuU$x9$=>%&r|HuUN~ZR>V!GgfM{Q%K+SRj%$**JpV~ak|s0oX&}YNDWkMP z6OJY{$Vsq;J%gr!z1j!7$!D5QmwUUPW`8NI7Vt8lnpqK6yq9xW?MW(C9R3c z&D1eD_gQusd$-4z>M`_sj1!L zdOMI1`gJT1>C~U~G?V&^-)Sa|TMbeZ;y(UMm54tVY1RNEpy~=#2sC4h_=R}QDdD!l z7}xlDgZg4!JVVCjL$=_s_K~laUU)9-9!H;28a664$0h_pO4x=i_7uq)#asq&^9ZYE zqFu>->kpHTD;ZOXDe7%lW&P?;K>$s>W>l{IHStsLpD&CcGEJKw_FPV0J`=oN0v1*b z{RiX*daNvO^YbYb%dmdOlirh|Il>;=_+`pP{)m+o2S+?FUZ7)TTN0mC$l)?(h!xT! z159`xpK9c}+pb9&R{U23fEL-_RVE)+K$EpC`nWyc57mL=frD1!G9Mz*!NHDf^ zarg$GKU|;cI@2Uf6&KKiQCUemTkhaXD>9tQ2%BibbK1>X>wj?Pr_9q_x~uQ6i3`|* zO^6cd3H&~K-R9pZq8yknw5r$`t&vw-$i@Bq`3?D_M&~Yd1B0meVsk71#Ltn@N@ zM(D)Xp7s>R)J5>R#PDIU*;*t0%&x{@$IkcY(V6HOExC?N+}R+54mwx3)7%V2ij(9x zjj!9=pAW_Ck_ATJCLr|kh8tS9PzBNIbZ_WvGRvPb8>BG2v*MvDCDO$i4vvhZ)OL-_ zgW263ud1-kpK6Cg?eXG^UE**T9jj_ry518 zytSVl6{%|xT{a)i?U$ay!(Ovqd7384Aak1BuRY7fm*OvYa^D*tQRoG5WoMc>8Lk)| z)xBQ#f65m=_mU`9`yZxCsg9GLXD_ym-EgspAu_-Z_z_Zxaf*q<$vul5o4;a!rn~ipF^D zOvay=IQph`(Jo1YJ*EgKKx{bJv^p!xgTw-oNC)YYiZqxa# z3#LI^u6HqKb9?^Kii(5hi|9*rFdj^DUAX?r;9_=pm0Tgp`rdvO+H z$?dt+dgFI>AbBGrDmr(beuKUZ=CelZoHgs1cqaxNEM8AR=b7?wwrqKv;S+o{72nz1z$5Tabj9D`d|Xb7|Py{6Mhm@87lHr+~Iv?l}qZ7fpk3Pw`x<0vGmArn5C2_SuP`gLq;B$R-9szvY!E(Wq z6#q@R;WQsnv}X|7`+lJWA2;hvLU*Ju2*v-;=VQ_TtixBklq)8F>IFVTDZ9WA#!`1mSl_{Qeux}u~1_mv*# z?irZQHBj;Y(~K@z4lx}k<3(N@3uv}I4U5yxii^K(67H{29X(s&gMMP`i7*WR^r!Oe#%GM?FFYm*J8QZ$sqF_rCb=(O@GSQCoiJuI-HocU~plgD;IqyC&H!G2xp98@fqKFp_cQnr}uCR`ue2mh@v@~zRnyg4>E2^ixH z{I8=gpl|+Z^?R`S?|p9vMH^K_SpDh6{~UbtH-YlE_LA=Z+_DMr|Et$-^@#u1sgJmK zN>Jg=+{UJ z>t%}_+;8&$x$VrTKH}4x4cBl!zm7i6VI+{a;Y)k)-~$*)%ktz2J-H;|JVNd)K$#KQ z3RI1|V9Xwy@`c43jr&OD2!cc#afq%%j^yRg%VbF90bIJPaX$MG*xb zG1@bQHcSDz2{bMHG$YqwbgYDjPK6&!cdB?(WCN&Tj-(&{@mCLHaC<)DMFUjK8;Gz3 zs{yGAcvXkviHz7d)<>D5EKrM;*9!(?oE;wlSX7ek-Iy=1yOE2|OWn&e)BJMB4`T`(R@B~F{=%u*EiyCZrHThSHBG0Da0)r6d$ zv@fTadXH8LgKW1Z0N#@H`5q$HLO zd#ytP#(B8me?;Gj3ViD2(0yu^cH?^Xf4LbC)$y)QK0c*h;#g}JGW}iH<9(5q9{Ykn zS}!oxGDuxj-<6IJ;kj39#qLz;oEnu8i_Gw=(IPzFz0{DDY}ckxwZ~3V>Xo@?pwJLa z>Tzi;xa%m50bwnM1ih-S$vV;~eO`jD;*7B@xl~e}#bv4F%%H$|1aT<+Ev-U5HMI>= z(3S2%TQ9 zur=;ft56?l*=?T;RV0?2;Jnk4)?)-XrDUKQjFxtIDBK(%ILH#LZ$0F-J#ecAXvP8lR}0S{o?OfEdW0-C0$d7S>8`(yDh0srEEbBk@U zPWYFU@30VY{A=lgvi1ScRB=CWZSMv_Uo8n?!Q$1Vu5^;0GQmV6}+5h+q^!I&m3QYi;dM0+h0BY{mh>*ULop#*=tPKw%P; zFGls6bdyz4BZSEqN6|Iz^AVkOz!u3XPE`0kW^q75)JYO;;#~68m7S}9S$=7)?3}-b z^(wUJMq4W%SB9d|4yY8j9<~P^+1P0Ko?u?CR+g&k>jf80z$z_7ZjYkBC({i?>^3RL z015<_w!L)I`=JIA65^S~&Cq5#SsCOKMnh@G^R!eP@792oa(L)z5N1 z3j)-OrpVxJz+C`5e(g=AEC_L7-M~K^G-(M`3|n^;=@f0r=L;OwVzlZC!&ic=j=+kx z5!gu${G$o@?9CYMRur|BYcrs-0t8o`M~3>$gD1{Hl0* zY89ZZi_6jPXpaO02bm2f;}e;qR94-Dl4nI@RxCw%Y6r89=lt$#3+kJ-7uU3*}UgjP=5&? z?hCnvyE}fDt7K8nng;tzeJ<9p_KrvUsf1zj(Qhhn`Kwu~-43RYweAUVQRpVuW^A_p z<{^|%8w2|1wJqH}q{QblXDao#c~NiLwuYl$&!6I_EN}jRZ&i;e<2CE}9VzsSm26k3 zj)1ccu^+Fc6!Y7AZJ%f7yI*a#4DWT@3&05RvaR`_fu-fSb3P`egj7HY2ce4&0a|6m zCXePb@Q?slFe$J-)mNvO`!hW^CNT#YGKhE(uJEw-* z*37}M` zRH!a~L7`t8w(>2yv~J>9^(b*D&Chw^Puy*KX^75zdm6xOX1WUrQ{0Pjx}tntLq+Pw z9F4duB*AKWc-S#OdVb~k+-7&8oZI2|p$`rYRA;$Nb5%^M$3&w6lIHZs z#|Nrkj%vRgkt2TZdb&)goZz3^Rz$%NcZOP43c+fYV_KwVRTr!3Z_wu0$3)UP6xXHjvzXsoNr6L__!>F{8y`NN<$TLl*?!l}$x!2+s!sYVE>Edbkdxot zM))ZFoa?a;ZR8Bsu3GReyhlP%6x`{j$6&7Y^G2cFCiz;A$e67R>96~^+TlUBp z#_ic5<<=n!iidPqv!o~+@su~)EDa9C&{na2WHyr;|GdyamExcfolF1QLvyzl{>jX_ zT}ru0dMe}FO*Yk!*KsHlLU-kqDwy{Z2ig(WcRl;8&l6*UKZU!P8KmRnE7Db3navI7 zXwPNPP2hgwO4c>uUn*1Epg{tELxVYshmwC4=xjV6b>4XRWL$^vz8%>&@f|4)9bv>z zCK4WecV)cPdsG?EywGJ4xDe8}AGS6qM<1cexg{Mk3S?vu`wJw9W~?q5XxR^kp=~rz z+3hJ+Ee$Pt`D~mhhvaFu1g~nZf$&Z7J$c0DyItL1Angt6N!8vvw6{hNqwMkChMDOl zLbS1!HA3x^0`Jb#WHbl!uB`KNJNNxRq(AbBi`{B7a+rDK2Yw}d(Y`k_0i|(fUvJ4 z2~~E+0=bd_g3!%R8^aY+W)B<8atqbpFBXU8aRVfb5)N?7HA0z|i%jYS*(SNQo0DZY zyE7vL1ycHJ;kKUOCN~UEQs$wxH?+i0%#|w@0zXfnoONB;xLxQ4gv7su*;+I`*Bk(eILK5nXwoVk8D!fy z;KUI?e74*(f1^2?c+}~2O%|QC93h*>XQOuA{o^cyGiAPWl;Kf3T+pK_M#4w}_=Z-~ z)Y+_Y=$rQTUMBQrj}FtiWcyY5Y%Ea$6Kd7(4PxOUD5B&i+r>++1KPN>lvb;}n@>-! zt2n7xWQQgT@5BLAMhHRd0pl#>Ex2AfkW|AkuSbMNf1#Xx zWckn^=!5KPhn+{jBT@1Y1{$kP;Sh!~TMP-*p#CB^@XajFHgj8a{<54Na>*{Tiq{`P{nhIp#s^ zBhzH`&Y0#?{a2l*$}i8sD)>kv(L1*}e_=L0eJc2!*at{b;Iy(F?raYUY=n^Sv@kHt z^g6%D{ptQf{(;4VL{lZbSQ@O1{|LN0FLX^kboT+ddR+TqLSWfjQd?<~%wQ(iDm~Q+ zH$!8%(8@6Ui~uJiDf1PB&ogCPTEAzwc0P*oGNguH>8ZCRFSu`6rO+Q7Et7MQv5YK1 z-_{?L1V*S3x)2Rb+vc@s1oP%BI&qx^F{Q7>hEB2x))m}KbR-X<9_A$N6Qp^jJQA?~ zBh;ET@y^UIw`>|?LH^U%#4~pzy=_~ze^^|x>u#q98cNkAbA$I8 zI3FYv-kXtoVBfY;NTK&zuZh2+#%pieKbNq^B2&W;az7g=WlYckqDn4QEgb+hL=(it z`HzolJI!mf`{-_WFCYlDBtf%T%9lpi-*Rk9dR0vR00m1mAc!rMhA;jWE>fgPGU4{= zxK#`MZv(jd-v;o&Oixax@35MbicPwVblff}u*E)>Xrc=JtC4! z{V(e!EFLyC=zNS+RJ8%8=^9ilp6dTL2a6ez@KD;|eR<-qgH{70bvY7MR|NX4-m$RP zB?m3x{HTn<*A#Hnx7q3Dh(vnQ!rTYRtA?e#X*k>i_{0GMdtBNSArPweeS-bHDN6Wh zgWuYvA5n~s;;Au{%meP|3GxeeUx(R_;mQ1?TFVa?PbYjl<~oNQnLF5BkIAf4=(TlCa#D1E^BX(_Ck?lr{%Iqxny7)Yo`fIA{Ur_+ES>Ddz~=hE zrj9{dw17wgsvhPf51wQj+P^VU2=w7Yz&|>T!%^+D3W>ZQ=qBR}`l|pV6IYscCbFSNtLg^a1>Ba92=NjcZn|Jj(8 zASb4)iBvL42~YJc`*8K?^_)Ou(P;ThzUD+J-w{YKVy9(^dx62dk=2}UuFb+iJ3$_g zB%@M|l!Ez_jm7w@0&0@;a4yzLRq!GdFYBrnSny%hB9sZuU1hi2OGKzH14XTmP>^L+ z@OjU{$0*5ccns*I^E*&)e;1}$?yBF*vV3!)BMzaA$qQK}=SYYI9qPTA{(u(>UOWQk z7AmJs>~9*MS26T8!e*V6QDWYg31qDw=wMI_P#z=2}ct>gGCQ|q~#W=}N8`seum<>@x;#rH==nTCXb*)0a`p`zp-KK-Udnj-W1 z{7aPhg~X-(`nsuQvlS_#niHMT`8Y(?{*y5JTbEW#_DGj=BOtk{YB0v2`_)L`BaIC@ z@h*YMsb2&|@s35l*PV&kQQln(%#|sjqZ{vrU@Ib||Dfcn`}jG3LFRvtN`)o#xg{x7 z@dCqD9*ll(bl>v$w1Zd{}ORUGWYNMl;ztMnX#Ws_UqaUH=KA_tn9?)K2J(4jY z&W@d^F7T%wW>UckGDDx4`c(_Cwa-dvtWVYyl0^h(*H(FOA@AU+HkMoYFYflxzG=Gm zp#s;2T-og$6$T1zWFygW&j$;>?R)F)oUTgs8{)C*EernO39%RHu!#(s9HjyFjj%}& z(Wt}LiJN6vYWE3WMOYZxe-4M)`6BD$ zdfXPlb;zJGCdTc{np8fzBv4{2CHhHHIq7$V_t*Pk3&12y_>q z-cpPT-IH)w;Uve=P;h~c+em^*oy4sbfJ?jN_ikfRluunKEpLYj^~W&Y7Uq5cHnz4Rb|>2(+Kof7H76F{_|(tdGA-ii~-oz$QW(g!jL9G zO(C7T)F~)|gS}@#BwhGlR1EeB0xuxsJnrU~F$PsENvmoFfD=z`Dj9$>9uVQ;a_mf= z;qy!p2hKkZbh=IYtn5(z6Bi^O(Ek^@=kFBYwkgwW-jdBjtsLR5L%o@fdpI;C4Z#aP zzYvKet9{g#vw(%|KP7llK+VLX%wHzW(2PegN33cnRKZ+($4V2Yt*;QN1m$l&cEb|I zNmaDjroDyhrTZ_?hnm0Anr+})MAKfb8Gb!Dv_gf0@J{rkQQCwb()$nHxWd{>y@gK_ z_SIo4a%cdoNDlHayGC;+hYyEsUTiNV^%De`W)312H9b%!_SLf zKS->=p%dI~I(Sm&F@lyU#l}OK%*Us;2-BuHc?+o|a4%&8G3k>y7k~P+-&hm}wsAR! z#GyEeM?qeV!Ko-q4GxyPKaLK?`Sf+BHvt89T(8rSuZ#axW6&c6b*p*!Sk9GCOzzN~MS={-nR*R+%JdKWGV*nV-EbTd#0 z9-YDGg7=4(Nt5jOMiYfoeX`NH4b}f_mxdEKO_HWDT-b%&(!rY=w^bKc&ghRzcM#(Y zs6Dazw7Ym#<7Tay);=5(_?u8rl2D$cvHXi#x@KfS4mS3aLjz5frK#@$NOBK^1FG*J z19Ps%AZljy$TDkZ&^`Xsoe1<9f5?atCo%)6-`|{4iOYt~*xQ{JQXiY~jTTqb^tPb- z+N%O$WeAMi;X=4njvSW*k;b#tc7PZR#va*b{}ZxiiT@3}W!RVcyA`wZ5-93uZ(DrgT}6?^J4-A-A@BQo4<`jlw$ahTq%x+C z_M?s>_*i!Pu>7%rxanzL1lJzpXFFXwnQ?%L@1r%^y1VKxzw~M{DED(x&*w?4L0=DD z%00w&u+shcTJv=)rydCOs#4fb?;!D=}C0##lET;=+5$e zQ3vQiCBG=?>((Oq>x2y%(XzSaFI9iJiol1g9XEQxBFk(ftA{20wwUR$;<@?aFs=#a z$p~HD82Zm2Rf~(m`X-nr^_MEw3nLu#AD~@wDvg3Wh-?bLHtM?d2N8jjwGpegzXXI) zy9m`Oi?3=jL0#3SPsj72)nL2fXZg?lI8*?nj6vg7$`VFPVGy9bL_a1k|DQTBcK#P{-TP_>2)Z)F?RYTK{Bmwj6N<|xufTp=bpn{ zlF{C~Ch^XU6A(}iJlc}*5p@AiR~AYJOwll3kTW0kxm}6}q&^{1_9J)hV-wMSvaYu{ z=PpE+IrL}fN84TecV_V6UxDe_uf;=IxL2hc1Jtznzp<7$O|ZrcandU94KD(cHSmfq;j ztg*Q22*^3Ul(9*A-aGs8?V$aX;o}&bve>wKkB#b?!Shqbg-%hE8$!R!cvy$`p8sdVU72AD znU62SD3@yNn}a&?AD!vIZe$4|R-sKdI$HZ1t$m$y3>hfa(!Fq+2ZHr+P#Yk&#hNh)4Tx!5>cr~BWpUtq|P}{?> z^SemK&8>qO*a>E6VIkcA264j+fuJ?PAWX|dha6|+|wf|vUWmRJ&>`cQyW?-Of zX3$YQbI66+3g(glP%d~#63SbinO4}g;UMR}o!t>&-lxI=QmH5?`AWKuL= zE|cPd&hqlT_NCvpXcsrP#dK&(q!}W%9p$*^o$WsF%PNaKPygK2nsUCkx7YEL7Z##j zq`r+kG873{Nro;Ch%R(bUtGw0uC=Ub{iQ|E?zH~+4B)_Njrck2A|tMm_;2VP$34SC zR6e=^2*@u4s(GHDfKv*gUvy>XZs`Ba{Bu({C2r*`Z=aHDpAeY$;LD-elah?R%d8lK@%hQjLI2NgkG~@x?Ec|W_GO~%mAg;?EmigWA1t7epgIU z`PMpdT$^?Id$H;uldV5^>)wM0SnW>+7q@a(Zn>x}U}Ss^-Z#%1%H^I$Sy~UlV#d4T zG_xmVtMlH|iQSv&wsxieC`CC=s8LPDO-}KviTc_zo$6-ay0!K1Vr~c#TGp=z8GjFs zo0aU~q*@-IVK3~>1Ma^yg(=!qQP9uuNDTlNy|_ryTEB z9TXcpzHP#cQ~VWSQ;TeW_Dy>H3re6mdqo3&zZ#$?a|Y?$fLBDykr?ot!UO)T7Sp*w zXvmZE=GI}Hh%;IT-vqF}4!j5p{D~IV`TOYv?E@~2XX^GYoH_Bpe~a_4H|?}Yb!ra` zBCe4pXn$jiN1(O6JlMF%gQ62mc{H4#Q)tgzC)P)>S$+v&NOW`+Xy#N$YGCgIx)3b2nOhf@3 z*pU;UFuXawhTi09k`&tl)MbFjWOxM+EBFWu&u_na!s5Ec!ct}=x4>tbwrwDTS(R>T zVD{K(f{&B%`O#16w_t=dT5sKwhtGds#JSEHPHZ5Hzw%j@j0*sv6%SaEq3hzzLJ~T;eWRC zt3+|MQ8PA+Z;)Sk1p|!-2_iZsB+(KbM3BPh=H!%7@=WC&e=?ZadU7>#i@^k{A zhy*yxm|7Bw1r&v5Js>eYSoggh;HB^i`*dWO_?N4*w4_nM>pbd)MnK2}tLxToT&%hD?(us= zgMK;w)I;`Daf$D~e&0$v8*IINfQh&o*EJ*MbNXgW59`EiBy)Pms#X9NWn{0lZx1{+;jj2fJ~BKBqb=E`MBe*lyvLJeYeqd>fEDk@uo^b7r0H z%o=xRjJWQjKS>OYA<-+<+0Zf*!mIrLwW&$vR8hL+W}k~))Dr-qRiGvi;ca$)@R>{F ztsy<3AYOMqQJ?a;>-|wchF9zq8Yh(Jc4=`u{NfNg-s&NXQ_&&*ZvBtU94lv6&I@@= zx=BY#7M#6`qFhy3u&~3EEg11f0Vz=y)CH@AzGqRR6w^)G9fu|6#c6iAkMR?=Zumc= zuol;4*7*98DY=*(R{RL&oHGFIk(9E~+w-dzCI7EEV_{Fo^U%6BHQ@BqGky6$Ex+`w zZ$BSL5BC6{x4|GNor%Iv(~gchTL{aL-?`O`nP8UEv82Aoq@Hu(>}o{hQgo#nWvgjg zGO6Wwqw1kNr58orXDQ*+TzGk@6_HBPH!7kj6BUYc&?HpX43PB03y-+xS^m!tjLmb4 z-h50KnzO6R6M74-h)L_-QXOGII$^n+uAUg@ZzY8GX4Zk{5jzlULU$$i%1*-&UQzzx ztO2&rGGJr?mYi+P&^*7hbS`HuMLr^<^HTy+5mTsVnV+u=sZy0ea#S)?5J832{OnT` z{gSI+?T8ZL1bEkr=15$p=w<8OO({H(7%A?QJE@!#X-^V7Zb9MNs#a-p!e7&92S`J) zWd!c#U~m-bb?lr$Ni51qQKg_ouab!ZB9YOe^!lBpq5T-LBK@;Zc@?PdC})Fase(ze zaZn~5QedmD^W^${-0NJp$XjUNW{qJB&rGAoB}z24VRvi+5EiHX9J_thtGhs4a3J0o zj2suL?W#iON%<>J9ci!VCL(;@!8^675Oa#3$Un@9zFk5dahu!u9qhU?qN%0}y=DXH zll&i{@&((zKQH(N^9=jV{ysfq> zUE92uyzL$%`+GZG5CxBk>X*Tc_Q9!;0Q+m<>e^j9JNF!^u+^CA=485;tJqDmmt)sj zeynqiya|zsx|~Qy@$p_qe|GUo37$TbmT~;5`{&(?g*x+yw{MsGI0R-Ea#8Sch6RMJ zG(5tY`p9peUCsq4=0-s$-sXL|k4@N5AQWn(DwO(H?=fvxDLNu?Y_T6JrPd~k+77g*I&OTcbym3*Xy=Mt~mU@Xdf8ZO- zf@)@aEV_QQ^n?p#V`kQA+c5LDSFx{I^sz>k2~WNCJb9gVkIMg!(q|0+S-WmWy|4xD zcgq5TYA)sGSXaU^Q`Gyr8XSP*xOcHF=Ri}hX#9x-XE>=n;0C)VI5VC~8)#u%#KGs2 zq})gPHx~diXpT$t@MJG!vUMhIG5M2-or48QX-F>lS{tJmaqe>i&3WgeRw9{IpEJB@ zPm``GF$`il$ulw`LLy|DaIUP7<$oXz5#n+_d%So6oFYh)z6`(Neo~yIp2VHeOnAA$ ziVwRo(WZ6~d%F4>dGifl!Gd-1NSjVKtALE(J>Klp;3g&Y+oiCa(WBK`4{8srB7B=X z_9WLVf$Q>FEoT+Zp&$5luyc5fifZ4-Uc+*?9XG{uEWTj+AOR)TaCZA@A180h{q!3e z!P+sA{fjZrMcy0k{kUt%`{`@W5?DMVAjl$M5_HGvqZ_vo(HeUI#*YN7?l?%9O{gxd z&B;kW1>k9+oUQNF`dO4kAaRi*Ug=JE8+wp+eIG>#aI!RAW^gz#BFP6JZLKUU&F1i! zWJ#qCgwWu02%}m&{hblwK@~+-h5QIEn~qsEWp-rqpwrO6j*XdAcfV`APQW_oPAOij z46)StLqylFF2xq`ttp`z;KHS?3}O{u%W-jyX0gcM&%1epVsS;##@&FjE8d2o{_-xL zTi|lWOozCm_tf6WliGur1w09F&$u@kN9&}3n^VewiB^Pe>osSe7e(|T9o)0^@ByD! zq+V;1MIbKy!tQ4-ta@OB6F$CM3xu7UK{mnAzMh}axoLN@^;&rRPidojsvM4ccpF$4 zX2fW5hgUBs>Pu`Nh#_=s?1+4xnX>m@C{6R-*D@iM&`!B=uhgjh{lFY$7LsHX>;v<$l8IE)MlE; zc1N~gd0j_hJ&sCbOZ3#y6?JdI$<;8F!8Z5LG^0+NevbVuk8AR*e0Yt3Uj(u#(J5UL z8`#IXuZKV$&>s{VxwXWg)s4*DhEASz@5b|LjUEy_%`@idaJ}w>Ev3vyq2~eW8LL_5 z2?1rUkhG?Km2SA`jSTjz8fHp3+4g95t9?tVR;Yu_X#p$Uy|qf&kd@_^Xsyc$$-}PG zUZ&0fFKKULxNRE5WUucRv|{{w9raUgUGLq5R+;f59gE9=jeha`*?msTbHsJw*vV8{ zF=zq)M5!gdDDP1g4gy+>VnVlQIa_$t1tDR$UI1AZvKc(PirNzaNA|BxXj$`Fj!@qvuoe}DS16@Ar~4<9?~Kh^o|xOW?R zWMRjP++uydt!iv0si?l63fM>90M4p!fUj(vVN!lcTg|#3RgKR-VZCgO=VSMc2q%LMi3k5sUZ)_8u`CNpKS`yokAM2udtMo)TSvhc{IW6h(mhw*TPSR8 z33&ir4uuxjIn|>3HGNm^R&q)x>Eh%s*Xod^Lnd>^IXpN%h>e4w#zpHJolM&vu82$L zI^eTAO!g$C?GY1fhp!+ZUqdV8Y$cic{u~QdTreRi`W|8%B`OZmXifP~89X10 z<7^~nMN4(JOSv5@Me?L`fp5e;=)|xB3d09Tju|Eyh4V+rOOrzG_(;F_BexNmBI_?m zF04PZFZC)ezpT^Tj|}F$8>^Eezfs+FXjvc)kGI`eFKDh7K)AV_ z0ySVS&n-NYwn|#Uz2TEZHY0G3Rhf>Q{>w}=emnS>Yn;xRcwHsh*k<}( z%8bHc6T(h=hZhYu$1Ka0U6$8ZXL`Lk$@4{WB3_Z+PCW8U^=zFj0-(Y>khBDN-x0s3 ztG$w2(M(cf}l*X2X$8#*Z&0SkU`38Pb;G4b zdWl})7NHJz0}EL`*`AY4Qa+hAc8&`xiG`N1Hb)*Bal4Xw?#k(BQ@D(E%v17bzVnUa z);j9in)*b0k;*S9TjHVXT_C4KAi zm~{K5{V}4z`-}J5^h5JjpRj|W`HtPmTV*C zuK}w`RzjJ}9YuRuZ+@F|?2HOmabwNj;&X6-2k=pQfODyt^9y^tOdPRl5MOuVH@2J` zFZHpC3dHu0ja`Po^OHpzu=Z$_ofe_4$@f`Jo-vOzA%e1tLS?1r`JGNbMkY8JT`0Be z4v9R~)R+}QoRZ3&^!nF9T`_4A%68T&32}`?oc=x+Jc2v6W&CDZdw0wuLb!W46xwXI z+l>#{(+d*>U1Nrq-EUMPgO8}FYrjdtNf!ti3^@FckD+))X&{_a9e zhKFb1^0N$$l=aSjXv#y{e3AY7Hy4Ow?UxJw-sKl33lW9)kVS{psev78e=TBVY010u z*phd(OBY=R<579_D+gMi^)idLQ`O&mlz>vMvwUDGil_cU_|kK*z+GIVrNhQV{M!9P zzrs1V}Wvm=>|du{ebzqy>|9DU+5Q} zllQky$oAI9l{(+xm^z-u*IyimX1n=!)>n0F-L!rg7%euIKwJa2*wXVVrYC!A_NR!b zzcmtaRL?W$uwIhYyJcCoFzH-9GQ^BIx}H##2d}t2DQ8kwHnyG12)9P3s*UdC^3962 zd&u5yRFAmWMD#NxBei5o%g|XP!)>y%XOg|Dk1QeIz$=^#^OmsL7Fh7NL?m*Hw*}#7 zx$)hGK=Xq^2IS^>OmksF=CDpq*jC%dM*{&%60>~v^FiHA^cn3@ZSz~&s> z`-v26Z?nkIfIBSN(kdUoP>t!4Oq^<=QA1t(FENvt+3{ixWs=k@D2;&Iak9T#-N+AG zJT|1ubJszbx$C>T+RA37IE3O?_h@#c3D$Sg(%{qtxUE^rwddZ$OfrVBDeLW8@<@|CFq-Q zbt&u60|4J?9;4f-EKzn#?JG}r=g_?>ClKO!mbZn0DB}6ZISVuF7P!_S6GQe|$S`a5 zDlw1G>I#a8^TjL0GIj)?w<9l&XRGTHx+fUbpAE2d>TY-^DQm{lHhF|(Qrs*X&`}GpI z-_wu2x7xfat}K|WoiNRi#t91D`IcnNdS^bz(1llkv(|jKa1H^+{+WD<$EMkbGM0t) z3cU46-^ps~F-xU>a%=9HrI~Z+1SK=-jCVluRs#(bcdigZg}>4V=F-a}I}NW2ji5tt z)1#XcLrI|A$+k1}wV)!-l;%_E!M0weYNHK_YW1;EZKjRwBAc&j+VxKp#bAxW-(K|a z#HB=Qf1frqZ~E|_#d_moE%cmNhOwfnK_b$+L<~{mH8=s{ED_Jh_?ab7?NXkJo>UOw z3>ITbyw}t~hHVl+o*JLu_P91+G8o~hv}B+j8BdX<1QoLZ4Hn*!H$@)0R^f#(uA6!kn9+P=H6Y z8gW9Q{gRB;)b|Zx=#hWVpjZ+0gqt@d`Y~wjD&DczO4*Tz(O@dMkJX9CJ&!ukD~vAu z0~=5JUu+dhH7jL-60FB zr%h$U?zuVJ1l1&>ExpW5@Xze@QuALK4u&w4mUY^vHoi0%B0Zd4##@!8c!aBb54$<2l~@`f7u1{%0QDKWyMa#NMMi_`su4Rv}TOq5oM*0ry! zlOBoP>fy9?RP=k3fCcEFAkflQx&VItx*NInH2ZFq%KPMq(DU7`b@P`4s^5ZaeW_+p zPCCj3(0)g0ChyI`4w_PP%jGW1LG#q-vETAXda!F%vnVwCt`BsaryP|@_XKGehLywI zhxM9|_MxgO+xvf**FK8=lO>>=rgH(VORbs(X%~xL`t- zf4JLg&f!((>Kx7M|NB6QEcShg$9@fNG4uZ#@IF`2?SOxlKJMR+dz2I^o$bI#_@t

HknFCN>AsnFgP=Y&@-iVE8ev|-PP^K@5SDQF2|s-YD|8=dq0 z7Vdd(X4B1oI0HP^gd<729-7k{6~S}fl1kUn0h4+vwE;Q}7U>t;P7SMQ&>^+6ovmWY?{DP|Le*U@*9QGRFsBQe zpk8y%f_lS*sM<#vifR^;sz;|HfCRq_#gqr-)!C`z<(U?4jct$mcWJD?-v*I7Q#99x zobB8+3&<2IH5{p5aYI$qpb533Ldxk2*UaVdr1m<>@BLYN=^vFD#HPMfI6b)FxM!~A zT`-@=t?vnRJD&j%G)_w_HcW7#ZA3_%utD z&$0~Kx*G0uo70dFBv=N!!nJuTU)VSBmvkD_@B8tQm5cZNc;%%cI2$$C+U{s&XTb`A zVC-l7>TM3nGd6bKRaq*3CvMtqD0JK@=YFkay#;CZZjP4j$bsWZR2g1L^z40)QFRHs zEhENyC*_V}Q?L&9DKw?OR81*e$7Hd%DM66lJpN9#yDatWBb8G%KlIsPulJF>V3ad|G`t)#CgP?67z0 zQQ35tzMxk(8AqY~Q&$kb=ij%itILLwF##lEa$kxcie7)s{95;0oa2T?%dXaZ`l^QL z&+5`7T2M~FvBJIhpFqaof>-<@Fh7O$311?1zANE`2sTLs~*cS#;& z6V2Zc-C;tuJg;1ApTQN#fJu0lBmSNiC0pt*L;_$J?#&Y;3pp8%$2~Iya`~NPmaB)k zjU7o|)Lq2YFQ!)BYyB2XH+o9pb3++8@Z;MLbG*Wv{p*RB<9ZZ_!`N-4HZLA~c7BHA z`6I~wQxIp6%@8xRId^KaYvgmPst86-t<5YI$_q&~}nP@X{W5zff!KYF) z-yM#kU>{!od}JXOj&6O{z?EzcL1gMa;px>X;{|+j|D@s(QxFoap_wx53LyZ4^w^Hh z#3L02C#J^B1C_&iF@CCgj&&M@{^OE96$jf4AC(5@<-sH3o-t;J3g47ow>x@^DXxmS zHJ;Z{6;P)t;Z=+3f_`O0P%X|h`2bwvltUtov3nJba@&c5Vw& zg)yA@+A<4vawA_ybAjlP6TJAWH+fF1Bh4`BzNL@d9qd`TQXU7=rnXCqPAGKln-$59 zrH>%_xk0%S>cxX2w0Vz6JpMeEQ)Py&*0aUjagODU77jQ{ay4|6MoQl8b}|ygM4NK+ zc_!LM+P+?vJhxH!1mx(vUX0^PrNYgVlyD?`ohqX!4Ah%e@Q}bVJa}(Kj(fl3CVb)~ zPC(_;Qwmk%DXxz5poI9ShTy)K)HjK<>pwnTx;9RY+pO&h*a%R3NA=qkOQy6qvbU3^ zgvVA>PLW}DYI!2jYPN6#*+rT0`Hrdqckh=aaDGRx)7X-~6?1}mXTIn!hyCwS6P0^2 z35xi_;5wa&-}l0qeG{v~F~R16*`qqX`cO$uo@15|oCsR~n4q?0A;AX?hW+1v@hNx~ zr1{i-8|y(e9XVS+ul@C<@Q(8~XC(DBUY{qv%}j&N6lM}sy+lLe-S;6$i9PjzJN`p$ zLkGLKn}xCIyROXNZOZful$B~p*iqNTGwp3>8I@-R)&ckW{SFmH#zo=E))F6=_o6Kvot(RLi=uQ04)Eb3IcBFr}?cO#~Y&OAcYh-hx z@}#+!H%ANup+fvP&%6rfHY8Gu0ezD?!4VG;?1v}y3B=-xM4IFS;2yof_GGoA+1+Qx z6TW+w^;@bdcKnQMQNYwsX`%@s5#pQ9S?Wy-cD#GI znUj8xE9~v=+{TQ52z*{K;#ZUy36D{smL|udm0t^-czD82jgyR!=I830n`o@trUV+- z&*$zw;)dbu=!WX-;0EE;@JG#b8-ITa`_{}Dc|L#^TN^7OoU?27chVaSEOdsQ5AMC2 z53Q?`Qn~x9lE&kDrrF&yv!hByWkG>0+D^;vc*4PcPTxEo&AxNELq1EjJabw75 z$pjl$Z<;+f>TlJZ^m_^LI2Wzw#V{?iES^SE<0v)%HiGsEik^S*YyPs6p!?ezmGI%Z zRjrW5LCU?YbjmEH;u-vg>oCY8|0D1(((y6NfvsW6%E19;hXqSC|je_ z>%~!LYHFSq7%YKmF7Gp+Pf)vo3VXqK?1p?!Qaj>vyk3^RL`vW%=hai^BXI+X#^-t~ zv!W@fWi?Sw-79no37Q>k+5WuW9D9{VCeBrp5lNhe2?j~%Q`jJFj@T$}CI4j7Tqt;3 z$0IW+a)#$FgRk6cuEg)RhR5aG)(iGDl{ae9K7K(Tj2l)~*Qw-eKkGHBrkt^ZZ_ffu`h$CMF%ycxrOB{Chfq_o24C#hq_U zI(>^uVk!?W8R2dTSHB@=)rMitMhpG?9(NKx7+N#zS<^8^thaYOj6q!}`08D{6SY zKhHfcm%!sfc$fJrc`dXsc8j}`JlelotBx|=CJl?|G>_e?hF04TFemJ@K^+$M3F^lQ zDYSccs|D@6t_&r+@Ix2m5n_Hn-5OlNx#?hM%&`V4L<;+PtAv197KoC~-WuJQu$EQRS-DYf-6k_lX+cHNRy2B3VAog{)2MaeMm)K? z>di}oK9XjM9MO>K@U|_Rmr44nW(vs&T8SuXXk|1wL$_1+F)x&@?7D?qaU{3I+VBCt zBM>s@K|V$HWll2fjE33>q!YPNZB?1O^LzI^!x>kV* z1srgi+J-IweY!irtKqN05WI1(FQ+;0X9|B60E>|UtHr6d-T?vzt;@5B)+RI!76QW7 zjCCcIO3M_!#La7^9aLMxzbFy+%#f_Om|6t}jVsp=C#mQJCJ>AYR%FuGpJ&EHBY3QX4tcyTwyXw6lHQLY*A*bGj@ ziXVQoT@B4N6Y!_6_oAPi#gDXGytozm>(CkRzR6;ByrhrYuxU~~BNa7g?Kio+Tc7A_ zy&Y50tHW>u`_~jdt#3VlFgS5>Z>IQOG;vw>=&zdC|!D88#jX z?+4wP`uw25Jb7%4rh;Kg65Z&V>3>gGuO==rpvx7`PYf=1RIOtNU?dZQ7}D_EznHmS zX`8x%CS&m|0a^KXvDZdX&Y@Q1xMj6-{BefbAkNZ=rx5R6WM zzGc9~eAgLLEdGlNfVZLyUy)jj1908FuZ{0^D0j&=k($r)sSKSqJRUfFHir^4S-E!d z{dS~1f0I;KZ^C2qCYHM#SWX;!u8#LAJkWgsNRf@FKaHnBiWxswTI_v$p8;={KIfPR zJ3LA_dRh_phx>j$Uf7^?cMEaIh+JMsc{71Y`Dc&$_OKrWLPQhD0c1eBs2et4Z7|=Q zBYJxGUS1T}=Ppvd?d*np%TarUJ62mAKN>?`2=ofm4D4rqNcBKJMY*kVwl{;TtZw}(}1qJX=yJ+G6KF5W=q0)>t{I~co0SB#&NWqUB-+=Bs`Hv zuD)`)ozC67K*IDrbJ&z~m3i=X{+BxBjK%VJu88@1w@OJYa7L#!_U-rXp2ixkR<9gDnrh)1IlJP*n>6KD~dAH`TNhKQu zD5Hylj7aIrE%gs_wdPNr-MJrhpSbqnOJcGp5U10tNO=kiF>j4M>$f`YxWAci#I&MU z;+(Cw%8d(~ydGs!@_js-$F^c@_Lv>^)6oasUoub8>Ws3jF6*#I&3IqkFP&MT$H?P> zXZ7NX*&8LR*c;IQ^o8q0X*u7-ZGS&8rW%1#aNR5|l-{ZV%^p$i+e`m- zw!`E*Z7e^9|2eDH8{JnJZ~u0z?DdhP%u#P&(OCi1FEnwK1a?E@-#IyXrcJkjgpMac zuCEuclBK1k`vd+rwu2^giv{t#htof^yCnMdE~6Hj&=zq_GMy$jW6(-yb^O9L z^5y{w6_@*M>XDZ6-8jT+cgt+cvM1>S-v(sXl1kkx?d{ob2K=w$&inS7+i*=5PX!xr zWF>SKMeAjO$>K^BJC1g~>A$*J!_%v*f@`6PJg70bekv$feYwPX!^mUQTD5aBuHVl0 zIKIcmQyLU=XaGBBJ={dg^if`Ks$K8dUrxoR3>JYRUfd2^EJq8H$-mRl$=)yC-C@1a zxp}g))8uh;6?Zn_e^}&xW%%rw4D4VF)=i3tJsg0YiP&GQFefGZIYrc|KkT>*(xU8DSYT z)7H^ZdC2=r2NP2yoxuEOdbr>JAp5G85^qyd{++yhNAuNSQ}ao@d)C<`NxafiF0R`_ z0T|iK&pkaQ>@s0RW%c!+fAszz>c09R%5VKvKtV|Z5D7;Nq@_CqM39i~?rxDBQb3SW zV1S{dks7)glz_Plk(}l67EINDBuI}|+&>KRHkJ6amUSJV zaPmV~>_Mcu%VHx%!B@i5hu2S&gYwnNw^14 zw3h!)cnlkL__rkHr5$m*oIrm`i`AjED3n`bQCL$6v1RuXs+)#4~CYjY| zdUij2_8`3)lDjIp&1V0cPDwmu!~vG~J$TgfTV57*v=(UkFr8CbJVf;=l37M^)J^jBEcEI09jg)3nu;djOWQL# znp(U5QFl|Mvy@eTSkIP(qnNJAP;};_(P$f)p%?ON^c7CJK%iD$)_YFAh4sA(VbS9Z zy+$&dZygaoRAi7RC0m6&D=$s5T05x#4 ztZDze<)3Nzp}D8XdK6|9SNq><1Q>oOa;6C}b{!^oE|*W|1x&;g!C1betBg~XmKSMK zpzb7~TMicXi7pPAw}Z^v3480|e8_0=Ng4%)Q3S_aZC+vOPp4jXgv!e5{QD*S-rFN( z{&V`|f5c0Cx;w=4|H%x^%mthOoT2$SKQ~ux%;k4k#fkuSnN{A6%pWH8ibk%8T#BE~ z0JRMk2Mq;JQv$hfd0WL6WvHtOd#gR7P=)Cc@YP}|@?BITuf=#o2fB+tJ)ajDItRp6 zN0BTx^H-34h`O_IZik<1a-NslGP5)3$!tsE?m%s|xVaH~ycs=r0SErb%NBAD!KI;# z?G-Fa14!SP&gKkU3f>c+ybyiwK)~b928D0YOYzf9^)F*epWp&NUdDlogyHdlwOVqdMYh+a^97q zbwiO3TEVQiH%>s4@pl|?hQWVL)fBI_WT2gUD5=^P`(9KxK$33sEchYC zFR4p4p%So0gL%2B<>f}0xWYoP#`L~+XEra-j+gh*`1H^S7z)$8t8~tN@6AVNR(!$; za<6wuREOie>CY2VV)zJiA8%7XIKMmhzO=2a&_**}omo?hc3CPqIyk^gMJx$-cJ`1J-RZQN6 zJOo|u#qnkM{WFQx?_xjL|hT-Qa~{cGX=3yvP$*mlA0;*0BaT9^L! zp#lU`Rb;Bxc!UTkv4yrLCY1K20V#;@ct(p313UDQvaI|yv*!qNZkn?x$HyZ{*KoF@ zvidDP3cWFWU|OR|XF11mQG7{^|B-vW3nt*H9&UsQW$Um%sjqZ{?Hw{%VWw%x6oK7O=yX3g#3+qZIpo_82>*!4>} zoz*oecW_{xK7TU3=4L8YQx!OlA7pH%kRsk<=@8SVXm>CCV#s3eUay{B$^f$EBYPG8 zgQ+Sd($-nD-b{YbQ_zimV6x1RZcu@d%J2T<)pmg)04W9>#BL=s;jn$oeC)Q974#-Fo zEb-Rsl54cY+|E?bnqS+oLyJFOe_AyiGlLLfFNv10+z#cqQ{PRIICn|2K6#PyFf=9I zWi|M-o&UUJ%)h5IOMV{szoNi_tvliWtam%-1?;to_GVG(fTCb&+8rPg`FFK#)v3ny z{Q29^sXH=V_u58=9zoXV0BlY?kQRynL$znal;oH%!J$5EiQ>06daKh)98^9$=C9i2 zZogkvaT^_C+kB|@93OUV`APP7EhRg<7H}9_+!=(&tzJwy&y*=SR4&1=`6{7~y5Bnd z9`k()zt=~A9OB|z0G6(39NV04Ixsyr8^yy71`hKp3?Ht3Rq~Tx+T3<(cDpEZPsG5LiTX^_My>NbBq&1 zq4DF^*N+IY@gRHt&)P;?JEyI0ID+~`-jbb>s7a196CyzEb`>MJ9;C4P@ zeAeiBeqheBl;M4954-4QN@e|BmdSx30Z0AYUVBkd=8E$9Yv(AEgcEPkeX6KmD|x3S zM%a?u47mCJ@wqL1EmyflF|k@V9wN!=y;cX!zus9`M-c-RReIG=6qQd4!gBN($K+HO=1$GK{D_(;s=(CfFo_eV)TU*Fk z#mzv|2?^vpUp29n-?v=6y(>bi zQN1@NfBCb?^*g-YgEar;39kv90r-Q^Cfp9AOBL@KR84`xh?oAD-()SKniAiK_0KWIOse&7;~~>D&7BA9Qf6BoTcA%SmSB4^`u8D5b)`>^{V1Z$H18w zS!K?>-=n9auy7{O-`Z*qY0Ee$NtwX4_K)JLGZfMkA`O>0` zf#^h>2W=4kYlqO)TLPzrCKEVP6ZNr&LW+L(gFZj~O3`rG;Xu0yIUe5QWAj(Yl+D z0SBW1dNrUB|9Ym9hYVOxP1m9s-`Lo-wna-f1Z=xs9iC|uy%j64rY)qQkP)X*p86i% ziFMNV_vlu-=91qsl){)GPX(#Z=o|F3*b>EVgX*GAu)rJ0bjUysdBD>wCM4Rf;-F9V z-3p7(teQ}!vJ7|>)c~Nm);&YV_JKl(k*?3bEb?EeME>BEdA>OVW?8+ScfXLKX=(g> zw6s(IC+_@=GQ4|!Lt7=;#C4Ku)T-x|IuNn>oX}T(dQlJ3swzG9SP?@%P|Xdorv3Ip z?0=HM5A_)>>n6flZJ|5~?$8j@bs?wlPU8s)Ag9U1q|@azEw+MLddslV(B(Z7=yq}O z7!W?&Nx+7X+!9j+kjremS=HX)P2;;N@$oFJ#g=7VTpz|%)l02y`DsONS&dGelo=|1 zVSoex=$`5#7V#_57cZ9U`Eoo8a%O5g(E zm_yx;_SogM)mUJI&Oh``>J|jqR#F1-_IQDO$0m%qnGNsa-U+W+(dx1+c|}ABDH@K# zumgSY3f(QSfdLliqF!c(Q*PEyfQkLjQRFI~Dq9+~1`i5+8bsX1L;UwO6rzLxlma~S zEG|VCLA|r4wpgQrq%@!;tcix6wn5V6A&1l*zY_5x{lPQ-lUbpz!=pMKZc zkfw6QtJ^vn)&N**Q)&;4Mwr!2)b7t= zf-=kAd}UPDMoF1}af|K{>>6##&YUZ|&XHP8&tLVbozL=e@2FKAl)pfJ2C9(5m)4um zE62X2?PmEKp_yc#5;SdrLs>#PEiY#j*>+P|1;s+lOoE+*0DY)pWv319 zI;Kg~<;VKRXS4pJWvBxB>g|bw@3Ja$Imxs&vrS7gPinWvwnZJOKvk^$tF1P*J2E!8 zS@40quH;%tAZgHfR2fL~@Z&T0!YNE4FTwno9leE;{0KTNi7PotxF7s8?Ah`yGDF<5 zDpi-u>VQp#0cV(nV{fKA`I3yhI-VSeiDU05c)kJ?n^```A+}Bhr^u=nhe?o!I_Z?A z*Q~stXe7HhN_;QDfIa@-ba2m){}uai*?+qdj@^D;3%fh+H~vbNaTr~{PdsYo(uibM>v(fB|_gXVy3uL}T(^t7ZR`Y1wjp-YD)ffO`C zK+F74ynBf@UPA_0IOLE52fToV84>mYLEr>?wB@1BYikSjWh~xm3i2~@FZPOINIB5x zI3yw;po5oebcqvg)UV?OphF_iHeFY7IIFzSADn$twNiOqY<*v%mtnVlSKHt>P?*(k zx|&eF#jt6Kk@n=!8C}xcJ0zmue-)(Ug5J67=u_GBmSZmv05YT!B$-v341sl5P~zuvN1`n{cI{wG9?cbd)ru3ZjwC7zv}Ci>ty>2fs}`@15$LQL{quBC6c>y<&v74HDlJp5@gQQ3~r zI?G;~A4|?wD%n{{9&J9HHq$Q=mac7>%?MNRJr8D`0OGnv;r9)fcV+axJRx*1xk5L& zaIS1_=xa1EE7~OF75?HB^tFPip^Qy{+14XmylVxv`#v*eea<@%`i|?K#kugfQ?dJ# ztEw)xMf|YnDq(>HnM$*k%EE*+A$+@f0P=ZJv%?S6BT7rUU{Ntye2x!eqh7PazcYUn zsD07_Jx!%OptxSM@*S?Lk#<`!58y0oT3>H`Tw2>|U~q62@sfP-LN0Mr6kt>mLT0SI z^U*dSzY@n$pp%tj#wPdV(0aX-mdMPiFhogOM<=^n`%*3AS)=@@xdgYJNj0@xONJ_= z4{f8|{9r`#yp<2?aZtF?6GP`i?5~6j(bCuUr5f2yU~%ec--PCBEMtlK?;2~!L%<11 zf5dHElDYgjt6pu)XHMK{Bq<9M1Y@_x1@)KOc$eMy&IjMLFl)3lopixQ!XS^{JfYSC zo@K;V|EFioOUk;O{Hs#!J@?&ZFsErbyb8pE>`kBwfvwLSSf`(FN`!Ydb0qw!F3CoTc)itvE-j9$GmfK@s}sR`vm9p|MEl!+~zKWoTGLddli#FJF+y> z7=6;Hzn~On7~*PVBiHGx-`jl0;0IThuX`+C?E+psKuINf_9=Bl;Tk+PJnU;DoA# z7cv#vYWkTvNiMTZ@QxKndKC_(I$zMadoZ34bm1d)PS5nzH?Kvwck#5n>U#|aCH8pxqP{@hh&}&8 z*R@ZvwG=+<%N9w7F8q{@B{lP1=Ysvz`Z+CvoMIA34%k(5G`bI|vG~2>+w)@B`HaFu z*oV9z%*{l7v+(Z!Y8cYW1%wC;mc%^;5~=p7X{v~khsatEydSGFM04#+%5&>_gu+|0 z5XM77W8PJ~uR{9{@`-D7KshlWAdwP_-iIWOm>AsdOd3xLTH8b(z^i*uPiO~T2 ztBr0;aO}28!K?`4^eJ&>)*z^YFP(#gNXjf2-VnR-AJG%XH#kpRaYTt=EXj)qMw zdBIx^CCW`rI{o3fV_~dw4rrw3oOW`G$uUZR5uir-rE}`Mo-ggfsRv%tq0Qw04rHCSs)k102d~)&-wX5>ftB4V>zP{e)SgBE{$^&`s%n^Gr}FA=t84%Czqd=D zPEBZhs=p$T7(NrmID6smiC8OD2{;^|`Wg0%Ps^fli^}oZF`;oHc<`451BslI{=+XA zM18YU80(;Dam zXVUrg9eVgTJPsbU)%%T%#-U3W#HdU3I=MSfda`PM5bZp>=B}Hv7sZeDtz{o}xaMx! zsh(}X)IgyCuI3|oAO+z_Z0^m$>b}VmJAe4*(o;=7hVr~I4IyR?F}nTApLeoCbA#M` zjDyRh{!#_f{OV@Y<#+&)zf7?YcCTt07`KP5}meA;Dp?LYk0E0743XBZr&1gzkMNst3FNY4!8i6Ok zC-v3Htj4HCKT0bwZ~eH1j6n(vH%Zr3K}lfA3!3*)>RoOt4v`xT4akyzNA$`UI;*D9cR8f_89_gEDnWx#E0IY4l4gaGndCg{TDVD}F)TX-c-yk7i3QK50x@2M^}Fe`ED(9bj9pA0k0f*BI_{Kpj2hm6Y>?XwUguaAhYk~z*=b>!JHeA2Tx)PK%Q%~+0?b2Q?v%0#Qynw-$pHLv9lxgNPNIl zszd9_(p+@5Da|t-M)gRZQInGG#|xR9^T8p4d({@=t;H~N{%MrwUR1yLj0NhllJW}+ z!2aF}iCA(S=s(XV!(s0i%IgAuE-#ee7IL<_UA7ZI4+)4t5^sosRCQpUJ6d}x5hstb zc~@S5yAK>C^#7jH;w>vkioXG~z!#TrbcnLSkkwm8fm=W2L#=U7X~)!g3mR)E=A+%B zIX(!?LVhLzu}Ha0%v$RF%+sV)az#Ac%h9X1kbad66IUMY9Wgd0owfLEKa{uqN0c>l z0sOWtY+}7VtPrK!%F44nx-IlQeD&AYA!_DLp2`Ps)9JZ2VZrU0F3sP}hd@2D_lk1L zrFr9ft07BO`r5s3_V^k5K3~nCM+%S~-ntp#h=ZDxSMs_WQNP*6<$80x}bVGlo$9krlo=Hf7 zQ?jb$Tn^=NM(RxKCR~r?y3b>sm@ZpUBoyv`k#sV?G^5o4Mtpe5kxoQ&lchG|#X#(L zyt*UnCz1(7YZ%=_D?08txWNZ-$}g&GU6*7dwRqU{=_U-H$GzYLhz^(VipneNuRSyH z5n~7gulI$PRV*Y?eS{l$WRkU?7#=N-+fUan56E8>x=7qKMCS&m!Dw`acylQt8u*$G z-@WQTcL~!ea6C?6jH}-_Fbr%J)Yf%2SB*CRnuU#}zx_R@*B|kQd`jP#1+wp^#D=+H zYEZQFdn_>dN}l2@h532o~S*#@`iW zq*}ZennNi?7fh>dUO>~!I}kBVIWRIroqFWZp5Hlpf#Tbf*%E8E5QnVM{&-Pfzi%ay z;_}8+*|6l!(})nAuFJ^mGWC%PEj87Rv*xV%>-VC=(Bhw$@n#mkMrw@Ika^r<1)ViSnDL6rk*{-tCw4TQT467$3$bEg%?3yG3Joj{-89z(chn6e z>{>F=0lLeD0Tk~g2|y+u8VOWoX4(1Zn*mJS^GHtCCSj@D>e377+Kcl5DJBGHCn8nI6NO3P_KYwMKq`so<1xcR zu1aI=J8Q|vD&eMS2zZb&#-z=#`M*yIl5awT_42 zi`z_X*hfmsPP)}`vAd}m#IlfczMk4z15HAxFt}(@mK=C$z*{rJ``C;{?_gBaYuv@T ziXd4;^gc>dry0Pgv~nP4kK@CKj9P%)!gtdCCLIB=Z=UH-(@_&1ZZk_h<&^&;B~s6j zV3f`|^6&WmEI+gDZY|Dk7ztBj*>YR#e_cWlQF3CZslAl3JmLpZD7GC2n!&s-tdRIS zuXmt9hsjlcaZR;onul!Jc+pRAkGP|!A^H4bf2;lw=^q-kT?(o#hW$@Vi2{|0pc-y~ z+(1RVocJ_JJ}-(LpY@dhK7%9!giXO_(C|30I`G{!6grwJ2;$&dr%EMEu8g?s3tg4& z>ei>^&T-M=l_%mT-yY*=eunR+7&q{Vw~psb4QpkeZZp8Ch3R;(<2(cu)w)H);O#N#?aB`KI zX^*7ew)0(;!;9!$@lf1EF^q2*1Eu0-(^kzwHD0c1H1VzW--Bs znMM4ykNZePa|PY#jn5$#=J?RQUIa>5$m$)e;Y3v2e`8Qo+=I#Zqw7|BM)O^c^qC`| z6b6lL>+zD4vs4B3+>=4N_r2W1tjupybl^gxDWod2zq-RoDjpv$R#aEIoCcYQ{RWht z3QkZBZIUwUfbhjzwq+uEt3d)Dm@R`9?SnoYZ+yDsrhqk3sCoHQ-COcDRLMk-GFLZglTSEO=cz)k{QBBmA+m2>O1e0N zU>kJ(%@7U!9n$U$<>R$M)Q=?=1n18xu}K4TE8`4U26$YD^*6ubK5@fPjrC zox?`c;BoM7Qb9>|bX-}Pr1@R}Z9VIFKXrw#@+d=WWi-J6rZ| z%f0D)iHvrQExp4Rd+snmP!s;__%iY~wrsg&(_B~citjTxt4eL!~%K z0hs?U1>Ya#x7N}9s1Ad~lIYcLclCdEde;O8nQ_!->HJhS%)Zn6_R5^Xyuk9W zf%&4ue}C3-qsjLEfDz7<@oR=f(`&WIBbAH$nZcL4;@4J92iTS+J>(TBzqg~Km-tuN z{XcfcFw(~hEttLQjj=~746r8O2&0Qlt)%3%RltfEla!LucrBgLTFHmZ|52Tr+;V+( zMasg$ayfO4ajKsm2dd$5JD>*oAMEVyKLlPH4=;GekNq|-;4?gb*mBh3yteHH$8?_6 zyPiy6VDjImFUT8R;=ZN1%J)AN{jz zsH3v@&tm4=1!WlG=u^P$TuEf#-INHfN066_)G5^G$g8vKxELk>iXD`p-v;^hvH!W# zULz<*@?SaPG{Ft>Ei?9fQbK7{r_E{Ks~PU!OftrJ6T;JcAPNj$++qwvV1#IJ&{ zzmSPvB{_jHjj6)^i(mJc{>?Zn^<1=SwfGI4wO~;G4cW+-6gU1xECUG1vva-irU5D4 ztdhJNx0~D14E#Fhe=L{&Cu@%YE2zii)zGwWMxLk0dqsC%4U}fq_)dMXHxZBEL429uJ`n%pw;T0~YY)-8>?-T;)`Kw%^{2fiLMRm7AN8b80CR6qY~TOHbp{s{cv({ve3>LJA7-##P`py3Gv-@!I_2%XCcn1HvpctA5eqwCue#_VIg_%e)^L zj7)ehJk!G^)65rjDIRvo)quqT({H}s6or5C^yaF}6};k5!0W4S;h~|4lMao<$MIk{ZESc-9lSxPtNNUB_FE4PkL5Pnl3I^%o*@K^~gJcw$HuP)bj*9dpU%RA8k~# z?%K;3(5Ktk4Rtxg{KX)(Yi67{fL~&Bh6gx$&+{yNyT48|#8Jqf550ON5*WzB$oL6x zM1qX+r|IHYf#?Omve58K+p@FP$xctRWYXty7eoK}Ju6O^5mC}(z()xHSx|Ec;OhJr z^#A0isUYlVV?qI?hmIiz)lPNNRC>} zUZcVdpe0***gCA(_4+JMw9)R6@O!D^*sM+6z9Xa@M@FX_5ZI+`r$*f8xrAM=n?+8c znN_OXvv&2Di_2bv6C(ZRf7#d1FaJ9}Tm&zO5zvPZ3CTM}rke|Cc->1P#wi{%-k^Gu znWb5d`fTDm%sF}FjMSUS3U4=F=xj4+iDTX(SELgwt{jU$J#(5WN~ZEY#QAC6AGQlH zBaQ~;3Q_a^(bKOTzGT%%n@N2H(!&k2ZRV^1j(%N6q%lMDDH%X9cK#gbI}OAO0Q?B2 zQlSabUBim$wl~Tu-j6iNjQOxVPbVV*Y^NIkN)I0=C!XJ5>^x=$(!>8?)i)-%vTTBR zowMGt0QWOerl+QygJ^ly-wL>XFip~SfThu#8H$YmuAbMF&0Y zC=G@JVzhCL#mA+qvmO@Z=U*tw+m}H{ZIOSy#-BXz|2L1wZR|Jg0yJ5ynX)A!y4H$- z(NS`X;Iipo>EZCFo=N_u1y6K{drSdej}HU0S%SLlA)(0{Cy*X)J-u8+Cv>)gDvXD% zKMX$?FTwz0K87_!l-!mVWVJBkrRq^Pj#3;~? z&sLdvZy$($`vq3olv!{n<&z4!7o0Me5OnIkz2-bj?ljHr7v>87$|+W7Al7N0QNvD0iL4 zkBtY6t5P;oJb-^@X5MLM+J0K=wO$IWvR-*^ZOHr*J8^F74Ly!9GwF2D-*TNO9hkZh zw5dEuLr5~D=}pXP2d^Wlwe|8((AF)Tc0hLcnZ{FoE3#ll>(v$~2c?NqtRcmG%}jnZ znz{Pm>SA)c(;JVRT*B`jz$tB6v2~s+(}|7Mz@zz&)etF&j}FClK7JE!U*{S;B+Qoq z`{wlh1wC<4r`DhzO`C2Kd)jn^qa9+ca3Bx0TvP9ezQm@JZMCSS%%p2omd|f!OuGG6 zp8K9mT?xFMs@?IV+sFEZ8+-qX4nOz)WI#YXmL2`X)a&5OJ)pFRqtA5K@>uC)hzX2H z0kf|@dU8*~a(u*R;2`l@XkhdbQLaRciFK4Nu{@)1>vFodgMUx_VG8(XVSuffL9NfA zlek{WnOpv`;SmMEqdzW*LXD-0>pK=(V#{Y;@hg&}xS9Xf3z0AXy%R{8OH@-Ykp-@&|gQaAI(4YVXh6NBEEdggc&?(;yT)njpzw3%s=(cWoMjeJcmr((@ ziS4hU;J`6K6zHr6?ml;umXbAFcQSEKbJX42Ei6U@MSAzHw4 z1L!r{2f*1f+h99^66W6Rex**HfUed}a=APf)&X3YEMNTKx)L@aS*QvNQ_r#L2hrx` zPVWkN6=HNs)5@#1$JP05t3t~q`p)2Ab;v9dG@1SooSSdO^L}#?pK6PtT_8OwVI+)4w=vAWxuUkR(m&MODnJXXgp{D?~^={*z0lNlE8Hi!)@}7c-UJj zwqt$b9`oH{UAW4H>uS0#xCny->17;P2okcK9teL*2Uaz&DaqS19HY;EM;$B7lifR_K-kh8JVR131i) z7t1^$FT`y4#CN1@nxZc^HK`h4H>sUz&XVV_dL!sZjdqK*;=&cV~ZR{%#! zQEl6Y)`Nx(n*H&r1DL^D;VbLumiVY0dVxTv6F=QgL+vlhT5|lqw%I5?K#@TiI%atk zqP356va8kHmWXfw9~D5BJYTZj<=GZ9PfB)vlclU&0C0X=I|fAZU~T4p-HXz&cugjt z`H$KO9TNvC$Uine0^+_x@)Dl`?`nY%vV9F+>(mpR(nZ6*QarKeynn4bV zSio+m)`g%P^13DFNi{UXSPSVsgS|N1+jgNDI`VP(Z@VAfX$B8#zmOa=1z*-N zYk%1aZ5KeZzSR*C=-9U%4{V=GR1%qm0k+)~SK#WFlt!@SLa2oaLPa~vdIr|Bf*bI) zfNYrp!d!J=u>iPZUm2-XZB)-xgAiB5mJ4%xmP|#OfCN>wI@q z(s1D051#n?PVhcDB2N@*mvMf>*yM+QC@LQ;@Ct%ztCJ_Xx|*uG24L=Z6m;{gv`sGA zCEH-biESwR&6E?AaPphX;9IK9zUM3>naX?4rOuG0ZA@dtN$K7z$k zlOGLk1eea-$@25H8}b@jja8t2s#0wj^$Ma45aakx7tP*_M>uSl%e}d2fyd+1kBm~f z){giA8;y{Y&Bm;NzndCl_XvW((x99yM(z)V0hZ0#=f@sl#|FWh48YALOUu^S9(y%> z)xS8!^X>my<=U@@=E1l0KF*=R*QM2MJ!k1DW=MG2aMW;=?ENQqf>!!txl`})Rurr1 zc$f8Mf0IA`g4BPZ^CI!`yX(}ekLUr{o_8IvQaDQw#{V0u^@`%Ye|MVyG_iT5Tqli^J$T)Mc` zHhu(27wb6ZR7CF2;pxsv4HpUzUeA{Cm_14!O$BSJ&>fH0;&)Lz=;BSQS>8z`;h*PL zC!i3;LN~uRENd5iiLUjtEN~L;ZH_pwQLIPpqVSP}+=>ebY5#!^LgGm4&S+b?xuFh0 zsx;MhuMPwhrtPs3Z|Ny{v>5}o5Hpfi&g4lI$$vz9#HNHi4>vLE7`6EohrFF6keS?^ zrJa7X_`Ny84Mz2?X88aXLL6zMiDGkVv7gj-k9^OR_xN!6_S|XH7?jn0+n$#6k-IMZ z#r}y0@oJ&zrS{LC1!S((DGRA&(gXSxvrX3PG>AwgJN&MX8DGghC}xa2caQ<5@y4H+ zt?pqveL{2fq8HzF{AN;&*XlRqEAAKWjAmv~x;RK_w%q`YD5Q;6(d_TSh-p}&!tXjGNC;YUDV^`faoTf^wHQV5c-@>mf1!D_uEJ%+|BP zMJb<|=&$b-?=l}s`y(0a+^5b7{RO*KWy~=9Bnq?>sbD^1V_;-bYO|@fo^yOrgk^0n z>St1?U?1=|a+U6_US2x!o7l|_t>!FCL8X@G2(#=S?ci_QOOCp{5d?)POglP}7Wex* zpR{ZS2vrTfYIa<&qBl}ba^!GXtFTzauS$vb9l5ToMTVvBQMO*@Y!9?#&}`~O>-+g% zHDfC>b&fI|Kc)Zm0+{s7p$WtoOL%9SR*G=>?8<7^snTyhNUH^gCUX?(4oMZPvBO+Hd4B+hG1kxX_x{_eJ@Wx-OS}im`d$o)0m0g5J|% zquv)_f#TUMmOKMNr*$5%&tw`&?ZmX3*7U*nMrpIz9kWS8kI46EzOp?J&q}dq8)4vK$MesYUW3TJ!B8`s&I&G+xG+y<_>plWES1k?Rx3x1y0zQLjq1Yo)6oRm zR$o}K!i*gxR}T`B60QGh#8D_q`yOfo{|2}ZiKc-}6-BdFR;9oTr*;Ka3B`=-s(06` zPNftE(Xq3wBz?0QVZ6|fed1p_fZwMf9nJ}VW^_1_{OUr^3HXmRIxR^)Ea(w}L?4Wf z5lB6VkbH!oWYrqNV06-I?MrMzUy3_3&#STTvD66?4PH=Ic5r3y1e0pv7@0<`CUWT5n;L}*Et&5Qp1SK7{LrIpJm6x zCa1gZ5I{5N5bXDn&UC7Rj{?A^+7jU zFA*ymn)y_I*068a038Igf9O6V=`nrftxM8%>E6~;iKCY9D?AgZ0{YF9EZ*5!vD&NW zHvnVTeV;}ofHnU)G6pR|QGDSCb8y;VY{LbRa+saOJ@G|^@OLU|q!FsmjWn+Hmoaj! zFjrmWnDnBoe7AJ$Nt4I=`Gb)(|3t&ul`ml^-4RJbX znyE1OdvcsV{q|oCZw3U<>zl)pGkO(^~ONUVw1>#4Eb+^FMunLayAb~?I~bZ*~Bt5!On7rr@F`o2$V zbnL>JXr^lUR2+%5FE937Ne*}rv#Zr2ujU{5OrZo|n{Ae|NXh5{XH4vO@8 zT8FfHdSI0T+!j$Jewffs3fiyn=RFfjCkwunDeh6U+m^Y2mrBzY%@Od+#fVpxNqQ7* z-X;TAWR3S;utO5{jAAohI9$&nDA}Cg=AMV>fUa~RL3G{nv~XyxAs6eA&|^P_Cq~Lu zoAA-uW=T)PHWGPdHwRpe|D?glQt^Alrre*6^cy9)yfWRi(@}$vzksh-tQ)AbCUw*= z_2>nNt5_W`-OogvqNZk@G-?1NSS2PGF`&^Gb;gQH53U@#c=vk@HmU6rh8s8ND&t=7 z;b&9l_lv_t|7g}J%ooisRmInIW!eezv+jNfz9qbGvYa3&{{t7%I?LNNTaCrbqmx3$ z!UTl8BrAoSIFgS_yv0o^Uq#Rl?!esKW~er9JjS~IOwRYE+16~kC+#=C5o1_)_ze9W z?V&WNMIYrFvA_+t{0Dfq{WrTo_I<7MPZ9buBj@aCB^o%-cwSZJD} zyi!KqSf~Ju@#OrBj`)MSu5#&x*$!x;Rl&{mfn80`o^m4*=OeJ2 zAB5l+Q!h)DLpNi|I1CyWih{wk3`Y_S5aprK=TWPQRh3{y>?=8BTeh6SXBwn?Z>=-H+aUQ~mX5(0E8BfC2JXi+X?; z4SqXtuhSbARRVy2*IV6Olk)+V14C!OT#)#6b4^E0yX={YH+JtWxFTws9QbK#*@fqQ znaoamuUtx8&O{SQOJgpgpV^?D=Qo2aJV&*D_Nf|y-L#+rOIfgjS`nXfKV1?dI@j&2 z#celw^wJgT8Y`}JhQPMF3>^70<&58mn-^qADFSqNCzWOz9$G4mcxEwkjLP$HBeTY> zCiqUqvvZWkVe#5=98R@jm%M*!IeZPj_o^rEpZN)(I64;v^zSJ%F?!(Hn8|?q z=6;0nRI|wxEi%TNI4fk6_Zw9Ht~K_Vt8kwCj_0|VLLc`MY==~fvHP=VHiJ6MomX8&hwV$xhY^23gYa7h-WZYW4Vs`h z6mDk>-BtD+>w!ZEMus9=Eh67=f3Jt)6j`YisEV$saMOwcRW3taVX)^qKbY~meDzk! zvm@MwZmoPNb8f3>UQ{mA3Wo_5Xd8Seeg$s2VO+Eb?4{VM#r`pzF)JQXtn=wqz}3A0 zM4^QjqM~xq8nJku?H3)R$P%4Sy1t?16WH>;Z=_gn30LG&;b*xF&lnk|*MmTCf{yG~ zltb4VG+X<0oHh7o?P&?=c|_d1E-m@Xi@nfPY+;_$6dR`D`l8E{Hmz86+heh#}v3L%?y}D{#kYhH(U|r zT&Q4=E)dC|lDT+C`noJ{wn%pR%)dAXudfoKNyiXlr-=f;e#azDg;DNbS5eF?zWucB zywJpLT!l_Kl2(PM224@MdIDUhC>6H+-5-a(UzxfkL|a!f!e@TVV%K z0gYgm%b@d-fjqq6RCgEQad`=M?@>i^#`{p@X8<#ImACDwD$M=CBFArj{+5DA?eAVp zruj$FfgF9eS#UzQ>`UkzuV)H{FB)+)_Y;U@VwgK)(CE8wIRr|z(>#h_=`A&xda81| z>$N+;r-FXJOL6?FcRyseRNgHJ>NvgNj);mm~W6_Yl%cx88MHNnQ z{C}AH>c1x2zkL-2Bt_|(7)W=gfP{pEbcjedNGWW9A|aB}-QC@wG!mm5h9Ha@qsJIy z-|PN7_x*Y9=Rf%Vw!L<3yRI{i_i-HW^L(E*%EIoiCLDgOS1e6_Y$|aT->O{b@akmH zB6$pQGHX7VUpPda_s(I>6<)Mp&qp#6PZDU-qVjBSHO_+DkN`0M+|7Pxy^3VYqJqC& zPkuxwUmiTcKH49?vGsI#tJKm>7g9O*Pu&XnM3)$rs!pS?%|PCFma-|4v`iytLD#gI@;fhwOU1?9e&N=YAB!BJAg9cbUMr;* z8#{HQPGe;52l0mnfis1gyk3n|NtFanwy6sZ5uI*^KK3(4riHLFnr22OPj z3A*wMV+W%sa7}nl(h2g7wKY4-UZXkmZ6zj`Ly7wzvx3wd@%yaF}E`&1!IyFnYUS@6vru_?y}XuauYq9|kgfyV*y4v#HgC zoACFd>2o)hwSFg`>tHu8)+e0yl~CivJvuq!H&-5`B?EZ)O2QUYTcU5b!hd(nLDuTE zblNZAse+mWx}`)k=pV2dI+(Y&H@Btzkd8-c_X`zVST}tDE@PO_UB0e4L(5fyXN7Pa za#QYhS!DdE=vzTjG2eayJ7Vj3dvX1=#EOv^Pf7#Zs}s{jkuR;yT{6=G8O1L*eq0)w z)OWVNc`$2ob4jv@@6$VP_XNiiM?XwoLyFX9EXBI2`sm2f7W#`#HmywV5DjHX2&(cj za4%9Vge!PGl~t5ls~lK*omgq&VpeO~`A(aj{!<|U-g*?IJ(E$EM}>b8*kDp+o~W;! zucv{tk!glfCoSDPxkG@mhAke6zhUIb2ByaV6Y}HYa zk|ta}%^j==vAgb5a(8|ON(~S%YnFHbsOV8|6=|l2f_Yr7E4h{4g9(&#HrT7S53AJ; zTXt+`IaUAU8hfjX)@{KuiYVDuhK%Iv7v$-~G%^<&fNlE>W9sQW> z(Vb7Ydqr$YTQ2VI?1#Zym``@4_*Fwb`*^p!i@zJBUjEKs_eeLz89gm zHDs)Hi0tKt?C$~IYbzui4~xBOz#Ue(U0v{B58=)OfF(p;lR_PPH8OWq*$WG!4t=iA zVeef0GXv07*RqL81_vKB85Or5K3}O|d=_KP{_=RotY|isjpS~3+Z^Aa^RVEgP)|!V zO*aFGVZ+BC({xdW`-wb>zHY;P%IZ8l9+1fVCs}#K0K=3aF(&~S+_ibuHv0b*R=c;J zc|dhzcdtx9Y+SQ!;c~er;07KZUPNT%`;9H`$^8f0P`|VRsC0kLj(ua3kIa#!tlcqM z$}%tSDoysvQba^l7CDSIU2%l$(=#w_F2Q_1J&udJ+LOJ?lkEPUW@0ijGUo2&n+m%) z7fs1BztF`C)TTVHpy~WH8E(>KV`yg9wjmuC_W+D)@(-%B1wGD!uAGSlT(4`Qe`a0B zL`G`D(I8p>2-wA}=*QiX0p#V0ciE_;Yx@v+z=b6>HH~bnhyQrpf9ub57XK%M0GbHk zKPHz&s{P#vVFdb*li>OvNdkvHDetfg%fdPDH!(1m3s8WHjA`Lj=LQ$Q27hO3@kG&e z2&vjv_JU5Y&9<@WJoS!tj)9$^NUd$p3(eu4CumKP7s9$At2H>5)-eEqQ_p@@@fNp9 zt2q)7U2;|DsdT+wf7oeGFgIY1#6gXiTLcN`P#;EO;bjpTkvhgD$A6A}Xh}dtf|KH> zoktB8_;2V6r=^BqrObJ+uTJ{~$8Qd}YG_v`eQ3-K9)I((BYb3wYd(8<(^amJ7THz^ksXE`isN*Ee zwdGjk)%m|hgnMk0%+$$C;hZiUc_9@jyn~K`Th^Hlv=yqNx`8Z0X*D+|2SxW z84+2nR#Ua7JEGfRp zI!!d&^KqkMqh@fr>h==(bvB0Rm+($m(d)K5wIK!G;+7tx;O!HAW+L_o{QRY;(WdX@ zShQy3=jKwni}&j24`CWP{@j+>88Rb8UgQytwuMyo`$xS-h5^Dqy=B_`hNJN#AVtp| z`7QUrs(w2`aOj$W#|!E3#e|5Tb@lO-$N}@xTP+>kS6@hE(iGrS72-WxCPP?cn)$=N zlE7HkKla@8v*v0lO)@L|)c@1bVZ22DwqLL4wdg;$Kayn(ykjfl`=N}Fc^u>R!czZd zDB%r#(6^T&BnWF_5X%k2Ofq)ny9+J*59EWOTPk~rTc%2_XXrZ!-5Z+^SY|xRRS;qcC9@49ny-5=*cKw0b{a`uSZPtp=E8!(d5 zMA>l97xo-);lF4^uMB93+0^m>gx>Nb;Rynmn0Olc?Cb{ntB}+5sA991zXjj)6ToN2 z3NU`F(&WLn6%|`8?Z+;(3`}Qk-#KG9M+C8Ov4_Xq?0f8=exo)uuOUXe?Ro)Us!Qeb z=UgGCNdRwtX)gT<#q$My9U@&`$RF}#WT<-fWDh?rF#Mq??KsXVLW8NHj4qOa&MwT4 zYBP=kopXaK&|bp9KxdJYn(`0+EuxK^dtJLRbUt&!i6Ek48R{tyiFYT6FIL0v7RhZs zNFU7Tf}-ydv76B7_-hoU{SVBp&&B+^rkp=%zOWkBH#{o4VO!x2C}EO_irw?#_bpzF zmlnoESc>Zxs=+dY*|*M1SxKd{iW*`p=%YOMPRexih2o<^URaNw_r(_kz`E<1&VO=d zdTMeU%EcAu+Kn-oo-c$b#2>TJQr+^|vOeG#vsBCPwvbB}|Aga8k`RN&>T8`pKVVym<4ERsUTfJirAD`#CMss8U^ zb`|{Zt@dFAUvDv?19B-#G4z2NOsYXd=eXG=Rx(?bGD!d}Q|mA{Z%}c)>&7gaanWlj7oub>r>CK3qLdH^E zN|b$fcy67`LQDF6N=j{`*u++rD#m%T?d4CeJ*#?bR@`^4C3k;_b7ixHQH_bq7Pg7P zuaYIEI^IN*$04HP@X8h7;IO--g$TZ{vgZfQ`x!o{)!L;oM<%;~H?|pF(e1mLfeIU6 zmpT%$bLR(n7f9#YjN(*B$GLHW?qQLwharyD)_wK zw!6yCBId$Qpo4VhRn|UZUa&lm-SPA#NR)((+seTYj}Brg@2sIunEE!$4R2W9Ai1Nt zoa4|kmT_fC#2G%9droy>x&;plZ$qs}Hxhv#2II{&`i!jfcxPb_y!^*oxzulGzePWD zc1Y6c4`_Ri%;Vp*3~4a;KJUNG)b-){veJ^>3U_}SdGb>XvevNTzU%LeqW2Ns%Yrc> z?~=hb5Ewb&JLuTJdYhUOP?CaQ+Nx|lci4Sx^df>pZ z4Qja4gGuyQ^BzbI@++6OKZmQJq%DlJC73Vuk*-fE2k=5V6N_*ujKsJ`MSa}bjWy}6 z*WR3S%*%MLF!O^3B=g0qBk09pYh(XPokUJl;1AgL&+KBs759njl%>-{i3xWUgsvBI zvm^RNeIkib&t$Xxyg%~CU11X0EX!@jkCV=7Sfwr0ZG?#)IfNUY)%WQm1*K3Pj3@iC z;l~;cM;Xg2zdIqBQ0=zN+Ox(SaaTU}2(Tb@J*&}fS-=m~g;}9a6qeJrF!l%4zVzWy zB$)ly=0S=)2=G@Y2Wujt>}zqIE6B8SLa99xlGS)oAmEH{2oq5YcAM>}!MA{oQuD(n`2c!m8kxZkkMt&rD^g`&<#5~7I zn>_VE(R5d;QPssN42j1R_oz7Nn*{XD5313Z83kY72i=_UppcW}m%h&@&3HeNQpDY< zGq(EjgscLM&DT0}QrFV5(oo*98yxD=D-@&9p(WlHXR5)v{A$28jGe-`^4fmqUsWp& zgeN2|{!zwLeNAC8K@dhSDo7nLp6mTIPmGQaE2`$H&#i6r0Obh8xdkDvt87I?t!u+P z#;teWm5Oj!=F{|>ntEyyRXG4{n(1ReBy174O|j+uV%Em~&_Z?)vG%82Z((b+ zvNM1-08IGu*sQ6EANSA~P$H;6giF4hIeZiszf51VR0KyHe+@gY&8f4wI~<<*n&O7h za1A5}L$Y-xSj-p=)^Z%ky|o=6QHa}Nw0IfYz*tanuiO&T5nukBE#l6~+}l0*?nPdS z=}N2OaRTkq=sut=;y(if5opOJn)?!tiN(3aG8BTR=~*6rH)EF;$>7SjSi8)#J)gW7 z=zO|28~mjuV5QZvv*g}o)>pa@9Lnwjw@HnrYK~OXB1P{ns!?Zzu3J5iF%W`--3s~W z`Op|eRpnc{#VU@}Q=VU*_7&ht`Y(`FgQO0!wgHPd$6ABENa~x0SfkK01BuhjNII-{ zX8&kJi>y;FB_oBQ8A1lW(TH^%!dW3b{;rRgP1B%12XQm zSs)VpHD=lE`HtQ5jed+l9%lpriBw!3t|YnNRLSDWLoUbfq&-_pchMBMtK8bcItwF= zVj2KJ)y}d9Sj~PCFgt6xo#~d{2yNr4sv)Fh$_$ozZTtRImg2#gJzvzNa2Lp_s}1DK z5n-=;17RH$WpRvAJkGl*ah$Wd{5H_uP(NUng^dm13t7?~HYXMFRy_x!Wt{^M zy%+OY1}3i0o}`7s^(xIVGm=nkF*N7!X8Gc1d|$v^uK3AM zsyoVEdfB+0lbV67X2redJlkQd!6*&5~35_iPWhb#e}|jh^uqISpNXE3e_ntcC@!WR5rY#w5-m*PsX9aK`q3A>0pCF`y52mS#I zAL@D*c_oEh&ThQp_-8EuttT1KY;73RHB@k_HGCyLAxWY>Nxg~6^~WIb9ij*Rl~2<` zfL{R`wOfC_rP$+=aWPC|$F{)F&(kjtzApun*wAsaKsufr7c z%nJZ-SR6lqfG?;bHLwz}4-Gw+oI|V4f0S>TZlS;#C$QHNRVpG}qBwiJz#|y@IL7V; z*0;YR-~d&91mo06C$t;0nRc}zfk?O{=Q*xXbN4*%?cBNKn zg0NLoU>zZvy`HmNjSMn_`yil^P)zD*)M?@EbVkIA!f0=m`gz4{Ots6~jh=zvFl&UutWqK4lTt zo8`mzP?U1qNO<2=9SgAoopMER(hJf_Ch3GL%X1VxCW;K4mG2+xi$CTE(JuH3=J0O> zJp`oJ21$9x^36NaMUMeSkmt{>IMAtMaTG>=zHh2(KW0VwOySpnyu;k!Cfwk%HcVq%?r8HW}id)%srS{q(6wp7vo;@-qmeFeR zlLn$(2J&rc)NV`iZ`VJ3+-kGOa1hppnW}phO*!QmU#I}u-?cFhe|=K{x>cM`xbQ*n zX_*lqJUYk+u9vHx_>44NKlA9tOGSLU8<{3^BW3oS(;LyaFk(CwO8Xa^NiBNMk?YM~ z(C{wIPxwDV>CUIB7gPqq!8P|jlWjhTC6BUKw2*pk58D#loyophWV(88&$L>aY)+v#a|n-P!D?>JgQt=4}Md^sTTm8C~Ngvo0zPdx_q9rizaW9x^$& zqvRYoM6ceeXVUc0&je}d?<(lrBwD4uRCA<6u(Xh6FqHW7$-#L0;s@!UHd?6?QdL09 z<7W+1oUtYPFF)eKR6FD1!~q(T>fsY|3_J=F|4!s&cy%$G`fWf_Vr`UaUD!b6B{I}C zM-HK5sX3lc-X3z+z`9n|*j5!kF8B8$ToMyw|0z6XF$GSz1v)Lq82 zmW<42Dut--Q&YNk6^c>aBe(M$Gq-iodM!h7pZubu4J?%p;5dL@SZg0fX#c5fN>3vr zpgFax9?8;<8`TNf3Z@5l4m>W6MmID;Se~vdcKI`5JYv!3C!lF#6ZTD15(TKg0ya1g z7R*7#h5?t>7UtDC0HY-5((naTp@w{ajP^a=%^uhwg#m;LYBi-J^Bc)tAt~}^?;bT( z5APrL2m+_hgtv?S5tms}*i~Qgw#T<=6S0D$4v87VbVG?}@B{qwM&2ht?9hbLfl4c3 zgJGqqlM{i94H4xoch%?%Aj8`L^1@BTe|H&}d;|2dP1UbA?i~3-RFlcVVtePtsRVZv zk5sk?wtba6e(qN=q{%OX0MMFDJ=t&*dJ){Gy3NekgMD#m_*h^kAN3*}W4D@##ieUgK zTd0+nvaTNBh!*V0OE+-nR!hh^EX=+GYhyv~6imtWO)p!_1PigDLVKn(WoGq}a< zFB3|1n4ozi_p2EvgX#TyNkOmq!+&pO`|y0GGT*5)<=&roa`IDqpjn(UGzr5~GeaL8 z_uK^N*mA-8y+GlZ12GYElw24PB6XpC@zKgc@Cfqgc&z)J|Eyeacy<0=wWfvz;_L=1 zL%$XU%UJ^^dVd+s%C05jLms^Z?-^GcD_7R!sK+br>Q`@KCnsVqN1dvO-nu-@{>Ygp zoE0m~XyRG`Yyq)=tEyrbKV7F283QGOk{0U?xLf=+9LGFO(l|VCT$-x8OFb05cX6J4 zArc7yo;Ob+5afPMv7jdctM|)T`}8zoPV`!LMFmW^Of{EQ$6TrJUVlA z5Dxu|!_JGtiL<$XQK4VYuI5=+Vj^^^J;W_fB71&8k{we_^I}nQ-4Fd{-_VmXXa4tUyF5U%CcK&*YsLhhh5LTPs$SM-N z22QCaH5z8}wb70~zM0KQV62ZnZY%--aAAX0*ss)8<;S(H)`(3qgr34Otm4Vs9%rx~ z6s`KTjJL@MXHAeMQJZLa)$~pG)#VRE3#oiQ=IX_@rgHN-rk9EGBVwThfH=F8_WVVFgN^R)Dp7y*K`6hi}SYwI6~{HAuI*n!UcSQ z`xrz10AZ?*nzgRAU)Yuz_*r$~Wt2)*0vmiN&3I#998o6fEi#hhwuael6j%Xw^qd#O z17UDFt_aL!wb@4|X#hSH+MVe%wp{+jk|hX0!~n5|fh`CbB5Bpqm1vRK^G5c6@#;41x;rK3MG0${{^ zOv-3Gl}d9Me^p|@^1F?*+qPs-&H{0vDv>v8Zglxem=jgSi0s%z;wbs#vMQlrQ#E`J5b%y6IWrI0(3+vyeq3M+r9Q^IUIPs`%D1=kND zh~m_)r9Y{#%~fk-N!2ZrbV7D7l;EMC^@dIVrQ^!@j5dfX!VgLEhAIS)DDNiXK;~nU z{x6>n}<_8Tb}PZ#9EHfuQk45jL+WvwtVT|Y1F zOVkWu10Dg*k$T3`HKsNo4-jbk<^9z!pT|P1H;VwLt{*o1K?O?J6;OH)hq(hs%>&cz zJeOrx4|u#Kn%7QCHJ9&0t)_A<6}7as4Yg$`YOhOo#|U8OK4+U`fW<7Kb1F>{%Rp#& zF-4u``ozz%!c+HpQSE=0HXiLMUuVVqlO1E}#T%?=IxAzBL>BxB6&`AQ`o8o2_*maa zlVb(iL5@QUB^Bi?321GAS?x6DiBaiD?{Ow5cBD4*K6Lym?_lr4`RyqKP!?ZXmSD8S zsp{Am+nS4GS2jF#4o6bixzp8e35;-q7y+QzP;d_Fq;mCJxOaBBrtjA`{^okEF}d)! zN$G5*l^=@s%>KAX<_qQ7MU1M}X+p_&mm7QwwbQN_(1q!yuob2*&jAB&&ItU|^sits z9+_r?Jh_qKrQpGu=f4s< z(;v{JVmmY5yqT3m^!9?`<9cbKr%=A9&Y|dL?!+aYfOo3~t z{e$gtlSw96cMTAA5e`bWd6Hv9xoiZasg$`9shkmYh5>Iv9|43{_9J2_?!3 zO{^yg^i`F+JKkDa_J_Lfb22G)nLU)HUvqk~O!GFfYI4??ubGmbYO|_~lWWJNflEPa z`p+TN(~KhvMVDa~@x3cQV!+;JkzPv7K!G=(r(L?=J?@lW@87#S)L6hrcQ3IOdg!aF z@6n4*EVr?k&&sZI7)p(*o~$Sq$+oYe?XG1lz0jWoBF3y+zNF0ZchKEdZdWRIy$6ZH zg$Tmo(j}tCHXfj=phfk1>_Kb!_R2Ezn#nlzGX;)kVCxQy5r|-+x8__8B&GMO%2{#; z=sN}5m4gQTG17}{AWQO4mJtYqvjHsb~C*)(k9qcGyh41{4AJi;Gz=cRNKcoy0e zDDBnmXKi>q!`H-{zn%VQ!TZl;eU+W}+cz_n-espu z$p(F2e-H?`W*MZAvm(aIipyIOW?26O&`wAY_7vdz%M&4&Z_&wKL~)?H ze4!=&z-$Q{WdR@-&>M2vfZ)`9r6}ZrF8ylpa>7V>g?i`0A#wmBdAL{*zj#KH)UBT? za3KtP)y_x#_OxZ}SMN^wPU|DCMsviWxw@hH%PjBh#imRmmoL$l8{l84{~ z_xcn0!3*hMP5W0dKOaqbi*0D-xZeC|UL+z@|4uM8oKixkw<9&D3jRaR*k}GUDB4mm zRPwNMT^lxA4=}_+8Lj+VvLd_ZpY;d^su^;%h=3de^-@&!pB_j;97SXronzN_^^p>M zoW*ku&tptX{TtTN`|XlUdz~mNnN#f9;&yz@Lq$V!+1 z#dbBDJX#DWE%681=|o>IZvm;lFpRKt5v1Wqvw0ohsMxL$Onv_B+8_-Wc(O7p?siql zJwK^@ma3R!JnP!hTlEmE0bMG(dgblu)gI`)XXfeFIN$R(r%+Fyj4yYBiKvXGpoos9 z0)OVvr--09MMZz1hn1I&({0AY_-Pkf4(JzQh~~BsJ94wnzj;ndG2fQ9W>)tI;Ccr< zC%Zg7@eIe@6?vsLhb>Zg=0gbSpz^J1rY2FS42kVSM|5AArDCnU{jy4peP&M8=1}xS z;~YSNJor0K=#~{?PAYVr;Rmv!8X#7+My_COQ0p{>&=!gNjKSFy{|bjYzAbo7oSBgJ z;(MX5i>y=POnFymM_MbE6@~g`#_{EM6}%)f@}&qVCvPJ^BAPF-t!R+UDX^^y5tVR@ ziK9EWzaXYz9wy0{oJj81A4H#o7M@3aaW(2}@_BNw>UU%|-`GPBYt@HS0+~I)7d!4r zEvN7_DZwNTv{|&b1xtFBZ~Frwb;X)umm2KXn#2)?X1d@3Ae+FQ2vAKRC4vTfe*TM*)AJ)&aw@)sRHP$=vh z3M~F3N;!27({S&y*{w<%B)xDb8S@Tm`XrCY9QciwzrsH)-e3?bTGoQ`clpOMqDlC- z+23o{V4%^|nsTl;oOn=$-Bd0((V{*1opYZ1xXi>{1F8F>)*gH(XK6*b*aFc0Ek6Ti zvb+HQmcW+D_Ifx|QeF|1U-z#03`+k8U#JHPFOj00K)nDf-x#Nd)#`f%y(~4ktHDl4 z^uy)h#sjMduy-OA;+5AM^c73SrKyQ|?{^OZDCghsbKz24hn4`hYTUpB@Mc5p(W-6* zJNJ?gNo0zm~E` zoZo&3L{;K|{iQB_D3F|Z3_U#gSj+Uy-qClgx0AyXC6aMAK&E^~VNnk2=l&N==HUkE)P^OMq!cYtO8q$~t2>wuokAG&TA_=@>4+m0G$ zjd3voVh=1%Tz#tcOsTt@+!(dey-+A~yLiK|Yd~B@6oOf<&IPQ-CFGM{hzuX2fiUZl zoLmw*P|tCq`b@w3_!|Q6*$FD3OPBC5?AH{oAlY-E_Wb0_R(cEaybBZzSVXh21`Wg9H-nK+HYIUpGxTKLi>BMsaM&qYR+{o34PBgAae%K&8}ZE z#EHRLKl~hjBlt&2NbXIXi^*i_V=fF)O`F9JOsp^@n`h8{K*E4!em28mBxsUJxpf2V z)qihsIApBwDVKd|^o#BBcgBh32W(LU5&i>BCFWOgtV_s3z=&hTy+4AKS+q|77EpGJ z4v+k2*zavI2Y5dH-*qAWE5J@HtDgIMlpWEReDcs97GPmS_33Zi#ORxn*A6i(BRGectMQOrPdkd&o8=NX1k)cTs9F~WxU08DjoZH z_b#9YFuLhPw7xQ$M=C0|qXg1Uzs#U%GfY6rrM9pDX}P!{CihLVftni5El!zU+Ph-cfZde(1hoE;@!KeJRHXxh?fV0aenSn>ZDxu@YH!Wf8_x@0g8ugq9#{)|-VqQmFD`if40&7gNf#e8Z z)%n&deUB$giGU*Swp$C5VWDIN_pui&5J$7EkUr^%DFO>Tag+ceqdFj++7ejHQmtAbW;p(Xilpp-hw}rmBnwQ$3Y2897fUqq{VYI*ZfN;u@pKnQ@G!pcj1vC8l^%CdX>O4ex={<0OvqioNdNMqA(usq% zh|c{QSwvMH?CwIc3T(%NehT2-<9w+_S)SdZ(g>XLG5Bp?YtTZ{Db$Jc^LM+2>or0_ z+I;!kms1oZpB_EDaa-kH5Z#SBRHqMcY_5s^rjAhy%r8ndy}Yz?wUB>GTAH-Ii}0Rc zil`t|dKmHJFa_0cAhpw+D?pbWrq=1ZPjOC zHORu-)Ny~Y&#>Efji9|SD@SpB2w$E}ml;3oG zN#fV*C!P9q!x$)yi@_&#r9w=e`b#Mxzwv(-RybI}?^BCZroYCstE3N4V)!DgHdn_l zfH4#9Jb+vB{OF|ANg=Z@r%f#jIb9{3;OUsy`Y8Q`VBB4(dpzmB)j5X2xuv7|(JH*f zdoNEW1wzr@oSP}gpm*r-jU^Llth4k(M{LRskzj)4+>}3Ps*KfHD>75tKC_LsH#{z4 zP`}bnDeHRu?#jH1AKO5*SrT~pUgcUhV}DOreN+cZzo^t6+1Mcs?IU#LnKm*fJ9-yx zm&97$`YTCQVaYo>*8J5Z+OxMoy-VZcqihk-edNhDPW=||$6;{-)s0aM{YauyNa3b`;xv1?JfLR@Wb0o$rcAf-juI zb9h<9P;J!X83VcZiT+*@d8PiH2q=2?&lZmLO7#b`E+Su1_5yy7?Lb%gk7n~g%^I5( zpTDzwpXhBBsxReX&vu^9YZ4`6&i7U_Y+JB$7PvurBuCEV{=0g~BzQKFbX0+nb z5+$eWwDxKSE*f{nr!$a4SmQ#|t|LQkN_y9{ZDB+P1QC;WW_cpB-B4=1lH$NGyeF?r zlUk2B+`;HZD9Omb8+^uQqR%T*QoraX!!@dBpkaTe8rP_u^)orQ#zDkbG4UCax%PO8 z(s>D@BMJGwwA_BduHnp*-$?nL@yK-W>0YTRB%8m|=(vdfNfY`>2n1giE+1Z!AydGh zp*oiq|Lj)zr(>(<1F>weWRdp3|9Q;R7^~^0kPT)yjbq+LtVR#KL=ye<(CY&FRfsqw zyOxgiXvtKIqhk(6n9}_5u3h4?j1Gn3^|H-O)2_VA(fgnSTgy`j$3dw@A?2zwcSTTb z8fN-c)(mH4q)5@|_Z);<{4TgzTe?DUUdDp?u(Rz@i?p-#vtK zt|#NS`!5~4YYQ-Fwnn34ID2xejy0nKb(lFA;d{b*e^QzL+GMdOO@%Iuq3OT+xa=eej~ggxn4`s<$P#cA%7mL(or#Z~Wq{tWaQ5x9$s=OTgeSbGuD zMapfl(kP$!8&XEPc<+4?!zrs}KI%o7XY*`htaryIot=VV8%50O1G$e5d8&3tI&tTC zJn3}h&Y6hr7W~-wT<0R_4``o#mJdtI<@+#cc&S}yl3{cqa>{*C3GAR2a_W+0*Va$3 zmS<-+(~%IGrAun^bF~|dKV*@0owP1ANAguJOKDZBc|2c$Vf~p|8ggWnQ2N=se4myt zSxH}Z!GZSjz6M`LeiF_a{;Kfa`Q-!6`2_9LW`2SX0p?3tk|+jN8A(v@i~?Xp*G7Sc z&6!E@fuZ`;clwhO?vS(u{Dy8dJH9kTs54;$I%e@|3#=?qsaHK2{>w?lnxj{ZLw~h2 zPMkIe(?W|NAOXf+IGjNjeoK8tuRFraPso6@(0;;Q@m+7xUJLE7jmiWFFkxrPG;+;S z4>q)O5#fea%K2umTPN~AU2HrxWGh zNgt%#pXEA|fi_TeLO~Oe?H0~b59@e6p}F2P*{?`Gl6Q1h{%uuB-TjZEL{1g_+8L0( z{XbK>7n;3Y`pAb{j;qMXxcqgw(AP=_B7^%E?0u6uku$2#>xk=QGGY7pb#(GdAMH6a9>M3wujIx07O2dt)PIti!0E1zmq;Y+K&7yw{<$|s*}Y=u{o_}o5%RPa zIW_7N&>ShkDntF%iaLpj{JgS^#vK3JKBAMsU%efjLJL~n9)*Vk5NlEL(sD*iEbF0RgH_UAx!$e&;?Y-q zx7W6yA&Cbmv_ll285ohBDG__Rs}`5&5KKL$*tdL4XsB{VTdx_NstI?9|5F{kdpoWv zz8~*BQlxk+_(=rZ-^cmH=cW|i%s2K_MFpOvw-xu-=!(r$v`vVHDujyl<#X?hE}}&M z$Q-g}_(C-tiB#Mc+vgxFl_7LY+;tk1ZwdBM{BKyCgn<|t-2x4sk z1jQ9If+J^?j>wilh*h{vGa{yJwz*Mkcwd#-O)fbKO(@t@-g0i?E}}Y6z_vLQ7_2&q z=-8o6`S7J-D6r3z)a-K_@J^cIoix49z2)jZujbgzNQHF$Ap?d1bZSkbI!^ zmZjOr&_C}JPya$=8;t|Ss)R?M-k#u^#`CnZohkR+4Eo#{!o_VK@3;dO%XaTi4XPQ9 z;?WC&`~s1LXhBl_TCaG5Dfl~2#oY2p0xs^^VqN*-z_FgUTyhRY(Mhx8Hg`)C!3D2s z7N1BnSOA0VNx2id0of z8(rRGDhqJKi@bL)x1_dQUcB5R6g94D;Px9ai~e4YUN$t#Gpg_9by!R&bJ{_%sqb-7 z@(K@`dE4@Mt4>Ws#E9QxnLLmq>Mmt5x6jVi_Equ`6qkiIbU#AR8hSk-621Rqml?f) z&zXQP5xD;0^L(kYh0969%-+jfBMtvbIi9)sc>@IIF@s}6XO;dynh=rqaOZK{FFl*| zS_X`RxcReW;d)`P?=Jr%CVJswJ}(7vhHBbrJALjJHFtqenvfa3*Og9mCk`bq2h`(% zHnU0MDA&IwMf|l~`AD5%cuF``?u%Skn$8{S$&+|5I&)<^&4#Ji?d+phku3=?-wzk~ zRc1}z`=~at*QbIAZw`NNJf*i^C_2+8@9{}3!9G4@UU76y<+EK|pdU2AQ$GZZr@Fm3Vi9U*2B1X&!&N6eJ4swGHdenKy>Wzgi7 zziN6@7aqLBhH$0`pVFu;kZ2_n^}m;MVi(_=9s28eva__0=1d7UUHk1P=0`g`ZGO`o zL`(J4i3-$ddg>2qTWFo9WXf9bhd1$u-qN&Y=1glc+XijJ)lD0$}v{Gmn!JxldUMKS4`D;N+M6Rr0_YCK2b$Iyn-c3Xq zMQxnB>ad2ss@FO(b4F6`{Efn4>u!lBV;}3&2$@13eZZ!{+ubVVu1krfsa^ZO8bD@^ z9cy!L?YL(Z_{7~9SuNN4btj)1qQCm`-b?mEsDWX|-7qW42d;|SM0pp^^l5=^HC`g0 z(BOpPNl&w=T&ZNI$Y`QoUPqV4zCp&rBZ2U0o4e1r&I@xJX=AvjOAMIZjdcHJY!_i# z_1Ds+pMFn-UiQ^|SkLhXPN4?JpOQyai^*b}L;oRmn&nJ>S(_!iR|<=Qbd}f(@();y zdD}*+X8XmhuER}U?Fv0~1`G?Krb%M%Q3>#lL@;h5aUER>4aS?AxD`<`kMrdX=E=uL zM-M<7n@KXkH6gG|IPSP}P6j*8e|^at5UfBzc9@1k%Uap9V^|`Of3g>Z4Q#M^p+u7V zX+J-FNXYE2IqN6pd3Xbd2snwpK*6p=V7Q-UpdoQ^JlQ^MfJO%Ak$y3Ma`g?k06w`{ z=IP^ge5{rQ97lR#eT=@8D8R5T_%%sSj4u{1F&I2}JrRI4uk}H?ya~W^U!Q@leoI%W6)avf zm}99j{m$vGPobSZIb}aYsutY(v9@+23)2EUOi7X%MP3~skG4#ZZxMMFx^oK)Kh?6X zDFd)okYHdR+4WP=XRMl2szNnO4$RM%O99}YQ;Quy}{>9FX0PYI1j6z~L4 zyJ`VfPB6KN6tk_{+rX-TjWnye;b=dx=#j(2dta?Sm(5km9^^<-T1qOTd9UkWo-EHH zwu27dK$%~23`Nz|S_3)6DaA9jv-1rY20s&pz5DoaQvuYoF*`BwD!?4niPWwV<+BLm zr?w>NL|WEdA#i_O-J`*b3;P1dcEA6$R^}kz94~eqda~>Rot&8PAMhJhic47#_lNrg zM9b$~&?=#Rf{#wJHU@tyEVqmQM45j~aPX?T4_{xu58Q$98=$)w=ex4d!HTTUikNpk zoF6QV{={OElhdsn94@~u^Zp|(m?QqPD;FXmbGQ`kNBECV(Yw2Q@MOpJ@s={JGmhMj ztle!u_!=y0+cono6}jm)0G(2o&J4GfEfcUvox>q$&r+%csIeppc%X_>Q*t$rcgbZf zBn6WbP0Il!_}&~jzqcUbQ;ch9JxHdOcbjRAtr zqM0h%BH1*+5Mc<}YbP<|UbFNxc= z-^y)L5Om513$D;thLgFK?+FSCra^n{rY-$!BvTAQOM-O3eVkaVOltOyMY!cHYCO5$ zvLS6rviZxO%%*WF4Awz)u)&wSsB9oL8g(ugMb5Z%oC%lnYg*qhKVGQwksNIxa~a!T z*YcBHPIOnl*42VaD{@? z?STL8=`1{6g|#9vleY~2D?sI?j!LZ!Gu>}~txSLV_4bWEWpj%K9m7Np<8`c~$XlTB zkkE-6DwdzmeSyy%o#S6|qcWq#CwY?Ikh9w);AnNP#UByc_eDj;=!`k3Ma_to&)1l5puR$S2(Vn-jd zbS4BmR;p_c56^#UmMMUyPRZtQL_?4hkAv>4J>h)?KlZRguP#sgF)Xe=ll7ImScmL_>u&g)#UaoXcwl3Fw(37Jc zzk=%UT7eR8Nw~){+&6)M=*Fcs&H5s)eqb`f+UDvRM!F!!R*T>F`RhS{p<97Ld8yYF zebWu0veX4;Vf*eYuE4(y$Y|WV0rRhheqfGZJ{@1w;qC#hpds4be;vVeY?#V#hs=Q> z@Lm?0Ua)0;DJdgwB-?q_x9v|Zd&?CJryxI+uJ_^e-k&*jo-&nYnw%)xh7}HZvOm)P zAzxw#OfSPxqxjE;z;Rk9Dn4vighC84Q2Kr-T`H-angS}@#;%C<{vF4h_$DrOy7%7_ z)tho%({kU`+`j?7fn5@kJa@bb`@OsueT{m=-?wVXd=eM3{Q0S!vG@Lz=eYwbZyEv7 zyy8ZR|KoM^)h zjA^lpc4fb$wiLr#-8WLt0{7qw`}0K^mCc&SH(eI}@xIJ@JwD6;WZbOBZg7W>+Q)B(Hmr~8 zNCbrQz$NWp5mtdbuN_DQM^swZlXi5#}6!owwZ6pH>TIv zZgwSr7ivN@%rYxK?j}cs`1Cn4hP;_ig*MQh%593i9?8*O(I)+c;%h_yERZtGZb3`? zv_;#EU{@0{d==0Y9VZPxzGgED5$=NaGMz^LYZKWy`X*ww+NasH`1e8%B;2Sd=I~sM z+vXTZeX=qGkVkBHQg~=oI4ccVtLUY=gr9kztp136O1SJ(4s)gbtlwm6xlF&J>Q`xf z+>c##9M#yvxEOg@n)WrG)!0GrOt`E(W3^IaFhqPpx<>8H?#)$HxqlcoeUW?6x+as; ze%8`*hMGeK&#o!D7(FAlaKu$qy)Iq3b0bZ7IN>$E?CustLksg-fA)Oqv)KS!NCr|Q z@eK0G3-%Fnshr!i65Z5-W2pEz+XUWvgZPSUv~`trG^y-@qU}UPyEK2#V$S{QDSn@rq#4pYBp-HXV$d&p23?{xao?f8A5=|_Z&03(EAF`kiuX&$LrG-RQ99^>be3WLt^Y#)6gHg-7#kg z>2kTPbhhQX;=4LKH+nKZ-?M1I5P;1-@bdmVE2s`TzY?@?tdh=um;3E?;^xA=x=!RQ zRI#jizXJMvc-Hm}?dtpf1mZ`PxU-d6m4BjI^Z!jkKB#g{TmRNYT~O_$Qb?J?W8hWM z{H_HoVD&*AfVqi9JuK!NB+PCF%!Tt0TT{%Ndc&x$S*z2)fq51Uu7mDp865|HTdX8B ztAs6!4w91fuie+0fzP51ZGa9bnryDYXMP8wnQGd+nZA zo)lFd@A!@5Icu(i!z~uSE{s4fxr!Dd=NcH7yK`eO zFSp?(#eDWheZY!*6Iqq#T-$Kp4k~jE2A(zVneAu4v9H~BaBqh}r{kZs0@VoXigGVgr-`JdON#w@NftSe(K-IHRK_HE3(B#BIL@!M4n8HFpCM@Z_hKQ=vWn-tax|rnzy@Shel9H;COW z-^rz&n-p=&-!IxxILyN&*?!+gn&Hx?gm%MzdjLQ-I-_QDzz{>Tcb3ymn1h(_o{7i< z9KSzmU?N}9-7Xc%67cXxCOW0ZT6s6+qN>TZF9>#~k#yg)@+*x=g>ccZTn( ziWfefsNWNE@0uFNRf&Vw=o)WHHJY#&2=gv(z3d_bwcs2n6GP(u(0nbcf~mj*hCdli zyucbec%)7%!&#(Xb26=$P>(qrpUshgt&)G6CD-Gec25$y_YAXwP_JTKU0PT5zDKB5 z_2}EFleW=dZz|*U-jZ5$21Y;h7iV*m2fX^CC{v^4e5k;h$RmgU;j0Q^Tj46Y@NWEV zA#2sJc-()2QDzUVVmG~MJu(k8=XRC81zejv9#X&0_=S6uQpm(&u&`qf$bl~|MB2i5 zkM^O}R;u3A@MIEiLcigg%1j04oodTDL1z1HjQH56FBBXMAs@EGt)!NRMKi+MX9OKK z^gAgo{2Rw#LhDo!mHmU&q=PfH=FcwVj~eX@=6Z(H_Qu^i3s)g?H^%)(fF~1$Mfz?` ztsEX5A!01=`h*D8$zm!XRx&)b)B>KvUor2EPm4?oxbz>WqcH~Tcq_DcV8!wDKSM;u zUK9N*-T2KBH%=lYfn5hA^U6rmQ)ZP1%lbooGeYJyR6xQ%WF7`f*#PqmX4adj^*K$Q z*KC4JSDt{`?3_Zf9d$J)uY>yOGWdaOGwYCoYRtN9RK#Ec+QwfxW-cac@Z263Enr_J zR}#Et_HHZ2atn}V+e-Potu7Bh*`E7Snfyqk7ww(pE61~WY#ls1xtwfwmC>ZGDFv7^ z0pK;0H5q*cY0M| zGWOCkV9JM#O9a3V5PV-Mct$MA?)mf)UshRbR(sRZcuNI1v4%1z*J+InpEXet*j_nr zCkE-+Z^f*%%b})G12c|krf5oNn@!7zHf_wL_lvU4(3O?Pvh4ls1o@2ccw!|f$U4(d zxhLeg&v?zssRRHXoA*|dpcwr4X`IPfEdcSH_vO+KdD^l(Wh5wkst`Nsg(RzUHzq2T zTz?Zo`=n4&q``h;e6^Z_LNyf)t_H|v==@}J<((9J8##k(4@$&)zO}`|?Qw&&eq%b^ zs{!|hei=m9ujQfQNnn@&)WBO50K5;%fuWonf`%x+?i@8n5^kTsjL}hyJl!T z^?>9Lx{1OGcU?^mUD#@6>X!o?rI66@66^31&#P*)qK-jYZuDEmZn>U$O4_d%Y8LvzKBMM@SfWRNXrEErz}uizNH&~) zj0Deni7uGu_Sk67u3r^}Pl#}eKSo@F#V5ot=w+74Mv{rOvf#5k{tu+hM$(R*+zdl0 z%$&I3T~~}aMcU5zMWjOE;tJa+oNhc_FI8~5dTsJPR^q3a*EGw*`}&c0hDGs z#Z$%n)c7VaVzBcj)E+4WdK;wmhObzb-$SuXXQ$fbxp}`P+`W_^u91O7I+;rx%s(bV zo=|ldAJUJfCw}Ln0&(eA$$X}{!w#Z!J@Zp^7|Z5I8jeP+EnjH zvnJ|niESJjXI?|KFFBs_{1uQgW&ez!`pfkebL8o&56gzp;Gk-zQ=!2G)5Gli0-v)0!p z7|jd4^xswLHm*K0@+BF`-I zRQTnvHZ??vT^^69%yzuc?mQBsC9>;b({Y8hJoWRrCd?TTtXUNN8%7o{+_U@5KiJbI zyNI2}V05{eh$fqicJ7vq2v^!EBPFmafiTcl$DC1~{{t=gV~u`D*_cATTYERD*0i(F zt}lO#Pu$_zMK);-F}c^FMVecOK7tL zJG1g!k5;F~9aV)Le6?tR_M37L4Ve%Fh*O56F25iKGx)g}Z}V)ww&OfU5}ryz&oBz` z&b?3qL3ax5z}I5JHTNFyf-)05F#ht+3r)+%!w5pMBA zZ?G5V&oXonHH5kc6%niz@_+^;1W*=loUgbKII9&1$TiogvOZc^P7`+#Svu0)x`!Vx zQ5_a9xN(K{0XWr@`a%cCV=Ymx>H7OKFzU$yCE)#d+B(Rfs0JJUlxnQ{&9{w?Vx7;n z>xbfX!vK0PzV2Z?iKI^Pcjy;_z|8f(bhDjQuIt=DeKmFg@HUued!IrUw#qdoQ#Wbr z&#dcnO=zs9qR-8SWMb^S!KDt)r!y76>+&>UHU)KH8&_4-(a8g{aP*g~IiUj3W&9m* zfT+1JMZ4FQuRg%e1)bYcs;rfrw+r%l){c1+B&(O2F2A32)#k0jj+(D>f!@~GCI(#1 zaF#1u$g5r{gUOF(lxI{g%CxS(byqk&40#Rd?9=b7XD-z#Ze7IC$JlFI6~=LGONP&~ zstD8_1mr_t>yY5@-^dehwe7scgEzJW8F4^k`u}$)zW7c|1j^SLUi z%=~%PwPzd=* zz_sWWlAQ_C>^5*nrqHvg$Tid}eq}Z=ifxWi8iuWJV(Q_kB3sYx)`6ER%9_RsJZ>v& zrI+AxDOUILq%T;Q->FGZdNEHmYm$1f8uD;zcKQ5y!|%(wT{_D=5R=8yg%dH^4Dul+ zn3Xg2=3>|wfk8LPUHc|&HdpGZ%(t$%{P*sg1R9-b!t8TUR`Zx}k@k%Kbj=DX`0?xt zN2$eg>NhP%-m(sZL%(FRr;5$e98j?tg)`ba`pNvgGj0u_ZrQ@UCK=I^XJmGs+(CsU zKgFrLs74?>w{!i`mF2az>+~aAQXBl@4!_X)4f*$HFdR2&K+}?N$cTweJ6}NG?J$IQ z*i-xbq|AIULdVT_1V5t->;*p*(5d7tBHZTN>@BhLQdMkBGv4q82id}2Zp>D4zU4P% zgO82GU9%LUHPC~K8?tLm9^bjhW(Zm0m~%aVoHP}e4m@X>OaWj3`rJ#{k(~$0XCSf8 zv6c0lDytS5uzHEMw7hCLaOt%^HL2l`0u2G(&GSip^f{u`vLqc6#)y5tmHhaN%Yw|D z=jcj!4~^OM(WWq3XFTG?xEsX0kW`xCiXfU>5RD}hn^oes($nKRqJ*APs^EV>a*;Zp z51H5KHTI*}9#D1`+6VAhH=lhbT`wG-3|gr<0W<0sKStc?8_h}%(l0@DT z){nH(&GW}a&ebsQwvHVrW`n6aiUQ58D)hZt9Tdc5TNTRHDh0ryc!X`(R_*t@l+ zmB+C6%HIsIxsXe^yRhcHlNGlML)}uLsr1Y#ZNV{5yBSgzd=L_3%kU zBv}~djAF)%4*u)|dSW(E3OSf=Yb?(Wf3Ib`p-_MB;0OL)`UFvys5ET^atv{|D7b+7 zN9J_^2*>qq2a(5<@+RK?xWfIRBDgfLbgi<1+|jpC7z;1V zelnbOgCQnG_;cZwcG<$pgFEg`($b6V2AqM#jdR=O9xux($!d=*0(-#1wzq}eNkOW2 zbQ+sh{fy2^RZY5_P83+N^^4bh=c^SY(i3>;8QeuK8&m;00d}ge2=Sr1q}6nwz_H$e z%q+UJ>2F_v5e^5g!35Hlwnca|5TE5ar>5+a6R%IB-^ljIJekd2!+?H@OQaI^El+H` zXJ=hmW_G6vap?xDX5M`QzQcSz136X$L3QPp4{v)Exi;`w=p~SNofgeE3Nk;fU&EF3 zvzOC;H5TyO;H2EsYsojJ2nAZaNG+6LZJ-Z%y})7?p~M3Tdnzn`=mXRTxYRftc>L(5OAo6FgsmQU zMb#Mma?T;w^Q243XsJgEKb+c80cvWk72EGn*oQBk_Gwt6 z6)`BSXE}Z>o1r3D+M~VL!yWYb(yL7P9oVrP@R~yfw zdCcQGuA`So^?t?OzvCRgfcdUxKvB7lilD;Zibdo>;dv9kwR_fCt|=Q&)w0~xPdd1c zgdX4q?W%F=x*mnT5D3grLFkJ3zPy~`-R4&67D3(_@mJdHO#s3Wr>N0s>=KIG3wH}y z9D%?pGV;r<$@j@2*W`twINT07#=>D`(k}L2=T3TvwwqEjlL-A@wK&3iuOZZu59d5S zbKe8#|EHNr!_g%QNZ}0J@-?TJ@U-%bbmS$m3UImez6L%4!ur_fDqm+qtFr z!)ZU!-Jkh%^hyA73f7;vvtEt!kg!v;Su%Dfm?mQ8 z4xJ`R&51C81ppDiP(cFt{rIs`*6&HjTa}X+WBCrsS+x-I$%*`DXZ%o105Ifu5s*!$ z@IdhbC{;W}3InRH>1(eOP3F3mB$+X7Wh9!w(~meTsgn(_ECFA5a;-Rz!KAB&Q#@&`~ zU7;JtaHwYk&&~Xe33r?l%o6q00kkE#vT!;ufrTGD2>(Gn9{zZ`4$Z;y{+UF!*FpsI z^=QbQ-0KEa@~EKGuk`2*&q8axWPnPL;#*f!vm zI)|(Vwr)V7jP++yqOA^pBlV0HNEEARJ+iG9__gLWJS5e zQF6#l!}EF0VjpDGJ>hMo)VUGqHZk zWbUv3E$Gt-Ty3s4Q16v>i8zG6{L1&<)$#K^|0ZIhw2Nd_R>(TNM(>Q!t0kCnj*Ks# zw8^gmsWX!I>u)QZ8(7Ns;%}PRi~*<_r~#}TS@}y)3^9cR`y~T1^6oG=-M6sj%nm_HNRCw#lOYYUz(U#CqDaQd4S)@Ol2z@w=sy)-Y_>X5^i9 zj_G0Wlps)*T6Zw%F5N6vcu-h;4G`~8?UtSkeq+UaxxiE=Km$#)oY9U%Rty;iOlW{h z03}4srFjVur^6zGfu86R+1?8gQjmODv_so(>Q5!X3n@ugIsp2wWO59{4W>VHX|5iVzji z?F;SRW$L**?QJnYVRtyq?UZ#C{tFF|oe*VXL136zW{1nZ95j_x&i7kJra{>=i-a16 zJ3@x1JA6*mo8=zVSMgHIC49@5Bk=jH2!sQlpu>0x2w7}4@zwax%qc*L=Umq1Y1ZGx zY^N7~IQyDq)kyqtYDjARE&9pvVP(&ornO6G+h=!z>g|E5N^qM^+2Hxv;ej=FwuTv1 zlTNVfY+9AP)i5tuK72cjqj|t(JJO`kIuYTZSj8!i_XXrUAbns9?IW-qbe&e*=)PdP zxlQLIZ`zf8KU$~cYAOv5-D{OdC)>G7ooB-!#t)`u)5#XkXT*#WT$);efeZt0RUEod zxDfV4kA>+kGnixbB0wJ#`yHT*hEi2A4_urf1kBT~*W4E00kN@?vm#J{ejd;ta zM0q+tz!1M;-d>xxX}H`tuHlIoySB#gl@T#o-S0*oMY+_dE7;m)rD zy_JYdvjy!&Lvczmh1Ej6t+gSZdUu6b4II95Uvc#Py#dsURmxdtK-{FXKlFnW>05^M zJKOT5O0G^P3z~DXetF@%UZeUXr&{c;z<1KFQR)Rl!u4S#I#0c@xq;YGGR@<@Y6qO8_5F@XZ%-4>G% z@hGwZbh9r~34Qw6aJqJ0I+#l^z>g12KkwyP*!ti4&?agkybmbA4~Mf9$zrON6TX{$C%USQxKeld8u;S)ud;= zAfwhc5zNOGFY`8am%b1J!@=wxzw>J!Qa*{qpnr-}4F<4@zN@DzEjQH@Ciy=0s0N^$ zqc?kwj$_(#+4?1XA--mCzg zNs_984*%d}U_}1y%HyqZEJtH56&PWH@Okok#htcKKbO7N*gKMWC-f;8f`(|qHiwH# z-zB1gw0dBr8dqPZz3L{84|g3XF!*ZIwrCqWneo`CJ@a!2rt&w#C2aCXzhSeEfSEj_ z>-pMBm8QD6^p-k8yEmZ-zfPCOuSSP3eg%cMAGy$J=8QjQacrP>DEH?UqCO~7Pd#mx z+CvDq_UM(yk4AO0^~%tSx7T!%(ZK@zK6$?Ni5wLMT8k=@kfh)5hTa)D z$J1*vk?@jp3Rj$aj=S5K3thN}cS6oIH^X&+isz~2U&b1uN@_@y?D zhM8+!H8e|^XFf5E-co|V5;b{>eO#q}>M5y_2LBRXq{?^IBHVF*w_duuhh&Wv40dS~ zP{TgD(E$LyfZ2*iv${|@UAXNVuYEGxGY9M77z4Ndw&n2Unc;BlB@VotK)5MNAaH`% zgVx{_Ul(Ud6%V4iJ|^+pS2%z%F29)1ECEGa>-t7e2d`Mc*&MgLp`uh-0E{i|&)|Ds zrF!Caz)qvYI;<4jNe|6iRlh-IL73eXl-eFt_WJ#8I_*-Jx60nf4)aa0ylmT1WIJ-+ z*>$pxMlc)_%UltcBXS1&)8L;au4ms~0L?(?5KCbHF>dkFsS)BEtm+G}H6J>WErpjl5yX_()f zJ9>idaJ2K5{-U@OZ+s@u5zV~NaKUokwM74$TWWWa^sMf9=dCvemo}DHFOGr;^ zRM-w+$tg+6JR>yVJ%8!spY2}YpMnuS+s+j0vQ+_6=Ua?(zx(TVRkCxeFg_{RQ?gt8 zxcwii=KpvD!)T4? zM_JB8iwNszY_&W@MTkLeZc_S$W=uZ`_2lKwrJi?E@2{NVcUcH1W2@6>%5&%i?qYG< zbGB~Noj2f&7-(pJG*`Fk<9GWFd&ZX=lQ=c}nkVB`6^5u4!#Q8N^08NbEd33>lEmmf z$sZGB#fP6j3!7N%k}&~uf%@*)fI>Iya=%RzJN?k0`9HdNYshAq@%e`T+{GCBd?)=p zRg37K7Vp3?$%pcD57%6_)=WI2qcynTvj+qUgXbwvzJ%gv=fq#v83nG*U8p1HI@p<^ zg3tIGfxoZfa(uSL+-K%qbbq|2sp-=BQo{iwQSI^A6c{O0zf6dIs#@mi7Pp$_3%kqW zo9a31FLdgevg&w!)(9QD>{yLp9;husI#elle)LwuUA|z3?=0?=QQRBQB!}62B`n0mSjmM?qs1kV5 z?W7Qc*V~yM4N%P|%D9sTrc5cE_TdvF%Z|qDcHaqC^`%*+G>qqUvFaz^m$|VDW=?U89;?@9ZK8+us^{HOB%I^|`MzRO zJDi^2Sao{KIR;i*)gK2bO+}WsV|3|PjvBwQOguwJ9a^rvj{X|{%pTlm5S)DdrHJqaWoWBl{GrP3( z48Q#5=Qc56v5{nhX=hijBRv-=&g`xS{A~%_wcj3c0{M>*aW>?O_dlC` zw>3|WyT_2nX*hoP5u+!C$60|5RtG@6W+syW#?|$$W^xbm$|y{OB5rz0aTwrAyrk^r z@`2Lgu$}IT10CRMsz92Tn%?yE&)R1t1Ha_B0+=G9aLsMmSj=Z8U&*gF)gEIp#x=eM zQog-yTfZOw}`r>_5Qvz|3w=9qiv_Y_?K($Z+QE^@!6m8vMIzdv^O)~mDX36spK>=*wkTW#}+O$gH39A8#b>+qA1 zq6%J8ATH7Zch;(@_|HQ9f7jvv|Fcftp)3C-DabSqNVP4jl7+=BMmdI1jcv)MZSK>O zrTB+WF!F=?+wg<{0rF3$LQ5S&pt*>|#Oe!vPuC+omFGFS36XjUK1aLKB*2%OoP$Kd z@pGZfJ;HD6>;NYxx=(uAPqILZ=2kKvStR@Hg-vUB@?#R|6FRf`aa^z;)YODy<>VCV z&`Zb+Kj)8%T42b zNH_1iMD1JxO6`c9;ZXVQUBe!)o!(WlM3IC#lf?SNb@&YTIu{w!KqQM`?L}-%n~f5v zVeyDGSijBXz^}J`kC3)JQA7tMW7jHEp88|2!(&(o{a^d#ja{36PY6uAWJHuGBKJI> zZzVP*KTu1s+*rM6@5QATyzf(AwhgY5DFH}3ECh*}mx@YzzrIP~d6qc>_)NtYo1a6I zhA?NRse@eo0VkIl?=Q%BUVqErU=*%rSbZUBd*#z96|lI@tq;kGJSeOWSwy)|t!t{) z9@20vhIXh;zzHJldKxlZLV_EFV~!9iOkhwxUMVKKSBuW)_g} zkF)T0+O_{STHe^DSJN$i3B2%hzMNtNE5Y#4zeTE39*wM6?`AR@4fq!uwE!TITM3MVV~! zX3>Wv5ZDLjwh3Owvqo}>=y(2N*~Jce#-;mEo4$aFhjWFxiJtwB>^@dFEzcw%zVj>e z0i>MDKBc&1Z+)&x@!<1N|=q>H14D2Oi*v+YQ5u9N@LfUcjZSN{ZEJE zn!*FqhdBYM{cq{iwy&lxhWcKsjPqb+-#g3Nu@X;4s0L3QjmSGpcHQQ2Wi&Y>{PN@& zVX~S!Z;FtTEq*}*QUNM``r(@%tGZLqJvLL#JXI&Rz=F}~6V@mQq`;`ylc#dU<}{%E z`%;R6KVb~lKY*?+iftM2RpA%DvQS-g*H?`SP-R*5)Hcm1uZ0oyf z-!oQcTLuZ7_9*hL4{RV%A|bY}Qvt%_xS3xqgMz^TkKQotomS~Nw-Ra(38Q$B4VS&2zq2KMQZ{A4-fvRS&!d#?S- z#RNP4Z56CoWu@l2R=!Y(MKJFBdGb0Rz1O}p+E+VkpEg%t=^@>(usmh~8sxK;%q-mX zOFjs(j~cae?2&}rx5E^u|BbwhW&e)G5;8zY{1(vmFooERDwpi1OZZ;)avh7p{w>5| za&E!tRv|WMy5=5GvGzWsmVKW8$70x3$r&BkHz1m)n^?6Ewf#yzu3HJzjTg5Y+{3;# z`G=eJGXmk^k%k&KLZ>%jK050QWGb%w-Ruqoi-W8L#|x9#bV8w7 zhw$VJ1QAOuz00;=YkN2;25zejS)(m{?zln{a6VBv`}%1XjS(UJ%rrx4b$>o|qWgwM^Y1 zWd%$)n?L5^s^Ft`UEdDP$08!2%mU?}9NluXKc2g^J1xpZt+F!+gQ9q%O%`QdmEAP- z53i@hPgLM2X#<)0J%=;M?@8o}oqtRZ)LJ4|E7FA>|#k{YlM6EVht}(k% z@7a;Hdzw6dzwpVypNwVTY`396bNYR~N!wsqu$dKk;J!4v3cc+BScuK~A-ALSd;b%@Vu8X)}E%{xm z3#(6+x4sE+79e6K3px2+s}{F!8^*Y8`@1rcY$F=Y>J>`r?eq))-()czujPJ>#T-9i zP(w1A+Kp@(wnP&#a>DwgDvPkhJA0xMXIGWMp$PZm^ROyA@kkc;Q;E54OrTkt%aUoY zxbzLN0~d(BJ9{L>0SDMm@OQT_6t3}`b8df)X>7WNYIhd;;8yCSUZBaaWX{ZHk(Ju8 z;IG4i2A4~~)l1+$rt4Q;y(!0oT?stb>Fqy%Zti^21rq)T^kM$P6#ni!83~!m7jQ#< zzU*=Er}$&KI@5_jWn42z^38WxymtiFpdCtoD@6nt`glfEY9DTgoOHlWj;%LqsM860 zKL(uogA@7ml6GFFjt@Uxy(eKoCvM6HN4_@!ELRtTmPq~NX5yn0c_9&xOK^k%|8>=x zn33+<>|zhA@eUmL20ZgRkS#&|a&#=w5?0T4zou8)uoobg;#@A}<>vE4B4DYkC|r^E zqdVbUoRp&ifbjKB#HBN+_W3E5Ali&3+cY|v|K2l`8;r&ETVG$=r-XxZOuff|Wo+G$ zQN9RWNnACY-`YTg9>4Cv89Og@*xnS;c~Jgg*Ged$Ip}`NlSj2FIPSL{%0|)Fn_7|g zgtny9*&W8m2)56CrUkzv>0=dJ#!Xtj2nN6D(%?ooEjCI;v9Rf^n2s65k4z$kjvaK9 zqZ892yUN++p>zWF;T0A=)2M{Xw{1u2Eelp=08Z$v~wA%Ucj8WO?JruZnm ztUrjJHfLc^oDx7yw^9X0IMj4KFOA9OkY|Pfo=;d%ZYbDjcbb!VqSo|8zqdWl;^_U# znjUTU$xaI0QWMTt81Rl{7j;mHa@jM5UwJBo(jgk40Miu}YNkhVURga3catBi-z=MA z3eZ6ckAF@(vrq&znEw8;@&e*t(7P>Zi9b(hut=<@Ss)faHN)|8qmmxSt4A`8f&)5; ziKd>v- z`Y6^0Faj?!HY{ltA&NJDkA@x_V4i2GfN}*Bb*88VPdlbvkpv@bJL&XM<>7H_Td3ug zvlsTDC8eK#JWngZzB*~XD6=_deEk$d$J2_6_9~1S{^WZDwJQK^>&?VB9bJW2CXW*c zPd&zrE2v_7U*O-Hx)Nu@<&=$r&-qbD{HT-C^od#}dd0uHXi@NAUI!V8TW|RNxfg*K zPx-s)JLT9OP6u#gbpUI2p)7y3>hPd1{=Lp76#z#`rK5Bc!ewrPZ2x)z))_~W{nWo; zJw~f)u2ajN3Ijedg6IQ>0J<+&opN{Ey_t<@SMGj;zKcg0sqo{`E zUQAR+h6J8Y?fmx@JXFH&kN<$|Xd~2ob++%j zO?t!&^pUzsrX@QySV>P5gqd{=pey0CS!dFTWF9(t-L^j(nx)s%m|%MJ<9oyUCVz^5 z(x7aTK0OaX!qmP?pxijN51}^If6*Rv>9j-WfTI@Y4+?X#Zg@Pk!tqe5KugDY=O^nh zy!uF|b=mVl%naBTHLM2gtHaSfwAFKg(x>=gh@ zEHg&4eOLj2d!Aa#()z`ZzE_q2*vd{PSwK&R21~N+HuXOc0deQ!_=^aY{hcoXXI7T_ zCw~QDQcV~qR&FYuKZIh#SEt7JwhL221#s>7nDdWlFuqRI;C zKktIO*puPPGU<<3dCIG7))~j>RKF-APvx6_k9KMwDZLmkj}yhuHVy=!x@gG_3Gj`j z{ff|dF(r|rU*&mz9^}>qQS_5VUG8jWpA9Z5NbCw~DyDxSV=VjM<$qjzqgY55sTUwWTgBv*Y z38O&iC^mcDJ6$OU_;W_e_zY4dheBzmmo|k9W7Ht(VvW!7-b&b*j>hU)csJW$SDYWs zzkKVV3cdQjK!&Fi2H(t1m6p8d3(s z08!Jk1k%ZaXuuRVO`sX$BGu$q62y$BvZ};qge#+bxoAV*b~*ye&f zWt7KiVy1z7khlqZHZ+ol(d$)h@0}@spFL6hNYEjS#D7NHqJZipBU;02>z9JT zb-*F7p{JV|fe2`t>EE;KZ4b}(JgQgPO}+CP)P!n(XNEQvNeS{Hr+tL`p;2lF*Wa57 zf7oy901nqnEgRu@{}GHthGZoswB;Ip-~p^#{4DiMOUK1E@94$M^%C5;zevsnEuAPB z3L{_X^SiuQy3XhR*c4-_JpS2s!sSp{v=S~Q=REwik3Ab6#HEZVIlhUHl(Jb}n@Rc1j(dSqZ1P3a~c@{leQK6rn0+|RbJ_Qe;9f_3jjIL-)07_n| z#;H&-w{9nZd;nPbfHeHrfG1)67s5777yagxAj&MUFgBaDMn65V3KZ#n-9&!!-|XQ} zOwOpA3W{=vq%mM;)0Y%7D$e1xjIqjC!$3fY`6=Lwo;x(50YW5!bil5aN)#NiS_s?z zD|5A4{c9G~XD_hFvRp=}q;_{f&@)?NG&3K?upQvxYeK%#_|jMNL!;0X=)T~^hHN)u zIhverH{dlW57uJ%_>pnas9hwGan2ZOK@;sxg!ddPN=UID%_0rMHv9vb#> zB%QLH!Q7jBE|Q`xfva6;+)df=L}iPI?1C*?wI+$*(>#%WUTkJYOQeY;U7x56v^P5} z2TqYwGu4G}&bh|79;WV8Hbi0ErD(4SrXoEmemAGjUrvTZ(Dc8+|Hx;^jmuCEEMItt zenKN)zY}D9aO1nUG-PDj)gTcnhWBZ8Euwo{``dY^++y)lD6U8W@DfH#Is#|VM!)6@ zc^2I^y0<+&*Jo9jtIqQ|zMBJx^hOp|Tr>>YidOgM%7$0my*(U{Ttp$#M)qP^GZIe) zdCU`f82jpLVo3}Vt9(~hGPfpJ=-W4cq|9mhDefz7ZQ<4`#fGj$H@pK_M@17H#tj_BO9?n?j;B}-*uxW;Ay*VtFmP(w*zel@q{(T~NgeCEs$vM3I&{y$^1H-J`XnAlJm zHd=0@A5~N1?S)SI4~R`}e21v9*Bs1Bu1yl5IXsQfvyaPfHcDOwi+FH5LM(a!?-&9t zn_aY>EE>Iar6_CWG*hcsD32$(N`iJ+e2Zf~NCo8Ys#Ioja|oU>ySP7lnhP^@@BGk1 z`P6t%G21ix;=8ni1NIbnc%XomO#R?r-mA#~NeE|z%N^(p6z3a%=|aS8w^$2JGfz#f zW~t!sHAyyndMd1tDoBOqE(+u{cc0xmHWt|mTwz{*EBz@h_|;^c)%U2=K;ys19lSJt zHyGUgyFGzQ4m*_syxTC#n*R9@vEx!!#iqfsqjmEz*5oWK#7ix5kOQa}~D4$bS}bTIpeKxo!xVlg$zCFy`axZ+pcW zd2a;Zo3DQ^9HYr3Ws{;T&;fWjp0Q=?)vNLHeKN#Y4)EumlLFfhAVD_$Nb_8n^upCS zIR$8bjn1?a^_oKw$Dy-!ILxI<;29jqXc(Kd_m;qqe^*eY&I)%_QkJ7Gt{0aS%?!wk zeASLvmhE72v`%%I@t}0xGoUr_+Xof6lx>o@?uz&n1T#Wxs;H|H`Q|I4HUqH3$T8Dt z0zyNKatn}9P{I)(x8L|3JGAZ9TDUL75)(OY#g=EkkZHcq1dodhqCHdxY${XT3hDGP2It4Tp%^eI;Nc(8A^)+1i;`tf_`w=X}iOimjX&^KLQyV#4Jb_+xG>9gVOzn@bn^j9D28(Ik6#L=7|7%JRqex*VWlSc4Bll<9pnz6PPPL5yvIrkz10V)m9eXf3X5(uS<8E~2@V2saayn=FelEq-N2<)`bem70@ zXK8UsX%)e-!hGG1Shq==XNyIX4CA`lgiH5TE0KWc8Ajgx3Y;u#TcbpEvZ_4DwZ`i7 z&V(bp+1qi3L~ifE!{WbJiUPGV%S|_2TDw2yZzc3_MyhM|@V`Om44g(oK7P*74?gai z8uG4Ux7W3NvbgPEPpt-cnCw!JChP?WU*E$BjIr!dSju7VL7sa&EhnrbGdPMGjX`R- zLHGd0Jjop!tr}6mSFUOjsc=vB;+Lg;m&@jx-jmWSq_O1S{}u@F4=w@DJ}R2UAnU|sK zf-J>S)5upP7YjI@27TMKd@M@wOTDe)&v%H$A*BPAws^*1l$v1Oozi+c8^AZ84*~R= zmn>W~T%@#yrvr(J30W?!qXxk;1&OKQdhpL)?3~m#0~ly&Q$BQG{#R^6h%`B3Kj(cyyzaw;s$)pio`zun5jL4jv(`>U3+r{-6#4 zdy+&+u&-ql&qj@CizUZw`aWT?`tGCtZ8x^JfVHi zP5kWQgF=>dqJ#faBXn(G7lCqP>rHQ?6YUq9)fCseTz2h3m6JK}^=&gNkinASX!2>J znOU0vlBPr>Og%pdKa1B*a!xA*l+|z4!gHCvYD-xrx#oJcc8DX()=8hU(trHVIu$6Q5>%rczVvPkp+kC1}E2czC zhgxR?``;F*>kM39zSip~U!Y}Xn-6$%)9VZuRwX{Xk}L+TIPpB*tO>h^*V1K`iGl0C7G&-*{=+#Dg4A zF#L|-q{-O=;eS5ZFJS!!(w!qdhXco&%Z52wNlE$T@?VpC1GfIf&Y3_02T%a)brfOK z<^nk|xP>$w6@wj0&zTEE$;}iD-;2Zt$&8kQ!zOM#WN>dgV8=_)iaeA^nVn#<{ZeC` zUJD5<=s06c_<@~_k6bJr-%nh4Qcw^he=NupZWN);&@HdoIsgM5)fKB116_ud{ciJD zS-!q{>7_%QX+>95?qeMe>gGsGm)ZH_aD+QP`b3eM%}6t@Vrq3lBxujXe^@}8+I3mx zl`tQW?62C$WjZ9DH)cGFU?FkfZ$v8)2qAZoydF)d? zkuJk#6?ut?(YhRMKUWS?4`^S7htbzr5gwZYH(yqFJ_cdVvZOT`>0fMr^jQ1;7j(re$3Ed>fo^DNp>svmWW{XE#baQfzd0a}{L`uJ=l^zU z<4(am6+Y)A8p*C#&`R=Yet4%!+++X7#XkOm5ophLlo^P_dYz_tVvqWY&b!gaC%hMT z0M~xQqJs4K&5FqmHu0lSCcu_bL>PidxNfms?57d%1qE=aAzY3Id;KwEL-)Go)8nT= zmo>a%KcRf0!7=$_k%hoBY8xMjbG-P>2`Z9YXZ7Fkz(Oww6z9A=MrS+lIn6hNV`J^o z7gi0bbn<6yRrM%*;8Eswny0(R-qgjeV`<=p)cL$I0RmT@{xM-UdjN)h)CeCOq&(l1 zK9RJvv^tIrykOAGQvKmQqmdo=?Iw4-0cu2vM)c6aXZ1p=)hoE3Ck{LMb)uG1ezL2;w# z`rfUrX)|)7lll&j5#i1oPM{N-u8|R9NxY>+){g)&UjE&6_fc#ZGIjO3^cnx$@g}yk z^3W^w;+$}Lv;nH`O%%+*p5Oqdk2p4E2{}iQa>e1GzmxF3u)jFC)#^VDL>5^IApeWA z*Ay@R2QH`TFTeWQwM9hKV%TBRqm{Z7+Pdw7#E}u7cI8|0d&&U zYko9y7lW(YXm%hJzm+1<$y_A9GwNmC$+4{ z`FW}Vx3#&0@>Mo^c4L{p-cF+j?rb}IR{S+FauL ztC#_wd2KOl@{`W_m)iRW9ToV*5Y%fZHENtMJVu^a6es@|$Gs^0&2tP%L>#T+e zhoGf@cihj1A!6uYq<3Pmur#O0aWU6U>%!c<585I9xnInIAA2}@t}DeaK7#3+9m*m9 zp4~pGBZyzfn#?BazzEHK0IirEZNBGkDZjhtqmzWe)5}4Us~R$x;#tV=0h>uvoaWag ztBh2WQetjY#!tRmUF#V+WCuzj>nZW7dlc&HR+B59Z7%j|TXakO2BHv1Qxe%&w%~kBA!Pa$f;L1Fgx(pV2+QQkhS~@4i%s{C z*h%ZuS=rEJ4C)9Oi+&TYEVFAERe(?nx?SNhU?BR z= znBgeR1+POsd_?78U-Bb4Fp4?UmE6#Nhm55cGMj63z30|FxaJn~*#{LiIIv$ezwR)( z7#41=VByJcxv;H(n5vo(^;6DYC?(@x>7q@%*#A79;t$XElNr`Gz5!$IF6w&`9$R3I=Mw5%-Z z!b-J2yRGo{sx9-)Tg|YH4Jt1ov2@Rx| zS^?EUcHBkngh~d{gQPc7XoPgE9@UQ-`_sP=(2SN?Ub5)M#%{5nSi7+czuqD5(2_XH z{G%|sj`W|1_DrXlY<~zy6b1^s#`~y9n7DM;xaQPjZ1PSfc{(a~_(~q`t5+_6a8w0v zky~oi@p#9h2H>%ff9Jaf2se3&pK=D-oc8T}e#1j)6a&ZhY0{bYo*1ocE>o+p9!FiT z98{GKnTJhZUdbSeR@cs08CkI0(p;OY`c1FfWVl|6WfTRv&gJyI&uo=VZy*9TO$E?i zzR|u4x4%Zori~Df3Y<=j()%IHk&KEBHjT56du!<_Cr9?HI-X!R6O?7Wh5q?f31uLl zGyAUBKFVyb{^Y@ue`d7{(WWcplsi|bvERhxW)$K+&f6c3Vtsd#UV?rE;aEwRT-ix; zP)l8AUvJDgf-)Pgvesepnk;F%R6_QDEZ6UXvJ24Nqrj)$RUUb5^Q=p5Ip^~$iD}r%as3U6LjY+5DTXAJAiXcO>0sAvqE!x}YcS>AXy-S-0yGC@L#XvMD*dV+^?dfB zKHxra#kTohTzR@Q?hEYp0XfHB)hFpMl}GkbcToN(rNjX6$5xq^=5V3LdE=Fu_l^DY z-g~xsCP(#^78O@(H_1T5#@+N)h-UEgF;dmipqkp~@r`cz%_Z7&P)If-<$vODx?#<&UoVuIla%?%syZ9BbI<>&mjRZ@I4AzgYr47$9}p`@Y$cP zQuA2)G5*NynS72^a3drCGLWmwzWThNEXqF8?NP3ZM~<}`Vw&Q9Y0m&_K57u9%iiZB zn{}akbOB=pjvAfS%~qwO;-Yf<56d*DEm*zcAs;&cxek!-_>aR8z|8p1_@4xmW zvdM))&9@)<8Mx=sA0vQkPNn;~&kC{c6vR~J2Xaycg zIZMaITssnTA#;>O5Kc=Us!vmS;bqK3q-<&T0d#@osI%v~Ul&l2zN5>n)L70SH#Wcw zplg7@w-j}@5#}Xkj2+tH1hNr_S*?JZab@9fdy(hF7%AHR2h^?KReeS)H5DBThIhqo zW%ayLd5=SI3CcfkWz1pS-XLaD7+${+P+C_|VY_5n zDPPMNnyqD}g`S%fVmT>W(Mkn+u3su8A6NzS7%5?nbrMNTwbs^u(eRGijcz8c^;B?B zQ700KhfOH7h=ths(jbNVD#qmpF6Tg+QV>h!d}BGLP^isyNhHiY4{hF9C))q;SW>s`+P)`UL4<2lzynmn9l|+*4-H;)YwX)rOsQ_IMsFyPaNV@2h&WvRl1;l z75{`#m|IIy|2V^wfj5VH-Sp2{0HRq9MtSA;n;Rhed65WWWeqQ%)0Ye*HSX@G8zQ8TK|}tLE1B)>gk5RuDA;I z6fW5O3gN^09Wt7l=X+UDK2?Z-rZ+*o&xfs;jU;tV&{k1Y`=R#PKgcx4$KPLF#o(vb zomNy5=TR>^CnRg)t*alPbHvKll2$>?Hq)WzQT#WsgFOrO0AG$y$mDx1Y3x(K8ZRqq zvn0Yvn%BU-cf_4%oHLXUXv-dXYod*0bOT>renm^sF6W#hwX9hD6fCLrZKLe?_edmWZp1obugHR83^%X)F^A!^q zX_%kj+~>>4-79=^eT|>s3?Wmr)wgbmU9c5}5v5D#?~hWXj0Jl5$atM}Yt}3;8ftVZ zH`O$YFG)CA7O%gfM@6mh-P=X%)2izQu* zWlezQZM=I6Ze)sB7^UZ2+t=n!%u z-yVKZ?&Z+KZ>c5O*kcdsNGJCgq^{t2_wxoF{_u@!Ci=cV^Ed0@!|9&Wr#h!Rsg;^v zA*csrH~w0@8WuZaJUbbfJn$zIZz^vxogl%+vKz_?g$qZsAM(}d^+ndpX_ZCnE7|8M zH~?%~k3vx;&U=d??P=~1RvXx&qN)rpOOC=IOtf!#nt3w*-t3^ElnKIFWr6 z3g0!C9dLoqvUv5FNoC*S*A8{hnrkEkU&bG^E>8ZwbnAY~o|Y0bVc4A~O@haNOu{)C z@RQ~grg~ZY-GdOoFXug|n1)Wfih6Ko9B&%aFJrSgshk#`dVga^Sk8QxaO!6=O<4G8 z$zic=)^O_yL8v(Iq414RD-lKDbE3}dKUBuz+kdjfY;x%!Sc~7|o095&)jg52OHASk zgsT_jh}|NT!nuZrimj!R1I% zf6=vq1-8V+E2-s|m{D_KCA;l8B+R|A6C{1JDdbcydTFm;w4#@dd++ZMg3AG_+l{o- zA|zwNvQ{klEJQ7VF~OVHR8di>vw?X}VG0lDf&Jo)t79#W?w%T`DBzMfc0GP|CZP!lRd3jRVG_kyA2EZ@e0a z&tcU-xqNR+;bE~+!s0viwUQthR0Q}9Bo)&w^IvO}GVyPI=MScrqf>$TlbB@eJg`|Z z#kf`7)6KQF;dAPNEgN~&+P?^~Kdq2z6jJ%={gZ)wKHp-~BxRW<%T_juJMe-L7O+kH zL~=&tn;&OK`pH}>B=S((srOLXPkdB)+5AQ*#r)Da=z;++(Ds7|o%@}Z-185j08J2# z9`iF&4-sF`>RT!6MZ8;Pw+rRR>t(LOQj1;qg}}$;R)^?|_i-j-Hdn}MFxuz#gfJ1@ zy2|)wTcL%&v3;>~#F|I`?Nxic>Gj_Ai64?;I_kOdsqV?dZ3F2It-c%C$m-4Ro^CAQ zOYdyTsR56?SLJV(L(EXcY*(_4(ULqEIfxody{LTOp_f1c-S-7_-E(V1(|ZgST;9PE zb#1PzOF;=>A>1nCXBdic9M}%Hh+_peusIoA*2`j#IU9&CtOO5c79AlpfuA74g?*Jz z88fE+{Q2pHJVZ)qc9UK_xqC^JV{(mD^e(5X^Zwnx!TIT{`Wi}Obx{4RruOMIi8wf|3hz7geP0KLJ*?~XpxAh{;TYp@}qi#^v#Ah0X5LNS2)SWwSyft>(sA*)z zN=Y;VS#6~j+-c1ySZFesbEf6fD)0B1oYi>$+4s+y!5-l+*jPrF_fZHofL}L!Xdbm-DBUDVqZ~d-?jc21jZGYYr!}E>a>1x(0 zWXg-9bHp-v-F9qiC~$lat5;3t+&gT&O46O^kRDKBZnDpHk|4MNU4upbV^LY>N1b{< zCxJN~1~ztui9Q`y%}n+O=7aam2(W2!&4eP+C&La$EH@+EVGG!WzQ$Y3ijZ_A@a{Z7H@Wv#iOr1^^~{Tg9-qudMyN-d+ceeSYVVqpc{7~F}v;f0Jvds$AX zjmn|p!{D8s!JL6dM5C~=rx)w6J5fHXK>;H&JlQB=hsisfWxqnTA7Fw@Ftf2K8kV94 zTehOO3!o-GVyw9@#Ko_tdjPo|YI98N`#$x z#q~?&_(IYJ&@62$+dGRHx&ZI=%Irak^Ef$omVr&d^9r*Egg`QJ&fbC0n9GP}yK zRNedpI}?)FX`sR+?7Zfp;g<00g!u+ys1%g2EizxJ$!DKDS~n^0+C%SsJ}e<}rvkt9 zB#@Lg7%WkD+D0!}TEG{aGUb^Dro$vC0f<(#DHZb(W2%{W8%V|P+X6qd$*2`F6zY&| zG$SajPJhu1@#mK_A1!dDU#WVPy3e9YB~n&TX0!w=_FO_v7sIJyr%~w&$3fHcb!(9#(z>Otgqil(^i3zuv(QKOj<=8A$w-8HwBMBvHt0+&$*=DEO zD05A+4_TZ9W>-{;R&{IYt7M^-j{ewIXpWHXJ0gQA?3FOUTnt4>M;et+=a^jqZoCpQ zvWCd~I+*A&ysXL&w#i0S`VFS)IsTqg?_Dw9fiUHhfn3}^`!5O9W)Xo>p`LC74K2U& zXgjrWqDSxHx3{l?r>lmI&&zapuaodQ#`qP8nk_W5V&0XzPq1Es6H(ywz;hHcU)`9E zccKXAg5L?2TM3?GCmX|zvWiahJ06{+?Pa~fnnzjW{l9RW zR+Y1nO97&6KiykDmrBvrJ?I8f+XgE-hw^tDkHWZyDxR(-6T2md-ie*V*mkB8t&oi0 z@?CiqyW`}+vM}`nD(IkU%L2y9uom%Qi-lI-A^s3%z}$`UyLWb>?Y0#I6~Jtho4wWc z(MX>}`;f^MA~;*#%`(pUg?#>v&X`$ivzBN3tTB2y<<|%u3{vhn1Jf8sI91JmO4qCO zfZ|)-r zMPIm7Up2pGWhi24e_BDmvn4DqLXx3DSTzSMhJhZel(RyXv%LBGYNrp#%b<$}GmGf+ zz@MiN(h<2L3HvORh%2oY8xA(MZR?*ZG_h6z+q&5i0vCSG*Q31hZ&MqLxBN7WSj$=` z%gS}%8POKGQn-FQ0YqKklnRKr%%~M#f^mjfVey?(pBmd(j-yWr-wxDm#Z}F4Mua47 z{sDERwa%B`^g2s@-+J^dN)VJcNTXC=q-xQ?mmSC%J#l?P9 z{7Z*Yv6d^O>bNO1$Aw|jC3KZ*VS-&Z)|sKG9gFp945wKn+vcb;?<7{?z5T?VN?n0pSL9+vDi(Q^Rwly@|~yzS=6qq1z->*h3*YA=&AiX1CG z)>IZ$(`6XR+>amWOychB8Euy=pYMZCVXWkYHk1fLMuWysX^Nl}~vDa!oUP;+as zi#)Ho=1mOH=jLkk%97HYEpTmuf`8&_ZBDJcOkHiF>hAi%+b!MKL(wLs6`xpx?KB$G zW_C35iZ(|4mqs=o+!q3A3QQiWfM^0zDOei z#1~0aD$gNy5-Np4VyUd?ermTAovUn|))osivTPg?4E5L!XQ?lK-B4_~c!25Py#L}J z7w(s|40QK@r~LP>hCb4ELhL^W>d1Oya%)Ue`43qAda0+%N@c}7#g{r4r#eA4Sd=7fVD+}by!VV2K{|ySvD1?%jeV$>pu)q^ z@LW)geh}3Z4`~&gqcVs^t&k9|LRNeJ6>$RijCL8OTZ#_@h-t^lsl-aArB zln}mHh5FJWYC|O7moNO%XoCG&N?>lI)r#wk(S9LFG z9I+yhp6mRDT$akwtehdH!o6(Us&vwaaOl(-? z81}!niF{{HZV4o!i4lGCHYlS~UDakfF+^k~vVUY*efnHnE|&9lhGQk((xr^S0k_sW zDr4%$=b#LwIrW;kqdMM_;4bBnvIR5si5(~$_&W!1_H%2$`h08V(Uz`KAJcszS~i_o z^@(Mv(eZ=Xu!&M7(7A0$ok`Z6hfdnw=0UrBm8BV)$BYENrnAE1(@Oy_zQ2rA4 zrS+~UP2>KL9M6D-TS?R6TnjBH@=DCpW+K`qp@~Sl3pE?^F{ml>v1mHN^C#7)1c+s0>HUe?gaP|U!tE2k zK2a&G6F{waP|6Gp5-SZdZGB`PmVEro$h$>B7bU-=Wlsc<0zrHB(Sjz8ye@6ar*6Z5 z>b77(8{;TYenG~REa|0Cq5c;Je!AKM`{zK! zx1-ECX*2Nsp;GI`cN}1OzNL-#cLzZWoqWr@$j*)0h@`byTZ z3Iz3jVTsmA?y7o$?hUraxspm_mlEKOcDlESlF*XQ-=8ddS-hOSXYz4Ar!=o4jINdy zjf5~@f9T4o(%&oBjhD$<=I8j5uC2Drz2oE~I8+6ltP&o;6%wf17(vTM%PMA@DG&%g zobW$cQlA<^5A`e%!CC4M9My9vx*tI{Q2<+aH8YS5eFhlyt^K4k!KC`(>0$`>`6s-l z{~!6Vf3`mTdkJ3u(0%`rIS-AX#`76MPGA2*KY-xT#$%@A-acrkq{)&xnX3zaby3Gb zygVt5gP&MDnN6NN)6hnTj_{}@vfXj!o!@CVt_1*ggV@I^oWp&IXq4sI1D*W4%3NZM zOzXK}sYcn(uS0)+*)=h?&Qq-8GTob|o%`*)EuV}=8%>*h{~?XE{#%J@4=>Qf_*2=U zU&mRX?k%vi-TW0%2vZ)vhW)hqxeHd!Hd^UhqvqiTc)La!c30K?v~|OFdP5lS$2j{B zni!?FPQx}Nq*3`z9~e1LqPNU(``|@Xi{^#zzi%MmflAqo*k^yi4Il`0*QH@tioVA= z_MNJ=PHO*wR{Dk&Wn!y2(IiMlUb?o0RgV{-w4}KI6Ts!J1Sb-MgCyfb2pZ752*DZS z4895|@4ZsWslnNpTe@VaT~~iQ6D@4@&U`eVEc0@dIfMSPL8eYs8(7xUNO+NdH0=R z;}*qqTy6ghPbF125%ZDXtHS0iRjeYXS!~wR8Ew2=l(Qkql%>TwL9g`&Cwt4+3R0FW zDq5RI$*6MRp-lnsg$KoU)vhu2#>$tE_QuO^#dhb!&{f*{Kn>7C_OV7v%NAOFUc|0Q zGP;aqJ|AY;cS2M8Z9GLKvZC7)x1#>vlfhV&%eG(}>e5t!LjzCr4;#I9^m8S?=Qlmy z$=2=Yqo}}k`^rAC5VY_)aG5`kc~Es>+# zpIf4N^1kfNHP~2{N;rg+7BgAD8$J^(>Nt+(cnWMbDg4@s%vs1R@J4JGLX6OzkLJ^# z4yLEviH^{E)c3_y8iLtvx5{XGDvUeR1usf%cA>RWka8nKOj4Z*-%XPk zu1IPO%-LK$>f{mf0M&57r_NHYJU`6pEj;yvcSpVoxV_!a1#A{}8JOnW|xR7lLC#wj0q_L!&CS{>)a`DfPVA z+2V)#=g5t12@9I+1!eXxgmp)NXPO$ek1SWYc0=GAR`HQar*!uLq@N%xUGKu5=XNs7 zq(6|xd<$Ef^Gd8^De_8Wf84UpeTa1eev)-6H zJ@&a846VKfn>4J(jt+K`GHk5Q?qHN?A?}^4= z2HG)1VOS^F-5eF+&J6gPPrNZzZ+deV=BqIH$Kxl?_;?OAHsSX+vU*sDttE5ymWH*o zPp$oC+vhAZp--f@AAiT@8j7UZ>*krvY9}7zb20NuAY0fGZKjzB^u#1tCvkP1ny7*p z(y$rORU{tqhyeoY&6j2Bd5}ymHA3bT_G;-Kj<0G2NCywvQy{9!(TVb!I#2wgz6iQM z3gVvJay?aCMr6j$2Eoq`xQwmAF5BSuxzP;mk<5P-7NO1d>PiGLsXUrw^MxyOAt|fRH-%wkb|9dI zh!^mo3W4+OkB|#V$!#!Qt;jk9vsbRNO`klw;fD{54uZd}Wjq6T^7*io2!7I_hxDcs zJ6nO*JPsYx4Cg^)fU7rbtN7J^0Zc?23KFL3t)M>dz7V^D=FRfGw{#PG=4?xIfZ?F= zhxmryJl~+IKirs-D>?-?D|zkf%Y<#vJrKM6(Yyqjsa`Ozv!(9YIP z91q5VcaXrFP5$+v?iAg|qD(fDZafX^7T{S7-}=s3qnuggltKr2y*k38b2KU@uAi#5 zf47ec(?}O1(L?&@$c=EA22TJiDN3?ohTF`*tTShuAY;O(~D5t87p$!VPjG{guv%}HSz~S3&M-|n&6TGc4koUZ?F?STyc8&#GsArr@ zKg)!EQ?F>&IW!rJ{VTaSQNnJL;-Y!qJxHr(HV;{dEZ?YA;)%oOINc#U;nNduM90zAxvB<3v!e0wBTXH-ii_Y{pE)oT3144mG`LXc+yGM?8} zEN8<~-hb$t<~8bKa+w~Gqdjc&Hl1&xT>i*&Dziz66I@TG66%Zm8(v;_>N)+*K@#j+ z(4!Xm9I?wdJiZf%z!)QNlx*pAF*gQf=rd?DaeB6!0yIYjLf+O+$u%6zL3$^HX3fF^$ZH#6pZ0>zSiP+!2ra>*uc#{X zY~`gbZgm^lc(9oW%o|MQOSv~$P9OwK(|-g@U$N+n&f7U!H8aSYCyi{w=fv@w@fKIv zKgalLe9dG0G@pBL8;+wd9|t~GYFra(X7XDp<|9-l0oujZ+ngLdtBSTQg~UhJE+oj) zAV0zH#E~9rYcf^4_exgla#dgG!K;MgjhD}2g^pT#bhROyqwP$E4Ti3$V{qDmlX;|c z#+XJk-l}mimXO?&@8dUVjosFpdB5;0OyD7-3hIM|=0^rC(*_ku(9WGBMhR0!twU#> zQ?4Q|PlMy~xNvffq58NuWeh<+-?jV34Eq;3s?9{KC|{XK0-PipxNceF?5x1~j09|a z8Wl=2&Ni9J;LSJnm|H&dvc#pjb7!Q1g2c>j4Z@l%Z2F|d_63#w3$7+B2#cX2$QB}W zCLEqAdJtIb2;`V>A>$6bKOxByO#5YDpkp&~LWvWuJ^<1&I8biv>XD)u-hye}=v&eB zI@!rsW8+ihlrKVu=Wdef$UMN)YY3^#lkOJZMaxC%#>jrpf))X2At@E9nuN&Q4@s{E zUfe?xNHcaui!V_HH1k;q;$WP0Nd<=TtVLzObWSj4mA@@$~Q`}&?(BHUdL4= zd?(ew)3BP=E0yJv^_sQ#Q3hdOhJvi`8IH3ce2y1Jcd4=~nS6#DPryWMJ40c^sz2mu zuows{?Hm<-B20^eQq)pVZ&L=yk)O!ywWscuDZ@FyuUd>GWQD)|%Q=N2pXx7uciV+LhGL*oO5%(B|Ewa+)+_aw{*{f5bL?mZ+~gg5_(f4P6Dab^5*$4ncs&@2IteFOB4DZyHlYO(} z)M@MzbGW8K3!&u6v<6n7s4yrEEly66dk=qQ~IJoS5D+R^P1E8^{w+KCG1Gy(O}VE4zgu&a6ezHc?^{ICF-`B5_s z)QUapa4iYqe$yA`S>My^ufuTcybXc{5|}+B*4=G+2mSc{F_<`ZosJiG!x*krA5;jD z(zuOu<~j87dPbHV<5FfuvyoCgtAt}YWr;!G+D5HY#&cvh6$m!v(hp1o-YYgv)t>}V z<2e}lG>>O#q$L6L)iKpzi#NiLj4Crn!+u`hCmed`W%i25H{JensLIW%l~>4YB}B9* zpKSdE`-7b#)0(NsH^!u{kS^scfHe3zWAc5n9J?uje%*2|5{yd)C#z;2^I%l3fk4;J z-d+5YKbU-YKDCaEaF?lnc0CMJuunjAsUtX2ib8>s2Fc9WDdTLV8>ZGYyq7 zBR9T?RmfxPm0RLicRS)wcZg2VGr{(kM~-+6ABVBTPU42g&yM@gS%3_FS<^uca3TG| zTIe!FF88CPdlPK!6vd3+bY$`4!ptfR9E$rWcDW7SwBp~$F3{5Esw%@cvEm?TsXV*j zoN8up_u`0kQM9V9TPHgQevbYImV=X{@uYJG8r!bVEAw%iJL0>L-Hffo0SfT57a#fW z;m>Q^e-tD{+l8ochRgs_ty?sqKh>cOQ9oo!Z^GHF)9u4BM`$-0(Dmh6U6w5lV*`eg zTa(ViHov|@Xt3yshItctWh3$j)}l#lWx|HN0=AD>4xpRF8dJ#z9#G}kXMa%*Dspp$ zx3hNdPJYu^yq+DB-;2;#{JU3480p*p>1mg;FKP4wB>DdLVLV(Kj!JyNI7)#k0unk_ zwX<+J8>*PkPOK3d8(1^Jfs2Xe7m~;)U1yF+0aW?zyyB2GE^2CX8eH;_2jeT^FxDj% zBi|xnMJCkifqhu6o>W?;+piP zvlXrV!63|}74u>D`kr7fQ@6DBlDTy5sp+9J=5 z!?8vDuA=1NINqrU{+aR_=UxDyUiXyvfxecnvMveq?`RUE*>( z;@cb}WO<*-5eZ>rb>V|0&M@h1T1X?58mD@sXkpc@3U4fM3%g=#Ygu3j3vVkLlUaPP zme393pPGZBdEKs47Pp$2nOhqZ;;5?9vaZy&KM%a$B785vEKpFNR6$CvvilZ4Ff&11 zKX*^=*z(CvSKIBoW>5$|We}U3ikxZ2bS53KF-VbW6*76Fo7oaE(CdQRFFs6zI8G8a zv)`A>kITfTny>gS56=4)&!!Yhxo&1fNCO!SAr-P&QG~#fKur5p;|xULCeRAyLJkwa#iL|zW? zwN9K`?zO#V^Il9NRfl%2uD-z1rcRm(DxkVTa79?zLia3Y&Upbe{U>B*_JPDLgRd>~ zHNm@|nApL{bRP=5cC=d%5xAJSdqv$UmP3bTq~gs&!`qQff{}RwOI4&cXV4R#F018o zWHE`RWUn;vNvmgf)$9v2cr{!ut?v^R%%;DjYS>FJ4M*}9aa4;_55MulJcy8S;f~g* zRw9ll`4~rvgpVwJE7o%x_r~p;j2BNnr-3HZ?g2s0{TjaCMBf1sfJyJ2BPylD$D(W* zR&US`Qcu?=xltTLCf=|5drm+JKth3lmcmXN)|1I{dbO8^ zBB!$OWLb)!=TESZNQrPWl%04azSlW@$V5Qo4(pxY|leAr`{FSB9YAC*DGQ4+ZVHd$2l)GxRDaOTm)XDm?PH^E_#u7qxb<8tx2>WiIj zRy4+EU*mj@&aYCjg;xV5Uh=qR_yu}`S>)fm_G2bQ?@JZX@f)K-nCKrY$ zGUIv*ljwos51}KfA#O-sx3^ZyT5)wt2_G7z6t%5NtmZSd=dls8Y~zLd;uFLq2pqBS z>l?8dkl|Jcr5vT6Ti}PIEtMtfVJ0B@QlYD!l^c!SB84*OB-i@7Ve?yra&ho}Ep4fe zN}yedL&uV-NA=W0#ierc!7V0rvTCP|&ZxavpPq^hf5I&?_f4O1I^^XuCP17E`n{&2 zXpNEf5wy_L%t8S$K}<*tc_ltT|3<*IpU|pYr~ll9Ka)5oUl)B)&L`59I+fs>AcHmm z-_?=}tLKd)M-KHDk&8ePz(vXW=0?Mq!CnF)o$SZPyn#|yBrLI%r9Wk+U@% zI6n*I(MUwo5Z(T4sZBE+3ycq`r)Xz~IC=^`Y$41m4gHj4AItPJIy*(I;MonVwe%!}ine_E zXzF+kS3g(_)1F_ICIlj{Evm1X{war)n2)$#sK2E457N~mrp~^#fbv}xq9l9G*)^Vp zEId*6DaXY;_6*6K-~}Vw=eZL|GOgxxG{)()kmd0{ zE$rpl;$fr*drf+gzWg0@hS5{t&|oFi2U2NPC&=9uJD z2QVJlm3w@KJfRyr+;e*E$4Lax#B9tpSZyKrZ?^k1-4zktU3NUuM}Op(S220*v3h5{ zcGbVY%*;F_@p6#xk!@Fm++g zgAoj}-xy5;ka*uJMfBZCXNrjQ@I4NitT@j^*)BK5ed{7?Y4xc(^qjBO8QF~@+CMxk zD&9Me+4jZ!I~(TY^*=QPWm+N%uijc%E z`+}*ihPEl4tDIk4QLmfnmd)gF>%0iozn-Kk6eWlDYXzps7n9VS)9=u+^WALa8%Oqw zIEr;{aOHLNz=Uo_gjuz!n^7|V3n<+gqN7FJnlTO~tt!*ih*$^y*xGU%9ktyIf$F8` zy8EfTt}Oj*YRv5pdM4XyaUI&J4(6dAnL=FHu-wZhs}nXNP8*V2nS3rv@v~)S6nFkI zP>`cD&f=dkKTkTr?M7SakG=B^ebc7y@$kVvWhsp>8?OKW7WJ>m3R<@#FKOa}$(-J# zNSj^VQn|COMHz4gC`9&sOkQ9p2CNNVwD8zlcAV}k7vl9@0k&LgktSNtxD{~3wo!N0Nk3+fu6S(Zsj!XD)<>#+l8#Gv)wdCZ=NeGYUA9rsXkY zw%WEiDln#*LSAf77U!iEa=VGYFX&$Smc>$gf7I5iS-e3J^FCTgUw!_$$Ny5T>?)yz*NCV<&g9zj@B77sP_Vi;`S|MJAgq4p1 z18(|YnVMaTvRAXpXYAs?N8Ee_!EELPa>`ljB%c{ z&-eU$gZaPjJby(Acgr$jUVpq&8#SN2``2B|?QSSASw#IKqrZ2k|N9nLy&e21kfbx_ zCr10r?-j?VDzWu|i&DsC%*)jT<5pNzbbknf|MTA3c*M>~*ud6brei7K?$s*41>P5r znJCV){)LX@{t)kw5b@jIVu4mef(19Pn^ym1VlXHlCsSZ%Oz?SzdT0*?W&U#c;dJA) z)t)+IjD;PxUD0v*2}9t@^5m&fhvWRO$Bd7o=QC|Co@4uNs^k+2_P&{dS(6WMhZf{QSna7Qf@gr$dS- zX>+~(QBB}_X+kAs?5v#BleCIA=d)g`IXlGbhWFs`aF^oc)(N?fyyNbOlrl|rHrZ=| zcfEi7MCWdS(LXK5|51gFgX65V{f&`J81SARg28CF*9 zuJmg(5sR1GGl}#`GJu73!LI=?aBxGW)9oBgoq0VHj4f^MY#q=p1M}5ob~wJPK@;Gg zPn|kzSKjt3e#E3N+ZK1t9*?M5#3#PDjn5dvJ1nKD5jRhp6EHMccQb9CO_@B-Pz%Vt ze*^bRnf5ebglMtJh+i%WZTs>qOmPA64(dUo%@hK2gZ>Bn5aXQ)+{6*H?A4o(M)-i1 zm?Y`ofaj|Y$Ne)US$0}YT8-Pkuvrv!#O-*-1b5TEE8jmreY-y(__nili~6oy@Wte` z{bI08O7k$9~6Uo zWX@I3*RAgMd$-{UA9?-bJ02cl=0GTuE3<=@oa>%LE}#FO$5Ieoe+s^HVZ{wmE-H$# zY2~)}%>*fzt61X`sIfg*BsAutk1-$^Y~LKd@EfWR<-21O>7f9cu^g~Lo`Q2s)}Qyh z_BRG-Qbyn&*C3BP*?Z>`Id^>7dl&PKZQ%x<3%s|B)V6<4=QA8kM~b3gp84Sl3Uzrn zX*`>`=}3Exh{4Ll^(Ig`V*Tj^BSE_8_HdzHC{J;FT%#jGl-z&FiImUPv}<98l@*Q> z>vHZ%y;8d?u3*5%AZ5gC915+q>fvdBj&g__nTwMz+&?+0W`I4fxAV_?UU%E_w{Q>l z_sy_jbZ+J;Dk`$sIwi#~@mp`+5DYY~{jt?#Ss9+cAOAr9|1vVee~pYoz#iG!IjB+J zIj#NMH+qui*Pf5z#nwqD-s8}V(3d?(t=CNdTlh{_p0R=$$i`2Dd+8IP&uM!C2&DcD zCVkOj7)_C-Y<^*I_+ye&Gvd3~6?%6n5;I$Nb)CsSlJhh~>a=xwbT=H}`#LbE>4v=l z2)f`5xWPCMIQ{A-t8jr4a3g3i>wmo5jJA%!L7Uo%f1`~Ns|7ynr(ILAX?bfFM8!mH}BRayZ@C>!^SupO1L-mdLH#cO2xP@m3IXB@UM?+5j!IMqUw(XNTdFwp{Tk{4(6SMkl6cvxpY`xK4_twE73c%%J+p4O=p4#E5lyjp6kaM@vv5ANmVC~#<-KyDP zkCyXzwFmGa{qa7f2*;CI%&%UU_pnUh-DbCpT%wQ8a+gQ=#F6^~ACzc?9-QS^Oz?@= zWFsXtkCcs_d>oE6^I(L9S-{sVCgrGW8kn9eFUIV!&+|%OxAvSTRFuxxce2f}6M#yV zt!R3Ln{AQvR$q?M{oevVfC+jYxOZojh$N;5@$%|bn*LorIS-E=%&acgGI|6iAJEp+ z=#t(VaB^_e*e+o`p9j2Le_tq|Ax|kc@ict{BMC1LYdx??f(s0maWD{UC(R|+(`gsb z7_faVe2FMgj=*@lU{o%>eUt&lZ)7T$%Ce;Co@w)u6l8{gFVhS4EzLl zoYO3zrxYYV85h@Y+8(khw1Dw1uwHS(Re%wN-*;$jC&duzcOoQy$Kh{}9PF*Pi5-0C zC1Qy8Y6ouGwx6(szo^po!0f=6Jba#qX?p2@$VQVhGHXbEG5L2maYXXRVU&Og;|37! zlfLS{+!9>=7NIVr&tb-?G&qW?%K7WAGbO9yf+4JLdB6_6J2zm?prV!Ip`5*`3e8^M z9yD7N{cpA;u(x+BjL|iErQx0+pfKE_%W0DFMHKU!g5~!QRQIFxL;$ zS=>bOfq0Sm6CkvZ)0y=sI`q!jS6mtoQ2*^gHZKMxzWk}rc_IIca1;SDDt|-A9Q;D7 zPP1*e%6?{Kjk$mtV+8NF1a(}~4DS1KEUt%f9|Z&MQ-IRnICVR0x>$mR<62r!=+X77 zZaN_R6`JjlFL#`z&7UaK7;#D_c0RPJYiJA{A3r)5m~1`v89Y_L9eF$;eKGjR@BXR= zz9+-oV4}nCPLKu%83+08^C?8IAM&_c?F3o%0K9|i-*WQbS`Y|?JUp(hzs$(2QIq+v zG&4^|EYA!yht3Nb8cdIKSKqx4{-a^ow7?{QGy5zgW%WQbW@Pz-X>{t7KPbTVk2;c5 zS|XYmH^A_C!6m1*yVKwAbgzo3<^}I^(ubukFX|AYGyXP1)>N=D>p_d*?BoczquYwA z%ZJK-bLRv=eFL3v0UOu~>Moojd|C0|S3yaldFvJxq57eyA9ZzrJQ1`tdD+pb{UFPO zx9q%f=SFCpC|UxjRH zm*$dvpG`WQ$vYocJGW-&|GU4&5>ArVkx$3V-u?mj)YpER$(kd}7SOf_Gkj~U*@FNsITs9nZgV)}M@ zFT0kO??<^^j@lx189bW?uQ6};-Z)Ww;0s90ut;KQ04Oy8)Dlr@Ahk7w(?8<$MHXrU zhm?BsZhXFIRUQbMaJfy61KoH(_9LK9_%VKZ_Xo<^V!OEGxGO2j2XE78#iMT7{^XTY z3*QrR#8%G1ts780;sKLZoF-LTKdvza?oQ4?|L{bw7Sm`y;%BwKxVTJLFS_u0$l2KC zd={JHUf@`rih%U0 zfTDEiRRp942)zb^AYF>myY${mq$MCFD7}SF=p8~25Fp?2c^}_^}F!xfjhjPS-<01?y+fA+kgrznn7#Bsxg`+T*5 zr>uJSciX@&uYnNc)5Orz)Y7`geajna347oYXK0dpBa`R%vG zNjLJ0fiy6o5)yJpT4tJVdXF_BPbhGcqls=QnI`v|1VQelL?Hk;G{&rqy^p`OTnc&K zWGHw4InR?oB87F{_$ark&|xr3yEu-Bj?b+N{1*Cg*R~xlwHT98kM`#LZy1q&!0%cC zq!#5yraw0VHaQ;X)A-&{*`jnb8)%Rl`ygH=yfmP}qTd#u!YE{f-R(dY^A17AE4od_ zm?(x9gU&2o8wL1uz8GMtTs63wO%(LwTCvcCjd7`BEqULG(3XIhNQQpH0XGjfQ*tO( z+kQl{++7P+GCYWK0=E{({b}{s3~$G??Ek7U8muYDa+nr5j(2#8r&?P8y)YiU=!co9XyIHhw@0!?eHGhobPBm1Bcm6 zpQkWOL(`dioI#!1%Fk|B#|Y8sC-T`gPdrTvT))-!f-;a1%AX4A-lpIb5_vK#bVqO; zc!|7DSO}KrfIh_#iqL$^c34&!#&on!7FIf@2d70>(h3rZK+miQq#)A zfOckNc<5s0MnQ2XM|Sj}iH^*TLEB!lj{g&xA*9U5wtXZqv(jh~F?~mM{9x9bKg*|{ zH7+efmcdOFC@T^fUr1cEvH7tZE@dSDNKD7ZZFXm9yyyhI`i#tlLECn~ zw)uCfD-Gdvg0$Q>T%NzQOO+uWGO(0SW+eYuU9V$EFj=#6=fEMu1e1Lgb^JPKbFLQq zb^hn^xup~7bBbIEM(DG-lgpdWDTHShzuQ%4o!zze#<2JO`W?VRl75pOB=pI-!Y#OM zWrm`jIPmHl{9_79Cdd8LXa)NC(lJ|P2LK|(Dffqi8i-%dzZRegV>3b7@qp_lEp6Eo znc<&%lWwi)7V1`yvP}K3new>2<;`u#Aw0cY;Tzo6nlF=%xmUGG5xl%9xM%aDGE-|l zVSedzbRl0`=O*KZ!4?>W3uYX@_dSDe)Vs1Xd}1-9kHeodRn)XRcQ!`-)xQ%HW%qAg z(LSOi+dl|mbqf)sPI0$<+*eQZ0cZ8zmZoX3@4 zRqIOcUseC-1)v?Tu;>2E7Vz~iBD{AVF-JYu?H9g0xN6_HY$3IKiQA8W_@;O~>hr-( zU-(~bffITzrd?OJC3a_m=@~LU0;Kkp*lrq?bTRT4cMvoYsN9WL6EJxD^s~zkAG|S= z)QV4;p(mA-Ub6=~?ss2P;g6;JzxS#5?z8R8&t*Pl6t^b|e09fRX`FBNJZS7ahmMHX z`E~I+i1}FGwnXJB$CVoNIT`{DS?-SC1sJqP^@dDGM|Jyi?Mu`b7eqR9A-DGyUztpS zT}lRG6?un1d_xdjxNP--7Eg6d(7I#W+YFNnm0&>wftc(dAbUP4>=K+9c&LUaDHNPy zByN0c!A4HbE%frIMas%ZYsmSb;Ed6KE4hTQ`lMvv!<3R z_?40&uag|uN@&}`!?ZA9{i)OA_3U_hs1;voH0Bld@{Zpdlr*)nt<4}3-f{An1g1j_&IQS8(yxpCA=iCpxWtwjWD-{+!sEmSTtB%5sY3fGT035mIc{Se|XJ&ED zOGVBFRGXO@mm=^&gJM~|>uts>po%piG9JDs#8nLhrL zfT>tu_+AiV_+UOsi77TbDE%Il!o+}D*}E4nsQ3)+LhJ1t+cp5-s98{}z3ii0E4{3D zIa_;DKJ2Cw`oW_H0`kX2`d1w6VTv~zkD4KFSN0w{);BBtrw!Z5 z>5EkFKS{Si8YfI_yBroyqcuD0hMVW@JJx>mZ{?V*>-Xku8fD!~_Yrw$WZ9-(pLc5M z`&$4gq;k*C!FxH6G$`M@>{$jxl875GSGd$#K)VKDtT}~k+F=X~HFhz8J3gm+w?~%c zLr+k*zw@Mqa2>}+vOwJ3R7rZOlV~N*$prA9*KIF#zPNo{f6xnT@Wr;E)}0WKyJ9-P zyBt-D0KoiXw;6jPg_vwAwhZA<8#F*o3&bCXkE&gV*KHd+Ot#&Efn2uUc064+g{Yvc zTAhftln^!kJan{G8>jTVgjXcrwa|3&C^>giK$7iIL*yIVX>%ln4Bfi1Zn82Mh^`~- zhp9K$rASK^w~zSvOojR3SZ5jr27T&n#kkTCogg_<#!_}#@^J$D%66?WCd01`gwsmn952}evs;zwHVAWdcAo9-e# zN-axPa&@hR>%Kdcb?Rh5$m2h5sNbOocKYqG;75)Ss++jqV{$XHn4@QMH+$%B_?)z( zsMZKdU)0IV`S2*WOK5!aC5N}P)%r;Ly#b85$ev*rkP;c(MU_-1PkF`eRGJ2~?Ykb+ zFI4)l6qaO}hnA{Oct87O+8#8lW9yKZ?l*YJsayB2QFHsXyH_Mot+MhW9&25-%$-kf zylR0F!U+j~nM;rE6l2v#%$BxLe-r|qU)QsWJb5~MQRcoIj`d}GoNe}(OrG{Sh`5b@ z0kHHK-=*xP#cjy45f;h-U`L6=-R$YhA_LkPH?ZREzf^Cm`m}q$(ciCH+?zM3lMCsS z@++xgGaV@DQ~KGHpo~vI1SBJm*c~Ha<#Bs{^ zVz5OGVc|2V)ULLI$lG?gs5L$0c$s;-dwaF-n-s%NF+|%16i&wwQUF#fnKhMtg~G`n zW>nfKYS1%euIpNJMx}{;ELM7|KKIkjSlfOiRg4$&-6aR3&B3@XEKob)r&WLbFXreb5CgY}wj}CNBo+XioWfug<`9v+^K(&%y-xexI)(W# z@;deT%=~KrmH2k_6}8xwM_l*5Yw`pK{!}9VOFq(!6VEn$EMmdHBmCWkCHLXCFIp*f zFDo~#+Cx`mk|!T)mqZQL8Q#;h-FLs2)z=M1L!m2*Fh6B0{B5|!HWHkl@YVF>o6nVY&@nfbD$F9xPf z%A+FVl4mujWF@JlAg=4WWqaLJNvV&Lv;eN{#5{z}*7c{8?5mJMGR z$qcEqdEL)jVNws=Uhbg6@95tg$I|yF4|om8PkLK6N|QTs5xDpaAf2){nr_Lq zov5`RU~Y5tJOD7^+YHH0-;P;T06hY)UUV2^UW9!`ND2v14V3#8q8pbRXl3a`2B6pNz*qZXHmX$3gFRSAo2FRj*DMtSB zdzGJnEUovgtg%Tn{x`sqJ9dq5MoLjTEhR-e4F*ZC7F9jC$LE4(ZUgRYZpu**zR7w2 zN)uh~$0Geu2e0X%bRBz#5Sg|EO!cZ%nwEg8)kaGHgs~a+qFl_xVyT#x*thOzUZy*dMes&v~xA| zgI@fc(Ud=V*Dw7lveqDSJY}~K*%YsB3-PBm@NY0(#dUDRk-}ukl2iSo;M8JlV?$BH z&WmDx&g%U>=;L$wk%_UxLl;$5$_z^0IVEODCj%DGy7%T{-_!Rbo8*H#zE1>3K%PM; zC5~sX(hXlvZ~a)@Jc{;ocQHH)dbyCxY1rhlTsg&lU-3MHFa`|5KEPbe zQl9MKe0Sv*sboCjsvt-Ft}}3niAk5KQPT4)Pk|L$>J*~yzZc1-$D zmF5wvb)9ZNo!yxR#k$OW#yjLFcCaH6(Z4s{8jI2wU4I}3Nmgus!jYTZh=zHUia}9K zqz8cyBtiTF-bRFK0=Iw;m^F1LLZmL}y` zW~rWEliuR2m*l>;cQ87#(Vy_pug>1MuuT86TmS^uh`7Ez<8Tt2r@x0XgXz|0k7W?8 z2L@#3s#6JRZ`mhuq&T&2&@gaD2L9eg4I%I^Wn;1ml?TU3Z^uODT1%H3q*T<+adLDr zP`?@>A=X)e7P$CKn3R_DjN$2DXxJHr$pt)WTpxHdD7-MGN+f3%Md7<`xGHX^=1ecW zbM^+bF{RKp;r{XDrv-SJ+shwLjGb>C^- zAJ?Nl@b-vj&&-3#nSd57!Y5#s1QX$=Odpo)F{%fBAN(LTf{Kbtf%X9aEKAajKmR(pD9Gqo2fK&xbWA}(} zOi6#@USa(G6t&Ux_&)A)O{STb@XIZ(SjET=j_!@7axM{1?xiEh1ATECQf12w8L+f) zX<9Px;ik>2IimWRfXhZRlRr9FZeYHPr?{yyUYRZ=*6+cK8l~YDF{fn*&JgSEks3Ff z2IBLE6d`YnR=;BHLR;oNoePAy5MyBKXB$3ijo`JoR|8BiThW3#S>&y zQgJLObe}xQ_`j4675a}?I>SV0+Ms8KKzAZdA&ExdL|0*z<U$v8XDx9+2{5XxfYk3h6b^D=|lJ3E01~X`Fu?B?RMhnv* za95+o6-*s%=s@l44Js0FIXvBf8Cc7E**&k-;*rrkqG>;d_3d1g__n64!Pb1F=5fonrx zUgs$ifz6FQyH_Z(+cEJ|v@-Tkx-4n|IOm_Iy6K+hH&buy++G)5U45iozV=|3hWiDB z_f^@|^XY=^`k_gZm9SG(FermHj5_r6jOOqlxB&gX%Rqfbm$^`~Drj`I6V4Goof986tB84F*h>PYL3On=q8H%7#B z?irE09$j{|cLGih@fH7Q)lEqZ4Fd*;m<`Z(fewe19 zRpDD2JfcF{jPl0sfpNAEQhdUvj-JY6S=K;%SK)>h&mD*_lMCZ0 zKYnl~imNNvQ1YRjs*6(!L+L9=R z`|7$LQBss?e|8z#yUftG#)b|}meNU>u5D>hJ4eQecAtNFT)SalOc;X5S#7=_VS^Sp zxwfkF{5oV(6o{nMKL&r@n;Ikj^AuBe=J*ON|J#xYXCKH2n;8(_GI;a6Uj(>UQG8^k zcb87=wDIYVNhB~bmtI}zLv<5j=2_pb^2-SJ&`wm%2cdzXOrTSc!Im#-OGk&&hzJp8 z%4hjI58Bb+>w02Is8A4U#xbulED}kx1hbe2O;{KHvILjUnOSG|xHv*P~xT{y!mVRnB@18162XH(`joJ#9kXhh&I; zhmU_A_kFQTPu8q@;kd&AK04TKzgv(;sK`nT0Jd_nICX%$fIeROek`fO26;f}iM9}D zwQ=Od{r~Zmg*>nN6<_{+R5%M5a{NV0zCU`>J`TUW(+yo_bA2d~N6jdf`L0y2%39=z z1{37I-|)27G*r$W_C1z zN!Y<)3|Fe{q*LmE$ykwTFH+AGT2A7kJMW8J!9i{?>iQrj%F@eWy0)rxCjDrGoR{RW7skA{Poa8&2CbrnU)(W6)ITY=<^c5IC zG$>iH&aC-KsBUV%Oss*#M^~AFPSl#$LcHLF`f3%E3As~>)D~JYhhC(pFqRv@5b0#h|F&Ww+){657#*3FKgB~(?j~avAd4h6c;+uVKM8(MBvhM@hf zEDpdr<ey`2}$&;neEi$vgFZvO7z)v${-Iw4W*Anv* zoq_Sb8w}L-%JNd10YltOVOF*fhDI6|oKH#Z6yf4pgNt09!D~vUgVG3Ng2hXpY*XJy zpR6bEh8EF^{j;f@RDHkl1DCs^)c4xEH~paD(mdaH0SBgiaW_W)H5G z0kf73FhZJ~1LOPG?E|$R>v@9DmxA^npOTo#Ov~ZS#d+~YNuQ4ng;>f$1JoB=y-PQ4pI?T)x zRrkM#M<$80$oD_!YqGQ2RF}&_=#OoAyyc?|j$|k@m*-jJye6d>`o;=7F@-U`thRfe z=AtxnbA4@?(R&owX(1h%wHkb`lAk;~)j34(Du+c%^(MiIk|(LjrwGkR{+SK`TyP6P z4I5!hR)Hota??-R=`rS61pd_5bac;j-n1th%1>l}vKuEqe z+W_!d7J`F6D~$!{w&y4pXZD;vXm(-o&o^}`Fs#3)Yj2OkkTW$)jl;ztdd6-ePq~P)Um!ZLdoDgAlQiBYUr&pGQnC|JwG>}j?2MbhsyyU}9nfK3-bI?GwM(XT43@{W{ zwpve;kM$jCmq}?%#?2 zfBatmHd5}V`Cn!k!dJO+i8fY!8Z(Ye-|7bZuJ`M8MJaR=^nEQv9~!;qvO~YUFXcHX zV33}3*P=^nZl%byxalI?)hI{Jz-#TNf6+ppWZRo}FKX$&bA6F33KQ1|iYr-}?K2bX z*beoRQ|D`@uKHg4_kf^vyTj0l9K|i};J?_k7ujiKAa4f|6(mx}w>18A3}n z78XyR#JD}l6RiDW!e{hUtD=XC;jS?GM69{!y(%q-o)Wsda9|RpnW%=D)GY(6&afb$ zY`R&}Cw`bkx%%WbLSQP*`)V(J$6?h>voS*kh3<0>i~C$)1D1p`Vlof3tazTeX~Mum{HX1tpLY9MLa7LfZsaHzg(w+Asau8^z`d}58VEue$|Fnu<*OZ zTu5HoSg^&AHOh%KCuehwk=eviw0G!D@~74W>+lh3;T^ewPQK<(?$|iU3^}mY@@ScT z%NfaON;r6oZY3t)zGe3Jf7;rH0>i-RMQ>49=771a$R-v*Bmwg zLJJX;k@^y75Fc91@*+IarsF+tvdl+1u_C_h=s)7$V-SCgoPJj|(0p$o>UbGb8`R+S zngyGf=$P1|W23lIl*5|2tp3tFiq*EJ!~)!eOSj0*S*J(08&JC{SERe!#>x!Q3>+kv zy?R|D|E;WLhQ7PUPBG7>03z9WQu>;8m7fGH)tOHu6qX$4gWBn9_8aBy9lAv2U3;LB ze0rmMLpuWJn>E1>KBKTUz4e;)OJR7tA~vvuThIUIr&*09<(KkNopL`kU6hQRYH(I7 zF}~@X#bLPF7i7$WbVNxtW?qAXH9vj|o9WGxar8VJ5tP+IT2yPBo6lD2X+1dz`usS| z3Z7-bQoEr$OQzRtu*3cO<~3cVK->67OGS1gzdmRSTmCNOs%)P%tvn0m{{bi3Tnd&2H}~{V~FFjpLC*dC?&a-St`yW=RT8YBo% zDV_fYAz$LPGv}2W_e-bV4kYv6}`keHI>gD6h-KfhifSuLn{{4jX^<$(J9-l>L zwzjjefIKZH+`h!Y4yT~I^Qh-uZ!;PZvmb!b!H``Bt6EUl2s zQsnQ-Z26NItZxSFxF1GnIe;yXA=-p}cZv<`z0xoj&9FB-IC@1z-)m-Mgn10UQQ*m4 zD1ijl=Pm6?$SevO}>Xh zow#3a!@@s*_Pxv}oWkZ8Z_{Q>d2DQK4CqTAih{XK=+sMngVM({UV@dM+-;>h=o-^} zMq^XUcH2=(MN#nu42M1UW|V@4NT0@F_XvOi&}YT&-=WCvEd%74j`A(GvVUFx5dIuQ zn3eVgnr-|YdG`EyF7$1`BbMj4;`|bHNws@^&99U0gQ$0MI<6GrjNmtm^4H9wY&;oW;hhoE;2IPX0m?;INq4bbF=Uk5?dO zk2``vzCFm30BJ9KGZF6<0cpaHFdc*VS=ZS4Fy_>V$M9#nHVgk4)HtcHHyMfA}EY9#sbli3rB4g=}TIr=q6>35%M({v^gvR1~1Qz)w^GSKWm(1o-t**<7^^mL2J>pu^8 zp;AmiL(92T=x6qs$Qk@ z*V5k}?hkC+F2Aw^mnIES9ns5;Z^_;hc2= zS@crxH|MINFHigP}{{yX*J+YVryM zpmhpglzZPdY(@<*0)~!}U#IDC>EYvlv1A!{dU1#1@9De~{?k<0%+%QK&gz+`xKkWG z|8mN>2UF!YXU+SlLc5;K;yY1?O%;xk;8BP0#&t{{3%`>51%tqwVB^E+|9)J*8pnoi z)R~&5!3FzKYunV5u;++G|DKYSHtE6GF_nQahx1n2WtOC1_B59<*rgftCDnaYrsNrg zTt?e}g2=VF@0S@~Vg)ddZ$F>cioxA|#t-Sw9x$I@1sAJVWWGaF~?6~mUmT7-$wf62TL;w0)-P2!u24$%hZs%*BENzc ze$sjb)*AOR%-W>qM98I?a6bKcHoLg$$VomXeKJwfbt<97>9Cr04)gnWID~zF*yi8f z`Tv~HTX_29|N6WCef0mi>i=KAqi$Ne3Z$Qtg(Aq8!QsD&SDaSX0;?ZZPb>NC-`yEM zv6~5lBbd~B#O;l4>ZLiW)~0z}d(%(n6-+ZWOQ~n<^raSsMA3;j&JIjx>ZZW~l1x?b z5+O@!M?L2IKx%Y6BezfLHKCwW@23w-j=FE%6n0JPkgPfJf(htDBAR&%yt_wf6HBUk zj2J_h+{~&FS=MdBtw-%;0|c}RvQRI{olDh#G-Ehf{?6w-w~i?cMIt?A5)*t$Ncz;L zxh)LfRjks^G!Jk9(|Jb0p@N{=&Bm1b@@&2?hmPJU4e0QW&t1>Qr=ON&@d0H$;N8?s ztm>JrYrnVK`JC9cOnilnt8f$*Ra(XziL_s1LMvhFrkn4YW(?=*Vy;|d``Tgmqijn0 zkcpoE<|OVx-~B!O+5VX0#(gCTx&M0#D30GyOSVbz;sma?l=*h-CT41$m|i#2Z=?=ZrAOcAnv}qAl0X?WpRmdAJjU80X zl4Fedl#D2!{%_AIWA*>cF57z>$IsQTpXG}94jh2Zl7I3q{DlMb&vf&B^&VOKgY~wj z3a1}OR35Ipd%-s-_S&fV(OsX0e!FepOw}u~(}e)O$(OR#SqQ4{;m7wlA4)|WGavmr z*0H{M?M9#EDbW2jcVpwCey^l`K}wAlqLzb}S{0;RUVU`;2zCCB=p$=ZNQci2(b9A& zTRfKwxs_Lt;CjwevP|$xzT=`yY&0ubCJzPrmC>V|=+gJ4XEtfFC;@#=8?9BC$Af^q zZmH|z=u!`S0(svn9-`5zY;}LDH1iGO5^miF5;A}n@pzP@xtl+I!FPbMTGW_te?OM8FFLsI?9#ufx# zaU&UKMeuhnB^CN&a|3;!v2gUk=$~VJ4aEI!(2t{8?~68A>>2KAyEolfGZjJSMV=-JC%< zd{ZA7adUuWtR!Olh7lJNN#T=yj_*GTB>K?w`QIe*hlw97j~i%=!!9MtF7}8rAes6z z%DO%aqk%bmUSA~G`&Io(X3`rJ zQDPQ-QXV{W=X{>=pQk&lyZ@ZeTJ8|aru2|g^?7ccjj&X4vrsuWs{UT`EQ~@T_)6t} zQ&Pxt*1t+pd%xd9Qe%S3OKDL(r|p7rkrt z;PG^m2y_LnHQ+xGsD0)oFEH6M{d0K;SyJiIKPdR)B-c19ddkJ{#Dx4FdmqBAK$X2D zAWs`aJ!$A}2$Nnq4vbcqfDb~K)BI%F1Xy>sWi}Z`KetS#09I`2lEPRLEd9-%Jhjw>5aEF!I${$E0~$E~vhz?n_@hJq>IxE%loVS&JrgEFeGmi^XX*hSzbIW& zC9np85OZp6W&R<8w>B;LJyy{wU>)r4#uK^Ao`xJ0#uj2A+72g;C zS~`l3pg|Q1lW{X$RyOL$0Lorf-Q4PNErhP~!%=^L^jLWlEx!-vlZ>Nz%dN6zr zD0)B7VxpJn70tpQWd!ai#w*

*7^bh{F5oIRo0^{qO2#qG6oql`7?fyc(1uA!Qz zH|NesSpIGb12{nGW-#b;hKb6T9`8OeE}Uz#1L8(~V?2< zQ-aJ^$?S}zL=%)nfoJgjgYzNt{&oa|0S)Ue<|+VgoE8Ht$J|cN8-vWs=>W~#Y5O8N z1F!N@zMQq>6x(q2W?Vts0>ur;bM2>oa*sc$N8Xk`Gz%3zIYOXSl%H%wS!PX;K=UEN z(RV4wC>Y%PP-acYlJmq48fXZF)%XhbUA*qwCcBHz2HCQuTJd>m4EQU1*MesD?O8IKX{B?x$2)(u$fB!Lh-YTqI1=OHurB8P*P2_Roa zJ>>}nCc6blNb~V936ZF{*3JV16YEZ=P5-CqYK)S_qI(UQ>W6XdHJ@eK1ivL_MLS=o z0{s%h)m~ZscozQ@u;2h3b#-K>Sr8E4OO2Afg?%c(9zFf4biaZz#Z<*+!cEn6SKv~l zdPWswU(ehMYoDDjMS%f12N234Q~7dYB9ESe+-$J#<+8(SZ`}R^r?%Tfw{MzQ6aUqD z1@hLJv`d~_cRrt*xv;Z(U&5_#fjv28l2{1%x|~&>Q&)|ge95v}+}z>sgEbQI=sD2S zH5T>gf%ufXEfo{P3zoxP_@jN5I$z>`Q0|y(#IH}qU%u~6a)+Kk&Yj-s9%PnABAOfA z2gR(c-ZvVNv;-$UNIv&GoS^vgS@b3!5KdAYMK9==>m9(x;bauMfbL zWA$Qtc$U7ze6gs?hiXmEJK>ySc!9m^Tq^jFtH3@39=-uRpo{%v zmg#FU2k`h@2_M9lgX@qI zk-N&IxjqIM{n224+C1nDBwFRQY&A^kelQTEtzP4_i*Q z{6RP2%@_U2O5Sx@4xmKhP2+h<1U(!4QW~cQJZdF;Cv`l%K=cdrAA>puX@kEQ?3it zu>X0AI$!!=B@YG}HeUKhEPc;bKOEgio*em1>XhqmWcU-S04a70<#j?0J<=hf$k_j!8~Fk7j_rdf zKxai!Xc_0=`D)%y9Z*TJPLHjQsK%rY!xr-ZSHDP|5bl~~%Dcbvv0+Cl<_$O>FyR3Z z{?4?DxOMKsgwMs&h2NA#P51m{&4^3qCsl2h_#=V1>s^iw`AEhn3Jw|LWGy=_@~bU^ zEJ^_3Vr%TvaO7i4fl!N%Wf;8^-ZZ58jyB_C)KRwk;VqYAPU1I$oh?T=@K@t1+Y)xf zJ~!y6oqcfWVH@WHjk8p$h!09#2U{N1C#CE(sBr6r=B2-60y;XyJ@cK-R1^C{}m zi;MO`^qc;{{Hs;2;sYSdf|}7PX9__~l4nUbu7EoAZ0P=K#%}|AAlgBBn`-K*3c1Rl_Gv(SN1)bF#u%6LU15 z>&)HI)4;(7FVvq@0*yD|S`|*IGf0&PtFEZC(o1rlg30Y7XUiZShtCiHIrN$s`5AyT z(mf<8Z}W7?0LJ~W*+MVrfM7)g@nLwN{{yiHG%#h5@$Kew9EE?6=wJ%umY1!Dz$d2X z+X6Nf<9}~T0hXyx6NY0h;1dCKQ?b%d6&QL*=1{FPa`uKllsrnCw}uCf7AZCbAG>dQ(1IQ03x_J&WE zz76t^(CVJyj2|h=&Lz4VaVaHEmRXGnM*O;CdNeS;fEgYqb3KZla46~9zO5~)YveF} zHl8KOC8Vd)Y*J8+8|@)AxHMlEMExl-x9;qBFPsD3VU{c7IN-%ZBJsIn(?lq!S%;)U z4}d?=)__^%C-}Uo92mlK^AStA@65npzV*7MK%L(ZuWK79SYZNdkb5}Zb{rZTgyw0+ zW68OFxwG{5R&}1Na1Ps*h&?b63|2WQvisf9EV@m9tVMZXu=A2Oz@{T=Z^v9(I*DAK zWp&&Tw;j<6|7f@E|8|6A05EHI{Q3~6UJB%iF4O<&G3v;PIa!;lUW$4PZ> z4;>5}dbHen`*1r>(GyhWO!2FzJZB7@``Z%qf$Dkn{FsYSJgzJXK?NF`(%oYj1a>@t z;pAeFAB45Jg$@_Tf=DJ0-@x0T5!{FvIo9Piui9sTaH?J$KWh@f7Y1Z67tMPS_{}#Q zNEKJSOxIaawxPq<)DZv+iFlzMj+&#@XQ{MVN>0N(6w$ z`tUyEhL|Z@FgH+O03gf~DUGZ#!RdQ_P4~Rqq7(ZKcb>ph5eSgRYutS6q6`0(UI_vJlM@ojeFTws5D_xuQWsCSKYv~x8)pi4 zjre!na}rB97N1HvyHrhK*-B@8bvv|f9{N-|_gvZRNs*eudk)wc>ApCwa$|TJCRShv zPtuyySqEDMc}N26h_(fjPcU_*7$#W&`j)r*`ek)@U^8ESOnhlg9yYrC*y1ap#IQ5C zpyqR=l(LZwU{5;v)6ejM&^eovi-wB)`AgBM(duwP^q!eBpS4MlXev;EnOJuZy0VLK z25LV^j@?a$uvoMF&G#q?zj3#M3VpZOaPQ5VLWd^3R%Bw{g36^~`W^0%Zi2q`J4yWH z*?aPc&-m8%c|S=iesY$WpF_TOD?kHi`x7w?q3=W711ynzo{~!>NP}$cBbJ`Gu|%0I zL;;h+l(^v4>0=nSBt_2j(gRz==Y|fheD)gj))qiz3t={%kTT|FK));b0JuFT8>o@{ z5a#pOE0bA|L!{QOfXK zzU~6b3}IiEza@D4hRiI%J_0OF4xbvfQSg)(TerdiBiA&7>9_w%a-~Hik?$whP`%O{gLn&ebk1yrkR*jY9QC_BDkj(_L zQ=dHns?{7Eh_%r|rlpbkU)h#K2wGLn0Un{RsUnFXhct&wxlNEt{9eXbV#c_RHP_+i z+O*@QBx)@)*J#l0Uo9K4IT=t} zwokM-j$eQ0Yq%!Ps9OfRv!fk-9mLN~>vOWcBF}P9M;3s~slMI3UY*h1ErB4kmCbu< z=L!$(Z%&k?J=eGK+4>SepiTyM2!NcNeyek*%qX}=EuTv;Cq|)~nqu8!_W_!7U!H(# zC1OBBYdg|DE-Y>tg3xCzmKJS}2W`;+%ls_{VPwy9drW$_H1)QqVrd>oFK{@B7hptS z5I7@x4yeu-b!5J3KS}-wsIP}LEcHv{(;mA`<(kkr^ma<1ZjTOuYdM2Q!UlYq zILdZzh{0}&2LX9Z&cG{Dolzhe;|`VP3}o4o&wW>_BikQ8tNq`^wEv&P zB*CU23y>cJe7!3T5DXQ14sf(4)tH+M0pA`P1UU`zdG>zy>@r>Lw)&5)(pU)z@YwaT zd>(SywGc|pXDvXb%z7;xCsn`yPag<*o}l#SJ%?xmi2z$aINp-G4MpzrI78~w;Oy% zsY70TJKjoZP&n(IK;g_$)UWt#eYYAn4i%(%m+i@^Jyx3p{q|VLrv?oV-(encIkt|s zQEbKROFB2+z4F*`yLTTW3N8hKb82 zo?qkO2f5H_p6dCxJo*5u0$9v6iz&?JmKcJpY%z5*AT-!U1Nis{$xQObd;$W+GeRR< zDI~k2oWYTLUX<1<$lUs~4U|)Te3pi()qyi^^gOMj^J6R!HjC43I&eAc-@jZr@d|~z0zhc4pwpJ38rkO zz;Q15G_Gi@lB0}Y#O_C)-v?fq)-CTKfg|#;={yMK%tX2Mx*AF3sQs?m3~v1P##gFP znXw-GtqJXhpqRH zYHEwNhXqBYE4>M#f^=I9mv8BeVR(D7R)U*7(81C&;^a>b}OqvMhB^x zw(pE^Wq6ESo8d}OGp<3Pb3!iKoG2&8+oS`4M7olnmhM)1v;PK|0XP#oal%!WN5w#=u+*cval)E+gsh7en-inR*O0uEZ%)h8+ z>lF%(E&wMJuu}qEHW~Rk2V)ht0qdv`Rup}58S^xb-ugDCUA$ps`i(G(gHA`{=FTtG z0si0afTXZ`^FK-94{c|G@{2bs`}WPN)kY?t#uonjDt5pBJ1XlS=0_l5r1$Kdq8O4K z`1oJ=Cf(MR|26elYq^44)qOP#P9`qW82XGbQY#IunZu)6%3?LCs!P|tD#VOjV2FWk zkv;+`YoM#lCfF>psHA@cM#*W|n{^fa7)4LJBeN>SA4bV>WW`BomcUxRvib;qcFF%( z@|kZdF%0wsor145Mbt^b;34g}q5Ws5%!HlNkY3=JFvzQ|7>0OgED+ueD7Wjt8R}mq zO{5=aaM@n;*Xyr`j#z~lS|{`lDSkvP|72qLpRLZz4-Ml) z>ey$sL}&$3D+5<1blK4kxCzH5-H@SO#tMY+Rk4Dp(a31U_RgMmg0}TuyySKrs6ZB6 zYo>)Rx)*W2{M}c#YMt#SffObvz!~%PSVq5UXZ1MBp;M(2(XpVf1}es4RiGKv5cli<%h1HjWAf|hc z&)0}szZGr+6Z%@L-h8qE0Hx7*QZVKT^HyNgh;v6k zP&9cWILX$rF4nnbi~F=taYa5fPh2KV`InWd`WB{Uygimw>W93KHHYdbI?2N1%=iU8 zz0UVJ9agRVx#3P+qI{Qusk!&3=HWp5Wx*UlC#Q6O)RVFF4t8*=oq`@=LbVN0t-G-N zg1#ZQ4hSa$fD8!MLQcfugB3dmUvG|J3(&F102_*k)#}~}gM}-~8%gtL)%)kCue7CT zX_iFDy&tSJ9(Z4zT3D>-bnW32gNt0D$jkw74(uPjup8v0hP=cU_~T!s;A085ku_fTSG@Z` zZ*KW5eB3)YC3refWj|OZg2sM`)0+t(CQ~+c5@r+ndW@rE!LPG*G%9Q)_O>Kvbwap; zc3G2#hx_lDJ7IzxXC3=IXQ!Rd5o-%mepwW!Y_X52N#d1gC90RzrgOj(VrQNOHQ@>u zF=UB@8-A_%evN$aSfr_k<>^($-WivbCF`k#(56~(9xWUn+-p(NN>UQ|Dyd3v$R$Ks z%HDJ2rAITb5KWYJF_M&xPLfFF{$qNNvV;H1gfpu}IOdvSr46UodM47jIof=?_^(*d zCs6?9RKzQMTo(La4xkSfnN>hysc||K>t&)+y}F)~Npg~fyt=v2zjl28UiSRT);BIv zqP}wPJmbpS>k4XXcR8VQo!x*v)zWjXUO6LW+FUKX8r27!l@Ir>uPSI5=zNgVHx{C^ z+3X1l`=mFGb4zW#II|vDSt8O%u^ndb-SWN2FFF-xC$86ioS?!-XVr4AdbSF^V~U$<=)9!??jSz?wJDhs!@=Z0FPQzL6 zkAFU{(|7WDcvdBo4h4l2`rLVAQVGv$=~kDC5MPP@_Z!BTZTL)zbpw~|;KDKB$LWoh z>^H#IgH+|=ZW`ZMOL6rl1V6wc9u!R+Dh$6 zl6+yEpHg7l)#2)jY;aVhPuSagZmTJc7ci5*J7=*1ZoN^6gZm!(!^K;+z3zyPb_cEs z@4AbS;qzk}A8HyE-@YAj?uAn~)E>Axo;vY>bBsRz>;oa;V260W79lX4LUUNI*8Pfq z4^hIQpjI1~C-+r4b`)anbHHz__8(loba=g1E&NaNoE>A0omwk@At^3n?#KX>_IFtdHXG)&Yt!cB1Gozh^k+X{QczdoM_l`BoK6}ww+tZRcS zeeG)AF-x%b2Xp#vN9#pq2+^43 zZg5sYN9hNdL1W9qj6K7}-o94I)&u_DQ#6xC{gY%(s|X8Co5tmxTPX$N`(AiZUKe6Z zsTV3&uP+%l+qmLy)mn9X7tS4?HJ7W(LreDfBCR61xMtn=KKaM`B)Bx%;Ec zgo1eApg36QYJc1-fr8Ku`>^2We;0#$A0kHoqc8qV+x~mlLLK!N2m9~g|M~5=7mxog z!vFcdf7dABf&QC+R6zgF-`4y6Gn(N4c~LK%e`Hwy&x@Yn{?qaGzpwHCzrX?BvgFSH zNZ(!!Q7pDLKkRdnIcu1kF}ACh%lZsIDf;)4lFfOW6xsTRIXRjT-xr@PTf8={ABkjb zi0A*M8od~z6eZ7&wE+*6bZn)pW#`L%mHph4X%D$U7jP-B36L969KtR6mEJkV0shdk z@)d1&U}>B8&4;a46K>Z6=`m#^UmjefZX&25{_ zbK}nH!`|xrf>!YiFJnR@%NqdM(dcpLQIN^mW_E*8OGd7Za9%-`^)n;@5=PEN16lI z?2TG!WX8$V^+T)$n40$R?hj^9+v=8PA@uphE#=hNJwzxc1-{SuTu7>;oGeJ-1Mdt& z(-1*Ncke{fyuzk`1HHL<9T}KQ*0>h$IK=qu$AIhD(;5@4h)m0}o?i?m zBjO%Ku<84mixFSV;eo7E>!+m_33NhiJ7b}n9+4}r<%PWWFF$E=UZf6|xj@@&>Bm3~ z@#CGC$6A@MTxN|-W=jOH`<~kWe#v+L5e+Hj|28&{oj)fy{{pn!{yXh{??3F)RCRyYU6>QUeSzWw8VAbMkbBXbf)&$f~hy!)My%%4kAP8iJ z`PnTG&MloQ>lO|$BDt@f^#0gUhI zywT@yP%3FOja%ysb$@uxpOMWbr8Tc}G51u$1i0I5MuFBt0!}feB3`%G^Ieg*H68W) zFcbvM3*?Dzs(j*x6X4>J3Xb{A*QYJ#f#p`a04L@s5 zzl5Z+*%@WM5xH5pT>LjI9c);G_SO7CD$8BRs(rCzf$A~1Mi3KT@;5ewyJ#oMFCSYE z`#I~e?KG!rI}KdII22)iGdI=No}9#AJU@AE2jIl=*tS<%z9$tS(U6F$^HpJrK})*y zN6PD7CAPWL4L1QR7LEMc8-u7d2`BvImNo-#EL*#_4xbJltC^l=oObCj*n6$vsaN5B z&Bdc!%vlEzVo<;7IYwO z#0oqL=^;{b`@vP`yIXTP!qPo#o$oEAwtFAFUPeex1-_T;}L5xdbcI!<0w9%AL z|H+BVYESC%%FjN8-1b2a;$Ye))e6*l)Vxw0*S~SKdt0(r^xjFtgt}kJxSo%D9N6W~y&-r0GzWI(OaMZTh(wept z3m%7J3b72HV`keh_us#MpSbLe?g)~i`M^tzH4Hk82+KC?4k0*(X!FA<#*?D?gD1NS~?{RA`7_CZbz8iokS7A!FWO=1rmCjS|(nh&n|6b1T~hHK;(aiXIwFV z2#=wuW%!Ky>KZ9~qZJuS=bD_$(%lgb9IlF|e2OyR=_PI1Rce^JIWEPQ%4aex#%b$wzI@Ap|= zKzdbuA>QO$IbopAcb9) zZLx10qkn2{K6qv@G36&Jd>eRA8-EA_fi+t9d_O#~s>+(62nYtrT<_wl%JY05y7}Hl z%I-kx>YXSm*k8AyGM9DW#b(U>vmAVT&#OjGF=G{tkTAM!(=bXLI{d=j>DV3ODGbCL zd4Y1yPhxA)NL4TF97S%Ph8RX85)hzprLj>OT)O;eWx5M+wAwY44Tt(cw6zQa1G`fX zb#7zSf+R0{pM>$lT*Wo~M$qcse2D3W(fc0!Yk5smjtcGaahsv2^7(NFQW~v#CQV{#@{R zDkh{=cfj0o0*H;DmoIib9w=2qUwb_LrS1o+-%y)NXQ$!7dQgNa_SMrEXaC63k#*XSkR#gE0saXW>~67c4wiiN zs`s)>*qd%zaOpM zyN9o7>gnv3{QV}2?>&#Vq@!r(F~8{dc$SBD?y%DSh!@Ym2XW7C5%v}DG3|1=s`G2L z0HQjh{y7VL<6(0w{LXI}k#~RP@sVr;$k=CSU~4X1d~9Zc=9JyPPVftEy<<0(>?Jba+SF_3p>`HwpB@dE3cL|Agm)D709o0Vp+Hl4t^wNMTr1q*t1~{>X zJ-af}r}FvFk*%IB?yUb-O!uPyihaEB-L+L{M;R;#fS>4OcHTP8#lTdyhD(JXqxpUz zD^G@@EP9*ahuXg1BijyXpt%uHZ3F;W?O**@x9f#`lrc&o3=9863*e}U9)KH0q`4;G zp7HHnTB-i9=TzXl)wy@9V5l8<;T-lZGd)-OVT?fOQc$>{JYYe!{yI{AFQ&tkYo{*J zXwAFq9&oNah>%))sHibxGlP_)*whgSApi=r?+`-s4TragULO*oTO25Fz^-Ls;>{aY z>(?%`*=xULJ!VP-YUp=AXjV5&Fx6X5C}>$G-G0Is3D}smGQm359+w(~Jg3YL7EJqQ zaH;vH>uYBE$Aw_Zr3&5s-Lq0gZr#LK_wEV5Nq(x{)kHuc^Z7e;+zlu)E<#PbTOEJ} zdX-*z;PZyrEH_8R${JiyPyZl-BM~>AmJi}B{3~xZu--BWQqz9^LHg!3bss0?XY0)A88vx>SPjXVAf0 z|1qBLRnf-1 zqcQMC%7CVtEUI4$D9_0IJwF=&2a-JnA#8BVM$hFquDdBVTh}CfaZ*?eScGOk%}i^) ztqeq1+)4r;{r-@y3t?S`_A#dgPKJ-qdn?N(fo(lp|$$_@JEitp?z5Rl7WY(wN zk)kA5O2JT|9zN$fW2W}`-jmL9?sqBd>*iA zJMN~oKyZXRvPb7o%b!VgM1aHNlY5U6%@Di1qW`pQKYQ;waVh_*wO$p+N;ub?3tB{R%!y335) zap~^PThGYVv(nv5;4MAjGZ#m0yw26q72s0P{!}+_zx_2>kjEUc{M=LrhgP&p`jqH^D((Vq zpEl5HA@3nRzyBe!o}bZiDX17n-uma_gnLC~$lKA$BjCajpv%}G0ZOVSUf{3 zk&mlan~c&XTsPSTYE{y8b-s_^dsu7W)1-kvW6FJIAd_qE$`LD}z_yLs6>gnq*vJb6 zEkIVDFFQOzk_3fCZ{YwtIE$qBA{1&r=K{7r#8WK!$j4_6*5y5dV>Ue7p2N9|69~RN zUudG2NF|j9-3AtNk3Dkug$GyP#O>RDgOz}L#={s2X%KPhJ?@!A0M>AzrQY6^Tx*)O z{+XE89c0cxHw8KKNWvJ9f@ZEC(O*+FcT06sys2(R})Fzp6RUKZZJ zdA9m>INX{Cf*!~-Hy>V!szk(0k1UjD3iP1)oS25=|KNF->g=AUk|%B$0LD~fX^HO@ zt;xCi>Dun%!c4iup9=us6%b?ZzMwCmU6G&l$ z0&=Szp;gRV)M>CFD{o+rg19Q;{qvET(Km!!7mmnLc z(?2xP)1Oy|-tKtDFV>xUbwj*PCw$B2>kqY@WBQ^}u^0{bu z8kW9#pCb_gBV!o}bedc-oQu|Q|1Xi~*Xyr0(olxZjSf3AWPnx%tZKhr$(mFS@vUf? zDiJMM{iK-w@u55ikgt9G3>*jr4NUXr1856#@!ucRt2u(EcgXKaki;|EkCr8h2!wp3 z?g8r;Be|=2CtPY)@1zCn-V zaD#SUGED8SZ@T95fL2LJ4&J4QN|CI#5l@zx*taS7?9Gw_B?gdMu4qZ8s}6cA@JlY> z8UYmv@R4K-XDwbNbc0_e!SX~vX~^ov-rRg=YVd&=^&wH9u3($p9g;_mUt$lYgBB1t z&b>he^}Il-9r&Tb0IF+lcJ%&@<>b(8hQ${CAUZRlYAFv|^|r>&O1G4Lcxv$}EAoV< z!Mf6MI11^|i0aaA9f3iS%{hIU%M+P^I5MXhv-}?CPUxLE02@t$QIM+-ekf&?9qlgD zAs(wO_ub-64;k-LQ+L*1r)9ESyQ{#Q8;DHtj{E39)+D$TyE>X5_62IKGa*@+w|Vk6 z36_^YSnN4tIBJg9*PIQy>~?1S@`-`4kyWf4onv=ui8TN%I66!K<=Xo8u+y=pF+Xgi zWTxtLS2pgp`GVZJkMH=*9`NR2@j9Z4^LdZUouKLV-&!b9ilHN+EDS0wKOTf|9?E5uisnU`|Fkk_2* zJzLGI&>ZK~q~<=ys^^OC(@uTlY4ctVsdcJykYoK^BOkx<4g}3AXmB zGf-jLQ-JBlJ_IELWV?jVW~&U=U#XI`Drsx3>V5-pR+v&8J|O1ONdj;+d1kW%X&%2R zkB*FZR}DUBl0b@gq9LP=_jXfxXWkV&cO!;{&T;gqw1N4SJrs%m`6$R9=T8|SI_OC# z8fM_oq1;hLP$*x@GXUO&ZXg8kY_bV3kd03*N-)7h#wn8kloSsbnzGpodPdYAD0Fa-Y3sEnCVubyZLo*#jRGJ!5ZYbBf^GOq@qsk1 z%Hcs~ROvGg-OB9qZ>(f^bFLcOO@lLe7JoI4UA(WCry8=xb} zasBDZVNNp5{x~a?uS=-D)3_KjQ^i>E0uKl}BNdgOo14l!m)i17e1?i%?eu*S+^y#N zq+{c4wJ*J!T*3Gi$V_+ScZ-4LmrjB*(Xm$`P;KGa>F*yZ!K9fV?t2B|L$;>Su$2>; z<~Gao%jJAT3^-$B^Pbh8rV*QUAQx5BU6YdX(|S9ebykG zmTEy`hmL`v%nqBXt?Xl${ZFM;t9dRq0O-NV5LjARZD|siTw_~Qkm>3eNBI1%49FW< zu2PCJ$g#CQJ)1hZ=36T%AedT?nyiw1V*j_VuiDFt9%{oBO(TloNSdsRJn2Pf=U4;z zAyqba^n@;kW0gaG7Y^F3tXR`f_A1j~=UO(fL2EP~HSbE8o2#uK)~a;jMfm@K4qD=l zC!)pAqry2WBtS6Es81xcn$$CuK%SL5TE}%C0$m-mq5c#4ocrLJLL_*~ld?GANunL) z6>JTisH(Lz*5cKt!NNs_78@kU+d@}O%=TZl?9$Z~uYPSBJiBfj7pN=(J-aI`DezmE z@rv4z+lG76MjSn2(`ln!K4gcveL6AqBU9OqTi<&Qk0yVnpw?V>;=I|f-a+&LP_Vt) z#H|WsX!*fI5!YWSv`s0<^DChnE!yWbRwA$@HJtf`t6!(Ol*g+uMkj-QTKC&4Sb%hU z-J!@%Q`@7u>dQxqCnxB&k@-x&bM^`<=SQPJA-%RtScEdLFdqeAu3zihl`L^zj7P8S z6~k#@rZWr}Y2}%9+@#W#F65Y$gwcVRYlF#9!H=yD6t<9(a(QVs)Hrv+uGyQhb#R*-CGA88G!-;W>e9YInJ!Npe-frY%awN=JxgB>D9`! zoT)-UU1A^R+i&SUTkT8Wb1|!bp`a}i!A7S|3B>73mm)`{e#Lm%0s;2_Q+40}cVqOx2GoZxZaXj)0>&GWnMeM*^U1wbE1m? zI%{Ml*?{DKT&iTR%yixIk4}R|x_ZC`mCQ4eQX_2t&CEEOE{xB$E=$Z*gM=jTNz}|& zLmO8yK%go$hH60<2!ML#=#glu!U^I&pfY*wT%?<%iAQnTs$TyLL^Qg@|UP65c{RVoKeBs}-gI{L+23Ur?&_6A4N0P58WS^ZcK zs5gMQa~dgxs`-W50669Y9-PcJ-&Mx8Rh7Gq^X-XOjHT#jPMjOQWgmZXG%4)Wqyox? z?hieeuSa{Uzb&Fyo|-MhhiLF*LOQxmb-^xJ=gjn^>C9Rdm~w#B?&YW;r_aHQJ*6$c zOZIZ#236iuoBGkC0;{8x8_F>Pq2pGt8NHg(567O6TSv=Q8gFph=RU4B>_ z&A5+S!K_hmOIX<8mB;K7fQ$Atm!0b}qBY?=(t{(XW^%yxZTe$A` z?=bh@cor^9xJSw% zy-3*lbut;xF%zdV;@Ta1BvtB83ux6rG9^o~et;;Vj+&Dwaq7xXXsw0f>5Xr5T;=m< zH9Z{Ptu3C>AV))77kVQYdrJYpQ=r;Y0Pg(OIDbZRdP|kBp+>?z^$CN@aw+_}{@)3q z-~9hi_s69IJ0)cP$x&3OXJnXa0YTCldr49Ow>cZpsB>bb);R2VcwJ%@V3wvv^`uW<6k{w2Re6b1}-wuEj0N(L?{4XN;3;|&? zTJ7`xp$31$y!Z9(LXPMQZE5*9Rx%KhhmOh&BG_&BJ;l<*W{lZ75`^$5f9seCOR2ZIKMDb(tn99w5= zHqtx-*sNcmg}j8>hD27#bKPG6_x@wLa%(G z(b2*Ui4*hjHJ?K*aI%o`5Ul#|b_U#ZIcA8x%qJk8``CPYj3kJpS95&pwvOaImUy^B zEGOp{M5D)Ww$9bN6=kWYZmICO%A4;o+?jh0w938b(~okUC04J#^;rVD-hRHFB=J1; zbOP)=Et-?RyHxh*B&-FkUDHcibIf`lA}63~2kp_B{)J#48v2Y&h_Xk*tXjcbw!El^Vc%|A;Lhn=8CN=6&1w1L=}K zLLST5N!QBci>P!Azok&+>}H{qqYDL86(3iYd?*zc+{$a$Q|9XO(owP7X;^xW_1OuFN8XRy7`kPy6%@vyXxpz@FyBt;$ zi*Z_#ic*ZVBjhV?(0J&uy9DD4yK;U7U-|fzUcbGW4!b<=p1|Gv09riD)Eix`bd2+_ zM#v2UluM=RhOw?%K6W{aewcq)emvT}g&Zq{I`Y+d>Z^8T7Uv*>tHUky|!pxfDS z2EBDK8bt|6ctJ`9jJ~n_Ld<29Xztv3m2W?qH^gqfGxiG-*6Y<*5?xZ}AmuV#QKkN+3D<`>~TIn9%M~bsF5Zx{E6!^n{mLXn*m&=#P5-$r3Cq&=$pcVk z0u)^|8}={PhR}0m03XHN!2yOcs3z}H9iJlh@xAc%rzvW z6|%R`w;Q4pIbT+TpT&AN%_aYfN!A|$QwD}Jf-W@wLh!)*1LxIisBw7z*k|JEudF$O5ttgCw0GVZ#I&7s=1B{l~q3MQw@)JNF}+^HESZ3=6+ z??WEs6}E&*Di3>5mUOK?QdJdV+3>0wAADi&r_` zTO$EP?Vf(2-3ZEZM|4R=V@*4COo$yoHWNgH#-yiEthyD(6%x}l{$zl0(i@?)Wt?)S) zX}_bO)?M%er0fto4**4vWYNVy0&8iX`J_jY#z9N??oQRbsY^HN1Pw-gpS=&@m9H@+ zvJ!-pf~R_AsyJg^ZzlnY({va>VOP1Y%mfq4SafU>lKPwS!*J@d#Q=>SfYR|Acq7uI zf~_tVewL(Z3{K_0_V`4uSVaD3OZ4`6GW4nF>!V43=X+WS!ry1@pl{~1BBxAc#P9*7 z&)^o0<5EG1u10>1SwzaIoF)g-ew0_OJE+PYw89H_d|TFU>8Tg^G<5ZojM#=6*LEM` z{6qJ};k7$z23D12a8s+T3sksMo)G=xP!8cjp)p8&QLdW(CSGJ?H@tQ;o$mpKE@CoB zlUfyR^CG#wI`kWY&_&;*q9&GrSI|-pHlAq?c^F<}H#s9Tj@Bj*X&%XGy~9Hh@)b~+ z30m^5{MtbWWTuSVd9jArzC^UG4dj!sTrjwa;vJx<^e(%EC!|BjDb%;!Z7Ky>rr;#f&m^}8mtoikKs z0AwaSE!Mk|N&w{O55fi17eGKe=Fan)LmYZCdGG!&@%vBupu}Esw%`rNRQyuL$1vYF zeOynZN3t6HSd`d<@hbze{m=VGcyPyU?NGNWmfCHxiF>|p5&lFA;moy{9GdQ@-YFdB znr#`7n&i(=5C0Br#*c}CgEYdIuM%iaAk2B}{y4k6@KHK9!?;mV>U91f~F_S>c+B03jcb7Cidz`<@4ex_c)k!4CO zpA^(+4+o@jIaFivye>1m`}Th>Lru)lPc=pql<262S*TuP_C7woLV<+t zy)7LTh6iJv5%bxq@2S{)v49b{D`ww|CgtSwA0wY3@l`DE%Z;Cr zmVo4nAHhN;^o!Ffb@> zX}VWr0jO?jK6;$lId-5P&yd{e%ja|omz_4Cmf<%g!q9cvhb<|iL@d z3fTMu`^;ma#Tf(crG3vTZa5Bsc5cMp63w?T*Lb%{fn&E4v~n2C(xivCb485&&9#L2 zi7c!s&ZeeV4Nf$&0E8`p#M124EG3f>#E)Z>4B1QH`e@l7l(c6Z#Ha8QDc#{J1oP+* zQ!Gj3vNX@fCxNsZBYlxtofe5T0>GrnGBQ{N7iX2YtGmQ`1%>BEqWYb7a9ja+we{CN8f2 zQsOGc#%m2j8djD7xp-B}CDPzEq9agT#6$cE<*r^l4KI;CF%8_Fa`ojnCBj`107K;R z9W=ItxE$G=JyggGH;n8)2Ss-}!(P^WNcpFEYBpi**1x?|*G=#n8Q`)udi0IC&LBZ+TC&jL&-Sc)EJ26$TT6iHcC~)LHa1hM4HtF; ziAR7*b}T)$C;fMchplTx@E zd`d9_A0!oCYJHHYjecWWQdZxgYPs}FzG7*JviJRb-uEC5%B|P(&t{~!5}&D|joJA( zBwoB1`$`ZACJVBqSslMkqme@wnRpkZXk#P(IIh_EF6Gu2#1Ras9+0Gq#k7{`=$C4B z|FQcfN71(Cp8|FDY6_(l;bO;B-UAKtcUtUt*5i{tvAhy z!r3v2`@dwFE3gw2$-%DOPk)y8*it0S`9WF$(Xy;}X%oVK(4zZ{yBUywbj!mAs-L)Z zYa}3S#{v=|zZ@(Eiz;d7&y`X0(ZuYqa$FdHkW+Ciq&VVu`(>U-xIoaP3iX3}ACptj zwxi#yliJOlAd!T4ZBrlXgv{@pib;g-3JQW`xdrb)zbpAGzxfVk)Xd?{i!+&Oes*cv zzZPWy?7jBprd_@A#sq&3<6zjE+v6y5(`Aj4o{)iJC?GW!w6)uLTa^CyM=1r}20h+C zXRYL~*=Wk>W{l>GoQz4zm}F^<-I4%!(<&5ME@ij2O7Th+Rt!SCB?HiZGi&6SpvQLb z+p~(1D4AwekqKH%kP1ZsI|oUiTsBuTIgVO4&XzC*_iclZGTj3K)2Bfywb*)DPy=hU z`kfE1y;&f@Th3c3@m9g?ey5eT>G(&aDVOnQzMWNj$81(I=hGwfFB6_jnpCu|^a7^!kU{!9mIMKgl=ULOxfvMq4!~HilhumL} zmem*sGeTMoFK2EJe1Dace3iV+VA{vdo$rblzP7)lxw-N^bv?`2YiOO8ynHy!U#evA zTVJo!_^a<3={2Kt^V7@u{0iDiXDrY_6>`?q{o*jY{vZsQsd^dadtRueuivl-jbzwZ z*}L3nkYS@(gkP7=%^$`h;7f8$;cM;z%?2ah9~JJtee>W>OsbR8*lo8|Sv8}lH}iWp zx;{(#SJ2s;`HX&VA|;)Cn0r>Q#75yUf5}BbaIgHk zul9A9?@65d(lO}##I?SA&v&hcBJAFcs*M2M$|$J5Ngs=7=(D})+dDWRoilK;A+dwI z>-%DyXWgzeX0h|q^~X95p7VlU$s!_C=W26jczSNvudf#*FWS-7;%DzppZ|+;D>Ws} zosHg;xVhJWv1Qu2{*4Pr&HlH+2R?8m^%%IfZNSj*xwWORxcDUfvt-$m6)OB-3kwV8 zXa!yYv8Mfw^`a)>iikykBx>!Hbo$^O;#+@*>FRoMWm({68&#lA21i z*uJ3a?sv>WFcan^a`rq~_8tl)=(W>3-m&69K|s20kLBm03m-y{1k9G6SmfgeN`axd zi2RlQn@#w_i|Z+0hk2~yBtyeN;WcrFYg>I3=D2rUe<{vqgZkUMrc*Hx7ezv1>OzI@ z+9?ff9fsn)tMA8LtdE{`jomeuZS4U^SXkKQ;OU&9v$geFm6cHU>6}Lr`SXau$-%~A zqbqc+`&LLqRaSN4b^TLFol8o1c<=M-7yo@@wV3~>m++H&P`;y%6I^~ztXyO$BH`lh z^`=E=6;zB#RPmiG+vJnXOWuWEU!~`;fz-Vj2de8ToRl*WjbhBZs|#pg#(H%Yv;K8DWSt&6(_Gh7PH=RVwaX#c}&@>w%&H7 zU*buxm1m=LAgm2K8>6f&|1}^e*rMn5Mb4LrgQ1D5P<3kJ52B!{Ft5pj&hN83p5rk` zgBRv*T_7$#>PGk%_kLd||7B-s|KjM_wwPWm|EH5flhBMB-w%%WyX6V`HfK%?A1r=4 z6UfkLxckR=^YmK~ow-77t*zw7=ZFn(s;Ko9skrYwm8^FHatFd!zD)nYp zI^@5RXa6Jz+7^RxauX5sAI+S{`k=&k!A zd?XYy=|36&tmAt|%h(sTANLWZ9jB02KZ%CTh`n2reCqB;#x6_GzB=c-EiNcD^%WuS zO;m9$7_s@K@WWJtT-O5K&spbLj>}upVi!?P+z=DpR=2)Evx(~9u%CAm)&uPO1qo zLN}1PM!7+Dy#9ov6>CdnzWe)=)~{4Z?QdCB@6a4+7TENuI|JOw$A1tbdff+*s~i?p*(+=HtSGVy%dJhM+ocf8{ll$dT`sd%}oNlG2Z=>{qw)2ml# z@N0-E+G9^;r7m9I5SOzx826s`9Rz1F>@4;+!L2+Mg*WUFGxwUiy1KinuIjF;rA#V0^2oZDOY3J`rN9!5I-x$oTaLE^&d|9s+9w%G)j_n_j?sXY6%(Yyfdm@XdUxH^~ofY?1Ze z;J@Ob6Qml048bMh9kcjOKG=&o`3~jdY5Fz(ZPDMeeM-OXt>is-_aaI4!RvdZWnAd( z=OMkjW5o6_0ap@WJX3_zk=}lQ zF1X{q=N{DE;e}Cf=eWYvv1&NhKop`S>I)Bx4_(8``_~+ciXajdRuX`t1lj z^^4d)_;fydpQ(&lK@+@y%2-7H&iXXxVg4a3WF{>iMe39~oXLNNZ`ck;JlAC72^Jf}tG<+2#t z{(74m-svAQ{QIGfTdS@I^&9+u{6-vKq~pxz34otHpQN1MyTfqs+B)TrkFR&PyT&xi z29U|Klj1q`H+=7YcH(0q4z$^O57cQk!MwzA50V^nRr`87>_i##5- z{qoNW+bHFJMDN<=gn(@15Tt31@73d$Ooz35xGMGg6Q=;ji{L-RH*=_Rg(PRj%ifCjO9eZu_>Ij9o5{9D|^( zndHU8D|=%H;z*WMg+#4u&rrujIicd7-p4TJ-cEywl5?ymU;)c8cWX3CWa>gKO^{&?nuy1!Zw zowavLA7A^&@pJcO$Yl|I*}s+EI)`J!O!pP9cP>hII>!d7nr0Gn50>QqM~t$s?>trK z)jRT$@TG_M-|kI&EPQl4g$`pv2U-FI9smo z%6~dWNA5s^K=#kr*b6D8E85JRPQT4gkyl$GSAud|mJ9-WD?hihCwr#rFK}Td#rfr% z_I`41&RA%9kFCGe#cY|c&VMV%;cx3FGUY!AX&mindXaAjnDu<32kCg({;XL#@w*3n z#AXp7KyG0itT-V^kPtl+@XN-y7kXoJTgY5$9 zI}gV+6!fwZF1o(!S%v_s{)nm)2$yTiXlVH%H5o%M`c zpHOSVSGZ@Eubq-Z#6k?QhR_*Xdwg)u4v z^Y)LV0h`h2pdln$I-~^cXq9ntm!vp@@-S#fG^v?g4%jn>Ur_iNC}LKHF;O(2!{`3p z+UDHIZ^a4r=&0?ffcza+4?T)#>yB_@{BOP690Ld?py3T5dJfUa>EBa?CI?Ojb9&W& z|3I`eF!;r+X_)5R{}3oRBp^3?i@9JrG*gR{9SvpgKPFIGL2R!Lx;g=Hd7>3Ld_o7> z^+eDgI6hueS-DwEyZzHX2A>_o2l}j-%{40|y(<2WWSQCA^0g44R~bwch}Hc6qpC`R6Bp0_;&PH4=IeX|9n zXvn!Wn>9F1%)tTdmrT%(6h?g`7Yp1KaZG}|@C!nag6OeIlw5uZ?p9x8g<@3|Ei-Q9 z_e&-Sj#s6AvgN#%osYxMp6Cv~!+{H-u}8~Fkh{A4k3ob`=z5zt6Yu3Gxg`NxMeqFf zsNii~$?&m=Vb(m35_Ane${(CLMYoJ|zJ?I&9f+=txTaAos%Y% zPR1My?&;Au4|sQv|9~;LqArQkN5mD+@A9P(g~>Z$aMN#2Q;bpWj-y?}k+(}sYe}dQ9b_MfT)8^QJ0u6BwM~Zod*mWGled5l()mrL#dEtla?rXK#6N`e#FFa5mCHNQ zHyJtN_3qUsB0TiFu-SHh&79$Tt#i8PM;p@*^QsNcQsf_hd&~T#@fbj3k_05HbUQx^ za6I8}TF47X^GC3J$}AknxA4wEc@$Z~(~^w84!9B}C|Ot+pD5Bt#8KOiz3V06g7w1@ zxFUya^aFg|kR>@okdGq|^w!_8%kd#?WI%Ye#~6}^-iT|L+$P?kUyKhb2K#M5eBa(M z{atXnm<{VmaAo4^dAzvyY?iyGyxIn1+Qup_vJm+b4uF(jC=H0HMh{;V{o+E22Q03T z>!yy==v#|ui&#cZFK`JmNN~~*juN2{w6&pI7{6XeGojEqXwYt<2F;1b*3QHH=4j(l z_8C1d?#`-lj`RJX0d23t!gW9BUNp1D@PuH&i3}55$4Z9 z&(4u6O*j*|CiK;f0t_@-)&S+aDupM6@<1l0ut~?jP2tj)LpVtlbT-=V3&vj`&u;=c zd$M;=UCRF;o`*hwlexUXfF9MIGk?cq1FOuglYf~3*I!>oPA8UE=06E%IU1f{d=bBE z$7H?FF39uVp0cmACHd)1=oqyAN7aEPzji41W-#<5nK10#DK-sowE2Y1j%8_|RPMUJ z=lEvuYr=dni^@);O|q7IiuTo4)TlZS2>W{;YEuI$N?@+Ck4_7mvFOHGxF9qwW$y>V zZNV7$1_%0sX>8jtcoo(nG5)*y2r;S|-&{W?r+ecv%q!vlyu?xFhvUCeic%ZJ9~$|n zB4BYTbj@7GF%BjAl?CG$8OAT1?SW{r&{MeiPVgOmvbALzxrX^46{TZO#X5)#jENBW7V+~0X>9r-?UM!e$1R7|=5>`A`0&rCqTppD_)5c+Q>*Z6h( znw@wc6GwTPoYJvGnVdE%;-ZI2Bah)VoE(yxOW?lP6or{YwMm?=T^^v|y8_fP7#;;&{-8Q* z1q~XrXP4Q6{JVOzZ*qL==2dhwQ?lWv7b*!q&BE-nc(~)fID4>egHGC4z{FB&iKdgy z@b}Jj_oHGQXWze?PsLdh5D9kkIk^*kp1B6Sc!u%VZecj=;ckW~N@1wAjbg{xJkHNdfHxCW zBQ{PeGViVR#4c@0dDLMxaUQqq)%h)nykb&F8_R)_W01RZ<)lRTId8X3*T+vkdoN?N zwhk#S9(e<~S)zXHptqW1{>ZGAS6nsu0tj)HT${ zoZ+PJhCPFe%m3ZhTa}89{F)uT%d!eA<0)*&Z9udP zakm>A>LrU!wdn4cIqm7wn*;HAbs0YU6tm`bC)q_m}@) zJdnWyg!o@$xO`!~3kpQwc!8vh%_3=@`Kr>jP(9z2rf!!fxQ&Z%F3|8nhiuP43Fp?K zn)!H(f;L%bFvCBN+JctJDXf(|C&wwhphqi)^*X(FpIc|S318uRl>E#Cq?Pq)a(Z@IouYQ#(s z$V&{M7dYWsXeyJ9xHn_yls7VQjRF5?h6(LfJ2-jrJl|w47Em`2C~AU|+DIv3=yKIkGRd16{WE2Dc2YopO z8x?8S15WMl6R5UQURyYkAEK990*gw|rrlN5;4dU0azb%PAb%pD($3ShTY%|9vyc^q zYKF3xeKO5_Y-%ng>!1SwU4Q6jK30(nW?EP2qPiTB7RC{>TCH;~<;KEpMCBJIKV?1^ zt3c_|6udeLcr%h)owv>F@pV8b{Aa~BF;UPbgb7g4cH@gdz^f!t+FjWw!>bg#2u{c) z5Kph#5GHMK^Up5;URkf;dluI!VJ>Q6x5$Z1zyZ z;6n0gSO%~Xi@=q9;>d-k`6l%g=6P{qcoK!P+#UusOMUWo`>d$;H=MU|@h{}393mZB z8A|XWZT9sIlF^+r+wGklgGaw) zzsz@;gRpJEg_4p(snZGyUysG+4Y=B-x9~3XE`@;0k7={ zP42o}TZy7EBoVtZ%WBhdDUc9JuK5s!;RBUWDxgBwFOIKlqT!McmL~ljW;Cuz0EH6! z|E60yeKldXv}6?I=Ck*Y&SNIcF5s>v^eF@9>r1=uLo(t{uMiKee#}%rg}nU5Vfu<@ zAurx^yw~}T6)zuFzswn8&R7{W;`PZpQS~|B*(r;2`lp#3 z`9Lgbsy+28uL(Pb5n}N_GV4-0_Jes~M>PCVM&u?vG9~RfC4E{+1`QnXr==kp*@~+C zvTDAs{vpTtl9Ie(&HP*{yiJoett8^NIXDUricKG`PM9ng^y`hx!1Oo=lx%ye@WCr6 zzP^9YZl(F$D!b>&^Yu^M_W5Pmf!G10iqsK(Sh(MQN8;V^%s}FT{wQnoAFe*a^Hoc| z*@S=}ni{5E%Ae}lB!0UKrp8z~lLk%gl~uW%^8hc7Ydl6x4!+?|1_{6hE>SVt2)gr{ z?W+^MAKSuzlPP96sTr<%-z?%qPyE4(^BO1Ymex1;IZyhC(fKKuy61@C4k$csi_O3e zsSX(r=d1i&+c@P{U2v1x=FTc;{ZU0IIIDnxoLMe1*M^JMg0Th&?*D?0R_>3g@G@BD zN!IP<;%LviqwVd9QE=qI!5jp>uh<#9F9On?`}=N5N8BGW@4Q~(K@bC zWAB{RcX>qHODB6H=;>9K)lK#Y+9y~8Kw_5N(Nz+|oxx3QkkpsqbXgc%p2z9f%}^F1&Lz0Ry5_U`Yjpv6=Xvyy%h1XH*^)8Uf#kfL|F3DNJeQxQV@?}R4m9Kgq{SU5lBOb?1XA_!yubPE5` zK?kV$ujZ@7j~GkxK1S+upB=HRzwIi04wPR;sjTS0Y9;1_S@@Kb55MGy;T&f z1Jo3gHjD$H?0F4$w_P35T&gUYS_`DVgM*?bb-dgq%tT5#{bWMf?86azllySxTG{Qx zt3JmEbR!lvcTuB{R7P)mufJMuky%lZJ7$)#HGH*aFni&5v)r6@$FOGNuM}r?m^1t* zo7Gl{QcBEflmHdAe*9%VY_m<$Y?T-e6TB=Ak$tZX4_b z$?oP$TC)^);_V|!!G~Ap*j>>{W`6BdPuj3f*f5>V{C-*^Vj@OCcKy>NjG}rB#hrZ^ zm*s^2xY!*($*Sn`|fGBM}ykj9P?cu= zx5U;@ZU2d>?I21_+H`0&WuPBCg2}=2zy~~7f1wTA14M5nKgxYXuCwk*B$Jel2KJ*2 zWRT{m9algzj(EG>T#W)=tAC}@IkO3+1&12SJ;h#w84;r?ajc%97<{Fajm3Kp3$yKV z!fiE|5pN+IzUrIQ(})!+eA@kbnyzSMLeVd=Pv1F0+&Utmp3;Mx^bF@)#&E&MYp|#% z`*c)^*9tyl?SA<)91CIZ>)kU>4}ZZInJBI9Q6H7mFNk0+fu>??l?M&0JnKA9+NoWe zchf+d{{7pn@Wt5ULM$X7?F@0~PA8eTr}ICQsoDMgar#t4hf@D+W8NYj;K2{$Bj^{$ z*o%txEow`dqlkYL?d?=HUb2{)fa8=WXS!rE zXz(m8{iPbG{*7YD7)vca-;oY%&Lskk=e^fSIcx3+)+gj85~mY~Wg0qV!P~hD)5yf& zgduAWHM%Ziwk_po{Y`Gegu^*t9*s*FMoo}-lI>dt&~)}KQ}p6LHd2t5&~CukU+8aS zWp7lMLxtwd|7*{bkbmvPP;S2 z?m+xE0avp9;&MbbnVtXXnl&g-{W)FMh-3U96(!!&^DHQO`r{rf!PVl6k2So$Owjvy z+E}eu3Vc!2o@+zhOEL1&Z(MtmcWS~W#nUgQ`;8P5wR>&iAV#kEB>l4l3S`wk{6^@A zooMwH_&;|65VaGw^RRd=nW$)>rs^h`mhD&;)`=xZKMbB>OPM|F-zgg}Ss<`s9$Ax4 z*LIh$>Y%`iKK6-M!QknTBbpjbs`sG}0(?IZmu>Wd#UZUQyeF0!YAQ0~%#;XYFxTil zT=N2KcF$ERS=1Kuy!^Qh=TlQ;zJ2$;RV8`_&yw)li9X*L@=o z6{gHBqXz$l1*J(CBKW1e77+S9Uo{sM+3dx^#<0^HVw5<2!#?G*Nr?V39LnV(w(ES` z4IP}2(Dvy)%>MlLD^x6wr6~=8lrxq-#TD&O1VFRA?p9d%AXoc~8)4vGtwW|AGwBA&fj@t~%>nXhZYQ5|WusOYl6-zjc;MQBVxTjXp zyYJ+`p<$2G1*HoX>1Ij#s_H#KNDtjy1FZ$^_w>pycSUu@ zSKZfVBr-BsD>${YD04Y>=bUaSs^##;0GT1CQYQZ}< z32(~&NvXY>euqJ`xyu1jTTyeq!bX$=p%Z@WWL$RUyP#U?rKP5{yr8S2A#I7z-Vs%o z6=PbWvg04%t;c0S+C$lhx`+-21=W!!SL2xTXt2Uhw(E3^fmc={=74_P2vjh-*s=Q{ za6`)$96Nb}&M<_uf8WJUDy%8 Qdfh$DKJ60*zgtc0;*A_^_sbANtWM}qj?F~BAq zFG11O>W2`pWCzLgz5cLL7Z$G_-kqY9y&jin35tyBv}U-ew57!~jKb_{!dQ&PLI>Hx zC$Waq0t>o`UqT(1jU|YDwn*%9sWGKi>R4%rO#6k_otyFzbI2HM#^iteGf~~6`{`_X z)2S&zM3Ifle#OTt=*TK z9jlC`NJnBpzR0(|sU~-<6n3`+*i|>$e6hGwd;l2I39j!|MrX5KQWRmZ*@Z6FhH?#{ z*^?DGz-FS8#GaF6C`Rhbj$X0!uzX7of$#l2n~iwGnRGwb=hu;0jq?bqH$txXxWb`V!|8nSR1YstSSrO8!)Xi&TQ;&2P`!>v_2c z+6w);&6Xar++~m=u?>9AG(wQ|uJ9G@p*{0y^if4CQ3mj;`Ly$T(8~QLReT~+aQ9mt z>3>)0;@;|X2eoEQJg?83_0=CAZOR~HH>uAHJ3Yr^N+S>F{|EZ8KuZ=YAV0DudfkIS zkj`uccX#Uu5@mb;Z|4jc072hd)oNG&$7isvOd;Z>{;;#bF^|?%CfpSpQs7Py60)ok zq8~>?z;eoMr`U@(n-L$`ZU50$XjnFS4*7;tch{w#NiJa|cYk3dUz7K=YzO9L_62Y% z5QKtW&2Go0)>183z>9Y(@Ol1yI-r>M#~!1-Z`zXlS1@{4t2Nmvl-*QS;YWb&JwWA| z8TebiGYAdYTiN!mDKisba1k)bypyb78ehlbr8dt`tC1n8k*z{L)6w1%`<|R9K!7eU zhgMs)1G*s+CP3Bg7Mnjc(CrE!z#)HrnUPrL%ykUTjiEu5yN_{n{0EUdUONZdd zz%Siv=hn{>oE1h;?(vE;f{d?bdLCBiU1MW}*=4Mc9)5Eu9qerRFMrQ`7v_9!v;z!z z5&!reV@THVYEESP2Wnp+dXiDz&bYmC@t6H@iXp$gW@J(jBY~IfJWms@z49w{Tbc-S zHqC9B_cUNl$GEUrZn@6Gcn89c;Pzt}3VLE!WCkZ#E)dKzMPyxI9xwD75N4=|(k;*A zofnyviG~0GHivG%@forfJT=L>E&ihJUBO)$7G-W`xqwA89lx%5Wd&YwBuY!x2V2Z-XVEB zcul`nd2agIyCCa#BS}rbMnlK}ryVtn{Y4V0)9*KUuQrvD8tDUE#z-D9V)03~)(42# zLB9^XcSbdXjcN^UtztIRVu+G1)|uKL1!XF0a5k9MoHZI)@Ef@*BKm(=Um*;k!SLho z22voEYT-||NP|2sFv1eRCbrU}b{D`QK#*$2URs&Hy*bD37C~5I^qLjNXF`(5lONaU zKl+~`CdL5F3~zyX=S3)za=#;V4%Lp^`|k;2KN$%K0-1=*L zgM3HG9~z9V%!FDSS#94r)_b{olVJI%<@d$-w(+^+D3HM8*SK7Mxml$_D`{KPG5yor z10H@s2J}s{Cvqh1wz8LA%OZ_VEaYYr8joOf6IS*D~vP zABY(&!0n}z!3HsQ>Jk~whh124W%)L&Vr1b{u`A~_sw)QA!|7XA?Q+3|$~*UvK?P`& zu6v3gUr9bf=EaW7;~d2@v}K$k%JpCQJO1~~rhopv(Y}lttBz4u2oVv@7X-Kg{c@Ti zyde5a_b>`N8AdwTfjC-gq;8dd{&*`lyUS=j7Zi^o%RU1NlU>|ey5^d#9*%U%B?9P6 zwa%-&kQl#TnEz#TX#Y83(+S$#cy6F3WFNccjO>h>xAzT!3%2JNS6z2Ij=NpLq#P@Y8-&mz`>6=vuY^7T@G@ z!Ki!Ft_GM^)`l+5q?If5-%j+$dkJQ5DBm)hVzP+Mko+3(6p%5%W<`z4QWLP?`b>ab zbI*qdk(BAYYN>bcZ@A!nR8%jNFqnOXe{6D%gFfYIGqVHkn7=)uRfcF!sT%3J-^NNv z&pL0C_O5`DyPxfO=lul0kgV@4`M%c;BHZRzd?=ik$_)BYuNYtz2)cpUqe?mMA+XR_lHRCa^RdXep@ zuYYp`t3BTKif~FWI+3O1bl6bPma*m$oAm~g# zUYoAc%*P5-YFQIpuMJgf0VPOWzw9Uz9~=#Bf^Q~K^l8US3JrmwlRY>o6;+5VxSLg1 za#AzMqXm&ZhVZ>R6>HXxo(dMuJ06dKmq{~6-hazi!D%n(3$@yjkuJnDEYz&|F&BSb ze2gI~IRL>fbmiNlv2?K4$sYKg&)X9q_x#w*_#srH%3m7?SJ2J=Clfe3ZYNqYu3R*A zRZu?I`8OF+{Pa7P`LAwF#5#{UTjEq1XA}Wf#`M54mhmG&Qp{yqx?2COOaExw!K!C) zGqk9s=-lPW%fOFLmhP^tUSwrFu+UtDe6q7?Cd;#$-JR+S9;7WG>;Ur>4woA&vEq4d zC}mflWApf|?81MgKjCara61hP&}d{xpEO~3$4kM8L8a!LeEmCO02d6#@{CACl-BDi z?;NRY9`4W7`m1zc*-?z*?)*DX*KpAn>4BX%yIF8xX#3(z zDG7VoZZQ(I_9YwfV!f3enXZ;rY#3U62ES^m?5>Unfa4L>9LffxuPVwqFqRAG;r9w! zbITB+`Hmg-61-c4qr|eX)t054ZD|W8h|0i(0uq~`P%Wqvtq9Fke@nQ6Z3f^lCnzCk z&PV#_R$A>V928tD`5LwWGAyfD^%t7wp_bTV^Y_>(CRfJ@A*7e(4{tZ(h<?2>yLISk1P?sCB)}r;Nc%<||uWa@+N=kdqpvV)3=ChQ-?KxMqrOClp z;(ZP!dh`wUh%2IBm15y`z~ej2ahI;1t65?Zo*N;E%oAsj7( zi;W+E+G2P@WPV`n*-XA0TQUS!Jslu?8<=kpj18BHKO2Xc3?IWJ$<_v6+Mh7p|Fs7t zLgul@ANlAwAp=peQg+X)d>lTNN3j1LQyEwz&pv59mRoH*M+xVw9}1XRMoMg z78JJiRa1qk)~Mf-)IfqRm1s>P8>%9jkHPw#Y6na(Q3*t zZp~LM2OF0E>pUS^hGU$tzP7yDG@&J5580wwSc{Ikn5KOB>!#bF*HEeda;ef&g}boN z(+9F+j;Ym>Y5&;xik~>pN2r25d{o7PsS1q>E=4N0L@3U7hHz%HX@4_X z!RuLMp_OMj^RR>T>I0KRm0r~{B5jOp{$@OLeEYPmFW$ zkF-^^rjdI6`q{iI(T8SQzv=kgPjp~qdcWVnj0J^`QBFGJ27mN7O{B5=!pP%(FILhT z_y#&OOIWF#gsOm{q)2o4)JnKRIgxuzXRs}|KaAkB8R;N0S$wLfy^-(z9^LsSLN_s( z0?&?-M;GWUPfsd5=QAZ@`sYLT z&&{%lF|ST(R#V1(O){V%e-P+@mmLO02wX7M`>4oF)A z+;zU;zDvt;87ywWQ@AdsBUQy_!H`KlGLk6SU(lG?4M{TNE}r{d zQDZ)8q)_w9)}7F0J)b=_w*_0#Dl|Lg_GRxHh z)_{N4o?$E5h0&yqTJrp5dt*+^jw4eHu!PfBNgHi@#hdXa7h-o`f2!yNV)vE9f_2Dhu>Z=@ntt*_?xd9_M+;w!F=Dp04paJKk8 z1xwAr=M+a)*c^f2x(;p{jvv}YD0KRztu}@mk;j)?_iVdy(M0QX*d2~2CQ?F_c+3n3 zh#4hl@uQO7*heIPO`u`&S(G?(Tf15^*8ZC9t=LYn6Hra`Au=NEu|ucAO&684Zt8_Aafh38i&RoX{GzS)G~$u(>=4qR zRlH=ZQ@Sr-N`qA=+A`-;K`H3q3)2{>(H*Jts{h$YjhQ@p2{7?GCznivSsLNv=a-?V zGHit+Vx$eys9OyQ>l7gKdiFjBm$96=TQrCK(+n{qJ8zqZSw88#curzSRnTUq_(nf+ z7A0MYJJ)Da?=w}3ut_HyC2*H^4&( zrLa-Mn|PK!ckn$=N5aOE++F3u%yU*xc!3m2d@`cmc_+*WE`_4xPdxT_K@Eluuh?&j?!8F$D?Gr+$;*E8HGA z6pMDY2x+Vt@a7IB!Bo$~=W3TM=8BvAX~I^A%R(O1|DB+sYhq%cUTf@OIyr2sQpntO ziQOoIX5hAhA}K;vQ@r6*4Kx@CpUjMb#$ z$oH=DNCvCSbXEtR#@^0a7Hk;}A9Ypnf#o&+Vk)FoY{V8t^da5%34BIb&VGdi23)q9 z0~#hTm!E+4QbA8f;Yti8yEl_F7)ZMBW@^kEcFFeX_qC<8wL6P`(O@^?A z#loLQ|jg>{k=s2j@xsPFfHeS7N_`AusIFUep?l6u$}QkGGzHA(iB`#A=p= z?T7LUx`9^@;MfBV4dSaNxWHH_K4^j)m)H852+fG`whts0L$6?yh#9Rm2>1|`jlIeY zl?w98aybDOZIYIjL~JSwTav~~iel&3m0vzc%e zs*pOIl*-9XHHSCL#Z8yS8x3$PWqyXi$>LM?H9BgysD;cl?qGBIEya)lWeodcWjVs%uIL80?1|x z76vfodS+JVHH@7MYm8|O(y0uZ2ozU58E-wJBEy94ymt;XNs_i0k~ES>FJItj>~Dv8 z9rts&j(wgwk0B^D%?te!*?r;OyJT4SWkS59l}MgQWSyJbt%&1YK-b zm{tng9Zv*#Zb!&Z-FU{dsoZ%eon})fBIn-L0N)XygQ-SGV!VZk)&NTM1!r);kO&@Mocifuy;F?`9A*|6&afSMtR2nq27b8k@-l$cvuga+>Of<~ZB+v=2jB&SW%}$^A zltwLm%zCcakiSq@p7=-fW+0O;LvvYj*?t(A3R`B6rqAc9aX)hY2NjHl;O~I$3wgy1 ztunK%u3V7L_fyX*={DHk_UFj(oW!Y9WG?8fN(0z3(~ZE85mXbSi(aaP<36w^ZU*O97Cg|6V+7wOCqD~(dGLl!&IjxvIy2B;ST7FZgo^aT9tsn@9v{!Y3eqP86d$0+3U zxw2szKC$%LLe&*O6Wm40{=ZY;)0-Sh^SR3-;?@dwi}4fKW2%U8D7ZYgG;+IlVTLWN zh#~r4?XaMP;%6kXA7sGmq&3uGhAd-ZA$8RM(Z;*IPqTVIEBrshi9k!fyjwuo@Z$gL ziU0FyaDTj1Ng9GKs&F%y%1c;}Ak{@*KjUGNwAiMO?RjH@vOVoPO0ki`pM~F3cPS-s zagS0Tw@t%qhq2P$Y#VQu@_eG}>n(@grBpx@$H+aki+Dd0)b>fzY8IE9nySBYQX}ZqgH+MZBd^Q4w&3q{h@Ad2Ah!?aa(TzF?^t{Gfmh%Y&(95_``8}~ zsMHLUXg09t7uwrd`OEBR&bS+&Gn73Am(=6;lthQYLF7PC^c z)qPD{LHybo`cU96R=)kK=15(Vp<7DaH_)DP!;^l=Pk+Lt_N-v$W}d}9<#j5j|bra9*oQwfHTAdg3`y%w>;t-5ff z8P0_D946)W{Cb462y=OZw#vnVg^7>U+JoDd1b?}@5bphi_!WHlAM3MS<{DW zpZA5SURp<7DV@GDEXDBpL=2*D!~E688EFS`Oh41eFic4-@RM0SY~R%jpDI43;xd8Q z%wt@3(hmqE1k z53I$h{x6BW&}NW_acTdQgpp-F_w`nXD#@3(hN`A^wQ-}~w<(zMub;azQn>eC&`oI1 zFYMmc^&KoZ|9GZJjOZV4fXnXOwX$PPeC1sT?!Lph;nn|=?{%h{J#k&A`YY48^Hk0k zQ~C46$tN#dAPGI1zJ5ARO;Sk8NKV2S@d&}eO;iDcVUb0=yqRFF&fezzSH^{CM>TKL zYWUm(Q>(l~XGia_(hK|TG_-{zDvu9rcS-2k(`RmQ?hI?l{nm2oJ9EH3+L;#t)0}+! zc?r*lUr$N`+7w-)mKSEHi|N4jS2Ee^L)fKds_Vl`UX-d^@|z3Cs(C2Es#vL!JFe42 zp7oXTS`__W2{YfMxyd}kGbHff?B|=|_LlW$lUzL9VUDEnw2XPi2J}Y=R>PSYvMY%c zcyG{6+GXU=tWAu<>tag1`P`^^kD~YsxgB!@kG{qN(#HoL)jXV~n6*(nZV2Ih{R7z0 zM$dncOx2>$=@xlka{AQP$N#vQ@z?c3)nl5x6*SFQN{{PyXR7b^F7JB$Igq?hfL?dq za;RL1RwMOn2y05>ptMn8`hCB&)wds&FXt+y9^qJ`swWs+Q1u!OsmzgcDb@J-=38an zyJy5KRf}IPc-3~KjYwV>uRi7QY^Y1IwTgJ%jNFIhxRmJqooCf|h~$jY=BKj!MC3;S znZa1~5Vu@ICw=RMsV|SE?tc8-mCGSvTtAWV3q5)c^d_uL0ZKYGy8~j;L1P68d3u;G zWkrB>e=e|ISJ#eHwBHK1Z?#?JtbZhgJWnQQj%UHy2&BB?OLgF{@^B_#`b@$#C#bH? z8Z?e4LHu!@qV6&N#7#(d!)H37&T2D)RS&vl{2vTlpJtIZt@ zo8n97=nu$V6W8^0U0*Q?89n?k3I(5OM{U&QI324eW_pUh4d0quGEOhhY@AB@KOiuk ziiN(KcHnItzu1;N^ZOMkx(P^|*M%b|dywKGaHnmTxV_j8YUP%$+P&iU1XEsJbo*v@ z^bTrWGK6^KZ4*9Z&S~yDx2HP`!tf!cun*rs5NzKUwPFxY+`Bq* ziv(vk*Lwy8#Y(@Q%IdQDyyvmV=4ZyQsyPUQ9C`&MMWdq-Y+!B_foLJZ}7@^pn- zvsse4Q0TWY)ej;|p6-WI)Es>T75PfDttxJ~)+yf}HSK7r)e#mu;JYiGy}pL|PMwjS z4(0s|=}F6Z)XSV)egmT}z_r}1-%A>OdgUrlOEPF(oyPC;%_b}7P;$n2hCw!-io#y1 zk6$7X1E%>XcrgokZ{*7Emx)?h#GxDtQ}$XN@B~MQ$Sb>pUBD0c7n$$+$JV!CSjehz z^4R~6yLWJoH2B^}V>=sfw6SetW83z`wzILFY;4>1#E}G>>C=5!{GbJ!4kfM9gWP_CGoNWD0PN-u$xg!XyK0s%%;V>b>pq~iR~#aM zS$_hplyMo7*2~H(1@Z=sPbdeNrJe80TXPUpRk6*3=#7KO+w%@0r%c)(sAVf$V$6`- z34s-w*p~5~TvfksqcpIFwzf$c=Q!TP+O8 zHJdvvbt^2Q?~Gs=XETvtn@Vc)@_s+-+dnhQ0$cfsmq_vP>e2}>4U^t`wvNaQa!Hz( zoVWFgAk>*P!yllS75b>G@#z(I>5ayh+$Whc+~Jua<~pB?h_<)?2eMt60Ysrxzi3CV zIDr~<`<_;dH)O`SRNY~i|K*>Tv<1|FE6FsgbK;Us^`etX#yD@AtkTntcxs=+`-ud9 zkjpqCet=l2SA0!y97yAO0}uZ|eG*KWgPvUhop12)3X;_pGts~0sf2L+Tt1Jeh?X$P z-rDZ5IujN?*9e@Fe-Y~Y?xyLDe8S>;mPTUZ(aT6p%h*#VNt>BD)m=8hFZ<#s?Q$pY z5N*5zTXybqiO$84$0=mrr+-19ABNczf-aq1pjlhgV0a~m8#!x#ajigRzv@7G zQeVb$caq169EmEi#YRwWyQcFr9fEBKv{F{`b4hM7WIn&3dekbe=Q3LR4H8Msdc9%) zWRgXaAqEpI?5k)Kh~sj^RmR_N!8>)oVcRj3q?|V zJ;XDq2V5ZUp03-TOi7MHEBS|=h0eEPl{g&hrV=;%29$Q&L^(pc!Z%GQ782}}tOg*% zqQ0yY0VMqJ1-392TVApjISRT1)#wu0|JXKe>!;*9A+2`~Mh@I?m>1aeM-{X@CT@JA zfjo+$Z{0zFA&LxIMiUlubzf)4UWT$sU6IZY5$=Fw_!nb$a?*0u!5DMJWrq_23S707 zqV@ELsC5ez;lzx*RukPXvd5ij)Vh98smFj5p)u`+BBL*zyBjZ`pxyM_Jf4Fk09rPX z9p^=s*ITA@V1S)JvHddsCU3&F_2hXx5(z#En*DC2P%@~|?}aENC|UB3BU#T%^`6g(@dy@w6kaGxE6 zH|EdCIgVyGpWt8ISV&B7DIJg$7zrS1;?#3IA#stt2UrY2;#MiHDa#vf?<3ZDsXCTW zEifq9*Ze|vsLT_Z*wQ-_SUA1&AwaKvZ*G6`rJN;`$U|RApML!83_zI8V%uwrjmaxC z+b4tKcl;s{Q;Iaks1f_bCkdIaO$6F#P8W$m3Ps&tFfoVZr$MGYvk@$vwM^%#com(Q zZHVL4e4C7{-m`5H?`bECI*59;BG)IET&XqOJUWl4!c~{)k*5LbwlzsDv}J6oKTi+~ zd@GyeTn#WpFq1!;CX9p4z;)!~W;S@Wu#FDo?TKcG6HmPu#y=T#863+}j3?QdI@-f8 zx8*;B&>P|U^ezi7S3N0b19Eh}`D!yA@LYXK^IqgmAK6%I!WOp7MYe)xZZ0v;WZ!S* zGIL)1T0~FsJ%H7HJe+F;4O+Hd-9iy&&;LVsNl~5WF$Kit+TTN+uEeNaLxx)iVRv5s z=0c^}d=5VGMrKE$EdeOLP}ME4wF05Mpz%k|8JoP$gG>OjoL{6VR37>2jz$E3SQr<) zopg4GI!Zb42c>O070nk?!c*@NMqKLzZb^^u$%A|p$g8o;l3>KzOv@-7^!|T0UzD<9a;RZ8( zVv)<=QzsZjVnf&Yba$wIe%`-|&ckJ|#ugdF-7zR4Fpg|}dGw?DR$yQ^Xu5=L{1bDW zmTn=#XPojS9g^EX`7_e7D&^SO1p~xTbMWaBe4FBL#@kBWYO+E-tsVX05!qW=^Utmd zPbfvgw|#?MYXsVqQ`+KLjp+jK0o_ z;Arv2G}++&A`t&6AV-co0`ah=R`TU!?}Jdh`f_{KK|o=tha;M_;lool;xNZk+Qjdm zUh~oy#{{INqzwMSDLamrKl-&sEUt+>9$?3M+@mM@!MuE8kL}9j zU10C~LVI98G_W3v!QIRHI=EdGa0e=Yb%TAM+JPX8s|#OM%7f z6y@f*D$X9jINQQzpx5ok)G|EpOF#&6zqVk7VxiHR%hC9 znP@L$^9_^v==I@*cO!@43ue?T>xR!01Xu(65zfE|dP7`A_WzQc{vRLfo5~;r|6APp z|98xN2}=*+d8N9yOp;t_9(|lB$sok@@}{J zHvRD=w%XP92OB$9hm?{+m7Ik4kz+_tEg0R+JBI+|Il0s86Ko$N=m^6))8e-+fb;jMt1*=;fQcR*7nSqoIFD2maCPh$13(4Q8 zSZWm0mZGq%aFtc;hLO~sna!O~nwbeMH;-`afdx%L!bjwmlpeTM{Z>%CDsh&XNm;U# z1K$0|Zx0hm*lY$%X@E%rZgJjQRD@$or{G-DjHNt4qnw^<%m+&%yjsXki~?hxQROM= zE>4*Rg!;dX%vv!KK+w|;Ml%GWPewpBmTKF!bN^+MRmK*y+8J9x?xvXJ_4HL3R5mbJ z?y02?Y?l*P4oCIEXfk+mS|O#rKT&WUMlA3`n$G>G`oJW8LO!n&J$-A^d0x48#2q9Y z*^l#t?Wjcu?@@2L=*b_4w#oRT;bq+D72L)&TV`0|iUeHM$ShnrIaxK1wp(QSiaw%; zyT$)S*MIze6)i?ep@#vL6%|kt;~)!EFv8s%W@ixGgxDbc1LYvS13Al#*WxRe2O)B7 zEKo8YNsmrL5C9pvrO0tO)=|EtAaH8NeX3%0*h z0E}auZKZV*{vf)hSYNN27~&;B6ebMkS9v1H@;fsMZ5`H1xv1w!7{Z}~Cpx5>9RLockc(wMR* z8+cSva2ValuX9i9M+Qw1aAhnDtp$e8COJq$=M4V8hI>o=d(Knlb=W)59Mk$^m=K04 zJrC#a!NdzblDWf{+9T2Y#t#(;g0{2Qn)`3q7f&xkp*|w_p6d{bG6P+D>QmDlI$96x z4^(t~o*1K(Z!Ul=m1Y##4eJpiT-JY8n<=0xVhaoi{iRnv*w-($>Fq9=UgIkli;--k ziZY5}CoYY#IGG;4joX;l?3A&bpPoH4tu`CZ1^SfiwjFF!w@^!c7(vk66na>LcS zgd>tTpPR_d-dH-TdkHB!1Jw3_9{MCjT_*lwzg8WDm-T(qPNKD9N@SK~kkbDM@?#Wl zPkp4@E15*vu&3iEjl~Gmw*T6=Qk|TTwl{KeYgP*nf2p%+=rXq0?>pb{V>j}Zbz{^PR&q7Iud^&Pb2I$rrgZOP zg@it}M8DVNiX?mtgO&0Nwdf;o%;*kUR``_d(nOwJAeyl+_PkkFSGn`H(fzT3 z8CK&hUiKCyosJm}&kP)wCir%*?jJ9y0Qoq#ftp;2kyVEL^L50=Yspk18RhouLOs>XhwV?oMI+g|byq~rx;g}+51^2nmZ30&4c zRX30&g5+$X9%sIzv68VC25&_L5J}y5eJUl};ma^B;BJ0#vz*_lU#GOT?%Ywx7jrRV z8+$MAZ+GkM06oX+QXeUEa0XU=!+8G!8-H5){oN^?ot~;gT$SiGU}$4AmVYLu7Qt%U zFVlb@=o;3Xzdsu+V~~bgoA(?{X2@OmwwSunq9=ilhy;HgTmSauHa*k0;rsHL**yUG za8!X0_UmF5svQ?GN_)gC;DB&i^)<9XwVJkKbuvkOdDl0y<|^iX#UP3%lK$PZ;A>;B zd?Z~n^$U|E0zU(HS&i4%yLdu=YRR9HzrJhWAejY2k6-mr->)ORwTxJF;j5{{`EJp_ z=R0@eJQXl;*z$s2G?tQF5BYs8U{j}yEYGK3LL&RGk^aSc=P7Lb#_cfkrfotG4=|ToWs;fb9)VSQq3j^Z>ShI z;wLPH=%o`Sv=~|03R#+s+FF7UMyXbZNm=6Pus|kV{{f^37ngOdohObSvqfB}pIdj{ zB&&}9h9Fa$8>eEB61UgT&DoEQ@d9O-%R>qIU#qPVg`pdS3LIi3JGCj6s*Q`;gw>TY zZXD{Z%Hxj;hpXw)WrH}XRaRFmndWSnfv>Db_e-PXpilPob=f1&3p|c=EoJUW*ytqu z$+x+Txz%kzoeB)L_HAqX!R=f%=d^XKli89gtbc6lu$s_zn3#D^ZksMi3dD9Y?PPFN z4zfeD{qis~1fMLF4^B{>Y3%(yQk7*^!2tUuA})(&4iNLQqM=6q*mSqk^ku38fLId) z6;zHl`Gv&-sT+ zJ>b=A&-ct!rp0(BQroz7vD3lnr74mv25E$I!OqEd>p!E%JuV24GQ7C>eQhNMuoPS&uRq$+ z&!VX0&eZLP7B>DkTDj;7=N73V52Aw#s+wK7;8;GpIBUkXW?Kb1uS&YA{kMcxjOa;b z$!MQ0%vY)2m1SLkm)=>-OF zx6Y|*`beOSmx%9ipn+s2Dy}Ej7jzo}q|6%!Osam08MkqT0z2bjMLMO$WkRpNInh#1 znJESX>=(`{V_WNvqjx> z7>9;M|4V~Z@`8ogb_l1{M^%NpohV0_T}MFOABCfp_zs>4r5)XS&E#8#R=L#4Kx_jA zrRop;Ip~)uxriAq!tP7tIqV(Rb|0nLtVM3tz0i3Z_~iT|`(7w`he-X09{mabRx&mn z&0;TeE!fSjmio2XNvyIS?b@Vuyw5UI74GlK)1m@r)^Y^I`0FJ77Wnn@@|ic|X)SOT zYIg|PHS`H&(>Pi!nflcX2}brkAteSE1;T14n+a7am+ZG;yJjMxZRkjxmPe|?l+goc zE%Ktz=tIWzcu9HZ_Q4%PHXyfzN=nW1)j4@cxR;PWKA*vOX0^0I<>iHE2aJ>H@@B1w z5~jOqv#?{92!EOXX`3eY&lYOmlMi=R(Ch@vP^F2XRDCyQf^LFIs*0cLn8j$*GX=h! z3{RA61}{4biJ|4+EGI6W?RZfoU`j(MB9_Yje zc5%=LVLFv@1-?L0Wwi4X%3%o(i$MD>8P1Lfof;I{8uTOSOx;1~Kyrg3?3)2)3?R?t z>^^9DJ`G0u5Au#Fsnlez5|lLD&Wk)2Y5g0u3`JK)va!ti?7{)Me`od+;G9HVter2k z?2*i07W}hg0{*u$wgFQ?8^&HuHK*?E&&P>#NEurWmg~@|CHp^8wy(P^G$)Uo?X|?6 zB3%cuwMo43-c7mXowfRr@e&$fbOdBX&jj-Gi=~OpkYsGi?*&w5MhI?31Z#hm=0K{& zTkGXowMc1IU@)EjIodQxzxKY^TR-iysAeSM`n{=ulU$M;j}r)R;SHhmyuVD=tvcQ& zj+h-E>G*m#g9*wf_8C+N72sx4Y5SAs?ge*xxA%v5gdV#FvtCKYNm_jj)zIi@I+k%J zn7a+0+I&6u<`MP-oz$~@67U)@GgCG9iahx(kFv<~->PP+8~GVROZ$g>&t~v`UzufS zBs@J3lL-aViv^5lk`yju?<^Fn{QoOeBd7c*WWvM;=!`0pMUWme4=UK_xpL+kklLjF z?FLwc*A@p0@)0#~@er<^F>IGB%?eh9gjSoq8Cub*%*lSg@akCn_2$~tN={Yc?HBx6*!Vn;ZNtG#J zQw*)00*5=H2}*n>btv`@eTP9*&{TO>lE4`Ai9!^WETKracRMG{_P`1(25I%;yvb7j zJSi*AK5|Ook#ZgvO+&o7fQ=j$(+L8e>!afS-y>Z8o+wT+99r7=>&K3Xih!%D#{ms! zo0RiCmI3*(vde&cr7>Z~hdZHtSV0<9y&PRP%)rek)O2LIhvrF0mHE0Zl$G%i_1PRcWDmoP~tx!bw7DlBUeZ7X@ zGAYMNa(f>eMEq$LGZrp5C>6UvmyLn23B=HR+@cvdeynaJDnpO5pKg$-J7FP_Q{B2yDvKQkw5-VvH{(x`JQtwGWL zsS1BxeOjbEd|?JHD*2;qoX}jIJWwqKPG4*v(txU`>cq&4u_RX~fxWRnF)1&S{Z5~L z*U@q=NEQGR`k{M7j$tJ%EL{ZrZZFc`URot&Wy4Bi1;5GuJE*=(w?(9phBa(;C`g*U zrp_~U8eRVqNV*$tF2_@<428)jgXX_1Xn)kpB9syqRNl{}8_3ZRE;O>x#hKXrJh6#( zPd&O}m_(sH>Q&|d3JMz5rc*g<)H!oEr4t%k=e?&gN52TDyE6zyw95gD7LG7enIG0f7Pd zfiZf~qEa(lD2|WLy$gqNy+MMD4Y3D!p)F0igtqe`f*XcKWnO3_i6>xa%?d7;sAM$2 z#Vi)=0zQi>#1;9ger$`89?nQNOZE&lN!iaAR3Yro4NtQVqH?z!Rds$P{hDi%1&zyi z)vS9e-FXQ+4hdvDjY-^O$(#+U)Hycf(}u(MsFtP38{;!|%7Pr8Xv<(FC09}*$~xyf zG%5na`m>$ss$Jde-`9YVm8Iwu$^BTmJ>3klH0_}F60|jefKD50wbGCU-tg*xqDc7| zkBz%dZC3U~7!^Oo)r>&Kh%P5DTK)rN<7dLP6)36T)`rg?rWN7k>7BDzK>FcJ$>hSQ z0@i*PrU>i*zna^|PWI0c9)akkIe(^&fVu{(_Z6hd5%C+jD8=v7@JagoeBK2Um?^8? z3BR&joWKS4pyDpfGUCLeAofMeoT6nwWsWk79OYP-Mp{zk)cap|lH4-Y+H#c)9r-MU48x9*;k!@Q!Do*BHgUfbZcx=dc&Y&r0uK>o3NUnYuE+ zJxY_SJynW*qLPTVY-St>f3aE5w2l|HqtxD|i0uCDI7|e=IIMml?i)mF3i)}$ebfya zHA^tz6O9BI%5dM=8L@Y4j!VUn}6O+JYioVatrWk z8v6R&Ux}JYe&MsLf^P`_~)9v0)`hhY^u_-Tl5ALq4V(N)~Ug$2skSA03054-5 zGKhz5PBxIG2X@B!WV1Nl7V|%C(hB@ijE>P0BBp6`HUg-#DM8r7Kja~-qbBKYG$C)vgZ%FB~hhwy9u_Rb6|-I>K1(E{OOhVxlY&{uVi zEq7L+oJjzO*^92wPoAo$5U2`N@@&J@)w=OIa*;Rq=e9Nu9#PBu2x`p zB&Zx>OlPigMgNDpMAa{x)tW>r|2ik)@rLqjA3ZSOI{5PQprxDz(8aYip4>r;D`B_q zeEc0~!ifFsVvj{$^=8yOp7<+W*3EUyy$V9|?lKo02mo#hF_pZo6+0GG*@n}U1h_36 zsY=_Q27xTECgt09EBvZJ+ChnW4W@x?oKFl>K`>Tln7f_v(w+(BMslhq z`0Vv8+BSVm57(p2e1U_8wjC={W=`CL?YzY`h6-H5c!@CgJLG2p?l~Z#Oq#f9toVn^ zd z0`Bl1m2^_%^wziMeeiL5bZ^@Y$`JE3VZ}eRq;b305QV~ksM1vesad%PgqoGvpsjRb z9hyDeze0r8BiN_E(x!J~8H8ObToJBx0aE*dIg@$5k@`B7Ol?O@8)kP85>Y_rL{Y2@ zjPrn@+Y2brQ{y~eiHtqV%;$M2Fm{TY`!bvO{wf3*7;h95$j`Ls4m|Iyo%%~z-j8*x zHa0^=&FVD{LriI7p)`l^Zv-r*4f6S7XgV>vJ?1sB?#R=jM={vA#uI8qJ4C0t4q4jG z{KsJfqzN`kdF$Y7s9L7J*j|l!^cNGtu2xS$k zX=J5NM29*f|7IFYuTa!B9f9zAliw|mAx>|=Vm(j{{}>!8FS#w>}Q84G5aaTBcr#0aXgnl(%&AN2kYylAIQ>x zBptG|g$9C`m?1Ie`zN+AJjKY0G z1Xlw-cHuNS)T42W^oMBA(^$H4h+dwBq!~_T3!re)e@k|JtQ&(cjBUXbhi2x)FCJ>` z>a0nN^#M)&p9ZWns{;{be{B_HJSKres{_)>7}u7suuHx6Gk0pvxz!Xsfj>q+<>e3r z6O3A!ABcG~iG?khCIk(r8M(}t=+_IDTjr>qieLtv+9j4bk_xFRX5O6%a0i>wDJnk( z-#5V}>FL@lfwW2+L2nI%LLJuQ-YPoYHSySWu#EdIKmtUDT@>`LOCFncQYw%L?V zM%<#76@AkkccZM1wcsohp|46jKt8ULPOx2brvqr)#wI;-#1x}d!75U~R?4-ZkReFr zlP1e_7`O-lJKsSB(MZY`gE(?xHCNW20H7fNN-^jxkg6Oec__6#%X@fU=v8`CR*e-Wo z5`ghPP|9*J?E4)?DReq&!-OCU)ZvC!3k$>DnLQOu?A2M3v@;MbuF@MBAV2@}D0rXP|S3zT0%S3G{t<3T`G|Ts6PEWkWb9+qAX)I58=?+#yCl+?PwKQX%n+i21(po*!X@g^2LXTn9|h;slG)vgR7_gx z_@h&#`loG}1Fr?FzYj@ZbwKG<lW!)?m5C4#pel zx-ol!+3-9_=auv@9)Gc)}60SO=w9+Sbab{n~1Vf7(4&< z_r--JSOsi_J4RP?APo3M5o|Dq$RO?w)(DK1$Ji8Uo?5}#^lqrCS_L79!sC&{2d!aK z1d9XQWDH4o3_3Rc)Y|~)$r_>ZS zKqz3Icm-_mSmt9)IxT@XnK4$gCk!veK8V-U4Od=`s$CtmzZcdqsT1cY2<+GjG*T}_ zwPq||otP;1Y%-xDr*8q*w!1ZDh*|h-Ym(UmxU>B9hmr-#ABdt~_mQSkP=T*#G6#dv zC1F91*>7Mbe25%`6b~X+@KANUFKELJff3KUuR&bucfMRiMdHgVg(Rb|2?|mZYrL+z zzw$7*U%pEAb)rYo$KNafrWaZgw$y#9G{gP%q1klC%cQGi5bBlK(hc;U&BCVizo!~X z8938k;rgWhUYPvVnxfDv%>&f(Zb^6y1uas$`tGw;tCEi{sl5<=TvSJ7Ck~no!&K1Q z`hud8G9J^-uvrz6MRoT6Wf5Aor=fv3#DEcavKJbPP7j38i>IbV=8+qn+RB}`l ze0}?JfnZTm!ZKJ;55u4*3i(Pm{Th5mOUl#RIkW1Wi>44J`{i6WwjXx0-veKM&web& zQOm3svWV+SRmg1abfp5mItpi06>}vCH_tScEvJ}4mv63nH(!UBQxi{9#~r_uVcZaR zUwf|z>ZlAm96YH-)jN?2qlQ)~J6KxvwsNi-hf-%hU4;JqXkR9RfDIkW{L%wAR?P26chRueXljXB}Rh zR#g+A#zqqPP77Y94Ib0B; zxIL!QlbVLbFI`nL@*mkuv)SHSgfL4nss$z-d3KQoHJLRlNq);I#C^VtQP)_TY3W<$ zzQllkYipFiJ+}r`f3oG?8@G^7!9mu5!l|2!-gOEVMyWgMxJ!tQ6?=vOPDFW)5Fl!h zLb*%W$fUWMX%EUwfmp)Z?DI@5{U+PSw#3nLj>ZYwA;iEqpirw%Pu_5vI`9t7&=|4d zBtH9^)B)_{j9Ie4b9%E~a?og|SNfMTu;@(C&YKI`NFFZ5#!Q^Aa<=XQ%E^)-QgZN0 zbY|7Gai@H7PoK6A?gC7CVMXe!8QeqHHJjh+(b6KXT#00@7Ktv>ZHn?nXM#Bi0Hs<@ zGL>ZG@CQrkdaW$Xveiq9(XJqDnaxvvxafn!;lBgJ5Ta-fEn~fSYf%+CIa`z)p7mj1OSp)uHOyo+C zsLw1VUB168SEE(Y7U-rT?ksW~)~w@3w7X1#TAN!ocza#%f|HaRx7mG8C`$To z{uRfwL}T$<-cBv%sg=S^=5jmN)Bk%#q}gIY%PN`blT?$|u_VM~vs1$=kt$-Ve_5Eu zM7jBtx8v}x#nTt@+o^7Ljht1ZJ^r{nv$%Vv*J99}1}DYNMB61RLcBF2pEI-M$a2Cc zxt310h)#rMrf}Sj4^pF?D-^Y*8>C>*-`OT(g{JfTJjH6t5XuyIAY06<>>>g;9C?{G z6rd?Ap+=J$r|=jdB?2}?Y~rI`a>Go)8|$?SXDY4~s)nIu4Lw)FJG7LsCwY5s4^8qU z@dSl|Ug3#;4gqU>v28Q6xDV_g_e^L3fR-Ns?ZUB)GCf~X6wRd#|G z!MKIjvN#=QFP@L2E_wRzL^0Wh-QgF|4Xs9d+>AvlnWo~lE+fS|jPb*?e!jbOtg8K? z-;FU6-aV>*2qMmZ6&CC5;qJnjjyr`?oDv4s#xI^UgsQNz}5EOt~A z3R-CEbgf9PO5UVo7LJ1ojUA1JAf?bSeMu!cbCF7{>hN!meppG=?kNB=bb8W^n=p;7 z7aW>~HkxQ(8|#t<5(Z5?$VVQ{BJ7*Q_-9#D?~*FWZ`;MoFO<^Hf7#IVV<1-?pjVn{ zcBPZ)JW8nR%pGPW*Q*#~vRL^qm97lgZfq%Td+?;R1ny#mHHwUO2vT2kAGa&YI(QA` z|7`TVH}zFL|``3c_nt>~OBN31h-BtjnT+ncde3@$$7$%}lZR zw~Eyw$+any^$s~|FpA9&7~2gZTh>ym*0Q{PS&zy*L-twG%)@llrP|CE$*5;_0geG` zU?=D7tQak#aa1Ii%^CBva>|@ilqU2n5v^rYwM`v-m?kHiX=VMqG+l@XWZ0$KaEwy{ z3r8$E9cd+u=kfF$bxIKxSScDB$hZF)M--N#R$qs3Ba&fFM}_R-?l|)T+xW_51z6V9 z`pWVjYIQ>01gMo}D~!qz6;-aDZ!ruu)cA%S(%5y-7Ew1>o>N_vXi>F)P}y!0z{tMQ zMYP0uxXRg;3m_C5W$^B3ma1u?2S@@yD%Yx9P=K%={pD1EpcbzqQlfW zobMa+@2J138cLlB{9EdMrG~y<98`N zLgoC?s2A4gt5^4$uPvg;ObnYc%^eocfRDl%{)tr4q@e3chZ2;D1ti`RM!7z-#{mwhzN8p!hqyD-hYoLo)hRS?>Rr-Pw@M27RGIp5s6S7-(=)>#-2kuPl{p<2o9uqM0o*F<^UhV`_X%lpA7RvaG0jD!C`wL+&}# zYdl=D%(g0`I2q~rMn%=7koG>{dz0EjckB|&)t9XT)^3o8HVX2;ra0hDqMm0FwLsY; z_ZB>n0K~*pOSfl$!lmQ7JN{cM(4KCo^0bj`marxG;Apl4vBT^RU4%20Up0@hl!G}y zp+uMDmX{rP+G)e@Dh(04t_mxUimK*SFB@~)`YgyTN3En@vNXO}O@C)L;T|fQ>Md*I zHCe@Ur#fl+!CK~dIql8jx*1Z{C#H;up;OmDS9HH3q4J-O_fj<5tXR5~e`s~=23Bgz zCOITi`BW@wJy}K;iDXTu`Y8Xn(`c$W;!tuP06;zYhTf`Zx1?x}eikXKwAZ)h4b!*W zec>e`F?}+NZCDB5DmJJ-WnO=2+I7&R86pbAP`1?+Y#pU9nNg{glXI}I8=gm}ePr$t z&|oAslO;0FSaV+MWTy|)+9DLgZkcT|Ih;X6E*r8O{j6%b)2Sr+YvIvw)%sCAeOt9K zunytS@BLf(ws}9>v?Uper99DUOAN0Je_c~A!0NzZkxnsoyL;0zg{`c}6~PX?r9NKA371Z{IkR)=s-3(rTcbXxR&W3~^;Y!@SkXvSuSvm8#>N4{tay z{(*0_oXff1jPWd??zV({adfG+sWyUbxX-6cv#vV~bM8JWRw1MC#J4=DF3YWku(qN0 zF@wl@4p<*QsD*KYx<9|p-+;r^sBAO2xa72*t3;ud2d?mCCWzrw8Gn=tv$cv% zgmSn`pT`lEp;8b45JLz7Rs=e9g0res{F?ZNZ$lmXKd*jKyHfq3nebdL6)S4tQ!9!q zqBB>d2?mmdvVVnjX=GMbV@>|GDGJ*EP`NUM7Yn|SmTmc4!xLF&6h&32clFw%fG|r2 z%fZ&T8J`2zr78pF+0gB^Pq&YD{pXV25Xx<29j!%9CoL#SsNC8(_&H`8T!U&TS~}92 z)o8;s)tReThN_r>zFr{$Mos5JNdhfXr)y^D&PcHHXr}~mD(BMZ^^bVt5!x{3(7K1! zBX)xZG_4i9rpgkLvgOO{A3)?*fsltsd^28JbP2m>bn6r@$1RJs2_=|^w1ebR|6|BGHL^{a%7w}71+@NnlQ1<4H?A@c9l0#y7aQ>&NyTdlK zqlgJbH-bOMH?4#5m$6*@|55N=*xO$JrKqOs5?CTP=Q|Yn(fxWKL+*d-p8ucU`>zu5 z-Ww)x2E@THAac1L!hPA3y9?Xl7&Vr%vQ)J!&R=8S{)>E~I+_DL9+<};$hGaEqXa$) z_C848Z_?WD5mj=3yTu*6KWnwr8Dr})tjpOv$h{mAxhMVJ56GGUS`5>ckTQqQ;P99XNAC+)tZ z?cncjvw-Xpfnn#Tx}~(d(zCgS?FWv28nqD)>M}gse6YSi;p?_OYpHwH=VKJ)PY+@A zzwd;SR>qY}dWJ=}B)@dm<5LJ&2Eug+<=vv~%~a$pS(VU8HEheR zDD?MuBO*x1y}}s2%oSm5Q7*lc^?M)X`IAfDhq-12&zvpq*@X)}AjuX9y+ds`B1wjU zO06}1$o_TBH8Y7`AnO+5=V4-u<3^tytObxN&EQ_n0%h?*t*L&(olLQ-%hFE^L50Zp z);=lkz%w0|tMiq(>cs`5(ZqwLIl?%O{^4QjpZ(pDXii0<8B%Q-B*d&nJ4xHRFvtGK zC&SP7m3`fWOfAWK)&)A0h$+?$UiD+cI~_d4{IzDdr3BH_)h6ls$CG0C-9d_P@2qdm zbPNCq`D#OdwT=oS2Cdo$k;B3Lv~|?_m2Cqc0N&!Fe<<*tN&5I}>tQ_Yvg?5Gi!GzZ zJJRBCSoic@Db$G8<|g#_Z@Pxaqn%?Irw$OtU$Hv7a>DuW|C=AN$WxiUy zw)?zP=s60%4G#K@@F+?2>%P_IJP>Hy-W@QaFnyb7;QhLBtE~jtX8Om#+uKd^ZQ|9e zSJif^d+60v1AbHF1N;7NKqB)Z#C0&EtSY0uPyd+E^N9itw{G`O$8pMjvm?;1v99FY z&Zu-jyKwJ#05(-)z9!y#mkO7`C&>O4Y7msFAabDV`^Wf%vg5Q8vVI;1$E01(k_rF8 z+S7HH3DEVk!X=*Txte?DuY!Bo91rg-1Qt`W~374ihkj)Vn z-(5!9_9}13qw#qeVlZeMxqe%(BA$%g$H1NT%hJcuJPzpd-_ikS#ZfM8bbht8so1w3 zWtq~gW#i7_ZUnKhh8Svb9{UMC4PwnW$h3NAwbtGPJ4C;lZ`LXbWuG6BjUhPiH35MGXXzll) z7}<Dz;WTrZ`Nko*pf}`lw{r{@B&v@4YucYqmUfPS!lk;!OPAKk3bx>FId1me%!7 zxXP5DA6Qx9PR%fH-RYL$(u3ZifksDqv^?qz&S?jQ&bF{QpG2PFdM8ahWU#S$m)aNC zan>U-MXu{PCiDq5hjQ`z)3iVqW$qjPCjSM#dxtPctmzaM)b)*DETxQCU3b(evDqV* ziTHg(JZUoBb$#hJ4y(n2yCtoXR`TiO7dpoA1?k}6lmI_Jd~s2-5AR1< z008kX)XM!z6!=9w!LoI9Kj_|1(s9i%Heh*NG>*(ND5B9=OuxvThqLVQbuU7z|H&R2 zF@4(-GjyO_6?%NJMifaCN~!kl(foDH^JMNCJM#+oMqjl1)?YD(W^}@D#mXirv%Ao! z840X;Ow(~et8ULI+91Xy(f?J~+_RHowLz}$?(5`9sbE2X6SB?Q#y8aGX2yEY6`L2T zrR=r9Vg01@-SL1$u%g7nEoGQcvJCuB@FbW8J1!)3l4^QZ^lpJsceO?C(Weg66_Pbht3JhUqI&wI)E zLlhttcXLMUY3CQC`{ml@x70CA_KcCzPu4ycif!}F+`Rf2|JQhxXWH|f_Ei4^acuhq zYp}@l2PS@*&&ByG9mo-2+hl-Cg#XZ?RsVh*I+y3U?@oI_y;CXTl?Ll#mnz|JPz;hm zq2pi^Sl1#(SM8*005&}I; z3daF509+b|KtDG&a+gw}=T`XxEmo`!F5i2%e;y>-?{nOSd^$&m!ws~-9M|noK{grF z7jwKUMP=_dcG5ll(p#_(@99gA$Gu@bZ};ji7q@C2zBOuekG&1YDei$8Eo~CgO*TTjMLXwu_U7L`@24PZ-;rIs2s}_pIpCetQ&${jv;~1p`ZO%H)jzS zU-{`LO2YB)o0@4#%&N^n5$HPZH<*;_L|mTw#cYUPsrlO?zo<#q@3BnC24I7}b7hAL)s652!sF-n8uf_(|?Hv4P^|C)P-S z*Q@WLeAx&_qg674m>1#dfIJG1Y8Z(McPJ8Hr?kM9!ver?qAPoPYQfARs)3Ft$ed2zQbx&(1Pur9=J0eD6 z?P)ajTP*#jdHlP0^FG*)Np>Z1A2v3|^6f4HZ}Df(@{0|1`#s;6l>njGg_Gyrc7xXq zni@GMgUEJ6k4UOOKv(+oFZ{cPHGP(S{k~bYHJiyTy~%CfVh;~g%ynKRW!d^mSh+=m zyFG!t3io=puDGSjQ0Y#EZvucrz}5upFwgcE4K!gfW0|sqe#u5t%$Vy=_BWvMdd5ti zIn6KXYQiCZ(6IR~rxa=#$9kTZt=hvvCDEe#n{7NzXV_)a9{X$3kvg&k z60)GWT7yndPBpoEsMSHc$VB6?&Z@{&QBAEh{uU~{rKsvx;luY1Y4l}lAHJ^x_3rpP zt8KzwEY{Oq#Me)Zg6q#u3qocRYkP}jWKP~&5d`Ud71^hp&k|=9WdAk(bVAjh(Mmsj zb&h{p8PnI2?iL7o+RL5UFF)Pu7wIduuMADTq@z-@x7MDThVONrr}ip-h`dhECvvs3 zVVq2B&w4~l$%h=g5T60p0VJzHBtQ&}@>_%+Ow|)t%aU=H)=Q_=}#t4I1y{GFF?|ucw$2GYv zP=waQF@cy@pXiUYVOq1wy9at;1Lt5%;90}-9g*g)|nY{D8ib>pI!Bd zL%cpej$^aw?DObX{4^lYORyRnZ#m%cxK3(&KcyYmnfGedGAdQtwr6E)jy#Y3${gON z`(2lLe-Kd9Wyh;%*@Pe#}G3!Gcz-D z%aFZM961Niym;8^qW&BiGz?;K8YJ>Z3m4?uoHjmM>F*DGl~4VR$ zvrODsU4pxRB==&FCy#Ssy9rR>0#tG#p>sUC6nZ7;e56&o;^;w`J_;cyNBSH>HALhK z7sFxKpkW@l=KwT15qS}wi<672o>cIGZbkBy8$Yy*jq__(g@)ka!A$vV<9+QDYsu;E zQFg){D+^!DXDy~fKm}e|{3&Acc={k>KHuX_+GRqC} zoqUHk&X`%p%I$sh9brH|3(qraZbDHB%2 z6p-rf8-N)mnlpcrcdqrq_ASP3m*O%y+A{Cn`Jk713w7;tfww8Ey$zd5jHz0hC9XB+ z$1{DH;4yu`^#bf(Q2U9yPQ*tCn@S_!_k}G+&C=r#v*nXi}fY5?h-k;;l*n*?JWo)<3C!k|`Al zz#OMJY`GBuS)&6uO>cS`A~skh9leQ$&IDT?JlK9EGUSnJ@Xl)+;u7$04yKTy0z$sY zfZzQX9`8}A-Y#`$O^8Rdo)3{{^p6WViXsGLgLH3jz>8I)^-+cSH!IbpEMQu$NbKnM zHi`c=tcd0a&$6}6sw93VU$3mv5u{>WU-e(&-C-!oPBNl(MC1g&y39i2{wq*kb*(T8 z-9(<=0WlEoPn~7YtwPb_EQ1#iwPSwQ)|W@t)c0 znB^OKkI#mY-H{Z*qQ>){W!UZ{?ewq?j22^T{RV>9rkB5s&*D;;fOF#g?yoO@*vHO& zV0|p3E#I+ZiXmkS8|+q!BW9gK0{}fuKPfG@L)KgMXh@>!@msev*232e9TLn76kLp9 zKejax_k!S3tqR?OS7##@g~GFXA%$y8pN*HNjCXCxj)ZW&w-@UfQ%gu5a%cHO64`*E z$As-rLPn>qDk?pGd37$Q#^K@IxN4|v75@{M{IvxxL5cirTmykN8#vxV5lb}Zc}ErX zfU4U&;C%;G#I=aX8a{TZhOg9=Fk@9}#b#^oiFw_!95wIfsV4=;`{+>4Woy`hx>+%Z za6klL-S&@u#X+X?2upg;nu3d%OIGj`2<5(gRt54C$9T{Xl#R~bv<9uJGe;k7R38RF{Sc|fiW#=ZL$+@&cLlrrybV3+1t-$Pm>c->vI$* zE8NdVuZc(dIiTXc-V4pFPKm=e6G!&S=LCw7q06Bw{@`sdSTM3*N!Ac5XJ|ux&Vlc;m8N`>0Cp9uZ}lf|ImO?la{mFZR{sd0p|#kF*h0OEIG2XHS*} z(N^ZGc{mWh1|3<5#}c|oo3S^$a_K1wA@eHPU-aUoX~JeGnK2%49IC(m8tSGE;}9TKYkQmkKh-)Im6a|N?$5C$Ec{7M+po4 zn%&P*`dvM10^|{SYT^|szQudM{>WU>q%!OC#OAwQ%5FYGP;ASOy*bbeP<>~qRj8ih zzhiiOq#@cyEY3}|R6r3q{wwmk+1dtdfDrf3sGAY8VpEvt$J)}3Nf*;=fJE?u+4 zHGz7B-clbOD&gi9O>UpI^eY7PimbcmzV`GljTprbt=(HNK=eFX%pTD4!_jp3{MCu9S8S_eX2si4tWSOV!YB7JQ1={run(RPl8S@lwq1x&( z)utPj@DJ}EJwB*t*P~3-5!56#N1RLxh)xk56(^#oSRs!R`DFd|_%y4GkNRf!Xm|Kt z1jM6w;I%@RRg{{Wu_qB1%ku|G(bx#rc z2&=Z+zong=bT(AVII7>QD+gYhY)HShYjZ1n5-vN7RDZwFH?>!a&U?6WwkF?As=sh? zvQH@9Kp3tYnlS34e7|S@M)auASF@mMPosxwhlF1qM~W8Mp0)(|cxV z168e@Bvc0{n)?h7#OyE3!BQFb1A z9A`{IYTxPV`jPi=w6l^}##H{i|1(rCg3wujCC*G@&EdE|?&6KZ&b1C_!H4?&2XuGbtZ5P7`d^a_Kmt5owF4XgTpPiG};|5)AprF1O9u81_ocl;-A$ z$oInP$AsE%_78d{dOG*vF$7Z%8Khq+LPXg`dc2c~@wW?;)VzvQ)lTV`Mg`T}2)_}N zZ7b|;Z5CSNR)4ubK0?s9CnqMs zM>e>1eBC;P8|UC_r)OfW*_w9SsHkI1_FuGJoqMxb_oly^#d=xi9Z`%zf!PRK-Fs3B zHfjemJTFwSmrl~h#*Uc97D#L-IV*oG{P&y(yN7(aL(p6IxFEbHQ}(lpOi&)f50H49 z9wZjvIU_QjSPlf^3la9njhFI|bnhf$+vJzt{8mn1_-EYCbX;>(avz_n-r=|yYbEei)s3c3JJ^}vS7T!n6E7pQ($zTh=@wYYw>Kp~mbl=NIm|iTkid8$ zoW*O7O16y{q<%6~YtURx8TxeWp2{c)sPiA+qqu-Cd&acd3PP6|P<8_8iva}ma;ocf zJp3M+A1A1>e9)(QuKMOInZt|Kpz|e}_+}IA!k46c5UqI7CPC)L; z=v=PggyFhs=GAlXT?CnI*0Y|CxGM6T=OI(!Rh3s>Ec|Og02q+@n4#kmwsn1G1SmHJw76Z-}_a zlzlUL0`g|=GWK}%4*~g;4MLzL4iR0h*Ah_9Rl|ZOb90sGZ}j@;BGoT)P^iv6Xsafa zhEMEh2Ay611?zE!+`I?o_}n1K`jT5xw!g4SMAqu)J(GJgX5IzQ?ywa)fl?jj^i}?B z8xH}h_=0cz@C82A*`w)VoN|9yPd=51!bj?SFzQTgAKsyy4}vP3T}f%+K{8s%@|?1c zXoy2-XqM$FVyVESqx?fi5}?a!2lq&OpU*Sl=o(tJN;{tAM2MdKb!7UNt-9+iCqHkQ zm0W){N#RDjUv+ekv4I~GFy+Fhk8)Vy3LZa7Y>EgdJ>JM%gZ`5SUf{3qP4Hitb zbaYd>g|{DE;sLK^)mu*aQM`H7kq!V`vB8UTMrH{bR!WIC3TfXK31qb8eu3Fc-$ zWTI1!Z_cRXZeW-CKMUp8RQ7eePRw7n=VQ}P9AbF7zbamJJbi3c7OmqszI9vmsZyI6wmc;Z zJhE>!W5Z>X%`#Gt4@`I@2|QCj3ZE4d=+~rV8GS(=yr^~+@RV9(ES+#RtoTmDM`=m46AMfrZT7+|h0+0ptJ&L2E2 zeM75^EAX?H@)>tlugks2_-qc&OQHunLh4 zRn-}%Y+jvyTUuV^++Bdp!CBI+%+zH8I;Kc%dmkof>QKsE)nwKeE2#VoU1_p1$T?Ub ze27ue?3-uYnC&9G2m}E78#r1g6MNnnyE`G05AEA{2H+lW8)VLQ1&=p(5M#{9EWp_bJ}KQ_N| zFfzeF&H{3sL6QOgIFnlE;mo4USJ5mp?piLwwam+R*P@qt+e;$o^wA#o_gvg%n&to% zLJ^h!kAmEslB$Az{YR?4f0{}W%b;zEaFo~#qCwF9Vh(;k=5YtM;=+H2A()#z&*yZMbEm!j)6a<*em?wEfH7f2bp- zjaq>Fmo#_^yx)owty5et#cYV64zHM>Ic&G@Tt8^w>1rXJ*syW4CJ(SL0J1?D2q5;| z)W;Z4Hfrd7oHtW4ZQsHf<6N?HWpH?!EkPK58d_J{)BMw4G-|_Aw438vu9x{`ybapW z5<7|18SO{wkalBnpAo|*DM|p^e~Gd~nojB;iuP1T5z&ugP;0jyoxmrf>M_iyCUt4{ z&bt!utwD_7Y3EbWP%Jck%l}7)Ljkiz;;Pn^Z};JV32*6uPLT>+lkDhY2hXZS<8VHP ztRBjIbsUu-(9pVp+mpyzAC4r}cl)AQF%tVoK7lFb@7v80=+}%lzippfUy6YK-?WTk z|7WBMmfq)n){@G&GcI&IIJ~_6+h}Oun@^F>C$37szn^!t;K8EEdR92p z%xN0ul<`8BYB5SFy0T6?y=M3SoeSW}J$x!)Ri*(kH3>YJhc@u3i^~p6Tt4=dSG_qg z&FXPa>GkX$Q#t(Tg6o9AK!?$Jxp@0p{l>{!n8`o)$3d=@^KSiDk@kYFAA{F%F8mGS z1#y$tU*Yud{OipaBp)CHQd~)-@$)Mkq6($&jb;1?XUbiSVC^InXpRsE~;Zehx}?^$$y5=iG!yxgxdA754>ytc8QD}U(Ve?C$TyIcGyk{a4+QssYnIx z-Mm!-s2Tk7^*Z_CQi=TKW1-a-rtL4P0EIO^OfJTuDkdv$7)$`MO`64cG(w)Du zeYJTLDRTIrs%<}+Bz!N0(?R-iT+^wm(;l#{9S#Rn%5r@WF_LjPa~9=Yk0xEJ%qq=> z13?~%S#I;; z8hC4I=xsL2)QGOIrKD8&`F|Rj3mgEJ^^lO(`CYijU!YPBWkb!VdtB)Om92j856|dA zV9d)%DN-OtW3XYHfFnaoA|eTx8f(_|l;GlQ_T<;*r+o?B8%(QTR2Q_Cql16QoXV+B zOTdytEvxV}Hb}qGkIC1461*XleVSWZh4+3!!rf?_`QsqgiHIaC zjD`zFEvQDOhh@N>TDlAz-X-FH;+uGRJZ4m6?OA0QI_oT%$u9!R+N15uS0o+lDQ zhODDm5jodUX3@9P2SKSwXs@Suf^W4r4X!i}CVb<8LNm(KS{|Q zcTm#?m}%=jFba7X7Ka9lMqd1$F*fMKc7D;?-h!6QN~Wk(!o!|3kiedi4z8$VigVRG zlch8a))!CIyojWeDRY9$p2=&D2375@$%$+NL=1Pw7bm5k@NLK^-Agel*gS~x)1m}{ z`22cgigvH1qnU`tjBumr#=l6pmwLLgPNt<@_Yi@tES9ghtQ1rl)N`B6R9TH6fWbWis&8{6x@wyDxza>>?XViR1lf>xfmheWAI^OLb`ZV}N%J48+5tLC

nQo5&RP z-L|sbxJv{=0OXy9bQb2`9K4YgR1@V5;lP8|p0yF6M-9!GPY#3?Y0XG7d^x+Kv%^O+ zwY7wVhk!MXl%=B{B(X~UE+ZN$2~#~H!V#3e=)oy$(FG|K#24dVO)9Nx@13E=mac_h zKO0$Dz+rU$9kHfZ=GQA1EP8&Jb#N($8MZCNR4w%z-v|O!{Mf6l==W~L4Ft>3JMs;^ z_S=-q#hKk4RT764`A*C~Nv$S?WWN%vCkpm1c7iVGQ6nV&ZVa$D;LyH5Vk?F%By*^^ z%-pjr#9%3|4k3`$_)ws+SBj)8)tpO*I-*K0yxBv0$ z)M~>TUPAPoXQ&jDnc@3-gvWq{>=u-5afmn!`N;m5q{>!sMUg``O`OYccmBCwkDznSv+(u9G+V*_^i@Fnf3l^k4N{Op zD_MuV0*f64c=(_B_aYz+2u4mTGUZ5=RQ2B8b_>gT#|%y#@3{Bs{$bd~!%+;9j0+;2 zZGrSJ@90xR>1;l_axK|_MJHG4_r>lC$;D17L{X_n6#h8h-`SfTNwFOq<}*pn>Tzxi5jgwh ztZ5g9e2WTH6HXuQq#_=Tovd6FGl2o?M`;tw3CuR@V@(}|`YgY`Gg7&2H@`t+G)E|xKu({f3I7L za)KDv)^PTlGOfs~n^mG-LYSbCDKGog&Z6vjH8w?}-KTN(>*XtX2xrNvgLx5n)&~#0 zRuS>ko`e9e;G4NtA{+7LLB+mB?;sd0QWO@G7+%u7PBsiCZ4@jvV~XUe+fO8I>Rx18 zQbkA`f#zbn%}Zx85*&KNP3Nuc#sx)KY#kJD`WDRNCBycw@FF>V%^w73vtUe{?e`Xc zpBe;5nMF#6s?F4rO&>ZG$kcTmRehmoZ|b1Ga+{k;;m+6E<-?LlLvm&F329U_!7kCq zR`RHQyM{>?^~j4i816Xiz%9|^mAR`q{w4!OXmGRCKw>c<+a2?Fa<2`UGp+$#rXi_Z zFojGrZq&oq)}YS6alnv4oLv_s_);`U8xW`sQ~t2?1sbzzhE;)p4koa1w*RhnUa584 zL_nKTzPtL<;Jj}bG4l?x#-_oU!>76zL8M&rCjF|Ls1~&n<)>>!yTjIJqdR#c9rUDO zXLq$bLd~>XI{s<=qx$yug$-c9l%ivaZ&adlZI0n<`6O?Q-h?zQN}L^GFB+B(3(71B;B~KXYq*@rWSD)xlq?og zbGF2G@k|=9YNmz@>35DYpC;FTa@0m9BH${L=!ik3vx+B2b6p0(B5G8eJ>tb|Zr}_5 zOtpjWVKCN+p$vK}&3!$uzQeQ&B@dApVmP*;$SKWtRUU$L3P}$L5@~Xwzs~DIS%DZ^ z)Jh4{{^QL>S_efI1KsoWtQ)qI#wne`s1MlEs?RKTV2`rl8Nz)X0*j+%Tz=58~AtzD{kN%~ze? zuXF-qFC4>?pHr{&T-x-I85L4zI)J|oGL4-gO~$i?p}I9uIuZLD1XT@0+V}I9@lPts z(sSCR$!Lc@;&jW>(A9h&Wh!-eWx9`GX~shW=N}X9y%ZR%a7ji!XB?t5MecoZX`#!F zr1hr84b5 zYutR>W)p4iG5x7mGRKUT(Spn3et3Aq5@!@=yQW~tQrrjQeg{`gzx3|et1+}+EWw0V z*M>tBDqYFY&1Tc8qNwsv1=nHjGM#a-xFenJ_R;xHTSkX1BhHpR__FF?^Bb+GDPs~? zMEmaeMM4&y-VN%iiYC3&!*3jUC;@6y7`1$LL$#E}sef)*GU{gSpN zluXGM_@T)%!%`+Id`@?k;W~_6S=kzcbtl9xO(}R6CG{K;J)5s!jCOq^!j9d!B;FcM zb1(_V@;jiEdF;SUI!!rkF#(urEU%ofAu-0v<|0;4hWj5S#xmKB2| z?^S(9CKGi#hKF;Br*?$OES)~1PP9TfR11HQZ4tI_q1@Z!I+yAvSUy0vLVNEuMVw8e zFK+jDwj>kE7tt$9yDKocLLOhvJ=#D>bT-O*u0OQda-Nib6<5wpQ*-iHsGHRYjVjoT z`Q?>iN#Tapd{@|Aq60%J17uJ`Lu1sVJYDn{k^@7-kTZ(Tkd;=A{S`LfS*&f4K|4y7$mw$*z&Z8s534(T@O2T>ve#-Ln@6c!_tyy)zXgNKGC=D zuGq2*9Ul3$xK79c)sx;V5k>n54&~kQR*;}kN z`ltuIHlsQ{ciLT|1>nY|Ct;16_L!vZkNa0+ZWN_%5YuQ2&~R={W;ZQu0OLtv1{pjX z8jhzAWcSBGtzi9Z(;L=MqAQ80;wB}!Ht|Gu$6`!0Occ9#^ljZQ$CryhVSCO3{K#rj zHC9Ws>MfFsR^?Xy?h0tVfiOAh)(|x4*+0$acrb4`jCQbHod^+UeiFrcR6TO%gBhfV zUg+B=vlmZ4?>rvEmhp(bUeb0mlPdECvbTayCDN8l-TMbmj%w#oK?8JM^9J(!=A(_pPyAg7Kyvr8q~C!}?YJNoN=XU41~ zlwoRUSzrHcT-Yh@JkC*vG07_JQ$jnGu)N(WMAw91-@}D`HbSl-;v|6bjuLFsWw&j)pJ!ZAgFN1?)BWK2Z7q?}Vrpop%NfeF6c1Nj?pUG{d}C`-(!M!c z9I?#VwQwrpXe22YHe^VucybV1GMI9-F;W{RG#f4bBMK4rJi`uaGuKx**d100+D)=0 zmx9ZeP+IxVN3vM+O4?jw!~X8u`yv>29bY9ly(t@JOK|8GB=h2a^6zQ~t`h7pVq3~a zh%gLT`PGey2}gt*C53D0e6Pkz_lR@v+N`aO7>A_gGpG+v&zMxj>YRyINY;w5XPDd* zsixS_Sy$!`$$pIb?iylqp5wIrsLThQ_Csb2eW4qYNsYb=ALIRkvr&^R3O+e3?&+#^ z15~mzEdt8tO!K$OhNW92qQog!YSH1qAs#t03w94Sd<_>#Y~ys5nk?(dKPz#G>N=iF z1v%*rFx2HZYJD|_RJ*;yt5-ys*wih?p<>;Zp=e%bHG1`F37Y%@QQN<%_*_@OxVp99 zy6LWqu2dRE?=8D`%oy*?h3_bgPf)7Jvpb(z8bj1m^Kvi#B^{TOIrw$y^K0qJlCM2C zOB$F8MW$VUhr>*>ErMJjnv9Pg%XT(ZTeSj%1lxV`x__iSYPn-Ok}|F42^IS~L(ai~ z!1%T-Jay7Aln!Bt=^((~%N6T)WH7n*L()UymT{E#GP%~(ZW9r|h#qfaQ>iIK{$P{J zg0s9wo%q)JN9=pkiX9>sTCZi_J zl+*37Z7LBQh6u4du_nJMM}o4U$$Z*^VUrswf^E;3^~_k+vJc!ERQMsB$R5dMk0OP* z(sh;iJ5!%qY|<~qpIs@-n;N5gvhDT#Qb+(+XzEDoN}+dthC;Ukd_dT1z1OSt$e2VS zd5Ko0_KVd33D!jOD(ZaC8tlna##sAtWL0W^gWO!Ds*)IEm52+Qsay5D$(TVY?iIpN zB{#|GLX^5gL2_itI+zhCU9YQ`T^bU5>-gi@A;i#SNU#~$C+25Dl@ajks@y8;QHGVL zJL6$7Va3NPtL#75`)4M-e3H;LCfHjTaWV+_;S|TNF#msEf*nl$4^LOXs{ml`=Uw4P zH;g>T^8fjV|NHkvinzmB^X;SIAeU3_jD5WqbKD$p&PkJyvlEfe&iaXVws}tb`6*= zbkG;TXQj;9aA>3orO}o8Y4twUA#00-=$9SboDH!DJ^XCvKWkvbK{fVynNg2)PSxcf za_RAj_T@maJG#zA+=eSk!{OYw)*2J|>;yk86ft^y?SF$8V(@`h=x+Z;dh0cz*GU6< z7qw+_^{z3t`<+qftfr2rvxGWSgFqbNy)@F5ob3hR7Q_-~^$<>60N)0#8*J zh&l*BQeF9kwCYv(eE1Ze(;oSFGdlP{b*~-nx_%9;v=6ED&(_B?RwORFuBhG11%8FH;qi=pQ{@*MBB<2&gV%cLr!j{o8L+KgnrejL7T z%-IH1xh&!qKO7H`pVRK|y`|f0#`j>bAmh=;*^N2u&QzI^`$jC?=EvxVPcX^}r=~ zrtBZpvaKJgd*ho(nKv|_HW~FF9W5FU=;0pf)eFd2bv%B$z(J2an)+5cN%BGT`FfZd z4Y9XHcOry#XZO5*hJ_7GaYgdOr;vvqp`e91Vh)d?{&HxRLDu*MfR`Mg(JRaIL_=T# z%cNh9t51r<6W4i6m)ZZVbl?ZPk_9oVU7N4jR9Of2&NgRFVgWbny_8zczO>QBFR9F? z@3bQpPZGebJu^Q1*x-K9Hu zcMS2>w7eIBLRZ`Ego<04M^QO-Ev$?dAd*Cp?_h(`o4OdwoYbghb0-ik6dm-ZJl_@%9C6SpTs6--uTU#~OGRM%5N zqM7&DjWt9$a`-2sBFgzhKuisixH=LFIT0S<&Yi(=j8Q=@S3z@vw z%)|LG$@<`9?{*O{3*ma#dK<&0@vc?I%O2!y=!QDIa{yvkXm2>K>mu_NK*Cgh2G0(Z zyDtD@mQipH>`%8}cL|i&204FYRQu@w_~tqAl&Z)c9Wo~M;HkUwzH=hm6ILSXo76uE zsI;^;e-G>q5K0TUfkC0T@A-TRz0&u$JjN7O2RI?|Cv+dv&)zlz(=k=eoDF!QlEL31 zpXcz#;=W8J9VdoI33=o-IXy9?5Qp6i$RY5Qx2?VkQKQ}>R(LwQ#G;aUjpGh;#Y{*- zFnX@9ADuJ$VUj}1ozhQMOn=vY7i4|7q0%SE@rl;G(@>jkVUAc>!kLT3DBrSV>q2c{ zKDn;xby-JM#h_kSVpjHyYQHqNjzh=qs?#tM46S}43NMY{X{HDvW=fb5e5Zt8KQ;&hKK8;P3!f#nL?gK&)Vs{p{nCzF?Jb%KLM-u7jB0 zhU;|io6!o5KLI6hyJSQO&%}}=)LqQzuS+G{k(w)}16b?-vEdk`ceuUEdo<>zJ&zA* zuyNTs0`7Xt zKgX|77<9Wlc*Kka8)5Ets8b2m*2%nGl3bgWUeW?&=4$tQJ~WVf{fU$NzFEZA_udjW z&V9m&d&bR!_jrwa)|VGpq(%LXcmX~Vd{xdDrqFB{kiN8{FCJ=GEp{jw9imun7Y_L3 zf|Hy|yxT6d?TWhFICCPf_GEykv>t)1L6zyuJ+^~QK9hJRC4^gCmqR*%eMD^Z;Kkcc%H1MUW7<{ zjVp@KlFEBa8yVqA2&0xltbI^8kHDH8pQz7D-q423M<1*_?&$P%|CyPfUhIEtn~zJ^ z!hcIG>lJYS!t8PJ&Et`NDr$eI883wRJ=61L_Yt!loyjQn3c&YCmaItVl;BA$fxkb> zz0|1H{bQL*us2v6n_DU`?5PEW(Y*S2O~~!9*~1>yN||W>Kt%83LgemR%?O)2(tZKs z8vL^S#(7pB(`U~9$sW%lJFl`Z`Gxf|$tKsraR(og%%X0X6$B(OB*@?R3viwc)YI_s z?a|u1(r%o9^sxWrOb^IQZcebj|0>vQla}#B6^kV}GVerutg(w7gXJ)6aZ32k>3+s4 z=CeJF`?bK7Y!9P-VOL1!m$RK+81X}T;tImCqAK&#%vXmQd2HAS#aX^Lm;;;Mk|vM1 zjU)_=Nb24E-k%NiZBsn0l5}ySxF!U@Kd3OyyZQdkA=>YZ(#AInJnCG0W)`()%i8-O zW&DmDso>q*7i?(q`xaCClFF!qF`>t&-@E$}OuhNio!RN3`u28QS5|i7_xVZDXxQki)x@R^^7ED@VV?7aNUbq zyXpJO%?AfT+3@!>Px3-BzP^!0(*U;F(o{(x6h-7My4oM$bN=#nVSYFXg@hYiE`OQF)Tjy4-P~ zp@-GQsqs$V;^F$-TVhQdrqi;8w1*cyVe;pKCl{Xqb%=q!SbS`&twxvpzjFcd9NaFT zI;393r|lCM+SH3(ht0dXH#y(i6gT~HIi9Ha1bau1>ye4e?iaTwKcpa5pluz)q_BR(G${PPclM3&hwD!#7SSO<=DCoE+ ze5<#(DKmgd`;MSre7l1lz^d z)xHrc-|3eh8jv$yls+7M2F~*Z^SV03LCmR`569`P;uZtSN$4DAD7Jc^34hdWm1pas&=cj>%CbO2c1Ig zw{*I;xKLNT%NvDhkEm`om9}@AQzp!hhDR6ElinRk;53Q$4%?9LX*$k;f<=y3W1sAS zvprjxQ8b!Kn8g?$O#9z=6o-2r=L;8Uw5~A?*pa-Vwp*@L#rwLaO9W|tV2hkp*|P;HO!Cyluq44xI*qT;d$9e!UYh>ROkdfe?!h_77Shzldq>$Q8}p z@1)^1FkImmruEcK8v5c6EzTY7y8sImhC=%&eB+;~J#W*Fin%1(x<$Z%Pa))L0~L7F z8u5qxPq>if&t%Kpx$dvo;CHKjnJ^;-mW6*I1`H-h=fAbj=Tq6$?&p5aid8Jbq;Zw4 z=)K@wJ6>V-5a)vZ`e!?bBPO;Ym7wDJ;RUu~(2ca*ctA&iFfgb5w%eHc>V+g;i6H(* zCj>Frr)xX%OZi=L0B7nuKNAexl{YbCC*W7Skpkhj(+%L&ru8U>Z_PJc7N~HPp^74M zC`XHH71)&B4Xf~TbOH|B-@E+S@4Wc&K`!SwFO3#olFTS+!gyM+wWw1b@iw;JV|0;h zmf2q{1jCdWQJLuNyX~cT=7(^IzI`stNBYM55vFu!Xn6V5i7;T5P?rSK$l2x!1+Idb zs*%uN0to$(%&^*tlb*SA$8qKXqr?G8ZO!36A5!FYQT)uoe&$?*VSy>wI!z`C?FDkk z%+X@p5&5KTFIYuALlKT1xt1@qH*1^YxT35^M_w)5Q4ADuD`ZRzA6-syV`%)1Iya{R zPXYqHm~4m=)h4&|x6C6ytxBv=*mc|58gjv?6{;%4H1~lg0TT~&+CDgcat-~iO^l(+ zgEUGD$}z@OSk633&PCQa3+O@r7sb>9kGlYP%vo=R0v4nDwTE=x*J6mLq_6AxVcr2KpR44vUZ16KY)5K`KnTH0gBHzNIP=cv(4VBnW zg!Be+@I*)H(M&Mb)wZfwO7(|Va_AS$(tbPvJnI8>wA)`gKZbN+ zgB+b#V5SoJ`xHpqJ#g+0*ozF4mSAA_jj>kN+%T+;arp1?1o?SfVs|&R)PX`27##7i z&uGbILR^3VoJ$227Qs8L#2Lg^R>y1w2=apnGt@mHy23Y+Y|B?*hpZp4`(pqBfkYEv-E?<@5L-pVm=_4~7SZV?(e4NVrVt$)D< zRo&u#B2O8{_E;$0*P|dcK1dgjran!k&etoQb|AbfLOgKIKH-UCZg^7I&t#?9cUVo$ zy#V8(HLLja@QBC1^>S9q9j;QB(aYHR`vZ;08NA@o@Pb!%%>j*L`;)TX$q{3}4|c9z zTAdg22s8GUd1=8YdCP&S;MoZGn-s+pq-16>@e~VW4(9cqIld3o4?}FHXmmdP<4iJy zLq*l6E%M(4QmBTK49RaeTbE3vbFj_J!~An2IP3NX#ShRh)&kJ8-9A^r&Db=Adrb-W zVp|v>L*L2tK9A`|Q*Y~g*Zef-r!i?(LuN)#;}Q>znCd29(cM>X6)SC>9Y}% z&;JmPWu;~B}hX_K@8#Bw1Z3*c-{UjNOVpo+M2*2*}7C8iWP$W`8 zHfR?}jqC8D<5zpanQs#m?o5C0KigaYb?gH+ zW#AH3F=NqHamb*On;FNN@lL}dTz1%<{GxoOC5fSeXvSj?^9wJ#gT9=bAA8o`i|Kqpm$q8Yz|jJq=|Z(|zF&@=;i{r=8ROTY^5!VVml z$G1Q_|CG`?z(qZpRQ&e_43b z7?Tp=zW{E;gmMddENz7v&|-tZ#G33Ao(1Fui((50?ZMBFP|mN632n(Ly(UP^mO1*H z@aG;ID=A{|bB)+P7nUaVOED^UIwO>dIf8a^rxivW}%R(nT^|71Kd=h6ea1y)i7h zfg~f$^iz+La5{b`K`78XF3}Axq0Dy0dv1FJ>{|=R%td(T9QFC~-z!9+P^rw(a}YBs z{Zrht4F7y94pU;u;cG5tETc|lqx!`S(-Qv>;`MvViBgWqmM}&&!SfSscHVDdx3GHu zu#~abg(p16WW5&o*y_2ro$Sl^^grBGN_zynp-+!I120-c?iQ5eWT4HSMAwG~Nw#Cc z@xH{?z>3~xM~$curIsbBTHafQ22K}fdy%Q(j|OX`Qjd&INx0$p(El{`s3_xJU?UCLAgjPPqNyh06NEsp;$o0qU^h>+-M85kql)0wZQZa=1 z$RR-azu0@r;JBJ)OH|BcF@wcqOBSSU{@3V_yo>Gh~Z^$L9Eol2apxXtbm={hu$^e#D=wi`Z9wJJx}DDco_zgd0t=PuV}0R) zN#SlG8t~_AqJOnk;XV=z$phJ_epPTI-ZY1*b0AIRp;|7;;*P?p59?A?)_bMmS3X!Wjk=ti}nQH+zZr0?h}|A&|TT zFK(-kH=hYLsz4rEuK(JF&$Mk4hDan`f9&hPGs|5*Lmqe2xNAoblrNA(m)MMqqao-~ z`^VU0d!tv38@Q+i2SyOis73htz~b#pc|Z2lUU!K?%1isI zW~U;sldUh-A-?^S-J7AoL*GD#yc`Sn?H^C4-ZZY6Pi!(aMfvxe+J#o&U*;%utk=Ip z%*X=P`r%(Of@BQvSA>bPwz-k7h(LUp)xDbB?i@?9?MyMn9+f?dbW8@*T>-M#AQ{4; zqB8ph%kl?Gwkl=YJ*+l9X9%jC=g0h05kl4!Gzal`O(vLR#QJs0-|wGqUb}ec1}a#$ zPf+%)g7vds-94b=5^Eo`Nt=l`6p)G;ex!S#Kb?%cxCnr^9scJz)Bkhe^j{ymWVyj* z|EI&I|K%)+Sor^l7f>m|Yh~#YW^hP`HU8`P{Z(`1dmR=)Q7R;&RM?Uw#niGqKQQlT}<{yO#hZvRGY=X>XDY@ z2qn7YsdI>PeDjM9z4^AkFkea`xcIH0JpxQkC%GcN`AF`5BBu2w{b7vyu{&@SwZOfi zu)+qMlxmtZ1SPkh_+(wC-oIZ^It$cvTy+`aSBn#jes(@~qF3g39J%Fa5AxsGDt^`j0Yiymjt<0P{l^R6nVf7viECSVthC z7X-7Tuj2(_Ios_&qeNW{`31&`lwfZJXNHT~!T&GqUjo#JsU_zst_0Zqy$u~Fj!z>wh7y!|MGH&A@?w(V6n>>J57z zz=YxjKS^ljf#M$1yqW&c8e1TvOZ>?_Hl9(q0UfHjKPoqwu=TU(%h=qRn!K(V!d~xD z>|TlZ;inumnE7fzJdFMg>u|SMtdPk%^LFp)TN1)w)Md%yT_0*Nky@vf2qMC+=x9~sRxX45R&4Qq~q7U*x zOYb)mhpI7Z36)LYFR(-f6Xpj`{6&prX^EzZHpqU%7LORU z$m2$YEAd}{MIXLA3VUk?b@yZJxmN=$qgd3lcTLl-AtR5^&$#)%% z(}>AZ&{u2vL%Ly}_&~`KrHaf-<+7CBgm3OK_tmMRYah>o~3It$M zqs>8(_0b7>!|f=d1$V8${En^2_s1^Xms?h!Cnlj2Z8~aj-c#(OGrDd3Qvk=3vJpxr zV`3EKR~v!WR`}io(6Ev6kff=BHiK|BJ4>aBCOb#ZX9`|_wQrWcdm6K(l&`*SJ>pVg!=59!?y2 z2zV?;Nj+zXR$xC+SxpZ|D3zw>xwO2(EXUI<5P>^X6HD(9+>eQ96xrlGAqi5?vf7v@ zD4b8+AE0F+<#Eg}3z2KZlUOkze(iS)!}cGy3mRj7AOAcjzZoEf(m%(|=gq46OAU3? zdqgG>S&InA5@N5ByHHTwRLs{M4~FFyI;WU?r;`11KUnrsEZNn<@>{1hP%Y!E>KH7b z*h{dyC*5kscq1Mxl|U@WOI&z=w)={}9REd>5R*q$D(Ce%^%w)8B`qlakd#ja!p0p4ufvU{m#By=U;%q{}s!* zp~NB!fi!j?q_50aW)Sdg7>J=JyFMAfDi?tKkBuM^MHl28^@j~kg6^v#EboDl2qhGV z!W9Ukky?A!-Ud;<6beBC>lH0tLmpI=Zm>f zhbxJv%_Bnc*62 zX<{Q^h0V74D&Whr+MI|r4dPc$BPfW)bRjpXeGtSBrdc+HjZ#`o{X7>qRR>u+(fi~+ z8?tkGX9btjvP#sm$$;f?bXzhFFBW(q8Npi8uppqcw@E2w9N_p&0xb~R#)%1ti4FPajxbY-1 z`?h|sh=2P=IfUl204J5++PDGOF9%w&R4e1?U5n>ok8_62bh`Tm=`1h9@x@+HbojaJ z_&`ySY6xj)tXB~7m40x4`g?3@;aTw$dD) zUhG9mH;KL=>L%kXtU{~k0>7vstm?LZSCY#b^b4ZvYtTPdG$jg4ZbjAZiHQzge9fWs z5}}{Ym_DKFu{LxwUU8nEl?~`iG2TowrPTYafIFh_!Y%O^n(%AGqNNgoH=xG2!t61( z)0myzv3y0EUpmOK*3W!XB*@FqiNyX1?I$QiBg3kv%4~Hb$Qrvu{Hur=y4prWEkfxu zL^&TSoQNI7O#x}-xpta2smLq!aW=f~PXxLoJ%+Vs&a;AlW0|55!J>vA>(6K+Kyrz! zUW(y6gxNa+kaGV?Ld1^njjEib1Otrif?U>{U@;q#B?OlznkKTKlgjKuOVNaEt+^ujLAGZ$WjgPu5Z2Ao{=D2*}3{o}^9xU8xTJ!u!H9^0~Mcfd$UPO{>F+)mAV}AYI zpZ%~bsyEGTp;>H16&EF@Lushfccar_EOTPXNPv&^(3vKVMQUFu{|=fP8jvr8^^nIJ zMaYv~g|d+@#AYM7PZplAf&v#-<03HI_$s>rf-a&P^3ju$KF4~))nO&uvo$LTO0*NZ zyy6X&-2(30?UshR?OE{Hc#u@aR(VQ+psT=uXhGId{b%knWspd#;eF-t70by}lke@8 z0sl4#xN~S&C}>H1x`%l!;p$$9N<9#EzLl54z#<0S60=pn9-G9O3K@vMmocgUgW(JH znv04ch013_ngePkN&e%=grGY%T-e=^hxVIdSKex%d7!ngtw}Z+%T}!bv^1nV-U`1! z5gFPM>ZLR=t|@PVJQ=9C2nHFl8)0toYBLhQ(<+b|Z?66xZ6a|7QbD2tA?JqivfOp! zhwac}Gt_VACiGDUAt33w(4I}igB`fU6#%|CgD+}~DTn@Hy%3h0$^e%IxE_-c1UX4w5ELAmCRFpqLn=&A?Izy1c%NBC_H=$zT zyWvfscnqi44eY>$kiW@1jqSjX08x{vS?wojSa-#l@uO(M^2LP;;UG+N|9B{=EPJ0V zZM3Y<4R;3^G z3GDQO*Nz?sLPa|~F8cjPLCg`fn{vyyX}zn=|GaM+5vC#V7lBwl8KD*^!_OlujZSezmGkKrnPhf5wG%*o( zh&;l4Yw&#r(~p2Ccg_&v394*#0YpPO>23a(TQpSW44gjv%(f%hBHFo(PvWGST$CID z*2tfh9%Mfxu<#sv?mj51%yB{*>IoZeh>@u|0E&s#*n@sE;bY#2ZE?<^nAYU*ni}a_ zfk2McR%h6Sid}S9B%O@jl6UF_^`H#beCEiBP46EZTQTF=#YZxHT{6VtR4!C)E=1$2 zssgc?olH+;bO8$%h||9bYB_HYEp^_ICWmF`TUF5KO9z%Ws-ZbQ*-v!`P(Znuaxsh9 zVlGtV_P8u(`_p(c7(0ki>)Cb8POJBbYqPe7qYH3)kiE4wz3HJ0y{cTntWiwPIgM-6 zqPS36YB!W@9ZGLln;ZP2UQ7QJti9KcG1zc@Gk=j#d`5GBuIZ~OF#fhJnLUKk!ek*3 z+eziko$6@v)}Q8$KELHl;hP&|HC>SCziWuCNzh0t5bm%WQ96fd{LW}OJFWbg*5$ju z^X~kgcLBQJtt?#_Cu{&Z{)2#;H>U^EA~XTU${`pl#RE7xm3~K+24>{Z;foF4@*f$C z8dRcDS);?yw?=n!E@bDsgkJD$&PZ!py_TI_?^k=kX?gZbvO!^MnM56a9bHFHdH>IU zQcT#@zR{fibz8^K8BF7bl}0hyMVh`=u|>(5f$^rA-p+KzGI@X~=e*C2c`8xC*9~!^ zAsWv{i)`MyGnunA^Lv96d|>~<5yEm0cNzBGsAp}Iu>41e$BR>Dn~B`PKP z@W;$;il_U>U+b+Cjx;N3+R*&HSL|a0BuFFy}F-N13j@gTU zZi|@dshso7D0^uwpnJ)s=*I1#dJI}lBih4w0)FEvwlST)9ye@?Mh|`28EB{xr5SuQ z`mC+clJCE;8(D#q$+S+0x@=2ZDDK~-b~2yPr|VL3S-}rzGglB>OM3kA1>g?`I=t?L zV*<#X(wXxkXhmjp&O(A+_YHV8+dcRX|i;JZuy%l zBNrMHEETM#m0Y8{_IpKR1w)`>Zw?K?AoT+us%pad9-O|eW;eQmWX*u^@2u8S&n3@5O`r>&d55y zaRyt9jvt9~?T;M3QGr|Hx_p)^8HvNjbZ^vO^kp(PC@Od*fqaozf?39ge8(ABk`_p@ zW)sHjonv^xZ44Yrf47M~F%MWin_E*QmAow-MyMEosdD`M0q2~myW}xgr{Azei4%f! zc>~oGWw4u6alC{-{Q1;!VP;`kLbmLWU0rZo#0)(`*&M74w%oGdx9mN0PwTWP-C96jmE@*xRxdZfYa}imO<<*0$xR;_ry~6NtEc9KR1*y6 zHQV}kMgZ#jPucv=XhkyEn-h13F1O zqym+0-fGpW8lw0j;RHT!v+3ZWaF#d0MdO}o4L1w{qop&nPkbrSM{_Y7VI99pHf*l! z*D_`8R$FQi@7lWr7#9^#qjvV7;iK5f842s4A=-h-z$CbeKbr<6iMZC6G-)}%;ApRW zot1injG=Az%gP0X0Sb^p4J*jy45cEO?yzDI&J1MrghZ_RBK4>ka>yk z<(fBo)dThUnl2dAY9Omos57$ei__-?&%+gyDN zZ{%^mY-Cas;gPKXdjoeqth|3Jj}%5JSR;8QVJZ z3f+aXsr{9A{cTehx?@Qs1V8JRbP*`s`|~@1?_Q#6Wu*mYgF^BQx1x_?KiC2@{c7jM zB2_cQ(7VNQ&HM#v51bFfszeoL?1Ik?uD)J+=gUT%4$L*9{ye$8x+%d$;iE=|$nOja zB$v5!cV7Z?%l5Tme5#5S=FKtnNn3H-X9RYs2Jh*TNs2Gc%e|4!buX7zyaW`-LEKAn zK+?WuPfPT;Mt0fi5cm+6jZbgHIqt4wgKgDf>h*Qw$|q}P6udlh`ZAt#zdObgJGw@+6ZkWU^k^p0k}4C!=-^b ze$7H~9HBciP$dPt(iyV}oQ6owv@<5*g|iQzduw=j#VfBlSTO^IqkjXb+)6Ok8vTqI zc=1xp69X?MC!GWm-@2LAFvAev3zfwTH9P(M=?5^|0DhkF>jt>kiAmkIl6Rz%wq}3cmOs>_AG4y`yOsczUkv|;pjgFp$jXa|$ZrOk( zM7J1vB?w=<=`-1GC*a%>O$^!(5Ved|a+d}bt9ySQ2vl(=Lu>p=*weib3~4ea6O1@oub*!)Fz>Mc6ZtkGZZojiFRfVSmG&G zK?6axM&x6$WeLZnY`eX0LBk#UFA=qB!YqO@n6+Vfn`e6YiPq^&Tl|C^5J6{_GW_(l zVi+0Us>gS@3Vg}fg@ea_tiG&2cl-N!Br!c zyn0|fQy*vem|K@lW3|mwF<1;q)fa`2nQoJCR`b&;!f-RUa)>IuDB75X7+o!3C|^pP zx?Rltp}9`L(nHMMh0rjydWQyfWJ?Qa5}4X&ob@CoC|`E2-JKpoaOix)>zh+;F!|w~ zb{#Sp5rGzbtV|Futy^D*2i`zf_9)th2T;LD@pQ>^vy(V)*4k!(;fd)a@W)0yWwMkd_RT0g4fhEKOJ zz4MHZo5xHEiC|hxj%dmV_cN(MGRT#yT58Y{g#UT_u+n0;g-IdP5&N#$XZ;L>%PbA6 z_?9nbEFsVuwsN{1sZ{T+x-xd7JUzfnq*zp&(4xuKi?>v_@$Ce`c7*YOI{cXm3f?t< z4Pol>8rb{9cV`tqJm&8ZAU$5bRg{5r-LgemY9%6|4zh@*1bL32{BZI>cf^*S;0@DQ zOqa9lD(et$)?AN~R0WQ7K%8*dM(-eLS)9Yfn$yYyo3t|AJdE20aK7jCWDP3AD}KUd zJ%T<1sRR1dWq*pP>Pz9``{S>2<}8%H1Ns%NMS@bEmn;EV0@%U}d@8tDi*xWnP zTE^yllcZupd~>Vl?OVg%;Uo7h2hFwYNIiBfj`++%N>(+abH7=pdEyvpKx<>)+ezaS|v}^L#2{*+F=+r*PU+`1-r5mN& zGxaTU-N~%0vzv%g*dE9Y=ov}W^ZLWu_jy$88e>kKW-V46Cdbt+&09e+@Ra;ek9 zQsUp+MH7v^5UKG%!i?~=0WW83oEJ{1@7Hfw&O>_!nI}q0@fOpdlu_ygkLV(j856H! z!^iF`>ff#9g607i&Em$YT4e2C;v+8}cI@xLI4aR7G&*{u+J~~<+faQhv~19{x`OY$ z&P|w-TW_@_L{?RYM1+?3J7K$)`lw(U#cYhD2sYOGi@z8H-zTGQwb5B}u{r_zFPuxY z)0-blQ>^+?WEVWEH7lBu$XJZqg~Ik92-e(!v{KbQ;&~q^m$ob!CR0~hg^m?$q>->Z z&$k7Ui+2e|{*A#F!sUDus4A$yS!T``cI!uOPSE-Cm#(fXgJoL}m0ZqfwJtw3WI2qr zJ^!FY$Em(%OzZBD`6>f-ueWdlasJr>EO zhbTNMB_an{@?q}XAD;4o+Z+b^Z92emdiHMh9@9GMt5{pYpol^ zNuFGN--;qr)1hwl3^H^(QK(9(^_31w`IdL}WiPj~Zy^MPHto!VORqqhzEfXE2l$Ou zotn+>YCi7S<6Ru@inP7c#C&6U0|4)QRD@tif87=gEt+47ZD_j!+6@3{*$U!?_J6W4 z{zpDY3C;tqUnM@^zP1Nz$9*5VwEu6GLr}@e&gYl!A1_*;AJkiK8#4E=oqHw3MwYv? zB`Aha7Q=-ayl1;bT@T04*LA~$oX+V43V5RrSyF9g5?g&fOVah zPsehw|C&}y^h-}Lq<{qVBN!Px25yIV3g3jbzcgC@fduPmqi zs&I(!GhCpMyw$3UH?`?5ncC~Vwe#7=I(~Zn8Qy5`FMooq;BbZJ^9oapksnFva|%%) zaEof@bL+EL>lQq=3*Fy1EPwU*IlT;2N;WxS_}n&|Akk9LtfGAx`*29LMeJ0~-dX&^ z^-;45%X&}kLFsm~QWLQErAAA5S>WS>cEjNl_Y8}6=_6SGtgGkN%@cd!m}ZDNJNIqX z0@J4E?|MJg^S#uE)$39KKWq1oa!xJTH@D?lJdK4qyUtsHYy#z-XVzUnRMtUwOpMwl zrwJFI-m^*yAv)v{icqZTrlvB~Gx=Z=0DMCHiTqd31CsNr(@`-Og_jRe*ZQWKGAPzZ zbb_85>45jG56fk1iv)k^2cQ14PPgT=3B8U{p4nJ~^mVPw<$=PAJwt?4JSux_)cM7q z?}W^g$_!P4*8@$K5<#1FHrqi;zYeg%YbJLivHs9A#!o1nA6~}em5g`?gzYN0PLE2f z=X)R}UTQNkKb4`sNXm6D^8mIYoWMzRo-3H++9gyNk;Rbg~$=uafxAZzA*LJC!pbL8=H=pe1UHAc!FUlRZ zZWcT}M4vF)vPOpXS#J|8rSimHZNJ?)BEK-oYt9=jMw@G25_eeS6Wqx_&()7jyTvWF$!q?>f#r@9TP0cvR z7{1B67eozS%dY3}M3u2AI_z`c(?5n&XBO$=kTu?foiN1EgP$Qw@WFN#@L7AcZqm8e z$^3EVc#p53ousjYZiB-9VC5pv+Q1UoeeKXXACl(rNi||!8T-KfX5@3>M>x~fsVv9K zv3tXFpXgU-Yl(g}hPR%e0MwJbt2zTte>he9Tz(Ozg>3@&ye>RgYr$+$4=1Jlk?oUc zfYCEL2P8@0-PxWr{@qflit#q4WNA=a;&mhEwtSN@qlTUSS6k^%1aUMkjIH_lO^(3X zU!IDY+QSs4Vo|$hz-7`rSZWgc*uB*G=_R^&2monjs$X^NvN0-P3}<$eX6Oz3fSPT( z4!)8kL3=|)XGP1Wb)U$f`Hix8;`v{0vHs`U>oZ^~C~j5&nwJJbZ3grCFH&KrMW zoQ@##0jLVzyA!fBc_}!wnp@idg=uyJ;IBh)7=i!a@!mz?-*Yd0CoL>HMb>Q z2RU#0ADo0C{TML{?$216VRq4#OKy8pXK#8Qe)q0|lK~pz7o;f3!qvMu)9!C;&f`fJ zr%UcAK{-C?Chh(wPsc-4FLRqeD&iY5bLx2AN9(y{F*SdiWBN32>)P*0N`L$ig*5Zd z?)xpADNMf24b%?RwvOv|ifKtypD+kv_`E|*7VsE301%lLxojPHKNpQ4NqC31w3nz# znUskLj_quMSo>|&G5$lY_HAh*S)jeTwfP|SkScT}d$3h>`~V9f zrnR-Cqt#olvl)+xfIVEw7*pKZhT7nczZA>pl-6izqT2^D7b^Mjr0A zmR^0vb*1-spOH4>8o-WSK9k2zX+kCqc-EW>2rQp%=Ax>vxQ>`j!l{4FT=$PwEbNh> zr?OcGW~_&k^xB(83xCv&;Jz>Fz-+I(Y#F^RA!H1oT){?poWJS5%>@pOb?)XY%gZ7h z)eY=trwT`5F=g=W_Ne>vyclvV-wcvX7_4bk2_X&lyI9lvzY*FU5Ypg=Z57{D;M6`> zp0k6z;+ABl`<;QCz`f^5hs`4^W`q){Lp7y(OV1LG_qWb zq+nvac1@gijS)^zr8K_}Cw&BVdiWzdQKUw(&pys5DmD$#H7&2)H0>)l=+4rh`!j}v)i8@By91||?pV7nM6Um$3r-3w} zm@2&I)#Y05g98^FTCKLejCbTyiC$snj&mB?rxiO`L``^x-*G}w*tA}Ic$S+}!_(D^ zwHdFk#yZ^aAoB4j26w5$)=oh#Vi>MO8!AE8c8Mmt(RGGsZRPvBZou6yC!L%FY0>2y z#2B?=`|#f_n%ZTYZ+i4b7nk${n;bpfo3r)vo$~@yVn4=5{{k?3oUz#y{1$YZVMKiF zh?35erS?PzzVjkz@N2HgE5e4e3W#0})+lVcC(MoczU>)sVi-y0OMIMj7dN}C9MA84 z;w1!?IHxpH6OiAn|Dd$N1{h(WY#igutrl)ev1q&Mo2&(YVo${>o$tlSb3USjokTsB ztw(dl^Y%1sFlif2e1cE?1$fA;(-}LAJi&&1z%M>g6tB#&rqQJr(Vh$KT3#Ro=-=6~z1f6p~VVQMqYP!=&t*m|)QJYcp%rfv~AB zBl1v2VVY!1VvW!nCxtiZ4ewVQ7b(G+&dEDjA2~*3=L@o})_LyllXs_Kj^8}51R1pp z?_TOVLtaO3BG=H`xNb?x^@D%jsC?#orlLx_Y&u@6ZX9f4c^={RqEJS*a*o1`dpXux zOuhQK`ew${c;8$ah6{~^?D~oCCi9*0Y1)R_7ZPZp-&d2;z_<;r9FB$Pv0y= z9#lCE$rGUOeUa=E=XFmGtMU=PR{7Rz3a&vCySe^TXN%F~b{a~d=!ld5<+n#P<(w4$ z2hU_wXSae0a)MGaPtVNC*k^h^5A?VYL5{U&LetatF`sK>1d|-ts=Td|yDkgK(*(>; zJW>Ljwz;P*cqfM~&i52L8(Rh+2#e*k>!s(EDT2GJN8GMGdDm0gU*EfY>5@3RHTRY` zHI8zYH%sCIQ5~Z%#ZkQ7`i*=V2TUe99;rYZ3jS!P--YDV@X;R($EnnwqSG@}*opF- z1qGPX0%20<{9~B4PPc^zGGk=kslkAfnLbfYR7z&_=0xNK@e-p~h zdp0zIoj=?q$#Y_euJ2)(`dq>FroHVEPLT|#rvM^?KWp5wI;L=3=|)mHhFA_%RAYuy zC)HL@`oQV3{c%9?4Tvy667rz2C=j(qn(Bp|DBSIl55Vh`t!V^x700&Q(}qk3!EpJc_e;hrXu9XsBr8Sk`Vz6KFzve@|-+r_en&(uH+Qt)V=tL-Ko&%$qf=-U8`gMpUT{Xjt(jI$qfoj z+MPBf9|m#=V)o&nz1Q2rUtEdYzvpO9;&C44C`zMWN?@9RKzOiBqN!?yycUnmKgV2^ z@`(Tq*O;koLgGF;+hlc9XV~50+ZYyw4|Gu*7ceq$^Tq;ZOcVf}Ktlq5Xo{Uu_P~J( zty_rPxE^jzLMmBC+gW1v2VUmf!@i<-c0Z^{k65?=nIX=Gl3TL757>ED?W1Bg0O6?2 z)j*Smu@1XsrHh^z=?W-^Fr%rvAd|Xs=jIGNsA;GF%LNC$8)m`F5<#2|QCZQ19S{SV+X1DKZ^ z%WG#0JQ(4aMz>@C&4y@*2I|USpK0ICabgh(ZTuZ;-H*Z$#8Q*`zN6H4w?iQAWTdJ; z55#Q2ABPo+Rgsu=3)Y4zr55EMt9HJTH8bpbC$qsoJ>OD?s zvd?%Fi7g!}pzQ2J+kaRsEPy?}5n31dS|f?qEc$l2S$En-uiTvNuis=lra&|#eZP>n zO`uC!hd-pNfy|Wx<+IOC?h!gEtJ%f`FiD8(xn^AtMP+rSndj))+oElRD(sZq{9cIv z5`94U^>!NqGt)k;|G8u|i?c6Jx<4(7OzsJuU<|dlu5$7cIfvALkc? zfe>G9rzpbeqrTGV)m4X{`+NiT33i9g3lo?RjAxnTi8Y5`6kmhgXfV{MxU_!lq8j5^OtLMxz*gn3}?aJ>+c^hh(8@q!Gu(JNfHV~N{N9H)6z_*WMy?%u_z6D z?W0Xg1~qQ`?b(`;uPZTW`Hd7K=bFN*fo(rpS8UXwLI3Isd0ZeYPmBsGBzKydP?8;p zVYG>k?B5Gv>M{fK_44KxnyZj6&P;xTBxFOem!)+ zBqa8uF3#)NplYF$VSV7VUJW=QzHCFd+(E$Po%?;pvz64;%ZHMbeKbpistX=y7Et#_ zE}5bOl(2Xb(C_@q*+59!T+Ly4VXYf6*{9+XWpX9OweRhYCbPTKs!(S+ z&R$;_Jfg?OhzjUrmx(#^>8` zglWNE!htckhf^CP6`>f`Q6iir!%^oH&2|K{Y>e=^;JR325}3tvJj%?=U}oz!zZu>Zer?-&-PL=PG1Qb= z9&lgT8_XFJwslkEgJh>0wGC_5qB6|K#C{N{Mv8i!ty+WS?#bO1b7{IK#l6+>jeHr* zkW9)ks(1CKo{Cwv%|cfwGQrzI5r5pn*BIakD>4>UL!zZ2H2Tb5q*9MHNu zL}JZ+ktWy$wC_(#>NN>1!>$CNyX??#a#*bp_C{84X3=ynDmfG2~^B;!-ZzKG)Cs z5N+04F3#nmzm+=7q)je~@3HDtwGFwHC2(MSVl>`X&?+5GbK#p{Fz^xEA6D`YmZnnl zHJ?Rx8Kg>oV^ZgYGz4#5*7$t%d)%*k!z2ElIU>95NVfB+j;cGo0)F*ZFC6FFwjv$j z1NQTc|AEV?iWxfptn+qgfot>m`Xg9x$oF&2|F#4Wi1&V^0Qdd>V8I8EVxhng{x@#? zU#xqMFgh5`Kjiy=^S>vG#6QKS|GU57uDSm22mc$9{UfXYpNZ5B^#u%666;p-1dan( z2y`V_)G?%8H)Od6uU*pj%rU%XH}g76@XD63Er1ECY5^O)x7ib^nmK>E!ZaKDg|n#^ zB9CIt&Okp9Dg-2}GoEQ)QnwTsUVs65Hy+uXjiJ{O>0jX2S;eV(Enh8C8JW5>1g4l3 zPB`9Sd(8{DY9KCKK72}f{5|42iPf|L-%A$cYV}SU6r}f!APBGf*{l^yx6~faDK+|N zoi7>4p-FnL4&;Q+@bV*WgtyTeEdJuRl24#Dvb^Y_ITe71C`Iy387nI9D;um0j+zwP zb=x^dnB_~DfKCfX-#ydLcJjnPwID$nM{?PIE>$Z0V zW%3Z(PpgKtuc1r?TA}im6Cpg8LwZoRj7&7`;1ptYn+;F!xV;jPow+co;;bk77zk<( zlS{C(sC7&ED4fyaCOM1nR|68CP?bSUQ=DegC-sTB^1e5Q zF{?H_2RGh1!H^%-RyVs0wSK;gTBGio~S2_ zuIaLB*z4fKo=?v!*-GW^0nBT^M5u-lSd`VXR0C4|D6 zL4fI&cNd&>Oa1GiGa@@|xL$s(DpI<&W5Dc6{{pPXP0B^^npJs#d9$PyoDhPP&Y!65 zE-BF`G%OYD){S7&y24@^7T1`@WuY2cOLJ05CJji-`kpDCEyaak7fj}**sJf`!&a6Y zg5TMhig=50bYWj>UP;x8@VX-^!gX`Ed$R|z?j(kWR`0U?=uu= z^yY9CbU;@xc%Bzi;RMtK{;mt3HQ;I&RH?w-V{5FvXTtWue7WBJ>dus_pk=ag70chd z*OpQ>XwSzq=-gpSFbp(D)w*K}z$GAA2b4cxWn~s!4m2@}^7t;OS1t~=3sblLTD{U! zCW?3Ln2nsP)N*>P3}z1xN-koTrFi*9zz;Tn`_1O(tWBOj+{gn{mTn2}9|EmPlB*T! zg>Z8LdIA3ZVX4GL&f<_NbNLx)fnmpoP5Lr;T#DZwr!>AojiXFemxTlvb27eN>4fe|=z&p+DTF<>ed3N|kM8-K+vpzxYR?Kpd8$u7%UGBdec4R{q-4mqw; zAhs}7VQ(ZcR>MfaPz|q!R9-`>GUWN8dj96w;N!OttZ}Z@3%`H@F4=B%myNv;w(umI z>&Q7Wt^iaMCB*yJpv2;41#6IN<6vg`oFOF*K94SV*b%#MH?>tPMUs#E*ySEn&uH_R zsEC`dcacNC-nrvcU?P2$tm-L|vb0))?b7^5lP6`l`y)R1&Ko5~{y3x{21i-7m*-dwZKJDJ$mhJe`D2!E%j(l(VIo zLzv%IVYN#MoBK6H*d@sQ2a2)IOIZh!$VOB^6$5kxa>at1j;&+1Z?M2l@1^2vWi$jj ziQbw9;}I{UF*fdg-l@a#gfcA>FQxJ=M%T?g=(QyjJs1QrlSBsvuBe7W5S&?B>UuB? zDs})pSU>QERjq{u; zEIDuvs*Q|nl1-=>{? zs&J+-ik%vx?A~}@Y)^iK>5MG>5AxnJtgWqE8%A0vrFe@=p-7=^AaVRxOYWl(`KkeK1b zq3G(WitMFmEqmlrWy0fE22F2#8$MECU*O}95P80zhxYmLufkVW= zYFFfG090j!GIh_NG;Hs5oZ&^sE4m!W*E+i2*aHmr{@`D~KWLs8H0p>yF7%y#(q%&e z8~FZNrL+mByYaP+RAz$W`yCf|8Eh2yidhj=n%ZlhH@!;XTOD2iv!yj3zdXCf1l`w- z0U!W~d@`PABA(&Fwo7dQ$>#Ua{>7HkRu%&KWhQzl`WndW(^s&opAk~w+&&d;B*02% z(uc6^)#6VK@k>4ll|(x8Hnd>!I)+RLuG%jhsKvac*i&~JXuBtkaVy&iD>QxFI1YPs z8Ur>m>%h3iT2=Frz=@F7$JV8n@#__X9n*w|$*q>p-Y&Y@?wR0Qkho5>%h$)x5VTBp zi$rCXG5=^n2LKtUen5h)>2Ku&ed*clJy(1&Ib%BZ1}oy@@yZA}yHL7e+1yx0Pd|-j zzR47hIiRsdPx(5ev#GMuvTZpS9vvUQ6c4ddv;=ZIa~4u!ddQvgkSxYJ*^5k&_Q8JZ z!S=81#{6gKS06|$#wG0BMFcN{msxj80-h6(-3|=jrkc&grT3k@+`nRKH$cRjXiH* z`BUN}iiT$-58?&or5omE$SMB18tf$-6soKXaJ~o~$*w_SUY93F|JT(&a@Eh(<~{;z zyBS$&D?(0Dx3m58LO*uFxb3cS++nNUp~!tHKErYY z7nL8PFZkj@STr$)pJXJf{Nwz7$^70eM1fc$OH2xM1ZwAguLHUXoCe%Hy9dTl?U%Tq z5k6#=JCEqyZtsMW(J1#zGarZ;NKQOm9&ecH#lE?mWz~&ms#)KpFd&bQeo{Jd^5@trk|KNW%6*N#p@?RH+Pp`Ij`R%UrRz%LHN!%F5|p!~J$7_l*SMuF zr_f9HrSpzP7VMDJjWr;cK-zahyP-+eM2+O2W5D-K;HmWNX5h+Z=em>w2m|}cawXK! zJrUn#J=V7~r;bS;(+#{3AhKWfA44{zpOi-(UG_TUU(!t+KkUwF<5TW14w;G2qXyIo zdX>oz@8bbho@U`|Q{{qE9rhon-E^k+lqZTehfVi~RFr0jFJA*|*YWljvN=Xw;3m}z z1EZKEb1CwgS{Y5B%G#ap(}1fnc8%28KgR{K;#p$PukyQHcC7aLZsA(u=-`Xl(eOeq zE;m?YW=__6$-de#U|4$SaP+vRP4P(i4318)Vi2Y813hS$mtw50E{3%9C63a`wGd(u zLR~gn$2;GaJ))J-_~8xk)8}ncuT}W-tS%$8BXnQ5h*j(8~fg=Y>d7(h!%( z;q9E7GG=?hTi%wrCw`nmp6|2kzTc|b4slf`B&xAZZcx=|gLXdo*u@Q{Mp(`>7JP@; zY7rKh+`MZQC=g3m;FYh5oT8@m^6i^GHMDj|<*l>a%8Yoa5NTnSY{B+tvM~!2+Vv6& zc6}7|NB~!^Tt~F|`nYPs#x*NI*5I2?E_|YuO*^jh_Ch#nx-TSkb&)|N;pR^^Qv1OF z0oRIJjf8^41$57OJ8(@}@NI=m^|PTNxWXuJ7i-8Mob`MP5jh7Lx?Ryjkga*2Q9VN5 z8*L?M2Q${@9y4#wR*f0na0NNrce1p?1>dz^x5T9WkUF&7ULq^a$dV{@!Cc80YMi?x zkvR8Kv2WxK&d9jt0V=X^9T}Uf6hcRnxmz`_=4DKX3%<#=qzVG8d-JW?mcA?p-*a%u z$med=TIfWB9o&|WBvKBFSB5-lT89_JC7|_rjKGUh{2PMSoc4>R=G(^f#jt`<&4k_w zmgLU@t~48`@hS5d95vg!VEfGaW+jQhPeB{J1+zq6?<8t*fUmm}_=k$oMj;35IB9!h z7CGKd0-&p3NX-M8j|)OC-U2wwc_%Q}U>ywVHq6g^UrXd7a&HI1IGc^+LkS+_%8hiK z-HojF{-j0Yb`p|XFQP4YL9Mp6hK8P<7{sI(-y3D&FanNHU^euKKoqU*40^a%+n^Bx zGNTaR0D&Qc963r#bm3%fhJclw?^l&emAB;f~ z1(`~Ze1ao3{7|(%o@GTOL-O7v0>6vH1n039& z((KQeHe2!2I&=1H$)47i5jQtP#pxtm=?@}mHpagcedik~7=^tF>HsL%7;mE(_ZTAd|SN?ygG z&|IDt)IC>Z`=!j)QF8dpgQr|hio;rrNOL)TI9SU#M9Q}Qy>NZ^h|W$(rtxj+jx8tY zA0ye(uaj)rb=af8W*oA_w*`(;vWi8{HcqXliV!t?7``k)3(%PO;gZ&;QOcP1zlqJ zXX#PX5b<7#=|l|q$IC~^_kp!3^s=Exdeb47_P+H!;&vLWmgpCe<>ky68_#GVt?J@8 zg;^}?uMSwFpJW%Cb8rh-@kz|dFcsXm{JK5WzQ>g?nFCoAUr*X$K=w3DEL6=(H%-r3AP;F-otHOvbQkyffpV}M_L`OXYrWbS*3Xuh+AN$_2tT_=`gBLl6Rzjku+a)?q zh(4CoiObkkpKef+y1ljxI6F>739b%Ti%XAXFZ42)w5qyXvcq>vpoSBtM}&Q9ZlF+$ zGxmI5$tytX&ev=QpKg)R3d-2MVd-_;`G$e6 zkc0Ux_?>N#SktrY5&p9|e5Eel!8^hIn6?2zm(K{flXoRin_?Mand|||HkZpiJOFqZ z&Z7J^rH1CoZdbFBBl4G${WS?q3~fDSd+AH~bh<7r*K;TLx+a8}56n4+-=l7}0oN13 z)8?4b%S@Qf!T_`1eFl{~dVMi8Tz`hRuIIxJ`K}&m70C76GplYr#wV_;2=>)FY* zFwW*Ld5cbgYMH}3P)C=hyz*SO>f`%c5IuvMroNFw$g*tnJAq@n;Ft?{uigYn-oQ|B z!=am1X6rL>im#gO5n*jAXkGa8?a*!DTrSg>8>%)1)O8MR-;)*uuKLQQI8IPVTaVBF zdJc+EsJN}?tHO|Klx+{o>Lb!spA((HQ*Z@@{d^$>b2Tsh?Xr#T$SP=~46{y9H~nT*wFwP~{2`g`VZ z@a(Ri`bV*XvZ5L@E_Q?k%^HGC@4aEh@G2}w#1xna>||bBGutx|>Jf6+=iyv`IkC@G z<>K|;WqTTjzY6;?AaI>jZuin{-}2Hkv64C$St{8T)cWGhENc5n`1rA(ehkyGBa;aCD=g@MM`w)kRu$ zgRm{p1$uO)d{nz@XFztHf?~oP6i8~8cS1>LVfIHdi3RgFZx`yQ1hAOM-xfjDBsZtX z($PyXF|78j2j%+(a_|B2@cH3@Y~(8t4M<8-^I(wdoTWhPeBXu;Y|GNIM{ljOWZ#VWTT;@h1?q(t-eK||Z`?x05@&pj@hze3V2awq?AwDH=r zzLQnCy^!BzISH|r{bBdfE6D3~MUVbMM#qOZCZmId$41|rtLa+rQWrf@5SUxK(q}9& z;&DZ=F}f7qV5vz7CwL__dduqiegle1+#uXKnap)6_59V)FxD;S$eY(rL@cnf;qku7 z5%)`U|7&&KyY#_i@H?+`fA_(UMWt!w!{j9r7DSc25s8NO@hZ97-$LLYa;ZdB-<9Vc zE-@1ziU)QB@!A429pEyX@dwu6N}O?*gDw_EE`GkUjQAqrnkL8eH3seLa|s!!SofS= zjP>cG)%nWeR7{jmNS_$9G&1(}B#rB!h)K_Bif`e}-h0msx&LYw7NU@)7=O zKFZUY|K?J2A+4%vILgw;w8JU~@nbFXl;kpK-(l(5+l>-$_3d(L{;X!(zv~z6zR83g zUvW$V6B!m^1@wi9XqjHTIJ0XoQ#?08laJKsDc-RZb6*ji}abI_MG$(wJo*T}oY9apif%J#EiQqF1WvQ>iBpQ$*!d1Xm8jyQ# zp>a5(tSSuO-DY+&S%@*u`rg&Wf5ifb%QSPqo47vsk}M-MS>i{-?(79ZhbRyDa_8A@ z#QKu2^>#M6cHB~CA7OV64)2-n^l|a$wFCJr)VUnB_rA!ZX}Rmder9ozYUv(?(C*l> z60P!((FGv{(_M>)UJEmon@x~8r!)0d;rn$N4-;BUTE~SQ2^aS6Kj9WIKN1a;H95Mh zv+Uw5sl19AitQ#wI%*sFoLm%2MM*TSPyOVrE@EPyxAHI-E9_%=Y}ABddk(gzt!rUf z_(`#x&)F)FQ2E|wb@fZ1l67><_;UyN2i9|p<4}`z@6K)?18v`YY|G+@ke@{al61Cz zgbxp?&rwtctuzS8-%C_Ozh{$aII68qkgOP@1j^Enh2L0|ERH$7u?wUrlrf}T>mics7Ryw*FFxrv2A%wtkA4xuBSbjC|~`E{F&T)z%MJ3B}6`}I%? zY`_yq+bbcm*tR5SU{Qr_FM?LyOJ(5_P*X(m-Hcx`Omg@bCvQ{)vGitV_$Wm@A^s&6 zIhVB#fJ{1pt!TAcPCBE1LgpdZb+?PhBJ2^-k?QcbTL{G|!NF6+BK0Hj9 zz(b^ziPE9YJlSs_#@8prcI}KveynB6=y~J1Ra}3{<_`*o*CKHY%iEMEp|V`RrsqeSUH-OfS2#f71)_=&VbEA(AI^fx&(-&5DQJ**p#l>`lRdLdSA z?ibeX;{%B@JSDamX`^auYy%TDJ`0fZJ;05;2~MB8-^_Tjx*PI&5RGB#eJhsh8_0xJ z`=*#d4O)b$w`*&bXRBrDPzT-qM$se1;u67SqF3TQF_mF=PU(*Zha=8C-S019EoRd6 z(47~9E*lDVxt5#1_=79EtiMwqa4{4JI4rzsxneCXFcSdgTmcE_5Uq>fYXZxnTIEZE zGT0ZizoW=47ie)}3?7@wC9u#-Oeip(AEEfAE_eZayqtG#woKnK!rUG?!cch1Xkc1QF{}N0Gh?JQ_|m@|;81^kGZ68pm%+C(=G}slZL#6g~XJwQBb+kxa`Gsm~UddzZpr&R3NUaV(g%m%Dx~ z^Z1k}JMOH3`aZ)t`GnAs@#^UEs26oa=$jNy@@<3L%Gi2KrPesa2EEE(D&)XN=2CgR z{5w*)`wN5~ zlu~QHv`aP~`@VW959Yo+QNy&&NjY~XwG(F747 zvDaebBR@EwcpDu)M1%9AmW31W%&Pg}RvsBR<6yTkvxDDZ|IQbAx4AhIy)8z!Y>H-V zY**q9c~i%mWeH}$7`5&Knfb+mGn;g5EAj&S+S&K+O0#4HBgjDWUWghiEst0!Udy?rpU}47x3W4ZLmX`q# zx~d)ni!c;Az{tnG=_wbheRffk9j*oSkV4;^yO_)UpOd0=h*j#wqbMIsM)cry5Sn@t z5mj=1E?-lg6GRwtOB-xLqTN6QI&AE)aL_jDZQ{)*Q%RNk*y<4tPWBKd#ikG=Rnw<& z1v0GA;5)*q>c*0W-c#`a%|^5wvdvdj8dEMiJu6jFel8L2o9vq}wbl=7 z@f7?r%X?WWe-p}}2d9i4c>KCLz7Bg`0icHYG4{)87RcXwG5WBg_ld)bVs%9Z`@41* z-Zg?(&!2&lP?3KI6w;&e%B?Tyrwqg;uQeekrh(bZX}m zz+ogLASv}c7_D|qYeK}d{g(^#fBbIdGRArOPxqxK*(BbJ|8Rx=?ISrbi5lnA|H%HY zuZv+dIB%5xaji2K-VOa{;cVzHBs4fa-st76fPa=1-5==QgV)}QQ3yHGXd8x3?B(dN z{H+!L`9}k;0WZ#bS(ZMJ^M2xjuV9e7@HYFusSH*Lr*6l|N1{55+g_-s!178H|4pTf zp}+ZY?x+?E2Vq|?#g;R&RC4)mQuUM6Z{ocO4Kn$V2-=MG3)$A1R{2M-m6TlDy=idp z-lI>BI*a8J5sTySf733{oTbB0tsP(aWKRC*)m8KVyCh*JsUiR7KKcs-07pVxdPD<+ zMl2Z7{JYUn5U;^;xAjjx!1Rc#rL6nF{?~@;!+3&_V;CwO+QpdvwbSi()g9)TEl?o< z{J##(HeUPF`xpBKw?aj_M_SN-TVW`A|Krp9swv?I!ANZ*)69i9shH?Rz1#}*U+2^S z7iucE3bO!Oaz1VAu+3BX;a&;LtoKK+ajq{PYSs9}0)u8!t-mK>{h3M{Q%YEOuSDyn z^qLbzwajbXXJr{9k?#gQRJj|E$`DU@Re&12;ojfa8}^Sp{E6Zet9&-tys5;o;KoMX z&GH+MYempkZ=s&voas)wUmXueL=7eKIPNn!9r22NXw2%9!X4PhFWY(c=&JpTB%gCN(qRtICe=1aO|>nRW=2%8B1uJg7MDML6|EFyuJ^P zUz~8O8%`VY3w1bhr8TN*KgJ6*0Gway@=@^T65^{t7I|vZooq7sdz3urz+%LPqmL{h zq>#dw5?&t*0n3Mt6&|WJGOsBYrQ3#O%p$wZz8jl8cGS-TzkuKtZEoClctEBbwZ-_C z546CvGJT%44o}7T`luhwkaSLx#1`@t^Eci8a6+nnwAEkZ=>vkxGQi!j617^t0;n0o zdV0XG*Tf7e#Jwrc&p{O#53QJ8o09Uu7>D zD-0tL#F6YDEH3X%Y+rkiv1CwJ>by59@D;CkAsRY0MA2grQ-5EBT8mdfw((e*-auav zkX=~S4De~Z7rRW^VVrG+pTPHmZ+4Z_+4_3^GdkpL|BAP(b*DW>F2eq z3l=A$;i?$f79Goan;s)mP+z5_D=;O4;;HpVpL!P&XqI-DW=p1g8B`awr0=StqU6hc zVw~5u7D$X7qfFbZg!ceXXPdj6B=BcKJ*fZ%Va5%bf3(az+`Mu$3I`8X?`&Wh202Ap zULmFe4*UB!!^){5d(Q!RrxKoBj}|v6p~}-;lKS*+Nwb_I=Q8_ z!ioV?mWp@{Zb+O~jvyP9FmpajGyFkB>)Xp}l$iypQ2hTh#!g3H8Rm%ph>I;(_O&UXnby3UedhT@r{vn;eHBxxjd^ z_!^9}0+8OV2F{SQ%(9gp;zmC;Dkce4f7|IORmj8T(Wx(={VTsGo<~V7Pu(y5?cTfX z2}aB35lD6#q^%mhxZ&L4wIyXWEmRVjUedO5{fF-%E>-Pmd$ngv*a5}SAg|bc6AvG( z$3p}fO~a+C9vi;JsN+%lD6rD<2i^yR9{|UfZ;L4n0Xbn1P9+VbP|4!ySkwD__Pn-u zz9vaXyP!#onqJp5O-U4MD#yLB(E3It3E#`Az|{CB-!@91zH<(zk52pDotX#EWNK@N zKe=o~f3rbcXTb~~oB0!p*Prutg2&zvJ85R|I!;{zkRJ#1K~Pn5sJ4H+q-u6}ZPjwY zyDrC4?i6mOo^0s%0MH(zYt7N4#TtsfooB-Y@91+f1LHz%i3K>)8$~Vld!>&zoGGn; z{}vfzX|Wn5CD$r=%+D9PL!sFU0mvEQE#Cjkg=<*Cz&Vu8(WcNXV*%$GYj>uk0`pU9 z71T4mZscGi?jc~zizU*~%j~81vhC;V-I0W@BO3yMm1>%T{y{A(_o*_=UpVlH07-j( z4wTp$0<)j>Mbz{H)g(36YRfVj{*qJI)}{$>MA_o5Q?n)|XW=hog%;2>?^j+Q5$s-S zIK<*(&%DOlkHSiJ_d#XN%QOKU$F>)#8fj~xF?LyLKecQQ@;?|-bzTzB+jE<3<3W_bNNHZyTpat=~1R{+Q-r`XV1#@w(q%@m-pEKoFdWu=|9;2H6>{*`yaHYq_nS#ulQtmOv}|r^gprOzLAz|((Wu5 zj)P`UqMr+*=j^{R7&;{}=$IH5bhEH>ePa^PN>Yat0*<}?Hz06*)bom<4zpu&q@Y7f z{)z^KnLu5e996Qh`4-i_B%Fb0gs9JJOkgT+e?a(LAa@vC?goW5e^robJrkYJ4@|eu zk*qba?qvV^d9i^dRBX(&ni)3qOY=UeVO!vb$7GiglXNhT8S^_ zQ28un4n?aKNOrWn_O9+Uc|idE$NuQDPt~cY$&`bEUdB^90^i2f5aXEsC=&k6pE!+DWM(gQL+Ltn|i-_Ya{tShc@tX8Jx~|)*9;C1tzxE^3Uop0Fp_LD>cgo}n zc6{iDB$S}ffs-~YGB4r61#DNVVVObs!_)q)lY2))!OUZI(kFhA)~CN$#-0-vbgWd- z$lNyWHRMFSi>4b3Rol1i+p=%)j7?p8_}E z?kyfrCqzxnX=4TitV|ON^?O<7v=^dUHETp7%FIFdg|)@4AE$((Npk#xpVWX(U?wJY zt`)Ky<#BM78NWHp;kjnUP?&#|y4BnoLx_Dp0=2|Xw!s9o(h&Y&KTpyoy!ff>>t{>8 zs!|)&dO=8bCoxOkN-GnC^2xjOi-IUwIoO8h(*p}WO#*{9z*G~<6Ug4FZ{=}jKwxle5gDGF1I--% z7~w@#+aQSPEmazLBmb-gWMBp4WsEt!`i0;azKluC9?i zdbQZ%ONq(LJ}l>s z&a>>*NDp1>DAnv!lD^P`^ySlgTYFK))|CMf3FhZ66+|A6m`^uqbw%)MGyawO@dBnn z$wCtBDGPtCLdlc!0 z+(-g~zAh9EiRL;PcLE=@9czaWf&931JV-l7R>b;NsxpQQG5fauM!k2|Td*D{z@Kj7~YhI)bnA1!ptBc~pLb(2REAkAid_7dk~nUPa>!rp$S%X?ZW06$G2&!A?VzrI=-@w_ivh?Y?fxUyV3rgVkjVCAxkPNd*_{jufjX4>X>z2KbE zwKa_08EYJh5fR`inNv#ViX~(cS=;o^@o7k+wr_*X)V^n9@}b2t9ER2ZNyO31;-P_n zmk#g&v8Y=|jc0bX=cF6mRr3?C_Z79rF|lpYF!2|vX!cJ03*q~qV_xHj3 z=H&eOTt?3Tw~G6?TCm4kW@M?%pd|PGhg!9%yJ0Ln4D)ItUe+l5;+BlF#2Y(=BWvB@KQwCt<1 z4~dkyAs3ei+DI%l{Av0RZpxR06>(+keJq`#*5cVyU$JV4(JAPIT?zC~B5l!HsbHY_30XW$^By5gup(@lzB*UD{u1_7p# zdXZ`4RRL)#fm%qMzd{$*X8cnw=gm+L#;y99gCZA`wI^gwt4GSQ;aMck#fYg}*@?+J zYAqiZBvtKTf3dloJfq8xyx=*uEg}(IZQZ3Vq8XJ$(ZN-eH&61wySZ)T}DQ0-YJEqKzo{Z4Du235?TQ#^Ldjq zeoI&e@N8{;zN2BHyf=xyi3MbFbtQ$oEH0cn4-2zl2?}lHvvw9Z5!|hVHy!>_?Nl`3 z4RaaaEAp>#NfHx}06uX2i^3zsI$7g(mfte1Zx%-SU|D5r-HVa(K|W+53QH=She<<+ zn&C@jXLfRad#3>^*bs`N`9Q()SXlL4kOPC2OX;~< z{l(Kn__+}dcIYOXn0xD&Ibr+Mf}~nM;!pl4F7@q{{8n*AZkoyt?CuW6>VnsoU0=KHn1!45hfe% zA*)#)w)>%2Fl#ZgPl}+Ie`^Tv=VUhd`2)WS`M}B`gWe=>)&shH7hnI@?~hcvxWkTQ zu3P0M$r#yMKrF7>Wd4NA0unMSTb}H0v?pLDx02)=zm_MN#OU#n^?GJ3t|}zx6K~@2 z{jB$=lVv#qCK>G!tSJ??&=x-pmCZ4OnDlw@$3600twmd2KL8xuD}m>g`EQUMHbx#2 z_F++%Q27GA{Mm!B2G0=Ke&*2HpChr1${NdA`n7?|uGgMsh6gEhT=*g1$DiYjTuI8u_azjkfC!6!=AD5@F%hP$iFC6Y#p zQEnbSc~EyL1O^veKS@jht?Mk7yUxpW9YoZ1CywA-|?nZ0b%o!)yYq zZ7*_j4GX{Mml9#skGQ5H7nZTu8?k!9fHCu47F`5c0?sB#Gwc_bq-47c>E+(KpBqMX zq4r44@>yLh*n@@aNH<89`0M9*F||UTWU@+5FsPzcqhg0lKDC#0RguTW&XnaQS09PCNiS^YgfOct=#nb7!dn-6 zX6dyVsuI5T3Cp}!Tejgvtm0u$@Trt%pcgw{CHt>f0CuEQ#)jsZJ;`14JTaH92^xOt z(pC8UU<23$3CWL|54l$ibVqG_6hBK#FeSA@{*<44BmfVnnALb`#-oe~CfP9L3c2wbaWf(bt zV5}Ybxyy@Y^Ia4wQfq8=-Shm(lXVvV&(ouM5-xWmgSzx+mO)P!r1qV1- z2V|QjmB%(6womUS2-V9j6{7c9(e3!>)(h>Z*<>-SjX88K?g}DAf8?L6D9eEJt4un!~n z%gu>`4XyPMrF>DQ3_6@=C>w=O4`DZ0VrtO_H0@Xl!VKL4?=)IlW2`m8|03y8h zO2ov5#P-=l0tO^d<55v?uzajO(p{Hsjlp3hkJ3ON8{e87PEz_>FvslcYYPVNZySy& zPrH?0YgGy9 zSg$pvwVu0c>Jb5WV0|hn)2@rhmaNS_%{@i0oEekf8n|8mZX8J&d=8)XNO^iE@H?U) zFee_b6FX{_wC$#BNYt4a?4fAGREgXQ%X|1*}GSeUT&RY3M=+b}DrmIhR$P{@U6DPJa*z(xht@qY+yMtTpad#+}%f5xZXz&k$Q(Uv(!zQJ2pWk!i732DNt; zZ2QfbE}YRl(GY3NSyLKG)0fD#lU~$6scG>o>#+Kl;%JQIQcZZRx2%|UK0EebQnRx- zVyvhuu)-eSHy*u1R0(m9r!cH>CuJa*ed>3xGUq}LQ&)UxU}n&%Q& z?-U#Tu1Rp{#Ao@uBAh7K&G=L0hNdvQM_p6YHBHnRIOO6!h2^twZ%Hbe%(oRTN;-+@u`NUOg~d1M#cSY&h!u8o+8GPj0hlH5rP12q`L)ms(@2G@{g$Z^qyb0?RDZTsrN^BZ=8dmQpwiN^ z`Yk%>v2zof+%6OKBNN|3ZtojbFOul^R&_gv<-j*PUJCtWXNUF8R!1ADrzmEZ5=QZiiou<$#o+wrFuUV!O03iEtn$%UHfa+D!kwmgi! zz>mCesX!lK&9S&GJX{8V66efIda+Sh(8 z-3)EH(iSMym4_@r62Z77Rg z+NWPB5|7f&()uf(9X~dy?MY%I!gGit`GO2|TY)bM7X+Uz`%8?twi9G0&9?7_VM6jl zz4UKKFW;SlVSpXXzESCqyDM%F-*&~-N!>kVJa&!3Z1e0Hm_LyPeP~`GgjX=ni=IkgkO+#PddTZECL+Rd(9362|-ekBBNvAju6JygLg3xHk~4xY@+L z!i_IoUD8r$yXObV(;Iq>zr6ME;oa%#kK$5yHv|!$;8o-M{yHzpl+ib2{U&^u2IVj0 z5mr(X#R3tBT6JGQu@LUxZb<-t`$N=-l`p6^U$kxu9QPk36P^yJ#HoIcoecF$2YgnAlC z1^5>0$8F!L&g>bVnXiF!pGwG(&;SnwmWljrMIS$Vtt)-xJHUhbEFPj{dnzxe3o2+S zdLko)p=8b3Jz25fnuXt%@m5z~+kR};zL+PYZ8;I}l{9hmeRSvjCQEV12^jy)yxdU# z2597H)w|WRe)Z&jb0z8RwrM|*;n8Hw_i3$b7o1-Px1YCX9b) zCy$oDjFD02{}A`jt@p^^o&l-&UqO~M-!(ot;Y^qS1Z3~;(&6;3+^l3SVTAXyUSEuM zg%LAh#$b4usf$oM5tW-m@I--xX0ulUHuG5j8H1*Z*Y()lpwH8W5Q840qdFF=S zr!514j*OS@j&0(SpWQ}QqT4iH#+c_`i&>HjGMGOh;Xjs#)Xp}b(E|t}YzIlKS+1+i zM&HhxZOtkeKO39MlT}PAj%&Hq=5}fAj18&@t%&z`hCDV9M7}o&P9x{}sl?KRM63?QZ*d;4R?>Q2z05osYQDmKW66 z|HIkZ)OwNH+UDKfu^TEaeSyPSfS;dshcOiPy)7FmGT{6(a!)>1FwH9oO!j=0gv+}L=5Z7F@&5=-lU%sj;`0i6}wrpEkkHei({>78AS)XrNdsC|0lHpg7T*ximU<>%@cxLsbFLFW#~ zwhD$m??NRyCxccVtsR|1WB7ok0kkfZ@D^Q}jFB!sgL2kiyMnO<6V z4!1xnDsbOm6;n;pK7hm4vvO!g{P*teS$L!-IW_x6u57WfX>ij^a_U1Nl29Z6NUUVz zOwU2l^9J{@`RMii9Y4O`me`XOz#|vu*PRcDkKvK2Dqij2@I%mFAFAWtd8iXH+KD*n zJEK|;R)>h++os8kvy(6ij!x1jEjDo`pt$%WCl#wrX(Rh>zjvRLXNdx@Y9fu{RA1@Q zAcMRe+egBT)t)DUDeZ}1>WW9n?>1fI;-z3u&987h#%=hG)|5^ydpkWwy#=cD3*X!C z7YpCgZj>2tyA&*quop=y1F#|wUWu3b1-S8LCo!aDnr3Lf^`&TKc}ZdLXC(iJFq%Bg z#;T&#B`Qt{k^56lNf?aMLcY5c8Xjii$_t7VVXRz!?|c8{Fo-H)F&oyp>5i_fnoT1cgL7``ai&+c*}-l{X=>uoZ6GdA^AV6d;23mxa~F~ zUHDHDh31JQb%|!5`us7-a4U}o-X>3OYU6;O7By)|)fSAd>DkQqW7S zcZ}sDnbYCeDCwv|)gWV88u&#fd7VDDTdVTqUR6@5d%(I!?UeVgB@(Cx;(hVKcL>s- zXW;2ZJj;Pk_9PSQ@6o9|Nk#h$M;zs`TXUr{@mX;XFY`0-SVs=*v9_`r*8yE=uw&}r zB^vkVop)Mq$KLFiZ@4#TiXJ4}9>t&X&iQ9p?a6;23BCNdow#3IeJHUt%ld6!w08wl zkX71V60xStbb~m+g;>4uLM4vpNh>b4s{14QJ_pF!AvlZkxFC zfdF<_z_*&y`|o{3UvHtPy5Cb5e3NgCs`tnqx+9JyD4 zW3lWe_H+&B0x>2ubn(5CZzOxeo^Ulx2gj;|yMW&Vw?<_(LGJ}DeN)-4srr4y-vtK% zaC!f=nEh)>Bd@|L#(%7C|N1OswZZ@LKX0(BDfSb*_*<@&$Rv``zvW^l`RZ;TXkCYV z{P%>IkDakB&0W2hX-oN0e=qft_#UplKtb`$PrZjdu45{HFE98$T>JL>l;cDx%o*?V z-*P(M3unO;0Kgh8wXgD#O`dPH zRPx^|FX%M?iOsyCUE=1JY(Hl&^pDP8M_>!6D%CEleI>8qh7S6B{>kFQ45z<(7k3Gz zb_54X|Fie!$cwJKcTe+NlzV&KU_tC#1B7bkuwK@q==m9k(i~X`;?|gACvo$&{i32Uh8mdEXv8 zfmeRk>F&Mm1jH-T9-eV(nm+aOVyazukxy}2>jt&E3GOU?tzYPj7WV^trK|fTKieXe zK2dWzcoWW%av7-9vS&7|)98g=xCTa>#J8V1fhWf9*Ulf>(-2sfi+wT2G0G@B<9w4e zz0A4mf`2!!xhF7B+)x{wL@j$9qTq6czlI^RL^5>(T+W6Ny2u2a0L zX)~c3-hP2kESi|e*Mq-}a$;>vG8Yy52pLX)Y;Nq5m(rb@Pwy__)RyzjLj10#vm+0R zq_IcbEF1QHEY+9>aP$Y7mJnWFY<$cOogpZsyW<#Ef5e;yT;bYvq4e0fQ=ffaEjFqlxd7rz$vZ|l^&$Dk^XPlkq&oH_p@4~voBbS-s@KoUJ-NXY8*r>GbOnSj z=Ach{9@T+KN7tnp!k1h+LvobKMqQRlzw3*KKBk>sgzh92cb}rL#nq_JjfzPuTvEU; z4UDvRYC6NQ%@_oxvd4Ff-6?ERPR7m-A)OsckHGkP|K#R-^@h*R2WCQbP2DY}Y2!wO zQ!XV3fh!Et>c&x4?sk3<-~767Nx&_F-;S|0Em)99po66M*uhhP&x>f{eZeuD_c}!0 ze;7ewl)rR->sfT^)stE%J@+4&-!D_(FXaar!TTv^!UXWgC2#v|;!#a9|E1HH7cGYn zx25A7`OAHlW$0Lwi#gvZd{h4KCgWu8Q>Z)+ZBAxm*$0F~L)j7gmezAIIkoCa)Lifj zTAfy;kUAj?`#r_u%U*l5k@<^PxbS`nYi&@DyK^LF{;Em#|FHK~QE`3WmM|nZ!3hM1 z;1-M6kbT6fWp0Srz`jNzrSD0{hsc9>CtbD8l1Cb zt-aRVbFZ@}y<$#YkX`)zVcLS&4H+*nkmj%%eTJg$yDVu$P~6=VBwxHKhqICZZkr7* zd(LHK8zfz27A zQABVAoTRlZRP1SxGzI4St{u$AKT{g`{-|k|UT{f_r~C*7b}75BlTcG1SKtwo@Mwmg zjODyVmwI!0yCO7MnouxW7cv|g&Z??nAMe;QbG1Y79ul0r! znR|k=OCWzz9zrc+E%)CbJTCbYn(`EbG+ivkGY3%|bW&9au8cMd(Y@lI#p0y9luUsP zhjVCklpRyFj#PK5hVJSbNO*>%Ix`9Cvz(N(;k&4Kbm|Hyw_Y_klk!LAah(c$fU>D8v+s-f>qKy|dR3_;H2R z-%_?CTMsN#kPZC-<~ujuy9rcpolV~!zTb_fu)d!>nbh~AjE-fh)PCnj&J+g|XA|$a zU!iEKJC6IYvUnfY-TBVx7e`Xa>Tb8B)8RczE<9ppJL7+_&P%(DMcGL!|G?GV2{lWBbiehT^yD$mKM2FJXf{o@t=p zw}FvS-c-}K6v8so?Fy#ApDia2hkwo~TI+GX&#uxPHpBZ^2)CH9f2Z2FYxEf*vsEj4 zl69)1z^wWscCR~8zk++acVg;JkoV}($u;laFs@*o<@U|5L-J}brY~~<$nZ`qje@kq zADotyK^kAU6rq6_Qv0lc;-+V#ivetH`z+caEhWjJw887svsuZf_7*ur4+gbq{U?_3 zmSVK%tdcg#5CF<`n&-^|{3_|^t3^HMgPoNo7ldTG6*)5n-g+s!J@diDHm$6gYA8;2oPXL)Ar6n|VCiQhw=OWH5OD+F4d7C?~IDSGeq zWnzpLz7`W1p>0`(9`N!ebU$bWsweL~nt@g7+sL7j`iARv1^-*$yE%G<=n@&14a&cQ zoA14n**Xp6ZRCjb73I*;nR>o>Tp%1}MC`>;y-kq!W)-#5+Zq;t`>-uY%&ovWdRid+ zmSf8B7NeuoAKpPiNtF-4Jf3qj_#=w)R|sL!b1w_sus<-Q$FtxfZEB`m9g+ym;LKDX zFRmaJ3QJJNO_PcH1ycAYSMXxVT*zNd8~7wnO!HW#2%?#v`6RC{NG`c2Ol6431_e@N zMvi8cP1H1tsPRS8pbfFqzXx*FkFX&%DeoT7RI^QI_=Q}J6~G6rA@lldnr^Sp4ekz8 z*}mUINM2HNsLlWbzg!(!z#B=Fw~~6nkk752P|Zn^fqJ31WKlKG1&expzN%OU^+ zv*r&iHiwZA9=~sOcRGS%EVnYqKw*1Bh<`1dA*T8Ph@6W*@JBm)`=eE8RLun-E90Vk z^y7U_=kL0ao;8)AZDsh%fHxHFlk9OOaFnvM$0nlW_89P%Rdpsq!0oXjIQ%VX79=5@ z9+v1S6ze?|U@njQk02|&9$T>1EFkND!_72g z2=xC`HX)VbM^O6z@G2a?@qTfHd;UNF$qsE1f$5(=C{F*U0r~%1I@I4Lil{g}ID}cZ zBqWLG>u%F*!xMlpRG+-MFL-BE-X=<>m{ZGn@XceM$>k9#h$ z3%cU{3$rd?+6l63_Idb$7d0c3*2cWKVH(f*6{wbrW)ShofNcf@ccR3%ASu zbmV#VOKH}CcF2!#ledmi+`dv9q_t)}{-Fub(E?7AS#bFVAGX>6di2A9DeJy*LHkul zclyDBX1Zt8vL3U`|9)5O}dEBQLX zHZ@>-lea_ulQ<{`^NlA2%^%R~x^HEYk2UPuQNQ$M^MwY~s5M55L0l@PN3=0e{9AwC zB5qYOZ(QBQxlNno4KafuG)fMA*}D0w3>vtN)mp25SfL>$f$(pEg)db8RSU58Fwj{s zD-_)E0Rsc0s)&7c-!TyY2+MDfD}AE$^|rvRAM$3Eexye-4*u~MpEjYq(<^J{1JbQ1 z5~W?&`fc?%t4sfl0bf4|)X7|~jUsqVJaZX9AKrofwVMqJ;s`i0Y7Tf#=)8|;zC8cS zJpVIB{G=g742ECr$>nME68=HG5ma$H+tM*0(Ym+Xmwqq)8(p$G@ZfpNzq&^&c@`zINxS*P+xyvcsBNYD^8Nbmt2rJHJPrz)~Ya1xR!dvzvEtKd>p0x$5|y^s*PSz4e{>n03kVHFI{UTha48WMlK@1&)Wsc66lj z-r9ci{!5N;hr;N-YO4<1kIP--d5fh=iBEXOSpvigdBbT#ab9!5HwT7G=kPQ-`>%?@e=^~yoX8|LXI2@LAsxAI14~xy$_fius`+@9m#f#1Qp|AO4ohC z{b}_Kj7b;ux#`1T5Y}IreSZ9^UXx_6>v(*jn;&W!ubrfrvqJqgf) zM5O(mL}YDT`8dsF1cb(KFuSCuPNWjGeQ(Sdx?rpqtAS~7&7aHGm_9Bjz&~S(7_iyb zBT2ceQ2Ld=Dwd+i>t>le-T&LpIRk;@kla$y!`4oE;32wNEXS!shvG!j`#mC(Pc7NRm$?6Z_vMyW?zj6@y5qB z12@e^1{2)7{WVt$UWH(NF*TTOvIr7Q!mOYx*c@g$+OlC*rQF$Fpu9>MKu`Q)2@xA%RL56U z59U=%)>!iAfKB6t3NU2k6yN=RDdX8s!UJu>)s{3=0(^5eTI{M3=YQRUd|pvW+u zQSu5kE|L^fvqbzt>EbAxRP+t*^-ZL?d#aBclV)lY6*a{ocRUPl17R}S+Q|tN77rIB z@hWUJUEJNaa9aNX;j!Olk*BFAZ-kdXGsD&~b|i{GM=fh@><{8q>Hs2I&$nC=IbR5K zPbq!qrwmIF##{u|_QGYys*ldnmP+9Y@rQD{`)i(0cduqf{RIVE#>} zaTo`#>yl$}lkqohk>#iYV>^`Z1qHZ!9gSP?SWU&G>Rdx|h158;%r@5}NW{Hx2Cc6@ z{S5H$s*rC@`a(bu=6UcLb@9C*==*mPh7FcSSlC4x(A?RPV^D58BUMiK0V!iyLIYUuAhvn_D)0WiazwG46m5l z?XKnsqngCWdq_>~e-dl_e+$_BuY==xKFy4ON%cqLvn)sd%!vaVMR5U;Q6nE5Rt#yw zj}S+|oghCbcXoqfC8Pk2XIhyUeIvbn^~ScMLj8FOIrNNa6WO;-^<>3+uewt6VrFGn zy|s5U4rQ5ci4j%c< zP_`U`)s}p6Vk$kG?%1O%)Pp@!);hTAva=IVM(!`btBfv367!79OZnGV!j{WMZAs}K zs7!`oJMFd(G0F#NG?g=Z(ptd@Z%HgXcSI_kvz=45Jp4T%0WopqX_1C}{P#y>ZgrKc zlVyp`2ZX+ACr92TCY^m-cPN_GZZRYZfB;G044^~ldi?$HJLC1aJ5nQqw3;@>(kp&h z&-;mGUJk(ap9z$M zjw?`l)J1>N`~8b|#<_|^{R0f+rb%91)XpXhkO%kqCGhUWKJQs_`&_>FabW)2o$({x z`RRFT1x<|$!=HMf@8I`Le$0=To~_k6TMzRrjiyN-whbb6JcKW*O$*QGGY=grK2&_wTNk7&g*JYx;`YEJPJCCPzFfAC$P;UDmS^t7sf&?blGhn z>m8$Y#CB9M-tQb2*C732^JjT^LqU4nC2Q7^<237;WNy3q>Ov7bWkx^cK24t!4-GWK zw*l`JAH$I)5NjHC)|{}6t8`o(L=;<-*xsF3T#kgV0Kfi)u2_|>mQnlNX^6X7&lYt9 zwlSOlsCS!GV`B{vXHI>=Iy@kF~bq_vUNVPdq_Cy~G_=oAg|lu8Rhz2lerDrFCsa2^-M*90sn8}F-Q|tpwDORnRJm6RNA#gR0CngF*Qv_1c)$1UR1f&?9d#tOBh> zt%!|s&vnfsDZ6o#09*<0?_uY=YcV5>10|_02duqgJS61v2qAL+PZ3k4Vn%7B2+8@O zE|45|Thi5bFZahq_p;%Q@7R+1f6mxyMCNTy0#&bojfY${>j(QAm<{fN_} zi1z?|4@T;@^*t-Rm!c?Dd*c(f1Ub=I?C*Nik9D8i9pAiG8M{t5RAY|Y%z0DKgb|wi zanCj)>h{SdJCv6<@zcbnc|u3=QnEZMA0Y~-$+$1PzM?N;0(s-pL{Hv3ZDmm<+C|2J zS8q$=x5lUrIfb%==9v|w=}RpSgKPR19F>w^*;nCr2gV@ZGn(;|kZ*^Zbr0zb{CRqp zQmbC|J~VWCD0fwk!35%Q9jzpogvlHuFRa6e(x?&r(T<$ba3X`OLy$ifBlXyhhd&Kp zt{>i%RJ!=2nR%zY@#E%>%APZ~(27k^a71Y#W%uPB2?r{9dZ+t1L%N z(qS}bSW?WHty{;u^YQb~Hr3?KSkU#Nd;a71M!PDEgp;t^=`}J51f{vy3{lad>kDTC z9y1+{>@j_z`JzoBhQPiBkC8JNzr{>ozK@1GL1AbC42IlS9pwCj4kdWahhq~6K7DR` zN#FveVUmpFdXEZet*1@Hq|V4t*P38M5fOd)AQp_Z&4AxiiwW^mIY&Gl(`{}~FvX2u zA3t$f9ZOu-r#FdN{6AvupPVDse<64yJh?)GNBEyPdesy)zO^!shAOGpL4$=ksI#re z;_WXkO%ah0Hhm>5So6T|6ok9so*EjG9}o>l&Dg#q^dei@^oW{@51i( ztxk`vf6vKDGLlue`QlBMMmPdgcUSXQ5vC6==bc(R_l{GpUy>SekrBJhOa16=WPqP? zmw1Zt@?-w;>Qr$MW&pY1@kAW4(*H0*H@Hs9;VEoW^k+1`w~yNPl5*<^`oU~mD$N=m zbd^^d7uo1;b$4_+u#k}C6?)L<<^IgL;&c`y;m&G))Nv-iZ`DR{nlu+`O?o<$He)(8 zVfeyVLi};_k{Oh{91CwQVDub+&v|w!s^v;D<4R2ih_zpk(G$Lud+Y}W*XM^cP`KWX zV`Sfr*HRTjFc;eOi-L~kZ5XDBC4s2C{1)>Nj-*0F!Z7fAqtgk7**0Wfv68eP(&;i; zN0WCeX9gT6s2=-xWILnH#*m*J`S~GgwASZZQFo1-0|&m?J5713kGnS?*h52ee2LO8 zUy_ocf2QY4eo^&&-uj=D@ox{1{wI11UT}hT^D?A1s&tuDAU-KaR3Y)}?7wIHztq_U zjigm_En}yw0KfFAgQkZOuAI!wXsqZHEAD`$k)CsGHSkt;qflCdaC7`fJ(esUq$&FN zqOgNek)s3V`}6I}?FRSo2$vjX0AA>~iY6u8G=VT_}VaxG249 zr{$5EL+*9UwB?^e4K&UT9(OTH6?5Dmes+jdwrunqx<`l=JqM!=+MqVhIUs5X!U%_- zw2{nIOuMe4V2K^qS}oV;J>M1U#z%9Mv7cM0NoX^KP$aspC}pdgSkEU{%F(1oP>+iP zKxt^@&@h{R&U}yESA_*Q1j+J4&DC0jxE}`J_&<>Kx!)^Px4V!2vQU2%f}P>rW$NWZ zI=1xpoN>pG%`U!IW6fuIQA}e`UEt5%=P&}7Z>3#l*L&4DU2dy*I&vg| zXL?a`#lvl2kct?;wbb74uw-=+MC9fGVDd5l`ld)y*TeOYkZSZ5wk7lW0-FWsSv6E3 zZ%ul`DK1k$7)TRTGRK74J4O3&nw$$@b3mzTgf234(ho)wJ;Q zQe4-`)GYT**A7xMS)j-AP7pZOqwIqp!y?yomV&$dE(&Zkzys$A0NkT|~(+T=UOCZwL zejIpXv4buq0`67LBpjshuOGyecsnjK#tm9$km2W}T3ykcqsE(baF%^r_A zE`Ge4hS=E<$aoOW0PLOG#N$SypjS6;Kl=(TD5z{LSYzbN|JGI?B8}XOwQ_N+>K-!W&BDAkq3Ut&(l{vZ5yK@8S zV@Jb87dIqB_Zf2-Q8cBv;fKOo!0w#1V{S>iWcXYbruaK|eyGA_C(muTzL6go(2k+I zKcTINg_JZV#2HlKgN|tgT}L|Rzx_Wl7N{P_9JMQibRgxk0%6OQ8E$ok9!7tsw& ztF0F273#3}-!kLTNq$3cg9HNRR&-5lKCIr$^O4yiQX}K*cuRMIf=dE)AN`}o0!|*! zwv_)@$wGeWjQ>#n(NTgMQRDw8{z&s>A;j{;rrCMW12j}PeiI%mN zxFF~T-Lde|cho8iiyb-0`JWf0TPiO>O=5j(;R^!ot??OT-OS@q18XRc_MmC{?{e^I zh4&H{akXiGmPmvVd&j-Y62ttaY}VNepi}S=AFg@!Cko?^SivBoYE!3x?UrA1SQdx8 z$5^-6pr+a#x!M@PcJEIN&AGp@ovrZvgUE}oR-J&jT?!)Q@myX-Q}P~%TcB4JrK54R zOTtcBJL`+{BX5p77GmPACixtw>>s2m_(1M%`>$giT-G!0EU;&_hm#_-l@r>D9smrb;IShjLZ!E1I~XLxL1qT z>B~L5fXU5>}Uo)>ugPk&nQ5oQgn3Bb&N$;6}wi{ zRo?rfL2Q0J0};6zS5jW8#);w*Pm%bImK$ZgsQ6%tV5`_wi7b#+GoIdK`y1g=^XE5D z*4Yo1A(AXZJ^o(MgCjXNk9$|m!JWkkPaJIztY1zpJJ7V^?22IXN-GsIO9x4|!&0rz1+6#Ud&8BpS

%bOQw#Hm&Tknhd0cg(RvhNN-u{&ZsT zrL>z4eJY^{M1H)qa3I_o-!Mpe3Vi*e6)Ty{E60Lm%s7_QQ%t3C?KvZuW6{25>tlxH zJc?&vRS2Y~jUBA2akD%V*2Q5wjTC_I3(4A*!I8TWe^79&XEhD9TTtDYTc5%=K($&R9L?@NF ziV+U6-ElPm8o)*8-IGA~g94g1`1t19OT1a3XvX+f+{A@wAe8M+EFAcRro>ct){;af z-{ndO-Ee4rbOv9s97zkoDlX5_F5^`JXav!MbVlsF&zA?QeTS0UM+v!}jsV@xgK4hg z;9HJ2hMn8n@Jkh%Bf!{?7)k<7TJQKgjJmi@{3E<0)_YLWfFY5}G$eDTws$#cZQ zHw=+bP$NZQbV?%8qGFOqaqnxYOxw!u4}0hTl+TF$Ogei1LCoj}Hm5uM!;HeAwt#c% zr<&f$UV20Rxf8E+ZhXimZ8^goOnD{r!&zla9DCO=PwV9AjCdF$Gd&B7x7 z7^NC0U)Bw5=~=uYuc&Bft9p5vn1Htf_$c4hjRhw1H%lCZJTSB|f(T5+9=fvk)<-&b zN<3Uln0MwzM1r197Q};pB&#EeJpDRJ%W#0i`}KzTGp5%bHIp*E=cB6^rWf z2A6@xziZE`1yR|oUsVS?lT}PLcF5&h1vc5l5J;Yi;Im(q9Y{DgB@92{HB^EsSg4G% z@*Xd{patjPNI@c9oOhP~w-XjQ(lJG_hJnxDM-pnE<2c+j%5ev_Y=0(nxd=x#+ntQK z7~Jx_*?O-bEX7279zfIld!kisC7i1l3i@OFso42I}8d^oM; zq|CKSlH}aTW%Ise=jrK>iKr3IkyT{hu@(7UyZGGu9AQFWW2-5lr?Yx0FL=Kz{&6px z{5xv&&ZnEW=Mmt8ui>MHM(;yfALW@5_IjZq+1428T_Y8==`}2wbh(J=mWsg!t&9bO z5;7(nP~dZ}XdPi#(kv#^u_Z}ZtY&%V|HUTwMcLQ+KQf1~LD;2-nL0=-@ODt20Lrv_ zw|w24Z%+xcPFm^Gk-%gdJd|wQh2B8;*Y&+q0V5tHj{X5r5(^FwlR}pBVa5V26fV2< z-;5vS()SOZvD`Naoqb`ckjn`2rayq9Qjk91R}&64ldb(sO~0lC);M3b2E12X5cx>-dSsXRLm<*G1oo$IrOM zL+HNu!e4E~1Sqq5(SWAYo?na^T|>cb9yZw%PIqN69r_cCedF}f`daVX?Mf%#8inhL zr84quzxs3s3y(XqfL^MX!LD2G)$KAzM^qnUfqgTh%@5Ah8?iJN$b_(6qxGSBN5U(8YI%d8a#C0NCYZ z5c2d+ogMfwr%hpM&SyBsnX1 zt2@Jk?g}%H6FPc{U5Bsc78|;xTq(AXmAl=a?QXT8S zaJ_QZwYe^9@UEg^YQ&O_ra;h5y$VhRFxh=~^Lg;|9<<>d1pENqmd`j44M}u1Vb*TD zU~k`kd!uq!jQKDFV-6?_t@{&^!7cj4;No6*Z01uwc;)pg+x}#bb9xz;Ca+-p04n)_ zjGQ*PD%wXGOs=lW3Cflieo}{ z9xo39k#MXr@6#&+$b-R`ow>zeTN3epckx?gLj={&%H2rv{@&c@r`4+(Rq&%WlBf0l zYJ70+aSyAiuy5W8dcb%#!ml~}R5wax!33hqBwzPfKzRzug`)>?)vQS;@5vwT9L?1W zh#BvZTrCS6V#qT+6Gc00oSZNQtRZ{`Iv@7SvOw;|t~NI#eqGNq{@bqUL_3&iSWpy4 zje_@eB}bTxHV+V$3FG9?k5fW%>js>TTk2o@pGVNv`@v*nfux2atLbB@c652!=IH%Y z5~XAKx(cfNAnKnQ+=@!k)!I?ZJD!v`E9!$&+Xd7~PU2$y#z(z=9R^148^ zvi7~OG-V8nuQ7o!BqWB~Uw@x|=n?i>Gvr5lF{=*T%}i$n5Za}QQk z-j_7%n02hvD#6slI7(0n+C@qo5ijN%X-%|^b-&!Z-Y+OreURyx{99M2q|FE2E7M1e z`y+0}oDqc;tECHpvXB&v0%v=vaTzyOlq|hk3(?tnoR5WHHR41*_ZYnB$1JIR)xAm~ zbL?-fJ9Zg>!E-^nO}lda$n^2O!R^#xjV@cC zcPJ?o_y7~>Tz!QIw#wAr-m8l`_UY3}I3db5nuu*md^sx=pKC}j)dKbDS4ObSc(uOb zKRM0Z?neO~QCF{D@ZnxTFDL1KGM8v z<;Co(;q(AdZJ{jG!DMU}jpYb>wAL2dv^l>R6{*}r94dEv=YjQcbrO5O_KiN#D_-YW zCt>%DUaO?z+>w&?vm^R6x4G~fC2k)>T%Ja(j&sF;ElqPMU~tApvqocvM5@hu-N9Ze zL5}Mkz&cP(x240ihi2KPqV-8t-dO3}8y=}|P3k%>b-Nl(+1crTae}3NmZfcn=XICM zsvoeL_`R0VLw-Q$SL~+Ya7Uo?$$3AOC_n{(Hp5ksWb4dJ9gn<)EH3CTEO*9fxcL>` zpJ0dN*8>X<^EXp5@PsV>*!4-z^3|!5U|YlQwT*HS^Q^)Z$nNu#F9^4g9S5r1Sdn}# zg43oN?)Z6k3_qKeM`WYbgXLXDbnmj&RuOtg38^!+WXC17S3AyJ8E(cCr3R8LO^qGF z)IHU2(=n(4|463GdaWA+LRp#ovP8CH_2;ICipV@xV?TOH<{dYAp6S;K5({}24skm7 zcR&OI7sirOVf5$ObR?Ypw0M|vM~_dE>_MtfWoE25bE6mO`wY<^{^Pn}#Otzw9v$(| zIc4s)`}mbRfQ`JKHzXj26W@|4Bo1r}8Rh1@J5p3RzdOQS;?5purx5yFVY0%xIS9L* zY`NYJ%f{@LrvqVs)1pP1PrkLLmx!7k#gp;n?~dE_x{XXYs);O?ogBko^|09ZlB$6F z<_$GGYyzz`1NXMDz^RhZ0jvB)(>W9al#&y7DGAQ~#64cHXC}Wm1%qCE&s7ACU$6es zB-enNg&J_MZM<-r>O!H>$m660FSe>nkwBS8jGRm;!r@C2_GmF$Lfj2<*tuaSM3+OrbFz%89dV z=aw9}t9&T;iB*+9ZK9MIuL@Ea{^j+;;qI(V1@_G+=k6={LDFP&ndlz`R{op0q@If;dvN{SVGGQ z^~C>k`$lQ)PFm@M8L9i7Q~`V2_;As!$%ZO}sIdKw^ubAR2j@{Q6~ z)y9^o^rd~tXX?CcVRR4g0(pyBP|mZf*B_G&oK_FV%A_Gf*89%nbFRA&u_Ki`E|63r z55*?KgI0h(t9>%@BhxbW8r_FxBcPSJv8dbfN!bs1UW=L94mpwfr#ps}(2XnP>q<;X8YjQe$&lG%Z^y08-3Kw7VQzhRnhgC{CqS_utvz|BYzwI~w}`p^+2OL;j=ZQ0NE{ z!vD{2jl@!5_!nRJPg?TdUPbmz1%dECcTfX(^5wbNf$i58Vg(N`IWQec<2*h;|L&1-aq{^BhHW z@XDhio}QkHxU>M>V>u{fn~A^zO0bx>1dCD*j*P<^@84m=b%)HCUZ?ICsIYZJA1peM zZ`~8j;y3g)&t&|fN|U1y0rh$v=)@WnMGuxN%Nm0Wp$h#SY@5opEw{8X591}#O#$QH zZXz-Cfypgr%w^WjmAwxVLiqQ?17Fc^wOE~+4}^9<0FAd^ozJ;DkmvmZhX*JwZM50N zoV4b#pVydVYXwH{O=r#?(Tji#pb8MGE}}x_y70~mi_y#T-{-C(Nk1}d%6wwW_!6L` z4isg-B1)rk*vF}8@D8Q4U;4Z!k0`t&Y#A|#zJxVmp9gA1qSRd!Qqd4Ox-?i(`uuw1 z6yDh$)6qwv@9W$o%(%_fHOMBJjbYCV)|1*NB;m9(%f|8{G&ix<|R98 zpe)D}K3F1$w2XHY04Q2NM|4eBRNmn^QUenchR>QGc2jx1M#pdSBktA|zyYWDw`CLM z)C*0bGGb=383&5|U-FL5cW9-q9AbR!ED-?yB7+K3>sl|nINE^cw@=o7(>UK>U2d$T zvwE^@mg}Wg1B;zGD&3MdOcT+(rcxJF8QSf1gI2vfMh0FHC|8?1uGK2AElg-!`SQ{u zV2mDB$L;B)S7#OECJYJEAE*DjhxR!+se7Fb9|z8w#N7IkgnZjBHYXtmCDhpJlq`Ib zKHA9>KY!Cv+FKHxAQeb>igv4&?iCIz)MJ`w4~&L#%;lG>6JK3vaP3SU ziIi>79)dJx2i;6Cd*4~K-5EyQb$ZfW^ZYqOf3UZ0Ir@NRk8TWN_yNi{@@3gRG>l+A zHlqPVca9Y}x;*N_?Npn^Qno)>jKWM-{|nfw16%ldWQ5$n89XA}Z5ixb60AkNJg}JdjX2chrOv-m} z^oUIt76=ds+DJx%l4WYb;4?uQ*e?Lw;WiReTud)=9GmnZ*Nena?}PLm$Ai|j!*nWr zRoX&}P|$5Pi>``araqN8u>O@v2u)? zB%_Cj%eQ9R)ZL&|$3BQ52rr9|&Zj2Kr1{2XOryisOwEc=yaT&S-(p!q}j#g$t@xF>j3cA6b9@Wo=B_0OHoxS0{@wVNWUh1P_**5H3AIRgr~m_{$e&FX*ltgHyEv>!wm$ zh~jOk0tSn}YG}BLxUW?EIJ9}jt3NWk0M4{h&xTN@*tw=^|sjI+J( zqRKYpwIAc=UK+pL6OYCzGoNs^zN!`!zJJZjH?%szQB!%Dbn%m_NTW?olFBYiy(&B+ zCN#PTh3qK92yo5nrp89Gb+egMr{5;yAAkcNvxLoAkYn>iz!uvN(%+_bqog}YfH(xl z$b(3I2t4S$n+Oi$sRz(d9-6!C?CdEV7cd5;wPnfp5N!&{qW3zk7XOu+F@}ie@rY^62{LO%dw%>VrGG zDys=V7EUc6*jEQdsIj&N%+5^Tv@}Q*pxk$-#kccjxNE2u#_^&CK7sHM^4gClf!j!M zT`X6t@9Gz|j;4a<^j~meOUz=q%AGT`rK1GP9pz0?WgRJXrNQ;Y} zL)(Mm;`_v|yn;T%!|iYiNNZA_E{EngvUMKuFX%CcGQH8QN_JP7J?ndrB$PEc_!g9B zG8UL!y1^yYNtgKpxH;4Il!)s0ByYH}jEHuN|2NM0dL2-4B?kt@6$A#eY_oxhy5twr z>35aun$OQg+Ab4(Yj$#=+R_M?7DkJZ$K?RW`y6~6uUz;gH8D4?V|o+ZEf+#XCNz|` z7A|0yv$094x@PEk`N+}HgNQoTqu65xZAo-+Tz&^n;T;k@*^g|$nv&r07w$R*dJ=3n zmuJL~=iCsKS;u?)%IV1N>SpwMAogl_CZz#jgv{SyQ!YPL-z+qV)8OqJLL*Q9=&uis zRzF?Z#vN@=sdf#X_X2(jd)n%@6h}OTNC!R>RKl_y2ugn0u7;-di98hbNTp=2Jk7afascZkdN%Wt35apb*ZHqN_AG=_7vTb4WOpnN6KXu`n|6D zj0wVemWP8Cu)s!6_Kx6Kih8|n`F5cP7=HBLZHV|N4bDM@{@0RP9bPf9)JPv*N z{B`w$)`7h7m)SDTO(E|~m`{$xvgf?NZ!Rn(s);k{g5Aj`*T~U9>No(|U`BnId zA=T8XC;)$%&o0VoXR-2`SpC(5QEMU&LY%Jidp^-&_}p0fFBiicrugFZ|HQy!d8b!O z>BfPNfCt0L&id1H#?IGg+KeHDANWYQthdx4=SPdb)2;B;A1tt+lMT&C-iMAQLwah` zhU6j^A-4%Phd}tHfPzvVZw(XCJYf%~#DaBOfSHW;VEELKEiNur!;vK9P0ml8^IE6p zs6o`HaV09w-58$9hgVd~Pd2Ztb9$Q}aQo#r0fK`8m*lD1^kge6g4nP;)?M1z*Fh@d zhIEI@?zuZTr+@tjzIs`$457llCC`Af&9!I6$u#}pmlfV$O&9=eV>)|SynGVg``A-S@#jmG}idJufsWn>q3Lic*V z|6D|AfDp4WhEB?)p|6Pms{wye?d9OR^+rTG0XBbxvWc#kf*<^%Dk*ZxS6HS z9)3i6R3rx(nqJCAwv85ljYM6BTM#ikd|x*D5*>z*aV94#aO*mP9I))(T(sw3>5WxWr zQTi6TK9de zp4|(!+fAg^dQio3Y){L|6|_ECp&WvQv*Ld$MqO8) zOwuEN;m#O(fj0XdyQOpN0vfs5AXViM9L&-v*8JJokbMlOAz<>m_7BWJHa(TXz=dQh zW$6`oeI6-}fq9rg+K7~eN18MMru(rTBkpFS-@(EX_uX?Eg%20xO=ay$%E&*NWK`i5 zS{k+?*mw<;NsN?1G{%@Q6bqtT6TB{aX^A_EV;B7h%OB?u`o74Tn!F;wGz)Zq56RKB z)DP|jUPPCs6X2c`p_%2;h>eiuP;pC-r_Y= z`c706z0;Od1G=O&#(R&JVHYCGZ$PCff7y{$_F@(dR(TEuGLu1<3mPSQQ_JILU}!vB zJ0_Feu#dZ3oKpz}0ehX=>_^7t-KQN7_rMtch5>{P2(ktUn<3?KMok~=EUfOj5hvn% zd+RGG;Ga%!jj2_P3vH1H)xVRi^nWh58s?TETofNn%HwEBs9_jUx>DQGFpy%!BZh8} zZ-lAq1vCPs){^#w{=y&DMN!{s>(K@{4tORvwvi2(W~ljF1DDzvc!erBHR!a3I*LQWOPvLAx~g z2BkLh_ijj0u8PXzn=*eeHfR>oIp$q$`Zy*3<`v*x{8a!>alCW#p0He!69a;yUwsyL zD&+@K8}($r(G`Yc!VK|m?bwNLu0sEd=WJ^~DMwh*_uJKcS#h4ijbrMgh@^{@bDeiR zc+$zG__W>S!Wjnz2B>DhhmgXEQp}Sjleb9#ZmwF7s@RTCdg|w^MRMPvcHeutNfV0k9NJfE;U z^z(Kd+5NUVLdAg`t4T6#dtdLgJkErXIDRe%r;w!;R1&ZO24E8CNhKVe^IM%{Sy*`# zoMPOwpc4%YnUs`iO)Z^Av>@#qo@&>8n}hw1mVZkTp%W{cgHw!oGW?|ZaGp@HZ)bP7 z(4Hr_dD#4!G3pzlGtwmq)zBfcIj>&oK?HlSyCF(41I=2f#zwY314zc^oOz()lDwcn7IfMbt1zbJl?V(8)MD z{mY)A45A$_?&Uq`4_~ve2yxGQDuXV~%A|PJeKL-=?SYkGbmhz$`vQVmDHfF=?rARy z?YBCzcCvii_!EnTRL7AJN3EBj+xEF6a}HSn@yxQ3AIkd59K(%XIDHCszpZ&^KVy$nRpI~oyMMu*~vvRHuPPozbJ%jH^ieip+hUlY{3=o`C3d>K< zexs?zRg*zoa4cSt1d2%aIO|yd%^GrR6yx0Ez%cg z>zuD-?r6M=qTE4yS%_U@Zb$xa@Y+^NQrlnMXX)kq3NMk!EnSo7#05+nybk8mo2E@? z%_D9@gHQVYA^nO53m`OFw2iV|K+BLRRXgsq1=?GCmLE_v`E0i81x8ZxtR_Ge$+a4f z{_3i~Y%Bsxw^MdQ*3#SmpOJ_4x>>!@<4n*@1TfuH`R5@!x>PJeG%HKu*5eb;d}uS` z)~LvL0$|l+xLg&SvCT^>zhalz0L}>$xsCT-SmcIo-QmUTUG((i8NPv;C-4jlFc9$( znEO8b3!G**uN(gU{Mu}M(ADhKaV zA@%BZ{btFwVcR7#Cz7XlP)o6T-B-Fr#K{H77hC6jB_3V!zh&`GYE9=wAoaB(i%-W zMuE_8T;z(i9Y5iMft1?F*OYeG0A zFcJ5xIg#T3_|r#@sR+>Q-Mh~10~$czRCommRmB168^}3>%HwM;pHD;_4A+8Evsw%P z<=GXM{vq=YQTvmi^)wo!wkBs~1qe`WPMK=n4u@Ca zSLxA28NJ05Ux%{^gARvJ3$;K-DI}^+Dyiph=5=2Qo(*`UR~mB%Myx2gwvNPQ25%!U zSg^ujKhq667_b!79U|4dGL);5uE(E%Nt)?nJ5R9$sp&fGF_9MmWb|KR0u+`~*77%7 zPB*&Poo{@$sD_CE?w2z1JD>Fjypw}>YHD-Ue73jVzxj|*`LAKpi>oLn20`eRXU%mZ zk^-BLiL$A8gLWJ_g#9EB)BEc&A31Cv=rKzMp6$WS{916^)5M1idVhfnEWS zG*!`b{uqo`l&1Yx1$E)D(-pkc0$|6%V)zqbn=7*^I}%#7 z(?3W7?WpREj2&#QiZRwX2oxGX3sNi5F#t;KUv6JU_M|_fGz?5)NN40);q})~0s>Q^ z77c5lSX)2mky7&ca9L%aRpJgh^iI4b_F*Q@C%J|}axQ(~PKz_{VJ2pA0I1na`Hg@R zCPO{XHyCwrz6wn6{-X&&niy#lvH$_PE40`rpC6E8ALYNPWzQ6z8w3D2ATc4+vU;2- zrOcLPs5*HzG*eg`!mO;LeXDE$dta<53#C0zKUo55 zZGXeOY&6_mO1O6H%`Ta0pxdbw%-ErX*pqUN6E0kD|VGr z3u3`OB4GinreOE7Zh`a$&`4x zK#RrMxi@HAzttJ0{`ks z8Bj6+;{!5Q@9(qHYoZf6eAhRP>C8n*=IFk<5kLzfv?LS)hzAIApgW|BuKNej?DxHR zRs(97VF-<{j1{I#s=UrC$|V7A17h#Nbpx(5F{@Efz{l%zC{J(k{u`WY^C0a4KF0vF zKPdpm19HQmBt8KUDr)$q#2-EN6~odtNjp{A{y;PW5oO4MT0z~$56}rKD~y3!b^s+0 z*@rxk05!pF3l*^K;;bzZsT{I>>3x;LfRWAv+}YCg3IG(p16d5LseB>?*l7(`%6YbkGN?J+y<%`|#h{%&dg>Qh=kKsrq^xzuJ!$_mACGYhJFX-J`dnkYoD#_W} zVpR9S?V&}2H#(9!HFca36$^X`fzU9Ix-e1Pw+E!W67w?h%b_Hnh@+ERdFfRJbz^@l zuLOWhw1lv?B}WF5BTGxmDxh-ao8kvBgtY7Nx3siO!qbT| zXXRd=zjxG>{nP*(U0P&4R{6B1C8YQzue-f8vQw?f|9$j3AhwWiNk9;_;Q&T)tvGmbdX>G=rcBe}DIr#%4ZlE9 zzwDARF*lde2hht_#aF2*PHxEFN1{hExWwO239$yJfyT=DEWk(3l?Tk3Cu{mynt4{r zf!es)f>vyYA2uhcCKKUDCjs<@($aVZJiX9;SfnJau-{S!08{~4ap4b0>E6%QmRiWw z@*J7s&Aifm=|tEoB6EY;E0TVqi%BMmGr*Hs@kz1F2=J_Ofr19S^~w^sWSJI4s3S=} z3BuXHlClZY2xm)Bb$1j9JQXYokp6S2q5zEoa6jGKSHRN#?f_pa_6Ke^)~zZIKnsz* zI*_SZ1jy?Fx?vT-ctY-ZuqyT$9;)CeU&%LyNzt}P?qD&CLLQB9w|H5reuZLmDnyt{ z;{^<3qeJ?A0V2?*1LncT63F#y!^qbp#Tu~jiDaV%J!yce&j#`^FH2oXM=I72WXwN6 z@Q?TU5Wer%R|GBo0a|WrcCTZK%%t|*hI@}EVyD2B0u~48IOG7g2a-bvIAHrx57ohn zhZ2J+1!f$3rZ{snu!?bj_?k3e7KU0}ePsqr#mEO95XV-)Xdaz-ixTjD?V(^w^@Wy( z2O5v4j*%rCP;P)7oJw<#FBk-{lz^y%FA@H?^S;7Bo8hQ3)tUT8`ZJhMG$MWJy1Wl? zfa1GdqeqYod_I)M)nP$9lg2L~d;2zI_kgt^UFs5da1p$MNQ(Kf_=_rC&K%~C5&vsF9t3wLbxdBSRuZdfeHs(|f{d~L}85;zqp z>zA|E!LNspne$bjDNiEkdl`uwkNiW@=D|pup|wO@E>P`KTiU%xT>Vbjj*;tzw5ueJ zt7eR`X5TTIF$-F?KNvJpyA~Mw-!tdqmrp2SAz1? z5H+)E9Am_CFqH}QSF^`w5f+(+4@1TgVd3>iI;{E|N`gtH(n8&^+FP2Ck2B2lxs4U2 z(l8{n)s)8jy)CmBK{Rc+g~ZALeA#pJn*-Z)KDjK2R$IXhd#@E2?f>^^l8|GZGm zkTbxamS(3X4Y!FRi%&>ojeUzY^GF==iAegYcM>8rMks65ztqeug;ydUJl7?WX9>h{ zgHK0VZNIO?W}1dWM>@DnN->iBwgwj2thc%S$>Y-kr9Pa?W*(V634~Rcl|}ry)%OL5 zujI5lL6Wm})nIIz-I(HACKBv39CkSI-RzmY`bI>7R{8RG6Rx1{lI=OwDP7&KFka@R}?Ip+Ki)3`K@w8cBt}3LhIByxtkusR| zFZo5B^*XW=B;}#dLn~fs=`Qlqz(Z|-l_c48zT%YKkQYC$Sv<$kR3ZqMS7Z?Yco-nR(%qcKzH6Y+i~$f>KD1^qOntD@lF zydg21en(;H>yd{HS_2}Sf;Tr5fs1}gQeHMu(HxcjyqLzEvweU63@O=cxj=eT8X6$x2U*D9r<&7TDY_hDX=L1n)G>;BZ9_Wd)VRREvL`26=SL9GKb8PB#A2 zhE>MJAesjgIm?jAxCM?lE*=s=A8N!t#=e|T<(^%J7G%wVv19%m3(0lO8uVoiBB}c` z?s3tQep_7S4|R_CY2sMJ%p%=T1y?o+_7D!v&@{TyPCrR83jw0}gT5WEW(Efeh!r32 zfz?tZB0Qs%nMf{o?kHqjp#*0UDDLF@?w4F#3i+MLsU~hQ+X?#iH;ULNFquh7*Ztl1 zLJEW}q1f02ed3Z$MdLcP$Y{2%#ei|QdC8&F6-gcLV}gs6Kk+M3yYS=hHhs_4igBh^ znzY6AEP0BQ<8S9wdofc6io3&+(|ihq@^TDlW*eo}maw8GE37g1v2)5TlJ;;^J(GE1 z`jX%UOaZqkh9kM@(+v#xicd2uiK&-*yT&-00#h=XqBB#}LEdl!cVla4$Zt4Mb;twD z@vylm;Q1Y2k~`u)rQQzOi!gGyh%p@`^ODGtvI=c2Qq>xdR@+c}!X=@Ay_v_^b?Nuo zrO<0DH29yjq&FC{lUfwu^E7zwb6T3jtfqKRi5F5hyBY=&C4c$Co+4?R{Kr6s^!wfB zRTg*0HW6pw?Fl08VuO&usQ)_82Z*Y?r^|}&#}qylW1`L@2DV9j#Gd-sJq+AX$Jj)elXvG`2SydIbcK(f#J*3f zF1xWz3-u7k`)Wuj;t8Pj<^vKxLlAhg7$S1Ykb7lmAxy!}tc3clM1nsQCnla_%-8S9 zF8WrfX3GZ5`(13>#OPA78p?swSO1uqSRc@Jzy{c4Ma>F_{q31 z!lkbU&8s20P=v;fDukP+jAEQR(gfKZ(Q*=(J=xrwX3F0_62uF@(aZ$zrZ*+IXpbI# z4eZHaPhnPZf?H^*^FViARlRuHUGfxLiDrOdta$OLG)7X9*LRqd0>t#6TQWCM!>8TNt2ZJS0`2qgcy%G zD)&-^K~7R>wQZ`!HNlau^t>#RzTAS$9hXl&@z9ALGvMhshEv4aPy=8Ia;v=|d19yz z;StYYJ$r6V9Ew8w>^|p!Qr?texZz}0gpdm>iCD_zLecGB5H^p=bWfX9pt%@61=%li zSJp^@M&pNtDiM|W+-c&7Q`#2x^=hKh8Tuyk1Z$e?UKF0`QC-qu`tp69k`L_zwWh*+ zmxW5iqS+k6Bi$~Q(yeTrqy~^3V+N3kVM+Q6ybF19_qHjI#3R%?(QPMFcwDi9f(@tx z^IZiPA1uPU8ZgF4IpCzYXd$43tmB++<1JHm2{{l1LEumhtUCpg0u16KJMoZ+J9#D^ z?{*P5DpugL#F6oaoQa0A<6TseVt>WV-(PTsXvcf!3hs&2A>y}tt?tqdJ{kVOUvOp+ z6|r+r*ILlN)Y&mGLP0U$;YFDM-M272?O5_#qCn6Xz3}C@xD2Pw&_^R*R5(n8MKk z$w|QAY9_+P@XL~Vm8r~7JkJ2{miM5J6|z8am&kiJ!aS$Pv_sX(4u;tR(<{^A{_M<% z(J+VN(=21)<}LA5-OyoP5Dr7tuuKD6EIoNZEHNxc4pCuc5B1<~9E{?d4oq>1Ck;utizV*VbU^bDa_ z7Z%!Ogq5mpFXSq*zqwN~<;TjBAJn(q<^^BvymK&&s_wPXzTB)4hxOdbVi)o#wn@a` zZa?cU`=bdn!rhaC?nmR#XpEqbC`1%VbHyVMEle$)cqDb@SQhkfMV!CJfw*H^~zAbT~@E^WB*?KN;*-R@lM9tMuP@HvK?&Y76H; z_5T%_@{viHn6e*YY{*V4tT*{<_RwA;6k&J>@ocVGKGGECw8}abAv{&MMo=5&IzQ!e z11KZ4+vM{?QepBSLnh;*{#dV6Bue_y1G{4rYfTJ))LEvwtPMC0@`AHH{FmH9D6WKl zoYR3JomIhD&HYg(d){I~Dkf_hVpIwDlm*Y!Dru0F*ZYfr-?&$N^f_Z92swlOrrBF+ zSRvOWT02731)osai9C>awBW8{yq%uP6Z^1yIcMf|2zpY{nC`W|2eW~WzTa5fJguL4!ysa{jJPsLj>QBl< zL)HGDak3(a!4e42D}9*<(Xr*90i_+e-4l170&;93Gx$6+6W_uK$X7=v@It%q1DD!j z5m`QPyGWGhC*c#VwDB;ESNh)Tu3ajV{FxXgvxSVzo3WfdwISM&$T1l3dvmwAQ4V=@ zkg-9o!`-1$-=$Z=3%}&?V2LU+^fQr&6e>C08HF1f$r12Oa3ba6T~DKsm`|MEaQbCn z41`8ysQAcWTMx>V-efX``}K94em7_)|DrhmdtkP?pQ)db>CAb+0FLaSfi0C4BSS@2 zQ`AC0Fjl+&3vZsH0LRb~w@C~3ZrsCf8wNp(oGe=Of=_eK95yM$3e_9~ahhVt_}F^3 z8;~-jU(3)7EkT1G*(p2eNd1~Gc6kO=|#kA{HCK!2yaJ8SC^LS(S=ME3&lm6htrc^fsj<{`%VvR8nqhVqofhUvKF&X?EFbVf`$i1i{;ICruef$b(Rp1-CMmB3NCpq&UOjcb z;_y}`9q4+cWWAqm!D{ET0jF(Pt=Gl4u3x#NreZzK@O{8;gTNcCv%L2u(@!gV7VcTW zLIpejjw5JqVzs;1H_~lc+csx3u686$_+Cjm;heXXof9Z4=XX>jcD|+2O)ojei~7E^ z=VH`p{GYu5P>9Z2hZfMbjHv}zP@Nx`%gY+>?gKJDkGW8^P6_LBU3z|=7w&wjHtW5) zQ9NBU6Wo%#S$r?yzb_%bLin#bPOND)yCd^?AAZmD*;kjknSK`2BdDRixo>O_T(N0} z7LR?r0lw`8gYIbHXSWn0U~LWm zadmca-5o}!duY+FFT}kw?Xt46;n{x_^|tG2d-&Y@?eFg&Ux8Or40c6p;#eB10yT>J zKKcq>n%_e)CFOAQ_3&7kZevhKhmT0k6s$JMmULN7&hNXEg&Ox6$Vl7^0!~(+CE~%1 z<%Y<}p8w`elo)Ai?V>HNp+P0m1^?jhFMsqF3EQTiXOD{JXUdY&Qubx<@FO=Z@czJ! zqT>=uR@>FT$8)9Zt4n>oy}iZ>uTXNjx&xwo*92hqr%MAdxuSG-5dm!2(ld)LFV@t$ z4e>EC@;Xb7?j%xRZO$uaW!kl=@9&>tUddkj+i%8QN6&p%vOM2wLIcaRn$xAR7m(o~ zm)?eTtX^()d?B{jvP(*0WxY;`0*LhQ@wtUW2mTAIRjvy!D9M#1yHmFb;{rf_F#ZSbecOm*K67LktWh<4#(uKNlEpN^Ujb6^)dx42*Q1$1{O#r z8gLk&?P&Ypb7LFw0=fM0MpPWPyFUTJOdid4!=Uyn`@3_EjK#X~e2Fx(2ile*H^wh? zb6UE|wCKaSm9J|~bVs`HU|e{C?k%mh8X37c`Ms9}e8u+OF6`?Z#&smKmNaU~rxTs* zifjveF3@O0m-UK4$6w>q;@L33OW{3SgBZe?*06&298{hv{yItCq98KKH-i(kq1aln zX}IXQ9RA2s5KZ`?gd-EKnfy5fnkbgcR~SL^MvWz|rg}I9o#n~nH(^$4Nb4R$nXbNn z;t(n|#XMVQ;Rf2~zqx!H_atwXnU$>!N@)M`H~7Drk=gjXpKY9)9~4aDa;H3hpP45{ zV8zRl&g&2w+fK1gi#!6~rMG@&tU$0b9>vMatxD6+eW=?L-i29ix11zKTmzoRcG@h+ zVq$3{;y`_SE3Xt=RC;Wu9enSG%`(LJ943m%6kL33{WHNkFM_vbf9S64-lkxDX7qg( zA@+V&%6m#ZkIuCIgEA913ut`EPgc-|R@9niz^yt8D>sqVn*xpXCBqxMp$*&7BILUF z`@9>O+I*1%U&QWlDS?xZAaREB{GUfRGMjmRhy9Cz=DwNnxnJenDL#H>8QuHauy!r9 zO&WeqFTwP~s$1et-_M_DV^g?xzBR#4O{$-5gwwi1C*AEpBQl(`UWR+I735{vuQPUz zn%aP>D%$ZxD>wTxKXZ9ML~Ey*uD{Rd3Oe8te%TmEi>eXR?MmFF;ii9sw=t)yYVIOY z4!FRN7egh2WQh?0(jA1Y1W^h_$`xQt(oOH=1l!5@vaOU9r~=+dh2qR3$aU7ceObtd`7}Le8BKZ$b3;a<`Xi%sf*3E;7cj|!1crWxFe5EN@w|P9Bya%$OA`*z@&9dfX*c& zJiT3TnPDZ%Cp3yxINh4Eyy=XS@{EqN?~AFsu9jD}-pfD|gaEbg)OZp1dG(tI6p;vB zj|d$mnR|tAL6}Oh6?|aJN%fmEco2S1+@Vmzc&FsUU56#P;)uAVXOrg)xG!Hz0<9%# z1DR9*Qv>yUv$r%RcH(@D{P5F6B)hgUKG_yWaJH=0*GN8L`i?A(dQf>xs~hK?VX7rm zb2dCT%U(rIgU=}j1-aApl$&&Q};GsG*T^PfVr9m#Y%aIgh! zbcqG_nHBL<&P6dyf`ys+LV+>4(tG52L^QE`P~CtsIyp0r_WaC>d0N~qA9 z|LMfo3fyMFDKpE*U~d#ss%I8A+6*34$;AG>B1Zy9ug*Wsgd?6R9Jex(ZOp_|Ny)&m z&gW_ob8LMjhX#p~mhS9V)>4g5%VLu^#qw!t``VJ~W-we$*eDsn`BBl)rReEoy9v$u z#Yv6UWqHlTF%{5(!MWf2Ro|D?7sxnBEf_@1cJ))Cr~c3gBpA8uHQWGuZsGLby8U`9 zV&XQuD~j#abT4wmkTy8zZ#*-Wr4;o@ES#A#BvBmbctX@2$d$ByQncaltbuHgQ9nnH zNoL{i5wgLYDCc1N9CxY*Ui5kAx{sHszS z?aU@$){ZvB@D=xkH8dvOJCPHg4dQ}7k!E|{w8=1{XCunWFxnV6Q@hJu38F&h)Ocmt zL4~f^a2essKa(PI&No$Rchv*!TK4QO_J(Zx$-iuSI>;L_z}KIc$bFV@>4b)DY2H9(W}0I96MrmPpu#03O(~KS_IlXoNK6B>v#*lac2Rg zsXjKn9AjBvr@Grr8h84T5lK#X;d&*|bwq7Gy1W*xDqo4;?>6i}-aDQcs3~k+BXSY< zGa@xmWmD0HQEBm3t?%)186ilIrgJkpt!vuji_WV~XK{$_Uox?i_&d$CR<+$779;$w zK9D1cNE2|4E`vwDD@Yl?HMn*Cc3UUEwZh+pldHx#4D(?!8wv?YB`MVxC5Jk88Cm1` zG&S29YaciJSXtQJnwWc4NxfLX-z=_d`F-ue;dJkEl~vLwmFJB4MDlR(g487Ah3CDm zKc<~{;f>8oHsJ4I++*&+hGkQbO!@p^(BFEsu7en_FH)t}+?2kTWj7T?3y+&D-&JUA zh-EP8_J71Q-`I{M~U>+L0119lTh#mWSQ? zvf&H67o@MATWOvoqsaDUrLT>XlRjJ%<%4k^uT<(b_R*lU+a^GREZ78qkZ;{MD08MX{2-d}iVV3yF|V(%NYi5OOW+UqP!(|aQtXi04CL-ypKj9w5W5X;dHxb%>+0%fREqK34Zl_UHUwgCd6wTT~1+Sm6CO~j-~!8)(;Bjn<;75(B13ns@^bvrk6h+ z(o_{ka(q}@q1~vbqr5$JEWd`9;@~y8C5x5Qil7h3OT7attl(SgU8CiW(f&&yFT-F9 zWkuO#C6RB2msyImJ0po(O{RFY3N`R;#k5|ETx!(H3Wcot3I>BqW1_5Si;*K#M3>i> zLkBf9(g>1NdW?vjhjCRnAme9F3#xf~ZR|X{_9TT{a`*NOgFX=)8h-Z)wwR31)|q>= z=F+=PNij53Ll39@7#u+w4yNaC+}7a-4T_}hW|wq*y7R+Z#(0dL_{P%J2@>^*f`J)Q zDah2c!0|6E-A+D;iChzo9lP7`>eQXAf@K>cYK^|*yD<4dD&7>v5~y@WzL9rStGe1| zfRTl1h}Q7Im#`20JBWOEbt8*nF~6*Sw8HOt^^J`h+h`WLCbupIz~&cWPSV?{eVq7B z6{!XLX!j8kZG^UlZbzw(8kHL1TVWRiY21GuyjhGLNT5wWk9a0;tZh#-!LPiKL`si* zo4;DE0{0{y3Iz$r8~?sdY5^3OUA`lB>0M+EVoV+sNlln=gz;s|-x=e090y$nI`uXtRv zSFeAEEY-ay_sU1Kq&!LIpO1fBE z$cT~}V0D&^`0~z?`%l zIUhQL38Z0fOK}uVF-+~7h6(ICH5c#Hp-*%}MZp3d=d7zpP4JyKyZNJW7+SCR;rm#P z@+HdwR{NKTe34|1_f5-Q#73VTqRV3*Vy{jGV4Sw07p`e;I%>x?G2`7o5`I=_$O~4r zMXtS}@re)js$}tq5OU0e*#8R7vSW>TaP2veOc>*DoALU!Ne&oKO%^ z&U%D;UiDk-h&$~IeO60%0VP;#FiGYSmpgTNk;9@KqR!l&=8I{Csj>u3c}GG(X`=n+ zF{gs87d#hgp`1NLbLz9$NoO?RhnD$TL#w%Z*S)6O#mfAm348LsQuFb zF5}SlCK2u$`AVB`J|a79u_pXQ}SkfOOJpcuMp)9 zX}~)%kSydZ`=S!aV)G1L0EOtGkJt7;)@y^2}odCdaGVl-JU@tVJdacnva1i)Aeo2 zEK_MSLgG{txg}`}-zSJxBk$ZyXj&`O773NfLJb*ruq6=WhkHznjx@$O`?F%scamy~ z-?RjjNKvLr-d=E|aOaJ1!fQ|kY3lVJ_FOL*#+P@;aY8P|v2B##{5Ul))Ef{>^4MgV zIM7*=Z7+lnkKO9^P7V;5tPS~7h`Aq?HGVCoLpjH}e8=*t;1SUM`AHf`a?)K1a*F*$ zXrig7+Vr33bCjxcz?yJPp=3l_hviK5Jg0zQJJ)>}zMz zz7)@-PLlBX_uK1WL$SP#iB$xWv3<>pw)ia;DH%A&@7R$^;rJWc5xjFgKonZBX2(bt zM4@7SSLc@KP-hoC3o^o2R-jO|R6ExmsXfa2ggI#x55p&+32}<`1Nc77L!DkTDb#6? zTdJuIz5o0Vnoo<9)K^v5*-d3=`35%M&LRxR9KmiNS&r98@0JE=NXlciCSA zk01!dn`%Ug9lrEEW80%5+HAaurEm14P*ThmMLZ^ao6BbJpoB>OD!fMtcQon?tDbb| z@I^?}7>OgeqNqF17WZ*nzpnzeLfGu~yHP)dYKhv>S~#V5T;E<*J|?N-r5;%xar9>_ z2SadaSfp4c>IZ8h*YoC?`zxlaVIPZn($5mfsLHrfgOi#ixtp!~;WrUa%=hTdK~Siq z(`c<^R4LOVTcb}gU`2r#9>-r(6%F!jkwwFxF5RKjbj}4kpdXT(HBu3zG-DP+~ zz-LL7IQ-5~$i*3!+e^vkX6coy7;56cr%6=Y*s7>rsF4_6HF)MaN)Hn|f_BEUzz*S; zge>5#1O*ab7Ynt%D2W@!xIzfCiH_pzofgNU^R1``$~S)^Y6JK7;L|u}Pd~7SlFMaL zD4k>2JKOGOO|p&GpuoVcZigj12Md2;LBQZ7@@L#JPfH5!k@8$H$dEj*JK2dX(y=SQ%S=w~|N| z^7~r8qIK?%tS2>3>lGa=_!!QG1`bw_nlCI%!a~J*`Myka##uQepHc8?pr{0nHjr_64(Jm%FR1dd`Wb_W& zhZvC>K86q9KEgfrG@>hiX4KLmEx%fD)@f5~s)1VB;LoL`|nI7k5ebupqvaspBX z;SQxGg;*F`y6CkSoF~ue;NdAJ%-Rkl)D8!GIgbkskO9 z4*r+AS8|VMav!ci#tRS5o660*UN!YI2V-O>DKT=GO+O_}ejP2)Y=%ip+vgGHkW~Ld z9;$KCB&U)(PPCWA9cb>ZZ5T246ACp`-lRnsxD!q~VrL0OTBu-nk24n-<*=ixFPFOM|FlBZ9K-(`@`)D~= z+)KF=J6NpwmyF?hTs0gw_CetSrKS`f?kq=}#NrfRV#7sgiB8pnc8A$T_~0{DW-l2F zEF6eWDFl!iPu^5-6rVPR7QaoVQasqO8@4i6N;mcQqI0|n;GK6QjYp@^D!(ukG8CX(Jf zBJ(a$NT8eQ3Gm^4*mV>lT3A#lR*po0Ol^nC#Byy&xGvF|KA>##uyGJ2jI!wlEtrIT z?BpR4xk8MGES0ha&x@3r{GNkznB*-BBAt)4fpNX0@?T+I`%?n(Sx*f$4aYBBs(FLl zFSw^#kTV-n*#^ip34IAByyry=sp?40p*+&<$@wS>W%zS2-{jK19$Sa+Ws#J{&-|gw zQL7H$Qb|}H&yuHOBO+k1fsgCXTWw5@&G=z>OP8!sY(X1AbuHBTDpu*Pmq3yNT+f}x z2#0lHSb}^*tk;{g8!Dd|W7C1_O{oM{`LmS5%5;r9VLdJ1#DZd36O86PEqfN>PeO9X zzcABNJtylQMnn;0Ght*if+-3=b4|aK)L>lOSD4jJqF`{61QapHZzWQ94^bwzSpIX+ zw9>yk@4^3uX#H$23~bN~UB&{5)A;RZOa3(1bt9F|5xp{yvP2=%ij2U%O|_O=a+$vq zSHWAN-IB)nL7kv@VV=Kh0^(p)Wo=Tc7%}rWZ#U5XsxsFCM@oVMP5no!ynv$#rnn%T zz#WBXKKQJ>07E_EtoV=AJ3<8yvZD2*hL{^JDY|tYGfx0BgP(!eV;$#k8&xvpDdCTo z;yI0wfQ8AKW3rXz46N2)BG%srByq*W;@wpR#R&;;f4j%x6PuITXtx)M@j5?K$z(C# zz-g7T5VcZj1>mD(F^TuhQ<2YZ1f=4MWG%bWZdF1ACNM7F?uDoHl$GH9toAw`Yn1<* z`IKCZf^p}*7;|NS8#FHOpPMg?Qlc?o-mcMrNP`byL>aGPQ72Mm4Ywhc^Gi35-EBJ| zIIomEjH;`uc$igQlhSlJQ*RZE_ds*c#hYaZWwM*mGR+I+G=e>oQOo_^FNmN=Nld>J zb2kAmsY5)j;%M2?lhzUL^mdddNcDcEh$d$j1p|terF$ACDRObukmk@{UGqet#z0?!1U_U&5&+oIY9{I5GwB9pDq_1^}>mon$uUjBKEjCe~U3Kv-09XhLw=)TT=0#~OC zS#_%0$)`kY#j)7-N1<4^D`8tWnX<&CA&J;O*zZwvoGj)b0ZlU4eXOAd+)ka$I{~1x z6tpoqRb-^=GGXddBfDwGNrWYT<}S_1NYjt2@f#SI!{s}ptOlMuQT4Ioe931Qce|Bu zs}P8o7fR58Qdd)jEjOm!NogKX2*g)WY6X|!sPlG+)+kRo9`uN;P!lkDd8iW8d&Z_v z(gI~15zxQGg>pDXXd*51`+8@Fn<%7X@lzdv*PFard_kU;1nA4L_t`GxF@}`{ku#5M z#!5Kt3kJEfKRoJsJ%&nbfYWZ-#BH_(vh80|7N2J*1Rs(?#tqJ(@b#3aD%xvs{t->d z>SxL_e%SXzIyoZi)sf|9S9xG_=EJwS(VV)lz78AbIfR( z)!JbY&z9^Fx(GF+t&!*3KUGvCzTX8`jTxg1qx0If5nP2YhPi2FBy>pMCbSfN?Mfs? zuiU@BQ%f$)z?W^n7mvLZ`Ybg=3y|0Axg_%O+FGkSh^KV=I6J}8Ew4} zKf4aXA*_dtWRbPfxmLI^7m4GB8@thYALAzmgZZ!Bs0Hajl7xT88>}-xYo8$>H-5Os z6dz@0&c$it;PiUEicfdY0G$-ME*`c%G(uqQ?&IEL?CJ|zdrMl7EA7tjarwh!F^SoJ z>mkRj8U*hNDRGMVZ6L*N3vPg>{eBY zUOSUr3v9exRp;lUGlMldvw@x2?sc1K9WI9W;_FSV=}`>-w{@_ioDm|ZiOQt9Fkl11 zSM=^*`u<;fk`C^~fmYv*y@y8Le_hl6)jR$F|3v4B0ei$&QC9CV^?gvJw$^pXE@iCT zcp3H9YTp}%)rSB14g7z4ZBYw--n3OCM=|JmuAf!P@@{*9)n7}UV`gIodeZ-Twg1!0 zG+XBoMvIpBrZ~#6Z$AQ7W~2E8wc~2+;uX^J_C#J@XFQnq(nmuxb;>y-W{RBeaq0^k z0{jF;&6oTp4%lskf;b)$N(~WmCM1nGkWI!mD^)2-;-c02nF%| zR^Mu1OWUW`-r-qRCuQ^kmN<9aU{ z@4bG|`>~{y^uVV_;I3-**RalgHw%qw)Yn}XS|4}#p1<&Z^wCYw4qQCMh!H=HGpZox z|DU}855BkkR$%{+8@W!dO$-*#n}0gB3o1E4Gp&9`}7&;>mSt4Y`l%T+RY`TJx1An_#v*|Xfn?(TrGV( z7sgAFy!!Axc%KRaK@@Jy<%D~-pSJP5ZuL2H#f*?R5a?`zL5yq{?j#}zN_GXOh(OIsB z^&1+Z;Qc`N?N6j)t363*XIfc4A&DDvD9Uy0@__xHg7o(ft;qFRXurla2ciU2R8tys zS>p9e@Y>dm7aPvj$m^Xf-d^VgtF1HJ_6%GuSElc2=`_f2;E#`@yno-n1TVg?!UBT= z_P+b^^o3b^hijt~Us}=qqIF|Pl-CXMMeB(+ zu=B&f%Ih;2u=xU*acZBetYF~SZ?BHOV{F<{3#D1r>%m-6qff4!ai9L4+;vpklOYXp zeE6_@r%xOE{0$TU{rDChE@CpPXDLex_UyHL6Sz=s^Z$sCkuD%UE{jbY zfNdccJccP#)z!L^7Y=kCZDB9?kI-O(Jw&g2A20tOdvEy_*RyR6LkJK&XmFC??ry<@ zdvFNu4#6P_1eZYL?(Xg+SmW;Q?$UVovw!!!_kGTXbM9a8eCRQHY^_>VHP@_FbJwCR z*KILTdiC}F+l_~J9v9nLLR?SFb9#=2jg7%wkH=jZihI{yek-?Ug7n5YeS6T)wDIw%u=--=G{7D+Pxu2)I{4)@c1(jy3X^mj>T>%EC3(dyN zh1Qb~sl9((Kgij$j#S(NL00n{O4wVlN876{_5*|tb?Ji7wx=}&u#nk^{-d! zuuX)nME97;$T28vV&YBZF-};JGUMGvUd;HzU2TCpAw^oUA{603z$M-%;K^@BLkl-8 zUG}}3K~ngY3y;0Nvbs7#qWP$7x?X#8l;lVLwCB50EfCVU6cx^1@NuQ_dFRUxT2t#!|%mzr8>If2`g@RxA|DuLrqUzk%@;}&*8iiLGq(< zqv+kM*w4v6YcLTJ=p+{VV?kFMNP+%8BJNM87B6*LxK~0CH+5XxJQH(EE=%4`%BTebMTrh#fbZ$K%opRyjPHdU-3P2eow;2(c;(6q}Q%$si}%@u9Om{9XS;4 zlA1kJ%gIKuA`+qATgCy2>T1jw!3k)o3uBwLyh~Z9tRgV4q z$4A+)rSsc`qu_0|YIht3~G0hgrHm#@pTD&Vp;I1Ua^5iD>} z7>9-}_fe2uG1=Hsg9h0E+m_V)5PjMf-Eo@7#w0uHbtbKBH^JqeofW#AVYS$QItV%M zb0&pQ@JX6VAUOIAf5G%1&v{x8t^LrNPl=}W+$COzF_?Yb!7 z?-G8v>F%KDr=e{*pAM+rxY2sKfFeI^zF=k6C;-kLaH^R#BWr%h3vK6rh=zrrJDmUg zEGt0&nMroUZ#e}{DUkGYbLE0 zzl5ylSR5Z@3^l}kaV@=Og$9jM{W@*ngN=(iKYp4i9@{i%0Oah3Kl(=L;)(aXv|IM~ zYxtkiFSHKXKq*Ra=a|0A;1s8b;#NnLc=&r@55c;(@%J)Y>~ogj8!wtH!IT0r z2E(>3kUM&&$O(sl+2!e-fQDeWRZ_h23HiEjV@$fGOUGMhXX>r6FvGL}fBGFck=$|4 zhdDF^eSH)!5Bl7N6$9SQ5&twxOrOY`G~cHV%-zN2$VxWqWHR!P z)lPqfYAIQq>8%-!A_}2`|gI3`aCfM4_-^g;^W=CT)~d{ z<x9#E0I;B-jy&lH? zB|T)RY<9M@|HT$XS;UA9nBQwQH7`^1|~F#SX*8Wf@oUsYmEju6(M0)ko36_i0*7Uq7nZ za6y>g7ko6TD#Uy0>|8Lf?B^mt32ZWFTUWKdlK|3zU}!(zw~mjRjNBNetp~KmFg}EO zBz=@dW-?5^EOy7QGPo`A@EG@vwf)PRbqsDqXQ>Ec(!@37PcesO_r+A&vz+%OgrYIF zvGOwb`VOh#x>d%u(2DAS z5zzEbT>4SLkS}%fz(3?#9y^2X;}Z2QZW}m4hBNX;CwtR^eGf3b5mXWrgUh#1Wo!p% ztXOvp0>;5gc^5_W<#sowo*ZGG{iKhnV!ewfN$)I#Dl7VY&()G#L%auFr`f3ZS^Re) zkTfB|(@VXg+K~hkwE}*Rfn9k^7q}I=UQ!bG`;|ea<59PHW;iz{LW<`}J z@m8H~G~m(OwcPO4dN?OK+p8Pn7-h22uW3l{v~rOXJ-!JSo}Fi2g~c>&t{~qTr9#NR z`8o*n+=!5j^LRSa@60s!yc;)bskS?9eR6c0*aFE2Wu{zt22gi5;AEyjJga$d~2*XvZ$g@JUQTMMTvB~5=xlC~7|8h)k8x?=X;w|pPXL+?! z4?c8GVw(hd_iBmWV`a?&@Sejv^u1w}D1({A(3qq;T$ZfV%jR14Jy8Lh>!CsPhDwu0 zJay|LU#LbRV2;Bwg9bRK_ogmqDy+OVX-HaEz~(Srh7vm0V?Efs5$t8NL8f__M*KiZ znSQgHI494l(RWJw1M|3R$!NtMqDOW9ZjI&_Dcd?6(PF%6T+eGg?`0r&1e0q9hT}Zi zC8wP$#rLzeC-MoYLg^M9YE6DaMo^Gwl}xf9mVaxbBh zYF(ndQcom7(p>*k!m~ML?lytc7$BFUS65QeD%pwR{bW0>sW{~PBN=N<48aS7zv57M zc^(urQL?OE86qcgFQkE8K&4zx`H}1^qIvq9kj2c>f}V1zpTH$fj!m{(^RB-5sd_XT zVZVW69}zoXkM{YPdKm;%KR-G6QWZdJ&+$&5T1CVr zkICBRZN1`=EsN@e;Jl>F%rpn@i+p)Mn%mS|<5Ps&Cw?6;xBkb5TjB*Vc!>f+>o>vT z^8!|gQA%MZ`3I#m#8SWUWHr*X?M0M?>FX?5ux%*(o_uAk3pELP^r)JuX(d;Z)l89X{~9N!*ZC_RXL<054EYrpbI zk7(sl5&iZ)@ET8wn|V^xeoKjDTvI~>dA$ASkET)?#4z+rRw0ES2!7L6oz>jT%q0ZeZ=ie7HVh2R33iJU)c0G zKgSnhLzX@ls2@3pvv(?LsgXi0SkR`H#2#6JPO~2%N4gK=Rd78_3d9=pX zvkcP+^fXP=ZIC0}lMN*34FAVS2bjc)zvD{~lT@*+Nb)LW+idbaBKRp&{5}fuZT=)& zB2J>C2xw!Y@)RY<7^+^iYdXKO`)Da035H`(;+qY$ZGX7=GMQqfNf>E_qhMB;kPyfS`sCVd~ z!Cx%0^joMz$d@e@t}loAqcRM7`L7tP!}Nb|G-c!ebgbMY(mU4WG2QrLaMcuiP##Q} zz+yd|&#|t|Igz|eBR!wr5m&HMb9rRnt}BUW$CCA<8a`!di8~`=5Q^SEhjL`ZA`Qj8abBSKn^@D|pWNyC81wy8^5K`%2=^3)mh^a8 ze7KV8;z{N*yk`tf^K87`!m=o`!y~e=E^PFkv?ZkYfbsn6zEzQ$-oz9hDQR92 zr+|L_x}i*V6>FX!X_QR&z%4nH>^mF8|%-{T^U zMz-EPFR@fTH1V*QK<<2us_68cQ`tLqjnG)6j{b8Ay8be4YspzTudZxa;xbR*Yeyfa z;l{%KiBU^>K5~l|2_sG#dhhs04#{6|ZA;WFX2o}wR*mZIOLH~n87{98Be|$F2qHEW z1zHtxikN25g-C|TKJx}ww_<01A=wBi4@T3No@(6AKkXGL%+%1vct@{;qSTXp28hAs zYoo=+8t6x{Che#W;;`(cy!=B8t0oF~Eb`Cw5!$Bu%R)RWDopK*(va0m{u#FgDH@YG zd)|w12m+VLuvSU29$rsJMfEmYmk+byeHTcx;!(v7kxDods7-Z2u`~-}tZNej=FEn< zOLkLHB>ATdR*L1iZn@My>m|!6hh*W~>?bRN^G}Ia>qnzDrZSvsvhb{fdC5J6K?C{f z4PiYMw~VtWqxECD^^)1&CYc7mxrJ+WgkA;Lv;k?gwMx!*-4pZp;~bE2=@v8WO-q*r z030$+dQY)6x0HX(&v32tY$M*dXQwvRUdf|w*eoKPN*%DmTBr-E! z#FQ%@UAv2eFP^R&GA0M}8$ORs1XGxxk_Rd zxgbjLq18qIy}D0~RZd_u0y>_OO0TV959S;j&ttfY8r5{CLVYH!{g(+@h0<5gZe)}k zDhxRw9(pOStzg-DiyTmBtn{tCMvoE8%@kSmMh*~-7fl{ zR%yXadJCss9J?OQsk~yM{yJ(I)dZ|+%JNJNuO3%}^40sfS@Y06h9NJHflsU~NX-qH z6}ulw?bF_`w^2;rlJ2&k`yWE@vQpftb2);etqU;ZoHM`Z(yF8~9wQ}|=yWk=sul1E zC>`8m%5O3mtV)TRv9V z_oMyoQ6&Dn({df1C7DDKAyU02NKGx9j6QevIDeEwAWCKFy~0p^9wLBgQtFDRK^df~ z9|EIWH9#DN#&<{WQor!Y|HhlV8@#i@ur^aMaK9wc1DoZXMg71RQj2~tRV+$Y1t~lj zFa?k?miiU1x@$ztdd5cEuxlyP=YNo`4u0o&I;>CfSYyqEiNhGh@SvTL43UFnJg4h~ zq0H5m?|MJ*SCOyqQdmmif0|Y((+-wptib(+bY9*ag-Wr5V)9GZ{`Xw#%!_;ELQz`d zq^|2a)s^sHu4AwPa6%xefn+{-56 z*xr{$P0<_c7LkqOmOgsdQ`bN$70-XKXP#CN=7VI zPMO%Pqs#T@)%eY2?i2Xe>MT^z)W0FKpf$OGjIK6{1;Ied+&I*NS1REZ09)yCIk{2% z1;0Pjm&%4MdraCP*7Imr23!QFjlxYgzDY%DSokzkl@m%&GaTd}5kBXMk9f%} z)WIz}nUqS^M|$i^lq();a5`ZydddXO!7VS2jt!O^%1tc?9U1zy>HLs3Q)w<@;eMrE zEVfr!N@kTMr}o}{ol5^8o3bmZ2wl!IYM5_T6MZ&fJ=HoR;S57y_j|MICop`Nnj7e& z95o%;q$aXeJT;Cu$a-MoA6Giz4pVBk zGEh;!`10a#t+q(A`08%dVE<>tDcvHCX`JCrQhyVthS!Q+-}jBE8;~3A9Eo`xC$Ko%sOlW#Ey~p}RMo*J{_X7XwGllwpAkkuQRQ)lphSJn{;=JBhxRWTy%cSYZ7 zfU6_D9a>(}H4I?xEx%vxZ&l*O68JEOeZ9#IJh;`V#cE^8&bv&xJvOULdq+t`0)&$< zBEPA4bc``2?Qcq6<71g_PQyIW(Nf)VL3}G7+YF$*$)|0wtrb^|jj2u6?AK`B0;L6} zI7#NzlD}$v%V||KeadszQgvAEiJSfU$x4k7`LJj@u#{a7jc>2-v?Fm~RHm_=$)i@Q zDJ$qD;Y|6>hsNd%bC^5xYu`UmtVm@weIO|Vsn&nz9)kw_8HtL`fRe%eESUeiEKhMs zg!0d(@Be+eH{|YD)$?Bj7<9GkUNoQp3^D??zdZ4$xJttLzgPXszlErO=IkbS9Faz_ ztuEx}){VoFI;LRWgCi_|kH_Dx@y`2fDD3C%Fibdm-JQ(mkB}df^__Rh0d`eBgK-3C!rr;fK&lG-Xzn;dR*0Ym zr|WOAO$nW51bc#ff)bpffo;7|Irj4J<4XinL_Z6!9kfep zH4rrw4QT7E9X>~@+!-2d66rv6Ooe7))@!HFkqw^C!!1GyeqpNn?*rx^^;%0Yg7<>% zs7%j3?ESPY?~BN{ds(!y+@gXUa0yu9>c2erRhh!xL6hrSk=_|ujtC8eoq+TYo_~th zzEtr>Yv-kp2}x)*rJz+#sA$45k!GF;{9?P+8J#JK!Pp`bkw2z|VRfBCk!RvK&fNkU zclGVE_>)9;4sPk2yC4wWIH|y1sG4SV+u?QrTL_wL67ma@IS6Tp?Kj&_(2FQYsb|> zfebvIcYmQu86Tf!Kno$X=RPADsN-GA?bXqo4_uq0yt3gD2Tp)?t9_gGD+CLq@g*#Y z^xRpXE_&*ssfql8r7dKFSfAA2a-5RfnzOFM!tpB6l>$-ECUB_I(Tg7Gc6iQM^wDX^o9dLys=pdQs93AElt26SmFS^Lk?cN=H zXJsBNYlFD;aGlo3UPD-OcP%IZL#g6{rso>`O%alS^hR(&ApyI)Bl`enWc1o~8 zb?o@r)>i1)KU^5%zh(vd{Y7{C*vEoz!#A$)^Y!#zS3#KoArk4`7`si3`sc(Htg0ID z(m2q-3#OduqNSJX0a25MgT|P+hu=MTA7hpaS3&8+9Rqr&6sk&dp^H#_ILGF|CVwI1g!f`u5lG}-+`l(pD)Jc{|9`Xq z7^yu9-e5wHLeHXWbsOujo-=kO%gj9+r%ds7-y*UAJX%g=7nY}w0jf{?9rKn#8){Mf z7g(4J=z2(Fn8Qs3w$P%6*S962R5o8w_Q&#;(u|4(O`=?S1%KIk>z{QhdT!y(O~7ln zj(N_0?aufhnyW(=Ht8(sbO;`4E?PW~R99wX`jzDMu_=b=Tu07%Of9nW;KefSI(jp>{oZaRT3Q>k(RNVwb6_(b$H7rW?cRCFj!7l6oow!Rs z7++UWXwRvq6nS#o;RL9?fIUR&urfZnUP=8N7Nw?VmAJYLFaURy*W~dbM^_h(@2NMO zPoZM4R7&K$+~3+}3HL|(5S#zmJxjY^ZfCI&G$P{gvi@V%Rd;O~(3E&(amE%Uv9EF} zu(`5)@b%c1sf+C%<0Jl69mE4Z*)5vmkgOoJ+?AQH%ls+i;pC5&tou;fDFv=8Ou#1Z zp*y=TpxI;J&x3OIxjM&lR_M%E^`RLe z4`E46cCk1pB>R%ZFW1&0UhkW*nYdT*gVmX_xz~kjSPZ^KCPzaY0%ihXSD&g_IX$#T!`Xei?41rbt|e38x28T` zUL`OulerZ>98EOcgN8(NN^Eb^Botb~$u2S1pt*G+7aynP`+g|2{!!UFe-;^HNhEQN z1t!B3K!RYhI%aPKu=K_&b&BuY;xe}@>n=SX#}iQw2MP8>dZga9RjHFv^b7kYnjJEj zCdAXbrrPjq?XB^Q_0kK7!Z@WtA23^FYB17xm#S*k2d`Rw>^nzcvpgWK-36ckM*9;s z0*chmYoB{WgX{s1V!dd;#<{H@S@wt`gD+fFSMZ@tL941JX8yfK z`z;-4_S1~~w^h||4XJ7Rmw3ZCy@-3*E?Hk*=T=k|u3(_EhH(0)cUpG#ZXClRlOVGx zRexEvMDz_@ogQ9dMfZI7lbC|$Lt9|SM(|3OKw}W>y9kO_W_5dz(2Eia^!(X}hCUiW z?~1><4}Y`d9FOIEqNf5zP16Aa>et`&ii#~_iPLNh1nrPAtP zjba<^!^pt=1N4vAQV+TFCc$=Jkn*ix?0jl~6t{tlef!$P-)*-&`dW+V1kWF{p8m@m z0(=0TbM7g`#So;#ROYa=)gjz;-Mep1ar`ZyIDg*qLnkzhUa2!OW*fLpcSmTd8_P2C zdVVxFbzKl16Z1&*Z6DBYZcuKSFM7o2SSzbG;d5F7YL6J7Ms3Bt&mm2Fu zEjwLO7r&~ebUQz}-MT~AM$R!|MZR4RmKt05O6HhX%RN~;spr7R!YU~&h;k87@F^m6 zrO`omtj1e6nIPR1C4Lxx_`7C{_vEWl7S8FDAxiq-^PGLTUoLrVwH-(bTePb(D5H|v z>n^)vK4w2hkuLYsh;z5d8}8Uv7XDFp_tv^j()0L~%&02{*LI!i>@ThvGV*=zk_TZe zWo10NMOlm}4+)nHFxcwmiMBh=)+awR0={=XL{VfB=DA%PVK9m{{lP+JeB&f5_xc+4 zF2n9rP;~Xq{MD*COS=+h1&7_~b_+Bq1*_7P(l-DM2FMi=I@8sqknLq)fbsKnAW|2% z75gLM?o*x3(0K6pR;F((L~mWJVMy-^aBxJBWz(4{)TVUvN5e;i+yZ%YJR4F}8NwmO zi^mI6$PfChf-m>Cv6zPn>*F*ccFry@8s^!bbR)hfgT0Cr?0Y0Nfka5{!4BvdRyNeb@3|D=mj~bY0cQIrPtUrT#}Ya zgk7mf@8ggWxQxfM@SM|an&*0wyH+|xy+uNyh4sSUBT?@!H*slrU{A5|-^CT$7J_!L z9M?Gf)BCQrebMbU)A;Qo9^044%#C`vIq`C(zMSE{TeC2| z4m-Ify^6lbc!as{yPn@)H>+o5U-vhuW)K(0+}Sp)9=h^TZjbO$cXHnoz=l~BuPU&# zvZC!GO)Ue5%lqW9NNrQBAWIi_SbU_uk=kr37hOp>iP2r+8xZVKqxZ~?nY?2Fp0v;s zxnsFddS*3vh9N*+cX8i5J-0syqh3z=?OQiaJ7PKuTyxRTxK85B4Za-5WBB?H?DYA& zf@>VIHz?kj&MCKxJ7xJwckvGlj_{Lm4%<1U^zXp0lqkP7fC^d&_z-afm&AB7#n@;p zP&Y&6qlv%^w&z80@!Fx47mGho`qTFZWZ=);*PA)B{CkB*MJ&sC3r{*n{rBOr20A}I zZTMH7*#E9v``Wt2F~U~hnXu@y1k@Q)% zy5GVHD?`kK_;u6Tb5t^#4{0r&%A1X13h6aOD~GgRGZ#y2=p#_{|KUY#FQyFTJ<@EZBm@Ax zcM@*?{$mi{@~6LT=(X=}{l;pd`e6ROjNRa_mrs5(`~&qKalA{S-G4si`Rs3z=HcPf zRlkS}yMBFbX%}4I!2RIIdw!u~!W+@~%px)OhI5VH0r6c^hd-PSB^a1z1(!l9d-}a; za03I!xX8++dg`oVBDa)8!!9Gbz77e>_Kz5UQ#`{M#9#toS;;JOCme=(*Z)(eF3b6j z4XzcT(3lOx{|kT7Xm z;jyy2ilhnKL<0mvLN>2&A$9Vz+Si>~99?(s{nz2^mvluH6lyI*E$hZFBOqm88>)lB z&M`pl104lV87nGHn?O{*8|&v`QA+!GLRVU%IxN?DM@RR|UT5W=dt}+t81xejz_9N5x!GjqEetP}EBE?fLy=T=gg>rfweZ zV7EKnm4u+IokHUUI;bjVyRaM|NiSCr<8;YiNZ1aX`=fL7SdeqgGs)+ko#wJ{KFkN8 zFZ4t%hJ?C(A6(X3x^7EBLSLlGqNpdV_sIO(j86w`wRlS^B&N@b{m!Ckh8(caII-oD z@LznGy^S?%fAW=5oN^)^*cL0rVa+Z^cxdoMJ((MXhtM1_GUG_xF7(y;5BI={lhi=e zr_1qh`omt?_#|504$QQ9PY}0jBtOWk!ruWY>GPe^*K^TVyEs0nJfJhx=MPfOT#}AV z0)&kajx#0WUIN=mP0Ut)CG+T5PCteo@`@5l$-48e4SqB$lzhRi`<~Nh&MW$9SB8OR zE~0RtHB|=LSn>8+`e@ft^t>@K%oa{8=HpPm@9K+1-bpUo9E*WZ7noVpy3KE9xf|V{ z+rH28Msf;Bt^e+EMq@DO(|_q*iiELA+!{d#-j%nYpq%pm^`rfAjpnn&(RpkxV zRtyDgNfkCr=Xl|UO;5bXPlrcf1;`46b5rU3-m^)uRVK_m(*8N}9$o@>`ukzzgSOwD zcfQUW69u2|C%9cw+p__f%QMHSmOsU8u^Q^%u$x7T$k%p$hS_unxds*LK@*4N5aR`6x`1v5eg%W2Ef#zy~q9THyi`FaSC#;k~5*uJetQ<5GC7NcJ%h@h!1 zH)KkGt>7GG33;~Zukkrm1-xDdavAyG->YdSq}6}cjD1Ak(N*@+@ZKqAKi=g4$)uUr zzH<+>tVBF(o*I2Z&5Sj!^7~G7j@M-sfE5rTMZp&0`SIyhfIXCvd_f1H5uWu{k93&W zQ;^AsarXJxMcfufCBQ3Q+u6M9l(<<9UIJs^NjkEhY>2E#(B>GLRL62`j(mk6j>ODd z(2P$tbSY+$n~kLOvJ0c5yA<}6jJ(d|mt}`$SGY!@Bd4R#q*Jbt&nbJ+ULs>TU4BdK#M;!huZSUb;5;3^L}PBUMldq&=-f!5F~>c?(3x+R8r zcZu3LtT<|?i=@9ly!D?FjpV=P+I&Wg92!bA9y=w~W2`Ve&}#-+hBkcEQeMo8=1Ac0 z>&y9lvEi&S(a(u`(gY@QEdqiQ>{fjvp2&O$zCT->!WyiI4opZ`e9HX*M%s*T7@W_$ zC$J2ACAz-+ekn`B{Rlk~LCHL8?{=(gMtxgp#DRq8Dcu>V!Wf9&&ibG={pGTQXg{Sm zMc*v#y1g!m4(5zrYn*1CqW)lc>IXx6DJ2b76TMh}2?Vjmzu{&YJ0PX-`F4O*aJ5W2g z5AXx3`x_EXqp3)oMB6nceZ#BqY8zLel{agXmKUtg-%sZ}qXdxo4-{M}I~At6or=bC zbTuCd1V%iIbfPWTV}#^4v}8w`2I96A&Op!YFMdbCgZ`G^l_44Lzv!%;cEXcHd_eCp ziD`(DIKwU-6c9Lys>>Yc_@n2eUfN@iv)so`@J|nQtT$={;da_*arb9;gh-0MSCwR) z=QvZyK_5MZujQ_%@HwGd@@@|?6IB{nfrA_v*FHg<>8klQ#QiO8JsJlHb);7V6`Ez5 z_qrsuu{&K)BqE1NDR)ic%$))DzC*NBh_P(O2>?~SW}t7o0%HXTz;mynV%k~7mLnPW zKm0+~3o-S7oQJr*ye7<+mLG(#zvx{)ZWsc*ft?jfI9uZCNWSq+BygeEY1nfbh+7aZ zwhk5x3|K*4JDweGkqiAO#djVRcsg+|?7-E(TBkCVRPYAkkm+X3azI2Io*BI-dz^Hq zWJUhjY8$>^#P++ynrh?>Wx7}9H&>$hbW>PA)LT@$#>>u~;R!Dx-*Y|r@1H+Bq)KMv zw~cKtEyi*DP$6sXuI*!pDjO3C(Am#&rb)-kC#twalV2`sHtT&zrQ}2mw~R|J{iip2 zO&d?eN=u4}DFF-Mn7j=`~1s;glU^kR2#@F zQshhSB64&WIzr(Jm{IRgDLSt!Y5N{3x(X4i5}p(3U!@;e`1y7vrRcMb?M(%Bq@FsJ zfQjETpE0AnzFr@@9{}-OaSXPDIDQr*#+F#hX5(sd;)!|GDw+yC7JtYsmIV83+P@Ee z&kjH{Suz22fm;TW)M-{$vl7g1QtevlgU8gkd9Vb^0Z=@=(PZ7Q>2H*^!{9? z^U9&C+jtl1DiRjig@14P-r4P5V-uLI=9Q^Nh;?L*$Rj~ng8 zNPbA|Twk59*fS>_3sTQ#)aCNj^HV;=b+05cI*_nDr6Z8KM90N=i#knPVz>xgqn~^D z4|>OSoePbA6Qb-%<_t4mD5u6UOn1mhfuM_Ue|7c_*|jZ56{V-;lEDI5+}Wj#AMF02 z9ZdDx$wOT7!Q4>Fq{(O-So4fTh%~{3s1vFeD+ONcEuSogw$;NAV=H(Cj(CD=B6+l) zBR)BH`Y(_hkEG;h%(1f#q`r`Ti&0vJKM*y@{Qs8}{$;V? zZ5N8jv!L|99{HbzhkuS;o@X%qSF){tcfq$0I&gIV?gVh`MEoBj{x8cI|8m{Jum2Vm z{>!odDO(NN;Hyvn{lT6`w7=)h~D^w%3tP0$@j>Z~)oK?SF`=dq1b(k@( zj^{ebty>yhwWI<1XfP`l>+2*!<>nT#YlUgl&?KB>qt?yMNhKhi2XEz&E-%ycb-Rw) z{FF(|563kkV=+&GAb zvGaO6f(*W5Wo63@LwBQv+}6=yiGF#RhDG@&s2Uzg)5|9<(LV zgltumx~OvLeS8`1&#fZqtWfKnQLiISyP0SYC^h8e&9+Frw?*st$Lgo)dvJGEZVblc6lS$S0NHnM^vQNE^!m&=a(-XxD|MW2OOY2e-Iqvl=18$X} z1fvwYVL@8so2788T>$~BKskD_RQ()@eWeAxU2L(}7t@Z(%G;#{XMW?($v%P00|!6tDtL3sJFTvBPHE;}FQ9_sx$ugLwp)4MEhK)8Qf%HL8usNmXSVzFO?0 z#La?C>vH)h3DOm;aXdrIiOe(byZNC-#F7kD6-j@p*`H}DX0i#ZCQhv4bkS9-gl$TY z00ARsNn9>np9tC>$FZzr*&)t5wW;i=oR4zOGy_dQV)tMZFQaWU9!3cJu!BPM-%86&-dSK zA)c>x)s;}`@WR%X&ez9n^7+>_ceXRME_PL>86;utI6%#)mzCrV(a)LoHcJ`GPi|FLoJZI`02EwAeQe1I^17_G(j&J=lvaU)Ka@_d zd~0)9+O%*Ab27EVKMs{vGv+pYzb5A6WEoJ}H+u20W{J3Jjd2#~+yg&h=Fg%r?p~Rp z5yfl-mA_?PrhTECI6%!}lHRx>&sKoSS$}^!st3h?ESq`xT|Y^~G9pgAUB&`5w<(|1 z`w4oqNrB3oV%h@1r)YOcZhGUj+m&x3y6^3H!!k70hQH5PG8^W*%-=cK7Ga=Ha%9xs z{^9G$(=7*=bqIYwvoGqwly%4G;reE8g!DaJcG*3*fU#Z>8;zSz*mL0Wvsy+i;aYy| zoa*By0!X^FWsmW(Ji}<%p;RY=tJHup-3O~dQ6Eu7nGukqfgnf5xu#c_A7{<{z~I+s zk$yN^Ar5T>)@u%oNGBP0v+_Sa*6M1P&8x1`c&9y8fEww*h^?(rbL_H!sJ2Flkf(&2 zz_9+)3*V=uIVLSi6QMcoc ztOC}o^@J=XLHF@%>*|ssMcEQs{Ezq(lgs;U*&bc>8d0}Y%8d3mA~VDpy}&9&+)`fD zgJ@!0P=3ASypd%e`b8rTNHABmUL!-T9k9Icl!si+o;Yto;@LB_?TAs%Inw6#N?l+{ zgr=mxoWnWsbBL9(LyvJ@W}SlvBOT`JaJKE7zzY?%}f%CO7vhH6#XFNcV+ z(%KRd7!zm5g_@nxSl0zrE$wWtHFIirNUO2tYRG+k4dB#;CP@K)L;kRNKv=@SDp%!{ z?xY+r#nu!k{bl^ATmlG69B?JdPO+3~ewf5=>XvuYu$5P8$ovu-CQ{6MH>q&U7u

Lq09WXP~@0d%1nas*~zpn^tY~lcXualmhi|`2w{@JHU2G^ zVkknpdF9-T)0^iZ}w4SH;uFjc^K17?GLZ|7mo*kx2 zaDnynytBsL<6X`I6^oJsTh%@zRI^m*7#iv`3p$LrE+&T+Zb=qeqtiD9$c3!<3FXE3I3 zl@hM-y(pQJluwd9Zp&kxww7BeiN)L3xjNdhIpG7hXZW^D3e#hYiu+w9SkQ0OEJ3vY z`W8=?DD&O1Fj(E!$CRzBpuc0dXfy; z^5P$?bo5S`6QzL>u!3C)+&)nfL7p8 zqVk6{u2mqKffVSJroZiIo8zDZR`>f zg+K0JY!RsFCh`5vE34xdzgK4!l|H2o88?}CUve5`meS1Q9-s<;LqSp{@crpRJV|?{ z*1Uc3>~dq{uh&7q%G;?Wi}|Ls#U%%4{tD>_ArAS-+r4(YyfHX0QNKR-)Oz0GP{B`1 zM#S=u4{mIjBhYCbv?m?9J<7hac^>_FG8xm$<5tlRLCp!}x^vx#H}7xQy=w{>MgOxrF;8r{lBvG=X6gn4&3gTws>g3&<^JR(ECdWAqo(u^4yUz zHhIlT>B{sk?7eqrzI}fKu>=mUK;+1qUgF!v%$W7e!2;AlO}}Z>5=nXP0G)>w~%%UGbvH z5qsX(nxkDxguJOJ?N7_3hAr7@+uLvO{l!Gb&1;XWRrYalc;vA-`A)U94<{xpHWt>p!XWfd9Ehn${ylvY3?a(?9YG~wztgo|4ur<{v6ssP!h za`kxBpEt2g-Q8mYZZ+F4ox4eRRjy)VvU64UI$?xgWWZad$(Hf)Z>{ZmnOKXQAn0qo zJ6A@@);Oz^lD;qzCt*~}aWy`#qpMSK8maTnw1d4-pmt_PUijs&;4XVY!p^L^L#1nw zkp`jn-~-=GD|xiAKxFpeVJB(nL6JMAvx$rD{nXi6ypDYb@q`3CUA;l!Ymhux#Zal< z3ps4HZ7@qncJsvRd4_2dgXHsy`@0m{tlqTr*9D;ttI}65=RIU3lus+J%{@Iqy!Icx z&V}=#n`!s1C4Kt^wLiO4I6b#P6B~z20XO~r?io$H!aiA3GhWC^G^l$ghP+R}0~0?6wl7g77>rhNd%3fVvz)-KgbFzg_&&NzJMgwsN0!PxFXH{0 zo1+D?u#FJHu-7#mZGi%4BYt<- zzb zb?L22LHGga4*wN6Iww(}P_v8aW0E)#dp$Ji44~bzX;gVXl@bjm)54f~WPspKMTyML! z*({{^;TA2kdd^O|%jt>^&e~FZZd9KSQ_}R0%JR0H0aW=w{P*nOPiX&7fm(A;xVaWI zJf*CD`O!E&-{gZidp?OkJf#dnEBSEIN7Zf&jQbKDRLe0HKKInP7{5pr1G2Rd^tr{l ze>$ZbO`{;fT5n(SG37CYUM*s+A*>5$iE7g}+ z=d8e$R0U8bmod*H8#>0m*>ty+UGrg;_9B{Fyf!oP#uiTMWY2|scYZ|v4BWHSI4A^rzA}y0i?YU9t_uOc&~^QVC1;2s?%anp zr`3IVU1E_AQ)!1L|UJrEF3kq|7Lau_)g4KaY}tCoTb3#M!dE%a9ukY86$1 z<1M+=XHLnxb=ilSp`hm8FT_+~VyWtL_+1y2wbBN?vW4M%{Or*YtMSTL(s5hYcX(fI ze;w+(Z%s`86mml^?tshZH1T%Bv-v7TLLY)g#7GWOxLWr{Uf|y>{0A=zff3Ev=^mI$ zu-(^L);vBiKJKy`B@*u3VkrrDW4c>M{z0OWNTPF+mU5Doo{5!zX;_E=yefsC|)~;qXpsK<4_kA88iN&zYrMRwCv%1^Mlt0?CiNu>1if8NDe!G%uJKfl$)6E zc+0xgD!A=^YgDO?@Nw(#GMBxKk)hL_J*_yqKSzG8ne$;KcN;~_^(5kAP4+ge_!8o2 zBIF+5yu$&r4uxMzG3+5DV_3%Tr^;Lx1=MBx`Wlck%}Pd)K&Y{bfOlwbS{)rGC0(mI z+AMbw|a~>pN%Igl8GKI*jm2!N zBN}*A1^g);EZnwQ-+-kC88d~#=H?`4vfQ^~V~q??dlVB`odBh9x! zpe@2PMfBT9cW_2c?kWPlr0ym`&==qSblqxpH&p>wuROcxt`q(KM!OItkoAw%dgy;7KF_HgYNAtsIZ6;Z6j&a>i%q5N8>4&9V?9xlruwb4YU-KW7NH z0Dd%D9HqD=HvrT9=@}6dDeRDIZY|5nz_w=5Qmg3Uq8K{RQ(l=U(fT;6RkcvtFqNOtjB2U%fI~nidw7?XpXbq9N*s){M~xN zsG(6=z!SBuHmG0U-I#7$>tHegjKX081>41kYn8 zHL1Y))ID>+mAq|rW@yHoZ*lerFb?IA9SiQSoo{@1N21e_4$T%JmK-tS0%Y?j6)?qt ztgeQ=F(YSY3O5Os`O!SL6k#9(qt?^Ak*)uoxyW_C@NM5@O7094@zw#-R?|GZmtm4_ZV#3bhg-fQtTd$zTwjC3Kq^@(%S60&a}s{G zE?M&1dZtlBk^LMB&rdsEE(cfeD-rtAk$MR)z~-h)R^G3bse+A)5cqB=a2AnSTh4Yv zbX?-ihSN+Z(kKvlFtpmS_&rpV-8OY9*}dOd=G`^)CCEXanaGGZTMq%=`cY6!@ILo! zk3bneCfycF2TZfNO|Zt>Yr2Kp!jt6k8A<6aURwcPTN_ecc{H@sx5NtH-4?yAjjZn1 zcP&}&t%e4H;NTLO?=s>d5pYOgG{Q87?Edhf2tGa)7Y?meW;)q!h+{E5Cu_#yxjvw6 zR#ScF=O(2`b&(PPWPdLCjY#WbC+=&vqD9me%-2xGZ{?kWTy70x4v$tM-$PSy(cElW zV3y;bP1bsvhd;^_Obxv~(^?|}BBPoZK6~3wm00HK-;a+u#OSsA`eo=39-lEhX1RBT z*=1O(v_q6tqPzOv>{N_~4n2ftt8`8ht0b}NW#k$;xQe`Op)D2s2&V9FbfumC= zSh6tiIkV?&G}UYBbgOIs-MJ(gDnSgc-Fo2MTnsmlOmm}K9_2F8@^F_c>wwQs7iAOU zHXCCs4ATMA?>4^y21QPp!T|GID-qn<5OsIKYIwKt8BpQybK4$PmN2dFtFr7Ho49(0 zQmhK)Uxff7us+NvWsY!}C4{mt|Hi1PaSI*YC^;EmD{mqF3DEK%@ZFB7N@vz;!hu3O z;Q%|W^+c0I3N}nXZVy@W#R26o=xu!TnW9vqiG42ml?yJC+7<#ZAZ{xf9xXi{Pz=6{ zZZe8KGAg*B6B;0{sKD>E+ro?+&l%!{(bCr^o)S^4|0)E81x}VlPGQnONIEQfgWq|8 z3t)eLr3kdc36KWz`x%E2v8dzShGB*%m^<>;^0MR}q73L(T&oDnhk=p%yS9pg z?Ezb_jYfg9=eMe^$uEAy-aKsm>G%a{Z`7@?Vse1B~t!Gjn* zHnCBB7|~lZMoz;V;k-@QIvt35L~Z^@yH2!}w5_G_&~;zp%SH7Z#4}HiG+XQhpc$?{ zJuHoyzPzm^+LmiioK$%u=4r-!ZT4HUxJ~J-8n)c^kPP9PK z3p}cz|9cpZf*q~3wNHA+2MVKDmmV>z579St` zWWWa6-^uW`e_6hpX{1Pen=zohk3{a-$xD; z&h{D3zcU4g1gS7Gj1s>(DftNpGSc7R#}CQ_Q-Qp{SQ$3A-pC<6)$zcLUh(4rh?OeK z`Z8p{y7XoZHN{CNk>&eq4FCbnHu(3}reTd(P zOO&v&A{*+lwYw-`+`s*7p?cwv@o-pHD9OjQ)7>@j3^}|#>v!YXD&!`PkoY0f))BEd z|GqgvwI_}^DG+egQsx9RXYvyKw$xR;07pq7Pw3)QlsjE>mM`v|I&dm3n(hSUP=ip$ zqC3Zb|EAurHXkeOn%b>i7P5yf*sE#a+AJ%W&00LQ@T3*{w{w}s2H>#Mv|C_&7b}Uy z6jPt;xtUu;4&WD!m|K%Csr`94U$JZ80lOhn6%E*`K)5H59mDG=i`A4N&p&jxW_kHi zl}W7O6IDvwV2zNVBaE0apzVgoWAqF)IxENv3pgtsWO8^8Ug_<596L^cYDn-qNZpf` zRm2Pz$xGS-?tu>_0hjEBL!(cN{fg(XT=6>XzCzs}>L{u){vauoot3v6lv(Of~E zmS$@@02|Ni%?q)aEU%Pw;s_h;oAIRdf&SAkA@%s%K75>Pm2>TT&A4}P4L@rK+{gGl0<}&1k z`TVS;*>A)(jDWueQZ%2K2%-ztc|=v?8UM*qu(HtBdOCyQeqW-L*I5Q@d3Ia`Ua zU5akBJDpS*KOQyeRMQE{nNYFn2vAlLvmGPM7joOYzRA-MlkuGngD$bwZ9MShrK^8> zZ7@d@zyuid`ZfOGA+Zz}$!T3VI(t^n`Gw(w$NI31Ru?5LdFOVG4X8N~p%Mr$88by= zvZr2Ei%6beAo8w%j>x(X&-Ui2T=mXL9nfh0J%Rjhs}90-1A)Ds|1HP+@8_}n`2WBA zH(C8_{mcJXa#=T0Pa=4lcks!c$RKraQ%=6lFocpfQ3yKLMg4E4XuB9df3ouC=G3(| zos!bZ)+I^pKa=|u<%YW6|2IEEUORpBq!fEacG~^3IVSps_jQfrg~L*Ng#V+1 z-(&6b>ztu~$bta~@gpPtHqhyrXld?Gw2EU*aUA>egyyh?Gt8>%fwA9SU5-v|rsb6W zrsC^U)WTNy|9Sz@_B#PE+*#xh9lxzu!dlHfscY8AiG3?tfr-8TNV<2P@$FK<=OX zFedp}q1zRUB>M2yK;Y_%m#XdGODg|T5V3MP&;m-p`w4k9P5pc2_Rb5)Xjzr%(k(#~ z86~B$hX;HBuy<%XA6M@^nw_k@Ln`9y6m zNd$Xlbsw_59vFmguIr*kl?)69L9aqqkIJny*=?P+prNh2EWAGY!+;CfUzUD8GAqzu3Oj7s`^zr1SrSckfpOV8u1I0< z@DiT3x*mNoDHex-n3Ow3!Fm7mUeX^j9Za;DUDJc^l=sr-0zTnbzF^<>tK$=3PiI!!1Xkwl1;6{*Okvkh z%N(H~b8VQX^{yAq?|sHE=fE3c;5*1z&i%WW2ZvY6fE5R!;n=ekkG+NhbWBX= z1^uvls5qA6?J9zl)DVo)m&>*;h%VhWO6O}ZRs7gq%j23mZj@w|*17Y6F440;TNWZU zVU+&tB}0NFu#2?$2K)Q>d5rN!j#eI94e>Laz(p1t%0d=mDZ5S`*4;1G)HKyE{hth9 zpVLQa`lm)ID2t|-Lxny>IILJzC3Zf3P2-vDhVf%-yA~}j zs$O2<+;+gRZZ~V=C<1Cs)wt{Gcs;)-%N}Kc)X#X zjpEr3kf>yO{=@bN@@+?I6VixT1u$yQ{Z&jyMa!Huk|GdS&OvhF;th)6YTiJO*x zJZNbbE9?N2ghJeJ|LPRXrWtj$kWW78E3*@5?i8zN9q2kCB7qN95I_k%);2@oQef56 z5a&mYWjAn<<@A^LKJE8c&2M$HZ9gL67=JevCBzv4TpA~8 z8y$UnYBK0V@BV4cY_gmBynp*-iGmER^5(wt3}gL}@%;L43>yv1@+@?T$IXUJVt>kk z?REAQtnrJtnbOVBe;V^E6(|KG_B|=W6p+7%-{{N9=sQ2^%WC~_@O%cQXHZ&}7Wza{ zjf6;z`$>}K(1jF~Avx7VqT$ZJzbmkQeJw_IR5+_e#P)F*h&?_E@%hn9C2p70Jw|tUuU^Mc*m5=#Xjzi%wC=RqiH*&tKsvkkD8<*%B(k^0^tT{&7q%LEmP0JA!pfT1>MZrW+y5i* z{LDDpog-wh z!x@?V=cj{=3ftFMxo5iGqei2oWf$_x!X!tIMWK=tD@O-IE+?S((k! z#i7P8eIjWz3hn=~7Qn#-*F@Q+GXSgk8E!W!62>tU@zPlvXf%cyXVlQNb$ho{%---0 zFh(i5Abz1X4ui&nlW-xC7=DWOeR&1M9c%paLt+EAu(hQ?q1R3!;qPqt)$F(V5JCel zpb9|h>?N2g)Ixc_C=e2h8NHvuS{*@&) z@+b99SxmPGpM$}@uIKV{+m^K+cnN@-ZYs#{6mpl0FkJ!L9N3XLSOfapLqUrAG6?}{ zaF<2&m)vPan9Fk^{35|30`;eulikcFg!$)>X`qfp18 zq3KGXr$ET1iZfJow#UQe<)Pnce`vvNnlLVI=lW)IVlp4V_U2%2o&bI>(rAqR1JLb2 zl!xjpgfqXeih(@=pwhuMkvTT?XPb&s@Moi3oa|@KJ$j5O!B*&56Wiz*`93*h4>3ZR zfR%;JYp4m}M(oQw#p&rV)0Oqu$e!cIJAN z6-%55&-cb;%6}r*acP-%~-^R0#Hit`5BtY28-EvGso2F5y~>o zP)`6|jC5FE_^ZvC5x)fCn| zZD=U?piA{`!l@#D%jk4x5vcn%Lrk-is0A9)&o`fH_Ru^EOj8C!;NU*x?6(3bMlCyv zX7QAH2H4e0&hkkG^x1wA@^pW@H*ICyhYQj{3r}gRd@EL;IftJYVeBu1zuRs4Hu=l7TD2axVCZWyXE8r-i0jrH z9Rx*lmv!6&&!MjrdWO%>yLb_s=`8pkH@XFVMZQu2j-#N9U8`@>!pY0Kha>1i_bGa- z65v<2Ld86so%!4PH7A%h4~cyE-Bw z0hO%wG<-L2Bzt9?bFibMeJ#mpr>GL*c|W{?pcdIwj!+!noZYOwNDoXnF#B&DL&L!9 z;UxQzMZmxYMyL;{+gubvCik3L+4E<=a#UsB7J)DKxq5W|EH3tXQIA)R$g#zWcgpO`hSxn1 z_?b$05+r&{3nL1Bxx0!8quQK$-L4XFeK_si0D3VzF@gLZv^$@RrrLih3O~UB^wZw& z!2LB~6&Ej?-S=|Jo#-sP}^{a>2K5}sbb8_%LUJ9tk6ua66w>TYk>W7Tcux%s&M zEtq3Tj@OcJ-x7sEeV*Dhe{lE~@$cdH_F3DVxaOqv*e&(FbD6-@8nx7fB_yl!4fzHZdE#LvXngzSO>&?@YF zj)_1D@~*r)cs&&G4fAS=DRu5sZ1WXm1FrgCWZBOAcLotHEG&rLy>?u>YnIj5e{VY% z7{3J4V{$KTKkE)ARTE{vw~Jlq$gEkN zHs!;jw6?nH1Qfb!{%B*n+TGgF-~(BS(^LKM?*sgwv9(FzLyH0oyvNlL$arm8hxGV_ z&HI9?^IrcC|0ZFn&WSnR#c9)OXj6WE=FQ2%)j#n_{(|pDG4PC+5h!AL*g3DORnzr$ zEO5cWK&zx30k+VuthR<4I${=SE9M3mh4`=A3;qNvZqUJ}EcB{@ka;(@`bPGr*Vo6Z zjk>R@v_KTR1q=iRyjSIuWca){1=j7l^kf~|Uc5|@pq&|;MCs1f7ph^6ns2-vTgThZ zx(bYmOg3Y}D`hl1(tSeyvu&MW=RPsxUle}SgRIa)_nQ|3)m{^v8Jhi$vivYawmKnW zM4iG+erf^wE5~G5Qy7g{XV&)PL?d|}N>1kyHl9ezhTYtB`ui_+>P%idncCaVKc!Ol zJ>E6M(-@3JbM2bV5JU5DRa~)Dd0tp&#s`c{D@rxMuy7ANRok%xX^s$Ov%KT($6TK?uv2;eoY(uy6q5UVt z+cU#f0>{Q%-{ujoNOnKGJqsmBTzECJ+t2lDk}vE&@k;jCd+dI;3wDxaOq{gWOV0H~ zgrLy!UxWFT?}j`a5|{EA-FnPe(y^GK*`YWXl`7>JEcS}b7QzBaH)+?oC==2IPYqRR z5`ij-`AeZi3QPv`9M{Q^d`>`+7V_x z?vqfA&c>5Wt?Z$%_^L9EkZE8%^9Y)a%3h=^vFIkZnyJc&)$UUv4@1m!3K8Hd2s)q~`Qqu1 z6_LJ6hS5}t2o);zReh7`bV$pOyNCgvlM_2;H#31t70$<^-c80l>L!=k=Un*KArkiyxcTw9AH&~cE zOfaSWwR#*_CQvrtQH_d6Dr%%>PFf41iQ+|A`Ruugz(ZbuA6o* z2&d{4y*A+8?Eb#!RX}u}>)$=;FOD-790N^WXVs!jP@ul7hoR>>Y;pD}s+RKRa0Jzu z7KfIj01xg>m0EP678HLA{~{Oeg@8p0Hm4fYlo4SOtHY9Eo*=0pmcU_mUc7nH9(>G? zcI_YO{QA_klTD&xiWBKoU^|>HfNB6SzS-ijl7Bs-bL*kY`$;ahEr;h(57R};A@^L68o6{=~GF!(?u({sTY_d zHCHuEEP?a=$dAZt47F7cznJoYROr`eW;Y_HRy0SRlZ0sRRsXgLkA90*Zn0V+DFm=7hBp2MJ8wxGV<+EU zJcNhUS)SK=;*9>ZbyU_~p^x8Zv69CV<3LGvu)QADmlAK=h_uO-d17Zx zTyw0=#2TU&X)Dil)$~Ksbow2$OlhtinFJ2gw!oo?N=reJ=&Zr5SI@M;UGyB-T3AocMShI>H<$1W5enKALFAn~!aT6*$ zkc4TThF!L@v&)-FT>{q@g)XXLd}SD_S2-P1^V#N{(e?E=~(9-)*DIvklNo{Xpn~>!Tg4~&)*;LVCc;=X8q!2ZSYAfH)#A^xg z;;~F`fuACxkSP~0_FHa>w}BQZLA&ed1)cfs9JSTYzJ*==;j=^ct4==px#Z0@74mLb z10m%joAgno~HdcyNP7-y%Sqhl7y%#B6eHdn#zyr zS(09?a@zjU^06l9RuyiGUI*m*Rrm?ns0FKxC%t4XwD+)-SqyGKFDlmgVU=L8LtVC%a^5c`8?F41UcAoyq$r z9szA7R1kiF(z2M({%-#By8wEGbzMZex3r(QY6Gj|`>T0*$j1yd4Pq4K%yuadm&P%@GCqUtFIs-g*4Zj zlM@P>uaXx7!vswm<(b1h(YW zr%bNEAW|lXO}L~my6pAk$tx{y8lFrCG&&bskL3{^T7%~BD}`QuBDr}ab!XN9W}ID5|F}>2;@a=d=qs6c^Di~5|BKJcsE}o|4G5d%i$8mFhS)Sb})ktfr7c0ZC#@C z7@Tj{K*ErD^w!HXs=0ew$&`_F#;o#Xu(jzIc6!zq1m2IqT-3_PDs8-RsvJtckxfNm zEuvt*cDQ6Xvy^EnRc#>z3I38R(Xv!Fu$lE2{G<~?6w8!+t<~{C@gjoenCp|SI)~Bk zVbhy=VG)L(zhD!v3oQ^b3MR;>M6o_9_ZK@QP3bv@LZpK@wMtw|?Bi|Kwk%*;82-;{z8ve|@s)pbI1W)nMT` zGU~fQd8pV|G`A7L#0+UN<89R{!y8KjMJI$9R9ggAsir(a3%~t@bkMb`qq2| z!^LvrDZG{Qe@T^M=(Vhy8&)p(S#Kzk)P&ULbPt<+EF*B^{>)*vEXg{kSxgPw z{x9+~UZY-3Mjj_xE!*^OK8O?u>wj)0U?P>s)zUFEwKtZ1<)s44cpnqDaqv~gY)?XD znG(;t(-mE?_J$&gCXLs>&pnbNv=R8ANn%dKpp6*%b0V~19NzFee^g+iLCFbS`=DZK zO#kTPll!K5BEX?Cm+u|=UnWM)$o2a%KHQtqwdA21g4yTz`-(T}H-y<$j8qZRKj5T$ zKIi(xU|5B+SuuW;k<_BHGn?;lX-FR8Jvp(ZNiJ%lm7*+(VLZ-{EU!v)$mQrukGvs_ zPyCE5P_n5Dd1T}ya8(YD_>r&q9uyGHVgnCrX6*gq5ad07HuPNY&uv2PVSEppef*u0}gKq*TxF;|XycEe6HKH6#3@PECEVZs91sx!w8%JUIP%O4}s(&+`G zPnYq27I$L&3RZa-t!Vtru>mrJKO353w3-MHwcru|!db0H!EM?c(SYX6xA!RVM<4!c zB)@?i0t(G6kh5>uPsV*4{>Z~5AmUFxLz9ypLpU0cH&3mg4ol8~ANQEfIZ-?EwFy`C zlM2E}Vbw&_##XcRfMC=gbw2S56cq3%?rUV$a;2c>-znHE+h;>(sh$nD#dUxAj+z<{HSa>`dn$7Cb+t z&?9ANQ38K_gpV@gAP(*@sv;V?S++P^YjD3?<88ZLwy`yZGx#y`IhFA(Q1AULlTI#l z;Nx5fc~KT=YAg1ZqAj(Q5_uAx;*R^PUi@1jM3rG(@G_29cT$CtbMgC#)M(5K_?Uku z*~_tA8U0q$zoNBUC)aEM?)iW@1u8Nl{e;ymPn>Ma%(GBj<0Pb*Yirb@V1{T1f#W-yd)2YDZL zMo7*P%Er2H(Bx8%sP$>J2>SQPTwYX`pNJ{C`8wyz4uKKu^U zbt^#*aqP0cXh{hlS(ntd`+va_DsY*+aK$}U%i2?)*DAt{1q@nXfr8GBQ&x}KSv!)Z z;mI5Q5C>_{Y=2{r0FD0mglG1;xNS`3`gb%|+v`gPt4(~B@&7r@!f z-a;m^QxG*)4pLG=^0u%YD=1#I@=l`ly0B>~>X7OiZlAeHahjX^*p-&pb+#V7nt zLjQ2fTzXiRco`%~kHnvUB{RbkW`bs_+d_n`K__0A49sAue*m^-7!ZbwoHcjZE{V~%{ zUaP#M3M%{2a-3SXf<-s z)in$|WAPfr^xFm4%o1XP!x1pPMit8X<^%m(pLxcTgT#pk%rk;8AbN{g4W7dO=b=mR z)k8t%Q>uvLWPJO;r26!yV z>}6~=RwYA%aj1V*{I=xv8X%}I_p>>Bhc!Fw$HMa!`XYN5YRm)ql6b1bQ)Ua3EU6%W z$~Y0x=XF-S>VU1SsMmMsIRE{Kmz6YxAzzqV>x0XAq_NSc+BNLD>5!$!s(UzIcuf*n zINcBKx@8fD#yHF@%}o@=oP4HF@VOi_gv-@pto_>Ob*tyiJ0cOZ;?NeNW_`>EMS8ti zX}=;^xO};L#>Mj6(f##@$-e7MeM$%h{NBL}u4&Z=ZsX;=56EoR*_~LxFcZY>Fl--* zk&zy!E%03CO)RIA&X*fVvZa_BIflF1XhOC&p0KBcA>>nMAejNkwg7#CFm%E}v&nlw zXyXgMr0{c{rFNqWQ&r~zV&mltWjJ_k}0Mky>O=p0OYJ&~%Oi%jX=e9ddRDId|_&dI|hQ`S60Z z=)KmUddn>+IB57hvElTpd%kn#;LA-*TOfQ5rd@S>UazC{p6LxaZM(4mlCD5o*e&3% zwDN?Cc-Q^)nF)Tt1|I5Y3?DDd~+Ih<}Qp>@4o5w)54I_Re&;oi2x(fnvyf;WTj~_8(Opas+ zTStcf_hVe26vhMg>3XEtbE?Y6anQ9`N5@j{Zj1_g!{qOO#A0|LYk2<<{YXqfvE%>yg#e+f3B|T%DL-&6}Qy)UoNJC z3g`H3LJrU&<8_Zfm+)uc)2=FLXQbHgwta|uP_E67=0A=Up+^L#c zwSTg&PJGx7#?N7jcRZpDwBl8;)g0a7oY&E<=plrAPvcdNaM3V!-PLLL0@)FVce1}& zQ^a=T9ISv&Fr`mUw*EAsh>UOyw1)#Qcn4(ZxRPCH3zJ4PQCpLAg`6pjnO^c4{y$qn_Fv=YW+WxDnv6t*sYGlP3 z`iZsO*&s&t0)E=ETTYOOg)yG7F@e<2{>!4o3w|uf&52eZ0yZ=zuPeGjd>5X(6{>@&f2+2VH1aYXq7~CD#yxq39#@iPM{9Y}?pNzHOG1U#cq@NQ; zx~N?yvv`3vMm)9PFMM!zOi$l@&T-z}J4W+e!0(lye`Kl#>fwx(Q=3WKj`zck`L@K! z=%3*Cx;c1EOq=LS@hMt|3_+ggID{=6m>~SjXCHKI9l|gS=!`a3Aj$WSRS}TiJBVH} zlGnm_v!&?q<5WYaEL!o>wbPLzeM09ZjAY`{W;o*&Lk^^-pOAiJx}1!9hUMMjCk093 zYilJ-c5%WJMjOF8}Ei_ z-ILI2!Y`iFZ788WqR+O!DI7|Gbqfy%Y=?o+y;Z(XQkj-h%W_m4>@-z?IY}r((tTgI zNj-m{uA^h%Gkv8lu;tz1C3*|TWY}bpZSG1f^ayQuVOd`9?psCOLIe!Qrv~A#X)r>p zCSipT(Mj8gXk&QsVgy>}y+ENbN;uQxJgai+2a+R&%@o_wmI!yhjELE->*d)oO$j{( zLB7&VH70@RTRO)CRVuzJGw%7ET%KcW3CBNL*LVw((q0Mp(iCE#^)M>+Lr5E=w4|^q zJr`lf&%Tj)x)$TG?j~?_9*BdSP0|A~2su=mTyPl=e~{dZCaOMS3@P4RnX9^snA58g zp3hH#Ay5t8b^gAvPPN~&@AGzRoO|-Ye57#ZH^YX>WX4A0uoj5mV2Bc|U^Ov!=+&-L@OfXsC`2@t|GVxvg@a^!vlyCS1mx_A+?Gt>J$Y3E2Dj51Fp;^ul77P#qBhxq z&pT(|oZtD%cS_GWL*73q6pmog6(tcM38~;qk&A6IUh;if??G=({D$Y+fa4!ft4c>D z|5quB;PA{j;tOJxnW@}r@{$f`b3&!&hl`(7;WE4Af&slA*(T`dh+w(5W}*ALOoIWY zh^(etsSxj5>FBqxi5$B*3nk+LF4cu^tVUhBoX&6((M7RTd(R{+ni1|9#%F$*COn);%!t|*CSIl#Ureh_fi{X6ye|I#u0_}dXe=r=wK;i)Tzfjn#yOnh3zH{kK5DyOIH0L#EiAkg?=ddNXI zgYN68p+%>Bn5uSg+>goT*CvH2&X1kB(|!Rpb0DgpDL3UbhU@n=kA!ygu!-9?fH0j>Zw1(ct%UFjD7Gx zS?{P2Ln&|>W}$n{n}5 zsi+DQ^{?;#^6M!0pWEv-hYt>&r3OLSPgwR7+$m#qq&KA&yN4e zehj487GXZh$1u^PaLdE8{ukrB`m?B=Bj<)6@Xep!*ua(bU(6%zvGJnyHlK0|_JugN zvC~Pz{j@dW{~C;)K3vEXu|0(J&x$>D-5GuRq$#!sUf?P|M%9bDwc))~=KWassAvAu zG~_}!p1#JG7Q@{iSty=nJkF?nJCS;1PV+~{vBFQ5#+|$D@7+0#8Gl1%dh$s^qUG))<(YM&ry7I5bB~>sSifCh;j~gmXWP6viJTyoo(QGYJ zr(!pGzE3;jk6&tuGpbK%CEJLct%vVwLXgv16YbbbGIKkJGew33V@q20?@csM*l+|& zs}aMlyz4lVQaP!7Ge%E(;{Rgfp3SAX`sMbw5DKJEqL4h*N{6`34|ZwIJ>I~d*I|K%&f8-id@tU$d}G;?p;9)6<`2=mrz$btq_q;3-B7o07B&)#Kv?BI@@N`k-Tm7?*4`x)>cz{!?XT;LcG$0KNS83 ztLZBzWc>;0yXo`ymMx}>KYxo(hHbk@+WcO|FK&rODl-(VcPL2>AU0-Io*01e6dqHS zyl6E9qc%KU9xzw&OK%ob?tZ&({heYUr$|k9*SB&r4C6`Jys9*il)8P?$kUg$=jLXdqk9VO9fkmGTgw`nj?TqdCt^=uR zhoeZZ8+dB3S%YH-U!tT~aVG*k4{SUYMjk6y;;N{9rg(>qt~-vc4kvCQwn0c)+7DiZm?hZH)yH zpuvjyp1cNzWbjF{6ouLeS`ERca{#4_TUVXisaBc~n&H;?PV{|9kz z85CEvt&0*55+Z2u1b2cpB=4fimJ?Z-i;O$=>a){>ePu%7TVAM2ukf1zk z`QYiWrupDO!vn}UzLJ1wX++d)MG(+a;-09b^R*5rp4Q>U)=!-^MHJ7=gD!DghZN6YRrtj^t~Ia7M!@lwxm z&sFQCL9OY7YHr3_gh0R=FjvrWg3RYk) zo_>N2czjxuH{uoXPI@{_@BTJMI)0e9Zct5#yHcHuu5vTAr9CD>}WQ^(Yw@rP-93tbXSOls9_T;fejK@z*cB zCARl!0S{`tl4&`+w;FIiC|RQ=eWGc8bsMzO=t@*h?L3**@FWyd@`@^TV&wF$*0vxg z+#ZPKltp=k6CrlQp%*^s?N+*Sj5u{`qM%H5kDb2Qe)#z+i*7K?<9;arVqu*l{>QwDO%>Ae?6Y_f z4O7@N0_W)&Pfnrp=y{YjlO#pT+`l~!=L1<@zSMm3C~rEz1h5-dDD#Vq*X z&7`>*^0B_!O3oxl-*qA?rE5M0a|(gF5yF>a!zi;py_)1e7)mEKFyPzfR&Sbfj27Tm zImm;j;js14X94`!8p^)c?Drn3Fd5$<>y}g#^7v2`3HvW1PK6iG-Obo}PvPZHMC$N6{Mk6m+$^sforHSM8AML00y}C@{KpBvaXrxv6`?Ov zC|QHLs)l|zr5jfGKdWHDZU$5IyEeb^@fZNB=Os)Gl+vi%W;^Y?e7pJV8wFUN`XJ^B zjfQ>h=cVhMse=xgoL`QH%JXw+kkqvu=9MpR(j+l_t#uPXbWCA#rYi~Z#>-f99#BkY z|BLErIs}kft%ul{def=MEG_buU#V9Gc2&tw&jc1S+mhB-CJol5Fo+=4I&Q&XPc{8g zs9HVPgD+(g-dcS#)1+u;l#b znBsDb@~#u>>AB5e0}3}k7xH|=W;tm9N-*n*8DE+MXzi2x4yj{u%}P!3j1P7E)IB8C zYYcc#q24>dbhB;&TMz71~dQk^hTqPJ^Wc0W{)^Ux&NQ?+gj9*Mg9F^Jx^@J<;PRzQ6Q%!!U^gF*Os z9$?C~Od6%!7dBfgySjLII1eIEz&d z);oobE9yW3U3Md*WATgci_&fcJQVYhP_W!LArR&^tJh1fi`docfh9I=X-}yBUXaN9 zu8F|kYsqQiovjyoy(7gqZX$bm-N0k%YQdW|u1bwkx%Q#~8J|6RPGT+f#XRRHpFPw) z$#hEyR~aWZ*i;!=k>9awNNoro`ja~roym5$5rdi^Q;Sot9^|OocUJ8)UP?M`)0N1v z#^Wi52D#*~~w^Y_N~fZpLGe?y?tpSDP{kmM)zOL^9o{M*7i09anDJt1z#$S2sr-rAp== zD%|xgz!So81fmrHxXzIL^V&R5a`;-m9*t&%q2HOd#CBH!!6sI_^2yg3uVhwuHxO-y z23gCqaOKV;+=#8QJZSL2rRS;X+$B`Hv@OQ40J|7P zrJfhDcEmazJP{}82_)D;p5{LETCryL^VB?An#8i+6xA0FL~PWfExZ)pg#^^?U%0h% zDyN(o`fjd%YdN_ilVE0Iu_Fb7Jf~OTGx_1r5Oj1a$?x13omSMY*YssCB85@pa*4Ju zm!7aJ%=fX!3{H58VXlod+u{e~cA1co&zobdy=)jjRG28*M=rHGsuv}!g_JHesf>lG zK>UC>OKkGX34_vYSw^|IoQmwACuM?SXM&xo)#~ZGiD>(dU3-}oRiiuWLDu9NInf7p zHZ;rPhbSo%@~UFS5vSNcCRe`~zAIJTUR62nihR1MVJ~D;;o#!8GQOB>c`htw-0wL? zdfs&p%|c{e+T-rCQkNrasO3%`#GdUYtf_D}zcT40Pxzb3y|{hJFGQlm`)1uCi^#&$ zX5_VYLF$#(=|ZC{OpN%&iJM$w#z99sV9(TK2kjOu^IkKASpyTZU-kGRyRs%HDcQCA z@%ssMR=7O{<+XYH%~C4HTvwrV{5;`~f^y~+3U32EFeO}|@-deBJh>&BRziNFzXeSU z0^`eGRqsEPqB#`AHCUUt4{WVDw(Dk~E{auHWeY^D7|jN(^8H_Clb#?mQmNAUfO)T& zuXbX@i6IL0j_&#Cv^k|D3u*OnCs*^T5yJ0(F9rfWuL*#Mz~cNpSQ;b@h-dg2R~j9a zclLU^8)SL?R@E~M+-^DNx_3F&SF{!twp}&8ya9lDemqS7w}&7bSROBF$V-|<7xUfP z^^X;XMg(d+3=Vi>4@$ZJL;@sQV?!dvvM3Iy7EAQ@NLKJXk-9kV)j{tQ;g+dQwv1zE z-PF8Xz`?{1f-*Wu4tnl8dg+^Y)?T1k?|YZ3k`|FAo@}&Up!OYB@-^*4x;o+=(dW^< zs~kYYEuyVn)@BOw#osjA){e0O1I2^S=JS#)@}?ZBedLyg7Wz|&neS5@;98%lR}*nZ=N zvZUC5Y4%@N=|K-M^i5q|UH4j?9)%LJHB2ZajS|4?D~)HyWXYfW`+fd>;X!eb8}Vy> z1G$Dn+WQU}GSDj; zkF&pf>0ev>oT3oxi|FNyXZPRp>?etGY8ebiF;ldJ4{S(q@jgNC3= zLFStTaF3;G0~%mOQxD%bg?RWS#AB}dhzIE*{lbx!Nr&cztagn$YY$;ebH;7 z8*y_>xvUrx`ET~^Ul)Dr8=IaOOMt?(8kXi%kRO%0Ohj)n2}Xo08p_J-o0^*L^qF!Z zs}AgO$=<#-afer4@=;!x@s=1o`X#oAzGHAT04v@_c+cB3oFx}uT!^B8!=TYAnoXu} zC~xW^V!6t%QihHe+#vo>c4039pFes^xlDPBRGGZvK4Y7}H+QYg3!3e_@sP*Il4t|= z$V8SR-g498%FBnhZ0dn5PK(r3nFxT58D4g9yxkS>Oz))^IeiI;684gkaZBy3&;C7x z8FVB3Z9UqMXU`X%;fwjJ_x38A>!7%-E`bB5*x(viO4MAob{+Y4AUjeR_nj{q!}o^L zC-KkB10;R=eZ$VHS~9`U-w?t#_b@0f_J}U}C~ql9c}rH8GN~=*|J&f~_deZoXJL%P zDII&$+ZK-zQhQ7HAHc>69L2pp$KK~oOG`!~=iThb0~&kFWkTo3=}6xXAOl|>IXn3$ zkdsi*>UM@ORO~6*PbT;7FA%!>9WM2kM6zUat|`-TWtSd|^r zoWkn{Q&6y}VL4Fa@K1L;Uq2n2AIJ7H(*kxs@rkmfgKD&$QFR*!r+}K$T?Ac+-4UV= zmII3>jbs?xE>{oIeC_O>;h9D3j58sxAuzIb`Q35oAVQy%jcBHF{WXnzrISJ z|HgR#Y;1JYb%X7{k7~0qi^A?4Q<+&|#RE2z(YbM9uO{ zmfzP-kieJVo`7kbABX#*cqy;=ARUx93qhB&{)=lscdj+wdsK7}PlBzcAdZ^YwpaWW z-C=$Sm(sUI58~B$q?f|8_+dmNx6*gUFeQOYu8wxLFj4Oc+c?S|gh&LYzg2H(;7ih9 z2oN<1DZlI*_19a%ESW|UJ`)^(GKCyJyZv-N;T|u0Ax!^9N&E-tH--{@e}|vsi>aF%U$$HAet&(_rUI071*xLm;$Nfn<%nMpejHvMNP{>u zkV1|hD=}(#YozzzH~ZSAUg84Y3F&*b5qPK|BB4p#{cUu6>%+69+P7jH)6yK%0(SbY zz5pCteQYEk+V=j`k{I$8cEs}DmlU80rW#UyN0zJQzT=cc%r4ndj?4MTpt)j!~bl3$gb*g{!^Ve3Be; z5|4uEE9a7YpS_58x{o5~0#OEh%SIu+Y=64;03{YzJm-5~A$@b-&!6%1PWW}DlpOB^ ztX!UQm%GIXK*-`agWfUmif}gq6My&ppzr1k zke86KMfg{wxq|5JCnAO3S6|y@4^8Mz9z#83Br^bl?N!r#s~h9*M?m{Wm_f}{sf}U6 z8WNY${PL+vJ?GatmC7^0Ex0S!!V}>za8!YE*zV)pH(%QbS)fT(-u$znFrTOb{y@x5 zeqXW?qT*f(_7r)6o^3G?+78{+;V?d7@-|JMb4Qlm08{f>!SdC|Nh=Ca)dlBu8E8X zQ|U;zSue*A{=nexPX0g#V}3L*-?Pk$0+@@jQF?RSrNr+cjW2f}8`)!&uRfzYSyrB; zlKUq58@EiBW7f@T#o5tWB5|>nE3(&5Sqg^CP42I@AracA~Bh& zo%85k7vUon-ahEx3p2X8twXxF^wa^S^)u}VNpCT{%l6f<;SsZ%7(di?D)HVb0j8%8 zNBrG9-PEZAzQhNO?{5z9z6Att_P9&jEW{dSDBVIMpPkh9fRN++l+>;6^S!He5z?;6 z0uGs6J5hif&+;qKr6MK+@d>o#5*f+;4mZ42D>gE1182T~3^60Nu3o zWsT@@9ktx3lHvFr0R#K}gp_Irwf>M>TCP2aTztS4SGifkKsE3*qQ25K_lPv-s6X}uW<5(cbBEueK^J9B98$p??Q+^u$V~8>L70a1nWBc613=y1+AkW z%LzX8(y1KR8fFQS$ej}x{2c6Rgm-lzHm@=e^4?cAHE|e!ffU zGZQCZM^$DxJ_u>KTw)jj8+Be>==E#Kru?j9A^r@>XSuJVfxD6ha)0A zD`tbnZxan#Y)?xL_~X+T`mUT@dfPM{Q471~x*9C;7f{G8apN4e%}J~0Ih)8~o54!9 z?(!>46hts$Zptxu1iqAv<}Ch8K*kd=^I1_Q6dlu0YUe}JlM9=D?}Pj7`5em-l=SPj9KcQFxe!`YL!D`%N+H(8Qw#O+O_{1epKW*L%B+#}#-v zZQvUR`g|<(d0EyVt?Kh`g1oNVKF{a#WrPY#Lc?Zl`9>4)e8r3#Zc+D>HyA#kt3i$olQ z@BTp%4u`*^cNTrmk&0D5C_$hmDHg;yTt%NSE}%j|oBoh15vF)7QST!w<^G8{o1jBN8h!uJ8C^pb=k2kRXvm=mcz1 zEzx^5Wid1Nbp0H9{JHI&mn1zUuo_C#=Er%mCfCXWepPC*8D z`~mi9+j(248nebB5&9ZrZWpz-wY3%J#lgYf<$FzuwvqWEdOa?>au{ub0nGB1Nnq9M zFhHI5VoU?GR)ZqRRw1{Ekq(>t^~iJdMZZ>XVqfD$Z836(&o7o{{PY=^AKAayC&N*< zBJy?5iPK;775w}j%kE`esSQ(s6=X+O<+wdpsEz&W zeU_Dr!(L?e-Gm1K2L#X;hKMr=H>=gENPZJ*GNoZOnR(Rozb9rU>vJXQdsH%U=9)OG=(6ZQg{@|Dq_gHrQA)YXs zFrtyD(^D~D)iCnrM&dCw9wiN(84hIm-E()EuNFU=R#(R0hc+6Y+-}A-jXVY%#p)wQ z)j+`8GZfyuVZr0bj`R2GR{oBo)caTo5$g2_2CtyDsjlwXh(+BoNgS0+3e|O&C>>&b zg8MW);B7+amJoEkC2HrzZlP=$?e~zAcprBpwZoTq61SDfGTL|{8S4`&!o*Rr`}YeU z#F2tAqF#EE(0|Y8#F?iHgtetyMT|tV^q5JQG+rc`#if8_2FYIugNf;%SDLCsoAMf0 zuy$K2V)=_j=}TEH)Y3TjN^Ro4r+a@?Cx0>GYSbk8W<837Co*~x@w06en6m&k11ytd$ThGe~Tn(~{A)ViM*dnGXWJ1e@CGtAUQQTb4&} zx_v2P&ORrP>GU^*T9IDAc%BG5NsFJYjY}ze9SwTL1IW~Jp`xPyI65{@84@ou?@0o` z&25^4mgS69_?s@@5B|UmCRVh0PM6%r+8TOtSWOd>@9_1l>#=Z|$10Vgx0wj_d_RAx zj-t`%5RD8u4OVHx8N<|y=LPP4-}`XihLA@n zM%uB+9GCYL>nNJ+=HIc@RL-;tmc-Yt5mwQeZ=0wIm6`vzBh>blAi=So?W=7}fiwO% zUVb4}?RS}d!LXFPy7Np%WYvxezs-h;li8NyFP7HXDV{d}z_o|AURQhg-rbaAyk~0s z*?TL71WG z2W|@|85K|mXt;3H$=V}}rQ@2V)^r@^-=x-JS5#qkWTSpWzwq&7Y&3IIft?b}Ir-MI zS=nM+sNtP;XlLEt9WAU@R66~jNn<&zKXP={oL3r}#cAS$T}u6QDP{Zoe5a&4AxvK6 zD?6@1G9!xR1xslwfm~OCTU^Vclk|#XrZ_mclHfdw0UW!JzB84ZY<+o`V>*l#LR3E{ zE04~eT<^a0;!yu8Ug;`>Dx7GDrK%1aTvBvsd2gM<#GW%r>o1U_3rXUreFi6`<|0Xazj8c}IX!XaQ%_Q-+#EsG ziZ8?xP2oWT{Z=v^ou9{;v-&RZ#_(cXx#HmNtF&! z?`(TP`DQRR&j4mw{niR8#kOYH|IE=58djz^($nKMs^|Xw4dNo~w6~F|O8H`jKsC88 zmP|;_bMG)WVyuKz3&k7L7;74sUXTDk8mc#bBoT9I(hwiK{V?>n?3EmE%uAn#^<$EC zqZa*nuQNg8M-$?{MTvR=8G2s`;`?z?_+xAb1hSPZ*l46WRXS~Lg*#<*Y);$muF8rj z4{W(4EpmJ0d0n{Q`3lr3BHql%d?c>Q*ca(|g%zocm0wTGBK<2U8Sk?<&642^hf6dZ zx%*<*N($729k|Ube04)Q_@wO5-zp6(4}0z1BF>kpYuTr$v)!9&Gj=<(3? zFLBdHzN&Vz_Zj&1e$G{d16g3@!X&4PB9t$#z+kN!4VRyE$n9Y+*&sE+sl(77}GuVay9Yusd`5&>KpyTSMJFDH9|&CdoCMN1pPkS_Yq+- zvw{L_VD(RJtEP(QBiy2v3z692dEZs+XACHIjvp4!DFCbPEyehIA<~gcVaLT>;W^Q|> zfMe`ANh>Vvl5DdT&ofyrGie^CaE$(%K7y&h&pG-#q)U2m4i478Tl zH}8@z{&M^H`ifnCygM}Sj^DdN)r|~Ml-3_Odv|GRnfhj{mwlw?cB&3+zj)4+z^)@B zX#oo}0+!lzSK`qZ`=fPX-L9#OMxNJKz>ce2z;dJP)@;1*`6spJBoCYe>rovk4c40k zpI1~emjAutXeca&Qp=$aT3hS%ihqiZ|0?xg) zeYT-7Pwu-GzflORLEZb2iaCuNf5oaw!n>N#l*GMzB`~AO&3_mom?G&k8V3JX=;2;l zT#6JCK^;%*Gvxqc9eEEkEPx(fcQ-Nkgvzs>004# z!*CsOFpeaP&8$6YR?R(Jr`?e%`RXmPkVGR=MmbxO$~VHT*u;e0%8%nU2@}%TOM0A_ zMNxZiVD=_-C<&F}tR4P@bd0cm0`8*vI^7#d+2So7qk!ky&!gI*)bMuaj>`S*0C!hX zhZZB{geM!Ep5OiY`j=oOcd2;5&*(4EGtI{r1v_5oRfMxSl!oTjyBBBPi$q!v2<~ez zVI62ET9>NZwaz2#K)$C_@Y#jgZvEm?JCM!%C#Cmy_@0>(DMjgl?_x^#r4Nd1-rPRM z4U?vN7h|-td+C1Yz9o=xjPhxVO%iZ!J5E9rrZH5LQCz|!l~OwD6~FVEHgXckAgFsQ zr3J^0m|Mj(3zL*4cKig={b~#*hI3$OeCDSV zea_Bxr>o;!6FjDb&g^#^&je5poE{$0szLW_C!KQ?;R)t4MXVS#@{u35y-lZ|3*3{4 z$<{W9s|sAZd~%<^Go$u?1~>Bcf^NXrbw8UvEu+ua;7oN|$8teUeNI0XMr#>a?Qw*1 z(8Q?g+2l8D)fX3<@$Gu26!38q>z^7|EWaB@;^h15XcM!;1?!WFMr6=qS>E+I1wyi! zyxwYva&RKr4NU8DMwXeLpqL0f9n!$p3&U9ExDnmCHNi~TCYqz4ZcOly zYfL}he5z}d3N-7p&0wPLWNw}#@=DOgEq`68XDR?Mz>QQfhvQ37E4?hcD%TknPq6qd zm=h5XkJhp*AtL&JB_oOR<1C+w#TgN zEOxNAc)%4n0A{h|*{m9o(A&l`4OgCxK40gs#zwN_Yh<4II6Vw8yMEf+V@o=`y2*%s zE*-O65kF5)We}i35WqH>kSl1tiXT62z|g5jEeTEzJHeNTL=ks66+KLOc{nI_0r&SrPb_Yr7`0fWp7tq3)9)X1RHUlAd{KG>B<;} zq6&G-^kY~qUk6}9vjwK>Oj8sB#vaK8$_iS`fNeZ8ld#NiqV&ko@gKY3(g~f+k_9K& zaAyMRamn0TR~1blKB})chwna-R`?E{S$j+#g{nvgTvAk6C+UpJrAL#bLki)D{Ixb^_;K%eDjkdp`s(O}_fE zDNe$3%s#IUBJ9i>3a4+(>nD^Uq2W0$%3UeUie~$fj`NHORe}KoxIGzzEE;(o3v9ha z1rEthE(5Bc?TP>&q^Kfu-9&|P*AKGUnMts0JLC>;_%~TK#BJ;lo@>3qXH_ZmHEA8n zOA2J`w6eLqvpWR^&8?}KfxfATy?F)oM#WhJu=)ePyyYl(bmra!QpOFoFf>NV_O@=T z%X(6S9X@1N5Cs3ERdb6-Oo@d?19nJJwO26^rnuWmCug3(toLPb`AF5aytNI)3L?o9 zQ*sk}FpI0wley}4Ta#*sKS)*-Hx^!rXr+-36#D|sUMrkseHoa_CcC{)J+Mct+<$8# zYRD-!V_x*tE~sCpt@4b7shd$U5?lqnsR}JJBTdLUkh_G3tHZrVdUH*Q1H-_DoX%rr z52!6pwV+Y2f|9cKYJz24UE^pjUrXYDk1LQWR$E+j{0>w79iuK#l9SS$g>m@f<|}pb z^?=RJWOeg`r-c=;HRf+p?8UK*N|Tc%)Z*`~Zp4;@&zFZM29Dd?+D=S}{a2Q7!{KaQ7E!SxRXkk^E@0 zY(im9jhxs+n2!<>PGF7Kh`w}fz_by0EI6T5dLN?#_5~e%pB($Z9-q zXto7dV4eBNF2cT^uoY#$*C1rytd(H49UGoeB)d7>bThU+TjdH&pJRS;<8rJXuc5k? z`3Nx%J>CFzc~m7*Y?f+l6!1iaOIV;Z?_vDMzh8}EcYop8&NS)i)Y&V#kkAtZi=RNz zmudEpQ#w82U71|>w@i$z;~{D`#&4VrJk;vvJ!21A_RR?>Sx2v-8kS?<93A&nRaJL_ z^KXk%?0;nIfZFue*9km*XC5VUdMNvQZ<<=z^mG&TbKvN4dKa-J`P{%ryCr2$vMJGW zF)rO^T-@HX5|!RW`VMVuElZ!hn>>nd4E*INW>Wl_s_o|YSQRVH) zCS?29m(_Z!X+S6D-s@QMWTNUkJR)`%rrEcrb92+Th8zW=H%P{Lca*KL!{WSC^Sq$8 za*>6ByELiDf_q>j$A7#(1_sR|ER2+PNm7wfy+!wm1@G%K$^y4V7~&c{hlaEnX6hSM zR4ni^F)^K9;Q88Dpwlw;ry_l?jvMS`7tucz=FZ54PRE0MpV#m2yOHXnDhe7j-4?G) zAD;Uj1T-F>bj*3shtHgB6WU$5K&4H1Ker&`0zL;FW1mbTGWg9D{*vYQ?2q8T?_gl0 zieO>f`Ky)G2YigXf8B}t#(;7EuRBzjN*Ir+1ZKdyd)WW^fhCPa79NHlmoafK_AySj z$Cc&5*?&IWuln5`D!;W}EbssP*2Rkz!e1CM8nPvWy*s}D`3b*df8c}UyA>i3^NiGlI|`Gx-TIT-I=WMMq}>y96d_#LXh?qEoN zzx(d*J26pQqnH@)IJr7_oG!3vf+PdUgBPu|oUYIMqi8LDeu|kd;I)IRhtt}f%`iny z_=deKiW)bd;?jLJGwNybb@atcCkA5HQc*W(O!1B%*)26E_3dmMt=dPs{im9OBz0{Z zWRN1UN2XdnGhp#L8nGDi0;QaqcX{K~)O(50M0&}JoP!3Yo2roFoOh~qcF=-YsE@#n zq^bDbtG6$k2iYHVRCzDQ3Y+E+y?o0AcY#)9#n>^kC5%+b0*DcAjLgN6?=rdolY*A*nF z^FFkhl(WMVUY<_&1Ps{JORz{hjfa-uQh0&*rzF?GNzqLsf5E1!W+Vsfy4jQZ z)q>j*HDi$r$fjC_`L2^nIb4l-3{#6`mF0ys01Usm5_D!){hGQYqcNh)A% z<}B!2#$p`fhEdhoN|gA?+$CM-XmLHerhb6+aTi^}OHL8`ueks(Dspb3O7X{)UGIu; z$tZIu=X_Vou9$DqH&EX97+fDoTrau4d_5!3K~Qxspm&4TW-`4Vo2K(Qkv7U*)oxSX@HiSRF zq?TL}VO9^qWiW=JlaF=AdU8&FfwW4*!)zt7Hm4w*&Qgc1=_}_t)=u0$z1Hgs9#vq|sh`SVwNK!hGcPAD zWm?7V2tkAr5Rl8T`ka1LSq~qepb4x{+(MQLEuLkuUS-DOG$P9-k0w4tW?PuP0WGOm zuZWS$-dMFX=~>PM3P{qW688D_Oe}q}E+l_)PE^g{M8wfqi*5`)e}nkwjz8w`lS zemB{mngFAUlhAJ}i~#si42-3{AB9F9DmI{f0tU8EHOV$C|7_V0+A45loz0lZYW+Lm z`q90=L-o%_|NQTNktm-%@cX+B%_}Sn9nGM#{gWj=gc799`$d{e>j4B|FHbDWJ8JcJ zV}&nwe-S36rv$#B7{TS{45yQmF(UAi;hoS&+Y zk4r9CvZ?Q`6@Y{#*9VhX5w*yWu0+^^Y&6m2UMqotV_nD@@!R(pbn8i^&-qtGaIvbFZ5}6afkX zO4QP;_(V0!1siPqT9%-uwgdPqgVJ+^&lHmW@@-uJlFnepUuS z>gQb2I-)Bg@CkeD&adxzUPK#oM#^$l`Q^$1^-vg9ZxQEqQBjsAESn~{(PNC$`THE$1-A*j zo%I03HfD3Ys9LTs92c`9V6^~LPLTv>(jf#hxj8wh>Vtq+DL zSU#A-9+Yk0YMr)frxPuu>H|UN?Ei(~-&L7B_;p{a(KBtM?WTxk+i>9(>O|D%XS@Hy zqKX23!q|j_yr=4Aqs-;Qk#;E}`>)qGHnt!Xm+@E;6J_5w<&oBBlU%3uz7sty^)5*X zEs(eB-`qIZ6(LvMog{qVunkD1RE*1ZSZI7oY_{m_Hv4RY)p2F&6i{JZp{&A9Ci;x9 zf~aSkT(7OVP0h_2L?Q1MZ+i4dxGNwm(5Zw{_=)exTS~W-r9Y-a%I94a$>&q_ay)+u z-Uds_I+$;=zWDtrncpZifkiLxu#)$$Zu(i91?=nrcqiF4!vE3J{yIA|bMuK{ox_cl z1~g)^JTd|wzb6DuV+B7WZ@TQUiH}Vy&03kg`vG1z$;K?3EgAj^R+a{org2G*EvSW7 zucDHYx9T_Njd&UdLh*xxIPwC%S-hWbfii%HHWNQPWVFq5|%B}js$9tjlL#r!=L_5h^T(r`ZWs5&G^mTr3z4=S}cU1sK zKxPOB`v4*s>TXNhIUbA&c#bX1Uieto&e32Y^G{dR z1mch|Nbihhr{wALYaK6Y<^V4U=&NtL#JdT+^P{*U%UCOcR$9B^f*d%j?mg=2Z0b*q z?2Cpep7gO(#*SB5`%c0Y&Mx*skXKmF29U? z`j>KtUGXvKlI}e^cIgkZ+L_I*#w8yv=gZR=p_r|;0F_*fY`Y9>J9Ks96B1bUEq&RU zs+BE?V&meOr zB|;ZRmv8Pq*;>@Fuh%Ib@>PjhD@IzqgS3%|9=vVTWBWsGyrodo6$)MO)Yw!^{pp3c zOeSRYkxa*r1lXbdxCefGx_mk+?8#Yfi7YD8R!M<4Wvyl)4&t!`VJ}kh&=(>?^47sm|A$cdXk>S{3j1$0E_PV5>u#;dCY!6}qCOsdm>L!F#Oza1 zYZsxv#E%MNz>wnv-V$jKZ&5jx4EY+-~Bjh@d*KmcG@@)^p?bVT^ zFIpj=1L)+5iC_v661nN_JLOH=fdWmZ8@p=y`kCj6-ai>*>3OEz6u@=nx}T$gP)IwT zzZ0{qVsaf4nXn}(daIRfOjB4@Bm<}wa~`}K?zz?O1suFs>wpx}K5HNNC`t%|Gy}!1 z&D`{xKfy?@Q`P7Fm-L(C#kdTE-&^3n#HV`sGR9VVXFAs+mUmAY$!q0>;lS3vKOrCR zz#~}UAM903Lei|#LVBWX%-u(c4BFlt?Jxgn@4sf6FG2+9=OJ5@7AZBN)zUhD(OnoA ztY7cG`|Ia0V*Zyn=P&>HpBRd$7Te}CgQ~i}Ogi;}g)>hoyimuP=Ltsd^pJjr1#yh& zM04c1z@H!HcdpC#4?B^QR@Mg7zudmT_!mE;paI}ILc+B_Ffhsiz=KGUM}NLx(JuD{ znxg$m8VjF@L3@aY@Xt*Q48Ju#mx%_#{YV{*zhCixAcOwjZ~gP{|1Nj(zpbj9m>(Y9 zxqw|!GF80Rh^O zWxQ6@aEvD~)2kNFZFHmgkI;hZ#5?V|*%S56Dbo|&{X;@3^7@d}gO>b!u5R7ha{qWS zjEE)74zgO8!n1x8tX<{On4WPhlg3Mlo-cP<`t@G;?E32ZZ8Us;s-}?0a=ob55nc{} zI68T<^17s?qf77EW;e*~>5Z701GSs&SJM=EF((@Wp^SnZgq@KL_OQn8!U08nf7gsR*KVV2rU}k2v9^ogl z&>*dluL2us)CM9`^Mc7k+^zDMG3C?kc@_Km=JvWLoxk_AxDxm<&U3yF!^e5(3QIGX{Q2eHim{6F+oF=9 zv5sy;7fPqqTu)cGzD&<$QP{12*xI&^%s4X^m9F1^MwzN=_6qNi!GAd zc4IZ*c}Y*uwo~K0z4z$`DYs+CbVFMri(VO!7P$HK|8fIM+!74;!CbC9tB#Ax**fo_ zKkRZp;wjmpTYvX##d54*FAk}orXAdT2Yq{)deE?ME}zb=FPm6eP>%uz4R)sAXe1g4 z>V+=)lq(6SX}}n)C0xf(olgL&Sfu)kGzpjWH?U8g&axCuU1>{;@(AQ2WWm7a?c&ak zwR{FRnofbZc+y5B%+A-l{R5C4GX-4-C#j#J*Wyq>TKiz~F>)e^(z^t6b-1|N7r<9L z6K^#BQ_YP~?%E*KaK$%WyLo}^6}RGGD~p?@+!q+*G8Jm*`6rb18<)8A(;5`* zHIqB82A~bJ_fa`e8#rE*HlYyT95)NIs+5^E$|nbz*(M93J4snEO3KQn&Z?aBI8Blh zyel}%epS2*w6eUcQcDKH1<6}WzdE}`j{%c51}-kK`6lpaOfGb_PV~G(*43ELel}}+ zoJ|?Vrl-)HG18c-FzUj3`c3DRAs#*^6_zW9XXBA^r;FF2cLOI*;r zNC3thT=)01!iRUosRN(J8xaGeKyXGAXXh$f>D*j1i-|;!7kKOK78!x^(?+3T z=x9rcdd-2q#(UF(eyFi>l|tC5VQg&7Wyw%K&HK?YZlOu?~zaDc1)#HO#VrWU@yl*fSYSU=sVt3NEP{14s`^ZUf#{k z*VwaV7eO<49U%mjZr_H~mQV5glF){3w+fTnnaOTuDlSshQga=cHWGWpOcHK=oKTuS z*}M5igFvCSz3-N|1Jp=Q$bOeEGytPzo<0Z6O_zbYCtJQ~IRvYRL1)ktHEHQW!j?t` zwy9M|oBxNi_ke10>$-(wKcc{~015&Y5a~*lj-p`b9qB4P^co0N5fBm30MdJ}p?8SN z0R*HIA~p0D0#ZT=?e3iNeuww_?-+Odi~-3=p1Pm4SDACJ4LKkXd6_<2#g%d?j8v`; z;Er}}mbKBya<66%;u1zi$H*u6Eafto9$fu^Z(3|3GQ9Fl=f4h*?n)H!`uV%->Yx^Q z4j53_Zm!p3@qZ9Te5J`d)w&YWKwizT;c(3De$jn-MjA>OHiIJN6%-h}e}D*fmGKsCYu5+r?WaaxHu#WC0W`8l*SOi6nGz|d7!u-n(cqO^ArRw zr=Sz8|Tw zhqsp;%R;0zUYwC;L=&y++(LQB$@qbMN4uci^1{zbWMAn|TR(={Ce-_8+g3 z5u~H3u!B}~l8&-+C9tP#H(gy0?)-z1&PEbvoLlc1(umYyCbxd&;`4}ql$H*2T%{0Rde~9Ql0JD5_;Nvi8<*}zDBVQO&$m!4DP!L{ zn#=;PGR^F*fVju1OcD@pAP`GL^?ztY0}U{l@Ndv_jO;6 zrWbLF8P~5Z_&R;((>q!ogd91;2&7^hSP(b%vP$yVN|x$d)-_hU=d!MKAURKcIx?a^ z_3Ht-0w>}}_6KFRz*A*4hn*UL8Wif4DVuaNux2;)=9qABa$@ds7mj)u2vM zPfYKf=~rZ|@YRkg$b$Z`!+TD=08=*y%q}`Qypw1YrqVVMtzpvQjrq|bXy&vp9~SsB zM`hIrJXTjb_R+ zx9toC5d*(TtQ$-99v+;d`Ry#GL0^aLr<3i*IdZqQ(6O=C^~zS75FP=;E7}5%!m}_Tcd5^u_%xo-eY4Et2{a0wmjUR)1PgaF`8gQLrc?IiF4{F zlBXRo??_{ZCpyNiW}}Xvpnc-wD~%+>7JQWaZX+WJ2~@Sq$VFsEW{0HZChTpp{^nL4 zX_!#8Bz1JK2ga~){{&k0B17IS|8ubBcEJ*wYv(|?;p9a>@!VPop}7)vT~HldFk)*r z`W0dK;6$NbgR;h5m4Sop@j?*SKlLvT%xtjTyK~17RWEc@-a1yNqXIILXBu6}Ij7F8 zU!@2VF#UDY6+is6L#H6GY}gc)G^VAdn>#g?o=rTCn|_IvCjs4uhd9&N8YZ4ND5g*j z>#XgUJR}!TRsd`OvC3+~?~so%i{e5EIXfZFrB$&)r&|-*Mk*^1N(X2B^07WSu{s8M zz;!2=LL6GV3<0EACL2!QyKuw*Ys$EfMQ{;%D^9lAe!Nr(AYtU#DYn|cJmS-PW?lj5 zn?nJ zL?L%uRL>wy>WIkfShZujTCf`} zKL+`bA&^BMDeAU3ltJmKUavH4BxBwkmkUA;cBs*6gSKIP=?!h$q``K>syKj|JME8R zCKUCu2L?0`wy@y~Knwyhjl3t>{}ikG?Rp;3&*z_xPF z>A1RMYLf{r7Vi@ibma7kw_ZTS$9JMU%(>RUNz|Y-QfuSBY5Plgsw$aJ&z^sJR{ixS zCA0lNV7PGQ3pZn`=4LtK9~!<&F7#k6<9}bc*ZGRPlV2*BeKG< zs9`A=K9cjq$-?&Fe29?6G5N|Dk%duoH-UY5v_k*u3sqE?@5ToJQnyW7GP-`G3cAzF z4&m;;A9f@bJ#tKZ6ZgDNr;!X!=vNUQ<_js_S;XIDtw?f&SbTURaqOysRUh1KyRYBc z)T7sOEO&%F6y#|6_oif#4Al3+#FK^b!jPg^*Gwnpve`9vP4|pvtG>nZcqtMK?Z!0o zLA)q@=^hkhxISp(pp9_LwGoBQp(2Tr>g2;6PYxc*Jhr3t@M_7Hu{8`QX=Zw#ou0oQsHbDK^Dcq&z;ZdWF(nYoBagfOpVFNLRI=59V509r@Zn#5# zgvae!&&}yVCm9y$S;A1iOPoFm|0P$3{|xE zKKs+#m!N&;I0x_X-;tq-igH|=ba;M_Y7UA7fUX=E!2#cc%^-Sy#M^OidtHTW>Xcx@ zX#FiP_TWc)UB2yey2d%Edhe$~?Q>IAZ*Nsp>-2(lp3N_scNLid#1ye1<$Z z%lcm1IrN_~%I*LVRa=Hab{NI!lK{R?;umXlW&`9;PEK`oF#oYc7~(Zld~(9-`;Q;0 zq20Iz_vjASf`#sMkV9AxKaJ4R)6Q~b_8I-4`}c*D&kKGUFW!TQQ_j*vEsf>NrB8bj zmy3>eCS|~L1xh!@A7U9;+;gtvBlkTSeaf!S2q*X#jcfA_R*OR4-2N!SuGNd&3Q-)k?%^P zG@ovXQz59y1n`{E9RkzLwO>!|b=?2ZZ@2O6lt$D~2;>dEq9gX#rCz%V?<)z4FT z7dwUSzxI<2yOzZ?R<{&8Zv@Nw^w$vm^B2i@#xL877)Rvb5Y5XsH_8k76Zp_1tN7S-gnWtizt#3RBg_yXWYv^()jpyU zn;^b&^}6{wj&i7y>x|56z+8K=&$umReS1^?H;WDJP$b;5@9toQzV=tRS|}Ma@!{`F zn^-pVYEFn)G2UD%QyMm2*J%(u(NTFh@o|JD_w+xP|JQ2*{TV+bD0R{_EHC$3vixP; z|FenSZ{TY#&b((VkqZJ{@+_B;Wh< zeWlz(Y6_n$p9obQ{bRp>?&!xGWzGrtK^Q2ki974(d~(&9{ISEFjmo*}i?PUz@?~Yd zKEi2>e_UHz_7DZ+!iDv?d@IPa>QUt}_hhoE`IWuM3Y=9c{K#_V{e`QAbM$}1+0yF_ zqA-JyE%*v<{%^$k&(nfG1CLJJ{OA2&um7*&gX2E(|G#YTmqLaqP^22)0vv3juH2 zE3pL?1R$c*&r;H-Xxis-&hq;WG`iU73O2m zy&C`}V?z9r)cu_4$-JYScpwvWu}`qfJkP`~jH^R^Jk_gVr%xO>jR67wv;Vnl@TQ05 z`H5c_K*E*^U|%zHQ_Icomx7hg6^@!kD`SBWmP}$JGeB`nYD4>~fb^}u$k4NVp%AWV zU_eIR`V8TtdhC$%_2hgv*GF?sAm@%Jy(!TlRZF)Y&|&z zj~YvDAQ$)YL`1iH!oRNt634^|N&2$rl4qPOsdxBQo~*2H31ut{7q*jk(Np^V%;luT ztypK2aUM4R^}!w)m@pZEc}ddH(BPB18&f1aw-`ripC29iuUe9YA3vO(X)c~MSwnm7 zs~|JTc`8U1#m$n*Jn%4x29*Rc0rwt1(LLT6J}6JL26?-@oSaA&!dR8D)c$ytpw*On zLzT-=EOj-FeD@D|7m6<;tT)bcs%j)I2+VA`n~5@>?`XbHnOV^miPI^xWa*napqJf9 z=J&>Ey_eOuvte7$$Y>|5v1F`1DdkqcXwh^Ix5kPGo%BimmO7DpDqa_+SNm;J^|6yx z?!4*4tD0X{t+#w6qvJmF1VwZ-KMB!O+s)&+k{D>fQ)R%;e!i_)fKgtd+wxN*ERVxX zx0iNN0X9^USf#)*RtAY z?GBxFZ!^t*re~n14Sh0`pDXV|k7NmIVz<|0z-a1xx0`4!>Fh)G=MI!_niKA*VW+tt z2qna!vhtqJGcPq6JnOe^E14Y7du5dM09=$f`#5cP^Q;pmt6<^K9&|wPKGr~Q)$IRrZ31HY9 zLHr;du>F_{YYUKi6z)0r1HlC>ai2saq%#!N$tH@rY2N(3CoD0Uo!)jlVjrkE$O85D z&Q5J}*+76|mG2a8m%qZqRadY2&g;uV*A{$}z4W$mo>jGQd|p)kAURCgxU8HaZDXh# z_iVY1OXLA)|H@`hhI%|3pqcn!w&cN92B}cnqt0|G9gn90OuMu8EM$;{jJ=Sd3Bb{S z%6PcW9-VZ!$G^TChW6M7A_idHzWrBJBn!I|#jIMIH9!Wq*=g@B>?Sr2C|Vf!ggnDw zFt6d=A)bTj*m)Ys<|nOh&zqRVWoZ|PAMTv?e{`@Kp@Iz_D}yJ4dN^~u0Rh_!OJuXH zy5)`79yp`Xg>Bcej`}p7H6HU}(*=x>mNt{G^GJgQ3Y7Du!{$k9=)$3oYF92No^()V z7ePF!tMo!H2tuhlc8`bdWLm!4kIW;%xA%mMeHy{x$y7Ds;OhE^MGZtdq`JcACi4J5R>tJ8w(Ae{%bLtiDz9n3}{MrQM+9WkLIa z%O)~~&XcYKSF6wbI^XhvJ{S0HGT)&hY_A3AjFt~fx@;?b*ZT!F+B;k}Epz91pJ#yY z`X|1g5;Xml^85p!IJ*xwf<%3%+-U$Vb1=)PrLLX_m?CxaxwWP3RY?~#PyLJ1t>v-o zU3;#+)O|0*gTMMI^U3{`t6aKw#fH8={_XeQ_K?!gO)qI)eRSFpRai;X*3YjTCzVwY zE4RkMN%UGhr&#qlI?ryz9P4w#C zPUzn4Vxi`16q%39*B|Y){r>Udm}fV9W(`s9I8Ey@%GBBDm33G7*U?9_KYpbmCJtBG zA3S&zFEkHxUeTt!)?<7}?xMAjxz6>6Hx0LaR>j_jg#l5;-sbFkEsD~!iwsp`1&>bk z#}lc`hFoylE66s}zBpHga@V~uDoi6BcLFmoz#+liiIU*v)(Q;`r59P>r2h7H$Nj0C z+~P!PQ)A_-&-_T7s2P&n1TyLI=ZT% zZ3DYH;$5wuoSVbwab(-*devdkm?zDR8_U~=9rVJo=4HBd+bj6Z(PL4wmPrhfWLok( z@_}Vae3tS(8{ zch{aXw+*6V?qyCUg;cKooV6aQ+xA=8T}cEdbzu~D`2z*kI!oAE5lg{IY`r!1l!JBi z-cxN!d0qOP%c;g-35QO;{jI6O4DXh?EyZ~8hOk>CkfDSnW@`-D*At7DmX>&?T?gK9Ojt-~T6`z_{Q6M;>YMjlj zfA-uWxeU?ixr})g;)>f=ccFdMt9wc68c1S1PO9h%8}=n%EMTMdpR6N@%;^1!`thu3 zO(f~=(${k<8b&^^0ih79TzD*{^^#*{=8MmQS9VH zD2J#|{o=yUm>6^K1|yV_57F4ima)+L=(P+51x4S&r#mdr>6Q=LfWz4$h^5SAu$(z} zP62tA@Z`}~ zjn4Co6lDQ|p3DmB@=nOK^z^IDaFS*9uj2d7t1WYAxf{j{`3I_L`a*TPq{Dr36Pu@2gj?6j~guXFD5&YxS-ThI_e+Ci2CrtbP}^ zoPa?4O1`C4(v1Zkk%+6VJI{>Dw%C~VltK=;X+z8nMU}3!Sg=31Wpp<}H}K`Xd*6o( zP{4uqngskg+S79eK-Gbv7jk(gG;gOR+agUv%ZLSnX4ej-kVrwh^1*_!lz{$>~`;Jmn6ez|&mWg*Wbgmn-9UqChedqS( z#Hno#E^$Tc!6K}4H@d(i;OYFu3zZq^6?C+;x%Hz!8;~{a)32^kJ`>TcY2ftO%`+g- z%p5kdmH0k9r2jMb$o@(LY-_gG?yERqyi5h4|IbI#^B~Ja9b@bEmC`3vtmcrbI^`W38IiyjSohe6~B0#?}CE4Q4i@R1WreyFx2M%q^7 z4v=L9}ruAF3c9Vkd!^5%NzPp?>(FqF!*AjCZ7GvfvQa)Qn6&K&P1!@3w4R<~e zSJP#~K7zPAND%YhK7Hnl)z|%+_OV(+-xbqhWu!g4<=3zK#xIKQ+@Pcc9>{Jjj$1%; z+xN&WAYf>7fZumptf>0Vdj)yG{-6Y{ller*9&9lRJFZzX#sghS9#nbs=#i*H%UTCA zGFm4kZMu1BL^xIB_@rN8njiNm2#OK)NaINsk%4>fxe-b^=8zrii0EH%hsR& zY=!&&YK=?QR>L0EJuWWAh`V9YV@9dpB3jR_j`*qu2cM@;c%WKQQ88R~Zf#xkgnKKX>JPK1m_++hAMl-}9GJYb%e`{5+HLIYpBWoSK>|bL zX=u&+w^tf9Fc?9gu*z=l=n#a>{Py&zENu5Zrg0Ah7D~$N38@UU+&v#}T?do2C~}4+ z^u_}Nl_R1q8AT^2dO-XUcVAKgc=O>J>u@gGkU;9r$cD7XU>jyeA0`kelf4@aF3fdf zx_$QV*tU&jrA*umWfFW$TUOIFU4bq6cAK*6TE^Z=jY8elvKGti)z% zXwU#4r_IdV7~+L_JYBeWhEjZ0UQv}24iWy89=kSxSioC-2BR^&NYQ#>ypwc#mMDF<93 zP+Rnu=SH}8kHKO1N=A?q#Ua{Qp32HDgjV4QJ?_UGK&=i?SSPVReE1yhb)fBC?-C0# zHtfpYV;Pxp>+K|3)a)$X!V{}lzH=7e?_b0W4laUnwD*m5(cs%Ht-az$@*hKs0kJ#0yf#702@NSZWvkE_*nM`ssh{%vSn$_)07M2saQYyH*#5wCA zu;A`kGFXTkSe*ZLc#b(A%qzf20kkM9r(m=QkkTEX4cr?;lgLyhh-%!*0=`+m#xjeP zc{hK;G8Lpm*T)>jDoURZ9|_?t2~km696TbaAaaSiFKqT!uqj98fx}e)l+>$%h2kV! ziA2Opa(JNaeF$M%Y)Y8Zx&RpPN3z{1f@)%;F1w(9ln$b(vWjA$HN?XYkZo7u z2aWXxCm|^;2e=Rv`9KkDvz9aOUOLqpa7d_Sq4y6&Y)8v8o2HM=nFYxP4v2fZRvE8` zx_a8UJHEkU5Pkol{yH(`7(>UKM$8U)Qq3j*4AFkwKyC%M{m>mW`N%TSn}S{ zK`SH7c)iU+5Bzox<9y(b@kH_M{SL&+ZnT(YNdGL%u@JqR+!|kT(0z1Rg4q?*+w1l@ zH0G|9+WghsEz(*`WT>z^XfY>Ppy@#H9GKfPab6L!?HU_|C4Vl!6+^2=tDlh{Qq{Nz z8dIxR>e&BT$XYBqJ4teT&uJ}4p&S%3iS%^;wz4k)CKN6w<~uyYwjQL*!sb3$(UsGt z+yTW5@G_Z>CbDO0ul{>`h=eNHD#6uie(g1R&2!2UQ@84y^7E%;9G?hVPv)lhxO20w z=Z?X@cajz(?%eSUr5z^MUEerS`57}$T3h&HLA_(fck#MxcbhDtfytxZsK!9qLn28! z+MntkpojZZq;x(C-XkA0&`~tTqI@$V<$}xh&c6@+#baUZ#FsKgNo2N;k*@}k7)CW_ z@j&0K^112emz40AF{VN+5g3-}5%#JMngwLf3DENNC`Dmm_h%~gzmH@52t3_tw*y)4 zU%h=)I19cHObmx&5ihTm#uTo*i!YykhYI9}pIGktQ~LeSo_-;!-6Hc_H?~%!Qb29i zKwOzp23dTYP07~IFC4qltuIDV_|tk8*Rs((bH}v2en83=@d_mPyLYe@6MUxvlwWf;$8k7=i__zeicYU!fB*ObEnZY`bqwZkdJLQne>(*dkB}P;0Z0%{u3yWvzoNO&e z0(K0tB=VMLMF)zf-(c0vc+oedR>$ltq?7p3@n>2B4huT$S0Ydz_J=o8gi(}-Jb}2H zFy@X3?DYBXqqAnLu!Xs9?o-5?-hk%}aRwK0a)z%VFW79x>$MiGa`o=}h8g)zk5S=_ z)znhIzx!pEbMZ@nDKF=Fm-7(4-xrQvR?{?!0&W1VzTZJcy;FDG$kTUANx z>6f{c!qqluNeu%pdWIXYi#_KyfXWwzzZX}{y@7#N8s_Ch+w3c*c}HCcVv5i}G)v;+0eQ+-9Vfi>K?fuRyikXJ-|GlFR7FThGUi4p#5| zpv{}*95hIQ9iLham*gODBo>dEqW-)TUxiBpa*_PrMVpf3PTIU6X3}fmGMDg-qK;{r zM_z|CfF3rA`Jk?jYGUa zB~R>WLfg0|IC%zq`(^`nHbP;sNpVQ7ru&)(UTF&WWBd zpWXVsLy7}b>(~3Ub}CeUv;hUVx{=Y>-OuNb&BXrDmwBe8HH=NS!6v}KQ^ z^DY}o7X-o4K=BTP!75&1L>q`*Q#^T>kDa_H{u;6#IUvm1pm2_g>O7%utybgNJMJD< z7m=WJo*q_6gWfB3DH=LQabfa)E}z>5by63dS)gJ2W=`0dbJhR2rH8)JhToN50U6$3 zT7ZTe44%1>_$s=~VyUkeaSzpcW#MwIroH|#mk&u1Iq8QKA0MZQ!Q0}GKNtZyh$?b~ z0G>Cc%XPzVcw+mk)0x_f$1Y!5_zvVA(0zZObvs-8foIZ5Y}V|Y^s!AAh+MCECSFat z)yz27nSrnbVqOeFS*Q=cw)6NygBhF5vCXCNwcL=Hm;qE>;90+rlT!2KZ&NnvzdG{W zPGf<-T!Zr6-cbnsd!HV)FWP~Zmm^4P&VMzW`?t+$<`k+UXCA7tz$wut zf8W}|u54IGP09u_!P)Nk-TG0VkpBCif4?AZp7cBZ$)@z_KYP^w=hKDPzx~(hH9e;; zLH=Qx|Jp9)+-=A|t@3|;B2E3|zZZL9dI!1rKZhm1llc9F^d`kdQKPz4^>*EBo^))i zr0shqpM}@b(Z?=5<;-VkI){>2A=lEC9a15=+2j=UVR^G8+-q7}sm`?Y5RXM;C+~RJ zhFO&jZC1pibo9`g&CShurFnX1&d3o-@v94e1ossgee>8G=~ww~3KL-`j$C|_}*l84i0q{i32#u)|TKB0N1J-cWF3F^-M?gqZTWh zJaJM^p~9{y%iN^-k0Xz}0}aZ*wwc-qQM(KLMo^iO7&Fp-cu*NO$k%j2fj1 zT4VA4;h%*VcysO3?t&T(r2X>B%6(Evo@yX9wsvH+?QHOFIWR0ygD|?k^o7|(k;$U52WlIs9LR7`t<3NC^8!5${@rp${`;c>reb8 zL%krybP$!7Z(W;1D$;oFQm+e9B6mSB_`&qx&9nSw58{c3Q{i5=+g{tsp^OspKof~w z62wQw#H6h0nh?{Oy^N36I*dE;L=wqJ$T_em&O4T6I^k<;nK$U{GAkR>t2Zxs?sBxf zs{K4hNg>+1XgE$!DK;V9-gg5J%JJS6iR|Q;F!@rZ>LVsy{{qnJTnAe51qGao8-4If zWPWjJ3C{`z&Bi_Cn`mZb+g$52YriI-dI6Ql2JL+8Cyc9p7AwOyY@pydS2@C24&Z1| zc7z@`Lx)edAns>Dd;y)?iygC^f)6a}o$i_@>Ogt`HV&SV_Ooa2lDt=*g5KR;I5&X) z069>5ugMpw3F0bzeD7wDS-HYi^@HKMRlfwDXjEZZnmjL>6exM*6o4ws%94%luGNfS zQ|bYu<4pjG>1%SS#!6Hg@RcBy3C(u^Tysd*q{3&~T&ev!ZkYtI+Q>cNUmI7r)beNKGA(;B2tOuUfT+KC`D8z>18kv2LYn#ngeEXRqcK5NNntBSCkoM%J@ zrQXu+#!YKq058Q8JI88e$itn>&Yoygg?ZJhBy5xokP*dYbyMgFk=X z^L!d``}Q|)Td4OHvJM*5g3Y{sY|wy!r;Wbh@81uAEEx!7L&yj{gOJ@|r!Y_79|>!n z-aA9D- z^b8EdmFsf-mE*qeeD)`(BasOiqqZXzfIVw&ZdnT29!ozzT^2{I(&Hvo*?Y?X=*jA! zrBknN(&e{hoUcQ;xngpje-IUZY+Gue_uFIc>bWIt_*;BVQH<|o#RmA_e<1*Qs&FET zg8W34-$GGi?&AkH%zg9CXV(K$31V&)L&Ct0L9>KBd!Z^9r0&Gkz~^Tu9U=FS~B zN&PC@Xdqsui8kbDx)0WONxdbn_zwH2zW+pRiHU66s}jqmEbhXtAz(fL+M!}@+qMQR zkyR|Y#<12?p@hV}$h4_fnM4J`FY#jzzn1#n?3WD%5)d&^au0x0i-lGH+^BNs5Zhf@ z03yI_Y;1In^zHOmprquZgY+;a&PW#SLxFgsFtTDW>a>gzK|G~sAGplEU*Ow>; z9>wl38J=$nG+zpcHtMJA&>;<@;n*rc=N8R|gOw>E>)^2Wefu8H`@>-~fY91XG!*hp zj=5$iiC2&M9EFcQIuOIHxB)#R+aC770$PSp^6;QJ(4m~Oby!~)P9 zd;-qzvJ~5Ukt&iJK3~-+C0eOIWtz<*b}NKk@Jhl9!bb-eyla=80ivaD20pi63u^QM z_#iR^*zuU}>KHv>6^-osoqZ-Vb%3Uj2l}4SFNf6h9;8=L=$IiQXs%!HovbK_uV}n~ z|2`^FFso_f*hnr&E<@b=BHKhLC@I$h4uIPM_r&S^Y(D%16@Y{Y*X1fJ?}B&%*e%d% z$YZ>tf@~+a+5Rkl>R93}V6*VieA|0#mID1NktLLoUk{+?#yiR|(RM)qYl8VOQotr6 z0>v6RSKzRJoF6h2J^CZTCgAO9dcy@GPN^)i2~;$3oUIR}?$kuYet>^y7@kIpG-R5p)*lKpa1kQuCvEjhPwd=h_! z{_<_m*jBevFViKPJ?^#VU1(Hi1gw5(+;c^=q2+!&i=Zd3h#^mnfn7l0TNif;vW?~C z#icd{I$?$ORq9(xXtTfmZkRXH!m^xIqScLbtxG_Z`vWAEC0)}h|OJZcQ zNBw2*nmKlTBR(utNdb>t&Ka4SLYf~;Pm%83y{p%KIIF_l1*rFY08tkXmRWBz0k^AR zJLoOB!Ny9AVj)Pm7k7Geu?h%ihEf);7Fzn^q^&}G zVGLq=aGwo4X|s%%xK$S-2-ajUa%iBDl7WFCv2=uU^9Sy5Xo|T1CY?Z^T{bLsluJzTE^w4{7iVgF zxb50>A2j~r)RHA1knm+k*-mv08XJo3u{ZNEc|g*Pi#M#BuY>1R7AbEZ{JI706LdzY z*ED>C53I#iFqC_A2!Wnfe^9y>f#{hDYOUoh3Mg2=Ol-fU13=53%OgyOcMp!&s62>QTt_s_6ZMXbz=EEHB;=v%`jc73?}IU)?rctR z9xZ8p!4IsVdFwaz%1o?%56glgMNEq?Hd#JMz{VY%qq34l_;%KjgKq)xiy^BjcKAIb82g|WCY;>TV2N5T>|qWqH-Q&}Wr5}wr?f>lZ7y0E=cG);{KFWp!-;A?XMxfSarN8F$>w7?piT)F%bD6bfUoZJ+dT9!2?O*S zAXIQ`7FaM4=Pz8)kdg7{{cudwee>6FRp#i^Wz834+7(xa*QXb_-_T@Y8p@d7(jr@^ zTa`LgA>p~rDK4%!UDoRAK3X;IqXHZRPvQeDFtv6JLx4H&p?#~SrsKa^w^D>?K7XFk zdq?ix+Dw-n^6K?VsUUWdL;3bf10fFnFSe!5{92(+H1;X58KC9^WTSxh(pB^psIWN& zSTYQqfmV!tzaZUFWp|EXOlEZSN?@V(!i`50p&o?ZKj@fjJ3BnCnpC6 z?JE`AHC*!G)-5CDY=u^pvnzlLPP@V;FGxVP`5fH+z|00H6LJSHO!PhJ%U6GTQI~fh z>Z@#bFtOD&q-lI@AMnfvheE?BQK9ONIcF` zK8t3_WMNzUBUl~Umw$u+%aN|27gNvqrYOZblSjjD*>dQyEopv(w{56vT1VKni_9tt zzV!N>mUg=W|l~69pa@#%frH8tNZ7unEJvJ>;7u2ia1wcu__4?HW*Oe z5R$9``(Nudm^ zkR;qFIVMSpDIf>H6dLFxW~uJ z$|`h0E>XeK@*Sv8`ll*DzL$89TVrnGgEHf(1Q3xxj_cA98WI^7S3s7Fq?O8pK^sJm zd$q3aU=r)LK)l|lCW}mz5``zokzuRjK6s zfdI&QjHJZ)m;XyX+}9zOG=i^wsv3%#|75k~;&c6b;* zFKEA_RSVlPE7j3l>a4=`w<<^x@zF`+#==~+gD5aXuCPMcCBjlK(#*8j4IPuU2Lk~0!G<)3S z$jOa#h+<_Ug;MezsM8E@MBQDfFEoF<@GrY{%S%rSjgEd5($%?0YD9;*U2?2V;CP7O z>M@%*`Mk`$0n9EPMMaksg~h15vx+^hWrBV?Iu-!vVwrqT7xzBa2TSG{xzD;rMJ6o# ze112Ai#A^|@K)1;$j~zs_D(n!bW5<4fg~rMYVd3p^M<0WKuWzCK3^+zYfnqqLAsVN z=;GkCFG92@^H@suP=U;t5Pp3UZ&)YA=b_*(3z8v29A(Y7n=IMY-5vQaM*IjgEsdK2 z0FqnHNyEM~+g*Fiu!WCb$-qZ2C^&#uE4Nr}l#cG>Lscy%j}g5`#-@)YIPQey#T#w!9t$%O$i{`PqBJ$;uVR&j) z;jxoLdTv0EFW&|E|3akz{?pShPW;0@%Gf~>4OY857)JZxDe3u4KQiq$+wTv9g)(@d zAn=2M|89rsS&yef@owf>c95o1g9?c*wEJHHs_9!%GTJ{ z(bT-hce_3$|M-SpQNKf$PygT0aJ60~%rUEnG*bKqaeJ()`cYD^RB`LQydm1sC*NBs z5X#>6rw-#vY3tJAs;}J3N>!kHo;#`a&+0~hjmbhNKsHmd!MN(}e^vM1*wQCUPXt91 zxwp!O9s`(2>LV7f1t@R8^e!cLqtm?is7a)11D;sL&axQVkf(t!zb`tuj)?x-7y$5h z<;0hbVi7o4F^x|RIMv5>@Kx&)6(!q3Cd5v3VVH&;N)>=YU}u0H4Qps1&8`K$4^eTk z8as9zzohUxe%<{H_6?{%-nz50Z4)-HdF;xW3B?PV41>%fSVR|y5!*XBAL96q4@<*E zP>E)LOPd}pb_@YLR(@$-o`#7Hu0#PWMY$MPQwG4l+_b67l=L5qz)^;a*{k}ek3R_n z?@2$yiB7m=573QtTYs=PBqi82TBU-cb(mk!Qc^0^F3`)a8rL0ts`X!cUXelSAW2yn z@Wn4*Z>1WF8o?>3C_cShlE}X0DZwNtAbQg9Y)?MxU808LrB-LM2avXiwehBjlaxcw zd^f%eSMu@W{QH=G8}K1ah+f0tq&~0PqE`bO(Tz#mNCj}8>i?Ta{0$=SRA&gDCU7|l z?hSy5E&4B{6|!J{8o~;4{+iJ)zy`4z*k#u*{};B~ki^<^#()bfY6KP&;4;rwyc=;* zgG}y9Fd4rqsaiHY4h35pR-J)di6FURdms*K6Z1F7&(I}F_@T1wlSmeJE=j2DY2VyS ze?M8#ypfZBYBKhJ!QC9&`w9nk&o3|Vk-tu9b*hyB1w&Yw5+&n_v}ze7&K;v~g(WU|SVZ^xFJuigQ!>QPbr=af9rsw90UuAa=(KWK&=<%2H_ zt{O{SUQj>TVs+X|{|_??XC2t{F{i}q$|Kh$4Fx}6qbxM!j=tog20+He6mEUjxiB)^ z*mXJ8zjC?kq~AR4T&cCRZIRm(1CD6+Xa3fMoDFvI#>xA|mxh^z9w8%Vexe+(HODXK zOLH(}Lia9p_~3+Q{Ht$;>+PJL6$+l~*5xnlwNe!7EI%+^D-sb|%PWL0w!tr)G*k|s z|CCz)QH`T2E3chC{{_6^wjo=j5n5@I$+PS*>htLu%6yWDTi)vR*AJ_>@`|^#kS1s( z!xmh>;mp^j8NN3HUrH*Pwg=P7+TiS4C^g_ay{WB2b3O3pOvQ< z4G=(jBg;<4J%_kj8ogJ5E%^Ny7*)8x+Prh371VjXB*tK-5zfiDF@e!ZL*qG?CB*Cxp z5+w4fji8l{YzAz}&}UK#>f~9Eu+R!_6hkMM20noujLVA{8jU@?qjqq`mbBryo9DU{ zl)6$=nG9oOlBk-AL2-)+J?C#Fs(G1y6gSGvNPDCp6xnVxy-tjk6qUWgm%jNE9cFnr z<8GGmHn)CwQCL}_g*tekr~LlYAbm4I1`d=PY_WxpCQ5jRSZDt-Zi_L$zxQ%bY0#5A zmA6{NP$q(ucD7l3$1t%#HY14rC#-K}U+-yvK8E*JMzc@V4|(-sOa>M$UNkgR0%pKx zj7?v!deFGB^J7ZoqmRQWGpc6P!RpDsg?xL=m6zviHMSyeE2cRHlZXhCvnDNE;~p`{ z3fb3lMWS!_S=C?7(t?eI}IszRw%3^lBqT2|^p_oSjG{QaAA zoS84f$@pIZtmoJ8tRADcKh@hXY|bz*3smfz#llXT;R)H#RE=0wPna5{jhoKgY>y8o8wOsnmmm1{agPNwVd`Z|-&Dqrw6&YP@vFTKb_z z;XD#G_j5(e{HNVwKKCb`c3ZIMS@Iv_>`;3B3eB~yeI9S2z*`3!2==OH2rskw7E6JE zoZI)HRL<2iHisuYSp{Qk@`U9$20Aa{{%wZ~9uH4GSJ!*|?(rMNVR(5_^Uau%Y0TGe zqC;;w<%6v$>Io8hY!Og;iTmGB_D6sG9uTDGlC!W`;GOTcrb;c9>~oy)&^ zY%poW~R^<1^&x_jo>ZpE;MYJPRjO+$Ft zosgi6VV9?F1$j%O!*12G%F#O0K~G@H$_8~wTIOg6EljPZjr{<)pn!mws3G2h=Eh4x zWp?P>y1kRyj9djePYV_Nk+yMyMd8NinV&BVg}K!xQ_W{-#3DV5^|Z@UIobZ;5D}`z zAcR)(HPbuNZ&_eoTvL~~DW>SS(O5i*xiTF&)SY#wnX2%oolbDWBa^dFIXJlPi~Mx5 zE;W6Y8#cymMD47@Ubz<2{zl!~)`}=L&|Lx+BG;{xoz9c|+-aEmeEtDtK3MZi-f?EoJdx9-(DcBv#7uIB=ssYPk3v7;W?Gncgd*l|7ZaW;JVrc9M*{l zJ80@?Qr6o^upenoi=FJFjP1(yIx$iYo^*4w+f@#AL>qTB|HUI&;_BAh)+>N}(jy}UI7`%Xmeb~d`(@GY7p;7j$gaBm5{b)nSuJpzS^8S6MYeJS! z2+@3r;`bhQmdIGl$tV=&a$LNgR*)Bd6>me)!73X5{iDt~o9U?0YU4-aXPZUB+m~J0 zzw56G%UL`*tDtoluSxpmi~EjbI?yCFXEQQyTQ1y>I!&#r-3;u6wqwe6cTaR!x4_B_ z1p+Dxq4C!3NtbFyndXD8H06S0QkUvqAXTNH{y%Je2RK~axA!CxC4wMPq9$4ro#-uH zh>$2zMhg;A2csK?=+PpgMT-a`7(tlPMjtJR-s@n(V2nD<;5+Ykzx)55`{tg9hjE;J zc3JDU_FliW*Omj^_l_30bpwOB%VIGXos>_8H;CUq59glH7+O48D zhRa>$NBeXya%hWFAeH16xW2*X9~Gndq(7fuLtSRAMi5&$!?tm_KgaL1X0gWV zFcP)vwxOzzEnd|rPka$ho$=9a6wkuP^Yu4>Uwp$i>H`nij>9smLCH?*Hz>U37Hpx>z=kW~mZay; zt8tgk`|+&Z+KGGuuhadsqvx>Gb+hg|(BMRbevQsuWsVf_Ar1;yG*GIO6_O}qTPQz}MPBs9XF zPZB0kcsA69CtL;SE(xS&?%mgMyc`l*-=U}xN%aT!$>9!q^mU-@-sSyHXGhD_3A-Ck zZebAx;iAGV*OP!VY>Zz(r<@cG4J8BO3^;GfTY-L9!J__wN$Uc^z=|Rrs7SjLfa*^< zc}t&?i+}pTljltE-p1XQ8-_vU@bQhjy?kZ25?ylt`15BUW=$+heuiDzP+6Os<-L}D zwdN^Xja~9X@x~Pf;S9l><5Ne^f0t@O*Pq4ZCwPZH9m@S(u3a`)@G+u6rVEQVu#JCS zDr5#2xQRuMxJPeNsX`aaPk8_bw?`c8b9gs>|;+r}?+By%0@@^F=HqcKY>aSYJ zBNAjj+)obptSvwj)`L%;BQ_SE4ZPC%mDff9XFd2xIM7`)QN|m zvv+jJHK$aOs(@I7yUo;vxXeIZo>M334A12-)a$Xxf+|VSx7PvvTGf{-UDGFwfwjYi zqJCb&sB!Vn!6GtSK~(?)`GTX|%Qqj79++HVb9@oj`=zSxIAK)n#pLHkAWO~qwZwnW zO|-&R&ih%IrP+%m z=Z2$J!@&%pR9OUWSPV8s3hg~c)(Q0zc2sx9zk|K{ik8<4i!0=VUQik5iDxbsP;jXX z0xpt)XBIs+H9(LfDpDPsS@JAplGzux*<(fVc?8)o8L~mMu4dh#^x~RE#$UkJw*^>$ ztpNYyGSxH#(Q+$UH(y+je<~KJK2}2-m(6LKQGCBL z?3@g9Sgp*n2TrO;*;kUEs8#ruMZ|#pZyE`xA04vBtZuq$oZiDMXC9VpOvG^Yd#&s= z=D$vT3LaZUFp~9x7mzo9K2!!S_ z^-Gc`H$!9pmHU{&@9P}BgLBEwq~4Ns z`Udvl3ASgY6A_+q;$n}@b!A>&RcDyaq8~?J>v7KQOB}*keMUiVKT-2k4gBW!d?s<| zmCVfq3tN{|Pe!$xbE8TZmh+L<;`a;bD9IYx?b>a`t_$D@+1UMjMQKwP5=FA9pc#Yr zSm_OPS&h*UD9%;;^y+Ix4slplSE%|n<%LMQY|Yprb9w1*T+h40KR&m`#(N&h3c93! zN;RF9H&+R6$K|j0KJCb<`zE_^>mvoxqdml_7u)I+#H~(NrL@_bv3&hkcSUuIVvNYt zO%{1Kl`Gh-P;%ou?UNDDr(9yS&4q@@Nkz7(r9pV$iVD&vwW zK$6^&&r2TEuN=8@$3q$v=YIBU1VKL$P(JVBClS}OmCWs`pB0wLc2y^e!TnF8uQcvS zSY24{6dqU3T*xQ-<}T=<`>gdoAm=7J%%yEgG#*;nZ`VT>@s3Zn?nAhio1$5=LwNFB z1YrqqU&pojj<%fBr?ToZ_1_p5AHGKt*6jK*^4TgCANoNGpE8T3Cv#R*HaQo6cd;C2 zX;oje|Is7(ruvyZ_Rc32Z=dZgXxss|EA;%VyS+wVGdr&njS$(35waDH{g$5`sV(*!h&K3NVy7j+%8<> zgv^@xsi@rPc8{9E?hkkV)*a25kD}fWSE+nYOro0v;@eOo+(>AtIL;j?%WzASrj^?vvcCg8XQ2WQF>lO&Ui(yvyA zE88BgKJhPJ03hl0Y5Dcg87neAp$lF$57psUm zWHs*2=?jX~{aUS~+X|c|ymqP$S*aCBeOmt)nc4CIt2t(_F%ksXUOM9iLr2*{eemP) zy*k|We_y3G{3ptPen<3*Gmwsu$Iee$!F?MlQ{za7!~U5Ujwy|PZYlbEs%2FH=a0${ zobAw;iHZGuR)nu5?;co_27D5v)s2=V(Xf;A85KdgLP$6M02QG|zcaY-pW#qvQH^fv zjd@crk2`aW?Hc>q&RnqntKt8tvAPb8rK5<7AnZ|B{+S+4}WM z>AeuD&mK!bUs@8qbQAxh>7Ds?8y(qaKJPTaYv+SQ!r2|8+KTMeGU>(L zX&}jTp{d3gLvoY9i2YNcv#rJZo@cfOebf8ZZk5;8c$mv;3-35B;e`|EjiUlh@ z=nP}_clRyBjSumqsRcLntg>QG`(9uCSit!H*Zzk$cralpsziph>;TF*-SW0!gk)GI z{nPbj?tjMf|NZh9U$TjmEn`_OqQYl|rAXg#SZm+Yjp+O&g=vG}WDBaCb}tga=hZ6L zr*8kdn;#&~js4;RNoty;TM-0*4p(`rUuX9o#lfpy=>3oJ06tjW-KD~z-Zw)3g|+{vSA7tZC>puIeX zDc3p&CQzT!t>mk+JAUYstt3@aTp{}bIh%k=5a)_KY_~dmYyjgs>euLIM{nOa-O7(N z#)bX6iVq$vegAqsIMv$sM?#%zV|JO^_R_-x@zi3l&`V-N8`kzmPYdP$yThJ)!xGVa8^q*2p&UXAu#0W55%>zHzq1N&5ScwJ%c zhZBSTt-nhcyCn{)KR4%pbpGrdZUchGe~8b?>Q8d&`Kf61_kV9|e*r4ZzZHl++G3p! z+Oa(Hud^I2&$j=wx@qT)_@Q0&mrY-bdZR~_A%jFhAy#6Guu3G_}#xHC#tjRh6&-Ed1)pTwi&}jR3ik?2l%+;S~8{1 zNG53!llVvynNY)9a>9l7ha{_GH6XSDJS`G?UA4?l{PiUghb?XEax2?kf{Ap*Hfc0Y$DhSAU&{SLHg_T^hS1|Jn7n6D!Biqln^n1X4_9()5Al(T!+;$r*&j`iG)g7)RktLPP z|7sI+jkm)YYkJtq&S~4MMTVqkti};(ZX4}`j-PU!d@7vv#pKq`yh8S(Y|d%nsG$De zqKUW(%+@(1qbsYfY>)}+q*Dc#VnZj&Tq9+nU;}P5Qx znTvvxN1eOQ`7+0Y{vyj2uL6ADJL4?~c2V^<)BCplB|a?w^&CQbuT%e)@tMw9a*`m+ ziE`oN|CG11mUljwev)SY{VI39YT$b_cNVUcF%>P*mGhh*TmtRY5j%lcE15M5)a^qg{45&^Mh})+@su|5-gH7SFY_fvP_-!7hDn z$Scl2Lv!~_{z7z=xz;2!>!S+8$-cUdmCB}P?t)M1)>fp-V>FcS0b6R>ouwC-t5+cc z$B00|&GE3}VfUT5j-!BxElm~o5SpEl6%Z^G-by?BjK+i}-weua!O7y|a-Ird9{Ky8 za!6w-I~>Q4Fbwr*SV)`Sm#%N-M(9Us(*`RHGFyd)aJzngy#2J!ExS1dpY~gVHv&lipYD*9Nn|HTjyWPhMe7%0KG4wbB9|v7R`Q%rlh2zt@L&pRBKi>4&nZc!l8*2R-I>-nQ(t z@DIDd1XnZhIO^Jrle6V9~ z`pM#^c$nl<@#UeW)?H+h|B%GGA)U6Ijulrl_|Y}a`FCxS4SSKes2ZLY^F*8)76Rd80I&6iM2i|-d`@w9k6Z7dt)K^Ry)@5nhUQPTqAcb zMytp#5V)JdnTr%QFuIq`$m@r^z~2@jYt6hC-xDY^R6uvT?a}ttro2Xl&Zj)j^m<2rT+Q-x7%qmtPj^}1LDI4^D__pcA!^0_>$J15mZ%!ZB)+!Hu zoh?^zMH=EoU0WNt`+^ z731<~6@5P-PCpADql0K9exhpRzE8nISNq=A2XGb0hH^SXe09OiLNLlxwQ~WHi>gN7 z7TnUBM1Yaz|A|V+;3VDqqJGK*_2t-MeRe>M7Is9tFGGmXv!VC13IVR(sXgD zNn*qRXrDEK-!AECe~rTE!yi8McRg6cQrz~1%uBs+jBUs7N(0X+I5%bYS(UpCqjc6f z=7uYO_ZOEqP@11|ZDDGqYA$yvthed1V^6HH-(feJmImb$k)Xox?HPHSqNQ+cpB zD!mR1Uvfe&e1AFJ?|KaaWdcSFr7ylYRD)7m21bHwtw}dkiheJ8wfjAwJ;JN)$L{^} z30{fy_9_Dd6YaR+-Rky(Jk+Jv5?7GVeHh#Np-Wpnd#uy7%mFYq_D5;5LZ0#=jd?lN zPJ9T%$nINJJb?-GDbn?5dGpJ1aRTPWv?%0-TqoYtIZ#@gj3lITOsU+{|g?$&j{iplk%XTbO-Y?A}YU6vf?LD-HOp77i{} zFNADW@3O3OxAASXalPr-wJSHsAKjy)VuFCJrcXT>FBoxK%V>Qln_~=_lahDDC2*X+VYtgS@(k4w$+)jP6U>al?l*pgPhjqBqI7k~dw*@H|5E2PB`H08pw z0WDVu!O89JeK^>NaK64dlqILJu<<9M2jrCd`;*fqgdjkibO|HtS=H>ln14wP6?El_ z!4+)x{HiKFi{sSM{M6PkB~^19NQ9)S2mRUZBEs*xA+Hm_V$6UgmK&|8hK^-lKaAi> zGrubPJHD~1YU3V``<(i!j|I)kFd!{x#(5v-wp}~y$?$~VYIU4%Ue@@8%&>qCoGTiy zDaErIvG2PzSxRcG;{daK za-Wsi@4R%)h@oPdG2P-ykP9@e&tWk@ zE$?7+3h@rXKNA{rMbSREEo$gFu)lN>sP8I}YE}cSAi$HR1N|gG2RWMBGu7IZ4^b+kAL`4 zdFJGvC6s=w-`7&Lxw3%?>aD(iu|S-YDcBwha{!=jz*WbGjjd-KG&B>TF&UTouIHX) zs_^&#(@nnEW;+94stw=!O4CY(cB+ymE}xILw@TjgdlAp5Om;MWRJ?(t*1HiW8aI}F4Y2k^wNcov`ocn7+u_t#a84sX z;{MZH%(&(F;qPtb-1ZlRliH@HLluUZkqePva6IQ+E53U^&fA7<=VSEL<`kRoX5T`B zw>YE6eb{M3GqZYFIhdFqjxfw13wrC(RoLt}_cOIwl3(rPZ}>Zlh3#@F_ZthdNhTk@s#OoJEt zb{4g&N*jp$yD64B9eziHsm-3RwmNF1Cx&e-Tx6p2_opUYwmmOtB+ZyU_I0-yN^hPU z;5#Mjq)j}S-gnlsQ4LMKu;o8R)T~Q6>T7U0ml3cC&k5{jQMZJs%(~fUcD#s|bq9Cu zA|K zT^|M;|H;K8=li&JCRn=8b^dWRjMQoD zTtb$(H#r`s@E&{t0tf(d80YbNVK#1QOS!*DcIYllvft4bL0u*r$GI7GdT`eE5ne)y zbU%wk7nCcXTkF9g#4Yv`4Nq|_&#R{DV2M~`Y zGd5GTgr(G6NVYS@gCaZ1*!|M6}#1C0y)V(nS(YpnfDG+O4OlrbQcIP1mQ9K6!8XMH5X$~@NcL7wu3+( zj8elasviQps2XjQj@(urBFYYCn-3k0;U9NEpYp7drv0bPj-3BdD~N!ieF+AyjfJd- zc>dVcyefP`B-@YtckFgvKRT=f8Y@}c=pcUYtO{u_iUhUp}K=ApPfpL9Rts$ z(XFb3zaKyQ=FI1kI}&4k_)Wj3L;JI{ywo&D4o+Y)q$cLCm3RH%d3;u|(*SvP$JMp= zC0|2>+7kVFOW5iqKJt)8p`k>X&!9wCAJ5cbHX%sE>Zn^o&iQnyP`y<|V=vtc@U;K? z>J-hG1&!Ni><~`GA$~)6C68T>9MaO<*6-Fi>#5OR7>+zwIjdD?SF^vrr~CHbk4Hsi z-iLskbA8aq>eY~`Zo=t;rr)1Lh?wi?G_LUu$3q6P>s)m*;i&IGM)mgNeEsFF!kmQ! zjeUC)C-#HGA~?D1Mshc|FHy+yw07)+69hMb+us27Rt2FN6LtZoX3HM&vtEqWh;9lDuYhwST$*M(km712%CI6u_BVb zzcwvYCVsld(xCVGK?HwtAgqIoj+;Bm(lD9EkXd;5X<6AGrsMeo=5Xdl zBDk?<|FV2Kn@s^WFslJJv>}t29>ZeRef6=MA#E}YVu*(gHAO5X+OFIXa;N2<#!K3Kw7~1FSghJDeDCP>({TdKb$(eA1Pc^Sg*7{ z&_xr+u`uAcUJ!{gH`mVuo6{)NFA_s{0lV@4ef9N&0e7WnL?kuc+u(_biI~m0m#tD{ zE5#ou64N~2<|)hb%jNb5eG!55;fsrl8o@Ah)AY1?EBO3Q_;nYTxSWCWX*vB06!KPL z-tN9<87Wi3DEFDUIfX__MMYCM$zi-hImYWrj!Jwh6`R!TlDs7qc+CM?1-uw3&WwQfL&b!te zKEgaw_&u7kEDnLGMKba@m|VX3w_M^1^+3qS91*|zddnTK*6+sB1N zI{oGphPSETEAv8OCndzRlz>5=rZxg+={)VZ} zk;n8P@So;L-es4AhG?*BZAb?%+{QgXIO#QZ{6;@P1Jp!h#X3*bhV#?Xvd{V+M2yYc zzeG}PJPkL}0$-;cud!7DayXk#zYt3SP~)rq&F>?(nkwiy@O#SHxPh6Oku8V09^xzn zE|Q>n#KCtwe(J2{`*%ajZvUfz@MDG3n81L8tyYa)D%IVsyn52$-6Q{i8emUIkzp}x zi~Yn3q)zN=T(F0)6p~hRNMn>^4T``q)p0%61+uVepN$2+olN~_53+rRvYqVx{Ua0_ ziC#dx)DMgL#ZFGvq5V543JuxtL#nwRte$EY4|1kEB5|R>jge33LX%JZ!IbaPK!>9D zx?+;SP-=3u;jQ+Sr(6?2JC{Aur7ostW&)JJ-|poX|8!)hQimi(qw$vy97jfN)~eO7 z#ZRSWm<6MJ%ykQg8ym2VJ;s<{Z{g=1UO#ngk%dJ9o$ttX-k^k z`Be93WA(L-&4)Qk)|{33FPg2kB)y#`12`*+xh4A zd~1To5W`p{guts#9D}z~o$l`SXipjQlCGXEk{cu8r$ZWOl=!>ryzs@FL3p2&`HGgm z9rk`gzX7R+_7RC1r)>`prcPVVob*T@Bk4DH=HMF&8P&%CD&;#&!H?$Q7W*VBNwnv{ z!*ISsY{E{W#))rhHQz~7g2KiD8BQ7~aLGB^o!c(bJKcCV1al>UAPt14uoZ{~(if)Z zqd*Aguk#(rLRLEA_S^e~{uklguHVxQF5wL|B?yW+5wa~VGB9rW^hC^(}Q01W~r;)<#+@m_`pSK%vCTPh&f~`!FLWx1cU%eX(RcIJpVVY zJXm@Ms9evdjLc6fd3iU(LI>hQvY%}iT*ls4^x%-DDH*7*L# z3svp-rL}7VI&_i?_e&4cdS2>XaXY+#nOyd9n|EX$SXK2#48ka$iBebQ*W`ob`@Vsv zox^F}MppiD8yh2R6Z#js&L-P&Q{tg*)wd=|iUDH*G?-myW)`TyhV1654yciv??UT5xpEdWUl+!g9U356FV%2OC*<%dm z-Pf-9&BF7kCD}2~ul&kO<9omhSWRQWot5-BE}pD*Q?{L=O2%%lFSfGJiX^$MM3l3v zZZNTTkNbyZNOnxIcgq=MFa+F65Pi{XM*GzKAiiIY+Pq{DgL6+-rbce5 zcKA6-i*cpC8YrJw2^$e;X&ri2(m^btMm|IG=EO_|7%OI!vG8QX_0(5Muw3Ut3ONn* zIM&s3_bGjg9A|}&?^Kl=hq$}@IqCgscd6`L%r{fjseA^X7&?C=r>j-MwT>^tv|0jL zxXs`dCu-<6ba&cPqq0|d{QJQx6-@5(LNxV@+;;f1RCKdV@f`4#Z;Jph6|k;uqd$ITq1EBmTf-bTFJv#hAJuJc1wOhNwQW zOX!z~%|#j2SW0*hWs}NEwKd$7FdRSP|7_iMMaPzErj6X_SYsZ@T}5T9Lmd;_b(h?e zFB!Z38uOOn-ErHyWXn_itUW_4+bC7Zd}*<~Jbmz6 zoY^1x5?A~*n`2tn_3CGFkvELxs(aiDdS5WGlQB-$W2XeY5oh~?8-Lroda=1(;>l?h zkF7z_;oZV-4cJ|C6&P+ot-o2Qsd3WKE^DTSN!SxHJKs+s*3v4u4SGZ zk#(bO4;)itW}aGe&ueuyJmt;Hi-*(1AQjUnd}(bQzU1B&f~Pt!y_(8&x_)NwcY0gj zY>Y!v^t0nae`1XmI{jL_xr6-f`xi$-sIyB5B4 z&t_y_%hQg^a?U=j1jKcx!Mi$^j-m)bg;($UVV<)YUH_6NOTUvV|=>Mw^ZoCqhb_nb6ip@{%*=$WQFaZ^6=I`7sbYR zc=50h+HrafA-SPLrybNYc~!z=oV9YC(gUt^my;1Q6ur+SkGwr^2>7W1mhw~Ps%pvT zJs?O_Ls1n|M9eE@E&IRrs8cjJmFIfpp$o_`ep=5^v7+zhe~3rs8W;pdCQFdW_8N!f zDHx<91jNnF3oiZq@Ii2Rqp&Fzj_f4)ANsN{|46Dv7c_YkDP7}4E_3_;vyT_8-^D@0 z8RZX_%nB4h|3&MY^Gbq6lHxS57U>)!x4tF(|HvHU#z2L+@&aP%B(M%nbh6ll4lDuycdz z&7(nY^v+rrYNv%ou0FQqelR3dzi)Tn|4hqqA+bWMeqd#Hvg`W52^|YERwDlu`ktqA zO51*r`JmN+7r1C^!oYXqX8kgP$C`WmP{vX|`pNAM9(C`Jf9$f`X23+~%}n;jYlF+h z4?14EMtvBrYtmNuo$!}ncjVdu?$)@swqWW~r z!CKU)1cl#>8$~G*fy7Snfo6v_EA`;k#GnrKU(XQv z&rx01T@kSW|HxIl2(x5}7QQf!1(*UJaBYECOoli!TI#M1LCP~f*91^-z`@7XLwiB) z3oCR!q%oIXhR(buK0_SKv;wes*%H=;7dS39_XTCbM+1u^hI3S|q)HZh_3~UYbIgIW z0fwltApzhVj71P^l{>F=WXhd6>n2-(Oy%$+LF&(t#RJXPQH;8Yu-3i^tVOQ& zMsus6Rh5I$HrUfMVrw7|waOAbMrh*OthYXfKm7IprR_4|O>l&^UtKT|iDJ}QWbm0M zaG0POV4OO3M%o`dET_Cu<2FMl0oAgBC zC)Gx@(z+cp;DQ8*;cL7q>P%46x8ivqK>~^ym+*6a*vZN-@z1>A$z%2|1%jg~s<_!i z(^Cc=B#gwki1Jfb+`nl8UaN%|Rza%<&Y_7#ie*AneE-7smA81XPmVE5S(f-+H1)m8(JOuBT1>QjJTsO%`As0Xm%8BTeV+3y=J(1GzWX zn0PT<>-_Fy3<~0wUmW%zDIB%AH}>Uj)#b9RP6Kjx%riRTBW1e%i;FIsHKcM$%{N7A zo>6UmXLL@l&L-3f!5q2E8r6byTfglc*0boJVwm4n_Lu4xp8R>PBhxfB9_RA{jnIId zE0g6xO5cyky0z$K$2Mh-IkOivsSrQ!g>3hN*p-Xmq-RWZysv3Qq`ZOgDwsCmArOu+ zkIGl(Mbt&0GGx0;BrxsE5TcB_|6qgT?ifsw-m#hii0^ys@?;DmE~d%_*m^~HoOdD& z+n16MdP%88MAy1jmQ>nThciGv&@3QQPr45YnWNvW0)A32u{b$%Z*Jy=W`}yu>xkmX zB;xp+nutL63uR>tveV_L08dZ0?_N3*4JLdw6D)$E1!P`zwRrInty#46ECj|o$&wbI zFaPMchi2=MGSO#Ofg8VaYkZXj$aaDi5WmEe(eCu*WvwjHOIy3vmx8cLM3t9mO}zN%C;}3Av+pMp;C!{R zH-BvBzGL+LK!XW-1t56f2TkvAOjhwn#+GeUlQE6I>Zcv05?fQgqlV6B;uhAJF=mvR z(9)Q;GcO;ImJ|UWd<6RW@j@#}APe|nShhVL5Jd;(k24^)a2 zX?Sw5D*=={U=6Z-7)kmQd6f@X!+j+J7$ywPPdn4Wo%lw8^2R3;G`zEM1pVQA>s_h{ zllHkvL&gaK*|5SLLC6#e)oY@8+ei?$t=MG@iNm3`jm&Fi6?vbPus&TXPkP8Q#?;|N z{Z7s~A8j%QV9>rH@4YGWxfl1oM5~@W#w{v#FUyR!Z#2&CV~D*~Qt0Xdi-pRIKqR9C z9Ly)2#Luz*=BPIXkE*^rJI1)6O8ZDzYxc*>hYJxjh4Ufo&26m5(f9~1>SxnBs!e}f z)`V-2HLPon*Id5@H!JnpI*FbE_2-Y+S1QVhx49ir_m{1Ig)i>*J&U4uW9dTWg*B|Kyjen@ZpnWY`qYWd zY@JhxO?7p7=)7}EBEYrOJq3X>u6M`+S;VttPwM4?3AHc4BD@Zd zD|}N+Yo!FjG##opy68|;*Lj-=HE&f8ihlts9DyMc6_2lzVu%x-O;%cuw5tV>u9T-U zs2UIRh_7Z-OT{UUX@?RkB6G}QM$NmN&MqVQFN-# zMHeDpf4(YjZUCkOAmqo85XFLu1VriRME`m7!J2UC^-IvrPd71HSD4<;g!xS=igC7mVOBD6HeT$#0_Wie@zO7g6=e}yQbZWOGT z*kQ-glTuzRu6&z^J&w9w=Q@JQY=Vx)v5l3#xi7@`ECzfWq6JQfsz^fFFbrk4@n*!~ z(a<$y&%)ZZFy;q{1=C_SCzM&k>LI0I3radcrKP|3MGLfkOL9I8a zE-|b@4Yy?q(qprCslTx{tlGS<{#HDYp#xY{zQ4dW^c<4WhVNR;sRAk98Uq#`mS7u) zMrW2XWa-OBf92q&R)848t?In};~U(#WiI4T16EGsQJ2&89B=h3C?g&H3TKm{t3R2x zv2uEf%@o{u<5E_%6cCt~V7)Ur&4lAdT*c5Y$?pR* ztYL7ICd6tZ~iK;QG|ParA2VH2vWs5^e*1OaZ7lP51psv za!1S6vRAX5Wdz1|09dqr57QWAK7DEZ`;$gm^|CU;6|0dl`#X9rtNkIH@waD^V_0Qp zvVMcX!Le^966qq{A1;C;%0#!rDmnIb=wz)NoJGA~U!CMwND|rV_|((-$MZ~p4IB8I zd|blMd*TaNAAn9(#^RmDoBQt8_lee7QM>bSKfOXLzxQ)p$s~HK)009YI(yj!FnsP@ z`Qa`X!3bMjoO^cjob2ymd9$b9h}f(s+{3N<0{3?>q6*%HkwvbqF#e;%BvetiMtvDd7?x>9)m;u?A2}ukxQzeVmoAhq5V*@T!JsOpavE zVQ_AVLL4p@7Hqw#MBFX$GCpwE3p^m*KnRR)Q5EDR_&JEvRRE4kGJdta`R-Us} zd+N94F}@;8sv&>n+O6oB0m+ z9>G1WZyNxZy9jt5c*``LRGa!z z7B!Bx)ZR9<0u}>6ILq={ld-T49RHSE)rw3D+jo?J77M@>k3xKM&(8rdPAy?ejE zH0GvN_m=}TD(dGuw*R-D{2w>X!+(PQ7dmF?77%Rzw~O<-ajta(tDE@YKWyJ78sAs2 zAuC%++3lwkdOiOC+fn}W2J_;dpdan;@>v?mWccupQcI?>n{s~_>rp!RkCS@zAHGqX zJeAU$L8>|38lsR-NjtF9Q{v3}I66z?a7Y7NYz3zcCY;z4u%j7}>q%X1yW`R)TLOHx zHZ~R4Vw2Ye{w{n>9n!$MrM+9GzSa6s`_;*InR&fH0nE)|fth800am zRQcbr00&Gbv~vfC<$T8)T=C#8^}h?-tGS{6qGTeILNuY+BpS& zkfbsCV*?j-!p^DT=SZAmQlq?SC5&y%gtiiwY0$MO7Z1lz0MD4m^~iU-$o|F8geV~9 z-;RH#c|$E@Rk!NQJX^KqsEm>57HOAR`1|4BlX;=ZDW#S@i#$P9I588ny>~%!S#M{Z zLU$e47eGmtSUrC+U!^9%ng%VM86eD*Ioa>)mr@%uDik=V?6d3o(iSY~pfRv3#@Cs% zm|nw*PpPhS>GX7m_6yg3FtsicqjPEVEbYuQph+)E9+eEajw`PzO;GVie%Pg&HmxaP z8BsDxo1ds*{g`*8T;SG~5sAo?%yPPQp!e~$Uy<9V*7hQ=CmEgLo+6dggO$;VItu%l zFM47Q4PekCS|hG(+@GpS0fkO+wTw@b3Hd*|pB&Qd}x4dROo!j`7R-2a}TXek1A2|1tw!>uSg|Fw@dB3R+@&-imjK~1`= zn1b3Dx4Fb_)0?|2&F(`Udf#iKC3gN6BPZDk-XD1CfpwEPuXaqjrC2b<_Mz$OEaC`=LYJ+HsKR zAy12poYd5G$cj%1?79N^*YHiu7l*f)6^?rd1FWL}Qq+c4FOGNdvE$p{;u|fWMqZ;6 zaf=eD#&=)uOY1V~LOrbgpKukwktaPrxxh0vJN#&xjxa1@?wW!!?ANb0Q!&1iNE5{Z zLYh`iED|U)jR*1gY1OOzO-pTxiEe4|qkvPlo?4p9wE zi}T$KwNY`L*HGvVH2I3hYZpnv7iVApbt%0BUJT8s%E$?>-8HFpWZ`U`;2A0DU%1!( zQdhWaN-YXD%s`%MkvetV86$Fc=z=TA+&b|@_5+P*Y(Pnd+M;mLLEze3`z#(Ojv zP0P^*^KoflSo3IBQEYydv!}2kqc5+0K7wx4=?>Ep+dA*6E6-*m-mo7%>M@Idh%msh zsN{T>;)vfFAm+hfi;Be?GA@NvOmuCrm&!h}otf#dJE$E{fCeU*BXxc9W?t3GTZ~OS zE}2?5*Yqq*ub<1FrjE&J5Ogc`gUvi-%(Y)M0+)Yb5K-4u<@}>? z!n!uKYAv^Qm`>7X(mfT&Q+9tL#QNs#U0ffa|$S?X%s{1NY&vte=k)D1{lH5hk)J5ddE?@Z`#%r&~JAndbo#_S-~BKa0+YIikI4U z0?&q6KJ^M@>VVR5-FVoW`~s`~Eqi`~&aq!^vA#RhZW4;*Cy*x>l!UJk0iRMD36z7WShC-7(QEGiMSM ze)#(3Pz4e{HuL6w1Xc4>mwO!tZVE?U3wPeXf9bq-V2XftxI7B3@(u?ugM`^GCZH=i zB?i1)&>3}|eB#HRLpbU)*B?baYiin4dwV9&M`K$}b&k;S)V1xQ`CR+<^vPVr#n9g> z!_5ayMMg_j=Eu$1Ifr-D(`vI@ff~S3yE4Ta-370n3p^0xp!WBo@$U+0Ag6`68^Buk zEx7yf2MyOBy|?B^6+yfe4rGR^4zbB&q@k)Yf}7sM&4m0H(F8=!9Ls&6#f!~S*@Odp zJT_oFyOLS2J|FG;lPdtu#qS(O>pI-ZC4H*05qhPT!BWh8*$LUE_Ru@Sxe(J>yMJF| z`gQgDZ_H~|`)84Ln&$(`AH*kJ$s*h?mhp(;1uu}?We>c}Tt$1) zg1fPa5)9KsItp8~O=R9Xad+vNyqD7V&D^gNU@@@{us-EC9prCBR}OLRe`*zIOork% z^O?#XPaN|Q7IfamzVW7*9HHOT4DXRAgZj{FEIBB`aJ+5)+|5oA@795~TJgVpaCg4H z=ZO>t$Ikk?dIn_*16}lQXS=!Yr8@<-Wt9$N!XIgtYr8**^SFG};#Qd60v^({Un|cp zi066mRCDRs9?Dg@LP|t#1ef$qSc-h?PE`j*;WTaRHx@6@PPY>Qm}~75Q7#G!VMAT4 zd7(q8?%cMs`;kmOw<}6`uY49{(`Cd+;^(@2uMS>hFBf!pA2Q2O62L?>^_NlOhKpQt zZ{q4o`=3`fkC!${&lP_ngo#M-_pVwDa8oH78+}}V<6SEZ&o+l|{c(uceByY3YQ?@2 zQrnJ;=*ck6)3;!eY_BARy$<^-oYfnu6P7HuU$0X~*lcM@w~$)S7!4abJ}$2AIY-D# zjlS|{peD@gJf3KGrlX&O`!T|EMuZh%OP0&bTYbw&L!l%=BRysC}f^1)Dgp{<~#G!x|tz5aRJx92=r z>EZ+3!Pd`%9|%C(@I#C0wd%(LwGX4t=vKSqL!zbC8z^#g%`c61S=k}IVzu%NVk_kf z7k<_VWN~#b$f%$2U|xAOUD?H#8Snw|R#HZs`_b*?63KZQ@qtY~Nud_OrK?URMrmBz zzR_)kw{QHhm^o3sd=i+Z`n=fB>7w5>o5>9OHubIH-G^X62=9FaA2t07k`yMAlm%WX z)Zu8%hPJciy|nQNPd&bCICbO|Xmj#obj8j)RjM~@*vg*)nnHr%8Em_ce}2A6?YEAf z%^Lq|XXPYHK~c$Rubt;|x#^^7ZeJAORC=avqt6!+stGxdc* zW%_XUuS(9N7hWZpVyTNuc%$N>a`Ps;i#^lQCJO{Y#l>#8Mu=g*q)Ms*?f(|G&4 zc$DATcn(=*BW||!3GM0GEVNTRl*xZ7RsqH=G-;<&a2;`X&sFC0v%s zxhj5D&pr{k{y6qq=kEF3i&-4+5rz08M%q~C2O=`36mCXxDw4O1uOkx_z7FF`xEr#G6Vfx zV#anizmKef+v;UWW$NWRegTJr0TMnXewo2e(ID>~pa!Mc`&n6)c|q=MIEVtNMRu6f z&uf!=zhAt!Bw^c5;N? zO6BrBU`Q>%F!xf*R8sHW#h?gM`ElAroz}6Cn2a0P2x;kxQfK3S8^TB&UMV@(xvYn4l{m772u?h8i& zIw}5&;0A%Lx}U`<88DX}|LidPJXzSvbAAD()Hv`ZBW!kw!EEKFEedyzHN zmg%{wNe(wFPQJI96@AEp@bK<4&JkWeF3IiTMhEW9EzoJ&b|#WQXZb|3D`$i<_-W9pYjTYq ze8}CtGxxC)fr2r1wAd3%ZX<58*L$_8e4mTS8`0_;f+usQ9b#Fxq?ZCdP(6&zXFaK^ zB^q?$up2j?E#7ZCH^+D$_N-ACIgMIPbtF%TAp-stiSqReyddF9Ua_8JX>y)_ypq63 z^D1@^SStWOuA^@UG?MPeuz35KNwMM-)T;E(INz9;_WKwYO>6FlG?@CO2nJOCBc9ES zC`CSc57@9*k0`;@+UX@Z|AwZSJ({@*FDEf7#a&H8$>B0tsWNt+bjp>57VT^5P!(?~ zD%pOyMCquZ=~XKnSIHB)qn;b?W#yDsuN{cE9ki?Ro)l~N0u{V0TI`_k_DRL`eaw)M zLLNz>vq$Ddp~%{iVnV5jn(Ui*Lc&bar3f$6*phU9;3(u3~}FWyL1iV4TD1YJ&&uOcW*B7f|W z{!2e9cl|XTRf@52#{AMeSx!(3$jkL_>SYFscSHT-c#V47(>SWIoRMI>}1rW zm>dQH^)pFvpOlSSk7YVpT6(Rv z@Q5p!+^eCY&&)0eB-MiOLt^|ekqoL;C!36Slk+Dt%yn6cAT_M*1Ss7Zm{Gkk|5Ug$ zb>ZSC@zJI*Iez~;k_GeYFFPI?c_GgGi)i`?c>H+!O8{js8hl4wiuydqxPHzR?yeVx zPbO0u7rtHe1w*n)6i%R8-Ce-il#f=angX;UB??S&Vv?41a^aM{644e?%O_!3W8I$> zg9{bIw3yjpI5JwVPEYP*Ok~ODt8aPE%Q8OhWUvMZ zly=<0&;_9KW;8SD083cvRLWfEuEecYp3Yorh4F`G*by^i>`loacSlTM6ogys`ei2# zcO2>&_&2*Sz0Zsx`4qgh`}l#SAl-#<7EG;ZMEUsk-T8wa zJm3@n#q3(8#lMO?k;yW9)MMrs^pp4-N|fMbLbNOnWmNvzYhqx$0JZ2LSc#+U#Or*n zqGN2k%l|M{mX1;@{?uQ|4qQr%X)H%mtXxVB$O1V>)6|#C}HuZKT-CJYITy238 z_}`eAd#lXHL956XCz|1^*P&mdWz?fzsxyVCrg7dzL24us z=5Mz;n%9c+NK;?VU89uAkUVn*q4*ZUFD0Hio-e)haa6qXuGL^9*` zTeniQ*mL7|4b=333m1qpg~#(T=hztR<2eqVxET8ImC)9=E7UjN0a~W=KF28|Y}i;h z0o<|{hJ-3~-U7fFG`z52S|`YQD}QHwbkyuo>l?d|k)y2#2tzz02D0T>KJV6#A4q-_ zqUNVj2l4_dCHD2A@NmBzuj0{9AN2^OcbtLhC;_Zkm~F>%pZ>x9$O?zC2+C(5SejxJ zL_ZDsPRu}w%bQO5)i}K&z>xnnFIisYV1D$RF@l|vJ4KPe@${^E*1Saj=CwK?%Rn|s z@Ozr1v`Ur-_-KY{PgdG*dsxU){%lA8o) z@%@ck^8nDJzelSw_Pw#g2s^}7fhbPJR?$MaDoOyr3Rgoa%={tfcjK7*ewZXg`>Des z2eFd&nt+rj%vn4cSZiAd-*KZdNK%xseB_FJREmSn9Jd&pI~(|0f`XStz{0nO#2@K* zVt_Zey5(|s-j+Q*eE=3#rIx?dbI&ml? zSnfyVo&5RpSW@IbB~ZI|y4{7ABnKBOUN}3==R;gH{T3Ix?$96AtXWKcO`!nypYpr6g@Zjp|;@?8Ro@Wh&;mEMJ>N;MIlN)~f zJ)7n|f6N(sHDx2 zgMZIYDlkF{@oI=-Qhs1utLZ78B}^clu`ECdK$~WnkKXNZNyM{0IP^6#yi;5$rP_C;b<7(vRdA zC@lneS`nhVg^DQ9hp#Q%4gPzgNi@5G;# zpr-tUJmaNwPS^9{%7hC0zvl^(P`n8C-!4!9B6#o7Gtouw4DMb_Lf;VZ4Q2rm=@4U* z7VhB)Bjrju$XJlSHMm_v;Qmv?~spb@mbBqry=B;8@;4TqfnP`6Lv~I33tt+bt z5%*LNd_uL^L7z=0)?S!1dAVA>3cMUUD23R9JAPhsRkD5cLiP-`Bw8gFOOaOEPhVEw znO)itAREZ#(!^dR=jVBGoM%o?JN` zSo)m2SfW$=%htuY$HWBWG(G_xM#{iDQRqq0#lQ!YFYwI##fE|2xIN8d0PZdkr^mnk z+{*5fP=_cc>2%DEd;`EjrKaapu{$Seb!WX$oj{gksZX>i@C14h4v~~BMbD15>BSSh z3o{*ZVJ{U}IgwfothY|xss#l5*X-P9fxsjYh;^ho?8s`?SVeCsqpO7Y7^YuJ%J@RO~lD+q_g?PATAeH z(Ne{3n_V00c*oL)!K)b6kFpp}3`EE?PKDHq4PvaZtw z`YK1PV{imJj$jMfu_(|Q{D8Fl*%wV1#P-_wWyL&ZLT?eJMFDmxBK@+78X6QcS6>X}SNf30j`Om0hCSioatY zm24%VFDdSkvhd1fcsJ^1o1Wu6^l=7FER(P#cgK2*b{1TpVZ4wpnPvefH+uGfT7prg zDiWx#33}x?9h4rTWKzonT#A47UUPrOcrrlm3a43s7I0BfT*XH8$ExY*c@PHR1VHNo zF-Wh}5~Ocsg{&`I6B#A!rr0+!=+exE6Xw%XUEh=(`m-6rsWPY>37Dqpp&@BD5i z5UEtEon~?z0{{;o|H>&&b#7_`*@Wm0wR!{X$?6K=_h44N0TGg~VjrjS4il82)3aKt z)#^>aeF=mUd0w(*0QIJzu~BX_@AA%&-SXBQzl1=|L9=cf5CF zU27W$5<5?!j&B4-_}UN2jC>vR%YXrR57{;R1;tqmifxW1eAab)o8)z#`pB? zm%P1xwt_@1F65nnHiI7(_~PTd1Fvg4Bnp3~G7U8OY;pIkc0ejahssFG9TJ1zz$ztY9_0UFb>FQ zk{c*?=9j2tyf~}2Rpq9BQzCdR61n{_6MM_ha#vQJv|guiphG9=T4DmEc!&;GJ_}vXq|8h4N}fD6ThmjR9Vu`VG;X9 zJT_YiFU7|FYXY^amDYQ}E?+W~KLK|ssyB81-tYueC*sy?H=cM`4#daUPDFKhC+RQ) zcMunG`4DTntLB_l;eJ)nsBgC-l5jnL?~z#Q^?DhWz}$_Pp3QEnKtzGLrp6HQ0Q1yg zlpQGE2X+qXRi&$Two;3O3nIYN0oVnfS?$}QIU{8Rl_`K>?oQPR)ilDEPesRdJM7Q4 zeT}(k0%6CBh0}{#J{1VodOgD?r0Trc>e~|XsFEAmDZ(vaS;15RmDA1-IJ?61y4r@| zdW|3urTSRHYW3i}_o>Yu&@fIH4o>7Ft^#26+u=Yw;!CGA0R-(i$8enO?3yXdb7DZo zt(x?;D$>OC&1&VxsYHMR2t*(AK$rL9J@Nnt@Z616trR1xyThCL1Q;v3AbzoZw4bk! zRZ?Z*J-a|(lW>^_&DXK!mSy22ZGeG6=dQjapSz8^_!cE4tH$_HzINR`~8<~W9IgkS!D@%sT6@nP^Ee#PPTIc zXk;~hn3#z12*{p91hAu;u`a%3%4H{?46XqQrMzi|`?0F!I>ps&^=fqkUi|%fE}ucn zUFh491jO!#M!05ARB^f(A2@G;Y`MLe4CsrjmSaHI4e1_wb$j*iy#PFtk^#{w54sLC z;{79tv*^THb)bDw9k#x_m^+|?%Gc&v{7bH0qiz7`*YIfi1sDI!dTmxR?&ITQo&j0j zrM@~63`8y{&&H1&=kG5YxkMMh2tQ4xblL&69O!UBO#~H@y3tzXV|VoXaOJk`o!oPC zATGiAkg`DIH>wkKjUbkrWwG6QQl_av8O!ADu_z39*E>ustdJ_DP_`KN8GAf#=@^x( z^QdqJ5YCQ|0XoBl?RTZSgS-#uGdC&yV%6SIsWL`}BEkMr*E4qbf^MoS!F$sTspls!?-(+S~q+&M{=z1H1;-vfiZY=6X+xMD3bYfmd+ z4MfJ&1(6jJM!g^Ag?8drS+PX+q5&^Jv643X6F}7AxAMIm_L!pj=U(N?sv!ZHrF_LGmv8IHq_S;kXp*w*aU082UAt!%q<2ICfN3PJ_x?O#U z`qie&+hX$(=z<(n;gQi!P7G?FXisz7R!w)(W_Rmg-_K%$3?xfaAm421D74@50^)Jy zGUvGN%K=SIK(G4^Q$!S7?C#=+fs9}qBTfrzoeuML5~p18@=1t`rR_NK&D~P!yyfL_ zE*xR>UZJ2zMYttgh7bsOJ0{Kn%1jta9XKaDYsP!Cw?M5HNNPJvYY%%gqR=IVW{(o) zfy;w1NIR}^W1dzxg=#e-9_w=nxm2J2!c&)MqBvTM5`E)BZ6=h6xTH(a2jX_Ot+OnQ z8c-`T)HP*FTQWt;-p6Cz*^R({u_P+MIacW>kX6KVuB=#tYIwmvm| z_>scsiQq(;EJC;GmGIrgoEB>UQ0~%$jO}b<6v?>$nsc-gdihZ4H}`3i1i$ZJx z>j|)99gi{&CvK6>N+b+WSBpA}7{KEZemC~a& zXsxTc1Br^v>tdIag;gv_rAtfA{Fpw&j)t}?k36U|%1U^EhPWGnjFwgdbx~kD%No|Y zWw#mQnLv#v$HEi1vm?{VQS4}4UX%u33Xuy_p!d06hFpG_u5^yef#scUtP>t^u^#gtfmasf6`?^+sF-1Uj2*h92QK71Em+_0fz=qUyEV@`4Y_bA8)-#rdC4<>_t=SNrTWdh24ILh3?#K^iB`G6)@Va{fMNdfA3Oo!vd{UM;DOmy9&!U4K}{} zf$})GWEE!L1fA@FWEl8+O(**O)SJn}Gg#lI79SK19DF zlj^WXx3@Gb{rxRAVN==r5&4uXuVD6-2Ji(7e7)-A*Bn)XGGD0|tRB2aLAr(Z-6ByF z2S*y3h$Q5~y^UiIRCSYOesy!GKysqbEw*%Y6US(bM`6-1`|^uluG;k)5lS=0c^)x3 z93hFls?^+F4DBi+U>jK~u*^R{ zia_R3zJI{nrATVujS?I3v;3#FRaHHwL5ey09Xh?k#RL^eaVieUJ3TLFf?6WoW7H~N z75r>m_Jn!tx;WmntFPJ+jNKYUQc%?-3I)VlT;&e?`oFn$m7^=bc>oc_2 zJE2i{X^5u7f*u#?BgYcM0O4mv3Fdhj8`)fm&p&Uf1ijOS8F2o7(%E&-R4Xeaoo${U zuO>Z#$j@(jYt~e-=UnNilzAUd>1(qNM^rb#0ghJOopDgC#FrJXUkB`R-&n0q`jY(QS^wd2MXChKE~(YYn2S+hkQ7EpC(@V9+g z`Lf=Js6_s|Um7!7=5R^0awz-NBCGt~>+r>?ui1;)axD7L#rTEA+NyuohsEVvaaok> z=9M7yt&h_1j*aSS*zSqD`hN01x1%i^?T88AEuuY#SYrwFSi79y>cd+aHJann-lVaa zAr+|-=XTj9GAIeDRB8VZwkt}0ENNiln(4$RaDJ-0TBd0oj6y;j>vU@9wtZ;RW;qbZamD(4 z_p?)Sl!HjV_mQZDuu@fB#s^}J7}f#iOboOmN$us7MQ*z5F&y3O_%WISE<7<&V;EV7f1O^x*40M;|N{a-Kf4Rv5}Djwd@DBZb*xzFX)rf*ix(-F=#qx2y_;F(ey$X% z(PLEg`HM#HZ+*CWDf=hdpO340Cwd`%ai|3$g>w^pA)OXJ#L0k5ZeP_N)vQjeiDkso zVg*x<5Nk*w8Scfv9zO2tv^HYUxjzjx}hfzN;JT=HnD3g&5i`3OpAo(VVTwf@+}r3s9yAmn>+J zz<250FdvjyKW{+ItKYc9AGFQ3Zc@d*hNk4%S9QBlSc$Tf91yewJu6MX%| zgkm-szgtBj^0N@XKTPXc>oxD{ZJ=x?w)zjM(%w~$X(#oJTa)okWIkb@$LXZnO}#~Q zr;nRF7U++^@F$jUc;q#IO<%yldRrfX1l*Mdk21v}xkCffv?SU#J^kG>CLE?;v;Zj% zFhp6@_PEdW$!ggY4-YiBORULEjr3p@FA;2vzOC1Zclg%kf_eg|i-n{t9}TnS>tR9| zH-$rBalINo!e!?j=J}XRcG=Wq#hiZaplqqf+yPBla)McpHNb4!YFR!!oj#neMs1Mn z)OX;>dBUA5N0=_X_NT&I|!hY3|@Ugabad^gf#s~hKB6s2he z7SqHxHVC58eummv(ZM$yltrA|R0fk|DrZJ?Ic^wZiJgymBa?N#?C}|?riL!j zOcHo822x*Z6FLsW4*2fep!{T$0Rq;Nu1#r=M0#Aq;A=3R|*xmJgvylcAG@?Z9U?|y#3TQY%BHTrf_qYy!1pLd6M z%Nx|KJXqwpjsC!gnKP^?E2^4KQ#o>Z}GJX9{Pl;$_c)zQn-?RAX25KaW)f;n%3(4m@8@;Hw? zV7C3h{|5HyZWpwg*mM_VQO2QSN4z6(UOj6&d_Pz*=c$l-Na-C~YTw)e?`84;{$IN&f1(WU?$cq02(Lzc|&S^1C!N zPmkhW;+-$~kox|)d#yj=zu!IQWZUB1QcJR6?!`BItqcL&E-C)iJK}dK4!5Bhu*VcX z9gh-Ril(7sVg^zx*uQTJX_x9Ti%c*i8@6*gNaiPYKAFsUS|;^N=gX9qxqoxcNFTEF z-tG3frQ(+~h=R904jbc`UMDg8?Wj@qvYG8F)ww*$dak4pP3xmm^KmxaDH$O?(^=1G z#IZ)W2J|k$J}0Y z{K5u@i=@4l{H6~)Wo;ro>)x|64*H)oGg`O|&(42D7}_G|PB*`gcJ$hpI2zQxZoUbv z^e<;`vOk#0btI*S_TAj^-&;oM4pga0ww~-^2tZ%>=22*JVp_J{yp<#dkh!@qjrn^F zjslt7q>i5DS{u(Db>>!ef$O~&3?Gp0->c*&@@{w>j8VFx(^fq>g@%&Rwj&YSy>X#WGc4>YJK=$CLVd<8_0uUoTXX zirGFF-S6`wgf<9B;sv3e6-8CfJ-Ey71OKGwsC&c_6zAxBygZh4n60@rZ~nsw+sUy` zxh_%Hgru+pLi{CWRvN4UNTLPr@ zlMLB7Utx9RFL@N%2FlMj=beOi0@!8X+QMDEs$N}j3~pvAs$cf4_sbo3uyWk8N34-@ z3gVC86*+cHVBIQli99g*{A09}gM3#axwuwO!*gd!@Unk5$by0ku~v1DJhG;oJ7D;I z>~Pb#SBC{Uz}B%r=9m^=3eo=Xr~Mh#PQAgbyWY&=bQ7hEt6{B%L7UyuV}4&}#EKkw znSPEKrFrdv!D)|Bv}3Piy(XsMkG7G`7W-KzesaXvR>TK8XU0>fG7?OL@!6+g*QJVS zA@@($x6X7m2ug~yrLsy^*iZW+t0*2X@lIDuoXqo(sJ9bc@?xy)c{-=}is>=@a|cO0 z^e8I(?7&m+vR~>l#Qvi3NN_XmLzlzK+0n{h&(G?@GP_zpLVbMybt}u>Bu#o@d0jSPNFSx03%l!^tk!&BtjPQXp79b4N;4#t2!j5UWi3 z3iXoxpZ2|fYU{gBL%iW682hyXH;!y{l!)V}Crj;`kDRuK$eOQPtr*V`zMz~wNy}RG zwIS`*=OGc?mA->f`LV-(!|$JyG~yf3?dveya7Ku}Mj;vG;%s}8thp4mXp5&vT$_C( zc~$Yy_zP=WvMLd7ut=T@gpAO+LocP3Sph4AJVp>OI#-J9$JbhqC?OAILkI zhQTRH^}gtIflVXVvw>+>1OorvNa-L9D z{h-ZAtz=-Tn(+nQ=YKsD!pxCviMRf8Ls{7AShz(c#ip=%ade*_J=iCT*^HBtCgG*r z`1JohY)ZNsA(4JV>wkG+n--AiYqR*DKbz0wFv$K{fPXF~L%=?Pe=YKVT~^>c#{6Gy z^D-m)Pp$HQe)Wegx^LM3+rgR=uWtWuH--6r{AZf~-T%V}|DSaHMo`mm=-q#o;Gc{5 zBU#4(ed%(jubk(*UWjQE{{7~7{l_#BBoWe{b(Aj2>_t)b?%?X5b)1&_*B8}eO>Zm; zA1YD2*nWqckh}c*2{Q3G3IAG$xtHUc^K}E)?+QYRy7gaQ{k=L%St@^(sQ+5_E?*-L zQLik>aL31^mVNQQ*o3Qp-q(`%-*ymx)DGMHwd2o>xqY({ZS0`~M*Z^d?i1Hq{>$$6 z=uR5rCfDcBny-4 z;QF7@AL&^Y;6M2LzZK&(r6K0-5i+J?`k$dBF`dWDT$e1lrpV8cHyaOF1EnFh+ivCC zk)s06`j!Pz5C8t0i=oxO1|v2~0jZM~#{1 zz($&IY{PKJKMPyy_^%rgH-#)m+sijXkJ}IU?KzJu^CzLBN^o`GabuZ4XTPKNzXBe|@K0$J04+`Lx(Up`+epFi2@ItzLLrJ_w=(hS&wykbf~*BJ07u0%B*>Jr37BMK5` zMd0o{{ny6wcQ?#r|J4mM>J$rV_`XLSrK7E%NhI8VT?kAp+WJCEe4y-n!_dUe1SqbX z1KwE$OHzL?_Q_NH8&{;wzc~ZbkIK%+3>Vi18TIPiyoXoA3F*G*lx(RKdX(wL^}?hM zi6u|ESDk3WpXbzsFhiRfcTNmfQZ2Jq4ul54{v2@BE40C{0(v5YEgtvOs6c+ts8;h} zk0R@WGW+CD!H&O{A>97#@UP>MmM&fN!&M$B7*Tli9QM;P(JBA78$s+hDM3T+waNBS zV*2MCeh3)<^l{_A?Q)U_L|tL0r5sw$9x>njO}5W+%von0@I$Gyt)>ge@%2AmCk16Y znwJ}hiI$-)PQ*#vRdB6;<$3?rB6{ciC=j<f$ri61 z?(jC6=|y#-FWZ{nWdZB>=Sakr<#|ZppHchS>c4ztlYo(uiCliOhp8E-BW9_ofqSds z?-R)@dDcXB|w_So&8k zBsheQUN~cxy4sv>^x9}lJwX(@)f7r7$N0vUtpvDs zdaU^Iy?psHrGFLTuXyC3Ps!&)mOYTxUkq=m`R!JQJJ$TftlFDe-s;5;ugTA4Q_=Ed z)R;t=vhfOYDqm&4^k%*hLm=lFpI3R_Towl9cb>8n>$rjggYa2O>4P#;?ukh5i6|Z| zW&Js(-a?*S+1;)W#+KrjoBLe5Cls#YhqeC+D$Ac{E!x!Hd??FqDp0Sr(T>D>d;ZVa zyzj#CF%aG;cL%gDb!{!{L~4q(b9v&%Xc6ym&a*d#H2pt`Z-Id3uuZH)aNIO^59W@5} ze$HUIbYD=NG91d;zOWqSsY>2CBcX-$Lxqe^-UpujeXorp`5W%fwYntXdT|5ltxh3J zv6uIB)wK#a0$^iM+RD?n?GAV?UM^g>La;W;f57ARRq~g|2QptXp%I1*jaSW>^F3EH ze?`uCwzyH9=$x*g?s52Vp`JVY%TTS7jx42r;?Yv%{NR&h&RxjI)YO;gY4ig|ifKDNxfniOYvU2p;<2d2mABSU zucmn@38L?xu_La2PTZqE+%Y;oFP(M#5&r(YmuP6zhF2d_xZeNv>(dXvzezz}au61G z7wY(PPT~a-(o6j`)P@`V4i+JMYo}z5NBB1S{3Q@6ZX#r1D>LnjY zoYUYkZ2kT43S+)pmGRYUC6oOy0SXRQ<)61kv73c_qjYCdvY$^3cmWhkj*R&s{ zH-ea4sIRp9P)?Lxc6{+6{*aGTQV}8kLZ60PMA#F!Mpc^+(1l*lR*x7hORNdK4Kr2`o8&}y}TnYsD?fe`*gwY`{P zy5Pc!5-rB@Arw8C^piFyJE_cUrFgRMWCJCi1Ye%_bfd+XEHTU#{<(60I&?D&?^c(+ z_z;Hy2z*}YJ5u#C_j=d(@zVqOG1q%>=Y+6AN`hwcb1`?|$EhE`bF8Q$`B_-JHxia-Kr|fDFrk>Eb-|#|6W$gi@LhCn<%O3v{=gJs zg}Nt4SJUXW_0y-PYVQvFTa6$Iqlfl*432J9`#kim+tjgbMs6K=4AYn@WW*>z3+GTO zFA6ggrwO_KD;sh;;(^aZETr0nB-7Q7s}to4X+lv8>rMBpgsg`?Jf>G*$+sp=_;gh1 z)34)K->^QmOipCZ`Iq4(%Nyt1hi*+>1{UC(Lna@w^ZWkW_F`or7+Hr1<@=I$xJPpm zACCuwDqTjY=FXh(tgXX)#4W{Nht(Q##RP=oQu(L&fXgsbCOr;I8Sn zodN%V_T$3<;nlQz$o$XNG_`AEDusd1sQrw{)mhrd-*+XmL}%~$nzeIxM$>iJH~8!= zXVjYaK*W8{GOtWqQDsBhVby)k2)AO&2B~@;>wk!;P23lI2lD^rFzvV&|8E~C{G@NxQLuClZEUTvY=e~VaKV$ z)dj-~mD#QELo_&vV{xLKZ#P5ZdHqo6_G!?&H6^XqG?$t)q7)a8yN9UsV{=CjTZ)uJ ziOCX!oCl{CkmK&UkY>yrXeL{=c5B-F&S9I>1?!X2K!Tv3h6uM31Is?;hi>Dt8CLu^ zLaKs(>u<9)EDL&3+N#`n?pd$cm8Gn=%Lws%8PFe~2RXP2IV8+}+6t+ErP6bS(PeXj zcT#l>ZIR%U4dC2H<{^+)-FdwJVA>+10Kw}V*)rJ4OV+B0{Y$PKc7sMETRmSX3Npq*_) zsg5mJ0mO)!?Iq&!eW-C|wX@eDs(Z{&4>9LZl+a0{dUwW=T#%~EqY zz4T$rQj<=^h?^04e zh11OrSZ_9wUt!V+?@$Y-$L#Fv@UH68T~q{M3At~y)}J6uAd_n(h*WF(z~+&-ImWNi z<6q_dEPrU0b;f2!Fh?Gfka~i`mYqtrgzr%;=*!-IKNDK&=^HbqWH}2D@E{d^)b2V%vSst#M{L}l#`1Yf!9YuPlk$^{Hw zij;m8&929qcPb^&^3uoGgu3cCa5l-%|J0rV?c5k54^E~?`ZDF5vgCUUUe!0xtbNr9XXF`O*0X%g}*(=G2#~H z&V1zp&3g8Dz-00K$3RIiep`A-OZabWOJ}`+1Kmv5@p@W?f0vqa|}Mzw>yuxWZqz9I$~78ZKV zaPheCyy8mwkp88S_T(fuU@&r=n4DZr&Sl8@>`jTz@=RxJDCi(6Zv zEk@1b}^ zn~*TyyE0`sfFmk0asZ7z(Dq(v&cngf-nL}uaX3>|C6YRVJ7|m>I zM$PIY!qOK$apuxo4R_S%Ffm#^l-l*G#8{~>@SCdIeqxWq7D(50S_qqab)LRj6w9zW ze&+(KacPuJlil$RdUJDR7?1*5E+3aS@?*Q|0Q8KDsSD3VIs2py%5&? zn-a|VpYxu6I1?NxMTZX_E!*PdI$(!mT`?9lJB_XPhMwRzBFkAj4SwxLMn&x&9(pvy z;S#a(Grey(OMnl_|KphdxmY^)s>j$Niim(x@@zQkRDUq-3IAk`D7BBU_l1WhCs)pY zov@KGO;`6Wz|i`2EE`||Ep6XLQv9S+U+x*iaG0i~xOXo;-S?dktCFaFHt=sQ_nEO0 zoxN=_N0o=%9|feYTEBTBkWe&5IiIn5p?{gh$%%!5&$7Ur&zXMX`MJ7DnxOI0hrr!a zYErl@4L$v8t`S6RU}EK!%q`%S{>{qM8BheH=*jpE`<^aN|;$qxF+ z+H!v}gY+Hp>_AoCBKAvQW)(W5AtTXS8;0(!L5KPNNN<13$uFoMkgydR;r4m4^GYW&<<9WE94CPx8cS6NBe|a>dm*(FLg2YSvSCtmYa!x=_|SN+airb z!4h3H6_w_+>_F!ZzRHS1J&`mpX%*(EYBd-Sr5tC0fJ-~~lefiPAL8RG&DJZTlJ-X# zq!4{!G3gQ3Q}4Uv?!lpfCJC<8V;iIJ`U_W?PE{A>j0f!eBc?As>6J4FVCn^YZe2#5 z;*ycR+l`gdoCVOpQkEPsmUzuogW|?*T}~YISv0cTltyk81W;0((JF=8uZV z32c&dfH|zsbRUR5T+sZQn=996ICi$8C&9gDz|h*Jk1kNilzH|nc)FfdSrn2`qS`RU z+`sH1LL(Jtx+U3cyyFp-UF=Cu+58#W%R%?ck!7M};D*4}s~eXZ85T#o1>9c~Q4K8) zE$9^214Z3X8wY17PmQ$X!!rwvkCT@dBruxTUUj#1eZL#%ce!Mt0&O(2#4nuZv>~*Y zY)#SG&Xgft_B%gy&)sS@kGI&Io*HZg&o3*^w|Ea62VVtaxtPF25CwhEX1=;=8U=Yq zQPD_Q83NZ1(Dg!1NGVR`KUJvzq?!1=K2tV~Fl-!BD)JtDcq|ey3+FWz(OV(GVRumK zyLG5zn{A<`)pY%>xYZRBfTn*x3`ym6pSlmbFdiiFIoMPJz#v5cNm|XqG;hB1)w_~X zX=RxhQ9^-{t&rZb50720+T|LAYAW-_(&g!l{%9ABTPSpfC)f34bz3bR>~bMdwqISA zdqzrh(}4weY$_}zUMV=)ik&nNtvjjvPW^J|h>L1uHa;t(^!NJ!oG1ectKHvzR$7fS zJY+S=5PM=6_*zcx3G`@lIXA8#-s4fG&%wrC;HNngOB0gOqPMx)3Z$DJ(<1ul5S(F* zn(~V~3Ymw75L?|W>6b5!`R{u>{r0~WlyHdXTI~eYi!>B%MvGex4L!NtrwBC9`uf1| zoO{pxe&6@qZ~uKiJ#;)^AooI{?P<&pme(1o_Uym;QnXy!*vx^mnGbiJ0n#eK$~fBp zEkhPPoORJ$KMc^VE`D{39{E$LBbDmwi(MsgQSKdOM`UlY-pJoEe{XT~f`rO-C&0em zT#mFQ`%Y?mFa|K7&Ri#f0gRslz#nj?%n42WFtcbn%`RRIEwcW`v_Cw3y0}Y17 zuRuLC2bDwD`>6dZ5E0R}Sa-t70w_eznc3vgRCw7kJsqW&oeE{iUN!f&@@caz4aYegjg`Lvo;pX1RQC$9pmrt>D{*`>Yq5_Pr-@1miAMT~dy=cgD@tYg-=@u~>XObtgz*S12x!pk2pgu0b-K0F=Lw-M}L z#O0AJ4MCv1{dY$@910y^IFb8m8*(5FcucvU<_ITt^Nm^-QdyCb&!w80hmBnNvNe~- z0s&(te@LgkD|fTp@?F`O>W%_l@L<3#b)&Et=gAVj+Pd1?0CkJPqZ(2@$EuhJ1|V{j z;%ijLgNGkSO!qU)p%_;4+dy@;P2_P2m`_kwY6Ah zwvYL=^cy@7gcYD)YsES{C&0q+cv-()(x>o!0q3apS5XCblhx*b4`5b>Mn`8V(#0VD zvV(1_iRnW7#xZuDj1l;Vm(C@25-EgtVAVmWrDfISpB(u#xZevyb^uQ$YtFyKqtAkr zqB-7MB}JndR_d+$06DNHDaUHoJSF3C-U;u=+Q;cx3vT{dDbyi&0Fx5hV4lbr`C=4R zc8Oj}<;5ZpV<%X#m}zV?iQQ`J)(rQFl<6-|ge|Sg2DeeKt&2@egt>qv&DT`4oiq+S zHE=o{5z)~-2INym2LwQcjw6RZzazERz*p`DJ$pxhnjwf^(rA0t-w6YMfWny5EN-_QZTz}thZEu>0$*CMGQ}(g-=4#-IDw-*Ed*`=ub;E zs^igr3FlN6iDkeC8uM7<4f2aebEM-`(v_>5w9eh|mIQJGW(&e+Q~5x%s%^Ez8n!OE zELh_~?tDI|dkC4Ck(}q-R51&|<;eh^;B1m_+^~B`x(>mB=(AT#Sc~d)32VB;f+PBAz*M-Op;?h5u*OK$62KR( z8J4>jcH#VHjdL>&z~$lNhsCx`b^HsYID>J79J6kq~E#>aWl$zi+!ra)Ob;a|>=afnLq97j`m3W9vo2 zl!)SvZ%2ioo^UX`q6TCj(J+3>PS-Z~DIiQHh>M1`vbSC!gA}W~K6yD(cr&N=k`5(q`%;B9f>6C#v+2ZxMGT4O7fiB&m$sQ zB*aCm*!UaadLUtTqxjYz)M`b9&clO5-|XP_DD+5tqu7!>2uPRI&8>4N{RV`~X+^#y<*k$1a26Wiqy({1Ai_ zBa4<)g!e52>gcALco|y)3-JgZCGPyhO(RuOEhx{o5UtTjK}bTkrlqFOT^P z`^S!PkxzV%^AQfu#H2?+^%ni|&ks#JWL4=}#_ZY$THwLHSh>mnc@6f?1FYOAJNNB; zqd!k>F~Z9I@t@BdY+-_zx3a(Ts{S*q0YP{!pTEzT@Bv?4@HX2#`QHOS68mCw_d7GP zH!1@Gf<7MW{~8iK0A#+R_bfsp@Xb;I`M3lT!V~{@MVTIgmojP9oV9@*in3QJKY64t58Q0KJ56pr1^$- zL-1<9{UA_@=HWkkfd~5#la;Kw@ZH*=2^-ASIb816?_o)ilsfKAOOd%4eNYd+q`3FI z!g!GQd*ksBZ4XXaPJDPh%b?LL!#KTVQ8kahyYvohDi5(jmfGC_wyE}8?mtSlX+wDu zQQ~#|yH%AJ6@IVj(keg2u8*OYfa4kbv5%p}M!o?#f-Gz!`$J$8K=tC%yZbx>)>#`t zQ&f@1dB46wQz8!*_u_~9qI8B(=R$i6ge}DGpWoG5MGdO3a_39L5g5Pf1ed=uWCUxL zBw=v6JGY+verVJ~dv7g4KOnp&7oANeB!DW4+)?2AXe`iNeK!cX0-u|PjhnUk3JO-q zs^cc%)!jUp$h|*GX+qX|UdFw-`-5EA2WRDPlb&66wvPu^puu#hH8or%XVZGlOa6o_ zrQKwY<+C128KCOh6zbMHUY_hcB!Bg1dt35F!IrJ1?QV(OJqgo>qxN{x{t>;! zC}#Y(I`*@+-HkUFzYpG4QAP4U6&tjWE`I#k_mQCUg2!pxa!LIzue;P(CZ8IV;^UTO z^;~?M0a}P3KW@d^NB3VF!(KPM_v2To4MyQw$&%#3V}}dXu|s&x zo01Q&t%V!K`y!-_DFBVVtBv1B>%{)>*PuzVZ5JYe&bPKNdX5GK_rU@9whW8APDNXE zn|!TVXn8lIB6%x!`^{h(v1t^0J@v4V?zBcD0qq$&^Xnoa^tRi`&WO z8?N8BtU}-!C3Rc*<->w*!>Ix@RhDD%Qex66TCeq7H?uajw5=N}J+EEwA0W5r-(E~6 zQkPcG3SDn1dI$5FtYqP{Jyd>iPllC%Isd7Y{Zj&}4}nF6-DGZPRh&zOL4|l?W9L3A z-biF0zD+>3yK(tdu9C0HVV5PUi$aH93=zEC z6W`S-HXa#|)6nDLS5UWsM_3&Zv;D;SAXB5|{Bq$oLTEXN3WS)TFE*;Hc$>9p)J5mK z^Unvg`qx_;0+cVMeJPkvy803OmjcN#D3^ExF!QADMb?5tnn89Ii>c|FXT5+{uA#!5 z(g)_&9+oILwu7hWSef`Y|1H77kR+{-1(rVyda*D|Yd+ZF&>Vd#f6|x!e(G%kiiOVW z*P@gHD`6bcJrWFLBh^jEEDY0)Zq>d|ALx2rA7d483%iM@d0yn<&N;7yE8{{ae_fw0 zBxF2eg(j}bqL(nfkiy+Mn0MO?|0OPU;XbMBq-)cPxa#`&#Yzlqd&9KOXvAnn6x;Z$ z{DZ?0De~G*>`NjG={I4{kRz7rS)V77JS`37*VB_tv;|^L3m(snkge-BX}Jtv{%s24Bf9C%bBXv*aCX@WShw7qFqlr@D zeB{OJ95D_VIfzni3}!#xgu+=KfoZC4(lxanfcPXY@;p1~DF31vgr z6t&0`E=K;{nic*B-qS}*m54-6(Pv)^yZzT8fh5p={Tt)@NG80Om$vHak8CalAu)_$uci#d{(S{ z{ET`e=ED#nThrNcl>7Oi&|K~N^`WR_haXz4Mpy1v2i<|Iji>VuX@%&nX5SA>T&Fxj zQ}GQT>P9ZmdD+cXA{9IKm3e~`-HUX{AZ(~Y>C+4Qfo*-y^R-qJvxXhaw*A%U(^nGG z@|g7*9Oq!>TgTp0uUUB6>s1BV?hf};*ik1+F8EUtFDMrGcWA`}LWwt%SEF{2SrtM# zo?nylIZDSoMU@REWH_sN(!O5JRBgSm`(54+?4g{}%^TTy%U|mnZYz`*pkzeMaL{=+ zsp+K4-cqhJ|9CE zXTHT4)zTZ9Ce-;}U_LW^_nI&r|8%`oj>9f>nt%jklXQs1Tpbs)R2tZ78i_BAYzW50 z03?{WUgRIbGS>`p$C7qz;=#cF0kWwd<&KHeEi|Fo2rD7{O`N-Zw?FjBGG!#=$BZ)% z(skn;lSumZ#Pu|<`4vz|PCzCi^UXz}drCVCLU1kG{E3j!?u zx4ta+d$Q(5FI`%T%UJdcUXt)GLSH0Z_9U=EA6reFld(m9r8vzqX!!HkB9S>$pkS}6 zU|PrtWPo(BurCsbM7+V?!(|I$g1emkZCLOU7r7B^_*AHj%-zXPM1U8K z;C5e6!0vv>#nv553O&7^{*3=IJ(~>_buai3$o5ekgO%K(m9 zq7Ru~4Iwv@5Km`>#7nJmlU;$AnXX<=Tpa%vy4c8+BKJ_m_wS848`NZ-x_n$U;yRDy zYO!Y^5iDxkw@NP=i1ny_FbXlS*hnIMCvq?uM7e#`M>4w}NaB^vQB?> zI4jdy#Icj98oCa6vtN*0P%VeW-fJm$WzK~aACO)gmw1sbIu{w7a7-FP6T_Y*#Yl1b zvvDu0g}%jpP!b~nHW>yKJ+Km|qspwUkn^@ z8_$1z7HtCMQtd0$R*Ve$<+}EL?9M#Ir5CdFCbCKVc>Kmcpikd4*EnB+>S4E$ z9%yU=ZPLhqgJT^W&6P`Em zWCaDq*WfXio1hIs0dqE=Stkddx+jb}0utg}JeuEY`3Jh+Jb*0z;4OnC?dy{nwln#p zico!nn{5*zs<2CWw5{2+hdqe%Z%o-}DedqbLfhT+dHrZfy**YH&JY8C{-h~~@I%iP z6_@_061=bhzO;y<9-?4VX;}Y~##%{LJEeBP^y~{cXa&6M)Qg*{0m~rv~H0 zRkITJG|zjJ`$C^9#11x72eW;`@jk&|@zOFGQuf7Dmr%H>iaMo88L6{w{#{)EHDmBh z=JQpO&Zk|LnX?$9Dtb@$-i?ml@`=MO%eyGR5d&yCSVR~7E+OnUBU{{FJrYf|?IOBm z`gD8ixHrF(_<4=j>nmif^N8|Z-3i7HfM}Z&L`u-vc`qu);kxM4 zCM(JDtEfGe@lEvip%gxjQ&>PiIr=(s`l_wwcr&ll)~qxZz~&e{gLSvhe&xBsjyw~J z_u064vPyT*1rneH{%`?glZV4XX>Z+xLKC&XO!vDUTYr5?5qq@}MsMFA&yv1y&NGHt z+`#Geka5A6#Ovx{hus=TMM|!=!eq;iQ5U_Ue8YgC1pcVj7f>Y>D-oezxY>6@V3(LyeAK zKGdaSCBE-BG@KNn)Gm+vy82>DauJEd=AO*=w+XuO>Bg55vb7x7GyE5k&^OOk%-a40 zhB3Z(=bp2FY-ptE#GmAbi2zltiz?aQnQyFbL^gYL>hT$00M-c_51#Ueb2nV${(c=uiZ6k|yYc0* zRSm>>^C4RU1v)PP;L9^{2Z?#rguZ$<9i&aBZz{y}T)t_d>IL~>+<=L;3!TospF$;d zRLEuFujQoXU9=VNdslm%zKGQy|EEzbfCC3U!6Lq3`R>GDK%#O+!Qsi$znOiSvjY%P z)r8yega(SXOn<15)9&o~w0X&z6wa{F?bY8_hoM)vnx%&FJSWvPpLFu(a&JK}VyGF~ z#}q4}LS*f%lShpmdWyAk?~%aKcW2O3sXTsub937Ks8*GkZYOGgSUW8D+bLoGBrJNG~W?L8SfGO5nKh>&=mEs4s09ELSv z-+6A%R!SP6vaf|bhj&&Yg(3>$QcxL-n{O^0!*=3cG3~7Zl;t_Zvt9!grOXOalRWRS zwN@*qft!{f(8cvQfl?CYExzCmWUXDrjuf%Fot9fn&~ueJ=m{C8GT-)SO;6RqeRO1Un(Yj7^4n<$t_TE@tUmV|a2XN6lt#w$V`C_8!f=PxRyPtlfKlF^% zY5}g#?)s;3siUnf6VZ{6Z-f(xvi(6Pf!mrl%2V?s3A>f>=dwyy=<)-Xd(CVTtdJ}b zlT?@NetE>q`>vul9XN_n2EXJ$%W8q^^1G}IuNKxEEyAzs@ zvS_Tg(;j0EN^-9GrQ~ci*z*Foa;xR>BsIG*qCpJ@r^l zQ68$^hF>%~d8?KZ4>cZdI|MKnOrUvA7VDB$I^(7x#jtVfm_|FVrIQT=rrp{xhaMAl zgspgLe)(*C1-c>eUC<5T%YQabJ}19DAlLe;M(97Rc>c@jofLx_tb=}uT!;-z9yB<* z?aUx~PLolpiww&|HMkU7;}ZNbLAW4v9#N(_49a^>$_l7=Lala@fkxg96sQ}a3N9YJwKaKIZh0q zv~+vgyPzrI?r>_-<95VhDqB=h6u_fjJ4Hd~5VLh3Aw*t%7;%L%b>*GgpbG0@GJk=p z!9!@eCsjEX#(|#|Pp#az=<{&d7BXVc(YQrqzsYEBr^<3yl-S>bQ^)fbM|q9`0Y9MS z{jh-3`C38=;5`<8Edb`o8AQaa!=&!Rf_hmtcz9NmjQHLKD+%1M$IlnsRr<(^>$VsN zz(l#vhu%4<(+h1RHp3alb)suu8TA*U8RcIjv2jCV+Rx>Bf)7Cz$sBIuv9bh|6WQV3 z8vpIpA(?gkPSut7lrILa^u6B8HAW5<`0;+$`T<1RH5eeqHzBP4M14p}kbh*j^<+#q zpuO?^$pbDEuP<<2XXQ*G^v-h*lX7A1Za8aQwXFkat7`O>U_3(qxWbgJkU2k}7~wL|p{ zKOBX{`O0YW`(39rAB5X?qjDLMpL}Skvc_sVkY&(RMINArpV&mC0QnBSr69wdfv6*Yzd5Xrrad*`O(N=i~=u1qHF?|H}^0PvO>)xqCS9vC&)dHux6roWq%!b zv|SUG@MtJ6BWyZ%b~8CFdLv1;f@V|mohYuU^w;8{hQ3yLo=xhFS?(8NLWhWf{)-Qk z6A(;GIhSdp5K!hC;aZ(fAveK7myo%#w{m?iLWpN8RH#-jo>|X_onpU$@g7|(!=5J` zr8vR(x`B*7X>4m*M>%PpybEQa1;g+!HnI}*h`$VK)zVP8qH%3#-IO$*yt*gq3>vHx zuCN2yyl=T6cc??)V=x#rARSz*q6YBDvu=Bh5jrDLgrLln9A(dE#>C*n#o!Azm;153 z!beE{nyGt;JtB~+{W;II~85a?B zZ;v`(?xVVASx3+0Q99j zTo6&QWT5eaTbqJNLQL3B2ve~$NTes*Q)9iPM`yqFZ+Xk7zbmORcmWWli@Q!`+@Oq# z4>CZ@7Hnh_JbPSQaRDFhi(qVm+#w@K$ad5C%?Tr-TVbnX#bRmq?Ly-%JBS((tTVi3 z2S1erp*>WM`lo_?b{sSijX1`M_`o=IRyj*8YNBphyQ~auuA+@cE?WbG@tc3o1CSffgo%dq} zt~b`a8w@r913yn82`iWIe92f~cd@C4pX=Q0*@i;b^Ww!sbW}meOG}}%)<^6^YDLLk zPrJf~l%zim|*ZZx(^r?rpGo#lj@rsBM z`X{(?;S2C#D$36IWQztaJZMsS4jQhY{qyd_E;ws6N7I$&66UC6_TNPSolqD>_1kno zPsOIq=30IrASD9VPr;d#c~FWkE_224W%#^ zRa%!9G$YTK4w4{8w-!2dm3DjR6)-;K+1fPsw2O)BYkD87Ew?7Rd{h_wqq;&RZ z>CSAQW~!#B zc`Zb)V%=ULkaii_18O-^MR=p5Nlka|>n|YmretB=i31ENZY7@*AIpjfhdF^HRh4Nc zXk%u(H;EcUpve2ZBh$PEd7l3lUA?mC2I>$O)dQaUrVnwBve-Nr2scC7n&{QoAJ0{q z6QGR#=p@i*ei%6`kdeaYv>a1hvmCMt27MCtOzO(Eleu2_7syfL)#uU54=t)g2*9C~ zl>lj$Z2wc(W3)8p&X`%7c626(p7#+2xS&xp@Q3jQsGq+R4{b>-f(kJW@+zNa4*sd9 z)$J?DJl+}z#$XiQ--YEqA5jUYKGm&8%ke*`6r#IK*>k7!YcyTIlz7!E91Ha)(4!|6 z&-ku-BB}peRD~vEVNus1o1eOXJ!Oy1E=R}fJDO|gNcB2kb>b*Tn`&iycjP>dO*DQw zg+X+xt#lV`JEQgM9W57TA><5u1JGBLQ&sM=1`p*ExzI!$$xc4I3!FoILmFKOrE+ z;M@Mj?C%1jBt>2kc{*RLklzx18V!H^eCr#>E`S;)%n|y28=$#JfF&rf&+n0KJ^(;6 zXtg2(wIwFp^$4>H(=ZDQ$d?8{F3hLp=@!st!xlBh{9Bf(Y=A!Ha?XO?pYLyEi6(Os zD0C%4e006%C6p*)>YmJHrzmuNWbkw_vMSo%Gxg?tJ&g{{%nN~%I)(GrjI#Fggct@- zheIEbR0>DlAY*lY zN}3=+{oK>PK;w1KwO0vLHm=2Fv&Xy)eK0b2NhZFyw$SB>JRw3zp9Ha+3GeNl5HJx8 z;^r{rbd(Jtn-q?%xe?yl?Hq^{m65Qh*3eVK~$out|kw|#-cE6>>tzQ8qnY>HBE-$ zs`PH98kSKTQHxe5H1r3%6^wrBd7|Dbvm~DFs_&-=;bXVS3jvC5RU+CJi89O`FioIQ ziVjAMnF@S^>J4p>u>6M$00YuwMi95nkb1Hz0;INKfm55INWlyGR6m{FH(cY@sS$x* zpsjetRdNVD_`Gms3|)!qTkeEu#Q&60uGKi-5B*Ox%z%(Hw+}%-=OZcR3`{r{6~Kd@ z8i+vw>BdNCx?YmV51L{>5wpbdIFA#Qxi!TSlKqeU$o5ECvMgSc1xT>d<$-*)Owf!e z5CT~#q2K&6Y2=J;S4%A?PF$F3V;u6)^B3QShP6vbQLoOvs zC71}ysHNAGZ_oy=_1KlygNLM)=MbWiU>a$wI3w@zGXI(M4EdbtZE80d>@m>M0wS3D z6mZxY_C<|k9E5rOrpp@lwi~5{K&;JR<;nKQ(kBnihZH{&JTt%+`|?3Kx|v9>rh${X zN`F3Lu>4q=4MCj?{=Fv;|HzX2Zo{F%NG72&b&?%FO05nrgFfo79HDDd^%>8PsRV9{ln76H~*xXHsZkt%lN?RnZQQBe>bIHjYQA%tE)LwJy8+EI7# z^ZbD^GGE^P7t&!8YpoBR{QbA++F9VKV28n8=aVx)I+1UuWYgCqG zTAE(*QKkOtPpNupit3c`T-g$0Up#W2$!0ZRuD5%Ikab3-P*O(Fm(@)*2SUG;p?v4!uDH%V~s^Opuoa=W$ zBA(Q9E0VJB#M7{+_EVAgWtD%Sk^*x}@L+9R4Ncj2SU1(Y9eJfsNAgyYXAQR)O)=(1 z@X}zNmJO~^3I|%5&tI!KVqT-3$j_D{C`7{S)t{!)vhlj)Q397910swsHNb;aImYB8 zVkoKz5Nj}ONU)@Wf0O9gsYW0JP!R=IpOvjmW&Y_9Drg*-1~GXpbGuP8-^&NB85e`Fiy{BqOA*y!L@m|nTVv}EqDE9L3kZ{#0-SoZajf0+%V8iz3+4tf zpZeuR9&F&b>3j4kS$>_~D)@n=JVqlBtSo7y<0Au&Fy{Xy_4;;2=!S~=e1r|i0?TEp z_X_18TJxOU@GU5(t5q&Hwy{em97{tc@0tBvq*b26brE^!ac*S4gV!Y;^_xJb6ESy` zZk;b|suKTMAVb-(DKSj*xiBg7s=*b^{sFq53Mp<;r9>guZbcm0(u?I}qvFdSUm#jq zBtjjn#^`?!tjY&rhu7G!UK}O1ARHE__6U?Zsyj(0sSFYj+eV^p?&^eH-QKw=*&M%) zYS-l-8oV-n&Tbu&Qs)o0gVY8qrU?&TCoafGx;n4>A5j%E>{X?@Lwl>9^&d`|r0K@? zR24Zsz1p3t&rE0YJP2)7g|Vn$Suv5Bm8FT|)b4(55y!Nt3eDXP;}{BhKzhA0Rt35@ zS83?5vFFO;1>{~%7#26O7wdRDU^E(3Tv_?HKfWeAXfYn4c9I6c6cvo6(1$4%x_6cW zY&84cqktLs8Sx1Lj@{2R>Yzq>Z2mchi|-`H`Ni)NYx08xUB{`gt{C1lb4OXh<1cc5 z)FXKmn4zgxT>jhR2~`ODC33^jsN9pFE3R7tefKBSES~$2-fRY$rim=4ryh2cgWnKdP+eP|2gOI=BSy zOA|PeDXk;L$f%=~<_<~2h=^}>>Nh*qUiBWyEQ0g>tL6z)TkgkIfA}mIrB35t4J1wc z0O2^Y55i=>51pQ-dR}aP58ps0Pl!(hF$S_{Eg;62J*PJ1^e>c;mH>c5F@`sFcX*+5 z6q9yE{mBggAKze3wg`BEz+YF>CiI2VN5BjqiRUT<`_$( z$vPu(m>)c;S`*!i*S3NFl9=_G7?08ei2JK05`dnmTPkkOmHykYPpRX*hpX zAek#Nv8geUhB)H6WIGktcR0-EGE-oqYLhS|!4*=e0n#D^cql0-4&+ZnIIi;+K@sDS z9JwTetioDyU>^DeGY@eMdxkfHLDGhQSnQYxSD)!;jl{`5Q80~jSAVaB=?SJ^Mt1n_ z*`e(*`}qsw0}f2P#bO&@So8H{m%6b(=u$rN-2i!?vK{IMj0Dsp@i-d!pXiDL5uNlB zR|OVpE!3Dn+O076NaSM7yGm24rdLTmFw*AyO#l_9ei2$qV)vqk8nKW&URr{c5EZFL zO`$n}ZTF@cT*}#*cR2(yk}$d)cds#uFnOER%*7>i5w>b!V1x+glJL3G*|Neq3UwTtg(iPL{gL&!}@q z9$~_dyiQj7m(!Dy%s-W8fHS9n-&9JFDH2Pkb8mNbVSD8KbiwO39c&x4m##40$e9%xMN5%CHFZkDGwsj6aUrn+1BW@q7sCGpP|(Ifn~FblJx z7%vaA7|=>$VT->{5fUJ8lf9(;X4oB}K97<$T?lc%>DmkPEVTjgiG^(q39q)tAUeuu{?VE%w{p z&H#H;xKwf{IS`|=Ec5cN^Kw$nQctG5Zm(y$G4}R%kay>MJ8|oPBQLpGw?rbBg79Y7 zsd%!hpDausVu#)i~kTdW0Qar3n*8>xTZ=ytXrJyt7Gj+TLfaJJlH_x}Ce$=ahtb`< zZW!$w&c5LE^t%Ei2Yg{5Uwlu-5lK2I*eWa^ijI&{%fFk+e`p?%J=wF6XMxqu^+y*Y zSwaM8O-a30AfJQu_XfH#$_=|QN@FzXn~>=q0A9`Wo6DiG$!Qd~8Q+nUY*CMtlRj2V z#|CigA`UDfWi4W3cq6L|`x;5IQTxiiD8DMM-<8$2T)zhLg8?@CldcUR8&BY=eL!xy z=M-@HY7$s=hUYVae%mX(WkPYurds>rALvbUj$9Av;Gp8Alr zX0fcFb!`1C?`zZ}*%g(Nqxbf{oN`T1F>{q*Y1tWuy2z~lPST&!X4*8TmJX|+21VK#xo;c4hVB;Z*XWRLdCMdbmUVH=zWsF(I!?*VsVR$P64cw}Vu zw-N`evg+|#V4jeHg7mB?Pj*&4YZof@JOjCJG30Z)&)R*wGI$``O5JR7m(!2CLFkSTJ%y~>2H>#VCVmemhw-giAV4T`uEH6 z0{~^}By^%5;Om$^5M8*vIV~^4udhGgrgxAaB1lh( z41F}p=5jZh0+{_>V_ zEw?A|&DT5aWjg|XQg#GL)ICcK$3Z6xUS2;(JCJ~=rVISLyMR|(J^&{n+kH8h?pL6N z*nBpwj?z;on)uA@@?FLLuX6AXD-pp9P&)6jPpzpjS=o4+y1n(DCpV{*1B;v8NJ(bZ z40Vn>G(!);PYg@A2{7F-^FDBO?{8#Ac9G9e@PO|83k)Qj;_IkISFo0-zFD*w=%ly- zJxNpss1uqYnxwOui*QIJi%*2`>w(eQE*db$x*^h0gk)@1ez%~7zgYBrz&x1PP6Fy^AZ#l&~QnGZF z|GE6^6PGzjhb|>(Ro>R+4`0hNmso_nA`pXr5!fNuW^I%7CY>AQ9*h})pnGzuywdQz ze9|nPBoxAwQ4cU;Is&i!GOZwjVIRzQcA5-%8d{(1B zV2mw4=!y+AmTn;vBI}nG6A?=nE+!L&ywg@&mE58nN6+bbFFio?>CaNVM5YD{M~e%i zDNc8^EwW}vbatB$ibyP!za zKnwF{gUca9PI4LZ3ysXXQ!~H`&m{0Qd9ycf2;lAizG}oi2P-FYhvoUM8oqzs-v7FB zqXna-v_6+^Swq;-kyXiI$nCDU{`~>-i`EhZ$lm>3GY@&Z=HKtol!Y;~vi1eL|9r~8 z9znMJp9dt0fR~e7WeQmTGq4sY<;$YKzq-R*=V??*w|5?z|M^}`7v%9E*dheO8lqsh z;;p|2?>}E{fQuk&@(lL!v-KiZU06HwzbD>e2l-y14S9QCvh=@R`SZR*5isxbFO?*9 z6UKj2{l9;xz|gS@AJ;_b6o~%23qk*Se&I9Nv@JRGR$vQy8!CLQ@xMm!jUKjY^z7<1 zf4^JKUwA_MKW`Ze15Y30=bNbhpJj{3%x(Ic1^%JC^~C;l!D=ExIK268UBS&J(Y{X6 zvj20v37_0MzZ5gk4z}h{l>FU0vEtviq<}Q?jn_@X6T2Ouz=W5c4&Ua zKU?^(qateH4V#wmy@Io3WLKm+sW%cjl||Le}y0Fw9U-y@y)9DI7{ zaT(+P9@V#5WYtUXE$l3?C}81Gy8iD`-7&m)u!G3t?qlx!?-Bi-URXd2_bWW*2;Wz6 z|9zwEZloW;XN5H}@}K>mQwsV=57=&8^Aut+bHS^5{hu{MgXJB#mRJ}qVT|Ry_5Odo z19ojM%C5{=!~E~CyoLWESHK^>QNmD_guEg8|Lm3~FGBt=K)&8xUGosSJ%!yU^1+(K zeCRD9W)KlI8uAKwB|$BH?&=$hr(|^#`;3_nGy8B-4$R^`Bn`in|hY0 zy+LB#H(P3yq{`yIk;vWIxmYliwBwl<+4(v3en0Sk5Etjq8^>nTh-Ow0XO1;47QYI! zd;y1!^o%m;e6k+6hyY6eUeCI_s`%Q;CJ$G)x`9&4$VWmba9^8l8*0>VBcg}2BCs%g zSP8estdf^V%UV%;oTXZfjT_mn^8Us94ty;{xvJ62A4I>Eh-=Y;q4bNt?}aEg<}n++ zw2W6jnqsbQ>s+fuPKLZ_SppFCtKnDQPd^mpoNK%YRjXF}eOvgo-`?BC87!YgRRsyT z9keVs_xZrZB%h?zai5)fRRZ$|Q}>pi{~~!x`stm*3)BXB{dM*ado&obtbNRaQrG9) zK}zCC4@@Olji#4J;xDw9mZ2vP!!txCK6S)xIjkcyEC37#Gy8+51D(Gd(g^TQU7j}I>h5Zpq|l_0je0GDYi}o@cRJ{B z<#KOKq-`Yr$X(Vn2cvp`ohcsxk`imyC7=+LJN#B>XkyEo8UR*}6qw9$!H9-EeFb!1 z)Hv&(-yrsy6dHV#rq5;36}=8Z^^*bn=Wd_xvGh>C20R631zNb?bz(^7>EiJO*Tm{P zfB?b0VUQh2f>nqsz}Y!AR>0%dcBP3ZcLUmoOU7>3SV>kBiF4oyddW^5nrPNpMqHtx zV7vocN&ujq-m)9%{9tpv*vtvsD6C=*H0-j*3){$4e=Nu||FfYbPEe>kJHFIoV+JT| zoRE@&(vctnJ8`Xy!jc1o*dNuRBorf$k`z4a?PPbIZe zR5dCEDQ!nmmJiN=Pj)|-p3H3wZyUDiZl(0S@>Kh4Y+gy3_5s|W+9MSOmLd^;{0VMe zrUx{cN6*3?0fMqM=m~Q}lUeBelfgk5-YVObu9v>Uu?^Adaoj(!Ho-X@@!FWOlRWJ_ z3fE`_9YW4%)6uw!qW={T*d7{psD4{VP~=%r`&c9rF3l`HwbbEfpO^zqbU!9X|hnk(xbX;Uz*#CM_&EGWJpk%#!@8V;yn<-18uvU(Vk)D_Tu@#|;tulk z+XLUk@L75{e>NPFF0ttQhZ!=mbWS;}um+1%OfD+obUZ;RV=dku$2 zve+*drDV?0&6q}ipP(=dl0U~f0yhmOdd@SQNocYPgM9J|MWZuJg%@j9A`Z)@rX8hU z&E}3qV1Eoglq%o23HEQX_B?$2!f?C$Hg=MKRW$3_lfFed(Pi`E+AgcQmLEHnrA3m& zl|=9OC5f$nq4AJWsjOk|wpaOH|t(fWYOyWJ(jv;o-O;#eFR#uP^QJwX2F%BeS^PlzuPNN>sXY033s(hlX_na%nuNYPFTdQ-xEGRh=LzErk#B?KNz?w&o4-dM8KZtPg*aK`1lxBV+&9DZu#8 zxZ21VctOVypfA+4-}Z3IUQ<=D3e7xh%35Wyg17O?Im_M+8FmM_5j~ue$gZu*L;{^~ z^~icbCZte)hpp-nr@!qh$EQg^S-=|pU8WgtaG@!^5NtTX4?(0pJD|Pjp^Btdj+ zj>}pc2J%4H#RcF%*mZc`w<0f5|DX@G%|=-H_!3{R7ZdrT7%r29E&K?ayK*x9SBNdY z3PK+GYsqwp(>jIxu=0d2a92uo6+yfnqOw6b)~G7%c??N0>4jdX+1aQvlIf#K_u%E0 z;+%!~LWcC+b}IGQv;ybn-tcyGh`kVp^p{mwJs}OD5V6Fe;)4sFNVbCA6e6T!xq>G! zs-vTl$d?ERQ}n@m&6^j+Bpn2Jg>1T$n|jd93do)cURz@?9Mu3aVr42-+@uNjelwsU zk!sya{N5ItQCu7HVsNDhV%bwq4B~n+ zFvZhe<&Yi9nGCguE05zp&#z#jOX7O*B^K><+k?U>^_^T+pW65_t{rYm}J{IZKfl4}Ci_?ycNR*PS93 zK9|&Wr`rKL-WQK67jw$pm2(JH(J-6d_Xjusjw!YhtomuQP!^{fL7;F*G zV{wFt1S*=Zz!eu@!hGb_ldnQntHWB*btok10_@lN;1a+YOl|X%xjKQdA*sm`I#B5+ zBF|)x*}noqs+6{$g6Q$42SBv8s}X^$vF_w9wyhsey<*3qru&yxsNi5V@hwMc0icNK%q#7CyI!Y5X&4w z5t9W;r6=QXaHN(}565*OZIL;q>13dua&25zE5H7-#{e!w_y)L|;?m_BvPN+B?OO;X zVx$0Sdx+PsE(c{`wU>w&k`l92#w%dtN3C$k8cOE8=nWLWLEBZ}!=te1P46JK&$gu$ zDMQnm4w%wOlYyxEwuc+0_3;(x3H+KAC0w(~WDFoy$~tta;_91an2~dh`|6e+6%7en zVSXEtHBlOi97A|7aAkx@iC6!P9JorxDVb?F{~kXu41?PLCbrPSWR|^CQ_vu|Tb{_$ zKGC`&Q60}LB9zF6kn>einW(iAI*-SbW0|aO`x#@q!}A7uzf83;E6sTl z+4jw^Do}xqtdpZ_?Kr0N=NFkmc+;p;R~^>-$fRw~BP}kB%TJ^gYqUC3n52!a0XuDd z-4K(1N>IgaLWd3sRQt5oeA`sPmM6PFSk&I$Hf){EX8dgQ-N(!DZU{rl%iRq>cIB2X zf|SDSr~xJ}Wj&32Tf){Ast}o|HKZv>Y{F(D#H@_DGq*g(tS!Gv3)Frghq_U8DS9bw z?90FpO_|NSP#Fepd#AiLfxGEg;jTQTd*FlN{^;PJi zb~hKdQXs_PaN2Q{9^)0xHN-&j4uBA1q)!mmo{6lU#o_XvBPT~OyYV_A4J^`Uasz^; z93=`>_f4_(eK|xl10+&Ku_5j}bjahdHTl-jn_fjvACSYa61BAS05xdbb_D)wfSD4x zMT{CPW{^v;vj0fqnubxW1fZd{u@Qmu#$|-Fu zR_h&#N#qst31G0HHhguKl?_r>^s`M1>Iy>9A2Rm$IK*G5GpD~CimeQ`vC!pn+&1Va zolk}&GmCYLFeXmK^NMN}uml&+NEMLMj;eD9p<%=Rno-gkF z8ns`9QJJWSbajzYNJDeeFIzreVRw6d`hG0&?(S8<;=PVa|8Le??n$_)=Ra-HOUGf* zIq=ojp_uSq!VA9zQ_~NJ9Mj!(hna)kA-ZEHrmy zec;I*jE4Fh$-9NJ2FYvh zNH*$3&l8E~K)r9wa_+A!!E9-+^L&*zLi3jb`$46Ap8|UjF_agOt7+7#cp)yWJ+jas zsO`mM)5Ij}>LNM><*DS($AySlh5QG#^QUrwI4VYIrP!vtujLX_k|i@9HHz67gMU#G z^R$O^NRO6w+F*t`lEsz_OG=At$8ljx9)&#qm-OIPh8&TRC#dregNq=ng$OK=;G>=D zg)Z*Vif1jKv3_MS4V#MDx|%vHKdy2l`XqzPOx~A_0`4ap<9D4fsznZIT9IH`$TZBw zS0T8gvN$wYS(J^cn_wyy6*VMG&hRv{rp)^kiKm@wMqlCF>ypc$xQA8pTY$F2Tq7f3 z`EuayfDI^92$Ovkl{fEQiZLC1lo8b-!ZU1-p+Sc7Hfx%iXveW3x4Y_o_CT&oVtApR zvhS>3g2c~rL97)5yK*Bf=BV@6@t_`gyF?H@1nd;}Xz%4wpDaCOm*MxB z0*WQhu-fV}b^xX~N%BTlUH1qpdpv4oP zfhKKHsWYGx)4z3)j(IbB3OMyBZhr$$kS63}exe9IQgo(v9mjc?to9iKZb~{H)r+Rp8h`36RRbG?Jj2tord3#}4zO zdE_miqD7a06+O;B*MYjc=>(k)EzUvB?Lr5&zc?I z-rqUrFMl{QE8g|4dY=2f?vu;APp^*>>Tgr^6SrFRK{2>(`IRv{BB$}dV|QDG?|wQ9 z`@jTI3bnV-KrpT_H7ecw-Q{s$AW(w7Br_K{yRO>G`7Xm0bPbRMOme|@loL@W&see$ zyYb7bmOcm@iX~kFYT*z$|AT^8{zCSaXi)!o7^mUS*${+IM%nnQ&rv z!j0kId<$kE6w<9*;|UXoWiG$dW``610kmf?-!qs!PiK*)k2Mp9PvMa9Et%emh2=z5 zAMHnA#NJi5vqkl&qiS6nwk&IEQryv~L4?SZ0T28>B2r6I0_|;b&A2?8@<-)$3dAW4 zpH?&uh9oM-9s?0=k`uRs^4?{+`86jB?i1Qa`FvYnIYAZz-w05__$}^7+vKKOG_nc? zK$60I8(?KzHXUtw+2N2({aUM5%v0tudQ{L7KDxPugBjT@%mzD+94tqviD|yT`~Upbu|rgalj+RmeA&O;iDgLn>>KFH&mi$G~no&nw>^U7y0sP zfZ`YCz77R9B4oJ@@@>WsD1J-;*F_vuWU>T(R zXZ*GnuqQ>cQ7Y&1pv^hLwX}?l#2Z&IBP1XvUOUj!sJlyMF+A<#A(h5O!uPp*9N9o( zSHk^9nlPy#y`rw{F#eYp!~0wZtqBgxVd3cQ9@PUwxFiWoK8Lty-J^$U%!<&=lX=we3mDV7;5-xJlis3h$b zaY6bl7BWn_h9mw8slylV_16I5NB#{97PqFnJrUtSw6tH$J=h;|=ELh>ol$R}f$ol} zt-=J2mFG!gppV81WVT&jr(S^?pAPlj?&KL>kOJAyLOckVg z84BvH%8|Tr<_ZeR-`~oos6^cZnm%hp)`&73_~SX?ZpfvTkb5G*+8nVdbB3aXdO}sL-t}oWPJZcXsoqfA zPQ3*EBYbmK%6jm_*_?pG0a69p^j@Mov(p&~WWuo8-hu;Y)d9le(#Aks^WejDu8sIs zX{zv+Wh6#e!Hvtl0&ArN68HWz5{HyruW?~nVVm*&v9Rq024-?0MiqU9ZQ6n(T4R-h zjB&S}EsFJfYEMRb2YuKAc;NrF3Q`qd4$lxUW>_E|2xi7)L5vva;hhf7~#ImF4B+aE`LDgqlm5S7iR9Bf9B^r(3&Cj1c4ltvIK;J}M5XG7Q{^y3tNBeNw||*Y zw|#KRAjtYCBL;mRF&8h!+XtKS%|?eU+dk6OcXQsq&pnBTOs$M3HR9NS^9TPf!57ZgaD4ijc|p zBHfVfS@3l!MLFyXSl;%=!#<^E1}{J}1GWvDkR2UGa~O1RCM-t6K@fTbbS4izxDJO7Cq%LFo)nZCDKR_~@N zYH(ifJRKT_%W@#dh|V13XqNCig-EV?M@D&bC^Zj}M$1oNV0wV1I>hlh^co>$v4H0% zq)~U;P?hhq>FaJR{a-Y3N@vV??EM1EMNYK82#o`e^@!6($shzQ_db9eC_*f2h;@)L zB6wlIvTP;F>U3e6P$WM6Dc*RI3qO20@VR?D5ximK{x`TG!!W2Pmz+A`6TSj)%N$*vy1_T1YIGb)IhXZW<`GrE5cdE=H--yJagCE&A>E4o8d4G58;sC*-i1tUC>O#;KD+NQXq=I zwp!&KDZAU9Ct~0l#gyuP`r}okwr}FnxsYk zfK6uX{p~1Eh#m9_Kvb!Ub6x=ijn^jDQ}U0QVJJkZ^zEO(_wVKtOA4Fk0z29Xyz zCkd^F>)DNCo1p?3`qM^IE}kissTG%eYyqli2`s+8Dh{Y0!H}=W7jXMqyNd#?Ck})N zIhL6^g-e}q{vwTx>#W{KT1!#D=3xcLy0jJfS#zsMNK^kiZ6*i(M^AZ(YOqdtJJM;w z)-2EMqd?|Jrk=d8P5Sih8w^8wF*{G{B`(p0GG>pnwQt>D3Cu_u_b_0ojRp3Yer3J> znN_D^Nb01H_{9aOjHq77%EO+;(AXj2HvpYNBa?B;63$rBz$Z$MqHFTgmr#rd z1T3H+d#&6QlIj8twbZUrl%FrlZ{0IjG}pQX1}1-~W3efT6tyWpqy_TpjF=nBuT=+7 z*j%8x=Bz5Hq6FPWovF9n;`zgcB zE5_r3q8TOaC{ho^w2d}r!w)1}^j{H$7^{u}!B;21fz)$#+&#`on-n55wl2?(z`>mS z;#R_42CCfDQ|V%(<4Iu4-oRB4KQmM{$r|Fau@$T8ms6y#3OfH(^=48=PN$4UolFu} zvW0)|+qPt5(D%jKOzaoZKLz*(Ub(x+ee=GdXLSGsl7N~{7Em@c{FG^2@tbg$fGoA8 ziIfgvQ27kV+Jnms)|!Y12)q)dSpon#U*e4_eQ}ZMdAR_VWK+4^V1>Te+A81&m)kXx zv}nhHaaX1pEjG>FowRk=RPS^dTpvj+zv`%y-zzu!`V5x$m8=RE*e^@a>PGg-rxdu# zcB8?$i0?$~g-SZ~+^OhE5Z=-FK5O#`=GdL1v_0f?T#}}cC#9@g>s#50wt@&P?J@#& zgd5Y}yyGx4-z>0qG)8t~y~m`e7HhYxj27-S^J95L6yF{l){r&0YVRPSA^6)^U2lgH zl^+HuQK^J>i9{Auma*eFz+-p9%fOi5eoas|WhRyx1yzlMtzS;VX1XniHQ~mu5?dE{ z1LbGbn08JFDNoG7mcHpp@YFlzb44Tl-u54^3Hd&&_D*BD+gbU#hb^y%ON~9Wt7lMH zHt2-O1-Yox34A-Y=(QujVlEO~d^tY^8Mn72RK(wa6a5$JEsR3Doi_zqQCMKV&^EeIn;YVFmkPC`{)+GW?#pua)ryV&t4oxlZ zs!5@w-EGB&wtb@IQ9+}LHTI5VwoLKR&HDnjAiyF;HnuIAh^y(Z<&rB$Gak1#vwqwMNc6R(>3T?$ZfzUv}7WW_$Pc~Yi2sv@!} z8fkgao3{32Hcuu%oPHuaHA&)@MEm)srUk}bM4CiSMZ{EejMXms}SkmftMhp4{&7%Th zRArz2n771Xd%r{=tgmu+Y2R*1s=Q@{L7t5m4X!WV@TU-gLAU9)_|fLWxyZ4MW@M-5 znPL<$pVafdKqxBtEnOMmGZ-hZf&;>31D!C$kem9Z6+g#vo2HX_=wl(zO;;$-6$5Ur zWqu?Czv<;^hFyZ}OXfU1MjYpAvc?s!Rgk-~?ayqD=(oG(24Fxhh1beZ#x2p592Jhq z%^53EU^MzO!SmbQq+P_r*=PX8QMd=aQUNIGrSCq=Zbjvj2;6jtlz5j*SIdt1;XIFi zQv;s)Czec+0JzD^D<-dG{G&J;XCLz3r!&|2k5Imk{W8_Yl3jowa_n`g^8}z^{8fV% zY_hxUpZqJ;|BFO|s-`ILo)W6^9e7VkeR1eKp)qDNfJr^^5U@`Yn8&3^O_lS#%!N5Y z0x%D1zRMEWTt=L)`%=ig_b%Z6&Ipe@1t$jbQK;QLv1Jwtcj=ThC+TBhe7a0}9e@AD z)r>s&&Kza!*45sRG$LLXD>t*7+MeAV&4cw}g9UJal7(he(*y$L!16~Fj_)*bbDE8$ zW%V$x@Uzl14mRpIapatX0s!q~voL=M#x;k01Nuo=^|b(B3_YAjY?z&w5~A9aKJ~BM zkm@a<+|bi1)m#^vDGvsd6ZxEu)RnZDkDg(+xLX||iujAjioy@ci7#UAroPRA;0C3& z^=Ij z?)0dY+e>GH1;*LLKh`(}4M7Ex*L}aN?NRhw7zpWT{CA54oRv11zK*c`UW%Ihw#v^7 zzX5u2o_t}*qJp>!I=5wPExQ7yj5>ob;N;~%4OT}!kkDuRq{f{q{lSlYFX~W}d!pgx z@X1T|NQ2{e9yOs}z&k9YA4U$4lgLlNcrbddahhCH%3z8_yW9sspG|y^;1+oyQSSF2 zpL}+^le{E5)5@olmFgAI@pqKg$*PVXqvp+)>CzsGzlPxmFoZ6-XdOGTkUPGavUU|bVkp*Gfj zZJw70mHT;=kO8F=L8do!_hgaYvhK?C%oKPS1rm&F*3oSZ>So(4LU+nvD7L5k^MfzK zqEp67stqO?(sd#4IlC4B;A=_0FBuibyqfD&68M-wj+CyvW1W^ZVHmFTg7^LLZusY_ zay-isj7nd%VfNOEVBZwJA3$Tw!{5@Zp5rI~?JsXpjc%n0Tnh`!0k7T0o_i+q11ZYL zCHOO3e0ipLV(~#0x9ED%n71Jt3XX$)cNn1#ac&N>+!q!<9x`FN3aQJKvM15sI1xD~ zW+{OjkJQOwu3Am_yM|Quc&P}@KW~SDs+~wMgo87Myv{ITJ@(#>xY17lu|roDxG?G77f{-i-0jO^5Vx=|hkdc)H!{_}M1po4Y+Nw0Mq z79ZmwujXyeFCM0F-b`_%W<3vI&MofrII-#E-?G0yq%vtwApbs{OgBQ3t10R;mCLI- zbVoU)&)yRhb3SXol9)!GCq?MBge(YG4u;ZllL@m7pDRZ>F<>V9>=sbR8etlHPVr`= z>j#l#oivr=B*pD{m$oSzI%#!8PXhC4YB9zKc&~_lGL+YDW78qweFSVZY8)xeH+oTt zh#;bLaGU`4*jX*nY}`7!4DN16DMmrv?d(Hu)jDH-MvG+=Z49Kz_7i!rac!4$U{8I# ztE((C|GHb)^jGi6)M3~m-vnlIZamQ~Aq~>8XsfrDhes&ja*G$9q(=;>S+5U-Hjfrs zaFbAv0NbRlCwb|mM8U?Dw8As-@XcA>j46Ow!}a zR4sO1NVz1JkN#-8bZRH7r9d8^GnK{~6cDX!8asfnb96Im2Xq?+*pqwZu!wevKbIhmoGUEH?dIpE&%0JQ+3<-fPbUW zJKGn1$?GH|Xq?0udl$C7;<7s%);ElL7-eh<2p90 zI|V9~y-AzC997@hT`uv=v?V#2;J0aIVw$7jToayUyqFadjLg&JVc`jO5K3Nqm-36RE=XT#K=YI4=)E!?Jbr?$VL8(DpxBCN$=Gci1;Tlz zf?#Zdqus7ZWib;$!JG>w^2)T+pvS@w&u_UGr&DeKR>!n7LJ^Wn_S`zZB{4uzp3)LEdF-3i+d!hFewK)}iC;YtMXoQGz5i3bc>2Jnk>lng_WYLGk|00j0iQB%CGY zSVqYnmgH8js!23grMp$-V@=1g-72kABb(dsoOrdBqK$|Q*Wfb`UuAdEeDrF*(s0k% zy?O6|kdjlel9#7bJH4^4lZs>SV<;$kQ2m{iSC)dA{x$g<57SU=!2?=j{Ma8Nxjlf} ztc*2ex*vzU2*8)OPUX^k;*{}Rm*7a*&uOM^UtqkDxKGk|@VF2E1elY*gU|#qOF_sf zL%7v9E@n%rAJYAQGW-IXwhjEX8R4riTIkas$Rn|Cmghpp(n|>>ui=|~r~~F0a0ANl z7MN_Gx|O?vsm1&+Fa{6&E8C3HGllnx&k64?eBHNO@d?E%;S0Y5u>{m!-;SP=`TU|5 z_Ica15Csve!$wd+H{60MLR6%H2M(%jWmxmlv=G9?=h|^zYxB!GWmT}D9q@LeB;RRa zT_J*7o33kvoIkOwXS+9iTRwRXVM#w>=q>4D`I7Ba8aJQ2$mC>~;3wa`|7E-6yreK! z_w5{Dy^t5OqB_-)&}lL{N7!jnp;r0^D{t6_jh5hHRF*l|Hh?7kGVQ9`qGROnQ%U~j z=uc+R&;tTDlM1kU&WeSC)&X} zbCSDqD0^T*7ao`ABc*7|&hDPJ{bo9^3_2Uwn3f%CGBIy%jkcMPfgTyn9kJ} zI#8Tcob2#Ey9D}}O?7p=kI*WulF%-l*yf8^o?QbWMg=-k1T8&>@iw3>v;4Eo`fKd< zlP~d4#MV}(IN#u%nLFkA%8{-^AfTWgIKkU*-n-YOwn>5?X@bi0d5jXGUtCIbF~2>B zkTsr^=;5D#ENJdlKUC=jV6d3Y`GCQY&4h6k#`iQEv5lU<*nJ4^GEM9b9rDHT=iMIe zIe~%s#EFT{-b+gXDTc?w2oC66X`XH(vW&{(ag1L2?yee%(TZA~!)OSyQ2;I=ZS=jj z^6la#5GDS24oFYos%8w*-EDjC7az)p7vFTI_f-KO_)-^q5Ko-VWEz1C*?GQ3vdp5+ z#)O(o*o` z9l&E<&@IOEusGYHEL3~fOBLsmJWY7@PS54bazI?moZf8aJ9Kq7i?4AkZYW{^a7F%} zh@|V)bCYWH<6=yC?$-bzkJ=-EIrZb#fGtfO<;T_6EHv9IVeF0C+JV{e*;owBF681pH~t3JyMYNu3KGlowBmgIq3<=+CXu& zX^{A&rcwwQp|y9CZjG}ws7tDTp!)}JW`_TA?_RvVQFBG|PlG`RPkOSQr(mFH_5Bl- zk5h>DWo8`kuEbMEmNsE~JO{Y$PP@oHA9H$?-(~s+HER-WjqK2azYew2oNKe?#&>jE)dDiBw8-TZu4VhPY2KqcnzNGmGD5$Vgl?NVv z$VyZvD+hDkLYo5U$x7Y>=Gso_NlouCUB`O(_keR%y~#DMzK1uitmxo*=4P^ z8T{bvIUY}Thrf|?$DG^egk@+30b(JjP9$yILbR>jGwURij{GhZJdCGuBpSWz6?60Z zHhSSyClG@(^7_4+z39XkX_O+)uC@#jqi`lT-Md@IKmVef(M!=s zL?+Y1OiIy2^rMF$89aoyDcQ!q9PmZ;j7^`08H>a&h~ZHsJHBQUqDgVceaAj@YB+t5 z?*PD{#cf4*&J4rdOv=B0rG2U7pobf~xe{9g?y?NDToCbu^eoy!Wk^duH3p!BxzFQx29ua=_?f3p~J zIt)Na1`qefB{!Z(B>AoZR_1+s?^8V}c@Rl^)1_kz{tF!^tjyI+>oXyA% zP8)?GM0o9$BRo+am>Br_BAF5wiq_{hjvq!L^{ici9s%n#x714I=#X)(F za6{%{3KEtQ9YvD+#!wx@K_*e`YCK|DhAX+DFx63OZzF|n_$t)~-&`1byg6S5Avk2= zCtn9E9g`K)?3g@)cg7X(XQNUv!k~GnCoe0D5S`>S5kFIav$k1-Pv!JlEo8^I=;Yg3 z41piWXzR56_tFRzk>viEy_Dm9P1OCni|-_2V{~MZ zmM=6bEo~be(MZd>^}F&lZxw?LTlZU)qiAAuIk9T+s<;DCqMj-bFnIR|0R_jBq&(Pu zoll6%d$&NY;AjSD{o5K;3(_3s-2>=5Q*57w>$lUeb_(5WwlD*ff#f8Ujc494rSpTN zpz?W)A6s5Su5&hl-RH^K9m%QaaF%AACDi4l0`hd?i3IqOi{-3z6I#^g272%MoG{b* zzj&o-&N{;p)uuH#hM25rhOvM(1XYA%Xcs={>Wg{IMw8TMSI(m)v61L^DTX#@Tu@Kv zGMe*yxd6Dy^l(+fe96?YZLrjccx$2l?+Mgj6d&RzO-K;^?%?3fzSOM8F$dE$gBjCF zzoY^<1u(pn9HYJF9ad7omen#fVcJ@%pLsdWy4)sGDue~rZ{T`W@&s%z{ob}~y_SUO zO|n{{vQB2U7-;N^NhPm9sqO=WE(#%S4`LhPOXAMOoNGT5`Q^oizNS& zTAf#wG7UDPvcip;Iwjm{N(F7h0aks9XiE{>nefB1HYzKZ)>v%$7BMTB-E?}s&JOR6 z*SeLjT|&}+(3V;<*+dbP6k$-2D*uAT4JJlhoz*S~S;TB9o+<7*;9nX#0i%w_*KAhn^Ms98w6epE(Wm{d5W)Wz>#2U5nGIV7k!QPa{ARV7!J z%j!_pLVC=Vd5Y_1PxPgy6#R|xVU+uFH)TlsFKXZ;(@l;!5Wiu-)_j#a?U5wI*=eF} zlea6~8xk@+jTM@`E~7(nqmyr1&wVpPcm>|* zlES@%11D6<2$nqbe!rFSu5_O>u^4k7PPpW{e?{j|uhkQRgZZIsB8V zRy6|AmD3X$xjB3~Kst)=o76Lk;1C$C=eoCA)qqjx{4jGcDE!0Eq~CW*Iykt`ZI}mJ z+LM3WnjWYm8?|ia)t>H-@jVkCC8<~%N(0BtifEZp)%m%!ziMLOML3(%!8?958{5bi zjaexZHq}IKdfP_%{oSMjssxywODMm9zf}AXHxS4r5K5gcE$m~;b7|M!kafXq2^Urf zh7C%`7-6yow-Rnv46x+Sjd4sA8NDkQFR+(iSCX&H5fJ|-vQ($C>U4XX2@zO~K$_!c z+z&0qG(+fLPcJnz#JnpOqk*fS@-+B>Uky_EI&kE!@>r=iyT)eHf4pPPQ=^+n{|cAd zBByEo(lQnk97>qD8czkLpuhkTsj|!fZle#5n|%qy5b7H{52-o7Mgn~FwNWzKa?*16 za@@!T%hF;qlMWrNNT5tN^(#5lM$CY;m3*zanp6h9oY+~(+t=8-PgM~dQ}Xt-P=<71Iw3-$2xPQ%BC6i{25lLUxQ#&MwlcBi9H5r~l^iYMNU2(`pXjs&U4e2HFTk0nQu>a#;Fe zKgZPKRT-Ct@;O*IvLD$9x7d0DPH z{+-s7%=0roP1B6zoWJomly}{`nCPFjL^&J`2U|%r4UW~{K(fH*n71#N`jYgLgIEOu zEX}@J;_oXndhk}sIlsHmYXuma;|)(U_JM5I0J&${L8|iHIo^Iss|e5093yuM%J@>% z9j(DRrVJF|Li+-A1ND<3ZrZmX1iIRVvdZciQpu$k7<1TpTkc(Z#F`Fm%>r<$X|LOl zU@%3^_+?W`5ouTh^| z1$ptj9~Nf6pw{&Cg~V=G>R`Ub!O00iCl;HwT0ilM+DJU`*7)1-k16_lfcgkE`C*B% z7D|>J14;`bCdus4vJ9jLHEs6h*&Zx%U^J^2HgEYJ4&$vI)6jd;h-Fr3g3>J51Y)z% z5r5!60zS9>)-8rYUW~b3=u^EIILEv!V=p4TXAB}qL@qSTJdqY&aX-K%qPg>p!-Gg{ z=MgaHoM@zMwUyaKqA}oCMaAYxFvm$>QV#caIlgWDRu_~s+pg;HO`R<;UM{S}MxO}L zW@Rk?&1Xh9b78iCEO_re1qJ%qu6uy3t%?muw}V%5fZIT}e(oU6?c^!r?izn!@{LPE zcrOI(wd`dm9luhlt*qkGFrs%fS(=eOIh+6f>TY%kVmniC`Kq73F}Qx@HewoU0vIekqv%2FW5F%;&_-fE2@A;ILr3yru0=P z3@o^}+%;{Nh}ew;RiJX=}^Ss;&6G4y`|7@2W{pUP7avKf# z&1Mg+^RCf@G8#hc1ClN_!6&`2$nU+~R9}>w3|}+&&e`IGA3ZU{u~) zbdl@J?)D`6K9a4x79#sfqahT-B7E9eYM=IgE0P4<4>XcF0wf)mj{9I&Mup!Ww}b0= zZC@nTmiMSg>ubZM9Rc$3goCQJ;8f)|zRyrhrJUz4so1CanllTW(4dFkJ!f#Jq;8&k z{IWSM9oBL}bXvQOxQ51FtppAbg2x49dl>?Fqa^DQQu^uy#|oC2x-Y_8 zcxZ?c%yADd#t)}QS_M+ZLhp}eVP5HB6kKXPX&5RUze;>AUnz;r=Y&xtTxIOW`SQ() zA*TF5lD7lNHKF6N(B{u+%82r`ktp=l`vMxL6Qsu}7~f)%E_&7ciWhtF z+31oLE_?wK9+ao%8qp&s^}V_t0C{KmC_8OlZIg|$L=3k|e}Yf$1%YkqtI>WrQ2rI< zJlzuep~sjDyznzxKp{1Bw6q#FmoW~+ZfoHD2`2=geXgQX={WNZA z%GRL=_pOMV&7@kpZ2h3MnXhL0^)0ufmHKU;rLQrT{F+V|H}Py#D_mCeXPkhy<5Rd* z0hJ0m*b#5OAoC7IKhT2s^{bb-cH?B6Az{8zvM**uWXr$wICOD_c@dfgpoSIS<|c$f z5&DfJSr;`BU5{DCp$air2K`e$(hpR=kd-eSaZ1N8xeaJd+DSBtJMH(0u@O_77%Qew z6HH#9SUw?g`&Z*N9=RPY#=)ggd)r&8ctnV+nsWhk1lvVbzOvfV=l#}xnGNzn zops($QAi%0G|_C%KE#!UPJ`YXA{Ac6twN)KOqE1dpek}BFonnwz#ZL>-JgST{O(9L z%;SLdT=y%qPUUM1lIWc*&yeAS?CFqoYt$L#-(|C@) z%;;8B8^Iqd`t7odSPa%%JojyUw|DK3_zlFh+@aPNIqMA9$DYgcWj#Rs=qsV7r1|QdHB@W28sK++(hVUH8GdyP zElv{d+@7teiT<>@@-^DZG9StBrS~%@Zjyo~J@sBJ9nYtjIa2S+KQFkpl(83*)ib%yQVm`#7i zu9X2%rsqXr=mtOv8eo3?reZ3%MO3h|xOjF%=nT+fI58IE0D6qx5aJ+kUM7L+EI1qv z|GUY9TM zUIWo|Sq(8YP=4osFgvt=V^WRzU~9G{&Gs?x=HLD*`vf?J4*)h22Ndr5AKuC&CX~Q3 zW2LQ&5|R-qo-q1U`)f&cq_Ftdnhm754mAJIr?@bd z(2*r9nzBgwUnG}5H!=>IOYvt=8w+X((|>m8ucc3-LPv1&rM4VEk}D<<4z`B< zk3Bin0$CIN8M6J)oU+e>Zo&o@y&SIn>fek0YZ_&f z!|J?}7ZmxfF7p4tbIjnO`vgx9WC?W2{_o+yg_(yEgXa0^7fYc)tg2uCP^6&aSw{@z z4e@Fjzxa~@L=ckt?*smh1Nt#xu=e}E2PqpBW>N9)3CocIB?4v4YZm;KQ3&1O6`3Oc zKf}`;Ap!|C}`2QO-rNq8iRK_e)h{IpM`0sz9 zXFp7_9T8Sl#s6Nnq*)|%pCaiS-N&U|`+L8dh3y8d4eE8=uWaGjts{ZwD3gFkLFBD_`y+Yu= zKooKey(%fALoaGk-jKi2K>u~5e#eW8OhNP5rXO*ap8faF|M*MKANIZ-*z(s#ihohm z|MhF&*aOKlz>kq;Js4i*{e9y8`^pUi{kWe+o1zlfn1AuL{@BNNvCuKYQwv7x#?*60 zT=4zxQN`G=)piwVHow=@qa%3PQ#9>!4Np0cuy@oYipIr$tc={!?ojU(MOHVX1ORO;zsTHSpkgsi@-bjdXP(|v` zaJTUi|G3XFUbg1QyFtvLr`JuXuT1On}(yqFv-u*_%x_SH25-MU{~ zt!tc_H04>IP-(wDkbsT5O?%wt`AkWsOSp7*Tp_cnp0?fx(00@3yM@yA04)y1%BcKR zKyFwCyv=#<-;M&c)?Z&#-Wzz>`St+_)-dtiis-2)Bp9jB=;;jq>JOy8Km|09fk!}B`}r3o_HLfd(48at z!t6^2J0)mT54j4o*d3)$r77kQ5Nn5PH@SFTp;TTPye-s>H|1#YIWyP@#tInfbngYy zxPv#r&TQL6Uw=#)cc@RXBH72YMQb&CV3}`5>fwHX=7obh!F3DK3yeYs)Vc$@7kp{! z&Iz;f3GJ>w_95c%-g)177)nppVE>KS(x(S{bJI|9a1fqO<;*>HEVs>Fohqy>5$^$h zaGyo{Bdd8wf~MQJcf2EG{w+_^dw?Hl6Ge?^c%FB)@0m`K|E0YjlS_J!rJj^OMhw2S*i^ zmIDF(1ab67(dlK~UgM0Ft&(2>YWWiYL{5GYf@yHc_o~_tUKk7o3f)_>u^=!8Z|%;Y zvJLQmX}Sj=CG)by`5i))a&!fP&{(6YPw|fcvuQw?-rc3ta1?Xui1f>0ce84N>fwt) zB)k>|nrCJMOFi8c?UJBDfQS7k_WQe7x4x%hK1#h-dO)UgtyIBJFc4vrz&4s#p?e4z zQF2=9fqXm;IJx4suMCd>QHt0hP5|H}Ei-n2InezAfv0euLH8U2oN$VWLzFUmC}bn! zw%k{Y)={>gxG+i52f?rmcSz7#o)L_WA?ZHcWj+uI%Eg(OD+lo$M8n^tMWEI!VHq$! z@M46qkhT#F7Dq@40EDG}yMY%MfZsfQAQ?%0fm)Xnq`I1+Ywx>@A%Ijmf8FC-raX+;dICh2s(ULG-X1hms z(vvU>v`$=OecHZ2HeIC^q7=^4-H4crcZWxRCJ+OTo&>~1=+k@AkV*pt&kCHd6tK}M zj3owP+4{I;hRnFSHRpnFe7XouiSJ}Oja}p3ug&o1$%g~KgYF*1u&;m46S?Q?J9q1p zHRH?69_MP%!Z*|gF%HXAWFOui5=R ze&+1ox>4rJ8W92{+bRjHnz68wnZ-@0$B`d4Cu7UglUN5%a%a$=S1umyjWCz5NinDv zb~{fon&*erc!%q(2df}l8lMVQ5u~-6pr};ii7RdWb!1h!52_L9&6t0FumeFD(Ury%{ zlb!1eF6kh=1s@m77Vlh_76wY~V`oc$G{I(irjt+Ys8OPr$eV9p2H#*K*WgQ9I49lZ zR=e*eo^2(Td2$KV#B#j6JkU8bqmb{Y9ws7;Y=`5|>IqZYU-g336+Q@Uvp z(#+lVFL}FMyUQi}Q%moxG__hjuX*EWNO=|IUHpGW*J)v-0k?VSu`6i4^D%I2M@3*K@#)fQhSV=mQe!6r;k>d-P{U)|Fz( z?q2q~Dzz6kjju&0R7|gGlab~j4s#&AU--e|O@M6f_lGO72Pfd(W8QA!Ewh9y&(Hp` zwy!#(Tyk8al8!}G3&nCF^~H5_jahg!ANKp(q&q~0b64*bj)uoa3WaYGVfi13>BMhQ zpBpOCgstwmMfg-Z;nv%95$?E+b(+1RL%$b>pncHRiurm000BDz<^{;+#RKpr>jea) zzS^dcs7M%BAtOun_@UZ&_IcOINsi1r{R8zZ3R$p<88k(t!nt;D%L!jOfqb}Y^a)^p zmsuv~EZAX-)nd1$r-D#b22t9<)5_rreIGs7$@CBpXsErbkaf{%-+a=CTj2BE*m5+xGgKtK}{A8H&J*;)wS?*a-mgPl8wKClamZ4#ckFhf&a&TIeQKY z+=|-fI;N)lTsP<%5I3^%a7ry$MdV(`nLbjzIHA>-AMtOIP0_1~O;Xj#zY(I>?=Af- z7?BH~o#+A-H^`=w_?HEWIJ&xeNUW>TKaThJ0-Qp|cLx-|E~2uF9&HLFXk~N`TCh{r zh|oVTuvP7jg?X^Yf#nnu7D~|X0o_53JhS}{e`oO$e-JQI8E46tt3vxU0ckV+OWc)o za7|j38Plc4jh{I^^6Hbdv}41ZBapG#7|U;n8zWKaX+QR5lE!qDA+lKX_ml}lXhf5} zN7}8G($x#kBSy=}t zcR}4`};2Rvn>g2Hx$_|~o&t5=1H;`#loC0e!`nI)EB}I5ZS4CJJG69-TA+ z6-VS@ycD_lMK?51y$_f1opLUw?3N*rJ&sc>3fC#&q(?jFo8F&hg*-NsZ&p^&GgO^2 z<4cBSzE=<`=UTM&2B|;oD;c84oJbpPCBxFF@Tk6|sfQs|mg=7~<_uK!sW^tHgpPTk zOh@tC_E_K^vuR2Msfm^VyX+@CiUJnSVL|$YU*=u2d4*;BN&&B{6pCEB+*`L3z21}$ zmGUSjx}9RAgNv3~mJ1ctS+te&H#ol2qR&_BGHz(bvN)$jY`y1HppideRIq{Zv@IiQ z(1V#xOK)`r?NxMT#UM&tX0CF-4|32LMIlph%TOTK5cS|Cx6RFOkt34XObU3m5-Wcd zdj{}H(0F{!KB!$x1v(Qcha}fS{Nl9w?76 z^V&k*=MJw-EI*Y-@-fpm6R$DHpn3HOMQGxb20{kuqp+I$>_li?==zzjj&CnV$*8S` z*DLhnnvL{1EQQV(M4jy;(oqVq8bM@VBh`7#HdP8X(p*fkeGP>NZ=|)gQo%LDYx9R{ zY}0@ac6PAu`z7F|Q4}SdC9@f=741ws*D3uh-2HWT3dC!58I~(bKD!A-FVMlX##Q`X z3pfOLnGl>gfbZ@lN^?1ZbGb5n61;!CCC2?V`R55L(9Xr1TwZIxXd3 z15;N((u_Iq@WjKl^j%%A%G*SQNuNceY;*2*KKFG3%Kav(x3Fd<<{QNvc=u$TazFAN z4waP2QoE(e@3S|IQW4E0ZRL-NO&f-|nx0HXyj)d1>OMCS(c+AEi!Xw>M3iB3D$k$H zUdP9c!FgWK${}zIpE(L=lH(x9me*h_%SV=h66Cp2;I(sUi1f?o*Mdza*d5u~CiROXxz!~7Cp2G2>fs2uN zN{+o#65F0~%o(`nyrDfX`vYsW(^|l&u!d04f2uI3TRqQZBk>K0k=H(Q+fLM@JVAy) zWlP8v`~_t+V!NBdp|bDr%P10-*WlC;_ZJv~!h^?c)R-YM|T?1r)8l=2ywlb5qwNm{;dkl|vP zXmygN;;B`R$aBY-sei%@|9K}TG_dOs)%1O@ZBj3SdySyx`_mrxsb_J7L(p9dcA|*` z+?y;16=qe0-$otve7-;&+WVPYtK~E_rfD~Tj`tbmlh)D?AicA)!n^zlP=|~lP30rB z)SUY6&n7a1NOTAc38Z*g9_E*a_F^%P2^hK8Q|CWC-6$mIk3~H$ue8K}8N`DR3w2FzJ2l?4sLV z0s?XBk3!dROf!Fk*;a}7Kze>37Nj~(zq>deK(BJ8T$fiIM{UK4jd!frLm(%#wN(oL zIFzshC6$3DTdPm7q3S#tnKlZlwS$S}F5pukly4ynECvymkHbu?8ChS_RdSK!SWC?$ z4C|qm%Uvo6V}Jiz%2~Ez80a|ql)29YTX8L%!so_=awEwlQWCl2fUEr6fLh;r1}q`M zc3exDY~=oC&CXvYSuXZp$LP~ZicH?Gvr44Mp94(Sz0#_YZQVojfm=Ao;8COb!WN=p z{OiUrL=x`@wZ%&l+OS@P@2c&=j~=^x`l+=bHxV9YQ1r2 z6G?1r?;yp;{fvD|B&%2@Or&@7V!3c?7Q?<8ZjEZ$623_DFsY8 z%V41+!*)VS10xch77fMpm^?&j@xieBZYO!z-yjn8BrNM;Zukc#3sX)0bYM+N7Ddyv zf-%Iy+rxxcfa{}*Fh9u<%S|+jfHa`$31TLfDo3GbWIH~0{DIkr(jXsl4G`fNJCZ)9 zwV_?(0`<3qRGYAadw&lg{ZWr)o8^E%$W1*4X=kuX75^6H)EqI7x|+7J>|*s14` z3s%&gP1=R~*oXlgxz(8gOFpN!nXD-@dbRAdc=?hfnMUUM%wBr)0TNI2@^e!{7OpJK zV5mtcFG|uW*+kEgO1Mp98PxDkZ?K%B*~qfuHbA!bnLpJAOENEfQf8j?Fiwx5hE?u z<7rWOzpGaD5aG-Ue_l#{OJa_H;dj0ifp^X?`83SmGVTaU1PJZnGNn}IMne?3}{(vQenmMM%R9)QcP2&kS zL&W&oP}3fbT<+k(^(Iw0>JOw{WuX&)x>9W%yu+NaumZp76pyey~`z?d|RRsU&NO42v)>NP0Ya`pAi?X6>4*z z&q9g#QdLo__1W(LqRS(v6z6q@J-mud?nji}Ubl7Ka2twL2V@>2cPD`U6-hd0J$Wym zy<^AuHVwUfiT#^BElp4wDR76PEN=`mj(9W6yl1Am;EYg;y5mUYw*S;~T#wbeGYclg5Vn<4 zdF|)PXbI=oa3dk?T!bZ8CJc1Eaw-8pwL~UAKN1rZkVGPlZc-aB*nA~3r-d8VW^8uH z;0!5L8~!}}y2 zoyiF0p0Cb6d1Oj(p$AFJ+$L~lMJ%ZD5?NeMR%{Dr{T8(!VY`a&Nhl1#9WC$C0I(Ct zv-UhbIIEF@U%mge1$~e)!elICd~2u`YgSskfrHc8g36{zTf1CiEdul7DCm>C9P4qc z21_%Ul5lkZrTO~OwP$AYIN78=@>tKlHtLbWP7Up4%v?1gB zB>;Q2iG!*cbiPP1B1tU)~z6(3=+5(clqwkC-|%8exTZ`!HYt1ze(c1%iO-7*MpqbL^wZO`z&SpM68IbE;i zq`F6A@zEO^)pM`ZkTvR-@5+Rmoz6yR@L3iU)3Dx&O6d70>M%GKuF$?G-9ja4%uP==$dXurWuHU^7j$fY}k4rohl7 z7?!co4-Kpcr1PT*1n~sNS%7 zmd2cspt5^aO5vq~mEzXXe_cXyarJ>4fy2T)gd7kNh+uN_w6!5j{*p+@iN#kVZ|gjW zb8{l@A=d~#uY8h;@Cfib=iY`B`rHW26?JnA`Xv^2KpqLu_ATZFRM)y>t%!-DiLn%6 zrft6+UJyzhO6o{`+Avz25sB`W)S*O|$C-(8sFcR+t-|OnfBIo> z7GQdLJIYv#3cd3YyFe9?lC5V^2s~o4G^gFGx5eQ6rD0eOf?U_dmURLVZ1o!9lvf~B zG6{A~*?Ra|(-IMEJ9F&2&dsr#bTL|=JLIr}UeGdTx>yNWnSJ!VC3>aVXF=kfeZSWD z^>WznHOOIgkA1%aaxjFw^ZMu^`dvDfd7pe#mckP1Z21)S01XoXZcdxHl0J2B4GmEM zZivrf1LiNJx)Ue^3G0*akq;=t`D~$Fw*%zybm#`Yp9WqS7Rc*MI}{1a*buOodTkmR z4Z3rxzaJWaXWvUggMF|{h{jQxk?YpgxphIYw<)!UnT5c=ZjWXEISOT;xq3YjdIwb1 zqKwl)lyJD6H!o>np{a_G#dyLy+6RGi-Q>%z3_lf67pd;W1WNu)FtbwOc4A3OvMnpu zZCxkwhMwLEI^l6F-RT1&4qt+g_J4Sxl7Dt9|G)9u&QdYjLbY>Z+aY5UUtUDU8a~$1tLgA$yKG!zXS<<@moZ;=mWY+=E@8kd!#E zw>K(7n`Vn?c5pjr&;tDgaBvC|*ljDfZpZh~5=PkuN|MNV;%W1=pJ_oCz3$ zk!q80DmZ1xoLyGOCf01#G(aOFa~^wzz8gp{b6rK3kt&TGo*xG@Okmf*?P4lt$l6a8 zx#)Vfa|46Gt#g*!TidX^veis%IgfAwPV?MNR4fUIF1yIuZ?FSG3cYg*xajYUcH?}v zz~3ve+J`m=zQ1qpW@GTdpC%?~AHx)1D`eHA*}nR*oD@P4nl^7KyTH8n=+R zM09^&TIjs_DZGI0z`{pxWSbZhr(>&7l-#s$68cw2{a>+wM{0t%K}fyG!{wx##wqJ! zJD^&vJdk8dY~mhY!;!=xpeB6Fof#9T@MT#o7h&Try14OrGj8#Un_(GA!wVR8t(X8vWa=Zyjf{P`7)m?H0sLmM;C_N!=W1Zt0(g6 zMd!(obJ$+sED$?1sSClgk!`_6Y?afZ*1{Kgp+sd=Cj-&iQ6>=Y$cpUEsJKx*!G#hH zciP(lV<=&Vl%O7|_rxE?h^*2S?cC_Oe?}BBa9R9bew;W+K48wxIm;bZ;dWzWH;fD+ zJd`?fdPEF^tTvGteUW{75nl-w8jVh-6posk_ZXpW_h%%i9ZqFxj{dMEKSjftFh^k6 z;b8$6oqxQmP}|MaB!(28Mala%jdUJG<;LHP2!ZGGd;w*Vt(7{J+|Y^$i<7HP`+BWg zj1_!(4y(Czz!E&xHR}wZfWx8q!DfCIis3a=Q5gb%m-l9v#57iU2){mSVkTiBf4o@t zGiQ5b{I8w|`7sH*xZCq)00=ihm|g2W^8;9O+ODJ)$yI5qY1m8U@yiELTxuHBs4jbh zG7C^}zYc+-5$q3^L}fU$f)`ARoJNQkiYy||)LIk=>5kl@ZA_VGrQv<{dSe0HOf1j|c3cHw zSnE)dA{Cyj%85>iVeIR6TK#7V8p`cn2#5jh_Px{$l()UE&vE!80 zh5&YKk7#%*kr8h~2o~|JKjQ<+xK_XR>Vn0(VuMv2Up&Hm)@qEWu;5)Ic{)YvV^1Yn zEho;5bW&omu^CK=!!r(neKZ|(u-H*E2S4cDp_~8-f&UpB;se z1t@c?ewJ2Ic}hlU-0Hc^l&8F{wb?R4j*h6l9yITDsKiW&w$fMi zT$pCroM~z&^W?;{yYI4h6$d<#4##PwZ6xNJF#4SrgM?|B9xs2tUSba=Gn&j0oDd$4 zsFOcL(4F}V8Mdrc_M3FFBnN|u<^GJGZ z1(PG=59}n-KuZY`JZBgEZY1d`Bkr&eH&A3zLmV2xVZmg!|9$743S=raG zO2co5&B*Aq%uw300ZTnU_$kzd6Fy$mZjc4YleUDD?cTeSP+bg&QIK?e4(ng%tLOlwuI)fzEwA8^-#`Lt>P@0y2RDAypVC9Q%j}?3~Vm$pG z8n8|2QIMP^ct#MO9?!s7OIQt^bi@*zGA4;3Rr5^OBQad~@1W83Z8L51}5Pabnueqs09-*&!v_=IhsQ{e8z^JkwM_eCc<5uah z3m^Wx#F^oO5l3;n*Zspo+6LmR;?g;#ZL6{)oht*T>>M^ix48)XIrk~pA+0)_2$s$# ze>$n2P%P8o>r3H|FgqX{pe+0(kvlQ_C(@SBH%!Zfm;;%m9`|5x4ZLyfxb<*U4=S9- zL%t`DL=$wEVvDNN9{QM7sN#=w)!4|5p~V~oD=_SqF4lNaQA_;UMIGnGT$(bJ2o=qZ zXLc!ulq{sM`=#r5Yd(}DjzY}2l+mJVFv+V8ePP}ws4KK!*@$b#lqa)F!TEuhVw2=p zX61J#2*9IPt>n0p>>$F?fWj+F;5kLOR&`Oqwc)tYm9>97JT6j9j-)ahB~Vl#-X+2L z=&y%{9>?^fKcGF#9rsA>}*9KGl#+jc9FbTaMo~m z*9CA6=8cun5v1?Q8g?JJxbz7+OR^kVtP?pMkBd_?0*u*IpHTNFK12t!LpN7A9`l~8 z*%olyvyDfq&ddTH^?jGDsE)%5$&YVZz;0y9=@8AznIR43Xfdorqksc%gEzDgXDSeg ztr{z+{PRAj3GLgOgO_#Yw`;0%pIRR?a0BCo3?J^SjQq zex1nhTe^IAm%~j*@4EE9B*A*4{62hxS3tKH79fKfi&u@OW{xYOta7(zb3_^W88r?CvO&sTi$hg$5(`hUBKfYT9>o z$~`_lGg&+P6zza*@mWf%b0!3~J!4@jl`o$8SBb?B6666twff#c;IbWgWEXdAU0kfq z6>Jtq^)awvL%xXs_=2LM?p>@ci3$~`{O{7 z*ycEk<4>l9A&28r33QjHy%ZiR-@~ZOUgTe`l^Z}#-I_vwNW$m%uugVo#->=w=>jjo z1yb*a5rN*NUxvz|it{^VrP(f8Ur1J98?KAJ$T^L5Zr=%bv!~B!wn}f!ibD82tNeD! zP+9YWRUI*7iX8r~rPI7`9x?&bkd3!rUECnW*h|~oR zOvQ)*yGR_d)_XNCz zL;TrVaAn@ly~q~Tk+m`GB>)RTMf*6KIZ2??cqjL{HI*<3(tmKK7$vo;K)4~X0kiugO6C!c= z(FT+_5EMLY`U-5Pfi|aiU(6W?oS3o2p#t3j5$f^0rT0k`;gz1JwOLi?!SmDD+^H;^ z?=h90VaLVb;#p=Nn&B>Ik55(-`(;HUXW_5u%lBv!81CtPf9#cT+E1t(RaTw0WkH%w zQzL7|$5B+_`~n_M^2yP>d5}NcRM(PcIS@e6cJI(~Z7s4xD zd@Wz+=s$Z{eBQP?dE@w?ValCU4lC^v%9s4k{ifBa-I_!$uGlhRFzyl;r_M>=^|Xr` z+c8-*)`hZWE1OAQ)H+07KMJy7M!3e2g=S#%B7{wM zc3^2CZzY7yC+3GLVnA9YG3EaXqjEj;jlV4(Gy3!Pw2)ET`zca&zQ75H-wBP=dab{^ zfR%GP%kcWX&%B3+&XzRPrE|J7);pNAU~iAIJOXX?jw$=T(R~k42=e`{Q|#F?6e#oS zYF37U?X?F&3irF?7@h07L+DbI9P`LBWX&t)W-nF-ZOwB0@Ew=l3OF*98lN3on zAXC)HnTDCI&H^bq?sp)*q2r1QOFpAhgu+Flq}};h1f$0@q1_T`PcQ6!hhkCV+-;OO zZ=ans1C)UfMXDcIyOsm^mU*@_=17{^xR*7= zW+6kDi9fJ8(G6LT{LX_%?S7V9>WPN_Fd2z3?nvkGh9&QHD84~?Md**mH0MeT(5jXg z$U%c+3&4f1a3-dAl<|~M=xayOA(HVbp2$!W?<)lO4=j?TQ#^taoibt<@M-8skL|IO=R(*s$*#ZZIOJn!eqq27o0YEF zckbc+wP79` zyRIiQC~v@>m;x;mY_fG;3}ZetI$P83Es=JP`{( zjDl2g&4`#(E>Knyz`#~(uF@Pl6oTrjlZm z20&mZFYHvCMXN=MZl6cj2CO=3c8t@+)0aSzMuY|bR{$uaR{be|T?9xV|2q4;r!VKa zPcR3BmTy@#wMM4~HcHq-id7NQ_y?3hwTiB+jm~+iDyhQ5{vHu}YdK zJAd!pzVYU%DBOBivU!XV+%Y5vc$mHanB}hcRXw)d^sW;g_hpy#n`Why3LZ?OP}{|d zd2Th=GPFbBr$`70RJ8ClZPo=2`>&Qasvm2q79Fr#N$R!wU4m6IGr`S5II&GA&vd5E zyg)JYCfLBFcLU7}JEo0P*IbroQ!NO#xLt~52VY9HoFCle=8bsKcBF2U;A8PUxpK35 zCe<^7V0PnC6#fLtg<~_bKmJGCD+?LzpF(|EqiRJ0JanJIvQ#>jS~c__!uY+$Z`=h; z7Lf)d2;GB9^1om%`!Klmz^_k|T+ddv`n68$nHnNM@qj}&nr{p2uzOSOhbF2 z-$9vMXt-$hGeEdSIF^Mp0Gk3b@R&l9SwuK?)dm{DF$cC8O3Dq;W)|53QWRevEn)>e zg*YVlhxdW9%4X{M?iwhqDgzRFNIH>+Uq_3ECnA5`Ui2_0ESKquh8%u9@-yj};gE>3 zL_)tQtB7AXJd7O1in5?r4$F!!yW0n)c8^W`fOU_2$t->tj(*zqyTSBJMA&nmx58-X z3h3i)HXO%p!cS4)Dx9l$?2gE&9`QFo6NE^~#yh_c8H$=afJ@wjf!UP4*mG_^-vrB( zBzptrdXK(&bFnqz7#)O?`-FN zJ~n}Sl%5nQn=tLrVzqT`z zBF&68m&h7nlljlpwTk1U*)%{cn;J_l3j2gGD^Vnzg2=TTBZQH!eZ@^P=ezZS0?#?C z94GxCiJa<2mr5T;1N9_nYz$e(E$SdiJ~dq0+xk9SqULq1=FzLRW)Y=@H1^D;GILroyt&q! zj)KX}Asv5mr#FH~ia(_SF+hihO3M z6JepxK9K%BaY4Lk*o_C%ehEDs(8idWoSe)0pRU3_^pTT`5{>hUVt#-0EXbYz-W?T( zN&`|L?_7edwZa;P2+rm-{L&a^m{lGy=x0#8cmnw5$7VRqo%JU(=^iHxmC*K>JVQFa zE&AehqSe8xG*ny{7k`P|LR)%Sg%&|DgAFTKzg_~O?0zQ9t3(smwfS<#>f=`1ph)S| z%a1E9hRV1(7APB%j9r(enO7#I?nX%N3c~4LoL2EF_uN#CVI~hUZKm+2xFiThE%M{c zWLa?JRj^wdNrn)ZO6VSwB(ZpL`+>`TUJ8@JLqCNVFduudx;``^&Hqfa3#?JMC#G^F zz7=ioaJm*49ei)L%j}?W>@5+cVHz8a`7-#f3fSWg+GuRN6F;j1a$2(GJ0~gN-^7?h z*a^oA^11ymo%g0$37ZGaf(Az>9tnM|eF>z`{n)M97#r>+W(si}5VzRR9kj8>`ePqlBW25IPZpL~9u+!my?b|YgdHV(8Rys608+f^5rvl*Dcj(&bT>oT zAWy5|H$i@ST=hKP?0_g-=ZLbhh@HWx41If%R6$~9N=01OyF~T7H1RxZ-jeoc!L6B< z%df2q6sa#Q^dISn;o~(m2qseu8I&*f1CqQ#MD<7G4t`F&AzqQT{FN0dC(dX_vCQSa z&Kb?s2_96)eoOn74gSz;c66g&lhKaOECRND&VW`kYFC|TupL^3`|dgjoYC%>Ae6ti zt-Z#`zK2frPk1PT2;KeAH@p>E!F~rt>7_ym-y&A}6c2#{XlXOP zA4~ZKbq?*F+cS*Vdjc+R30ExiDvY4|r6g}W-+I3x%_3S`zlaH#7#QkhpvxvV0%J2) zVJ8-N9T7ky?;u{@2*d-jzObj`XpKIH*3xcRy>jTe7V%LOi8o=T9TjDYmE~oB6{4o~G6&cnCoXhW` z-}HKr2qb^5+9I!ZUV9Hu6(@z9_JtV-PVFlf_b0-XTVbJkk!1n8Idw}Z86To}=<_Q^ z)b^~7r;B0%!jGyPJ5f_)zfO;C-jXPKO2dtT>Yh9gnU5^qxQy4O<&82Yzfy)PA#hR; zpoQ#rlf>;tTZn6&{YDT?VB|$Hh2@i^8gvWuOg>FVG6a#1!?LN}KO zR=;e!gWM8|(OHie7NWo!lb%8}LwVWJ$Yj?G>tVsbMvM16f z!HQG^uiqa)kU%s8$@ft@a!mV6A8B8_>(19z#~rR$Z4g40w@tCn?l2W87g~jlKtC>0xJo zCg@5QH?lcL1?SG=^8L>6R=785dOON4vKZc9KFu^YV4C7!Hyby_FL0FvS!ux=AazfFY%(0Ab{)R@EvPN|GC~k+Ni3`(1jYP|%_t|5WLS zESEi<43yNBmR!&^jNzI~i8TP%Fq|qYuAnL%DNrt*$)!bH&>d4UjicOcN`2}p@(HSJ zb@XOS%`wVncp^O^a~TCa?x6Y9;KvnT?uJ{tV}3_!h{cENSXvB&b~N5#mQs@BI_iB> zg|)2-|B8}b;XDFqStqXc$wqXEoEYiWf9cV3^=4EGO}vG}LgBulHb_Q!G%EQ}h$|@e zljX>B_MByK`eswTc<6za9BWfNghw@Pi@cN?snZN~kbb@DJV=SsHT#!_8L70r)Ol9U zYH+zXVLm3>haKDXd2K@y4%g6>TTbm6!%I;UqCL@c(ls_X>2k5CR_5P6rzgik)@^gL zF2=}_a@M1YkYXOONK%9<2MjISY-co&+R$ox{0Pj_rF zQ=ZIbgYc}SX&p$RDJ_nnBEweLZF5T@bGl5NMAsyY1$+|l#G z^IAyu^2|@S9n&%(iZtmxM4g^B&WiXxAQ+c2#dze|;3(!;z1?>~Pc)gr^RpoQ%+QVX zn9*T*%KZcBvI;~c2bbUKb5IkjropeJ8+^@U@T^h<(-mbQj-GZV_>yDe8k?6 zeBR;FP~vT2^T2WR)l4B<16;^V#?gGHkOXY_cUv2A$(|Q~(Mr`Ya^`Z?q#r&hKErNM zu679@@AiS?UvqD0{$19n$szvxdPYyZFDAlMR>V<5eD7xq-d-67idsiIHLBrb-+i+< zXC9%0<{4=2w-|bm%E|IluEq;?3wW$HNamBA^t6q3`@nHPmxvB&>Jui?Cw`9jVOOq= zN49s>q%Eg|>a7Vy!2N}+3r%NLlCoWKVBp50k38?sUAE%CVgUrQ{p7BmF0PmRk@(-c za3yM{_@5|l@XNdhcYehYu0^!es3D;}cTh;kZ|?dD0H9iEv=YhH_rn91??&5`;e|qy z+-773xmkY{)+u{d6oD6!OV#d-lk)#*YX#2Dy^ z+_#(Ux?#!a-S)!*Ht8kS(k;o^gqS_!%oHl_RlLlQgx)Z9`9LnfFjNR{j$odocX^4_ z&c8x(k*r!Ye?%=XT|Bg_n+*xKoFJ7vEm^f>F`mNckQ>iBo(2-9M_=bAWijdzZJTtz zRd_vrd&>8SaeyVC&P6VSjXPQvr-j>5FYwEV^C;2UZW`C`Uq8rnxmC&bgmXl+AY;KD zlj`gN-Se{WC1|&?FX!_Z|`1#Wy^lg3{l#8;drA9L;4B>;Elr>NtKyX@V)^!@GG z*)58L=F(Bv(`t>~HG>3sZsxnmoUCkjILo=Q8a?_d#Ga67;E253?Xpx56?|#6{~>=d zk%U6XumL}7B8%TOKKt3UD0+L0rtI~06bqb6u`xY&Co-m8^tBeKlfWMOh8Dl+aax^r z4;J2ExU{S+m#foJ3hnqRN)%3=JDd~=6Y>U2HhVLTe>f&{_;({pfSI-j+L+1H;7?M7 zFRV83IMjhZ_Gn!TowAlChw zwSuHr45T;&+OCCb4!Zga`hPZoQ}d$f2jd}PkFRthJd|mN{#C6-A`f$jgMuw#{LBxf zg{w-$tD^CochdP)WH#p||eq+62EJ#VeuecwUp7!izM9^I{WC`UHvD|_%D47w*a zVdtP73G6QBPF-9ILduj+J*AzaI1g7mt9k_81o8woDXRl4zggzov6KNmEfWD+Yqvzs zFFq+IBJl$mt7G-b4WFsQPiN|(a`_^zovXaa={}}CVn*ZSV8q2k{*Y)&t5I$rp}6P+ zsl)une*{y#d+WIe9vDgJyw?HMtdvg9LpU@#7*70q{gA&@f(EHOorGlc3{N=YT9T&v`bk0|XQpEn<}ToHe)2bc-!Bk_CZI zWh+rdFMa)TG4XIYUN<8vO;0=2%VIu&PG-})$bhgEUx$9{Gxe^UWd8QB5W$wtW8ZCf z{XCL;rXvSb|1lv-N(0Peggw^Ai&3G49D!bFFqu%H(4jA$GaT9V{5I}Y%&M&MLsI)k z2-0hBDGOs%Wb-w#Z-QlL#!?UfZCbET2-@f30E9v;xgZ39avkYVv3b8dKuUj zVo~gd89lbCJIHYh{FzH{25fjI{c#~Q1mz|@`R-X+;`tC_VO=+>!X<~t7~WT5?h}Cs zyqQ@eEtd4A!mrgovAFoNW!6WKlZ89L+d;-Ng@aq8J%U$aH?-L-MG%5|3f5hEqTT%L@^_ zlHB+l!KDH7NrncfWwI5*lIt<;{=!i49kX4Xk@#NzfthRGD>?ps#6p);$3y4kxo@t@ zE#Ysb4Fb5;#J)9wUTLa_DJ`4dlA=!Af;!7SbJ~qZh%Z3x&k05h>R>|j$xEeu;LF?l z*N1Ppd$)564+OI>%%cUFOJ6o02RS;3iND>9(#R>MP=?-`-Vj{6K|?O4EvWGyqbdwJ z%h8k{1N$3O9m=HQ&kUuq3ILz6tOViWf!x4=6o$aV9VxX-Pu80~RTm4yHk1@zAAnf8 zN}X0xDE3KFHX0__?`XjneWFigBp-AN`O=F$%OB5ZiY?AxkDbEw6n<@r-4#1fzs( zviwp3l<)Yjy1{3NsHj2s6@v%V;mc@G)Pum}yrBLXevR|jxQZcApUEI9GQq?A9?zty z)V|t&>B;c#pC!pxXuHTF6-!o5LH?W0@-1Nec)^T+l6H)o}WNRev4Mo5?7Z(7ZtrV&uf4uRqRHI%wR z{r_D4=L=_0U<(pNjNw4ziFHZ;3qceZ(QOy=73jI>*Ps8k`#+gFTXDj`u;}Ok<@bM? z?Y};}5(7T$b3I4?mcA7Ax267`fJ|V_j+t-iKq#&{0+^)ezdrdF@h8w9=oYkbwpG%< z`l=3LrSs>nKW_0!g9I3}VdpQlr+;782kW0<{cji#QQZqLeeEx4DFgp(v?WBa`49YO ztNR0RwZCQsuEk$h{9)>Jzl!9Ee}uum08M_Ir{Ki5zx?4Zx8Ox7)4YHIOIE&ssWSZs zjO*{Y>U-;-Ek%V4y=N8kpWglr?7_bWwY>aQ@a^CK`rFHcyGuZwrgB`UL0R^O__Id- zX-|V(&@}UF$KisQL-@Eb^R$2eDlke;pleAtR~m;IrvDmKH!+0tAKcZyU_jb}psDgx zm;2957;vDLb^W`I`}d$hePb&}{ihRsnno?V2Ebh1hJnCrgia_giT>yEf6WsU7#S4= zVA&A|Au(|8Cx9>%-NV8b@6)Ym~!C%a-t8Q$+)FX{$J)4HlPCfMIT&-e$NBO z*7eWX{L}Ge8VI0en!UH^hKqez<$z#j@b7c#(f5Si4P;fUj$l`F) z#VUVYcBT(uW%Ad=brTW;mlFo)>;7k0-P^%}6D@z-LIfGO`yh9+Rpb2AWQs6082(nE zrA7m~Izlj6>%V*qG#h6U5LU9Dk$}xSBYp4wF-d>_@>_9(>ZF@DQ;i}2<5OV0oP#=k z9(VCBkS*%}7=;qT1s~{|EB?+2aGwkjT=-A-|K$tN7@(e?j4reRvuVJ9QpW$^vkI`- zl9gW>H~|HdKOVTHD3o`g^p}m$!M=jl=2w5myMO#I616Pj-#q~b-L9bBuK$fJ{I`dn z(St4g3qK_S%o#qh^3-3W{a2@gyL%+^%mCw{fjkjydO`00*bP9c!Qk6y0*!mVSl#}A zm;W&=5m>0EbbF|z`qKaS6cAemR^Kq)F%e6ChUEWg{mc{OOAj(1!D%7x1;E&P|J$&A ze~h-2T65ZmE`6Mye7AS2m=?sm4pjZuF)!&nxsYhx|8vFvnHqyrn51CvfbQO;8<2Ju zAR5w&VSdd3?E1OCpK@tVF0)oaDH{Oh=?rJBoNowK zEN4!T4E_`m3rc2HpO5OZS)YmMtST@Y5Aj!1+Mp+Rg!j`v{-{keFu50}48PZx*EytXW#W*}u)D zxfTzy8Qe!?HUKNeR!SD3d|)B_C66-3CxqydkN=US@($pI9E1w|+6GLy)YHwX882%) z?Q*;;$MOA-l*dja1EI}vF=DH6orLROv4Da)i<0^k7p2%eVxO5hkh%k4=7EDYKq&chGMnXirBE1SR+_5}5GW-j2VY$$Fa3CdHN zPumnB<2Ra4|XZxUsC*x)XpPNL0B%M2WgFyyHez`Ntb+?t`*CO3+ zWMKHCX-?h~NG8gpXRd>O?=>Cuxtcwpm~1m{iAycYw*01r@~RVb{@$JH08~2)r1ksK z5Y%6CvJ`?K12is+Ry3S}x(4Ys0~pa(Vp>fuF@)Fd2afzJcl0o&D$=liY*3>`#!^HO zKvv`LXS}E>TjsS5?f2%hq?)feHe$=cj*2+TQ$F{N%^pi8VmpY|nkB!=6}+Xxb#9V) zq-k)W)nm7Le^=XkM{35{S<<5TU=%RKmpu6+dkJz$fi!_g1oo3udOx-o!i*15K@XaB zYM+Xwyt~QrGJ%)_a$A|i8%OJ1X)~T~+9co%bWYV24j{tf` z4%2DyTI@FQrZ?I+o`($1vRRO3>XkMF!`CxralMBxOHy(ayMyC!K#EcJ>KFYB9qGZ> z$#}`(wg=|X7b)Oz_I(QXAT=DXEeC~#J^lA@EN~eNxODdDi}OA}a=$v#aKIM?fgkBp zO)Jo)jU`Ng7Hj)E|6VTc4$$YFsC^{VYx5tD=X1$n*XvBITOT_g4PNQt?aoDNo02t=0z+xb86Wx3P%O_x62|;6dN^v6xLtW2Al^lB z9t;T--vE6OPHq!IY{Vo0%35`Zo7DOaSbAZsE-x3Ejm9Ive+yVIHY_G>$5~Uh?XAgm zF_D0gq8CA6skEpPs(6roOdd+Y_d|g_P!ujXQ}P<*03*+5MjHx}=*R7~$@Zccfz0lC z%41qlkBI`RxAm7f%B379t)BN8eBB#TK9yeT$&+O8 zJ#BWdH3K>M=1KRFVG%gf`>Fbb;clFb0g3b!BA2f?J>@DF)&?16|gMDyh&qeMA-PU%&SZ#svl{}{Uv>PKPYoNbsijRM)YPr zOIxzC3m%YR-w;mb5gkiW47;gt}r!bH3~y=yONCljq#V zi2Th1WQGo5;uGo^Qdz>h)1E=}EWc51JrKVxO^PZ!FhM-i&jus^M(3T=a1~m z+p=-v(Ny489F+UKet6S{lSYh0lB(l4@*x5>1V4@#as$&*HG`BJU082;B?uhtGk}d2 z-E2~i4X;<*w9Ztiu!igjndqUxPmeZu+G|lm~Rw8!GDoG5dFcuM zhPhAi;IE=FDa$s9+=m=_7^BkCe|yS#GS;=7;!{0pBs0INSRjPz*M@I{3=Np=@GX8J zq2ux<$&3#-`|olegXmpi@bw0#x-=ksqs01aV*<&I9fzQt3n|~UOEIct>EsP^%*LUF ze}VnBQzv0cs1mUp)WH<1Hog0{AegEc@e{zY$_yuP?t#QI5WaXZ{E5XL4@!*dWb9B! zEtj)|ZhsdQ&fByj9?_hHKd(d(@%0z5E=CHO6D&4>y9Q~bWM$c1@e@)))`&#cn7(bo z$g2MpKy5tviv44q@->MWZeoNu>JV{VsgLqfC!K>{`o}_K`L-t*16MmK zdI}l^?|IXl3-K4a^PeB6qwK-8vmjH)Obe2^9Mim4#`H>(LZ==8VgnP&3Sim zM)`_11g>F$v3?%F6FA}!sp>68YM5b5sjZUO0#5YTV-bI$QO&wIWZ02x8WF8el}wf}8MBqpexv!d6!VOs#bi}Lbyb2r zv;q;RQ&j5S;M;|F z3gx~@&U>OA;TEdk8ni?9{Ysc{S&WYuQh`UY_&h&5OtNT(;d>&Ni$#9OmFV*W6j=Wt zJ-i=E3HB#6$1F7#Y{QFnL>uWd8L9OdcT9O@&&-k_fPd;Z_Rgj!u>yYzzWS#ejmT~c zxBMQ5Z+DEs7rKUXf9w7#`Kn7>o7G8}?~eUV*A6vdx3$Vy1{=Z{lw=|^oTwv+wMJq0 zZN1T@cDvVvPqR9mSN>PA!@;QuSzMmC!}}s3%c~%BMSHytDdaf9%0lVF&x_oI=pXE* z-&g4*^VnSNqwPD>QotL2wLoQA9afq7T}MIhc*R+i+5~fz=-Gd5_O>ybRJC08sNvNL zMk48&<_iOBdOd-Y1P4}|zUYg!~Z7x5>EIPjJWLE-)&S?VdSzs!#w*MkwMkWuQ`ngytSyP{ZO1LS)qM#xMSyK zb}QBra{Dk3yc>PC8*7;;Z91S?rY~XTpuGELlH~rAVGc04YLKU}QG4D1kK!{wZ$@P3K*^24Fp0=B9N>P&HJxYc{|FzIRd&nU1~y@ra8 z`!8BKPFOuZy@5f;wfU;#05xz=d+@MFFGds)mu@-XsD&*;-*?k@;9;>c{anaRVr#Z+ z#@k39>lE_hUvb2o03Poqi60RpK1sI@ucN#`CFcsw;9-V7Na-Ty+@D@_4#VuiM+muk&f&-Qa;Av^%ZyhXIRFIn?5BnUxeyRj z*td)}QJfA%GqX>EVLbqgTGDEX_uiNdx0PA!)w;qmZN_Eg1$@iDK3U>G>Nn*7d zo|?1NiY!_`0Cq6dSm8{!nNUc6lYC3MW1%^_{>aCu2e$bWyc>&zb}Kv+cH%T3xNyS$6LM}YCHEJ$bATe7gGHE-$+ z$*x#2HKLDhuWHj)7~++%OrGM`NT_3uJ64f=XiPs0mGdf0CRDC#I7}NlsmZC#Jbv^j z7^cr`FfPGe%@N?+Ev`4FyxMf@MN4kPLOzdr;?tnbkWZoIOH7D^m&k=(F2+~I6D?L4 zTDo@|N>`IC5RGU2jFRH>+CkF!_|3Eo!Lk6WC2mc-h36+6wrr!CBsPy~&-zx>CB+{u zyL7hoFiEZT97h|QzE`#Bm1H=JfL+mv>mG2g0R_cfoSZ2?NhqgXKn(ZtP_6jRZls6D z&aQr0+pI4QldhRL>~=sHE{vtaJcYogi~j(1l3w4Fc)8Vp&14Pf6P8mY%j8@_ z-W{SawGUm3iZg*(A2Mfm1{^|s>^Fjm9lmkS-CqwS>;X~Po23L#s1E{GFq#~E++4K& zYNC?pp37U)i#>;y@Y=g_t9&Dj0V-%Zb?rcC;I4p>e0WtfOSM*HxRGOaN8BIy;BFGLIwP3x--@D zk$#vx-876NaZZGzcXk}Qvq5V=M2(n%#VdWytlq2vX!s=aX4n z6Meu)&m!=V-BzSAm*q&t%O!8z+c!9z_HCF3aX%JKd+Ca;XUrvkvx?Qi9}zEGi#ZS1 zruWV|KBT7PXR?6ar8*l7%bvz^R^!w)t(!fG;&KF>i*ngX7->8vw91$EbdLeAlyJtE z{;Upa76GylTjdLX7d)MpNoa}%pQ(6<-*MS+CL^m5Ha3ntnA%R}f%P2fPHo9ITXh`~ z^IFfojt)>}b{eXSL$vuR(ejiTAJgD@P)tgAKB)OEFVJQn#!^tM4CKZ7Ty!INbkv{t zI~Ngiygw1IznNw!gxTPzz8S#U@-O9&1~lFVJmXL_s0uT%%O-3Kkt+CK3HUbLsUJvr zun@gVjb7_HfnOD+#)Zby&C#As(^?J;!5O77-ZlE##*UkODQffaaYxad=%o^T_`V){ zx9$?RCIH9zg4jn>@iQfC=uGDiV&1 zFqQ6DVXdIPTc4NBqC%eYy@Fr=9#P*B4jxRwV!AGyD{iinF-dv%Zs_Paeaoik=(v8A z%|n>rr@#0Gc(UL~8$7gdXMOi09%h4xKSZj+e}lBa;Bq>{eu7ln;b8a?j=LmTx9n`z zragrvmd;N;x+w$A?)8GBN$`YAl0QoV`p(kKV8cusMRo16K9|XNJXTPtRvm+8AF$Z@ z+F@#Jo1Dex2z7~4MkNQ#9Qhenw&b7hLuyFSr|qrTU{2k;2!D$p8HLbE zy3oi~7|_^bD~B4P04Xxsa^?vRXDpdrDZHgc8l!A8ezA$~ipR(2m0953M$XCYP;n?C zGy8io)+pwuPzSR*8}(|HF8TFZ+_klsA&6XTA&S;SX^eAL~AMjmdO0z`0VM5BY+PL|0l_@86ifDc67LO>jPHzF&3 zJ;2Gak_R!?c^CK1QU(2}G<_wAO2um)3GTnLQK~e0C(l9(r>w@NEU1b&D{bl-2n~gh zW#m}1WZ-U0X8#s2p*WVLLu@QnvBMb<*EpYcH+MA;k0f><-Dt+!sQtGjU|qh*MDJ4v z+prh)4PIJ=PG60GMPqRxhxkuNo4n~kqBp~we9iYhJPfgpG`#J(MEI$jI*$M~!;x{Z zQ1oQi`knn59j%>rJWZ8My?)D0#p%pI{S0|#0Wi@OzJluRI>E@oL*^)S(ba7^Yt+}^ zNBi3geL|e-=km&lo8QPY2OWJ8ypPyL%@oYp`A8s1+Vdvg+!c_LG6t zHkbFucR7XC58xjn8t`z|v?orx52ITjtWc9vrpzDc9FdCZEZA2|Y=WVWzQWHcUUzJF zKCgEEh>x9gBp%(a)8cDM{Q)g+x5L_&zINpL(Q-h#1u19CJJ+?=+?O~szq=^T*Sbx9 z2e&lbcB;*$5Hp8!I(>!6r@|=E%H7k2{9yEERn6HXDAi#1p$kD&eAddDTKGLs8o>D1 zUMCzAOPPTpH8+o5etC;^QpFaD0&!wGqDrCZ=WoM_hXB9;b(%6MC1PV=k|BtG5X1Q@$;J&PaPd6eyB^{@D$`ZZrrtc z;0P7)zOxfZwDGpsPV#*ot+vVOY=J^>tA{=olvqA6hALG+SHtv5f$$D!y)rSQrVxsi zOFA}spkIcS@|!JgjGiy^#eKXodmx0ydB|GNZ`Yjhyp>LwOJIJqq?y{m7VN@G9rkXA zLw$#v45A=%-h{;b5g|kSRB%7 zESuBd>yPE%5LlT0fl9QIAN3C3u_=E1B|9-g)&1c{&+$F^Gflh*C{>JT3pYQe=~$)m zjRw-Hs zfgpA-A#Myz*;n-Qhjt5;2I;#Nq28CE}iDRQM9yFz4}Epf(al{c0g z2Y|Yql_X(IUjCd4p8QUdv|nN+gx?e9RQ}BXc}HUw!QPPI9lsBuBeDM$sRFZWz;_B6 z!Y6$W$dV4ec;^C2Ia_*OK%!>|Qo%`4Xd)wo%nf;Oz@sxyl%=@li^PAj7CgF0jY3bXD zQLS~*K>ABObs=>y*Zq-q~p z#%IY8^U2^1(KwGnhY)%tC2>%cf_ISy-6{o%L=9L zOfp!*h<_c+Z_HjG?wrΞTFDsKSIAN*4#;(@5U0eD~TuX%77xIepfuj>CT1lxNIL zFyU*?({_?T@!fYesR+f#5|!HIt0bHYd;%Z0)_&r>O1mA(Atl9M`K~8wZaWPf^&LA& z!GWH@Mw3c1>Uiez#3SrB3`Qs1f&a%Zid=QPFp(Nfu1chc$)x3J{6w;5-YZUb4T&va zb#1ibC>lK?5esS=>`R)g^M?B1);#6s{u82@lm<>h7W!lhKorP%Y>Maqk!8I?*F#zB z(QK}WOn`}K3o%IqIGD2_vG&fv2E7MFDZ?5obrtkGy7}Bp6|mM_X|a<(q2tz@w7#T|e50f;Que0QiWDgvdnc!0Dm!%gn8<&X`Uq4rj* zX#qEsoGk)r{b4Sm79-$L3NY`trpuha+-`TZMCh~&Fr2sm?wpARYN(37e%~E`*m)%g zT;L1Vtw6=*Oa&`O@a~J!lXNee1;@9refIyT0Tj zckV?i>kdnN^-cMcH5WmvEC&?JqvgnEQ7L7?+xREQ6i$#w5OzXVhsLGCV#@sf z!3j==_2-ZEsSl%Hoj^pRi5|@?7yM}C$!!*|G<=-3)%Kbj`Rg@N-*|V?!&fN;sCdiw zv}#$jyD_3bruL+<%C)3+t}Tf5FOHqT($KBGVW?p%-A0fub3}%D7^C?N4!x)PaOgx7#Rd z^OSiT73ug}B^Zyl$joQ|nG4vkgF_ZD)1e~y(*6^u#8p&(4WVnru2#JFRHL=zRRF2_ z5+XBOhE*#8uZ){KH299n)qE&mxx1z+jc|hPnHm2hy&Rc@CZ1Zcrqy_W39x5b{Bg1E z8}GB_ecClCU6*oI2#>aH#tS}oOZGiK4LvFRJK!0eHeBcP*68MQ-g6a^!s{mYyZ%cHQ&T@B=$z+6 z8&@oD?BQbX=uMUr$5duI+jk^486YqB{4f3)jW^-aowyvYE*sX`*;@9nM=I8q40M*? z!#n^*q$&Vl!Dboiay1*zx2rxJHR*c&VJKKbXMGg(`;xseheV3x&r$=e4d%6!U6QWi zf&_%5LooH8&Kkd zuNUQ5>GnX$Qcs+~6btFvs83c;E(b|z3VJ+hx>V?LuV!)FU^EK)$3R+kKm&5pu286) z{L6~fyxzLzTV267n<<2xHpBh9LJZ;9lTZGIR;H_Ti)m!C;5*?*v{lh_&NplTp-aIK z;Ynnw9o^g^ROC(Ov^uwAF_fblig&xAU1Q0*+~|w`!BKP+qe!2Ykq3s(Qg%Y(R>DMsnt-X_2c&{tO+M%k zK&jGjr0HLIDH3rCYeTQ-g0A1ax6}-!N+HW+8D3yycVD||c5yhy7vP|wRtDY?Wi9Th zKg7Mwdo2Kfg&8BeBtH$YyZG_SVvigXCioIzDKL955GvG>>x&#lBJv9vEyhD z8&nenh}171(1^h~!2AT54YC%~kH=UFhZ&Y}STm_UPWwySDr<|6rFF==eWbg?aZN&G zui>n;CkXDG>~_6kL)i%|qRR%eTs!p7b>vEIPf16L#wCw~62HZL<~hd@B~b@;L}Gi; zGS#3Z3!N?euFRsRXgl4(bDZ9)UQV_N56eWCdHhA>vP&kB5!{-xf^{uTKf+l3UJDhs z$%K{>@Wo877A)wdGIYOW5a7|u0tC=Z=UHVk-V6G+muNNpEJ%5}w|%Eb>6+lfulBMt ziW)tc^JVST_?lq1E;rCPf8;9vnp;RXgZX$L9FG>C95=72HaD#|GW5_Jt0VwZhPZac zscwk$@g)pR7~e7{%-({`-6VAxY*Mx|0aCz?8dR(ODUT*8n<$JG(5~G}u z{2A<#o<-tDzbTh)g?)IH9b7^?$GB0cINKM>_GJq?5z%#(!EdIRXqtADV|BmQo-arK zW@e~{<~1ME0FAkthzYeVPTU|SZx*S}nz8gDJ6P3=gRj24>{FUf0Apx(6Md3N!DT5# zZ{TNzCXf+(1e73DGGr#VsjmMm+NRUX$|MFX_Zv2FY_`cP`7}y6NB39ym%UN9KSPOG z<$!NIiN^3=n@q_vYPI>nj^zXem{IV+fFgt-bI7Dc(RcHUvOH%|6@f5ct-3fh0~*_i zm)HYFFI{(Oi%^ttB|~c@;)FY)yA5VDJ%xwxYt@CguyCrMEkO{D#JS(*qsVPa0$XX+ z(~cFnmh34o6KzP2P$EtS_ep8bH6BZd;3o2ptEV6wDYG_lSH`%`lK9gs309_diy7LS z-q}kwo^Y}rM+>s+3@*v5TmkOU=$$6>vY&tg$W~@8)<6nS_v)LVd9u+ujdHBPEZ-kGccLg`05E-rzC}b#mJ>EeH3WKG90Z41A6?j;5?a8g~u)9uHoj z{iak_;Vc$!Ssz5odL|_~_Jt2yHldt%q;{S$3wD>V=-a8TK}o}b%-4%(MFg1dK@%ex z-AhBe0LN;WHNm@h7(bc$;1!MC75lpbuV$N(UtbsE(w=ZBvV`Sov zcgU3%=0S(t*gez*Up46aBwef0epW6SRFHP`kVXbKwW%=uJgf&*4)}uns{BLgBDIgQ zLNKxBJR)w5%LpA2_fz&ka~?&JcxPAuY97_vyDiG`vtj>bqTg&U<)}>)^U;=kcD5_^ zk{?##F3lfktA6Ux<&H~w-!9v(YGrWlKZkt_X`G1v%|iT9pTjyk8h5}n{>WAMv?sCo z)H1F+__r+qW~ojn0kX7fJV?;L=tbqo&k8gML>z<08YG8D?Zog%?HBAG9sRL4VqP(@ z&F54>J)-h}7BoDaJSe4*Y<(i5rotax&L@|l~*?+=7rjvf{EC)(XnIT>c z4Nszw8f1BSmgFTr(dmO-YKg{On`!Wg<|nuHYU}`qIu6T9>1k&0%3je$3WYCL_{QVL z-Fq2@*#T9zZuY`tHPK}g@RJ{FH4;tPD>jR@7iFV@R~kZem64z&g8I~3rOAhKdb-F|UB$vXTAf{d3--C{c?Rmhs5XH% ziAS^G4o1Fs8q%8dD@ekpSIDF>R@Pr`J1q*ph^YR`1&CPE2$_<}JmLNXLw5VB4>&4E zH94L&xcGCAaYL#4`mW`d2+>vY`0}rb4w%}Wg*PAKh?a## zLLs?6~*JyAhp8sB(rYi9 z3^ygKN;f$f*DWQ+!-#5!SrG@f%$he34zqci>@5Rs=RDag9$!^-Sh0m(of{T6FPlBE1Tafd*--9yVHR%gaQxTN7?A;Wp_ zAX6*ouKsuC(prJD?oUr}uPfujf*O%|8j}|08sp?FMcz|#2D}q0j0%FuYQrmTb;FxO zQKqYk#>=*Dkj_!Rpahp18@_0~ov6;|R?>sy6tQk!KoY6EfGxzEC2CM9=@A`PYD&Sq*v6gJ{<}Av_NRbc(u4;Pim62qrSt9TefiqAL3AD|GtOvB?J^0G zuC@q`4NTiUB>URuDG9`pB>mIo>3&2-^0K^f)!S2b1sroPx z_t+op_Q%r7a$QRG3H9zOGkiUwJA$tca$eU?X11=&OP-v zpvsQc-%lxN@<=gFRHP)l>CqRw%Yb*_j&H)DQ87nVssz^1=6=fyO-P&x*D$;l0jesr zxUdg)?vkCW7Xq5<1maS9gcy>KkJ7H6-=4a>a-p`PGkV{!NgP)eGbK3aw22~$KyD-M zi6$KCR3Af>z$q{px_&7S4_^I2sCy8K4$p^Hn@2aiFkG{R7X$@oUj18cF6>n>}>AMRhF=X{+t(Ztw(>q#W9nAuEa1>i`dmLkB zq4?+F@>6<6P6 z?YjOe0X9oH>cxtaD#Sr+(P844E-s*iz|SH#bzbi`H;4kEg{1i$$#WPin+bIDaRNJt zS~}JbR#)HV-TZ!C=dn{OFi!~VqW)4CtZ4bXuU-UH0<&&x71!~pPyd>&)-!`bP@1B@ zQu_THt=<1*j)@y2~vp7b)=*y0k#1n+Bzz^}mQd-i5z0%+Skh(|O;;90M#&)=To6QDYS$*TK2VJ14n^WMM88MK3lo?BCRT52k^BhQ1qa-9ZBN?B?Un|8?Eqm4B@>^mRjwgF1Sh*=YJ z*8o@(_F9FplJd0y3rVOyiTzcRtiWN<@L9bTs$?^aZnaF_8yjtyof^6xK^lTF`t}sp zI>vcOUEZP=o$}P}2N7<2e^xD7M=v|Bhh&<)tVAhRx%hV;>u(pU?TVj$J%bz-4Hdwx z$z3nUZt1HGT9qm1u@0~d6SPAxqFj4!7}ZxXtQIqRGVZgsiRxzFx>SjS?2IN<+;U5` z6dy;vmXa`G<5d&Dzi#)FbCK44nx`f&&rFri{>lh_~5<^>`p zT3Sx+YU>*B7gmi(Iwm#KgRR~Q{07RfiBa20`Ykt(96+-Tg{fxFZav~xC4$3rAa@^O4dgy}7K6jr>x%>fghF1h2W? zmsi$y5Em~P_$mvi8}3mNyRb_nS@L9MdXu-S6XFCkHVuD4HmcB}!fD3(ra)=jKWAQ( zELLeV-?VPN@q5p-by#tq%__vH4NJ#}4vNG03;Ocb>3o)+gQd{IGG{TY$rCN7#pKtC zoHsKHaW;$L*#w1ox|XVjHuEzIrsI)cnu%fx8N3?PfwlLi;Bgy&yn&%>(P=B=PFrxs zsPbw4i1NBKHSUrM_tVTwR?(qy&(0aGgN(Vl?`Qi=#IWd4f@T!|ID@MI$$XA5C_!Zp zijY)mUZ;&O5$V_M{SMS6d!-XfUm}HvplzjmKkuB}a0Sj7E5od14;YVU-D6s(+iUyD zw`gDJhVY{r_8U7huP(*nP~O&ACP`o-cLHm)$*wO{%s}lIX^#H=a}IC0Uc>M? zsVE2YA4l(kEEcC4so_voWh76xK8V+Uy(D$2JPgrw<8CsI zAcz2p4>g*T(Pt<3tB=!}mAvb@sA6@XcGlawsGoIbZm+uqxV$uJgD)MOG)Z)^R*I7| zE^fPjIIj7_L=A1xw1o#No%_F#_JrMT!pu6+bKl1wewn|n&pGo>2Htt*q?Tc+rJakO z>W)lvxOEGWS-884(Sfx~XDu_pygz5};449Xy%Ig>4BR!4D3R|rnP-==S(qM_QC zO?F_cIF`i>+9MhYXY0aU07Snik=BntEavZDTL(+)3rk5RktI5;6M1bFMRWi|*`I+} zP*JV7kzak0S_ZAAL|Q*JjvL0=p9ov&J`aJHuS{@rcPlG&W7=O#4E z%`ZPe~ ze*g(nn-Zk_RT(j#JkgLQ-JvD6=A{8&21nnup59zelVzdPQ@BHyLrWE-SszWio6+t1 zQJ75HFo}I5vW#h!zRWV0Oi;BZGwfJLYGn45C35>(F-YY|Nf*ESCC+*3VB5*)_trB# z>*Mzb>4)*4VJ zNiZjjY2@Rm#vGv8F3b?_2u5@A-7`jy1zj4~%QA>FnQCvd_&M-v!=KK@Zs z26qY{Bu8vSVxTezegU*jO_#*A&JYL1Ep{%u-?AT5E%)=D_UOT;z>?g^6d=D28E*AK z{h-=m;)}Xp%X~x~&$u2ajlrBX*by_O-J4@M{d1<7aV->-mtd2@vi*)FGXbyc$PoS5 zA~xHzdwF}^OG1z&S;A0(^$~w@r2YwjgQ1azyCuDun=MAyn&up0ASBW2aaI5N`$zQ> zKN+?M6eIh7jSZ{Whk?Da=PKBa#Blfpa!y%9o^u2!bczg2u2F+%Ku7(uj`3Y-T3o)0 zDRb`&(M>wj9G>Zg%kGp8LN&t}IZ18YG9tH%6!On&*CqWWW5x&bnxk~qu(D9d<|3k} z<-cE|`qn!RSS3`n*#gF2?D@uZ73k9aGrtBm!27~K-1+Oclc`xUUhLOB{J2lm;a7at z6}eN>8=uwHjizq-8CdD)6D@gkg~79nLyxZ@OXxtlTtnfkjc-dK@Mb7W(@ln|%n=<1 zP)3AUnD5nHStTfd?As#y_`yD^*)^TeVDu+w`%@&2)qpzl1?xf z`ZH!1s=T<_)C={W)YDTxD8V6O`Gn@C#GT>bFREE#pRB3H@s*97R^I&7dHlc+H+%5< zvLQS*r0c!~re|?WLCO zD)eCJ(7-_}u%P^&*e%{7|J}Er1H+}oJ)FZ4Z>F_CB;As2(MIeNlRRXf_kM>+$f>6q}9E8dOX2Bq~%FX1j9TNDa{V-$^G`bhMp!JXW!D z_=U>s9kToko}6>h0U1;}87G9jnET-QlUy(3JE?I0VaR z@D0rvQ%M~KNTR^?LJ<=Sr0a3B9NHJ?K7+Dy&q692|6JFbW!aIdPsbPLNOK|qBk)3^ zkB;UQlo30&l2GDo18i4wOXy+W!RuJ1MG0h~lJcmg0oUlGO5I!evJcR_Fns!pxI&^j zOZrL*QX^|`%jw|w7$-3!_%DgV&??3lM$HqzB(mhTBf+K2jp4^h_f9R3CM=|K76*U~ zx(M{P9&BUI%Zdgx=o(R0QCF47Er7f3T z6qaoMyD3_#oFee^iZq#5zVaLgFRFloJD7Ex8XKa*`w%ZEbu_C^3aaU`jFWUAb;yjU$(Yf~ML}QxAF(ZLIi~Zlr~zwFdd)H9rV3VF%U|oJxplz z83nbVZ*;-pn)my$@Eyq6jqKzE^faRaub+emyC)9^dFRj{wSI;p_yF%o@i5r1%lxz; znz=rg!K{@gt@<|ui`2n^niNSB@R%wKX>OLVqz#cC=qzqBbHcgFGfb#+BA}MFp37Jl z;Gv2BEs-A?D+ALce@uUL->D2r#<|`p$h~>~7f>3m_jCE0E^P*3d^w}knQfB^JJIn} ziU8x|={aXmW%)XZJh9-XA=95T=}CQ*HsX*ewbL6!bGFyvc`Wql8R}D*tPkRw-ywLV z`E&(YhK?u>$r40a^6jrBc6`fLUwX+e|Dy}w%AbZkdSZv0K2l)7f6#ck1}OO2kGIX8 z>n9u?Fd1<0<|il}BM~a5a&D9q$(SnlkopQyE~%H1>Qc2>P@-Kb&T5`SWMoM0uUyv5 zHkt#ZK{iKoMhu2V>i{wqDB!2X@r>GagVdb96m!{Ie3D+JlG)tV_Nu1G4g1)F*Msp5 z%HoKx&u)}{j(4K)F(6-utNrni8kpWelRrBfZE{21I&37#Sr`pPsfxy6FCUAJ(BWPr zX51^O*1Q?miZjmV6^JA^g(GCPdb|Y@T$ut6gs%gu>p*ggfB-kdH(uKKfo~pRc3TGh z`352nnuW5<3xR+j*9iKD1#hDT2Np>u(I=<9Sn`0YyWgifmprxYw+89GurKWl)4Q1v zMUyQtI|2cQKpYyMbtbkWxyr~1yv;Dl5BAb5Oc)c(Cgpo5W4CTNO(U9SvKbUOwmmEN z4n49&9~-BXJ0!u0G3tZbOAjWKIgx4#+QoY1?9hY=yZTMG1NbH!AZdSeU&H_E+f`U2 zJAJvDg;RySRi+_pz$vU?o9jcT^`6cY)dv^-4S%^9V?*Ao&TtBB2H3iZ^W>`nLGz`+ zCnj|B&}6HH6ZKsokq}-SEcH;lE4*RRZNppCJ1#y?k+@Z2`VUVYd5_uTpO;Zx$&qi- z7zb1~tB5%N&5^Tv`lX!FNGAo} z$G3F)-3qS7u3Bz_Bo5-q!pxE?KiMuG6q`K@vam?|C-SxvG%xK4v; zEeS&uVl9V-(b9onGj}UJBiR8JY7H3)%2;=i7A%XV^d&1F0z}087m*!R6W+!PNqVB5 z>SOgWreU;_=!|E$J%g+SjX9qtp?^=m9m+ftb(Uh~;$L2qt<4TlRZ;+uyr-(TbWm~^2BC+%6qth-(_|==PN{s|x za7CtpdURF*LLq<@%q8w2AEMJ7DRETpCQZX`EWv>m5{}yluOC$Sn9bQQ?>O<=(ZzD< z?Wy}fk3Zhx0a4X><@h3T@uyGR;9K$_~%myLcd7nnE9GUPDR;(FTK1t}Jj)-|J|H5Tb zprPOZC^B>YmZ5cC%_f>ZpBT7P_%($Ar(KFX`jH>;1Az9x$>UN5Nc<>?Xge=}dPJR7 z-+@{7i^~rnHT+TMfu0{dJ{$6fMVAxKXs08Y1`fD?rBEKdXH{cM=-4{tK~{`oqoT*l z5REBeFYVO9Sfy1gSre_En5hXC zC!$p1UDS-!Pt$dh*(B%L!pyZQtu`2#){B7qRm5cFkSxmKBHxp_9nL|%NWE@fM%uuY zeynqkK;@m1*c_-bWvO=J))sszBJ1-OLAA0vrPRJ&Hu@yTClqZAv3bUL6Cbhey)65Q zj}#Z*9Lk_bBUBrA2#;y;@RDjWAz_%Tis+;_4*jn&7=gN}uT#7E6+u?B!EgvP%_#3= zXP8eIYE5Cc$9*|%QkuS@P-|uA=i4>SDEkZzwDSuigk{PDz`!X-T7X*Qc27xZQWPMt z=`Z@~VJ~@gDVy1wVmp;8@7^k&OUl(e;Lr}|QZxGrP)?=TDk|oby85lITEzfGRDODO zJ{o`g(yRAaZ>V%;2_;m)qyUng?Y%Xc+Myx8H z9z5tMw&dv^P{vjd7$A~XAnwuAK1yjQ9`)P)u9z7e0t&sooAQb>iHkSFIUkasUA4j- z28L8>_(nhl+OUYiq4Kxn>GJyjrlkeE2Z%i>c0{lT39bfl7$)iA4n zA(G4{79TUP=>0(2(;G4-3b0_$%W)4w28k7U8bo8}NQSH$VeV&9+k z{)2)7AY*2J2S$whw>$a^{{v$JC~PqRg$*P}&>Hh6uD{R9jRAsjDro!4+x!-jVf{bH z{1-C$Q#&X| zv;Xcf23Q@b{Dwb4sz4&cnXC8*R?9C!{KsF+v@THy$}gaSkT>O@aY>;BgH@>94*v?h zbAY$^pPM0J1B5ZxNTAC|pmoRpH22>>yQMB+`3KnQuM>crH#!nG9`ho*`u_ie7fQiFawIji%I7}>GFWKq&wuBr z4Fo`lpIgE5_d!G6i^u-^7Z5C33W#03_8)9Cfo!a@76|`~Dio3RKgt+}bGl*0G(%WuIIxv4>5Y5i9%CP(>93(8cZPwlY4y)>e_|ad3%ncLRd_`Gf5Zm&17Xv|=u>+7n$dv_9 zeAXd_8UH!lpR5;{10ZK7JdhE|(_{UAiHnd+ErlQKtjvg^ALC&0-%bCCJQ={c2_lVB zu~#Ae3H|RM)}cXId1#brQIcl#SPZ~@=O5&^u3r#v+jA=)Z#@v2fF0tT_kZW1zzNXb zS9OM0V7Gtm)-{`}L-?oZu&!GOz3_HRPv`GEFtf#EsQh;`kHkO{-u<|m+xmMu(2M}$ zoIKDBa617S?{%MdocL#05&Y1+kCJAqkStowXOa5n*1!I8;)i}*Hv#Uk0bvgQ|DWwA zhQQT4Nh@9YpB{(w>08K=$r=>Bep9N`{GrtSi80_Wo~ap@17cWLL9FWpZw{te@7Sb0$NE^5os(v_vxIV^S-}uc>}=Q0|%@XyNps}(LCHw z`s7@^0lKFLzVRc}ex(n(KLEJjYZ8DX42j2mo(X$6H`B!bvL9jwyIBOQf>~H61$JW~ z)MwD5$*d5>A@CEB^R1yoLnyAs;`v|y++Y9dLA}tfq?9Ye&7V33xquVae876<^MUAH z3%ma|%3vc*l$8^Sy_x?}k!wJ`sRN`MUFvZX?Bmn*-`h5!^uF5tvRNL<^d-~B}|v*)l(+PFWI z`!(U!q1BR-0sffYw)c|redgk8lJWUbtD#B%Z`}W{tTT^?dVAwIlI<#rY~9NxOQS*t zNoGivX2@8wCZnXTrF2C~mR#a?tD9!VHkL{DE!&k}V(3w`JU(f!R;OeN+X-HFl^9!0 z(<+yp1i)pR%FeCLdjWf#xPZ&thfWCF{uxfXeyf1~`_)CWZw-Wpgb~Fj7*DnFdaqP{ z{jTpGN#S`+Qvze8%}iY92jG|S+tH-)O4fmk$As7d86-BKSa;hVrAKnc88{pJFSE6>Rh~<3&4c7xP_uLiE5~5P^akj_ zHe+lP-7G&dtSCGrro_ShT%T{gFM9c}-q4*iuy-gL$%Oh-Gf~>-Cmx^jHSlBSoP{@X z)9Oz21J^!WUdMsnZ8!!xb;RmVpHhgLQxBo1?s^kUEV5FHkQPaJDEn=O2L{p@y>Bo< ze%b@ezWySL!s*QCq(Xms)>h1B#Bz=44x_iRQX~`#fVmX$0Dx1LC~c1(O+TXto$3;= zJyEMs#{aaXEnD|;vGgoJJ!WZoXR^=twV~#`Ys$_|u;dP*5wGe%FTU&Zj385n>0rxq z)2QhJS^NRfxuD@hm!mQ_q7O$c&H2G8cX1Ytzx5^htnRDceS#wn&FmwzZg+w~+3te5 z!VmMW0ySFl4O;%11v<}Rsb+vHVV=ok{%kV_8y*+s#{7@O9tfIwBufO#oMPV`sSy?6 z)TK&OgmI{dFpe?5_H&>dK#rwP+DFme8JPCt4jHo4=}R9RUk&2%$IcKyI3ph`Pd*I57o%pt z=PAOrC41or(5w}|K-L#bKax5agj!v-d4M_FeIPMhrI&{-wFejV=KxKuB zqRH(rjwLy9dM`X!w`=>!AX90-lpkX8=>*AMo!r@LTB4qj)qz1xIOCazImTv`zJ5kh z%^-|@<#DOXVjo7$%4#QyqVoiunq~b=?vkz4mRM+ZE-(U?{+L6wWw3wuI`-lnU>-i+ zuZt>eMX@}{jIO=z7v5fy%`{r1IC~RpYo>?{jFkN3c0%|}VM%-Pp@HKeT4y|SJQ!wP z!jfBw-;J%Y!@vds`%Q2W6OKXuy3`GALgKfK@>4(7PG$mpUi)san2xtA&m-)JQq3QA z{OzgR{D!1*)9u38p!2G4Ayv%-OTb810~}H}3*(*C3d}?ehI^5ndJp7iFv^bqVr?gv>`f0WRyZxI=s@zc|AjP4%63u-pu{ad^Our``ctLyV*Ak>hp`%?(=y7!Jc%P%@~9J}a( zj|kzVr57%-!O=DlMfmML&!RLD*w_z(`zBki3wiJNkDzoG-Kd)dMw<~d$EbpN@I=ke z>Cx(G`Ln5@-;@8ovPS-WUJpo>&lmus@E8xypmyt)+S#~Wb@%gaOFfz7o~@6AA8xQ( zSg_P$zsB|t$0%<3a|0sX+?HG$<~*J~1p13_?^fXQzsgD2=;DNwuFgw1vVZMV120lZ zjyHM*Eb#@=>Q3@>hvD-*02XCnBA_5{d$65I&R#u{d>P$}or| z>0H;sSmqx;X~}hl*))x%^=q1Rf&+@`AQkIoL2+x|oDdiWe54yHFflN@Py(2PXZzcp zYA_`d!Y3Rlequ*F=PQ#pXyvL#mR)!=Nge?`Veeh{i<&+dLoi(GC!D}|KvSlx(UKub z=7*RJCghL^H`L_6F);@Y+~W~5hh0=6D(c?GZLTFpJVS%qhq_Hbu2_Cw)s`f2y{ZFd zOEit)cIZJQOpF$N>wq0%k$dX$e4CYp^Eg%+BQs;!*Vj9)0?$8ZX)uvnlmf?gQ0WFd ztFu8w1(z1E0I78pqtT~;7>a(jlSwZ0G2r+o!uV;6nmafk;d2#Urny6CswjkEFW8eo zNTwsVS6gm0L5ykAw-q?ApnEIts9sqK%0Vck{q!|lcc zL%NRv%vQ$+D~6ujkvq>ixYIfjp|g4M-3qZ8IKmOH#}=L>blC3M(V&R%=62?z=Rm{w zW^PU)@IV`fDGHf2{74qhZSGoL1Q}Fm*6Zp*f9uLW0jic#s9R2(DKLYV%}5rD8=@5) zzdM#s8!DHdM5sOQ=XBRJ&*)^q={+)^qbKK1ISPg@^n+W@={P9Y7>WK2k_*VB3YzR0q#2-5i3t!#|w;YH;r0>Lx9b3vX5W`8r zRZeTrX8pjArh)wR#KfCk``?!mV}6@JycrhZ#_*YfXM@_bHJ;;eb3Snjm8j0PK zn+%hNoG6%w;DH&XZHy*kwY~R!M zTAGCH40>c?1mUeIVKi$`2ww5<+mR91#-g(7EjVglxidUiAH7=1zKU-2z`k5gxA6{t z?_S?AACp(S>!l47JaBwrhgJ`5p0fO#*;HXpUmRUr(*Ez?znXeRR&*XxHil_j`qj4uJ$GS2lYoeHd&(~Xh^ zZ)F4)<=`HDbYyXU7Mejx$`xq- zgPq8^w7q1}XU?rHEQ#x;`I|VIB`D^0N?V@kPTNn!H7|u@PZ||nMLs0)Loiu+?%;)z z;0?%qJK~|G_`Z{+Wo5HhzK+%_$&VLGaJ|WgyCjBEHt*O0??0)Wt}@P5!~+`s4|~Os zPJ1>y+0o28X6xo%FV9}iIkh7V>aAWapzJzIFv9N{HQ8u;R<-ni=GFls*}w?swnTR& zXuTMxYe@!4_*sXSPqps72k_scWliMY<8F%4fmr*-ei=li0sr+2XWCH7bU+VB12r91 zgfQfFt5D9+ejF=sPLYADXw#l^fC-~lh~Tk7O|+$lT=J&&^fnoWi|b<nt4A zs6B2OJ$4%}=@i$vyD%X@Z*h&g-@}3gXC(RgMN=G-)2T`vB=0Qv>yo9vPn{4GkfyIJ zRlAQ@jGeZ;Ir6>NLdoCYgYJgCYhRTJLE>IC`E@!!4}BI`;i7(gtGPxzTHuR9I%u~d zlZ5#Ae-buUQlxTGfv&tyG0AHOP-r}Uy6v!{#(fsVdbLG54Z#xiGJ(=kr0bPBd7ZLft zH#Oe31R_gJeWX4Igm%u40uiq*@|QF|zE~Z;*HM%B#(#(Io8I2E7ga1SpfY^%mia|K zegQ!d$zM(3FPr&q1Z4-m@K*ooHNl~^=FJPTZaVnJxX*7CbO$eXx7|xK&A(VIy87RO rR54>dK0Y4LpUUS8ei4Fo@fyAj-R+)ZinMd!6+VoGt@&*;uRs0|hlUqc literal 0 HcmV?d00001 diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\216\222\350\241\214\346\246\234.png" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\216\222\350\241\214\346\246\234.png" new file mode 100644 index 0000000000000000000000000000000000000000..00a37c31246654696ba8e85b2047358e05424020 GIT binary patch literal 144470 zcmd3MRX~*Q*Ddx>5csKtfPeu=BQYQ?A|TyeA`IQ#s0gU^z|h?=Fm#uIf^-cG9nwR0 z!+A#ipK~t0tM7alXKol~-g)0A_Fil4wVy8vauWEsjl4b}+SZ z+`QH(1TGRZQ_*k~wTBuwn%P)CQZciFf)882C8KXtjy6WI07au|6S zL&|N|yYWlfQbwefja+zvzNnHX^Xf{`8K{7NkeT6sHDcbDWc)#32l-cavPQmM( z*GWml9^4?)Oi+xPrv6S-GWUq1J}LATAr!gpz;4^!b$Y_enh9Q4e~%UCy>U-@F5aK_ zI5-lI)~^3~KST5n|CK*)U)*Sx`+I@hhobv`-o97Gk=n6y6+S77@+1^h|ZsX8~UO6x@H`h7LoM80yl@exaI z!L4uCIyyTCVYJdj|Nh(8*P@=2%FQ$FS!&!J=OofrqhZM%pX7aFR&G71<*?Wl9v`or zSaG`DMqTcCY>!XLr{a5yupxz5D?&yrE0!WTKcLZ>q7 zx=yOThl2=pHeI&`_z4D4_)vrL&zvyL=dZZsgJ|f9T6ubUvL0!WO552LSJ+ImO5)zW-TyRHKxvhmJ6^W1{va-dN-%G0 zYfBu##L6lso?iSaMw69~Ps^r$Kg!d~tK7Jo{B%wtayIEcmkhPO!L-?cuvX7>iBqNJ22LG7h7UTITUO#JKO{B-}Y@pAHRKt9d$=L&{x zAqCEEZ{EB~e^^;ue|d3A&Se@lR%xR@=Wf$@UI9ikmQzkdLSlY)w2c|a3=}%w8ZqM?oSAAwape8JJh;t4#(Rx zQepLhJ*xBI(iQBU|7b=VD{?I z@TaQCQls{JR%2!D4c1mWySwFHr_Pz#+5BnWjdQKWE6k@G>MN|qvp`&^Rm+)~* zt3#Bt<;Oa;-2iB`$qmTi1awKLay%bix$@X5K}&_zh{Ou43Cy?T}C;lupn-6j1ux!qF7 zy+2I3Fry~Pa2ghkqLhx|5GJ*J1y9eqCr_TB8htLMIti%+a`fumS3uzA?`U8LT&P#S z*Ggf7(ohEPP=)%!k^>*E-9mvdK1G8hrI7@2fxU)HUeR>-zX0A6F+7J}bh`$YZl!ecE$ZwtW1dfUAPM z!1gC>RtxU+iPkJcY47a~%CCYwAF@_fQOR(b_EH!1ztd0YGLafU<^D!dQPF+YkMtkG zKk>rKPGB~=vbdOaIOQRaL?Tblc6-ac&rfhWlie1{D22UsK+yGh9ZssRdk^Pprh~6m z#%&r^B_+R{Z(qAbV7fEk?i82PHoCVmC;|RgadvjjM6NA_k_C&`AN>jp){=Pr`jh)U zhE{TCZ%+lhKS!|MCy<085C}c@724|B+SU5QDP?hSaleJlhEpU+0dqnzM5)K2buhKC zp76y{x{i*H<9az70F`tAAvG99>E1@|rrH=C1fn|E7K)Y^KBtkFm)CWfbTS{z3g)(% z%72#R`brH^-eD*R63Bfi(FLHR+5YCV{!{TF1rQ8n-skRumuI_9d^UeW<{<$^X{kwX zQXH2VY|3>GUpD5s2tY5@&L|e*`}fm*E>EYy1W^2&b?F~JUOnXoyl`-tot?c3Ruwj6 zu4-zUL&B_)vbALkLVk7A=dw16MLR1~IvRrw2e2JXl$U6Ip91H13D_j`B^0tn<#N~T zBugJR0=&!jK3RDtCLv+5HPdvk&DyE$A(!x?c=&b@D@xm@KC`Y)V1E@Yzdl}x2LGxq zS@n7)@w;r~E2fK^jg~-(Tb~HJZly%A=_5ggsg2{RGOm8S_!Y>OS5%|`cD4|U(10=! zIXQ2py$%xryl1*hxk;dPT@pbsuZ`EA?R0i@cMkzHEd-#ED;7us%b@l-Qv^#g3<8*u zl8K2)IyW~LGngHU&08=zDaLsCMQ(P|09gMFdo_DXg!aDC|2D( zkBw>x7?;b&1n#V6^~8jZ)ksllM$!>j`0-<`#tAWu)I}hW zHBy9vdkXi}HH3YRJN1bkJjemyV|ISBPuxnGDBxC7S6`2qJOHQ=HEKDWe6jaT@NmMx ze6psTrhRf!_wuyy(jxiM>V)+UKv6XvorD^fO=Nq-vvFTJ$(rNEcnGv3k}-5)X(@Yo z*%Z5O2Pw*o7VQF-BhMw30N=Fu0G)z;+0qA;uSL?>*w~z}>R%3oJr?D%nG&EBbXPe& z+*H@l(107dC0XoahF}?B8&`HZ*$T_f0FtOUI+iIZDV2c?n6z!dCysL2oYK8KK{w`2 z99D~{Iyiw{Gg}`kAFKDM6grx@=CwbhZtvtY1Q>~)ho>@&wNUTQP2Vip_$(|B?Jjl? ze!2gQh?=_a=g&9z4__3uT2)$(K9MA6fA<=0*f0fhTdU5s7#n}YBqSY!gUZ^K)^HjS zZiO9ICkN{Yu*~$&o@Ebk5E2r)xw(|6cpf{f&v1lVnd1b=|+7rGBPZ8dRp8VwB% zaWCxNDMves(8`D_2rql?(~lzVu}1N!wp`i zjsOf8(F``n47shv!ikg1{pkz8ex(%^J$G|=AFlCsBfS7rs9G(+&tE$QiHwXydBo?% z$k3fPULIbSxoqkMKVVOj8`rV>L5$ro5LAvoh}BiJyn!0X%*v92c@1RkkR&ekUU(zH zu+`ixk9QV2+S~g}dxa2Sp&FY@^u3S${tDTddxNIk3^ujv-1U0{(mUJRxq~O^;=wQ) zudSwAC&x?4Ow9Zai$TU+F^?s2@$i&;lLRTXEjfPya*?2N%jgll=Hugo#rwWgQGn%$ zGEWejPC^v`M0Ip^;c;S6J?g-=c+d9s`ynrE3?ZNDI;S)e8+ zCN|nK6K`E33EKl|4+D4yUh(D2m)sIO54WwUdXGb?&b+w#zfw_l6}G}JWHvzK=kJfj z!tn6$IDQu-2PbDaxSd}>023XZIG{bPDq911MA!Vn0s|dg)>z@l$OzpFMAi}plZVAy zKR>^d)9Fj(uV3#ZVE{GK!~)9zSMY6ic6ZBu`0#;|m37!8*((bMyO${BrFDFK9IWb1 z3t&$*luED+fE(`8Q%T{oEnh$pkJ#7>adpo;Kum*;jO{j-O?aUwDJgjX#d6(ZPa>0V zlfsGyC7)xM3imZnH;dtc0VS`?^TQ!I{~Nyp=Y7^7{p1KmGe}h+NcvJld>xnj9s^OY z3ZkK$&Bet9eR)28nJb?xq-xy=!T_7gW9O9?=pG@(Zt89zfYV}P9u#PniVL6bMIftw z;R0fjr=p^AKNvOH?_erKx#-12eJRl)*?)J{OSX*1iahil+ z)1X)QQfIuvDk3>K8I%hLYokxo`*YOwy;QGUxk6mwv630R_x-*+y^xR|mVAKBkqjY(3PMbRr6VcSvG*)aFFy(z}g`MH{_IAxq2b0roH_s|a zOJ!GA*K!D|b$qa1>t#s6K^5t3f z!RB-!P?)(u4#WZe@QUA23tB+dU>GH#rSZwD7pI%((NdFg=XI^jjEtO&jMq1A+<;}M z6=*R$dGZR(kho%{FVz<-8UZ=GC-&m=OeicIrKs!J`AvCj3KnP-zbY;+#!ma}{2bzW zWIO3|d5)Q=cF454dhHq(GFAY#E6B(M!C){Thf{A+dn(+glfl*>M0f8RP|vBr008aJ z22f)L<%Iz%*Y543^;gF%3WtgeS{wo6n*o^yFj3lX?o&Vj`qb^(pFlFKiX3+NYW`_2Eymrp2s@;$tqtLB!3P|9ejo;il& z?=o;*jUE6mf6I~F&hilCLfQ+@xBAM_f4@z+%T+~j>id-1)`*f+%psrb&*v{ExvHSg z{$8Ua6_fJMpZEVCO!)tK9}lxwQvh}X*{r0emjrkpS~}`@u%<~&O3LG8hNHj)vZ&m& z|J6-jk*TREb3o?6Xkd|-@6BSx$F9PA;fKFhG|;QBx%m~y{1rg4jP&$j{iczPjg3L# zhJI`~RgnPPPa795d%L<+tgVrCb#*8pF94L@_52-@w&6OcqQQ5?%b?N(^brAm3Dnm7 z+?+!(0U-{lAJE|7XZ8*b@YvW~P%7$R5wzXN95e8HCv440Y1hg|ErCA~R#H+zL`0O4 zl9JLYM1!MqFJBD-7y%B0JrNcbcG{TGyy=^o+xh#c-xwh7BeCF)K({!pjf98Y|3DUPFA~M^`6D7C;E9O^AOQz-DRERt+*^PAz?Qza1O(O?`#?|RO>|K zb}--+Jt1Oe0Q8^vs`?47Irb#$5h4B`|s9Xe=mFoTnxC3_Sv&e<5Gwd@9%55vfQ&LtY29i7_<&mTksNpI0t8i@DY3S(k%F0yGIu5eR(ZARIW$C;mP(UcJ6$o}lUv9M!9*`vC z^i5z#Q&pt_cESP!EKu+=b!MesV`gSXfxA;=Qn-P6ob0bY%gv8J0icS&XIEBIS}PFt z{rIuWab>`Gy8IlKtK8Ass}59Md`+Y{R``O@cXM-tI5E78DAg*!mMee}=2uomQ47DH z(U)Jvpc1$^-}LD;FV6#V1Yf8JuyuZSJde6t z0>;S11f;Gy7#02r$Adc{br3EEn~j$ZC8ec9)ecMUHcXPB78n9H8&)@BjW7JMTQkfP zlbzzE?Y+HdXteI|_n|+duhZ#^Nqi=kpwJdVNvRq*8N+T!gOxm@^;)JqG3=r+;CRRY zStkxC-PzS@sT2GjNZd$}NWjpVbX!V1fC3rE$jn^kGhOfD0xFG>Mq&589FVMe6NRIe zQ6QbvNS*!t{Mv_xhAz%QIqS5y+NU4a&(^j`FsMNR+)oN* zgstrsbs8~~0;!OYkd&O98UV7m8lvCBj)UI}NCfH*pi$1hA-u1SU%v)SZ>6BVN5*MP z540gWJ3H74&2)J!$Gv4mP-nJ}k8A7JIv4zFyfNL__$r3%;X@q*H*s-)?~CIl^20_8 zG}1g8Yem4REJjN*L7{43gWBeDv$-%4=^k&FMaIo z?aNp79J(Qx$=XWM5Ih}Bka#*SF0P7(#v66#-=oE(HW@7CDrY~Wa+HE{7D_`W(G=z@ z1DP09ZUnXx4Dd_j<>f8WD!}3+4b<3ptTtY10Bhdl;vE4cKm!kEr~wX+)ae~ULUo{o z;BDg|4u8$fL98dLa0_@W<}es7L%eq2s;mMCMSc5lZGG~rvp1LV;^G1W=6f!{!NI{; zzR1GJnCIZN;C;M6Y6Yx10C5Es`&z)$0|}Bnm@Utl^z1?bj2|GhBL>Cr*A>T(-$YMO zFL=(BTLs{gcqo;Z$eh*98#i)orW+14CnyaLHzu(NeSQjpt=))&ftgtb<_N01>Y5t0 zqu7VQj{>?I0KGN*wCm^3oC8;^*6MgQgee0SHDI58zEHgI$VlJ^_2SMNx%Fu#e{%py z2%rw2Dgx|A-kBg)tshSN6oPRgfMUkx3)|WC_r~Ab+9W_S1lueBMpPV=p_w91Vd3f7 zh<0WGJFtfw#R>qjARv%`eEh&;x^2S)jNqvzZpcI^r7}8(2@9#;iLI0#ivv zI9XcehX5mM;X+DEstk~$R;czM-`%@+4~q6ZfvhbCe2SQ?sLM11hy>)GML{~?S~D{< z?68>1xB&3y)2x6hr+lO3T`@g3x0*yKmFiMYBA~0Z!o~|Nuamvpt%Pw)?gITrJ&=6L zZ{Mb1eUt#F#9MKA)YQp=92?JnhfhyNlR1lmX2ptqTtc?^OSg-LruZ7V}0)9Cl&%?B`N`jbIO&0Q6nO+n$U4Rj39IN2S4zTSfburaxoYh4a-`EQOJ zuVsdgfp&&RM`w>)*J8`4j;=0LYcK`yAKihUBm?v~pY68@9ux{CK3bTbIBs5AM^URXY+jm@@@-1CzQ%fOiQn_6 z2qYRaFfk_W+9-GS_u(qsdZu7^z`ihc9q2{u*l~9PTW1wf*Z_-;js~fRDm9V!u;@5H zJw(?ug4z!r8A&fJ+}Ql!BT1l`l4h7YaKdo!KTq5o*0uq&kFQ^?2ei6Tv&u?KN7n(& zB!pUl5`c!Wa*J$WqUC`#bmCM0Mh3i5k504S!&l(JC*ThZ45XxT=1V(u=amRSYr=t{ zrUDF}(#lsXo2_~qAPd1dyeuEDt*J3z?n}k`kN^chU8#mGIAr6wfrmTRD1+65=335Ki9H7W_v$YD;5LgozOl8u(ljQ{KhfK1Z`BPXc7#n9@fC|?f z7}5uCD+SgfJDR7N3hPS7yb`7dRA9i2OzrBDS)Zs@7Ef24ot>=#v&29c?$6|d)QYe9 z(+c~IropKKYdRH{vA_qg6FL3*Qt9LGGL%M09p{^`SVSYMw4pa|59 zxs&RzU%z7gT(EZ*pcppqPxH$tiS1cs0`=wPaj#EgRFts~)>qC0C0Q;YcF@nsWnW`f zS5Qc42_R%*XIBKM3Cjqf)?&$+q zh>}pCyNpcN4OwQ;Ycot9npDJEkH9O%+Gi#`3H<&moNkR{|7=S@xz#xL=^XGe(HoPsOa(sPJO^0g6*L!AH~87u8&{E6=wKdk02b}rp)hkA;!|BTlbpS*F(Nq8;I?eyQ3kIf! zh0OgJK*UaO)?YfjE7w|QDgNE<07cr1HLeGDLh@UXfN(&Dm)uI)U+)nD!qNlL=kK4X zn#&ql3!04!8yiTFMhI||-(~Xxun{_b{#2;6o@BJDHE6jB=8!WmcZ-k;)#Q7N79PC5 zGCTViU<73h?zsuz^J_OC%}2h?glnO23>i{v-1U;hbY z5G^e&M|}6eJ}8#4_A&?y1i&IGd3iLky3S)@LjgaTk(;|>EDq!W)`b}Lg@GVm0eA+q z<{$tptV}J8N?KW20k9BAlM4V63uU12_}#BrMfW!7fo545Xu|?Llm@Ixe4t46`GG+| z^?%L+T+QLPvDwUG{k_5OjWa-La7nNJ^}y1U8D8oIH<& zMN1`Sp)Mum)PI}n?mbp_ofqKh*vbhMuYhNUAZQ)sYl>;uh{0{)=~O=zH{hE&ycRhJBLa_BiZOFQCees5`t zJ75ZcPgcpXb?iY@f?8lHRk5|qeG<6*TNR=ERVv&L9&VHi1+Owo>}UWWtZ$Jd0mZ5+ zWL>Ob#6>=Y#i7^y06u`oRspFBpKJ|%$eT7<>%zpzsSJvwYU{>xXOHOJow=&2s-K`3 z1wEo{pleNjw6>Z9I~TaRgP=vrqOOj{Jr(dcu!I|ac6W1Y2Uo%zLB|^RJ~3$Kpj~@z z`ZiIG#C02EZA*?=8K7ugGM%12eTps3L5KhiUE4~MSY2JMI_%y6Of4<+l>}L=sGFPq zUH!esKwhi?xB@w$o$d%_Q>W5v{r-I!G|jLsSiQyR@oF$=l}}9y($b0nsylF^yz2^T z?G~TwZv9B+3KJ4XkWM zFgqYw_5kkD^YUtfSWp0;3<3TJ*8sgJ2Vz}HRu)ckO7j}KP&Wx4=74_GkG3`o&>;L8 z8VY(cle8fJJrc3--;>DCcI1h$|MmSg)`MhaWvyAS7|%-(_|eiL3Q`FL&RC=aTi6V+ zxjFDo;isKoRNzQMItrJ|1Qgt$a?ex8!ydN*eWWV}*(%)LhQ`Lm>3)d=n<9*=BP zyGurWHC6P%PFR_Kqc;Jb0BC_=n?`_Fu+#^Z6HxPDET;*o!OH{GH~@5VUcKlCa{}rF z0~Q4&FQ4~Wl{GkBAPG%1jpw!^x_3{>Ky4pT5eB#{d|oGISe)&4-kh6D7xFrx2WlLY zZ^(Z^`N8r6(5Pmi!+4`lBd9@DR^8j%+p)@erpedBvu7Mg9&q+yJb*Yy4nXq|XqRh+ zo?@-i8iE-akp9>t&tm;eJ@1X%GRwd#m4Xxi2Jndi2*F=fOJgkzP-bU?#)kOHKvHz{ zGmz&a31=okO>q?Al|^#$DeUBvarBMEs?tyy$QU^3_&*( zY5|o2o42pz_lcbM?lmbz$}nv00wORMKzvAEduk890Ja%rP$!R5h5D}?038ecOk)81 z3czIohS5`uUDXcjItVD*}BV2Edw1phypPdVpif4T1zXjM+ei!jP-KSK)<5x_=@p5`YK*X`}@k^l>QbuNzib{7L&{*fxuY5EqLY3ATl5UxJLlTl|fKKpFfDJ zVg7x6!hqOq{!UG15-it&8W2E2cxWhLK|Ib2O2iN&7(B>!Ahwr5msTC@vk^yZTn*Qs zpN@&)^LJvY4s^(|1_(H7^r!L#n9JmO-(V!Z51{fT)@VM>% zICE$0T#_Sh*diy9&q)eA9t&69#)F`%psJ%Y5edEWGO-G*9O%UWp9U-b0L7tzJ$3cZ zX|uSO5dn$62ZMt%^AA#1_wRe07gr7b<4I?~9Qjk#;(TNH_dh1TE$*lP{4U}D(@U#( z4zA$JYeUP&92YIazxI&iR%K)q6aJk`N_bUuO-H?7t$6}p z)x7&r+j)_cD6gvQQ_6o19n!QrFXn#*$D{NWTopq-L<~%TJ7e##JHLV1$Z`LBix*d1 z)OmnLY;W34M=yV@DPGdrp@Gwu|NYvQ)mH*f5*C^}B*r@hj&yYOvZ>hFeMvX}JD-`? zUjwBhlYO>pN_eWECbMRU{9#T$ZI$`uxc@xhHld3PSrFBZrJ3cPA*2eLJ!Bq%7{AeL z^Y`%Nw=dOI)g9*Lx{-#E7X<l({v>b!rW}uU%KA1?hq=MZp&g z*Q(#45!q3Vp4ao&{thIX&MX6*4wVRWq>p*ClhWFr2>G!UVSV`lCH{87>J>|=(_fO4 zoa|azl9!5mWm6oGW-~5^c(Hc7?GC>2Z|mA840|>I#$t}+%|`bUZcfzdG;j`^{rFBr zoK(x#KAKuhNI3Yh#z{N>nG8_O=2Ds~FD{rSW~ zHXgN5qh+MHR=2Gc)~_uu8=vz=zdfqz_0vsHIn5T~p}f&+7%$yc>I4>5>d=jB6PPtx z!h#fYCfs{-ll!S8|F2(}7S`%Us)a)pDg~YfFtt-T6heMKoS=?c3K7Dw0O8Z!yYJo@O}fT*?=?uAd#g;imQ3 z$=OJX$h^DhnZ;0vIuGdH_4ntn(MEAmPQsi zm6p_~T;y2rhi80Zxy_MSB#G}?!!k~?5=)u}A9l(YBA@#tl1&5k$*v|0MPm|fT5s~J z*c4p8Y#ERCvKX&@^VB98IU?mkg8$W%R-8H*A^k`RPdiVoflCjsf{YmAQL;g|e4Sik z+YCx;t=>#rtrX5!cXgCN!eUXn6TV22)|bJy*P7r^!Fj^8)0<8-x?R`U{ZZwI_1EZb zMChwJ<#PMkTB#0U7q^f1x?sE;c=g+3Nng=An+V8!lLyh*;$Z;uZRx$EEiox1TJqTZ zdq6zbWrfz1DY0f`(FH$kXGYGrJ5$OB5tFe>hrO-@Q%&nt7JW8`^_GVVlG!fzICBS!yOa1+#M-iaEKKweU?Ab*1=m$n1x$65EzKQEHzOJ+tE_ zt4)V;|DY4ATKA%LU1V-)l=_*66A$BFDjGt|EJlHN99Mk7J;Rb~@y0ZkJLCk#?5boU{sWN_>D z(Asv`a0{)B)SsshuFg#2KU)`{qbSxSrsCG)RZ!L=qqWjO;s?k@m(O#WFmxv{Iz(w| zwkbhbF9%m%NKxT8{IHMU}0F z^YXb5i@N1ILa`2wgP7calj&j6Kz;30V(rT`z4H2vH1V6$|N7m0u(usB2-(e2sJGkP z{NT@W&R@U1$f(IDn0~Qe{cX1{waKz$tVYMCYy7b?nhJL-Z$U7-JO5LCF0P`_B&Ft$ z?v5v=U{}V=*DKG*ahXfz=NUWSh0~&+NXun)@seovScsDeuq)IXqaGuQGtKw&4dQBY zvs`BB@HLNoqXZNxDbf};nbeS<)tYmtoN9RAehC~NusKx6{Susb%e(%C-!0-3?^$pv)74yYl&@bZt zEV1G$gq4+VF}CVp%W#(t&%sagrEe_v)@8%j)fFVTP^hxC7lR zbQyKoJ0-UmL*+3vo7udfbmXvC+ElWj)D^volDET%fUMVrgUUP!F05k4R-WEr;;ktA zN7^v8vpMRqm^s8_&Q`{{v+ggCUyD*bNclmLUhcZ$LkvkAPTYSdJ@Cup#C*8otIbFJ zhR3*}HR^^J9BQI9JOg-MJp?Qdcw`(5tzhlybfU;t5)({~EH;?nV5NEeXBMwrB2*F( z84&5RqopZaubIOlhn4{`Ey{;EwU;^vbfZzln$(wu4bDlI-VNP2eiMq9MWtywjSjv+ zqWL9ID|Fj#PxZy_iS6Jh(QYGZyo$Ne(r?5D$L8e1e%FQBk_Ol*g385Tr5x zLE@CFwcxls3hG$TLgjhcE4cq`McZiWAD3@O*t|@6^-&Zv zMBi=O!d39qTRe4Xa;wbJUG5syXXMdafQ)#_s0n;x8=+BR?Z~L7?R9TLCmO29`p|`! zT;0hiZ912D=G*;R0-Lo`f`fN>PE3Dp`THd{RLTrWB%dsA2V`69?*$l{p2D+G?kw~c zD{)a9&m6ib%~{EXD&CZ=1rN6&4m&NUQfUH>#!r$KStkmAZjcnXo>F8kNLM7#WD+Ny zb{a>poKtky2&)F-!zRvAm&ZPsT*iA+yRE$L^To$wnnY=j=lscq9&H)a0xwnW%J=w&F5xpBx?cyBkCEJfBLeY*6HR@0xOM_I;J-?qsIE zQD&xjcz!L6X7qryA9W-5tGjpPIeTXkZCuR@Rneq+iKrWM5T4;D?T?h__4hBzQge*8Z^jWQ=*bsnDXsO;^=3ozA?AW!iMu zqtH}D<&!Zgor$932BZ0&f?wPnSa(`k%l6lEqmWkb5*<a3!NigA}>qx;|P@k>0V4vMDT(Ud)QAUv*EN$p-7 zU0Qo4DJ`)-Bx7L{{D3nLGfVxo>ucg+PAdS5IhMh_;UJF=Ei)2Hm~6*LYA9)!;@a`} z*Z@(yb;OzZ3mdF*;?|{pMnS}DHO5Ol)9Fn&ODKm5gGZF@lpmAdYfup78{WV5b()OG zvyC2q#R%37%gU&t9IIx~a^MlO*bw$N?^rE`%tkGa2P0GOB(`GC%L|$Ef-{0t-95L` zrNWH5B~_an%A{JNH4ZX5l#C(>elR&lh}~3v|El&@$NUB&$Bdd@3T7#J;yQ5dPRFkB zG)o};XUGQ*Py07ddnS9cmbl)cZu${WW!@bsYc8c_MGq3n4|EtlCf{CsC0n(ffcAeE z2vcVpm`Go093C~2lnKkMS~M9bu?Ww)Ff3(G+~^1LJ$Gx&YT2=tPN3>F1?h#TL-amc zB7(d)K0AVu;>6<1K#$k+ZSQ>z6}QP7E&LWuHkX7+RdN)@pVUux7u|fHj^O?(cN{n` zpZ;_vYZ@$#&oQx6!p&^6+@}b$G^TA^S6A~M^&pR@mA$Ra-8m-oGPS9(z}~e!s=>Kt zAedMCgaArBq&DX^A%k1Rs&i$(U=R_I^i?mLf#(5@c3xD-Mt1BmEWdHc=c4ZkuOx>y z?dlhr^@%b~vne`1$IZw3Z1DX%C${i`u|ss1U+>SIX5$-W+lntyTr6MmQvK6Zyack6 zla8w!e()q2Q|0pd)6P<`#8hGqxjfa&NH%5Rv+xh3RVMMUMpwB|87<`e6M(#K5;nmI z%r9%@od9`rqcmAus!Sd|Y6bT+W`vLY3yrT)2KH`i5Nbc{DtsvOYpqs4%|i`+mg!kjcOV zb|xe}(59cPhT6p=Zg2=$Y&g#m$0WNm&hU39-e;NI4iSpuPO(?eEYe4)+eCLWy;2Ns z^!leF!`!5qml~)06t4Pcw9BK+FizlyS+hwf9%aY$#KT}W$)e%hylL(_{}g%+=$A(**+Wez4aA`ft0^5*Du0J5l48;Dsnm36vx+iOLe)|4UHC3J!e#W3xe<#NCT)sZ zCARMbwf;@)OiRS09{TAP53RAPn9_ftn0hnNQf%o{b>j`B*;3^&YXyCEE7b-!Ewh)N zvalq4r2@T=sp>Gqr}(%zr0f%{wzjtr2dS)lwKP5W&4BUZrwnf6p{6V^j|LN{M?3Ai z-K}dDr{Xj6q53OBO?9^-@=Gie3nk2#1>auS!oq9hoBjxh zx6u0nsQlgY{fl~b71@Dwi2b*SA-!9aKcb5DI3;8UHiM_rAx@5EsB^M{lf9{u0DGl! zLxlHbf?|V}ocTgj_6D7p+{H@-ow|ndqHJgRliEu0JHpwjW+E^(pDN9)$EtT0R3(ND zS0RV9l4BZ^|N8O4xKWOk#s_I4 zgKV0{9J3SCg%_Fwjr|i#Q9n86o$C1*byCrXVVl0aZ(v>1asz$rOM1)r9FBFmoI(Z0 z54m;;SQ2*=jJ@x%8KP}(P>-huvrXiNYi^oelwR}pd1@&+(7`*mzgOZq$~T#Qyji!w zHJ7GY(bNA^xL5rZ{v4modGbLVV=j6(S*;48lUOFJPk*41a?&VJ)Escz;)5!;)+>3_Lbs6dR)|rJTc^ijtXT=lou29xP z`nL;g-V4{fHz*|*>`DnMdP!!FjB1_CuC@t9Ye;70pvy7vJl|`bJ*EXC#t|~&9?K?Q zv@_azZ3-Sq#xGi|WgR#A=npD9CRlFO6A9NW-s~CdxWnK$@M5251Kl&CMMnfP!VOYb zwz094NqkD=7O;mZ9puVLS1m5xM2*By#a;GW`n`dv6AJE#pVtjOB1=&$>+4tWQyikvr(mKjdBR|T*gWl1VFk9TC96O$|FSzjZg>6>7SGxt* zfR=00xjB&iIA@@j?L;>UhRmlsWA zb(#*zo^ni9*Y+>~E8ZM$MjrDE?luxyv{@ya4KX>9O;2RJFS5laa|umK;+AR>+t%8e zNj$0Qw)EP&9IZVq%E7JMzT7--*E6Xb)eo515Set+IAM)AaCn0!WA#%u_^V8@?(4hNz| zTs%lhe*J^vqz;K-B3ta@?dpozdUqbn|b zQ4`qJqoamo5lO87!LRTNVo;rXT*(*W6{t~^NxnPhF(S~&Yg?!sajq|OT>k-b%j51_ z93sQ#Y8G6JX7Z&*AMY-G`0@Ig`u*xTXw-xER~UzBJ&<)l=kI7W@konTMd4dRy~P=| z*RL5~zb*g$U0|#WlVSG2^W zJyUpC;~{RdPZ_Om!UMCqmD8T28^r8A+)ABUmocvlE188Wv|$; z2&%ttBYaWSR+>vj65wWOq4Z(Zm6bwi*Va<-O%b6u>OW@zUTKj_XE_PdAln)Zf1c-$ z?n>O_gF36h;0bcjw@>1uCElKW_*#wVy}Xj*D)bKhOXXm^R>*@r#-zwy_hA&p)~e+lB^VBok zEP1=f@?~@^gk1 zFO6nog3bXo@>?oJu}nM;i(TOAdX^I^&>gx*)GU8KOAPm7E>3wantO#Lxt^bZSS4z zxU&53k66ClBHXCB%+}Ku@A1EB6II#!QbstCcF}s%m%K%JgoI#u6w#vV)#4-QM={D}#v^sleT|xZ1fIM`Tj#~vH$Rk|myHXmj zWp^C+rF~?!{#@p^=E^aOP?l!m-5yov>KN2rIiA08!}11ir+4?w`1mj=Td9@z5;LYk zW)Pb#f~9KkVs4CxC4=_Jnd?YpVdH9L0)UH6`WO?-jCo||v+HG_#ojJs|0>sLLK# zl_zUSHok0VV%Ck4>zO-LJTb?O?WK7~>3)MhsHygxZj)3z9RqmyoY`&mgRV(9NN>$J~gb%Gvm$N)WceB=dBUU!TexwNGJAC@wU%9LRxlj z4RLPxt?P0IPT{d(q)cxeuiXfwHWD>})R1>guxM+r&viWA=izg5wU9>|BX?AcyK@gV z?)X2CS}Z?UV-|9by|q_m=kd~IG?zF1{($HECA+zk?+e{)j^3~9` zJdaGtvY88OJ5=D{OFmD&{+2IE&e5H_qV%o%BR{Lvd$hiipI3j(WznsQ3=^rPGhNZ3 z)|AsbAUdOjs+0sl8WGuLkD@;ov=a$ji-Hy^?!2bom~)$dSS9DjL`6A%AQ8Kh-MP<+ z>hJaXG~8lPzwy&#Ay44p^?xHu61GwqPg*EOeI!V4*}Dr?Os@Xq)4A39&y!k=yh}w; zw>(stlPjujVX7euQ^s`=TG#bq_T?qF@>2X*mVn}?2%P4nXB_Ql$;<*}AuHT7n02(8zI$96Oa?eX24oXSRe2inq$F z0m+zmdp%<7DW7Gyh!nD<&lI^(S6Ib{7MQ({FX`8f{h_%ufpzt+@s0Yv37ZrstJ&{}%zj}Vy7_H!Pu$;LiF}*Ji`m~C? zKDvFP25)>!zi39|d90uI=Q)-A3&aOAz1p9~Z_ic{e z$w~hHDEH~~-1^-HAt;8pGTO{Vr!PvP%BX~P^#4QAnALLXzElTZouFdBySUjl7H+_(m(6{ zJjF-@$9uid^N-^cHrvZD2$`RpFz|<0`xxy*%0d+iBJ}D zTym*AtdyNqt?(hNl+u+@Iowvo)b-jmH^&KU$~L^v63`{~_X`h8Y=s4!>f>mAruZ8A zAw@!Ige3ExIsSp(-J#WyeVLn@LT0ZQTT5sRaixc=b9gU=+!dZn8AA9g=YMVC({a2` z->VNH%u_D+5-P1YESj3townS<`^NE@sGNa9*F{y?L0kg=4xMS(#|yti@_?J7-5UNm z^r9bcR}UqpXxALo*{K$}DRN|*T+zQT0&w;t2UkoO!LzKXuFu&&8C?wAQ{oX7!OqSF zTHld`#IMxa*v%f8mcvnJHf%^AA65BoJ03de=X9zf=c!-GA2eagUhn1WKW93%cgvFX zY|%K!NJj=~wGHI+DBQ|@FwePzF&!c@T9|#=L{2rl@kobX^^4sB0poy{bShAz`8o^r zJIU3CPuXR!kglr;F;VNvh@9{>w{1M)5b&$p@*7d2rr|@s>~fIp#?&8fO#GUaa&hWpZbc=KL#A|(y%|NWIVNnX&YkV-oPItn4bAb)zS2o)jJo2T(|~(%AEr_N z;{C@Xi7y*Sw5r6@2>N!9;IM-Bd_l`^%WScUULx#q?D$u8Lbo?*&-YP1tsCo3nT@oa? zyE_CA5(w@V+}+(ZxCD0z?gR)fjk~+MTLXCWU(bU+{R8<@~ zv75}O=0#;JmL~4TMMBbL57L(D*Z9IytD5)J!5tWik*scfl9gPLfaO6-plkTR9ot{* z=AO>V4u2&0p`V$IjO zgGbM>eO7eBv^Q2LWbfn0NIN^_yaDAc3Yn}gU3RfynW_&0KB?SMyA|4S1B1reG&Qu- z>Mi}#jHLPlhq!^>K*$~E(vV2>mS1A+A+V;%Ymtq?p|RS9Z>MO73@d!FEHZ@USu#%X zO;KO__dHG~!%)fKGvsipi7gzvi9dDLbv;6gIW0G)RTB*kv6b(r81$gLnXAk&rUTVM zg#N|%ICpgtR^yMVXB@KTB4}A7IArAdun+ZL?bs^XcUshPq~h!6s?ZLAqVygVfe`g2Zsb2A8L zSbRKy67{}7Rg}y}`W~Z+{!rvLQwo*zIYC50|4s?zhp7w7@fzz&&=@P%Kew}cqqL); zCNqD*M{rW4xkn-Id*VH|I=3#Pn#Iu`z)5Ld(&k&E+PyQZfW4eV=c=F%P){VXT<}(h z-ISxm5;juaH}3xCPcGN|y+E{<@Yt9H>$hI^o*m)YiOlKh&91H-!k&#jz~6w7Xy@Dc z6nSN3KX<+NK+57?^jf}^zs$*5i86KuE{?CN#9eu}3Na(kD?ybApKhNp7%lixnIBy) zx2^FmFFyPkA9&|is@^-W$}q;q_=pD{qc8Cia_GvJi)`M3&PP`{c_$1dgJFAuZv6}DvqBCKN6*~E ziiw7-ZVk4@4wP}vu>%WX;183+y_KMlKj1FGc90xEf8f*?pzDmsYE0?@b(tyXUV;Y! zJNbriUdkia~`W*_ZJb%cP%j$cs>Y*!iYi{xHtJym5&0 zdtaQWSan5SMd9}U;b=}fOgd>ndRh2rer4?^sLs&-s9o+>6tgBlQ!*Re|6G%58l8A; zILKwa?I$aTsmeu{@%w0-fJ(c**4iok9K9iL#O^fPD(CkP@h36O?hJ)fDpMa)zvEe_ zZmr-+;i$0H-f~jB!B;|~6C7r#evM9Y0iNfWjC6xwgmT!<1Qo-)spS)8B41W~MI+P6BG0E@P`EU1zEpw6pI_U@Ole7=)s_9H|Tc{7CA~1aV=iGnU*)fsFm`t+9l*LULH@#GCr^&)w19 z>Q1kw%W@8GeDu-KJ{P@#1V4G;r;C&BO@6hY=bO;lc%vX=b2nIs{y4Pk^2K+*;N;~) z*W*D`?Uf$nq5#>mym?C0EKXXI<3Y*-s9!AAc<8exgr`4a z!|`Z=y-+O9!5Mt^eHK*`0H%zWN)$$YDcTy{Ulm55Uoku5xj36axWozZpj^)XynUZ1 za*ZYu*w2I-zfRS0Js?AgMpyzV>Z-`2KkYH14~SZ!X?o5NQYT0)!V9uagl(8w^FwKd zzGXKc#f6F9zW99s)Y5Xz+u};3P zmnoZ@Y58Ld$}(cjYSBVDmK|6*&>X)MB9gNLsu;|j9{_S&XeiphV(TB7{n55 zo7%n0CZvqi(E|i!OmNPfN9}hf-tXfeATk3f@AR z)-98Ac-M6*k1ptlA`JEdl4;X^x=YGgD>9oP)M0Bpf#t@8O6Wx)gsR2ZkocdA@4`m^ zhPc=|BU9_l)QGvl+k{Epc-I)VyHZ^~7_|3coisJ~?%Lw$$Bj~0QF?vc>O;_-6Ourr z8Bz#4b`D2X;SGEDsMRM9{F{nX6Z!SK`3$*SQvNu=?V}6L?iN6=p?O26yI9W?}o*V2B9t)Y+Nz=dW$)466XUloWD_5EhN9 zX36Q%&Xqbu|1ODC6f}I1=b}Qo^ddbmrM?UH<@oZ&SkNGXv{n`PcVC=_voj`^qUhJB z7+JCGZ%MCqJn=CFZ-3*#y7T;V4#~L-=Rx!FAk?2BxzTmVF+3Bxqyd^mJzCZzSo^lL zXE)j&Qxmk7%S4i;-|w#B&c50|AE@Wj-37;m?JJ;Nx5W1z$UTNyw)p4WXvH1r}0nAm9U8rdYgH$1^t$L+^UwTle+U9Hx&g0PR%+|?aXeuQe^ zBdjr;i%+%x*YCGd#Wc>*WM;^rDo_0HzYmQvgtqikEJXnaI7h!)0KaSEt#WA!;D#D5 zPhyvlk3&s#>i@<`|?P%T5Tte0Zlj)+sDl9~r9if~{jA)OAY%|;nWPx|sXP5R(bubgsa%#3Is#bcOm z!O<~UQXB532S^j9>XMl+c?cF5wKK(%Yr$*887s{H@YVjufU*s`j&;Pm+~-8$EopcT zBHlQ+lmmFgyR+m!-x3r~ivJh9V75Oli>V@#xC%vN@{t2?{ z!OVx@EU~GNL@6O|SOIZWoSvvQHUkV#zS_Gxso0J(s!LBBZ5`AivRO~r3!`wt>zEzm`AmuK48VRnCJ5wE6<3^ zB9rb7v}^z!VoS2{xd$#6AW=$iYJ+4cQgKI$qn@+4&5rQVB>5% z0C4v0LmXtmK-lq~qlWNY1oJY8z@0#(;0ha1E(KsEwI{D!d-v#?UDwqC7?q7_Za=#| zmHi2O%C56gjuGIE4;%eOjeD%}veLDguKI#sFQw4uwnGkocmwyGW+YgJ#kq?SE22es zp$^3t-osEm9XXi!LjUSqlwVa`@Z_Fj=m97;&{gOvbH7g9l{SscBA7Mzt_ZlHx;XA} zN=Q^(Mi+AUhf4TFr~dn z3sEw<-^~hoeeJ*#CMblk?3)Yk=w%%jUwPn*M#j~Q*ErM~dRpt2cC^N!xIXf-y3GEyWA!f*7Ab0x5Kn#xpJ{z;`)!1Be~U7$ zupq!5C&Z$-o^*z!Cb@m1JshOYb8qM!xFk|74f0!bkm;%?9lk-aQ=pqd0IdIw<;Hq> zay{6DO{@Qt;ir$omh(-+9B=-IpiOii)AASlF0s~%x(XuzI_3QPNEW1fxFplzpAWi! zkRed+URYr2k+I^V0eZQflQ>SQZ&C{4C|(Gae&(i2yEc|Wn>VfR&X}0mCZo2Ie!Wnn zabmP>0ROlqlQ{W7)tNk+a-u-5FJE)pn*5w&Rr^EpWEHUSpp!=#;8G%Vax94t9{ksbKIU1^{4zw=f z1Jw(%0{!p9eYp8H6=O6URJ(fRwS=?EKm~GOlP*4dff3Bb;h+}@+`mO_geQ7Cuee!D z0pAi`lKHg9rqk*@ZQWWzxuH+&=R&c1?*8Gi`x_4tmKT5>A3ZfB_A>c)$(fd?$_&6r z$1?=gWVXdL(7Q1S^qYB#U~>4b=A@VS`4#lg`l;AK0rg&Pmbfl#1ndLd$?zlGoSFh7 zHCm0X5lbrqKW?GK-`4twf=$V|3s#@ZkcR;Cu8Tn)_0Tw6LsvfW(X~8l43nF(Q*>&` zQ7G`Q$aWeCNA0-s$!gkUhLfF_qh@jbnTyz+bT!d+A+O1tD0$v_*Ru@5SB|&sQMN9l z3gA=h6`>ko!E<*KSR-}>?B^$OO3aTe#)x2CuKt+>lH$2PMha2=Yk$ZBz^9MS+?8R7 z=u&?$l$E9BX^g+?ErSJx2@iiFmqZA>i;JDoYRQReDLkJ$?(o*x;L(-WjnSQFVzU_X zA$)L1PF}q<*8Zx?Y$N&z;(EY47)UV1D~Odf1Gcw5gzna6g_u8ld5B*^ZXb`FTQ@k^ zc5(T!l#5C?@_f(6@OA!S=m9pmRl7I*X~|C1yiY3rVE8jP_dj<|6VcJRP8WoUd0rPY z7qzzHviQA0(5fGydIIWjg#bQ3JgC4RVAcDq-kinckOElIWxXbaHPlY8os+O%io|)m zb>8K3J8y+WI!fo`ORH02T%BN(j1KYga72C;l#CiGPx(B4WvI7>*S9p#^ zc%O<@($>EVMvvcEt?*77kD+dwdF0LQmg%!q!c7?sHgGbg<0dw7W!<-Uv|#bwAFkTb zdwk}8)xjF_xX3PYa~7#FAtYFH=4B%^heSi}lJvVvwV`ce%3QSH9G^uvjq*TPWRuV{ z&Lrby3&K$uh-+!=kLp8Pc>zb{@Q4?m@FUzMkH5ulX)A%hv&UXPl3qvErE}tg=(&h% zA>TlSC3WA6`+hzsO>=Uk3?} z+m<>wFj6OjH!bkownlyr+|D1ln9rFWS|=>}8~HLjD~aINA10*mboV#BU*ZVE7%%P7lq#{6eMLSd#Y>IY?)eUtXbP*Wd`(6W8zsv) z&g3Ko|L?VSZdzUS-QZPD#pWA(+8<>-9kCLoms{)wUl{_zwNq+? z5cw*a^1MEh3>;$1PfWL4P5!a{`mu~&Df+(sM~ui{tZK~iIGa%HW>z~6cD-xs2_M-o zW2ke?&{WV5=D3D&Q0M!nydM+$YV-za3QJ6C$o`$#KyvTR0iJp1yxIZH?Ypa4{jWs5 z8*k@-J~Jz{rk~m->ztFty#5l(DN_^3R1REN?hMw4O)djN zCj^ut=w_n@^w$K0O-G(QnCPOC;i}CTSux^1VxpX^M>B3iiC=ZU4D$FQ^Yb9v9Q^NI z0HU*>Eu%#$py2QTSl~3)#l;ln@bBvgQLM$B)n`RsnNwg(TPRHET6!tbn?uKgj3Iaw zs^orP#gd&O8A$i<25!m+DU&1j6dBcbJAJ{3(&e5EIePi$%gX9pV6R1X@HMYL^R-I7 zNOyGGA;4@II|dnLMM9hpu~|H_E>XANz#XRFl5k3?K}QYrIeuu}X*s`OXNIVz#E~?P zu_Ih6f=#{Os5)W5gv8?+Qzy`s{ts2h>8khAmOFq93MZs6Em7J0+eaorQ-^>zrWI*w z!Sz>@H2G70iDRGn%GJy~`6XXQX-&XH^Ydov@!j9IN_AO%#@+M1j%pF7k2;dF@D`jO z@r2#bC~1hLCYuIPEOeXA1lXch%#OowY!tk9duT%lIC$`e>s~^Fp#KNf~yO7f5h}q?e5$7?SuP7POWs1X3=~Equev`Sd z%>((=ItBC535>Pp3U$MA+Uzo`{f^+oQ1gvXVVCQGwYsi<&Y=WSUmI8XZE!)fa7HqI zaqR;lKcj|!+1oNQ8CFvrPqJK0Ms_t+QzxM6YH||ApW+u`WgAps5azB-pR!X!?W6l8 z`S;IW;y6wZ{|T#qN#rPp%@PS*r#-^&*VexyZLfog7qUY{nT@TV@VBFIMI9Si%TusE zaFuiy`jWh?a0fs5!)$Ea&X%iy{`8M@JwM}$AzCt}1aem5(A@2Z$>Ey1a=&&LRlS;! zSe@|5`-jJ$_;~V#;+cE(BHQ9+VsaR2LhFRQ6hI20y~&&nI0T<9^X4u%ql--I-@=*J0oquL5Q z)Z-KO&bddaPfl1|hM2v9@~J92QqU9P8-)_k(T7MeS?{@07kp3Jsx^>iyJ8kEWDPd% zB;XI}@qwKCO2Uw6q}8w1~{>})c8$)H^nsplhD+Y$CH@2MP4LC(zG)BK_g zrT%__aeyx)j71R2>FV5p1hRF`tq5LnuFnyrZ=9sJzSyzzq?8Cg6tF$<@Snzv5Iz%D z(Uk`gwQNaD(ry*>91Vg-8>}-5oBsxQ;DLqQ@tR@HzbYdEnqmU-_H6&e##({}1VLYp z8;#Gb9uhcc=OQEBggRLkV9JO& zo!tCbF(0=$?Nc9{t_&-)H$L6e&*|wCVg5iS{zMaVM%VNI1+O$eIxCUj3(}B#1-k*Mz(K0L^hSH%ER` z|3tYDiDEiNXdlv}+I2Jes<_1_@t#K+w*OZW^YYJA?K3xN*jE!)YKmX!V`-*Kl|fbl zB~@2NcbQNjFd9z&VWAuY@}-+Wlzy>cbt@y$YtpONdwzXmTmxxPQ?7~E$#1o z8M@KW(uq9ykE2-eels{+Ga`kZ|6)f97+CNUuC3I+9H<3w1eIqh*q3%nK+nY6n=CDe zFR=RBp0|qB!j`-?f}5YSS#^2&>ukfA*L&TL1vIC+G1m_`T7B0&^*mKvTYrc2`9)*; zB6cwpnT&2Ei%;lFkw>*P#*D}sN-Mq;;S^WF34c>&yKRuaZAWIOPZ4U;wiqFKdq-ayF-1~{V zVgke#zr%_>4m6l}5WCVLwzRccedB>Q?HZo(x!&Hqghenxi6VDVW-mS3*B>O4d%|JQ z`q?Z&KqxhFOI=M8k?^!BHKU;L@nE$w7a*qEc9r|@^*?3ycoUwM={-oYT3;{%zHJe| zZ+@ji_;U=+*OmcWOX#fg096dp)T6_H(6z+$qdgKTaI*x8xhC@oVX~Q%Ff53 z_`?KvZme#g{zoXc$^DxtdI|6x|LPGaRB&y=1FutIw_*^CN|k8M_L~6M^|x|WHS=DQ z;6;HEMW7b5%9%IhS8uBPg%Kr|B^6@BgJ z;sRo5(b&5~CO7+uWlQQCTi&ros?4Z;^cGzB4d^dtzqs(4g&Zi9SduVJUd;cy67Eym zmAR~xBhB$IEIZ(`m=S$)>0kpj#TU7Veb@Ud>W8#n4WLH?9OCkx8&Z5=yycqWN7Y1L3uG}>8;L#+C9E=r1VK+7+8gvvsFyaB&dSeI_(T-1c!cyb?2J}aQsh|dnd*Q zMHR1c9D1w(O5OoIz>SFFaa(EO0%5GtW!B;&s%!pyHS>+ju{ltgyXtYUkW|+EFin~O zWWt`vwXc1ftvyjo*>GE>a-~RN+?T51;s$Y|o7fVl%C^fEN_$6w>G7CGWTyK$tWefT zGi6gaaQ0{Dw6pcY=V3cv;AVLUE)z1r{o3suznt=IPy+JB6a$dsk2J|4X|0z@M$Lg1 z;CV%Sw1xYwI+oq9gM*`bfKWnH1V|h86aHj#dL9*{vRIR3^52O$YpaUgB@X=T1jm0i z4SUCpA0f~h(3i1~CT->`O&a#6ma;3Ir{t`IK_2ZPb04$D#i!xc8Wdza^yS@zJSE@N zb7vt%z``p42Ax~K$?EGd%bm9HqFWN6lnjlYQ*pgF@iN`IXDe-3aKyUzFeDzYYW1T} zEGKx@ZXLv(xS95kQF|}nR2N*Q?cEu{^d(yS$C&MfQs+caq3NjP1ateu5Hdx3AsRZB zjpE9^ZZAu(ykSOINd&`ReHH0!75{I~iJ8h(bAx6e_CDXYwwmB4#D|8kraxIBv3A4? zqJ(tCpST_fF-NjquC0J6MxrSS-ZdCD>>T5J0$OCP(`q{J0z7coscA!5ZT2sfU?(&c zmCwrN!#dkgke=i%>>)_!D{vR2=_Z3rZnG&TdH?VpDz20CI!p>I#K`YUie+FPJ#ik? z-t$z(*^XKFkJZKl_N+P3{Gm3xum+TKq+|^K5q1P5L^GJ7iP2@V5^tCd3D~1?$eGmt z6F#{=X`QmkD781!m2y4pO8lK^eUY0PepT6TU_1OAg?+p#pmz@`R&W>S&&;p|@7YpX zspD5Nq>QZl27aDfUsoP>o*PYkqa-Gz7gN(toN43M2Pnr~I>G;(!O<)(O`IwHr`_ex zI1gMde{03CgvFN*HzpvNOrp`!u4RJyRj{f{pXGl)o=rPYA0WPgyXon#L@1aNd_Ujz>R7`foSwRpUHLJ>>G*5P>OJm=k#iX4;#(`_*a?3W`| zW72xqEEe!B>=S;H6~DsJly~r9LzMdcn?y{!G`w*s9+WJQqewUP0d_zsJ9 zJ?v)7enuV`(%0{kpSZjMSk?%B2%r$OXLXAW8=%ADs_YLqciyiS_;EVSNa3og-ALJ7 z@*RK7NW4s;r1Z`|=lEgmO>rLUCo3wZWr1pMh;fb|59^$gfSqBu_0~8_{+jv=3EOKw9e8+^RmQ@w}u%(H;e3}W2 z`VtbCent!f%VNv3i1d7S2QK>aJz8CVT2vyp=7`)=h`eE%A^ascjqa3geTXGpPL}S! z=_YC6z>oRNoe;7%S^uy0l-}_8^_no=)GZyg#pX?sEVN@&0J=Y1@A3Y}Uo|9(uutOu z&ktvPfBoL6kK^__vH#<@6x|3|7dZdxlK)(Mhx-42{(s9uWVzuQicG@LKH`6i5+bL< z?KwbNlb+gNP+Im!QC+hH62$=e9@$l3z8(T+KF#1P6y~LT~8!h9`W)(o#40R0qO5p1dTZ%YFCaP62JcE zYX;CKkF&R+ZZ?ks*s^snS&E;8@DW~j-^Z=rC@nai>5l2OKTTEZHyK^`PdgC;BCs=` zT1@i%-noMt1Et3#LVW?0)86q80ZDe!FzM>FF*xkAZJn{yYht{@xcXuR-BV zKNk^kfD=Ve&t1d>u`0FR`Q`%^QVJgHli|Na}@q)Dp`-TYUsIev9t95N@WBcx5C zS_d(M*5-jOYxYKd6U;-~Fgp?X(6nWjWpCZCa8mLEK-netamj#38-DC*c+Tf;rk}T! zt)jj$opDOHy%NC_K3;I4=1SQAV+C5TsNtx(`APY%>YxA=PvC+}wv8PR*Z{QdLyI-s zoYy#`y?&}%8~1>IIxL>I$`;N48kCKl;#hq>L%Furp*c|gzp6T0M%}1($I`<0jo%g|vk|mA9Q?$SbEG^a zjzJLDy8sNI_*7tT{KW9n{GSh)IGP+iR-F;ifIqp=rSwE)# z0MXLE_z$+JIXd4J2e2X%^Z>2kBQ0D ziT9LQaufvzMiF2`0|np!(>DBI8-c`lga@(>*#+DQd943=hoAiAI0e6xxu|upB@Rr| z-DI~%wM_X9YFUcxZfpbglC`i1Zmc)kQ2z0&A9SxnQZjg^fK5}l|Rbq)? z&<#DfC{G%T>}0in+&s@xci%$i2j27NF9b{hrn?49BhYd-I&&0AIDZhulNIj%(Gz*v z=*XM?cwp7FLj?O$lOxKOLj*Yv-knw36|8u5*9-%9A*N1dEP4NDOYpJWdYuNvUDW3)_u45J_Jl z{B(a~x;F;ybhpA*5#@1fJ9&)%)?+_RRs4ec4UlLhuzI?1B^2_C zK)!d7ad=k}?df?c?z&LI7u!C_`tRNr=@@-?f#W~y2)EA?aDAhKvB<6W{_bBtR&W#h zj=Z2rycarMYbFC5OyS81!{?aVHP|L%M}=Z^&0%y6-GbPCwjS~2zw3GPw^7-NVEAYG zJdVWPym_;_7m^4O`#AFU51+`x+##PljH$b#$4q2{-k_7`R@>V8~x<-oP&s02iM(GT(D6u$_iH^3B#nN#!rYJGGM!m>@nlW}CoUt>%x*+j+qAd|3aG7jK(#E5i z8I1$qGBP|OZ$+6+3%B_d12=jmMG0Ua#c;*}&)|3XV^MGls^L56EbU`i^vM^@&-zy6 z$cK9Eu--QPG_2u}^@Q=fEc*PqEpj-oG=1aQh4%<3d<2U=4{uL7zwe2jEOr5&svp;# zYiiv;E=K|Hy}w-ZU3O7dp$G3Dr`bgJ{lIu#Ys0lCzQ=|iJHTU+!?(fn8ycDczWwNj z^z&!o*;&!9ZKC|8K*$WFmb&{_Rj6yowAkAzK@Gi5uRmIk5GZTRg1Xk@SvRYeFtAf? z5JHmk%$oQQLURv7C*2^?};qk@D#5}otS=an$G6=70}5C9!`+v za6V(peEQfNclYmQT%2@Cj(Z!ELJU-1mD-5n<=wTJLVLYiauvK+ryi znzz>)+(f|Tc9hm#_eVooiyWrY(}1*|mB?qJuVL#z@9Pgs;MIO1Bj_UmAguG4_dGYx=fh`r zwU#|FRzg;9o!;o6`z_yC&wbM!4QMJn>iu<8?+I^9*4bypn&NC}+j;mmwg$42_glOE zf&U#!T@4fUU6ypKtsi}yZb!5#1T%omYl*i=vVQWV_cw zk>hFFKHqPGgw4?%R?-Iz+=in#Jnc+N(Hu2PlbPDM>L)9nQTInV)ZY4KuTDp~*RM_mVP*L~ zN$?+;3G4b`XhITy0tk-|Kbcm+VNQgPEMj~2ZN-qD|LZ^s;Vts8=h}Z&V`5=J!h7O zF>J|M7u)|%QAz!F0-7x?lL5YL!ai`kuPuMqFNL0Zz|~+kbkr^H^a?dct7euS2-JaK zhAnBe5qMlT{n6#{N|qN89SC?vFgGvp^7iI*rj1u%8WKp*o+%uVeb zYf)gv+!(zsc&X!M7+oD1D`EP_gq&O`{F)S@wvU;!QM0k2km?a^ zVdoySAQ%+ML~E4d^W&J8#7<7WWacsg@EvpVn7IY@buAF|60N7#Efyvnf%Ogf;X)5Z z4a=(whxBEVKEyU8$-ez+Q?WtRV5CstJAN-mY}C0@K}SX`+frhi0|o9D?-ijhrLvcm zEzHMjzB`U)u;IzKmq&K`dPgcpEi51#+_`@X)?&>sSALI-q&XD!kVrz_zrAS4Qy`kZ zTP-W|1b|v$BAvF1Pa#rHleh}7xhd;904s{~OIm2KG{<|xCxJAxo52@#UY5Ls*~1q3 zE6=2!nl~$aeM$nKNtM0??oG?I!@u5*BwCh4nhD#Q^3iH(jUv<3@EI!D15{sU$|mV!Oj^z14X~y7dqXc77mnE~ha zD|VaD7mZCqFC)Z|x#B(gp4)269`mJat;T2f#4ZADHVamedG$&zs+8xVy2A>5iWD^N zsJY;`vo+vmBT$OY9#r7EW`@^0K+N{wr3aV&%U!#@lJGMeTh|i|2S>f0$Qw#GxcN5?S?6?X zy{R_nQ~+Wy6xIx!#4sK5Z{(^2G$FX3NwSY?ZpiTG&PFGW#`QgyEiBtdbe~b`{2St?Mb?LUbP+TG`Rg)AbAb&;rWCZA zeTet{sj0!o34z$Yw~V}aVmpz7ZGo~1OW!mo&Og8?<1Qlm$#=mUpYg(sS55@|nEi$C z+1P?7(2b&EX=E`cj+qgr%H@h4~iae)YnNS0>~pja;?PoL#aC5^2>H&7he)Rlh*1;dzsF_t~DSJ)4aG zc!@`Gc#knpO>U5&U3qa{Y|7)xt5SO59&*(9oUIiY-ZeHfc=V38 zRJ|Hrr6x)nXH9r#I}g#q#>l$m0Lb0PcGk&`(pJu#O|O*9dp{=VxSVmb5hA!sE$@k1 z-n&gzGg`xVAiiCp zbJbARM6N)Atrsnr)riW*ctlTDYL?>zIr<#C&L->5fBmkcS137%xjQ8C6DP<6hZHzerZX4+9IIe9Lh>aqQCki|0dj+sYkTn z-to6CbKI7!tiePhu+Q>LWIoTp`}8|R3qc2JN}!Hy(Ow7%5V0pqNz)a45Y>4*7sKpz zyor4YAvo^sQNBMKIino%Ieo0j;rc;ef_`?9udB%0~&;#fksL!p`9<%`LPwK^wK??Kz0oH(4T;im)q{< z=%F4inT}G^4XK4FLzG1HwX{MZJeefXkJE(C=b>|<_W?Di;$pu+IkxVe?|AwHLkdw4 zNF;8M<;Sg-Lv8>UlUB9SuAn2RO2_lxN&;QJS5CU&3nByoZZXxLa65GVabB-~`I=i{ zR2FU>XVO8EKkMqnv^td*W=Vi|X5Aed8Tk*-)llkQ!ecskU_nX=V|BEtqd*r;YapCA zW9%4@KZS`BeZ`Tkr8P6cwIyaVlK{u`lYo~~m$llLzKA&5F-%#PxzW1ZOt%Ymn%9KS z#GcrX;AeuXY+|_rM=lZ%+y<8U_)}c@g`F8}!=$cI0vB^M=+>34Ma397r1sp4CxK9t zmelUh*Cuv<)qt~DheR|Ito~~iXY|le{k{`C8~B#wq}wdRt97r<>}@<{WoYO>pTia~ z;oOmFg$7_szT>1U8AKvLcfHI$o)Hm{X|Smcjf*=;lEJVuFwz&!ZT5{4JFD)&^@~E; z$}L|oBp4>|-QiY`d+g%*qnp0?W+__l(K8=>`Xb&ZLFBjEtcNdVh4)om@4CNwn#nS) z>i5DwJ^5`U69gZmzK&e8I$g1!6=F{(QD|4QFPN$ck0eT$#NK8XPGXN%CX`CCL`ARP zY6-E>G(4K^x3#)bWOT0tOG7(1#N|UG2}J((x?oE^2J3Nf<8S;rxD=Xk>yoe(F|-@$ zS*#$_io00$E6^+^NHwSV3P$9#INE>Poz2v9uiqXuT!_9|D!p-OpZ0b#Y3_aIg@NE7 z-_c?E3JI_Tqit)EY}qZCd%>-(h{{(5#k3u7%3@z0RR2oG9z?&4cX;U_kq9Kwi+(km zE}WT>a|W0PQxI z$C3!viVYn!S2NefCeT54vJbvMXgs|g{`t>4>wPfo9_K@Qk0M6q&G zDr74ut|d%Npj~wz8#FYyK7qKV_ZBA>9@icP)lycGgD__kYUw(fZ9GfB7ar1*2RX-u z?83h~8AGpq3^5e62&SmM1&Kn(EfE(>(80jG1^0i&=?Jm*)il^3-k!qnn5`*fRS_W@_ZFn{Y+x3 zLEE2PTWg#K)n%_ah;$Z2Pp^>J?&2pEefVh0F!!eOKvW~~^}{fLmeS?Gi2ygqQ7h1o zX#0Gs#rPiaLC{%q5db@S+P=T+=`vggv}@QLA4hFjR{&#iRWKMjkDMORV(O$9)sgn+ zzds>jVFS#D*ev%hFE`D8dV@`2xmgX>vR79-Jb{{zfzJ`+^h`{LN7nO1c7TufZIk;) z$3+eV%y+9JnMN*(y>2V0EwoEBs#P0(v0_F2yN4Rmh z+2M2aT}34Z9tq0padWUnVEI2w?JnC0J%5T(oSM>Q=<6l5BLf^D0F6|c4Yg-I1o-^T zdo`+%sVSbMw73}UBy6!RbNj4vjGh73Pv~vUQQL;JaGbt2{oUNFPVs5c>wue(~U{ z4;0G(e%`Ed;il5S6lLy6_uGF(c3VsVRbfOa_LQ8DCI-vyE+Nk9{=K)Y($rq*ST z-qx><-I>&F2c~*S@g63(C)nQa;XA8EI(}_CC0fjBv6mL>L!YK*=ND-9*zo1W(SJ6b z{|T@m?te~(z0o&gRT$KF;q;a-w12I2RjNc#C+75)*!0DYj6-KuGTM*pL4wco(B)sX zrM~Z*EJzc79fvYxTGABZ8#|RzRNRAbtk4L#TOD-mcWNxBOV3ln!Zi_(yTK3Z$)th| zH*!#{9d^p0|8WlnP3nKAWequAb%thgIhgha9=RSwMhuqNLVSft2rInZZuVylx*Lq! z1IQL~{=F@B4%q$rg^i7kwez}2@e zNM$9BxET1Q^;;{UVg~zRyMlMuq1h|w`K&>}+f-HzSds3u(p7r{lD|**Mx>?H4VhFO zC^B;6gLfD-v;wUfG|&edJ7YSNA(+XlDknW~Z!d~F=`DHmiVH{QTY{0hE$@WO>mL^8 zd%m4%PX?R5&)@Z@W07H%{x9y{DLSvN?-!13r*YDtv5ls&o2E^}E4FRhHXFCGZQHhO zd#^t4eLs62d`I8W9(x{TjI4F7xz_w!Ki=M1dg!0~6o^YDh2Nx_9t)u8Uj-szkw()(-i#ia$pe45{*9u3UVMEcE>TT~}ATl3mV#vxVWo5$exS5x4a68Q&Ub7p%X|$q*%n z4Z%a<=)SvKJHV+wm50GI@rmbXjzb6qp9SO!=23p3gxC_)Gv=_z28WH zdGcKUbXX1xZG$vs-fc>pJyHN*DEDp2)_dykl>hF_7H+%ogeQQ@Xufj6Q*E>!rIWshy0ynD?X-Lf1>=fB>xfuX zCjPkUH0#wb6-RZQw4b>-%B5ulh|{@J2Uj2*^tQiyRI;F$TV`1=IN!FOlA~Q19GtFd zh`#t~nOhX?vN8fe6ps@oD(kcOo0~rMOWZty9{jnfwlYpkM!az4GAY01M_jHrY!hNO zmR0X0VF}dE%q3X#>D347=QuuPzTBbh?r`D$NSKBw9nLsIyMfFvOQ4FID2hMGW+Q_N z>nx@PS&&rLG|!mIWygc%)Uzec^8hL4ar1@L@Z&zfOgeq0dFz+1Fxvf=9{1%a(g&*k z``}#4aH`mOjLU%ZObtA>=51W$A?D5ejEJ;XkGfnh(k9 zc52}6wXzBPl(sxG?bJWhwy)zeS^mAF5aRo7r_aW!8!56@(QqUd^SZyGy+Y4efr=dAe}bzq>LE`n+X1yXy(SUj{(QP&J$EcLxhI4X1XFXwST!m$jVF z+rt|5=46$%;r)yx-M$xYiET+1o0Vby~dn`oIt>rlk5i3VbWE zVPE#qXOMc?vxqvTvxg2Xr9dFdc3B$UN-I_JxA-34M#RONJNK97m0>&uXcdis5-sj5 z4#&8T&CT1QbT~k25ro*( z!@S+5j=xfBIG(K|dYd{PW>py5BAD_E9TTp$Vc>a3Zf!{=SZhXZg{_s-u0QXifY1EC zh5jU<6z^AO+@ZA|ZRnUtIa<+pZz&A5r_Hxx)8mh{wb#=cyyk+NwgeUZ^g>jggkH?_ zV2;~IC`_f>se~irR-HbuGLMxH_LW>+bSEzd3`6n$gmqeSU$69M-J6fnR;^4<2$3oo z|2@A#m%Mp->-)))nvc_6{7whaqr5q%7c5%R&Ggk7(Q4Y+XWDHHvcg}hKYU$O&;O31 z;Z6em7|t?T=i8MK{Emf`ZvOyW&IkJaZyAG+3p5?}x(ceJ&+r6Npw!Qj<$U;S-C#%% zFsnlz`^(Pn-41yc=E-CSxE!%`YBc6lr`se<{Nn`W?_mcA!w<*nZ)@6Fqsp%%-__KX zb{zhama4M&dFt<_2qFR3_+_g!dFi2_E01Z-$-K2X%0_^URjste22f>l-@U=XprD~I z=UN_jcK5D&`;g5SuN+oqoD>o!#^V7Q3gDef-k;%GZ*&MpmGh}oug<;p)Z80odPEZY z2smE`;5zS(dznsn0zSDb5`9+W-~YTm6P>z_>=}>BqoecKtuP`?PtmY5Xb|#66OUqH zRzmeMyWC~D99ntpRW7pOPtPN$gnV%LF|2jRt5)d<^GBoML%GFyU?>9SMJWiiG(wY$ zUfI31gMZ7kNjz0Gq<9F*>iy{6s>RKV_xGaw0s81y5B1l_D^|Dzu26!4a8fkt#osRR zGtJ3r4y(->I|cblz5zv(^D}_c*M!#n@BoK2EwAPZE@!%!iZYehGFW8}G+)E7*q#m+ zHbdggX+NQGqaTzkN`Ka8*YVtVVod4^<;;ZLpdTo#K!7064LO*y7FD)_Y-7sF!=R~; ztIS7hp}^yCRB8Pb3IqOHt}9!E9Xchon$OxO^drbNZeRh{Go}*hWbFWR=~Q<;^NWVc z0nu!^jt(XhU2Es>&s_QBeqKI!`T|ye&*j94e{-pkCy<@k|dPzH6uI$71%iR4$UR z-md&4@me$G{l)D6sA!%s{#drK4yN>EPwMwAL;Jc-H}0?+;5-Ku!%5s@=bd5Eu(W-; zpJ}YwT_-qDziMc}h=cinw`FRrf75D$K{tk;RvhsO4SXKy7Q-ThYJvh>_=X^==eoR# zTnOJ>y)^#xVV;qal(i*E%}y92eC*7-dizxpzVPZAh?XYwh1^Tu3z)eLw33cZKl#X6 zp}J+MyHGCG_$|Qy48^aHvf!0#C~cuvarZVv_)^2~DQ7zG&P?BCZ~P*9v>R5=wZX{} zm15l^7J^s-gRw%x9J;Bn5di)fT7G|Xn*S3I-3U2r+fpByIZ}AU1)9H5sHy|hL2$3F zXWX;*=C+Ik6SG{kiK79*IiI%nDX}D&b#_CMYm>P^!aA8}xysxOCiw;ZS?y_?L6E>aK|g8Kl`yFZ$%~+^B9eHwM~=Q^+SK`e$5ltyjE( zH{bKFgK8{K3rE`SEg>MxOJ1BEG&_GiF%}ITAu-mCjn2Lgd^Eyw1eYyE>62 zX>tZr$|Tjgc*qskJ^}uFt3d@06UhHB(UfN`73gJ@Erm~ob$5q^gjiIyyWJmMuO}wV z>T8{HDg^ifDTmMR;E>7pbL}=jgy%)5JoC~)ayiZSuS*e_jAbSBu^9^jrmr?fjeTRK z&nPSJ+ZK`ZOKta2fg;E71vNWAlppz(kuZ`~J6Uvl{RjIPmUCFadmdTY|C z65l{Aq5A=9dpJNekKgx|sT)k~=%f-Q0HfCj9JHZ(S8KyxR5UI$=} zmh37Es|>d}Q8=DBjT6uE8?eF=zwjFk*K&IzyEiXfKufpLSzbXe@_+AW`_^Gv{09Cj z7upMP!}jYGx7&cGPxTGjMp$h%t;7pfiWq+0nYHjZo^C1@6-0}BrDHXOd}up^No?)O zOPM#Mr2=?@Qo}h|BH}aV5|PGZbese{P8YeK+1!di#B<~!>2=)f?DR~( zhRAonaAnBlUYl?dwnUbQw=j~mC@`>gxWLE{!w(G?a~a8!W;Hf{W8&;{bNl=q{LwO3 zsSyjq1gF*sUoTn_UtG}^W{cYHOVft!ytE{j-ugCWF$=qW5HBnY>!OC+ViM9qNsXbl z%i(6Pmr-c|tm4D^71=`h4AnMWgSe9+H9wesG&F6!Rik6;#t)#BEcdDp-gL*Mm+%+o zTdx(6FakN#wk z3vvB;9*`aE-o+l-==@chEO<T|Ud*d;lK?P`) z=qkxO+?ri?aL&DO`TULO4d)1aKQxE#7Jpyxf!f$O9v7sr)4EL~9#Zdr9%kj<0e<^n zg37FyXemM35dA@{56yvASozU}_nY|O+ZO;pgR-#*B<0`rQSil@f)Wqv90rbV2|XRA0H(9kG)z_ae8-h!7{Ft6BxwbZewd z!hm@>z1`JDiQH_0CfLw8ZsbS&ngh?7Z9Oefvc6a3rp$(_Utwqg!&t7Z8^}&8I5YGt zA9uX=VQ8aBn>&8U$V}-Iec_XujC<5Q7f<%{=fHqv`I}R$bG{&3LA~91`S6S4Wac6h z@ieU+xIWqBjtB8~xyY!3ASOKF*Wh+X2)#gwnj6~4OtY9I-8B_~HOD7z^MsJ-0fGZ{-XNaOu`j6p z7}%T5(&4f|BH{|QURCpC1`?ggtS%T}B%U?CfGwA5VOFW~Fk8ZBotR+LU+x)FT~fk3 zZF5P!=KQJ!*l8@cfdt1(OzMdpBB*bPfJ9V%yfkfcJe@nKY=1uE%XHte*&A^|3)1si zz1!LOxTg7l0?F7KPbVG@CI9_P7UyZ=H2-s5@hFXjMnLOaXUS#%zqJ4*e4cVfLPBqE zFY7Axbxqk*Z?EfY+jZ*T4bNw8msh5)Z1g~Y5miA*EQwLm<&4SebyI!=>6;Jvqj1W4 zLB&SGF7pu!aT*q2KBaCSQ?}I12(P>{@vqrR5(8Udv{fG0lS&kt zUt@_l@IDGpD|v70IX?7}O=Amd=)g2O8_^iKzP5u;amTGLLz;Y&xvJh=iui0QPB(n|=l$uD z6MdiV*<*9HC(!tVP9bP?uKb)MJgT~y)P}=Kp!!2tmd|lCJf44KN>^~$hm52{QaVaM z40EK2A38Z%nC)-#%z~CoVHiMWafNdZ%BkD>H22~2?aw~%gX0v!Y4!%2t7Bpa#*B_z z5d_r=o&+(bf-9daWI7aS6mUGKey#DMRI|yNa=pn|Boy!y8QspTSe1zf?=^H;&kN!$ zxA>U418f`v2mOnY)GaBHa*nu`#LQSgUh6$>6&KWWHv~59y`ly}XSWfn%h&0l^HLsE zUJ2&ACiIi!`j`=a+yM<8%8HJFQqy_-e%0A_L_A(+c%xy@?^ihLD@c=NXR~GJ-wvkZE-*fRedt*xM!N-% z{7IMaotz{^^uHC*^?KQG9#F%F{`RJs%xz-Z{Kg<(z=5CtH&zhAwXWM|F`eSP z>P*hm=&#A=q?<~GNiDAngr;2GSApsxgdjbk_v<4ec({w3?H+|F-;7NA8_J@0<}G-E z*V__L`4z@c$MgIWklN1Vd1w(cct6Otp5%9NSI{#J<$97%Q`A@GXqm{Ltq$S557X5@ zzxs)lqmWW0Jr43&L*@A<_x!>KD$Bg2NpqooU&LByOCrL8%3EiH zgmAD|7!7TywCRK8=v~JqTjP3DBuCoR!0PUXO4QdJnDCuH#d>OQUk-N_w2h%Q6T#dC zu+6#5kusmu=~snx_0(|q+W4eD8w)-4MH-r?2bKH+ta!mWzY7&?s;5O3;KDm@MZFXK zf`aJOOzAstOm{zgenW)$s9@vo7eUIPdyf9LTPJ@M`K>Uy5dA%@Z@v)|2Y7*fQQFo; zx1zmvNDyB{Ol*dHt+X8s3v|es`O~!vg;nk`A$#FI^o6^A{mGZPAtWm_N=-6CR6-9(knL*& zxP!m-&uzahQYN?A*l2Lr;%zeMmkdQE9Y2$6)B*7|akSYfrY|6zu707eEJitl?ZA($ z^}F91#-%TMa~oJZu|&E`BQlHCT-HtYn);Wb2Au@FEk>7)j^8R0Hrif8@NJO0pE*!2 z>R;QGs>}#_*U&{Z?T{}cylD6UREG{hbZego`OLE)-I+K{px9b zh=up*M(_qB<;0}c+c0elv0|C6a57(^M5tH$k~bp|XdMpwcVOR#idvT?mk|4Hd_boCw$*U3e*dFYQcBZ+6P)H+brs-W>6C3K0^>K0Vxd`aFWFOv(n zf@d$ItNafWA{y4h4vJYoj<%(x#rfh_(7{UMZD)?m(_QY%#U{-8e7e>Pc{1I0xvBLU zn0mcucXp>=<9?=_y8COMRGiDZU;fMbdu#9B&L|Qg-e}R}&+7xd_Unf?F)^`|mBzTV z!qqN7@2wu$nT`R)wr{y`$(cZP;N^DVRhIy?*bM@_>G|7$f~p*+!P6nmms7&g66$mE zy5gX1uCWt)zCpDc{&Isr=HpA(a-M{du`7BUAuoiK46p*f_y8t0re5uv*C^IojNy>B ze3g`^0g-SN7VCoc(QB)M4L(CoRJqWZC8n#~oj>MT0feIevrJFd&$%foMGYLKpD)kH zdzwk~_5DW1L~-7^pmhsa7#J&b42_;J*4CVp4?fj1SN+T#8OxFsFQpTTK~_bPnoyKjff;C(>$*teKc->YG?a=a8dctPYh18ah-t`c__MbC^I%c= zB;SlF&lY;90I_dE;#MCXA3k`OQllraYgAVssCywXVZKMhkyE6O+t_38R3srDsxJ3v z3QKw7xOsF1BbCfo429%Ey)<)X_I*EBS>La|-#{Pd8@}Hvf|?%gEL$N&;fh;y@tn`} z*VG1=9lBCdxwVn^Tl9HcPHUrT?`%)sF;|n3A8r4F_t{+Z+5-pue^$(#B4BWT6<^@a zeA(w7iM3B0_eiOFQ)=`Og@b_+&L!T-%9XbT17erA=gd|{P}PRw%2ssiz5Q0D2J^j6 zmoudvrXNzVD97|{zDZCC+Tv0OK+~?uUkPROo23ylv=_3)^x9#jty&OF8k@HKM9U>{ zj{nU7WiG&SU$T{tn^do~>-5i7(1D#SWvG7ZVC>(7S&sB9YXXKATq3fAOBeS71%(ERA&KnOG{J7 z_1R2sOX8Bf-NZt_v+g&QxZf*=gNz8$NQ~i#onHj0w4I*;c~2_-#OUbZm7K|85kVY- z1gFuvWMyk3>YR#YsoiY<5tGTHU?19S@R5~B0+tSPiyC9K;d5-z_G3^l>ltrcMVSOZ zr~ay{WCeF(-M` z6|#Kd%|7cYX4H<4EUz!L&I+Lr3To|#z$Y^7CM8+%*6D0_`Y6ChPi9eB(jcAPenCa?-WE= z^+V4|>ORA6EzgRtEL3pbEB#B-{D8Bzl!1=+B`>Bsum=Mos0U*an{d4x+8sU=@AJEB zLcgZIj&tW@pw#6UAuk9|>g%bwV|Dn7)fEfm7TmGr=McnsmV*O>4T6cFin9tbC)~)T zZWsIo@~4gWM;u#(C-yVtVOiM5^_{J*jtkq^xF+oWe47Qe3|rL<4&E`5xuRn3j9Qr_ zV2fg3N4{IO*?f#q-n9)4{-0ID1SPS(4Gqh} zHO-4rwsVpve9~#ea3wn%Ga%yMPnf6q>F?=^b?%>6a1Z)eJZNex&m$x}?(Fb&2qk7} z65x>WemmQuX@f#;OqHDHpR;^QzFBAvMr6-6W3SG9<72FcM%zuXivs=h*0tOk$bc5v+goz@1adSy#|97MM@Mwt+i+u)eaC zVM)ysq7|z?r6J&eDdEVAg7k&)(C^X$!K~e4X$TIlW7kGU1se+cA7y5*|Md@$bg$!4d$)~8p-jnz$@ zP?$Igi-E#x+E)4WvZ6f3cczwIsS3Ue38p7d=T;2F{$QF#~I7Psu-FVGuI} znY|-nJjo2>6M1lq$P?EH$UZHe$X@pY=LrVM=#_9z#Eu!`Q5vC~}Bf^j`<0``Pvl*G#;*sv6%O6&P}WAg~Vm^HF+ zw-@9-j%OG~psR+D$?C>nivpFUENSlWU-|M#_V=Xwhox`g7y0pFeOk@&eu#Z!OzpqQ zpqfjb439MC4#?1B9Fo@1uG~5|^QH`+tRFS>b|UHeAWOvc>FLRCs=Xe~#=#3>nF4ak zKDzD3Vb25P?KN^b(ncAX4bxwXh9TsqYEXVPuAsHwxHY-rIV_q3iCyvE4OD&u*fps> zA{wMkJfb_Oey!3x9gzS2{j6glSmt@xm4!cv0VK1SfE{9FY zvT%eg`|gM`fe`xla}3TMk%d_&mg6Sl@gMW8C88iF-0D9dqqV-;mN#>p$~eotK=hcJ zKgmtcG!n(K7gxO#uIc3eEFo^-fIJ#CcdXR9U`Z%rwJg!t^*_|FGsjkpb#&4v(my3@ z?#{}MRzV<0lkObBL3r?!jMdsPT(7Cf%NxMu7HSM1VoNwhU#$tH)c=NS5y9iJ!%+%K z|LAPG5Zv}iMtt0wE;h^=ebqUv@4*%mboN0aIN!wXx>+>QY5@@kG4N2^EEKXu&kAx54_3xbJgv-g80DT>XJyLP@i#8ovAa_bz+XA&5!KL4m_Pvw`9e>l?W#XFrP-vRQ3 zS^xR(=^O#KO`*^_m2IthG3ide3d)4*bhRXgs(NNSWQuW=uM6NzQ%H zS^wt+fb5-Obcbtwas4vx$o22fdiJqLXnNu$he$okLN@aQBL4cE7c0X!Epi@EIPJ5R)Yfr zSuaEM!DbJ*Q=*>Enja%>FKT5?LOzCOm>gz47bptmGuxkE{#wbCluCvd5ua9h@C^#|g6j^?- z^%vvIffz@2(b5W?W4ZYs+xat8->WcR2bx=+2fjAdI9>+U-6;HfAY;6L_p;GFA+dk8 zv(VAx1q*Uvx|ctI=v9|wDhWYzjNDs;;=7w2#~^Xun$^@TU^Za>ZmBu$C~;diHRF_0 zUl-qD!o+%he+l(8!wFnmANO$LWn?22=RYc42MozNj4G!#-s4Yvb0PGxP|dGHe#QKU z*<-2|phNRkA14CZUi~giDk|wb++c+bKEa~?_;R`yDfqP^wDCRZ^EY|)d_iGRiJ))p zGqALDJSuTzChO@vj!o9D2fs>@RFuEf+^J@^7@^;V8D3s$S(44{@06S7{CMR{fDHSG zU?tuDW;wvx9v;|Hz@BEzi}R`+<%V0?=ifz?3cS@OV6_Y%MN(PADYuS^3yU6?zW4I zTCU&)*AzQ1Umx|Y#`iZm-Pc-wt+m*=A1t*ZT1Wf!!m2x6Y@M2)d@9cL zOl~x>P}i=${oh}Q1QIwI`rmmp6hwTQc$^kEm$4pUTBgTp+CzYl)T*H8F@Q)4K<5nmzhp+8)hzRG6MO1E*PIs9cf}k zyC8Gj6LNO6B~8c@8g4D8oNjV9T7IJbFRNwwX4ua8M`rZiyj8sJ7Q;ofg&wx2j@7dK z(r_GiZ34|0JB*GcS)UhQC$r_n|4`5p`#MkB#fXw?Hb0!)%AJyvt9dxJ;+vkqq%*gk zwfnRo9s4dA_JsNa*b{%8Ne%H<4%faO-@2Z1s_)D9`>Lr4E<4yI$SM z9SgzKqkd&t-T*0SkzHU-rX7+IbVR$G)Z}brniE zx-&K=25DvVcTY-u;>Hkt@wd@1FXi#S)h)}kXV?@BqIt69Q@@sh|KLc0+I~fB#cvNS zybqa>zvHe~xGi?M!oG*WJOA>2yl~2@D>A&O!<4;%-&28}G;yoH+qg8dgsy=Mb`*8Nm zI(B{8{j7gKNlPszwyreW>AU+9mldRXw($%pad~($8g*QOn9P>UzBZ5Ng*Feq)h35-W=~?=u;o)K3JSlm48HBt zQ;(^;X2;X5WvljH!|Ki5+iqy~vG4SkEkFeYiBJOW*(aKupORnQ?)Oi4)Q0X*-#WXa zg!=!i))GL5WOAl?y?CHpb`I@eqMP7a;_?iGwssW|;hOJQ&YBs+6N!yRwnRZ*n-*2P zQ9Ers(+5@-zieDkWI*mIwd=Jr-~<#T&b}r|Gc9I6&=1WLtKBtu#eQ(eF$U}3)N-*g za>B{Hy0tD#J^30rdsI(%+c1cZJNq^VE~>fOJ@VKz(>%E{$Rw7M7w0+VCPXLGrSKZC z$7m7=nm4z}=*&+2_-Kf|`Df7(N)&_3krs3N%b)UIobpc0FJ&IF{bh)q)7vfA$Mr66 zqoR>WO&R?_WZA5jt=(=PMo{C9C4|pxeg4Zpw2f>Ad)1&zG9^!mW}ScSZ)k^oomQGi zDWw7o+&Sl(G%SzSP)ZU2dtjktZO4Mbc7eM9gexJ5NP=SVWFAdIWH(L=a@FB^G(?Oj znH+I_VbpD&djS%UH?uM-RlJB276eQLliP(Xw?}Wt<-44b?KptBPwx?uG>?tx(ycc_ zyz62sS+%m+;dV7}V>l2H| zlRe0JGoWd=+GyNxGDnP%&oMkZn>02qPPNWlwL`C5rP8-B%QF>7_CrkMN}zf^4r{TR z&C&DP8@~oM937d0JZ^tvx*mOEY(A-?8lRuXmH`0NUhcaO_1xLdai$EdcJMSq-fccW zibPeKDRJKIM%nET!g01`2$yVSoc2iFwFyR;lzT`UV8#&zpr-Z>k1OOwz3q|XX54>f zR87m6U`t^c9EIimefKp!O++{#rJ-_AwEhN+(43OWg|HwOd<_Yi%dKu_yQB_H`T4QE zOZaX}D5iEEag%ck(^aR8UsO?;jHg^k#^#XSlZ6vT#-ihT1b#(=t`ke@rNWu~Sc}4U zM3fUZCFVTIlqnJww5wo-a0y_heI{)6Z6&0o?P#P38=2M_8IUBex_d3}5*|~R8L+`q zfz4xP{DoUkC1Ac3t#CFaE4#@i8CCuHs}Xsb6eYvTvDXM2NTb=VGAR`v@tlxb3j~|S zxwDN@$6~4PP6v?zQ)V&jay!beF>6S!hgqrm>*X^z^s;hghfT0ld$Q=a@l^koF^xoe zwa8W^dPSR$cPM@oC?{er^eN{X69Utc3KetEp)Jq0Dp!V_DBZpcf-#_9AZejmr3MrD z6rwKYoy3Mf62hY~D}r+5t5Q9LD4hj7zhv5X2XPlN7ZTo!K()#loJ3mOMys}ws|-uo zRr8*AW(=i9Ki^iy6R%fH%WX2l@TlL7g*CNrwH-90Myo;c2Ps_#WWZUdGp50mm0HC% z7J!vghBlCS!1F1E>N%0Q(6O?7AtBd`zI_>sk(8lL-zit`9YPM6@xM-Ww)|}y`@Ba- zzkUc~_t6HPa|DDN0F82T%=#NCv$JNht{S`xvB+xQj)7}sr+iHRk&|mL6L{^~FD@+5 z&8PaUmfCN#Sg~%YW9ObO;Nvcq&#s)OY!xU{1`X>$0w&T@Qt-cj|8~C`q}H$-{azRyIp81BHl} z*NWf{eCJRi_O!#?2`EM=2-{%w13prnqvKfu(?+zs}WCj6Q$%~Wf@vE&c zrf;gMi(7ZKiHw@k4h|en>mE+4RT3l~@r`!-tj-tOmJORQg6}VO$|WkD5nOAMmX?gb zhVPf}s)iZ2_l4ehfTp*T6Z5^16cJBP!Sj}D6rfGjl!XLPo%|OakPD}dK<~*(l{e5L zAe5y}wK93M?nDMOU(4e#ZeQ+m*_PY`*${@f@P_K2wR(=3ua{qmz0RRog_;6{kj|e* z4Zn--G?xvDtt}A}{!os!lsCCNZrxmw9}8k^V2xmbj~Q*(hDa{yS&bIg@cyxu^J9U_ z_wi3>&pkUh%oqdR;M#F-EOQ9HkR17yIUV%2q@WCFF5A)FCdxy3^=~vq@_E8iL zC)_vmbaE`x*Mn+W!S(9oaRnKOFk23>t=5dHcZ64jmw6KNSmX+mRCT;a^nR?)2ZbR< ze$gR4%U#ZI`Gj>|Dw0<}LR9=@Mf(bw83lpj+9lO0Hr%dlJof~S&mJt+ry2@n-5stT zwB4@tIbb9Vq;7J=Whjby{12pul}2}(WOMLUnvQz8EzthV`>HS(CUP+vmH z#gbY>2l*0P2O`*6?$@O-sbT0qfZTV5ZS+1}^0`4SRoxpJCpRlhE1C&%zGw#k(67!k zd#{=HHfol6*jWRUC+gagwTH6Ag zhX)#|g^zgnafNC8o~Jw~dIkB>*;h}J1mEk7qOrz!t;OXp+r8{@wQq({XKug>72!XV z^6{my==CvMt?>iYJE`rKG~Bz@8#XvF0Jae$LA;r2R<;RD8rjZtJKKzvVFcIi3&&3v z^hgEBSv-nZGfCXS+iEbDQZ|FP*7C8_4;DO`+lCzA9P?W%l_vX7WMpJ3HvNsHYkz(e zC$pPiBqSuP00n5=geZbf2f5zQ@4&1c!<~T`?hJVhOw7aiauTcdSI4ma+edeL^}29j zKNy=)j(M=JYHCA(JrEocLdM0F6dj$qplohVufTr^<@I)*^>|zEh1^kRv7F53#3=N# ziMj}wp_`i=noyZuZ!T1_$)A;#ajaS{VdAJ2EICQ44M(>g_p_+k+2jB6aFUiM&xT_V z3eg{v(q|56z8}coyy6%_{1LyewYYH)pbkC^2X9t(c%B3>%5$c8yVv`pxsB_-Rql!i zzR+B>*E0S7Q8>_Rw6T=Z`vk$1=+8a{Q!}yJycFTF?TS!L|FPiXtblP=k}0Yf-$}zC zh92b5ViQg6hEFTUJ*2UnTs`wZ&#tV%7Jeoc^sGFB5M6FrhH(OHJ*=#bsT2hR90CO` zZ{!X<=2+_kcrrzBNF>+_x6JT(A_HXobOugKhwLfxpN`erp$<#CW(g6LAoaqK{VZ&g0h9Yz3m*Ur5?eW6= zhuZ13gB%V$s-HVIacJE->^@Gl z4B2w+>RjfnK68(cjgi(48pgx;i%)TJhFoIC62$8%IT5_(%39x`l`VnT_tn?O_SHbA zy*l;VU83tfJ+2}d$stnGlxq-Yt<5xyR)Q&1=Slmw>3Hy z@kAvPV3VfL)Ez~Nxp46CI{RZ84#O=Vrr{(eYF5@Npj(+^b>Y1I-IJD<_6Q>PZyPlM zXSf8hOP+8xCMg30vNzCXz@g%KFhK_93MitZm!IFiSGeAqTDCpz!7xzWpDcC*y@-pg zZd@`+j22b-6yDQ^J5w@DyQ@y?20$PIS*?n*uukky(`ofb{mJZ4t^B5@RIS&epAJ6q z-YINGx}0e)+YwwvM1!_~vjrxb###DzQ<|?vW9f_9X?uV^A$HyX4-E@@Ts`-k*Pp+a zVQP)fdOa~JeX_vyQ*(qMV-ea!J%7u4v}h}60j3718$j#7uSWJ0iaZ_k;53iB3JvRd zw81Lb^ZsUl7FXA9KwNyQByj!6^6;nH1mCEA4K_#z*(@C&GtrvPFx`1bLHoPkry@g$ zgezLOYBiHyce*P9!_Z)A59#c^kNY!TaL0se$ZVD^*a>)Faa)+>6O};FRAzWOqY(^R zyIpOiU{+tT21Dw=G~1EWm*--BLMZH3O5y+&~R2B=~VByQoxLPMk!>7Oi=KeSJAMap5A8mY*0^nx{x^KLfChKh45 z=AH#`ua`*I1v5R=EBQKOqeV6D?RIBo*kW$nFkYRCuD*i6p@J0MK3SQ)n#FGRwdr7@ zbW49bs=!fqsny_YZ9t9R>Xp?C3y-gOb!Ct@Hs5&;Lv@E=m-K%%yi%cZg!ASe~cUaAvgVTl6P$8BNqVx0v6 zaTK+7@o<`DQ%FCFS2}}wU35&0u(frCR{hnJ2_$=FTS1+Hkpcx4V}Vw?r~73uE|`YP zZi1)De5PM)tReLmFm1p#Dt#*b_3Ky5(@`;@eUsYCo$l8k=3|4eCrfoC2O{ROc@UZ8 zl;dX|MB0R`*#&OLfg0(vF(30$sAv~Dw(Y+*j0ejewnw&aINwyr!bmvJ`O z%o@;0@mpQ|LPr>*7MLay$=V^UFS^jEY-ZyZa%{NqU`8b~hr4m*Y;!)WH+x)6Q&eOU zv$6sUQO=finuVY2$(`)u(}zVzr|n@$ChkoBd(eZ0fLP-h!G>F4w9fJv>w~yNhg%Kd zd5Sn)@k)FWq#>eCCzWic;YjCau|}^|K|6^0X2J%4W*Lc&?3NKZ%85nOe7jugZ%WAt z71|d~tIkJGrHw^^e4Jxh!LnTTBcOW?>Zy599N|`z92=rP!KN}7!59aU+fwW8Zyck5 zW>3!`9+3E?=H#D?8n)yvNtp=!m0k^vnGX}Nd zT0m$qYBtsIOe}s6o9Yfi#l*tu0cwxV>J==)W|weE0ARma7?33yk28tu zPY&NIs_~!S?;x0V2r2ZPVVqE=oPk8x{_yi6)tq_RW9UA^5T)5DZDI>+5>NGh)zH}; zVx9BdUTpPp7^Pr9yfNbUger$#4IKKRoa`>4*FDs;k3#Ke6bpSD{zY~om$2#~jZI8c z8s_WYyqi6#h7cA~_x-bJbUhbNPavBVF)zS1)RimUHHq(5p?9oQdg;bTbUK&1n(&3i zQn%JJM8-f&1qUBt<_XxQXBcAm&RYE*tZ^TU-+PqUNOZ?qD0g^C4b1^;^;G=fk|jS1 zArhn7lm@a((T7^F+pUGF#1~df7K(2n=Z3zyZYs6b;MYGbtyJy^DsPA---$O?n-g06 zZr4|k2$wl^?!6nGkQ9G?c4mk4G7e+~;)@x`N3~8`#EoJwHx=lZs@KiPR+KPQj4aRf zk1Wt>hvpD}hA!vQWG`|6(^kFzp~|}j0#&p(Zb5%%D$GuV$}Y<}*d&TkJ7G19!2tn3 zS1&c|BC_CE>+=u!E5U(oY%H;FrsOf58;|rsV%NAAdpP@#tMy}e&aE&|4rPzb6O#R7 zhP>c~v@PuTY9_Nh7>@|w-9C+IDzw4oU|@t;{#tgiyFG0w-wnjI!ba2W+nK+b5~a2q z?XMX^ptuNXF+5MQ@ET)_Tu{wA?l_1MJ-X!_5eA`HjOcg1$W{@BI%bvw@rq36EqJiA z*88=l-P+EvC(SZ5VEE2trCvXdUq%oriD9QNw0}_0^QgV&#fC7TodK_JAMm}jAFp;- z+|FA+Qc=OkI=#QWny)m(0AWX>xWnVutC{a%--5a)^}VYcWAg!0gq#_*e+L}26u@Gg zvDm+UeCQ^51DipjzR=9KJdxwBJB2=L6^eV+l(-1mXA7Qp{tA#_zqQ_abXfI=19XwA z2~kt3u@$kiJqTNn3;y|ziL5mDa=_rR{xPgegPdaC>=@4Kcf<+h3+}o&FSHTJKy(tt zci|qrGjj;AWbEHTGC1b!PMce_2 zL6>Iq@KC)S&C9WRNl0>U9#pHlv^Gcr#i}(UXcs)AUn<@eYXu7ivvoPVs&!m14T86>hU=;eKyS=%U4}^Y0v$Z_@idkE-u% zW3D20x4YV)Cpk?5om%Pz?u@`0(>xz$FhTUOe!;6X!Jrx#p@k30o6PJ07p)WrkGa8- z`UElgWcSh52EXAdF|Khf}Tv&Tvr>$ELGt_TfIy>*B7^&2m z;{y{Lb+`K>kf#vYF`woszlKds7b@5t&3qc!>$oog!r+h`rC-3G?$5_dRIC4bDiq@3 z8+&P1Fxo(K>dUE>S1wRey<#^mpmoXg2xM3BI|jRC3)U`~30Bc%-A@e)K*h@ck$?|V z@}4ocU6_N(f+0PYMkEZdl<4bxagF8sarIVgW$+ z1P~yGh&(+z-LoxAKKIg(n5|Kn$q@6n9hV}8k~iLw5rSbVQg{wQa{7TP>zaIy8ASSG z6i*G$(i~5g@1i+LrB)GoqZEPSnBHQVl8Weu>1C<5DnN~VYq-M(E$;MSc)_B|OAGQH z`ue!-(}Lb)7-c|gn0FII9M#EOviLtc9YguicvwRY`UFD$2w+)^{L1cG z^}{=9%G65kXWxV!Qw{FzU)v5lXrXNy{+A6>WRx)#ja6SFZ$Q5ic(cr~I&#F-@3=#fjNRG2~wVb<)Kwz3j>s;sDP) zM=ASN-6K+zk+I$bySyzUikdy}hHpC91s7*+XE`v}dac5|R}qSca@m z7a_w6)AB>Q5#bmRkz)*&*15@^8~w<{@C=yb{apcwQj(Gz0k|OJin6-qJ3y?x-0?Xa zVeRhj%0v>3U@CJak zAppeP^!-jrJZ>hveRj_g9Q3SCuFZ+twn1xxV3(){S6!=>ABvSkcOc?PX!pKitLVXE zOdGpXAOC<1frpK3u>J{w9k63L(61Ls)|gN&#?{YV6?U5COsY#NB`4n=~G3UTP*Fp9Ax@}us)u^0ql1*NLf}1K)t2za zPD@tzy6g>D^)SZn4MzMOI=G1+Mj>$c@3~Rsh$Y-nO|STdysB92#zm(Jg@%Wj_=mTr z*(AfN!Em;els-VI^95=8!b4JYfLVwqBu&F;qikbf;=caJV`z0bp@o(?(T2G*Hjv^_ ztrvz^qS0+`ooPQd{a@_8Wn7fs+CM5HAWBF{hje$RQqqV>cY}0;lyrADNJ*D; zcY}a*w=@i$a~6C5f6w0M&G~$u^Y)y1qcQ^b%&dE@>-yFueK8j4ZNBr(ju1Xtq?eh# z$iMklo`qBWi|RZ!-4_IjCvh`FW~(vG((TITJC0x-Mb%NK4pTJL7nPFB)i>Ql!Z9o# zGoGkhRlRxp>Wk~c{wEF>I^-<2w?h<<7tQ{6MzWAn_tE0ZPZcbI!sY(m(E}s!UU?VV zUJs7OV;QoML;@~5>CSuVq|9Q z!jD)s%&85iMt3{qrdaZ_8gL3`!znCf{&V@VNnK4Hz3vAeWoCQB$aV+w_p@mJNbQ8s zLtgVI8BF3PszUQ-L{%eSmSBEX-utmw!SOOh*>kN^3od~u+Aq2NPHn#&xxDrKlf95| zYd8{iQ!UnNhZH0#iL$&!Jl>n8-Zq&{4-vI1%y$ri!EVYlmvXz~{01mvU!6Q(9`VxO zHipXb?{BkO5~`MaeMdkz_}b;MFQ`vi`aQNPhBd7ImaWXqmbfgraOv&ZN=t3cj>M)u zYgz-ntHA5P8R7eG8W}wFEQ{%{VjfMaJI%Uu(aLOb33mO56qOn+sR;@JLrN(5)TE`z zzJ|Xi9;)ENGtVO4R(aInf4`sEiGnsl8C|`;1@RAeLmf~Q18VJRZ;mTb1+Tvk#8Mg5 zs1FPb6sFqzY@OHLp~`e$ud@OG$1_c*IU~<~BDVt;uqSoa%xDjRz_&Wi$Z|D4n6YHG zo!)6R&=3H35e#gEPTu_6yz%jIBqIJnPq@ZJr^ZVn|l7qN#4ta^~@ zQ4>E~b?cTnudwsWNku+iQ*KO_P5OpJ?ka<`+)GS07Pm2*@G-kjYVkz$o4?vF{`6<_ zRQ+1$q&FHt*Q$ArdRqPSu5bZ2OIhcYV6$!nQ;rSW&FYJ+WWBC+JogK(2%=6$Oqp73 z7|OutZ$lR;)`eLm42LG3qK!b zo2C+Czg-XPM%zuSMKnKW?o~L#4&-VMtgvQmw+wz~7KB{P(WunL*D923+`n`ArnCw0SqFyM*sS)*(eX7ik@>B-v`I)yvQdB zzZ@!a%8)@89XiadZ*sG~H5;iOlzKuETV`!MyrVo__u6xm&3~M$5j)>U_|;XszUaap z#4rv^ag{yG1c_w{|8Qabp-Q|s^Nd%V)#eLcw5R-FH-r(JF+wn|el`=ptl#Amp_GdOOzTt<-HaH!se}^}4e@?IdD+n+$b{A7B zXB`N{>^$}^Mn)8`u^A2j{8TtUUTN!S1jroBA4iq)bVQHtAT(L0RA|wkT56Bm-GZAA zRL|AA$6kDE!>#}jiPu#7p<;5&uF`1et-gT)EPMa4b= zHLSOC4*8hA&3jYNBHGvq9go)go6$)JC!MgGL(@K#r2GgC(!|_eKJ6CUjA?RaP)l`V zvGpt6miCGdy>BZos2CKLnP9#R@S$Okh^w@H4F1A1;IyfrO4zfiupv=G_+r^+V|G8^ zz72;IUqJes;836T^Mq^Bb$p?7_|>^ok!cjv{J*wgsO*=YYXUn*=mzGe5mkO;9@PsL z<-d-bkz8(+Jqf`c#!dd-|?ye|)j&K98+1ws@Ra--o-NN_zhDJbhuA1DM*q)qP_n8n>1eS zh^AlPTAo=)k+puLC=2s(Yd(6sV?tvynZpY5?Pk)cX3vq_9$ahXvN>P zFnk?+!^+xH3wTt+EQY|679K;d*-%6gOVuK+&2ly2%MZ*5%_cSL1i_jz8NyZCW8~$3 zWQMPGqF6}V1+ta#_cT(=w>-yODM{hU(H}m%gKID!s9E}8SAsS8s$aXdXge9A4&|Ll zxiebGIQtjjB@FxS% zB%f+l6=zJmuCv1PlY(D zKgX4&^(xU7+*mt|5C|@wBh8qVk)lP{B8KY?UmJ!z#)NuoX&#ue8ZaqgQHT^%({Rhq6NdZQJ(rFcL3i zLvW(y>(iWG4Cov>WW_VUCtN`^DH~!+QO|Jd&pa(Xy~OTL`8!cKO-f< zYH3}P7(Yu)vv@SRU+nYYJ1<8{9u!z+YE8Dp!0Bsiwn{mGLFs9pf;Rz|$LOct1nyp3E)$w#ZZz!J4XgO=OTLO+x zfh+sBL8O4Vwl)z+Eo9uEDK9V?%eV(bTt1*9nlGj`9JR_{zkZzzh`gz(DNG=saXhvf z-m(L#%nxmz4=#X1bloc~z+u*XW^2pw;igohZlcA_Ie|e#c_zjVjNx!NoUgH>A1jwO zSC0z_3_QEtFSoxtth1k#<+7bpQaPP7$<)Jv7rGgc@9*z_S-)~HSA7E+Vp98Mil)4; zVoyxKX@LW{y>68lCz4lA!m@1V?`Fmq)6moJuCj|q6N36(E4x=Z(stlu+iV3ioZqf- zYX;Of6POt&N=p9XSW?eY{$x_3z|!py+&z))lpS306X%DU&(w6OvD2tJR$l$6=o>ae zk}h?YhhKNq&ow((x+9Zf%T52$0^rIHqNaMHb6v=gZgeo)aarl#9gz~`QI?FKhNVV{ z+qoMKwTZL}`E>p@s-NSHvX}TfFF!z>OA_ zsS__(5$9oa{`f;b7RO0zB_YtvbQn|Iunp(0-d4;!f6}n1_(u#Ss~^^p;xyl+Y3#n} zhf5n@k1#kOlwE$lH)9d)iDvyOlS5)jmcTtxFw&mV(wZ-)q#6{mKeoDNM&4Jmp2xZh zz!vG|kD4x}EZtYql&uXuSVSLhhj%0eRNF7_4-e+e`qW_ZgSbEUu6$5;Z-GoC+|Fn= zSmzpG(|P0=Fgu?SqyCm>Gn!mv3ct(O7|^Z@_8~<$T9wc*lOlI@cy$V69BX~SgKp+K zy4DgDj6yasF<_le^9!3ZRfqWS8?WA>UKws^OX*e9`}y-U6H-V3lCi2?|8koAnjlvJ znMF4$j@;f~Q7G!PWxbB}wS7jg^taidV9xt*1d_U(sigx}Q#E^IL`2}A2XZgxC#@2v z)uM!%8BN{tH@eAu*$uFc$`_6{2Vxg(x{w)&+7t`yx1cisH}|+b*8}VB$-5pD^fF&A zeQ-FLZNK+<>Cc}e0MP-I89Ltl?NK}QLI(;qV=>7GF?DCBYSeS0|MnKz}w}d z10i(@nxJ9zj*c$0?S6~N{dz4BiGVu_q{p%VZtZ&?6AjH;^_bvBS<}&a>)nz0h4t6L zNYC>eO4xjPp$B$^bRGHqQQ~WN!Ihf>m*2bVgFpL2(~1AbyI^M0TBiDAKIm&xBFcUb@*w~Ha5n;?;Lv_iLks@s z6$j3LgyEZRk{Fl(OZfNWe{s40?_T8p2fvGu$Zo0;T?-y=W32;&?`9H1n0E;u+*|RE0i|6#u*^zp5}a&?l+ldaX-o4+x>B3 z?ho!lkXf_ViKUFlKX+0qvghjAjr&gZ#)pDXx*Sp_C(q5MBSV_wQMrv#I48O1?^MpS z_bVUbl%J{G`z1na!wP5oo>5gotzL*^G8+A@iMLo(m+Hqt^K;i8WFaw}CD4|82nx(x z>}N42CY8&O%~=smC>=7?Z?jg!`)7qrq?OYp{7gu7ICh?kG|1q(<{&Sxg%rbc+XXTOLmT7N0WS;r? zCJ%Nd5k(ogqB?uU$W*>4?OU6?62tz_i`-1jd)kMv++6Q!ETlxrHr5IX?3S=XSPw{jax?=U*EkpPhwJ<`Ewvp?sD302}|K+(SIo+_2Vg5ecz46f~ALV$k(kJrTbVc%3CaJ3`@)8o+SC+FsC)=dW5`{)HGfvgK_Tv4! zwMxxIRv&K_hYM*wkidf!9-3?4XRU%q3Sn^E<~BmORJ6NK9qBFHpg4g33VNXE<8j6!`0?0^Oe&y(hl1zzt)}7=BME3`>M3eZn_D;p^G_KL6E?-Is{3WAwzjSud?T`Oi7DnAiM=3^`)WOgAL%TCY^Px?f z#f2D_jVl8n@Liz;pyL4GG@rVJ-i&!&UvETe)Er{#xIcaUXGuwO6uoD`7ZnwQx&h`S!A zH)PUVq9Uctu>;7Obj1rjib<*7HOQOLZd4vne;l59tckFXI zS@&quHLJer&sc1TuA?NmYp{>t^r;;0ura&efzEp5BbEp|v4l~DmH#Hr($;GQG=f=^NPXT>V zc0-3@_?T+d9yJJescl>FmKnyV*c{uo#^D!2;_0;5m#9_!^SXUXcce390dXvKu?aZA z4dXjDV<5D?xh|pEqo$7ogM&Q${B)q(kBdJ%R*9Wrg)LU`^t|nq zvXedeDOVVohXeogs<`CwkkJUA+9n}C_6LrI7|Avd0!fu3Im!K+Li(D@v}tCW<59h- zJ&S{ncQJzYcr*$lN}JD$S}BK678%d1?P&Z2k303mP}-p^d<1p zN9%Kc!e23O=yIOz7_Gg;L=BfRm=3f3)5hV7x*dHnif4C~-ZZ9uM-dHK*^oS%x#lIq z7Ex2fjTE@VNls4wqB5iT9s?cS<)E^!ek(!azILU_nMvqwZmfT73?U1o)q`3$5KMnd z^jtUs3lx{lxI(WB{+SFV2Eb~EQ@J9G%Uc^eY7tNfvH-ydb2bZ<(qQ!x+N6)|ka2Rp z4ej}ySa1t8(Zj%fa08|(+B9f_eTUtZkpKlju|RGMFi<8vAy|t}3zo80uPeNuo?x_9 zA0Pu<*Zi^f62>wFQpG~w+5$Ae$HzxPPR@9)$_OU1MZf1&hUcvr2*7{+9#9n5TTEU9 zfcboy2Upz}n3A;z1f5A-*4kQGjiCR0;_Ue^{wUr9S*mHp1F$E2X=49A*>`Qa%#b^> z++SH%8J5HxCwQ>a<%nO|cEM~Q7>EkJrCS$i+YJi|a=#*kI#7fvV|`fB7)biW^czvP zHSGh&QNbiO)q`^h?K|s{JIwT2{X_H<#HQXt^kwMA27jiOaAK?EBKxWx^<7dkbg4fx zvxM|~%5`!6N;9c)Z&q)yAZwv!181pCkA!`*>VxL;pX`X^Z774r$6(iIS1UZ7Lce$2 zGaFC^GR%C+N*xQ35m{^!GXSdauiEroiP)Aan5fU_-;O!z#=lfPb9=04S z%tQo_bn!RDd%1}JcDL79BJ+sBProzE?Y>1I`G52@aJOYWx*l8C59p*1KrF3>1vVrO z%>5zyubdVJzAd%i7t@eumXZOe(*q32J$CQ~5il?CjmtwjpCep4CVzn+)15cf&>}p= z1d%jto2wE=L5EvUr{So1iQlHMR`qOLoN;htPK)WU6VCg6YZiK&x zqtA^bv1A0;Acm6+q3@}%Q0$h~g(JG!c}e_j5nGt!5A`Mk-0c#L^fP^z=ZxM>^ErM%laPtvvK4hEc@xS65q=_p&cj1%q7#8nwPq zP*4El0YF1CZGc9Z*=|$f$;+>q@b+w+Mx~@k0ny4_qaz&;4^J4ePzEp(Oy&(Wa7Du+ zpgdzX9>#%3C0ZoO1V)C3z!mU)pq;%3ztW=Oezn7QIE^PZo=&CKg-9-qy90EpL!nHq zv-#5n0NHsGkpjH2$!abqQgCpw!5)|2g_G{Ms0LVJXroH*^bhu;YaWjZgr2Fj(USY?37McgNXTDe?%1WQG;1L!G&77n+;AFI0s znUwteX}gmJGnQ?-9nU>uJl%@DR^U$43t$7KMxAG@(+3H?PPDK?%FPGaT2UjXH7dPo zto5zk9m1{z&J?z|5`@sAoi_ZsuBKrJTK)Wp?=nVt7W1_P_}*SmU|_fppxz5*<`g9# z)-s&rQITJl-OUNF^kp>Wwcd+icmSlcl^D8BEO@m$zopdLbu9Q-T2A6}WcaJJXOVk^ zv&n4?2D-bDXTIk*lXR!M$J12enTOG$Z+BR&PLaorelK+P4jdovz#PK-#k&3Qxz>SL z;!B3Xy9wu_fs1Mn>&NQHTg6t9xwb3WbpE-k6Q~84$78m}4V~L^JF6UM9jJbLSKE|2 z|5H1q-0C7et)y-{^su!jQub0d1ER5#GA8#Yufb*7aAmWw?V|MlSF?78{B;OaJ9X4) zCD8-`2Zc-?soEJwsr;@*){h81tr!;&jmJ}|m2xWiOIgo**_DiM13`bdx9DzG#9DV_ zsqT#1YKnWRJoiEmIkxP2rCUoG+wiu|(rI}#-hwNC{HvMHd(q>x@uGBbAdkNs@!VLnzJ=uUv<8BI3EWS7(>Xo;7RcW` zbu|$NUS4F&?!rZ8_3OxEaP^1aWdvg1)Do2?t|(Hpmex2j)OABsNGDYBhd!qc79PhP z4>k9+&Vj3>l+W0^yO0j`2+c@4NWE`&U%3^rq)(@kcHh@Y(RLO>%DvAtIlB~E@X(o- ze(Wxw^JvTXe4OzmwGQerz~upPSozW1zMC$cHCwkm2d+cW^U$s7vE_Ogab?-r)9rHE zN#f#=G*!@}5%uH&I2=xuR8(ddbefz>QBRh@$%S0o2GCCT@7Aa5J&r>W1U&2ebJYb* z?CgJ3KrvY7zS{0BSUqV;bFs(*I2{pT1hK7dM;GT8| zar1>5wRT%lIzSul@%Fgc_cwC{qxm&HVxXKOR@B^XA@Z8jZ@q$9sFq4RCw@LBK$>3sZJXx8gzIjF@JU1?~ z#t#Fv;qIE&)FjTqsT&R*-_+Qa*751?lH%z&?V8QeotFf|tis(~9R3Q+KwPYDkCBe> z=Bu^M$K}o&oNzpN+fSJmUD!Q$u6(ZAm8Xww462Z+E>fPA;`#DoU}cJ~Ux4FgO2?X% zv{4A{!zb5YX&4M~*jLMkvqL0y?Yirosk8kyiuE(WAIN>JgQ%*shwv36T>_}YKGei^ zjtw$Di?m0L$kB};T&}p&j>28|z1FSKzA@_y=2UT0JGf+PkPe7gAghn4>BmY$aCD#{ z_?JzGqtdRBW$4>+&7TJJMzfM0Sro5ResQB#%@li7z+lPgjm248DH^}7dg&I8{SC?R zL}_O)uJ9v&p-4{FU`K|$bSi;Y#ZzNqJ9}8qJkG^ZseR?d#v)k=eszm8cF)$c|Ckpl zr}jCF_pEZf;t-$6<#UVCVS!-$7*TyG+Eao32f0|R1*H~dy-!DF_m|;0)GLCPw9=OJ z1>gG)dofcxbJfOJOa{4cHO5R_;6O!8KaK5DNuC?c*^Xy}{~v>h(LV-}YLnm((v=XsBhviBRC?$7+|XWpeRIcZQZh1N6#A*L4B-Exe;XoHfpzs?ljZ(xiSPqsLb$k%0SWQ! zNpN0q530Xs=irE$x+(wpQ}<$TYSH~{1X*3?sdNRF-71h(4P17EMu}R}hIn!Xu}^|0 z;I~TzzF#iWf%!Bb9soC*%2E~348#MOsOLqo4jQ_h6?N}_j1ed4|8YiypPYctpE5QW zJk2?qPqVhR28KWd0X0lI5c=g>xr&Ou02%A`nq`e=>PkvIvRq4lidLLhe&X|e=g0_4 zOt@hZa>Z2HWywO-65%i|00UkP#$xG=~| zX|sR+e8G!MLh}d-Fwj>=h&tENn!Rol!%QGB5shTD+2%fCS_B1Z##-s2wPVPSn)1qg zoF7z6J%3xhJte5VJ^tO{KO=*3^hFYFU*o>4b$D05rDwC*Gzyfb6^MXXrE$dZ4lH5U zBa9O7eFo)-$Fia=?l9yu_*&NdrJ=%_t{op!%JL_PIRlD zFbw${#IUd2N(vF!%z=S=R_^z@bb5O2XsBCC@Bs>Pu2{brhP~+PP!e@(0$`}9uhOix z1N*Ncz3mrxAyOo?9RA%IBhCYtZ%e7@#=>_8*?W0=5g3d?b!+Jb7^_o&3{(RV0FY6=L_xXSSJoD?%=$3@Y4VW3 z)NXO~{Zk19B82w;qdlRl`bsuoX3$jBJZlxaEMd@aqBPDDpX?<90yRH(BUN?>^A z3A4Pqsw*Cj*8@79kHF>%Y^@uhN_)0CNLPL?-E>J17Z=B#8o1UKh(J()c|T@ft*;3^u936l(es3NkNJ@%6o6HJHU|4@Z|BJp9sOWke zRb$3QQTlb&F{DeQYE~V7oUD%}tD!Q6c8P!y_EE^qnHXCs8R~Uk^Y^5A=8kBcPiXJc#~ZDm^Dcs?`*6QXV|CPm*I{kvz_r|?%!U+qdB;SljPPheehOp z?|BS?926`O++1?I+^ss%ug04Yt!J-12G0nMmb6)aolrnAu24sqj&F6)5 z(og4`82k!fPe_VLLgpGRuO1jF&$(@&O8g@~VVjWr(iC-0W5RZ`{^ZNhFIH+=2ixbS zmNgTNy^aQu>h&F1qQuwEgp(p%u6Hq{nqu$wd3|Uo*&ZJ~-=mG5#hW9h;%3-CP5L6OfHITV^=(4KfRt>9_Df@pXX47dxDlY0 zMVC?Sx_fhE9d##(el}yg=HO=Ik^TYFG+>CO@iG!g2%B2q0k$lMs(mwjXa7sgN_tmX z-F9phOH+i$OkdV&u}3|#UJGT#w26-<9xg6!TtWgmF0NWb3F(QliVY(<`HSx&fhg(D zD_Afc;2coV@LX+&88&8fYDx$95UbT9cT7wSHA5*B|Ir8T^~+bUiiY2o{+@kL^}W2@ zIuaU1&zrBa)-g{F37ZO$84A%C!}_pX1c5-Fy+mrR5EiDA&oJDdE(L+` zQ#V()G=CS2k^4Sp?#x6|Mc0x=#=dW_*tuRM#G*n+yJ3modmPvw*y0-5SV%azt*xeG zJnp@w%G~Ji{&1l}DU>m?C(lO1;cTF4frE^~$;2~ii@IuFNM=5PK&)e(<+sr%MGe<1 zz^M6AdQFw`_@1J=%BZ&yb6JgG%%07J`IpF42BFDJpQQV@?$mxWXBO&ydXdpyVV8;ZmtdrMQNHnuP%-HVj*l7cj^a zG4l z>Q;pP+7Mf7V)>l~d1_x|q~S^HZi(qfwbA(hsiufO<0>bG9^g&ZjLTfNRmQ7j2kg86 zE)=$z$3g=L%jEovILx&|s)2gR1$-iYiudVrGnCCE3^2YJJ>oU=zrGIg8y1xJ9KkOz zDELiZH0AvM0J)>Fc$g1fcs4J_a=Bb7TF4#EKINYw9nK@q6tz=)LBH@R<8W(0l8R#I zAcXT>cZ7TiXUXCwanG!q0FE_NXR8CsWgeCr3kD}?wC%X}PFZzKbq;PwwOS2#6U0Sr zflW33t?c@P!G7AS9NLYb^cwuL4)hOgR^RGR6af^x0a{o1)nzwr?WyB`v;fDYn%Zaw zCA^!BKAD-sz-7m8U2*oL!U3QcP-jmUIXo?Qej-#Lttrv0H614fCjYJwEV_E&jw`Mi z8yX^GU}Vr^qCBGA?cV;_nhV znpxunSuVqURNCZsh8R9^q&$r`^-g(}kiHP?ncS zuSGJ|Qn^eE_+^BT#2v|#?QRhKwL#bj44fv;D&4in%)*qiS7kj$=5=sM;cH~q%lLGY zxiP9Pf2)m}&|lg8N5HeissM%wL-%{M{4EvD3wSw_?fGw)w(P00DA?B@#|zGGXwy?G zE<&lSb_74O#wKZ{$%-Nf$rK0>EJn*Ize^&6N7u5Ym!`HQ`_b>ZFj~}5ZF8|3QiFw4 zX2Hs8&(dtsja2e8&647)D1YX$hUj!)IV)Wos-C4R4pq9LrQIq$8``uNNj7q~<sRC&eNm}$e_XOdnX_LD3{f0!sH3Ns59MZk^2dq$}~ zolI@eY53!wx420~c7B=LQa27p+p(_v8b(uHdF31#el$#symX|`!IBB>&7bw8=& zvboDqYNwy}^x!4WrK};APoWkLS~iT01F@PK2o4+4b*?iHD#Ka(wBEY~?`|Qw#GB^b zB3zlU34Cv)F?A-)UfJ>O1Gck!ZW4b-ji>HHZR#0B!y<*Ob-yr1m`&zKXJm}RqVl(c zJRHr2`#@sPXFoZyjo_Zfm^K}-E?POa+&!!X5=#K10>F(7K;t|`TawA^81is;HCbz^ z@puv_WCL_P8%|BGM-5>6^+io$)_ZPdW;S)HqC#<=DW!#*2in#jzkXp>Hyj;@0-+H6 zV6`v|nBFlMhRc@3V^8S|kGu{flyD?$Y_VHIi5S3x+8vCMx>!dcopACL--CwM0V00J zHxHTK`X7IN8RQWWAF;FgM-uXR2l8hF8tGq+;mU5uMjU2&zlcmq`nilA&Mb;)8*#GkvOxOukq`fN$4$?xXyBTKMEE?r15?1*qWdn>am5!$eutB;^Cpj za$?Wxe(r7Q^UZNlPEM9d0-1ZZ<2R^rZ?~f^JLrbfGJEN&JO2va5eZ`*TV(Jzl<1iB zZD9z553I#La%vqz;(UoM2p-8;>Z@9Tre)oo@ zum--lYDXF#dK**4;`^sXY1(xz2v@VFjj|*f!)3N_@JC+Z?97hJLOg}OZq55L8tQ=V zz>yO57~#EyFLuS8R@jb@-xKcL!BHY94T7X(uKSbD;5)S3>Jkt(y80mf9xU-aHKado z-_#nuwV&pbc%Tmq;uAi-q#4UUkNbnBOMkY?@}2Vnfi1PFWx%j=6Ap(+Zi-e@U)N<2 zgQi*55qe<96XR;1P3WA;jP6PJQB|OWGS=&bLBU7nw-w!?6+if_f-0DyBC0HI)BFE(&_+0{5=b%BuoVlDp0 zU%%ph{P^*0XMrD>ascc}rBQ3K1~kPb@l$}^ke)8n8UncI2Ot2B{Rw&p7OzM5s_JUB zHcx?tTFYXBTTcLx3SO_mf)06iJf|o>{+5h9d#|G-OyC`ZVoo&vXXS(Cei_hOkj&9nCP zovUHhY!Dtxz_krq?~GIJ_6*BTX2G+*1Ok3ae1#TXPQlgjUh&~;++oiPCVoBxS`7FE z)WSM(R1@7&`Y#yMR;Mx#T^7b|Czz-!*z%%$?NnhJ0e=`t^-w}c^={9aFCl0A90Jkx zHm`NxttY4&IZ9EGdC}3;akFUQcb#bOAd<$pQqN{aBCx#^=q$!=hg7{mN0TF2yT-7|fmZ*$f>|C@zuK7EBhRd7UcA)L!l1v3S_CaLP2_Dzgvy&CQt5q>!NG+{4$3DG$U4v$ zBMk1Km3~q{6nOV88?pRk{bxTln49TP>Iee^gZcLDXJzI6U3@T)u+&0v>i^h&z*tP( z$NQV45#w*?f5$gV4uVY}}mM+NPvjt%X=I-mye+yYBOa zc5M@wzAyY=M00ng26G|=E#1vg8LZbrdyvwovo{@RzPx^y5L0)g$Qmi~H(dnLTqd2{ z-i`OgP3I$jRV0Cl>8E%S;{h}XM01r;T;li5B#j5d+h#gxl=zqyPmU)3_6+BGlU2SN zuG!01Id`qkXO_D7tOX^wRz6D!EW6Cy%l_^ZzAJWz;VSDqi)W94Dyb!_xjY*S;Nh>$ zzWjEzmn@@(JpHJc%)&?zn&_>d-}yxa>#WvzuUP+xrvl5Z%)8KDh?J51=bk~w{Ggbf zF{Ahb^}02p^(sO-qP>QfMD@n?v}pf{%a8I>Xi`3>vO0Ew?wFmyQQCX=xGMe4H1|Gb zJL&JR#Ys0OMf1pZF|JNhth4J$EFL7{k~F{3tLK`gT@Ak)<>z7J1auU#Vmux`uZWS= z^NuAY%T<5>*v%!UNug_o9i)>^m5bPxZ*goy@%w9l&vTSUe$gA1o8MfoYkxa-exLn|%5h52Gf*R>{e*o|K>6hxXt?M%-!MufE~g`O{H=(!16U7_Nr;vJZq z+uw+Us}C+3GNkt1Gcd}`^u};2D2m-_?o3ov#0lx`MI?J0k9CRVb>4F|sv$LQ<1naYiZPkTF$uKS;AYh_E zK2v?4ZaL?P1q0D`U-toXjr@o3jn()MIFNl}0NEUH;V0T2O<%~p+64qycTbOkrsgn! zR{AXKmZswp9VLM+1GB_4L`PmjY)7rw<*;_o>pJ$vbrY*=O9kGhVVMCfn)0lrNT9WS z&BL``itMx$eT<%A_}#HamX@23yZNRR(ZayqAFa3)vN4^cpFVZ*M=f0^%W0C^5&Tka z&c#TX3B1lRr$#gmd$QIGb9ov3tO?~iF5RX9+A3Qw3nF6^J(52AygV}P!MC2jirD#} z`9}uAB#OlNZ+1=;&Th}v^FZ7R+ zNgyOLsO+(E(_{>9KGG%Rt3Mvn*argqDjT)Vpw@RFY6Y`c-u&bW7Gg`}w6<$h=yZfd zC@7&f*YGU2!9zuOJ}p!qd+JE+=Fz+?T+=|Ny);B3aZye3s$LrPlvS4y)ZJq zAO7AamfiQAQwc9IJ?rmJeADo}kvW+?bSkw}LKohiH%g8~NmUq_n@!9W=(Eemi~F=^ zAUrHM*YO6Y7Kd_)Sl;DOo!c&Y?$8OKc&3y2$eBRWP! zccIQY=kANDYCpi#X{f1J59?OSl-LW48d>F-uqX^*F|nhrYJ3|a`K3NjH)K}!Igy`2 zi87^$O8@edFnP=XRl9^$J6BE1=tTJevqwHFfOVhAa zmeQyk1}59DuAE4nU9)K}Zp+6>Yj>tc-CBqLV^#lP>0R|)3s-!c3|IbdRdMkQs*Mx+8+9 ztls0ijSc>Ok})e2)%MQ+553CPjRhvJYE7?q-;j7_YxWk5g!D@V0gKhzYNzn@d;P3o zk;h&!`uSBKC$#g|0?s2iNA6WskTH(}i-}9h=Z*0y`J?hWlDEr{R$PCzl3SAG19S|% z7Pkz1tNl^RttE2SSHphc?%%sV)ZP|kQnT53?2|fQxPK1>FlBjYR zdIMr7fj#G%pgBc$Er4I zwKzl7%qZ81JpAh619qqSTIiCK?I-oxzZi#^nX0aO<$GbL17x~xCu*EB19uYt=T}w5 z1)7ev-jC0AZ-sht zRh-8K*Q}e*RM}e@+MBwV^W_Ig?t0j@llgsSn6CT{`e@+X@8ov86?P)F7AP_hdM(zo zZ(h4lzYG5@r|ixxx17Nrarcq!p;D9-pUu6M(w;h}U~y=aN}Foq5TX7sR@a9KVWzbi z(q6z5cRI;vvodjiJsX6J)z@05*sso$O2vtQ{u@37g>T5~{AM<&#lt1s=M6j@-@zuP zZ>crryNf01NL~p~OCnsVPMo`0Onsjs_pfhFsxM<5G%4|IyA1ZIJ-H%1@pYi}6OU+0)nI;OheY3)fcCgty#*w#eQgGSOn7Lg^-MHmi0DyzS3 zvPpdQ0$qT%j1_-!F`M`l*7%3LvNuD&PcQ~1fz5iQUo7*sW+ep~SNsjsVHKXb zp9BnRi1~WCLKWuMm;c7i+ul~%($>~Cz5~=q&6T5^q1a4i^wj51r6RaTK~y9eIBBIn zeR}g`WovD1ZH7MH#fy|zS6Ab5SrHf+8EtNEenB}d8`yO~fAi)&0G*yvSzIpXOm?SB z@d4-i%wqmmFkN|b3~;h%vBuhg^dL>oo2__z*T*k-u?q75&U)&rK)7WWI8A|<=(Uc= z1y+FGNwLmD(X#I5Q_M8jbNayLc~7ZBVb4>ZsKe<}4VWZ;S5_dLDVs6|wP^F@ym|B4 zC|~oLA)4=G_^DUfSPYZ2sqokA(Xi(Z%N}KTxX|m>P?f;9%e>!VE;0XHxp0MMV1A3r zAzdIS>O4qdBu%T4BNA*W^&X;sQ9dkknQ~LItRy;V0)I|fag5V@+171e;+l;3s>*IX z)*(g+rOU|~Q#sZ@>gUVnoaOp+R8GQF;)!F4M%*{meT#oQ2kXKUdU5SRsjvbhzZGG{mWj4b(esM(;W_#3SL1+(5 z5)r8Q`}m=71Y4_OMj@MMuf;VwtOHI_Ls1$aK z8$j+B)9)0HV|-*l!A*)@hSK?Y%klcMKCq1eUv{~6s^rl@AhV2a>;#}283@EuC4v^( zo}B8p>bp320dsA)_a57FwlWWjNTURNmof;Khg-NA)1$yENTlL%P552GZ*mGZ@@gaD#0KJ$M(6sN4)4Ev*SfS5n;8 zTgmXyeNE-mZ-WdIm(Oy8HcNvv4_L6hmX0AAu){R}UK+jqg+!nRm`BwL|6W$769&=1 z>7>jtbh*yBj}>L=L(^J3n{PEL{GId0+PU>Ri}}g9k`a-Rdz1}-81UIo)-kOhdc&f- zcE=1BAv;yIIoZ;Z_yt+PHW;aXsX;)@=bU+i{<*I|UTRr&A$e?6%e0e|BzlKzV_;dY z&#Df1#CzLiB%m(hkzwPSJ`)nMe2FPe8PmTz@!PWAbXF)cQNmwnTPFkoIfpRuftmT- zFM+_tE`L=%KMbz*xB5B4MbJ?*e7Q~orr&J#0{d(Sh8~COq8j*X4|iyM)F~B3b4g%q zr?@~nnYYQp_O|}@+)G0KgxQYqsaK@9NfiNpzgORI*uOuVAs3PYka0^_5#5(kqUC&9 z{HIi8AVri0A`qZ=0&%c}Aj;kU@!?LTmd_PJ4hxUOeSF%|GUhP@!fCwZZETodzomZy zguj1(24rKyz^tmwgKY|^a&2vG{>xbdpqaczUEMEl%UEAu|JqkP_T+vB&0leg)%SEh zXFyL5fd7KG_y~XwDhdjJ;2#VHY3q|9GbaEEw=-5=P;KWll7OcZXh?bh#MS_I6a0si zXJz&4ub$MEx~X#Myf9m5J{jeQJ?EhSO-XBrLPp)>YnXoPQh7%p(z@BF4UY{g; zbK$ouhhHDDxxv|vZrC+nbflxbT382Cy9W!9pj^^nJ~{u#mn?!-SJ{E6B=x`y6=vm`~HrCP$1Zhz5>ujlN~ zue}a_X;XF&;ro;*GgFWO|3O?%z@t_5;fxi7(*j+uYmYmKFZ9L>#5H5=l^)(*x6n=4SK;{&Fs2gd$L|W_kEujAH~a;pZMc!9=9$Xo3XrL zLk6>cqmJ(slE1cc=SR67C+RNO)|QbEFTRYZpR=OEmgC^U>)DtCB%l~@!^LDF06Et`=jRA{{_PC^Hju*TN^7h{u7>S@;x)tLwSbnch;J2%Tk6?s1N0X{8343kzDKI!pcrAG75*s) zj)@(h!8b@#7HsRu1fzwhrqK(gO%-S#|8b9_u^;#%CKd}!Fbt+Cjr#}?KLLpE zFJ=VV*9|n$m;1a*(>-8e>)HKJW1?HfB@iL)3Ina@PkH{oDkx_(nCg*0QU6v{1OIcm z4I)mp3FAAe;Y3!Rj-9b;I3Z(d0j@AD-F~dt4fTnv-_8)oOBlz^g;-p&S8afweI2Mm zZ7VJH`y_bbSAyd82&?Y>cG+swxsF3UH;c(qJ*KlDejkd0k%}H?^VlC1j^-}Y^jSuf z;Wn5nPzig01z%Z$6ZHnMYQTqQicR5=jiPA|mvgV+0RZ#vDt+{c_76(iyCgXx(nvH| zzng8qqoTbiAy?*s7PT55zYRNzZntC6L;fiYW|!6X@y|JZJ-*Or7FI0dy%`znZ=(Lb zBW_HqIlZ!f|JS)J7)}eyd%(I&2T;^2$t^B!RyLyZy4G>N$tD~WXYxM(n1bK)gBuw@ z{G@MKR5vO!FoI+PL)W-*M*qf`Kiw$t(PVe%NR_c?(`p}N`=Q44&duB!;Ry+s%@=pa zuVM6wIz4%keqwauqxT#pt(X?`cc`3J`@ma3!Nm?#!iiaz+XBdij62H`+@s;fdBbjO zXU1ChIe-Eg#gQ{}+GSJhjxS*Lql*-$(MO0?$$L3<$F&f~;keg8R3>AwFw~j0 zs^3S(pJI>F{|#-h9_#~z@>adJ7H=xz9U=7qO$-%;rr6RU(;Wc@3~hYT%-QIN%P$2y+|pe^fSnBVK+KJsZ~`Es7UK8jhpHb*cN_bc@olwqcmtHy+=)FuO$yprjLJn6oxXUi z@@%2+z3_D86G#Rm-%f{+=mH8Lt&hctCs#(t_|snwFcbg&Nz7PGLVHK>HtTLLsf#Jw zSCtU$uLPcPUH!UH{9O^_@tQx=&i&-3_5m%wDIRaw{h}(j0gu7 z2Uv~0eP6>OnZ2Oajn7L{^+m@#>!BR(YD}{wlV+A*D1n7IXg0z@rS9@lmz;V$DKrU* zrf=CPbB=5#)85cQ9;}=J3YN2THfhctuA4-K67O0`RJGi;!4ad=ax~ptzdzBoNrV}d z9p3+od>omMfN4svKkz%*>J5%Hx!*9-Cfdqr?wOMOt0oyHJO6HC4i7<8;EQoat0*K>$Z*uTwnB(@Fb&I zJHV`&Hj+mI3jCwd(UCr`xJ!YutPXoL7rDwtmb01CX77=3{Qz18A#!Pvv^x*B*j<)% z$5&pMSgs-d6l zfZ#Q)(>-~Q=tBexn~$i?E`?8Mi_KTX{~QQ7KwV^1z(tMQ&vYG)83rn z%P%xplyZf(Q(Bp;QKT4XL=D4aZ_E|A%QvK>Mzc09Qi5>Jy*~w5nJO>5NaQsrhXnfi zsrpz6R~2FutVpDcPBTzhS!FP- zjQi9iHTK1wMqezqr*sBwSm+X*21Ev0#D*Tq2@f17ufoyfNC?83K4yD093<>IF6n>h zboO8-l{GNmTGInl(y^NBxyH)ra5R7SNb;10Lnb-625*>=B4b@B{~M2clB)mcf6axr z!3-qa79#eM`zfE;#E`NmZH%!WF3THj;EOeAc`_6qG@1K+wAp@pjd-k#KI2d5gRRby zjkYJ`XpdiAZ)D{{F>?kGsrq>}BSG&aCZEz%!)kX_Jv+`Rw|s>|7Q7nhiMfB)z1pR9 z_P985x7y=1W5XXJ6q~DE4Bu8Ef|kiE{rcWmx!M0MW@jDSR;!CpVc{rlSSJp9K680h zdIasz_w2(vIoEj`Fyq*QlYF;CE2Ft97Li&ObWDu1S0n7V!M%%Z57q9yp4qI z=Rl$twVjmtmW#GnENTA{yfJ5an)fQb!f(Ru#kRH6|5Vdu@&Cj?g@sX+m_i-7wA**R z%v4C5`GUKL5-?TfaDrE3Mjh+cFAyk5{M|5D3I1KWGHIBgJW@_ew2yfC0=9i`k4G;0 z$z!5+`3-2nw)#I$>=|+Lo+aN3+cS}M_Wi`JA#~o%7}seTSxXF?w=+URr#dL!5nE7( zj^L6{FB}K)$cRKt|L;&_i~pT<{QqBqtwyUWy2sU)DEGUCwainxnuAPL ztgk|&klz6odN>%V`n;i<9@(~z%i@Mi2OfV7hRv(i+3X>Z=vABlR`{7>@kV$P{zte1 zeytni@_Ca2U~XglT+od5by|C9f{x;_=r?Gbpz%kZ(++mZL{_iEVdWtS9CJ6JApXCv zkznyHJ=@voL-ip}e>>YzzQ^hq)&Rf>(f^(c_Ew(pjuwH~20Gf`j()KXL@afmCgn~a zXm%E)@mLDdwa4zoW%BERA^%0@2$bocs+j>2{yIq30*t5k&3Pqq*l?kO$Kr%qK>dB~ z{nM7YeMb)l1I(l%UO2u+XMyNjtf}L>EiD~}e2a0ctfv19sT-F<<1=TcCW=uKlA)P$ z_2+aEm`36MgIeL>J8bqxzgaJwie-`<|*Xh1#HO|D6ce(KR>z(W_%75HPq?{L_9?B2%n zvn%=T47a^N+a%1!&1GUVKP2L_t05R~+L51k&-ImaSvQH$v&)6+#oyytDCv(7Oa^TW z=!(7UX0cTR6-stnK340;TEoUi0kwW0E|^^T+O+LSxTvx+2G+t;@&$Vqt{om6+=y%7 z!)8Fb@_jKfiRLvF#>{rgs9k*Y5om5m=M^Qi- zYTCx03=y@rlv-F!%e*R`da~fS?5q#W{^ITUD2jLNS5yWWqKVMDeKFK>x?%{IOP{JD zn>>7Qas29H!9rJ)?=aFO*y1zfch}W1bh9f8_SvUQv70D#yE7y2y!H0IZ>f7`oZ10) zzs7P}lAZdB5ZzyeV79$%JX`tZit3E1%r)ooU;6Ps>^CQL?+zKk7z(mkXL9;eQTP0x zOP@Zun|}@o-|YRh6?E7wA)|LD4ULQ!-u(W3kav^s*9G_HP^qr9VbTYIDY+#keFdH7 zhznrAcqca}SpTYH*)Tmq^kJV#di9&ZzK0W@rX=jw-yybK+@Mrz{=gSNpDvDc3hmM(5Qe*-c+9{0FkJ9!wX`k6 z!L&>pjV=`v{i)*y#tLiPdXE*uX`Dya*7lowryl2$z8wb`3;}aK+Q&j?3KCdd&1)FdudVzwMfx;wktDk=HlM?0hExqUEbyA{>!U>iP$)Q^PQz_i+js(&(-B=zQc zOb%P0H%78}Q({t;_@AGp)Kdy3b(sG4IXH>TY_(f3YSWut!CTC8eB=`{c%bKx4_jzc zT$)R})ExAJeHwI?b%#n?PTjU-QmIGzF~s=N-(o?V*M1~gkWp)h{&Tw7!AWU;8=bxg zAi!{hLs59p2kMy$dz}6mCK6PH0Dv5Xyv}M>54@|Z8;_Y{&35BRJ=UFx!PD?GLP z9}A8)1B#(!`uV5ZiByJEC?vh)8Kmo=qUMST8{d8ba0YrnuMM(?Ay@r0;ODTzk zn7(YC&1>vExz`<=PNUWA_I18M>C4zzYPz6}hIt+zE9)<4X4*)SqTnQp|8=zWc{gZk zu&KG{_Tq7v&#-!PdTumJd?KqGu{t1)2>AZ!{kvynOm3f)?$PxhUWBxa%i?;&>cx~d z5B|MViU7*T?V6`aC{{Ii*VZ?6G~#w^9{@;3j8-BWweK;HcWVhqn;>%aqck(P^ zaJs_mUj3Y$Jv`}3|sJKZD9kOKd;Yc!J?2RM{p(o)?R*ZJfS~ggtvQRILNJ1> z_NpqN{?_NJ^TpfS@DvrWUw1n`$>jSr9*l;=%Fw4nZbUz2Iw^cmPq~hmiaIa2OFTh` zeY2lKI(N`5wZ)Bdlev?BkGdd>UFX!R%jDED7H}nwWct7}x|%dR_4U%Z&xbE597z-B z;5~)O`dX21a*C*zT>ZP{?dR%lb{wy=0L!f#Qh_SOR%S}V0QAxIr2bDJ$q9ax_zmH- zic_?XF=!XCU3m9Cpc1Hr_Js2I5zK%W=4E)b-!;jk4VO{|JoXpY*Cl?xqACTN>k+p8 z%pI|a?W&QZo7Tz)_O=xwzv%`Ci>{~M#;f|vfx5epIN9xueUnBr0w{uedNF^NlO5U4 z-Rtg}0fV`V4})9emX@`qYrgWH+j-#|Pe4O%P=A)ij7_bX88x!}q7$=QDK z;QXT=g0s4SfvgVSp-t`_5=6HheL0wPc-@|CXiL!NXsL=H5p{C{CQg$w%pKD2CLg!V z ztzM2y%Z3zF-Q1SH9%|e+yoS$LID|`LYaGl$A*c_Ctgc$5u-UA|7q7t&ko&H1c@kk@ z1xxoHOF*`|-9KzZuNx1Jr=!B6Lq7`tryW#-jAu$0L!Tz;)skoezi%+hA5)m{XUtqt z_@27K0D_T_pnP{4fCq!!P(Jt53}BJ;Wi*}E>eN&&HtV-nq&Yr9jjx+(EhepVUSBzE z*HcD882CHq=lYk|FrS~v&}*Pp^O-43R5u-xRI=^ds+F2CmTTx4H@iO5kIiSQ!CwFSRC?%^x8y9EPqRxDX?BU&Jr5^)t#U? z?yr-snNbx*>EiptD{URxVLuU~uceab-x6rjNFtDy>+*Z^-CAvtf_)_3ZX9etG{CQy zm8*wAVXvXc2V$X^Wg1z z{j{(x#CJK@$%=!bW-_RD-E%L;_>euU3()IR(=pgwbMEV82B(R78e!hBH)`5{;zr7j zaGa|}L#UN5o<3@sJ3M3VNT!_(I5(QiV7bzB7o>_Z+Qsc^Qf$*S#DF1$pUOho@W%1@^~WL#+RK&+EIZJz<|qg z=*}KMnx%pD6rP{N)|=no6lrr!X|GorJ2Y-U6MGmItQ`t)FYO^%~p`j~RhNRY~Lp{!#3GZ~+J@RT{e)?)1AgG;=+;T9#2)0%e_~**w7Ia$=w92BY zw7qxFy|s4%&3qF$e*c+U?Ec_!T3}Eg@8Y0-eC%)lbPkMndRd@{WeRA2F^P0qb`;W(z}!`)l*N0>twlR-zML8te8m97U|*K={8 zrbYLJTYkQ8n>$&R&hL^77~JZ5SXf|O*I^3ylP(vf#~v_NnFPv)852Il?la+ z9$I@3itWbt=AQ|SmBjLik28j}GVgL-v=qH4QA2G+Th|_;4IYoi%8SRqVSsE_;5!so z)--ocNf)J=KqSqp==Ua1a78W0#m&r_g4RtIO?Ect9v&5&Rh64#O!G9d#}(C9L5d`a z1L)PidRV^aJIh#*Xc2|Nquzafg%B!{qaYEi>$e#$FS-q;y9galj70FdOux1|II5`| z=!GvN4LR4lA1duze6-hK3MUsPC|CPn?lglCl(d`fYDW>wh>d|9wqI)ST`ar_q~yO8 zcRul!GBM?593U_hJX&~U{JoekZ~=wi{X>T<#Ph**c+seM<1Fw~f?T@QSuGOGx$V8` z&}YXvIj`S{3&ad#UrLXD#-N@4qA$e6GL=b&*I`ogHCUIIXa9hL%v#f>sqzt$8!5zUI(o)>1);qA7A|9FTCE$+(ZT=KjI1GFp{N+|c@9x(AC zKq50dzi~lDd(kCZ&PN_=EX@1yte*$=a27uU3tt1C;|}{?~hB*yGv2 zlFi_f5;}2lg48&EQtW2iRXL+Ph)5n@*7(Y0>!T1c3CZ*L(ib>bl^A|BnxqXGiNc0) z2t$vja-*Z8`D|+{T~YWW3_bK~YP##{63Ev7a+nTHU=t-3Nya(4=#fUCW8WSwVEuzA z-;BG)k);WVz})Ga)Eyla{Pq;f?cVO-V@FgZ+-BMvW_P;WDK0CEfg{`jhP{Dp%W=)T zS*w+&3;ss+dhqF1C)|b)0FeM8dt2`W+OW^4^SC+>{EGQ-YuXq3W=pUA0`^BAKYoO< za-_=3E7deK3|0)h_oQlabOU}(Og&_Dn+@kKT|FMWV=_qS!|{b~5>SV}g}$EMPX-e3 z1=ywH{EYgGU4^_ax#y~k9vSe@yiDR!PK7W;YHT0XF>f}wtiGU zUCb(X`imM~<4!bUvQ`hBK!y?2eR_r^2APpkc?~z}#rfF;PU%N^MLrro?bWJRgl3Qd2QwJW@^6VFs4+4!(Aop}-D^yTnZ}6r+aT9VfbN4uU+b<#} zwAO+Bzq-zs9JJ|@BFRg2YS!)i2f+qY!VZYae@k39t8<37W{{qM`=9=nBc=>}TL2e{ zwR%@Z;2Ot8F|ps;D1}ddbW&NaRD`4X7{lukaqszJ@M=-^HQDUYNaOYA32Cy!dJ)Sl zfzn;V#Iuvxq49PlnGek)NuU3hq%dPJZyowlmN%xVMviz~Vb!6zsrZAS`~3F>KD3tvY^g zz1xAvTrcj_d~-7uw48RiHc+uEE>lKUG<4mX3YyBPEEH6VLapw!m%h!ec`0gu^8Bv_V`*Zzou%>Zh2UwJ# zy5g@M%tnF75iUg&0@eG&O?YSC7sqN4AX zK>lavsCl-E<13%=-AsQLFX-HLFX<9#y-c_^T)0ScC$lyy0*KJ-9@Bm3D>kY*3B8cT z6N!ZC=r|x-^|_C2?d$qA&4#~XT6MU2>hF?`UdIEqh`-_SRIVS;6AU%s*V3)WoD>h| z@qX_OLSU%e+{CU(yY*1^#EnTRc~`FLxS&&7%4*uK7OGN}51x5v`@~^yqBLBTY(Ex8 z;&i=fR#P<`#J`YQ@8QB=vn6FdzJ)O1MrIkjo8NnCombaovvl)3zhZme^RZFr!tpo0 zVJw2>*}{THy7G$R;x=64+IMDz?9>^fohMF8*m-r<5mIR9c*(O{n6lR@czv{mp-~ z0F}sKG|a0`vzET=!C^Ck)mbD(8v7W0;;J;pvAP9=EFZs&?oE9Secsh|_RqXkMp=ZT zkyAgcn~KgrEu@PqiDjx11Qnk=1OGO9uCXtV-3SMApCA09eA zoN3d_MYhydeYxWDeY#=>k;G1IZwSM@3}xf8J+>X-U=D_%*z1Z}JV6qL`L%h!d@i*` z#&-EvuiP0dr`u%ydU#)Ua^88rhS|iQ{EMC?ag%S9d0(*Upos9>oX@0RIec9_?i z2v1d|_rL+>?|Z!6!&+QraTx}yAy9zQ&H!?WH0L63oQM#D`&rE{hq3|!rzsEcHkQit znpElm+PR~g7CdV?7i47k&X4=X{kf$-f!@{2WlKseHr(bV@Rjc`Ci80V^Lx;4wH;7ocnfPA*Dy!oM{e)C(>Gf{{OMqUtVWsN9U78Aa0V;&R{Y@)sAO zf#~^-N~ldaXzbSG-~R$>QaK4re{%9YP1F*`GhVj&%t%jLb444x?&j_8n=n26fdTU? z?)y@B9OU_@gwKY_s*ZWd^g|=jxB1PjUdRs>E-cjBQD6TPc;E)PMJDm|4wH&T|<=`07vWaT)a;SZxqbEqo54fG= z1|OSSYS{XMzofePbwf8@78bL2A*dP;<%OW%AA6@X76bJAGeIYB(jMPDLV7OeP~UtC zJ^C~66eh|(XbRUjI^v#2J041keg%{UygG`y}#uBV8DzhF39?iB`VnG_dg&_0j*a0yBMY4L|4h# z8^T>4ZOi1JvQ1bl9o8&lh#Nz<5*F&nE%j~OskS*7GFBo#A_$)F8YC*(OOumm@GGSj zyROr(8c$K4LvL!mkl8HV?`RN@vCKiwPics`Xt(LR*00`~LX&{F1^gy`i(*68AF;V#a zElz}fWLk0frq#^FS4oaTG@;wK4$u4QWenqmwSDjg-oe8p*MbiHzJIdHO(Q-(F=2}8 zx^Pzg3>}bSP{CLM4N=s|lg3ZsEo7LIf(K$wEI!gU+lX?`Ldo^(vNqyfT0l8It;Yu~ z80(6;?$x+I45RB~Gc!ptzK2-Va=Y?hz>H(GJ%@d*trhBbHs#6qDs%Zg;%d5H(1p8u zD}7IXSWG*)JHe)=b4_odox_RkDl46bHDp6Bi)gRwy9|YQFmr<}K_C3sp?jZ{lGKJP zds9;?xKF^L>yfBo&mW5z6IuQk92QeZ+ewP(t4%I+{~&3nC|E9To9xyf2n;+Sk1zq% z^$Jk<^7i^w942JANA&_KUS1bHQv2Uu*-A*Em%J6Zqu~D1D?RS1)E;~8gE2D6VE>pG zOn*>(5fTGC#vY36KP^X@Wc8j)3%WLL`(+y6AOU%1o7y^4JmUi zBf|MSN*5CDBk%UM!fyGylj`8~iK0bDe!eC|NwBt6r3fABC4w0{6lnQ+~+D zjC_O&q}A@Xi8(NEf8L)~MV)=0iD5BtTgcn#>CjKf8wXPqI~l1t#07(M+Q3UKkE#{aP|yklrK^q`Dgd^Z!%ILiqed{)kV+@j+5_ek$?NZX zD1uG)!vDD8Lm><{9o*y%Fl20>H_9L4Dpk7-D`Kw^Iezi@G4EuL+&=Z#=<>*FB0irV#PLCa&;@* zG2+tg8v}BPcV}+r7)YG4D~^hy;r{*iF-_S|B#on+;Cq?{$uOhMH@3I<7Dp}bl}ta{ zX2I90-n$`MM`zc+f=aFCaNltkO~dDe-v~ri^jc1c^)dS#rxK2E!ha9V;B#|upJ%7! zL?UBYJiuF7|FAKCM3vrf#@%rT0$kmi(^Dhaz<3ewXv$bGccQtgML+PpYiyyu6mRbU#bpfgDkd z9Owh0xXJeXQNL7Lrd%k68rc-EVW=xE&xDaNuM*xB&wPn%_r)MfD>`ZRVJO+ zRW-Fp1OoZ@0mbO#$|=UPO%JZCijrKDqZu1dk9i-@CJV}DqP4Zzj;=17+J@#Oly8C2 zG#fDeG}a@)1INex(&T0jy7Nq+(JBxyk-UNtG{t3aZ$FX4ff@+zh>rDKm$7L{Lk%T^w#gMPY888CNAxh`8N1-hJQFSI$M3%R+2nS2j_paYKE0w5WZqmrUW1$0{MW%};v)6@vzRNt;W3mb78fRTuAsP0&*hYDO#}mY>~o-ZwwTJY6op zdQzreeneQq_z(?G!__HsIii9t_>Pi=$7_q)*LQuJSZeC~dLTfm1cwhmnr0c+p;OgL zT!?9(6z zsHKwj`J(XDWu}q?Ht$6k=Dgg!_y{eoPVl5tJxzA`YsAF`FAG}ZOIY|Eva*LV)x4Pt zw$$h8`5DmLY$9B3KciT=_P-%TN|9*49q)AD7(E65ICGQV#{XvY?XdVWx=uf9W3tOj za`x>PR~barnJ28E*RbR)HCBc0-_rdwb6zy&eovYyUpgtWr*bE<+gUhCtuHf{Vg?G$ z&GpIoZ;nf*{B#ft&IMl=WahZ+_J&C4u2SvRL3)O&6>bnnmP6~4-I~v3(ayoaP}UyB z!;i`UbzO;LtE8VJBR}=YDfkwi!LsqB#$YUNSI^43k$2lBrUehxZ6^${V{-Z?S1`1H zXM10`bFO5$tiC>x_wVM|@sPxp&ZDKpRK{XziJQ1oE#6aau;JCyKsDL&>G?bPxVPc}|kUD611+07z5-Tb%wW$V;Fm*;``=Zc|KhcU zwk59NUlElL|2<#6i21!S>5!EHNMz(U)7WJRE_B15ABn2GQ-xQrW(PXoDHL~;6>8Ft zG{eG>ne?uxP_S+eUSlShSAW~liis7>KPNS@U_|uK=5y`m2u`tt+S_Xa@n8Z*hnu-_ z0>XfPEZ<+P4eR#X^78VPelA$o*MHx#+v`OPM@2`&y|L4;-daKvd??N&Vlx>q7d#wa zOi_MWnD5^80e}orhaI|}EriClhI0hl`H=W4$}35EuDFATVw{MiP*z}Vfahngjc4GX zT*mvwQ=9UtCa3kPhSKWl-TSW&H?XeN-Rn<%c-%82|0|&36(YfUg>b3KFzWD|7{5S6 zY3>IoW_bvTddTz#zrV^(TJg8qq98sSKc2SPxRZSGTMFoYZd9SNIB@fL&{o;06g?wg zKXVx$6^je?=^2_Cl@JY8gO%>=!c2oZctdAM%WgBI2TWW3bQsZ@K`pr(vL@W{%7=kp z)8bdJh(Czsp(ufg$gD+NDw9L_J@EE&<*aN%qQj#|*|8UUW~>T^Pn9{}zH@M7D-A(! zO76pB=%+Drauhu=|{d_zPF~GfUfy){yZi3==_W`#DV{1^_`d zyZ)kFHU=2cCEDjfDA4xRtZ$fakwu69v_tryUAyQ1t=Z`l+NUWW&Zg$HP6~XWBgb3r z(Y=X!HJi`UhQle8{6agotq%UvhQy`r7`f_wk##7X_QWws7TYIZyrcc7#E~_BKmJD(cC-5AjUf79R&V2eijMrO=@Gif$g#910kS0LIU&lRy|*zLKz`Bk}2XBPE;EtTfrgz$#g>ZiYEesDGICp=eI zSPil>Ge{LNX(5j#DTTGlHtU90N7!5i;Hz9}fxf4JlGUJ0;i1~MzES8SP$LqFbKe4; z@eWQkvpPRbHtah&bh4%U&Zn5LkoHGoScf=+C|0`*%6-oQ1 zX5sABd^^lv@V47>R575B#GxxJUSs?K>J!fm&eOPNCD#mwJ-6LR4)5;fh~^X)j+Z9^n8S{4R+OoB#-p05k%CNA6E`J#W9W2b-*Wni7ZG2!N_=Y_utKyu9%F-`}Y|J!2iL z9{Mkzp7O$MwMH8p3c)LE1`r@46GVD!U$BnxTX6O+K&0x+$Mg8I_vOY+{?tYD-0Pd< z;)}g%23tDvj$?&P*xfN6WKH3T8z#xWw1N@acn{NwvF&}L&(%Lo*_=WgOf(;y*Ipbx5p$WQz($Jjj?M@_X>pRv{UgC7srH^X_if6CT^&tp=y zS3Q0i&24uWZpf{{Gs0nsXL>bTxh!L_PHZPcB(gGmeP8^V&`5P5_6KZ(RmUtC@HNf8 zEtbcHVO4VOxnGsS^v%_uQHx&pw)^Wpt*PKd(a0_p1WIWcrV?)(_*Mr&d8a^lLvH@X zrKJgDKh9AOmRY3w_PYC7YMK<))~2+L1wdkEu{ie;X|s+cjof4!^}#@+f3XtCq3b^U zBAhyrPJ6|%Hn340+4lZgQWmQj1rAwMkNtYtBgcV3J*<6%@1Vz0gdGh?4NILS)3e3o zC#B)+<6@m!r^7wC^{qbN9ht_vunNaN0sy^Te%eR(-a?==jt)OxE$HE27N*YX&n$H2 z*V&DYjaS;@$jexuuJO9~0I%S{z;FMUr?cdMX{Lp4b!jE!KJvgksKFhGPaqtculLy) z>B*O!so^y>6n`gk=4)yOi|%L(7p5Ip9jf(*climN*Qw52ti5J)mzV-S2t0p z0${+2U-=G26{wmuaEi;M?*s`Y|kUHWGlJT|c=&`|y0_eIUc ztlO_fgb3F9G=~1zzy*xo<82;8GX5|4H{C)Qhw5h!I%jTiapM^Rjf4WUbFIw6_QZF~ zFY}EY5$Bhv##--xD$KG|QpN;^w2zK3_1poiEW@*5Rl+wmu5h-&72e1~mEW&?(k59G(-4u#m!h?WRQsmx&Zh&2W+IG;FmylCKCVID^Av|mX0wbK;5Ok6q} zFrLj5<)>4fSPMhI%x;hG9S@;MWOVM1H?ISKGmA{w%^OPB7uVf3UoD&c4}r60dG)iQ zy^`=5N*xi#oP2thI6aF;-cN$}!~&#AWp?E5u*#<73D0RbDU>#io}NeSyE43a6N4vF zq*R)$v+b`+Y{DM3_pTCDn;sPi9SAI3RL-yy%Qwalp!97DA|N9(UZ-Zm787S_peK+d zw-$ZLXx~K`GejPtQN6h1_94&e4QD#*$_EMT^XKq=1_v@K`R`WYEn1h0>(8eIeOde6 zR1h(Goaeq_9%aMj1>;4Zd;fVf%{BoqVfbllU4(F zBcNko(DPbIiibEk@;C?1pNvb;5`wF7}T-~Em^42Pf zn$m)pE&p5y=iZ`A-|cJHGBMi@ppHn}W4aRd;0V@H5LYyc836(}5R*w8jQS}|J*kze zE(i4wG7B@NC*+?{huO`GIceJFBnTzh`*`Yy{lZ@iF}u8cd~Q?k$OJ$W4x_8kcQ8=+ zL^eqD0yd*cs|d-(0DaBtHE_0bd@e1Y zsp4Vmtf^w5*x1;X*48?3g5=f1g&)J+%Jy4g#oP4t^;c`jrrqIK|m|m)%(w}gRW0+isBD@ z_eUFPK651(-@ddlO^KKMI%>+r3b6ZVaxJ`JiRcd@WL~7gV`HBFf0A8q71E%!BHqTAMDW#+v}xCdUH4bveI2O~>1TOlPeQ+;hbdKUNk^(R?yHnAy_ zEIBr{3y_xK%w3He)@|c}a`lDZvC#ccB^Fv29Z7UvC#lc=dvlO2l63+Cry`s?ObL-LF zMXKRz^qh%5GZ%;v*e>L)MO+IG-`p^Itmg8_*Ch`utAL!RtZ}o-hK&&ueclrR)BQ^Y zg9}iC?_CeN0qxB$*l)Q$9J1p{R1MRln+s1?VMqFBi?UH}TJs$h3sX z0j=`2HJ|R1eW(w9>l;r#)=OxhUUbR~Ke@&Ao&gi11836n3&pAp4&}f$<=7E{*lwp{ zK0=BhMKI{&odjO-)K?^3HPQ!S>x%NaRu4Z<7RC4{d~0Clg+Y_HsH}Mx8%nU>F(6MeJ)&Hp^J6u>{|>ksDxq zXm1Vd^x4QI@;q&DW{+63G@C_RxUDXS&ZyfL5pH39-Wsy zs~#%tBwQ~kx)X}f_-7OS&rT|EpkATxLS}d}fZcdgupa$Wp6MY{6PBpq9UU*3EWuMbE1FsEFrVEPP(TMud4BQ)#` z554!c#BjrZNCtB7<-Lz2hov`l5X^Fhl~!=Hf;)3`a&4lm#%@JnwX~|L*Qt0@)p6b= zBc)G&y32p3Z6RsU4Cb=`nmhQ|7HRFZXa!V4-_{h2|{x*pW_qKJG^t*=eWi9e_gjOts8|S=^8nsHy*im z`R^>L2|uXWkn3_^3<-QrM0P1avo-wFhMb|0QT8MvDb?Qd2PlS}{?C$9Cvn#4&nOa! z<@Np_?0r>NU0ate1Pc({Jp_UWg1fszaCe8`?(XjH?(PuW-QC^Y;jUDxbH1u}|3W?N zhqYTJdpSuX;FUFwyHK>RgU*cX&f(Oh zPth4S)b#7`DfJvS8rq8l?t-B!0g15wNiLS-fP2cr{ z5IeJ|_+b|s``M8ukf8)AVU;?cjY=u2?Psp#rVi6JUQ=9b&nOyl`;t;Y%65ORk5;*& zBQbg)O|&~&KG~Ahw2@Zj2?7(>F67-{Cuoeb6CsKT~8bq|DLf1#lGeSom?LT9t(;3 zBcP$5v;V6dfUH5?%WQtu>970hwhH3Px-#U~eb9aE{Wsn0dd-4>zs*IGa{Ru%K9etC|@=4ZI^ za67Us9b=+WWPd!O{o;TGu3$#C+Lquxy6X)==u{l6iR4hQVK=4a+^vwzX!**Fe_`PcAo1Hjq)8U^yRjfpzdv?lYj@c!bPI*$YDfZ4)1CcR@iwfnTwBeGpY;)*?udd@ zaBZsLi$V=fP3_?=+u}g2#hYlx`|Qa(z#-vt?s<>~&UC;5y*+J9!aW6ezIquIMQ;t9 z?j{`#GY^OMGVEb^RJoToa$>mhyd+hbPeCfTKS7}ytIx+inR|_+io7W2l>lk{bgt^~ z(#!ruageCtdYSQ;^+XOr1t_NLepiS^NUUJ-=1e~;cV?_=YO;&1Hh+{rjj5Xzq}QP50DRXuaEIkB)n3=Nc#{H zW3?U}_mZvw9%O&?2Q}5em4{k_j z6$&l`^3J9^z~>F`u}Zk!f&bJWNXrMi=!!q2^Pc2xfcTpG^HWh<6%eU)H!41v^Y;Yd z$W3(d3(BX34D3SHG3bzwXHRlPd>@G^UVWl=MK~>F;rP;XcThCQ%_Df${CoVc48fGY z>!%^=@Pfa-=h-$EJeIIs+?0XUv_>=b-m1!FsP;M?N`xU!9~RD#LaxeBcI2@k+=jx$ zoN%??TWi%-YeNUIEV+MUr;^WKQxL}jG@^n9^a`~L8KK0j^GEmrZ;!RKSH%A<86QO; z5;kkC4t#?yu_=T*x*gyR(~=9oo!M2e(eHi8Tw&28Kh%u1rFeb{>RB0`#_TboyU$^neBh6H@ZhgBJs z&3NIoA>?}Q&CkZIi4p{3Z7%`*UZ5fKHdT|}1!f`S!;^@}1WqYImB&+>g`jgjPdt#W z((2D`3DPxHb`M!o=~17FxYB*Qe@MJ;IREB|yQBU|tJ84le&l8Y)R zd)YECQZ>|>Sn2amFeL2Vi`FhYbUc_~)O#uMfe94LU%RS=bLi(gzyC5UW%jwzpLNwp#fw1Tg5xPAR@djPtB-?r4?!E-_87M#Nt(EDCRHXJ6fa za8FS5fg1Nq_8l&`oF==NJD^GD(kQh$rNG;wtJR+RjY*Q&iO@)s{^uDG2g2W*bbU;x zGvkv5_X&;bSE`rBXq9~)fMk1wM4gc5*hA_>0lg)dhc~M2cqpR;CjUqYb9^m)@jA)xll!cSQlBaWb_FzeUSCHZ*{C3~ICFt;nR za0sMzHE40}=YA5~;DTms7L{fnnG!*hi_>s4Qm-;R+PT7SN-*y4%vp^&eS*5X_HX)D z(eF5`kIX5Q1CK9t2}cM{0YcHB7sz7stEZ|7gk!CRYvkr$Y;*uGX}omu#8dKL9~ZR# zcR!Uh1FVn>PORN*ty8y+_zLSF8)j80Z0H9$rt1w0>qXBnF)Zq2R5hXbiJIL#%ohy# z5AbGJ@VwyjZPBy0rH$38o(!f-^TX~ceuHD+L)CVwcH_!Jl2`i7)R zIlyYuPf<&g44J*5m}3Qsh)piOomi1DIHIbV6uUW{LeVKbn4&KAD8cOVkl~3?@=}E@ z%Tkr*1@Dl6NdoS!m|%=V8)jCi-0hVS(IQ>eM=n>&KR?-V`8^rvEm;Se=_#6=b8f2x z3bqOOMjxx-S8&l(@+>H`oSiu2_|v^j>d!U=6=Y>uR}MF+V1^Zt5C4s80w1w|mjRof z*cx&HKxgY%ei96KpV}bg>s8IHh;-L%AkBP9Y4{copt#JQJpOAc(zVOSUfJG;u;taWPHhW z$Rh*eV^WjH*VE0&pICzY2!0R~&Jyf}#h*4RR=+7PL zRD$p>B6A*fQ&{5tPR6spOd~m!XLy-?ntU=*Hxg@LDYA0ZCjU`T!Yi1IKxln$C-hKf zhnIE50`Zi;_}ZP+sQv8Ev%M$N=9cY301>Zo9^I=(v81No|4TObe)%V|W$?f(L2~iG`3+*a+eIqScH5tBwZBuykI&}MZ>@Qx zu@e%!lSx~OE&+qW##kF`oyy6{P?D**r8pI$+&{iCzJs*GmN}u4Zv@xIye4RQeF+1b z<0n#J@&CthZcPuxGO9D5*5I_>rsz9_Tx2u=YJ54_(xL&s#WGLSG|??N`yG}_LNdL7 zmw>$*MezV6um27|?_h2?IynD6vcS7d+dy3Y*FzEdVIuzD_5WUv)sJ{zJV{eejlh4G z5m>KR@%Mb~!Vhi9|NXwM7u%ZhM|qlXLN)CFSY}9YXovq~g8+Ht|E~YP^$5Ah+sCFi z6vrF;$liHlHm*2%HBT)reg54J`H%VfJx+Y9T z397xF2XjdO{H9(<|Ci2}PrZMpUk$jHAPmdw3yVricjD$or?z``(NJaQ4u!AT169s0 z{4pl&Dxh!fjgr|dQ8LIYtIy(FF;CV4ZB-nu0Unye9U?FNdS`4;CUKRYg|Cf-TXo9bbIy0{=` z)I}X0Bvv0go+RTm`&e0KEsYiwD#hH((hoVh*g#Z@{hPk>`K0G?GwQ8(#570hJAUcL z%2BrZ2<;UVL}4BaGV&(KI3HtR_IV%2P)y%=YGdwivaPCv9n19`@m5o9Du-v8ju*O;P?d+#@lCfWpR|{--r>TYFRs&uF8K(iW$X zevL8@9SkR-uH|M1u(5<5ea}TdqnC~S61p-CF|m}T_!OaxB{A9GXtT*n_pf5`uAj~| zF^#DADq0t+FV3X~(o2ax{QXC%IK;H4L(4ph7^{{ycIw+h-lxh|TVnGU7NQ%zrsaow z6-Gl;p6c_qzKs^w4ssn2twJeQ{h3VkEF4*Q-J_oGObx7uPQiiF!`*_0I8k{rIoUY~ z6Og3u4y37}>lv+RoG$tn#llUWnxO`*>5_BN7Me^%eEP4S40TCK{S3thGHQInLQm^Ogd6;6cw*QHh$Dn7S}Tdedjhpw}d# zaVOW~%znDL8-*CX_@OuNT2VQ$_#OyQUn?v1=b{_np7Pq$HXIas;iL?=h&v;e0>=!e zGt~#X6K}3kZ?Ie6+v2P`DH${ojwAAGDc-DF_s+&<(FpQOV3;^NV`>~mKH3BNs=t1R z-$4r^W%6tYRgkXiU;Ul>)NR{A=IQv7Su!Z4A;<(Glpv)_*7ZLiE2O_U`++*1Ff4+e z#SfV!bJo#fL*p9M=(p$t|IzV%nsSCrvx4&(jv8(bds?EKXRB8!e3DL$d}yW?jA;UhguE89Lw|$mZca4UtHT7)>5Rc8Msr;ZSf2k$2O0?&~-en+{xRo zrE64jja1y@^fRK^55z+|eA+x0T@?>pdeYWP$Y-{_-BVe8Y0CJ!{A>M?`; zCy0JA|GN>SRwNVV8w1m@$?2_5``!vkmL36f2tpdCL7Gav$AS0;i8>G`nHi#L-0Ax* zY$=>?lPA7vVBYq-oh@PUtx&cL%LbQjT!uUJWt~qyDYeA(R>G1QW&#aP$7{7Y6$9pG z2E%jTR6UvG4vGhg9McOA4=e>XED|5z@ewZJjgBZMqHvXk?_tGhz&T&auE>ten5-XX z_uwZ@FT<(YxRa~bH!+%hEC_z&H4~<*a8VBZKHbCgux@MG!>ZxFCbm+inLqo_Us8ws zy8%1r1B;^8VHPh>BNU3ReDQ>7O{qK7knE{C1^tRKm(ioS@Bw9l6e3fd< z#M5fMr@H3KoVX&ZM;vQxu-hIvya=`)EVnP$#+X!&?NP+>Ri#z^Ij_W4FdMduRGEQHIGb?EFx>$7_T?`C%mRIDBb|? z!xn_)9~YIDiqCP;KC{tm$OUl(;Dj(s-o`UXK8S=W68U2ZlQv>c?^8^-F??M<@HZY@ z_+0olzgbCldqmd{-kdw3+i;*1WVSR1a_sw1FW{{HBo1^Jh8F);t&RQF0QE_}Ql;1D zICyuJJ0=eXgHp@6!oJ|_iB4iSfc&mYs)a-cCq3^^oj7R{B=~L3+J^Dc1**MXCXw*o)y<~~0VC+8dh)@8t2A$_jAFecmLUyi>*5!5`P#Zi~h&37x>Lr2i{ewsv`RXQ|QcHbvBsNZLVwYR2=jOo*RJjGTV??*1Kfm>?gN}pV{nXWl-nuu>g8EYEkOz|9-FNH ziOhArK$B{qgyoNsTIkYya_E{CsvNH0vEzl|U@P3o9ZK`ox@cY==4LDy?q_ha_leda z=Nk~;%pZnZc#?2RQHCVsB$jEJWlzkhZ6-VCEaEEmz@N1<}NB->R1i+n!g} zn(h@)9$LOZ10cABym#$6VTK6yMn?7siy((=Nl7=|qUsUA8M?0UDE&nLjC*<4G zv1-$NKXRm0d3w}dF56$6e0tq082Q5A*C{1%WF0v&bIY3j_*`e||%+G~Z~X`=@QX={ravsd1hCSejb% zHLay~d+21BpYUmeqc5Z}tg}Kf9b7Q;b8$Z2FzWYUc^u6NIT)lds8DF<6J&4=Ea#0U zgLtfdP#bpEf8H^fdr9fexe;VET*H33;=|$(>TK1|e5 zujyBHA$XkbJfl!(H?>}$mB63+=G2rA5M(Y6$c-FTs7kIlQ1(eM4V7x4rPyK%975xr z7;>_%#S^^$8r8=sknFHY8G)f>YV^k(5PUy~h+eFaG;f!*fT(b^#Y(=y%v1nl-mf|g z9(^WXtA_$U``eqP)*mBD-9#v+N`td`ur!fFF}E4@-6moZuJ)+n=Lx5CW>~Uk60Z@> zFiH^XIggMHx440I8VgRa4l@`ztx-K4A^zv=!l3wjuqnggMrmP!N=a+X-?*dNq25gi z&bii*=FABFhD`$G@z6r%L_&1C7=VJj5GN2eM;*{7G+!4LRI7ET@QJ7@9 zU0?_`+LCddDF4=ELGWL@q^yJGZU1xxwL+0jeR$`)%`?WP4UzXchbX_O@V8!SllN7C zXhLN_K!UQxS6$!Wr=BwlP6XUOq z_ear~6=V8anW;1%C!S+r49+de+ZCCEBLr{p1`jyo?#zkNdKfb5@x8r;_y+p5dGjph zwTjmZk(wm*6>;0I9P{q`iQ50E^r>Nh*b-XMGAwP+*lQ2@u@4c3^M+a45u~@fWA=1> z14rd*O^5=QPnGgOC%B=A^4J!%Z3!3F=5CeeEUj&eFW%T6&B_@__#6PX@TuOY+aLgm zKKbKPB!*2%P8Px%k|?+xxV{zm()A6UtMz=;?tU3W@P>r$JhD%IgCo=6f=aL4ottBgUNE8@6A28clfueHTFQJZtK|9_Mk+%$n+r$+39Sd%)w^U5! zWnb)d9ydYK#F)p<;F|`!lFY3eJZnC`h?w!`)Hy|~&Dw8@?qI$IA~|LUEMHgFhkqPyoK4qHOD#^WbKA8DMLSs*N54Fen zHNRfY0wS}$Gb(XIMR+Qr-{e^X%kAP5`lxH1gg7oZd{kcG@mS;(H#vgZMMW>@99n)8 z2O19yN;!}!9b{8W92$QotrxmYv9%~erbhUMl{=7Bq#ja>vb|jdy~V8>9AnIrY4x6I zz=%uyxCSk97{Dd#JMA6WFeekmlOIyiiezCvfTt96^9zFnmF&RDoH=Z5+}ZZgyD}dc z2C}H^h$_?7dWogeb3~zA*jL_RI{TtvYulrzfk_Dv;iikx)Jz17tOd%gnxiM<2<{Vx@xuH6sWn^)L z7fxbDrX9c28X&`JImdN{S1Kt;+Vk=Ij{whv^g-2SYBZIbH3?B=`1ON|g4p`0`n8xe z2x2B(@Y4kMGGcF-O*x)(TWtx+ro+{oBk6dx21h@auc*n*+dHT!Sefge{=5fD5iHv1 zV7c@Cb9OqlHbq8nnh@n?)9-?rFRW8`*q#Rhzmp!xUb(b! z<=!7V!T9G|n&`cs9hH%DQx+s{+ugLyD)iB)iyh4Td?5O_Z@a4Fu$HW!3-##_w!TRG z@cqf?p`MBUt7*u;vp`415f*Fb{2Tq;;ytFxz+-4bq@^%nDHm1O%44-~mz}L*+(f&W z)rlHMV>cdFtK?DOnC`he<#ND{90K~)u(-d|92rh3T{1f3V8vOIRiT*0B_wD+Rckhbp?Z=|196WrV@Eu9LUN(l<)0JPe4QRye zf3Eg73<@`UzOC+s!`3H3MtpVsB{Z;VG>e$=!;I{4Fu3w1*&1Et$Ih9soTlbDpkpMe zsfllIf2Z{d^#$_}a6ObK8mnrdrK&ny>2XJyCPfMYuI79`$5!y5>}{0|az(a&qJP*k z!|}ACL@kftjf(+oV|L8VJC|fuTN2Nw5iR9*7sk%lJ(`tO^i4oV@%DTzRT$sdPcRf+ zFkvE<86efW#fF5y=<4cC|=I={ZQ33ZkAlG&;o8 z!sKpLA)y;2V6phCW9JO?gw=I%NFgy&0oL4z$gcgGKQmo8&bj%^QR0nWpy%+Eb2i3!UIr#7NyG zj^{v@^hx~23j7!>t+>548~;a7`-Jb?+rw9D_$6|S4|VgF&U-e>1~Udf1XnA9TWzfd zKdXagiCV!Ih@$O{YJAt^koSEFx?iflbGJ95Csr<4FBeRH7Ub#9^Y1bLUS)@InsW^@ zL69aY7mg#yu_qiBJ8dduP#!3ZrSg`tW+Ftu;kH2xV~l0wD!0$79TuOe=|fEjAe;)b z;E`q^g_qsG*Yidx$0v#^y{~@ zAf87i)VhTd^u>woS8-vUyq;25?<_Z^S*2yD?Bqn|6ELh#Y9E{B8<{8Z8~k?5#E4NU z)svFK?l;c7aac2w_sRc`>LC+!BeD`Lfn~`ZK_ndU)8*E<((P+NZ z4XBMl`}2k2bEa^)KUZ0-Fr7NEqmW2wPh@b8(cyPK?c;y|w+&?R`lh5GEYV2<_wWIO zAuBuIU26BB5b$@e1_+GCQWy{12>yPR?F8q5S!FxG!aNYOV$3qYXX}kOcXbkHV>x(>OH( z{I1u|1}4zs(&8fGL;W_n4YQUq5)yLlowbUcTL+HSp>nbjsO&*+tV-zA(6;M#%OkF8{=P zEQ=wS*#u3hJcMs%=KiYpJ7k{y(vdEj)|F`~yle8WjicWT3ro#=t_JQx$xStKG+2_} zpLsM<$6AAp$EZ+%>$H9MkHZC>DWaOLw7g>h*_dZUjSrZDPwG#5nf|6?a4btYqBYX< zWCEI6Zaa9qrn_Mq-bp5(>C2eu^!0ma^9=ojzGr;vpxQ0^c%@^_gB2TzZBJp#Ie7d4 z1>t;e7Ol#Z5fwvo!ezBsx3NeEYwRhZk@MF=##p9#+&j#Ua7{6g8_Bn$?++~2)H*b& z+zmT&;d#;%-&?P^WHI_{@6j~ZcJuTXK9}W6l6kqwn z_ppzyx^QR=I>EA#uO1Sv{)jhemc-BS$1%ifI%R2Ns)zREQZ~B-h{hmdXbsZ z#WLP|#K-6`CjAD)sQ+@|z&w@fK<-!X!VDLR7zkxRthdQ}$LT0P5eORgwP$Z0+#?A; zh)4^$@(+VaGd2PB?A;;K%-#2wi-Qo#3J`5z&iCy6{4d}>Ky#O9!1xO!wN^(OAowek zK)~UI=v6FNNN0aM>!8&Wf<_q>8mb3~{ze-BI)f6aOyMY`QdSRZjyrcQPp7pLIp-e_ zogeAIK&-8sClN`>e8-FRGS&Lyv`W|KTNA|!^&ud5HDC4PfPM!5)OI^9u1tk4CMFgX zTKaL95fjrbC(YUC`8wcv`LglBWfO*><%T3EOGZt-ul2sC1(L*K{jG7sgPp+r770HR zDztBUpA{7yy~g28wLg+zqC}ze+lBjA#IJxJA#i83O0A(_K_3Px>Q5_~W!B1aqQTt` z$hnOtXnFpHyu5vtzwH8#Dw&@_F)>|r%@g#VLg?0x)x@@gDy9qnhHUDx1jdZgz%Kx&JbHioc1co+$0Zvgt(`~@>);2D{2*r z|4SN+;?R}ckNZ;Ct###~n_+^J;DPxa5{|0PO5U$?}dgzP#sQ0Ctv=A$4U0X;d%U7x z)W`)Z?@N94=_EYB{nXAyh7lcdN#gn|Z=J8ty9?!fwPUn_nN4QU)FgdAhKqaz?zL@B zwq2CVEn;AI7PH#n((T5D07^mzHp~9OWOaf{3K`_C;N9X$Hh88%>$%pV93YV~%v(D?FNT2i9i>hPm+%|Wd;Y^dG!fzEW6?EZA=ugO0?O?a;F)4FjH zC~4TE0sdSM5Q-n8Q<*5B&ArWVJrAaG1ss1d&`_KbCI|;-1I71zwPwJE`&%|i>U|$+ zX<-BAbPWBS_$`uEJ5}De4)X`FPwRPG5eW%>Af0P_Id27iwO5z+S4Kv(&J6-*t?YOA zW_jW=C?3Xv4$eQcExN4@rTP`ngaU!TeugLa&{xB0AYbHazWe_& zE10;$R(Ey7r9jXV_uFI!?KZAb9UG&cN9AmfktECKa)Y9jjiqtmbthbwIWIWW!ypXz zKdmmFqNmrZQx=jWI<~!W;7m}Qd>A7xv^-%XE%8Y!0u#*THIX&rQg4Xn6Gq%?k*Em} z_yBdRDT(=+B|4jJqEgDG7%2sX3n#0R%NI-J*Xd9-g~ca2ZfNJ~c^TpOmz z?8DUb@UX^EGeDepG>JBM?m8(66DV? zT9$$rGZRw+prC#U2%6UbNsY_Hxuk@|mv)!CZsSe?}`WZdBj zg<&GqOOXqBW*&?t^9vqM6X%E+8u|VKXE@HXv;yK%(9Hf%gvfO2^Rj2NhMtVvbFHF z*BNj;4H(;0P~k&f$JAY0*A3~}Sb~q29&Ut_n65e86V#JBmA;H-c;K`#Y-Q3nCi^-x z_mN_k3*?v;pc4_u>IS^sKl;TI!ST|wJ(ev-MkiCg#>aaJg;pG4D!s8V$8VO>$1u4d z7AVI@O@ta{W?n8~a|d~fb!iBiFQNqe=A2S!u`+kufvtbcvYXDHA%lAenqSh-*5mE= zQgrM~<}rc_>ohd=Y2wz9luZulPwCGpkUKNu1Lw~ovocl#b+-JmU;Y3?X8ocTUMzxM zH0^)M#SU>nZFt2#LcV%zve8`3%fWt+EFhhk`r`~?6Cjd{ZJxY=%T7Kp2$_yhK;rnu zF#0{9Ue0HVUE;!_z&KS^4(D2BhIDLvrII9&u)B2vlkvs9P5^TWgVYG!k2~v}sC!{I z#yR-9b2hML#lrGeY!?bXjEdc6_cO?XhNDL5+1UOwGWdqa%Nc9r8YW`fQ9)|TkH52z zvBMz?XNKM$HYR3^<%UK^c6NtiE%42N1oQTAiYVsCa#6dtRFUOpyAJR120AeW0xk)_ zN`N#{veX@$UNO~B_5FFFbN_HEchzCVqDh^4cx1#G@Mg|Wdx=L2l>|KjaO{;grt%sZ zW8>-UM^m}NwVjnW&RQO1L#N^zaEO--rQhh@MGM{Y*?!r3iZrnhw2H9AoOl6OT#yG z&$o}ncp2sSs&8Eq{hX|0g~&pjUXd~3@IHs4IJ1TB+%=}JEb~+Y=hxenpHNYb9uA_` z%6eKtEY%3_0}^UG&IF^M4TkMaBwprFf8i7$n^s-`2h`wk^H;2J>lljzi?-LQv6MbD zFu@j(%s}zye_wp?n_8|J5}D@c=; z?*{)6Zeh%VS~@UYi80KDz~0y*%FH~3P3c%Ww&jb7lAvW58dgle@s1rSrz)_n3T4po zl_IO&Fxpn@W?#TmA!-;~rrqJXJzRXPsbQrWA*nmeb$TMsz-07O`L#^lG74*eYoc$? zHiHd*3UYm(z;1nq?T%I8pE%wUl@^a+0wGR75kQf^*y78X84Ep$Zg}3t1LD?)OKNxd zQ(FrocTMDcg5)^3S}hhM9^-ln%z3pQ4O}j<5tC=rpZwN|L8ljMpzgSum6Ha&-1J2d zN??X$M?q2hfH<*ZV{<%r8`nMn&^PD1IVGT8FrPbm1;m>DdUQx6k`PrAfMR32FZAhg zCsNIT6_{b_wY{y2@4SnIx;HJV+T|*gEgcXPCgeMFA)H`pwqZQ zH1cwHBCMmMlQ+Nr^t?mxF<7lPkTQ4DHcPw+E#n#nwxI~3f%SUHhc1!Xi`bsI?TAlw z5ZOnS@{UQKsuZIaZ3Q6S>JrD~^<1#@LfN))P-)>>#8UHfd>U`}cE4oU1fP7e$;fz= za! z=W%dFbb^YQ3qn!ECB{Y3Yn&$ZY!Anc9@)I4yj_g2dMl+TPy(T6{{7`+fH!!AcO6G( zc8T?_ex`0nKSDa+JUKQezY-C57RnvBh!-JZqP>WgWLlWP#2od$`b-Xrw@+ZdC}5A9 zxK>!bU*3(KX`2J47rPoxJU3DW3KU^Q47Z6}wHs%yd9=C5heW80fgyCX^9-1K#i3b# zt=*~_XhSiY{XyQP0gt`%c^aF0huiLzQFcX!yh=v^$p%CA(L~xsf=y1VvAUHF5+_-2 zMzl2>>d{fIj*VcT!fpM;(6|Zyn@U!1ib%T!E)gVgZsk?Y&5j$41FLd}SfsU~;mPk? z5Wg$;YB=d|ALl<`P2#?JBt-@0Jlt6O*Dan8nIh>cUA5w|jC5zw>YkYiIhNpF;3K?O zI@HvT1{7Bh$wEam3h7j)KPa(&Hl@;+7Bl0*r(SXw>R*X80MehhxVX#99t7KwAB_SD z1l%6DSt>ta`ww6d^iN@!2xlHVQ^#_u&8q+z747cl{8?HtKu1BwfOi%pOvHmvan{n5V z2!M6w)r4yEt_LRiPE04vU}%M%^mr?_?Zc1xV2I>QkYMu=%QSQj%q9yJk&SyM4>CBJw#*);38>58b9ca$0_DTmfdYd!Dgcn z65G9A!>M7o4n-@Ilam+_smd)+^Q)zF#&2&$YBAJ~zfuvjb#IRW@`IjO4eaH{x~R8c zg;%Z0oKI$FEX>RoI;HjDU)zTuz#q)@Z@AFjqO8M=?REGVknXEp6EV<;j%E&iexGh& zjh9!GPH2mZG+H5LfqMAX9{35HJ{e&`u+{_3Ukz_jJQbI+U*Nfjp7()(;azpIK%dr( z7k#!Eyptl^$^vv;@}heGxq0_X8N#}83~3a7EqBP03(1H|>UbQK{IC}JpV|@LB@x&L zfv;0S_dT^D(m(ZR89JKc!(44XLuq9djSh`lPSXk=aY)V)I-Fng`o!}x`OfB`V^E*= zyldhMYNG92;uaJzUg+a5%l(OYNCSs6*&R*eGktijKSqLbwxlZDUE{C|t{r4BW3y`m zEdqsl8iBc#HxX^WsGvtU##XO+2@7i^=8!Oo`UMGKvN?jGDdlEq6&kX#Q>LA< zq|(1sRF(|e#{fI@0Fcufy#aWSo79cE;p5ZQW_ldv-@7HH6*-n0&)*I?U2e6XuJ@cc z@R=;vNB>rv$> zo5Ai>l7R|Xf4$8aGhE2yu2if3Wfd^+=7S8^$s@bJV9?^?Oc$7LJ!M+Sb7TjBZNZTxR1W4u zhg*Q()e@|~i$Pu|&>JJ6ITbDBV0_a;;18RvOnfsOr`)G22#gQDYy^RC5Zj%m5R z^jNQr_Y!f)yuk5e>tKzk%meGvQE8<;&SSqYP(7m%JFp zVV@iq>vJ}NBzSzwfmKv0wTN0=iQIF&>#;H4s>Ew9qMfs8SHO2AGh5;VdHMkWfd}b- zV>$*l9+7iVMN<=g))Cb52xwaYIe(Q-4`cuwPCtOrq2Jxj%*+sZkNN*3)qh#qE&Nij?kiOp6*EYw;s?1a@4yVfcI9`lq^H#TqvIWwa_|SHI z?)#?zKL3SBQ&Y2bVr0+8dH~PS>UfrX-Sgv35pSJ_ap;sI@v*}0?8|`1=?x-~*h5yk zgt@?hf5L?c5IvjU9mrCR*wBIwfzIW^*ZpHW3&u4Z8JDY?>Y$E*!lAxVyk^ew!6wW1 zstvb?f-32DqV1}ezJy)hfy%#&QiA--k=dx&3jJMq(1k3}PR^CPox(KSXZjk{rH~RE z+X=6bPF}6kU`#ZN9$PX|QI9`(W|9?BULib2qdZIW6tYiBT^Mb2r$(f;VyNh)KX?Rc zGh17S8lrrBN^rfCQ$VxK|8&51$oQc?cq9gUUmp*#a4RH%z+Xy3W}6+1RCaE8{-xh^ zY5;=#aWyn0No@YGfEVHJ(}32=8Tl~uA zPv1?GP-Nat*SARgGHz3UT>s=%PXeInJIiXfzg6lNoAZg@)obtN;vIQ>c2HW#uz-Xy z0aQB@6gCcAZuFdn3g3Plp7_KZ#~ZRMZ);_EdXhSLL(Tv4FXHDj(%c3>uM(J;a_Zok zwa(KgpYn&}p^H=+5(4G~@k&V_x;j zA&Z8;JFq#NB&4IKFD@zBZhu(cxnHrg{1DV21_pZpN&B4+exE*_t&Ldh>9}b`d<%;V ztmr(BNF3`aaCF(4oXzF?gMhW`2VJAar@MJoK|7B?<22hU2G zv(nxcf82t1Lo1R-oHxpSf=p8%0w>*|7&G&V($RY6t#u38S|R1txz#50QFVk!exa0f zr7;x7hdgRIz6J*(hhJyNKI2n~?;{wY%qV7Mh-G^%cCSJmsAD|j$*hF=SWQJ7EViel z|EmA)n+Z+8P;K&bqw{?6vwM^_oN>jhAe&O@9)8+5NWad}7?|+Z*T8q?FU$I2;aHZ8 z9;+)Fey)YFrlxd0$5M!w5VBKxZ3TDk;+)&MMk<&U;zm&nVn`S{)L{=Z9M?=wGLlE5 zPGqoVbwXU}sCH}+#}vh79K0hS22C8w8;`Q3`Lup-APFVQQ$H@7-9e_p zF)gZato#YC7b@newR`Snni%{KwQwt*S_sO=T7=qrU zT;dTmaeq*&UzMC1{=~rulaTr>+_G#c@pfdXC?kG`YVk1798`3*qu!a8l{%@1hy$g$8>ECJGiS zLTtQE0)digRfnj+4)<32SKKeib{I9Uaudl5zkHjR(PflNdGjbAI$)0BgS^9gk2#3n)RZQJjxEgj8?()`{u*E~=6TK{ojA1mh!ezZFr#!WlscP{13d32x9{}}L$!_t;Ztd-RbnQq!(nN-`68O%iMebG zSK!izKT2hZwkW490&32i{AU=S~# zfddSe15$s`*gOgx-p}>fuN91zv7MBD4u4Q>3JVxLjTd_Jn}JfaHEwu#xaq861>|uz zRvPoCHc#q`DFZeT(WDFvzCQp1XyW7J)41GI054btWN9uyZ&|}}4Q;6-E6ZzjRUH6v zt5?p{?dKFM@Lw*sdPdXP6M>%B&6?weU;-dDv+jDq`FwXW5QIeJ)xe&u1q0Z%Mf*KJ z0%vc$lIGCgj$C1;2h1M;(jdARDO0Hp3NHTeaRee~R8-U-V0_$o=Ev*7AnmIMHv-2c z2v6+;eag;2B>nGI7BO*gl@>d4R#sN-M{M>GeTPqnftQCftff!nG97cGo$sfT%$WR%XitWCl@BJk6Nt4S`B$~| z?LGy~n2TiT@haG)f@Q1U&Ve&YHzH~pG&JhQ6E^Qnvw{;Z`}AfaFRcg-9&*)5@z1GPiRs}SZ- zp!?O%q8grz?^(ywC}vJiBWtA6yvr0qW*>^9$`nJG@7oBBwWU+M^GY}-a@KWK=ew{` zS#X%>&bNAQ#)ZOIP8S++CMPmgj?5fz&Vt`pX`dfI&l^`urHC1p?vqh?96&N90NuJ# zlG;iE%02|t+&+gD@Lvej1I_NNu#E?~XF5E>-#ds1QRUcXats_5oEINuMcqmFt&gU_ zkPhhIXTOd=Xz4s)0rONw@2KFMod{xUR}u}$=QF-)(>ACi3xW=|5*1g4n+4y0f&Ev zHY?ZPK+Bi9sHRJ6eT~v4s;>k>!h!VNvdTLt2BDXiZ(AX5T}khP{&C}bKNShWRdnOp z&^1QduG!t;XXG!SWJy8>^v@V@z86s8%y{o-PgEPnP+#jPGmRYf8aqJbNNf-4D!Zl- zNOsO*hDS%kz%hXSGp>E=T&a>UHVo&}ZY&Ef{Cv6E0MM`o06ZwqkC%U;`y9d`N6@#e>)JX_6r)TE+^%yR=^Zf07on;8SV{6NrfPSg9HPb z<81VXquQ^JS3sTPa(_Bhq0tJ=crlnQk{wNBy(v40OHCaCU`Eg+R$If_Gb4b=67yZ$ z<4?R6CUDFFQFwayyZPjbatlkO{r$va7%^mI`b0NtvqXEdeYDz@p4?>NEvd>$fBl@d zB@fibE3mosPv%^UbT`$o&$H$QZ$uEFdJMOR*$2JC-%;I+yX>?mTm(Z|=K^$Gkxsa6 zV%$7c-y)&sCeAIWeQt@@sF)qCqwS7;Dgv=!pZCXPzoCZp*e2!H%hmM-8)-$U`eLbV zAcf(lE*}S$7dVn2aOM;q#*Y0z>fSN1u5Jz6jnkNo)0mCzG-=$RNgCU>t;V*|*tTsn zwryKyW`FyA_jmrCzi0kS)|v}r4Lsv{?&}_nBdrj6;mVd`WFF>FY<}(`eUYMOsogL8 z(3>=CgY_B)&Pa(l9cvPM50m=2DU&RQ(sMrj#(d8nG$Mhr%l=QUCKxep-fJtgaxy=s zxnotlBnPzidLNtM_RF8(WjrLHRAQM@WfH?DwXyjV_n$cz)lWKi?>69!8Oe+o3}%d{ zV(QA?Uy-0GuQR?`Ng4H8>^~wCk$-6$^4*mnS(4tb%4|HL3ShCtAdRhGjd-nDjl;i@ zHLHx`@}y2CF~`aY3*MNwe5t}P-|mmO46TQjY8lAP?n+7vE0X)8bRV!yiQJ4Np1%VV2oJs5%iJkMYh;{;C zjTB+SoQf~KqDfyoE_78$r5`3?HoE~y=sqK(Y}GZ#ji1WXg@*Y}Sip+SZthSArzaV3 z457Gfe@v3ck27BkIQ~i?aFYa`%VaOb2Afq{W>f4Lms0s1vMCMMb^h{o%2?xC?UgM~_+ z=+oko5^O&A2TrYyABl9)@4$f#UNUlWs9zOJH?Al62ecU&?Diyq!rY}8?cd^JYCtIE z47`}Qn6Vf1kk{+H7a)KvUOcYbT{5fq=+tB7xEoep&$vJ`UmwVW8LfIv<%$Ef=Bsh; zbKo#-$EW-A2_O=Prl#f{(`Rp(P+$W zfsq$Pem0ws-3T8rHf_W%CUS2le3I?h4jxJnOJ}NE8F<^cRIG68HZyvBVuH`fXDqL% zpN;jW!{H^LcG1sOY#Q>~1Y_d*ra2 zHal#G1W-8rsW@$YOvv$s75(k^k{8im#we3AlD2vK{kvTchpka=juN4|VrOV5zP;Jp zqDo)GEwAQl@BH4e8~bjRvwYm=d4sBr$%st8^FbPo?OLnpP{IT1po``+21%cFv8Q)v&R@w=PyO2gB;@s_s9u`(q^ zY~#kqLxRJK&?+6dnRAy$PkxQqluZmwUSWNEC`|l5ur#P&#wApVGm}L4Ji>!#MIyFD zRzNwZfuV69kvz5uQN9))ddUwriM8T#E`BM~*g+$K3NOP}D*atlVL?f=} zk*#$kGp-uuH&DE$3xU_uU{~xepK`Qu_`;tpYch(C!q_`dY5w>i}bWJENFGCuNNw#9VJ6h}mZ^Qk5Faol}lQz06%KhV=O!Y(6Vt(1i+`nNI z538h-1n(%1o48y0RR=wJ65*#8*?+YOw_ZqWx@5el%&N+*2&bRaU62P~H>!=PE>I5tz5bD#V6Kc{@X?heEgkI({xQ zF1FacQNDsBTu>KP{OuEJk7t;|PDZ4bHX?KI@l_TvJ!z6@_$<^~!>Bxy9Q4j2hAt+G zaX+2Es$z9GK{7Ldz{x$s=wER3)8#f`8uD=2>y%@3{{!~`B8xo8?p8M~8j`lTs$of> zi-RE*Mysv!D%^2p&ZMk{nic9Pa+tUbV1{_(Pu3KIb-qC@eovV{p}FK-T(36VUn~?D zzKcT<5eSCad~~PpY6nuS0T#9hPwj{LB11R`(CQ2G7#loT{lE*ioJkI3$E*L3^?doy zhZ&IF{!gCibY*2`E<~BP$VrlDENg4EHY8h{pqe62w~#8wI-d85FC=3onq=WFCEemo zvb5(W(caW)VPCt3J1pOdO}>0_>`LZzVdROS&-LtNyuRq6fzD_MmZA%S85?OYhO=`a zn=|g;PSnef2;dr-$nk$CCB**znPovBqdt(I8j@3$?ozyL`CfH3#MPDksY*NMGwQ&5O@EEQR^xbt61HmjF*W}+eL0Z^9 zqnfjYua3ZULpUx{aV+WLJnMYEO=D%pMcT>H_kiK+6TGu9S*Ufr<66(wKZSZyXNELW z`jsDNyZc+e%h3GO;2)`J? zq;|k|Q+aULBUPTRVp5h$7|d-?bHvsbW;Nu;>e$OSI#N6_%8PYKgFEP~=wO!TxR&c`iVH1dp*6U3)n2(XIPDA;A*UOj9z@OiaMbp6*^VpXVIf$hPQkq$Na+|rJdJK6G>bH%SJ=US8sDe zypo-8*2f(ameC~r@9S5*5voz0eB&Iwp_7WYH@z4Zk4T{Wc!V)!bWz3d)?PO z*L_~gl#0^0{HS^)qx-ojdiY?q2J;xI<%!^o*x!nT+3&Cro`euI0M|TyaNcJ_2Fu1G zjjrkO5JAlDUti!*MAo^`Slxer5B#1riM-Cz=mtOmM_>}I+FyA6(k|-{H0A+uQ~v~E z!5-~Z{ep|xpVS!`Cq|b>e@9mNijur~5&xuzy<5?L!vUf<0r5(C`?TMhx$Do|P={_h zreXE7{;F1(IV4S1Osa{YF;08(Q|5WTWKBvQd-_eG*>z+OrYy=7JVN7Ga5#d_Y!=#3Nq=AdLyE&_+}WnXSbhQ>*xR8Q*k?2YK45dGi;<6nQ^0 zA*x{(sR}5#Ob|UAZZCqDAS{;JRnY6t29j~iw}!aMVAvVz>xMA-Y4L7lRqh}gvrw8w zwT%<@ru;P_CHeJl@b>=j(J!NX*UU}U9oc2{9L?$Pz98nXL4DCfx6kLxy)zq3J250N zJ3@EuQx>mblNOHL50MsF;a++8PQCO6W9q%bW+6MScSf})={5*ODI8m^AwY?3VGzrT z`<|+JVA%$Lzfs~lj{F&td~F^4`+cDGDH1Y#5ps#}x<()yWAaQp#uHM_EIW6|q~4)f zA5J}^y&!pg0xQolYEBo6KCIm{Vz&M()SZ@5&tH)uIWkRXA124Am*3qWI5c`94XcwH z|9+P(g24&tOdz@zbmTd>wo4ycnA|7nmH90(xKFYm9$6%cn?CiwUt6f<|Aei87|3@{ zU*uUYGvMDj|FjV6K3_QW(!8j`f%2gk~$<}+7 z)BB}V!u{KT76!3JDJ?#V8sNs=VZjNB5JreDA{>(AOc`4ZWA3c(L5XIKD2zVYBsCEb zaweI4031(2#6e~3B$6Bleha!QiYG+0)S2;&#fqjDCmwM|J)=NnN=zO&rL$E5dP`^+(SG?LV1Gz&t1`9 z1au1ixhd%X7jM<9a2P?8@zf`Ag*f+nrhQNQ%hv=tJz>v7$abcz;vj?x{yzB*Cs|os z)o0cY_b7w+wy(8!|9tzaJh=s`)k3KCJDLo4cSx7O@9^AhYC})E@~u^?qQ-7%fyKmg z{0(6gUL9y!K|1^W>*d;g%^rX1F#p?P(e&~E-`m_)XXaty_Ss-GnJ2C>=#y@MltCAq z8r7-2?l`O?FLPk(G)Cp0U1HI?&*rav@Rk{Ogt!s43uOTM_G6#nt{IDtu}>S9Incmt zSu3dU-9i?PUkQ4!&Q!QWVXv#6dsvl?K3+faO;}YeVfEVv{LQz ztwre-Ru>dn`I*1g5>qBh_dt{PPdmKJp*s#k{v032G-R+jlP?{*N3n6xGF@>gZ9O+U zX(-c9?C{etGoyUh$5d;;v}r8f^<$00IMqLP#B5Ko6VU0?Qe@H_RX%&##c{;<5m4&c zTs3nxy;COVixV{o?X20mfZCHu9`>U*?^HnR$-3w{QGG^d?hiBQYVN_OkNf42lzpnU zcq`7yO6xZh(sQ!ncvsaQ8(*z6sT`2cJ2)&wM@{2*pa$6V{D5n8dBIck%plCm`<;i>~e0!G8c$KEfOZ}q%Hm$Z?)_B77u9)#U3C0$;l>9z7 z5T}&5&8a?)y)K=Rjv8_N2<_si7K$_Kr^d;1BF^S1CsjA^w|_qCyp|G~POP|gPiTei zqy(bUvsrhY{N5Vc{{BY8-FCZw{aW+p9-L{Zy|}G{PIdmzNIJxT<=*whvfq#8aofkz z?U}Hw4AXh@;uBASR6m}ptjReVr=whL4=C;E%u*RqzDpN+y5oEPc7nx(Ij(~&Z0_u4 zbZyOQOFWX{>0L{NfQdPtb}Mnd+0Wy?M=@602=4b#U?hXVrfvlWyrl$4w#R$%gQY zawiCzD#o8`0xk*&)^9pSE-tUgw^uTN)T{OO5^9wrASqX1Y#iOw1%A_F<#ChI<^~}! zJH)nl;(8dw)O5J05Y*JYvEFLedc;m-3EOU(5VhhUAof4A=*Qau@8roECx1pCXjd0^ zYg%#{|3v6c!x)_okzO}jh6$35W15Qih)A&)(*BJ@53{Z7iGVBsJ##KD6M?8l&GKGW z6`V1FgLrBA(LA!K+=Tb0HNPGwMcvL>AKrt+X92sl@lcptRj`rc-}dchF)C08`P^9F zq*~$YssGjGe75tr( z{#>^PmveL1`g4kv=U&$wl3J*zeyRvvR_JSq7i&Ll8VhG1hxu)68V~KxQU@1wJT^BJ zuMlgU>HZnKQ>Pz;*XE(lfJSXu{|~`Cog$rvNCv^ly11TDi$a74w!RE$g4|q3RNvfG zO{y7U&E_Sv%Aw_?>kOXdvmwW4EBfuOd36?%kh;%}nn^tiLvYH|whp%`8ayrbkS#cv z;>DX6%OlPg9ifDd#0`0frVWBDD@^Fm$c2m1eZvhCyd5E{UzIPMzl;%&T?kDXN|V;t z2gU!$Bt4(Do0qjCjf9@=XWCCLD-+cDx5;$T9VHszkO+EvXK3eW6?r@L}+g3kK54$1T-5|z4#yBIh2v@?Q3$$+$=3e+5^#32$#rr59~q zGcDq@_0CSwS-D;bT|v|8^KrU%eiws`7B|tSQ|n79*pI@R)lZbyj;oEtsAEcjfMg*)fPyhr8$02!-uG(@;KK~5beyyh(NHh;kGWb|L*@D zE3t?r&Hmwm>7W;TTwK_Y3+5n3=DlmJ z3WLkeF`NDBnI+f=JgNe#ZlNiw=7>QC@3mg|)c!P+_sBI+0^u`azDS3Qwm_{NAddOD z{TOHM#>Fg>$t!&>R^DQOXcx0x$g9!|y8nz<*qEksc&+lbzR1JX#+#aM0b~YA}#QJw{6I(0vqQP9FLLO#siif1`s3Ca;i2^(*_>Rk8PcV&R3*wDVpzcUhM7C zKuY}qP>;eOAq|+bIW4^rygxvHbd{Err(@vJ?g zB47!&o-TcGY@!kCdO`ZilP|wVLI?!~Cf>)(i;wg8+BzPIT<;Q0uFS%HzwP1xA5s_ zs|9|OTDtQvE?Y`fgrvb!*sqwUS2K5_Xjh-MMH;WXs0opAgQ4`Hvh$v(GdJ+}I|ZJ) zC8w*_Y)88+p@r%(##sgm54tF=C$QxY6-=oRCxl zNMoPU4|Cy7E=8u}?~|Je*H)WKU0Oh7w*N>eTS-#rxx&IQrrXyx6SlWK?zE-{wv_7V zq0h;(M>CSt^}P|@H3Up}DSUx_Px`TTru$Mb!5f)t10eiIgyq-pVsw?&)l44}b)d2d zST(+H@Z0W)ef(`WnlxGLAuEgWeo64|GkplL0Wgdk0C=@K!2f1bQV)%YrwuUf=8M&k zAhAP25)!ZzJlF8R3H2T|^`a4YI)AeSEHt+PCEr4amse_JI;UMgZ?8~xb~a8$Q4t_r zhzXo0o;1{{zjerX42iZU2^~Mai#FDQ+uiY zKYAkVJTT0ho*9wEx)~S1;sdSEouZ#pd>(yVa^a01{+ewpUvjCOH^PXhh&S6XOlBr- zEow<_rmA2?^Ap+q8qf(F^hdw)SpTKr9pmf1qF@-4pNG6oeQ+ap@kBj$c02E-gKN@* zcO!dpx3Lr_?2su;lcbxoT9HuTf!_K?YL5C=x2kjJ%G*6(_Y4!43cpA+Z z?uLa%;%N! zWpCEKwSZ{Y&`7y7m?<)K%g?{m13&vbGqi1ev0vkZ%-wRiH_&u0NXp`j_#MsmBpWcG z5$5ECNT1LX^H`+3{H(TS?=`9%NWoYs6QxIH*?PVoqkub>J-TDaFZp|J-Mbm&jNdT! zH3KPjR$%Etr+4;oK`AQI%#N+g8V;^tNzg?!7H$V9#7k!IBuDZ+1>ssY`iRGpUjlvf z((C=+%l*D(z36ri3uZM&OKT0$Q+^PSJH9b8AJiemrfrB@S;==BS3~` zJmGfUCMqrt^M0%KK5PCAh;LL{of!cd?)b)m!{G>BhW8fzqvBaqUVc{V;Rk3KLb>5TxgNivDS2f#PuJ{&6mm0Pl}FBsqv6QXDVR8J*5A9@H-KYtbm zJU4*vN-13s3s85VfPqHs^f4t3js9l0KL9#`l($|Dvn}(!zg({**tiW&oV!z=m=_fX zm1gpflot!=^SyzXq$>%928jj2zi`*Myg5*KO7@a*2HkSo#5b&RIahkOu6jz=vtZCk z4j4jxl6x80clOLleQ{?e@zkz(?`t2G(ri!7;`C9urzPAquT$54SItmU@xERKi$P-+ z3o?IJC&rYXKYKQGbG?u1n1-IuWpvP2FVXHhUmERi^URK6dYDWvS{Pqj?++GKEF0`^ z^MF3XwOG(wYNyL&PF!j8ERFT*_E;A;F7<@L*uA)^xV~&w$4ia-yI>WT^H}Q22B z-o1o!*rzfhJEq?DP1e&x>r2e|&16|7biofBOD7Zt74hdu!KuI*uT}dn3F5Q(0+r)6 zpXp=c6*}HzAFFdiSRSZ^%UEnDi{tlwji&Rlc@D9b{Icl>s!{6PC=QXZ5Pqe{*=V2x zzPN|@$u1b>Gc=an$2OZ%k9_?T89&*O>-4k9f_OYbqPZGthX3j-?!O+@yzskUsrB(&iI z1|GB3vA%8W(JW1TqvsM{%&q>V%fH}*4{4gXPfzyQvPOmHLKHao7khlRsgx}&*7JbT zy)xDZ@ywajqJ7r%%yU3ywQFmu@rdqvqTidzYd!di5ElEqaMzBcE1%@F-Xx0}u?c=$ zVR4Vhvu#l(Dl;hjp@O;i7%MfIuVDjU+`@j)6=k*%($hx*So_t-gTQ!UGdsV2bOzX| zq0v!@jX!~3wgPZ1HvrRxO`zPrQ$0cl{;{S2kbF#H;y@ggJWk{!@`Zts5wq9ZOaG~8 zl+TGKW%xb2%U7wbQw>O^t!@+?S2wJ+fFDuRF%h~SVkG0;1 z0WRjkwGGhse5@oXtY%!i=Kyl=ydBPt#cGPqzG4jyl9!kFp>`l8CH?vH=if$8aa~>F zNZvcFj@MJG$3qz}*v_-n7L|5)P5?B-ce^_wBF}&G+Xl4DW24E;fHGp&spAyghbi9ckyMQFO+VgZMfdZVN2jEnWiflVZ8v**wFAM;^q84phGMHkeL9Q@0;!K1 zp7!n~+*$gQ;>!Q#z0PJzB4SitGq}dTh5U9*;CTHVlyw-4v6ueDQUO z0FAbRmzFeq8jUlIOf9cu95wa5(t8*VPvB;&?PhX@7-G+7Vl}X?sT?quBId&=ypnf( zTDd!t;E$hloH|qOm;b?>yiz<Zi z5uYbvelt_Vqi^lKWG*xeUbc2gsJ1?85nAlhC!&e_c7C@wIn41$KkM-H$@;&7HyhFT z=AP@3Lyl;%R+Y8*)SjNOUM{RRRocNXx>Y$H#z^H%LAZP=xd8;!qu^rAjM$<_`X}Hi zc=P&QZ1#lY@YJufM}!DM_dP{Ud*>Yv{#g& zsT#s^R2o9CM;X-6XB4(7jiQkQK^`XGT^vCN2^GC2f#dA^WyiHniHQlw$#O%*;vH}lGUx|{ z9N_*6u=J%`9T{p?ZFMSrJ3iphNh~I)&-dp>w@0(5id}_{JU&p%jvwSq>`J_acb4{`44)HNCF$-P93bJv7Cv7n&@wApvVM!{-OeH1G z4EbQMZ-kj{ zI&5|n%;S?45(S%$|CO0PH0bO%JS+h}s+>))uzq?8-`c`1>kL*JD{ljDeEk0K2!Nrz zElx;%1yZCe&x~a`Kd!fzsu3b}h+J$#2@8s*uJLiTt!Iv0ssPze7E z9Agx}^p}1}T`XfoneDz^%IT{An_)=3Hlu)v0$HZVc#|A`PRa&9LQv3KH~&wu`y_yq zx!xNytK+4i2?LA`Qs@|K-0m&Ge7>bBBmz6S29C@!KZp#xcn2tOlx10>4%4Huw|#GJBgyIV4I0W26ek8T~7>bVQ6 zYXmH)TqbDs#!wE-WO5CsC92of`ZkND@6l1BrIVh1KL}jelf6k0W`}DXOBp`hhC)$a zKKlK{%q`7EWzY*Yc|@2Is_hemfb#TG!M&~zR(B>^zej*47^j7FKHd_`;K4B?@bjMk zeknlE^%GVGOcN2GozRfbCY=^)@@i(cZ>M6UPr;zuF$Y|2dP_?F`(Ny>Drt_z#y&q3 zjKEpo_G@AR(rMR(k_UBI10n6!y?k^RiV|%{3 zVOY?y@=$tgU=N5G+m}#+CCc!(?MTiGRr7GN;v4}D@_!ZuFSJ1)-N2J#<;3xEAQkv{ z$oiZ|RMi&`LEb=;B(2Rw5mibwwP%kx(HlJ=zB|9exTc>x_IBE@8jIK`bk9Lvwq=f_ zlxYT4*-ax<0XbG4tT<2GPyFOG;M?P44vQH-pO;Pe&z#mtD|$QA;zkeH^79W~re5%_ z<3KUCbaN(F_cacz@|C%Bw#Oju_bZQ(Akj17Q7LK3y`(Ujpb=>1Bgy}aQEBarC@L!t z8fAL)7s}?~G_XQgS@_$rr6PC(5n=7L9y^5Z`pX`Bss+F64W(fXHo(UN24+Jii#0>T z!&@~Id}jB`B}x^Bo!+3~v)KrO2wb3a9WPgO(CMcNabNg7(<&<&%A1dh6`xstm9qV+ z1WqEGLB3*h0QPGjTU%UOVufEPZgu4h_EkPFYw0ikxGj-UFjpI@wlE^74BPZkTH6>p zhZpqEAST94LLqg=VMt}HlQpp}YGT*^sbQe)j@Xq2s)LfW{9w@G^gH>7&qT2=1dIvi z=V*F`tU~XXSApyk?A(6QE)#n0HFasrqV!RNYis)(|57?Jn=ZpN&`N1XWP&N4I)PTT zx#~S$_Keuuwq7=uFUj(#?ap3|FC6c+C{CZfk@oY|apuUvaalu~8rbnsP5Rk5V@I_y zRMf79R5QHJII4z5Uv--ImQOEzo$IZ7RSb!B`t?s)S6mu4E_sN)B1*m*?vtdRW~M z88eIo$D)qgV3OSkw@+^qBRiG=__aLG_Jv|NLtxL|&DidKSXcqrMd(HHLUjTN!E9q2 z(TO?NPP2f$Jqtimow1zo008m?|2q%B;guWeV7F8Q7P}93F0ZHQ%*Lmq zlGHkW_Xh^G2FqQ*`e;DExzQ_hxrFl8ndcNL4k45VAHbWvv6&4oJ|e}BsZ8I{kZs(- zb#D|q!{bgR%ahBe;;Vy&cS=vGmnT`ds`em|=-e-9t=dz_&`9df3ehNSw4jEY!+)!; zkQ2%sVCG{#Dq-l6ys_NIFW&U)HFBu?sT6wI1llH<8S9eeC7kJ(RG9F}j zNmUUNdUNU5gRQ&>1FVMb=M}sDO4pZcR9En>8Yea`3_;yoJjrpFj^?l4Ky1?yHxtM; zrH7qI$7XpiXwFo0T57x2wf1NIJ^<<7k{xubavf`_TL|=?4K~DAC_GJ4g5uf@=ly1W zy`%~VwO(Lkc(37QXj6o*KrYTTMvhZMmMbo3=nVr$kX~c3(ic4Yifrz*wi_B^rES;g zI&)-=uY@5(hMr$l)qN;Wrl)Lr9cRG5A6wNKt@GkUpj+=pKbNO|!xUYB;C+1Ylp^x9 z13P4nN%hauIJSb~b0Qj#(S9wyVkeK~q24@+#B^Jb-g7O`|6{jn zx#9@`lir4QvTUE=;I@Hq9oXB}x~%}(_~_{9E+7PP(sVt{GyGaQKQ{+#lTiTi?*m{D z)bnA$Mop~?_+PmLF)=`)G|XNuslvSKnv`VM*4HNhEERlB)9HO_P|&BWc4~BOKxs^) zQvDn7b|aTg8w3n}-~dk_tq+e!?(^0jz)|i)l1%>++w|?%BFqflN@t42XX{(@wZRG9 zyE-H8iBd^RT-LxUyZNYFb&TX98K(U6M7J_xV&3E;oyJm>-1=E)J#Z3b{;t=&{552# zJDoTc)7kR9JhRuFX89PL{IZ0#*^KM=(BN!s?H4Cx z@(JTK(I^Zkt=Lg#!)5?;cC^wTYX*az-(|R1a>=!S=xac50OdKKr)MxsZI1GvHaGez zbjl%u)N|EsxAWP;@LJT7NmPYF8YAXa4zV1-dg0DVa&$OXsF5Bk+P{tW(`a_Q5bdqz z#^<}`HiM9G7=QXXrv&c;9U9XM2NXZZ~6eX`)N05*WM(k~w7xdo1I+kTv$ zT*=p>+SOz_7L5^w=!UO5CbN_0uVGu@ELK!Z}_B;BD&K&4(!Mw{2i zX$YOWg}Vss1|4=|+J^eSGYh`^e24JPN_?5G8N9~ol^WD{v>os=`hIF93v49S6A&Gs z-Xh@!H8TC|MvhBc9AT1kmQn(nS#xTm>MHhrs(<$1j#w6N%6G{;mGFW)nb_=`Pk+#h z9ZK^mM_uBO6`QlFrgEkYJxKD;b-46UqD#~IaOP8Qiisu>t=8=eF^pt|pD!;gG_G3* z)-e$S1A~@pi@CYEtG08GNB|+W1M(~N#YZQrE$pf6R(rVX4^TaUNYL8Anx$`KB(JVM zT5qxZFrxJ~#c|VC7XOz%;f`7Yn8~Ex-(DYq9OmSnC&F?nqTBmyg{QE&IZemYY{iFE zJ+;mY=%mZ=2_1Ad^`9Qpxfhin{p5@oE|2!!nU5Plp+h-Vnq`86#GIJYH7a-PD{t}q z9gs`eygy}U)Ah_8GmXlkcu(%R2V_F}&448ei`J=-QD4Q&DoA!=P=MNjQCXA=t+s26 ze^FV4d^n=vFnr6JP3(?J&aMof(7vN(E2P$mEU7)7rDOF(#J~0%8vfK$_xRJfCLX1y z>>J4XB?uJI?rDwS+D$fGv4L#j&BPX@>nXmhG(pRU**7dx;Mi-t<}0T?_ENC&qK>qG z&!#-IFn5xsgXbC0fPKKFol?O6NmTI%rm-_n^2QAKp)4n>~ z#mq0S(J>L3Ma3oz56@QHh(KGf zKljNDXbHK$3_aNWIlvmz$sWGD$WuARUYoLwGgV4A`z5p2Up5fUyeVR4Ei@tV1C{w) z*3_#mmQI1ETnNbk=>0Xl+DY!+9~S@|PzCU8AL4&tA2h2Ryzr6L{kE)Qq@wbA6^wBG zlg$BGE&zIWvC)PEK+g8(JKh{p_hrPzP2jn3fBX9bwomX`o?j-R5WfB~($}}&#CD7f z2$=$Q{`IajfGre|^#)k+MES(TMD}`bKCe5H ziU;bTa;x`D&1UE(E#Y&WI!_nnNzFdvegEHchwa+QelQ90Y~5_eZO)b{yAjtz=^2jz z9T0N6W05MxPg3UX#kj$*#w1$*(CJFl@b;;lnh`Fl(5F)q>8z+lM8Jxp9Xz zcZtsHXkc5B+4*8!rd#f|wMyRZ9;v8DN{DC@8W$1_rpdF*6tk)Jez|0MvP64gz00+I z5LJLpDPV92*wW!*;+Yl`oxB|6M5^-E1D;?1Nnt8R@?H&+06*+Cuo}gF*u|Y*y`&(( zUAVtJ?uRccR%l8oD=Vk^Z2Qmrl*J(+m;jQoYP@R~IQW2t4}dwgFX{p14v;0p-)^@@ zeE>+A%dN|OQUs1*w zt_`eQ5{)}4WA@gNg_a63A71a*E3fntZrr0m=;qKLyLKF+g_A8oP)Kk(R;TRb50sbW zkE(_E1be<7KkEzGvyi+0diu&}{6!F3!4s^&C&@T_*2j(X#|y_)k-36^IYi{9>qIy| zwa*T&g)+@$c+X&q=Q{4g3b%ElpFwxyUqy|x&tn_7(_%Gsbl>3;0O0bGVHuBfK&^5? zH~15HPAOJFO+?FizfQfXqKFKxN7hnCW6!kfWfpNUfx`SGEcdcif?4L$XM>Kd&Llx~jg z=PPCc6tK$>Y1H;@MH_BX9CPfwoRpb*_RXp(2=Nc~h2H!xtO^=gMp znC)0k&S3#N7}>Mu8X6oVWMYc>C@7T`f5in-hd^NzSYybs;Mr1BZ2^cJWNhyN;y=!x zA4x$iFJ8dK$+G2ZppueOjqR?)hkGuN@1HxV@_4=jWkFQ3Qik!^>J2_cOzrrILoMO; zcAaTAM3z|V`T~SihJb6R`ATDz%EICG<3hU!H^2#6Kn;PYsHgxSlO{5O1hIDdL;V;*v5kFdp~VBv{2IV zb`)iWYiEt=l<=x*Aiq!Q@~nMhm&&z3E;ZHrE0QkTfm1)2*e!H&@${Lm4HB1%vN)4t zjnXdmUIg(YFXC<|v`w9Ac>GCZK@v;&ap&{~pwPUQ0(To z=Su&e@NfcxLpJ#}g{~k0NJck(A0VUID;|3NU zn(BTxA@PzYJy-VCxLP^N$I6QCxZas)bhz@tQv-=x3S;Z^GU-zOVIaIQrF9-!qW*H+ zk{{RySvup9ODYE}1+)8Exgj>4KjZXh=8Q_MJ@JIdq*g8l`ZJs;w?|~7%8Qw~OvuPk z=uLgn9n%geIma|XLp=(`XmMT#w|+WdxMcW>7+AJS%}BD*cX~Is?x$mDt~afEKX7;` ziElvydw}VMb~rsPj&0fX+h;vpW}Mqh$S9ur8vB>96n2@b3PLtvAw)T?_h5 z=i{Ag6@DF88i!VZK||(^T%{$xKSc_S%RViskq4-!B)nub>znm#X$L4!;yczfJY55k zB{8bfc4V?SeOo;j^n$C=uAx8?_U?9?lRenQC`5X+9&yZ$=>-lkEm z^nQ4724>brKt3>eRuxIWpYN#g^&c(3H~^nE0-gT>Ao123+bS!gfYFc+Oqm&}md!B6 zUSv0+fBt0Gw%M%%FuA(t<1WwbjQj@#3zEj;Qhn*5(r8Vjqf^FhCoBvJ@PuUKkh6ovP$d^+#v zv3=#;5Q91JWKwpO-^3T2stFy6zPH@K&akNDYfeZSzkI1*m*wJx5slc%0$mawf(;J^53#>@N6h(b^GIUbZ)>xTnq+HuX`{Y=G} z`pdYj=auxaVR(<(SB;(4#*pqK2t!)UZ^IzWFlYs_M?=J3UWhV$>fWquzDG>-Z`8p5 zrPaVU$+i0VGKk)Aja6H3}vj;bNg5blMkmuIPda9 zeU&n^HtuDWiOnt(XT6A6De{@^eAfbhm#^O&@oPavo^-kV?L_BKL78(rGwF%EzMqJ0 z>tK&)5!4de#Y3~n$&Wd@ih9<>kZHWCR3*;pkGvNGxsGD-{A(&L{%}UhxXq9HQBiE4TnZ8AwzwyfhI0$)ZEK`zudiFTI zysE0IULyyfcwk#@ecVe8hlBly7d{-1fh7BfW0f~3_`^pn32=ZOKL_CO{y+15eSJV3 zDiui00=~*PflduO*N;Kr;Rb_IL?1PQ53vC}0bg2fE-45wdv|?SQt!gGYMoRTZvyI8 zPj{!w9EtmVs1m{^fY;=QUHOOK@<%cRo6DXu)BOr2K*xI7q9Hmtd2!DZ7#SAod26~& z3<{p_*Zl82fRoWEN#xs&FJ1?kf4Whh=f3bAent;&hf6}KgWBRQ-)NRq{^h|0!A_To z-a&~{v09P^Rj#I%*xj4O$;Fa*Oa8Hzj&BB7y$4qB78c5qb#DHVuk$@d_4xy4YQPdY zw=cTrr+5f0G95!G_{yKJ17Nmm(w(k>;GtBosXfvs<2-ARIk6AivB%&y3*M9sm6Uz zIG5D=O}fjCb2QQJpU_nDa#c9o-|KJaeW%klp)d86Pq)gbxq8*;-@}*xq=Dev9(LQAQVP$ z@t{a@x}&K%q?EGU@po)-0*shpv?oM$os2q6RCg7N&t>*V@+>yrQL{SyN7@fOkh}#{Orj2m#7U(p8{SE(h9uS9 zdFCY0{GOfgTyNS}x;GvmMqg9tQ+yUhmR3W9mg~cPJMgj=wtIL8eRi;&qS%Ov95UH~ zjJTVgl%(NGQWyTO_~c6I!JQ)6Lv*i=-b4E>YFKUUem1`J|F_FNY zp3$oCd(*@xwIs-a`$V10t*+gbo{To>6$7ZGGLC$GhU@&_cJjmZv+Ld2*v2)NDxyX$ z!-7IR>we`o?6Hso_>sDH=yDfpi0TAxmwH`nQ(cYX?3?zXJI(tWt|0uBY?r>3T+NdR|Uu-|brWtVB-2_yiSjQ z0reOlg^&Su)qtwU2q;5^+xaLeD4<|q^aK8w!C~;PNEaP|tG}#*!t9CrX%mn%OlC2W ztJz2R@ap}rgg$FIZ%4e(WYmKK&Km$!2wwqDxmXku5o^F9@dAkM>s{UeKe!9PU;P3g z&@&cZCd&RIhq9X%G=G)uDKKk3_n>u zeQ}blTwmu{mo3>v$%fv>O8uP|2?QXZ(pHs=Z+uRmbQJ<=ypV3@6dk?($(8WeUuu8nb_0Mnt>A^M5_MQ@{*VL?ZmH1^6riG3?kfNiF z%PY>*Ytk^&Wc6t@g^Fi|I^)N#l+uJ&s=Wan4@>U%|6&czeSO2I3Kf*A=huV-j{DK` z;#pWX97FfDm*3v?p&}%^b&LCtmn7Goec0~rn_w^etvT7I8SuW3znL0^7%qhQzv$cI zcHk?z40SBnU!VRRp6c6XE2h&^@pm}Oh9+`>n+jK3NE{jpZOyW@q33Zi_5(Tk5oJg`eiXcT zU#*OgMQ0gbShasYa{iJVEpnvhrVqysD(k>zC9BOmT!!CAEa;jnu_mP zOEF+GXdW5qzjEEjBl!g?Ei|Y%B)00Z-cBwP*de7^QLDG~66d4a@1(1*Sji;xp0|G< zm-OYf;8c!jI$3oXhF-siLHGqO)Xe*#m>4ZY9Xt_M(Rmx$VbGi3JGEM6Mb&+UTlQFB z`lIvkvWTE#qh(L5B(PAmy8DO<$ytRfHn|x zDQ@oGH&lNbSUDl~sR)ee=T`47WNbDVvh(>;XlJc2)4=98@da+F9|p zGcfgS=q!D~Xb zV**Po;A)bzXRMJ|-EQf>!ZQq#E2GHdQL)$|D}~KI{e$$l(*pbgT=xUsL4MFE$0n&$ zDl`do@I#V39<@o5RCR)P1*)Ga*gz;%ggI~6>(qhI0+(9Vt9x=;m?P=W%C)hGwmkUI z^G25oW1?j~cwK3U0Ud=YH-8&bsh3{DPR`q1DhTser6nQb0<>qe6Px}Hd zWYTG(e#r6OBvc7OSa3t~=0sYV|FlV>ln$V1hZ0E#)v-2Vhc95G_IU6Ev9#5)X;{p6 z*2Cz-RkMo3i67GAw`VEJXhoLnUE3g}>cE;m!^dkb_5-OYhrgGmSmT1D+7?B|H2q@! zK((kPh>VtB3DS2?(8*CSI#cm{u<&c#`wKc7?~PG%RKK4%s;k7{j2>SjmfS1_(C;l3Q7j>&dV zYE%ucXCDJOKkjiti9Ki8vJuT|zIHsoLYi)4p>Eak>;kvAhp77{qpP&!F8bAy?~$x0 zDc+thlnMKx-spPEl;TYcx5$Z>&_aJj$%yaQe@8(y-3Kf*!69t$w^KiuJe%8>xdu|L z6;k@y+LE4VI76~~(S`fffSX~H;;0w`b&KTUoW2)7LC7{{cWay(8$-CYM|uU?t+1># z_vxB#X(l~;R9>5UQwfY^V!&J%z)!m|}>Zj;*R- zW;I$vlNp?r6zTpw;-Cn* zstESZB=C;G8sC$I=P>B!hLtfqDE=rwX3J^{t~J-8XVh97-|-lRH)zgK3){JKA-z)L zN9%TSyHJeA$ z@4_FlKneubWAh2JbL0Jj%l6F3hBjym@&HbAA%C3le`qv!@sAHz!Xr0Ym4&^mu&SV8 zva#I$|5W!@L2=Z00M;K$gbq2;9lZ zKMVVOmt%kK5mH?Xy?0e{SRz^=U5g}sn7$PZ`DH@_7;1n=J(VZ zvNO*8hAs$X(*~=iH1zCEjYp?v-9;WW`ot`Wk6hq7^O2`{Agp`)`m%iZV%z|7Mp{7m_3^#Pe^~-;G1T0Jn$R&^OP6-Y4)Fm6 zG~>)i6(6_gv_NHZa?g2SR?g6P8yI5_$w$L4CMjt{TvFJu z>f1JP;UP^)p4WewhwXaW8sz6M{&CogIRpY}k$i&0A-(hT`$BJ?L@Ha%=uG*L``7I6 z-`ebCy0g+gqY9N9SC$vvHUORFF&+N@4vK0TV03yR@K@1sU(EN%aph+DR?&<|FE%Yn z7P1Dp@sIpkpMb);U!TbT6Y9q~`e)Ib{Wk;6)YO=bk&+Kg{aG35ea07_>@RsI4WH^a ziQeWgDFL14u(h48E$g?<*I`;Mo)l7^oz7c`pi?{>ANulJ_Kh8lW_VSQ^Gm$?{e@OM zG0={_AXZ=TnjDlG$x1U^#agg)su=cvxfP{tWlLeT!Th6j$8+{_NSmHAZwy^wmlwP0 zZ=lC!!1VmZ_7MOO9$05+egB_r#11IGMyc&vQi{Q=M&3zjL7xl5Jdg6ri6O0)SlN=4 zmO)Zs80Zn}3qZlUGep+6B;DXkn6=Q%p74~VPK)?l5?MEo1kS!B(Roa!?uA1!=s$^MLHP-gW*yFs9K+rxV&_1OnOL+_;>MQCW-5 zI^IemEQ!fAgU2MuPz(vSpq*P)$N!QJA8=aV>kTGdL($@$d{wmrPIos$3))$NlO?(( z;{T<2Lec9am!v%wcmz=${Fjy%_ci4IJ6-X=OYr~o6w&|Q5fS&m_5J$#Zk|Q$1)_hh z&R%Dn=A@x7wf&gupDz}KaDD&orZArW$R}!3R;%UNYc8LA9 zz_1-IU9*-YZl)>(T2^GE#fcd6#--uA9~r3q@75FchrZ1 z&?sgrDS{wOA!oGIMr>|w!O`J?DT=;kTen^-^ZeV2bpB*3>ZDRR5IJ$)vnHN1VUWpL z=E{eC;`Qxh&Bs$|ckTXd0Ul;qEkT7?dNSwJ$;>Qw4mg~|rl25vKBXvChHvl6991+y zvv3^MQ6{@P(&eDF->`E6Cn&~VTJWe~G`T(SYaoFgGR#!dXGOC>!_+D(6*)~gxD9@x zN9Np02rd_10wjDS_wl6T%z}kO$#pl~g(p$OS%}6$9>yJR&wXyjSFdxT@aL4xRcJwE zsDJ+4eby)lJv~_Pn;8xIgwdSPU3t2|40`+*pKNr;R!1@#Z^XC`=>kOYF z>p9J^$J2@c;@4*}ttdF~VcV2S|Mjfy(g63|m=bFb^XTe}X3WzZ_v6y6KwEdXoOxfK z>lIrE5Jc48sg(rsO(81v2wfv{W4iK%?w-s7>z}rHK}-C@#AIE^MJ!L8HS1z#v)+oS z-KVhS%n80?h&;UMa1W;$q=jps?XNpc_~n4n^|BROlH--uzi(V1eVAUc?Sm=ZsP_n} zhfhxEGQV}+Uyr-L_aCp?Q4D2z?C&gT!N3VFr^GDINohZ_E28mdb4FP5u+ZSYXr^$4 zG&$y{U0vJPHLRJgs12mYuqkRRbyoc!hbaFB2p@pTYvW}j2=p7vbFaFL`Hcp1Z|iqu z@Y)%rj7=b@>UDoNO``sM(bgRR=ygMJUwdB8-a1e=z;Qv@E8i5)p%e%9;iMK@3w6XUOn zp}veM$u%kkJxYbIyP~X^SS`0U*AC#lRtLdFTCYl(IbgK96zAY(gFJmFkQ~poVr&e5 zCz4buZMD1W5U}=)tsS@JFJ~eBkZ~=!ivF!z#s5zT0&jAefx}n%x^9W{{+^E?l#H+` zIH{y?-Rqs|_`7^EUwWKvW$^nOd;IkQ4UdBh{AIR{ey|Z@Rnyu*otmnfpVL@lB0c_C zR{4m;!!Y9Q+ul}sRwIn7{xgKm3uTzqS4*Sqc({YM`d9MF#V?#Q)<;<)Bd0F5(o#6X zcm&G2!vooJe?r;tO$%;YA_n??oUVjIi2f@SVO2AMSX5PUAnE#cr*yq1hBHHJshuk3 zX+W0o)5D+D!->|1tM%A7zZ-3Zp!a5+W`V!=?w45_I89i z8#0P+jmum3`D3hI)Y9Dy{Axip-u@0@8YkNJ*Qw2L!%0P49se@rUy3-W$pihdnEjfL zJ|uw-H@0KvbCb0IGvz(t#ORDt4ruIkqGo5=QL{@ZN>@s=QesnkGB*Ym(%G7vE|yai4@p+2+%$aR=(R3j z;PQm>zmu3CavQ0CR!>rXXTWaNN7|U@FO<7smr6?^ztd-;PFZ!W#m{RGy?u34j{|9fKKQA*32ztO}+S=5j7>YC!j-Y62!=&e3EOAQx@sI%8!-m~er zRVLoIv&UOVZp{rv-=hJX-=sVWaRbj|u%z0pn9Jb;HCh>lC~`7J)FI0P%X9D%9H&(e z@sWOnG19u9DdiyM=+YWFsJ1!CD8rtcx2w*OL^SHNtMO%L0co=H zBEh%1R}5MuYog-;G(|o=mvd z+;uJEyM3Fm|G9)$HK5bl9xoOQ9(AupPo1GO)M@fTeS1a3F~h>llikl1xLg0aeJz&% zIlZl^je5k5Ag9WVw$uT>{NC*ueVJn8vb*5zFOnA%qg`Bn$k zrWnHADNe&RtZ&=9u+D7kTg!i`PgUSMv2iRwl*@`zr+>8Sg0+U@&;K>_U?D>FiZ5iR zzR=IlSkH|Xu+*J(MGjtIULHwjTu{jVNh0`0SjvuBFNb@XgqLILVl>(#9Ch}s#HwKC zl@#R3mTasra|sSNOnTxK^w7Q;jk(7lvr;OKb1dpd{I8iW5F`oA@zt{VD+8aurc9?6 zP^{~Un1Rtg?wqM|o#&3M;VvCT??+=MCwO?QDquji^3x-@($DZZj5V?A7$yPnCCRI!D%VIpnHH{HQ08U4{9$)XU_p5>eHc%@qnyn-Y~O!3EtYv^%HlZ<Y2Qz{M=MnLpm%gv58%Jl3gFV)Eqwst zc0={OWKHW_)oLAHzMpxNiAhOzW2_4dfG5CLhmn*dSZ9c&H|C@tNSIFPRBI}pd*;y{ z?Aa_RsgHkAl~ONeOeQVpNNg{!MN8e`*e{m_m+0nvz+_=v#9DVTc&^J{HO`K zO&CdNxw{jm*Y(TW)3|*DfMm~9J6-jW9yT2%vCd|LnohdeEn^F!pNS zr{p1XXbAnDC)Q#O7*enEl@U&C&JaYpb!o2I&I!?@?2t67iRJWSOEY$QVlRdi@P3X9>z9#4(^=7cxBc0v)fQfNXcWRp{$sUXY^sL4=!uxD=Y##n}&dQ zZn>bsY%aZ48!$>>?)Se%CGZZY;!&w^A}rpmxu23jI9;ZOx~!UfxdV zA8uJj;&o%e4!a^@C|r2M?apB~-^>}LrOJ-)NJm{t$85qz3i zk?A~5Zr~3hWFy74GH`s4k3ofUJakIy%2>fBOkQH-d)~bF-1zZSK6k7-UvKp@+Y*Cj zBDIb@`1{X)-J0W9>0sDn?(3g!Y^vShe>q22bgan~P}-TjP8~Fib4Wpd>FxSOJke{l2xw^Fm^ZfVW-h;*F}+iuw`2GR_OEhh4)wb*@MA*WS<=P)FJWUD>{o!x z6N9m)FTiQ}fP_B^K(zTCt$USFQB2mmTKt#(_3rr?A;W#32=je4s++1%VF?itYEeCH&&0&u8@ahI+cBLY=_ zV5|%$bMjnP1O@L!d*EEYx#2d&OT|07$M=`dHHv%$A+pravMuV*{uc|d+yyK-_}XGB zJNvljMox#F6=!nYlKF0)UQ`b8)63O2*imc|yFSnS=8e5v4N>AQ8HL(PYxgFLDs8?6 zEH7m3k?a!#Doo(k4UTw`r&)c0Dd8M{13oBPe><#sdAsPC{SM=QaawS%jb=rcb6`r? zK{O8|OXllP9`zgmt_(S<24j|c26dKC55J)(&&wAR2_DB@2jAsoW3X~2rh>~5gA=&d zK`=38|0TnUGK`;?qNt`-L-d=P$z0&%ne~1Mp0=4A{*l?v>cCEu}o6Qws z*9Kqq`J_Kq5RFqk$XOzdgR&2HFb_5Xd$p}<8u5qjOmgTD1lGTs<|G-WBpw?WQ{#=~ zzBC!|YlX@xTC(+T!h4P`G;t!@W_W6O-SIOcpN^R+NRsmg)B@>){k`oJYDxO%=D z*E|?vSQF3c4Eq=xLvo9%)$cZRsevK)i9lG2jnTNwzm#g^$I*!0$sHkuxkALsw2>2F zT}&>U-Zx0#GiuemHUN;Z->vi>y5FAo0d_ONA4e`gdfHX`>D>z%fJNEgo~Yxy{sFH7 z9K``ID7TZ^!GpPSOrSeyYSwNtfe#Sg#2By)fkq$#Hd7z~#ZpjI>;d#HH~|y%_Q=6^ z)cJormjS4{^S)cD%naZ;Bd?(F`+Zi;MGv01{;PPQVyWZ`}_}x|=a7BMtMJR6C zcsX-9thh1vTm=YH49bw`KN1sQLhJyy4pe_~qE$F>q2#bP{~KIs%I+hXrd z5oiB|*Eim2Nx~P_pzyhEp_(m#^*4JYnxbVMJgP4r5!uPetb<>MP{$FmGmz<2?o-2}sM@YJ^NwI+igK?Fh3(mj8ME^RLTmM( zaC{N48sD6PVP@3n;7goAobO5x@i^yO&5u;$GO8MKJqcR;7Vtx#4Vzuutscibr>Y3? z5eJKtvmI0+Q^+Tv{n<1m`o|G(7T{}IE%e)^1*=_LklPv6ATPR-u*zlqSJO&_Kh6v+ zvq?zre^bM*0Ymx48%{^` z+d%KfA?&J8F=MeZ5AzGIUMZiSHZM~tuzOZKnB+2^IquqEr9J>tqr~l>3&O@|hWxiR z&`$f^=Cy`qldfU{(qJ?ykF3V@`N&Bo`kG)(1*zxxm&~#2xU6dS-BA>0p94tk4_6qC z#_F0Oh<&$58ksP6o7ffzHmbOi+4_<8XwWvbFx+|_URVKr9wQ)gzc(WC*6;yTrs<#Qvzc5SJ_Oetls({>XAFo`ql3q^T9KG^+> zH{e?yHf-2q-Taw=;5C?*jxGv7{wKdP)tzxMFk`)2E8!&oc@V)OhQ}^Y8PXJyajm-)tgU!m^?cJLlAltsHfY55S)Q|}~0{4ig z@dQ(wuNfrRQP>eU@Yl+cx-@9z+-Mw6zp9oHwkHK00V;NyDXsmQEO`uMEKv~zE6jwS z@{s^Fph89Cvx;)+G{p}E;hYm#ilZJ)$ZY%`?N3Jyj=^h_j~J?1$%10(P&({qN-61~ zl(4GfNjP_vF(0?+>HE}4ex&o@l~-{elEaf8VJ9Y!U#DTy`N@GDx|0Yapw239mLF!a z7H2xjx2yzA^JkTQmYqro4K>WYnah(s8CbS{DrLD%iyl_dVSyJ#Q3B>9U95BSspvNZaKiS%U4IY zJfEYN8SZBmFqvoDHX`2JZ@GsbHRDe3-2|;=UF67@Z6T3TV^38!~fit zsNWj<=7P2^*qM>JxGQt(?(}{O!B7rFs1aXpUldM_-5~ZEaz!U*_=61Oclqsn>XDZ! zyd@Bk31|37pG1UU#f@89VNub$ttDUt-s(J)1(;dN0D%{RL}0x{4q|$EcsRIuOctOB zEUtaKF6J-n>gpn6Se~8j1Ka~#{|-~E0G8x+bv+KUF<&^Rtv$YfM+XCxF&vsFy1Tl5 zR%$le1Kj>bP%4k}2^O%t>;MRG%f%YMbAUX}_yp(&Xt}I8SpoG#2|10Ep23H!%EDW8Wo1?UbGtOK!bF5R!TS?> zN}>i2N4RyJ_cw<~5jrasrc>L1k&^8FUcR-VoD&OPt>|%@mHyrsNrnxP{Sc?{gAI4Z zha?hpOR#aW$&Va?HW2 zdb#P^9cr&0fEyTd=S>^058XOp1!0x2C9*qXuT7u0qRR23XS&Q;!UTlz zKFLmUoofN=Rfy2&8nW3{Zv(3BRPGz1@7wgBA7o2=KQtgNis zeMtiArC4!fn;&a5D!*{r7wmSl(P)SP=9V7dT+pshx2GV$dfE+O)p6QwK?T4d?Y%o+ zfU&R5x={=7uyS~DE$1Co4#W@;IAtjwJadVA`>00TBl2Rrd0T2~<$&~W*Qiu0OrwFx zsT-oyhJE7;+Z;*>Bx;^vkRaZqX)9w=jF`|HAW>6m0j;_0E> zPSuo>PpX-g$OxN++UL^oC5C;G>T`wWg0hwkOBm{;Ku%GW5`mPLuiod9s?#5Ac9Xqb zy*lz4wTN@hh^v%*ne8X8!tT=)QGJuDL(|8_numHxY>1f0D01r~S%0=Ci7ydq=I?z=I$%h?wa zu}IS;47d)_kK@=;n`E$jfPN1BQ#2HmO?T&uM)120Z=vG*-Vb0KrSW>Ov1LR#?2q@K zSO8Aq2v`g`WST5_?S68;fc<6Dpx?L}HelD6MNCg04M-B@&KMA5m&eI~Bx>^{Vw)?0 zWDxB75zy!jE3B=JduQ+i1ULX*sp>Zak~MT5kXP3?7_^gcvqe<0!N`>pL=G$$ii^tt zOM)*|D)JK{_--SRo2w(piBwl1paoGziEY^VS@{}#d=4~ZdkB37y?H#FV230@>3WB* zXkTks;o_qJY73eMJo;Yl)NAXv_2GsSS0d~_{AL)IGG76NE9t|XIeNn!=d7ig+P|L3 z8~jWLK`P16rSZqnvh*tnbx}88VHE&2aYMZq8BqZS)A z9}`jWY?10cK1iSbFrDzI>i6x|?gYV9Jmy5%^&_F0aq&(`S>G$&{(TB~uUq*t%2w@T+p zVB>c#f2v?EHdoapNULS|^)VdIE$T(2b$PLoc>JHAHWXXfvls4~bVSXi5nE@zcB(Y| z4Z;oKh7Fdf(1x}Q0+FM^U!K%YKwL};BQ;vaK!KL!TB9dTSpb8Zc97maps(W!lprMK zsBRRFA2 z#e3TISv-#+4w#kZ@-kI>)FJFpJM%jpB0J(ZFwKMITX;K33v?9L^(0VH z32BIhW$ulSr&?f2T2tX*N??ka*Ug1Cnxj7xaMrYcSzM1*z}SsnKlLs4pNNa8T#Rs3 zBIit4sTwT(!>bl~g)pfaV`RxC9Eo$0Qa^F_>t)@1-Bjz>mPb!am|<>ieJm?kz_sQF z-*Z^ej4Qh@hlSBy9L&{!TJ=-pu&sPk= zGK_;d4k~bemW>8+As#b^u&j(?F^=TE{O%f*9bTRrLi(zAc5yL{TsJ5;yh)hzUWs7b z^DDtA<8b~Jebwu!GDkz}&2hDbX#vmI3~3UCK+U_CE5@>4e))xN@NSq0@w*xWLd;EJ zdgR^26QxZ&10n$x{AX<-1*HRa<5vLQ_ML{*WXbPwODoXcNJ+A_wbinhZb8o}`v~L# zfMxGRKc;St+tr>k`g}%4#sd(p?-?s#Cr$>`%LuU$>O3Fp*{oNGfcyXmL8}Gb2B`y8 z>4wN@Ak7%h;tr8Q=j|sUA_9js34(af)(fRm01_}fxlHQ290ByN5CZQZK-N-RQBlzq zm^M8;tjto<#qv*O;IAGhQsW~RgWa78(MfRu_Gv4L+OwG&tCRR|IG7#8tM33KX=LD10x zJSG20j=n#Y)}i)v0-p!?G8f(6>`;g-YjV4N+>vUHc}#{EJ^wsa9${azBBg|=8QhsW z`raXSfzG(Ab|%i%wdk`vHpe~^suRsVWQNfUw;1y zUsVnsZ_;H7bqfAMd&nIfTyG$ZnyE5Ys<9;;Mya?_0xo>bmKA1Iz~B&0I(^XsJTHj{ zmD`F~oVXxAeah0I6?NL3Z6>wJmz_>^-}2uAdp)8%B4?j(@TKM}^YlWZnybCy-L|<^ zrD5c|Y=Z*ghE?ndG9p+Nl`TcMk~6F-TtQ(5DFKXTOIX_hu6ziX*UWO-Lp^2#5zH>M zYT}53?MJNPBjHr9i$Z6xbiebVE4)`E1ybbBp4HW!LY`FQxNAD&LSE=_(|3XlMcb{9 z%Q+kuKZY5LX*h22IVL?X_`G0R%vb!lzu4>m5ap-0qD2;f5`nWj5On*1xMf7^y0L=; zGyn4@5~u40A^*$aH*O;*z#cebYfHbX{fsLZ5f2`qk3$ZDK=~aVnE=WB3h*0lH9GS6 zPqbDkS_4Te>+)K;QW5w+Joi`3xkUt-(c5eZ8L z2?|jxE*c?t<0DRi&O(kN1(?)+(=Sf-BJ38i{OEK#3fK)SUhF#}>^Bn~LuD)*=6oye zeg?UyDSa^{D7mK+?v0z=-Ed+{H(;YjH=o8&yj^rlaG$tL&12#VZcNMG93v*VkWq{C zoe-wvv3>flR;2I>6c>q=6YxJt`P1FLw0WmFWv2&p{8g3o+o{HP^Tq_GHr!pUc^9UQX3m<%Wzc=^A+pZ=AEfs2G?NL z%iyH;Rl@^RbCI5I;t7T0I{-#e20@68Wz5*1n<$z%--R9tyslB)+}ui( zsjb^joA6DX+O+7;TKDnCv-u|g+tvr4xAV8g6@rHsZ93xmdJ-S1CKr%^%U6e@g0LPD zpew1Wq)X>s3IFU}2EjD%ZM8l?J$LWvM2J0U@tF|u+MVuC;iDsKc&k;vQ`)amTQY@w zxrzr0&)yru)fwMuN5q3Dskzp+O0?=cmoJpzsASR&y#0XpZ>11V9S-QkZ4?i^@*sS) z6&bmDD?OUq<3>>G;8UisLi$&)8XeTKdwi|1`0zlEJZ~`cF1(H^Vb^ru1ks9nb%p&8 z=+HdfIbaCtwAz_tg{b6TspNvV;L2W4`ic*RQ`t&EH;^l(<-acDkY+K&(z#)bb|>D+ z^1EoXx&ng5h+2!@udq#BC`e5p*Jh*!ymoT?8SGPD%x6B)Iio(CSE1C-$A!{D&`=2dGSIMNw2hibATdu5> ztGU;KR!eI^c-PMoYR{^%5zfRblhSfCUk*HX?LmD;RGa|tPrI3waKYy(Rxy?L9QKp4 z$aL8GeQhnOy)Xoxzj~4QBUQ=^3knJf|Hx~`=`8`ZfmO4OseEzp`!kpB6|)VXtMsnC z`A+%rdEO@Q=;$&oSe$&gss#LkfTBZ7OG^p(APrT#LV9}o$WR;!_i0Rr*kCl?HqZ*S zYCd1AH4wP=wXN3a!~)1-KvhmW*=W8(eI|baP?-y_sGvPMK7N&vrW^_$S01;>lrW|&tiz7-nmmI%_F(* z3V0AsJ}+bu)s!Y$fI8RU_xu4K2NZ`vu?fe7Sq&CAfgE^UuMniNC~(hLLJ6_!1u`w3 z4f{+v{>NJEQ+@#uc-bp?09RwpNvAukBTBLKD7ku53%_5v7vthDL5mW!f78 zu>s{z_JdzB#}_s^4{~@)`gztz1fqbx%=egcO~IH1FN*Jx*fZW^!i7cfx}PKl{Icq0 zp^4ICBljV3o3wS%gg;#24@cK#2F&K$rRPi$Zsya=hX_4=g-Uz``%O-${(1W+X|B|K z9t1@3^DI@7v(Hsh4yMni*owkzDgQ2<1{^rKr&6vN0r!X+fv+MXR+2FaqtYeBkSzSF$PpQ zpyHF0F<4^}^y-&vIljCXiX-uP+gA3KfhOr)_nQ~si2JTktIR@0{P(F{qt6A+YA&Ug zt~zmf9m9A6pAOhfS`(%Hl0^DB^BZ(W*6;;EwdSSh7H_+GXK^jG0A?`sjv;6_|J)|J^c8IA?0Rs=Wkk$w+x z*!1V3x^%C3b%g%B|gG*?<;Vrf`QdU-OKpR+Ub)Wzgd}5YV3KcH_ zIs122;~4L2lB1)e_gX40o2A}ZIxA2h@df6!>OwC;L2#heN65;W1n5Ko+qbq3CARlS z7vZRmT1H~8v(qKH+}>x{W>ZK#s7vULQ09J`&|m6^tnP?y1MF2nHi?rSTN7$Lz1O?^ z2vkgbF(F*C)J3M~x)zxg+Sa`m>HrR7YF=OQ!qO`a-GgQH@d1Jrr@?g5RIXmvd5wGVIc z6XDI&46DpBtCx9{fwni2Tr`|B(F%>jqQ9$C6jDq4-LawN(;SBIXNKoie$xL~#B2c)E= zB%Otu4&B-?r7SRT-gnJiPiNcjCG_T1`;UOI166pQEuh;qfFQ>Aay{b-48C{5J?}`D zR*fDwAgT_b_xk)Wn$6#>lmQ@L>H(@e;qy~@)q{BR(;Kx&O}!{d(r)7cfZVw`=u{K= zk!Qm*_laDwk!*2A#rf1<@$|RXHP-HTK8QK+vWff4xxj7|7CDb#K_$mouRYNz>?t43 zuKoxmMm^u;;JXvF4vU@(U-Zg?0n^5u72UquyUYSU*V^#nt?eN`Oz;o!KhW4+HsKHF zV8*~m$Qu~z6iBbP?bR#jb4_Cs{%(V(JBiT>4>FsDH_ou1l>Juc`J0-Ym(^~k>I!>3 zXVBI!HIOJnJH@vM-?Ol3AN|CEUSuZicBn$pE{)f+y6gI4j!>F5n$8h+t?YEXVUrXs zSo)#tNC~?n+4n<1pC49KO1jx3jFr!K3qLe%uGWcy_^$52cu4UQj>AHAlO8TBHuY_~ z)AJN+5_P1nV%;9>xBLY8tJ}h?zeFA*VTN!Pa&u5ompALxVmXkX&b(UqF8>HdOl6AR z_Fby=&;It2jsNhX_xXCkUbb4gAz~(nJ)NPnfwa%GIFqWlLoCY7k(@K%z|)VWbh5Ze zza#-`w)<31rQ{Z5`XPfcrrz;QdN27(TEK$gN1IAaTkpfoP8ZoH2@03E4<3ishou}>nx?KSG%2-$SmwX;nwooS7u$#7&Da35Gk5l^v*0tFdTm)+w10_ zT-yG#Xh+~J`t7LY%fPqMU#rgM?FIdhznCIbU~YMzY`wO~d@;6dX$4!(16yxF#$;GLt@aX{Q~yen#-(g zY)saxE%Y%P#ow}%-`;?RG$9#T&?j8BcglL+%pqV=&L^ErYo`9O0q7_KKr8?}BBZZ> z7a;a_Yx5@N<>l4B-ER-bjx*V9^~ueF-hlQz1JDMk+K#IX07an`OJ>)LKgn+t*|Nbk${a=6vRG3FSPAF+lI>(mfYjLShd7hLd^ z_ce&yeXTNhWzqPZDGdl6>Jd_xiuXtM;c`t&#ccYGh(9LB#1v;fCgJFfu5qOhkh+_}v6lcp7@XAThcD%|`{7mQW!RSE3gIiKxQLh-?0{n{!M*<&~^C*4S_!LN_tld`SV}k=aa{*XfsjpBBI} zp9`6|5{VL83-da0J%=i5!j80~2G(Bn3T8VINwV+TKXe;jw>}lT9hel1jNRis?D!-i zBmP>?G02gNFc-*}JxM3L?LwYJFqAK&a>p>C3@u(9DC_i-=8wq%i~ORDH=U%y20WqN2q zp~@D-Fy)1{A@~Lkltlb&r2cov(7m4#47!qbvW21CMM!m!aD z>1NP~IiVoL%@S~uJN2a8L6&&#fO1f1JQR3ht(=`QiB)MRtHeOgzP;WM`Q4*!2&Yqm z+nPwaI2DT-n!&se&G}zZ^hyQb=WUIG(2mA4e{~Lq<*_r^bl^+E{qd{@dnz`sD=u-z zkWd;|q<|cP>cFSRZo^&v;^mv5w`ndk(%Z5&h z3mO4gPH`SV+TM10oC2nz1S13jX1#Gz&F`3A8jo&i0zsemC9!A8D~$2rhBc;(vD)37 z|4jwZ<$tGbWX8(;aYV-&)Weha%`ZUqn8spRU)v!-$ia*A(<$w7a0d&<>UZW+JSvBS z`l|g~%X$BDi<-i#z_eHMw~HgeSq?lpQ{7hoYtyHsBZaz|SqaFBQSAzoNeo=Kr;kf< z=~8EOjN@bn{kw~UarxG(KW!K1G*zM<6XN)|J}&W>^tLl)m+3%CW~Z;G&n@p18b;y8 z(;+v2SB$Y)F;zjSo?o{iIy?`NNnvzNV&8HO zgPq!CDOLXOv9TZl6oF{OZ@1On+%wE~>EY+wG+cbtek~&j8vO(%Ft=lBS#UT&bR#U| zUY}B@H_xr+*dm5OCCzeL)V)HkLQiPpks-Ao?>FjpHN>WynT*dLIf_D7vD}W^J~*vNNU2tfqd|$ugDmvzE@SxZJ=~Gn3S}PujZWW2s%B{Wy3r47yWi zx+kcZ#;(Pr4rYGD_}~4JJ9+gsl);u*qY{{Aw%p@s2!?k@>cHYapc=Y?$FZ_`7RPIm z!|T;V0^eV$uw_@mG%=hvxT<;8Bptv=-S6=SZGw=V3LzP>a(O3w_0>r z#5No1nLH`F=jHYjVRjfqJkrrqxYggV9;8VdYNf%8&;T*+Yn*7hFzf`Y*FqNM67EKO zDJu__6;~nO_lbhR2`>{CD9K65Ia!n4;P#|C8eGHw{pXs2Hyeb+i&R>SV>f&}CPy5Y zON1V9ER-xTUWuxS%^xP*wF6Yp^Ap$-e=**+Lf8tE_afBd=S<2|m1` zX-wW$9$frdFJW;kjpComdYDp<;=4~ z0*ic)$s#iMxJLviNZ(m5-ri3}r-D|)#CpDI3UTvI9e#Cd%( zGuDOo z>s_H*#03*3R}wogfPu$_RL+7I(5HUvYpNs##_y1<^?+$?j=`;QCj53m>uAKhNz}gL zx9j*sbENTUp(Y}uK_$(S!`qgh!|h1^Sd|D4-<6e$9C+mksE4J{O>FOfz{t$SFIQ108_wPY%)fm0qb zkYpdA`ydDf{d;1PFPc zmncg}`5KIaU9;vWghk5#>!MyC-buwb*K_(KLqZ9rghYr1Q0TL+9DDG{mJmQ=i?R7= z8R`ygJWExcly0xxemKiCNBdchHTFk-1tmU#-{XgLR&u5a@==VBK_BFsv# z>n(qbUenUPv}5~50^3sFh(cj;{f(}G&+O`~fuhcUk&L0BF4F69b9Qk&@#_~s+O^k) z&tJQZhFWda)~jC$vxSuQHu&pp9IS&J6DkT{VZx_oMOVIT)h!})dufI{(&gD6egOjm z!>YA$3S;Aba9fHP4-Gx!8L`6KwE)j6b9rFq38vu!2fo5&mkb5GfMG!!QM|u0z`Mk; Wa!uZ=bQ8RPloXW{sS?uj|9=2f-XKf> literal 0 HcmV?d00001 diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\234\200\347\273\210\346\210\220\347\273\251.png" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/assets/\346\234\200\347\273\210\346\210\220\347\273\251.png" new file mode 100644 index 0000000000000000000000000000000000000000..9e5b9d8e09c07abae7e363df705b15bfde44c9da GIT binary patch literal 420137 zcmaHS1y~%-(k{V*L$JkNLXZW5YjC#!!JQ4dXmCq#f(2V3BoKnTyF0<%1HlOn!SxQ` zIsbpoz4r<8%5W_KHYrU_-*uUPQp#Njorv~_}6_*`3kwNky=U3V>ENg)3B;B||xkbI{gVcf&( z+mJG2VH#9M6IB^HN=go*a8-Pw%#Y-Z;dH2uFDRztHlB9ocdxjPEo?pU5_%O+`h9JX zDk>4wi5bP(8xpRb6%A7Old-DlyFxH~Q)-zYdSQt3q?QV&^gaC4J9Z><1V=qL3EoEa zQ$z%jPahPi;IR+}euZYqF~cK$y6240bXuWSc5#6tM(~kO zAUh;l5yHW8T5kGoB*(cXM&pi1d?YHg((&Rw$69;-?)~`<)pg|ZD+bf|Y8XSk-?4RZ z142J^s+zKo67shCINA%r8{V3@w9VFT8j~&UqDG_c1wIWd2*he1{INna%MR}OfR#%r z^eOG+;7@>_m=doL-1&!XRbq=3!NH+jaVTUmW%^(A7vNNV{WMByra5kx2m)uXBJ~cR!imUSU0wNJcL$L_ z`31thkmtLItT5?N2`7QGz(T_>hbIVO1|wcBEmk5sd~{Tp#u z4BH)Dgc?O$+!6(0R7Md;p_9TEe7Y~urhGmf?-x$wSzB>av(3bEwZ4ceF9 z88zGTIC(MmK}5!j<9(iq;sCyL>kJVny!~}4hGOtwn|vu}AzDGl=U>jv5iZ6*XP$lu zed$2B52xkuJoP>LVcrFqHD)EiD;%7eBRTuVd=f8n<9R@w+gbKLA_A5kHJ zsm_xlF`O2u@5}TU)=iPPOK+Hm&loUbps|9LoKv zb^Wcl$S!9{Q$ePu%u`Bv&rPS(h=Rlg;7}N9oo;>p>&n+pt1g~>))WQpP%uSn3akdjGUw1W^ITK?tE_2#bC84dR`<{ zl%?BHIbe8^=>x|Av3~ITviD_Q9b-rHBqqwHmVcY?#Bbje zj%SZ&6cp{`r{oG6)Vx0bWcL)$OqauBK!3w&Lu7+{qvx44Eg!8UEi59J)X}O_uqnp# z&XcAIw@KJj>r`aB;w0e2`9y1*cWa5e6>}HE6eFA1mXptFMEJEu=DBQJZ*0u9^dhGu z<{pL_rUi-JOIz+;ZcZLFj#!c+UIDua=m6Il?+}lnnfFX>f1t^V4GoWunN#N)u|i^( z$$EL84x_ekc5rrjHrjYmHZ(hJpl}d(;N!qq(ogQ3&lQ6&2d0y1lFqf0v?D5#AZj*_ z{jsJ$st3yTEj;Fqza~^^#!k`{AQ$9Jc}@xKi0|m`2=Z`|il*f8X!2;;hE?0nlg&p~ zgZ5Mfc^g<7a+?~OW_&YzuAjF*A6AQ1)66Iq%oTJXLnE^$-xuU0rzHy~BPBD6 zs3!W#>AAi&p3x#u$GFXMj`t|y^69faQc$idYM$*&xuF8#| z&d#Z6x!Zwsv%0saYcISZq+y_GsHwlBop#VHpvR>eOYn13yia%BFXCVP(-pIS0umdv znx{odS{lh4og3XeM9FQ0-*CBJ&xDA)74c3h5+QG9S>jq^Z6^24_VsRg(em_e<4*jZ z{7!|h`it$5-B37O4xCQwhxcy!;Cz_2aX($B&1J!NDDnF37*E=c*Y~RD_QI5uJ0~&5rEQwq9ZxZ_fCJNc5-!JQde4@ zhV=U@e6Y34#81net-R5eE*!J(W-rVrdIWlgJ_?ifgr<yE%%k$4ehh?3 zdYW3BDy=kRwq|O|k*~@`rwQMAmy4D2)o}#H$0n82X1R%v1Xke*sQ-K&u3n;4nabz0 zGAoB1=fp_YcYSujG@;v z)FW~f)5h4U5(yOqvptmz1gm5S+2*Nb*6`LNF9i-|S~N{EF2o#LT5GS8l(l+u1L(dYe?vW1E!&+-VICcAJf#JSP9- z{c+y7H(b-G%-tiM6P-Z&j;-V8hohEr3&E@=%f}WycqYhp5 z@{Pf{A2!3VIvCVY#vn&~T8%<&v3Ak)_Ty<9bNP$nH^>_Mv&OY1KZ4yzgGj;9<0Xxn zSxe{N8_{~5dUm#$v${qQP0i}a+9Kom)xEHJmHW=Sytg53QDfvruJW#~8a;2)a*uNJ zW+AmU9;~_rxqEY8>#tXaKZx>9@lQa1b}XpZgLXA{ea;l-z|&_z2R5_Ai$>=K>m{UE z>{#&=w5Kn(y-3PQ1W4jY=dpvyZLN;&1P)VH?FCbmQkb!SQ|S68dActzH`4~mct*=d zyHfTGos~kI`uA2>qmB5VIe*_XzxeSo$6tk9bdcX?En+x%qr%VvJJl@BhVQ_)xp}ha zwDrKQh0@a5zM$ooWBSqe&93aTvb_-SrGbDe|KV{Sb0hO+iL@cK1?I5VeLAs2&Kt;^ zx;W5udHklWNTp`;C-nkTU4cicU#Vz z?LCz}5EaGL2_YAcxdnm+m!sK0i?733!=f&r`@GZ5U=R=3_qPAOWGeMY_$wF!zBXj_ zDZ2OEB%S7EZ5nMlU)*TAJtfO5jQSK-0wa>8M_+hr=Ne#O2<|1L0oMthi6x{WbwwD2T1;QoZ;=i$MI z0GwA1f-%zR?>sokak$WP#H}O_3712bX}^69LK-1b+pxRX)I_X1CAF_|Eq}b+(c|-r zjoq)oPt*cPqXQ!Jf$-uZL|g8yk`mktAdLct_=E%w2}nHwJ|a&@|2HlD5y9kNY?H|H+N4`wiiL((q@%GdMAI2{}37 zs&3{4f!H})+P_1&*--*+pgPFvIK#ma(m#Bj$i1dJ1l~VwrJ?;!TS-yC%pS^SVs38= zVRMH%JiG@^$Xx(PLLu)=sNA8pcFqFs!ZiQn5CGB-x4|@2|73Y*BTS>Mq)H`W?*yUZ zVPj`wrx8J;qM{OVGJh-ZT2lJo#erYKG?wq)IS7EkZfm{l7&uogq#V_E4bCcOw7M*S`z@_rre|6aqhV{l7%_QXy_D~0o?f%1z4bvgP3GO3T5TTHRvw@`6|sg}=glot;;9Z}zdTIrt< z-pMf>FQypzZk#6%2;B=Ay=l6kWW%nZycQH}+4QTKt=7=gl!^%n4mO=188IFl9IQ5O zgSWAnF(EWCntx!Sy2dHey8Z;F$sjJh~7>tdhW0ay6$S9$oZOGM`lzjEq z6Akg$R3QQV2`y_v1rDAma=ZL7P22dBCx~V&@CsgD4W}n3_)W^OeDL|^pCBnBo6p#c znuR)Hbd^-a@7&yEzM$}t1@OYp)13``6`bA}PP2EfFy`ydK#mDIEIiLS7f;yg&JjB?OhX0# zx3eO?wb5m%vP2UJyjsiH1?2IL>Xq*yt{bzQ`KNLS+6-3P*h3{2jy(*Br+}dtAKHJ-3wR z8xElf(owJ(iUNtEj^rGQa46@Gf?)4ogZ|V>hhz$5xVpH@VfM>1P5l{=mE9N4h-l&9 z#d}j?3YNxz2!2OZW*6bOF+Mp{^xxF^RNpcaxKR2Jb~j(5t1P}m!abs{J|vHM z)jxqEYne>a!bC|9P0BlSpZ-`P5%5OTt**-MHg?eG7k^3%H1WJatikgFVdr8^Yk%|@ zkF;^rWQv^;&%%328{H?QS;o)&`mh5#WaQY?WSjj8hX(!odb&i&Zjb0)89AQ=)fMPg z?ZA>aywLX%?GS~x782;jmXVtT{cgh1#gRIECu=H=E7a}Zb3o**lZ+POR!0?DdaTDe zcbt52WY<gt8@mT zp~Lqd{(2wIQYbN1>QFT3cQVi8Hjh44Z4!cca;kmZc5YyK31O&Q*e&1guEx;~marr? za2P;mH9r}Zp{!y|b*IV3ns4^;I-$M7_4(*l;eCI-R1@$Nn^0{r<9@fqtXr9Xwfi}D3F zI%_3nc{Kv(mR35~NH)6Fvm_hf>O~uiy}l1ddYj8jGp8B$bE;`xxeIL>^ryVXO7~*2 zUfX#&ras3%7@@1(k;{Yro3qI;Eo_HVh0iL|ZU>Z8g>=fkzo2=fZ?iJSZ_rtPu%{qm zH|H!dTuBNpk87A^c%)|59wq%Kv3gO)rAdrzI>l8)qz8*HnVptoHZnzJO|THKi_jDy zNENKWz&?>G6#!|I5JF3R@(c&AaVF;Lo-ycHz>f3%xjiPDN!rl2086fvJEL0efn#wq0R^1G#W9zQk;7pyuo zUEXM}ip%(9g#4~*j`}mTx+34RY&1KFE@S)cI<}=7(wd)dBT;%3<*4`O^}XzE@b63a z^Ra$#3z;}N4f^=`GP&#a1k-gF*w>`N?Vx%*N2LefCy#QGu*~;w4)5Xq2SJ>qQ*m_f z*l#f+j2V06GX!po)Pbw}?`-9ngB0|9RetA$_J@IkCab;SOJO6gMM{`L6vY)f2(rDV?x>r6mHH9GgAIIXgD2sG1 z_M`e>c!J7`q$;=dJZ?U>t&e*uyHXlG?JNsX8a?jJN{}C(9X7(r?;0o2ht#XRz#NoS z4RLIY%d+~-P++gOGVXtRJ-4%iNqUmFh^^mZd&^c@x_<5in@~w|jn2DE znhC~3)Rr10e#fbZltqhN^)9-vVhTF0NUE7D^(4l@+8@D2p|4VbxlIVuCzSw^u5w}3 znr(}PC*I?`#^e;KjnXYy8fW2D@b$H6Uuk+XMH-~{|dT`{RhSwvArH_^> zJxfR#?G(r)-p{3#twR5%OV=O{m1*CEONeXB`2w>(E+3YeH00JJeC8lAoA>s z{bK-(cX4r{Zl-EyuhKS3Q_*BFfn@W-udDp^mHVXdxVrl(`LCs!rCy0EWXURXq&A+T zscFjIv{oS569f`S5@h3Ug$^7Q(-~@F&l5@qtPbX!6|uBLk_w!(vI}ZoE$;oT~-0#D!cB z1sU=$9dzp%)YK>x2w0N(RL8IXQg%!~G9AknW_Lj8!&fb9E} z4{A%Rf33_6*t|l}s~04Xbrt7H+{kxn_MHdZ6)D{6CO-V z_hnz=Ut=V|MJ)I|cmO|_Y%-M?AU_lQ#PVZmqJozKn*Q2K!7Z!4_=ol94P7O(l#-%12J>61Vqa}oBe0vMrX8oQ=2R?*UwBq$ zt{%9G5VSmM12pVK;tGp#S?AJzYQ}X5O}pv_K2SB=I#4Pr2Sm+^LHaVaNzgQ&eD&rX z?$9ZpxSx=b}JqXG*DX-av$3 zR9wtf&CSg{H9v2|Yrpu#JM;XP?6Hy|DG7(78)4H48d5{!d{ ztIL8St4oBb^0Uzk{c*mf{GF7A{;WkQnBC4!vo$&8eec}TeH7*;haxQuWc?zIqwD$K zV;?GbGrT08bb619c-r#JUD`^j+6C4ULX64DyV$)(qF7&AY@!Md*NS70ktOoB*HIr* z%(U6#bGR;sEMWNdbJ>-qId>ZIa1r#C8u(U7G|t2L=gN604eh#92jx>r4YvC*i^5`P zUQ0&THBl>l6TGKtsy9@)*o%=K2)p3k^6ubj_rjs%>b7{3lp}{O-wvy## zf#}t6^*5KWNM4Ax6;rI1g4eXvwdnn+Xu^K#iCZT&#sR>aEVI7PL1>|GqYtAFrIa&n zLs;pKB%PX`wz$2$rH+qasHOW5#HW+;`7;gqii*`I$M#4TAQDTncQGK`v8Q+VYV+3=Vd2G*8^8x0=B<4TpB#?KK>eev>kT@C9IKmL~BgB70c-FVu)(UU7_w}ZgWT;^ zA>6D?c9NusU+{U;;x-l03ikdTllGB+ZAu9OEojwQXt_v)59<7>W&L?7+F5B*<@|ZC zI}6YC^jFH1&#_N~qi2l@`P6U!%!XR3q27QjC(ZPvM>@iwWITn=>RwP;RTTv1KjKdn z3#0Qbm9qVza+$CgIr{S_6d zFZ)n;`s~eBT@2U1Hs5wjRBJvA_wT(*myFg)%UdVae<+L@TJ$ zXGrLTGdGuiI#v-62s^Qb6mWRQNo3i65Wm5Nah9?`MjMh&2vdV>fLLM9{G-0c=1 z{YE!NK4@ioJnk9u^-%yL;C7$!ldoavN^r=FQ2(S5x|RhHcQ1^nKdSxJKT8_p{xcS? z;`~gap}uXw$RR7jtm|t{#fv%f`6XpPWtK|Qn$K(2$((xFQ2<`%$2?*T9T4M(vB>>; zWxOoRruZb7s@*LRHygXb^VB+&CQChZ`A2(Syg6ASB8+QY+t5lxu@sETXUqSl ze*(ax$pRgHojBHEExRI&mA0PUZyHaurYcXK>>TW)q_6kMELD(hV(a>CcHwB{TW`}< zYGlLXKZ46%A)zRyjFelCN?-SD-I?L115O@=Z8d>0j3w|gW%@4sYn``u<1fg8gtnNu z7@}P2^G3g*yi6P=9G#gJxpv=1^;idh86Bk%LmZ9^2Z4i$+U>1$%FV-A4I69uAVf4+ zH0bL7%Vn|ISQ3ZSBH0{f^r9?*j4VevLPrfkdf(XVq{REzYBg`3H?}vHV62Yfs)y7) zP)+@8l`ONSK3}zpb~**8nm3=aRgzWT6)R9IUYjZ{|CFxteRI^vB>wkV;>ok;ajli8 ztZQJ8szZ-&_9Iu<$QSPO50fXR(j9&SIN{_iwFn}5qk2+xIFcIDGU5k%aMEEifjwSe z@`)j+6V!>@`Kdn<3|g8LZ!84Z-U=Xc2o6CrMLTeA6AWVYSHM*aB?S{M;z=yH&mNM4 z30ImOmc=`j?od*q!WDOh15sMB?ddMcvkXt7V5roEioeu?U{-}|MaZZiO zx_Ab7?%)>FXMe{WxhNm32y~yfwfOxmfv$ceS0n0#iKWR${O;7o(cbpXXRl`I8Kx+0 zxu5il98S0i&~E=MiBVlB$6^xhNO6RuhD=STgc=-^T8hHBx3?F}i(&YjOyUdBH|(+? zhZ||4Y+?ypYO-KoUztK`j8qmB{%}H8G;GkAk~q2}BQ{agC?`LF>lk{9%4IyRBN=uV zjYM3MG^cf4;=G)p$8G@Y$Flgvqf*u~kHxj?F`Ca2ew?jR^?Rd5zczhgC8L6sfLv`W zicx)O^wKkpL+`d7L=MD~xUxV(Um3}^lpuo9AU=RPQ%>TE04@9xJfdAW`-tkcZJ%=Ug(l~rAh9MX1y*Sz~|Aj0qkWdmY zkPwlWP?Br~@DwQgK<@fOQj=sDhGXzWT+N_1Wb|Ma?;FMI+tMs7E|-~$ffp?`zeZPF zKFBC&8U##MvsP3O4$qWt-i?tE7Q3adm%C6%SYBE*c#P&??KI;ayc?Y`Nhra53~jJ` zJYeKx?`QjWF>-Wbzh2$6#?d$B1WkD4pu_idg@abVzCZzcM8kk&h_U~NG^-BnxDo$P z;x``YNYqcGVH+D8-$%r6+^ZeF1j6zr>wvGLl;hDmaj)(7p!o_c>=QizadH>U*g^Hu zD!Pm=V3T!QV_v7RE)+iV#rsm0{BIF>iy@2po-+OM1+Shxf;KcB+&5~eB+fGVi%=@} zuZPKQ*}ovMfE;mPbl+I?s#&tJ6jGUgWFU{Z#o~SdxJ4-I+yEypm+~V#yI{t z@@NW`-iBqSLDbdkQUpiz(XgnfC>+(;k=RCKU0vOFy%WrQ#Qz ztf<`lq8d+-=JsY$=~XrT8$pW?EooV^+q#LmaV=?i&CnQhjsOwp4DpW8U5rcGO{=Fj zdE+dkV=^>hEzqjxMFMRX?PEY$9f)2b_`V)0gd(eh?OZ14ge;*<9Z8GmX2J(=#V%^! z{eJB&4fg{aXo6O{_9S2?*04Tm#$Q%~ER-qqMQ4A5k!<)urUDsR!1I8S_6aj!X)d1e zPDi}IGZ}E&@iB_;3Z3j>pMwcgD(;wUbgq-Q!Aqu?f3z+&j4|J6&TsiE#9^N4SET;A z-n&X}a=o39netlC#C&ITq*aHZouE>YsKm{X069KQr+tKMc%cLYxa6UnRUmkj1M_52 zKVP8DVmHrEIe*9mair0Eji z@0d~Q^_eP=&V*j7$L&og-b!mlko%aR4vtpn(@|hvNj-b^v?a)7&QIo(+<`?~K(N_t z4lovt3*3Ow_;Vu4>45!Q*Gjjn<1@^@rkkF0Ap=r1$un3pw}B)QIU{6$E-|&_LEQ-( z%4jJgwK&&1SdZfon%Oi2pQbj}cws-zrM*Q1cpS3{#P3hxQCH9o%-hBT-E!&d@w4y5 zPh=^WvPO*uEZVM-GDOt<%WxUT{pr&K9pp4j=9iWzm&P3fn;ciA8=o>L@iL@m#h-OH zcRW~0lI9C8LnasJL_LG~*TCY0t;<~$Y&-akA8$XWjc#4X_9mMPM>fym=$OlmkDpBc z_5SQ}b4y0nxFDW-$R)0U!gvUT{Y2{#L|G4bk{^?B$nJoP6H>#`j%7sme+vJUg2lZY zAQ<6MTC&{o*1o^+Rir`d0fNPg*ry6;Ns@#sYtF}kCUcy=4c?bd?xXDMNx;50sFU7> zyk#<3Lms8(`hj`gkD@B-Z(J<;Hh(B?f~)Mj+n0WpPX#}{D!i`9nDV$_ZK%HQ0jPnY z>n>wU777Viq`2Q#&s6+?$G8E-2c{U;HIyNg%)NZgNHMzhLvLHW3DDZFAay@_>MXj? zD_Un2bS6m4ffsTi+#c09AgfZT_~ozc?xQ$_YMU!v;blq`T)%5g#jjL(0F^;9>II)< zsJOR07s$e>3jqZC^QN&1>M^w;d>An~F&IFCCJ?YG_>*)VCWZ ze4Yp_AvURiHrhAJ+zNjR>)Pq_+1^f<0OWTKLFU9GQ(qqy`%Vz76D0=WLjXZE?qZJr z#fQ?d#YGB3diQdwayoE)nPePYYi2EK;+wPe=@UV{Mx9t*%u6a2V5bC!Cb4=5hXJt% zjja_46D#yU5zZIjJqLalha5J)*Xlb|ZE2~}v28hL&Qbt01@-4IsC(U8S+%qc2jZPe|>Hi=~%oGj3ict@?w`p-0E zee=Tgotov5)t`JDX*5;^CvktEV4aSX!Rn8; zg3K(At{!qEcpzTqd+Uk9DrZj3(vEulM*(e^Bb2t#N=#DBnEH<2`37l;Z2^ap2FoA) z8pkAwGvc;TE}nXKy+2W6iyGXS0`hoGGN&8?cM4#PFpdW1bcG-o!#P7o#yb7E<z0l-0IJ$L*2`$^Ri?)=#k zDQ`_Ri$=l#w;^fzxh4zP6oJl4>Na}}GO*M+dyjecu6x2n%hhk%?#sOWE%~PDHhTf| zP~^4nV6xWnBT$Cs)3c;1arEvpE{GPc*dd+RmhAZZENOpZ&dsLYI?QQ`JTP6I`EqG6j`!pN>rXXd{x5C zI>{rnMPbO+Y57Pviu2(&12FIjG4_AoJ`ZI>p_RLNE@PYykD8b!5aCup$3Y*)+IWw| z!H5dBoein={}d}p*oenRQHDGWbNa*n6Z5yPyb-Vvl2`O)WW^`FzIYZ1CSyB_=l8Z~ zhqC6X4t;l{YItw1%x$lGRsx9oYtmB;c(}F>#J8JGZw)}47KB1(m5;;?p{-p1SVLnR zQEc`n`EfUsP_UbvIEpFiIA0bwrKri+wHA(R>`W|3_t2a=%3CNDtos63c2HPPR%VjE zO0l8WoUEFvwG60&UVp5Dde1fKQM@aGk?iaUSRnk|<8iqCeCp;_+d5`q(6dxl+|klW z*j5Kl&0TZTtPqtI({_w z<{|FsD=XAjylbX8psXqiFS1(in5sWwN;z*upQ?4Jo8-C(KRD7QE-qh;vT^;>2eG+7 zY;e8YyYC4O4k^rU{&8OSoGCraY4TCpu#W-2|L?773|Y)ZrO1fwsUzvWWwN%cYeluF z3$13Q`?xW{**9gZ4}ID9KmjS_?|jy`)iI3>=hgnx_q51v@ij=7q^N#zx?Y;}mNmnONCm!BZhFI>V*{|ND|N=KF6rKX`f)B4G%K9g&#xl(C4_As_NabKmy zM$iSmG+~6(B`P8O$%ojymZTRYeK|OP&a1cvE9&c*3$r~wX!2HTE3J|hoVvL82Fazp zJ=nNZDcr1x<^;n&*ZcGK@7)$Vd1xp;<~7&_0LYQyJX+Gy1H^~}Cw(quS=m$hw?COR z3#(l2iea8cq@Tert<-qvJZ^MvHQyOz+j0#Wpm4ZmS?%qa^ja0kGQ9cq3@BRe8l{+;{en8omo-kwvB|pG_T8;?t!=B~??AB? zg9Z=>vSx6(lRdCZS;zY&;p5+PzegEBepu+0U@eHC1+o)=kT818ZN5Uxeg%Y7_`iOS zTt9U0x?(Zj<}qS0DeR%~ow{nN`!62)Lr_BV$Tk1~H3F;W{e}5<1cX4dCLUW+0Gj#Y zUrX&TQm=y_M4gup9M4eIu^*)ne)%#2>$1KjV(WhG4`c zlE%jm)u(=qYW!#O!#P1n(%8%lpSb1kot?N5r-K4fA zoG`Fa;_orN!~fVr--C&)FKPupv|azPN*4av)qrV{#ToEmOXmsHK)IJ%A0HX#=MSpz z>d?s>paL|w?*q9pL}k5ePF~>@(!b$bkr2TU>_+2Vq4G&iK$M?H!|ApLuLrYY^1ktI zdD>dQGEFJAc`vOe-|_5iVQB_=$q^%$RyJMUnsa_-QquQ}v7#nL72TZ;_k6 zF$-!4*U_AK5XnDL6@Bg)mErLpsL}W@-}xv`-&%DN^{|GBH47h{3t%ggE`ZW(ZLF;e zYig!}-QfHgvjEze8yUgT;}m3P$MK*fA*ug~A{Zbqkz#Lai?I<95YRn0Jq?-{JPjdM z$H1sHNg1r3frfhwlRDDsw}tMc%&hx;H0ftmrB==)k3?M;XQm}{5i5mX;*6>lmMxC5 zU*az_%AL<$Ds+?*46A+}+rko26Xf563?UNKw=U4(kU{4y(;pfb$upv3SSvpXj$A6Y z^J3O0%^;AJpz2&BtR>V{FXLB>%#1(gsY^OoREshCm^!D$qal_0#3g*JNXT6AazBls zt?-kiMyJLMzK6`DCIOKJe=5d-sdvpslJzu2{krGU-efV-rdn@Fe^92>f`Y@IS(zJ^ z^Jtk1%ZDZ(s+Zjo2v9;48)x5E@p~%Eta5nvp6vA2URH74dHdq7zI*2}uTRn`;))bW z{IgfcJd&>LId5|Wrd*~pOo8n-erIX5OTm|u&?+UoU?AFx>>XVV2RM?($hVww3H<|Z zh|>yFXbPq`Qb3x6*(Y zfyY0j{8NvS%LT`Oe1-Zol<4RTTh=}G+xj*zB;F@Hr|PdSznIlMDxMgvma&? zGFN39Je{#HFPfhCly38tR9TU?&e}8bKp4sT-qo)KowbAQEhgqAB-OsIq9FfFp>cb= zHpZ0Hjqy(KT7Bl(GMY(0!@{mQtj zUTu76S5ObvhL)FLo-1|jiGEDbpK5~0wt33S9HoTH0+VR;Wo~q7!PASzqI!XpE@DB3 zgrCGNsw>=FySHOngh?7&{Bh>L4wrI`+OaS@l2mnSN!mRwey}k`=04s2jx+j+k#qM_ z?zb|}U4x^3A3c=C$n~PLrT4m*LX9QI;(3A#r-y=2Zh?jyUKnE*`Fvq8`j11Hd(4VW9-$^V$KV1D(jC{ z8q|Fcy6!pyoyt_Dj|WyY&cD{e*qa7q=jOVf=$DgCgnP{p-obI0|bQE&6Q_p9ESg3`J?zJA~$ zt?vWVcGje2td((&u}u~YG|@xc&k1JY_}1Cj9Khbg(-w|?`x5yLGq3%u-+!80)i$~Odb31QVRzc{tjC?>JG9e-6PKv0;Y6XC ztaF&tlQex;u`wd+x8};VrH3+LIyr#&RG@5>>Onf5YGbnArJxglg*Td^jcR3T}$~c;ak&u)fnS35}W;TeTE)&;w7@{ zo`#yXVgdXaB3jNd8|s@Tq7@9c@UUbDsVf^vJy1yR`m;yQ{ZVJN`+(%!Ki@Tgd*_=M z9_A&99W6~lYM)5Ty+~Oq!TzJe)=iPwu6^fX*fN6T&jW9L_-j`JX|8e8~fwvj(%_?0dphuEQN z1)BLmtIp*tWcbdo-D_))m{064l-|klhA>goTPAFx4oi#;E>S6S?aPNSf{EBKM7O@H zr>YUX`KrVq0zp3XzETr8U=A2cJaXX7Q0?yCiQrKoJ@doUkxdY?pfQCZ94qP14!TB$ z5Q$pEyrrJqvgzzm5?pPYG1CeDVGW zaUkTt5=iMZuB;7*(v_Jkyss)@Y3oUMX3J&zvO3;8y3zn@$vSQo-L6=p#y@`K%p9t8}2y)@%0T!rUj+xl>zVY6;zp(>o(u$ZaUW??{3DoX@~8qO4vu zq%*lsokX+;k9Q9&2-+An>-<8VGNu4cxiAZBjPKT+Sx>ROVT+$0ms79?t%d)gW0E3e zG6Z|}I<6=mEfld{nlV*B8wyILBm}nF7hC+UZ0q#EPc8+EUxaYLB5Vnx+$&s@vc&o_ z)3Rc+K-3-tDiPKIVC0Rv9WUH{oV^?(fWDP zXllQ9BLr2Uelgl66Lrlh-gb-eVTk7P0dO5D^GoR1=jSd zgpX}9_Stv5x*X9MoSs##cz{m|y=o@ZZ5-sPuGUr8OPS=o#J3dj(X&^HJ~ccjOPMaw zOY`ft41^nBTSl25S-@YqPKFg<*PrS)y!Gyu9S))cU9N*EQVUj5$<=*#E9mJ^I&j{r zS+r^`Ax}HO@m&+8)YYwZ&jeNG_B%x>ndG_|SuzuS!qFzt-O%3gM|!iVz7tcEVqh@k z@jLePXRium>$azZdpLy@T2fuS=xOghzq21zJw5MCyF%|5Xm#c)Pb#&W0ma>ywn#j^ zJPBR<>JSldzS@pS!E3N!%RRxPtzZ_;J;f~z{5?qMcm1dp@0Hz5>&PqoXwus3`fL;} ztrS$v-Gk<3A>^w<`%j_B6r8pE401bsO?Ay5AR}m&9j|kM0*=UKX48j zQqyUc39cLGJqB%t5LMU>`w65vbf#5x?CNuHZB2I`ej_J5Mgb9OOI|yeI2C$F8V4Bf zLEfvB4I+PM@mq^I0-Bd63trzdvLfA&*;P6>{XdRu#c`E|h`=}W1xM8*=D#xC6C3tn6z zJtaE=Ie+`-uNdr2h;NHV+Ks;`FLs5?^2&Dw$zt4K+||5k&rXQLIOzUGj_1v0A!qeW zRQk?+`={up3R<9?Bc3QI0iXrphJcM=!!C_PR-5N>Be<&wNXOMHU%A-E673d@nXB!p zs-?mUU6#dn9Lz4N&_TPq=^d`i<0>6CaI0teQ0$p#Y8YteIv2U0fX`x!Xh1LL5+RQ zloYz~{+%VW%#zQr0GnZiv+|cTq21Wi9otDP>`f>(NA#_hh^P%Hit_w4FlU=rWuLs_ zEd6V>({;6C7Y2tayV-Zsx_i6r+VyL8qc;=n;1K6Z58qVZRJakI zXZ7i3*)v)WDcBMh{g0dGE6%2)r_*nWGlSV!Y*M~44%B)3Z>Vjwd0v0>Tjm#x0aecW zdfaGt`As=1eI*??8b=H+(dNA=akIXlWkFG>BahS<@mzHd{r(x_O)NCMatiCgQu?Gt zTZ1$0j<$kf31d3Of%fk9;GTp&&v_(4KC=JA*48xF+M3&}rNNC0jm_?Hlo__dK;_{v zj*(}CL-Kv4J8O63nSAhXx|bq4Z-1Q6FrvU@-kkX^kT!UGa?#k^1!cJ0b=f+-ZA9KZ z|M~H$-VMw2wHDTfmS_VWz)_7|F_b?E5b(a%eXZyrgb|dyv^BK_W3~-0*e3BO29!JC zBYxtz*?%+g-_g@y*L>W`=R38B(dvQrg#n)a)#l||FlRxC$CQaSLSG3ZlJlXIw`IqF zNNqY_4%%86vhl8QMx;YqxzT^a`(8(g(X|5FFdmXdr>}_L{eDjq?bi3PFdmTidVgpDA$+c+OSc(-m)n zaUGKK@9|-P_9(sWHF$i@F;>vd89P{MACYdet2-)6!HQESu%0hj!q_)Sl#f_IO zUYE2j+5~CS7Ko4k<1K{Y{7G4}c5}_zyV|2aF%G5UBR$#}WPHnbMO!3Y#;s_F(Z(Q$ zz%Xbxy*<;a`Y5p4)V|D&I@`}mh1KT6}g7vCu(i}$3c z&qZy1zu>o6l4k$_KmbWZK~(H{C?sQn=#;g8#%{*Gt6rCk-Eg4IFs5)WBVdXzA|2xo z+~Gxgnd50K^z$0+k+wq`J>GxdLf@yoMg5n=)yg>N?SwYiYV0sA5N}+|52MZ@GR$hu;@5ZO*=H__O zI2SnShCQ8V2p<=HTs(83=uZY;_b1u=eVUhY1d()d#MsQs%$u5;VoSU!Kg6G2ZrqM2uT~iAB7g$N*WI~J^6%fd;U3FympN(T(!!L*Pq!%7g!Gb z`qy^!z4vX<4e$+f=GcQTy=a@Ce#%yC+3a2jt#_(fajf=j=9v3$CBuf6;IAB1?vv~zgQ)VUYchQ}WDw)K)NT)is9 zzKbVMx_ENLj=b|bJNmnKBk%QoetG0;uh@exzF-Sht&H~g`N6|>a_`6X$#34U^Df9V z_<4>vDRVk6jg{Y-^n1G5#xQ{aXv32A_P~kK!_J$UZpnfw4V$?xS;YQN!I zwQ!ZKn7<+n2^en?c@BJbz%F)P3PTd@iKhoyrh_-_GOk_Ru)vKxH>}*OZdkg}zc z=@mDiopYI$}3McR&j&f>Bf`&UNaw^L z`gIr?JzfkAi`o|3#w8nFtXt|v|GA#lkQ;=%?K8h=KYIRX7y!AE_pB8Ac0S5t+_?`a z-;@Obg?w-Ewz+1}T3bHfg?Be7ASlpAj+{Lb1|j3SI^p!^GBlHVI(1*s3+>mKTp>Z=skVRS{PI?IBr_D z$u|0m;cbo^V>{gdh!JN0XZ!6_H=M*d*UOmN1X~QaYZiNdTd+F%8*P_%J))iqg`~8p zE-+MKC_dx@EbSYk5A{0WPx!)e-Qsn&HrBNv{)E8h@uu;7EN-NW!H768lwR`upE!Tq zj$b$)c}i`Ke57&|<0{k%Mo;?WQWuO7gEuYR6h>;|ri~mt^NC&X{+Nk3F3NcOW=Zj- zvXj@f-o|Jz^izaq@)hA>@8#i%=4;ORv&)}(+@INJzG5V$Ey0WYVx**R&|WurKU>hc zAdHx#LtCRPTme|@QYi*tPZ+J9#JE4r^ z39;#tkEM(mjPKMZV@dRZ0>^vNx>MOkJ{4^>`d^vtHJ|oWKf?nr%Q@{Jv57W`aE8Fj_{nk9G4aBKdgQu{u>*k>k07_mc>B5)FMS~s-&(v)tn}-G4K4!A zi))i4wlMZ({V;1IT`uU+PT-=WfOm(&2i%bm3U?NPau7cgFa= z@3VdJKu?{Ikubn^hLBHR{>npNv48XUzlr?O_TD)7hW+-?Z++bKw&3Tgh5lgGn#F7F zr5!KXW9uIc!RNZy;fE(av^^*H*uk?0Lx`&JbQ6QpJn6d#h=2&(YXY-cXo;INzu-Y& zX=(ANBIf!{pBqY9I7#HU@x}jy!NPA^dNH`&@F%TYZ0YuCSrAA{Pk0=YeoKb^HlNa^ z)t^R2*g@c7Nf%3m<-~)h4Dqs(27-&0$`OEbst>jpHKo(7M6 z^(!IBtZ>hkrq;IbP&oF%2loCy|7!>{^)9lG5L&n}EUdh9W6%vj#E%EfyrpggT(TrQ zXRe+IVvi8qE503=A61ZVwDc*_|zTc@0jMr8s8dpkdyf{;sG#;k!<8AL0Up}tU z!lwPOW*Iz)3lYT+O6lg0kI#iVW`Dou1@EU7TesM{M;^8vFMQD!tzHv({B-w6 z_S=8_r|?uk#HnR^F+WFbdh$to=&N70#cS5s+@(vrEg`fVwgbQVwTpdkL>XyYPyF#8 z*~8uj5oGY#`Ru?ac4GH#+xz7pZIvvj~+i>+1Ljco}@%zd4Ah-p7ypdPBi;d*)$^ds^jO6+55-ekDG0bO&A=B zD;vGZk0m?G4X!6Ied_gh%nqG790n_dKe9}_EN4majuv*jvX2&? z--wT-sY48D7?qa0DEai}r|qfDPsWsgh))O=pS!{J)jh99T~ddsO{98U=KLPp@R)61 zwZp}>P2t&s!8)Zwd&m|)BFdZ@dTzSNg2C>W`+w=}`z^cfVp_dvyI{ z_RxAiAGp!3*F7kh#*6ev<3TFa7yW~LHn=hK^Q)h`;qkP+f9ySb@7TM(FWVpGcLt^G zQY(TE0^F0EpYn2Tk0*8Uj-$PO;zsRXANW<&Cr|5Am$Z?zM4$G2tn$8rp&w%%;>Z#g zMd`ETpOuC5G2BsZ`XFQDzEk_6y&&Z9#54;Au4W$}A8|w0a}PWhb+yR*3<4m>XZ4vl z$06At-=B$u{&chNrXTY3JY(VBlY7G$hH(~y8gUR8>|1;q7{u~N);${IDaQHmaw-Td zl#xD&&~)hYLvD0C5(dX}E-++FA?XxHVVt$R*qzI0$;U!Pq&8NvxA>L#S;!AhTVseu zjCyeOgCYJd@iB#|Z{U{|HMn*#yfMHpGUU1ArnD_ z@9@FUO&I=>N7BbAN`E=)ZQ}TaW8uMh*4tk?*TncoXpf)#agN;KTa53Bj0lOG?^2n_ zXL0dk_!ncCx50WZ4~E{)u4HZdm-~Jh^}#umdM1sulJ8YMo^D&Y-P3x&<~Pp|V=Vc> z;O)`ILj2SzZICwp!SN5mkc#n`eW~A5)+QHJFcj0acdp$TLJMsKj|*7GbX0{lay7LP zZzsoGC_n2We0t{?c`Rosr_GENSJo1l>a8};Guu}?2d~c!{&2!#7xm}VXWl)uB*IeG z5P`jYAUY*OIzj;b^rl}|G7gM)?x7CJ5B&*mA4Fs7{0p0&u%$lMVra+92SfGl6T9v0 z!*7QulIosg`3&=ycfM?2-1fy7JMb>U7>@9o!jfs)Gif0Xao+mW#m{{%j-U4HF3vxU z@w9*H4v~Oy2+GXu(4VwC|<~m6uMsLv_bkh`KC?ocG2RO2Ywm#${0(#Kqy7%qF#4x zea4>J^pveyl!^I_MIq>TxahZ$5s%6Lb3a7;p-wM#T#9SHpYHi-T$j*yNfGXhCG?Fy z{K6mFzkTN4h9HT+{Ik73v!Cz(gxaX=NS{SO1VrFo5t!9N3!(-m7{m`mmW4hg3nEB5@sw6V{sfEofpNamr;xeo zQ^;VXM<8Lzhf5bSc(5So%%ATUpPp9!0;)6};>M$exUczp9Y`nTH6M5Bar#~!0qR6T zOL?p+6^eFyUkWYWRxnEQg7Ix%e9j*J%9nf!r}beJzjEe`3oW182mkUz+xN>~7KN6_ z|KJ<8!#xUC`E*dc)8(R18+OD!4&MFlcf+FquY)QVNoa@uhXEtT3@%LZnCSEOZLD|i zg{{x*av@-gE#I`k&K^BxXOA4Uqwl_Jr#{{ndBQ7!<0BrABNpPj(-Ie=S)h@j^WP>A zUm?9X!sE`zlFv#NdCmWDhtGu^1fSsK;nVSHU;18*3J#tQ`_sMy}C)B$ZjpBIlX-oH}c=+n%38^Kd$#|tmmBKMTyy(}j_+-+}t=R0=J>l~p4 zL5Athws^krR#~!sT?lYYSBKED_h&!z_4UYOLsPRo{ioly$G-86n34>Ss?+=S+p+iG zx7|PaX-ps1=noU%Ep;cMrO348LzJhaj3q(t%2#gWxPjuv6V^Ld?+ha;Pr_lO!~np} z=uh34^`09%`HaC2&y8Jf*!}WDUydi?8(k3L$-3Vje#;GjyIqv|i038a#-A1%{b`I6 zaBaYiJjStYp5`BY!9^`MWZ@+ePYJq_<-OzY*+0Df5AId7FP`=!KAzB~S&bPtO>UUB zmtP94hBnROUnkasvcsq^ZrE8cN<6am5qoai^I^1xDTX(U85l+X<%55*J*W0~6U~e_ z+(;u#Aj&+s=}CKT>+@j*Qif$$sr#M?A%hXoj~> z+@u%URJutfZa$NLj0G5eA9VvTdEu#Xj24I8sD`obXZwC;zxm`hQHRtyoHKenU0fXo zeGEJpHPX7$g_ilP^TR;J^c_r5bfxo({nkYm45B=(d#&ec_z@hDXBVjJl7WyX(`3RwczU<8TKIr_u`_4bv-@Nhvx)HrQ@*Z{L@#2ZZn1mq9 z_=bpsC!808%tsv(M{19h7vnHSW1eu|{pp8(uK73&$M9erVWG|#vv+OYWsm!V20Q&c z6;lGa*b?&WQMRsJ_V-183CosthiHMA%5@9<1y4;pOJU0jKi07p;=mC7#D*uL&N+81 zc6Kayv*5w;>jS^GAMgHgoF`~QEX0}e^5@V0x&6sge;Rqk%P>n5!PxnEI3DgxeT}gQ zj}W|vxQ3!0X@i_^Xs7g74D1`+)8s+FPJ8x&XMLP~{+1Bw>631nrV$pYuP{DQcE;4V zkG^9E{Cvjs$#oyY(|MS-vBJHCc=+O(t(llkZqcs$+PRwvh!a9OL2PMp>DD3lAdNG~)=~ zdBPC4>^D)1=<%%bXLw?AO}}9z3)|9n)_?LO)1I zU`RQ+K0)B3ZNKr!8}^;wekbadb0&R({>kT-hF1HNU4LT#{TKftuE!8se&lWS)sJ3{ za}*ELEMKrZ9{Bmzli%{|(C56L&5dhF^2|e1oX;+JTTjOd3W7tpJh&Xu6Ams2{=}C1u8_)fkHuWN*<=)mF?e@METDVa6e)+)j&)P#@ zdN~Z;cpqH;>`Zv5eDK2`g;xQd5qPvb_KmOGcK1+N9ZXz%U5?%p%c*h63WvaNk^N4)dpvri7%sr~yS9pWNC82;mm+1oP_#MnSY zV=XC+hfmW+aZ#RK9TNu!x}n6!A1=ltPX8q?o=IbrC8bH9ia0$!K1aS%_(GoIDM}YI zh(9e6HhoFkI6S+#3owP>QyP5DM3F3?>GxD7%E12YrL=`|cpmT~p^gwYR_)kk>)fkl z`R0u_;3A9Pby$6~&qwd$KjH*@5hAFM>mAoZprNec(b9g^#fyD*=q5(i_SWD3gPku3EwudH9vD+LCpCQ|6vdOk0Q0^6~3Gw~v4RizshBLd&0g%O3su*IW!-?Z>s)>HP;> zX!*eQxaSq&%Mn^4--Z0A&yZ{=~H3*zI?r1%t+i?)kFk)4g$Xh5-nJMZUbLd|AEu zjk4dYW8h^j>Brm{{j^5O9h?Vo=4-IxZ6X*Gy1FSJC<$aPcN1~;9g zo&C@H8sSJc88`UMGFtv(N6DXPe;#3GZ1DFWc)7o_^DFMz^I~{RFkO*H=0>kRyXaHV z@NSzQe)vOw0)0>92LlU+0jAZ0Ax3{hC5#00MF+5MAv0ju!8=FZvDfy# z8UhGU;ZqOE;4v2as8dHk88DJy1i=uO<7jp>1 zilG}(0b>m9i>G{f=g_t5ex2FV;c?)BFdU*625P(xs9%gPr~QfD4~}_%FL-Np+~|ln zb4{*TS*cC&17(YGic;k2HR{uOBSJ?x3l*N~s~B0y2jb4+!CNPr7apS+<=W)+K${{C z>WXslr19Uq_21&j|HL6XK1&dx1%o`^X&8=pD))*T^Wnjht%T(`k22yVeO5dQct^oo zJ`KscZpef`1|nF|ULNv#!dq**8w8o&rq|ix(Gus7k}{>Qypx0J^e~8Hl>FrLPr{21 zvG2`K-V9NUYs|>M=PNDnqs;Ux#6702Ltw$%Zo|^`(H6-QM(U4T0RGnx|1}K8JpE7I zB%><^|5i6nVZh}{^cS|j7)EK{@xnA|cp>23hqwU~3~U(2Ld@{b7`o|?Ou6zae}b8* zi>L$QN9aMwdcrv~Ml5Vu6m>+~4W_gLI*u-{(+n~F zP~ytAm^X6a?CFo>C##06E(qrVV;sracl_d^vVIVS^FLW!h^Gi+*;qoK#Au9n9Z&10 zen%h0(AU+|73IKK$U9OPzj(5oaql1B{>SiWL;Qm$*8cJBOK>droST{Sm05FhXBTZ z1irW3TM(g=>D94%mHW@^&!*P$eT6JU zn&c=g|K7?!x{)|iXS;lN0 zw7|RbrW@)JFKOH4n=Ncf8_(lC9{0OP--|JUypm`7G}E5VVuEp72fZ=N7bj^4(W?5WAzJuCtt8FF6JG7OihTeNIB_Wj0yB@&S7u+x$IN- z#$k+S%%=Xb_L|w8c&OxAKi6`N%Cx%lIf_R+q(AV^6W+%{y*7B-OMIM1@I{>bxz{68 z1R`z_Z(eAjeZBnP%Q1!QGOw%i9p|Gy({zqSh@KW4X#2E<%Uu^kjQf>~eTe#{9xZtC zeEX?y+n+!GXK_wz^!M@cu*+-i)w0(g{NcKhv39e2=KS#&zUgIqrYN+~*Ad@$`$Gc~ zTB5n>j|hms3?eWqg%pMXxoF~~$LI9^mnA;^&mw<`Hu12KMw)tzIJhYzUh=}0aO5EgIC!^^ zA1zZSK;1}anL5Xb#6TaquZ0#DH4s{Me#wQFbv}==8?g~u@G98-qaVlAO_|Wr;zsU* z(6TL?It-!Zn7_m2-R~o`e0oc0@$n%ohyoaP@MgK`LJNlX$G-73Tlerowr0nU5LtN7 z2A&ItfBSZLImGl&T*$Mg#W|0;_&)YV=s51yjetMe>3eP+OMS(Xw)r%&J-7Z6Chhla zmYZCjvWWa|;T9&M_Rqz`O2Wq1lHUrc#f^&}7cr&0v9FMR>}M}&6!udV*it`C5w>9E z3R}2xr7d>BB6~+mHkWbB0)M*QUrxVp)#?z)@P6s?as9%{Pwl!3-a`WjFVs(WylV&1 z;--7a3=J0QG36(*;QNStk)`Z6-nmof35G7H+1aqDTE&NDQ6H`FeY&m%graA0!&Bpsn2Q66kqFo zIt`wh{*oIcm(N?_28hOZBAO=yF=p;L`EeLaFn+|Xb4i(LZL#=u zY+~raa0wfxG{NZby?4JCPX&%9wB(V|j`L}z-b0vZ<7^A+D9EYI+ zqb!C>re0zWYm0HuDtz zYkOZSTWCr3l$BLMHAJ~u3VTrg-L5{)LbJnx2Fp3=oY zgwZ74XVC8N1am_whANEC^i9f!0I}T-POJT0E*QfwGT|xnj+X-g3WGY{3@M+fz7p${ z?`T(rwx9i0Xhpu^*!afhg6CE7xO{}U&ry$Fztjby8`B|u*~QLpJn{`U4Ej43ye>%& zV-E(nGwq+-ci#F=82&NPr0eLE4n|81^Gp+iFhaX!>Ldgjrct8JWYhm;agshOMj#A= zJpGShm-fTT@ffLymz62F$S3avz(^Oz7s``HLHY~zyWXF&Mf}OF-RjOUZ*^-SsazPD+-E%4SvASWCPZ32<#@r{`W9^Ph%nv6jwE*_8iIeYbd7Bv#!dQjJAYQTO{CtDg7TzPYYsR=Qc>BOhGp0!M zG5C^utPuA#e+Yo~w7|z1cp#kMErGy5o*3Kr_}GdkB5de~ac=i~BOVZzV|eHW0qJx1 zR6(G;>LN`F$2pR=&v}wQ!?h4XA#s1=`TgOCKMs*R3!lcgLYpYC4p;?Ve5Srr{qsGI zLB+PjZ?qBr$A9*xV|Xb&eVwT@@is%ay4-y^9vb<`y@P1q=e*A#p23)N9qb54`)163 z-^DokJOV4{gzUUfuxI&kl=GWU{INfv@=SBh4Y{e*L7SI@;;XnE`lJWa!o>;^Mc=MSKiL(*Ii!E zh;Uq&5Dt$`(uO%>3_{B{pZcbYeg9oNynxWcc=j`&W|s0YmN1qvKK`M5k>mYG-$I;1 zXkl#Px*IW*Jf!-dZnTJi2#COR5}36@3((Vv5-}ust8jD2#U3mB(}f5pO;(J>2r>vW zh$38o;FZ$e-fovJT?$VaeoOHa9xoY07Cc&r55a^f!4OXnTlkLQA8!}J5HCUw@vtAU z<+8sE0O3W;)Co`r5?ZFtaUwCKzC%At{fmCXXZFpsg*>JH&lFuu9aa=t)~~n4Iicl4 zJX-u68Mg^7;nC9S?`ZLtE924feSg9EiJV7^_ixhS1{cw!&fmq+?;Z^|ulLy2T~FK2 zFTZT7J&u)Ix7wLQKE=!@2O}N46T+j#^TjP7o*caBeTXs3Ka=m$g>)ECn2M^}kE`?X zCF1Zc?|MP#8Fs;dICz%{27kg6ALS>H{zIf!jN8+RIQ@6VO^p1>(LE*5zBTI(;$tgiK^v`Eu0$O%L~>KYqjKb*)Td5yGvjD;WR zQ-*Gg_U<1w>4!vGSJ)Ipc++^p8Ec0YDyTm6>(j#8#7B40Q6F9T4=UPYPYw4zq zE~Y(fJ70du7O!3F<#7X5OdU3g(BjjbA+!)~-m;}3@`Xps&qonj3LY)Ij|GpG<1V!9 z@u|)}KYVA87TOO&KVu|YRv5C-UpZf~@|h*=i!XQa55@uL=7pBW)@2?o+-xJXaFd!B zTDSpSw|ITr)FLn-v>=9H^uVKqn==f7+?1{r5KNsykd9;9so6BGqP2aC6-0+Cu1>X15*{CliCIf;RV=_a0b;|4eN z82m2y4LJrGZqzYgU^IsVPkAD=yo%7`Mu&1dTJmj#I4}Sq{C)kAue%XwyEDn2h9_^S zPIzLOC%Jj)Z(eAjjWeF@aKjOGMN6O^FpU}0$;6A?Q(Isx^KlXb6^5IK+}j94I%Q_6 z4m@+d_uluyxRuq=Eyk(+K%P3oV25#or&sG3XPqGn`}=&%LxjRO410{bVW=y#-|^aE zGN$lO6%0mUVA}L_7*i=1?TWPEe#V8QzjL95cUeSV@N|efwGNC&&$|H=pUi z^SfjIo+a8;c!)TtqXrAb`b$ZEFee`1r-Fu-g}QG(Q}gAd%ydqefOT6 z#F&#<5{)f3Y#>$;6i~qe7DN#gEYa`zjkPCi5R8(T#PjV1`z`xjd#*Xh9COUIo@W&6 z&u8dj*Za<9At%f=0K9>%zLNt`mp=KdD4F*6PQyFNm7LP9^F(cks*xH{rH5e(Llf6R)0ob;ul~&lj;f z&lGs5#?UJP1EmpjE>Cz-VF(Mq8y8RadlFq??zKzK`edH`?(n<*v#n%X_au40TZ3$H zG76+$WZBuSzm!d4V@m7N*LG;1*S$4n?)BZe<9Uw21MI8&05Ao(v+r!V>#H+D$H)YD zP4BHa5twAkK1o?81(~#AB2?ZESPeJ^N;Ysg9nbg=4)}U>kv-(153V6?4s!VU)ZGC{ zt4(6u)|Z#LEaETY*l`}SkK~_gup4y5{%|@{I*~KXJo|c22-{8%Un~1r_XPo8$?Dvh z=Y?FjEa2;ujA3GVpR18#UzO^uMK_Qk*&i4ZQtcP>Lw`h|C;*>s6cI#^$BiCWkv<|B zFR;!+#JLX??ICgzDNB~@1e-2>=y~Y}!jG=AtJ31#7oe7xzP-EbbWN!dcZWXcmpO4b z#~LFuw%n^7$W8Y>TjJ=UkH+rR+C^f~{T~J3W~UGyY&m^ZGfqSafZKW6H9se0e~|i* z%%%OaUSE6nwcejoo8p1k!HNz^^ivg$@7u%HPXe=UMAq476$l4JBSy_9JEGpx2UWO! zO&8IUC;>LmxCakoKzJe~5G2+GLeTsmQ|XL8i(Es3*k`Os5qSV_^G*Lmh}e_qu088Tv5;XSyp3;U*854>^Xe5l;(RuHMf92eNufwSP4;yUkccrhp$PKy znY{d6ly1o>U#xwyc8_SEN8Fklk)9$Yew$RL_R6{^iXMUJFX!n8Qp3F{M;(zZWI#vBj-DV_$=zr5-F?7WAr{}BG-^xHRg=7TVYvc`S1G1h=8@v`+acTHKOJE zB3csBlC@4we+s#FcWnIgxgRx8jcAds_so7oOA#&oNY!V^Zb-CH+2Jviz)%AJLK67@ zHPO=M!G`9r>1@9V?fyFWw*nCYn5nTcdUVoa^lPoS`I6$Rks%^fT02M?smYx0yT6HM zac-->qFn$qf(0StGk$k%H|)-P`|Y=jbaAe>Tfe$q4!?dI36S5AX!$l0``gc@um5c# zS|%e}k~-{-co@Hu8ZCbi(emK8Ct9+$5d-^W-5(fRGGff=5GHx}PgKjqSts_U%|E5m zlqr6WtdT0}m8YL6!UcGbh;Se=wr<+o+maN9TQ_eik|74-E0bmp)P^}A;QN82lj0*O zvef=SYV5U7Zq!43eHMb^mhchma!iZmK|*w~Vx2#(J~ut>l5VQlP&A-~B~AG&LG<0t~hWn)kMtqRj0e_U_! zoRd?B=i=Vsnd43A=j*9&qPEHF%a%nZ8|z$zPxIWWHg^_Een%vEjfkA}+5D2t50c8r ze0`WYFQ0{U8hu2(?T(w?n|V%ZU!-_F0_kN$%S|`!B6ZklDRLj8C4|uEqz-E}TBHs8 zN<_=fM9Y&QTJHLKqQyM7{e>fenq|m;>lUzLuBFLrsWN@92P<><4>;OG?G8kX)L|zk zqNPu?05w`cZ4dPxR6C+&dqm4Nq9qY6MYMcP>af!Nj173_fe`QyPy_%0rKGozp29;a zZ$l5P9+tmCv}BDr+I@kaJ+P~Zz*7`hfS~YR2ZAV+`)Pr89(n;6fB_($W3ZiiW8jMi z`UWL6Fo>tCv=2ZLAg{;`pswP%yw(Bvh@n3|{l|K+ZA42Ozw6bjTHetPxSqoUw3o5) zpqC2b$~jk+TmbpfmOPN+-ar|7sZYD3F)pA}gEHqQu{!-J80dA`=Tqi=`B z>-vZQF`{q)x$1fp(GoeE^VWal0RX9L>BJM=n0cJP9Npm2>Aa}{T|`QlKPgRkiUN%Q z8sC&?A#Y;`s3PDHTjxvXls<|^0K)LR1|T@p2)gH4sl&DtExbHCTPten$mrMj0w>Hr zkXFi#vbO=HEIG0Oh>eNYyfh)F$2%6_J~m|Frv5k(7eESq+u<-}5w?Dl7Z8OW(givU z40W)*o44?A-c-6y4?hn;Y>xWV-c4Gaz2g`CtNzArNZlvphyvo)4M2>iG?}=jS~$F& z{~FN(P~-(Hss(s-Z))>A`r>1~Cs#fhFz1t!KOO3b7X8V+a=)WwTd!>$@A_+uoep;A zckgch>hsP4I35%;;oBst+Y*aNFTL}7yw?IZ=Un>rpzrct_q~b<6qjl1<6|v zz!oXw0HOo&fp*rV6o3G&6#+%nw30sOrPP=K6av9ZhTC+;Z`%*5(?zt27TQ`m4LFoe zYjz?Gc{+<6lDY{|B)uQ-+Su4pU@pLu&R!6ma4{)=0O;nEEkgVPjP;q_vX;iD(7H5c z2sRM{>enF4=@B5W_6G|i2Ysf;h_KNIWdF_2NXY`+GoJ;B#+K4&g}tpYb5M3O{pnii z=3RU2sCps5{8K|hSmWjwQN^qL;ehDu1$o#H>i%F*&(g1gWb?iLkwqWq6qQkRSz_Y= znh4(Y(V=F;T;IWlBJdOe7dgUXofkTiP87s_A)BSGRAVc5WwoDJr`Eo4)!4eG>Z^X} zC%ppf@+v+jxb(IkYdw~1I`Wo&Nb4kGOeDk^$DI-H^|NXWbOhi|r`Tn4W)2WpKbn7Stw-b# z0C;Vpdf3*x0@ebUtru;#dx)&~t`B`X=ojnK+Osa~lZYwfskNKGtKs!=>NhDa|9KK@&=*zUK0=%31tHI7d(F?B8q+u=*PmCxs% zIAd|Nu@Tqj(Ld3uBF!3*kF+pmX#x>RYMyY^&}aRy7l~kMd%fr=`duyOVokPi5fNC& zBH->HNRT$pRqj)idbqk*k}D~YsNli4(NvTX-FZuVov-`+EKb;zUYQ_*d*;}1CHZo z1iHE@SEum%{3)kY1eALr1?WX(SjWh6cHZYj28v+Yz820bo28cA?^8>~y41e;B5!uy zoSc%j?G1=7?JrMi%llG4j4s-rkVgj+4WW1)yG76ARinfV=qT%>=S0pOzNu% z^?pm^vTq2S(MiEbMw>#SY*@QCHHKF39oqr7xuDJ)$Kc@2S-yeON{Kq^++U zd5>+$MiC-wbP;Dq9(!!>h^bRbE+g`u&l>B7$YAq^0H-0N$ZXJuZB!R zw3thM+aYz>Wta5ECZfeyMcIgGsnlVAzQ4v+sl%>M>aZ)i)M5K2YR4E}PTiMhzb189 z(-r%XJ!Z<3DZP^;iz7k=iL$isAzNO1?X}+OTzBQlmAzL({Ba2IzN5ze6`}n7-tds- z0d0_IkvhzS^B~azq-d$bJg_1YDs@=S1M>0M7SZzfUr8NCCT&arI=Q}G3n(%%=`%bi zO1Y!3HQ%ko81N6N+Dcz7EYd=^UkID}U3}LVYcsr}lv3}FLrQcGA-5*emEJV&|v&olE710us^OO{Qro(qVe^>9( zWsmlrT(KzN(bnF6{fI}OqgzD~$8NA4t$)?&fiKm5H-!W5_J54Hebzi8IYs82nZnMM z@+P$+e9>WZjquPuf(1E8FV)4k<;hzL z9OLD#E&2>3LpoX;=O;ZK8+c+;Yaw}FO;p7lAxov_6S>6;-sJ?fWBp5;mS6 z&by|(+^u(W_h9NDAbrT*2kDhLZ0odLyS}HR`p~V5WKM`aA;*EUoCwqxu^&l8CdKA0 zi*BuG44{N6v32&AOJ-kEkxBH;Jg~+0FMS~M{#xlI(x-Dw>DEN8%ZydjP1%if zNV{b>VpHr}$0ceM*p7Ud6K4uUEV4lAwLhgmcBdm-W7DP2n^=87mIDCEKsLX?5hlQR zpuW^~0Bma%C`%@;ORnTn#AkF}FrMJW=)0Z11Cs(j3DYn8*mmun9{Kg+51XI><>BU z0$EFC2S~Su?C*3P!Ra$w<7{OAIVFUGKD-pt>A{e}e@eSwN3_%hTqAl@Hmk2E1lINs z32Ko8u^7?G%r%G9ec#Fxp%mx z_H3VJ*B)B_P|c%SPakF;IjH!oYizeuZMAf|M7Nzox5;{HqDS0QAZ|r$In3WpkKR=G zCG>-B9n?$bX#bEz<%&F`@9rb$2XeHt6MGdoQo517+4N7C=W0;L_~`^Er^zu;>%{E zHj9_qEdiWi~<;o=v_I_!zxsnPOs5gTr9f#`>fA791U zC#LBA(Ru9^pzw=4WCP4ay6nAgbTsctSta5G86sVl8&@~0KTL##x*_XAioCIWdGD3Q zPZx;qCO1Z+?8sgL^AQR9JbuRU6|FHgh1a_%i-YQ#MF0f6m2zuSMT)Esxv)Gy{)1*NC&wRwQMcNFZ}S9>$OcYwh^mW2^pEL`xAZvopsbS~AB)w7h_5 zxhqjbh?YdOWX>AVQvK>$s;_=GX0n$O?1*X8ss_uE$4=`Vn0}E<&3nMFP48`})MFxu z5LWa+q>;YA`b>zHU;idV%gVZ5BU+}PetPfZ%e&NJqY}|#T$MWP-urr@TC6dVMW_Gx z`re!?FE27}Tk5c+n|tQ2M6@h_u_A@`$~C_X=+(yHv-;g4F6Yjj+q>kFOM1r~b4>5> zT#Lm6v-?(q*Xo(1?~- zM6|>cPAV6mg9l+A=+X^I_f@IGLWW3d1N@L;=h;`|0UVtwV5pa`3IVn5LFb1FjOt6R5gMHfAL3c8oX#raRXHpW)N?TIU+!f%7XYBYf6XH2P zt33Io+>ldTs+A&I0;B0p( z?4D?ml7}8E@(bJp9zK=&5ldcOQXtvB`*peB5rx)(C=`8ZeJcU)=F4{7?)C>?jm67c ze3t!Ds;;sDN%sZx;WIMRtJq=f z_?*686T(8uH1&&+L!w_6glORv@49R;LWVrp0cm#HTv|WUA>H=mZ5?r&HHL^KztMdl z)4HC14n%w!e4p>JK|3 z9oarTDKi23QVbpyQcmQEwNUmw_OQvYK5HO3y7u~g$o#hVoWq#~kajl zvm6p0jmNm6)aDUkhVW6?{+NE9B6U4Pl_I~8E22(-x=0LT7paFdEtzBt+ty|NRyt;l z5pvg|eEqWY^x_jQj;_q9yM*&e8_WtKn-_1C+XWl1dEnGYQx>^f< zW_P79BSSLL2PA33Wustm= zN0**#ZS65La&KWjpyz5HDMW4${U$_9HMY|iV>FNYKRN5xyxX6pv$ChUp8|Nh_p+|Y zjSlIn!<>ADbnH)ivyY;x*gsKs?%B-2TXBqdxKFe!Nt7Rg^niU1tTm!fCq8D~{`#?B_lcIxwJ#YXEwbK33?XT!B+Vmd8vB`b!hyiPTM;c; ztL|rNzfWY3^l-=xdmV=a_dBA5oGXngoA%Swep+NHyD0Ub6rWEes<2Ya9qJD^lN#Ox!!&n>C5xez015t(Pz8l?+rv{cQm)kz%|=L`4G z?HX+?!|y{03?(p>z&}d@L!xCDlDDyV?R?mH)LUt3!$h`t57^ymv;e*R?(jDr!HsC~ zAT7lhQpWd>;^pl<=~<)4jwy>+WrRkTMVS`H1daP)~M7FnV2Jz}Mz7y^U?*n#>#bW@x)FA_pp ztiJ>#@F3uYX!!)uvSMZL^^gK97e3j0?~OMqomcL3bN*))xpKtxV|$YlH8J_**%f_) zB(V+v`v@E_^-A`Qee-~Ud=PQ+*2}N-RxVmpq=?!oQhrsct@Lm7ktxKSzDu=rWNMqJ zk>YU8N<&tJklHGUo&fcE8JoW$aS$!5pL?#p>x;39{urMWUdNwxR>`GEjf}BH(G1!> z?HD`oK?F?~ZPT?Q6`FcD>e}f4iYFFrv+3I`XW7udooLY~4;AF1?u!&*GtWJzHzTi6 z;|>b}bV%l|KYt-skVj-8nuYv5@F82YV~%$wTF6d{x8u${qc``;D?rCk!a|Ccx#@pFCs}hm{N+V=O%I_SA%HAE z@R8)CmI}f{xpRT99xgr10>Ns3OP@W&^Blh<(FDjQDLihCXR?QL5AOg{p0EG}dDFiM z(UQ*Y_Gk~zQcEDe5MjJXpMU-NY8!x#6xrc1xYl3?!g$yh#W89`Y-r?ib<&}TVlci( z;-Sxb5&(V5lv4{Ro;`VXdHsKdXvw(Az(0qFe{)y>D4Q67%Ul3Uq+LM*q(G0Ho)|>-C5f|CN3OxWZevQilP0Ba?iz3j{_>vZwT$fh zb^7~tL<>L?p`;)=aBImMOR8}1oA0g78cAeg#;_$d4S2LGxGZ{Ua|$yXkM%$g2zIcK z`njvanA&-Chit4_sqBD@mAWpG7{J9RQlsUummV*B1DrEAWZVFAAe=VHK*}&OnUeYj z?`(XlqFKn4?u(d1Od-9NWZg&&Lsu#SD>`B=@XkiINIS$Xu}O0He-zsR$h|uR9$->h z9O;(Kk5q5~LcsO)iDAnDhtOZ0ru7A;X`Z@zWyrtY<(YkQPyDof{)klq}T2nUpotqQj(w6V(EIMs7VG znbU9jqffleE9x*H?Fq-sswke%fuKp#sUPem`DvdH(g)<2)JOWQJ`>UhXlM?{!7TwKq1Y<#>jZ+{HY${f*5vr##`=LPU$`I0eMb z9pD~t{#^D;fUz}X4#|++5E&o>M$zpR>k?g?^VH!IQHoGBzt+P(kujaM4&RE-1AN(9 z5f17iwR#93J*mlSC~Igvy!t$VS^3VkW!+C5*+yD;GMd5L_n$GaXK04}pv zh+rxG0I~Kqw&1|nN;XZT0R0zjXDyr=!g5OX$Z?4ZsT6aOiN0>$vbp9(|D?h>iR#^Ultgzc%k|pfy!BT4IxtVTX)3G-G_K#^^doLT$Fl zHa3(ag1;Gu!`iDk&*01#9YY+c$92OQH}w9W^Zr+epo1%V@s337+?{&`DZ$(uxTiWU z(Lf?H>Gh}(jrxwDe=MXY{a{<#o?90jei=$&D1o5_{>c&;5-q!!0GrX^%PBSo8xK-M z+AxF*@CoSwm})gzkSuOUYBR~_j!Zfo9>F4EwCOxAWQ$Z_+7r!!XhGbx!v?1fH)Wz^ zoG%gv*)k+rnw0kJJK2)!kZ9R6`R&Q4=<@fKXi8VlE!QO z>u>ayJ@im-)uMQ$1~dnB1KHJAIehBWieixB3fTR8A`jk66iippJsy6s9S0{}ln4$f zrj865f;jn0!0v+*x#4d>`~43ZnR+Rk68Vu@4eQ^n`YGgzSYT5`xQOgfFGf_$#&_PW zzcm{qI%9f3`bo1+>P<+L2r@^6iTWlYRYZSCw}ljv@=EHigJX|Gntb@~`qZC!sDSzp z-g>L*qKJf0M@7N-iD^@Tb`dh4rkyRRjj|;v!an&pMJe+;kdW-8C_DJDvGx6+(PN4X zdM7E>mfrtBZ`s2OY7EGfqauTeA*hfP(rJ+g;^xEjQ~EJ&?RQW>lBC;0KuAqSHk%Vo zsJ;o?drKwg`>_E+fAf*q|GEFf*RC)uj&GvxpXb_c|oR*XkcgJIR*7f{DHDQY z-w;V6hi060cBPa%XzZcA4R5?rWXp=j7xq?!$TQ#jq@K#Gi!biYJm>7*=%YfAL?5Lf zdu{0pz2yrYsk$!SBgg7<#_#ai(xvF0{pOt1jJff~8++3dmD7k8HuKqMQ?T>tr+W() zEa)v-nA!4_~*p1z4f@vwMrKn0$vfRTxbPPrjogfjw=YGZdq zi-&WO1^@(wWO;O-7{CJHv@xEK-S@6`3Q$eXd#NTAjy@!*D)igKFz;BN%RGGLUzeUk zKS6aKTzT{JEC+z_G)AUah#r7%e(dJ*mX+?~grpm~bk=17nocNE>Te-hnyknMNTMc+ z2jelRzoAe#?_a<_;6aU?y#aO^W1$JP4C5u8ej9HU+6{mwA2@j;lmIQa$Aew!5AOro z5d}c3zecoJTg~fOdt_yPc?2Uqc+M*3{B4Mq27Vg1G3&SVMsx&FMBsojI;NHg5b(N? zalkj%28>E^w>zSx^e}R1htIUb-^#YQerkTA;A#1~QjA3<%~k+K82@ z)HT2xp8@y-UXB=bM1jjv+W;g1FJ(`nTL?M;B47hx0!*d*=Ci9m!xNSqnkRo(tApM zGvFqI2srHZO%LspnK1*Vc)qhWS7bb*2!LGx<@*y&#f|{HT;F=67lQ{KlrqS=KQ-R> zqFHzrKN*rn(Qq~hh<8<@p^Oc1y1horcA^Df*sbx@I!VeXbxVNF=N^A*MkWYUHb*_!Y+94N!wjqiv0$!@mA_7HZ z31Si8%MJn!^|#5=nm;Wv00y$%WC{dEvaoB9rQQy((0Z{Z0ss1sEYgmnEB5_qMQ|W< zmZwe%vcq^LjGa&^66Z}jr8hY?j#Yf`gLf-cA@67XW(R@hY-Fp21f2IbDc96}c`2j; z5SGreyA2@y`Pn}&p!D_iuh$-e1Ou$99b^8y*aLdTn_VO#!pk}VJc`8G)j>Ynj>vUT ze`@XK9rkhmkON3tVDa^dOkz)flt5kfRQ*=NnG^kqnO(KVt;6fH*%u3WpafL~EX`iHo)4@>RWt;>$o z5;=s-L3#tV%`e?!o20|LICh>bWuFkcq86pMWb>tNvcFxH{sN%)%DQO4u(7M}rQkZE z<sFLOm_k<{$e*br9!W=~zed3}*-YGIfQ@}dJ$#vL58#eu~B zQ_>sR*U5$~>C}ZMU04w}G$+P>mFJi2Pi~9_YU0?2tU7gVdKnB&d}p3TNTU zMcekRM$Ue!O2wX8NAeN@BB{mxlQQMfhs&>HzB5iQ0_e#l4q zgN!nF`s;n^_m0jQ6N$?i0ug~^6q%`}jPDS-7@weHuGr?-Xe&qUFj&w2T_jMOzgy zoqJu5EZRU=uxD(mHDT-sK{j&D`ZXmpZP^Fx2X-j;qZ>0GdWqnBEB6hEI(=wzHCN`6 z9c(pPM6`@ciaj>c{Q`SueC}J3ry`o9_Cx?_+d6kG=V+HrD`G6?(=m0H+>2Z_@2W)9 zj;KBK(b(e0+2ZUc>^s@IUt@FFhl64#I47}v_DcH*8%rz`ue7hD6K2 zEZ6{QV-DyB6mDL;47ZKf3pnq6KRXvF)zX3au8qM1(E_LfQq{)SCs~j!qFRJ>#fqqD$QpWC@V`;}8}bQ~Vqe;xmYw+zsZ5BRP!Onh>lb z#*EJSV=Cg~lcWceN^Hu!xxI16q#n%CqG#6h5EIg2Nf!ku7b)TpC9>ti&DFM4TvBu) zAVf_7@{u9ZVI47bYFz`tqE^h>mtG1nv#dxL=d0%-a^l$e^NUp4|DfcO_8{=s!iow} z>qEU6ppmxK6cODs^^|!v2EadJMv5(@$6wr3vt?;}^byIGp4uW4W*py}aKg;UD&S}A z+H?2Zo3vN=6%pZc&L>$GW2SuDvxR%;MD0dHi>`l$^Xr7fb&AIry81mdxR{BkfI|JEu4O)YC$W zoRo3Qtn_2ZtToRqsd@YO!=xVzAvZev0s`!qL>M8d>_)!RZ z`$|Rs4D_8$^uhkqh?W~}xM2sPg}p?yNE?P|S(Hd2?|HGQJMVqlUG^EVg6QV_b^wr!qFL5M`mj{0`<#`DdkP2x{9=w4F z#@3>5*chMZtqX7h90TzHu}BJOl^#nJ&!c&Hze_tbyJVyY5iO;M>9hMVhy3k87$7;)z_NN{{Tr3SNyHNmcO)E{04QAluMsWV zd%WXM?b_OK06^6FxoDQ!L8sNclR1zS(elSccLC6y$G*_5O6L__MYQmyHV^a?h{<#P z?f$yZzIk%K+9UvVWXA_7M$QZO&ZqAz0C7XW93Yi*fu#Uzkx0le>7?l2=B!0viO)#+ z1PDh;iNuip&sx;yuZw8W579!r`GLVac9ANsNgU?Zs}+myYAU>6QBMHC*OPt-=tCA{ z)MRL!J0IF{j`alWQ~v<5cSZ`(lNmdrP_c-HTT@3x!~!7S{PS{qwSl8;G7FT}90R_>kVxE|H5UBGD{J zY=nz(kT*R8PV(aaNmBj*g8`SwO!ZsM*thnI-bUx&hi71KFFSjnRAS z@VTVj(qAMLKo|jmETj|WMcphBV4_S!N+>4J-Xp=JfD#!hA`bb9R20!64W~W3=@qaWp{PdM-qCAzSo$d7@`^Zu zyz=hdc^tTW1E_GPb zG5SIu^i547_g30xzv%WsA(@a#fPXc;^m$$8(pq^i(NE@5w3)qzuC#vF54-rnRyvR; z+Gkru7tINAXWluo7#qS5S;Q7=yR)&~>Y+F6G=jstBMt38+Ox(F%2@PU%_sK3eF
xgs&l<|-Q@>6Op zYmy@kGKO=Fs0{ZqQlU!i>3)*kLEu^^hzI&8bu3vMkH{@VsQIwAIXbYJ2pac<(uPeO zJ1O^~KM8SoZat5(KkU4FOV>7ch*{|e-8JgJeawBZlz7jtS(5e^m+t#4yz)%AJdKq+-o5GT{bpLxxwDiX;sz5Y=G*1YWN{OOCp}bt7wPiZmZY!9+OAYl?1NX|R`?8p2I+#J5V@fq$4M7m zP=ta=i#J1xNVm1@;fH%KKlxOhk8IHfB16h4gn>Tzu5W6vG$Q203(hZML=?lDFRiEu zoV6=fRD^@injhm*%jNj9&g>nR)KSu-jZEZ(Xoxk>K3D0_-bjrO(MPH;(qkoQCXCdzBp!8|!&cMShP)JU3z@YQ;@QhgU0Lqj8p6Bnb`Ub}8bLP+k ztUdr6dHw-?)Ik7bKAiL=04pAXa=S~XCJ#PBV>{7uXMm08y1vls{M+5(dJg?2Yd|Y- z0vN#a@sofe9+2;O?%sg84;Nr6<&Fn-9J+DXvjt6d_6@W-VW8S(e z*RCjV@0$777RV+AhZ+Jv9)LG+6o9I&lAsIpj{;_lr9Zd&>!Dp_MMWumi-?wOIug5K z9Dv4I0sYPoxHUC}-GMg1A>#(H$jgp&P`ibE9Dsjy4*-q81Trvhj=^D3w}Y4y;UOj7lP@nSLJo*YCZ&G?CZy%z!3{)`x<+1eRA zkVQZANi>W305};Z0E=G#;i*4V`}EP8qk9bi11156tR4O1al0s?Tb=ya2GalD>;O>TSwJr9l}?ZgI|G~r zTJpqyJgL9{%l4G^*=juvvpc^y`+w9NiWXcFa8HUVp5NBFsHMwN^X1sAC2|A&>EozG zasW~h!`7~pP{2mIZk>^jNCs=#-yL)r*a!G*eeXt>UPTTfi;#gp`({tbM4gc*Ry~oO zB9S2>sYGVXj(nx=u@>kn*)k|14Co@6iO#50rU62MnQWaXALJL}Q<@v|Lf#dnkUs40 z*g z3gnOi+SWJ{gC5fleP~1r0u||P?vZz7Es96K{j0gpb!tybKPo~Zdh^2{{-8YGS7co- zU9&VcWMS=N$VU397`-TyOS6wuilXd!^y&7+w-u3x&_QDG@K+~k;-Ql&YK87_VlY4U zB81_CArTk6_;Bf@XaTm%{el!?vnS3DIeumB@$`yKW4E78dP{_u)ME$~2YH!G^K3s= z>&@O|-*gU}Lcgug0)wNsDRSDlc=WTA^oFejcptR?fkm7kYRO!EH%=4)X$1B@ zxzB61Oh2@9L;&tf;viuU7m4^$QX|^`=)jW5z&gJmQ8nfVG0aAaI3Wvl!5V2qEdd%wgLPeAbKJaf&hz{!VYL8TaVs zSeyEfXf(IEs;MXZ{zVoh#)?s2b0PR zIP7SffuEh%B3Te7$PoaK8_;$*-;Jhd7OBG;(XxJha?^$^5vAhq>bJO-15xu{(%opw zwY`u%es>)tjao26qNPcB&%W19MsB!L=xWL z`N(An~~H{b5i&m=`dmD%&Nb#W=S4q?z=zrMizEt zPjtBKR{u(6b7H^PhfhM*j7ii4(&X4v=U3Fp-T`$tZFsM@GWAMCl6)9j^yyY!tliQ2 zEXv}T*(X=Dg+lHkCV>9x(7f{WGu1Z2NK^~ZAHjgEP{YN|z2DUzQ6yZS2%|jn(zoL? zu9GjjtkRQ>Jmk=%Bum<>j0X|%=8G>@6o?dAC!CjB7&B*7#LycdDjxauud5GIpG{4q zkBAphA)hA#L`tlOfBDNI5=JMI=(saZuLvHg&=3+&{Qj2Sg8%&0cS5w3{4-|PK~&AV z`l{X;KmBQllu<=?y!2>7-yeItB9YdG*isvXYzY0}qz)qwY39uFh|!}Gm9#0cc(V7% zZ*K00mb6_HRATg{lV`qm??-oP{=7e^Uw+TgX`(aiZ}rwLzqI`dsc#>epQxs*uMRML z%rV#T3ruRz@Uw46BUiI(vW;fLSSZk!6Y<6zGe_kmhsqG{W zvg+1T2NHNd2kf48{F!-hpH&`{0KrF+iUdf>3(&(Yz*<_UMzjD+e!uYdApts~h1X|! z5axOw3;{Mkiz*5n01NrzK@uq7L9!k+Gk+f9eGZs@Qb0b}@X)%GrQXyTj1JQzrk!GogmH-6q5gj}CyML+r2Ap+B zeFFeSl+eHOo({Mtt=@|v75?J*7k+*`Wi-5U2ser7UHhl0^R` zq6N_snE|`aV;zx)`T6|I&wFE&w#u3Zmd}fqG_nuqXs+v-Bq~F1^baVtE}qLkQGEw? zOJ`L;R_5G+O4<1!lUnO()42dc>Bpocq))(D>sV9;QWsFCK1}!A6R-7JH+d#XU(Ef_ z&;CWN7paJ({=0X{z2(6V3|DvJC-KTQRzPfv!XqEAmRfC5h(!7)LJ!~#p0oDNxuXGe zLq_Gk)u{;E8^Azvz;M+v*ck;Fm_?f)T9byY{iHN3yhX-W0 zzIlryNXZ>AnNRlH+PUTK?PhoBDlnMt*yoBE2#G~k)Oi8oYLkA_2$n~m^hbg53qqv+ z`1Bw5)KEdjmQ3?EazQu&YR#u81lJ-9wnGXacGJ2gYx@Ho5QSwAq`UfpXtB=OX;D|A zYt$(NK3^Ff`OzsqDz&oq>=6KG>q}Ij`bXyN!PKWfwAe>P+n5_fga`ytK|fCVHRP26 zK9geFQd9wh>63L|YuX?C0w6o}YykHkZB;)d5f)ZV#@FlysoTHk7Tbo@o;3ED{$AWg zOo%$s7wObK{`8}YtUyqb2V%I9!HsY>W{3TsE?yctfy{P&I*Sk!orcge7xWY;-y&KN zE$&^^1#3i0vsK3ApLs?2tDQx6)poPr9x~#P5JDXQZavdy*F{z!5D*VYV@?zzmcGf+ zz;tUgmoBo-xac85)7&8>?HeLm?ep44@aS*JJKweAzJlIQ&wi%95Mtzv*cfZwT(Ajc zOY$=t#ZksR148SiH(x3{z|Lq3v0zR_2VOq=in7_9ZM1K0?CXe=4?`;E^3@M|Z!W#` z-+Z&dbnMZrAN8k@mK+HVPN8@8bKINR6Od-^Ymg?KU=Y<(nps2IHBLIjK5=re?&$=D zwC*b+G4;57M+VzFENiJ>ZcGRpkz|{)Un2Bg>wS`a!2N@@Pp6~>wI)TsxEIjh zC*l~P9ufO#9oWaJ9#{yxak2l>FIuzjee_<=;ZU$wkvaAPY4)tMeSVmT`s}}=&5RF$ zU1q8@25HvnNBw9#v&kyn)Mzrj?{PyIN*3b8yXrVWV1h;>FHYRu5<(ZhZ-i|t_&aE@+ zoU$LWE%llFZSd&Y>THZ?kgch6BGO^b<(F0ThR6<4C99uDwA@#bB_c3@`#xJovrV}M z(nC~?NE0c;riEyk6v76XBW;+dl$W1+x@xNEzvD~ajV-T3^EoN1MAC@50RC@Ct&^w! zbXQL;5b38x=pYObD8?j3l?WA)6jE008=JOoZsaSfBIkIJk+y3>(t=5erpC%av9qrv z71}H5%ZraZ+CwC%&w}Wgd47nN8OQfF=UT77u&np+fBa`}-Ktgf-n3Ir>78)?`BftZ z;qyXj>Hx|iQoOc_t1a*@;fD3%!k+f>weTb``>Qt z?VV^E@?QGq2daOfTHal~rfRK_l?W3l1*Is+Aw5&p5SxTd67fsS=+6bZoK@J zbJoDQ9x6SY0~-_smIM9FJh%f^ceR&_RtO#0_@0_--s3<>hA&y0Z89WUU(Bx zgoo~!KL0WY9$vi=Isjr|2Cs1E7FiGg2SCV!Pr3wvCJ@DgIj{iH!sAnV4W9eJaS!o; zNMNS(ksTiJd4vL5q`E%cmopC|-! zk$y!j3*ZV+gI-qZuy_}XXaPnl;0t7QG`V!w+y1rso^_oe6rmwq5n%C@$PBQvfAj`W zqJD~+HqtoJQR8*ZMl47xBXx;l%4!|F`tGY$Zvv3SqqgaveCztcOO-b)ko&mQRXJ+R z_yTSK(ZFsV$6bAoFKW%qOqx4H%Q=Zm0N(iqmcG zeWxq#pG~28)PH0I(2LB?owgAsfLvf3FpbAIVp7^7vKHN-ElW>PV8Gq>^h#|%-Y#Q5 zk#hzf&@rj1@>%Qeq<~z~rqJyPX&aFM++G;44oC)+b*`gyudjdBHW^E|LU#d}?2zaK z0GxC<>*(#_D@OK!0Fd4B`g>=_02<+?tPTp$OvPIzCld7=l9c2do-O&26e=8BM#BBkha71<5|$({jG zMerapM7jV=t$Q|({p*OxZVzy6;2|&R5&H!kN4!Np)C@64)-j+_y&VOwXGOnX3pt=I zwpD>;GVE$;oK(>-bXLkKas)zd{Ago+MvB0SCdv9IYak;cQML4=aJUbNd_cQXyz-MIZEA|B(X5Pe)cH;?H|(1Ucx-&PVBX-n4cY5)nX? zGM7kEMBtGjZ`2C|Cev*|IwHTGd;2=zvxpWc9&bp>OnZ|wPueC|a$wW+4FE6An7@sW z-Eb{{Is(O9tP6=FGPA3-^I|^#MG>k<5bF&&gXk@yW#NCNouu$h)Sfy+oCMgomAS@4 z+245Mv&BV*HKHZsV0#|#+hdWh?1&mCY|r(lURS>(PI>Z+RHbtW3;N1C-%)GYY-D_8 zM{8fS4!0lV?QitVK8*-b+k@R?1FU-hy3|lWY4$#1(Rih#Vhik-zKcQ-4chKyT$dNW zD2Vx9nJZBj%R)dPsTPFnV7n0l2x)--*{RQBtySu<%!N6%7mHkI>sjxcPSn1ZJ|3JJ zAoNRGHRPgb4K|PtxGyn2ddY4{rL{OtCu)s}IML_Qt!mhp9gmWjL1#RE zL4#yr-$k9QdV5vY{+}zwm-&$5ihRwB>+3i9uu0O@Nj>G>M@p`_X{!+}2oNOvzS$pr z25G%2YaH;u?Df?32zkf;Y1e-`=>FyE)S5GXHTcN&)sS!IXGzvX?KjQd>r4LW3*DCH zl|5#MpH0LJ#~W)@Dn;v^Y(?$4Z$z9RJKayVxwL-NIAOn}Y-2;%Jt@(wv(CnSnfspS zYVOS|f<|PS8eCEea=zg7p-<-DJtA_KZd-F~7rEHWkB?oMcKESX=TNj2Qbd|L^@rGG zHqhSYzRLdMKG*%Qy{OW9<$jdzh*UeU@902gBDt(z>l~@d_T3k|y6nwmu@7(7Js5p* zSUdKWZ8j_Fp}kK3bC;fZt2G$Can82$^4ad3%s|emIjk0kUxpGGN?<5~e+>x?iI#zR zV94D}AU+hSZ@_RF@EGw1&LdyyAuyjsw(P1#i_bV`XQD+^3POb^u=Yi*IM4gigNaVp-hPG10DJ~NDH8JMiU;Yah_=F)fa7GZ=e`&T~nTVGp^Abkal>bskGI`^-7 z=v!M)KrXLVkLX{y-~T9i^4>*Se46?!NRk<6pWT~%*(F7g9G!Y3uZ0|0vt(&Sm8|SX zv;gcqu+WP-`ZfpAqF#xj?Aky~Oqh9mMN=S6)Cu`G1ctO;1<-rC$l(D)dqow|`0>3jqyNhi(emK`{lC4fn>JS+lj%u&HtV8z#m9ysejfj?-{tTB@tqPa(t5ep z{A;i6o%L@wkY=S=69u&Fp@)0#gkmRj7=#rOx$Npo@+N7Iv$ig_yaZo3R08sNCZ&*YfP7ej(_MN9WPgT0&q1xVMKLZK_urKR_%z!k2#_fx4FYp0b1hf^I zAx)85D02fi0k;hzzqOKzoabTslWV<5H0x}nMfn2{him6F_5v(nLl8)6ghx( zfLehy0W3rSsbhoqC=cL(hrkMl`7F>mp4JMdUp@D#njcX9>-)|xiw{b>iZ zj1B;2uExZ}*<8Hv<_qQddVfeAV12q?>juyXpaO_1?ksmdc^I==Ez*eqIRVCsH9Hs4 zLI;1j;FpyKr@WAJP4cLJ`3@KYx>{q_4-$)v8|XrAezvZF+dxL1!1@f(7h$H}2~d(g zI>=W0?eH@o3&|on$mb>mEK108d5i?|ABt)QebAwNPAp@5rq+rGMT7jv*AT z$=m~v-_M-zmIrt)ik;wLY;F`5=DB=%iXEFz-q6n_h2QTN{Vo8NngDCkURN`xj|LWD zK;O>`Xh)Xv^8~zOD*TS$QD4R$h?*chh96ZdB>n0ZnIqh*}0mC z0s#ObYRw;DREnwFL$vVvM&@+7+Sw2GiEP7R@jEPPG zVCjsMS}zbOEaeC)wBSb)*z5k;qs^AO^#Tl!=;0nPjkvX@oqP- zNSqBJVb&$x-0zZB@^>K(5QEwtlry}O9>7>ck0>osO2A+` zV(mJ1Zr6Kl9z}upS)C2?WVZp`)}1tL=27ZFX@5GPwTraWMn|-K5~Ah8KGBk>*Vs4* zz?BX1V#5)7`h&0#wTR$QEMDZ5v{%{&67!x{{Q9Ge!F)-NX1=XGIs&LxFx+*3+0w`P z&Vc~XY#spA_F-hDhy-InYCSBX<)skckr%SUTqCB{7Lv{p2ycH!eBK+v4=62-nFBCR zpLUj|-`hJx5>MTUZ17UgOJ} zW;^xcfB^I&aumFulKo0e5OaWV^j*ym5ewFoKC}JoJ8+(i*&gkfcQW#GJN(WbaSU-i z_dmvRerj)-bFyIH-QSRf`4zEKHaeeUvD@K2?YW;K3+q=(Rx+QG^TvQ8m+%i5pF%e>mF zUVi80S_g<^DGses_kwJK`$qO`!q^G54(KnE4Z+Nouwh6igzfQ1&4?a}N_8*5j_y@< zu_9WcE*XkcDJ!{@H_C3z2~~k<7>#`-7-y`e-hY*a%hnZ9UReZ8_Wz4jyLO zv3<6?--lm^5*SKgD1q;r1cpS*PBUi1*cpU(iuW7HEu9wt3gC?d5siW*01`p%zCnZx zg2!$&)M#m`!?XjS0_X#;TIwo*gOp<0@OkYvq6OjOJp_!17T0Ju;MyA=-(CW4{C1w> zA<^>f<@jwrVg3AfAX@5%a;Je0K1gi)Q^xCPH;jJX<$3;)1L1+V5Yh5UK=|X&J}W7< zE~(UEUm;pj_#MGPhg+L&Nb6d;E@DL8mBS}b?#;gRk`%U|(^K07F$3TiP4U^MLbtmI z71yt$|FEE;vV(abDKaO&e-yj()E#$L$^=*o1Nevgo&SSs*ukV~_IdIg$RUc;V zl}R^n-F1ZcC#keXN1yY}6DZF>lD&Q8ZSJ#VLv)vt8QUT6Pm-&=c)B=sGbD7CKF z002M$Nkl$t#sg^1P7MjEH%<t2x&@YX}A2k4PWA;Cjggn$$< z00$5Fz((MQhw1GQaDdXhFQ2#h*pdfjqz|G6Ai;YW5F^3?xWW@yx(}ZLPyr2*71BI- zI3}}`lX^>@e*lMm@TlZ*$h(-wrU&r?Lo&xD&jI=Re7YA8E)#PH9XKGRQmn2Pl_nW=jgLck|dqQ~}qd#d{|pp~xDMV1WN_U3Y1d z$h~wfq)`K=t)rUb{LMjj#_44(SYsm5MDNJkkKkJUj%bBd0UPfuy=ESdwTKa)DO;7cfof@{WLX32f3a@KM}K)(dm%)~7APzN2k0-d;M2c!&OZg@ z4mOgX0mZ*K=NGl^k;CTQ8a^m`byx@{vhY&;Tm)eetto0xX937SJZZGl{As|QHuTNh zACj6#*4c4K9#ce(NQU{=&U)iF369kIKutB8Qt~m zjp{V8Z^#a0k?0mt3Tz9K4jC$H5QvHN6WIZ1|Lqeu_ildd=CVWdq#gS2|MvI3EMmx7 zLewCc)vRGdfv3Q2*R=*k7qmlvx3|&W?Y_H)c?3=aKc!nj3RJXK^p5SN3x7;1D&F|g zH6b(2d7DdioBr|2zkdF8MLZ!y*pUwdLX#carjJMqpB2eS&SMkVgJ_{U)|Ec+a0kBL zod`cfB$82tgZbtaZ(isY+opC0@Y_7IMfAhLy zI65E%>79K-B*;taUaB;3*1EJ;0C@KAFIn%^{{g)-S7gfuiKIf(nkV~SMeD?lObU@j zhKFXn_9punqQ$=UYVIYZ-)j2OWJg*gNuF$s`vJCy-9z$C&w6!VWiPRpBM-DI)uZ!w zH5dABEYeoe3DIl%Z0|Au)~QHO^Nt`zdW*m^Pp_|kBkjLf_Rl)1sFt+n^ZF#p%iK8E z7)1lqF(joi)Ozi&*(gN4=bByz<-Uf@5V!U|dZoS#5*$fOugTASrh7mnuxlW!-w&~| zam&WYXaC-zd7n*Tr@ER?`xQZl(CccnEbKj&$VxWXxY$OiOwIks5S!MneFT}rHY52x zYm%40A*?4S%8brhZ|)DY?cVY?u~BTCdkF+o%>kLFFUS~saIJ-Hvgn9b^UCIm=DK9| zB^B8 zt4Ux;w0zBs*hB`8+O$KkwA5lee!b_$8c8Cp7cXB#iVef-sH1WN6)$h+c_C(;?}p#s z{q8gGypvG-M6Eank#k7W-ZYP5B!KqY%jb94kOn(B{Q0j+z`PELmcJ^dzv~I~^}hqr z@^=-=Kx5m9mPE9K3=z??TcTwkqNQw9^tssszx!EqjQT3GF1aXV%}JF;%zL6`K1%_0 z1=8zW4+=Znpzid0*}yMTD`jKv`TOp#Xp0ZuO%CCZzatMlq$k4Ugrxa8Iw`+Kr54q` zAzvJyix5lkb=Ubz&ix`q-8~$(^kVxYngv-TqGio<&sTb_SDt>hB6xNrTHdeJVGI8A z*S&QsS7l67`$WqHAtojisZ>PE|M|a_;%wBo!;+3G;N3+R^`zoDFhtAaeWC^VB-*EB z6&dA+3_TBNc@F_3RhqPA(yfh~Jh`HHUVkB}8Im&Xjb$&E48GunAA(C8b_b-E&$}+- z$7knUae0kfgw2xs@9!;7w9bZg>nd%V2%m~v>Ju$$@Kl6Oh?cR52qLqW7cUOc@|)gU zt6r`-(#JNxCES0|m#@2K>D@o5AK&$5JTPY#OFw~?JQ1%y?fObX0^}iopvTj%K3fmX z9y~oPd$9LV&I27lDxHo8Qo!ha&;Pkn#Q;MTs|9u-3%V2+ljD6jy?176zW{>){=5&j z2Alw_NDbyQpGU8(83028hA%kd&b$CU`p`f+B#_^MZO)q#@PhX?;z3Fm=j-Di z7XP6<_j!Kv0xgg;pL5V3FP^c$_R|AM>I<*QWdZs`KLF_q#Lo4A8^B9ZBvJ?g6s8?H ztx|vhV*svF3F(tDy1wyNziX{__i}WaOx;6$W1=N}M!p%B7vKfK2An)E>C*h};FT|$ z1n9T1@70Su11?H=CGrKx=Rnp0lYHmFx;a2H5C?bx$ZTuRcmMp2yZ}YOA5lv{e1w)c zdNyDi;1vvE39DWAO0RoZBz_BY+ zL|j4cqeEyQwL}9*rDl8u=H)X!H#k3?Y`!QR`JdfwZ|Wx@TJ(+1u{CT@MF7U$drw4==(WYG78hAW=MYCyDY8G(A<}JNsx(_c zIBN~uy6~2YYOx-?O8=tE#_6D!1cy*LI(lKx#qn!sTZRDS5nl*u8J_xw*5}q`dUOwfL3ctM97TnCFaC_ zQ?kq2lDe=``ep4}3qFTr1$fh2c7h#e|BZnyyzR-`iaZd}XuRyJ`Y)UYkZmhcj|{m; zrj=GMpfOzsVw-z9j%blS%vc-`_Q`(x!<$K0bYP0lBjH2_u?=TuP1~E*q|pat4#Lcs z+q`I}$;{8~AP;1R7w#hQYu{2Yn4^Hu# zGQafZgOA^@eN%z-cQ?Ho9o$@`x^vXXu=gRk)%RlWrL<~M9duBX!o10+l#P{&OJo^G ziYH@>=`=8(9ck-9-@V!xyH%gx`s+V`=-aj2zwf(W@9={UFM=P@cEZs!%dXH#Da1G# zumcE12NO`=>*p@|L|>8eV*|!>t~fh-U{0j56Uk%m7bWT5h7D$W%*8i3wv9pG^J-&2 z9-Ta;i$JAYGonM$Lgrs{tjc z_g%BJ)4wWoHf`9dkX^9y%SP`eXeM`+-q&ttp#;| z=|mKM?CcN>bApWWy?SL!h&3PMHuh!$;2E2e$Kjc69}Jxw z;oIhQNVI%IVf~#>vmS;-%kG7_jcAcN4AD|W<98F$vJ0Z6T8@lkw9597w{*F8Ii{jqHopCKZ^PQPRp1`+;JV$Br-C}%IQM^U-|}D}0z<8ow_bU*B1c5J9GUjD zt*>KK%SBtrg^eMNMZkOJ~f65?nZ(empMEieD`h?chZ?Ys{fx0j=RuV22$UpyQFYP^8&JpF%s z+K;QI2I9nn^{PZ<@N@A!U#_Kz9JmX94WU-*pfv9{%;s11b>Wo<7myL3mQ%E0LG3 zbPdkY-s+@*0kTy7>*y)~Nt^l%P~eR&%>|J5?GN9shk7Y#6mt%|3*2V|gM17pm- zT=^cRjR#nu=EiC9o)vLIM!;wu|A2E~381s-4mp##2X}qorOxyAgrtMvH48*??X?+~ z_Q_QAj5YvmKs}z9WI|qV1waS1D)P!p^^JU%OsaTphy#Z`rv1CYOVnjK!z*g zy$Z-c?38zPWQ(v+bH%t3EdVQIkMuWHEH{9jXaZyfupRj&2Rd-Wm|cfAs#GFC5J0JX z_YU%Pt~mmfnkVUxc(R&1*Q9&q&Da4V`p26YNQm51hlB?>y?i`iinLfQb&JoFn|}?2 zlX?cAb#{mcp4`BFs+|)joLKV% z^a7e9){sf1&p96yjb!n8WCcL$-_QEDq>Q_|bij3h%I4bq1M$5_e-LIOPUr}L7CGP? zz#}=32c0&*QU$IHsExq6YTi{vngFt-4+4(gmGnS>wYKhBf6TY`fwhPT5qJo}CtpcX z`RETFqW@$--j4d7AK48r1P~kaqkzePaJub%p7IK(^TuDBbcyb-E5awTXor5#0RX*- z7JUXRngb*>eJ6`0v27y5ki@ui>f!5J8fWP%BGU(v+pB(^Rt2whnL`&wF?en~*=73Z<*Ur1Pyxr*zV)~BM^iYH0 z^?>;7q4bFAEg?|MC!zp3r0@D{EsGYpJ*2`1A;Gk7z0m^)8F|@P*cd?e>9Of-9veW1 z0rx!c0j$Q>Y|XCLM3a5{Zf^Z&XUSAO2jt#y8M_o}2nTWp-~x#0wizs&PaE<>vnOme zFL=>%KWkFzGCIxXN~y-?(y2MeBr-1g zH7!m6NNwq_)Y2FcQg2LjP_zT`gdE7jdNYszcILlTeII)oM+HQs)N%-kH?wZYkj+8D z0)Wp+8bvl8xc_(v0uf5q0G)0&f$U0de#p_*MhPXtK|I;Z>_?)}s)kr}#aNL~(w9kt z#ety_pXQisMFLGf>ex6)oL|M~>4-z>w%eb&z3i3G*dyuD+{lI9v_E>0oqG(2_UNT` z?tZ{H5q0#Op4cDVi>!?eWCz(H_SN3D?_Npi=^y>;|0}G*{3pe5y`Ku@))QiQQr3e# zANeyUYmK}_R6d$?ds3AmdFcsR*@vplv`tP(j~Nq=FS~_gLNw^RwXkUAqE1#tM6azm zsKM>#OS4b@++?aPeXpZ0XXi~!#0N)(s*jTEB3A7c_b$0NI+6&(%nzG<)FDUaUhCwn z4MfYa(Z_@8USdV=;Y1`go1azNEsX9;-Q`M#biT8n=5Bo)O+@s*8C_uO5GdquP;4JU zP$UXlEPX89M`l}}qK&L=>uN>px`-Th+kK`qo}zdWi5z?EE23h|xhOLZH`2ekmp~pq z{?g;MuI(v^aW;q3106+rnFrA)oNnB!n8!Pw{$p(1%DNwuW=@|oyzc0RH!$+sL!h2;L4k!73_Ez*J^MUW#>f+0er_VTikd59gX(c_IEhdni>)7P0a2;-`uX(1v5sF4Y_P z0!-B&k1|E)5htG{;sV)lQ0&vOsVy>Xeo}`dHQ50NkL+!H>+Om{w;FFg2gPwKEqiSYQ^h?ehyXvuigj5$(7%WM%X zI}t5+-Q9b0*^5=E9$EB55iNb0715FhixFc+7uh13W&V$@ttcL{T5|sbz2yrQbg9GM z4bf5&Esbc&18bs_@cxlZlpk(9YNfb)$i>?R(Re@ALOg$2$@N zSRzOKk52hfz_CmE`u|0Fe*%liggpH7IS-#67J23YEF8QarB?cQYZny)#FUBwfCH=t zkV-Y?!5Sds;nPF@^0ji>ztRDuMNTMK9B=f1Zh*iaC(6ddcGYM}dat{a=1b}yz=?vU zysZ^=7V!Y&km`bmtCzmkLvSAQ9f|_$H$ZoLv__YX2{<6qLPW^}NdW^al6uEjffx?J z1#kthgJ1z90w+YttPb$c+xzw>Z!eGIE_F;E(sPg-8S0N~10n%x0CAv`zQ~0RsMJ1L z7f_S$AV29eUY~TnfdKC14GBrQh0eUEke90p_G`5ly40 zcOzN=cy~RQD7FEj#kI`;md`#eFinjBsYT8T@XLc8AOfTTYz!W*S3p+eEn2F;ihf_q zqn(}vMC1v~Lmh}-1W|xP^V15w1J0%ejNutC0!Vr`as>JV8-d*BtmY^7)^!Vn%C%~q z(nr7&VyjCX)`8?~JP-z81B~8~NCJJdz6ZgN))pPow(HPwky1RPfy(O8@K!(f`16Wr zQP7#UcmslK?#)AhJ48%Nl}EM!sL#IWhz~>~(!jcSEMB`nq;`GRHreLT(Z2gx-}Jo^ zE!F`*^`co9RRoXq008{Y1-}Z|{l{9TI}j~tTYG?0B#W97K)@^JTv3rA04RNVD78ZX zr%To>DLo~t2&)dCCmT9DlJp%Ms zEcAq|k*lJ8x_V15_O1=tapRdc)*3ReiX-dSeE|+dmH^ZMX=TTAZPzzX^c_epJrjBI zj0bc-xb(pyNcGJ@p6w`-Eb9?4E;{GB`PUaP{AEam_v3+&Z~&l*Xg~%4?LF_1OPix6 z8$|HW&iq;FGcWe0YxLAw0@U9gG6EQkupN^$XTbJGv>=J~8@Yqe4NK@(U0GkyR)F)cH^Ny?(I6pn@hS<4T zMUEr25Xn+(0rBa%I!om2XX(#G)Btubjm<{{ndiHoxvOex7@J5g_K^&X-)Hy!!QKVh zBM8kC!U$;kVInQ8cXk4)i^w+?y4u!co3Gt{*S^F2n9Cc^xS=APq<}(%+#iAjNr%Ky z^8=_ZRZw}?XEI%<5iM%gh>j9j#|ePVr_~2(G54F}P{5|CcO;?(`H3tx z&us2*Q`d}BfPGW#)Pb}{2px0I&a^e4|IG$@zkcr9A=hE! zrHUg5^Cl94F4K|QvJWoJer=t2&wQ2RK<3VU1M=No!nuRah@cQnaLbdo^!K0-DoWv^ zq+C;5#yyIt8}r3RwEa}S^pg#2wsrr#_b;8YM|RU9s8Plpj9C;q`3O? zlKYEXW%tYh-D%^hP7Qp~cKto??W)iI=4D)D#OVR4;C{ne=P=muM>%k`~c`=s?oF38qpxHp5l zvuxXWjP&RfbZ$h8s1rm>tD!PTw1{RImlQUDeJ_W0M5*{pBT#tyIn-o9w1{X?keI=D zp0s307HP(|-B{HphF`v|1jv6#w0v8s{cY!4Q$wO<_cGiS(V~z&;^F9-GkUMRu)Mcs z$W>;)nOsWI-DV=BS|kUwF@@f9hvcEyU`i;q3l zTN(mDt(Lu$3ZR|{w%;#To2}jY?$IFa|K+oIW5xC%8pecVKHxxKj|BNTp@+6%ezPa|5=C&L!eqA>lD711)gH(Tnk zDgD%8OOrb6E~&$o_fv-@jRrgJCOk4B7c$u|gu8k$r(AP&?~EIMQVoBehomL<-Ct?Z z-d*!r?}HFg>e~4Hme_LX%8r>crwFT==bjVt>8Re8jUQHewnu-n>l!WO_+1^=XnO#$ zzuC7$Roat)X7(b^1ok=6o4&XAJ3aZXdiLG6+r|wH;7#sfRmvYgkOxeE_s~l=j_yIZ zQvrXM?g{}=juiPT5Wj(>ORH3Oq?$#z1G=Jk|aea0s~D9_l*~ z?g!;*C<*~st8XGp4h%q2mJMY!N|YN5Hq2waB3H64*#!u6STh|M*cGk^k~`Wp z{oDYW=2h%>GV?om0(IrtS9JS=0D9K4=p^9wO^@ExyEDXq_0{IfXT4m5J^`cDOh6t< zkHphg+8nhac;i}sBCo7^o}&YPGNI6T7DenFZdUCsah$KHMT+i_I+qHmleE7+EE&N*jeY-7L-7?@#Tn7Q}O z+|TE|zwD(6Z|(#b<}w3Awy^L zFS1VLCMT*jT8u@yt!!-*(FudAb37a2dWA$`bM8ze!@u42z3L~&OM@vjov1Wx)F2=A z9h)pFgO2EXDGXuFsjN*>hmBY-F5(;E^V?j95DUgW8+_EENiQ4wEwT!U%Kki)MawH1 zBPruVNg}x6>e@y2@qa!AhtlKhIG`NQL)Cv^$|dV%aVU!>5IG_0PoI4+sp1f0`Vjuc z1|31h8K3S0=BJ|%fwqD4XdZUO5V@`yf@qmb+ZzWj%>Qft8 zU1Qv5xG&K*eMg7W71bu!c2QtRmfJ&&)Z~SvbVHz(z0QG6C+l75ITtCf546Lz6B#B& z71>3)Ap`Wu_wM{&-OI3BO+MF9a?xw4-{^~LK6&-w-A~_D3)72eL>j=jZQACz+1uC; zTdq&!Bd>d(W>dYx+ROcd#oVQq(?=pM5RU4m^RB_JzuKlRzm~Oryjy>cjr@6Z`oZ9a zEMUjkc(9jj6Y+@cd&y5%CZF|tjlIa@wXO z6=5s2;*Vl~=&Q*Y&a+}_mq(w7h8@{%te?Ds`-4|VBvXb@9E{?5mUD3aE#5I(C(RN$I zslM#NKjif?QQBl?J~l6#zGxX{JlpWMwf5~nk$TY`fsVkZ9Dz=>q=Vaqul02^#EZ^*0~@37B5-6<(Qr(wOKLHZF9;WrFZRur zt;ycsw~qk)I?=LyoOb6C^tBT$^CLQ&XgT+WEX4hl5G@y9G!iXw^sjvI;o<(D{wzex z9ktep)Khf1y_61Tq=$v!k4iePOTU`%_e7CcWZvTJFFo;OrL(%{fBm4o9~65t+m^6J zMUaG8LE4mTv0GARAqkvw)TyV8B+eC=*IFs>yz=Vs`$T>`dFNe4N^IJ=A*8~Rl1(%V zB4hEgrNc#E`qFU4pMIkjtrr#Z-0wrQ{O)&SM9Yq>(XtyOTHqD|tBpvHK2h`RkBi;{$~Ldgzy#Dg4ya z!&@)DG%Q}eY>;+MbdclEyzB&>)Nyq&adTr=Pw=-Jsff-azx7= z{?$Z>NYzoZa`W(Ot3GcJ)Oj?_jyNL|u@EPahh)xv4~!n_%{1l+6Rq&8hkrHvdmL-7 z@6tXX7Nl(9#8L0x-uky$1pK?@$ah{jV~Gt%;V0%gM)SV5T_d zmzn&^Xb<*dM9BCs+j^j`f04ULi^#!<(!;K8Kk#PktWfE9V2Eg@Z^xn+b`;(H6vt zwOPKJ1<+x~!TyUZy2?@M0sXl+PRntS_O$JTv>E1RQ*sb7)CvOBvW`t6Sy#aXsryUsW8EW$ZEThd!A=^Q?_Bep zn$eC7xHUuzl1G^qQ8(+jIq0jL$m^XW&a70rVxv3iGbA|w&QJP0PUwobiF4RmIcfe( zo1N2SX+(=M^d(22HXA?2&67!GhRnGA@!N_pb6)E=<){~yg?n_X#%A`*8E?!PBbS_d zamjLH+Dk`YOT-R`_`m%2U*Zh@uWCm{l|;z;mv?n&A6$$(vvL14!rNn#U zr6+6lGg-97*4m~F`=o~Xw#bjzlBTcfu{C*}10-YAnB<6N7aTY0$QackqDABq=O{;? z=m7-KUw-y4wyz=45YZm^8<*;M-kl$=Fhv^$36~7n9yhtqa!wb>J@h>eNlEc{Lke4dX z+C2J|-oO{hV(eXh`em^{*H$V=&R>rFXJ31EE24#cu}+BekbiUiU)Pv7HaV<+kjNXv z2j?)yC_AYhQ5VQZkqwP#p{vTWw$FFtjHah_y#`P)BHKuW-dH4y`VdhfvGlnyid1En z=ps9)-7vB>ovJUh{8nKa{YO$2Ayvsp zH^?T17$Qu)oW_VIv+a?N(wvF5u^vW64;^trBCWnvq!~=i-lv!Bxod%lJs681VNQh2t*FwHGlv5+bM7 zegw-5(Gv2v@QD67pDt`6TE0@`qy82N_x+@nvt|wQO3J&>XB`vQcjTCBCRvmPm+Ag7 zR`qWYEwMqav5qw%!NyrW5-lNZ6VW0?oO_JLu>OY$KJ<%7-k)bpJ31hRpm9GECL>YX z+F1>l`qT8#@3xXVc7!dRmYdA(+u0b`{F`&FN3>X9O1~jmYHgWZgQa4#&pnj;Eo6|$ z4CL!C9?n`;NkM0PsGHrI#-5(&*Sbzl>SWXDf5WrM>b=?3)|bAI{h6_xdx`gQ{oIh} zs57!Y)Mc?#;~IE3R$p4L%XLV4IZ>k~fsjf2{`$JVDe}wOE9{)qc_zIW2O@Sv?y~L8 zhB&9vr^Y6mc;Xsi;*au8CSY&0t@{5Dc_vJMYENrZ!@63OcEMZOCII}|m48CG`NoSaS;sz-*1X0Z=p zf(1vOcsZ6qN17mI%nX(a%;tt%gi2%0ktdFCX~UeeV#SIgW*C3vy^^&Rq!4p186DrZ zT|)qV`izs%yMEFK2(P_)BR5&;oOc_2Xldv6W}P5gCtBv$E@XxmGDOO$bFcech>S01 zA$t)m=an&t|_BGd(Ob7>Qv{pXwK&{gv zb=X3#(NcJ`p?PUr5iN;mIVEe(TzpefhlNO58#4Tbhm)Eo1k$QUr4Ew{Eslz`%as8Z z#$^5!0VTrd%8*Fc{^eg*DzyU-UOYVh>)#HK2XC0HPXFnjNl2|0Wun1@{1ra^IvXw3xl=?{v1*dnp&pIKga_oVr6+8Ync+NTxlxBx>nsX+Y8R&tQ!-BHsi4boyeWHY-2hiZ6z~uhO^8$oKocFgyG!rUHT7B zFf)-kZrBfHh{WKGRfgHA8~}&_@^aSv=e_?~?J;|@5f$2G8z082n0naf(3Yy=n9N3o z52w$x9_bsp>ET`3+Qy z>bDU09Z%g+*m5vf1I0{PP6f_IGH_IIEFoJ?h#cBh<7XmkH8g)XV_|YgW?Sp`b0fEP z5p3k3uZ(a50McTp9tbfr<95Hvt+0|@8*|)NEIX`WruLuOQU^%lm(#q)= zT#T8cqnl=^i$372g0W3`g;8XKAHqsMa>RIXWNEh)Db`z{gVOBLAG&A^jZU=Od#7C6 zSofG}tnSw%XVVo9XyX{6z=2%aK(BA_0ng3bxk9LC#3;C8~?VT@=LkCyQ9aiEfG}Fnihp+efGW zIQA_V{m;Mo=OPiLTqKvialC$GyS^M-EIk)8XI0sfk!bnNW4{^R-8iOST)y~_^yBHn zpJ#DrIoZw?WH zcl1}=w7;CosSjbm9+1<_ebH3NKjU2FsT5zvu}yn^mWc;!H3Dy4`rlX*W#S&0GEhcB^Z{wZr!5ZNwx}j`M`dYU(yT_g(R75gZd+gM#@gt&&F20;H z$l_jPb*?$C8P;rB`P#}#Uxt(&b>t5r$P*=+wp$~Pj<&ux9`&KIWBiLuVk433`dE3^ zIuU6`W~8gN?DP?AM6)>Fq!BXHMYcoqoi*+Bm;Nawp6IJ~Qtc)~>ksexKwtQ+-5-#f zZhG0YNf|JAt#jRBN3J^as=5}NsDV^4(ZLvIV;z4+#-S91t|8V>v7XX{A^DI~ju(v) zPtl|wwhTHEf}>7mGDxnK@8x;HI5I(@+D4}QBM&S+OSpcyCX(H;WH#>Tn+O*=b?v#= z4woi1D;{)S3K)9CnTf*ME(T5>dszBH~OY(l#>FeJ?n->o2_Px))nUpUye{>`G@V^2s&c z{jJFYB3hKs#+cM$YyDMU_Q0QUylnCNm_+uu&wZ|l7LoflzP9I{d#*^9d+xbs@T{XR z#&NL+gz80h1Udry1Oods(c+>|H{BD5y9)+m%h}}RyO*sU+fsW;@8!i&&kzealuiZV z;?>S+p@~v~Ay1TN#xSBqREuM+)#Bo%Y{wv45Ht4q&2g`0f$0|cqU>(`3*ue+a))c_x|S(hldj(AxcDQ zD!S~}y+xada7MjmZE`!%h~{Dlq=tZ_92-y%0cmsIAA#JIAEQ_+0H4}h!)NP56v7Pw)Q|j z^Z1|g%2Apz$BFNuwj6+QG;xw~xSC097QD1JW=)%!X(LC|Jq`k&Ibu1OI3PH|IPPye z?+fJsNCO-ngjO?~>wsk3P z%L6>;3nGB5wwbc>xs_$FS*O1og#%n0E=}g?^2aue z^=JY|73cP`aY$YshvqkuLW+}_Y*M$F`E6|;v*B%(Pe0BZ&K9H{N2c^-97AThADr2r za@li){vx^TMeVfuM~o4Mskq=1g}|9N_M_i9qb&SvX7o3Lp>+y4gGDTHz*ujB9&$)- z+_Z5^p9D(xOPtMSTXV4M6AQFkmxeCTDeWRp!=W0LKfbHqIW5{m>L89({mD_sc_d2X zA8z}Ha@5jg#~1c_=S=04frC0YwUIMOM|xKE)0Onou^}v^q`NllR7a7IA%Qq_MM`ky zTeoABbBk@Q?`bPpIo+kix$5++Ll&o-bz{t?M_|x1Bwo$JogA;Nv)gSSotfU>`m3-C zF8W$q5pW#OW2DCgMOxjT`lM6hWaosmA(7fx_}1jd zUKqj!(TgbII5!sV3&~+jYlCYs{XAk4*(G(F@vR-A`9zE%4}TZi+WNk=T|N5AHQ#te zA|ZiA6NGAylR6WdlIrc|OKvW*$MM#WaX(NI6Vb)jLU354s%`9yNg6kte?vvOl8qz$ ze}!Z*`NFlq^_|n6ov1z!{!J!;>s#$(OXvr}PPCAS9b?CuDvxG7iOBfR+-tawj<$8Q zSr$>}aM$Mt#?G5;@Xopr{)xcR7knb_)lJ@5nl)vPNlHZovht-{_teWzBr@)f zif9!*==#TQ7{fP)0HbfGM&I;1qCmRKpGV)2q0TqXj0dD#wQ$l-`qtXraMNal5@Nt- zx`9}=c8#bQHqv=EgatW7yt!|nd*p^E+u{DpSVwA!dLjq?yD5=zNK$Lhxi>~Q+#W(1 zku(zRV}yIFC*4+hgk)e_Tq{KfyH>%GjUy|&scw@>=!mw`HEkd_Tj@U1#1F(IQV4N^ z5IJGR3B%(drHohoc}wg(ohec$W7}ufcO)V^g-|qE##prWl`*k4(QBejO}G)&b9o|@ zt~>X-y3a821=c?bi6av5rL1f8M%IyHFINO_Z9|SCO{6{~vv!SsvP>3En;d5X1yaab zIL4QhZ6d`)1Q}zu#Xi7!A&3^3lh4a!W`2k-eJRA|*S}ur|ED8bMC?EI*ki*_fBMtm z?z``<{&IfbIszSmj=-)Wuul>#?ZVic?T8K0CgnWKO@#;nWJn`h9FO$iR6>RzPZ(=2 zzd26S3GxIP;(Qwe?Hsd#8FOSyIref+TiCu4Eo4!)NEzijN4b7iHnN2itZyG50e#l` z-h0i^=n*X+AJ}b_U@UZ^WqyF*!-0?V*qm_jQq2VoM7>Q(gKV^v0IEX!o-8nT;8W(=)3&YV#C3bM^3lap$ zvieD{r$X#xQO%99S!Bq|!uCrJJtXO~POoT{w^rw7J4DZmiR$n!^%Y`3x-09y93N*q z(&F`mUB8s5jg?tT=B<}jkLj^;W2-z78)sZ~MXk%SEaZ*NIDFxehjV<0mYIl_3o3or z`8VD;oRH9NQ6;M$eq^}qAOB~`daTr8A#E=F;ujMob3)nd$8P`iaLeER-LT=^H7Wnd z;jGVIQz^=<0kb4z_#NN<-f-*R|3h$1c>U)0LO^A~(Yy{jCNqau96pjCk^it`j~R|Q z>7-iE21&I!SiYHbUn?JaaCm$5%XR+y!Rfu&;dftqtroafCj4!%J0q#UkYpl}PCW0t z;E>tKFRUt(1uh%bt|?Lre$tbjo`|5MPdlwrt9fYF_XwHW+Zru@B%#rZa^PTTh z3bART<^KEcAMU&FzTwxu{`GKI2(?cm(V{QQk+H=eGJe;``A2UN2uP2k;|xbONI_!8 zrUz&LFOV98UC=nHB1~e?S~l91><$DxZVi zLo|nnoaP>wkpXQMCMUXw?UFt6Z4rI^#1Z3PIZomf=fv_LF3pY$gj1z%_ zm*ZL~&eP^ z=LpaS4k1n`q!Y49dLk)++!vCIBZdR7ZPRkw*JOv4jpIvx^RpZGoABfb#%+|)jRp%R4#IVmG^E) zqt$T^nSBetHrw~~I6G^WY@C=JagLQni8I$a9~?@eY+C!wR+jcmq!h;%=PaG(4NOw(9+lKJPm1>l+MJ#ME{T@`f+?+SCI&>By|r5?>}b!6pQ`BhfK%_0j7u1rp-tYlsmuL22Qo(@ z2mEDm%$apAs)U2c%zKVgPH@o+?fmIBxAMHt-rw-hTRyjO%W&=ZK2FAEA$ZIzM+)mp>FzlD5SsejYz z8r^D7ND8@CZ@!Rz44Gs6O>AGPo7wcMf8&h%gX(&QFgr0s6#D>=N7643Wr2C?*tnj- z@AyQHh_tvelLm}U_E)6Dy=PdH&$ljYK|w`|C`CeVqI9H$P(->k0i{ZlEdxlbU@>^0=&q+?E>qL=vikS=G5{eFnBVb_K$3uyBpo`ToaXoNNDZArv{ zSuFF%=u@%UfqY_5UHK&C?%O z!%ZHpM%S-Aa>s_P19$@CjHN*8kecDM6-L>Wt1pg^6(ZHJS`psz&p>>+;SA<%Y-In= z1rUZiUl@0sJ*^S6hqeU%RBxT~lz(!;sG9Q!xEhG6Jmp0};S;fTml`BUrEw9tE`ZVU zwAN&7jLjr7MnKtth|;HFI6WWdWeuIn+zG#T!<};YBec0AQK3GeV|af&()0D9Ot$OQ zz>A@~oku3$9l+ldj8#1>a>T3{geTcr_P%|9Nm~pMQ4*#2JzM!U&sLECGuWc_`qh0S zImMPXx8q0SMOnLPINl^x3Uf=CrfG{9sEl=_l4s<8v$dPESq0UwO2q9+sjiI3mz8%| zAu-(N;^BZ>axw%7#7X^{e(4KJ`9=ztF1uV3~ioS&>3mB zJq;&$PTRtnl_1!S9%KJT@dleyKK<r#XM`z*Qsm@gyo%~#QHD+KBbcx zXpU%mI(!_;?{9uSJ=uY#%vG4K=E_$7)@|v3=)W@S8M~l!dooUJPA(?_$_sI-^Qq-o^zL{axE{f zKB%Lx?8$h4?|Yy$cLyVWd=LJBTK;1;>pguzp&*18+`Bj zTvLHGR=$6Y#c-Kj+??Hz@!>FdXW7en8%Nf0$K$ zdYys;k>#=Oa*0adl82ywKH;2*H+aPQ$zVDxA1i?NWXlB&hlFY}7K?JQ+TvRe9)GuV z%s88HG(QQhn-sKR4@WoKkkhOyGE$z&PDe?~YF7kqZl2{GmfKS|J{K~VF?zk#%yyqwnB|D2&x^^+#R^#E-&UNG z8|A1hKnCM`2TIyv??gZMjF9gG}hb7r4&; z@$I^tBnVZG8`b)OmAMxZVCOu+)eu_$!ewJtmd9n^$(c~!iH5n{^ML9d$p#nJa5ooK&|NP)QV=P-)`}Zw=~N4_$ce+R3bU^M zZZs*E7J2b)R1M?+og`3YiIQpW+;aKnJ+@Z-X#tzu^M*nEb7DGUkx{*;1TvkV%QG43 z<~2!6;tdWTMT2S73TSRr8jHw~_mcM7-`K_zXf1(TIroN-WrFaU4f|oJE99_7f+W;} zm8c`(85son-nmU*{uXO5>73!vvj9yE#GATq?$RuU{VtZ%U|e9lhjJj)V67@mF3l}U zOcrrlgyzYdHd|7kT!l-QosWHxxM9TxFQ zdQ@sf|C{&&Xim_ti~`H`iBBPb&oo(Jtt>gsoivq>)vB?`FRB zRTvM8k6)GBj&&4oa-dWz#JP@SSuOaZGj}33^92W&e$$y&SxI(ENOnkQobm{D3SvO8 zBr3b_PL%qrZ(>v_`6tR{V;-B+Khlz+5AV*ZMX7fU+^Y$;{|GWW-i(gxGRGZOvsye+ ziF^sfPmS(8?Wy7MpflS07S^LU!u-Qf2ENhsp>FYEI6bS3RD7Un{+-%Tb#1-a1NWDR z#>Z)Yqkuc=CWtb_&fmc-cy3%Z%X6tPbytqaLOi#Sn>1}_^g_tk-|PSC0Z7-;p}70&y(JO z2C;R0y<%iIvj|bI`(5y!YHr-7hcDz7D>`yhtSy!+gg^-xdMyD&k5?8jr`RC^JNiQA8X*JCdiX)U{u(i(5t%C!uQ>YW5^Ip4SyT zJ>mOUz%aTEj;ZZk{J>87^|u6H1XI_5dq&^hOeM<=ghu`DQ_ogk>VO0Je$hVa%MTAe=dZg3wBFzKKtjV5V+yBBo8IN!4Yla8{D zmWz`vRVJgP=Hh&q(Gy!>AohlibNE2}JpO9RBlXquE)%cA3C|_(gZrhkB z=E&kZE>8pKBhGC(aG@gXF+f)|`R#@B11m!|#pIXTg$r_M7q;Kg9D^A7ksCT}F_sh9h2pQ>g4h^E$tMH#=8B zz5EE=+wAPdp0~z*@+8JpV7@qy5z>*ohw*r!H~1yuHUdKTqhHS2rMx`_Z|w_q;*qko zgG$hAqvaH$TpV5^a1aV!0JA0SB-#0{i;i;6iO|kbv$z#+C7*F5W<7m$o)B2I;acj# zsFHO2dmut|=ek%9!n3hlVL*cEOay<;8``z3djT|&ftc8ewaZH<;9Ca1HfnTns36>A zbEO%{x-kfqU-$ShjS7TG;ry2sWf#C#oNy zYV9dXD$nnTl#YfQY8H`t^fB>zzHb=Md0NKpr@CF8N2TE61MOQVr<)@a9Zd<&%nP@e z#dC6zqJT1#`qsEnehn|-nk^C~;SGK@d?v5lsHXgbjmP1665{#hH~UN3&Xn7AftxGm z$+=`2mWw3$;eDD4Ru8g-`D~%e`vX5jcII*sUK~$I2-iqdk8Xsyoi*b7oaFmAmT%PR zMqIV=Xh8K=%0*YM4_4Y9XZ?-Vq$rvLOk&A_=YucIyr=Wy!1$;Yh;X7)6xpAFf?(Y>2yb7AMhlcm*HYQYhh47=pR5tCH=$==CYybW=SHRqJ5i{ zONnOcbZI_cp~_&kSY(Shhw&(Ig_0)X8NQ3M_j`<);3z7_y=1Ch;Zx?w(o15_?zVT~ zVq6%RM?K2}Lt&4TeuX6^%FQ&f=(WB{*&n#6M5cmTnOY)!$%&HNYRGqf)GL22{sbjV z%hd8x>&jJ9H>y=Hrwl7k+{VZyzy*`Gl?J0?)}TVfS*0K>8eV$ zQZ({m#UP}ghPCHi4TlA``%h1S9Obps0NTTn-g2 zdiVCR{yVDL-u#>TGKZJJDt@PiXYFoN()1t3@`A+EW4&MnBq8$Z7jsQtKs<&g1n!X< z7KSwYwqk931nC!I#4(UXKVKkk)FI`pEr#6p$CYF0JBu24x8?4`tk;{Z9V}ryFr>8% zqmn4Ju`z+z+C7A9#ELyc`pRKNxi1>9!0CJW+om(iHd^?rt2Ur$pdk`lI}fpk35J9e&{P^AWov0q zdgNh4Qs0Wj<@LvbH(Sj=;l=UZvb*yrmJkJzVKMA?nT4S)Qp{C%R1HZ^Fc-FXXG$sQ zm72+`o^39P)fQHTe>OqB8z#3oUGiL%-Q=OC7KdH|A}0BSvX*P_r|ztTWnNV`6DxfE znYp_WvYQh1cz&PZG!A!r@{vcn{#ywqDIWo>&93t8E$k~U1z+RzVw#qth7Y$-H|%Su zIGyVQ^O>ewpKaadb#eYkL2AA?&FYcY%kgCIajQVA$Gvvq>-b>4Cd6{^Vi4{)ty9u} zg?^!!N4U0!mK@?EUQ*7~lmD8kw_Mh)bMkdxEAjVD1uwz{cMuT~S|#@mlGMfx#%92y zzX142>gK18>GfV>UPBl zVL;5jw6(9~!!{F>wE1prEL!#Nh}8yIUmtOf+dTMbRA#Bgiq1a;?|%1@$}FZRsg36# zL<_iW7i3gqduVG!L32G#;w5&k30Yaa<#}MNi{gkQ%SVX#QP+syPR=a$`pQ^g9RMCn zfCDE@G8;6VNPN8crixF?O}e6 z(iRjX>+^V_M@B+gy}4$X<@sC?Uax?9rb#qtJ&1~hHa zct|)qD${GJsgi^0vG-sCC1~j<+Ar$#wti+avh`uW6|TlJCfn6B&YkVAcV;{su+vil z=Px_p%BFrllH;aH_3VlL{SqYx!`v z!5@^HKgVxZz9a{90Um2dxVHIiin2}b`XBJO{TsmeRdACJ?9KX?>ai%|gZ3bP0(lyFt zkD2@Y(IYIjo{Sr<3SWSy*UK%PQKO5wOSdNnrkXtgMQ^)!V#HqpRS^cIkij@ z6P@%|Ri>9Yp`j=D)1I5}$XGDxFM6mkC@SUuQ|sM4W{zB0T^dk$<9gggMm_up!J>^U z_umT0r%gcTS9F@CiV>@{6tI8ZJJt0dc66*ujeh)6`BsTW{cUQOB})x_a7(zWGr-XfQR>|1^$a&1^hQ`xc=J zvxBvvBv@v(ASXde^dg6T|zB{u!LP>ZugmL{}B!(R|8YV=MAS5%A zcSnYq%70xsd(PQ$;D@}F_07yLgBMZh)lP|HiZbd%iFBjZl%pHX`zJ-Xk48B*+c9|> z>?!r{QY3hGenEqN-GNi`66pUU83aeC(#zeg4C9?96)E)rR}(jyd79DueY@V+Unun; zN(#|q3DarTt&Dh00-LqhRz$L+Mc+5AWjg;DpLFZl=pQ&;YBCu?N_C~*K)muP=4y&V z;)m>zPq7R@M(;X~BhI4xfV-hO1(%bDIZ^RoPnlGSJYM-MO1<3#P~DhUf6D8rQp$<` zGk`ATlK6GAJmA!5M&X-2W&=Brzu_(_+@3&!Tm4DPw5B4GPwBA-Ue|t0r2|+o3temJdcJZmiDmKHU%CrMJ8#@ca%HDEJiXZ+ zx^+_auo&N5y6lI|7V6Ygcgv+L62DZq^FPdYtKf6CbEV61hA~?xT>6bUA&k z?8>}w@s9O*PBVyXB^kGUV&(OzJhq}#y8J3~*5OgJ0%=tGN7E@*%13~0w*92xw*<=9 z4M~UvY+FVMv``+@lsFkZAH|sG)$_nAJ75>(F$mZjU+s`|DuKu zl^JXt=|x-5RKL93i9_3|QXaX+lo{m{&G=fm+v}X}mTX>B<&#C5Xj1Vafn0X|Q|~0a zeMatoTQq--2rSX_OLlg1mUwjKI;GS(MX42mVecEF{ir0q&Z3B!;+#y~L3UK;vc&uM zP=q|?*&OsXt^J)~+)`0Bi}^GuOWbYUDLQ-7l?8WZRA2SYMSTdG3fhwE^d6>R!(Bv< zfEtv=LxeI=(lgKcX;G}(UuQ?)QcrL!>e)yDGUN<@D!Jrru#%@IM)ZH+Gy=U!^)|zBG33IR98-aqKjrBhp&%Zpuhb;c;|C+!F+oo?RmaZ?0f?A7{RisZ{mb%hg~{ zhDVZSuDb>{>z8u<>mA?R64XY%@{%R6F-*DIWQp*QadQaStljRt=YMeUhI(~atB4cf zsqF=HQs`kkuW4V&`?dLMY8%BNlYAmb{uoI!ev67ybfnUNsz~4-5<75v%iQ2q;BQyN zL#1qg?cV4^gP5r|GUxcY{EYgmfo_qV`)>H^Ajao9ua^Eyx26D?-Y3k#!Xe&2Z=%4T z1X|9QqwTxXo())Ynf84X<}Vz|4iJqVz8ehdoMu4D?9H4rOG@xc{c(GfK0birmI@63 z2wEfEcEML;XV%L+M~ghXAsOvp{b=_1W5sP|%G*$KsGGIg8PNV*CzeJ^SZsIFPyVYHjn?vP-oQ&5vF8EsfPs)2y?5t6&WC0yF?w=hHaU zU%m`yt2}rDtnhs}aljwbl;<@@6I; z`*PS9j5%D0hZ>^XB1C?^q;vy_VFfjq6K-Np&MD${HhlrM$}B#I#xjORr`#F&e4`la z+itYO@byJ6ULiQmJv-`_Be|j!ZmVdT+Qbvvj=`yS`*etGt>Y4^jAnklt2d1prb0HpJF7Gy^ zDRd}zU~yt{Hyky<$}iqf@x;7MnY`<5; zS)f>0eEicmE-ICi>y2Sin7tHSWwgzOTYC{Zba;A=RRhY-`+W9gW4=r+WD5 zeO2*w(`c%s0agidr5#MI6`NKCADIO^i7{2irer|mK+?yuaiHhY#M$6v5whmzW&Q`f zHhK^Vq3pPcJCus?JI1;@zrkRW=~7|5`kq8>i%^c)t3Zk0+;KN1aw6VhoeDdPMJ%at zL^h5~=VB8J%46{z<>>seU&--xqMb=K#!PC=uks^b)`TSpva=3|_T~6>CgVS?Dv7F| zm2~!Q=TKKKxlf7cwg}rl7?!n4VeH=DegG@3Jo(x6lwqPI5vK9&N1>PcmBGiL)fWM| z*DjA`gGtTE%YWyWPbMKBidF(4EL_dKH+5j9hRPVB>4#M8b;_FObqFid;KrTL%AsCg z#WED?+fvVzCX^>r?A2b-Yq?ySBPLgGwLFYtHDe{ub;Z(Ic?K_LAj^6!bKmKm9{V@E*tM%@%f+TfHN@8O=uGk9Ky)*d@jIM&4;(|>g`7vE8hn^~oKOd(EK2Yi3~MNe+gHv1SQnPK6Bnxir!{}}=C0uS)lfS- z%D8o_``Jm$?Cp*xS$-9(1@sQl3o}ibs!sArwokaX@|ol~M`XJ=w9_FiHae+C#Kf3; z0$NJ32zS{zuQl-``^|`*q@BE%Vbgmi)Ux>U!8UatS<@7`gPLGM;FSv${1HX1}#I zxeTHk2R1Tz;A39e?H_ghHC?> zp0#Aqwt2p_%-Y>lc#zJBo5A6nT-4yjZp4`))CO%)q3|ft;%H=cD%>n#yRU_PLL))Z z!I-eqh=wuCK39(AXix2VAY725L*B(bY10+=&Wwa5x`$uW$J)#Z*}~RmK&37F@?K)$ zh-D~`#Iw4Hc;+7H;Lgeo)%?hJ7P-#5G#6jrY3JP8L_D_1NXTk|wvtfWJs{2JuG5Y3 zjLJ56Jo2ISxi7o73Tn-Zhb z0qR~67ZE>xL$0F)&mI46f@IMUfiRu%MhCx*VuHMG`JOZBM9Yv5B~T3_>G*|{QUv?P z^E0lE7dGXvWlURfXL@EUS7~nH^P_Bt>H3ik{K<$j-~)#Z^pmBeWpD8VV#`?vr2Ze1 z2Z^1nf)+Lb)Cv%IbRHzHU%vv?`2JYw+Ljp62{vby7?g(0f=3DfU|k~HURVLSwBa;l zPb+Xs8JjN6W=4aE6o2%ghK){hTPb;<;b)kX78&B=yX&h=E$4RP{0Qj~paY6>Z41FF zchU^4 zllV(IfDE*zCuNs@U|~Fr=xLY4hFc24RMFyFC~2#-_6WwrInpBuv09PxQU>lpeOrEs z4l7l4V223`2#I)Pvrd8QQFP)?nw|m-(cyj_R5(r2V`DF0ZNl6rG^^46Z=^R4dUUOd zWIhZ%byXy$M->rD7+tYGC}Xma+sYA~&JBIaS$$T8Dqc?)yZ2@G^pD?}J#;jWtg=ev z>j$y%n_e3;hMj{>(fDbSw27kCvG92X_w=#P?j;sFvorRiaV+Q7?p#<|%*;ez^4&N1 zePY9m!Jq1Zuq+1dWlnUYNu#@uLxa7u(vaFb^b&O3NJ$BgCjJ2V_EODjJAQPaNF+Tyu>!6F`?*=`|LA7H#1D$*S5 zh84*{+J_5nH-*L3Zkf)tS`AEY_=|Z?MGtp(fFN-Hq8$39=ativc8wj6?RxfE!Z*94 z@}-5Xr<@SF1_^294v`&hj&+?Q?1h~Gu}RT=5jv8{ocgn9&dGsbC48!`+RLf*Pmadh zvXV&Ncug#G{`0Ud2Vb6g>(Kx9Z3It#8&hLD z{_^iSfAIPrD&8&fu&MRAGQKv!%J~`TunC2?8;ykOhDzaYVnm#BioI? z-iN1tqX9F+0VVHDh;60}D@EJ3U005Q7>b05440sGzWLC*PN(MvS%C28=u_p#p`i;2 zRJ-3qH0M?*2D4%Yt8e64h&c(b6WU?vVSnNGu)7|xfvZ(}fJ~4#w z92Hr;ogBET5it_*bl-Nc{bavnn?vkgOI7Xo{G=Tn>%kUW-U)si@nC99-EVa+zWBTP zPvAe3OO~wV5~WO>dyWeHryyX_y;R*!w1z41%vHcy`{}Tp9gt2@i31Tve7??}(XB{6 z`s^BNVa-y+tdvJ=fE(!J4_5`yGy0}jt+?~!xu4MIeOfQIXz93+_KVj|rOW9y$YhWy z{7AjwA1c3RuzSCOtg3c;krAAM}`C zeYG<9+4%r#h1SK@8ZlJxTQk*F5G*uc26~FEFqhKH%5#drgdyX!?5@j7n~@p3!8LD8 z4vYxtqT-`A33<@x1<`qKaVJr1sX8Hn;OTG=a{>ia>qvDT{eB;&NtU-K!}+po^30`A406juU!^Z&YdOOSBD| z3@)gNbxiG(%I2h?f=*!L)=wC6VcxmIR6Jq}pu~@`OG-M4+g&}YwbMzm;&XUYM3~Zg zM2dDq_8$uknS(z5i{PpafzCKzaah^-_^O_1nW}R&TR|MqdF*%maTo@A6n)NXZXS4ff37~!1$mi~BGqj8zJ2xCOu|X;NBX0ZIHLgP4aW-w1Gc-P=t6wQZ|mo+OOk~( zY>n8qY4KC1k2jkyT&??u`y3D6V~ARByt5KX%)`2UjSJ+>Qbu*RrX6oB`J-#hHHxAI z-WWKx{ixjvJM-T>C$4=MGVqSk09G+)@t}SvVcXWCXxF^wB_@FPR%y=@w7_xe-N{(R zD+qRDhdRKLE9E`fNm1r?_E_C?`u(bjo{T9O+Q(!`t|Qa!b-q$C+sW%+>N$Wkp4&y) zzT1bt`2$_g`}8`rsd_gzR(drK^|dWjBOM<0t#1U&{_n3P2wB3)BpWxJBYHX51jF2Q z0Ts66+S>u6=*E=|UoSNfAwRKw&;DR^4f$SUp>c=gM-vrqD`cS(s*(d5u1)G`?v1XT zrl3?%RO=W@e(9s(o(Gb9EAH&u@8!2m{SB|B3w`SW-xnj-Pa-Cy`$N;31s!3dE%sH@sdr#^@S$UhD0abGacOF!?!%M6aa#kv0bL_OBhQCnTPvqBnvc{YR6^m5> z7LgG?xrl14dgVd+QAxy_9vlAj84vnPeD7D3c#eOmyH;~y>`(6@O1kGX8ZP!>%v5YQ zps9!`_sL8d;_B6ST-hz&xsAyrgowUnR(;@Tr((c6&q1jJS?sQeN*34uqo} z!kc&bowjufr9)!_P?168?g>hCIYCGpsXqzppVGZJ!RvyHIz$;`gU-NMm?zS!5)UzF zQuGsoC~b&20EAoBt>A8I!6Ea5jof;sPr?%FPCxaTOm)^7jaDg!Bn**$LaNp387J6{ zb;{HKCM#`huru-yodc7$fAV-z^D6BmZv8p=#XkXuIQU#w2BX zoP1f5=SUE#j_cD2OKu^cP~6=w((u6KNxHqJ8gjG70w$x(6c8^a;-`1jVz%CR-HsR1pMFl) z6Ge2a>2OHu8G58I?E|2Xt;8g9-vN2 z*mOOqB%G+#OZ?HirJa@7#Y}pYVNbT`w4x8fvjO~O;gV5pJ*hF}z&LB0ZjCXLmz-dfwO1O8CwmxrX8NZP5YO+Gi9$HS^T4VY&R2D?)$}3_5<&_$?*7%*_ zB(g&QhM%9QnUS!@o}+SRF3>}Kd3aI3>s=u%##re_ca|;6(CySj&QwZ$S}XjcJE8V)n>D9)+BFHI>*U8Zs)i>d zevO87lalRWbkVC**mk`2LixhpCq(i0Ui)h@XSZbFWOD_h(3GKZo;Mqx@J7S>`agD( z)`6yrGcn^PXTc6RcdDR=WXFs=&oh@Y)=W{? zgy|q=MdJKy`ZDQ=R%qv7V@AlclGPmOt5nV%@3HI)kkLLvg%{cf^kNP&A*|PIF&9Wj z++6E$vecj;wq0#}rE}&Wr#U7Qt2)}67#q{&;Axke!U1H+sW;aP&9qyQzlQ1Y0Ug!5 zE4FibfNY;2-<(>9_6Q|GK*5NCFF??m^ZSJX=hg$cUc|b_6X0A}$(&oDyEVM$X5CsA zAIkY;YyLy1^Lx*os=0q|uNK!f_>Q1Qf_I&b;K2`SPaE*nu^2m_4fyv@>GkHGIS0?A zjdF-bNydpPbchSXy5~=cOH;AsjQof@D`nKyycqX$un{+O5sI-`_ znP-p@Gy;0os#n`~`sryT&1~ont;S>pmYv-0a|b>=xwub1vp*TO{tmYuk4mT*xUkH& z?Xpet&$#&$1?u1qy6p70=?1HNKSGxEvZ$etw&GYD?+?~gEsV8bD&@WAWSH(N-q!Z+ z6$;2bw*Md^3%ja;@PEI|0IWE&+P-jZ*#og2Of7@ zodwRZ3~Wec(=Tlolp+NTDifmFj7GB=QSgY(xsSRmh0>61CU3;Hj z39Q>ncOq$H5<*z}=y*=M2Jk zPJ(UM+Jm7B)(}T4P0L~n|LolypcB|>9CiQsk((Q{ws5MFv{GhTxSb(w$s!|0H`Cf;lsSWY-B%|zc_*)MOK}{stTUzY&TIW9 zY)jT&0FM@Ei}tAhXx*9CMn^PnGza3rSUw(_h}A{zws>w}bT{~Vr9zjGeMFgYqa9g$ zom-y!<>e55JFhrY32x|qX%;K+t@7zs9*g8kF4t|*;~A@C;XXo5Ox;j0mB+ffv_fPY8h!;5{7gq zy?+980&}Ec=%E(mOpyjcBx9r0t!01#KfA3P_+qB{B*oNm=m5m3`l7X0$TVXQ^>cqH+;-_XJs4HkBMI9}w`*>EzwUnFckE_A+AwJXn*Do=MP~yQg!&)%2H!a_ z^;t_XOj}Deaw=&sD4+Z#GnN}Y-R!5`VdAOu58L6E0^xX;GvBetn@N0=21&@OxQ%3- z-m|VE9K!C+nq4F~g2<1VfF^VxFcQLuNw{C=IiT0=kIUK?n}J!DTWs6eNP1Rw&XmHB zDzclAS?$m+_rnzA679T_1&yy9K=o(NdFoRpxS}BxYG~SGW>~^A9}hBG zNA|S|BQ#~|pkDJn@U=8k0*AtYAY4m{L8R^SxF5bfUPjCfFCQJzj`3MP`X=pVv%QP> z%Wc8%CF~WK;nPR>fSkW)<_8PoS+)k=1AeiDo$?hySAQrs39H4wl4<|#aVm{s@xU(W zH$)n{xzt6a2}tEN66G;&arZtM6bl!d_s(NAYHMuZS{hYJ4Ny__H3Hr8BO`3MyFdrO zUO(nP&;zJOJ&)vid-s1o;Z2g=GLp9ID;#`%Ff-b0iTG33gX7;e zM*m+1>?e828mgJ={4XJyY5tvtRICesP?g-jWXSP9$xx4SF+4r+=#F|P`uzvr*1!2n zQsD$WpaHP<)%M2(aq;*g2sy<|f(Ii1rltS-Zf2&6pZPzcySc_1|Mg=v*T3ZQsv)ge z@89H-{XfX%KlEy5{ ze{n<$_2c!U{1>8-f9L=If~fz5GWFfR(j`Tur^0t^`iZ*L7xuNt0HlhutfkTj{<)j-2k+(_l(onSJ^&pr6&e zJ2d7xjkjym8uWCo%Gzr(zctS^9j+M)=22sl7FMhDPikxySV*v-R-86b{qF!a74R2H z7;}^=HiUWc`}ODBWVB?Ghd+E5g0Au5>Y1Vs>BR_@S&Uve)$$qE7ti0r5%Rp#c9{at z&HZtM6`dH7V25ZLe!c;j^Tptw_9yFuXB}ewQuMuc1{)q9kzg~J&^lPOVF8QNVErRF zukr)Cgd@GBV`1SZdY^MtFIKZG_8V+wTGABMDB*Os4Y1HlUA^~5H1n0K_L5j=QGgqx z7!AwjVu#27b6xy*g=vxy0GTnabt3h7NW!5}qQNN3NNRPcvN~wNk>p1!VTXH&WOyFo z6f_c<`m|^`>qu&F{kvLgy`%3g7EjJ}{Cq8C=q%F$J>wi+{SEkR!{1WXC*bx@#2(XY z0bQmW9OoJs`21I@j5l0{3k?|5Ee))nvjf~g%c~z#xNx#mWAywYjZDCyL#q2B)rqIrHdOy+V4{%MC=x#?n~dgj^9rd2@affypa+>{}H&<26> zbf2YOZ}oP1U!Qpuvj=@?$<<#jT7;_D+dB;#3=d^-Q2YRxJUqIP!!Iq6PFUQ-1y~={ zFI5EK+y6V8cOWO2nHtL& z`(>)DKzA`k$4-M5kw;*b9$g*Ut&!d+83k>Cy%oK1usdJE>ZSZ%2)Uas=v@zid7SKe zS4RPmxj4Xy9cuWrb>}pd(J_7MM9_8VPxgYMV~y2H6Us>K&Sr} zsqKG}N+xN;$ZP{_&$szBGXMvB*00Ps&-gl%H$m(+8Ze&or-F{AK??ySKN<5(=6RjH6KFt|`L3~y< z-BLmqVjUm3veovMoqjgEz{B%L%63$x)9}vVp!+yOoOL6mbisJeFSL^$W>M=?d}mUi z^Ge%%qtOrV8(D_n9u_*9a0Hr%iJIR?)WNVZaH`sRuo^umW|23~xc7$46t-M#z7_QZ zo(LzSCzCi|G(W*Utg$&DJ7_~2GAyi4{Vc__z&TUhrmlr?bXY^?#6#x_KYRT; zbM^L@wX7fB=qb@lj>>E`Qva^u6|nVrVIfJl+uLI=yyF*GmOy>fKGY*q;;CfmVv^L@ z?Ftfe)Cd1bT&iC5?@5#kTl;y^PCm~xL5ygeY?o?lMh{}IeuguVhVudIAtxRFuBR^p zYRyNnODSF^+ylL9|2uB$Klhz~<0IjENHB(c#P)YSa!-J%$sX>s71B^C?^o@xqgHR4zKpKY720U3JAI3XxL+Y{U~(y}>^oo$2}BWRiw_9CQ{{rmyt{3HaWaD7dH)LHo8ub^bpWajQQ^)1R}E8TmNL59$q#i zB_l~#OX007x5;{@UmZ@(pL@nQuIr<}qj&v+#1j&znJm?FgVXq#Ad+jzOX;H;^bo0& z)9Gtue-#;lnY*j{_NIEJTvKJ7LsOR=3D-)_DEyvf_`y#_8Oh&$FDv3b<(O^P|+5%Gw+b{Ur zd+!vC6EdS^B-<`VRE?-Y{2sglu_%pl5%on`y$FleX!aIAE7l2DAW5d+(ri0DKfK%- zBIobK_Z_s-LY6-o!Dy<+r$m#}n_oe&#UMR^#piD@c8`^BGT0OD3`S&(b3F zpicH5r>>zwjA*FCR6PQ5v)SxcxB@ZH^Vj#v_gY3=GRe0`-hzW7UKz+Z_&cwB7Pn(R zq6`PnK{e?$B$vCELf~=ajPzk*6CHw*Cw;>3q_UjygUtCENvQ8Bx5sR+Qua$g<8l*) zfqnF)uh|;j!$Ffp)Rg=b|6k80KjtAnq*d!(M91LfzMKCH#KV4^CvvFv(d{-W5=!W; z-K?i%p6AW(kQN?q@1!9{`I8rq;*6RdW!*e{4U*@ad*~n3I`@=&Ss86Y0Rc1ZB%4Wu z1HOq`qux{e6+59k_kcYOSD~OE_DLGKjVTVb6hZJ4bfGS@hwG(jVUyXWKuO;jN4PL% z)`;v%K{btZ!#D`5 zx5Hz3iM4OR?j_RKH-?W*u->`<5UNcdpd zoaBeV>9r@WN7&Y?hn+r`V(^1(9R9Ft8xg%J?vGl6a}<3L^1r$qBIjB=mrDbPO5*!Q z;RiX__`}1}@NgwPDZ^nwlBl{kke~QTQ@nwy-eZ zh}K3V0@2kL#9hDalyXud;^gGK0dKV!%jUgZz4ymAqQ+#~k}ER5OP5B(;gvEfc}X61 zb=$GXY5Xs?-ZH4|@aq;W6pFhS3+`T^xCJlnf#UA&QarfRQrxvAKya7h(%?=hR@~iA z{`by#-+Rw{K4g9~`H-1slI&;gwf0_1=wN%Zgie)JNZ#BQ+TfuhN74b^?*ZDys)u6H zDtY4eS-n0*S%zNvG380*csvQw=*~w&{QX)kV??4%#4f5gkAZkU#=6*4yUkk;uGnP0 zeAf15wb(i2!jyP;PKs*JYlFL?tUgXwKYE}o-Tq(ETj!h5pj2DC$+tLpRNJ`hV}BQo zQncT%KcDPK)l#ld-Q|SiN=4Pg%y@jJpf#R>aW*_QQmJbZ+i++i*A-8OZ`od4eEtk3 zvg*nvq;l!Jo`?8d?g~d^{&-okx;+(1dczJ^mo8g^a`^t8F{RiRvQfZYB_|)Bu=;<6 zxNFI3QuvLIyxn_YQoH63PUpKK=A4OB7AO6-j{3KKDeyU$wzpbXGdicc^-TwL< zLWJj~epG#&Bn$mF^?bA8^6{u;E2>{j)a>K&v@+YYo!_*uVnEXd!sAr(zpFnu)Gss< z{mTbiI!U6uqTY|S(F;X_0_xN} zU&bANQgPmoekJ(fzU=DjdAe3IEpxU0nwC!li~CKW=(UcufVCHR;z480$^4hJl8N>x zG30su>`Lr5osxHblrtzy*u11V8W-K-!(yteaui>pki{Lf;8W7(U^4ROEi{1w% zO^cwH4)UnO$yVo&!Lp85h4MPuquGb+6v5|zrlLhWRu7tW^+5H{X=US4tGEo|+AVk6 z%CWzNgy9}}#=cvTn}r!TIJe>+r>?hJK6;kQ{ekon;^KGQ8jM1~shohm)_#5)V?LuC z=KMIdu8JevN3ELEyFI!LfzPr~g68U=z}mNMI0A^3cBW z;(WIhE>W#O1Cu#8HvGTA;X8U72z*^2fkKGhZvrRb@({fJC=UgQF1s8{;oyZapZD13 zTL9~866lU{eby^c31Q4lHwCNWB45*%s%0ktddZx@4qR5Or^g?2*5eywq2r#pI=-)$ zb$Gu+a&7Myxlgna1_$6>0z6mbFgmXLsnp$^DkEK_mPvRA?w>+=@{CZllL)xoJ$+)n4f+ zS?m9#$VWHhCw%E))12s?G3*bfy}VMp_M_dx-Uq4zLRPA%nf9iR@*g@gB!Y}Yd>%S; z$Lx|IguiH!Hw^SlgeWLG#o=T4_)_PKU+4pZc1A&T5gjuQN8|ryfqDBEdk;ollbK!* zJ?Ra_-$>~U)nUg3cFQkI*N(MEBj=XD0+!RWtt5 zXU#q{JHb*y$&pA2rT4(&82(eNCdR+tFAa@=_^|=U-b%3d>oh_8N=@BP_MoHM${S>)z7 zUp+GV&GCm59k)y6P?4 zEK1{~az-Vl_2m@4?_(u0rj7vnNeByVF!I8a5aCGE9gk!A>B(U#+SWktSnT%BL&kRP z8#se4Za78wsI*xZOQQ9ai5i~Z;JNuV%3-q2nyqN()aCJUa}Vbk-XdRmNpEcYPipMw z8ucg*7-w9GGWC(mQGvL&e?nlgMja{I7Jrh4%P>xEv%qBl%nNJ)fI5IhzQ1$5#Xm&9 zT#SRRHM=~`MqO7u?>UN)VHa|mm7YbV1slg9R~m;Trw_#Q3Ct_UL?>m4BPzNNTWfMQ z!>UPT*Uo}Ett+uKE9IfKt^dln65(z$@m=SARd8?cBzS(zSXc0+Yj1*}E>gH`C0h}L z-H(%}lkN#c=$qCfDR>T+ri)R#a|-}Cu|wlnGQ@lMbYCNOVKnaxKe+@Y)#J5`gOsmB zP`ZWFU*9#S+(r`Ofsye_qso5#P53@0Nz&o~&xF~tzh()-Gs#B~M3Cwq2-#{1>j}vI-T0i~RqO-T#H*@Qv>Y(5Mm&a9PwG)n` zs~x9pJINna)qUIdlBHI{*dSh1tL%;?x>JNfKz6gUS%UF*d+Iwa#iLK)ZF@gF1`sdR zOsCN;3Yb&-NA2R}su;_!Na6>-=aY~Pj$Y7VHjTtj4)XK9{v>UrQIGC!2>}{AAX5sn zMi^Pque~IE4!uZBsu*IJwgWDQfyd4ZE(gwwBRt>6Xc}OgE&eft7k zGIW5ln9La&$EPInm$KnVtGLWx=iS11#1d4_Ax}5B9JfSxVYaQDcpMuY7uoQPgzlif z%`uv-4Mi&0;{X1zuLcjhUY43r!h2KhNg;W55#Fo8=*F920xmHg(c~*LT3)#AYzhb? z@`di>BpSkbuEUFY%hjy%VMl2UG$jl}@!S-iu{-V@aMi?$Vm7^uuwhqkrFRCm3<#j^ z&~swwC&AZnH(2LV0>(X~rud}JkHwA8Tnm!@tJ(nm?Z~`xXcl*O7~zy2_UEAW#pTGC zz=65{Qh;_ecg;87D}sN7bRTXs$J*k}h1ud6T#l~P>H^0kR$p-7ihG9Y$=-jF)BjDQ z24MZi;;RVd{YdE38K}+EPBsf*0IY^q5D&RKdW?Fw3)TaHv0F9ew1_*`3nuySxCG{F z`~BYJkI`0{9(4L2hcixl=fthxY@?Zt(4dsfMz=k;O^=M<3DI2e5UTEdS;*``~#CbD?77U%G-EzWh zWhMPr;V0|K{a49Nm#Vh0euzF4~ZY_qx=Tlt*I{yuY1u zCAIH;BY2b}M8%hwd(WLUUk4Aph#IB>j?IOB7T$^&sw zLoji~WFQdZEni}{8%1z(o&K{if094b^Z>@qx)TKKJ=^}AV2d~6zIAVcO6Gp=syLkK zGv6)N@i18}i!FBZ*FyH&(JdK)kh}Lov(pcD_gWRCF+PKSx$jUPf(bf%`ObN(R-Vd! zVc4|#;pzXSRqu-G5m$jiqlcTEmExRj=`m2GMaqP*aG7Yel6j32-8xXy_0n)CLp8Ch zZBso~+@@1$Ku;&WFoJcV{8GG2cdDXaaMyRQ#XR!Zhf)HL0PKxcrQLt`+XNg7SR`v7 z6e+!@6&s`as;9(7RxiU}DX6+iOR)_FjBXlBu-nfs4T^UrQi zn`?*^|F);*Ede=12u7NXKj*GH4{AA!GuYedxQrtB8xh2#`i~@4=>TRJqJ#Hl%lrVG z36pbsYSL>kn0fxaYIU%6wk~m<6>2oOd=v+e?RlcXXjl;Daid7Y_D*bTZQixj%XF6* zZ_VifQ*_gad>p=27R&Jrc5wI&@iM;5ShLqY(bqlr15KF8CkTe6hDXbeWfFI7e>kL; zs=G-LtX`G-G#)9-7I5?UDL^xLWNnxcdu%cO$F>@|;nHMyXH$%l0kP(DeYOh~=ktMH zEaCEN!CN+^%^jwEDqTiiMd1ppnBY~YSM z@g*ROJJ}m0K8ZNj0`JY7%YE)7?T9MeL=a@Lch@jA`d(m&B-D2UC*)(t#q6RrL*t7U zv$Qnj+|P7AAGX1V6y;K%`JTdn`$Wt4k6eidiU$0;VKX_|!hhbR4dZMHGcao4wVjxXg{G{x$GgEsF1FizuE78yxS7!daNdCDz3#Yh38cw;B<|2F2m`OGu-`61THOV z@~|SUMbAfs$-*t6y}ZYv0F&dcg`l>z-nGd=G)D{fNkt@@l(aWA>5$XAiO}LRomxzP1URtD-fdc9mv7UQK~WJ@DImhvYy^TYB> zAD$}`1%iCfe?Rink@D<|7$&oj(@+7Wh=@-;B?EgcxzrG55CBxPM%w!>3{4{vF*J;p zF|#5))RN{EL)U`Y)OoeVOajct%8DptUh(w?LLF`!+EPdrtUnWcq+GL*A0m!u3YX`n zhEDe=M&1YijUi?sgkaOXBJ^sc>c~KoNg|#A+tbXP(YAhT7aWqQ-@Lxn6hIo4z?l|> zlH06OT4ALfyPU7Qcy;0FcU1L7S_@I%?X;|t_{R`WGx&@9BQw)7`-|)roH6?FYbXm$ zNh2i#0}gjwpOvapus#b4PcQJgCRGt&l(8*}=^tb2xHf8g$7GU6IK&SpCpI}0$^cys zv-mb+6)aGx9b3$Q3uXC4-ptB35rK~YB#M#Rr1w=Mk%I{iPS!D2RRjhkDcjiss1FvQ zvTOQejfcX~Ub<~3v;2Z~GOuMXQ6+zT-Tjh3ztQB)@=IC8TtuwG`k0E=B6W^LJ-I#$WnL&gAI+iynW7Lkvf#vPRX2^z zR_5TKsl~-tIA!U^Y{C6)bKEFne=bz|L`~mNo>V9PFuFX+#f%v)kBMUidaGYlxGHtd z_lY&r?fTlxc7mC%RNXZ>X*+cuc_4Kz?SO47)YW`VvaF8EiJ}i|9%_x}Bs?6}6RBB- zO(sN}j*(qYGj0|$K?6h{GD$nQs(-*>K^dg@nt-;V&X8h6^=r0Z)C^QR+FvFlpE0o5 zGrC+sVP8UhIsV<+$>ZdUh|{#*!CUI1BlH#xnILj)9jj-Q5C}v5=$)U3rJ$h3rFSp> zV*X---a+{m=xwg|Bo^+{FF(&M;REKSDLeLfmTcQfrljiJ65~G8__U^M=W|BtJiMYnt_MVr7OP_nA8%uR=`!1~LSjuvgb9(J@sx z*zW&!y-2;-%UPUbg1M)yve3GZ#*SJV>QYs`(^Hf!zAj#%FBR;NwKlS!X%AO%fd$tEcV8EEQgSosi+6HYimkxAaKlN z{^i$?etht=$^W;W{6E#jR~pZ?h=Jeji8%o$INchB)HjK`mRSt|0;r;LOGcboe5K$> zrb8kf&XOv`MkPcX$K@22hbN5qhO{8U)r0`3lvoPX4gLv2IQM8dxwd}81?$Da^A7|n z(0?6Fp`LH0#@mGHB*&``D!`dZd>-%=cfU)lF&OVV**FF5F?a^ySk&z+(Il0SITCX3 zXT&BebnR5;&MNPmIR_sMHg2_KsJo*I^Fkv6Z*f9T(@`BKi$ciGkr48&1{Hni<;f&A z#<;PgGq#QF6C@wmcPj@qAgr39()0mRXF&+2w2=yW7^0kx6(Ao8n-25mZ2G{~6Eq)o zjjOORrX z+Do;nf|z(W>I=?{(H!%?#P=xfrPo(ku5zQ~;MfRM($X(a!vpoIKHK59{3&SVIMmoS zp4dk1zP;MQxZZ?d7I0nNH6Qe+gN{on9!W(u0G^3X(+H5Oh;D8F**H_#NKp5IKL$+c zPLR9UxT1hP9AEJ_<}s8U340GM=!a%Njsz6VZ_pReD=eGtb|UF1h3m#CbtC0i{{Wc% zl)zNuwRCdLbaymFFyK9@5Zd@CY?2 z^2qgjFJzZmIr~p=A(VDgTi7!BfPxLNeDnbc&r{I@MPQt4jFc9uX#XJcQm!pe+;@1G zb6!;&S`ZNzOFT?Y$bqZh-u1LR|A#|%%)@LH-yJ%R+W&zBSD_r6Dzj+Er0iSqWIG;5 z;NE6Cd>gd4t*wn)id3UzA4!K(F#c$^<2a2rntxD;aBs}}s66DGyi!?_mGpH)GBe0H zxv2C*4zKLkC=gae!I3MLYc}u)&x}O~MEt}!0z-sN*K^f~Ksg=lS1sY0C>^Pq(D#)EM*_LylT&S9=*zdh8@4@jQH=Si;tbTg~lDb9$W}ZDv&QI1Xzh za0TQ*C1*5A${N8Mq@Q_&OLJU8% zQxMIA1Yq^JMY65hstP(D|E>zpHZyVh%MB&?OI4(diy&Q2lhgFdN@m3#ON16x*?nEs zlmfXF9G4jZR=WSdLZEbly>Jz3QarE5Ef7FrtUhb3PkP6;l7z-XRS1-rXNMVLe7Sda zo#!IG{)X^ts{!A<-|NtS`@Ix4Z}lXUDW*npao)NT0(b#%G3lGxPsvg1ROYczaJ}NC8r3K zq0EcQ0D{QABRxHZI_`CF3Wf^0J*C)tf#*k=HIA1AWGNb=-R?{83X_}qodWKnoyBuk z*XJn6%jqkmgM87L3|7=W(`u2|{y4;1DI(>6l;M2inf?At`vIDO1^-O3lc3eiY_X(+ z^rtue(gg*kfLr;7u$Q`@~tW+mCpTR(E`317nDWtnVZ2uoK2?4A+Yu< zDd4U`M1&4`Gxu;*vk_ucz=yk$k319T{$ZH${VeQFb&-iOl0`1=!u8~2W#gxf28a*I zD)VY}G8d?>$u~O@n7NG-V#G)@?KcRgF?ypuC8htG6;07AE>}^h$@;)}6KeaHGvwqX z`p%XxtI_7#o<-Plso0wZm$#EREZhj?!wVk4pyA&K4XyI`ZU|Cu1WboTd93^8kMX~u zz+JSlL9aA0bt4G1e(G1NvRx{jxXL^rYvH2g`D}|_S_-8YEe&9Fy4U7<%Q*56+cSc4 zy5`p2I8Q1@7?Ks<#~HPGq$XpeTMd#ze_J6}K#pbzjqFc_b3REYKB{>FC3aa5Xh$Z!<04X3(IUYf}KrU_8q@tVK!xAR}(B2AWT%Q_(CeWWa!8SSe6N`(M>3s*b=K9)`lAwU?#JMg?T1*wK8wVlr_^@kOpIiJ;&($zMf->>joX-L3q*9ahWed++)MXuQB(Of8Z|7~;0{@=! z#R9y)6<;g#a>k!>>*ohEinQod>n_=FY^3RXJb!kk19=Lk6Q@U^(J@B6-u`-J>R@iK z!pEbiP|$ULWe@B=4cO6KO2y^SEbm|r`{Oh9!ug3*oCcHS2Wmj}E$KaiTVFA=#0czS zw1%i2+mOmWYTmY+My^6qHyM?xe&|rYP0oN0DV^Weu7I>rm7ZEr)et7TLS%PN`ozv^4m%h@vj` zo#tBe{+)A!a*r$NbY(l5ao|L3!&_Y1#aJ(IAJBzgO}{#Hyqg?WBz`4e<5(;{jGN~* zQrCB5S6r@CSd$^Y)L!L;DU)Zat zy5srkiXVPbfcDfCOI|sW6q>7AkaL}Zw9ee9@FxUkLb5$zk>lghB}A=N3$=`RMA-4h zp>{<0=$;!SWpbh$oou(-A>{b*N2Yy6th;46>}c@~ULd82hnzGN>`G&*^d_hMj84%E z826I}X-cE^N)<}c7>J!fF1R5F#VujD9jX{fvyv4M`9uio%YA?!!?Q(3yfw}w)TvR? zmAtQ1-si92@iI}Ovf}T0m+l=>GtP>nf{zHUVS@xDt7lAJaQy>Dy{d1vr^5Id7enEe$9#ZQj~6HFy(Bu$e5!gmGw ze@;TQ83Xh}$OS^K>I|Im0?RE4IA(EpevYWH1Xf{^Xc^DWlL<+7+vQ6NBDkP$zX^4I z7^q{qWiy$?ap*^5;Eaxe??Zkr-B$O)`iHpY6}rO{n);gd7G=~m%3$nymTv`DUy_%u=N2|0 zVETDRhc8}{V!HioY@aqQfN|jWh8VaU_Fr?jlf{u<>oWL?)GQ#Es!j|*WsO){10qoA zcnhl`$HSZTSy>anlRh0TDV=5KFCiO2e(sQ}?8|umkuJG^Xd&8$rBF~k4wyHWQ#M;w&3Y39BO;NvhhMJ{&Kg~!=FE8xZBvROb=4`f1>wQMpr@< z(tBC(cqM|oV;dAcy2v`4IH~Ouu2k@*+Ay0p>qo)CQ7&UgbD0QGo;UL{gAj6{q%Q_< zZ6=sqrc9W{p=K#SYrfdxWpR6dBUD&EWy5>B?O`BlSEtR|SkEY8WlstwzW6E$9k0q| z2m>ArR`2yCZi!5IV?MH{b0^a>+W>6ibN{cxF7s*8bl&u)ct`KULW=lrLTTE*Ip|ACA0IFWK}N4~p+A&a7x z0mm@B>;A+BKS_fnA%E1c_oaJ6bJ%b>?otT;wdGSsBk`b0#;Z$C62S}D@2L=8303G7 z->|`)hpJd*EHPpwl1!^b@N@&lQKFxtALp1cTK!JXcWU+mrJir5OvDz}LM0+y8VIHc z5xkaDg4OH_N(J!uV`|XyWR|tnD)@AeS7-=^5kez%_W#hxp_*1B75?^rCsK&vE&90p zg8tXsAmVvCYp^)eq<@`Yx5f*V8gU(0xPo0h%(OT{GqQlAS1-)6!U)1C{j6!E>pZW8 z1iKCDmfKn}?}Zl-iCEOI;esh7@4e2;vRR@=Ydu*=PE#Fik@M_Z`pdxsbVJU;xL-3~ zYf?$2^NAUnY$dajNc+bR)r}TNIR{dorN4ad`4D77J&RkxFdD66&_$L`zFoig_ye4Y z^T-IQ$o|5Qgx(UL#j~ajOvWrOwGOon`_#hPhdPHofQR5q=M7{hV$(w#m?|*Kkb_Vf zb@S9UPOsLBL#}8grouKlvi~rb4tJe~pguN=C$jttKKRocVXEF^M5!RobmbV?^((b5 zL7aS_&K*M*N9xtB!8!{cAr-NH`(htRWGVC+>J8aNv;mW{Lr@1Skel=_!PAID*q8n1 z6#z&BV)#8RNIm@9gCs`S6tjC>!M3f7%DhR;ZsG%tMo-wi>F1ND3pN(^d9ri&&yUi? zkVJLSgl`|EJap`wuzvggBM@*}9FtPl|4EW7zc|kmb$5^Eh7juNdFV~I4G4C&6(ch! z@<{cPQzf5lM!Vb{Np&ci6_VkQ$J*-OO}IuGkqRbtUxIXv@Xyd89t_So&_)ZaJPf|W zm8QL4ox&gjc~nU6c3nr!GR0xZ(5*vn_UR&Ubj`~0rkAQ|YfpZ|CD3reP=kp#dJfOi}azq>J+)xvo_ zzEOkGb^9q#bP1fP&~!!xRMv(*J)eezcmm_XUeo)Yq_y%oc=o&)g7b?z7#TME<}S^i z3$I6~>ggy%{d!Qt3wJV<^d-i$IrKfHVkva~P#e~YX%-wgEs@8T+OlxHC)E8=U~52Z zpD|{oe*Why2`dE0Fwqfhj65?si=R(64+?U>$BJGcM>~J0S@<;JUR+I=(-AmI$YsGc z^sSVXHT{QtGbm)93Pk>Mt;PCTicG6gymR@?UZ3&$QzW6lVgk%(`1{O{QA7IFGRZA+ zBi@hk83(2I>KQ3gjvGv=s9~sclYW$GT+)pA91SIjQZQbZK2bbckE5Vjt_tsASs7z8 zs_mPm>o7@eyX+3xxu^wk{zWO%8s--5O&94xu_Hum6pV{tG9a zAp!mXmFMa4*n6nrjo3vAl5Gs>jm;;CnzSJ34r%&cfBJookYPOE~s#B;2ky- z$X}-VwTTA@UXiG~I_Ji}hnU9@qhcWkLl()5%L56}7x7&gdp-d!q3#Ao>dE*RBqs{g zETu!-JDfD62^V3{kUJs2voC8xs_Moh2py;#zvIOSsu06^j7Zv0tJJb`>aWQxdR;D* za50_N+lSZ-DlEQ)nI*qLGOwGWHB|{=*Z(1MSw~A-@mbJD_uKP>13E>L;NP$wAwPOJ z`(%G3S4O1~0(lLStjZGjb)i5nb;X9>^^-T7R)Il*>OtX4-ovnLSlC-5RN*x8h=Ql3 zt)Wki(yloGzTUvFZpGkO7GioPaI=;Y>cP-M+!t5OB7zDOx?!B*va01e;(;(w(5}YDMCmO`)`Sr_N?xa%hnDa)vjbN zl_V+RA?a<5_+@_i?5)5&Q<;o%Nx>4_s$|*j*If4*ZP5Dn&+k|AbT+;V|BGiY1yrbW zO4eV$2U>o%%$&7(mkl6zZ~WQp04p_D#x~ib#A;*xcSlG*D*c+-&o>{DP!7@9gbAs( z4SL`GWgV0N@6A@9~mV~libAfI(k9eBA* z%-!sHxc;8MR_J$2<$yDd*k{)QCI<7(QrGA#GoDWT+D)@B=8-M%jF3*Lyt0Ux3~)R_ zk3@?Y?&5dodkE=raUj2lIA&f4^yteI&eJcTxrPmwCVtZ^mDT-H(@!-Q>HcmGW8MKb zZX89xHCJ%?9LUEey!%7?TfG24z#u7ilq~{3`jT=^PqbFOfCmS+4L8Dc?7Q`G^C)_f z(D9FfQ%()o#Zkl5_wSK5>tou`j@$Ks&3+t-@ya5Gn=F1LmSnC*`N*RtYgUO-QWkY} z2dJV(Xjrj9=yStpinRDwO@?f0x@^v{m*n@qls<$Q1Et4)5jSiJv&_j$L(|D&^`C&s zpXYq(EMNAbC6p8D1wHdzmilsI7v5zS-<+6(fQBpAJkn)5)YcJ^+$JyllYi&`IP{}vRdsa2tB>bAZ zrWuTpnjOkaaGexRYu{}B8e(Y}s$gm0uuz|nS?zkpPP7TbKZ zR!hMy@@Yq=Ao~Gs$xHdX&0T5IwQBs;mK@<`aKbQS%c23fedb0H$sg`1=Z}F{hLC`5S&=%k0mH2ND+W- zDvzXCy8{npOs=#3pT{jV%h(Ejh$HV<5Y~6z&j1Z1Q&3QN-3k=YP}`N!Tc|aiY0{-D zBv9{U)>8w)?BuO=Ug$}}V1-MD32 z@gP@v7(~z)enuP~7mW0XYHbiAWMq~jWpilv)(w9+FG5WmX?jujH%;*hymdB=+y3Mi zWr0S8K@RQ}s_k~(BivqVEgteRKIWOjyHqukKxLYT;&H^07_3x?CxR>b@Z+C2ag!<* z0r^Kpzy0Lkzx=RNyETayJA1x|LM3e&p;QTTtR>TcZr}VmZAgBBWm8b{FylJSsNC~dgYwAf&R$G%oIb+IJX9{k*sAfD**s??X2Sr zFv2iGO|4{vS8OM-_h(U=_h(d(p8LxQDtdNiF_lW{(2|(URH@C7^q%MTppC4~1;QOW zGQmJ2W<9uQJv}2(f=1uhtm#=mQP}y8;Y*0cJh`SX^!xjFhY)J)8vf}m`5o&@YCW#V zi5!UM==50)_`O2DHRWRldkaLgy=V*IF}xL474-l?=zqaXecR4G`;pq^ei4-#mGYwT z3Ota-@7zSl!8I~j(|gf%);TZ!^=ka7e(=m4S!RSYZR$k1WK||a<2b_yZH}75v@f1 z2utPhctYKl-!NT6o_rez`BW51UC6Wew$A;tPn>jpfi+Ffn|o=+;2LY~oO9l;Asl5U z5bu|fP37ICm`n=9pJn>DYP_(%?{(3WhH(OGv62RF(tGUKb5@A|?2!&m%A6AX#>aIN z3{X9=68L^mbuGUtQwI9hfxi(}7n}{l5>52tuyIJUqs|PGKWHbfBZ}Y_397=9qKX zeqy9sB{n26!_2(bgDD(_6CjL?xAv&JjG3hVNOX>nZ|v##ln76CeZez)aTy!JVmlMj#rQIU^Y}-F!FAGohJUL;MCIj z>`dmaHdQ|-;G~qOWYdvx+)`E%YgCp7qI~<0 zwkdD5WYHy!6MsL*5}9wub1z*^*wZ!(yw(G;r~RTZfVOxjzjeVG~WT% z|EAdS)OBc8A*>=Db{30wFWoA2Z{2=*^`a8bd&IukyxZJzP$_IwcZjab5o6{5w(Hr< zeyty;e+pwt-04MpV4|`PC3ZfUY?wUORC( zrZ4!@X0N~mH~}M-?q+6K=d&J`mf0)`>|&2UMAJE~R;5>?I*>b=ayybH3B7?EIBz^O zB^Ze#7-FlGE=6mzo0Lzz{tYfKTT(XZ(vRVw1VeHd$B&Cf;`%(azJh`?_v<%lQO17~ zwD%u~Ra>Gp$pXK2xx{vu8p+_A@1+trxZJKt=EAbFabwa#r6%q#}-i8nFsEPIRaFE&ympIr|!?VT~cZ} z4GQ8>oDY{yt(TBT^#Wb7)9e>yfL>`L(Y-u{N-4d z_+#qTDn5e=X0ebCMhwGukjSXd+;Apm{vf-LZczk4yXzmGtVaUeC%k z5QrhfOZ>7cXtp2n`W1PDh8!Zh;s6|@qT@P|hoz1epHwV}nhy<9NGeG}Kk4MtXshEK z-~Anf7@$+JPXPjIx*v{0SO4$}X!8sIlq!)+Kz TRq_jztYPnmd^d#mkC@vHG&KU zD$PQrOVli&D%C|?fodz@krlC&*-BrpW@^c$!@4SS!28GGa2{g{qKGU~(Yh2%T4hVv z2kC1qc1^?;Zv3WV1vsF+xSpU6wxp*b4)|ktn0iUq*yLvy+1-c@p_Q5Ypcu30V3mMU z>8q0IOY+?+g#O|rRsl~$ONhzSl-EV#dn9En08BO*i$!v;4M4dC6SgoabJm;!lp(Ej z@)#{XqYoGX?UbXkZ})w5h&{?O@q&@xv+0Ieq*Ip$FV%CKr@BuM3uDJO2R_+-^tgi< zuQj+t)E-7Von6`EgO7tp& zY#*Ca55k?^7I)pW9~W%DCy2>pWhu0`x9`(86H*;uITYBf*$RL;SGr+Sr3Ym;k&mS5 z=yAS6FZHn)s}USiFF&^9xJfh(1<@dn8?!6Jc&QG#Q&jb6uN1<3UOr7Tpi|^Jw8d@m)mCn+jNAy0hvj*@5NOam17LNzb+x$I#c{NBpB@&%Z@FL ztIF^9$OD5o;_9xgut5aPD}wfNY&?!M!r%G#7h~4<{HjfAHdf`44_{uA$B+zJ5g+8B zC1<>WgqGIiNN-~awHzuqFF>nYFzGz}bTTgwo5RA0iTXL|&r-%&9x5tjRg%XuGo{;G zHOfK6U+k<4ajbuobp7%gMJcO zBDi(vpi0rfVrY~rK0OYI@ob(GxSrMd9gkzC@tr>n2nzplppP7~7?dT6vwXF61GM{3 z(7pWTYoIY9IDuUAt2ASZP4p@Q@tw+)$u!Qq2oT@T`eW4fK-(Hykrh1L(Y@CwV>B{H zW*foWNd^j4Op)fy%B8JLR9cc1(f6;DyTCwC07Q6JnY`4)?Qj{8q*|XZGcuds@#8r# z{gBq^1`I&7tKaKYUpf}0n=JvA>z(zRD~6Bs|8Of@aYPQ(!i0*uwN zWoBEBY%{RPj%DD+WS-uJxHuKc(Fl$idTa=kj-E;NCk*!rIGQx1TkItnOb67ok)pNs zww1_*Cf`r=ZNv^3P}pa@-Em*pffTyQEc*Id=+{zkFGllpMcDeSqh>96d04u6OZnNR z&&Udxu9W;FV^-!$XK1!auOZQ;s3N^+%jtCIG;->;Z#7ZJDtuGlsfo)dPj9)XbJnQov&(H=E34~~i ztS?gJqehcarW2YJ@=KTD*~b@V_;wS?9}j2@HGJ1&8}&$=ujY{{&vsvam=Xki;jh88`z@o0nizR{vRR1WjiY<-qHH>3;dE}@%g zC0(wYI7|b8BP|X#Y}WwvzFSk z0ivhqQ|O?f{fXY6ylIO2%JHRJ2kDJ~!BuFNlr-*Bv|}h~R=s5m^2-ogs1Op0x7v#M z{WvN>0UTQYYm^@3dEO{e(`wBK9AidavWjEie~UB_M8KiR@Eb*eo{(ajOX@Ssd%uW9 zZ2sk$deo=B_oPb@IpHDi^U=ePmoH?I!E*Qr>~+|Nw1*JQAUVR=56cYZn5*iHB|rL0 zJX;MDiL-|YVbaNz7jz&?m6S$8DIdhw>~!0}RXKwzuPEAvN{TUJWzXkRN7x^I{ zlvyDnJ~b?kuNQHymNUd!Ros}V(=uODk|Ih@Dq^u~K9dN*^WUsT-nsRH!bJ-uYOPYd(HztqJ6>(V8_KDp?J6L{b|&F1(&FvXFnr7>~gV@!JU2qRd_G-Z&c3XDU-F zBw&q?b{gjbjs@n*+sVcy7OGW*PtThPH}}%!ohou9-IzSQ%@j2^`y|Soq-A39dU7lD zQv{4ToGB7RytJzbyRg}@z4u)iDs23rbFcX>^3~|cL`mRPnpB`d2|eSd>%Up49o{91 zuJvm#$sJy+S;D0-F(%#}FFFYT2!aJ4${t`I<-&HE3SxWm?me}X7ZAl{p~P3P@75GB zMNy4${d!Q+QZV3N^jm(I0-OYsk&oUGdlX3#RW6OlB3(}Jw~BKd9DbF?&Qhi|hA7GW zx`3Cx{xCF6I;H1nS{IevZ zO{Zx6EXAzNximB={1-d&r;epUOz_ zGyrP_uP^wB8NGI4L1BdI#@bbBc^~wC}B-`^E8dZ2!ezh>>qKn z@gpu*$f#QyiG>(fh4*K+JpCR_@K#A$uzSmNN&S#|?n4g(q(MBgZzosGMnax6=;b3* z(A&EVaH=SLeA#I|DlF&@Xon{4S@0!?QnuRh;TlJ!u-Ua3B-D|$nbhk zUB3jfL?;7B9Bbx&a9$>yq;}!WA%NSgnNu=iFgvyaK=W+fsI4xLky!YRCp7{qFrv3B zl8w#Q(kzx_bRvOCJAV}c69>tKY2*Xrj&{{mKc{=?8}Yqsn3v4KXjq=!*v6_NHE=pE zu6S=GwtBP+kkH@41yr=9Zl#rfw5wrbzxt4%>jFPaK%(Qi;{hf6FocMlp>#aVXM^YH zm_q$kJ>0AqB?i?_euI?6+b6}gw=98QX+4M1B+zPVFOiDxf+o$D$6Vn#bX8i&n+bLS z*Id_r2$QYMx!6TMJnB`Fm zfbOP3!L|6VyMc0<0z#oj+Oc>M8VAfv?#pZ>+vP>r>U;s9 zTa#$AxRW27PB-E8yVS+&-}|~8Qcj$?9pb}&bb9*!Z9dg(-*H--X@=$2M{s zUI=s}4iajzd8XitS}gM|rb#yFsUa97b)O@~A1w|HQ2Hr@19pm-MXW?HQ5@hvCZA zs1csA$~PPA99^+~vW4cySq>qx{Yz><9V%c;YUxhYU;!J9etph+0XO@|X-VMLmsv4ak-^N}e2oR#Fl53Gpx97I{~myjR47aqg@TPHBWuPx*n&Z? z+$d-2Dm`O85?n8~-hYmoN>!SfTt>4^3O?TryoT*b-%+m|0Yt42#i%ikXMOHEt z0IXBj-`&P;drOTWl%Fp)OnkI+{yiHw`L3{JFd_}1SE$0OoR|$-y)MsI$NP~)h<;_S z1sA4{`9e4TxwSJ%c0o3W5wXX1+bS?-2sS%OJbWD(R%JriH2_49Q$e9c-*k8S#Ot^r zl>P%fu)l3?{2L;o_jme6v-8VPcSLoM~+0| z5M#BeX$D!!sC2duw(;JyN?<0O*ws{>aOOMFH2uxpF*O<=P_J@^wMiCURgqO-bwEaq zjRi(F7rU4r^Ay4tc!j{#6rJ$QLh~@Zd_!m`EA!#?Oo=G1k_9P>aafBic?}ucLiJdx z+X6Wt8nr=$fUPsd3U@!f0gE)J7K=w#adaxFEZ-KVhuI{T2^hp|KPvn&iuGP!=zD(z z)hv)P_3P7Yw+w%&^i;p))h~Y}cHQ#i@I<5@Xp4C3P%FZI(6D*N0U4K55U6*J>ySOl zN@&#qfPO&Y3>FoF7=!dX)=yVUCFf~!P2n{<*7FbP4HdIHZ=aJdZEZq~{5W7w^$Dh{ z0`*b>#c`*c56;Uc8PGYBU?F!y!%BaTW#WV~6I0#7!&Yk*_;Wzp}9|1wXo_D$1alpXPS z=&Z|Ti`h;4>uKv2K0Mv;-IzHd0gQB|6!ulJw~uNcOrvsLGND1Sr8z34Acl0eT?r^Q z@dI}Ai|?m&9A^Fj=jF1%R8Ui{H*Zh}p9A~-9s_6-;HJ1gcvP6shV{TmNIvvtU^Py( z1cHuI=GCu;vT(}q1U{MV5<`$k&SJbUl?^+~?La)hPu}^X1-Z%`+EI9S5X>F>1b3+a z!wgeoYFa?@MiG2?CkGhmS4tZ|N)6N}0_QRteT*X{U`rrNHSN#n!-p8x*OA8TahgYI z66&UYT_a#q-_j8T&KDKO8F^_Q0D|&G%QN{`<&$@OAA0Gr?qM+I|N)4msBVHT?aex9WRX$oWhaM z=y#h80$Mkhoo{Xnn%j*9tI&fBhD3aum;2l}MNqG6^_VdoQ2rWACm@x?C6biyle72sFlxjz#lfLMY0FgUJmVFE<#$u+SqZj7~M}GUf#Xu$14r& z_^!^+91C0$U;Q_{<3d>u3_7s;j=nksnr-;ttcTOA37s6m)DRsvC02odO_zo|fA<3# z=P4g^UGPd?hj!K+{%H#f$>g%|*8V}poCz_u(GUVsDMg(}r*v-DpI)~$_GTEH_MA%1 zZP4Nhv8KwvoC;)YnNftaK;dN>C3B_Q)t>cGSXuFYY)S{~Ow90-k0NcmiJZ3;Ee)&?Vd24_Hg`vNt4FQTEfn=+4P04r3RvWjrh4d7= z6P;x6JFwJU!??}h*)AcNYhpcuK5z}5Q2Rtvk_^uXraumWIR@1;J|<&M!7PFk;)<2N zMn~Q{Z{7eM7b_J2VV}A(yiMW=&%ckRq~c0l3o~n=(imFVb4i=l*_JMuI}N%WXek_I zo=ry^-=j}No2gOhx#j-~Frrrbda>rqO#d-3=oWtrXuFo>0L*wDe>uee~p z`whWl!61Z0d4U@(^Dy0nlE9g6%Dz>HEe)iB#hAvbxgbSu1>y($0qdz5+-a8)` zuC4@XH-V*{9N$7oi!oee;fr7elB3cCY2MOIO|xf#ri^1wdwQ&wrc30!C4one=Zd{= z`T8++K0Cj|?w4O>itC@7;Ek@BdV6uDm@aEeu4SmDL-FQagEo&$N>lrKB;iJXMl^3H z-%M$ECP$~$gtjZ+(TX_>69nGNl&S0+D1Z$L^fwzvY&OtVbh1llnOujN;Baw$TRZOM zLZ-Py!c4tzC1y77lKQh-t~d{8cj9XqH`|{XfW+AF#yjK~)#18KSEDKP(Pk$Qnor|W zbq69HGIvI<1_k!svjX}6jJsp$^=5_Q%1ZW_&sYVKErhBIaj`7d<;%r>%WQ=aex(y{ zmZ3v;;an1^Q_eao5t*^EUd)LDL zn}Gj+lfbDk61eb7yhPi(+d?Wu47AhUm>!-JT%+IZqhxG2bVzu$6}=1bM=7KCcEQh) zfDu%k(MwtIDI(Xa&w}jNjF62932gYC-pI7P@YM;isCp=Dr0iYEffF3t+zA0Au8I(4 zztjpE#A}#*H3Z-^$XMP2seM*pA?%+u6#^1i{m=~XYni28@G+i|<;?tb;$u$UpwpnC zZIO8&xY-PGYNLS3x|cOcwvL(9IH7|K%*e%rR1vXk_)><>>Y6H=0iV#`5RlQ?4!i|@ znPtYrtE3xdpN?S>Kb z9ZZe&B+A)AIr6;XM|UN)Ge9O289-#vCKI3PuY|^8jX?esE`kH&k7O#8dp>UZ70hB5 z9hXVCvN$YTRYIq|z;bH-@Ex(%-Z%6CrIb_s@2(lei^%>6ic@DGHkjHWW6_pW8)X~i zl|@&Elhepn+sbTAvjicI@JFDa4J?V4<=ur2-<2F)-dT7g6`7yH9Az9xh35C@9Kz*_}b* zm^SfI_~OZ4O)#h2(g)|vkC+KwAm~PYJ$(Hvmd>KopB%!oSGh)62J;wcLbJiC%9WI$ z3kD8g85e`QHE3?6R5^NYnOb%ZgZL9l ztM8!Ib2Q{0%<&@=6H`f-bky{FD87^ipGk9{Rvmg>D{T6m?o~|}?&jt#<~(E$cwvSe zN-y)n?8d3BYf6#$S@9?Xg!yE^*4X@Ar|RxT-8U4pbsbe)y(UNM$A|S7tUPF&+?TH1 zqoRWw4ymDWnehn`P_F@&+o#PFRj?$OAhImLRzW}2ZUu*Cmdj)?f|Do9N+@Z;r$VTj z!S%{@(PMPC#MA4`j-v7X-oA0+V4B_gMiS%m;1mv#EO%>idAHI)x0eM|Iz?FiL53EO zSR2;^=9~mi3Q>CK_X{QCx!h024`)X-mS&~%PGk4ZC{u;m9O2_+XfVZw;MWc|YT)^W zIr(>|mlPUjJPe`u%cyvCF&#?5PW4@_T9ub4(s@cV3Ny5*^TpaiJ*rRiHjb0|QV`Fq zaiZCG>%KSox9aoy@)8ZQJrkAlUw6Em{s|)GQ{s5CM5f&aQ*wvE!ND7S?Y}ctzzJ;K zhGDtTCJu@7+k*a%E%R<}jFBLDm;pHs4pCx{N}nmQI4Hly5>%E2mn zvpkwI>XH^BWhfXHo^BWhCnz#?KEnBRvPcI8`d-EY>44S!G!crbzOapmQlzwBYBXDE ziUv%2=f+G6sZ1888xk;?XAF|-Nzdn+M{khwPWhxcnEU0&qMOvml##-Ofalk%+vw$z zYi|%<#L)I$%33R4_a|o|l%RNNM!EU=$PFdm_`k>$S}fLm_O;@cE~p0ok`H~2z7y_6 z0}P9=;<2el9~=9EJX&)k@^Q{tLx|%;O4rwfN;f9A16b7K@KuZElSr-k&+`_#G?fJ2 zGM!fkKoHiV;GuWv866-Rp$&EwmM)71H8ijc)oO|?Hd>e%4GvX;DkMLqz)>_~;cT`X zWs`+c>q~=mOtK$tlf;bYEMA5TlkrVx39(=h+b`LI2%>9pmKfDoNlm?E6XaC1kR*2v zlWSuO5`INGo!;|peu^b~+{{TD8hGx&VrW^U>a8PAh)yN}37JT-cgTj10wM78L>`YU z25hwy_%c<<37vaWV6ufuA3|ySBOD*?p7PDO5hi^1V-OUUYXG0c1zP%HJrMbuiTixx zHjdo*1sghwg*6@3fPz_7`&5n9DhnQT-ZOmRM}QtQ8-RLw#b?!-f(2i-eRDs;T9z;* ziTlaHAKx z_Ip{_sJY=6f4_9SS^>0TzJt1lLeMNveX%(Qu2!WJvu;< z8W26p5ONMXcv1no*kH$E5-}kl%nwY1Ckq%Ua5JuOF4y}UM^xEYXKfkkwYmH1ydnWi zF*h{j@?NQ+_HJ;;apU05v0gt4fXgoPAzYrGhi9}hhn!FeNFawzZfAG&Wes0EZF|9Q zF;>6Ho!ZrP6;+JJ;%)yp8plT;>d`z-n0hq}U)fNcJT6bdzIfEbr!y5O7ey99bG0aL zXp2zVycJTT4#{=&vm;VRkR*}E&*wtI6{{nZzzySl9{va(Ul*VDa*7W$+!he|AD+cL!AszH<%v6q0x%^A%hM^t zVf@N3l!v~S3iy}ZKpZ#;Mj!=htzjDvlrNVDWJ!_DDEa_Yu6%>E_4?3*nGtd5PpLR# zm9N-*)JLklc+5E|W=wfAzq?rye&0j4CL1S{6fRfI%VF zbNHlJNuiP++zfnlAKwjD?&k3_ysHa~unM3G+$zuEQ#3t#T_$|(`_leyP@D&WR$~h9 z6NW;FuE!R%CCVD959iKg&{X=|``t2?AXbH;!w=-aEGXLa)aQt-^MX`43>Y{#-@n-A z;4esv{L0Ir%SSRdDH#1CsL-VDeb&`5t8v00egrmWV6|X_X=_Af`$iDnl;NEI3+Tzj zHy3A?lKO%4a6x#fiWJ`MdySfl+i?G548o6Oz`Fv99z5HHg}1a`K$SeK2ZDrkj#PYz z(UviZxN+N+%Ry4#@EDKcFX?8~AEX+U1A(` zzGYMcS)C(cXDvl5CZ#j^xzxJGl0~aK?>)^Wij-z2vkGsoog16Ys}=&`rlqsXzJC;T zyG=;+#qlKiOodCN>Rq zx-LaQ!$<`rk3r|UK)U7Y4j7`1hT;gm5cO&Z+}@*fL{WrIB8e=R^imk^GsuQ8`?dl_h~ZL_xw5-0-Q8T9e3kOErjd~ekKn}VWNTD@i`dXQK( zKP;Jm04rD6=$EpZfg15Au3^yN&Jefl@ZeBEaW$QuZ7Tsk#TxN*^G4prZL{gdj!KLB zkG1h!<&mlL9x##Y)+mqjj>&SL&u`PpPN?P z7Y_%Q?7<`()9ZPgvy_p99Yyivtj_9NJap(2Uj=p_kk)f*s)DU_nw`ysKdbI|IP~xQ z8_fAE`MaDEsG0=$^jc!;7GACzPCeL28e{Slkt2zotT&#&p@w5iesrS8>5TZmgpwkG zCsCDpaKZ;n6g)9Iz`sOp683$O6>^icQ$7uSB-z zOterK@y*d94^}&w5_nQNn4yQDdz|8Lw-mQ7*APnMN>!(CqKX#%XevIkjN=WnA!1*0 z4{FioblQCxO*d@Frta;J;W9x$2!y*3V%GF7rufZneXg#_5-_p)Y_6`F(1@hS0?UXo zr}CjPP<6>arqU6NiiDX|ndbl%xNjC*o|vMuS!v#kM`kov1~{chWY-QCSS0?+dDix? zXHAHs30a|MQd527$|Ctay5x!+5AMy!3Xn0Clb|6pQmGo9CQB9aFySm5(groR>atNY z;8eH@?vL>6X?&{pYSZO@`R-iKDp&Znj9emSxWjpmLDKsEt%#Ac)G5S>#K9uN*eRUB zCC|hx*@sHl7uzOR)hYhMu_OT_Vv? z=j1izvx%6IyCu00zG$0gFg@hlIdrmt1~T0io+%+=ZZEf%g3F%2-VJEKPqc#XU94(< z`xdbFR}dN&PijdAn88}?SLlW_c7mD6-H91DEq#qrk?-DTC7aoV=McLE37L`d+dqnN zN=&2xa5RkI@1_gfdRiLl+0^6e(f0ZMjw_c+IIlCA`-qVAB6p@{ zb$d__sYa`fWw!f7XYziH{~SMTqXG@ug^ZC$agSutNTaK#I899i{lOc%0tX~TlJU;Q zdwzQ?|mJ^HKfs#dR^2`Jr>B_lD zEc+1^K9E*6_xAPmP;tlSQ!(=;@Rxlmfk1n-b=|_k?JEDLpNrvV?aY3=`zOz0hxc}E z-x|Sg+G?42bDeLscn2YBX7XM`38(3N{;=)-GmLi_p%Fxr^@AOJV_-P&YZ3JOyn3(3)4W5#r)zPw~er#GV%-5>?8Y0~G)xjS(6tWynKhA=glUv=DJ-zz5HAY9@ z3M|W$q2|t105#GlirKoU|-~^ z_HqZYR>u106zj>KD}d$@H?=9e&N?0}3WNVl5&17pl09ssT<=2U!lDn|-wMziaWQ^} zlSz?KWJ(O)CKCDYosB*qkx@ySoPR#v+9E!P7&!k%=?Ru%CnVtIvhwW2k0M8rpfH8& zMF@R*EV}l4xTZTESm8wBg8TIt;1rIK6z@{7e#wXDdzspW;M*wq$Fz+>(2}Nn@P1oL zb$tH3;th%%Qj!0xC{grBCJN2UY|IJP`GIIc&Vl2oo8*9cRDEOl-)WNf%|-8z*!?ug zE#Jrs@C9D?DZh3`exRMj)D}y{BC+*2#%DTC@eW&IBp~2KrXP+kynemL&)T&Lwo$QK zsa*`}TNxVlu@ItHher<632%SWDw|&ITj14EPLtO0FUOG&OX(KIzzbTT#KCGw_=bQ? zkNP9s*9Jlt=n@-Rqox`zS^VyPlXCV(7P1}!l1Dv&N3-XVqs^aXjDvMOmMCTm^<`@3 zhKff~JrRF3EsOL+y#j1=bpec2XJeKFG-j_?igW5Tw!HGRuGN;o3k3l@*Cs=>>xc91 z{zZC>t?s&Tx%a4(!OKd!Z zSj%pV{0lN5sp|mNG~?cZ`5#pmqd+W^Ff!o;B5k{;vimX-mD?GWgU!bv!JZ{_ngX3W zNBcb&v}NucJaoZ`Oqdo)(M;Cv1b{xTc$N6;FwrVdle4g-fu9;^X4-JIDIP&6{&_B) zqTA3?pcP`osv2yW9QD#G z%#F{*T#NcQ*66m}k*J&-YINpAyF_O4o8Q^eBrL0(g$c^U2Qvz25UFo-pPn4dyBP|% zLn6`;w;FM$YV}_q@G-)~Wz42-#|bX|5rkr-Q#5#SD>7SDe;#81=+H>7_+4((d=5~r z!fL0jBB%t2ii)nJXJkc{sNPEG2rxoRnX;SS6VJ zeaUID``CTtsUJI<}3GrXF-t|H==ePb#RL6fvJ-gVDa)E`&sHi}ETS^-LuLO*l zbf^*sFn!qkqjntj@ekUvfKYfXFd@MJ(z_hF7TO*d4J6vj^|5D=?N z0?ui6uFD9gxScj0IbU)bmEZ#qUX@^aSy)k;(5ond zwX5#Dk_|K`?HoioPaXAzVwS(z0g`V;Q~&;01FeN z%sQ+;B;K&_`aJvo+*ETuR^ZbJXFJ0J7QbF=^1W=pS>C==L=?IZiE24_YqaqhLrvBI zY0kW6ze&)!Xx@rTF?RQ74e@YG&T* z8amubg+XGtdr&_OyVI9-F09z1F{85GE9Kyjm4UuT1W>J`K_gpK%46Gyje~-Mot$IF z!uwI;8mpN(-Eg2N+t|Zd)HH;u)HSMIi<}{slcwni-+eSWqKFP5dnjn6s02Hx2XnDEPDoB=qn?5j;Z6}8{#E`)_!V9W2gPQ^OZq1}XKjKhmD47eo z@vnu&aUrssf>o`>2(Hp>=w-B?)qxCSsL1yqZv7{idY9iYX6*{4yDES zSkvc7cwB)%;L0$BL^{O!YwiIX8^bG2%pYjU<|lj(J#E!y2N&PP7k#ADd{IpGH|%?Fo#d>mt2*vks$Vrj z%+p4_hE|TN{>}eMj4GN6K<_#;ji$6ig1ZguM=V692{}3w^IJJH^7pImL-~bfopu`` zAqZdBH_ja?YY5pI_SS_e3hTYpwm7VtX=%=Jf+yoBAhY7bq?Pm{Xr`*}%3;32gons1ZS)weNBH~iUw zsGoAfWp8z=q~tKBxhiU#f>Hv(0iwZFM&s{nM?{O!ns{=}$WnpPrT}nEJ&Gp=LATsS zLfL2m`agpy-1iO!z@4gxmP`}b|b)AaHB}{y`kd;ZEd1`7`?Vs!TixB&x zs!6Jr5xmDVsx0WanmbsMyO91*f$0oX^<+u~g6<;RrgfL{>dO=JI(DcZ*Jm8xh4IYo z%t7kIn=ZUAX9e&nxmgIrC?{xjQ`TN-a_Rf2KK)X*JppdBezU_L4~m@w1y+8E7Kl*= zYM^xCL#C+(u@3wfX>o${pNE1Sz2|j$CoL4`udIA6@Yz(U-@X_z8_8=)cx@5X;k2%K ziIHhwrS|j0(#T_#y5v-IU`Mz{#Aeo#W@+)PI=5kv@{Y<=+Vy4+o~{1KL#nj>R`;AORPMZJo(a};mpUGhj(9ybgK~g zI-u%nWMV!aambY>)%59*$l|u5{>f4oX zIsMw$6EAbHiR=Gv&Fx)6yRhksrqhKXhw1SQ0|obO?0B%+s=h^X>O1(U7B(p?CnKW?j@l z(^!B$`Ib^anK=5^k6hRe$l0gup@ANo(E(`-0}XGM7HN4%An+3byvc#h=Sgud!u^75 zj?0`5^PBXwqIXQmVrhv|G;~a_*~>pKa@7f191X8I8QQ26Bhn#CeX)u5onyCS40OK# zhNE)V7YeZ`Cy3Z8oI(jkhES6Y(#!aY24lKOFBB{Oz>bN&?+$R{e6_>bSv`nwVBV9{ zS<`f|fBfNmU6rvZ8lw=CQ*oaB;iq@`#E!ZdPMXHnEu3Wp8Vp{Q6$m+sZCEhe?o?}^ zm96G7v1@`=#d0oE1z;9aSmW-H-~~N0Ty`Vey`{0VPIhvNsoHvWZ}G;jz3v$a#Yls!Az`Z)#J zB_^4Wu8Ed%7E;*$PIB`dJ705ki*TI+qL6}CT13pr%bx?WFJ5h!+2n&q4}Vj&XLvv) z1o!faDS;o5etw}0s+8Osl#InxM{y%Y;{>g0WAFRDSLHtCl$PEVHY00xsZQbf7!dp7 ziY58MLhJ)DHb*T%-?NQ7KKQtG%u zzm@-RwN>CS`9thQ)RRzJIMF{scO8-L@dHb9t>9N~5o0X<0KhI8iE-5NQzKmqxIoHk zC&RT(rz#WpA9;5&91zL}=cmesvt7QqA#NuD`m( zAQwL3(;{{4T4l@PgQ`Pr;wKuM5cuu`S9rmwQ!rG-OBppK3Ww1u?$3ZGBU<(~Q^xBl zgiPc+NwMFi+5_H6OVcmY6c?6mrScTjVX(+SoN`**S?1WP<%l;L+46TX1@@z^aAFGmpSSTxwM+ZO;|EHXL%Y_v8t5 z(j!SD$ws*-{x@1wP78UyR_L`TJz?I$F#c=>Cv$1$%ZIlbvsPnQ5Tc#1ukRrKq`hoO zg7wqhHqVfTS>X_*zS@L_b9KAbrTvZ0!u7Cr2c{8O&5J3e!gIuo*w>vYOkut{iVQ!qo>=L{XI@P9j~D zKN$FaG8+vt_5P5089lr1tB+rIt)a^Kg#$7(QdeJMe;~0e@CtJEnEWl!^8;#`?lsTN zp}!o@@9|*CABnt3hoKUoM{;Y?o}7xj{L_}XC+a3Sa&W(lIvnf$SoToIAv?ZCoX+@bJNoaO8ni=tYwtxC=cd@Bmk&0To|WT87l+B88m?=xkGD^X#peo4V-hAG z3lQp9_NzD}FHjSWiEri{u2sa!PAVUO^Ca@tbmhjA9&x@2N%qOXWsLes61J=6myA!> zw4S&R)=i~L@w=fp@3951-*bkbSWjD|e^Dl^Ab-f;uI zfQGc!8}#QHQg{;tGZ%EXeIJW{W>!=PIyAD?iN_gVR8Z7bCW{a7pExgYc(eW)xx(l} zBWx>5C9)0__c8D}s#m3b=rKtgO)u#ebvw|!J#Ibe8BS$o^Pr(K^1nGD!W_qT|DUvd zpdmSeGT7=voqt4$|JUz+r~^9j4}H5$V#ky($55e%CYVok`TD^QYVRg`bP`wqQ?%*p%2Pd@#5zF9i4 zKPerULSaACCZ(h*bp)uH4!=5*eDTWT+RM8XcRy@DC{d1^r@h{eYl&r2xGqqR$PtGs z&-FcMFr7IACW-AjCi0emfaY$uBiX_-Z{pnl-}Ob-o0& zq@g^mjvouZC1uZR0EF$dA{xG$a@zed)MmVKk^#sMcUM7#;UiVY~f5A&V zs&IV+o>a%7k)R|2Xwe|*2Q;=8guIIT`m1Xl+k!|k9BUupd$x>{^uKz2z2r? zkHoFG(Y*b3(am!1Lh)r(&$4~ajq$nt=xWPugyJSRZl-dl+|8UCZOUou~l*_>GtOolfnqh_nyeXVUw%Zcs?I z_1q7~y~%>f*;%tZ!?GtPY+;FZ*Iogkf0~jV4Po!tnJA;yKjuEZzBS+qsl8A1FR}D` zQ_s1OzA6&wc|#QY9-l695xY;V_y1^yNlg1ehkmT zcHk?`=jnflvdIGdWV<5M{6gnh{mEd91N}{%^yExr+RCP%A)5tK!AXPvH!JGQ_VY>% zQUUhaacfk}DiCUL_~&re-SLp``mMcdCmq;CS-J=zF}Q4KNjhG5!OJ(1W2=tPi}vTT zEa5yr_5;#Bb^8>hwnTBI7auI{#nA)l_POhZAGFt`@V&VB*SlY;b*BP!A7MK%L676? zM6So9f8zw-2g0m5@d$1X7m0sO5XT!Huy30oaY$GsonP82f}!d7^Xo#jDxbYAaj8hO z;#StWUkP8fgcA+;aGA1Fzl{JE9IG=bL2*<4uzB?#?JAmsdfIFvRLA?ZV@8(d74IX@ z?!F$rxb^>7Ie$Az!d}vVF<~3F(BIdU-D4Mvd+R8xcluimzDxbI{x>mP99_8&f2;+iDPVElKB5x49Ah`mX$fysb}`2MV|(Y}Y> zVnrnoHUkf|b?R=3-YaOaOALPxwLCX+3_pXTr{d45c|H!9KlwjE>ggv z>Y3d?FRb|Y!t_gP5YxYJusQD)EaN!7xpD8GFzo+hy#Je~+${C?Yy3W}5S4><_@sxS zzr7=4w^+7Zd)foH$gyBb{Fr}S!M_dnf4r^#Qt;-l;}93-;UA%eTh00k{(Bvt#~iC6 zv7NQTe<(-(`$=@`6)gM1Ik>LzoS6D=b2Gw}oGYfsZ2w{@{;%23lJ+orF};AG4{JOw zY`xp^)C??74;;-^X<05>GAfUA$Y$$ zyYu0fFez!k_0>~!S{CKu|E+&^BEUNI-J6ev*c;uj#@_Be^(OYg&t_v`LA$%V+t?Y) zy52iDsNG#dXv8iWdoH70{fwHf*(D(|n7zjM6LImuy#2`kIt;Gczxq4siCzNR{&kYp z%**qSUF>XqIGM)eI0a*?0BCgvhdeH|5*_^YP@ zO8>Su34UK&vX5WeyeKdN53}dVw5ts#+{G=C%j~)ilf40O3j)WKjdvIAxJLRYVqGho zWSsdg^Inigu{SP61EVQ>^^(e`5El!DJj5XdN1Hs*6L-9qy5Vx$jF(hGVZpRa&Ch`D zl;5U~R;dU%#wlwvhW~xI{lC9;%>?~TPv^1go&?#oKi#<`#?(_TJhnl z-3qsr_3mBJZC>Uj!3$mrs=ybqi;sAhjjA?s`9kg3YV(>k`D_nnheYcayW%)FemHFV zHU+;)6MEVtoMJ94>Ec_iKSr&1B|tqWX}v`SN+Haf%BJK4O7T1p(9s1g^czpWT|kh0 zg^dWV;O}|0!)$XVjEOK;n*iL>CWb$m6paJBhA4RAX#cl~r-|~frJ{fUk@oD);CBgP z>CIA2`u!K$S61F4k5bU%3aoXA0K)ITcFng^9ccfV1#o*gAOhNFOtPXL{740i4@5?$ zR)JU|76-khGM^gDJ$9^XeLo?#@m|biXvPn*e!Z(L&^Z3$(e)4=R@F^+Te% zS=d@frkw)23^u}5k~Z!}2A&qb;dJLgpCNorjC#X7LoGXGlPegLeq#FBh=(3LnaSaT zl<{~?TPZKhD+E&>8k>zL$koii@UVHfO7!lv#<6>seAH&qoD0NDgTjEoChwG@(}sTm z>HoI}mMZ=4%pBN&tik=v0LxaHIK$cVL)@MK_-t=|hJH&GxXQ~iF){NasN&HpW)BCjRRygz53RD0*n&@oI2Cnh))>UKg zog9%>-`%ID=g&7TCo7rznd7U+_CDEm7_XAKgd#EUr1HsdR(4y*nVoR(P-&5ZK3zk^ zEl-v{)#v@F!_K=BvtgU}(p%G>F9ASX^W`_tFVI1V`rtF2z2m(>5LC+2%{AxurWaJ} zRc&!R3lu&fiE$9H56?N&8-6v+V{dz1C)WSuoBM%%_lDeGI#89002TQkAHVI9b9E+z z)aU}SV`zyNp!g!waEg29VnF0tS<&h)Q^Y5}mebSV21SYHr zIkY;x+|Ty7PU;Fe#1UtS4X@vJI)a~!52r~~`<~vU$2MB8UVz?=*nRz!idgymg6Qxu zvB>@I@=L4pb+%6{L2BB9>sVimY&58;Z3Rv7gWEA^-dGt{kK?2(aAL%YEpyINi%CCf zYeR#tV3uVBS?fw%Wx45{*$}>K;pOF}1wVYc4I8<8!C6avgNJZ{RH%~Rpd}udj+W$=M*~9WcVkv&W+zie|H-5S#z6!OYA^3BzpJbO)T2udGQ=O% zB@KaW#CKnK>POsWmEgjQ50CEOaM`)o<~Katyg3Q4g!6u7Mbcu>tCn0Z{#FXzCUh4* zm&I+Os!q1bw)!Bm4)mj;pFL05lz9Ed|2dTW$vx}p-~y-(v>sXWTqT6L4wdk2yG8z- zXYy!u+B6=)N=*oSKr|)8Dzn2_Q?qAH$aNFtIkWIO%X=Oo1o=<(dcrqc``4=hkMS0S zj}3_KLPN1D=xvg9jGm)-{p++UtGy;FYC@s&Ki8#CPC}>_BVl8Mm3d72_~PN3Ty&$V zc3nhLT)KKA6G1Pt-MNI#AF7lkSR^+Pbrke_`s`@==+fiiXfzNph^0@9zhIiM1H49P z=c#}L8jb=taF7X^>|&8Om;zrcu_;EsmX^m)0lz+dc}(3@w4Dy&))n-K8p~BP?~_H= z(k?r(55)id+UK}-w_gc*&|#dTFXrB{@r_*AN$r85)wx1RBBN`Qg%t3P?b{T~#~6`+ z8#aHN?D%IF4<-l5#?JAV^^>LxH4fVT1&taRVrM~T-EqHB4_l%v0+y(_v&@Gkh|rFPPE!D z8JEND_h3Tp?noVOt6mt#KyP@3Y+@1ETA!Z_$Zro=C%O@|b*_Rv?9Ny@mNC2l1gY_j zilxfSHFnFk%X!a-xojbpZ_%uVoRbxKrNv^z;w#{@w;5(!tYHRC;sc$mMvFy5c3)-T5YWG~4n-0@F1ANpe&9s-uc< z|2}-thnO?ge+^`BzOFx@VO|Uuoy}1pWZr-{|xt#-3sd{pFEpfx(H*s-1Dlc9Zr? zWqJH@CDmqdWJ|A_F8AgKvPbN)u8TyQ5cY`kx4YeZGFY`6i}1i!6~$Y7(T2W@S{+Ym zO1V4(WZ-Oga5mV^|2g^yRw_n(@_Z$J-6i-krbKaTO5HxTRCx&&)oC|!;^%;gR0oviAc6jAqG_0W7ByyAOO8CmHuNmKN~+Ib(EKgUWje(*M5R+#bR_4yjz zWKa85{dK_2(9#xA6bNVTlBa8TvE2?<%_Rk6OA2<_wUR$u{OrwTd=@%9$KqaJJbBv_ z=2gb)^+QuR=YT>R>UXVy!nS2A84z^9jfJ-0a3%=j?h}OTC;7a33fg zEDc*LJqJ4;E_p;cOf7dFBJ24-C?c@^Oq*T$X3ldoMntseGGYl`M>>b=K`*;W>wJ}Y zs&Al51=rh&4$u=8Zao6Jv>pk0_KUC6+`2vnMRWrN42l1HGC#6^3RVaA!wLJKh?DQgPj zw@SuzOj`CsGn8w06*~>%*ELi9<-*|g8$&$D$$|-uQ(!NcqSGEWCL zepZTxK4W6Q(*!F~x8c8mQM&#wC|+O;c(7G&rcF$(>0|u0_jfufIL$6q@Qe zvfjF)iC$W^Fp$$~)#lpG}2Z3Z})m z4ZMC1lK$$3Acr39_D4m@kr9$f8i%B(Jl~BqDa&oyL`ihx)wU{EmBj}|I|jSy2a)SV zpG|M+FK-U+Buvj}vYJgY14iC%Z!@Xhf_!tscR|utiHvdrUhnar){7t9?dmReOjTk2 z9+>5fpLam_5C5Qw@TJ4@dQpzSq_VwFW&8LlpMT|hi*NhE;?3hy=oD-krP{0L6U$3^ zk95$fBxE)$9ruMua*>q(Mb}@(HTnMU1MrB^r6|oP0R=`%k5)nu1f)hwOuDCv~$93A)k-S_?I|NT7O175tYojb4hQSYPsWp;l3Oj6TjSVvNP->Xmq z?js6EICAOz3%Kx%HN!dA)`C%Cdes`#vJ~yH|EGsdA{h&8{VRFs5oW0bl1H z8#BGq8gb?Kc6!PvLs4M#iNhO%vx?o)r2kT3GrVGY-474ELdq=Y&Awpt8(*=t$y}|L`#Gz>QVje>8Ocww6 z(OZfR`-@U|)(XLec|9d$x_Bo|nLCjS9R4neI?vV#831%Ue1C*>a88VyE>4b7yp_$OYA_u&r05S5n0m4% zo%KCo%y#`DLYm>>_xl{GpVn6go}ID|r|whd-ws%%(-JW^O#}(7SDQC(2OA8MPy;OS z4r8k=weO;94j#ZdKeeeAlt3OQYU5<<3g>ol3}Q-uH&7>Y5Yp})shS~7p>DAu>gqZ$ zq|K)kCw5(bybvb$b|)fHOU(3W4osOa7G>rud@^k;}ay{{d&&$5?*-QS)&cq8UxSXAp5l?2XULjN@ofJ(7ZJTSR4%enTl#A<1_M=dlc zfx9O-gk66V`+;H#j7uumV8YD2_noR?SA<1?5AZ?vJG)hTrc-h=6$c#G&4lVB+xKs$ z7)k~iwYS;M$O`~;-#p<+NQxu5c3RcVctDc4>lFk_r!hWzc8&!$Jq;x@yv=2@mhF%4 z1WD{Z1c4X*xMPxJ7G`r_?u=#$VVABTrMr7KOUWCZq=G8p1GOKFhpAD-$99oA3j`!v zO`@wTg(ve4d;4p+jrnHKmVBG(*K)Ws1Jq^Afe@7ip-l2)t|MLmsnoGcY>lx^xVxvv zr9O?uagei+)R)RoS9k(|v&Z;@_VSm*c(WoS1(`a8zy(OEqkcxRG0ViuB8uGr?osqS zEz=2+4U9=j76O4M>FFOyki;a-X6L`5gJth8CvSeA3ppGQbdq;QLh5pj*F$10Mcqr0o#_Jj z3x(VB@#a}4jH{QUnMckteta{WQd*o0=Vhk0G1dC`NMUvH7jqpiU5Dp$-XwgLDfsMV z?l(-?69>*7qdrx*;bcnaAe~jd?+sg z67%6SK8*I867dc%Vo{Kn=)5pg;f7#3V$KvGETk_QJ+z0VDgpzk}HYlWK>G^(%s;;1uc0F2JT zd!ehsdPdN7!rA^rjh}B(OyLYGUG?To!M>WLq!S)G3r^T{G!r9;-P%YcGV_5xT)%1R zUIi?^jB>|p6lAs@_>h0JM%ehISapG(VaF7XcQn`EjZ^w_XuhIwENtZ-0KnjApIb47oL}) zKrydTc(o1Fxloj9X@VA1N#6h5V-1a7@$2VMAHGp5OX5yONP7dtQX*WuLO zYqzu-CV`_P`3kZdiW*e4bJO1Ij40kfN?BSE$6CU#>uh=l^!7{=w;tk7@I+?f6{oJm zp)GZp(!Iw#`AMJG1T?gfS16}rb>v&4Sme&l1B~`9qbf68u^T074+McnN%jt5UKGRa z_0PuUVrwFOVK>Uwy9uq;6-2#0gV>B1t*?dv;{-G%i8CT*u)(ioQIQ@>=qBM zPbF&XW?C;aX8(=3jZcPMmfJ%C;JP5G-*%_u!^afZele<-B$3|3^3h2(3uW8Wsl(=a zxKKJQuxBV)f>QslQt042FC>9202+WpJM3b!$?spUoVIEXko}P^{l}UIYA=+``}k=q zPTrvU!K@Q?45+f#b^WQ^G((!c(0oDyb>B(Q>RQIJ<@@IG@BHfFyFuH=5?-TCPXe9= zgnSw+u#ta9ND-Cwyf=pZfXEEryb>VnFi52Rox;FA$`&VO)r1zi8M2e21bMMpDIa$C z%yr{wa<3~HB7kTxxwL24zq8}jGGpL_th?GI)Vw=ZE{9bA5To>NS)O)mjw+tVOn#ot zij-Ug1j>MfHNsuR;1s^HX{r55Gq`J=bbO4n?#%{=*W;dxUh>!DBG6w7X<~@4!J$9g zPl|sQtd|G?oDVJQeAgpSA#1Lr1aTRNrEk~2GKKp&P#Jb(+uQcrjpEWn5wB?YD4W)( zAJHLnQ0Z&2*fj3Bzh+OGv(cR^K~&hvo5@kfeMuCiC|0nO54vK&%7nEnUl2N=CyMIcEd;VZ4iY}81c5_d(@9tq}o390E-tAA->403|E9Q zJy;blLbp}wF*qiYHAu)_$aV| zs$r%rk&rNO|MKyx@#XdBBOw z?|5KA3ce(p&d)WAF2T(kusF>rh49-o1d0(m_GyyM@2}?zmHDMD)_pe|yI`)LLIrqcJsR6VuQ zC^?l+5YA)EMtp*~#_1gps|`5=@@l?euP3SaIOx4w3=r^pPWYG920@q`g!d4amQy2H zfW{B4iiq5YYdMn-CgZBDV#)851`sST!wumd`Di~mxp|7ueSfH&TIE}3r^QlAB}?A> z_vi)y&{5A!Clt^TS?JLGfM}2<%j9{Igby;mz<6T7YOdw`1$a8 z>b^J1ph|0|$KnM_-DNXf#$#N4Z`zKtD*taL$XkJo>U5RP1Vstit8xhqFSQ^ea98vhx4BFCCE_Jl-A^3u z?3l1Ia4b>SJj)5j%{bc?I38T8Q^@@C`B}o+$Z=5d?UqJju2qCFDx*sVC|5Hc6Hklu4X-8AeKPUlK5GS^U6^ zXanPzu`o?iVYEH&4|+^&H5Gtu2cP%b%r80&Ef z#SHP1JXRQ{?;66j`gU2$JzVR%(lyOj{<|^jV^|yDi!5==+1{I4>h1g^lKk86t~J=3 z)Y)r}#-Ozam{SY{Uorfp#fBK5_a(zP<#J{fnJzx#&|i_)t#p{O zlac`iZ}HZx54W{p=`2OSB}((cGsIaP+)`C^IC!2Vp6MH!nP)sF-bvm-tc>Iwuf<+mPcft6#cx9RIatUz{8bbX)c|mHbM% z9)1vx?b6!A9oonTXgt!Pe0;TCkTF-+tq_Q{9AHQW?ZbtxVj-Pxy?vH9?#&XcC$5Z^ zXnO6!|G4r1l5y3@NoG@fMSqn_xb9R%*&UsUgvFrB9mea$A9d9rjJc_7+o zu%MNhT#sSob)O1HYad!;A$>2`lFu+PlPmJaAQ@Qi9C7`bwJ7sv$G(HdQq05)f;4%XAl|zrDHTF99NK;YEDiNwf8Sj@G(|w3RBC#kP_r5d=si2X-wyMEkgpe+fN74| z&aYYtqOq8sVz+K-s#AfU!JQL(naMx?cruqP@ASbxSv{jX*wlZ|gcSbE8EnyR5n4WE zTBA#`!AWE#%$;ER;ye&5t9)6r171HP0ZCT9v|M`>XPyy7VKE7eA#A9=?uc2{ObEo4 zrkQ8^N(tIqGhG-OaVajBGgB0YS`N2lA8O5=BSYOqY}Q_-6?*+hm*}LN65?!qARo5e zw!AVSR_i&zqUcYw^v!!h4WBiDHM7X8Pb0{Bg;Te#PzhVTdvzL%!nWSI{D3w+^QbO6 z)QD5If)<-~jA~q*WLt}XZ{~v;$2{Rkm}1C7i-EO-pNP922Ujf3-Al!qR;gp ze)W0$`~|*3gs@|y16)KTQyR>dc?;^>oy+MU!&lAghYX6klC5IVJcO1USvHmWH=4zB z@Rr>jN>d{b=j+~e68W6Zd(Ky~2U2Z689tu2$(mMTLaa_6XY`0yKj4cr&Jrg5{h9cr zIIp&~?-fz8a;baIF|h4(1d)B$IAOKRhhnhSMcK4aqMizK0fqvQ4`-~9@vR->HEDJ< zo=9ZTXCJ$3wudmm?hg;=aRpKyOAmj@$i#Ix(S=bxuxh7CFpY_2%fi8F019v z97MN!t>eoC8>z$<8`~IRoeJ{0ENI2^f#~x4QI1AP_pZqatsE22z0r6oATNg{K<1Qm z9@_Q$`fDh$Lr&x-q@gFSUB5Qzh7S%gqgfY{!#+P&p`2$6PG^iHE~RdKOlCi~D*lS# zLq2blF}cdw9GxNC$BxjfL5hkZ>t$?|p^yUTDSXiF$6kj2 zeZ$^eR!q_DjQeC5{WsG_$UZihlv+3mg-hZ=2*^0A(?1$A6j2TH>5^x?*da~6xmk9q zu3dm|fv*l>RF$*1rjp}@;y)g#PggJ-BwGpCm8FW2Edr|I8ny1oJc&$CKwUcZ5( z>_nJxp(|1^!~~l)dI()@4?M|c>AsW}ssL207^MfHB@`A!Kt z+tw^>EXsDD;;S}t`}ie_<>nH?E2D=LSlML)43JL*VA0omSl&IfIQDI>Km7D2a$nBG zihe9#{t>Dd8E&erFEk;%L#Y9pc=c;SB~8BR15XSSHo@OmHjb7u+2xf-&^>5n)%NvP zeKy^4-xc+9xvW)8mwdnEL8A~eOx*kQJA#+_@~E4sAv1;;nOCN#PtZ!h8(&4V? znaJzj{Xm|Vp?yCJZg?oUU)j6pgz7|2dUN$uR#K#73^FvEs2rYQPg>6Pw?5i3Z?zUpdi4ABw#HQvSFH+gSy~*{hr| zcACw%MF|TIC%KJhpjncqM|v9L@y$5J1&Z80^cn8@`#biL>R4D~rTo9aIFH(OC87Y|zqscZMa?T}Ht=e8L_ulb1=-yQwj(*5|n zQ63VRx-U?ZEqNQYLR%`G$|-L-Xj0X_PQ(m`AH9sKY*p{=$L0IlkyZoeeT@8;=XL63 z1FBj+Dpz(DHFl2Rw{9BE)6P`DZ=Ts?nJ?+N&NFarH3D>D-Zg9D#yvR-A4?|+?Dj`0 z(B%H7{I0M*NNlimn=0ST=bD9T!N43y6m1M+qv;Sx!g51c^b`e^jp@bai7=U7T2O1H zY|LRme!p9GYJLzRvFDoBJV}tCtFN`49c%Gk%7roK?;)J^1!ESL%iv+g%jP&xMu#C@ zN|*30v4wmVF7zVWkPF2iA3m5R)57OQl ziC6b4tDB!b{$I)(1yN)zHWu`Wmmr?`KwH2IGAemEmLWfxt8~p9at%sm0s!umd{5bd z1<7B!NR>L4j(AaG2h$|Y=(9kH@jaQOxeuK?InM&3v7liEf1Llkn~$~~{wA!Tl65p7 zePX2Sr4+C0r;a}0C*8RlSsLP0S)X>rhdn6!u0uu~iB>g1gc5;3B;;|feQ9!2h2CDD zhrQ{Di{IhmQvn#@F@X#genC{}9hKNA9b!Gpa(Cn++d{e!PX47*xKj4opy(WNzW%4( zbKjd!d-s32QJsN-nMWMM*s(nUNp+4e8E{X<$pI7m+ck%m4L{ix+c*ug%{0B!lx<+D zpp-=~zN_pWHcO9ja5B&=yu9LdV)#pJ@SI-%mtWY#vr@YTmw8<5j-_8V6f<^$IeU8N z8yH_0Q|+?w5nbi2uBiNauGD;xHa5_B=Hcn3&Ujoj-gFVPL#{9O%;~{#d-m<^KL&!} z$#RttVi`M}>opN$cv$$Cm3-sj#>%QeDQ|#9$fw*E%v`kXjbo*5zxY1Q|HP1d_9ci; zK79~V{s^R_+`)Fhm}zz!G^8sT{jVF&mlvKC@hycg>w2M|7@m{gFM0b!CoRw)^Y?|3 z;ULq+6Gotu*{^5frJL2!;Z<#PY7x^r`J&T(sr5fqn*2g7j1pvt7k@;oww?SuCLc`L zAO)qvp_=%^E`NU$&hj`FMPb{(gOXCwpj&B>4Ao%v~lZk?uqc zCH_K8rqUH&!2oeE^vozLtzn{c8;M~Pi#uWN@tW570jK}p=PU#}*ZALyUFc&u{;)ip z($6am0_CU52j61XDz*4^MdZH@6Tkp!&XBWjQ8`pIzk5PNiJilxwZA|52by0Kbz@r+ zIwj@&^uG5V&m?0t(tA=xnG;ltPTIof#A`=F200fOLo1TW9)vN|PfH6{p!nUm3$Pg1M63-y;X8se zc)7FFUWOD@!NPv+8`QALtw5^fmDeZh*F&;uQKihL=T8WfOOia~Qe8DGG{_S~9%ah+ig5!I{EB=Ea%=7$Vh?pG1jc*`v&}MO+yRcCyN*Ney!A8T2d7j~TN;P<13X)5zHGI)Ps!{$o_x7-Im zoNM8m8eR@_4`&@6SM`9@A%o{|=_z^!hq%r+b04;zT0iWVWo)YtZ z%1Er%MBTSyPo~c=sfni1Rq}y|4HM3aq6W{t)jC4t%~V2PLTT6=-Nn00KB5En0m>>N z4rW60I`2o)4wth|GsO3&u56bOmXgC`G2|Bg8vnH21{sqF^J?}>lwQ|`3l&m~U(;}o zoXUA7FB?URmEDx(x9z;qms)x3tZ70iRfy&X#Rq1V>6_uM1NhSFP`$Y?%`A?%>5F6~ z(%(8Ju=pT*B(0XO&_puTqt;Jf=r%RlAIh?>(x3M<3h=gncp&XyI}b?%JAJ{>cI(UY zhZ3xdeUtZS)6B8ZbZPCK_5K8_U?1f8Gf#EB@MDi>sD2Crv|fZ6zcC`8kUPsf0%6;N zbu*b+X!00hty$-;CvB;OTJE0 zvVQq-HNen=NzB3YK+M|IUaE)zoFOtHV~F$PLHB>6@AsO_vZ^hqzgw|iv=MYB`Dyll zh(ypt1hVzMdC^mrs0tJdDkScX&nYlHK{&lntLCyCz(o#&_6i&MCKG1?PP8)}X5KAl zS9H?W8JYnc(geLuV3HJj|0NkCZC_veS4AAZ+8jNJ7#YsoIHGvM>UDd9 z4PK>SIUrcB1%Tms6m$;hjX2*?u9FJkSkQ{)is5IJ1yOTc)3I+ygPWrrW($nCeCBt; z8d;EmR3+CADlSfy2kuNaJwZuc*x(Z;Y$Oi_jOF~*UPZo#+fGe0Bju7Xa+)p_EbA_YLEk(lJ_$nhT2 z{Qzq{G4j#2r0pkJ#s%Rj)nvFd&3aehNRzCp2l!Ma$>1#ldi3pnW_`+RC(ZSicv7gv zo-kW8oas^75#pSA4~<#1xKJ4L!l^WIxI5QKc{M#7!i(u=ccdII4naDyq z#fJgPL2Q7BkuS%wB4agx8D?|=5nev=#4FZK>@i$jMB^5_8WNEf2pQ6BtJ+EY`Z;1G#jDTh zptQT47Adv2M5;-doVVR8-B$F??v;}XjnrGTLKeNOhR1!}L{?Q(81KNA2^|H6{0H|w zi|7~Y@U*h#sYB;pJ(tsma%K~=exan2Pf(N)K>&EX&EnlLgBKD2^}C`xmww;uHZ9j5 z-Eu!q_ek)U>Pi7Xs+RoWSSkD(?6d5Do2!uZlUA4v3OR^u6uXj*fPnbl*~s?u&82mm z^jeahJbOgzP`O{qJ@+>E>faLHWyILR+~2zQ{PuQbZ%>j_sXgys3fW>+y~>-MtFjXW zf%#gB4o%3{M+Eh}d5rGA4Qh(($5)EQwfep-H&lw=T|i6AWp>G1zVL~@LM?wZmd>FWRs{tw z4=t)G@It7Fh@S}t%C^$CYO=BL1z{uSzz^!@D3LwPizB0$Iv^K~VR^$D$l}LVye{;6 z!O<1t@;8cCL~C8ldAh&i`&J&v@g|VKwp@BzHO`M{SQd#WfEp>-?B$&1QeJN{4}#Z9 zt{>ksAgLT1i~xbpf#4t>c$F9|)sf!slGu=ENe(6i@uKnZIdMNf#^L~2&+z`3cl2nS>tT0 zP8SU*Im|qrlaO&81jYu#F-PtYHo`C=)uMYcQAFWb=rD=TG;`L{cyO+gW&!|kh32zM z-vP0~mAX6$N~)o#LCg$(^H=F6+-G2ZI4izEqm=3B&Hl9SJ`KQImOp^l0<}afy9hpa zOVyU_o#0J#aB4zMtq=aJCM4qYIGh6h6BMd3AMVZg&v-jN%`;wT&1Tn&nUfO;L#}~h zCjYl!EBo|t^%+JK3lQ%WE<2UUAYuLT%Q~mo5p9#_YfhjmPEqmxKr;#*hkXFb1XJH5kvr+j@v z$Df07_P5d-<>_`byrbifw|hZ!zqIO1j)689Y3W_Kh*LG?D-P2Z*BZcY@cF{>$S23( z$8IZISDUi^1w61grHJ;I(qb^pO@6todUOC)C>bT3jha-#|GHb|h~3;mxaMohK*4zk znIV^02V_XU{x`-MK#1IZE}t^bt<-NuM>=_QdfR($?wv7nhVF0LY)IJqK-Cp*IBFOj zv4ckNExt+NM+5=)x~D&f3ZISD=uY=_R*BeMy=ZXWs^44i?`pm{D??OlW&i7I!N1Vn zx4~g4`lhPX)uz(9X7? zSIFgWk@R1k4gZB+LzBS`2; zgJ%r+yq$L#4(D9rkO_UL_)`wIDc4`X=b1GaJG=86SC>i9TEO&vz+bvdLEG?l_eSen z-~u^~gqj7$%>%(uZvjfGzg}m^Xe{@bQ~(30#Hm1NkSjbKHIG3s-{pSkVR{6n6sbdw{Ypt}`TG?wWr3biR zWZIoWE4eU%3Y3i>YHYj*&BhP)r%xAm?Lq^7KViJJ{O_Cq=PS%@{59HgTv?HXgfkJj zA-tg_wZIue)u>GX6p2+LR=hdo)**Q%OWPjx0UYWYLa2(zaomt6t+jZ&cpZAQk@HA) z4C6FsU&b-r?NYw|%2eA^Z-XOPp<#C$NRJ@mU0Mk5&wVLhp9FV`#R;J>;|9L&$|4-~Icajb)D&M^Jc|GD8vMO{dpIE8?2z`s z9|IK86o#@?`(Q7e9Lmlrlx^3w(Xb%X@bNLm65N?71RQo*O)I8TuvirOYIoX2x*mgv zWOxIsp$h$wwz3pHY#xH$;FUXhg0%O)FRfnlZ1Hr;TNT*KVviy;L+)?LRIWKSJu?$A zK9`tK2SD=%mA#Stpv)Tk#`Ol5u1MNy-%FRYXA}hb%8LmbsH__8LiOq8j&MY-^oF3| zVi?G8>A^iqj+D5;b9=s3U2~i=fKiIm{wHU%Mm{iana;80Wp5fcV{L5BT6?}-HSur- z+A)TTZf&}|G2h5xXR@@$DyV2n%pIVfCgR{C<&u>wje7g>;eXK+8j;wj=dd>Ywysj^ zF+Q8IQnfK@+K;1==1Q?B4Y8i{uW*<8h;gDQ1oE0pa)i7yE#u za(vql;G7w2UQ_FE5l5GUulQ)#^hH+78UK5@xQ=@2UlU1WLx=!j7dRN+fasS7;hgLL|cZcia@A5R7jTVbO2U3&3>~`^qvCatUJwE@-h`qOBkVeDN z4_!3u|H=FF05iRYgO2bI<07SKr1`DO%x}y)zPE)f<5&o^$iAkZ)CPbmS?G^vO@EHD4{A{u^TolAlu(F^w>^ZwtxQXwW~ZT z1WPU=a(h0gJmg~LG;ML_)&*&Hx;BY^64fnhDrJ{c?0asUR*Jh_rAj5~%&ZBy)b&7I z$Ykn+$SGWEA`_^4BReA;)+_Qw^kU5>e(LBx?Fpdn!TKZ9r~(2XhtZ0;O~n+yGLI=S zt=}b_3{TH}`DWvA9eiAN(Y5^ra5snP!&47)?+T*+?1QwFQ#H}dd59T8eD6ZTk66ng zOCKo=Ty;w;9v|Dc*@b^_>lHezhiH)bp{pQf(&ZDsUQguIw^E4gtB`w45*9y&Pc6=@9Te!jRM;a+qX(daX(}4jB$R{DRvD@ zX*(9b7Mo_9?_x~3HwMhZWg-cu_WV z*Q3Ebp26u-KQy<`Muz7NK20lT(K-SBnbZB5)Bd0K7Rv9a7IC;^WFE)+fpYTyLJMt= zF(!Cb!8c|-tv$hogeL9dyt_Pu25Vay|#em1ppc#Luwpo^N zM}qCFmMLU-sWvf%i6iJ|@(WM+ZpHh=?OrY6pnzk4X4XX^x_BPP9d0$A@Tb2v9)Gp3 zAnDf$NOA&z4IgFNq2{w+wj3`Ws;;oopnpVRmmqGsMj6M;LQ~a^?18 zNHY+aW_9-Q-<`C@n{_OVJpjC?E8z7xINWO|bIh;!OQK$F_=~YnVlsIE8~FI_sn$0y zd293ZtY%nFc}_pX@n3jO}z_ZegLNieyT!L4z2ezCE* z&O*h;{SnjS3Ht({EVI)VO5`fB?5dwU1vb-3L32oBoAr0SKy^fYRzU> zziG!DwDw5O39uEP8`r}0tPY#qwj+H;Hb3pvm$m$CWF;l?{NuM*@7`+{R%Q`lI=vK& zr;h{MI)TvwkZO-Ht!3oovi*s52E*SdOigb${AVyBVMN=h2k})uTPWl?@na$q^33U9 zx|``sv0c6fIim#0T~HSLIXB2l+zJ9Qam6si(t}J*PM~wNo!=pIebKNv88ki|HLvQb zA6?e$wZJi~#XH}taGcaW_UF%^Iwi~~DI1ClmA*Y4!wfG(Yx&3ix7)(kGX7VMKH1oY z>KP61Y=DL(Ms}6fUe&Q_`DM(D1v<%+Ml5bLEWf){P&=4-ugRPBUryw+rqa_6)O>ah zZu8Dky+?mv`p6R3V;m}c_s4u>*`qP1W!ioJ|6Krl7t!(w_ZKnuWrNrmHD3Y_)w=E69@11 zCLMOE8QmPkho17JertRT{}LNvahc|X$(GDsFx|g5ygTgn@kaLSm)K}@6HvUGUQC+YENYLX!4x#q1)LBZ0jNHUzBw(-~ZC)JnYBA0p@VR zIt}%w_?^=Te0w97oDM-uh9p_<{=9`tug~sS3xnO)l%}tH3yD_P`5t>YyFEAc1gxOZ zyIqu_v8F=Fd}w}Hh=pCt!Y+sbzjBydA79>|IC(7h#5Erf0bbQL7am&X7|(1-bGeax zlvmG869vX!HTw^A_g6-Htu{w<VY-SU+-sVX{D4J@d8Cmb`ploWRr3^>Fp=)#;VVU4Q72VimKe*tYox3 z{kn4r4ILbc2X&@u{V&4(%k)eURj6F5y~E>6bgBPdHoI5V@d5eJ%m0kAbE^J}vGW7# zx;&VPTacaKgGcb$6!((~c-kJ6bSAK`8L-Hx9=}nUtUG!|jcvfs!LFQeg*Qd=M;{x; z@{Bi(N>&wP{A2dq540N~J?ZBAl=Xk{$(=;0P;~2*L7IZGM$8I`)-V7AU(Po|Gl7|y ztW!)ZRG26jh^x-4|loWzz5>?W}g&v|mb5yOCP(L~RD#_SpoBI)Eti$@8HLshFsFB4=>Ac?Z_B3#mGQqI?9&knTe@L)F z{f+2#UADhESe$y9owVQ@*>2<_m8m2ITq)rGadTRKNu;F96QB*S8gpGr3K)Mm`T0iD z=>MDpib}xm(f0GUyQS%p8|1=x#TQ| zY*&kD?Yi9E-V)>QtpAEn(LDISr5E)ys{HyE0PBv;Op*fz8&yS8O!%@&z^zQ9iz0H4 z=B3!aV!7W~nE5Ag)!4x)Cg!CG0Pu%P3w&1e5(B1iuBXNdzi<9$P%XeTyt^}GTLHf0 z5zkOee74-pmo*KXwq3JtJ&bN+QkFA$9G|sl{(~{QE|Bbt`gbIkmoo(iF(1K_ii%aJ zOu_}v5_;q(gLW|yV#?+6XN7OVDxzq^_8Aww6GjmuN^^hJyGdVEy!F*HP*m}~Rld)F zSr4bVccY(ob$F=H+%qM{FlRBqO^)z);*!nY@%vSZ{8j}D^fOohZrXv9oouBAPCRNc-}>x*4X4bfp} ztc|EUc!yxGH(X(#Zl?J0)YC-IF%~Z}tbyp>oz0(J!QJZX&FGByrb)>w`Yuslrz1ok zIW0l{Cw!$Z6Xrj!WJ zuT^89Th3vI|INjR*&3%Tg~AS^FUfaN%R5IUn%)1A1;%9b=7=C;*}$U&0ep)1YD-_6 zEC6BH1Ayv1Db7Rsy)t&3S6}sK%zRShuTNEHZ)g2eJhJ}Sjl_x*4hy<)2cQ%KlH^X+ zQKXS0f-fw@I7{);itg`xwgAj<lH6cBF2L0Uw(goHrGPR@E zIc-zP$&aTmv>dEDL0mkhnc$|;ufE_(L&>Ux(=zix@Lmufk-)^C6+0kI!Kd!H00lhn z@}$V*DSZZ6P-~NOsn2;JQ#4g!72&YhJb$)3ZLzT$_~N!Zr+evDXX=05Mm;)9srcZJ zadwSM!eL~hBG#nz3>{*B!+2;L=9co$d(GmxH)Ts*XPGu{-b9mDmF6JPMU!&<05Dqv z(FF-+j;mRpIvDxkgv0u#0tFKNXcJ9xSz7aB2D!y7A7;EHZ>}Ps^ylm~`4(^&P=9ET z1KtcGaUPa@*i(1|{v=@e)3UoyOh;ygAU+TaN2;_|!~)K7^RUVJHD3IimV4on;2Dn; z`EZkD?`?%rd@fJh!n5x0ChPwAL}Xl>P=xkm_k0UWv^HDiPJ{GD=N`99*I>$3S5bu5 zd|^h+<(Umn{F_B&7_QaStC zXjcHA55@+E6IvAbimi@4qmdgP^=$DTxo;58P<8#E3d`2>tX2+}%9|y8nX*sE6|5o8 z&ixG*G?@Hv$~&lBCXEuhX%`KQTuZC*xA~$&|KB|y|KGr44<=HE*^AMk6hBAm?45A7 z1s8PAYoh;iMPiFXe|J@rG2`Vc0}ki90|_bw$lPA5rM zwECTMzS%+d31Hx^F?w0}GW#gRgTJIuZwNtrHULZdCelWf-ErFAhw(`~Ntdd}D!E_NF%M zg9|BpRE;u-2mYkIn?97^5zbZ@9s{wiH7#McG(Z=nR=?`H2JV<9hQAH0GHr0YZn@C9 zG^m~t56!XF`N$hU|8=8_wGL}2uNFxXF_(b zzUeG2Jjp!77w=T_Pr8@X9~-G?Z4A$Y_0G+#)km4`knt@~=}*&}ztm@AS$E@2zK6~% zzI1g4(l>_+=P;H$x*RYL@-!;I?Y!IT zU8W4|0i}Q>rNRSQaOE8ik|b}SwR=>EBZDz%BYXlOShzT1at7}z%JO!TbXiQ3+&*U< zoESmZ5O;0P9^iSSna?L$c%v2zy$+5`cHG8GRfGFS-en6&_moS{thHm!fYfA4Gn>MU zxqe9dm{H{Pa{QT?iDZu`hSvG~w6pP+PVc%(n3!?X5%>7&_MT&I=KpqN)wWuLP50d; zzZxc5Y#{gUHZ3H}EtHtjRrqa~i$}#kop*k>`9A)S5ad;_O-E|(EqVLZ+}7O};}3k6 zyTb)N@Kc~{>_h^3nwsYKSEcv1GWVA##(L-IH|CPhHRwgT#%FUB$n1vgpE{7E@4lJ* zm!wG2f^po2nU2XB`qC)$-bQK%d(d(2YXLBidyc)IO{#Td+=+k;Ddont?PORr<~~F$8NpK?~HZMJ*I-&UgX#^Qs zI)-Lwlt$@pkd)40Xpm0nZiE5p2I=>BKmYsr@ZPu2C-}`AIOpuW_TFn<>sr?mM2s{& zEvtLS`#5xuq}3Zu?EqkpRoVCZmX{;KM;@eH&Qo^FP1A75qQ&jyL2a$&s25f8=K5K_ zRlJMv3HQHH_ZqmT0pM~K;gEh{i!Sv%=W8Nbc|hSMpwAmwW#XfQ zYDFgMLh%awe*lbbsWY$9o$IEbF)X6c*94HNL^51JG2^DQ8N@Ob7~NRtZB zxiQGSyT1}QEAaQoGmqCkb5q5d<$(4y+GL)mJswtb?vijhvbjeCXZuc5EVa`)>Hof- zgOH{E^3T)~zj~=gB9k_D=9!y)=K%(Im$Frq0b{K&cbllHpa?dl$iiCgyi53B<=*q# z2$BybPO(Hx#X_F?_DpOaRNa#$l)f{zI5hgq13Gk7@Y2C`Nj>!}CR7jhLJ&5aKk4N^ zOWBOi|5g*955JFaO(6sbQ|~0jBzS=8c!{#2ml*V@HQ$0QiLn0MQJ-R zC}4zwX0$HO1*mfG#^^mK<;5y zo$zhjoz|T!4v%)@?romFLh<@m-A<)nU-SvZR0Us|M&`Z@<^ zLFF3K`gd+WcHh2>&b{DJ=?Z$)r9B|~@`cvDp#8R#55rZyWy)XXjPk->b(3y*vZ{8ro?*{{I<~ zf56rM{S?M#f3M~ilH*W~`7guZ2JQ|lWz&C+{C}oDBI57WvfIBA&;L$)zbX)P#LkL% z{i6i#|6-J}0J(Ii|2IYJf03BUt_S`;Ynoj2znDk=QBULK2DV5tvdp|>MtOk4@JgHv zxVz!mZ?1pM>r5`9sS=2{qF(#PfaQil)VN<>h#?^mnE|b;N6|4NmFP4&k!&WdJtmJ|3Ef#8k1zC z7}iKc0cX^qSbGv`;z2_GJQ<{j0#8rRr0ndYTc;X4sGFOvrR6q(FJ?q~H8C;q zO;*QBgHuehBDNK(TzNvYYF_5oNBdS*bBU(`=o9+E&8R6n zN?JvPHqWKqEBZc*YWzt`Z#sCto?gdk`~xe9%9d9Xqs+pljc7Bv`mTkmH*U+PE0JoI z{mn{M;g%u!_YJ>8{No-u*XD7%e=ix_0$_^YZtn5KZN0=sD;A#WDpuZYe(PF9BHL>6 z{_|f(g-POhTV(Z0_}8yX;YIcJvvW<&&A#GbG0e|#)!lN=c_Poe$@a{H(4Y+`Vx`D# zOIG5<1QnYm>`oZ^@&?vU>T{K9y~m?~|Sn z4KoljGF8=t4yxCG^y;o(R2@rqVV^{msL_OdFsJYO&&dD_A|Zh)7B`Ds}mKWk-lr z_&h1UHQVE}cyal|?E~cdrDiMIZ@Q8dIV2d6Mxc%jxqh?fp%sv1W|B;`!Khp_I54gK z#;=7aDM?rE%+0plZ#j*_06tTyb3f0iO}hh#`oK?j6|^R6?UZ2fO z&&;6NE@6mDXWk2bt3JOx|3d^bGBkh?X%>D^S17LIHVmF`W{}s=<{j@<&9E|08;v_b%uJS7`GY`E#J-?yqjS%; zlfGY8qRqn~-~gVyf1mW-(U;#%RER&b-PqvZV8w8YxR%g@{%8mPvgb`lx))-g8l^N{6;6;|KDdD}gXp2j`x8_}}&W7ZkQ@hE4|Y!qBPn6U_d zLg;ffUbhwf^k0i2>70-zi4Wlh4pvngi|z|D*611eU|)st|jWgsuf8Y5&XPmDY!cykAG<0kZc>bZZ=Nmf&1uqkP>f}x>?#>6a?-7&`x znFW>^mtMa|G6`FL)kU+Nhn0=|o9vtUpK}UDB_BRn0xJ`s@zK=J6OkuW!+N(FARp^i ziR&muYm9(^U0>a#FKu(!$$_BK+RE$WOrv}^^`Ir?ixaeS^5+o5B*Vy*;IsaYl^ zd4lY$ZVp*KqG!_lOX=U04QWIrqR7b{D59dx6u{q>u{|q?eGdht>IF-67PUHfdS^d; z8iYRFKLsQ=L!*ftl7P?W_Y+y|woVa;nL(Bl^H0$Ze)LAB^-UfYbW?btbK6@<*8xqn z;?iY|T=?qYKaz$yVD3ig#^Qd>g9%55FIQkkQSe1E$fjANM{!@p8@EM7>ENXEPKbf+ z&U-6(|M;nuB@_6c%n?TA|58AvLFUYP@?p>nue_}6Ty%RfN!Wx!=^HBggrYnR$xMl-pD08vp15&_o@9)VlresB-yUxhQw|n;$^er zenaM5^$_$q@tO&btf}l7(&8w52m9i&D}^fG?DktF*C2@|^KZ>&v)II4U5quFgn1dy z+b2Be*3TGzH6)`Qgoi&PzZG4`ZLc1*oSa{<@Ic8~rn9aGG(gg;&8k~8E9Qb<$w4L^ z{7ljK{U9DCcwoEnA#%Hpa1MbnmhpSnH}wrV6Is#XHJEbIx-L?ASwjIBhp4GYE~(_? z_n(kJ{0>poM3~`rU1=KL=0HBC$s~Xm0Oj65=NVVuEGDnQ=jOwgadhl|-|iH$|FYes z(&V}c{v!Oph=>2jtun^@FTPoItz^5QMX3gx-)BtYr|9vR{C7GB^J7N(U7p|M!#~~{ z>0~|oEFX&HO1P^L8p$^wCx;0sj65TZNbqRP-_uz9ob;NE$(kJZWrfQ|u@TXywo2sR zRDh(RPV137Ye-9dQeaWC^5qLq4}6!D(y1PQcJezM8z zpX2~x{f6j%P7dhd>C?DDReyQEL4j5I`9Jv)&_t45&*GmHjtsgGc3-T(eF6#vSk8hF z9HDTVX56~p7p;oOuOlqWXq2)C5A9rl@2nKdlwh#5SJNnfZW^T;LjeDx>j-g9iN@1B5JZ12V>~@I0AucXm7ocf)H9iyXq}IMEJSsYSA{ zFlgR_YzTI_U2!jn-XKn0&wo$=VbFY|*|Hc=*8?Ob=Lr3i6b4tF5X19WR@cCkm|u8{ z!wN8{L-5;iVLq5v*BfJR!wVT_(WB*hKUmMrr#+`kIgZ2bu&3S)hQgoG7K|Rn~D0{}!YiOuxanx#PQmP$Yk6`Uv%o3foyM^0c zcIX@+gD4ffK51TmIEGJwkK11f4*q8S_dGDl`CAeLsqoJ@7>=mX1boKmF+lp9euFeb zI^ygJfn;PgV?E}@6FQwrL*PVazo=7a)>i^Dr)CYr9&r+wwoo_-cBlrx6NO z!IvjCIX@ZaCHj2DcTfRkoQQq4Y6}3JLxah`^8Inovyv=rILc&5ko9%UWSKJaP_*U) zAUjew)mmlW|E*_n81Ak#Kf*H#XR}vd9gg+>^jlEIis#daS>M#BI!B1TcMx`$rsbiF zt=+x&VZlSmk5l;#C%)-Ogz!eA z^Hl@D!X-l*`z54MigLh{W+@?l^V;|beARx^_s38wX~uLvvCjN5o}UF$ATczNXbyMr zj)z=tmH3{!g-V9Hft|*cSy|}KNXHJia0HtEDU7AI#>IO4E-=aL9JylCt8~*VBVUNL z^9ciK_{nD^c$E{?95TAKo!$BD^EJ-;AEbO0K<1Kd;5ccaXH8c08(`KhIKKmRFd^~H z40S&nnO0Siw%8^uTDNr;)m-?iZfu%3CgGaQA#Zl+xWGs<%st*~JldUTZ7 zAjX{mxzlApQo>j*I?2gvWSsq5Ak;Gpi8<)qlDr>*sxV|TZO42_GWFf6`-sTrAb9W1 zTrURc=Cy|xfniD^JaD=I@`u>=tLtR_&TTQbTrOm2fw(?ul{-u|K^(@v_E0wM)@~+&gNP^;n#R zFO3vkNTq9r=n+@cIwT+&Bd3=okw(z7_=!{n6ACsbih$(0zWo4iVF7&7bD@_sdEs>g z*&6CA3QzFMmoJx^3r53=xMZ&VFgS6Mz-`&k)hm{n+afd&_ZWdt1O+JG#&W^j84z>b zzXOf~CH_;A=usq4IEkQ@bpm5wb@`&*9yJo^!X06CvK=n==mVz=`7Nj>=vE@~zECf< z6DMg`S7O4=xbymz-2QIk8C4Qi5fbMUn;6G7KgOJ{Lln>h#bo|C+Vfr3@5;3&gJisZ z9}|fPNpuZO+Pn*m&mDBE$}abQLX{h}D)u}l8B|;CMMryg1bb`Q1lgW|FciO*I}Z|4 z>%CQXdbZYYTj_Oar%Gic)(GZg9A(cIgIMBFQ~_d)2R8*n7@Q1_xf?(5f2d3bpRG{G zp!9bM#8Y{#dl6!asHTW5vy_w)b(VHwAUx66)>odeQVzyX7h0Ot8lUqqP@#>gjJX*|$AFy9*+R_CzGm;5f3CwTQaI;^S%C!R;L!0M7qQxDlx4nqVLs z%Uz|Ex0t0>`r*PFt#nvTwdaPu@EXBJ;k_*h`6jQ+$n8)>GwV}_euup6Hx~`b zJ;)Y826|NwGJv1?MFh-_X-rx7^YLC!nCU3p9q+T}^Am__wylt)GnX<41YOpD zbJ-G4A$=Z)+m{_R4ue&57;W>fBglB{9`3IBG;KqUFt;)+1#@0s?X9Rj?~dIIxBWA+ zb^~MAKga)*$9}}brgUYct79P!h;v@+5?3ul{kJbPH z-m}zuv)_u|B}jst#EhiKcpZ+tPR9M_Wc!@oF0QNwAc4?{TbL?jV;s3V$AAV$jZp$~ z7t7)8)e}%x>CQcv{I88Db!vmu4HMprd-R<}YmKHcI7IRsD%2dPeyRa<5w~btzg}D$ z+)ifC)$p%Y0X=Ido!2g$&CZAy?|{k&R0eg9VN{S8^oh{vg+4PIYn?7@l3r#D)9dLG zM~MU??|9+i4C(7J(E70!t0$up!iWX?wCt+LFUL*d1XdffeeAU((S%@$(Z{--lAm zpRmNEwv8&(qPXNru3Z;|NgKu~WesO<`XK70cOK(L2VDw}1=g5UV@Z>DnVdvgj4fLG z5|cx+(hA2ON;JvmLFuZb}XYFHNPn~)GSx1eJxr{y~<@ao>gT$*k@C& zwsoaVTQp}4!O$o=&kFQLw+qtxdMex4DwD9an_Y`{pTj7`K9YaBw-r7{=v+S5UadU7 zr_^0ILY4_qlaSn`aUSVdqgYtBuU}j~XJLI_fejD97$#W#kvwVOvf*#)c zJzku|^A6uO+w2!;x}UKmpVh)!NB=fPDlN$>F+eNd7O{2P3L_r;Y1ph?l9X8SEWqk5 zFKnYLdOsjRrpT{IINM4=T0w)4U3)Q^P50waGDp4V*_Pm!U!>vZg5H;9zV{ciaJh}7 zHsKG+A!$|$^q76;hOx_^=L~f(@>p&kq*w=1FC*&d&3$5nme^=3fsbGga7ZNmh z=qSkk|Ko=T6c0VOVD!Vr=0~NzyQ_>>x}s+2W|$@>bAK4P*IONv3Q}$pe%8mzC*GZ1 z&pM4U)V-@2ZEO<+(-_X39$gTea2ulpSdnpR1!~7!-@IW3J5G)2US+Vpr_rU6R8xJE zA!02o_NO1mr**|8QbLH$_e@zw>RpELY$9dk_mTdS$4rRtqi-`CLLpWXnq%m1?BoCk znsxHe>;mb#8^n;^R!O367#lvF0VSQi49=1^vd0!k8co>a*BWCG$*fvq#(eWA{P8S% zFgy+)6!Rm7OOpfyL$v!0^^T;hLmh^~RqP3M62p!i3uV4y+2z1Zn%!#qMh?Y|EphhU ze<24HHRny*+ur9;ln}|s3=BH`k`CYXuVU6ROr3~D(D9+i^1gF()vaHMrVF`Io3E<4 zC^8(O{LtQk1G&?tSBRhDm~FAYk!yf8?Od!J-k1;FwXix!FMRuG!hhkNa`S;6VN~7u z#fkDX;85_>?-&CC=k5Wj6AJ!P?fLFTmxQaWujsX=a9u%HoKMXI zC$gzhILTx$!*769%O94e*?cv(GTQEFu3j2OAs-)MWx^!$25}QA8GID*z9i7MuD>=| zgSxspgn(}!)>-0vId2K~y@ldpdm0=B+I=tIx$Sfy4IshJoPFLH*Aq6>hPLX}IPq}g zZe!P*5zFW4l^T{**fM64$p?7guju54@UYai@c&nv0X|d}Kz<85C!@*GKkQEE!jfZ% z;^B+BTWENPBN~PmB@jict5%K2hm{-r;r42=Y6C}YzL1J<H?&2#VZpLC=$F_ny zju>`<%ejfx2UhPI!%vvrfp|BvxXly=37)3TWQe3UquD*~KwIABMQIO72Zb&c|SgKwBw zwC{Eyxa~d0@Sy>KTu)Swd;LS!o27S(1xq*6IcJL5b;gR3CXJ}Z@kqyr`Fx(ZxsZY> z=rJIoSgV1%(|($Tg4olVMp&&hVNWcKsq;&9MIEovdz>Up=(1F3-Axl})~2byoqvNs ztwXz85v2-KF{msWfw9c1^5-pH4wel;Y-DcGu~FviVwsXjFHH(fT)(Mi`Ir?fmD9Go ztOsKc`a9i0)E&pSWuCe7du`WVW|LXCCGH5xt$Ge64&ix&K0cC^2DJBPh^URe(BU#!ZuL7;s z@xOxp_pJsUCCdreD+bb^2y{5&#F3+Z=QNYM33j5Be?c3-6LT6AWepfq=*w_^nBn?UITnON_ zXy|C#uij&C)8NS)t5(Dx435VUg1*qaWMmkk2>!tWqGpbhvHh~}?$jd;TZ^TwI2xf5 zE0G}a;_XNbf8m~N^T7+G>QDDxIvaQzSh?oXJxmBMGSgQqtkN}v-oH6--s^}P`07I5 zyP1kcE9RX><4s^mepNGsk<=++T*;i?*e0_JD7;+0_O?jL~DAIGsY$Hf^2@)6=A5K&e#`wkgK%AYA>0I+wL|`s-3<*u1 zAc0e7T1B2lOKy6DQNaIc@v~TzYt;Fmug{%BZeePrN@~ROy;msS8n80m+TMva`EHz1 z&U-Uax$)EDs%1DNJ(DDdZ%R76md)nIv97(yDkYIn>c+aFl(m4N%g#*KzxBd7@jjgp z#Ov1ca*s~$Q90iC#5}Jv#BFZNAP9$tTpdG@2g_gG?tdOn|Goc+)S{z|=jQZr^=+9L zZ_<}0#B&5c3nW|@8sTwo*c!}b@8FU}C+&|R6u1`p!KUb+C?SOc7w^DX>p<$-V^|eX&XGlA385wlT^5-I|g9i5$hmCiVJt zkLQuf9)&`z>UCr& z%-Vx?rS5KU|0bPA(qz4~a_*dnReC|5lI}gFrxwtoRgPUQJWRkLbT{C*IGt}=$~}kO z`qe0Ae$}pJ-u;_W>5V`|i=W;{^Cj-90}FJ&tY#Ur8@N$=u)b0(=og8wsU*b=OHo1P zGjroKaDCxh_3iKHDy=ID1?zxDcMtoTqy7^#ACWRdvN0W;Ka7pQJjE}_J_h>m$LO_$ z4hO{SBypqC1?f5N+rF=h1#8;oThZeF95lLD9L(S<$s2?E-F<=F96dVOfyvXI8^MH7 zr!G@5_O&k8|V+v^W=bSt?{Oo02D;dvI@eEseQ%DO8*=H~aNBoet)u?&|f5 z0K<#jylAl+kRE(a5MM+}0cPeK>m?8H-dl{`2g}#I47MZpcXmm6uY*(1HH`}QvJP&& z(+3niSou|){C)rE=G|{>JX>sQpS+}sJGBhT+I!I^Tzm2xElz;wzwP0FxH3O@{3Y-} zo?Q`>gauCqerskWNWUqC6;Ia1$tR(`Ap@U@Uu!<0((vA91?N zqVti>>)vgYRzscfqpX?y#tskE>xydep#u$ixJEl_5I2JnhB3t!M7`JIWFQ@=#XzUn zFZ^_8eKVHgu*i0f)_1GzPwsnW2dwU)LqNW;)yU7>GQlj%tAD(+6^|+gWGp#Cw(EcA zdFXz`YzJ-)OQI8Qv`r3s8I~lqg4-i><{c~nr4GUiN;p1IhX3fF7ASct(y)@_Xab|= zMoUzCo8z3y1S2w46VfnCbS|0VFt9nBwhv2sk-5l_i&%WTIeMq;(%7*lYmTd%Kn#%p;ww+<)7;r~gf6Wngz5uh1 zrxbAfA*JdZ<@TJs)op8@s;r*-s5@-Ul}O30$1&Z&$fh(YqGrM7Cz3KE_7#z$%64BQ zGyIFGw5!1Zyf(S%U=ad>JK2V3wt7sm^SL&^kX6?9uJx^2Acg)fuI+zp#bW{Ra2XuB zUjUmlBYiXDJA$anO^G~#qibHVb(25l+-8_jy#2&{6c9UsA*P`uRhTh#O@VlBaLqk4 zSRmOU+PPYk0(p&8dZ^3fs>=O@Lgc4(_?2)y&QdT^l!wZiwaQ}aAR^HO=U0EFR!Y?g zkENGka&TqqFHgONgp5ozgXTo0%qxDv8Jh-qv?-7?d2Lb$?wAGLzR65eHM(CA7P_A(tWahoPk7l3JN zX}F-DZc-0!cWv38J)ggGIFVlY-r@Hl&U3tr{86mZCH%p z?PzS|{G{>VLxd4eaZc@-ENW2JEH;nhiv3REGip8?`hyUjJsVnaix#Ug1R@U|5AMr> zc*fgU=d~Vp)W-TKmNyhZn5zs+QknZv5N@GtA2#X3Ct1vU^#uMFCL$Ci-iko>FDJu%Zb@KzX7evSfTHPMH=82{5iiAn{ z0aj<(j~|sa*54Sl^azUONxqE?c#7TU~)vlZ)zOZBO_P_&C4**p(Sr(-#|ABk)k2n!p~z zDX>E0{0DJn&S0{oGnS!r^(Woqx&xh)7zjR$=csm`ZT-`&8yD0WFwA2)wwYu3A*>2_(vRbSf{Uk4d5^ zYY7&QUW1Q)vMyB-7Z)N9pe3oOu%nc!PUn6YD7r||Y8pJ?ox2=j|5@v0jryDbcRe1Z5#W?(!-oG?FXSX861 zk#OE7?>Q(-1a?adE6y!zWgLYs#%|0W?3K?R41orfp$3Y{P?K^e<6U&W)!SlrmH)>Y zeMf^QJtP{^Sb4_kD~sslNRZpnnZ8pb`-LL+@-O&6YaCXm>KT^c6ACV2ng5l-w_+-M ze@b>c*Sq6+u)xa*t{Sqm+geSx>vl3m|7?T@dfH`ErO0IzXw?|_+66B+xXPqg9Im6xfLQXWVLzGUSXRPJ&A=sR#48fFz+6X{J0-V%FkN#HC^;PlHi5MjwR z?^9VkSr3IC{&=fn_)Ua48QJVt#=Oc_3_q$H`rZqacRJrs!@~L-=L^)-z&e{6S8k|1 z{b66#vv5Q7nQm#pQ&gwr0O*0Bf0i?B6>4}Vz?qAH4d!VvAGRLh(9JiORiaj$`v(R1 zq;Yz~7c|#VlMWWjSZqKGAp$0_cV)i)d&>9?#|8UpZ_e z5-%K`lF6RCof+ckcsc4+H0vTSz&k*Xp&@y@&gb_o_Gb!rr)*?`kbJj-vRlpw(P9Q! z0o+(SwN9yZg3WwLhrDt=8OW2b6Bt1CX}h~EQ-kE}!*}rX{s9gFC+H~4-$0+|bbvM~ zsu(@nX)KSAc65-G!QeFt=?J%^_glEc*D6NFmseYh3P+Q%&MkfTp%6lzxr*ZP%8`p9 zr;2(QOki9xj9UwoKyA~f-?{eP5v83590H|Jt$Yhx>c|j8ALll?WJ?F)^tk?TE8V8K z3Im%e7mZJNLRbvXgU}Hogejc+Q$k&QM?ECmx#n(BR&U4YdIm;t#cOJBasT)tz*4V4 zDMw~KEhL7=OKfwEBSHlk4CNYYbzdnCttOjtkIf>+t4z8?_h(bAHOp=EGqpgkS}F0w znZWGfR?G?RqRe0z>9yAz&Rm(V#96I=VZ=tGuGn;Byo3AOSuQV%{RT8jGr4u|8?6s8 zs;7NUa_fqgXVWVio4(iHht#KLfbtk4>YWBbnueMO-NP#!ZNto9KmFX`YFR{zDtycc zouf@SuI6N9pLD^UO#Igy@~%AS3cM*zGF_;8ZpE))@ZnfHQnG1pR(mhQ8YmPT4|l}SA8X40c^y%|D^-|ssC4z6c_lS2BvgRP^MKzfK$IyDe*Na7ji z69757Bti-oqYvBCij%3e7}IqfUy#ER*^8CYBT?rk+zc>q5Irf66H0t?0F9j`C#XD1-Q|1vj`AMHeVF~GiLv-W%852tmoT;&bHl5 zqxsYN)tWQ~0!Wag)7w83AEI*^FezuUb5SKpJ56s+DaLb~XrU?xD|!0N*{l?4VZ;bU zlRTW=AtTJm8_XT4lq$DRe_BC)8*afupCo_g1%^`Z&jog!zTQ|v{hZd6hZ4&~NHshDI^ZIo2TceQv}v(kYDh_O$IHBYl9c`!U6&oM<<8x#jA?6%wClK%+jL|#$_FEs`_0wYd3PXXLnbYs`%1~xnMv+ z7>NCWHjAz)VZ;fvZWohw1EVxx?S=enm-5zv>8`H;TKyI-?#7FmkG3}eDrT)x%8E~H z_v=e~JOz=z*V`~@xopm5cm+Phpe(#ydR1v3o04@otu%hREi9-Aaj zYxk5+%q)-ZAZ!cYIb$WaUbw`vn`Ugz-}aEeFYep*o(h6Tu$7~dU~PFCZ>zb@2MN=K z+)93GXOTtuxQ{u^oB%nH#Y3QzTz+rVSo~6#S9TjgOQ6(UXH3=gSXhwv*)_)P21hC6 z@Xon3+rUB(u3r~P{_=nTA7lhc%q`eYa&D3SU4>QWf&DM9(vr;IGRsOKC|LOZ4S_(R zi}gqPJ`$V(!e`I8^R-nZGWwWXbOy&ex$r@AYkr5mF_8W9pW~hPcfV4N#5N-K)J}J4 zz!Hdc{$AR!q|m-2lfz9iAm9wmsFL{kD_`<`VkDLL%2xYbT&cWb0s(4b??dJ6;z+bO zTsC;P$?G8?o|N8Mu2`L!&SZSA(wi!t1)n#Vp@#~*H+9%b2NrY94c@$TySShOuW)w8 zDQEEaEmYPCoH#jl$?6P!^YTA$U~mpR&9jl!E0}?qJG7dIuLAsVx?}3w_7&q$*HHsO z@D#BTPrPMTMe#ALKnb#Zw0eee$)J%I!|@CiI|#GUat@uV_t~>na-!;;C3%yYE%&4O z038`vk#Z24aEnts<#Nq*l*6h$2pd}kLw)t<2)j<}yA;vMlMjD(SyfsbBgae@0l|u9 zk#1>BTk*X&C5a45Nq^cbe!kR$S;c%*81$Az|ae)0PbKT{#wSx z>zt2F>F5-lL+5po8X-pFd3Ni5yBk?^R5(b-n7N4cD(|`GzvtiP7y+>g%nI^W2Dw^d zBr)-*B`<%2S6fZr%<}TE)^ru(;=Hk0VKC1l-Xek0%NG0k%dmQ{q#OhQx7t0!r^MYm>48 ziMKCFl)})t0;S|T??AWbY`@Q<8_lP2Ye{sr!jjDM?A}t12jn95j)clu-mQ<*fW2ze zOvRmRKfUYEJ^b%*?*H$0di3IZy#A83+1*(zqmr<@Ycm>-T#QVHtln7}PB7m|;s2&X z9jj3ihiJt`9NvINgt^{2!I2u@g*}aVa!oK0@H8rORA4C%5=MniW+!!LoDQ1r#ScA@ zzWdYT;mL$T%5%)HNiNq*{GI8K*Z}!KVfipol-)4y!bppw-w1Nsj4{E@Hyr|Q}h+@`&h%oYE1SwB9R6`B+J_g=6 zee6>aZn_X2#SD~`xr_85F3;4+KWBPF>T;WONyX5>xILJ*+O`Lw7Ig{;rLh(Ap6QAn zj|%ULVs`k6%~%Ya_M^^AVIC!A zVTWgwoi5Ki0@BPA``N>YcehdzH)O?$Vg&1TkVU>`$Mt}Cjxq&RjK6c;lL%B6u)c>A z=9*hKLKeNI+m=0g#OB8;#!M*roVy{Svq5dHToW4sy!+4ocy@i$W_A&|vKd#U0XMh2x=PUuU0<`j2Y>qxC4H|XNq(uw?67Ap{&ksMAi z5GNmRM$V|`oa@42uHKJl5z@i|K>p+S{{{KGQhW+S-EzSL`QlBAppbxwRWgPN$GIQ_ zAJdyWVoOzlU|z)a!{iQ+iliFc>rMclbVE8sYaTtehKV1Y_ZgRzM+M~5V`%0|?W_w- z%W63l?ppmbrUV(eKtG71dl;ZXmcO<6N{+!W<9Q!KKo$dIj5N*;3pyM3AjdZUOyq`oHs50Q?Y2y4PSCGky%re!3g|{21%a z@3hfM({t4-AFv6?P~U3NNm5LYY&~9wZ&zBsD3(t5UR<_hdbVdOHjeCcyW}z|INs@W z@~fv+T1f-ulPe;!>U)(*04lpseK}rZuiJS43xT26*3#J(dayEWw+ey4Kl!%ivVq(m@88@XKB!2=v2BB?%ep_y?|OKgQbF)E=t@77)8DM) ztt=H0daPCw_dXtIW0UC}PD2}fGHAoADU>@#pIk3{aSO8O0+1D>Bx635ULa@YQONPF! zvU}V{gUSvP+evm$y-WMu{WHUDPajFMP)*MckBGmypk(r;wi*p`uDl}dvddo5X`TH0 z&Ku2QAYV-kC#LTm4%t;Ttw1oMVX<#MP+4a)pRL~UC$tfuyfJR%LD7`Pd@V&+4e28$W?NGuENaLZx5Yv zUP7P#Gu!5IitkbQg; zdtqU(yFA6E8OOW|A|67K|K*1QT>a--MmWR~@5!DT3=1deX~^2J2`7rA-t8eGRJmINf5o;u}@b2%YlIlti6q_^WjO zMXum`A@{X!1VTYuoqUM%iAClac1icd@J)$%ZkfT1v@ZnHy$gm6&!O|y;@v~*k*iJj zNg=Y#?<*Rb(59a^>ZsMkc$>$a_GtC@hN)}J5x(G7m9C$WI&3el!xsTdPgSWuGL<(L zd0JgnM2nyB)$7~$S52iI9b8TXKE|QDw2}W&`H-(>!=yWB`7;ecRFfn%M!;?d2us0L zuPp(j@}6Sv#UGIoKAim7mx{6W7U5eFRc=E=`vq=557Y2z=(XhcOkoi3P<{F3MUu8{ zv(x0lLgH-M(7U?~Hz>i&$ZMV$a4rC+{mVNNr$ot8$PHfaiVfDpstL|u+6O|EZNZxF zrvm*GQ3TkQWH`K0b6sy|++S6T3ZvktzX#9uEpAMlwYm0G3kGw$f`8O318!_&RQaz&>1`;irbUzaRkN1^s?WrL4WD{cV>lAemS!rqZ>*h(j2s;AFvVZtJd#6CrA)3P?3)7Cz}#z zias`E_L-Z1dMmR=ak>Pev4~fU`bY)q6e%L4Sas#F1zLECGZR~Frh464_;vGR_UnB{ z^l;eB+5KCA{y>a$DA;64P1-r?(&>{}Qs6Y4`XgGL3>|$sSHT?d2zxcZ5gDZ2C$G0j z$X-`(PkFJe@+VYj*}1%+FRhNBZyDChDhj>+ir&l@JTj$2dz(piB4ra@^Up3hcV+(KV=EP3H}EzZ zll&d#S7~Fyum-#niIi%U*w;a*lQNYNjW}FyNx9`Q?2^8C_}e*g%i5BBjp4&*GC1Ie z5foz87&nC8(dCl{$<9VzvbBGmk=XSedq$?IFVqS7_Bn@r^-f#lc(?p=gnU+16{{Pc zC2`7r&>5FLzFzNZ7xzJoEUcJ3KyMS}8r;t#bx5+D(`Iu&0Q;KPKmm=wX^>h4^qy6P8I;B;E zg06$zx^!qAqA-C?L$~e%NJ1=MhWT9EA7Kmo6>5sJseRzbZp3xrrs(_k`Q_J%E^;P~ zX&N5LdT`h>4B*j{D9-eW3SSAL>;{~Yqx@<{F`-T+1>6G`LK+#|15oENr!4OYU!SPI z``{4`_!-)eKiGr{ZQPj}<}u?&R)aJljoU`Tu32*xoA}<@3Q#`i4;taaC!oWdScsXy z6!U&@o3TJY-R(K}Ey=&17K`l5F9lcecX3Z8L+k2c3vSF}OROW2jZM~=v(E{h=?41F zNM&WCL3A_e@|Htplst&qnTN(DkF3a{^X7Bn+lzoR`IX|6VPbfJe7CZ0L4p9CCIu!w z8|+>immY(*XRhXgua5lJ;QKda8&9p=EleE^cZ`fqDTXpEpDNAJLPx4`QS{ocx!T4= zu$$vSuM|UcJ?A;SH?CQ{x9e<5$@J_J>ug;7d!r13n}r+}-50@jLwvymw-wVk6nj-K zWY+SrOvSpZ#AVGoUH`pfUHwh)cpJxOnA{ssVOh?#?&F3WJSKw zNW$=+9V)C?@k0>FClvFJn_}4Oh?JwZFf2Q00q>Jxn&*cxqKI>nrh*D_E+dLvV1HUGL_Lbm7*i-bhwhQxh_|F4)Sv zKqw0Dy3n)X@fds2yJ#6o(O|+j3e~M91ifU5jJ)id@2{f#*LLmygCO89lI!2^?!oZ= zgHU0$qWMJZneR}3&2?WJ;JQRz_S;zX)>wipT{J^-x{&Ysb=Jma1_KMs4E%S!&=RuC8r3`1bB*_@{h-j8Bn^BTh;QP~2htk`GL`U2}t9Gc=& zwVCo?+Agc&KOKZFC|{mJ@?CtRaceRi@*L*xY<&+-8y&762C+ICALNtDJOM2vTC4fM z(*Do}HEyCyjbK7nW^K}@kIaUIbo*Z3(MJcFn^|u~FN~Yj>Hc||`e(lYk|9f$Z*GP9 zp55y7U0#tOn*RorxIwsI8bxNV#rEFnGTtoYTdSheI;kc&XWja+)mVsM zOZSN0o+s!(xyi~}gvM(L*AHCBBaCSCN!L5nD$bdL$hSkW!%sXELwM83{$(P85r0Pp z%PPhD^#7yoEu-RWzJ0;q7Tn!~TL{v)y9alIySoKw+}+*X-4aM}cbDMq5PY8Z%$zf8 z&3pg%ex6UPUW@Lodb(=YuKkmZ4i9|mAT3Zz9OSe52n~w~AypG4vCSdrNv&VZ0>znz z9a{aL<5>x$QSy6@d=$-RM}weKk`Wfek6tg z6I1+CCYDLU06b+rb&qEb5;57{x?jUWTji2KR4Da>FZ_>5D=YjzlYl1+2`@zo0N^FR zE?^K|V}%3%`3d~j9Tbn?pV1Oz6{gU@|JVCi|Nrv79Jhkt`$jhRZ%F^Cu=zhzCk1A9 zJstKi2@O2ikFTtP)nZ6z%#4gpfq{^nt|uu#zlXnU@TvK^qRPhqc~b(Z$z>5huNY0_ z`8-`S7y|rWvB)dl1b!=FIQv7F+x@;Q%*-tb2^gK8HzlraZamA7+X<@752r4=|7kV< zceY~=35+l&+$ct2pAz722oQ+g0Qgh>PmmSD|6y|Z|E!FEHo!~#f9(gj|L5h$|I_<~ z|M^GY3p@h>dXVWM9OXYB^HN@^{OGXcv4`)=xT-XFrAhDxjqe-CYNF?9)ZOrE92b z3Q5B@;?uqP%;SrT!-w?(VXeXtWKzEaeVdO>$glVZx)ZQ>hLJmZu4^X(*$*vy=c97v z{DOJ*4_~6A$F}$*`DS2Ns>YvRq(YpyD!aS?u6OKl2X5yXvQ6FZ4Jaem>$E?9wn^I` zxm-*8I5z<`Vmef=gwhX()%(7;#@#vIbMN#zOiJl*rXqM@|M;xo>{^WUx-cxhHR8Pj z*W6ov&HW#WK9hL=JUC-0lJ#+-*x=w`byZapA|yn_s>a4y9v+??3lE2Z=>=d}3C=8K za@l*W+7DlWluGYor=?fawY87a9&T^(yfrjZNppX>1>HG-;+G>MBZ+Uou2%1fOXuD( zOfjpSobBq~-IeV~n+%D(n~NZy)a)))`o)u_aXc`@r1WLadaA|G@Z6Sw7V_i^?;`u9 zyqq;Z-07Lt3(q@cf%QyC5#|TGCh(mOGZv z^)tSL-@_@Ap9Q8egRu z=~IfV*>x;;zC#--E;_gFF172ui@SF>s`g;rB!65O(c>C2!r-fF!p(<$-j0S@Cm@7az7sUnPiyhVJ+q+~*sQ~nUTnAZBhicU(bIZ6YmuU) zgs4yOBuVXX`o8w&HP{if=5pQ0?e$&rK{FE1g*KLB_z$6Hs1MULZgs{{^{F<;9h1m< zr6{?^9RmYd9<+b>=1VCF;cOl-l;s?m@8~48SR^x-{rhwl6kXW|KkS-iKcrdP2$(IK zxO{vy5oAHZO=)A=P7#hx?&5WFbB$rUa-8hsnX|`)jS+csm~GMRY&<{wcD^RGa#nKt z=N_{^;q+l;VmtZdR;a$DZKpbQj0&5Y5+pax^PV8x0)DHr%N-x!u(e>DtYzA4PbHFlTx@(ur(}Ki zaT3_iJ@JjzpuhP65>{dStpY%&gn-o|KZQ&U7njarv|Hygj1k48>i+MM^Pgeou8^V= z*Vog9wkED>Ea{GC*QVB;D?VK4)84(CYpYSuJ;TlMZ?SkBGPZ=K)TXBDs=D762?)=g z<}15hr+UEAJH_6JyJMunxoxctGovl7O`6`amFrXSPQ&#}i`iUqi|0pMvm6)V=kPcW zjNY3l_x@~mK`5TJ7;dk)lgaIi;_;B6hP3Sp86zW@zlofXvW|%1J?ZNLs$hd}_)Ra`6@~3_psjmA4oRc|csKF4LX5=Spn#24K)4~zF*&R&@hN8ZR2KjY2TeMdKZZ-Us`y%9!B`Axzl_6!v) zHZ2ccZA-6@-m3Jnx{dP}+YH78YgOF+$LwMk|L64n?|jt53^?Jaxo)|~*4j%ZkA{k0 z1f^ylT_%Xh+n-7A;*KwLLV2&w-;i#UE4}tF;CBAm%>=BmK^FMhW988XQ=&Lqm&EFM z26HDwGBI|+BRh-l=P$)s1Eop|Cj~PJ`ZUY(pip}^_vPwL75%l5tivUf8&|pm=<}nV zqnL|nOq!*oqdQKcdmB(s?Q%JP=h^Wi(!zqvZ~y$<1zCfWU6*~0=iihP-0^JAcUGA# zCgF^mEint6ylsw!*&CgT6E`SypX_AXzjqk6)tKp5i;--pF2dM<^=C=Cuk=dicCV42 z5%BsWM&pW7wP`rmV%tbKzPP*8L%tU}_oV_b-6)u&#eHOVJSRH4vH8>0zA(RGcYs3F z)Ch;V3{=L&C|uhc`s^ug@slk{1$2+IvYbBk`YshqY$bh zH~;JM0Lr4oM5w-4$NOg!Z=uZ7lEF*4gLH>MF$U17^`^p zJ`EX4Jkh9aAJ?hrX{V+r`El^G=<&vOZ46(a)$XtzFb>V*$OfG(riqT>cdvDN>D}BT zxbLeehv&Fh*QJQ)m?B)`kgY%7VJ^1Xc z80`>4Nocy|CSS4-))%Rhw{`OeN`$Os119(0i0_ujeDq>Z9d8{~`1{q2`eN@!MAyn8 zf(b_ZM3z&02lf|U2X5K=Bs;%mURbq0je5y6cGFwIHe-Gsl4ReTakah87%^x-JNK!5M31Tt=`JIvXfqOP1HYF zoIgx)p4luFGY-yiO|bVgrMEo9IHNGs4E{S?n&brXhuvuZr$V}HncUfU!ItayAEW-c z=blT(6@?eCL0XaqtCPp(tjIu&l+1ku*!@#RF|%v_%C!dXg{=5$Uw2S6bz_LDs~6}c zoy=EV=@jftDiR7rD)HC{==X>eiX<}UNN5FC#C`JQ91CD4zuNVu%kp#acIxV}I&Sb) zregZ-AFG8>{KHaOdF&6}`yxQoAYl^jx1Gz08cQ|k3E*}LK8&6>=A0M7xNCH4^uc`- zu>$plI;|Zy%i%cl)Tp@nEbsGz+XbIPn?>x``-gvfx)H6+@xK@2{zGKa8sC-XR5?zn zzIY$M_jzk?=Hvu@f(b?xFK)2eU;&l?GzMCWaXRnGWcD}(yEZj1*I%@h+ipIhphJfz zuP1^IFUg)dbzlA19z=RQag>5pqspCU^150Eq_g$m9*DzS7xhT5%R;sPNJ6p|y9l{*2dWOeq9FUUs3Q zFeN?jg~|lK&?R(Cpo{$J3d;mgN)QB`vLnl>*K?9JjR*S7$Kvr{phnw)iev#ros5jr znfe{K?ZbLG&%y%;6-?8KzD0et+?+{rC}N)%|wgR*7<{NU+qt#OEPUr7&c3v zudO^JEDj$bq0%S#ZwwQ>FN&Z2rMY?c?Ku0_Z6iNI8i##h?X~`B%53fb?$({$#2<=O zT#j^InyM?)(w(o3`xwi4gYmy&aC-hzFGliLEf!4pTbO2W7Jo|s=W`E->;G$g`Dc_M z&+G=1q=C$SH1UYQsl+Q4Hx0=935q!Lak%xX4rxnF+XV`wAy9fD<_q~lJShoW_DC%q z%=|<}uo%SD_m=8wJ_`>kt80i@i3OQc$L>evHRhWHxBejM%^2UxzGA%2uOr}Q9 z4o!F_uTEK>!i%RDj*(l)Dld?hj&OUsPrc`cLFa9>b@vDJtsCje-|FnO=F#C#Lljx_ zadI{<75JY+phfpu%A=P6yNR zIXGYiJWB%Sb+$WPGXzjzDpkzKRJX9vLN*X}zeBBF6Nl4+@cfAmq6v1gN&QS{ z+|EyJMW2|r;Sqv&vK_37_X@jgkHysKoWtv3MbRO_ia}dKzN#f?Hrz+#zfh_S?6XzI zT*6&CB8fbNX1nbxp!U%6o9m{@uVZ<9XF2K&W2Qr7&;zzw4x<-s7c%CbCYW6)wmbb_ z^h{>Z3E5pUFohC}_z2Y4R}e5Yn)&jG&x4C=bU`bPI?zNi#7RtNh_WRtW`mP(?X}&y zyq1*C@g2%9LfjMnN7Vex`~n835kkw)kGKN<-fY+i9sAvzD+z#o0AWR`*5oota==i~ zoM?Q4QO5IwOpd|gfrp(BUjNK>Jg4clNG7#YsDsJa&8+vs$`$b{|1ZMHk2m-tt}6+@ zYo_Sw>0Dowr*-+9RKaVW8bU@oTMuVPU+&HL)ty)9r0~emk;3~ke=@eI`fmX%@M=F@>};>*B$ng? z6nvH)9pox+$Eo%zvUS&t87@n!CazfKyN>fA84T)^cKQ9Mj}p<0cP-Z!3M+c5(9;R! zDnAMe8+B~$ng=Rv-S$>ot?e%`>Z^RT=~3}+<6VDABi!M8+vyBmqfB|y zadmN7EiS>~b(cQz<`VP2FZ;{R?$VFX@vmCk56p-k?9xp|f?*E#^}=;RE>TJB2`}^^M&CwO7IYZhFjWSyL6;sBD3CG9|O&qtV)>c0Xb05y#-MKxk0w zd)@(4`Bd%JL2sD$w&^6Uutok3z!3^R<2@_Nx{aeo`~0H4No?)p6Q3xKRD_AWt3eJu z^97ybC1Nu?g{3zR>ZMGN-+hDF1`QvLK#%y0B7jU#qM|3&W+RfNt)f^35%s#`FLS~N z$;R4gFa1WIy<2+v6wwZlBqXtO@`t_aMN6f9_6eJ;+dU^Ajj{{tjpt~fqcbrmn3MTJHdV2 z<)UjltB`|FbXXSog9PT|CKEV9g#2dB`=b3bRH%Rzx#-%!kj{RLR4eE z+nzKC@U^ahc5wCe$Xq2*caF$_sn0H0UaM;gTXc7ZoQfz-bmM!k{czbw@1ZXP>3nn) z8DxwO1olt4f~zh5FSFWC?SXBJ7n4R=_%oFbhwCoLrvq}C`_L3;0cZvGsfsxb{JFkn z|B!x%@#<(?i0*10*X?bWS^UnOaS9L$_AXm9!Gl_zHJ$-H)Rnx?Tv8R22i)%@)7-2v zd_gyNj7;x)Ec5S-o6~;rhFkr77Tm5Y5{hYLB33db1%QxPdLC$cAvM>)+TbU4h2iYal4AtA;uoMcllYMZnU zqmVAHQ2r?pbt~ucN91R=Nw;sR)I3po-PO+IHSUV8Cf+NOva-h+IEgvlHH#tRalVSk z#=#k!{IVpDPRYdq1iT**opGvbNa&x%mnL#S!XB#Jr~ z>sxM<`dwD3U~ilSWO3;h&rW3Ppb-GLjzuGb1pY=KNqCc z#HPnv(|%4~LSvCb7WzXKP|lt7PKE99FM>=x!HI!IH7^4qq~$Rh?T$N}O^XY+Kl|W1 zZimyrM5XcSGl3Yb$Q_RVzAyeyr837AI>!;AozZgr@`DxYYEEna7mI>@z`>VnTc1;4e`7(9IwTz3}i zbUgbB|J@h>JO(UTWmy-2JEol2b3Jv``eSi(kikN%ss-6~gq%Uo+J2Sygg?;OBg;Ww zvUN)nJHEj<^?Yby}R4L9hG5L6&50m$PrRd?6<6vo`k78 z!O{m7);AF(N!`(QTjwpYF5`EgHAp-&e)=s)HD$RMK+dPWFQ!3mH)bmx&1SZuN|OZ$ zcV%@LIiP(7+pS}z3PoBqG-`b)Rd$82glarQZ%23n66=(9Q9SLI>QR3R2}GXPOl0-P z>wNx3!T~LlR;bhB!Oz{QjNqa0@s5-O#guP9dfguB*1b`Eu}HFFNA6BS_)^cbR?BFw((RLq*{aWWLVe z!7yI2P@6yq?vg9xwb5JtVW4T@XnZcr+RO8BEh~zR`vAqW&x19E=@4lUWLTU{nbeQ4 z@$k>%Fvnm<$|j;+@#Y~wq~|BeA=nN#6X%U%@g_nm0vmuppBge-svAAe6j&>?bEb3N znaUQ#y`zOM=$vwJS)Tr78ew?c|EROdURk;59~_4xJ<(U$Q0e??W5&iOA%Z zN8&y=ezDt^qp8DNG+u3a&MNK}nY)@hd+K3val*U~>tf%dxFOSuu``P~zrOz((&25b z?rSrLEb+4XfkQU*HsxrMBr$KxH90nl|OB zWNd-4ymfJb{vZiHWQBk(H*J;vpxyK^ZzL!;k(`2&0HASsZ%5byX2SPn<1~|BR_0R( z)gd9uU`6qa2%D!y#Ai>(adEjl1CF6F;1X2T%4|XqTI?gywl_D;SvXgZZZvEbA#8FR zZ&GWtzB*@O5-pV=Q_4iz&iu8~`Ql#C-QZO-o~BKs0(aN$kZ6=4r^VUBtUFzB5ic8E z?Y8aIXuIs%$7yXpgQsuXUgKsCjVyy}Q>Dv2uFjleW}PzWa2=&Sv%m=cbg^_BRky2J|UBEB3JVd^>OpU{+JH8j1U; z%gtu);Fe=g9wa`T0$M&Uf!H0y8_uwl!2p{ad83aL+&><%1<9j6F==j@46Ge2f~Et% zo@Irh5VdowFFSrqY=x#dOUQj4>lbOK^d5D0`eD}`v}x1C?)0NTgbr&ukf+eA&qL>S zrBa>WFOVY#w1cpR5U8o}GClS#l|ge4}O+?N`ENgPTzlEzM9tMp0eLv*(OMQ@#? z=M=!_oFiT2qxneQwuG3xM5u;99y`Pl^yj<5)5Y^OgS&3xD5Rud3%h$JpKbg2qP>yu zwVNtu<1Lb(NYdHU->-K^I_R)k=LAlcm$LqSUcxPH#7x^g5|y>olp|~5%ntlxDye;; z`4kS%?Z-c|i4VvZW2DJHEIZ)?qO_bIG1=@P(m6rv_e*d0fs3cVnCP&p#Igw^xD5WG zyo+~{9$C!~t19bp2yO7Gy@135WAd3P&XDk(OKh~&Jq)MD0=ejQerKSZI5L|kt-&In zfe}75BynlDR`)lW;B5dO{}{Pcb`%yRjR>c@&k28CImmiBz|HtXBGL)AJbSEBFW2Xo z{8w5(8YQ4=-u{-u>i4y1LWF;IVXg>t**rNZQEq6lS-`d705j}?P6Yl)IXSYINRxf| z%|=h{SDTfB3fIOa$?_k9-H#DfD|Cskpylh-Ui#gYP?C%o(6DA*kj~0l$mx{&08xRX zjc<2opU&mAtb&UO0g1*N-adnKu96+C-y!az1yY;n#F+P$Ck-hbTs^|kug1PMkzJ41 zCf-HZWA_4aW43QeeMZvF&9~{N4=k?t5R&^dsC4fioMV0Pi{Y(;l~uIl2n4l%_346U zPi!MLVr%G~X?S9;I6y7aWF`yeLFP8IF*_}?gZvZ{w6Tr8MN&9@ z_T6{a6yCP$d>W2X@dd$C8sHR>WC#*)CxtE!`KZHUVZ-+(A>?}32ZZNuH{|FLi(Skk zv-+If#7WWRf1JfM<7i4mz?dVM8Vh6f?)p04jK!mbxbsgBr`S=;h5uH6YsROR31<~f zhr-bi^AiG}V%67K4ziUi5zH&S@@hV~9P8T@BL2L;@T*@N80IwILc~0^I@oO5f)Ea} zWwp1oeuy(c!|A-wb^LrVG(4KOh4&e!Ov7w`aoH+*bJ9SY>a$7I6AYW%D$l&z!rGWy ze)0%s1;vVdTI0!Cc8lo&0?@*`?M&^NTvCY=f`gv&>8WAmW#$-kUl21}I(Hl@PrOim zX}I1n>A~)nWK2}gD|e6~Hh`XDb2DKKup0Z_BuvnAwI<2z@f7cu{uhvyC|MZJD_aua zZh$gJjrJ*yHr&BTTPUK4kDmJ|k+MLqwPLFjErWIQkdh8p<}qmS`5 zAO(vl6+sU}EOLiOlf{roY}7Cx#LCOZ0>CC8D?^!+ont-~z&Jr3T-{=i+Mahbj}!(5 z$>jI%ejlj!@_BZ1hfLKNedNX$AFZtU`+aH%3rglMa|4k0Zqkn}3B(gc7_dNoXVn2y z={!Ggw(q1UnSiGQFUel8Dj*XaNMH5|QDc9x+(o*wIDc2t)+OXM?w~=U@xh?S&ssmz zK2V_dnNEL&J6$OrRiRsRC>~xq+U?Xlqe6)>b&@{CV*APRhQI|=p0vmLQ##$*ZaaLZ zJIgzA*GHe614#y6Buv=qG!;{agx!5A7Do&G1ZN6`;!L!?w9Svuue_1x5wIJ8me*U2 zcm$3JbgCK46L`&)${0nQ7i%j4ga~?$iKILqLqTFlAlsbxMgOZQy-e(i2&zNwWnOFn z#Fs~LjtQ9uiw$#?qNm|_$bpl$zi_DHFpqP9ZlUKf_T2@e-#>rc9kXT5F6$-Eapz#n z$rYnW_h{%dP2%^7-W@Q^ra5J0MY2Wr!1Jj4*G@m^sf`8ts3J&6_NQD3R-p!J=doEg z5ubSL12BBJXofX-vhUMlFrZDiaWE{)A?wGZfrq-@IeoY44rLAsBda1bIvgSwwO}8r zUto|YX{r8tgJ{n9+GcGWvGrl0IlgdSaG5L#cXtDjF^nH*T8oO-exuK|7=FssGDb0k zGQ}8SgBR<&WAj&HCtj=mNNwDuR*1#QFM~griPPcHo8?mZP+p)N4MzY z4=5Q3&jlz;xR@&F4#aXALd{gFiph%0R}1z6xm_zrTsA)xdG+2Moryx@f>zt??@irS zF)lk0uqlA$zwRDmo_Ksy|G?IbRNpBub@9S)lyEMed(mMTCXlxU2(56&(@g}uje3~k zD}uWibvr+a546jUyJ;%CiecpqrBzI6#Lldg<>02Bu8o$jl`-MH}QSi3$$28;JaPr)byCbrdX3IJhjPRuclt_<3PeDDd zhgt-B#SY=gd+J;rYob-jVLpM-9rWT8^g&MMJb+V?Vg-9Eq&UmEFfdNLPCnKb}~OjCWu5*=F48UCeqOKPow6zODFWM z-2Cg0@)C`fLhu;WHp#T~H9@NjjM?Ygb7b;E7f$lzKZiy|{?=J7=j7FWTQYs&H;=7b zb^qIi#XfkJX)9^l?=I>5i{L#bRP#FdK7N4J4&`5@ky6wE^8^c{w7*bT2S|ieaJ>1Q zWW4Z5DPcH72sUU;z;Vd5tFk>5GXvcK2)B@(MlL1Kx;c;K?(2eyRllI|bH620(t=o& znhe!06QaV+=>Nkpc#7{4?Mewfg~Qx`1j4 z9J-zQR>t(VnBNoMK|eO<%TH|j$YMmY7j$TNDx`KocvCR}+%Q-g?cJ~tKF_Pow{Zkq zCzmfdpp)%frP^BzCPWPm5o>zsb8_v=<9rFjS^~id;)6KsQ|?wrlH3##7#bdhZ9+r@ zb(#d2K~NENjDu6*6b?q60apCVAaeu}Y364|@lm)}59<+eHh z{^0JEFv1oZ)Gbf`W2kdxePi30O>UF4Db~>_oAj%oysx&^3E)TtH`Y@N5c}jcl0Rm9 z|Db$-rZDf>gxXvBsb*iK>Y)YhN8EOjZI@+ca9Mh95mkv4>f#OuWA`jJCGPfEt!XwJ z1E^&C7{z9@6Y695PAaT%!fd@`vNBCwIY?qXeidz<7dFpU6&_)XZos)uz%(|vpb4Isb_ zSr3Xz+h?pB`=hT&u;tCpYlfiwPZD&bIOy?9+2PJ8iSclsguRFIRhTfo1zcxUX>!jelY*D6u*hc|)P9wR!C`^WIeOYd)k>dz;oyuKfmXTAeRr!lv|Sq}Ac6w@w& zGuWGUc2y8v?CYHViZC;-RQBvaaTTxAhQO0o?d#didJnfy-%Hx$4@&v8HqgInG$eos zImKdF5(NVW4qo!Dt{>u$0@~a0e9DE`4=uSgHjyadu}e7YqK{TMQs`l+Vq3e5@ivC3 z4{NzeU(lG3a*9&&iJ)a^wXc}JI;Vo*OO01dCMswDw^;vMT}mpdL6B@1wTSY# zTm%lFa3==uiy#N#%FI7Y{JBEchLV=pr00N&@l&M)vr!npnIKt3g*zW`4+}G=p`cGn z?)iM`nr?@J;}?X7hTV{UI?bg2PWzoI;73_y?W9P->x+3!QG1C}6c~k8vaq7waB)(|4{#gRLJAd;8Gz4^B2 z&#RT+6Yi9jd6o;5m1xrC+ocHKEU9Q0g0^|os&QPVSy&G<* zlxGgJ?x&?*J??-rDm{(=oW|u*QXqmFrios)1C2d`$%mp*hdqpT57o!d;CF<0)LkN@ zQ+)R%RWvVW0}o+mJ+a7rx%wnfS&ZG6R`^Bri`Yb>Ze(TSyWWSdzQIW@;*m+@bFJ9o zqoJhjfpmKxoIcmO>w@EtLsUhn#s9M8G%in8D<5{t?D0?6zGEZWEtckL7Y^i!zjIz* zDd#@h#qonL`y^fyek-&90&UsJkc3!L-hHD%g$PZnY4d%m?PvIH2@%8*pM(qL_~EIu@Q)Yv8Q#?6KfeL&65`h|+2Is4O+wdMh%d%FC7z#gCxtl8KY$&mq$^ z!^^@Vh`=-FBK6^SkqMj8wjx90k^hVUquc&{S%qpbIzgxNo!|)%kncOku@FNwc5>EXc)FaCqdzBKx>J|2cC$3tK@; zA#y0RbJQ#~hm309V7(hCY!FgFDbdO4vl@&9Ymr=sb8;TzzQp0Ybo5qL%&3v<27>C0 zf-3cEq=tX^)?f|a@dl)-pA91=!%4(E95JML+NG2@A7puux!?iCd#K|;lfWeMCIT?Z zL!*RH3YxZc1<@9`o&ps_a0wHplQ02gOsB8q;jNX-7qfIDRC2jr%q^&NhzP=F!dT?4bMTEaB1+N-zwv zN08i=jA@A+65f*wa4YIMuf%1h|FZ76-7`3GCU^*q2v*CMB9j8T8=*x0IKIx}`ijGJ z&1yj_k|sJ%v5*!S_YDnlOt?c4^GD#5-H5P0WkKwvUfd_zj+U3Z*C-213oN5~8cA8$ z1WXdsWbB&bjN&P`#QkwsLNqgQDlg!~dH~r`<4?#mZcMe8I%yFh5Rdx}s`tItrfZ!NGQ7RHq``f;YxnPf9qx{O}?CqBQSs}ll zOapTo(z5N3Ef$kj3{DOumffzk)7_Xa7Sr4)U=^V_7_Pau$x0ir+_phG6PHaIp_^*U z>5uZVzT-Q*&m|@sOXEjlFhn@X;Pc0Xf0aUELf4{!Smrp!n1~t@qzzB$`c=XEI$}&c ztb%J{yw#w$UNyf=49`OM?f#MZIkwk!p<9E$6cHpa1XZB_$X76#bVIp5`)uRq$Siw^ zGJXAQP}OvhqL?aTQiUndiJH}{*t^VvJ;8D`^g?}x&DVLuj-q5#q*Y;(9j`D0B zK<}hg-0*X)?=EFEGo|n!%Dm3?54Bh}DlHOer!7G0%qrE3GUbd}_DTyPvu+JpL7=M`6C$-A8mL#6!BbU?9Fi73GH_5kLqfVf`^I z{Dagej+7fuN*i?;QY%3!-r8wEcN@3qiJHL#i4Y~c$&rP#jWqNI55!NdP20~R)Si0+ z#@OMRD8+eI;Svd@hK9R%Wt<`6txc6C7b$w|CGI~Hyr;G<8g7}zT`QHe)p z{obV1Hk)+w9)svhCRRC5*<~ES*VR=ymif>pfUXjyq|g0Bi(NTQC@ik@Pv17Bn@6Ib z7BdZJyP^XcO#?JGQh zyVyWnBupOy2K@DhWQd8!Bm?n%@sC@yNrgHw0-P8bqC;D{_V$*IXQV!cneQ*SbYHPj zRBVSmuNqp)eJHf~T$W~vTb{Uj9h=8Yn=7>*+d8n0BbrGi_1rFkh+U6(5oaJnw*f~8 zUF;ch9l%~{srCJ!S9iK^sq}@tsk;~Xg-l3^LYSMC<{#ckJlEb+^MgY$_wxu~!ddK~ z;X&nVFYJ;OhL$(JHn{IM>a+79_4Pu-lEwDu_!P|yQQWOs>$u1)-tx@Oy}{{so?Lpn zRq^`u@ASp|*Gh49Kdg(gSgyHr zeGNr{V#$$olfr&O{^DMPF;=U4^p5ifj>3|+Oeqjf3^99ptuwFX+CJ-x6Oea>Se{e;h&`%rpF0Kki1@5lF49LAbp8fmmE z`&2x0QtVMzHpO@27K6pE_;>|KpVArhhmAe1%pp44&F7>s<`w;wnqO9H7upES|NNy_ zw0nOXfV#ii?~mE>+-nzJhoY11^cC3$$kJ`T+g0z~A4j2=gsNV)tCR0f7E_m~g3&AE zv`t6t(ECRK3+>~j(%ri_#WDpiY&KupQ3W#tBm8ER!cXM{YG%y@qwbrUpAsyJu^1?` z>KX5pjeNqG2XIR5()l*ZU>9Jdpq_})~j5_mg{MPfiOjHx0Lt7`~W8x#b;P{E0pVU14wO>gAO z!ZsNi&hF&fZcl`?vTldBCi=DeCyiEIC_S{BJ`q#jsC#!UaP22%2qlSHKd79RKry#o zy&PH!4TWvmQ55vSrHgLRH#Tc!9&)HFMvr1(iJw0@NP^f$6sI+O7J|dJ>%7-@{L){L*3VZcpvUg`|qOL9VkTI zhgLX!V%tSD$}hTf+WNTD*EL!1$xcfi~1tQVHfjEE-(&=XK@=a>@ z1I>c>Z!OlJ_=4t7-0m-l4!-KX*{Y~#Ejruv0*M7MzsFpIStx(u5xO1r8?Tpf)o0$9&k=%AMzNU26}}Z+>UN*qbCJb8CE!2I zdg+w&G3|c!l;jo#=Q{+{Ez>5zh>tEJl1hYeVXQnzYI+cw!pV~p>XC!-tnd>S6tgu? zcBg=>X1Xoa!m(9p7{@tW{5iws~OxL_qM`)|+yN`d`(J90cIza%7&?DCpwsZ+7># zmqw>bCR2eunSE)5_Q{4;MX51Q;3z35&fwvbqSK+$4WKeEV}bAWtFANq6-Ei%O?jCAN)53iU}_748^ADqWd- zH@WlAb@^9?lr0}7Rb78~hgc22lzLZ%kK1(@360%qD0Y6$2IRjX!v~OVtzD8JzyW=a zzVawuPVFv3T8dKjH&4F>jD~r{JLL4bhWaVC!)Vxe9{#M+(Ol=wwss%5y5ZL5R5zyVjHKSV!Y!8H6zQ~YvlTSCCJKA#7aJkVKWl|w zAYgMU)6?dpDxj0cedl8^e;{?BMnWX6&1*&RP0DV%>w#it78W@|25Cx!4+&4R`z||W zRv7pEUUFZ->7#k zqP3DZOP%77Ae-e^t?`ggHC-z|~&rK7vgo{m?#{@aA&*7yV0MSJ5f>;cfObDB*V)LN&epy3O1acaB z>Ey>elX-6QqTjv1!A57!b5(H%$@^txZPn7NY2ab|2VVkM zS=)*x&_Qw}8pQQYDI+^5s z7Ej`!zg9?U%O@-@jE#OUxi~4WxCCSk#I8^f8(7|8etL-;Mjq^zXHuaAQt5w7U4ktKV z^9BhQ5)CIDD|e|KX49t#GKHLj=jQVG#xh0P<;G;^es>${#RZ*qcW%2O%TUO~=b4mS z@Z69w-~Zn~KA~t11;vrWpIC*qzCUB}Z6T3NBH)N76m{*!G196KU_}LdsmG#U=BmWX z%0rUE4#QFj5^gfgH3$`%$kR3I-?w;eSO~`n+Npotj%>yRG}If+$j;2%(=*gxVs8On zEQ64kP-c@t5n@5zJs9N7=ufH)!)kXya2{WIJ3Wm&Mm}V^ie*Kxgx*b*MLho;1g+j* zQ-R+{2%M5I;m2;Y+>_2p@J&HY0b^tyg36JFN4RtTK^yXv(9^2(3m0Nl!}*WzUyT2XW$;SXau!2T^WzHCOI$(# zyw$2+DL7Tm)TS1t#w_)SZZA52ZMA+lovhf$M;U8Llx0|jWyJ7F2sxIz4F(!@Lbkrg z{v|RZP;Ggri_8Jl(fdTCjpfXvYS9e_hQ;+CarM;@=XVfHXsoSTg#daY~CIN z(e0>u2MIm3ja;+pHLLE2f6&akRjRgKtcFD>?|*xEJyZWnq!RK)OWmK6iv)|9_KQ+L zl7($zNmQQy(f0&a6)~hD7~!YsW2n6WwqmAdrxRsaz-jhbnbMLBdWpn68t2K784t=2 zHMs+fW%TNz7Z|stt}-*JNU6O^`n#t?BMu z5*8!LBLok7vrO_z@?J8zIwRqT zG3Lp3;T*KQJWeDlyd)N^$Y6_Y&8L7Lj^eU*A|$h-`~Bf5gV1(dW+0j@izC=9$;ZzS zDm5U)ht^1U8bZ z5LpnJTZqgZmy|KZ89I2hM^O+p_oZKuVA!^qSG^k1Szp_{tfg7E{sV}!+}81E?ua)$ zWgd-bv`8-9Qd3WfzOn-oQLd8TGP$ZH)TcQuy$;4;;OGCv)>}q30k{9d(p}O$22#=; z5(5UK$LJy5B?6*!cOxlCj~EToASodYjub^&Nx-ecxwh%MfdtD1_8S0eU&W&IzRgD)5<9?ppYeLIIvvzC4=C%OX8^9%Ia6@26 zVw~#BG{5v-GX~w;X4Mf=4|$e-SNf{r@~J(TK4nzmN21fW+=#0W=L4szOz#=0Re7q= zp~x=IzW|u(yI=;t$|=4&{d7#f>Tpt_ z-1r|f2RmKVj!bjGZu14IbuesWdJS_0I!~46}P{EswvS) zE}A!Ws!fO=0IbvQY%S1XR|<5dZLjhv_>o4;yP`+`RK^SB0t)bFZ4jk*JhUem>c=;y z-Zj2{WrUrZyM8$5)1LagLMSFIs&-ZxgqQ&-rF89+8=?6tDX$HsrPfQ8tMH9O4SOGR zJDc^@ILG9E+rqzM#zbI*B9r1vVit}j=%S?oFLl#62^imRW07o;U-rC#)XAu5 zvficoX2*&|JGLVRY{Jv2r^D}Fk+Up?YJCvA-5#(jI<(0PfW)VDf}9lhi!7u$_@L2& zI42xFJ{ zHH%71F-a+idh{ti4U2MtRm8=0k}XV|c9hD^ix``v#at>2Aa)NcE&Eim9goR@pg9s0`{6h>zS{Xp8twx642bF z5r5tH^MS29{@of>HdtNv7ZaG6=Zn8av*MURyg?_fzO!Ahy9J7478!o`@ZM4RF?{$m5Gygo0>55gX?0B zKThT-gog}W*!fCj*glO*HbaNN2N&a{(#h0*Z=SoPAU=h$gToN>UP-|DS3L0F+Jj1> z3SsIFVPOqjGo}ZX-s7F+V{HI9;W!cy?pu|sAeoArAj8C;EogefQ>U#8wYw+p-ui)! z3GAu=63Tvb7F{IQ>8!IM|M~JaDECSNKvc+MK9;R`pEE)XK(*dk=zyyz70_p?*##4-td{Lzof z^Cq`;q|yF#p>b_m#Cp3+xFW9jzVvCB`!JbvMlti^;Fq|n($W>t1e%;5c^ZbTyXm@v zdS%FPs&zX2&mt6Bs)^QI)^<|%$)2>1!t!G)IS2o`$V>>?%`Sn#0NqMnZU6+H!{mi}1T@ckTWs6&tT1M3wRP-3r4Ki(o`X z%svM~6hM15 z=Sv2L5&!ZgD@*_wFVvYc{`5bnm8Ih6HU*fy{j*Ac@m!lt0k-aNI=g&q>raX(iqpUX z554a&)L0Ymhgm13Rbh)`VH1J(2pUT5=l}z$p)#O{ooaX0=)lHps8rycQl)f|-f6gD z-4oA(_Xmjy{sJDnKj?g9d8rb)xKlLBO(G-a>C;%5i~@><59NJVV78#v)B(awV(-3; z#*-^g_gqzWIMgB|-URgd^+_LZLw40F*B{69|FB+g@9UUnjh`Fy&99J_ovj3H z)j+aua$1SJs2zoCNcaE%Ty=Mmz~8?b+=6{f+5)2~jLD_Rg~0F-aT9Z4P( zR5ibA$!RgA^1Wxtv&q6W+5hAEKyM7<=tp5T@$^}8k|HJsMmRn)Spx{K0bRtrHKu;T zW=`iqHJTMT)_}?q^jK2H5QRCrT~Tvm1Y7i0o-v_(!c^389J+vTrV)>zwMnfr#|25MW;{E{5?WnbXao+T>t)}3XluqpT_ML>^ zZ5`Kk8Ed5O-^f`Uueho*h;cjhY?A$r`po-V=NJ8BxjKe2Ga3qkiTyCLN^A7`jKg@d z*RXA&8FrAjSpNZN`%U&Qx8Y*Q7b~AS<|_(?XW}#}L_9yB+!G<&)a>v&<+*CzLt)O? z+NF>h64iJvmo6dJHflh0MnlJZx0hDua$ptfom!fT_Q%@Z`~0&0ja;=SEVrx21hUvm z>O5c#zPUOOr^@U^ABt+*O#@O5U{<$yeLiJ0{~jQPF;i#_~$8}=+MyvDyjusRdH zFANOhcq*<`ql>U(-K6bWe2-Z(CMDO=&9fvKJgE+%2q3*^P!%MPAZvvXhlnt5lW@nF z`<#IT8~GQyMYTiH+qvwPav_?0^^@7v^(;O2c^Thkw{(YN+kAO(;czEWR#Re&rG@lZ za+wj~*4E)}VK&^c3ev}x)sxhLARCTUKviIPV2)ZjorZdtU*!mG`|cCEqpNm#4433( zW1zfx?6}l)9Qwe@hEoz(;zO~DuL7>a4fFD55zV53>f^Ko%nOna4QmT2e`GxlZJFYn z*vyy4GM^6~)b|GC%{D1(un5OZA(wEzRDat6lX;By zun1`#SGMHqj@Nf*WevMCAC5mzk+TMOOz*!YkHn+dc+~p9oF4Wi5X5^o&stAoEj$9j z$za%tSee0#RMPjZPLu4t^)+7Nzs25V3jp|;@JRgqFgfBAqf5?~h$rIYwA^>Rt6)MC ze3Cw|N?bguxcx;YDzUw8McC0WV(Zte_j()aU$ME8z4%et48G-D;od0C{#Fz))eN!d zZ+~8MJZMZ~a?>1!RRVho@NlunVcahZz4Tw#8f)krheamTg|?|${CB4 z;X|!mpIoZ=Cqp&Rx%{yv>AGIu0~G={x{;;}A&r7H*@Spe#OKBhQInC9(|(C?hqF;O z;KU9*h!$23+7MdaiP-twImN%@mA8Ko*nLvO6956x&g-POF09u5FW}+-g&|&O;!DlG z&*SfLqo7jsQOXqz?)U(UIk8W)!j@d&028SY*b(`l$(k+6!(tuv-Y3i@mc*r6#mwg~ z0WJWjlUkKXY}OxG%rh|2{q4nT{LnT^ezKVLdzQ)TO>{_X$;I)J=|N}TlK3#oxE-;W z>Dd^InIKY|Ch%FRiBp9uX$rwFuQl6}?F2zSciJ##&0PdQ0y)Znuu#mOt4H*Pr>4z> zh60W?J9N9p)V@bN~~p_ zJnU7~gD1?g;L1Q6nxMbu4f_=#HsW%>18!M=9wD4eoq0)PQ~EV3P=L8|5KlJ5^zCQu za~RcGsRGegOf4D*93sp=sho&uScZ(-KBrF(@zRSTIxFQ&$AEJbXxiWG=&wVhKOlknL^0>^-$1b#-fc9M~64Cd|xYY^g+{CJqg z$4zydY%Z$U&iVy8)D7V++v4aU)$Cd75)^1pp>`Zv#Erf0oO4c}saCZvYX5$;$rK3B z$^k}MDU7Aalr1Y2zuD;n>Bkl2uo=?cCS|7buy`{HTZgAPQfAsD`2!=+-Vre4UPEO} zxc~1+D$P=rt)zT;IjtO7;{(TLoL19Qb9Yk{bMwXrBx|XUiAruQLASpP#w8q%od{oNITEI90ts%D?4*KmBlRS*uEVP+;JW5VC!xP7uBEEPpXETe{Iq!Z()k4kEd- zlJu5dbs?nA{Oxb%cY7wmQUkl0`z{oD1NqaE-G${o@8IYz0i^zV_mnuSl!3IV@TP0< z*w#0!2S5aMPA!jbwm>xG3i~zVd_xvvdcHAi0YRHhevq{=SEBT==NMJu4>QI#p7XGg zB5A%hrqlBIu4PuA843(w8sbO&V55hCh8UFQZN|vY7maHQ8~O zG3g_6RnjN+?5YIaN@eKT#7CR1Da)Kb2>e5$=AyxKwJAtvC`$2T(Tioau1a{{cIEz} zmE?M;xhyOA{mXCSbv187_eY2m;z6~#Qo{zmiI%E%L0>?45S$H~Wk(dbz~B_XHNzga-epGw6I$EQj8AU_$t zj9sU0ufk6XAqn5DA?joYOl~$8zOlz{y3HmPvx#x#xa}x!qEIZiy1wbXHk;pHa^5GI zQw6<*;&)qmsw3V>}HNGZT@Yx7Q_0bp!~I zpP_Yy$3j3!K2Yc~FZ!Y5ZS8y)z@FB*Q~tKX65b#y!_c@TJQ9u}2{R~$?Qw;3Qb;dY ztV=_9T->2j1)k0OK_z4HdbNved^GjKPViKHTOh>coO;c8x$2q5xG!bpwbhc~c~&NA;NhPx>1 zYt8N=-1BW<4Rwc?l8*_%&4rAWAW4G~k4NQ(pB<}#ql@O_LI4|2L%F7D1PN>!EyMEG zKVr)Oj5lB)ab`t=FWjh+h1A2`a!iQK$+SS{OHz*(`>CcvdB1u*WuZ_M@QBu=^<`ZO zB6gIRN)jO@%QFxK=aVp29Z&&Aoh$FRH;Ot`29I%TevdYX@aN$LSrk>u`2^ODzl(hq zKW8cLy|5sHfB=;fe^KC4Zw{OE`$L#BED9es1-0|%%ctqdO&8=(6_{N&#~%>a7`kB; zC!>-xn@V|y=K~*JwfnfNPU_1up7_NEOieeZDP^*zzpg7k6-T!}2HZT0`c}EFT=Qz| z((+?P29#2J$V-#A;S~4+^4-vQ&25Zu-^HM^!M|*UteVd59Io@R!)Yu%#bE0r@My8D z?^iDNTg5zy1xn{e8g6m=em?^siYu%EKuxdG2jO9_JOX_t(^rYLyZ*t2#-qz9BEB@b znyaVf>!&_RaUJwH^p21M7}ANu0VR-KaYba?m777yh?TK)$;6x^$f8IBwEXYjTvp{J zDpHdK(frw2T;1gKS0Dd!D+52JTJF79nbCnDBjN|nE}prVW~;%>4c9r}d8f2}?Ds_HM7H#R`H4|$d+Wpu9@MJ! znlOcAs(tUH?>m#n`QSW{!?SrmVs$$WA<$HAaUIdyYQw@ZxE*N~S1jdl3T7p2pFTt8 zMa*af-3?4U5U?fjr&2iY83U_6K24AG|F!rh_q`*-KpX#aynCFUNNcUjeliMXpQrB+h1q;_x!S5<#O@wBb10 znpLxNhWYf*KmXAuv!u>x=&3vN&Vv25;2V-h4 zhKIgY)I<}HG2DFWGbN>KP}0^D_vkTT1HKbe4Sy`W|5u2PQtCiB2< zLGHJD-?p+z4|q2ywXA1Dpw2lw;gPw#lsgl8-@mW-(ONRhDy;|^Xp_5G4R|_Fg8D!l zAR*#(L2pYq*T9awK6NoRXmUVC>GK;sSExj($ z4PPPWoOr&;D(Q_*CE0HM)AGV*UmXlfFR;oVlc#uYB~FeMB1rcQQU2RW~|XH zgKqqcWv``}%EY)?$Hv)ZYjpVBVC32>0M(D`*ce{wmri2MBy`@c0xKF#=GpF&oL-!) zJw95mITo6cu*QmL;+6O8txY+zyjyv%ZB}>vb*$xJCVH{UIa?MnsJtM$sau$Icn2kt zuRu)m?zNEOT?|kf@KTvuHd?_<>b}F$_zX2@qr<~9%*_nQwkk1ipn8@Ukl+vLi@L=OLJ{&b`bwLjO(abr0$i2hTyUI6 z5{@rI^3>sB!V_o{{hav+G|dO)f*hZlU>3re%&g`Xq<8K9nG+`PfH)OA@=x{C8oBxW z-|N2|di|dpdig~#;bhToPuMcP9Tc95NqkOliOt}I74{xNvEN2+5S7kYGO&d>l~>U}eYrjpYO0-crvpN@zCXCWMxY zJ#W_`FK~V4+)ZUFN{#1&Wq_F~bmb^%?n+R8ssEaI&i=gcUaaQ!disxgGqf}cE`qRA z5)bRdz`+6eg|D=3V-|HA$h$O`4U5bt1m1M!p{%iY8bx1n5Juu&rs7tRF^SC;ove1e z*51OZH=N9)&3e32%kodi(C3Px-J8-mTq(^kRnrRlbe9rrr@@9|2~fD^JG6MzlWGiT zTaO~a=fclWPoRP4iVQb;4AQE;P4y=l!m*9QHn|>AXf$wb^8gR&nRzE%hbhH0yH#8s z(H1w(kJ8K~rd>?@r1d_S*-@Eh^UU#vMH>p5ShWNK##`@hBpF65^jDWw6)gx9wL5D0 zKYt%?_JVabSX3S~{R*g$j`HZPyAgrW@dPa%dv2DQ(9_pJqzRGD+xcrbRRsSAQS1ZR z=Tu2MA2rGLbPvMc_*D8sOH&2hTOp}lJnC#Nt(AYqTh|^zJpBxbTbeN?gd1`%_^|>V zvG5i?&ZB1KPPJs`imlUcR8B2J#QAPZSzV4;GkJekH#~kK0qV$$+SHkNQD6~oqxsO3hRThlt^8V+w29DcC9@mMdAKo?sz44 znedQo@m5z)s<`CB>}`I7KYJHK)yXOGTs3E^s&TLu|7NA^+rW6$BPfV8{GbvL{3+OR z>D`r^l&_XE*{9Ip;MEtLnKaI=zQk;b{24Zr)GEW>dz+Y54212vxk9T3{Nh#G|7MNx z1|EBQ&9QYpmDOR^HxK{&HXcYTl%zRbr0xE%XDONGF0P`$CH?nGoX9gdt|%HRma62v zbk&%@atd5E7B8M9)-l;^u5^D=-I|#_ZVarKHh1KRSuZ%C>^&yf#y{7Dc37?bIluXG z0tA46cAwm>9=~r8URx8u+C+J;9*=B8+U|e%dTzAhRI9O zRuSzV6l?>1t(TjRH{$oo-HIKi{wn-n{`nYKSavJ((z|~Y5c?sLxnXWvd)}o3c}mPA zm3_X3&y*qcsUnd89WK(6;iS1^@cqCYwiZ~)Ak>@5uzU4EQ=@vps)`xCaUO(Zn_0_y z2c!#jWZxD4_rTkt2i_~2K-1E{UDq-mf2LGE%q~oY%~wkZ`&|7qW-vMXDuh}!&+A?( zH^A-qxvl|UD(4B&%Pd^AwVK8-^H`>vA~}SH%e3uHpTyhu3hzAE-uXm4S(2HZBauqk zvtOu2YY+%lN8GbU3#aJAyC-bredwIirkJGchW6{Zi;@|_o(S`XVj)!%;v*OPx1m5J1URa?$jj2kBcV3P5GQ7(CO6hzc?g<1#4R zO2p@f)Ovn$h^~_M_UU#p;|}HByjx5vDnSuTs%slDMJKU3p2Nn@SiY{7bcS_z+GG5< za=pRV*1{Y4_#XFgCYwJImchno3n%b|LG%@N4v@Q=9c};jeGkQPbn=hc&W#<}6%u}_} z2Nw9EM)fr75{nCyH@&w3X`+Q@V{+{9H!SqOpFCi<-8trTGrna1*LK(Q(_d>QzZ39y z8S^}myJ%Fp-*3Mpb>I230lhfToX^LAriiuV%&6I{!>K7vG!+9eRT4ZJ&9A046oxrO z#qT*w8$Ea)E$^ZEc}U61!I$Kg9G}~?q|>3~NBYkGdlVbv*7)D1C-+0qG-e9TjxN## z0X1d2QKa9n_@y9evG4pFZv$47&AL_w?sh)TIt05ixglnOB#T7j6;Xx><&n}F-^2|f zN7>S}&EBz+{S9uFWJ>*q&=yZ>z!12yX6!k@ShOD{lNh$@nvXn5H#))-oNl{D(k!6V zmf5|cn(h7uHn6)@9bhdowsfam8PseU! zybU~EZ_qvn0sQ;W@nwX-HkvSc{Aeg=)Zcz_FNO#sR+&Pul*3i{<$t0GhY>8r4>Ez6R-YCqeA9pVUrcvQgR;*NO zop+}dkK0w{5YovG2dDzz<=g0Lh0;tB$XIxuMj_5ur6+P0t^|N#S04=bb2WedyXNq7 z!YL2`-Np5UklT0c=I?b^ZDSgBGGOfG@BM#{#!||@)8uIa$4BW7UfYV(T%$?*`;xYw zLiQtlO>96)QaW`nher(koKz#X**uTRw_iCvc|uzBFEneF*5A5RLCIMRXS|Tyl?vW1 zjbD?$Uq9T2p-d(-Bx*gS zDHwfS-{WFO+#W%q@D#$>iy!kQj6-=lqAIpg`B!L~`%${>$Z7Csz=}Ycy5-$%2zOF; z@Ry=oK6dCsVxbk!I6!gXp-`dCQ#o?}An3SgE)CSKWyp3j3sVx#4x&OhSab~MjYW$W zn;nYeDO^xcWvxBKGDMjDohE-tE{NJ{q_R#_KS;AV^BD+5Ig$%`@*`&VEn3fP0_A=j zh!4N;+na0H!t+3?j&o3e25T?Kc>18+pclpecAs9C=qK$MT#WlaI z$;2A755j{~hSXMl+IksmR&xpt3ErjRA2Fu%Hs*>UCV=M*Ti;GKE8v~=-abh{DxDb1 z1i(tiqXA%hkp!s-Vat+?pf1>5j7c*X=qg5vH**gyq*z+o*^R%TF67EzVw(?L?%egg zx?FoD0mE*>D=8D|@$LLjz0+p>^!5o+EJ+s2V1R|nxT~vURL8erQ^LlGk*N>7dzV#7 zSi+bXRQ1M3?VKU$f8RttHgDc(O|PYn#k7D48~-(7th*uupBnagW*s9(sKJ>|S4tlDncicv(Q_XX3ycgm*dCEHHCyU*SGo zsL8*C0wY_euT zNU57_(xxJ5ULwr3#^|pkpFHnfT#ErWXhfcWh_>J7JAUf8aY8c%|F;F^Sn8W zycQxl^XlsA#0U}mxXU=8&Z7h04ep%J_>oIjKEz8SoICtLY$;0oR+9Q?m1i@&(`W_nES(LO_+Hr zo6uZ$i3r=#f_R#pq)O^?_ZNj_PePmh80?857M@oU5BnjpZdE%;RGcuPMpA)7X9HyZjk4fUt8Kxdv6f8d7+eGteYAOLO`vl%kV!G;&`dB!B(E zWh@C=)zS!5QTKzFcuTqOic18Ozn?>R@!2LE2(&}*~ZTa7Dk%Bg;1naa*Jy$>%3+;?fnfiZ5Ihl+yl z`0(`7M#&fGdtOhkc%;WCSluOlElqUtL${!;)cp`E*(M|B*U)^c)gli7Rw{TMWWh|F z`6+5xU&5c3SMA`7Yu{9x$Zg^IzF*mIG7xeU6yq&h?s&cKl|C=pPivC>u{H9ra39-_}$3gHo94*eF@mx zv?-!FjaT6BSNYckZVdFm)#JiNXtQGT#Mjs?WpQ%;mi5c3`G3ph$_`{e?7Xi6Bg-#x z3ulcP(-~3;NfUq5{`*t^;HzDkH9c`O%q*UzaWw9(Heg(3nI4;ka8lS9KLB{}4Hpfu{HMnvoD?>dN=*#( zchV4H8SJb=)21UKe>yI+jh~$lK1c?SV6&`lW@9aKl49$Y~ zGQ|HPti-|Nmw{(-auVn2Z@Y(j&vP!84y0#A#tm4+QP>oFpfQpCFhy)1dhRTDn8zd) zn|0I@xXR;@CQ z)y_*4VsmF40|?o^Oa-DGK|U~H2dx~TcRD$o;Gqd|*TmN{yhGYQPAFwA5GaLew?6<= z$QrIG=+|ty23ENQ0XFI0I5a+kn8}KeGJ|a?1`26+Di?JF&U!g*4mWImzT^kfoI3Q{ z?8-s^1}z2d)d#a=-5m#XLo&(;uh%duQM4eE@rs54&XP-%N=y=y`!x@ zve|^Pq7c^F+b=jli@ARxJW=*qnwb?F|PQGQoF0bQ?@cpStj%#8u-^dsMe8>I!^KQ z^ccivEiO(?5w;k^*ZpI&6#fJ0k93jTG|ML$K&L&;f8W?G^6SL->NKzpglFh`m!~vR zy{xY(D8oF7Ksa@-0EIN-m6VVQ&#=O*wZV?&~JQ*P1f%kTaOsjeC`lwSe336 z_SS%f+oR61*M4srn%&)N?IFbt4D;u_3YAIaU$@q6+C5CrNdC>$GcsQ5S8FS!_NHb9 zbXEg@u}sDP9Nex2C2WRM$aawy5cFOwb(`k-Nvq}lh22UEzM9Vc_g*3^W_l5FssCHVN2jWG5 zxyty&SK9$pTxQmeTRlGij^`jvEKQOgR!gH-;o(^2@-|uh9t}1B(jQlzev6Rc)-+#4 zq^*ubRo#SI)%NN9g>CLPTMfT*wC6DHkb!eBFxW9D?G>Gu*VISpVhv{^N|3tDx@c$C zkx_oFe&^4Z-M+K(yIjB>&jMKq$m8aJiH}o=!~TlRKJ8^Xpmif`IF<_D+aNM0=0@xm zO$JLqsP`+6tHoduC*U|iC`vaBUtJuIuiB7uu{Zv7giH2?Ln)bs=36#LDH_(w>4Eyy zcHMKh@1bQU@GaR5?cGNwKj<0Dq0~YI zq~ucXZS>-!%{C>i0KWU=ljUIM!L&)Gj<0xy$ALk}G6zG~pIZ<8Ab1eG`{|k`Q)6&$31Mz^y;mbV`%Am7 zM_CSO3@@>7ejTL+np7fZ^&)bk3H z_n24hV76tjG_gK|qCA>)_)bbu^u5(xbYMGnRAYfwQkE-V+?hY0f;wVgn%`&XB?W(3 z7K__0%YS&dUlD?Ddh7|G4(*6@wPEdN&9aXrnfL}BG3lok#e?{pg4Wda`D%>)xzUjI zEv@qeSkY2Wh3q6NjcsoZGMwfg#5U@-%K z@4as}1t+Z_x=);2>eXty`$IC5#8O}hKrZUXIPPUOT!~k)rGv9Csd>SB zZ{k=I<%5=_kl!cV(yJ>o);TWvys#s4p?bCFEqI$#-wz~LMUxZCu31`*8|U2O$ptl^ zZg{u@FVJ_Kc$w-Z1A#M+9**f7WfPU}oprI1VcaP>XTm^iS^EgwNXf-XlKQ#=esN83 zB{wS_DSW?ya=llj*$oD72?J{Pe;?-S)hh};6-S1(_tu?{E^VyDA9%Fy1f9jQS?3@7 zYxktg+0Ko+F9#{LZLU+6C4X<~8qNS!M&K&$jJ7ThwZ85T87);I7L|w4xLg$p!1?DA zZUmJxpC;Q%LfFoW7pM-fH;D33&a)K(rzAie#`*bq>&{@`z7A3QF{<0uY3ae;TS_jS z$RuDM8?RrW_O9$2&RoZ$Sw-^Ye?LQHF@5iHoz@e`+NhrBSc+ZGDJzDBLI_E>zL&qT zsx-V(8pCBnXL=57#|L}nuT*Sy<#Aqo%TmPI(k{c~#}@g`n8$6>5U1O`t(omX7V|_( zX7=A}x}Se%YrelJXl(7E2>kWn{PNpg_tvTwf@cvsob~Ht#q8hP{A`w%-CX4cfxPOc z^J)Jv{b=&${H(F^%EO$Cpl&G7k?ZZL*ujs-4Zpmb>uu@yilXf*Q{pLAO$gsGDg|>4 zohK@mFUjmq7(`v8mZoe@IqW75H*nlytfiLlv(BTS z@P9+p1PSY~*0>f2AqQ!CPUNZumEy(r-?|y!GhpU>XUG$};TaNlr8}A8qJy1Y=h_Ud zBL6C_@Q-3agfP$1c9^FU8tWW~m2uu~-`@`MPqOLItG<}&qkFzt-f;H?Or)JDrfsKr zzYH(q_)bY0bl-yHrSk?7tfX+|>u|sU&j9atdhTF+zXe2U!Q$iccX_>rEA9M1B9%%W z6O8~2$loY_g_xz*-IatD0GGQx#@mhV|H9t?4P~ux*cJGM*FP#f)Yv*DhMOUD zC8xVJwDl~Z)???m`tj!P>o-J8a(*RI(`~?E1nV( zD$G7Un5bTg_?cq^@((}7QmKSW<>v}iQcIu@2|`Nz)UkDCr4GV+%g$9Bgu1?FwSlO6 z`9i2$m>t_M9iiZbj9J2oYKvIHu@rj4S6?Q0$~f4g&|w-rx&Y2&RxqwSDA=H?@^)#= z3n{-tzm7pK2Em2^;iQWn+subQD)p}^qh5d`c5?Rvyk$?Ie<3tO8lSUp}?Arhou zX`Dd&NYCqEVX8{z?w@G{99q?vvmo5DB=}V)naYL+JxY6{8kFP7uv8DvG`R{U1PA@yrhM2Z(dYenhO2TAs5|vaad0+%=|-3LWL?2 zd>`D2j-`nibV%d71#4txk_zShBC0dC%g%5>r8sxaF#u+uKVFlJq<|PU-_)j*d9#fV zM=e*ae=~NCZw#vP3%v9;=Ss<^<)p0%ii@pPX+G`<^L|3T)a`dVa5}Bvks>Xb!!B>e z^oHLA}CHyWlvE)5ZJU8jW$tq0+cpB1CTv`g2MSoh>i7-N^REUBk4;DvtnpcVXZ`a6n zCW2s$!fk;R4u1;J%H5O9mH|04!iP_!0Zl&(P*=y3aamkZ56TzgK3cBR`$fOu;tg&M zbMQ%G*M~#hB)+00e_jVmKAWgt|0D^M(=TGh)UY4fRPh<~@tLBq!Q=eTvSDEq99*$l ztSEjq`vsvgV`~K;O|hIqr39QQHd0ZDv=zth94(smSVK+r;fh;3NdQW3m(d zSolp-0V!mVh5+myXKJD`3Q`1dX@*!-H7vy_;ZutMaD04+D#RR$j-e;~XX`knnIB8TqqL$Q!%vp5 z_P)e^KSKy60mVOh_|2}5L6RjXikJyIl({?SEF=1EgAu&nN0mYTkKO$jR-^ax!Ssz& zO4@iQ(;n2EvG`UT&HYO5JFyJaJvk~_2jxX}3uS_EV9c2@nyVl*q@QgMm;|VRf4Cxg_1MsbOeT zoZ&Hlx|>^4?!O+$?p0^r;e!2*Wper%G9p*dqUpTv#u41uctLM_T{#Ao_AEhZtr4Zu z2*NZvp09`R54zn?dSr5XRHTI@&c8mu-L+yqfx#P}GbBDf zH*0$(-V_xo%=4vyYF6p^@s`)+v1IPmWP!vmmB4wDv9=uMveINSoPx=nS0e=;)@<^L z>O~yN*#mw* zEK_sK4LO|hP(0iaNQ`)y3j!$#-AvSn*FT+)DP|$UCemOMLR)xpkg(eKL*7+`Vr9P> z>RZXt)hnndqNX6^Lwp*$`=SSTS=QX9^PNLi6>MlLhW;AM=}g}%q4zI&UQ@BQ>yW?5iezp7pc&J zgoVqFHAw^b?~}~@7a=O+iS(k0PNrRN-G8BbD6W3$mAaleq8oHnh^n%T?n6VTwI=hc7Nh)x2@MWQ5&yk(TBxPuzf4&9He1lK`H_d5 zpLa>_?eJN{2A4u3!!ztS+xSd6ZWNeTW~xOS)E*2TD!%^rOkWM98TVS9r^cy%?bPuI zaW>PE)h>$jQ_aYjaeY?zi-ZqsU#FmnN~7`_L>io0FtE+d3@u1yiC2_SNlYUUN$6b~ zysV4O3PJAIIbbFVSajgmPo>x9aS|K_mp_&FOS?TkAid&a4$6bUu3f( z_BPsVI7#>y`yadVGiLX5sq*S%ysf7^rMU(pc4+H0?VG%zc1oVLTaOTI@p;T zQH%j&rMkEOCOj_#GNk%MN8i7>1D!gb*pkbQHo~));52E+!+vs^3`R9_w#_jZucb;@ zDd=o%PS7s!GB2Z`b=Mq6iyiDED!A?$;rpW1{08_rO%j`!Nn7cMi66yg-S~Wu4B6KN zki%b{=Z1p-@3mdHG`q^*PME$kd3MQiE*B(17TB|=)tss{|GMe4Np}f&PEth!i#=~G zd6q1D6Q65$@{xD`B~ro%J!XO7MWOi8#foy^Kf3u{HCCPY#uR?GU?(jk1KW_s2lSYV zp(M=-SlyOrgR2Mra-kk3u5TJt;-j}uxuPX4Ey#_kfBDx!vfwzo1b52KVvdB#KpuVQ zycUff=!O@?5_HZ03Ykj_Dt@DHhWHuUpqLqk8E)7g{XYCHx8pGpIy5{}bbuV^POrHr zzN_4O0Tu&l4g}PIvA3!DC+PRaAyGJKpI%>JSI0CdwHVP%*NR|9b_(qZI;bSzCqSZA zO6;5P6L`~f@5}c0Do%;c$*}QU02y)!3uO|0V9SUWg}xw9CXLDT2Z;sz_|1yUpL(r= zaovjim}yV-qtDw6T)1^_61v0e`W21?E=Bh%m`EL`B^lx{aVuTWyI(aT_=AhLxALa- z13;>eu>{AEfI}uZ^HQ>LRb_ZOTcoK0?{4GibDOQ~15{<+ccvWg+L zS~cx?g$ua(r3+yaP|t+1?b)*RpJ9zS>^*VSvHYjz2g}KY9ZufrGEHRQ9AOWH$R}@u z+i!9T8|zG;h&HoQfoJ(<~ zL&+a_vYtSU!^;K-z^*pXH~$aQV3^ zWIM@=gsY-SHM5Z0uH)^HhwST#rS@cHs2o?w$%iLAo4YL=4_E`EU(@;e)-3my;6C{= z24OEhJgz*?QgZ*5`VU%N#29uXibn=ne?2g#!S8!?4wd$^p%bv#D#DMvFARBw{0>b_ z^-An~#vL=o@`9=^!6Y=b3AofcKl>14p|T@maIwPRo3lT*zFLlnJwk`xi-hi+J&G zRDz-g5>Vc&CV1S%Ipc$s&M3#D;z4wbcO*h}aw)3Ny)y>EZoZUJrio}Wb>5vIn3s~_ zG#o9i?;)k7icHcJEZ~ysKhkTp&PVTJ@}i#+r;j5F4~=D{o1evkB!KR8Eu7HS!aNbmC+Lq_a5Os^>KvQXOf8{VM)n&l+O(WVQrAMejr%Go>iQp=g7D=TaMS@asHyzx;{fXF`sb@&--G; zgRfD{HXR%Zs#LrYdT2dp;xFtdKTH4d)Z7wTh_zGea0_4JDSU_jH-Esf@CHyzkKBgb ztA5U8?P*f<)k4$a_mP%gBE?)ZJbAl5zWXH%H6bRaSFO#%UH7z5nV!O`sQELFQC2LA z=y;;v)w)Nw5a2(N(kP_{Fs0$C-PFk6)CVIfP6>gyUiSAl=c+mdG+)OYgOjZfCcQ^w z&J_N8U+KdSb6X$ReaM>8-XgjTB-j|&1HyetPu?dV z?9rB<*(U}g??sft*yhqHiXyDon3oOj-!p!19_sl&AsaNw{fD-tqWZIW$ktOIbx1aN z?0E*|ix}b#xNc{ygT6Sq6X%t<)?UVN>ae;yw~M^h%&*FBIyj34`2S(=EuY)^mNso$ zW@d<)9WygiOo=&WW@cuHnIX1gW@ct)W@g81nHfjFQ}fpIo^$3en3`|es*+0DwRU%} z?)$!$?X6~R32*nc+RT&`92DiVLo*tE+m;f&PH`n?Re<6mpft$RA8-gcte8`v2kS!gO76f$cGu)hA zhPypXbu@xq2KLfsydR_}WqW7D%c)hyAQ~gdBr0`ID@guIGUxiq^vfaV7s-zN3wyu9 zNm&SOFs2O7Y2nWPU?TO8ctz@j14p2SvcPw(TG7(7AOa@stiSbtdcQK_H%zvtJ zbp-2@m==Q|!IB`rEUOVvlN{-r?54;DDO69fwn|g*4ZFxPsdEJ-MCTok-=QquM>zdOT z>zC|Lxb)v;v*xV}YhSV#;XwO0$JSry{IL(HeFJWOMxZ@bl(`tal zjXx{en1w?bme?ap?SvLx(R#TwD8>s05g33Pev{i>(}HS9Y0-Ut)ga{ni}||P#Kf-T zQ|KAjZBI0D;8j0n;9zyvOPRpodr}T7kVGvZ?O>-j$W`uvbGB~=F8+)j=|E;%LMIxk zen{(u(K7Zf^erweu!mRh3>_n!0u3g*Y}P@8CMQZ|MHxfxYv+0ov`Dg&=; zOIV&_+5F8-*%yQ8(`ey3MEXTDl}c%;sKjE%5CJmOEoGlL2&K`~Jc;4@t{RA0fC1Qv za9UMxeR-69pfetYQmK?x0=`wf{C4WgB>eJz^N`n%Bf2uI1#&G3ZeYSCM=?{{+CawC zhXRFf)oC()1_vp<67g3LlHa}kBg8t}AhUUsOK1G&f@W!=r*ZmRiHK&-3{QE3CFK)@ zGzF-AqavXH!;*C_5D8y1?V0}dG{}ltMp7_kJ1{4HdxZ)~J@fR)ddr93dhMr?40e{7 zYamj=qWWTlhe%mLaD7?D19Iu`)tC0Tcm2Mf5grY^+I8c2++J0>rIUsvEb`u=j)Ulu+qJs#X5noa*Z8W_GX3?$0hX#;<>YZObQX ziM>F#|Bfh__*;Rg_-jEqb@+@iKfBM-SHzaY$$>IiJJg zP%HuVx+@4NgofOiBasQXBevRu^lWs@@)68KKWP)-I$#^}q3c)-aOm#@1DV8zqZAo3 zSo@Ig-|t60b(GsVP1Ln788lGRJa7+TyX%?-!YpJ#+UePNA*KHtr_Cq1Fc5G z-;p3&`!|G@6!c71AeBPH%@u|1*T%F@YpT0#sWsWA^_ZyWGcYJIG5ONo24!Ko8B+n z$nx<({i1`w%nqg?o-@n?)ZxUciKUj9Vk(42{g|?=ca6e#Ubnjvr6i;V4@nL-(Z63f zNyr+(CeHT2U*1U5&x82!zvhLtF|2R9W_2!Sy}yv9!HUI7XY5wbdze8kg{s`dccq8$B=N~mSG2; ziivBZmPp=<08uP$QFD%5zxCcFHm2P1p2DH^!8ft_p1PnG6hFW>%9bE0@iH$Fkyb7X z$8)0+eKjHB#{20^BpQ6Dek4H~mv)+`q@j2TStxlFeFpz`oMBgoR;yq=t>n(AV1TR> zcL3@LWtcOfgam787Xv5i%|Bx&Tc^Nz*l2+RmCFQ=30F8GCxJ!pfKOd`N8yU_wE)kk z6;Jk z)exg^w&1e9i`ojx$$-bOCxs5diX3W{K0xp(da{=+S78cs+|FA>V4gm`fWT^&KS49< zMG#eCXIdL`Y8=t`?WjY4Us5@PYr#~565VGsfihQS8L3N_{j@D6&?J2hzw2?!YZv4m zcK!aMT$8SQ<@Mj%m$H9tY9f0cti)*S^BSfH&c6m$lSx%T4V68-j zdpPyh5X$)3#TOp^OyV|vZx_UXsO7woJp)_Ln&ocZvizb^o;K`Ry>3v zyK;VmIHa#StMoCx$qu&VL?EDz#f4f<-3Cj^RFS~9ZHQBpW?7+ohbz?~_GRb_9s7ZZ z#H&QN1W9x!NzX#;$sfpIfIzB1VFFoz+KoqT?x!3eytXkzFw?EGzkuw;&uJ?qw?qfl z^|~+L2S=li$C$wt4Ck{OipBt+r50)a$1)-ri>$!1e^ke-zMfo(8<<(Pn{6+{v%k0Y%bX6o($jy(hU|420gXMFv+Ky8wje)?)0{au8MYI6V=xe ztvvOLwp~Mn+_eGE2ZdV5-S*d(ja1OZe)b!Mo!SV438yASGM|(?MJ4GKv_%wz0q)7P zVON@uixw@9a$6aL3QF=!;UEzxgg+C=ytJw60PP*R6CR}=A>8K7c@j-YWC(S^cSc`1 z&NEUH(XwOy{t-~tD=|7o5cOBlQtU*F-E|*&WAyy)gF6>s5PD{@*0ZTJG)m;|oF&N|O2C>2m2E;-iExpg);M*~+cNBt7n35z z%9G_Oy^K5sBlsC-Yj=OTXi&9~V67B!s zRjm*1amzA4;n(qg-a|mYRuhklfq6j*ASWjfZ*khPYJWQ1cY*=BU!Tyf=p{Jz; zN>?X+wfo7~2shyAv&fRiimw>DVN7U+_5L}Qp#I}Ij!^91&=KqrXv=n$>1?wI$OVWv z+|lK!!wI(w!HLC3&{I4 zXqXpqh98_|#m5o+o^cGq!?0EOqWVF}8mz9y6dB^r`$YszJ%=;?GMQ1<3XQ(R!0|`$&NX2lZ1t!=AYY1FdDpb7iNH+Ko+i^%WZ1 zGm<^h_P!uNcR2*+9z9r|(>eWRSVunIkMNSFZvSozUH9)1dpLUDwF=oLsY&4ln0%gh zC5xt{n?#n_x)AMe*Fe?x`2Y560YMtgye~3(A#(zH_A7Ip0<+l-*{=<;UbYr`maim_ zDd#iU`8NA_t2q3=a{Bz-ho5z>)-RDv)?)_^;WrzVJwayiGe?6S@iju zDs1t!wr%klwJ{7@rN!SWHG>zHvf2RJ0J6&>^LXBzpZBsQKjNw)IW1pQ;uz5OAJP!# ziEXdUdabHy4i>a>YpbRf=nGJ-=Jw+k7js2H-b@n6#rfnB;+VmKO){6eAf>!a^KeVd zGg^tH^^1*6CVkD9VKl#RBvJ{l2Wl+D*h*6* zPS_PEwrnFl4qM^?BBlO=Un5}feR_q+{(#r7j;Z8_`e6%%MnykLO8OroRD0yZOD8c)3(|uNHvO%!T1s;3O9&m(PxQ5s3^m^cyD3O{xf0|m`qY|*q)<(=%^|5! z*Fx;`q*u7>oZ0}-i|&;cUmlQK@FmhH@)G54>o-t5ZKSOurYZk&0)vR6T_4BpF1FiZ zYW~T{>M8=J=r-sj@~e7@MT6*lV5=HYqRCvZXu`NVo5F3S{<=_0n%6P~;^*rDc0p0mK?`5Wp_#pjzK z(a{51h733jV8xoe?B{KeCf0(0BFX7QZ%}f~N!iVc9uC@!9EJMr#i#N6-0+hIo7iuq zM(yq;)3Gmrl?4~lU=<-6JbL$(Teiqb4dLonSz3NQrjD;8w)^92`>P6tM`RqTt6{sj&Qd&zGUPTSh*e?s!W?1-9!l-G zfhT{FsV-V~@kL%~Y2u_7gSS^CX~ZI$e(~I{7>qAb?E_u|WS%9~nr;-!KJ)LSSB!nD z8cVQ)=bYm6&W7%-4vz?}c-f+uCU$coa*M2K)ZR&%J4q6Xp65*XI168ET?)H2%qR=X z=2c_b{Sg5zAUWkhoJ=pD(G%%(Ia9AU%=KrWZr9)Sm;EeK zup+sJJH&Fy#RwOW%OhX(mZo<8L~~KczU`KyGU)LSk}`@pj3szPukvwcwGyl1ai+eI zh$rAY4iG#UXr&QH=e4oe)!QGc;V(cdVXL+49=PXf91(XgY*_6AYk@uC>Z`AICO7Qp zQ_EsiyqG)d8+gL`4OnmG`H?#H?Y<*ub7RH6vR^O;*pQvlw&Bd|8$jAUzz8V-(lG53 zTTGXfi-|mlkWtxw#dU8i9X(AEwypWXl6Wt@o)A#ho6|!YTN_TO)T`6(DM4FcN}fhW zbm);%wRtb`lCVkon>?u0w1A(a5vMGHh`ga;_r~`A? zncDSF4ozpfG5 z;D|u%1SsYYB|ZRT@m*A7o)F7U0T}zySwCb}O$xO+Y==-(r)y}thq%3**vJKZH2ie< z7Rykv0|eb0>qMQ1@snGyMq0VMl5_yz+G87y@@rbS|L;HK1;864Q-;(z?nCuH&+@o_izo9=S4 zqsVFuhEBd$NNf=NS!W`9aEpE=BMddOc)_@K4MIcsRb)s-A<-BnAJcsUyi&A8+vuzH zDz2lmxPh9TAlN=5*V}v-JIv0I2ZHLx*>q-T;lrwrF}yi^>tLnMro&P1%rNXOfHbkHxoQ6M-%0zRs~)yUe+-7{ z^M%%Zln5|zdeG<*vlE6drYGoMs&Y5rSA-)ok#_*4u*#c3(}ZsCK>r#Y?n|=Z$nJec zO&k8pH+bieU5DND+ws(dsrQG!9$O)6^;<-#3sOy<+`3xpn@{dsQSCefdajTJ0cK#} zRZP$Lj82>MaVyh3xEm7x)rC0z0>W?IAyV7jGl>3^#j?LsYkCsjnl(x;McA}4LH2mz zj$AlS)Fq{|%d?x*O6L?8PaC?R&LvJg&@Ra*^~O1?N={sNB5i znRyPFrUgJYXwS~I*Gu}(Vli|^wm$y<1a9&~;DQi^S~pbjjB<5U;{TPb-5=2G^A1+C z#fW=-$UuwA2k-!!2HK9n3Oi%`PrvqRf;3JF`(pKpAkB7qLx02Izi$xsKa8T_AvbR= z&=(dynt`6Zt>FzX=l>*`{Tu0+AYkjCn7-A!>g5sfkKg$bbl}7P`y6OZ{$Frk0wn!= zoTj0w^q!+1<9dKJEAU;rMu&;dZp2-GqBn4}i>_UWpu4uLye;^HUFAo$U5Fxu-2P9? z<;j&4v@)Bm1bx^Ag#CA9yM5n96&2$*AA##!+U>3k4ULWWJIB<^klojf1qA~a*MAQu z(+^)ks?SSUXh9zN%kw`yYM&qPNVgzq9hB{IwO+FqLd2%W2!;<-#p~UPQS;uOVL!h? z=9?ehyKi2O$5BjAGg<%vc>Hm)DA*?rZuN80En3}yFHO!{<3!Bw=wg@tWv|`rA!Xb_ zW&W&UEVIY)|A^54LVE9f{PS#mu-uYoIY7eBfpSoLqSVefXn_y%Z)?Hx->hu{O5Gaj zA>en%I;8#H;$Z*QPNlc4zx`czx#zaK^#%To1&(o6W2yG}KvxkL+uUO1};G*!0; z-0%*E_Z+_OCBp*kSXp0rVAavKHr$%;jXO?O7^NXzmaspIl3QP`g^aow=J2-BS7Y=J z^VcvB>D2${MrbP-d853nS9N<+)Va3Zwnu$O>T|3w+emK zIL|p^X0bol#Om>h4?Z3mbS_Z+j{nP1^iNs?ih-VFyX}mZFc4V`92Ydu1)n3|`f)WC zo-B!2mJUK11`Nl5Ei;lc!4{o$_J3?HZF}#0)_Tplz(9$@5Kjt?FwfFGc@TemVNoWD zA+aOtP4XOcHfX2z%1;h{Kx2ymjv53Dwt46q4RDGT{7^GWs`s&1=VJc4GxKxZ_T@T- z@ZUqr|Go-<`6t5ID$sfJTsCy0t{mTa%4Zqg8pK>m6K@@E!7`(Obi4S)lBJ)AV(|bIrNc!8(QS!kguOn21Qr z&yDldwJYzMD)`qflCAz4+%o3fVOMebs?kTPnN9Gstdkp;m<5qD{n>!^4#tP;T7um< z+b8eyR%xri}`Q&{)7<*aPU5LxQP1QUgg7jWPgS5u&K&RpWm|} zgj0y1=bmoscA4{#NuX=mYbee8ZEZQE6VLL; zn+@&RDa4_ZuP58<-ylk{4!qzkv)?cn)lo5N<{Mg@ZHAy?P7$;WkZ_7b+pTbE{}Lj^EvMhL(6D!pvo`4!uW z5&fHkJeQw_{2#sG`=lr=WN8EjNd?S@&1aHl(~}nVq5u<+f>k0_ir>D5Uj=^0(`H$asydXt-+fsu zCkt7Fha@jhxA5@574mAj!o>o$gWy3UnO}7GNV}S6i>WeE5>6@i5g0p(VT8=Pc<{a( z+%UGAVWp$yuNhUf3vy@q9?i~H4f9Xx)6EUZ-SID@h+7b>);{S?Q>g1@5U)hG&-?QR z1d|&DjlcbWytZ>qZrpcVPPxKMViP&r&hkwoCH@TpcQEUI9Ln2gN$po@6C6N5Q+sV$ zZH(r~Y^2lA0(P(K)cr_*^Tv)*3|f%iZG>@0wb?gd51Zr$9146{2zrI0COm!opE~Vd zhgS9YfB+M+)HO;y6rB^dcWLjm{OgKON(p!mSAwgj(zh_ceCnp|aho0x24+hKyuu4q zT?-xVOuSIdNb0v6>z6_q!u4UJ;OaZT2F66ck#*P7G&DA-+sEmj&afDrr+~H^we#v~ zDeq+3B4f#P*a{R+<_-vXIF|iO`Bs|Pz_2c~x3KPyA)?+ zL?TXl)W3|g9jGHv@`shrdxZ*KcVSSWa7(8>fF%nktivzqeJFcHhUclwD!m6}t45#C z`NMsMY-aGr8b=;7oFvZBU-jkn@$mbHV%hv`r8!8dD`gMi&_OPjo6`B*fJRawQdjui z1M8T?*bO$KB&dGs@62R5Dg|J)@!Z=hOL%h*1Ty;F7XuFXTxDpST6P}o6+EkubV&$N z>rw=jtG{XqON5$u7m>S3gKcGi@nQWHM+*(_Sr2U@-aCA^?4pq4Z)$pqSH7y|Z3;M3 zggkf;HY=zg??tUXFAx`nTUhD0nrg_HbRy;zF-;Nu6qy%YR=ISiBp&2c3gKki`JEWWbewfV;-bwWJDs zv8;-6#&q>@0cBI=v%RLO(y!6yMZ*P=`8G#L?bc1Ymm7p`uD4vAM|O9k-e*O?d|#dy z!dK`9Ej)5AU3sI)ST4g*oK4b=+5R_8W9^+$yg}2t?O3a{umLTyF1u4~mLpG`BRG5= zo9p|qKZN;C5nq_sHHc5Rtqp3-bc^%nG6n>B4g1mp-WEKneC=vfYqFmw)Z1_QG#>wR zXuV?q+R8{A5275rh>Gld)1zN`BFVBn8PD_qZ$>$P>C0!D;<_75Uv6<}cA#%21Blj5 zyEdb|Jo}WQTwE?my7|1G>5zjB$u&4fPwso(f7ZHP$Sayi5?qEcPh2VJdqg5K=N$){ zQ=nv@U5?@ybKrbOa?l;yJ>R8KmJxS{rhNZgCMFph*?``ZNC_Hwv(ZTTj$1B~A zYsXy2I;LS79Qf2t}|UuhpToSBC#fIoq$ zJ3K-9q*8Olf``=|pa>i5N?vE}g!IR}p=8BRg5!EMH4X~StEzNRK~4;z!@>A(>eiv@ z-~G)WZp!$RSQ&^2_NGbC>y}!2b>2JuJHf#>N6+`dX@?NzyO6J8{j{V{J6(@u3fh=R z2(mZ8UweJU+y=)B%2==IHUffG%Y6Mk-*W(hVm5=7hc%z+Jlb$S6YL@a@C7XAz$H@V zRNVInudcmXuYPGum-?0@S-aD#WIzd%$0AC zv?iPjVJ)OX`->c*tCV5+B0D`FGwRA_*_bp?t8O9B@R6c zGxD`rqPG9oDF?C1k!ddK@m&b+J^f<7Q$YX5-A& zB?#-2z(%Y{l6s8_A*^zVs@vDNIwbUWblq-kuwAW$1M=6|>4R}wZ|H!fG#QQMdyz9d zf`-34KTo(#wrS3&-uHsvn2Dk=P8NBXckUUtJI{Rg>(wGkM?+&XQg=TMFXHB!`kV|U zLZ_8){Q7yGb-S!(s4}Ar*IP1Rj~a$lG`oheQQc4zQNwL;l;l*=`O?QV&}1=0IDfb< z){o0sAd-IS4N0$jP`}7{vn_xv(2b4Lqs6ZNNo!eLq{3t~7Wo7G`(SOwXIGBT*VI-u zLAD8TOceFrA9LGPac-un^V5QZ*Rp&sq1c*R2Z9^sY1p^D{WTc}ts`-}{(drJeB)>G zY_Ev0cndMsD=MUmSy{zL1vUYNVRA_Yss@@h6ix%JL=27g!D=vpRA^@>#QNImPc6Qt zB0;dNPq()c4<^N3pIo7pFtYm9L)9rxXZ={4lL((6U-#b_pEI$J&kR)_&>l~W*z*~A z?hMPR!^Ls*cVF<2OuW2Xih`V~jEWQCiLZHXtSH-MuH(aCzBHkdq-zZfHaSnNH|;@6 z4-XDG+=Pc(p6_0x*N>&v!4*DQAp3;8bp}qgzIDCYndtm+xjFg}ET|nDEhAnwc4nv( zHVzm8#6y9sx!7U!s2U`7J>88xpA56;zG(fTECE>lJ&OQ|JwSk{<1gM_5rLk*xVx3v zl#3sg(9;Oqe|N^=U1SlKMOQh_^Q*8*h+hv72^|4#stxE#O|JI`nK%8MGV763LQ{Yz z5;9G}DOI7L411hpn!dDFStP~Gdf-)n2k?=myE|C#^X3Jn5N{Js@$vN(wueqtPl&6j zXnA#Q{m(BZ9e0~uHfk4J*T67!k9MZcdn(n1 zwhuXObH6QefvcfT%+VAVflZpXdmDoPn0RXOd0kXndoUkze5ph_cQf8%dinx7bdNN5 z0uDB`i_fj%QLVc{rZK2uSLkh)ipUoz%%w$t(tFtBS{fVYkfTnTPD6v*esr{)Nx@xd zMkdS(#Y`YT2bqybbA5BNEx@=UST4!}zS@@|lP(Fr?0oe~rFOAhKlR}Ya~_`W2iuZg z15^ToNeeohm+x}O+lj8jYC#Sc7D4t@c(~*Ln=t_umdgl!>A#8k&Hsj!XaAS~sj{2S zA0oToyJ8*Ff_{v7cIzg9IzkW2Tf}9N4)pe8WD)ke13Qq^10f9C#5C&bK6&LeMnkxt zT=+_b^Inv%ZkOBgC8ZJE_RESe&M0!zUKwiXHRnpm_kqQ|a*cN9+vVgA-W_0B*&~GL z&?`_7rRsM$q|rDJ`|xL!3#-enq0#dq9%Eg8Q?lVw1ZW#x}#s`CCFFg>k+x?(bhb^U#dDdWzQ} zZI+{!H7aI_ItiO+QxqA@{gS9SBcb%VIPpOYSw9Lk4=+Il&Qu9(CprY6dBk9OUjAnK;yZrPv`u&iAIVow(&5T=M+^N|?u|0BwscZM2aNv9p)`^Tlk;W&Ki{<0GTKnjN{B|1Wm0C;%o>L6tlRlxndTJ^o9 zjggIYFEXs-&-SR(RKse$nUJGUt;t%0#nK)O3ek#_l2U@Kw6ueTg#{xIj|V@VufXR7 zPArjtBMPygbANxot-Zbdo5Saa&#Slh{zzMRczDds{QUf8g+{%#*O>cKpWC?|h?Au1 zDk&L?kBe(hPEOtg1Q|oE=Z}70SJd@S2f?DYr>5d^IURoYqAJSI&&S>;c8V(Da*2+N ztT#L5<>WMVG5h{4cAi*^uSCn+yJIlJV|ah7)9c~QsDi836!|7BI=V6SRsHql^Z2w9 zSGiac58;s`E0JwZXEzY4%6pZS%kQ0+i;i#oN4nr-*4& zMMD8y%e>h0Bd+q%23W-jnS=sAFO(yjjGBGF=o!_yyvtx@Qz!KFx|fXZc@@H9iU|yD^w+{%nEG_`y$_~Ty<6tv z&(`|wZ9K4emhjvuu;gbe*Vk|4F=%aor0?2&6)N};l7cS-ncMe`_{SHnup@nse756aYze(Hg{P*o6RLnPcgw;#-dXznY3%!s8for)WuJ)up<`%rhh4JE#KbB zdRVVI=!05zzDa8Ld~3lc!bN!AoA&HYakV*Sv{C8cXk5FE1S@XEyeUGk2eAzs0xbAV+vb9t;4GO zF)X+Gsx6nzwgYUt^;254YDW|f`Bao9=5uyGbFYaB^~vr(oR6tN?{2(0erCn(XE;qBA-PaI^O#x0dn%T#2tSLdF)0Qu6ItP=^P^Xt0H>_NaZat?iu1Ab+ z&m(AD;zCGBR-L%(Tc_Bi+12FPXv;|z^eH;p*4NrCibi~tKi9w4v$)F{wV!>9JX1Zd zKx|~-VVgf~GauLD4@fpLV(`s6X38&;8 z!ct>W;d$qEb!l~qm<_gbrRyTJRoOScIQ`vJT~{bu*=h86(SEU{s@Ck5J+v;JJ}mbB zju+{W*u=H+y5HT4>;0Cc*7ce7b{FbrM%wT)dmpQ9VUVoVTC%`U&gd$8FC`si^weII z%y*$=)V}W1vIXRIADvvl4UdjCaEG>;zrFi1udV7?N0*Vgt4dGn5P68xw)gfu@b!Qn}j3^Vm^)aGl-IpHD9vk4h#dlRR+%{-4_Ti-u1i-kOllr&p|r;jvYd=7gD z>|CH zt+5%l`$v{*L-4nooJ*CJxf8aNjny!ikI*VQpHs+Q{ORFevz9d|C+(jeS#*pu8`-~0 zd>1b>POUg*Bfq50i}mMMJzmuCc@~WVd7Vc&w$Ghyn!PHs2EttYDx9qA9zwH&(O}6@ zzmY@<;|4GT3?Tbp-&)9-fyIaoaoi^0VBp513lhA-2sv=u@xS)|t|wsdYEX~b;2f`o z7}K3?bb2k{(>j111D^W!n6_51Y4@weMVuh+uJ`I(fyMsdKHAokW|-yVPPrFjt(bwIG`zL)DK*FQ*8U z4ulaNcW^tnN)#S1NhqN3mbOhi*L}F)V=QhEm%c!g?ZIow;Th`tC^}RBM&{Or&P8#5 zNf*zvl5b=x9~;2E5d4c}R7XpGSaY3MDTHa5+Q(hU_uQCpjj#fFkhdj#RM-*Q^B zgwCWE(l5yPM~&}a_jll+dU|Dr)Lim1m{OsvC6e)II^S|aYd>#QVjN)5SziBb&C?4SyefHj{dqC|0Ex%rG&(ifoKi(a(0<@26vj`Uap8{+PVD*%brG;6X9Q;I%l}xqk(V#@bf_Vkj`^0w4DO zfIQT{TLV3A5#43dGxKZx28KR?ho<0QZisjK8Be}GkWW-Q=%}_u36)@l(kI}@i*83T zLn9Ue>@G5YxnSe)7u<1#2U;Ufr`X<4arCHsM7|WtCa9LWV_FZ%W z{P5V#it*hw>pc?F$>T_MjPfOo8b4-81Gt-QkP}sqra}WrSP$L7`T+JnOYCHDn3`%~ za0k>o9Pdi}lLaGJ>;17`eyz;=3RksSseVzPcr%y+qtIDv-p<5XfeK?QTaj1OL*JK- zP_y<93{bsYu5R8(ueDOGz?08lkBbI^^Vs;LBcv_3;?6pVn}|a&Eb>x)gJ6JwZtw;& zo@YUtD}Hx!{!a0*S!q*cODsHIiUY0rM({Y@b?I)J6+?WR*Zl;pr!3FsTaqmmLrpQz zMk(XjeQO%6#muEAESwgJk=o}7G)-yWSh(ogaO-`)Pj_nDH4FjA^=Cs_T&9)r7p(xN zABU==*HJ$oi*(YiNZ7fX%t1fs2dI9#jHW3S>YtSPp^_&!)C2?)vy(vj@0R=EY-V1x zC#%DlgJZgTi8Xq;DMy2I-Tsa{x{XHzX8=HeY31PzHRwIr77a;@zI|UVD|j14Sl;Ya zR@>alSJSrpjMJMuox;bL+0X7;L$DM>3bV@aHNwDB2E+1470#mtCk12lmuCT#rxq{V z1{`j-zNAApob#$5#bkGz3d88bvBVJZadw5Z6SV?_v8Nbn+^aRUX@$kRDFH6rha1Wj zp-@t2kGS_@6?>T#mwTA8{So)10uR}!$MkV@BDb5Dk)D$>a`qlP^(toITmUs%8OMrq z_5j|2g}f@Eg@e1nGu!D|&&B7q&Y0O}fKhL3^s!VOp})4KOx}GlY7dLT#>ivW|!+paFU&$%XR;GsxH#4mAmvRhYLK}R>e?s4{e~*PlWB?)QoZudZ zKos@9aF4Q!okQ@Igd9k6w>5~St4>O!M!}v?GvTWVuQ6ODK)Ie9-NQ+hG#JiUJ;E28 zekLv(RciM!*0cq^2&TX>~j2J}2lE zSHyOm&)*i2x?AZSlhSnR97_F$?|#TLb))KNU<_XF?F62mC3@yZLhq@UdhQ9RVqHMSGoJ01U~ zw*T}SMNg#$>$mh%e&c{ck(<4#qYRFTqpnxOX4;QA!vx?>fr6%wDAIflPdj}X_d98AF9K(88 z_dm{xtFce-bL4W@6al(jhOGfS9aJhfZkGGBOe_*}7yP;p#{p=T@cGF6aiA4KeuzMK9f0mBRho#U zXQ2ag1QV8$+itPH9`wf!@Me~@Y_e{u)$zN^u^GlK_y8k%cA}%FqByS36f~MZQ-^C1 zfbOA2haff*G;8Q#BwNk?Iz!7rw~oc2t^dyHt=4@HkkHS&YiMk(jaTH#*)~_lavLPT zBQ&=CDpHotC3QHD){9Gc#mIA1VKkrh)IJ_Ob!0QhZ_C~A{y z=XWh&WRbe)TgGW>@ybuTwUQ>QiNm8WsGiCboNH$y85V6`8}e}JLS$_Htt?#SX>q+yz}rYRWZYx(-8qj2OUT^6-gg&pXM#hQzU;k&A+SteYxqE<4I#d%ueC2wt#&qfsEQZ}+e z-vDP{)^66bw-oH(EFhurroh-Ndg0w>PpG^JtKh5Uy}nM zLJOICa5Oc?%hDDzYnM#MA;v;Ab$*$o&fSip=P3E}C=n0_=8_kXEQNF8iDhSgj5e;X6Bn8 z7k@Q)|2yF;7>aN0FQ$X|r^1QwNYLaZrP_uj)F+xak(>o7VYpZ4>U`b~r^Eeixyi)K zNBj#mL}mB6+2y*<1I2japzW{lnXju&PF@eVK_nsV<&9|4%-9!As$#(V0iiBmRkc9~ z5{T6@>)NlZ@Cb+$`7zB{>$aPn(}WKduN5~H-j&l^L+jY9Gzm`QTVC=xnxzSM!O#() zJ@QqdKlX!c>#Pa+Wgx}k9#9KOc!ZuFv&V4@eU}NbL8{E+qU>kH8FsB?!bb1*hwdm4 zqDt#V<4)}&gU2?jpd}d|_E+JI^Ino~_V7^QOx$zc?1w2lvDu#fd-Vr-layeuoQNFC zGoLy=#a{g7o{O4nq*ao*%GvL7@Fzz-H-D=MWoTCY8m(iu9=U?exSiA4H~-xL$VGun zAq8X?(=Oa}v31P}cH3wEd=fq3O?;}eknhC88#JTiIykb$%CE(`3yHzkyEvykMyjsK z5IF&(0QUfETt`=(%I3glYE;MvsERrmz$5y+l1|skx5e}U^U_il3z4znEY+?6bYCWn zEb>SpNWhXeDcpadRC)kRSVuJ_Tm7Vvu0pdN+!LxDm2<0ZV(f27*j>F&q=l)xg( zdQ=&91g9&vlHH^Qm*AgUl-paHrP(R%RC6q29RE>zV(3uKQf)Ip=;iueq2t-@VsotsR1ejWq;_Eg>ENXXDeHSaaVwk7EK~ z&0MvEe-(ZNqRIu)hUhZ|I+Vn@emnOP`vlo9f|X>u(nJi(zJsEICT(j~mcNqWHaU;!({l`9n(T7F4-6|%1r zu8_3%Q|>F_A~qE=na+%!!9TJQWqA*XH!}EV99&`~so{T1($D{vB!yTF78xSzVZ)O^ z1qd=pm~B}0E$Jq|d5dC^$Tn}l{xEpkG=8;!%+)tq>n+VVc$?V?n;#%wBan{<@uHZcMgBEYdtJr#K1kA#OiPh!$)$Pj(yH;QSCZ%{ zRu5-DzlG0M|J;Sprv|pbKZg(3fgTs5cQ4*r3E}uZ(KYv&;<}d6WoZK;ZZh@~H>d+< z*HTCvU)?@!LqM{8m!L%AjBl!Y-o}3}SQx|YjhP6pE~WPv zTloD~oS5N5$DwJaXUoP7abfXFT=xUW!t7#Fz=4Yy1+d!u!uvzEoP`FXv54GXg;LI> z?(1B~m!ASj6~kM^n(8WPiW!Q-|4&;>ybBy6IMqcwt@)!eFn7*upls)lME$7q1a*S5;Yah0!LoZVXY z1ax#E9;Fm*F-j&<{jQqweFcx?w^H*zs{`>|)HnOeJdO?(oB6s6FqkM&^8KVqe$bcf zOJniY!`D5fBEtqNlaG5e+`U{K!MsHRfjrDDc~x#=I(YH?-M|W~SR8X_C8KzKyY2_j zs~?6vM;ic-za4+TvRfNuhe3zKr9B_y7`WzEU)aO44Gm~ue}N>pQmnGXf9LgsIQ!vS zk5NW#2%Oq%IqAj`C_ug5`BfNFXA1N6-SNtIh#CHRs8yV@qWeq%$)=-#b(9=54A!@s z&5h*|)|TojCQn~07S`%ypndeq3K{prigs>n zoBn4WrS?PTk0eD|pm%Rs36gSkh@*yu_{Fxou^kB`4h4DJ@(~oP`Ss@_Jc17k-T;pl z6&0+qY$?TDUs)`-z0(&~Kr(x2bIb!gC)A5W#|Wz2{^hZJ94=8&RhVPA$=-UQuMvO5gFmhg`Uv^r-~za<|YcWuu?v?Qx{) zdYfBIpSOShAY!(44c0(u^(L~Ay&x+T*Tr>3>9hj$y1bICs8^?iz(-=X5I4)K43~EdKFl|0!rmC zj?%$^VZ`BX`PE9mTB;6~pO2fXqd=;hl4?=9hMvh}0r8ixwm=SqG+ z(ywG(J#9=sv3y_P{Y?fU&99T+q(8nWtI57t|Lm^b(mqZBuW1bF0Hf03tfZ;9Wb%(s z+HQl}0X77Sn7}FB>b~TPsch3y%J&8Ge}6d$YJae+r6>@ZrFy||NeEtaa=Ga=eAHi5 z&&hpOlEe6yMIA!nE|3Z!DZZY_W5@2I(*FK~M$kFxh{8-}-HUlAK+=Z2cz02~^uEZc!gAUDMHhXlEO{;_tH^ zv40pBBQSQ^2qQsGAFm#l+3SX0|MB5k%Rn=Y%har=HaI^EbFr8=Vu_2lMHAOMt|eXj z?t7glmkhoDRr155*UCVp2QA}yY!;yu+SeH`5@_0d8<+Izx8spIMKDV!8^BhDuJ?Yw zsP^XujAitTg~$DL`s4Z7UG$#1#NbnH#rL>fveHev9#6@BbeV;jA>L&$(r#N2y*0BL zq~lsobA{e`4m$d_F! zFzr0w)jcw(%<{WZxGTfUsJLI!oJgOJ8cWQ}HdZnniTCGvh%86GGjEhSnen2TYkOAr zcaD$hZQ!-z%-b?+L83}=L-y>oY$(r08ZwgN4@Q)OSpSJj8oikP)!rXH|%m2i-%vkR_sjW*$`L^YbdBriV#XtSvvTS^-78w(jxczgm!6rA`&6)|E$7j@K$esDa{g%y`t9<7 zJv#jYI@*2VbvVw$`?7PjRDyPOaKsT?zK>bOu{DY5P}LM|wO7Pv%)qbL8A!!iL2`Q- z=k6g5;JC!(iuNpro7w;(nnea*WVZwpPL_V2#NX8`FYXEN$EmTpT(53CEL8@FF~6V* z{q=)h9h$h|B!oMQ;JHvXpw0@fsQC^ZG)`~)) zJ7j9=E!HQ#IC9oLkR$v9({RWH4mY@W!`{c=y|$hq4Eq+#Hc(=3Q~jFg&E;(T=l6y6 zwbWzKi;#0XfNFbkZjUAP#9PzpxBr>l;(`E6Bp2&oapoY+n}pke=GN2)oBXs2{kop{ zym5JL>vTTLV~C^(*OV)#`+%E*2B9-tSoAW%&|B7k)HP|J!Q3k7_HSiv60Rw=Bb2P~ zn{=i?YkJt)iHjWrkGC3y4*GNa2SL3LUKKG)pDSD9Ecrq9+#)v3O zFcj>%d5xMd>wl8mF&P%`x3gFm0jQY9tMe&zM+PN1DO8zb)JpOG4|z)ML)Pa9^UG( zb63V(RbPs<=Re>Lj#(z10oajE<0z<~F@+iDyNW%{4oUz^yoBJ5j5{IUqvbz3I8c*; z!vn1$YHiAGtW&4&Iy~Z0>|N3l1>C?7%Y&brIfj z)bqZ;Jkej*SsgS4j%Fcda#$x6(2JVAPN$0A)mT-=5m{Vv*6I%{S=Bh|8TgA^350nv zLZd%z-FTMDf~h_XrNm|+u=hV2m;+s3FGH*M^1?(CSlX_0UoWZ;1x$}GH*=@_3UzvI zPzX__Y%cJXvQp+PHtZ+LEd#RmT<6^OAM`UBQARn%OWMXpk~*)tCd@F)L%5j+tq;vY zoI$k$IVIeE!jaXKTlLDjIVxzAQaEw%qW}J3-?~#P5qLBlDyYVRmqf-yJI^VU-_|-J zd19t#0zYFO4BJpy+9BLl(BT`UE1_R)LZw_gnDQ>ofhE>m4WkiBD(WB}k=L$?1c(F$ zwtJlHlNQYNm^evt!)@wiy;87$q3yL+y=~2fD#a49*7e}c1yvm_Ta-pv8aopUq*vl`OE}cjB zdo}=zjtb(4LAyYyPD@e%E5bw-isn|%MRJx&eU-J&Aj{lHT*cuIbZWj~?+LQP*=0*5 z1QNfRvvY^AOB#yp{n+D&UR8aJW*J3A2F=NMjV@qGB&wjH!RIopD#z}*N4a<0g@ShDREdp9q#?o9>Sed!N6%sVcO+*(9&7D$_!uM!0>KJOOW{bnzV z4p;bnW&fs0xRg)vh<1`pcThWXM}73X)0&hHjN1Mrr8lWCx({u#M!tgbRPpO2YLZCm z8HEl%&bFw}Z5usjqi3sb$w*4awOlJ4xJWCyIoLR(Bl3YvJxw7*Li}z04D(euW6R8< zTO)kwokEq+hI*#uhNZN?y93G~KOB7L0Ez(0o(K+RHR%eXkHH^7DT>$7{S!)pb6Yea z)?&z|#gXY2`8p2p)T}@@Ze+Rx>}gq-6L6If80X<;6kp^*w9c+dvi_z-(r93BmLq%d zE6ADuJu2ZHjvY6Dtm0z2Yl?X@ph0av{Zu}Dwd9wN-Cm8NJeQgg?L`xJ!VTm4Ap(JT znerP8Sat|P_Z=gJ9qt9@l(OVX;~L=acKkMaUpWu{M2$4|M!o5NpmJW9$%W35_tJ zF+Vm>v1+)B-};-weRISo5IKQ(Wx(Cq^Y3`1$$EX(Z$+RC16MLJhN(lSS>|&>A#fNKZ|p62sMUM63}3A6oZsl+ z!(l&lK$)s(yRqlB^98R#z6iMe8Ma=&NdAI0tqVK0UL%ik$>ncT)z3<2kK3?jKr{?w zHI&MCm;@^DYGw>q2zP_mi9u--{jz^hNIKYt)eNM0kZb+iYg!VQHkMJ;n1HY_Yi)R$ zhNl!a*7qi|I4%zC60^#qWXug^rJ1=K5ZF`)4-OkL2 zU76ttSLl1NlI7}ehNnFti8Mz#EUhMafT<%MujK|yvC`+k2n1bnVm7vQU}O-1_>tnp zP`}2}Cv0wyBP)*K!d}kQn`W{evvaVv=^{7Q^dBs?^(J$$__ltRq1^U=RB09EriFJf2}PK2QFSW1xfXU&#_;UG~QSb7w%wtCGt z5mcS0%O6cmHqS2(7+;X-uOgJ}fa3nrUlX%8W@&dCFUO?;qnlune^8aR$R z3`c4fnAgGO1^j6kJ+P-+F5rH?O7LbTM)w1cCebkB1=h$yoEfAzFxhSJm7$aqm1hdV zrr+5XPL*US1ITnDa6V>}n4tMZfCn_x-9f`7TH6zJ)&%<4x@FG=?4jwUZ&NA&lfSNb z^*vJ@YAdPkGi`qZ{w;`g=EOeXuYtz^YM^#vTW)7x88i``oJqK*Quf#}u!?B?A&j)F zk=3ihvdk7WjRmwyu%QGWgC6vdK{u<*6SHEL2BrR(=)5~mN45aC0+CYzIKy+_{$XfD$kC!LzsaToH6xVKeV zVKz923^Zm~m;8v}n$_VT4}}FokZSlAk~w%-YJH8n2dwLiP0Xz|H8a8YOho0X2)oSa zIdV9#q>ouZ2tM}km6kq4LC$BngvVFyAA|Kh|FiND^jvG*lK=J^}#Qa+n~T^hxJxxclSd0XPdmEZUWxftbpD+fwk^+F$^O2j`teV$|iGisU1UNzQSMMRf5a$s}D`s z_b?&MDH+X@B~0HnwSUmtfWug*X?)-PB;hvbMnx@=wN3V%dD6B8KW6N#)@c`0+tX2^LLO3cs@6p_& z-S8oc;@3RDW7dU~3UW!PL>e{6Y2nJEu-DWakmU}a>)~O$n786H6ig=H`t2tFoS*+i z9HVX!g*;>t$R>!?LMLPRstCghHIIFj97838Jv-5`sAiNmD_i|1WEh7aR<`Vn28sC} zDy7zNiM#lP*3!&fI|eRhf`}n2^qc3FHjDrX%lm}CLjCMGG(VqziepzUAgg9^25BGU zq!vgFA4-}??@;qm-{>G?a$*4k<-tDD5_}j@ij1cfsz|#2*!JTS7e_x!8oG|8c3 zr2i??lv=&gnI(0yo&_`u#HR&eX>jj4Ulp$3W2r(%`sFthU26I3119js)!PzV(10{8bw%}vooB8t( z{$sov^luZUI^tX%45B`1ha>xrS(kJ%-uI*LIs>F7L z=YK7c@J%pVZi5)}Mm}A_0>A0@*HwfMM5KOTVo>sfN_dVf`m}oyHa|A&>rCz>0mx2U z!VBGJwBMhu*U!&FpcEZ0Tk3Ma|DknT`>Ry#6hTepNT-=5DWyu zhe*nh@VLu{MSr(lL57I-Y^w&6IVdjMX7v9%d=Mx%G~_$8i{Zrl@23RbKVDp$II;`$ zdc%6KtcxK{iI{+$NCU6mH*MEHX3s=_7{(g4g^4jcBnnZN^eyC#k=?NLSaGBpaZ=&I zTjYB_wG07(%!{oOlON^H;!Chn{E+_GYP-kX{6r1zc~i zj56}qSXG{Vh3^L&PKaR*lpEl|pvofm46r8Wo_GO~_@tbBZ>@b|rKBa2kimiC87X2u z#@>p%31VO0DdXDz!y^4(EN=*DN)&zwOhuDId};k_a0ow1zyuRDnJ@8VD-M6wab|*> zIXw!%f-aHzjSTz;Sd)j$9Ke@+de5S*DuMYoy#BnBo`WLf6X!FbgM8GWWhj`3kGf*Q zFx`m(%u_ggz`uR$#nr}CK!?dD`bR$7Y)lFGgN*IzeL$u8AqO4V|u@6O@k zFSz`am0QI?gdkUev(U3P`V(kZ92U*f zU=J8*eC16i*sD3RI{e>Zrwmu?u=Ymsj3|0WQb6!iDoUr#7XZ&#zbh+`rS!SIpVMce zJ=Pil{ggR4RN7%TZv1Xt^{ev{=p)l-?S*_O?LMk+@`Uim!abD?D8fac^uZ8h@k`ZR zIh4~aX*Q1pF7eQI&>CaOkm@+Pha;$pN#I{8V}x7X5l{I)9PB=QyDVwE5_TCPO}c`c zj|*eIgR!l!R(JI@`CF1X02!3%Sef#uLvnz}2Pq!^3}oGM;s4DT^I&Z?bI07ksHR|| zU*vGs2f)6>KSF-B;Ckg&+hk82BD^2J!b4D$98KXIUhYMgsiq>}j{*SMLTWb{S+Q*9D^Ix=mTkMb564cj^Sl>z zgb`kCl5jQkYGWJzsh{lr4?OA8QPv0*8|`YPbOHN=mB7ugtOtt}E^(I$=+qClS?JU? z^qPm9aomR&PqM~YKa^h6TAh4|T7P`(q#Y~y6&oRti4w*}{r+J_2jiR>p}xE57a}34 zO#JkzPR_I)C@=a{Wcn;8l+M2K=18SRrK_8c-;v@~fc@4Wgy7Vp>WwN7e+nP4K+t5e{-xxusI-Wjzq#G-qISoe{B;_oyb+!_{urgTjP7xzv4#x_7Qa3AKxIc zvezz0OkUUpAKjXKF4BFwH_lsJ>L28Vy%On!_b&A6PrR$4-uKm4S|l)z8?%eq6tUgk zxQ(Ev0>t&+tT>$R5E$tqr9l3`650QmEcjpY1m}GG5X=*0dJ@XQzUf)w>BY}$eVIAv zwcbJ?PTe#=$?(pn`5|wb?#=TeyzZcWoqXPDj^t=s-tilpqxSNt%>Ad)&L%yjjFY^H zT~|v--lR;i%tJ@s0=`qlG&Pt{%Xe=|biORSYf^Wx!IX3Pzg?;QmuIj&H!JWGd{TUtpKM zo`r$PT3Ld<##;uOF~@D>2u#RUSwag{!c)X&z0`v$FHdv>2|S9WW(i)`n|PG+cw{8V zcB%)qKP%TSwTZI*z8iZb@#oj8_x<#eub)P=e_WHhnxQ=} zk{*>R*SuBCYg5eqRHzO|EViMgOGd7mJJUPaAi#9W-ZC3atSkjLZ|;u>DR?L6DM2=# z(!p_k<9O}u2n>Y9M`7FvMSCNHFfQUmZZ2GH?t8Lp3VMD8N!2lZPTw@c&k7vD6XLt^ zWX*`Decv(SDP2DDP(Q^2ByO&yb-WPVYZ@@fh{?vcrm6ps|x z=ss!VNDMp_#o@p7OL0yK#=p!J*>){;)B5{#LN?GY9ckZ-i-l2ZL@^tYus-AatM>^cwp7+P zepo!X+~1U*q^a2p)(SiH0_3K>3n zHq=^{R{Kz(SBYF7N2#3X=n!EDe#QJs{s7=UKSh9EnM!InzkiaGHX-PwpuICW#U`~N zNWcNZmxeN&E3q7Cs5ctEGSw*~kyM9Qa2F*mB(Lvi3Apm>-Om}v(KMr&-q4lur4^HE z&3as+-7f@!?VxrsB-Ol5O1)d|C5_|Ao!oz?3=YyIgu?6 z3_Fz%Muu+V{(Pcf%39i0i^R-VHeTN0U9v{5Rzu98@<}n^_Iq)j&RmjuJOU^ zBF)5ZixN6Ke!pzfk4*jmBO!g;3jeiN-|jlJP=*_Y!20@R?QF&^ijGVBvD8Q`77R@u|Z>+?Y@c12BqxUfzh-W)y6PR>SK$}9fpL`5CAtTU^9ZZ^i z`<8^Je2P+e?;=!Vmv&t7xiq7Lf{7dRXS<0E4=V5prGnkyO=eBl@V;e4( z-naV&!J$?Mws|Nfed#y`9zX@$ZQ{Z#xX$eW<>gKQT&u`gOALb5) zwIj;~ROeD=y!n&MtE3uVD_VmL9up|m$8Wa!;JPccG$?!hN~C)j>C52Kx1MWi+C>&W6!5MJo5 zyZFoUCNG0;4@b7XJf9RD4g6K6R+{E#zx{Tw+xyI9HSWH@BIQ_^SVcv}u<8oi60#A5 zbZXp~J&K4GVDESfHs1_2{B|tI@bdPnJMHfT4$s|frFu9e5W=)n`Mp||VE=*HLncU= z-PrqUKF&>Es$9ZV-Hqg#WN`#lbBb-vb9ReLBogli2aiL;JRk_j3RdTs)7a(e<@R{;qx2iHezSYt|4~(J{!3MAotQTQsh~WAV7%Uh2&F$B^wN~o4~dL)&JcEa6LJ3 z-=(u3NiS@qAV@MCJ>9h&$8OGMM>KxCJ^ehM<*PQkNg+`tuHL*)m+uorgU59phF335 zlj=7P^-@0+JY92Z5ib_qB^a#%1;Ob(}tn&vw^5^=;C+%B+dc-J!c$LvXc z?Ie(G`^$%cjgY{8_U3J~>Fo$iL$cIgYJCEbYn_(A+|P!V-Xc)-5i{G+A(h?jFOU89h!2%rc$&CuQb|O#MQH35I{K}R`Y<9lhT%+Tb?U)=*WEEk6j06 za;^XUDqwy+A&%D!sd%*5)|oIJT<0@(RmB#XoKx0)2*Q0I4Wr5(gIh1DS2(WVM)MT$ zZS4e7VOf3TRujuCWjxClfMf?UOInaquS>dpW#HF7yWhWe+*oz`yTJU$Zl^IYM9~{W zD~NXx6?my#?^h4vr_%q_cjhh$3MpMte#7R1Tm8T-ckrvN{IxE1X#_2}cFDwLc`(3@ zzX$$|*%n{FQ)uDF|7jn?`208N(a;)i%Q7?WZ8d|(3vrBQ?7oBUAkTv#=xAvs1OhQE zy9za9>S7w8yv(qBYgn==)^W+U5pq|vscWG=ImY$7%Kd{%6N!#F;$h0S1nzJe0j3Xp z=v{zJVc=P6>T?)289b#|n%bK5RgfD;)|>cI3^h+3&8knPvG!BDhz2<$V97o_@cd;m zcKI8Y>G)zBo4S>)pVhRJLE(x_hm^T@AaJF|3HS2d@&Xkt#=3Wci7y7q`IXZ=U_Ruv8`Y&p+v2&Wb7VewY|1n5hTm`mqK1K0oK!O zdi`UgUi3Z0I^K6OsLreNZD31~hw4fUam}^$??C(DSLjXhZwZ*>`1cOAX(j}2UFD;m zma4f;x6_#2z4TYfv&q-IlR91pOtr-Cssc(IZ|~$aY772*LG!=U*ZWVv%tK_9Ym26f zN^w-Yp1QT&hYUQxV}m!PK&h-sgi{%rA9JTj(RP;7NV(d{avvM%E{Nf*Wy8Q4tJ0DE z$@S69kzs9v9`?U1OuV{k^7ocDasu;d?BDNs_rd4dvWx|G7a39-|wJ0XW^EVaMUmbduiA*Z_!ws^9I0#XmAXDeh9!V%i&j_G{)~;g)d2qM4wVMN0j$fG)eu-tzw2kST^oOHZrGNblC2kct*DE zHC8;|lmPq(?as)m?rx4FgYqw87*))+a#B(N>=uc&j2L12;Rcx=-q#*~crx}%7CyTb zv_Ea(!s_8@kUS878P$iyGw}qOG1C11AsrRtZQp1OQgZbhRw^n?i;wOERX&!_WHrHXRKP@(j+=!2WK zdaOYdZ50k*YVkI8?IoDmdY`f<%>^POW#Yv$#?;n@-08dHeCa>#be9jTU;0Dpif25ou#7zB@C5_QGIHn1S>A7^-%5hgK6;d0snNO1E+%BtK&lGJ`Zy)fRi4w}bChhe!#9&mA zxKn<;XT&BILm1e@uSxw5#$Qm$rI`C`J#s~2 z`K5}{Ue%Ch0uSg}_b8P#WkB+a4*9HXJobZ^;Ft;Dji&PjX1Xx*)MVb3<@Eih^~J@1 za(iz=;}+b}Tj7j;D%cSp@GE^)Q27_OowoRus!ng3HkOUBt(lpXgZ^4^sSX+)|GAM( zOk@j`%3h)I_|&%5900CkC;OVX@A8NrOYuDpOyx_DU;Hx#@4nUWGe-UrB6NN*_yb#? zjRpfr{fwcKo+Kmi{6Me}Lk=}otbE(*I;(EX3|78K`zOW7vD9vOHA;@wAYme?jHxcb>?7{=4`zSn{q%0ee>~4d zkDG&&Eq7#jKOv|$VC?7?wH5=R#%?sM7I|FIDP!=DkqQtnMOIQElR9Q0_Q>$@C97}O7_*z!1wCO=<{3JB;H4d=OMIqs-XXA28V> zdBl&x8l+&z9ClWm=dtU11X|SR{_BY#)6tj%hYt8wwfR~dh?z>4TBms6I-{A!unrLW zsJ@lJ6Z)A2I$Du8@nwVwt4Yg5)>w{_A=>DzOeS4E7!_u#vv#plY43Rx?zaUBwz4#6 z=s5xzo!Z6%-op3|ASmmLGzK)At&2y!K<#^aEoq4Y4R8j|VuF9Rk&xL6A;M$;R6exd zo>ab>aMP=7#w3g{Y5Ut5lMeXCMW&gk#QhXfYPS~7*o&9nds2vtD|SQtlcMjTkHmD z59xON2xdYVU+GvLved^H$c#M5CkQ)ZwEsbb-0h%ZCJkt8?9jo+d?y^{q5QIYeJ&QN z$RMO{B-_HNaX@45M<25Xi?Dabm(*jSXVb<<=8Syh#wyV>caAF>pEpaMdw>=oI|J?G7sK(sfRbXa(lVjb+)H<<%BkEjr)J_o4d~ck_y*v(q(6Z z{*=7#An60W78jg1At6%K-Lp=U#jBtlDg<@?(~+xT_T}Lcj=#x_pJ(xU0)R)oe^3T^ zkslh5kyO3wd-5-iXRdx>Y$68GGBw5~VkjxpNjRhh^SmSNVf;y`;%d7m*+FEIdx&pentf8gG6jhHuu^{-dA%$pB9X|HW9`5=LzsVM_2DO*UrLCq1GR!Q zX_Q7Hqw&<>h0Q@dWOQu$Gkm0rfgxlNrGJgF#2{9wm&?i6yCJe{4n+KK%fkh&{_Sml znSY@IT}Wgk$|m*LFnhKa*XO?w@E6uM<%0XB8FtX)2iYs5U9HQT09 zsoX@{m>n0dQA%{Ir07@%)>qBb$uPyP#EFPh6OgRsFs+u9NA>94Rh$<)#1jk2N0Z&3 z3**L5%Y3(ONq4CAEn|1)>Ls#nzQ*p<*k0??=1ZaMn-3o@zreie5L7WdbU$dRKQlJl za2USjTKMYG(az~wb|hVxAUXd9Q@aRdNu>=I21^33;pW1Rq<;Ekpk3#+1oX7)l%)U) z@^@BsxA&L&9!3nZx9AyV3dE}-&J%hDF6eyL&5nzQp&yXbhQ+h$z6mkyf?~@qmT9`U zMl$g0F1Ita5c+EhbZh&;=lSCflv_pdnXn&n8f`qT40pc_y|jxC7^iztv7iZhf&tJO zLFQ*3?o*PCc$mY;Pq*!@Y6K<$wLLEX;ph@kH_`RLNsE33)Be{ylw3*mHi4R6l$55k ziM16H?wrKNlm1|N73|WnU)-bY`inbP0T+CotIYPu63D_%eADY3kQf!;?%+m8lW|1H zcX5Wc9)Nnhw2V@`lQd`F--}SBSju(PB8s( zLo3^5?2$2vb)7@!nPNq;?%SpR#3HW3FSQm%Ez4f`YtOTym9g3ZRS^8ERHWc%yxdC% z#v=EC`@08R;kba#PJBD<3!az>R;Lsw^lzKwkyAcLgTHQe2Od5=JL z_*L8aED?ycG#4%?9Mv<*<^BL*xDM9+#Fj*HC!vo`e0S@^wIj7cj`*hz#-Cc3*H0F3 zg6qESZ2zL}r|+51$o7J5^XEa%|1*YUIbvwMHJS`YKU4bO52zauJP2lSaq%QcwhwXc zBK_oihs}{^-b+#oE`sV)ieVb=Nvd$KKO=lf^to+-9=!h|;r5?%>%{+0 z4gO;q|1UK_XFb2bOw&nvZ)sAQ37Z5GyJoMG_C^`E_Jt5Z=5^9r@xB-aP~edt`=#fP z$6Z}Lfk|t;HAS+K)4V-we)f?zzEp~n%L+?zd5zrj@2s|PhGbfW3yMw%vT=s+a65K8 z*_57ge95W5o??4TWUP5mWRS{=7-MM80D-m@#pXyWLU%L82J~qHY>`P{o#gSET7+3I z%{s7CIkWTRBK=jDjESvvKcMXnIh6Jm9Rf;r&uQpzMBKgsZ-J966wfXeP1xnlx(2nIhSZxPeVNHGH_5U<+< zy@2KszB;LOxjLv|YA8;pb)D1UR~>09p7v3? zvV&1#a)G1{!+cFTsU9xf99-+z5q?lWuyJ!7adY(cIqo~?*^Ov;9sSGZ?G7|aBYq?E zHeL3YdjFGMt)h>DC!dqj7WaDFaumqmHT8RG_dWkl2MO;U{XujbjZfp)*QM~6 zYS~wv*FG1}d@+L|Kg9S*cv349ZrCFSUcd)5lLo4>ue=55;nV!u``>Qjnk3imCKT{q z(;}Z=u0^_kz;8n`6OjA`1-msU0QhPU{2B@+=NUZIT_xb;diX zkB!!H(q?dd4fbVf7|&gLWS-P*<`GslPEwDbfJ#!WaZ|k{P+@6RYX+78BHS5jpH`gP zcJ2U)rW>Ws@-e9t+scrN*{x~8^FTz>1CS}^0{1PZQ0T8I{aoTBgJgkl)cgbg+W0ZZ z^eJDH(f-{gwQaHbR4kO49(M@zgF{@SQIk|2a=S_`@#5PqLKG7xd`07;;_UWUuL!Eb zgP9{3y9R{o-hszZF2P={jT|P5nOyx4mNno?w?_DXTA^a*l6ZlKe&vt+Waw|xv~In-eg0^Z;krXC&k;=g>Nmlsl0$+{ zDVKtNy1wdyA*Z1CL;qrN_4azDvxDn@R>f(sDonoco{3xw-_K(FuZPB9+^}XqpC=~w zRT#TCq&P5t!M9s`oJzhjcNFjA7;DTq;60eauwM4dKN-g~IG3F^g>%(oP#z)dlofR+ z|BVTE_WL*YhPLT*AMH*{r}g~0ulU1uzp&6WKbVQ#N>MFO2~j;ELkI`ZVO&nV3a(z{ zt>a}xZp|6vvjCdh+U1k$^Kux(>2WN-Y$pwmTpMlk9^o<6)y%_1 z0z#>-It_%Q=XW#r+F(N@Rp3LG_2XVv`fC5*`iI9^y7w6DDp=|_E;^S>H%rT4^}DD& zI=N3DSLe+grYvM21GB_U{z}JyO2_)?EPGVbZnz6 zNu>jam!`!vzyBau%*#_u5M14BbBy!?AB0%;50YEp45`0OGSvBv5B@*)-nuJ}uiF~M z-KBxxjR$vkcXxMp2?QsNySqb>;1C>wySq!UV8Pwq{ypbC<2=tj_Y>T4|7jalHL7;) zUN!ezbIrxndrQL0+QaCl_jjX}WI=U;pwo(Eq>3aViY*flF6+}c$Y`ODsM-1soh(HC{@>efBosvQ5W2<3`52o%;n~lfn2T9XI zzHF}>Pf7DU?DaZ>`v;61bS9W1GuTeO`Tb5GQ*{nx;!RDxySfySue9h)SO|d(_EWF% zQ@xZu);9B%5u%atNbFpiosGL#K&1OV(2AdoK+zT_fv*t4y6B^T_&_mi!BIObo4IV< zwX_@Xbgu^O1{muU;AUS`n3R1FZ2JgdwK6H2@mk?rK;{y+RrQj%|1k&I3?b$*Uh+o4 ziu=AyioU|v%_wAa%K_%&6|z5%X1%>u>p|Pe{QZJZzDh1NAu6l7Xy}N0W2N#YHeNJT z=J@uA{({hxDm`=?^aKeF!q@${WXF?zH~)k#64d$0H5x92B>nN0{=V=-Rb)KlMFh*} zmydNDK^Gp<0w35qFana%6HbY|5SB%<3<4rnA_!oyt;O`&U3mE(T=u}EbSIH_jy-t( zdXSw2Azn1u7~G`(7Oo1la{cj24WTzjbq65ir!Z-PXhO( zs5Zx>5H^S>#1M?=L&0h1PRh?TXY%*9v=+ff46;yAT4$%gjoq$@?aeynI2bT(vm25PD>Y8MNdR8O2-(#>C6ioS&svX5k zZX0Me7RspNU$grDLl4`b9vO>@mcH?6DHAB^S-Di(@z&l=e;#ZMjAFf>xCC_%hA)b` zYq3=7HALi%B-nM5 zYQdf&6Q-;&9|2g_j{n$mq8`^tv|Macf%%hCMG{7{_z5OP*PG=U+!?UxKulUArqnnq zB-&frZM?D^%c0aAbx;~3K)R&q{`K^iE*5DIbbS0jg`lZROo_R1z^&i0qXFR4@+$8jR_J@eI6{R}Cxzux?HxDmfC z<dqArw@D)xKe~CkT-kj~XW!KdoEVkmlv58cbp%3u&Y3_HGSDxx}>S zN=7d^5h}uz@TP?QxZAZW`;$P3EopIs*#PBg%SFgK{b-rE+>9|+K`=LR z9(f&+NzV?Rn?mG6?Z6boWq}gAH$q)2&HaSosH|$)PqU&2TKx# zS7&C1C3=E_SYz9I#wEB-5o5-tFnmMsZBR292CyrBWIGCK-5WwOv2)Tu=noB(4hovH zQpb2ja3guXxtz8@GP1`SiVZ?bjY9{Mu^R>pIHrv{AHB5z#`lG$ODlz^5KW#ujtIVJ z%{j$#^TiZOR8{!Sl=VgjKHc#>Y{Mpzjlvd#A#vOS@*f*FRImw1I@--u8OjMMJA0qa zlM7Ta6;u;MCnk2#=k=EN7=#Jvk;=3t!fUsYAVx!#D!#bPVWb$qkZQ~RpRhaMI1a+A z{TG8-%1R7Y-X>Lt?cxd3a8m&Kb4i0 zt$UKZ=lBi1iuVR(pR{|N5zn$R8vHNrr4K0+H6QS|_P9Loh_OxUWnma0xu|Mlu%b z^#;>g+?$K>0=TET21VLS_Fzv%W!{-0HoEli(b@{bQ4*Wg&ee%++an69;drpeFIf9Z z?P`UkxS+Wn`!QGHoYyANP4Q&>a=*^zx}1AvhCTaBJC3N)JF#x)8w|7u*6S_jfv|hz zLNj)H(Ca+a$zZyv&S zIJoZ&GB&%A?)AHAFMCHcv6ypTw3BR32yZlI?~^4ng)Giu?c$(z7<5^Nt5tFU8i016 zgGSh%9+|nP?`AxdF8di7Y&5WB}`d(dIoELbX|KbVZK#Ov*zRly4APp~fU1STKmBYsy7~Snp~4SdT?}?8AQ|$0Z8>Nkj8cj$cMb zssl;6Mi$L!HcFx4Q(T;yK`dWhv1=oLp<|DKSA4Bp>4wyUAt{ZjZj^}uK$HwD?9>=` zx|y8pP_YxtTtB8CR|x8u#VB_@d4pS2T(ehjw9%Ns;)m3CzIo~9S8O>HJPX^K8vYhb zB5C8csDsdxOXVp$%Uj|a4PR`F_I)4_k>H2y!=6cuQadW&C&Fob8tW}fJ_9_(& z)draN@;N7q9c&+M$0k^<4SBk8WH2K_PjHuv^!hFO6a+8eCMO^B6Px}}CPra1k@Zvs z%xnRo?`@7#E%3}=HfjvDv~O%?R~p(e!)0Cw+(*I~u@F+2Ni!VkbuN)11wsaAnGorR z^xuNN0npc=S=77T+P)n_w~EKmZVyl9MHZTgq^yRMGB2UuXPT8x6DsGqc)r}r7s-D) zF6UhpJsS@Uz3>FyU(iQ&FF~{nG4a^}t8st!ofpAGm)j%Q1G`q!V2j`X+tw+|- zRyb%eQr$XlG4CJ>U=$!(vY4511$3Taa9EY}PEJsQfJ!maIs!zb5RS*;G-)zKyl^T; zIS@r`qt%+uU*{p(P=Zwk&TsLIvs$ovP)#D?r5U-_(lQ|U4lDi=ejJlJ61n)b122e! z((;jTAeoe_=opf5wl^bZ`VTLv<>tq~)R0x$!0zq-ufbEW%(bJg+{CMfM z<4CAxq+v8Ba8w}l;RveQprjKL`;0A91*FTTb87+~>J`pD?)JnnkL!l}QBG$DS&x;vTG-VnQ+=u}0E3UVPztW1*%(%vKS8+j$< zcKM%FR;s*MY{u@mn}!K2S)VK<=~WWi&D;LC^>^2*3ffUu{pBBdQTlfqe#)i}S4FxzhBt8|(+H z$w1Qj2BnLZ+iq8twRCR;t?0mD_wt&CtOe=J#e=6T#xbM@e5@snX7LY!K_^>F=R(q^{I>u40a8;db3rL$d|JG4=GOwrIi-W^*dWbd z%DKp25KS_CRZqMk$X8%rQT#y4FrVmS9@%pH-ULPj$?B348y0nW=yGSk!e zJ^XJJs~qxJ3dh%neJY*WvdUN&UG4h$o~DdC)XJz4bEMF`Fwpa?V8k>2J~3!kdAP)3~&!<{~S zHf%F=l4kT^d61%7t4TLR5@bD!o?;q`)HuYb_oxA;!J0k#PkTxvo6HWkQ2n|q#dHb? ziZj>1;TZ#kfl0RScWPPw4J4sbO;?dA# z?3j#6s1KZ0eJ?Vl$K44S2ehbkLR5~>M%-~VCf(g(;WQCf#4DZJhh%>AbP}4;BwR`h zMOy&L^k@4*Yh=!nUVUyR!Yvh!$k9_}h#pX)Qk1uKHzS!H>-27D@F_<8Assj>stu!P zAOJD=K^Q%RE8uw|Y3A)+aBvtJJrpF0g|;I$*iH%Q4K^;H+zx%}^Qtp~STh;N_m+;f zZA!~@l{-7>v>b#NZh?OZopk#;tYlBGKpOA`bz`p8vWjKsweRur&Eb6R%fK%Y5nM1v ze$HZB`TI{J1-$5W4Qn}UbvH!w3rt%`W+5Cv5-WV{1NB(2*P8`H2B01e90XTsmIj=t+Jn~|V=Bk>IuzCA5*=H-m zro6tCL8%YgkhLkl5tOj7ui%Mb%6`@1Sci$)w<=pB8rX)Bf`4b2j zDcTmgkN+5v3TC7Sr0TsvNM3S9;uWx!JfE%j_MY<^Fb^p-CcSPR+J@w#-d)${v{I}^ z`w=!6KjuSNZK}6r!*V$U7O!<$*`S*Yw~aGW`+AAiuEyO%GoYZwM-kvTz{@~K!&)Ia zUFIM4oezHhsJ?%ESH%4-b+L<&|BDS+|Yo~~SzmWG!FcMw~t3Yp_A zvs>P>|K+QltA1Mjg%SBdN{IAKicTVsz4ZBauxCNahP7g6~2vi~wm40j4g)ksgz?g;Y1J2f>`s>wykXyGpDg<~lG22|Qn zKS8YjafZMH^*^ngFZU#NLMior$D24kx0zDGF~h)fS7VUTNwR`Q7&nAJ6KAFX>n+cwG-6~=-+DkS~WaVNp1~$?hq(5=BgKV|v%o=4pz+39<<>AQA6X10#-J)E9l=(o!iBILb?oh%e_-H= zH_u*>-(D7NiqIev9;T&oq8Y%IK&}iU!DZYbM>)?bz4)9I6hA^}DcjwPApM$g75d#S zlLp=~w9h!v9ELFN^+8*TBoU(PE-y#<6FdN`nO)1F5ksX?_)DSS3G-qrmL8HDQUe5y z(A2ZN^^~62&AK*Ic7Imc#FG+*0Z|lI3ikeBJ#VrW8k(OKe`^BBzn8vN$ITmSQar9M zx6NHQE)WMkfo84nT>xnx7`#Ak0GN7<;lf_!_Q8k*G_$j-al497nPLXI6 z-E*V5@-s3toaum%4Djm+U}z&EtM(b(<00U;!=WKVC_j#Tqn>%TYQ$o+B|SiktGUWDZ5$ zj7;6Fz%Fj0wFzS7(`342J}Mfl#`qsu(dm7Nx!l8C&RlbNXJop+4WixC`BeU<)aqD0 zeNhp_F0l;DXlHv>xj5BwfSomwg7#2K~;vTjg-N+f3{&|8p3J?JZz*Y! zegyj6OmY{>?des!L(yECU^1G75QALU8lZA?GSeB3^=04#=9@>h()Uw77UAvt+fz@L zLyim8e1y^DGOsIei1TTW`P@yLM&PIHb1oWj-|Z5y=6`*xsqx;rzM}smqA!~==Jp(Y zFHVOd&qE*&YxEDxxh*&5lHpb3vh5_gFIIqcZ@!j#R73lM0D%ySe;x4{iO`@p@-}P#({VA>0 zzYPKLjB4Q1ec{NHS#|{AfPjs*|{jkv4L zsFfiUsz+F>;t2Yv?DQKS36lq6lhNB`KP)>h5kk(cEI6EGW6>pR(QPA2jO!BubdnyM z@59H!-8V-7*M~XPopkFZC^))eOgY+;Kt=V(no*!9li?4}kiGdy>Hg*}>)3^&S_TuT z5w;PQGbYtI91V#W*auEpU3~MALRGaM5uRMERk4~BdF@f>%mj|3A9fthI3%hkRBi_x z(JAItD!&Ve^r-sasL<_lkTqLMhGthH6FQ94txJq;uXrdzjWl(5`z93enW-w)cqm~} zxtz5|*!v;|uPpi8C#@qT#)nA*iMG7=Vp+s_2S##1Z?t-3+A2*BOPmOv$Vaq$za^r3 zj{=DxrF$@4VE*;yy^HC+8wtqyl|Y`r_$qTK%?*O1=Z^K7uW>xv8{(?*JCwpVgM~#V z??<>AjCFtAojLeT!s zuKhB|M~)=kaZpW2`+!h%O}62PWM}yTt``Lln;qbsDM4rqaOdE zBq}vEM!YKxV3BnUou>*V<{=7OTz7EDV(OGJA-MdASJ5cF?Q}VG6-k*}&UANVx}S)r z!#@7-U6EqfCjKjo8{DMzTi8tyCl(`adw^JzNCQS?#P&C@nh~a)?tJ;Ko4aLW{fwfE zZ#IzeFv~ER1|`7VpvJd6F&oUJ6K>Kf#2GYb8d;QiMu%+zb8%2x>@Kp}f5+iqCs6sUDV5jC0u>NYL*0rD?=ZZO zO8YPmJc3zF6v$n%)+LKhzQ*N7mpX`~UUj>rbT*l#5&ParRVNjbatxwQJ@V0!mUC}m zQ+5TsI-hYD_WybKfh2i<_#CRm9JxM91HGC^69OYSjt6w-ibhy--xq6TXfSgCZ+P~a zP@RBwS`zmpvfEA_?LvoN?5D#9Z3YJ{mf(p26tby-1`Q8-gJmkJ?He8v@DqeVIQ$H5 zyD=J{L3ZMp;&G{TNd2Mh`-Y(lS={zXi<0;)gq*0~wSpZ&h(^@~X-&vlt|dHBl8cUa zuEqg#j?p-BAbmn=AXVn{S8s*k1VYRAs?ps|Z9M&m7K`p1T=sz!>KNuI!XfxOEIa(O zGJn9BqdGpjTWx_wyY3JB%;$;DOzs~X4D&f(2v5+N|?ifQ|YJOScfn3d# zRm|tdsSzj)eL1Jma;uQ^HJBxqcazgmGa6-ev7+YdGl*mQGe|HCxiq7FZ-|=Uo+w)kSURbQ`$ru37sb+&PXTD zqJ>LVos}=g5{J;8M=*?ZTV*4LQ%7f^XR+dAJ%``w1zz@CuMP>_V6VeS->$+1Ed6}= zV0yOxSB#|4+dmVj?wLO=Cmu-_{lk(|2jYt=?61k?GmrBYz~^Dx`*|KIC|B#8i_19Q zXC|nrS9mE3;8WNQxu^HQkHR2oQkXiwm)-~mMS+^ZnFz5nny>$`sGmqpL7Hw6h9Su86L)=z(OOgy*hxQu z7<3N&kS$%{Tb|D2Jir7&q6O5nRo#YbJM>)%HM{s_)H_7dS*ak4o~(SS8Y}o?LiC7o zTqnK*Bw-tn71{G-@0hNtbtECms74*<+iI3fiOK z_3wCo5h=+^3~bOEx>=U$-5L;fSeDzaLh+YBFG^1O{qQ-*~OF+juhliOKiC7R1HG5{RlY2cGQ@}lQUR0N()W9Rx(&vJBpnaCBs5 zdCNv4IsB0VpGF#bfki<&pwnmTVbg;r6(tJZJ94EFPBEpFyvs=pM%zCuIKbLs8`bh^ zTqJE=$2f^`ZT$6>9ADN1RA&^03m|=mZFvFBw7&Sv3 zKuG3@K)ysZlIs3h+59EJzqAW|$kobhH!9@?4RDpWtG>sJND79`| zjhl?z9Q6Zw0Gny--9BBvBOe@zLQ0g^(gRihFw4=1g|KoXAU0dc`UJ_4AE8+gR?<6| zn7%rs8pZ-$?d%9~8qSLj3Pqp9b8lnW?55VrPytiC0ku++;~a*3ysfT3WjfM)df;r|AQ|H6nnq2fEC)OwGzC~acDeZ;~=+g)zP;t$SJbrUyW z2C^Xq7a64yU=S=2VB|G=_NT}KavIf`&kvTe4Y|<}oI0Npm2+wJyh#_C+dN;6-pPA( z_-5yX&z^m8Zot9sz39P@Xc+%Q%tpy1BI@_?Ji|1!kEn|UR?{FH^NlJief{M%z;k0?#K5G6H=W;qYkUxwvk~~dh17X3HhVqbL`3@*+y&nqY*JoLFAa3;DugII4Qp` zoZF5ZrO-+aL!?-c*GNT-9mX0Xw5GxXLogLWs<{c-FbD9DJ59wH!@IA)`4e)8cP zEr=D!o(^TJ>&_fQ50d80S|T#$`Mq?Nu+}mPR1DbnJEayc3%VxzJT+|tu&%2mx7W2= z94K1Q(7vpdGM)jOKB1Xth~4^*>N{sRwlEtbYD>lty*(3l<*@;2;~hdF{0b zm6DINWEMTPllqD&D{E_58<0v-24MEK7+}%HXw=&|KJ&IB;2-itk)??5M2VH&k&IE& zIsVx&pBJzek0QQFU5sEi9eupn!eQ*QTJ3NSw?$Sfk}M&AEUarVRa__E*G(LXxO9px zh1wG!C~}1!PF`=;&uHKa-ThRW;91_S)&0@E^q^qq(RgiEWV719Dc&T36@LQTj2Rl1 z8-XuYmZ%=uHbo?z3NF|$V7g1+DE zcHh;nAXC9Cz#pZ?mkBYVUL2aR9Rw`di4Ni*^GoKT=RD+OHwBxMSjtDsg2VW5>S=k_%L0bq@<)T%$b=mkuO9%KkEK8dF-(CPRAQi6rNb<;HG49nV>*uf9%Q2hb%yQ@Ir3SRi z3Kdw6-`PQ?6rg3bRpciP>Qwn)1ZFW0Rw)DP9zw7eH9P2rk6A71QhzuV9agQH0xxdJ z_@J&$$1Ypay$qD!UB9uwCc9mf+EhjLLc#UpmxJ20hma840b(TpYsol$42}wQ0HHNT zH7{2e>7!Xhf}mup>R7FF)aqG4Z5jD<$bfyK;K~=-&s2xf4Mx-ZRfPuk_}QqH`R;Kn zQ@ZxM&)8266_WF!HzIPT$FFZgl`diNbRc6}o8n-(nHT$#763}DSf|BL5Q^Fh9EID|Jin($-}Egxr%n+R35>x} z?l6yMJa!!oXL@&w)yn+iJ=7a5o?do{NF!CwkT)jn8uSTTTA^cs@-{R7b;O#tj0(m+>uAxI_)Fi$0rgWBOentW&bS z5V?~TXVm%Y?$)>{xe9#fOn}$X?8rP#PM1=HCQr^9A0q-0 zX;bI|Ceo!u=b-5JGMXsT(MT>$(6qt=3s8+r7R~Ruts>{dS?kANX|=7CjJc>Bff%3j zJ9{aHO{}(sC5CHJ&z-7pB61wU!quWZl2RlI`QStSL=6G<(nB%^!ZuB&(bf2@Ug}CBjFy$e-d4k$bVHzZn$hE z(?q&jzMAE--V%C>Jr+u@Hl}ubxeg(l2`= zquJ99Fb?{Hw~JiO0GTtQ7M6x#^lW^^&&$e zha6ENtv^_FX>mtH-M%;IO+7xlbTFw_qGCU#`R5@>g{_W!J3uZAmBh6yblI@sJ(T=S zm$RgaDQcaP40F9_D!1Yax&L_uyE`T^DwNA`n2dnT`EyD-OzPmm+*P5WIkKuKe}Z_~ zVj!3&Cc$7rc7tFzYtQv(a(60sl7a2}ni!Ku#F%1TDD0)M^pmW9z)Rn_qD7LjpVuun zA_ge%iF~#^8FoT7t>90Fu9kzvxCJJo&L{GH z1;paFCQ)4k^6NncOj*#|1f~q{2*tg)U*Vq$`%b*dlwV;v?An$$lk2FJ!kNgj>J`L% z)N!0gct?~`Y0E~wlO$<-LUHa_S>62PLJxm83FG7m#&}1a*B{?VPgzu6GK+1Hqm8uHZ6B7dtPU3cGb^YOiD^w&uO6!plCH zcQy5Uee`Y-S1Pv<8xU;s0Ag6%yvQc=j$k|LaSt?e(kBQUth-tA^Oj-Pb8LK!q|4&H zY)R}IGTDi&`JeQ>`*J3EJX|k-FAQedjn7ZW6Gz_=S`M%rwAOiPFe@Ko_l+#zUuWIg zAC|opI3eI^5#5*TUGMJ?M#>Aj-e2GH7uNsL*SC-U`37O#84;(zQ&wYAg38GAO*$rt zUCMvgfQ%Aw3cY9<6+vTSvCP1G8b+VVlOcVlpskpNH|pgf73x(Hj-he5K*NwabS#TJ zyz-FYz@2mJ!8OI&v1PG3>Alw}wQ8})Tm<#{RjZ1LauH(c z>?~LWW5wvX__vpWic!Qw$p6 z7%C1Y$OSYat|%cyAU02|P6Bzz;*NJli+mg|<}mnj@p<;OPlXC9+D|+sTws(ThSc#y zSya}QmXVozVput=iij3S$yr-1>?X&?oVkr*u&+_U=QSZy(LZJ_u7m>yOlZ^+@@(k6 z%*K?2NZFFvT%opVs|6B;g3luO?0vU}*t0!fe-+M-6b&Mox198&mwIJenC`U#n(X}= zkLY;BFxZ^R<45S+4Rkc3T_BbDNvrIaM7R8%k#F+1DFx2@v$|M}{DvEcw!^&sTH<|0 z+gjJaOp#*v^Acq6nc|MmfPTs~0zpiTvy8l=oh_4(>=zj|nbwGD$W0Cr6VIHgE=FF= zjL97km7$Hl5@`_`5_)!i9?EH-BN7H)z#}`3bzOrQDP>>cv?6WlMI$s9csdAHdjA%m_nl2AEnNTit@WnZ%C@#=G$F5}2 zsx_nv=}5gb35&VMtC-O`=%|(`;EYQ83{`2tEHPG>i|h0vcXnilC-x8&pSB%{u!3wD znGl(-)UVfO`JVwtNbc{w(5cfx2JkwBS?HMgeh0PWRXX(=a%bjF3(*!8n&piWKFJe_ zUa>8A`koT>X2Lenr8*L9KDt}}51VxXZ_g}71B?&I37@U+f}jdt(ZC)u@}GI9SJ}$~*r0w`1ELS@~V3>5~H9{%imPUAJFaG9n zZ80X&bNK5-VagKe`cw_~8s_*H%`#_SFu}C)-vm?sf06KCo(4=S?Td&qr7B&j_uJ{p zpqG^mdEiT+7aY-~ctyRZ4L)%ZmDLa_H{EbV*TB|0buxjf`goB?*$mKV41~)eMyi9^ zi9`AvR63-x{^0uNH3ea!`CwK3kZsIOkj9WU!L5DSTofn7U`r!2#Fq%4GH0}27ZcZ& zzm8;;8=-#*I7JPpoIfiz%rQENX4Ep(BK!--=If}>$E?)u3Y1RHVv$OgEvT zwcrF1PtiFq8Ndi)3UQ)`>@Or?2&rc)r_ZRt857r8O)L9(-Ao%+NHu73DBuX5s zQnir8ZJVN;q?K@_h)|yZU1eXw;Kq-F>$M!N#VMB6p;i-U^B$dol+W41+D{r=9FoWu z>d6{L%aV@;?P9p>Eg%aTogCMRSg018IVsT}zU=gtO*|9P;ph7+Li@#V$*i0ekf{?%WWbgwU9Ncf%PeDV%R^5WlV#teerZSjsB zL(JKxRUS#>y5?COn1@mpPx;oY`yO_`@wQ&KdDghwOy^T(fV>O=2nG9q9yVQya@0rMMHwJmYd8jd+}~)QI5A2P|U9y2tjSwjLNX4`RGGYaa+^HE`7OjpTR$OIWSGRG_fAIy=av?J)QH!+%>s#3@k znD+Sz$T`uXvdVFe|2VrDl!-22Ty_B>nPi9~$Z&)>fl62hsX5zU?eYUKKReD<>5iE; z^j`06U=+IEmGl)6^O~8SqrBnJ_cNb%VQRSzrA=u(GF6x&JZZ5xR^~qQkX*<=NjedT zg>P!CaHMcj=7nnnZ$vC3L|{B~82HoVtPvfUhI9s(hw4;dsjC0L@TJu!o2Es!g{xpQ z97|ty3(6{ztfb5-j%SD=iJ%@($(AYgz4Js8)G{YOFOTlz2ATNUg*4BXd|E+wjbq7M zy?}2{?X6(U8?68NZa@cw8^tM+GK{PSQv0j92cE=o2{tYdEI?KWscRli25G$LP-!?v z_b6RX<$R7Mi&6nu~*s2(2 zcUgLr!WH;s=?N~TgA8aaea*U`6V^y`&p4#lmjb4+dzZYJU!F4FG>Gbd`#S^93W$cN z;T{glhsR5}9tHz`Pd5>>8vHCTfo@spfia&MB(0Sg~oSAI6#i5RH z)#Uc|%-pjw%iHgae~h-z1NU=2QsIb#VA1p$_g&%$1chFS1UV%E;^GDO0?io4Ph#Tx z2|NT$4KDK`+^pZx~f(4C94gv8{*cwxo8 z5?sGG(S0!D8%yV%oD#(XLnk0bq3$bbL`sN7a6|~1=puBz=m;BOyoW=T5sEC{HA|}2 zMDxaHhmeelHVvu}q$hn4y&kd`22y%?!EQYj|~6@8@Pyz$lZ$ zA-EH|(|5PAh7SG>a(SKXl#kdwZl?`=qkyM_IC1l z7-{O~`1dxbmSF!s^7}3@M))vbSl_2|W1d^uLHb9Tr*nL>e(Ee2S~t+E>;2sorUyI? zhh2Tr^fPXhngrxQzmD4HPc2SDeY}Qdrr5n~R&b5aB?1BY>`bk#^}hs(XmiX2-|EfD zE>9c$gKwH1XY!d&m+qO)l%njal0;7$6ux&e#c7*AVlr9~Ngztdh~-iGv%+T)IXk+o z?d@1c-t-IYzee7>bg|PeSTRw0`F%Ag^IyRmx9+maG6}q&bCmoHB7bdBG3SNVoK(5P z{fTlya9yQi!-~nCHY^#H=#*aa&5C9W`??pOxAf8q1R8x|rW5{5OD zf~!R}w6*oi>y(AxIEVziXCm|w~Ibf_$! z^9m>B{OiZnR7I8l!`jMs=!yH*k{41sWvBy$NlppVMw@iCHg(|VLhnV2x$iQ;w<9Tv zj`C+qODll$KNI?Jd^x3ZB zD3AXvehB+^HnKxZf4zJ}0y)H^n{*TLFagRtv{l>sxQB=Q*0HCo2fJMSzQENwJaEcZ zeQy~j(&XacUMXZyH3QnWp<`x7ry9%kls3gj#BC1z2pFHo@;z{80q)lRCzM`A;F(Q| z4f_>HjcL%`FyZHEGn%=H7dCc z7qh#J;i>OVSDMIukH^|F|4s{SYz;T8{J1|eS3YlGPMc@da^wQUbIweLRdLzOk7R6o zZ5Z=Vk9KU{^I1OjK07<1Ww0{F zl3jDaP{ZOrmg_N3s)x0?3gE}Tlb1mhmtqt<6645%ugq3Q;4^=9J^H=?zL4QfeGl-_ zXvh-I7cPcu^U2BQ-Zpv98grcq9s;#Jy)2VHv9|YrDi%3Z)Gp2MR6Nf@-TaxqiANN^ zWbv^!F+{P9xLS<5c?bMiHf&DN^7lDy$$l?UlWBeGZ`S=ecoZ+qC@^YGX$^H+o$9hq zlxtU)pYRCjF1r4Mpb*gp5AYoB=k9aqSO{)~8XJNQ^24Ej&Tnu--~FG7T|ff^Y`b27 zfw7P>xP4ozrCOR@-rI9sjlKUFcV_6+PyvO{6lkXp`=VcQDO2i&Z_8GoI5=F;3!jQ>2#`>K$ z_y#}rQY0vrT;|zR%KN?bewpF_e~hgHCiu^m*>{>M91p$_w#VY%65~7Oe;NOE*#F%2 z-@oX&dyo2fb_ri`?mZ4?+3nN{~s6^W+Eu@FiOLK+dofdeD{LYj?k*ZHz8B; z70lDf-<`t$WA+FA*SDUBupn&4ZNl5F%ZE_#r*H9{|Gc!d{Q0*m#Q)v)Ki>AG3^ztcGo{M@DPvPFfcgiE!`Fo9r@xaS$fB11*N%LwrWGC3$*X=oDA#o|y z_V##7LjOe{nalEI#4^fdfM3h9xqwU1`vFSxg2zCY;6%I!DVxv-1UU|(y^Z`Yw_2& zww#)UQ?QwrBX0#f;nuIOuTgH`(Me#ImX@@9{YlQ@P7B=@8sdm}=UC!(q+jx-J32_+ z+sKP))ck#a7hN?R$Xc4W(P%!Fe9iOdy!Zl>2NKCK=knaM2%Ho3-yl8m{SIy{oqWbV zDL1soL*Q#~R>l>Z`-g?^@9D~QS9h`es{h}mj5D#W!NT>i-*HCGB&_2xk^=98zOgEl z$*h_BtKv8Rhpe}ZiYwZ-brU4GJ5++ZyIbMzZo%E%wQzR}?hq`v1_@5#?(PsExZT?A zzSrJ6`~0oHYt>w1j^6v%Da7rL*#cb>Z}_16pBzrDAa}GIuKf{2u|S?@#XH`E&3d)Ro-7-^dDiAyeJ&e+S=+ zBj|?7dLu5D+78m?5Um2$k=?ukcWd`KTgsI*ZCg-&$UT=^27NMhu&pZj4j^DB;vQnK z^~)2z)in2XEib=kb1f|g^1d9;Ibj@$Nrxy%#PD(C{P$FNKj~vE5o5{h{cci+1r9Fbh zzVJVne}s2#A`^kVP!#MF_7YQ1s-iAiC|z^+?&^aL+LE2eFQilzm)atK*;(k<=zFGp zNVl@<|I5+phHjn!n81*W3%Uz`%kP`|viYw4L#7m&_wKG1e16ZyHN-l(S^2>AqsCD8 z@jcxRkj!4`+49|WkdmFclb0h@;@(RI7af22AEVVR&U?voV-D*}bMTq^aB{4cK>qTk zo*w_{vDXrgq|8T_zb=~s^~DW;*xUbpqT-Mz`a~L^y$0C_I1L<&kE@9;ABDEhcodaR z)uT**zA-Ta5Ll=2Sbft=cKIrCy}bM}ENkIdpRF39w7#hVr%NTfzcy#d+I2@eWY1b7G{%R^9QN)t4^`@ey_Y>? z#jgR`Tiah!N0Ys=xR-4kg?+csljsKq{;H5qwy=UgjVFdy4qZgL+1cmays`&>uim@7 zuqCDcWE!cPxyJ(7shTz&QSTNO_;em}ajd^nq)Z&&tY4a03`fMH>U_jDL3FiK zo{AXHs$rj>T0~B6xZNmjYj-?Y4N+nxLaO$};JGF*pR$Ib_0Lsc1sGRTDSHp%M=l1h& zMMBOiM*6<40DQKGncD*lr@UV<{v1vD zTd9`T$m1$S)1|i?>xaGSZPL@Mpfq_l`Y!^lMrf_Pqdh=SKEF&UPkE|sae|{WZ>A@= z>R!B;7Ajn?Fm_-vS5oOb3h$U2itCc_Ma5+TE#+6QMj^&pwjMC2pJG=z!8FxJ|7ArM zEyCLG!grzPj0UgU!+;p8_eK75;dH#^DpEK(xKQRMkPkq+YiKYrnTsU147&bSeurL6vEds%hjQ zmu%ub`_Y;-Ic9kAvU0nM>2TN#>{FrFX_GRWsiak#hAG8#oa@nm!iseaXKq(qpuHdR zKty3d3B`nAH=TSQJ2f^oUc_?__dwQHeldTAxD`GVH^UL4oVo|GbA1K+I1LD*Cl=&m z^HM;$%E?`UmpnG4V4s6x0N7Zd8c@~F0Ot}=fpak+Kqrh-2la@T{@SNY1AdrT$pm_M zP-#m4KgyW?h@C2B9gWw?C>xCD+!J^Fv1hoZWA4%IlXr~8 zRz<|INkbtnqh8deYh|t>te1*U^9F0e;TQHDwXHHbpi}tF2!!DsKO7Dgw*~)27=@)4A z8hkV7@w+QK%ROE?(1?)vE17>c8dST-3A1smu4;5IkF|5#@w8Ac2iY=lG-2SaTG=bq zb1^iQvU4)KI+)7&z6?x{@sLHZ(&LYt0V;_7+VKcjO|%OF zRxSPVn#5n8;=a4W#@3(=CB6{VYQ5%3pYFCo-ThK6D1K0~8GYdY|K*(iGo4PML5AZJ z(>@eY{k^|~Ibn~4j_=FFTMvg8tlG3gX;Sr+yso7$ z_kT=lL_Ru>rQAb5J>K}UX(C33AYcJiRWL3e%SURVIF zR%;Dxtny11#G)qwd+ipy+4V{L$STc`^f#>reE0Eb`Tq$q2i7e5@%K`trs)`PGaDct zJzOF|q6NC0hsOzTyBaMno9Mdqj_2CsiRnFe7BqVGBJK&OOH)6P;Dp}R;S2g@N{#hL z#gkb7a9z^q|7MV&^{9+lidPNWu<204qeJ_OdE(5A6+dw(g(qdfWn5j$nMl+mthkhw zHJ)w*z>PQ^ll_S&cDLdxSrWugfp`y$WL#|;WodVQ_iOWe6>qC|?8T7MkD;+$N<4R- z;y@H=G@z$gq}gFN9Gq;6qU5@t_`z4{-xko!wZ;7hQeq25qa6^$0xp8Koe?@02*T1sd6$*aRir?(KM~_I- z8}o_%pW+E77T|K`HURw=BUnSW72gb{rPE-@2eKRFoIhdJUtIu+?En!coL|-IE&1JB z&(_23j6sQo@0T6vYg7iM4EI#_B`xLljVOAR8SkM87tMdBPOjVUp8waE!CUfo6mA?U zcjU~qVu+gq%kmJWV~N*PT#(@CPoiut-Z1GrMH9Ogwc1XQ|EPlaDZc6k1xGofFQ#U` zsc!?Ij$(_-n8$u5o*6py)AG>uVH&GdQZeY-FAwj0Nb?Z2j)7H_|F*H5_tF0W=|5`v zeKC3+XN3ZZjuVYiZ3W9PwQ(CqEH)_79;?xoh0&*@5>+3TIK$kP{CD$YNAASRmXlfJp-ly z7IzgVk;+0oy@@-%V1{55ns4Y*|~|r+KJXoMM{Le z$G#KUt!pn7ZN3`CaLL?zWny4VbM;T=9aLMSAc!=pqEJ-#hXcMB1bh$J?D85tBv|65 zPq#H)Ma7v(fH+rnN{wEDitm(%+P!Z3(#!8JTRF01RZSxV{SOR?8ITBH|17tW)?*cd zN1k}-$VLCyx>@k|c1GXYTt53uoT%6d=IrmXIPjmrX&AaHvDXB%u;oI7h{h96r_?Vg zudHI?+sa09&z?7(SN+zNs;eq-CtcfJv&6ooQ#>D}He-iV3+?S}3;}!1${gOZR$c!5 z`aE7NfX8!Jas%Ax`b;)YhfJ2{i5|RU$hzb`a(m*!7o9Eb_2PQUhS$lE{|9Byh9_!WG)V!H=Jt194cliA{Ato^j< ztJ*Wc0X*E@8Q59o-U`pIlCzPOG-qx%t@EF7uH_Dk2^Gj{x6{e2+}HSkcD0@EFnQpq8U6`2^tPJPChntahJCa z{j{5XWcg(m5}*_dPhP6tX0U{9)4bWWf$p$n-RKHx#6TL#d0CGIf;Mdh74&%R5*jfCFq3iwt{JH7c9n2ExE-B9&?i|7Jy^R!mDE=Ek zS!z5Q*3eM|tBb{Lg*85kCTUS9@OMf?|NDFH2In)+OdSA zSDMllg_3?E0UVp%#HWN;pfm2X{-3(EBeNYfiropB!_HyTK~2Wza#Yx1Cu-3g@nqqg z#K|2InX2U)RYjfs4-RUsZ7zH9EOozF!}L!bqU;Nj3z<$CX~*ys_ki!(l~Ww)H50Ir zxT^&WbxRJL2wHVp2tTvM5Gn^a8#>edI{)j&{`ZyiFA$6IddRc{K*19{Jchuct2pds zn*1iKvnbuPo!Kj;@7xUxuYZ$4x|_j0!{;?E8UR>gzG$29ca>jK3CaW*IfK24FU<3q z-=sbPAPT#1z#DJ!(}lQPVB*;t>pewVtwb0oDEeF&rdctOE<^sEk~;sZrY%?p;zYpWCRTFxs_0t_Jzl$_znA^@sO4-Rk< z;i+B2t!AEJXZlkP_N{&g$Gf(Pg5M-*%`O)gR|Ccd&kVPGPD*`&L1#pCDDb#|ty@E% zQNZL8$6*Bx_;SM*rk|Zsm!qWug(~KpbR9?hQ%N`@f@|^#lxsRK=oeA*r1y%q#gan~ z(T_t70WzPBd8I?NH<}-u^fey~%=WcnY=(S(6m~DIfk@m;`t?pKsWI{YFvif>0tdqD zZ{8jlDDVd1Gq1lo?9}bs4SVEXE*vpOvhd(B$&iKup_+#VvkoK&wRJGqASB5OEqKDn zA1cKud)5!4{B&PIc5(wP8yME!FJlXz+CP!Pj6(6696@=E;im7L;vF5V-!X$$2#2j> zaCNh)Oos<}3ltnp*E8H z=wZv#8`z7XlGw3FrHK8Rhr8zKm|C)}ATxAc@a7^+n%ZzY|sB>|o zXIAH+(Cz-b{`zYOn#?(Oq)j?Xz;Y@#BPcVta`WInQ~ zk2TQHJzwp|L-v~ZJVi-SG~yAsf{lrsBcBQfFRU^prholLM)LctDb;}a-3Csk6IrQArAa%7=n9WpREh1x9s^Y87ftpcg zRSgnpT*s|+KYc|R=;*obF*dF`I4MkoVz2- zVgb`RD{6ff0$2e-3C$WwtbGIOY#%+*Hu1zW#e8S|lQbH~~e5sAQ?I?Pf^BLwjBcu7SC^giSb6xDIA}lXiC8 zJw^2uYRj0bZPn!*)&m*kQbDZFMF%I&=W-|$fe>myh_>(;qG{*hHq30ZX$@eF$8z+7 z#VB4HkbB-6a2P9S9AGR=F-D5Ssv%%+qkVHoWW2KGctZOMbx=2gbaa2t9&wrVqm!X2 z;Xt&${6Xrp^6!c93curcR_}VUE@Z$s!chd&W}?AEQ$`RfrKBVRB#>Z`5VOXq%BU4KQ^RTyZ6QU@tcDJn_n}))khHFQlxR1X?8#U(cX;1$i_Jr~V0Q zG!|&mWeG8o+H83Z{5|XvdY(LVGC9*!i!F%!<-C|1nknl4>##UKyB~(rz<=_mll}oS zfCt?n65at9;=sa<+Ee^;EJmSQ7Fkc5BP`6Y{!ccOo{fM30m0+%ZEwWhmIG@Ma@u@J z^a>Z;Qq|;144w|}yg}#3KRgUWDPT7+j@2<1QW?%}n!M)p+}?Rxo3p!hW?{GBq|=jf znBG!fZz8MR>*AqJZt}_Hdy{nR4jFMmUIk)Af)S#gXFftFnikbg{XV?qBkzZVw<3hkYGb3Ifr>{$9Y85uVk@qI&9*H}oi8H0A<4Y;eL$R`)cxeO_L$pc(9*?meJ{b_$&=WlJDsBRK;;V$Mk(|qkKT4lJQg= zzFq`d**o_hn)B0lgl~yxG`>_(C7Gle(=T;Q2*SR;pttmLC3NVTA5*#Wh!FSER%I`7 zJp8R-|8u9mr2-=qQ{(#iq1Rt+fmQ}PZO&nqe12OO{2WZv8CcE=AoYkYAh#SDjYa?#~ zdU)XbuE^tqk_WX7@g22FPe~z+8vNUz|B<`;jp3Dt;YAvS`{G=u5@p;=B8=$%OIS2s zCK9)0*x>4^ji3o|9a;r#J{owPjgg>LPx>T5qVJLe;NjaDf0G+)_!Haq{ zHaN_PW>GK{wJuqa(gPxRn`rn`IeIiH+ShxxwtF{@qcAK+K~j!SFJY`adw}3=8XDZ0 zMz03mb-ZJQcjr=cMllJG-1Gte1VAI4yr4+C39oEM&y$5V(~VhEDoe53PqOu?_6^^G z%CG$W2gSos2(4I+lvAk;)qyuy)tv=xBY;$ahN-z1f^znjV|{jC<<^n1998`7t|1wD zt3pi!ZfB7-PPKzhgXQuL?Ok`a0E;!OgmPsiInB2QLt94s6tSOi`wB{&?&sr`?3D={ zGQ#RRd+rv^ODZaOS0)?Vwi-8hXojk|H{WafDKO})!K=-MBBWg;y5P%AEby_+M|NH$ z?y1G-5dRv}y(z!B%h>9xnhv&xX&Tl*;4C+{ z!L6Z^#e%^gnIbM@cfWC-VYt) z7nwW~7wA>UM_B%zTx>q>7-+jQp}g$0nQiYJYu~1Uw^vKa=SAu}rd0#)Dt@jJxg4te zv#y6A-tSF;r`8h!f|gwX(C=>!91bQO--w7y&2$grs->J1 zwH@)x_~8_i+5%i;y0_}PlRb&sa3c>1B&PG-==sOwU{ViJ#(q5&%PyrOGWrZ?_#D77q_UEvSf z#yXQ{adMsaoq^_2PEiVIr0Mnns0{+*e3Rb3H_;NI<#!I12##Ch%%Zr==R%d6+V^3s;Idt7qMaZDfSh~V8fc0 z4o~8a?-y#G6bifEecBMdY1jc$8G-STeijo5lXM&Z!D13?QS&DNi55Co-01ywu6rl{ zqfJnK%YkD7^QB&J@r_|<;q1Rf!7rE)ZQczFI|>(@F_DyK-MLFD4Ut=mXV26S2k`HD z2+9Sv@c>n+ZmjS3}(81){D4>=ixV-oS! zBk&dKJ4D}~GhSXQ&T3U``iIjjl#VyFZ8sQRYaC`nR-&xlgk5Bn1PxQp3Z6iELDQ~< z%iGRp@wd^gDxXnlish0Rw{pc)s}-983Pp$6k+$}ap$R4G8iRNhoKce)x^$N4;qmfx zsfnS%Hp!k?y6_P^ib{ZvCI$RzLP}UEgFYc>kGKa?0*a=LJ(Vf7652ikJ$VX8NLB zP7j4e+k4Jr*EgA%G&t@tn?EX3FOR(6lz8n1W$p*L)YI7-P^+=lCUfVhPUS8ZiCl+t zo|eCR)>&_4@f|q9U~yW7eRdD=1vVB~Vaybnoov5}OUUVZp7-wj*fk!~BZ}Gx7mJoS zw>)PL`%-wO{44R7X^Tq{i)YnTKAM(d9;yb} zAdB?e_QD*P1l;qh$fyEcd43^?eYBLDC4CtT0O!!U9zyWF&Y{vIaI=Ct{f|jG2fKv> zi^MF{b+a8tel;#gC9+voM@FH~(R5!lR`%p(nTFYvl>ZjckjnUP+z}bCUA-$rms2<* z^Q;OaaNSg;UC@{)i{ckP%&yjpFQK{=K)?ot01;|?WR)Qi4)a49A4Snz=k*d^m%q^k zI`{WEWr^f9B&HsZ)pc>mT+Ioq{am=KaY$D>3Nwx0gNN*jLE3XAvJiSu-Ka*>M0%CY z#W0O7ZVoh{$owJyBbP?kN2bXP9J4PTf4E$25U5(ick!4f#NyGh9y9M{Hu(FopEdnx zui5=}K*>-GUI0`~NDegHr!lCONC%`MntaU2?d1F@??>f*AB_sp4W3qHFqKR|tk4ni za(CA5HEHP=k-4J`DeGzA)lXGD7&)OI8xFa74nBC%=5wq9y^(Rlt<~w*AsIOcV+d^( zvH0l}mhWyZEV3Wa3keB_;4dk?lmQ zv*mdHK+gB`A92Y`l>bo}Nf#eX7y>f~79Y36U?lEtV0b-!Tkwbq8)?(2?&6@uO{C&t zJWnuyI`?f8<<)?n`@-v{Z-()tOVV=QBeEpWU)J4dp{UJlV)t_dS6%QN+7>>6?k+2r zHWon~Aq>rUoy2qG>E`6H*f=It$dY+e8NO9=yt^MDF~erBsN32hmQfD0me|e2-d0Hc()8hWGRgTkFI%BF48y%^Dng_hkm0#BuJ!1 zyfLgXku{Mqk%5zw^H9FfuZwiJaEebwQ4yORFffT~hDtx|A$9uQQeC$ovk%uO4`O2* zf}}2n8<;tr6d{%BRZc1aHOMOY<0F^w} z%*`Q$StgwxX2)7|@b<7TZA2OI2%|P#piElm(d{Bm0iG9d)lPvcs<4wK5g<7IY}e&; znhG)nazdYAA|T(Yn|m6VuMF{yn}FpFge%1YAM*i4iR2!F8EWIW_S_koWg?AF4ZSDt z+QN6c_+>w|d%qSRUSu=bkz)+pTS=AsqrjpokTO>y*{p zdD0?y>b4$B<&*-J5!A0=HZ4;>byZIr&zt(oH^kprm?mAyvWzqn@mGkg#aSg0!WO|6 z)tOw_5SiY@uxwg-Tyl{G?cdxQ2GBBO5Q6Fz(A`_Bm~z-mfjT;w6@PUqg!o)CKSiR{ z8m7Pz;Fl+nIg?nBJXjIm$^3AEDy$*hnvT20gW~0v_A{)z_=i;#RK80v6{XrHpUbH> z7O6nyzo@s&7&$=h@?880C4M1>(&Juq=F9*h2T zxfSTRQw@)GFYU@9#)q$h^^!CJd6S*7N3~%~H6gFyXfSO^n0gV$coZf85t1_Cdf3|hA z(aEqFX-Eg2`Q|pUGrWtjBDY=xHxtDJXL{L~$kyYSMXkw-Xc=%!=B6EPg@F;y{@e&p zO0JtK7d~wq7JL?01j}%qD}5cwymI8ce(&GCO2h8J7Sw||79~JX_FY@UxG)#~1%@_g zj3n@U#7WsE#UGuqQ;~im19({~jzot8A5-Z9X3o;W74mWQ*6#KSr$vWfaQ`>I;3RL> zhmZuz^GJ#-?SraPFCYD7)$COz*bX=9_9@+TtXh@4{M}wnW;>`nAQj8eQJV(pyPDF;5qI0V5X!7#5o`+~4H90tjU6Jm%Rv$@30$22`!43GYen0nB$YUD?GsJR*l zFTE9fdiFZqsc*omC6I{0oS(w26~>|J^>a)!P*S!Apj*Q-JxF*-zAEQX+-g~jFi0ak zdlpMpOM`s(1yqc9q)1wSUzjd*s@kAGBkYJ5`SyEaI3QydX%;q9A)T?UG^m<3+Pyci zNL-Ct{*oZwhRlvc_tM@jhV9?ANnA#*fc|0p@Nln2D8xTYu50xvHSg7bRpNbB5J(o zx|z}M0zS5_I+YiW_9w+D<2N*rUCWj(W))4FM3l;}x;}3F* znPDbbkY|0}iF`x>+qTP?t2`)R{TCw>PYeP(u2_)D7u7FC2kE=D#flBoTJE1-NG!DF z4CWz;vznPZh2Vp*6JK)xj6A-|O}QG!EH|X;-<0EV(_(maoxF;z2uRCmPb0w5 z>BCXD{nvj}g?UL8ua-;qmm=cCNtvM(wQ`4_5ak9_9&QhQA69<6UP~12-2eQY^5FY% z=D4%H)ho?cb<4)|FOhgg#)_jizp_$Uy!~nkIJUXn!!+i!EC8@Jk5t+JERCvoIw*uo&Qn5>)fw>P zV4DX|EW%y1$*f+wp($YW2{l-hAKjQnAX6{VY8mDF``bL%fULPT&72Md;F-EBV~JUk zztOGN&flXZc#&QW4Le$eCd_j1h4gB~zF&4TLFU&O7KDQVH5^*U_$l>C*P`YLtp0Km zVE0`y=mQ>uE~A8r6tI$pRW8nEz)FNwMA?T-E$AVVC~we4qK}0EPwOWCq|8J!nAa?^ zapHx~S}`@jI8>{%(?CKz>dJ9dnqk+IavreO`yo|>gwQ?)OmVaRvDU6)D>tZRoI6PX=4A7_#2phB`Ij%+QfM{@@hBCeuQg zJ1>yUR9?R6R=^sq6w=YaYJw1(YJ@Yn)OfA1%*-rO5$I;o@i685d)%7|@e*9X%k?%& ze(x3RT3D)dy4k7AhEcbh#p8B!sc~iQ^{u5+qm%yrKYU>(nd5+MZH?vXzs5AW9?@2! zGM}PmDBy(K#nLD9zZvSX(!YwIq9ix&}pH8m^=8}fCg;sshs7IipUPtwp%-v^ExQ5?s%3>z2BHAA4j36>yD#96) zXxvF?b)rD^yrf=#aww%!v;nk65$#p|B-YW~neNa?1ZA3{VLf5=9tgwq^6A; z@Gm3|qtq*I4?jKOj1%_OQT{5r>&M7;J`XyRWC~VO4Rq;c7eG<+P%=e6`L_+Bno_#1 zME_T^?my0~V2Cq|uxy?9rk#82T*)E!8=*C(SR4i-vJ~ndf;R-cMxsDsX!3R6k5)Mn z69OBWc4Ai~h$I+Eudo=+V{FQ>Rl5%e{D2%^T>sDUmHl-7e~+xEe;vVCjbe~LBdT8g zjybt5KAc)pi7Y0NL$oQgh!O%v1M4GkmQD+QEga~e(v7B4QZ>2K<(+C>Va>pPMi^#` z*N1RPzX1C85&MGWQDg79K=SIF;bI?e<~(%6c47)~yi1dXb+&Xd24e2DyYrw|0=){a zH!^##N&1o2b)wYa6Xd>GbucPUpNgkXf!~S5^D5-m-BnaT1o|*BpsEPsXV5Aa5^Ln^Z)rR>F}*6^mseqt-eO zo7<2e6px7-Ya4?N`Hi#*Ue9yy)%$=@)qybS8;YvFpmH8=PiGu_>a5#NS)B2;ZJl*q z99Baro+AA@P8E?lPHR;agMdyhxhcm-COs7R)MolYXwT9+e$$6ttf>H>{0M=L5FT?q zUF-tEmoM3GSTAFq^;~L&UF!%w3<+lN?$?pHE|u}(rG9T-Y)pRxj?4*kA=FdIspXUJ zl~wK$Vb&KN7Y%W`y{$6Vs5>Gj{6v!x0h4d(=@~I*KP-5r6MlX?^Q*m!sHzM}<9;3T zRgfjeL68|~OCk>cYe8dubEgNt8zPiqdoY_+yht&;&t%sZg~XbfA-ssWfF)Vs$-i@b{hij1*2z69@_@8$jR{abobVMb8S zqDWysjPbE^HVzdA_aft<^evv&WmhTV*1uOeT~Mkq3ODkQbWC`e%)~k~GTX)VJ+PwQ zltPhu9!DIWmlIWPA*c#wmIMLR5dmulg{8Su}bqC{%Sb{_u14(fhR}h$ z94Be45iFpe1PE~$gDDoLAC%*U%@BuZ@2sa3=px@0m2fer<-oV=>$?~D1^F|JXOa2u z?Q;!@TqMY{3$f}Q4|B-l=j>fYsFjY@PWNFi@{FoU{G4=?`WUT^cuejo5$>%3o`p%B|rEiHI`md)j2WDG(%J{mc9yxtYw;>$(^%{TY?G= z`HkuJt91r>+AEu9tpUBUlqTCE&@vG8;IMJ(Q?=eG(wkJ z#ma1ih|Oz9`bqb46A1zUD1mA-;aZB+uSg^OED_DyMxgYIrGdaD9B25EVujDW2QxUP zjf{gROjLdhRpv7;`WK!*mP~pU=HOzhm0O+}AdyM~wDkaM3Y~>sk=qFBNRa>re=#TO z&q>urc*j$z?x3-dL}!}ROHPjL+i!j+TYKW;EJz6M4pTx%jDMDuKSH#)!Ze7x9K#?5 zb(VtB)le&}h7{ilr=5D(z3x-p)3M?UDUfmYmnTYH$i`LB-ea-ieSjXK->wmNKC29Q ziQ=R7C#Yjhk>#)+!wg9-xYUUQ}^~DE+Rsm&$QC{@N$||U899;&WFfZx6{(F z;PPy_4D&NH{y^oXf*YPzPshjSk5?lky+sN(x^UE(Q)0utm!vA+6+Oq}_2mL@wo%Eq zWA3fYqF7~{uywH$Hn}^$Paw(9JPC6sfChu|nftm=WF}CAT%AAth^inWw7(C=&Vuoj`RVr_BL~9LmECv>- z#DneFWi6#`5>Rv}yh=K<<4pOu9LtjAK1QrZ<07XmvkZbke4y3fQ|AL0Qv*^NqUL z9g#fw90lCOnuopWYFy6!DBpjy5C;25V_`Fep@hOjMdJcRQRbQx(6wk37wO#QB@d9j zl}O7;6lyoyOu?>^xT%a`qo@jc z;jyff6zVW%;vDC(G zO!~D5MggAv($JW*V)w}VP%TPiy2=}``NDp90WV7iKA+@xanlAYcuq--QD#0rg+32R zlkj8Bo^?XdpG-8tWpvI0;z(~45W!&SQY>vJ#$I(LN*Ra)v}&Dsj&J^Jc1o{3kuOO3 zI;Vuy7=2A5zI64`GMq|L;a5+56Z+Rq*w?c(Npr%nKlbf5zXj>Y2J}z1zgEYp-3^cd z&aW!DpBu;t^Uz~oy4enT2=sSdhG)N-S+}=gHAfK#f4r{vjDHS&J}z@P^Sj#^t#?mI zRYm0Xkc0WUl(x@T@KnTVl7|6A*59Q~ab3Y5G5*)_U(ump%V@|DXz)Ad3_nt0IsO+u zNVeg8Z%*Axdwh7eXM}QCW z8xkxG?rUsqW(kDYt{8RS7^q|W2;gXKqM#rju#=MuUfacytZr77rehxoZHgO`T&>zz zq?d;r({8U0$g6kfp?!VHUv~IgCP6Ka1@eksH4Sds$&I87Af+^^^6GX_>h61qM};t= zf1(j5ks2{>c>UptX&$Zq8Sn0&uUlDiXDEwC#djx|y)Z6~P#{~9?H5c4UHSMG3WGtw zeuMXF;H@*xg=HxVGe~@i)2IKnB{CyImHF5vH)zZq8!cWlk|eJp;(Ol=zxO=@d2ao} zjRMUeo)=+R8~OsCOXRlnq09?J&Cn%EL{*Pcka;RBw95nhzCFq+f|(_AwCka)P=d$K zD!M_3x?n_PBu|c9=h|;-b36B{==NEFKujoheLo2(H%td&qy5_7&x05tb6;qbeP@yk=?q*ihc&{=H{k8YPAl(Q?eHk~p^nX1_8X}PM@>?La~-saaI?ApDFi+;11_^% zw#%fc=gBBUs1xArSF5f>Mu=}r#3hhO=aid7igI5aBF35tWLmA}%IbAxp?ly|?f_ej ze*pL7k_Q9`7H^nez7IdxTYPpS7!ZJtDRu0z9z(6~2)M5S&b}G!1^&Tizo~cnyeZ4)IABAF^M!2t|Lrff+0kC>x@@s zia4pb&$Z}uzK|w{BoFiqwL@@2`}n$eb4sEN;$%v^*6CF9Tu>}EA8l+?P-e&{0I!8x zMXxiv-IEpu<7o=$H)cFwO28OZWZndBa45D^P8Eu2>u(jGi~P}zF@HY$DV43>Qt?F4 zL6=JJg`dht8`rd_be+r2cyr7ie((5j0uhF{n1A@WxXuv1*gT7a8XgmJjNzIY1yl5= zn~w93e`*-!IdL#yT{$HU2fEFdmcqZde!n&jc^-W{NP+kTD0XFeb;Ed%qJCUe+G``@ zR;bI058dcp4xkAmJJJu5tt}7d?sIRH+qjm488>g&;x9bF`))5f0oePa2MRpXLZ|^Z zRb7mnP|??2D~rh` zWk!Wmq2S_lWH;k|=*Roe*jZ~5it&5cM?E)K>bi8&@LLEtC%#j~`eIpY3q=<9R_6#F|6CYnFLn_H=a>pG8-B zddi4|pry?sxs+e%ksy|@GAS~TzBzH_ZFl_Nm^UReAl*0v7ZlSqw1{n zn)~JmJpc4D#|-m(Ix0Otbm=OT2QSy^gxjy1pQT&U9i8!y&#pc%@jk$iA_O{uPEjv+ zx7_obg$&mHv*IKs?}Ap>OKmDFaV9z5Fg>IIqSei1jf+<~LAG?pw+J?Z1{}C*+DElW z>Da|o{wyV#4bkGL4n`}To{q1bj$=Mh%X?nH{U#Vi2-IAN(cY@N>X9_rJ!VO~bM5gi zaf4a1nLi^s?lH5GkJwR-yL2FQOI%&kni&@x3Cn)}WF2f@(8f zO=da!)~P_Ig&hsIYw!l3%+*m2Er<;=tKjpQ7)jS$L5T{2#t#W%QQAi*%^&ewiM`FC zes!*Djeb;KnoLTk27#po@22p#=ZVRIF?=L2%Ct$?-_l@OTo}du**Zv!N;I>8qCbrp$qLhe0BuPZ^xcF)}?Cw6cTuqcA zAO^N*yP9lF1v)7CZOwV^y(L#o{Ssu6rY7uU}<;B2#J2nml=VetRbkNuW*ax4yFWjXTW2) z;1`0eLkIWr8po_>mxr&FC#jJwpi_L``>C5mHY`u7 z?!N=9m|2JQwG)eDS1?A%utA*+G!KotdK(I_e_+lZDXX&+xUyAL#JwJM|9fEcHj8aI z;S1im=>0&O9DPe)3Ku%UWb|{qeu>{Qd5x!6t7zG>%JYA{9=<1vzxbE22oQ~3915Z`2vJ9dsj?+d4_K#qk|X-ZBbv0dV3`m?-0uHf3v<5=U+4@sx#9je0fq- z&N)|WV~k9ZZYuC7bkuhw<=`4~jt3rSRFcZdAuIKqfx-5vx+Zz^oZQ${$O|Y3_jCd) zKl$TSePS`3+1FAr9c%I(c@5zGsf@EG@2C#$Dl!N|1sFeGhkum)!&fhoFz&%WOJBSf zcP|1!q?`(MWweIwvKEzgyHiGL#L;sC=1AIk$dYf8P7LZ9;sN=f10Lq~2QSE3kMgj+$0~HfbIh~-D#we+r}?+unS-k*e6?w-(LLO60YWvc zA#smJES;G1d*1r^8x7x}BRNtaNv6II(TF{0NvM@qO0K!^DtjT$VCn46@X@C#LR&O| z$_Gn8sQ0^DkL(lme}zB`0{_GN01*ivw_iumr96(c?1xLUCLJ_I8UH4cZbOK`-jO82 z1W^NiWzZtVL9Jjqc3h1aOL*t0mVuA`%Iq~;)TqUgl_XTtrgPE7-(zwFveQ za``Asf}eBWdj&b&0!w+PNb$!G2J@R;N_qc6dN>726K+3NX+e^z_{8d;_B%;D5#kpe za2TaVXNsmp$q;@gaygp-y|%FpU#_JNYZ!kRFu~nh3pNCh0o03bQuJDMH}*;TJZ_X2 zlE`xth4+#Ayw{Xn=kmIGi>GBMMkRkjk+M6ORD$BjAc#sR#Tzw#&#*mN3q$5%uEHFr zL)4_(V^N}Z`taqHi!h<0Df8Pyoh>gyV!zYWU!UEP;~jGU@YFX%QC}gXNdn!m;RJ93 zaf&3C?CWW>1t_s)3>qLjy1Uf3>Xx{7F}niw_G&(7WifRTae4I~xaTj{E?T@6Rr{|$ z1tM?EdUM}+hTI(LZAhuVoTBBPRv?a8h7ca6N201Dr$QI+TB|7_h*!|hV|2C1@M1;b z!gCCzG**N-jo>NeJIJEILji%Jh9V5>Dm1@zcbtv3^p;VqS#ADh{$4ibcZYmbsgQ8s zcQPx8sJ(h_k|!6JlfnR;ebQgyXa~B)Eom6a|0IE?gD1Uaa|hQx|C7A~tJ~++hMS%2_@MhL2m2zo{?Ji>@jeV)O@! zzKTX;WXIe>D6e6CCf_J1|N1(tz>5|~II38BuC|PB1zlBVIjSTSguDHw1_IxU5HP;@|ux8v0 zj+pj8@h1KLaudhOoVXiu#Th@9$4$_+OH`Knex2_eVz={J1L3qor7Js?;j0KHiiK6> zbn7bW&?+k1U`V36V;;fZ#@oUieWLpT=VOxpB>s}iJ(|BC=O54YJ$SWM@I%=5YRB+T zn*B>U0x$@ZghB*{vS@?=8I_`QZD?h~-|+FWd+V)pv-4`xb?W+L4qRxz7xpdO+XP~)IZ|>KwhE&@W4*_hT<3uk~>h|!{5KG8Tkgxmu zJmI!TRd)*`ciYH0S5Ytm?bIr~rK5%Ji-o!Gaw$L~r^OzDD(D@k2@DyqP4_R*&7KQh=ZOg#WDMJek-Q8US z3=IP)CEXzc(mixH(m6va-6Aa=f`F8Sl$6o}67T#szK!SiZLu{Q90#mf_r0#`JXx0B z!xZIP%sLPTpYnue_zGMxrpuP?7{hRp*-lw(YBrZqvDv{f)QM1bT2n_lA+MU9G$wjt zz?mB?0sm)rzMN5rP&-+RdxC5x$-~p%$36V#JWNwYk8UI{Cb^(QFsIiY%mc>t47%W= zH|kr1feyQc@UERmY&K}+MF5}d-qorF0<%bn?A(q)Yqc1$#_^(JTWk9+Bgj~D78WQe z$*Txp1joK0i0$0 z8DaAc#B+$ZEpF^gp2;~$5#(#Mf`M^j|>zhFpyFDcX34E6*RoHng&Y#IisRD!Eo*rlBzOe z&ySihR;s~;=DwYsF3iAqlWrFydEe-Z9p*u)O4na3uYS6C@Y*SNHP7eUOZcPJbmZpK zH(9Zq_HYAyh_bG$Op+J%2Z>?hPz^|eGRwRV+!vx+f z*&J^a&?|06*L+jUIx_N3>$EY+(D%}&POBv*Obmh-{*{iDhZ9d3bEq$oj1Pim9y%2Z zfQh8R@uH@9t0@WVp7b8^mQ@@#4+LML-59&P%ymci{jpuH%eFxP!%3^dO%L~=t_hQM zPHd8B&cbw#tA>G92-RAfeSb)DfSh-0JqO??7!_*YmF4W!`jtv(l}-S_^?LO$&q&@v zsy|}Cu#gFzl~&6sKE~41MQlnRBHZxh{EV%O;y=<-Muz2+`dNVW3_NYY_qo9AZO>-zR?3v>T3@WV^0 z^Jvi$=IRAmc7_h7Vvwfg$dAXDb@0C!`BJ1jmw4y+p{cOjj>+sM^o{rTdde!=rN8@@ z3LbQ;*Rl`8I~Di?4x5m$MQ(CBQrC51T5^w)zU$xh-*ZQB?UdH3X44r>!GIieV)02f zsp9GoSZgDX(;!&iC-k-zgj1XYLqn3p;~HqFMlv`Rwbl%I)9gkq=_f%yDt@EbhtnTdgcPnWP<%IM6 zGYI`#$Wcj6I8TWq=5?jcWDS-BKa=3|(uDDP32GdCJ;msshy}|%LS9w|Kt36x-{Lo_ z;))G-D?&h=1rz%wWlA(zKUHnCYj|T-=0X*!X6+jqB}plf{Y`9=vA6Ztkv?ZHT*29= z3OF`4BVBuh)jPUf3hK%1RBjk-&M)@p5cu11DFmy{H6+ynFj_0Uz~3n*e)=iG5skAk ziFh+u_5|u2$?@|~FdDW@&6Yo#!&()7aI@~EHCHd2sv!MSGHT9xjyna3)Y@0dIcfIz z(r23RrOEejTr=ZeY!FIZN)TjH6&um8)PbEZ--6miN&V?2{g}m>vVk8IiU(sB_k12fwn-=qA2RXVT+I_urzPE;iPd?EVq*9QdUF;B?8q_tH&0&ajg4+s;Sl-NYIbw)!A#!l|PCbbhF)UnmbN z$PU=MYGyhxw$vh#&8pG*0 z6U7F*HdyJ@lmR-U>>JcbhxL+OlHEj+3PT}X!2|w`oN~4;Q)9DOb-c7&SJs=@U_aO* z1QoDG;cMy#8`=Yfp8mH>b?&teEM&!|B#jg$i=VdfAT|gG7k+};LFCWten+-Sef6GJ zuv3uFw`AH@wj^x@tp0eZYurA#(uVH6AN(hlJc}g@$G7{H%u_cNEJd$`+5)GWH4ow3 zKk+jRRn3i+cz^&~xF{yY<^WmP7(l0z)0D7hV0N9{7Zb#CN-Z;yR6@3_kFEwrlIyV< z!+F#s#8O%eu=e~C;qma%wkfX6JNY7EMEE!40Wo5xzv`te;qXDbccn&KWfn zMM5lq>K?FK@*hTTN;;H2id}aKzQUptNfaL^fm@N1NG(T&6rU%jf>sco0p+~Ki|VeX*_`8QKf?0>C7fM=mN7%d@GklvDsac7Xpoc3|eWd z;WJ*591sdRl?vI}z)5edAQI1q-2*)ixpWWeZM^IOm(>{h<+YrypxnD2dB}vV6z_E~ zO>A}qM_JUo6|7~{ouyFn`1v?HaL15$Xc3sE_7OIcGhW@OSN`mQij?R22lD`-Cr%a_ z^9yG%h>P7UPnefj&+k}%jo&9Vl)bD(aA||p^*)B-cf+Yb!o!~KH1h2SEXH`7`8#gI zIXhafRQ9NC;rGIXFD?+tO8yI;O|x_BWFRtAb=keBhhen`pH5X-w!AXj?G4!)VFha= zCRiv&XWO^&vuXirdTicI)&e~KZq<`PpAp}amjR!XRGQWeRuyY{!WkpM^Gn}hPfj5o zkBUEr0##TOB6rOCDI4C>36mu!s~RXTTGHVB6JPk9$DM_5e0Ze#bx zqj^e|J{aV-)-gNJ!=5Ij)-ptS$Vz)f+fakJ zR4N)Wiak}Vta#ig0&`w=%RNmR`Yn=+vU>$ECE+OXqviw|6WTa3%$S{zk{rZDAH;%6 zoT1_3t~`lT?0Gt*kb}hQp{*Cghn*GNXnY1diNGr*5dNYsdi5HVWH^Y~j#`}vyIuy1 zCK1Dlgri^a$#d7cNi&{2$O7znj=+!rd)i&p+05=0yPP!k?!C@#LrSjzRW9r``6u_) z4F?o!n)r~oCiU+;0$r$?=4u7VH6T5oi>N79Ms9+6ZT@{`^hjRd0yYhkq!av%$$vpP zl6Ah9E9>McY&cH7M}sl>moZ+{RnC=JH)Gf-dpN~o;JriHyolRjWg65c=*y!_cMs=4 z+_i%PQDQV9oo)4j?rzKLUOO-M?PR7|y;gKPJW5zG-~YvaCil4mQ5#iZm>HshIB4^y zU&rrc?ys-YQ+Y6Ar1)GZ=AsnBqr(&})4)3{jfj|5=z5TY@27VyGUu(#0B{{{5A0e{*lZ=X`d zYnp#RzTOzSX646zwQ4dJUWtpTRLZh_W7!CNxjIbLZpr_UB$gb$gdtKn&Sn3*_{lat zK_G)|7&1hnIoK;$7WIq=V`xCkyd5mHPK=?i<{jN+{d7k^xe7PXk*qec%&wbb%Y5PY z{w_fnU4ZMWmK>VtkHg;8pBLR)r1 zWLOyE1DOp8*0)~4F9XY60s7qXu9~p%8Z`~_WlB^M`d_WlE!tMZmRR&eOM<*0o4s*X5=TcZ7yZwU)bcU->t^T#-k7jR0fFFIq$ z;T0SP$RsA~WNgRH#GT&j1q1`W4|{ZB^S~mFoki2AP33N&PQRVGuV{@*qZxp-kO`7u zpQqrmC=a+}bv^m=#os64vI?zrmab4!aa4=yj53g1&t0>5s8>$BCz^2LmHLN9b<+09 zd7&5RlYhO~kZmhnuQ7lBF^%Vi%5-b0B!J5JZay^Vs`zU3Jp&8`CA545Mu@zu$^z3s zaiXac=k51tsI%5Tf@?tjkQM1YDQ2jPeg=s=UKVERHrX2G*PD5?VV^$AJ>t%wY4?&H z>(#>~ha7C~mFRPZa6L!2+mU~ZY}j-cs?T%hf_8GI(HXT`X@?BO-J`tB%n}cO79g{> z(WG6arXwy*Q@C4VI!W6El39e+i<{}W#1XYxE}PLAbRp5I!#tD#OYLq9@DU~` zQ&u=DR&FWD>npcA`n_E-Zuv$YSDsiNzIkxIPvKxfT}^}AQa*g6*~1{G5$p-0dx3Fo z?CI51o+juKGil8A+^z*g?e8DKbDBK+@SV}P=jCAosXECbXJ~S}c@fR$GEbbJDPKbt zaBtjK^&RMVFi^ImB2#gJ8g8EC>9Kmd!t0Nyb2L;k8B7V51p9!wz#)(HvE)dy@`n^J zD_M*( z(=vak9PPDFA8vxWQNQte-(tJrUKc0&M$Q>R?VQ(G9;4LtcLtRP;D7@KYWOm-e@ypHs z)gPcjnqvP8X~tPGV)uXX&~3&jzkAz16@^yZT*e!Y5pK(1M@g~xAv0~QND7QTJMd_> zLpo_oe;sAJWnCDx`9us9sHUK&93^jynw0SUU63X$bln_z$u3hezUL#VhC&OMh#f{k(}}k)=ZLt?JJ_clSddV zs^LL0*hMzfA$kYH{D-Viof?!h0~V?PA|7S^LAis5z-K+CY_%{1FV8^_f@n{1PO0cH z{=hm&`s|`I!yR+^tSt{d{V}qnEDsD4CUm^XLN)Cmm);+OUO{yh z8ATLYHCt0A9N_N^2Qx#`GaV5h8E=S;-|Ys~`sbFkbg;PM6vRaRq3Y&=N<=Mx0v?h< zR8h8@=YcKrrW;VCu(mA`C&4^5yM}IgNlwY!roCClFPPO`;Qk7k z_Q~ORH|RblPIS#dBrH#$o+@VjIS(~Ik~876kM5sVRfcruZ1}!1l*|xfiz@S+2=?_G8Ef0zwwRqQO2tk{p+_%* zUw`=nqFu>JrA9GXX5OR#-P#)c1;md(#5v>Yb)7ROr@NIIh*~P- za<5@{K`)wndSoB+lSAx0OKa!j0D!Chfn4OdITdLk#qG23CS#|bkn~bXg?DW%h7=t2 zr;fSYaGRJmhSA5uIfX`^aokJ}zXZ!AZAq?qzWTZXJB}XD=wBUrdDVx5m55QtrWSn) zgf?2e9t=tBIPyIrc!^wxWRxhD6q;Qx`hpm3o{{O&CENU%tlAwM9KE~G?zR-?*bWjW z#!JOv_qxG{1MEiR9yItPqfaI32oPZ>LeE3a@d#DE6Xa5F@_qv3(Gnm%&+RSh|(2hWC7zk|j^ zH%0i11g?lPxkD>eHW(VUxlX7%N@NG*AIq={t^Ln**Ih}pzqQgbze=8a7cR!Jcmi9`yLQ;ZbgA$g0lGe3b(ESdq0zP36?8y|wvwbsD zV7W^j>r)rwNWuW0V`Vkr+`%z$;uvPFF`K3p+c`tejfO?PdrL4y!5vH;3=sj2sjjlU zr%)wTMij~hj>Yz;EIzylK0l!=tV2Q)b*@q*U3-fELwrir``_r?H$@c6Yk-5@oGyyr z6cP0oXL|d4pM&&1V=|s=EHO;=EgKU>?iOQlFkp5d3a6gI5Jh>>UMl*|z+wb8IVTIE zE|n)fB)<;Nbkd8bMrG3rbXCg^9$P4|`KM8tO+dp)1IGvosE~p-x=yZO3)R>UBOM9j z*%Aox4wX82zaBl)LK_L#m3}^3=0#VXQh@~Vu;7#-4@Mgu+mVE(?g~b~3sjR7R@h6sw2N;cpQ<4R3O`R#jQGveodQz-WTI4%oj5nC!R zf+6CRfjznA%lM-IUkVLe(nA1|2zXd|M$f&Zz=-`R*Wu~Hdex|aN?5*(1CH<4r-Y&ZV*p!vW*!o04t5Lt0-l%tdj}$=~%%-&(0phXD zESULXhg9JgaKO&V5Uo=fz8qT3b?18e7utMoLZhQxnfxNIE%v6rFQmsn&d{02=}(`N z#Y(wyeD;>QYI{Idng$|5swdF*a{K8~j49i?4Co0?^~L>OQA7&e@i*5i?Y0>nd)hPB zZ$dc%tDSG}yY>WZ>>&wzf)@^lb4Yt#)Yc7xaq#e~{Na(m<|Z>py-R;OWQODe0w}3w~F6bj6T^0jlnVe%|Rr{hM zq<ALbTf4f=3shBWGlxl`7yh~10g*Hr`?7C56*Vv?cf6c@N?@( ztLPcC0G!0+vDq%L;F8fB69^vWFLwF$%oMy`ZPEiFO_c*FL>>57=;MHhkS?2@(b;Hs zm2piR!D0)pyS}25JhA$8Fxxuf=KZsf(JvNX2M zp~@{9`FrKI(@(VU*ptf*kOF02gak)_-;19$ubGEtyM$&P)YDXIgrV5_8NZ|m=Fpxy zRnN7SKDV08ad7ogM2wE@U0fH;+wD^04coWOv$7|h22u$!PmDgFYjGo9j_S9Lts)kY2~fXkBXHVjapy;` zej;_X=-vr(60a%Fz_Vld7EMq%%ym_vr!>f#X1L&52_~T9S0I3dt$%Pq|?TTQFE`ofX?@9Tu-FMMdU9XUk%T|w(V9roGzi{6vY;! zQ}b|N_+gdoX$K|J^2!P5rL*utbncnR5HSiaY7Too6?$R_EO?)*iXLtR&CXXh_Ez10 zHQm{7%19_Q+|MnwfA|77{HsMc7%;k~)+@Gqi&hGlRotYfCzwl2yT(O-8)PBMWk}*o zB(={b@T~HbS{mRJFf+a3OytJ=Ol!Q^j?o0H??`me{)=L_T~xT@_cXnuIw0Oo3V+u@ zjYE&qo8Xb}`@1^1w%7It>PnH+m{Ej?Z#$Xp1ZW;0wyJk+6T(GprH2hZ$Q8^+K($>k zC^r&@^x2;gm}L}t$v__0Y#G={WDSnWFTMTNO@=7Ki&`MNz;H)if`_ zCQu2?Qm>}2m2uRS0OLcFqaMS}ygYcpbeS$T~-M*T2NY*VoJPu8U$x~VjLbt z{F+I%z&GEHb=4^SW+(lX%H1zd*A|ex&%9)6pZ#}T$)>gPY767YgKWUb)XH-u=?sRe zh%b$*8ildh1W(sRw+|(CbKEE5?ayPtHfRT8R@rlnQ~B=9m_uMrQUkdcfvK|W>21N6 zan*H{QU=NqUKfc${685*_-R!*7kux=+9VkH7-%Fg8Q$uW6dIU(5;|w;LT)E|bR%dK z>f(aJclcqcI3)u11xa5)(KnU+tljo;DN%tH!&@DS9Ljk;l1UQVvqoO*Cp-}wB*8Xj zO^|uAWEuyjJ+K3m)JGAL;(t89xZSE2k;XCVA@Rj9($YgEA7^_L=xXOSQg3h_k) zEn6Gc+CCOV`BO7QYg8<}^^Yk3NpfzLt`7fh@o4n%Dhx#XzKjdECL_;z>YS6wL4~6v zbmt})^*yWk<9Si4sy?C*+^4fD8t-^(j)uB-&2We;<1w0?5~C%EdoR`f3Y8uz-+dJR zWzFmKjA|x^$92#bXwJIlRiWrk!$C|M>p`1IMB+Qs?Nt$XpT~q8_fkJn$rFy{&IA*a zn<80PxE62@K0_Dmou8)(u$yEjg!LWMdE*#akk-A*d_1WcMVM_BGu4xhl0iY_vBd*h z5*2WTt&4{}rP;G+FTI3J7UyK-aZM^N7R#9R6nfAb+^m)3)#t96=ez)wesh819obr!=AFisbf7&)VWbYD2AbfZD|~Pls>X-FBC> zT%lC4BI^zzr9KJ-5Giwz@o-RNiS_iuzsl9Jr<xQxaf|;NSbC}d7m*B0rQaVDaUD=5>;EYY1Q_WRp~loUVxCkszxIan!D8D&`IpUS&o|)E7SeqiSU-m1DjhW7JtrA^lnNT}_4j<@`C<|=Ph!t9t*gTdYlPcAMT{Rwsmi{3Y z>GkRNIHnx@!r9fV?%^a>1K-zGFm(G+(iW7wiFo`vuo&KV=(C} z%6y9uJ4$z){gryxRdN7U6B~jMoh!EJvaF;cDL0vK59rj@vy@tcLbG|rlnw4bDFj#3`A!ht+r^|ctN(Go& zjXeB2rF|DQ`dbumi)Hk@KEhzt^#wFb?>(`OC>>lKdN0$jXhk$X$4bIt(%J~p7Z0^t z8DwToez0Az-J5R<+>_&_e%g+pUvWYzZw)4x(0B2iDtY6j84AfbGks`h5YK6Z zn)(a~QSG;u7Mt>f-QKdzz);|ZTZoJ7hOXs(4d5`9g()|M@c&~05KC_(6T!;MELpy7 zS55x)K^#aR0$@(>vf}}ez`*Hm!#JsaFKkFzp;WWEe}2~W`cUKJivP;J-b?O6YiIJi zNrks)sxkE(Ht(mMW;^QjY+@G#dvOPQmF$V59-8^4y=I`G+Gcyf6LbAjVdiBD0 zD#37hA=gi|GZuu7;ws-Fc6I*Sg~9)y?FT*V_hF1A1iOY5N-@6UjYDxWnqZ%>Re|q0n*cxIw@I z^@Dpa<)dgBouqw0jUxl`^p6{iR@CJ~33~R<6z$uAx`T264}644gKhNh&)90TQ`PF_ zhFO?HlEEFK^PD1_c)jT-lPECry1WlG9~u(y#KXiM%m6 zP)n8hAJQfWPl0+;65@>aoK+t`{pLIR{ZKKeIu%5dLOwmxB2|zfwLNM^SbT=f>!h3j zf0$q}nU^iI{nr!YxNOOhitV`VHEXc}{QHiOEml;1HSNx+EjGKU)2n<-oizG zI_SMQk_2msbxe~$At=m9tsupytVkAexh&YJ3@AM>3%Ed(Fus*j;21l7{1uP-jg!Uj zP>)u+*fkcZ7Fr4S^3(Giqde9VqvW*^ zTt1Lio=SBE9XBi;V} z&_OOuW|um27*`PICr~IrI?CShI3|VGCZ2u83p&JLBj|Go(hr*rUD3J~a+uqW-5?yA&b4XHhT(gTkCGo+B|G7GZJ0Bg z&tynC^}Dzv&0EpEJff^h6T~aSj-l z_1paiXtqr92}sUQ9WrpDsVfCNLL!}&u`&@}&FQhzm5Y@-x1e0D9>BkemC8y8jB`55&wEV+^mM1j7x@QnDO~ z{(%e(Wjv10o`x!+zz3XYuY&e!nfC^C{qLZEmSK{qd$r03RZyL)C|IsSONn22L7+D4 z?<&z#Ky_w2UI;THk+CF&0=j*WK#mx3$KWGfi5Wom+yhn{gH3dkgeNw(%jHHsbnA@e zghsc~v~wG-sF09sjQ^tB$YxlEL2y-9u$@rP|1{M5iFIs4W6s(iqjz=*o!Zt&>l-sfp; z$l*+Uc!Xsp`2RAq-eL&;Px|#5{NqQ;`6+s_DvArhS)~Kk)2%EK4R^KLRYbj~baX7e zY}@kIZDuCpaitlrkVr-0QmZ1)^%^kBfD4JJkK>#0tpi-xOajoBY?d-@#!(8=XFD0} zaBx&|t}3HR3*SgSPzKZy&JQNW%Z2~N=6HXI3|p$aG-ua=stR}8Ci?qm>dDivVEQhU z6Ur0JmTU^i5*d@n9@dMJ^r=JIu|-y&9NrZr{^?)?zAWhaFf|&c&cV%@B{@FlY^bGA zU2$Q*P6vxwzAct*j#du-?#Mu}#E2vuEC;q?hjW5>uplkMWBcV-wh_QC2Cy!A zgM2qA3ki#vDxxL(_3vj5d=2rPv2dZZxXnniv4$d3Qj_yVEZ)VDE_L+}#>%!(Zbiji zqdV4>(Hi>2BY~IQlum2*ARNcl}{*fEwwjT$+JM{Z=Hs-P9 zK`n?FD}mUtJ-YH~Xk3dQd_yE)>9h`U)N{D)?o?^oj8h_Bj3tM~pf-mH!hC92%t0Xk z%a$M(ox+rlm<`t2JIR!F?>t@H9>XM@K3+p-l~#BS=>4OZ%8{Y|-!qw8#MD~&d7VfD zfPZfuBPS_e8&gf0H47?g=~q7Hq3nl5iOVB~hB*0WvuA~$B3Gw>mBa$NHFQmYWl;gw z8ix9{22NTd$zu+-eg;U1gna^zNo~X^09$cX2oo1S_oXDGrX9e3(w*EuB3n|UqY>y8 z4Kkf%i^U%AXTy9BaAg=hS0Nv+NpCr1rz7G@J($>OU?EggG~03#f(&1r^=j?V zZV&H^)we6}LpVi_7M6&VIi^Z{Ldbc|1j%vk&^A_`C_aRW; zplJeuNuELWTPu1Q2fDLW65t{ka2aRwX=3GzdZh>lBhS1ga8hkyfp=g4rjFbL{A(nB z%XUi71-vunWSf(i4H+enmbq5@0q@0%{)>J$BGty#H{Ho&QG)@Z%IPP{!i9r7NO)?V zMNP4ASgR1uB&m;3sLaIM`dgiW^gCY2c!;u^EWXOOz*h)OT%>S*Em_Z4lU^E_$RNfg zV}o{Ioau@Clyh>P7GC87v@i#oE!qKKAnp`- z03<-nG2q;ONSE;F7|3xfNm54VdkI^i7}n9v#SI@a?eXX}3CYd)W;h(?=-~ai+PEj% zH;IHNVtYOVu^+$sR>w&Fmc=^Sz|C)va(&=;-Y}&O>0QKkT-Luc)A{TGt({y0<);Z$ zR%q6VLmP@C3WZRO@H`+4Ep=1nhDHlYYbY3#E?Abf{(8O^<0=^qzX^B+SwE_EaH(^g zu3isy##alBc9`_9&Y=90t3Ryrw11Q!QUrt66s%JI690cn1S;6Be}=5E+^BhX~N|*3SaB0iRT%Zi-m) zailitWD{gJfzl2GgHQ{XW$l(o68{#}w?1XNvkigspL4#W5Hy?tI(~9x%BY(bq-C@` zqD9W_eI0u&Cu3yawcBpMdqpB^RA-nRK@=g+@HHV6y<$tX%~k)ccM8x2hmN%B%X?-q zGN?*YxZIc7lhw?_qYtr>47goGPBcz5v*yW1pK!&MdU667fcb?Z3sL#!u?!uioQmk% ze>v1YDhDvZ%u9)v&--I$Nvfe>nqOQy%7Pu5w=YGVdO>B3OQY8Nr11JK{o)qKna8Vo zd(N4o;EkWNO|s6?MmFJ|uu_s6RkuL^A;yx!7$o@q@?dWyoD+kH6!mOP27HrBQZZ!Q z^7SrU?RI)bHsk#mmI|D>pp8XDJHj(s#Va%3DQ`6UXS^p<2p9vp8bw49{_ZXYjwymy zlc5Il`$_@0Q*E6JLGnaghFATRiZAiQf480pZ)dVS-ekrc&-%Zx?G zr{wLW44yir9H%$@VDL_0RuNY)$O&w6>b7;m{&oK{?hD9#9&qcZuOE6Gs_6rhW4)Zr za;23+Z^o)T{rdXo?g6j`ABYD9S1z>OeO{=5bRv4T$`wu6xlF}Pf~hHzKc*_;qc9Fg zB%|g__Jtp~`th$*IqlzL3+!FI-b$9_xnB>jf*15VM89)-fzC9d!w+D^Sk&HG(p!q{^f+r3$EwI@abh!g1pN7}iwgI64V? z!U0RsRPtZ*AF%97Q=wVqCgOaSdC4Yd^zTJ*@3#*WpY^moy0%y#2-JZ=CQB*8#`>9_ z&e3zP6rPtjmyYFp)K@745m>lOTgrk7Dt8V2ltmk$bD#K?q$|?Kq=>c*v;Xmt^P&nt_drNVFT)BK7L&LwpX8uvugmN5X?k0%g$S`|dpnW3j%R%YpI6s{5e47eOq!d7?f2UgazJqR1F5`%)E?1m&=CEM>>sQNzf?T2hpO`XN9tEQ@ zFPS8|dW!~_MV)Sxi%NOSY9QXQ&cS%RXLDf+MJ3MR{5azjb4iV~cF_sXu%C_F(o36w z9aqRQZHgNtXXg?Wj;!??P(nC7%ezya{|VuOvEN#de{5ue{tzm&t0duJXSn0AB-jd< z!{`r*jl}i$E*TGtEY&o?h(WG7!uswm&m?QnBi!yq?b$7*%&L*tDf1^QHY&gUWK!XH zU!d4auwG0R12~84S+`2&r#}XI6#5tods)tTF z7{Gn6~{5$7Z2 zdHU@#=`(f4xmMtFG4ro6{-|YO#>pRYpq}EGSUG`|)q|`&=B_87SqBQ=xtPk`>!O{T zk{jTOhv6OM?A0ryox@}EX2}t$1vTZ7_CLkf2#6{Fv^uqm5gjYULR zvOTOk%<$RWJT|scb%URYJLqIr5?;gvpGJi7*VM1`3I2k|b`Qj`o1ltKDHJtXuED+9 z-(KW=wof84PqoCj=m-8pPsd4Adx(yY_sHiEd% zfft$ulHWTwO`k?;M!%L-c z^SRTQ*_&{8+5rOS+7Vu4w)2gPys#1Jw98NZdvXUz7zQ&0cSO5xm==bm80L1sbx1l) zHHSoQA=03&uuC&1r$z&NfdfjcTEn+5I=+|R_Ex=MI!1uFip5kTzq>M3I{xW#M=545 zo#*kQC-tZCCQa`q?}t#wblyUReL3x9aD%&>V( zxRK;;Zr|_ae<*ZS&-89^mEWki(7VwyZjj%_0jwA28&z^Pj}*aAQd$49GKs4^b?YL1 zfZo_O#2~r)4>>GMH=g8Opcu z?F6>c&csjvDu|>Wn*E-Tlg@zcqhQrYv0d}aIhWS>Axr5cbKMp7gH6ri4?SN&hC@^M z4_D$k7z=F`R`?a!9YMqC61G zI3*3#bIw?psHCxq{;vIblyp2A`_|(hiGDXF6*C+Iw$~f$$UtURw@2wTvKYr+p7QCs zxTWPSYS6pLizVMX%#5F%RZkP=OK?=u8$2(5S|bTQ-JQuE{|#Ruq*E)4vPn7OgwDj{ za+D$zcLEY@t|a%}3ssY$WF1=jbTHm#8I2sG>w%~L%9Mk{`sni<Lf`N(+VZyBbQ z87UT}qsi6=Q@lo>1BT7sd9k5jXaCO2>B@kD%ziDWva5|{+FNh9c%OmoZnkE8%s`QM z*pFT#%orMA2e6CoM1#!1U=^F$+TZ@58zsm(hf`LZ0aKKOIG+r{jPjjScD%izo!5cv~{Xy`B z0z9=q5v<+5^Ice=0sCfw%_AWqVnCzZI1<=zso{k5i5k@hOJN?qkc4t6INZ}7YX@Xa zT&FU!Q=L}IPx0&R08a)dfOZ7FbIoOM`&GXHbZR1QHNpzQe0<&%nR7kkLJ}%m`V@YX z(D}BJTl>=3xdo8M#O|f79xPAIJW9|lJLjgiihW61pmK#44ikSoZ<84>G%H6>2;X-dh#FE@TLu>H;(N$&GBXU#I@KxJDaBw6uN8NVC3t>U?=7H5u zE0ykw_z*iR0-qxjU^7>yte$X{&XEG9f(k{q4LtYTzz@(Iri)d~afqbFPf1J7 z*BvZC3YJ>Sh*?3ygwad}To~!sUTzp3AjVS4h9J!VVY!^-az~hDGS=-N`#1zR%ACs4 zyMBwxNaWLi!Ps4>2VBT|NIVHM@Y5IfUgV{N(8rIxP}+(Up2#XsY^I7w7xR}?{z@yt zvGP-%`TME#A1m`F(f2p+2o~uxg}-GD5ZKL9rg?WpF=HuAn&oJNW>d|rNzgp|wP$AOQpKm1$7=9q^P+bLa1 zVazbi8~ZU8*4Dln|Cmu>0&{SC17od4T9Hyk`Vu{S<`u+FPwPY@8SN--v({G1R%D)) z(_iL4SO{6ce7q+t)w~;DD{gbm?RI;)VkBSmBK!HThS-S!Rf$=m#QlriB)liaUQe34<#e6tckMJ;2oI%LU(&L6)7>o6=4L;OD8XF#Oft=8gkZ)#hB zO&mvN7TC(%cU~)UqH^^ToRtQ+*ej3e@TV+N6_r7|pKq8GKnzl?G7ubfx)p1zPmk%p z_FbB=HfKyS0LNA5twzfaD32i4bZ>sC^rp=l*WyNE@OoIa=(itF_ zM0?QyX$wYumy#t*^EFs7G5VQw{D&Vm*^{r-KO()j5>rczGKG$IWSZS(JT3+mlq1R_ zI+Z>*IzxB5q2vvFBmjaBx9ejR#t{zF7wS~)nmv5_$|fb5%b^fw!8GNg$RDyKy6m^Q zqYd(r#+C1v#*9w zk!IUcm2!vvSJFYp1_*U5i^8MgHNbwRRARI#|4L&EAV~~Axdq zRw6NbK6)xMm;6EwH5@;GrxU^kr;)VpVm1LmXZUypQ?TY>3QYxdGWPw(a|8tB914A( zZO@r6o-y!i_7y(kL<5Z7}4xFjFk10wN4+_9$-jR5rnHeY)leq~@H z@3<&@*2d$^s1VxC`&CnL7eKEAp6COm8HYgijJiZhI+2K(!JmGR@Q_!Ks)XzIkpjKP=XYmR3}Nk6mPdH%nO=-s;BKfFW?DKA40{e6R?*S4 zW+2>>lpHDb{h%GUp=VYl|NAqQF9F3wj&in>AxIXfhDTKDK7a*6(Hp8^g`35XY! z@j_uJu-Kbuo7+OnO0{56Z4;}T<2E{FnUZsfy z-mLF+rM~fXS~@@0WRe-e*fr~3ke#ggGOqMgPQw)3^iJ0RXL&JJ43sH{Hkt67LC{i+ zE%;Z8CxAt>VBI-vPLTk{S(Uce3O|g@65X0YwMipp<+4PN5pxQK&*Q?eG)^`8H@07= z(br~ck?N(oj2I-1k^$~$GGF4=C%Z6GphkpqDLq0VLSC`4`6vt| ztezHSEd8C3m>aRqIUizUpp$I|a|L3Q#z5ZJ?)0bG_Hr59D+`uEqnT+_r{%0wtZBb_QteQ@Pdw=%RL-$`H{&Hg4TW zs+Nj60-}Npf?v1#zLM<++iw?TJCG~u_;! zI1?sN-2yx?%X?Txha6u<@Ry}>J4aHrMzlI_ah+ND8h=Fl#}n~CjuUJC`#5oei~ln} zHuFzoBlFo3tG__m>qe$SQx1{DZl^~>Hx}df?qOO?iaO?T#eE1Z9*@n=>8wh znV~Abp^CZYr()52XQoZ$TS9x{V61tl0JHC6G(MF!00bXp087Ich1TNbB@6<^_j4D^ zb|_i1jl;9S8_xLeI3-5n(;+A@#%Qu=IJogiQWXas&wak*hObPrjRT(4v!$b6#q;A6 za^drSw`F;=h#$sdq*W^7KueU{H&M~o zxD^}Ql#8xr92*n#k4YcWWAS1>_|P*juyStV)k=3dgC2uvH#Hk@DNIfqY(Qsm*wLRD z!q>Uf=Ky82LPRm7Ad(fxStw7o>qu=z~)>8M>T+AqL{k8c`!Po9vPtN_fpo$ym0AUotz5tTvUZ z8gNCH*C|pNUq8SgSFy$tn7qopAp}Pq!_}9YXe2)sebWhn3iSkL$?HqBSE0PumQ}yR z(saN7WVZV`v-0iMy1R4}s&934>Sdc378~j3;6<0i4t7)qX@TEDvNh(F48L$kqP57L z*fA-{mP3Or&iwF2H45){`;(nV*S@$_(O|9C>W;B|`NzY??P?zcHTXItxgozw zgBDRYN83+YZhysRZIUz4GTzbCiFXFFYFSa=z=^lnSIpIPv+t9B6aeVoXY(E^}lkB3BZO#eFZi>3ow_Z?D z5O;x~WWUd?*i}5a1df;*e!vD3sn>1~H5r8mX9>gRcNXqeTsPuJ;26GQN}zmnxtku2 zv&i%kL;5dd1(E zx5sezjSU!2=n!IGVaID_Gp71YXI-9Ao29P`WSN-}$Hsvej%^v(xGj2{Hzxsx7ho)M z`1TtcbBUrV3zk!nx#ZL5vG3U+jXTVY0nMwPA=cJmpXes;$|F%8-R zQqY1wxj6cuq8BQGSi`v5^CZ=D-yO~{%}EBcK7<2^E*iM`ugJ-7lYmDFY~~~&T%ZgC z6x?#(;75^-r{Iqtl*pyLF=k*XsHzMMe%3O>Az2_Db;7XsTb&Mjo+_+FZ=<-1PN}l) z{e-AfJg{FQO$yQTB`PlYEG7Qb3Rym1p8%jFX!Qa}BRbb}{z=DnQuCE>qU{5l)4&C( ziN!H7#_qpASzo;n}mfDyAM9@P+Sc!NoYiZ5FZt*8<|r$2iZEyCUop}K~iZSXlP zN}t3x)uv<~{u&9IUED;CDgB78YZ-E9LDd5{Kg$V5+%R(x82F6BiMz9@y z>TwwAnJ_sQjF|JY)G}O?++A4B2l^uUQl(^hv2yL(b+>)*Jj_qh4Rmz;Z6Gh7Ol7t{ zlc4iYrD$N@%GFMubbk9v_{C5TxN&x;Nav78)4`E^b38Mobordxp-O6|P*-=(pr{~- zI1>jKBDL553~^UUM&hm&t%DaBv;j<7jpL0tpB*w;R+=6IwtjJ_mhY)b7A^&kh{=f` z*Ph9f4*AJqK@9xgujB$?&4kyDAH{{415)QqDfDhTvo#u2Ba#VDbM$uHwE^bxPLy1j zaM+JCR&R?W3tm6x$m8k0fWPqR@OLT(mhxBU z;p4T5!M@T{Rtt1th8`GVF;kFx`n_hyBfpvDx=Mf0MrqTXPP3f+dZ1DrU?uv_o_N_W zwhq(kq3;&uAlKWpb=OD%(XNWNTaV^tSuf+R&-j^Dd&o6RdEa3$IrwSfg{kko`t0jV zpm6R>y&09;xZ6u)%*lCJ%2Q$5t}I`EcZ!3xr2=Kg=Lr8!GyJ(52y1{C@Hl-hRNj<| z!u+~W^sLBYDwraP>NVc|cE38k(r7o^js4`0IDP&8av^h4vbF66IudOt`%9-+TGs7EJk6ndL~ULa?rtdEAPZfuDUUY38YtJWH73Dd6ObW} zhVix8SZkCaS&mLZIojLW+7b7ctHfxW=jV>McQSnglL{O^ciB65v2ZMFCnfWS;x?8Q zaGGAc6-!o3_;&HzEjHI!47GYuv(Z<(0gd9Q0@TbsM0zuNtJ;nX$)*9SD2)VLkxqvU zZKWTVJ&1^0V?LOaWfnb|W+^+*>(gy}w) zW#FI86MDSSoBzz+Jyd{upOQrF6^+7>1q9SeI6F`Ki8x~!r(cH(8@NZxgy;dt%0RNv zL*ZO@OButpwHOSza|~d(on%z!jqlYKNz`z-Ep*=mWiyfz;`sfa?*1#Sf+40Tqq8p1|SKcw0dMiUBlx2Yz(t5nThmn=AGCU3}j=J z-kIO@r`_vepT}p^%b(^G0aH*Up|NwCE+{JSh1XMYCujq8_QVM=8x9neec*1%C3}KJ z1MVI~MshobkPw~~MyYV%b@kW-tw_1*`c_CUixr<0zIAuAnd`%BfkBktFyyBFcFQn9 z;N>FYx#3o_KgebX9wUcYg&I;yA^EK3JEWpVcRS*h8@EG%unX+k}2Q57BV+r zq` zs*Sn!Y-HJ<{s)`3na7$a_&6#ZP3)@2pI()+ZsH#L@tm#nhoCw7@Z!n*E#JN58o;2) zX$YvW!&L{kATO1cu3OR6T1ry14v`H@zjXcGK;nG0M*%-FkkC4zm*`f4D(YThGAVEF=z0a@Qdgv3Tf7UAx|XG1+1|_AF3sAgLy|t(ZDFop;V(P{mZx&9;Ya|QCC}bg zH{bC(>0kTM_)Nl!o;n{kV!LLa5Q@S#!?$^iHk%u)xZLwoD$|QjesH-J({rTO;FQ-X zFl6M(BSN&{!GF)AGDL(#guVn^%~@B^k%af(6xnI3Sq>k_`SByrZ;@$@B*U364-Zu3f~FVr3YdcSr_|%T3r2OgrKB z*6UgTEV(@)3)y>;I7H$PTNYxZ&nHm9#E2EFZXLk7sCGubKJS3}%g;yEzi9-Kjj*gW zKvd-*D zy~@5;RRF=3=_evuR76x2ZMyc@Q)Q{FDU%(T*VN!h-_NhISg$8SILVU<%$dW}sUgf% z%)Gwl;J?9=g8UbvHKG2nia!6P`cB?mUOtnonUuZ$!r;U**;@OIw`hD~Smed+tVv^& zG>J)*AKzt`WHX&4FsWd*$stx4QlKBErTK3^(~l$C)&BiY-QF`)un5&Omh-MSfd}#~ zVrh(+Lm$SKbMm;GR>1DCSbr7)vMtUCAPQhady#eQrv?B;B^SlDdH*IlE>tQ+AKj+N zuMbT`z3^b+yzM9RjfCQb+!YIq2b?XCNGL+Mtu_ylHbFV!#rk5JSqfBl~Ng26CIutDmp^dprDl06mjMyC&KX46K*jv`5u z%gPJo(E)%s|AgPi8U$JMGQyNR9oD%A8RTTmAcrHP5RQ;Pqp^oKE9RvUHZW_MM_KFE z6CnLK1BlJb=M5RiPco)w=H(Q-5ZZ2LKTkc68Wj5{IeVW^eFph$t6a|pzOEw)jj&Qy zg`rsx^8O)&FxIgW-;3htH^2m}PeUe4h5TMHiYO6k&;9*R#FBh3` zhGmYLeE4&5%=?Ih_iGR)RT1Y7T9AJDH3CGu>{4p}r%*C>rF1=h%`_LtEW6Eh*J{b_ z2+OsYFxnq^YluIY+vI~%HUD2z<8~lR23DiAh~fr*q}fo?^f=)fL`3 z-XHPbPuF_aP`i6mc<=ArjxDyE@hk8^-FOhimfb>Cw3ck$kkhA)10Is@*{=8J6YKWN zeLk;Q+Q8L24}D)f;b9#eJVyAXSdfCEs%=@Fyp@Me#1U5{5tC}SjPxC>yS-S1a~!+l zqgPUO%-epfMK+(mxCX_bL`D;N)MOI?>WqTL@#r-i1g^fyg~)`l(MNLg8Z_B4`!0@K zsJ9gPS>7>KX-O9Z)8m`Rb~#GyQu!(L56p1bz&U>mKIpoLfk$j0*-oEn=kbrqSQO;2 z5MSbJkPI82LVpRo^<04M4NM!JTM4IO?sm`*^yVk|E@#r4%Jc|M)&F{t*+)$jx)t#O(r>L%t(zP~(Cu(kc{3@MQ_Q+JTPw)9CgXb>tjJy5bs`3K9T48IccHJ>0?%(%D;T%0=ZJX=93C_0;XHZJOxn^O-=lCnQ9p<;z> z1?an~V>)^f5e?{xVGuDOvR-V3*uB`0FV<$CKBvuDS72y3J9UwhWqt$j0s5NhzrpJ4$nxS!X8225@)!|*(Iewan5wD5p_ zqJRM}9_g#W6;3T}pgV;3ycw)`e$bMjN|R_X9w|`Q#<;q9K?)O>j?LWJf4b=S`_-#x zZ?Whx?18qYmrq^nqvH7$qgbQp#nLgMNkof;pn;dUmkdik#Zn|(CVfa5gD^g5Bj6z+OfXat3b_UV4{VW??5Zt4 z?r#0$^e*%3pL5hAg5QWE_YPvVWN#`HtHNkYCHMdoRlVK%8{&oH&re-cQn#N}vKtZ> zIs&{D0R?d2pilfji!cEqqm4lQmwA#&{tkN^(_*Fm;O zQ3jO1KStcGiaZa@agkFO&^Lf|;!{;njD!IyNm4)BAP59Z6p~{Yry=*gxa`sdTL*i# zf~+@?RbGYu4~Fu|oOu8#reLNNnXVRkLAlCMZxx#E=pRyizH75iq$)e|iDqFkhJr4kL8; zMjU#7W@cqgln-aUsN-mU%=0v;Xr+02zRcZ>-d;l46Ld(VreISw#Tj9Ro0pB14R>SC z+;b_KQ@++9@A>i9b8W_R1PbQWyEci)wknO5pkOCEqz^6yngSjd#)tv3asTm23mij1 zjUVijvttO=OR}Td`S*~s-DPEP>En<{dZ4^xw^%!`!u0Bs zhy;febB>2ZfR{$mKey?cOnDp1C)fV?Lxs>Pyf=I9i_$U1%7V|f&=7v;%XA+siH_XE zV9YRb7>t$6iljUX=&NDtiX)tCuJB1crU0GEQi84ci4DU51YEWwje))WiGcz1)E@?W zG;^(i$WUN+?EtMWi|eO<>11Os>~xl>px7ulk2o6MA!;bGQaw%K1mQaV`R3YdY#9Yg z5nJ1;gt;1M5&KhH)WNsvI(S)McM3yn3uiae53f0)lgwlS;cVV?KC70czg@i*`qh#v z5u;GdaaJM7Ry_SetWMQwNg!ektX(t?T%U<_!6qK2?LJdEj6i=)@r<4lp(~+A$#l+3 zXBsOIfcZCI&Oi7cJqolcOGZnNd^N;n-Ly08-%^X7XrOo#@VKo)xNayL@gT9hjag4^ z0Hvx7bb~4E_dR`SFm4E)cCv(4q1EFM&>J^~{D}22o*&Y|I;F8zFv-99sD??Y!UO_~ z@DJUJ+<8x)-yS&n%rV1hAg4t(Xk0}YHZO!MR3zf)(_yN9Rb^%LTErfw6_ko8JV7ZH zZBoU6_T5lm{D2ftnLB}c03*;{fdSV<&!YgHdzwp*_xhWN#Ef@&=jlY+J+1W%1uVHO&?bY-3 zDsG+=l!WA#zXJ~=hr^1JlwCwQ#6P4j85-i|MPNLFV9KV-`==XbS)(k7ATLO=w-TY) zr|G8;m|beU=KxOV2-YH-`B>nrunGcKX8b+Jk>gfIoh7s7;4P~rMoiAHhhq_%wA6l1 zT#1+x9WE^KF+PREHU;OCdM2y7zx*`)3=%v%&FOwIC2K~Zyc%M3lq;+_yA=dj)P-r? z5Z+4c%;_5M0}f&kX67aXx%DyZCn^cOn7@cwu*qb@wo5Xd{x$XgB68oY(x4$3yXikX zQi1wis)5d9DIrmWtB6`54A@P&l;1V|uzd{(-U{j%Wo1aHOCf^tW1c4%&eNdrM%%`g zY->xMe~Ab3c^uZ3Xcd!j|D7{v>otX)hB4J@gdhN_G(_pt4za_qW3tpK$}WGB`;%-i z#);VMsB*I31484PlX+Kaj73V+W=JBKLK?z932ZJ|MBTN*JwL~ z*mZ8&gXe0%A$vkvc+bS3jun@RMbdJ?=r>{zp6I6}y%M+e=f>T47aq_AU z5~A=X4w62Ae~jX`hCG{P!QAqJ+@`c3E)_p97*T`W$0JbUm4-n(2E7^GM)CA$acm}=}xkV0FE2dJlkKs6|33XN`p?j!a>KB&5( z^H)QLZqs>&K*yGCKggL*b-Djj0$-Bk7mPSU2}*tI$`wUJ?|N14oqqXH*lM|4YmtJf zz!COQw8yDL%!*jgDWl`)UwEceGw>zoEyZAzgC&dQ;I)Ct`i>64F9&&Ean`}!+wCEC zv?J*?UC#{sz`py2w1%~Yp}JK~pI`NFt^YwGy5t2s_Cq{S!6S(t0ZEZu_U8CfZM<7o zE=*}EhE+@k-p-h&_|jh{4!Q&b;CdDz)~LjjPU{!Pa@i-E? z{I{B4@1At&lj=^YybdJ+jkxu(ch#UDIz5JGaIu$frVerrD#$Llg|!;l+o3oJh3jSR zOg&Ye64qhxOpcqNZ>lLDUXN3Ys z2c`z<7r|&2q~#+J=v=gHp~nI~4f@-b+cBx8E+Mne!&#O!yA)*hQxR6cW(H(%Ypq1S zeXwhce&}sdlterl0LG$7C9tsq%Lp+1q7r14)Z5*@-xV|%N(0Lsafv3kZ$J4A37YKJ z_|Izyds7>#`YOxig+1@?zQcYtP$|$s0;scY2$m&XzLNhS^N?5}U|ET1nmt#rn$(kT z{gNzZ)`epOzy*i=SsU^J#U_)j2;(dxx1ryKeUVu84$JzbMVdt99rgSz808fsDGXV*RY0gTozjFkBJv*qG(gDnBz zI0rk#qi*wyt=&p(np?!%Laq?6`*5H2t@!UB#~+4UDb#=9v0`8xPkhsx)FZdORRop- zZKGdEV?I09D?uWm1sE9_i@RUOSSmJo5oHTYw8`!3ePLWAOZgivee}iqJf#9% zq>R2s2IhA8B=R58k+8Er@c~afxyG|^$S;)}Q=jkuxWV)-d>-9*T%F2P9t3?pBMCyNhafjq$l?%6fykv$0`mYi z>er_g`nVDa)m5T|Jsr|lR3i2fe(e{NhdM-11e8`SjZ5~LfXg11!23ihWZT6A&W0WS z-mLCAKhpJqSy3(T|CnpabA`ldJp=&XYq#Q8)EydBH?cr(jt1TP#-^(WX* zhqZ3A&wQ#A<78_m)h)3Gw3DB6nU`wXQ+iV;&8p}_gGBDWjYlDeJ#xa_G8H-^5o92y zI}k5%m%D(;qHok*eb&7C#p8XJbY`S4zr|NA47uw|#M9;fIXjoX8iLsdOyU|1`v1{A z2j zJ)gr$!(rp1QS4{NA5`9de%BjIo+&2}=kPcsOdJ|uv3W1)(5#+Ml0yH>Jr`^o3Nib3 zBM5Ci@I*u58!0zfcpkQlc>JS`wgNY4h=Rx39;+&)hK-#vRY8$L5(-1N+76$!`zg{$SBReh8y6Kq7{07&y`F_@ z6O*Z{f@Rga?}Zz?UE4w*Q!i-mIXiB3Ph~g-E$dlIlA!^UNFk39lbwzp&&(iZk4g#; zjtU=MmZqp*(i{bVnvQLS#kk%<&6{ng%4Hf3`;MtW>O!9_FQ{!Ht6@eKj6?LCrF==- zIl#&3X#T8}Z_PRSnVOmO3=Q>S8eyg)84rCEw?omI)=2pGux+~SdHuGDgHprf=#(8y zO%grCLpwf(Xcl$r`rPwu8z3Ri9+|0Iz<0{C`bCse|aMe=hv4nAX(cS zRXoeEhyi~k+_Gjq<=NQ$2yLMuDoxKgF$2d)j&e#vu{OeJuQnedPx*VHPx21vWW&DS ziKdm*F#OtQtG5ki3o%wAi2AMa#!wIVqwr=TG&MD)+zuw}9lEtjg-8Fu*@?78cw{u z&oB%6UqkHd zQ=5g;gfkt|^z-J?LXoC=bsf;2*WyI2t_s=NgeN5#|2Z@{M9L{k%<tA!ge+E?~7eLehWxs-1ASwk-#P8Nq5I$~+u znW|%07eG5pW z>dISPH(V}MS$H&hMcy0qKVLeUIJ>A$P?ZMil`>G1Ta4xMxfP|PVBMIOS60SyJFZ!O zWbVy@FDj}0;ydlPIrlF$Ne|{BE@T0|MPb=|dZ_Axzqjc)mGklgi6 zbw2IG(W{Tj*kMP(KVvXhY%_GXnIojx#MS#)!g8Kjm6^smr^hI0eXTyi54CnUM(OXAgvN79JVCi zF7ppY{hr_vARX(P``vnl8}JE(aZ$wUKOVM^_>jgGWTEQJg~Tcu__ab>wSNyn<-59H zfg~+Kk|~1xv`u|bj~_~B3zHfQz&rw0&Z|^}n6jxG8f0t=NrX048b(eT=wb>hGyQo2 zr;9cP?)v9llwe)A!~Aj}NTu+vO<8K!STsQ3SFI{$xo5dPs+ znJ;q`JyO}&%-N(L|2fQPUff4|Wx+%Oi+!IEf+WBcm3&te`~J-F4DvtkK|Xq^)KK2qUr1vJe_S_@1!rj>3Uy*45argB3)i9jvJ2QM8jsEx|8wy_TH$|L${vL%02+*PSecvmzwREEoGw!T{VubK zknf=Km%s;=jPsvn{VeLMnv2aW1oeyO-`8l~K$zE3#S|vf9nprc#s973wD<>2#u4AL z_%HwgF;d@1Y)LBYvp*31$Flmrd@@Kb7PJhI^PfRSnA@509qFGp5Fvm(D<9ozrvKxC z`maCk`Jcb!_-A#_?p?O?iGg_Bsc%BJ|NnpY|8ZjmQg8|1jKA4lthfJtZN5CUarCWl z_~xtf_pzXEPtC<=?u}!Cy=LTp_Q`(|o`xeKeUylyzRyqCDg3gYqw&AWw||YZ01?O_yM7^5xRAs6hY}ho zDm3&PY3Z|-l?*H*dffS1tNF?A9G)|FxFj-jkM1_QeOVFsSlyrX^+jT#2vW^Xn+Yqc ztFvc;M5}L=0_X68B}k%rV4I{a8}kr?BTe|z11k9%;7$DhZLDuS{L>QM-RyTCh2S8g z{3DDincu0T0$I>XEJZ32SLVV>Pa6GxQZ=y<2@LKQraB9O_3o+&+*aX(J7t zuta`o_X_*5;`vZrJAd*Hiy#Rr`A$7{(xLBUtjdAU?>ynT6B8ngU>PEb$uxUr6?+0 zci{Z*Nbm`?Be4Ot-I(wc$c`tuP%gJicpd`MsE%8vlz_!iPXG=8SNiiI7uAqMd9U@s zjvieOyG5%P1{O7qQJI&?!nc-nBAMfGKlVoLW0C=3X&4(ayc|?|v+(;_<@>_*#5Mxt zvwS|pMn(-$NqTB*9LxDhYenVd-+jejIHss*)w*LDh{0cp>RiiM6zk&A~ zFWnW4WA}zMbd+#dvf_}q4+<-`ttfLoO^@*N)rap7&s~2{)+hZRJiXw(($5&Ic7oVx ztl&l^$#K-+kS~chTPLxG3D|m(?Dy-N@2B~EYuFZ6d}H}0vMArC8Ui*MZi$C+U%vbl zGT(2pzw_5KHKnVrM)BdaiXgvx`V)>?c(W_lc}^_C)rq>}wDCuZO5=3rLiH6%n_eBr z-UD+@H4z$&d4OPY`<3sT(^dhiBI>~eDjx(KKr7&$$B#QfDq<=Luz9R9&h`2pF5!^! z16Lr>)-z)pipHEGE9+-nvVy6nre#*UMKq>zaX)2YajWjl3W+OfY2@jvUp#c*Lp$4y z2tYbAQt4M(dJb{S{%_P-tKS^gwiYJ91|IH+;Pdm7-KoSo?E>QDG`;rQ*aVs^swJs4 zUPr$-q69tN@d^q)ez=5;tD&)LGL+HM(25;Y znu8?ivK)`{R0*vec|9ux-8p#Akx9Szl$YaQPEWJj9|g~Qql15CgXnKr1W`Qh^mg!M?9gsv}8ygavrTpc&U(YoIN2a zVcneBPI=AwH?$L_)Ana>P1|gGP=tl`+MJt@Gn?v#rYwxZtw=&|$m$oT>p6OU;Z?RD zdYvmsnd566Q}td>yw6Yg5ht3zRRA>HR>&|S@2L~WRK(rSeJgZy^vKj)^Is3@rd9)t zB-O1BFOE?fi5DU@Yu-i|`c$2adF$SYSLJQ(#9cJg?-BzF?5sZseV+`YudvfoGc7av zNt33bZKq@z(RD+*y`*g%j{4(A6Z+5rav42|#S-w*?$ z{<~wh{&mWB38C#oXRMC1^@g{&Wcf21yb!cR2j{P9xJ;4__DEL&n+O9R-a09vGL#$E z1OgshX-<}-GknF)&nKX*7I?&LLMF38_<+ZJzV8}hS0t78sxKT@bLRGe^FL&vxjbGB~^(VM; zk=i$XcygU=kwb-qq@w`0EXEyl8r6EZrLkGr!j*K$3l-g--RhgESd=ZVL%xq`RP-G{ zvS)sluzew42^j?ig2Q8OtSDi10-n`I4Wi0bsUWqiDVEW=<@B5Wh|TxZks=RiZpb4_ ziXL!nV=ds!cAG0;Ws%}DA?f#{G3Qm*!yjuGQKw``3wUiKZBxEImYd6`MaI(57_LBJ z0EEwiP54ew`0ati%~d=|GlDrXSu4Dt#b40>R)QmVT$twlJDkAWK(x!iGDst zY`zcwuAPHYJ_pz7+we=F=|Oc;UZx#*{u;kI%x5fvFnV@+hN#Q`rMvH_SjM3$jx}K( zA@}P;3qqFNb4JY_yZ^mCW!53(z+6eN*tRFgLbZKcvZC1rL|!+=Je^|9waD?&B(yCk!HfC=AsOp-Qv7@BmntsKjEdEXB?YTFg9%-e#?VfKx zraIdWWa(m1Uzh(UhR^0X#c`N7C*HI+BAPK_^s3Q?iW0{eo>5NL@ zY`VdkGeV6#+|g6ksQxJSD!&wcvLN^WYdmUgfLzw~VfxXw$z1dokyp9~ zQbCItLYR}I{Y$v3Vg}msEPxBsT-Iv>M)Ve2y=laA{eApm9}GRO>@+-~hxdsVCFukY zS7dT8w0emuYw-TbTc^)D-Z;LpIh}F?t8qPD8tY7xcU@F`_QJy~M9`jPqb~m={}L{% zg7cQNDEQWp?`vxIo@JqOmYCTdogx0i-%v7tvn6pFXA+2cs02xi5d*V9cwe|f(QR*{r>2}azav;5uczS13g zL7AF7{#;mnII5BjXM25LRPX!5#8iu$7q8E5)DM?AUOE^1)DW8HZ$N{EFCiA}DT@9y zC4@Vqv|c^WR(kaihTU}Bnrc!3cIn>daC!={YVibX!6U}xZR(4wX2sfW^&v0XGJv5(gP}oUCVHi#_VM1X^KD~Ci5I8gXUV#w~^-eWM>y=-dnUo zBCTvmbEE&L?*B)D9XyN^Je*-2-Al7iN)ZOU`~-jHijy1WP98MC?(%cXGR8XikwA^j zN^}!pSd^5~L-=5cSfQ{hT0niXr9_xJp*=@D4>D@|Q3$5b*B4xP&VN0BJzPS-k*C4&gWJR^~@MzOk zX__;omDg-}xHgjCW1DJLJNV`t{WvtDM$Fb8b905i zC?3r>#m3pP^}!jaS&OCwYj4C67Pgd*g1#0q1@E?c7gCbE{i$@`3dQJm4(!ezpZ!Qb zk9B?UTz0BSLc>-zd4CFJ-ihGmY~H2TQ1w1vqxclohLB>Jn8wI;vJep;_}IBZgk)$n z5#dIw9CjZuRQueh-Jf!hkBPiU1fd2ZAqqe1>`sg6Bf4cWW$^Eps8%$}#PAHYs>I+U zkwo$3s6PV~Dva;r<{ruAWl?2TQF$Ccq*rQ3_$XxRG5oAnx9ZFH+C-H}kBRdc#`xTu zdQ$SNYv-0}{AtX&us|lwC_e48gkE6Av+_=nW{Jwhxj1z%FcRY;dbbs}-&oPZe+VS3ZZgdA7`Y2SO4N_D z?{?aPV5joqW+(^l3T)<-b~RM1SUajg}W(6GE`$9;G3>-+>2g;1mFj=%OIy=@c>c3 z5A?uKWw)$qE?6f`^0&p3(Bo$QmAOFa*VADbz!mTD7MH{Aue5AWZ->o^`4^*eh!Ykf zxZ9|0U;L>v45<|uzj%4)vmsc)+|!X>f1Y52hYuk1$l;N5U>ggmBPi!#_^m$HZjJ*Pqroa{uB>OXyw zC#8W%tV+;B-Gc>}-X?tINf$^EEJKvHNYWv1V_b`4D0Q>1$G{wN2`;CwMrSvD5I%RX zZ&Qx{plVy;SI)v4349Ogkm!2fDG*OhLt6(Oh{}J{>iL$(o-$;wB^CahStfyX3wju9 zQKwpQhib~KkP7SnP@*ki$*Eu-tlH<^rN0YP-VaQT8&|c|f`3!M^-!lSM^conlm7Zc z*eB2A*g~HNM@lLnpl8(iXTX4wO~G*y<||54>gUtzsF7!Ap_<#TKD!-q>YRPK;fUH|la0i`<}0t5eC`R8iS4TXqHUgQ1m5e&vmTf0 zwD!#;ZBNh8njx7*G?>wq5x@>U=NtpO*tDe-lS9z=;vBv#h}BP_XK7X_Us5;_9A`<$ zp9~RA=4V*J6=HHw#FkEeC5;H%$u=ME3r-ZN6FPzagNlVJ>-g}h^lWMPx{_3})|sJIUhB z(E^x@Q68@d#Rdp^fI3pn!U(jFaX|#~sYd>G_%LvP!u#s#L>p??nWx1C>X?x-Sm0xq zimf4Np(0-=-~BmAE!=-@G-<{xFzs4%xW4UE?mP55DkD6Hn*j-Jn&X!SG7q)hE^NJl zB$CYlQ12^_;6Y0cCi6J$&;ic)5%NSTR+ZVHYiGVwMYjJ%pa~?N-f<2eZtCh@=Wott zST?A1tK8*wj8)>TsRq{qAG6q(c(aE`T;k3x&nS6W=;+61ZH=r7(<@Gu=v#sh#(xjU zO01Wev5MDaN#@u9Omt+&`71aX_C6CG58T*p;{+{Cis7!u86?IzdhZ4%tZZl( zRgj@&0EJb_HKcwX8UI6WX&uSU?jL!Z-E3bz^CQ|h=}XwV_=bCdFm~G>xtkQ6l=zpD zq4c$S$LR3amJ(3w$!WdL)H$-{K^(m^ItVH^v;83ex}DR={;SKaeL(w z`}+dpl!dH`j6fnZC$d!8`^sSXB#05cO5{pV6UVSL7fU*&{me7>o+u3}F@+Sgy zJ-!a@Hp#CZ4NYIaMDcxIiYY zoG{(I#2a)9%r-&3Ioe-H4|Hm1D+b;Ngp+8h8hbrnwlv+Tu%*=fjJLbe5*Z;Q z(TtSFPH$Cu_?|4nEP%t{PYCZZ7W_PGI^4Bi`!~xk{`MEf!k2`-s%FA8rk zP2z=o#VK?+DE){owci&FDsT8W79r!oa54TdJC|?kSnVaPlt; zf5>xkO6U;cc$(jjzuu7?yP?oiZBRs2_aLL<(w?oxnWmt;`A~8}Q9@t=5iw83?oV)u zq*h+riTaCuBd)>IpzfrAW2+iog)fnAp7Hm{7o<)x?d%XT;_yJBZ&3baXKpY}fq_rS z#$;CITkzga(+B5m!xMoAx#o*=TA+WsM8`-y?3Js7_kYz3yHsJa@v|INJjbXYJm7*d z<)B4_$QqT4F9ql+;MXoQ>3{s->dyG;aa~_t4Qkg`ECF>p==1tK)+S>meF*EiqSgg= zR4i0+!-RExc(_wYE=o&ZNx}KNFrV`-D@%}_U8zzg4YZlyX(oUU!-`^0GDDlp>brJ@ z_|b2)ZO-oVoiHHK(z-=WkYr&{Fdd;PB!;|~{Q+RCS>a5o;DAM!LTU_k zzb<4E?Vzxzm7@YD=XvrSUn?#W$05#3c6hsWw$|OPYV3~EWwVEFH!nSGw4l_s&GF!n!jgzm&4=yAvOiSXefTzfeu6Ph7#}rhet##E1 zGvufH*bf9)BaAS~VI{7`X80Rsv9iLQidYEo2tzCOiXCM7Guw$y{f{j=i@k$2bXC~} zI$A9)b(%XU#?xd_ZTeO;AC5eK|0YZ>kr&~X0vjHk9%VFrFe`PI&VTCq)R2o&5a=uD z*Vbt?kT@r&w+P^?(0jAO`M=fQ#=*v3~)iqKwTgR$hmY%{2Oi-hYdI)=A3UX z%Qf_o18xs>K(RR_Fo`@j)#%c1TSxvwE;w-FgkY@@En$yjge>5D01hYVGq;PzPo1s7 zB6^;M-R@0fn5_fJu{_}2ENTA=q(u4Ne^bhZZ0`JCShN=qm9Y+LIAxoGvXqMe;TA55 zLI>CRyJ(h3fjHeF?i?|(UU{ExF4sE}%z_=~DZtgFyPVPr$m02q<^ zLXP1$sMJ4lA6E{|ZO%JbB%$vHrdy~4W%9Ozm(4i#AckzUAq1sy51@*%INJE76ci(yM|C0I;FcqfuXw_ zL_nmwyIXpQp&RLxMx;wg0qK^8@A2Jxzu&d@_5Key;0O-ZTF-s|vfYY_98~(f`7K>@ z+-N(@P*g;?LUl=_#?Wye-^lvU(#u|+5>JW16k{C%E2b(667lAIWiq0X$pJW z-m!2VX8~m7G}#ZWK;c819wZgkGuxWhtaKGbZ7f@Red&{3*-Fy7$X+}i`cZnH+Q*cm zgIc>$8E4AOV~XAJjP$#=^U*2G`{#Nj60tv$D+CAOe=mCpShI;jF`1eB>q9LOnnq~g z@>wcwVadB~6EM--p;nFF&z_g%Pwz>U;?&#NOM-&xM7mf!x)S0JWfJ+L%N1iR_Wtar zOKvOkdpvyNhaGLM%V(6kMe>c8e{=1Rk{NkL!_Anq+gXLupSPxs)|wmm;};(VSIPbu zgx2a$_tW*}{{@cyH&php2@oza>dG%zJ|L-s8f#^~p;rW~Ba`=R z;3(T6vs$`&dz95pIgCHzKl3XUYRFxXn=s)0g<0D~b2lsQ5idNr*%RyzQB5S$Hrg*3 ziSOPSa}&~AsYFUs+w)ep$~j}4EhAib_{wlys2%d@xU!pUwIN+ZxE(&zEu74X*nP1{ zOse)d+f-$hLD~-(D0;0xDZ9)Xw1sq21YFC`wC>Z#SyKvYct7mBc&e#|();G(b3KTq z;_ar4-Up~pKw{8|dMLhxn+A0P?oIVNT~PSY#-v!fSs|0g$jZwgge!C)T2^i8QR~_u zsBbnsC8PsrnUNG!=liQnXBfO~|DvgJ|y7#vBj z##0L4$XO_Pn-@V|hZTqP^oLwpei`6JZVl8GM#HL=Mvy|Qpt>qh+H2?dz;`aiP)p_T z&PIiaN4^9X7yqVRC&6gTE4KT`vfY4kDAcpc?@>Dq z50iIF{#k~OelTRO$)45#WK@xtAhK}nRV6Wvy)2P4z+XBXj(nvWJq*9AX(rW0h9Cv1 zx1TAf=Mjvvy~XIII_wcq|7yLnquWg*SLm-u!LWcTO>87a$;z5l6r{@E@^gL#16Lge zIXKx@7KY9mX?=(z;E~x0?*}9+!q!GY3C^td*9=iai2{33M9@^j!wztL#H=Z_(I;zX z>ilx%?JEuTN@oHM#we`OA!N(3c+Wm3}F4DQ84P9@An?6f;xM-$AwA#F0P5gd}c2WtU!?Z4T*h4Qf+o z8R9oPl%pfxwjGzg$D;P}lENCA}|0k%pXDjI6;QW(+1WAph ztiS~E^(%dkiF>4%bfnPy06a)9$m2rPLOvc^E*^uE5V`BJdKAR%^S*#8h2>^Iy!kW> zjJ-F!C{HchGc@){_hJVaG{oZE)n6_IGwU6{s`KX8A_=~@j!ctD^FnL*81w__VKRr` zw2P+d{;z+Yjpx&C?O`bE78Pp(U92MY{LN7dL`F%n(PC#i3(@RX%^FX#dj#qQUt-X} zwBr_>t-~B=oEdCtJPCB#|K#@hf-MEN{2c% z%cxBdWM|Pw0vWJ863)w&;B(E7dCws0gLr#k^@#(E{u4j)bE|P4?z@H|6`QWV`gd%@ zMme#}6J z8dD=?T`B|`a@X4%^@1|=BMBKNBIZxj{i3u$`aMD4Wda!-R$yu5PTK#YZM=a!VR58rpd^1rt+wh{D{um>X2 zasm+F2)Wv2g+1&Mq*05J}3_D82fxQA~xGArPx>rBY zkN^2m3!wwk@PM9LyctGGrhQC58F!Vjv1pu9#wQ*5Kd9R*x2~k%Fw*gdC4;$0x*dP& zJ1XsG83~VpvXKw?LKJb|CSb4s1GEobqNVnJDAt4q=%nfzmPuNr5Rp$wlGOwynR#P< z8hLn}-6$#vvrRuig;!VSM<^w}A0bIfu}q*MXrC!PzV48_CuB|s-fQS+)gIL76FF5z zfBLKjxkU3#W@+pn{b!M$Qwuuu*mJ@qj*K63Zw1@IVZ#~ZmsuFLr`eRLx^qn#Ut=I0U z7u){We3<8#`B`>SW+i(E?A;v@0Cds4ovq}KC!1yWX!M~=I$9sf1L|@r8TMmBQgKY+XbYv1h+)BoS5j-g4LgWLeqivpj^-$Ps5c2sAyt>(%&vno>-TJgZDqONC zMI|`Px=X&-b*;K-*2m>g#jq^T>~Zc1MLL>*b(ifjG`>ra>@$@O>V14J7j>$$Otcn7 zF6-U^jvq^MQx`nzil4TySX_FH+Q^vL$j?imtcpc#v}R;uII_WX6k1j_r>#!HZ5(V0k~(L~@x0C7)9hn`|3*FiQkAEg3qZtL(4;;+tQ^`J>- zq7sxo#y7H#%Jzf@zg^j{!(J>3Ou0BM{yNY#(EtC@#NAJYmdDe@P@c-UmZv-}JS5te z$Pz_9+MocwqgDG50K+;Old1p5B_2pe3u~Tg!q@&bkv0_t1lUYrs*EO#$PVHJ44H>s zq<7vF!U;Wo(VUw;hg0j4v5e24`Xx`JI^m^oz#3@3`gAA*f|NHVo9#s1BPr}S zG;v%O*5Ju^UBg5gI5ELnf7q|ze-!aN&2N_4xUBOkrm@E?`ywCM zyc4@1brgnq-ZetdI!*Oi-WSTwRfV5N2Jzd85Z7aqJlXSqBrXmRBF>Pz`!zL7@Fr3? z2u8O}A+7XFT_sZv(=K(Nw?Z;+Ru-nY@|W5l1EsO~bXa1lUcCTBG3O9ZL-3CQ~Zps`rGh$;u=dwz0@0G3ATy z0}!VEGwD=P9_7;tIe8!R>58{x7Xb@Sg0^&{iA((T z=>F>q0Mk6^v%78jLF{9OmBWuOdS9~ITw(B%*2}CNuGHiHf7CPot*7{}U80Ab2+-1e z$XF)%8POYYxrRE9!}8`EXdnMAA{h?z9u=#aC*EXdLFAdSc|ce`rkff&6R8F_3{`4G zvJJ0GlYA!ZQ730B^*&mk_Qf%9E%0(t&Jq57MNklrt`t#Wc;cD?_;V7+F^HLy3yPlg zeHsX51fVZ@P7Yi(am`ZUvEsej!LtU@1Sa92k@?ODae*#!q~7e$Cs@|0A25)SAZ?qz z=%Ftpyb!mLcCB$E{s0xGSF_1R}tlHLpjC(dujNC-h4O0~~tAJBylR z@?uGE^RdWAB*Zg5gU<2Shb$Q^;>V@c_*cd*z_7GwgJF;=vaofyLa(QD$wk zd`Rni`rG@dX_BnodVc)g4__mK%+o}xIt!){uNiGBu^}40+%(cw+3~35SpW}1(u{+W zfEt4kv^BfYpLQNqH%)vHe}{Wk&qSJs_Vr94!wmGCgE-yTf6`d@P)sc4?3S8H4f7JZ z`=&jwvYhz_4^Y5Ir5j_|OP2uo;5X~VGlc{;H{Zp*9BxY27QZAdR~b(x2A}iVF14=> zQ)tL0$UDH$UQ+LWY5(M0X3#pigSZJQhDrcoba*i@`w%-w>8N-pS*r>zQc?)Kv^HKg zY^J8eV7itA$|tuMX>3?YWt6ls1Ja&`IT}nV;ciZy5Z-Cs3w>LWk3ib0$T%^4tc*NV zH8vt%529JlV+~AWO`Hk24%b2_p+(SL12r-5v#Nk*jXwDfyP+K#KHKL!vkPwYNC*{{d5bq54z z+*dp7P&FF%BtfMMS6Y+0pJVn!`Q3B~B;M7PIzvF^)s5TO4Ymy5D*Y_W1 z;^6OHzb)(~Se>y45sS)s2I^TYtgT|c^OkO8{IyVa91r8>c99y&5269((Llct=vfJ~ z^D1v=&B+QDR7z|IFl}bUY$$Zf2y*RnoJTY-X%I^s*ek}$=Dxnec5ae}m+wS5Kydeh z^Tslk#n@_aBQT&*R}tdkbBIP^si$p*oMy_tR(|sB$YHSlAXOm$=uq%xgh5tM;xOT} zv}Yqw%C5G=uw5o=JPEOk*5RR_S~6$aDG?eo>j==I0}s$sgMxR6 z>!eG%G;&7fmTpaNuPxDu_0ae=RA3s+8v1kg3uJ#i>cj&KlETq0FxBSWyQD@_8n|-A z*zoVE)ceJy)xdW9EslJJaaTPL6Zob<*J!r;8SRGt-VD0Nil}MhPH7vco_s0;sCzvm z2Va(vO>0~DE{CaiPp$J-_-T&ikF@GsLMfHjXo&adkZ|_S*cx*&=(3#VNiO6z?jj~V za;p+;JkLiM`t1^(aabw(1%y{-2w6L1x6f-M)7-|GB@Nwdhet+L+iJ(&Rzskf2~3Ak zJ{lfQHaGR2#S~g(wOg!51qD_8!dl9yN#{0^(qr8nU7{UPg~j#!#Wn(@UCVPgtB;4` z{I0=;$c3+^_KW6}cE?JK*(wvX$CHpjbnKImEe*C~9zbJq1{J*V=|NUFH z2htZCh3rj|fRQv@4t)#^d`2+G%OxOSV*rGF9t_3|B<{LZR1Lf4e|nETG>!GD1K8I1 zkUYS;r@l0=bu}7(+UEHEJi;}}&9P|}krlftbx>YsE7#N@qx!|bhP{^c!*Lj)(`3VX@_U(S^mdU($C6pRUDtR4@r% zHA;C~M(?xuE?_{|tO(M}^0Z|vA|h0~tG5);Z9~t1{@3c2)xQ@+(oRlx&0P|GI*Qym zW{_4MD!)8lef=V(*Ym{ZAz)KJRP zr99GR7b?WE5KNKV~U8dezPz(X**v0NB+bN3{urpmComWXh21jKA z#3Kx}^8OGXAbi8MkGOQIBn4$OKcb}EIfTdJZ^FEe$RL&Taj9U5sB2+Iml#4ZCKzrIj6>G8Xbu0Bp*?H3=3B}; z_y7$NFt?_2B$+(D+Mg_Wm(Ai!Vmpw*UVTw!b9=KD-;WiG;D$bFb!?8GE2!+MCE}#z zA!EdJpsf|!mL*`NEXKo+SN#zaE{#o8?H_ZXR`=;IM#$aH-;W(|5sH~E_&dlk&m@am zM4FGVo{>ItJW-;F<~JH{b4C6)Quj66EkSZhV|(8da(FeH#7x~bH(Tl?6iHw^>t00Q1r6eq!d%WQf(N zdEkq4TKf>TThx}9_y$cAV_@=A@ryC$3{y0$WgeHTxQg%8X|s>-ht32pIDP7j-I^jn z`_Kgc;nnyqMp*M`s$@~V4!lqLEsLrk<1mts%Bgs-&11*3PE$09)^H9PqIzl@h=j<1 zaN6(pM`Sj*JT+_^h%R%p}h$cYz9ta+>XpM;q_{`&54D zY9Iw79Vq3h`?e_&XFU=H15%%rN>uqSm$Bq0uIDS~5H5?2l33Fz^?yCI06IW{KQ;YE z((*Q%J);FxF+~#oUY;S0Z_M;I29OWyd2vNB0#3ov_}}p7r!{4GzX-0d-t{qlYr{ec z7qgKcq~rAYA$wzU4ax03Iy%}YiWfy&^SPMlr_*aR`Ko0vZP#(9EZNAC z5=E~*n$+u;Xn1bg%G3c?aH%gY!6W68u^5GojPL?^>^}VS9&A3R-!^KmR`Abt6>vyf z=?i6^n7chaH9gMY-?*CDF+E!|3$`t^7wftO9yCJ~r0h}BbK#a?71b#M(&n^Lk41rZXB z_Kr@nFSnyosBlxCvN3}r~2SbC0L21Y}#I%x9Uh+al1?Rch42&{f81z};vo{5fT zwOqK}(W1iw3om0)a#4Wg%aQ}QktP$rli1YZ3%6@`im7D;DFaq;4KXOh;eSc9Z0 zfm%5*!cU%Qx?W-& z|45HLzHiSv`JdkKQ|^CGy~dF!X>_|Z+jG`{7r;!8V*;+lyG~$A0K?F?^lLianBd31 zLzMn|@Xr5KpqK2yXi;?vC!)kCg$v1j5g)XvvwU?v1AdVU=X1z1a^H&}#6sk59H`Vp8re(p%c;bT z->b#ZJ%ICp#>i{Z7{|kPmp1B;SK6NrYBq*0=F+@HqvvD$_&)U_Ya!xII@=zZB09 z7tHU&0f(Ui?>89&z8BQU@?A1t`dxfCXp$6~6nlxCF(rJGD0n0UVNv~{QGVhEtp_*? z+mPvzez_PEfxnuCet~5Yy}?81((k+F@q}%`-15btXy0-z?W{|tQ`J;Hukz5z_w4>I z=r36Y#{c2o-C}k6B{-cNYpiaAJjzZ0=Q}ME|K;(%)4u1M?*3_3{XX5yAMTL>L8KC$ z8f+)&G{rb8fOO#9Spxtn$G{L!mIWByFr@LG*a-Gv18xFcR5SB6z9kptb>`Ee@h3xq zHV1=!0uHd37Ibdbb;I=niniLcZtoq4v59j}80Lmt^}2<3J!OZwkq1}4)&-)R3i)OU zhzvp92qVPh6n(^1 zU>lPo9dZ%q@0 z&4|hCAx6Xa%y*b@7?6m0l;?}KCL_Y7a<3(k`>p^zr|SG4i2`9=y2@?q@N}jZ%00#Svek@#E31 zq=Wx-5S$?<6$KI&lGhB)do?7Xw?4zUSk!1Bn0_IDmRb;dpI7WX?MxxmmFXTbOMfXX zX13gL7B_|Rs|b#+i7IiYPa}^6AoBK>knP~~S_fiXc>LtH9>t(kjuB6E4 z>_1fwH|PF!EeXqH`i8SNU^frx%vh#FCNnR?M;-K%(4DEbt^pFQKNJ}g+`W)pW4`3; zd|2FV5qf*XeP`)PW6;XCF&0*y65F?%z~QrmP#?N0Jn8=3lLVQBi@PLj?_8rdhKk6= z;n42uC*6VK9hyGrYk)5bO)7k-)kRC)eakX?GSheYK#|By`z|gQ?fK?Gb@5H#Yi8H- z2W>9pRH#4;W5?}7RIQ(k3?tKQQxTe4Z{wKz_GPP)J<7Ln2j92uaXxwH^(?U@Clby1 zR)I0MGxzO|N(8Km-sN5n6Te<8Ppn3V3$w(S-*^4LhVlO`T|Y+t2cQJ$Jv7_VtMiBC zzvz~#&-b7qkunJjpjq=e(i}~B@9WBc_kO_j1GKYK28_TJ?lkp%mM_b5!aIQl#?V`Y zjK}cz+e_>g$@Aki6eZvVrfRtt2tEJ?jC+aGjXe8n@Pv6u&~%v$FuSrN)gu>q5MUrp zl`@jVEF)RUwfKmzF^1}BeaQU|sk_v>$&Pqi`^iXqP&O6KVLXKf);OwB4tbCQ6IU}) zPQT6^RbPk=bsE9`bRBYWt`}${a5jN;flKq61TFTPikM>*?!MEKJD4X@USfii>AXz~ z4Re?T>W~^y25y>L+G6OV@yrl!j)^q_cFXDfSqVz@I=fIH%9U0%fybv^9n#pv#bs+} zl~#P#_r)*kczbE)e0%OZ)%QIU>vVQ3KE8Bsle0*IsC~hm89$RHVaRw3)Gz=09&sj+ zu{uDo3#FZ^zr5XrRo^ycE(E`hB3|aFe@-bf_96B9jpcjcXWKkzq#sK1Q$!H=aet*fZVVqA@b=Pfqsy* z-&<-Nzar9KPl}oktmqcAWy1m?4%NhrK#s^NV|I9 zT)Vr6VRs9fNu&Or(2VW28EibQ_tDGMP^?`p-{i2j>H?)OROa;36wj%5GMb$F7x)n{ zOZBH;@!{sO(oX&+oS75Bc{(WS+;H@4kFlftAqTu?9unTJ|09xzjrs2wB8UW{EB+i> zATgI}9{6OOz>XwZh!*P9nPfzd>ce=GLd3t>c_U;;6hYRZ3;s|R?ldV`pC4oDtpibX zSnpn^9uNFylIX4U1^yJ#MP3{4FgD^}9HwGkUeh`Iny^+Sq80Dt?Kw`}YuTW^4^bq;l@B=87TlLvO^l)Hl%vFRnR1{m`XkI(D@APGm!84TppuJzobTtadk>K2 zqEa<5a+@ry4xnV~Gt)JV7G)gDRRvEH6yq%H4L$q_m(k_nlQc+R_ETaS_tnLW zbeP1xg}>xK;f$9)x6<;)3?z!7Ia>U-$PSy!)>O~J3f(f#JV>{vdr<+ACjg$zPPQlgDEA* zy;Ed6Nwv|r!XbB?l9-bkBXx1bftFFLwV5j_{iTP=Z;$r-xBfro4(f)*?)%2ma(>Z| zimZgi=ZX;-#C)u102fG=#>~1{m?cnfq^xa|(kfD(3^0%7yZN)GdJ8bhu_P{dp+iaW zD}60(s`A|lv~SRFzva(dG@saKF?%EJii;}5Fywo=fIeMla+Qv-n8BYmlI8dAq1Vk+ z7$?4;RsAOon*2T*9a8(ozTU+A?m!Ty#L&1|_n+QDKP=#@?GmjU^?VwC??Hk5Y0ydEW1%l45CvnP58^)m?8P z6N_r-Px11ZYISCo8Qngf87&TxgDIx)Au#BaFi^OS)-nt^dXDh;ij$WN4M*cmI!bs9 zPLM?;Wmv@eS6xN@up0EPw>Y>7B#plEZo#BR>^FMyh-0~!E zcQArtdF&4HRT)s@2a1sWsUoT!pLf^hu2m|Hz6a+Aj_cj-Gna6T_Or`cpj`x@MG{!E zdS5G1e$ug+&A@#(o<1>yUHZ9iEpk%-qt~qB^lMzaI;U3#k{YM0^b-O`kl_CG@ye&v z()19wvF(D!Pk#Jv)eJmMh9D>ZuY5k|SDgCzUovB~pV$OjKFU@<7>>q#%w*rR9K8I= zk$MGA3cYpe{>aIb*8%<(m108+;j1ji(g$?hrIVk-%`tbS*C|k`9fj8s?7~hK7{A}OD9;^qKP`_RC)d$SqxO1C$0kA#>!HyF(7Hm=Hyb+y z)Z*RdQ`SUB+afvGoCvx+?t4k9ou6ZUpmyU!13tzn`xwZB;{ujjwoPC^z2YAdrmpAf z5=wJ~CzFi|N|ny?gb}aqkOWK(BOdepm*p3D{l9U+ZRbN|R2-Wd?HMJr+)sELJRg)Q zWh+!{t28DxaW!Y6(B8#103r^VxDHsGr6gNne?&rs;tc&JirWa_BH^Tc>7)Ek>VA3f zzC!ee_MJe3L)-@DROqsWUO|LEE?cH&D-9j9{@!?-3`bM=812*0?+bup*xNv}nyi0X zrwKlp`1|}&$8xxfbJRx-MdUI+L>^9zqjnQQp|&=g(It#QHsif zn|vd@P9;M&SysO8Inh9x$^;iKLJM|&|0&XutXJ{^Ql@i->JT|ZvDd{>LQU36k!#cr z?tpzVInWkw)Cl1|5-HH#>j&$J{_-17VS-_rcXHA%52tKAUUnQNq-WbcvEF<=r3!7d z=r1Q0OY?L^D^7u++skY^=;%_C>3t9H3}5LZ7E%deZz@PaSX*?e0AB@n$M$4(=k$nf zPOOReY=jV0u2bzYJR0~P{fC{U{J&nmOd9@=75=~R9Rlegnr$Q=CX1-nB-bg(>ujUuIRgk7Yr^2J5 zelyyLfzFP&8Bqv5U#+OVasJ=~?2F$rTZ2V$1Lu~5*G;I-FeM7a(Q1{ESB3QS-zC|$?Mo|^V1x}~pqXBOD zLKP-@%CWub-Z!tw9DK;9rF}*`-Jp}PRsRkr7|I?n5(wA(p`}NldLQJiN{s;Qj`QQE z+fa$L)OohrSz|^U`J?$aX0|Z@o(CJWPs)MAJhO5QOgr{iz}UPDuf6`&J;tz0PM6g~ zM4jG2%H6+do$UPGxJ&s-$g`mPA_(#UOYw1f=`)uB6$yc7!PBA+uR)HG8Zw4QS_9Qh zpOgli$jvBNR@r6#x-y;Rb3NlC+TRP_n#k+ zD3wg-ypKCb3yDX_k&RO<&q;2Nvv!QpBH5o8o&lEb!Wd=+N8NUFgN_KN-<0Hk{83Lr znV2Mb5F*)r)16-l1U+bY)zfX12`hNPzJ48SSyLZc>}|QjUC7!=7x@0;k=RY)8yYSN zG-JHHi~}!prj#j`2&w;er{qWdNdVkl)kGRP1PyY|->*wrUaO=&`{Ga>jTFx7)MXli zYTnn8`^my%SzNI|T4JNu`b@Z|Os{j4Kn!*-iSX%sohFfcEs8)9@fq+`F7cDOpJ2Uf z|E$l%8G#4Fhr(m8lfRdL*pFZ3EYr1$=7;Mw2+l@&WYJGj?p)n0td%+CWw^*EUaPk1 z4erx=$(L%Cr&zyJ)SR#UWcRN9Z^k)@Hqu2GJ_&aIXCJeZVU)1o{2PtK3f!{wt@v0Q z8X2++jw$KrSi>!bgh}5rsK@;5Y={h0U@qz*YH|$$65{a&$Xnu@U0BFoCXl2mT-jmx z-Whw>`w(%Ik3&O9yvo26-2$NXjMhyHI*a+Y{Kun&o{bR%p1Sm7|ClDg^r}igI?aTg ziVw|w*j;nijbcw>B3X%6;dDX1@ERsO9-g&!Tk$90TEjR~vwq;>nhbzCMOiZXcZ#x(gV zcfB6gBn&6M8c@{!}my^7F!-)Q`b~6&G4jyc?}l=eV>OK zx^0HO@88l&z4LIejkXLF^}XnYmlJ$vJ;FT7Bm9tGT4(Yg;xp{W=x#VQ!CsO z&xt589_f59a(QHY1XF-^+dj|0>51hQDrL2^vm|i;^wf3bo)k1&;eS%60wSxP~p`|;o-K#!%gDX(&Q zy_ws;5S&Tn^T;%5DB4BvBx!8na> zW7t|>%3Ix)B%ynZw^*d%B1C;)R@^&4%z@QbSlIs7tl!1+tdc0QQR~2cdo*>+ANy2U zcB04cCt`fmRpJW0YQe7^K?Tr#Gt?H&kFZqSaRO z^KiV@(%jJFU3v1YYq2;!4Rq}%6&AIaJ-ol!PrZgcDuN?*d3>0#DKYRmE`of2v#rRl z{pm7OE9R=G5(b{tN@Zl#b@Z^kpjBdjTVlW2&mJGr?bpHnMUeUR@~x;Zf@RVbu}>NM z=D?b;B{Zq?YF^*;SDTw`<0&7mzkVPu|H)dn7)25v{!MVhe`2b31A5Gr8c*Z>o_bY! zLM{!F0+V|rlcFM>TF)^*uSO5yfy8_J_&5Z@SzqacKm%Z9!rFY>X1MC12c9;zC;1E|MK|E=|j1}N_vm05PM8+j)fu0i4!t z1YTTr0f7#Xg!+nr)-{DK`jZZ_Vu8$t>vCbdh~lJMLDM_i#;iWi#Fvy+tg?7h*rOic z3|DA&lVjc+u^v|39a@srGW5UnKo+fhMyFrSqRtM<8V#;V=lSO6RIU0wDQ3}a;ZG=weD$11*sLmH_ zU~3Xz_nS?uAx@x|=PE+1GpNbkmF#VtKFpIUEH|8gJ8#mhm!{$+?$4J`>em<3UUP>D z8{oO2`SuQl^H_K^jvLnrC%CZC2`L?Hmsf?qL7vga9zNOmVX7~NBV!_G?68jNfPKi| zD(j8IDqOQ*^YdV#rbPrO(0ic`$c9scFbgerzvloi?X`#B>GjJY-8^b>6*?+Lof75L zyf_UySEbarsx}S#h=kipaF0{WS(N`cz!ii2aI z^4oW@)-?F>-8W(_+9{ykC|(14|M#eYy(Zv4GP3{9xkBSW59oPa&J<4pdo|vw63IlK zS@K2vHEKByMu1HqOJpE|=q~>(z|Vd`QIJx=#u<2f_E}=2FSpGm-f4Jq0&z;x7s+FK z@=<^VmZwggiT)4nAJm!rUuV3YR~{5Jmq3*MDtx@g{OX{)UfkE&vKY?&`7w3Y^or1RIr%F^d^u7BeZ=K!WTV(SOh1|MH#@c$ z-v%Y@YcJCelc>A@;f{#>^02q#txkuRFAe0E4apmSt}ajuyIzs%oTZoLg6AwV@R4N7 zkTou)mPNgf_#;R}7P_&tHH6PjJfVH=P@t`lmnoO>mXMZYCrUn(?i=Q#_%~phSYbQ3 z9S~rQLHk|7DMcRk7xtk)z1VW?$zmLW!R1m9#5xcb_>7pvx;0c`&ZFjX$LuB59t!%t zY1|hc25U%GB7e*-+odwM+ET*8VP({nw?y8+izNCe14o-W-z&`{o9a7=q}7xE6{>!$F-|glRodft>UjPc{_U^Pl`ha^hiWW8Zy2vIdy0Vy~ikB(@Q!)T{4^ z{#?k=>HrKuWX}jzm3147A;}Yq@v$&28=MlFAo0DY0hT6HFWb29@%KBv=MDSwIs1kJ zZjCVU=9>CC`bbcDsEEJzyV)$&g^vGLkp16t^L1#CxeD+=KiDX%tcM~xBO?4ey<6Of zNSxx-@v&Wt=ZEd1 zj|ly)aFz0fYE)(;ABFvxR#Pv9y5Vqo#OO@l1^7(C#Vo+BYEth?1(2{F-q&GP;U>1bC2WbmN6ar%=VdgU)^7+}w&9zB>7xhky_= zjZ8a=IIz!7kG@-!rh7e~ot7J2C(e;@R~hPboOi;p8R4p0q^#x-#EZnM#4e_Mw_MrfyR{T3N+-e{5(qwbK*jE}c65 z{8UcK@_NfDmY~9zuqdVY+K*FE&o?;0REXxh9MAdd*_73bcpqnn z?(b)_1v-pqH2y(6272SdW=`KYU#)&D6CtOwUa z1J~m)lBi38Gvj-T&K@G(G;@CLpLJpQA)Aai;8twybd78qHgVPBsb9w!{ zge8prSg8()^0kX*pEZeV3xyk;bQo*o?WLh-Q63H+Rj#`>7HTL^w(} zj*(O&#my-lpm%=XHPdn!Dxx{7ftYN2SD$f_y-_tJ;kdNNzh zgdWk4qm{O!K_=h_weks#w><7=EyaX_u4o!#$X7HOVITtrgS%LKWajK2fnmzQG;@a~ zL`8RRR7|wxGaZu%wZAdTD;_SKOsr5Ri9h11;*DR@vMiP9u?BDluLPi7 z;-li9$B9P}BYDl=+mdMIflIPJpP(`{`86SF zu)lV8?Kou1g8v+X{YJ&uN|BkBD|Ppl-;g)lYpSub7iJk{GO9_NP)DTN_~u6l3`=1g zMaCL0JcFsx_lCQEt>FPN*ciV$rSzy3sKsAXX5WwilM-ln$I?*-H=&b=|tlUZ4 z5!L8VZ=;NvYQNw8rz8$s+4Z_NdN(M-2<}kNjdZ*jZJYpwwao{a^H7n!Ua$gKZm|&Y z4b|8!L0mD8a=zazP-d`#LFJ#8&B|$)q3cB6FzH8$gcGj zf-9;-5*#cj;f{D3dKY##y_?I6S5WH{N7`}Xb6?ZO9!IJWcrsdOo%{A`6oR}!^IEnT zoL$OewjI5%oZ8!oM;jJYh};AkMNPqmrAf9XQh`gT8M$RsSJ8rkQ_Q}^uc6xlMbMLZ zhV+}I(?R#1wC~TY-uzY!BcOmiwpk#JDX1HmD8?g(r&z>9wz)`0N$I4pm2qC&_0y#?7RUX#JgD{nDpaDFk5p9n3Vl;(+~(Vg`ZKM-#uvuH|RCD(9kS1j3W3a#IRO;KZ6M#_y#qaGe^vX zq{ycruMzo72CXT#!MfN)(3FI}oJ_u0u%Ap4b#AOzMU>M8E^b#cs>!_l>pVz46OnE& ziNFFHEq^ubdl!=^7;_{+MBcu`!9+^TZbXsBZcLRHWm2xb2C0<3-;&oC=TK z9SVM<=kS5IrU-_GZwnQy=z)fJKEynD?GS~IWCg<&chex}!QBi%n1!MIN@xvw6oCed zwUAf;Q&DMO;GOzT-t>Z)8KkpJj3^D8DZ`p?7Dz~HdK$G}I#Dl)2vLHom&IS2Cw2&| z`mF9d3j87kvc8%>(`6>9a@PH|z|L)qJoA;2k-1s$f7HIP*$}zIo|>MWB@j)Ba_pAx zC8|Q%3GdFKC0eVQ@^}4zg6IFYj)004iAt$36aof|=ZD0;_=tfx^g>SI%D?Ni z)}2H4ERSTHym=yap}cq}*hnvE6wI@l{|kx3#*^bPEdhPsB~s@5_U&@1ft7dR6I^w4 zpGYB6jiUZjybut%60`iw#yi4;Jg=RpSO{xdj#96Pdej{M?0(kt3Ud zl4u|2?Qp>enzB0az`L4R#$VRomv3{*gTemYDWRONNw-2nh6$Ir)RE93Qyd=bc(;ia zpXDofL&3EgY%mpaBDkbrEnwx7X3VL(V{2>r-^f0KAFH41ZiI0%0lRu zEkN;J)?vbCr^bO65|2>5hMovCXapw27aD&T=_$N>>He9l^`+L&cC!sn$-G_)E)+Ml z(BYTUkrE^lNCh7PI|SEe9o}3r8vUc(#?4DW1UtCTBs++_I2KbTDmA!FU*gRy_0$S7 zSAyBm2qNHEvB}N}aja+UBiy0rzf}(cfgrI-qkCcPqm5tn(lCAabPFLty8P9wZ7w&! zzS;VmAK*eh$z1(^?44C}Tv@ZGTWFER%*nl}NT{O!q=MugkZzRzrZ~FaRM#x-oyo?xFKHj(@2wjX z#!8b%EG9lwm99P(r$SGy!R?up0xZ$frON{Ki<>?U1R9SZ`$!Gqz;@8Kg@d>5fMW_dSvtkEz;-CI}UGW3+{{|JYrw29F(dnw>Wlt zUyM31&8-=eR*x&2HsLCqY`6I?%HlL7VhO(z`8?@!Qi|b2V%r~QfpaoQ?+q`a1{HAdWn!keu#BQ^!S36DEP3=tvuW*%zCo^>z9TyQxvEJSkMhvqj zdV*hKXA#9!L&1fsR(BMmt38YV1l763mTw=sw7AP;08-;xjk0S19z^g!6;*0VVr*iu zvrb8R9_KMH;c61>D;&xa;y>w1bO75sRHFW$dQXuiw&VsDK*K<$Yk3r5K*cVi1fD4@=|2&pXB3lHtl z&DFX3n#G|8e7C63^OcLpIXe3hmI*=RZFCHLF|vxafHYP}4w!^zw?W=vU0Q{ZNMfb- zSVbg+!_w4baK;|ZzB7gKj-q&Qj{v{?HZ~fKuSrdWz^f}c0&QOZ`HsZ!j!iRlRl0K3 z=356eL6(^Jl9!fTJVz-CT__{377~{Fn^|Xn?9SZFT|}Buhp7;!ufC|s zm-FuKtCT+}5=DIcQq@T$X<3(s%>oDGtyDQ)POA2EJ9_D6KF-{4hDjW3+*&7nh`!Xj zBFOmU-o5e@Ip${{&8|E4KO94TL_zo9Wko5+t-gn8roE&fAaNZ@ji%UQo)L-M8cor4 z;PfuxBzb!0*!`5qbSMY#{DTB^tWsW=uG=b~E(3DG^F+K})OcK}h>D*>)9lh{Vesgb zaWI7)2EOiQ%P#?+k_acWgu$KzuB}JPKl+fx=S;CORw3Dj?{*FFAdKM4Mz6?8I@HGa z!&1CSB@rdXX3;^IEzC61-yi^naETCSW@UBLykQjDZoY6fH9Uvych(U5`)tdmB$GnS zJ}u*#rlFV;Q)D?c8&t2DAndXY;s8bT8FuU37q0euN9EZ&sXXwRx(g0ljdW%}RCIBY zXdLrX&s_u6ue^ExxZYD}wyE(tz;k8N>UtoAv2?NQvZ`Ll;!v95eG30Yk{3Jn5Qr;7 zQ&>2?{gvo2EyIUz1oFaj>6DUKJJ~;5C{!A?~Yr()MqW#Ug6-K zf`f-FAyw31R>O&WK=*|Htq$~3VYLc0>cIm+*isKFSb14-L&TG86o%Gr<(Fjg*=JT~ z{<0C*u){y@c|u?feJMG}HItfs7TOhT)CRmT0}|7S5cbirdxF=013Cy;M7VD1HJ;hSAfxYyfUtF7Cdw*)6pKR*B%?9dPY*2Iv&0ZF$2^=N@$ka+EHbtpC6 zi2)IXk|v#kd8@qCfW77>dEpu^a&6*M-#9Qy`)PIh$e*!@CK1jk zqVph!DIsLuLe+8!6jfn^^bpEAq388vFJX=5F0$^D4T)7{S4fjbtZ3puCy??f9O-(O zZWw$OvO)x{W~u#{L3=TOyIGgKm%J^$lfB{YT=$iqC;p$2-XnsABlA(R@%qvg#n-|> z@@zi$yu$L)H)9`g83i6LQ2ugM2d?wkJUT*KZvICoio(ee{s&h`=!H4p4jDcU?F` zwnnRv5a5lHOJczw!O+`5F)^FeKP&qxvdXa>%&25=cxh28q~UTaG@57g45`u=3rkZ@ zN`fRn&k$uHt-*ZKWFS8|lY9?aww3NXL7B|0=C0b>r_7jFJb|3jE&u5kYRrAj2tUaO zDWqSv8F&O?yX2T;ayGS%dn{*Msr9)UocIE%oF)}my%XH}Unq>mW9Kd37j_E4$Cetd zPxU^zd0&6e+W`a)Xg3HgFnKljpZC$$9?h)9Z80b1)UnHO|1Ip0MS>=kn$ps-iNiws z{Y2h29t9L2fBh}|bh&6w`HzsD$jNe?#t2pvi~{;V$xAU?Q8C3ZikLY;Faic@`5C&+ zd(w3Ej!dcXPb<0H65A}6#+6#UE@?mJLxXYx(L;8YiNGHSNbn<&(|2kxF@KN8%XeB~ zFjVRUsFVar?|OE#-c{*}PZuV`9OWVG>s)qa9XK3P+2RPGU;j^;Rq0v1nHUgqlw&;z z7AkmXp+Y@r6=WV@<8T~D^P`kUDyMJ{Cf%{Kv5f}P;)@~1p+D+t_o@Aj_(A|dAsiud zxPj@((_4Szo|ir2r3S6sJ=%R-G|j?dGMyni<7VV-1)AB6&k>*4G(*@VOQu2>_U%T@ z%*ze-M@@@lG*8;jDRGS~P_wGbKD8P!S6V-7*(hCE+*Zc*fR=CEItw`K6IitBlB>9~zeT>6T3JJ&NB1q-RK5 zhTY?K_*)a^8&_Ghg_P(?ew<5;3e%HN0~0_2JL^C{)Gz4k(RHTnEgqNE?VXTLiwXzn zk{EX}0p6z8HW?h{Cz>bf5U>@kC<%2*0vQG4hdq0JH!=8AZvFSeoo;c53I3`9OTL)% zoPtH(SMUM$nE7UNCUcoRL*90UoCc*Q1c&fnE(ok5E{e>t3U-APVUgZW>H-F@F0)?^ zYdR~C@cZ8K(DQYaF<8_Z=a1|=&s;swM^Krghl)JJWA`$(UdtXU`E|e99O?u+&e=C` zybr~a#>HsV2?I))^#nW~&XIm{C!*@rxSW4J^$?sJ5-F7#MFweicsT!Pw$rBul|LQS zMynSW)nsrM*VZhjOeUW<(_0Es-iZH2jJHDyvMA{r*wXTafkE!J2VFaVD!W6`v3`S( zZY?s@3QHPH_|lV(Yd_mnQx<01AqPT=^IPmF3K#)hk*G_}=n)!@*ED2EWQ?+Js!W$F z$L*=4VDxrlrjMu$of>`ZsTSWZ7@`x0AqGX-YA3F*i%wYPZLqsKQb`H>oxE)4)ut+6 zt_&|Lm^c{B&u`r`yWGy)GC1(D1e*Bv<}rt05tEj8qxXPFdq|TDr)z|0vG?y&gkXy> z`=nF;2t}|#D*JrxT8eEoLhU#xxJ&a>KJ)y4{nJYQhhRYgo3Lc%p%xb$(iOBTmj?uy zLDY}%0k2*y8K_Cy#VV-Xy)WGID2q>PPd=GZ+qZ2Xdy)2!jCnqJ;=)yYfd`w_=84+l z_r_Lps$enhaAeT)-L`#|^Ds>2#Ebt^>ILsK_UoX_?&XW!2k3BC)-R64ozGe&ohpn8CI zHS7P$>+iyJr12qX?-7ZeWj-eQ5ibV(kQX-TO%xkx)sd+vNb%t^2I z^k{8sZPTuDNPff^@4O_2M9sy3gh574Kz5tF>_mwFS+f#lkESocdfVZK!XYmzPJ21x z2qG8}z8i_=AIunj6g)o7E1PV2Iewbh`BEdz%A1k#i$juvqxgpjiV~9|qlCeT{+};> zh#^U5ZXm$$GQ7P&_w`soPKG3q*JIrGj4;apMQGPaird^FH3TYxGTNqPjK{Fy-~ZKr zKB&7uj2M+67V~m>&CpNq?=MIGpD&laN%j=l`_iS{Ql?vaM{qBad?#V6_o&_Fv(e6X zKe1P;R6uUm;>4}MH>i{N>&;hTubs#9vhv(hE6_EzkMAqJTFn3bWBt9R z9A3~;q%wLvoijJ9_x5dGTDBVJLeO$H!N6(=zL`Yvyu0J`c>sB)zd%D@*bVL36_A=CKf4C`Y4Zc8qDuwZw_klAN4lilbq8e$t)$HF6!7 zaQ2NR62{(mSjIQ zdoyY@jXcfS?&4E;6LwWk?8o?-&oVYros+AZ2P$V0nB!%1mDrz9agv|UYBmRZie&ic zRtpgWr&@mek6Y<~-$BSA7?L3Ac|6lh(v=v$cI(`;-McQ4Iny1=Dx!7J?T)jTmm7?q zGiVnOX1mUzo^qiwxCYMmRYN==!-T`La*|wQQ@(2dxmLRwcqCtFakhoK@46rMp=33P zf=LOeyJq*9CEwbDetkXPJ^r>hH>1BKX3>1ax(+f@K*R}j{YJ;MY_`ZSVA-&NRO zrG3({>3DQ@Hln=fy3=b7n?F6j!5a|#STw59(vp&vbI%sK=>D;mRQ|e646Q}&e zDR!LC(sF+^_O?IX&>M8DL&Dyuq3>beP}Ga@naZ+$V9Qvfv@X%aZ3(ZXdM`O-b-L%v zopv!Wt>H3HShMY5t4;QCa^CMmLF;@M{pyCNZNI?8NO!o5ANysMvR?*Mj|F!W^m zZ;v_BnFLJ;FMkCAPJ&L^t*3yg)RQ3m9efa|C$*Q4^Lh4lN82ptHfh?Wd9cMU`lyx{ z8~i5@2z=IerF&_@Ad!EdtK`}JQq}=wELA*DQV;+qeG*pz2-ohSG=_%!av`}ep=j15 zOQ5^26y(+n%0L`~0=Zr3%ibNMK+39dxI-2i)xnZ?E^1Nx=sLC(Tl|t$h}&2 z@dzNEBZ-3qHquz}=@P28xRmi1je(wZf2@AQ(6uW5&k}V08%1;-g3gOQ1afRkgQQ`h zm{Wj~o#GG1G#mgK99fmWKZ9&ruS4_oD;(+6^$!S z_|CNdW&tF2+RR$aF0na#$XUPUJxeY6C26n6(PsA9oZ(k0y5ZmyFetky6MuJlmY}(d z%-Xg;U*&)mM~=ot3ci&eFW+?58Pe2^o%!g&7T?;rB(!rl3;xoKptH~p=b2Li`8MCY zko^In7lY6XS?*oa+o+=2MDdmgjP6KRu1v1zTe>UdlX^V%O@?%}73pwmD|^79JEwF+ z*OVnh^9%+%fA%`@&S4F(T#kt_HJX}j86RiS3=GdlH!QnsT3>ha!uY3fNeZq&40`^~ z6jJH!qEUfO(HzI}Ds;-;2ICHi+!SRoZS6GVv`6~dp>D1cRZHN@HUYh>&$ohM6}&K* zqe0PC2m1FlD86!!lQ7uL7WDhkW2@_C$5E$$Ng}~g$Z@IrYhc(6|JCqYWplo8kZ{q? zAJd#xx8gOADVIpE=AY^wo>}-$j0t_O+d+^C_Lr#TZE`oZ)s>uZTQX1F0^Z z9Xk6XT&|J%KG(e;RQX^xc`h@;1d&!Jvjb`XRpTs)jNitOglms0NAk^_Gx;8{LQbh} z<=!sNn~(Sf5@Sl~iPp+kOpPaL-!hqT;s8a)OE#z6KI0)(2%LF6^08mPW&5mBE-Ey% zCm5V^yNo~h+>~X39HctvHVLJPnaZX@_6Ha~XR8kl-wJW@IPCN1SxA9S7Y43PIw$;= z9Uk~9UGpOh^uYXxaOm2^(jeb(q|8 zNY;oKrDfd9Pd~ZwuC=q%E4<3c3jSD|MDYVY!&V5pg7YW(e?F^a^McMU`CLOZ+c98V zuu_W?saBbk7tB58`DC2O#G;&%``%(Pje?DeNAkg*H1i}~&=SL6QD0!r6=U1+_TgJ- zj|z^Err);~_#gWn9%0B4Gkz5uT%dSfrT^Xn-ISa$TT9@a^_@!mx%4y%28IUS9dhYB zLpOsM&-}CAH-%Z;cGHtU;EyI@2O;k5`D(`i!*DcK&bKv0g2u+aP#r0Us)h;2V!__+ zdjwC-R=zWbvkYX8JC4Tztlt*mfyru1KZa%sT%6`GEmzu>AT9j)=7xfU!LIO-mnk#? z@eI7*5Rj>?TY>gqqT`6VVS|;a(Ru+Dc)yu83a3Xq<{RgPr7Vhs{(e=d@Ft^T+(*(~ zT+?R_h#ll&=NK}n(%3Ts*z9n6yJ({ghQ68c*$p%12voNW*TOa^i+*RQts6;X=mO4L zpW1NgTT8_JIPD3Phl~ckqIsge&HRKy3E(br)2>QR#{A)P@ndY_u%1OpZd*2V*ssv< zH#X|^jIZ&qymMw#%r~qzu5#GS{cHiM21fe`ks9|LY>lKDxM}EB!wwFIW4dSE6VPJuJ}I{%D*`OjAU2QkW^?#9XC0UdM;2 zS%Hhy>kzySb{}oh#hLc_D-pL-U^Bh(o9yR~koi!gp%A?k%P%_K{g$qwfBg^_8{5#o zud%3|B%8UZnQnLF@elG-8K$AHxhZodvy|>5u!7DqKefn9sNwM=jTSz!inDv zmh=j=WWr4ZTeZqVi1%$9s^S{yktZ78I;KXCzw(>zmmYJj?mL2E{=~9t)6t7Pj62R8L*qFGsCN`$3(BOO?TdJulTBMbQT$*YadVtd?ZpO|l!fF3$1>?R} z7fdlfft*GC@jlLFftC^CX)xR9>fsYjCr4Vwzho)fJt%GlAVnndmj=mB{GU(`b{J#N zP%EgT=bfCl6zNP0{$|Ijl39@vOGNWj>u7M!?h{xpt4G5#OeqjEz14Pe@b_FG6x+CV zib$mdi6OBQ_)5a2&r8Sk^9I>pgYjl9p6!IJi zg(ZDp&g8RSH~JLlW!0DHDw`1qr(i@eInP4o8)!d6BGI^Pk2yE)b#$$A1tXKW!aoXr z$fk1)71RG>syo7|$mWxg&PZHzxr#e~enPWcsb9)kjPgQ{06Q-stfE`EBtV(D_xr+L z0Q473r^z>bnS=Yn-XJeJ=w_K}an#@G?P9j!!6&e5txQx$0m`ai3(90tI75PXPWkO; zV`vrNcgA6k4_@2xuzVI`=pGM+34{YGU)r`--%@oG07d<(646?XrIiydZtu{i$7kpd z!D?+k{d=oAE*w*hFb)*M2BylogeOYJ-A+OH_|Rn{yO||&&78t>x&gB>(lpMQCiOMsGB^J$iG z&bMyi=_H7pckz=Ua14oy2?1-ZM(;|dN9iY9@<5}>@pmm&*E%%aJvhhU32Z7Rm6e?y zN)@do5br`~=~RL%J?HS?0B(@ylD z3yERAW-JoY9+FtvoMWyy3^s}kmU-6*9wbuxC%h>M>}wnN{#y`o0|g;eh8`T$Fie;@ zVXkKBeU&lQC)_Z^Dxp%vr`ae+B!m7eF9%_}ERQ%6WCrh-0rnsm3gDD?5UI!rKt;u9 z58WeV2_L;mBzIcDOdyUalgw+mNG283qz~jbbhm@-=0jybirMP`c6*YdGCT$>d(UGG zZhN84kj@6g{rp;?hFgx3nII38Z1FEIat{r|9W7-liGL~*grAglIH@^oofX?IL2Pq|dSN)yu@&LiQyKJ5* z%0He&_q2G$8V6B0gck^`fnL;9=q6`pWy-jz!jeM#k;{H&*t5Ek!cbpqYWB3kimq}A z^XB1-R%rk}Nh?!T&b)PE<%T?ahMDIQxRoUZeU9_fi)`nY`x!fgNnAg_+4MrvkcFUs zSXG~Y?$RN3iFe|Sx`wFEagu0ef3$VgyEAp|w^&wv)l)64W2m~}YX3OJ6ODY)EPi8; zGlHF&wYH8k)M`veV|bD6X8*cAbdgo@&V%vLrUn0t??gC99ymM`Lkd0sA^)5VkCsoN zrQi1psp?+Fd%H!bka9SAG&s({P)XeEeRrO}qDHw5i6RJU(q^6kIKhXECr0}9@Vm-N zJ)P+?UNwCkYZ=?i0kT9I7>!^|0$y{WeqKo&%B64Z6#s_ZaG$qxPTTLP$o^~j>|{6; zxUTnSRILutQ`M>Z!l-WDGeTtlpgtmy5Kp4h>Zm@;Ip!r#Svpd#Ew+Ufk_~aEQgKZ2 zD1;Bm)#R52pe}cmxLqUWO!K>E0z)R-$f0~SunZ6$6{H7DVFWCA_gxB zAKZSv*%uI4oYZ8?yxTDCgS!+UrwP=rrc6|*W~hQpR_)(^&-3v05;nj(e%H<0g`3Z| zslCN|vC=clIOleo4UcTBB5XT>i{F=^+Lt4F=5?vk{I1%sT|}qJ8G3WjdqC$RWNiqs zN59yPwQqcCTW?8f4V@Xs;|APV1c5C=W1L^sMeJ%3Es9#*jVJJ_(u0wu^yUONi0SPGt{M&L8Os0D}-7%{nO zdETqV{INi=+5m^cfSWH)u5=m>nqhVEOm_;K>E7EYfxe()@N?R_up3j6 z=N2CHK&8&Z-9Ex*AN$%7m?VReQ`cZ@5gFWZqk3eABk3$aj%OWh^M>x?j zrFg0;vuBS2D=BmaYR{C;5BP}R%!!rwuXtHztv*faoW9Om&5#@*ZJd zF^S>AAON&pzz@H3u4UQ;hyX!E2LiEjj*8CSMd_KFK^gdUur0C@{xAA%!r%{(i5U_x z1bUZdbqt>AIyn9YU;}XSZef@Xv2X8qV^+ylLuCrt*bak&>BtsQ31kK^3oy>G4B~uK zd<8CxW5=JzA%TruwHt^LK;kH@^Vimvm835|1FX=l!xS~-M2J5+DP(m$JW}X5Ve%H1 z1PZTA-d&_yB)Hi?<36zQ;QC_kG^7K>v>)e8YrOO7vBG&MxL?8H@?l=}botgc-SqwC zQ2be>EYd(bODs&MkxxSNT+*>*YU0O%p9Zn8iHA;0wE8mV&8btxBv?EHk)#k%0gggf zMDrSvR`D16;rp}dw=;Fp@zEiymp8o&yt{R}pzDA?em?{M5ErfiX$}Z!5emE&dJP|qh{!t*&@;D`GGu%HckbDzd8F%pr{tQ@xD3W0jiF@=ewpp zg1i$u(PnldFm)8DG&Mzm`w?+8RaHQ)DPbVB6Lv(3C$T1svf2+d6XjI;#`G4)-v+DS z%QS&Y<~e~t;%a!3A(%C`!fAzlOQConPSVT*PKIywb$m4MW?U>vWe5*6+WD*Z?Xp~uNca@`BG>{EW~$D z&A;UViG-Y|aMPWeKw{3LKPgOLMQXG#+oG(E;%@+v4-&;ttVFMds1c*X=&|#&iD!Nu z!1;=SyAB5tyvlQ?)k{tKCQ-&=1|05SHH7aGwJBxID=0A*m!X}j(s02nK|h0IMo*NS z|GJ3*x@J9+-y`e$`KQCY4UO5r)MULXyzC|Sd_SrXPBT1h!q^_@G_5t3yvi-@0h>vf#?O;C7)q9%XVw%W`QW_`Gq&EqryDw#Tkz%hM(zjF>=qN+E6` zYW|bQr9M60+JS#=hc>H2W}D?x_(A3QlLhx>G0=ROuhdI&|4IU3FWk&BjCSGADKrBz z=}drj2)a$WhFelDC!ZJkDzYIsUU=k^$4HK3onIvc)(86W!Dql(MAnv3R*yP={B46C zDLU4n|B&0DQ4Q*7Dx!BC`RF4|l8MT&M5LruZJPZ^gCi(DMP?`^6V89tUZ5*Oow?D@ z2l=;-_?gA04`6^$YYUXZyIK*L=NZT(BBZ?16&2PolDNZ|FyLY7Jw9Ax4O>FxPr4D` zG$!Zbs(T)pS46H)J@kqzjF(d-EEVtzGGT*A+H&m-zjVHhTH9{;D)tYG-6s~TV*?<} za#YAhI)C;0D;oBZT1s13Bn@|}?P*!oFY@jceswv2-dE-U`gGj%P*3$9f7`_J=6=s@ zod4A{hFveZh@KSvnE5Tz=fe>xc#u;*Va@n`2~+|Ev*d9QEu1l<^7T?Y-UE1M7aTNE zYXZ15GtEexmN9v5?kBi3Q`+OHHHQya6AGa6#qMFU9=szkYsA9)dUUb)_nvrnzJ*Pp zNPZHX*7raDoWOhLw38KqMxKN>%D+!laa5{F-El`0}+hocy*dv1`QcBTtj_RKlI zo>nZybCX*u5_@JB2?EdS^Lf4@4fe)0oM`=MNNDkG|Hxe;`;J}_OEK)kvy->P%gsWf zg$iG@D~?m?PDPJxfq3#v;#rF~)LCt;DpBdL8FXbf#h9aCpA=i1=Mr!P$F<@yxvAGQ zwZK`*5n==%ME0eTcYJrEOleiOHg_^bCcIwy+1P|brF9n)aGY3;M2@A-bw0W4GYgJeZKd&tT+lv|B{iPX1**InQ=vdUT zd{rjjf<;gVseXHfI^bbM@lMPWpskQVdqsL+^fq~)CnM{^OCPDpz&eJT%=yPNrthB5m)C0y|p>e8G;b$>`_NW zLaqJHS~O|o&P5Q!wdAcWVwP@eCgT2PsS@2%u~Zv4J+Bqkj}9t~Ip5B%jghbD)aYq{NIHk#Z|m6{&&^ka zm)XOZg}qR?&m1Q{ua!;j3NUG63IQbOIL*@nw3rUnSDCexP<0d}o4AFA=@fwQcE+j5 zC;zy;?9aC~cF)f-@6n@zI7kAAF(U*UbJCb1H+R_{B=^@h`;12Az9GIj+2iAzr;2ep?18c7l z1fk!+oe?9&lK%z;P>fpC4aduR+{zmL`dF-y0uzv~@h!`)6uHMMYcztHBXMLg}k-w zoiJ?|a5)?^Nh}(f0S;EY^Ou!(m28^8R!aLvy6WVwymqz$c{OKzYX^t3;Sf33BC{Pw zqqN?_i42Y@)nqT-Je!?wM^D##i_6PqUySH-==Ax0qNb9jIpW~Ob8>QtyO$=ok|2UG zM7|2eqnMt8Aq_s9lSwJP>A^$ZW zXqLx>&c?Y(@)xHf;3pRQMp@8lk4a79pPqAtvygddbfUxxY3rEb0QKy&^Kt~qP)9!R zs^OehR@KaxR%cA!~)5uCeatKHS{pxO`qfSf( zd;xI_1Ydz?FLG&2yUHq^yR(BAPYzxsy=<8SLNlnD>d4H<9F(iR4D=k3%{S{>Jm#IcD*G~)ezB%$;>UD z>mnU}&~P1$v?Z{xe^ZJbMt}**HG_Tpl7A2|j~e%7pZ_?)kSM_@DGaj^e5+w7L@z=4 z3maTvUlP=eU3AKZv7MJigr)7J|{8TZ|cHQ06J~uHToGrbCgK>Ab~=kgG!0bjIh?j zqLvz;XEB`~Bh9DGTD&+a!nAaaX0w%Dh;0}IMV$%5-Qj2Wi!>3fqD_TrUPQef)>5?`do2G5&V+_%R z53ogWk01m_q){cvY*y0&{p7ST^O2lki+VvcR+v9t3raPhoJapJbHU|OI)|%0psCJA zncZ4+fQ4wssLfzE1)+UoA}4{^AY7S73f=lj-LrDx!=KwX!cvV7N(#t+_?z7=MAo zQs8AefDXcfGJrFH$n1_{wqcUihaBnUF^_-m$(ZCmS8*1s`i(CK&P}*GF7*frkl<8WJ$Jb!=J`@J%9j@`C<{6E1dae_Wip8|580DBnHKxwQ(Si zjNPw4Fci<=A4zDG0|NsW%>lf5LHG=&Ng~8SB5;rKmQcHpLs<7BF_=D52wWoG!)S0M z5yZBoL=Hf~T|uCei;KrEEPTb~dqbXq-Y1`k*ln!v6bT;3-%FCfx9(q{@*Se6Oeg*s zo1@9ve36V*hzh|TWzw^ALN|xa`dK-M9D>YH#HPxw*CK-1{t|H@>CU2-SB%SH>wLuT zE+i=^0MY{_uFmoE>}7?g9Fr2PT76>aEr)l!{uUwcsSzf;$u20otnk@o|5T*X+V7q( zimcRqSYpySX;FRfetDMs!|af3bVyE8TNMe!CL_%#d<2ml)fgzo3%MMBn_OfDg~d-q zllmaajLb6Ll-w@c+s3{z@?$rV*e7gRAR_*teJ%krTvF23I5v_2b(`6BjKn9Kh3|9B z&jBNu>#l-aS`Tfl#(!btJ6N(XwY;qVxJB}{h4+0Zzv)Qgri`Nx1yOksX&=X7W=I>x zN1L`DTE>X<46yE7!1bL)9bK*}s(*jx!+3ZzKAHNh%E@+JyF&?e9}TZr>^&dlJWiqV z`*stFQCOttw7IOPTYwmprPnwj*AH(i1)rh%%ttpmr?*$>HA=cfX+2yp`SkiSF;Nt^ zlys0wDj9S~ZS;-bNYyr9y_$~9xT=gUHsCaZ$+}de=1DYRxb9ONfaqaHf4IG8vd$Q+ zH!gk?5q^k!)0jpNZk3b&p<0{L44uc;Atlh>$oIJ{p5BVqW}s0`h86}qT?2#!`QygD zSNCDr#ZFQZPb#DuedvXE=DvQ>Vezj6PdLz@1fwHEW_WJVP}7dH{2a1XZ^I3|(#uxW$ZlV+Wmn1r-Z<8KBQny_W9zpJ>o~|hI3Qhut59sCg<;;Lw7-pK2FU`# zfQSc%TqJe;#b3zfr(FWeJNn(AjfrMXBUUrWBXMfrkIu|g!QYR0l6gzI$G_#& z&whs9<5oB!ZyBS+9Pdhrp0-cnZEGOMN`x#QC(DLG!Curl??b_(>#N_H$VC0)L%E#5 zVbF;de;tbEH*w(X5Z+5z?7!?d0V01;oM9dEsvRW3K+NRL>~kM*RG~qp>@i1B63+A- zMh-*ERNEQjjt#QQPw%?AZ0i5I1Jfz+5X{FmSa zD<#=nGPTZ7j8;jnT2|r>7-?UhZ<+UKrRW$w*ZpYXBxv5W2rN?1xFsphH}FwQHx z|3MeK) zw6e(#-E3W6IM|0f!R<}I*J-AHwY9gZ*fTO3>EgVhA-kmI%VP>lIeP;~0*U5ytc?Lc z!FF_YSzyHrFLn|j*Bzv=l3z2|0mR1j8AlCf7rb|YD#}!Tb+@-~e1c|dSHV%Ozo(GU zzZoY|wUt99D>lL;(80yK$U80;wf-FWQMq778 zH9gg1L%VzZ;-9-PuI5jz%8dWSdJ*px!U&lMa#QGu%9IHkEkyoI)6m@cB)fftY{)qN^Ge$TSd1s6SxobHaCoX zReC+p39HO@Ito6GVlZH<7vR{$H*tkylau#f-_Fz+F&zG!VUWzy^ulYwEZvVMm_MAX zt^ktI;ZJ#y(3)<$VW%R&dv=WS{EVXK71Fl%WmVm4(G^np$X847ke+e%V`_0!5e+4o zn%rfcjW^Xx)oajZ5i+nM?_?xPZy$~n9^Ec-Zc|^9EKL>3;cq|fM7oRCaB?L6amJu~ zFA_B8FnbunH18!8at8B7UTxC8kmslm%lxJ|P6TQX znXJSz#pFz)=Jiv@nG$!1+qGmrx*WN<7u)vh5Wb4_??V0~?Ke2icYl-u%{01vI;dCMD%mg^op-wbt{TUKcW3#Z@KN6I=!@@g z*Wync!bA2oA*EVe`k7SCM}&g|UKbd(|BJ4(3X5ZlzWjwi(1gZAXryu1;NCbSxCIUF z9xPZO-MBl!U4pw4+%3V~3DQX8KAr!}d@~RCKJ{aNRduTNIeV}5TbIooVw3qCUfnOG zU(GLnFjk6FvCf0d@VE*OycX~%g%F<)wA{-L=SN~HzD%lkZ{nzew(;o6_t-1=UnT1W zqGaWdnZgUc)CHqr<#ec2tsiat3?L*2$*I(2D|C55OTW4{GB3+|ZyRlhmv;+bDzV+Y zg1L}yqEZDlsMcoNU_eG;i_v?3`{S=sbZB&sU$x$skKILuCEZJ(JRP5lpwiP1v&OOv zWl|3`blbhcF>p&9a%0RLTtar{ptCItez_(3c;XgRUz8(boVEENm-E^?aUAFJQmk4;j{o{z>=`ceha+}!I}+qE9HT*wx;3zbG6PMw*6DZ;YHTSOwsz;i0QIX3 ziW}1Smegt(W$L|8iRxZ0CkbnAq8z#PRI{cP4 zQk(0X^DWu|C?vL8jaOU_&`>HnjO5~4w&p(FFmmg!CcuY|Bi%$75-(bU*WIRk?yZ|Xc+<@TSwu1!g)&ZTdm(sZRH0w5ezHEEnUajHuJBRwvJw3>< zPS`h6d_KEIe4oSdOZ|r*tk+WK%GG`|qVnsHzIk`sFF`1CNg`vJ^e;`&P?h3vWUPv# zp8Tz#Jiw5o{Tf@XM=5Vcd1I^V2FaJ3Mb%gJ4PT zyXi4AK1R|Hudn%(X7fT6$De1Ei@-kByv0Tpf{2{cz*2}l2Vzp7H2mt%Y zgk5orqZ6eQR~1-l*cmZ`g)>nKq2IpK9@(#&RRMM=dE9yPZndf%*HGCQyc z04vW7?=|Y*VI~&&7FLn+5+N*>!HQCHI%kPMrtkY3XO?ps^DOW8m%#1-6nL{Vrf-2h zEracq|5+lB{lf08a)p)usKy@-OeyK4tHX!00xeasnNrK~y4oULSi)eY0xm`T-dKGN zBb7ZuH=PFZYoq8C0{wclKA-?jI^ei(czsAjVTLyUdW!tMJKq3nxX+6|O22EF_MO#a zy_os(CkCNh7cb382n(vVZJXO3CGXO)4-#1CT<=O3K+1aZ$~{y%w(98k@rlCzY(d&D zsbJW?XOBcmZ11+s2KUxg?g}fHL?^ATyO@KlchL}>%|WWrmFrzps&SU&$O+-OpW z9z8|=`Y0D{>`azRtN8jhE?0j?bx7N)S{UaUiWksEM-xI{G~g~U1$rZAccKtI9R{65 zVqYv#hhOe#BNWpqx@n9q!5yiRn7!RUe^xCu8PRlKJ<{7Hcbme5i(`y+^~5vRolI}< zBXKq;7GFz^Jdn&ZI9Wu-vyAM{-M%$T2fU{Vcf>CQ#^A>=6+5Bx$e4HyRtm?e7Hc$M zokUW-K}7asp48`b8K;#ZR5C$6r!mIS_!@!mEj}^P>n9WgI?bi zB*;M46~N`RV!{HG^A69KUG5vM!-?E6qNy!tHMEM`FMF|X9o|X-Ka?|`FEmyAb-5;7 zf5<{h%YVPF5yqs808c=6F5Cw;*JXkxrUI`;wcpx++I9Y&%@ksj%r{&Mr4i)n$_?DN!J&2g9W-h8a~lYA%+05$SjRAyv2x_x9dsf|2q{6NidFx1GQWn{>% zk^6c=J*6n4V@qAMkxn_&E^(NwXHQ4P0-h?}QF(mqaCmfdF~RmWr}NVm8L96ead(Oe zYc`NlLQnx(&-_+6_0}p&n^Cu)E-VQdx(W6T3`6QeQF=C@Vo}kFsmcIbF)Vg7DmQUV zqid?AiC-1{y&nq`iA^C zplUgcd5i!}KI_2lblNP0)UeBN8no>!VJ?AuH`bb#QT*qiBcDSrS$_&aN9I)@3vwX; zqguG{7(|2eCa3Yu8e#<0pplOxDN*gLYm=d_K!=0$dE2vySvN?kP25i-iP)?AZsAp8 zCMnvlEy#gi^=(!A`57B+V&}GfOmblz|Ae6yL)Bbn+MXv7rYyWOMsn|~lW^gHar>T? z$9I=m@dUtPIXI^yCA#VpdjW?J(rObH8{iHTxB6 zBrAlfZ*JLW3|37*d+2cG0UqC|(sd}7Cv;3g|Gm3SkN-a#Fd?HDkdTvt^ZJ{hoAnXy zJ!h27pO|}I9}J@bo8cTr4=r3i3x(}5hqT3JR>9rA44Q+j$N_kZMfP`JQPy<_kz5U5 zOD3?LDf%ozR`%36%nYfgT(bUI9k6esrm zxl%SMtf2)g`uA0L=87UyLOHc6$&hvqXO-XnoAUjyhJ2}ErL4f&MHDyCn0(3J#@s=% z0GoT3>Uey)yUB^DGByqga&=s>y&7zzuDt+u7H>nKRyGSyq{pJ_ona~cR+JHAGE-|2 z&0wH1_L-q>7Fxj;OG6c+f%E=h4zX$E&KM-IBz$$rYyRY8t5p$ifAS>VZ1vi6^Ip~E zDQe0R4Vr^;sq!XyYeYv20DQk#Ua~h zy&0SCV#i|{!@594GuX$a2i&w)VEm;3{7)xM@4OohNX9UwyA9}Vyn{{eL8qO_&ndYY zt+hjY+t_J$cB&nNS7D;M^j>i>!84_12$NMBoipTx9N~;pV0IJb980u@jz3!lFjS(Vk{`~U9V?Qfc!-@~U8iSv?Cndj=%;+Uz{F{;teWw2Wr@ZoR6uf7y zc{k8DMg#JK+WHX_dPrq9OB+kPoqTJ#V`MG1vIRqP#rSaf zCrr1st}B$Ndrl(6e%_fyRVeh2WI#M4m{rnW`BTqIhnVBBKu6fWuli?3#13kQmm4Dh z`3(22kBeq)iPv@xvH(K}CoPveMn?mxeBV(_;uz0CdvXQM;!8mk@6LGL(++%+uE zPY0uVfQiM#*&LcEw}Vt2ON)wWkWGFj%&{ouYvi^Q=C@xd5}SWoEP{Ryy~{hg&dbT9Xu9k|)-NYmmEy=+wSP^ey#VkOR* zABVHgo5j)r_=?r8e4n6h?|2aqFcB@&3uhY&I$2DcRJ@(M)DD*dR?ka9dSBk>GcsEd z&&hG15F0ciz2Yw7Q}jj?Of)0v%>L-oOTMxvRA{CZMngy6vqv1g?|z!v-G)lqnHMVS zj-dxe#0cLN>wW!dytO?@npcOlmv?BON2Qgc<87AiEVEQ zdHiX{HVc;8f?mo-bT4_8&g>3r(FOY~J7KFVN5QgAl8QSWrzh8Z9^d%(oX#|gkCuE{ z7?nTS0j$5J--tT_+z&}2@grVBC=L=Xm);tjNqDkPOO|=#t>7!YDi|;AQs)Bqk zJTU~{nU$f>R%yJrWOh`sVI5OL$QiH$CpTgRAcHNFd{*shzT%r!pOea)eH5QW3_UQuJ&`iHF3!@S&`VT?R@v93m8iJ1v8Tv7DO_<)iw!GNa}h43*47W7 z{ZgkEyf}UC%SwE0A@X5dWEZ;k3KGt~EprPlH62?LMl=5WwzB8ty@Ylq^-hUjo ze@h_r)s3ml9j8IxMuw8ldN+b>Z7m12WyhQpC=YijV*2uYHvhRaCsuN2NP9~^U!)BXDOMlE|MrS} z`2Ak^@JP9jO$Mvqqgz_(G8gegdNnVUW(htnq?8>YAIl$s1P-Y(L8Y=lm*#ygNVeN? z^eY!;xwqf)f1FxOz}k&CvRozsobqS6J)}T=qwvVTu|@>rk@wtE9CE%V+}(`mWYWYu zKx2Pjo_@BC*A*+K-7N0ZBpI>I+$Cb1CQ2{`pGlHgio`d-n69lM2vUl2Ouyml&qU;6 z4hf{grvp?dy@ZTQh#2aK?BgdhR8)w!v!9o=?~fGgeM`^N!O`*(v}y?EgphCxrc_QL zB~WE|QWR|)Q2KU}7U>Z&6u)evQ=JPLysf-$T%r(L+(!(Em$K_fMDCWP!adw6p1;~6 zWU=1TY$Rasc7`2{ElT2%EU^{yECt-pH#*5~8SQ_V060UmE|AHO2!|dMyFx}geAG_6 zNqiEaiCliV&m*fKkobTc& zv%51hE|-PAYpz)iOE12!FiCfHGy0a8xO*?SMsE_deR*OlieVjeFor$C1S)OluK0I7Kql@ky1;rP&}Bq`u}Hg{6E+J&-h5hw!g#j z$&ueCF3d+@Kptm^%TS3=gDL$R3h4722o{RM4Ucnkx>pQBJeIyCh~|cY)i0p1`9QJ* z!8TK@1}{T^oDdSM>$4lvU~l=DoBb!OS%m7c?mSqX?=R%$oT7|RxakXyglNQ&N)tV8 zHoc|%RBxz)q2J7Sy~ljaApWW)Y71{iVgFq{{T0L6U83S}{M4KymZ9Gi(eow`y{<|e zP`7h6%3)>Kcu=Kjy-liwOBZ8gs8LP8v}I2%pzSNbny}hgwKVC6TD?t@b3rx1YY(pQJW{{2 z8c8<#Ep!l>fQS$u^KrJafoZ=5UcCO#fY%AeMEgd<;r%XJCz>WE1S6+kr9b8(V;RI= znYnMA`*SaBFiDw!&NgFF?ANdToEr3JLcC0RclxZAn*;m-q0(1=W)iOxXlNGmOVzaF z+p27W@QLR6Sh_dX2$PP)o?6Dn6$u0I(BR)}MXxlrgmH{V9yDxNavaeaj=NTI8Ptir zV7%z!8r0kRx0gS{VS5wfmv`;H+QMkBO5eIfWzSP~VjKpZ>Tb)Fq`g~B8cU;U-Shmz z3$DQpJZzu1_N?(b4H6Eh+gyBuiwE(bj?gUt7fSd5fgO$xP}uRCdZ+f z?C#m_wJa7X(tP}NW!PBM=>HR=YK|bvE(e`SLjYz#EW9_|VlOM*0Gz_ndnv8|s(Eab zQ3}JYH5iTwr(%D24%)tPqXI=!ycS1F-U6F+zy91kDhx zYk>*kqe-?2SAv5=saq74@%41NHXTZrCJMq)*iq}u>doEN|4?ikqQ`LDlKy&TC@@!K z=PP!n$O2rbaByBJ^u;;iCsvvvhyj|@%9k{!!JVsr!tf6?5duW(k>pSW@i|Gv8>VlY zcFBNo4{uznv3z>ECbtPU>N!EMB!y0QwB4%@>kMYe(VPQopggh2;Ewy<5TR`mMB ztJeq)$;rj&{NjODa%DW2f3Q$%7j&Ff6l1y_zVI-frH^sWkINbnuWtm52$&$m^fnrt zz;mfv4x7vKgK-!7r_en~aNucB&j)hR%71!FAY>qzVK>tKbE~0Z4LGn#1IyXR=>{y5 z3@Ez^ZQzQw_n6+5W>Z*UkXI^%A}@f)V-~-wd7EFdXxFbpD5h^Gc!6B$hiK7mo=Pzg zt61IsXCk=;7E))pck5G11QK$Qa%t=tLRyb^IhU2?*!i?zf)g+O=W!+CEXtFkHb_rB-$d?azn6IyV?WsGT-u3gHNwzfMvH>?HFD6{oz>Y z!0+sVFS+(^hPWHtJZc~BjT+xHTUEt+17#O-pGAnVA*2VX$@zC(elES@VAz~Un$)!2 zqVFi7c8@OKQ2Muv63g>lko(t=m+1C0gnIRa5jJc`8qM!-jk&Lfv%~V^ z`Wwkb?OgNGKi-^c8_266s4WN!=h76b-qH5z2H!zy`{_k{wj)qxCmPb4~gmHLUid!Ynic~0@wc=4{tqa1dMd#Lr45|bxQMti+% zGfMqs%u+2fK`JpZkvuGJf*`&1$k{R?|w(Rf~_P* zZTFLUFgTfqac|gJ#3l<66~iARd4c5xiH*@^4}C|iv4m1$6O|(Y_F{ThwQZ%g0%l7x7){G z`^+s@q3!v`+i9`RTB}%Jd4A-nf2zW__G1p(EopS;W$}pDaU|`(Q)&DJ(S^^Xu#DU# zZ#&rQH?X*kzTW1JCR+e?`0H<6fZ#xh1{7IQH+0T^+q=!TMun<%;^Pn_-onO61v&!+ zVt@=0rr?k)R|g?!?UY5lSNMqe_h0Om3G9Dkw?+?w+qrfgB&ES7Hy>?hbynh}hnu<7 znc}w^j*XsY#>1^7e7W{+_v|iD{EJldBO4AMamuXk-E9}@m*~dQY}J$5wA5}jTRf`W z$n9&~V+qmb%C8TVn_YQT>$eqG(1yS%m5hQFOd6A2?45Om_DisYkTgH2x*0`#`~ZF& z^iuW@JNPT1jdH((-4COd=|pZ{kv;ppHBptDn)2Do$A$Awr>sKCCP6PlEg|XNUwA+C z+6G}3EP88?;fRB`?)6YSUq})`$+_tmWtL(%5o?Wr>#mNqbxAy{kHeY3uV24RK79D# zt^*&1NSiRsNXGE5;M{58`(GDf9X(kWN0a?)56e<+rB(Vd(>r^=Z17+Z{P z2icy(nSUd`p*=SLQKP2}?M?=R3(=NCrgbzzc<{@$a&LC1cqv)O8IuRI64|$FBZD&r zn~F~j$;p}L%Pc^|ZDJ=>oZlTlpU_KqXknL5E9-i%BKpfS+OaJppV*qP1>_M7< z6|=p-E$fr`(p)03d3s1JZG(wl#t*4}izw$yQ~|rH*zMyJ|K_#t1^+bXt9-(a^SX*X z6xe%r5#W>iICX$#LO+Ucd(?_~Q!94X_>yL1vEKGofh<f!(tolrkZ%LUM!9 zOW7r|lbO-j*H@e3KL>f_=)zM{SmQ*mf%rbvJ6NnnVLcm0`EG0k2k?TFHkRTKgL&Py z$FFkQDOJExM2jDA$oe0rw*(q;_vD{3|7VDLD}~hh?>w41(hC;6{-H-`+vHx z|GO9U_&>d<=mFnRP%8+6N~0Kc^)plfTTB~NNBI#>;P>P5r^M1}@H4{J(mY&_oLYU* z*jXKwCu@I=MzM58CX_F=0Aou0_1#AAV$qiOc_&$T4%7IjyYEfTG;3P^OZh8lhO(cD1(l)Mo^Lt$dTsSFepZ79=;OCUm8t_N*^iu6e(Gsxa?cmsy zYxSC70?xMBcFMDk8Op7$aC9)((wtHMV-#3ImRrHR|Ewpz33P_ zL!8LgRcdGu%xx)3-SMl71yo$3Mrwp`@=DpUvX1AiV*or@oa!sZmUw8b^hVXw&6JWY zUY8W9uG?E>aNQ>~G=4ldvn*4TkpdYwiWFWit}mg30!7Cpd@GBK@!9<*YH}DfhFUIy>rQaoU+B@qw?e!B` zQDV@?RsUyS?fTQq?GyY8=>6~S5z5s63cFqBlvxj@(|s863R|WG;v3mG3qs#q1vnW$ zInAH$B(wKHuq55OB}kXiaKSnp;OcCR98pzE6={73!hV?YF9ADowG9&reBt z9v;a&&CI`I5N>$BtVEQJs;V*;kgnhPA;jwT6dxB~C~?lLGSd8+fb+!=GD&=f(Yjf` z11_s9VTY}qdla(?JS?0=^8V9vpJI|h3_I=F7Aa!gro)AHFJ+B@( z{>crQh><%@~#4TZz~pDgO-C~;Xa(|Eyk*r8TaI=qV3XIVDEKxV+} z>_M!Q8~M4YN+X;!=G=|j&s3DsCd!7zvfz2!;N2i zD#Nc=FrjJ>1O5-PKg!E5ubKRbb4BkDepDW1c&s|(T_nII&#vWf(h_+8YXX_|5?PsM z3a_IH-j~eY0@?4z?elUm6c5U#+7M7;7~jh8RvqmkHtj%h^{2F;R6@Vyc=vAm8Y*ff zRIIM@LnTX)zm3uyEarndKqLoTEMoKBv6kZvrM6>gG5sI8^bu*je94N|dcMVQYtKR8yshd}{vw?8nd zwbwAlm*pCNkIgbaZ1ul7H6~7QHPZoQ%gDtg;9ro!8jqUi>XY-tL9xD%zi8F+M=wP_ zue)UJMS;Xd1?oLM-g~0+?5^s;2qC1VN5!@_#4J^PmCxP%%B9W)W7RMreGExIB0DFr z%zYED^;!|d4tCH)x_DjN)7@c%jPc6NLqHpj6jm6b{c<%Q^oo?6b&BeQd)!)%NQ1Ii z*6<^@UevVzO0g@f3 zW*T~tZGe*35lyO3`JxOX(mt0G4Zux){3#%=BQMD~Lc0>YD-pNRcD=Ke^~j+6)Tl?u zO67vcr6heW_F?>h0MN;y_;XL;%~EXR9N&3ym)hlM&>%no>hj07<=QkjHgPaDN1&C> zwZJC_N~r^zVPStv1-#SX;zjmuWVOQ{H?%gf5KssSd`etLJWTvkwx5K#^AX7ADp!AxZaJnI?iW(Y zQNO7q@o9;--%5>_StoY4y${Qs`W}y@*ZEi+OuN+5xS&+8m_g-zAhq0HXOqNA6blIp zeGo&zGK>6d#oFMOaqVP-;oOgoDEAOqmmYoXkau!Gmlw2<(QIa0hf&iSoGSm*>&+VT z5eF*W8=<()>)CQWOaHsySx3bQ{^jN6u~tay|H|C?sSF69PqNxj6z+X#^m*iB@1r*E(vjJ`lcj>=Jo=&TLhTH){EBresBU=03T>53{H1C#WGDdsFU0fjhe*SbAFPn zI)ws@-aM+KgVorlfU5phuGU|+?@q#Dwq-n#IEWGX_ ztSYm?1=?sJ;M)>Ue=l~gFMpt?Kl9g*_V%-{T<8i;3^#RJ!nYPaxR+hoy#v4rQ@M@*SmM9a}u zo)vB(#+1w0T>;=H=kwK%aD*^EE*J_XajJo(Uxsir#i*eJx-ngH84x_mjyt~e+1IJB6rXl8l3U^jwXfN~OgrMYom>mhda-F3k97uKjf-@+SAc|kW zUrMM7!Gn5HK9A8c8ddqO#qwrJYVb#d0&Hvx{Jap|5@U_hW~OR&Dy9_b)9kMj)~X#G zjFagWLIN)Y>i1Lv(adnnNKy5gE9GqR5M4$bgCGS|VUSeHKrPjzN^MMbW<8D{&q3oC zlUO<6<4VD#Rars&ThdaRMMt-qa%I!DU7OUkXOmRcR0JJIs!3u~|2IFo^{Wg0mNvf{tN|_iwx72|& z2}5weUJ{UOr2wZKCy_b{Zy-WlAnL^X8wV=rgHe>g&`5ApscixsE3hMkfUTWxkN!0K=AKO+kRi$JyOl|j2nqP~*hpJsZNC}jZAT$_$ zCEo>l=*BWkd1@`Z{kKyEDmX0W!G@{72hv1n-VPJP`ZxU7@)p9eE=W{!q-K9>m)Ch= zdIY)NmHBxoHC4Yb@-t5+lqbHK{^H-P#X!0QR|*J3!k;xy=KxksJov)zA1eYP*;R%;E;Wji zPFvg>TO`y2ZiNIW%hB;zG9!5?J}?RTqJxtJcBa1ruAhi|yCuHfg$JT-vIxtXtKj`i`tM6&ge* zG`X9pKq{)QtXC$*hGxv0^Y~@lEf2_jjulj?xMWJ-?%d5cZKeDGqr1NRC_c!X0BRdh zD?FRNht%D4`ZvnRFIwt$bSLFBqi{S-yk1DS_9iC z7kqPEl7!nmO9zW}aWHu0&4%4(V)-Qx63of~3{jCsCxrWMV3c{@jkDD5p!z=o!(@xS z4t5~ml5UBQVItT%bJwi_9G~M708!;xCHgK2olx`L#WN}VXp{Vn{MV0#*W`yz=C8ag zDVg3-g9f+=j#+{MCThhjX-1akbxOMoT_U%>GS=euh0g%Ao^ihvhbM4EupIS z+vg$`>jJFZtm?enXr%=d6oa^B0OS$JW`TtNyH=qkwOJQ+$S7JLi;qzkGV;Fq_898Y*t20~VoTY!eQf)Xb_|86g@j)zy{lH~X6oTnxLa9sX;+gMTF#umj z|7war#{mbttplzRv*lyd>Ve{lU8S@MIds-(5dPtz{l1=~PNYknekbqI(DteZ;@X&CUHT*c1;LoD~Zx-25rv zFOB(P5F-JvSYnV9Y{lWyqQdn~25%<*en+50ypSnDBLF?jj@Fps$AmjPVSUiD4dUC4 zz?`oPtpu8`h7#e(yXkI&J>Vp)t(7Hg`QTV^(g!tuc6jr<>7*YpU?XYTP!=)?e&L#= z4dW;w%m_DNk_jps4vz-%FK&3Lsziy~W&3TM{IhDhpLGhL-cd_zdwTFG@w0gs3}J>^ zfqM6Ophl!xhKaoQc3Z~uPG9b&FG>Qk3(;LMb|aD-cnBz--)+E})e5ydkJEs$?CUay z1R+_*3%fZ>W(WX?E<{jha@yD%B2fCuE|1kNxasTZ*Crid%eV7fUcq{`0wx40}a5FDm00M+b3Xegw4q6+jmqnVIvoTZr6izC^0e(5aED( zSYN-rPCV^P^+QHb5+3LO>FOqii`PJ|Oq;4sB(GL0xs#FZIR@6Tov-+bk1WWbm*2|E zi0v#Nt1>C(0{_m<^ey+ik4y}#Vbx;Q+OfSI7PIfY@2)%XZ#iaKqA0Zgl#67h?R*}= zAU!xa-@fWgOr+AytljDKTN+9@wn>cd9%XtnnpLAwm!MCzduwj!M=v z{b#@X63o4IX@z^`dFg)j+M=+?QgG>b9i5jqCEYdVA6y6kJthS2rhKhJU0aseJOI1! zwf>4mkg#|~ zhRG@$1l}_Bxvx78_wKL|1wZ816rRR#p>|UV+$6OyoyD~1eKnBdTDp|Fc=~vC>p+7T zJhb!MQa`p$Oo1+<@qyKpPPv-HYK=bd&Xr0pIC@Yu6%T`lYL<32f{FC-t=lSsj{SMb zfp3v4HkqAL+Ujtb;%=S$l}V#MZ>`a3I?v!9rZHQkLbi>ah7vvo`w_FG+ou;WP(uV> zv$6w`OL9z1<{nK7eL7*Ia4!}HRyiJkavZb=x=pKSXc>XOw6;dz0|IS_l-J7zan{Il+|K9X~LcgOafg-^T1fCR>WG0o*CyOXK!!u4B&T z`Mm;hFtqd&ick(vXJ20T;W1M#oL!;w2Dc1B!N)k>6;Gdjgwa8K{zwmq@PkA(8MTK| zFxWnE`tbL!!@90x(UP1pCw2JH>*>4&g1^+_cuQBF3k29hZ<`$w?l|QiQ?fbVh?EZY zhC9Cv7Y@~*SIQXUbND`KLKTQoemLyWJMaDZ6aYNhaWLkv0=O+42h1I48>O4$V7Y1@hTR^E>68AdTH!`mWM^_J0{`^R<@ zht)bGi~0$3P+QM{;Ayz~{oR=;l@Cd?7dlK!Bg7-47neur-U2!L8>7K8j%hk{@8gE@ zJjWgMM%rkGFpZ;a!tw%wfxo=j7rufxroDfl(Py>AQ-!J+r|~(ulk(a*ZQpHqqc=Eh z{VeqS4@uMSfS=r~bsSO;;y7Y9zpo|iHh6|lzmeo}-dprSSFzj{zLkkRsJwiiFbhg>a+*IW!3=@EVfd?`O89qw!~0YICkk z)2cEQe?@agu-S7nU#Qlx6OtfZU&X&24ta>dWM3W7Bt1zLYrQYKAW1-^z zEQvgti6y0k%wb>Cq=+G+{Fa{kh=3_|z^p3j!NQSfSlib|26CUXDzG2dkV3 z@jh;QahdMp@u?efqVrA9d@fuy5OvixR8BvOS^Obe*x%60UTF6Q|Ty2o|J;{?TMdSY2*g=#0&{^4Dww#;b;eg%uDj6`%0^)ac27fw2OvJIO`va0l4sUS z&BI~8@<&=Uk6a^+zVC{3bBAg@Dft(4Q9FWp)LuY`IPor#q^80>!v}wkr z$3P>5Sy)-gZf*bjcqUJm8A|5sTh#uGhHyLjhc^)n?sf8mpx}{F!V!a+VZsXC#)j2sv(y8X7iE$0N(fO9EvaUsD>o)YxEIsP_ ze$4-YXe*ZCI@sXB8q<7lb@8B_LiP~@2OrFzh3?D$S`v@^5* z2$JI!k2Rhg^Wg+WqT%SbI6(eSw;!)g)*PGTJzwFx>u>+jJaB*ZFR6xq6Wi#C0~qS8 z>{{+5mdKD24H{ZqR$h7CJvgylZ(KS}+-^J`AB0unU?Xuvfk%E;k^cEOy2DV1P0&t% zC_FpiB)8C=8$palH9)o#-@=ZA(f^h`W}_vR+8k_N0CtFs4<6Ck8T{&*IFAe}aUhZS zzMn}?JhbqO&IRs%_`$30oz6J1ODW(jc23ZpJbyIVS~x?{%JwB1RnSS}O6{rP7R}Ln zRnOXeOfvWOvzi?D2{$-07 z1!kgtA0@|R*j11Y3o2) zwZ`KkH^S>qj{I<;|Q@9xVUaFG`1A zgP4k-fi^ePsY@|v4}tA8Vl_(}JDfE-_~0UWn2m99`#( z?us(X>CgM8af~jojk2-t5UP@ztW;vh8 z9)CI<^x4{79W5Mx!x42-+%e`;HT=5n_c7Dcv)TvbC<3QDOLs+QB;pu#;1&H3q8B|W zVs)5BD&IJ#U%C_(Jq3%2757ZsDmh-?nK3N>nF+CM`e%y>nv|3A&*&{;U9j#(|ERAz z@#>y7(sZ=5bN}|%RarkwI3XAd)D{#jMY9)b38q*T!(}~Hv5qee@RU5pr&j7Q{2^F> zW^uKu;$LVz9J($YLc$!Kc1HM{cJ|%(Nw6Og3Xt0W(vLqSjhX-+D$KfVduF2KxwaG{ z(x|7moROr(&ahwVg|AZ;Oe$M7C3fHS95hjKjE%81zj7}LZ>f4A;$Y1CN&Bl>bc9a> z*UQ?+vqIGhH0sAHnjLj>-_FNFUZ1+ml7>7Od8ty6aQd{f(p&Xt|KOO(sh+0~i5%@< z+f}yMo2%?RSmx1~ANw>RM{dO@p=;w~$aT+c&U?>HUyAW9x8VbSm(S?dIj~bLN4KE2 zM{QAJS2d2*)4ggaCzHt1zjz_v1&_;p27V)nz{|y!m8xEl60_*6``s_CwGc9E>2xG9 z!%KV8%zqbtIR33H=Eg^W%$a~Ab3vG_BKlti0w-FWK-cdp=Q^6(1McICf3n_TTp>0t zx(})6e}6_PEN#njs)9KiGN0e$r*~Ycj8`}`GI#F82Ffb(5i}>t+DUwx8hfpIC_kw2 z7|;ciD!to$D9C(>?W=Io2yIhxXj!7-`cS3162j5Z5e1)@Ew06F_8Ph;J2^v-ChATA zJ=HPi2w&=Twpw?dFNlrS9*J7$Nl;`$dYlqKUEebQgFRCZ*JpQv}RyQxJ{|9)0UHQahok+l+9c2)9>9xx!w@lS1OWttwb zHMW3u6o%JUt~aY|YVOQd8l6~-ecw?~R5VamkB4GXi;awD2(j{EWwHSGMGFqYqW@zm zJ|s(t$*TPHMjDHP6B$6FYUS0NeWgaWtxcf&1!>>fTL_S8z!|aMp~5B|0&g0S+=;x| zZu|N5%0)r@^5YXuD39T`#PU?^{AO1EQ>ovu_fm@$cG_rW)V<+-Y^kVd1j2Wf`3fa) zvgaE6id^&>omfjgMJ~#8qZ!jK1LQvU53Ld_P%>qQ2{>Wl!p95|pp9i|_&{8&G1QN>vj& za*|%vQp!n1E;@Nr|K_|$v`F#bg+hTAx8QEUU4j)aE`!woumu=e>r*yp*DJoi1HDPxW~2M`*li*LU7fBGH}<8cWv3Nlmx5tu-s_#p4PQSIrje`hWkH2_*}!M=Q-Vse^MJ6B1<;&)1RcOG^b7;BQhP5u9eS^dwkj3*oHOdx+cp&$?gDe!6@Q2pM=X@6*e9d zN3ZF$=@6qYH!F_HEAIwY|v^~v(*wtH>fI=O!+)Fr)=H8lg}Q6K0J^@6%b;l`S5LiLBRF0-_cQUNWi zGgiWnGqCtT^GEY+j98V)1TGl%-E#v5@x~+irJz5D4Zrg|_b|-)9U8?wBM6V$M|$Y9 ziAh^XS~H2KL>+ourjrX$Z4H{e(ruKz7i%ZODZ3{!zRRAXeKuh)D%B%X|KT;HaR155 z+YVtot4UHW}m$-i?_BV&B(Jk zyp?9Nwpoft1_Y)|k^GqS!9dcX>jiXu-WjHKPJUdB<2Ong{n5Q0j+q1{r8wpb(``8X zk>%;rTesYF3#`hv1%FS5kyyCl5S6YQ7W_x1gq#3iD{NAHhpzc8tvy`7R%Nywbf@@G&Qd&Bvg zpCcgs!yntMZ1pn{xsCrV7XC*u{3)2-@a3%aT}lVJUrzOGl(iIl=51YeivQ=8$!lNNH4~6?K%(Nk8&xy980}8a4s6ly!_55%9}F;Q#xM? zMc3hTE97LS-ojIitL{5H_w!LS6!ltpohxZB?!8dZ#W#5!QE|L}_ZfF%d2=3i);RZi zdbVX7pK|G3CGK$Lr-6{r9O?0$ag(fwK%PuM$W-97MdiDyHzRyEd2a+fij4Qp5>K{h zmqYn9B+VmeO!T1W7`acXqx7!nc6+L(Ht7xc9vAw~KayhQZ5Pi(O!IuCJk)N$H9ZSj zEfVx|-_ThL`|-b37WRPLy@PEOsl(n`k2Vik?=AndpItggqma7%E?5;KR2l6G z^@lz0t+GI33fpSe#o1=drP)*RHb-dH9mbFL8L(zT4`o_##Qn!6zy$_$P#9H0^BH@x2d!_*660v)I6MAb0 zr6Q)WumE3EMQ{2;!5#PC(%bieswW`}t8uu*Z4Misr}HB5Sa>lFs`~^|h(+K#_f0`L z9+%Vc`zN?U9$u37BRK^Sv~>6D_lk3(H*d~TiXBQ;XAkUgJv^okOmHoKA4nbcL5F1* zjiBQL*qMW(^E)0de|IOp59;JMg%%T3AJS*h9UGug(x^^<_9^c}3k!)iFDl0*cwAy4 zb)mLd#yE6gP$Rgfc;@LI%R%?gA&B9}UbRvN!-Z%r4`TOzA9Y~Cr<$%zZ%)~8EIKfh zGjxe2iE%w%NWr;5%IlQc;O6@z1uFPDHl&T@f2MJNo^r_{W~Cq>fJut4=(knp-@O|@ zT2_0-G1!7Vo?@-%*UXJ4dxs3a?u!%OvI|J+io!@Nr(OGWdm-fWoWqdRyxXrS9E))@ zX1~eU?J}hXa6~XD3tW8C6Z+1eSH_}bjTwbbM)%1mP@^i`boXDMQC^Z#pt58!ga4f0 z-i-%S=TYs-b3Lx?GxPTmmJYi=jQ8seKDmzqhe*G?q33kafrya38C0?xrOt()Y3fxD zlc`-Zu~aMBaEwbpaQiM9r^ohJkJJtp9Ovp*+tN)Vq7y%kA{|c}E$+N^j6ZJK83+7| zd!e(fcC@@RN|&VTdYWSI=v4ne(5D62&=zU#Ay1$-UuzP-XqTE{4XREE&fK?jI9zG9 z%&0SLT0eZ-gD|mD7l_Xs%iU~Mn?Au8^EghXL6rMWzZHgEy9w)?X&7rHiPgc<#oC40QH>XcTMF1O7ZwYjVqIX7&YK76TxGAt)P{o=E1h(8o&G@Exi-@r<0|D z!NdCXxl-Q36&yTJE%wU7a(FXk!nejG<9(*u4sCzXarFX0^60r{Z9_8$s*z}0mTIhg z+u3_;yDb)QeBMW28s>DX+`su(M8XJq>gMHb%8(bZbb3ymnVxKoiXBsyyuBi5tvGs9 z+b54Bdc(y%WoQpHDA^oAkfe-un-{6@ycxi|3m#6prkL&}XO z&A&F(0}!|-`OUYa$>lM9DeCEj9x$o7=9F(PIux^{I#INPwp*DX4lG+ zl{5w@xeK>35~Ln~&?Ib6c@L*v9^amJamQlluGy6bb&hM@;8Q8PDJq=i$K$k~Wp? z`Mr8bG|F;>Z3$yG2I0-W8&lkTY4L&R($^$`quZQUja=V`xBtv=CTDcxj&>L9p4M-7 zh%3et@e8oqo!#zw+i`_7a&~ZkdLiYfZcBH%f{FDZF*2XRJbtaqABLL*{OMKA-6sla z8~8F;KhpEbYK@4)SmZ}f+!q1m#!-({6^iW3{s13&Z_VY3$9>7g$R>v>hTa9HlAXaO z*x}@NEz)35>g$9N@1kJphFBaqomEUK$rn%pj~3qiyf^80o9`NI?tO(da%l^O*fJ`E z8(q#?j+VmB`v}>#@fp%%%`YQ^5}My_zP)%o3XP?TR6}CI#6B|Yxc+(PPOR}BBI7O9 zn(ic4+b!}(^cBLcC|h4^gQEQ*>jst8d#KXzI7E23UPUI<2h?{6oG5b2QNOQmheIDq zl7x=Z+s})##jWdcId&50V7HRDOJAM@(v2p?S%Tha4A=MdlS=hK64{U6xX5<#tl}~i+wrnw)A8rR_j|`D4o+sQOMVGYH zP2yt-+nzTF{ms++ou1XgOXi?5OTABgPi*}=-Q{@Y{)jArZ$A6bxkxos-h1Wuy{w0Q z9{Q{)KB`cQdV1u9ov**1)0A`p|EWjeer;-!fzgVSHvi;{*cBzFs7JRX1uqb$Zt8ju zSX?FPb(Z$X!K>F+2VNh&q_$18e`EOCb&Oe5(AgOjY~TLq>$bd&f_q1}5b0tr{-g0D zQPjzxBK;a74sdExY`sQj2<*ZnwJ^*pl29aZwd>pCU0XHT#Pt3$?`V zd|NbV>gWZFc6%`F*nyW%D7skWoqDj`YE|n@)YJV~+D<;k_qZ00*sl1=nMvB4CutCr z>^fK`R^w&LM7b@a>uekQ*D^yZWdk+PF8JepXOwogwXUm76^$O>VCtTr+fs?E&=^pU zN65rI{hZpn_)84o;i5dCAV#6(yLs`#&P#OFp#z%oHT^LA_W6CMmFZvVzHy>Mg;(H8 zhAfi!pZ`UYzB&WMQ4o9+Y|!FwdZLSOa0Y_!G%gDXY5_@qa)e0Ufcr4PFyA-q#QdXi z{=~q$+9W5osS$}yQ4iP{%3E9?6PfYURGe5}&~hyM#$I~{A20AB&dFR&T-%JJlBilr z9m`oO8ZG4?1FW#APG2)A%FpDw>gkc-V~BC-fq9=EXe`+=mM0$?!DJOjsrUk~qKQa8C)f zN_sP85bB7rjjbh&rlY!j`k*U&Jv3T}sV_d&<@N7Y#q9wU)&v@K_*hkvlGF}?F|&!q z$D#QMUZYX_;(m@p)QmwR)EvX6zR>zrYA((w<;48=`PDOyBQ3W47sli@xhCIEjJICC zwjQ(|b*EI%;Un4a?2S9z?+Vjqvxv+rkb;i7Q@x;P!N#Y#B~X=Av90)k{$YNKgCaK$ z=-8?*0i_2^2TSGu`7)zD3vvd@z1*gwf>xwjp0Y%?70pstv@|%q+??eS4wztUIQuEQ zG}#ce?iq9%y?8rNe9H62Zbe?k2G{Wim7@`v`vfjAJW* zFMj7&xS_u(xm;@Mh)`RW;Rn_i43(nuWiD3mZ=CdEy^xs^-Rd4p@fnI0i5(&ael)U^ zqPpv>wX}b*NNV*FCP2~BT2r~%ox6@``{c(|ot7%5zE4zkt~(XRE|I&$%{QgzhY?Zv zcRA?7RfS60f%Ze&|HPH6}zf^%5Bj7pwPVQ?q zU(zd@nP)z^6T@Bn1fZFbPfO1eIVcWa?n)6M=y=Cn< z`Ir{>`_F6LEBwc@+TU+9&z)66Y3^Gm-F`Xggi8hBq82MPd=3u;6>l+C-pOCfE3l_g zq`t6H_y{t{iL)ArhZck`>2_M^yT+)MTes+1C^EicXNs9+rf4-)uv z&OBxMGxM|Hr7wy$0qtpV%#pTrA)GW;Hz;L-|ReDpZ-d zP(>onh1;KqmHY3^bC()Dx{b7qhC61&ehLLS;CW&YdI%CLQpDCa83|PnZ{rII@Z@-^ zo3VMA{MLo;kVCs0UC-c6iE30#MY(u-kj=369JZ?&9nX&C(Fs_n7Uzc&lnd z?CSerjlG82urqS?gJ}1hVrTc5!8X9pv~-3DwRsxHdJwKJcKV!v5!3s9uB7x-uY!h=&!(^SlgFWF`%(49hyN+4ez2sBCvnYA_Z{M8C z0A8f!3|DSpV)fTz_;ulmM7e^%K?U7Xvztq4IHue!oLX=xr3toNi-lS~3&($h&oKyt zMh-3#kM(f9-Xo@Gu0N%?n0ruP#%oZZ5G-j{Mw|^CKP+^7wcGNfvb-mJ+pJBQLC3hK6-{V z-#gWMF&e78uXIPun4j=ZD@zvJ`VMaTq`1YY!DYvukP>P3PV+uEb2p=rUZ1%fE1CY( zB@U_nF_K#B$>Ul{IL|?@nA~WW8oQ9iSiA(?+CrILTG&g4D+JpBN=DCXF7mWLzjmCT zNHy_320|V`>a0;}o6^Ad{XWQy${>?u;a+-NX72;tT$*W1bkeKhg#)hIrMK|3aHnW- z&9v70qG&EZsrz9#&eM_@Km%Wz5G5P~9Sdb@kqH3c(nhHR%&>kjao;T_Ik$U=JzOAK zI1PZ3+wE8R28|CM_+YC&CC~ZR12nHn({?^dcv;Je_dMNdy^+0J3og4bxfRh#;!TV~ zf+uRWeOJ5vGWTZyTo$uc;_;?udvmM#4NVPhYO2#owsWH>Z~4U4&S1MieJ^fyMBm)U zXo1>+16#35YR%6_n!u(ZD=wG{XL|ED!L(2BvmE4>PGb6&du3_lfOPJ0j2Gb3A8ezZ zrPdNhGovPZZ{-YS`~$r1#4ZO`)QyZbY<=o;8cJ)yf4ZBq`FumOMXd6(1#RQuIWI{j zEE&yIgv%Qjdi6NZvk_?da0<$^BnjwHpZJ~Do z<@Gf5QozTYe&G{ec;k!YMV#fa?um`%+kZR5#XdipCdfLe_(Z z4M%6j4JOF>B$y&x=O%Xb?5NnswRWY|g=# z`@4_mM(_}kp1Bt@yMo?^j!kxZu0i*KUWbFCD56fHy=>sx!^*v%=jditvSXWfe#;=- zpi(+0;`Dzbu;>B%lqYzwjG5!vOWNlu>R&>M$KHB%J^a1)mgH zZlgLtI079;o;B)54o-ZKK<8!SY)*iiX(xRrf^KX^0!7I0a%kam`j$D5)r|rAncvh& zfW9lj@8(W^%1oVDSSbh|;fX7{a6*vY7p?q4a{HNCcEoTGA_-u_)Hmm!St?8doczkugAeB~hCQB)|EKL?0;r(#u6Q?~Dz8m`B-_+iHk8@^ zrc=+z1~G;`!;Fdycrz@M3(FR^%nhg)uNdq2My~Sixdrdl&GGC4P5naCPcNsp7MWIi zXK~>-hkUm6z0iyRAt#aAiS2dVI5bA44a{e;8sh&Ol7H${b1@yO;&HDHsUFM8Tv2^5 zRaKGyOCIgMKBRfkCA#FDz>o$2MR5zu6M#Qq5ouu+v^~|B8755dL>=M z(sKKC@D2g=%dwqkn{{5ei^Y#T&Fx;%Giwmlp+;!j> z38*YXh8!YVmMUiW4KMy-+_V%|QNpK=)jJQT@IeQc0UuP8yWCPt6{EMYbHy{q5Ed!u z$`@@OvQp&4CyGwL_;4zW-!bD`H(%L@fvKJtjDIO( z19hOPIc0XAyYr}M#;W)U!`V8A9dr6TD8)LIiH%T4z*v3!H6wijY-tC`-~dQ@?)TTQ z5o0((WtR#jZ@xnKVRJdqpKMDa3wB->O_WSgA#Z=_wi5q;6E)ppN(6b6sxwx5{Yu?f zWJ;YVZv)T3%Pw*|4?fG86%_lYeWfL|o4saaCItJq%HQt!v)pv%2H~$DLw!1PM-L#N=O#Id@ zckEi{iBPgCZ;YVLm2zw+F-na0ly8W#J#EQo_{$v%Q7p$_+zl5 zGYog%qAke}xQaz8pPN^=&?azcslzX0aWH2~IxE!tj*KP3F5Ch zLB%A?7MEFATF=%LUBH&bhXsAh@b8u$+Io3L-~HlqL(A*nkux+0FoLAE#5uoT+a<4r za{zp#_T!-4Q~cGemwIJFxzV#cBD@!zo8x z=>>B?STexlFK2f8&xU(L0wlwM%wKK1j%zK-44ZJaKK8hpyr2g!e0#zQwryN90E9wh z4VoxPW&Oaf-kTZ=t6_2CNePbn3onr;V%cV}?+Y?H?nnX!&^W((u?^7heg=9=_$y?@ zJL7}WbJ-8ljit}ybLY~8f5v_#C=}R$^M(EmXE=!PioWbRijor*XVU*5e5IKd*0)So zMk%J5C-(g4oHn)3>L%*=;&e%8i7VO*z^Jfp&pj`KPliHblErbpHW7AC8)DV0#-3Ms zF-2UNFDXYYj;`{mZ|=xA2bJ$q@k(aMuRkNisvcTWA2I*1MnHVgx`D*#`c_{xD$d28 zv!Zjb$@4%(uc!OIj7_|`r1$0%$beBj&kc8B>)X~qpnx9ZmIW%x?6DUgcPzVtIN(5X z=^EkFhXrNyld5Xt&NT6s0{`9c5miSLHYU)#Ha=Lt?`cE@{4yaJd;!f9V?O!fp~kK% zjzev|4PeBF%p&FFj0(m&wgl4_TYGL3+PqO!Bcxip!!vKAkt(%8cNPQc%<-c&=budP zGbL5#YvOaHK9lHIW}d_<{9(Dw3e#qqE2FvAtXS;L<7 zp?Z|OZowryHuvZ5b1nM?3;4J}@ytJKG*Uwf!64R1zraq3#rLtp#9G$4!;}kJDuavKcaS7f1|9;6AYG|)+riu42x z;W~a~GE9GK)5IFG^>=%O6+Q8`o9|=W#ZYEP?Khfp%Jg`R<)ei;6LFtGw~qbm9k%TW zQc^@RtjO)3Svc=)oX-PD|9F1W@xl`aJ@mmXOkAc6br~HcgVT*xRK3<7m3-tOFNp*D zTy>UbgnZe>JT>8IfY46)LkG6`9rU1bO0C$tbnK?3Razj&?4`o0Icwz z7zupF(UuQtK|ae)yBN8a8!2OvF8*cQtjkqD4MuKp!LJN#p!brGLnbzde_J9JKs%~B z5PVd$i*-~W0KSh??0LAKK92Oh;XgO0eUn%vMZI%;^M1*kxHRcVJG>c5iLkU(ut`#k zfu|Lmj$b)E7Jq@KHIeFrvSxVCAox&6q0I5qs`8==kDc;dK@|9+H~KH*(qMMXihIRW z>&xoaVI|+Fn(XGcw?$j-C&r`jR@}m}&gyYRN;!a&Sk*{C^O0c7F58n_*b)~F>l(W$@+xG!8a3tFur)`)#Gq%M!n6q# z8vE>peQ(HFP4}5$lDI0e^!TK*OBQW1zO)F}UjE*@*!5gg?!|c!%stsiV{D~Um#G%? z{C!}w(nbS(X!ta98w(w()|OJ!Eu`?ke4DSNyK)K{!o3{GRgbOh#wal7JuUXUq!gp_ zDmRT3@1c1n@cHC={xmN$Q9`DJ|whItubCjy_hL%i<-* zKe822Tj36`&(n~C^Xo((r2%X2H$7>(K~^5tG#jCvuT{<=N`hln02+uZINy}_g6t+T z2jH#2@3~5UKZCq<;c;AaPQt7?T-e?9*e2hZT)}^FIZc`eq$imL5cZJo(GUk(!dPm% zbiv{z*mfSTjg9vRD5GZl7=N9t%cMXJNjNf@hIQnMF-Reu)~G6H7R;RLrr7zjE`VMV_i?5b4VrTlX_N^)FfJ9Wxg77}i67@(L8n;EJd3O&i(lETbbCXo zQSR^F)YFX@o&36fm2r|JFwu3x)DYOGCzZE<+jD;@{OAZ_2~b*u=t-$x$LX zMWQk*b=PTE(*0S%3ZY_V_(C8nw@U3x`w@`9B5jflven{&U$~15frYjeKF@omZZqM9 zv;A}ilC!yOOnBT8-gjjG-TF*~nX8OoanIwWrR^&CCRa#r!3Ql2tt1)51TzEJ#CH_D zdF9fqj0T3|crDh3ND{)>@IAf>!rRR4xNKi{2TL|E=`DwARw zdkD%2z$~RYG;mqMsURhsWLRfR<3v9@jEOHa4l1rs+o?DvGd54WKLhciORsVLu#?_7 zeR4qj64#LHRZkj+kY3yQ%zyZ)|Gn_fCnh6-kSWTD`f zUkzpgMNhhr991-bV#1qghP9BxE~x-P8HI1a>VETNPAGeaF8}U3eyY765tH;{bVmWk-etGZztu#`v58%%ie46L>2+ zfQ8yp^ei_x*WLb`hAVp_XXSF)AYa#1UAUc@J38z~&P`AvOYm3g23bUOEQ8vWo{1 zt*o+~*jN;iQL;O1?VEjOhXkGZYdiB;wqGH@{Y!U+9@nA_ ztv4@Gg&M5{BX%>X56+e{#>rLP$_sDr?v8AQ1d!^#az3v8XKdrMIO6xT<&r?v2HGBO z&`k61IxjGQo&L8_!LqW=cVk=3&zCl~YVfo%In6^sH=623SsNdtP?_&ZK3xSY0PcPL zDDXyVDf-0v>!gCjihXdRg`r1~JU9##3!br1_+A#$3UevIV4 z8^j3rV&tNR=7N|w#>mIvKA|7N#KYc~YEgar!~IM-BC^LKwObZlGq08KmYfG09R1!5 zfCfxFV4MbVTMxtpj~ja*RnCA#P9j3~9a7&<`Q8>C#3W{GePr`U22rnPlx~ARb%-1{ zVN!;RJa1Bz(t8AUZffbiJS7gn=t9ro!6h5*B2qGcZATR$flZSkoBlDr((v{JFT&~A zUHQndZG>)60DGx3^jNI|zomhq&61I1h5T^|rOxC94g-f%GxUbN*Y$P;k&<9L4+48rs{jJvgr{75Gf`Y;(Axnzg+!)uzQ+t5_}gsugj}8NHZ#U& z;1fJ~8(6q_1FMfU7#sLdY6u%AwNSA!e_*NvW)YJFyA6}f5JJd>A%ChokFm2kj)3B? zG8zm;gEY<6ufY9x97sof>IT~gCRpvm+qPSW)nYvUhG~Z)C$g;!#et5M?YBWI>iFE8 zR=a+1akV+#EPb9^LkUtR65SD^fIQRqO3#z8p@xdy2$RU#6=8-;^cSf3>p{`sS<2eT zvhghD*cZ^vl59f>06%~t#WG+mPJt);i8s}W4UZ7aN@&^hoX=T8xE8AQiqo2&jw8;k zx=cJ)UH{VFR@6)z>qJhMHq04NGi*=N;IF3e39OvXNM8sll;OfjkfT-3nIjAMbBe$i zT~yiSJ4K<)9g5p2@@QUf>B_MbDurp@gJ(F*nZkPJB}HIwXIOvkHvm z%y(J62yX!^R8ffif&IQ=v29b`^!fXmG5+qiQPf<2q&X54XQK-3H5#=8Oz28a^uQHB z@!NiUA>+AC`=hS{7gCr+y`Rs=bOWj{$u3)*gRBG0d4gR#^ZJB}Es4?bLMJux!<%}< zzZTgU1;4hB9b?}nz(Mf{mFFmOYIwduck;b5AKZ<#9cE zd!O1(fY#9jX?WV1YSjOj+yQzQg{K{V!$%j1qBZ$eSi)IX;)mw&?9;r<%U3Pm6{*5d@A0@-YaTNdV<`1Q74cUve=amA zsS>RXXASM^J)}DbfBEZ8Es*64Gyrvj+lT|sQ;kN{A;&B5w1HD}K7@Rs_xX>HX~mm` zOmw#V6gbxpRQ-GS;q};+0^ryptAami(U>~>sB2}q3sw3jNb9n)x;+D@@ zmqzrPCLK63zsgO#RU(4MQ$ae86NBeEeuvKXIL_Ux;a5GnVZ^ztSa_+pRsgmq?$5r&;6m!If*lod#a_->9$x=L z$3_s&eItCimCm&TxOM;kKj6Zp%WZ)=uQRI8M+*&DIVzt0v6fG8~@9;u(kv$M$;otKoIgiH6F`Hztx& z5GQrv5Fluc06+WjSUZ-%M77Q$Ya`ei50vC~*Hx$IZo}nvj^Rp5{N+v1Q-F1Q9f*wU zTT@8T)%r6ia#^d*ep^Dk5SD}F0sP$YkbfAA9T(7B55BYkyX#&;hHOyrf7NWBG7j;k z-52Y)q7cT6hgfet;pdweDa2Pg_JtXFmjLfo7nd!4Rk@T z$3jtfH9e_EQ$wjU0MHuV_O`)@WDtCp1^fmYr_U@c{dRpzCLZq7Owt=!S`FR-@ko(R z#LZ$|TY;noSkQ2Zukq=fxqQ3bLK5*KXwedLb0T$O*_N8BTTt}ferb!4bJsDEn`Ok2 zAl%iOp7cdrjyUg_k20d$xaz$;&K~jn$7O?fXCW@oCvX+X`~IMse7kVW@bidoFZm4y z6vK=JekmFD(k_8rnq(ZT@wuVxg1O=folN30>S^tcJ?_~H4&*08hKM|-Ep9ZK*qz?3 zz4T>@q!RQ+T98`*9HR7tF)|c+5VBNR(0}CynP^CBdqs zA%22_bcQ?L9DTgJxtm@7;(4&<<)j2ST_5%dpJWmSpwN$4#6JMPXDpWC^wN?<8VV*J z4Ra`Y-Gu3k+?EEe6%B<&8S=un#C#tzenvWGt!|=B{Q=xh?HAg}&aPH73BjGm`g_^S zUtPRQdgme6kOQv=q!}eSYTk zr#itnJ+I72a(QQ2p82<;%rEQmTqfh=*D* zuxW(6CSrw$WWTZcG1BpS&=Jnp*NjQLYju#A*E0iuh+s193G5spx@``Aw%b;(TL7hx zL?I>4Lvibw2tshp)ogfsudDa1o_elSl)VHYUOzBvnv$RQYNPO6<56Cm;`H`WZJ20e zNeM8ZD2WP5oo=3&2UMk|)ql?1+13d}OZolXAov*Elk|uQ4Qm>?n`f3mJE{y@2%;)( zCAq%HOnJ!3SI>MXag=oC-ZY=ob2+{x2V>t~avu{Yc1FGVj)TZPx&HOza>TF9@P1pc zQp1na$D{2?y`PXf<>3O&N9vvPS$OI=o;LKkUzBLDTiL+vhL`ff;yQhOEA9KP+iLHC1B?73ad))r^+F; z>@hq!;3ZRSl*1PRbtxzL$-{_$!5!Q1P+YprkvBdXwIM-%t!u&jPiY>y@pRv~eJnxZ3 zW`noxc!)WcE<+QEb6 zx=O3&2X1TB?6z zR)_(Qj+H-R9J*VtpU6>&iL`|&y zY5HKW>lWNWOBvxpGdRYwVCE1uzAhU%Q1DdDof?8(k!IZ{xJO6_?^u|E3|bUAjc(m?5$F*)>sy zB(pG^u~d^1&+km;MNs`#f+I?psVVJ_(p-N4?PMN73By=0eL=`70^x`-1PTiqpB&Z}-$xgYHzn9O20<{ud=2rmGq$>Wr*!@4-|bO-^0z(cxhLjswl zNDEiBeqfZY9F^2!hrp%x(TTT3aT>>oA5LEE09Rs<4Ls4J2gix_;$18^iPuCigD&3G z;hjf{A_E~xZybtfg;r-IvFVrYe9s*D(>A~Av8xl$VC*nX9i~YKTQeZZbx-d@0vY?O z^2WW8i7OL+Q)4;p6=NE5+!xxM#m%8035DpLe@vlt5C(hP3jEffIhsluE@YQ^4=pCo z3|_+hCH27ogp1oF-^ac;g7 zGO@>}F<>qP*WQg;!PBgyW@^iQyQNyyO7c8{*%ri9R;$i~*+03UenQTY#-jaAyu;VX z<(ghEI3M$rC0$a9fWN}lIu%F&*Jse9pLX)kP~U1zW0#|8sq%=-0hsXsY82Inz0N`GrIy;SVBwwVsVk^R_)ihKG-|{34yhP;) z<@M+@Hw6SUG%6x2O|J+yxwh?}>tlx+N{uRCZ~_dJ^pR>Z@npU1PmqL6T7f*XBbH5ezLo>8@73zX_PvRNOW#%!guP5Pp5mN?TiT1G zw!d^MrY$4^xr*BY|ki(vcqlr{ngLor+J$kYj*4A7ga_ZBzcib_qmArr?POZb=Ol{p*8(2u+hYsNm? zX!&>6yM#{!*yVs|F#!G=EIjv^ zDeqG0EWVErI=0=;d_)?@3#|q?E0k09`^b^qgmMpa!wdNFq80QoiE~#I8`!qS4y9~- z{0{3^RJR;X@{$z@IJ|I;*#YWLsmDK7WB=RR$Ja6#Yv{LDcE&pPhPI_&EeaQ}CsyA@ zF$xny7oIYZ)X^^k|j?gD9Ez^q_G3ZOCuYJEBU~t>*e)zyE+_-Fn%YB-NG-uzK$iMG=A{nfJ5+jZz zttuXlm&~_e<0tb2Sj<|lX1cnCQ<}cM4l;&eCK&Os4wQ3FU5gx!V29wP8NQ=~QvjHI zu*CJ_ITT~IaFigCV9=j7vgGzlsJ>%xju*i%yp6zT=;lBn75dEQp21B>Zg?q>JRvNhxfRE`YyqCMqt6vb7&9~ zWOgkNx*qs=a=!MoFVwR?^;G(nsfT_AlX)`G>z{huRIvmmpj^9OSJAJ**fc8F%x@r=@qVl$Z zGSIP+d~+*1QC^a^m(C3MnOcCdnr4xb^9Z|#y-5!bK1R>dECpV_oUBt|g8MK`=G@T& zED{%13M^Zq@l3M=_=sBL$2ALlanrGsi8wy>hZK`o{A(Esw1bN>f-Lr|qWu!H!%l2) z^D#F0@w>5VgRVen(OJ!XQZy>k7*?9n(B0jsQQ zgP2cJjKdv1yQ?%$bwmcPPr~0|KH2iL4yW7FEhke$-0$ID&(XVvU?!SMEO|PA?ZXpq zymu7Sgp1|juTa5F{f6daVE=iQmkEEu{?rwCMjo$Pt#zxh@27Zr-KNi*)Zm!&jZG_V z1I)2-pAzA}$7WrU+X5LH5j)-@RcaZJ@H;<)LiZ|U^h83((L>8^w^vZ*d-XE08~S-X8^Yj1VAJ?_MyD|7E?ld zvg!{0!0tg6*8hvCxBiQ|kKTkSW$2+na_CM4X&AbsL53C)q(i!4knWa}?ohgs?x8!R zrD4b+m-~74ySx9u{PKCtd0*$M^FskefBdM>1e$0A)Q^21@b8Zh0PCUW>GID?QRj9l zSo=$7A=lYYJNu1`AWsDoVAewHd0bhN@=q6* zEJsH^SBpX#9@fOAZDB}uvo?ZauT2gtM=h00d>t0Ock%lVq0#q;*PTB|R|;)XRXlRl zwnsf>ZL(ea_o#CS0b*eglYJK`7S64Tv=y>Qud$3Kbk-u-*s8e<7(hYvglCpK;SL>m zf#ewkihD!BVZ85hWxJVkUddlILw^ohBD=hvWTyVV@%faO_`EXTosZN8;ATTv8{W`9 zo+{pkab_ij&AL;j@%Vn%wKt9(^NPT^dI3+qJ#;WgPx(wzGAfsY0nNfQtKl|pSjZ3* zA+HP&l)DdmA5CZ6r1=(#H`2hpKz`~iAHW*}Rz7_%HcasUk!VUXttIS;z}Nd@$q0Jl zJZnsbW9N{okG1`S)_y>`I-nQeQ@z`~J(h*6^L0pd`z}ePUJQvRgCpakBI6*FqChi0 zZTvZPU|NyioccoJ$ybv%^tl$<2@TQ!`V(4eN#S+-X0u*bAVEt_%SdVo4Z+L3(-8P@ zz|r3Dy);Fqz_N5qis*4_kK8}`cH0E0HKzCt=;#l=P)g3om6uMtWGsu29f9UG$02Bt zO-pF+m*>LD&o`&G%9i$0ry;Pg@xBNC*<3H$s%EJq>8UNniEj;Ss#zvmOfO6pk#?Xx z;tpG?7l|Ff5G0 z*dDatq=AqHTDd)6n7znc98t`c7yA7+9v<6@JZWV$pOTeR(z3KJ(;ft_;P%3~2(4D{ zR0!cB632naZLGR+iU!?ZMGh0zSVGZw;Px8}{*Rer9t%*cGf*)`P(-}Y(846musdU& z?Hz2r+0Os5+~#Y?7dPFnRe?Az0{+^+mf@iitTOXbgb9FO7N92WoG~onVu7PyC+RdR zHOb8)`Ysm3Z9p^F+*ci<8=WAO2*7@R$QW0jL2=(%pW?@cqKd_wi+8!mQ|EsCug=M2qOnuR%|FgUT{NQ@#0pgk3Zgp8DKsXoDRq zl-y}j3{q`U!%~`H;qQMw$Ks$dekO#17>M=gB_wQ;MJaa4S>>i)@T8hdLw#6Np|}ED z*u+grrzJ`#vxkdJo}ZZ~_n9q7TYW7IrFD4mPPcxgN$+$6gh=H591|a}jG+=K6C#FH@i+; zf-7Ghr*wN|ViG*w$?DKn5-?_LBA&aC)y5X>@}D$U1v0LqjR4#AlHMJflLvM4!nn9M z${FQ9sw*gV(H97pj%Ph>jxl{jP0mz48G;T@#hNU=3HQWef`$5*EFVp2ys-|&?ASfd z-cVCPZ_5&pWEX_1o-B!5C|QZ8pY6IimK5IeCMfkjRAw;A66uK zdqw-XkCxWSb{IP=Z=@584@|us{ z>b&q59B1kr-;E4%M>3|a1fQrg>a^+IYQ5*B_Qp-Io&HH#a?JI)tO=Cy8CBo42Df{| zj~ubMP$2x>q~&#&GUU)y@mBf>O$}{Hg(_=;%HW@d?jJ7w!X564%8=b0ojp7Sga6C0)?`mg`% zBYheF(*XBiu!0TEpQOif)%8R~_P8Of|4d}9N!_`WLV@_H&~;@vt07sy&6~pqJKq>k zTFk_I&;sqJ)A#kOZCv>i3GY}pr~mgx8h{7SC^I0B`~y_+`J3dpku_6voLNF+<<;z34GwpRk0a z`#)qHkg(b=YznS_3(1#*Bp?V48V#EB*M}*j(A*j(zz($K7%tjuQl2qOJetN(pKKX& zu-Rg)h2H}1BG2U5#Bd;MM2~QFVFTxZBTPlYDESQ{b z2cJkGC`R?DQO@FDro2;4vAP7Bzr9}|kGSndC$8^ReYlPX=}s)i^=44NX@w|q5og-z zOWuNCkF8q1yLR?OH`6YhOpl{iH!{SZEk|e?;1qUllgg#_qK-Ik^3Vj2tUbc55UJ5(U_3ODpa`a@F$_OPmjE1NHu{nNA4q}BGeNG0qqTrP zhfqvfV{hLmc@(Xo9TeR+FGyO-iuvSKg_OHeaCXWOD&GhR&<&Kc~Q3}SH zk<7+{75=pr(5L#ZYaf6e)7Ibm5H^?@%(>@X{WvMHA&xfD^~KTCX12k0HyBjkC*X%B zifsn=xsz`W4_30(#Cb)}Zq3#>{corU6^4scwOo2QPO-{=PG>6jj|8NuE_@fnaoHs^ z{!HNeU*m+3<=IMUygU*0Mly1Lzy4T4dD;%V+n@~YtT|8xNeVm+cDKKvu6Us{l-m3PH!w$RIt{FZ0eFL;^gR7}ah z#6ay@okZ0;9;z+#Hn8hS8;n&zDdXgnaJY@SGkrT2*_q4N+Ok5>9QknQ+x{EzIpZd1eV0!hsf@H0MBMYo=9-(Fc`bj#`=M{wWr)cd&ijn{yoB*^bPl~m4y~0W{r6kBnrtN6n%qnjr_cvy~T!0t_c4*2-KpD1eOrRu3McgyUF;Zw&v&lQ#M@Z=19*=T* zn}oX9xgkC@FU!#{UftF>xF}Po?5o@;hNbx8u;y1oof@D(T7z`|%X->kT7I=5-tO7Q zYIglc7-J3ZQ!_m3wzb>%(Q`Rx%pfnf`A^$h1?LXqY!2=rKQ!`e31I)Iy9^<`%BSt# zyfu$zlSJy!qLLR}C)GsXF(Ys{Ax&eQ1^VcGC1yS>ipIPP7sW_~PI${38vu$)w{gB{ z7w~RkZ{(g`nzp4$UuL+_>`3{VRcbvuj%n>iEDk{?ZQ5dRw((fC-d8GluGQqwI)h@) z5W{btrW+*45xCS+RvU)68T4L*vaEnQrcVphTYCxKC*J7sv3fA9rTH}tP3L~}NnA?t z<>%64yauK#eNrgNUqCky1)Yw?_ydEk{O5e5{>_TEzrbp@@F>$5Gr z_G_r<)U~*O!r%2S!F#&AhtHW}X`P8T%mPj2JlNY#RIDbARRAL3B3}1DeFr*OPZv|6 z0=0KK+zk`i5X;X;7q9-ni+|8)B{hEcgssKg%J`@4^?5LS<%G9`GGn8UFg*CHQhB6q zIp8}JmUn?aeIL3pZko~h)hy`m*#&Luo-9=PdZe4S;{Wxrx8>Lm-Iwc-ON#{He{UUphG`$~mKGP$o#1=7AIrECuuklLb^no43n(M#J;(#!sa8 z%dapDpk&J)R{6e9tBDGI#-mH!HhA2n#Qb`Ye>~7k*o|+Mqj?6$+vS|8M>3Uwz4!QcL~((NqH5d%A(t=4~Mc z6v54emt_Hnd)=c8{WLg5FK0{nQEvXSj*UPpd3%qJm!jEZu-qhn4e z9eRaaA+HTZym=|;MEim_E6E~#+qm9t>B-JCUgq`^fIo=1B~JNC36I@g1@8Qpop$^X zQphNG4fEZXWgPVD6XiOyq74+z?E`?^qzW9W+LRT@s!Ynh1R9lAJB8GQ8q$~u^`N1J zxr*`v#QGp3l%gufrET`7TKx{K>3Qksne++W05eb0G(q$=8q8`L4?ch=q55Z3*U9J5 zkQ58KJNk#Z)yd;3I8vBZ=)>SqJ}W=J-OU!VojH1LwX>@TcOJEKM(@twFlH=cFhi>) z_ArC34hLjgcjVr5v_Nqf86F)DVK8J)Rm`xhm$`!1bzPW-gJT2tEW(Ke84n0`hxJQRD*BDX0Tx4*vXcEE@)vzKj*S|OkBW)>dS8u3QAN#I`>TorilM z52O^af7f|wQd)HDq&g)AXCi7CQeGM_V#9cUwG6AK4h(i*!FLlBem!FuF+TRC?ji?~ z;+iw|@Za);a(w%k3LZf=n_+>adD`#XpXv{C_mfCroD%gX8NMBLE%Hut4uP98&QJ*M zPW*|+JOy$8c7bC`d6XKqEr1JL+3OUxX;bmR3!>pPBMieElh9&PkE6#U7k9-p!EEuZ zVTLnM1BcI}>j*g-iO1yM#qWd8CujI0%H~@?ei(@Ed84m8oPEInkF=h4*u)~L2%#xP z1k-L-F|>7*36Ae)z*y7dQ-tswc1KX@AQj5FziO^&So(9a3=w$vtx_E=CEoTd8wSa? zEXW@Tjk5W@IN>7N#_O+YR@Ir?2mYN7lLQ8k0r5QfH+mAk)#F6FNdWgb`RLtq3MOhoMpVCd=Pe9^cMlBI#t5A0EPo}Yfso;>{kP7iPFxqQYbMq{P8*;-c%v1aff z4jM96I_{*>8fceVhTB)xv4=@vaq`Fg&hXepf45-r%spv?z~{1%b~+XOAaXKoHc|Xf zZ-TI!LeDX`#V?@ctIzc}CT_%HFHT%f&9n)+c%$P%fK($4XJ zulg72)~+|+*hzd~!v$kyyNI9lHdb{XlDLHg8^~Ia!o`vp}O?x#MtAUT^qgDm4 zcn(nG%u4v9f7cu_DJrGbt>`YYOR%W9@@0CcAXbjesK}gPgCJWQ8Cc9*g-oRL+T~K8 z&%nE=qqf9P)cd3R*s#!dg>lP4eNUIQnKeQh3a*l@m&zX|0+NhVd!{2i29sdrLxllrlD$57j`pp5s;P_|&XI#HwAI5orJ0|$W z!(doAfIqx7)*a2A>*GZCVkzSeV`4BVrbK6WZ81g7S=VP1#-8ae4aKpWd@a!TqVw?J zMp?YbJKxtO$3Jz*;>Lo+aTAN)_q&8l3H1ymph9Do-t`~&bQg5e;}8>&9csea^rtk0kx z+Eq`bHK7se>^WoTN;eZ_5gewX2<;&upueV77T+n!<)KT`MLy?0bcfga>Z=fWD z-UJD-IQao=Pu;&pQgqkLf{>);otELog|!G4Gp2v?=cc%Kf09e+zr8C6NlZvZ#n^&{ z*?#9iXRhgSLpC)l{K8T<_l8l2M#M8hlvSX`oNit%{LH>XVXy=}#S-dh1l@q#wp0ai zP=C$F&s6h>hDv)5+75c*_5Zu&|`NWK+7eG4@}!^=}fU@`vgg6wq{%jG_PTlqPEtu zgY+x3Z_b4*Ll3+atTdX~MT-EzAp@A*uk2|5(pqbe)S__+JkI;K_IP#5=Jd?y2`g^K zh?|t=EK@~Gpf@y*^t@KXm0^G?$fUe%LuSpoK&G@CfYh#2zHm)=_zQt1I&^Kd$R0}q z9r+Sc&=VO2y^Nwl!8iMxhTh0X#K(Bzu9UjT{-_HB=&=3FW$j^${Uhc3FJ7N9a&h zn+hFKH=`Ld8|{XW;JyEXSVh$eNwQWTs5QiDkrV$V-%`ChBo(4{4}y|}nFH9WbiaQK zGGr*g(>u%2xGl<}BLsOWevVb(QPtZ{ls9h%m;%@UPE}-UUOwgF93Ci=9PzkBspCCgTU~m{#Tof?KgKPy~Duf5BhXg7FTC9;yMwd(#JRs%l zms!K>fNl|S;@k(r|46?`m2t)kLKVD1=vus=tBizjS%*I2CqCb-Kdk|H05(V1-rUk7 zF_yj-4ac1{cI~-{cfmvs%HojAf3!i|@XBAKc5CSY~;-gWarNxH;>wu8KRSjLk zFQ2JVFc+ju6RapSZ_42Hj=>C~RTQ6DKKpc$IM;Kd?|{@xDWo>^V9&!)l%tIZH-#== z*ZUPIM#E6s(Rm!(DF#~C0w<(=$?Z*70(fxTU~4B>xhZHZzp`EaeZWi45n)4T!UWhJ zDvvK5)gpBE< zQ5PD9ajv{t7rbf~1HPsES;K{q+N*OH_ovR~V2br-;Pk1)vuRP3n>L!?Si?6{#5dg} zSEYtk6qd~$#p;dg`-2M^2M@X1JKp5`Z$LN=E8ZAQ!$b9v&Tnte!!vT?Ufu(X@e9h) zMJzd`pMSHRw{^SCv2)RWHE2>h9}yoZzt{rb;EuYA;#uK&-@^WE(Y}C-1oypJL^T`D zfU`kso#Hs2s0z?`G4*3?++hX#TJeV`SNAjOY;5Qt{AS_!M)|t76#kw58U3Ee=Axz7 zyshzLKD73eQM!DB?U;S&9)P8g;3vLc&_A$31sle;4qx)?dSRh!9DF`pGS#U|++(5JO}#ee+^TYfz_7%LBBPm|eIi}c202$;f=l^^1q#H@#9C;Re4R;V6> z?_b;iNL@IaM03(F&jqq6joK^AUBUY}Qt1d)Q*u^3)HcrdKuQQB_WGP((3xZj>)?M) ze!T>p)SnPVy+&;$fIcWsuIgBrB@_)FDT$P-nq>_YDX6*^Z>y%MZ}~x;@WVwI_wja0 zd;o{yY9P#b*xA(k+H5C*6g0oQlir2MjI;`LN2eB!TWMtMKHH*(;kBU`ai=*n>jU2o zk_MegEpR%|yfkj|`+=x|*#qe|!Ek1rg`| zJR9#aM^>YSlI(sD=U|h^fxhV}f|s$`IQYnhEQY8{OAL+#`9eX`0j7pljMHB50AWP@ zn4$E7v|&0={!}uof{!;alO^-p;T;Yk`~#aI@u&4C|FqEZGm z>^3m)E5~*^sc|@UTN2 zo1!?YWB&s!w9lcfs>Tz*nUnD|cu+xtoV}J=U+*P`!eA}_{9--bUaTF9jB@!E~0+8l^JQsQJmV9T`R za?s(VFXYegM7!{pDzW%?n31V~vu*la5xd$udBDhkmwa5y0Oc4Y^1p3jt;M4*YWjVZ zQR`g$^V5BnDC8D5c<{IDp>C+ujVH`Sh@#Bu3|| z9k>Jw(riwB+_@Hj9_bHQ30$@Q3B^b#ag>0#j!_!}cRx>kt+rh~uoXGFD1PD{dNanp z!YvaTNfjCslw0nTsC|yZJVxkd3C5qpi)2Xv zIt(@1fO8_{GX~1L{Yz|c9SE#E@zL}Th^?xQz|nQy6`ryeGm zkuLhWqeDJRWopWUocylOU=bxm-(z!pvt^ViI;F*IioNkRN>k)~U1Y-)yTlUoP7;YR ze#yjt_uw-LsoBCBqe;R_l81cAS*S#I_Ei_w*NW#3c*0;-IZ~yPD4Y|xh!=)6X*yIc z)a1=I{@C*MFA?6HKgy3z`l4VWB+B&y0y&*Kns4V)W`|Opx{qrF`bYQP%XG(h0O?A)k-HU+l2{TCb2il6>`r zJT!)-MD>*c&%raRccYc<>tcx3IC2mKE_Ch`K@w8S-Os|g52#@)Dy+P*Z*M=Ohjik|EsNIB}itiGsfnr zqfY*VF7cX#?4zk^VBo9QB)8~wT>(}@KN~Yv@92I|P&DhsQ6ath>$qC6+n+1e8lS-a z>Q~UMWY7^#4La0HP=74-&16ufQ;*8nX5d>Vu{_ykN$kH8^-(P$yK#7Op74&6fp^rX zd4>8n|FRdqN!C$j9mia=iHp-hV&=T$os6C%KLqY0yZ(-=J5AI1SZSdjTOl!yA5kh& z?0wzJtvQbR7bP?D?_jwb3>5}z-uOPu53ZbCbD;I-Dz~{`w+R(CJ@Nb&eID}Gr0TvP zum#t>q&RRrV2cWQE$|Aq>Z%vmNAPo5Rr#3CcJ^zkF(e{HNP`m)IOiM%*pn4=v_-Nu z!Cm}2O$}MUn+KqxDP#RAnxV?1?b=ior5xgV7w zTBs0fE~f}tsdvsP4ZuJC_dF+2s!)81MJ`)dWA}K6vRfZo8O^+d|MO|v(}g`}WRyRJ zXc8sLecW3kv@N5-_QNNaW(5Ix99jI3ZLrn*;{JjkaSQOFz$WQ^Hx#z$K0KM zM$(DiHUTc{Fy2IFo#q){ZPEUIj|1d;7HJldjo%-#1-YSPnd1&{zxz=R*+lW}2hPK} zE9%D_6GoY(+MR*gDZDbmvZ{Iz#~Su+=fg%Qp_{FIJ?Z3c_xY0-R8O( zUA|8J>1|%&U7FvU z?`}pr$*2df>Ve#@#_ych@v3o{Aw?guAQ34U_AUFgd4)SxxwiPiqaL%u)0yrjzj6*8 zbVVD^&RYFfCrecV9H@oYyJ`jevkt{f^|MGt=2~3SNK$d4?VgY$LK^T-LduXZc*R?% z`lcI4Ol-Y3l8@fXCDibqQ7Izxv~E;TpvX9|6%J4hZP6uRO{i93S)*ELvLelr;?dS2l1qlC4p11^%x*jvUU;E z^T@@0?0`e0mCo0ctECeEz)TnY}}@c{pV$qXk9RPuXJH+DA_1s zcYcnWh7P#}-)3vBJ~|6}_lbXK03QSJ%D0n838ReBE&g7vsYut|y?%1-<$>_JlJLJ3 zRS*4DOxr|9yEjR|iJnqv|9o2@O8iWjxAR*~xvrq{3E4GS?CWL0uzauo?OM_{Ht*|h z*w8X+n4kN6tF%Csjk&T6R)YN(NoSe$*e!G1GB$+lwhS+6K&?c_t$ZPm=V>sNZ!(O& zaH@sFlRAZI9bdwi8R4f)Owl)4RHZH6(&tpSRr%JwoUr5U$dbabu5jd&7DQ% z(rI!+;wen~B0)p9`Zx9RRC6d1?`>w4ZH(Ex2WdjKC*Aj0EZ2n(-Gxk^-VhJ!w9^<} zZb2a&Eo=x8@`KDZ4NETn+2-O#OM~ei7YjNfDCS(AVe+82NFg~YUiN` zLt^5&Lkg4ARqEfmP3^p(F$s*xDcdSs7iT@6hk|hCWM6UKNrKk9&=N2m=v^tMT1H2c zLV2jMZwZk|EHg&C>up9PcWkGj?#A>(^o(?Zx}_#Q;z&vu6l?Fz)Eq>8Z?Oo4`6UTJ zlrLV=kXcdaquuZM3`ZFts09cmK6|&Ud0}aN zE;_${h}7B0*NmdM2y%62WAb5SGO)JCtmJWg2vPtGK!j^xBZciW#Z60RhhwUyaMw%P ziX4|>v(Pzf(5yTvALr$e0gb6+t1ZztucIm0ZsaRpo)`zU`X-y1j{ci<8yT#hS#*}Z zU>8MT2m4R{38O6M8KmmeFY}M1jFIdzNjy3Z42pv4{;FW8c3e(QGwrEL#Tnp4&$Fa5JqeO_%VtEI>e)I0jHwS|-AZU4; zC7lf3=iQHqT(?`88VQJ(mg%Gd{LH1~bjhzu=WQoaGyvZMItvHPnLd=oxodqHFZ<>H zuk*Qe=`}Ddeua;;uc6!N91A)g>EG6P$8bqD`lb(QcL}XSgUg42Gd$eXv-GFP z<6rJ(I9$*5sx1k7IfmyvvuBC5G%r4CTUz4T*~yQUXVy6Q9oH_NT`VE!#ZU_j@#;`c z7VVQ3hZ*(_iRiIpA( zP*)ElH9cWnELG8`GwyUb;w$DiU>Vgm&?~^~^T;*}{X*~?tAP_++~K=7#mcu; zTo2=!aZM-E zTizIGthSxaU09fG%&{=PxjlN2rmD2vs|n|Qs0d5haD~Uif~}B0Ui?KcbF)aqOYBq# zSy~>6ml!1D9pgp`bTaX9Hx#5*M`-G8;C9t&90qFL;>OGE zE=H_HB9HAH#wR^`YnXvrp^PhaLEJF+P+N|D%w7;V#@Z8&lo9$rEV7c zq-^$(sUt=+h55}ATh1)EHbVpKuL-?QneiUCqwn`-1K~;rXNWjGEw#d=78njlAtn4RPWePxGjy zhw`8c>E{02_7h4W6R4YX-C*L(701mZNTzOIx>;h=yWc>8&78}J9z|~b%Fe^+;H)Lo z(IBi-i$K6;1!Ksa^c|LLB#dMagn#)wl&q#R$Zlgy5Bcs-Gyn93@L0Z%wl_{Dp&b@x zXI-g~%oM$-LEQR_qqCHX62H% z+k@>Ife2C45e*PMDsM97?v^#nwj0X7zik> zr8?4XSi0d|Y?t_EeDgi$PUQXngSlV8w?7tkSjs~F>Pqw1Z(4JMxMpog6~|Tq$eL*X z*anoe>eaJumnVlWT5or#hr!u{`zWR<`F-^`O|Nz_b&Bj^VF9`y=pk|Z&VPJ#{(OCW z^nM?85nzS*Bp<Cbsr~J@j?TNidKc0AOat@>^su^%&%(K z%H+pAul#4fap^ihc%AEU%i3K)D-|n#9^=)6&}Z0ws>z(l{rAJj!S}45;>P|*;iA-s zl{hurD`fS-Z6qD$*PRyB>Kn|DwWxGQ9S{F{#9nQtj`@HrdPL(e0~B*Rw*k3RNM~4j z@9Z`n;y94$r|l1DKaXCko@#R?w6{tFwy!#OFTOkn^S1D^*IuquKDjp5YImF!pP*2n z{J>yjJVz9jXhlsYk@K^O&Xg~W6v&!Y);u;#xA8m+6VEFgnS^h)51C&K<|iA3R3Zq1z{UjI!;jC1#HiCPi-ANnv3mCtcY8p%lUE4o z@>&0(Y+Z%-&MZTSiIK%w&Q~l_rrs4m1c{8Y#J7+122nfAj1aDe4Lmk^EFC=AIQ7ro zZj7A}=Qgs2>%|lDuZoD^ZGwh_KJU|G=Q^25_%9*Ii_0M*5j$8yG7lMxsDEtmQo$9###drhmsOhztTgZoN46TE}oyyi!Lw>$?HlwE|gW$ znp?*tRw7R4t|N>=tK!aUTK-+NjCB$E2&AFC>Xf*Y9{Ur#vov{FPWoVPSIJT*R9dj;BgigQbzBQ9X< z39|Z%}-;9ic+^Im@OngM`&jXs8dlmP3nA-dp^Q_6=Tl!y-s5QEqoUkuk21QiJwj0${dgAI$($!k17rl_5Kqu#`0#%QxXvtb9)d;N zy!{ z{yzJ$?V$-asNp{={M#>0*n8vK2;iX)tbPVE%09pe_N4VbEUpvZ=Z90$a9VBh-;-Cc zkhj3sBc3#j-{P}wlklDE6gpPBEUif`o-M;j5YcPTXOHlD?qI7)rZS8Uf4L<^&!6qi z_ef7W2~s6LM6_iAT*uo654N{0c^rOA@Dg(OkLj zOkwD=%8Zjaj4-Epk(OHj|9Vyf+n}Yaorb6 zvfqjB#t1u0cU^6y;H(zo$4XUcF;>l-G9%A3X>Kwj;u4hh#R9uy@>iu)uEx)>+>6cP zVDtZ<75U%v$hE$yWZhA%?Bnhil>JF+$ycHCt4AngaZJWHjD#5ALO#)o`7;VUTIfXx z3L(MhhVMMcNEAug4cZXuMd@0bA`E=b`Hf6xB*2RGg7OOQng+-zZ`YCE68GPf&gWaT zbeXK-R@C{cU=o2VIuK5uiT;svN%0>ibYz=+(X-|DlEsj#!jPt$N<*4f;`)mnQOrMX z5GK+xKdKK*WG|CW7y8y~=MQ?sLz=LNsTY#vXH?CiuHy>d<8mVdEnoa%%-g@Kb|9Pv z8Jrh=Wl%kE#`OR*O;yLGKN$l7e>}bZ@Iyq26DTN;6M)Td{h$=^HX5>M8>T&1O}oiS#w?(rzd;oyv7A`!-ZmeCE9a^e2!&wdD7HB#Dj)tc5GLr;36^_(%(% zGo0D~0*`GElM&ZsrI}M{Cd!On`a7kFdBo_On;+9obK!}IEF%LBgBprfXCtgv(b8b7>IQqF3y9gebGwEh4e*;s7m7E z4yuI$udFo0cKaCF<9`=5O~7ZP=I}UoRmdl~pz-`D;2erg`S zV-K=*XqMlj48&6g91}8UCQXP=v#Ct?VVr(b8l=Pwn!gvBpc%ro-o}yCGGUdy{nb!@ zN`Xv=+$T9ZqrUeZR^?=~5odQHf~0T4t$Kpv`^MFfC?8pwM$!3Vz}UZSkU%jzc|JhR zRRgu@d6ls1jU-#<3?oJ?<0yF?5eqx(`>PO2kmN&^-QeF?8rbhhdDsa z*n&w0@>9S!(03?4%ER(T%s&shjY8@X)ZEFU{HM9jHq%UK>Zj=xdIb;R>e`Lk@@!6L zprg0FvntYLh!vqG_T=U%A*%NCSvQSt6Q4EilhV-n=OnCZ2ak5=D>Ak1ESYml`S3c1 zf;^}A0r_zn@#rY^Q`W=}rRo|<9J@)Uio5i2S(}?L2)ZDQQik^VF)xh7C9_@ZwxY&s zP($cF)LL5iTz~kTK=B$&;fGy)_nNyu9VHG`TLZ&C|Hbq7NZ|+70EKNdw4IaWZJR*? z#@XuLFu8CrC}=ePRFT5Q=-|HXcX{S5;LO-c5Hnqv+2IT%Ks1cAAJ`8H>pE zX-;71n0JE~_xR2h07bOS={;qNnD?+v7Jsa$BveY!z|-vFL?9oLQdE{W z>Zh0R$9vUl(RTk%n#$GwZd@Wpds6f2qWVGP`jCjz9Pa68cRCEb24|#Wfp>pVB!nx4 z)5+81;Kko99L8Rq;Qe9E8P?H2!6^Ob3%b^opLDj_J>pJmRbEE~I(^w1XTQ1nr~XJO zTuo8&qcc?4-Q;WV`u~HPZwG2$m}+vf(beABXwBzE*5iI$C$Q@j5u5!}Jd>Nfy($?| z=RMs&COL8iLP_ZM^&m+nDx&>#$*t-l)5L@co_#aB?_-@Vw>^lyMY)F?lV6LpjQcW& zYgZ?Y`%KVfMkXPr?9Qtb35l+3Hk9343tMXcvXY>C7vn~QgSO|LP_f!vZ@;H6w$yO6Jb%fB`B%(hZWQIlr0Jzvmo$&$r`vh5pHqN` zr>1a1A*!0NGM;PQ{e`-O%MSrh+AwjYc35#tvH#8e9sGdLZR&XugP9S!s5oTD<WM zvLWm@y5J?JuHKFGT*|VoeQu%4l}NZZUirX8M6Dw~!!=y+_vavf!g<_vCq$6 zv8m>5#nSHLyQ4abF@sl|)h1662g?H|vF@HGs|-;(_I%yNq#88?=@gI`lUBQ-r|jwI zHZQ)t>Itc*e4Oqnv?;z|4J5Nrap`iRKNL;)ZX`*+)=T6fJ*+NX8a=R9L}qG%Mbt90 zqiLK_T(vXgIw4WUJ_(XD%^Z=!U)93*^sZQ;%=;Pkd@Oc1Vshp79h(je0VeERLaXu& z@lTFAUi@oo$~S9zm>yg7$V4hi&#Z_zeg4)mL+m3k zt-i8;AHD{;LRC=}DE}Gt$0qdEwh40gHH4*H{aVg86gR?Ibu6M7xj9}}N2INz6 zVSHwh(1rLeqjZANb$Bo5tKN=`2b|$X5}l0D=I{osFjupo@rutoU6<|1wmZJSyxr2H z1i9ZA0^?@s2`xgl6xjH>G+ z-MwDIy!kh`$XKN||pB;LUedF4p4*i9?Fwrqi z^5}h2Ej;T{EdDYdM(V_OYR(fu)hKP=v-@y(WQE&*^!#76|{~|0l%*gyD!{YbccAA{`dcQ~S z=@u-Gm%j!f4-^!cEtJ3m>ML~ zhgbT-Z02MC#gMBvp#ETAt`)}Tv}M>{^VG-yxs7_3CF1DQ0I{5zEBytVAz*%gZb$z#v@4E}+TWvwez>(WC9sS{AgG9E0sucp#+kag z%th~xDKZ3lWoakI?88AIe$frWp9qswt)lkp(8W!3px!Tp= z5q;^58bfQ7yN+wpo$B)4&4#`6?1=1y-^3zkq5*jlzH>Wf39(6$=}Ld65EH4fI4kaY zJ(~Ub0-mda(_~*!O1X{9$|(2LkhuOXAviWRqWeDJd7j$0u3Me6b@u>ek6l}L_|@yC z*SYE?W`4Sx*A<~%mc!V&TD;LSS1(Ui}-AcDqv;O-I;XhNJTsnzlm5sA|9+VJE_lAm`cEDteB@ zroDj{cfqndT{CicwsdloTKvxlo25K#P&>xDP&N_up{}?@C19{hId^g51fBpgvzs|K{$zvv~;m z5yH27bl8#A#*^=T!}x7M+24!ydikSy4-6`@WdI%k6_r4;TQ5#DsqLc1t}Kl4Ip_Tv zB2sA9M2UoL<&T;-q6LQ=`s+pb;>lRVMFP87t(ldW5gx`$VRd?h*{Zh!Wf|3PCZ~#^ zLel~f9+j&jb>0*(lO0&WTx`DYoZxf7%p7WKaL8psFQfxSk|iW?LYY8P+4KRDEocvW z!2F65=F;Bdg-%N}ZmBy+!Z0F{<*s$wE_O;RdEEZ7=|D07X(nKAWS+Bn!dM$+FWbxh z*fg8suSP}*(C_>ocSKS=?V^WHhCyrX@*5VFKFMgu#eD)c6NpIfv69K-_yP1O@8=E7qzOr+XhDx=ETp6K3iMbhh> zib#2SkFtz_jUp`!#+;%VzKePIY%b{`vOojla%V^~IzFKq;j-N~8Ab-s^Ruw9pxKY6 zJ&g1{Yvvdwv3rBvJk;N75S@vkQO?e^#PLi!kA2mEtda$Nj~MyIWgR4f`lLltfwf4> zK-MjpR>_;+&(t=z^1rxx%b>X0C`z}nV8LCR#+?v?d*kjN=wQJ;xF&(%?rtHtyAvR| z1cF;b&_Ltb==3-D&Ye5+@2%>p{&A|#e)m~>t!J|ur-h2WH$0Z-v^79Fw)@kPKx6`i z>KB$wq%-l!Oyvu;uCOhRPH(U7r6eeFy(Bn{zetRR0Jnr;gPBWN-Al@U@5eq9^vAC! zNT(Q1Tb+P#AP7^)^4eNuaoSV-?MN0|c_kH<0R;n)1ahP{Mr|%QR#cg`6U0H+*-)N# z^14udy}@t1g1#8P%ki_WT{!g6jOj?}*6&4pcORJVa#?A58JYiJ4eRW25VVNFLLrd5 zmNAzuq~}bp`_>zK7&ohN@YYL?o3%JFudH{>Q9w!bY0-UW{I04V%!QRIDCz+0PmIEd z^q7tw^(F-#>%Kuxet+*MwpwgrQ*jrRn=5%S&LG^h$e;hsc1GIztlWhQuD`H9Zr`I> zN5@ORa&JE|F=N1Y<1HR+vpi=L=W9*Nz?aUnKOL8_$3a+ed~-L-6%W;3|3}Pa`V#*= z_^Su{|E`wghMQ-k<`vL z=t_SgNKbv+Kpt~(r=g@scb%jZ4DZ6x-K{+=R2(S^A17I*sNAHh&8t3N^?%TFPJCQXWDe%K02I7_52Y!x zh7aJe*o^$?_Hgd%0S7KuAFSOHT0}LVXl-G$&^?7>Zeyp&zv`mV29vW^L>*(Q&4;{p zl1ZcE*|BNg$q%Lm_}haGV~@noSD`m~_fJ4&El|LGrdAhc+u0ZTn45qEtBTQ*bLoD& z5)_%E3u*n+v?KbEgje9tN^NGsHeh88m+R7(hf_3*I+V8qenHQ-y^BQZ4v%x?sbsH= zkY8Mn^CeeI0*x}Ct2n_HGUW#NWGs`}Ccf&SCdF?H8OtBJ+Ji=S({HDgqzJJ*&b!{kW@ZY~mp%EBM3^D!bbqseYb<;oil z{zkv;@n;X8sKgW3b&_B%3?7pIT&&huv{tx||5lZ}Ds@`SKsjsu)LOw~WgzrWp^Yvg zg-2Lid+HR%PdIb2sqrFedDlsQmg9ZczJEg+W-ENO;_w&9g3wzbn07Z(1THY{D@u!l zD;F~}tFf40rRhm>BYdbdp5scFdx5yJ;}qThjoNnOrHeGL>XP9Frn}(tSn@)Anu;tt zY*Jlrkkuk~aF<}vlO|h0PlYTtlp8o|C_I*b*c#Z27hL_a@RG#>$Yo&2=&`X>_43d$ z%t@m;Hb5mVwyh^JszLei?zf({<5pEjKaOlA_@WraF4*U-h}^%nZPcHC@ffBK`VA^=BZ zsSt9jK9e~iRp>wgqfBY(O_T?Mv)6jAsmBV?x7TZqgUaiL)nf!FcDD+h{*sE$pHJn` z+>F70m#&}f`?p4fFCR*uHDw8?RzMIb+XebiqTj@yJ@vsFrrvv=XkMM9N*>cHx4Qli zvlr#+nG0W`o8|(<1|`YWjE}!6gOCGuR|@n2j`2X1@KL)H)@X zUwO6PU~P@JYZd!91cZTa)*nB0w~OW26676zYji2w*>;TG`GCzB89)h+T&H1*#&
xG z^P#iZP-m@2hU(msEYAe%yNAH;J}B%m!ae7VNAkqa8J;IwudEqG>*q^_?YmV5mM8X! zv~6!98y%+TjPbc02L;^OC}gmTu2&c4`R|W6hZ$%lZ@Wt4E^g@oBGdCMSiDiR=Iv!| z&OMuZ$bW`enpXHgBvqIfgqUn-0c5z9VPPuh#@;&p-Hves_P?rqV9%*MazeZk+OD>fqDp4!Aeb`g8%N=86h?DKi_9Hx?0N)jh` zrJf|}FjF_?ZQh$&q4wtUd8+Ty6KJ0=Fb0w7Pf}OJmCmC0>$hn5n&z5Tjz_x}dBlbX zS4+h={8op#M!ZODjo>kHv-OM z{_CdP!VQH6&2b<3S9FE~)&^A~_`JujVB0H#KBvD)KS&NPhS&->#Y!A`-`ykTXjRqs zyD5tiq>LDH(ZApK%Yg{-$=zceONU)h65_TDCxnlU}L}JYb-3ihldzu{q!s zy)SD>8ZI`?O=<8;{Dx)kz2525@##FAn5#7WN|+_o81d5>3ZA~=o1juif8J!I1scfX zem9*|_-^Wu(X8jPW@Ed)1^ArLI&TsnHm&3AyFiLLp7>u)Ah?f*IQ~ZhZOHNEyKyz#)+|4D! z3Hi-L&r|u=!@XF`DR4J%$Q$!Ie3E^p&-}9?^i~RUtQ<1#VhKAZN$+~sq4Urho{`$% z?Bx(yx9d`?#rycAi1!w5_GnT}ec8usDuBh1XM+Vq$UV+ z+`uyV<$tjNE|Sq-_?!&&_&SYqTO3;79)Htc{1wf8Vj8rvi}r}}xQIqg>Ktw?wGr;^ z?KEBUn2Sc82E8d%KDG-_ExxW4W#QQgwD~02$@h6nYBSdkT2p2jgfBPlEz<<3n8#rZ z@|*tyG7jnm&J=M=am=G7p((j(dD%mNU;uA{I^!>wyhpW_XZ zis3#AO%sSI_f8^`JH0V2j`__X708%S9cR;M=s4TL2ZC2soI1@4_(yM?guXlbN0fwM z^Sba#I*(GSn^(4pyfra`^!N&|0NYir0ThPM9b&3f56659o^{)$_JLhSs?X4WUu}wH z3GeIZ3Ha^o7s6ezCNm-e+PV9NFXs==GsKPo{#Bv6lk5~)6w>yDEv2r3&fB$BztgP5 z<;*!hi!$R$Y)E;vfDd77-Hk9t>W$`69XGSboGfej{i0v6&mP`xKJeLBV}%tnptRtj zjaR7dC4c9XvNa~teYrmaBk&vN=2v$dB|~nO*mlgNQrqP`-|YZ&(M4OtRYN;g8-D(8 z@z0PK^dAk};&n`9m%L=#*`Bsg4zXUep9^03wd&rK3MQ4Ftcs%_-Aup1-Sv5K9M9@L zk&q`tPs308?XMsB)puRWyY7QNe0_LwGllCe!YFl)y?hR9j@@(55-E~xZb%hUuev>Y z4GufEdw44rhV#WQ8ZUGhQ{ieqf8}Sc6n? z8I6NzisWYtKHdFuc#Z#82uUN_FQP+_^D5B^7p|x!Y)m zo=PMq-D0!dyDu1QJv#f}EZ7S{wB;)wHaA|IQT#7o@&8m8iklINtuT1JnTk$!ya`|) zYm3DBu1GF}M=)(5GR-UZ@bTp1*i0S)R}Ip8RugqHd(u-5vYDc|cw~K?-X1K9MMG{^cvfqpA6Kv-)Q$lvD9_w!DzA#TlB%$=>s@h?3zV$|jD9(Ok*@nf< zmoVFgCdI}2myX8uhS{_6=eyWGsfIgK?r6T+&X%Mk9d#jE)r0cz;OFQvYSM3K!riS` zV=j91SKflo7;IO5rEFIcy_?6(E*L^&d3N!nzdrM$z{VNUX* zjV^NRV>wG~wL+l$?&)FY{-++Pur8;?vh%ENN2C71T)j}Tun1QN@W^hF-%CcBVgfZs ze$bDk)mq&XbzkI}*`G zi1=)f6W_^5g38NoK9SeSW8QO5qrDz@c3n$$sWY%>w+NqW?wwqd*oES0B;@4>gE3)8 zta%tz#==+Py351w5=_ZoB)4i7xLH|hI};p#v+V1xIBx{1SZiEzoQ;bGVTn5m)hi8V zw7R+MKXyxGli=7|VqP5T^1e<7U{GXre;c0zfHz-jLZ6VCr`SVwbnBq^wq z4-N!NUO-94${46`jMjZ7#O=)3T=Cy|si0I`9~*tBR{)Id{`)G!(e%f&`sI4)3~0Cg zTWj)UWy6d^@<5sGj1YtCQ-&B<(u&HDr;85mooT1(HPW*?cxc5Tz(DRQeYGC1)4y7AA1U!w`>*bRTHzEij`miw3-V_Jy7Xn)GDX>r|fJ_5^4U2=?<~ z@-qg%0&MVCS2?SS7MRAxxUs~w7cL%NtU%E>nX%goTHRABKlVsQDA8}=qHsvIlKNZ5h z6?K8HG-ETcb+H4AOjIfbWdcR}U<%j!uK9pxY<~f9&@+hHZ{!^k=t*qR&;?T)_ zE2LVu)RfC0QX+88T&zdG)M{$(#eBN`_jQVjKpAy@!iR{oY@U1bT|E)ucAsw%X=Rj4 zRzfqgu<~c3qx%{Wpf?O+@`qBrkSw!HQESgeiD|g_ z`q+Eu_BuY-ADJ(l|CusyAAyK1f)^@JoDRg`-&xk;GR&Tf0zAjH;$6m>1BWYA{B>Sq zCSPs0f&*kFmAlW^a!am$17)?x$x2L>1;dBJUGr$=a_Sr-oNwt%D&n+5{p>BlCr((A zL;4G-H*1Rx_EXZ5hSI_a*QUc@Au*Tt_drR!8AiOMpcI@3Qr)r}3XgWD(49+)M#qH) z`V21?D_VEuwzgFw`@{-4TZJGSz61w+E}JQd#*wjRlSO&>mFnrwT|8#*-+}WVWo@qa z-#i=kZrM8C(CO`L2J0&=)zCyxITbA= zY?@AC+edgfw}Bw~FZzWOw2z}2L@P&Rsf;vRC_ecfdzqNOnhV`+#r#)uP=SEgCz#=BomU&?t7WgI?@fp6#pvc#3E3Hf7we@rf&#*oG@ zVp5vg<@1$n5xu!1BCI&N+DPK4VV$2~!`{E$z##bQD3#Hb|NDc@`^-mFt)?<>ixE;^ zlTcL$x{!M$Iv6((_F0e~;kt`i_m2@$CwohZZ8)7+8Fa%S?(H+-8+oBdH_YG`SQcwS zkx-#Tk(C?y_B!UzkE)@)N$GN;VQ7335A%8kum`Oc<)@)DIyse#O1fgRJ4=j+9d3d> zHJ>Cb8`6HqpYZ+C_3jOX62cr`Rr& z%c;2UzrPy*5@#yxUU^~E?T%VEI&}Uvr#V2yDDN3+xpy0Tgm)wSZ2Z1fL*Z6kt*wtg z!+W)((k_ttZ1-9`-($=*g4;~}{*`#v&I*JHo&K6wB{TCWY-LBCTeLWvU|6v$rMg3?ho zUD-5Klm41$VVWk;?AoB!eYf>xl^xi25i=W|N`yQt7Y$u)`c*oLmtBZ681JRnD+v{H ze_CpxG>Pm23Jjey6#lttZ!W(c<=b5MCiD$t?2$LdEJUr^NE%%7?)5RyRg2dxYd@w? z5xq&BU+EVxu6P)y3?)x7V# zD!1FgL}n)|qp}|6(s*Q+I;*59$Y03{%%)U*Kw28U`RUoyFL~GO9wtB-2Rq!^tV;$8 z7ij2;5Q~P~H;n#dmP{%Hp1P5xeN@b#Krx#JVpi$Lqeg`fpCYK`zPE$;T$*Ss!6rRsx!;<@2&Y>~`zVT`~( zG^5?p^Jf+#>-5HiB2`pJJMOCfk3rgs6g}U_iyz;#&NYe-GRjTyW%^?@abYSc@)!2B zNa)GjOQtYAnjDqZh%Dx4nPn3RbM+9-8{nGXg*=1Nu!_T^68J)kP-YsKWU;^KzKjF6 zQE;!1@SxY74K$9+7nf&}_Uu-JUHBu98cSkZb4x=xbpj+a+WIGsV>cHabTy=^*${KG zlBvlG;~|%aGFZ!jm*VFL-Rx z5OCo@Y>%pY1b*Q{B^iRe2aE*gIwm0-ll&<F#U0pX4lCziCT7VC2 zkPuYubpaCSt#CndZca_J2hk|Kd3U%6((B_go*d!Y)KzC^W2&y&_o!2Cp=Rfbx(-)TV$2ySnn|akq4>1?lE5E@5@QrY9op}Rm&CUU=L@cIo-FN zSy?wi@n+PW%`e_$)ufn7;`to1ucU=)Xo7Sr*Tm>6#*a~8v_X7GBCN`ah@^!09TJ^( z$`g7;9{$*Lp=%wdp5fLiqgKvs18${ zg;b<($8h$+swVY_ANy|wkKihbjkNChsFiUj_;CesL}hg zCwWr1Y!BXyZL1K;^yLH`cE$Xv(dOu;yx8fE;g`IVm#B2$MER&?d=9iqe|&4QP{k#x zO{ry>>$%>FsM7iRGqWtl-1l3Pp!cK(cY-a@CU@(^Kw>p(oj{w7PXX`UF=FGt=*3YT z^4N3VB(H`Yvzms+i*8q_Ctez3v}5I_YkGi>Xvi8W_r5=nSl$RzKjz`B-bw$oU>LE^ zga~es7x5gov}BV_`M({jWXba>y49XjtyN&#lNhm-1wK%(-20L&Pg)Hu;H+Im{=w!< z|4!b=nJFrDHW|`;5r(neOkYq!P3VX&W}WZM{po{g*v;tuZeTA;&S%IXXXlR#tC7~X zT%FZOj}FM+k#K))k@=}SlYSqAu{EN6g6ILLtuZvu+X1fb{bFs8pLd3r-qSJJ{6WW} z-lRf&B%q}epCikDm#U*OdwZXLvDTjCIoLK-@vo)F+i8B|hy2Ks?mctwP+3zhGI9uJ z4m$C|9v#_!*)`uT(srp@mcCNe=lgBs6Mv2D=s};$NaUH_?;mt_edWaf)U~2V9Wa=g z@m54g^R4VUK$FBzH3*sRghl)+(o`e?k$k!~3Y0Q2pP#}(EQD;Ar^RFCbj--Xkg{5& z6y0RWu8Z$!D!y)?#M>8LT+-&o^k)3Y|Mqx>u{szQt8lbTj$QuC-QD1wKSyio`F`(H zJ2*_q3#U)~Wez)vGZ`pLo63UHNADQ2caPE~?n-M9fm^MdP@_`NnISs|l48BCM}JM4 zrM7{Msx!WhJjhRr!61u@Pf_+@<-W93uMzl*XRaFWu;A(pCT&+Ic+9c-D{j&Z=0V)wOM+HG5ULqpNP@4`Bt^+0#YtV`!l$8z^B`y4aq#P2&6 z{Z|C6e3|j6(lwY5oPq(mJ`UC1mKmupl}T8mFhTj9uAucb<*eiD3#%aiBK%S#4`Ml9 zTFp;0)4*~xzOoU>^HM&?tN6_n=*39M@30T7mtV2uT67lva97&Ek*>R!+gzHB4&p~$ ziNd+k0j5tFYAZL5lR-92$dkraq{~yCRpLqu72d%qdx2x;&w-9Sq9x`x{ZpnV{J;XMHcp*slUPr2UxE?VRmlRnYP?4S4X>V72jWYGV3&Hu7;S@B+lkx$<+XH-PuovN!Qm;l8WH1Va0uI~TBShK1wN-&c!h(Dmpy|K ztFc&YAp78GHNV`yG6xO~K&n^d2uN*ty)M!qtLOLM3>QOo2l zo}=yrr;I3p=920a8t9Z*wL={$dK+JDyZ~R8{Ta8qLVj&CYbfQVbI6#zCl2xROTj4i zKb7r9dY8T$hAh*Ax7Ex@22EPOAm(hI`bbkkilnMC#La?*Ht>i&#i40xTW90aS*%OI zclivnMpSY2nG4ko#1?t5_CdsuLHhGiXPqsAcko=!W*mwunqjM@iGGKEHsaa;Emz3j zi&zZ)mqrddyZf>4YxiQ2pU>rJDdCE}f(z_>Bqo)Y`Ds7!QRnn~yNBgI(d_M_KL_%M z+MpA70y)p1dO9%ypYKOvE)eyV+n=b^`kxPS9lC$v&-=-)&hDL#{tw3+=q9Zsz9sqh ze2;YkxPkwsP9;{C3~`Cu4St08Wl@lfHibfp`-OsF8I&wH>~boHL6W=JrOj@?vct*Z^}A!pwF`7shEWVf)WCqtt=8@Yxb4{KcEwO z2>e$w&&Tldl`%dy)yH?~i@z@=t&v{u0;J?M(TPJ2Ci^>|^{rQ%<6b8M-W$@4Wgz0m z!`c;^<$%uPigZy_9~L+--GaZAS7*-ds82jQ(+eP}|5Ywg0eSnD@z#J>DfK?!IDr~) zO*6_fSc;v%iB0)<#O^+ zIg|{QD6Xs7q8(fz{=V8cl*?1qn$L&?Zk_ueMl_0|HBSiZ=JofKSk;1!1mYEGCD)_I zRUcfK86si3 zvu$rco!ly{%oQw_mYi^CHxJb0J9LjGn?p4tfXP74%|hZ#5nE)^iZNp zk!iU%H5_*vB>rp_Y+KuM`Be^-!>u7fb7`8vw*AFWx=vHqNg4O`cLJ*6ocX*$;|syR z$h==MWa;RbuY=t}o_)1aeVz8hMOEaaO)r9oW&JHwp8u_}Z8I;53~*z-i6XvSUirB| z%(WLHK;Z!zqD~|3DdH)W*K;@CTT7+jaUcucQMpJU|Jqs1C__hYdIuq}s;`!4H!~@V ziMf1CVG^81_m_;Qurb+{tG}TG!U6f<6)iaED;z$H|7C%CkpT<8xHncAi?@p(k>a*7 z&7~4xUQa z;ulB>>}-@*rIBRvuvc1s{EJ3r5Jc@89eUnzyev6LIx3nT+_T*}9mc(8RO*!__R+n3 zp^9w5xu*b!=NmPmkV30liQffWCxw~z)HUwsn&#_`pJ!cuMw^oR0-%&fyYlG1qyo_k zVt*O(JT6<86t0~wbc~n(i;&~nLu~POD~QaCRqZLv2bz9O9&O%zBK=?@An)oc@aij~ z$(;HT-uXW-g#n{~r`u1j%&?dF*+Dg|>~KD;8-DetdwyCHVqVS) zXc{Jy-=bwKC$mu9zy;;tqeMh5HaF-} zb4kaORaD~M6naLUX{&UE+#;MHKPJQrcwA%EXEnxWs+FDj!YI`wgx=n;Srk&GG|Ro* z?SPM=7_Em?2UQKqGEzy~$%#%nViGn6f4~fDrhg*#ixFQYuLlV{LnKzH9=ah;@;-DO zDgVl*2cB(;g2)ciY|wwrx`Dcw^WIjwkEue0dWi)ri)_iogTN7=0U>b5B4&3ws{|mB zxr>Zx_?w`?l|Zd=?i+WP^bTw92=W#_eQ7K#RQL5Wjb?hw*%KxM*%^1wh_v=gaMJ`k zvW>yLUT1*EX9SyPaddzlL9tg6n^s?uohg#MZ-oFa1$%c~9%b%?qS6p*7q_tSvmrcq zA)HPmiJEJgSs~ih`%r?{(t9Y?ASGWz(V1eq;Q7!JT5}>LQdEL%j=VIg#4nKw-3ue* z3)VL<>}t<6!QRGJDJCbnhmGqtwYApsTHE&zoBHmib|WAi<#SvT3h>zW%TIeOMQB^} zF@_H@)x2+(f^8Hg!((XivCATL@oS|DRcpFE@4T^6 zt`9RK(~szv&*1HB$kixp__Q~R98C7g2YmuXvGp4#66*<01Z*V|$&e*BSJ#&q@C)|J zP@$q~y`jd-N0K(djQ-Q{!Yc>SRgKru+gZ^vQkX1gR<9H#Nl8X>r=dojO;qLYnsoea z!-SXq#WToZ5)t;+cQGCFO}>iYql1r$>7%_v3+g2GMej?S zSn;7$@sfT+oqoHkjns!#S+Wiu(i+kf20-S~t-RI?_cmRD3dEC9GtJh26I2$p3vd*N+G^o1?+AK#?h4)@(Bf%F2OI0JIA$MS2OkRGTz?L7W zJfR717CVKzIJ~ArGJ^M1XqtaNkl!1xym%3i?}bWmEE-z`=?r=v^>5&FRjQODNSab> zej&D*S#VK@!zwJT5$9+5t8>i31#zU=a>|tIo~ga-az5E6x~iQwg#hUoemNBe<1&l9 z2r&3y0PpY7UYJ*X_4#e!;vJ&AHakimU+*UbiZ*;xl#1?p|eb+DzQ zjFZ%wp^GC3(2)Hpt!G`6MYd7|q=i{J!l?+azlDx+{1vpyf+HZUMZ3*7we!aJ;X6S^ z4RE2NFc+PyjPNGyl89Osaot-z=5bE-%^H)27+obGod+smd)8GysPmk98mroW&ZkHlBcASfSEq)La=4=3J5F&U7t6d3|8 zi;IVgIDqE2HM&w}q5n>Fwz$XYF$nWS9XA_C&i#|D$`DqedEW)3HEKi=XixozIbZO4IcdtT zmf+oeBr0K+!@XUpoL-XX`g7(b;cK@*Np+WxrQR21lxu9s#E`lpXB;xqdMGdT_q}?m% ztDd9rivdQN6ivAE^F=u>ZKq?o-cAR$7KW=*Xy+jLPUo=vT!m(Y^N?INS|Fb$SoC)2 z$bUNAR{F78Lp*yUpwV$qy+)^00^9N%or&vDc()}ufR1=M4|(?~z$`x`BH~nwDVCD2 zgTY_>mPI(_=z$t{~A}<*_Vxm%Gw&?D?N~L4c58G<)q+gsV(m}#k65s ze^@Zd4DwRT*L9R7JM{b5{WU{nLIfrsul&Ls(1>`7_$;ZFbej}!dY^v<0~Y!Wzhf4x zI9r(~2%rbZaRSEhBXy7~kTPl~)sHF(NniWeklpRw3NB8$$POtO;W49Upth+|iq`k3 z^xgl_;#+&LaA!o@c{OXtC*|kt6^ybOYTiMwlsX(kY2tyKQMx{cO1mhg8VvV$5X8r? zE2s8c`h_hzS?QLw-PRCx@`6L}sV8-> z9f-m9?tN#ThHW)01$_?r`uL2jr3L|@J5UTi=l|IEiB9N)&+LJcxOrYrH*8fdf{wTx zS1f@=YHKeTLo{#+Kk1htGK}EukfN{Cldc4KZ*7NAgr)O1w)mDHUgIu7=Op_0l4E4lqaK9v#lt%R=FI?L3+LRTj1N94>HdIA5)mp zZSo)AXUT{wJ&KYQ7xz8E7f^n`W z?4!IFa|+*kr5D>S!Hh~3ym=I1j+!Jxl~|EtIi%gg=Ha-gFzQC+{dP~!d4(;2+&XE1 zdX<>!9}7flcrM{2Z`n$&KVSUbSR!8LKrFr8vMM`YAL%RbKj72kMo?o{c>j5SR@n?D{Ex(EQKWd*ud2>(EzR}mWagb>drFJ^$_U|Xj zKk}2#)GwoCg$1I-D!y_#Eb3S{5l6>@x+1j7zap&-nh z#s<9w3G2#hWGwl^(t8irZ%<#e@_}y(a!g;YKOa9Q-82BhUX}F!lI&;~bN(T{w)anzpirqb!1jo{4PB%ZcMfn5Lv0&rtoxuCR^FJ5!d$o((>!p!Dw9X2 zxe+8DIyj~$M_2A5AJJq7GG?AubU#XwnUoc$&spjOh3vS@kePReDz3tvbqhdtB6_NQ z1$Qb0RCdYK?^I+qMrVpn$WE1#?pXCh|6IU1_oM2GjmNW|4|sx%l=h+VkddRW1t^q) zMc-48n(4b$R`!`Crrv)HvSE4m$>QF9lc)Kz#gaZjQ!U{?R;m(~P;`tx(&;Cl%x6aZT% z;Q1-eu}B$-ROgV+Ei~&QFL-636;uFGtJ9$607ajoJ?{%OF{+C_;jJ;`6iO6_x&#HADd9#%YIDQ_ z6t8Az+8YHhCdueLx*ojt{!w&pW})rSV})rt0^b76UOxU6_}zToJIJi?w>x)O+!=yh zJATUWDbp10ME!o)`SG$+xF+Gqv(P;FZPsPM!~pehK(f^(SRW@$qTWG3H~^{qdzVZrn2<&OGl$T zGBaPA+C-v2TLnACTIucVU4~J|j!8Dwb0S&ob~sx{DI9zBgt)!~8VUH1^FQ-UJ4K4* zAXP|n_+XqoqA*;-nuV5LSot~lx0hvW4Ms1ih-z>aG+r|3efjpz5Ac~*pq76Io?;)$ zs;NYtfDsWv^Md%%H{Z^HCkOaLZdEua?vK|F2zl%8pQ4WO`-+-wP(z33_e=bz4-WC$ zMQyzMlbBYo8*JrLK4G*k7A0vx81C;G2H#ny*GLA(+1tB~3_3&&8oFNePyBNvRH7M6 znwD)7+ZsraU1 zqQW=-03A3??1%qvm46ZWrM z+2<2{w{@x?p0TIPca27W-O`Y{IOfEN?H-mDmQ>O$*|qlbG=B;iCXOiOxj6Ax_7Kij zvg3)y#~kiKL)_82=u7+FfL>Sg5)4cuIxW%pFE zR9R=(_he+lS?y}puWmP7LvzYB8t7&B&35_ndCER{=b}HH>v>Nwn(_QtmzHo-QoC3b zOF-BMaAA8jH%;?>+k4wES6D96qO)XnMzPnMtKJHx0Y|5dKgmYFvG6Bci2hrwnBPRz9wsCj1Z!)MM7*l6k=`+bR@KQR z1&*i4f8Z)dA7f&-ihS1%8(YbtDF$gJ2n>CC*TAtj8Hr^n@7_Kl$H%?%n|Ju5g6GiJ zi@^*Gmlk2Tw?J2Y(by_muhEc~_mE;RwaUG$3S08*twPXwG{VcIL66w63?7*}ACob|{ zr5$$%1XlOqvwT?$VXI|9-JrrGxc!*I;K?9|t5-#GOL@y$&2&Z~sUg+1 z=m4~wdMUxv$Aw(3LlRly~~6aqx^aD}jUM-7o3>E+ufe`0ur> zOiq&q#8N8yOO9%Jmf}p^7RY9#JmQlvPRz+^J>1a2&& z;7)>s_BN)=_#zm5r3pq6!aGD^@}JL*7&_j<<-s05-dp8fn3+B(pI(P-xcq|N#44m3 zTL+s-Hc{~2OtSV(cwQz}pi=C0A3_s?zmq(4D)XznQ4sN@BqI$I+FOUI z$VQIWh3#B(RN#e`^4@^SjK*kTMd2&sk*vj{HXkl4G;7q<6w9{2x#Rri0$p>hJdnwf zXuVAcG|1}N$q@1(M|kvf+T5wR**QnMb(KOu|NT4upx6vP=#i)k0mnMO zX=j+kn$aM8k{A4bmn82s1x%A0VzEjAVyS>J#+B|+r3x6!G?k3BE)A|pzh>}WvQinV zf2EDlLkN);F-EC;34GDfXpCK?ngQ=JkR=B2d8kh?7tE!m8Q2capH^{uV$Oq_2`=V5 zhl**&8rrbqZN&(Zbl={;aq4d}zRj&veX$rhnKkdR$i2bucQd)a7;%VSpl)elP)%?M z!1!t3UH>inb_I5cQ5gL}r$P8**W_?IU)tU@VpyE;Fy!j`|LfBc4Lh0T=U~O6lFS91 zB%ZGrX=xs40B8u(^oDnvv~(zda&1Se(TDE6So|lHZT?07B{Zx8+g<;gjt1lQu^U@B z_)$Vg!XKMF9g(45UcKH{ez%Ay=EFt@9FKWCNy4EY-o`1$kMeya#^=TxTN))ovGSIU zXxEFvgoJto-HxOmR3!t^)__FKL!lviXdm_+c)p-*a*L6Yppjs0O8%PT7ezi~a2)#f z3F$r_3lu^JxN%YMNs8rwaw4D+*df3>HJnDF0(i>~n$hL#xR$ion*a6Ba1nP9pi@(a zm`f#T;$iM`N@*H7$LNnvd>0?Q(d^P=XMoHBl8J)+16mAjIcV*Xc&>!p(%xYfF$pVL zR3fima=L_fagG$%gVr<@C8C zO=Qk{S>evd4ADh;F|9H0;#kiR|Jimz`CMegkI4?#er8iGkgU|+PG&bDlvJFCH)APL6gn)OqPrLk`91>q~ap6%c0 zJrDsE|M+R~fhDEnho~aET~XMTa)0sYc2TdRo9Z6JD%p(tfxX1M=-6prHDkEIP`Ss) zArLFB6```r0C)EL!6@BMIBC9_U za%mDY-Dv5PANzKb`ok;gH4?Pw-1|!8?>ayqHKPsmy>f?7NoCj+j4x8FYIu(=(G+i3 z@+1>T>R1CNvdSxjc+{pT5f5l@b0yVNJ6)X4T`rzW9T&ggz)neRC# zIEAopk*P6___GWJazG%9u*D95pa`qJ2{8ktI<9J9IKbA#SX9 z^&l+PiKvSXYUfFRw-F%KzzUfn3nbGX6(U>8yWUg0pS$<^KD`q2E<0UJ1OpKIUPp|) z-pAqE&6fP(iP}t~@#iY&ZCSvCUf>qhQ1~IzXNLD58mmR%IN2GZb9{U1VM$S@@8ikK zyTnH!U##q|VFrTXlfjndJbO$H6V!S(*)9u|M-gkB0mo0K#LPfFs?;OfXXE2Im+|v= z4)$|36e?a6MEP$P;WSX!9@o#N*}JsMMzYArea1|9L~pAnjhv+Z#sI}@Rsu9|x!=4d9zu0hv@+ovL>_t<{#?k@F-`_0~O zS?gWvS?gK*eJ9|mI?0tpKt;_PU;`-qyP^d5s)kFh6kYMC0R!9aRY@!Ph=&I^;eTHW zZLFvI5$mF{^7x(5yEJksSiUBzn|sS!UuQLZz03I16vo2X&nSk4X%qC5{EF{D*A+od z-QhsEOfe}3bw=uy`9~k<3>4xk#>AUWGeSQlFQnr2sMf8Qamf2dQ>sb~c`Xf3U^CjG z1*AIT0zT>hQ3{iK15>1$Vx!;57T@=+7Om?lFdteB^lPOWpDH>Zn-YFwL-s72?S-(> zO367rULvWIP36%}w`Z=AqeBd>vnnbflkQ9k=+ndM`owRrS=%O)(=q(Y8Y8*nFU>d}Q!OiNBr z7T3&1!hH>Pw0sV0XwxP>(TixlQ9^hNMR3>Txs7L2=PJKcp}XIHRHNX|JKMuyXxOFM zR63ga?LbS&W;KJ`qXdD+*{b{**NdiiQ%=BS~(Hq zkV)^;sM{V=xE3Dj2ah7q-p$nv$s(Uigq#A2K04Y5e~j%8c2r`bKjDAcsCWCy=*e^y zVvexoJPWc0 zoje{q%5cL@alWE?3Nm|6ek#BSAM6#?4c2z7i&&4Q0V44MAgMk zG}ovcQ&S||!d>7}E1G3ALiyC2gT;1CuHPj!hSgq#}Y1OR4 zf_9isAOMLfq#jxzZo2Em-?m*^5Em3yGnXHHq-w6o=4Oi)3d!wmC9wEVd~?_b+lZ5&zxri=33e} zah83lyV`c`da1-UpGJo#r7EwS^#rEW?FuJ`3NP53_8Zk9w9~#3B~2JPC!Q|}?zFn!w_7NZ#v7~{3W-_U<~SEZk@=jyR@%82NheUe?%;iv^Ij(B z=WP}<1}}GdTwleiwV!qTAA|gv)))iN6`(vmBrRrkgJ~n7_v~a9h9EYes1=es8nb~J zJ?m0t90<0bH>U8EpeA(^>~ii<`@`~C(B1MswiuNlx!8P3Z~r^b{l>3I`8r-#7612F zWoaMx-}sg?aK`wjW+=mO`C1olx=ZY%+3^GOC@s|QiWYbRjT4MvKf7T*(J?u7M-Ww`pRpg)_ zJtbvpHfT7?#n!4p#>`2sj4+O*iTU})FlpUshN<7Txs`Y?D6WwZJ6_k6u}gKOvlciB zT@d*8eBr5JP={s{)(112hReh%b=F9G3Hwb8TTt=_|4yVeS1iJ+o2%XW!_!7ob~Q2R zYr0;{Ag*HJLB`%_*&AEwVg|Zk8CQy26=6*_5wQ>Wtk>H>ZE`;CNHM-dx06EEo56># zQ3;<&dBOw`y~i+eRlo9;_?hrIJ2 zoL6IV>^X``eVXAo0_;U#5F|b?bf=%-LGzb|iZK?gH!oOCkLDT<;}x#u`v*yL`><@t z{qiZnO@YXAy;YXzqYpfO@hGy#S->eltTj$SlS#>9+_|Ye)jiJ~l>wmAYWNj(Kkn*y2hUBQT zoBm{;KkenxKqjdpum_>2Rf?s#sPCJzEfSN3g*Q^M8}S7XU5DFsRDe9qX&$d2!CceE zN;_*^y?nKw74X%gr1br-?^t9D8IotHky2f=} zqo(y8CR!ImjSqdJTYczG_TA~5QFqal3v}Grhek&i=Q--?My=&lo|RN2^%NfUL^Uu$Xf7>FxUJq9PdkqJ6Bnk^fJy#H=P|5Kz~C{27Rl0cH6 z%(C2Al zaw70ZVU%}1Hh%OSqC(Bxrhm!mSfNW-)#8&KqgAxulp$SyybmJIN=Tnvr24=20vK^S zL=>;ws3=ly9%Fc#duZgdWj_?ZlVAs?7cqVCX>(apUw+l;)eNNn)GY7A8BY9^go zy>E3dT8(X;(BZFN8(yS|VKSQs4w2AdwYVifgl1sFe4&x{k2Rf@rQUm^5%u5!)sxvG zHfvH^XKJIrL=V3r*FJ+bp6l-}*!zz7=%YNY_z^fRwgwfG zXM)ra;AR8$$a`8rMAvdy?$=9twEqI-S3FUmx>7flVMJ5u?QvrjAKgE4OZf0;RJPw) zG`T62plHP zVH({WmP5(d4i2By7Tn+1_w}#A?biFg(uX?b$-}-zNV9$Q+&R_Iy2*PmK7r7AnkI0| z_=fupr8QJW*B=G!Q5sj4ZI84xj(#)G$tlTIWN^lp$=5c{z+5!nD>qe+_L}`6YP-OC zmqnOMf=k1+cTl)9w7jGWHl7J`dzc%DekL8#P`#9m2aRE1sgUuIcS0WMo}Zj`xcQ?6 za(pb{)64rU3`4nRFT1$>ddp|lc8lH}*>S;teO+tdSzEcXh9j_U8eTu_I4iQz6*hl> z)#8bCIKOgQ<-*)FRM2}jNmS;#tVJS=fMR6!_Uc)c6;N8|13h|B$p}@c%h|lG2^=)e z(`1m#_+;FYxFzSs8afQWJ_kVu+ z_3&an`|wrLRk@b;r;;EYMKRw;q8BjNy6wy>m#@u8#+DH*Gfv~Ji{66}+vlQo2pcaL z0YxCzG*6Vl4f-@D|)~i8Reu{O{iqQ}e^^jBpfc9=g<6SK0Sw zJ^I$WeMg~B`>?Td*{e9GS1O%O+b|MHu%xzhvxH;eYmQ6PtB z8{3hl*c!+Qk^QpsqY?a}yQXTbnlBj40-lUQe{u*z^1Ofz8{=!qUS>a6piACJYBq)a z;^Ssv7@Dr16{*DgKDtpTvC!(!@u>bq`@1gLr`#Zb(8hZ3-+#`aZ4bZFOYxM>nASHh zci*8YXMx1*b};JJJb4J~d<##|i6gYGc$_MEee7b%^*|yx1ilPL}uMK!^O^H|%kzwtF>f_NvkI0_~_5M}$Yq zX1HDTldtVwwc8foG}a=0ijU4rlYRMC^#nbcwwkdr{nKZ5Poh66UrsqQ63WQmYzdj7-S^JfIw{0zwfPg6xOD^#!dpGov zRsGppLce?OuHfCg_wIWo{!NG=bKxG3SMoY^K(TbA=gb_Qb7QQ=>T2R0q*l++M}~cm zM=;Bq=GOdkhkRNEu;kO|+P5xk=f62|ZpPc5IR8es6FTgw%?K}KCpGuAa;B8W&@SzSIR4SKH4fPk{bI=b<7l7tG6o(Oqh{F_==H1LBa1^%QfY#$Z<$1fGSpT+ zM#;HHx2bp%-N`wmj@qpZ2$X~#ZY^saXRHl>Wu3K;T=Q8Kr<%zr-dZZy3iZU)60jqP z_3gVSv~S*b3f2I<9#rYV?KjR<3+mXi-P^Zh@+|3e z>^VHXLWWIKgzB5Ra%=bX8CXapb037;QdP|oW7n8reF7efF(OaQ(+}wn8Y+v!i*$F0VQ83J6p zbwl&Gbvig)skIi|`DwU)`N6}zgWZZ@BG`tDjjJbu|)U7VypbG9Q zd*2@E8@(;j>P0tIkwN z6gCQa8jkL=SAA06PM&YN&C9+#%X^^{SMaK2-=&-1J~-lWJMEsNQnTGF+hmnebh7`_ zD1RN?BQW!)Ja{J3*SG+B>Xlo=IhfBx?V?gpF3MfXXXO7*B(hmUB($-r%}0Cc?QT8A zEaY@Q*;my%HTLPRkfwM;WRpmlOX^z{Od*G>-c-Hu&X;7r;hI6r#=Kj?`1YqFx5=4$ zgk<dl55o%hj1-)eJO(;XyPV?u4Aus~hg-2ZhZJRU+1Oeb-3RkmNH}_7$y_ z3mINn!lsU^Xk9v!nO2{h@+(?vkJ!Y^JLNW3%~MoI^0bG2Q{R$^+1yerQ{E_GWi;^C z^7iY*DWL2t*qf&3_E*?E*6e?;CWVhsNw|ffmgQ_iNmUPjA=Edlq4A_kw$9JQQ3>>O z%bFv1l}_ttccH4u^aBTYYljlvL^!93W(*7m=`Jok71cl3$svir=wTpD1`17_g5s|a z8}#_u8nIhWM?-3RqkmMN*Lbk>>Yk=}?>0>KT6~+r9I1HQCz&{QyCnDu3(WClt|10} zv&QX0P%0_&Ihg2V(=yF|WW+z#?6ZBm>);`S(q(QkwBL6!B>&-?RmqPxAxYNL;T8rM z{)32t!^a^E+mNBYB9y|YDnoZ?TUzvaM$t$`p(OVP9TWEU8UwcTVKXY*XI{9OSHymZ zVRXAzVdU?wY_C-qgEWD21{dK@-;vY7zHik5=ci~@I+6rnLWr7SUFw$ z;{AWxDq!&Io>v3njr<}!0vGXCzu$+k>yC|etP}R(&Q^77@+C z@EgU7;Hy#0FM{c8?1vJcMvm_tP|~(K**P=CCaABW*5rJN)-Yb5*77TG%7?}s{m^Uz z-_3ngTx#(WL;>5n7TvM7m7*8B(Q)O?qh6Whn@ITgN9P!|`sP;qXq|=yhNfazKwFfi z6W(>wXIHNN^8%;32g^?iXK;(SBBi*C`6dFo1xdmH!YmxMq`&swN~C5#$7-qx?|FUH zs6JGiz$@5wuA)`qR9Uon-lgNc32wohr~?PzXA`YY4P9eY7PCMWOf(h?(UP1zfQ6y7 z>LPU<&XqA9llj9Nwnv16o?+|J&-kvtPA5glmZ*LHc$;cl^9R*`bq45`|1jx2m}MmP zyEec2yY%d^(QM6p)7bRv5ogRvYbEE92y6v=-c0$gUk^8HJ0|K)A| zslHCv0bQN{r4M-ly!==w5V$P;?Zrtf&(Qt!7QOw_fX??c-K}GWb-pRDtgW>=_LeC0 zROR}Iy}INgbvw;p{1Nc*seuZ*-`lD8{XZK2KbIadV5m`i7M9F_W-ZNpa3zE|_vCAJ zJxToF;^LyuoxrDEK>@<&fFzxh&3(9}A-tQ&LO^WY)3n?+xxmO=6~`_k#KXcGpwmg6Pybty{_VvV$~FERm|$f)iiEMJ!0+1ibtcHA6e~2=L;ue#{G*|b zhro2%xqVohs=OlAU3tse?o|~&n1P{+{AdsSmx(SR0Pw@XZ{?Xa`$RyFY44`Uky?bB z3K@#Q>)FTXzlxO`Xv*19#k~%AJjst>TpXs3ubA<>K(~m^=H(4;Pi1CBjzt4pAwlDV zLw1RlwEsO5|5T_FI^c2Fd80*fTaIVwGImr{{phAtw}=9}7P~oncCM&WR>&VG%I3I! ziM5WcYnM;zx)m@vv~~!Q2u|?ot0I@NyrNMRabOG9F*~4_Wi=?3`tePvh{m};NJPq; zltwE9Lou^$g#_=679E`t8-ZMlk@{xy9!^6MX%|I?CPoF@a;ZxN;_nJ9tNeLxR+*1Q zc%LGZR4dgRRU2&Mt6Ju}0CC0{x$cYs>NofJl`x5_PWc-*2IYwcH}m!H8Qc?Y>D|L+ z_{QRf+waURJ$j0xipDFpzbc~^Z5;ATBtFncb*uQ6!RY>dP5!qs_)qXWLu<5%s}VE| z2RyXil1=G=XeNC%iC_(R)G4^*qu%FgO>To2^QX=r+uQ`U!otGQot>Qn zx|ttT&QrC+o12>@L-~%Wcq`-ki67ER8R~;-lZPwz_x7&udP~?tbai!GDzy;?#%sO3 z#3ahU-Yk!-cSj#g2sIAcTS)NPX_FX< zCEmIgKBV04f4QX54}lrdo@R5o!KoHD;r>OWddSj{oj(lwplbW5TDdVNZ76Dd5MrCW zXuIX^bhU;YX=69IqPy>5-;plc;+GFGxBIbvo7!_DKf=fLBtxq1{MSd9<&i>8`!lZ} zTx%0mGsb0{_T1w^S`Lw6e}SAHk=k~L_kbNM#fgv4rMh={qYk?Nqas% zKIXT5^9m6n2SXvvNoQxiYinx;B#^FJSy z1SBEp4|Fow^b)^*MV6M9s`+OgKk1@&S+x$+=E5Fql*h!xBy*U?E^DGQL5VdrHT&Ay ze5DK<-DTpy65lu;azaW}Z3sGlna*u&IO+Ohpi5{-5q^?Y@tHxi-||sZ_6ARrh?Q}} z)7LEoe3c6=VK`$Fi0{t&w}u_L>Gp@9;gPJ2)tiuo&Vb7d=yy8`0Ea9T!)2gX!{BOK zJx@eLXhkAE{rEXJ_}Qq=)m9wV$ID9?CRvaqE2V6x-A6v5R+9N$F4}Y`f|1iuz2ev2 z-kx$CxO79AJ2{~eP^>XL7y)#6xso9;{zXQKEWWje@ongo2AlKfN(qNy{{bSEwnEbD zb!!wOrnEvS6xW1_9#d_|T~Jv71NK!Fx@#GYCmYLYbTR z!)gOai&jf?NgZ(~`vw724LRB^!GFHUgc6VWUn7uyo7IsNDr*uRa&rLXsUj1*e{c}u zGss&sqQ#rFN)^lvGQSC_tTxfES3+yxYXrJk9;VCEC!wKEl@uUroh^)OEuZ-{A86lB z92qu=0DO3~WzrCsa*&nVu`y<>G8Ta-9w%V3C=xB2xpw07d+0HdYqs#^KC>G456$ow znH%CQ{4F!Yg+0Vn`TP5mCgrU@M_BjiNGUq5l(^eW%G<(xsfk2$TnCGx5wl@SCSg($ z(p1{LttGNDF>S)Hw1l%>g_vXUc63X!typa-LoA{6L}SBU5yb{0aMY(t$ll$)iVk?5 zPQ%$0DKZZ(Td)}jH+W)+&{|WS5NZ;$@jGZUsl)jDt}!W03`pLsN;|je@;&q-fk<>g zblWio;py4hlF5eDmx^G55-=V2@>Hx{%sQA`w8ApGsI))lebP3Gpu*Lwqsfg&9-HtM z_MuI!Wc)i@TZ7oe)X|x}jxI+XIR4fzy}7-X!H+wgR>qMl0`Vu;oo8-BR?7LhVmgum*V4!fVOmN3G3Yew#Vv^doZ%p)Gv|0lI zJnNNk89nAu4s~KC3Hku4UG9<-H$f0(qTABN|tR2X`B7dPcgc_ zzJ4;vSvnv|OqVm;HrSXtlIaBV>$#R?N_hYTKq!s0XZ#i`B`2q=!U}04Q-G;WjVDDO z(SvD4)|b-Fn|-$@2R&sK+XLf2>mq5Kkb}E@vox3LE(E~ItPT-_R}uGQ<5=O(`qYf! z{Ob7OK`NTSO8=&7!{?C|A9lx`D?(P6jj6@Ly_FxB3!Ig#DOOM zhss(ie97zJ{vvy3SU$>DRDL5Vy^Rbw32keJvMoJ(pz-V*3BwX4I9G*abPG1_i(6yI z?n|No|5gRFudkY$(Iz@vp3OT)(4^VP$C{G;Pb~7^sNq5z(uV*%OYZi-^?L#QxkU!TxShRcsk>y31Q@k-+|W78_w-A5tooD%Zg*DP){ zcoua-eaw5I-j=<-dy|C?JGH1EpLv2_J(Q&!7nG2q55LR+KU)I2k2(vX~V0Bd$9R390ZR;Tdr6wP$o8{s%o+dGDuPPw| zTSk#ggxzUy-H7gXJ>ChxO_A7*N;dG@??zb1_x!=yoQXbbxq7ln6jS$Hx$;tkgRJ_V zdf|8JL+3Qi!vdn8s)jF_t5bGXSdU2m&aBa-2w$J5viEUE?4)O(&XWdG1@m{N%VvZv z8T8fg+kELQc*UyXj#$(935*ikmM*)@0tCm!8@y7Aw$R8?F968Xf(q5xdYAV`YoK1I z-f2eO-h&c8o@kwA$Af9F?t?1eOl;$Z#}=A=_y!Q}-hoV$jxu4y+ifO=V>=di%WIdJ zhpX4^vM#!v*&t0sM4fX_R+|$fPEp48TMa+FnFyaZE`Ul|i|JbIXsSqjMQgh3#vfrK zhMWU_`Y$4<0%ky`@5%j8A=Kb^s19H#;@#@A7D-KKnCU~B~bV-V`5pWfX z@H{+O?23DXgrFc*$YDRPgL+4q4n8=Pgo*e^By`~00mvSSXr{&ogIxFYmiyK~8qN$N zY0V4`f|8O&P{g}r!;D+G~ie^pXa3KCZ5ch>!yCVSxjx)?Bu5pcz1g}>)Ul4u)gF&n+L?_Ljx&g zF+Ws9&;R&}9==LT5udU@JT?Ry`9Ry2n7iNv+}(;E21z`laYED^p-o>aXPO#ee)W~K z%}K={Ld&MV!435O-PrsKU#H`)AK0MD3-l@6+1Ak_{QS`%O39xBO?eZ02 zd@!85?=959ofjVrn3RXz-QCQjeI>@865MW+4)S?c>Mju?$+|i^&AuSby{}Z_LJ|^; zdDq*)e=r6Lw!)NR_@AL_J_n$U{a{8+X; z-kR6;T8w0^6RD&PA=lJavgpDjpJ!G6e?;Sq&V-ZW-%WZX+v1XK)*J`KnZ1z@>epNoS)7W0(=qk&1vrDB!S2sSm!x+aM=?8@<9*^*RR{K45Cx z)xAJ^&PF2CQr!oQux(!;a6^^GXSrFUGlLdp32HZ*bG%bHeaKD6wvUY+?{j_j_mA~=ohK=xmn^{=~h zm+J?H=uF}nnpC!@eP6kf3ZYUFc!WhDMSjBsk%bCa!K>%Y?>X)bN+Dd4Y#;j0#;qcq zidNioNYmYT9#x0nIR#in2x}b1jN_= zS(SKObsd~`{l2wCu0q$7Kc9`nW-O?A_9jfz=t>DNn5wp#uMODgG8~TN;YJPha(yPN zeT0OnqP%=*Y_Fg4QEA)9Y&$jL=l|;azuo!AlZ|RhDH6)RMrT@)5fSJxBJ)_F*2kIi z6&#Pl6eo3OLj$O^Q`p4RDZ}w-w98@I=uGX1sJ=#!8vpti;IAiiHkv~pc-mA;AWM+4 zjO}CJe#ap>BG7gp&oRpXu1u&lrZPo*-(%=dS?cu@v`rQI@eS}9U9$!Q``VvPBtk;s zmGRkWy7=iB4jYpi0RJnKz)DvPB2xoqRvB&OHssH5>p3rfh+)8t5#-6TotX6qRDbNE+7>V$|6yua3mh{ zM@iq$IM)XkW`gZ%12O@7?@&d9<~&Ve)=j$#@oA?6^$n}Lr3*N0G0O0!uJ^~1NWT_- zjjxq|u~zzcUK}toxQI8UR^Gg3i(`Jk^B*B0YvHJld$JQWIV;OF@hjC&3BQC8M94$} zww5OI)dOONoR=Fw_wA^I*{tLUu@jO$YJn$ zQL(=LjgOk^fX79$OlaR@ijoqXoi6)n)XiK&a5SOP-q&)p@g$fTJ9zDyAb(Qjaw;&8 znO~#3Jm9ejros~p2=VOpw!${*%VN9jF6M^CU$jXMv|f|W8kXB@jQ^1LS)Vnp-pkG0 z5*aKTS6>NKMn-0~^gvsUJQM`P*VrYBoL(*A2Lanesj!+OGx8Cq5AfE4<;M1>_7QP}TP42$qXUB`? zUguWBPg0WRUj{~cC}dfEk$cyF<1d=>;d?4EWX@=m3#pb{FMX#6oYDD*@*V68uXe>s zh1tzawFfmTE;z6+K0jcO;=-wdSsfg`w7&yC2;CI8d?1S39=szFZ8ijdznl&(+w_oN*E} z496V+2kRky<~ucHBS;vAF;d;peoyRBXO82(hh1epTk5@>#;1+!Fn!a}l2r-tdhq}8 z8~_%4<^|-;KX2-0Uy(^^@oW@>slUohmqqCU_&L;VoklnGY_=s-W;B!7v7z{i?5D|s zOj(A-@kctpU21rf z?es-X>d=jaKmQf19;7eoY92Imasp{}+!qJI%fZA2AraOf1`9I@gX-Xi2axupWio%i zFc0Rc01%|*K|&&QVX7u%@G zSR;_~Iv}#P&6o~kJ! zdQfhxG^UuKyU}9pv9`JM3Qlo7m-|sCh)dlP zYXBf0y5aEvi3t!H6J-q62hl+43neCM$2P%M-o`ExMNz zni4jn^Jxra5(v9QW1wmVP?Ziwq@f3=z3Cl~wGG~YcscL)uuB5~f2xZl##}p)|7`JK zr%-2&&=DWrg-fWnh2fElPg|*1lTW-is3>?WTQIH(%knVT)7!A>J4R|bMU}Al*|6Pf zq4Bol^z&GgsNBh18?LZg5w@*Z;mfqveKFD%$Gky{T(Mp@erW*JFaPwYcKuhU9d>X^ z9Dp@Y5;c*+CLZ%cMohUVW{df}pFR0Y=lIO|ynxrqvgW}_oiJUR7P?;w-QQ)*RZMhQ z^x|&|#EXMh@=92WJ2@y^91HH$Cy_qy$p0i(41x~k;Kl6-R2C@OEUN6$-ZXfvdqEF{ z0Qq^qQX3s)G*eOaNMO~Xjq1_XBq?{1s%eZRD(Afx^$X^2AKOkz2z9q=^-dqZ8f!c+ zf9Z7=EKJuFQ+1pjVBUY0a_H)P_!yV^L@~eA=Xw?6z`!cN}>hYDk%-!)!y{a)kjR4)3H?Sts4mE*|q+2fPz3qZy%&8R`X!t??JY4v4>VG_KxYE;u! zD|bl_Mtu$s#>5Zz#&*1?skp3R!aZ@9=b)}$ zhv)*)QyLV0KRC-=tSHAPX8h>`v!o#lo7WT0<$2e9@mKr#!NC9D|PH#L(<8;4;S zlBgWp!K$nl%jA^qi4TZ@@<#2IGK~_O;@>SnE?>^?MA?RbmoFXg75@oNAZ4C=?ur+) zL2t0U=%^zA;S*z{sLkHHO7({s4T)^yX(5x8qewL~-$$m1_?t6bU|}SEM~b z(3G&HoBSmm-@pv8N~~my^aefh{fi%A(iGZN6y7TF0>+~CHnJGSmWVbG z)rPK#^N)w@PPCUAXwqs1AQxIClj1b31K7Q#-6}6Mc^gG$P@=AMQs(xLW=X<3RQ|x{ zY?nBm%c#3n<{zJ$PU+`; ze5yDvH+tKZLYP*}DH1 zG*Akx7P=WR?gj~I_v??1xI^SXIzwnpup49!=vHAZ4g0Dv34Fv7Kgx<>dpQU}-(d$`y;tXAsE7r2wive!S_h z;Q|1L!M4hcK#H<&8tk8%bb`6Y%l?fz`44GtZ~;_nuUoV=LJ^o%?n+yx&tLaa-e zApl*nZLDHY&=V=lVxg=Q3g%+0E-a!9r>>S8bxg)kb8tnGIC z3{^ZNgm4Vp2>}^!Qk(IXWS#)CfFOmv4%H%&!qRkyBFyv|A}?wZdr9?}d6< z)_sCD82N#f{`ti5e16z#;WMS92MI){Wo@kHmcd=Wk@$|8vg}eJuBxYhhQJ`$si-E% z=B6R7ryGp$Uu)&x?<(1WHm|*6!6UE85D4LNFlaF^Srfcy^p@V*EHms+N%LwZb&j(h zCMB)ZAA8nHY;<&+PHg-AqSIN*miAQzJBQx*M~LxLEYH|4rqoh%5`(haLe2Hu3@Lx+ z5*4M)ixPgE{c8pM$K7`oKwMh+Nh-NTb#)U@5|#8^zKE;?`5GHb%g!B-G~>)J;@?^v z=2kqF*=dy36yB-fHiFg#1qB&K;t${F@7x$Fp#!|RUgW5NR;IK2moi$&8g0WsEy?se zb9c+I*9v3WQR9S&*P%YD(aT$OB`dgSbgsj%B)?m7|NEk>!P$|3r0bZ7w?u;4(A>=J zznTZ_3+=AH0}}Q0^t3TEYrS#fMl-4uy`@!WX zX?3++3-vDBOWyltDb{x3*NDyUVpVKCSf=s`E=C!87S~gMaNEEgr{9g}^&jnMjlnBwAM()U23nAU$WZy?lQA-bfP1_#wDETxh0W?dwXK{~cY$QdY5xZo=feaJ z2#rW#Hin1#(z>4SjCU?*>5zhX2qhcb>@F=W1o1{a)XlI81kobOigLggfb8orM?^(C zm8^()a_b=T6hhL?Khc%9Q*F8fd96{nWy|59otdg1bJ*0%dLC1I7gb)IUH_+`0LbXsThwiWgNcinJI-g0T> zywdmB7$PqxC#?-cx?tKR#fauFksOf05FKS#M9CXxXMU}=R%Up7d^|wOj3A>qj9#P3 zLx~NHjEF`ucJsjT?Kgh{Wpcdn6V|d)5q_%Pj@Z{|$6^Ykm^L9~dMADHB9ngCiT`rD zx{d_=48Dm=t93Qv8Ub<8+iz7>LP4A$31r?UsEJ~QR73y;z98H#F)fXxAb=`GK8_pY z$p1W%5%ud=p)L}zQPsjk?f~U{^*xd0{Ttx0@OVvtZhyfGOFT$7cNK>v?u5tNNfvNG zC`8D`1Q)2#H!tsqjR635rSgc~gC8Z;$;`eFpt9Ys2a1@@{=l94oInP9r#dpB`9)7v zBYbsa73l?)OnNRd6crK8^cC{x*lK3cy{xbKurcPzq@<~Pus*a&ijjo#Xty9X)c6Q( zBDs~(V&~pGX7oY#^bCNFDIdOPKXi1x99sB?+$Kdncp>1dOQ;bN5+aSx{AlLjz*`7{ z?CG_|zU+Ec!f<@JJTAn~*6;mg?@M({&4Zj_wz- zYUp4r%TN`!Bk08iyBrWs%vBuP9_$;B4`yRVyA}v^euBNqeU;z?=Yyq)1RS78=tZ2_e)VK_049lvi9{-b(YlyYJAIHGnTM-xU5%$f!d4y!#ODos z1q0RFU#lb{avz*c=^1o;lsqvL4`td)U*r{iJ3BEf;6135`_Sf~$avx#0l3@LUQeeb zl$~@#$$pC8>uZc0V`_q28zQGFKP{l$>|e9F!O8TVUS1 zFjaS)V7y0fU~pge8Ea;0<)b__8R88fTi?xl9UFU_4h@oh=9eqQ&MHaR5A42F$!hZ^ zwXhaeqekW{&)yAInV4hZ89S|ODvj$Tu`yqj0c=A9TYDf?od7IsrlvY_pb*IeKuy)2dT4Q@TFH&gT^I75~VEtvbPrsE8bLtw)yG zeJdfy;#&2U0_TlKCC2%{7bNucBOxl{WJ^|m#J$r<9u3d81cUWmU4Q93I$x3MMoOku z#f7=qSgsS17`OYpC=COuUjAqZ+@5_YYmxew;DLc`)JZBrBx{JhNX0sstD&L6bp_1~ z%w&C>!k0`eu1C?r;>H z%UCWGede{c-xYtlT83)diJhw6JNKT$Dni?KDvK7&P2QJm*{ZQB1;5Qc_1sGtI27_d zoLf7gZ=J#47QVa)1Rw*7@Vq2o>MmV3GaUC61L6&glFpR?!l7Hg)@{@?I`x)RxBu=;*{|Qs z28|9=OVcNhlI!vzGfDENsDJ|=wOi=JU^Z6}aU*kC+oyDXOLq*^7JsiZ)(2$9- zNfVM!j`y79Nze>l*2tA+JSp5*mP*&Tyii|{A7HmC`K#U;U{{iAD zD%s4`>B2-M?euJ*3s7l1#V6nWNG3EJ zD_4Z+Ef@4&<()YHnPvd@LQ@qtnwBHZ`*Zxxqbo&xhMG(zo`2zCU})RwYf7wo-+{3$ zd^m9Ed47g-T^}tiiJW28{`$x3|%ba zE?+sk)2kz%`~7*!&bTqd6t7Jf!=6JP!1>}2z@wnSJXV~poA8wC*6?@$cmxr@!bN&D zNbvF95qW2-twb-v#i;9RyX{dqbo(d3e0d^LCiuvZdzkXG`DHPf$K*kpPnS0rZ?ey_ zcJgrx%JyIgqd)wp-dNH6bD3Vp$>GW;jAniNkUz`&8ll89WgO)qN+@1Fx+@SST zDG+%jkP8$#)6vk}WZ3tFW`x7_21^Uxvi6Q-_D&uNEtndV-qG6*(dN8z1@!jib1AI} z8YBQbe#*(q%S>3QGBkn==4Q%NT~FJs^JmY>tnsf31K$9=vpmskOK$^tD0-p$-A?O? zW-yds5JXHX(RVGjr!gme5!+_c(l6e2SO4@(8~&zmvEz!fhlg-|G6AFecSheM6YLyy z6U?Ib>#Z44A5}_Z{qa)r!7o14=wf$5%c1hj)WRW{|2xy=@!GrZRCr$8u>6ILs6&Ma zJhBEV8M0t5fUGo;NTB%zJIW|6<<;yc7I|CI`igZd2lxvQG*0KG>230tHyL zi!#Mm{|G#)(jXl+rwr&r*%`^6hRh8a5^=NT-jwo3Y2N*e-djJm82TTZoHTO4j<=)z zY!LN3;YLls-ptu2zE)!^Zqn{Cf05%A|Db==ie=D_1a&K$goRwuEP3+uX*DB~ji%25 zE>`2Z0Xd@E8CQ)EasOlzA>t=M{UL-vQxvGPC`{FL&ssJmu%{1K@|%ZlSTf;rc|uY7 zfqEv2Ww-^G-Q8ilvfNB_>)e#7mKJ5)e8$UA4#*8+z4I{1?eO>rAf7Nc*!9rMnW;D9 zV?_RHn3L-62Wp8L+7#hWV12#tH^P2r_2M;B=$*X*)jP=b^~fBTOYxd(!^Ifp7^bcTa4MpGN&qVLWs*M7LADB-ej(@71*#w8^@y@?dr3Me#@IDOhscc;``>H! zE7mV~i8QH}eZvstMan?-Q4kS=*+_HB=CWoi2});8D+g>I-P2=yI^W;7D>1W=q%*n) z>3GK0<6g@u`dYS&)Si5P9WpOin`0&$aS%ocHa_W3kOu0#T4O1}ys!&Ajq3QWnMU?E z1dvV(qy%C$fVu*>>P2KSO5Q>q%(VJVnV5lURtO+(Es75coTqnGrU4j!Muq?X>N@vm zDDyCmS6iDU)K*lM$+0L*ZrijOret-&DYubJ5n)KV4aRLK8FP%4Z8Ae5mky&eE_0C@ z3<@zZOolOJTvn9JFftayyKnuo?K!-Ez306D{Qmkq-{z(;UCy_p`MRQ{<>1y5qhLJrws_K+`UdZRAIo5omut}pC zO5fN?Vr%mRgI~nPRS*v5*)}R!+B%j=^T*bD>bkH!YpGV%{2}*+dtD&bV6j-QT~PO4 zw=l;<(ux?b_m&YB=;#}a<#--q`LO}uW@w#YVOtA=sK$&Y^M zuo0hKauvMYq z&knK`8p(b%GIFTpID*>QFB$%|o;>)u#>UbvJijBMH)ixze?o0F^Y$!G&#g3|U-(Qj zE_tLeKA1F3JkjTEaZaPMer#g8p+6*c2M2{cZ-jgi5j#wjoc80plzhB3`yp$(MGSPS zoT~LLbAkxn5mO|itKO4F5co8i!^2pm)n9+bmOqZPE_vy;k1+_II%;mX(Oz9Tf`m?RI*~z-;%+4- zk#$7#j^NgvsN>G*zGuaZ*}F@^h7Txe(3hHV7Zt1$SzF&X3(5|CeV4pcM;Ol_Bi|A| zq&8%`>_^ReGZhW5hzzNRJTD=?x=yS2>RuF_sGmIr;E#E^8gVYdozzN)Mdsw4S33o2E=4?9vxS`9V<>+M2 ziWZRHl#$7ONbJousORdC462r0Nk!fb&fVb?d6<-faSzdoOZF4c{weR0@Z|}lKO}K= zU~WMsi$7pu*XnpwMZ14vo^zVHO-py3r-bV+0%>2HZ0WCqkelt6N{4>9-8f@iL z4mw#YY+`UbY+u%JpAPg(^JG(@uB};szt1fR74S=jGEHcS-v*H_c?)*h2F-;mb9;wH zn=5A~i!4Sx=W)~*uh=<*oTb#tXIQ9#=ob-Lig4mo%CPf9vXq&FEO@>a#RYkkjlYnNB;N3tf@ z>Hl?8|9@MQ3#h4tACvWkPJ$BLDrzL6#jJj{te6SRV-(0&U-^e50Q!pzj0cNV6EJ0}?~-%fzKW T{rGST_*}5Q_(%DfYf1kCSEFaL literal 0 HcmV?d00001 diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" deleted file mode 100644 index c23f7201..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0001-20251104commit.patch" +++ /dev/null @@ -1,1272 +0,0 @@ -From d61fd429337580809fe74a59b1dfa81b91094dae Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Tue, 4 Nov 2025 09:11:51 +0800 -Subject: [PATCH 01/10] 20251104commit - ---- - mindnlp/transformers/cache_utils.py | 28 +- - .../models/deepseek/modeling_deepseek.py | 149 ++- - .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- - 3 files changed, 976 insertions(+), 87 deletions(-) - -diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -index cadd2e04..02f8d4be 100644 ---- a/mindnlp/transformers/cache_utils.py -+++ b/mindnlp/transformers/cache_utils.py -@@ -812,14 +812,26 @@ class StaticCache(Cache): - # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. - # k_out[:, :, cache_position] = key_states - # v_out[:, :, cache_position] = value_states -- if ON_ORANGE_PI: -- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -- else: -- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -- -+ # if ON_ORANGE_PI: -+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+ # else: -+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+ # 确保 cache_position 是 1D tensor 并且类型正确 -+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+ if cache_position.ndim > 1: -+ cache_position = cache_position.flatten() -+ # 确保类型是 int32 或 int64(MindSpore 要求) -+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+ cache_position = cache_position.int() -+ -+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+ k_out[:, :, cache_position] = key_states -+ v_out[:, :, cache_position] = value_states -+ - return k_out, v_out - - def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index c695b944..d8303e45 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): - # Copied from transformers.models.llama.modeling_llama.rotate_half - def rotate_half(x): - """Rotates half the hidden dims of the input.""" -- x1 = x[..., : x.shape[-1] // 2] -- x2 = x[..., x.shape[-1] // 2 :] -+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+ # x1 = x[..., : x.shape[-1] // 2] -+ # x2 = x[..., x.shape[-1] // 2 :] -+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - - -@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): - if self.training: - raise NotImplementedError("Training is not supported yet.") - else: -- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -- if self.config.n_shared_experts is not None: -- y = y + self.shared_experts(identity) -- return y -+ # @lwx -+ if orig_shape[1] == 1: -+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+ y=y.view(*orig_shape) -+ if self.config.n_shared_experts is not None: -+ y = y + self.shared_experts(identity) -+ return y -+ else: -+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+ if self.config.n_shared_experts is not None: -+ y = y + self.shared_experts(identity) -+ return y -+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+ # if self.config.n_shared_experts is not None: -+ # y = y + self.shared_experts(identity) -+ # return y -+ -+ @no_grad() -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ -+ expert_cache = ops.zeros_like(x) -+ for i in range(self.num_experts_per_tok): -+ expert_id = flat_expert_indices[i].item() -+ weight = flat_expert_weights[i].item() -+ expert = self.experts[expert_id] -+ expert_out = expert(x) -+ expert_cache += expert_out * weight -+ return expert_cache - - @no_grad() -- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -- # expert_cache = torch.zeros_like(x) -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -- # token_idxs = idxs // self.num_experts_per_tok -- # for i, end_idx in enumerate(tokens_per_expert): -- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -- # if start_idx == end_idx: -- # continue -- # expert = self.experts[i] -- # exp_token_idx = token_idxs[start_idx:end_idx] -- # expert_tokens = x[exp_token_idx] -- # expert_out = expert(expert_tokens) -- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -- # return expert_cache -+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): - expert_cache = ops.zeros_like(x) - idxs = flat_expert_indices.argsort() - tokens_per_expert = flat_expert_indices.bincount().cumsum(0) - token_idxs = idxs // self.num_experts_per_tok -+ - for i, end_idx in enumerate(tokens_per_expert): - start_idx = 0 if i == 0 else tokens_per_expert[i-1] - if start_idx == end_idx: -@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): - expert_out = expert(expert_tokens) - expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) - expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+ - return expert_cache -+ -+ # @no_grad() -+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+ # # expert_cache = torch.zeros_like(x) -+ # # idxs = flat_expert_indices.argsort() -+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+ # # token_idxs = idxs // self.num_experts_per_tok -+ # # for i, end_idx in enumerate(tokens_per_expert): -+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+ # # if start_idx == end_idx: -+ # # continue -+ # # expert = self.experts[i] -+ # # exp_token_idx = token_idxs[start_idx:end_idx] -+ # # expert_tokens = x[exp_token_idx] -+ # # expert_out = expert(expert_tokens) -+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+ # # return expert_cache -+ # expert_cache = ops.zeros_like(x) -+ # idxs = flat_expert_indices.argsort() -+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ # token_idxs = idxs // self.num_experts_per_tok -+ -+ # for i, end_idx in enumerate(tokens_per_expert): -+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ # if start_idx == end_idx: -+ # continue -+ # expert = self.experts[i] -+ # exp_token_idx = token_idxs[start_idx:end_idx] -+ # expert_tokens = x[exp_token_idx] -+ # expert_out = expert(expert_tokens) -+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+ -+ # return expert_cache -+ # @no_grad() -+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+ # expert_cache = ops.zeros_like(x) -+ -+ # # 排序保证顺序一致 -+ # idxs = flat_expert_indices.argsort() -+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ # token_idxs = idxs // self.num_experts_per_tok -+ -+ # # 找出有 token 的专家 -+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+ -+ # for i in active_experts.tolist(): -+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ # end_idx = tokens_per_expert[i] -+ # if start_idx == end_idx: # 没有 token -+ # continue -+ -+ # exp_token_idx = token_idxs[start_idx:end_idx] -+ # expert_tokens = x[exp_token_idx] -+ # expert_out = self.experts[i](expert_tokens) -+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+ -+ # expert_cache = mindspore.mint.scatter_add( -+ # expert_cache, -+ # 0, -+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+ # expert_out -+ # ) -+ -+ # return expert_cache -+ -+ - - # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): - # """ -@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - - # Initialize weights and apply final processing - self.post_init() -+ self.warm_up = False -+ -+ def warmup_moe_model_deep(self): -+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+ test_texts = [ -+ "warmup short", -+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+ ] -+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+ if tokenizer is None: -+ from mindnlp.transformers import AutoTokenizer -+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+ self._warmup_tokenizer = tokenizer -+ -+ for text in test_texts: -+ inputs = tokenizer(text, return_tensors="ms") -+ with mindspore._no_grad(): -+ _ = self(**inputs, use_cache=False) -+ print("[Warmup] DeepSeek-MoE 模型预热完成。") - - def get_input_embeddings(self): - return self.model.embed_tokens -@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] - "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." - ```""" -+ if not self.warm_up: -+ self.warm_up = True -+ self.warmup_moe_model_deep() -+ - output_attentions = ( - output_attentions - if output_attentions is not None -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index 3cbf820e..d4c6b651 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -18,7 +18,6 @@ - # See the License for the specific language governing permissions and - # limitations under the License. - """MindSpore Qwen2MoE model.""" -- - import math - from typing import List, Optional, Tuple, Union - -@@ -36,6 +35,7 @@ from ...modeling_outputs import ( - TokenClassifierOutput, - ) - from ...modeling_utils import PreTrainedModel -+from ...generation import GenerationMixin - from ....utils import logging - from .configuration_qwen2_moe import Qwen2MoeConfig - -@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): - self.variance_epsilon = eps - - def forward(self, hidden_states): -+ # @dwj -+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+ # @lwx -+ # if not self.training : -+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) - input_dtype = hidden_states.dtype - hidden_states = hidden_states.to(mindspore.float32) - variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -@@ -234,6 +239,8 @@ def rotate_half(x): - """Rotates half the hidden dims of the input.""" - x1 = x[..., : x.shape[-1] // 2] - x2 = x[..., x.shape[-1] // 2 :] -+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - - -@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): - self.config = config - self.hidden_size = config.hidden_size - self.intermediate_size = intermediate_size -+ - self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) - self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) - self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) - self.act_fn = ACT2FN[config.hidden_act] - - def forward(self, x): -- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -- - -+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+ # @lwx -+ # gate_up_output = self.gate_up_proj(x) -+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+ # return self.down_proj(swiglu_output) -+ -+ # def forward(self, x): -+ # gate_proj_out = self.gate_proj(x) -+ # up_proj_out = self.up_proj(x) -+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+ # return self.down_proj(swiglu_out) -+ - # Copied from transformers.models.llama.modeling_llama.repeat_kv - def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: - """ -@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): - use_cache: bool = False, - cache_position: Optional[mindspore.Tensor] = None, - ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+ -+ - bsz, q_len, _ = hidden_states.shape - - query_states = self.q_proj(hidden_states) -@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): - "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " - "with a layer index." - ) -- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ if isinstance(past_key_value, StaticCache): -+ kv_seq_len = key_states.shape[-2] -+ else: -+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) - cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) - query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) - - if past_key_value is not None: - cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models - key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+ if isinstance(past_key_value, StaticCache): -+ kv_seq_len = key_states.shape[-2] - - # repeat k/v heads if n_kv_heads < n_heads - key_states = repeat_kv(key_states, self.num_key_value_groups) - value_states = repeat_kv(value_states, self.num_key_value_groups) -- -+ - attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) - -- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -- raise ValueError( -- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -- f" {attn_weights.shape}" -- ) -- -- if attention_mask is not None: # no matter the length, we just slice it -- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+ if attention_mask is not None: -+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] - attn_weights = attn_weights + causal_mask - - # upcast attention to fp32 -@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): - attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) - - attn_output = self.o_proj(attn_output) -- -+ # @lwx -+ -+ # max_seq_len = self.max_position_embeddings # 2048 -+ -+ # if attention_mask is not None: -+ # # attention_mask: [B, 1, Sq, Sk] -+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+ -+ # # pad 到 [max_seq_len, max_seq_len] -+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+ # global_attention_mask = padded_mask -+ # else: -+ # global_attention_mask = None -+ -+ -+ # sparse_mode=3 -+ # attn_output = mindspore.ops.flash_attention_score( -+ # query=query_states, -+ # key=key_states, -+ # value=value_states, -+ # real_shift=None, -+ # padding_mask=None, -+ -+ # head_num=self.num_heads, -+ # attn_mask=global_attention_mask, -+ # keep_prob=1.0 - self.attention_dropout, -+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+ # input_layout="BNSD", -+ # pre_tokens=2147483647, -+ # next_tokens=2147483647, -+ # inner_precise=0, -+ # drop_mask=None, -+ # prefix=None, -+ # actual_seq_qlen=None, -+ # actual_seq_kvlen=None, -+ # sparse_mode=sparse_mode, -+ # ) - if not output_attentions: - attn_weights = None - - return attn_output, attn_weights, past_key_value - - -+class Qwen2MoeFlashAttention(nn.Module): -+ """ -+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+ -+ 关键改动: -+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+ 直接传入原始的 key 和 value 张量效率更高。 -+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+ """ -+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+ super().__init__() -+ self.config = config -+ self.layer_idx = layer_idx -+ self.hidden_size = config.hidden_size -+ self.num_heads = config.num_attention_heads -+ self.head_dim = self.hidden_size // self.num_heads -+ self.num_key_value_heads = config.num_key_value_heads -+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+ self.max_position_embeddings = config.max_position_embeddings -+ self.rope_theta = config.rope_theta -+ self.attention_dropout = config.attention_dropout -+ -+ if (self.head_dim * self.num_heads) != self.hidden_size: -+ raise ValueError( -+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+ ) -+ -+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+ -+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+ self.head_dim, -+ max_position_embeddings=self.max_position_embeddings, -+ base=self.rope_theta, -+ ) -+ -+ def forward( -+ self, -+ hidden_states: mindspore.Tensor, -+ attention_mask: Optional[mindspore.Tensor] = None, -+ position_ids: Optional[mindspore.Tensor] = None, -+ past_key_value: Optional[Cache] = None, -+ output_attentions: bool = False, -+ use_cache: bool = False, -+ cache_position: Optional[mindspore.Tensor] = None, -+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+ bsz, q_len, _ = hidden_states.shape -+ -+ # 1. 线性投射 Q, K, V -+ query_states = self.q_proj(hidden_states) -+ key_states = self.k_proj(hidden_states) -+ value_states = self.v_proj(hidden_states) -+ -+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+ # query: [B, S, H*D] -> [B, N1, S, D] -+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+ # 3. RoPE 旋转位置编码 -+ kv_seq_len = key_states.shape[-2] -+ if past_key_value is not None: -+ if self.layer_idx is None: -+ raise ValueError( -+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+ "with a layer index." -+ ) -+ # 对于 StaticCache,需要特殊处理 kv_seq_len -+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+ if cache_position.shape[0] == 1: -+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+ kv_seq_len = past_seen_tokens + 1 -+ else: -+ # prefill 阶段:cache_position 是范围,使用其长度 -+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+ else: -+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ # 4. KV 缓存更新 -+ if past_key_value is not None: -+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+ key_states, value_states = past_key_value.update( -+ key_states, value_states, self.layer_idx, cache_kwargs -+ ) -+ -+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+ if cache_position.shape[0] == 1: -+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+ kv_seq_len = key_states.shape[-2] -+ -+ # 5. [重要] 准备 Attention Mask -+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+ fa_attention_mask = None -+ if attention_mask is not None: -+ # 截取与当前key长度匹配的部分 -+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+ # 转换为布尔类型: 大负数 -> True, 0 -> False -+ fa_attention_mask = (mask_slice != 0) -+ -+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+ input_dtype = query_states.dtype -+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+ query_states = query_states.to(mindspore.float16) -+ key_states = key_states.to(mindspore.float16) -+ value_states = value_states.to(mindspore.float16) -+ -+ # 6. [核心] 调用 flash_attention_score 算子 -+ # - 无需手动 repeat_kv, 算子原生支持 GQA -+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+ attn_output = mindspore.ops.flash_attention_score( -+ query=query_states, -+ key=key_states, -+ value=value_states, -+ head_num=self.num_heads, # 传入Q的头数(N1) -+ attn_mask=fa_attention_mask, -+ keep_prob=1.0 - self.attention_dropout, -+ scalar_value=1.0 / math.sqrt(self.head_dim), -+ input_layout="BNSD", -+ sparse_mode=0 # 使用 defaultMask 模式 -+ ) -+ -+ # 恢复原始数据类型 -+ attn_output = attn_output.to(input_dtype) -+ -+ # 7. 调整输出形状 -+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ attn_output = self.o_proj(attn_output) -+ -+ # FlashAttention 算子不直接返回注意力权重矩阵 -+ attn_weights = None -+ if output_attentions: -+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+ -+ return attn_output, attn_weights, past_key_value -+ -+ # def forward( -+ # self, -+ # hidden_states: mindspore.Tensor, -+ # attention_mask: Optional[mindspore.Tensor] = None, -+ # position_ids: Optional[mindspore.Tensor] = None, -+ # past_key_value: Optional[Cache] = None, -+ # output_attentions: bool = False, -+ # use_cache: bool = False, -+ # cache_position: Optional[mindspore.Tensor] = None, -+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+ # bsz, q_len, _ = hidden_states.shape -+ -+ # # 1. 线性投射 Q, K, V -+ # query_states = self.q_proj(hidden_states) -+ # key_states = self.k_proj(hidden_states) -+ # value_states = self.v_proj(hidden_states) -+ -+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+ # # 3. RoPE 旋转位置编码 -+ # kv_seq_len = key_states.shape[-2] -+ # if past_key_value is not None: -+ # if self.layer_idx is None: -+ # raise ValueError( -+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+ # "with a layer index." -+ # ) -+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ # # 4. KV 缓存更新 -+ # if past_key_value is not None: -+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+ # key_states, value_states = past_key_value.update( -+ # key_states, value_states, self.layer_idx, cache_kwargs -+ # ) -+ -+ # # 5. 准备 Attention Mask -+ # fa_attention_mask = None -+ # if attention_mask is not None: -+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+ # fa_attention_mask = (mask_slice != 0) -+ -+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+ # input_dtype = query_states.dtype -+ -+ # # 6. [核心] 调用 flash_attention_score 算子 -+ # attn_output = mindspore.ops.flash_attention_score( -+ # query=query_states, -+ # key=key_states, -+ # value=value_states, -+ # head_num=self.num_heads, -+ # attn_mask=fa_attention_mask, -+ # keep_prob=1.0 - self.attention_dropout, -+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+ # input_layout="BNSD", -+ # sparse_mode=0, -+ # # <--- 修改点 2: 启用内部高精度计算 --- -+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+ # inner_precise=1 -+ # ) -+ -+ # # 恢复原始数据类型 -+ # attn_output = attn_output.to(input_dtype) -+ -+ # # 7. 调整输出形状 -+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ # attn_output = self.o_proj(attn_output) -+ -+ # attn_weights = None -+ # if output_attentions: -+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+ -+ # return attn_output, attn_weights, past_key_value -+ -+ # def forward( -+ # self, -+ # hidden_states: mindspore.Tensor, -+ # attention_mask: Optional[mindspore.Tensor] = None, -+ # position_ids: Optional[mindspore.Tensor] = None, -+ # past_key_value: Optional[Cache] = None, -+ # output_attentions: bool = False, -+ # use_cache: bool = False, -+ # cache_position: Optional[mindspore.Tensor] = None, -+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+ # bsz, q_len, _ = hidden_states.shape -+ -+ # query_states = self.q_proj(hidden_states) -+ # key_states = self.k_proj(hidden_states) -+ # value_states = self.v_proj(hidden_states) -+ -+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+ # kv_seq_len = key_states.shape[-2] -+ # if past_key_value is not None: -+ # if self.layer_idx is None: -+ # raise ValueError("`layer_idx` must be specified for caching") -+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ # if past_key_value is not None: -+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+ # key_states, value_states = past_key_value.update( -+ # key_states, value_states, self.layer_idx, cache_kwargs -+ # ) -+ -+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+ -+ # # <--- 核心修改点: 手动进行高精度缩放 --- -+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+ # query_states = query_states / math.sqrt(self.head_dim) -+ # # <--- 修改结束 --- -+ -+ # fa_attention_mask = None -+ # if attention_mask is not None: -+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+ # fa_attention_mask = (mask_slice != 0) -+ -+ # input_dtype = query_states.dtype -+ -+ # attn_output = mindspore.ops.flash_attention_score( -+ # query=query_states, # 传入已经预先缩放过的 query -+ # key=key_states, -+ # value=value_states, -+ # head_num=self.num_heads, -+ # attn_mask=fa_attention_mask, -+ # keep_prob=1.0 - self.attention_dropout, -+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+ # input_layout="BNSD", -+ # sparse_mode=0, -+ # inner_precise=1 # 仍然保持内部高精度计算 -+ # ) -+ -+ # attn_output = attn_output.to(input_dtype) -+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ # attn_output = self.o_proj(attn_output) -+ -+ # attn_weights = None -+ # if output_attentions: -+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+ -+ # return attn_output, attn_weights, past_key_value -+ - QWEN2MOE_ATTENTION_CLASSES = { - "eager": Qwen2MoeAttention, -+ "flash-attention": Qwen2MoeFlashAttention, - } - - -@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) - self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) - -+ #@dwj -+ # 只遍历激活的专家,而非全部专家 - def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -- batch_size, sequence_length, hidden_dim = hidden_states.shape -- hidden_states = hidden_states.view(-1, hidden_dim) -- # router_logits: (batch * sequence_length, n_experts) -- router_logits = self.gate(hidden_states) -- -- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- if self.norm_topk_prob: -- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- # we cast back to the input dtype -- routing_weights = routing_weights.to(hidden_states.dtype) -- -- final_hidden_states = ops.zeros( -- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -- ) -- -- # One hot encode the selected experts to create an expert mask -- # this will be used to easily index which expert is going to be sollicitated -- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -- -- # Loop over all available experts in the model and perform the computation on each expert -- for expert_idx in range(self.num_experts): -- expert_layer = self.experts[expert_idx] -- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -- -- # Index the correct hidden states and compute the expert hidden state for -- # the current expert. We need to make sure to multiply the output hidden -- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -- if 0 not in idx.shape: -- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -- -- # However `index_add_` only support torch tensors for indexing so we'll use -- # the `top_x` tensor here. -- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -- -- shared_expert_output = self.shared_expert(hidden_states) -- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -- -- final_hidden_states = final_hidden_states + shared_expert_output -+ batch_size, sequence_length, hidden_dim = hidden_states.shape -+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+ num_tokens = hidden_states_reshaped.shape[0] -+ -+ router_logits = self.gate(hidden_states_reshaped) -+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+ if self.norm_topk_prob: -+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ routing_weights = routing_weights.to(hidden_states.dtype) -+ -+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+ flat_selected_experts = selected_experts.flatten() -+ -+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+ token_indices = broadcasted_token_indices.flatten() -+ -+ active_experts = ops.unique(flat_selected_experts) -+ -+ for expert_idx_tensor in active_experts: -+ expert_idx = expert_idx_tensor.item() -+ expert_layer = self.experts[expert_idx] -+ -+ mask = (flat_selected_experts == expert_idx_tensor) -+ selected_token_indices = token_indices[mask] -+ selected_routing_weights = routing_weights.flatten()[mask] -+ -+ current_states = hidden_states_reshaped[selected_token_indices] -+ -+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+ -+ final_hidden_states = final_hidden_states.index_add( -+ dim=0, -+ index=selected_token_indices, -+ source=expert_output.to(hidden_states.dtype) -+ ) -+ -+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output - -- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -- return final_hidden_states, router_logits -+ final_hidden_states = final_hidden_states + shared_expert_output -+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+ -+ return final_hidden_states, router_logits - - - class Qwen2MoeDecoderLayer(nn.Module): -@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): - - self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - -+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+ - if (layer_idx not in config.mlp_only_layers) and ( - config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 - ): -@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): - _no_split_modules = ["Qwen2MoeDecoderLayer"] - _skip_keys_device_placement = "past_key_values" - _supports_cache_class = True -+#lwx -+ # _supports_static_cache = True - - def _init_weights(self, module): - std = self.config.initializer_range -@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): - return causal_mask - - --class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - _tied_weights_keys = ["lm_head.weight"] - - def __init__(self, config): -@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): - self.num_experts_per_tok = config.num_experts_per_tok - # Initialize weights and apply final processing - self.post_init() -+ # @lwx -+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+ # self.generation_config.cache_implementation = "static" -+ self._warmed_up = False -+ -+ def warmup_moe_model(self): -+ print("[Warmup] Qwen2-MoE 模型预热开始...") -+ test_texts = [ -+ "warmup short", -+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+ ] -+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+ if tokenizer is None: -+ from mindnlp.transformers import AutoTokenizer -+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+ self._warmup_tokenizer = tokenizer -+ -+ for text in test_texts: -+ inputs = tokenizer(text, return_tensors="ms") -+ with mindspore._no_grad(): -+ _ = self(**inputs, output_router_logits=True, use_cache=False) -+ print("[Warmup] Qwen2-MoE 模型预热完成。") - - def get_input_embeddings(self): - return self.model.embed_tokens -@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): - >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] - "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." - ```""" -+ if not self._warmed_up: -+ self._warmed_up = True -+ self.warmup_moe_model() - - output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions - output_router_logits = ( -@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): - } - ) - return model_inputs -+# @lwx -+ # def _decode_one_tokens_logits( -+ # self, -+ # cur_token: mindspore.Tensor, -+ # input_pos: Optional[mindspore.Tensor], -+ # cache_position: mindspore.Tensor, -+ # past_key_values: StaticCache, -+ # ) -> mindspore.Tensor: -+ # """ -+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+ -+ # Args: -+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+ # input_pos: 输入位置信息,可选 -+ # cache_position: 当前token在cache中的位置,shape为(1,) -+ # past_key_values: StaticCache对象,存储之前的key-value状态 -+ -+ # Returns: -+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+ # """ -+ # # 调用JIT编译的版本 -+ # return self.get_decode_one_tokens_logits( -+ # cur_token=cur_token, -+ # input_pos=input_pos, -+ # cache_position=cache_position, -+ # past_key_values=past_key_values, -+ # ) -+ -+ # @mindspore.jit(jit_level='O1') -+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+ # """ -+ # JIT编译的函数,用于高效的单token解码 -+ # 使用JIT编译优化以支持静态shape和高效执行 -+ -+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+ # """ -+ # outputs = self.model.forward( -+ # input_ids=cur_token, -+ # position_ids=input_pos, -+ # cache_position=cache_position, -+ # past_key_values=past_key_values, -+ # use_cache=True, -+ # return_dict=False, -+ # ) -+ -+ # hidden_states = outputs[0] -+ # logits = self.lm_head.forward(hidden_states) -+ # logits = logits.float() -+ -+ # return logits[:, -1, :] -+ -+ # def _sample( -+ # self, -+ # input_ids: mindspore.Tensor, -+ # logits_processor, -+ # stopping_criteria, -+ # generation_config, -+ # synced_devices: bool, -+ # streamer=None, -+ # logits_warper=None, -+ # **model_kwargs, -+ # ): -+ # """ -+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+ # """ -+ # from ...generation.logits_process import LogitsProcessorList -+ # from ...generation.stopping_criteria import StoppingCriteriaList -+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+ # from mindnlp.core import nn, ops, no_grad -+ # import numpy as np -+ -+ # # 检查是否使用 StaticCache -+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+ # # 否则,直接调用父类方法 -+ # past_key_values = model_kwargs.get("past_key_values") -+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+ -+ # if not isinstance(past_key_values, StaticCache): -+ # # 不使用 StaticCache,直接调用父类方法 -+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+ # return super()._sample( -+ # input_ids=input_ids, -+ # logits_processor=logits_processor, -+ # stopping_criteria=stopping_criteria, -+ # generation_config=generation_config, -+ # synced_devices=synced_devices, -+ # streamer=streamer, -+ # logits_warper=logits_warper, -+ # **model_kwargs, -+ # ) -+ -+ # # 使用 StaticCache,进入自定义循环 -+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+ # pad_token_id = generation_config._pad_token_tensor -+ # output_attentions = generation_config.output_attentions -+ # output_hidden_states = generation_config.output_hidden_states -+ # output_scores = generation_config.output_scores -+ # output_logits = generation_config.output_logits -+ # return_dict_in_generate = generation_config.return_dict_in_generate -+ # max_length = generation_config.max_length -+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+ # do_sample = generation_config.do_sample -+ -+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+ # raise ValueError( -+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+ # f"{logits_warper})." -+ # ) -+ -+ # # init attention / hidden states / scores tuples -+ # scores = () if (return_dict_in_generate and output_scores) else None -+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+ -+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+ # if return_dict_in_generate and self.config.is_encoder_decoder: -+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+ # encoder_hidden_states = ( -+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+ # ) -+ -+ # # keep track of which sequences are already finished -+ # batch_size, cur_len = input_ids.shape -+ # this_peer_finished = False -+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+ -+ # time_record = [] -+ # from ....utils.testing_utils import parse_flag_from_env -+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+ -+ # while self._has_unfinished_sequences( -+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+ # ): -+ # if _record_time: -+ # import time as time_module -+ # infer_start = time_module.time() -+ -+ # # prepare model inputs -+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+ -+ # # prepare variable output controls -+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+ -+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+ # cur_cache_position = model_inputs.get("cache_position") -+ # cur_past_key_values = model_inputs.get("past_key_values") -+ # cur_input_ids = model_inputs.get("input_ids") -+ -+ # if (isinstance(cur_past_key_values, StaticCache) and -+ # cur_cache_position is not None and -+ # len(cur_cache_position.shape) > 0 and -+ # cur_cache_position.shape[0] == 1 and -+ # cur_input_ids is not None and -+ # cur_input_ids.shape[1] == 1): -+ # # 使用 JIT 优化的单 token 解码 -+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+ # if not hasattr(self, '_jit_used'): -+ # self._jit_used = False -+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+ -+ # next_token_logits = self.get_decode_one_tokens_logits( -+ # cur_token=cur_input_ids, -+ # input_pos=model_inputs.get("position_ids"), -+ # cache_position=cur_cache_position, -+ # past_key_values=cur_past_key_values, -+ # ) -+ -+ # # 标记已使用JIT(用于后续判断) -+ # if not self._jit_used: -+ # self._jit_used = True -+ -+ # # 构造兼容的输出对象 -+ # class JitOptimizedOutput: -+ # def __init__(self, logits, config): -+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+ # self.config = config -+ # # 对于 JIT 优化路径,这些属性通常不需要 -+ # self.decoder_attentions = None if config.is_encoder_decoder else None -+ # self.attentions = None if not config.is_encoder_decoder else None -+ # self.cross_attentions = None -+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+ # self.hidden_states = None if not config.is_encoder_decoder else None -+ -+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+ # else: -+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+ # outputs = self(**model_inputs, return_dict=True) -+ -+ # if synced_devices and this_peer_finished: -+ # continue -+ -+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+ # next_token_logits = outputs.logits[:, -1, :] -+ -+ # # pre-process distribution -+ # next_token_scores = logits_processor(input_ids, next_token_logits) -+ # if do_sample: -+ # next_token_scores = logits_warper(input_ids, next_token_scores) -+ -+ # # Store scores, attentions and hidden_states when required -+ # if return_dict_in_generate: -+ # if output_scores: -+ # scores += (next_token_scores,) -+ # if output_logits: -+ # raw_logits += (next_token_logits,) -+ # if output_attentions: -+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+ # decoder_attentions += (attn,) if attn is not None else (None,) -+ # if self.config.is_encoder_decoder: -+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+ -+ # if output_hidden_states: -+ # hidden = ( -+ # outputs.decoder_hidden_states -+ # if self.config.is_encoder_decoder -+ # else outputs.hidden_states -+ # ) -+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+ -+ # # token selection -+ # if do_sample: -+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+ # else: -+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+ -+ # # finished sentences should have their next token be a padding token -+ # if has_eos_stopping_criteria: -+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+ -+ # # update generated ids, model inputs, and length for next step -+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+ # if streamer is not None: -+ # streamer.put(next_tokens) -+ -+ # model_kwargs = self._update_model_kwargs_for_generation( -+ # outputs, -+ # model_kwargs, -+ # is_encoder_decoder=self.config.is_encoder_decoder, -+ # ) -+ -+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+ # cur_len += 1 -+ -+ # if _record_time: -+ # import time as time_module -+ # infer_stop = time_module.time() -+ # time_record.append(infer_stop - infer_start) -+ -+ # del outputs -+ -+ # average_infer_time = None -+ # if time_record: -+ # if len(time_record) > 1: -+ # time_record.pop(0) -+ # average_infer_time = sum(time_record) / len(time_record) -+ # print(f'average inference time is: {average_infer_time}') -+ # print(f'inference time record: {time_record}') -+ -+ # if streamer is not None: -+ # streamer.end() -+ -+ # # 简单判断:打印是否使用了JIT路径 -+ # if hasattr(self, '_jit_used') and self._jit_used: -+ # print("[JIT] ✓ JIT optimization was used during generation") -+ # else: -+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+ -+ # if return_dict_in_generate: -+ # if self.config.is_encoder_decoder: -+ # return GenerateEncoderDecoderOutput( -+ # sequences=input_ids, -+ # scores=scores, -+ # logits=raw_logits, -+ # encoder_attentions=encoder_attentions, -+ # encoder_hidden_states=encoder_hidden_states, -+ # decoder_attentions=decoder_attentions, -+ # cross_attentions=cross_attentions, -+ # decoder_hidden_states=decoder_hidden_states, -+ # past_key_values=model_kwargs.get("past_key_values"), -+ # average_infer_time=average_infer_time -+ # ) -+ # else: -+ # return GenerateDecoderOnlyOutput( -+ # sequences=input_ids, -+ # scores=scores, -+ # logits=raw_logits, -+ # attentions=decoder_attentions, -+ # hidden_states=decoder_hidden_states, -+ # past_key_values=model_kwargs.get("past_key_values"), -+ # average_infer_time=average_infer_time -+ # ) -+ # else: -+ # return input_ids -+ -+ # def _prepare_cache_for_generation( -+ # self, -+ # generation_config, -+ # model_kwargs, -+ # assistant_model, -+ # batch_size, -+ # max_cache_length, -+ # ): -+ # if generation_config.cache_implementation is None and self._supports_static_cache: -+ # generation_config.cache_implementation = "static" -+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+ -+ # if generation_config.cache_implementation == "static": -+ # base_required_from_max_length = generation_config.max_length + 1 -+ # base_required = max(max_cache_length, base_required_from_max_length) -+ # min_cache_size = 50 -+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+ # else: -+ # max_cache_length = max(base_required, min_cache_size) -+ -+ # original_max_cache_length = max_cache_length -+ # print(f"[JIT] StaticCache max_cache_length calculation:") -+ # print(f" - input max_cache_length: {original_max_cache_length}") -+ # print(f" - generation_config.max_length: {generation_config.max_length}") -+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+ # print(f" - final max_cache_length: {max_cache_length}") -+ -+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+ # if max_cache_length > self.config.max_position_embeddings: -+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+ -+ # result = super()._prepare_cache_for_generation( -+ # generation_config=generation_config, -+ # model_kwargs=model_kwargs, -+ # assistant_model=assistant_model, -+ # batch_size=batch_size, -+ # max_cache_length=max_cache_length, -+ # ) -+ -+ # if generation_config.cache_implementation == "static": -+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+ # created_cache = model_kwargs.get(cache_name) -+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+ # if created_cache.max_cache_len < generation_config.max_length: -+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+ -+ # return result -+ -+ -+ - - - # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" deleted file mode 100644 index baee9388..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0002-20251106commit.patch" +++ /dev/null @@ -1,3200 +0,0 @@ -From dcd6fc7b6307db27f23087ba3958949eb52a9beb Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Thu, 6 Nov 2025 09:20:38 +0800 -Subject: [PATCH 02/10] 20251106commit - ---- - .../models/deepseek/modeling_deepseek.py | 379 ++++- - .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- - patches/0001-20251104commit.patch | 1272 ++++++++++++++++ - 3 files changed, 2689 insertions(+), 305 deletions(-) - create mode 100644 patches/0001-20251104commit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index d8303e45..73773c22 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): - # y = y + self.shared_experts(identity) - # return y - -+ # @no_grad() -+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ -+ # expert_cache = ops.zeros_like(x) -+ # for i in range(self.num_experts_per_tok): -+ # expert_id = flat_expert_indices[i].item() -+ # weight = flat_expert_weights[i].item() -+ # expert = self.experts[expert_id] -+ # expert_out = expert(x) -+ # expert_cache += expert_out * weight -+ # return expert_cache -+ - @no_grad() - def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ # x 的 shape: (1, hidden_size) -+ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+ -+ # 1. 收集所有需要的专家层 -+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+ selected_experts = [self.experts[i] for i in flat_expert_indices] -+ -+ # 2. 并行计算所有专家的输出 -+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+ # ops.cat 会将它们堆叠成一个新的 Tensor -+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+ -+ # 3. 使用矩阵乘法进行加权求和 -+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ # 最终结果 final_output 的 shape: (1, hidden_size) -+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+ -+ return final_output - -- expert_cache = ops.zeros_like(x) -- for i in range(self.num_experts_per_tok): -- expert_id = flat_expert_indices[i].item() -- weight = flat_expert_weights[i].item() -- expert = self.experts[expert_id] -- expert_out = expert(x) -- expert_cache += expert_out * weight -- return expert_cache - - @no_grad() - def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): - key_states = self.k_proj(hidden_states) - value_states = self.v_proj(hidden_states) - -- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+ # @lwx -+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -+ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) - - kv_seq_len = key_states.shape[-2] - if past_key_value is not None: -@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): - return attn_output, attn_weights, past_key_value - - -+# class DeepseekFlashAttention(nn.Module): -+# """ -+# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+ -+# This class is designed as a drop-in replacement for DeepseekAttention. -+# """ -+ -+# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+# super().__init__() -+# self.config = config -+# self.layer_idx = layer_idx -+# if layer_idx is None: -+# logger.warning( -+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+# "when creating this class." -+# ) -+ -+# self.attention_dropout = config.attention_dropout -+# self.hidden_size = config.hidden_size -+# self.num_heads = config.num_attention_heads -+# self.head_dim = self.hidden_size // self.num_heads -+# self.num_key_value_heads = config.num_key_value_heads -+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+# self.max_position_embeddings = config.max_position_embeddings -+# self.rope_theta = config.rope_theta -+# self.is_causal = True -+ -+# if (self.head_dim * self.num_heads) != self.hidden_size: -+# raise ValueError( -+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+# f" and `num_heads`: {self.num_heads})." -+# ) -+ -+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+# self._init_rope() -+ -+# def _init_rope(self): -+# if self.config.rope_scaling is None: -+# self.rotary_emb = DeepseekRotaryEmbedding( -+# self.head_dim, -+# max_position_embeddings=self.max_position_embeddings, -+# base=self.rope_theta, -+# ) -+# else: -+# scaling_type = self.config.rope_scaling["type"] -+# scaling_factor = self.config.rope_scaling["factor"] -+# if scaling_type == "linear": -+# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+# self.head_dim, -+# max_position_embeddings=self.max_position_embeddings, -+# scaling_factor=scaling_factor, -+# base=self.rope_theta, -+# ) -+# elif scaling_type == "dynamic": -+# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+# self.head_dim, -+# max_position_embeddings=self.max_position_embeddings, -+# scaling_factor=scaling_factor, -+# base=self.rope_theta, -+# ) -+# else: -+# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+ -+# def forward( -+# self, -+# hidden_states: mindspore.Tensor, -+# attention_mask: Optional[mindspore.Tensor] = None, -+# position_ids: Optional[mindspore.Tensor] = None, -+# past_key_value: Optional[Cache] = None, -+# output_attentions: bool = False, -+# use_cache: bool = False, -+# **kwargs, -+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+# if "padding_mask" in kwargs: -+# warnings.warn( -+# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+# ) -+ -+# if output_attentions: -+# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+ -+# bsz, q_len, _ = hidden_states.shape -+ -+# if self.config.pretraining_tp > 1: -+# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+ -+# query_states = self.q_proj(hidden_states) -+# key_states = self.k_proj(hidden_states) -+# value_states = self.v_proj(hidden_states) -+ -+# # Reshape for multi-head attention -+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+# kv_seq_len = key_states.shape[-2] -+# if past_key_value is not None: -+# if self.layer_idx is None: -+# raise ValueError( -+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+# "with a layer index." -+# ) -+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+# # Apply Rotary Positional Embedding -+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+# if past_key_value is not None: -+# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ -+# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+ -+# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+ -+# # Convert attention_mask for flash_attention_score -+# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+# if attention_mask is not None: -+# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+# raise ValueError( -+# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+# ) -+# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+# else: -+# attn_mask_for_fa = None -+ -+# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+ -+# # Call the fused flash_attention_score operator -+# attn_output = mindspore.ops.flash_attention_score( -+# query=query_states_for_fa, -+# key=key_states_for_fa, -+# value=value_states_for_fa, -+# head_num=self.num_heads, # This is N1, the number of query heads -+# input_layout='BSH', -+# attn_mask=attn_mask_for_fa, -+# keep_prob=keep_prob, -+# scalar_value=1.0 / math.sqrt(self.head_dim), -+# sparse_mode=0 # Default mask mode -+# ) -+ -+# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+# attn_output = self.o_proj(attn_output) -+ -+# # Flash Attention does not return attention weights -+# attn_weights = None -+ -+# return attn_output, attn_weights, past_key_value -+ -+class DeepseekFlashAttention(nn.Module): -+ """ -+ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -+ This implementation is a drop-in replacement for the original DeepseekAttention class, -+ designed for high performance on supported hardware (Ascend). -+ -+ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -+ """ -+ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+ super().__init__() -+ self.config = config -+ self.layer_idx = layer_idx -+ if layer_idx is None: -+ logger.warning( -+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+ "when creating this class." -+ ) -+ -+ # --- [FIX] Correctly initialize all required attributes --- -+ self.attention_dropout = config.attention_dropout -+ self.hidden_size = config.hidden_size -+ self.num_heads = config.num_attention_heads -+ self.head_dim = self.hidden_size // self.num_heads -+ self.num_key_value_heads = config.num_key_value_heads -+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+ self.max_position_embeddings = config.max_position_embeddings -+ self.rope_theta = config.rope_theta -+ self.is_causal = True -+ -+ if (self.head_dim * self.num_heads) != self.hidden_size: -+ raise ValueError( -+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+ f" and `num_heads`: {self.num_heads})." -+ ) -+ -+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+ -+ # This call will now succeed as all attributes are initialized. -+ self._init_rope() -+ -+ def _init_rope(self): -+ if self.config.rope_scaling is None: -+ self.rotary_emb = DeepseekRotaryEmbedding( -+ self.head_dim, -+ max_position_embeddings=self.max_position_embeddings, -+ base=self.rope_theta, -+ ) -+ else: -+ scaling_type = self.config.rope_scaling["type"] -+ scaling_factor = self.config.rope_scaling["factor"] -+ if scaling_type == "linear": -+ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+ self.head_dim, -+ max_position_embeddings=self.max_position_embeddings, -+ scaling_factor=scaling_factor, -+ base=self.rope_theta, -+ ) -+ elif scaling_type == "dynamic": -+ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+ self.head_dim, -+ max_position_embeddings=self.max_position_embeddings, -+ scaling_factor=scaling_factor, -+ base=self.rope_theta, -+ ) -+ else: -+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+ -+ def forward( -+ self, -+ hidden_states: mindspore.Tensor, -+ attention_mask: Optional[mindspore.Tensor] = None, -+ position_ids: Optional[mindspore.Tensor] = None, -+ past_key_value: Optional[Cache] = None, -+ output_attentions: bool = False, -+ use_cache: bool = False, -+ **kwargs, -+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ if "padding_mask" in kwargs: -+ warnings.warn( -+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+ ) -+ if output_attentions: -+ warnings.warn( -+ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -+ ) -+ -+ bsz, q_len, _ = hidden_states.shape -+ -+ if self.config.pretraining_tp > 1: -+ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+ -+ query_states = self.q_proj(hidden_states) -+ key_states = self.k_proj(hidden_states) -+ value_states = self.v_proj(hidden_states) -+ -+ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+ kv_seq_len = key_states.shape[-2] -+ if past_key_value is not None: -+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+ # Apply Rotary Position Embedding -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ if past_key_value is not None: -+ cache_kwargs = {"sin": sin, "cos": cos} -+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -+ # So we must explicitly repeat the KV heads. -+ key_states = repeat_kv(key_states, self.num_key_value_groups) -+ value_states = repeat_kv(value_states, self.num_key_value_groups) -+ -+ # Convert attention mask for flash_attention_score -+ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -+ if attention_mask is not None: -+ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+ raise ValueError( -+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+ ) -+ attn_mask_for_fa = attention_mask < 0 -+ else: -+ attn_mask_for_fa = None -+ -+ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+ -+ # Call the fused operator using the efficient BNSD layout -+ attn_output = mindspore.ops.flash_attention_score( -+ query=query_states, -+ key=key_states, -+ value=value_states, -+ head_num=self.num_heads, -+ input_layout='BNSD', # Specify the correct layout -+ attn_mask=attn_mask_for_fa, -+ keep_prob=keep_prob, -+ scalar_value=1.0 / math.sqrt(self.head_dim) -+ ) -+ -+ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ -+ # Apply output projection -+ attn_output = self.o_proj(attn_output) -+ -+ # Flash attention does not return attention weights, so we return None. -+ attn_weights = None -+ -+ return attn_output, attn_weights, past_key_value -+ - Deepseek_ATTENTION_CLASSES = { - "eager": DeepseekAttention, -+ "flash-attention": DeepseekFlashAttention, - } - - -@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): - config=config, layer_idx=layer_idx - ) - -+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+ config=config, layer_idx=layer_idx -+ ) -+ - self.mlp = ( - DeepseekMoE(config) - if ( -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index d4c6b651..bced285c 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union - - import mindspore - import mindnlp.core.nn.functional as F --from mindnlp.core import nn, ops -+from mindnlp.core import nn, ops, no_grad - from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss - - from ....common.activations import ACT2FN -@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) - _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" - _CONFIG_FOR_DOC = "Qwen2MoeConfig" - -+Long_Prompt = False -+PROMPT_LENGTH_THRESHOLD = 128 - - # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position - def _prepare_4d_causal_attention_mask_with_cache_position( -@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): - return attn_output, attn_weights, past_key_value - - -+# class Qwen2MoeFlashAttention(nn.Module): -+# """ -+# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+ -+# 关键改动: -+# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+# 直接传入原始的 key 和 value 张量效率更高。 -+# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+# super().__init__() -+# self.config = config -+# self.layer_idx = layer_idx -+# self.hidden_size = config.hidden_size -+# self.num_heads = config.num_attention_heads -+# self.head_dim = self.hidden_size // self.num_heads -+# self.num_key_value_heads = config.num_key_value_heads -+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+# self.max_position_embeddings = config.max_position_embeddings -+# self.rope_theta = config.rope_theta -+# self.attention_dropout = config.attention_dropout -+ -+# if (self.head_dim * self.num_heads) != self.hidden_size: -+# raise ValueError( -+# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+# ) -+ -+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+ -+# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+# self.head_dim, -+# max_position_embeddings=self.max_position_embeddings, -+# base=self.rope_theta, -+# ) -+ -+# def forward( -+# self, -+# hidden_states: mindspore.Tensor, -+# attention_mask: Optional[mindspore.Tensor] = None, -+# position_ids: Optional[mindspore.Tensor] = None, -+# past_key_value: Optional[Cache] = None, -+# output_attentions: bool = False, -+# use_cache: bool = False, -+# cache_position: Optional[mindspore.Tensor] = None, -+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+# bsz, q_len, _ = hidden_states.shape -+ -+# # 1. 线性投射 Q, K, V -+# query_states = self.q_proj(hidden_states) -+# key_states = self.k_proj(hidden_states) -+# value_states = self.v_proj(hidden_states) -+ -+# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+# # query: [B, S, H*D] -> [B, N1, S, D] -+# # key/val: [B, S, H2*D] -> [B, N2, S, D] -+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+# # 3. RoPE 旋转位置编码 -+# kv_seq_len = key_states.shape[-2] -+# if past_key_value is not None: -+# if self.layer_idx is None: -+# raise ValueError( -+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+# "with a layer index." -+# ) -+# # 对于 StaticCache,需要特殊处理 kv_seq_len -+# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+# # 使用 cache_position 的长度来确定实际的 kv_seq_len -+# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+# # 临时解决方案:使用 cache_position 的最大值(如果可能) -+# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+# if cache_position.shape[0] == 1: -+# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+# kv_seq_len = past_seen_tokens + 1 -+# else: -+# # prefill 阶段:cache_position 是范围,使用其长度 -+# kv_seq_len = cache_position.shape[0] + past_seen_tokens -+# else: -+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+# # 4. KV 缓存更新 -+# if past_key_value is not None: -+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+# key_states, value_states = past_key_value.update( -+# key_states, value_states, self.layer_idx, cache_kwargs -+# ) -+ -+# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+# if cache_position.shape[0] == 1: -+# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+# kv_seq_len = key_states.shape[-2] -+ -+# # 5. [重要] 准备 Attention Mask -+# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+# fa_attention_mask = None -+# if attention_mask is not None: -+# # 截取与当前key长度匹配的部分 -+# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+# # 转换为布尔类型: 大负数 -> True, 0 -> False -+# fa_attention_mask = (mask_slice != 0) -+ -+# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+# input_dtype = query_states.dtype -+# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+# query_states = query_states.to(mindspore.float16) -+# key_states = key_states.to(mindspore.float16) -+# value_states = value_states.to(mindspore.float16) -+ -+# # 6. [核心] 调用 flash_attention_score 算子 -+# # - 无需手动 repeat_kv, 算子原生支持 GQA -+# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+# attn_output = mindspore.ops.flash_attention_score( -+# query=query_states, -+# key=key_states, -+# value=value_states, -+# head_num=self.num_heads, # 传入Q的头数(N1) -+# attn_mask=fa_attention_mask, -+# keep_prob=1.0 - self.attention_dropout, -+# scalar_value=1.0 / math.sqrt(self.head_dim), -+# input_layout="BNSD", -+# sparse_mode=0 # 使用 defaultMask 模式 -+# ) -+ -+# # 恢复原始数据类型 -+# attn_output = attn_output.to(input_dtype) -+ -+# # 7. 调整输出形状 -+# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+# attn_output = self.o_proj(attn_output) -+ -+# # FlashAttention 算子不直接返回注意力权重矩阵 -+# attn_weights = None -+# if output_attentions: -+# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+ -+# return attn_output, attn_weights, past_key_value -+ -+# # def forward( -+# # self, -+# # hidden_states: mindspore.Tensor, -+# # attention_mask: Optional[mindspore.Tensor] = None, -+# # position_ids: Optional[mindspore.Tensor] = None, -+# # past_key_value: Optional[Cache] = None, -+# # output_attentions: bool = False, -+# # use_cache: bool = False, -+# # cache_position: Optional[mindspore.Tensor] = None, -+# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+# # bsz, q_len, _ = hidden_states.shape -+ -+# # # 1. 线性投射 Q, K, V -+# # query_states = self.q_proj(hidden_states) -+# # key_states = self.k_proj(hidden_states) -+# # value_states = self.v_proj(hidden_states) -+ -+# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ -+# # # 3. RoPE 旋转位置编码 -+# # kv_seq_len = key_states.shape[-2] -+# # if past_key_value is not None: -+# # if self.layer_idx is None: -+# # raise ValueError( -+# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+# # "with a layer index." -+# # ) -+# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+# # # 4. KV 缓存更新 -+# # if past_key_value is not None: -+# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+# # key_states, value_states = past_key_value.update( -+# # key_states, value_states, self.layer_idx, cache_kwargs -+# # ) -+ -+# # # 5. 准备 Attention Mask -+# # fa_attention_mask = None -+# # if attention_mask is not None: -+# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+# # fa_attention_mask = (mask_slice != 0) -+ -+# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+# # input_dtype = query_states.dtype -+ -+# # # 6. [核心] 调用 flash_attention_score 算子 -+# # attn_output = mindspore.ops.flash_attention_score( -+# # query=query_states, -+# # key=key_states, -+# # value=value_states, -+# # head_num=self.num_heads, -+# # attn_mask=fa_attention_mask, -+# # keep_prob=1.0 - self.attention_dropout, -+# # scalar_value=1.0 / math.sqrt(self.head_dim), -+# # input_layout="BNSD", -+# # sparse_mode=0, -+# # # <--- 修改点 2: 启用内部高精度计算 --- -+# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+# # inner_precise=1 -+# # ) -+ -+# # # 恢复原始数据类型 -+# # attn_output = attn_output.to(input_dtype) -+ -+# # # 7. 调整输出形状 -+# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+# # attn_output = self.o_proj(attn_output) -+ -+# # attn_weights = None -+# # if output_attentions: -+# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+ -+# # return attn_output, attn_weights, past_key_value -+ -+ - class Qwen2MoeFlashAttention(nn.Module): - """ -- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -- -- 关键改动: -- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -- 直接传入原始的 key 和 value 张量效率更高。 -- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -+ -+ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -+ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -+ 完全使用模型的低精度数据类型(如 float16)进行计算, -+ 以达到理论上的最高执行速度。 - """ - def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): - super().__init__() - self.config = config - self.layer_idx = layer_idx -+ if layer_idx is None: -+ logger.warning_once( -+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -+ ) -+ - self.hidden_size = config.hidden_size - self.num_heads = config.num_attention_heads - self.head_dim = self.hidden_size // self.num_heads - self.num_key_value_heads = config.num_key_value_heads -- self.num_key_value_groups = self.num_heads // self.num_key_value_heads - self.max_position_embeddings = config.max_position_embeddings - self.rope_theta = config.rope_theta - self.attention_dropout = config.attention_dropout - -- if (self.head_dim * self.num_heads) != self.hidden_size: -- raise ValueError( -- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -- ) -- - self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) - self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) - self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): - key_states = self.k_proj(hidden_states) - value_states = self.v_proj(hidden_states) - -- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -- # query: [B, S, H*D] -> [B, N1, S, D] -- # key/val: [B, S, H2*D] -> [B, N2, S, D] -+ # 2. 调整形状以匹配 BNSD 布局 - query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) - key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) - value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- -- # 3. RoPE 旋转位置编码 -+ -+ # 3. RoPE 和 KV 缓存 - kv_seq_len = key_states.shape[-2] - if past_key_value is not None: -- if self.layer_idx is None: -- raise ValueError( -- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -- "with a layer index." -- ) -- # 对于 StaticCache,需要特殊处理 kv_seq_len -- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -- if isinstance(past_key_value, StaticCache) and cache_position is not None: -- # 使用 cache_position 的长度来确定实际的 kv_seq_len -- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -- # 临时解决方案:使用 cache_position 的最大值(如果可能) -- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -- if cache_position.shape[0] == 1: -- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -- kv_seq_len = past_seen_tokens + 1 -- else: -- # prefill 阶段:cache_position 是范围,使用其长度 -- kv_seq_len = cache_position.shape[0] + past_seen_tokens -- else: -- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- -+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ - cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) - query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) - -- # 4. KV 缓存更新 - if past_key_value is not None: - cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -- key_states, value_states = past_key_value.update( -- key_states, value_states, self.layer_idx, cache_kwargs -- ) -- -- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -- if isinstance(past_key_value, StaticCache) and cache_position is not None: -- if cache_position.shape[0] == 1: -- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -- kv_seq_len = key_states.shape[-2] -- -- # 5. [重要] 准备 Attention Mask -- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+ # 4. 准备 Attention Mask - fa_attention_mask = None - if attention_mask is not None: -- # 截取与当前key长度匹配的部分 -- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) - mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -- # 转换为布尔类型: 大负数 -> True, 0 -> False - fa_attention_mask = (mask_slice != 0) - -- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -- input_dtype = query_states.dtype -- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -- query_states = query_states.to(mindspore.float16) -- key_states = key_states.to(mindspore.float16) -- value_states = value_states.to(mindspore.float16) -- -- # 6. [核心] 调用 flash_attention_score 算子 -- # - 无需手动 repeat_kv, 算子原生支持 GQA -- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 - attn_output = mindspore.ops.flash_attention_score( - query=query_states, - key=key_states, - value=value_states, -- head_num=self.num_heads, # 传入Q的头数(N1) -+ head_num=self.num_heads, - attn_mask=fa_attention_mask, -- keep_prob=1.0 - self.attention_dropout, -+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout - scalar_value=1.0 / math.sqrt(self.head_dim), - input_layout="BNSD", -- sparse_mode=0 # 使用 defaultMask 模式 -+ sparse_mode=0, -+ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 - ) - -- # 恢复原始数据类型 -- attn_output = attn_output.to(input_dtype) -- -- # 7. 调整输出形状 -- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+ # 6. 调整输出形状 - attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) - attn_output = self.o_proj(attn_output) - -- # FlashAttention 算子不直接返回注意力权重矩阵 -+ # 7. 返回结果 - attn_weights = None - if output_attentions: -- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") - - return attn_output, attn_weights, past_key_value - -- # def forward( -- # self, -- # hidden_states: mindspore.Tensor, -- # attention_mask: Optional[mindspore.Tensor] = None, -- # position_ids: Optional[mindspore.Tensor] = None, -- # past_key_value: Optional[Cache] = None, -- # output_attentions: bool = False, -- # use_cache: bool = False, -- # cache_position: Optional[mindspore.Tensor] = None, -- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- -- # bsz, q_len, _ = hidden_states.shape -- -- # # 1. 线性投射 Q, K, V -- # query_states = self.q_proj(hidden_states) -- # key_states = self.k_proj(hidden_states) -- # value_states = self.v_proj(hidden_states) -- -- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- -- # # 3. RoPE 旋转位置编码 -- # kv_seq_len = key_states.shape[-2] -- # if past_key_value is not None: -- # if self.layer_idx is None: -- # raise ValueError( -- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -- # "with a layer index." -- # ) -- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) - -- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- -- # # 4. KV 缓存更新 -- # if past_key_value is not None: -- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -- # key_states, value_states = past_key_value.update( -- # key_states, value_states, self.layer_idx, cache_kwargs -- # ) -- -- # # 5. 准备 Attention Mask -- # fa_attention_mask = None -- # if attention_mask is not None: -- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -- # fa_attention_mask = (mask_slice != 0) -- -- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -- # input_dtype = query_states.dtype -- -- # # 6. [核心] 调用 flash_attention_score 算子 -- # attn_output = mindspore.ops.flash_attention_score( -- # query=query_states, -- # key=key_states, -- # value=value_states, -- # head_num=self.num_heads, -- # attn_mask=fa_attention_mask, -- # keep_prob=1.0 - self.attention_dropout, -- # scalar_value=1.0 / math.sqrt(self.head_dim), -- # input_layout="BNSD", -- # sparse_mode=0, -- # # <--- 修改点 2: 启用内部高精度计算 --- -- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -- # inner_precise=1 -- # ) -- -- # # 恢复原始数据类型 -- # attn_output = attn_output.to(input_dtype) -+QWEN2MOE_ATTENTION_CLASSES = { -+ "eager": Qwen2MoeAttention, -+ "flash-attention": Qwen2MoeFlashAttention, -+} - -- # # 7. 调整输出形状 -- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -- # attn_output = self.o_proj(attn_output) - -- # attn_weights = None -- # if output_attentions: -- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# def __init__(self, config): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# # gating -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+ -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# #@dwj -+# # 只遍历激活的专家,而非全部专家 -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# num_tokens = hidden_states_reshaped.shape[0] -+ -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+# routing_weights = routing_weights.to(hidden_states.dtype) -+ -+# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+# flat_selected_experts = selected_experts.flatten() -+ -+# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+# token_indices = broadcasted_token_indices.flatten() -+ -+# active_experts = ops.unique(flat_selected_experts) -+ -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+ -+# mask = (flat_selected_experts == expert_idx_tensor) -+# selected_token_indices = token_indices[mask] -+# selected_routing_weights = routing_weights.flatten()[mask] -+ -+# current_states = hidden_states_reshaped[selected_token_indices] -+ -+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+ -+# final_hidden_states = final_hidden_states.index_add( -+# dim=0, -+# index=selected_token_indices, -+# source=expert_output.to(hidden_states.dtype) -+# ) -+ -+# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output - -- # return attn_output, attn_weights, past_key_value -+# final_hidden_states = final_hidden_states + shared_expert_output -+# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ -+ -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# """ -+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+# `_moe_infer_prefill` (用于长序列处理) 方法。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# # 门控网络 -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# # 专家列表 -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+# # 共享专家 -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# @no_grad() -+# def _moe_infer_decode( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# """ -+# 【解码路径】针对 sequence_length=1 的极致优化。 -+# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+# """ -+# batch_size, hidden_dim = hidden_states.shape -+ -+# expert_outputs_list = [ -+# ops.cat([ -+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+# ], dim=0) -+# for i in range(batch_size) -+# ] -+ -+# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+# # shape: (batch_size, top_k, hidden_dim) -+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+ -+# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+ -+# return moe_output.squeeze(1) -+ -+# @no_grad() -+# def _moe_infer_prefill( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# """ -+# 【预填充路径】针对 sequence_length > 1 的优化。 -+# 按专家对 Token 进行分组,并进行批处理。 -+# """ -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens = hidden_states.shape[0] -+# flat_selected_experts = selected_experts.flatten() -+ -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+ -+# active_experts = ops.unique(flat_selected_experts) -+ -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+ -+# mask = (flat_selected_experts == expert_idx_tensor) -+# selected_token_indices = token_indices[mask] -+# selected_routing_weights = routing_weights.flatten()[mask] -+ -+# current_states = hidden_states[selected_token_indices] -+ -+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+ -+# moe_output = moe_output.index_add( -+# dim=0, -+# index=selected_token_indices, -+# source=expert_output.to(hidden_states.dtype) -+# ) -+# return moe_output -+ -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# """ -+# 顶层 forward 方法,作为智能分发器。 -+# """ -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+ -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) - -- # def forward( -- # self, -- # hidden_states: mindspore.Tensor, -- # attention_mask: Optional[mindspore.Tensor] = None, -- # position_ids: Optional[mindspore.Tensor] = None, -- # past_key_value: Optional[Cache] = None, -- # output_attentions: bool = False, -- # use_cache: bool = False, -- # cache_position: Optional[mindspore.Tensor] = None, -- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- -- # bsz, q_len, _ = hidden_states.shape -- -- # query_states = self.q_proj(hidden_states) -- # key_states = self.k_proj(hidden_states) -- # value_states = self.v_proj(hidden_states) -- -- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- -- # kv_seq_len = key_states.shape[-2] -- # if past_key_value is not None: -- # if self.layer_idx is None: -- # raise ValueError("`layer_idx` must be specified for caching") -- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- -- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- -- # if past_key_value is not None: -- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -- # key_states, value_states = past_key_value.update( -- # key_states, value_states, self.layer_idx, cache_kwargs -- # ) -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+# routing_weights = routing_weights.to(hidden_states.dtype) -+ -+# moe_output = None -+# # 在推理时,根据序列长度选择最优路径 -+# if not self.training: -+# if sequence_length == 1: -+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+# else: -+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+# else: -+# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+# raise NotImplementedError("Training path is not implemented.") -+ -+# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+ -+# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+ -+# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ -+ -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# """ -+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# # 门控网络 -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# # 专家列表 -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+# # 共享专家 -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# @no_grad() -+# def _moe_infer_decode( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# batch_size, _ = hidden_states.shape -+# expert_outputs_list = [ -+# ops.cat([ -+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+# ], dim=0) -+# for i in range(batch_size) -+# ] -+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+# return moe_output.squeeze(1) -+ -+# @no_grad() -+# def _moe_infer_prefill( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens = hidden_states.shape[0] -+# flat_selected_experts = selected_experts.flatten() -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+# active_experts = ops.unique(flat_selected_experts) -+ -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+# mask = (flat_selected_experts == expert_idx_tensor) -+# selected_token_indices = token_indices[mask] -+# selected_routing_weights = routing_weights.flatten()[mask] -+# current_states = hidden_states[selected_token_indices] -+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+# moe_output = moe_output.index_add( -+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+# ) -+# return moe_output -+ -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# """ -+# 顶层 forward 方法,作为智能分发器。 -+# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+# """ -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+ -+# # 1. 门控计算 (通用逻辑) -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+# routing_weights = routing_weights.to(hidden_states.dtype) -+ -+# # 2. 智能分发到最优 MoE 路径 -+# moe_output = None -+# if not self.training: -+# if sequence_length == 1: -+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+# else: -+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+# else: -+# raise NotImplementedError("Training path is not implemented.") -+ -+# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+ -+# # 4. 合并 MoE 输出和共享专家输出 -+# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+ -+# # 5. 恢复原始形状并返回 -+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ -+# prefill fastest -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# """ -+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# # 门控网络 -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# # 专家列表 -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+# # 共享专家 -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# @no_grad() -+# def _moe_infer_dispatch( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# """ -+# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+# """ -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens, _ = hidden_states.shape -+ -+# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+# flat_selected_experts = selected_experts.flatten() -+# flat_routing_weights = routing_weights.flatten() - -- # key_states = repeat_kv(key_states, self.num_key_value_groups) -- # value_states = repeat_kv(value_states, self.num_key_value_groups) -- -- # # <--- 核心修改点: 手动进行高精度缩放 --- -- # # 在调用算子前,手动将 query_states 除以缩放因子。 -- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -- # query_states = query_states / math.sqrt(self.head_dim) -- # # <--- 修改结束 --- -- -- # fa_attention_mask = None -- # if attention_mask is not None: -- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -- # fa_attention_mask = (mask_slice != 0) -- -- # input_dtype = query_states.dtype -- -- # attn_output = mindspore.ops.flash_attention_score( -- # query=query_states, # 传入已经预先缩放过的 query -- # key=key_states, -- # value=value_states, -- # head_num=self.num_heads, -- # attn_mask=fa_attention_mask, -- # keep_prob=1.0 - self.attention_dropout, -- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -- # input_layout="BNSD", -- # sparse_mode=0, -- # inner_precise=1 # 仍然保持内部高精度计算 -- # ) -+# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() - -- # attn_output = attn_output.to(input_dtype) -- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -- # attn_output = self.o_proj(attn_output) -+# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+# active_experts = ops.unique(flat_selected_experts) -+ -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+ -+# # 找到所有分配给该专家的 token -+# mask = (flat_selected_experts == expert_idx_tensor) -+ -+# # 使用 mask 选取对应的 token 和权重 -+# current_token_indices = token_indices[mask] -+# current_routing_weights = flat_routing_weights[mask] -+# current_hidden_states = hidden_states[current_token_indices] -+ -+# # 对这些 token 进行批处理 -+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+ -+# # 使用 index_add 将结果精确地加回到对应位置 -+# moe_output = moe_output.index_add( -+# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+# ) -+# return moe_output -+ -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# """ -+# 顶层 forward 方法,作为智能分发器。 -+# """ -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+ -+# # 1. 门控计算 -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+# routing_weights = routing_weights.to(hidden_states.dtype) -+ -+# # 2. 调用统一的 MoE 计算内核 -+# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) - -- # attn_weights = None -- # if output_attentions: -- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+# # 3. 统一处理共享专家 -+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+ -+# # 4. 合并输出 -+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+ -+# # 5. 恢复原始形状并返回 -+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ -+ -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# """ -+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+# 【最终高性能与高精度版】: -+# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+# 3. 这样实现了速度和准确性的两全其美。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# @no_grad() -+# def _moe_infer_decode( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# """ -+# 【解码路径】极致优化版:bmm + 高精度累加。 -+# """ -+# original_dtype = hidden_states.dtype -+# batch_size, _ = hidden_states.shape -+ -+# expert_outputs_list = [ -+# ops.cat([ -+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+# ], dim=0) -+# for i in range(batch_size) -+# ] -+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+ -+# # 在 float32 下执行 bmm,得到高精度结果 -+# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+ -+# # 将高精度结果转换回原始数据类型 -+# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+ -+# return moe_output -+ -+# @no_grad() -+# def _moe_infer_prefill( -+# self, -+# hidden_states: mindspore.Tensor, -+# selected_experts: mindspore.Tensor, -+# routing_weights: mindspore.Tensor -+# ) -> mindspore.Tensor: -+# """ -+# 【预填充路径】与原始实现一致,结果精确。 -+# """ -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens, _ = hidden_states.shape -+# flat_selected_experts = selected_experts.flatten() -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+# active_experts = ops.unique(flat_selected_experts) -+ -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+# mask = (flat_selected_experts == expert_idx_tensor) -+# selected_token_indices = token_indices[mask] -+# selected_routing_weights = routing_weights.flatten()[mask] -+# current_states = hidden_states[selected_token_indices] -+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+# moe_output = moe_output.index_add( -+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+# ) -+# return moe_output -+ -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+ -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) - -- # return attn_output, attn_weights, past_key_value -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+# # 如果模型主体是 float16,后续再转换 -+ -+# moe_output = None -+# if not self.training: -+# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+# # _moe_infer_decode 内部会处理好类型转换 -+# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+# if sequence_length == 1: -+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+# else: -+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+# else: -+# raise NotImplementedError("Training path is not implemented.") -+ -+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+ -+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ - --QWEN2MOE_ATTENTION_CLASSES = { -- "eager": Qwen2MoeAttention, -- "flash-attention": Qwen2MoeFlashAttention, --} -+# class Qwen2MoeSparseMoeBlock(nn.Module): -+# """ -+# 【融合版】一个混合专家模块,内置两种推理策略, -+# 由外部全局变量 `Long_Prompt` 控制: -+ -+# - if Long_Prompt is True: 【精度优先模式】 -+# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+# 适用于处理长序列,避免误差累积。 -+ -+# - if Long_Prompt is False: 【速度优先模式】 -+# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+# 在解码阶段获得极致速度,同时保证结果高度准确。 -+# """ -+# def __init__(self, config: Qwen2MoeConfig): -+# super().__init__() -+# self.num_experts = config.num_experts -+# self.top_k = config.num_experts_per_tok -+# self.norm_topk_prob = config.norm_topk_prob -+ -+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+# self.experts = nn.ModuleList( -+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+# ) -+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+# # --- 速度优先模式的辅助函数 --- -+# @no_grad() -+# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+# original_dtype = hidden_states.dtype -+# batch_size, _ = hidden_states.shape -+# expert_outputs_list = [ -+# ops.cat([ -+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+# ], dim=0) -+# for i in range(batch_size) -+# ] -+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+# weights_fp32 = routing_weights.to(mindspore.float32) -+# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+# return moe_output_fp32.squeeze(1).to(original_dtype) -+ -+# @no_grad() -+# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens, _ = hidden_states.shape -+# flat_selected_experts = selected_experts.flatten() -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+# active_experts = ops.unique(flat_selected_experts) -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+# mask = (flat_selected_experts == expert_idx_tensor) -+# selected_token_indices = token_indices[mask] -+# selected_routing_weights = routing_weights.flatten()[mask] -+# current_states = hidden_states[selected_token_indices] -+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+# return moe_output -+ -+# # --- 精度优先模式的辅助函数 --- -+# @no_grad() -+# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+# moe_output = ops.zeros_like(hidden_states) -+# num_tokens, _ = hidden_states.shape -+# flat_selected_experts = selected_experts.flatten() -+# flat_routing_weights = routing_weights.flatten() -+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+# active_experts = ops.unique(flat_selected_experts) -+# for expert_idx_tensor in active_experts: -+# expert_idx = expert_idx_tensor.item() -+# expert_layer = self.experts[expert_idx] -+# mask = (flat_selected_experts == expert_idx_tensor) -+# current_token_indices = token_indices[mask] -+# current_routing_weights = flat_routing_weights[mask] -+# current_hidden_states = hidden_states[current_token_indices] -+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+# return moe_output -+ -+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+# # 声明我们将要使用一个在模块外部定义的全局变量 -+# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+# global Long_Prompt -+ -+# # 1. 门控计算 (所有模式通用) -+# batch_size, sequence_length, hidden_dim = hidden_states.shape -+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+# router_logits = self.gate(hidden_states_reshaped) -+# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+# if self.norm_topk_prob: -+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+# moe_output = None -+# if not self.training: -+# # 根据 Long_Prompt 标志选择模式 -+# if Long_Prompt: -+# # --- 精度优先模式 --- -+# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+# else: -+# # --- 速度优先模式 --- -+# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+# if sequence_length == 1: -+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+# else: -+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+# else: -+# raise NotImplementedError("Training path is not implemented.") -+ -+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+ -+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+ -+# return final_hidden_states, router_logits -+ -+class Qwen2MoeSparseMoeBlock(nn.Module): -+ """ -+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+ 控制的顶级推理策略: - -+ - if Long_Prompt is True: 【精度优先模式】 -+ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -+ 适用于需要严格可复现性的长序列任务。 - --class Qwen2MoeSparseMoeBlock(nn.Module): -- def __init__(self, config): -+ - if Long_Prompt is False: 【速度优先模式】 -+ 采用业界最强的性能组合: -+ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -+ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -+ """ -+ def __init__(self, config: Qwen2MoeConfig): - super().__init__() - self.num_experts = config.num_experts - self.top_k = config.num_experts_per_tok - self.norm_topk_prob = config.norm_topk_prob - -- # gating - self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) - self.experts = nn.ModuleList( - [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] - ) -- - self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) - self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) - -- #@dwj -- # 只遍历激活的专家,而非全部专家 -- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -- batch_size, sequence_length, hidden_dim = hidden_states.shape -- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -- num_tokens = hidden_states_reshaped.shape[0] -- -- router_logits = self.gate(hidden_states_reshaped) -- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- -- if self.norm_topk_prob: -- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- routing_weights = routing_weights.to(hidden_states.dtype) -- -- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -- flat_selected_experts = selected_experts.flatten() -- -- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -- token_indices = broadcasted_token_indices.flatten() -- -- active_experts = ops.unique(flat_selected_experts) -- -- for expert_idx_tensor in active_experts: -- expert_idx = expert_idx_tensor.item() -- expert_layer = self.experts[expert_idx] -- -- mask = (flat_selected_experts == expert_idx_tensor) -- selected_token_indices = token_indices[mask] -- selected_routing_weights = routing_weights.flatten()[mask] -- -- current_states = hidden_states_reshaped[selected_token_indices] -- -- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -- -- final_hidden_states = final_hidden_states.index_add( -- dim=0, -- index=selected_token_indices, -- source=expert_output.to(hidden_states.dtype) -- ) -- -- shared_expert_output = self.shared_expert(hidden_states_reshaped) -- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -+ @no_grad() -+ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ original_dtype = hidden_states.dtype -+ batch_size, _ = hidden_states.shape -+ expert_outputs_list = [ -+ ops.cat([ -+ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+ ], dim=0) -+ for i in range(batch_size) -+ ] -+ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+ weights_fp32 = routing_weights.to(mindspore.float32) -+ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+ return moe_output_fp32.squeeze(1).to(original_dtype) -+ -+ @no_grad() -+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ num_tokens, _ = hidden_states.shape -+ flat_selected_experts = selected_experts.flatten() -+ sorted_expert_indices = flat_selected_experts.argsort() -+ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+ original_token_indices = sorted_expert_indices // self.top_k -+ moe_output = ops.zeros_like(hidden_states) -+ current_token_offset = 0 -+ for i in range(self.num_experts): -+ expert_token_count = tokens_per_expert[i] - current_token_offset -+ if expert_token_count == 0: -+ continue -+ end_offset = current_token_offset + expert_token_count -+ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+ expert_hidden_states = hidden_states[expert_original_token_indices] -+ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+ current_token_offset += expert_token_count -+ return moe_output -+ -+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+ @no_grad() -+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ moe_output = ops.zeros_like(hidden_states) -+ num_tokens, _ = hidden_states.shape -+ flat_selected_experts = selected_experts.flatten() -+ flat_routing_weights = routing_weights.flatten() -+ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+ active_experts = ops.unique(flat_selected_experts) -+ for expert_idx_tensor in active_experts: -+ expert_idx = expert_idx_tensor.item() -+ expert_layer = self.experts[expert_idx] -+ mask = (flat_selected_experts == expert_idx_tensor) -+ current_token_indices = token_indices[mask] -+ current_routing_weights = flat_routing_weights[mask] -+ current_hidden_states = hidden_states[current_token_indices] -+ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+ return moe_output - -- final_hidden_states = final_hidden_states + shared_expert_output -- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -- -- return final_hidden_states, router_logits -+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+ global Long_Prompt -+ -+ # 1. 门控计算 (所有模式通用) -+ batch_size, sequence_length, hidden_dim = hidden_states.shape -+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+ router_logits = self.gate(hidden_states_reshaped) -+ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+ if self.norm_topk_prob: -+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+ moe_output = None -+ if Long_Prompt: -+ # --- 精度优先模式 (ACCURACY MODE) --- -+ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ else: -+ # --- 速度优先模式 (SPEED MODE) --- -+ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ if sequence_length == 1: -+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ else: -+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ - -+ # 3. 共享专家计算与合并 (所有模式通用) -+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+ -+ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+ -+ return final_hidden_states, router_logits - - class Qwen2MoeDecoderLayer(nn.Module): - def __init__(self, config: Qwen2MoeConfig, layer_idx: int): - super().__init__() - self.hidden_size = config.hidden_size -+ -+ # if Long_Prompt: -+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ # else: -+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) - - self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - -- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -- - if (layer_idx not in config.mlp_only_layers) and ( - config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 - ): -@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - self._warmed_up = True - self.warmup_moe_model() - -+ -+ - output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions - output_router_logits = ( - output_router_logits if output_router_logits is not None else self.config.output_router_logits -@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - router_logits=outputs.router_logits, - ) - -+ def generate(self, *args, **kwargs): -+ """ -+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+ """ -+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+ -+ input_ids = kwargs.get("input_ids") -+ if input_ids is None and args: -+ input_ids = args[0] -+ -+ if input_ids is not None: -+ prompt_length = input_ids.shape[1] -+ -+ if prompt_length > PROMPT_LENGTH_THRESHOLD: -+ Long_Prompt = True -+ else: -+ Long_Prompt = False -+ -+ return super().generate(*args, **kwargs) -+ - # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation - def prepare_inputs_for_generation( - self, -@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens - # Exception 1: when passing input_embeds, input_ids may be missing entries - # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -+ - if past_key_values is not None: - if inputs_embeds is not None: # Exception 1 - if 0 not in input_ids.shape: -@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - } - ) - return model_inputs -+ - # @lwx - # def _decode_one_tokens_logits( - # self, -@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): - attentions=outputs.attentions, - ) - -+ - __all__ = [ - "Qwen2MoeForCausalLM", - "Qwen2MoeModel", -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -new file mode 100644 -index 00000000..6dfb5b93 ---- /dev/null -+++ b/patches/0001-20251104commit.patch -@@ -0,0 +1,1272 @@ -+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Tue, 4 Nov 2025 09:11:51 +0800 -+Subject: [PATCH] 20251104commit -+ -+--- -+ mindnlp/transformers/cache_utils.py | 28 +- -+ .../models/deepseek/modeling_deepseek.py | 149 ++- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+ 3 files changed, 976 insertions(+), 87 deletions(-) -+ -+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+index cadd2e04..02f8d4be 100644 -+--- a/mindnlp/transformers/cache_utils.py -++++ b/mindnlp/transformers/cache_utils.py -+@@ -812,14 +812,26 @@ class StaticCache(Cache): -+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+ # k_out[:, :, cache_position] = key_states -+ # v_out[:, :, cache_position] = value_states -+- if ON_ORANGE_PI: -+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+- else: -+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+- -++ # if ON_ORANGE_PI: -++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++ # else: -++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++ # 确保 cache_position 是 1D tensor 并且类型正确 -++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++ if cache_position.ndim > 1: -++ cache_position = cache_position.flatten() -++ # 确保类型是 int32 或 int64(MindSpore 要求) -++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++ cache_position = cache_position.int() -++ -++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++ k_out[:, :, cache_position] = key_states -++ v_out[:, :, cache_position] = value_states -++ -+ return k_out, v_out -+ -+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index c695b944..d8303e45 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+ # Copied from transformers.models.llama.modeling_llama.rotate_half -+ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+- x1 = x[..., : x.shape[-1] // 2] -+- x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++ # x1 = x[..., : x.shape[-1] // 2] -++ # x2 = x[..., x.shape[-1] // 2 :] -++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+ if self.training: -+ raise NotImplementedError("Training is not supported yet.") -+ else: -+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+- if self.config.n_shared_experts is not None: -+- y = y + self.shared_experts(identity) -+- return y -++ # @lwx -++ if orig_shape[1] == 1: -++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++ y=y.view(*orig_shape) -++ if self.config.n_shared_experts is not None: -++ y = y + self.shared_experts(identity) -++ return y -++ else: -++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++ if self.config.n_shared_experts is not None: -++ y = y + self.shared_experts(identity) -++ return y -++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++ # if self.config.n_shared_experts is not None: -++ # y = y + self.shared_experts(identity) -++ # return y -++ -++ @no_grad() -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ -++ expert_cache = ops.zeros_like(x) -++ for i in range(self.num_experts_per_tok): -++ expert_id = flat_expert_indices[i].item() -++ weight = flat_expert_weights[i].item() -++ expert = self.experts[expert_id] -++ expert_out = expert(x) -++ expert_cache += expert_out * weight -++ return expert_cache -+ -+ @no_grad() -+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+- # expert_cache = torch.zeros_like(x) -+- # idxs = flat_expert_indices.argsort() -+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+- # token_idxs = idxs // self.num_experts_per_tok -+- # for i, end_idx in enumerate(tokens_per_expert): -+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+- # if start_idx == end_idx: -+- # continue -+- # expert = self.experts[i] -+- # exp_token_idx = token_idxs[start_idx:end_idx] -+- # expert_tokens = x[exp_token_idx] -+- # expert_out = expert(expert_tokens) -+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+- # return expert_cache -++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ expert_cache = ops.zeros_like(x) -+ idxs = flat_expert_indices.argsort() -+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ token_idxs = idxs // self.num_experts_per_tok -++ -+ for i, end_idx in enumerate(tokens_per_expert): -+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ if start_idx == end_idx: -+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+ expert_out = expert(expert_tokens) -+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -+ return expert_cache -++ -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # # expert_cache = torch.zeros_like(x) -++ # # idxs = flat_expert_indices.argsort() -++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++ # # token_idxs = idxs // self.num_experts_per_tok -++ # # for i, end_idx in enumerate(tokens_per_expert): -++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++ # # if start_idx == end_idx: -++ # # continue -++ # # expert = self.experts[i] -++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++ # # expert_tokens = x[exp_token_idx] -++ # # expert_out = expert(expert_tokens) -++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++ # # return expert_cache -++ # expert_cache = ops.zeros_like(x) -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # for i, end_idx in enumerate(tokens_per_expert): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # if start_idx == end_idx: -++ # continue -++ # expert = self.experts[i] -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = expert(expert_tokens) -++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -++ # return expert_cache -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # expert_cache = ops.zeros_like(x) -++ -++ # # 排序保证顺序一致 -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # # 找出有 token 的专家 -++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++ -++ # for i in active_experts.tolist(): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # end_idx = tokens_per_expert[i] -++ # if start_idx == end_idx: # 没有 token -++ # continue -++ -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = self.experts[i](expert_tokens) -++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++ -++ # expert_cache = mindspore.mint.scatter_add( -++ # expert_cache, -++ # 0, -++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++ # expert_out -++ # ) -++ -++ # return expert_cache -++ -++ -+ -+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+ # """ -+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ -+ # Initialize weights and apply final processing -+ self.post_init() -++ self.warm_up = False -++ -++ def warmup_moe_model_deep(self): -++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++ test_texts = [ -++ "warmup short", -++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++ ] -++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++ if tokenizer is None: -++ from mindnlp.transformers import AutoTokenizer -++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++ self._warmup_tokenizer = tokenizer -++ -++ for text in test_texts: -++ inputs = tokenizer(text, return_tensors="ms") -++ with mindspore._no_grad(): -++ _ = self(**inputs, use_cache=False) -++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+ -+ def get_input_embeddings(self): -+ return self.model.embed_tokens -+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+ ```""" -++ if not self.warm_up: -++ self.warm_up = True -++ self.warmup_moe_model_deep() -++ -+ output_attentions = ( -+ output_attentions -+ if output_attentions is not None -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index 3cbf820e..d4c6b651 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -18,7 +18,6 @@ -+ # See the License for the specific language governing permissions and -+ # limitations under the License. -+ """MindSpore Qwen2MoE model.""" -+- -+ import math -+ from typing import List, Optional, Tuple, Union -+ -+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+ TokenClassifierOutput, -+ ) -+ from ...modeling_utils import PreTrainedModel -++from ...generation import GenerationMixin -+ from ....utils import logging -+ from .configuration_qwen2_moe import Qwen2MoeConfig -+ -+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+ self.variance_epsilon = eps -+ -+ def forward(self, hidden_states): -++ # @dwj -++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++ # @lwx -++ # if not self.training : -++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+ input_dtype = hidden_states.dtype -+ hidden_states = hidden_states.to(mindspore.float32) -+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+@@ -234,6 +239,8 @@ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+ x1 = x[..., : x.shape[-1] // 2] -+ x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+ self.config = config -+ self.hidden_size = config.hidden_size -+ self.intermediate_size = intermediate_size -++ -+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+ self.act_fn = ACT2FN[config.hidden_act] -+ -+ def forward(self, x): -+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+- -+ -++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++ # @lwx -++ # gate_up_output = self.gate_up_proj(x) -++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++ # return self.down_proj(swiglu_output) -++ -++ # def forward(self, x): -++ # gate_proj_out = self.gate_proj(x) -++ # up_proj_out = self.up_proj(x) -++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++ # return self.down_proj(swiglu_out) -++ -+ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+ """ -+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+ use_cache: bool = False, -+ cache_position: Optional[mindspore.Tensor] = None, -+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ -++ -+ bsz, q_len, _ = hidden_states.shape -+ -+ query_states = self.q_proj(hidden_states) -+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+ "with a layer index." -+ ) -+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ if isinstance(past_key_value, StaticCache): -++ kv_seq_len = key_states.shape[-2] -++ else: -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ if past_key_value is not None: -+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++ if isinstance(past_key_value, StaticCache): -++ kv_seq_len = key_states.shape[-2] -+ -+ # repeat k/v heads if n_kv_heads < n_heads -+ key_states = repeat_kv(key_states, self.num_key_value_groups) -+ value_states = repeat_kv(value_states, self.num_key_value_groups) -+- -++ -+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+ -+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+- raise ValueError( -+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+- f" {attn_weights.shape}" -+- ) -+- -+- if attention_mask is not None: # no matter the length, we just slice it -+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++ if attention_mask is not None: -++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+ attn_weights = attn_weights + causal_mask -+ -+ # upcast attention to fp32 -+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+ -+ attn_output = self.o_proj(attn_output) -+- -++ # @lwx -++ -++ # max_seq_len = self.max_position_embeddings # 2048 -++ -++ # if attention_mask is not None: -++ # # attention_mask: [B, 1, Sq, Sk] -++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++ -++ # # pad 到 [max_seq_len, max_seq_len] -++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++ # global_attention_mask = padded_mask -++ # else: -++ # global_attention_mask = None -++ -++ -++ # sparse_mode=3 -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, -++ # key=key_states, -++ # value=value_states, -++ # real_shift=None, -++ # padding_mask=None, -++ -++ # head_num=self.num_heads, -++ # attn_mask=global_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++ # input_layout="BNSD", -++ # pre_tokens=2147483647, -++ # next_tokens=2147483647, -++ # inner_precise=0, -++ # drop_mask=None, -++ # prefix=None, -++ # actual_seq_qlen=None, -++ # actual_seq_kvlen=None, -++ # sparse_mode=sparse_mode, -++ # ) -+ if not output_attentions: -+ attn_weights = None -+ -+ return attn_output, attn_weights, past_key_value -+ -+ -++class Qwen2MoeFlashAttention(nn.Module): -++ """ -++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++ -++ 关键改动: -++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++ 直接传入原始的 key 和 value 张量效率更高。 -++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++ """ -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++ super().__init__() -++ self.config = config -++ self.layer_idx = layer_idx -++ self.hidden_size = config.hidden_size -++ self.num_heads = config.num_attention_heads -++ self.head_dim = self.hidden_size // self.num_heads -++ self.num_key_value_heads = config.num_key_value_heads -++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++ self.max_position_embeddings = config.max_position_embeddings -++ self.rope_theta = config.rope_theta -++ self.attention_dropout = config.attention_dropout -++ -++ if (self.head_dim * self.num_heads) != self.hidden_size: -++ raise ValueError( -++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++ ) -++ -++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++ -++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++ self.head_dim, -++ max_position_embeddings=self.max_position_embeddings, -++ base=self.rope_theta, -++ ) -++ -++ def forward( -++ self, -++ hidden_states: mindspore.Tensor, -++ attention_mask: Optional[mindspore.Tensor] = None, -++ position_ids: Optional[mindspore.Tensor] = None, -++ past_key_value: Optional[Cache] = None, -++ output_attentions: bool = False, -++ use_cache: bool = False, -++ cache_position: Optional[mindspore.Tensor] = None, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ bsz, q_len, _ = hidden_states.shape -++ -++ # 1. 线性投射 Q, K, V -++ query_states = self.q_proj(hidden_states) -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++ # query: [B, S, H*D] -> [B, N1, S, D] -++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # 3. RoPE 旋转位置编码 -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++ if self.layer_idx is None: -++ raise ValueError( -++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ "with a layer index." -++ ) -++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++ if cache_position.shape[0] == 1: -++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++ kv_seq_len = past_seen_tokens + 1 -++ else: -++ # prefill 阶段:cache_position 是范围,使用其长度 -++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++ else: -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # 4. KV 缓存更新 -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ key_states, value_states = past_key_value.update( -++ key_states, value_states, self.layer_idx, cache_kwargs -++ ) -++ -++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++ if cache_position.shape[0] == 1: -++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++ kv_seq_len = key_states.shape[-2] -++ -++ # 5. [重要] 准备 Attention Mask -++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++ fa_attention_mask = None -++ if attention_mask is not None: -++ # 截取与当前key长度匹配的部分 -++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++ fa_attention_mask = (mask_slice != 0) -++ -++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++ input_dtype = query_states.dtype -++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++ query_states = query_states.to(mindspore.float16) -++ key_states = key_states.to(mindspore.float16) -++ value_states = value_states.to(mindspore.float16) -++ -++ # 6. [核心] 调用 flash_attention_score 算子 -++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++ attn_output = mindspore.ops.flash_attention_score( -++ query=query_states, -++ key=key_states, -++ value=value_states, -++ head_num=self.num_heads, # 传入Q的头数(N1) -++ attn_mask=fa_attention_mask, -++ keep_prob=1.0 - self.attention_dropout, -++ scalar_value=1.0 / math.sqrt(self.head_dim), -++ input_layout="BNSD", -++ sparse_mode=0 # 使用 defaultMask 模式 -++ ) -++ -++ # 恢复原始数据类型 -++ attn_output = attn_output.to(input_dtype) -++ -++ # 7. 调整输出形状 -++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ attn_output = self.o_proj(attn_output) -++ -++ # FlashAttention 算子不直接返回注意力权重矩阵 -++ attn_weights = None -++ if output_attentions: -++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++ return attn_output, attn_weights, past_key_value -++ -++ # def forward( -++ # self, -++ # hidden_states: mindspore.Tensor, -++ # attention_mask: Optional[mindspore.Tensor] = None, -++ # position_ids: Optional[mindspore.Tensor] = None, -++ # past_key_value: Optional[Cache] = None, -++ # output_attentions: bool = False, -++ # use_cache: bool = False, -++ # cache_position: Optional[mindspore.Tensor] = None, -++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ # bsz, q_len, _ = hidden_states.shape -++ -++ # # 1. 线性投射 Q, K, V -++ # query_states = self.q_proj(hidden_states) -++ # key_states = self.k_proj(hidden_states) -++ # value_states = self.v_proj(hidden_states) -++ -++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # # 3. RoPE 旋转位置编码 -++ # kv_seq_len = key_states.shape[-2] -++ # if past_key_value is not None: -++ # if self.layer_idx is None: -++ # raise ValueError( -++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ # "with a layer index." -++ # ) -++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # # 4. KV 缓存更新 -++ # if past_key_value is not None: -++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ # key_states, value_states = past_key_value.update( -++ # key_states, value_states, self.layer_idx, cache_kwargs -++ # ) -++ -++ # # 5. 准备 Attention Mask -++ # fa_attention_mask = None -++ # if attention_mask is not None: -++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # fa_attention_mask = (mask_slice != 0) -++ -++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++ # input_dtype = query_states.dtype -++ -++ # # 6. [核心] 调用 flash_attention_score 算子 -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, -++ # key=key_states, -++ # value=value_states, -++ # head_num=self.num_heads, -++ # attn_mask=fa_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++ # input_layout="BNSD", -++ # sparse_mode=0, -++ # # <--- 修改点 2: 启用内部高精度计算 --- -++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++ # inner_precise=1 -++ # ) -++ -++ # # 恢复原始数据类型 -++ # attn_output = attn_output.to(input_dtype) -++ -++ # # 7. 调整输出形状 -++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ # attn_output = self.o_proj(attn_output) -++ -++ # attn_weights = None -++ # if output_attentions: -++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++ # return attn_output, attn_weights, past_key_value -++ -++ # def forward( -++ # self, -++ # hidden_states: mindspore.Tensor, -++ # attention_mask: Optional[mindspore.Tensor] = None, -++ # position_ids: Optional[mindspore.Tensor] = None, -++ # past_key_value: Optional[Cache] = None, -++ # output_attentions: bool = False, -++ # use_cache: bool = False, -++ # cache_position: Optional[mindspore.Tensor] = None, -++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ # bsz, q_len, _ = hidden_states.shape -++ -++ # query_states = self.q_proj(hidden_states) -++ # key_states = self.k_proj(hidden_states) -++ # value_states = self.v_proj(hidden_states) -++ -++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # kv_seq_len = key_states.shape[-2] -++ # if past_key_value is not None: -++ # if self.layer_idx is None: -++ # raise ValueError("`layer_idx` must be specified for caching") -++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # if past_key_value is not None: -++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ # key_states, value_states = past_key_value.update( -++ # key_states, value_states, self.layer_idx, cache_kwargs -++ # ) -++ -++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++ -++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++ # query_states = query_states / math.sqrt(self.head_dim) -++ # # <--- 修改结束 --- -++ -++ # fa_attention_mask = None -++ # if attention_mask is not None: -++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # fa_attention_mask = (mask_slice != 0) -++ -++ # input_dtype = query_states.dtype -++ -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, # 传入已经预先缩放过的 query -++ # key=key_states, -++ # value=value_states, -++ # head_num=self.num_heads, -++ # attn_mask=fa_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++ # input_layout="BNSD", -++ # sparse_mode=0, -++ # inner_precise=1 # 仍然保持内部高精度计算 -++ # ) -++ -++ # attn_output = attn_output.to(input_dtype) -++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ # attn_output = self.o_proj(attn_output) -++ -++ # attn_weights = None -++ # if output_attentions: -++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++ -++ # return attn_output, attn_weights, past_key_value -++ -+ QWEN2MOE_ATTENTION_CLASSES = { -+ "eager": Qwen2MoeAttention, -++ "flash-attention": Qwen2MoeFlashAttention, -+ } -+ -+ -+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -++ #@dwj -++ # 只遍历激活的专家,而非全部专家 -+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+- batch_size, sequence_length, hidden_dim = hidden_states.shape -+- hidden_states = hidden_states.view(-1, hidden_dim) -+- # router_logits: (batch * sequence_length, n_experts) -+- router_logits = self.gate(hidden_states) -+- -+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- if self.norm_topk_prob: -+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- # we cast back to the input dtype -+- routing_weights = routing_weights.to(hidden_states.dtype) -+- -+- final_hidden_states = ops.zeros( -+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+- ) -+- -+- # One hot encode the selected experts to create an expert mask -+- # this will be used to easily index which expert is going to be sollicitated -+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+- -+- # Loop over all available experts in the model and perform the computation on each expert -+- for expert_idx in range(self.num_experts): -+- expert_layer = self.experts[expert_idx] -+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+- -+- # Index the correct hidden states and compute the expert hidden state for -+- # the current expert. We need to make sure to multiply the output hidden -+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+- if 0 not in idx.shape: -+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+- -+- # However `index_add_` only support torch tensors for indexing so we'll use -+- # the `top_x` tensor here. -+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+- -+- shared_expert_output = self.shared_expert(hidden_states) -+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+- -+- final_hidden_states = final_hidden_states + shared_expert_output -++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++ num_tokens = hidden_states_reshaped.shape[0] -++ -++ router_logits = self.gate(hidden_states_reshaped) -++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++ if self.norm_topk_prob: -++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ routing_weights = routing_weights.to(hidden_states.dtype) -++ -++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++ flat_selected_experts = selected_experts.flatten() -++ -++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++ token_indices = broadcasted_token_indices.flatten() -++ -++ active_experts = ops.unique(flat_selected_experts) -++ -++ for expert_idx_tensor in active_experts: -++ expert_idx = expert_idx_tensor.item() -++ expert_layer = self.experts[expert_idx] -++ -++ mask = (flat_selected_experts == expert_idx_tensor) -++ selected_token_indices = token_indices[mask] -++ selected_routing_weights = routing_weights.flatten()[mask] -++ -++ current_states = hidden_states_reshaped[selected_token_indices] -++ -++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++ -++ final_hidden_states = final_hidden_states.index_add( -++ dim=0, -++ index=selected_token_indices, -++ source=expert_output.to(hidden_states.dtype) -++ ) -++ -++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+ -+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+- return final_hidden_states, router_logits -++ final_hidden_states = final_hidden_states + shared_expert_output -++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++ -++ return final_hidden_states, router_logits -+ -+ -+ class Qwen2MoeDecoderLayer(nn.Module): -+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+ -+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++ -+ if (layer_idx not in config.mlp_only_layers) and ( -+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+ ): -+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+ _skip_keys_device_placement = "past_key_values" -+ _supports_cache_class = True -++#lwx -++ # _supports_static_cache = True -+ -+ def _init_weights(self, module): -+ std = self.config.initializer_range -+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+ return causal_mask -+ -+ -+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ _tied_weights_keys = ["lm_head.weight"] -+ -+ def __init__(self, config): -+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ self.num_experts_per_tok = config.num_experts_per_tok -+ # Initialize weights and apply final processing -+ self.post_init() -++ # @lwx -++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++ # self.generation_config.cache_implementation = "static" -++ self._warmed_up = False -++ -++ def warmup_moe_model(self): -++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++ test_texts = [ -++ "warmup short", -++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++ ] -++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++ if tokenizer is None: -++ from mindnlp.transformers import AutoTokenizer -++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++ self._warmup_tokenizer = tokenizer -++ -++ for text in test_texts: -++ inputs = tokenizer(text, return_tensors="ms") -++ with mindspore._no_grad(): -++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+ -+ def get_input_embeddings(self): -+ return self.model.embed_tokens -+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+ ```""" -++ if not self._warmed_up: -++ self._warmed_up = True -++ self.warmup_moe_model() -+ -+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+ output_router_logits = ( -+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ } -+ ) -+ return model_inputs -++# @lwx -++ # def _decode_one_tokens_logits( -++ # self, -++ # cur_token: mindspore.Tensor, -++ # input_pos: Optional[mindspore.Tensor], -++ # cache_position: mindspore.Tensor, -++ # past_key_values: StaticCache, -++ # ) -> mindspore.Tensor: -++ # """ -++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++ -++ # Args: -++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++ # input_pos: 输入位置信息,可选 -++ # cache_position: 当前token在cache中的位置,shape为(1,) -++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++ -++ # Returns: -++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++ # """ -++ # # 调用JIT编译的版本 -++ # return self.get_decode_one_tokens_logits( -++ # cur_token=cur_token, -++ # input_pos=input_pos, -++ # cache_position=cache_position, -++ # past_key_values=past_key_values, -++ # ) -++ -++ # @mindspore.jit(jit_level='O1') -++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++ # """ -++ # JIT编译的函数,用于高效的单token解码 -++ # 使用JIT编译优化以支持静态shape和高效执行 -++ -++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++ # """ -++ # outputs = self.model.forward( -++ # input_ids=cur_token, -++ # position_ids=input_pos, -++ # cache_position=cache_position, -++ # past_key_values=past_key_values, -++ # use_cache=True, -++ # return_dict=False, -++ # ) -++ -++ # hidden_states = outputs[0] -++ # logits = self.lm_head.forward(hidden_states) -++ # logits = logits.float() -++ -++ # return logits[:, -1, :] -++ -++ # def _sample( -++ # self, -++ # input_ids: mindspore.Tensor, -++ # logits_processor, -++ # stopping_criteria, -++ # generation_config, -++ # synced_devices: bool, -++ # streamer=None, -++ # logits_warper=None, -++ # **model_kwargs, -++ # ): -++ # """ -++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++ # """ -++ # from ...generation.logits_process import LogitsProcessorList -++ # from ...generation.stopping_criteria import StoppingCriteriaList -++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++ # from mindnlp.core import nn, ops, no_grad -++ # import numpy as np -++ -++ # # 检查是否使用 StaticCache -++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++ # # 否则,直接调用父类方法 -++ # past_key_values = model_kwargs.get("past_key_values") -++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++ -++ # if not isinstance(past_key_values, StaticCache): -++ # # 不使用 StaticCache,直接调用父类方法 -++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++ # return super()._sample( -++ # input_ids=input_ids, -++ # logits_processor=logits_processor, -++ # stopping_criteria=stopping_criteria, -++ # generation_config=generation_config, -++ # synced_devices=synced_devices, -++ # streamer=streamer, -++ # logits_warper=logits_warper, -++ # **model_kwargs, -++ # ) -++ -++ # # 使用 StaticCache,进入自定义循环 -++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++ # pad_token_id = generation_config._pad_token_tensor -++ # output_attentions = generation_config.output_attentions -++ # output_hidden_states = generation_config.output_hidden_states -++ # output_scores = generation_config.output_scores -++ # output_logits = generation_config.output_logits -++ # return_dict_in_generate = generation_config.return_dict_in_generate -++ # max_length = generation_config.max_length -++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++ # do_sample = generation_config.do_sample -++ -++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++ # raise ValueError( -++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++ # f"{logits_warper})." -++ # ) -++ -++ # # init attention / hidden states / scores tuples -++ # scores = () if (return_dict_in_generate and output_scores) else None -++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++ -++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++ # encoder_hidden_states = ( -++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++ # ) -++ -++ # # keep track of which sequences are already finished -++ # batch_size, cur_len = input_ids.shape -++ # this_peer_finished = False -++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++ -++ # time_record = [] -++ # from ....utils.testing_utils import parse_flag_from_env -++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++ -++ # while self._has_unfinished_sequences( -++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++ # ): -++ # if _record_time: -++ # import time as time_module -++ # infer_start = time_module.time() -++ -++ # # prepare model inputs -++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++ -++ # # prepare variable output controls -++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++ -++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++ # cur_cache_position = model_inputs.get("cache_position") -++ # cur_past_key_values = model_inputs.get("past_key_values") -++ # cur_input_ids = model_inputs.get("input_ids") -++ -++ # if (isinstance(cur_past_key_values, StaticCache) and -++ # cur_cache_position is not None and -++ # len(cur_cache_position.shape) > 0 and -++ # cur_cache_position.shape[0] == 1 and -++ # cur_input_ids is not None and -++ # cur_input_ids.shape[1] == 1): -++ # # 使用 JIT 优化的单 token 解码 -++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++ # if not hasattr(self, '_jit_used'): -++ # self._jit_used = False -++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++ -++ # next_token_logits = self.get_decode_one_tokens_logits( -++ # cur_token=cur_input_ids, -++ # input_pos=model_inputs.get("position_ids"), -++ # cache_position=cur_cache_position, -++ # past_key_values=cur_past_key_values, -++ # ) -++ -++ # # 标记已使用JIT(用于后续判断) -++ # if not self._jit_used: -++ # self._jit_used = True -++ -++ # # 构造兼容的输出对象 -++ # class JitOptimizedOutput: -++ # def __init__(self, logits, config): -++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++ # self.config = config -++ # # 对于 JIT 优化路径,这些属性通常不需要 -++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++ # self.attentions = None if not config.is_encoder_decoder else None -++ # self.cross_attentions = None -++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++ # self.hidden_states = None if not config.is_encoder_decoder else None -++ -++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++ # else: -++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++ # outputs = self(**model_inputs, return_dict=True) -++ -++ # if synced_devices and this_peer_finished: -++ # continue -++ -++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++ # next_token_logits = outputs.logits[:, -1, :] -++ -++ # # pre-process distribution -++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++ # if do_sample: -++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++ -++ # # Store scores, attentions and hidden_states when required -++ # if return_dict_in_generate: -++ # if output_scores: -++ # scores += (next_token_scores,) -++ # if output_logits: -++ # raw_logits += (next_token_logits,) -++ # if output_attentions: -++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++ # decoder_attentions += (attn,) if attn is not None else (None,) -++ # if self.config.is_encoder_decoder: -++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++ -++ # if output_hidden_states: -++ # hidden = ( -++ # outputs.decoder_hidden_states -++ # if self.config.is_encoder_decoder -++ # else outputs.hidden_states -++ # ) -++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++ -++ # # token selection -++ # if do_sample: -++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++ # else: -++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++ -++ # # finished sentences should have their next token be a padding token -++ # if has_eos_stopping_criteria: -++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++ -++ # # update generated ids, model inputs, and length for next step -++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++ # if streamer is not None: -++ # streamer.put(next_tokens) -++ -++ # model_kwargs = self._update_model_kwargs_for_generation( -++ # outputs, -++ # model_kwargs, -++ # is_encoder_decoder=self.config.is_encoder_decoder, -++ # ) -++ -++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++ # cur_len += 1 -++ -++ # if _record_time: -++ # import time as time_module -++ # infer_stop = time_module.time() -++ # time_record.append(infer_stop - infer_start) -++ -++ # del outputs -++ -++ # average_infer_time = None -++ # if time_record: -++ # if len(time_record) > 1: -++ # time_record.pop(0) -++ # average_infer_time = sum(time_record) / len(time_record) -++ # print(f'average inference time is: {average_infer_time}') -++ # print(f'inference time record: {time_record}') -++ -++ # if streamer is not None: -++ # streamer.end() -++ -++ # # 简单判断:打印是否使用了JIT路径 -++ # if hasattr(self, '_jit_used') and self._jit_used: -++ # print("[JIT] ✓ JIT optimization was used during generation") -++ # else: -++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++ -++ # if return_dict_in_generate: -++ # if self.config.is_encoder_decoder: -++ # return GenerateEncoderDecoderOutput( -++ # sequences=input_ids, -++ # scores=scores, -++ # logits=raw_logits, -++ # encoder_attentions=encoder_attentions, -++ # encoder_hidden_states=encoder_hidden_states, -++ # decoder_attentions=decoder_attentions, -++ # cross_attentions=cross_attentions, -++ # decoder_hidden_states=decoder_hidden_states, -++ # past_key_values=model_kwargs.get("past_key_values"), -++ # average_infer_time=average_infer_time -++ # ) -++ # else: -++ # return GenerateDecoderOnlyOutput( -++ # sequences=input_ids, -++ # scores=scores, -++ # logits=raw_logits, -++ # attentions=decoder_attentions, -++ # hidden_states=decoder_hidden_states, -++ # past_key_values=model_kwargs.get("past_key_values"), -++ # average_infer_time=average_infer_time -++ # ) -++ # else: -++ # return input_ids -++ -++ # def _prepare_cache_for_generation( -++ # self, -++ # generation_config, -++ # model_kwargs, -++ # assistant_model, -++ # batch_size, -++ # max_cache_length, -++ # ): -++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++ # generation_config.cache_implementation = "static" -++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++ -++ # if generation_config.cache_implementation == "static": -++ # base_required_from_max_length = generation_config.max_length + 1 -++ # base_required = max(max_cache_length, base_required_from_max_length) -++ # min_cache_size = 50 -++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++ # else: -++ # max_cache_length = max(base_required, min_cache_size) -++ -++ # original_max_cache_length = max_cache_length -++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++ # print(f" - input max_cache_length: {original_max_cache_length}") -++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++ # print(f" - final max_cache_length: {max_cache_length}") -++ -++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++ # if max_cache_length > self.config.max_position_embeddings: -++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++ -++ # result = super()._prepare_cache_for_generation( -++ # generation_config=generation_config, -++ # model_kwargs=model_kwargs, -++ # assistant_model=assistant_model, -++ # batch_size=batch_size, -++ # max_cache_length=max_cache_length, -++ # ) -++ -++ # if generation_config.cache_implementation == "static": -++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++ # created_cache = model_kwargs.get(cache_name) -++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++ # if created_cache.max_cache_len < generation_config.max_length: -++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++ -++ # return result -++ -++ -++ -+ -+ -+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" deleted file mode 100644 index d64b7f3f..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0003-20261106secondcommit.patch" +++ /dev/null @@ -1,2769 +0,0 @@ -From 7a37d9be16fe823c251701c26bbb20cc09f9922a Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Thu, 6 Nov 2025 14:54:37 +0800 -Subject: [PATCH 03/10] 20261106secondcommit - ---- - .../models/deepseek/modeling_deepseek.py | 217 ++- - .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- - patches/0001-20251104commit.patch | 1272 ----------------- - 3 files changed, 528 insertions(+), 2032 deletions(-) - delete mode 100644 patches/0001-20251104commit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 73773c22..2f9192bf 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) - - _CONFIG_FOR_DOC = "DeepseekConfig" - -+_attn_mask_cache = {} -+ -+def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -+ q_len = batch_and_seq[1] -+ kv_len = batch_and_seq[1] + past_key_values_length -+ key = (batch_and_seq[0], q_len, kv_len) -+ -+ if key in _attn_mask_cache: -+ return _attn_mask_cache[key] -+ -+ mask = _prepare_4d_causal_attention_mask( -+ attention_mask, -+ batch_and_seq, -+ inputs_embeds, -+ past_key_values_length, -+ ) -+ _attn_mask_cache[key] = mask -+ return mask - - def _get_unpad_data(attention_mask): - seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): - return final_output - - -- @no_grad() -- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- expert_cache = ops.zeros_like(x) -- idxs = flat_expert_indices.argsort() -- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- token_idxs = idxs // self.num_experts_per_tok -- -- for i, end_idx in enumerate(tokens_per_expert): -- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- if start_idx == end_idx: -- continue -- expert = self.experts[i] -- exp_token_idx = token_idxs[start_idx:end_idx] -- expert_tokens = x[exp_token_idx] -- expert_out = expert(expert_tokens) -- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -- -- return expert_cache -- - # @no_grad() -- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -- # # expert_cache = torch.zeros_like(x) -- # # idxs = flat_expert_indices.argsort() -- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -- # # token_idxs = idxs // self.num_experts_per_tok -- # # for i, end_idx in enumerate(tokens_per_expert): -- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -- # # if start_idx == end_idx: -- # # continue -- # # expert = self.experts[i] -- # # exp_token_idx = token_idxs[start_idx:end_idx] -- # # expert_tokens = x[exp_token_idx] -- # # expert_out = expert(expert_tokens) -- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -- # # return expert_cache -+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): - # expert_cache = ops.zeros_like(x) - # idxs = flat_expert_indices.argsort() - # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): - # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) - - # return expert_cache -- # @no_grad() -- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -- # expert_cache = ops.zeros_like(x) -+ -+ @no_grad() -+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ """ -+ 优化版 MoE prefill: -+ - 批量张量化处理同一个 expert 的所有 token -+ - 跳过无 token 的专家 -+ - 保持结果完全一致 -+ """ -+ # 初始化输出缓存 -+ expert_cache = ops.zeros_like(x) - -- # # 排序保证顺序一致 -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- # token_idxs = idxs // self.num_experts_per_tok -+ # 排序(确保 scatter_add 位置对应原逻辑) -+ idxs = flat_expert_indices.argsort() -+ sorted_expert_indices = flat_expert_indices[idxs] -+ sorted_token_indices = idxs // self.num_experts_per_tok - -- # # 找出有 token 的专家 -- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+ # 每个 expert 的 token 数 -+ tokens_per_expert = sorted_expert_indices.bincount() - -- # for i in active_experts.tolist(): -- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- # end_idx = tokens_per_expert[i] -- # if start_idx == end_idx: # 没有 token -- # continue -+ # 找出有 token 的专家 -+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() - -- # exp_token_idx = token_idxs[start_idx:end_idx] -- # expert_tokens = x[exp_token_idx] -- # expert_out = self.experts[i](expert_tokens) -- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+ for expert_id in active_experts.tolist(): -+ # 取该 expert 对应的排序后 token 区间 -+ start = (tokens_per_expert[:expert_id]).sum().item() -+ end = start + tokens_per_expert[expert_id].item() - -- # expert_cache = mindspore.mint.scatter_add( -- # expert_cache, -- # 0, -- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -- # expert_out -- # ) -+ token_idx = sorted_token_indices[start:end] # 原 token 位置 -+ expert_tokens = x[token_idx] # 取输入向量 - -- # return expert_cache -+ # 执行专家 MLP -+ expert_out = self.experts[expert_id](expert_tokens) -+ -+ # 按权重缩放 -+ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -+ -+ # 回写到缓存(等价 scatter_add) -+ expert_cache = mindspore.mint.scatter_add( -+ expert_cache, -+ 0, -+ token_idx.view(-1, 1).tile((1, x.shape[-1])), -+ scaled_out -+ ) -+ -+ return expert_cache -+ -+ # @no_grad() -+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+ # # expert_cache = torch.zeros_like(x) -+ # # idxs = flat_expert_indices.argsort() -+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+ # # token_idxs = idxs // self.num_experts_per_tok -+ # # for i, end_idx in enumerate(tokens_per_expert): -+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+ # # if start_idx == end_idx: -+ # # continue -+ # # expert = self.experts[i] -+ # # exp_token_idx = token_idxs[start_idx:end_idx] -+ # # expert_tokens = x[exp_token_idx] -+ # # expert_out = expert(expert_tokens) -+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+ # # return expert_cache -+ # expert_cache = ops.zeros_like(x) -+ # idxs = flat_expert_indices.argsort() -+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ # token_idxs = idxs // self.num_experts_per_tok -+ -+ # for i, end_idx in enumerate(tokens_per_expert): -+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ # if start_idx == end_idx: -+ # continue -+ # expert = self.experts[i] -+ # exp_token_idx = token_idxs[start_idx:end_idx] -+ # expert_tokens = x[exp_token_idx] -+ # expert_out = expert(expert_tokens) -+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+ -+ # return expert_cache -+ # @no_grad() -+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+ # expert_cache = ops.zeros_like(x) -+ -+ # # 排序保证顺序一致 -+ # idxs = flat_expert_indices.argsort() -+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ # token_idxs = idxs // self.num_experts_per_tok -+ -+ # # 找出有 token 的专家 -+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+ -+ # for i in active_experts.tolist(): -+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ # end_idx = tokens_per_expert[i] -+ # if start_idx == end_idx: # 没有 token -+ # continue -+ -+ # exp_token_idx = token_idxs[start_idx:end_idx] -+ # expert_tokens = x[exp_token_idx] -+ # expert_out = self.experts[i](expert_tokens) -+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+ -+ # expert_cache = mindspore.mint.scatter_add( -+ # expert_cache, -+ # 0, -+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+ # expert_out -+ # ) -+ -+ # return expert_cache - - - -@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): - - return attn_output, attn_weights, past_key_value - -- - # class DeepseekFlashAttention(nn.Module): - # """ - # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): - - return attn_output, attn_weights, past_key_value - -+ - Deepseek_ATTENTION_CLASSES = { - "eager": DeepseekAttention, - "flash-attention": DeepseekFlashAttention, -@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): - ) - else: - # 4d mask is passed through the layers -- attention_mask = _prepare_4d_causal_attention_mask( -+ # attention_mask = _prepare_4d_causal_attention_mask( -+ # attention_mask, -+ # (batch_size, seq_length), -+ # inputs_embeds, -+ # past_key_values_length, -+ # ) -+ #@dwj -+ attention_mask = get_cached_causal_mask( - attention_mask, - (batch_size, seq_length), - inputs_embeds, -@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - # Initialize weights and apply final processing - self.post_init() - self.warm_up = False -+ #@dwj -+ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+ self.num_layers, -+ self.num_attention_heads, -+ self.head_dim, -+ batch_size=1, -+ max_length=self.max_length, -+ dtype=mindspore.float16 -+ ) -+ -+ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+ key_cache = [] -+ value_cache = [] -+ for _ in range(num_layers): -+ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+ key_cache.append(k) -+ value_cache.append(v) -+ return key_cache, value_cache -+ - - def warmup_moe_model_deep(self): - print("[Warmup] DeepSeek-MoE 模型预热开始...") -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index bced285c..ebd7782e 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) - _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" - _CONFIG_FOR_DOC = "Qwen2MoeConfig" - --Long_Prompt = False --PROMPT_LENGTH_THRESHOLD = 128 -+Long_Prompt = 1 -+LONG_PROMPT_LENGTH_THRESHOLD = 128 -+SHORT_PROMPT_LENGTH_THRESHOLD = 32 -+ -+_causal_mask_cache = {} -+ -+def get_cached_causal_mask_with_cache_position( -+ attention_mask: mindspore.Tensor, -+ sequence_length: int, -+ target_length: int, -+ dtype: mindspore.dtype, -+ min_dtype: float, -+ cache_position: mindspore.Tensor, -+ batch_size: int, -+): -+ """ -+ 带缓存的 causal mask 构造函数 -+ """ -+ # q_len 是当前 query 长度 -+ q_len = sequence_length -+ # kv_len 是 target_length -+ kv_len = target_length -+ -+ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -+ key = (batch_size, q_len, kv_len, dtype, min_dtype) -+ -+ if key in _causal_mask_cache: -+ return _causal_mask_cache[key] -+ -+ # 调用原来的 mask 构造逻辑 -+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+ attention_mask, -+ sequence_length=sequence_length, -+ target_length=target_length, -+ dtype=dtype, -+ min_dtype=min_dtype, -+ cache_position=cache_position, -+ batch_size=batch_size, -+ ) -+ # 缓存结果 -+ _causal_mask_cache[key] = causal_mask -+ return causal_mask - - # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position - def _prepare_4d_causal_attention_mask_with_cache_position( -@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: - - - # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -+# class Qwen2MoeAttention(nn.Module): -+# """ -+# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+# and "Generating Long Sequences with Sparse Transformers". -+# """ -+ -+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+# super().__init__() -+# self.config = config -+# self.layer_idx = layer_idx -+# if layer_idx is None: -+# logger.warning_once( -+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+# "when creating this class." -+# ) -+ -+# self.hidden_size = config.hidden_size -+# self.num_heads = config.num_attention_heads -+# self.head_dim = self.hidden_size // self.num_heads -+# self.num_key_value_heads = config.num_key_value_heads -+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+# self.max_position_embeddings = config.max_position_embeddings -+# self.rope_theta = config.rope_theta -+# self.is_causal = True -+# self.attention_dropout = config.attention_dropout -+ -+# if (self.head_dim * self.num_heads) != self.hidden_size: -+# raise ValueError( -+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+# f" and `num_heads`: {self.num_heads})." -+# ) -+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+ -+# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+# self.head_dim, -+# max_position_embeddings=self.max_position_embeddings, -+# base=self.rope_theta, -+# ) -+ -+# def forward( -+# self, -+# hidden_states: mindspore.Tensor, -+# attention_mask: Optional[mindspore.Tensor] = None, -+# position_ids: Optional[mindspore.Tensor] = None, -+# past_key_value: Optional[Cache] = None, -+# output_attentions: bool = False, -+# use_cache: bool = False, -+# cache_position: Optional[mindspore.Tensor] = None, -+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+ -+ -+ -+# bsz, q_len, _ = hidden_states.shape -+ -+# query_states = self.q_proj(hidden_states) -+# key_states = self.k_proj(hidden_states) -+# value_states = self.v_proj(hidden_states) -+ -+# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+ -+# kv_seq_len = key_states.shape[-2] -+# if past_key_value is not None: -+# if self.layer_idx is None: -+# raise ValueError( -+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+# "with a layer index." -+# ) -+# if isinstance(past_key_value, StaticCache): -+# kv_seq_len = key_states.shape[-2] -+# else: -+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+# if past_key_value is not None: -+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+# if isinstance(past_key_value, StaticCache): -+# kv_seq_len = key_states.shape[-2] -+ -+# # repeat k/v heads if n_kv_heads < n_heads -+# key_states = repeat_kv(key_states, self.num_key_value_groups) -+# value_states = repeat_kv(value_states, self.num_key_value_groups) -+ -+# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+ -+# if attention_mask is not None: -+# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+# attn_weights = attn_weights + causal_mask -+ -+# # upcast attention to fp32 -+# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+# attn_output = ops.matmul(attn_weights, value_states) -+ -+# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+# raise ValueError( -+# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+# f" {attn_output.shape}" -+# ) -+ -+# attn_output = ops.transpose(attn_output, 1, 2) -+# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+ -+# attn_output = self.o_proj(attn_output) -+# # @lwx -+ -+# # max_seq_len = self.max_position_embeddings # 2048 -+ -+# # if attention_mask is not None: -+# # # attention_mask: [B, 1, Sq, Sk] -+# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+ -+# # # pad 到 [max_seq_len, max_seq_len] -+# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+# # global_attention_mask = padded_mask -+# # else: -+# # global_attention_mask = None -+ -+ -+# # sparse_mode=3 -+# # attn_output = mindspore.ops.flash_attention_score( -+# # query=query_states, -+# # key=key_states, -+# # value=value_states, -+# # real_shift=None, -+# # padding_mask=None, -+ -+# # head_num=self.num_heads, -+# # attn_mask=global_attention_mask, -+# # keep_prob=1.0 - self.attention_dropout, -+# # scalar_value=1.0 / math.sqrt(self.head_dim), -+# # input_layout="BNSD", -+# # pre_tokens=2147483647, -+# # next_tokens=2147483647, -+# # inner_precise=0, -+# # drop_mask=None, -+# # prefix=None, -+# # actual_seq_qlen=None, -+# # actual_seq_kvlen=None, -+# # sparse_mode=sparse_mode, -+# # ) -+# if not output_attentions: -+# attn_weights = None -+ -+# return attn_output, attn_weights, past_key_value -+ - class Qwen2MoeAttention(nn.Module): - """ -- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -- and "Generating Long Sequences with Sparse Transformers". -- """ -+ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 - -+ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -+ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -+ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -+ -+ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -+ """ - def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): - super().__init__() - self.config = config -@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): - if layer_idx is None: - logger.warning_once( - f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " - "when creating this class." - ) - -@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): - use_cache: bool = False, - cache_position: Optional[mindspore.Tensor] = None, - ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- - -- -+ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- - bsz, q_len, _ = hidden_states.shape - - query_states = self.q_proj(hidden_states) - key_states = self.k_proj(hidden_states) - value_states = self.v_proj(hidden_states) - -- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- -+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ - kv_seq_len = key_states.shape[-2] - if past_key_value is not None: -- if self.layer_idx is None: -- raise ValueError( -- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -- "with a layer index." -- ) -- if isinstance(past_key_value, StaticCache): -- kv_seq_len = key_states.shape[-2] -- else: -- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ - cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) - query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) - - if past_key_value is not None: -- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} - key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+ -+ # --- 2. 动态调度核心注意力计算 --- -+ global Long_Prompt -+ if Long_Prompt >= 1: -+ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -+ fa_attention_mask = None -+ if attention_mask is not None: -+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+ fa_attention_mask = (mask_slice != 0) -+ -+ attn_output = mindspore.ops.flash_attention_score( -+ query=query_states, -+ key=key_states, -+ value=value_states, -+ head_num=self.num_heads, -+ attn_mask=fa_attention_mask, -+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -+ scalar_value=1.0 / math.sqrt(self.head_dim), -+ input_layout="BNSD", -+ sparse_mode=0, -+ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -+ ) - -- if isinstance(past_key_value, StaticCache): -- kv_seq_len = key_states.shape[-2] -+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ attn_output = self.o_proj(attn_output) -+ attn_weights = None -+ if output_attentions: -+ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") - -- # repeat k/v heads if n_kv_heads < n_heads -- key_states = repeat_kv(key_states, self.num_key_value_groups) -- value_states = repeat_kv(value_states, self.num_key_value_groups) -- -- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+ else: -+ # --- Eager Attention 路径 (用于短序列和解码) --- -+ key_states = repeat_kv(key_states, self.num_key_value_groups) -+ value_states = repeat_kv(value_states, self.num_key_value_groups) -+ -+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) - -- if attention_mask is not None: -- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -- attn_weights = attn_weights + causal_mask -+ if attention_mask is not None: -+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+ attn_weights = attn_weights + causal_mask - -- # upcast attention to fp32 -- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -- attn_output = ops.matmul(attn_weights, value_states) -+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+ attn_output = ops.matmul(attn_weights, value_states) - -- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -- raise ValueError( -- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -- f" {attn_output.shape}" -- ) -+ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+ raise ValueError( -+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -+ ) - -- attn_output = ops.transpose(attn_output, 1, 2) -- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+ attn_output = ops.transpose(attn_output, 1, 2) -+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+ attn_output = self.o_proj(attn_output) - -- attn_output = self.o_proj(attn_output) -- # @lwx -+ if not output_attentions: -+ attn_weights = None - -- # max_seq_len = self.max_position_embeddings # 2048 -- -- # if attention_mask is not None: -- # # attention_mask: [B, 1, Sq, Sk] -- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -- -- # # pad 到 [max_seq_len, max_seq_len] -- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -- # global_attention_mask = padded_mask -- # else: -- # global_attention_mask = None -- -- -- # sparse_mode=3 -- # attn_output = mindspore.ops.flash_attention_score( -- # query=query_states, -- # key=key_states, -- # value=value_states, -- # real_shift=None, -- # padding_mask=None, -- -- # head_num=self.num_heads, -- # attn_mask=global_attention_mask, -- # keep_prob=1.0 - self.attention_dropout, -- # scalar_value=1.0 / math.sqrt(self.head_dim), -- # input_layout="BNSD", -- # pre_tokens=2147483647, -- # next_tokens=2147483647, -- # inner_precise=0, -- # drop_mask=None, -- # prefix=None, -- # actual_seq_qlen=None, -- # actual_seq_kvlen=None, -- # sparse_mode=sparse_mode, -- # ) -- if not output_attentions: -- attn_weights = None -- - return attn_output, attn_weights, past_key_value - -- - # class Qwen2MoeFlashAttention(nn.Module): - # """ - # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { - # return final_hidden_states, router_logits - - --# class Qwen2MoeSparseMoeBlock(nn.Module): --# """ --# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --# `_moe_infer_prefill` (用于长序列处理) 方法。 --# """ --# def __init__(self, config: Qwen2MoeConfig): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# # 门控网络 --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# # 专家列表 --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) --# # 共享专家 --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# @no_grad() --# def _moe_infer_decode( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# """ --# 【解码路径】针对 sequence_length=1 的极致优化。 --# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --# """ --# batch_size, hidden_dim = hidden_states.shape -- --# expert_outputs_list = [ --# ops.cat([ --# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --# ], dim=0) --# for i in range(batch_size) --# ] -- --# # --- 错误修复:将 axis=0 修改为 dim=0 --- --# # shape: (batch_size, top_k, hidden_dim) --# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -- --# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -- --# return moe_output.squeeze(1) -- --# @no_grad() --# def _moe_infer_prefill( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# """ --# 【预填充路径】针对 sequence_length > 1 的优化。 --# 按专家对 Token 进行分组,并进行批处理。 --# """ --# moe_output = ops.zeros_like(hidden_states) --# num_tokens = hidden_states.shape[0] --# flat_selected_experts = selected_experts.flatten() -- --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -- --# active_experts = ops.unique(flat_selected_experts) -- --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] -- --# mask = (flat_selected_experts == expert_idx_tensor) --# selected_token_indices = token_indices[mask] --# selected_routing_weights = routing_weights.flatten()[mask] -- --# current_states = hidden_states[selected_token_indices] -- --# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -- --# moe_output = moe_output.index_add( --# dim=0, --# index=selected_token_indices, --# source=expert_output.to(hidden_states.dtype) --# ) --# return moe_output -- --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# """ --# 顶层 forward 方法,作为智能分发器。 --# """ --# batch_size, sequence_length, hidden_dim = hidden_states.shape -- --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- --# routing_weights = routing_weights.to(hidden_states.dtype) -- --# moe_output = None --# # 在推理时,根据序列长度选择最优路径 --# if not self.training: --# if sequence_length == 1: --# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --# else: --# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --# else: --# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --# raise NotImplementedError("Training path is not implemented.") -- --# shared_expert_output = self.shared_expert(hidden_states_reshaped) --# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -- --# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -- --# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- -- --# class Qwen2MoeSparseMoeBlock(nn.Module): --# """ --# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --# """ --# def __init__(self, config: Qwen2MoeConfig): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# # 门控网络 --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# # 专家列表 --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) --# # 共享专家 --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# @no_grad() --# def _moe_infer_decode( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# batch_size, _ = hidden_states.shape --# expert_outputs_list = [ --# ops.cat([ --# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --# ], dim=0) --# for i in range(batch_size) --# ] --# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --# return moe_output.squeeze(1) -- --# @no_grad() --# def _moe_infer_prefill( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# moe_output = ops.zeros_like(hidden_states) --# num_tokens = hidden_states.shape[0] --# flat_selected_experts = selected_experts.flatten() --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --# active_experts = ops.unique(flat_selected_experts) -- --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] --# mask = (flat_selected_experts == expert_idx_tensor) --# selected_token_indices = token_indices[mask] --# selected_routing_weights = routing_weights.flatten()[mask] --# current_states = hidden_states[selected_token_indices] --# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --# moe_output = moe_output.index_add( --# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --# ) --# return moe_output -- --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# """ --# 顶层 forward 方法,作为智能分发器。 --# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --# """ --# batch_size, sequence_length, hidden_dim = hidden_states.shape -- --# # 1. 门控计算 (通用逻辑) --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- --# routing_weights = routing_weights.to(hidden_states.dtype) -- --# # 2. 智能分发到最优 MoE 路径 --# moe_output = None --# if not self.training: --# if sequence_length == 1: --# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --# else: --# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --# else: --# raise NotImplementedError("Training path is not implemented.") -- --# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -- --# # 4. 合并 MoE 输出和共享专家输出 --# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -- --# # 5. 恢复原始形状并返回 --# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- --# prefill fastest --# class Qwen2MoeSparseMoeBlock(nn.Module): --# """ --# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --# """ --# def __init__(self, config: Qwen2MoeConfig): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# # 门控网络 --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# # 专家列表 --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) --# # 共享专家 --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# @no_grad() --# def _moe_infer_dispatch( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# """ --# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --# """ --# moe_output = ops.zeros_like(hidden_states) --# num_tokens, _ = hidden_states.shape -- --# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --# flat_selected_experts = selected_experts.flatten() --# flat_routing_weights = routing_weights.flatten() -- --# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -- --# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --# active_experts = ops.unique(flat_selected_experts) -- --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] -- --# # 找到所有分配给该专家的 token --# mask = (flat_selected_experts == expert_idx_tensor) -- --# # 使用 mask 选取对应的 token 和权重 --# current_token_indices = token_indices[mask] --# current_routing_weights = flat_routing_weights[mask] --# current_hidden_states = hidden_states[current_token_indices] -- --# # 对这些 token 进行批处理 --# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -- --# # 使用 index_add 将结果精确地加回到对应位置 --# moe_output = moe_output.index_add( --# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --# ) --# return moe_output -- --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# """ --# 顶层 forward 方法,作为智能分发器。 --# """ --# batch_size, sequence_length, hidden_dim = hidden_states.shape -- --# # 1. 门控计算 --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- --# routing_weights = routing_weights.to(hidden_states.dtype) -- --# # 2. 调用统一的 MoE 计算内核 --# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -- --# # 3. 统一处理共享专家 --# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -- --# # 4. 合并输出 --# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -- --# # 5. 恢复原始形状并返回 --# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- -- --# class Qwen2MoeSparseMoeBlock(nn.Module): --# """ --# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --# 【最终高性能与高精度版】: --# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --# 3. 这样实现了速度和准确性的两全其美。 --# """ --# def __init__(self, config: Qwen2MoeConfig): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# @no_grad() --# def _moe_infer_decode( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# """ --# 【解码路径】极致优化版:bmm + 高精度累加。 --# """ --# original_dtype = hidden_states.dtype --# batch_size, _ = hidden_states.shape -- --# expert_outputs_list = [ --# ops.cat([ --# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --# ], dim=0) --# for i in range(batch_size) --# ] --# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -- --# # 在 float32 下执行 bmm,得到高精度结果 --# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -- --# # 将高精度结果转换回原始数据类型 --# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -- --# return moe_output -- --# @no_grad() --# def _moe_infer_prefill( --# self, --# hidden_states: mindspore.Tensor, --# selected_experts: mindspore.Tensor, --# routing_weights: mindspore.Tensor --# ) -> mindspore.Tensor: --# """ --# 【预填充路径】与原始实现一致,结果精确。 --# """ --# moe_output = ops.zeros_like(hidden_states) --# num_tokens, _ = hidden_states.shape --# flat_selected_experts = selected_experts.flatten() --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --# active_experts = ops.unique(flat_selected_experts) -- --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] --# mask = (flat_selected_experts == expert_idx_tensor) --# selected_token_indices = token_indices[mask] --# selected_routing_weights = routing_weights.flatten()[mask] --# current_states = hidden_states[selected_token_indices] --# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --# moe_output = moe_output.index_add( --# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --# ) --# return moe_output -- --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# batch_size, sequence_length, hidden_dim = hidden_states.shape -- --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- --# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --# # 如果模型主体是 float16,后续再转换 -- --# moe_output = None --# if not self.training: --# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --# # _moe_infer_decode 内部会处理好类型转换 --# temp_routing_weights = routing_weights.to(hidden_states.dtype) --# if sequence_length == 1: --# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --# else: --# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --# else: --# raise NotImplementedError("Training path is not implemented.") -- --# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -- --# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- -- --# class Qwen2MoeSparseMoeBlock(nn.Module): --# """ --# 【融合版】一个混合专家模块,内置两种推理策略, --# 由外部全局变量 `Long_Prompt` 控制: -- --# - if Long_Prompt is True: 【精度优先模式】 --# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --# 适用于处理长序列,避免误差累积。 -- --# - if Long_Prompt is False: 【速度优先模式】 --# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --# 在解码阶段获得极致速度,同时保证结果高度准确。 --# """ --# def __init__(self, config: Qwen2MoeConfig): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# # --- 速度优先模式的辅助函数 --- --# @no_grad() --# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --# original_dtype = hidden_states.dtype --# batch_size, _ = hidden_states.shape --# expert_outputs_list = [ --# ops.cat([ --# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --# ], dim=0) --# for i in range(batch_size) --# ] --# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --# weights_fp32 = routing_weights.to(mindspore.float32) --# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --# return moe_output_fp32.squeeze(1).to(original_dtype) -- --# @no_grad() --# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --# moe_output = ops.zeros_like(hidden_states) --# num_tokens, _ = hidden_states.shape --# flat_selected_experts = selected_experts.flatten() --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --# active_experts = ops.unique(flat_selected_experts) --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] --# mask = (flat_selected_experts == expert_idx_tensor) --# selected_token_indices = token_indices[mask] --# selected_routing_weights = routing_weights.flatten()[mask] --# current_states = hidden_states[selected_token_indices] --# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --# return moe_output -- --# # --- 精度优先模式的辅助函数 --- --# @no_grad() --# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --# moe_output = ops.zeros_like(hidden_states) --# num_tokens, _ = hidden_states.shape --# flat_selected_experts = selected_experts.flatten() --# flat_routing_weights = routing_weights.flatten() --# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --# active_experts = ops.unique(flat_selected_experts) --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] --# mask = (flat_selected_experts == expert_idx_tensor) --# current_token_indices = token_indices[mask] --# current_routing_weights = flat_routing_weights[mask] --# current_hidden_states = hidden_states[current_token_indices] --# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --# return moe_output -- --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# # 声明我们将要使用一个在模块外部定义的全局变量 --# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --# global Long_Prompt -- --# # 1. 门控计算 (所有模式通用) --# batch_size, sequence_length, hidden_dim = hidden_states.shape --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- --# moe_output = None --# if not self.training: --# # 根据 Long_Prompt 标志选择模式 --# if Long_Prompt: --# # --- 精度优先模式 --- --# routing_weights_casted = routing_weights.to(hidden_states.dtype) --# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --# else: --# # --- 速度优先模式 --- --# routing_weights_casted = routing_weights.to(hidden_states.dtype) --# if sequence_length == 1: --# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --# else: --# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --# else: --# raise NotImplementedError("Training path is not implemented.") -- --# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -- --# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- - class Qwen2MoeSparseMoeBlock(nn.Module): - """ - 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) - return moe_output_fp32.squeeze(1).to(original_dtype) - -+ # @no_grad() -+ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ # num_tokens, _ = hidden_states.shape -+ # flat_selected_experts = selected_experts.flatten() -+ # sorted_expert_indices = flat_selected_experts.argsort() -+ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+ # original_token_indices = sorted_expert_indices // self.top_k -+ # moe_output = ops.zeros_like(hidden_states) -+ # current_token_offset = 0 -+ # for i in range(self.num_experts): -+ # expert_token_count = tokens_per_expert[i] - current_token_offset -+ # if expert_token_count == 0: -+ # continue -+ # end_offset = current_token_offset + expert_token_count -+ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+ # expert_hidden_states = hidden_states[expert_original_token_indices] -+ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+ # current_token_offset += expert_token_count -+ # return moe_output -+ - @no_grad() - def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- num_tokens, _ = hidden_states.shape -- flat_selected_experts = selected_experts.flatten() -- sorted_expert_indices = flat_selected_experts.argsort() -- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -- original_token_indices = sorted_expert_indices // self.top_k -+ """ -+ 优化版 MoE prefill (速度优先模式): -+ - 批量张量化处理同一个 expert 的所有 token -+ - 跳过无 token 的专家 -+ - 保持结果完全一致 -+ """ - moe_output = ops.zeros_like(hidden_states) -- current_token_offset = 0 -- for i in range(self.num_experts): -- expert_token_count = tokens_per_expert[i] - current_token_offset -- if expert_token_count == 0: -- continue -- end_offset = current_token_offset + expert_token_count -- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -- expert_hidden_states = hidden_states[expert_original_token_indices] -- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -- current_token_offset += expert_token_count -+ -+ flat_selected_experts = selected_experts.flatten() -+ flat_routing_weights = routing_weights.flatten() -+ -+ idxs = flat_selected_experts.argsort() -+ sorted_expert_indices = flat_selected_experts[idxs] -+ sorted_token_indices = idxs // self.top_k -+ -+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -+ -+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+ -+ for expert_id in active_experts.tolist(): -+ start = int(tokens_per_expert[:expert_id].sum().item()) -+ end = start + int(tokens_per_expert[expert_id].item()) -+ -+ token_idx = sorted_token_indices[start:end] -+ expert_tokens = hidden_states[token_idx] -+ -+ expert_out = self.experts[expert_id](expert_tokens) -+ -+ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -+ -+ moe_output = mindspore.mint.scatter_add( -+ moe_output, -+ 0, -+ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -+ scaled_out.to(hidden_states.dtype) -+ ) -+ - return moe_output - -+ - # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- - @no_grad() - def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) - - moe_output = None -- if Long_Prompt: -- # --- 精度优先模式 (ACCURACY MODE) --- -- routing_weights_casted = routing_weights.to(hidden_states.dtype) -- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # if Long_Prompt==0: -+ # # --- 精度优先模式 (ACCURACY MODE) --- -+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # else: -+ # # --- 速度优先模式 (SPEED MODE) --- -+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ # if sequence_length == 1: -+ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # else: -+ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ -+ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ if sequence_length == 1: -+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) - else: -- # --- 速度优先模式 (SPEED MODE) --- -- routing_weights_casted = routing_weights.to(hidden_states.dtype) -- if sequence_length == 1: -- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -- else: -- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -- -+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ - - # 3. 共享专家计算与合并 (所有模式通用) - gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - - return final_hidden_states, router_logits - -+ - class Qwen2MoeDecoderLayer(nn.Module): - def __init__(self, config: Qwen2MoeConfig, layer_idx: int): - super().__init__() - self.hidden_size = config.hidden_size - -- # if Long_Prompt: -- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- # else: -+ # if Long_Prompt == 2: - # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+ # else: -+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - - self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - -@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): - ) - - # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+ # attention_mask, -+ # sequence_length=sequence_length, -+ # target_length=target_length, -+ # dtype=dtype, -+ # min_dtype=min_dtype, -+ # cache_position=cache_position, -+ # batch_size=input_tensor.shape[0], -+ # ) -+ #@dwj -+ causal_mask = get_cached_causal_mask_with_cache_position( - attention_mask, - sequence_length=sequence_length, - target_length=target_length, -@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 - 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 - """ -- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -+ _causal_mask_cache.clear() - - input_ids = kwargs.get("input_ids") - if input_ids is None and args: -@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - - if input_ids is not None: - prompt_length = input_ids.shape[1] -- -- if prompt_length > PROMPT_LENGTH_THRESHOLD: -- Long_Prompt = True -+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -+ Long_Prompt = 2 -+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -+ Long_Prompt = 0 - else: -- Long_Prompt = False -+ Long_Prompt = 1 -+ - - return super().generate(*args, **kwargs) - -@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - dtype = self.lm_head.weight.dtype - min_dtype = float(ops.finfo(dtype).min) - -- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+ # attention_mask, -+ # sequence_length=sequence_length, -+ # target_length=past_key_values.get_max_length(), -+ # dtype=dtype, -+ # min_dtype=min_dtype, -+ # cache_position=cache_position, -+ # batch_size=batch_size, -+ # ) -+ -+ #@dwj -+ attention_mask = get_cached_causal_mask_with_cache_position( - attention_mask, - sequence_length=sequence_length, - target_length=past_key_values.get_max_length(), -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -deleted file mode 100644 -index 6dfb5b93..00000000 ---- a/patches/0001-20251104commit.patch -+++ /dev/null -@@ -1,1272 +0,0 @@ --From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH] 20251104commit -- ----- -- mindnlp/transformers/cache_utils.py | 28 +- -- .../models/deepseek/modeling_deepseek.py | 149 ++- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -- 3 files changed, 976 insertions(+), 87 deletions(-) -- --diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --index cadd2e04..02f8d4be 100644 ----- a/mindnlp/transformers/cache_utils.py --+++ b/mindnlp/transformers/cache_utils.py --@@ -812,14 +812,26 @@ class StaticCache(Cache): -- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -- # k_out[:, :, cache_position] = key_states -- # v_out[:, :, cache_position] = value_states --- if ON_ORANGE_PI: --- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --- else: --- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --- --+ # if ON_ORANGE_PI: --+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+ # else: --+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+ # 确保 cache_position 是 1D tensor 并且类型正确 --+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+ if cache_position.ndim > 1: --+ cache_position = cache_position.flatten() --+ # 确保类型是 int32 或 int64(MindSpore 要求) --+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+ cache_position = cache_position.int() --+ --+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+ k_out[:, :, cache_position] = key_states --+ v_out[:, :, cache_position] = value_states --+ -- return k_out, v_out -- -- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index c695b944..d8303e45 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -- # Copied from transformers.models.llama.modeling_llama.rotate_half -- def rotate_half(x): -- """Rotates half the hidden dims of the input.""" --- x1 = x[..., : x.shape[-1] // 2] --- x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+ # x1 = x[..., : x.shape[-1] // 2] --+ # x2 = x[..., x.shape[-1] // 2 :] --+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -- if self.training: -- raise NotImplementedError("Training is not supported yet.") -- else: --- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --- if self.config.n_shared_experts is not None: --- y = y + self.shared_experts(identity) --- return y --+ # @lwx --+ if orig_shape[1] == 1: --+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+ y=y.view(*orig_shape) --+ if self.config.n_shared_experts is not None: --+ y = y + self.shared_experts(identity) --+ return y --+ else: --+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+ if self.config.n_shared_experts is not None: --+ y = y + self.shared_experts(identity) --+ return y --+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+ # if self.config.n_shared_experts is not None: --+ # y = y + self.shared_experts(identity) --+ # return y --+ --+ @no_grad() --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ --+ expert_cache = ops.zeros_like(x) --+ for i in range(self.num_experts_per_tok): --+ expert_id = flat_expert_indices[i].item() --+ weight = flat_expert_weights[i].item() --+ expert = self.experts[expert_id] --+ expert_out = expert(x) --+ expert_cache += expert_out * weight --+ return expert_cache -- -- @no_grad() --- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --- # expert_cache = torch.zeros_like(x) --- # idxs = flat_expert_indices.argsort() --- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --- # token_idxs = idxs // self.num_experts_per_tok --- # for i, end_idx in enumerate(tokens_per_expert): --- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --- # if start_idx == end_idx: --- # continue --- # expert = self.experts[i] --- # exp_token_idx = token_idxs[start_idx:end_idx] --- # expert_tokens = x[exp_token_idx] --- # expert_out = expert(expert_tokens) --- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --- # return expert_cache --+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- expert_cache = ops.zeros_like(x) -- idxs = flat_expert_indices.argsort() -- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- token_idxs = idxs // self.num_experts_per_tok --+ -- for i, end_idx in enumerate(tokens_per_expert): -- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- if start_idx == end_idx: --@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -- expert_out = expert(expert_tokens) -- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ -- return expert_cache --+ --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # # expert_cache = torch.zeros_like(x) --+ # # idxs = flat_expert_indices.argsort() --+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+ # # token_idxs = idxs // self.num_experts_per_tok --+ # # for i, end_idx in enumerate(tokens_per_expert): --+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+ # # if start_idx == end_idx: --+ # # continue --+ # # expert = self.experts[i] --+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+ # # expert_tokens = x[exp_token_idx] --+ # # expert_out = expert(expert_tokens) --+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+ # # return expert_cache --+ # expert_cache = ops.zeros_like(x) --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # for i, end_idx in enumerate(tokens_per_expert): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # if start_idx == end_idx: --+ # continue --+ # expert = self.experts[i] --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = expert(expert_tokens) --+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ --+ # return expert_cache --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # expert_cache = ops.zeros_like(x) --+ --+ # # 排序保证顺序一致 --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # # 找出有 token 的专家 --+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+ --+ # for i in active_experts.tolist(): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # end_idx = tokens_per_expert[i] --+ # if start_idx == end_idx: # 没有 token --+ # continue --+ --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = self.experts[i](expert_tokens) --+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+ --+ # expert_cache = mindspore.mint.scatter_add( --+ # expert_cache, --+ # 0, --+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+ # expert_out --+ # ) --+ --+ # return expert_cache --+ --+ -- -- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -- # """ --@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- -- # Initialize weights and apply final processing -- self.post_init() --+ self.warm_up = False --+ --+ def warmup_moe_model_deep(self): --+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+ test_texts = [ --+ "warmup short", --+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+ ] --+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+ if tokenizer is None: --+ from mindnlp.transformers import AutoTokenizer --+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+ self._warmup_tokenizer = tokenizer --+ --+ for text in test_texts: --+ inputs = tokenizer(text, return_tensors="ms") --+ with mindspore._no_grad(): --+ _ = self(**inputs, use_cache=False) --+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -- -- def get_input_embeddings(self): -- return self.model.embed_tokens --@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -- ```""" --+ if not self.warm_up: --+ self.warm_up = True --+ self.warmup_moe_model_deep() --+ -- output_attentions = ( -- output_attentions -- if output_attentions is not None --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index 3cbf820e..d4c6b651 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -18,7 +18,6 @@ -- # See the License for the specific language governing permissions and -- # limitations under the License. -- """MindSpore Qwen2MoE model.""" --- -- import math -- from typing import List, Optional, Tuple, Union -- --@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -- TokenClassifierOutput, -- ) -- from ...modeling_utils import PreTrainedModel --+from ...generation import GenerationMixin -- from ....utils import logging -- from .configuration_qwen2_moe import Qwen2MoeConfig -- --@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -- self.variance_epsilon = eps -- -- def forward(self, hidden_states): --+ # @dwj --+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+ # @lwx --+ # if not self.training : --+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -- input_dtype = hidden_states.dtype -- hidden_states = hidden_states.to(mindspore.float32) -- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --@@ -234,6 +239,8 @@ def rotate_half(x): -- """Rotates half the hidden dims of the input.""" -- x1 = x[..., : x.shape[-1] // 2] -- x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -- self.config = config -- self.hidden_size = config.hidden_size -- self.intermediate_size = intermediate_size --+ -- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -- self.act_fn = ACT2FN[config.hidden_act] -- -- def forward(self, x): --- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --- -- --+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+ # @lwx --+ # gate_up_output = self.gate_up_proj(x) --+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+ # return self.down_proj(swiglu_output) --+ --+ # def forward(self, x): --+ # gate_proj_out = self.gate_proj(x) --+ # up_proj_out = self.up_proj(x) --+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+ # return self.down_proj(swiglu_out) --+ -- # Copied from transformers.models.llama.modeling_llama.repeat_kv -- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -- """ --@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -- use_cache: bool = False, -- cache_position: Optional[mindspore.Tensor] = None, -- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ --+ -- bsz, q_len, _ = hidden_states.shape -- -- query_states = self.q_proj(hidden_states) --@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -- "with a layer index." -- ) --- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ if isinstance(past_key_value, StaticCache): --+ kv_seq_len = key_states.shape[-2] --+ else: --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- -- if past_key_value is not None: -- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+ if isinstance(past_key_value, StaticCache): --+ kv_seq_len = key_states.shape[-2] -- -- # repeat k/v heads if n_kv_heads < n_heads -- key_states = repeat_kv(key_states, self.num_key_value_groups) -- value_states = repeat_kv(value_states, self.num_key_value_groups) --- --+ -- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -- --- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --- raise ValueError( --- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --- f" {attn_weights.shape}" --- ) --- --- if attention_mask is not None: # no matter the length, we just slice it --- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+ if attention_mask is not None: --+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -- attn_weights = attn_weights + causal_mask -- -- # upcast attention to fp32 --@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -- -- attn_output = self.o_proj(attn_output) --- --+ # @lwx --+ --+ # max_seq_len = self.max_position_embeddings # 2048 --+ --+ # if attention_mask is not None: --+ # # attention_mask: [B, 1, Sq, Sk] --+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+ --+ # # pad 到 [max_seq_len, max_seq_len] --+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+ # global_attention_mask = padded_mask --+ # else: --+ # global_attention_mask = None --+ --+ --+ # sparse_mode=3 --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, --+ # key=key_states, --+ # value=value_states, --+ # real_shift=None, --+ # padding_mask=None, --+ --+ # head_num=self.num_heads, --+ # attn_mask=global_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+ # input_layout="BNSD", --+ # pre_tokens=2147483647, --+ # next_tokens=2147483647, --+ # inner_precise=0, --+ # drop_mask=None, --+ # prefix=None, --+ # actual_seq_qlen=None, --+ # actual_seq_kvlen=None, --+ # sparse_mode=sparse_mode, --+ # ) -- if not output_attentions: -- attn_weights = None -- -- return attn_output, attn_weights, past_key_value -- -- --+class Qwen2MoeFlashAttention(nn.Module): --+ """ --+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+ --+ 关键改动: --+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+ 直接传入原始的 key 和 value 张量效率更高。 --+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+ """ --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+ super().__init__() --+ self.config = config --+ self.layer_idx = layer_idx --+ self.hidden_size = config.hidden_size --+ self.num_heads = config.num_attention_heads --+ self.head_dim = self.hidden_size // self.num_heads --+ self.num_key_value_heads = config.num_key_value_heads --+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+ self.max_position_embeddings = config.max_position_embeddings --+ self.rope_theta = config.rope_theta --+ self.attention_dropout = config.attention_dropout --+ --+ if (self.head_dim * self.num_heads) != self.hidden_size: --+ raise ValueError( --+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+ ) --+ --+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+ --+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+ self.head_dim, --+ max_position_embeddings=self.max_position_embeddings, --+ base=self.rope_theta, --+ ) --+ --+ def forward( --+ self, --+ hidden_states: mindspore.Tensor, --+ attention_mask: Optional[mindspore.Tensor] = None, --+ position_ids: Optional[mindspore.Tensor] = None, --+ past_key_value: Optional[Cache] = None, --+ output_attentions: bool = False, --+ use_cache: bool = False, --+ cache_position: Optional[mindspore.Tensor] = None, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ bsz, q_len, _ = hidden_states.shape --+ --+ # 1. 线性投射 Q, K, V --+ query_states = self.q_proj(hidden_states) --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+ # query: [B, S, H*D] -> [B, N1, S, D] --+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # 3. RoPE 旋转位置编码 --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+ if self.layer_idx is None: --+ raise ValueError( --+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ "with a layer index." --+ ) --+ # 对于 StaticCache,需要特殊处理 kv_seq_len --+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+ if cache_position.shape[0] == 1: --+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+ kv_seq_len = past_seen_tokens + 1 --+ else: --+ # prefill 阶段:cache_position 是范围,使用其长度 --+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+ else: --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # 4. KV 缓存更新 --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ key_states, value_states = past_key_value.update( --+ key_states, value_states, self.layer_idx, cache_kwargs --+ ) --+ --+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+ if cache_position.shape[0] == 1: --+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+ kv_seq_len = key_states.shape[-2] --+ --+ # 5. [重要] 准备 Attention Mask --+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+ fa_attention_mask = None --+ if attention_mask is not None: --+ # 截取与当前key长度匹配的部分 --+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # 转换为布尔类型: 大负数 -> True, 0 -> False --+ fa_attention_mask = (mask_slice != 0) --+ --+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+ input_dtype = query_states.dtype --+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+ query_states = query_states.to(mindspore.float16) --+ key_states = key_states.to(mindspore.float16) --+ value_states = value_states.to(mindspore.float16) --+ --+ # 6. [核心] 调用 flash_attention_score 算子 --+ # - 无需手动 repeat_kv, 算子原生支持 GQA --+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+ attn_output = mindspore.ops.flash_attention_score( --+ query=query_states, --+ key=key_states, --+ value=value_states, --+ head_num=self.num_heads, # 传入Q的头数(N1) --+ attn_mask=fa_attention_mask, --+ keep_prob=1.0 - self.attention_dropout, --+ scalar_value=1.0 / math.sqrt(self.head_dim), --+ input_layout="BNSD", --+ sparse_mode=0 # 使用 defaultMask 模式 --+ ) --+ --+ # 恢复原始数据类型 --+ attn_output = attn_output.to(input_dtype) --+ --+ # 7. 调整输出形状 --+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ attn_output = self.o_proj(attn_output) --+ --+ # FlashAttention 算子不直接返回注意力权重矩阵 --+ attn_weights = None --+ if output_attentions: --+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+ return attn_output, attn_weights, past_key_value --+ --+ # def forward( --+ # self, --+ # hidden_states: mindspore.Tensor, --+ # attention_mask: Optional[mindspore.Tensor] = None, --+ # position_ids: Optional[mindspore.Tensor] = None, --+ # past_key_value: Optional[Cache] = None, --+ # output_attentions: bool = False, --+ # use_cache: bool = False, --+ # cache_position: Optional[mindspore.Tensor] = None, --+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ # bsz, q_len, _ = hidden_states.shape --+ --+ # # 1. 线性投射 Q, K, V --+ # query_states = self.q_proj(hidden_states) --+ # key_states = self.k_proj(hidden_states) --+ # value_states = self.v_proj(hidden_states) --+ --+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # # 3. RoPE 旋转位置编码 --+ # kv_seq_len = key_states.shape[-2] --+ # if past_key_value is not None: --+ # if self.layer_idx is None: --+ # raise ValueError( --+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ # "with a layer index." --+ # ) --+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # # 4. KV 缓存更新 --+ # if past_key_value is not None: --+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ # key_states, value_states = past_key_value.update( --+ # key_states, value_states, self.layer_idx, cache_kwargs --+ # ) --+ --+ # # 5. 准备 Attention Mask --+ # fa_attention_mask = None --+ # if attention_mask is not None: --+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # fa_attention_mask = (mask_slice != 0) --+ --+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+ # input_dtype = query_states.dtype --+ --+ # # 6. [核心] 调用 flash_attention_score 算子 --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, --+ # key=key_states, --+ # value=value_states, --+ # head_num=self.num_heads, --+ # attn_mask=fa_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+ # input_layout="BNSD", --+ # sparse_mode=0, --+ # # <--- 修改点 2: 启用内部高精度计算 --- --+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+ # inner_precise=1 --+ # ) --+ --+ # # 恢复原始数据类型 --+ # attn_output = attn_output.to(input_dtype) --+ --+ # # 7. 调整输出形状 --+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ # attn_output = self.o_proj(attn_output) --+ --+ # attn_weights = None --+ # if output_attentions: --+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+ # return attn_output, attn_weights, past_key_value --+ --+ # def forward( --+ # self, --+ # hidden_states: mindspore.Tensor, --+ # attention_mask: Optional[mindspore.Tensor] = None, --+ # position_ids: Optional[mindspore.Tensor] = None, --+ # past_key_value: Optional[Cache] = None, --+ # output_attentions: bool = False, --+ # use_cache: bool = False, --+ # cache_position: Optional[mindspore.Tensor] = None, --+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ # bsz, q_len, _ = hidden_states.shape --+ --+ # query_states = self.q_proj(hidden_states) --+ # key_states = self.k_proj(hidden_states) --+ # value_states = self.v_proj(hidden_states) --+ --+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # kv_seq_len = key_states.shape[-2] --+ # if past_key_value is not None: --+ # if self.layer_idx is None: --+ # raise ValueError("`layer_idx` must be specified for caching") --+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # if past_key_value is not None: --+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ # key_states, value_states = past_key_value.update( --+ # key_states, value_states, self.layer_idx, cache_kwargs --+ # ) --+ --+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+ --+ # # <--- 核心修改点: 手动进行高精度缩放 --- --+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+ # query_states = query_states / math.sqrt(self.head_dim) --+ # # <--- 修改结束 --- --+ --+ # fa_attention_mask = None --+ # if attention_mask is not None: --+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # fa_attention_mask = (mask_slice != 0) --+ --+ # input_dtype = query_states.dtype --+ --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, # 传入已经预先缩放过的 query --+ # key=key_states, --+ # value=value_states, --+ # head_num=self.num_heads, --+ # attn_mask=fa_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+ # input_layout="BNSD", --+ # sparse_mode=0, --+ # inner_precise=1 # 仍然保持内部高精度计算 --+ # ) --+ --+ # attn_output = attn_output.to(input_dtype) --+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ # attn_output = self.o_proj(attn_output) --+ --+ # attn_weights = None --+ # if output_attentions: --+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+ --+ # return attn_output, attn_weights, past_key_value --+ -- QWEN2MOE_ATTENTION_CLASSES = { -- "eager": Qwen2MoeAttention, --+ "flash-attention": Qwen2MoeFlashAttention, -- } -- -- --@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --+ #@dwj --+ # 只遍历激活的专家,而非全部专家 -- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --- batch_size, sequence_length, hidden_dim = hidden_states.shape --- hidden_states = hidden_states.view(-1, hidden_dim) --- # router_logits: (batch * sequence_length, n_experts) --- router_logits = self.gate(hidden_states) --- --- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- if self.norm_topk_prob: --- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- # we cast back to the input dtype --- routing_weights = routing_weights.to(hidden_states.dtype) --- --- final_hidden_states = ops.zeros( --- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --- ) --- --- # One hot encode the selected experts to create an expert mask --- # this will be used to easily index which expert is going to be sollicitated --- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --- --- # Loop over all available experts in the model and perform the computation on each expert --- for expert_idx in range(self.num_experts): --- expert_layer = self.experts[expert_idx] --- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --- --- # Index the correct hidden states and compute the expert hidden state for --- # the current expert. We need to make sure to multiply the output hidden --- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --- if 0 not in idx.shape: --- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --- --- # However `index_add_` only support torch tensors for indexing so we'll use --- # the `top_x` tensor here. --- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --- --- shared_expert_output = self.shared_expert(hidden_states) --- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --- --- final_hidden_states = final_hidden_states + shared_expert_output --+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+ num_tokens = hidden_states_reshaped.shape[0] --+ --+ router_logits = self.gate(hidden_states_reshaped) --+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+ if self.norm_topk_prob: --+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ routing_weights = routing_weights.to(hidden_states.dtype) --+ --+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+ flat_selected_experts = selected_experts.flatten() --+ --+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+ token_indices = broadcasted_token_indices.flatten() --+ --+ active_experts = ops.unique(flat_selected_experts) --+ --+ for expert_idx_tensor in active_experts: --+ expert_idx = expert_idx_tensor.item() --+ expert_layer = self.experts[expert_idx] --+ --+ mask = (flat_selected_experts == expert_idx_tensor) --+ selected_token_indices = token_indices[mask] --+ selected_routing_weights = routing_weights.flatten()[mask] --+ --+ current_states = hidden_states_reshaped[selected_token_indices] --+ --+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+ --+ final_hidden_states = final_hidden_states.index_add( --+ dim=0, --+ index=selected_token_indices, --+ source=expert_output.to(hidden_states.dtype) --+ ) --+ --+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -- --- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --- return final_hidden_states, router_logits --+ final_hidden_states = final_hidden_states + shared_expert_output --+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+ --+ return final_hidden_states, router_logits -- -- -- class Qwen2MoeDecoderLayer(nn.Module): --@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -- -- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+ -- if (layer_idx not in config.mlp_only_layers) and ( -- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -- ): --@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -- _no_split_modules = ["Qwen2MoeDecoderLayer"] -- _skip_keys_device_placement = "past_key_values" -- _supports_cache_class = True --+#lwx --+ # _supports_static_cache = True -- -- def _init_weights(self, module): -- std = self.config.initializer_range --@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -- return causal_mask -- -- ---class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- _tied_weights_keys = ["lm_head.weight"] -- -- def __init__(self, config): --@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- self.num_experts_per_tok = config.num_experts_per_tok -- # Initialize weights and apply final processing -- self.post_init() --+ # @lwx --+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+ # self.generation_config.cache_implementation = "static" --+ self._warmed_up = False --+ --+ def warmup_moe_model(self): --+ print("[Warmup] Qwen2-MoE 模型预热开始...") --+ test_texts = [ --+ "warmup short", --+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+ ] --+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+ if tokenizer is None: --+ from mindnlp.transformers import AutoTokenizer --+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+ self._warmup_tokenizer = tokenizer --+ --+ for text in test_texts: --+ inputs = tokenizer(text, return_tensors="ms") --+ with mindspore._no_grad(): --+ _ = self(**inputs, output_router_logits=True, use_cache=False) --+ print("[Warmup] Qwen2-MoE 模型预热完成。") -- -- def get_input_embeddings(self): -- return self.model.embed_tokens --@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -- ```""" --+ if not self._warmed_up: --+ self._warmed_up = True --+ self.warmup_moe_model() -- -- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -- output_router_logits = ( --@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- } -- ) -- return model_inputs --+# @lwx --+ # def _decode_one_tokens_logits( --+ # self, --+ # cur_token: mindspore.Tensor, --+ # input_pos: Optional[mindspore.Tensor], --+ # cache_position: mindspore.Tensor, --+ # past_key_values: StaticCache, --+ # ) -> mindspore.Tensor: --+ # """ --+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+ --+ # Args: --+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+ # input_pos: 输入位置信息,可选 --+ # cache_position: 当前token在cache中的位置,shape为(1,) --+ # past_key_values: StaticCache对象,存储之前的key-value状态 --+ --+ # Returns: --+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+ # """ --+ # # 调用JIT编译的版本 --+ # return self.get_decode_one_tokens_logits( --+ # cur_token=cur_token, --+ # input_pos=input_pos, --+ # cache_position=cache_position, --+ # past_key_values=past_key_values, --+ # ) --+ --+ # @mindspore.jit(jit_level='O1') --+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+ # """ --+ # JIT编译的函数,用于高效的单token解码 --+ # 使用JIT编译优化以支持静态shape和高效执行 --+ --+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+ # """ --+ # outputs = self.model.forward( --+ # input_ids=cur_token, --+ # position_ids=input_pos, --+ # cache_position=cache_position, --+ # past_key_values=past_key_values, --+ # use_cache=True, --+ # return_dict=False, --+ # ) --+ --+ # hidden_states = outputs[0] --+ # logits = self.lm_head.forward(hidden_states) --+ # logits = logits.float() --+ --+ # return logits[:, -1, :] --+ --+ # def _sample( --+ # self, --+ # input_ids: mindspore.Tensor, --+ # logits_processor, --+ # stopping_criteria, --+ # generation_config, --+ # synced_devices: bool, --+ # streamer=None, --+ # logits_warper=None, --+ # **model_kwargs, --+ # ): --+ # """ --+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+ # """ --+ # from ...generation.logits_process import LogitsProcessorList --+ # from ...generation.stopping_criteria import StoppingCriteriaList --+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+ # from mindnlp.core import nn, ops, no_grad --+ # import numpy as np --+ --+ # # 检查是否使用 StaticCache --+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+ # # 否则,直接调用父类方法 --+ # past_key_values = model_kwargs.get("past_key_values") --+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+ --+ # if not isinstance(past_key_values, StaticCache): --+ # # 不使用 StaticCache,直接调用父类方法 --+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+ # return super()._sample( --+ # input_ids=input_ids, --+ # logits_processor=logits_processor, --+ # stopping_criteria=stopping_criteria, --+ # generation_config=generation_config, --+ # synced_devices=synced_devices, --+ # streamer=streamer, --+ # logits_warper=logits_warper, --+ # **model_kwargs, --+ # ) --+ --+ # # 使用 StaticCache,进入自定义循环 --+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+ # pad_token_id = generation_config._pad_token_tensor --+ # output_attentions = generation_config.output_attentions --+ # output_hidden_states = generation_config.output_hidden_states --+ # output_scores = generation_config.output_scores --+ # output_logits = generation_config.output_logits --+ # return_dict_in_generate = generation_config.return_dict_in_generate --+ # max_length = generation_config.max_length --+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+ # do_sample = generation_config.do_sample --+ --+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+ # raise ValueError( --+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+ # f"{logits_warper})." --+ # ) --+ --+ # # init attention / hidden states / scores tuples --+ # scores = () if (return_dict_in_generate and output_scores) else None --+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+ --+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+ # if return_dict_in_generate and self.config.is_encoder_decoder: --+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+ # encoder_hidden_states = ( --+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+ # ) --+ --+ # # keep track of which sequences are already finished --+ # batch_size, cur_len = input_ids.shape --+ # this_peer_finished = False --+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+ --+ # time_record = [] --+ # from ....utils.testing_utils import parse_flag_from_env --+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+ --+ # while self._has_unfinished_sequences( --+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+ # ): --+ # if _record_time: --+ # import time as time_module --+ # infer_start = time_module.time() --+ --+ # # prepare model inputs --+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+ --+ # # prepare variable output controls --+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+ --+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+ # cur_cache_position = model_inputs.get("cache_position") --+ # cur_past_key_values = model_inputs.get("past_key_values") --+ # cur_input_ids = model_inputs.get("input_ids") --+ --+ # if (isinstance(cur_past_key_values, StaticCache) and --+ # cur_cache_position is not None and --+ # len(cur_cache_position.shape) > 0 and --+ # cur_cache_position.shape[0] == 1 and --+ # cur_input_ids is not None and --+ # cur_input_ids.shape[1] == 1): --+ # # 使用 JIT 优化的单 token 解码 --+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+ # if not hasattr(self, '_jit_used'): --+ # self._jit_used = False --+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+ --+ # next_token_logits = self.get_decode_one_tokens_logits( --+ # cur_token=cur_input_ids, --+ # input_pos=model_inputs.get("position_ids"), --+ # cache_position=cur_cache_position, --+ # past_key_values=cur_past_key_values, --+ # ) --+ --+ # # 标记已使用JIT(用于后续判断) --+ # if not self._jit_used: --+ # self._jit_used = True --+ --+ # # 构造兼容的输出对象 --+ # class JitOptimizedOutput: --+ # def __init__(self, logits, config): --+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+ # self.config = config --+ # # 对于 JIT 优化路径,这些属性通常不需要 --+ # self.decoder_attentions = None if config.is_encoder_decoder else None --+ # self.attentions = None if not config.is_encoder_decoder else None --+ # self.cross_attentions = None --+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+ # self.hidden_states = None if not config.is_encoder_decoder else None --+ --+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+ # else: --+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+ # outputs = self(**model_inputs, return_dict=True) --+ --+ # if synced_devices and this_peer_finished: --+ # continue --+ --+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+ # next_token_logits = outputs.logits[:, -1, :] --+ --+ # # pre-process distribution --+ # next_token_scores = logits_processor(input_ids, next_token_logits) --+ # if do_sample: --+ # next_token_scores = logits_warper(input_ids, next_token_scores) --+ --+ # # Store scores, attentions and hidden_states when required --+ # if return_dict_in_generate: --+ # if output_scores: --+ # scores += (next_token_scores,) --+ # if output_logits: --+ # raw_logits += (next_token_logits,) --+ # if output_attentions: --+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+ # decoder_attentions += (attn,) if attn is not None else (None,) --+ # if self.config.is_encoder_decoder: --+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+ --+ # if output_hidden_states: --+ # hidden = ( --+ # outputs.decoder_hidden_states --+ # if self.config.is_encoder_decoder --+ # else outputs.hidden_states --+ # ) --+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+ --+ # # token selection --+ # if do_sample: --+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+ # else: --+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+ --+ # # finished sentences should have their next token be a padding token --+ # if has_eos_stopping_criteria: --+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+ --+ # # update generated ids, model inputs, and length for next step --+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+ # if streamer is not None: --+ # streamer.put(next_tokens) --+ --+ # model_kwargs = self._update_model_kwargs_for_generation( --+ # outputs, --+ # model_kwargs, --+ # is_encoder_decoder=self.config.is_encoder_decoder, --+ # ) --+ --+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+ # cur_len += 1 --+ --+ # if _record_time: --+ # import time as time_module --+ # infer_stop = time_module.time() --+ # time_record.append(infer_stop - infer_start) --+ --+ # del outputs --+ --+ # average_infer_time = None --+ # if time_record: --+ # if len(time_record) > 1: --+ # time_record.pop(0) --+ # average_infer_time = sum(time_record) / len(time_record) --+ # print(f'average inference time is: {average_infer_time}') --+ # print(f'inference time record: {time_record}') --+ --+ # if streamer is not None: --+ # streamer.end() --+ --+ # # 简单判断:打印是否使用了JIT路径 --+ # if hasattr(self, '_jit_used') and self._jit_used: --+ # print("[JIT] ✓ JIT optimization was used during generation") --+ # else: --+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+ --+ # if return_dict_in_generate: --+ # if self.config.is_encoder_decoder: --+ # return GenerateEncoderDecoderOutput( --+ # sequences=input_ids, --+ # scores=scores, --+ # logits=raw_logits, --+ # encoder_attentions=encoder_attentions, --+ # encoder_hidden_states=encoder_hidden_states, --+ # decoder_attentions=decoder_attentions, --+ # cross_attentions=cross_attentions, --+ # decoder_hidden_states=decoder_hidden_states, --+ # past_key_values=model_kwargs.get("past_key_values"), --+ # average_infer_time=average_infer_time --+ # ) --+ # else: --+ # return GenerateDecoderOnlyOutput( --+ # sequences=input_ids, --+ # scores=scores, --+ # logits=raw_logits, --+ # attentions=decoder_attentions, --+ # hidden_states=decoder_hidden_states, --+ # past_key_values=model_kwargs.get("past_key_values"), --+ # average_infer_time=average_infer_time --+ # ) --+ # else: --+ # return input_ids --+ --+ # def _prepare_cache_for_generation( --+ # self, --+ # generation_config, --+ # model_kwargs, --+ # assistant_model, --+ # batch_size, --+ # max_cache_length, --+ # ): --+ # if generation_config.cache_implementation is None and self._supports_static_cache: --+ # generation_config.cache_implementation = "static" --+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+ --+ # if generation_config.cache_implementation == "static": --+ # base_required_from_max_length = generation_config.max_length + 1 --+ # base_required = max(max_cache_length, base_required_from_max_length) --+ # min_cache_size = 50 --+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+ # else: --+ # max_cache_length = max(base_required, min_cache_size) --+ --+ # original_max_cache_length = max_cache_length --+ # print(f"[JIT] StaticCache max_cache_length calculation:") --+ # print(f" - input max_cache_length: {original_max_cache_length}") --+ # print(f" - generation_config.max_length: {generation_config.max_length}") --+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+ # print(f" - final max_cache_length: {max_cache_length}") --+ --+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+ # if max_cache_length > self.config.max_position_embeddings: --+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+ --+ # result = super()._prepare_cache_for_generation( --+ # generation_config=generation_config, --+ # model_kwargs=model_kwargs, --+ # assistant_model=assistant_model, --+ # batch_size=batch_size, --+ # max_cache_length=max_cache_length, --+ # ) --+ --+ # if generation_config.cache_implementation == "static": --+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+ # created_cache = model_kwargs.get(cache_name) --+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+ # if created_cache.max_cache_len < generation_config.max_length: --+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+ --+ # return result --+ --+ --+ -- -- -- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ---- --2.27.0 -- --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" deleted file mode 100644 index 25b442d5..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0004-20251106change.patch" +++ /dev/null @@ -1,7498 +0,0 @@ -From 60df5bdc79368911a03b9c034b11b7437df753ca Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Thu, 6 Nov 2025 15:48:09 +0800 -Subject: [PATCH 04/10] 20251106change - ---- - .../models/deepseek/modeling_deepseek.py | 189 +- - patches/0001-20251104commit.patch | 1272 +++++++ - patches/0002-20251106commit.patch | 3200 +++++++++++++++++ - patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ - 4 files changed, 7244 insertions(+), 186 deletions(-) - create mode 100644 patches/0001-20251104commit.patch - create mode 100644 patches/0002-20251106commit.patch - create mode 100644 patches/0003-20261106secondcommit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 2f9192bf..0546f318 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): - - return attn_output, attn_weights, past_key_value - --# class DeepseekFlashAttention(nn.Module): --# """ --# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -- --# This class is designed as a drop-in replacement for DeepseekAttention. --# """ -- --# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --# super().__init__() --# self.config = config --# self.layer_idx = layer_idx --# if layer_idx is None: --# logger.warning( --# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --# "when creating this class." --# ) -- --# self.attention_dropout = config.attention_dropout --# self.hidden_size = config.hidden_size --# self.num_heads = config.num_attention_heads --# self.head_dim = self.hidden_size // self.num_heads --# self.num_key_value_heads = config.num_key_value_heads --# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --# self.max_position_embeddings = config.max_position_embeddings --# self.rope_theta = config.rope_theta --# self.is_causal = True -- --# if (self.head_dim * self.num_heads) != self.hidden_size: --# raise ValueError( --# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --# f" and `num_heads`: {self.num_heads})." --# ) -- --# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --# self._init_rope() -- --# def _init_rope(self): --# if self.config.rope_scaling is None: --# self.rotary_emb = DeepseekRotaryEmbedding( --# self.head_dim, --# max_position_embeddings=self.max_position_embeddings, --# base=self.rope_theta, --# ) --# else: --# scaling_type = self.config.rope_scaling["type"] --# scaling_factor = self.config.rope_scaling["factor"] --# if scaling_type == "linear": --# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --# self.head_dim, --# max_position_embeddings=self.max_position_embeddings, --# scaling_factor=scaling_factor, --# base=self.rope_theta, --# ) --# elif scaling_type == "dynamic": --# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --# self.head_dim, --# max_position_embeddings=self.max_position_embeddings, --# scaling_factor=scaling_factor, --# base=self.rope_theta, --# ) --# else: --# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -- --# def forward( --# self, --# hidden_states: mindspore.Tensor, --# attention_mask: Optional[mindspore.Tensor] = None, --# position_ids: Optional[mindspore.Tensor] = None, --# past_key_value: Optional[Cache] = None, --# output_attentions: bool = False, --# use_cache: bool = False, --# **kwargs, --# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --# if "padding_mask" in kwargs: --# warnings.warn( --# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --# ) -- --# if output_attentions: --# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -- --# bsz, q_len, _ = hidden_states.shape -- --# if self.config.pretraining_tp > 1: --# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -- --# query_states = self.q_proj(hidden_states) --# key_states = self.k_proj(hidden_states) --# value_states = self.v_proj(hidden_states) -- --# # Reshape for multi-head attention --# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- --# kv_seq_len = key_states.shape[-2] --# if past_key_value is not None: --# if self.layer_idx is None: --# raise ValueError( --# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --# "with a layer index." --# ) --# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- --# # Apply Rotary Positional Embedding --# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- --# if past_key_value is not None: --# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -- --# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -- --# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -- --# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -- --# # Convert attention_mask for flash_attention_score --# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --# if attention_mask is not None: --# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --# raise ValueError( --# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --# ) --# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --# else: --# attn_mask_for_fa = None -- --# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -- --# # Call the fused flash_attention_score operator --# attn_output = mindspore.ops.flash_attention_score( --# query=query_states_for_fa, --# key=key_states_for_fa, --# value=value_states_for_fa, --# head_num=self.num_heads, # This is N1, the number of query heads --# input_layout='BSH', --# attn_mask=attn_mask_for_fa, --# keep_prob=keep_prob, --# scalar_value=1.0 / math.sqrt(self.head_dim), --# sparse_mode=0 # Default mask mode --# ) -- --# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --# attn_output = self.o_proj(attn_output) -- --# # Flash Attention does not return attention weights --# attn_weights = None -- --# return attn_output, attn_weights, past_key_value - - class DeepseekFlashAttention(nn.Module): - """ -@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): - super().__init__() - self.hidden_size = config.hidden_size - -- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -- config=config, layer_idx=layer_idx -- ) -+ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+ # config=config, layer_idx=layer_idx -+ # ) - - self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( - config=config, layer_idx=layer_idx -@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): - return outputs - - -- - class DeepseekPreTrainedModel(PreTrainedModel): - config_class = DeepseekConfig - base_model_prefix = "model" -@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - # Initialize weights and apply final processing - self.post_init() - self.warm_up = False -- #@dwj -- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -- self.num_layers, -- self.num_attention_heads, -- self.head_dim, -- batch_size=1, -- max_length=self.max_length, -- dtype=mindspore.float16 -- ) -- -- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -- key_cache = [] -- value_cache = [] -- for _ in range(num_layers): -- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -- key_cache.append(k) -- value_cache.append(v) -- return key_cache, value_cache -- - - def warmup_moe_model_deep(self): - print("[Warmup] DeepSeek-MoE 模型预热开始...") -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -new file mode 100644 -index 00000000..78f22642 ---- /dev/null -+++ b/patches/0001-20251104commit.patch -@@ -0,0 +1,1272 @@ -+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Tue, 4 Nov 2025 09:11:51 +0800 -+Subject: [PATCH 1/3] 20251104commit -+ -+--- -+ mindnlp/transformers/cache_utils.py | 28 +- -+ .../models/deepseek/modeling_deepseek.py | 149 ++- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+ 3 files changed, 976 insertions(+), 87 deletions(-) -+ -+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+index cadd2e04..02f8d4be 100644 -+--- a/mindnlp/transformers/cache_utils.py -++++ b/mindnlp/transformers/cache_utils.py -+@@ -812,14 +812,26 @@ class StaticCache(Cache): -+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+ # k_out[:, :, cache_position] = key_states -+ # v_out[:, :, cache_position] = value_states -+- if ON_ORANGE_PI: -+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+- else: -+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+- -++ # if ON_ORANGE_PI: -++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++ # else: -++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++ # 确保 cache_position 是 1D tensor 并且类型正确 -++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++ if cache_position.ndim > 1: -++ cache_position = cache_position.flatten() -++ # 确保类型是 int32 或 int64(MindSpore 要求) -++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++ cache_position = cache_position.int() -++ -++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++ k_out[:, :, cache_position] = key_states -++ v_out[:, :, cache_position] = value_states -++ -+ return k_out, v_out -+ -+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index c695b944..d8303e45 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+ # Copied from transformers.models.llama.modeling_llama.rotate_half -+ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+- x1 = x[..., : x.shape[-1] // 2] -+- x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++ # x1 = x[..., : x.shape[-1] // 2] -++ # x2 = x[..., x.shape[-1] // 2 :] -++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+ if self.training: -+ raise NotImplementedError("Training is not supported yet.") -+ else: -+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+- if self.config.n_shared_experts is not None: -+- y = y + self.shared_experts(identity) -+- return y -++ # @lwx -++ if orig_shape[1] == 1: -++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++ y=y.view(*orig_shape) -++ if self.config.n_shared_experts is not None: -++ y = y + self.shared_experts(identity) -++ return y -++ else: -++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++ if self.config.n_shared_experts is not None: -++ y = y + self.shared_experts(identity) -++ return y -++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++ # if self.config.n_shared_experts is not None: -++ # y = y + self.shared_experts(identity) -++ # return y -++ -++ @no_grad() -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ -++ expert_cache = ops.zeros_like(x) -++ for i in range(self.num_experts_per_tok): -++ expert_id = flat_expert_indices[i].item() -++ weight = flat_expert_weights[i].item() -++ expert = self.experts[expert_id] -++ expert_out = expert(x) -++ expert_cache += expert_out * weight -++ return expert_cache -+ -+ @no_grad() -+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+- # expert_cache = torch.zeros_like(x) -+- # idxs = flat_expert_indices.argsort() -+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+- # token_idxs = idxs // self.num_experts_per_tok -+- # for i, end_idx in enumerate(tokens_per_expert): -+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+- # if start_idx == end_idx: -+- # continue -+- # expert = self.experts[i] -+- # exp_token_idx = token_idxs[start_idx:end_idx] -+- # expert_tokens = x[exp_token_idx] -+- # expert_out = expert(expert_tokens) -+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+- # return expert_cache -++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ expert_cache = ops.zeros_like(x) -+ idxs = flat_expert_indices.argsort() -+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+ token_idxs = idxs // self.num_experts_per_tok -++ -+ for i, end_idx in enumerate(tokens_per_expert): -+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+ if start_idx == end_idx: -+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+ expert_out = expert(expert_tokens) -+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -+ return expert_cache -++ -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # # expert_cache = torch.zeros_like(x) -++ # # idxs = flat_expert_indices.argsort() -++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++ # # token_idxs = idxs // self.num_experts_per_tok -++ # # for i, end_idx in enumerate(tokens_per_expert): -++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++ # # if start_idx == end_idx: -++ # # continue -++ # # expert = self.experts[i] -++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++ # # expert_tokens = x[exp_token_idx] -++ # # expert_out = expert(expert_tokens) -++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++ # # return expert_cache -++ # expert_cache = ops.zeros_like(x) -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # for i, end_idx in enumerate(tokens_per_expert): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # if start_idx == end_idx: -++ # continue -++ # expert = self.experts[i] -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = expert(expert_tokens) -++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -++ # return expert_cache -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # expert_cache = ops.zeros_like(x) -++ -++ # # 排序保证顺序一致 -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # # 找出有 token 的专家 -++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++ -++ # for i in active_experts.tolist(): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # end_idx = tokens_per_expert[i] -++ # if start_idx == end_idx: # 没有 token -++ # continue -++ -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = self.experts[i](expert_tokens) -++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++ -++ # expert_cache = mindspore.mint.scatter_add( -++ # expert_cache, -++ # 0, -++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++ # expert_out -++ # ) -++ -++ # return expert_cache -++ -++ -+ -+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+ # """ -+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ -+ # Initialize weights and apply final processing -+ self.post_init() -++ self.warm_up = False -++ -++ def warmup_moe_model_deep(self): -++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++ test_texts = [ -++ "warmup short", -++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++ ] -++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++ if tokenizer is None: -++ from mindnlp.transformers import AutoTokenizer -++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++ self._warmup_tokenizer = tokenizer -++ -++ for text in test_texts: -++ inputs = tokenizer(text, return_tensors="ms") -++ with mindspore._no_grad(): -++ _ = self(**inputs, use_cache=False) -++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+ -+ def get_input_embeddings(self): -+ return self.model.embed_tokens -+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+ ```""" -++ if not self.warm_up: -++ self.warm_up = True -++ self.warmup_moe_model_deep() -++ -+ output_attentions = ( -+ output_attentions -+ if output_attentions is not None -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index 3cbf820e..d4c6b651 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -18,7 +18,6 @@ -+ # See the License for the specific language governing permissions and -+ # limitations under the License. -+ """MindSpore Qwen2MoE model.""" -+- -+ import math -+ from typing import List, Optional, Tuple, Union -+ -+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+ TokenClassifierOutput, -+ ) -+ from ...modeling_utils import PreTrainedModel -++from ...generation import GenerationMixin -+ from ....utils import logging -+ from .configuration_qwen2_moe import Qwen2MoeConfig -+ -+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+ self.variance_epsilon = eps -+ -+ def forward(self, hidden_states): -++ # @dwj -++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++ # @lwx -++ # if not self.training : -++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+ input_dtype = hidden_states.dtype -+ hidden_states = hidden_states.to(mindspore.float32) -+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+@@ -234,6 +239,8 @@ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+ x1 = x[..., : x.shape[-1] // 2] -+ x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+ self.config = config -+ self.hidden_size = config.hidden_size -+ self.intermediate_size = intermediate_size -++ -+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+ self.act_fn = ACT2FN[config.hidden_act] -+ -+ def forward(self, x): -+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+- -+ -++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++ # @lwx -++ # gate_up_output = self.gate_up_proj(x) -++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++ # return self.down_proj(swiglu_output) -++ -++ # def forward(self, x): -++ # gate_proj_out = self.gate_proj(x) -++ # up_proj_out = self.up_proj(x) -++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++ # return self.down_proj(swiglu_out) -++ -+ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+ """ -+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+ use_cache: bool = False, -+ cache_position: Optional[mindspore.Tensor] = None, -+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ -++ -+ bsz, q_len, _ = hidden_states.shape -+ -+ query_states = self.q_proj(hidden_states) -+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+ "with a layer index." -+ ) -+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ if isinstance(past_key_value, StaticCache): -++ kv_seq_len = key_states.shape[-2] -++ else: -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ if past_key_value is not None: -+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++ if isinstance(past_key_value, StaticCache): -++ kv_seq_len = key_states.shape[-2] -+ -+ # repeat k/v heads if n_kv_heads < n_heads -+ key_states = repeat_kv(key_states, self.num_key_value_groups) -+ value_states = repeat_kv(value_states, self.num_key_value_groups) -+- -++ -+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+ -+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+- raise ValueError( -+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+- f" {attn_weights.shape}" -+- ) -+- -+- if attention_mask is not None: # no matter the length, we just slice it -+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++ if attention_mask is not None: -++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+ attn_weights = attn_weights + causal_mask -+ -+ # upcast attention to fp32 -+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+ -+ attn_output = self.o_proj(attn_output) -+- -++ # @lwx -++ -++ # max_seq_len = self.max_position_embeddings # 2048 -++ -++ # if attention_mask is not None: -++ # # attention_mask: [B, 1, Sq, Sk] -++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++ -++ # # pad 到 [max_seq_len, max_seq_len] -++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++ # global_attention_mask = padded_mask -++ # else: -++ # global_attention_mask = None -++ -++ -++ # sparse_mode=3 -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, -++ # key=key_states, -++ # value=value_states, -++ # real_shift=None, -++ # padding_mask=None, -++ -++ # head_num=self.num_heads, -++ # attn_mask=global_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++ # input_layout="BNSD", -++ # pre_tokens=2147483647, -++ # next_tokens=2147483647, -++ # inner_precise=0, -++ # drop_mask=None, -++ # prefix=None, -++ # actual_seq_qlen=None, -++ # actual_seq_kvlen=None, -++ # sparse_mode=sparse_mode, -++ # ) -+ if not output_attentions: -+ attn_weights = None -+ -+ return attn_output, attn_weights, past_key_value -+ -+ -++class Qwen2MoeFlashAttention(nn.Module): -++ """ -++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++ -++ 关键改动: -++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++ 直接传入原始的 key 和 value 张量效率更高。 -++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++ """ -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++ super().__init__() -++ self.config = config -++ self.layer_idx = layer_idx -++ self.hidden_size = config.hidden_size -++ self.num_heads = config.num_attention_heads -++ self.head_dim = self.hidden_size // self.num_heads -++ self.num_key_value_heads = config.num_key_value_heads -++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++ self.max_position_embeddings = config.max_position_embeddings -++ self.rope_theta = config.rope_theta -++ self.attention_dropout = config.attention_dropout -++ -++ if (self.head_dim * self.num_heads) != self.hidden_size: -++ raise ValueError( -++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++ ) -++ -++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++ -++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++ self.head_dim, -++ max_position_embeddings=self.max_position_embeddings, -++ base=self.rope_theta, -++ ) -++ -++ def forward( -++ self, -++ hidden_states: mindspore.Tensor, -++ attention_mask: Optional[mindspore.Tensor] = None, -++ position_ids: Optional[mindspore.Tensor] = None, -++ past_key_value: Optional[Cache] = None, -++ output_attentions: bool = False, -++ use_cache: bool = False, -++ cache_position: Optional[mindspore.Tensor] = None, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ bsz, q_len, _ = hidden_states.shape -++ -++ # 1. 线性投射 Q, K, V -++ query_states = self.q_proj(hidden_states) -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++ # query: [B, S, H*D] -> [B, N1, S, D] -++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # 3. RoPE 旋转位置编码 -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++ if self.layer_idx is None: -++ raise ValueError( -++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ "with a layer index." -++ ) -++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++ if cache_position.shape[0] == 1: -++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++ kv_seq_len = past_seen_tokens + 1 -++ else: -++ # prefill 阶段:cache_position 是范围,使用其长度 -++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++ else: -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # 4. KV 缓存更新 -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ key_states, value_states = past_key_value.update( -++ key_states, value_states, self.layer_idx, cache_kwargs -++ ) -++ -++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++ if cache_position.shape[0] == 1: -++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++ kv_seq_len = key_states.shape[-2] -++ -++ # 5. [重要] 准备 Attention Mask -++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++ fa_attention_mask = None -++ if attention_mask is not None: -++ # 截取与当前key长度匹配的部分 -++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++ fa_attention_mask = (mask_slice != 0) -++ -++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++ input_dtype = query_states.dtype -++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++ query_states = query_states.to(mindspore.float16) -++ key_states = key_states.to(mindspore.float16) -++ value_states = value_states.to(mindspore.float16) -++ -++ # 6. [核心] 调用 flash_attention_score 算子 -++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++ attn_output = mindspore.ops.flash_attention_score( -++ query=query_states, -++ key=key_states, -++ value=value_states, -++ head_num=self.num_heads, # 传入Q的头数(N1) -++ attn_mask=fa_attention_mask, -++ keep_prob=1.0 - self.attention_dropout, -++ scalar_value=1.0 / math.sqrt(self.head_dim), -++ input_layout="BNSD", -++ sparse_mode=0 # 使用 defaultMask 模式 -++ ) -++ -++ # 恢复原始数据类型 -++ attn_output = attn_output.to(input_dtype) -++ -++ # 7. 调整输出形状 -++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ attn_output = self.o_proj(attn_output) -++ -++ # FlashAttention 算子不直接返回注意力权重矩阵 -++ attn_weights = None -++ if output_attentions: -++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++ return attn_output, attn_weights, past_key_value -++ -++ # def forward( -++ # self, -++ # hidden_states: mindspore.Tensor, -++ # attention_mask: Optional[mindspore.Tensor] = None, -++ # position_ids: Optional[mindspore.Tensor] = None, -++ # past_key_value: Optional[Cache] = None, -++ # output_attentions: bool = False, -++ # use_cache: bool = False, -++ # cache_position: Optional[mindspore.Tensor] = None, -++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ # bsz, q_len, _ = hidden_states.shape -++ -++ # # 1. 线性投射 Q, K, V -++ # query_states = self.q_proj(hidden_states) -++ # key_states = self.k_proj(hidden_states) -++ # value_states = self.v_proj(hidden_states) -++ -++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # # 3. RoPE 旋转位置编码 -++ # kv_seq_len = key_states.shape[-2] -++ # if past_key_value is not None: -++ # if self.layer_idx is None: -++ # raise ValueError( -++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ # "with a layer index." -++ # ) -++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # # 4. KV 缓存更新 -++ # if past_key_value is not None: -++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ # key_states, value_states = past_key_value.update( -++ # key_states, value_states, self.layer_idx, cache_kwargs -++ # ) -++ -++ # # 5. 准备 Attention Mask -++ # fa_attention_mask = None -++ # if attention_mask is not None: -++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # fa_attention_mask = (mask_slice != 0) -++ -++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++ # input_dtype = query_states.dtype -++ -++ # # 6. [核心] 调用 flash_attention_score 算子 -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, -++ # key=key_states, -++ # value=value_states, -++ # head_num=self.num_heads, -++ # attn_mask=fa_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++ # input_layout="BNSD", -++ # sparse_mode=0, -++ # # <--- 修改点 2: 启用内部高精度计算 --- -++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++ # inner_precise=1 -++ # ) -++ -++ # # 恢复原始数据类型 -++ # attn_output = attn_output.to(input_dtype) -++ -++ # # 7. 调整输出形状 -++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ # attn_output = self.o_proj(attn_output) -++ -++ # attn_weights = None -++ # if output_attentions: -++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++ # return attn_output, attn_weights, past_key_value -++ -++ # def forward( -++ # self, -++ # hidden_states: mindspore.Tensor, -++ # attention_mask: Optional[mindspore.Tensor] = None, -++ # position_ids: Optional[mindspore.Tensor] = None, -++ # past_key_value: Optional[Cache] = None, -++ # output_attentions: bool = False, -++ # use_cache: bool = False, -++ # cache_position: Optional[mindspore.Tensor] = None, -++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ # bsz, q_len, _ = hidden_states.shape -++ -++ # query_states = self.q_proj(hidden_states) -++ # key_states = self.k_proj(hidden_states) -++ # value_states = self.v_proj(hidden_states) -++ -++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ # kv_seq_len = key_states.shape[-2] -++ # if past_key_value is not None: -++ # if self.layer_idx is None: -++ # raise ValueError("`layer_idx` must be specified for caching") -++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ # if past_key_value is not None: -++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ # key_states, value_states = past_key_value.update( -++ # key_states, value_states, self.layer_idx, cache_kwargs -++ # ) -++ -++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++ -++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++ # query_states = query_states / math.sqrt(self.head_dim) -++ # # <--- 修改结束 --- -++ -++ # fa_attention_mask = None -++ # if attention_mask is not None: -++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ # fa_attention_mask = (mask_slice != 0) -++ -++ # input_dtype = query_states.dtype -++ -++ # attn_output = mindspore.ops.flash_attention_score( -++ # query=query_states, # 传入已经预先缩放过的 query -++ # key=key_states, -++ # value=value_states, -++ # head_num=self.num_heads, -++ # attn_mask=fa_attention_mask, -++ # keep_prob=1.0 - self.attention_dropout, -++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++ # input_layout="BNSD", -++ # sparse_mode=0, -++ # inner_precise=1 # 仍然保持内部高精度计算 -++ # ) -++ -++ # attn_output = attn_output.to(input_dtype) -++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ # attn_output = self.o_proj(attn_output) -++ -++ # attn_weights = None -++ # if output_attentions: -++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++ -++ # return attn_output, attn_weights, past_key_value -++ -+ QWEN2MOE_ATTENTION_CLASSES = { -+ "eager": Qwen2MoeAttention, -++ "flash-attention": Qwen2MoeFlashAttention, -+ } -+ -+ -+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -++ #@dwj -++ # 只遍历激活的专家,而非全部专家 -+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+- batch_size, sequence_length, hidden_dim = hidden_states.shape -+- hidden_states = hidden_states.view(-1, hidden_dim) -+- # router_logits: (batch * sequence_length, n_experts) -+- router_logits = self.gate(hidden_states) -+- -+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- if self.norm_topk_prob: -+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- # we cast back to the input dtype -+- routing_weights = routing_weights.to(hidden_states.dtype) -+- -+- final_hidden_states = ops.zeros( -+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+- ) -+- -+- # One hot encode the selected experts to create an expert mask -+- # this will be used to easily index which expert is going to be sollicitated -+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+- -+- # Loop over all available experts in the model and perform the computation on each expert -+- for expert_idx in range(self.num_experts): -+- expert_layer = self.experts[expert_idx] -+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+- -+- # Index the correct hidden states and compute the expert hidden state for -+- # the current expert. We need to make sure to multiply the output hidden -+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+- if 0 not in idx.shape: -+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+- -+- # However `index_add_` only support torch tensors for indexing so we'll use -+- # the `top_x` tensor here. -+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+- -+- shared_expert_output = self.shared_expert(hidden_states) -+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+- -+- final_hidden_states = final_hidden_states + shared_expert_output -++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++ num_tokens = hidden_states_reshaped.shape[0] -++ -++ router_logits = self.gate(hidden_states_reshaped) -++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++ if self.norm_topk_prob: -++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ routing_weights = routing_weights.to(hidden_states.dtype) -++ -++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++ flat_selected_experts = selected_experts.flatten() -++ -++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++ token_indices = broadcasted_token_indices.flatten() -++ -++ active_experts = ops.unique(flat_selected_experts) -++ -++ for expert_idx_tensor in active_experts: -++ expert_idx = expert_idx_tensor.item() -++ expert_layer = self.experts[expert_idx] -++ -++ mask = (flat_selected_experts == expert_idx_tensor) -++ selected_token_indices = token_indices[mask] -++ selected_routing_weights = routing_weights.flatten()[mask] -++ -++ current_states = hidden_states_reshaped[selected_token_indices] -++ -++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++ -++ final_hidden_states = final_hidden_states.index_add( -++ dim=0, -++ index=selected_token_indices, -++ source=expert_output.to(hidden_states.dtype) -++ ) -++ -++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+ -+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+- return final_hidden_states, router_logits -++ final_hidden_states = final_hidden_states + shared_expert_output -++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++ -++ return final_hidden_states, router_logits -+ -+ -+ class Qwen2MoeDecoderLayer(nn.Module): -+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+ -+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++ -+ if (layer_idx not in config.mlp_only_layers) and ( -+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+ ): -+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+ _skip_keys_device_placement = "past_key_values" -+ _supports_cache_class = True -++#lwx -++ # _supports_static_cache = True -+ -+ def _init_weights(self, module): -+ std = self.config.initializer_range -+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+ return causal_mask -+ -+ -+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ _tied_weights_keys = ["lm_head.weight"] -+ -+ def __init__(self, config): -+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ self.num_experts_per_tok = config.num_experts_per_tok -+ # Initialize weights and apply final processing -+ self.post_init() -++ # @lwx -++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++ # self.generation_config.cache_implementation = "static" -++ self._warmed_up = False -++ -++ def warmup_moe_model(self): -++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++ test_texts = [ -++ "warmup short", -++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++ ] -++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++ if tokenizer is None: -++ from mindnlp.transformers import AutoTokenizer -++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++ self._warmup_tokenizer = tokenizer -++ -++ for text in test_texts: -++ inputs = tokenizer(text, return_tensors="ms") -++ with mindspore._no_grad(): -++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+ -+ def get_input_embeddings(self): -+ return self.model.embed_tokens -+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+ ```""" -++ if not self._warmed_up: -++ self._warmed_up = True -++ self.warmup_moe_model() -+ -+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+ output_router_logits = ( -+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+ } -+ ) -+ return model_inputs -++# @lwx -++ # def _decode_one_tokens_logits( -++ # self, -++ # cur_token: mindspore.Tensor, -++ # input_pos: Optional[mindspore.Tensor], -++ # cache_position: mindspore.Tensor, -++ # past_key_values: StaticCache, -++ # ) -> mindspore.Tensor: -++ # """ -++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++ -++ # Args: -++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++ # input_pos: 输入位置信息,可选 -++ # cache_position: 当前token在cache中的位置,shape为(1,) -++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++ -++ # Returns: -++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++ # """ -++ # # 调用JIT编译的版本 -++ # return self.get_decode_one_tokens_logits( -++ # cur_token=cur_token, -++ # input_pos=input_pos, -++ # cache_position=cache_position, -++ # past_key_values=past_key_values, -++ # ) -++ -++ # @mindspore.jit(jit_level='O1') -++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++ # """ -++ # JIT编译的函数,用于高效的单token解码 -++ # 使用JIT编译优化以支持静态shape和高效执行 -++ -++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++ # """ -++ # outputs = self.model.forward( -++ # input_ids=cur_token, -++ # position_ids=input_pos, -++ # cache_position=cache_position, -++ # past_key_values=past_key_values, -++ # use_cache=True, -++ # return_dict=False, -++ # ) -++ -++ # hidden_states = outputs[0] -++ # logits = self.lm_head.forward(hidden_states) -++ # logits = logits.float() -++ -++ # return logits[:, -1, :] -++ -++ # def _sample( -++ # self, -++ # input_ids: mindspore.Tensor, -++ # logits_processor, -++ # stopping_criteria, -++ # generation_config, -++ # synced_devices: bool, -++ # streamer=None, -++ # logits_warper=None, -++ # **model_kwargs, -++ # ): -++ # """ -++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++ # """ -++ # from ...generation.logits_process import LogitsProcessorList -++ # from ...generation.stopping_criteria import StoppingCriteriaList -++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++ # from mindnlp.core import nn, ops, no_grad -++ # import numpy as np -++ -++ # # 检查是否使用 StaticCache -++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++ # # 否则,直接调用父类方法 -++ # past_key_values = model_kwargs.get("past_key_values") -++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++ -++ # if not isinstance(past_key_values, StaticCache): -++ # # 不使用 StaticCache,直接调用父类方法 -++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++ # return super()._sample( -++ # input_ids=input_ids, -++ # logits_processor=logits_processor, -++ # stopping_criteria=stopping_criteria, -++ # generation_config=generation_config, -++ # synced_devices=synced_devices, -++ # streamer=streamer, -++ # logits_warper=logits_warper, -++ # **model_kwargs, -++ # ) -++ -++ # # 使用 StaticCache,进入自定义循环 -++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++ # pad_token_id = generation_config._pad_token_tensor -++ # output_attentions = generation_config.output_attentions -++ # output_hidden_states = generation_config.output_hidden_states -++ # output_scores = generation_config.output_scores -++ # output_logits = generation_config.output_logits -++ # return_dict_in_generate = generation_config.return_dict_in_generate -++ # max_length = generation_config.max_length -++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++ # do_sample = generation_config.do_sample -++ -++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++ # raise ValueError( -++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++ # f"{logits_warper})." -++ # ) -++ -++ # # init attention / hidden states / scores tuples -++ # scores = () if (return_dict_in_generate and output_scores) else None -++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++ -++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++ # encoder_hidden_states = ( -++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++ # ) -++ -++ # # keep track of which sequences are already finished -++ # batch_size, cur_len = input_ids.shape -++ # this_peer_finished = False -++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++ -++ # time_record = [] -++ # from ....utils.testing_utils import parse_flag_from_env -++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++ -++ # while self._has_unfinished_sequences( -++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++ # ): -++ # if _record_time: -++ # import time as time_module -++ # infer_start = time_module.time() -++ -++ # # prepare model inputs -++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++ -++ # # prepare variable output controls -++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++ -++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++ # cur_cache_position = model_inputs.get("cache_position") -++ # cur_past_key_values = model_inputs.get("past_key_values") -++ # cur_input_ids = model_inputs.get("input_ids") -++ -++ # if (isinstance(cur_past_key_values, StaticCache) and -++ # cur_cache_position is not None and -++ # len(cur_cache_position.shape) > 0 and -++ # cur_cache_position.shape[0] == 1 and -++ # cur_input_ids is not None and -++ # cur_input_ids.shape[1] == 1): -++ # # 使用 JIT 优化的单 token 解码 -++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++ # if not hasattr(self, '_jit_used'): -++ # self._jit_used = False -++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++ -++ # next_token_logits = self.get_decode_one_tokens_logits( -++ # cur_token=cur_input_ids, -++ # input_pos=model_inputs.get("position_ids"), -++ # cache_position=cur_cache_position, -++ # past_key_values=cur_past_key_values, -++ # ) -++ -++ # # 标记已使用JIT(用于后续判断) -++ # if not self._jit_used: -++ # self._jit_used = True -++ -++ # # 构造兼容的输出对象 -++ # class JitOptimizedOutput: -++ # def __init__(self, logits, config): -++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++ # self.config = config -++ # # 对于 JIT 优化路径,这些属性通常不需要 -++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++ # self.attentions = None if not config.is_encoder_decoder else None -++ # self.cross_attentions = None -++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++ # self.hidden_states = None if not config.is_encoder_decoder else None -++ -++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++ # else: -++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++ # outputs = self(**model_inputs, return_dict=True) -++ -++ # if synced_devices and this_peer_finished: -++ # continue -++ -++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++ # next_token_logits = outputs.logits[:, -1, :] -++ -++ # # pre-process distribution -++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++ # if do_sample: -++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++ -++ # # Store scores, attentions and hidden_states when required -++ # if return_dict_in_generate: -++ # if output_scores: -++ # scores += (next_token_scores,) -++ # if output_logits: -++ # raw_logits += (next_token_logits,) -++ # if output_attentions: -++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++ # decoder_attentions += (attn,) if attn is not None else (None,) -++ # if self.config.is_encoder_decoder: -++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++ -++ # if output_hidden_states: -++ # hidden = ( -++ # outputs.decoder_hidden_states -++ # if self.config.is_encoder_decoder -++ # else outputs.hidden_states -++ # ) -++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++ -++ # # token selection -++ # if do_sample: -++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++ # else: -++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++ -++ # # finished sentences should have their next token be a padding token -++ # if has_eos_stopping_criteria: -++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++ -++ # # update generated ids, model inputs, and length for next step -++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++ # if streamer is not None: -++ # streamer.put(next_tokens) -++ -++ # model_kwargs = self._update_model_kwargs_for_generation( -++ # outputs, -++ # model_kwargs, -++ # is_encoder_decoder=self.config.is_encoder_decoder, -++ # ) -++ -++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++ # cur_len += 1 -++ -++ # if _record_time: -++ # import time as time_module -++ # infer_stop = time_module.time() -++ # time_record.append(infer_stop - infer_start) -++ -++ # del outputs -++ -++ # average_infer_time = None -++ # if time_record: -++ # if len(time_record) > 1: -++ # time_record.pop(0) -++ # average_infer_time = sum(time_record) / len(time_record) -++ # print(f'average inference time is: {average_infer_time}') -++ # print(f'inference time record: {time_record}') -++ -++ # if streamer is not None: -++ # streamer.end() -++ -++ # # 简单判断:打印是否使用了JIT路径 -++ # if hasattr(self, '_jit_used') and self._jit_used: -++ # print("[JIT] ✓ JIT optimization was used during generation") -++ # else: -++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++ -++ # if return_dict_in_generate: -++ # if self.config.is_encoder_decoder: -++ # return GenerateEncoderDecoderOutput( -++ # sequences=input_ids, -++ # scores=scores, -++ # logits=raw_logits, -++ # encoder_attentions=encoder_attentions, -++ # encoder_hidden_states=encoder_hidden_states, -++ # decoder_attentions=decoder_attentions, -++ # cross_attentions=cross_attentions, -++ # decoder_hidden_states=decoder_hidden_states, -++ # past_key_values=model_kwargs.get("past_key_values"), -++ # average_infer_time=average_infer_time -++ # ) -++ # else: -++ # return GenerateDecoderOnlyOutput( -++ # sequences=input_ids, -++ # scores=scores, -++ # logits=raw_logits, -++ # attentions=decoder_attentions, -++ # hidden_states=decoder_hidden_states, -++ # past_key_values=model_kwargs.get("past_key_values"), -++ # average_infer_time=average_infer_time -++ # ) -++ # else: -++ # return input_ids -++ -++ # def _prepare_cache_for_generation( -++ # self, -++ # generation_config, -++ # model_kwargs, -++ # assistant_model, -++ # batch_size, -++ # max_cache_length, -++ # ): -++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++ # generation_config.cache_implementation = "static" -++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++ -++ # if generation_config.cache_implementation == "static": -++ # base_required_from_max_length = generation_config.max_length + 1 -++ # base_required = max(max_cache_length, base_required_from_max_length) -++ # min_cache_size = 50 -++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++ # else: -++ # max_cache_length = max(base_required, min_cache_size) -++ -++ # original_max_cache_length = max_cache_length -++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++ # print(f" - input max_cache_length: {original_max_cache_length}") -++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++ # print(f" - final max_cache_length: {max_cache_length}") -++ -++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++ # if max_cache_length > self.config.max_position_embeddings: -++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++ -++ # result = super()._prepare_cache_for_generation( -++ # generation_config=generation_config, -++ # model_kwargs=model_kwargs, -++ # assistant_model=assistant_model, -++ # batch_size=batch_size, -++ # max_cache_length=max_cache_length, -++ # ) -++ -++ # if generation_config.cache_implementation == "static": -++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++ # created_cache = model_kwargs.get(cache_name) -++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++ # if created_cache.max_cache_len < generation_config.max_length: -++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++ -++ # return result -++ -++ -++ -+ -+ -+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+-- -+2.27.0 -+ -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -new file mode 100644 -index 00000000..22b65dd5 ---- /dev/null -+++ b/patches/0002-20251106commit.patch -@@ -0,0 +1,3200 @@ -+From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Thu, 6 Nov 2025 09:20:38 +0800 -+Subject: [PATCH 2/3] 20251106commit -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -+ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -+ 3 files changed, 2689 insertions(+), 305 deletions(-) -+ create mode 100644 patches/0001-20251104commit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index d8303e45..73773c22 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -+ # y = y + self.shared_experts(identity) -+ # return y -+ -++ # @no_grad() -++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ -++ # expert_cache = ops.zeros_like(x) -++ # for i in range(self.num_experts_per_tok): -++ # expert_id = flat_expert_indices[i].item() -++ # weight = flat_expert_weights[i].item() -++ # expert = self.experts[expert_id] -++ # expert_out = expert(x) -++ # expert_cache += expert_out * weight -++ # return expert_cache -++ -+ @no_grad() -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ # x 的 shape: (1, hidden_size) -++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++ -++ # 1. 收集所有需要的专家层 -++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++ selected_experts = [self.experts[i] for i in flat_expert_indices] -++ -++ # 2. 并行计算所有专家的输出 -++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++ # ops.cat 会将它们堆叠成一个新的 Tensor -++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++ -++ # 3. 使用矩阵乘法进行加权求和 -++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ # 最终结果 final_output 的 shape: (1, hidden_size) -++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++ -++ return final_output -+ -+- expert_cache = ops.zeros_like(x) -+- for i in range(self.num_experts_per_tok): -+- expert_id = flat_expert_indices[i].item() -+- weight = flat_expert_weights[i].item() -+- expert = self.experts[expert_id] -+- expert_out = expert(x) -+- expert_cache += expert_out * weight -+- return expert_cache -+ -+ @no_grad() -+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -+ key_states = self.k_proj(hidden_states) -+ value_states = self.v_proj(hidden_states) -+ -+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++ # @lwx -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+ -+ kv_seq_len = key_states.shape[-2] -+ if past_key_value is not None: -+@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -+ return attn_output, attn_weights, past_key_value -+ -+ -++# class DeepseekFlashAttention(nn.Module): -++# """ -++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -++ -++# This class is designed as a drop-in replacement for DeepseekAttention. -++# """ -++ -++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++# super().__init__() -++# self.config = config -++# self.layer_idx = layer_idx -++# if layer_idx is None: -++# logger.warning( -++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++# "when creating this class." -++# ) -++ -++# self.attention_dropout = config.attention_dropout -++# self.hidden_size = config.hidden_size -++# self.num_heads = config.num_attention_heads -++# self.head_dim = self.hidden_size // self.num_heads -++# self.num_key_value_heads = config.num_key_value_heads -++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++# self.max_position_embeddings = config.max_position_embeddings -++# self.rope_theta = config.rope_theta -++# self.is_causal = True -++ -++# if (self.head_dim * self.num_heads) != self.hidden_size: -++# raise ValueError( -++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++# f" and `num_heads`: {self.num_heads})." -++# ) -++ -++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++# self._init_rope() -++ -++# def _init_rope(self): -++# if self.config.rope_scaling is None: -++# self.rotary_emb = DeepseekRotaryEmbedding( -++# self.head_dim, -++# max_position_embeddings=self.max_position_embeddings, -++# base=self.rope_theta, -++# ) -++# else: -++# scaling_type = self.config.rope_scaling["type"] -++# scaling_factor = self.config.rope_scaling["factor"] -++# if scaling_type == "linear": -++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++# self.head_dim, -++# max_position_embeddings=self.max_position_embeddings, -++# scaling_factor=scaling_factor, -++# base=self.rope_theta, -++# ) -++# elif scaling_type == "dynamic": -++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++# self.head_dim, -++# max_position_embeddings=self.max_position_embeddings, -++# scaling_factor=scaling_factor, -++# base=self.rope_theta, -++# ) -++# else: -++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++ -++# def forward( -++# self, -++# hidden_states: mindspore.Tensor, -++# attention_mask: Optional[mindspore.Tensor] = None, -++# position_ids: Optional[mindspore.Tensor] = None, -++# past_key_value: Optional[Cache] = None, -++# output_attentions: bool = False, -++# use_cache: bool = False, -++# **kwargs, -++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++# if "padding_mask" in kwargs: -++# warnings.warn( -++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++# ) -++ -++# if output_attentions: -++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -++ -++# bsz, q_len, _ = hidden_states.shape -++ -++# if self.config.pretraining_tp > 1: -++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++ -++# query_states = self.q_proj(hidden_states) -++# key_states = self.k_proj(hidden_states) -++# value_states = self.v_proj(hidden_states) -++ -++# # Reshape for multi-head attention -++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++# kv_seq_len = key_states.shape[-2] -++# if past_key_value is not None: -++# if self.layer_idx is None: -++# raise ValueError( -++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++# "with a layer index." -++# ) -++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++# # Apply Rotary Positional Embedding -++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++# if past_key_value is not None: -++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ -++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++ -++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++ -++# # Convert attention_mask for flash_attention_score -++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -++# if attention_mask is not None: -++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++# raise ValueError( -++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++# ) -++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -++# else: -++# attn_mask_for_fa = None -++ -++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++ -++# # Call the fused flash_attention_score operator -++# attn_output = mindspore.ops.flash_attention_score( -++# query=query_states_for_fa, -++# key=key_states_for_fa, -++# value=value_states_for_fa, -++# head_num=self.num_heads, # This is N1, the number of query heads -++# input_layout='BSH', -++# attn_mask=attn_mask_for_fa, -++# keep_prob=keep_prob, -++# scalar_value=1.0 / math.sqrt(self.head_dim), -++# sparse_mode=0 # Default mask mode -++# ) -++ -++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -++# attn_output = self.o_proj(attn_output) -++ -++# # Flash Attention does not return attention weights -++# attn_weights = None -++ -++# return attn_output, attn_weights, past_key_value -++ -++class DeepseekFlashAttention(nn.Module): -++ """ -++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -++ This implementation is a drop-in replacement for the original DeepseekAttention class, -++ designed for high performance on supported hardware (Ascend). -++ -++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -++ """ -++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++ super().__init__() -++ self.config = config -++ self.layer_idx = layer_idx -++ if layer_idx is None: -++ logger.warning( -++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++ "when creating this class." -++ ) -++ -++ # --- [FIX] Correctly initialize all required attributes --- -++ self.attention_dropout = config.attention_dropout -++ self.hidden_size = config.hidden_size -++ self.num_heads = config.num_attention_heads -++ self.head_dim = self.hidden_size // self.num_heads -++ self.num_key_value_heads = config.num_key_value_heads -++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++ self.max_position_embeddings = config.max_position_embeddings -++ self.rope_theta = config.rope_theta -++ self.is_causal = True -++ -++ if (self.head_dim * self.num_heads) != self.hidden_size: -++ raise ValueError( -++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++ f" and `num_heads`: {self.num_heads})." -++ ) -++ -++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++ -++ # This call will now succeed as all attributes are initialized. -++ self._init_rope() -++ -++ def _init_rope(self): -++ if self.config.rope_scaling is None: -++ self.rotary_emb = DeepseekRotaryEmbedding( -++ self.head_dim, -++ max_position_embeddings=self.max_position_embeddings, -++ base=self.rope_theta, -++ ) -++ else: -++ scaling_type = self.config.rope_scaling["type"] -++ scaling_factor = self.config.rope_scaling["factor"] -++ if scaling_type == "linear": -++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++ self.head_dim, -++ max_position_embeddings=self.max_position_embeddings, -++ scaling_factor=scaling_factor, -++ base=self.rope_theta, -++ ) -++ elif scaling_type == "dynamic": -++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++ self.head_dim, -++ max_position_embeddings=self.max_position_embeddings, -++ scaling_factor=scaling_factor, -++ base=self.rope_theta, -++ ) -++ else: -++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++ -++ def forward( -++ self, -++ hidden_states: mindspore.Tensor, -++ attention_mask: Optional[mindspore.Tensor] = None, -++ position_ids: Optional[mindspore.Tensor] = None, -++ past_key_value: Optional[Cache] = None, -++ output_attentions: bool = False, -++ use_cache: bool = False, -++ **kwargs, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ if "padding_mask" in kwargs: -++ warnings.warn( -++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++ ) -++ if output_attentions: -++ warnings.warn( -++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -++ ) -++ -++ bsz, q_len, _ = hidden_states.shape -++ -++ if self.config.pretraining_tp > 1: -++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++ -++ query_states = self.q_proj(hidden_states) -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++ # Apply Rotary Position Embedding -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos} -++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -++ # So we must explicitly repeat the KV heads. -++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++ -++ # Convert attention mask for flash_attention_score -++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -++ if attention_mask is not None: -++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++ raise ValueError( -++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++ ) -++ attn_mask_for_fa = attention_mask < 0 -++ else: -++ attn_mask_for_fa = None -++ -++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++ -++ # Call the fused operator using the efficient BNSD layout -++ attn_output = mindspore.ops.flash_attention_score( -++ query=query_states, -++ key=key_states, -++ value=value_states, -++ head_num=self.num_heads, -++ input_layout='BNSD', # Specify the correct layout -++ attn_mask=attn_mask_for_fa, -++ keep_prob=keep_prob, -++ scalar_value=1.0 / math.sqrt(self.head_dim) -++ ) -++ -++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ -++ # Apply output projection -++ attn_output = self.o_proj(attn_output) -++ -++ # Flash attention does not return attention weights, so we return None. -++ attn_weights = None -++ -++ return attn_output, attn_weights, past_key_value -++ -+ Deepseek_ATTENTION_CLASSES = { -+ "eager": DeepseekAttention, -++ "flash-attention": DeepseekFlashAttention, -+ } -+ -+ -+@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -+ config=config, layer_idx=layer_idx -+ ) -+ -++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -++ config=config, layer_idx=layer_idx -++ ) -++ -+ self.mlp = ( -+ DeepseekMoE(config) -+ if ( -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index d4c6b651..bced285c 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -+ -+ import mindspore -+ import mindnlp.core.nn.functional as F -+-from mindnlp.core import nn, ops -++from mindnlp.core import nn, ops, no_grad -+ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -+ -+ from ....common.activations import ACT2FN -+@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+ -++Long_Prompt = False -++PROMPT_LENGTH_THRESHOLD = 128 -+ -+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+ def _prepare_4d_causal_attention_mask_with_cache_position( -+@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -+ return attn_output, attn_weights, past_key_value -+ -+ -++# class Qwen2MoeFlashAttention(nn.Module): -++# """ -++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++ -++# 关键改动: -++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++# 直接传入原始的 key 和 value 张量效率更高。 -++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++# super().__init__() -++# self.config = config -++# self.layer_idx = layer_idx -++# self.hidden_size = config.hidden_size -++# self.num_heads = config.num_attention_heads -++# self.head_dim = self.hidden_size // self.num_heads -++# self.num_key_value_heads = config.num_key_value_heads -++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++# self.max_position_embeddings = config.max_position_embeddings -++# self.rope_theta = config.rope_theta -++# self.attention_dropout = config.attention_dropout -++ -++# if (self.head_dim * self.num_heads) != self.hidden_size: -++# raise ValueError( -++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++# ) -++ -++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++ -++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++# self.head_dim, -++# max_position_embeddings=self.max_position_embeddings, -++# base=self.rope_theta, -++# ) -++ -++# def forward( -++# self, -++# hidden_states: mindspore.Tensor, -++# attention_mask: Optional[mindspore.Tensor] = None, -++# position_ids: Optional[mindspore.Tensor] = None, -++# past_key_value: Optional[Cache] = None, -++# output_attentions: bool = False, -++# use_cache: bool = False, -++# cache_position: Optional[mindspore.Tensor] = None, -++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++# bsz, q_len, _ = hidden_states.shape -++ -++# # 1. 线性投射 Q, K, V -++# query_states = self.q_proj(hidden_states) -++# key_states = self.k_proj(hidden_states) -++# value_states = self.v_proj(hidden_states) -++ -++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++# # query: [B, S, H*D] -> [B, N1, S, D] -++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++# # 3. RoPE 旋转位置编码 -++# kv_seq_len = key_states.shape[-2] -++# if past_key_value is not None: -++# if self.layer_idx is None: -++# raise ValueError( -++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++# "with a layer index." -++# ) -++# # 对于 StaticCache,需要特殊处理 kv_seq_len -++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++# if cache_position.shape[0] == 1: -++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++# kv_seq_len = past_seen_tokens + 1 -++# else: -++# # prefill 阶段:cache_position 是范围,使用其长度 -++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -++# else: -++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++# # 4. KV 缓存更新 -++# if past_key_value is not None: -++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++# key_states, value_states = past_key_value.update( -++# key_states, value_states, self.layer_idx, cache_kwargs -++# ) -++ -++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++# if cache_position.shape[0] == 1: -++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++# kv_seq_len = key_states.shape[-2] -++ -++# # 5. [重要] 准备 Attention Mask -++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++# fa_attention_mask = None -++# if attention_mask is not None: -++# # 截取与当前key长度匹配的部分 -++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++# # 转换为布尔类型: 大负数 -> True, 0 -> False -++# fa_attention_mask = (mask_slice != 0) -++ -++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++# input_dtype = query_states.dtype -++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++# query_states = query_states.to(mindspore.float16) -++# key_states = key_states.to(mindspore.float16) -++# value_states = value_states.to(mindspore.float16) -++ -++# # 6. [核心] 调用 flash_attention_score 算子 -++# # - 无需手动 repeat_kv, 算子原生支持 GQA -++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++# attn_output = mindspore.ops.flash_attention_score( -++# query=query_states, -++# key=key_states, -++# value=value_states, -++# head_num=self.num_heads, # 传入Q的头数(N1) -++# attn_mask=fa_attention_mask, -++# keep_prob=1.0 - self.attention_dropout, -++# scalar_value=1.0 / math.sqrt(self.head_dim), -++# input_layout="BNSD", -++# sparse_mode=0 # 使用 defaultMask 模式 -++# ) -++ -++# # 恢复原始数据类型 -++# attn_output = attn_output.to(input_dtype) -++ -++# # 7. 调整输出形状 -++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++# attn_output = self.o_proj(attn_output) -++ -++# # FlashAttention 算子不直接返回注意力权重矩阵 -++# attn_weights = None -++# if output_attentions: -++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++# return attn_output, attn_weights, past_key_value -++ -++# # def forward( -++# # self, -++# # hidden_states: mindspore.Tensor, -++# # attention_mask: Optional[mindspore.Tensor] = None, -++# # position_ids: Optional[mindspore.Tensor] = None, -++# # past_key_value: Optional[Cache] = None, -++# # output_attentions: bool = False, -++# # use_cache: bool = False, -++# # cache_position: Optional[mindspore.Tensor] = None, -++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++# # bsz, q_len, _ = hidden_states.shape -++ -++# # # 1. 线性投射 Q, K, V -++# # query_states = self.q_proj(hidden_states) -++# # key_states = self.k_proj(hidden_states) -++# # value_states = self.v_proj(hidden_states) -++ -++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -++# # # 3. RoPE 旋转位置编码 -++# # kv_seq_len = key_states.shape[-2] -++# # if past_key_value is not None: -++# # if self.layer_idx is None: -++# # raise ValueError( -++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++# # "with a layer index." -++# # ) -++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++# # # 4. KV 缓存更新 -++# # if past_key_value is not None: -++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++# # key_states, value_states = past_key_value.update( -++# # key_states, value_states, self.layer_idx, cache_kwargs -++# # ) -++ -++# # # 5. 准备 Attention Mask -++# # fa_attention_mask = None -++# # if attention_mask is not None: -++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++# # fa_attention_mask = (mask_slice != 0) -++ -++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++# # input_dtype = query_states.dtype -++ -++# # # 6. [核心] 调用 flash_attention_score 算子 -++# # attn_output = mindspore.ops.flash_attention_score( -++# # query=query_states, -++# # key=key_states, -++# # value=value_states, -++# # head_num=self.num_heads, -++# # attn_mask=fa_attention_mask, -++# # keep_prob=1.0 - self.attention_dropout, -++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++# # input_layout="BNSD", -++# # sparse_mode=0, -++# # # <--- 修改点 2: 启用内部高精度计算 --- -++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++# # inner_precise=1 -++# # ) -++ -++# # # 恢复原始数据类型 -++# # attn_output = attn_output.to(input_dtype) -++ -++# # # 7. 调整输出形状 -++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++# # attn_output = self.o_proj(attn_output) -++ -++# # attn_weights = None -++# # if output_attentions: -++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ -++# # return attn_output, attn_weights, past_key_value -++ -++ -+ class Qwen2MoeFlashAttention(nn.Module): -+ """ -+- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+- -+- 关键改动: -+- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+- 直接传入原始的 key 和 value 张量效率更高。 -+- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -++ -++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -++ 完全使用模型的低精度数据类型(如 float16)进行计算, -++ 以达到理论上的最高执行速度。 -+ """ -+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+ super().__init__() -+ self.config = config -+ self.layer_idx = layer_idx -++ if layer_idx is None: -++ logger.warning_once( -++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -++ ) -++ -+ self.hidden_size = config.hidden_size -+ self.num_heads = config.num_attention_heads -+ self.head_dim = self.hidden_size // self.num_heads -+ self.num_key_value_heads = config.num_key_value_heads -+- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+ self.max_position_embeddings = config.max_position_embeddings -+ self.rope_theta = config.rope_theta -+ self.attention_dropout = config.attention_dropout -+ -+- if (self.head_dim * self.num_heads) != self.hidden_size: -+- raise ValueError( -+- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+- ) -+- -+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -+ key_states = self.k_proj(hidden_states) -+ value_states = self.v_proj(hidden_states) -+ -+- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+- # query: [B, S, H*D] -> [B, N1, S, D] -+- # key/val: [B, S, H2*D] -> [B, N2, S, D] -++ # 2. 调整形状以匹配 BNSD 布局 -+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- -+- # 3. RoPE 旋转位置编码 -++ -++ # 3. RoPE 和 KV 缓存 -+ kv_seq_len = key_states.shape[-2] -+ if past_key_value is not None: -+- if self.layer_idx is None: -+- raise ValueError( -+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+- "with a layer index." -+- ) -+- # 对于 StaticCache,需要特殊处理 kv_seq_len -+- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+- # 使用 cache_position 的长度来确定实际的 kv_seq_len -+- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+- # 临时解决方案:使用 cache_position 的最大值(如果可能) -+- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+- if cache_position.shape[0] == 1: -+- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+- kv_seq_len = past_seen_tokens + 1 -+- else: -+- # prefill 阶段:cache_position 是范围,使用其长度 -+- kv_seq_len = cache_position.shape[0] + past_seen_tokens -+- else: -+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+- -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+- # 4. KV 缓存更新 -+ if past_key_value is not None: -+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+- key_states, value_states = past_key_value.update( -+- key_states, value_states, self.layer_idx, cache_kwargs -+- ) -+- -+- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+- if cache_position.shape[0] == 1: -+- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+- kv_seq_len = key_states.shape[-2] -+- -+- # 5. [重要] 准备 Attention Mask -+- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++ # 4. 准备 Attention Mask -+ fa_attention_mask = None -+ if attention_mask is not None: -+- # 截取与当前key长度匹配的部分 -+- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+- # 转换为布尔类型: 大负数 -> True, 0 -> False -+ fa_attention_mask = (mask_slice != 0) -+ -+- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+- input_dtype = query_states.dtype -+- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+- query_states = query_states.to(mindspore.float16) -+- key_states = key_states.to(mindspore.float16) -+- value_states = value_states.to(mindspore.float16) -+- -+- # 6. [核心] 调用 flash_attention_score 算子 -+- # - 无需手动 repeat_kv, 算子原生支持 GQA -+- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -+ attn_output = mindspore.ops.flash_attention_score( -+ query=query_states, -+ key=key_states, -+ value=value_states, -+- head_num=self.num_heads, # 传入Q的头数(N1) -++ head_num=self.num_heads, -+ attn_mask=fa_attention_mask, -+- keep_prob=1.0 - self.attention_dropout, -++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -+ scalar_value=1.0 / math.sqrt(self.head_dim), -+ input_layout="BNSD", -+- sparse_mode=0 # 使用 defaultMask 模式 -++ sparse_mode=0, -++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -+ ) -+ -+- # 恢复原始数据类型 -+- attn_output = attn_output.to(input_dtype) -+- -+- # 7. 调整输出形状 -+- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++ # 6. 调整输出形状 -+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+ attn_output = self.o_proj(attn_output) -+ -+- # FlashAttention 算子不直接返回注意力权重矩阵 -++ # 7. 返回结果 -+ attn_weights = None -+ if output_attentions: -+- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+ -+ return attn_output, attn_weights, past_key_value -+ -+- # def forward( -+- # self, -+- # hidden_states: mindspore.Tensor, -+- # attention_mask: Optional[mindspore.Tensor] = None, -+- # position_ids: Optional[mindspore.Tensor] = None, -+- # past_key_value: Optional[Cache] = None, -+- # output_attentions: bool = False, -+- # use_cache: bool = False, -+- # cache_position: Optional[mindspore.Tensor] = None, -+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+- -+- # bsz, q_len, _ = hidden_states.shape -+- -+- # # 1. 线性投射 Q, K, V -+- # query_states = self.q_proj(hidden_states) -+- # key_states = self.k_proj(hidden_states) -+- # value_states = self.v_proj(hidden_states) -+- -+- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- -+- # # 3. RoPE 旋转位置编码 -+- # kv_seq_len = key_states.shape[-2] -+- # if past_key_value is not None: -+- # if self.layer_idx is None: -+- # raise ValueError( -+- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+- # "with a layer index." -+- # ) -+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+ -+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+- -+- # # 4. KV 缓存更新 -+- # if past_key_value is not None: -+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+- # key_states, value_states = past_key_value.update( -+- # key_states, value_states, self.layer_idx, cache_kwargs -+- # ) -+- -+- # # 5. 准备 Attention Mask -+- # fa_attention_mask = None -+- # if attention_mask is not None: -+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+- # fa_attention_mask = (mask_slice != 0) -+- -+- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+- # input_dtype = query_states.dtype -+- -+- # # 6. [核心] 调用 flash_attention_score 算子 -+- # attn_output = mindspore.ops.flash_attention_score( -+- # query=query_states, -+- # key=key_states, -+- # value=value_states, -+- # head_num=self.num_heads, -+- # attn_mask=fa_attention_mask, -+- # keep_prob=1.0 - self.attention_dropout, -+- # scalar_value=1.0 / math.sqrt(self.head_dim), -+- # input_layout="BNSD", -+- # sparse_mode=0, -+- # # <--- 修改点 2: 启用内部高精度计算 --- -+- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+- # inner_precise=1 -+- # ) -+- -+- # # 恢复原始数据类型 -+- # attn_output = attn_output.to(input_dtype) -++QWEN2MOE_ATTENTION_CLASSES = { -++ "eager": Qwen2MoeAttention, -++ "flash-attention": Qwen2MoeFlashAttention, -++} -+ -+- # # 7. 调整输出形状 -+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+- # attn_output = self.o_proj(attn_output) -+ -+- # attn_weights = None -+- # if output_attentions: -+- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# def __init__(self, config): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# # gating -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++ -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# #@dwj -++# # 只遍历激活的专家,而非全部专家 -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# num_tokens = hidden_states_reshaped.shape[0] -++ -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++# routing_weights = routing_weights.to(hidden_states.dtype) -++ -++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++# flat_selected_experts = selected_experts.flatten() -++ -++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++# token_indices = broadcasted_token_indices.flatten() -++ -++# active_experts = ops.unique(flat_selected_experts) -++ -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++ -++# mask = (flat_selected_experts == expert_idx_tensor) -++# selected_token_indices = token_indices[mask] -++# selected_routing_weights = routing_weights.flatten()[mask] -++ -++# current_states = hidden_states_reshaped[selected_token_indices] -++ -++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++ -++# final_hidden_states = final_hidden_states.index_add( -++# dim=0, -++# index=selected_token_indices, -++# source=expert_output.to(hidden_states.dtype) -++# ) -++ -++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+ -+- # return attn_output, attn_weights, past_key_value -++# final_hidden_states = final_hidden_states + shared_expert_output -++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -++ -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# """ -++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++# `_moe_infer_prefill` (用于长序列处理) 方法。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# # 门控网络 -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# # 专家列表 -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++# # 共享专家 -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# @no_grad() -++# def _moe_infer_decode( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# """ -++# 【解码路径】针对 sequence_length=1 的极致优化。 -++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++# """ -++# batch_size, hidden_dim = hidden_states.shape -++ -++# expert_outputs_list = [ -++# ops.cat([ -++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++# ], dim=0) -++# for i in range(batch_size) -++# ] -++ -++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++# # shape: (batch_size, top_k, hidden_dim) -++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++ -++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++ -++# return moe_output.squeeze(1) -++ -++# @no_grad() -++# def _moe_infer_prefill( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# """ -++# 【预填充路径】针对 sequence_length > 1 的优化。 -++# 按专家对 Token 进行分组,并进行批处理。 -++# """ -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens = hidden_states.shape[0] -++# flat_selected_experts = selected_experts.flatten() -++ -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++ -++# active_experts = ops.unique(flat_selected_experts) -++ -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++ -++# mask = (flat_selected_experts == expert_idx_tensor) -++# selected_token_indices = token_indices[mask] -++# selected_routing_weights = routing_weights.flatten()[mask] -++ -++# current_states = hidden_states[selected_token_indices] -++ -++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++ -++# moe_output = moe_output.index_add( -++# dim=0, -++# index=selected_token_indices, -++# source=expert_output.to(hidden_states.dtype) -++# ) -++# return moe_output -++ -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# """ -++# 顶层 forward 方法,作为智能分发器。 -++# """ -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++ -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+- # def forward( -+- # self, -+- # hidden_states: mindspore.Tensor, -+- # attention_mask: Optional[mindspore.Tensor] = None, -+- # position_ids: Optional[mindspore.Tensor] = None, -+- # past_key_value: Optional[Cache] = None, -+- # output_attentions: bool = False, -+- # use_cache: bool = False, -+- # cache_position: Optional[mindspore.Tensor] = None, -+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+- -+- # bsz, q_len, _ = hidden_states.shape -+- -+- # query_states = self.q_proj(hidden_states) -+- # key_states = self.k_proj(hidden_states) -+- # value_states = self.v_proj(hidden_states) -+- -+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- -+- # kv_seq_len = key_states.shape[-2] -+- # if past_key_value is not None: -+- # if self.layer_idx is None: -+- # raise ValueError("`layer_idx` must be specified for caching") -+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+- -+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+- -+- # if past_key_value is not None: -+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+- # key_states, value_states = past_key_value.update( -+- # key_states, value_states, self.layer_idx, cache_kwargs -+- # ) -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++# routing_weights = routing_weights.to(hidden_states.dtype) -++ -++# moe_output = None -++# # 在推理时,根据序列长度选择最优路径 -++# if not self.training: -++# if sequence_length == 1: -++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++# else: -++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++# else: -++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++# raise NotImplementedError("Training path is not implemented.") -++ -++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++ -++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++ -++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -++ -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# """ -++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# # 门控网络 -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# # 专家列表 -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++# # 共享专家 -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# @no_grad() -++# def _moe_infer_decode( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# batch_size, _ = hidden_states.shape -++# expert_outputs_list = [ -++# ops.cat([ -++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++# ], dim=0) -++# for i in range(batch_size) -++# ] -++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++# return moe_output.squeeze(1) -++ -++# @no_grad() -++# def _moe_infer_prefill( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens = hidden_states.shape[0] -++# flat_selected_experts = selected_experts.flatten() -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++# active_experts = ops.unique(flat_selected_experts) -++ -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++# mask = (flat_selected_experts == expert_idx_tensor) -++# selected_token_indices = token_indices[mask] -++# selected_routing_weights = routing_weights.flatten()[mask] -++# current_states = hidden_states[selected_token_indices] -++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++# moe_output = moe_output.index_add( -++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++# ) -++# return moe_output -++ -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# """ -++# 顶层 forward 方法,作为智能分发器。 -++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++# """ -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++ -++# # 1. 门控计算 (通用逻辑) -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++# routing_weights = routing_weights.to(hidden_states.dtype) -++ -++# # 2. 智能分发到最优 MoE 路径 -++# moe_output = None -++# if not self.training: -++# if sequence_length == 1: -++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++# else: -++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++# else: -++# raise NotImplementedError("Training path is not implemented.") -++ -++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++ -++# # 4. 合并 MoE 输出和共享专家输出 -++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++ -++# # 5. 恢复原始形状并返回 -++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -++# prefill fastest -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# """ -++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# # 门控网络 -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# # 专家列表 -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++# # 共享专家 -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# @no_grad() -++# def _moe_infer_dispatch( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# """ -++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++# """ -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens, _ = hidden_states.shape -++ -++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++# flat_selected_experts = selected_experts.flatten() -++# flat_routing_weights = routing_weights.flatten() -+ -+- # key_states = repeat_kv(key_states, self.num_key_value_groups) -+- # value_states = repeat_kv(value_states, self.num_key_value_groups) -+- -+- # # <--- 核心修改点: 手动进行高精度缩放 --- -+- # # 在调用算子前,手动将 query_states 除以缩放因子。 -+- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+- # query_states = query_states / math.sqrt(self.head_dim) -+- # # <--- 修改结束 --- -+- -+- # fa_attention_mask = None -+- # if attention_mask is not None: -+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+- # fa_attention_mask = (mask_slice != 0) -+- -+- # input_dtype = query_states.dtype -+- -+- # attn_output = mindspore.ops.flash_attention_score( -+- # query=query_states, # 传入已经预先缩放过的 query -+- # key=key_states, -+- # value=value_states, -+- # head_num=self.num_heads, -+- # attn_mask=fa_attention_mask, -+- # keep_prob=1.0 - self.attention_dropout, -+- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+- # input_layout="BNSD", -+- # sparse_mode=0, -+- # inner_precise=1 # 仍然保持内部高精度计算 -+- # ) -++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+ -+- # attn_output = attn_output.to(input_dtype) -+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+- # attn_output = self.o_proj(attn_output) -++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++# active_experts = ops.unique(flat_selected_experts) -++ -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++ -++# # 找到所有分配给该专家的 token -++# mask = (flat_selected_experts == expert_idx_tensor) -++ -++# # 使用 mask 选取对应的 token 和权重 -++# current_token_indices = token_indices[mask] -++# current_routing_weights = flat_routing_weights[mask] -++# current_hidden_states = hidden_states[current_token_indices] -++ -++# # 对这些 token 进行批处理 -++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++ -++# # 使用 index_add 将结果精确地加回到对应位置 -++# moe_output = moe_output.index_add( -++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++# ) -++# return moe_output -++ -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# """ -++# 顶层 forward 方法,作为智能分发器。 -++# """ -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++ -++# # 1. 门控计算 -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++# routing_weights = routing_weights.to(hidden_states.dtype) -++ -++# # 2. 调用统一的 MoE 计算内核 -++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+ -+- # attn_weights = None -+- # if output_attentions: -+- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++# # 3. 统一处理共享专家 -++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++ -++# # 4. 合并输出 -++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++ -++# # 5. 恢复原始形状并返回 -++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -++ -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# """ -++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++# 【最终高性能与高精度版】: -++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++# 3. 这样实现了速度和准确性的两全其美。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# @no_grad() -++# def _moe_infer_decode( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# """ -++# 【解码路径】极致优化版:bmm + 高精度累加。 -++# """ -++# original_dtype = hidden_states.dtype -++# batch_size, _ = hidden_states.shape -++ -++# expert_outputs_list = [ -++# ops.cat([ -++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++# ], dim=0) -++# for i in range(batch_size) -++# ] -++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++ -++# # 在 float32 下执行 bmm,得到高精度结果 -++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++ -++# # 将高精度结果转换回原始数据类型 -++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++ -++# return moe_output -++ -++# @no_grad() -++# def _moe_infer_prefill( -++# self, -++# hidden_states: mindspore.Tensor, -++# selected_experts: mindspore.Tensor, -++# routing_weights: mindspore.Tensor -++# ) -> mindspore.Tensor: -++# """ -++# 【预填充路径】与原始实现一致,结果精确。 -++# """ -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens, _ = hidden_states.shape -++# flat_selected_experts = selected_experts.flatten() -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++# active_experts = ops.unique(flat_selected_experts) -++ -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++# mask = (flat_selected_experts == expert_idx_tensor) -++# selected_token_indices = token_indices[mask] -++# selected_routing_weights = routing_weights.flatten()[mask] -++# current_states = hidden_states[selected_token_indices] -++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++# moe_output = moe_output.index_add( -++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++# ) -++# return moe_output -++ -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++ -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+ -+- # return attn_output, attn_weights, past_key_value -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++# # 如果模型主体是 float16,后续再转换 -++ -++# moe_output = None -++# if not self.training: -++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++# # _moe_infer_decode 内部会处理好类型转换 -++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++# if sequence_length == 1: -++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++# else: -++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++# else: -++# raise NotImplementedError("Training path is not implemented.") -++ -++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++ -++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -+ -+-QWEN2MOE_ATTENTION_CLASSES = { -+- "eager": Qwen2MoeAttention, -+- "flash-attention": Qwen2MoeFlashAttention, -+-} -++# class Qwen2MoeSparseMoeBlock(nn.Module): -++# """ -++# 【融合版】一个混合专家模块,内置两种推理策略, -++# 由外部全局变量 `Long_Prompt` 控制: -++ -++# - if Long_Prompt is True: 【精度优先模式】 -++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++# 适用于处理长序列,避免误差累积。 -++ -++# - if Long_Prompt is False: 【速度优先模式】 -++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++# 在解码阶段获得极致速度,同时保证结果高度准确。 -++# """ -++# def __init__(self, config: Qwen2MoeConfig): -++# super().__init__() -++# self.num_experts = config.num_experts -++# self.top_k = config.num_experts_per_tok -++# self.norm_topk_prob = config.norm_topk_prob -++ -++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++# self.experts = nn.ModuleList( -++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++# ) -++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++# # --- 速度优先模式的辅助函数 --- -++# @no_grad() -++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++# original_dtype = hidden_states.dtype -++# batch_size, _ = hidden_states.shape -++# expert_outputs_list = [ -++# ops.cat([ -++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++# ], dim=0) -++# for i in range(batch_size) -++# ] -++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++# weights_fp32 = routing_weights.to(mindspore.float32) -++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++# return moe_output_fp32.squeeze(1).to(original_dtype) -++ -++# @no_grad() -++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens, _ = hidden_states.shape -++# flat_selected_experts = selected_experts.flatten() -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++# active_experts = ops.unique(flat_selected_experts) -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++# mask = (flat_selected_experts == expert_idx_tensor) -++# selected_token_indices = token_indices[mask] -++# selected_routing_weights = routing_weights.flatten()[mask] -++# current_states = hidden_states[selected_token_indices] -++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++# return moe_output -++ -++# # --- 精度优先模式的辅助函数 --- -++# @no_grad() -++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++# moe_output = ops.zeros_like(hidden_states) -++# num_tokens, _ = hidden_states.shape -++# flat_selected_experts = selected_experts.flatten() -++# flat_routing_weights = routing_weights.flatten() -++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++# active_experts = ops.unique(flat_selected_experts) -++# for expert_idx_tensor in active_experts: -++# expert_idx = expert_idx_tensor.item() -++# expert_layer = self.experts[expert_idx] -++# mask = (flat_selected_experts == expert_idx_tensor) -++# current_token_indices = token_indices[mask] -++# current_routing_weights = flat_routing_weights[mask] -++# current_hidden_states = hidden_states[current_token_indices] -++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++# return moe_output -++ -++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++# # 声明我们将要使用一个在模块外部定义的全局变量 -++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++# global Long_Prompt -++ -++# # 1. 门控计算 (所有模式通用) -++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++# router_logits = self.gate(hidden_states_reshaped) -++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++# if self.norm_topk_prob: -++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++# moe_output = None -++# if not self.training: -++# # 根据 Long_Prompt 标志选择模式 -++# if Long_Prompt: -++# # --- 精度优先模式 --- -++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++# else: -++# # --- 速度优先模式 --- -++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++# if sequence_length == 1: -++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++# else: -++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++# else: -++# raise NotImplementedError("Training path is not implemented.") -++ -++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++ -++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++ -++# return final_hidden_states, router_logits -++ -++class Qwen2MoeSparseMoeBlock(nn.Module): -++ """ -++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++ 控制的顶级推理策略: -+ -++ - if Long_Prompt is True: 【精度优先模式】 -++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -++ 适用于需要严格可复现性的长序列任务。 -+ -+-class Qwen2MoeSparseMoeBlock(nn.Module): -+- def __init__(self, config): -++ - if Long_Prompt is False: 【速度优先模式】 -++ 采用业界最强的性能组合: -++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -++ """ -++ def __init__(self, config: Qwen2MoeConfig): -+ super().__init__() -+ self.num_experts = config.num_experts -+ self.top_k = config.num_experts_per_tok -+ self.norm_topk_prob = config.norm_topk_prob -+ -+- # gating -+ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+ self.experts = nn.ModuleList( -+ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+ ) -+- -+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+ -+- #@dwj -+- # 只遍历激活的专家,而非全部专家 -+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+- batch_size, sequence_length, hidden_dim = hidden_states.shape -+- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+- num_tokens = hidden_states_reshaped.shape[0] -+- -+- router_logits = self.gate(hidden_states_reshaped) -+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- -+- if self.norm_topk_prob: -+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- routing_weights = routing_weights.to(hidden_states.dtype) -+- -+- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+- flat_selected_experts = selected_experts.flatten() -+- -+- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+- token_indices = broadcasted_token_indices.flatten() -+- -+- active_experts = ops.unique(flat_selected_experts) -+- -+- for expert_idx_tensor in active_experts: -+- expert_idx = expert_idx_tensor.item() -+- expert_layer = self.experts[expert_idx] -+- -+- mask = (flat_selected_experts == expert_idx_tensor) -+- selected_token_indices = token_indices[mask] -+- selected_routing_weights = routing_weights.flatten()[mask] -+- -+- current_states = hidden_states_reshaped[selected_token_indices] -+- -+- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+- -+- final_hidden_states = final_hidden_states.index_add( -+- dim=0, -+- index=selected_token_indices, -+- source=expert_output.to(hidden_states.dtype) -+- ) -+- -+- shared_expert_output = self.shared_expert(hidden_states_reshaped) -+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -++ @no_grad() -++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ original_dtype = hidden_states.dtype -++ batch_size, _ = hidden_states.shape -++ expert_outputs_list = [ -++ ops.cat([ -++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++ ], dim=0) -++ for i in range(batch_size) -++ ] -++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++ weights_fp32 = routing_weights.to(mindspore.float32) -++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++ return moe_output_fp32.squeeze(1).to(original_dtype) -++ -++ @no_grad() -++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ num_tokens, _ = hidden_states.shape -++ flat_selected_experts = selected_experts.flatten() -++ sorted_expert_indices = flat_selected_experts.argsort() -++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++ original_token_indices = sorted_expert_indices // self.top_k -++ moe_output = ops.zeros_like(hidden_states) -++ current_token_offset = 0 -++ for i in range(self.num_experts): -++ expert_token_count = tokens_per_expert[i] - current_token_offset -++ if expert_token_count == 0: -++ continue -++ end_offset = current_token_offset + expert_token_count -++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++ expert_hidden_states = hidden_states[expert_original_token_indices] -++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++ current_token_offset += expert_token_count -++ return moe_output -++ -++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++ @no_grad() -++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ moe_output = ops.zeros_like(hidden_states) -++ num_tokens, _ = hidden_states.shape -++ flat_selected_experts = selected_experts.flatten() -++ flat_routing_weights = routing_weights.flatten() -++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++ active_experts = ops.unique(flat_selected_experts) -++ for expert_idx_tensor in active_experts: -++ expert_idx = expert_idx_tensor.item() -++ expert_layer = self.experts[expert_idx] -++ mask = (flat_selected_experts == expert_idx_tensor) -++ current_token_indices = token_indices[mask] -++ current_routing_weights = flat_routing_weights[mask] -++ current_hidden_states = hidden_states[current_token_indices] -++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++ return moe_output -+ -+- final_hidden_states = final_hidden_states + shared_expert_output -+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+- -+- return final_hidden_states, router_logits -++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++ global Long_Prompt -++ -++ # 1. 门控计算 (所有模式通用) -++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++ router_logits = self.gate(hidden_states_reshaped) -++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++ if self.norm_topk_prob: -++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++ moe_output = None -++ if Long_Prompt: -++ # --- 精度优先模式 (ACCURACY MODE) --- -++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ else: -++ # --- 速度优先模式 (SPEED MODE) --- -++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++ if sequence_length == 1: -++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ else: -++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ -+ -++ # 3. 共享专家计算与合并 (所有模式通用) -++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++ -++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++ -++ return final_hidden_states, router_logits -+ -+ class Qwen2MoeDecoderLayer(nn.Module): -+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+ super().__init__() -+ self.hidden_size = config.hidden_size -++ -++ # if Long_Prompt: -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ # else: -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+ -+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ -+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+- -+ if (layer_idx not in config.mlp_only_layers) and ( -+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+ ): -+@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ self._warmed_up = True -+ self.warmup_moe_model() -+ -++ -++ -+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+ output_router_logits = ( -+ output_router_logits if output_router_logits is not None else self.config.output_router_logits -+@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ router_logits=outputs.router_logits, -+ ) -+ -++ def generate(self, *args, **kwargs): -++ """ -++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++ """ -++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++ -++ input_ids = kwargs.get("input_ids") -++ if input_ids is None and args: -++ input_ids = args[0] -++ -++ if input_ids is not None: -++ prompt_length = input_ids.shape[1] -++ -++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -++ Long_Prompt = True -++ else: -++ Long_Prompt = False -++ -++ return super().generate(*args, **kwargs) -++ -+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -+ def prepare_inputs_for_generation( -+ self, -+@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -+ # Exception 1: when passing input_embeds, input_ids may be missing entries -+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -++ -+ if past_key_values is not None: -+ if inputs_embeds is not None: # Exception 1 -+ if 0 not in input_ids.shape: -+@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ } -+ ) -+ return model_inputs -++ -+ # @lwx -+ # def _decode_one_tokens_logits( -+ # self, -+@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -+ attentions=outputs.attentions, -+ ) -+ -++ -+ __all__ = [ -+ "Qwen2MoeForCausalLM", -+ "Qwen2MoeModel", -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+new file mode 100644 -+index 00000000..6dfb5b93 -+--- /dev/null -++++ b/patches/0001-20251104commit.patch -+@@ -0,0 +1,1272 @@ -++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++Subject: [PATCH] 20251104commit -++ -++--- -++ mindnlp/transformers/cache_utils.py | 28 +- -++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++ 3 files changed, 976 insertions(+), 87 deletions(-) -++ -++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++index cadd2e04..02f8d4be 100644 -++--- a/mindnlp/transformers/cache_utils.py -+++++ b/mindnlp/transformers/cache_utils.py -++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++ # k_out[:, :, cache_position] = key_states -++ # v_out[:, :, cache_position] = value_states -++- if ON_ORANGE_PI: -++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++- else: -++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++- -+++ # if ON_ORANGE_PI: -+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++ # else: -+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++ if cache_position.ndim > 1: -+++ cache_position = cache_position.flatten() -+++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++ cache_position = cache_position.int() -+++ -+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++ k_out[:, :, cache_position] = key_states -+++ v_out[:, :, cache_position] = value_states -+++ -++ return k_out, v_out -++ -++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index c695b944..d8303e45 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++- x1 = x[..., : x.shape[-1] // 2] -++- x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++ # x1 = x[..., : x.shape[-1] // 2] -+++ # x2 = x[..., x.shape[-1] // 2 :] -+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++ if self.training: -++ raise NotImplementedError("Training is not supported yet.") -++ else: -++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++- if self.config.n_shared_experts is not None: -++- y = y + self.shared_experts(identity) -++- return y -+++ # @lwx -+++ if orig_shape[1] == 1: -+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++ y=y.view(*orig_shape) -+++ if self.config.n_shared_experts is not None: -+++ y = y + self.shared_experts(identity) -+++ return y -+++ else: -+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++ if self.config.n_shared_experts is not None: -+++ y = y + self.shared_experts(identity) -+++ return y -+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++ # if self.config.n_shared_experts is not None: -+++ # y = y + self.shared_experts(identity) -+++ # return y -+++ -+++ @no_grad() -+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ -+++ expert_cache = ops.zeros_like(x) -+++ for i in range(self.num_experts_per_tok): -+++ expert_id = flat_expert_indices[i].item() -+++ weight = flat_expert_weights[i].item() -+++ expert = self.experts[expert_id] -+++ expert_out = expert(x) -+++ expert_cache += expert_out * weight -+++ return expert_cache -++ -++ @no_grad() -++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++- # expert_cache = torch.zeros_like(x) -++- # idxs = flat_expert_indices.argsort() -++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++- # token_idxs = idxs // self.num_experts_per_tok -++- # for i, end_idx in enumerate(tokens_per_expert): -++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++- # if start_idx == end_idx: -++- # continue -++- # expert = self.experts[i] -++- # exp_token_idx = token_idxs[start_idx:end_idx] -++- # expert_tokens = x[exp_token_idx] -++- # expert_out = expert(expert_tokens) -++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++- # return expert_cache -+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ expert_cache = ops.zeros_like(x) -++ idxs = flat_expert_indices.argsort() -++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ token_idxs = idxs // self.num_experts_per_tok -+++ -++ for i, end_idx in enumerate(tokens_per_expert): -++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ if start_idx == end_idx: -++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++ expert_out = expert(expert_tokens) -++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -++ return expert_cache -+++ -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # # expert_cache = torch.zeros_like(x) -+++ # # idxs = flat_expert_indices.argsort() -+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++ # # token_idxs = idxs // self.num_experts_per_tok -+++ # # for i, end_idx in enumerate(tokens_per_expert): -+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++ # # if start_idx == end_idx: -+++ # # continue -+++ # # expert = self.experts[i] -+++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # # expert_tokens = x[exp_token_idx] -+++ # # expert_out = expert(expert_tokens) -+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++ # # return expert_cache -+++ # expert_cache = ops.zeros_like(x) -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # for i, end_idx in enumerate(tokens_per_expert): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # if start_idx == end_idx: -+++ # continue -+++ # expert = self.experts[i] -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = expert(expert_tokens) -+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -+++ # return expert_cache -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # expert_cache = ops.zeros_like(x) -+++ -+++ # # 排序保证顺序一致 -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # # 找出有 token 的专家 -+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++ -+++ # for i in active_experts.tolist(): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # end_idx = tokens_per_expert[i] -+++ # if start_idx == end_idx: # 没有 token -+++ # continue -+++ -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = self.experts[i](expert_tokens) -+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++ -+++ # expert_cache = mindspore.mint.scatter_add( -+++ # expert_cache, -+++ # 0, -+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++ # expert_out -+++ # ) -+++ -+++ # return expert_cache -+++ -+++ -++ -++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++ # """ -++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ -++ # Initialize weights and apply final processing -++ self.post_init() -+++ self.warm_up = False -+++ -+++ def warmup_moe_model_deep(self): -+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++ test_texts = [ -+++ "warmup short", -+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++ ] -+++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++ if tokenizer is None: -+++ from mindnlp.transformers import AutoTokenizer -+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++ self._warmup_tokenizer = tokenizer -+++ -+++ for text in test_texts: -+++ inputs = tokenizer(text, return_tensors="ms") -+++ with mindspore._no_grad(): -+++ _ = self(**inputs, use_cache=False) -+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++ -++ def get_input_embeddings(self): -++ return self.model.embed_tokens -++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++ ```""" -+++ if not self.warm_up: -+++ self.warm_up = True -+++ self.warmup_moe_model_deep() -+++ -++ output_attentions = ( -++ output_attentions -++ if output_attentions is not None -++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++index 3cbf820e..d4c6b651 100644 -++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++@@ -18,7 +18,6 @@ -++ # See the License for the specific language governing permissions and -++ # limitations under the License. -++ """MindSpore Qwen2MoE model.""" -++- -++ import math -++ from typing import List, Optional, Tuple, Union -++ -++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++ TokenClassifierOutput, -++ ) -++ from ...modeling_utils import PreTrainedModel -+++from ...generation import GenerationMixin -++ from ....utils import logging -++ from .configuration_qwen2_moe import Qwen2MoeConfig -++ -++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++ self.variance_epsilon = eps -++ -++ def forward(self, hidden_states): -+++ # @dwj -+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++ # @lwx -+++ # if not self.training : -+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++ input_dtype = hidden_states.dtype -++ hidden_states = hidden_states.to(mindspore.float32) -++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++@@ -234,6 +239,8 @@ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++ x1 = x[..., : x.shape[-1] // 2] -++ x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++ self.config = config -++ self.hidden_size = config.hidden_size -++ self.intermediate_size = intermediate_size -+++ -++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++ self.act_fn = ACT2FN[config.hidden_act] -++ -++ def forward(self, x): -++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++- -++ -+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++ # @lwx -+++ # gate_up_output = self.gate_up_proj(x) -+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++ # return self.down_proj(swiglu_output) -+++ -+++ # def forward(self, x): -+++ # gate_proj_out = self.gate_proj(x) -+++ # up_proj_out = self.up_proj(x) -+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++ # return self.down_proj(swiglu_out) -+++ -++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++ """ -++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++ use_cache: bool = False, -++ cache_position: Optional[mindspore.Tensor] = None, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ -+++ -++ bsz, q_len, _ = hidden_states.shape -++ -++ query_states = self.q_proj(hidden_states) -++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ "with a layer index." -++ ) -++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ if isinstance(past_key_value, StaticCache): -+++ kv_seq_len = key_states.shape[-2] -+++ else: -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++ if isinstance(past_key_value, StaticCache): -+++ kv_seq_len = key_states.shape[-2] -++ -++ # repeat k/v heads if n_kv_heads < n_heads -++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++- -+++ -++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++ -++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++- raise ValueError( -++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++- f" {attn_weights.shape}" -++- ) -++- -++- if attention_mask is not None: # no matter the length, we just slice it -++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++ if attention_mask is not None: -+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++ attn_weights = attn_weights + causal_mask -++ -++ # upcast attention to fp32 -++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++ -++ attn_output = self.o_proj(attn_output) -++- -+++ # @lwx -+++ -+++ # max_seq_len = self.max_position_embeddings # 2048 -+++ -+++ # if attention_mask is not None: -+++ # # attention_mask: [B, 1, Sq, Sk] -+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++ -+++ # # pad 到 [max_seq_len, max_seq_len] -+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++ # global_attention_mask = padded_mask -+++ # else: -+++ # global_attention_mask = None -+++ -+++ -+++ # sparse_mode=3 -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, -+++ # key=key_states, -+++ # value=value_states, -+++ # real_shift=None, -+++ # padding_mask=None, -+++ -+++ # head_num=self.num_heads, -+++ # attn_mask=global_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++ # input_layout="BNSD", -+++ # pre_tokens=2147483647, -+++ # next_tokens=2147483647, -+++ # inner_precise=0, -+++ # drop_mask=None, -+++ # prefix=None, -+++ # actual_seq_qlen=None, -+++ # actual_seq_kvlen=None, -+++ # sparse_mode=sparse_mode, -+++ # ) -++ if not output_attentions: -++ attn_weights = None -++ -++ return attn_output, attn_weights, past_key_value -++ -++ -+++class Qwen2MoeFlashAttention(nn.Module): -+++ """ -+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++ -+++ 关键改动: -+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++ 直接传入原始的 key 和 value 张量效率更高。 -+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++ """ -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++ super().__init__() -+++ self.config = config -+++ self.layer_idx = layer_idx -+++ self.hidden_size = config.hidden_size -+++ self.num_heads = config.num_attention_heads -+++ self.head_dim = self.hidden_size // self.num_heads -+++ self.num_key_value_heads = config.num_key_value_heads -+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++ self.max_position_embeddings = config.max_position_embeddings -+++ self.rope_theta = config.rope_theta -+++ self.attention_dropout = config.attention_dropout -+++ -+++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++ raise ValueError( -+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++ ) -+++ -+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++ -+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++ self.head_dim, -+++ max_position_embeddings=self.max_position_embeddings, -+++ base=self.rope_theta, -+++ ) -+++ -+++ def forward( -+++ self, -+++ hidden_states: mindspore.Tensor, -+++ attention_mask: Optional[mindspore.Tensor] = None, -+++ position_ids: Optional[mindspore.Tensor] = None, -+++ past_key_value: Optional[Cache] = None, -+++ output_attentions: bool = False, -+++ use_cache: bool = False, -+++ cache_position: Optional[mindspore.Tensor] = None, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ # 1. 线性投射 Q, K, V -+++ query_states = self.q_proj(hidden_states) -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++ # query: [B, S, H*D] -> [B, N1, S, D] -+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # 3. RoPE 旋转位置编码 -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++ if self.layer_idx is None: -+++ raise ValueError( -+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ "with a layer index." -+++ ) -+++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++ if cache_position.shape[0] == 1: -+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++ kv_seq_len = past_seen_tokens + 1 -+++ else: -+++ # prefill 阶段:cache_position 是范围,使用其长度 -+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++ else: -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # 4. KV 缓存更新 -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ key_states, value_states = past_key_value.update( -+++ key_states, value_states, self.layer_idx, cache_kwargs -+++ ) -+++ -+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++ if cache_position.shape[0] == 1: -+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++ kv_seq_len = key_states.shape[-2] -+++ -+++ # 5. [重要] 准备 Attention Mask -+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++ fa_attention_mask = None -+++ if attention_mask is not None: -+++ # 截取与当前key长度匹配的部分 -+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++ fa_attention_mask = (mask_slice != 0) -+++ -+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++ input_dtype = query_states.dtype -+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++ query_states = query_states.to(mindspore.float16) -+++ key_states = key_states.to(mindspore.float16) -+++ value_states = value_states.to(mindspore.float16) -+++ -+++ # 6. [核心] 调用 flash_attention_score 算子 -+++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++ attn_output = mindspore.ops.flash_attention_score( -+++ query=query_states, -+++ key=key_states, -+++ value=value_states, -+++ head_num=self.num_heads, # 传入Q的头数(N1) -+++ attn_mask=fa_attention_mask, -+++ keep_prob=1.0 - self.attention_dropout, -+++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++ input_layout="BNSD", -+++ sparse_mode=0 # 使用 defaultMask 模式 -+++ ) -+++ -+++ # 恢复原始数据类型 -+++ attn_output = attn_output.to(input_dtype) -+++ -+++ # 7. 调整输出形状 -+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ attn_output = self.o_proj(attn_output) -+++ -+++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++ attn_weights = None -+++ if output_attentions: -+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++ # def forward( -+++ # self, -+++ # hidden_states: mindspore.Tensor, -+++ # attention_mask: Optional[mindspore.Tensor] = None, -+++ # position_ids: Optional[mindspore.Tensor] = None, -+++ # past_key_value: Optional[Cache] = None, -+++ # output_attentions: bool = False, -+++ # use_cache: bool = False, -+++ # cache_position: Optional[mindspore.Tensor] = None, -+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ # bsz, q_len, _ = hidden_states.shape -+++ -+++ # # 1. 线性投射 Q, K, V -+++ # query_states = self.q_proj(hidden_states) -+++ # key_states = self.k_proj(hidden_states) -+++ # value_states = self.v_proj(hidden_states) -+++ -+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # # 3. RoPE 旋转位置编码 -+++ # kv_seq_len = key_states.shape[-2] -+++ # if past_key_value is not None: -+++ # if self.layer_idx is None: -+++ # raise ValueError( -+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ # "with a layer index." -+++ # ) -+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # # 4. KV 缓存更新 -+++ # if past_key_value is not None: -+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ # key_states, value_states = past_key_value.update( -+++ # key_states, value_states, self.layer_idx, cache_kwargs -+++ # ) -+++ -+++ # # 5. 准备 Attention Mask -+++ # fa_attention_mask = None -+++ # if attention_mask is not None: -+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # fa_attention_mask = (mask_slice != 0) -+++ -+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++ # input_dtype = query_states.dtype -+++ -+++ # # 6. [核心] 调用 flash_attention_score 算子 -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, -+++ # key=key_states, -+++ # value=value_states, -+++ # head_num=self.num_heads, -+++ # attn_mask=fa_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++ # input_layout="BNSD", -+++ # sparse_mode=0, -+++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++ # inner_precise=1 -+++ # ) -+++ -+++ # # 恢复原始数据类型 -+++ # attn_output = attn_output.to(input_dtype) -+++ -+++ # # 7. 调整输出形状 -+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ # attn_output = self.o_proj(attn_output) -+++ -+++ # attn_weights = None -+++ # if output_attentions: -+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++ # return attn_output, attn_weights, past_key_value -+++ -+++ # def forward( -+++ # self, -+++ # hidden_states: mindspore.Tensor, -+++ # attention_mask: Optional[mindspore.Tensor] = None, -+++ # position_ids: Optional[mindspore.Tensor] = None, -+++ # past_key_value: Optional[Cache] = None, -+++ # output_attentions: bool = False, -+++ # use_cache: bool = False, -+++ # cache_position: Optional[mindspore.Tensor] = None, -+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ # bsz, q_len, _ = hidden_states.shape -+++ -+++ # query_states = self.q_proj(hidden_states) -+++ # key_states = self.k_proj(hidden_states) -+++ # value_states = self.v_proj(hidden_states) -+++ -+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # kv_seq_len = key_states.shape[-2] -+++ # if past_key_value is not None: -+++ # if self.layer_idx is None: -+++ # raise ValueError("`layer_idx` must be specified for caching") -+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # if past_key_value is not None: -+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ # key_states, value_states = past_key_value.update( -+++ # key_states, value_states, self.layer_idx, cache_kwargs -+++ # ) -+++ -+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++ -+++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++ # query_states = query_states / math.sqrt(self.head_dim) -+++ # # <--- 修改结束 --- -+++ -+++ # fa_attention_mask = None -+++ # if attention_mask is not None: -+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # fa_attention_mask = (mask_slice != 0) -+++ -+++ # input_dtype = query_states.dtype -+++ -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, # 传入已经预先缩放过的 query -+++ # key=key_states, -+++ # value=value_states, -+++ # head_num=self.num_heads, -+++ # attn_mask=fa_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++ # input_layout="BNSD", -+++ # sparse_mode=0, -+++ # inner_precise=1 # 仍然保持内部高精度计算 -+++ # ) -+++ -+++ # attn_output = attn_output.to(input_dtype) -+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ # attn_output = self.o_proj(attn_output) -+++ -+++ # attn_weights = None -+++ # if output_attentions: -+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++ -+++ # return attn_output, attn_weights, past_key_value -+++ -++ QWEN2MOE_ATTENTION_CLASSES = { -++ "eager": Qwen2MoeAttention, -+++ "flash-attention": Qwen2MoeFlashAttention, -++ } -++ -++ -++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -+++ #@dwj -+++ # 只遍历激活的专家,而非全部专家 -++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++- hidden_states = hidden_states.view(-1, hidden_dim) -++- # router_logits: (batch * sequence_length, n_experts) -++- router_logits = self.gate(hidden_states) -++- -++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- if self.norm_topk_prob: -++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- # we cast back to the input dtype -++- routing_weights = routing_weights.to(hidden_states.dtype) -++- -++- final_hidden_states = ops.zeros( -++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++- ) -++- -++- # One hot encode the selected experts to create an expert mask -++- # this will be used to easily index which expert is going to be sollicitated -++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++- -++- # Loop over all available experts in the model and perform the computation on each expert -++- for expert_idx in range(self.num_experts): -++- expert_layer = self.experts[expert_idx] -++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++- -++- # Index the correct hidden states and compute the expert hidden state for -++- # the current expert. We need to make sure to multiply the output hidden -++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++- if 0 not in idx.shape: -++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++- -++- # However `index_add_` only support torch tensors for indexing so we'll use -++- # the `top_x` tensor here. -++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++- -++- shared_expert_output = self.shared_expert(hidden_states) -++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++- -++- final_hidden_states = final_hidden_states + shared_expert_output -+++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++ num_tokens = hidden_states_reshaped.shape[0] -+++ -+++ router_logits = self.gate(hidden_states_reshaped) -+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++ if self.norm_topk_prob: -+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++ flat_selected_experts = selected_experts.flatten() -+++ -+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++ token_indices = broadcasted_token_indices.flatten() -+++ -+++ active_experts = ops.unique(flat_selected_experts) -+++ -+++ for expert_idx_tensor in active_experts: -+++ expert_idx = expert_idx_tensor.item() -+++ expert_layer = self.experts[expert_idx] -+++ -+++ mask = (flat_selected_experts == expert_idx_tensor) -+++ selected_token_indices = token_indices[mask] -+++ selected_routing_weights = routing_weights.flatten()[mask] -+++ -+++ current_states = hidden_states_reshaped[selected_token_indices] -+++ -+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++ -+++ final_hidden_states = final_hidden_states.index_add( -+++ dim=0, -+++ index=selected_token_indices, -+++ source=expert_output.to(hidden_states.dtype) -+++ ) -+++ -+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++ -++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++- return final_hidden_states, router_logits -+++ final_hidden_states = final_hidden_states + shared_expert_output -+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++ -+++ return final_hidden_states, router_logits -++ -++ -++ class Qwen2MoeDecoderLayer(nn.Module): -++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++ -++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++ -++ if (layer_idx not in config.mlp_only_layers) and ( -++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++ ): -++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++ _skip_keys_device_placement = "past_key_values" -++ _supports_cache_class = True -+++#lwx -+++ # _supports_static_cache = True -++ -++ def _init_weights(self, module): -++ std = self.config.initializer_range -++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++ return causal_mask -++ -++ -++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ _tied_weights_keys = ["lm_head.weight"] -++ -++ def __init__(self, config): -++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ self.num_experts_per_tok = config.num_experts_per_tok -++ # Initialize weights and apply final processing -++ self.post_init() -+++ # @lwx -+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++ # self.generation_config.cache_implementation = "static" -+++ self._warmed_up = False -+++ -+++ def warmup_moe_model(self): -+++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++ test_texts = [ -+++ "warmup short", -+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++ ] -+++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++ if tokenizer is None: -+++ from mindnlp.transformers import AutoTokenizer -+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++ self._warmup_tokenizer = tokenizer -+++ -+++ for text in test_texts: -+++ inputs = tokenizer(text, return_tensors="ms") -+++ with mindspore._no_grad(): -+++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++ -++ def get_input_embeddings(self): -++ return self.model.embed_tokens -++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++ ```""" -+++ if not self._warmed_up: -+++ self._warmed_up = True -+++ self.warmup_moe_model() -++ -++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++ output_router_logits = ( -++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ } -++ ) -++ return model_inputs -+++# @lwx -+++ # def _decode_one_tokens_logits( -+++ # self, -+++ # cur_token: mindspore.Tensor, -+++ # input_pos: Optional[mindspore.Tensor], -+++ # cache_position: mindspore.Tensor, -+++ # past_key_values: StaticCache, -+++ # ) -> mindspore.Tensor: -+++ # """ -+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++ -+++ # Args: -+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++ # input_pos: 输入位置信息,可选 -+++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++ -+++ # Returns: -+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++ # """ -+++ # # 调用JIT编译的版本 -+++ # return self.get_decode_one_tokens_logits( -+++ # cur_token=cur_token, -+++ # input_pos=input_pos, -+++ # cache_position=cache_position, -+++ # past_key_values=past_key_values, -+++ # ) -+++ -+++ # @mindspore.jit(jit_level='O1') -+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++ # """ -+++ # JIT编译的函数,用于高效的单token解码 -+++ # 使用JIT编译优化以支持静态shape和高效执行 -+++ -+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++ # """ -+++ # outputs = self.model.forward( -+++ # input_ids=cur_token, -+++ # position_ids=input_pos, -+++ # cache_position=cache_position, -+++ # past_key_values=past_key_values, -+++ # use_cache=True, -+++ # return_dict=False, -+++ # ) -+++ -+++ # hidden_states = outputs[0] -+++ # logits = self.lm_head.forward(hidden_states) -+++ # logits = logits.float() -+++ -+++ # return logits[:, -1, :] -+++ -+++ # def _sample( -+++ # self, -+++ # input_ids: mindspore.Tensor, -+++ # logits_processor, -+++ # stopping_criteria, -+++ # generation_config, -+++ # synced_devices: bool, -+++ # streamer=None, -+++ # logits_warper=None, -+++ # **model_kwargs, -+++ # ): -+++ # """ -+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++ # """ -+++ # from ...generation.logits_process import LogitsProcessorList -+++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++ # from mindnlp.core import nn, ops, no_grad -+++ # import numpy as np -+++ -+++ # # 检查是否使用 StaticCache -+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++ # # 否则,直接调用父类方法 -+++ # past_key_values = model_kwargs.get("past_key_values") -+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++ -+++ # if not isinstance(past_key_values, StaticCache): -+++ # # 不使用 StaticCache,直接调用父类方法 -+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++ # return super()._sample( -+++ # input_ids=input_ids, -+++ # logits_processor=logits_processor, -+++ # stopping_criteria=stopping_criteria, -+++ # generation_config=generation_config, -+++ # synced_devices=synced_devices, -+++ # streamer=streamer, -+++ # logits_warper=logits_warper, -+++ # **model_kwargs, -+++ # ) -+++ -+++ # # 使用 StaticCache,进入自定义循环 -+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++ # pad_token_id = generation_config._pad_token_tensor -+++ # output_attentions = generation_config.output_attentions -+++ # output_hidden_states = generation_config.output_hidden_states -+++ # output_scores = generation_config.output_scores -+++ # output_logits = generation_config.output_logits -+++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++ # max_length = generation_config.max_length -+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++ # do_sample = generation_config.do_sample -+++ -+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++ # raise ValueError( -+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++ # f"{logits_warper})." -+++ # ) -+++ -+++ # # init attention / hidden states / scores tuples -+++ # scores = () if (return_dict_in_generate and output_scores) else None -+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++ -+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++ # encoder_hidden_states = ( -+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++ # ) -+++ -+++ # # keep track of which sequences are already finished -+++ # batch_size, cur_len = input_ids.shape -+++ # this_peer_finished = False -+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++ -+++ # time_record = [] -+++ # from ....utils.testing_utils import parse_flag_from_env -+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++ -+++ # while self._has_unfinished_sequences( -+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++ # ): -+++ # if _record_time: -+++ # import time as time_module -+++ # infer_start = time_module.time() -+++ -+++ # # prepare model inputs -+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++ -+++ # # prepare variable output controls -+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++ -+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++ # cur_cache_position = model_inputs.get("cache_position") -+++ # cur_past_key_values = model_inputs.get("past_key_values") -+++ # cur_input_ids = model_inputs.get("input_ids") -+++ -+++ # if (isinstance(cur_past_key_values, StaticCache) and -+++ # cur_cache_position is not None and -+++ # len(cur_cache_position.shape) > 0 and -+++ # cur_cache_position.shape[0] == 1 and -+++ # cur_input_ids is not None and -+++ # cur_input_ids.shape[1] == 1): -+++ # # 使用 JIT 优化的单 token 解码 -+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++ # if not hasattr(self, '_jit_used'): -+++ # self._jit_used = False -+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++ -+++ # next_token_logits = self.get_decode_one_tokens_logits( -+++ # cur_token=cur_input_ids, -+++ # input_pos=model_inputs.get("position_ids"), -+++ # cache_position=cur_cache_position, -+++ # past_key_values=cur_past_key_values, -+++ # ) -+++ -+++ # # 标记已使用JIT(用于后续判断) -+++ # if not self._jit_used: -+++ # self._jit_used = True -+++ -+++ # # 构造兼容的输出对象 -+++ # class JitOptimizedOutput: -+++ # def __init__(self, logits, config): -+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++ # self.config = config -+++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++ # self.attentions = None if not config.is_encoder_decoder else None -+++ # self.cross_attentions = None -+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++ -+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++ # else: -+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++ # outputs = self(**model_inputs, return_dict=True) -+++ -+++ # if synced_devices and this_peer_finished: -+++ # continue -+++ -+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++ # next_token_logits = outputs.logits[:, -1, :] -+++ -+++ # # pre-process distribution -+++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++ # if do_sample: -+++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++ -+++ # # Store scores, attentions and hidden_states when required -+++ # if return_dict_in_generate: -+++ # if output_scores: -+++ # scores += (next_token_scores,) -+++ # if output_logits: -+++ # raw_logits += (next_token_logits,) -+++ # if output_attentions: -+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++ # if self.config.is_encoder_decoder: -+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++ -+++ # if output_hidden_states: -+++ # hidden = ( -+++ # outputs.decoder_hidden_states -+++ # if self.config.is_encoder_decoder -+++ # else outputs.hidden_states -+++ # ) -+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++ -+++ # # token selection -+++ # if do_sample: -+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++ # else: -+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++ -+++ # # finished sentences should have their next token be a padding token -+++ # if has_eos_stopping_criteria: -+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++ -+++ # # update generated ids, model inputs, and length for next step -+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++ # if streamer is not None: -+++ # streamer.put(next_tokens) -+++ -+++ # model_kwargs = self._update_model_kwargs_for_generation( -+++ # outputs, -+++ # model_kwargs, -+++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++ # ) -+++ -+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++ # cur_len += 1 -+++ -+++ # if _record_time: -+++ # import time as time_module -+++ # infer_stop = time_module.time() -+++ # time_record.append(infer_stop - infer_start) -+++ -+++ # del outputs -+++ -+++ # average_infer_time = None -+++ # if time_record: -+++ # if len(time_record) > 1: -+++ # time_record.pop(0) -+++ # average_infer_time = sum(time_record) / len(time_record) -+++ # print(f'average inference time is: {average_infer_time}') -+++ # print(f'inference time record: {time_record}') -+++ -+++ # if streamer is not None: -+++ # streamer.end() -+++ -+++ # # 简单判断:打印是否使用了JIT路径 -+++ # if hasattr(self, '_jit_used') and self._jit_used: -+++ # print("[JIT] ✓ JIT optimization was used during generation") -+++ # else: -+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++ -+++ # if return_dict_in_generate: -+++ # if self.config.is_encoder_decoder: -+++ # return GenerateEncoderDecoderOutput( -+++ # sequences=input_ids, -+++ # scores=scores, -+++ # logits=raw_logits, -+++ # encoder_attentions=encoder_attentions, -+++ # encoder_hidden_states=encoder_hidden_states, -+++ # decoder_attentions=decoder_attentions, -+++ # cross_attentions=cross_attentions, -+++ # decoder_hidden_states=decoder_hidden_states, -+++ # past_key_values=model_kwargs.get("past_key_values"), -+++ # average_infer_time=average_infer_time -+++ # ) -+++ # else: -+++ # return GenerateDecoderOnlyOutput( -+++ # sequences=input_ids, -+++ # scores=scores, -+++ # logits=raw_logits, -+++ # attentions=decoder_attentions, -+++ # hidden_states=decoder_hidden_states, -+++ # past_key_values=model_kwargs.get("past_key_values"), -+++ # average_infer_time=average_infer_time -+++ # ) -+++ # else: -+++ # return input_ids -+++ -+++ # def _prepare_cache_for_generation( -+++ # self, -+++ # generation_config, -+++ # model_kwargs, -+++ # assistant_model, -+++ # batch_size, -+++ # max_cache_length, -+++ # ): -+++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++ # generation_config.cache_implementation = "static" -+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++ -+++ # if generation_config.cache_implementation == "static": -+++ # base_required_from_max_length = generation_config.max_length + 1 -+++ # base_required = max(max_cache_length, base_required_from_max_length) -+++ # min_cache_size = 50 -+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++ # else: -+++ # max_cache_length = max(base_required, min_cache_size) -+++ -+++ # original_max_cache_length = max_cache_length -+++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++ # print(f" - final max_cache_length: {max_cache_length}") -+++ -+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++ # if max_cache_length > self.config.max_position_embeddings: -+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++ -+++ # result = super()._prepare_cache_for_generation( -+++ # generation_config=generation_config, -+++ # model_kwargs=model_kwargs, -+++ # assistant_model=assistant_model, -+++ # batch_size=batch_size, -+++ # max_cache_length=max_cache_length, -+++ # ) -+++ -+++ # if generation_config.cache_implementation == "static": -+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++ # created_cache = model_kwargs.get(cache_name) -+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++ # if created_cache.max_cache_len < generation_config.max_length: -+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++ -+++ # return result -+++ -+++ -+++ -++ -++ -++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -new file mode 100644 -index 00000000..966529e4 ---- /dev/null -+++ b/patches/0003-20261106secondcommit.patch -@@ -0,0 +1,2769 @@ -+From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Thu, 6 Nov 2025 14:54:37 +0800 -+Subject: [PATCH 3/3] 20261106secondcommit -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 217 ++- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -+ patches/0001-20251104commit.patch | 1272 ----------------- -+ 3 files changed, 528 insertions(+), 2032 deletions(-) -+ delete mode 100644 patches/0001-20251104commit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index 73773c22..2f9192bf 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -+ -+ _CONFIG_FOR_DOC = "DeepseekConfig" -+ -++_attn_mask_cache = {} -++ -++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -++ q_len = batch_and_seq[1] -++ kv_len = batch_and_seq[1] + past_key_values_length -++ key = (batch_and_seq[0], q_len, kv_len) -++ -++ if key in _attn_mask_cache: -++ return _attn_mask_cache[key] -++ -++ mask = _prepare_4d_causal_attention_mask( -++ attention_mask, -++ batch_and_seq, -++ inputs_embeds, -++ past_key_values_length, -++ ) -++ _attn_mask_cache[key] = mask -++ return mask -+ -+ def _get_unpad_data(attention_mask): -+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -+@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -+ return final_output -+ -+ -+- @no_grad() -+- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+- expert_cache = ops.zeros_like(x) -+- idxs = flat_expert_indices.argsort() -+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+- token_idxs = idxs // self.num_experts_per_tok -+- -+- for i, end_idx in enumerate(tokens_per_expert): -+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+- if start_idx == end_idx: -+- continue -+- expert = self.experts[i] -+- exp_token_idx = token_idxs[start_idx:end_idx] -+- expert_tokens = x[exp_token_idx] -+- expert_out = expert(expert_tokens) -+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+- -+- return expert_cache -+- -+ # @no_grad() -+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+- # # expert_cache = torch.zeros_like(x) -+- # # idxs = flat_expert_indices.argsort() -+- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+- # # token_idxs = idxs // self.num_experts_per_tok -+- # # for i, end_idx in enumerate(tokens_per_expert): -+- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+- # # if start_idx == end_idx: -+- # # continue -+- # # expert = self.experts[i] -+- # # exp_token_idx = token_idxs[start_idx:end_idx] -+- # # expert_tokens = x[exp_token_idx] -+- # # expert_out = expert(expert_tokens) -+- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+- # # return expert_cache -++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ # expert_cache = ops.zeros_like(x) -+ # idxs = flat_expert_indices.argsort() -+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+ -+ # return expert_cache -+- # @no_grad() -+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+- # expert_cache = ops.zeros_like(x) -++ -++ @no_grad() -++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ """ -++ 优化版 MoE prefill: -++ - 批量张量化处理同一个 expert 的所有 token -++ - 跳过无 token 的专家 -++ - 保持结果完全一致 -++ """ -++ # 初始化输出缓存 -++ expert_cache = ops.zeros_like(x) -+ -+- # # 排序保证顺序一致 -+- # idxs = flat_expert_indices.argsort() -+- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+- # token_idxs = idxs // self.num_experts_per_tok -++ # 排序(确保 scatter_add 位置对应原逻辑) -++ idxs = flat_expert_indices.argsort() -++ sorted_expert_indices = flat_expert_indices[idxs] -++ sorted_token_indices = idxs // self.num_experts_per_tok -+ -+- # # 找出有 token 的专家 -+- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++ # 每个 expert 的 token 数 -++ tokens_per_expert = sorted_expert_indices.bincount() -+ -+- # for i in active_experts.tolist(): -+- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+- # end_idx = tokens_per_expert[i] -+- # if start_idx == end_idx: # 没有 token -+- # continue -++ # 找出有 token 的专家 -++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+ -+- # exp_token_idx = token_idxs[start_idx:end_idx] -+- # expert_tokens = x[exp_token_idx] -+- # expert_out = self.experts[i](expert_tokens) -+- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++ for expert_id in active_experts.tolist(): -++ # 取该 expert 对应的排序后 token 区间 -++ start = (tokens_per_expert[:expert_id]).sum().item() -++ end = start + tokens_per_expert[expert_id].item() -+ -+- # expert_cache = mindspore.mint.scatter_add( -+- # expert_cache, -+- # 0, -+- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+- # expert_out -+- # ) -++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -++ expert_tokens = x[token_idx] # 取输入向量 -+ -+- # return expert_cache -++ # 执行专家 MLP -++ expert_out = self.experts[expert_id](expert_tokens) -++ -++ # 按权重缩放 -++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -++ -++ # 回写到缓存(等价 scatter_add) -++ expert_cache = mindspore.mint.scatter_add( -++ expert_cache, -++ 0, -++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -++ scaled_out -++ ) -++ -++ return expert_cache -++ -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # # expert_cache = torch.zeros_like(x) -++ # # idxs = flat_expert_indices.argsort() -++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++ # # token_idxs = idxs // self.num_experts_per_tok -++ # # for i, end_idx in enumerate(tokens_per_expert): -++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++ # # if start_idx == end_idx: -++ # # continue -++ # # expert = self.experts[i] -++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++ # # expert_tokens = x[exp_token_idx] -++ # # expert_out = expert(expert_tokens) -++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++ # # return expert_cache -++ # expert_cache = ops.zeros_like(x) -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # for i, end_idx in enumerate(tokens_per_expert): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # if start_idx == end_idx: -++ # continue -++ # expert = self.experts[i] -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = expert(expert_tokens) -++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -++ # return expert_cache -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++ # expert_cache = ops.zeros_like(x) -++ -++ # # 排序保证顺序一致 -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ # token_idxs = idxs // self.num_experts_per_tok -++ -++ # # 找出有 token 的专家 -++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++ -++ # for i in active_experts.tolist(): -++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ # end_idx = tokens_per_expert[i] -++ # if start_idx == end_idx: # 没有 token -++ # continue -++ -++ # exp_token_idx = token_idxs[start_idx:end_idx] -++ # expert_tokens = x[exp_token_idx] -++ # expert_out = self.experts[i](expert_tokens) -++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++ -++ # expert_cache = mindspore.mint.scatter_add( -++ # expert_cache, -++ # 0, -++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++ # expert_out -++ # ) -++ -++ # return expert_cache -+ -+ -+ -+@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -+ -+ return attn_output, attn_weights, past_key_value -+ -+- -+ # class DeepseekFlashAttention(nn.Module): -+ # """ -+ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -+ -+ return attn_output, attn_weights, past_key_value -+ -++ -+ Deepseek_ATTENTION_CLASSES = { -+ "eager": DeepseekAttention, -+ "flash-attention": DeepseekFlashAttention, -+@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -+ ) -+ else: -+ # 4d mask is passed through the layers -+- attention_mask = _prepare_4d_causal_attention_mask( -++ # attention_mask = _prepare_4d_causal_attention_mask( -++ # attention_mask, -++ # (batch_size, seq_length), -++ # inputs_embeds, -++ # past_key_values_length, -++ # ) -++ #@dwj -++ attention_mask = get_cached_causal_mask( -+ attention_mask, -+ (batch_size, seq_length), -+ inputs_embeds, -+@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ # Initialize weights and apply final processing -+ self.post_init() -+ self.warm_up = False -++ #@dwj -++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -++ self.num_layers, -++ self.num_attention_heads, -++ self.head_dim, -++ batch_size=1, -++ max_length=self.max_length, -++ dtype=mindspore.float16 -++ ) -++ -++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -++ key_cache = [] -++ value_cache = [] -++ for _ in range(num_layers): -++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++ key_cache.append(k) -++ value_cache.append(v) -++ return key_cache, value_cache -++ -+ -+ def warmup_moe_model_deep(self): -+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index bced285c..ebd7782e 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+ -+-Long_Prompt = False -+-PROMPT_LENGTH_THRESHOLD = 128 -++Long_Prompt = 1 -++LONG_PROMPT_LENGTH_THRESHOLD = 128 -++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -++ -++_causal_mask_cache = {} -++ -++def get_cached_causal_mask_with_cache_position( -++ attention_mask: mindspore.Tensor, -++ sequence_length: int, -++ target_length: int, -++ dtype: mindspore.dtype, -++ min_dtype: float, -++ cache_position: mindspore.Tensor, -++ batch_size: int, -++): -++ """ -++ 带缓存的 causal mask 构造函数 -++ """ -++ # q_len 是当前 query 长度 -++ q_len = sequence_length -++ # kv_len 是 target_length -++ kv_len = target_length -++ -++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -++ -++ if key in _causal_mask_cache: -++ return _causal_mask_cache[key] -++ -++ # 调用原来的 mask 构造逻辑 -++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++ attention_mask, -++ sequence_length=sequence_length, -++ target_length=target_length, -++ dtype=dtype, -++ min_dtype=min_dtype, -++ cache_position=cache_position, -++ batch_size=batch_size, -++ ) -++ # 缓存结果 -++ _causal_mask_cache[key] = causal_mask -++ return causal_mask -+ -+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+ def _prepare_4d_causal_attention_mask_with_cache_position( -+@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+ -+ -+ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -++# class Qwen2MoeAttention(nn.Module): -++# """ -++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++# and "Generating Long Sequences with Sparse Transformers". -++# """ -++ -++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++# super().__init__() -++# self.config = config -++# self.layer_idx = layer_idx -++# if layer_idx is None: -++# logger.warning_once( -++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++# "when creating this class." -++# ) -++ -++# self.hidden_size = config.hidden_size -++# self.num_heads = config.num_attention_heads -++# self.head_dim = self.hidden_size // self.num_heads -++# self.num_key_value_heads = config.num_key_value_heads -++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++# self.max_position_embeddings = config.max_position_embeddings -++# self.rope_theta = config.rope_theta -++# self.is_causal = True -++# self.attention_dropout = config.attention_dropout -++ -++# if (self.head_dim * self.num_heads) != self.hidden_size: -++# raise ValueError( -++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++# f" and `num_heads`: {self.num_heads})." -++# ) -++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++ -++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++# self.head_dim, -++# max_position_embeddings=self.max_position_embeddings, -++# base=self.rope_theta, -++# ) -++ -++# def forward( -++# self, -++# hidden_states: mindspore.Tensor, -++# attention_mask: Optional[mindspore.Tensor] = None, -++# position_ids: Optional[mindspore.Tensor] = None, -++# past_key_value: Optional[Cache] = None, -++# output_attentions: bool = False, -++# use_cache: bool = False, -++# cache_position: Optional[mindspore.Tensor] = None, -++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++ -++ -++ -++# bsz, q_len, _ = hidden_states.shape -++ -++# query_states = self.q_proj(hidden_states) -++# key_states = self.k_proj(hidden_states) -++# value_states = self.v_proj(hidden_states) -++ -++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++ -++# kv_seq_len = key_states.shape[-2] -++# if past_key_value is not None: -++# if self.layer_idx is None: -++# raise ValueError( -++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++# "with a layer index." -++# ) -++# if isinstance(past_key_value, StaticCache): -++# kv_seq_len = key_states.shape[-2] -++# else: -++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++# if past_key_value is not None: -++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++# if isinstance(past_key_value, StaticCache): -++# kv_seq_len = key_states.shape[-2] -++ -++# # repeat k/v heads if n_kv_heads < n_heads -++# key_states = repeat_kv(key_states, self.num_key_value_groups) -++# value_states = repeat_kv(value_states, self.num_key_value_groups) -++ -++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++ -++# if attention_mask is not None: -++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++# attn_weights = attn_weights + causal_mask -++ -++# # upcast attention to fp32 -++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++# attn_output = ops.matmul(attn_weights, value_states) -++ -++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++# raise ValueError( -++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++# f" {attn_output.shape}" -++# ) -++ -++# attn_output = ops.transpose(attn_output, 1, 2) -++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++ -++# attn_output = self.o_proj(attn_output) -++# # @lwx -++ -++# # max_seq_len = self.max_position_embeddings # 2048 -++ -++# # if attention_mask is not None: -++# # # attention_mask: [B, 1, Sq, Sk] -++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++ -++# # # pad 到 [max_seq_len, max_seq_len] -++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++# # global_attention_mask = padded_mask -++# # else: -++# # global_attention_mask = None -++ -++ -++# # sparse_mode=3 -++# # attn_output = mindspore.ops.flash_attention_score( -++# # query=query_states, -++# # key=key_states, -++# # value=value_states, -++# # real_shift=None, -++# # padding_mask=None, -++ -++# # head_num=self.num_heads, -++# # attn_mask=global_attention_mask, -++# # keep_prob=1.0 - self.attention_dropout, -++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++# # input_layout="BNSD", -++# # pre_tokens=2147483647, -++# # next_tokens=2147483647, -++# # inner_precise=0, -++# # drop_mask=None, -++# # prefix=None, -++# # actual_seq_qlen=None, -++# # actual_seq_kvlen=None, -++# # sparse_mode=sparse_mode, -++# # ) -++# if not output_attentions: -++# attn_weights = None -++ -++# return attn_output, attn_weights, past_key_value -++ -+ class Qwen2MoeAttention(nn.Module): -+ """ -+- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+- and "Generating Long Sequences with Sparse Transformers". -+- """ -++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -+ -++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -++ -++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -++ """ -+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+ super().__init__() -+ self.config = config -+@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -+ if layer_idx is None: -+ logger.warning_once( -+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+ "when creating this class." -+ ) -+ -+@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -+ use_cache: bool = False, -+ cache_position: Optional[mindspore.Tensor] = None, -+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+- -+ -+- -++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -+ bsz, q_len, _ = hidden_states.shape -+ -+ query_states = self.q_proj(hidden_states) -+ key_states = self.k_proj(hidden_states) -+ value_states = self.v_proj(hidden_states) -+ -+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+- -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ -+ kv_seq_len = key_states.shape[-2] -+ if past_key_value is not None: -+- if self.layer_idx is None: -+- raise ValueError( -+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+- "with a layer index." -+- ) -+- if isinstance(past_key_value, StaticCache): -+- kv_seq_len = key_states.shape[-2] -+- else: -+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+ -+ if past_key_value is not None: -+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++ -++ # --- 2. 动态调度核心注意力计算 --- -++ global Long_Prompt -++ if Long_Prompt >= 1: -++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -++ fa_attention_mask = None -++ if attention_mask is not None: -++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++ fa_attention_mask = (mask_slice != 0) -++ -++ attn_output = mindspore.ops.flash_attention_score( -++ query=query_states, -++ key=key_states, -++ value=value_states, -++ head_num=self.num_heads, -++ attn_mask=fa_attention_mask, -++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -++ scalar_value=1.0 / math.sqrt(self.head_dim), -++ input_layout="BNSD", -++ sparse_mode=0, -++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -++ ) -+ -+- if isinstance(past_key_value, StaticCache): -+- kv_seq_len = key_states.shape[-2] -++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ attn_output = self.o_proj(attn_output) -++ attn_weights = None -++ if output_attentions: -++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+ -+- # repeat k/v heads if n_kv_heads < n_heads -+- key_states = repeat_kv(key_states, self.num_key_value_groups) -+- value_states = repeat_kv(value_states, self.num_key_value_groups) -+- -+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++ else: -++ # --- Eager Attention 路径 (用于短序列和解码) --- -++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++ -++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+ -+- if attention_mask is not None: -+- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+- attn_weights = attn_weights + causal_mask -++ if attention_mask is not None: -++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++ attn_weights = attn_weights + causal_mask -+ -+- # upcast attention to fp32 -+- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+- attn_output = ops.matmul(attn_weights, value_states) -++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++ attn_output = ops.matmul(attn_weights, value_states) -+ -+- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+- raise ValueError( -+- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+- f" {attn_output.shape}" -+- ) -++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++ raise ValueError( -++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -++ ) -+ -+- attn_output = ops.transpose(attn_output, 1, 2) -+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++ attn_output = ops.transpose(attn_output, 1, 2) -++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++ attn_output = self.o_proj(attn_output) -+ -+- attn_output = self.o_proj(attn_output) -+- # @lwx -++ if not output_attentions: -++ attn_weights = None -+ -+- # max_seq_len = self.max_position_embeddings # 2048 -+- -+- # if attention_mask is not None: -+- # # attention_mask: [B, 1, Sq, Sk] -+- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+- -+- # # pad 到 [max_seq_len, max_seq_len] -+- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+- # global_attention_mask = padded_mask -+- # else: -+- # global_attention_mask = None -+- -+- -+- # sparse_mode=3 -+- # attn_output = mindspore.ops.flash_attention_score( -+- # query=query_states, -+- # key=key_states, -+- # value=value_states, -+- # real_shift=None, -+- # padding_mask=None, -+- -+- # head_num=self.num_heads, -+- # attn_mask=global_attention_mask, -+- # keep_prob=1.0 - self.attention_dropout, -+- # scalar_value=1.0 / math.sqrt(self.head_dim), -+- # input_layout="BNSD", -+- # pre_tokens=2147483647, -+- # next_tokens=2147483647, -+- # inner_precise=0, -+- # drop_mask=None, -+- # prefix=None, -+- # actual_seq_qlen=None, -+- # actual_seq_kvlen=None, -+- # sparse_mode=sparse_mode, -+- # ) -+- if not output_attentions: -+- attn_weights = None -+- -+ return attn_output, attn_weights, past_key_value -+ -+- -+ # class Qwen2MoeFlashAttention(nn.Module): -+ # """ -+ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -+ # return final_hidden_states, router_logits -+ -+ -+-# class Qwen2MoeSparseMoeBlock(nn.Module): -+-# """ -+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+-# `_moe_infer_prefill` (用于长序列处理) 方法。 -+-# """ -+-# def __init__(self, config: Qwen2MoeConfig): -+-# super().__init__() -+-# self.num_experts = config.num_experts -+-# self.top_k = config.num_experts_per_tok -+-# self.norm_topk_prob = config.norm_topk_prob -+- -+-# # 门控网络 -+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+-# # 专家列表 -+-# self.experts = nn.ModuleList( -+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+-# ) -+-# # 共享专家 -+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-# @no_grad() -+-# def _moe_infer_decode( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# """ -+-# 【解码路径】针对 sequence_length=1 的极致优化。 -+-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+-# """ -+-# batch_size, hidden_dim = hidden_states.shape -+- -+-# expert_outputs_list = [ -+-# ops.cat([ -+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+-# ], dim=0) -+-# for i in range(batch_size) -+-# ] -+- -+-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+-# # shape: (batch_size, top_k, hidden_dim) -+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+- -+-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+- -+-# return moe_output.squeeze(1) -+- -+-# @no_grad() -+-# def _moe_infer_prefill( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# """ -+-# 【预填充路径】针对 sequence_length > 1 的优化。 -+-# 按专家对 Token 进行分组,并进行批处理。 -+-# """ -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens = hidden_states.shape[0] -+-# flat_selected_experts = selected_experts.flatten() -+- -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+- -+-# active_experts = ops.unique(flat_selected_experts) -+- -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+- -+-# mask = (flat_selected_experts == expert_idx_tensor) -+-# selected_token_indices = token_indices[mask] -+-# selected_routing_weights = routing_weights.flatten()[mask] -+- -+-# current_states = hidden_states[selected_token_indices] -+- -+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+- -+-# moe_output = moe_output.index_add( -+-# dim=0, -+-# index=selected_token_indices, -+-# source=expert_output.to(hidden_states.dtype) -+-# ) -+-# return moe_output -+- -+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-# """ -+-# 顶层 forward 方法,作为智能分发器。 -+-# """ -+-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+- -+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-# router_logits = self.gate(hidden_states_reshaped) -+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- -+-# if self.norm_topk_prob: -+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- -+-# routing_weights = routing_weights.to(hidden_states.dtype) -+- -+-# moe_output = None -+-# # 在推理时,根据序列长度选择最优路径 -+-# if not self.training: -+-# if sequence_length == 1: -+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+-# else: -+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+-# else: -+-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+-# raise NotImplementedError("Training path is not implemented.") -+- -+-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+- -+-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+- -+-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+- -+-# return final_hidden_states, router_logits -+- -+- -+-# class Qwen2MoeSparseMoeBlock(nn.Module): -+-# """ -+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+-# """ -+-# def __init__(self, config: Qwen2MoeConfig): -+-# super().__init__() -+-# self.num_experts = config.num_experts -+-# self.top_k = config.num_experts_per_tok -+-# self.norm_topk_prob = config.norm_topk_prob -+- -+-# # 门控网络 -+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+-# # 专家列表 -+-# self.experts = nn.ModuleList( -+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+-# ) -+-# # 共享专家 -+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-# @no_grad() -+-# def _moe_infer_decode( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# batch_size, _ = hidden_states.shape -+-# expert_outputs_list = [ -+-# ops.cat([ -+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+-# ], dim=0) -+-# for i in range(batch_size) -+-# ] -+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+-# return moe_output.squeeze(1) -+- -+-# @no_grad() -+-# def _moe_infer_prefill( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens = hidden_states.shape[0] -+-# flat_selected_experts = selected_experts.flatten() -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+-# active_experts = ops.unique(flat_selected_experts) -+- -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+-# mask = (flat_selected_experts == expert_idx_tensor) -+-# selected_token_indices = token_indices[mask] -+-# selected_routing_weights = routing_weights.flatten()[mask] -+-# current_states = hidden_states[selected_token_indices] -+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+-# moe_output = moe_output.index_add( -+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+-# ) -+-# return moe_output -+- -+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-# """ -+-# 顶层 forward 方法,作为智能分发器。 -+-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+-# """ -+-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+- -+-# # 1. 门控计算 (通用逻辑) -+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-# router_logits = self.gate(hidden_states_reshaped) -+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- -+-# if self.norm_topk_prob: -+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- -+-# routing_weights = routing_weights.to(hidden_states.dtype) -+- -+-# # 2. 智能分发到最优 MoE 路径 -+-# moe_output = None -+-# if not self.training: -+-# if sequence_length == 1: -+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+-# else: -+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+-# else: -+-# raise NotImplementedError("Training path is not implemented.") -+- -+-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+- -+-# # 4. 合并 MoE 输出和共享专家输出 -+-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+- -+-# # 5. 恢复原始形状并返回 -+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+- -+-# return final_hidden_states, router_logits -+- -+-# prefill fastest -+-# class Qwen2MoeSparseMoeBlock(nn.Module): -+-# """ -+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+-# """ -+-# def __init__(self, config: Qwen2MoeConfig): -+-# super().__init__() -+-# self.num_experts = config.num_experts -+-# self.top_k = config.num_experts_per_tok -+-# self.norm_topk_prob = config.norm_topk_prob -+- -+-# # 门控网络 -+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+-# # 专家列表 -+-# self.experts = nn.ModuleList( -+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+-# ) -+-# # 共享专家 -+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-# @no_grad() -+-# def _moe_infer_dispatch( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# """ -+-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+-# """ -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens, _ = hidden_states.shape -+- -+-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+-# flat_selected_experts = selected_experts.flatten() -+-# flat_routing_weights = routing_weights.flatten() -+- -+-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+- -+-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+-# active_experts = ops.unique(flat_selected_experts) -+- -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+- -+-# # 找到所有分配给该专家的 token -+-# mask = (flat_selected_experts == expert_idx_tensor) -+- -+-# # 使用 mask 选取对应的 token 和权重 -+-# current_token_indices = token_indices[mask] -+-# current_routing_weights = flat_routing_weights[mask] -+-# current_hidden_states = hidden_states[current_token_indices] -+- -+-# # 对这些 token 进行批处理 -+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+- -+-# # 使用 index_add 将结果精确地加回到对应位置 -+-# moe_output = moe_output.index_add( -+-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+-# ) -+-# return moe_output -+- -+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-# """ -+-# 顶层 forward 方法,作为智能分发器。 -+-# """ -+-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+- -+-# # 1. 门控计算 -+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-# router_logits = self.gate(hidden_states_reshaped) -+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- -+-# if self.norm_topk_prob: -+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- -+-# routing_weights = routing_weights.to(hidden_states.dtype) -+- -+-# # 2. 调用统一的 MoE 计算内核 -+-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+- -+-# # 3. 统一处理共享专家 -+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+- -+-# # 4. 合并输出 -+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+- -+-# # 5. 恢复原始形状并返回 -+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+- -+-# return final_hidden_states, router_logits -+- -+- -+-# class Qwen2MoeSparseMoeBlock(nn.Module): -+-# """ -+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+-# 【最终高性能与高精度版】: -+-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+-# 3. 这样实现了速度和准确性的两全其美。 -+-# """ -+-# def __init__(self, config: Qwen2MoeConfig): -+-# super().__init__() -+-# self.num_experts = config.num_experts -+-# self.top_k = config.num_experts_per_tok -+-# self.norm_topk_prob = config.norm_topk_prob -+- -+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+-# self.experts = nn.ModuleList( -+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+-# ) -+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-# @no_grad() -+-# def _moe_infer_decode( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# """ -+-# 【解码路径】极致优化版:bmm + 高精度累加。 -+-# """ -+-# original_dtype = hidden_states.dtype -+-# batch_size, _ = hidden_states.shape -+- -+-# expert_outputs_list = [ -+-# ops.cat([ -+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+-# ], dim=0) -+-# for i in range(batch_size) -+-# ] -+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+- -+-# # 在 float32 下执行 bmm,得到高精度结果 -+-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+- -+-# # 将高精度结果转换回原始数据类型 -+-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+- -+-# return moe_output -+- -+-# @no_grad() -+-# def _moe_infer_prefill( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# selected_experts: mindspore.Tensor, -+-# routing_weights: mindspore.Tensor -+-# ) -> mindspore.Tensor: -+-# """ -+-# 【预填充路径】与原始实现一致,结果精确。 -+-# """ -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens, _ = hidden_states.shape -+-# flat_selected_experts = selected_experts.flatten() -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+-# active_experts = ops.unique(flat_selected_experts) -+- -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+-# mask = (flat_selected_experts == expert_idx_tensor) -+-# selected_token_indices = token_indices[mask] -+-# selected_routing_weights = routing_weights.flatten()[mask] -+-# current_states = hidden_states[selected_token_indices] -+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+-# moe_output = moe_output.index_add( -+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+-# ) -+-# return moe_output -+- -+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+- -+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-# router_logits = self.gate(hidden_states_reshaped) -+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+- -+-# if self.norm_topk_prob: -+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- -+-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+-# # 如果模型主体是 float16,后续再转换 -+- -+-# moe_output = None -+-# if not self.training: -+-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+-# # _moe_infer_decode 内部会处理好类型转换 -+-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+-# if sequence_length == 1: -+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+-# else: -+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+-# else: -+-# raise NotImplementedError("Training path is not implemented.") -+- -+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+- -+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+- -+-# return final_hidden_states, router_logits -+- -+- -+-# class Qwen2MoeSparseMoeBlock(nn.Module): -+-# """ -+-# 【融合版】一个混合专家模块,内置两种推理策略, -+-# 由外部全局变量 `Long_Prompt` 控制: -+- -+-# - if Long_Prompt is True: 【精度优先模式】 -+-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+-# 适用于处理长序列,避免误差累积。 -+- -+-# - if Long_Prompt is False: 【速度优先模式】 -+-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+-# 在解码阶段获得极致速度,同时保证结果高度准确。 -+-# """ -+-# def __init__(self, config: Qwen2MoeConfig): -+-# super().__init__() -+-# self.num_experts = config.num_experts -+-# self.top_k = config.num_experts_per_tok -+-# self.norm_topk_prob = config.norm_topk_prob -+- -+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+-# self.experts = nn.ModuleList( -+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+-# ) -+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-# # --- 速度优先模式的辅助函数 --- -+-# @no_grad() -+-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+-# original_dtype = hidden_states.dtype -+-# batch_size, _ = hidden_states.shape -+-# expert_outputs_list = [ -+-# ops.cat([ -+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+-# ], dim=0) -+-# for i in range(batch_size) -+-# ] -+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+-# weights_fp32 = routing_weights.to(mindspore.float32) -+-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+-# return moe_output_fp32.squeeze(1).to(original_dtype) -+- -+-# @no_grad() -+-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens, _ = hidden_states.shape -+-# flat_selected_experts = selected_experts.flatten() -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+-# active_experts = ops.unique(flat_selected_experts) -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+-# mask = (flat_selected_experts == expert_idx_tensor) -+-# selected_token_indices = token_indices[mask] -+-# selected_routing_weights = routing_weights.flatten()[mask] -+-# current_states = hidden_states[selected_token_indices] -+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+-# return moe_output -+- -+-# # --- 精度优先模式的辅助函数 --- -+-# @no_grad() -+-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+-# moe_output = ops.zeros_like(hidden_states) -+-# num_tokens, _ = hidden_states.shape -+-# flat_selected_experts = selected_experts.flatten() -+-# flat_routing_weights = routing_weights.flatten() -+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+-# active_experts = ops.unique(flat_selected_experts) -+-# for expert_idx_tensor in active_experts: -+-# expert_idx = expert_idx_tensor.item() -+-# expert_layer = self.experts[expert_idx] -+-# mask = (flat_selected_experts == expert_idx_tensor) -+-# current_token_indices = token_indices[mask] -+-# current_routing_weights = flat_routing_weights[mask] -+-# current_hidden_states = hidden_states[current_token_indices] -+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+-# return moe_output -+- -+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-# # 声明我们将要使用一个在模块外部定义的全局变量 -+-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+-# global Long_Prompt -+- -+-# # 1. 门控计算 (所有模式通用) -+-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-# router_logits = self.gate(hidden_states_reshaped) -+-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+-# if self.norm_topk_prob: -+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+- -+-# moe_output = None -+-# if not self.training: -+-# # 根据 Long_Prompt 标志选择模式 -+-# if Long_Prompt: -+-# # --- 精度优先模式 --- -+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+-# else: -+-# # --- 速度优先模式 --- -+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+-# if sequence_length == 1: -+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+-# else: -+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+-# else: -+-# raise NotImplementedError("Training path is not implemented.") -+- -+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+- -+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+- -+-# return final_hidden_states, router_logits -+- -+ class Qwen2MoeSparseMoeBlock(nn.Module): -+ """ -+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+ return moe_output_fp32.squeeze(1).to(original_dtype) -+ -++ # @no_grad() -++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ # num_tokens, _ = hidden_states.shape -++ # flat_selected_experts = selected_experts.flatten() -++ # sorted_expert_indices = flat_selected_experts.argsort() -++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++ # original_token_indices = sorted_expert_indices // self.top_k -++ # moe_output = ops.zeros_like(hidden_states) -++ # current_token_offset = 0 -++ # for i in range(self.num_experts): -++ # expert_token_count = tokens_per_expert[i] - current_token_offset -++ # if expert_token_count == 0: -++ # continue -++ # end_offset = current_token_offset + expert_token_count -++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++ # expert_hidden_states = hidden_states[expert_original_token_indices] -++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++ # current_token_offset += expert_token_count -++ # return moe_output -++ -+ @no_grad() -+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+- num_tokens, _ = hidden_states.shape -+- flat_selected_experts = selected_experts.flatten() -+- sorted_expert_indices = flat_selected_experts.argsort() -+- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+- original_token_indices = sorted_expert_indices // self.top_k -++ """ -++ 优化版 MoE prefill (速度优先模式): -++ - 批量张量化处理同一个 expert 的所有 token -++ - 跳过无 token 的专家 -++ - 保持结果完全一致 -++ """ -+ moe_output = ops.zeros_like(hidden_states) -+- current_token_offset = 0 -+- for i in range(self.num_experts): -+- expert_token_count = tokens_per_expert[i] - current_token_offset -+- if expert_token_count == 0: -+- continue -+- end_offset = current_token_offset + expert_token_count -+- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+- expert_hidden_states = hidden_states[expert_original_token_indices] -+- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+- current_token_offset += expert_token_count -++ -++ flat_selected_experts = selected_experts.flatten() -++ flat_routing_weights = routing_weights.flatten() -++ -++ idxs = flat_selected_experts.argsort() -++ sorted_expert_indices = flat_selected_experts[idxs] -++ sorted_token_indices = idxs // self.top_k -++ -++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -++ -++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++ -++ for expert_id in active_experts.tolist(): -++ start = int(tokens_per_expert[:expert_id].sum().item()) -++ end = start + int(tokens_per_expert[expert_id].item()) -++ -++ token_idx = sorted_token_indices[start:end] -++ expert_tokens = hidden_states[token_idx] -++ -++ expert_out = self.experts[expert_id](expert_tokens) -++ -++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -++ -++ moe_output = mindspore.mint.scatter_add( -++ moe_output, -++ 0, -++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -++ scaled_out.to(hidden_states.dtype) -++ ) -++ -+ return moe_output -+ -++ -+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+ @no_grad() -+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+ -+ moe_output = None -+- if Long_Prompt: -+- # --- 精度优先模式 (ACCURACY MODE) --- -+- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # if Long_Prompt==0: -++ # # --- 精度优先模式 (ACCURACY MODE) --- -++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # else: -++ # # --- 速度优先模式 (SPEED MODE) --- -++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++ # if sequence_length == 1: -++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # else: -++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ -++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++ if sequence_length == 1: -++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ else: -+- # --- 速度优先模式 (SPEED MODE) --- -+- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+- if sequence_length == 1: -+- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+- else: -+- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+- -++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ -+ -+ # 3. 共享专家计算与合并 (所有模式通用) -+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ -+ return final_hidden_states, router_logits -+ -++ -+ class Qwen2MoeDecoderLayer(nn.Module): -+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+ super().__init__() -+ self.hidden_size = config.hidden_size -+ -+- # if Long_Prompt: -+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+- # else: -++ # if Long_Prompt == 2: -+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++ # else: -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ -+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+ -+@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+ ) -+ -+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -+- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++ # attention_mask, -++ # sequence_length=sequence_length, -++ # target_length=target_length, -++ # dtype=dtype, -++ # min_dtype=min_dtype, -++ # cache_position=cache_position, -++ # batch_size=input_tensor.shape[0], -++ # ) -++ #@dwj -++ causal_mask = get_cached_causal_mask_with_cache_position( -+ attention_mask, -+ sequence_length=sequence_length, -+ target_length=target_length, -+@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+ """ -+- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -++ _causal_mask_cache.clear() -+ -+ input_ids = kwargs.get("input_ids") -+ if input_ids is None and args: -+@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ -+ if input_ids is not None: -+ prompt_length = input_ids.shape[1] -+- -+- if prompt_length > PROMPT_LENGTH_THRESHOLD: -+- Long_Prompt = True -++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -++ Long_Prompt = 2 -++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -++ Long_Prompt = 0 -+ else: -+- Long_Prompt = False -++ Long_Prompt = 1 -++ -+ -+ return super().generate(*args, **kwargs) -+ -+@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+ dtype = self.lm_head.weight.dtype -+ min_dtype = float(ops.finfo(dtype).min) -+ -+- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++ # attention_mask, -++ # sequence_length=sequence_length, -++ # target_length=past_key_values.get_max_length(), -++ # dtype=dtype, -++ # min_dtype=min_dtype, -++ # cache_position=cache_position, -++ # batch_size=batch_size, -++ # ) -++ -++ #@dwj -++ attention_mask = get_cached_causal_mask_with_cache_position( -+ attention_mask, -+ sequence_length=sequence_length, -+ target_length=past_key_values.get_max_length(), -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+deleted file mode 100644 -+index 6dfb5b93..00000000 -+--- a/patches/0001-20251104commit.patch -++++ /dev/null -+@@ -1,1272 +0,0 @@ -+-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+-From: Pinoeer-kingxi <13022943007@163.com> -+-Date: Tue, 4 Nov 2025 09:11:51 +0800 -+-Subject: [PATCH] 20251104commit -+- -+---- -+- mindnlp/transformers/cache_utils.py | 28 +- -+- .../models/deepseek/modeling_deepseek.py | 149 ++- -+- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+- 3 files changed, 976 insertions(+), 87 deletions(-) -+- -+-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+-index cadd2e04..02f8d4be 100644 -+---- a/mindnlp/transformers/cache_utils.py -+-+++ b/mindnlp/transformers/cache_utils.py -+-@@ -812,14 +812,26 @@ class StaticCache(Cache): -+- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+- # k_out[:, :, cache_position] = key_states -+- # v_out[:, :, cache_position] = value_states -+-- if ON_ORANGE_PI: -+-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+-- else: -+-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+-- -+-+ # if ON_ORANGE_PI: -+-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+-+ # else: -+-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+-+ # 确保 cache_position 是 1D tensor 并且类型正确 -+-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+-+ if cache_position.ndim > 1: -+-+ cache_position = cache_position.flatten() -+-+ # 确保类型是 int32 或 int64(MindSpore 要求) -+-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+-+ cache_position = cache_position.int() -+-+ -+-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+-+ k_out[:, :, cache_position] = key_states -+-+ v_out[:, :, cache_position] = value_states -+-+ -+- return k_out, v_out -+- -+- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+-index c695b944..d8303e45 100644 -+---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+- # Copied from transformers.models.llama.modeling_llama.rotate_half -+- def rotate_half(x): -+- """Rotates half the hidden dims of the input.""" -+-- x1 = x[..., : x.shape[-1] // 2] -+-- x2 = x[..., x.shape[-1] // 2 :] -+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+-+ # x1 = x[..., : x.shape[-1] // 2] -+-+ # x2 = x[..., x.shape[-1] // 2 :] -+-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+- return ops.cat((-x2, x1), dim=-1) -+- -+- -+-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+- if self.training: -+- raise NotImplementedError("Training is not supported yet.") -+- else: -+-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+-- if self.config.n_shared_experts is not None: -+-- y = y + self.shared_experts(identity) -+-- return y -+-+ # @lwx -+-+ if orig_shape[1] == 1: -+-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+-+ y=y.view(*orig_shape) -+-+ if self.config.n_shared_experts is not None: -+-+ y = y + self.shared_experts(identity) -+-+ return y -+-+ else: -+-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+-+ if self.config.n_shared_experts is not None: -+-+ y = y + self.shared_experts(identity) -+-+ return y -+-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+-+ # if self.config.n_shared_experts is not None: -+-+ # y = y + self.shared_experts(identity) -+-+ # return y -+-+ -+-+ @no_grad() -+-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+-+ -+-+ expert_cache = ops.zeros_like(x) -+-+ for i in range(self.num_experts_per_tok): -+-+ expert_id = flat_expert_indices[i].item() -+-+ weight = flat_expert_weights[i].item() -+-+ expert = self.experts[expert_id] -+-+ expert_out = expert(x) -+-+ expert_cache += expert_out * weight -+-+ return expert_cache -+- -+- @no_grad() -+-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+-- # expert_cache = torch.zeros_like(x) -+-- # idxs = flat_expert_indices.argsort() -+-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+-- # token_idxs = idxs // self.num_experts_per_tok -+-- # for i, end_idx in enumerate(tokens_per_expert): -+-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+-- # if start_idx == end_idx: -+-- # continue -+-- # expert = self.experts[i] -+-- # exp_token_idx = token_idxs[start_idx:end_idx] -+-- # expert_tokens = x[exp_token_idx] -+-- # expert_out = expert(expert_tokens) -+-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+-- # return expert_cache -+-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+- expert_cache = ops.zeros_like(x) -+- idxs = flat_expert_indices.argsort() -+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+- token_idxs = idxs // self.num_experts_per_tok -+-+ -+- for i, end_idx in enumerate(tokens_per_expert): -+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+- if start_idx == end_idx: -+-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+- expert_out = expert(expert_tokens) -+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+-+ -+- return expert_cache -+-+ -+-+ # @no_grad() -+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+-+ # # expert_cache = torch.zeros_like(x) -+-+ # # idxs = flat_expert_indices.argsort() -+-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+-+ # # token_idxs = idxs // self.num_experts_per_tok -+-+ # # for i, end_idx in enumerate(tokens_per_expert): -+-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+-+ # # if start_idx == end_idx: -+-+ # # continue -+-+ # # expert = self.experts[i] -+-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -+-+ # # expert_tokens = x[exp_token_idx] -+-+ # # expert_out = expert(expert_tokens) -+-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+-+ # # return expert_cache -+-+ # expert_cache = ops.zeros_like(x) -+-+ # idxs = flat_expert_indices.argsort() -+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+-+ # token_idxs = idxs // self.num_experts_per_tok -+-+ -+-+ # for i, end_idx in enumerate(tokens_per_expert): -+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+-+ # if start_idx == end_idx: -+-+ # continue -+-+ # expert = self.experts[i] -+-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+-+ # expert_tokens = x[exp_token_idx] -+-+ # expert_out = expert(expert_tokens) -+-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+-+ -+-+ # return expert_cache -+-+ # @no_grad() -+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+-+ # expert_cache = ops.zeros_like(x) -+-+ -+-+ # # 排序保证顺序一致 -+-+ # idxs = flat_expert_indices.argsort() -+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+-+ # token_idxs = idxs // self.num_experts_per_tok -+-+ -+-+ # # 找出有 token 的专家 -+-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+-+ -+-+ # for i in active_experts.tolist(): -+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+-+ # end_idx = tokens_per_expert[i] -+-+ # if start_idx == end_idx: # 没有 token -+-+ # continue -+-+ -+-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+-+ # expert_tokens = x[exp_token_idx] -+-+ # expert_out = self.experts[i](expert_tokens) -+-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+-+ -+-+ # expert_cache = mindspore.mint.scatter_add( -+-+ # expert_cache, -+-+ # 0, -+-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+-+ # expert_out -+-+ # ) -+-+ -+-+ # return expert_cache -+-+ -+-+ -+- -+- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+- # """ -+-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+- -+- # Initialize weights and apply final processing -+- self.post_init() -+-+ self.warm_up = False -+-+ -+-+ def warmup_moe_model_deep(self): -+-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+-+ test_texts = [ -+-+ "warmup short", -+-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+-+ ] -+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+-+ if tokenizer is None: -+-+ from mindnlp.transformers import AutoTokenizer -+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+-+ self._warmup_tokenizer = tokenizer -+-+ -+-+ for text in test_texts: -+-+ inputs = tokenizer(text, return_tensors="ms") -+-+ with mindspore._no_grad(): -+-+ _ = self(**inputs, use_cache=False) -+-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+- -+- def get_input_embeddings(self): -+- return self.model.embed_tokens -+-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+- ```""" -+-+ if not self.warm_up: -+-+ self.warm_up = True -+-+ self.warmup_moe_model_deep() -+-+ -+- output_attentions = ( -+- output_attentions -+- if output_attentions is not None -+-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+-index 3cbf820e..d4c6b651 100644 -+---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+-@@ -18,7 +18,6 @@ -+- # See the License for the specific language governing permissions and -+- # limitations under the License. -+- """MindSpore Qwen2MoE model.""" -+-- -+- import math -+- from typing import List, Optional, Tuple, Union -+- -+-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+- TokenClassifierOutput, -+- ) -+- from ...modeling_utils import PreTrainedModel -+-+from ...generation import GenerationMixin -+- from ....utils import logging -+- from .configuration_qwen2_moe import Qwen2MoeConfig -+- -+-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+- self.variance_epsilon = eps -+- -+- def forward(self, hidden_states): -+-+ # @dwj -+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+-+ # @lwx -+-+ # if not self.training : -+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+- input_dtype = hidden_states.dtype -+- hidden_states = hidden_states.to(mindspore.float32) -+- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+-@@ -234,6 +239,8 @@ def rotate_half(x): -+- """Rotates half the hidden dims of the input.""" -+- x1 = x[..., : x.shape[-1] // 2] -+- x2 = x[..., x.shape[-1] // 2 :] -+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+- return ops.cat((-x2, x1), dim=-1) -+- -+- -+-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+- self.config = config -+- self.hidden_size = config.hidden_size -+- self.intermediate_size = intermediate_size -+-+ -+- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+- self.act_fn = ACT2FN[config.hidden_act] -+- -+- def forward(self, x): -+-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+-- -+- -+-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+-+ # @lwx -+-+ # gate_up_output = self.gate_up_proj(x) -+-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+-+ # return self.down_proj(swiglu_output) -+-+ -+-+ # def forward(self, x): -+-+ # gate_proj_out = self.gate_proj(x) -+-+ # up_proj_out = self.up_proj(x) -+-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+-+ # return self.down_proj(swiglu_out) -+-+ -+- # Copied from transformers.models.llama.modeling_llama.repeat_kv -+- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+- """ -+-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+- use_cache: bool = False, -+- cache_position: Optional[mindspore.Tensor] = None, -+- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+-+ -+-+ -+-+ -+- bsz, q_len, _ = hidden_states.shape -+- -+- query_states = self.q_proj(hidden_states) -+-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+- "with a layer index." -+- ) -+-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+-+ if isinstance(past_key_value, StaticCache): -+-+ kv_seq_len = key_states.shape[-2] -+-+ else: -+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+- -+- if past_key_value is not None: -+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+-+ -+-+ if isinstance(past_key_value, StaticCache): -+-+ kv_seq_len = key_states.shape[-2] -+- -+- # repeat k/v heads if n_kv_heads < n_heads -+- key_states = repeat_kv(key_states, self.num_key_value_groups) -+- value_states = repeat_kv(value_states, self.num_key_value_groups) -+-- -+-+ -+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+- -+-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+-- raise ValueError( -+-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+-- f" {attn_weights.shape}" -+-- ) -+-- -+-- if attention_mask is not None: # no matter the length, we just slice it -+-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+-+ if attention_mask is not None: -+-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+- attn_weights = attn_weights + causal_mask -+- -+- # upcast attention to fp32 -+-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+- -+- attn_output = self.o_proj(attn_output) -+-- -+-+ # @lwx -+-+ -+-+ # max_seq_len = self.max_position_embeddings # 2048 -+-+ -+-+ # if attention_mask is not None: -+-+ # # attention_mask: [B, 1, Sq, Sk] -+-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+-+ -+-+ # # pad 到 [max_seq_len, max_seq_len] -+-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+-+ # global_attention_mask = padded_mask -+-+ # else: -+-+ # global_attention_mask = None -+-+ -+-+ -+-+ # sparse_mode=3 -+-+ # attn_output = mindspore.ops.flash_attention_score( -+-+ # query=query_states, -+-+ # key=key_states, -+-+ # value=value_states, -+-+ # real_shift=None, -+-+ # padding_mask=None, -+-+ -+-+ # head_num=self.num_heads, -+-+ # attn_mask=global_attention_mask, -+-+ # keep_prob=1.0 - self.attention_dropout, -+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+-+ # input_layout="BNSD", -+-+ # pre_tokens=2147483647, -+-+ # next_tokens=2147483647, -+-+ # inner_precise=0, -+-+ # drop_mask=None, -+-+ # prefix=None, -+-+ # actual_seq_qlen=None, -+-+ # actual_seq_kvlen=None, -+-+ # sparse_mode=sparse_mode, -+-+ # ) -+- if not output_attentions: -+- attn_weights = None -+- -+- return attn_output, attn_weights, past_key_value -+- -+- -+-+class Qwen2MoeFlashAttention(nn.Module): -+-+ """ -+-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+-+ -+-+ 关键改动: -+-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+-+ 直接传入原始的 key 和 value 张量效率更高。 -+-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+-+ """ -+-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+-+ super().__init__() -+-+ self.config = config -+-+ self.layer_idx = layer_idx -+-+ self.hidden_size = config.hidden_size -+-+ self.num_heads = config.num_attention_heads -+-+ self.head_dim = self.hidden_size // self.num_heads -+-+ self.num_key_value_heads = config.num_key_value_heads -+-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+-+ self.max_position_embeddings = config.max_position_embeddings -+-+ self.rope_theta = config.rope_theta -+-+ self.attention_dropout = config.attention_dropout -+-+ -+-+ if (self.head_dim * self.num_heads) != self.hidden_size: -+-+ raise ValueError( -+-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+-+ ) -+-+ -+-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+-+ -+-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+-+ self.head_dim, -+-+ max_position_embeddings=self.max_position_embeddings, -+-+ base=self.rope_theta, -+-+ ) -+-+ -+-+ def forward( -+-+ self, -+-+ hidden_states: mindspore.Tensor, -+-+ attention_mask: Optional[mindspore.Tensor] = None, -+-+ position_ids: Optional[mindspore.Tensor] = None, -+-+ past_key_value: Optional[Cache] = None, -+-+ output_attentions: bool = False, -+-+ use_cache: bool = False, -+-+ cache_position: Optional[mindspore.Tensor] = None, -+-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+-+ -+-+ bsz, q_len, _ = hidden_states.shape -+-+ -+-+ # 1. 线性投射 Q, K, V -+-+ query_states = self.q_proj(hidden_states) -+-+ key_states = self.k_proj(hidden_states) -+-+ value_states = self.v_proj(hidden_states) -+-+ -+-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+-+ # query: [B, S, H*D] -> [B, N1, S, D] -+-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ -+-+ # 3. RoPE 旋转位置编码 -+-+ kv_seq_len = key_states.shape[-2] -+-+ if past_key_value is not None: -+-+ if self.layer_idx is None: -+-+ raise ValueError( -+-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+-+ "with a layer index." -+-+ ) -+-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -+-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+-+ if cache_position.shape[0] == 1: -+-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+-+ kv_seq_len = past_seen_tokens + 1 -+-+ else: -+-+ # prefill 阶段:cache_position 是范围,使用其长度 -+-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+-+ else: -+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+-+ -+-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+-+ -+-+ # 4. KV 缓存更新 -+-+ if past_key_value is not None: -+-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+-+ key_states, value_states = past_key_value.update( -+-+ key_states, value_states, self.layer_idx, cache_kwargs -+-+ ) -+-+ -+-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+-+ if cache_position.shape[0] == 1: -+-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+-+ kv_seq_len = key_states.shape[-2] -+-+ -+-+ # 5. [重要] 准备 Attention Mask -+-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+-+ fa_attention_mask = None -+-+ if attention_mask is not None: -+-+ # 截取与当前key长度匹配的部分 -+-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -+-+ fa_attention_mask = (mask_slice != 0) -+-+ -+-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+-+ input_dtype = query_states.dtype -+-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+-+ query_states = query_states.to(mindspore.float16) -+-+ key_states = key_states.to(mindspore.float16) -+-+ value_states = value_states.to(mindspore.float16) -+-+ -+-+ # 6. [核心] 调用 flash_attention_score 算子 -+-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -+-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+-+ attn_output = mindspore.ops.flash_attention_score( -+-+ query=query_states, -+-+ key=key_states, -+-+ value=value_states, -+-+ head_num=self.num_heads, # 传入Q的头数(N1) -+-+ attn_mask=fa_attention_mask, -+-+ keep_prob=1.0 - self.attention_dropout, -+-+ scalar_value=1.0 / math.sqrt(self.head_dim), -+-+ input_layout="BNSD", -+-+ sparse_mode=0 # 使用 defaultMask 模式 -+-+ ) -+-+ -+-+ # 恢复原始数据类型 -+-+ attn_output = attn_output.to(input_dtype) -+-+ -+-+ # 7. 调整输出形状 -+-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+-+ attn_output = self.o_proj(attn_output) -+-+ -+-+ # FlashAttention 算子不直接返回注意力权重矩阵 -+-+ attn_weights = None -+-+ if output_attentions: -+-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+-+ -+-+ return attn_output, attn_weights, past_key_value -+-+ -+-+ # def forward( -+-+ # self, -+-+ # hidden_states: mindspore.Tensor, -+-+ # attention_mask: Optional[mindspore.Tensor] = None, -+-+ # position_ids: Optional[mindspore.Tensor] = None, -+-+ # past_key_value: Optional[Cache] = None, -+-+ # output_attentions: bool = False, -+-+ # use_cache: bool = False, -+-+ # cache_position: Optional[mindspore.Tensor] = None, -+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+-+ -+-+ # bsz, q_len, _ = hidden_states.shape -+-+ -+-+ # # 1. 线性投射 Q, K, V -+-+ # query_states = self.q_proj(hidden_states) -+-+ # key_states = self.k_proj(hidden_states) -+-+ # value_states = self.v_proj(hidden_states) -+-+ -+-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ -+-+ # # 3. RoPE 旋转位置编码 -+-+ # kv_seq_len = key_states.shape[-2] -+-+ # if past_key_value is not None: -+-+ # if self.layer_idx is None: -+-+ # raise ValueError( -+-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+-+ # "with a layer index." -+-+ # ) -+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+-+ -+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+-+ -+-+ # # 4. KV 缓存更新 -+-+ # if past_key_value is not None: -+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+-+ # key_states, value_states = past_key_value.update( -+-+ # key_states, value_states, self.layer_idx, cache_kwargs -+-+ # ) -+-+ -+-+ # # 5. 准备 Attention Mask -+-+ # fa_attention_mask = None -+-+ # if attention_mask is not None: -+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+-+ # fa_attention_mask = (mask_slice != 0) -+-+ -+-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+-+ # input_dtype = query_states.dtype -+-+ -+-+ # # 6. [核心] 调用 flash_attention_score 算子 -+-+ # attn_output = mindspore.ops.flash_attention_score( -+-+ # query=query_states, -+-+ # key=key_states, -+-+ # value=value_states, -+-+ # head_num=self.num_heads, -+-+ # attn_mask=fa_attention_mask, -+-+ # keep_prob=1.0 - self.attention_dropout, -+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+-+ # input_layout="BNSD", -+-+ # sparse_mode=0, -+-+ # # <--- 修改点 2: 启用内部高精度计算 --- -+-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+-+ # inner_precise=1 -+-+ # ) -+-+ -+-+ # # 恢复原始数据类型 -+-+ # attn_output = attn_output.to(input_dtype) -+-+ -+-+ # # 7. 调整输出形状 -+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+-+ # attn_output = self.o_proj(attn_output) -+-+ -+-+ # attn_weights = None -+-+ # if output_attentions: -+-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+-+ -+-+ # return attn_output, attn_weights, past_key_value -+-+ -+-+ # def forward( -+-+ # self, -+-+ # hidden_states: mindspore.Tensor, -+-+ # attention_mask: Optional[mindspore.Tensor] = None, -+-+ # position_ids: Optional[mindspore.Tensor] = None, -+-+ # past_key_value: Optional[Cache] = None, -+-+ # output_attentions: bool = False, -+-+ # use_cache: bool = False, -+-+ # cache_position: Optional[mindspore.Tensor] = None, -+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+-+ -+-+ # bsz, q_len, _ = hidden_states.shape -+-+ -+-+ # query_states = self.q_proj(hidden_states) -+-+ # key_states = self.k_proj(hidden_states) -+-+ # value_states = self.v_proj(hidden_states) -+-+ -+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-+ -+-+ # kv_seq_len = key_states.shape[-2] -+-+ # if past_key_value is not None: -+-+ # if self.layer_idx is None: -+-+ # raise ValueError("`layer_idx` must be specified for caching") -+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+-+ -+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+-+ -+-+ # if past_key_value is not None: -+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+-+ # key_states, value_states = past_key_value.update( -+-+ # key_states, value_states, self.layer_idx, cache_kwargs -+-+ # ) -+-+ -+-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+-+ -+-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -+-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+-+ # query_states = query_states / math.sqrt(self.head_dim) -+-+ # # <--- 修改结束 --- -+-+ -+-+ # fa_attention_mask = None -+-+ # if attention_mask is not None: -+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+-+ # fa_attention_mask = (mask_slice != 0) -+-+ -+-+ # input_dtype = query_states.dtype -+-+ -+-+ # attn_output = mindspore.ops.flash_attention_score( -+-+ # query=query_states, # 传入已经预先缩放过的 query -+-+ # key=key_states, -+-+ # value=value_states, -+-+ # head_num=self.num_heads, -+-+ # attn_mask=fa_attention_mask, -+-+ # keep_prob=1.0 - self.attention_dropout, -+-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+-+ # input_layout="BNSD", -+-+ # sparse_mode=0, -+-+ # inner_precise=1 # 仍然保持内部高精度计算 -+-+ # ) -+-+ -+-+ # attn_output = attn_output.to(input_dtype) -+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+-+ # attn_output = self.o_proj(attn_output) -+-+ -+-+ # attn_weights = None -+-+ # if output_attentions: -+-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+-+ -+-+ # return attn_output, attn_weights, past_key_value -+-+ -+- QWEN2MOE_ATTENTION_CLASSES = { -+- "eager": Qwen2MoeAttention, -+-+ "flash-attention": Qwen2MoeFlashAttention, -+- } -+- -+- -+-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+- -+-+ #@dwj -+-+ # 只遍历激活的专家,而非全部专家 -+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+-- batch_size, sequence_length, hidden_dim = hidden_states.shape -+-- hidden_states = hidden_states.view(-1, hidden_dim) -+-- # router_logits: (batch * sequence_length, n_experts) -+-- router_logits = self.gate(hidden_states) -+-- -+-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+-- if self.norm_topk_prob: -+-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+-- # we cast back to the input dtype -+-- routing_weights = routing_weights.to(hidden_states.dtype) -+-- -+-- final_hidden_states = ops.zeros( -+-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+-- ) -+-- -+-- # One hot encode the selected experts to create an expert mask -+-- # this will be used to easily index which expert is going to be sollicitated -+-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+-- -+-- # Loop over all available experts in the model and perform the computation on each expert -+-- for expert_idx in range(self.num_experts): -+-- expert_layer = self.experts[expert_idx] -+-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+-- -+-- # Index the correct hidden states and compute the expert hidden state for -+-- # the current expert. We need to make sure to multiply the output hidden -+-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+-- if 0 not in idx.shape: -+-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+-- -+-- # However `index_add_` only support torch tensors for indexing so we'll use -+-- # the `top_x` tensor here. -+-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+-- -+-- shared_expert_output = self.shared_expert(hidden_states) -+-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+-- -+-- final_hidden_states = final_hidden_states + shared_expert_output -+-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -+-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+-+ num_tokens = hidden_states_reshaped.shape[0] -+-+ -+-+ router_logits = self.gate(hidden_states_reshaped) -+-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+-+ -+-+ if self.norm_topk_prob: -+-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+-+ routing_weights = routing_weights.to(hidden_states.dtype) -+-+ -+-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+-+ flat_selected_experts = selected_experts.flatten() -+-+ -+-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+-+ token_indices = broadcasted_token_indices.flatten() -+-+ -+-+ active_experts = ops.unique(flat_selected_experts) -+-+ -+-+ for expert_idx_tensor in active_experts: -+-+ expert_idx = expert_idx_tensor.item() -+-+ expert_layer = self.experts[expert_idx] -+-+ -+-+ mask = (flat_selected_experts == expert_idx_tensor) -+-+ selected_token_indices = token_indices[mask] -+-+ selected_routing_weights = routing_weights.flatten()[mask] -+-+ -+-+ current_states = hidden_states_reshaped[selected_token_indices] -+-+ -+-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+-+ -+-+ final_hidden_states = final_hidden_states.index_add( -+-+ dim=0, -+-+ index=selected_token_indices, -+-+ source=expert_output.to(hidden_states.dtype) -+-+ ) -+-+ -+-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+- -+-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+-- return final_hidden_states, router_logits -+-+ final_hidden_states = final_hidden_states + shared_expert_output -+-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+-+ -+-+ return final_hidden_states, router_logits -+- -+- -+- class Qwen2MoeDecoderLayer(nn.Module): -+-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+- -+- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+- -+-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+-+ -+- if (layer_idx not in config.mlp_only_layers) and ( -+- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+- ): -+-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+- _no_split_modules = ["Qwen2MoeDecoderLayer"] -+- _skip_keys_device_placement = "past_key_values" -+- _supports_cache_class = True -+-+#lwx -+-+ # _supports_static_cache = True -+- -+- def _init_weights(self, module): -+- std = self.config.initializer_range -+-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+- return causal_mask -+- -+- -+--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+- _tied_weights_keys = ["lm_head.weight"] -+- -+- def __init__(self, config): -+-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+- self.num_experts_per_tok = config.num_experts_per_tok -+- # Initialize weights and apply final processing -+- self.post_init() -+-+ # @lwx -+-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+-+ # self.generation_config.cache_implementation = "static" -+-+ self._warmed_up = False -+-+ -+-+ def warmup_moe_model(self): -+-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -+-+ test_texts = [ -+-+ "warmup short", -+-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+-+ ] -+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+-+ if tokenizer is None: -+-+ from mindnlp.transformers import AutoTokenizer -+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+-+ self._warmup_tokenizer = tokenizer -+-+ -+-+ for text in test_texts: -+-+ inputs = tokenizer(text, return_tensors="ms") -+-+ with mindspore._no_grad(): -+-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -+-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -+- -+- def get_input_embeddings(self): -+- return self.model.embed_tokens -+-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+- ```""" -+-+ if not self._warmed_up: -+-+ self._warmed_up = True -+-+ self.warmup_moe_model() -+- -+- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+- output_router_logits = ( -+-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+- } -+- ) -+- return model_inputs -+-+# @lwx -+-+ # def _decode_one_tokens_logits( -+-+ # self, -+-+ # cur_token: mindspore.Tensor, -+-+ # input_pos: Optional[mindspore.Tensor], -+-+ # cache_position: mindspore.Tensor, -+-+ # past_key_values: StaticCache, -+-+ # ) -> mindspore.Tensor: -+-+ # """ -+-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+-+ -+-+ # Args: -+-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+-+ # input_pos: 输入位置信息,可选 -+-+ # cache_position: 当前token在cache中的位置,shape为(1,) -+-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -+-+ -+-+ # Returns: -+-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+-+ # """ -+-+ # # 调用JIT编译的版本 -+-+ # return self.get_decode_one_tokens_logits( -+-+ # cur_token=cur_token, -+-+ # input_pos=input_pos, -+-+ # cache_position=cache_position, -+-+ # past_key_values=past_key_values, -+-+ # ) -+-+ -+-+ # @mindspore.jit(jit_level='O1') -+-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+-+ # """ -+-+ # JIT编译的函数,用于高效的单token解码 -+-+ # 使用JIT编译优化以支持静态shape和高效执行 -+-+ -+-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+-+ # """ -+-+ # outputs = self.model.forward( -+-+ # input_ids=cur_token, -+-+ # position_ids=input_pos, -+-+ # cache_position=cache_position, -+-+ # past_key_values=past_key_values, -+-+ # use_cache=True, -+-+ # return_dict=False, -+-+ # ) -+-+ -+-+ # hidden_states = outputs[0] -+-+ # logits = self.lm_head.forward(hidden_states) -+-+ # logits = logits.float() -+-+ -+-+ # return logits[:, -1, :] -+-+ -+-+ # def _sample( -+-+ # self, -+-+ # input_ids: mindspore.Tensor, -+-+ # logits_processor, -+-+ # stopping_criteria, -+-+ # generation_config, -+-+ # synced_devices: bool, -+-+ # streamer=None, -+-+ # logits_warper=None, -+-+ # **model_kwargs, -+-+ # ): -+-+ # """ -+-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+-+ # """ -+-+ # from ...generation.logits_process import LogitsProcessorList -+-+ # from ...generation.stopping_criteria import StoppingCriteriaList -+-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+-+ # from mindnlp.core import nn, ops, no_grad -+-+ # import numpy as np -+-+ -+-+ # # 检查是否使用 StaticCache -+-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+-+ # # 否则,直接调用父类方法 -+-+ # past_key_values = model_kwargs.get("past_key_values") -+-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+-+ -+-+ # if not isinstance(past_key_values, StaticCache): -+-+ # # 不使用 StaticCache,直接调用父类方法 -+-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+-+ # return super()._sample( -+-+ # input_ids=input_ids, -+-+ # logits_processor=logits_processor, -+-+ # stopping_criteria=stopping_criteria, -+-+ # generation_config=generation_config, -+-+ # synced_devices=synced_devices, -+-+ # streamer=streamer, -+-+ # logits_warper=logits_warper, -+-+ # **model_kwargs, -+-+ # ) -+-+ -+-+ # # 使用 StaticCache,进入自定义循环 -+-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+-+ # pad_token_id = generation_config._pad_token_tensor -+-+ # output_attentions = generation_config.output_attentions -+-+ # output_hidden_states = generation_config.output_hidden_states -+-+ # output_scores = generation_config.output_scores -+-+ # output_logits = generation_config.output_logits -+-+ # return_dict_in_generate = generation_config.return_dict_in_generate -+-+ # max_length = generation_config.max_length -+-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+-+ # do_sample = generation_config.do_sample -+-+ -+-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+-+ # raise ValueError( -+-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+-+ # f"{logits_warper})." -+-+ # ) -+-+ -+-+ # # init attention / hidden states / scores tuples -+-+ # scores = () if (return_dict_in_generate and output_scores) else None -+-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+-+ -+-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -+-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+-+ # encoder_hidden_states = ( -+-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+-+ # ) -+-+ -+-+ # # keep track of which sequences are already finished -+-+ # batch_size, cur_len = input_ids.shape -+-+ # this_peer_finished = False -+-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+-+ -+-+ # time_record = [] -+-+ # from ....utils.testing_utils import parse_flag_from_env -+-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+-+ -+-+ # while self._has_unfinished_sequences( -+-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+-+ # ): -+-+ # if _record_time: -+-+ # import time as time_module -+-+ # infer_start = time_module.time() -+-+ -+-+ # # prepare model inputs -+-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+-+ -+-+ # # prepare variable output controls -+-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+-+ -+-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+-+ # cur_cache_position = model_inputs.get("cache_position") -+-+ # cur_past_key_values = model_inputs.get("past_key_values") -+-+ # cur_input_ids = model_inputs.get("input_ids") -+-+ -+-+ # if (isinstance(cur_past_key_values, StaticCache) and -+-+ # cur_cache_position is not None and -+-+ # len(cur_cache_position.shape) > 0 and -+-+ # cur_cache_position.shape[0] == 1 and -+-+ # cur_input_ids is not None and -+-+ # cur_input_ids.shape[1] == 1): -+-+ # # 使用 JIT 优化的单 token 解码 -+-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+-+ # if not hasattr(self, '_jit_used'): -+-+ # self._jit_used = False -+-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+-+ -+-+ # next_token_logits = self.get_decode_one_tokens_logits( -+-+ # cur_token=cur_input_ids, -+-+ # input_pos=model_inputs.get("position_ids"), -+-+ # cache_position=cur_cache_position, -+-+ # past_key_values=cur_past_key_values, -+-+ # ) -+-+ -+-+ # # 标记已使用JIT(用于后续判断) -+-+ # if not self._jit_used: -+-+ # self._jit_used = True -+-+ -+-+ # # 构造兼容的输出对象 -+-+ # class JitOptimizedOutput: -+-+ # def __init__(self, logits, config): -+-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+-+ # self.config = config -+-+ # # 对于 JIT 优化路径,这些属性通常不需要 -+-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -+-+ # self.attentions = None if not config.is_encoder_decoder else None -+-+ # self.cross_attentions = None -+-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+-+ # self.hidden_states = None if not config.is_encoder_decoder else None -+-+ -+-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+-+ # else: -+-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+-+ # outputs = self(**model_inputs, return_dict=True) -+-+ -+-+ # if synced_devices and this_peer_finished: -+-+ # continue -+-+ -+-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+-+ # next_token_logits = outputs.logits[:, -1, :] -+-+ -+-+ # # pre-process distribution -+-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -+-+ # if do_sample: -+-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -+-+ -+-+ # # Store scores, attentions and hidden_states when required -+-+ # if return_dict_in_generate: -+-+ # if output_scores: -+-+ # scores += (next_token_scores,) -+-+ # if output_logits: -+-+ # raw_logits += (next_token_logits,) -+-+ # if output_attentions: -+-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+-+ # decoder_attentions += (attn,) if attn is not None else (None,) -+-+ # if self.config.is_encoder_decoder: -+-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+-+ -+-+ # if output_hidden_states: -+-+ # hidden = ( -+-+ # outputs.decoder_hidden_states -+-+ # if self.config.is_encoder_decoder -+-+ # else outputs.hidden_states -+-+ # ) -+-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+-+ -+-+ # # token selection -+-+ # if do_sample: -+-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+-+ # else: -+-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+-+ -+-+ # # finished sentences should have their next token be a padding token -+-+ # if has_eos_stopping_criteria: -+-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+-+ -+-+ # # update generated ids, model inputs, and length for next step -+-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+-+ # if streamer is not None: -+-+ # streamer.put(next_tokens) -+-+ -+-+ # model_kwargs = self._update_model_kwargs_for_generation( -+-+ # outputs, -+-+ # model_kwargs, -+-+ # is_encoder_decoder=self.config.is_encoder_decoder, -+-+ # ) -+-+ -+-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+-+ # cur_len += 1 -+-+ -+-+ # if _record_time: -+-+ # import time as time_module -+-+ # infer_stop = time_module.time() -+-+ # time_record.append(infer_stop - infer_start) -+-+ -+-+ # del outputs -+-+ -+-+ # average_infer_time = None -+-+ # if time_record: -+-+ # if len(time_record) > 1: -+-+ # time_record.pop(0) -+-+ # average_infer_time = sum(time_record) / len(time_record) -+-+ # print(f'average inference time is: {average_infer_time}') -+-+ # print(f'inference time record: {time_record}') -+-+ -+-+ # if streamer is not None: -+-+ # streamer.end() -+-+ -+-+ # # 简单判断:打印是否使用了JIT路径 -+-+ # if hasattr(self, '_jit_used') and self._jit_used: -+-+ # print("[JIT] ✓ JIT optimization was used during generation") -+-+ # else: -+-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+-+ -+-+ # if return_dict_in_generate: -+-+ # if self.config.is_encoder_decoder: -+-+ # return GenerateEncoderDecoderOutput( -+-+ # sequences=input_ids, -+-+ # scores=scores, -+-+ # logits=raw_logits, -+-+ # encoder_attentions=encoder_attentions, -+-+ # encoder_hidden_states=encoder_hidden_states, -+-+ # decoder_attentions=decoder_attentions, -+-+ # cross_attentions=cross_attentions, -+-+ # decoder_hidden_states=decoder_hidden_states, -+-+ # past_key_values=model_kwargs.get("past_key_values"), -+-+ # average_infer_time=average_infer_time -+-+ # ) -+-+ # else: -+-+ # return GenerateDecoderOnlyOutput( -+-+ # sequences=input_ids, -+-+ # scores=scores, -+-+ # logits=raw_logits, -+-+ # attentions=decoder_attentions, -+-+ # hidden_states=decoder_hidden_states, -+-+ # past_key_values=model_kwargs.get("past_key_values"), -+-+ # average_infer_time=average_infer_time -+-+ # ) -+-+ # else: -+-+ # return input_ids -+-+ -+-+ # def _prepare_cache_for_generation( -+-+ # self, -+-+ # generation_config, -+-+ # model_kwargs, -+-+ # assistant_model, -+-+ # batch_size, -+-+ # max_cache_length, -+-+ # ): -+-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -+-+ # generation_config.cache_implementation = "static" -+-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+-+ -+-+ # if generation_config.cache_implementation == "static": -+-+ # base_required_from_max_length = generation_config.max_length + 1 -+-+ # base_required = max(max_cache_length, base_required_from_max_length) -+-+ # min_cache_size = 50 -+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+-+ # else: -+-+ # max_cache_length = max(base_required, min_cache_size) -+-+ -+-+ # original_max_cache_length = max_cache_length -+-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -+-+ # print(f" - input max_cache_length: {original_max_cache_length}") -+-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -+-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+-+ # print(f" - final max_cache_length: {max_cache_length}") -+-+ -+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+-+ # if max_cache_length > self.config.max_position_embeddings: -+-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+-+ -+-+ # result = super()._prepare_cache_for_generation( -+-+ # generation_config=generation_config, -+-+ # model_kwargs=model_kwargs, -+-+ # assistant_model=assistant_model, -+-+ # batch_size=batch_size, -+-+ # max_cache_length=max_cache_length, -+-+ # ) -+-+ -+-+ # if generation_config.cache_implementation == "static": -+-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+-+ # created_cache = model_kwargs.get(cache_name) -+-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+-+ # if created_cache.max_cache_len < generation_config.max_length: -+-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+-+ -+-+ # return result -+-+ -+-+ -+-+ -+- -+- -+- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+--- -+-2.27.0 -+- -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" deleted file mode 100644 index bbe6df27..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0005-20251107001commit.patch" +++ /dev/null @@ -1,7707 +0,0 @@ -From ab47c0478530d34d2b48200af0453dda94d1ec18 Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Fri, 7 Nov 2025 11:48:18 +0800 -Subject: [PATCH 05/10] 20251107001commit - ---- - .../models/deepseek/modeling_deepseek.py | 91 +- - .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- - .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- - patches/0001-20251104commit.patch | 2 +- - patches/0002-20251106commit.patch | 2 +- - patches/0003-20261106secondcommit.patch | 2 +- - patches/0004-20251106change.patch | 7498 +++++++++++++++++ - 7 files changed, 7577 insertions(+), 30 deletions(-) - create mode 100644 patches/0004-20251106change.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 0546f318..8831e4b7 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): - # expert_cache += expert_out * weight - # return expert_cache - -- @no_grad() -- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- # x 的 shape: (1, hidden_size) -- # flat_expert_indices 的 shape: (num_experts_per_tok,) -- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -- -- # 1. 收集所有需要的专家层 -- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -- selected_experts = [self.experts[i] for i in flat_expert_indices] -- -- # 2. 并行计算所有专家的输出 -- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -- # ops.cat 会将它们堆叠成一个新的 Tensor -- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -- -- # 3. 使用矩阵乘法进行加权求和 -- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- # 最终结果 final_output 的 shape: (1, hidden_size) -- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+ # @no_grad() -+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ # # x 的 shape: (1, hidden_size) -+ # # flat_expert_indices 的 shape: (num_experts_per_tok,) -+ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+ -+ # # 1. 收集所有需要的专家层 -+ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+ # selected_experts = [self.experts[i] for i in flat_expert_indices] -+ -+ # # 2. 并行计算所有专家的输出 -+ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+ # # ops.cat 会将它们堆叠成一个新的 Tensor -+ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+ -+ # # 3. 使用矩阵乘法进行加权求和 -+ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ # # 最终结果 final_output 的 shape: (1, hidden_size) -+ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) - -- return final_output -+ # return final_output - - - # @no_grad() -@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): - ) - - return expert_cache -+# 放置在 DeepseekMoE 类中 -+ @no_grad() -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ """ -+ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+ -+ Args: -+ x (Tensor): 输入张量, shape: (1, hidden_size) -+ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+ """ -+ top_k, _ = flat_expert_weights.shape -+ hidden_size = x.shape[-1] -+ -+ # 1. 将所有专家的权重堆叠起来 -+ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+ -+ # 2. "收集" 所需的专家权重 -+ selected_gate_w = stacked_gate_w[flat_expert_indices] -+ selected_up_w = stacked_up_w[flat_expert_indices] -+ selected_down_w = stacked_down_w[flat_expert_indices] -+ -+ # 3. 准备输入 -+ x_expanded = x.expand((top_k, 1, hidden_size)) -+ -+ # 4. 并行计算 gate_proj 和 up_proj -+ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+ -+ # 5. 计算中间状态 -+ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+ -+ # 6. 并行计算 down_proj -+ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+ # --- [FIX] --- -+ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+ # --- [FIX END] --- -+ -+ # 7. 根据路由权重进行加权求和 -+ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+ -+ return weighted_sum -+ -+ - - # @no_grad() - # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index ebd7782e..913a7609 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): - # Copied from transformers.models.llama.modeling_llama.rotate_half - def rotate_half(x): - """Rotates half the hidden dims of the input.""" -- x1 = x[..., : x.shape[-1] // 2] -- x2 = x[..., x.shape[-1] // 2 :] -+ # x1 = x[..., : x.shape[-1] // 2] -+ # x2 = x[..., x.shape[-1] // 2 :] - # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - - -diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -index d059dcbe..2b217b64 100644 ---- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): - # Copied from transformers.models.llama.modeling_llama.rotate_half - def rotate_half(x): - """Rotates half the hidden dims of the input.""" -- x1 = x[..., : x.shape[-1] // 2] -- x2 = x[..., x.shape[-1] // 2 :] -+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+ # x1 = x[..., : x.shape[-1] // 2] -+ # x2 = x[..., x.shape[-1] // 2 :] -+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - - -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -index 78f22642..0a0ef2d7 100644 ---- a/patches/0001-20251104commit.patch -+++ b/patches/0001-20251104commit.patch -@@ -1,7 +1,7 @@ - From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/3] 20251104commit -+Subject: [PATCH 1/4] 20251104commit - - --- - mindnlp/transformers/cache_utils.py | 28 +- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -index 22b65dd5..5185270c 100644 ---- a/patches/0002-20251106commit.patch -+++ b/patches/0002-20251106commit.patch -@@ -1,7 +1,7 @@ - From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/3] 20251106commit -+Subject: [PATCH 2/4] 20251106commit - - --- - .../models/deepseek/modeling_deepseek.py | 379 ++++- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -index 966529e4..3e05f821 100644 ---- a/patches/0003-20261106secondcommit.patch -+++ b/patches/0003-20261106secondcommit.patch -@@ -1,7 +1,7 @@ - From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/3] 20261106secondcommit -+Subject: [PATCH 3/4] 20261106secondcommit - - --- - .../models/deepseek/modeling_deepseek.py | 217 ++- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -new file mode 100644 -index 00000000..88a1aef4 ---- /dev/null -+++ b/patches/0004-20251106change.patch -@@ -0,0 +1,7498 @@ -+From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Thu, 6 Nov 2025 15:48:09 +0800 -+Subject: [PATCH 4/4] 20251106change -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 189 +- -+ patches/0001-20251104commit.patch | 1272 +++++++ -+ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -+ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -+ 4 files changed, 7244 insertions(+), 186 deletions(-) -+ create mode 100644 patches/0001-20251104commit.patch -+ create mode 100644 patches/0002-20251106commit.patch -+ create mode 100644 patches/0003-20261106secondcommit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index 2f9192bf..0546f318 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -+ -+ return attn_output, attn_weights, past_key_value -+ -+-# class DeepseekFlashAttention(nn.Module): -+-# """ -+-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+- -+-# This class is designed as a drop-in replacement for DeepseekAttention. -+-# """ -+- -+-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+-# super().__init__() -+-# self.config = config -+-# self.layer_idx = layer_idx -+-# if layer_idx is None: -+-# logger.warning( -+-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+-# "when creating this class." -+-# ) -+- -+-# self.attention_dropout = config.attention_dropout -+-# self.hidden_size = config.hidden_size -+-# self.num_heads = config.num_attention_heads -+-# self.head_dim = self.hidden_size // self.num_heads -+-# self.num_key_value_heads = config.num_key_value_heads -+-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+-# self.max_position_embeddings = config.max_position_embeddings -+-# self.rope_theta = config.rope_theta -+-# self.is_causal = True -+- -+-# if (self.head_dim * self.num_heads) != self.hidden_size: -+-# raise ValueError( -+-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+-# f" and `num_heads`: {self.num_heads})." -+-# ) -+- -+-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+-# self._init_rope() -+- -+-# def _init_rope(self): -+-# if self.config.rope_scaling is None: -+-# self.rotary_emb = DeepseekRotaryEmbedding( -+-# self.head_dim, -+-# max_position_embeddings=self.max_position_embeddings, -+-# base=self.rope_theta, -+-# ) -+-# else: -+-# scaling_type = self.config.rope_scaling["type"] -+-# scaling_factor = self.config.rope_scaling["factor"] -+-# if scaling_type == "linear": -+-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+-# self.head_dim, -+-# max_position_embeddings=self.max_position_embeddings, -+-# scaling_factor=scaling_factor, -+-# base=self.rope_theta, -+-# ) -+-# elif scaling_type == "dynamic": -+-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+-# self.head_dim, -+-# max_position_embeddings=self.max_position_embeddings, -+-# scaling_factor=scaling_factor, -+-# base=self.rope_theta, -+-# ) -+-# else: -+-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+- -+-# def forward( -+-# self, -+-# hidden_states: mindspore.Tensor, -+-# attention_mask: Optional[mindspore.Tensor] = None, -+-# position_ids: Optional[mindspore.Tensor] = None, -+-# past_key_value: Optional[Cache] = None, -+-# output_attentions: bool = False, -+-# use_cache: bool = False, -+-# **kwargs, -+-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+-# if "padding_mask" in kwargs: -+-# warnings.warn( -+-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+-# ) -+- -+-# if output_attentions: -+-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+- -+-# bsz, q_len, _ = hidden_states.shape -+- -+-# if self.config.pretraining_tp > 1: -+-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+- -+-# query_states = self.q_proj(hidden_states) -+-# key_states = self.k_proj(hidden_states) -+-# value_states = self.v_proj(hidden_states) -+- -+-# # Reshape for multi-head attention -+-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+- -+-# kv_seq_len = key_states.shape[-2] -+-# if past_key_value is not None: -+-# if self.layer_idx is None: -+-# raise ValueError( -+-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+-# "with a layer index." -+-# ) -+-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+- -+-# # Apply Rotary Positional Embedding -+-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+- -+-# if past_key_value is not None: -+-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+- -+-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+- -+-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+- -+-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+- -+-# # Convert attention_mask for flash_attention_score -+-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+-# if attention_mask is not None: -+-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+-# raise ValueError( -+-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+-# ) -+-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+-# else: -+-# attn_mask_for_fa = None -+- -+-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+- -+-# # Call the fused flash_attention_score operator -+-# attn_output = mindspore.ops.flash_attention_score( -+-# query=query_states_for_fa, -+-# key=key_states_for_fa, -+-# value=value_states_for_fa, -+-# head_num=self.num_heads, # This is N1, the number of query heads -+-# input_layout='BSH', -+-# attn_mask=attn_mask_for_fa, -+-# keep_prob=keep_prob, -+-# scalar_value=1.0 / math.sqrt(self.head_dim), -+-# sparse_mode=0 # Default mask mode -+-# ) -+- -+-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+-# attn_output = self.o_proj(attn_output) -+- -+-# # Flash Attention does not return attention weights -+-# attn_weights = None -+- -+-# return attn_output, attn_weights, past_key_value -+ -+ class DeepseekFlashAttention(nn.Module): -+ """ -+@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -+ super().__init__() -+ self.hidden_size = config.hidden_size -+ -+- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+- config=config, layer_idx=layer_idx -+- ) -++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -++ # config=config, layer_idx=layer_idx -++ # ) -+ -+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+ config=config, layer_idx=layer_idx -+@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -+ return outputs -+ -+ -+- -+ class DeepseekPreTrainedModel(PreTrainedModel): -+ config_class = DeepseekConfig -+ base_model_prefix = "model" -+@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ # Initialize weights and apply final processing -+ self.post_init() -+ self.warm_up = False -+- #@dwj -+- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+- self.num_layers, -+- self.num_attention_heads, -+- self.head_dim, -+- batch_size=1, -+- max_length=self.max_length, -+- dtype=mindspore.float16 -+- ) -+- -+- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+- key_cache = [] -+- value_cache = [] -+- for _ in range(num_layers): -+- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+- key_cache.append(k) -+- value_cache.append(v) -+- return key_cache, value_cache -+- -+ -+ def warmup_moe_model_deep(self): -+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+new file mode 100644 -+index 00000000..78f22642 -+--- /dev/null -++++ b/patches/0001-20251104commit.patch -+@@ -0,0 +1,1272 @@ -++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++Subject: [PATCH 1/3] 20251104commit -++ -++--- -++ mindnlp/transformers/cache_utils.py | 28 +- -++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++ 3 files changed, 976 insertions(+), 87 deletions(-) -++ -++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++index cadd2e04..02f8d4be 100644 -++--- a/mindnlp/transformers/cache_utils.py -+++++ b/mindnlp/transformers/cache_utils.py -++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++ # k_out[:, :, cache_position] = key_states -++ # v_out[:, :, cache_position] = value_states -++- if ON_ORANGE_PI: -++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++- else: -++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++- -+++ # if ON_ORANGE_PI: -+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++ # else: -+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++ if cache_position.ndim > 1: -+++ cache_position = cache_position.flatten() -+++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++ cache_position = cache_position.int() -+++ -+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++ k_out[:, :, cache_position] = key_states -+++ v_out[:, :, cache_position] = value_states -+++ -++ return k_out, v_out -++ -++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index c695b944..d8303e45 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++- x1 = x[..., : x.shape[-1] // 2] -++- x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++ # x1 = x[..., : x.shape[-1] // 2] -+++ # x2 = x[..., x.shape[-1] // 2 :] -+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++ if self.training: -++ raise NotImplementedError("Training is not supported yet.") -++ else: -++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++- if self.config.n_shared_experts is not None: -++- y = y + self.shared_experts(identity) -++- return y -+++ # @lwx -+++ if orig_shape[1] == 1: -+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++ y=y.view(*orig_shape) -+++ if self.config.n_shared_experts is not None: -+++ y = y + self.shared_experts(identity) -+++ return y -+++ else: -+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++ if self.config.n_shared_experts is not None: -+++ y = y + self.shared_experts(identity) -+++ return y -+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++ # if self.config.n_shared_experts is not None: -+++ # y = y + self.shared_experts(identity) -+++ # return y -+++ -+++ @no_grad() -+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ -+++ expert_cache = ops.zeros_like(x) -+++ for i in range(self.num_experts_per_tok): -+++ expert_id = flat_expert_indices[i].item() -+++ weight = flat_expert_weights[i].item() -+++ expert = self.experts[expert_id] -+++ expert_out = expert(x) -+++ expert_cache += expert_out * weight -+++ return expert_cache -++ -++ @no_grad() -++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++- # expert_cache = torch.zeros_like(x) -++- # idxs = flat_expert_indices.argsort() -++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++- # token_idxs = idxs // self.num_experts_per_tok -++- # for i, end_idx in enumerate(tokens_per_expert): -++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++- # if start_idx == end_idx: -++- # continue -++- # expert = self.experts[i] -++- # exp_token_idx = token_idxs[start_idx:end_idx] -++- # expert_tokens = x[exp_token_idx] -++- # expert_out = expert(expert_tokens) -++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++- # return expert_cache -+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ expert_cache = ops.zeros_like(x) -++ idxs = flat_expert_indices.argsort() -++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++ token_idxs = idxs // self.num_experts_per_tok -+++ -++ for i, end_idx in enumerate(tokens_per_expert): -++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++ if start_idx == end_idx: -++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++ expert_out = expert(expert_tokens) -++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -++ return expert_cache -+++ -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # # expert_cache = torch.zeros_like(x) -+++ # # idxs = flat_expert_indices.argsort() -+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++ # # token_idxs = idxs // self.num_experts_per_tok -+++ # # for i, end_idx in enumerate(tokens_per_expert): -+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++ # # if start_idx == end_idx: -+++ # # continue -+++ # # expert = self.experts[i] -+++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # # expert_tokens = x[exp_token_idx] -+++ # # expert_out = expert(expert_tokens) -+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++ # # return expert_cache -+++ # expert_cache = ops.zeros_like(x) -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # for i, end_idx in enumerate(tokens_per_expert): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # if start_idx == end_idx: -+++ # continue -+++ # expert = self.experts[i] -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = expert(expert_tokens) -+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -+++ # return expert_cache -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # expert_cache = ops.zeros_like(x) -+++ -+++ # # 排序保证顺序一致 -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # # 找出有 token 的专家 -+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++ -+++ # for i in active_experts.tolist(): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # end_idx = tokens_per_expert[i] -+++ # if start_idx == end_idx: # 没有 token -+++ # continue -+++ -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = self.experts[i](expert_tokens) -+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++ -+++ # expert_cache = mindspore.mint.scatter_add( -+++ # expert_cache, -+++ # 0, -+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++ # expert_out -+++ # ) -+++ -+++ # return expert_cache -+++ -+++ -++ -++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++ # """ -++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ -++ # Initialize weights and apply final processing -++ self.post_init() -+++ self.warm_up = False -+++ -+++ def warmup_moe_model_deep(self): -+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++ test_texts = [ -+++ "warmup short", -+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++ ] -+++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++ if tokenizer is None: -+++ from mindnlp.transformers import AutoTokenizer -+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++ self._warmup_tokenizer = tokenizer -+++ -+++ for text in test_texts: -+++ inputs = tokenizer(text, return_tensors="ms") -+++ with mindspore._no_grad(): -+++ _ = self(**inputs, use_cache=False) -+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++ -++ def get_input_embeddings(self): -++ return self.model.embed_tokens -++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++ ```""" -+++ if not self.warm_up: -+++ self.warm_up = True -+++ self.warmup_moe_model_deep() -+++ -++ output_attentions = ( -++ output_attentions -++ if output_attentions is not None -++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++index 3cbf820e..d4c6b651 100644 -++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++@@ -18,7 +18,6 @@ -++ # See the License for the specific language governing permissions and -++ # limitations under the License. -++ """MindSpore Qwen2MoE model.""" -++- -++ import math -++ from typing import List, Optional, Tuple, Union -++ -++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++ TokenClassifierOutput, -++ ) -++ from ...modeling_utils import PreTrainedModel -+++from ...generation import GenerationMixin -++ from ....utils import logging -++ from .configuration_qwen2_moe import Qwen2MoeConfig -++ -++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++ self.variance_epsilon = eps -++ -++ def forward(self, hidden_states): -+++ # @dwj -+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++ # @lwx -+++ # if not self.training : -+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++ input_dtype = hidden_states.dtype -++ hidden_states = hidden_states.to(mindspore.float32) -++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++@@ -234,6 +239,8 @@ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++ x1 = x[..., : x.shape[-1] // 2] -++ x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++ self.config = config -++ self.hidden_size = config.hidden_size -++ self.intermediate_size = intermediate_size -+++ -++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++ self.act_fn = ACT2FN[config.hidden_act] -++ -++ def forward(self, x): -++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++- -++ -+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++ # @lwx -+++ # gate_up_output = self.gate_up_proj(x) -+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++ # return self.down_proj(swiglu_output) -+++ -+++ # def forward(self, x): -+++ # gate_proj_out = self.gate_proj(x) -+++ # up_proj_out = self.up_proj(x) -+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++ # return self.down_proj(swiglu_out) -+++ -++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++ """ -++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++ use_cache: bool = False, -++ cache_position: Optional[mindspore.Tensor] = None, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ -+++ -++ bsz, q_len, _ = hidden_states.shape -++ -++ query_states = self.q_proj(hidden_states) -++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++ "with a layer index." -++ ) -++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ if isinstance(past_key_value, StaticCache): -+++ kv_seq_len = key_states.shape[-2] -+++ else: -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++ if isinstance(past_key_value, StaticCache): -+++ kv_seq_len = key_states.shape[-2] -++ -++ # repeat k/v heads if n_kv_heads < n_heads -++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++- -+++ -++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++ -++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++- raise ValueError( -++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++- f" {attn_weights.shape}" -++- ) -++- -++- if attention_mask is not None: # no matter the length, we just slice it -++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++ if attention_mask is not None: -+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++ attn_weights = attn_weights + causal_mask -++ -++ # upcast attention to fp32 -++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++ -++ attn_output = self.o_proj(attn_output) -++- -+++ # @lwx -+++ -+++ # max_seq_len = self.max_position_embeddings # 2048 -+++ -+++ # if attention_mask is not None: -+++ # # attention_mask: [B, 1, Sq, Sk] -+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++ -+++ # # pad 到 [max_seq_len, max_seq_len] -+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++ # global_attention_mask = padded_mask -+++ # else: -+++ # global_attention_mask = None -+++ -+++ -+++ # sparse_mode=3 -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, -+++ # key=key_states, -+++ # value=value_states, -+++ # real_shift=None, -+++ # padding_mask=None, -+++ -+++ # head_num=self.num_heads, -+++ # attn_mask=global_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++ # input_layout="BNSD", -+++ # pre_tokens=2147483647, -+++ # next_tokens=2147483647, -+++ # inner_precise=0, -+++ # drop_mask=None, -+++ # prefix=None, -+++ # actual_seq_qlen=None, -+++ # actual_seq_kvlen=None, -+++ # sparse_mode=sparse_mode, -+++ # ) -++ if not output_attentions: -++ attn_weights = None -++ -++ return attn_output, attn_weights, past_key_value -++ -++ -+++class Qwen2MoeFlashAttention(nn.Module): -+++ """ -+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++ -+++ 关键改动: -+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++ 直接传入原始的 key 和 value 张量效率更高。 -+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++ """ -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++ super().__init__() -+++ self.config = config -+++ self.layer_idx = layer_idx -+++ self.hidden_size = config.hidden_size -+++ self.num_heads = config.num_attention_heads -+++ self.head_dim = self.hidden_size // self.num_heads -+++ self.num_key_value_heads = config.num_key_value_heads -+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++ self.max_position_embeddings = config.max_position_embeddings -+++ self.rope_theta = config.rope_theta -+++ self.attention_dropout = config.attention_dropout -+++ -+++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++ raise ValueError( -+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++ ) -+++ -+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++ -+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++ self.head_dim, -+++ max_position_embeddings=self.max_position_embeddings, -+++ base=self.rope_theta, -+++ ) -+++ -+++ def forward( -+++ self, -+++ hidden_states: mindspore.Tensor, -+++ attention_mask: Optional[mindspore.Tensor] = None, -+++ position_ids: Optional[mindspore.Tensor] = None, -+++ past_key_value: Optional[Cache] = None, -+++ output_attentions: bool = False, -+++ use_cache: bool = False, -+++ cache_position: Optional[mindspore.Tensor] = None, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ # 1. 线性投射 Q, K, V -+++ query_states = self.q_proj(hidden_states) -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++ # query: [B, S, H*D] -> [B, N1, S, D] -+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # 3. RoPE 旋转位置编码 -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++ if self.layer_idx is None: -+++ raise ValueError( -+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ "with a layer index." -+++ ) -+++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++ if cache_position.shape[0] == 1: -+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++ kv_seq_len = past_seen_tokens + 1 -+++ else: -+++ # prefill 阶段:cache_position 是范围,使用其长度 -+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++ else: -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # 4. KV 缓存更新 -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ key_states, value_states = past_key_value.update( -+++ key_states, value_states, self.layer_idx, cache_kwargs -+++ ) -+++ -+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++ if cache_position.shape[0] == 1: -+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++ kv_seq_len = key_states.shape[-2] -+++ -+++ # 5. [重要] 准备 Attention Mask -+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++ fa_attention_mask = None -+++ if attention_mask is not None: -+++ # 截取与当前key长度匹配的部分 -+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++ fa_attention_mask = (mask_slice != 0) -+++ -+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++ input_dtype = query_states.dtype -+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++ query_states = query_states.to(mindspore.float16) -+++ key_states = key_states.to(mindspore.float16) -+++ value_states = value_states.to(mindspore.float16) -+++ -+++ # 6. [核心] 调用 flash_attention_score 算子 -+++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++ attn_output = mindspore.ops.flash_attention_score( -+++ query=query_states, -+++ key=key_states, -+++ value=value_states, -+++ head_num=self.num_heads, # 传入Q的头数(N1) -+++ attn_mask=fa_attention_mask, -+++ keep_prob=1.0 - self.attention_dropout, -+++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++ input_layout="BNSD", -+++ sparse_mode=0 # 使用 defaultMask 模式 -+++ ) -+++ -+++ # 恢复原始数据类型 -+++ attn_output = attn_output.to(input_dtype) -+++ -+++ # 7. 调整输出形状 -+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ attn_output = self.o_proj(attn_output) -+++ -+++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++ attn_weights = None -+++ if output_attentions: -+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++ # def forward( -+++ # self, -+++ # hidden_states: mindspore.Tensor, -+++ # attention_mask: Optional[mindspore.Tensor] = None, -+++ # position_ids: Optional[mindspore.Tensor] = None, -+++ # past_key_value: Optional[Cache] = None, -+++ # output_attentions: bool = False, -+++ # use_cache: bool = False, -+++ # cache_position: Optional[mindspore.Tensor] = None, -+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ # bsz, q_len, _ = hidden_states.shape -+++ -+++ # # 1. 线性投射 Q, K, V -+++ # query_states = self.q_proj(hidden_states) -+++ # key_states = self.k_proj(hidden_states) -+++ # value_states = self.v_proj(hidden_states) -+++ -+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # # 3. RoPE 旋转位置编码 -+++ # kv_seq_len = key_states.shape[-2] -+++ # if past_key_value is not None: -+++ # if self.layer_idx is None: -+++ # raise ValueError( -+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ # "with a layer index." -+++ # ) -+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # # 4. KV 缓存更新 -+++ # if past_key_value is not None: -+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ # key_states, value_states = past_key_value.update( -+++ # key_states, value_states, self.layer_idx, cache_kwargs -+++ # ) -+++ -+++ # # 5. 准备 Attention Mask -+++ # fa_attention_mask = None -+++ # if attention_mask is not None: -+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # fa_attention_mask = (mask_slice != 0) -+++ -+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++ # input_dtype = query_states.dtype -+++ -+++ # # 6. [核心] 调用 flash_attention_score 算子 -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, -+++ # key=key_states, -+++ # value=value_states, -+++ # head_num=self.num_heads, -+++ # attn_mask=fa_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++ # input_layout="BNSD", -+++ # sparse_mode=0, -+++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++ # inner_precise=1 -+++ # ) -+++ -+++ # # 恢复原始数据类型 -+++ # attn_output = attn_output.to(input_dtype) -+++ -+++ # # 7. 调整输出形状 -+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ # attn_output = self.o_proj(attn_output) -+++ -+++ # attn_weights = None -+++ # if output_attentions: -+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++ # return attn_output, attn_weights, past_key_value -+++ -+++ # def forward( -+++ # self, -+++ # hidden_states: mindspore.Tensor, -+++ # attention_mask: Optional[mindspore.Tensor] = None, -+++ # position_ids: Optional[mindspore.Tensor] = None, -+++ # past_key_value: Optional[Cache] = None, -+++ # output_attentions: bool = False, -+++ # use_cache: bool = False, -+++ # cache_position: Optional[mindspore.Tensor] = None, -+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ # bsz, q_len, _ = hidden_states.shape -+++ -+++ # query_states = self.q_proj(hidden_states) -+++ # key_states = self.k_proj(hidden_states) -+++ # value_states = self.v_proj(hidden_states) -+++ -+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ # kv_seq_len = key_states.shape[-2] -+++ # if past_key_value is not None: -+++ # if self.layer_idx is None: -+++ # raise ValueError("`layer_idx` must be specified for caching") -+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ # if past_key_value is not None: -+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ # key_states, value_states = past_key_value.update( -+++ # key_states, value_states, self.layer_idx, cache_kwargs -+++ # ) -+++ -+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++ -+++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++ # query_states = query_states / math.sqrt(self.head_dim) -+++ # # <--- 修改结束 --- -+++ -+++ # fa_attention_mask = None -+++ # if attention_mask is not None: -+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ # fa_attention_mask = (mask_slice != 0) -+++ -+++ # input_dtype = query_states.dtype -+++ -+++ # attn_output = mindspore.ops.flash_attention_score( -+++ # query=query_states, # 传入已经预先缩放过的 query -+++ # key=key_states, -+++ # value=value_states, -+++ # head_num=self.num_heads, -+++ # attn_mask=fa_attention_mask, -+++ # keep_prob=1.0 - self.attention_dropout, -+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++ # input_layout="BNSD", -+++ # sparse_mode=0, -+++ # inner_precise=1 # 仍然保持内部高精度计算 -+++ # ) -+++ -+++ # attn_output = attn_output.to(input_dtype) -+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ # attn_output = self.o_proj(attn_output) -+++ -+++ # attn_weights = None -+++ # if output_attentions: -+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++ -+++ # return attn_output, attn_weights, past_key_value -+++ -++ QWEN2MOE_ATTENTION_CLASSES = { -++ "eager": Qwen2MoeAttention, -+++ "flash-attention": Qwen2MoeFlashAttention, -++ } -++ -++ -++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -+++ #@dwj -+++ # 只遍历激活的专家,而非全部专家 -++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++- hidden_states = hidden_states.view(-1, hidden_dim) -++- # router_logits: (batch * sequence_length, n_experts) -++- router_logits = self.gate(hidden_states) -++- -++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- if self.norm_topk_prob: -++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- # we cast back to the input dtype -++- routing_weights = routing_weights.to(hidden_states.dtype) -++- -++- final_hidden_states = ops.zeros( -++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++- ) -++- -++- # One hot encode the selected experts to create an expert mask -++- # this will be used to easily index which expert is going to be sollicitated -++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++- -++- # Loop over all available experts in the model and perform the computation on each expert -++- for expert_idx in range(self.num_experts): -++- expert_layer = self.experts[expert_idx] -++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++- -++- # Index the correct hidden states and compute the expert hidden state for -++- # the current expert. We need to make sure to multiply the output hidden -++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++- if 0 not in idx.shape: -++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++- -++- # However `index_add_` only support torch tensors for indexing so we'll use -++- # the `top_x` tensor here. -++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++- -++- shared_expert_output = self.shared_expert(hidden_states) -++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++- -++- final_hidden_states = final_hidden_states + shared_expert_output -+++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++ num_tokens = hidden_states_reshaped.shape[0] -+++ -+++ router_logits = self.gate(hidden_states_reshaped) -+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++ if self.norm_topk_prob: -+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++ flat_selected_experts = selected_experts.flatten() -+++ -+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++ token_indices = broadcasted_token_indices.flatten() -+++ -+++ active_experts = ops.unique(flat_selected_experts) -+++ -+++ for expert_idx_tensor in active_experts: -+++ expert_idx = expert_idx_tensor.item() -+++ expert_layer = self.experts[expert_idx] -+++ -+++ mask = (flat_selected_experts == expert_idx_tensor) -+++ selected_token_indices = token_indices[mask] -+++ selected_routing_weights = routing_weights.flatten()[mask] -+++ -+++ current_states = hidden_states_reshaped[selected_token_indices] -+++ -+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++ -+++ final_hidden_states = final_hidden_states.index_add( -+++ dim=0, -+++ index=selected_token_indices, -+++ source=expert_output.to(hidden_states.dtype) -+++ ) -+++ -+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++ -++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++- return final_hidden_states, router_logits -+++ final_hidden_states = final_hidden_states + shared_expert_output -+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++ -+++ return final_hidden_states, router_logits -++ -++ -++ class Qwen2MoeDecoderLayer(nn.Module): -++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++ -++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++ -++ if (layer_idx not in config.mlp_only_layers) and ( -++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++ ): -++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++ _skip_keys_device_placement = "past_key_values" -++ _supports_cache_class = True -+++#lwx -+++ # _supports_static_cache = True -++ -++ def _init_weights(self, module): -++ std = self.config.initializer_range -++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++ return causal_mask -++ -++ -++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ _tied_weights_keys = ["lm_head.weight"] -++ -++ def __init__(self, config): -++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ self.num_experts_per_tok = config.num_experts_per_tok -++ # Initialize weights and apply final processing -++ self.post_init() -+++ # @lwx -+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++ # self.generation_config.cache_implementation = "static" -+++ self._warmed_up = False -+++ -+++ def warmup_moe_model(self): -+++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++ test_texts = [ -+++ "warmup short", -+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++ ] -+++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++ if tokenizer is None: -+++ from mindnlp.transformers import AutoTokenizer -+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++ self._warmup_tokenizer = tokenizer -+++ -+++ for text in test_texts: -+++ inputs = tokenizer(text, return_tensors="ms") -+++ with mindspore._no_grad(): -+++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++ -++ def get_input_embeddings(self): -++ return self.model.embed_tokens -++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++ ```""" -+++ if not self._warmed_up: -+++ self._warmed_up = True -+++ self.warmup_moe_model() -++ -++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++ output_router_logits = ( -++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++ } -++ ) -++ return model_inputs -+++# @lwx -+++ # def _decode_one_tokens_logits( -+++ # self, -+++ # cur_token: mindspore.Tensor, -+++ # input_pos: Optional[mindspore.Tensor], -+++ # cache_position: mindspore.Tensor, -+++ # past_key_values: StaticCache, -+++ # ) -> mindspore.Tensor: -+++ # """ -+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++ -+++ # Args: -+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++ # input_pos: 输入位置信息,可选 -+++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++ -+++ # Returns: -+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++ # """ -+++ # # 调用JIT编译的版本 -+++ # return self.get_decode_one_tokens_logits( -+++ # cur_token=cur_token, -+++ # input_pos=input_pos, -+++ # cache_position=cache_position, -+++ # past_key_values=past_key_values, -+++ # ) -+++ -+++ # @mindspore.jit(jit_level='O1') -+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++ # """ -+++ # JIT编译的函数,用于高效的单token解码 -+++ # 使用JIT编译优化以支持静态shape和高效执行 -+++ -+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++ # """ -+++ # outputs = self.model.forward( -+++ # input_ids=cur_token, -+++ # position_ids=input_pos, -+++ # cache_position=cache_position, -+++ # past_key_values=past_key_values, -+++ # use_cache=True, -+++ # return_dict=False, -+++ # ) -+++ -+++ # hidden_states = outputs[0] -+++ # logits = self.lm_head.forward(hidden_states) -+++ # logits = logits.float() -+++ -+++ # return logits[:, -1, :] -+++ -+++ # def _sample( -+++ # self, -+++ # input_ids: mindspore.Tensor, -+++ # logits_processor, -+++ # stopping_criteria, -+++ # generation_config, -+++ # synced_devices: bool, -+++ # streamer=None, -+++ # logits_warper=None, -+++ # **model_kwargs, -+++ # ): -+++ # """ -+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++ # """ -+++ # from ...generation.logits_process import LogitsProcessorList -+++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++ # from mindnlp.core import nn, ops, no_grad -+++ # import numpy as np -+++ -+++ # # 检查是否使用 StaticCache -+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++ # # 否则,直接调用父类方法 -+++ # past_key_values = model_kwargs.get("past_key_values") -+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++ -+++ # if not isinstance(past_key_values, StaticCache): -+++ # # 不使用 StaticCache,直接调用父类方法 -+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++ # return super()._sample( -+++ # input_ids=input_ids, -+++ # logits_processor=logits_processor, -+++ # stopping_criteria=stopping_criteria, -+++ # generation_config=generation_config, -+++ # synced_devices=synced_devices, -+++ # streamer=streamer, -+++ # logits_warper=logits_warper, -+++ # **model_kwargs, -+++ # ) -+++ -+++ # # 使用 StaticCache,进入自定义循环 -+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++ # pad_token_id = generation_config._pad_token_tensor -+++ # output_attentions = generation_config.output_attentions -+++ # output_hidden_states = generation_config.output_hidden_states -+++ # output_scores = generation_config.output_scores -+++ # output_logits = generation_config.output_logits -+++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++ # max_length = generation_config.max_length -+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++ # do_sample = generation_config.do_sample -+++ -+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++ # raise ValueError( -+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++ # f"{logits_warper})." -+++ # ) -+++ -+++ # # init attention / hidden states / scores tuples -+++ # scores = () if (return_dict_in_generate and output_scores) else None -+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++ -+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++ # encoder_hidden_states = ( -+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++ # ) -+++ -+++ # # keep track of which sequences are already finished -+++ # batch_size, cur_len = input_ids.shape -+++ # this_peer_finished = False -+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++ -+++ # time_record = [] -+++ # from ....utils.testing_utils import parse_flag_from_env -+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++ -+++ # while self._has_unfinished_sequences( -+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++ # ): -+++ # if _record_time: -+++ # import time as time_module -+++ # infer_start = time_module.time() -+++ -+++ # # prepare model inputs -+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++ -+++ # # prepare variable output controls -+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++ -+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++ # cur_cache_position = model_inputs.get("cache_position") -+++ # cur_past_key_values = model_inputs.get("past_key_values") -+++ # cur_input_ids = model_inputs.get("input_ids") -+++ -+++ # if (isinstance(cur_past_key_values, StaticCache) and -+++ # cur_cache_position is not None and -+++ # len(cur_cache_position.shape) > 0 and -+++ # cur_cache_position.shape[0] == 1 and -+++ # cur_input_ids is not None and -+++ # cur_input_ids.shape[1] == 1): -+++ # # 使用 JIT 优化的单 token 解码 -+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++ # if not hasattr(self, '_jit_used'): -+++ # self._jit_used = False -+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++ -+++ # next_token_logits = self.get_decode_one_tokens_logits( -+++ # cur_token=cur_input_ids, -+++ # input_pos=model_inputs.get("position_ids"), -+++ # cache_position=cur_cache_position, -+++ # past_key_values=cur_past_key_values, -+++ # ) -+++ -+++ # # 标记已使用JIT(用于后续判断) -+++ # if not self._jit_used: -+++ # self._jit_used = True -+++ -+++ # # 构造兼容的输出对象 -+++ # class JitOptimizedOutput: -+++ # def __init__(self, logits, config): -+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++ # self.config = config -+++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++ # self.attentions = None if not config.is_encoder_decoder else None -+++ # self.cross_attentions = None -+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++ -+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++ # else: -+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++ # outputs = self(**model_inputs, return_dict=True) -+++ -+++ # if synced_devices and this_peer_finished: -+++ # continue -+++ -+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++ # next_token_logits = outputs.logits[:, -1, :] -+++ -+++ # # pre-process distribution -+++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++ # if do_sample: -+++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++ -+++ # # Store scores, attentions and hidden_states when required -+++ # if return_dict_in_generate: -+++ # if output_scores: -+++ # scores += (next_token_scores,) -+++ # if output_logits: -+++ # raw_logits += (next_token_logits,) -+++ # if output_attentions: -+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++ # if self.config.is_encoder_decoder: -+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++ -+++ # if output_hidden_states: -+++ # hidden = ( -+++ # outputs.decoder_hidden_states -+++ # if self.config.is_encoder_decoder -+++ # else outputs.hidden_states -+++ # ) -+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++ -+++ # # token selection -+++ # if do_sample: -+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++ # else: -+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++ -+++ # # finished sentences should have their next token be a padding token -+++ # if has_eos_stopping_criteria: -+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++ -+++ # # update generated ids, model inputs, and length for next step -+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++ # if streamer is not None: -+++ # streamer.put(next_tokens) -+++ -+++ # model_kwargs = self._update_model_kwargs_for_generation( -+++ # outputs, -+++ # model_kwargs, -+++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++ # ) -+++ -+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++ # cur_len += 1 -+++ -+++ # if _record_time: -+++ # import time as time_module -+++ # infer_stop = time_module.time() -+++ # time_record.append(infer_stop - infer_start) -+++ -+++ # del outputs -+++ -+++ # average_infer_time = None -+++ # if time_record: -+++ # if len(time_record) > 1: -+++ # time_record.pop(0) -+++ # average_infer_time = sum(time_record) / len(time_record) -+++ # print(f'average inference time is: {average_infer_time}') -+++ # print(f'inference time record: {time_record}') -+++ -+++ # if streamer is not None: -+++ # streamer.end() -+++ -+++ # # 简单判断:打印是否使用了JIT路径 -+++ # if hasattr(self, '_jit_used') and self._jit_used: -+++ # print("[JIT] ✓ JIT optimization was used during generation") -+++ # else: -+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++ -+++ # if return_dict_in_generate: -+++ # if self.config.is_encoder_decoder: -+++ # return GenerateEncoderDecoderOutput( -+++ # sequences=input_ids, -+++ # scores=scores, -+++ # logits=raw_logits, -+++ # encoder_attentions=encoder_attentions, -+++ # encoder_hidden_states=encoder_hidden_states, -+++ # decoder_attentions=decoder_attentions, -+++ # cross_attentions=cross_attentions, -+++ # decoder_hidden_states=decoder_hidden_states, -+++ # past_key_values=model_kwargs.get("past_key_values"), -+++ # average_infer_time=average_infer_time -+++ # ) -+++ # else: -+++ # return GenerateDecoderOnlyOutput( -+++ # sequences=input_ids, -+++ # scores=scores, -+++ # logits=raw_logits, -+++ # attentions=decoder_attentions, -+++ # hidden_states=decoder_hidden_states, -+++ # past_key_values=model_kwargs.get("past_key_values"), -+++ # average_infer_time=average_infer_time -+++ # ) -+++ # else: -+++ # return input_ids -+++ -+++ # def _prepare_cache_for_generation( -+++ # self, -+++ # generation_config, -+++ # model_kwargs, -+++ # assistant_model, -+++ # batch_size, -+++ # max_cache_length, -+++ # ): -+++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++ # generation_config.cache_implementation = "static" -+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++ -+++ # if generation_config.cache_implementation == "static": -+++ # base_required_from_max_length = generation_config.max_length + 1 -+++ # base_required = max(max_cache_length, base_required_from_max_length) -+++ # min_cache_size = 50 -+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++ # else: -+++ # max_cache_length = max(base_required, min_cache_size) -+++ -+++ # original_max_cache_length = max_cache_length -+++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++ # print(f" - final max_cache_length: {max_cache_length}") -+++ -+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++ # if max_cache_length > self.config.max_position_embeddings: -+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++ -+++ # result = super()._prepare_cache_for_generation( -+++ # generation_config=generation_config, -+++ # model_kwargs=model_kwargs, -+++ # assistant_model=assistant_model, -+++ # batch_size=batch_size, -+++ # max_cache_length=max_cache_length, -+++ # ) -+++ -+++ # if generation_config.cache_implementation == "static": -+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++ # created_cache = model_kwargs.get(cache_name) -+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++ # if created_cache.max_cache_len < generation_config.max_length: -+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++ -+++ # return result -+++ -+++ -+++ -++ -++ -++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++-- -++2.27.0 -++ -+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+new file mode 100644 -+index 00000000..22b65dd5 -+--- /dev/null -++++ b/patches/0002-20251106commit.patch -+@@ -0,0 +1,3200 @@ -++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Thu, 6 Nov 2025 09:20:38 +0800 -++Subject: [PATCH 2/3] 20251106commit -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -++ 3 files changed, 2689 insertions(+), 305 deletions(-) -++ create mode 100644 patches/0001-20251104commit.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index d8303e45..73773c22 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -++ # y = y + self.shared_experts(identity) -++ # return y -++ -+++ # @no_grad() -+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ -+++ # expert_cache = ops.zeros_like(x) -+++ # for i in range(self.num_experts_per_tok): -+++ # expert_id = flat_expert_indices[i].item() -+++ # weight = flat_expert_weights[i].item() -+++ # expert = self.experts[expert_id] -+++ # expert_out = expert(x) -+++ # expert_cache += expert_out * weight -+++ # return expert_cache -+++ -++ @no_grad() -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ # x 的 shape: (1, hidden_size) -+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++ -+++ # 1. 收集所有需要的专家层 -+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++ selected_experts = [self.experts[i] for i in flat_expert_indices] -+++ -+++ # 2. 并行计算所有专家的输出 -+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++ # ops.cat 会将它们堆叠成一个新的 Tensor -+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++ -+++ # 3. 使用矩阵乘法进行加权求和 -+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ # 最终结果 final_output 的 shape: (1, hidden_size) -+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++ -+++ return final_output -++ -++- expert_cache = ops.zeros_like(x) -++- for i in range(self.num_experts_per_tok): -++- expert_id = flat_expert_indices[i].item() -++- weight = flat_expert_weights[i].item() -++- expert = self.experts[expert_id] -++- expert_out = expert(x) -++- expert_cache += expert_out * weight -++- return expert_cache -++ -++ @no_grad() -++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++ # @lwx -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -+++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++ -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -++ return attn_output, attn_weights, past_key_value -++ -++ -+++# class DeepseekFlashAttention(nn.Module): -+++# """ -+++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+++ -+++# This class is designed as a drop-in replacement for DeepseekAttention. -+++# """ -+++ -+++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++# super().__init__() -+++# self.config = config -+++# self.layer_idx = layer_idx -+++# if layer_idx is None: -+++# logger.warning( -+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++# "when creating this class." -+++# ) -+++ -+++# self.attention_dropout = config.attention_dropout -+++# self.hidden_size = config.hidden_size -+++# self.num_heads = config.num_attention_heads -+++# self.head_dim = self.hidden_size // self.num_heads -+++# self.num_key_value_heads = config.num_key_value_heads -+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++# self.max_position_embeddings = config.max_position_embeddings -+++# self.rope_theta = config.rope_theta -+++# self.is_causal = True -+++ -+++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++# raise ValueError( -+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++# f" and `num_heads`: {self.num_heads})." -+++# ) -+++ -+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++# self._init_rope() -+++ -+++# def _init_rope(self): -+++# if self.config.rope_scaling is None: -+++# self.rotary_emb = DeepseekRotaryEmbedding( -+++# self.head_dim, -+++# max_position_embeddings=self.max_position_embeddings, -+++# base=self.rope_theta, -+++# ) -+++# else: -+++# scaling_type = self.config.rope_scaling["type"] -+++# scaling_factor = self.config.rope_scaling["factor"] -+++# if scaling_type == "linear": -+++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++# self.head_dim, -+++# max_position_embeddings=self.max_position_embeddings, -+++# scaling_factor=scaling_factor, -+++# base=self.rope_theta, -+++# ) -+++# elif scaling_type == "dynamic": -+++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++# self.head_dim, -+++# max_position_embeddings=self.max_position_embeddings, -+++# scaling_factor=scaling_factor, -+++# base=self.rope_theta, -+++# ) -+++# else: -+++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++ -+++# def forward( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# attention_mask: Optional[mindspore.Tensor] = None, -+++# position_ids: Optional[mindspore.Tensor] = None, -+++# past_key_value: Optional[Cache] = None, -+++# output_attentions: bool = False, -+++# use_cache: bool = False, -+++# **kwargs, -+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++# if "padding_mask" in kwargs: -+++# warnings.warn( -+++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++# ) -+++ -+++# if output_attentions: -+++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+++ -+++# bsz, q_len, _ = hidden_states.shape -+++ -+++# if self.config.pretraining_tp > 1: -+++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++ -+++# query_states = self.q_proj(hidden_states) -+++# key_states = self.k_proj(hidden_states) -+++# value_states = self.v_proj(hidden_states) -+++ -+++# # Reshape for multi-head attention -+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++# kv_seq_len = key_states.shape[-2] -+++# if past_key_value is not None: -+++# if self.layer_idx is None: -+++# raise ValueError( -+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++# "with a layer index." -+++# ) -+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++# # Apply Rotary Positional Embedding -+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++# if past_key_value is not None: -+++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ -+++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++ -+++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++ -+++# # Convert attention_mask for flash_attention_score -+++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+++# if attention_mask is not None: -+++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++# raise ValueError( -+++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++# ) -+++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+++# else: -+++# attn_mask_for_fa = None -+++ -+++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++ -+++# # Call the fused flash_attention_score operator -+++# attn_output = mindspore.ops.flash_attention_score( -+++# query=query_states_for_fa, -+++# key=key_states_for_fa, -+++# value=value_states_for_fa, -+++# head_num=self.num_heads, # This is N1, the number of query heads -+++# input_layout='BSH', -+++# attn_mask=attn_mask_for_fa, -+++# keep_prob=keep_prob, -+++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++# sparse_mode=0 # Default mask mode -+++# ) -+++ -+++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+++# attn_output = self.o_proj(attn_output) -+++ -+++# # Flash Attention does not return attention weights -+++# attn_weights = None -+++ -+++# return attn_output, attn_weights, past_key_value -+++ -+++class DeepseekFlashAttention(nn.Module): -+++ """ -+++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -+++ This implementation is a drop-in replacement for the original DeepseekAttention class, -+++ designed for high performance on supported hardware (Ascend). -+++ -+++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -+++ """ -+++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++ super().__init__() -+++ self.config = config -+++ self.layer_idx = layer_idx -+++ if layer_idx is None: -+++ logger.warning( -+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++ "when creating this class." -+++ ) -+++ -+++ # --- [FIX] Correctly initialize all required attributes --- -+++ self.attention_dropout = config.attention_dropout -+++ self.hidden_size = config.hidden_size -+++ self.num_heads = config.num_attention_heads -+++ self.head_dim = self.hidden_size // self.num_heads -+++ self.num_key_value_heads = config.num_key_value_heads -+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++ self.max_position_embeddings = config.max_position_embeddings -+++ self.rope_theta = config.rope_theta -+++ self.is_causal = True -+++ -+++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++ raise ValueError( -+++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++ f" and `num_heads`: {self.num_heads})." -+++ ) -+++ -+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++ -+++ # This call will now succeed as all attributes are initialized. -+++ self._init_rope() -+++ -+++ def _init_rope(self): -+++ if self.config.rope_scaling is None: -+++ self.rotary_emb = DeepseekRotaryEmbedding( -+++ self.head_dim, -+++ max_position_embeddings=self.max_position_embeddings, -+++ base=self.rope_theta, -+++ ) -+++ else: -+++ scaling_type = self.config.rope_scaling["type"] -+++ scaling_factor = self.config.rope_scaling["factor"] -+++ if scaling_type == "linear": -+++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++ self.head_dim, -+++ max_position_embeddings=self.max_position_embeddings, -+++ scaling_factor=scaling_factor, -+++ base=self.rope_theta, -+++ ) -+++ elif scaling_type == "dynamic": -+++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++ self.head_dim, -+++ max_position_embeddings=self.max_position_embeddings, -+++ scaling_factor=scaling_factor, -+++ base=self.rope_theta, -+++ ) -+++ else: -+++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++ -+++ def forward( -+++ self, -+++ hidden_states: mindspore.Tensor, -+++ attention_mask: Optional[mindspore.Tensor] = None, -+++ position_ids: Optional[mindspore.Tensor] = None, -+++ past_key_value: Optional[Cache] = None, -+++ output_attentions: bool = False, -+++ use_cache: bool = False, -+++ **kwargs, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ if "padding_mask" in kwargs: -+++ warnings.warn( -+++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++ ) -+++ if output_attentions: -+++ warnings.warn( -+++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -+++ ) -+++ -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ if self.config.pretraining_tp > 1: -+++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++ -+++ query_states = self.q_proj(hidden_states) -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++ # Apply Rotary Position Embedding -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos} -+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -+++ # So we must explicitly repeat the KV heads. -+++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++ -+++ # Convert attention mask for flash_attention_score -+++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -+++ if attention_mask is not None: -+++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++ raise ValueError( -+++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++ ) -+++ attn_mask_for_fa = attention_mask < 0 -+++ else: -+++ attn_mask_for_fa = None -+++ -+++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++ -+++ # Call the fused operator using the efficient BNSD layout -+++ attn_output = mindspore.ops.flash_attention_score( -+++ query=query_states, -+++ key=key_states, -+++ value=value_states, -+++ head_num=self.num_heads, -+++ input_layout='BNSD', # Specify the correct layout -+++ attn_mask=attn_mask_for_fa, -+++ keep_prob=keep_prob, -+++ scalar_value=1.0 / math.sqrt(self.head_dim) -+++ ) -+++ -+++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ -+++ # Apply output projection -+++ attn_output = self.o_proj(attn_output) -+++ -+++ # Flash attention does not return attention weights, so we return None. -+++ attn_weights = None -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -++ Deepseek_ATTENTION_CLASSES = { -++ "eager": DeepseekAttention, -+++ "flash-attention": DeepseekFlashAttention, -++ } -++ -++ -++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -++ config=config, layer_idx=layer_idx -++ ) -++ -+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+++ config=config, layer_idx=layer_idx -+++ ) -+++ -++ self.mlp = ( -++ DeepseekMoE(config) -++ if ( -++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++index d4c6b651..bced285c 100644 -++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -++ -++ import mindspore -++ import mindnlp.core.nn.functional as F -++-from mindnlp.core import nn, ops -+++from mindnlp.core import nn, ops, no_grad -++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -++ -++ from ....common.activations import ACT2FN -++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++ -+++Long_Prompt = False -+++PROMPT_LENGTH_THRESHOLD = 128 -++ -++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++ def _prepare_4d_causal_attention_mask_with_cache_position( -++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -++ return attn_output, attn_weights, past_key_value -++ -++ -+++# class Qwen2MoeFlashAttention(nn.Module): -+++# """ -+++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++ -+++# 关键改动: -+++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++# 直接传入原始的 key 和 value 张量效率更高。 -+++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++# super().__init__() -+++# self.config = config -+++# self.layer_idx = layer_idx -+++# self.hidden_size = config.hidden_size -+++# self.num_heads = config.num_attention_heads -+++# self.head_dim = self.hidden_size // self.num_heads -+++# self.num_key_value_heads = config.num_key_value_heads -+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++# self.max_position_embeddings = config.max_position_embeddings -+++# self.rope_theta = config.rope_theta -+++# self.attention_dropout = config.attention_dropout -+++ -+++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++# raise ValueError( -+++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++# ) -+++ -+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++ -+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++# self.head_dim, -+++# max_position_embeddings=self.max_position_embeddings, -+++# base=self.rope_theta, -+++# ) -+++ -+++# def forward( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# attention_mask: Optional[mindspore.Tensor] = None, -+++# position_ids: Optional[mindspore.Tensor] = None, -+++# past_key_value: Optional[Cache] = None, -+++# output_attentions: bool = False, -+++# use_cache: bool = False, -+++# cache_position: Optional[mindspore.Tensor] = None, -+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++# bsz, q_len, _ = hidden_states.shape -+++ -+++# # 1. 线性投射 Q, K, V -+++# query_states = self.q_proj(hidden_states) -+++# key_states = self.k_proj(hidden_states) -+++# value_states = self.v_proj(hidden_states) -+++ -+++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++# # query: [B, S, H*D] -> [B, N1, S, D] -+++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++# # 3. RoPE 旋转位置编码 -+++# kv_seq_len = key_states.shape[-2] -+++# if past_key_value is not None: -+++# if self.layer_idx is None: -+++# raise ValueError( -+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++# "with a layer index." -+++# ) -+++# # 对于 StaticCache,需要特殊处理 kv_seq_len -+++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++# if cache_position.shape[0] == 1: -+++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++# kv_seq_len = past_seen_tokens + 1 -+++# else: -+++# # prefill 阶段:cache_position 是范围,使用其长度 -+++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++# else: -+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++# # 4. KV 缓存更新 -+++# if past_key_value is not None: -+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++# key_states, value_states = past_key_value.update( -+++# key_states, value_states, self.layer_idx, cache_kwargs -+++# ) -+++ -+++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++# if cache_position.shape[0] == 1: -+++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++# kv_seq_len = key_states.shape[-2] -+++ -+++# # 5. [重要] 准备 Attention Mask -+++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++# fa_attention_mask = None -+++# if attention_mask is not None: -+++# # 截取与当前key长度匹配的部分 -+++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++# # 转换为布尔类型: 大负数 -> True, 0 -> False -+++# fa_attention_mask = (mask_slice != 0) -+++ -+++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++# input_dtype = query_states.dtype -+++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++# query_states = query_states.to(mindspore.float16) -+++# key_states = key_states.to(mindspore.float16) -+++# value_states = value_states.to(mindspore.float16) -+++ -+++# # 6. [核心] 调用 flash_attention_score 算子 -+++# # - 无需手动 repeat_kv, 算子原生支持 GQA -+++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++# attn_output = mindspore.ops.flash_attention_score( -+++# query=query_states, -+++# key=key_states, -+++# value=value_states, -+++# head_num=self.num_heads, # 传入Q的头数(N1) -+++# attn_mask=fa_attention_mask, -+++# keep_prob=1.0 - self.attention_dropout, -+++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++# input_layout="BNSD", -+++# sparse_mode=0 # 使用 defaultMask 模式 -+++# ) -+++ -+++# # 恢复原始数据类型 -+++# attn_output = attn_output.to(input_dtype) -+++ -+++# # 7. 调整输出形状 -+++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++# attn_output = self.o_proj(attn_output) -+++ -+++# # FlashAttention 算子不直接返回注意力权重矩阵 -+++# attn_weights = None -+++# if output_attentions: -+++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++# return attn_output, attn_weights, past_key_value -+++ -+++# # def forward( -+++# # self, -+++# # hidden_states: mindspore.Tensor, -+++# # attention_mask: Optional[mindspore.Tensor] = None, -+++# # position_ids: Optional[mindspore.Tensor] = None, -+++# # past_key_value: Optional[Cache] = None, -+++# # output_attentions: bool = False, -+++# # use_cache: bool = False, -+++# # cache_position: Optional[mindspore.Tensor] = None, -+++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++# # bsz, q_len, _ = hidden_states.shape -+++ -+++# # # 1. 线性投射 Q, K, V -+++# # query_states = self.q_proj(hidden_states) -+++# # key_states = self.k_proj(hidden_states) -+++# # value_states = self.v_proj(hidden_states) -+++ -+++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -+++# # # 3. RoPE 旋转位置编码 -+++# # kv_seq_len = key_states.shape[-2] -+++# # if past_key_value is not None: -+++# # if self.layer_idx is None: -+++# # raise ValueError( -+++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++# # "with a layer index." -+++# # ) -+++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++# # # 4. KV 缓存更新 -+++# # if past_key_value is not None: -+++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++# # key_states, value_states = past_key_value.update( -+++# # key_states, value_states, self.layer_idx, cache_kwargs -+++# # ) -+++ -+++# # # 5. 准备 Attention Mask -+++# # fa_attention_mask = None -+++# # if attention_mask is not None: -+++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++# # fa_attention_mask = (mask_slice != 0) -+++ -+++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++# # input_dtype = query_states.dtype -+++ -+++# # # 6. [核心] 调用 flash_attention_score 算子 -+++# # attn_output = mindspore.ops.flash_attention_score( -+++# # query=query_states, -+++# # key=key_states, -+++# # value=value_states, -+++# # head_num=self.num_heads, -+++# # attn_mask=fa_attention_mask, -+++# # keep_prob=1.0 - self.attention_dropout, -+++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++# # input_layout="BNSD", -+++# # sparse_mode=0, -+++# # # <--- 修改点 2: 启用内部高精度计算 --- -+++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++# # inner_precise=1 -+++# # ) -+++ -+++# # # 恢复原始数据类型 -+++# # attn_output = attn_output.to(input_dtype) -+++ -+++# # # 7. 调整输出形状 -+++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++# # attn_output = self.o_proj(attn_output) -+++ -+++# # attn_weights = None -+++# # if output_attentions: -+++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ -+++# # return attn_output, attn_weights, past_key_value -+++ -+++ -++ class Qwen2MoeFlashAttention(nn.Module): -++ """ -++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++- -++- 关键改动: -++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++- 直接传入原始的 key 和 value 张量效率更高。 -++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -+++ -+++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -+++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -+++ 完全使用模型的低精度数据类型(如 float16)进行计算, -+++ 以达到理论上的最高执行速度。 -++ """ -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++ super().__init__() -++ self.config = config -++ self.layer_idx = layer_idx -+++ if layer_idx is None: -+++ logger.warning_once( -+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -+++ ) -+++ -++ self.hidden_size = config.hidden_size -++ self.num_heads = config.num_attention_heads -++ self.head_dim = self.hidden_size // self.num_heads -++ self.num_key_value_heads = config.num_key_value_heads -++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++ self.max_position_embeddings = config.max_position_embeddings -++ self.rope_theta = config.rope_theta -++ self.attention_dropout = config.attention_dropout -++ -++- if (self.head_dim * self.num_heads) != self.hidden_size: -++- raise ValueError( -++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++- ) -++- -++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++- # query: [B, S, H*D] -> [B, N1, S, D] -++- # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++ # 2. 调整形状以匹配 BNSD 布局 -++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- -++- # 3. RoPE 旋转位置编码 -+++ -+++ # 3. RoPE 和 KV 缓存 -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++- if self.layer_idx is None: -++- raise ValueError( -++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++- "with a layer index." -++- ) -++- # 对于 StaticCache,需要特殊处理 kv_seq_len -++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++- # 使用 cache_position 的长度来确定实际的 kv_seq_len -++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++- # 临时解决方案:使用 cache_position 的最大值(如果可能) -++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++- if cache_position.shape[0] == 1: -++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++- kv_seq_len = past_seen_tokens + 1 -++- else: -++- # prefill 阶段:cache_position 是范围,使用其长度 -++- kv_seq_len = cache_position.shape[0] + past_seen_tokens -++- else: -++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++- -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++- # 4. KV 缓存更新 -++ if past_key_value is not None: -++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++- key_states, value_states = past_key_value.update( -++- key_states, value_states, self.layer_idx, cache_kwargs -++- ) -++- -++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++- if cache_position.shape[0] == 1: -++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++- kv_seq_len = key_states.shape[-2] -++- -++- # 5. [重要] 准备 Attention Mask -++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++ # 4. 准备 Attention Mask -++ fa_attention_mask = None -++ if attention_mask is not None: -++- # 截取与当前key长度匹配的部分 -++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++- # 转换为布尔类型: 大负数 -> True, 0 -> False -++ fa_attention_mask = (mask_slice != 0) -++ -++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++- input_dtype = query_states.dtype -++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++- query_states = query_states.to(mindspore.float16) -++- key_states = key_states.to(mindspore.float16) -++- value_states = value_states.to(mindspore.float16) -++- -++- # 6. [核心] 调用 flash_attention_score 算子 -++- # - 无需手动 repeat_kv, 算子原生支持 GQA -++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -++ attn_output = mindspore.ops.flash_attention_score( -++ query=query_states, -++ key=key_states, -++ value=value_states, -++- head_num=self.num_heads, # 传入Q的头数(N1) -+++ head_num=self.num_heads, -++ attn_mask=fa_attention_mask, -++- keep_prob=1.0 - self.attention_dropout, -+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -++ scalar_value=1.0 / math.sqrt(self.head_dim), -++ input_layout="BNSD", -++- sparse_mode=0 # 使用 defaultMask 模式 -+++ sparse_mode=0, -+++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -++ ) -++ -++- # 恢复原始数据类型 -++- attn_output = attn_output.to(input_dtype) -++- -++- # 7. 调整输出形状 -++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++ # 6. 调整输出形状 -++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++ attn_output = self.o_proj(attn_output) -++ -++- # FlashAttention 算子不直接返回注意力权重矩阵 -+++ # 7. 返回结果 -++ attn_weights = None -++ if output_attentions: -++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++ -++ return attn_output, attn_weights, past_key_value -++ -++- # def forward( -++- # self, -++- # hidden_states: mindspore.Tensor, -++- # attention_mask: Optional[mindspore.Tensor] = None, -++- # position_ids: Optional[mindspore.Tensor] = None, -++- # past_key_value: Optional[Cache] = None, -++- # output_attentions: bool = False, -++- # use_cache: bool = False, -++- # cache_position: Optional[mindspore.Tensor] = None, -++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++- -++- # bsz, q_len, _ = hidden_states.shape -++- -++- # # 1. 线性投射 Q, K, V -++- # query_states = self.q_proj(hidden_states) -++- # key_states = self.k_proj(hidden_states) -++- # value_states = self.v_proj(hidden_states) -++- -++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- -++- # # 3. RoPE 旋转位置编码 -++- # kv_seq_len = key_states.shape[-2] -++- # if past_key_value is not None: -++- # if self.layer_idx is None: -++- # raise ValueError( -++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++- # "with a layer index." -++- # ) -++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++ -++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++- -++- # # 4. KV 缓存更新 -++- # if past_key_value is not None: -++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++- # key_states, value_states = past_key_value.update( -++- # key_states, value_states, self.layer_idx, cache_kwargs -++- # ) -++- -++- # # 5. 准备 Attention Mask -++- # fa_attention_mask = None -++- # if attention_mask is not None: -++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++- # fa_attention_mask = (mask_slice != 0) -++- -++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++- # input_dtype = query_states.dtype -++- -++- # # 6. [核心] 调用 flash_attention_score 算子 -++- # attn_output = mindspore.ops.flash_attention_score( -++- # query=query_states, -++- # key=key_states, -++- # value=value_states, -++- # head_num=self.num_heads, -++- # attn_mask=fa_attention_mask, -++- # keep_prob=1.0 - self.attention_dropout, -++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++- # input_layout="BNSD", -++- # sparse_mode=0, -++- # # <--- 修改点 2: 启用内部高精度计算 --- -++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++- # inner_precise=1 -++- # ) -++- -++- # # 恢复原始数据类型 -++- # attn_output = attn_output.to(input_dtype) -+++QWEN2MOE_ATTENTION_CLASSES = { -+++ "eager": Qwen2MoeAttention, -+++ "flash-attention": Qwen2MoeFlashAttention, -+++} -++ -++- # # 7. 调整输出形状 -++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++- # attn_output = self.o_proj(attn_output) -++ -++- # attn_weights = None -++- # if output_attentions: -++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# def __init__(self, config): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# # gating -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++ -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# #@dwj -+++# # 只遍历激活的专家,而非全部专家 -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# num_tokens = hidden_states_reshaped.shape[0] -+++ -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++# routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++# flat_selected_experts = selected_experts.flatten() -+++ -+++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++# token_indices = broadcasted_token_indices.flatten() -+++ -+++# active_experts = ops.unique(flat_selected_experts) -+++ -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++ -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# selected_token_indices = token_indices[mask] -+++# selected_routing_weights = routing_weights.flatten()[mask] -+++ -+++# current_states = hidden_states_reshaped[selected_token_indices] -+++ -+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++ -+++# final_hidden_states = final_hidden_states.index_add( -+++# dim=0, -+++# index=selected_token_indices, -+++# source=expert_output.to(hidden_states.dtype) -+++# ) -+++ -+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++ -++- # return attn_output, attn_weights, past_key_value -+++# final_hidden_states = final_hidden_states + shared_expert_output -+++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -+++ -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# """ -+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+++# `_moe_infer_prefill` (用于长序列处理) 方法。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# # 门控网络 -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# # 专家列表 -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++# # 共享专家 -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# @no_grad() -+++# def _moe_infer_decode( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# """ -+++# 【解码路径】针对 sequence_length=1 的极致优化。 -+++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+++# """ -+++# batch_size, hidden_dim = hidden_states.shape -+++ -+++# expert_outputs_list = [ -+++# ops.cat([ -+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++# ], dim=0) -+++# for i in range(batch_size) -+++# ] -+++ -+++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+++# # shape: (batch_size, top_k, hidden_dim) -+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++ -+++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++ -+++# return moe_output.squeeze(1) -+++ -+++# @no_grad() -+++# def _moe_infer_prefill( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# """ -+++# 【预填充路径】针对 sequence_length > 1 的优化。 -+++# 按专家对 Token 进行分组,并进行批处理。 -+++# """ -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens = hidden_states.shape[0] -+++# flat_selected_experts = selected_experts.flatten() -+++ -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++ -+++# active_experts = ops.unique(flat_selected_experts) -+++ -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++ -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# selected_token_indices = token_indices[mask] -+++# selected_routing_weights = routing_weights.flatten()[mask] -+++ -+++# current_states = hidden_states[selected_token_indices] -+++ -+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++ -+++# moe_output = moe_output.index_add( -+++# dim=0, -+++# index=selected_token_indices, -+++# source=expert_output.to(hidden_states.dtype) -+++# ) -+++# return moe_output -+++ -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# """ -+++# 顶层 forward 方法,作为智能分发器。 -+++# """ -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++- # def forward( -++- # self, -++- # hidden_states: mindspore.Tensor, -++- # attention_mask: Optional[mindspore.Tensor] = None, -++- # position_ids: Optional[mindspore.Tensor] = None, -++- # past_key_value: Optional[Cache] = None, -++- # output_attentions: bool = False, -++- # use_cache: bool = False, -++- # cache_position: Optional[mindspore.Tensor] = None, -++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++- -++- # bsz, q_len, _ = hidden_states.shape -++- -++- # query_states = self.q_proj(hidden_states) -++- # key_states = self.k_proj(hidden_states) -++- # value_states = self.v_proj(hidden_states) -++- -++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- -++- # kv_seq_len = key_states.shape[-2] -++- # if past_key_value is not None: -++- # if self.layer_idx is None: -++- # raise ValueError("`layer_idx` must be specified for caching") -++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++- -++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++- -++- # if past_key_value is not None: -++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++- # key_states, value_states = past_key_value.update( -++- # key_states, value_states, self.layer_idx, cache_kwargs -++- # ) -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++# routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++# moe_output = None -+++# # 在推理时,根据序列长度选择最优路径 -+++# if not self.training: -+++# if sequence_length == 1: -+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++# else: -+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++# else: -+++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+++# raise NotImplementedError("Training path is not implemented.") -+++ -+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+++ -+++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+++ -+++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -+++ -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# """ -+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# # 门控网络 -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# # 专家列表 -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++# # 共享专家 -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# @no_grad() -+++# def _moe_infer_decode( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# batch_size, _ = hidden_states.shape -+++# expert_outputs_list = [ -+++# ops.cat([ -+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++# ], dim=0) -+++# for i in range(batch_size) -+++# ] -+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++# return moe_output.squeeze(1) -+++ -+++# @no_grad() -+++# def _moe_infer_prefill( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens = hidden_states.shape[0] -+++# flat_selected_experts = selected_experts.flatten() -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++# active_experts = ops.unique(flat_selected_experts) -+++ -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# selected_token_indices = token_indices[mask] -+++# selected_routing_weights = routing_weights.flatten()[mask] -+++# current_states = hidden_states[selected_token_indices] -+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++# moe_output = moe_output.index_add( -+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++# ) -+++# return moe_output -+++ -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# """ -+++# 顶层 forward 方法,作为智能分发器。 -+++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+++# """ -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ -+++# # 1. 门控计算 (通用逻辑) -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++# routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++# # 2. 智能分发到最优 MoE 路径 -+++# moe_output = None -+++# if not self.training: -+++# if sequence_length == 1: -+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++# else: -+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++# else: -+++# raise NotImplementedError("Training path is not implemented.") -+++ -+++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++ -+++# # 4. 合并 MoE 输出和共享专家输出 -+++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++ -+++# # 5. 恢复原始形状并返回 -+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -+++# prefill fastest -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# """ -+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# # 门控网络 -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# # 专家列表 -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++# # 共享专家 -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# @no_grad() -+++# def _moe_infer_dispatch( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# """ -+++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+++# """ -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens, _ = hidden_states.shape -+++ -+++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+++# flat_selected_experts = selected_experts.flatten() -+++# flat_routing_weights = routing_weights.flatten() -++ -++- # key_states = repeat_kv(key_states, self.num_key_value_groups) -++- # value_states = repeat_kv(value_states, self.num_key_value_groups) -++- -++- # # <--- 核心修改点: 手动进行高精度缩放 --- -++- # # 在调用算子前,手动将 query_states 除以缩放因子。 -++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++- # query_states = query_states / math.sqrt(self.head_dim) -++- # # <--- 修改结束 --- -++- -++- # fa_attention_mask = None -++- # if attention_mask is not None: -++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++- # fa_attention_mask = (mask_slice != 0) -++- -++- # input_dtype = query_states.dtype -++- -++- # attn_output = mindspore.ops.flash_attention_score( -++- # query=query_states, # 传入已经预先缩放过的 query -++- # key=key_states, -++- # value=value_states, -++- # head_num=self.num_heads, -++- # attn_mask=fa_attention_mask, -++- # keep_prob=1.0 - self.attention_dropout, -++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++- # input_layout="BNSD", -++- # sparse_mode=0, -++- # inner_precise=1 # 仍然保持内部高精度计算 -++- # ) -+++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++ -++- # attn_output = attn_output.to(input_dtype) -++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++- # attn_output = self.o_proj(attn_output) -+++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+++# active_experts = ops.unique(flat_selected_experts) -+++ -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++ -+++# # 找到所有分配给该专家的 token -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++ -+++# # 使用 mask 选取对应的 token 和权重 -+++# current_token_indices = token_indices[mask] -+++# current_routing_weights = flat_routing_weights[mask] -+++# current_hidden_states = hidden_states[current_token_indices] -+++ -+++# # 对这些 token 进行批处理 -+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++ -+++# # 使用 index_add 将结果精确地加回到对应位置 -+++# moe_output = moe_output.index_add( -+++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+++# ) -+++# return moe_output -+++ -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# """ -+++# 顶层 forward 方法,作为智能分发器。 -+++# """ -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ -+++# # 1. 门控计算 -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++# routing_weights = routing_weights.to(hidden_states.dtype) -+++ -+++# # 2. 调用统一的 MoE 计算内核 -+++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++ -++- # attn_weights = None -++- # if output_attentions: -++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++# # 3. 统一处理共享专家 -+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++ -+++# # 4. 合并输出 -+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++ -+++# # 5. 恢复原始形状并返回 -+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -+++ -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# """ -+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++# 【最终高性能与高精度版】: -+++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+++# 3. 这样实现了速度和准确性的两全其美。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# @no_grad() -+++# def _moe_infer_decode( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# """ -+++# 【解码路径】极致优化版:bmm + 高精度累加。 -+++# """ -+++# original_dtype = hidden_states.dtype -+++# batch_size, _ = hidden_states.shape -+++ -+++# expert_outputs_list = [ -+++# ops.cat([ -+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++# ], dim=0) -+++# for i in range(batch_size) -+++# ] -+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++ -+++# # 在 float32 下执行 bmm,得到高精度结果 -+++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++ -+++# # 将高精度结果转换回原始数据类型 -+++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+++ -+++# return moe_output -+++ -+++# @no_grad() -+++# def _moe_infer_prefill( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# selected_experts: mindspore.Tensor, -+++# routing_weights: mindspore.Tensor -+++# ) -> mindspore.Tensor: -+++# """ -+++# 【预填充路径】与原始实现一致,结果精确。 -+++# """ -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens, _ = hidden_states.shape -+++# flat_selected_experts = selected_experts.flatten() -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++# active_experts = ops.unique(flat_selected_experts) -+++ -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# selected_token_indices = token_indices[mask] -+++# selected_routing_weights = routing_weights.flatten()[mask] -+++# current_states = hidden_states[selected_token_indices] -+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++# moe_output = moe_output.index_add( -+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++# ) -+++# return moe_output -+++ -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++ -++- # return attn_output, attn_weights, past_key_value -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+++# # 如果模型主体是 float16,后续再转换 -+++ -+++# moe_output = None -+++# if not self.training: -+++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+++# # _moe_infer_decode 内部会处理好类型转换 -+++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+++# if sequence_length == 1: -+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++# else: -+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++# else: -+++# raise NotImplementedError("Training path is not implemented.") -+++ -+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++ -+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -++ -++-QWEN2MOE_ATTENTION_CLASSES = { -++- "eager": Qwen2MoeAttention, -++- "flash-attention": Qwen2MoeFlashAttention, -++-} -+++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++# """ -+++# 【融合版】一个混合专家模块,内置两种推理策略, -+++# 由外部全局变量 `Long_Prompt` 控制: -+++ -+++# - if Long_Prompt is True: 【精度优先模式】 -+++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+++# 适用于处理长序列,避免误差累积。 -+++ -+++# - if Long_Prompt is False: 【速度优先模式】 -+++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+++# 在解码阶段获得极致速度,同时保证结果高度准确。 -+++# """ -+++# def __init__(self, config: Qwen2MoeConfig): -+++# super().__init__() -+++# self.num_experts = config.num_experts -+++# self.top_k = config.num_experts_per_tok -+++# self.norm_topk_prob = config.norm_topk_prob -+++ -+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++# self.experts = nn.ModuleList( -+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++# ) -+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++# # --- 速度优先模式的辅助函数 --- -+++# @no_grad() -+++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++# original_dtype = hidden_states.dtype -+++# batch_size, _ = hidden_states.shape -+++# expert_outputs_list = [ -+++# ops.cat([ -+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++# ], dim=0) -+++# for i in range(batch_size) -+++# ] -+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++# weights_fp32 = routing_weights.to(mindspore.float32) -+++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++# return moe_output_fp32.squeeze(1).to(original_dtype) -+++ -+++# @no_grad() -+++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens, _ = hidden_states.shape -+++# flat_selected_experts = selected_experts.flatten() -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++# active_experts = ops.unique(flat_selected_experts) -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# selected_token_indices = token_indices[mask] -+++# selected_routing_weights = routing_weights.flatten()[mask] -+++# current_states = hidden_states[selected_token_indices] -+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+++# return moe_output -+++ -+++# # --- 精度优先模式的辅助函数 --- -+++# @no_grad() -+++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++# moe_output = ops.zeros_like(hidden_states) -+++# num_tokens, _ = hidden_states.shape -+++# flat_selected_experts = selected_experts.flatten() -+++# flat_routing_weights = routing_weights.flatten() -+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++# active_experts = ops.unique(flat_selected_experts) -+++# for expert_idx_tensor in active_experts: -+++# expert_idx = expert_idx_tensor.item() -+++# expert_layer = self.experts[expert_idx] -+++# mask = (flat_selected_experts == expert_idx_tensor) -+++# current_token_indices = token_indices[mask] -+++# current_routing_weights = flat_routing_weights[mask] -+++# current_hidden_states = hidden_states[current_token_indices] -+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++# return moe_output -+++ -+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++# # 声明我们将要使用一个在模块外部定义的全局变量 -+++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+++# global Long_Prompt -+++ -+++# # 1. 门控计算 (所有模式通用) -+++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++# router_logits = self.gate(hidden_states_reshaped) -+++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++# if self.norm_topk_prob: -+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++# moe_output = None -+++# if not self.training: -+++# # 根据 Long_Prompt 标志选择模式 -+++# if Long_Prompt: -+++# # --- 精度优先模式 --- -+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++# else: -+++# # --- 速度优先模式 --- -+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++# if sequence_length == 1: -+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++# else: -+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++# else: -+++# raise NotImplementedError("Training path is not implemented.") -+++ -+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++ -+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++ -+++# return final_hidden_states, router_logits -+++ -+++class Qwen2MoeSparseMoeBlock(nn.Module): -+++ """ -+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+++ 控制的顶级推理策略: -++ -+++ - if Long_Prompt is True: 【精度优先模式】 -+++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -+++ 适用于需要严格可复现性的长序列任务。 -++ -++-class Qwen2MoeSparseMoeBlock(nn.Module): -++- def __init__(self, config): -+++ - if Long_Prompt is False: 【速度优先模式】 -+++ 采用业界最强的性能组合: -+++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -+++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -+++ """ -+++ def __init__(self, config: Qwen2MoeConfig): -++ super().__init__() -++ self.num_experts = config.num_experts -++ self.top_k = config.num_experts_per_tok -++ self.norm_topk_prob = config.norm_topk_prob -++ -++- # gating -++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++ self.experts = nn.ModuleList( -++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++ ) -++- -++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++ -++- #@dwj -++- # 只遍历激活的专家,而非全部专家 -++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++- num_tokens = hidden_states_reshaped.shape[0] -++- -++- router_logits = self.gate(hidden_states_reshaped) -++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- -++- if self.norm_topk_prob: -++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- routing_weights = routing_weights.to(hidden_states.dtype) -++- -++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++- flat_selected_experts = selected_experts.flatten() -++- -++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++- token_indices = broadcasted_token_indices.flatten() -++- -++- active_experts = ops.unique(flat_selected_experts) -++- -++- for expert_idx_tensor in active_experts: -++- expert_idx = expert_idx_tensor.item() -++- expert_layer = self.experts[expert_idx] -++- -++- mask = (flat_selected_experts == expert_idx_tensor) -++- selected_token_indices = token_indices[mask] -++- selected_routing_weights = routing_weights.flatten()[mask] -++- -++- current_states = hidden_states_reshaped[selected_token_indices] -++- -++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++- -++- final_hidden_states = final_hidden_states.index_add( -++- dim=0, -++- index=selected_token_indices, -++- source=expert_output.to(hidden_states.dtype) -++- ) -++- -++- shared_expert_output = self.shared_expert(hidden_states_reshaped) -++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -+++ @no_grad() -+++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++ original_dtype = hidden_states.dtype -+++ batch_size, _ = hidden_states.shape -+++ expert_outputs_list = [ -+++ ops.cat([ -+++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++ ], dim=0) -+++ for i in range(batch_size) -+++ ] -+++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++ weights_fp32 = routing_weights.to(mindspore.float32) -+++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++ return moe_output_fp32.squeeze(1).to(original_dtype) -+++ -+++ @no_grad() -+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++ num_tokens, _ = hidden_states.shape -+++ flat_selected_experts = selected_experts.flatten() -+++ sorted_expert_indices = flat_selected_experts.argsort() -+++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++ original_token_indices = sorted_expert_indices // self.top_k -+++ moe_output = ops.zeros_like(hidden_states) -+++ current_token_offset = 0 -+++ for i in range(self.num_experts): -+++ expert_token_count = tokens_per_expert[i] - current_token_offset -+++ if expert_token_count == 0: -+++ continue -+++ end_offset = current_token_offset + expert_token_count -+++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++ expert_hidden_states = hidden_states[expert_original_token_indices] -+++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++ current_token_offset += expert_token_count -+++ return moe_output -+++ -+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+++ @no_grad() -+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++ moe_output = ops.zeros_like(hidden_states) -+++ num_tokens, _ = hidden_states.shape -+++ flat_selected_experts = selected_experts.flatten() -+++ flat_routing_weights = routing_weights.flatten() -+++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++ active_experts = ops.unique(flat_selected_experts) -+++ for expert_idx_tensor in active_experts: -+++ expert_idx = expert_idx_tensor.item() -+++ expert_layer = self.experts[expert_idx] -+++ mask = (flat_selected_experts == expert_idx_tensor) -+++ current_token_indices = token_indices[mask] -+++ current_routing_weights = flat_routing_weights[mask] -+++ current_hidden_states = hidden_states[current_token_indices] -+++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++ return moe_output -++ -++- final_hidden_states = final_hidden_states + shared_expert_output -++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++- -++- return final_hidden_states, router_logits -+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++ global Long_Prompt -+++ -+++ # 1. 门控计算 (所有模式通用) -+++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++ router_logits = self.gate(hidden_states_reshaped) -+++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++ if self.norm_topk_prob: -+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++ moe_output = None -+++ if Long_Prompt: -+++ # --- 精度优先模式 (ACCURACY MODE) --- -+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ else: -+++ # --- 速度优先模式 (SPEED MODE) --- -+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++ if sequence_length == 1: -+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ else: -+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ -++ -+++ # 3. 共享专家计算与合并 (所有模式通用) -+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++ -+++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++ -+++ return final_hidden_states, router_logits -++ -++ class Qwen2MoeDecoderLayer(nn.Module): -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++ super().__init__() -++ self.hidden_size = config.hidden_size -+++ -+++ # if Long_Prompt: -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ # else: -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++ -++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ -++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++- -++ if (layer_idx not in config.mlp_only_layers) and ( -++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++ ): -++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ self._warmed_up = True -++ self.warmup_moe_model() -++ -+++ -+++ -++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++ output_router_logits = ( -++ output_router_logits if output_router_logits is not None else self.config.output_router_logits -++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ router_logits=outputs.router_logits, -++ ) -++ -+++ def generate(self, *args, **kwargs): -+++ """ -+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+++ """ -+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++ -+++ input_ids = kwargs.get("input_ids") -+++ if input_ids is None and args: -+++ input_ids = args[0] -+++ -+++ if input_ids is not None: -+++ prompt_length = input_ids.shape[1] -+++ -+++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -+++ Long_Prompt = True -+++ else: -+++ Long_Prompt = False -+++ -+++ return super().generate(*args, **kwargs) -+++ -++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -++ def prepare_inputs_for_generation( -++ self, -++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -++ # Exception 1: when passing input_embeds, input_ids may be missing entries -++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -+++ -++ if past_key_values is not None: -++ if inputs_embeds is not None: # Exception 1 -++ if 0 not in input_ids.shape: -++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ } -++ ) -++ return model_inputs -+++ -++ # @lwx -++ # def _decode_one_tokens_logits( -++ # self, -++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -++ attentions=outputs.attentions, -++ ) -++ -+++ -++ __all__ = [ -++ "Qwen2MoeForCausalLM", -++ "Qwen2MoeModel", -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++new file mode 100644 -++index 00000000..6dfb5b93 -++--- /dev/null -+++++ b/patches/0001-20251104commit.patch -++@@ -0,0 +1,1272 @@ -+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++Subject: [PATCH] 20251104commit -+++ -+++--- -+++ mindnlp/transformers/cache_utils.py | 28 +- -+++ .../models/deepseek/modeling_deepseek.py | 149 ++- -+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++ 3 files changed, 976 insertions(+), 87 deletions(-) -+++ -+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++index cadd2e04..02f8d4be 100644 -+++--- a/mindnlp/transformers/cache_utils.py -++++++ b/mindnlp/transformers/cache_utils.py -+++@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++ # k_out[:, :, cache_position] = key_states -+++ # v_out[:, :, cache_position] = value_states -+++- if ON_ORANGE_PI: -+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++- else: -+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++- -++++ # if ON_ORANGE_PI: -++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++ # else: -++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++ # 确保 cache_position 是 1D tensor 并且类型正确 -++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++ if cache_position.ndim > 1: -++++ cache_position = cache_position.flatten() -++++ # 确保类型是 int32 或 int64(MindSpore 要求) -++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++ cache_position = cache_position.int() -++++ -++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++ k_out[:, :, cache_position] = key_states -++++ v_out[:, :, cache_position] = value_states -++++ -+++ return k_out, v_out -+++ -+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index c695b944..d8303e45 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++- x1 = x[..., : x.shape[-1] // 2] -+++- x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++ # x1 = x[..., : x.shape[-1] // 2] -++++ # x2 = x[..., x.shape[-1] // 2 :] -++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++ if self.training: -+++ raise NotImplementedError("Training is not supported yet.") -+++ else: -+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++- if self.config.n_shared_experts is not None: -+++- y = y + self.shared_experts(identity) -+++- return y -++++ # @lwx -++++ if orig_shape[1] == 1: -++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++ y=y.view(*orig_shape) -++++ if self.config.n_shared_experts is not None: -++++ y = y + self.shared_experts(identity) -++++ return y -++++ else: -++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++ if self.config.n_shared_experts is not None: -++++ y = y + self.shared_experts(identity) -++++ return y -++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++ # if self.config.n_shared_experts is not None: -++++ # y = y + self.shared_experts(identity) -++++ # return y -++++ -++++ @no_grad() -++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ -++++ expert_cache = ops.zeros_like(x) -++++ for i in range(self.num_experts_per_tok): -++++ expert_id = flat_expert_indices[i].item() -++++ weight = flat_expert_weights[i].item() -++++ expert = self.experts[expert_id] -++++ expert_out = expert(x) -++++ expert_cache += expert_out * weight -++++ return expert_cache -+++ -+++ @no_grad() -+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++- # expert_cache = torch.zeros_like(x) -+++- # idxs = flat_expert_indices.argsort() -+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++- # token_idxs = idxs // self.num_experts_per_tok -+++- # for i, end_idx in enumerate(tokens_per_expert): -+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++- # if start_idx == end_idx: -+++- # continue -+++- # expert = self.experts[i] -+++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++- # expert_tokens = x[exp_token_idx] -+++- # expert_out = expert(expert_tokens) -+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++- # return expert_cache -++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++ expert_cache = ops.zeros_like(x) -+++ idxs = flat_expert_indices.argsort() -+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ token_idxs = idxs // self.num_experts_per_tok -++++ -+++ for i, end_idx in enumerate(tokens_per_expert): -+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ if start_idx == end_idx: -+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++ expert_out = expert(expert_tokens) -+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -+++ return expert_cache -++++ -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # # expert_cache = torch.zeros_like(x) -++++ # # idxs = flat_expert_indices.argsort() -++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++ # # token_idxs = idxs // self.num_experts_per_tok -++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++ # # if start_idx == end_idx: -++++ # # continue -++++ # # expert = self.experts[i] -++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # # expert_tokens = x[exp_token_idx] -++++ # # expert_out = expert(expert_tokens) -++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++ # # return expert_cache -++++ # expert_cache = ops.zeros_like(x) -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # for i, end_idx in enumerate(tokens_per_expert): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # if start_idx == end_idx: -++++ # continue -++++ # expert = self.experts[i] -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = expert(expert_tokens) -++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -++++ # return expert_cache -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # expert_cache = ops.zeros_like(x) -++++ -++++ # # 排序保证顺序一致 -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # # 找出有 token 的专家 -++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++ -++++ # for i in active_experts.tolist(): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # end_idx = tokens_per_expert[i] -++++ # if start_idx == end_idx: # 没有 token -++++ # continue -++++ -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = self.experts[i](expert_tokens) -++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++ -++++ # expert_cache = mindspore.mint.scatter_add( -++++ # expert_cache, -++++ # 0, -++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++ # expert_out -++++ # ) -++++ -++++ # return expert_cache -++++ -++++ -+++ -+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++ # """ -+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ -+++ # Initialize weights and apply final processing -+++ self.post_init() -++++ self.warm_up = False -++++ -++++ def warmup_moe_model_deep(self): -++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++ test_texts = [ -++++ "warmup short", -++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++ ] -++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++ if tokenizer is None: -++++ from mindnlp.transformers import AutoTokenizer -++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++ self._warmup_tokenizer = tokenizer -++++ -++++ for text in test_texts: -++++ inputs = tokenizer(text, return_tensors="ms") -++++ with mindspore._no_grad(): -++++ _ = self(**inputs, use_cache=False) -++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++ -+++ def get_input_embeddings(self): -+++ return self.model.embed_tokens -+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++ ```""" -++++ if not self.warm_up: -++++ self.warm_up = True -++++ self.warmup_moe_model_deep() -++++ -+++ output_attentions = ( -+++ output_attentions -+++ if output_attentions is not None -+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++index 3cbf820e..d4c6b651 100644 -+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++@@ -18,7 +18,6 @@ -+++ # See the License for the specific language governing permissions and -+++ # limitations under the License. -+++ """MindSpore Qwen2MoE model.""" -+++- -+++ import math -+++ from typing import List, Optional, Tuple, Union -+++ -+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++ TokenClassifierOutput, -+++ ) -+++ from ...modeling_utils import PreTrainedModel -++++from ...generation import GenerationMixin -+++ from ....utils import logging -+++ from .configuration_qwen2_moe import Qwen2MoeConfig -+++ -+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++ self.variance_epsilon = eps -+++ -+++ def forward(self, hidden_states): -++++ # @dwj -++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++ # @lwx -++++ # if not self.training : -++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++ input_dtype = hidden_states.dtype -+++ hidden_states = hidden_states.to(mindspore.float32) -+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++@@ -234,6 +239,8 @@ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++ x1 = x[..., : x.shape[-1] // 2] -+++ x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++ self.config = config -+++ self.hidden_size = config.hidden_size -+++ self.intermediate_size = intermediate_size -++++ -+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++ self.act_fn = ACT2FN[config.hidden_act] -+++ -+++ def forward(self, x): -+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++- -+++ -++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++ # @lwx -++++ # gate_up_output = self.gate_up_proj(x) -++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++ # return self.down_proj(swiglu_output) -++++ -++++ # def forward(self, x): -++++ # gate_proj_out = self.gate_proj(x) -++++ # up_proj_out = self.up_proj(x) -++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++ # return self.down_proj(swiglu_out) -++++ -+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++ """ -+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++ use_cache: bool = False, -+++ cache_position: Optional[mindspore.Tensor] = None, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ -++++ -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ query_states = self.q_proj(hidden_states) -+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ "with a layer index." -+++ ) -+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ if isinstance(past_key_value, StaticCache): -++++ kv_seq_len = key_states.shape[-2] -++++ else: -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++ if isinstance(past_key_value, StaticCache): -++++ kv_seq_len = key_states.shape[-2] -+++ -+++ # repeat k/v heads if n_kv_heads < n_heads -+++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++- -++++ -+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++ -+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++- raise ValueError( -+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++- f" {attn_weights.shape}" -+++- ) -+++- -+++- if attention_mask is not None: # no matter the length, we just slice it -+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++ if attention_mask is not None: -++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++ attn_weights = attn_weights + causal_mask -+++ -+++ # upcast attention to fp32 -+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++ -+++ attn_output = self.o_proj(attn_output) -+++- -++++ # @lwx -++++ -++++ # max_seq_len = self.max_position_embeddings # 2048 -++++ -++++ # if attention_mask is not None: -++++ # # attention_mask: [B, 1, Sq, Sk] -++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++ -++++ # # pad 到 [max_seq_len, max_seq_len] -++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++ # global_attention_mask = padded_mask -++++ # else: -++++ # global_attention_mask = None -++++ -++++ -++++ # sparse_mode=3 -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, -++++ # key=key_states, -++++ # value=value_states, -++++ # real_shift=None, -++++ # padding_mask=None, -++++ -++++ # head_num=self.num_heads, -++++ # attn_mask=global_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++ # input_layout="BNSD", -++++ # pre_tokens=2147483647, -++++ # next_tokens=2147483647, -++++ # inner_precise=0, -++++ # drop_mask=None, -++++ # prefix=None, -++++ # actual_seq_qlen=None, -++++ # actual_seq_kvlen=None, -++++ # sparse_mode=sparse_mode, -++++ # ) -+++ if not output_attentions: -+++ attn_weights = None -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++ -++++class Qwen2MoeFlashAttention(nn.Module): -++++ """ -++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++ -++++ 关键改动: -++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++ 直接传入原始的 key 和 value 张量效率更高。 -++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++ """ -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++ super().__init__() -++++ self.config = config -++++ self.layer_idx = layer_idx -++++ self.hidden_size = config.hidden_size -++++ self.num_heads = config.num_attention_heads -++++ self.head_dim = self.hidden_size // self.num_heads -++++ self.num_key_value_heads = config.num_key_value_heads -++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++ self.max_position_embeddings = config.max_position_embeddings -++++ self.rope_theta = config.rope_theta -++++ self.attention_dropout = config.attention_dropout -++++ -++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++ raise ValueError( -++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++ ) -++++ -++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++ -++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++ self.head_dim, -++++ max_position_embeddings=self.max_position_embeddings, -++++ base=self.rope_theta, -++++ ) -++++ -++++ def forward( -++++ self, -++++ hidden_states: mindspore.Tensor, -++++ attention_mask: Optional[mindspore.Tensor] = None, -++++ position_ids: Optional[mindspore.Tensor] = None, -++++ past_key_value: Optional[Cache] = None, -++++ output_attentions: bool = False, -++++ use_cache: bool = False, -++++ cache_position: Optional[mindspore.Tensor] = None, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ # 1. 线性投射 Q, K, V -++++ query_states = self.q_proj(hidden_states) -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++ # query: [B, S, H*D] -> [B, N1, S, D] -++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # 3. RoPE 旋转位置编码 -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++ if self.layer_idx is None: -++++ raise ValueError( -++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ "with a layer index." -++++ ) -++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++ if cache_position.shape[0] == 1: -++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++ kv_seq_len = past_seen_tokens + 1 -++++ else: -++++ # prefill 阶段:cache_position 是范围,使用其长度 -++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++ else: -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # 4. KV 缓存更新 -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ key_states, value_states = past_key_value.update( -++++ key_states, value_states, self.layer_idx, cache_kwargs -++++ ) -++++ -++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++ if cache_position.shape[0] == 1: -++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++ kv_seq_len = key_states.shape[-2] -++++ -++++ # 5. [重要] 准备 Attention Mask -++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++ fa_attention_mask = None -++++ if attention_mask is not None: -++++ # 截取与当前key长度匹配的部分 -++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++ fa_attention_mask = (mask_slice != 0) -++++ -++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++ input_dtype = query_states.dtype -++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++ query_states = query_states.to(mindspore.float16) -++++ key_states = key_states.to(mindspore.float16) -++++ value_states = value_states.to(mindspore.float16) -++++ -++++ # 6. [核心] 调用 flash_attention_score 算子 -++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++ attn_output = mindspore.ops.flash_attention_score( -++++ query=query_states, -++++ key=key_states, -++++ value=value_states, -++++ head_num=self.num_heads, # 传入Q的头数(N1) -++++ attn_mask=fa_attention_mask, -++++ keep_prob=1.0 - self.attention_dropout, -++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++ input_layout="BNSD", -++++ sparse_mode=0 # 使用 defaultMask 模式 -++++ ) -++++ -++++ # 恢复原始数据类型 -++++ attn_output = attn_output.to(input_dtype) -++++ -++++ # 7. 调整输出形状 -++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ attn_output = self.o_proj(attn_output) -++++ -++++ # FlashAttention 算子不直接返回注意力权重矩阵 -++++ attn_weights = None -++++ if output_attentions: -++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++ # def forward( -++++ # self, -++++ # hidden_states: mindspore.Tensor, -++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++ # position_ids: Optional[mindspore.Tensor] = None, -++++ # past_key_value: Optional[Cache] = None, -++++ # output_attentions: bool = False, -++++ # use_cache: bool = False, -++++ # cache_position: Optional[mindspore.Tensor] = None, -++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ # bsz, q_len, _ = hidden_states.shape -++++ -++++ # # 1. 线性投射 Q, K, V -++++ # query_states = self.q_proj(hidden_states) -++++ # key_states = self.k_proj(hidden_states) -++++ # value_states = self.v_proj(hidden_states) -++++ -++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # # 3. RoPE 旋转位置编码 -++++ # kv_seq_len = key_states.shape[-2] -++++ # if past_key_value is not None: -++++ # if self.layer_idx is None: -++++ # raise ValueError( -++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ # "with a layer index." -++++ # ) -++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # # 4. KV 缓存更新 -++++ # if past_key_value is not None: -++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ # key_states, value_states = past_key_value.update( -++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++ # ) -++++ -++++ # # 5. 准备 Attention Mask -++++ # fa_attention_mask = None -++++ # if attention_mask is not None: -++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # fa_attention_mask = (mask_slice != 0) -++++ -++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++ # input_dtype = query_states.dtype -++++ -++++ # # 6. [核心] 调用 flash_attention_score 算子 -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, -++++ # key=key_states, -++++ # value=value_states, -++++ # head_num=self.num_heads, -++++ # attn_mask=fa_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++ # input_layout="BNSD", -++++ # sparse_mode=0, -++++ # # <--- 修改点 2: 启用内部高精度计算 --- -++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++ # inner_precise=1 -++++ # ) -++++ -++++ # # 恢复原始数据类型 -++++ # attn_output = attn_output.to(input_dtype) -++++ -++++ # # 7. 调整输出形状 -++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ # attn_output = self.o_proj(attn_output) -++++ -++++ # attn_weights = None -++++ # if output_attentions: -++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++ # return attn_output, attn_weights, past_key_value -++++ -++++ # def forward( -++++ # self, -++++ # hidden_states: mindspore.Tensor, -++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++ # position_ids: Optional[mindspore.Tensor] = None, -++++ # past_key_value: Optional[Cache] = None, -++++ # output_attentions: bool = False, -++++ # use_cache: bool = False, -++++ # cache_position: Optional[mindspore.Tensor] = None, -++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ # bsz, q_len, _ = hidden_states.shape -++++ -++++ # query_states = self.q_proj(hidden_states) -++++ # key_states = self.k_proj(hidden_states) -++++ # value_states = self.v_proj(hidden_states) -++++ -++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # kv_seq_len = key_states.shape[-2] -++++ # if past_key_value is not None: -++++ # if self.layer_idx is None: -++++ # raise ValueError("`layer_idx` must be specified for caching") -++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # if past_key_value is not None: -++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ # key_states, value_states = past_key_value.update( -++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++ # ) -++++ -++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++ -++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++ # query_states = query_states / math.sqrt(self.head_dim) -++++ # # <--- 修改结束 --- -++++ -++++ # fa_attention_mask = None -++++ # if attention_mask is not None: -++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # fa_attention_mask = (mask_slice != 0) -++++ -++++ # input_dtype = query_states.dtype -++++ -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, # 传入已经预先缩放过的 query -++++ # key=key_states, -++++ # value=value_states, -++++ # head_num=self.num_heads, -++++ # attn_mask=fa_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++ # input_layout="BNSD", -++++ # sparse_mode=0, -++++ # inner_precise=1 # 仍然保持内部高精度计算 -++++ # ) -++++ -++++ # attn_output = attn_output.to(input_dtype) -++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ # attn_output = self.o_proj(attn_output) -++++ -++++ # attn_weights = None -++++ # if output_attentions: -++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++ -++++ # return attn_output, attn_weights, past_key_value -++++ -+++ QWEN2MOE_ATTENTION_CLASSES = { -+++ "eager": Qwen2MoeAttention, -++++ "flash-attention": Qwen2MoeFlashAttention, -+++ } -+++ -+++ -+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -++++ #@dwj -++++ # 只遍历激活的专家,而非全部专家 -+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- hidden_states = hidden_states.view(-1, hidden_dim) -+++- # router_logits: (batch * sequence_length, n_experts) -+++- router_logits = self.gate(hidden_states) -+++- -+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- if self.norm_topk_prob: -+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- # we cast back to the input dtype -+++- routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++- final_hidden_states = ops.zeros( -+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++- ) -+++- -+++- # One hot encode the selected experts to create an expert mask -+++- # this will be used to easily index which expert is going to be sollicitated -+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++- -+++- # Loop over all available experts in the model and perform the computation on each expert -+++- for expert_idx in range(self.num_experts): -+++- expert_layer = self.experts[expert_idx] -+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++- -+++- # Index the correct hidden states and compute the expert hidden state for -+++- # the current expert. We need to make sure to multiply the output hidden -+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++- if 0 not in idx.shape: -+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++- -+++- # However `index_add_` only support torch tensors for indexing so we'll use -+++- # the `top_x` tensor here. -+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++- -+++- shared_expert_output = self.shared_expert(hidden_states) -+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++- -+++- final_hidden_states = final_hidden_states + shared_expert_output -++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++ num_tokens = hidden_states_reshaped.shape[0] -++++ -++++ router_logits = self.gate(hidden_states_reshaped) -++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++ if self.norm_topk_prob: -++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++ flat_selected_experts = selected_experts.flatten() -++++ -++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++ token_indices = broadcasted_token_indices.flatten() -++++ -++++ active_experts = ops.unique(flat_selected_experts) -++++ -++++ for expert_idx_tensor in active_experts: -++++ expert_idx = expert_idx_tensor.item() -++++ expert_layer = self.experts[expert_idx] -++++ -++++ mask = (flat_selected_experts == expert_idx_tensor) -++++ selected_token_indices = token_indices[mask] -++++ selected_routing_weights = routing_weights.flatten()[mask] -++++ -++++ current_states = hidden_states_reshaped[selected_token_indices] -++++ -++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++ -++++ final_hidden_states = final_hidden_states.index_add( -++++ dim=0, -++++ index=selected_token_indices, -++++ source=expert_output.to(hidden_states.dtype) -++++ ) -++++ -++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++ -+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++- return final_hidden_states, router_logits -++++ final_hidden_states = final_hidden_states + shared_expert_output -++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++ -++++ return final_hidden_states, router_logits -+++ -+++ -+++ class Qwen2MoeDecoderLayer(nn.Module): -+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++ -+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++ -+++ if (layer_idx not in config.mlp_only_layers) and ( -+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++ ): -+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++ _skip_keys_device_placement = "past_key_values" -+++ _supports_cache_class = True -++++#lwx -++++ # _supports_static_cache = True -+++ -+++ def _init_weights(self, module): -+++ std = self.config.initializer_range -+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++ return causal_mask -+++ -+++ -+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ _tied_weights_keys = ["lm_head.weight"] -+++ -+++ def __init__(self, config): -+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ self.num_experts_per_tok = config.num_experts_per_tok -+++ # Initialize weights and apply final processing -+++ self.post_init() -++++ # @lwx -++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++ # self.generation_config.cache_implementation = "static" -++++ self._warmed_up = False -++++ -++++ def warmup_moe_model(self): -++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++ test_texts = [ -++++ "warmup short", -++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++ ] -++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++ if tokenizer is None: -++++ from mindnlp.transformers import AutoTokenizer -++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++ self._warmup_tokenizer = tokenizer -++++ -++++ for text in test_texts: -++++ inputs = tokenizer(text, return_tensors="ms") -++++ with mindspore._no_grad(): -++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++ -+++ def get_input_embeddings(self): -+++ return self.model.embed_tokens -+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++ ```""" -++++ if not self._warmed_up: -++++ self._warmed_up = True -++++ self.warmup_moe_model() -+++ -+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++ output_router_logits = ( -+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ } -+++ ) -+++ return model_inputs -++++# @lwx -++++ # def _decode_one_tokens_logits( -++++ # self, -++++ # cur_token: mindspore.Tensor, -++++ # input_pos: Optional[mindspore.Tensor], -++++ # cache_position: mindspore.Tensor, -++++ # past_key_values: StaticCache, -++++ # ) -> mindspore.Tensor: -++++ # """ -++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++ -++++ # Args: -++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++ # input_pos: 输入位置信息,可选 -++++ # cache_position: 当前token在cache中的位置,shape为(1,) -++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++ -++++ # Returns: -++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++ # """ -++++ # # 调用JIT编译的版本 -++++ # return self.get_decode_one_tokens_logits( -++++ # cur_token=cur_token, -++++ # input_pos=input_pos, -++++ # cache_position=cache_position, -++++ # past_key_values=past_key_values, -++++ # ) -++++ -++++ # @mindspore.jit(jit_level='O1') -++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++ # """ -++++ # JIT编译的函数,用于高效的单token解码 -++++ # 使用JIT编译优化以支持静态shape和高效执行 -++++ -++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++ # """ -++++ # outputs = self.model.forward( -++++ # input_ids=cur_token, -++++ # position_ids=input_pos, -++++ # cache_position=cache_position, -++++ # past_key_values=past_key_values, -++++ # use_cache=True, -++++ # return_dict=False, -++++ # ) -++++ -++++ # hidden_states = outputs[0] -++++ # logits = self.lm_head.forward(hidden_states) -++++ # logits = logits.float() -++++ -++++ # return logits[:, -1, :] -++++ -++++ # def _sample( -++++ # self, -++++ # input_ids: mindspore.Tensor, -++++ # logits_processor, -++++ # stopping_criteria, -++++ # generation_config, -++++ # synced_devices: bool, -++++ # streamer=None, -++++ # logits_warper=None, -++++ # **model_kwargs, -++++ # ): -++++ # """ -++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++ # """ -++++ # from ...generation.logits_process import LogitsProcessorList -++++ # from ...generation.stopping_criteria import StoppingCriteriaList -++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++ # from mindnlp.core import nn, ops, no_grad -++++ # import numpy as np -++++ -++++ # # 检查是否使用 StaticCache -++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++ # # 否则,直接调用父类方法 -++++ # past_key_values = model_kwargs.get("past_key_values") -++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++ -++++ # if not isinstance(past_key_values, StaticCache): -++++ # # 不使用 StaticCache,直接调用父类方法 -++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++ # return super()._sample( -++++ # input_ids=input_ids, -++++ # logits_processor=logits_processor, -++++ # stopping_criteria=stopping_criteria, -++++ # generation_config=generation_config, -++++ # synced_devices=synced_devices, -++++ # streamer=streamer, -++++ # logits_warper=logits_warper, -++++ # **model_kwargs, -++++ # ) -++++ -++++ # # 使用 StaticCache,进入自定义循环 -++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++ # pad_token_id = generation_config._pad_token_tensor -++++ # output_attentions = generation_config.output_attentions -++++ # output_hidden_states = generation_config.output_hidden_states -++++ # output_scores = generation_config.output_scores -++++ # output_logits = generation_config.output_logits -++++ # return_dict_in_generate = generation_config.return_dict_in_generate -++++ # max_length = generation_config.max_length -++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++ # do_sample = generation_config.do_sample -++++ -++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++ # raise ValueError( -++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++ # f"{logits_warper})." -++++ # ) -++++ -++++ # # init attention / hidden states / scores tuples -++++ # scores = () if (return_dict_in_generate and output_scores) else None -++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++ -++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++ # encoder_hidden_states = ( -++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++ # ) -++++ -++++ # # keep track of which sequences are already finished -++++ # batch_size, cur_len = input_ids.shape -++++ # this_peer_finished = False -++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++ -++++ # time_record = [] -++++ # from ....utils.testing_utils import parse_flag_from_env -++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++ -++++ # while self._has_unfinished_sequences( -++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++ # ): -++++ # if _record_time: -++++ # import time as time_module -++++ # infer_start = time_module.time() -++++ -++++ # # prepare model inputs -++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++ -++++ # # prepare variable output controls -++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++ -++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++ # cur_cache_position = model_inputs.get("cache_position") -++++ # cur_past_key_values = model_inputs.get("past_key_values") -++++ # cur_input_ids = model_inputs.get("input_ids") -++++ -++++ # if (isinstance(cur_past_key_values, StaticCache) and -++++ # cur_cache_position is not None and -++++ # len(cur_cache_position.shape) > 0 and -++++ # cur_cache_position.shape[0] == 1 and -++++ # cur_input_ids is not None and -++++ # cur_input_ids.shape[1] == 1): -++++ # # 使用 JIT 优化的单 token 解码 -++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++ # if not hasattr(self, '_jit_used'): -++++ # self._jit_used = False -++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++ -++++ # next_token_logits = self.get_decode_one_tokens_logits( -++++ # cur_token=cur_input_ids, -++++ # input_pos=model_inputs.get("position_ids"), -++++ # cache_position=cur_cache_position, -++++ # past_key_values=cur_past_key_values, -++++ # ) -++++ -++++ # # 标记已使用JIT(用于后续判断) -++++ # if not self._jit_used: -++++ # self._jit_used = True -++++ -++++ # # 构造兼容的输出对象 -++++ # class JitOptimizedOutput: -++++ # def __init__(self, logits, config): -++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++ # self.config = config -++++ # # 对于 JIT 优化路径,这些属性通常不需要 -++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++ # self.attentions = None if not config.is_encoder_decoder else None -++++ # self.cross_attentions = None -++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++ # self.hidden_states = None if not config.is_encoder_decoder else None -++++ -++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++ # else: -++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++ # outputs = self(**model_inputs, return_dict=True) -++++ -++++ # if synced_devices and this_peer_finished: -++++ # continue -++++ -++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++ # next_token_logits = outputs.logits[:, -1, :] -++++ -++++ # # pre-process distribution -++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++ # if do_sample: -++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++ -++++ # # Store scores, attentions and hidden_states when required -++++ # if return_dict_in_generate: -++++ # if output_scores: -++++ # scores += (next_token_scores,) -++++ # if output_logits: -++++ # raw_logits += (next_token_logits,) -++++ # if output_attentions: -++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++ # decoder_attentions += (attn,) if attn is not None else (None,) -++++ # if self.config.is_encoder_decoder: -++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++ -++++ # if output_hidden_states: -++++ # hidden = ( -++++ # outputs.decoder_hidden_states -++++ # if self.config.is_encoder_decoder -++++ # else outputs.hidden_states -++++ # ) -++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++ -++++ # # token selection -++++ # if do_sample: -++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++ # else: -++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++ -++++ # # finished sentences should have their next token be a padding token -++++ # if has_eos_stopping_criteria: -++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++ -++++ # # update generated ids, model inputs, and length for next step -++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++ # if streamer is not None: -++++ # streamer.put(next_tokens) -++++ -++++ # model_kwargs = self._update_model_kwargs_for_generation( -++++ # outputs, -++++ # model_kwargs, -++++ # is_encoder_decoder=self.config.is_encoder_decoder, -++++ # ) -++++ -++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++ # cur_len += 1 -++++ -++++ # if _record_time: -++++ # import time as time_module -++++ # infer_stop = time_module.time() -++++ # time_record.append(infer_stop - infer_start) -++++ -++++ # del outputs -++++ -++++ # average_infer_time = None -++++ # if time_record: -++++ # if len(time_record) > 1: -++++ # time_record.pop(0) -++++ # average_infer_time = sum(time_record) / len(time_record) -++++ # print(f'average inference time is: {average_infer_time}') -++++ # print(f'inference time record: {time_record}') -++++ -++++ # if streamer is not None: -++++ # streamer.end() -++++ -++++ # # 简单判断:打印是否使用了JIT路径 -++++ # if hasattr(self, '_jit_used') and self._jit_used: -++++ # print("[JIT] ✓ JIT optimization was used during generation") -++++ # else: -++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++ -++++ # if return_dict_in_generate: -++++ # if self.config.is_encoder_decoder: -++++ # return GenerateEncoderDecoderOutput( -++++ # sequences=input_ids, -++++ # scores=scores, -++++ # logits=raw_logits, -++++ # encoder_attentions=encoder_attentions, -++++ # encoder_hidden_states=encoder_hidden_states, -++++ # decoder_attentions=decoder_attentions, -++++ # cross_attentions=cross_attentions, -++++ # decoder_hidden_states=decoder_hidden_states, -++++ # past_key_values=model_kwargs.get("past_key_values"), -++++ # average_infer_time=average_infer_time -++++ # ) -++++ # else: -++++ # return GenerateDecoderOnlyOutput( -++++ # sequences=input_ids, -++++ # scores=scores, -++++ # logits=raw_logits, -++++ # attentions=decoder_attentions, -++++ # hidden_states=decoder_hidden_states, -++++ # past_key_values=model_kwargs.get("past_key_values"), -++++ # average_infer_time=average_infer_time -++++ # ) -++++ # else: -++++ # return input_ids -++++ -++++ # def _prepare_cache_for_generation( -++++ # self, -++++ # generation_config, -++++ # model_kwargs, -++++ # assistant_model, -++++ # batch_size, -++++ # max_cache_length, -++++ # ): -++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++ # generation_config.cache_implementation = "static" -++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++ -++++ # if generation_config.cache_implementation == "static": -++++ # base_required_from_max_length = generation_config.max_length + 1 -++++ # base_required = max(max_cache_length, base_required_from_max_length) -++++ # min_cache_size = 50 -++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++ # else: -++++ # max_cache_length = max(base_required, min_cache_size) -++++ -++++ # original_max_cache_length = max_cache_length -++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++ # print(f" - input max_cache_length: {original_max_cache_length}") -++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++ # print(f" - final max_cache_length: {max_cache_length}") -++++ -++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++ # if max_cache_length > self.config.max_position_embeddings: -++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++ -++++ # result = super()._prepare_cache_for_generation( -++++ # generation_config=generation_config, -++++ # model_kwargs=model_kwargs, -++++ # assistant_model=assistant_model, -++++ # batch_size=batch_size, -++++ # max_cache_length=max_cache_length, -++++ # ) -++++ -++++ # if generation_config.cache_implementation == "static": -++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++ # created_cache = model_kwargs.get(cache_name) -++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++ # if created_cache.max_cache_len < generation_config.max_length: -++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++ -++++ # return result -++++ -++++ -++++ -+++ -+++ -+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++-- -+++2.27.0 -+++ -++-- -++2.27.0 -++ -+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+new file mode 100644 -+index 00000000..966529e4 -+--- /dev/null -++++ b/patches/0003-20261106secondcommit.patch -+@@ -0,0 +1,2769 @@ -++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Thu, 6 Nov 2025 14:54:37 +0800 -++Subject: [PATCH 3/3] 20261106secondcommit -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -++ patches/0001-20251104commit.patch | 1272 ----------------- -++ 3 files changed, 528 insertions(+), 2032 deletions(-) -++ delete mode 100644 patches/0001-20251104commit.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index 73773c22..2f9192bf 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -++ -++ _CONFIG_FOR_DOC = "DeepseekConfig" -++ -+++_attn_mask_cache = {} -+++ -+++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -+++ q_len = batch_and_seq[1] -+++ kv_len = batch_and_seq[1] + past_key_values_length -+++ key = (batch_and_seq[0], q_len, kv_len) -+++ -+++ if key in _attn_mask_cache: -+++ return _attn_mask_cache[key] -+++ -+++ mask = _prepare_4d_causal_attention_mask( -+++ attention_mask, -+++ batch_and_seq, -+++ inputs_embeds, -+++ past_key_values_length, -+++ ) -+++ _attn_mask_cache[key] = mask -+++ return mask -++ -++ def _get_unpad_data(attention_mask): -++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -++ return final_output -++ -++ -++- @no_grad() -++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++- expert_cache = ops.zeros_like(x) -++- idxs = flat_expert_indices.argsort() -++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++- token_idxs = idxs // self.num_experts_per_tok -++- -++- for i, end_idx in enumerate(tokens_per_expert): -++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++- if start_idx == end_idx: -++- continue -++- expert = self.experts[i] -++- exp_token_idx = token_idxs[start_idx:end_idx] -++- expert_tokens = x[exp_token_idx] -++- expert_out = expert(expert_tokens) -++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++- -++- return expert_cache -++- -++ # @no_grad() -++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++- # # expert_cache = torch.zeros_like(x) -++- # # idxs = flat_expert_indices.argsort() -++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++- # # token_idxs = idxs // self.num_experts_per_tok -++- # # for i, end_idx in enumerate(tokens_per_expert): -++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++- # # if start_idx == end_idx: -++- # # continue -++- # # expert = self.experts[i] -++- # # exp_token_idx = token_idxs[start_idx:end_idx] -++- # # expert_tokens = x[exp_token_idx] -++- # # expert_out = expert(expert_tokens) -++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++- # # return expert_cache -+++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ # expert_cache = ops.zeros_like(x) -++ # idxs = flat_expert_indices.argsort() -++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++ -++ # return expert_cache -++- # @no_grad() -++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++- # expert_cache = ops.zeros_like(x) -+++ -+++ @no_grad() -+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++ """ -+++ 优化版 MoE prefill: -+++ - 批量张量化处理同一个 expert 的所有 token -+++ - 跳过无 token 的专家 -+++ - 保持结果完全一致 -+++ """ -+++ # 初始化输出缓存 -+++ expert_cache = ops.zeros_like(x) -++ -++- # # 排序保证顺序一致 -++- # idxs = flat_expert_indices.argsort() -++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++- # token_idxs = idxs // self.num_experts_per_tok -+++ # 排序(确保 scatter_add 位置对应原逻辑) -+++ idxs = flat_expert_indices.argsort() -+++ sorted_expert_indices = flat_expert_indices[idxs] -+++ sorted_token_indices = idxs // self.num_experts_per_tok -++ -++- # # 找出有 token 的专家 -++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++ # 每个 expert 的 token 数 -+++ tokens_per_expert = sorted_expert_indices.bincount() -++ -++- # for i in active_experts.tolist(): -++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++- # end_idx = tokens_per_expert[i] -++- # if start_idx == end_idx: # 没有 token -++- # continue -+++ # 找出有 token 的专家 -+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++ -++- # exp_token_idx = token_idxs[start_idx:end_idx] -++- # expert_tokens = x[exp_token_idx] -++- # expert_out = self.experts[i](expert_tokens) -++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++ for expert_id in active_experts.tolist(): -+++ # 取该 expert 对应的排序后 token 区间 -+++ start = (tokens_per_expert[:expert_id]).sum().item() -+++ end = start + tokens_per_expert[expert_id].item() -++ -++- # expert_cache = mindspore.mint.scatter_add( -++- # expert_cache, -++- # 0, -++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++- # expert_out -++- # ) -+++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -+++ expert_tokens = x[token_idx] # 取输入向量 -++ -++- # return expert_cache -+++ # 执行专家 MLP -+++ expert_out = self.experts[expert_id](expert_tokens) -+++ -+++ # 按权重缩放 -+++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -+++ -+++ # 回写到缓存(等价 scatter_add) -+++ expert_cache = mindspore.mint.scatter_add( -+++ expert_cache, -+++ 0, -+++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++ scaled_out -+++ ) -+++ -+++ return expert_cache -+++ -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # # expert_cache = torch.zeros_like(x) -+++ # # idxs = flat_expert_indices.argsort() -+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++ # # token_idxs = idxs // self.num_experts_per_tok -+++ # # for i, end_idx in enumerate(tokens_per_expert): -+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++ # # if start_idx == end_idx: -+++ # # continue -+++ # # expert = self.experts[i] -+++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # # expert_tokens = x[exp_token_idx] -+++ # # expert_out = expert(expert_tokens) -+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++ # # return expert_cache -+++ # expert_cache = ops.zeros_like(x) -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # for i, end_idx in enumerate(tokens_per_expert): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # if start_idx == end_idx: -+++ # continue -+++ # expert = self.experts[i] -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = expert(expert_tokens) -+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -+++ # return expert_cache -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++ # expert_cache = ops.zeros_like(x) -+++ -+++ # # 排序保证顺序一致 -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ # token_idxs = idxs // self.num_experts_per_tok -+++ -+++ # # 找出有 token 的专家 -+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++ -+++ # for i in active_experts.tolist(): -+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ # end_idx = tokens_per_expert[i] -+++ # if start_idx == end_idx: # 没有 token -+++ # continue -+++ -+++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++ # expert_tokens = x[exp_token_idx] -+++ # expert_out = self.experts[i](expert_tokens) -+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++ -+++ # expert_cache = mindspore.mint.scatter_add( -+++ # expert_cache, -+++ # 0, -+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++ # expert_out -+++ # ) -+++ -+++ # return expert_cache -++ -++ -++ -++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -++ -++ return attn_output, attn_weights, past_key_value -++ -++- -++ # class DeepseekFlashAttention(nn.Module): -++ # """ -++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -++ -++ return attn_output, attn_weights, past_key_value -++ -+++ -++ Deepseek_ATTENTION_CLASSES = { -++ "eager": DeepseekAttention, -++ "flash-attention": DeepseekFlashAttention, -++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -++ ) -++ else: -++ # 4d mask is passed through the layers -++- attention_mask = _prepare_4d_causal_attention_mask( -+++ # attention_mask = _prepare_4d_causal_attention_mask( -+++ # attention_mask, -+++ # (batch_size, seq_length), -+++ # inputs_embeds, -+++ # past_key_values_length, -+++ # ) -+++ #@dwj -+++ attention_mask = get_cached_causal_mask( -++ attention_mask, -++ (batch_size, seq_length), -++ inputs_embeds, -++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ # Initialize weights and apply final processing -++ self.post_init() -++ self.warm_up = False -+++ #@dwj -+++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+++ self.num_layers, -+++ self.num_attention_heads, -+++ self.head_dim, -+++ batch_size=1, -+++ max_length=self.max_length, -+++ dtype=mindspore.float16 -+++ ) -+++ -+++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+++ key_cache = [] -+++ value_cache = [] -+++ for _ in range(num_layers): -+++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++ key_cache.append(k) -+++ value_cache.append(v) -+++ return key_cache, value_cache -+++ -++ -++ def warmup_moe_model_deep(self): -++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++index bced285c..ebd7782e 100644 -++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++ -++-Long_Prompt = False -++-PROMPT_LENGTH_THRESHOLD = 128 -+++Long_Prompt = 1 -+++LONG_PROMPT_LENGTH_THRESHOLD = 128 -+++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -+++ -+++_causal_mask_cache = {} -+++ -+++def get_cached_causal_mask_with_cache_position( -+++ attention_mask: mindspore.Tensor, -+++ sequence_length: int, -+++ target_length: int, -+++ dtype: mindspore.dtype, -+++ min_dtype: float, -+++ cache_position: mindspore.Tensor, -+++ batch_size: int, -+++): -+++ """ -+++ 带缓存的 causal mask 构造函数 -+++ """ -+++ # q_len 是当前 query 长度 -+++ q_len = sequence_length -+++ # kv_len 是 target_length -+++ kv_len = target_length -+++ -+++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -+++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -+++ -+++ if key in _causal_mask_cache: -+++ return _causal_mask_cache[key] -+++ -+++ # 调用原来的 mask 构造逻辑 -+++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++ attention_mask, -+++ sequence_length=sequence_length, -+++ target_length=target_length, -+++ dtype=dtype, -+++ min_dtype=min_dtype, -+++ cache_position=cache_position, -+++ batch_size=batch_size, -+++ ) -+++ # 缓存结果 -+++ _causal_mask_cache[key] = causal_mask -+++ return causal_mask -++ -++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++ def _prepare_4d_causal_attention_mask_with_cache_position( -++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++ -++ -++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -+++# class Qwen2MoeAttention(nn.Module): -+++# """ -+++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+++# and "Generating Long Sequences with Sparse Transformers". -+++# """ -+++ -+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++# super().__init__() -+++# self.config = config -+++# self.layer_idx = layer_idx -+++# if layer_idx is None: -+++# logger.warning_once( -+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++# "when creating this class." -+++# ) -+++ -+++# self.hidden_size = config.hidden_size -+++# self.num_heads = config.num_attention_heads -+++# self.head_dim = self.hidden_size // self.num_heads -+++# self.num_key_value_heads = config.num_key_value_heads -+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++# self.max_position_embeddings = config.max_position_embeddings -+++# self.rope_theta = config.rope_theta -+++# self.is_causal = True -+++# self.attention_dropout = config.attention_dropout -+++ -+++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++# raise ValueError( -+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++# f" and `num_heads`: {self.num_heads})." -+++# ) -+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++ -+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++# self.head_dim, -+++# max_position_embeddings=self.max_position_embeddings, -+++# base=self.rope_theta, -+++# ) -+++ -+++# def forward( -+++# self, -+++# hidden_states: mindspore.Tensor, -+++# attention_mask: Optional[mindspore.Tensor] = None, -+++# position_ids: Optional[mindspore.Tensor] = None, -+++# past_key_value: Optional[Cache] = None, -+++# output_attentions: bool = False, -+++# use_cache: bool = False, -+++# cache_position: Optional[mindspore.Tensor] = None, -+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++ -+++ -+++ -+++# bsz, q_len, _ = hidden_states.shape -+++ -+++# query_states = self.q_proj(hidden_states) -+++# key_states = self.k_proj(hidden_states) -+++# value_states = self.v_proj(hidden_states) -+++ -+++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++ -+++# kv_seq_len = key_states.shape[-2] -+++# if past_key_value is not None: -+++# if self.layer_idx is None: -+++# raise ValueError( -+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++# "with a layer index." -+++# ) -+++# if isinstance(past_key_value, StaticCache): -+++# kv_seq_len = key_states.shape[-2] -+++# else: -+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++# if past_key_value is not None: -+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++# if isinstance(past_key_value, StaticCache): -+++# kv_seq_len = key_states.shape[-2] -+++ -+++# # repeat k/v heads if n_kv_heads < n_heads -+++# key_states = repeat_kv(key_states, self.num_key_value_groups) -+++# value_states = repeat_kv(value_states, self.num_key_value_groups) -+++ -+++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++ -+++# if attention_mask is not None: -+++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++# attn_weights = attn_weights + causal_mask -+++ -+++# # upcast attention to fp32 -+++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++# attn_output = ops.matmul(attn_weights, value_states) -+++ -+++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++# raise ValueError( -+++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+++# f" {attn_output.shape}" -+++# ) -+++ -+++# attn_output = ops.transpose(attn_output, 1, 2) -+++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++ -+++# attn_output = self.o_proj(attn_output) -+++# # @lwx -+++ -+++# # max_seq_len = self.max_position_embeddings # 2048 -+++ -+++# # if attention_mask is not None: -+++# # # attention_mask: [B, 1, Sq, Sk] -+++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++ -+++# # # pad 到 [max_seq_len, max_seq_len] -+++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++# # global_attention_mask = padded_mask -+++# # else: -+++# # global_attention_mask = None -+++ -+++ -+++# # sparse_mode=3 -+++# # attn_output = mindspore.ops.flash_attention_score( -+++# # query=query_states, -+++# # key=key_states, -+++# # value=value_states, -+++# # real_shift=None, -+++# # padding_mask=None, -+++ -+++# # head_num=self.num_heads, -+++# # attn_mask=global_attention_mask, -+++# # keep_prob=1.0 - self.attention_dropout, -+++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++# # input_layout="BNSD", -+++# # pre_tokens=2147483647, -+++# # next_tokens=2147483647, -+++# # inner_precise=0, -+++# # drop_mask=None, -+++# # prefix=None, -+++# # actual_seq_qlen=None, -+++# # actual_seq_kvlen=None, -+++# # sparse_mode=sparse_mode, -+++# # ) -+++# if not output_attentions: -+++# attn_weights = None -+++ -+++# return attn_output, attn_weights, past_key_value -+++ -++ class Qwen2MoeAttention(nn.Module): -++ """ -++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++- and "Generating Long Sequences with Sparse Transformers". -++- """ -+++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -++ -+++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -+++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -+++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -+++ -+++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -+++ """ -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++ super().__init__() -++ self.config = config -++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -++ if layer_idx is None: -++ logger.warning_once( -++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++ "when creating this class." -++ ) -++ -++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -++ use_cache: bool = False, -++ cache_position: Optional[mindspore.Tensor] = None, -++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++- -++ -++- -+++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -++ bsz, q_len, _ = hidden_states.shape -++ -++ query_states = self.q_proj(hidden_states) -++ key_states = self.k_proj(hidden_states) -++ value_states = self.v_proj(hidden_states) -++ -++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++- -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ -++ kv_seq_len = key_states.shape[-2] -++ if past_key_value is not None: -++- if self.layer_idx is None: -++- raise ValueError( -++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++- "with a layer index." -++- ) -++- if isinstance(past_key_value, StaticCache): -++- kv_seq_len = key_states.shape[-2] -++- else: -++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++ -++ if past_key_value is not None: -++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++ -+++ # --- 2. 动态调度核心注意力计算 --- -+++ global Long_Prompt -+++ if Long_Prompt >= 1: -+++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -+++ fa_attention_mask = None -+++ if attention_mask is not None: -+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++ fa_attention_mask = (mask_slice != 0) -+++ -+++ attn_output = mindspore.ops.flash_attention_score( -+++ query=query_states, -+++ key=key_states, -+++ value=value_states, -+++ head_num=self.num_heads, -+++ attn_mask=fa_attention_mask, -+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -+++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++ input_layout="BNSD", -+++ sparse_mode=0, -+++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -+++ ) -++ -++- if isinstance(past_key_value, StaticCache): -++- kv_seq_len = key_states.shape[-2] -+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ attn_output = self.o_proj(attn_output) -+++ attn_weights = None -+++ if output_attentions: -+++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++ -++- # repeat k/v heads if n_kv_heads < n_heads -++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++- -++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++ else: -+++ # --- Eager Attention 路径 (用于短序列和解码) --- -+++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++ -+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++ -++- if attention_mask is not None: -++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++- attn_weights = attn_weights + causal_mask -+++ if attention_mask is not None: -+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++ attn_weights = attn_weights + causal_mask -++ -++- # upcast attention to fp32 -++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++- attn_output = ops.matmul(attn_weights, value_states) -+++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++ attn_output = ops.matmul(attn_weights, value_states) -++ -++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++- raise ValueError( -++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++- f" {attn_output.shape}" -++- ) -+++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++ raise ValueError( -+++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -+++ ) -++ -++- attn_output = ops.transpose(attn_output, 1, 2) -++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++ attn_output = ops.transpose(attn_output, 1, 2) -+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++ attn_output = self.o_proj(attn_output) -++ -++- attn_output = self.o_proj(attn_output) -++- # @lwx -+++ if not output_attentions: -+++ attn_weights = None -++ -++- # max_seq_len = self.max_position_embeddings # 2048 -++- -++- # if attention_mask is not None: -++- # # attention_mask: [B, 1, Sq, Sk] -++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++- -++- # # pad 到 [max_seq_len, max_seq_len] -++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++- # global_attention_mask = padded_mask -++- # else: -++- # global_attention_mask = None -++- -++- -++- # sparse_mode=3 -++- # attn_output = mindspore.ops.flash_attention_score( -++- # query=query_states, -++- # key=key_states, -++- # value=value_states, -++- # real_shift=None, -++- # padding_mask=None, -++- -++- # head_num=self.num_heads, -++- # attn_mask=global_attention_mask, -++- # keep_prob=1.0 - self.attention_dropout, -++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++- # input_layout="BNSD", -++- # pre_tokens=2147483647, -++- # next_tokens=2147483647, -++- # inner_precise=0, -++- # drop_mask=None, -++- # prefix=None, -++- # actual_seq_qlen=None, -++- # actual_seq_kvlen=None, -++- # sparse_mode=sparse_mode, -++- # ) -++- if not output_attentions: -++- attn_weights = None -++- -++ return attn_output, attn_weights, past_key_value -++ -++- -++ # class Qwen2MoeFlashAttention(nn.Module): -++ # """ -++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -++ # return final_hidden_states, router_logits -++ -++ -++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++-# """ -++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++-# `_moe_infer_prefill` (用于长序列处理) 方法。 -++-# """ -++-# def __init__(self, config: Qwen2MoeConfig): -++-# super().__init__() -++-# self.num_experts = config.num_experts -++-# self.top_k = config.num_experts_per_tok -++-# self.norm_topk_prob = config.norm_topk_prob -++- -++-# # 门控网络 -++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++-# # 专家列表 -++-# self.experts = nn.ModuleList( -++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++-# ) -++-# # 共享专家 -++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-# @no_grad() -++-# def _moe_infer_decode( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# """ -++-# 【解码路径】针对 sequence_length=1 的极致优化。 -++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++-# """ -++-# batch_size, hidden_dim = hidden_states.shape -++- -++-# expert_outputs_list = [ -++-# ops.cat([ -++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++-# ], dim=0) -++-# for i in range(batch_size) -++-# ] -++- -++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++-# # shape: (batch_size, top_k, hidden_dim) -++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++- -++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++- -++-# return moe_output.squeeze(1) -++- -++-# @no_grad() -++-# def _moe_infer_prefill( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# """ -++-# 【预填充路径】针对 sequence_length > 1 的优化。 -++-# 按专家对 Token 进行分组,并进行批处理。 -++-# """ -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens = hidden_states.shape[0] -++-# flat_selected_experts = selected_experts.flatten() -++- -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++- -++-# active_experts = ops.unique(flat_selected_experts) -++- -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++- -++-# mask = (flat_selected_experts == expert_idx_tensor) -++-# selected_token_indices = token_indices[mask] -++-# selected_routing_weights = routing_weights.flatten()[mask] -++- -++-# current_states = hidden_states[selected_token_indices] -++- -++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++- -++-# moe_output = moe_output.index_add( -++-# dim=0, -++-# index=selected_token_indices, -++-# source=expert_output.to(hidden_states.dtype) -++-# ) -++-# return moe_output -++- -++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-# """ -++-# 顶层 forward 方法,作为智能分发器。 -++-# """ -++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++- -++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-# router_logits = self.gate(hidden_states_reshaped) -++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- -++-# if self.norm_topk_prob: -++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- -++-# routing_weights = routing_weights.to(hidden_states.dtype) -++- -++-# moe_output = None -++-# # 在推理时,根据序列长度选择最优路径 -++-# if not self.training: -++-# if sequence_length == 1: -++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++-# else: -++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++-# else: -++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++-# raise NotImplementedError("Training path is not implemented.") -++- -++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++- -++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++- -++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++- -++-# return final_hidden_states, router_logits -++- -++- -++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++-# """ -++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++-# """ -++-# def __init__(self, config: Qwen2MoeConfig): -++-# super().__init__() -++-# self.num_experts = config.num_experts -++-# self.top_k = config.num_experts_per_tok -++-# self.norm_topk_prob = config.norm_topk_prob -++- -++-# # 门控网络 -++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++-# # 专家列表 -++-# self.experts = nn.ModuleList( -++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++-# ) -++-# # 共享专家 -++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-# @no_grad() -++-# def _moe_infer_decode( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# batch_size, _ = hidden_states.shape -++-# expert_outputs_list = [ -++-# ops.cat([ -++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++-# ], dim=0) -++-# for i in range(batch_size) -++-# ] -++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++-# return moe_output.squeeze(1) -++- -++-# @no_grad() -++-# def _moe_infer_prefill( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens = hidden_states.shape[0] -++-# flat_selected_experts = selected_experts.flatten() -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++-# active_experts = ops.unique(flat_selected_experts) -++- -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++-# mask = (flat_selected_experts == expert_idx_tensor) -++-# selected_token_indices = token_indices[mask] -++-# selected_routing_weights = routing_weights.flatten()[mask] -++-# current_states = hidden_states[selected_token_indices] -++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++-# moe_output = moe_output.index_add( -++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++-# ) -++-# return moe_output -++- -++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-# """ -++-# 顶层 forward 方法,作为智能分发器。 -++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++-# """ -++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++- -++-# # 1. 门控计算 (通用逻辑) -++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-# router_logits = self.gate(hidden_states_reshaped) -++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- -++-# if self.norm_topk_prob: -++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- -++-# routing_weights = routing_weights.to(hidden_states.dtype) -++- -++-# # 2. 智能分发到最优 MoE 路径 -++-# moe_output = None -++-# if not self.training: -++-# if sequence_length == 1: -++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++-# else: -++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++-# else: -++-# raise NotImplementedError("Training path is not implemented.") -++- -++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++- -++-# # 4. 合并 MoE 输出和共享专家输出 -++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++- -++-# # 5. 恢复原始形状并返回 -++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++- -++-# return final_hidden_states, router_logits -++- -++-# prefill fastest -++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++-# """ -++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++-# """ -++-# def __init__(self, config: Qwen2MoeConfig): -++-# super().__init__() -++-# self.num_experts = config.num_experts -++-# self.top_k = config.num_experts_per_tok -++-# self.norm_topk_prob = config.norm_topk_prob -++- -++-# # 门控网络 -++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++-# # 专家列表 -++-# self.experts = nn.ModuleList( -++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++-# ) -++-# # 共享专家 -++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-# @no_grad() -++-# def _moe_infer_dispatch( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# """ -++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++-# """ -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens, _ = hidden_states.shape -++- -++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++-# flat_selected_experts = selected_experts.flatten() -++-# flat_routing_weights = routing_weights.flatten() -++- -++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++- -++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++-# active_experts = ops.unique(flat_selected_experts) -++- -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++- -++-# # 找到所有分配给该专家的 token -++-# mask = (flat_selected_experts == expert_idx_tensor) -++- -++-# # 使用 mask 选取对应的 token 和权重 -++-# current_token_indices = token_indices[mask] -++-# current_routing_weights = flat_routing_weights[mask] -++-# current_hidden_states = hidden_states[current_token_indices] -++- -++-# # 对这些 token 进行批处理 -++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++- -++-# # 使用 index_add 将结果精确地加回到对应位置 -++-# moe_output = moe_output.index_add( -++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++-# ) -++-# return moe_output -++- -++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-# """ -++-# 顶层 forward 方法,作为智能分发器。 -++-# """ -++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++- -++-# # 1. 门控计算 -++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-# router_logits = self.gate(hidden_states_reshaped) -++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- -++-# if self.norm_topk_prob: -++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- -++-# routing_weights = routing_weights.to(hidden_states.dtype) -++- -++-# # 2. 调用统一的 MoE 计算内核 -++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++- -++-# # 3. 统一处理共享专家 -++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++- -++-# # 4. 合并输出 -++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++- -++-# # 5. 恢复原始形状并返回 -++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++- -++-# return final_hidden_states, router_logits -++- -++- -++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++-# """ -++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++-# 【最终高性能与高精度版】: -++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++-# 3. 这样实现了速度和准确性的两全其美。 -++-# """ -++-# def __init__(self, config: Qwen2MoeConfig): -++-# super().__init__() -++-# self.num_experts = config.num_experts -++-# self.top_k = config.num_experts_per_tok -++-# self.norm_topk_prob = config.norm_topk_prob -++- -++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++-# self.experts = nn.ModuleList( -++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++-# ) -++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-# @no_grad() -++-# def _moe_infer_decode( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# """ -++-# 【解码路径】极致优化版:bmm + 高精度累加。 -++-# """ -++-# original_dtype = hidden_states.dtype -++-# batch_size, _ = hidden_states.shape -++- -++-# expert_outputs_list = [ -++-# ops.cat([ -++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++-# ], dim=0) -++-# for i in range(batch_size) -++-# ] -++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++- -++-# # 在 float32 下执行 bmm,得到高精度结果 -++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++- -++-# # 将高精度结果转换回原始数据类型 -++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++- -++-# return moe_output -++- -++-# @no_grad() -++-# def _moe_infer_prefill( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# selected_experts: mindspore.Tensor, -++-# routing_weights: mindspore.Tensor -++-# ) -> mindspore.Tensor: -++-# """ -++-# 【预填充路径】与原始实现一致,结果精确。 -++-# """ -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens, _ = hidden_states.shape -++-# flat_selected_experts = selected_experts.flatten() -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++-# active_experts = ops.unique(flat_selected_experts) -++- -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++-# mask = (flat_selected_experts == expert_idx_tensor) -++-# selected_token_indices = token_indices[mask] -++-# selected_routing_weights = routing_weights.flatten()[mask] -++-# current_states = hidden_states[selected_token_indices] -++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++-# moe_output = moe_output.index_add( -++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++-# ) -++-# return moe_output -++- -++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++- -++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-# router_logits = self.gate(hidden_states_reshaped) -++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++- -++-# if self.norm_topk_prob: -++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- -++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++-# # 如果模型主体是 float16,后续再转换 -++- -++-# moe_output = None -++-# if not self.training: -++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++-# # _moe_infer_decode 内部会处理好类型转换 -++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++-# if sequence_length == 1: -++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++-# else: -++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++-# else: -++-# raise NotImplementedError("Training path is not implemented.") -++- -++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++- -++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++- -++-# return final_hidden_states, router_logits -++- -++- -++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++-# """ -++-# 【融合版】一个混合专家模块,内置两种推理策略, -++-# 由外部全局变量 `Long_Prompt` 控制: -++- -++-# - if Long_Prompt is True: 【精度优先模式】 -++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++-# 适用于处理长序列,避免误差累积。 -++- -++-# - if Long_Prompt is False: 【速度优先模式】 -++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++-# 在解码阶段获得极致速度,同时保证结果高度准确。 -++-# """ -++-# def __init__(self, config: Qwen2MoeConfig): -++-# super().__init__() -++-# self.num_experts = config.num_experts -++-# self.top_k = config.num_experts_per_tok -++-# self.norm_topk_prob = config.norm_topk_prob -++- -++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++-# self.experts = nn.ModuleList( -++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++-# ) -++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-# # --- 速度优先模式的辅助函数 --- -++-# @no_grad() -++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++-# original_dtype = hidden_states.dtype -++-# batch_size, _ = hidden_states.shape -++-# expert_outputs_list = [ -++-# ops.cat([ -++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++-# ], dim=0) -++-# for i in range(batch_size) -++-# ] -++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++-# weights_fp32 = routing_weights.to(mindspore.float32) -++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++-# return moe_output_fp32.squeeze(1).to(original_dtype) -++- -++-# @no_grad() -++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens, _ = hidden_states.shape -++-# flat_selected_experts = selected_experts.flatten() -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++-# active_experts = ops.unique(flat_selected_experts) -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++-# mask = (flat_selected_experts == expert_idx_tensor) -++-# selected_token_indices = token_indices[mask] -++-# selected_routing_weights = routing_weights.flatten()[mask] -++-# current_states = hidden_states[selected_token_indices] -++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++-# return moe_output -++- -++-# # --- 精度优先模式的辅助函数 --- -++-# @no_grad() -++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++-# moe_output = ops.zeros_like(hidden_states) -++-# num_tokens, _ = hidden_states.shape -++-# flat_selected_experts = selected_experts.flatten() -++-# flat_routing_weights = routing_weights.flatten() -++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++-# active_experts = ops.unique(flat_selected_experts) -++-# for expert_idx_tensor in active_experts: -++-# expert_idx = expert_idx_tensor.item() -++-# expert_layer = self.experts[expert_idx] -++-# mask = (flat_selected_experts == expert_idx_tensor) -++-# current_token_indices = token_indices[mask] -++-# current_routing_weights = flat_routing_weights[mask] -++-# current_hidden_states = hidden_states[current_token_indices] -++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++-# return moe_output -++- -++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-# # 声明我们将要使用一个在模块外部定义的全局变量 -++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++-# global Long_Prompt -++- -++-# # 1. 门控计算 (所有模式通用) -++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-# router_logits = self.gate(hidden_states_reshaped) -++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++-# if self.norm_topk_prob: -++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++- -++-# moe_output = None -++-# if not self.training: -++-# # 根据 Long_Prompt 标志选择模式 -++-# if Long_Prompt: -++-# # --- 精度优先模式 --- -++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++-# else: -++-# # --- 速度优先模式 --- -++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++-# if sequence_length == 1: -++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++-# else: -++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++-# else: -++-# raise NotImplementedError("Training path is not implemented.") -++- -++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++- -++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++- -++-# return final_hidden_states, router_logits -++- -++ class Qwen2MoeSparseMoeBlock(nn.Module): -++ """ -++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++ return moe_output_fp32.squeeze(1).to(original_dtype) -++ -+++ # @no_grad() -+++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++ # num_tokens, _ = hidden_states.shape -+++ # flat_selected_experts = selected_experts.flatten() -+++ # sorted_expert_indices = flat_selected_experts.argsort() -+++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++ # original_token_indices = sorted_expert_indices // self.top_k -+++ # moe_output = ops.zeros_like(hidden_states) -+++ # current_token_offset = 0 -+++ # for i in range(self.num_experts): -+++ # expert_token_count = tokens_per_expert[i] - current_token_offset -+++ # if expert_token_count == 0: -+++ # continue -+++ # end_offset = current_token_offset + expert_token_count -+++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++ # expert_hidden_states = hidden_states[expert_original_token_indices] -+++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++ # current_token_offset += expert_token_count -+++ # return moe_output -+++ -++ @no_grad() -++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++- num_tokens, _ = hidden_states.shape -++- flat_selected_experts = selected_experts.flatten() -++- sorted_expert_indices = flat_selected_experts.argsort() -++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++- original_token_indices = sorted_expert_indices // self.top_k -+++ """ -+++ 优化版 MoE prefill (速度优先模式): -+++ - 批量张量化处理同一个 expert 的所有 token -+++ - 跳过无 token 的专家 -+++ - 保持结果完全一致 -+++ """ -++ moe_output = ops.zeros_like(hidden_states) -++- current_token_offset = 0 -++- for i in range(self.num_experts): -++- expert_token_count = tokens_per_expert[i] - current_token_offset -++- if expert_token_count == 0: -++- continue -++- end_offset = current_token_offset + expert_token_count -++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++- expert_hidden_states = hidden_states[expert_original_token_indices] -++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++- current_token_offset += expert_token_count -+++ -+++ flat_selected_experts = selected_experts.flatten() -+++ flat_routing_weights = routing_weights.flatten() -+++ -+++ idxs = flat_selected_experts.argsort() -+++ sorted_expert_indices = flat_selected_experts[idxs] -+++ sorted_token_indices = idxs // self.top_k -+++ -+++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -+++ -+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+++ -+++ for expert_id in active_experts.tolist(): -+++ start = int(tokens_per_expert[:expert_id].sum().item()) -+++ end = start + int(tokens_per_expert[expert_id].item()) -+++ -+++ token_idx = sorted_token_indices[start:end] -+++ expert_tokens = hidden_states[token_idx] -+++ -+++ expert_out = self.experts[expert_id](expert_tokens) -+++ -+++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -+++ -+++ moe_output = mindspore.mint.scatter_add( -+++ moe_output, -+++ 0, -+++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -+++ scaled_out.to(hidden_states.dtype) -+++ ) -+++ -++ return moe_output -++ -+++ -++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++ @no_grad() -++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++ -++ moe_output = None -++- if Long_Prompt: -++- # --- 精度优先模式 (ACCURACY MODE) --- -++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ # if Long_Prompt==0: -+++ # # --- 精度优先模式 (ACCURACY MODE) --- -+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ # else: -+++ # # --- 速度优先模式 (SPEED MODE) --- -+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++ # if sequence_length == 1: -+++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ # else: -+++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ -+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++ if sequence_length == 1: -+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ else: -++- # --- 速度优先模式 (SPEED MODE) --- -++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++- if sequence_length == 1: -++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++- else: -++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++- -+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ -++ -++ # 3. 共享专家计算与合并 (所有模式通用) -++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++ -++ return final_hidden_states, router_logits -++ -+++ -++ class Qwen2MoeDecoderLayer(nn.Module): -++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++ super().__init__() -++ self.hidden_size = config.hidden_size -++ -++- # if Long_Prompt: -++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++- # else: -+++ # if Long_Prompt == 2: -++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++ # else: -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ -++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++ -++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++ ) -++ -++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++ # attention_mask, -+++ # sequence_length=sequence_length, -+++ # target_length=target_length, -+++ # dtype=dtype, -+++ # min_dtype=min_dtype, -+++ # cache_position=cache_position, -+++ # batch_size=input_tensor.shape[0], -+++ # ) -+++ #@dwj -+++ causal_mask = get_cached_causal_mask_with_cache_position( -++ attention_mask, -++ sequence_length=sequence_length, -++ target_length=target_length, -++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++ """ -++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -+++ _causal_mask_cache.clear() -++ -++ input_ids = kwargs.get("input_ids") -++ if input_ids is None and args: -++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ -++ if input_ids is not None: -++ prompt_length = input_ids.shape[1] -++- -++- if prompt_length > PROMPT_LENGTH_THRESHOLD: -++- Long_Prompt = True -+++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -+++ Long_Prompt = 2 -+++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -+++ Long_Prompt = 0 -++ else: -++- Long_Prompt = False -+++ Long_Prompt = 1 -+++ -++ -++ return super().generate(*args, **kwargs) -++ -++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++ dtype = self.lm_head.weight.dtype -++ min_dtype = float(ops.finfo(dtype).min) -++ -++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++ # attention_mask, -+++ # sequence_length=sequence_length, -+++ # target_length=past_key_values.get_max_length(), -+++ # dtype=dtype, -+++ # min_dtype=min_dtype, -+++ # cache_position=cache_position, -+++ # batch_size=batch_size, -+++ # ) -+++ -+++ #@dwj -+++ attention_mask = get_cached_causal_mask_with_cache_position( -++ attention_mask, -++ sequence_length=sequence_length, -++ target_length=past_key_values.get_max_length(), -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++deleted file mode 100644 -++index 6dfb5b93..00000000 -++--- a/patches/0001-20251104commit.patch -+++++ /dev/null -++@@ -1,1272 +0,0 @@ -++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++-From: Pinoeer-kingxi <13022943007@163.com> -++-Date: Tue, 4 Nov 2025 09:11:51 +0800 -++-Subject: [PATCH] 20251104commit -++- -++---- -++- mindnlp/transformers/cache_utils.py | 28 +- -++- .../models/deepseek/modeling_deepseek.py | 149 ++- -++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++- 3 files changed, 976 insertions(+), 87 deletions(-) -++- -++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++-index cadd2e04..02f8d4be 100644 -++---- a/mindnlp/transformers/cache_utils.py -++-+++ b/mindnlp/transformers/cache_utils.py -++-@@ -812,14 +812,26 @@ class StaticCache(Cache): -++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++- # k_out[:, :, cache_position] = key_states -++- # v_out[:, :, cache_position] = value_states -++-- if ON_ORANGE_PI: -++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++-- else: -++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++-- -++-+ # if ON_ORANGE_PI: -++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++-+ # else: -++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++-+ # 确保 cache_position 是 1D tensor 并且类型正确 -++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++-+ if cache_position.ndim > 1: -++-+ cache_position = cache_position.flatten() -++-+ # 确保类型是 int32 或 int64(MindSpore 要求) -++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++-+ cache_position = cache_position.int() -++-+ -++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++-+ k_out[:, :, cache_position] = key_states -++-+ v_out[:, :, cache_position] = value_states -++-+ -++- return k_out, v_out -++- -++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++-index c695b944..d8303e45 100644 -++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++- # Copied from transformers.models.llama.modeling_llama.rotate_half -++- def rotate_half(x): -++- """Rotates half the hidden dims of the input.""" -++-- x1 = x[..., : x.shape[-1] // 2] -++-- x2 = x[..., x.shape[-1] // 2 :] -++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++-+ # x1 = x[..., : x.shape[-1] // 2] -++-+ # x2 = x[..., x.shape[-1] // 2 :] -++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++- return ops.cat((-x2, x1), dim=-1) -++- -++- -++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++- if self.training: -++- raise NotImplementedError("Training is not supported yet.") -++- else: -++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++-- if self.config.n_shared_experts is not None: -++-- y = y + self.shared_experts(identity) -++-- return y -++-+ # @lwx -++-+ if orig_shape[1] == 1: -++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++-+ y=y.view(*orig_shape) -++-+ if self.config.n_shared_experts is not None: -++-+ y = y + self.shared_experts(identity) -++-+ return y -++-+ else: -++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++-+ if self.config.n_shared_experts is not None: -++-+ y = y + self.shared_experts(identity) -++-+ return y -++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++-+ # if self.config.n_shared_experts is not None: -++-+ # y = y + self.shared_experts(identity) -++-+ # return y -++-+ -++-+ @no_grad() -++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++-+ -++-+ expert_cache = ops.zeros_like(x) -++-+ for i in range(self.num_experts_per_tok): -++-+ expert_id = flat_expert_indices[i].item() -++-+ weight = flat_expert_weights[i].item() -++-+ expert = self.experts[expert_id] -++-+ expert_out = expert(x) -++-+ expert_cache += expert_out * weight -++-+ return expert_cache -++- -++- @no_grad() -++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++-- # expert_cache = torch.zeros_like(x) -++-- # idxs = flat_expert_indices.argsort() -++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++-- # token_idxs = idxs // self.num_experts_per_tok -++-- # for i, end_idx in enumerate(tokens_per_expert): -++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++-- # if start_idx == end_idx: -++-- # continue -++-- # expert = self.experts[i] -++-- # exp_token_idx = token_idxs[start_idx:end_idx] -++-- # expert_tokens = x[exp_token_idx] -++-- # expert_out = expert(expert_tokens) -++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++-- # return expert_cache -++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++- expert_cache = ops.zeros_like(x) -++- idxs = flat_expert_indices.argsort() -++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++- token_idxs = idxs // self.num_experts_per_tok -++-+ -++- for i, end_idx in enumerate(tokens_per_expert): -++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++- if start_idx == end_idx: -++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++- expert_out = expert(expert_tokens) -++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++-+ -++- return expert_cache -++-+ -++-+ # @no_grad() -++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++-+ # # expert_cache = torch.zeros_like(x) -++-+ # # idxs = flat_expert_indices.argsort() -++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++-+ # # token_idxs = idxs // self.num_experts_per_tok -++-+ # # for i, end_idx in enumerate(tokens_per_expert): -++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++-+ # # if start_idx == end_idx: -++-+ # # continue -++-+ # # expert = self.experts[i] -++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -++-+ # # expert_tokens = x[exp_token_idx] -++-+ # # expert_out = expert(expert_tokens) -++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++-+ # # return expert_cache -++-+ # expert_cache = ops.zeros_like(x) -++-+ # idxs = flat_expert_indices.argsort() -++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++-+ # token_idxs = idxs // self.num_experts_per_tok -++-+ -++-+ # for i, end_idx in enumerate(tokens_per_expert): -++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++-+ # if start_idx == end_idx: -++-+ # continue -++-+ # expert = self.experts[i] -++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++-+ # expert_tokens = x[exp_token_idx] -++-+ # expert_out = expert(expert_tokens) -++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++-+ -++-+ # return expert_cache -++-+ # @no_grad() -++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++-+ # expert_cache = ops.zeros_like(x) -++-+ -++-+ # # 排序保证顺序一致 -++-+ # idxs = flat_expert_indices.argsort() -++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++-+ # token_idxs = idxs // self.num_experts_per_tok -++-+ -++-+ # # 找出有 token 的专家 -++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++-+ -++-+ # for i in active_experts.tolist(): -++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++-+ # end_idx = tokens_per_expert[i] -++-+ # if start_idx == end_idx: # 没有 token -++-+ # continue -++-+ -++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++-+ # expert_tokens = x[exp_token_idx] -++-+ # expert_out = self.experts[i](expert_tokens) -++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++-+ -++-+ # expert_cache = mindspore.mint.scatter_add( -++-+ # expert_cache, -++-+ # 0, -++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++-+ # expert_out -++-+ # ) -++-+ -++-+ # return expert_cache -++-+ -++-+ -++- -++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++- # """ -++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++- -++- # Initialize weights and apply final processing -++- self.post_init() -++-+ self.warm_up = False -++-+ -++-+ def warmup_moe_model_deep(self): -++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++-+ test_texts = [ -++-+ "warmup short", -++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++-+ ] -++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++-+ if tokenizer is None: -++-+ from mindnlp.transformers import AutoTokenizer -++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++-+ self._warmup_tokenizer = tokenizer -++-+ -++-+ for text in test_texts: -++-+ inputs = tokenizer(text, return_tensors="ms") -++-+ with mindspore._no_grad(): -++-+ _ = self(**inputs, use_cache=False) -++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++- -++- def get_input_embeddings(self): -++- return self.model.embed_tokens -++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++- ```""" -++-+ if not self.warm_up: -++-+ self.warm_up = True -++-+ self.warmup_moe_model_deep() -++-+ -++- output_attentions = ( -++- output_attentions -++- if output_attentions is not None -++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++-index 3cbf820e..d4c6b651 100644 -++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++-@@ -18,7 +18,6 @@ -++- # See the License for the specific language governing permissions and -++- # limitations under the License. -++- """MindSpore Qwen2MoE model.""" -++-- -++- import math -++- from typing import List, Optional, Tuple, Union -++- -++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++- TokenClassifierOutput, -++- ) -++- from ...modeling_utils import PreTrainedModel -++-+from ...generation import GenerationMixin -++- from ....utils import logging -++- from .configuration_qwen2_moe import Qwen2MoeConfig -++- -++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++- self.variance_epsilon = eps -++- -++- def forward(self, hidden_states): -++-+ # @dwj -++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++-+ # @lwx -++-+ # if not self.training : -++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++- input_dtype = hidden_states.dtype -++- hidden_states = hidden_states.to(mindspore.float32) -++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++-@@ -234,6 +239,8 @@ def rotate_half(x): -++- """Rotates half the hidden dims of the input.""" -++- x1 = x[..., : x.shape[-1] // 2] -++- x2 = x[..., x.shape[-1] // 2 :] -++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++- return ops.cat((-x2, x1), dim=-1) -++- -++- -++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++- self.config = config -++- self.hidden_size = config.hidden_size -++- self.intermediate_size = intermediate_size -++-+ -++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++- self.act_fn = ACT2FN[config.hidden_act] -++- -++- def forward(self, x): -++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++-- -++- -++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++-+ # @lwx -++-+ # gate_up_output = self.gate_up_proj(x) -++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++-+ # return self.down_proj(swiglu_output) -++-+ -++-+ # def forward(self, x): -++-+ # gate_proj_out = self.gate_proj(x) -++-+ # up_proj_out = self.up_proj(x) -++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++-+ # return self.down_proj(swiglu_out) -++-+ -++- # Copied from transformers.models.llama.modeling_llama.repeat_kv -++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++- """ -++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++- use_cache: bool = False, -++- cache_position: Optional[mindspore.Tensor] = None, -++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++-+ -++-+ -++-+ -++- bsz, q_len, _ = hidden_states.shape -++- -++- query_states = self.q_proj(hidden_states) -++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++- "with a layer index." -++- ) -++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++-+ if isinstance(past_key_value, StaticCache): -++-+ kv_seq_len = key_states.shape[-2] -++-+ else: -++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++- -++- if past_key_value is not None: -++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++-+ -++-+ if isinstance(past_key_value, StaticCache): -++-+ kv_seq_len = key_states.shape[-2] -++- -++- # repeat k/v heads if n_kv_heads < n_heads -++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++-- -++-+ -++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++- -++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++-- raise ValueError( -++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++-- f" {attn_weights.shape}" -++-- ) -++-- -++-- if attention_mask is not None: # no matter the length, we just slice it -++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++-+ if attention_mask is not None: -++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++- attn_weights = attn_weights + causal_mask -++- -++- # upcast attention to fp32 -++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++- -++- attn_output = self.o_proj(attn_output) -++-- -++-+ # @lwx -++-+ -++-+ # max_seq_len = self.max_position_embeddings # 2048 -++-+ -++-+ # if attention_mask is not None: -++-+ # # attention_mask: [B, 1, Sq, Sk] -++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++-+ -++-+ # # pad 到 [max_seq_len, max_seq_len] -++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++-+ # global_attention_mask = padded_mask -++-+ # else: -++-+ # global_attention_mask = None -++-+ -++-+ -++-+ # sparse_mode=3 -++-+ # attn_output = mindspore.ops.flash_attention_score( -++-+ # query=query_states, -++-+ # key=key_states, -++-+ # value=value_states, -++-+ # real_shift=None, -++-+ # padding_mask=None, -++-+ -++-+ # head_num=self.num_heads, -++-+ # attn_mask=global_attention_mask, -++-+ # keep_prob=1.0 - self.attention_dropout, -++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++-+ # input_layout="BNSD", -++-+ # pre_tokens=2147483647, -++-+ # next_tokens=2147483647, -++-+ # inner_precise=0, -++-+ # drop_mask=None, -++-+ # prefix=None, -++-+ # actual_seq_qlen=None, -++-+ # actual_seq_kvlen=None, -++-+ # sparse_mode=sparse_mode, -++-+ # ) -++- if not output_attentions: -++- attn_weights = None -++- -++- return attn_output, attn_weights, past_key_value -++- -++- -++-+class Qwen2MoeFlashAttention(nn.Module): -++-+ """ -++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++-+ -++-+ 关键改动: -++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++-+ 直接传入原始的 key 和 value 张量效率更高。 -++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++-+ """ -++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++-+ super().__init__() -++-+ self.config = config -++-+ self.layer_idx = layer_idx -++-+ self.hidden_size = config.hidden_size -++-+ self.num_heads = config.num_attention_heads -++-+ self.head_dim = self.hidden_size // self.num_heads -++-+ self.num_key_value_heads = config.num_key_value_heads -++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++-+ self.max_position_embeddings = config.max_position_embeddings -++-+ self.rope_theta = config.rope_theta -++-+ self.attention_dropout = config.attention_dropout -++-+ -++-+ if (self.head_dim * self.num_heads) != self.hidden_size: -++-+ raise ValueError( -++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++-+ ) -++-+ -++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++-+ -++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++-+ self.head_dim, -++-+ max_position_embeddings=self.max_position_embeddings, -++-+ base=self.rope_theta, -++-+ ) -++-+ -++-+ def forward( -++-+ self, -++-+ hidden_states: mindspore.Tensor, -++-+ attention_mask: Optional[mindspore.Tensor] = None, -++-+ position_ids: Optional[mindspore.Tensor] = None, -++-+ past_key_value: Optional[Cache] = None, -++-+ output_attentions: bool = False, -++-+ use_cache: bool = False, -++-+ cache_position: Optional[mindspore.Tensor] = None, -++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++-+ -++-+ bsz, q_len, _ = hidden_states.shape -++-+ -++-+ # 1. 线性投射 Q, K, V -++-+ query_states = self.q_proj(hidden_states) -++-+ key_states = self.k_proj(hidden_states) -++-+ value_states = self.v_proj(hidden_states) -++-+ -++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++-+ # query: [B, S, H*D] -> [B, N1, S, D] -++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ -++-+ # 3. RoPE 旋转位置编码 -++-+ kv_seq_len = key_states.shape[-2] -++-+ if past_key_value is not None: -++-+ if self.layer_idx is None: -++-+ raise ValueError( -++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++-+ "with a layer index." -++-+ ) -++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++-+ if cache_position.shape[0] == 1: -++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++-+ kv_seq_len = past_seen_tokens + 1 -++-+ else: -++-+ # prefill 阶段:cache_position 是范围,使用其长度 -++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++-+ else: -++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++-+ -++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++-+ -++-+ # 4. KV 缓存更新 -++-+ if past_key_value is not None: -++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++-+ key_states, value_states = past_key_value.update( -++-+ key_states, value_states, self.layer_idx, cache_kwargs -++-+ ) -++-+ -++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++-+ if cache_position.shape[0] == 1: -++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++-+ kv_seq_len = key_states.shape[-2] -++-+ -++-+ # 5. [重要] 准备 Attention Mask -++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++-+ fa_attention_mask = None -++-+ if attention_mask is not None: -++-+ # 截取与当前key长度匹配的部分 -++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -++-+ fa_attention_mask = (mask_slice != 0) -++-+ -++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++-+ input_dtype = query_states.dtype -++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++-+ query_states = query_states.to(mindspore.float16) -++-+ key_states = key_states.to(mindspore.float16) -++-+ value_states = value_states.to(mindspore.float16) -++-+ -++-+ # 6. [核心] 调用 flash_attention_score 算子 -++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++-+ attn_output = mindspore.ops.flash_attention_score( -++-+ query=query_states, -++-+ key=key_states, -++-+ value=value_states, -++-+ head_num=self.num_heads, # 传入Q的头数(N1) -++-+ attn_mask=fa_attention_mask, -++-+ keep_prob=1.0 - self.attention_dropout, -++-+ scalar_value=1.0 / math.sqrt(self.head_dim), -++-+ input_layout="BNSD", -++-+ sparse_mode=0 # 使用 defaultMask 模式 -++-+ ) -++-+ -++-+ # 恢复原始数据类型 -++-+ attn_output = attn_output.to(input_dtype) -++-+ -++-+ # 7. 调整输出形状 -++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++-+ attn_output = self.o_proj(attn_output) -++-+ -++-+ # FlashAttention 算子不直接返回注意力权重矩阵 -++-+ attn_weights = None -++-+ if output_attentions: -++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++-+ -++-+ return attn_output, attn_weights, past_key_value -++-+ -++-+ # def forward( -++-+ # self, -++-+ # hidden_states: mindspore.Tensor, -++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++-+ # position_ids: Optional[mindspore.Tensor] = None, -++-+ # past_key_value: Optional[Cache] = None, -++-+ # output_attentions: bool = False, -++-+ # use_cache: bool = False, -++-+ # cache_position: Optional[mindspore.Tensor] = None, -++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++-+ -++-+ # bsz, q_len, _ = hidden_states.shape -++-+ -++-+ # # 1. 线性投射 Q, K, V -++-+ # query_states = self.q_proj(hidden_states) -++-+ # key_states = self.k_proj(hidden_states) -++-+ # value_states = self.v_proj(hidden_states) -++-+ -++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ -++-+ # # 3. RoPE 旋转位置编码 -++-+ # kv_seq_len = key_states.shape[-2] -++-+ # if past_key_value is not None: -++-+ # if self.layer_idx is None: -++-+ # raise ValueError( -++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++-+ # "with a layer index." -++-+ # ) -++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++-+ -++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++-+ -++-+ # # 4. KV 缓存更新 -++-+ # if past_key_value is not None: -++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++-+ # key_states, value_states = past_key_value.update( -++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++-+ # ) -++-+ -++-+ # # 5. 准备 Attention Mask -++-+ # fa_attention_mask = None -++-+ # if attention_mask is not None: -++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++-+ # fa_attention_mask = (mask_slice != 0) -++-+ -++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++-+ # input_dtype = query_states.dtype -++-+ -++-+ # # 6. [核心] 调用 flash_attention_score 算子 -++-+ # attn_output = mindspore.ops.flash_attention_score( -++-+ # query=query_states, -++-+ # key=key_states, -++-+ # value=value_states, -++-+ # head_num=self.num_heads, -++-+ # attn_mask=fa_attention_mask, -++-+ # keep_prob=1.0 - self.attention_dropout, -++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++-+ # input_layout="BNSD", -++-+ # sparse_mode=0, -++-+ # # <--- 修改点 2: 启用内部高精度计算 --- -++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++-+ # inner_precise=1 -++-+ # ) -++-+ -++-+ # # 恢复原始数据类型 -++-+ # attn_output = attn_output.to(input_dtype) -++-+ -++-+ # # 7. 调整输出形状 -++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++-+ # attn_output = self.o_proj(attn_output) -++-+ -++-+ # attn_weights = None -++-+ # if output_attentions: -++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++-+ -++-+ # return attn_output, attn_weights, past_key_value -++-+ -++-+ # def forward( -++-+ # self, -++-+ # hidden_states: mindspore.Tensor, -++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++-+ # position_ids: Optional[mindspore.Tensor] = None, -++-+ # past_key_value: Optional[Cache] = None, -++-+ # output_attentions: bool = False, -++-+ # use_cache: bool = False, -++-+ # cache_position: Optional[mindspore.Tensor] = None, -++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++-+ -++-+ # bsz, q_len, _ = hidden_states.shape -++-+ -++-+ # query_states = self.q_proj(hidden_states) -++-+ # key_states = self.k_proj(hidden_states) -++-+ # value_states = self.v_proj(hidden_states) -++-+ -++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-+ -++-+ # kv_seq_len = key_states.shape[-2] -++-+ # if past_key_value is not None: -++-+ # if self.layer_idx is None: -++-+ # raise ValueError("`layer_idx` must be specified for caching") -++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++-+ -++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++-+ -++-+ # if past_key_value is not None: -++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++-+ # key_states, value_states = past_key_value.update( -++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++-+ # ) -++-+ -++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++-+ -++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++-+ # query_states = query_states / math.sqrt(self.head_dim) -++-+ # # <--- 修改结束 --- -++-+ -++-+ # fa_attention_mask = None -++-+ # if attention_mask is not None: -++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++-+ # fa_attention_mask = (mask_slice != 0) -++-+ -++-+ # input_dtype = query_states.dtype -++-+ -++-+ # attn_output = mindspore.ops.flash_attention_score( -++-+ # query=query_states, # 传入已经预先缩放过的 query -++-+ # key=key_states, -++-+ # value=value_states, -++-+ # head_num=self.num_heads, -++-+ # attn_mask=fa_attention_mask, -++-+ # keep_prob=1.0 - self.attention_dropout, -++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++-+ # input_layout="BNSD", -++-+ # sparse_mode=0, -++-+ # inner_precise=1 # 仍然保持内部高精度计算 -++-+ # ) -++-+ -++-+ # attn_output = attn_output.to(input_dtype) -++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++-+ # attn_output = self.o_proj(attn_output) -++-+ -++-+ # attn_weights = None -++-+ # if output_attentions: -++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++-+ -++-+ # return attn_output, attn_weights, past_key_value -++-+ -++- QWEN2MOE_ATTENTION_CLASSES = { -++- "eager": Qwen2MoeAttention, -++-+ "flash-attention": Qwen2MoeFlashAttention, -++- } -++- -++- -++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++- -++-+ #@dwj -++-+ # 只遍历激活的专家,而非全部专家 -++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++-- batch_size, sequence_length, hidden_dim = hidden_states.shape -++-- hidden_states = hidden_states.view(-1, hidden_dim) -++-- # router_logits: (batch * sequence_length, n_experts) -++-- router_logits = self.gate(hidden_states) -++-- -++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++-- if self.norm_topk_prob: -++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++-- # we cast back to the input dtype -++-- routing_weights = routing_weights.to(hidden_states.dtype) -++-- -++-- final_hidden_states = ops.zeros( -++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++-- ) -++-- -++-- # One hot encode the selected experts to create an expert mask -++-- # this will be used to easily index which expert is going to be sollicitated -++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++-- -++-- # Loop over all available experts in the model and perform the computation on each expert -++-- for expert_idx in range(self.num_experts): -++-- expert_layer = self.experts[expert_idx] -++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++-- -++-- # Index the correct hidden states and compute the expert hidden state for -++-- # the current expert. We need to make sure to multiply the output hidden -++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++-- if 0 not in idx.shape: -++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++-- -++-- # However `index_add_` only support torch tensors for indexing so we'll use -++-- # the `top_x` tensor here. -++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++-- -++-- shared_expert_output = self.shared_expert(hidden_states) -++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++-- -++-- final_hidden_states = final_hidden_states + shared_expert_output -++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++-+ num_tokens = hidden_states_reshaped.shape[0] -++-+ -++-+ router_logits = self.gate(hidden_states_reshaped) -++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++-+ -++-+ if self.norm_topk_prob: -++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++-+ routing_weights = routing_weights.to(hidden_states.dtype) -++-+ -++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++-+ flat_selected_experts = selected_experts.flatten() -++-+ -++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++-+ token_indices = broadcasted_token_indices.flatten() -++-+ -++-+ active_experts = ops.unique(flat_selected_experts) -++-+ -++-+ for expert_idx_tensor in active_experts: -++-+ expert_idx = expert_idx_tensor.item() -++-+ expert_layer = self.experts[expert_idx] -++-+ -++-+ mask = (flat_selected_experts == expert_idx_tensor) -++-+ selected_token_indices = token_indices[mask] -++-+ selected_routing_weights = routing_weights.flatten()[mask] -++-+ -++-+ current_states = hidden_states_reshaped[selected_token_indices] -++-+ -++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++-+ -++-+ final_hidden_states = final_hidden_states.index_add( -++-+ dim=0, -++-+ index=selected_token_indices, -++-+ source=expert_output.to(hidden_states.dtype) -++-+ ) -++-+ -++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++- -++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++-- return final_hidden_states, router_logits -++-+ final_hidden_states = final_hidden_states + shared_expert_output -++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++-+ -++-+ return final_hidden_states, router_logits -++- -++- -++- class Qwen2MoeDecoderLayer(nn.Module): -++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++- -++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++- -++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++-+ -++- if (layer_idx not in config.mlp_only_layers) and ( -++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++- ): -++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++- _no_split_modules = ["Qwen2MoeDecoderLayer"] -++- _skip_keys_device_placement = "past_key_values" -++- _supports_cache_class = True -++-+#lwx -++-+ # _supports_static_cache = True -++- -++- def _init_weights(self, module): -++- std = self.config.initializer_range -++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++- return causal_mask -++- -++- -++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++- _tied_weights_keys = ["lm_head.weight"] -++- -++- def __init__(self, config): -++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++- self.num_experts_per_tok = config.num_experts_per_tok -++- # Initialize weights and apply final processing -++- self.post_init() -++-+ # @lwx -++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++-+ # self.generation_config.cache_implementation = "static" -++-+ self._warmed_up = False -++-+ -++-+ def warmup_moe_model(self): -++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -++-+ test_texts = [ -++-+ "warmup short", -++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++-+ ] -++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++-+ if tokenizer is None: -++-+ from mindnlp.transformers import AutoTokenizer -++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++-+ self._warmup_tokenizer = tokenizer -++-+ -++-+ for text in test_texts: -++-+ inputs = tokenizer(text, return_tensors="ms") -++-+ with mindspore._no_grad(): -++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -++- -++- def get_input_embeddings(self): -++- return self.model.embed_tokens -++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++- ```""" -++-+ if not self._warmed_up: -++-+ self._warmed_up = True -++-+ self.warmup_moe_model() -++- -++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++- output_router_logits = ( -++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++- } -++- ) -++- return model_inputs -++-+# @lwx -++-+ # def _decode_one_tokens_logits( -++-+ # self, -++-+ # cur_token: mindspore.Tensor, -++-+ # input_pos: Optional[mindspore.Tensor], -++-+ # cache_position: mindspore.Tensor, -++-+ # past_key_values: StaticCache, -++-+ # ) -> mindspore.Tensor: -++-+ # """ -++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++-+ -++-+ # Args: -++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++-+ # input_pos: 输入位置信息,可选 -++-+ # cache_position: 当前token在cache中的位置,shape为(1,) -++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -++-+ -++-+ # Returns: -++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++-+ # """ -++-+ # # 调用JIT编译的版本 -++-+ # return self.get_decode_one_tokens_logits( -++-+ # cur_token=cur_token, -++-+ # input_pos=input_pos, -++-+ # cache_position=cache_position, -++-+ # past_key_values=past_key_values, -++-+ # ) -++-+ -++-+ # @mindspore.jit(jit_level='O1') -++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++-+ # """ -++-+ # JIT编译的函数,用于高效的单token解码 -++-+ # 使用JIT编译优化以支持静态shape和高效执行 -++-+ -++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++-+ # """ -++-+ # outputs = self.model.forward( -++-+ # input_ids=cur_token, -++-+ # position_ids=input_pos, -++-+ # cache_position=cache_position, -++-+ # past_key_values=past_key_values, -++-+ # use_cache=True, -++-+ # return_dict=False, -++-+ # ) -++-+ -++-+ # hidden_states = outputs[0] -++-+ # logits = self.lm_head.forward(hidden_states) -++-+ # logits = logits.float() -++-+ -++-+ # return logits[:, -1, :] -++-+ -++-+ # def _sample( -++-+ # self, -++-+ # input_ids: mindspore.Tensor, -++-+ # logits_processor, -++-+ # stopping_criteria, -++-+ # generation_config, -++-+ # synced_devices: bool, -++-+ # streamer=None, -++-+ # logits_warper=None, -++-+ # **model_kwargs, -++-+ # ): -++-+ # """ -++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++-+ # """ -++-+ # from ...generation.logits_process import LogitsProcessorList -++-+ # from ...generation.stopping_criteria import StoppingCriteriaList -++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++-+ # from mindnlp.core import nn, ops, no_grad -++-+ # import numpy as np -++-+ -++-+ # # 检查是否使用 StaticCache -++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++-+ # # 否则,直接调用父类方法 -++-+ # past_key_values = model_kwargs.get("past_key_values") -++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++-+ -++-+ # if not isinstance(past_key_values, StaticCache): -++-+ # # 不使用 StaticCache,直接调用父类方法 -++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++-+ # return super()._sample( -++-+ # input_ids=input_ids, -++-+ # logits_processor=logits_processor, -++-+ # stopping_criteria=stopping_criteria, -++-+ # generation_config=generation_config, -++-+ # synced_devices=synced_devices, -++-+ # streamer=streamer, -++-+ # logits_warper=logits_warper, -++-+ # **model_kwargs, -++-+ # ) -++-+ -++-+ # # 使用 StaticCache,进入自定义循环 -++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++-+ # pad_token_id = generation_config._pad_token_tensor -++-+ # output_attentions = generation_config.output_attentions -++-+ # output_hidden_states = generation_config.output_hidden_states -++-+ # output_scores = generation_config.output_scores -++-+ # output_logits = generation_config.output_logits -++-+ # return_dict_in_generate = generation_config.return_dict_in_generate -++-+ # max_length = generation_config.max_length -++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++-+ # do_sample = generation_config.do_sample -++-+ -++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++-+ # raise ValueError( -++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++-+ # f"{logits_warper})." -++-+ # ) -++-+ -++-+ # # init attention / hidden states / scores tuples -++-+ # scores = () if (return_dict_in_generate and output_scores) else None -++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++-+ -++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++-+ # encoder_hidden_states = ( -++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++-+ # ) -++-+ -++-+ # # keep track of which sequences are already finished -++-+ # batch_size, cur_len = input_ids.shape -++-+ # this_peer_finished = False -++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++-+ -++-+ # time_record = [] -++-+ # from ....utils.testing_utils import parse_flag_from_env -++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++-+ -++-+ # while self._has_unfinished_sequences( -++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++-+ # ): -++-+ # if _record_time: -++-+ # import time as time_module -++-+ # infer_start = time_module.time() -++-+ -++-+ # # prepare model inputs -++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++-+ -++-+ # # prepare variable output controls -++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++-+ -++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++-+ # cur_cache_position = model_inputs.get("cache_position") -++-+ # cur_past_key_values = model_inputs.get("past_key_values") -++-+ # cur_input_ids = model_inputs.get("input_ids") -++-+ -++-+ # if (isinstance(cur_past_key_values, StaticCache) and -++-+ # cur_cache_position is not None and -++-+ # len(cur_cache_position.shape) > 0 and -++-+ # cur_cache_position.shape[0] == 1 and -++-+ # cur_input_ids is not None and -++-+ # cur_input_ids.shape[1] == 1): -++-+ # # 使用 JIT 优化的单 token 解码 -++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++-+ # if not hasattr(self, '_jit_used'): -++-+ # self._jit_used = False -++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++-+ -++-+ # next_token_logits = self.get_decode_one_tokens_logits( -++-+ # cur_token=cur_input_ids, -++-+ # input_pos=model_inputs.get("position_ids"), -++-+ # cache_position=cur_cache_position, -++-+ # past_key_values=cur_past_key_values, -++-+ # ) -++-+ -++-+ # # 标记已使用JIT(用于后续判断) -++-+ # if not self._jit_used: -++-+ # self._jit_used = True -++-+ -++-+ # # 构造兼容的输出对象 -++-+ # class JitOptimizedOutput: -++-+ # def __init__(self, logits, config): -++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++-+ # self.config = config -++-+ # # 对于 JIT 优化路径,这些属性通常不需要 -++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -++-+ # self.attentions = None if not config.is_encoder_decoder else None -++-+ # self.cross_attentions = None -++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++-+ # self.hidden_states = None if not config.is_encoder_decoder else None -++-+ -++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++-+ # else: -++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++-+ # outputs = self(**model_inputs, return_dict=True) -++-+ -++-+ # if synced_devices and this_peer_finished: -++-+ # continue -++-+ -++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++-+ # next_token_logits = outputs.logits[:, -1, :] -++-+ -++-+ # # pre-process distribution -++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -++-+ # if do_sample: -++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -++-+ -++-+ # # Store scores, attentions and hidden_states when required -++-+ # if return_dict_in_generate: -++-+ # if output_scores: -++-+ # scores += (next_token_scores,) -++-+ # if output_logits: -++-+ # raw_logits += (next_token_logits,) -++-+ # if output_attentions: -++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++-+ # decoder_attentions += (attn,) if attn is not None else (None,) -++-+ # if self.config.is_encoder_decoder: -++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++-+ -++-+ # if output_hidden_states: -++-+ # hidden = ( -++-+ # outputs.decoder_hidden_states -++-+ # if self.config.is_encoder_decoder -++-+ # else outputs.hidden_states -++-+ # ) -++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++-+ -++-+ # # token selection -++-+ # if do_sample: -++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++-+ # else: -++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++-+ -++-+ # # finished sentences should have their next token be a padding token -++-+ # if has_eos_stopping_criteria: -++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++-+ -++-+ # # update generated ids, model inputs, and length for next step -++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++-+ # if streamer is not None: -++-+ # streamer.put(next_tokens) -++-+ -++-+ # model_kwargs = self._update_model_kwargs_for_generation( -++-+ # outputs, -++-+ # model_kwargs, -++-+ # is_encoder_decoder=self.config.is_encoder_decoder, -++-+ # ) -++-+ -++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++-+ # cur_len += 1 -++-+ -++-+ # if _record_time: -++-+ # import time as time_module -++-+ # infer_stop = time_module.time() -++-+ # time_record.append(infer_stop - infer_start) -++-+ -++-+ # del outputs -++-+ -++-+ # average_infer_time = None -++-+ # if time_record: -++-+ # if len(time_record) > 1: -++-+ # time_record.pop(0) -++-+ # average_infer_time = sum(time_record) / len(time_record) -++-+ # print(f'average inference time is: {average_infer_time}') -++-+ # print(f'inference time record: {time_record}') -++-+ -++-+ # if streamer is not None: -++-+ # streamer.end() -++-+ -++-+ # # 简单判断:打印是否使用了JIT路径 -++-+ # if hasattr(self, '_jit_used') and self._jit_used: -++-+ # print("[JIT] ✓ JIT optimization was used during generation") -++-+ # else: -++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++-+ -++-+ # if return_dict_in_generate: -++-+ # if self.config.is_encoder_decoder: -++-+ # return GenerateEncoderDecoderOutput( -++-+ # sequences=input_ids, -++-+ # scores=scores, -++-+ # logits=raw_logits, -++-+ # encoder_attentions=encoder_attentions, -++-+ # encoder_hidden_states=encoder_hidden_states, -++-+ # decoder_attentions=decoder_attentions, -++-+ # cross_attentions=cross_attentions, -++-+ # decoder_hidden_states=decoder_hidden_states, -++-+ # past_key_values=model_kwargs.get("past_key_values"), -++-+ # average_infer_time=average_infer_time -++-+ # ) -++-+ # else: -++-+ # return GenerateDecoderOnlyOutput( -++-+ # sequences=input_ids, -++-+ # scores=scores, -++-+ # logits=raw_logits, -++-+ # attentions=decoder_attentions, -++-+ # hidden_states=decoder_hidden_states, -++-+ # past_key_values=model_kwargs.get("past_key_values"), -++-+ # average_infer_time=average_infer_time -++-+ # ) -++-+ # else: -++-+ # return input_ids -++-+ -++-+ # def _prepare_cache_for_generation( -++-+ # self, -++-+ # generation_config, -++-+ # model_kwargs, -++-+ # assistant_model, -++-+ # batch_size, -++-+ # max_cache_length, -++-+ # ): -++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -++-+ # generation_config.cache_implementation = "static" -++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++-+ -++-+ # if generation_config.cache_implementation == "static": -++-+ # base_required_from_max_length = generation_config.max_length + 1 -++-+ # base_required = max(max_cache_length, base_required_from_max_length) -++-+ # min_cache_size = 50 -++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++-+ # else: -++-+ # max_cache_length = max(base_required, min_cache_size) -++-+ -++-+ # original_max_cache_length = max_cache_length -++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -++-+ # print(f" - input max_cache_length: {original_max_cache_length}") -++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++-+ # print(f" - final max_cache_length: {max_cache_length}") -++-+ -++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++-+ # if max_cache_length > self.config.max_position_embeddings: -++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++-+ -++-+ # result = super()._prepare_cache_for_generation( -++-+ # generation_config=generation_config, -++-+ # model_kwargs=model_kwargs, -++-+ # assistant_model=assistant_model, -++-+ # batch_size=batch_size, -++-+ # max_cache_length=max_cache_length, -++-+ # ) -++-+ -++-+ # if generation_config.cache_implementation == "static": -++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++-+ # created_cache = model_kwargs.get(cache_name) -++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++-+ # if created_cache.max_cache_len < generation_config.max_length: -++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++-+ -++-+ # return result -++-+ -++-+ -++-+ -++- -++- -++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++--- -++-2.27.0 -++- -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" deleted file mode 100644 index 46db89f2..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0006-20251107002commit.patch" +++ /dev/null @@ -1,7931 +0,0 @@ -From 2c9ca98c339c674179652ab1635dab69b46d9012 Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Fri, 7 Nov 2025 12:06:32 +0800 -Subject: [PATCH 06/10] 20251107002commit - ---- - .../models/deepseek/modeling_deepseek.py | 122 +- - patches/0001-20251104commit.patch | 2 +- - patches/0002-20251106commit.patch | 2 +- - patches/0003-20261106secondcommit.patch | 2 +- - patches/0004-20251106change.patch | 2 +- - patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ - 6 files changed, 7773 insertions(+), 64 deletions(-) - create mode 100644 patches/0005-20251107001commit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 8831e4b7..e7e1c053 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): - # expert_out = expert(x) - # expert_cache += expert_out * weight - # return expert_cache -- -- # @no_grad() -- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- # # x 的 shape: (1, hidden_size) -- # # flat_expert_indices 的 shape: (num_experts_per_tok,) -- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -- -- # # 1. 收集所有需要的专家层 -- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -- # selected_experts = [self.experts[i] for i in flat_expert_indices] -- -- # # 2. 并行计算所有专家的输出 -- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -- # # ops.cat 会将它们堆叠成一个新的 Tensor -- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -- -- # # 3. 使用矩阵乘法进行加权求和 -- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- # # 最终结果 final_output 的 shape: (1, hidden_size) -- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+ -+ @no_grad() -+ dwj -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ # x 的 shape: (1, hidden_size) -+ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+ -+ # 1. 收集所有需要的专家层 -+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+ selected_experts = [self.experts[i] for i in flat_expert_indices] -+ -+ # 2. 并行计算所有专家的输出 -+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+ # ops.cat 会将它们堆叠成一个新的 Tensor -+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+ -+ # 3. 使用矩阵乘法进行加权求和 -+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ # 最终结果 final_output 的 shape: (1, hidden_size) -+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) - -- # return final_output -+ return final_output - - - # @no_grad() -@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): - - return expert_cache - # 放置在 DeepseekMoE 类中 -- @no_grad() -- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- """ -- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -- -- Args: -- x (Tensor): 输入张量, shape: (1, hidden_size) -- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -- """ -- top_k, _ = flat_expert_weights.shape -- hidden_size = x.shape[-1] -- -- # 1. 将所有专家的权重堆叠起来 -- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+ # @no_grad() -+ # #lwx 20251107 -+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ # """ -+ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+ -+ # Args: -+ # x (Tensor): 输入张量, shape: (1, hidden_size) -+ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+ # """ -+ # top_k, _ = flat_expert_weights.shape -+ # hidden_size = x.shape[-1] -+ -+ # # 1. 将所有专家的权重堆叠起来 -+ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) - -- # 2. "收集" 所需的专家权重 -- selected_gate_w = stacked_gate_w[flat_expert_indices] -- selected_up_w = stacked_up_w[flat_expert_indices] -- selected_down_w = stacked_down_w[flat_expert_indices] -+ # # 2. "收集" 所需的专家权重 -+ # selected_gate_w = stacked_gate_w[flat_expert_indices] -+ # selected_up_w = stacked_up_w[flat_expert_indices] -+ # selected_down_w = stacked_down_w[flat_expert_indices] - -- # 3. 准备输入 -- x_expanded = x.expand((top_k, 1, hidden_size)) -+ # # 3. 准备输入 -+ # x_expanded = x.expand((top_k, 1, hidden_size)) - -- # 4. 并行计算 gate_proj 和 up_proj -- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+ # # 4. 并行计算 gate_proj 和 up_proj -+ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) - -- # 5. 计算中间状态 -- intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+ # # 5. 计算中间状态 -+ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out - -- # 6. 并行计算 down_proj -- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -- # --- [FIX] --- -- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -- # --- [FIX END] --- -+ # # 6. 并行计算 down_proj -+ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+ # # --- [FIX] --- -+ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+ # # --- [FIX END] --- - -- # 7. 根据路由权重进行加权求和 -- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+ # # 7. 根据路由权重进行加权求和 -+ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) - -- return weighted_sum -+ # return weighted_sum - - - -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -index 0a0ef2d7..2842180e 100644 ---- a/patches/0001-20251104commit.patch -+++ b/patches/0001-20251104commit.patch -@@ -1,7 +1,7 @@ - From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/4] 20251104commit -+Subject: [PATCH 1/5] 20251104commit - - --- - mindnlp/transformers/cache_utils.py | 28 +- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -index 5185270c..c6cd8757 100644 ---- a/patches/0002-20251106commit.patch -+++ b/patches/0002-20251106commit.patch -@@ -1,7 +1,7 @@ - From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/4] 20251106commit -+Subject: [PATCH 2/5] 20251106commit - - --- - .../models/deepseek/modeling_deepseek.py | 379 ++++- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -index 3e05f821..601960c9 100644 ---- a/patches/0003-20261106secondcommit.patch -+++ b/patches/0003-20261106secondcommit.patch -@@ -1,7 +1,7 @@ - From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/4] 20261106secondcommit -+Subject: [PATCH 3/5] 20261106secondcommit - - --- - .../models/deepseek/modeling_deepseek.py | 217 ++- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -index 88a1aef4..8976f10b 100644 ---- a/patches/0004-20251106change.patch -+++ b/patches/0004-20251106change.patch -@@ -1,7 +1,7 @@ - From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 15:48:09 +0800 --Subject: [PATCH 4/4] 20251106change -+Subject: [PATCH 4/5] 20251106change - - --- - .../models/deepseek/modeling_deepseek.py | 189 +- -diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -new file mode 100644 -index 00000000..8d9032be ---- /dev/null -+++ b/patches/0005-20251107001commit.patch -@@ -0,0 +1,7707 @@ -+From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Fri, 7 Nov 2025 11:48:18 +0800 -+Subject: [PATCH 5/5] 20251107001commit -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 91 +- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- -+ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- -+ patches/0001-20251104commit.patch | 2 +- -+ patches/0002-20251106commit.patch | 2 +- -+ patches/0003-20261106secondcommit.patch | 2 +- -+ patches/0004-20251106change.patch | 7498 +++++++++++++++++ -+ 7 files changed, 7577 insertions(+), 30 deletions(-) -+ create mode 100644 patches/0004-20251106change.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index 0546f318..8831e4b7 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): -+ # expert_cache += expert_out * weight -+ # return expert_cache -+ -+- @no_grad() -+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+- # x 的 shape: (1, hidden_size) -+- # flat_expert_indices 的 shape: (num_experts_per_tok,) -+- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+- -+- # 1. 收集所有需要的专家层 -+- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+- selected_experts = [self.experts[i] for i in flat_expert_indices] -+- -+- # 2. 并行计算所有专家的输出 -+- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+- # ops.cat 会将它们堆叠成一个新的 Tensor -+- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+- -+- # 3. 使用矩阵乘法进行加权求和 -+- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+- # 最终结果 final_output 的 shape: (1, hidden_size) -+- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++ # @no_grad() -++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ # # x 的 shape: (1, hidden_size) -++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) -++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++ -++ # # 1. 收集所有需要的专家层 -++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++ # selected_experts = [self.experts[i] for i in flat_expert_indices] -++ -++ # # 2. 并行计算所有专家的输出 -++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++ # # ops.cat 会将它们堆叠成一个新的 Tensor -++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++ -++ # # 3. 使用矩阵乘法进行加权求和 -++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ # # 最终结果 final_output 的 shape: (1, hidden_size) -++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+ -+- return final_output -++ # return final_output -+ -+ -+ # @no_grad() -+@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): -+ ) -+ -+ return expert_cache -++# 放置在 DeepseekMoE 类中 -++ @no_grad() -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ """ -++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -++ -++ Args: -++ x (Tensor): 输入张量, shape: (1, hidden_size) -++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -++ """ -++ top_k, _ = flat_expert_weights.shape -++ hidden_size = x.shape[-1] -++ -++ # 1. 将所有专家的权重堆叠起来 -++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -++ -++ # 2. "收集" 所需的专家权重 -++ selected_gate_w = stacked_gate_w[flat_expert_indices] -++ selected_up_w = stacked_up_w[flat_expert_indices] -++ selected_down_w = stacked_down_w[flat_expert_indices] -++ -++ # 3. 准备输入 -++ x_expanded = x.expand((top_k, 1, hidden_size)) -++ -++ # 4. 并行计算 gate_proj 和 up_proj -++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -++ -++ # 5. 计算中间状态 -++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -++ -++ # 6. 并行计算 down_proj -++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -++ # --- [FIX] --- -++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -++ # --- [FIX END] --- -++ -++ # 7. 根据路由权重进行加权求和 -++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -++ -++ return weighted_sum -++ -++ -+ -+ # @no_grad() -+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index ebd7782e..913a7609 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): -+ # Copied from transformers.models.llama.modeling_llama.rotate_half -+ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+- x1 = x[..., : x.shape[-1] // 2] -+- x2 = x[..., x.shape[-1] // 2 :] -++ # x1 = x[..., : x.shape[-1] // 2] -++ # x2 = x[..., x.shape[-1] // 2 :] -+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+index d059dcbe..2b217b64 100644 -+--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): -+ # Copied from transformers.models.llama.modeling_llama.rotate_half -+ def rotate_half(x): -+ """Rotates half the hidden dims of the input.""" -+- x1 = x[..., : x.shape[-1] // 2] -+- x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++ # x1 = x[..., : x.shape[-1] // 2] -++ # x2 = x[..., x.shape[-1] // 2 :] -++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+ return ops.cat((-x2, x1), dim=-1) -+ -+ -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+index 78f22642..0a0ef2d7 100644 -+--- a/patches/0001-20251104commit.patch -++++ b/patches/0001-20251104commit.patch -+@@ -1,7 +1,7 @@ -+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+-Subject: [PATCH 1/3] 20251104commit -++Subject: [PATCH 1/4] 20251104commit -+ -+ --- -+ mindnlp/transformers/cache_utils.py | 28 +- -+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+index 22b65dd5..5185270c 100644 -+--- a/patches/0002-20251106commit.patch -++++ b/patches/0002-20251106commit.patch -+@@ -1,7 +1,7 @@ -+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+-Subject: [PATCH 2/3] 20251106commit -++Subject: [PATCH 2/4] 20251106commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+index 966529e4..3e05f821 100644 -+--- a/patches/0003-20261106secondcommit.patch -++++ b/patches/0003-20261106secondcommit.patch -+@@ -1,7 +1,7 @@ -+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+-Subject: [PATCH 3/3] 20261106secondcommit -++Subject: [PATCH 3/4] 20261106secondcommit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 217 ++- -+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+new file mode 100644 -+index 00000000..88a1aef4 -+--- /dev/null -++++ b/patches/0004-20251106change.patch -+@@ -0,0 +1,7498 @@ -++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Thu, 6 Nov 2025 15:48:09 +0800 -++Subject: [PATCH 4/4] 20251106change -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 189 +- -++ patches/0001-20251104commit.patch | 1272 +++++++ -++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -++ 4 files changed, 7244 insertions(+), 186 deletions(-) -++ create mode 100644 patches/0001-20251104commit.patch -++ create mode 100644 patches/0002-20251106commit.patch -++ create mode 100644 patches/0003-20261106secondcommit.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index 2f9192bf..0546f318 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -++ -++ return attn_output, attn_weights, past_key_value -++ -++-# class DeepseekFlashAttention(nn.Module): -++-# """ -++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -++- -++-# This class is designed as a drop-in replacement for DeepseekAttention. -++-# """ -++- -++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++-# super().__init__() -++-# self.config = config -++-# self.layer_idx = layer_idx -++-# if layer_idx is None: -++-# logger.warning( -++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++-# "when creating this class." -++-# ) -++- -++-# self.attention_dropout = config.attention_dropout -++-# self.hidden_size = config.hidden_size -++-# self.num_heads = config.num_attention_heads -++-# self.head_dim = self.hidden_size // self.num_heads -++-# self.num_key_value_heads = config.num_key_value_heads -++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++-# self.max_position_embeddings = config.max_position_embeddings -++-# self.rope_theta = config.rope_theta -++-# self.is_causal = True -++- -++-# if (self.head_dim * self.num_heads) != self.hidden_size: -++-# raise ValueError( -++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++-# f" and `num_heads`: {self.num_heads})." -++-# ) -++- -++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++-# self._init_rope() -++- -++-# def _init_rope(self): -++-# if self.config.rope_scaling is None: -++-# self.rotary_emb = DeepseekRotaryEmbedding( -++-# self.head_dim, -++-# max_position_embeddings=self.max_position_embeddings, -++-# base=self.rope_theta, -++-# ) -++-# else: -++-# scaling_type = self.config.rope_scaling["type"] -++-# scaling_factor = self.config.rope_scaling["factor"] -++-# if scaling_type == "linear": -++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++-# self.head_dim, -++-# max_position_embeddings=self.max_position_embeddings, -++-# scaling_factor=scaling_factor, -++-# base=self.rope_theta, -++-# ) -++-# elif scaling_type == "dynamic": -++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++-# self.head_dim, -++-# max_position_embeddings=self.max_position_embeddings, -++-# scaling_factor=scaling_factor, -++-# base=self.rope_theta, -++-# ) -++-# else: -++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++- -++-# def forward( -++-# self, -++-# hidden_states: mindspore.Tensor, -++-# attention_mask: Optional[mindspore.Tensor] = None, -++-# position_ids: Optional[mindspore.Tensor] = None, -++-# past_key_value: Optional[Cache] = None, -++-# output_attentions: bool = False, -++-# use_cache: bool = False, -++-# **kwargs, -++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++-# if "padding_mask" in kwargs: -++-# warnings.warn( -++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++-# ) -++- -++-# if output_attentions: -++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -++- -++-# bsz, q_len, _ = hidden_states.shape -++- -++-# if self.config.pretraining_tp > 1: -++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++- -++-# query_states = self.q_proj(hidden_states) -++-# key_states = self.k_proj(hidden_states) -++-# value_states = self.v_proj(hidden_states) -++- -++-# # Reshape for multi-head attention -++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++- -++-# kv_seq_len = key_states.shape[-2] -++-# if past_key_value is not None: -++-# if self.layer_idx is None: -++-# raise ValueError( -++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++-# "with a layer index." -++-# ) -++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++- -++-# # Apply Rotary Positional Embedding -++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++- -++-# if past_key_value is not None: -++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++- -++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++- -++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++- -++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++- -++-# # Convert attention_mask for flash_attention_score -++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -++-# if attention_mask is not None: -++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++-# raise ValueError( -++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++-# ) -++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -++-# else: -++-# attn_mask_for_fa = None -++- -++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++- -++-# # Call the fused flash_attention_score operator -++-# attn_output = mindspore.ops.flash_attention_score( -++-# query=query_states_for_fa, -++-# key=key_states_for_fa, -++-# value=value_states_for_fa, -++-# head_num=self.num_heads, # This is N1, the number of query heads -++-# input_layout='BSH', -++-# attn_mask=attn_mask_for_fa, -++-# keep_prob=keep_prob, -++-# scalar_value=1.0 / math.sqrt(self.head_dim), -++-# sparse_mode=0 # Default mask mode -++-# ) -++- -++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -++-# attn_output = self.o_proj(attn_output) -++- -++-# # Flash Attention does not return attention weights -++-# attn_weights = None -++- -++-# return attn_output, attn_weights, past_key_value -++ -++ class DeepseekFlashAttention(nn.Module): -++ """ -++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -++ super().__init__() -++ self.hidden_size = config.hidden_size -++ -++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -++- config=config, layer_idx=layer_idx -++- ) -+++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+++ # config=config, layer_idx=layer_idx -+++ # ) -++ -++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -++ config=config, layer_idx=layer_idx -++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -++ return outputs -++ -++ -++- -++ class DeepseekPreTrainedModel(PreTrainedModel): -++ config_class = DeepseekConfig -++ base_model_prefix = "model" -++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++ # Initialize weights and apply final processing -++ self.post_init() -++ self.warm_up = False -++- #@dwj -++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -++- self.num_layers, -++- self.num_attention_heads, -++- self.head_dim, -++- batch_size=1, -++- max_length=self.max_length, -++- dtype=mindspore.float16 -++- ) -++- -++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -++- key_cache = [] -++- value_cache = [] -++- for _ in range(num_layers): -++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++- key_cache.append(k) -++- value_cache.append(v) -++- return key_cache, value_cache -++- -++ -++ def warmup_moe_model_deep(self): -++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++new file mode 100644 -++index 00000000..78f22642 -++--- /dev/null -+++++ b/patches/0001-20251104commit.patch -++@@ -0,0 +1,1272 @@ -+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++Subject: [PATCH 1/3] 20251104commit -+++ -+++--- -+++ mindnlp/transformers/cache_utils.py | 28 +- -+++ .../models/deepseek/modeling_deepseek.py | 149 ++- -+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++ 3 files changed, 976 insertions(+), 87 deletions(-) -+++ -+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++index cadd2e04..02f8d4be 100644 -+++--- a/mindnlp/transformers/cache_utils.py -++++++ b/mindnlp/transformers/cache_utils.py -+++@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++ # k_out[:, :, cache_position] = key_states -+++ # v_out[:, :, cache_position] = value_states -+++- if ON_ORANGE_PI: -+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++- else: -+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++- -++++ # if ON_ORANGE_PI: -++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++ # else: -++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++ # 确保 cache_position 是 1D tensor 并且类型正确 -++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++ if cache_position.ndim > 1: -++++ cache_position = cache_position.flatten() -++++ # 确保类型是 int32 或 int64(MindSpore 要求) -++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++ cache_position = cache_position.int() -++++ -++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++ k_out[:, :, cache_position] = key_states -++++ v_out[:, :, cache_position] = value_states -++++ -+++ return k_out, v_out -+++ -+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index c695b944..d8303e45 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++- x1 = x[..., : x.shape[-1] // 2] -+++- x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++ # x1 = x[..., : x.shape[-1] // 2] -++++ # x2 = x[..., x.shape[-1] // 2 :] -++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++ if self.training: -+++ raise NotImplementedError("Training is not supported yet.") -+++ else: -+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++- if self.config.n_shared_experts is not None: -+++- y = y + self.shared_experts(identity) -+++- return y -++++ # @lwx -++++ if orig_shape[1] == 1: -++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++ y=y.view(*orig_shape) -++++ if self.config.n_shared_experts is not None: -++++ y = y + self.shared_experts(identity) -++++ return y -++++ else: -++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++ if self.config.n_shared_experts is not None: -++++ y = y + self.shared_experts(identity) -++++ return y -++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++ # if self.config.n_shared_experts is not None: -++++ # y = y + self.shared_experts(identity) -++++ # return y -++++ -++++ @no_grad() -++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ -++++ expert_cache = ops.zeros_like(x) -++++ for i in range(self.num_experts_per_tok): -++++ expert_id = flat_expert_indices[i].item() -++++ weight = flat_expert_weights[i].item() -++++ expert = self.experts[expert_id] -++++ expert_out = expert(x) -++++ expert_cache += expert_out * weight -++++ return expert_cache -+++ -+++ @no_grad() -+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++- # expert_cache = torch.zeros_like(x) -+++- # idxs = flat_expert_indices.argsort() -+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++- # token_idxs = idxs // self.num_experts_per_tok -+++- # for i, end_idx in enumerate(tokens_per_expert): -+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++- # if start_idx == end_idx: -+++- # continue -+++- # expert = self.experts[i] -+++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++- # expert_tokens = x[exp_token_idx] -+++- # expert_out = expert(expert_tokens) -+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++- # return expert_cache -++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++ expert_cache = ops.zeros_like(x) -+++ idxs = flat_expert_indices.argsort() -+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++ token_idxs = idxs // self.num_experts_per_tok -++++ -+++ for i, end_idx in enumerate(tokens_per_expert): -+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++ if start_idx == end_idx: -+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++ expert_out = expert(expert_tokens) -+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -+++ return expert_cache -++++ -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # # expert_cache = torch.zeros_like(x) -++++ # # idxs = flat_expert_indices.argsort() -++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++ # # token_idxs = idxs // self.num_experts_per_tok -++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++ # # if start_idx == end_idx: -++++ # # continue -++++ # # expert = self.experts[i] -++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # # expert_tokens = x[exp_token_idx] -++++ # # expert_out = expert(expert_tokens) -++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++ # # return expert_cache -++++ # expert_cache = ops.zeros_like(x) -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # for i, end_idx in enumerate(tokens_per_expert): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # if start_idx == end_idx: -++++ # continue -++++ # expert = self.experts[i] -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = expert(expert_tokens) -++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -++++ # return expert_cache -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # expert_cache = ops.zeros_like(x) -++++ -++++ # # 排序保证顺序一致 -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # # 找出有 token 的专家 -++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++ -++++ # for i in active_experts.tolist(): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # end_idx = tokens_per_expert[i] -++++ # if start_idx == end_idx: # 没有 token -++++ # continue -++++ -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = self.experts[i](expert_tokens) -++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++ -++++ # expert_cache = mindspore.mint.scatter_add( -++++ # expert_cache, -++++ # 0, -++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++ # expert_out -++++ # ) -++++ -++++ # return expert_cache -++++ -++++ -+++ -+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++ # """ -+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ -+++ # Initialize weights and apply final processing -+++ self.post_init() -++++ self.warm_up = False -++++ -++++ def warmup_moe_model_deep(self): -++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++ test_texts = [ -++++ "warmup short", -++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++ ] -++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++ if tokenizer is None: -++++ from mindnlp.transformers import AutoTokenizer -++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++ self._warmup_tokenizer = tokenizer -++++ -++++ for text in test_texts: -++++ inputs = tokenizer(text, return_tensors="ms") -++++ with mindspore._no_grad(): -++++ _ = self(**inputs, use_cache=False) -++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++ -+++ def get_input_embeddings(self): -+++ return self.model.embed_tokens -+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++ ```""" -++++ if not self.warm_up: -++++ self.warm_up = True -++++ self.warmup_moe_model_deep() -++++ -+++ output_attentions = ( -+++ output_attentions -+++ if output_attentions is not None -+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++index 3cbf820e..d4c6b651 100644 -+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++@@ -18,7 +18,6 @@ -+++ # See the License for the specific language governing permissions and -+++ # limitations under the License. -+++ """MindSpore Qwen2MoE model.""" -+++- -+++ import math -+++ from typing import List, Optional, Tuple, Union -+++ -+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++ TokenClassifierOutput, -+++ ) -+++ from ...modeling_utils import PreTrainedModel -++++from ...generation import GenerationMixin -+++ from ....utils import logging -+++ from .configuration_qwen2_moe import Qwen2MoeConfig -+++ -+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++ self.variance_epsilon = eps -+++ -+++ def forward(self, hidden_states): -++++ # @dwj -++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++ # @lwx -++++ # if not self.training : -++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++ input_dtype = hidden_states.dtype -+++ hidden_states = hidden_states.to(mindspore.float32) -+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++@@ -234,6 +239,8 @@ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++ x1 = x[..., : x.shape[-1] // 2] -+++ x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++ self.config = config -+++ self.hidden_size = config.hidden_size -+++ self.intermediate_size = intermediate_size -++++ -+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++ self.act_fn = ACT2FN[config.hidden_act] -+++ -+++ def forward(self, x): -+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++- -+++ -++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++ # @lwx -++++ # gate_up_output = self.gate_up_proj(x) -++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++ # return self.down_proj(swiglu_output) -++++ -++++ # def forward(self, x): -++++ # gate_proj_out = self.gate_proj(x) -++++ # up_proj_out = self.up_proj(x) -++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++ # return self.down_proj(swiglu_out) -++++ -+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++ """ -+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++ use_cache: bool = False, -+++ cache_position: Optional[mindspore.Tensor] = None, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ -++++ -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ query_states = self.q_proj(hidden_states) -+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++ "with a layer index." -+++ ) -+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ if isinstance(past_key_value, StaticCache): -++++ kv_seq_len = key_states.shape[-2] -++++ else: -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++ if isinstance(past_key_value, StaticCache): -++++ kv_seq_len = key_states.shape[-2] -+++ -+++ # repeat k/v heads if n_kv_heads < n_heads -+++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++- -++++ -+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++ -+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++- raise ValueError( -+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++- f" {attn_weights.shape}" -+++- ) -+++- -+++- if attention_mask is not None: # no matter the length, we just slice it -+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++ if attention_mask is not None: -++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++ attn_weights = attn_weights + causal_mask -+++ -+++ # upcast attention to fp32 -+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++ -+++ attn_output = self.o_proj(attn_output) -+++- -++++ # @lwx -++++ -++++ # max_seq_len = self.max_position_embeddings # 2048 -++++ -++++ # if attention_mask is not None: -++++ # # attention_mask: [B, 1, Sq, Sk] -++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++ -++++ # # pad 到 [max_seq_len, max_seq_len] -++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++ # global_attention_mask = padded_mask -++++ # else: -++++ # global_attention_mask = None -++++ -++++ -++++ # sparse_mode=3 -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, -++++ # key=key_states, -++++ # value=value_states, -++++ # real_shift=None, -++++ # padding_mask=None, -++++ -++++ # head_num=self.num_heads, -++++ # attn_mask=global_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++ # input_layout="BNSD", -++++ # pre_tokens=2147483647, -++++ # next_tokens=2147483647, -++++ # inner_precise=0, -++++ # drop_mask=None, -++++ # prefix=None, -++++ # actual_seq_qlen=None, -++++ # actual_seq_kvlen=None, -++++ # sparse_mode=sparse_mode, -++++ # ) -+++ if not output_attentions: -+++ attn_weights = None -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++ -++++class Qwen2MoeFlashAttention(nn.Module): -++++ """ -++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++ -++++ 关键改动: -++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++ 直接传入原始的 key 和 value 张量效率更高。 -++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++ """ -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++ super().__init__() -++++ self.config = config -++++ self.layer_idx = layer_idx -++++ self.hidden_size = config.hidden_size -++++ self.num_heads = config.num_attention_heads -++++ self.head_dim = self.hidden_size // self.num_heads -++++ self.num_key_value_heads = config.num_key_value_heads -++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++ self.max_position_embeddings = config.max_position_embeddings -++++ self.rope_theta = config.rope_theta -++++ self.attention_dropout = config.attention_dropout -++++ -++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++ raise ValueError( -++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++ ) -++++ -++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++ -++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++ self.head_dim, -++++ max_position_embeddings=self.max_position_embeddings, -++++ base=self.rope_theta, -++++ ) -++++ -++++ def forward( -++++ self, -++++ hidden_states: mindspore.Tensor, -++++ attention_mask: Optional[mindspore.Tensor] = None, -++++ position_ids: Optional[mindspore.Tensor] = None, -++++ past_key_value: Optional[Cache] = None, -++++ output_attentions: bool = False, -++++ use_cache: bool = False, -++++ cache_position: Optional[mindspore.Tensor] = None, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ # 1. 线性投射 Q, K, V -++++ query_states = self.q_proj(hidden_states) -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++ # query: [B, S, H*D] -> [B, N1, S, D] -++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # 3. RoPE 旋转位置编码 -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++ if self.layer_idx is None: -++++ raise ValueError( -++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ "with a layer index." -++++ ) -++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++ if cache_position.shape[0] == 1: -++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++ kv_seq_len = past_seen_tokens + 1 -++++ else: -++++ # prefill 阶段:cache_position 是范围,使用其长度 -++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++ else: -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # 4. KV 缓存更新 -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ key_states, value_states = past_key_value.update( -++++ key_states, value_states, self.layer_idx, cache_kwargs -++++ ) -++++ -++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++ if cache_position.shape[0] == 1: -++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++ kv_seq_len = key_states.shape[-2] -++++ -++++ # 5. [重要] 准备 Attention Mask -++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++ fa_attention_mask = None -++++ if attention_mask is not None: -++++ # 截取与当前key长度匹配的部分 -++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++ fa_attention_mask = (mask_slice != 0) -++++ -++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++ input_dtype = query_states.dtype -++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++ query_states = query_states.to(mindspore.float16) -++++ key_states = key_states.to(mindspore.float16) -++++ value_states = value_states.to(mindspore.float16) -++++ -++++ # 6. [核心] 调用 flash_attention_score 算子 -++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++ attn_output = mindspore.ops.flash_attention_score( -++++ query=query_states, -++++ key=key_states, -++++ value=value_states, -++++ head_num=self.num_heads, # 传入Q的头数(N1) -++++ attn_mask=fa_attention_mask, -++++ keep_prob=1.0 - self.attention_dropout, -++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++ input_layout="BNSD", -++++ sparse_mode=0 # 使用 defaultMask 模式 -++++ ) -++++ -++++ # 恢复原始数据类型 -++++ attn_output = attn_output.to(input_dtype) -++++ -++++ # 7. 调整输出形状 -++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ attn_output = self.o_proj(attn_output) -++++ -++++ # FlashAttention 算子不直接返回注意力权重矩阵 -++++ attn_weights = None -++++ if output_attentions: -++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++ # def forward( -++++ # self, -++++ # hidden_states: mindspore.Tensor, -++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++ # position_ids: Optional[mindspore.Tensor] = None, -++++ # past_key_value: Optional[Cache] = None, -++++ # output_attentions: bool = False, -++++ # use_cache: bool = False, -++++ # cache_position: Optional[mindspore.Tensor] = None, -++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ # bsz, q_len, _ = hidden_states.shape -++++ -++++ # # 1. 线性投射 Q, K, V -++++ # query_states = self.q_proj(hidden_states) -++++ # key_states = self.k_proj(hidden_states) -++++ # value_states = self.v_proj(hidden_states) -++++ -++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # # 3. RoPE 旋转位置编码 -++++ # kv_seq_len = key_states.shape[-2] -++++ # if past_key_value is not None: -++++ # if self.layer_idx is None: -++++ # raise ValueError( -++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ # "with a layer index." -++++ # ) -++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # # 4. KV 缓存更新 -++++ # if past_key_value is not None: -++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ # key_states, value_states = past_key_value.update( -++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++ # ) -++++ -++++ # # 5. 准备 Attention Mask -++++ # fa_attention_mask = None -++++ # if attention_mask is not None: -++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # fa_attention_mask = (mask_slice != 0) -++++ -++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++ # input_dtype = query_states.dtype -++++ -++++ # # 6. [核心] 调用 flash_attention_score 算子 -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, -++++ # key=key_states, -++++ # value=value_states, -++++ # head_num=self.num_heads, -++++ # attn_mask=fa_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++ # input_layout="BNSD", -++++ # sparse_mode=0, -++++ # # <--- 修改点 2: 启用内部高精度计算 --- -++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++ # inner_precise=1 -++++ # ) -++++ -++++ # # 恢复原始数据类型 -++++ # attn_output = attn_output.to(input_dtype) -++++ -++++ # # 7. 调整输出形状 -++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ # attn_output = self.o_proj(attn_output) -++++ -++++ # attn_weights = None -++++ # if output_attentions: -++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++ # return attn_output, attn_weights, past_key_value -++++ -++++ # def forward( -++++ # self, -++++ # hidden_states: mindspore.Tensor, -++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++ # position_ids: Optional[mindspore.Tensor] = None, -++++ # past_key_value: Optional[Cache] = None, -++++ # output_attentions: bool = False, -++++ # use_cache: bool = False, -++++ # cache_position: Optional[mindspore.Tensor] = None, -++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ # bsz, q_len, _ = hidden_states.shape -++++ -++++ # query_states = self.q_proj(hidden_states) -++++ # key_states = self.k_proj(hidden_states) -++++ # value_states = self.v_proj(hidden_states) -++++ -++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ # kv_seq_len = key_states.shape[-2] -++++ # if past_key_value is not None: -++++ # if self.layer_idx is None: -++++ # raise ValueError("`layer_idx` must be specified for caching") -++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ # if past_key_value is not None: -++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ # key_states, value_states = past_key_value.update( -++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++ # ) -++++ -++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++ -++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++ # query_states = query_states / math.sqrt(self.head_dim) -++++ # # <--- 修改结束 --- -++++ -++++ # fa_attention_mask = None -++++ # if attention_mask is not None: -++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ # fa_attention_mask = (mask_slice != 0) -++++ -++++ # input_dtype = query_states.dtype -++++ -++++ # attn_output = mindspore.ops.flash_attention_score( -++++ # query=query_states, # 传入已经预先缩放过的 query -++++ # key=key_states, -++++ # value=value_states, -++++ # head_num=self.num_heads, -++++ # attn_mask=fa_attention_mask, -++++ # keep_prob=1.0 - self.attention_dropout, -++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++ # input_layout="BNSD", -++++ # sparse_mode=0, -++++ # inner_precise=1 # 仍然保持内部高精度计算 -++++ # ) -++++ -++++ # attn_output = attn_output.to(input_dtype) -++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ # attn_output = self.o_proj(attn_output) -++++ -++++ # attn_weights = None -++++ # if output_attentions: -++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++ -++++ # return attn_output, attn_weights, past_key_value -++++ -+++ QWEN2MOE_ATTENTION_CLASSES = { -+++ "eager": Qwen2MoeAttention, -++++ "flash-attention": Qwen2MoeFlashAttention, -+++ } -+++ -+++ -+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -++++ #@dwj -++++ # 只遍历激活的专家,而非全部专家 -+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- hidden_states = hidden_states.view(-1, hidden_dim) -+++- # router_logits: (batch * sequence_length, n_experts) -+++- router_logits = self.gate(hidden_states) -+++- -+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- if self.norm_topk_prob: -+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- # we cast back to the input dtype -+++- routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++- final_hidden_states = ops.zeros( -+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++- ) -+++- -+++- # One hot encode the selected experts to create an expert mask -+++- # this will be used to easily index which expert is going to be sollicitated -+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++- -+++- # Loop over all available experts in the model and perform the computation on each expert -+++- for expert_idx in range(self.num_experts): -+++- expert_layer = self.experts[expert_idx] -+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++- -+++- # Index the correct hidden states and compute the expert hidden state for -+++- # the current expert. We need to make sure to multiply the output hidden -+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++- if 0 not in idx.shape: -+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++- -+++- # However `index_add_` only support torch tensors for indexing so we'll use -+++- # the `top_x` tensor here. -+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++- -+++- shared_expert_output = self.shared_expert(hidden_states) -+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++- -+++- final_hidden_states = final_hidden_states + shared_expert_output -++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++ num_tokens = hidden_states_reshaped.shape[0] -++++ -++++ router_logits = self.gate(hidden_states_reshaped) -++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++ if self.norm_topk_prob: -++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++ flat_selected_experts = selected_experts.flatten() -++++ -++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++ token_indices = broadcasted_token_indices.flatten() -++++ -++++ active_experts = ops.unique(flat_selected_experts) -++++ -++++ for expert_idx_tensor in active_experts: -++++ expert_idx = expert_idx_tensor.item() -++++ expert_layer = self.experts[expert_idx] -++++ -++++ mask = (flat_selected_experts == expert_idx_tensor) -++++ selected_token_indices = token_indices[mask] -++++ selected_routing_weights = routing_weights.flatten()[mask] -++++ -++++ current_states = hidden_states_reshaped[selected_token_indices] -++++ -++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++ -++++ final_hidden_states = final_hidden_states.index_add( -++++ dim=0, -++++ index=selected_token_indices, -++++ source=expert_output.to(hidden_states.dtype) -++++ ) -++++ -++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++ -+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++- return final_hidden_states, router_logits -++++ final_hidden_states = final_hidden_states + shared_expert_output -++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++ -++++ return final_hidden_states, router_logits -+++ -+++ -+++ class Qwen2MoeDecoderLayer(nn.Module): -+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++ -+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++ -+++ if (layer_idx not in config.mlp_only_layers) and ( -+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++ ): -+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++ _skip_keys_device_placement = "past_key_values" -+++ _supports_cache_class = True -++++#lwx -++++ # _supports_static_cache = True -+++ -+++ def _init_weights(self, module): -+++ std = self.config.initializer_range -+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++ return causal_mask -+++ -+++ -+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ _tied_weights_keys = ["lm_head.weight"] -+++ -+++ def __init__(self, config): -+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ self.num_experts_per_tok = config.num_experts_per_tok -+++ # Initialize weights and apply final processing -+++ self.post_init() -++++ # @lwx -++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++ # self.generation_config.cache_implementation = "static" -++++ self._warmed_up = False -++++ -++++ def warmup_moe_model(self): -++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++ test_texts = [ -++++ "warmup short", -++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++ ] -++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++ if tokenizer is None: -++++ from mindnlp.transformers import AutoTokenizer -++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++ self._warmup_tokenizer = tokenizer -++++ -++++ for text in test_texts: -++++ inputs = tokenizer(text, return_tensors="ms") -++++ with mindspore._no_grad(): -++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++ -+++ def get_input_embeddings(self): -+++ return self.model.embed_tokens -+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++ ```""" -++++ if not self._warmed_up: -++++ self._warmed_up = True -++++ self.warmup_moe_model() -+++ -+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++ output_router_logits = ( -+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++ } -+++ ) -+++ return model_inputs -++++# @lwx -++++ # def _decode_one_tokens_logits( -++++ # self, -++++ # cur_token: mindspore.Tensor, -++++ # input_pos: Optional[mindspore.Tensor], -++++ # cache_position: mindspore.Tensor, -++++ # past_key_values: StaticCache, -++++ # ) -> mindspore.Tensor: -++++ # """ -++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++ -++++ # Args: -++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++ # input_pos: 输入位置信息,可选 -++++ # cache_position: 当前token在cache中的位置,shape为(1,) -++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++ -++++ # Returns: -++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++ # """ -++++ # # 调用JIT编译的版本 -++++ # return self.get_decode_one_tokens_logits( -++++ # cur_token=cur_token, -++++ # input_pos=input_pos, -++++ # cache_position=cache_position, -++++ # past_key_values=past_key_values, -++++ # ) -++++ -++++ # @mindspore.jit(jit_level='O1') -++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++ # """ -++++ # JIT编译的函数,用于高效的单token解码 -++++ # 使用JIT编译优化以支持静态shape和高效执行 -++++ -++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++ # """ -++++ # outputs = self.model.forward( -++++ # input_ids=cur_token, -++++ # position_ids=input_pos, -++++ # cache_position=cache_position, -++++ # past_key_values=past_key_values, -++++ # use_cache=True, -++++ # return_dict=False, -++++ # ) -++++ -++++ # hidden_states = outputs[0] -++++ # logits = self.lm_head.forward(hidden_states) -++++ # logits = logits.float() -++++ -++++ # return logits[:, -1, :] -++++ -++++ # def _sample( -++++ # self, -++++ # input_ids: mindspore.Tensor, -++++ # logits_processor, -++++ # stopping_criteria, -++++ # generation_config, -++++ # synced_devices: bool, -++++ # streamer=None, -++++ # logits_warper=None, -++++ # **model_kwargs, -++++ # ): -++++ # """ -++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++ # """ -++++ # from ...generation.logits_process import LogitsProcessorList -++++ # from ...generation.stopping_criteria import StoppingCriteriaList -++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++ # from mindnlp.core import nn, ops, no_grad -++++ # import numpy as np -++++ -++++ # # 检查是否使用 StaticCache -++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++ # # 否则,直接调用父类方法 -++++ # past_key_values = model_kwargs.get("past_key_values") -++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++ -++++ # if not isinstance(past_key_values, StaticCache): -++++ # # 不使用 StaticCache,直接调用父类方法 -++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++ # return super()._sample( -++++ # input_ids=input_ids, -++++ # logits_processor=logits_processor, -++++ # stopping_criteria=stopping_criteria, -++++ # generation_config=generation_config, -++++ # synced_devices=synced_devices, -++++ # streamer=streamer, -++++ # logits_warper=logits_warper, -++++ # **model_kwargs, -++++ # ) -++++ -++++ # # 使用 StaticCache,进入自定义循环 -++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++ # pad_token_id = generation_config._pad_token_tensor -++++ # output_attentions = generation_config.output_attentions -++++ # output_hidden_states = generation_config.output_hidden_states -++++ # output_scores = generation_config.output_scores -++++ # output_logits = generation_config.output_logits -++++ # return_dict_in_generate = generation_config.return_dict_in_generate -++++ # max_length = generation_config.max_length -++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++ # do_sample = generation_config.do_sample -++++ -++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++ # raise ValueError( -++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++ # f"{logits_warper})." -++++ # ) -++++ -++++ # # init attention / hidden states / scores tuples -++++ # scores = () if (return_dict_in_generate and output_scores) else None -++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++ -++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++ # encoder_hidden_states = ( -++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++ # ) -++++ -++++ # # keep track of which sequences are already finished -++++ # batch_size, cur_len = input_ids.shape -++++ # this_peer_finished = False -++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++ -++++ # time_record = [] -++++ # from ....utils.testing_utils import parse_flag_from_env -++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++ -++++ # while self._has_unfinished_sequences( -++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++ # ): -++++ # if _record_time: -++++ # import time as time_module -++++ # infer_start = time_module.time() -++++ -++++ # # prepare model inputs -++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++ -++++ # # prepare variable output controls -++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++ -++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++ # cur_cache_position = model_inputs.get("cache_position") -++++ # cur_past_key_values = model_inputs.get("past_key_values") -++++ # cur_input_ids = model_inputs.get("input_ids") -++++ -++++ # if (isinstance(cur_past_key_values, StaticCache) and -++++ # cur_cache_position is not None and -++++ # len(cur_cache_position.shape) > 0 and -++++ # cur_cache_position.shape[0] == 1 and -++++ # cur_input_ids is not None and -++++ # cur_input_ids.shape[1] == 1): -++++ # # 使用 JIT 优化的单 token 解码 -++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++ # if not hasattr(self, '_jit_used'): -++++ # self._jit_used = False -++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++ -++++ # next_token_logits = self.get_decode_one_tokens_logits( -++++ # cur_token=cur_input_ids, -++++ # input_pos=model_inputs.get("position_ids"), -++++ # cache_position=cur_cache_position, -++++ # past_key_values=cur_past_key_values, -++++ # ) -++++ -++++ # # 标记已使用JIT(用于后续判断) -++++ # if not self._jit_used: -++++ # self._jit_used = True -++++ -++++ # # 构造兼容的输出对象 -++++ # class JitOptimizedOutput: -++++ # def __init__(self, logits, config): -++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++ # self.config = config -++++ # # 对于 JIT 优化路径,这些属性通常不需要 -++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++ # self.attentions = None if not config.is_encoder_decoder else None -++++ # self.cross_attentions = None -++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++ # self.hidden_states = None if not config.is_encoder_decoder else None -++++ -++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++ # else: -++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++ # outputs = self(**model_inputs, return_dict=True) -++++ -++++ # if synced_devices and this_peer_finished: -++++ # continue -++++ -++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++ # next_token_logits = outputs.logits[:, -1, :] -++++ -++++ # # pre-process distribution -++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++ # if do_sample: -++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++ -++++ # # Store scores, attentions and hidden_states when required -++++ # if return_dict_in_generate: -++++ # if output_scores: -++++ # scores += (next_token_scores,) -++++ # if output_logits: -++++ # raw_logits += (next_token_logits,) -++++ # if output_attentions: -++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++ # decoder_attentions += (attn,) if attn is not None else (None,) -++++ # if self.config.is_encoder_decoder: -++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++ -++++ # if output_hidden_states: -++++ # hidden = ( -++++ # outputs.decoder_hidden_states -++++ # if self.config.is_encoder_decoder -++++ # else outputs.hidden_states -++++ # ) -++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++ -++++ # # token selection -++++ # if do_sample: -++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++ # else: -++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++ -++++ # # finished sentences should have their next token be a padding token -++++ # if has_eos_stopping_criteria: -++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++ -++++ # # update generated ids, model inputs, and length for next step -++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++ # if streamer is not None: -++++ # streamer.put(next_tokens) -++++ -++++ # model_kwargs = self._update_model_kwargs_for_generation( -++++ # outputs, -++++ # model_kwargs, -++++ # is_encoder_decoder=self.config.is_encoder_decoder, -++++ # ) -++++ -++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++ # cur_len += 1 -++++ -++++ # if _record_time: -++++ # import time as time_module -++++ # infer_stop = time_module.time() -++++ # time_record.append(infer_stop - infer_start) -++++ -++++ # del outputs -++++ -++++ # average_infer_time = None -++++ # if time_record: -++++ # if len(time_record) > 1: -++++ # time_record.pop(0) -++++ # average_infer_time = sum(time_record) / len(time_record) -++++ # print(f'average inference time is: {average_infer_time}') -++++ # print(f'inference time record: {time_record}') -++++ -++++ # if streamer is not None: -++++ # streamer.end() -++++ -++++ # # 简单判断:打印是否使用了JIT路径 -++++ # if hasattr(self, '_jit_used') and self._jit_used: -++++ # print("[JIT] ✓ JIT optimization was used during generation") -++++ # else: -++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++ -++++ # if return_dict_in_generate: -++++ # if self.config.is_encoder_decoder: -++++ # return GenerateEncoderDecoderOutput( -++++ # sequences=input_ids, -++++ # scores=scores, -++++ # logits=raw_logits, -++++ # encoder_attentions=encoder_attentions, -++++ # encoder_hidden_states=encoder_hidden_states, -++++ # decoder_attentions=decoder_attentions, -++++ # cross_attentions=cross_attentions, -++++ # decoder_hidden_states=decoder_hidden_states, -++++ # past_key_values=model_kwargs.get("past_key_values"), -++++ # average_infer_time=average_infer_time -++++ # ) -++++ # else: -++++ # return GenerateDecoderOnlyOutput( -++++ # sequences=input_ids, -++++ # scores=scores, -++++ # logits=raw_logits, -++++ # attentions=decoder_attentions, -++++ # hidden_states=decoder_hidden_states, -++++ # past_key_values=model_kwargs.get("past_key_values"), -++++ # average_infer_time=average_infer_time -++++ # ) -++++ # else: -++++ # return input_ids -++++ -++++ # def _prepare_cache_for_generation( -++++ # self, -++++ # generation_config, -++++ # model_kwargs, -++++ # assistant_model, -++++ # batch_size, -++++ # max_cache_length, -++++ # ): -++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++ # generation_config.cache_implementation = "static" -++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++ -++++ # if generation_config.cache_implementation == "static": -++++ # base_required_from_max_length = generation_config.max_length + 1 -++++ # base_required = max(max_cache_length, base_required_from_max_length) -++++ # min_cache_size = 50 -++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++ # else: -++++ # max_cache_length = max(base_required, min_cache_size) -++++ -++++ # original_max_cache_length = max_cache_length -++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++ # print(f" - input max_cache_length: {original_max_cache_length}") -++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++ # print(f" - final max_cache_length: {max_cache_length}") -++++ -++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++ # if max_cache_length > self.config.max_position_embeddings: -++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++ -++++ # result = super()._prepare_cache_for_generation( -++++ # generation_config=generation_config, -++++ # model_kwargs=model_kwargs, -++++ # assistant_model=assistant_model, -++++ # batch_size=batch_size, -++++ # max_cache_length=max_cache_length, -++++ # ) -++++ -++++ # if generation_config.cache_implementation == "static": -++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++ # created_cache = model_kwargs.get(cache_name) -++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++ # if created_cache.max_cache_len < generation_config.max_length: -++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++ -++++ # return result -++++ -++++ -++++ -+++ -+++ -+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++-- -+++2.27.0 -+++ -++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++new file mode 100644 -++index 00000000..22b65dd5 -++--- /dev/null -+++++ b/patches/0002-20251106commit.patch -++@@ -0,0 +1,3200 @@ -+++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Thu, 6 Nov 2025 09:20:38 +0800 -+++Subject: [PATCH 2/3] 20251106commit -+++ -+++--- -+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -+++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -+++ 3 files changed, 2689 insertions(+), 305 deletions(-) -+++ create mode 100644 patches/0001-20251104commit.patch -+++ -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index d8303e45..73773c22 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -+++ # y = y + self.shared_experts(identity) -+++ # return y -+++ -++++ # @no_grad() -++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ -++++ # expert_cache = ops.zeros_like(x) -++++ # for i in range(self.num_experts_per_tok): -++++ # expert_id = flat_expert_indices[i].item() -++++ # weight = flat_expert_weights[i].item() -++++ # expert = self.experts[expert_id] -++++ # expert_out = expert(x) -++++ # expert_cache += expert_out * weight -++++ # return expert_cache -++++ -+++ @no_grad() -+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ # x 的 shape: (1, hidden_size) -++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++++ -++++ # 1. 收集所有需要的专家层 -++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++++ selected_experts = [self.experts[i] for i in flat_expert_indices] -++++ -++++ # 2. 并行计算所有专家的输出 -++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++++ # ops.cat 会将它们堆叠成一个新的 Tensor -++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++++ -++++ # 3. 使用矩阵乘法进行加权求和 -++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ # 最终结果 final_output 的 shape: (1, hidden_size) -++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++++ -++++ return final_output -+++ -+++- expert_cache = ops.zeros_like(x) -+++- for i in range(self.num_experts_per_tok): -+++- expert_id = flat_expert_indices[i].item() -+++- weight = flat_expert_weights[i].item() -+++- expert = self.experts[expert_id] -+++- expert_out = expert(x) -+++- expert_cache += expert_out * weight -+++- return expert_cache -+++ -+++ @no_grad() -+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++ # @lwx -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+++ -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -+++ return attn_output, attn_weights, past_key_value -+++ -+++ -++++# class DeepseekFlashAttention(nn.Module): -++++# """ -++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -++++ -++++# This class is designed as a drop-in replacement for DeepseekAttention. -++++# """ -++++ -++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++++# super().__init__() -++++# self.config = config -++++# self.layer_idx = layer_idx -++++# if layer_idx is None: -++++# logger.warning( -++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++# "when creating this class." -++++# ) -++++ -++++# self.attention_dropout = config.attention_dropout -++++# self.hidden_size = config.hidden_size -++++# self.num_heads = config.num_attention_heads -++++# self.head_dim = self.hidden_size // self.num_heads -++++# self.num_key_value_heads = config.num_key_value_heads -++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++# self.max_position_embeddings = config.max_position_embeddings -++++# self.rope_theta = config.rope_theta -++++# self.is_causal = True -++++ -++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++# raise ValueError( -++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++# f" and `num_heads`: {self.num_heads})." -++++# ) -++++ -++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++++# self._init_rope() -++++ -++++# def _init_rope(self): -++++# if self.config.rope_scaling is None: -++++# self.rotary_emb = DeepseekRotaryEmbedding( -++++# self.head_dim, -++++# max_position_embeddings=self.max_position_embeddings, -++++# base=self.rope_theta, -++++# ) -++++# else: -++++# scaling_type = self.config.rope_scaling["type"] -++++# scaling_factor = self.config.rope_scaling["factor"] -++++# if scaling_type == "linear": -++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++++# self.head_dim, -++++# max_position_embeddings=self.max_position_embeddings, -++++# scaling_factor=scaling_factor, -++++# base=self.rope_theta, -++++# ) -++++# elif scaling_type == "dynamic": -++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++++# self.head_dim, -++++# max_position_embeddings=self.max_position_embeddings, -++++# scaling_factor=scaling_factor, -++++# base=self.rope_theta, -++++# ) -++++# else: -++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++++ -++++# def forward( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# attention_mask: Optional[mindspore.Tensor] = None, -++++# position_ids: Optional[mindspore.Tensor] = None, -++++# past_key_value: Optional[Cache] = None, -++++# output_attentions: bool = False, -++++# use_cache: bool = False, -++++# **kwargs, -++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++# if "padding_mask" in kwargs: -++++# warnings.warn( -++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++++# ) -++++ -++++# if output_attentions: -++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -++++ -++++# bsz, q_len, _ = hidden_states.shape -++++ -++++# if self.config.pretraining_tp > 1: -++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++++ -++++# query_states = self.q_proj(hidden_states) -++++# key_states = self.k_proj(hidden_states) -++++# value_states = self.v_proj(hidden_states) -++++ -++++# # Reshape for multi-head attention -++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++# kv_seq_len = key_states.shape[-2] -++++# if past_key_value is not None: -++++# if self.layer_idx is None: -++++# raise ValueError( -++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++# "with a layer index." -++++# ) -++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++# # Apply Rotary Positional Embedding -++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++# if past_key_value is not None: -++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ -++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++ -++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++ -++++# # Convert attention_mask for flash_attention_score -++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -++++# if attention_mask is not None: -++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++++# raise ValueError( -++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++++# ) -++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -++++# else: -++++# attn_mask_for_fa = None -++++ -++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++++ -++++# # Call the fused flash_attention_score operator -++++# attn_output = mindspore.ops.flash_attention_score( -++++# query=query_states_for_fa, -++++# key=key_states_for_fa, -++++# value=value_states_for_fa, -++++# head_num=self.num_heads, # This is N1, the number of query heads -++++# input_layout='BSH', -++++# attn_mask=attn_mask_for_fa, -++++# keep_prob=keep_prob, -++++# scalar_value=1.0 / math.sqrt(self.head_dim), -++++# sparse_mode=0 # Default mask mode -++++# ) -++++ -++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -++++# attn_output = self.o_proj(attn_output) -++++ -++++# # Flash Attention does not return attention weights -++++# attn_weights = None -++++ -++++# return attn_output, attn_weights, past_key_value -++++ -++++class DeepseekFlashAttention(nn.Module): -++++ """ -++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -++++ This implementation is a drop-in replacement for the original DeepseekAttention class, -++++ designed for high performance on supported hardware (Ascend). -++++ -++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -++++ """ -++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++++ super().__init__() -++++ self.config = config -++++ self.layer_idx = layer_idx -++++ if layer_idx is None: -++++ logger.warning( -++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++ "when creating this class." -++++ ) -++++ -++++ # --- [FIX] Correctly initialize all required attributes --- -++++ self.attention_dropout = config.attention_dropout -++++ self.hidden_size = config.hidden_size -++++ self.num_heads = config.num_attention_heads -++++ self.head_dim = self.hidden_size // self.num_heads -++++ self.num_key_value_heads = config.num_key_value_heads -++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++ self.max_position_embeddings = config.max_position_embeddings -++++ self.rope_theta = config.rope_theta -++++ self.is_causal = True -++++ -++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++ raise ValueError( -++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++ f" and `num_heads`: {self.num_heads})." -++++ ) -++++ -++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++++ -++++ # This call will now succeed as all attributes are initialized. -++++ self._init_rope() -++++ -++++ def _init_rope(self): -++++ if self.config.rope_scaling is None: -++++ self.rotary_emb = DeepseekRotaryEmbedding( -++++ self.head_dim, -++++ max_position_embeddings=self.max_position_embeddings, -++++ base=self.rope_theta, -++++ ) -++++ else: -++++ scaling_type = self.config.rope_scaling["type"] -++++ scaling_factor = self.config.rope_scaling["factor"] -++++ if scaling_type == "linear": -++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++++ self.head_dim, -++++ max_position_embeddings=self.max_position_embeddings, -++++ scaling_factor=scaling_factor, -++++ base=self.rope_theta, -++++ ) -++++ elif scaling_type == "dynamic": -++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++++ self.head_dim, -++++ max_position_embeddings=self.max_position_embeddings, -++++ scaling_factor=scaling_factor, -++++ base=self.rope_theta, -++++ ) -++++ else: -++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++++ -++++ def forward( -++++ self, -++++ hidden_states: mindspore.Tensor, -++++ attention_mask: Optional[mindspore.Tensor] = None, -++++ position_ids: Optional[mindspore.Tensor] = None, -++++ past_key_value: Optional[Cache] = None, -++++ output_attentions: bool = False, -++++ use_cache: bool = False, -++++ **kwargs, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ if "padding_mask" in kwargs: -++++ warnings.warn( -++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++++ ) -++++ if output_attentions: -++++ warnings.warn( -++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -++++ ) -++++ -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ if self.config.pretraining_tp > 1: -++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++++ -++++ query_states = self.q_proj(hidden_states) -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++ # Apply Rotary Position Embedding -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos} -++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -++++ # So we must explicitly repeat the KV heads. -++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++ -++++ # Convert attention mask for flash_attention_score -++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -++++ if attention_mask is not None: -++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++++ raise ValueError( -++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++++ ) -++++ attn_mask_for_fa = attention_mask < 0 -++++ else: -++++ attn_mask_for_fa = None -++++ -++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++++ -++++ # Call the fused operator using the efficient BNSD layout -++++ attn_output = mindspore.ops.flash_attention_score( -++++ query=query_states, -++++ key=key_states, -++++ value=value_states, -++++ head_num=self.num_heads, -++++ input_layout='BNSD', # Specify the correct layout -++++ attn_mask=attn_mask_for_fa, -++++ keep_prob=keep_prob, -++++ scalar_value=1.0 / math.sqrt(self.head_dim) -++++ ) -++++ -++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ -++++ # Apply output projection -++++ attn_output = self.o_proj(attn_output) -++++ -++++ # Flash attention does not return attention weights, so we return None. -++++ attn_weights = None -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -+++ Deepseek_ATTENTION_CLASSES = { -+++ "eager": DeepseekAttention, -++++ "flash-attention": DeepseekFlashAttention, -+++ } -+++ -+++ -+++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -+++ config=config, layer_idx=layer_idx -+++ ) -+++ -++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -++++ config=config, layer_idx=layer_idx -++++ ) -++++ -+++ self.mlp = ( -+++ DeepseekMoE(config) -+++ if ( -+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++index d4c6b651..bced285c 100644 -+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -+++ -+++ import mindspore -+++ import mindnlp.core.nn.functional as F -+++-from mindnlp.core import nn, ops -++++from mindnlp.core import nn, ops, no_grad -+++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -+++ -+++ from ....common.activations import ACT2FN -+++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+++ -++++Long_Prompt = False -++++PROMPT_LENGTH_THRESHOLD = 128 -+++ -+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+++ def _prepare_4d_causal_attention_mask_with_cache_position( -+++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -+++ return attn_output, attn_weights, past_key_value -+++ -+++ -++++# class Qwen2MoeFlashAttention(nn.Module): -++++# """ -++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++ -++++# 关键改动: -++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++# 直接传入原始的 key 和 value 张量效率更高。 -++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++# super().__init__() -++++# self.config = config -++++# self.layer_idx = layer_idx -++++# self.hidden_size = config.hidden_size -++++# self.num_heads = config.num_attention_heads -++++# self.head_dim = self.hidden_size // self.num_heads -++++# self.num_key_value_heads = config.num_key_value_heads -++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++# self.max_position_embeddings = config.max_position_embeddings -++++# self.rope_theta = config.rope_theta -++++# self.attention_dropout = config.attention_dropout -++++ -++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++# raise ValueError( -++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++# ) -++++ -++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++ -++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++# self.head_dim, -++++# max_position_embeddings=self.max_position_embeddings, -++++# base=self.rope_theta, -++++# ) -++++ -++++# def forward( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# attention_mask: Optional[mindspore.Tensor] = None, -++++# position_ids: Optional[mindspore.Tensor] = None, -++++# past_key_value: Optional[Cache] = None, -++++# output_attentions: bool = False, -++++# use_cache: bool = False, -++++# cache_position: Optional[mindspore.Tensor] = None, -++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++# bsz, q_len, _ = hidden_states.shape -++++ -++++# # 1. 线性投射 Q, K, V -++++# query_states = self.q_proj(hidden_states) -++++# key_states = self.k_proj(hidden_states) -++++# value_states = self.v_proj(hidden_states) -++++ -++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++# # query: [B, S, H*D] -> [B, N1, S, D] -++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++# # 3. RoPE 旋转位置编码 -++++# kv_seq_len = key_states.shape[-2] -++++# if past_key_value is not None: -++++# if self.layer_idx is None: -++++# raise ValueError( -++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++# "with a layer index." -++++# ) -++++# # 对于 StaticCache,需要特殊处理 kv_seq_len -++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++# if cache_position.shape[0] == 1: -++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++# kv_seq_len = past_seen_tokens + 1 -++++# else: -++++# # prefill 阶段:cache_position 是范围,使用其长度 -++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++# else: -++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++# # 4. KV 缓存更新 -++++# if past_key_value is not None: -++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++# key_states, value_states = past_key_value.update( -++++# key_states, value_states, self.layer_idx, cache_kwargs -++++# ) -++++ -++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++# if cache_position.shape[0] == 1: -++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++# kv_seq_len = key_states.shape[-2] -++++ -++++# # 5. [重要] 准备 Attention Mask -++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++# fa_attention_mask = None -++++# if attention_mask is not None: -++++# # 截取与当前key长度匹配的部分 -++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++# # 转换为布尔类型: 大负数 -> True, 0 -> False -++++# fa_attention_mask = (mask_slice != 0) -++++ -++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++# input_dtype = query_states.dtype -++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++# query_states = query_states.to(mindspore.float16) -++++# key_states = key_states.to(mindspore.float16) -++++# value_states = value_states.to(mindspore.float16) -++++ -++++# # 6. [核心] 调用 flash_attention_score 算子 -++++# # - 无需手动 repeat_kv, 算子原生支持 GQA -++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++# attn_output = mindspore.ops.flash_attention_score( -++++# query=query_states, -++++# key=key_states, -++++# value=value_states, -++++# head_num=self.num_heads, # 传入Q的头数(N1) -++++# attn_mask=fa_attention_mask, -++++# keep_prob=1.0 - self.attention_dropout, -++++# scalar_value=1.0 / math.sqrt(self.head_dim), -++++# input_layout="BNSD", -++++# sparse_mode=0 # 使用 defaultMask 模式 -++++# ) -++++ -++++# # 恢复原始数据类型 -++++# attn_output = attn_output.to(input_dtype) -++++ -++++# # 7. 调整输出形状 -++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++# attn_output = self.o_proj(attn_output) -++++ -++++# # FlashAttention 算子不直接返回注意力权重矩阵 -++++# attn_weights = None -++++# if output_attentions: -++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++# return attn_output, attn_weights, past_key_value -++++ -++++# # def forward( -++++# # self, -++++# # hidden_states: mindspore.Tensor, -++++# # attention_mask: Optional[mindspore.Tensor] = None, -++++# # position_ids: Optional[mindspore.Tensor] = None, -++++# # past_key_value: Optional[Cache] = None, -++++# # output_attentions: bool = False, -++++# # use_cache: bool = False, -++++# # cache_position: Optional[mindspore.Tensor] = None, -++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++# # bsz, q_len, _ = hidden_states.shape -++++ -++++# # # 1. 线性投射 Q, K, V -++++# # query_states = self.q_proj(hidden_states) -++++# # key_states = self.k_proj(hidden_states) -++++# # value_states = self.v_proj(hidden_states) -++++ -++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -++++# # # 3. RoPE 旋转位置编码 -++++# # kv_seq_len = key_states.shape[-2] -++++# # if past_key_value is not None: -++++# # if self.layer_idx is None: -++++# # raise ValueError( -++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++# # "with a layer index." -++++# # ) -++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++# # # 4. KV 缓存更新 -++++# # if past_key_value is not None: -++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++# # key_states, value_states = past_key_value.update( -++++# # key_states, value_states, self.layer_idx, cache_kwargs -++++# # ) -++++ -++++# # # 5. 准备 Attention Mask -++++# # fa_attention_mask = None -++++# # if attention_mask is not None: -++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++# # fa_attention_mask = (mask_slice != 0) -++++ -++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++# # input_dtype = query_states.dtype -++++ -++++# # # 6. [核心] 调用 flash_attention_score 算子 -++++# # attn_output = mindspore.ops.flash_attention_score( -++++# # query=query_states, -++++# # key=key_states, -++++# # value=value_states, -++++# # head_num=self.num_heads, -++++# # attn_mask=fa_attention_mask, -++++# # keep_prob=1.0 - self.attention_dropout, -++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++++# # input_layout="BNSD", -++++# # sparse_mode=0, -++++# # # <--- 修改点 2: 启用内部高精度计算 --- -++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++# # inner_precise=1 -++++# # ) -++++ -++++# # # 恢复原始数据类型 -++++# # attn_output = attn_output.to(input_dtype) -++++ -++++# # # 7. 调整输出形状 -++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++# # attn_output = self.o_proj(attn_output) -++++ -++++# # attn_weights = None -++++# # if output_attentions: -++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ -++++# # return attn_output, attn_weights, past_key_value -++++ -++++ -+++ class Qwen2MoeFlashAttention(nn.Module): -+++ """ -+++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++- -+++- 关键改动: -+++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++- 直接传入原始的 key 和 value 张量效率更高。 -+++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -++++ -++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -++++ 完全使用模型的低精度数据类型(如 float16)进行计算, -++++ 以达到理论上的最高执行速度。 -+++ """ -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++ super().__init__() -+++ self.config = config -+++ self.layer_idx = layer_idx -++++ if layer_idx is None: -++++ logger.warning_once( -++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -++++ ) -++++ -+++ self.hidden_size = config.hidden_size -+++ self.num_heads = config.num_attention_heads -+++ self.head_dim = self.hidden_size // self.num_heads -+++ self.num_key_value_heads = config.num_key_value_heads -+++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++ self.max_position_embeddings = config.max_position_embeddings -+++ self.rope_theta = config.rope_theta -+++ self.attention_dropout = config.attention_dropout -+++ -+++- if (self.head_dim * self.num_heads) != self.hidden_size: -+++- raise ValueError( -+++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++- ) -+++- -+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++- # query: [B, S, H*D] -> [B, N1, S, D] -+++- # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++ # 2. 调整形状以匹配 BNSD 布局 -+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- -+++- # 3. RoPE 旋转位置编码 -++++ -++++ # 3. RoPE 和 KV 缓存 -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++- if self.layer_idx is None: -+++- raise ValueError( -+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++- "with a layer index." -+++- ) -+++- # 对于 StaticCache,需要特殊处理 kv_seq_len -+++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++- # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++- # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++- if cache_position.shape[0] == 1: -+++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++- kv_seq_len = past_seen_tokens + 1 -+++- else: -+++- # prefill 阶段:cache_position 是范围,使用其长度 -+++- kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++- else: -+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++- -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++- # 4. KV 缓存更新 -+++ if past_key_value is not None: -+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++- key_states, value_states = past_key_value.update( -+++- key_states, value_states, self.layer_idx, cache_kwargs -+++- ) -+++- -+++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++- if cache_position.shape[0] == 1: -+++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++- kv_seq_len = key_states.shape[-2] -+++- -+++- # 5. [重要] 准备 Attention Mask -+++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++ # 4. 准备 Attention Mask -+++ fa_attention_mask = None -+++ if attention_mask is not None: -+++- # 截取与当前key长度匹配的部分 -+++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++- # 转换为布尔类型: 大负数 -> True, 0 -> False -+++ fa_attention_mask = (mask_slice != 0) -+++ -+++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++- input_dtype = query_states.dtype -+++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++- query_states = query_states.to(mindspore.float16) -+++- key_states = key_states.to(mindspore.float16) -+++- value_states = value_states.to(mindspore.float16) -+++- -+++- # 6. [核心] 调用 flash_attention_score 算子 -+++- # - 无需手动 repeat_kv, 算子原生支持 GQA -+++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -+++ attn_output = mindspore.ops.flash_attention_score( -+++ query=query_states, -+++ key=key_states, -+++ value=value_states, -+++- head_num=self.num_heads, # 传入Q的头数(N1) -++++ head_num=self.num_heads, -+++ attn_mask=fa_attention_mask, -+++- keep_prob=1.0 - self.attention_dropout, -++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -+++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++ input_layout="BNSD", -+++- sparse_mode=0 # 使用 defaultMask 模式 -++++ sparse_mode=0, -++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -+++ ) -+++ -+++- # 恢复原始数据类型 -+++- attn_output = attn_output.to(input_dtype) -+++- -+++- # 7. 调整输出形状 -+++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++ # 6. 调整输出形状 -+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++ attn_output = self.o_proj(attn_output) -+++ -+++- # FlashAttention 算子不直接返回注意力权重矩阵 -++++ # 7. 返回结果 -+++ attn_weights = None -+++ if output_attentions: -+++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++- # def forward( -+++- # self, -+++- # hidden_states: mindspore.Tensor, -+++- # attention_mask: Optional[mindspore.Tensor] = None, -+++- # position_ids: Optional[mindspore.Tensor] = None, -+++- # past_key_value: Optional[Cache] = None, -+++- # output_attentions: bool = False, -+++- # use_cache: bool = False, -+++- # cache_position: Optional[mindspore.Tensor] = None, -+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++- -+++- # bsz, q_len, _ = hidden_states.shape -+++- -+++- # # 1. 线性投射 Q, K, V -+++- # query_states = self.q_proj(hidden_states) -+++- # key_states = self.k_proj(hidden_states) -+++- # value_states = self.v_proj(hidden_states) -+++- -+++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- -+++- # # 3. RoPE 旋转位置编码 -+++- # kv_seq_len = key_states.shape[-2] -+++- # if past_key_value is not None: -+++- # if self.layer_idx is None: -+++- # raise ValueError( -+++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++- # "with a layer index." -+++- # ) -+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++ -+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++- -+++- # # 4. KV 缓存更新 -+++- # if past_key_value is not None: -+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++- # key_states, value_states = past_key_value.update( -+++- # key_states, value_states, self.layer_idx, cache_kwargs -+++- # ) -+++- -+++- # # 5. 准备 Attention Mask -+++- # fa_attention_mask = None -+++- # if attention_mask is not None: -+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++- # fa_attention_mask = (mask_slice != 0) -+++- -+++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++- # input_dtype = query_states.dtype -+++- -+++- # # 6. [核心] 调用 flash_attention_score 算子 -+++- # attn_output = mindspore.ops.flash_attention_score( -+++- # query=query_states, -+++- # key=key_states, -+++- # value=value_states, -+++- # head_num=self.num_heads, -+++- # attn_mask=fa_attention_mask, -+++- # keep_prob=1.0 - self.attention_dropout, -+++- # scalar_value=1.0 / math.sqrt(self.head_dim), -+++- # input_layout="BNSD", -+++- # sparse_mode=0, -+++- # # <--- 修改点 2: 启用内部高精度计算 --- -+++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++- # inner_precise=1 -+++- # ) -+++- -+++- # # 恢复原始数据类型 -+++- # attn_output = attn_output.to(input_dtype) -++++QWEN2MOE_ATTENTION_CLASSES = { -++++ "eager": Qwen2MoeAttention, -++++ "flash-attention": Qwen2MoeFlashAttention, -++++} -+++ -+++- # # 7. 调整输出形状 -+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++- # attn_output = self.o_proj(attn_output) -+++ -+++- # attn_weights = None -+++- # if output_attentions: -+++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# def __init__(self, config): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# # gating -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++ -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# #@dwj -++++# # 只遍历激活的专家,而非全部专家 -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# num_tokens = hidden_states_reshaped.shape[0] -++++ -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++# flat_selected_experts = selected_experts.flatten() -++++ -++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++# token_indices = broadcasted_token_indices.flatten() -++++ -++++# active_experts = ops.unique(flat_selected_experts) -++++ -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++ -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# selected_token_indices = token_indices[mask] -++++# selected_routing_weights = routing_weights.flatten()[mask] -++++ -++++# current_states = hidden_states_reshaped[selected_token_indices] -++++ -++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++ -++++# final_hidden_states = final_hidden_states.index_add( -++++# dim=0, -++++# index=selected_token_indices, -++++# source=expert_output.to(hidden_states.dtype) -++++# ) -++++ -++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++ -+++- # return attn_output, attn_weights, past_key_value -++++# final_hidden_states = final_hidden_states + shared_expert_output -++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -++++ -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# """ -++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++++# `_moe_infer_prefill` (用于长序列处理) 方法。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# # 门控网络 -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# # 专家列表 -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++# # 共享专家 -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# @no_grad() -++++# def _moe_infer_decode( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# """ -++++# 【解码路径】针对 sequence_length=1 的极致优化。 -++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++++# """ -++++# batch_size, hidden_dim = hidden_states.shape -++++ -++++# expert_outputs_list = [ -++++# ops.cat([ -++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++# ], dim=0) -++++# for i in range(batch_size) -++++# ] -++++ -++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++++# # shape: (batch_size, top_k, hidden_dim) -++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++ -++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++ -++++# return moe_output.squeeze(1) -++++ -++++# @no_grad() -++++# def _moe_infer_prefill( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# """ -++++# 【预填充路径】针对 sequence_length > 1 的优化。 -++++# 按专家对 Token 进行分组,并进行批处理。 -++++# """ -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens = hidden_states.shape[0] -++++# flat_selected_experts = selected_experts.flatten() -++++ -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++ -++++# active_experts = ops.unique(flat_selected_experts) -++++ -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++ -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# selected_token_indices = token_indices[mask] -++++# selected_routing_weights = routing_weights.flatten()[mask] -++++ -++++# current_states = hidden_states[selected_token_indices] -++++ -++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++ -++++# moe_output = moe_output.index_add( -++++# dim=0, -++++# index=selected_token_indices, -++++# source=expert_output.to(hidden_states.dtype) -++++# ) -++++# return moe_output -++++ -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# """ -++++# 顶层 forward 方法,作为智能分发器。 -++++# """ -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++- # def forward( -+++- # self, -+++- # hidden_states: mindspore.Tensor, -+++- # attention_mask: Optional[mindspore.Tensor] = None, -+++- # position_ids: Optional[mindspore.Tensor] = None, -+++- # past_key_value: Optional[Cache] = None, -+++- # output_attentions: bool = False, -+++- # use_cache: bool = False, -+++- # cache_position: Optional[mindspore.Tensor] = None, -+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++- -+++- # bsz, q_len, _ = hidden_states.shape -+++- -+++- # query_states = self.q_proj(hidden_states) -+++- # key_states = self.k_proj(hidden_states) -+++- # value_states = self.v_proj(hidden_states) -+++- -+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- -+++- # kv_seq_len = key_states.shape[-2] -+++- # if past_key_value is not None: -+++- # if self.layer_idx is None: -+++- # raise ValueError("`layer_idx` must be specified for caching") -+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++- -+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++- -+++- # if past_key_value is not None: -+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++- # key_states, value_states = past_key_value.update( -+++- # key_states, value_states, self.layer_idx, cache_kwargs -+++- # ) -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++# moe_output = None -++++# # 在推理时,根据序列长度选择最优路径 -++++# if not self.training: -++++# if sequence_length == 1: -++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++# else: -++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++# else: -++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++++# raise NotImplementedError("Training path is not implemented.") -++++ -++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++++ -++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++++ -++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -++++ -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# """ -++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# # 门控网络 -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# # 专家列表 -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++# # 共享专家 -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# @no_grad() -++++# def _moe_infer_decode( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# batch_size, _ = hidden_states.shape -++++# expert_outputs_list = [ -++++# ops.cat([ -++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++# ], dim=0) -++++# for i in range(batch_size) -++++# ] -++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++# return moe_output.squeeze(1) -++++ -++++# @no_grad() -++++# def _moe_infer_prefill( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens = hidden_states.shape[0] -++++# flat_selected_experts = selected_experts.flatten() -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++# active_experts = ops.unique(flat_selected_experts) -++++ -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# selected_token_indices = token_indices[mask] -++++# selected_routing_weights = routing_weights.flatten()[mask] -++++# current_states = hidden_states[selected_token_indices] -++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++# moe_output = moe_output.index_add( -++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++# ) -++++# return moe_output -++++ -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# """ -++++# 顶层 forward 方法,作为智能分发器。 -++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++++# """ -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ -++++# # 1. 门控计算 (通用逻辑) -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++# # 2. 智能分发到最优 MoE 路径 -++++# moe_output = None -++++# if not self.training: -++++# if sequence_length == 1: -++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++# else: -++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++# else: -++++# raise NotImplementedError("Training path is not implemented.") -++++ -++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++ -++++# # 4. 合并 MoE 输出和共享专家输出 -++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++ -++++# # 5. 恢复原始形状并返回 -++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -++++# prefill fastest -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# """ -++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# # 门控网络 -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# # 专家列表 -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++# # 共享专家 -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# @no_grad() -++++# def _moe_infer_dispatch( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# """ -++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++++# """ -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens, _ = hidden_states.shape -++++ -++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++++# flat_selected_experts = selected_experts.flatten() -++++# flat_routing_weights = routing_weights.flatten() -+++ -+++- # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++- # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++- -+++- # # <--- 核心修改点: 手动进行高精度缩放 --- -+++- # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++- # query_states = query_states / math.sqrt(self.head_dim) -+++- # # <--- 修改结束 --- -+++- -+++- # fa_attention_mask = None -+++- # if attention_mask is not None: -+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++- # fa_attention_mask = (mask_slice != 0) -+++- -+++- # input_dtype = query_states.dtype -+++- -+++- # attn_output = mindspore.ops.flash_attention_score( -+++- # query=query_states, # 传入已经预先缩放过的 query -+++- # key=key_states, -+++- # value=value_states, -+++- # head_num=self.num_heads, -+++- # attn_mask=fa_attention_mask, -+++- # keep_prob=1.0 - self.attention_dropout, -+++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++- # input_layout="BNSD", -+++- # sparse_mode=0, -+++- # inner_precise=1 # 仍然保持内部高精度计算 -+++- # ) -++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++ -+++- # attn_output = attn_output.to(input_dtype) -+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++- # attn_output = self.o_proj(attn_output) -++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++++# active_experts = ops.unique(flat_selected_experts) -++++ -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++ -++++# # 找到所有分配给该专家的 token -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++ -++++# # 使用 mask 选取对应的 token 和权重 -++++# current_token_indices = token_indices[mask] -++++# current_routing_weights = flat_routing_weights[mask] -++++# current_hidden_states = hidden_states[current_token_indices] -++++ -++++# # 对这些 token 进行批处理 -++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++ -++++# # 使用 index_add 将结果精确地加回到对应位置 -++++# moe_output = moe_output.index_add( -++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++++# ) -++++# return moe_output -++++ -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# """ -++++# 顶层 forward 方法,作为智能分发器。 -++++# """ -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ -++++# # 1. 门控计算 -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++ -++++# # 2. 调用统一的 MoE 计算内核 -++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+++ -+++- # attn_weights = None -+++- # if output_attentions: -+++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++# # 3. 统一处理共享专家 -++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++ -++++# # 4. 合并输出 -++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++ -++++# # 5. 恢复原始形状并返回 -++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -++++ -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# """ -++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++# 【最终高性能与高精度版】: -++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++++# 3. 这样实现了速度和准确性的两全其美。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# @no_grad() -++++# def _moe_infer_decode( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# """ -++++# 【解码路径】极致优化版:bmm + 高精度累加。 -++++# """ -++++# original_dtype = hidden_states.dtype -++++# batch_size, _ = hidden_states.shape -++++ -++++# expert_outputs_list = [ -++++# ops.cat([ -++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++# ], dim=0) -++++# for i in range(batch_size) -++++# ] -++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++ -++++# # 在 float32 下执行 bmm,得到高精度结果 -++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++ -++++# # 将高精度结果转换回原始数据类型 -++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++++ -++++# return moe_output -++++ -++++# @no_grad() -++++# def _moe_infer_prefill( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# selected_experts: mindspore.Tensor, -++++# routing_weights: mindspore.Tensor -++++# ) -> mindspore.Tensor: -++++# """ -++++# 【预填充路径】与原始实现一致,结果精确。 -++++# """ -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens, _ = hidden_states.shape -++++# flat_selected_experts = selected_experts.flatten() -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++# active_experts = ops.unique(flat_selected_experts) -++++ -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# selected_token_indices = token_indices[mask] -++++# selected_routing_weights = routing_weights.flatten()[mask] -++++# current_states = hidden_states[selected_token_indices] -++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++# moe_output = moe_output.index_add( -++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++# ) -++++# return moe_output -++++ -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++ -+++- # return attn_output, attn_weights, past_key_value -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++++# # 如果模型主体是 float16,后续再转换 -++++ -++++# moe_output = None -++++# if not self.training: -++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++++# # _moe_infer_decode 内部会处理好类型转换 -++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++++# if sequence_length == 1: -++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++# else: -++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++# else: -++++# raise NotImplementedError("Training path is not implemented.") -++++ -++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++ -++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -+++ -+++-QWEN2MOE_ATTENTION_CLASSES = { -+++- "eager": Qwen2MoeAttention, -+++- "flash-attention": Qwen2MoeFlashAttention, -+++-} -++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++# """ -++++# 【融合版】一个混合专家模块,内置两种推理策略, -++++# 由外部全局变量 `Long_Prompt` 控制: -++++ -++++# - if Long_Prompt is True: 【精度优先模式】 -++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++++# 适用于处理长序列,避免误差累积。 -++++ -++++# - if Long_Prompt is False: 【速度优先模式】 -++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++++# 在解码阶段获得极致速度,同时保证结果高度准确。 -++++# """ -++++# def __init__(self, config: Qwen2MoeConfig): -++++# super().__init__() -++++# self.num_experts = config.num_experts -++++# self.top_k = config.num_experts_per_tok -++++# self.norm_topk_prob = config.norm_topk_prob -++++ -++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++# self.experts = nn.ModuleList( -++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++# ) -++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++# # --- 速度优先模式的辅助函数 --- -++++# @no_grad() -++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++# original_dtype = hidden_states.dtype -++++# batch_size, _ = hidden_states.shape -++++# expert_outputs_list = [ -++++# ops.cat([ -++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++# ], dim=0) -++++# for i in range(batch_size) -++++# ] -++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++# weights_fp32 = routing_weights.to(mindspore.float32) -++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++# return moe_output_fp32.squeeze(1).to(original_dtype) -++++ -++++# @no_grad() -++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens, _ = hidden_states.shape -++++# flat_selected_experts = selected_experts.flatten() -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++# active_experts = ops.unique(flat_selected_experts) -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# selected_token_indices = token_indices[mask] -++++# selected_routing_weights = routing_weights.flatten()[mask] -++++# current_states = hidden_states[selected_token_indices] -++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++++# return moe_output -++++ -++++# # --- 精度优先模式的辅助函数 --- -++++# @no_grad() -++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++# moe_output = ops.zeros_like(hidden_states) -++++# num_tokens, _ = hidden_states.shape -++++# flat_selected_experts = selected_experts.flatten() -++++# flat_routing_weights = routing_weights.flatten() -++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++# active_experts = ops.unique(flat_selected_experts) -++++# for expert_idx_tensor in active_experts: -++++# expert_idx = expert_idx_tensor.item() -++++# expert_layer = self.experts[expert_idx] -++++# mask = (flat_selected_experts == expert_idx_tensor) -++++# current_token_indices = token_indices[mask] -++++# current_routing_weights = flat_routing_weights[mask] -++++# current_hidden_states = hidden_states[current_token_indices] -++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++# return moe_output -++++ -++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++# # 声明我们将要使用一个在模块外部定义的全局变量 -++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++++# global Long_Prompt -++++ -++++# # 1. 门控计算 (所有模式通用) -++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++# router_logits = self.gate(hidden_states_reshaped) -++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++# if self.norm_topk_prob: -++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++# moe_output = None -++++# if not self.training: -++++# # 根据 Long_Prompt 标志选择模式 -++++# if Long_Prompt: -++++# # --- 精度优先模式 --- -++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++# else: -++++# # --- 速度优先模式 --- -++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++# if sequence_length == 1: -++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++# else: -++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++# else: -++++# raise NotImplementedError("Training path is not implemented.") -++++ -++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++ -++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++ -++++# return final_hidden_states, router_logits -++++ -++++class Qwen2MoeSparseMoeBlock(nn.Module): -++++ """ -++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++++ 控制的顶级推理策略: -+++ -++++ - if Long_Prompt is True: 【精度优先模式】 -++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -++++ 适用于需要严格可复现性的长序列任务。 -+++ -+++-class Qwen2MoeSparseMoeBlock(nn.Module): -+++- def __init__(self, config): -++++ - if Long_Prompt is False: 【速度优先模式】 -++++ 采用业界最强的性能组合: -++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -++++ """ -++++ def __init__(self, config: Qwen2MoeConfig): -+++ super().__init__() -+++ self.num_experts = config.num_experts -+++ self.top_k = config.num_experts_per_tok -+++ self.norm_topk_prob = config.norm_topk_prob -+++ -+++- # gating -+++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++ self.experts = nn.ModuleList( -+++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++ ) -+++- -+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++ -+++- #@dwj -+++- # 只遍历激活的专家,而非全部专家 -+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++- num_tokens = hidden_states_reshaped.shape[0] -+++- -+++- router_logits = self.gate(hidden_states_reshaped) -+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- -+++- if self.norm_topk_prob: -+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++- flat_selected_experts = selected_experts.flatten() -+++- -+++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++- token_indices = broadcasted_token_indices.flatten() -+++- -+++- active_experts = ops.unique(flat_selected_experts) -+++- -+++- for expert_idx_tensor in active_experts: -+++- expert_idx = expert_idx_tensor.item() -+++- expert_layer = self.experts[expert_idx] -+++- -+++- mask = (flat_selected_experts == expert_idx_tensor) -+++- selected_token_indices = token_indices[mask] -+++- selected_routing_weights = routing_weights.flatten()[mask] -+++- -+++- current_states = hidden_states_reshaped[selected_token_indices] -+++- -+++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++- -+++- final_hidden_states = final_hidden_states.index_add( -+++- dim=0, -+++- index=selected_token_indices, -+++- source=expert_output.to(hidden_states.dtype) -+++- ) -+++- -+++- shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -++++ @no_grad() -++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++ original_dtype = hidden_states.dtype -++++ batch_size, _ = hidden_states.shape -++++ expert_outputs_list = [ -++++ ops.cat([ -++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++ ], dim=0) -++++ for i in range(batch_size) -++++ ] -++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++ weights_fp32 = routing_weights.to(mindspore.float32) -++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++ return moe_output_fp32.squeeze(1).to(original_dtype) -++++ -++++ @no_grad() -++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++ num_tokens, _ = hidden_states.shape -++++ flat_selected_experts = selected_experts.flatten() -++++ sorted_expert_indices = flat_selected_experts.argsort() -++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++ original_token_indices = sorted_expert_indices // self.top_k -++++ moe_output = ops.zeros_like(hidden_states) -++++ current_token_offset = 0 -++++ for i in range(self.num_experts): -++++ expert_token_count = tokens_per_expert[i] - current_token_offset -++++ if expert_token_count == 0: -++++ continue -++++ end_offset = current_token_offset + expert_token_count -++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++ expert_hidden_states = hidden_states[expert_original_token_indices] -++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++ current_token_offset += expert_token_count -++++ return moe_output -++++ -++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++++ @no_grad() -++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++ moe_output = ops.zeros_like(hidden_states) -++++ num_tokens, _ = hidden_states.shape -++++ flat_selected_experts = selected_experts.flatten() -++++ flat_routing_weights = routing_weights.flatten() -++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++ active_experts = ops.unique(flat_selected_experts) -++++ for expert_idx_tensor in active_experts: -++++ expert_idx = expert_idx_tensor.item() -++++ expert_layer = self.experts[expert_idx] -++++ mask = (flat_selected_experts == expert_idx_tensor) -++++ current_token_indices = token_indices[mask] -++++ current_routing_weights = flat_routing_weights[mask] -++++ current_hidden_states = hidden_states[current_token_indices] -++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++ return moe_output -+++ -+++- final_hidden_states = final_hidden_states + shared_expert_output -+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++- -+++- return final_hidden_states, router_logits -++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++ global Long_Prompt -++++ -++++ # 1. 门控计算 (所有模式通用) -++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++ router_logits = self.gate(hidden_states_reshaped) -++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++ if self.norm_topk_prob: -++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++ moe_output = None -++++ if Long_Prompt: -++++ # --- 精度优先模式 (ACCURACY MODE) --- -++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ else: -++++ # --- 速度优先模式 (SPEED MODE) --- -++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++ if sequence_length == 1: -++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ else: -++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ -+++ -++++ # 3. 共享专家计算与合并 (所有模式通用) -++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++ -++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++ -++++ return final_hidden_states, router_logits -+++ -+++ class Qwen2MoeDecoderLayer(nn.Module): -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+++ super().__init__() -+++ self.hidden_size = config.hidden_size -++++ -++++ # if Long_Prompt: -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ # else: -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++ -+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ -+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++- -+++ if (layer_idx not in config.mlp_only_layers) and ( -+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++ ): -+++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ self._warmed_up = True -+++ self.warmup_moe_model() -+++ -++++ -++++ -+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++ output_router_logits = ( -+++ output_router_logits if output_router_logits is not None else self.config.output_router_logits -+++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ router_logits=outputs.router_logits, -+++ ) -+++ -++++ def generate(self, *args, **kwargs): -++++ """ -++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++++ """ -++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++++ -++++ input_ids = kwargs.get("input_ids") -++++ if input_ids is None and args: -++++ input_ids = args[0] -++++ -++++ if input_ids is not None: -++++ prompt_length = input_ids.shape[1] -++++ -++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -++++ Long_Prompt = True -++++ else: -++++ Long_Prompt = False -++++ -++++ return super().generate(*args, **kwargs) -++++ -+++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -+++ def prepare_inputs_for_generation( -+++ self, -+++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -+++ # Exception 1: when passing input_embeds, input_ids may be missing entries -+++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -++++ -+++ if past_key_values is not None: -+++ if inputs_embeds is not None: # Exception 1 -+++ if 0 not in input_ids.shape: -+++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ } -+++ ) -+++ return model_inputs -++++ -+++ # @lwx -+++ # def _decode_one_tokens_logits( -+++ # self, -+++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -+++ attentions=outputs.attentions, -+++ ) -+++ -++++ -+++ __all__ = [ -+++ "Qwen2MoeForCausalLM", -+++ "Qwen2MoeModel", -+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++new file mode 100644 -+++index 00000000..6dfb5b93 -+++--- /dev/null -++++++ b/patches/0001-20251104commit.patch -+++@@ -0,0 +1,1272 @@ -++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++Subject: [PATCH] 20251104commit -++++ -++++--- -++++ mindnlp/transformers/cache_utils.py | 28 +- -++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++ 3 files changed, 976 insertions(+), 87 deletions(-) -++++ -++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++index cadd2e04..02f8d4be 100644 -++++--- a/mindnlp/transformers/cache_utils.py -+++++++ b/mindnlp/transformers/cache_utils.py -++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++ # k_out[:, :, cache_position] = key_states -++++ # v_out[:, :, cache_position] = value_states -++++- if ON_ORANGE_PI: -++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++- else: -++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++- -+++++ # if ON_ORANGE_PI: -+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++ # else: -+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++++ if cache_position.ndim > 1: -+++++ cache_position = cache_position.flatten() -+++++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++++ cache_position = cache_position.int() -+++++ -+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++++ k_out[:, :, cache_position] = key_states -+++++ v_out[:, :, cache_position] = value_states -+++++ -++++ return k_out, v_out -++++ -++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index c695b944..d8303e45 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++- x1 = x[..., : x.shape[-1] // 2] -++++- x2 = x[..., x.shape[-1] // 2 :] -+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++ # x1 = x[..., : x.shape[-1] // 2] -+++++ # x2 = x[..., x.shape[-1] // 2 :] -+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++ if self.training: -++++ raise NotImplementedError("Training is not supported yet.") -++++ else: -++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++- if self.config.n_shared_experts is not None: -++++- y = y + self.shared_experts(identity) -++++- return y -+++++ # @lwx -+++++ if orig_shape[1] == 1: -+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++++ y=y.view(*orig_shape) -+++++ if self.config.n_shared_experts is not None: -+++++ y = y + self.shared_experts(identity) -+++++ return y -+++++ else: -+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++++ if self.config.n_shared_experts is not None: -+++++ y = y + self.shared_experts(identity) -+++++ return y -+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++ # if self.config.n_shared_experts is not None: -+++++ # y = y + self.shared_experts(identity) -+++++ # return y -+++++ -+++++ @no_grad() -+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ -+++++ expert_cache = ops.zeros_like(x) -+++++ for i in range(self.num_experts_per_tok): -+++++ expert_id = flat_expert_indices[i].item() -+++++ weight = flat_expert_weights[i].item() -+++++ expert = self.experts[expert_id] -+++++ expert_out = expert(x) -+++++ expert_cache += expert_out * weight -+++++ return expert_cache -++++ -++++ @no_grad() -++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++- # expert_cache = torch.zeros_like(x) -++++- # idxs = flat_expert_indices.argsort() -++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++- # token_idxs = idxs // self.num_experts_per_tok -++++- # for i, end_idx in enumerate(tokens_per_expert): -++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++- # if start_idx == end_idx: -++++- # continue -++++- # expert = self.experts[i] -++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++- # expert_tokens = x[exp_token_idx] -++++- # expert_out = expert(expert_tokens) -++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++- # return expert_cache -+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++ expert_cache = ops.zeros_like(x) -++++ idxs = flat_expert_indices.argsort() -++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ token_idxs = idxs // self.num_experts_per_tok -+++++ -++++ for i, end_idx in enumerate(tokens_per_expert): -++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ if start_idx == end_idx: -++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++ expert_out = expert(expert_tokens) -++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -++++ return expert_cache -+++++ -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # # expert_cache = torch.zeros_like(x) -+++++ # # idxs = flat_expert_indices.argsort() -+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++ # # if start_idx == end_idx: -+++++ # # continue -+++++ # # expert = self.experts[i] -+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # # expert_tokens = x[exp_token_idx] -+++++ # # expert_out = expert(expert_tokens) -+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++ # # return expert_cache -+++++ # expert_cache = ops.zeros_like(x) -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # if start_idx == end_idx: -+++++ # continue -+++++ # expert = self.experts[i] -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = expert(expert_tokens) -+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -+++++ # return expert_cache -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # expert_cache = ops.zeros_like(x) -+++++ -+++++ # # 排序保证顺序一致 -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # # 找出有 token 的专家 -+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++ -+++++ # for i in active_experts.tolist(): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # end_idx = tokens_per_expert[i] -+++++ # if start_idx == end_idx: # 没有 token -+++++ # continue -+++++ -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = self.experts[i](expert_tokens) -+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++ -+++++ # expert_cache = mindspore.mint.scatter_add( -+++++ # expert_cache, -+++++ # 0, -+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++ # expert_out -+++++ # ) -+++++ -+++++ # return expert_cache -+++++ -+++++ -++++ -++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++ # """ -++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ -++++ # Initialize weights and apply final processing -++++ self.post_init() -+++++ self.warm_up = False -+++++ -+++++ def warmup_moe_model_deep(self): -+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++ test_texts = [ -+++++ "warmup short", -+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++++ ] -+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++ if tokenizer is None: -+++++ from mindnlp.transformers import AutoTokenizer -+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++ self._warmup_tokenizer = tokenizer -+++++ -+++++ for text in test_texts: -+++++ inputs = tokenizer(text, return_tensors="ms") -+++++ with mindspore._no_grad(): -+++++ _ = self(**inputs, use_cache=False) -+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++ -++++ def get_input_embeddings(self): -++++ return self.model.embed_tokens -++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++ ```""" -+++++ if not self.warm_up: -+++++ self.warm_up = True -+++++ self.warmup_moe_model_deep() -+++++ -++++ output_attentions = ( -++++ output_attentions -++++ if output_attentions is not None -++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++index 3cbf820e..d4c6b651 100644 -++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++@@ -18,7 +18,6 @@ -++++ # See the License for the specific language governing permissions and -++++ # limitations under the License. -++++ """MindSpore Qwen2MoE model.""" -++++- -++++ import math -++++ from typing import List, Optional, Tuple, Union -++++ -++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++ TokenClassifierOutput, -++++ ) -++++ from ...modeling_utils import PreTrainedModel -+++++from ...generation import GenerationMixin -++++ from ....utils import logging -++++ from .configuration_qwen2_moe import Qwen2MoeConfig -++++ -++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++ self.variance_epsilon = eps -++++ -++++ def forward(self, hidden_states): -+++++ # @dwj -+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++ # @lwx -+++++ # if not self.training : -+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++ input_dtype = hidden_states.dtype -++++ hidden_states = hidden_states.to(mindspore.float32) -++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++@@ -234,6 +239,8 @@ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++ x1 = x[..., : x.shape[-1] // 2] -++++ x2 = x[..., x.shape[-1] // 2 :] -+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++ self.config = config -++++ self.hidden_size = config.hidden_size -++++ self.intermediate_size = intermediate_size -+++++ -++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++ self.act_fn = ACT2FN[config.hidden_act] -++++ -++++ def forward(self, x): -++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++- -++++ -+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++ # @lwx -+++++ # gate_up_output = self.gate_up_proj(x) -+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++++ # return self.down_proj(swiglu_output) -+++++ -+++++ # def forward(self, x): -+++++ # gate_proj_out = self.gate_proj(x) -+++++ # up_proj_out = self.up_proj(x) -+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++++ # return self.down_proj(swiglu_out) -+++++ -++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++ """ -++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++ use_cache: bool = False, -++++ cache_position: Optional[mindspore.Tensor] = None, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ -+++++ -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ query_states = self.q_proj(hidden_states) -++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ "with a layer index." -++++ ) -++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ if isinstance(past_key_value, StaticCache): -+++++ kv_seq_len = key_states.shape[-2] -+++++ else: -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++ if isinstance(past_key_value, StaticCache): -+++++ kv_seq_len = key_states.shape[-2] -++++ -++++ # repeat k/v heads if n_kv_heads < n_heads -++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++- -+++++ -++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++ -++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++- raise ValueError( -++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++- f" {attn_weights.shape}" -++++- ) -++++- -++++- if attention_mask is not None: # no matter the length, we just slice it -++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++++ if attention_mask is not None: -+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++ attn_weights = attn_weights + causal_mask -++++ -++++ # upcast attention to fp32 -++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++ -++++ attn_output = self.o_proj(attn_output) -++++- -+++++ # @lwx -+++++ -+++++ # max_seq_len = self.max_position_embeddings # 2048 -+++++ -+++++ # if attention_mask is not None: -+++++ # # attention_mask: [B, 1, Sq, Sk] -+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++ -+++++ # # pad 到 [max_seq_len, max_seq_len] -+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++ # global_attention_mask = padded_mask -+++++ # else: -+++++ # global_attention_mask = None -+++++ -+++++ -+++++ # sparse_mode=3 -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # real_shift=None, -+++++ # padding_mask=None, -+++++ -+++++ # head_num=self.num_heads, -+++++ # attn_mask=global_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ # input_layout="BNSD", -+++++ # pre_tokens=2147483647, -+++++ # next_tokens=2147483647, -+++++ # inner_precise=0, -+++++ # drop_mask=None, -+++++ # prefix=None, -+++++ # actual_seq_qlen=None, -+++++ # actual_seq_kvlen=None, -+++++ # sparse_mode=sparse_mode, -+++++ # ) -++++ if not output_attentions: -++++ attn_weights = None -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++ -+++++class Qwen2MoeFlashAttention(nn.Module): -+++++ """ -+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++ -+++++ 关键改动: -+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++ 直接传入原始的 key 和 value 张量效率更高。 -+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++ """ -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++ super().__init__() -+++++ self.config = config -+++++ self.layer_idx = layer_idx -+++++ self.hidden_size = config.hidden_size -+++++ self.num_heads = config.num_attention_heads -+++++ self.head_dim = self.hidden_size // self.num_heads -+++++ self.num_key_value_heads = config.num_key_value_heads -+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++ self.max_position_embeddings = config.max_position_embeddings -+++++ self.rope_theta = config.rope_theta -+++++ self.attention_dropout = config.attention_dropout -+++++ -+++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++ raise ValueError( -+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++ ) -+++++ -+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++ -+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++ self.head_dim, -+++++ max_position_embeddings=self.max_position_embeddings, -+++++ base=self.rope_theta, -+++++ ) -+++++ -+++++ def forward( -+++++ self, -+++++ hidden_states: mindspore.Tensor, -+++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++ position_ids: Optional[mindspore.Tensor] = None, -+++++ past_key_value: Optional[Cache] = None, -+++++ output_attentions: bool = False, -+++++ use_cache: bool = False, -+++++ cache_position: Optional[mindspore.Tensor] = None, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # 1. 线性投射 Q, K, V -+++++ query_states = self.q_proj(hidden_states) -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++ # query: [B, S, H*D] -> [B, N1, S, D] -+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # 3. RoPE 旋转位置编码 -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++ if self.layer_idx is None: -+++++ raise ValueError( -+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ "with a layer index." -+++++ ) -+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++ if cache_position.shape[0] == 1: -+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++ kv_seq_len = past_seen_tokens + 1 -+++++ else: -+++++ # prefill 阶段:cache_position 是范围,使用其长度 -+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++ else: -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # 4. KV 缓存更新 -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ key_states, value_states = past_key_value.update( -+++++ key_states, value_states, self.layer_idx, cache_kwargs -+++++ ) -+++++ -+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++ if cache_position.shape[0] == 1: -+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++ kv_seq_len = key_states.shape[-2] -+++++ -+++++ # 5. [重要] 准备 Attention Mask -+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++ fa_attention_mask = None -+++++ if attention_mask is not None: -+++++ # 截取与当前key长度匹配的部分 -+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++ fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++ input_dtype = query_states.dtype -+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++ query_states = query_states.to(mindspore.float16) -+++++ key_states = key_states.to(mindspore.float16) -+++++ value_states = value_states.to(mindspore.float16) -+++++ -+++++ # 6. [核心] 调用 flash_attention_score 算子 -+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++ attn_output = mindspore.ops.flash_attention_score( -+++++ query=query_states, -+++++ key=key_states, -+++++ value=value_states, -+++++ head_num=self.num_heads, # 传入Q的头数(N1) -+++++ attn_mask=fa_attention_mask, -+++++ keep_prob=1.0 - self.attention_dropout, -+++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ input_layout="BNSD", -+++++ sparse_mode=0 # 使用 defaultMask 模式 -+++++ ) -+++++ -+++++ # 恢复原始数据类型 -+++++ attn_output = attn_output.to(input_dtype) -+++++ -+++++ # 7. 调整输出形状 -+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = self.o_proj(attn_output) -+++++ -+++++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++++ attn_weights = None -+++++ if output_attentions: -+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ # def forward( -+++++ # self, -+++++ # hidden_states: mindspore.Tensor, -+++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++ # past_key_value: Optional[Cache] = None, -+++++ # output_attentions: bool = False, -+++++ # use_cache: bool = False, -+++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ # bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # # 1. 线性投射 Q, K, V -+++++ # query_states = self.q_proj(hidden_states) -+++++ # key_states = self.k_proj(hidden_states) -+++++ # value_states = self.v_proj(hidden_states) -+++++ -+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # # 3. RoPE 旋转位置编码 -+++++ # kv_seq_len = key_states.shape[-2] -+++++ # if past_key_value is not None: -+++++ # if self.layer_idx is None: -+++++ # raise ValueError( -+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ # "with a layer index." -+++++ # ) -+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # # 4. KV 缓存更新 -+++++ # if past_key_value is not None: -+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ # key_states, value_states = past_key_value.update( -+++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++ # ) -+++++ -+++++ # # 5. 准备 Attention Mask -+++++ # fa_attention_mask = None -+++++ # if attention_mask is not None: -+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++ # input_dtype = query_states.dtype -+++++ -+++++ # # 6. [核心] 调用 flash_attention_score 算子 -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # head_num=self.num_heads, -+++++ # attn_mask=fa_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ # input_layout="BNSD", -+++++ # sparse_mode=0, -+++++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++ # inner_precise=1 -+++++ # ) -+++++ -+++++ # # 恢复原始数据类型 -+++++ # attn_output = attn_output.to(input_dtype) -+++++ -+++++ # # 7. 调整输出形状 -+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ # attn_output = self.o_proj(attn_output) -+++++ -+++++ # attn_weights = None -+++++ # if output_attentions: -+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++ # return attn_output, attn_weights, past_key_value -+++++ -+++++ # def forward( -+++++ # self, -+++++ # hidden_states: mindspore.Tensor, -+++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++ # past_key_value: Optional[Cache] = None, -+++++ # output_attentions: bool = False, -+++++ # use_cache: bool = False, -+++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ # bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # query_states = self.q_proj(hidden_states) -+++++ # key_states = self.k_proj(hidden_states) -+++++ # value_states = self.v_proj(hidden_states) -+++++ -+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # kv_seq_len = key_states.shape[-2] -+++++ # if past_key_value is not None: -+++++ # if self.layer_idx is None: -+++++ # raise ValueError("`layer_idx` must be specified for caching") -+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # if past_key_value is not None: -+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ # key_states, value_states = past_key_value.update( -+++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++ # ) -+++++ -+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++ -+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++ # query_states = query_states / math.sqrt(self.head_dim) -+++++ # # <--- 修改结束 --- -+++++ -+++++ # fa_attention_mask = None -+++++ # if attention_mask is not None: -+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # input_dtype = query_states.dtype -+++++ -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, # 传入已经预先缩放过的 query -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # head_num=self.num_heads, -+++++ # attn_mask=fa_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++ # input_layout="BNSD", -+++++ # sparse_mode=0, -+++++ # inner_precise=1 # 仍然保持内部高精度计算 -+++++ # ) -+++++ -+++++ # attn_output = attn_output.to(input_dtype) -+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ # attn_output = self.o_proj(attn_output) -+++++ -+++++ # attn_weights = None -+++++ # if output_attentions: -+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++ -+++++ # return attn_output, attn_weights, past_key_value -+++++ -++++ QWEN2MOE_ATTENTION_CLASSES = { -++++ "eager": Qwen2MoeAttention, -+++++ "flash-attention": Qwen2MoeFlashAttention, -++++ } -++++ -++++ -++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -+++++ #@dwj -+++++ # 只遍历激活的专家,而非全部专家 -++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- hidden_states = hidden_states.view(-1, hidden_dim) -++++- # router_logits: (batch * sequence_length, n_experts) -++++- router_logits = self.gate(hidden_states) -++++- -++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- if self.norm_topk_prob: -++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- # we cast back to the input dtype -++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++- final_hidden_states = ops.zeros( -++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++- ) -++++- -++++- # One hot encode the selected experts to create an expert mask -++++- # this will be used to easily index which expert is going to be sollicitated -++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++- -++++- # Loop over all available experts in the model and perform the computation on each expert -++++- for expert_idx in range(self.num_experts): -++++- expert_layer = self.experts[expert_idx] -++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++- -++++- # Index the correct hidden states and compute the expert hidden state for -++++- # the current expert. We need to make sure to multiply the output hidden -++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++- if 0 not in idx.shape: -++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++- -++++- # However `index_add_` only support torch tensors for indexing so we'll use -++++- # the `top_x` tensor here. -++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++- -++++- shared_expert_output = self.shared_expert(hidden_states) -++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++- -++++- final_hidden_states = final_hidden_states + shared_expert_output -+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++ num_tokens = hidden_states_reshaped.shape[0] -+++++ -+++++ router_logits = self.gate(hidden_states_reshaped) -+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++ if self.norm_topk_prob: -+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++ flat_selected_experts = selected_experts.flatten() -+++++ -+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++ token_indices = broadcasted_token_indices.flatten() -+++++ -+++++ active_experts = ops.unique(flat_selected_experts) -+++++ -+++++ for expert_idx_tensor in active_experts: -+++++ expert_idx = expert_idx_tensor.item() -+++++ expert_layer = self.experts[expert_idx] -+++++ -+++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++ selected_token_indices = token_indices[mask] -+++++ selected_routing_weights = routing_weights.flatten()[mask] -+++++ -+++++ current_states = hidden_states_reshaped[selected_token_indices] -+++++ -+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++ -+++++ final_hidden_states = final_hidden_states.index_add( -+++++ dim=0, -+++++ index=selected_token_indices, -+++++ source=expert_output.to(hidden_states.dtype) -+++++ ) -+++++ -+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++ -++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++- return final_hidden_states, router_logits -+++++ final_hidden_states = final_hidden_states + shared_expert_output -+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++ -+++++ return final_hidden_states, router_logits -++++ -++++ -++++ class Qwen2MoeDecoderLayer(nn.Module): -++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++ -++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++ -++++ if (layer_idx not in config.mlp_only_layers) and ( -++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++ ): -++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++ _skip_keys_device_placement = "past_key_values" -++++ _supports_cache_class = True -+++++#lwx -+++++ # _supports_static_cache = True -++++ -++++ def _init_weights(self, module): -++++ std = self.config.initializer_range -++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++ return causal_mask -++++ -++++ -++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ _tied_weights_keys = ["lm_head.weight"] -++++ -++++ def __init__(self, config): -++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ self.num_experts_per_tok = config.num_experts_per_tok -++++ # Initialize weights and apply final processing -++++ self.post_init() -+++++ # @lwx -+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++++ # self.generation_config.cache_implementation = "static" -+++++ self._warmed_up = False -+++++ -+++++ def warmup_moe_model(self): -+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++++ test_texts = [ -+++++ "warmup short", -+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++++ ] -+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++ if tokenizer is None: -+++++ from mindnlp.transformers import AutoTokenizer -+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++ self._warmup_tokenizer = tokenizer -+++++ -+++++ for text in test_texts: -+++++ inputs = tokenizer(text, return_tensors="ms") -+++++ with mindspore._no_grad(): -+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++ -++++ def get_input_embeddings(self): -++++ return self.model.embed_tokens -++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++ ```""" -+++++ if not self._warmed_up: -+++++ self._warmed_up = True -+++++ self.warmup_moe_model() -++++ -++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++ output_router_logits = ( -++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ } -++++ ) -++++ return model_inputs -+++++# @lwx -+++++ # def _decode_one_tokens_logits( -+++++ # self, -+++++ # cur_token: mindspore.Tensor, -+++++ # input_pos: Optional[mindspore.Tensor], -+++++ # cache_position: mindspore.Tensor, -+++++ # past_key_values: StaticCache, -+++++ # ) -> mindspore.Tensor: -+++++ # """ -+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++++ -+++++ # Args: -+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++++ # input_pos: 输入位置信息,可选 -+++++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++++ -+++++ # Returns: -+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++++ # """ -+++++ # # 调用JIT编译的版本 -+++++ # return self.get_decode_one_tokens_logits( -+++++ # cur_token=cur_token, -+++++ # input_pos=input_pos, -+++++ # cache_position=cache_position, -+++++ # past_key_values=past_key_values, -+++++ # ) -+++++ -+++++ # @mindspore.jit(jit_level='O1') -+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++++ # """ -+++++ # JIT编译的函数,用于高效的单token解码 -+++++ # 使用JIT编译优化以支持静态shape和高效执行 -+++++ -+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++++ # """ -+++++ # outputs = self.model.forward( -+++++ # input_ids=cur_token, -+++++ # position_ids=input_pos, -+++++ # cache_position=cache_position, -+++++ # past_key_values=past_key_values, -+++++ # use_cache=True, -+++++ # return_dict=False, -+++++ # ) -+++++ -+++++ # hidden_states = outputs[0] -+++++ # logits = self.lm_head.forward(hidden_states) -+++++ # logits = logits.float() -+++++ -+++++ # return logits[:, -1, :] -+++++ -+++++ # def _sample( -+++++ # self, -+++++ # input_ids: mindspore.Tensor, -+++++ # logits_processor, -+++++ # stopping_criteria, -+++++ # generation_config, -+++++ # synced_devices: bool, -+++++ # streamer=None, -+++++ # logits_warper=None, -+++++ # **model_kwargs, -+++++ # ): -+++++ # """ -+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++++ # """ -+++++ # from ...generation.logits_process import LogitsProcessorList -+++++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++++ # from mindnlp.core import nn, ops, no_grad -+++++ # import numpy as np -+++++ -+++++ # # 检查是否使用 StaticCache -+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++++ # # 否则,直接调用父类方法 -+++++ # past_key_values = model_kwargs.get("past_key_values") -+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++++ -+++++ # if not isinstance(past_key_values, StaticCache): -+++++ # # 不使用 StaticCache,直接调用父类方法 -+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++++ # return super()._sample( -+++++ # input_ids=input_ids, -+++++ # logits_processor=logits_processor, -+++++ # stopping_criteria=stopping_criteria, -+++++ # generation_config=generation_config, -+++++ # synced_devices=synced_devices, -+++++ # streamer=streamer, -+++++ # logits_warper=logits_warper, -+++++ # **model_kwargs, -+++++ # ) -+++++ -+++++ # # 使用 StaticCache,进入自定义循环 -+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++++ # pad_token_id = generation_config._pad_token_tensor -+++++ # output_attentions = generation_config.output_attentions -+++++ # output_hidden_states = generation_config.output_hidden_states -+++++ # output_scores = generation_config.output_scores -+++++ # output_logits = generation_config.output_logits -+++++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++++ # max_length = generation_config.max_length -+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++++ # do_sample = generation_config.do_sample -+++++ -+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++++ # raise ValueError( -+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++++ # f"{logits_warper})." -+++++ # ) -+++++ -+++++ # # init attention / hidden states / scores tuples -+++++ # scores = () if (return_dict_in_generate and output_scores) else None -+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++++ -+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++++ # encoder_hidden_states = ( -+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++++ # ) -+++++ -+++++ # # keep track of which sequences are already finished -+++++ # batch_size, cur_len = input_ids.shape -+++++ # this_peer_finished = False -+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++++ -+++++ # time_record = [] -+++++ # from ....utils.testing_utils import parse_flag_from_env -+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++++ -+++++ # while self._has_unfinished_sequences( -+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++++ # ): -+++++ # if _record_time: -+++++ # import time as time_module -+++++ # infer_start = time_module.time() -+++++ -+++++ # # prepare model inputs -+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++++ -+++++ # # prepare variable output controls -+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++++ -+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++++ # cur_cache_position = model_inputs.get("cache_position") -+++++ # cur_past_key_values = model_inputs.get("past_key_values") -+++++ # cur_input_ids = model_inputs.get("input_ids") -+++++ -+++++ # if (isinstance(cur_past_key_values, StaticCache) and -+++++ # cur_cache_position is not None and -+++++ # len(cur_cache_position.shape) > 0 and -+++++ # cur_cache_position.shape[0] == 1 and -+++++ # cur_input_ids is not None and -+++++ # cur_input_ids.shape[1] == 1): -+++++ # # 使用 JIT 优化的单 token 解码 -+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++++ # if not hasattr(self, '_jit_used'): -+++++ # self._jit_used = False -+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++++ -+++++ # next_token_logits = self.get_decode_one_tokens_logits( -+++++ # cur_token=cur_input_ids, -+++++ # input_pos=model_inputs.get("position_ids"), -+++++ # cache_position=cur_cache_position, -+++++ # past_key_values=cur_past_key_values, -+++++ # ) -+++++ -+++++ # # 标记已使用JIT(用于后续判断) -+++++ # if not self._jit_used: -+++++ # self._jit_used = True -+++++ -+++++ # # 构造兼容的输出对象 -+++++ # class JitOptimizedOutput: -+++++ # def __init__(self, logits, config): -+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++++ # self.config = config -+++++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++++ # self.attentions = None if not config.is_encoder_decoder else None -+++++ # self.cross_attentions = None -+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++++ -+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++++ # else: -+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++++ # outputs = self(**model_inputs, return_dict=True) -+++++ -+++++ # if synced_devices and this_peer_finished: -+++++ # continue -+++++ -+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++++ # next_token_logits = outputs.logits[:, -1, :] -+++++ -+++++ # # pre-process distribution -+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++++ # if do_sample: -+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++++ -+++++ # # Store scores, attentions and hidden_states when required -+++++ # if return_dict_in_generate: -+++++ # if output_scores: -+++++ # scores += (next_token_scores,) -+++++ # if output_logits: -+++++ # raw_logits += (next_token_logits,) -+++++ # if output_attentions: -+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++++ # if self.config.is_encoder_decoder: -+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++++ -+++++ # if output_hidden_states: -+++++ # hidden = ( -+++++ # outputs.decoder_hidden_states -+++++ # if self.config.is_encoder_decoder -+++++ # else outputs.hidden_states -+++++ # ) -+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++++ -+++++ # # token selection -+++++ # if do_sample: -+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++++ # else: -+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++++ -+++++ # # finished sentences should have their next token be a padding token -+++++ # if has_eos_stopping_criteria: -+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++++ -+++++ # # update generated ids, model inputs, and length for next step -+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++++ # if streamer is not None: -+++++ # streamer.put(next_tokens) -+++++ -+++++ # model_kwargs = self._update_model_kwargs_for_generation( -+++++ # outputs, -+++++ # model_kwargs, -+++++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++++ # ) -+++++ -+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++++ # cur_len += 1 -+++++ -+++++ # if _record_time: -+++++ # import time as time_module -+++++ # infer_stop = time_module.time() -+++++ # time_record.append(infer_stop - infer_start) -+++++ -+++++ # del outputs -+++++ -+++++ # average_infer_time = None -+++++ # if time_record: -+++++ # if len(time_record) > 1: -+++++ # time_record.pop(0) -+++++ # average_infer_time = sum(time_record) / len(time_record) -+++++ # print(f'average inference time is: {average_infer_time}') -+++++ # print(f'inference time record: {time_record}') -+++++ -+++++ # if streamer is not None: -+++++ # streamer.end() -+++++ -+++++ # # 简单判断:打印是否使用了JIT路径 -+++++ # if hasattr(self, '_jit_used') and self._jit_used: -+++++ # print("[JIT] ✓ JIT optimization was used during generation") -+++++ # else: -+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++++ -+++++ # if return_dict_in_generate: -+++++ # if self.config.is_encoder_decoder: -+++++ # return GenerateEncoderDecoderOutput( -+++++ # sequences=input_ids, -+++++ # scores=scores, -+++++ # logits=raw_logits, -+++++ # encoder_attentions=encoder_attentions, -+++++ # encoder_hidden_states=encoder_hidden_states, -+++++ # decoder_attentions=decoder_attentions, -+++++ # cross_attentions=cross_attentions, -+++++ # decoder_hidden_states=decoder_hidden_states, -+++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++ # average_infer_time=average_infer_time -+++++ # ) -+++++ # else: -+++++ # return GenerateDecoderOnlyOutput( -+++++ # sequences=input_ids, -+++++ # scores=scores, -+++++ # logits=raw_logits, -+++++ # attentions=decoder_attentions, -+++++ # hidden_states=decoder_hidden_states, -+++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++ # average_infer_time=average_infer_time -+++++ # ) -+++++ # else: -+++++ # return input_ids -+++++ -+++++ # def _prepare_cache_for_generation( -+++++ # self, -+++++ # generation_config, -+++++ # model_kwargs, -+++++ # assistant_model, -+++++ # batch_size, -+++++ # max_cache_length, -+++++ # ): -+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++++ # generation_config.cache_implementation = "static" -+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++++ -+++++ # if generation_config.cache_implementation == "static": -+++++ # base_required_from_max_length = generation_config.max_length + 1 -+++++ # base_required = max(max_cache_length, base_required_from_max_length) -+++++ # min_cache_size = 50 -+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++++ # else: -+++++ # max_cache_length = max(base_required, min_cache_size) -+++++ -+++++ # original_max_cache_length = max_cache_length -+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++++ # print(f" - final max_cache_length: {max_cache_length}") -+++++ -+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++ # if max_cache_length > self.config.max_position_embeddings: -+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++++ -+++++ # result = super()._prepare_cache_for_generation( -+++++ # generation_config=generation_config, -+++++ # model_kwargs=model_kwargs, -+++++ # assistant_model=assistant_model, -+++++ # batch_size=batch_size, -+++++ # max_cache_length=max_cache_length, -+++++ # ) -+++++ -+++++ # if generation_config.cache_implementation == "static": -+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++++ # created_cache = model_kwargs.get(cache_name) -+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++++ # if created_cache.max_cache_len < generation_config.max_length: -+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++++ -+++++ # return result -+++++ -+++++ -+++++ -++++ -++++ -++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++-- -++++2.27.0 -++++ -+++-- -+++2.27.0 -+++ -++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++new file mode 100644 -++index 00000000..966529e4 -++--- /dev/null -+++++ b/patches/0003-20261106secondcommit.patch -++@@ -0,0 +1,2769 @@ -+++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Thu, 6 Nov 2025 14:54:37 +0800 -+++Subject: [PATCH 3/3] 20261106secondcommit -+++ -+++--- -+++ .../models/deepseek/modeling_deepseek.py | 217 ++- -+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -+++ patches/0001-20251104commit.patch | 1272 ----------------- -+++ 3 files changed, 528 insertions(+), 2032 deletions(-) -+++ delete mode 100644 patches/0001-20251104commit.patch -+++ -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index 73773c22..2f9192bf 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -+++ -+++ _CONFIG_FOR_DOC = "DeepseekConfig" -+++ -++++_attn_mask_cache = {} -++++ -++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -++++ q_len = batch_and_seq[1] -++++ kv_len = batch_and_seq[1] + past_key_values_length -++++ key = (batch_and_seq[0], q_len, kv_len) -++++ -++++ if key in _attn_mask_cache: -++++ return _attn_mask_cache[key] -++++ -++++ mask = _prepare_4d_causal_attention_mask( -++++ attention_mask, -++++ batch_and_seq, -++++ inputs_embeds, -++++ past_key_values_length, -++++ ) -++++ _attn_mask_cache[key] = mask -++++ return mask -+++ -+++ def _get_unpad_data(attention_mask): -+++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -+++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -+++ return final_output -+++ -+++ -+++- @no_grad() -+++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++- expert_cache = ops.zeros_like(x) -+++- idxs = flat_expert_indices.argsort() -+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++- token_idxs = idxs // self.num_experts_per_tok -+++- -+++- for i, end_idx in enumerate(tokens_per_expert): -+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++- if start_idx == end_idx: -+++- continue -+++- expert = self.experts[i] -+++- exp_token_idx = token_idxs[start_idx:end_idx] -+++- expert_tokens = x[exp_token_idx] -+++- expert_out = expert(expert_tokens) -+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++- -+++- return expert_cache -+++- -+++ # @no_grad() -+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++- # # expert_cache = torch.zeros_like(x) -+++- # # idxs = flat_expert_indices.argsort() -+++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++- # # token_idxs = idxs // self.num_experts_per_tok -+++- # # for i, end_idx in enumerate(tokens_per_expert): -+++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++- # # if start_idx == end_idx: -+++- # # continue -+++- # # expert = self.experts[i] -+++- # # exp_token_idx = token_idxs[start_idx:end_idx] -+++- # # expert_tokens = x[exp_token_idx] -+++- # # expert_out = expert(expert_tokens) -+++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++- # # return expert_cache -++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++ # expert_cache = ops.zeros_like(x) -+++ # idxs = flat_expert_indices.argsort() -+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++ -+++ # return expert_cache -+++- # @no_grad() -+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++- # expert_cache = ops.zeros_like(x) -++++ -++++ @no_grad() -++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++ """ -++++ 优化版 MoE prefill: -++++ - 批量张量化处理同一个 expert 的所有 token -++++ - 跳过无 token 的专家 -++++ - 保持结果完全一致 -++++ """ -++++ # 初始化输出缓存 -++++ expert_cache = ops.zeros_like(x) -+++ -+++- # # 排序保证顺序一致 -+++- # idxs = flat_expert_indices.argsort() -+++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++- # token_idxs = idxs // self.num_experts_per_tok -++++ # 排序(确保 scatter_add 位置对应原逻辑) -++++ idxs = flat_expert_indices.argsort() -++++ sorted_expert_indices = flat_expert_indices[idxs] -++++ sorted_token_indices = idxs // self.num_experts_per_tok -+++ -+++- # # 找出有 token 的专家 -+++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++ # 每个 expert 的 token 数 -++++ tokens_per_expert = sorted_expert_indices.bincount() -+++ -+++- # for i in active_experts.tolist(): -+++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++- # end_idx = tokens_per_expert[i] -+++- # if start_idx == end_idx: # 没有 token -+++- # continue -++++ # 找出有 token 的专家 -++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+++ -+++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++- # expert_tokens = x[exp_token_idx] -+++- # expert_out = self.experts[i](expert_tokens) -+++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++ for expert_id in active_experts.tolist(): -++++ # 取该 expert 对应的排序后 token 区间 -++++ start = (tokens_per_expert[:expert_id]).sum().item() -++++ end = start + tokens_per_expert[expert_id].item() -+++ -+++- # expert_cache = mindspore.mint.scatter_add( -+++- # expert_cache, -+++- # 0, -+++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++- # expert_out -+++- # ) -++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -++++ expert_tokens = x[token_idx] # 取输入向量 -+++ -+++- # return expert_cache -++++ # 执行专家 MLP -++++ expert_out = self.experts[expert_id](expert_tokens) -++++ -++++ # 按权重缩放 -++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -++++ -++++ # 回写到缓存(等价 scatter_add) -++++ expert_cache = mindspore.mint.scatter_add( -++++ expert_cache, -++++ 0, -++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++ scaled_out -++++ ) -++++ -++++ return expert_cache -++++ -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # # expert_cache = torch.zeros_like(x) -++++ # # idxs = flat_expert_indices.argsort() -++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++ # # token_idxs = idxs // self.num_experts_per_tok -++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++ # # if start_idx == end_idx: -++++ # # continue -++++ # # expert = self.experts[i] -++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # # expert_tokens = x[exp_token_idx] -++++ # # expert_out = expert(expert_tokens) -++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++ # # return expert_cache -++++ # expert_cache = ops.zeros_like(x) -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # for i, end_idx in enumerate(tokens_per_expert): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # if start_idx == end_idx: -++++ # continue -++++ # expert = self.experts[i] -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = expert(expert_tokens) -++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -++++ # return expert_cache -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++ # expert_cache = ops.zeros_like(x) -++++ -++++ # # 排序保证顺序一致 -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ # token_idxs = idxs // self.num_experts_per_tok -++++ -++++ # # 找出有 token 的专家 -++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++ -++++ # for i in active_experts.tolist(): -++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ # end_idx = tokens_per_expert[i] -++++ # if start_idx == end_idx: # 没有 token -++++ # continue -++++ -++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++ # expert_tokens = x[exp_token_idx] -++++ # expert_out = self.experts[i](expert_tokens) -++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++ -++++ # expert_cache = mindspore.mint.scatter_add( -++++ # expert_cache, -++++ # 0, -++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++ # expert_out -++++ # ) -++++ -++++ # return expert_cache -+++ -+++ -+++ -+++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++- -+++ # class DeepseekFlashAttention(nn.Module): -+++ # """ -+++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -++++ -+++ Deepseek_ATTENTION_CLASSES = { -+++ "eager": DeepseekAttention, -+++ "flash-attention": DeepseekFlashAttention, -+++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -+++ ) -+++ else: -+++ # 4d mask is passed through the layers -+++- attention_mask = _prepare_4d_causal_attention_mask( -++++ # attention_mask = _prepare_4d_causal_attention_mask( -++++ # attention_mask, -++++ # (batch_size, seq_length), -++++ # inputs_embeds, -++++ # past_key_values_length, -++++ # ) -++++ #@dwj -++++ attention_mask = get_cached_causal_mask( -+++ attention_mask, -+++ (batch_size, seq_length), -+++ inputs_embeds, -+++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ # Initialize weights and apply final processing -+++ self.post_init() -+++ self.warm_up = False -++++ #@dwj -++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -++++ self.num_layers, -++++ self.num_attention_heads, -++++ self.head_dim, -++++ batch_size=1, -++++ max_length=self.max_length, -++++ dtype=mindspore.float16 -++++ ) -++++ -++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -++++ key_cache = [] -++++ value_cache = [] -++++ for _ in range(num_layers): -++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++ key_cache.append(k) -++++ value_cache.append(v) -++++ return key_cache, value_cache -++++ -+++ -+++ def warmup_moe_model_deep(self): -+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++index bced285c..ebd7782e 100644 -+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+++ -+++-Long_Prompt = False -+++-PROMPT_LENGTH_THRESHOLD = 128 -++++Long_Prompt = 1 -++++LONG_PROMPT_LENGTH_THRESHOLD = 128 -++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -++++ -++++_causal_mask_cache = {} -++++ -++++def get_cached_causal_mask_with_cache_position( -++++ attention_mask: mindspore.Tensor, -++++ sequence_length: int, -++++ target_length: int, -++++ dtype: mindspore.dtype, -++++ min_dtype: float, -++++ cache_position: mindspore.Tensor, -++++ batch_size: int, -++++): -++++ """ -++++ 带缓存的 causal mask 构造函数 -++++ """ -++++ # q_len 是当前 query 长度 -++++ q_len = sequence_length -++++ # kv_len 是 target_length -++++ kv_len = target_length -++++ -++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -++++ -++++ if key in _causal_mask_cache: -++++ return _causal_mask_cache[key] -++++ -++++ # 调用原来的 mask 构造逻辑 -++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++ attention_mask, -++++ sequence_length=sequence_length, -++++ target_length=target_length, -++++ dtype=dtype, -++++ min_dtype=min_dtype, -++++ cache_position=cache_position, -++++ batch_size=batch_size, -++++ ) -++++ # 缓存结果 -++++ _causal_mask_cache[key] = causal_mask -++++ return causal_mask -+++ -+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+++ def _prepare_4d_causal_attention_mask_with_cache_position( -+++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++ -+++ -+++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -++++# class Qwen2MoeAttention(nn.Module): -++++# """ -++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++++# and "Generating Long Sequences with Sparse Transformers". -++++# """ -++++ -++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++# super().__init__() -++++# self.config = config -++++# self.layer_idx = layer_idx -++++# if layer_idx is None: -++++# logger.warning_once( -++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++# "when creating this class." -++++# ) -++++ -++++# self.hidden_size = config.hidden_size -++++# self.num_heads = config.num_attention_heads -++++# self.head_dim = self.hidden_size // self.num_heads -++++# self.num_key_value_heads = config.num_key_value_heads -++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++# self.max_position_embeddings = config.max_position_embeddings -++++# self.rope_theta = config.rope_theta -++++# self.is_causal = True -++++# self.attention_dropout = config.attention_dropout -++++ -++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++# raise ValueError( -++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++# f" and `num_heads`: {self.num_heads})." -++++# ) -++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++ -++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++# self.head_dim, -++++# max_position_embeddings=self.max_position_embeddings, -++++# base=self.rope_theta, -++++# ) -++++ -++++# def forward( -++++# self, -++++# hidden_states: mindspore.Tensor, -++++# attention_mask: Optional[mindspore.Tensor] = None, -++++# position_ids: Optional[mindspore.Tensor] = None, -++++# past_key_value: Optional[Cache] = None, -++++# output_attentions: bool = False, -++++# use_cache: bool = False, -++++# cache_position: Optional[mindspore.Tensor] = None, -++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++ -++++ -++++ -++++# bsz, q_len, _ = hidden_states.shape -++++ -++++# query_states = self.q_proj(hidden_states) -++++# key_states = self.k_proj(hidden_states) -++++# value_states = self.v_proj(hidden_states) -++++ -++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++ -++++# kv_seq_len = key_states.shape[-2] -++++# if past_key_value is not None: -++++# if self.layer_idx is None: -++++# raise ValueError( -++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++# "with a layer index." -++++# ) -++++# if isinstance(past_key_value, StaticCache): -++++# kv_seq_len = key_states.shape[-2] -++++# else: -++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++# if past_key_value is not None: -++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++# if isinstance(past_key_value, StaticCache): -++++# kv_seq_len = key_states.shape[-2] -++++ -++++# # repeat k/v heads if n_kv_heads < n_heads -++++# key_states = repeat_kv(key_states, self.num_key_value_groups) -++++# value_states = repeat_kv(value_states, self.num_key_value_groups) -++++ -++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++ -++++# if attention_mask is not None: -++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++# attn_weights = attn_weights + causal_mask -++++ -++++# # upcast attention to fp32 -++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++# attn_output = ops.matmul(attn_weights, value_states) -++++ -++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++# raise ValueError( -++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++++# f" {attn_output.shape}" -++++# ) -++++ -++++# attn_output = ops.transpose(attn_output, 1, 2) -++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++ -++++# attn_output = self.o_proj(attn_output) -++++# # @lwx -++++ -++++# # max_seq_len = self.max_position_embeddings # 2048 -++++ -++++# # if attention_mask is not None: -++++# # # attention_mask: [B, 1, Sq, Sk] -++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++ -++++# # # pad 到 [max_seq_len, max_seq_len] -++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++# # global_attention_mask = padded_mask -++++# # else: -++++# # global_attention_mask = None -++++ -++++ -++++# # sparse_mode=3 -++++# # attn_output = mindspore.ops.flash_attention_score( -++++# # query=query_states, -++++# # key=key_states, -++++# # value=value_states, -++++# # real_shift=None, -++++# # padding_mask=None, -++++ -++++# # head_num=self.num_heads, -++++# # attn_mask=global_attention_mask, -++++# # keep_prob=1.0 - self.attention_dropout, -++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++++# # input_layout="BNSD", -++++# # pre_tokens=2147483647, -++++# # next_tokens=2147483647, -++++# # inner_precise=0, -++++# # drop_mask=None, -++++# # prefix=None, -++++# # actual_seq_qlen=None, -++++# # actual_seq_kvlen=None, -++++# # sparse_mode=sparse_mode, -++++# # ) -++++# if not output_attentions: -++++# attn_weights = None -++++ -++++# return attn_output, attn_weights, past_key_value -++++ -+++ class Qwen2MoeAttention(nn.Module): -+++ """ -+++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+++- and "Generating Long Sequences with Sparse Transformers". -+++- """ -++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -+++ -++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -++++ -++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -++++ """ -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++ super().__init__() -+++ self.config = config -+++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -+++ if layer_idx is None: -+++ logger.warning_once( -+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++ "when creating this class." -+++ ) -+++ -+++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -+++ use_cache: bool = False, -+++ cache_position: Optional[mindspore.Tensor] = None, -+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++- -+++ -+++- -++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -+++ bsz, q_len, _ = hidden_states.shape -+++ -+++ query_states = self.q_proj(hidden_states) -+++ key_states = self.k_proj(hidden_states) -+++ value_states = self.v_proj(hidden_states) -+++ -+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++- -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ -+++ kv_seq_len = key_states.shape[-2] -+++ if past_key_value is not None: -+++- if self.layer_idx is None: -+++- raise ValueError( -+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++- "with a layer index." -+++- ) -+++- if isinstance(past_key_value, StaticCache): -+++- kv_seq_len = key_states.shape[-2] -+++- else: -+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++ -+++ if past_key_value is not None: -+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++ -++++ # --- 2. 动态调度核心注意力计算 --- -++++ global Long_Prompt -++++ if Long_Prompt >= 1: -++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -++++ fa_attention_mask = None -++++ if attention_mask is not None: -++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++ fa_attention_mask = (mask_slice != 0) -++++ -++++ attn_output = mindspore.ops.flash_attention_score( -++++ query=query_states, -++++ key=key_states, -++++ value=value_states, -++++ head_num=self.num_heads, -++++ attn_mask=fa_attention_mask, -++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++ input_layout="BNSD", -++++ sparse_mode=0, -++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -++++ ) -+++ -+++- if isinstance(past_key_value, StaticCache): -+++- kv_seq_len = key_states.shape[-2] -++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ attn_output = self.o_proj(attn_output) -++++ attn_weights = None -++++ if output_attentions: -++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+++ -+++- # repeat k/v heads if n_kv_heads < n_heads -+++- key_states = repeat_kv(key_states, self.num_key_value_groups) -+++- value_states = repeat_kv(value_states, self.num_key_value_groups) -+++- -+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++ else: -++++ # --- Eager Attention 路径 (用于短序列和解码) --- -++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++ -++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++ -+++- if attention_mask is not None: -+++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++- attn_weights = attn_weights + causal_mask -++++ if attention_mask is not None: -++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++ attn_weights = attn_weights + causal_mask -+++ -+++- # upcast attention to fp32 -+++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++- attn_output = ops.matmul(attn_weights, value_states) -++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++ attn_output = ops.matmul(attn_weights, value_states) -+++ -+++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++- raise ValueError( -+++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+++- f" {attn_output.shape}" -+++- ) -++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++ raise ValueError( -++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -++++ ) -+++ -+++- attn_output = ops.transpose(attn_output, 1, 2) -+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++ attn_output = ops.transpose(attn_output, 1, 2) -++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++ attn_output = self.o_proj(attn_output) -+++ -+++- attn_output = self.o_proj(attn_output) -+++- # @lwx -++++ if not output_attentions: -++++ attn_weights = None -+++ -+++- # max_seq_len = self.max_position_embeddings # 2048 -+++- -+++- # if attention_mask is not None: -+++- # # attention_mask: [B, 1, Sq, Sk] -+++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++- -+++- # # pad 到 [max_seq_len, max_seq_len] -+++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++- # global_attention_mask = padded_mask -+++- # else: -+++- # global_attention_mask = None -+++- -+++- -+++- # sparse_mode=3 -+++- # attn_output = mindspore.ops.flash_attention_score( -+++- # query=query_states, -+++- # key=key_states, -+++- # value=value_states, -+++- # real_shift=None, -+++- # padding_mask=None, -+++- -+++- # head_num=self.num_heads, -+++- # attn_mask=global_attention_mask, -+++- # keep_prob=1.0 - self.attention_dropout, -+++- # scalar_value=1.0 / math.sqrt(self.head_dim), -+++- # input_layout="BNSD", -+++- # pre_tokens=2147483647, -+++- # next_tokens=2147483647, -+++- # inner_precise=0, -+++- # drop_mask=None, -+++- # prefix=None, -+++- # actual_seq_qlen=None, -+++- # actual_seq_kvlen=None, -+++- # sparse_mode=sparse_mode, -+++- # ) -+++- if not output_attentions: -+++- attn_weights = None -+++- -+++ return attn_output, attn_weights, past_key_value -+++ -+++- -+++ # class Qwen2MoeFlashAttention(nn.Module): -+++ # """ -+++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -+++ # return final_hidden_states, router_logits -+++ -+++ -+++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++-# """ -+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+++-# `_moe_infer_prefill` (用于长序列处理) 方法。 -+++-# """ -+++-# def __init__(self, config: Qwen2MoeConfig): -+++-# super().__init__() -+++-# self.num_experts = config.num_experts -+++-# self.top_k = config.num_experts_per_tok -+++-# self.norm_topk_prob = config.norm_topk_prob -+++- -+++-# # 门控网络 -+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++-# # 专家列表 -+++-# self.experts = nn.ModuleList( -+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++-# ) -+++-# # 共享专家 -+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-# @no_grad() -+++-# def _moe_infer_decode( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# """ -+++-# 【解码路径】针对 sequence_length=1 的极致优化。 -+++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+++-# """ -+++-# batch_size, hidden_dim = hidden_states.shape -+++- -+++-# expert_outputs_list = [ -+++-# ops.cat([ -+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++-# ], dim=0) -+++-# for i in range(batch_size) -+++-# ] -+++- -+++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+++-# # shape: (batch_size, top_k, hidden_dim) -+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++- -+++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++- -+++-# return moe_output.squeeze(1) -+++- -+++-# @no_grad() -+++-# def _moe_infer_prefill( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# """ -+++-# 【预填充路径】针对 sequence_length > 1 的优化。 -+++-# 按专家对 Token 进行分组,并进行批处理。 -+++-# """ -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens = hidden_states.shape[0] -+++-# flat_selected_experts = selected_experts.flatten() -+++- -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++- -+++-# active_experts = ops.unique(flat_selected_experts) -+++- -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++- -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++-# selected_token_indices = token_indices[mask] -+++-# selected_routing_weights = routing_weights.flatten()[mask] -+++- -+++-# current_states = hidden_states[selected_token_indices] -+++- -+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++- -+++-# moe_output = moe_output.index_add( -+++-# dim=0, -+++-# index=selected_token_indices, -+++-# source=expert_output.to(hidden_states.dtype) -+++-# ) -+++-# return moe_output -+++- -+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-# """ -+++-# 顶层 forward 方法,作为智能分发器。 -+++-# """ -+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- -+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-# router_logits = self.gate(hidden_states_reshaped) -+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- -+++-# if self.norm_topk_prob: -+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- -+++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++-# moe_output = None -+++-# # 在推理时,根据序列长度选择最优路径 -+++-# if not self.training: -+++-# if sequence_length == 1: -+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++-# else: -+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++-# else: -+++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+++-# raise NotImplementedError("Training path is not implemented.") -+++- -+++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+++- -+++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+++- -+++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+++- -+++-# return final_hidden_states, router_logits -+++- -+++- -+++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++-# """ -+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+++-# """ -+++-# def __init__(self, config: Qwen2MoeConfig): -+++-# super().__init__() -+++-# self.num_experts = config.num_experts -+++-# self.top_k = config.num_experts_per_tok -+++-# self.norm_topk_prob = config.norm_topk_prob -+++- -+++-# # 门控网络 -+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++-# # 专家列表 -+++-# self.experts = nn.ModuleList( -+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++-# ) -+++-# # 共享专家 -+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-# @no_grad() -+++-# def _moe_infer_decode( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# batch_size, _ = hidden_states.shape -+++-# expert_outputs_list = [ -+++-# ops.cat([ -+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++-# ], dim=0) -+++-# for i in range(batch_size) -+++-# ] -+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++-# return moe_output.squeeze(1) -+++- -+++-# @no_grad() -+++-# def _moe_infer_prefill( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens = hidden_states.shape[0] -+++-# flat_selected_experts = selected_experts.flatten() -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++-# active_experts = ops.unique(flat_selected_experts) -+++- -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++-# selected_token_indices = token_indices[mask] -+++-# selected_routing_weights = routing_weights.flatten()[mask] -+++-# current_states = hidden_states[selected_token_indices] -+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++-# moe_output = moe_output.index_add( -+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++-# ) -+++-# return moe_output -+++- -+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-# """ -+++-# 顶层 forward 方法,作为智能分发器。 -+++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+++-# """ -+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- -+++-# # 1. 门控计算 (通用逻辑) -+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-# router_logits = self.gate(hidden_states_reshaped) -+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- -+++-# if self.norm_topk_prob: -+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- -+++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++-# # 2. 智能分发到最优 MoE 路径 -+++-# moe_output = None -+++-# if not self.training: -+++-# if sequence_length == 1: -+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++-# else: -+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++-# else: -+++-# raise NotImplementedError("Training path is not implemented.") -+++- -+++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++- -+++-# # 4. 合并 MoE 输出和共享专家输出 -+++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++- -+++-# # 5. 恢复原始形状并返回 -+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++- -+++-# return final_hidden_states, router_logits -+++- -+++-# prefill fastest -+++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++-# """ -+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+++-# """ -+++-# def __init__(self, config: Qwen2MoeConfig): -+++-# super().__init__() -+++-# self.num_experts = config.num_experts -+++-# self.top_k = config.num_experts_per_tok -+++-# self.norm_topk_prob = config.norm_topk_prob -+++- -+++-# # 门控网络 -+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++-# # 专家列表 -+++-# self.experts = nn.ModuleList( -+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++-# ) -+++-# # 共享专家 -+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-# @no_grad() -+++-# def _moe_infer_dispatch( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# """ -+++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+++-# """ -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens, _ = hidden_states.shape -+++- -+++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+++-# flat_selected_experts = selected_experts.flatten() -+++-# flat_routing_weights = routing_weights.flatten() -+++- -+++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++- -+++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+++-# active_experts = ops.unique(flat_selected_experts) -+++- -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++- -+++-# # 找到所有分配给该专家的 token -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++- -+++-# # 使用 mask 选取对应的 token 和权重 -+++-# current_token_indices = token_indices[mask] -+++-# current_routing_weights = flat_routing_weights[mask] -+++-# current_hidden_states = hidden_states[current_token_indices] -+++- -+++-# # 对这些 token 进行批处理 -+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++- -+++-# # 使用 index_add 将结果精确地加回到对应位置 -+++-# moe_output = moe_output.index_add( -+++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+++-# ) -+++-# return moe_output -+++- -+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-# """ -+++-# 顶层 forward 方法,作为智能分发器。 -+++-# """ -+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- -+++-# # 1. 门控计算 -+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-# router_logits = self.gate(hidden_states_reshaped) -+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- -+++-# if self.norm_topk_prob: -+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- -+++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++- -+++-# # 2. 调用统一的 MoE 计算内核 -+++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+++- -+++-# # 3. 统一处理共享专家 -+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++- -+++-# # 4. 合并输出 -+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++- -+++-# # 5. 恢复原始形状并返回 -+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++- -+++-# return final_hidden_states, router_logits -+++- -+++- -+++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++-# """ -+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++-# 【最终高性能与高精度版】: -+++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+++-# 3. 这样实现了速度和准确性的两全其美。 -+++-# """ -+++-# def __init__(self, config: Qwen2MoeConfig): -+++-# super().__init__() -+++-# self.num_experts = config.num_experts -+++-# self.top_k = config.num_experts_per_tok -+++-# self.norm_topk_prob = config.norm_topk_prob -+++- -+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++-# self.experts = nn.ModuleList( -+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++-# ) -+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-# @no_grad() -+++-# def _moe_infer_decode( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# """ -+++-# 【解码路径】极致优化版:bmm + 高精度累加。 -+++-# """ -+++-# original_dtype = hidden_states.dtype -+++-# batch_size, _ = hidden_states.shape -+++- -+++-# expert_outputs_list = [ -+++-# ops.cat([ -+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++-# ], dim=0) -+++-# for i in range(batch_size) -+++-# ] -+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++- -+++-# # 在 float32 下执行 bmm,得到高精度结果 -+++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++- -+++-# # 将高精度结果转换回原始数据类型 -+++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+++- -+++-# return moe_output -+++- -+++-# @no_grad() -+++-# def _moe_infer_prefill( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# selected_experts: mindspore.Tensor, -+++-# routing_weights: mindspore.Tensor -+++-# ) -> mindspore.Tensor: -+++-# """ -+++-# 【预填充路径】与原始实现一致,结果精确。 -+++-# """ -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens, _ = hidden_states.shape -+++-# flat_selected_experts = selected_experts.flatten() -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++-# active_experts = ops.unique(flat_selected_experts) -+++- -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++-# selected_token_indices = token_indices[mask] -+++-# selected_routing_weights = routing_weights.flatten()[mask] -+++-# current_states = hidden_states[selected_token_indices] -+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++-# moe_output = moe_output.index_add( -+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++-# ) -+++-# return moe_output -+++- -+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++- -+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-# router_logits = self.gate(hidden_states_reshaped) -+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++- -+++-# if self.norm_topk_prob: -+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- -+++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+++-# # 如果模型主体是 float16,后续再转换 -+++- -+++-# moe_output = None -+++-# if not self.training: -+++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+++-# # _moe_infer_decode 内部会处理好类型转换 -+++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+++-# if sequence_length == 1: -+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++-# else: -+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++-# else: -+++-# raise NotImplementedError("Training path is not implemented.") -+++- -+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++- -+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++- -+++-# return final_hidden_states, router_logits -+++- -+++- -+++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++-# """ -+++-# 【融合版】一个混合专家模块,内置两种推理策略, -+++-# 由外部全局变量 `Long_Prompt` 控制: -+++- -+++-# - if Long_Prompt is True: 【精度优先模式】 -+++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+++-# 适用于处理长序列,避免误差累积。 -+++- -+++-# - if Long_Prompt is False: 【速度优先模式】 -+++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+++-# 在解码阶段获得极致速度,同时保证结果高度准确。 -+++-# """ -+++-# def __init__(self, config: Qwen2MoeConfig): -+++-# super().__init__() -+++-# self.num_experts = config.num_experts -+++-# self.top_k = config.num_experts_per_tok -+++-# self.norm_topk_prob = config.norm_topk_prob -+++- -+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++-# self.experts = nn.ModuleList( -+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++-# ) -+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-# # --- 速度优先模式的辅助函数 --- -+++-# @no_grad() -+++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++-# original_dtype = hidden_states.dtype -+++-# batch_size, _ = hidden_states.shape -+++-# expert_outputs_list = [ -+++-# ops.cat([ -+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++-# ], dim=0) -+++-# for i in range(batch_size) -+++-# ] -+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++-# weights_fp32 = routing_weights.to(mindspore.float32) -+++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++-# return moe_output_fp32.squeeze(1).to(original_dtype) -+++- -+++-# @no_grad() -+++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens, _ = hidden_states.shape -+++-# flat_selected_experts = selected_experts.flatten() -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++-# active_experts = ops.unique(flat_selected_experts) -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++-# selected_token_indices = token_indices[mask] -+++-# selected_routing_weights = routing_weights.flatten()[mask] -+++-# current_states = hidden_states[selected_token_indices] -+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+++-# return moe_output -+++- -+++-# # --- 精度优先模式的辅助函数 --- -+++-# @no_grad() -+++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++-# moe_output = ops.zeros_like(hidden_states) -+++-# num_tokens, _ = hidden_states.shape -+++-# flat_selected_experts = selected_experts.flatten() -+++-# flat_routing_weights = routing_weights.flatten() -+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++-# active_experts = ops.unique(flat_selected_experts) -+++-# for expert_idx_tensor in active_experts: -+++-# expert_idx = expert_idx_tensor.item() -+++-# expert_layer = self.experts[expert_idx] -+++-# mask = (flat_selected_experts == expert_idx_tensor) -+++-# current_token_indices = token_indices[mask] -+++-# current_routing_weights = flat_routing_weights[mask] -+++-# current_hidden_states = hidden_states[current_token_indices] -+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++-# return moe_output -+++- -+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-# # 声明我们将要使用一个在模块外部定义的全局变量 -+++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+++-# global Long_Prompt -+++- -+++-# # 1. 门控计算 (所有模式通用) -+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-# router_logits = self.gate(hidden_states_reshaped) -+++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++-# if self.norm_topk_prob: -+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++- -+++-# moe_output = None -+++-# if not self.training: -+++-# # 根据 Long_Prompt 标志选择模式 -+++-# if Long_Prompt: -+++-# # --- 精度优先模式 --- -+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++-# else: -+++-# # --- 速度优先模式 --- -+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++-# if sequence_length == 1: -+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++-# else: -+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++-# else: -+++-# raise NotImplementedError("Training path is not implemented.") -+++- -+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++- -+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++- -+++-# return final_hidden_states, router_logits -+++- -+++ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ """ -+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++ return moe_output_fp32.squeeze(1).to(original_dtype) -+++ -++++ # @no_grad() -++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++ # num_tokens, _ = hidden_states.shape -++++ # flat_selected_experts = selected_experts.flatten() -++++ # sorted_expert_indices = flat_selected_experts.argsort() -++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++ # original_token_indices = sorted_expert_indices // self.top_k -++++ # moe_output = ops.zeros_like(hidden_states) -++++ # current_token_offset = 0 -++++ # for i in range(self.num_experts): -++++ # expert_token_count = tokens_per_expert[i] - current_token_offset -++++ # if expert_token_count == 0: -++++ # continue -++++ # end_offset = current_token_offset + expert_token_count -++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++ # expert_hidden_states = hidden_states[expert_original_token_indices] -++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++ # current_token_offset += expert_token_count -++++ # return moe_output -++++ -+++ @no_grad() -+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++- num_tokens, _ = hidden_states.shape -+++- flat_selected_experts = selected_experts.flatten() -+++- sorted_expert_indices = flat_selected_experts.argsort() -+++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++- original_token_indices = sorted_expert_indices // self.top_k -++++ """ -++++ 优化版 MoE prefill (速度优先模式): -++++ - 批量张量化处理同一个 expert 的所有 token -++++ - 跳过无 token 的专家 -++++ - 保持结果完全一致 -++++ """ -+++ moe_output = ops.zeros_like(hidden_states) -+++- current_token_offset = 0 -+++- for i in range(self.num_experts): -+++- expert_token_count = tokens_per_expert[i] - current_token_offset -+++- if expert_token_count == 0: -+++- continue -+++- end_offset = current_token_offset + expert_token_count -+++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++- expert_hidden_states = hidden_states[expert_original_token_indices] -+++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++- current_token_offset += expert_token_count -++++ -++++ flat_selected_experts = selected_experts.flatten() -++++ flat_routing_weights = routing_weights.flatten() -++++ -++++ idxs = flat_selected_experts.argsort() -++++ sorted_expert_indices = flat_selected_experts[idxs] -++++ sorted_token_indices = idxs // self.top_k -++++ -++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -++++ -++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++++ -++++ for expert_id in active_experts.tolist(): -++++ start = int(tokens_per_expert[:expert_id].sum().item()) -++++ end = start + int(tokens_per_expert[expert_id].item()) -++++ -++++ token_idx = sorted_token_indices[start:end] -++++ expert_tokens = hidden_states[token_idx] -++++ -++++ expert_out = self.experts[expert_id](expert_tokens) -++++ -++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -++++ -++++ moe_output = mindspore.mint.scatter_add( -++++ moe_output, -++++ 0, -++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -++++ scaled_out.to(hidden_states.dtype) -++++ ) -++++ -+++ return moe_output -+++ -++++ -+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+++ @no_grad() -+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++ -+++ moe_output = None -+++- if Long_Prompt: -+++- # --- 精度优先模式 (ACCURACY MODE) --- -+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ # if Long_Prompt==0: -++++ # # --- 精度优先模式 (ACCURACY MODE) --- -++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ # else: -++++ # # --- 速度优先模式 (SPEED MODE) --- -++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++ # if sequence_length == 1: -++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ # else: -++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ -++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++ if sequence_length == 1: -++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++ else: -+++- # --- 速度优先模式 (SPEED MODE) --- -+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++- if sequence_length == 1: -+++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++- else: -+++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++- -++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ -+++ -+++ # 3. 共享专家计算与合并 (所有模式通用) -+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++ -+++ return final_hidden_states, router_logits -+++ -++++ -+++ class Qwen2MoeDecoderLayer(nn.Module): -+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+++ super().__init__() -+++ self.hidden_size = config.hidden_size -+++ -+++- # if Long_Prompt: -+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++- # else: -++++ # if Long_Prompt == 2: -+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++ # else: -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ -+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++ -+++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++ ) -+++ -+++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -+++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++ # attention_mask, -++++ # sequence_length=sequence_length, -++++ # target_length=target_length, -++++ # dtype=dtype, -++++ # min_dtype=min_dtype, -++++ # cache_position=cache_position, -++++ # batch_size=input_tensor.shape[0], -++++ # ) -++++ #@dwj -++++ causal_mask = get_cached_causal_mask_with_cache_position( -+++ attention_mask, -+++ sequence_length=sequence_length, -+++ target_length=target_length, -+++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+++ """ -+++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -++++ _causal_mask_cache.clear() -+++ -+++ input_ids = kwargs.get("input_ids") -+++ if input_ids is None and args: -+++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ -+++ if input_ids is not None: -+++ prompt_length = input_ids.shape[1] -+++- -+++- if prompt_length > PROMPT_LENGTH_THRESHOLD: -+++- Long_Prompt = True -++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -++++ Long_Prompt = 2 -++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -++++ Long_Prompt = 0 -+++ else: -+++- Long_Prompt = False -++++ Long_Prompt = 1 -++++ -+++ -+++ return super().generate(*args, **kwargs) -+++ -+++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++ dtype = self.lm_head.weight.dtype -+++ min_dtype = float(ops.finfo(dtype).min) -+++ -+++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++ # attention_mask, -++++ # sequence_length=sequence_length, -++++ # target_length=past_key_values.get_max_length(), -++++ # dtype=dtype, -++++ # min_dtype=min_dtype, -++++ # cache_position=cache_position, -++++ # batch_size=batch_size, -++++ # ) -++++ -++++ #@dwj -++++ attention_mask = get_cached_causal_mask_with_cache_position( -+++ attention_mask, -+++ sequence_length=sequence_length, -+++ target_length=past_key_values.get_max_length(), -+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++deleted file mode 100644 -+++index 6dfb5b93..00000000 -+++--- a/patches/0001-20251104commit.patch -++++++ /dev/null -+++@@ -1,1272 +0,0 @@ -+++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++-From: Pinoeer-kingxi <13022943007@163.com> -+++-Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++-Subject: [PATCH] 20251104commit -+++- -+++---- -+++- mindnlp/transformers/cache_utils.py | 28 +- -+++- .../models/deepseek/modeling_deepseek.py | 149 ++- -+++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++- 3 files changed, 976 insertions(+), 87 deletions(-) -+++- -+++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++-index cadd2e04..02f8d4be 100644 -+++---- a/mindnlp/transformers/cache_utils.py -+++-+++ b/mindnlp/transformers/cache_utils.py -+++-@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++- # k_out[:, :, cache_position] = key_states -+++- # v_out[:, :, cache_position] = value_states -+++-- if ON_ORANGE_PI: -+++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++-- else: -+++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++-- -+++-+ # if ON_ORANGE_PI: -+++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++-+ # else: -+++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++-+ # 确保 cache_position 是 1D tensor 并且类型正确 -+++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++-+ if cache_position.ndim > 1: -+++-+ cache_position = cache_position.flatten() -+++-+ # 确保类型是 int32 或 int64(MindSpore 要求) -+++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++-+ cache_position = cache_position.int() -+++-+ -+++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++-+ k_out[:, :, cache_position] = key_states -+++-+ v_out[:, :, cache_position] = value_states -+++-+ -+++- return k_out, v_out -+++- -+++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++-index c695b944..d8303e45 100644 -+++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++- # Copied from transformers.models.llama.modeling_llama.rotate_half -+++- def rotate_half(x): -+++- """Rotates half the hidden dims of the input.""" -+++-- x1 = x[..., : x.shape[-1] // 2] -+++-- x2 = x[..., x.shape[-1] // 2 :] -+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++-+ # x1 = x[..., : x.shape[-1] // 2] -+++-+ # x2 = x[..., x.shape[-1] // 2 :] -+++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++- return ops.cat((-x2, x1), dim=-1) -+++- -+++- -+++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++- if self.training: -+++- raise NotImplementedError("Training is not supported yet.") -+++- else: -+++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++-- if self.config.n_shared_experts is not None: -+++-- y = y + self.shared_experts(identity) -+++-- return y -+++-+ # @lwx -+++-+ if orig_shape[1] == 1: -+++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++-+ y=y.view(*orig_shape) -+++-+ if self.config.n_shared_experts is not None: -+++-+ y = y + self.shared_experts(identity) -+++-+ return y -+++-+ else: -+++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++-+ if self.config.n_shared_experts is not None: -+++-+ y = y + self.shared_experts(identity) -+++-+ return y -+++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++-+ # if self.config.n_shared_experts is not None: -+++-+ # y = y + self.shared_experts(identity) -+++-+ # return y -+++-+ -+++-+ @no_grad() -+++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++-+ -+++-+ expert_cache = ops.zeros_like(x) -+++-+ for i in range(self.num_experts_per_tok): -+++-+ expert_id = flat_expert_indices[i].item() -+++-+ weight = flat_expert_weights[i].item() -+++-+ expert = self.experts[expert_id] -+++-+ expert_out = expert(x) -+++-+ expert_cache += expert_out * weight -+++-+ return expert_cache -+++- -+++- @no_grad() -+++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++-- # expert_cache = torch.zeros_like(x) -+++-- # idxs = flat_expert_indices.argsort() -+++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++-- # token_idxs = idxs // self.num_experts_per_tok -+++-- # for i, end_idx in enumerate(tokens_per_expert): -+++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++-- # if start_idx == end_idx: -+++-- # continue -+++-- # expert = self.experts[i] -+++-- # exp_token_idx = token_idxs[start_idx:end_idx] -+++-- # expert_tokens = x[exp_token_idx] -+++-- # expert_out = expert(expert_tokens) -+++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++-- # return expert_cache -+++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++- expert_cache = ops.zeros_like(x) -+++- idxs = flat_expert_indices.argsort() -+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++- token_idxs = idxs // self.num_experts_per_tok -+++-+ -+++- for i, end_idx in enumerate(tokens_per_expert): -+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++- if start_idx == end_idx: -+++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++- expert_out = expert(expert_tokens) -+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++-+ -+++- return expert_cache -+++-+ -+++-+ # @no_grad() -+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++-+ # # expert_cache = torch.zeros_like(x) -+++-+ # # idxs = flat_expert_indices.argsort() -+++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++-+ # # token_idxs = idxs // self.num_experts_per_tok -+++-+ # # for i, end_idx in enumerate(tokens_per_expert): -+++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++-+ # # if start_idx == end_idx: -+++-+ # # continue -+++-+ # # expert = self.experts[i] -+++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++-+ # # expert_tokens = x[exp_token_idx] -+++-+ # # expert_out = expert(expert_tokens) -+++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++-+ # # return expert_cache -+++-+ # expert_cache = ops.zeros_like(x) -+++-+ # idxs = flat_expert_indices.argsort() -+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++-+ # token_idxs = idxs // self.num_experts_per_tok -+++-+ -+++-+ # for i, end_idx in enumerate(tokens_per_expert): -+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++-+ # if start_idx == end_idx: -+++-+ # continue -+++-+ # expert = self.experts[i] -+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+++-+ # expert_tokens = x[exp_token_idx] -+++-+ # expert_out = expert(expert_tokens) -+++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++-+ -+++-+ # return expert_cache -+++-+ # @no_grad() -+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++-+ # expert_cache = ops.zeros_like(x) -+++-+ -+++-+ # # 排序保证顺序一致 -+++-+ # idxs = flat_expert_indices.argsort() -+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++-+ # token_idxs = idxs // self.num_experts_per_tok -+++-+ -+++-+ # # 找出有 token 的专家 -+++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++-+ -+++-+ # for i in active_experts.tolist(): -+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++-+ # end_idx = tokens_per_expert[i] -+++-+ # if start_idx == end_idx: # 没有 token -+++-+ # continue -+++-+ -+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+++-+ # expert_tokens = x[exp_token_idx] -+++-+ # expert_out = self.experts[i](expert_tokens) -+++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++-+ -+++-+ # expert_cache = mindspore.mint.scatter_add( -+++-+ # expert_cache, -+++-+ # 0, -+++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++-+ # expert_out -+++-+ # ) -+++-+ -+++-+ # return expert_cache -+++-+ -+++-+ -+++- -+++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++- # """ -+++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++- -+++- # Initialize weights and apply final processing -+++- self.post_init() -+++-+ self.warm_up = False -+++-+ -+++-+ def warmup_moe_model_deep(self): -+++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++-+ test_texts = [ -+++-+ "warmup short", -+++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++-+ ] -+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++-+ if tokenizer is None: -+++-+ from mindnlp.transformers import AutoTokenizer -+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++-+ self._warmup_tokenizer = tokenizer -+++-+ -+++-+ for text in test_texts: -+++-+ inputs = tokenizer(text, return_tensors="ms") -+++-+ with mindspore._no_grad(): -+++-+ _ = self(**inputs, use_cache=False) -+++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++- -+++- def get_input_embeddings(self): -+++- return self.model.embed_tokens -+++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++- ```""" -+++-+ if not self.warm_up: -+++-+ self.warm_up = True -+++-+ self.warmup_moe_model_deep() -+++-+ -+++- output_attentions = ( -+++- output_attentions -+++- if output_attentions is not None -+++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++-index 3cbf820e..d4c6b651 100644 -+++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++-@@ -18,7 +18,6 @@ -+++- # See the License for the specific language governing permissions and -+++- # limitations under the License. -+++- """MindSpore Qwen2MoE model.""" -+++-- -+++- import math -+++- from typing import List, Optional, Tuple, Union -+++- -+++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++- TokenClassifierOutput, -+++- ) -+++- from ...modeling_utils import PreTrainedModel -+++-+from ...generation import GenerationMixin -+++- from ....utils import logging -+++- from .configuration_qwen2_moe import Qwen2MoeConfig -+++- -+++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++- self.variance_epsilon = eps -+++- -+++- def forward(self, hidden_states): -+++-+ # @dwj -+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++-+ # @lwx -+++-+ # if not self.training : -+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++- input_dtype = hidden_states.dtype -+++- hidden_states = hidden_states.to(mindspore.float32) -+++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++-@@ -234,6 +239,8 @@ def rotate_half(x): -+++- """Rotates half the hidden dims of the input.""" -+++- x1 = x[..., : x.shape[-1] // 2] -+++- x2 = x[..., x.shape[-1] // 2 :] -+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++- return ops.cat((-x2, x1), dim=-1) -+++- -+++- -+++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++- self.config = config -+++- self.hidden_size = config.hidden_size -+++- self.intermediate_size = intermediate_size -+++-+ -+++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++- self.act_fn = ACT2FN[config.hidden_act] -+++- -+++- def forward(self, x): -+++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++-- -+++- -+++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++-+ # @lwx -+++-+ # gate_up_output = self.gate_up_proj(x) -+++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++-+ # return self.down_proj(swiglu_output) -+++-+ -+++-+ # def forward(self, x): -+++-+ # gate_proj_out = self.gate_proj(x) -+++-+ # up_proj_out = self.up_proj(x) -+++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++-+ # return self.down_proj(swiglu_out) -+++-+ -+++- # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++- """ -+++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++- use_cache: bool = False, -+++- cache_position: Optional[mindspore.Tensor] = None, -+++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++-+ -+++-+ -+++-+ -+++- bsz, q_len, _ = hidden_states.shape -+++- -+++- query_states = self.q_proj(hidden_states) -+++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++- "with a layer index." -+++- ) -+++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++-+ if isinstance(past_key_value, StaticCache): -+++-+ kv_seq_len = key_states.shape[-2] -+++-+ else: -+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++- -+++- if past_key_value is not None: -+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++-+ -+++-+ if isinstance(past_key_value, StaticCache): -+++-+ kv_seq_len = key_states.shape[-2] -+++- -+++- # repeat k/v heads if n_kv_heads < n_heads -+++- key_states = repeat_kv(key_states, self.num_key_value_groups) -+++- value_states = repeat_kv(value_states, self.num_key_value_groups) -+++-- -+++-+ -+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++- -+++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++-- raise ValueError( -+++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++-- f" {attn_weights.shape}" -+++-- ) -+++-- -+++-- if attention_mask is not None: # no matter the length, we just slice it -+++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++-+ if attention_mask is not None: -+++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++- attn_weights = attn_weights + causal_mask -+++- -+++- # upcast attention to fp32 -+++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++- -+++- attn_output = self.o_proj(attn_output) -+++-- -+++-+ # @lwx -+++-+ -+++-+ # max_seq_len = self.max_position_embeddings # 2048 -+++-+ -+++-+ # if attention_mask is not None: -+++-+ # # attention_mask: [B, 1, Sq, Sk] -+++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++-+ -+++-+ # # pad 到 [max_seq_len, max_seq_len] -+++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++-+ # global_attention_mask = padded_mask -+++-+ # else: -+++-+ # global_attention_mask = None -+++-+ -+++-+ -+++-+ # sparse_mode=3 -+++-+ # attn_output = mindspore.ops.flash_attention_score( -+++-+ # query=query_states, -+++-+ # key=key_states, -+++-+ # value=value_states, -+++-+ # real_shift=None, -+++-+ # padding_mask=None, -+++-+ -+++-+ # head_num=self.num_heads, -+++-+ # attn_mask=global_attention_mask, -+++-+ # keep_prob=1.0 - self.attention_dropout, -+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++-+ # input_layout="BNSD", -+++-+ # pre_tokens=2147483647, -+++-+ # next_tokens=2147483647, -+++-+ # inner_precise=0, -+++-+ # drop_mask=None, -+++-+ # prefix=None, -+++-+ # actual_seq_qlen=None, -+++-+ # actual_seq_kvlen=None, -+++-+ # sparse_mode=sparse_mode, -+++-+ # ) -+++- if not output_attentions: -+++- attn_weights = None -+++- -+++- return attn_output, attn_weights, past_key_value -+++- -+++- -+++-+class Qwen2MoeFlashAttention(nn.Module): -+++-+ """ -+++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++-+ -+++-+ 关键改动: -+++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++-+ 直接传入原始的 key 和 value 张量效率更高。 -+++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++-+ """ -+++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++-+ super().__init__() -+++-+ self.config = config -+++-+ self.layer_idx = layer_idx -+++-+ self.hidden_size = config.hidden_size -+++-+ self.num_heads = config.num_attention_heads -+++-+ self.head_dim = self.hidden_size // self.num_heads -+++-+ self.num_key_value_heads = config.num_key_value_heads -+++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++-+ self.max_position_embeddings = config.max_position_embeddings -+++-+ self.rope_theta = config.rope_theta -+++-+ self.attention_dropout = config.attention_dropout -+++-+ -+++-+ if (self.head_dim * self.num_heads) != self.hidden_size: -+++-+ raise ValueError( -+++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++-+ ) -+++-+ -+++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++-+ -+++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++-+ self.head_dim, -+++-+ max_position_embeddings=self.max_position_embeddings, -+++-+ base=self.rope_theta, -+++-+ ) -+++-+ -+++-+ def forward( -+++-+ self, -+++-+ hidden_states: mindspore.Tensor, -+++-+ attention_mask: Optional[mindspore.Tensor] = None, -+++-+ position_ids: Optional[mindspore.Tensor] = None, -+++-+ past_key_value: Optional[Cache] = None, -+++-+ output_attentions: bool = False, -+++-+ use_cache: bool = False, -+++-+ cache_position: Optional[mindspore.Tensor] = None, -+++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++-+ -+++-+ bsz, q_len, _ = hidden_states.shape -+++-+ -+++-+ # 1. 线性投射 Q, K, V -+++-+ query_states = self.q_proj(hidden_states) -+++-+ key_states = self.k_proj(hidden_states) -+++-+ value_states = self.v_proj(hidden_states) -+++-+ -+++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++-+ # query: [B, S, H*D] -> [B, N1, S, D] -+++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ -+++-+ # 3. RoPE 旋转位置编码 -+++-+ kv_seq_len = key_states.shape[-2] -+++-+ if past_key_value is not None: -+++-+ if self.layer_idx is None: -+++-+ raise ValueError( -+++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++-+ "with a layer index." -+++-+ ) -+++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++-+ if cache_position.shape[0] == 1: -+++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++-+ kv_seq_len = past_seen_tokens + 1 -+++-+ else: -+++-+ # prefill 阶段:cache_position 是范围,使用其长度 -+++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++-+ else: -+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++-+ -+++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++-+ -+++-+ # 4. KV 缓存更新 -+++-+ if past_key_value is not None: -+++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++-+ key_states, value_states = past_key_value.update( -+++-+ key_states, value_states, self.layer_idx, cache_kwargs -+++-+ ) -+++-+ -+++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++-+ if cache_position.shape[0] == 1: -+++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++-+ kv_seq_len = key_states.shape[-2] -+++-+ -+++-+ # 5. [重要] 准备 Attention Mask -+++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++-+ fa_attention_mask = None -+++-+ if attention_mask is not None: -+++-+ # 截取与当前key长度匹配的部分 -+++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++-+ fa_attention_mask = (mask_slice != 0) -+++-+ -+++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++-+ input_dtype = query_states.dtype -+++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++-+ query_states = query_states.to(mindspore.float16) -+++-+ key_states = key_states.to(mindspore.float16) -+++-+ value_states = value_states.to(mindspore.float16) -+++-+ -+++-+ # 6. [核心] 调用 flash_attention_score 算子 -+++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++-+ attn_output = mindspore.ops.flash_attention_score( -+++-+ query=query_states, -+++-+ key=key_states, -+++-+ value=value_states, -+++-+ head_num=self.num_heads, # 传入Q的头数(N1) -+++-+ attn_mask=fa_attention_mask, -+++-+ keep_prob=1.0 - self.attention_dropout, -+++-+ scalar_value=1.0 / math.sqrt(self.head_dim), -+++-+ input_layout="BNSD", -+++-+ sparse_mode=0 # 使用 defaultMask 模式 -+++-+ ) -+++-+ -+++-+ # 恢复原始数据类型 -+++-+ attn_output = attn_output.to(input_dtype) -+++-+ -+++-+ # 7. 调整输出形状 -+++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++-+ attn_output = self.o_proj(attn_output) -+++-+ -+++-+ # FlashAttention 算子不直接返回注意力权重矩阵 -+++-+ attn_weights = None -+++-+ if output_attentions: -+++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++-+ -+++-+ return attn_output, attn_weights, past_key_value -+++-+ -+++-+ # def forward( -+++-+ # self, -+++-+ # hidden_states: mindspore.Tensor, -+++-+ # attention_mask: Optional[mindspore.Tensor] = None, -+++-+ # position_ids: Optional[mindspore.Tensor] = None, -+++-+ # past_key_value: Optional[Cache] = None, -+++-+ # output_attentions: bool = False, -+++-+ # use_cache: bool = False, -+++-+ # cache_position: Optional[mindspore.Tensor] = None, -+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++-+ -+++-+ # bsz, q_len, _ = hidden_states.shape -+++-+ -+++-+ # # 1. 线性投射 Q, K, V -+++-+ # query_states = self.q_proj(hidden_states) -+++-+ # key_states = self.k_proj(hidden_states) -+++-+ # value_states = self.v_proj(hidden_states) -+++-+ -+++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ -+++-+ # # 3. RoPE 旋转位置编码 -+++-+ # kv_seq_len = key_states.shape[-2] -+++-+ # if past_key_value is not None: -+++-+ # if self.layer_idx is None: -+++-+ # raise ValueError( -+++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++-+ # "with a layer index." -+++-+ # ) -+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++-+ -+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++-+ -+++-+ # # 4. KV 缓存更新 -+++-+ # if past_key_value is not None: -+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++-+ # key_states, value_states = past_key_value.update( -+++-+ # key_states, value_states, self.layer_idx, cache_kwargs -+++-+ # ) -+++-+ -+++-+ # # 5. 准备 Attention Mask -+++-+ # fa_attention_mask = None -+++-+ # if attention_mask is not None: -+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++-+ # fa_attention_mask = (mask_slice != 0) -+++-+ -+++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++-+ # input_dtype = query_states.dtype -+++-+ -+++-+ # # 6. [核心] 调用 flash_attention_score 算子 -+++-+ # attn_output = mindspore.ops.flash_attention_score( -+++-+ # query=query_states, -+++-+ # key=key_states, -+++-+ # value=value_states, -+++-+ # head_num=self.num_heads, -+++-+ # attn_mask=fa_attention_mask, -+++-+ # keep_prob=1.0 - self.attention_dropout, -+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++-+ # input_layout="BNSD", -+++-+ # sparse_mode=0, -+++-+ # # <--- 修改点 2: 启用内部高精度计算 --- -+++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++-+ # inner_precise=1 -+++-+ # ) -+++-+ -+++-+ # # 恢复原始数据类型 -+++-+ # attn_output = attn_output.to(input_dtype) -+++-+ -+++-+ # # 7. 调整输出形状 -+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++-+ # attn_output = self.o_proj(attn_output) -+++-+ -+++-+ # attn_weights = None -+++-+ # if output_attentions: -+++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++-+ -+++-+ # return attn_output, attn_weights, past_key_value -+++-+ -+++-+ # def forward( -+++-+ # self, -+++-+ # hidden_states: mindspore.Tensor, -+++-+ # attention_mask: Optional[mindspore.Tensor] = None, -+++-+ # position_ids: Optional[mindspore.Tensor] = None, -+++-+ # past_key_value: Optional[Cache] = None, -+++-+ # output_attentions: bool = False, -+++-+ # use_cache: bool = False, -+++-+ # cache_position: Optional[mindspore.Tensor] = None, -+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++-+ -+++-+ # bsz, q_len, _ = hidden_states.shape -+++-+ -+++-+ # query_states = self.q_proj(hidden_states) -+++-+ # key_states = self.k_proj(hidden_states) -+++-+ # value_states = self.v_proj(hidden_states) -+++-+ -+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-+ -+++-+ # kv_seq_len = key_states.shape[-2] -+++-+ # if past_key_value is not None: -+++-+ # if self.layer_idx is None: -+++-+ # raise ValueError("`layer_idx` must be specified for caching") -+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++-+ -+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++-+ -+++-+ # if past_key_value is not None: -+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++-+ # key_states, value_states = past_key_value.update( -+++-+ # key_states, value_states, self.layer_idx, cache_kwargs -+++-+ # ) -+++-+ -+++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++-+ -+++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++-+ # query_states = query_states / math.sqrt(self.head_dim) -+++-+ # # <--- 修改结束 --- -+++-+ -+++-+ # fa_attention_mask = None -+++-+ # if attention_mask is not None: -+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++-+ # fa_attention_mask = (mask_slice != 0) -+++-+ -+++-+ # input_dtype = query_states.dtype -+++-+ -+++-+ # attn_output = mindspore.ops.flash_attention_score( -+++-+ # query=query_states, # 传入已经预先缩放过的 query -+++-+ # key=key_states, -+++-+ # value=value_states, -+++-+ # head_num=self.num_heads, -+++-+ # attn_mask=fa_attention_mask, -+++-+ # keep_prob=1.0 - self.attention_dropout, -+++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++-+ # input_layout="BNSD", -+++-+ # sparse_mode=0, -+++-+ # inner_precise=1 # 仍然保持内部高精度计算 -+++-+ # ) -+++-+ -+++-+ # attn_output = attn_output.to(input_dtype) -+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++-+ # attn_output = self.o_proj(attn_output) -+++-+ -+++-+ # attn_weights = None -+++-+ # if output_attentions: -+++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++-+ -+++-+ # return attn_output, attn_weights, past_key_value -+++-+ -+++- QWEN2MOE_ATTENTION_CLASSES = { -+++- "eager": Qwen2MoeAttention, -+++-+ "flash-attention": Qwen2MoeFlashAttention, -+++- } -+++- -+++- -+++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++- -+++-+ #@dwj -+++-+ # 只遍历激活的专家,而非全部专家 -+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++-- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++-- hidden_states = hidden_states.view(-1, hidden_dim) -+++-- # router_logits: (batch * sequence_length, n_experts) -+++-- router_logits = self.gate(hidden_states) -+++-- -+++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++-- if self.norm_topk_prob: -+++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++-- # we cast back to the input dtype -+++-- routing_weights = routing_weights.to(hidden_states.dtype) -+++-- -+++-- final_hidden_states = ops.zeros( -+++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++-- ) -+++-- -+++-- # One hot encode the selected experts to create an expert mask -+++-- # this will be used to easily index which expert is going to be sollicitated -+++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++-- -+++-- # Loop over all available experts in the model and perform the computation on each expert -+++-- for expert_idx in range(self.num_experts): -+++-- expert_layer = self.experts[expert_idx] -+++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++-- -+++-- # Index the correct hidden states and compute the expert hidden state for -+++-- # the current expert. We need to make sure to multiply the output hidden -+++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++-- if 0 not in idx.shape: -+++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++-- -+++-- # However `index_add_` only support torch tensors for indexing so we'll use -+++-- # the `top_x` tensor here. -+++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++-- -+++-- shared_expert_output = self.shared_expert(hidden_states) -+++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++-- -+++-- final_hidden_states = final_hidden_states + shared_expert_output -+++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++-+ num_tokens = hidden_states_reshaped.shape[0] -+++-+ -+++-+ router_logits = self.gate(hidden_states_reshaped) -+++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++-+ -+++-+ if self.norm_topk_prob: -+++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++-+ routing_weights = routing_weights.to(hidden_states.dtype) -+++-+ -+++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++-+ flat_selected_experts = selected_experts.flatten() -+++-+ -+++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++-+ token_indices = broadcasted_token_indices.flatten() -+++-+ -+++-+ active_experts = ops.unique(flat_selected_experts) -+++-+ -+++-+ for expert_idx_tensor in active_experts: -+++-+ expert_idx = expert_idx_tensor.item() -+++-+ expert_layer = self.experts[expert_idx] -+++-+ -+++-+ mask = (flat_selected_experts == expert_idx_tensor) -+++-+ selected_token_indices = token_indices[mask] -+++-+ selected_routing_weights = routing_weights.flatten()[mask] -+++-+ -+++-+ current_states = hidden_states_reshaped[selected_token_indices] -+++-+ -+++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++-+ -+++-+ final_hidden_states = final_hidden_states.index_add( -+++-+ dim=0, -+++-+ index=selected_token_indices, -+++-+ source=expert_output.to(hidden_states.dtype) -+++-+ ) -+++-+ -+++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++- -+++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++-- return final_hidden_states, router_logits -+++-+ final_hidden_states = final_hidden_states + shared_expert_output -+++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++-+ -+++-+ return final_hidden_states, router_logits -+++- -+++- -+++- class Qwen2MoeDecoderLayer(nn.Module): -+++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++- -+++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++- -+++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++-+ -+++- if (layer_idx not in config.mlp_only_layers) and ( -+++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++- ): -+++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++- _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++- _skip_keys_device_placement = "past_key_values" -+++- _supports_cache_class = True -+++-+#lwx -+++-+ # _supports_static_cache = True -+++- -+++- def _init_weights(self, module): -+++- std = self.config.initializer_range -+++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++- return causal_mask -+++- -+++- -+++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++- _tied_weights_keys = ["lm_head.weight"] -+++- -+++- def __init__(self, config): -+++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++- self.num_experts_per_tok = config.num_experts_per_tok -+++- # Initialize weights and apply final processing -+++- self.post_init() -+++-+ # @lwx -+++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++-+ # self.generation_config.cache_implementation = "static" -+++-+ self._warmed_up = False -+++-+ -+++-+ def warmup_moe_model(self): -+++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++-+ test_texts = [ -+++-+ "warmup short", -+++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++-+ ] -+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++-+ if tokenizer is None: -+++-+ from mindnlp.transformers import AutoTokenizer -+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++-+ self._warmup_tokenizer = tokenizer -+++-+ -+++-+ for text in test_texts: -+++-+ inputs = tokenizer(text, return_tensors="ms") -+++-+ with mindspore._no_grad(): -+++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++- -+++- def get_input_embeddings(self): -+++- return self.model.embed_tokens -+++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++- ```""" -+++-+ if not self._warmed_up: -+++-+ self._warmed_up = True -+++-+ self.warmup_moe_model() -+++- -+++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++- output_router_logits = ( -+++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++- } -+++- ) -+++- return model_inputs -+++-+# @lwx -+++-+ # def _decode_one_tokens_logits( -+++-+ # self, -+++-+ # cur_token: mindspore.Tensor, -+++-+ # input_pos: Optional[mindspore.Tensor], -+++-+ # cache_position: mindspore.Tensor, -+++-+ # past_key_values: StaticCache, -+++-+ # ) -> mindspore.Tensor: -+++-+ # """ -+++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++-+ -+++-+ # Args: -+++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++-+ # input_pos: 输入位置信息,可选 -+++-+ # cache_position: 当前token在cache中的位置,shape为(1,) -+++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++-+ -+++-+ # Returns: -+++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++-+ # """ -+++-+ # # 调用JIT编译的版本 -+++-+ # return self.get_decode_one_tokens_logits( -+++-+ # cur_token=cur_token, -+++-+ # input_pos=input_pos, -+++-+ # cache_position=cache_position, -+++-+ # past_key_values=past_key_values, -+++-+ # ) -+++-+ -+++-+ # @mindspore.jit(jit_level='O1') -+++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++-+ # """ -+++-+ # JIT编译的函数,用于高效的单token解码 -+++-+ # 使用JIT编译优化以支持静态shape和高效执行 -+++-+ -+++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++-+ # """ -+++-+ # outputs = self.model.forward( -+++-+ # input_ids=cur_token, -+++-+ # position_ids=input_pos, -+++-+ # cache_position=cache_position, -+++-+ # past_key_values=past_key_values, -+++-+ # use_cache=True, -+++-+ # return_dict=False, -+++-+ # ) -+++-+ -+++-+ # hidden_states = outputs[0] -+++-+ # logits = self.lm_head.forward(hidden_states) -+++-+ # logits = logits.float() -+++-+ -+++-+ # return logits[:, -1, :] -+++-+ -+++-+ # def _sample( -+++-+ # self, -+++-+ # input_ids: mindspore.Tensor, -+++-+ # logits_processor, -+++-+ # stopping_criteria, -+++-+ # generation_config, -+++-+ # synced_devices: bool, -+++-+ # streamer=None, -+++-+ # logits_warper=None, -+++-+ # **model_kwargs, -+++-+ # ): -+++-+ # """ -+++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++-+ # """ -+++-+ # from ...generation.logits_process import LogitsProcessorList -+++-+ # from ...generation.stopping_criteria import StoppingCriteriaList -+++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++-+ # from mindnlp.core import nn, ops, no_grad -+++-+ # import numpy as np -+++-+ -+++-+ # # 检查是否使用 StaticCache -+++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++-+ # # 否则,直接调用父类方法 -+++-+ # past_key_values = model_kwargs.get("past_key_values") -+++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++-+ -+++-+ # if not isinstance(past_key_values, StaticCache): -+++-+ # # 不使用 StaticCache,直接调用父类方法 -+++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++-+ # return super()._sample( -+++-+ # input_ids=input_ids, -+++-+ # logits_processor=logits_processor, -+++-+ # stopping_criteria=stopping_criteria, -+++-+ # generation_config=generation_config, -+++-+ # synced_devices=synced_devices, -+++-+ # streamer=streamer, -+++-+ # logits_warper=logits_warper, -+++-+ # **model_kwargs, -+++-+ # ) -+++-+ -+++-+ # # 使用 StaticCache,进入自定义循环 -+++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++-+ # pad_token_id = generation_config._pad_token_tensor -+++-+ # output_attentions = generation_config.output_attentions -+++-+ # output_hidden_states = generation_config.output_hidden_states -+++-+ # output_scores = generation_config.output_scores -+++-+ # output_logits = generation_config.output_logits -+++-+ # return_dict_in_generate = generation_config.return_dict_in_generate -+++-+ # max_length = generation_config.max_length -+++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++-+ # do_sample = generation_config.do_sample -+++-+ -+++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++-+ # raise ValueError( -+++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++-+ # f"{logits_warper})." -+++-+ # ) -+++-+ -+++-+ # # init attention / hidden states / scores tuples -+++-+ # scores = () if (return_dict_in_generate and output_scores) else None -+++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++-+ -+++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++-+ # encoder_hidden_states = ( -+++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++-+ # ) -+++-+ -+++-+ # # keep track of which sequences are already finished -+++-+ # batch_size, cur_len = input_ids.shape -+++-+ # this_peer_finished = False -+++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++-+ -+++-+ # time_record = [] -+++-+ # from ....utils.testing_utils import parse_flag_from_env -+++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++-+ -+++-+ # while self._has_unfinished_sequences( -+++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++-+ # ): -+++-+ # if _record_time: -+++-+ # import time as time_module -+++-+ # infer_start = time_module.time() -+++-+ -+++-+ # # prepare model inputs -+++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++-+ -+++-+ # # prepare variable output controls -+++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++-+ -+++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++-+ # cur_cache_position = model_inputs.get("cache_position") -+++-+ # cur_past_key_values = model_inputs.get("past_key_values") -+++-+ # cur_input_ids = model_inputs.get("input_ids") -+++-+ -+++-+ # if (isinstance(cur_past_key_values, StaticCache) and -+++-+ # cur_cache_position is not None and -+++-+ # len(cur_cache_position.shape) > 0 and -+++-+ # cur_cache_position.shape[0] == 1 and -+++-+ # cur_input_ids is not None and -+++-+ # cur_input_ids.shape[1] == 1): -+++-+ # # 使用 JIT 优化的单 token 解码 -+++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++-+ # if not hasattr(self, '_jit_used'): -+++-+ # self._jit_used = False -+++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++-+ -+++-+ # next_token_logits = self.get_decode_one_tokens_logits( -+++-+ # cur_token=cur_input_ids, -+++-+ # input_pos=model_inputs.get("position_ids"), -+++-+ # cache_position=cur_cache_position, -+++-+ # past_key_values=cur_past_key_values, -+++-+ # ) -+++-+ -+++-+ # # 标记已使用JIT(用于后续判断) -+++-+ # if not self._jit_used: -+++-+ # self._jit_used = True -+++-+ -+++-+ # # 构造兼容的输出对象 -+++-+ # class JitOptimizedOutput: -+++-+ # def __init__(self, logits, config): -+++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++-+ # self.config = config -+++-+ # # 对于 JIT 优化路径,这些属性通常不需要 -+++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++-+ # self.attentions = None if not config.is_encoder_decoder else None -+++-+ # self.cross_attentions = None -+++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++-+ # self.hidden_states = None if not config.is_encoder_decoder else None -+++-+ -+++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++-+ # else: -+++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++-+ # outputs = self(**model_inputs, return_dict=True) -+++-+ -+++-+ # if synced_devices and this_peer_finished: -+++-+ # continue -+++-+ -+++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++-+ # next_token_logits = outputs.logits[:, -1, :] -+++-+ -+++-+ # # pre-process distribution -+++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++-+ # if do_sample: -+++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++-+ -+++-+ # # Store scores, attentions and hidden_states when required -+++-+ # if return_dict_in_generate: -+++-+ # if output_scores: -+++-+ # scores += (next_token_scores,) -+++-+ # if output_logits: -+++-+ # raw_logits += (next_token_logits,) -+++-+ # if output_attentions: -+++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++-+ # decoder_attentions += (attn,) if attn is not None else (None,) -+++-+ # if self.config.is_encoder_decoder: -+++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++-+ -+++-+ # if output_hidden_states: -+++-+ # hidden = ( -+++-+ # outputs.decoder_hidden_states -+++-+ # if self.config.is_encoder_decoder -+++-+ # else outputs.hidden_states -+++-+ # ) -+++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++-+ -+++-+ # # token selection -+++-+ # if do_sample: -+++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++-+ # else: -+++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++-+ -+++-+ # # finished sentences should have their next token be a padding token -+++-+ # if has_eos_stopping_criteria: -+++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++-+ -+++-+ # # update generated ids, model inputs, and length for next step -+++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++-+ # if streamer is not None: -+++-+ # streamer.put(next_tokens) -+++-+ -+++-+ # model_kwargs = self._update_model_kwargs_for_generation( -+++-+ # outputs, -+++-+ # model_kwargs, -+++-+ # is_encoder_decoder=self.config.is_encoder_decoder, -+++-+ # ) -+++-+ -+++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++-+ # cur_len += 1 -+++-+ -+++-+ # if _record_time: -+++-+ # import time as time_module -+++-+ # infer_stop = time_module.time() -+++-+ # time_record.append(infer_stop - infer_start) -+++-+ -+++-+ # del outputs -+++-+ -+++-+ # average_infer_time = None -+++-+ # if time_record: -+++-+ # if len(time_record) > 1: -+++-+ # time_record.pop(0) -+++-+ # average_infer_time = sum(time_record) / len(time_record) -+++-+ # print(f'average inference time is: {average_infer_time}') -+++-+ # print(f'inference time record: {time_record}') -+++-+ -+++-+ # if streamer is not None: -+++-+ # streamer.end() -+++-+ -+++-+ # # 简单判断:打印是否使用了JIT路径 -+++-+ # if hasattr(self, '_jit_used') and self._jit_used: -+++-+ # print("[JIT] ✓ JIT optimization was used during generation") -+++-+ # else: -+++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++-+ -+++-+ # if return_dict_in_generate: -+++-+ # if self.config.is_encoder_decoder: -+++-+ # return GenerateEncoderDecoderOutput( -+++-+ # sequences=input_ids, -+++-+ # scores=scores, -+++-+ # logits=raw_logits, -+++-+ # encoder_attentions=encoder_attentions, -+++-+ # encoder_hidden_states=encoder_hidden_states, -+++-+ # decoder_attentions=decoder_attentions, -+++-+ # cross_attentions=cross_attentions, -+++-+ # decoder_hidden_states=decoder_hidden_states, -+++-+ # past_key_values=model_kwargs.get("past_key_values"), -+++-+ # average_infer_time=average_infer_time -+++-+ # ) -+++-+ # else: -+++-+ # return GenerateDecoderOnlyOutput( -+++-+ # sequences=input_ids, -+++-+ # scores=scores, -+++-+ # logits=raw_logits, -+++-+ # attentions=decoder_attentions, -+++-+ # hidden_states=decoder_hidden_states, -+++-+ # past_key_values=model_kwargs.get("past_key_values"), -+++-+ # average_infer_time=average_infer_time -+++-+ # ) -+++-+ # else: -+++-+ # return input_ids -+++-+ -+++-+ # def _prepare_cache_for_generation( -+++-+ # self, -+++-+ # generation_config, -+++-+ # model_kwargs, -+++-+ # assistant_model, -+++-+ # batch_size, -+++-+ # max_cache_length, -+++-+ # ): -+++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++-+ # generation_config.cache_implementation = "static" -+++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++-+ -+++-+ # if generation_config.cache_implementation == "static": -+++-+ # base_required_from_max_length = generation_config.max_length + 1 -+++-+ # base_required = max(max_cache_length, base_required_from_max_length) -+++-+ # min_cache_size = 50 -+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++-+ # else: -+++-+ # max_cache_length = max(base_required, min_cache_size) -+++-+ -+++-+ # original_max_cache_length = max_cache_length -+++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++-+ # print(f" - input max_cache_length: {original_max_cache_length}") -+++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++-+ # print(f" - final max_cache_length: {max_cache_length}") -+++-+ -+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++-+ # if max_cache_length > self.config.max_position_embeddings: -+++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++-+ -+++-+ # result = super()._prepare_cache_for_generation( -+++-+ # generation_config=generation_config, -+++-+ # model_kwargs=model_kwargs, -+++-+ # assistant_model=assistant_model, -+++-+ # batch_size=batch_size, -+++-+ # max_cache_length=max_cache_length, -+++-+ # ) -+++-+ -+++-+ # if generation_config.cache_implementation == "static": -+++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++-+ # created_cache = model_kwargs.get(cache_name) -+++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++-+ # if created_cache.max_cache_len < generation_config.max_length: -+++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++-+ -+++-+ # return result -+++-+ -+++-+ -+++-+ -+++- -+++- -+++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++--- -+++-2.27.0 -+++- -+++-- -+++2.27.0 -+++ -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" deleted file mode 100644 index 695e3df9..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0007-20251107003commit.patch" +++ /dev/null @@ -1,8034 +0,0 @@ -From 2831c3ffbda41719e00e1cd83c3840bcb9dd79db Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Fri, 7 Nov 2025 12:12:51 +0800 -Subject: [PATCH 07/10] 20251107003commit - ---- - .../models/deepseek/modeling_deepseek.py | 2 +- - patches/0001-20251104commit.patch | 2 +- - patches/0002-20251106commit.patch | 2 +- - patches/0003-20261106secondcommit.patch | 2 +- - patches/0004-20251106change.patch | 2 +- - patches/0005-20251107001commit.patch | 2 +- - patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ - 7 files changed, 7937 insertions(+), 6 deletions(-) - create mode 100644 patches/0006-20251107002commit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index e7e1c053..ff631974 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): - # return expert_cache - - @no_grad() -- dwj -+ # dwj - def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): - # x 的 shape: (1, hidden_size) - # flat_expert_indices 的 shape: (num_experts_per_tok,) -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -index 2842180e..c9c8c5ee 100644 ---- a/patches/0001-20251104commit.patch -+++ b/patches/0001-20251104commit.patch -@@ -1,7 +1,7 @@ - From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/5] 20251104commit -+Subject: [PATCH 1/6] 20251104commit - - --- - mindnlp/transformers/cache_utils.py | 28 +- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -index c6cd8757..625656eb 100644 ---- a/patches/0002-20251106commit.patch -+++ b/patches/0002-20251106commit.patch -@@ -1,7 +1,7 @@ - From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/5] 20251106commit -+Subject: [PATCH 2/6] 20251106commit - - --- - .../models/deepseek/modeling_deepseek.py | 379 ++++- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -index 601960c9..dcb85080 100644 ---- a/patches/0003-20261106secondcommit.patch -+++ b/patches/0003-20261106secondcommit.patch -@@ -1,7 +1,7 @@ - From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/5] 20261106secondcommit -+Subject: [PATCH 3/6] 20261106secondcommit - - --- - .../models/deepseek/modeling_deepseek.py | 217 ++- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -index 8976f10b..bbed13cc 100644 ---- a/patches/0004-20251106change.patch -+++ b/patches/0004-20251106change.patch -@@ -1,7 +1,7 @@ - From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 15:48:09 +0800 --Subject: [PATCH 4/5] 20251106change -+Subject: [PATCH 4/6] 20251106change - - --- - .../models/deepseek/modeling_deepseek.py | 189 +- -diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -index 8d9032be..b2d1035c 100644 ---- a/patches/0005-20251107001commit.patch -+++ b/patches/0005-20251107001commit.patch -@@ -1,7 +1,7 @@ - From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 11:48:18 +0800 --Subject: [PATCH 5/5] 20251107001commit -+Subject: [PATCH 5/6] 20251107001commit - - --- - .../models/deepseek/modeling_deepseek.py | 91 +- -diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -new file mode 100644 -index 00000000..bffa134e ---- /dev/null -+++ b/patches/0006-20251107002commit.patch -@@ -0,0 +1,7931 @@ -+From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Fri, 7 Nov 2025 12:06:32 +0800 -+Subject: [PATCH 6/6] 20251107002commit -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 122 +- -+ patches/0001-20251104commit.patch | 2 +- -+ patches/0002-20251106commit.patch | 2 +- -+ patches/0003-20261106secondcommit.patch | 2 +- -+ patches/0004-20251106change.patch | 2 +- -+ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ -+ 6 files changed, 7773 insertions(+), 64 deletions(-) -+ create mode 100644 patches/0005-20251107001commit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index 8831e4b7..e7e1c053 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): -+ # expert_out = expert(x) -+ # expert_cache += expert_out * weight -+ # return expert_cache -+- -+- # @no_grad() -+- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+- # # x 的 shape: (1, hidden_size) -+- # # flat_expert_indices 的 shape: (num_experts_per_tok,) -+- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+- -+- # # 1. 收集所有需要的专家层 -+- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+- # selected_experts = [self.experts[i] for i in flat_expert_indices] -+- -+- # # 2. 并行计算所有专家的输出 -+- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+- # # ops.cat 会将它们堆叠成一个新的 Tensor -+- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+- -+- # # 3. 使用矩阵乘法进行加权求和 -+- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+- # # 最终结果 final_output 的 shape: (1, hidden_size) -+- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++ -++ @no_grad() -++ dwj -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ # x 的 shape: (1, hidden_size) -++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++ -++ # 1. 收集所有需要的专家层 -++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++ selected_experts = [self.experts[i] for i in flat_expert_indices] -++ -++ # 2. 并行计算所有专家的输出 -++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++ # ops.cat 会将它们堆叠成一个新的 Tensor -++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++ -++ # 3. 使用矩阵乘法进行加权求和 -++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++ # 最终结果 final_output 的 shape: (1, hidden_size) -++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+ -+- # return final_output -++ return final_output -+ -+ -+ # @no_grad() -+@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): -+ -+ return expert_cache -+ # 放置在 DeepseekMoE 类中 -+- @no_grad() -+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+- """ -+- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+- -+- Args: -+- x (Tensor): 输入张量, shape: (1, hidden_size) -+- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+- """ -+- top_k, _ = flat_expert_weights.shape -+- hidden_size = x.shape[-1] -+- -+- # 1. 将所有专家的权重堆叠起来 -+- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -++ # @no_grad() -++ # #lwx 20251107 -++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ # """ -++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -++ -++ # Args: -++ # x (Tensor): 输入张量, shape: (1, hidden_size) -++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -++ # """ -++ # top_k, _ = flat_expert_weights.shape -++ # hidden_size = x.shape[-1] -++ -++ # # 1. 将所有专家的权重堆叠起来 -++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+ -+- # 2. "收集" 所需的专家权重 -+- selected_gate_w = stacked_gate_w[flat_expert_indices] -+- selected_up_w = stacked_up_w[flat_expert_indices] -+- selected_down_w = stacked_down_w[flat_expert_indices] -++ # # 2. "收集" 所需的专家权重 -++ # selected_gate_w = stacked_gate_w[flat_expert_indices] -++ # selected_up_w = stacked_up_w[flat_expert_indices] -++ # selected_down_w = stacked_down_w[flat_expert_indices] -+ -+- # 3. 准备输入 -+- x_expanded = x.expand((top_k, 1, hidden_size)) -++ # # 3. 准备输入 -++ # x_expanded = x.expand((top_k, 1, hidden_size)) -+ -+- # 4. 并行计算 gate_proj 和 up_proj -+- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -++ # # 4. 并行计算 gate_proj 和 up_proj -++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+ -+- # 5. 计算中间状态 -+- intermediate_states = self.experts[0].act_fn(gate_out) * up_out -++ # # 5. 计算中间状态 -++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+ -+- # 6. 并行计算 down_proj -+- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+- # --- [FIX] --- -+- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+- # --- [FIX END] --- -++ # # 6. 并行计算 down_proj -++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -++ # # --- [FIX] --- -++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -++ # # --- [FIX END] --- -+ -+- # 7. 根据路由权重进行加权求和 -+- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -++ # # 7. 根据路由权重进行加权求和 -++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+ -+- return weighted_sum -++ # return weighted_sum -+ -+ -+ -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+index 0a0ef2d7..2842180e 100644 -+--- a/patches/0001-20251104commit.patch -++++ b/patches/0001-20251104commit.patch -+@@ -1,7 +1,7 @@ -+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+-Subject: [PATCH 1/4] 20251104commit -++Subject: [PATCH 1/5] 20251104commit -+ -+ --- -+ mindnlp/transformers/cache_utils.py | 28 +- -+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+index 5185270c..c6cd8757 100644 -+--- a/patches/0002-20251106commit.patch -++++ b/patches/0002-20251106commit.patch -+@@ -1,7 +1,7 @@ -+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+-Subject: [PATCH 2/4] 20251106commit -++Subject: [PATCH 2/5] 20251106commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+index 3e05f821..601960c9 100644 -+--- a/patches/0003-20261106secondcommit.patch -++++ b/patches/0003-20261106secondcommit.patch -+@@ -1,7 +1,7 @@ -+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+-Subject: [PATCH 3/4] 20261106secondcommit -++Subject: [PATCH 3/5] 20261106secondcommit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 217 ++- -+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+index 88a1aef4..8976f10b 100644 -+--- a/patches/0004-20251106change.patch -++++ b/patches/0004-20251106change.patch -+@@ -1,7 +1,7 @@ -+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 15:48:09 +0800 -+-Subject: [PATCH 4/4] 20251106change -++Subject: [PATCH 4/5] 20251106change -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 189 +- -+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -+new file mode 100644 -+index 00000000..8d9032be -+--- /dev/null -++++ b/patches/0005-20251107001commit.patch -+@@ -0,0 +1,7707 @@ -++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Fri, 7 Nov 2025 11:48:18 +0800 -++Subject: [PATCH 5/5] 20251107001commit -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 91 +- -++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- -++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- -++ patches/0001-20251104commit.patch | 2 +- -++ patches/0002-20251106commit.patch | 2 +- -++ patches/0003-20261106secondcommit.patch | 2 +- -++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ -++ 7 files changed, 7577 insertions(+), 30 deletions(-) -++ create mode 100644 patches/0004-20251106change.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index 0546f318..8831e4b7 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): -++ # expert_cache += expert_out * weight -++ # return expert_cache -++ -++- @no_grad() -++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++- # x 的 shape: (1, hidden_size) -++- # flat_expert_indices 的 shape: (num_experts_per_tok,) -++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++- -++- # 1. 收集所有需要的专家层 -++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++- selected_experts = [self.experts[i] for i in flat_expert_indices] -++- -++- # 2. 并行计算所有专家的输出 -++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++- # ops.cat 会将它们堆叠成一个新的 Tensor -++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++- -++- # 3. 使用矩阵乘法进行加权求和 -++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++- # 最终结果 final_output 的 shape: (1, hidden_size) -++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++ # @no_grad() -+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ # # x 的 shape: (1, hidden_size) -+++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++ -+++ # # 1. 收集所有需要的专家层 -+++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++ # selected_experts = [self.experts[i] for i in flat_expert_indices] -+++ -+++ # # 2. 并行计算所有专家的输出 -+++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++ # # ops.cat 会将它们堆叠成一个新的 Tensor -+++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++ -+++ # # 3. 使用矩阵乘法进行加权求和 -+++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ # # 最终结果 final_output 的 shape: (1, hidden_size) -+++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++ -++- return final_output -+++ # return final_output -++ -++ -++ # @no_grad() -++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): -++ ) -++ -++ return expert_cache -+++# 放置在 DeepseekMoE 类中 -+++ @no_grad() -+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ """ -+++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+++ -+++ Args: -+++ x (Tensor): 输入张量, shape: (1, hidden_size) -+++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+++ """ -+++ top_k, _ = flat_expert_weights.shape -+++ hidden_size = x.shape[-1] -+++ -+++ # 1. 将所有专家的权重堆叠起来 -+++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+++ -+++ # 2. "收集" 所需的专家权重 -+++ selected_gate_w = stacked_gate_w[flat_expert_indices] -+++ selected_up_w = stacked_up_w[flat_expert_indices] -+++ selected_down_w = stacked_down_w[flat_expert_indices] -+++ -+++ # 3. 准备输入 -+++ x_expanded = x.expand((top_k, 1, hidden_size)) -+++ -+++ # 4. 并行计算 gate_proj 和 up_proj -+++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+++ -+++ # 5. 计算中间状态 -+++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+++ -+++ # 6. 并行计算 down_proj -+++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+++ # --- [FIX] --- -+++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+++ # --- [FIX END] --- -+++ -+++ # 7. 根据路由权重进行加权求和 -+++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+++ -+++ return weighted_sum -+++ -+++ -++ -++ # @no_grad() -++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++index ebd7782e..913a7609 100644 -++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): -++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++- x1 = x[..., : x.shape[-1] // 2] -++- x2 = x[..., x.shape[-1] // 2 :] -+++ # x1 = x[..., : x.shape[-1] // 2] -+++ # x2 = x[..., x.shape[-1] // 2 :] -++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++index d059dcbe..2b217b64 100644 -++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): -++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++ def rotate_half(x): -++ """Rotates half the hidden dims of the input.""" -++- x1 = x[..., : x.shape[-1] // 2] -++- x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++ # x1 = x[..., : x.shape[-1] // 2] -+++ # x2 = x[..., x.shape[-1] // 2 :] -+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++ return ops.cat((-x2, x1), dim=-1) -++ -++ -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++index 78f22642..0a0ef2d7 100644 -++--- a/patches/0001-20251104commit.patch -+++++ b/patches/0001-20251104commit.patch -++@@ -1,7 +1,7 @@ -++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -++-Subject: [PATCH 1/3] 20251104commit -+++Subject: [PATCH 1/4] 20251104commit -++ -++ --- -++ mindnlp/transformers/cache_utils.py | 28 +- -++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++index 22b65dd5..5185270c 100644 -++--- a/patches/0002-20251106commit.patch -+++++ b/patches/0002-20251106commit.patch -++@@ -1,7 +1,7 @@ -++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -++-Subject: [PATCH 2/3] 20251106commit -+++Subject: [PATCH 2/4] 20251106commit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++index 966529e4..3e05f821 100644 -++--- a/patches/0003-20261106secondcommit.patch -+++++ b/patches/0003-20261106secondcommit.patch -++@@ -1,7 +1,7 @@ -++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -++-Subject: [PATCH 3/3] 20261106secondcommit -+++Subject: [PATCH 3/4] 20261106secondcommit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -++new file mode 100644 -++index 00000000..88a1aef4 -++--- /dev/null -+++++ b/patches/0004-20251106change.patch -++@@ -0,0 +1,7498 @@ -+++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Thu, 6 Nov 2025 15:48:09 +0800 -+++Subject: [PATCH 4/4] 20251106change -+++ -+++--- -+++ .../models/deepseek/modeling_deepseek.py | 189 +- -+++ patches/0001-20251104commit.patch | 1272 +++++++ -+++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -+++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -+++ 4 files changed, 7244 insertions(+), 186 deletions(-) -+++ create mode 100644 patches/0001-20251104commit.patch -+++ create mode 100644 patches/0002-20251106commit.patch -+++ create mode 100644 patches/0003-20261106secondcommit.patch -+++ -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index 2f9192bf..0546f318 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -+++ -+++ return attn_output, attn_weights, past_key_value -+++ -+++-# class DeepseekFlashAttention(nn.Module): -+++-# """ -+++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+++- -+++-# This class is designed as a drop-in replacement for DeepseekAttention. -+++-# """ -+++- -+++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++-# super().__init__() -+++-# self.config = config -+++-# self.layer_idx = layer_idx -+++-# if layer_idx is None: -+++-# logger.warning( -+++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++-# "when creating this class." -+++-# ) -+++- -+++-# self.attention_dropout = config.attention_dropout -+++-# self.hidden_size = config.hidden_size -+++-# self.num_heads = config.num_attention_heads -+++-# self.head_dim = self.hidden_size // self.num_heads -+++-# self.num_key_value_heads = config.num_key_value_heads -+++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++-# self.max_position_embeddings = config.max_position_embeddings -+++-# self.rope_theta = config.rope_theta -+++-# self.is_causal = True -+++- -+++-# if (self.head_dim * self.num_heads) != self.hidden_size: -+++-# raise ValueError( -+++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++-# f" and `num_heads`: {self.num_heads})." -+++-# ) -+++- -+++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++-# self._init_rope() -+++- -+++-# def _init_rope(self): -+++-# if self.config.rope_scaling is None: -+++-# self.rotary_emb = DeepseekRotaryEmbedding( -+++-# self.head_dim, -+++-# max_position_embeddings=self.max_position_embeddings, -+++-# base=self.rope_theta, -+++-# ) -+++-# else: -+++-# scaling_type = self.config.rope_scaling["type"] -+++-# scaling_factor = self.config.rope_scaling["factor"] -+++-# if scaling_type == "linear": -+++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++-# self.head_dim, -+++-# max_position_embeddings=self.max_position_embeddings, -+++-# scaling_factor=scaling_factor, -+++-# base=self.rope_theta, -+++-# ) -+++-# elif scaling_type == "dynamic": -+++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++-# self.head_dim, -+++-# max_position_embeddings=self.max_position_embeddings, -+++-# scaling_factor=scaling_factor, -+++-# base=self.rope_theta, -+++-# ) -+++-# else: -+++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++- -+++-# def forward( -+++-# self, -+++-# hidden_states: mindspore.Tensor, -+++-# attention_mask: Optional[mindspore.Tensor] = None, -+++-# position_ids: Optional[mindspore.Tensor] = None, -+++-# past_key_value: Optional[Cache] = None, -+++-# output_attentions: bool = False, -+++-# use_cache: bool = False, -+++-# **kwargs, -+++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++-# if "padding_mask" in kwargs: -+++-# warnings.warn( -+++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++-# ) -+++- -+++-# if output_attentions: -+++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+++- -+++-# bsz, q_len, _ = hidden_states.shape -+++- -+++-# if self.config.pretraining_tp > 1: -+++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++- -+++-# query_states = self.q_proj(hidden_states) -+++-# key_states = self.k_proj(hidden_states) -+++-# value_states = self.v_proj(hidden_states) -+++- -+++-# # Reshape for multi-head attention -+++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++- -+++-# kv_seq_len = key_states.shape[-2] -+++-# if past_key_value is not None: -+++-# if self.layer_idx is None: -+++-# raise ValueError( -+++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++-# "with a layer index." -+++-# ) -+++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++- -+++-# # Apply Rotary Positional Embedding -+++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++- -+++-# if past_key_value is not None: -+++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++- -+++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++- -+++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++- -+++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++- -+++-# # Convert attention_mask for flash_attention_score -+++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+++-# if attention_mask is not None: -+++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++-# raise ValueError( -+++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++-# ) -+++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+++-# else: -+++-# attn_mask_for_fa = None -+++- -+++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++- -+++-# # Call the fused flash_attention_score operator -+++-# attn_output = mindspore.ops.flash_attention_score( -+++-# query=query_states_for_fa, -+++-# key=key_states_for_fa, -+++-# value=value_states_for_fa, -+++-# head_num=self.num_heads, # This is N1, the number of query heads -+++-# input_layout='BSH', -+++-# attn_mask=attn_mask_for_fa, -+++-# keep_prob=keep_prob, -+++-# scalar_value=1.0 / math.sqrt(self.head_dim), -+++-# sparse_mode=0 # Default mask mode -+++-# ) -+++- -+++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+++-# attn_output = self.o_proj(attn_output) -+++- -+++-# # Flash Attention does not return attention weights -+++-# attn_weights = None -+++- -+++-# return attn_output, attn_weights, past_key_value -+++ -+++ class DeepseekFlashAttention(nn.Module): -+++ """ -+++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -+++ super().__init__() -+++ self.hidden_size = config.hidden_size -+++ -+++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+++- config=config, layer_idx=layer_idx -+++- ) -++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -++++ # config=config, layer_idx=layer_idx -++++ # ) -+++ -+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+++ config=config, layer_idx=layer_idx -+++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -+++ return outputs -+++ -+++ -+++- -+++ class DeepseekPreTrainedModel(PreTrainedModel): -+++ config_class = DeepseekConfig -+++ base_model_prefix = "model" -+++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++ # Initialize weights and apply final processing -+++ self.post_init() -+++ self.warm_up = False -+++- #@dwj -+++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+++- self.num_layers, -+++- self.num_attention_heads, -+++- self.head_dim, -+++- batch_size=1, -+++- max_length=self.max_length, -+++- dtype=mindspore.float16 -+++- ) -+++- -+++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+++- key_cache = [] -+++- value_cache = [] -+++- for _ in range(num_layers): -+++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++- key_cache.append(k) -+++- value_cache.append(v) -+++- return key_cache, value_cache -+++- -+++ -+++ def warmup_moe_model_deep(self): -+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++new file mode 100644 -+++index 00000000..78f22642 -+++--- /dev/null -++++++ b/patches/0001-20251104commit.patch -+++@@ -0,0 +1,1272 @@ -++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++Subject: [PATCH 1/3] 20251104commit -++++ -++++--- -++++ mindnlp/transformers/cache_utils.py | 28 +- -++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++ 3 files changed, 976 insertions(+), 87 deletions(-) -++++ -++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++index cadd2e04..02f8d4be 100644 -++++--- a/mindnlp/transformers/cache_utils.py -+++++++ b/mindnlp/transformers/cache_utils.py -++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++ # k_out[:, :, cache_position] = key_states -++++ # v_out[:, :, cache_position] = value_states -++++- if ON_ORANGE_PI: -++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++- else: -++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++- -+++++ # if ON_ORANGE_PI: -+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++ # else: -+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++++ if cache_position.ndim > 1: -+++++ cache_position = cache_position.flatten() -+++++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++++ cache_position = cache_position.int() -+++++ -+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++++ k_out[:, :, cache_position] = key_states -+++++ v_out[:, :, cache_position] = value_states -+++++ -++++ return k_out, v_out -++++ -++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index c695b944..d8303e45 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++- x1 = x[..., : x.shape[-1] // 2] -++++- x2 = x[..., x.shape[-1] // 2 :] -+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++ # x1 = x[..., : x.shape[-1] // 2] -+++++ # x2 = x[..., x.shape[-1] // 2 :] -+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++ if self.training: -++++ raise NotImplementedError("Training is not supported yet.") -++++ else: -++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++- if self.config.n_shared_experts is not None: -++++- y = y + self.shared_experts(identity) -++++- return y -+++++ # @lwx -+++++ if orig_shape[1] == 1: -+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++++ y=y.view(*orig_shape) -+++++ if self.config.n_shared_experts is not None: -+++++ y = y + self.shared_experts(identity) -+++++ return y -+++++ else: -+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++++ if self.config.n_shared_experts is not None: -+++++ y = y + self.shared_experts(identity) -+++++ return y -+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++ # if self.config.n_shared_experts is not None: -+++++ # y = y + self.shared_experts(identity) -+++++ # return y -+++++ -+++++ @no_grad() -+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ -+++++ expert_cache = ops.zeros_like(x) -+++++ for i in range(self.num_experts_per_tok): -+++++ expert_id = flat_expert_indices[i].item() -+++++ weight = flat_expert_weights[i].item() -+++++ expert = self.experts[expert_id] -+++++ expert_out = expert(x) -+++++ expert_cache += expert_out * weight -+++++ return expert_cache -++++ -++++ @no_grad() -++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++- # expert_cache = torch.zeros_like(x) -++++- # idxs = flat_expert_indices.argsort() -++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++- # token_idxs = idxs // self.num_experts_per_tok -++++- # for i, end_idx in enumerate(tokens_per_expert): -++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++- # if start_idx == end_idx: -++++- # continue -++++- # expert = self.experts[i] -++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++- # expert_tokens = x[exp_token_idx] -++++- # expert_out = expert(expert_tokens) -++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++- # return expert_cache -+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++ expert_cache = ops.zeros_like(x) -++++ idxs = flat_expert_indices.argsort() -++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++ token_idxs = idxs // self.num_experts_per_tok -+++++ -++++ for i, end_idx in enumerate(tokens_per_expert): -++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++ if start_idx == end_idx: -++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++ expert_out = expert(expert_tokens) -++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -++++ return expert_cache -+++++ -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # # expert_cache = torch.zeros_like(x) -+++++ # # idxs = flat_expert_indices.argsort() -+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++ # # if start_idx == end_idx: -+++++ # # continue -+++++ # # expert = self.experts[i] -+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # # expert_tokens = x[exp_token_idx] -+++++ # # expert_out = expert(expert_tokens) -+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++ # # return expert_cache -+++++ # expert_cache = ops.zeros_like(x) -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # if start_idx == end_idx: -+++++ # continue -+++++ # expert = self.experts[i] -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = expert(expert_tokens) -+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -+++++ # return expert_cache -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # expert_cache = ops.zeros_like(x) -+++++ -+++++ # # 排序保证顺序一致 -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # # 找出有 token 的专家 -+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++ -+++++ # for i in active_experts.tolist(): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # end_idx = tokens_per_expert[i] -+++++ # if start_idx == end_idx: # 没有 token -+++++ # continue -+++++ -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = self.experts[i](expert_tokens) -+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++ -+++++ # expert_cache = mindspore.mint.scatter_add( -+++++ # expert_cache, -+++++ # 0, -+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++ # expert_out -+++++ # ) -+++++ -+++++ # return expert_cache -+++++ -+++++ -++++ -++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++ # """ -++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ -++++ # Initialize weights and apply final processing -++++ self.post_init() -+++++ self.warm_up = False -+++++ -+++++ def warmup_moe_model_deep(self): -+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++ test_texts = [ -+++++ "warmup short", -+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++++ ] -+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++ if tokenizer is None: -+++++ from mindnlp.transformers import AutoTokenizer -+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++ self._warmup_tokenizer = tokenizer -+++++ -+++++ for text in test_texts: -+++++ inputs = tokenizer(text, return_tensors="ms") -+++++ with mindspore._no_grad(): -+++++ _ = self(**inputs, use_cache=False) -+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++ -++++ def get_input_embeddings(self): -++++ return self.model.embed_tokens -++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++ ```""" -+++++ if not self.warm_up: -+++++ self.warm_up = True -+++++ self.warmup_moe_model_deep() -+++++ -++++ output_attentions = ( -++++ output_attentions -++++ if output_attentions is not None -++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++index 3cbf820e..d4c6b651 100644 -++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++@@ -18,7 +18,6 @@ -++++ # See the License for the specific language governing permissions and -++++ # limitations under the License. -++++ """MindSpore Qwen2MoE model.""" -++++- -++++ import math -++++ from typing import List, Optional, Tuple, Union -++++ -++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++ TokenClassifierOutput, -++++ ) -++++ from ...modeling_utils import PreTrainedModel -+++++from ...generation import GenerationMixin -++++ from ....utils import logging -++++ from .configuration_qwen2_moe import Qwen2MoeConfig -++++ -++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++ self.variance_epsilon = eps -++++ -++++ def forward(self, hidden_states): -+++++ # @dwj -+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++ # @lwx -+++++ # if not self.training : -+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++ input_dtype = hidden_states.dtype -++++ hidden_states = hidden_states.to(mindspore.float32) -++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++@@ -234,6 +239,8 @@ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++ x1 = x[..., : x.shape[-1] // 2] -++++ x2 = x[..., x.shape[-1] // 2 :] -+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++ self.config = config -++++ self.hidden_size = config.hidden_size -++++ self.intermediate_size = intermediate_size -+++++ -++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++ self.act_fn = ACT2FN[config.hidden_act] -++++ -++++ def forward(self, x): -++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++- -++++ -+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++ # @lwx -+++++ # gate_up_output = self.gate_up_proj(x) -+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++++ # return self.down_proj(swiglu_output) -+++++ -+++++ # def forward(self, x): -+++++ # gate_proj_out = self.gate_proj(x) -+++++ # up_proj_out = self.up_proj(x) -+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++++ # return self.down_proj(swiglu_out) -+++++ -++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++ """ -++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++ use_cache: bool = False, -++++ cache_position: Optional[mindspore.Tensor] = None, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ -+++++ -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ query_states = self.q_proj(hidden_states) -++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++ "with a layer index." -++++ ) -++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ if isinstance(past_key_value, StaticCache): -+++++ kv_seq_len = key_states.shape[-2] -+++++ else: -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++ if isinstance(past_key_value, StaticCache): -+++++ kv_seq_len = key_states.shape[-2] -++++ -++++ # repeat k/v heads if n_kv_heads < n_heads -++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++- -+++++ -++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++ -++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++- raise ValueError( -++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++- f" {attn_weights.shape}" -++++- ) -++++- -++++- if attention_mask is not None: # no matter the length, we just slice it -++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++++ if attention_mask is not None: -+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++ attn_weights = attn_weights + causal_mask -++++ -++++ # upcast attention to fp32 -++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++ -++++ attn_output = self.o_proj(attn_output) -++++- -+++++ # @lwx -+++++ -+++++ # max_seq_len = self.max_position_embeddings # 2048 -+++++ -+++++ # if attention_mask is not None: -+++++ # # attention_mask: [B, 1, Sq, Sk] -+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++ -+++++ # # pad 到 [max_seq_len, max_seq_len] -+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++ # global_attention_mask = padded_mask -+++++ # else: -+++++ # global_attention_mask = None -+++++ -+++++ -+++++ # sparse_mode=3 -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # real_shift=None, -+++++ # padding_mask=None, -+++++ -+++++ # head_num=self.num_heads, -+++++ # attn_mask=global_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ # input_layout="BNSD", -+++++ # pre_tokens=2147483647, -+++++ # next_tokens=2147483647, -+++++ # inner_precise=0, -+++++ # drop_mask=None, -+++++ # prefix=None, -+++++ # actual_seq_qlen=None, -+++++ # actual_seq_kvlen=None, -+++++ # sparse_mode=sparse_mode, -+++++ # ) -++++ if not output_attentions: -++++ attn_weights = None -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++ -+++++class Qwen2MoeFlashAttention(nn.Module): -+++++ """ -+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++ -+++++ 关键改动: -+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++ 直接传入原始的 key 和 value 张量效率更高。 -+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++ """ -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++ super().__init__() -+++++ self.config = config -+++++ self.layer_idx = layer_idx -+++++ self.hidden_size = config.hidden_size -+++++ self.num_heads = config.num_attention_heads -+++++ self.head_dim = self.hidden_size // self.num_heads -+++++ self.num_key_value_heads = config.num_key_value_heads -+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++ self.max_position_embeddings = config.max_position_embeddings -+++++ self.rope_theta = config.rope_theta -+++++ self.attention_dropout = config.attention_dropout -+++++ -+++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++ raise ValueError( -+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++ ) -+++++ -+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++ -+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++ self.head_dim, -+++++ max_position_embeddings=self.max_position_embeddings, -+++++ base=self.rope_theta, -+++++ ) -+++++ -+++++ def forward( -+++++ self, -+++++ hidden_states: mindspore.Tensor, -+++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++ position_ids: Optional[mindspore.Tensor] = None, -+++++ past_key_value: Optional[Cache] = None, -+++++ output_attentions: bool = False, -+++++ use_cache: bool = False, -+++++ cache_position: Optional[mindspore.Tensor] = None, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # 1. 线性投射 Q, K, V -+++++ query_states = self.q_proj(hidden_states) -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++ # query: [B, S, H*D] -> [B, N1, S, D] -+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # 3. RoPE 旋转位置编码 -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++ if self.layer_idx is None: -+++++ raise ValueError( -+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ "with a layer index." -+++++ ) -+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++ if cache_position.shape[0] == 1: -+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++ kv_seq_len = past_seen_tokens + 1 -+++++ else: -+++++ # prefill 阶段:cache_position 是范围,使用其长度 -+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++ else: -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # 4. KV 缓存更新 -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ key_states, value_states = past_key_value.update( -+++++ key_states, value_states, self.layer_idx, cache_kwargs -+++++ ) -+++++ -+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++ if cache_position.shape[0] == 1: -+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++ kv_seq_len = key_states.shape[-2] -+++++ -+++++ # 5. [重要] 准备 Attention Mask -+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++ fa_attention_mask = None -+++++ if attention_mask is not None: -+++++ # 截取与当前key长度匹配的部分 -+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++ fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++ input_dtype = query_states.dtype -+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++ query_states = query_states.to(mindspore.float16) -+++++ key_states = key_states.to(mindspore.float16) -+++++ value_states = value_states.to(mindspore.float16) -+++++ -+++++ # 6. [核心] 调用 flash_attention_score 算子 -+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++ attn_output = mindspore.ops.flash_attention_score( -+++++ query=query_states, -+++++ key=key_states, -+++++ value=value_states, -+++++ head_num=self.num_heads, # 传入Q的头数(N1) -+++++ attn_mask=fa_attention_mask, -+++++ keep_prob=1.0 - self.attention_dropout, -+++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ input_layout="BNSD", -+++++ sparse_mode=0 # 使用 defaultMask 模式 -+++++ ) -+++++ -+++++ # 恢复原始数据类型 -+++++ attn_output = attn_output.to(input_dtype) -+++++ -+++++ # 7. 调整输出形状 -+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = self.o_proj(attn_output) -+++++ -+++++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++++ attn_weights = None -+++++ if output_attentions: -+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ # def forward( -+++++ # self, -+++++ # hidden_states: mindspore.Tensor, -+++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++ # past_key_value: Optional[Cache] = None, -+++++ # output_attentions: bool = False, -+++++ # use_cache: bool = False, -+++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ # bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # # 1. 线性投射 Q, K, V -+++++ # query_states = self.q_proj(hidden_states) -+++++ # key_states = self.k_proj(hidden_states) -+++++ # value_states = self.v_proj(hidden_states) -+++++ -+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # # 3. RoPE 旋转位置编码 -+++++ # kv_seq_len = key_states.shape[-2] -+++++ # if past_key_value is not None: -+++++ # if self.layer_idx is None: -+++++ # raise ValueError( -+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ # "with a layer index." -+++++ # ) -+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # # 4. KV 缓存更新 -+++++ # if past_key_value is not None: -+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ # key_states, value_states = past_key_value.update( -+++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++ # ) -+++++ -+++++ # # 5. 准备 Attention Mask -+++++ # fa_attention_mask = None -+++++ # if attention_mask is not None: -+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++ # input_dtype = query_states.dtype -+++++ -+++++ # # 6. [核心] 调用 flash_attention_score 算子 -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # head_num=self.num_heads, -+++++ # attn_mask=fa_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ # input_layout="BNSD", -+++++ # sparse_mode=0, -+++++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++ # inner_precise=1 -+++++ # ) -+++++ -+++++ # # 恢复原始数据类型 -+++++ # attn_output = attn_output.to(input_dtype) -+++++ -+++++ # # 7. 调整输出形状 -+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ # attn_output = self.o_proj(attn_output) -+++++ -+++++ # attn_weights = None -+++++ # if output_attentions: -+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++ # return attn_output, attn_weights, past_key_value -+++++ -+++++ # def forward( -+++++ # self, -+++++ # hidden_states: mindspore.Tensor, -+++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++ # past_key_value: Optional[Cache] = None, -+++++ # output_attentions: bool = False, -+++++ # use_cache: bool = False, -+++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ # bsz, q_len, _ = hidden_states.shape -+++++ -+++++ # query_states = self.q_proj(hidden_states) -+++++ # key_states = self.k_proj(hidden_states) -+++++ # value_states = self.v_proj(hidden_states) -+++++ -+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ # kv_seq_len = key_states.shape[-2] -+++++ # if past_key_value is not None: -+++++ # if self.layer_idx is None: -+++++ # raise ValueError("`layer_idx` must be specified for caching") -+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ # if past_key_value is not None: -+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ # key_states, value_states = past_key_value.update( -+++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++ # ) -+++++ -+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++ -+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++ # query_states = query_states / math.sqrt(self.head_dim) -+++++ # # <--- 修改结束 --- -+++++ -+++++ # fa_attention_mask = None -+++++ # if attention_mask is not None: -+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ # fa_attention_mask = (mask_slice != 0) -+++++ -+++++ # input_dtype = query_states.dtype -+++++ -+++++ # attn_output = mindspore.ops.flash_attention_score( -+++++ # query=query_states, # 传入已经预先缩放过的 query -+++++ # key=key_states, -+++++ # value=value_states, -+++++ # head_num=self.num_heads, -+++++ # attn_mask=fa_attention_mask, -+++++ # keep_prob=1.0 - self.attention_dropout, -+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++ # input_layout="BNSD", -+++++ # sparse_mode=0, -+++++ # inner_precise=1 # 仍然保持内部高精度计算 -+++++ # ) -+++++ -+++++ # attn_output = attn_output.to(input_dtype) -+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ # attn_output = self.o_proj(attn_output) -+++++ -+++++ # attn_weights = None -+++++ # if output_attentions: -+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++ -+++++ # return attn_output, attn_weights, past_key_value -+++++ -++++ QWEN2MOE_ATTENTION_CLASSES = { -++++ "eager": Qwen2MoeAttention, -+++++ "flash-attention": Qwen2MoeFlashAttention, -++++ } -++++ -++++ -++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -+++++ #@dwj -+++++ # 只遍历激活的专家,而非全部专家 -++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- hidden_states = hidden_states.view(-1, hidden_dim) -++++- # router_logits: (batch * sequence_length, n_experts) -++++- router_logits = self.gate(hidden_states) -++++- -++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- if self.norm_topk_prob: -++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- # we cast back to the input dtype -++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++- final_hidden_states = ops.zeros( -++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++- ) -++++- -++++- # One hot encode the selected experts to create an expert mask -++++- # this will be used to easily index which expert is going to be sollicitated -++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++- -++++- # Loop over all available experts in the model and perform the computation on each expert -++++- for expert_idx in range(self.num_experts): -++++- expert_layer = self.experts[expert_idx] -++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++- -++++- # Index the correct hidden states and compute the expert hidden state for -++++- # the current expert. We need to make sure to multiply the output hidden -++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++- if 0 not in idx.shape: -++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++- -++++- # However `index_add_` only support torch tensors for indexing so we'll use -++++- # the `top_x` tensor here. -++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++- -++++- shared_expert_output = self.shared_expert(hidden_states) -++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++- -++++- final_hidden_states = final_hidden_states + shared_expert_output -+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++ num_tokens = hidden_states_reshaped.shape[0] -+++++ -+++++ router_logits = self.gate(hidden_states_reshaped) -+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++ if self.norm_topk_prob: -+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++ flat_selected_experts = selected_experts.flatten() -+++++ -+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++ token_indices = broadcasted_token_indices.flatten() -+++++ -+++++ active_experts = ops.unique(flat_selected_experts) -+++++ -+++++ for expert_idx_tensor in active_experts: -+++++ expert_idx = expert_idx_tensor.item() -+++++ expert_layer = self.experts[expert_idx] -+++++ -+++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++ selected_token_indices = token_indices[mask] -+++++ selected_routing_weights = routing_weights.flatten()[mask] -+++++ -+++++ current_states = hidden_states_reshaped[selected_token_indices] -+++++ -+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++ -+++++ final_hidden_states = final_hidden_states.index_add( -+++++ dim=0, -+++++ index=selected_token_indices, -+++++ source=expert_output.to(hidden_states.dtype) -+++++ ) -+++++ -+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++ -++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++- return final_hidden_states, router_logits -+++++ final_hidden_states = final_hidden_states + shared_expert_output -+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++ -+++++ return final_hidden_states, router_logits -++++ -++++ -++++ class Qwen2MoeDecoderLayer(nn.Module): -++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++ -++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++ -++++ if (layer_idx not in config.mlp_only_layers) and ( -++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++ ): -++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++ _skip_keys_device_placement = "past_key_values" -++++ _supports_cache_class = True -+++++#lwx -+++++ # _supports_static_cache = True -++++ -++++ def _init_weights(self, module): -++++ std = self.config.initializer_range -++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++ return causal_mask -++++ -++++ -++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ _tied_weights_keys = ["lm_head.weight"] -++++ -++++ def __init__(self, config): -++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ self.num_experts_per_tok = config.num_experts_per_tok -++++ # Initialize weights and apply final processing -++++ self.post_init() -+++++ # @lwx -+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++++ # self.generation_config.cache_implementation = "static" -+++++ self._warmed_up = False -+++++ -+++++ def warmup_moe_model(self): -+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++++ test_texts = [ -+++++ "warmup short", -+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++++ ] -+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++ if tokenizer is None: -+++++ from mindnlp.transformers import AutoTokenizer -+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++ self._warmup_tokenizer = tokenizer -+++++ -+++++ for text in test_texts: -+++++ inputs = tokenizer(text, return_tensors="ms") -+++++ with mindspore._no_grad(): -+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++ -++++ def get_input_embeddings(self): -++++ return self.model.embed_tokens -++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++ ```""" -+++++ if not self._warmed_up: -+++++ self._warmed_up = True -+++++ self.warmup_moe_model() -++++ -++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++ output_router_logits = ( -++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++ } -++++ ) -++++ return model_inputs -+++++# @lwx -+++++ # def _decode_one_tokens_logits( -+++++ # self, -+++++ # cur_token: mindspore.Tensor, -+++++ # input_pos: Optional[mindspore.Tensor], -+++++ # cache_position: mindspore.Tensor, -+++++ # past_key_values: StaticCache, -+++++ # ) -> mindspore.Tensor: -+++++ # """ -+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++++ -+++++ # Args: -+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++++ # input_pos: 输入位置信息,可选 -+++++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++++ -+++++ # Returns: -+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++++ # """ -+++++ # # 调用JIT编译的版本 -+++++ # return self.get_decode_one_tokens_logits( -+++++ # cur_token=cur_token, -+++++ # input_pos=input_pos, -+++++ # cache_position=cache_position, -+++++ # past_key_values=past_key_values, -+++++ # ) -+++++ -+++++ # @mindspore.jit(jit_level='O1') -+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++++ # """ -+++++ # JIT编译的函数,用于高效的单token解码 -+++++ # 使用JIT编译优化以支持静态shape和高效执行 -+++++ -+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++++ # """ -+++++ # outputs = self.model.forward( -+++++ # input_ids=cur_token, -+++++ # position_ids=input_pos, -+++++ # cache_position=cache_position, -+++++ # past_key_values=past_key_values, -+++++ # use_cache=True, -+++++ # return_dict=False, -+++++ # ) -+++++ -+++++ # hidden_states = outputs[0] -+++++ # logits = self.lm_head.forward(hidden_states) -+++++ # logits = logits.float() -+++++ -+++++ # return logits[:, -1, :] -+++++ -+++++ # def _sample( -+++++ # self, -+++++ # input_ids: mindspore.Tensor, -+++++ # logits_processor, -+++++ # stopping_criteria, -+++++ # generation_config, -+++++ # synced_devices: bool, -+++++ # streamer=None, -+++++ # logits_warper=None, -+++++ # **model_kwargs, -+++++ # ): -+++++ # """ -+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++++ # """ -+++++ # from ...generation.logits_process import LogitsProcessorList -+++++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++++ # from mindnlp.core import nn, ops, no_grad -+++++ # import numpy as np -+++++ -+++++ # # 检查是否使用 StaticCache -+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++++ # # 否则,直接调用父类方法 -+++++ # past_key_values = model_kwargs.get("past_key_values") -+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++++ -+++++ # if not isinstance(past_key_values, StaticCache): -+++++ # # 不使用 StaticCache,直接调用父类方法 -+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++++ # return super()._sample( -+++++ # input_ids=input_ids, -+++++ # logits_processor=logits_processor, -+++++ # stopping_criteria=stopping_criteria, -+++++ # generation_config=generation_config, -+++++ # synced_devices=synced_devices, -+++++ # streamer=streamer, -+++++ # logits_warper=logits_warper, -+++++ # **model_kwargs, -+++++ # ) -+++++ -+++++ # # 使用 StaticCache,进入自定义循环 -+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++++ # pad_token_id = generation_config._pad_token_tensor -+++++ # output_attentions = generation_config.output_attentions -+++++ # output_hidden_states = generation_config.output_hidden_states -+++++ # output_scores = generation_config.output_scores -+++++ # output_logits = generation_config.output_logits -+++++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++++ # max_length = generation_config.max_length -+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++++ # do_sample = generation_config.do_sample -+++++ -+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++++ # raise ValueError( -+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++++ # f"{logits_warper})." -+++++ # ) -+++++ -+++++ # # init attention / hidden states / scores tuples -+++++ # scores = () if (return_dict_in_generate and output_scores) else None -+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++++ -+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++++ # encoder_hidden_states = ( -+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++++ # ) -+++++ -+++++ # # keep track of which sequences are already finished -+++++ # batch_size, cur_len = input_ids.shape -+++++ # this_peer_finished = False -+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++++ -+++++ # time_record = [] -+++++ # from ....utils.testing_utils import parse_flag_from_env -+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++++ -+++++ # while self._has_unfinished_sequences( -+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++++ # ): -+++++ # if _record_time: -+++++ # import time as time_module -+++++ # infer_start = time_module.time() -+++++ -+++++ # # prepare model inputs -+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++++ -+++++ # # prepare variable output controls -+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++++ -+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++++ # cur_cache_position = model_inputs.get("cache_position") -+++++ # cur_past_key_values = model_inputs.get("past_key_values") -+++++ # cur_input_ids = model_inputs.get("input_ids") -+++++ -+++++ # if (isinstance(cur_past_key_values, StaticCache) and -+++++ # cur_cache_position is not None and -+++++ # len(cur_cache_position.shape) > 0 and -+++++ # cur_cache_position.shape[0] == 1 and -+++++ # cur_input_ids is not None and -+++++ # cur_input_ids.shape[1] == 1): -+++++ # # 使用 JIT 优化的单 token 解码 -+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++++ # if not hasattr(self, '_jit_used'): -+++++ # self._jit_used = False -+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++++ -+++++ # next_token_logits = self.get_decode_one_tokens_logits( -+++++ # cur_token=cur_input_ids, -+++++ # input_pos=model_inputs.get("position_ids"), -+++++ # cache_position=cur_cache_position, -+++++ # past_key_values=cur_past_key_values, -+++++ # ) -+++++ -+++++ # # 标记已使用JIT(用于后续判断) -+++++ # if not self._jit_used: -+++++ # self._jit_used = True -+++++ -+++++ # # 构造兼容的输出对象 -+++++ # class JitOptimizedOutput: -+++++ # def __init__(self, logits, config): -+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++++ # self.config = config -+++++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++++ # self.attentions = None if not config.is_encoder_decoder else None -+++++ # self.cross_attentions = None -+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++++ -+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++++ # else: -+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++++ # outputs = self(**model_inputs, return_dict=True) -+++++ -+++++ # if synced_devices and this_peer_finished: -+++++ # continue -+++++ -+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++++ # next_token_logits = outputs.logits[:, -1, :] -+++++ -+++++ # # pre-process distribution -+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++++ # if do_sample: -+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++++ -+++++ # # Store scores, attentions and hidden_states when required -+++++ # if return_dict_in_generate: -+++++ # if output_scores: -+++++ # scores += (next_token_scores,) -+++++ # if output_logits: -+++++ # raw_logits += (next_token_logits,) -+++++ # if output_attentions: -+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++++ # if self.config.is_encoder_decoder: -+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++++ -+++++ # if output_hidden_states: -+++++ # hidden = ( -+++++ # outputs.decoder_hidden_states -+++++ # if self.config.is_encoder_decoder -+++++ # else outputs.hidden_states -+++++ # ) -+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++++ -+++++ # # token selection -+++++ # if do_sample: -+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++++ # else: -+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++++ -+++++ # # finished sentences should have their next token be a padding token -+++++ # if has_eos_stopping_criteria: -+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++++ -+++++ # # update generated ids, model inputs, and length for next step -+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++++ # if streamer is not None: -+++++ # streamer.put(next_tokens) -+++++ -+++++ # model_kwargs = self._update_model_kwargs_for_generation( -+++++ # outputs, -+++++ # model_kwargs, -+++++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++++ # ) -+++++ -+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++++ # cur_len += 1 -+++++ -+++++ # if _record_time: -+++++ # import time as time_module -+++++ # infer_stop = time_module.time() -+++++ # time_record.append(infer_stop - infer_start) -+++++ -+++++ # del outputs -+++++ -+++++ # average_infer_time = None -+++++ # if time_record: -+++++ # if len(time_record) > 1: -+++++ # time_record.pop(0) -+++++ # average_infer_time = sum(time_record) / len(time_record) -+++++ # print(f'average inference time is: {average_infer_time}') -+++++ # print(f'inference time record: {time_record}') -+++++ -+++++ # if streamer is not None: -+++++ # streamer.end() -+++++ -+++++ # # 简单判断:打印是否使用了JIT路径 -+++++ # if hasattr(self, '_jit_used') and self._jit_used: -+++++ # print("[JIT] ✓ JIT optimization was used during generation") -+++++ # else: -+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++++ -+++++ # if return_dict_in_generate: -+++++ # if self.config.is_encoder_decoder: -+++++ # return GenerateEncoderDecoderOutput( -+++++ # sequences=input_ids, -+++++ # scores=scores, -+++++ # logits=raw_logits, -+++++ # encoder_attentions=encoder_attentions, -+++++ # encoder_hidden_states=encoder_hidden_states, -+++++ # decoder_attentions=decoder_attentions, -+++++ # cross_attentions=cross_attentions, -+++++ # decoder_hidden_states=decoder_hidden_states, -+++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++ # average_infer_time=average_infer_time -+++++ # ) -+++++ # else: -+++++ # return GenerateDecoderOnlyOutput( -+++++ # sequences=input_ids, -+++++ # scores=scores, -+++++ # logits=raw_logits, -+++++ # attentions=decoder_attentions, -+++++ # hidden_states=decoder_hidden_states, -+++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++ # average_infer_time=average_infer_time -+++++ # ) -+++++ # else: -+++++ # return input_ids -+++++ -+++++ # def _prepare_cache_for_generation( -+++++ # self, -+++++ # generation_config, -+++++ # model_kwargs, -+++++ # assistant_model, -+++++ # batch_size, -+++++ # max_cache_length, -+++++ # ): -+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++++ # generation_config.cache_implementation = "static" -+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++++ -+++++ # if generation_config.cache_implementation == "static": -+++++ # base_required_from_max_length = generation_config.max_length + 1 -+++++ # base_required = max(max_cache_length, base_required_from_max_length) -+++++ # min_cache_size = 50 -+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++++ # else: -+++++ # max_cache_length = max(base_required, min_cache_size) -+++++ -+++++ # original_max_cache_length = max_cache_length -+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++++ # print(f" - final max_cache_length: {max_cache_length}") -+++++ -+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++ # if max_cache_length > self.config.max_position_embeddings: -+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++++ -+++++ # result = super()._prepare_cache_for_generation( -+++++ # generation_config=generation_config, -+++++ # model_kwargs=model_kwargs, -+++++ # assistant_model=assistant_model, -+++++ # batch_size=batch_size, -+++++ # max_cache_length=max_cache_length, -+++++ # ) -+++++ -+++++ # if generation_config.cache_implementation == "static": -+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++++ # created_cache = model_kwargs.get(cache_name) -+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++++ # if created_cache.max_cache_len < generation_config.max_length: -+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++++ -+++++ # return result -+++++ -+++++ -+++++ -++++ -++++ -++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++-- -++++2.27.0 -++++ -+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+++new file mode 100644 -+++index 00000000..22b65dd5 -+++--- /dev/null -++++++ b/patches/0002-20251106commit.patch -+++@@ -0,0 +1,3200 @@ -++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Thu, 6 Nov 2025 09:20:38 +0800 -++++Subject: [PATCH 2/3] 20251106commit -++++ -++++--- -++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -++++ 3 files changed, 2689 insertions(+), 305 deletions(-) -++++ create mode 100644 patches/0001-20251104commit.patch -++++ -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index d8303e45..73773c22 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -++++ # y = y + self.shared_experts(identity) -++++ # return y -++++ -+++++ # @no_grad() -+++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ -+++++ # expert_cache = ops.zeros_like(x) -+++++ # for i in range(self.num_experts_per_tok): -+++++ # expert_id = flat_expert_indices[i].item() -+++++ # weight = flat_expert_weights[i].item() -+++++ # expert = self.experts[expert_id] -+++++ # expert_out = expert(x) -+++++ # expert_cache += expert_out * weight -+++++ # return expert_cache -+++++ -++++ @no_grad() -++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ # x 的 shape: (1, hidden_size) -+++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++++ -+++++ # 1. 收集所有需要的专家层 -+++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++++ selected_experts = [self.experts[i] for i in flat_expert_indices] -+++++ -+++++ # 2. 并行计算所有专家的输出 -+++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++++ # ops.cat 会将它们堆叠成一个新的 Tensor -+++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++++ -+++++ # 3. 使用矩阵乘法进行加权求和 -+++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++ # 最终结果 final_output 的 shape: (1, hidden_size) -+++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++++ -+++++ return final_output -++++ -++++- expert_cache = ops.zeros_like(x) -++++- for i in range(self.num_experts_per_tok): -++++- expert_id = flat_expert_indices[i].item() -++++- weight = flat_expert_weights[i].item() -++++- expert = self.experts[expert_id] -++++- expert_out = expert(x) -++++- expert_cache += expert_out * weight -++++- return expert_cache -++++ -++++ @no_grad() -++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++ # @lwx -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -+++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++++ -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -++++ return attn_output, attn_weights, past_key_value -++++ -++++ -+++++# class DeepseekFlashAttention(nn.Module): -+++++# """ -+++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+++++ -+++++# This class is designed as a drop-in replacement for DeepseekAttention. -+++++# """ -+++++ -+++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++++# super().__init__() -+++++# self.config = config -+++++# self.layer_idx = layer_idx -+++++# if layer_idx is None: -+++++# logger.warning( -+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++# "when creating this class." -+++++# ) -+++++ -+++++# self.attention_dropout = config.attention_dropout -+++++# self.hidden_size = config.hidden_size -+++++# self.num_heads = config.num_attention_heads -+++++# self.head_dim = self.hidden_size // self.num_heads -+++++# self.num_key_value_heads = config.num_key_value_heads -+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++# self.max_position_embeddings = config.max_position_embeddings -+++++# self.rope_theta = config.rope_theta -+++++# self.is_causal = True -+++++ -+++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++# raise ValueError( -+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++# f" and `num_heads`: {self.num_heads})." -+++++# ) -+++++ -+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++++# self._init_rope() -+++++ -+++++# def _init_rope(self): -+++++# if self.config.rope_scaling is None: -+++++# self.rotary_emb = DeepseekRotaryEmbedding( -+++++# self.head_dim, -+++++# max_position_embeddings=self.max_position_embeddings, -+++++# base=self.rope_theta, -+++++# ) -+++++# else: -+++++# scaling_type = self.config.rope_scaling["type"] -+++++# scaling_factor = self.config.rope_scaling["factor"] -+++++# if scaling_type == "linear": -+++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++++# self.head_dim, -+++++# max_position_embeddings=self.max_position_embeddings, -+++++# scaling_factor=scaling_factor, -+++++# base=self.rope_theta, -+++++# ) -+++++# elif scaling_type == "dynamic": -+++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++++# self.head_dim, -+++++# max_position_embeddings=self.max_position_embeddings, -+++++# scaling_factor=scaling_factor, -+++++# base=self.rope_theta, -+++++# ) -+++++# else: -+++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++++ -+++++# def forward( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++# position_ids: Optional[mindspore.Tensor] = None, -+++++# past_key_value: Optional[Cache] = None, -+++++# output_attentions: bool = False, -+++++# use_cache: bool = False, -+++++# **kwargs, -+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++# if "padding_mask" in kwargs: -+++++# warnings.warn( -+++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++++# ) -+++++ -+++++# if output_attentions: -+++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+++++ -+++++# bsz, q_len, _ = hidden_states.shape -+++++ -+++++# if self.config.pretraining_tp > 1: -+++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++++ -+++++# query_states = self.q_proj(hidden_states) -+++++# key_states = self.k_proj(hidden_states) -+++++# value_states = self.v_proj(hidden_states) -+++++ -+++++# # Reshape for multi-head attention -+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++# kv_seq_len = key_states.shape[-2] -+++++# if past_key_value is not None: -+++++# if self.layer_idx is None: -+++++# raise ValueError( -+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++# "with a layer index." -+++++# ) -+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++# # Apply Rotary Positional Embedding -+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++# if past_key_value is not None: -+++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ -+++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++ -+++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++ -+++++# # Convert attention_mask for flash_attention_score -+++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+++++# if attention_mask is not None: -+++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++++# raise ValueError( -+++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++++# ) -+++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+++++# else: -+++++# attn_mask_for_fa = None -+++++ -+++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++++ -+++++# # Call the fused flash_attention_score operator -+++++# attn_output = mindspore.ops.flash_attention_score( -+++++# query=query_states_for_fa, -+++++# key=key_states_for_fa, -+++++# value=value_states_for_fa, -+++++# head_num=self.num_heads, # This is N1, the number of query heads -+++++# input_layout='BSH', -+++++# attn_mask=attn_mask_for_fa, -+++++# keep_prob=keep_prob, -+++++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++++# sparse_mode=0 # Default mask mode -+++++# ) -+++++ -+++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+++++# attn_output = self.o_proj(attn_output) -+++++ -+++++# # Flash Attention does not return attention weights -+++++# attn_weights = None -+++++ -+++++# return attn_output, attn_weights, past_key_value -+++++ -+++++class DeepseekFlashAttention(nn.Module): -+++++ """ -+++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -+++++ This implementation is a drop-in replacement for the original DeepseekAttention class, -+++++ designed for high performance on supported hardware (Ascend). -+++++ -+++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -+++++ """ -+++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++++ super().__init__() -+++++ self.config = config -+++++ self.layer_idx = layer_idx -+++++ if layer_idx is None: -+++++ logger.warning( -+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++ "when creating this class." -+++++ ) -+++++ -+++++ # --- [FIX] Correctly initialize all required attributes --- -+++++ self.attention_dropout = config.attention_dropout -+++++ self.hidden_size = config.hidden_size -+++++ self.num_heads = config.num_attention_heads -+++++ self.head_dim = self.hidden_size // self.num_heads -+++++ self.num_key_value_heads = config.num_key_value_heads -+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++ self.max_position_embeddings = config.max_position_embeddings -+++++ self.rope_theta = config.rope_theta -+++++ self.is_causal = True -+++++ -+++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++ raise ValueError( -+++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++ f" and `num_heads`: {self.num_heads})." -+++++ ) -+++++ -+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++++ -+++++ # This call will now succeed as all attributes are initialized. -+++++ self._init_rope() -+++++ -+++++ def _init_rope(self): -+++++ if self.config.rope_scaling is None: -+++++ self.rotary_emb = DeepseekRotaryEmbedding( -+++++ self.head_dim, -+++++ max_position_embeddings=self.max_position_embeddings, -+++++ base=self.rope_theta, -+++++ ) -+++++ else: -+++++ scaling_type = self.config.rope_scaling["type"] -+++++ scaling_factor = self.config.rope_scaling["factor"] -+++++ if scaling_type == "linear": -+++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++++ self.head_dim, -+++++ max_position_embeddings=self.max_position_embeddings, -+++++ scaling_factor=scaling_factor, -+++++ base=self.rope_theta, -+++++ ) -+++++ elif scaling_type == "dynamic": -+++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++++ self.head_dim, -+++++ max_position_embeddings=self.max_position_embeddings, -+++++ scaling_factor=scaling_factor, -+++++ base=self.rope_theta, -+++++ ) -+++++ else: -+++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++++ -+++++ def forward( -+++++ self, -+++++ hidden_states: mindspore.Tensor, -+++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++ position_ids: Optional[mindspore.Tensor] = None, -+++++ past_key_value: Optional[Cache] = None, -+++++ output_attentions: bool = False, -+++++ use_cache: bool = False, -+++++ **kwargs, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ if "padding_mask" in kwargs: -+++++ warnings.warn( -+++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++++ ) -+++++ if output_attentions: -+++++ warnings.warn( -+++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -+++++ ) -+++++ -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ if self.config.pretraining_tp > 1: -+++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++++ -+++++ query_states = self.q_proj(hidden_states) -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++ # Apply Rotary Position Embedding -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos} -+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -+++++ # So we must explicitly repeat the KV heads. -+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++ -+++++ # Convert attention mask for flash_attention_score -+++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -+++++ if attention_mask is not None: -+++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++++ raise ValueError( -+++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++++ ) -+++++ attn_mask_for_fa = attention_mask < 0 -+++++ else: -+++++ attn_mask_for_fa = None -+++++ -+++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++++ -+++++ # Call the fused operator using the efficient BNSD layout -+++++ attn_output = mindspore.ops.flash_attention_score( -+++++ query=query_states, -+++++ key=key_states, -+++++ value=value_states, -+++++ head_num=self.num_heads, -+++++ input_layout='BNSD', # Specify the correct layout -+++++ attn_mask=attn_mask_for_fa, -+++++ keep_prob=keep_prob, -+++++ scalar_value=1.0 / math.sqrt(self.head_dim) -+++++ ) -+++++ -+++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ -+++++ # Apply output projection -+++++ attn_output = self.o_proj(attn_output) -+++++ -+++++ # Flash attention does not return attention weights, so we return None. -+++++ attn_weights = None -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -++++ Deepseek_ATTENTION_CLASSES = { -++++ "eager": DeepseekAttention, -+++++ "flash-attention": DeepseekFlashAttention, -++++ } -++++ -++++ -++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -++++ config=config, layer_idx=layer_idx -++++ ) -++++ -+++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+++++ config=config, layer_idx=layer_idx -+++++ ) -+++++ -++++ self.mlp = ( -++++ DeepseekMoE(config) -++++ if ( -++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++index d4c6b651..bced285c 100644 -++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -++++ -++++ import mindspore -++++ import mindnlp.core.nn.functional as F -++++-from mindnlp.core import nn, ops -+++++from mindnlp.core import nn, ops, no_grad -++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -++++ -++++ from ....common.activations import ACT2FN -++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++++ -+++++Long_Prompt = False -+++++PROMPT_LENGTH_THRESHOLD = 128 -++++ -++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++++ def _prepare_4d_causal_attention_mask_with_cache_position( -++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -++++ return attn_output, attn_weights, past_key_value -++++ -++++ -+++++# class Qwen2MoeFlashAttention(nn.Module): -+++++# """ -+++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++ -+++++# 关键改动: -+++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++# 直接传入原始的 key 和 value 张量效率更高。 -+++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++# super().__init__() -+++++# self.config = config -+++++# self.layer_idx = layer_idx -+++++# self.hidden_size = config.hidden_size -+++++# self.num_heads = config.num_attention_heads -+++++# self.head_dim = self.hidden_size // self.num_heads -+++++# self.num_key_value_heads = config.num_key_value_heads -+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++# self.max_position_embeddings = config.max_position_embeddings -+++++# self.rope_theta = config.rope_theta -+++++# self.attention_dropout = config.attention_dropout -+++++ -+++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++# raise ValueError( -+++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++# ) -+++++ -+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++ -+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++# self.head_dim, -+++++# max_position_embeddings=self.max_position_embeddings, -+++++# base=self.rope_theta, -+++++# ) -+++++ -+++++# def forward( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++# position_ids: Optional[mindspore.Tensor] = None, -+++++# past_key_value: Optional[Cache] = None, -+++++# output_attentions: bool = False, -+++++# use_cache: bool = False, -+++++# cache_position: Optional[mindspore.Tensor] = None, -+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++# bsz, q_len, _ = hidden_states.shape -+++++ -+++++# # 1. 线性投射 Q, K, V -+++++# query_states = self.q_proj(hidden_states) -+++++# key_states = self.k_proj(hidden_states) -+++++# value_states = self.v_proj(hidden_states) -+++++ -+++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++# # query: [B, S, H*D] -> [B, N1, S, D] -+++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++# # 3. RoPE 旋转位置编码 -+++++# kv_seq_len = key_states.shape[-2] -+++++# if past_key_value is not None: -+++++# if self.layer_idx is None: -+++++# raise ValueError( -+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++# "with a layer index." -+++++# ) -+++++# # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++# if cache_position.shape[0] == 1: -+++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++# kv_seq_len = past_seen_tokens + 1 -+++++# else: -+++++# # prefill 阶段:cache_position 是范围,使用其长度 -+++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++# else: -+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++# # 4. KV 缓存更新 -+++++# if past_key_value is not None: -+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++# key_states, value_states = past_key_value.update( -+++++# key_states, value_states, self.layer_idx, cache_kwargs -+++++# ) -+++++ -+++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++# if cache_position.shape[0] == 1: -+++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++# kv_seq_len = key_states.shape[-2] -+++++ -+++++# # 5. [重要] 准备 Attention Mask -+++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++# fa_attention_mask = None -+++++# if attention_mask is not None: -+++++# # 截取与当前key长度匹配的部分 -+++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++# # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++# fa_attention_mask = (mask_slice != 0) -+++++ -+++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++# input_dtype = query_states.dtype -+++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++# query_states = query_states.to(mindspore.float16) -+++++# key_states = key_states.to(mindspore.float16) -+++++# value_states = value_states.to(mindspore.float16) -+++++ -+++++# # 6. [核心] 调用 flash_attention_score 算子 -+++++# # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++# attn_output = mindspore.ops.flash_attention_score( -+++++# query=query_states, -+++++# key=key_states, -+++++# value=value_states, -+++++# head_num=self.num_heads, # 传入Q的头数(N1) -+++++# attn_mask=fa_attention_mask, -+++++# keep_prob=1.0 - self.attention_dropout, -+++++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++++# input_layout="BNSD", -+++++# sparse_mode=0 # 使用 defaultMask 模式 -+++++# ) -+++++ -+++++# # 恢复原始数据类型 -+++++# attn_output = attn_output.to(input_dtype) -+++++ -+++++# # 7. 调整输出形状 -+++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++# attn_output = self.o_proj(attn_output) -+++++ -+++++# # FlashAttention 算子不直接返回注意力权重矩阵 -+++++# attn_weights = None -+++++# if output_attentions: -+++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++# return attn_output, attn_weights, past_key_value -+++++ -+++++# # def forward( -+++++# # self, -+++++# # hidden_states: mindspore.Tensor, -+++++# # attention_mask: Optional[mindspore.Tensor] = None, -+++++# # position_ids: Optional[mindspore.Tensor] = None, -+++++# # past_key_value: Optional[Cache] = None, -+++++# # output_attentions: bool = False, -+++++# # use_cache: bool = False, -+++++# # cache_position: Optional[mindspore.Tensor] = None, -+++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++# # bsz, q_len, _ = hidden_states.shape -+++++ -+++++# # # 1. 线性投射 Q, K, V -+++++# # query_states = self.q_proj(hidden_states) -+++++# # key_states = self.k_proj(hidden_states) -+++++# # value_states = self.v_proj(hidden_states) -+++++ -+++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -+++++# # # 3. RoPE 旋转位置编码 -+++++# # kv_seq_len = key_states.shape[-2] -+++++# # if past_key_value is not None: -+++++# # if self.layer_idx is None: -+++++# # raise ValueError( -+++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++# # "with a layer index." -+++++# # ) -+++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++# # # 4. KV 缓存更新 -+++++# # if past_key_value is not None: -+++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++# # key_states, value_states = past_key_value.update( -+++++# # key_states, value_states, self.layer_idx, cache_kwargs -+++++# # ) -+++++ -+++++# # # 5. 准备 Attention Mask -+++++# # fa_attention_mask = None -+++++# # if attention_mask is not None: -+++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++# # fa_attention_mask = (mask_slice != 0) -+++++ -+++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++# # input_dtype = query_states.dtype -+++++ -+++++# # # 6. [核心] 调用 flash_attention_score 算子 -+++++# # attn_output = mindspore.ops.flash_attention_score( -+++++# # query=query_states, -+++++# # key=key_states, -+++++# # value=value_states, -+++++# # head_num=self.num_heads, -+++++# # attn_mask=fa_attention_mask, -+++++# # keep_prob=1.0 - self.attention_dropout, -+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++# # input_layout="BNSD", -+++++# # sparse_mode=0, -+++++# # # <--- 修改点 2: 启用内部高精度计算 --- -+++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++# # inner_precise=1 -+++++# # ) -+++++ -+++++# # # 恢复原始数据类型 -+++++# # attn_output = attn_output.to(input_dtype) -+++++ -+++++# # # 7. 调整输出形状 -+++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++# # attn_output = self.o_proj(attn_output) -+++++ -+++++# # attn_weights = None -+++++# # if output_attentions: -+++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ -+++++# # return attn_output, attn_weights, past_key_value -+++++ -+++++ -++++ class Qwen2MoeFlashAttention(nn.Module): -++++ """ -++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++- -++++- 关键改动: -++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++- 直接传入原始的 key 和 value 张量效率更高。 -++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -+++++ -+++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -+++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -+++++ 完全使用模型的低精度数据类型(如 float16)进行计算, -+++++ 以达到理论上的最高执行速度。 -++++ """ -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++ super().__init__() -++++ self.config = config -++++ self.layer_idx = layer_idx -+++++ if layer_idx is None: -+++++ logger.warning_once( -+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -+++++ ) -+++++ -++++ self.hidden_size = config.hidden_size -++++ self.num_heads = config.num_attention_heads -++++ self.head_dim = self.hidden_size // self.num_heads -++++ self.num_key_value_heads = config.num_key_value_heads -++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++ self.max_position_embeddings = config.max_position_embeddings -++++ self.rope_theta = config.rope_theta -++++ self.attention_dropout = config.attention_dropout -++++ -++++- if (self.head_dim * self.num_heads) != self.hidden_size: -++++- raise ValueError( -++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++- ) -++++- -++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++- # query: [B, S, H*D] -> [B, N1, S, D] -++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++ # 2. 调整形状以匹配 BNSD 布局 -++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- -++++- # 3. RoPE 旋转位置编码 -+++++ -+++++ # 3. RoPE 和 KV 缓存 -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++- if self.layer_idx is None: -++++- raise ValueError( -++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++- "with a layer index." -++++- ) -++++- # 对于 StaticCache,需要特殊处理 kv_seq_len -++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++- if cache_position.shape[0] == 1: -++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++- kv_seq_len = past_seen_tokens + 1 -++++- else: -++++- # prefill 阶段:cache_position 是范围,使用其长度 -++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++- else: -++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++- -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++- # 4. KV 缓存更新 -++++ if past_key_value is not None: -++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++- key_states, value_states = past_key_value.update( -++++- key_states, value_states, self.layer_idx, cache_kwargs -++++- ) -++++- -++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++- if cache_position.shape[0] == 1: -++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++- kv_seq_len = key_states.shape[-2] -++++- -++++- # 5. [重要] 准备 Attention Mask -++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++ # 4. 准备 Attention Mask -++++ fa_attention_mask = None -++++ if attention_mask is not None: -++++- # 截取与当前key长度匹配的部分 -++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++- # 转换为布尔类型: 大负数 -> True, 0 -> False -++++ fa_attention_mask = (mask_slice != 0) -++++ -++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++- input_dtype = query_states.dtype -++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++- query_states = query_states.to(mindspore.float16) -++++- key_states = key_states.to(mindspore.float16) -++++- value_states = value_states.to(mindspore.float16) -++++- -++++- # 6. [核心] 调用 flash_attention_score 算子 -++++- # - 无需手动 repeat_kv, 算子原生支持 GQA -++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -++++ attn_output = mindspore.ops.flash_attention_score( -++++ query=query_states, -++++ key=key_states, -++++ value=value_states, -++++- head_num=self.num_heads, # 传入Q的头数(N1) -+++++ head_num=self.num_heads, -++++ attn_mask=fa_attention_mask, -++++- keep_prob=1.0 - self.attention_dropout, -+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++ input_layout="BNSD", -++++- sparse_mode=0 # 使用 defaultMask 模式 -+++++ sparse_mode=0, -+++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -++++ ) -++++ -++++- # 恢复原始数据类型 -++++- attn_output = attn_output.to(input_dtype) -++++- -++++- # 7. 调整输出形状 -++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++ # 6. 调整输出形状 -++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++ attn_output = self.o_proj(attn_output) -++++ -++++- # FlashAttention 算子不直接返回注意力权重矩阵 -+++++ # 7. 返回结果 -++++ attn_weights = None -++++ if output_attentions: -++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++- # def forward( -++++- # self, -++++- # hidden_states: mindspore.Tensor, -++++- # attention_mask: Optional[mindspore.Tensor] = None, -++++- # position_ids: Optional[mindspore.Tensor] = None, -++++- # past_key_value: Optional[Cache] = None, -++++- # output_attentions: bool = False, -++++- # use_cache: bool = False, -++++- # cache_position: Optional[mindspore.Tensor] = None, -++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++- -++++- # bsz, q_len, _ = hidden_states.shape -++++- -++++- # # 1. 线性投射 Q, K, V -++++- # query_states = self.q_proj(hidden_states) -++++- # key_states = self.k_proj(hidden_states) -++++- # value_states = self.v_proj(hidden_states) -++++- -++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- -++++- # # 3. RoPE 旋转位置编码 -++++- # kv_seq_len = key_states.shape[-2] -++++- # if past_key_value is not None: -++++- # if self.layer_idx is None: -++++- # raise ValueError( -++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++- # "with a layer index." -++++- # ) -++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++ -++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++- -++++- # # 4. KV 缓存更新 -++++- # if past_key_value is not None: -++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++- # key_states, value_states = past_key_value.update( -++++- # key_states, value_states, self.layer_idx, cache_kwargs -++++- # ) -++++- -++++- # # 5. 准备 Attention Mask -++++- # fa_attention_mask = None -++++- # if attention_mask is not None: -++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++- # fa_attention_mask = (mask_slice != 0) -++++- -++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++- # input_dtype = query_states.dtype -++++- -++++- # # 6. [核心] 调用 flash_attention_score 算子 -++++- # attn_output = mindspore.ops.flash_attention_score( -++++- # query=query_states, -++++- # key=key_states, -++++- # value=value_states, -++++- # head_num=self.num_heads, -++++- # attn_mask=fa_attention_mask, -++++- # keep_prob=1.0 - self.attention_dropout, -++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++++- # input_layout="BNSD", -++++- # sparse_mode=0, -++++- # # <--- 修改点 2: 启用内部高精度计算 --- -++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++- # inner_precise=1 -++++- # ) -++++- -++++- # # 恢复原始数据类型 -++++- # attn_output = attn_output.to(input_dtype) -+++++QWEN2MOE_ATTENTION_CLASSES = { -+++++ "eager": Qwen2MoeAttention, -+++++ "flash-attention": Qwen2MoeFlashAttention, -+++++} -++++ -++++- # # 7. 调整输出形状 -++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++- # attn_output = self.o_proj(attn_output) -++++ -++++- # attn_weights = None -++++- # if output_attentions: -++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# def __init__(self, config): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# # gating -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++ -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# #@dwj -+++++# # 只遍历激活的专家,而非全部专家 -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# num_tokens = hidden_states_reshaped.shape[0] -+++++ -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++# flat_selected_experts = selected_experts.flatten() -+++++ -+++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++# token_indices = broadcasted_token_indices.flatten() -+++++ -+++++# active_experts = ops.unique(flat_selected_experts) -+++++ -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++ -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# selected_token_indices = token_indices[mask] -+++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++ -+++++# current_states = hidden_states_reshaped[selected_token_indices] -+++++ -+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++ -+++++# final_hidden_states = final_hidden_states.index_add( -+++++# dim=0, -+++++# index=selected_token_indices, -+++++# source=expert_output.to(hidden_states.dtype) -+++++# ) -+++++ -+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++ -++++- # return attn_output, attn_weights, past_key_value -+++++# final_hidden_states = final_hidden_states + shared_expert_output -+++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -+++++ -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# """ -+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+++++# `_moe_infer_prefill` (用于长序列处理) 方法。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# # 门控网络 -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# # 专家列表 -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++# # 共享专家 -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_decode( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# """ -+++++# 【解码路径】针对 sequence_length=1 的极致优化。 -+++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+++++# """ -+++++# batch_size, hidden_dim = hidden_states.shape -+++++ -+++++# expert_outputs_list = [ -+++++# ops.cat([ -+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++# ], dim=0) -+++++# for i in range(batch_size) -+++++# ] -+++++ -+++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+++++# # shape: (batch_size, top_k, hidden_dim) -+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++ -+++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++ -+++++# return moe_output.squeeze(1) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_prefill( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# """ -+++++# 【预填充路径】针对 sequence_length > 1 的优化。 -+++++# 按专家对 Token 进行分组,并进行批处理。 -+++++# """ -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens = hidden_states.shape[0] -+++++# flat_selected_experts = selected_experts.flatten() -+++++ -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++ -+++++# active_experts = ops.unique(flat_selected_experts) -+++++ -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++ -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# selected_token_indices = token_indices[mask] -+++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++ -+++++# current_states = hidden_states[selected_token_indices] -+++++ -+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++ -+++++# moe_output = moe_output.index_add( -+++++# dim=0, -+++++# index=selected_token_indices, -+++++# source=expert_output.to(hidden_states.dtype) -+++++# ) -+++++# return moe_output -+++++ -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# """ -+++++# 顶层 forward 方法,作为智能分发器。 -+++++# """ -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++- # def forward( -++++- # self, -++++- # hidden_states: mindspore.Tensor, -++++- # attention_mask: Optional[mindspore.Tensor] = None, -++++- # position_ids: Optional[mindspore.Tensor] = None, -++++- # past_key_value: Optional[Cache] = None, -++++- # output_attentions: bool = False, -++++- # use_cache: bool = False, -++++- # cache_position: Optional[mindspore.Tensor] = None, -++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++- -++++- # bsz, q_len, _ = hidden_states.shape -++++- -++++- # query_states = self.q_proj(hidden_states) -++++- # key_states = self.k_proj(hidden_states) -++++- # value_states = self.v_proj(hidden_states) -++++- -++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- -++++- # kv_seq_len = key_states.shape[-2] -++++- # if past_key_value is not None: -++++- # if self.layer_idx is None: -++++- # raise ValueError("`layer_idx` must be specified for caching") -++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++- -++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++- -++++- # if past_key_value is not None: -++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++- # key_states, value_states = past_key_value.update( -++++- # key_states, value_states, self.layer_idx, cache_kwargs -++++- # ) -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++# moe_output = None -+++++# # 在推理时,根据序列长度选择最优路径 -+++++# if not self.training: -+++++# if sequence_length == 1: -+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++# else: -+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++# else: -+++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+++++# raise NotImplementedError("Training path is not implemented.") -+++++ -+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+++++ -+++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+++++ -+++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -+++++ -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# """ -+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# # 门控网络 -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# # 专家列表 -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++# # 共享专家 -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_decode( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# batch_size, _ = hidden_states.shape -+++++# expert_outputs_list = [ -+++++# ops.cat([ -+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++# ], dim=0) -+++++# for i in range(batch_size) -+++++# ] -+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++# return moe_output.squeeze(1) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_prefill( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens = hidden_states.shape[0] -+++++# flat_selected_experts = selected_experts.flatten() -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++# active_experts = ops.unique(flat_selected_experts) -+++++ -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# selected_token_indices = token_indices[mask] -+++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++# current_states = hidden_states[selected_token_indices] -+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++# moe_output = moe_output.index_add( -+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++# ) -+++++# return moe_output -+++++ -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# """ -+++++# 顶层 forward 方法,作为智能分发器。 -+++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+++++# """ -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ -+++++# # 1. 门控计算 (通用逻辑) -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++# # 2. 智能分发到最优 MoE 路径 -+++++# moe_output = None -+++++# if not self.training: -+++++# if sequence_length == 1: -+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++# else: -+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++# else: -+++++# raise NotImplementedError("Training path is not implemented.") -+++++ -+++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++ -+++++# # 4. 合并 MoE 输出和共享专家输出 -+++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++ -+++++# # 5. 恢复原始形状并返回 -+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -+++++# prefill fastest -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# """ -+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# # 门控网络 -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# # 专家列表 -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++# # 共享专家 -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_dispatch( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# """ -+++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+++++# """ -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens, _ = hidden_states.shape -+++++ -+++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+++++# flat_selected_experts = selected_experts.flatten() -+++++# flat_routing_weights = routing_weights.flatten() -++++ -++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++- -++++- # # <--- 核心修改点: 手动进行高精度缩放 --- -++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++- # query_states = query_states / math.sqrt(self.head_dim) -++++- # # <--- 修改结束 --- -++++- -++++- # fa_attention_mask = None -++++- # if attention_mask is not None: -++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++- # fa_attention_mask = (mask_slice != 0) -++++- -++++- # input_dtype = query_states.dtype -++++- -++++- # attn_output = mindspore.ops.flash_attention_score( -++++- # query=query_states, # 传入已经预先缩放过的 query -++++- # key=key_states, -++++- # value=value_states, -++++- # head_num=self.num_heads, -++++- # attn_mask=fa_attention_mask, -++++- # keep_prob=1.0 - self.attention_dropout, -++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++- # input_layout="BNSD", -++++- # sparse_mode=0, -++++- # inner_precise=1 # 仍然保持内部高精度计算 -++++- # ) -+++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++ -++++- # attn_output = attn_output.to(input_dtype) -++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++- # attn_output = self.o_proj(attn_output) -+++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+++++# active_experts = ops.unique(flat_selected_experts) -+++++ -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++ -+++++# # 找到所有分配给该专家的 token -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++ -+++++# # 使用 mask 选取对应的 token 和权重 -+++++# current_token_indices = token_indices[mask] -+++++# current_routing_weights = flat_routing_weights[mask] -+++++# current_hidden_states = hidden_states[current_token_indices] -+++++ -+++++# # 对这些 token 进行批处理 -+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++ -+++++# # 使用 index_add 将结果精确地加回到对应位置 -+++++# moe_output = moe_output.index_add( -+++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+++++# ) -+++++# return moe_output -+++++ -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# """ -+++++# 顶层 forward 方法,作为智能分发器。 -+++++# """ -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ -+++++# # 1. 门控计算 -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++ -+++++# # 2. 调用统一的 MoE 计算内核 -+++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++++ -++++- # attn_weights = None -++++- # if output_attentions: -++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++# # 3. 统一处理共享专家 -+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++ -+++++# # 4. 合并输出 -+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++ -+++++# # 5. 恢复原始形状并返回 -+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -+++++ -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# """ -+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++# 【最终高性能与高精度版】: -+++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+++++# 3. 这样实现了速度和准确性的两全其美。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_decode( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# """ -+++++# 【解码路径】极致优化版:bmm + 高精度累加。 -+++++# """ -+++++# original_dtype = hidden_states.dtype -+++++# batch_size, _ = hidden_states.shape -+++++ -+++++# expert_outputs_list = [ -+++++# ops.cat([ -+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++# ], dim=0) -+++++# for i in range(batch_size) -+++++# ] -+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++ -+++++# # 在 float32 下执行 bmm,得到高精度结果 -+++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++ -+++++# # 将高精度结果转换回原始数据类型 -+++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+++++ -+++++# return moe_output -+++++ -+++++# @no_grad() -+++++# def _moe_infer_prefill( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# selected_experts: mindspore.Tensor, -+++++# routing_weights: mindspore.Tensor -+++++# ) -> mindspore.Tensor: -+++++# """ -+++++# 【预填充路径】与原始实现一致,结果精确。 -+++++# """ -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens, _ = hidden_states.shape -+++++# flat_selected_experts = selected_experts.flatten() -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++# active_experts = ops.unique(flat_selected_experts) -+++++ -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# selected_token_indices = token_indices[mask] -+++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++# current_states = hidden_states[selected_token_indices] -+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++# moe_output = moe_output.index_add( -+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++# ) -+++++# return moe_output -+++++ -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++ -++++- # return attn_output, attn_weights, past_key_value -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+++++# # 如果模型主体是 float16,后续再转换 -+++++ -+++++# moe_output = None -+++++# if not self.training: -+++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+++++# # _moe_infer_decode 内部会处理好类型转换 -+++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+++++# if sequence_length == 1: -+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++# else: -+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++# else: -+++++# raise NotImplementedError("Training path is not implemented.") -+++++ -+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++ -+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -++++ -++++-QWEN2MOE_ATTENTION_CLASSES = { -++++- "eager": Qwen2MoeAttention, -++++- "flash-attention": Qwen2MoeFlashAttention, -++++-} -+++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++# """ -+++++# 【融合版】一个混合专家模块,内置两种推理策略, -+++++# 由外部全局变量 `Long_Prompt` 控制: -+++++ -+++++# - if Long_Prompt is True: 【精度优先模式】 -+++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+++++# 适用于处理长序列,避免误差累积。 -+++++ -+++++# - if Long_Prompt is False: 【速度优先模式】 -+++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+++++# 在解码阶段获得极致速度,同时保证结果高度准确。 -+++++# """ -+++++# def __init__(self, config: Qwen2MoeConfig): -+++++# super().__init__() -+++++# self.num_experts = config.num_experts -+++++# self.top_k = config.num_experts_per_tok -+++++# self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++# self.experts = nn.ModuleList( -+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++# ) -+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++# # --- 速度优先模式的辅助函数 --- -+++++# @no_grad() -+++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++# original_dtype = hidden_states.dtype -+++++# batch_size, _ = hidden_states.shape -+++++# expert_outputs_list = [ -+++++# ops.cat([ -+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++# ], dim=0) -+++++# for i in range(batch_size) -+++++# ] -+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++# weights_fp32 = routing_weights.to(mindspore.float32) -+++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++# return moe_output_fp32.squeeze(1).to(original_dtype) -+++++ -+++++# @no_grad() -+++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens, _ = hidden_states.shape -+++++# flat_selected_experts = selected_experts.flatten() -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++# active_experts = ops.unique(flat_selected_experts) -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# selected_token_indices = token_indices[mask] -+++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++# current_states = hidden_states[selected_token_indices] -+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++# return moe_output -+++++ -+++++# # --- 精度优先模式的辅助函数 --- -+++++# @no_grad() -+++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++# moe_output = ops.zeros_like(hidden_states) -+++++# num_tokens, _ = hidden_states.shape -+++++# flat_selected_experts = selected_experts.flatten() -+++++# flat_routing_weights = routing_weights.flatten() -+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++# active_experts = ops.unique(flat_selected_experts) -+++++# for expert_idx_tensor in active_experts: -+++++# expert_idx = expert_idx_tensor.item() -+++++# expert_layer = self.experts[expert_idx] -+++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++# current_token_indices = token_indices[mask] -+++++# current_routing_weights = flat_routing_weights[mask] -+++++# current_hidden_states = hidden_states[current_token_indices] -+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++# return moe_output -+++++ -+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++# # 声明我们将要使用一个在模块外部定义的全局变量 -+++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+++++# global Long_Prompt -+++++ -+++++# # 1. 门控计算 (所有模式通用) -+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++# router_logits = self.gate(hidden_states_reshaped) -+++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++++# if self.norm_topk_prob: -+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++# moe_output = None -+++++# if not self.training: -+++++# # 根据 Long_Prompt 标志选择模式 -+++++# if Long_Prompt: -+++++# # --- 精度优先模式 --- -+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++# else: -+++++# # --- 速度优先模式 --- -+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++# if sequence_length == 1: -+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++# else: -+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++# else: -+++++# raise NotImplementedError("Training path is not implemented.") -+++++ -+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++ -+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++# return final_hidden_states, router_logits -+++++ -+++++class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ """ -+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+++++ 控制的顶级推理策略: -++++ -+++++ - if Long_Prompt is True: 【精度优先模式】 -+++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -+++++ 适用于需要严格可复现性的长序列任务。 -++++ -++++-class Qwen2MoeSparseMoeBlock(nn.Module): -++++- def __init__(self, config): -+++++ - if Long_Prompt is False: 【速度优先模式】 -+++++ 采用业界最强的性能组合: -+++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -+++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -+++++ """ -+++++ def __init__(self, config: Qwen2MoeConfig): -++++ super().__init__() -++++ self.num_experts = config.num_experts -++++ self.top_k = config.num_experts_per_tok -++++ self.norm_topk_prob = config.norm_topk_prob -++++ -++++- # gating -++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++ self.experts = nn.ModuleList( -++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++ ) -++++- -++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++ -++++- #@dwj -++++- # 只遍历激活的专家,而非全部专家 -++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++- num_tokens = hidden_states_reshaped.shape[0] -++++- -++++- router_logits = self.gate(hidden_states_reshaped) -++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- -++++- if self.norm_topk_prob: -++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++- flat_selected_experts = selected_experts.flatten() -++++- -++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++- token_indices = broadcasted_token_indices.flatten() -++++- -++++- active_experts = ops.unique(flat_selected_experts) -++++- -++++- for expert_idx_tensor in active_experts: -++++- expert_idx = expert_idx_tensor.item() -++++- expert_layer = self.experts[expert_idx] -++++- -++++- mask = (flat_selected_experts == expert_idx_tensor) -++++- selected_token_indices = token_indices[mask] -++++- selected_routing_weights = routing_weights.flatten()[mask] -++++- -++++- current_states = hidden_states_reshaped[selected_token_indices] -++++- -++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++- -++++- final_hidden_states = final_hidden_states.index_add( -++++- dim=0, -++++- index=selected_token_indices, -++++- source=expert_output.to(hidden_states.dtype) -++++- ) -++++- -++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -+++++ @no_grad() -+++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++ original_dtype = hidden_states.dtype -+++++ batch_size, _ = hidden_states.shape -+++++ expert_outputs_list = [ -+++++ ops.cat([ -+++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++ ], dim=0) -+++++ for i in range(batch_size) -+++++ ] -+++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++ weights_fp32 = routing_weights.to(mindspore.float32) -+++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++ return moe_output_fp32.squeeze(1).to(original_dtype) -+++++ -+++++ @no_grad() -+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++ num_tokens, _ = hidden_states.shape -+++++ flat_selected_experts = selected_experts.flatten() -+++++ sorted_expert_indices = flat_selected_experts.argsort() -+++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++++ original_token_indices = sorted_expert_indices // self.top_k -+++++ moe_output = ops.zeros_like(hidden_states) -+++++ current_token_offset = 0 -+++++ for i in range(self.num_experts): -+++++ expert_token_count = tokens_per_expert[i] - current_token_offset -+++++ if expert_token_count == 0: -+++++ continue -+++++ end_offset = current_token_offset + expert_token_count -+++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++++ expert_hidden_states = hidden_states[expert_original_token_indices] -+++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++ current_token_offset += expert_token_count -+++++ return moe_output -+++++ -+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+++++ @no_grad() -+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++ moe_output = ops.zeros_like(hidden_states) -+++++ num_tokens, _ = hidden_states.shape -+++++ flat_selected_experts = selected_experts.flatten() -+++++ flat_routing_weights = routing_weights.flatten() -+++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++ active_experts = ops.unique(flat_selected_experts) -+++++ for expert_idx_tensor in active_experts: -+++++ expert_idx = expert_idx_tensor.item() -+++++ expert_layer = self.experts[expert_idx] -+++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++ current_token_indices = token_indices[mask] -+++++ current_routing_weights = flat_routing_weights[mask] -+++++ current_hidden_states = hidden_states[current_token_indices] -+++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++ return moe_output -++++ -++++- final_hidden_states = final_hidden_states + shared_expert_output -++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++- -++++- return final_hidden_states, router_logits -+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++ global Long_Prompt -+++++ -+++++ # 1. 门控计算 (所有模式通用) -+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++ router_logits = self.gate(hidden_states_reshaped) -+++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++++ if self.norm_topk_prob: -+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++ moe_output = None -+++++ if Long_Prompt: -+++++ # --- 精度优先模式 (ACCURACY MODE) --- -+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ else: -+++++ # --- 速度优先模式 (SPEED MODE) --- -+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++ if sequence_length == 1: -+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ else: -+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ -++++ -+++++ # 3. 共享专家计算与合并 (所有模式通用) -+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++ -+++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++ -+++++ return final_hidden_states, router_logits -++++ -++++ class Qwen2MoeDecoderLayer(nn.Module): -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++++ super().__init__() -++++ self.hidden_size = config.hidden_size -+++++ -+++++ # if Long_Prompt: -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ # else: -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++ -++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ -++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++- -++++ if (layer_idx not in config.mlp_only_layers) and ( -++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++ ): -++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ self._warmed_up = True -++++ self.warmup_moe_model() -++++ -+++++ -+++++ -++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++ output_router_logits = ( -++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits -++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ router_logits=outputs.router_logits, -++++ ) -++++ -+++++ def generate(self, *args, **kwargs): -+++++ """ -+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+++++ """ -+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++++ -+++++ input_ids = kwargs.get("input_ids") -+++++ if input_ids is None and args: -+++++ input_ids = args[0] -+++++ -+++++ if input_ids is not None: -+++++ prompt_length = input_ids.shape[1] -+++++ -+++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -+++++ Long_Prompt = True -+++++ else: -+++++ Long_Prompt = False -+++++ -+++++ return super().generate(*args, **kwargs) -+++++ -++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -++++ def prepare_inputs_for_generation( -++++ self, -++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -++++ # Exception 1: when passing input_embeds, input_ids may be missing entries -++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -+++++ -++++ if past_key_values is not None: -++++ if inputs_embeds is not None: # Exception 1 -++++ if 0 not in input_ids.shape: -++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ } -++++ ) -++++ return model_inputs -+++++ -++++ # @lwx -++++ # def _decode_one_tokens_logits( -++++ # self, -++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -++++ attentions=outputs.attentions, -++++ ) -++++ -+++++ -++++ __all__ = [ -++++ "Qwen2MoeForCausalLM", -++++ "Qwen2MoeModel", -++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++new file mode 100644 -++++index 00000000..6dfb5b93 -++++--- /dev/null -+++++++ b/patches/0001-20251104commit.patch -++++@@ -0,0 +1,1272 @@ -+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++++From: Pinoeer-kingxi <13022943007@163.com> -+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++++Subject: [PATCH] 20251104commit -+++++ -+++++--- -+++++ mindnlp/transformers/cache_utils.py | 28 +- -+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++++ 3 files changed, 976 insertions(+), 87 deletions(-) -+++++ -+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++++index cadd2e04..02f8d4be 100644 -+++++--- a/mindnlp/transformers/cache_utils.py -++++++++ b/mindnlp/transformers/cache_utils.py -+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++++ # k_out[:, :, cache_position] = key_states -+++++ # v_out[:, :, cache_position] = value_states -+++++- if ON_ORANGE_PI: -+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++- else: -+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++- -++++++ # if ON_ORANGE_PI: -++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++ # else: -++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++ # 确保 cache_position 是 1D tensor 并且类型正确 -++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++++ if cache_position.ndim > 1: -++++++ cache_position = cache_position.flatten() -++++++ # 确保类型是 int32 或 int64(MindSpore 要求) -++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++++ cache_position = cache_position.int() -++++++ -++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++++ k_out[:, :, cache_position] = key_states -++++++ v_out[:, :, cache_position] = value_states -++++++ -+++++ return k_out, v_out -+++++ -+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++index c695b944..d8303e45 100644 -+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++++ def rotate_half(x): -+++++ """Rotates half the hidden dims of the input.""" -+++++- x1 = x[..., : x.shape[-1] // 2] -+++++- x2 = x[..., x.shape[-1] // 2 :] -++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++ # x1 = x[..., : x.shape[-1] // 2] -++++++ # x2 = x[..., x.shape[-1] // 2 :] -++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++ return ops.cat((-x2, x1), dim=-1) -+++++ -+++++ -+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++++ if self.training: -+++++ raise NotImplementedError("Training is not supported yet.") -+++++ else: -+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++- if self.config.n_shared_experts is not None: -+++++- y = y + self.shared_experts(identity) -+++++- return y -++++++ # @lwx -++++++ if orig_shape[1] == 1: -++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++++ y=y.view(*orig_shape) -++++++ if self.config.n_shared_experts is not None: -++++++ y = y + self.shared_experts(identity) -++++++ return y -++++++ else: -++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++++ if self.config.n_shared_experts is not None: -++++++ y = y + self.shared_experts(identity) -++++++ return y -++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++ # if self.config.n_shared_experts is not None: -++++++ # y = y + self.shared_experts(identity) -++++++ # return y -++++++ -++++++ @no_grad() -++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++ -++++++ expert_cache = ops.zeros_like(x) -++++++ for i in range(self.num_experts_per_tok): -++++++ expert_id = flat_expert_indices[i].item() -++++++ weight = flat_expert_weights[i].item() -++++++ expert = self.experts[expert_id] -++++++ expert_out = expert(x) -++++++ expert_cache += expert_out * weight -++++++ return expert_cache -+++++ -+++++ @no_grad() -+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++- # expert_cache = torch.zeros_like(x) -+++++- # idxs = flat_expert_indices.argsort() -+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++- # token_idxs = idxs // self.num_experts_per_tok -+++++- # for i, end_idx in enumerate(tokens_per_expert): -+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++- # if start_idx == end_idx: -+++++- # continue -+++++- # expert = self.experts[i] -+++++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++++- # expert_tokens = x[exp_token_idx] -+++++- # expert_out = expert(expert_tokens) -+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++- # return expert_cache -++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++ expert_cache = ops.zeros_like(x) -+++++ idxs = flat_expert_indices.argsort() -+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ token_idxs = idxs // self.num_experts_per_tok -++++++ -+++++ for i, end_idx in enumerate(tokens_per_expert): -+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ if start_idx == end_idx: -+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++++ expert_out = expert(expert_tokens) -+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -+++++ return expert_cache -++++++ -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # # expert_cache = torch.zeros_like(x) -++++++ # # idxs = flat_expert_indices.argsort() -++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++ # # token_idxs = idxs // self.num_experts_per_tok -++++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++ # # if start_idx == end_idx: -++++++ # # continue -++++++ # # expert = self.experts[i] -++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # # expert_tokens = x[exp_token_idx] -++++++ # # expert_out = expert(expert_tokens) -++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++ # # return expert_cache -++++++ # expert_cache = ops.zeros_like(x) -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # for i, end_idx in enumerate(tokens_per_expert): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # if start_idx == end_idx: -++++++ # continue -++++++ # expert = self.experts[i] -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = expert(expert_tokens) -++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -++++++ # return expert_cache -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # expert_cache = ops.zeros_like(x) -++++++ -++++++ # # 排序保证顺序一致 -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # # 找出有 token 的专家 -++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++ -++++++ # for i in active_experts.tolist(): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # end_idx = tokens_per_expert[i] -++++++ # if start_idx == end_idx: # 没有 token -++++++ # continue -++++++ -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = self.experts[i](expert_tokens) -++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++ -++++++ # expert_cache = mindspore.mint.scatter_add( -++++++ # expert_cache, -++++++ # 0, -++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++ # expert_out -++++++ # ) -++++++ -++++++ # return expert_cache -++++++ -++++++ -+++++ -+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++++ # """ -+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -++++++ self.warm_up = False -++++++ -++++++ def warmup_moe_model_deep(self): -++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++++ test_texts = [ -++++++ "warmup short", -++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++++ ] -++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++ if tokenizer is None: -++++++ from mindnlp.transformers import AutoTokenizer -++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++ self._warmup_tokenizer = tokenizer -++++++ -++++++ for text in test_texts: -++++++ inputs = tokenizer(text, return_tensors="ms") -++++++ with mindspore._no_grad(): -++++++ _ = self(**inputs, use_cache=False) -++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++++ -+++++ def get_input_embeddings(self): -+++++ return self.model.embed_tokens -+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++ ```""" -++++++ if not self.warm_up: -++++++ self.warm_up = True -++++++ self.warmup_moe_model_deep() -++++++ -+++++ output_attentions = ( -+++++ output_attentions -+++++ if output_attentions is not None -+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++index 3cbf820e..d4c6b651 100644 -+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++@@ -18,7 +18,6 @@ -+++++ # See the License for the specific language governing permissions and -+++++ # limitations under the License. -+++++ """MindSpore Qwen2MoE model.""" -+++++- -+++++ import math -+++++ from typing import List, Optional, Tuple, Union -+++++ -+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++++ TokenClassifierOutput, -+++++ ) -+++++ from ...modeling_utils import PreTrainedModel -++++++from ...generation import GenerationMixin -+++++ from ....utils import logging -+++++ from .configuration_qwen2_moe import Qwen2MoeConfig -+++++ -+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++++ self.variance_epsilon = eps -+++++ -+++++ def forward(self, hidden_states): -++++++ # @dwj -++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++ # @lwx -++++++ # if not self.training : -++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++ input_dtype = hidden_states.dtype -+++++ hidden_states = hidden_states.to(mindspore.float32) -+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++++@@ -234,6 +239,8 @@ def rotate_half(x): -+++++ """Rotates half the hidden dims of the input.""" -+++++ x1 = x[..., : x.shape[-1] // 2] -+++++ x2 = x[..., x.shape[-1] // 2 :] -++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++ return ops.cat((-x2, x1), dim=-1) -+++++ -+++++ -+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++++ self.config = config -+++++ self.hidden_size = config.hidden_size -+++++ self.intermediate_size = intermediate_size -++++++ -+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++++ self.act_fn = ACT2FN[config.hidden_act] -+++++ -+++++ def forward(self, x): -+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++- -+++++ -++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++ # @lwx -++++++ # gate_up_output = self.gate_up_proj(x) -++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++++ # return self.down_proj(swiglu_output) -++++++ -++++++ # def forward(self, x): -++++++ # gate_proj_out = self.gate_proj(x) -++++++ # up_proj_out = self.up_proj(x) -++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++++ # return self.down_proj(swiglu_out) -++++++ -+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++++ """ -+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++++ use_cache: bool = False, -+++++ cache_position: Optional[mindspore.Tensor] = None, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ -++++++ -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ query_states = self.q_proj(hidden_states) -+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ "with a layer index." -+++++ ) -+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ if isinstance(past_key_value, StaticCache): -++++++ kv_seq_len = key_states.shape[-2] -++++++ else: -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++ if isinstance(past_key_value, StaticCache): -++++++ kv_seq_len = key_states.shape[-2] -+++++ -+++++ # repeat k/v heads if n_kv_heads < n_heads -+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++- -++++++ -+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++ -+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++++- raise ValueError( -+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++++- f" {attn_weights.shape}" -+++++- ) -+++++- -+++++- if attention_mask is not None: # no matter the length, we just slice it -+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++++ if attention_mask is not None: -++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++ attn_weights = attn_weights + causal_mask -+++++ -+++++ # upcast attention to fp32 -+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++ -+++++ attn_output = self.o_proj(attn_output) -+++++- -++++++ # @lwx -++++++ -++++++ # max_seq_len = self.max_position_embeddings # 2048 -++++++ -++++++ # if attention_mask is not None: -++++++ # # attention_mask: [B, 1, Sq, Sk] -++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++ -++++++ # # pad 到 [max_seq_len, max_seq_len] -++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++ # global_attention_mask = padded_mask -++++++ # else: -++++++ # global_attention_mask = None -++++++ -++++++ -++++++ # sparse_mode=3 -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # real_shift=None, -++++++ # padding_mask=None, -++++++ -++++++ # head_num=self.num_heads, -++++++ # attn_mask=global_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ # input_layout="BNSD", -++++++ # pre_tokens=2147483647, -++++++ # next_tokens=2147483647, -++++++ # inner_precise=0, -++++++ # drop_mask=None, -++++++ # prefix=None, -++++++ # actual_seq_qlen=None, -++++++ # actual_seq_kvlen=None, -++++++ # sparse_mode=sparse_mode, -++++++ # ) -+++++ if not output_attentions: -+++++ attn_weights = None -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ -++++++class Qwen2MoeFlashAttention(nn.Module): -++++++ """ -++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++ -++++++ 关键改动: -++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++ 直接传入原始的 key 和 value 张量效率更高。 -++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++ """ -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++ super().__init__() -++++++ self.config = config -++++++ self.layer_idx = layer_idx -++++++ self.hidden_size = config.hidden_size -++++++ self.num_heads = config.num_attention_heads -++++++ self.head_dim = self.hidden_size // self.num_heads -++++++ self.num_key_value_heads = config.num_key_value_heads -++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++ self.max_position_embeddings = config.max_position_embeddings -++++++ self.rope_theta = config.rope_theta -++++++ self.attention_dropout = config.attention_dropout -++++++ -++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++++ raise ValueError( -++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++ ) -++++++ -++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++ -++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++ self.head_dim, -++++++ max_position_embeddings=self.max_position_embeddings, -++++++ base=self.rope_theta, -++++++ ) -++++++ -++++++ def forward( -++++++ self, -++++++ hidden_states: mindspore.Tensor, -++++++ attention_mask: Optional[mindspore.Tensor] = None, -++++++ position_ids: Optional[mindspore.Tensor] = None, -++++++ past_key_value: Optional[Cache] = None, -++++++ output_attentions: bool = False, -++++++ use_cache: bool = False, -++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # 1. 线性投射 Q, K, V -++++++ query_states = self.q_proj(hidden_states) -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++ # query: [B, S, H*D] -> [B, N1, S, D] -++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # 3. RoPE 旋转位置编码 -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++ if self.layer_idx is None: -++++++ raise ValueError( -++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ "with a layer index." -++++++ ) -++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++ if cache_position.shape[0] == 1: -++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++ kv_seq_len = past_seen_tokens + 1 -++++++ else: -++++++ # prefill 阶段:cache_position 是范围,使用其长度 -++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++ else: -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # 4. KV 缓存更新 -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ key_states, value_states = past_key_value.update( -++++++ key_states, value_states, self.layer_idx, cache_kwargs -++++++ ) -++++++ -++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++ if cache_position.shape[0] == 1: -++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++ kv_seq_len = key_states.shape[-2] -++++++ -++++++ # 5. [重要] 准备 Attention Mask -++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++ fa_attention_mask = None -++++++ if attention_mask is not None: -++++++ # 截取与当前key长度匹配的部分 -++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++ fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++ input_dtype = query_states.dtype -++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++ query_states = query_states.to(mindspore.float16) -++++++ key_states = key_states.to(mindspore.float16) -++++++ value_states = value_states.to(mindspore.float16) -++++++ -++++++ # 6. [核心] 调用 flash_attention_score 算子 -++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++ attn_output = mindspore.ops.flash_attention_score( -++++++ query=query_states, -++++++ key=key_states, -++++++ value=value_states, -++++++ head_num=self.num_heads, # 传入Q的头数(N1) -++++++ attn_mask=fa_attention_mask, -++++++ keep_prob=1.0 - self.attention_dropout, -++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ input_layout="BNSD", -++++++ sparse_mode=0 # 使用 defaultMask 模式 -++++++ ) -++++++ -++++++ # 恢复原始数据类型 -++++++ attn_output = attn_output.to(input_dtype) -++++++ -++++++ # 7. 调整输出形状 -++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = self.o_proj(attn_output) -++++++ -++++++ # FlashAttention 算子不直接返回注意力权重矩阵 -++++++ attn_weights = None -++++++ if output_attentions: -++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ # def forward( -++++++ # self, -++++++ # hidden_states: mindspore.Tensor, -++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++ # past_key_value: Optional[Cache] = None, -++++++ # output_attentions: bool = False, -++++++ # use_cache: bool = False, -++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ # bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # # 1. 线性投射 Q, K, V -++++++ # query_states = self.q_proj(hidden_states) -++++++ # key_states = self.k_proj(hidden_states) -++++++ # value_states = self.v_proj(hidden_states) -++++++ -++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # # 3. RoPE 旋转位置编码 -++++++ # kv_seq_len = key_states.shape[-2] -++++++ # if past_key_value is not None: -++++++ # if self.layer_idx is None: -++++++ # raise ValueError( -++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ # "with a layer index." -++++++ # ) -++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # # 4. KV 缓存更新 -++++++ # if past_key_value is not None: -++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ # key_states, value_states = past_key_value.update( -++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++ # ) -++++++ -++++++ # # 5. 准备 Attention Mask -++++++ # fa_attention_mask = None -++++++ # if attention_mask is not None: -++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++ # input_dtype = query_states.dtype -++++++ -++++++ # # 6. [核心] 调用 flash_attention_score 算子 -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # head_num=self.num_heads, -++++++ # attn_mask=fa_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ # input_layout="BNSD", -++++++ # sparse_mode=0, -++++++ # # <--- 修改点 2: 启用内部高精度计算 --- -++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++ # inner_precise=1 -++++++ # ) -++++++ -++++++ # # 恢复原始数据类型 -++++++ # attn_output = attn_output.to(input_dtype) -++++++ -++++++ # # 7. 调整输出形状 -++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ # attn_output = self.o_proj(attn_output) -++++++ -++++++ # attn_weights = None -++++++ # if output_attentions: -++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++ # return attn_output, attn_weights, past_key_value -++++++ -++++++ # def forward( -++++++ # self, -++++++ # hidden_states: mindspore.Tensor, -++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++ # past_key_value: Optional[Cache] = None, -++++++ # output_attentions: bool = False, -++++++ # use_cache: bool = False, -++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ # bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # query_states = self.q_proj(hidden_states) -++++++ # key_states = self.k_proj(hidden_states) -++++++ # value_states = self.v_proj(hidden_states) -++++++ -++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # kv_seq_len = key_states.shape[-2] -++++++ # if past_key_value is not None: -++++++ # if self.layer_idx is None: -++++++ # raise ValueError("`layer_idx` must be specified for caching") -++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # if past_key_value is not None: -++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ # key_states, value_states = past_key_value.update( -++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++ # ) -++++++ -++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++ -++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++++ # query_states = query_states / math.sqrt(self.head_dim) -++++++ # # <--- 修改结束 --- -++++++ -++++++ # fa_attention_mask = None -++++++ # if attention_mask is not None: -++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # input_dtype = query_states.dtype -++++++ -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, # 传入已经预先缩放过的 query -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # head_num=self.num_heads, -++++++ # attn_mask=fa_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++++ # input_layout="BNSD", -++++++ # sparse_mode=0, -++++++ # inner_precise=1 # 仍然保持内部高精度计算 -++++++ # ) -++++++ -++++++ # attn_output = attn_output.to(input_dtype) -++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ # attn_output = self.o_proj(attn_output) -++++++ -++++++ # attn_weights = None -++++++ # if output_attentions: -++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++++ -++++++ # return attn_output, attn_weights, past_key_value -++++++ -+++++ QWEN2MOE_ATTENTION_CLASSES = { -+++++ "eager": Qwen2MoeAttention, -++++++ "flash-attention": Qwen2MoeFlashAttention, -+++++ } -+++++ -+++++ -+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -++++++ #@dwj -++++++ # 只遍历激活的专家,而非全部专家 -+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- hidden_states = hidden_states.view(-1, hidden_dim) -+++++- # router_logits: (batch * sequence_length, n_experts) -+++++- router_logits = self.gate(hidden_states) -+++++- -+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- if self.norm_topk_prob: -+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- # we cast back to the input dtype -+++++- routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++- final_hidden_states = ops.zeros( -+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++++- ) -+++++- -+++++- # One hot encode the selected experts to create an expert mask -+++++- # this will be used to easily index which expert is going to be sollicitated -+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++++- -+++++- # Loop over all available experts in the model and perform the computation on each expert -+++++- for expert_idx in range(self.num_experts): -+++++- expert_layer = self.experts[expert_idx] -+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++++- -+++++- # Index the correct hidden states and compute the expert hidden state for -+++++- # the current expert. We need to make sure to multiply the output hidden -+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++++- if 0 not in idx.shape: -+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++++- -+++++- # However `index_add_` only support torch tensors for indexing so we'll use -+++++- # the `top_x` tensor here. -+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++++- -+++++- shared_expert_output = self.shared_expert(hidden_states) -+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++++- -+++++- final_hidden_states = final_hidden_states + shared_expert_output -++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++ num_tokens = hidden_states_reshaped.shape[0] -++++++ -++++++ router_logits = self.gate(hidden_states_reshaped) -++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++ if self.norm_topk_prob: -++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++ flat_selected_experts = selected_experts.flatten() -++++++ -++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++ token_indices = broadcasted_token_indices.flatten() -++++++ -++++++ active_experts = ops.unique(flat_selected_experts) -++++++ -++++++ for expert_idx_tensor in active_experts: -++++++ expert_idx = expert_idx_tensor.item() -++++++ expert_layer = self.experts[expert_idx] -++++++ -++++++ mask = (flat_selected_experts == expert_idx_tensor) -++++++ selected_token_indices = token_indices[mask] -++++++ selected_routing_weights = routing_weights.flatten()[mask] -++++++ -++++++ current_states = hidden_states_reshaped[selected_token_indices] -++++++ -++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++ -++++++ final_hidden_states = final_hidden_states.index_add( -++++++ dim=0, -++++++ index=selected_token_indices, -++++++ source=expert_output.to(hidden_states.dtype) -++++++ ) -++++++ -++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++ -+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++- return final_hidden_states, router_logits -++++++ final_hidden_states = final_hidden_states + shared_expert_output -++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++ -++++++ return final_hidden_states, router_logits -+++++ -+++++ -+++++ class Qwen2MoeDecoderLayer(nn.Module): -+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++++ -+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++ -+++++ if (layer_idx not in config.mlp_only_layers) and ( -+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++++ ): -+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++++ _skip_keys_device_placement = "past_key_values" -+++++ _supports_cache_class = True -++++++#lwx -++++++ # _supports_static_cache = True -+++++ -+++++ def _init_weights(self, module): -+++++ std = self.config.initializer_range -+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++++ return causal_mask -+++++ -+++++ -+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ _tied_weights_keys = ["lm_head.weight"] -+++++ -+++++ def __init__(self, config): -+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ self.num_experts_per_tok = config.num_experts_per_tok -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -++++++ # @lwx -++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++++ # self.generation_config.cache_implementation = "static" -++++++ self._warmed_up = False -++++++ -++++++ def warmup_moe_model(self): -++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++++ test_texts = [ -++++++ "warmup short", -++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++++ ] -++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++ if tokenizer is None: -++++++ from mindnlp.transformers import AutoTokenizer -++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++ self._warmup_tokenizer = tokenizer -++++++ -++++++ for text in test_texts: -++++++ inputs = tokenizer(text, return_tensors="ms") -++++++ with mindspore._no_grad(): -++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++++ -+++++ def get_input_embeddings(self): -+++++ return self.model.embed_tokens -+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++ ```""" -++++++ if not self._warmed_up: -++++++ self._warmed_up = True -++++++ self.warmup_moe_model() -+++++ -+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++++ output_router_logits = ( -+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ } -+++++ ) -+++++ return model_inputs -++++++# @lwx -++++++ # def _decode_one_tokens_logits( -++++++ # self, -++++++ # cur_token: mindspore.Tensor, -++++++ # input_pos: Optional[mindspore.Tensor], -++++++ # cache_position: mindspore.Tensor, -++++++ # past_key_values: StaticCache, -++++++ # ) -> mindspore.Tensor: -++++++ # """ -++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++++ -++++++ # Args: -++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++++ # input_pos: 输入位置信息,可选 -++++++ # cache_position: 当前token在cache中的位置,shape为(1,) -++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++++ -++++++ # Returns: -++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++++ # """ -++++++ # # 调用JIT编译的版本 -++++++ # return self.get_decode_one_tokens_logits( -++++++ # cur_token=cur_token, -++++++ # input_pos=input_pos, -++++++ # cache_position=cache_position, -++++++ # past_key_values=past_key_values, -++++++ # ) -++++++ -++++++ # @mindspore.jit(jit_level='O1') -++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++++ # """ -++++++ # JIT编译的函数,用于高效的单token解码 -++++++ # 使用JIT编译优化以支持静态shape和高效执行 -++++++ -++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++++ # """ -++++++ # outputs = self.model.forward( -++++++ # input_ids=cur_token, -++++++ # position_ids=input_pos, -++++++ # cache_position=cache_position, -++++++ # past_key_values=past_key_values, -++++++ # use_cache=True, -++++++ # return_dict=False, -++++++ # ) -++++++ -++++++ # hidden_states = outputs[0] -++++++ # logits = self.lm_head.forward(hidden_states) -++++++ # logits = logits.float() -++++++ -++++++ # return logits[:, -1, :] -++++++ -++++++ # def _sample( -++++++ # self, -++++++ # input_ids: mindspore.Tensor, -++++++ # logits_processor, -++++++ # stopping_criteria, -++++++ # generation_config, -++++++ # synced_devices: bool, -++++++ # streamer=None, -++++++ # logits_warper=None, -++++++ # **model_kwargs, -++++++ # ): -++++++ # """ -++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++++ # """ -++++++ # from ...generation.logits_process import LogitsProcessorList -++++++ # from ...generation.stopping_criteria import StoppingCriteriaList -++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++++ # from mindnlp.core import nn, ops, no_grad -++++++ # import numpy as np -++++++ -++++++ # # 检查是否使用 StaticCache -++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++++ # # 否则,直接调用父类方法 -++++++ # past_key_values = model_kwargs.get("past_key_values") -++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++++ -++++++ # if not isinstance(past_key_values, StaticCache): -++++++ # # 不使用 StaticCache,直接调用父类方法 -++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++++ # return super()._sample( -++++++ # input_ids=input_ids, -++++++ # logits_processor=logits_processor, -++++++ # stopping_criteria=stopping_criteria, -++++++ # generation_config=generation_config, -++++++ # synced_devices=synced_devices, -++++++ # streamer=streamer, -++++++ # logits_warper=logits_warper, -++++++ # **model_kwargs, -++++++ # ) -++++++ -++++++ # # 使用 StaticCache,进入自定义循环 -++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++++ # pad_token_id = generation_config._pad_token_tensor -++++++ # output_attentions = generation_config.output_attentions -++++++ # output_hidden_states = generation_config.output_hidden_states -++++++ # output_scores = generation_config.output_scores -++++++ # output_logits = generation_config.output_logits -++++++ # return_dict_in_generate = generation_config.return_dict_in_generate -++++++ # max_length = generation_config.max_length -++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++++ # do_sample = generation_config.do_sample -++++++ -++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++++ # raise ValueError( -++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++++ # f"{logits_warper})." -++++++ # ) -++++++ -++++++ # # init attention / hidden states / scores tuples -++++++ # scores = () if (return_dict_in_generate and output_scores) else None -++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++++ -++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++++ # encoder_hidden_states = ( -++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++++ # ) -++++++ -++++++ # # keep track of which sequences are already finished -++++++ # batch_size, cur_len = input_ids.shape -++++++ # this_peer_finished = False -++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++++ -++++++ # time_record = [] -++++++ # from ....utils.testing_utils import parse_flag_from_env -++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++++ -++++++ # while self._has_unfinished_sequences( -++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++++ # ): -++++++ # if _record_time: -++++++ # import time as time_module -++++++ # infer_start = time_module.time() -++++++ -++++++ # # prepare model inputs -++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++++ -++++++ # # prepare variable output controls -++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++++ -++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++++ # cur_cache_position = model_inputs.get("cache_position") -++++++ # cur_past_key_values = model_inputs.get("past_key_values") -++++++ # cur_input_ids = model_inputs.get("input_ids") -++++++ -++++++ # if (isinstance(cur_past_key_values, StaticCache) and -++++++ # cur_cache_position is not None and -++++++ # len(cur_cache_position.shape) > 0 and -++++++ # cur_cache_position.shape[0] == 1 and -++++++ # cur_input_ids is not None and -++++++ # cur_input_ids.shape[1] == 1): -++++++ # # 使用 JIT 优化的单 token 解码 -++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++++ # if not hasattr(self, '_jit_used'): -++++++ # self._jit_used = False -++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++++ -++++++ # next_token_logits = self.get_decode_one_tokens_logits( -++++++ # cur_token=cur_input_ids, -++++++ # input_pos=model_inputs.get("position_ids"), -++++++ # cache_position=cur_cache_position, -++++++ # past_key_values=cur_past_key_values, -++++++ # ) -++++++ -++++++ # # 标记已使用JIT(用于后续判断) -++++++ # if not self._jit_used: -++++++ # self._jit_used = True -++++++ -++++++ # # 构造兼容的输出对象 -++++++ # class JitOptimizedOutput: -++++++ # def __init__(self, logits, config): -++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++++ # self.config = config -++++++ # # 对于 JIT 优化路径,这些属性通常不需要 -++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++++ # self.attentions = None if not config.is_encoder_decoder else None -++++++ # self.cross_attentions = None -++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++++ # self.hidden_states = None if not config.is_encoder_decoder else None -++++++ -++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++++ # else: -++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++++ # outputs = self(**model_inputs, return_dict=True) -++++++ -++++++ # if synced_devices and this_peer_finished: -++++++ # continue -++++++ -++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++++ # next_token_logits = outputs.logits[:, -1, :] -++++++ -++++++ # # pre-process distribution -++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++++ # if do_sample: -++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++++ -++++++ # # Store scores, attentions and hidden_states when required -++++++ # if return_dict_in_generate: -++++++ # if output_scores: -++++++ # scores += (next_token_scores,) -++++++ # if output_logits: -++++++ # raw_logits += (next_token_logits,) -++++++ # if output_attentions: -++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++++ # decoder_attentions += (attn,) if attn is not None else (None,) -++++++ # if self.config.is_encoder_decoder: -++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++++ -++++++ # if output_hidden_states: -++++++ # hidden = ( -++++++ # outputs.decoder_hidden_states -++++++ # if self.config.is_encoder_decoder -++++++ # else outputs.hidden_states -++++++ # ) -++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++++ -++++++ # # token selection -++++++ # if do_sample: -++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++++ # else: -++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++++ -++++++ # # finished sentences should have their next token be a padding token -++++++ # if has_eos_stopping_criteria: -++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++++ -++++++ # # update generated ids, model inputs, and length for next step -++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++++ # if streamer is not None: -++++++ # streamer.put(next_tokens) -++++++ -++++++ # model_kwargs = self._update_model_kwargs_for_generation( -++++++ # outputs, -++++++ # model_kwargs, -++++++ # is_encoder_decoder=self.config.is_encoder_decoder, -++++++ # ) -++++++ -++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++++ # cur_len += 1 -++++++ -++++++ # if _record_time: -++++++ # import time as time_module -++++++ # infer_stop = time_module.time() -++++++ # time_record.append(infer_stop - infer_start) -++++++ -++++++ # del outputs -++++++ -++++++ # average_infer_time = None -++++++ # if time_record: -++++++ # if len(time_record) > 1: -++++++ # time_record.pop(0) -++++++ # average_infer_time = sum(time_record) / len(time_record) -++++++ # print(f'average inference time is: {average_infer_time}') -++++++ # print(f'inference time record: {time_record}') -++++++ -++++++ # if streamer is not None: -++++++ # streamer.end() -++++++ -++++++ # # 简单判断:打印是否使用了JIT路径 -++++++ # if hasattr(self, '_jit_used') and self._jit_used: -++++++ # print("[JIT] ✓ JIT optimization was used during generation") -++++++ # else: -++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++++ -++++++ # if return_dict_in_generate: -++++++ # if self.config.is_encoder_decoder: -++++++ # return GenerateEncoderDecoderOutput( -++++++ # sequences=input_ids, -++++++ # scores=scores, -++++++ # logits=raw_logits, -++++++ # encoder_attentions=encoder_attentions, -++++++ # encoder_hidden_states=encoder_hidden_states, -++++++ # decoder_attentions=decoder_attentions, -++++++ # cross_attentions=cross_attentions, -++++++ # decoder_hidden_states=decoder_hidden_states, -++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++ # average_infer_time=average_infer_time -++++++ # ) -++++++ # else: -++++++ # return GenerateDecoderOnlyOutput( -++++++ # sequences=input_ids, -++++++ # scores=scores, -++++++ # logits=raw_logits, -++++++ # attentions=decoder_attentions, -++++++ # hidden_states=decoder_hidden_states, -++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++ # average_infer_time=average_infer_time -++++++ # ) -++++++ # else: -++++++ # return input_ids -++++++ -++++++ # def _prepare_cache_for_generation( -++++++ # self, -++++++ # generation_config, -++++++ # model_kwargs, -++++++ # assistant_model, -++++++ # batch_size, -++++++ # max_cache_length, -++++++ # ): -++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++++ # generation_config.cache_implementation = "static" -++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++++ -++++++ # if generation_config.cache_implementation == "static": -++++++ # base_required_from_max_length = generation_config.max_length + 1 -++++++ # base_required = max(max_cache_length, base_required_from_max_length) -++++++ # min_cache_size = 50 -++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++++ # else: -++++++ # max_cache_length = max(base_required, min_cache_size) -++++++ -++++++ # original_max_cache_length = max_cache_length -++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++++ # print(f" - input max_cache_length: {original_max_cache_length}") -++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++++ # print(f" - final max_cache_length: {max_cache_length}") -++++++ -++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++ # if max_cache_length > self.config.max_position_embeddings: -++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++++ -++++++ # result = super()._prepare_cache_for_generation( -++++++ # generation_config=generation_config, -++++++ # model_kwargs=model_kwargs, -++++++ # assistant_model=assistant_model, -++++++ # batch_size=batch_size, -++++++ # max_cache_length=max_cache_length, -++++++ # ) -++++++ -++++++ # if generation_config.cache_implementation == "static": -++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++++ # created_cache = model_kwargs.get(cache_name) -++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++++ # if created_cache.max_cache_len < generation_config.max_length: -++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++++ -++++++ # return result -++++++ -++++++ -++++++ -+++++ -+++++ -+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++++-- -+++++2.27.0 -+++++ -++++-- -++++2.27.0 -++++ -+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+++new file mode 100644 -+++index 00000000..966529e4 -+++--- /dev/null -++++++ b/patches/0003-20261106secondcommit.patch -+++@@ -0,0 +1,2769 @@ -++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Thu, 6 Nov 2025 14:54:37 +0800 -++++Subject: [PATCH 3/3] 20261106secondcommit -++++ -++++--- -++++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -++++ patches/0001-20251104commit.patch | 1272 ----------------- -++++ 3 files changed, 528 insertions(+), 2032 deletions(-) -++++ delete mode 100644 patches/0001-20251104commit.patch -++++ -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index 73773c22..2f9192bf 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -++++ -++++ _CONFIG_FOR_DOC = "DeepseekConfig" -++++ -+++++_attn_mask_cache = {} -+++++ -+++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -+++++ q_len = batch_and_seq[1] -+++++ kv_len = batch_and_seq[1] + past_key_values_length -+++++ key = (batch_and_seq[0], q_len, kv_len) -+++++ -+++++ if key in _attn_mask_cache: -+++++ return _attn_mask_cache[key] -+++++ -+++++ mask = _prepare_4d_causal_attention_mask( -+++++ attention_mask, -+++++ batch_and_seq, -+++++ inputs_embeds, -+++++ past_key_values_length, -+++++ ) -+++++ _attn_mask_cache[key] = mask -+++++ return mask -++++ -++++ def _get_unpad_data(attention_mask): -++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -++++ return final_output -++++ -++++ -++++- @no_grad() -++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++- expert_cache = ops.zeros_like(x) -++++- idxs = flat_expert_indices.argsort() -++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++- token_idxs = idxs // self.num_experts_per_tok -++++- -++++- for i, end_idx in enumerate(tokens_per_expert): -++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++- if start_idx == end_idx: -++++- continue -++++- expert = self.experts[i] -++++- exp_token_idx = token_idxs[start_idx:end_idx] -++++- expert_tokens = x[exp_token_idx] -++++- expert_out = expert(expert_tokens) -++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++- -++++- return expert_cache -++++- -++++ # @no_grad() -++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++- # # expert_cache = torch.zeros_like(x) -++++- # # idxs = flat_expert_indices.argsort() -++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++- # # token_idxs = idxs // self.num_experts_per_tok -++++- # # for i, end_idx in enumerate(tokens_per_expert): -++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++- # # if start_idx == end_idx: -++++- # # continue -++++- # # expert = self.experts[i] -++++- # # exp_token_idx = token_idxs[start_idx:end_idx] -++++- # # expert_tokens = x[exp_token_idx] -++++- # # expert_out = expert(expert_tokens) -++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++- # # return expert_cache -+++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++ # expert_cache = ops.zeros_like(x) -++++ # idxs = flat_expert_indices.argsort() -++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++ -++++ # return expert_cache -++++- # @no_grad() -++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++- # expert_cache = ops.zeros_like(x) -+++++ -+++++ @no_grad() -+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++ """ -+++++ 优化版 MoE prefill: -+++++ - 批量张量化处理同一个 expert 的所有 token -+++++ - 跳过无 token 的专家 -+++++ - 保持结果完全一致 -+++++ """ -+++++ # 初始化输出缓存 -+++++ expert_cache = ops.zeros_like(x) -++++ -++++- # # 排序保证顺序一致 -++++- # idxs = flat_expert_indices.argsort() -++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++- # token_idxs = idxs // self.num_experts_per_tok -+++++ # 排序(确保 scatter_add 位置对应原逻辑) -+++++ idxs = flat_expert_indices.argsort() -+++++ sorted_expert_indices = flat_expert_indices[idxs] -+++++ sorted_token_indices = idxs // self.num_experts_per_tok -++++ -++++- # # 找出有 token 的专家 -++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++ # 每个 expert 的 token 数 -+++++ tokens_per_expert = sorted_expert_indices.bincount() -++++ -++++- # for i in active_experts.tolist(): -++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++- # end_idx = tokens_per_expert[i] -++++- # if start_idx == end_idx: # 没有 token -++++- # continue -+++++ # 找出有 token 的专家 -+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++++ -++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++- # expert_tokens = x[exp_token_idx] -++++- # expert_out = self.experts[i](expert_tokens) -++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++ for expert_id in active_experts.tolist(): -+++++ # 取该 expert 对应的排序后 token 区间 -+++++ start = (tokens_per_expert[:expert_id]).sum().item() -+++++ end = start + tokens_per_expert[expert_id].item() -++++ -++++- # expert_cache = mindspore.mint.scatter_add( -++++- # expert_cache, -++++- # 0, -++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++- # expert_out -++++- # ) -+++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -+++++ expert_tokens = x[token_idx] # 取输入向量 -++++ -++++- # return expert_cache -+++++ # 执行专家 MLP -+++++ expert_out = self.experts[expert_id](expert_tokens) -+++++ -+++++ # 按权重缩放 -+++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -+++++ -+++++ # 回写到缓存(等价 scatter_add) -+++++ expert_cache = mindspore.mint.scatter_add( -+++++ expert_cache, -+++++ 0, -+++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++ scaled_out -+++++ ) -+++++ -+++++ return expert_cache -+++++ -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # # expert_cache = torch.zeros_like(x) -+++++ # # idxs = flat_expert_indices.argsort() -+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++ # # if start_idx == end_idx: -+++++ # # continue -+++++ # # expert = self.experts[i] -+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # # expert_tokens = x[exp_token_idx] -+++++ # # expert_out = expert(expert_tokens) -+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++ # # return expert_cache -+++++ # expert_cache = ops.zeros_like(x) -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # if start_idx == end_idx: -+++++ # continue -+++++ # expert = self.experts[i] -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = expert(expert_tokens) -+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -+++++ # return expert_cache -+++++ # @no_grad() -+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++ # expert_cache = ops.zeros_like(x) -+++++ -+++++ # # 排序保证顺序一致 -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ # token_idxs = idxs // self.num_experts_per_tok -+++++ -+++++ # # 找出有 token 的专家 -+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++ -+++++ # for i in active_experts.tolist(): -+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ # end_idx = tokens_per_expert[i] -+++++ # if start_idx == end_idx: # 没有 token -+++++ # continue -+++++ -+++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++ # expert_tokens = x[exp_token_idx] -+++++ # expert_out = self.experts[i](expert_tokens) -+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++ -+++++ # expert_cache = mindspore.mint.scatter_add( -+++++ # expert_cache, -+++++ # 0, -+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++ # expert_out -+++++ # ) -+++++ -+++++ # return expert_cache -++++ -++++ -++++ -++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++- -++++ # class DeepseekFlashAttention(nn.Module): -++++ # """ -++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -+++++ -++++ Deepseek_ATTENTION_CLASSES = { -++++ "eager": DeepseekAttention, -++++ "flash-attention": DeepseekFlashAttention, -++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -++++ ) -++++ else: -++++ # 4d mask is passed through the layers -++++- attention_mask = _prepare_4d_causal_attention_mask( -+++++ # attention_mask = _prepare_4d_causal_attention_mask( -+++++ # attention_mask, -+++++ # (batch_size, seq_length), -+++++ # inputs_embeds, -+++++ # past_key_values_length, -+++++ # ) -+++++ #@dwj -+++++ attention_mask = get_cached_causal_mask( -++++ attention_mask, -++++ (batch_size, seq_length), -++++ inputs_embeds, -++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ # Initialize weights and apply final processing -++++ self.post_init() -++++ self.warm_up = False -+++++ #@dwj -+++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+++++ self.num_layers, -+++++ self.num_attention_heads, -+++++ self.head_dim, -+++++ batch_size=1, -+++++ max_length=self.max_length, -+++++ dtype=mindspore.float16 -+++++ ) -+++++ -+++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+++++ key_cache = [] -+++++ value_cache = [] -+++++ for _ in range(num_layers): -+++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++ key_cache.append(k) -+++++ value_cache.append(v) -+++++ return key_cache, value_cache -+++++ -++++ -++++ def warmup_moe_model_deep(self): -++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++index bced285c..ebd7782e 100644 -++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++++ -++++-Long_Prompt = False -++++-PROMPT_LENGTH_THRESHOLD = 128 -+++++Long_Prompt = 1 -+++++LONG_PROMPT_LENGTH_THRESHOLD = 128 -+++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -+++++ -+++++_causal_mask_cache = {} -+++++ -+++++def get_cached_causal_mask_with_cache_position( -+++++ attention_mask: mindspore.Tensor, -+++++ sequence_length: int, -+++++ target_length: int, -+++++ dtype: mindspore.dtype, -+++++ min_dtype: float, -+++++ cache_position: mindspore.Tensor, -+++++ batch_size: int, -+++++): -+++++ """ -+++++ 带缓存的 causal mask 构造函数 -+++++ """ -+++++ # q_len 是当前 query 长度 -+++++ q_len = sequence_length -+++++ # kv_len 是 target_length -+++++ kv_len = target_length -+++++ -+++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -+++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -+++++ -+++++ if key in _causal_mask_cache: -+++++ return _causal_mask_cache[key] -+++++ -+++++ # 调用原来的 mask 构造逻辑 -+++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++ attention_mask, -+++++ sequence_length=sequence_length, -+++++ target_length=target_length, -+++++ dtype=dtype, -+++++ min_dtype=min_dtype, -+++++ cache_position=cache_position, -+++++ batch_size=batch_size, -+++++ ) -+++++ # 缓存结果 -+++++ _causal_mask_cache[key] = causal_mask -+++++ return causal_mask -++++ -++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++++ def _prepare_4d_causal_attention_mask_with_cache_position( -++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++ -++++ -++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -+++++# class Qwen2MoeAttention(nn.Module): -+++++# """ -+++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+++++# and "Generating Long Sequences with Sparse Transformers". -+++++# """ -+++++ -+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++# super().__init__() -+++++# self.config = config -+++++# self.layer_idx = layer_idx -+++++# if layer_idx is None: -+++++# logger.warning_once( -+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++# "when creating this class." -+++++# ) -+++++ -+++++# self.hidden_size = config.hidden_size -+++++# self.num_heads = config.num_attention_heads -+++++# self.head_dim = self.hidden_size // self.num_heads -+++++# self.num_key_value_heads = config.num_key_value_heads -+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++# self.max_position_embeddings = config.max_position_embeddings -+++++# self.rope_theta = config.rope_theta -+++++# self.is_causal = True -+++++# self.attention_dropout = config.attention_dropout -+++++ -+++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++# raise ValueError( -+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++# f" and `num_heads`: {self.num_heads})." -+++++# ) -+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++ -+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++# self.head_dim, -+++++# max_position_embeddings=self.max_position_embeddings, -+++++# base=self.rope_theta, -+++++# ) -+++++ -+++++# def forward( -+++++# self, -+++++# hidden_states: mindspore.Tensor, -+++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++# position_ids: Optional[mindspore.Tensor] = None, -+++++# past_key_value: Optional[Cache] = None, -+++++# output_attentions: bool = False, -+++++# use_cache: bool = False, -+++++# cache_position: Optional[mindspore.Tensor] = None, -+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++ -+++++ -+++++ -+++++# bsz, q_len, _ = hidden_states.shape -+++++ -+++++# query_states = self.q_proj(hidden_states) -+++++# key_states = self.k_proj(hidden_states) -+++++# value_states = self.v_proj(hidden_states) -+++++ -+++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++ -+++++# kv_seq_len = key_states.shape[-2] -+++++# if past_key_value is not None: -+++++# if self.layer_idx is None: -+++++# raise ValueError( -+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++# "with a layer index." -+++++# ) -+++++# if isinstance(past_key_value, StaticCache): -+++++# kv_seq_len = key_states.shape[-2] -+++++# else: -+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++# if past_key_value is not None: -+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++# if isinstance(past_key_value, StaticCache): -+++++# kv_seq_len = key_states.shape[-2] -+++++ -+++++# # repeat k/v heads if n_kv_heads < n_heads -+++++# key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++# value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++ -+++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++ -+++++# if attention_mask is not None: -+++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++# attn_weights = attn_weights + causal_mask -+++++ -+++++# # upcast attention to fp32 -+++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++++# attn_output = ops.matmul(attn_weights, value_states) -+++++ -+++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++++# raise ValueError( -+++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+++++# f" {attn_output.shape}" -+++++# ) -+++++ -+++++# attn_output = ops.transpose(attn_output, 1, 2) -+++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++ -+++++# attn_output = self.o_proj(attn_output) -+++++# # @lwx -+++++ -+++++# # max_seq_len = self.max_position_embeddings # 2048 -+++++ -+++++# # if attention_mask is not None: -+++++# # # attention_mask: [B, 1, Sq, Sk] -+++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++ -+++++# # # pad 到 [max_seq_len, max_seq_len] -+++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++# # global_attention_mask = padded_mask -+++++# # else: -+++++# # global_attention_mask = None -+++++ -+++++ -+++++# # sparse_mode=3 -+++++# # attn_output = mindspore.ops.flash_attention_score( -+++++# # query=query_states, -+++++# # key=key_states, -+++++# # value=value_states, -+++++# # real_shift=None, -+++++# # padding_mask=None, -+++++ -+++++# # head_num=self.num_heads, -+++++# # attn_mask=global_attention_mask, -+++++# # keep_prob=1.0 - self.attention_dropout, -+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++# # input_layout="BNSD", -+++++# # pre_tokens=2147483647, -+++++# # next_tokens=2147483647, -+++++# # inner_precise=0, -+++++# # drop_mask=None, -+++++# # prefix=None, -+++++# # actual_seq_qlen=None, -+++++# # actual_seq_kvlen=None, -+++++# # sparse_mode=sparse_mode, -+++++# # ) -+++++# if not output_attentions: -+++++# attn_weights = None -+++++ -+++++# return attn_output, attn_weights, past_key_value -+++++ -++++ class Qwen2MoeAttention(nn.Module): -++++ """ -++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++++- and "Generating Long Sequences with Sparse Transformers". -++++- """ -+++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -++++ -+++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -+++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -+++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -+++++ -+++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -+++++ """ -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++ super().__init__() -++++ self.config = config -++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -++++ if layer_idx is None: -++++ logger.warning_once( -++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++ "when creating this class." -++++ ) -++++ -++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -++++ use_cache: bool = False, -++++ cache_position: Optional[mindspore.Tensor] = None, -++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++- -++++ -++++- -+++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -++++ bsz, q_len, _ = hidden_states.shape -++++ -++++ query_states = self.q_proj(hidden_states) -++++ key_states = self.k_proj(hidden_states) -++++ value_states = self.v_proj(hidden_states) -++++ -++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++- -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ -++++ kv_seq_len = key_states.shape[-2] -++++ if past_key_value is not None: -++++- if self.layer_idx is None: -++++- raise ValueError( -++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++- "with a layer index." -++++- ) -++++- if isinstance(past_key_value, StaticCache): -++++- kv_seq_len = key_states.shape[-2] -++++- else: -++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++ -++++ if past_key_value is not None: -++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++ -+++++ # --- 2. 动态调度核心注意力计算 --- -+++++ global Long_Prompt -+++++ if Long_Prompt >= 1: -+++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -+++++ fa_attention_mask = None -+++++ if attention_mask is not None: -+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++ fa_attention_mask = (mask_slice != 0) -+++++ -+++++ attn_output = mindspore.ops.flash_attention_score( -+++++ query=query_states, -+++++ key=key_states, -+++++ value=value_states, -+++++ head_num=self.num_heads, -+++++ attn_mask=fa_attention_mask, -+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -+++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ input_layout="BNSD", -+++++ sparse_mode=0, -+++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -+++++ ) -++++ -++++- if isinstance(past_key_value, StaticCache): -++++- kv_seq_len = key_states.shape[-2] -+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = self.o_proj(attn_output) -+++++ attn_weights = None -+++++ if output_attentions: -+++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++++ -++++- # repeat k/v heads if n_kv_heads < n_heads -++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++++- -++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++ else: -+++++ # --- Eager Attention 路径 (用于短序列和解码) --- -+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++ -+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++ -++++- if attention_mask is not None: -++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++- attn_weights = attn_weights + causal_mask -+++++ if attention_mask is not None: -+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++ attn_weights = attn_weights + causal_mask -++++ -++++- # upcast attention to fp32 -++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++- attn_output = ops.matmul(attn_weights, value_states) -+++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++++ attn_output = ops.matmul(attn_weights, value_states) -++++ -++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++- raise ValueError( -++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++++- f" {attn_output.shape}" -++++- ) -+++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++++ raise ValueError( -+++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -+++++ ) -++++ -++++- attn_output = ops.transpose(attn_output, 1, 2) -++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = ops.transpose(attn_output, 1, 2) -+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = self.o_proj(attn_output) -++++ -++++- attn_output = self.o_proj(attn_output) -++++- # @lwx -+++++ if not output_attentions: -+++++ attn_weights = None -++++ -++++- # max_seq_len = self.max_position_embeddings # 2048 -++++- -++++- # if attention_mask is not None: -++++- # # attention_mask: [B, 1, Sq, Sk] -++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++- -++++- # # pad 到 [max_seq_len, max_seq_len] -++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++- # global_attention_mask = padded_mask -++++- # else: -++++- # global_attention_mask = None -++++- -++++- -++++- # sparse_mode=3 -++++- # attn_output = mindspore.ops.flash_attention_score( -++++- # query=query_states, -++++- # key=key_states, -++++- # value=value_states, -++++- # real_shift=None, -++++- # padding_mask=None, -++++- -++++- # head_num=self.num_heads, -++++- # attn_mask=global_attention_mask, -++++- # keep_prob=1.0 - self.attention_dropout, -++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++++- # input_layout="BNSD", -++++- # pre_tokens=2147483647, -++++- # next_tokens=2147483647, -++++- # inner_precise=0, -++++- # drop_mask=None, -++++- # prefix=None, -++++- # actual_seq_qlen=None, -++++- # actual_seq_kvlen=None, -++++- # sparse_mode=sparse_mode, -++++- # ) -++++- if not output_attentions: -++++- attn_weights = None -++++- -++++ return attn_output, attn_weights, past_key_value -++++ -++++- -++++ # class Qwen2MoeFlashAttention(nn.Module): -++++ # """ -++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -++++ # return final_hidden_states, router_logits -++++ -++++ -++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++-# """ -++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 -++++-# """ -++++-# def __init__(self, config: Qwen2MoeConfig): -++++-# super().__init__() -++++-# self.num_experts = config.num_experts -++++-# self.top_k = config.num_experts_per_tok -++++-# self.norm_topk_prob = config.norm_topk_prob -++++- -++++-# # 门控网络 -++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++-# # 专家列表 -++++-# self.experts = nn.ModuleList( -++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++-# ) -++++-# # 共享专家 -++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-# @no_grad() -++++-# def _moe_infer_decode( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# """ -++++-# 【解码路径】针对 sequence_length=1 的极致优化。 -++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++++-# """ -++++-# batch_size, hidden_dim = hidden_states.shape -++++- -++++-# expert_outputs_list = [ -++++-# ops.cat([ -++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++-# ], dim=0) -++++-# for i in range(batch_size) -++++-# ] -++++- -++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++++-# # shape: (batch_size, top_k, hidden_dim) -++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++- -++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++- -++++-# return moe_output.squeeze(1) -++++- -++++-# @no_grad() -++++-# def _moe_infer_prefill( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# """ -++++-# 【预填充路径】针对 sequence_length > 1 的优化。 -++++-# 按专家对 Token 进行分组,并进行批处理。 -++++-# """ -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens = hidden_states.shape[0] -++++-# flat_selected_experts = selected_experts.flatten() -++++- -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++- -++++-# active_experts = ops.unique(flat_selected_experts) -++++- -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++- -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++-# selected_token_indices = token_indices[mask] -++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++- -++++-# current_states = hidden_states[selected_token_indices] -++++- -++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++- -++++-# moe_output = moe_output.index_add( -++++-# dim=0, -++++-# index=selected_token_indices, -++++-# source=expert_output.to(hidden_states.dtype) -++++-# ) -++++-# return moe_output -++++- -++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-# """ -++++-# 顶层 forward 方法,作为智能分发器。 -++++-# """ -++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- -++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-# router_logits = self.gate(hidden_states_reshaped) -++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- -++++-# if self.norm_topk_prob: -++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- -++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++-# moe_output = None -++++-# # 在推理时,根据序列长度选择最优路径 -++++-# if not self.training: -++++-# if sequence_length == 1: -++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++-# else: -++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++-# else: -++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++++-# raise NotImplementedError("Training path is not implemented.") -++++- -++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++++- -++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++++- -++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++++- -++++-# return final_hidden_states, router_logits -++++- -++++- -++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++-# """ -++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++++-# """ -++++-# def __init__(self, config: Qwen2MoeConfig): -++++-# super().__init__() -++++-# self.num_experts = config.num_experts -++++-# self.top_k = config.num_experts_per_tok -++++-# self.norm_topk_prob = config.norm_topk_prob -++++- -++++-# # 门控网络 -++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++-# # 专家列表 -++++-# self.experts = nn.ModuleList( -++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++-# ) -++++-# # 共享专家 -++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-# @no_grad() -++++-# def _moe_infer_decode( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# batch_size, _ = hidden_states.shape -++++-# expert_outputs_list = [ -++++-# ops.cat([ -++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++-# ], dim=0) -++++-# for i in range(batch_size) -++++-# ] -++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++-# return moe_output.squeeze(1) -++++- -++++-# @no_grad() -++++-# def _moe_infer_prefill( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens = hidden_states.shape[0] -++++-# flat_selected_experts = selected_experts.flatten() -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++-# active_experts = ops.unique(flat_selected_experts) -++++- -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++-# selected_token_indices = token_indices[mask] -++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++-# current_states = hidden_states[selected_token_indices] -++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++-# moe_output = moe_output.index_add( -++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++-# ) -++++-# return moe_output -++++- -++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-# """ -++++-# 顶层 forward 方法,作为智能分发器。 -++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++++-# """ -++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- -++++-# # 1. 门控计算 (通用逻辑) -++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-# router_logits = self.gate(hidden_states_reshaped) -++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- -++++-# if self.norm_topk_prob: -++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- -++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++-# # 2. 智能分发到最优 MoE 路径 -++++-# moe_output = None -++++-# if not self.training: -++++-# if sequence_length == 1: -++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++-# else: -++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++-# else: -++++-# raise NotImplementedError("Training path is not implemented.") -++++- -++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++- -++++-# # 4. 合并 MoE 输出和共享专家输出 -++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++- -++++-# # 5. 恢复原始形状并返回 -++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++- -++++-# return final_hidden_states, router_logits -++++- -++++-# prefill fastest -++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++-# """ -++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++++-# """ -++++-# def __init__(self, config: Qwen2MoeConfig): -++++-# super().__init__() -++++-# self.num_experts = config.num_experts -++++-# self.top_k = config.num_experts_per_tok -++++-# self.norm_topk_prob = config.norm_topk_prob -++++- -++++-# # 门控网络 -++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++-# # 专家列表 -++++-# self.experts = nn.ModuleList( -++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++-# ) -++++-# # 共享专家 -++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-# @no_grad() -++++-# def _moe_infer_dispatch( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# """ -++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++++-# """ -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens, _ = hidden_states.shape -++++- -++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++++-# flat_selected_experts = selected_experts.flatten() -++++-# flat_routing_weights = routing_weights.flatten() -++++- -++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++- -++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++++-# active_experts = ops.unique(flat_selected_experts) -++++- -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++- -++++-# # 找到所有分配给该专家的 token -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++- -++++-# # 使用 mask 选取对应的 token 和权重 -++++-# current_token_indices = token_indices[mask] -++++-# current_routing_weights = flat_routing_weights[mask] -++++-# current_hidden_states = hidden_states[current_token_indices] -++++- -++++-# # 对这些 token 进行批处理 -++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++- -++++-# # 使用 index_add 将结果精确地加回到对应位置 -++++-# moe_output = moe_output.index_add( -++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++++-# ) -++++-# return moe_output -++++- -++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-# """ -++++-# 顶层 forward 方法,作为智能分发器。 -++++-# """ -++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- -++++-# # 1. 门控计算 -++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-# router_logits = self.gate(hidden_states_reshaped) -++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- -++++-# if self.norm_topk_prob: -++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- -++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++- -++++-# # 2. 调用统一的 MoE 计算内核 -++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++++- -++++-# # 3. 统一处理共享专家 -++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++- -++++-# # 4. 合并输出 -++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++- -++++-# # 5. 恢复原始形状并返回 -++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++- -++++-# return final_hidden_states, router_logits -++++- -++++- -++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++-# """ -++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++-# 【最终高性能与高精度版】: -++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++++-# 3. 这样实现了速度和准确性的两全其美。 -++++-# """ -++++-# def __init__(self, config: Qwen2MoeConfig): -++++-# super().__init__() -++++-# self.num_experts = config.num_experts -++++-# self.top_k = config.num_experts_per_tok -++++-# self.norm_topk_prob = config.norm_topk_prob -++++- -++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++-# self.experts = nn.ModuleList( -++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++-# ) -++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-# @no_grad() -++++-# def _moe_infer_decode( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# """ -++++-# 【解码路径】极致优化版:bmm + 高精度累加。 -++++-# """ -++++-# original_dtype = hidden_states.dtype -++++-# batch_size, _ = hidden_states.shape -++++- -++++-# expert_outputs_list = [ -++++-# ops.cat([ -++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++-# ], dim=0) -++++-# for i in range(batch_size) -++++-# ] -++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++- -++++-# # 在 float32 下执行 bmm,得到高精度结果 -++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++- -++++-# # 将高精度结果转换回原始数据类型 -++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++++- -++++-# return moe_output -++++- -++++-# @no_grad() -++++-# def _moe_infer_prefill( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# selected_experts: mindspore.Tensor, -++++-# routing_weights: mindspore.Tensor -++++-# ) -> mindspore.Tensor: -++++-# """ -++++-# 【预填充路径】与原始实现一致,结果精确。 -++++-# """ -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens, _ = hidden_states.shape -++++-# flat_selected_experts = selected_experts.flatten() -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++-# active_experts = ops.unique(flat_selected_experts) -++++- -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++-# selected_token_indices = token_indices[mask] -++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++-# current_states = hidden_states[selected_token_indices] -++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++-# moe_output = moe_output.index_add( -++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++-# ) -++++-# return moe_output -++++- -++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++- -++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-# router_logits = self.gate(hidden_states_reshaped) -++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++- -++++-# if self.norm_topk_prob: -++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- -++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++++-# # 如果模型主体是 float16,后续再转换 -++++- -++++-# moe_output = None -++++-# if not self.training: -++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++++-# # _moe_infer_decode 内部会处理好类型转换 -++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++++-# if sequence_length == 1: -++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++-# else: -++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++-# else: -++++-# raise NotImplementedError("Training path is not implemented.") -++++- -++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++- -++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++- -++++-# return final_hidden_states, router_logits -++++- -++++- -++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++-# """ -++++-# 【融合版】一个混合专家模块,内置两种推理策略, -++++-# 由外部全局变量 `Long_Prompt` 控制: -++++- -++++-# - if Long_Prompt is True: 【精度优先模式】 -++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++++-# 适用于处理长序列,避免误差累积。 -++++- -++++-# - if Long_Prompt is False: 【速度优先模式】 -++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 -++++-# """ -++++-# def __init__(self, config: Qwen2MoeConfig): -++++-# super().__init__() -++++-# self.num_experts = config.num_experts -++++-# self.top_k = config.num_experts_per_tok -++++-# self.norm_topk_prob = config.norm_topk_prob -++++- -++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++-# self.experts = nn.ModuleList( -++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++-# ) -++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-# # --- 速度优先模式的辅助函数 --- -++++-# @no_grad() -++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++-# original_dtype = hidden_states.dtype -++++-# batch_size, _ = hidden_states.shape -++++-# expert_outputs_list = [ -++++-# ops.cat([ -++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++-# ], dim=0) -++++-# for i in range(batch_size) -++++-# ] -++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++-# weights_fp32 = routing_weights.to(mindspore.float32) -++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++-# return moe_output_fp32.squeeze(1).to(original_dtype) -++++- -++++-# @no_grad() -++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens, _ = hidden_states.shape -++++-# flat_selected_experts = selected_experts.flatten() -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++-# active_experts = ops.unique(flat_selected_experts) -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++-# selected_token_indices = token_indices[mask] -++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++-# current_states = hidden_states[selected_token_indices] -++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++++-# return moe_output -++++- -++++-# # --- 精度优先模式的辅助函数 --- -++++-# @no_grad() -++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++-# moe_output = ops.zeros_like(hidden_states) -++++-# num_tokens, _ = hidden_states.shape -++++-# flat_selected_experts = selected_experts.flatten() -++++-# flat_routing_weights = routing_weights.flatten() -++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++-# active_experts = ops.unique(flat_selected_experts) -++++-# for expert_idx_tensor in active_experts: -++++-# expert_idx = expert_idx_tensor.item() -++++-# expert_layer = self.experts[expert_idx] -++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++-# current_token_indices = token_indices[mask] -++++-# current_routing_weights = flat_routing_weights[mask] -++++-# current_hidden_states = hidden_states[current_token_indices] -++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++-# return moe_output -++++- -++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-# # 声明我们将要使用一个在模块外部定义的全局变量 -++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++++-# global Long_Prompt -++++- -++++-# # 1. 门控计算 (所有模式通用) -++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-# router_logits = self.gate(hidden_states_reshaped) -++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++-# if self.norm_topk_prob: -++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++- -++++-# moe_output = None -++++-# if not self.training: -++++-# # 根据 Long_Prompt 标志选择模式 -++++-# if Long_Prompt: -++++-# # --- 精度优先模式 --- -++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++-# else: -++++-# # --- 速度优先模式 --- -++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++-# if sequence_length == 1: -++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++-# else: -++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++-# else: -++++-# raise NotImplementedError("Training path is not implemented.") -++++- -++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++- -++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++- -++++-# return final_hidden_states, router_logits -++++- -++++ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ """ -++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++ return moe_output_fp32.squeeze(1).to(original_dtype) -++++ -+++++ # @no_grad() -+++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++ # num_tokens, _ = hidden_states.shape -+++++ # flat_selected_experts = selected_experts.flatten() -+++++ # sorted_expert_indices = flat_selected_experts.argsort() -+++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++++ # original_token_indices = sorted_expert_indices // self.top_k -+++++ # moe_output = ops.zeros_like(hidden_states) -+++++ # current_token_offset = 0 -+++++ # for i in range(self.num_experts): -+++++ # expert_token_count = tokens_per_expert[i] - current_token_offset -+++++ # if expert_token_count == 0: -+++++ # continue -+++++ # end_offset = current_token_offset + expert_token_count -+++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++++ # expert_hidden_states = hidden_states[expert_original_token_indices] -+++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++ # current_token_offset += expert_token_count -+++++ # return moe_output -+++++ -++++ @no_grad() -++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++- num_tokens, _ = hidden_states.shape -++++- flat_selected_experts = selected_experts.flatten() -++++- sorted_expert_indices = flat_selected_experts.argsort() -++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++- original_token_indices = sorted_expert_indices // self.top_k -+++++ """ -+++++ 优化版 MoE prefill (速度优先模式): -+++++ - 批量张量化处理同一个 expert 的所有 token -+++++ - 跳过无 token 的专家 -+++++ - 保持结果完全一致 -+++++ """ -++++ moe_output = ops.zeros_like(hidden_states) -++++- current_token_offset = 0 -++++- for i in range(self.num_experts): -++++- expert_token_count = tokens_per_expert[i] - current_token_offset -++++- if expert_token_count == 0: -++++- continue -++++- end_offset = current_token_offset + expert_token_count -++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++- expert_hidden_states = hidden_states[expert_original_token_indices] -++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++- current_token_offset += expert_token_count -+++++ -+++++ flat_selected_experts = selected_experts.flatten() -+++++ flat_routing_weights = routing_weights.flatten() -+++++ -+++++ idxs = flat_selected_experts.argsort() -+++++ sorted_expert_indices = flat_selected_experts[idxs] -+++++ sorted_token_indices = idxs // self.top_k -+++++ -+++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -+++++ -+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+++++ -+++++ for expert_id in active_experts.tolist(): -+++++ start = int(tokens_per_expert[:expert_id].sum().item()) -+++++ end = start + int(tokens_per_expert[expert_id].item()) -+++++ -+++++ token_idx = sorted_token_indices[start:end] -+++++ expert_tokens = hidden_states[token_idx] -+++++ -+++++ expert_out = self.experts[expert_id](expert_tokens) -+++++ -+++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -+++++ -+++++ moe_output = mindspore.mint.scatter_add( -+++++ moe_output, -+++++ 0, -+++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -+++++ scaled_out.to(hidden_states.dtype) -+++++ ) -+++++ -++++ return moe_output -++++ -+++++ -++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++++ @no_grad() -++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++ -++++ moe_output = None -++++- if Long_Prompt: -++++- # --- 精度优先模式 (ACCURACY MODE) --- -++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ # if Long_Prompt==0: -+++++ # # --- 精度优先模式 (ACCURACY MODE) --- -+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ # else: -+++++ # # --- 速度优先模式 (SPEED MODE) --- -+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++ # if sequence_length == 1: -+++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ # else: -+++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ -+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++ if sequence_length == 1: -+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++ else: -++++- # --- 速度优先模式 (SPEED MODE) --- -++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++- if sequence_length == 1: -++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++- else: -++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++- -+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ -++++ -++++ # 3. 共享专家计算与合并 (所有模式通用) -++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++ -++++ return final_hidden_states, router_logits -++++ -+++++ -++++ class Qwen2MoeDecoderLayer(nn.Module): -++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++++ super().__init__() -++++ self.hidden_size = config.hidden_size -++++ -++++- # if Long_Prompt: -++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++- # else: -+++++ # if Long_Prompt == 2: -++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++ # else: -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ -++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++ -++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++ ) -++++ -++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++ # attention_mask, -+++++ # sequence_length=sequence_length, -+++++ # target_length=target_length, -+++++ # dtype=dtype, -+++++ # min_dtype=min_dtype, -+++++ # cache_position=cache_position, -+++++ # batch_size=input_tensor.shape[0], -+++++ # ) -+++++ #@dwj -+++++ causal_mask = get_cached_causal_mask_with_cache_position( -++++ attention_mask, -++++ sequence_length=sequence_length, -++++ target_length=target_length, -++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++++ """ -++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -+++++ _causal_mask_cache.clear() -++++ -++++ input_ids = kwargs.get("input_ids") -++++ if input_ids is None and args: -++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ -++++ if input_ids is not None: -++++ prompt_length = input_ids.shape[1] -++++- -++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: -++++- Long_Prompt = True -+++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -+++++ Long_Prompt = 2 -+++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -+++++ Long_Prompt = 0 -++++ else: -++++- Long_Prompt = False -+++++ Long_Prompt = 1 -+++++ -++++ -++++ return super().generate(*args, **kwargs) -++++ -++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++ dtype = self.lm_head.weight.dtype -++++ min_dtype = float(ops.finfo(dtype).min) -++++ -++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++ # attention_mask, -+++++ # sequence_length=sequence_length, -+++++ # target_length=past_key_values.get_max_length(), -+++++ # dtype=dtype, -+++++ # min_dtype=min_dtype, -+++++ # cache_position=cache_position, -+++++ # batch_size=batch_size, -+++++ # ) -+++++ -+++++ #@dwj -+++++ attention_mask = get_cached_causal_mask_with_cache_position( -++++ attention_mask, -++++ sequence_length=sequence_length, -++++ target_length=past_key_values.get_max_length(), -++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++deleted file mode 100644 -++++index 6dfb5b93..00000000 -++++--- a/patches/0001-20251104commit.patch -+++++++ /dev/null -++++@@ -1,1272 +0,0 @@ -++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++-From: Pinoeer-kingxi <13022943007@163.com> -++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++-Subject: [PATCH] 20251104commit -++++- -++++---- -++++- mindnlp/transformers/cache_utils.py | 28 +- -++++- .../models/deepseek/modeling_deepseek.py | 149 ++- -++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++- 3 files changed, 976 insertions(+), 87 deletions(-) -++++- -++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++-index cadd2e04..02f8d4be 100644 -++++---- a/mindnlp/transformers/cache_utils.py -++++-+++ b/mindnlp/transformers/cache_utils.py -++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++- # k_out[:, :, cache_position] = key_states -++++- # v_out[:, :, cache_position] = value_states -++++-- if ON_ORANGE_PI: -++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++-- else: -++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++-- -++++-+ # if ON_ORANGE_PI: -++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++-+ # else: -++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 -++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++-+ if cache_position.ndim > 1: -++++-+ cache_position = cache_position.flatten() -++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) -++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++-+ cache_position = cache_position.int() -++++-+ -++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++-+ k_out[:, :, cache_position] = key_states -++++-+ v_out[:, :, cache_position] = value_states -++++-+ -++++- return k_out, v_out -++++- -++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++-index c695b944..d8303e45 100644 -++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++- # Copied from transformers.models.llama.modeling_llama.rotate_half -++++- def rotate_half(x): -++++- """Rotates half the hidden dims of the input.""" -++++-- x1 = x[..., : x.shape[-1] // 2] -++++-- x2 = x[..., x.shape[-1] // 2 :] -++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++-+ # x1 = x[..., : x.shape[-1] // 2] -++++-+ # x2 = x[..., x.shape[-1] // 2 :] -++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++- return ops.cat((-x2, x1), dim=-1) -++++- -++++- -++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++- if self.training: -++++- raise NotImplementedError("Training is not supported yet.") -++++- else: -++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++-- if self.config.n_shared_experts is not None: -++++-- y = y + self.shared_experts(identity) -++++-- return y -++++-+ # @lwx -++++-+ if orig_shape[1] == 1: -++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++-+ y=y.view(*orig_shape) -++++-+ if self.config.n_shared_experts is not None: -++++-+ y = y + self.shared_experts(identity) -++++-+ return y -++++-+ else: -++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++-+ if self.config.n_shared_experts is not None: -++++-+ y = y + self.shared_experts(identity) -++++-+ return y -++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++-+ # if self.config.n_shared_experts is not None: -++++-+ # y = y + self.shared_experts(identity) -++++-+ # return y -++++-+ -++++-+ @no_grad() -++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++-+ -++++-+ expert_cache = ops.zeros_like(x) -++++-+ for i in range(self.num_experts_per_tok): -++++-+ expert_id = flat_expert_indices[i].item() -++++-+ weight = flat_expert_weights[i].item() -++++-+ expert = self.experts[expert_id] -++++-+ expert_out = expert(x) -++++-+ expert_cache += expert_out * weight -++++-+ return expert_cache -++++- -++++- @no_grad() -++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++-- # expert_cache = torch.zeros_like(x) -++++-- # idxs = flat_expert_indices.argsort() -++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++-- # token_idxs = idxs // self.num_experts_per_tok -++++-- # for i, end_idx in enumerate(tokens_per_expert): -++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++-- # if start_idx == end_idx: -++++-- # continue -++++-- # expert = self.experts[i] -++++-- # exp_token_idx = token_idxs[start_idx:end_idx] -++++-- # expert_tokens = x[exp_token_idx] -++++-- # expert_out = expert(expert_tokens) -++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++-- # return expert_cache -++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++- expert_cache = ops.zeros_like(x) -++++- idxs = flat_expert_indices.argsort() -++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++- token_idxs = idxs // self.num_experts_per_tok -++++-+ -++++- for i, end_idx in enumerate(tokens_per_expert): -++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++- if start_idx == end_idx: -++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++- expert_out = expert(expert_tokens) -++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++-+ -++++- return expert_cache -++++-+ -++++-+ # @no_grad() -++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++-+ # # expert_cache = torch.zeros_like(x) -++++-+ # # idxs = flat_expert_indices.argsort() -++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++-+ # # token_idxs = idxs // self.num_experts_per_tok -++++-+ # # for i, end_idx in enumerate(tokens_per_expert): -++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++-+ # # if start_idx == end_idx: -++++-+ # # continue -++++-+ # # expert = self.experts[i] -++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++-+ # # expert_tokens = x[exp_token_idx] -++++-+ # # expert_out = expert(expert_tokens) -++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++-+ # # return expert_cache -++++-+ # expert_cache = ops.zeros_like(x) -++++-+ # idxs = flat_expert_indices.argsort() -++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++-+ # token_idxs = idxs // self.num_experts_per_tok -++++-+ -++++-+ # for i, end_idx in enumerate(tokens_per_expert): -++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++-+ # if start_idx == end_idx: -++++-+ # continue -++++-+ # expert = self.experts[i] -++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++++-+ # expert_tokens = x[exp_token_idx] -++++-+ # expert_out = expert(expert_tokens) -++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++-+ -++++-+ # return expert_cache -++++-+ # @no_grad() -++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++-+ # expert_cache = ops.zeros_like(x) -++++-+ -++++-+ # # 排序保证顺序一致 -++++-+ # idxs = flat_expert_indices.argsort() -++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++-+ # token_idxs = idxs // self.num_experts_per_tok -++++-+ -++++-+ # # 找出有 token 的专家 -++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++-+ -++++-+ # for i in active_experts.tolist(): -++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++-+ # end_idx = tokens_per_expert[i] -++++-+ # if start_idx == end_idx: # 没有 token -++++-+ # continue -++++-+ -++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++++-+ # expert_tokens = x[exp_token_idx] -++++-+ # expert_out = self.experts[i](expert_tokens) -++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++-+ -++++-+ # expert_cache = mindspore.mint.scatter_add( -++++-+ # expert_cache, -++++-+ # 0, -++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++-+ # expert_out -++++-+ # ) -++++-+ -++++-+ # return expert_cache -++++-+ -++++-+ -++++- -++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++- # """ -++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++- -++++- # Initialize weights and apply final processing -++++- self.post_init() -++++-+ self.warm_up = False -++++-+ -++++-+ def warmup_moe_model_deep(self): -++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++-+ test_texts = [ -++++-+ "warmup short", -++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++-+ ] -++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++-+ if tokenizer is None: -++++-+ from mindnlp.transformers import AutoTokenizer -++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++-+ self._warmup_tokenizer = tokenizer -++++-+ -++++-+ for text in test_texts: -++++-+ inputs = tokenizer(text, return_tensors="ms") -++++-+ with mindspore._no_grad(): -++++-+ _ = self(**inputs, use_cache=False) -++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++- -++++- def get_input_embeddings(self): -++++- return self.model.embed_tokens -++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++- ```""" -++++-+ if not self.warm_up: -++++-+ self.warm_up = True -++++-+ self.warmup_moe_model_deep() -++++-+ -++++- output_attentions = ( -++++- output_attentions -++++- if output_attentions is not None -++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++-index 3cbf820e..d4c6b651 100644 -++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++-@@ -18,7 +18,6 @@ -++++- # See the License for the specific language governing permissions and -++++- # limitations under the License. -++++- """MindSpore Qwen2MoE model.""" -++++-- -++++- import math -++++- from typing import List, Optional, Tuple, Union -++++- -++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++- TokenClassifierOutput, -++++- ) -++++- from ...modeling_utils import PreTrainedModel -++++-+from ...generation import GenerationMixin -++++- from ....utils import logging -++++- from .configuration_qwen2_moe import Qwen2MoeConfig -++++- -++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++- self.variance_epsilon = eps -++++- -++++- def forward(self, hidden_states): -++++-+ # @dwj -++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++-+ # @lwx -++++-+ # if not self.training : -++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++- input_dtype = hidden_states.dtype -++++- hidden_states = hidden_states.to(mindspore.float32) -++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++-@@ -234,6 +239,8 @@ def rotate_half(x): -++++- """Rotates half the hidden dims of the input.""" -++++- x1 = x[..., : x.shape[-1] // 2] -++++- x2 = x[..., x.shape[-1] // 2 :] -++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++- return ops.cat((-x2, x1), dim=-1) -++++- -++++- -++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++- self.config = config -++++- self.hidden_size = config.hidden_size -++++- self.intermediate_size = intermediate_size -++++-+ -++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++- self.act_fn = ACT2FN[config.hidden_act] -++++- -++++- def forward(self, x): -++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++-- -++++- -++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++-+ # @lwx -++++-+ # gate_up_output = self.gate_up_proj(x) -++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++-+ # return self.down_proj(swiglu_output) -++++-+ -++++-+ # def forward(self, x): -++++-+ # gate_proj_out = self.gate_proj(x) -++++-+ # up_proj_out = self.up_proj(x) -++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++-+ # return self.down_proj(swiglu_out) -++++-+ -++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++- """ -++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++- use_cache: bool = False, -++++- cache_position: Optional[mindspore.Tensor] = None, -++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++-+ -++++-+ -++++-+ -++++- bsz, q_len, _ = hidden_states.shape -++++- -++++- query_states = self.q_proj(hidden_states) -++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++- "with a layer index." -++++- ) -++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++-+ if isinstance(past_key_value, StaticCache): -++++-+ kv_seq_len = key_states.shape[-2] -++++-+ else: -++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++- -++++- if past_key_value is not None: -++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++-+ -++++-+ if isinstance(past_key_value, StaticCache): -++++-+ kv_seq_len = key_states.shape[-2] -++++- -++++- # repeat k/v heads if n_kv_heads < n_heads -++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++++-- -++++-+ -++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++- -++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++-- raise ValueError( -++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++-- f" {attn_weights.shape}" -++++-- ) -++++-- -++++-- if attention_mask is not None: # no matter the length, we just slice it -++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++-+ if attention_mask is not None: -++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++- attn_weights = attn_weights + causal_mask -++++- -++++- # upcast attention to fp32 -++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++- -++++- attn_output = self.o_proj(attn_output) -++++-- -++++-+ # @lwx -++++-+ -++++-+ # max_seq_len = self.max_position_embeddings # 2048 -++++-+ -++++-+ # if attention_mask is not None: -++++-+ # # attention_mask: [B, 1, Sq, Sk] -++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++-+ -++++-+ # # pad 到 [max_seq_len, max_seq_len] -++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++-+ # global_attention_mask = padded_mask -++++-+ # else: -++++-+ # global_attention_mask = None -++++-+ -++++-+ -++++-+ # sparse_mode=3 -++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++-+ # query=query_states, -++++-+ # key=key_states, -++++-+ # value=value_states, -++++-+ # real_shift=None, -++++-+ # padding_mask=None, -++++-+ -++++-+ # head_num=self.num_heads, -++++-+ # attn_mask=global_attention_mask, -++++-+ # keep_prob=1.0 - self.attention_dropout, -++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++-+ # input_layout="BNSD", -++++-+ # pre_tokens=2147483647, -++++-+ # next_tokens=2147483647, -++++-+ # inner_precise=0, -++++-+ # drop_mask=None, -++++-+ # prefix=None, -++++-+ # actual_seq_qlen=None, -++++-+ # actual_seq_kvlen=None, -++++-+ # sparse_mode=sparse_mode, -++++-+ # ) -++++- if not output_attentions: -++++- attn_weights = None -++++- -++++- return attn_output, attn_weights, past_key_value -++++- -++++- -++++-+class Qwen2MoeFlashAttention(nn.Module): -++++-+ """ -++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++-+ -++++-+ 关键改动: -++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++-+ 直接传入原始的 key 和 value 张量效率更高。 -++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++-+ """ -++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++-+ super().__init__() -++++-+ self.config = config -++++-+ self.layer_idx = layer_idx -++++-+ self.hidden_size = config.hidden_size -++++-+ self.num_heads = config.num_attention_heads -++++-+ self.head_dim = self.hidden_size // self.num_heads -++++-+ self.num_key_value_heads = config.num_key_value_heads -++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++-+ self.max_position_embeddings = config.max_position_embeddings -++++-+ self.rope_theta = config.rope_theta -++++-+ self.attention_dropout = config.attention_dropout -++++-+ -++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: -++++-+ raise ValueError( -++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++-+ ) -++++-+ -++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++-+ -++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++-+ self.head_dim, -++++-+ max_position_embeddings=self.max_position_embeddings, -++++-+ base=self.rope_theta, -++++-+ ) -++++-+ -++++-+ def forward( -++++-+ self, -++++-+ hidden_states: mindspore.Tensor, -++++-+ attention_mask: Optional[mindspore.Tensor] = None, -++++-+ position_ids: Optional[mindspore.Tensor] = None, -++++-+ past_key_value: Optional[Cache] = None, -++++-+ output_attentions: bool = False, -++++-+ use_cache: bool = False, -++++-+ cache_position: Optional[mindspore.Tensor] = None, -++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++-+ -++++-+ bsz, q_len, _ = hidden_states.shape -++++-+ -++++-+ # 1. 线性投射 Q, K, V -++++-+ query_states = self.q_proj(hidden_states) -++++-+ key_states = self.k_proj(hidden_states) -++++-+ value_states = self.v_proj(hidden_states) -++++-+ -++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++-+ # query: [B, S, H*D] -> [B, N1, S, D] -++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ -++++-+ # 3. RoPE 旋转位置编码 -++++-+ kv_seq_len = key_states.shape[-2] -++++-+ if past_key_value is not None: -++++-+ if self.layer_idx is None: -++++-+ raise ValueError( -++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++-+ "with a layer index." -++++-+ ) -++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++-+ if cache_position.shape[0] == 1: -++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++-+ kv_seq_len = past_seen_tokens + 1 -++++-+ else: -++++-+ # prefill 阶段:cache_position 是范围,使用其长度 -++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++-+ else: -++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++-+ -++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++-+ -++++-+ # 4. KV 缓存更新 -++++-+ if past_key_value is not None: -++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++-+ key_states, value_states = past_key_value.update( -++++-+ key_states, value_states, self.layer_idx, cache_kwargs -++++-+ ) -++++-+ -++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++-+ if cache_position.shape[0] == 1: -++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++-+ kv_seq_len = key_states.shape[-2] -++++-+ -++++-+ # 5. [重要] 准备 Attention Mask -++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++-+ fa_attention_mask = None -++++-+ if attention_mask is not None: -++++-+ # 截取与当前key长度匹配的部分 -++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++-+ fa_attention_mask = (mask_slice != 0) -++++-+ -++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++-+ input_dtype = query_states.dtype -++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++-+ query_states = query_states.to(mindspore.float16) -++++-+ key_states = key_states.to(mindspore.float16) -++++-+ value_states = value_states.to(mindspore.float16) -++++-+ -++++-+ # 6. [核心] 调用 flash_attention_score 算子 -++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++-+ attn_output = mindspore.ops.flash_attention_score( -++++-+ query=query_states, -++++-+ key=key_states, -++++-+ value=value_states, -++++-+ head_num=self.num_heads, # 传入Q的头数(N1) -++++-+ attn_mask=fa_attention_mask, -++++-+ keep_prob=1.0 - self.attention_dropout, -++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), -++++-+ input_layout="BNSD", -++++-+ sparse_mode=0 # 使用 defaultMask 模式 -++++-+ ) -++++-+ -++++-+ # 恢复原始数据类型 -++++-+ attn_output = attn_output.to(input_dtype) -++++-+ -++++-+ # 7. 调整输出形状 -++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++-+ attn_output = self.o_proj(attn_output) -++++-+ -++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 -++++-+ attn_weights = None -++++-+ if output_attentions: -++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++-+ -++++-+ return attn_output, attn_weights, past_key_value -++++-+ -++++-+ # def forward( -++++-+ # self, -++++-+ # hidden_states: mindspore.Tensor, -++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++++-+ # position_ids: Optional[mindspore.Tensor] = None, -++++-+ # past_key_value: Optional[Cache] = None, -++++-+ # output_attentions: bool = False, -++++-+ # use_cache: bool = False, -++++-+ # cache_position: Optional[mindspore.Tensor] = None, -++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++-+ -++++-+ # bsz, q_len, _ = hidden_states.shape -++++-+ -++++-+ # # 1. 线性投射 Q, K, V -++++-+ # query_states = self.q_proj(hidden_states) -++++-+ # key_states = self.k_proj(hidden_states) -++++-+ # value_states = self.v_proj(hidden_states) -++++-+ -++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ -++++-+ # # 3. RoPE 旋转位置编码 -++++-+ # kv_seq_len = key_states.shape[-2] -++++-+ # if past_key_value is not None: -++++-+ # if self.layer_idx is None: -++++-+ # raise ValueError( -++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++-+ # "with a layer index." -++++-+ # ) -++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++-+ -++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++-+ -++++-+ # # 4. KV 缓存更新 -++++-+ # if past_key_value is not None: -++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++-+ # key_states, value_states = past_key_value.update( -++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++++-+ # ) -++++-+ -++++-+ # # 5. 准备 Attention Mask -++++-+ # fa_attention_mask = None -++++-+ # if attention_mask is not None: -++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++-+ # fa_attention_mask = (mask_slice != 0) -++++-+ -++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++-+ # input_dtype = query_states.dtype -++++-+ -++++-+ # # 6. [核心] 调用 flash_attention_score 算子 -++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++-+ # query=query_states, -++++-+ # key=key_states, -++++-+ # value=value_states, -++++-+ # head_num=self.num_heads, -++++-+ # attn_mask=fa_attention_mask, -++++-+ # keep_prob=1.0 - self.attention_dropout, -++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++-+ # input_layout="BNSD", -++++-+ # sparse_mode=0, -++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- -++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++-+ # inner_precise=1 -++++-+ # ) -++++-+ -++++-+ # # 恢复原始数据类型 -++++-+ # attn_output = attn_output.to(input_dtype) -++++-+ -++++-+ # # 7. 调整输出形状 -++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++-+ # attn_output = self.o_proj(attn_output) -++++-+ -++++-+ # attn_weights = None -++++-+ # if output_attentions: -++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++-+ -++++-+ # return attn_output, attn_weights, past_key_value -++++-+ -++++-+ # def forward( -++++-+ # self, -++++-+ # hidden_states: mindspore.Tensor, -++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++++-+ # position_ids: Optional[mindspore.Tensor] = None, -++++-+ # past_key_value: Optional[Cache] = None, -++++-+ # output_attentions: bool = False, -++++-+ # use_cache: bool = False, -++++-+ # cache_position: Optional[mindspore.Tensor] = None, -++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++-+ -++++-+ # bsz, q_len, _ = hidden_states.shape -++++-+ -++++-+ # query_states = self.q_proj(hidden_states) -++++-+ # key_states = self.k_proj(hidden_states) -++++-+ # value_states = self.v_proj(hidden_states) -++++-+ -++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-+ -++++-+ # kv_seq_len = key_states.shape[-2] -++++-+ # if past_key_value is not None: -++++-+ # if self.layer_idx is None: -++++-+ # raise ValueError("`layer_idx` must be specified for caching") -++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++-+ -++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++-+ -++++-+ # if past_key_value is not None: -++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++-+ # key_states, value_states = past_key_value.update( -++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++++-+ # ) -++++-+ -++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++-+ -++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++-+ # query_states = query_states / math.sqrt(self.head_dim) -++++-+ # # <--- 修改结束 --- -++++-+ -++++-+ # fa_attention_mask = None -++++-+ # if attention_mask is not None: -++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++-+ # fa_attention_mask = (mask_slice != 0) -++++-+ -++++-+ # input_dtype = query_states.dtype -++++-+ -++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++-+ # query=query_states, # 传入已经预先缩放过的 query -++++-+ # key=key_states, -++++-+ # value=value_states, -++++-+ # head_num=self.num_heads, -++++-+ # attn_mask=fa_attention_mask, -++++-+ # keep_prob=1.0 - self.attention_dropout, -++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++-+ # input_layout="BNSD", -++++-+ # sparse_mode=0, -++++-+ # inner_precise=1 # 仍然保持内部高精度计算 -++++-+ # ) -++++-+ -++++-+ # attn_output = attn_output.to(input_dtype) -++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++-+ # attn_output = self.o_proj(attn_output) -++++-+ -++++-+ # attn_weights = None -++++-+ # if output_attentions: -++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++-+ -++++-+ # return attn_output, attn_weights, past_key_value -++++-+ -++++- QWEN2MOE_ATTENTION_CLASSES = { -++++- "eager": Qwen2MoeAttention, -++++-+ "flash-attention": Qwen2MoeFlashAttention, -++++- } -++++- -++++- -++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++- -++++-+ #@dwj -++++-+ # 只遍历激活的专家,而非全部专家 -++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++-- hidden_states = hidden_states.view(-1, hidden_dim) -++++-- # router_logits: (batch * sequence_length, n_experts) -++++-- router_logits = self.gate(hidden_states) -++++-- -++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++-- if self.norm_topk_prob: -++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++-- # we cast back to the input dtype -++++-- routing_weights = routing_weights.to(hidden_states.dtype) -++++-- -++++-- final_hidden_states = ops.zeros( -++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++-- ) -++++-- -++++-- # One hot encode the selected experts to create an expert mask -++++-- # this will be used to easily index which expert is going to be sollicitated -++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++-- -++++-- # Loop over all available experts in the model and perform the computation on each expert -++++-- for expert_idx in range(self.num_experts): -++++-- expert_layer = self.experts[expert_idx] -++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++-- -++++-- # Index the correct hidden states and compute the expert hidden state for -++++-- # the current expert. We need to make sure to multiply the output hidden -++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++-- if 0 not in idx.shape: -++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++-- -++++-- # However `index_add_` only support torch tensors for indexing so we'll use -++++-- # the `top_x` tensor here. -++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++-- -++++-- shared_expert_output = self.shared_expert(hidden_states) -++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++-- -++++-- final_hidden_states = final_hidden_states + shared_expert_output -++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++-+ num_tokens = hidden_states_reshaped.shape[0] -++++-+ -++++-+ router_logits = self.gate(hidden_states_reshaped) -++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++-+ -++++-+ if self.norm_topk_prob: -++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++-+ routing_weights = routing_weights.to(hidden_states.dtype) -++++-+ -++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++-+ flat_selected_experts = selected_experts.flatten() -++++-+ -++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++-+ token_indices = broadcasted_token_indices.flatten() -++++-+ -++++-+ active_experts = ops.unique(flat_selected_experts) -++++-+ -++++-+ for expert_idx_tensor in active_experts: -++++-+ expert_idx = expert_idx_tensor.item() -++++-+ expert_layer = self.experts[expert_idx] -++++-+ -++++-+ mask = (flat_selected_experts == expert_idx_tensor) -++++-+ selected_token_indices = token_indices[mask] -++++-+ selected_routing_weights = routing_weights.flatten()[mask] -++++-+ -++++-+ current_states = hidden_states_reshaped[selected_token_indices] -++++-+ -++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++-+ -++++-+ final_hidden_states = final_hidden_states.index_add( -++++-+ dim=0, -++++-+ index=selected_token_indices, -++++-+ source=expert_output.to(hidden_states.dtype) -++++-+ ) -++++-+ -++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++- -++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++-- return final_hidden_states, router_logits -++++-+ final_hidden_states = final_hidden_states + shared_expert_output -++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++-+ -++++-+ return final_hidden_states, router_logits -++++- -++++- -++++- class Qwen2MoeDecoderLayer(nn.Module): -++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++- -++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++- -++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++-+ -++++- if (layer_idx not in config.mlp_only_layers) and ( -++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++- ): -++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++- _skip_keys_device_placement = "past_key_values" -++++- _supports_cache_class = True -++++-+#lwx -++++-+ # _supports_static_cache = True -++++- -++++- def _init_weights(self, module): -++++- std = self.config.initializer_range -++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++- return causal_mask -++++- -++++- -++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++- _tied_weights_keys = ["lm_head.weight"] -++++- -++++- def __init__(self, config): -++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++- self.num_experts_per_tok = config.num_experts_per_tok -++++- # Initialize weights and apply final processing -++++- self.post_init() -++++-+ # @lwx -++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++-+ # self.generation_config.cache_implementation = "static" -++++-+ self._warmed_up = False -++++-+ -++++-+ def warmup_moe_model(self): -++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++-+ test_texts = [ -++++-+ "warmup short", -++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++-+ ] -++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++-+ if tokenizer is None: -++++-+ from mindnlp.transformers import AutoTokenizer -++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++-+ self._warmup_tokenizer = tokenizer -++++-+ -++++-+ for text in test_texts: -++++-+ inputs = tokenizer(text, return_tensors="ms") -++++-+ with mindspore._no_grad(): -++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++- -++++- def get_input_embeddings(self): -++++- return self.model.embed_tokens -++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++- ```""" -++++-+ if not self._warmed_up: -++++-+ self._warmed_up = True -++++-+ self.warmup_moe_model() -++++- -++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++- output_router_logits = ( -++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++- } -++++- ) -++++- return model_inputs -++++-+# @lwx -++++-+ # def _decode_one_tokens_logits( -++++-+ # self, -++++-+ # cur_token: mindspore.Tensor, -++++-+ # input_pos: Optional[mindspore.Tensor], -++++-+ # cache_position: mindspore.Tensor, -++++-+ # past_key_values: StaticCache, -++++-+ # ) -> mindspore.Tensor: -++++-+ # """ -++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++-+ -++++-+ # Args: -++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++-+ # input_pos: 输入位置信息,可选 -++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) -++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++-+ -++++-+ # Returns: -++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++-+ # """ -++++-+ # # 调用JIT编译的版本 -++++-+ # return self.get_decode_one_tokens_logits( -++++-+ # cur_token=cur_token, -++++-+ # input_pos=input_pos, -++++-+ # cache_position=cache_position, -++++-+ # past_key_values=past_key_values, -++++-+ # ) -++++-+ -++++-+ # @mindspore.jit(jit_level='O1') -++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++-+ # """ -++++-+ # JIT编译的函数,用于高效的单token解码 -++++-+ # 使用JIT编译优化以支持静态shape和高效执行 -++++-+ -++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++-+ # """ -++++-+ # outputs = self.model.forward( -++++-+ # input_ids=cur_token, -++++-+ # position_ids=input_pos, -++++-+ # cache_position=cache_position, -++++-+ # past_key_values=past_key_values, -++++-+ # use_cache=True, -++++-+ # return_dict=False, -++++-+ # ) -++++-+ -++++-+ # hidden_states = outputs[0] -++++-+ # logits = self.lm_head.forward(hidden_states) -++++-+ # logits = logits.float() -++++-+ -++++-+ # return logits[:, -1, :] -++++-+ -++++-+ # def _sample( -++++-+ # self, -++++-+ # input_ids: mindspore.Tensor, -++++-+ # logits_processor, -++++-+ # stopping_criteria, -++++-+ # generation_config, -++++-+ # synced_devices: bool, -++++-+ # streamer=None, -++++-+ # logits_warper=None, -++++-+ # **model_kwargs, -++++-+ # ): -++++-+ # """ -++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++-+ # """ -++++-+ # from ...generation.logits_process import LogitsProcessorList -++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList -++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++-+ # from mindnlp.core import nn, ops, no_grad -++++-+ # import numpy as np -++++-+ -++++-+ # # 检查是否使用 StaticCache -++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++-+ # # 否则,直接调用父类方法 -++++-+ # past_key_values = model_kwargs.get("past_key_values") -++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++-+ -++++-+ # if not isinstance(past_key_values, StaticCache): -++++-+ # # 不使用 StaticCache,直接调用父类方法 -++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++-+ # return super()._sample( -++++-+ # input_ids=input_ids, -++++-+ # logits_processor=logits_processor, -++++-+ # stopping_criteria=stopping_criteria, -++++-+ # generation_config=generation_config, -++++-+ # synced_devices=synced_devices, -++++-+ # streamer=streamer, -++++-+ # logits_warper=logits_warper, -++++-+ # **model_kwargs, -++++-+ # ) -++++-+ -++++-+ # # 使用 StaticCache,进入自定义循环 -++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++-+ # pad_token_id = generation_config._pad_token_tensor -++++-+ # output_attentions = generation_config.output_attentions -++++-+ # output_hidden_states = generation_config.output_hidden_states -++++-+ # output_scores = generation_config.output_scores -++++-+ # output_logits = generation_config.output_logits -++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate -++++-+ # max_length = generation_config.max_length -++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++-+ # do_sample = generation_config.do_sample -++++-+ -++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++-+ # raise ValueError( -++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++-+ # f"{logits_warper})." -++++-+ # ) -++++-+ -++++-+ # # init attention / hidden states / scores tuples -++++-+ # scores = () if (return_dict_in_generate and output_scores) else None -++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++-+ -++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++-+ # encoder_hidden_states = ( -++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++-+ # ) -++++-+ -++++-+ # # keep track of which sequences are already finished -++++-+ # batch_size, cur_len = input_ids.shape -++++-+ # this_peer_finished = False -++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++-+ -++++-+ # time_record = [] -++++-+ # from ....utils.testing_utils import parse_flag_from_env -++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++-+ -++++-+ # while self._has_unfinished_sequences( -++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++-+ # ): -++++-+ # if _record_time: -++++-+ # import time as time_module -++++-+ # infer_start = time_module.time() -++++-+ -++++-+ # # prepare model inputs -++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++-+ -++++-+ # # prepare variable output controls -++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++-+ -++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++-+ # cur_cache_position = model_inputs.get("cache_position") -++++-+ # cur_past_key_values = model_inputs.get("past_key_values") -++++-+ # cur_input_ids = model_inputs.get("input_ids") -++++-+ -++++-+ # if (isinstance(cur_past_key_values, StaticCache) and -++++-+ # cur_cache_position is not None and -++++-+ # len(cur_cache_position.shape) > 0 and -++++-+ # cur_cache_position.shape[0] == 1 and -++++-+ # cur_input_ids is not None and -++++-+ # cur_input_ids.shape[1] == 1): -++++-+ # # 使用 JIT 优化的单 token 解码 -++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++-+ # if not hasattr(self, '_jit_used'): -++++-+ # self._jit_used = False -++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++-+ -++++-+ # next_token_logits = self.get_decode_one_tokens_logits( -++++-+ # cur_token=cur_input_ids, -++++-+ # input_pos=model_inputs.get("position_ids"), -++++-+ # cache_position=cur_cache_position, -++++-+ # past_key_values=cur_past_key_values, -++++-+ # ) -++++-+ -++++-+ # # 标记已使用JIT(用于后续判断) -++++-+ # if not self._jit_used: -++++-+ # self._jit_used = True -++++-+ -++++-+ # # 构造兼容的输出对象 -++++-+ # class JitOptimizedOutput: -++++-+ # def __init__(self, logits, config): -++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++-+ # self.config = config -++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 -++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++-+ # self.attentions = None if not config.is_encoder_decoder else None -++++-+ # self.cross_attentions = None -++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None -++++-+ -++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++-+ # else: -++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++-+ # outputs = self(**model_inputs, return_dict=True) -++++-+ -++++-+ # if synced_devices and this_peer_finished: -++++-+ # continue -++++-+ -++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++-+ # next_token_logits = outputs.logits[:, -1, :] -++++-+ -++++-+ # # pre-process distribution -++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++-+ # if do_sample: -++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++-+ -++++-+ # # Store scores, attentions and hidden_states when required -++++-+ # if return_dict_in_generate: -++++-+ # if output_scores: -++++-+ # scores += (next_token_scores,) -++++-+ # if output_logits: -++++-+ # raw_logits += (next_token_logits,) -++++-+ # if output_attentions: -++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) -++++-+ # if self.config.is_encoder_decoder: -++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++-+ -++++-+ # if output_hidden_states: -++++-+ # hidden = ( -++++-+ # outputs.decoder_hidden_states -++++-+ # if self.config.is_encoder_decoder -++++-+ # else outputs.hidden_states -++++-+ # ) -++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++-+ -++++-+ # # token selection -++++-+ # if do_sample: -++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++-+ # else: -++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++-+ -++++-+ # # finished sentences should have their next token be a padding token -++++-+ # if has_eos_stopping_criteria: -++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++-+ -++++-+ # # update generated ids, model inputs, and length for next step -++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++-+ # if streamer is not None: -++++-+ # streamer.put(next_tokens) -++++-+ -++++-+ # model_kwargs = self._update_model_kwargs_for_generation( -++++-+ # outputs, -++++-+ # model_kwargs, -++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, -++++-+ # ) -++++-+ -++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++-+ # cur_len += 1 -++++-+ -++++-+ # if _record_time: -++++-+ # import time as time_module -++++-+ # infer_stop = time_module.time() -++++-+ # time_record.append(infer_stop - infer_start) -++++-+ -++++-+ # del outputs -++++-+ -++++-+ # average_infer_time = None -++++-+ # if time_record: -++++-+ # if len(time_record) > 1: -++++-+ # time_record.pop(0) -++++-+ # average_infer_time = sum(time_record) / len(time_record) -++++-+ # print(f'average inference time is: {average_infer_time}') -++++-+ # print(f'inference time record: {time_record}') -++++-+ -++++-+ # if streamer is not None: -++++-+ # streamer.end() -++++-+ -++++-+ # # 简单判断:打印是否使用了JIT路径 -++++-+ # if hasattr(self, '_jit_used') and self._jit_used: -++++-+ # print("[JIT] ✓ JIT optimization was used during generation") -++++-+ # else: -++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++-+ -++++-+ # if return_dict_in_generate: -++++-+ # if self.config.is_encoder_decoder: -++++-+ # return GenerateEncoderDecoderOutput( -++++-+ # sequences=input_ids, -++++-+ # scores=scores, -++++-+ # logits=raw_logits, -++++-+ # encoder_attentions=encoder_attentions, -++++-+ # encoder_hidden_states=encoder_hidden_states, -++++-+ # decoder_attentions=decoder_attentions, -++++-+ # cross_attentions=cross_attentions, -++++-+ # decoder_hidden_states=decoder_hidden_states, -++++-+ # past_key_values=model_kwargs.get("past_key_values"), -++++-+ # average_infer_time=average_infer_time -++++-+ # ) -++++-+ # else: -++++-+ # return GenerateDecoderOnlyOutput( -++++-+ # sequences=input_ids, -++++-+ # scores=scores, -++++-+ # logits=raw_logits, -++++-+ # attentions=decoder_attentions, -++++-+ # hidden_states=decoder_hidden_states, -++++-+ # past_key_values=model_kwargs.get("past_key_values"), -++++-+ # average_infer_time=average_infer_time -++++-+ # ) -++++-+ # else: -++++-+ # return input_ids -++++-+ -++++-+ # def _prepare_cache_for_generation( -++++-+ # self, -++++-+ # generation_config, -++++-+ # model_kwargs, -++++-+ # assistant_model, -++++-+ # batch_size, -++++-+ # max_cache_length, -++++-+ # ): -++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++-+ # generation_config.cache_implementation = "static" -++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++-+ -++++-+ # if generation_config.cache_implementation == "static": -++++-+ # base_required_from_max_length = generation_config.max_length + 1 -++++-+ # base_required = max(max_cache_length, base_required_from_max_length) -++++-+ # min_cache_size = 50 -++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++-+ # else: -++++-+ # max_cache_length = max(base_required, min_cache_size) -++++-+ -++++-+ # original_max_cache_length = max_cache_length -++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") -++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++-+ # print(f" - final max_cache_length: {max_cache_length}") -++++-+ -++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++-+ # if max_cache_length > self.config.max_position_embeddings: -++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++-+ -++++-+ # result = super()._prepare_cache_for_generation( -++++-+ # generation_config=generation_config, -++++-+ # model_kwargs=model_kwargs, -++++-+ # assistant_model=assistant_model, -++++-+ # batch_size=batch_size, -++++-+ # max_cache_length=max_cache_length, -++++-+ # ) -++++-+ -++++-+ # if generation_config.cache_implementation == "static": -++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++-+ # created_cache = model_kwargs.get(cache_name) -++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++-+ # if created_cache.max_cache_len < generation_config.max_length: -++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++-+ -++++-+ # return result -++++-+ -++++-+ -++++-+ -++++- -++++- -++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++--- -++++-2.27.0 -++++- -++++-- -++++2.27.0 -++++ -+++-- -+++2.27.0 -+++ -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" deleted file mode 100644 index 31d324c3..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0008-moe-change.patch" +++ /dev/null @@ -1,8789 +0,0 @@ -From 3b0f98eeed90a7204357d96aacc9dc7098b9dab1 Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Sun, 9 Nov 2025 00:50:01 +0800 -Subject: [PATCH 08/10] moe change - ---- - .../models/deepseek/modeling_deepseek.py | 433 +- - .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- - patches/0001-20251104commit.patch | 2 +- - patches/0002-20251106commit.patch | 2 +- - patches/0003-20261106secondcommit.patch | 2 +- - patches/0004-20251106change.patch | 2 +- - patches/0005-20251107001commit.patch | 2 +- - patches/0006-20251107002commit.patch | 2 +- - patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ - 9 files changed, 8510 insertions(+), 55 deletions(-) - create mode 100644 patches/0007-20251107003commit.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index ff631974..0af29305 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -19,8 +19,10 @@ - # limitations under the License. - """ MindNLP DeepSeek model.""" - import math -+import time - import warnings - from typing import List, Optional, Tuple, Union -+from mindspore import mint - import mindspore - from mindnlp.core import nn, ops, no_grad - from mindnlp.core.nn import functional as F -@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) - - _CONFIG_FOR_DOC = "DeepseekConfig" - -+Long_Prompt = 1 -+LONG_PROMPT_LENGTH_THRESHOLD = 128 -+SHORT_PROMPT_LENGTH_THRESHOLD = 32 -+ - _attn_mask_cache = {} - - def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -@@ -380,6 +386,8 @@ class MoEGate(nn.Module): - return topk_idx, topk_weight, aux_loss - - -+bincount_op = mindspore.ops.Bincount() -+ - class DeepseekMoE(nn.Module): - """ - A mixed expert module containing shared experts. -@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): - y = y + self.shared_experts(identity) - return y - else: -- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+ if Long_Prompt == 0: -+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+ else: -+ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) - if self.config.n_shared_experts is not None: - y = y + self.shared_experts(identity) - return y -@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): - # if self.config.n_shared_experts is not None: - # y = y + self.shared_experts(identity) - # return y -- -+ -+ -+ -+ # lwx -+ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): -+ # """ -+ # 如果 expert_ids 为 None,走单专家逻辑; -+ # 如果有,多专家批量处理,保证和原逻辑一致。 -+ # """ -+ # if expert_ids is None: -+ # # 原单专家逻辑 -+ # if self.config.pretraining_tp > 1: -+ # slice = self.intermediate_size // self.config.pretraining_tp -+ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) -+ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) -+ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) -+ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) -+ # for i in range(self.config.pretraining_tp)], dim=-1) -+ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) -+ # for i in range(self.config.pretraining_tp)], dim=-1) -+ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) -+ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) -+ # for i in range(self.config.pretraining_tp)] -+ # down_proj = sum(down_proj) -+ # else: -+ # down_proj = self.down_proj( -+ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) -+ # ) -+ # return down_proj -+ -+ # # ====== 批量多专家路径 ====== -+ # hidden_size = x.shape[-1] -+ -+ # # 按 token expert_ids 选权重 -+ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] -+ # up_weights = self.up_proj.weight[expert_ids] -+ # down_weights = self.down_proj.weight[expert_ids] -+ -+ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 -+ # if self.config.pretraining_tp > 1: -+ # outputs = [] -+ # slice = self.intermediate_size // self.config.pretraining_tp -+ # for i in range(self.config.pretraining_tp): -+ # # 每个 slice 单独计算 -+ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) -+ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) -+ # act_out = self.act_fn(gate_proj_out) * up_proj_out -+ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) -+ # outputs.append(down_proj_out) -+ # return sum(outputs) -+ # else: -+ # gate_proj_out = F.linear(x, gate_weights) -+ # up_proj_out = F.linear(x, up_weights) -+ # act_out = self.act_fn(gate_proj_out) * up_proj_out -+ # return F.linear(act_out, down_weights) -+ # @no_grad() -+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ # num_tokens = x.shape[0] -+ # hidden_size = x.shape[-1] -+ -+ # idxs = flat_expert_indices.argsort() -+ # sorted_expert_indices = flat_expert_indices[idxs] -+ # sorted_token_indices = idxs // self.num_experts_per_tok -+ # sorted_indices = sorted_token_indices -+ -+ # permuted_tokens = x[sorted_token_indices] -+ # sorted_weights = flat_expert_weights[idxs] -+ -+ # # 一次调用多专家 forward -+ # expert_outputs = ops.zeros_like(permuted_tokens) -+ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) -+ -+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -+ # try: -+ # final_output = ops.moe_token_unpermute( -+ # expert_outputs, -+ # sorted_indices, -+ # probs=probs, -+ # padded_mode=False -+ # ) -+ # except Exception: -+ # final_output = ops.zeros_like(x) -+ # final_output = mindspore.mint.scatter_add( -+ # final_output, -+ # 0, -+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -+ # expert_outputs * sorted_weights -+ # ) -+ -+ # return final_output -+ -+ # def mlp_batch_forward(self, tokens, expert_ids): -+ # """ -+ # 使用批量专家 forward(保留精度) -+ # """ -+ # return self.experts[0].forward(tokens, expert_ids) -+ - # @no_grad() - # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): - -@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): - # expert_cache += expert_out * weight - # return expert_cache - -+ #@dwj - @no_grad() -- # dwj - def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- # x 的 shape: (1, hidden_size) -- # flat_expert_indices 的 shape: (num_experts_per_tok,) -- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -- -- # 1. 收集所有需要的专家层 -- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 - selected_experts = [self.experts[i] for i in flat_expert_indices] -- -- # 2. 并行计算所有专家的输出 -- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -- # ops.cat 会将它们堆叠成一个新的 Tensor -- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) - expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -- -- # 3. 使用矩阵乘法进行加权求和 -- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- # 最终结果 final_output 的 shape: (1, hidden_size) - final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -- - return final_output - - -- # @no_grad() -- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # expert_cache = ops.zeros_like(x) -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- # token_idxs = idxs // self.num_experts_per_tok -- -- # for i, end_idx in enumerate(tokens_per_expert): -- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- # if start_idx == end_idx: -- # continue -- # expert = self.experts[i] -- # exp_token_idx = token_idxs[start_idx:end_idx] -- # expert_tokens = x[exp_token_idx] -- # expert_out = expert(expert_tokens) -- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -- -- # return expert_cache -- - @no_grad() - def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): - """ -@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): - ) - - return expert_cache -+ -+ -+ # @no_grad() -+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ # """ -+ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add -+ # """ -+ # num_tokens = x.shape[0] -+ # hidden_size = x.shape[-1] -+ -+ # # 生成排序后的 token 索引 -+ # idxs = flat_expert_indices.argsort() -+ # sorted_expert_indices = flat_expert_indices[idxs] -+ # sorted_token_indices = idxs // self.num_experts_per_tok -+ -+ # # 记录到 sorted_indices(moe_token_unpermute 用) -+ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] -+ -+ # # 收集专家输入 -+ # permuted_tokens = x[sorted_token_indices] -+ -+ # # 执行每个专家的 MLP(批量处理) -+ # expert_outputs = [] -+ # token_ptr = 0 -+ # tokens_per_expert = sorted_expert_indices.bincount() -+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): -+ # if count == 0: -+ # continue -+ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] -+ # out = self.experts[expert_id](cur_tokens) -+ # expert_outputs.append(out) -+ # token_ptr += count -+ -+ # # 拼接所有专家输出 -+ # permuted_outputs = ops.cat(expert_outputs, axis=0) -+ -+ # # 权重缩放(probs 形状为 [num_tokens, top_k]) -+ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) -+ -+ # # 直接调用硬件加速的 unpermute -+ # final_output = ops.moe_token_unpermute( -+ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] -+ # sorted_indices, # shape: [num_tokens * top_k] -+ # probs=probs, # 按概率加权 -+ # padded_mode=False -+ # ) -+ -+ # return final_output -+ -+ # lwx prefill 20251108 -+ @no_grad() -+ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): -+ """ -+ 高性能 + 数值一致的 MoE prefill 推理: -+ 1. 批量化处理所有专家计算,减少 Python 循环开销 -+ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 -+ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 -+ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch -+ -+ 参数: -+ x: [num_tokens, hidden_size], -+ MoE 输入的 token 表示 -+ flat_expert_indices: [num_tokens * top_k], -+ 每个 token 的路由专家 id -+ flat_expert_weights: [num_tokens * top_k, 1], -+ 路由专家权重 -+ """ -+ num_tokens = x.shape[0] -+ hidden_size = x.shape[-1] -+ -+ # 1) 排序专家分配(与原 scatter_add 一致的顺序) -+ idxs = flat_expert_indices.argsort() # 排序索引 -+ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] -+ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID -+ -+ # sorted_indices 必须与 permuted_tokens 顺序匹配 -+ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 -+ -+ # 2) 收集专家输入(按 idxs 排序) -+ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] -+ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 -+ -+ # 3) 计算每个专家的 token 数 -+ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) -+ -+ # 4) 批量专家计算(减少 Python 循环) -+ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) -+ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) -+ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) -+ -+ expert_outputs = ops.zeros_like(permuted_tokens) -+ ptr = 0 -+ for expert_id, count in enumerate(tokens_per_expert.tolist()): -+ if count == 0: -+ continue -+ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] -+ -+ # 与 DeepseekMLP forward 等价 -+ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) -+ up_proj_out = F.linear(tokens, up_weights[expert_id]) -+ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out -+ expert_out = F.linear(act_out, down_weights[expert_id]) -+ -+ expert_outputs[ptr:ptr+count] = expert_out -+ ptr += count -+ -+ # 5) Ascend 加速的 unpermute(已排序的权重) -+ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape -+ -+ final_output = ops.zeros_like(x) -+ final_output = mindspore.mint.scatter_add( -+ final_output, -+ 0, -+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -+ expert_outputs * sorted_weights -+ ) -+ -+ -+ # try: -+ # final_output = ops.moe_token_unpermute( -+ # expert_outputs, # [num_tokens*top_k, hidden_size] -+ # sorted_indices, # [num_tokens*top_k] 原 token id -+ # probs=probs, # 对应权重 -+ # padded_mode=False -+ # ) -+ # except Exception: -+ # # CPU/GPU fallback:用 scatter_add 保证完全一致 -+ # final_output = ops.zeros_like(x) -+ # final_output = mindspore.mint.scatter_add( -+ # final_output, -+ # 0, -+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -+ # expert_outputs * sorted_weights -+ # ) -+ -+ return final_output -+ -+ -+ # @no_grad() -+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ # num_tokens = x.shape[0] -+ # hidden_size = x.shape[-1] -+ -+ # idxs = flat_expert_indices.argsort() -+ # sorted_expert_indices = flat_expert_indices[idxs] -+ # sorted_token_indices = idxs // self.num_experts_per_tok -+ -+ # # sorted_indices = sorted_token_indices -+ # sorted_indices = sorted_token_indices.astype(mindspore.int32) -+ # permuted_tokens = x[sorted_token_indices] -+ # sorted_weights = flat_expert_weights[idxs] -+ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) -+ -+ # expert_outputs = ops.zeros_like(permuted_tokens) -+ # ptr = 0 -+ -+ # # 只按专家维度循环 -+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): -+ # if count == 0: -+ # continue -+ # token_slice = slice(ptr, ptr + count) -+ # expert_tokens = permuted_tokens[token_slice] -+ -+ # # 保持原 forward(含 pretraining_tp、bias 等) -+ # expert_out = self.experts[expert_id](expert_tokens) -+ -+ # expert_outputs[token_slice] = expert_out -+ # ptr += count -+ -+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -+ # try: -+ # final_output = mindspore.ops.moe_token_unpermute( -+ # expert_outputs, -+ # sorted_indices, -+ # probs=probs, -+ # padded_mode=False -+ # ) -+ # except Exception: -+ # final_output = ops.zeros_like(x) -+ # final_output = mindspore.mint.scatter_add( -+ # final_output, -+ # 0, -+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -+ # expert_outputs * sorted_weights -+ # ) -+ -+ # return final_output -+ -+ -+ #lwx -+ # @no_grad() -+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ # """ -+ # 并行化 MoE prefill: -+ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 -+ # - 保证结果与原版完全一致 -+ # """ -+ # # 输出缓存 -+ # expert_cache = ops.zeros_like(x) -+ -+ # # token 总数(批量*seq_len*num_experts_per_tok) -+ # num_tokens = flat_expert_indices.shape[0] -+ # hidden_dim = x.shape[-1] -+ -+ # # 原 token ID(idxs // num_experts_per_tok) -+ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) -+ -+ # # ====== Step 1: 组织输入 ====== -+ # # 按 experts 排序,保证 scatter_add 对应位置一致 -+ # sort_ids = flat_expert_indices.argsort() -+ # sorted_experts = flat_expert_indices[sort_ids] -+ # sorted_tokens = token_ids[sort_ids] -+ # sorted_weights = flat_expert_weights[sort_ids] -+ -+ # # 收集每个专家的输入 -+ # # build: expert_inputs[expert_id] = [tokens...] -+ # expert_inputs = [] -+ # expert_outs = [] -+ -+ # for eid in range(self.config.n_routed_experts): -+ # eid_mask = (sorted_experts == eid) -+ # if eid_mask.any(): -+ # tokens_for_eid = x[sorted_tokens[eid_mask]] -+ # expert_inputs.append(tokens_for_eid) -+ # else: -+ # expert_inputs.append(None) -+ -+ # # ====== Step 2: 并行计算所有专家输出 ====== -+ # # 存储所有专家结果到一个列表 -+ # for eid in range(self.config.n_routed_experts): -+ # if expert_inputs[eid] is not None: -+ # out = self.experts[eid](expert_inputs[eid]) -+ # expert_outs.append(out) -+ # else: -+ # expert_outs.append(None) -+ -+ # # ====== Step 3: scatter_add 回写结果 ====== -+ # # 遍历专家,将结果加回对应的 token -+ # pos = 0 -+ # for eid in range(self.config.n_routed_experts): -+ # if expert_outs[eid] is not None: -+ # size = expert_outs[eid].shape[0] -+ # tokens_idx = sorted_tokens[pos:pos+size] -+ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] -+ # pos += size -+ -+ # # scatter_add 到 expert_cache -+ # expert_cache = mindspore.mint.scatter_add( -+ # expert_cache, -+ # dim=0, -+ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), -+ # src=scaled_out -+ # ) -+ -+ # return expert_cache -+ -+ -+ - # 放置在 DeepseekMoE 类中 - # @no_grad() - # #lwx 20251107 -@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): - self.hidden_size = config.hidden_size - - # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -- # config=config, layer_idx=layer_idx -+ # config=config, layer_idx=layer_idx - # ) - - self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): - ) - else DeepseekMLP(config) - ) -+ - self.input_layernorm = DeepseekRMSNorm( - config.hidden_size, eps=config.rms_norm_eps - ) -@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - def get_decoder(self): - return self.model - -+ def generate(self, *args, **kwargs): -+ """ -+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+ """ -+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+ -+ input_ids = kwargs.get("input_ids") -+ if input_ids is None and args: -+ input_ids = args[0] -+ -+ if input_ids is not None: -+ prompt_length = input_ids.shape[1] -+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -+ Long_Prompt = 2 -+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -+ Long_Prompt = 0 -+ else: -+ Long_Prompt = 1 -+ -+ -+ return super().generate(*args, **kwargs) - - def forward( - self, -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index 913a7609..6566958b 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - - # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- - @no_grad() -- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - original_dtype = hidden_states.dtype - batch_size, _ = hidden_states.shape - expert_outputs_list = [ -@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) - return moe_output_fp32.squeeze(1).to(original_dtype) - -+ - # @no_grad() -- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - # num_tokens, _ = hidden_states.shape - # flat_selected_experts = selected_experts.flatten() - # sorted_expert_indices = flat_selected_experts.argsort() -@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - # current_token_offset += expert_token_count - # return moe_output - -+ # baseline - @no_grad() -- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - """ - 优化版 MoE prefill (速度优先模式): - - 批量张量化处理同一个 expert 的所有 token -@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - return moe_output - - -+ @no_grad() -+ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ """ -+ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add -+ 逻辑: -+ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 -+ 2. 每个 expert 一次性处理其全部 token -+ 3. 最后一次 scatter_add 回到原 token 顺序 -+ """ -+ -+ num_tokens = hidden_states.shape[0] -+ hidden_size = hidden_states.shape[-1] -+ -+ # 展平为一维 -+ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] -+ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] -+ -+ # 按 expert 排序 -+ idxs = flat_selected_experts.argsort() -+ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 -+ sorted_token_indices = idxs // self.top_k # 对应原 token ID -+ -+ # 排好序的输入向量(连续内存) -+ permuted_tokens = hidden_states[sorted_token_indices] -+ -+ # 排好序的权重 -+ sorted_weights = flat_routing_weights[idxs] -+ -+ # 每个 expert 对应的 token 数量 -+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -+ -+ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) -+ expert_outputs = ops.zeros_like(permuted_tokens) -+ -+ ptr = 0 # 指向当前切片的起点 -+ for expert_id, count in enumerate(tokens_per_expert.tolist()): -+ if count == 0: -+ continue -+ -+ token_slice = slice(ptr, ptr + count) -+ expert_tokens = permuted_tokens[token_slice] # 连续切片 -+ -+ # 执行专家 MLP -+ expert_out = self.experts[expert_id](expert_tokens) -+ -+ expert_outputs[token_slice] = expert_out -+ ptr += count -+ -+ # 按权重缩放 -+ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) -+ -+ # 回写到原 token 顺序 (单次 scatter_add) -+ moe_output = mindspore.mint.scatter_add( -+ ops.zeros_like(hidden_states), -+ 0, -+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -+ scaled_outputs -+ ) -+ -+ return moe_output -+ -+ -+ - # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+ - @no_grad() - def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - moe_output = ops.zeros_like(hidden_states) -@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - # # --- 速度优先模式 (SPEED MODE) --- - # routing_weights_casted = routing_weights.to(hidden_states.dtype) - # if sequence_length == 1: -- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) - # else: -- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) - - routing_weights_casted = routing_weights.to(hidden_states.dtype) - if sequence_length == 1: -- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) - else: -- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -- -+ # if Long_Prompt == 1: -+ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # else: -+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ - - # 3. 共享专家计算与合并 (所有模式通用) - gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -index c9c8c5ee..513dd40b 100644 ---- a/patches/0001-20251104commit.patch -+++ b/patches/0001-20251104commit.patch -@@ -1,7 +1,7 @@ - From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/6] 20251104commit -+Subject: [PATCH 1/7] 20251104commit - - --- - mindnlp/transformers/cache_utils.py | 28 +- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -index 625656eb..41081b85 100644 ---- a/patches/0002-20251106commit.patch -+++ b/patches/0002-20251106commit.patch -@@ -1,7 +1,7 @@ - From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/6] 20251106commit -+Subject: [PATCH 2/7] 20251106commit - - --- - .../models/deepseek/modeling_deepseek.py | 379 ++++- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -index dcb85080..c1392569 100644 ---- a/patches/0003-20261106secondcommit.patch -+++ b/patches/0003-20261106secondcommit.patch -@@ -1,7 +1,7 @@ - From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/6] 20261106secondcommit -+Subject: [PATCH 3/7] 20261106secondcommit - - --- - .../models/deepseek/modeling_deepseek.py | 217 ++- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -index bbed13cc..e548b1b2 100644 ---- a/patches/0004-20251106change.patch -+++ b/patches/0004-20251106change.patch -@@ -1,7 +1,7 @@ - From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 15:48:09 +0800 --Subject: [PATCH 4/6] 20251106change -+Subject: [PATCH 4/7] 20251106change - - --- - .../models/deepseek/modeling_deepseek.py | 189 +- -diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -index b2d1035c..bf224d2a 100644 ---- a/patches/0005-20251107001commit.patch -+++ b/patches/0005-20251107001commit.patch -@@ -1,7 +1,7 @@ - From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 11:48:18 +0800 --Subject: [PATCH 5/6] 20251107001commit -+Subject: [PATCH 5/7] 20251107001commit - - --- - .../models/deepseek/modeling_deepseek.py | 91 +- -diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -index bffa134e..1bd306b9 100644 ---- a/patches/0006-20251107002commit.patch -+++ b/patches/0006-20251107002commit.patch -@@ -1,7 +1,7 @@ - From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 12:06:32 +0800 --Subject: [PATCH 6/6] 20251107002commit -+Subject: [PATCH 6/7] 20251107002commit - - --- - .../models/deepseek/modeling_deepseek.py | 122 +- -diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch -new file mode 100644 -index 00000000..ce558554 ---- /dev/null -+++ b/patches/0007-20251107003commit.patch -@@ -0,0 +1,8034 @@ -+From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Fri, 7 Nov 2025 12:12:51 +0800 -+Subject: [PATCH 7/7] 20251107003commit -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 2 +- -+ patches/0001-20251104commit.patch | 2 +- -+ patches/0002-20251106commit.patch | 2 +- -+ patches/0003-20261106secondcommit.patch | 2 +- -+ patches/0004-20251106change.patch | 2 +- -+ patches/0005-20251107001commit.patch | 2 +- -+ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ -+ 7 files changed, 7937 insertions(+), 6 deletions(-) -+ create mode 100644 patches/0006-20251107002commit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index e7e1c053..ff631974 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): -+ # return expert_cache -+ -+ @no_grad() -+- dwj -++ # dwj -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ # x 的 shape: (1, hidden_size) -+ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+index 2842180e..c9c8c5ee 100644 -+--- a/patches/0001-20251104commit.patch -++++ b/patches/0001-20251104commit.patch -+@@ -1,7 +1,7 @@ -+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+-Subject: [PATCH 1/5] 20251104commit -++Subject: [PATCH 1/6] 20251104commit -+ -+ --- -+ mindnlp/transformers/cache_utils.py | 28 +- -+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+index c6cd8757..625656eb 100644 -+--- a/patches/0002-20251106commit.patch -++++ b/patches/0002-20251106commit.patch -+@@ -1,7 +1,7 @@ -+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+-Subject: [PATCH 2/5] 20251106commit -++Subject: [PATCH 2/6] 20251106commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+index 601960c9..dcb85080 100644 -+--- a/patches/0003-20261106secondcommit.patch -++++ b/patches/0003-20261106secondcommit.patch -+@@ -1,7 +1,7 @@ -+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+-Subject: [PATCH 3/5] 20261106secondcommit -++Subject: [PATCH 3/6] 20261106secondcommit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 217 ++- -+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+index 8976f10b..bbed13cc 100644 -+--- a/patches/0004-20251106change.patch -++++ b/patches/0004-20251106change.patch -+@@ -1,7 +1,7 @@ -+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 15:48:09 +0800 -+-Subject: [PATCH 4/5] 20251106change -++Subject: [PATCH 4/6] 20251106change -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 189 +- -+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -+index 8d9032be..b2d1035c 100644 -+--- a/patches/0005-20251107001commit.patch -++++ b/patches/0005-20251107001commit.patch -+@@ -1,7 +1,7 @@ -+ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Fri, 7 Nov 2025 11:48:18 +0800 -+-Subject: [PATCH 5/5] 20251107001commit -++Subject: [PATCH 5/6] 20251107001commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 91 +- -+diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -+new file mode 100644 -+index 00000000..bffa134e -+--- /dev/null -++++ b/patches/0006-20251107002commit.patch -+@@ -0,0 +1,7931 @@ -++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Fri, 7 Nov 2025 12:06:32 +0800 -++Subject: [PATCH 6/6] 20251107002commit -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 122 +- -++ patches/0001-20251104commit.patch | 2 +- -++ patches/0002-20251106commit.patch | 2 +- -++ patches/0003-20261106secondcommit.patch | 2 +- -++ patches/0004-20251106change.patch | 2 +- -++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ -++ 6 files changed, 7773 insertions(+), 64 deletions(-) -++ create mode 100644 patches/0005-20251107001commit.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index 8831e4b7..e7e1c053 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): -++ # expert_out = expert(x) -++ # expert_cache += expert_out * weight -++ # return expert_cache -++- -++- # @no_grad() -++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++- # # x 的 shape: (1, hidden_size) -++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) -++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++- -++- # # 1. 收集所有需要的专家层 -++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++- # selected_experts = [self.experts[i] for i in flat_expert_indices] -++- -++- # # 2. 并行计算所有专家的输出 -++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++- # # ops.cat 会将它们堆叠成一个新的 Tensor -++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++- -++- # # 3. 使用矩阵乘法进行加权求和 -++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++- # # 最终结果 final_output 的 shape: (1, hidden_size) -++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++ -+++ @no_grad() -+++ dwj -+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ # x 的 shape: (1, hidden_size) -+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++ -+++ # 1. 收集所有需要的专家层 -+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++ selected_experts = [self.experts[i] for i in flat_expert_indices] -+++ -+++ # 2. 并行计算所有专家的输出 -+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++ # ops.cat 会将它们堆叠成一个新的 Tensor -+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++ -+++ # 3. 使用矩阵乘法进行加权求和 -+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++ # 最终结果 final_output 的 shape: (1, hidden_size) -+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++ -++- # return final_output -+++ return final_output -++ -++ -++ # @no_grad() -++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): -++ -++ return expert_cache -++ # 放置在 DeepseekMoE 类中 -++- @no_grad() -++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++- """ -++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -++- -++- Args: -++- x (Tensor): 输入张量, shape: (1, hidden_size) -++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -++- """ -++- top_k, _ = flat_expert_weights.shape -++- hidden_size = x.shape[-1] -++- -++- # 1. 将所有专家的权重堆叠起来 -++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+++ # @no_grad() -+++ # #lwx 20251107 -+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++ # """ -+++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+++ -+++ # Args: -+++ # x (Tensor): 输入张量, shape: (1, hidden_size) -+++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+++ # """ -+++ # top_k, _ = flat_expert_weights.shape -+++ # hidden_size = x.shape[-1] -+++ -+++ # # 1. 将所有专家的权重堆叠起来 -+++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -++ -++- # 2. "收集" 所需的专家权重 -++- selected_gate_w = stacked_gate_w[flat_expert_indices] -++- selected_up_w = stacked_up_w[flat_expert_indices] -++- selected_down_w = stacked_down_w[flat_expert_indices] -+++ # # 2. "收集" 所需的专家权重 -+++ # selected_gate_w = stacked_gate_w[flat_expert_indices] -+++ # selected_up_w = stacked_up_w[flat_expert_indices] -+++ # selected_down_w = stacked_down_w[flat_expert_indices] -++ -++- # 3. 准备输入 -++- x_expanded = x.expand((top_k, 1, hidden_size)) -+++ # # 3. 准备输入 -+++ # x_expanded = x.expand((top_k, 1, hidden_size)) -++ -++- # 4. 并行计算 gate_proj 和 up_proj -++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+++ # # 4. 并行计算 gate_proj 和 up_proj -+++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -++ -++- # 5. 计算中间状态 -++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+++ # # 5. 计算中间状态 -+++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out -++ -++- # 6. 并行计算 down_proj -++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -++- # --- [FIX] --- -++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -++- # --- [FIX END] --- -+++ # # 6. 并行计算 down_proj -+++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+++ # # --- [FIX] --- -+++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+++ # # --- [FIX END] --- -++ -++- # 7. 根据路由权重进行加权求和 -++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+++ # # 7. 根据路由权重进行加权求和 -+++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -++ -++- return weighted_sum -+++ # return weighted_sum -++ -++ -++ -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++index 0a0ef2d7..2842180e 100644 -++--- a/patches/0001-20251104commit.patch -+++++ b/patches/0001-20251104commit.patch -++@@ -1,7 +1,7 @@ -++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -++-Subject: [PATCH 1/4] 20251104commit -+++Subject: [PATCH 1/5] 20251104commit -++ -++ --- -++ mindnlp/transformers/cache_utils.py | 28 +- -++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++index 5185270c..c6cd8757 100644 -++--- a/patches/0002-20251106commit.patch -+++++ b/patches/0002-20251106commit.patch -++@@ -1,7 +1,7 @@ -++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -++-Subject: [PATCH 2/4] 20251106commit -+++Subject: [PATCH 2/5] 20251106commit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++index 3e05f821..601960c9 100644 -++--- a/patches/0003-20261106secondcommit.patch -+++++ b/patches/0003-20261106secondcommit.patch -++@@ -1,7 +1,7 @@ -++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -++-Subject: [PATCH 3/4] 20261106secondcommit -+++Subject: [PATCH 3/5] 20261106secondcommit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -++index 88a1aef4..8976f10b 100644 -++--- a/patches/0004-20251106change.patch -+++++ b/patches/0004-20251106change.patch -++@@ -1,7 +1,7 @@ -++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 15:48:09 +0800 -++-Subject: [PATCH 4/4] 20251106change -+++Subject: [PATCH 4/5] 20251106change -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 189 +- -++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -++new file mode 100644 -++index 00000000..8d9032be -++--- /dev/null -+++++ b/patches/0005-20251107001commit.patch -++@@ -0,0 +1,7707 @@ -+++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Fri, 7 Nov 2025 11:48:18 +0800 -+++Subject: [PATCH 5/5] 20251107001commit -+++ -+++--- -+++ .../models/deepseek/modeling_deepseek.py | 91 +- -+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- -+++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- -+++ patches/0001-20251104commit.patch | 2 +- -+++ patches/0002-20251106commit.patch | 2 +- -+++ patches/0003-20261106secondcommit.patch | 2 +- -+++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ -+++ 7 files changed, 7577 insertions(+), 30 deletions(-) -+++ create mode 100644 patches/0004-20251106change.patch -+++ -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index 0546f318..8831e4b7 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): -+++ # expert_cache += expert_out * weight -+++ # return expert_cache -+++ -+++- @no_grad() -+++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++- # x 的 shape: (1, hidden_size) -+++- # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++- -+++- # 1. 收集所有需要的专家层 -+++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++- selected_experts = [self.experts[i] for i in flat_expert_indices] -+++- -+++- # 2. 并行计算所有专家的输出 -+++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++- # ops.cat 会将它们堆叠成一个新的 Tensor -+++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++- -+++- # 3. 使用矩阵乘法进行加权求和 -+++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++- # 最终结果 final_output 的 shape: (1, hidden_size) -+++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++++ # @no_grad() -++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ # # x 的 shape: (1, hidden_size) -++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) -++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++++ -++++ # # 1. 收集所有需要的专家层 -++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] -++++ -++++ # # 2. 并行计算所有专家的输出 -++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++++ # # ops.cat 会将它们堆叠成一个新的 Tensor -++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++++ -++++ # # 3. 使用矩阵乘法进行加权求和 -++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ # # 最终结果 final_output 的 shape: (1, hidden_size) -++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++ -+++- return final_output -++++ # return final_output -+++ -+++ -+++ # @no_grad() -+++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): -+++ ) -+++ -+++ return expert_cache -++++# 放置在 DeepseekMoE 类中 -++++ @no_grad() -++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ """ -++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -++++ -++++ Args: -++++ x (Tensor): 输入张量, shape: (1, hidden_size) -++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -++++ """ -++++ top_k, _ = flat_expert_weights.shape -++++ hidden_size = x.shape[-1] -++++ -++++ # 1. 将所有专家的权重堆叠起来 -++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -++++ -++++ # 2. "收集" 所需的专家权重 -++++ selected_gate_w = stacked_gate_w[flat_expert_indices] -++++ selected_up_w = stacked_up_w[flat_expert_indices] -++++ selected_down_w = stacked_down_w[flat_expert_indices] -++++ -++++ # 3. 准备输入 -++++ x_expanded = x.expand((top_k, 1, hidden_size)) -++++ -++++ # 4. 并行计算 gate_proj 和 up_proj -++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -++++ -++++ # 5. 计算中间状态 -++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -++++ -++++ # 6. 并行计算 down_proj -++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -++++ # --- [FIX] --- -++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -++++ # --- [FIX END] --- -++++ -++++ # 7. 根据路由权重进行加权求和 -++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -++++ -++++ return weighted_sum -++++ -++++ -+++ -+++ # @no_grad() -+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++index ebd7782e..913a7609 100644 -+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): -+++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++- x1 = x[..., : x.shape[-1] // 2] -+++- x2 = x[..., x.shape[-1] // 2 :] -++++ # x1 = x[..., : x.shape[-1] // 2] -++++ # x2 = x[..., x.shape[-1] // 2 :] -+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+++index d059dcbe..2b217b64 100644 -+++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): -+++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++ def rotate_half(x): -+++ """Rotates half the hidden dims of the input.""" -+++- x1 = x[..., : x.shape[-1] // 2] -+++- x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++ # x1 = x[..., : x.shape[-1] // 2] -++++ # x2 = x[..., x.shape[-1] // 2 :] -++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++ return ops.cat((-x2, x1), dim=-1) -+++ -+++ -+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++index 78f22642..0a0ef2d7 100644 -+++--- a/patches/0001-20251104commit.patch -++++++ b/patches/0001-20251104commit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++-Subject: [PATCH 1/3] 20251104commit -++++Subject: [PATCH 1/4] 20251104commit -+++ -+++ --- -+++ mindnlp/transformers/cache_utils.py | 28 +- -+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+++index 22b65dd5..5185270c 100644 -+++--- a/patches/0002-20251106commit.patch -++++++ b/patches/0002-20251106commit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+++-Subject: [PATCH 2/3] 20251106commit -++++Subject: [PATCH 2/4] 20251106commit -+++ -+++ --- -+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+++index 966529e4..3e05f821 100644 -+++--- a/patches/0003-20261106secondcommit.patch -++++++ b/patches/0003-20261106secondcommit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+++-Subject: [PATCH 3/3] 20261106secondcommit -++++Subject: [PATCH 3/4] 20261106secondcommit -+++ -+++ --- -+++ .../models/deepseek/modeling_deepseek.py | 217 ++- -+++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+++new file mode 100644 -+++index 00000000..88a1aef4 -+++--- /dev/null -++++++ b/patches/0004-20251106change.patch -+++@@ -0,0 +1,7498 @@ -++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Thu, 6 Nov 2025 15:48:09 +0800 -++++Subject: [PATCH 4/4] 20251106change -++++ -++++--- -++++ .../models/deepseek/modeling_deepseek.py | 189 +- -++++ patches/0001-20251104commit.patch | 1272 +++++++ -++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -++++ 4 files changed, 7244 insertions(+), 186 deletions(-) -++++ create mode 100644 patches/0001-20251104commit.patch -++++ create mode 100644 patches/0002-20251106commit.patch -++++ create mode 100644 patches/0003-20261106secondcommit.patch -++++ -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index 2f9192bf..0546f318 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -++++ -++++ return attn_output, attn_weights, past_key_value -++++ -++++-# class DeepseekFlashAttention(nn.Module): -++++-# """ -++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -++++- -++++-# This class is designed as a drop-in replacement for DeepseekAttention. -++++-# """ -++++- -++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++++-# super().__init__() -++++-# self.config = config -++++-# self.layer_idx = layer_idx -++++-# if layer_idx is None: -++++-# logger.warning( -++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++-# "when creating this class." -++++-# ) -++++- -++++-# self.attention_dropout = config.attention_dropout -++++-# self.hidden_size = config.hidden_size -++++-# self.num_heads = config.num_attention_heads -++++-# self.head_dim = self.hidden_size // self.num_heads -++++-# self.num_key_value_heads = config.num_key_value_heads -++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++-# self.max_position_embeddings = config.max_position_embeddings -++++-# self.rope_theta = config.rope_theta -++++-# self.is_causal = True -++++- -++++-# if (self.head_dim * self.num_heads) != self.hidden_size: -++++-# raise ValueError( -++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++-# f" and `num_heads`: {self.num_heads})." -++++-# ) -++++- -++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++++-# self._init_rope() -++++- -++++-# def _init_rope(self): -++++-# if self.config.rope_scaling is None: -++++-# self.rotary_emb = DeepseekRotaryEmbedding( -++++-# self.head_dim, -++++-# max_position_embeddings=self.max_position_embeddings, -++++-# base=self.rope_theta, -++++-# ) -++++-# else: -++++-# scaling_type = self.config.rope_scaling["type"] -++++-# scaling_factor = self.config.rope_scaling["factor"] -++++-# if scaling_type == "linear": -++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++++-# self.head_dim, -++++-# max_position_embeddings=self.max_position_embeddings, -++++-# scaling_factor=scaling_factor, -++++-# base=self.rope_theta, -++++-# ) -++++-# elif scaling_type == "dynamic": -++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++++-# self.head_dim, -++++-# max_position_embeddings=self.max_position_embeddings, -++++-# scaling_factor=scaling_factor, -++++-# base=self.rope_theta, -++++-# ) -++++-# else: -++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++++- -++++-# def forward( -++++-# self, -++++-# hidden_states: mindspore.Tensor, -++++-# attention_mask: Optional[mindspore.Tensor] = None, -++++-# position_ids: Optional[mindspore.Tensor] = None, -++++-# past_key_value: Optional[Cache] = None, -++++-# output_attentions: bool = False, -++++-# use_cache: bool = False, -++++-# **kwargs, -++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++-# if "padding_mask" in kwargs: -++++-# warnings.warn( -++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++++-# ) -++++- -++++-# if output_attentions: -++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -++++- -++++-# bsz, q_len, _ = hidden_states.shape -++++- -++++-# if self.config.pretraining_tp > 1: -++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++++- -++++-# query_states = self.q_proj(hidden_states) -++++-# key_states = self.k_proj(hidden_states) -++++-# value_states = self.v_proj(hidden_states) -++++- -++++-# # Reshape for multi-head attention -++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++- -++++-# kv_seq_len = key_states.shape[-2] -++++-# if past_key_value is not None: -++++-# if self.layer_idx is None: -++++-# raise ValueError( -++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++-# "with a layer index." -++++-# ) -++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++- -++++-# # Apply Rotary Positional Embedding -++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++- -++++-# if past_key_value is not None: -++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++- -++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++- -++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++- -++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++- -++++-# # Convert attention_mask for flash_attention_score -++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -++++-# if attention_mask is not None: -++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++++-# raise ValueError( -++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++++-# ) -++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -++++-# else: -++++-# attn_mask_for_fa = None -++++- -++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++++- -++++-# # Call the fused flash_attention_score operator -++++-# attn_output = mindspore.ops.flash_attention_score( -++++-# query=query_states_for_fa, -++++-# key=key_states_for_fa, -++++-# value=value_states_for_fa, -++++-# head_num=self.num_heads, # This is N1, the number of query heads -++++-# input_layout='BSH', -++++-# attn_mask=attn_mask_for_fa, -++++-# keep_prob=keep_prob, -++++-# scalar_value=1.0 / math.sqrt(self.head_dim), -++++-# sparse_mode=0 # Default mask mode -++++-# ) -++++- -++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -++++-# attn_output = self.o_proj(attn_output) -++++- -++++-# # Flash Attention does not return attention weights -++++-# attn_weights = None -++++- -++++-# return attn_output, attn_weights, past_key_value -++++ -++++ class DeepseekFlashAttention(nn.Module): -++++ """ -++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -++++ super().__init__() -++++ self.hidden_size = config.hidden_size -++++ -++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -++++- config=config, layer_idx=layer_idx -++++- ) -+++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+++++ # config=config, layer_idx=layer_idx -+++++ # ) -++++ -++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -++++ config=config, layer_idx=layer_idx -++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -++++ return outputs -++++ -++++ -++++- -++++ class DeepseekPreTrainedModel(PreTrainedModel): -++++ config_class = DeepseekConfig -++++ base_model_prefix = "model" -++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++ # Initialize weights and apply final processing -++++ self.post_init() -++++ self.warm_up = False -++++- #@dwj -++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -++++- self.num_layers, -++++- self.num_attention_heads, -++++- self.head_dim, -++++- batch_size=1, -++++- max_length=self.max_length, -++++- dtype=mindspore.float16 -++++- ) -++++- -++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -++++- key_cache = [] -++++- value_cache = [] -++++- for _ in range(num_layers): -++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++- key_cache.append(k) -++++- value_cache.append(v) -++++- return key_cache, value_cache -++++- -++++ -++++ def warmup_moe_model_deep(self): -++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++new file mode 100644 -++++index 00000000..78f22642 -++++--- /dev/null -+++++++ b/patches/0001-20251104commit.patch -++++@@ -0,0 +1,1272 @@ -+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++++From: Pinoeer-kingxi <13022943007@163.com> -+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++++Subject: [PATCH 1/3] 20251104commit -+++++ -+++++--- -+++++ mindnlp/transformers/cache_utils.py | 28 +- -+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++++ 3 files changed, 976 insertions(+), 87 deletions(-) -+++++ -+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++++index cadd2e04..02f8d4be 100644 -+++++--- a/mindnlp/transformers/cache_utils.py -++++++++ b/mindnlp/transformers/cache_utils.py -+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++++ # k_out[:, :, cache_position] = key_states -+++++ # v_out[:, :, cache_position] = value_states -+++++- if ON_ORANGE_PI: -+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++- else: -+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++- -++++++ # if ON_ORANGE_PI: -++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++ # else: -++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++ # 确保 cache_position 是 1D tensor 并且类型正确 -++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++++ if cache_position.ndim > 1: -++++++ cache_position = cache_position.flatten() -++++++ # 确保类型是 int32 或 int64(MindSpore 要求) -++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++++ cache_position = cache_position.int() -++++++ -++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++++ k_out[:, :, cache_position] = key_states -++++++ v_out[:, :, cache_position] = value_states -++++++ -+++++ return k_out, v_out -+++++ -+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++index c695b944..d8303e45 100644 -+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++++ def rotate_half(x): -+++++ """Rotates half the hidden dims of the input.""" -+++++- x1 = x[..., : x.shape[-1] // 2] -+++++- x2 = x[..., x.shape[-1] // 2 :] -++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++ # x1 = x[..., : x.shape[-1] // 2] -++++++ # x2 = x[..., x.shape[-1] // 2 :] -++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++ return ops.cat((-x2, x1), dim=-1) -+++++ -+++++ -+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++++ if self.training: -+++++ raise NotImplementedError("Training is not supported yet.") -+++++ else: -+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++- if self.config.n_shared_experts is not None: -+++++- y = y + self.shared_experts(identity) -+++++- return y -++++++ # @lwx -++++++ if orig_shape[1] == 1: -++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++++ y=y.view(*orig_shape) -++++++ if self.config.n_shared_experts is not None: -++++++ y = y + self.shared_experts(identity) -++++++ return y -++++++ else: -++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++++ if self.config.n_shared_experts is not None: -++++++ y = y + self.shared_experts(identity) -++++++ return y -++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++ # if self.config.n_shared_experts is not None: -++++++ # y = y + self.shared_experts(identity) -++++++ # return y -++++++ -++++++ @no_grad() -++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++ -++++++ expert_cache = ops.zeros_like(x) -++++++ for i in range(self.num_experts_per_tok): -++++++ expert_id = flat_expert_indices[i].item() -++++++ weight = flat_expert_weights[i].item() -++++++ expert = self.experts[expert_id] -++++++ expert_out = expert(x) -++++++ expert_cache += expert_out * weight -++++++ return expert_cache -+++++ -+++++ @no_grad() -+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++- # expert_cache = torch.zeros_like(x) -+++++- # idxs = flat_expert_indices.argsort() -+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++- # token_idxs = idxs // self.num_experts_per_tok -+++++- # for i, end_idx in enumerate(tokens_per_expert): -+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++- # if start_idx == end_idx: -+++++- # continue -+++++- # expert = self.experts[i] -+++++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++++- # expert_tokens = x[exp_token_idx] -+++++- # expert_out = expert(expert_tokens) -+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++- # return expert_cache -++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++ expert_cache = ops.zeros_like(x) -+++++ idxs = flat_expert_indices.argsort() -+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++ token_idxs = idxs // self.num_experts_per_tok -++++++ -+++++ for i, end_idx in enumerate(tokens_per_expert): -+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++ if start_idx == end_idx: -+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++++ expert_out = expert(expert_tokens) -+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -+++++ return expert_cache -++++++ -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # # expert_cache = torch.zeros_like(x) -++++++ # # idxs = flat_expert_indices.argsort() -++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++ # # token_idxs = idxs // self.num_experts_per_tok -++++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++ # # if start_idx == end_idx: -++++++ # # continue -++++++ # # expert = self.experts[i] -++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # # expert_tokens = x[exp_token_idx] -++++++ # # expert_out = expert(expert_tokens) -++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++ # # return expert_cache -++++++ # expert_cache = ops.zeros_like(x) -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # for i, end_idx in enumerate(tokens_per_expert): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # if start_idx == end_idx: -++++++ # continue -++++++ # expert = self.experts[i] -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = expert(expert_tokens) -++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -++++++ # return expert_cache -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # expert_cache = ops.zeros_like(x) -++++++ -++++++ # # 排序保证顺序一致 -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # # 找出有 token 的专家 -++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++ -++++++ # for i in active_experts.tolist(): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # end_idx = tokens_per_expert[i] -++++++ # if start_idx == end_idx: # 没有 token -++++++ # continue -++++++ -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = self.experts[i](expert_tokens) -++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++ -++++++ # expert_cache = mindspore.mint.scatter_add( -++++++ # expert_cache, -++++++ # 0, -++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++ # expert_out -++++++ # ) -++++++ -++++++ # return expert_cache -++++++ -++++++ -+++++ -+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++++ # """ -+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -++++++ self.warm_up = False -++++++ -++++++ def warmup_moe_model_deep(self): -++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++++ test_texts = [ -++++++ "warmup short", -++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++++ ] -++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++ if tokenizer is None: -++++++ from mindnlp.transformers import AutoTokenizer -++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++ self._warmup_tokenizer = tokenizer -++++++ -++++++ for text in test_texts: -++++++ inputs = tokenizer(text, return_tensors="ms") -++++++ with mindspore._no_grad(): -++++++ _ = self(**inputs, use_cache=False) -++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++++ -+++++ def get_input_embeddings(self): -+++++ return self.model.embed_tokens -+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++ ```""" -++++++ if not self.warm_up: -++++++ self.warm_up = True -++++++ self.warmup_moe_model_deep() -++++++ -+++++ output_attentions = ( -+++++ output_attentions -+++++ if output_attentions is not None -+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++index 3cbf820e..d4c6b651 100644 -+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++@@ -18,7 +18,6 @@ -+++++ # See the License for the specific language governing permissions and -+++++ # limitations under the License. -+++++ """MindSpore Qwen2MoE model.""" -+++++- -+++++ import math -+++++ from typing import List, Optional, Tuple, Union -+++++ -+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++++ TokenClassifierOutput, -+++++ ) -+++++ from ...modeling_utils import PreTrainedModel -++++++from ...generation import GenerationMixin -+++++ from ....utils import logging -+++++ from .configuration_qwen2_moe import Qwen2MoeConfig -+++++ -+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++++ self.variance_epsilon = eps -+++++ -+++++ def forward(self, hidden_states): -++++++ # @dwj -++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++ # @lwx -++++++ # if not self.training : -++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++ input_dtype = hidden_states.dtype -+++++ hidden_states = hidden_states.to(mindspore.float32) -+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++++@@ -234,6 +239,8 @@ def rotate_half(x): -+++++ """Rotates half the hidden dims of the input.""" -+++++ x1 = x[..., : x.shape[-1] // 2] -+++++ x2 = x[..., x.shape[-1] // 2 :] -++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++ return ops.cat((-x2, x1), dim=-1) -+++++ -+++++ -+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++++ self.config = config -+++++ self.hidden_size = config.hidden_size -+++++ self.intermediate_size = intermediate_size -++++++ -+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++++ self.act_fn = ACT2FN[config.hidden_act] -+++++ -+++++ def forward(self, x): -+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++- -+++++ -++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++ # @lwx -++++++ # gate_up_output = self.gate_up_proj(x) -++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++++ # return self.down_proj(swiglu_output) -++++++ -++++++ # def forward(self, x): -++++++ # gate_proj_out = self.gate_proj(x) -++++++ # up_proj_out = self.up_proj(x) -++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++++ # return self.down_proj(swiglu_out) -++++++ -+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++++ """ -+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++++ use_cache: bool = False, -+++++ cache_position: Optional[mindspore.Tensor] = None, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ -++++++ -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ query_states = self.q_proj(hidden_states) -+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++ "with a layer index." -+++++ ) -+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ if isinstance(past_key_value, StaticCache): -++++++ kv_seq_len = key_states.shape[-2] -++++++ else: -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++ if isinstance(past_key_value, StaticCache): -++++++ kv_seq_len = key_states.shape[-2] -+++++ -+++++ # repeat k/v heads if n_kv_heads < n_heads -+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++- -++++++ -+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++ -+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++++- raise ValueError( -+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++++- f" {attn_weights.shape}" -+++++- ) -+++++- -+++++- if attention_mask is not None: # no matter the length, we just slice it -+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++++ if attention_mask is not None: -++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++ attn_weights = attn_weights + causal_mask -+++++ -+++++ # upcast attention to fp32 -+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++ -+++++ attn_output = self.o_proj(attn_output) -+++++- -++++++ # @lwx -++++++ -++++++ # max_seq_len = self.max_position_embeddings # 2048 -++++++ -++++++ # if attention_mask is not None: -++++++ # # attention_mask: [B, 1, Sq, Sk] -++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++ -++++++ # # pad 到 [max_seq_len, max_seq_len] -++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++ # global_attention_mask = padded_mask -++++++ # else: -++++++ # global_attention_mask = None -++++++ -++++++ -++++++ # sparse_mode=3 -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # real_shift=None, -++++++ # padding_mask=None, -++++++ -++++++ # head_num=self.num_heads, -++++++ # attn_mask=global_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ # input_layout="BNSD", -++++++ # pre_tokens=2147483647, -++++++ # next_tokens=2147483647, -++++++ # inner_precise=0, -++++++ # drop_mask=None, -++++++ # prefix=None, -++++++ # actual_seq_qlen=None, -++++++ # actual_seq_kvlen=None, -++++++ # sparse_mode=sparse_mode, -++++++ # ) -+++++ if not output_attentions: -+++++ attn_weights = None -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ -++++++class Qwen2MoeFlashAttention(nn.Module): -++++++ """ -++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++ -++++++ 关键改动: -++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++ 直接传入原始的 key 和 value 张量效率更高。 -++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++ """ -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++ super().__init__() -++++++ self.config = config -++++++ self.layer_idx = layer_idx -++++++ self.hidden_size = config.hidden_size -++++++ self.num_heads = config.num_attention_heads -++++++ self.head_dim = self.hidden_size // self.num_heads -++++++ self.num_key_value_heads = config.num_key_value_heads -++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++ self.max_position_embeddings = config.max_position_embeddings -++++++ self.rope_theta = config.rope_theta -++++++ self.attention_dropout = config.attention_dropout -++++++ -++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++++ raise ValueError( -++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++ ) -++++++ -++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++ -++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++ self.head_dim, -++++++ max_position_embeddings=self.max_position_embeddings, -++++++ base=self.rope_theta, -++++++ ) -++++++ -++++++ def forward( -++++++ self, -++++++ hidden_states: mindspore.Tensor, -++++++ attention_mask: Optional[mindspore.Tensor] = None, -++++++ position_ids: Optional[mindspore.Tensor] = None, -++++++ past_key_value: Optional[Cache] = None, -++++++ output_attentions: bool = False, -++++++ use_cache: bool = False, -++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # 1. 线性投射 Q, K, V -++++++ query_states = self.q_proj(hidden_states) -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++ # query: [B, S, H*D] -> [B, N1, S, D] -++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # 3. RoPE 旋转位置编码 -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++ if self.layer_idx is None: -++++++ raise ValueError( -++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ "with a layer index." -++++++ ) -++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++ if cache_position.shape[0] == 1: -++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++ kv_seq_len = past_seen_tokens + 1 -++++++ else: -++++++ # prefill 阶段:cache_position 是范围,使用其长度 -++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++ else: -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # 4. KV 缓存更新 -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ key_states, value_states = past_key_value.update( -++++++ key_states, value_states, self.layer_idx, cache_kwargs -++++++ ) -++++++ -++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++ if cache_position.shape[0] == 1: -++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++ kv_seq_len = key_states.shape[-2] -++++++ -++++++ # 5. [重要] 准备 Attention Mask -++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++ fa_attention_mask = None -++++++ if attention_mask is not None: -++++++ # 截取与当前key长度匹配的部分 -++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++ fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++ input_dtype = query_states.dtype -++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++ query_states = query_states.to(mindspore.float16) -++++++ key_states = key_states.to(mindspore.float16) -++++++ value_states = value_states.to(mindspore.float16) -++++++ -++++++ # 6. [核心] 调用 flash_attention_score 算子 -++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++ attn_output = mindspore.ops.flash_attention_score( -++++++ query=query_states, -++++++ key=key_states, -++++++ value=value_states, -++++++ head_num=self.num_heads, # 传入Q的头数(N1) -++++++ attn_mask=fa_attention_mask, -++++++ keep_prob=1.0 - self.attention_dropout, -++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ input_layout="BNSD", -++++++ sparse_mode=0 # 使用 defaultMask 模式 -++++++ ) -++++++ -++++++ # 恢复原始数据类型 -++++++ attn_output = attn_output.to(input_dtype) -++++++ -++++++ # 7. 调整输出形状 -++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = self.o_proj(attn_output) -++++++ -++++++ # FlashAttention 算子不直接返回注意力权重矩阵 -++++++ attn_weights = None -++++++ if output_attentions: -++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ # def forward( -++++++ # self, -++++++ # hidden_states: mindspore.Tensor, -++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++ # past_key_value: Optional[Cache] = None, -++++++ # output_attentions: bool = False, -++++++ # use_cache: bool = False, -++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ # bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # # 1. 线性投射 Q, K, V -++++++ # query_states = self.q_proj(hidden_states) -++++++ # key_states = self.k_proj(hidden_states) -++++++ # value_states = self.v_proj(hidden_states) -++++++ -++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # # 3. RoPE 旋转位置编码 -++++++ # kv_seq_len = key_states.shape[-2] -++++++ # if past_key_value is not None: -++++++ # if self.layer_idx is None: -++++++ # raise ValueError( -++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ # "with a layer index." -++++++ # ) -++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # # 4. KV 缓存更新 -++++++ # if past_key_value is not None: -++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ # key_states, value_states = past_key_value.update( -++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++ # ) -++++++ -++++++ # # 5. 准备 Attention Mask -++++++ # fa_attention_mask = None -++++++ # if attention_mask is not None: -++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++ # input_dtype = query_states.dtype -++++++ -++++++ # # 6. [核心] 调用 flash_attention_score 算子 -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # head_num=self.num_heads, -++++++ # attn_mask=fa_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ # input_layout="BNSD", -++++++ # sparse_mode=0, -++++++ # # <--- 修改点 2: 启用内部高精度计算 --- -++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++ # inner_precise=1 -++++++ # ) -++++++ -++++++ # # 恢复原始数据类型 -++++++ # attn_output = attn_output.to(input_dtype) -++++++ -++++++ # # 7. 调整输出形状 -++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ # attn_output = self.o_proj(attn_output) -++++++ -++++++ # attn_weights = None -++++++ # if output_attentions: -++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++ # return attn_output, attn_weights, past_key_value -++++++ -++++++ # def forward( -++++++ # self, -++++++ # hidden_states: mindspore.Tensor, -++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++ # past_key_value: Optional[Cache] = None, -++++++ # output_attentions: bool = False, -++++++ # use_cache: bool = False, -++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ # bsz, q_len, _ = hidden_states.shape -++++++ -++++++ # query_states = self.q_proj(hidden_states) -++++++ # key_states = self.k_proj(hidden_states) -++++++ # value_states = self.v_proj(hidden_states) -++++++ -++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ # kv_seq_len = key_states.shape[-2] -++++++ # if past_key_value is not None: -++++++ # if self.layer_idx is None: -++++++ # raise ValueError("`layer_idx` must be specified for caching") -++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ # if past_key_value is not None: -++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ # key_states, value_states = past_key_value.update( -++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++ # ) -++++++ -++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++ -++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++++ # query_states = query_states / math.sqrt(self.head_dim) -++++++ # # <--- 修改结束 --- -++++++ -++++++ # fa_attention_mask = None -++++++ # if attention_mask is not None: -++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ # fa_attention_mask = (mask_slice != 0) -++++++ -++++++ # input_dtype = query_states.dtype -++++++ -++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++ # query=query_states, # 传入已经预先缩放过的 query -++++++ # key=key_states, -++++++ # value=value_states, -++++++ # head_num=self.num_heads, -++++++ # attn_mask=fa_attention_mask, -++++++ # keep_prob=1.0 - self.attention_dropout, -++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++++ # input_layout="BNSD", -++++++ # sparse_mode=0, -++++++ # inner_precise=1 # 仍然保持内部高精度计算 -++++++ # ) -++++++ -++++++ # attn_output = attn_output.to(input_dtype) -++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ # attn_output = self.o_proj(attn_output) -++++++ -++++++ # attn_weights = None -++++++ # if output_attentions: -++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++++ -++++++ # return attn_output, attn_weights, past_key_value -++++++ -+++++ QWEN2MOE_ATTENTION_CLASSES = { -+++++ "eager": Qwen2MoeAttention, -++++++ "flash-attention": Qwen2MoeFlashAttention, -+++++ } -+++++ -+++++ -+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -++++++ #@dwj -++++++ # 只遍历激活的专家,而非全部专家 -+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- hidden_states = hidden_states.view(-1, hidden_dim) -+++++- # router_logits: (batch * sequence_length, n_experts) -+++++- router_logits = self.gate(hidden_states) -+++++- -+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- if self.norm_topk_prob: -+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- # we cast back to the input dtype -+++++- routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++- final_hidden_states = ops.zeros( -+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++++- ) -+++++- -+++++- # One hot encode the selected experts to create an expert mask -+++++- # this will be used to easily index which expert is going to be sollicitated -+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++++- -+++++- # Loop over all available experts in the model and perform the computation on each expert -+++++- for expert_idx in range(self.num_experts): -+++++- expert_layer = self.experts[expert_idx] -+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++++- -+++++- # Index the correct hidden states and compute the expert hidden state for -+++++- # the current expert. We need to make sure to multiply the output hidden -+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++++- if 0 not in idx.shape: -+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++++- -+++++- # However `index_add_` only support torch tensors for indexing so we'll use -+++++- # the `top_x` tensor here. -+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++++- -+++++- shared_expert_output = self.shared_expert(hidden_states) -+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++++- -+++++- final_hidden_states = final_hidden_states + shared_expert_output -++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++ num_tokens = hidden_states_reshaped.shape[0] -++++++ -++++++ router_logits = self.gate(hidden_states_reshaped) -++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++ if self.norm_topk_prob: -++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++ flat_selected_experts = selected_experts.flatten() -++++++ -++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++ token_indices = broadcasted_token_indices.flatten() -++++++ -++++++ active_experts = ops.unique(flat_selected_experts) -++++++ -++++++ for expert_idx_tensor in active_experts: -++++++ expert_idx = expert_idx_tensor.item() -++++++ expert_layer = self.experts[expert_idx] -++++++ -++++++ mask = (flat_selected_experts == expert_idx_tensor) -++++++ selected_token_indices = token_indices[mask] -++++++ selected_routing_weights = routing_weights.flatten()[mask] -++++++ -++++++ current_states = hidden_states_reshaped[selected_token_indices] -++++++ -++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++ -++++++ final_hidden_states = final_hidden_states.index_add( -++++++ dim=0, -++++++ index=selected_token_indices, -++++++ source=expert_output.to(hidden_states.dtype) -++++++ ) -++++++ -++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++ -+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++- return final_hidden_states, router_logits -++++++ final_hidden_states = final_hidden_states + shared_expert_output -++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++ -++++++ return final_hidden_states, router_logits -+++++ -+++++ -+++++ class Qwen2MoeDecoderLayer(nn.Module): -+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++++ -+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++ -+++++ if (layer_idx not in config.mlp_only_layers) and ( -+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++++ ): -+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++++ _skip_keys_device_placement = "past_key_values" -+++++ _supports_cache_class = True -++++++#lwx -++++++ # _supports_static_cache = True -+++++ -+++++ def _init_weights(self, module): -+++++ std = self.config.initializer_range -+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++++ return causal_mask -+++++ -+++++ -+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ _tied_weights_keys = ["lm_head.weight"] -+++++ -+++++ def __init__(self, config): -+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ self.num_experts_per_tok = config.num_experts_per_tok -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -++++++ # @lwx -++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++++ # self.generation_config.cache_implementation = "static" -++++++ self._warmed_up = False -++++++ -++++++ def warmup_moe_model(self): -++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++++ test_texts = [ -++++++ "warmup short", -++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++++ ] -++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++ if tokenizer is None: -++++++ from mindnlp.transformers import AutoTokenizer -++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++ self._warmup_tokenizer = tokenizer -++++++ -++++++ for text in test_texts: -++++++ inputs = tokenizer(text, return_tensors="ms") -++++++ with mindspore._no_grad(): -++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++++ -+++++ def get_input_embeddings(self): -+++++ return self.model.embed_tokens -+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++ ```""" -++++++ if not self._warmed_up: -++++++ self._warmed_up = True -++++++ self.warmup_moe_model() -+++++ -+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++++ output_router_logits = ( -+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++ } -+++++ ) -+++++ return model_inputs -++++++# @lwx -++++++ # def _decode_one_tokens_logits( -++++++ # self, -++++++ # cur_token: mindspore.Tensor, -++++++ # input_pos: Optional[mindspore.Tensor], -++++++ # cache_position: mindspore.Tensor, -++++++ # past_key_values: StaticCache, -++++++ # ) -> mindspore.Tensor: -++++++ # """ -++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++++ -++++++ # Args: -++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++++ # input_pos: 输入位置信息,可选 -++++++ # cache_position: 当前token在cache中的位置,shape为(1,) -++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++++ -++++++ # Returns: -++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++++ # """ -++++++ # # 调用JIT编译的版本 -++++++ # return self.get_decode_one_tokens_logits( -++++++ # cur_token=cur_token, -++++++ # input_pos=input_pos, -++++++ # cache_position=cache_position, -++++++ # past_key_values=past_key_values, -++++++ # ) -++++++ -++++++ # @mindspore.jit(jit_level='O1') -++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++++ # """ -++++++ # JIT编译的函数,用于高效的单token解码 -++++++ # 使用JIT编译优化以支持静态shape和高效执行 -++++++ -++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++++ # """ -++++++ # outputs = self.model.forward( -++++++ # input_ids=cur_token, -++++++ # position_ids=input_pos, -++++++ # cache_position=cache_position, -++++++ # past_key_values=past_key_values, -++++++ # use_cache=True, -++++++ # return_dict=False, -++++++ # ) -++++++ -++++++ # hidden_states = outputs[0] -++++++ # logits = self.lm_head.forward(hidden_states) -++++++ # logits = logits.float() -++++++ -++++++ # return logits[:, -1, :] -++++++ -++++++ # def _sample( -++++++ # self, -++++++ # input_ids: mindspore.Tensor, -++++++ # logits_processor, -++++++ # stopping_criteria, -++++++ # generation_config, -++++++ # synced_devices: bool, -++++++ # streamer=None, -++++++ # logits_warper=None, -++++++ # **model_kwargs, -++++++ # ): -++++++ # """ -++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++++ # """ -++++++ # from ...generation.logits_process import LogitsProcessorList -++++++ # from ...generation.stopping_criteria import StoppingCriteriaList -++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++++ # from mindnlp.core import nn, ops, no_grad -++++++ # import numpy as np -++++++ -++++++ # # 检查是否使用 StaticCache -++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++++ # # 否则,直接调用父类方法 -++++++ # past_key_values = model_kwargs.get("past_key_values") -++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++++ -++++++ # if not isinstance(past_key_values, StaticCache): -++++++ # # 不使用 StaticCache,直接调用父类方法 -++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++++ # return super()._sample( -++++++ # input_ids=input_ids, -++++++ # logits_processor=logits_processor, -++++++ # stopping_criteria=stopping_criteria, -++++++ # generation_config=generation_config, -++++++ # synced_devices=synced_devices, -++++++ # streamer=streamer, -++++++ # logits_warper=logits_warper, -++++++ # **model_kwargs, -++++++ # ) -++++++ -++++++ # # 使用 StaticCache,进入自定义循环 -++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++++ # pad_token_id = generation_config._pad_token_tensor -++++++ # output_attentions = generation_config.output_attentions -++++++ # output_hidden_states = generation_config.output_hidden_states -++++++ # output_scores = generation_config.output_scores -++++++ # output_logits = generation_config.output_logits -++++++ # return_dict_in_generate = generation_config.return_dict_in_generate -++++++ # max_length = generation_config.max_length -++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++++ # do_sample = generation_config.do_sample -++++++ -++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++++ # raise ValueError( -++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++++ # f"{logits_warper})." -++++++ # ) -++++++ -++++++ # # init attention / hidden states / scores tuples -++++++ # scores = () if (return_dict_in_generate and output_scores) else None -++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++++ -++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++++ # encoder_hidden_states = ( -++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++++ # ) -++++++ -++++++ # # keep track of which sequences are already finished -++++++ # batch_size, cur_len = input_ids.shape -++++++ # this_peer_finished = False -++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++++ -++++++ # time_record = [] -++++++ # from ....utils.testing_utils import parse_flag_from_env -++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++++ -++++++ # while self._has_unfinished_sequences( -++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++++ # ): -++++++ # if _record_time: -++++++ # import time as time_module -++++++ # infer_start = time_module.time() -++++++ -++++++ # # prepare model inputs -++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++++ -++++++ # # prepare variable output controls -++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++++ -++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++++ # cur_cache_position = model_inputs.get("cache_position") -++++++ # cur_past_key_values = model_inputs.get("past_key_values") -++++++ # cur_input_ids = model_inputs.get("input_ids") -++++++ -++++++ # if (isinstance(cur_past_key_values, StaticCache) and -++++++ # cur_cache_position is not None and -++++++ # len(cur_cache_position.shape) > 0 and -++++++ # cur_cache_position.shape[0] == 1 and -++++++ # cur_input_ids is not None and -++++++ # cur_input_ids.shape[1] == 1): -++++++ # # 使用 JIT 优化的单 token 解码 -++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++++ # if not hasattr(self, '_jit_used'): -++++++ # self._jit_used = False -++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++++ -++++++ # next_token_logits = self.get_decode_one_tokens_logits( -++++++ # cur_token=cur_input_ids, -++++++ # input_pos=model_inputs.get("position_ids"), -++++++ # cache_position=cur_cache_position, -++++++ # past_key_values=cur_past_key_values, -++++++ # ) -++++++ -++++++ # # 标记已使用JIT(用于后续判断) -++++++ # if not self._jit_used: -++++++ # self._jit_used = True -++++++ -++++++ # # 构造兼容的输出对象 -++++++ # class JitOptimizedOutput: -++++++ # def __init__(self, logits, config): -++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++++ # self.config = config -++++++ # # 对于 JIT 优化路径,这些属性通常不需要 -++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++++ # self.attentions = None if not config.is_encoder_decoder else None -++++++ # self.cross_attentions = None -++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++++ # self.hidden_states = None if not config.is_encoder_decoder else None -++++++ -++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++++ # else: -++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++++ # outputs = self(**model_inputs, return_dict=True) -++++++ -++++++ # if synced_devices and this_peer_finished: -++++++ # continue -++++++ -++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++++ # next_token_logits = outputs.logits[:, -1, :] -++++++ -++++++ # # pre-process distribution -++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++++ # if do_sample: -++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++++ -++++++ # # Store scores, attentions and hidden_states when required -++++++ # if return_dict_in_generate: -++++++ # if output_scores: -++++++ # scores += (next_token_scores,) -++++++ # if output_logits: -++++++ # raw_logits += (next_token_logits,) -++++++ # if output_attentions: -++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++++ # decoder_attentions += (attn,) if attn is not None else (None,) -++++++ # if self.config.is_encoder_decoder: -++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++++ -++++++ # if output_hidden_states: -++++++ # hidden = ( -++++++ # outputs.decoder_hidden_states -++++++ # if self.config.is_encoder_decoder -++++++ # else outputs.hidden_states -++++++ # ) -++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++++ -++++++ # # token selection -++++++ # if do_sample: -++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++++ # else: -++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++++ -++++++ # # finished sentences should have their next token be a padding token -++++++ # if has_eos_stopping_criteria: -++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++++ -++++++ # # update generated ids, model inputs, and length for next step -++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++++ # if streamer is not None: -++++++ # streamer.put(next_tokens) -++++++ -++++++ # model_kwargs = self._update_model_kwargs_for_generation( -++++++ # outputs, -++++++ # model_kwargs, -++++++ # is_encoder_decoder=self.config.is_encoder_decoder, -++++++ # ) -++++++ -++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++++ # cur_len += 1 -++++++ -++++++ # if _record_time: -++++++ # import time as time_module -++++++ # infer_stop = time_module.time() -++++++ # time_record.append(infer_stop - infer_start) -++++++ -++++++ # del outputs -++++++ -++++++ # average_infer_time = None -++++++ # if time_record: -++++++ # if len(time_record) > 1: -++++++ # time_record.pop(0) -++++++ # average_infer_time = sum(time_record) / len(time_record) -++++++ # print(f'average inference time is: {average_infer_time}') -++++++ # print(f'inference time record: {time_record}') -++++++ -++++++ # if streamer is not None: -++++++ # streamer.end() -++++++ -++++++ # # 简单判断:打印是否使用了JIT路径 -++++++ # if hasattr(self, '_jit_used') and self._jit_used: -++++++ # print("[JIT] ✓ JIT optimization was used during generation") -++++++ # else: -++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++++ -++++++ # if return_dict_in_generate: -++++++ # if self.config.is_encoder_decoder: -++++++ # return GenerateEncoderDecoderOutput( -++++++ # sequences=input_ids, -++++++ # scores=scores, -++++++ # logits=raw_logits, -++++++ # encoder_attentions=encoder_attentions, -++++++ # encoder_hidden_states=encoder_hidden_states, -++++++ # decoder_attentions=decoder_attentions, -++++++ # cross_attentions=cross_attentions, -++++++ # decoder_hidden_states=decoder_hidden_states, -++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++ # average_infer_time=average_infer_time -++++++ # ) -++++++ # else: -++++++ # return GenerateDecoderOnlyOutput( -++++++ # sequences=input_ids, -++++++ # scores=scores, -++++++ # logits=raw_logits, -++++++ # attentions=decoder_attentions, -++++++ # hidden_states=decoder_hidden_states, -++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++ # average_infer_time=average_infer_time -++++++ # ) -++++++ # else: -++++++ # return input_ids -++++++ -++++++ # def _prepare_cache_for_generation( -++++++ # self, -++++++ # generation_config, -++++++ # model_kwargs, -++++++ # assistant_model, -++++++ # batch_size, -++++++ # max_cache_length, -++++++ # ): -++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++++ # generation_config.cache_implementation = "static" -++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++++ -++++++ # if generation_config.cache_implementation == "static": -++++++ # base_required_from_max_length = generation_config.max_length + 1 -++++++ # base_required = max(max_cache_length, base_required_from_max_length) -++++++ # min_cache_size = 50 -++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++++ # else: -++++++ # max_cache_length = max(base_required, min_cache_size) -++++++ -++++++ # original_max_cache_length = max_cache_length -++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++++ # print(f" - input max_cache_length: {original_max_cache_length}") -++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++++ # print(f" - final max_cache_length: {max_cache_length}") -++++++ -++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++ # if max_cache_length > self.config.max_position_embeddings: -++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++++ -++++++ # result = super()._prepare_cache_for_generation( -++++++ # generation_config=generation_config, -++++++ # model_kwargs=model_kwargs, -++++++ # assistant_model=assistant_model, -++++++ # batch_size=batch_size, -++++++ # max_cache_length=max_cache_length, -++++++ # ) -++++++ -++++++ # if generation_config.cache_implementation == "static": -++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++++ # created_cache = model_kwargs.get(cache_name) -++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++++ # if created_cache.max_cache_len < generation_config.max_length: -++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++++ -++++++ # return result -++++++ -++++++ -++++++ -+++++ -+++++ -+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++++-- -+++++2.27.0 -+++++ -++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++++new file mode 100644 -++++index 00000000..22b65dd5 -++++--- /dev/null -+++++++ b/patches/0002-20251106commit.patch -++++@@ -0,0 +1,3200 @@ -+++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+++++From: Pinoeer-kingxi <13022943007@163.com> -+++++Date: Thu, 6 Nov 2025 09:20:38 +0800 -+++++Subject: [PATCH 2/3] 20251106commit -+++++ -+++++--- -+++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -+++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -+++++ 3 files changed, 2689 insertions(+), 305 deletions(-) -+++++ create mode 100644 patches/0001-20251104commit.patch -+++++ -+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++index d8303e45..73773c22 100644 -+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -+++++ # y = y + self.shared_experts(identity) -+++++ # return y -+++++ -++++++ # @no_grad() -++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++ -++++++ # expert_cache = ops.zeros_like(x) -++++++ # for i in range(self.num_experts_per_tok): -++++++ # expert_id = flat_expert_indices[i].item() -++++++ # weight = flat_expert_weights[i].item() -++++++ # expert = self.experts[expert_id] -++++++ # expert_out = expert(x) -++++++ # expert_cache += expert_out * weight -++++++ # return expert_cache -++++++ -+++++ @no_grad() -+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++ # x 的 shape: (1, hidden_size) -++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++++++ -++++++ # 1. 收集所有需要的专家层 -++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] -++++++ -++++++ # 2. 并行计算所有专家的输出 -++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++++++ # ops.cat 会将它们堆叠成一个新的 Tensor -++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++++++ -++++++ # 3. 使用矩阵乘法进行加权求和 -++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++++ # 最终结果 final_output 的 shape: (1, hidden_size) -++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++++++ -++++++ return final_output -+++++ -+++++- expert_cache = ops.zeros_like(x) -+++++- for i in range(self.num_experts_per_tok): -+++++- expert_id = flat_expert_indices[i].item() -+++++- weight = flat_expert_weights[i].item() -+++++- expert = self.experts[expert_id] -+++++- expert_out = expert(x) -+++++- expert_cache += expert_out * weight -+++++- return expert_cache -+++++ -+++++ @no_grad() -+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++ # @lwx -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+++++ -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ -++++++# class DeepseekFlashAttention(nn.Module): -++++++# """ -++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -++++++ -++++++# This class is designed as a drop-in replacement for DeepseekAttention. -++++++# """ -++++++ -++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++++++# super().__init__() -++++++# self.config = config -++++++# self.layer_idx = layer_idx -++++++# if layer_idx is None: -++++++# logger.warning( -++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++++# "when creating this class." -++++++# ) -++++++ -++++++# self.attention_dropout = config.attention_dropout -++++++# self.hidden_size = config.hidden_size -++++++# self.num_heads = config.num_attention_heads -++++++# self.head_dim = self.hidden_size // self.num_heads -++++++# self.num_key_value_heads = config.num_key_value_heads -++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++# self.max_position_embeddings = config.max_position_embeddings -++++++# self.rope_theta = config.rope_theta -++++++# self.is_causal = True -++++++ -++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++++# raise ValueError( -++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++++# f" and `num_heads`: {self.num_heads})." -++++++# ) -++++++ -++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++++++# self._init_rope() -++++++ -++++++# def _init_rope(self): -++++++# if self.config.rope_scaling is None: -++++++# self.rotary_emb = DeepseekRotaryEmbedding( -++++++# self.head_dim, -++++++# max_position_embeddings=self.max_position_embeddings, -++++++# base=self.rope_theta, -++++++# ) -++++++# else: -++++++# scaling_type = self.config.rope_scaling["type"] -++++++# scaling_factor = self.config.rope_scaling["factor"] -++++++# if scaling_type == "linear": -++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++++++# self.head_dim, -++++++# max_position_embeddings=self.max_position_embeddings, -++++++# scaling_factor=scaling_factor, -++++++# base=self.rope_theta, -++++++# ) -++++++# elif scaling_type == "dynamic": -++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++++++# self.head_dim, -++++++# max_position_embeddings=self.max_position_embeddings, -++++++# scaling_factor=scaling_factor, -++++++# base=self.rope_theta, -++++++# ) -++++++# else: -++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++++++ -++++++# def forward( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# attention_mask: Optional[mindspore.Tensor] = None, -++++++# position_ids: Optional[mindspore.Tensor] = None, -++++++# past_key_value: Optional[Cache] = None, -++++++# output_attentions: bool = False, -++++++# use_cache: bool = False, -++++++# **kwargs, -++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++# if "padding_mask" in kwargs: -++++++# warnings.warn( -++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++++++# ) -++++++ -++++++# if output_attentions: -++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -++++++ -++++++# bsz, q_len, _ = hidden_states.shape -++++++ -++++++# if self.config.pretraining_tp > 1: -++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++++++ -++++++# query_states = self.q_proj(hidden_states) -++++++# key_states = self.k_proj(hidden_states) -++++++# value_states = self.v_proj(hidden_states) -++++++ -++++++# # Reshape for multi-head attention -++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++# kv_seq_len = key_states.shape[-2] -++++++# if past_key_value is not None: -++++++# if self.layer_idx is None: -++++++# raise ValueError( -++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++# "with a layer index." -++++++# ) -++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++# # Apply Rotary Positional Embedding -++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++# if past_key_value is not None: -++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ -++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++++ -++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -++++++ -++++++# # Convert attention_mask for flash_attention_score -++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -++++++# if attention_mask is not None: -++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++++++# raise ValueError( -++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++++++# ) -++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -++++++# else: -++++++# attn_mask_for_fa = None -++++++ -++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++++++ -++++++# # Call the fused flash_attention_score operator -++++++# attn_output = mindspore.ops.flash_attention_score( -++++++# query=query_states_for_fa, -++++++# key=key_states_for_fa, -++++++# value=value_states_for_fa, -++++++# head_num=self.num_heads, # This is N1, the number of query heads -++++++# input_layout='BSH', -++++++# attn_mask=attn_mask_for_fa, -++++++# keep_prob=keep_prob, -++++++# scalar_value=1.0 / math.sqrt(self.head_dim), -++++++# sparse_mode=0 # Default mask mode -++++++# ) -++++++ -++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -++++++# attn_output = self.o_proj(attn_output) -++++++ -++++++# # Flash Attention does not return attention weights -++++++# attn_weights = None -++++++ -++++++# return attn_output, attn_weights, past_key_value -++++++ -++++++class DeepseekFlashAttention(nn.Module): -++++++ """ -++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, -++++++ designed for high performance on supported hardware (Ascend). -++++++ -++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -++++++ """ -++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -++++++ super().__init__() -++++++ self.config = config -++++++ self.layer_idx = layer_idx -++++++ if layer_idx is None: -++++++ logger.warning( -++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++++ "when creating this class." -++++++ ) -++++++ -++++++ # --- [FIX] Correctly initialize all required attributes --- -++++++ self.attention_dropout = config.attention_dropout -++++++ self.hidden_size = config.hidden_size -++++++ self.num_heads = config.num_attention_heads -++++++ self.head_dim = self.hidden_size // self.num_heads -++++++ self.num_key_value_heads = config.num_key_value_heads -++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++ self.max_position_embeddings = config.max_position_embeddings -++++++ self.rope_theta = config.rope_theta -++++++ self.is_causal = True -++++++ -++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++++ raise ValueError( -++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++++ f" and `num_heads`: {self.num_heads})." -++++++ ) -++++++ -++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -++++++ -++++++ # This call will now succeed as all attributes are initialized. -++++++ self._init_rope() -++++++ -++++++ def _init_rope(self): -++++++ if self.config.rope_scaling is None: -++++++ self.rotary_emb = DeepseekRotaryEmbedding( -++++++ self.head_dim, -++++++ max_position_embeddings=self.max_position_embeddings, -++++++ base=self.rope_theta, -++++++ ) -++++++ else: -++++++ scaling_type = self.config.rope_scaling["type"] -++++++ scaling_factor = self.config.rope_scaling["factor"] -++++++ if scaling_type == "linear": -++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -++++++ self.head_dim, -++++++ max_position_embeddings=self.max_position_embeddings, -++++++ scaling_factor=scaling_factor, -++++++ base=self.rope_theta, -++++++ ) -++++++ elif scaling_type == "dynamic": -++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -++++++ self.head_dim, -++++++ max_position_embeddings=self.max_position_embeddings, -++++++ scaling_factor=scaling_factor, -++++++ base=self.rope_theta, -++++++ ) -++++++ else: -++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -++++++ -++++++ def forward( -++++++ self, -++++++ hidden_states: mindspore.Tensor, -++++++ attention_mask: Optional[mindspore.Tensor] = None, -++++++ position_ids: Optional[mindspore.Tensor] = None, -++++++ past_key_value: Optional[Cache] = None, -++++++ output_attentions: bool = False, -++++++ use_cache: bool = False, -++++++ **kwargs, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ if "padding_mask" in kwargs: -++++++ warnings.warn( -++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -++++++ ) -++++++ if output_attentions: -++++++ warnings.warn( -++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -++++++ ) -++++++ -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ if self.config.pretraining_tp > 1: -++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -++++++ -++++++ query_states = self.q_proj(hidden_states) -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++ # Apply Rotary Position Embedding -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos} -++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -++++++ # So we must explicitly repeat the KV heads. -++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++ -++++++ # Convert attention mask for flash_attention_score -++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -++++++ if attention_mask is not None: -++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -++++++ raise ValueError( -++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -++++++ ) -++++++ attn_mask_for_fa = attention_mask < 0 -++++++ else: -++++++ attn_mask_for_fa = None -++++++ -++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -++++++ -++++++ # Call the fused operator using the efficient BNSD layout -++++++ attn_output = mindspore.ops.flash_attention_score( -++++++ query=query_states, -++++++ key=key_states, -++++++ value=value_states, -++++++ head_num=self.num_heads, -++++++ input_layout='BNSD', # Specify the correct layout -++++++ attn_mask=attn_mask_for_fa, -++++++ keep_prob=keep_prob, -++++++ scalar_value=1.0 / math.sqrt(self.head_dim) -++++++ ) -++++++ -++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ -++++++ # Apply output projection -++++++ attn_output = self.o_proj(attn_output) -++++++ -++++++ # Flash attention does not return attention weights, so we return None. -++++++ attn_weights = None -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -+++++ Deepseek_ATTENTION_CLASSES = { -+++++ "eager": DeepseekAttention, -++++++ "flash-attention": DeepseekFlashAttention, -+++++ } -+++++ -+++++ -+++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -+++++ config=config, layer_idx=layer_idx -+++++ ) -+++++ -++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -++++++ config=config, layer_idx=layer_idx -++++++ ) -++++++ -+++++ self.mlp = ( -+++++ DeepseekMoE(config) -+++++ if ( -+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++index d4c6b651..bced285c 100644 -+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -+++++ -+++++ import mindspore -+++++ import mindnlp.core.nn.functional as F -+++++-from mindnlp.core import nn, ops -++++++from mindnlp.core import nn, ops, no_grad -+++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -+++++ -+++++ from ....common.activations import ACT2FN -+++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+++++ -++++++Long_Prompt = False -++++++PROMPT_LENGTH_THRESHOLD = 128 -+++++ -+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+++++ def _prepare_4d_causal_attention_mask_with_cache_position( -+++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++ -++++++# class Qwen2MoeFlashAttention(nn.Module): -++++++# """ -++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++ -++++++# 关键改动: -++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++# 直接传入原始的 key 和 value 张量效率更高。 -++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++# super().__init__() -++++++# self.config = config -++++++# self.layer_idx = layer_idx -++++++# self.hidden_size = config.hidden_size -++++++# self.num_heads = config.num_attention_heads -++++++# self.head_dim = self.hidden_size // self.num_heads -++++++# self.num_key_value_heads = config.num_key_value_heads -++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++# self.max_position_embeddings = config.max_position_embeddings -++++++# self.rope_theta = config.rope_theta -++++++# self.attention_dropout = config.attention_dropout -++++++ -++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++++# raise ValueError( -++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++# ) -++++++ -++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++ -++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++# self.head_dim, -++++++# max_position_embeddings=self.max_position_embeddings, -++++++# base=self.rope_theta, -++++++# ) -++++++ -++++++# def forward( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# attention_mask: Optional[mindspore.Tensor] = None, -++++++# position_ids: Optional[mindspore.Tensor] = None, -++++++# past_key_value: Optional[Cache] = None, -++++++# output_attentions: bool = False, -++++++# use_cache: bool = False, -++++++# cache_position: Optional[mindspore.Tensor] = None, -++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++# bsz, q_len, _ = hidden_states.shape -++++++ -++++++# # 1. 线性投射 Q, K, V -++++++# query_states = self.q_proj(hidden_states) -++++++# key_states = self.k_proj(hidden_states) -++++++# value_states = self.v_proj(hidden_states) -++++++ -++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++# # query: [B, S, H*D] -> [B, N1, S, D] -++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++# # 3. RoPE 旋转位置编码 -++++++# kv_seq_len = key_states.shape[-2] -++++++# if past_key_value is not None: -++++++# if self.layer_idx is None: -++++++# raise ValueError( -++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++# "with a layer index." -++++++# ) -++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++# if cache_position.shape[0] == 1: -++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++# kv_seq_len = past_seen_tokens + 1 -++++++# else: -++++++# # prefill 阶段:cache_position 是范围,使用其长度 -++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++# else: -++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++# # 4. KV 缓存更新 -++++++# if past_key_value is not None: -++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++# key_states, value_states = past_key_value.update( -++++++# key_states, value_states, self.layer_idx, cache_kwargs -++++++# ) -++++++ -++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++# if cache_position.shape[0] == 1: -++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++# kv_seq_len = key_states.shape[-2] -++++++ -++++++# # 5. [重要] 准备 Attention Mask -++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++# fa_attention_mask = None -++++++# if attention_mask is not None: -++++++# # 截取与当前key长度匹配的部分 -++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++# fa_attention_mask = (mask_slice != 0) -++++++ -++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++# input_dtype = query_states.dtype -++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++# query_states = query_states.to(mindspore.float16) -++++++# key_states = key_states.to(mindspore.float16) -++++++# value_states = value_states.to(mindspore.float16) -++++++ -++++++# # 6. [核心] 调用 flash_attention_score 算子 -++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++# attn_output = mindspore.ops.flash_attention_score( -++++++# query=query_states, -++++++# key=key_states, -++++++# value=value_states, -++++++# head_num=self.num_heads, # 传入Q的头数(N1) -++++++# attn_mask=fa_attention_mask, -++++++# keep_prob=1.0 - self.attention_dropout, -++++++# scalar_value=1.0 / math.sqrt(self.head_dim), -++++++# input_layout="BNSD", -++++++# sparse_mode=0 # 使用 defaultMask 模式 -++++++# ) -++++++ -++++++# # 恢复原始数据类型 -++++++# attn_output = attn_output.to(input_dtype) -++++++ -++++++# # 7. 调整输出形状 -++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++# attn_output = self.o_proj(attn_output) -++++++ -++++++# # FlashAttention 算子不直接返回注意力权重矩阵 -++++++# attn_weights = None -++++++# if output_attentions: -++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++# return attn_output, attn_weights, past_key_value -++++++ -++++++# # def forward( -++++++# # self, -++++++# # hidden_states: mindspore.Tensor, -++++++# # attention_mask: Optional[mindspore.Tensor] = None, -++++++# # position_ids: Optional[mindspore.Tensor] = None, -++++++# # past_key_value: Optional[Cache] = None, -++++++# # output_attentions: bool = False, -++++++# # use_cache: bool = False, -++++++# # cache_position: Optional[mindspore.Tensor] = None, -++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++# # bsz, q_len, _ = hidden_states.shape -++++++ -++++++# # # 1. 线性投射 Q, K, V -++++++# # query_states = self.q_proj(hidden_states) -++++++# # key_states = self.k_proj(hidden_states) -++++++# # value_states = self.v_proj(hidden_states) -++++++ -++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -++++++# # # 3. RoPE 旋转位置编码 -++++++# # kv_seq_len = key_states.shape[-2] -++++++# # if past_key_value is not None: -++++++# # if self.layer_idx is None: -++++++# # raise ValueError( -++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++# # "with a layer index." -++++++# # ) -++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++# # # 4. KV 缓存更新 -++++++# # if past_key_value is not None: -++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++# # key_states, value_states = past_key_value.update( -++++++# # key_states, value_states, self.layer_idx, cache_kwargs -++++++# # ) -++++++ -++++++# # # 5. 准备 Attention Mask -++++++# # fa_attention_mask = None -++++++# # if attention_mask is not None: -++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++# # fa_attention_mask = (mask_slice != 0) -++++++ -++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++# # input_dtype = query_states.dtype -++++++ -++++++# # # 6. [核心] 调用 flash_attention_score 算子 -++++++# # attn_output = mindspore.ops.flash_attention_score( -++++++# # query=query_states, -++++++# # key=key_states, -++++++# # value=value_states, -++++++# # head_num=self.num_heads, -++++++# # attn_mask=fa_attention_mask, -++++++# # keep_prob=1.0 - self.attention_dropout, -++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++# # input_layout="BNSD", -++++++# # sparse_mode=0, -++++++# # # <--- 修改点 2: 启用内部高精度计算 --- -++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++# # inner_precise=1 -++++++# # ) -++++++ -++++++# # # 恢复原始数据类型 -++++++# # attn_output = attn_output.to(input_dtype) -++++++ -++++++# # # 7. 调整输出形状 -++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++# # attn_output = self.o_proj(attn_output) -++++++ -++++++# # attn_weights = None -++++++# # if output_attentions: -++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ -++++++# # return attn_output, attn_weights, past_key_value -++++++ -++++++ -+++++ class Qwen2MoeFlashAttention(nn.Module): -+++++ """ -+++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++- -+++++- 关键改动: -+++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++- 直接传入原始的 key 和 value 张量效率更高。 -+++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -++++++ -++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, -++++++ 以达到理论上的最高执行速度。 -+++++ """ -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++ super().__init__() -+++++ self.config = config -+++++ self.layer_idx = layer_idx -++++++ if layer_idx is None: -++++++ logger.warning_once( -++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -++++++ ) -++++++ -+++++ self.hidden_size = config.hidden_size -+++++ self.num_heads = config.num_attention_heads -+++++ self.head_dim = self.hidden_size // self.num_heads -+++++ self.num_key_value_heads = config.num_key_value_heads -+++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++ self.max_position_embeddings = config.max_position_embeddings -+++++ self.rope_theta = config.rope_theta -+++++ self.attention_dropout = config.attention_dropout -+++++ -+++++- if (self.head_dim * self.num_heads) != self.hidden_size: -+++++- raise ValueError( -+++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++- ) -+++++- -+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++- # query: [B, S, H*D] -> [B, N1, S, D] -+++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++ # 2. 调整形状以匹配 BNSD 布局 -+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- -+++++- # 3. RoPE 旋转位置编码 -++++++ -++++++ # 3. RoPE 和 KV 缓存 -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++- if self.layer_idx is None: -+++++- raise ValueError( -+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++- "with a layer index." -+++++- ) -+++++- # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++- if cache_position.shape[0] == 1: -+++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++- kv_seq_len = past_seen_tokens + 1 -+++++- else: -+++++- # prefill 阶段:cache_position 是范围,使用其长度 -+++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++- else: -+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++- -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++- # 4. KV 缓存更新 -+++++ if past_key_value is not None: -+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++- key_states, value_states = past_key_value.update( -+++++- key_states, value_states, self.layer_idx, cache_kwargs -+++++- ) -+++++- -+++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++- if cache_position.shape[0] == 1: -+++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++- kv_seq_len = key_states.shape[-2] -+++++- -+++++- # 5. [重要] 准备 Attention Mask -+++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++ # 4. 准备 Attention Mask -+++++ fa_attention_mask = None -+++++ if attention_mask is not None: -+++++- # 截取与当前key长度匹配的部分 -+++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++- # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++ fa_attention_mask = (mask_slice != 0) -+++++ -+++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++- input_dtype = query_states.dtype -+++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++- query_states = query_states.to(mindspore.float16) -+++++- key_states = key_states.to(mindspore.float16) -+++++- value_states = value_states.to(mindspore.float16) -+++++- -+++++- # 6. [核心] 调用 flash_attention_score 算子 -+++++- # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -+++++ attn_output = mindspore.ops.flash_attention_score( -+++++ query=query_states, -+++++ key=key_states, -+++++ value=value_states, -+++++- head_num=self.num_heads, # 传入Q的头数(N1) -++++++ head_num=self.num_heads, -+++++ attn_mask=fa_attention_mask, -+++++- keep_prob=1.0 - self.attention_dropout, -++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -+++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++ input_layout="BNSD", -+++++- sparse_mode=0 # 使用 defaultMask 模式 -++++++ sparse_mode=0, -++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -+++++ ) -+++++ -+++++- # 恢复原始数据类型 -+++++- attn_output = attn_output.to(input_dtype) -+++++- -+++++- # 7. 调整输出形状 -+++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++ # 6. 调整输出形状 -+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++ attn_output = self.o_proj(attn_output) -+++++ -+++++- # FlashAttention 算子不直接返回注意力权重矩阵 -++++++ # 7. 返回结果 -+++++ attn_weights = None -+++++ if output_attentions: -+++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++- # def forward( -+++++- # self, -+++++- # hidden_states: mindspore.Tensor, -+++++- # attention_mask: Optional[mindspore.Tensor] = None, -+++++- # position_ids: Optional[mindspore.Tensor] = None, -+++++- # past_key_value: Optional[Cache] = None, -+++++- # output_attentions: bool = False, -+++++- # use_cache: bool = False, -+++++- # cache_position: Optional[mindspore.Tensor] = None, -+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++- -+++++- # bsz, q_len, _ = hidden_states.shape -+++++- -+++++- # # 1. 线性投射 Q, K, V -+++++- # query_states = self.q_proj(hidden_states) -+++++- # key_states = self.k_proj(hidden_states) -+++++- # value_states = self.v_proj(hidden_states) -+++++- -+++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- -+++++- # # 3. RoPE 旋转位置编码 -+++++- # kv_seq_len = key_states.shape[-2] -+++++- # if past_key_value is not None: -+++++- # if self.layer_idx is None: -+++++- # raise ValueError( -+++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++- # "with a layer index." -+++++- # ) -+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++ -+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++- -+++++- # # 4. KV 缓存更新 -+++++- # if past_key_value is not None: -+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++- # key_states, value_states = past_key_value.update( -+++++- # key_states, value_states, self.layer_idx, cache_kwargs -+++++- # ) -+++++- -+++++- # # 5. 准备 Attention Mask -+++++- # fa_attention_mask = None -+++++- # if attention_mask is not None: -+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++- # fa_attention_mask = (mask_slice != 0) -+++++- -+++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++- # input_dtype = query_states.dtype -+++++- -+++++- # # 6. [核心] 调用 flash_attention_score 算子 -+++++- # attn_output = mindspore.ops.flash_attention_score( -+++++- # query=query_states, -+++++- # key=key_states, -+++++- # value=value_states, -+++++- # head_num=self.num_heads, -+++++- # attn_mask=fa_attention_mask, -+++++- # keep_prob=1.0 - self.attention_dropout, -+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++- # input_layout="BNSD", -+++++- # sparse_mode=0, -+++++- # # <--- 修改点 2: 启用内部高精度计算 --- -+++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++- # inner_precise=1 -+++++- # ) -+++++- -+++++- # # 恢复原始数据类型 -+++++- # attn_output = attn_output.to(input_dtype) -++++++QWEN2MOE_ATTENTION_CLASSES = { -++++++ "eager": Qwen2MoeAttention, -++++++ "flash-attention": Qwen2MoeFlashAttention, -++++++} -+++++ -+++++- # # 7. 调整输出形状 -+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++- # attn_output = self.o_proj(attn_output) -+++++ -+++++- # attn_weights = None -+++++- # if output_attentions: -+++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# def __init__(self, config): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# # gating -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++ -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# #@dwj -++++++# # 只遍历激活的专家,而非全部专家 -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# num_tokens = hidden_states_reshaped.shape[0] -++++++ -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++# flat_selected_experts = selected_experts.flatten() -++++++ -++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++# token_indices = broadcasted_token_indices.flatten() -++++++ -++++++# active_experts = ops.unique(flat_selected_experts) -++++++ -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++ -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# selected_token_indices = token_indices[mask] -++++++# selected_routing_weights = routing_weights.flatten()[mask] -++++++ -++++++# current_states = hidden_states_reshaped[selected_token_indices] -++++++ -++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++ -++++++# final_hidden_states = final_hidden_states.index_add( -++++++# dim=0, -++++++# index=selected_token_indices, -++++++# source=expert_output.to(hidden_states.dtype) -++++++# ) -++++++ -++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++ -+++++- # return attn_output, attn_weights, past_key_value -++++++# final_hidden_states = final_hidden_states + shared_expert_output -++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -++++++ -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# """ -++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# # 门控网络 -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# # 专家列表 -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++# # 共享专家 -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_decode( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# """ -++++++# 【解码路径】针对 sequence_length=1 的极致优化。 -++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++++++# """ -++++++# batch_size, hidden_dim = hidden_states.shape -++++++ -++++++# expert_outputs_list = [ -++++++# ops.cat([ -++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++# ], dim=0) -++++++# for i in range(batch_size) -++++++# ] -++++++ -++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++++++# # shape: (batch_size, top_k, hidden_dim) -++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++ -++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++ -++++++# return moe_output.squeeze(1) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_prefill( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# """ -++++++# 【预填充路径】针对 sequence_length > 1 的优化。 -++++++# 按专家对 Token 进行分组,并进行批处理。 -++++++# """ -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens = hidden_states.shape[0] -++++++# flat_selected_experts = selected_experts.flatten() -++++++ -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++ -++++++# active_experts = ops.unique(flat_selected_experts) -++++++ -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++ -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# selected_token_indices = token_indices[mask] -++++++# selected_routing_weights = routing_weights.flatten()[mask] -++++++ -++++++# current_states = hidden_states[selected_token_indices] -++++++ -++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++ -++++++# moe_output = moe_output.index_add( -++++++# dim=0, -++++++# index=selected_token_indices, -++++++# source=expert_output.to(hidden_states.dtype) -++++++# ) -++++++# return moe_output -++++++ -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# """ -++++++# 顶层 forward 方法,作为智能分发器。 -++++++# """ -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++- # def forward( -+++++- # self, -+++++- # hidden_states: mindspore.Tensor, -+++++- # attention_mask: Optional[mindspore.Tensor] = None, -+++++- # position_ids: Optional[mindspore.Tensor] = None, -+++++- # past_key_value: Optional[Cache] = None, -+++++- # output_attentions: bool = False, -+++++- # use_cache: bool = False, -+++++- # cache_position: Optional[mindspore.Tensor] = None, -+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++- -+++++- # bsz, q_len, _ = hidden_states.shape -+++++- -+++++- # query_states = self.q_proj(hidden_states) -+++++- # key_states = self.k_proj(hidden_states) -+++++- # value_states = self.v_proj(hidden_states) -+++++- -+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- -+++++- # kv_seq_len = key_states.shape[-2] -+++++- # if past_key_value is not None: -+++++- # if self.layer_idx is None: -+++++- # raise ValueError("`layer_idx` must be specified for caching") -+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++- -+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++- -+++++- # if past_key_value is not None: -+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++- # key_states, value_states = past_key_value.update( -+++++- # key_states, value_states, self.layer_idx, cache_kwargs -+++++- # ) -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++# moe_output = None -++++++# # 在推理时,根据序列长度选择最优路径 -++++++# if not self.training: -++++++# if sequence_length == 1: -++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++++# else: -++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++++# else: -++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++++++# raise NotImplementedError("Training path is not implemented.") -++++++ -++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++++++ -++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++++++ -++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -++++++ -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# """ -++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# # 门控网络 -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# # 专家列表 -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++# # 共享专家 -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_decode( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# batch_size, _ = hidden_states.shape -++++++# expert_outputs_list = [ -++++++# ops.cat([ -++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++# ], dim=0) -++++++# for i in range(batch_size) -++++++# ] -++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++# return moe_output.squeeze(1) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_prefill( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens = hidden_states.shape[0] -++++++# flat_selected_experts = selected_experts.flatten() -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++# active_experts = ops.unique(flat_selected_experts) -++++++ -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# selected_token_indices = token_indices[mask] -++++++# selected_routing_weights = routing_weights.flatten()[mask] -++++++# current_states = hidden_states[selected_token_indices] -++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++# moe_output = moe_output.index_add( -++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++++# ) -++++++# return moe_output -++++++ -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# """ -++++++# 顶层 forward 方法,作为智能分发器。 -++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++++++# """ -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ -++++++# # 1. 门控计算 (通用逻辑) -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++# # 2. 智能分发到最优 MoE 路径 -++++++# moe_output = None -++++++# if not self.training: -++++++# if sequence_length == 1: -++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++++# else: -++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++++# else: -++++++# raise NotImplementedError("Training path is not implemented.") -++++++ -++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++ -++++++# # 4. 合并 MoE 输出和共享专家输出 -++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++ -++++++# # 5. 恢复原始形状并返回 -++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -++++++# prefill fastest -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# """ -++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# # 门控网络 -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# # 专家列表 -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++# # 共享专家 -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_dispatch( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# """ -++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++++++# """ -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens, _ = hidden_states.shape -++++++ -++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++++++# flat_selected_experts = selected_experts.flatten() -++++++# flat_routing_weights = routing_weights.flatten() -+++++ -+++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++- -+++++- # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++- # query_states = query_states / math.sqrt(self.head_dim) -+++++- # # <--- 修改结束 --- -+++++- -+++++- # fa_attention_mask = None -+++++- # if attention_mask is not None: -+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++- # fa_attention_mask = (mask_slice != 0) -+++++- -+++++- # input_dtype = query_states.dtype -+++++- -+++++- # attn_output = mindspore.ops.flash_attention_score( -+++++- # query=query_states, # 传入已经预先缩放过的 query -+++++- # key=key_states, -+++++- # value=value_states, -+++++- # head_num=self.num_heads, -+++++- # attn_mask=fa_attention_mask, -+++++- # keep_prob=1.0 - self.attention_dropout, -+++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++- # input_layout="BNSD", -+++++- # sparse_mode=0, -+++++- # inner_precise=1 # 仍然保持内部高精度计算 -+++++- # ) -++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++ -+++++- # attn_output = attn_output.to(input_dtype) -+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++- # attn_output = self.o_proj(attn_output) -++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++++++# active_experts = ops.unique(flat_selected_experts) -++++++ -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++ -++++++# # 找到所有分配给该专家的 token -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++ -++++++# # 使用 mask 选取对应的 token 和权重 -++++++# current_token_indices = token_indices[mask] -++++++# current_routing_weights = flat_routing_weights[mask] -++++++# current_hidden_states = hidden_states[current_token_indices] -++++++ -++++++# # 对这些 token 进行批处理 -++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++++ -++++++# # 使用 index_add 将结果精确地加回到对应位置 -++++++# moe_output = moe_output.index_add( -++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++++++# ) -++++++# return moe_output -++++++ -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# """ -++++++# 顶层 forward 方法,作为智能分发器。 -++++++# """ -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ -++++++# # 1. 门控计算 -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++# routing_weights = routing_weights.to(hidden_states.dtype) -++++++ -++++++# # 2. 调用统一的 MoE 计算内核 -++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+++++ -+++++- # attn_weights = None -+++++- # if output_attentions: -+++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++++# # 3. 统一处理共享专家 -++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++ -++++++# # 4. 合并输出 -++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++ -++++++# # 5. 恢复原始形状并返回 -++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -++++++ -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# """ -++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++# 【最终高性能与高精度版】: -++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++++++# 3. 这样实现了速度和准确性的两全其美。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_decode( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# """ -++++++# 【解码路径】极致优化版:bmm + 高精度累加。 -++++++# """ -++++++# original_dtype = hidden_states.dtype -++++++# batch_size, _ = hidden_states.shape -++++++ -++++++# expert_outputs_list = [ -++++++# ops.cat([ -++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++# ], dim=0) -++++++# for i in range(batch_size) -++++++# ] -++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++ -++++++# # 在 float32 下执行 bmm,得到高精度结果 -++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++ -++++++# # 将高精度结果转换回原始数据类型 -++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++++++ -++++++# return moe_output -++++++ -++++++# @no_grad() -++++++# def _moe_infer_prefill( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# selected_experts: mindspore.Tensor, -++++++# routing_weights: mindspore.Tensor -++++++# ) -> mindspore.Tensor: -++++++# """ -++++++# 【预填充路径】与原始实现一致,结果精确。 -++++++# """ -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens, _ = hidden_states.shape -++++++# flat_selected_experts = selected_experts.flatten() -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++# active_experts = ops.unique(flat_selected_experts) -++++++ -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# selected_token_indices = token_indices[mask] -++++++# selected_routing_weights = routing_weights.flatten()[mask] -++++++# current_states = hidden_states[selected_token_indices] -++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++# moe_output = moe_output.index_add( -++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++++# ) -++++++# return moe_output -++++++ -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++ -+++++- # return attn_output, attn_weights, past_key_value -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++++++# # 如果模型主体是 float16,后续再转换 -++++++ -++++++# moe_output = None -++++++# if not self.training: -++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++++++# # _moe_infer_decode 内部会处理好类型转换 -++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++++++# if sequence_length == 1: -++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++++# else: -++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++++# else: -++++++# raise NotImplementedError("Training path is not implemented.") -++++++ -++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++ -++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -+++++ -+++++-QWEN2MOE_ATTENTION_CLASSES = { -+++++- "eager": Qwen2MoeAttention, -+++++- "flash-attention": Qwen2MoeFlashAttention, -+++++-} -++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++# """ -++++++# 【融合版】一个混合专家模块,内置两种推理策略, -++++++# 由外部全局变量 `Long_Prompt` 控制: -++++++ -++++++# - if Long_Prompt is True: 【精度优先模式】 -++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++++++# 适用于处理长序列,避免误差累积。 -++++++ -++++++# - if Long_Prompt is False: 【速度优先模式】 -++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 -++++++# """ -++++++# def __init__(self, config: Qwen2MoeConfig): -++++++# super().__init__() -++++++# self.num_experts = config.num_experts -++++++# self.top_k = config.num_experts_per_tok -++++++# self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++# self.experts = nn.ModuleList( -++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++# ) -++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++# # --- 速度优先模式的辅助函数 --- -++++++# @no_grad() -++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++# original_dtype = hidden_states.dtype -++++++# batch_size, _ = hidden_states.shape -++++++# expert_outputs_list = [ -++++++# ops.cat([ -++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++# ], dim=0) -++++++# for i in range(batch_size) -++++++# ] -++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++# weights_fp32 = routing_weights.to(mindspore.float32) -++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++++# return moe_output_fp32.squeeze(1).to(original_dtype) -++++++ -++++++# @no_grad() -++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens, _ = hidden_states.shape -++++++# flat_selected_experts = selected_experts.flatten() -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++# active_experts = ops.unique(flat_selected_experts) -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# selected_token_indices = token_indices[mask] -++++++# selected_routing_weights = routing_weights.flatten()[mask] -++++++# current_states = hidden_states[selected_token_indices] -++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++# return moe_output -++++++ -++++++# # --- 精度优先模式的辅助函数 --- -++++++# @no_grad() -++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++# moe_output = ops.zeros_like(hidden_states) -++++++# num_tokens, _ = hidden_states.shape -++++++# flat_selected_experts = selected_experts.flatten() -++++++# flat_routing_weights = routing_weights.flatten() -++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++# active_experts = ops.unique(flat_selected_experts) -++++++# for expert_idx_tensor in active_experts: -++++++# expert_idx = expert_idx_tensor.item() -++++++# expert_layer = self.experts[expert_idx] -++++++# mask = (flat_selected_experts == expert_idx_tensor) -++++++# current_token_indices = token_indices[mask] -++++++# current_routing_weights = flat_routing_weights[mask] -++++++# current_hidden_states = hidden_states[current_token_indices] -++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++# return moe_output -++++++ -++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++# # 声明我们将要使用一个在模块外部定义的全局变量 -++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++++++# global Long_Prompt -++++++ -++++++# # 1. 门控计算 (所有模式通用) -++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++# router_logits = self.gate(hidden_states_reshaped) -++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++++# if self.norm_topk_prob: -++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++# moe_output = None -++++++# if not self.training: -++++++# # 根据 Long_Prompt 标志选择模式 -++++++# if Long_Prompt: -++++++# # --- 精度优先模式 --- -++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++# else: -++++++# # --- 速度优先模式 --- -++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++# if sequence_length == 1: -++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++# else: -++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++# else: -++++++# raise NotImplementedError("Training path is not implemented.") -++++++ -++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++ -++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++# return final_hidden_states, router_logits -++++++ -++++++class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ """ -++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++++++ 控制的顶级推理策略: -+++++ -++++++ - if Long_Prompt is True: 【精度优先模式】 -++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -++++++ 适用于需要严格可复现性的长序列任务。 -+++++ -+++++-class Qwen2MoeSparseMoeBlock(nn.Module): -+++++- def __init__(self, config): -++++++ - if Long_Prompt is False: 【速度优先模式】 -++++++ 采用业界最强的性能组合: -++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -++++++ """ -++++++ def __init__(self, config: Qwen2MoeConfig): -+++++ super().__init__() -+++++ self.num_experts = config.num_experts -+++++ self.top_k = config.num_experts_per_tok -+++++ self.norm_topk_prob = config.norm_topk_prob -+++++ -+++++- # gating -+++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++ self.experts = nn.ModuleList( -+++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++ ) -+++++- -+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++ -+++++- #@dwj -+++++- # 只遍历激活的专家,而非全部专家 -+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++- num_tokens = hidden_states_reshaped.shape[0] -+++++- -+++++- router_logits = self.gate(hidden_states_reshaped) -+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- -+++++- if self.norm_topk_prob: -+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++- flat_selected_experts = selected_experts.flatten() -+++++- -+++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++- token_indices = broadcasted_token_indices.flatten() -+++++- -+++++- active_experts = ops.unique(flat_selected_experts) -+++++- -+++++- for expert_idx_tensor in active_experts: -+++++- expert_idx = expert_idx_tensor.item() -+++++- expert_layer = self.experts[expert_idx] -+++++- -+++++- mask = (flat_selected_experts == expert_idx_tensor) -+++++- selected_token_indices = token_indices[mask] -+++++- selected_routing_weights = routing_weights.flatten()[mask] -+++++- -+++++- current_states = hidden_states_reshaped[selected_token_indices] -+++++- -+++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++- -+++++- final_hidden_states = final_hidden_states.index_add( -+++++- dim=0, -+++++- index=selected_token_indices, -+++++- source=expert_output.to(hidden_states.dtype) -+++++- ) -+++++- -+++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -++++++ @no_grad() -++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++ original_dtype = hidden_states.dtype -++++++ batch_size, _ = hidden_states.shape -++++++ expert_outputs_list = [ -++++++ ops.cat([ -++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++ ], dim=0) -++++++ for i in range(batch_size) -++++++ ] -++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++ weights_fp32 = routing_weights.to(mindspore.float32) -++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++++ return moe_output_fp32.squeeze(1).to(original_dtype) -++++++ -++++++ @no_grad() -++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++ num_tokens, _ = hidden_states.shape -++++++ flat_selected_experts = selected_experts.flatten() -++++++ sorted_expert_indices = flat_selected_experts.argsort() -++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++++ original_token_indices = sorted_expert_indices // self.top_k -++++++ moe_output = ops.zeros_like(hidden_states) -++++++ current_token_offset = 0 -++++++ for i in range(self.num_experts): -++++++ expert_token_count = tokens_per_expert[i] - current_token_offset -++++++ if expert_token_count == 0: -++++++ continue -++++++ end_offset = current_token_offset + expert_token_count -++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++++ expert_hidden_states = hidden_states[expert_original_token_indices] -++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++ current_token_offset += expert_token_count -++++++ return moe_output -++++++ -++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++++++ @no_grad() -++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++ moe_output = ops.zeros_like(hidden_states) -++++++ num_tokens, _ = hidden_states.shape -++++++ flat_selected_experts = selected_experts.flatten() -++++++ flat_routing_weights = routing_weights.flatten() -++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++ active_experts = ops.unique(flat_selected_experts) -++++++ for expert_idx_tensor in active_experts: -++++++ expert_idx = expert_idx_tensor.item() -++++++ expert_layer = self.experts[expert_idx] -++++++ mask = (flat_selected_experts == expert_idx_tensor) -++++++ current_token_indices = token_indices[mask] -++++++ current_routing_weights = flat_routing_weights[mask] -++++++ current_hidden_states = hidden_states[current_token_indices] -++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++ return moe_output -+++++ -+++++- final_hidden_states = final_hidden_states + shared_expert_output -+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++- -+++++- return final_hidden_states, router_logits -++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++ global Long_Prompt -++++++ -++++++ # 1. 门控计算 (所有模式通用) -++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++ router_logits = self.gate(hidden_states_reshaped) -++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++++ if self.norm_topk_prob: -++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++ moe_output = None -++++++ if Long_Prompt: -++++++ # --- 精度优先模式 (ACCURACY MODE) --- -++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ else: -++++++ # --- 速度优先模式 (SPEED MODE) --- -++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++ if sequence_length == 1: -++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ else: -++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ -+++++ -++++++ # 3. 共享专家计算与合并 (所有模式通用) -++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++ -++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++ -++++++ return final_hidden_states, router_logits -+++++ -+++++ class Qwen2MoeDecoderLayer(nn.Module): -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+++++ super().__init__() -+++++ self.hidden_size = config.hidden_size -++++++ -++++++ # if Long_Prompt: -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ # else: -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++ -+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ -+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++- -+++++ if (layer_idx not in config.mlp_only_layers) and ( -+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++++ ): -+++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ self._warmed_up = True -+++++ self.warmup_moe_model() -+++++ -++++++ -++++++ -+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++++ output_router_logits = ( -+++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits -+++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ router_logits=outputs.router_logits, -+++++ ) -+++++ -++++++ def generate(self, *args, **kwargs): -++++++ """ -++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++++++ """ -++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++++++ -++++++ input_ids = kwargs.get("input_ids") -++++++ if input_ids is None and args: -++++++ input_ids = args[0] -++++++ -++++++ if input_ids is not None: -++++++ prompt_length = input_ids.shape[1] -++++++ -++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -++++++ Long_Prompt = True -++++++ else: -++++++ Long_Prompt = False -++++++ -++++++ return super().generate(*args, **kwargs) -++++++ -+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -+++++ def prepare_inputs_for_generation( -+++++ self, -+++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -+++++ # Exception 1: when passing input_embeds, input_ids may be missing entries -+++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -++++++ -+++++ if past_key_values is not None: -+++++ if inputs_embeds is not None: # Exception 1 -+++++ if 0 not in input_ids.shape: -+++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ } -+++++ ) -+++++ return model_inputs -++++++ -+++++ # @lwx -+++++ # def _decode_one_tokens_logits( -+++++ # self, -+++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -+++++ attentions=outputs.attentions, -+++++ ) -+++++ -++++++ -+++++ __all__ = [ -+++++ "Qwen2MoeForCausalLM", -+++++ "Qwen2MoeModel", -+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++++new file mode 100644 -+++++index 00000000..6dfb5b93 -+++++--- /dev/null -++++++++ b/patches/0001-20251104commit.patch -+++++@@ -0,0 +1,1272 @@ -++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++++From: Pinoeer-kingxi <13022943007@163.com> -++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++++Subject: [PATCH] 20251104commit -++++++ -++++++--- -++++++ mindnlp/transformers/cache_utils.py | 28 +- -++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++++ 3 files changed, 976 insertions(+), 87 deletions(-) -++++++ -++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++++index cadd2e04..02f8d4be 100644 -++++++--- a/mindnlp/transformers/cache_utils.py -+++++++++ b/mindnlp/transformers/cache_utils.py -++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++++ # k_out[:, :, cache_position] = key_states -++++++ # v_out[:, :, cache_position] = value_states -++++++- if ON_ORANGE_PI: -++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++- else: -++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++- -+++++++ # if ON_ORANGE_PI: -+++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++++ # else: -+++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++++++ if cache_position.ndim > 1: -+++++++ cache_position = cache_position.flatten() -+++++++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++++++ cache_position = cache_position.int() -+++++++ -+++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++++++ k_out[:, :, cache_position] = key_states -+++++++ v_out[:, :, cache_position] = value_states -+++++++ -++++++ return k_out, v_out -++++++ -++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++index c695b944..d8303e45 100644 -++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++++ def rotate_half(x): -++++++ """Rotates half the hidden dims of the input.""" -++++++- x1 = x[..., : x.shape[-1] // 2] -++++++- x2 = x[..., x.shape[-1] // 2 :] -+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++++ # x1 = x[..., : x.shape[-1] // 2] -+++++++ # x2 = x[..., x.shape[-1] // 2 :] -+++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++ return ops.cat((-x2, x1), dim=-1) -++++++ -++++++ -++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++++ if self.training: -++++++ raise NotImplementedError("Training is not supported yet.") -++++++ else: -++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++- if self.config.n_shared_experts is not None: -++++++- y = y + self.shared_experts(identity) -++++++- return y -+++++++ # @lwx -+++++++ if orig_shape[1] == 1: -+++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++++++ y=y.view(*orig_shape) -+++++++ if self.config.n_shared_experts is not None: -+++++++ y = y + self.shared_experts(identity) -+++++++ return y -+++++++ else: -+++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++++++ if self.config.n_shared_experts is not None: -+++++++ y = y + self.shared_experts(identity) -+++++++ return y -+++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++++ # if self.config.n_shared_experts is not None: -+++++++ # y = y + self.shared_experts(identity) -+++++++ # return y -+++++++ -+++++++ @no_grad() -+++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++++ -+++++++ expert_cache = ops.zeros_like(x) -+++++++ for i in range(self.num_experts_per_tok): -+++++++ expert_id = flat_expert_indices[i].item() -+++++++ weight = flat_expert_weights[i].item() -+++++++ expert = self.experts[expert_id] -+++++++ expert_out = expert(x) -+++++++ expert_cache += expert_out * weight -+++++++ return expert_cache -++++++ -++++++ @no_grad() -++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++- # expert_cache = torch.zeros_like(x) -++++++- # idxs = flat_expert_indices.argsort() -++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++- # token_idxs = idxs // self.num_experts_per_tok -++++++- # for i, end_idx in enumerate(tokens_per_expert): -++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++- # if start_idx == end_idx: -++++++- # continue -++++++- # expert = self.experts[i] -++++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++++- # expert_tokens = x[exp_token_idx] -++++++- # expert_out = expert(expert_tokens) -++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++- # return expert_cache -+++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++ expert_cache = ops.zeros_like(x) -++++++ idxs = flat_expert_indices.argsort() -++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ token_idxs = idxs // self.num_experts_per_tok -+++++++ -++++++ for i, end_idx in enumerate(tokens_per_expert): -++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ if start_idx == end_idx: -++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++++ expert_out = expert(expert_tokens) -++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++++ -++++++ return expert_cache -+++++++ -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # # expert_cache = torch.zeros_like(x) -+++++++ # # idxs = flat_expert_indices.argsort() -+++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++++ # # if start_idx == end_idx: -+++++++ # # continue -+++++++ # # expert = self.experts[i] -+++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # # expert_tokens = x[exp_token_idx] -+++++++ # # expert_out = expert(expert_tokens) -+++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++++ # # return expert_cache -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # if start_idx == end_idx: -+++++++ # continue -+++++++ # expert = self.experts[i] -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = expert(expert_tokens) -+++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++++ -+++++++ # return expert_cache -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ -+++++++ # # 排序保证顺序一致 -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # # 找出有 token 的专家 -+++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++++ -+++++++ # for i in active_experts.tolist(): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # end_idx = tokens_per_expert[i] -+++++++ # if start_idx == end_idx: # 没有 token -+++++++ # continue -+++++++ -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = self.experts[i](expert_tokens) -+++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++++ -+++++++ # expert_cache = mindspore.mint.scatter_add( -+++++++ # expert_cache, -+++++++ # 0, -+++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++++ # expert_out -+++++++ # ) -+++++++ -+++++++ # return expert_cache -+++++++ -+++++++ -++++++ -++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++++ # """ -++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++ -++++++ # Initialize weights and apply final processing -++++++ self.post_init() -+++++++ self.warm_up = False -+++++++ -+++++++ def warmup_moe_model_deep(self): -+++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++++ test_texts = [ -+++++++ "warmup short", -+++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++++++ ] -+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++++ if tokenizer is None: -+++++++ from mindnlp.transformers import AutoTokenizer -+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++++ self._warmup_tokenizer = tokenizer -+++++++ -+++++++ for text in test_texts: -+++++++ inputs = tokenizer(text, return_tensors="ms") -+++++++ with mindspore._no_grad(): -+++++++ _ = self(**inputs, use_cache=False) -+++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++++ -++++++ def get_input_embeddings(self): -++++++ return self.model.embed_tokens -++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++ ```""" -+++++++ if not self.warm_up: -+++++++ self.warm_up = True -+++++++ self.warmup_moe_model_deep() -+++++++ -++++++ output_attentions = ( -++++++ output_attentions -++++++ if output_attentions is not None -++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++index 3cbf820e..d4c6b651 100644 -++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++@@ -18,7 +18,6 @@ -++++++ # See the License for the specific language governing permissions and -++++++ # limitations under the License. -++++++ """MindSpore Qwen2MoE model.""" -++++++- -++++++ import math -++++++ from typing import List, Optional, Tuple, Union -++++++ -++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++++ TokenClassifierOutput, -++++++ ) -++++++ from ...modeling_utils import PreTrainedModel -+++++++from ...generation import GenerationMixin -++++++ from ....utils import logging -++++++ from .configuration_qwen2_moe import Qwen2MoeConfig -++++++ -++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++++ self.variance_epsilon = eps -++++++ -++++++ def forward(self, hidden_states): -+++++++ # @dwj -+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++++ # @lwx -+++++++ # if not self.training : -+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++ input_dtype = hidden_states.dtype -++++++ hidden_states = hidden_states.to(mindspore.float32) -++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++++@@ -234,6 +239,8 @@ def rotate_half(x): -++++++ """Rotates half the hidden dims of the input.""" -++++++ x1 = x[..., : x.shape[-1] // 2] -++++++ x2 = x[..., x.shape[-1] // 2 :] -+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++ return ops.cat((-x2, x1), dim=-1) -++++++ -++++++ -++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++++ self.config = config -++++++ self.hidden_size = config.hidden_size -++++++ self.intermediate_size = intermediate_size -+++++++ -++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++++ self.act_fn = ACT2FN[config.hidden_act] -++++++ -++++++ def forward(self, x): -++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++- -++++++ -+++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++++ # @lwx -+++++++ # gate_up_output = self.gate_up_proj(x) -+++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++++++ # return self.down_proj(swiglu_output) -+++++++ -+++++++ # def forward(self, x): -+++++++ # gate_proj_out = self.gate_proj(x) -+++++++ # up_proj_out = self.up_proj(x) -+++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++++++ # return self.down_proj(swiglu_out) -+++++++ -++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++++ """ -++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++++ use_cache: bool = False, -++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ -+++++++ -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ query_states = self.q_proj(hidden_states) -++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ "with a layer index." -++++++ ) -++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ if isinstance(past_key_value, StaticCache): -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ else: -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++ if isinstance(past_key_value, StaticCache): -+++++++ kv_seq_len = key_states.shape[-2] -++++++ -++++++ # repeat k/v heads if n_kv_heads < n_heads -++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++- -+++++++ -++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++ -++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++++- raise ValueError( -++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++++- f" {attn_weights.shape}" -++++++- ) -++++++- -++++++- if attention_mask is not None: # no matter the length, we just slice it -++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++++++ if attention_mask is not None: -+++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++ attn_weights = attn_weights + causal_mask -++++++ -++++++ # upcast attention to fp32 -++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++ -++++++ attn_output = self.o_proj(attn_output) -++++++- -+++++++ # @lwx -+++++++ -+++++++ # max_seq_len = self.max_position_embeddings # 2048 -+++++++ -+++++++ # if attention_mask is not None: -+++++++ # # attention_mask: [B, 1, Sq, Sk] -+++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++++ -+++++++ # # pad 到 [max_seq_len, max_seq_len] -+++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++++ # global_attention_mask = padded_mask -+++++++ # else: -+++++++ # global_attention_mask = None -+++++++ -+++++++ -+++++++ # sparse_mode=3 -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # real_shift=None, -+++++++ # padding_mask=None, -+++++++ -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=global_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ # input_layout="BNSD", -+++++++ # pre_tokens=2147483647, -+++++++ # next_tokens=2147483647, -+++++++ # inner_precise=0, -+++++++ # drop_mask=None, -+++++++ # prefix=None, -+++++++ # actual_seq_qlen=None, -+++++++ # actual_seq_kvlen=None, -+++++++ # sparse_mode=sparse_mode, -+++++++ # ) -++++++ if not output_attentions: -++++++ attn_weights = None -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ -+++++++class Qwen2MoeFlashAttention(nn.Module): -+++++++ """ -+++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++++ -+++++++ 关键改动: -+++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++++ 直接传入原始的 key 和 value 张量效率更高。 -+++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++++ """ -+++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++++ super().__init__() -+++++++ self.config = config -+++++++ self.layer_idx = layer_idx -+++++++ self.hidden_size = config.hidden_size -+++++++ self.num_heads = config.num_attention_heads -+++++++ self.head_dim = self.hidden_size // self.num_heads -+++++++ self.num_key_value_heads = config.num_key_value_heads -+++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++ self.max_position_embeddings = config.max_position_embeddings -+++++++ self.rope_theta = config.rope_theta -+++++++ self.attention_dropout = config.attention_dropout -+++++++ -+++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++ raise ValueError( -+++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++++ ) -+++++++ -+++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++++ -+++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++++ self.head_dim, -+++++++ max_position_embeddings=self.max_position_embeddings, -+++++++ base=self.rope_theta, -+++++++ ) -+++++++ -+++++++ def forward( -+++++++ self, -+++++++ hidden_states: mindspore.Tensor, -+++++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++++ position_ids: Optional[mindspore.Tensor] = None, -+++++++ past_key_value: Optional[Cache] = None, -+++++++ output_attentions: bool = False, -+++++++ use_cache: bool = False, -+++++++ cache_position: Optional[mindspore.Tensor] = None, -+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # 1. 线性投射 Q, K, V -+++++++ query_states = self.q_proj(hidden_states) -+++++++ key_states = self.k_proj(hidden_states) -+++++++ value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++ # query: [B, S, H*D] -> [B, N1, S, D] -+++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # 3. RoPE 旋转位置编码 -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ if past_key_value is not None: -+++++++ if self.layer_idx is None: -+++++++ raise ValueError( -+++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++ "with a layer index." -+++++++ ) -+++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++++ if cache_position.shape[0] == 1: -+++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++++ kv_seq_len = past_seen_tokens + 1 -+++++++ else: -+++++++ # prefill 阶段:cache_position 是范围,使用其长度 -+++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++++ else: -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # 4. KV 缓存更新 -+++++++ if past_key_value is not None: -+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ key_states, value_states = past_key_value.update( -+++++++ key_states, value_states, self.layer_idx, cache_kwargs -+++++++ ) -+++++++ -+++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++ if cache_position.shape[0] == 1: -+++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ -+++++++ # 5. [重要] 准备 Attention Mask -+++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++++ fa_attention_mask = None -+++++++ if attention_mask is not None: -+++++++ # 截取与当前key长度匹配的部分 -+++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++++ fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++++ input_dtype = query_states.dtype -+++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++++ query_states = query_states.to(mindspore.float16) -+++++++ key_states = key_states.to(mindspore.float16) -+++++++ value_states = value_states.to(mindspore.float16) -+++++++ -+++++++ # 6. [核心] 调用 flash_attention_score 算子 -+++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++++ attn_output = mindspore.ops.flash_attention_score( -+++++++ query=query_states, -+++++++ key=key_states, -+++++++ value=value_states, -+++++++ head_num=self.num_heads, # 传入Q的头数(N1) -+++++++ attn_mask=fa_attention_mask, -+++++++ keep_prob=1.0 - self.attention_dropout, -+++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ input_layout="BNSD", -+++++++ sparse_mode=0 # 使用 defaultMask 模式 -+++++++ ) -+++++++ -+++++++ # 恢复原始数据类型 -+++++++ attn_output = attn_output.to(input_dtype) -+++++++ -+++++++ # 7. 调整输出形状 -+++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++++++ attn_weights = None -+++++++ if output_attentions: -+++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++ return attn_output, attn_weights, past_key_value -+++++++ -+++++++ # def forward( -+++++++ # self, -+++++++ # hidden_states: mindspore.Tensor, -+++++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++++ # past_key_value: Optional[Cache] = None, -+++++++ # output_attentions: bool = False, -+++++++ # use_cache: bool = False, -+++++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ # bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # # 1. 线性投射 Q, K, V -+++++++ # query_states = self.q_proj(hidden_states) -+++++++ # key_states = self.k_proj(hidden_states) -+++++++ # value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # # 3. RoPE 旋转位置编码 -+++++++ # kv_seq_len = key_states.shape[-2] -+++++++ # if past_key_value is not None: -+++++++ # if self.layer_idx is None: -+++++++ # raise ValueError( -+++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++ # "with a layer index." -+++++++ # ) -+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # # 4. KV 缓存更新 -+++++++ # if past_key_value is not None: -+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ # key_states, value_states = past_key_value.update( -+++++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++++ # ) -+++++++ -+++++++ # # 5. 准备 Attention Mask -+++++++ # fa_attention_mask = None -+++++++ # if attention_mask is not None: -+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++++ # input_dtype = query_states.dtype -+++++++ -+++++++ # # 6. [核心] 调用 flash_attention_score 算子 -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=fa_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ # input_layout="BNSD", -+++++++ # sparse_mode=0, -+++++++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++++ # inner_precise=1 -+++++++ # ) -+++++++ -+++++++ # # 恢复原始数据类型 -+++++++ # attn_output = attn_output.to(input_dtype) -+++++++ -+++++++ # # 7. 调整输出形状 -+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ # attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # attn_weights = None -+++++++ # if output_attentions: -+++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++ # return attn_output, attn_weights, past_key_value -+++++++ -+++++++ # def forward( -+++++++ # self, -+++++++ # hidden_states: mindspore.Tensor, -+++++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++++ # past_key_value: Optional[Cache] = None, -+++++++ # output_attentions: bool = False, -+++++++ # use_cache: bool = False, -+++++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ # bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # query_states = self.q_proj(hidden_states) -+++++++ # key_states = self.k_proj(hidden_states) -+++++++ # value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # kv_seq_len = key_states.shape[-2] -+++++++ # if past_key_value is not None: -+++++++ # if self.layer_idx is None: -+++++++ # raise ValueError("`layer_idx` must be specified for caching") -+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # if past_key_value is not None: -+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ # key_states, value_states = past_key_value.update( -+++++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++++ # ) -+++++++ -+++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++ -+++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++++ # query_states = query_states / math.sqrt(self.head_dim) -+++++++ # # <--- 修改结束 --- -+++++++ -+++++++ # fa_attention_mask = None -+++++++ # if attention_mask is not None: -+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # input_dtype = query_states.dtype -+++++++ -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, # 传入已经预先缩放过的 query -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=fa_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++++ # input_layout="BNSD", -+++++++ # sparse_mode=0, -+++++++ # inner_precise=1 # 仍然保持内部高精度计算 -+++++++ # ) -+++++++ -+++++++ # attn_output = attn_output.to(input_dtype) -+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ # attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # attn_weights = None -+++++++ # if output_attentions: -+++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++++ -+++++++ # return attn_output, attn_weights, past_key_value -+++++++ -++++++ QWEN2MOE_ATTENTION_CLASSES = { -++++++ "eager": Qwen2MoeAttention, -+++++++ "flash-attention": Qwen2MoeFlashAttention, -++++++ } -++++++ -++++++ -++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -+++++++ #@dwj -+++++++ # 只遍历激活的专家,而非全部专家 -++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- hidden_states = hidden_states.view(-1, hidden_dim) -++++++- # router_logits: (batch * sequence_length, n_experts) -++++++- router_logits = self.gate(hidden_states) -++++++- -++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- if self.norm_topk_prob: -++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- # we cast back to the input dtype -++++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++- final_hidden_states = ops.zeros( -++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++++- ) -++++++- -++++++- # One hot encode the selected experts to create an expert mask -++++++- # this will be used to easily index which expert is going to be sollicitated -++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++++- -++++++- # Loop over all available experts in the model and perform the computation on each expert -++++++- for expert_idx in range(self.num_experts): -++++++- expert_layer = self.experts[expert_idx] -++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++++- -++++++- # Index the correct hidden states and compute the expert hidden state for -++++++- # the current expert. We need to make sure to multiply the output hidden -++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++++- if 0 not in idx.shape: -++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++++- -++++++- # However `index_add_` only support torch tensors for indexing so we'll use -++++++- # the `top_x` tensor here. -++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++++- -++++++- shared_expert_output = self.shared_expert(hidden_states) -++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++++- -++++++- final_hidden_states = final_hidden_states + shared_expert_output -+++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++ num_tokens = hidden_states_reshaped.shape[0] -+++++++ -+++++++ router_logits = self.gate(hidden_states_reshaped) -+++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++ -+++++++ if self.norm_topk_prob: -+++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++++ flat_selected_experts = selected_experts.flatten() -+++++++ -+++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++++ token_indices = broadcasted_token_indices.flatten() -+++++++ -+++++++ active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++ for expert_idx_tensor in active_experts: -+++++++ expert_idx = expert_idx_tensor.item() -+++++++ expert_layer = self.experts[expert_idx] -+++++++ -+++++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++++ selected_token_indices = token_indices[mask] -+++++++ selected_routing_weights = routing_weights.flatten()[mask] -+++++++ -+++++++ current_states = hidden_states_reshaped[selected_token_indices] -+++++++ -+++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++ -+++++++ final_hidden_states = final_hidden_states.index_add( -+++++++ dim=0, -+++++++ index=selected_token_indices, -+++++++ source=expert_output.to(hidden_states.dtype) -+++++++ ) -+++++++ -+++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++++ -++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++- return final_hidden_states, router_logits -+++++++ final_hidden_states = final_hidden_states + shared_expert_output -+++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++ return final_hidden_states, router_logits -++++++ -++++++ -++++++ class Qwen2MoeDecoderLayer(nn.Module): -++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++++ -++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ -+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++++ -++++++ if (layer_idx not in config.mlp_only_layers) and ( -++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++++ ): -++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++++ _skip_keys_device_placement = "past_key_values" -++++++ _supports_cache_class = True -+++++++#lwx -+++++++ # _supports_static_cache = True -++++++ -++++++ def _init_weights(self, module): -++++++ std = self.config.initializer_range -++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++++ return causal_mask -++++++ -++++++ -++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ _tied_weights_keys = ["lm_head.weight"] -++++++ -++++++ def __init__(self, config): -++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ self.num_experts_per_tok = config.num_experts_per_tok -++++++ # Initialize weights and apply final processing -++++++ self.post_init() -+++++++ # @lwx -+++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++++++ # self.generation_config.cache_implementation = "static" -+++++++ self._warmed_up = False -+++++++ -+++++++ def warmup_moe_model(self): -+++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++++++ test_texts = [ -+++++++ "warmup short", -+++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++++++ ] -+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++++ if tokenizer is None: -+++++++ from mindnlp.transformers import AutoTokenizer -+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++++ self._warmup_tokenizer = tokenizer -+++++++ -+++++++ for text in test_texts: -+++++++ inputs = tokenizer(text, return_tensors="ms") -+++++++ with mindspore._no_grad(): -+++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++++ -++++++ def get_input_embeddings(self): -++++++ return self.model.embed_tokens -++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++ ```""" -+++++++ if not self._warmed_up: -+++++++ self._warmed_up = True -+++++++ self.warmup_moe_model() -++++++ -++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++++ output_router_logits = ( -++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ } -++++++ ) -++++++ return model_inputs -+++++++# @lwx -+++++++ # def _decode_one_tokens_logits( -+++++++ # self, -+++++++ # cur_token: mindspore.Tensor, -+++++++ # input_pos: Optional[mindspore.Tensor], -+++++++ # cache_position: mindspore.Tensor, -+++++++ # past_key_values: StaticCache, -+++++++ # ) -> mindspore.Tensor: -+++++++ # """ -+++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++++++ -+++++++ # Args: -+++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++++++ # input_pos: 输入位置信息,可选 -+++++++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++++++ -+++++++ # Returns: -+++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++++++ # """ -+++++++ # # 调用JIT编译的版本 -+++++++ # return self.get_decode_one_tokens_logits( -+++++++ # cur_token=cur_token, -+++++++ # input_pos=input_pos, -+++++++ # cache_position=cache_position, -+++++++ # past_key_values=past_key_values, -+++++++ # ) -+++++++ -+++++++ # @mindspore.jit(jit_level='O1') -+++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++++++ # """ -+++++++ # JIT编译的函数,用于高效的单token解码 -+++++++ # 使用JIT编译优化以支持静态shape和高效执行 -+++++++ -+++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++++++ # """ -+++++++ # outputs = self.model.forward( -+++++++ # input_ids=cur_token, -+++++++ # position_ids=input_pos, -+++++++ # cache_position=cache_position, -+++++++ # past_key_values=past_key_values, -+++++++ # use_cache=True, -+++++++ # return_dict=False, -+++++++ # ) -+++++++ -+++++++ # hidden_states = outputs[0] -+++++++ # logits = self.lm_head.forward(hidden_states) -+++++++ # logits = logits.float() -+++++++ -+++++++ # return logits[:, -1, :] -+++++++ -+++++++ # def _sample( -+++++++ # self, -+++++++ # input_ids: mindspore.Tensor, -+++++++ # logits_processor, -+++++++ # stopping_criteria, -+++++++ # generation_config, -+++++++ # synced_devices: bool, -+++++++ # streamer=None, -+++++++ # logits_warper=None, -+++++++ # **model_kwargs, -+++++++ # ): -+++++++ # """ -+++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++++++ # """ -+++++++ # from ...generation.logits_process import LogitsProcessorList -+++++++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++++++ # from mindnlp.core import nn, ops, no_grad -+++++++ # import numpy as np -+++++++ -+++++++ # # 检查是否使用 StaticCache -+++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++++++ # # 否则,直接调用父类方法 -+++++++ # past_key_values = model_kwargs.get("past_key_values") -+++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++++++ -+++++++ # if not isinstance(past_key_values, StaticCache): -+++++++ # # 不使用 StaticCache,直接调用父类方法 -+++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++++++ # return super()._sample( -+++++++ # input_ids=input_ids, -+++++++ # logits_processor=logits_processor, -+++++++ # stopping_criteria=stopping_criteria, -+++++++ # generation_config=generation_config, -+++++++ # synced_devices=synced_devices, -+++++++ # streamer=streamer, -+++++++ # logits_warper=logits_warper, -+++++++ # **model_kwargs, -+++++++ # ) -+++++++ -+++++++ # # 使用 StaticCache,进入自定义循环 -+++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++++++ # pad_token_id = generation_config._pad_token_tensor -+++++++ # output_attentions = generation_config.output_attentions -+++++++ # output_hidden_states = generation_config.output_hidden_states -+++++++ # output_scores = generation_config.output_scores -+++++++ # output_logits = generation_config.output_logits -+++++++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++++++ # max_length = generation_config.max_length -+++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++++++ # do_sample = generation_config.do_sample -+++++++ -+++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++++++ # raise ValueError( -+++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++++++ # f"{logits_warper})." -+++++++ # ) -+++++++ -+++++++ # # init attention / hidden states / scores tuples -+++++++ # scores = () if (return_dict_in_generate and output_scores) else None -+++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++++++ -+++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++++++ # encoder_hidden_states = ( -+++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++++++ # ) -+++++++ -+++++++ # # keep track of which sequences are already finished -+++++++ # batch_size, cur_len = input_ids.shape -+++++++ # this_peer_finished = False -+++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++++++ -+++++++ # time_record = [] -+++++++ # from ....utils.testing_utils import parse_flag_from_env -+++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++++++ -+++++++ # while self._has_unfinished_sequences( -+++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++++++ # ): -+++++++ # if _record_time: -+++++++ # import time as time_module -+++++++ # infer_start = time_module.time() -+++++++ -+++++++ # # prepare model inputs -+++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++++++ -+++++++ # # prepare variable output controls -+++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++++++ -+++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++++++ # cur_cache_position = model_inputs.get("cache_position") -+++++++ # cur_past_key_values = model_inputs.get("past_key_values") -+++++++ # cur_input_ids = model_inputs.get("input_ids") -+++++++ -+++++++ # if (isinstance(cur_past_key_values, StaticCache) and -+++++++ # cur_cache_position is not None and -+++++++ # len(cur_cache_position.shape) > 0 and -+++++++ # cur_cache_position.shape[0] == 1 and -+++++++ # cur_input_ids is not None and -+++++++ # cur_input_ids.shape[1] == 1): -+++++++ # # 使用 JIT 优化的单 token 解码 -+++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++++++ # if not hasattr(self, '_jit_used'): -+++++++ # self._jit_used = False -+++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++++++ -+++++++ # next_token_logits = self.get_decode_one_tokens_logits( -+++++++ # cur_token=cur_input_ids, -+++++++ # input_pos=model_inputs.get("position_ids"), -+++++++ # cache_position=cur_cache_position, -+++++++ # past_key_values=cur_past_key_values, -+++++++ # ) -+++++++ -+++++++ # # 标记已使用JIT(用于后续判断) -+++++++ # if not self._jit_used: -+++++++ # self._jit_used = True -+++++++ -+++++++ # # 构造兼容的输出对象 -+++++++ # class JitOptimizedOutput: -+++++++ # def __init__(self, logits, config): -+++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++++++ # self.config = config -+++++++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++++++ # self.attentions = None if not config.is_encoder_decoder else None -+++++++ # self.cross_attentions = None -+++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++++++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++++++ -+++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++++++ # else: -+++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++++++ # outputs = self(**model_inputs, return_dict=True) -+++++++ -+++++++ # if synced_devices and this_peer_finished: -+++++++ # continue -+++++++ -+++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++++++ # next_token_logits = outputs.logits[:, -1, :] -+++++++ -+++++++ # # pre-process distribution -+++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++++++ # if do_sample: -+++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++++++ -+++++++ # # Store scores, attentions and hidden_states when required -+++++++ # if return_dict_in_generate: -+++++++ # if output_scores: -+++++++ # scores += (next_token_scores,) -+++++++ # if output_logits: -+++++++ # raw_logits += (next_token_logits,) -+++++++ # if output_attentions: -+++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++++++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++++++ # if self.config.is_encoder_decoder: -+++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++++++ -+++++++ # if output_hidden_states: -+++++++ # hidden = ( -+++++++ # outputs.decoder_hidden_states -+++++++ # if self.config.is_encoder_decoder -+++++++ # else outputs.hidden_states -+++++++ # ) -+++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++++++ -+++++++ # # token selection -+++++++ # if do_sample: -+++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++++++ # else: -+++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++++++ -+++++++ # # finished sentences should have their next token be a padding token -+++++++ # if has_eos_stopping_criteria: -+++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++++++ -+++++++ # # update generated ids, model inputs, and length for next step -+++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++++++ # if streamer is not None: -+++++++ # streamer.put(next_tokens) -+++++++ -+++++++ # model_kwargs = self._update_model_kwargs_for_generation( -+++++++ # outputs, -+++++++ # model_kwargs, -+++++++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++++++ # ) -+++++++ -+++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++++++ # cur_len += 1 -+++++++ -+++++++ # if _record_time: -+++++++ # import time as time_module -+++++++ # infer_stop = time_module.time() -+++++++ # time_record.append(infer_stop - infer_start) -+++++++ -+++++++ # del outputs -+++++++ -+++++++ # average_infer_time = None -+++++++ # if time_record: -+++++++ # if len(time_record) > 1: -+++++++ # time_record.pop(0) -+++++++ # average_infer_time = sum(time_record) / len(time_record) -+++++++ # print(f'average inference time is: {average_infer_time}') -+++++++ # print(f'inference time record: {time_record}') -+++++++ -+++++++ # if streamer is not None: -+++++++ # streamer.end() -+++++++ -+++++++ # # 简单判断:打印是否使用了JIT路径 -+++++++ # if hasattr(self, '_jit_used') and self._jit_used: -+++++++ # print("[JIT] ✓ JIT optimization was used during generation") -+++++++ # else: -+++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++++++ -+++++++ # if return_dict_in_generate: -+++++++ # if self.config.is_encoder_decoder: -+++++++ # return GenerateEncoderDecoderOutput( -+++++++ # sequences=input_ids, -+++++++ # scores=scores, -+++++++ # logits=raw_logits, -+++++++ # encoder_attentions=encoder_attentions, -+++++++ # encoder_hidden_states=encoder_hidden_states, -+++++++ # decoder_attentions=decoder_attentions, -+++++++ # cross_attentions=cross_attentions, -+++++++ # decoder_hidden_states=decoder_hidden_states, -+++++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++++ # average_infer_time=average_infer_time -+++++++ # ) -+++++++ # else: -+++++++ # return GenerateDecoderOnlyOutput( -+++++++ # sequences=input_ids, -+++++++ # scores=scores, -+++++++ # logits=raw_logits, -+++++++ # attentions=decoder_attentions, -+++++++ # hidden_states=decoder_hidden_states, -+++++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++++ # average_infer_time=average_infer_time -+++++++ # ) -+++++++ # else: -+++++++ # return input_ids -+++++++ -+++++++ # def _prepare_cache_for_generation( -+++++++ # self, -+++++++ # generation_config, -+++++++ # model_kwargs, -+++++++ # assistant_model, -+++++++ # batch_size, -+++++++ # max_cache_length, -+++++++ # ): -+++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++++++ # generation_config.cache_implementation = "static" -+++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++++++ -+++++++ # if generation_config.cache_implementation == "static": -+++++++ # base_required_from_max_length = generation_config.max_length + 1 -+++++++ # base_required = max(max_cache_length, base_required_from_max_length) -+++++++ # min_cache_size = 50 -+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++++++ # else: -+++++++ # max_cache_length = max(base_required, min_cache_size) -+++++++ -+++++++ # original_max_cache_length = max_cache_length -+++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++++++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++++++ # print(f" - final max_cache_length: {max_cache_length}") -+++++++ -+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++++ # if max_cache_length > self.config.max_position_embeddings: -+++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++++++ -+++++++ # result = super()._prepare_cache_for_generation( -+++++++ # generation_config=generation_config, -+++++++ # model_kwargs=model_kwargs, -+++++++ # assistant_model=assistant_model, -+++++++ # batch_size=batch_size, -+++++++ # max_cache_length=max_cache_length, -+++++++ # ) -+++++++ -+++++++ # if generation_config.cache_implementation == "static": -+++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++++++ # created_cache = model_kwargs.get(cache_name) -+++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++++++ # if created_cache.max_cache_len < generation_config.max_length: -+++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++++++ -+++++++ # return result -+++++++ -+++++++ -+++++++ -++++++ -++++++ -++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++++-- -++++++2.27.0 -++++++ -+++++-- -+++++2.27.0 -+++++ -++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++++new file mode 100644 -++++index 00000000..966529e4 -++++--- /dev/null -+++++++ b/patches/0003-20261106secondcommit.patch -++++@@ -0,0 +1,2769 @@ -+++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+++++From: Pinoeer-kingxi <13022943007@163.com> -+++++Date: Thu, 6 Nov 2025 14:54:37 +0800 -+++++Subject: [PATCH 3/3] 20261106secondcommit -+++++ -+++++--- -+++++ .../models/deepseek/modeling_deepseek.py | 217 ++- -+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -+++++ patches/0001-20251104commit.patch | 1272 ----------------- -+++++ 3 files changed, 528 insertions(+), 2032 deletions(-) -+++++ delete mode 100644 patches/0001-20251104commit.patch -+++++ -+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++index 73773c22..2f9192bf 100644 -+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -+++++ -+++++ _CONFIG_FOR_DOC = "DeepseekConfig" -+++++ -++++++_attn_mask_cache = {} -++++++ -++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -++++++ q_len = batch_and_seq[1] -++++++ kv_len = batch_and_seq[1] + past_key_values_length -++++++ key = (batch_and_seq[0], q_len, kv_len) -++++++ -++++++ if key in _attn_mask_cache: -++++++ return _attn_mask_cache[key] -++++++ -++++++ mask = _prepare_4d_causal_attention_mask( -++++++ attention_mask, -++++++ batch_and_seq, -++++++ inputs_embeds, -++++++ past_key_values_length, -++++++ ) -++++++ _attn_mask_cache[key] = mask -++++++ return mask -+++++ -+++++ def _get_unpad_data(attention_mask): -+++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -+++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -+++++ return final_output -+++++ -+++++ -+++++- @no_grad() -+++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++- expert_cache = ops.zeros_like(x) -+++++- idxs = flat_expert_indices.argsort() -+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++- token_idxs = idxs // self.num_experts_per_tok -+++++- -+++++- for i, end_idx in enumerate(tokens_per_expert): -+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++- if start_idx == end_idx: -+++++- continue -+++++- expert = self.experts[i] -+++++- exp_token_idx = token_idxs[start_idx:end_idx] -+++++- expert_tokens = x[exp_token_idx] -+++++- expert_out = expert(expert_tokens) -+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++- -+++++- return expert_cache -+++++- -+++++ # @no_grad() -+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++- # # expert_cache = torch.zeros_like(x) -+++++- # # idxs = flat_expert_indices.argsort() -+++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++- # # token_idxs = idxs // self.num_experts_per_tok -+++++- # # for i, end_idx in enumerate(tokens_per_expert): -+++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++- # # if start_idx == end_idx: -+++++- # # continue -+++++- # # expert = self.experts[i] -+++++- # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++- # # expert_tokens = x[exp_token_idx] -+++++- # # expert_out = expert(expert_tokens) -+++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++- # # return expert_cache -++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++ # expert_cache = ops.zeros_like(x) -+++++ # idxs = flat_expert_indices.argsort() -+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++ -+++++ # return expert_cache -+++++- # @no_grad() -+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++- # expert_cache = ops.zeros_like(x) -++++++ -++++++ @no_grad() -++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++ """ -++++++ 优化版 MoE prefill: -++++++ - 批量张量化处理同一个 expert 的所有 token -++++++ - 跳过无 token 的专家 -++++++ - 保持结果完全一致 -++++++ """ -++++++ # 初始化输出缓存 -++++++ expert_cache = ops.zeros_like(x) -+++++ -+++++- # # 排序保证顺序一致 -+++++- # idxs = flat_expert_indices.argsort() -+++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++- # token_idxs = idxs // self.num_experts_per_tok -++++++ # 排序(确保 scatter_add 位置对应原逻辑) -++++++ idxs = flat_expert_indices.argsort() -++++++ sorted_expert_indices = flat_expert_indices[idxs] -++++++ sorted_token_indices = idxs // self.num_experts_per_tok -+++++ -+++++- # # 找出有 token 的专家 -+++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++ # 每个 expert 的 token 数 -++++++ tokens_per_expert = sorted_expert_indices.bincount() -+++++ -+++++- # for i in active_experts.tolist(): -+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++- # end_idx = tokens_per_expert[i] -+++++- # if start_idx == end_idx: # 没有 token -+++++- # continue -++++++ # 找出有 token 的专家 -++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+++++ -+++++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++++- # expert_tokens = x[exp_token_idx] -+++++- # expert_out = self.experts[i](expert_tokens) -+++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++ for expert_id in active_experts.tolist(): -++++++ # 取该 expert 对应的排序后 token 区间 -++++++ start = (tokens_per_expert[:expert_id]).sum().item() -++++++ end = start + tokens_per_expert[expert_id].item() -+++++ -+++++- # expert_cache = mindspore.mint.scatter_add( -+++++- # expert_cache, -+++++- # 0, -+++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++- # expert_out -+++++- # ) -++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -++++++ expert_tokens = x[token_idx] # 取输入向量 -+++++ -+++++- # return expert_cache -++++++ # 执行专家 MLP -++++++ expert_out = self.experts[expert_id](expert_tokens) -++++++ -++++++ # 按权重缩放 -++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -++++++ -++++++ # 回写到缓存(等价 scatter_add) -++++++ expert_cache = mindspore.mint.scatter_add( -++++++ expert_cache, -++++++ 0, -++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++ scaled_out -++++++ ) -++++++ -++++++ return expert_cache -++++++ -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # # expert_cache = torch.zeros_like(x) -++++++ # # idxs = flat_expert_indices.argsort() -++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++ # # token_idxs = idxs // self.num_experts_per_tok -++++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++ # # if start_idx == end_idx: -++++++ # # continue -++++++ # # expert = self.experts[i] -++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # # expert_tokens = x[exp_token_idx] -++++++ # # expert_out = expert(expert_tokens) -++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++ # # return expert_cache -++++++ # expert_cache = ops.zeros_like(x) -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # for i, end_idx in enumerate(tokens_per_expert): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # if start_idx == end_idx: -++++++ # continue -++++++ # expert = self.experts[i] -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = expert(expert_tokens) -++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -++++++ # return expert_cache -++++++ # @no_grad() -++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++ # expert_cache = ops.zeros_like(x) -++++++ -++++++ # # 排序保证顺序一致 -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++ -++++++ # # 找出有 token 的专家 -++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++ -++++++ # for i in active_experts.tolist(): -++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ # end_idx = tokens_per_expert[i] -++++++ # if start_idx == end_idx: # 没有 token -++++++ # continue -++++++ -++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++ # expert_tokens = x[exp_token_idx] -++++++ # expert_out = self.experts[i](expert_tokens) -++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++ -++++++ # expert_cache = mindspore.mint.scatter_add( -++++++ # expert_cache, -++++++ # 0, -++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++ # expert_out -++++++ # ) -++++++ -++++++ # return expert_cache -+++++ -+++++ -+++++ -+++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++- -+++++ # class DeepseekFlashAttention(nn.Module): -+++++ # """ -+++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -++++++ -+++++ Deepseek_ATTENTION_CLASSES = { -+++++ "eager": DeepseekAttention, -+++++ "flash-attention": DeepseekFlashAttention, -+++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -+++++ ) -+++++ else: -+++++ # 4d mask is passed through the layers -+++++- attention_mask = _prepare_4d_causal_attention_mask( -++++++ # attention_mask = _prepare_4d_causal_attention_mask( -++++++ # attention_mask, -++++++ # (batch_size, seq_length), -++++++ # inputs_embeds, -++++++ # past_key_values_length, -++++++ # ) -++++++ #@dwj -++++++ attention_mask = get_cached_causal_mask( -+++++ attention_mask, -+++++ (batch_size, seq_length), -+++++ inputs_embeds, -+++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -+++++ self.warm_up = False -++++++ #@dwj -++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -++++++ self.num_layers, -++++++ self.num_attention_heads, -++++++ self.head_dim, -++++++ batch_size=1, -++++++ max_length=self.max_length, -++++++ dtype=mindspore.float16 -++++++ ) -++++++ -++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -++++++ key_cache = [] -++++++ value_cache = [] -++++++ for _ in range(num_layers): -++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -++++++ key_cache.append(k) -++++++ value_cache.append(v) -++++++ return key_cache, value_cache -++++++ -+++++ -+++++ def warmup_moe_model_deep(self): -+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++index bced285c..ebd7782e 100644 -+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -+++++ -+++++-Long_Prompt = False -+++++-PROMPT_LENGTH_THRESHOLD = 128 -++++++Long_Prompt = 1 -++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 -++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -++++++ -++++++_causal_mask_cache = {} -++++++ -++++++def get_cached_causal_mask_with_cache_position( -++++++ attention_mask: mindspore.Tensor, -++++++ sequence_length: int, -++++++ target_length: int, -++++++ dtype: mindspore.dtype, -++++++ min_dtype: float, -++++++ cache_position: mindspore.Tensor, -++++++ batch_size: int, -++++++): -++++++ """ -++++++ 带缓存的 causal mask 构造函数 -++++++ """ -++++++ # q_len 是当前 query 长度 -++++++ q_len = sequence_length -++++++ # kv_len 是 target_length -++++++ kv_len = target_length -++++++ -++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -++++++ -++++++ if key in _causal_mask_cache: -++++++ return _causal_mask_cache[key] -++++++ -++++++ # 调用原来的 mask 构造逻辑 -++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++++ attention_mask, -++++++ sequence_length=sequence_length, -++++++ target_length=target_length, -++++++ dtype=dtype, -++++++ min_dtype=min_dtype, -++++++ cache_position=cache_position, -++++++ batch_size=batch_size, -++++++ ) -++++++ # 缓存结果 -++++++ _causal_mask_cache[key] = causal_mask -++++++ return causal_mask -+++++ -+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -+++++ def _prepare_4d_causal_attention_mask_with_cache_position( -+++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++++ -+++++ -+++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -++++++# class Qwen2MoeAttention(nn.Module): -++++++# """ -++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++++++# and "Generating Long Sequences with Sparse Transformers". -++++++# """ -++++++ -++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++# super().__init__() -++++++# self.config = config -++++++# self.layer_idx = layer_idx -++++++# if layer_idx is None: -++++++# logger.warning_once( -++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++++# "when creating this class." -++++++# ) -++++++ -++++++# self.hidden_size = config.hidden_size -++++++# self.num_heads = config.num_attention_heads -++++++# self.head_dim = self.hidden_size // self.num_heads -++++++# self.num_key_value_heads = config.num_key_value_heads -++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++# self.max_position_embeddings = config.max_position_embeddings -++++++# self.rope_theta = config.rope_theta -++++++# self.is_causal = True -++++++# self.attention_dropout = config.attention_dropout -++++++ -++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -++++++# raise ValueError( -++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -++++++# f" and `num_heads`: {self.num_heads})." -++++++# ) -++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++ -++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++# self.head_dim, -++++++# max_position_embeddings=self.max_position_embeddings, -++++++# base=self.rope_theta, -++++++# ) -++++++ -++++++# def forward( -++++++# self, -++++++# hidden_states: mindspore.Tensor, -++++++# attention_mask: Optional[mindspore.Tensor] = None, -++++++# position_ids: Optional[mindspore.Tensor] = None, -++++++# past_key_value: Optional[Cache] = None, -++++++# output_attentions: bool = False, -++++++# use_cache: bool = False, -++++++# cache_position: Optional[mindspore.Tensor] = None, -++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++ -++++++ -++++++ -++++++# bsz, q_len, _ = hidden_states.shape -++++++ -++++++# query_states = self.q_proj(hidden_states) -++++++# key_states = self.k_proj(hidden_states) -++++++# value_states = self.v_proj(hidden_states) -++++++ -++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++ -++++++# kv_seq_len = key_states.shape[-2] -++++++# if past_key_value is not None: -++++++# if self.layer_idx is None: -++++++# raise ValueError( -++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++# "with a layer index." -++++++# ) -++++++# if isinstance(past_key_value, StaticCache): -++++++# kv_seq_len = key_states.shape[-2] -++++++# else: -++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++# if past_key_value is not None: -++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++# if isinstance(past_key_value, StaticCache): -++++++# kv_seq_len = key_states.shape[-2] -++++++ -++++++# # repeat k/v heads if n_kv_heads < n_heads -++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++ -++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++ -++++++# if attention_mask is not None: -++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++# attn_weights = attn_weights + causal_mask -++++++ -++++++# # upcast attention to fp32 -++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++++# attn_output = ops.matmul(attn_weights, value_states) -++++++ -++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++++# raise ValueError( -++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++++++# f" {attn_output.shape}" -++++++# ) -++++++ -++++++# attn_output = ops.transpose(attn_output, 1, 2) -++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++ -++++++# attn_output = self.o_proj(attn_output) -++++++# # @lwx -++++++ -++++++# # max_seq_len = self.max_position_embeddings # 2048 -++++++ -++++++# # if attention_mask is not None: -++++++# # # attention_mask: [B, 1, Sq, Sk] -++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++ -++++++# # # pad 到 [max_seq_len, max_seq_len] -++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++# # global_attention_mask = padded_mask -++++++# # else: -++++++# # global_attention_mask = None -++++++ -++++++ -++++++# # sparse_mode=3 -++++++# # attn_output = mindspore.ops.flash_attention_score( -++++++# # query=query_states, -++++++# # key=key_states, -++++++# # value=value_states, -++++++# # real_shift=None, -++++++# # padding_mask=None, -++++++ -++++++# # head_num=self.num_heads, -++++++# # attn_mask=global_attention_mask, -++++++# # keep_prob=1.0 - self.attention_dropout, -++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++# # input_layout="BNSD", -++++++# # pre_tokens=2147483647, -++++++# # next_tokens=2147483647, -++++++# # inner_precise=0, -++++++# # drop_mask=None, -++++++# # prefix=None, -++++++# # actual_seq_qlen=None, -++++++# # actual_seq_kvlen=None, -++++++# # sparse_mode=sparse_mode, -++++++# # ) -++++++# if not output_attentions: -++++++# attn_weights = None -++++++ -++++++# return attn_output, attn_weights, past_key_value -++++++ -+++++ class Qwen2MoeAttention(nn.Module): -+++++ """ -+++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+++++- and "Generating Long Sequences with Sparse Transformers". -+++++- """ -++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -+++++ -++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -++++++ -++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -++++++ """ -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++ super().__init__() -+++++ self.config = config -+++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -+++++ if layer_idx is None: -+++++ logger.warning_once( -+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++ "when creating this class." -+++++ ) -+++++ -+++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -+++++ use_cache: bool = False, -+++++ cache_position: Optional[mindspore.Tensor] = None, -+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++- -+++++ -+++++- -++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -+++++ bsz, q_len, _ = hidden_states.shape -+++++ -+++++ query_states = self.q_proj(hidden_states) -+++++ key_states = self.k_proj(hidden_states) -+++++ value_states = self.v_proj(hidden_states) -+++++ -+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++- -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ -+++++ kv_seq_len = key_states.shape[-2] -+++++ if past_key_value is not None: -+++++- if self.layer_idx is None: -+++++- raise ValueError( -+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++- "with a layer index." -+++++- ) -+++++- if isinstance(past_key_value, StaticCache): -+++++- kv_seq_len = key_states.shape[-2] -+++++- else: -+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++ -+++++ if past_key_value is not None: -+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++ -++++++ # --- 2. 动态调度核心注意力计算 --- -++++++ global Long_Prompt -++++++ if Long_Prompt >= 1: -++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -++++++ fa_attention_mask = None -++++++ if attention_mask is not None: -++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++ fa_attention_mask = (mask_slice != 0) -++++++ -++++++ attn_output = mindspore.ops.flash_attention_score( -++++++ query=query_states, -++++++ key=key_states, -++++++ value=value_states, -++++++ head_num=self.num_heads, -++++++ attn_mask=fa_attention_mask, -++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ input_layout="BNSD", -++++++ sparse_mode=0, -++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -++++++ ) -+++++ -+++++- if isinstance(past_key_value, StaticCache): -+++++- kv_seq_len = key_states.shape[-2] -++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = self.o_proj(attn_output) -++++++ attn_weights = None -++++++ if output_attentions: -++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -+++++ -+++++- # repeat k/v heads if n_kv_heads < n_heads -+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++- -+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++ else: -++++++ # --- Eager Attention 路径 (用于短序列和解码) --- -++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++ -++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++ -+++++- if attention_mask is not None: -+++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++- attn_weights = attn_weights + causal_mask -++++++ if attention_mask is not None: -++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++ attn_weights = attn_weights + causal_mask -+++++ -+++++- # upcast attention to fp32 -+++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++++- attn_output = ops.matmul(attn_weights, value_states) -++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++++ attn_output = ops.matmul(attn_weights, value_states) -+++++ -+++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++++- raise ValueError( -+++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+++++- f" {attn_output.shape}" -+++++- ) -++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++++ raise ValueError( -++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -++++++ ) -+++++ -+++++- attn_output = ops.transpose(attn_output, 1, 2) -+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = ops.transpose(attn_output, 1, 2) -++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = self.o_proj(attn_output) -+++++ -+++++- attn_output = self.o_proj(attn_output) -+++++- # @lwx -++++++ if not output_attentions: -++++++ attn_weights = None -+++++ -+++++- # max_seq_len = self.max_position_embeddings # 2048 -+++++- -+++++- # if attention_mask is not None: -+++++- # # attention_mask: [B, 1, Sq, Sk] -+++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++- -+++++- # # pad 到 [max_seq_len, max_seq_len] -+++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++- # global_attention_mask = padded_mask -+++++- # else: -+++++- # global_attention_mask = None -+++++- -+++++- -+++++- # sparse_mode=3 -+++++- # attn_output = mindspore.ops.flash_attention_score( -+++++- # query=query_states, -+++++- # key=key_states, -+++++- # value=value_states, -+++++- # real_shift=None, -+++++- # padding_mask=None, -+++++- -+++++- # head_num=self.num_heads, -+++++- # attn_mask=global_attention_mask, -+++++- # keep_prob=1.0 - self.attention_dropout, -+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++- # input_layout="BNSD", -+++++- # pre_tokens=2147483647, -+++++- # next_tokens=2147483647, -+++++- # inner_precise=0, -+++++- # drop_mask=None, -+++++- # prefix=None, -+++++- # actual_seq_qlen=None, -+++++- # actual_seq_kvlen=None, -+++++- # sparse_mode=sparse_mode, -+++++- # ) -+++++- if not output_attentions: -+++++- attn_weights = None -+++++- -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++- -+++++ # class Qwen2MoeFlashAttention(nn.Module): -+++++ # """ -+++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -+++++ # return final_hidden_states, router_logits -+++++ -+++++ -+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++-# """ -+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 -+++++-# """ -+++++-# def __init__(self, config: Qwen2MoeConfig): -+++++-# super().__init__() -+++++-# self.num_experts = config.num_experts -+++++-# self.top_k = config.num_experts_per_tok -+++++-# self.norm_topk_prob = config.norm_topk_prob -+++++- -+++++-# # 门控网络 -+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++-# # 专家列表 -+++++-# self.experts = nn.ModuleList( -+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++-# ) -+++++-# # 共享专家 -+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_decode( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# """ -+++++-# 【解码路径】针对 sequence_length=1 的极致优化。 -+++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+++++-# """ -+++++-# batch_size, hidden_dim = hidden_states.shape -+++++- -+++++-# expert_outputs_list = [ -+++++-# ops.cat([ -+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++-# ], dim=0) -+++++-# for i in range(batch_size) -+++++-# ] -+++++- -+++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+++++-# # shape: (batch_size, top_k, hidden_dim) -+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++- -+++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++- -+++++-# return moe_output.squeeze(1) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_prefill( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# """ -+++++-# 【预填充路径】针对 sequence_length > 1 的优化。 -+++++-# 按专家对 Token 进行分组,并进行批处理。 -+++++-# """ -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens = hidden_states.shape[0] -+++++-# flat_selected_experts = selected_experts.flatten() -+++++- -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++- -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++- -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++- -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++-# selected_token_indices = token_indices[mask] -+++++-# selected_routing_weights = routing_weights.flatten()[mask] -+++++- -+++++-# current_states = hidden_states[selected_token_indices] -+++++- -+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++- -+++++-# moe_output = moe_output.index_add( -+++++-# dim=0, -+++++-# index=selected_token_indices, -+++++-# source=expert_output.to(hidden_states.dtype) -+++++-# ) -+++++-# return moe_output -+++++- -+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-# """ -+++++-# 顶层 forward 方法,作为智能分发器。 -+++++-# """ -+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- -+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-# router_logits = self.gate(hidden_states_reshaped) -+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- -+++++-# if self.norm_topk_prob: -+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- -+++++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++-# moe_output = None -+++++-# # 在推理时,根据序列长度选择最优路径 -+++++-# if not self.training: -+++++-# if sequence_length == 1: -+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++-# else: -+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++-# else: -+++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+++++-# raise NotImplementedError("Training path is not implemented.") -+++++- -+++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+++++- -+++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+++++- -+++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+++++- -+++++-# return final_hidden_states, router_logits -+++++- -+++++- -+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++-# """ -+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+++++-# """ -+++++-# def __init__(self, config: Qwen2MoeConfig): -+++++-# super().__init__() -+++++-# self.num_experts = config.num_experts -+++++-# self.top_k = config.num_experts_per_tok -+++++-# self.norm_topk_prob = config.norm_topk_prob -+++++- -+++++-# # 门控网络 -+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++-# # 专家列表 -+++++-# self.experts = nn.ModuleList( -+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++-# ) -+++++-# # 共享专家 -+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_decode( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# batch_size, _ = hidden_states.shape -+++++-# expert_outputs_list = [ -+++++-# ops.cat([ -+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++-# ], dim=0) -+++++-# for i in range(batch_size) -+++++-# ] -+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++-# return moe_output.squeeze(1) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_prefill( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens = hidden_states.shape[0] -+++++-# flat_selected_experts = selected_experts.flatten() -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++- -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++-# selected_token_indices = token_indices[mask] -+++++-# selected_routing_weights = routing_weights.flatten()[mask] -+++++-# current_states = hidden_states[selected_token_indices] -+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++-# moe_output = moe_output.index_add( -+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++-# ) -+++++-# return moe_output -+++++- -+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-# """ -+++++-# 顶层 forward 方法,作为智能分发器。 -+++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+++++-# """ -+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- -+++++-# # 1. 门控计算 (通用逻辑) -+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-# router_logits = self.gate(hidden_states_reshaped) -+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- -+++++-# if self.norm_topk_prob: -+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- -+++++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++-# # 2. 智能分发到最优 MoE 路径 -+++++-# moe_output = None -+++++-# if not self.training: -+++++-# if sequence_length == 1: -+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++-# else: -+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++-# else: -+++++-# raise NotImplementedError("Training path is not implemented.") -+++++- -+++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++- -+++++-# # 4. 合并 MoE 输出和共享专家输出 -+++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++- -+++++-# # 5. 恢复原始形状并返回 -+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++- -+++++-# return final_hidden_states, router_logits -+++++- -+++++-# prefill fastest -+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++-# """ -+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+++++-# """ -+++++-# def __init__(self, config: Qwen2MoeConfig): -+++++-# super().__init__() -+++++-# self.num_experts = config.num_experts -+++++-# self.top_k = config.num_experts_per_tok -+++++-# self.norm_topk_prob = config.norm_topk_prob -+++++- -+++++-# # 门控网络 -+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++-# # 专家列表 -+++++-# self.experts = nn.ModuleList( -+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++-# ) -+++++-# # 共享专家 -+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_dispatch( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# """ -+++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+++++-# """ -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens, _ = hidden_states.shape -+++++- -+++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+++++-# flat_selected_experts = selected_experts.flatten() -+++++-# flat_routing_weights = routing_weights.flatten() -+++++- -+++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++- -+++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++- -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++- -+++++-# # 找到所有分配给该专家的 token -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++- -+++++-# # 使用 mask 选取对应的 token 和权重 -+++++-# current_token_indices = token_indices[mask] -+++++-# current_routing_weights = flat_routing_weights[mask] -+++++-# current_hidden_states = hidden_states[current_token_indices] -+++++- -+++++-# # 对这些 token 进行批处理 -+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++- -+++++-# # 使用 index_add 将结果精确地加回到对应位置 -+++++-# moe_output = moe_output.index_add( -+++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+++++-# ) -+++++-# return moe_output -+++++- -+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-# """ -+++++-# 顶层 forward 方法,作为智能分发器。 -+++++-# """ -+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- -+++++-# # 1. 门控计算 -+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-# router_logits = self.gate(hidden_states_reshaped) -+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- -+++++-# if self.norm_topk_prob: -+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- -+++++-# routing_weights = routing_weights.to(hidden_states.dtype) -+++++- -+++++-# # 2. 调用统一的 MoE 计算内核 -+++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -+++++- -+++++-# # 3. 统一处理共享专家 -+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++- -+++++-# # 4. 合并输出 -+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++- -+++++-# # 5. 恢复原始形状并返回 -+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++- -+++++-# return final_hidden_states, router_logits -+++++- -+++++- -+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++-# """ -+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++-# 【最终高性能与高精度版】: -+++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+++++-# 3. 这样实现了速度和准确性的两全其美。 -+++++-# """ -+++++-# def __init__(self, config: Qwen2MoeConfig): -+++++-# super().__init__() -+++++-# self.num_experts = config.num_experts -+++++-# self.top_k = config.num_experts_per_tok -+++++-# self.norm_topk_prob = config.norm_topk_prob -+++++- -+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++-# self.experts = nn.ModuleList( -+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++-# ) -+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_decode( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# """ -+++++-# 【解码路径】极致优化版:bmm + 高精度累加。 -+++++-# """ -+++++-# original_dtype = hidden_states.dtype -+++++-# batch_size, _ = hidden_states.shape -+++++- -+++++-# expert_outputs_list = [ -+++++-# ops.cat([ -+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++-# ], dim=0) -+++++-# for i in range(batch_size) -+++++-# ] -+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++- -+++++-# # 在 float32 下执行 bmm,得到高精度结果 -+++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++- -+++++-# # 将高精度结果转换回原始数据类型 -+++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+++++- -+++++-# return moe_output -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_prefill( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# selected_experts: mindspore.Tensor, -+++++-# routing_weights: mindspore.Tensor -+++++-# ) -> mindspore.Tensor: -+++++-# """ -+++++-# 【预填充路径】与原始实现一致,结果精确。 -+++++-# """ -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens, _ = hidden_states.shape -+++++-# flat_selected_experts = selected_experts.flatten() -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++- -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++-# selected_token_indices = token_indices[mask] -+++++-# selected_routing_weights = routing_weights.flatten()[mask] -+++++-# current_states = hidden_states[selected_token_indices] -+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++-# moe_output = moe_output.index_add( -+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++-# ) -+++++-# return moe_output -+++++- -+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++- -+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-# router_logits = self.gate(hidden_states_reshaped) -+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++- -+++++-# if self.norm_topk_prob: -+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- -+++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+++++-# # 如果模型主体是 float16,后续再转换 -+++++- -+++++-# moe_output = None -+++++-# if not self.training: -+++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+++++-# # _moe_infer_decode 内部会处理好类型转换 -+++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+++++-# if sequence_length == 1: -+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++-# else: -+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++-# else: -+++++-# raise NotImplementedError("Training path is not implemented.") -+++++- -+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++- -+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++- -+++++-# return final_hidden_states, router_logits -+++++- -+++++- -+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++-# """ -+++++-# 【融合版】一个混合专家模块,内置两种推理策略, -+++++-# 由外部全局变量 `Long_Prompt` 控制: -+++++- -+++++-# - if Long_Prompt is True: 【精度优先模式】 -+++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+++++-# 适用于处理长序列,避免误差累积。 -+++++- -+++++-# - if Long_Prompt is False: 【速度优先模式】 -+++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 -+++++-# """ -+++++-# def __init__(self, config: Qwen2MoeConfig): -+++++-# super().__init__() -+++++-# self.num_experts = config.num_experts -+++++-# self.top_k = config.num_experts_per_tok -+++++-# self.norm_topk_prob = config.norm_topk_prob -+++++- -+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++-# self.experts = nn.ModuleList( -+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++-# ) -+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-# # --- 速度优先模式的辅助函数 --- -+++++-# @no_grad() -+++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++-# original_dtype = hidden_states.dtype -+++++-# batch_size, _ = hidden_states.shape -+++++-# expert_outputs_list = [ -+++++-# ops.cat([ -+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++-# ], dim=0) -+++++-# for i in range(batch_size) -+++++-# ] -+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++-# weights_fp32 = routing_weights.to(mindspore.float32) -+++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++-# return moe_output_fp32.squeeze(1).to(original_dtype) -+++++- -+++++-# @no_grad() -+++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens, _ = hidden_states.shape -+++++-# flat_selected_experts = selected_experts.flatten() -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++-# selected_token_indices = token_indices[mask] -+++++-# selected_routing_weights = routing_weights.flatten()[mask] -+++++-# current_states = hidden_states[selected_token_indices] -+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++-# return moe_output -+++++- -+++++-# # --- 精度优先模式的辅助函数 --- -+++++-# @no_grad() -+++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++-# moe_output = ops.zeros_like(hidden_states) -+++++-# num_tokens, _ = hidden_states.shape -+++++-# flat_selected_experts = selected_experts.flatten() -+++++-# flat_routing_weights = routing_weights.flatten() -+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++-# active_experts = ops.unique(flat_selected_experts) -+++++-# for expert_idx_tensor in active_experts: -+++++-# expert_idx = expert_idx_tensor.item() -+++++-# expert_layer = self.experts[expert_idx] -+++++-# mask = (flat_selected_experts == expert_idx_tensor) -+++++-# current_token_indices = token_indices[mask] -+++++-# current_routing_weights = flat_routing_weights[mask] -+++++-# current_hidden_states = hidden_states[current_token_indices] -+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++-# return moe_output -+++++- -+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-# # 声明我们将要使用一个在模块外部定义的全局变量 -+++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+++++-# global Long_Prompt -+++++- -+++++-# # 1. 门控计算 (所有模式通用) -+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-# router_logits = self.gate(hidden_states_reshaped) -+++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++++-# if self.norm_topk_prob: -+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++- -+++++-# moe_output = None -+++++-# if not self.training: -+++++-# # 根据 Long_Prompt 标志选择模式 -+++++-# if Long_Prompt: -+++++-# # --- 精度优先模式 --- -+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++-# else: -+++++-# # --- 速度优先模式 --- -+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++-# if sequence_length == 1: -+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++-# else: -+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++-# else: -+++++-# raise NotImplementedError("Training path is not implemented.") -+++++- -+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++- -+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++- -+++++-# return final_hidden_states, router_logits -+++++- -+++++ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ """ -+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++ return moe_output_fp32.squeeze(1).to(original_dtype) -+++++ -++++++ # @no_grad() -++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++ # num_tokens, _ = hidden_states.shape -++++++ # flat_selected_experts = selected_experts.flatten() -++++++ # sorted_expert_indices = flat_selected_experts.argsort() -++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++++ # original_token_indices = sorted_expert_indices // self.top_k -++++++ # moe_output = ops.zeros_like(hidden_states) -++++++ # current_token_offset = 0 -++++++ # for i in range(self.num_experts): -++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset -++++++ # if expert_token_count == 0: -++++++ # continue -++++++ # end_offset = current_token_offset + expert_token_count -++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] -++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++ # current_token_offset += expert_token_count -++++++ # return moe_output -++++++ -+++++ @no_grad() -+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++- num_tokens, _ = hidden_states.shape -+++++- flat_selected_experts = selected_experts.flatten() -+++++- sorted_expert_indices = flat_selected_experts.argsort() -+++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++++- original_token_indices = sorted_expert_indices // self.top_k -++++++ """ -++++++ 优化版 MoE prefill (速度优先模式): -++++++ - 批量张量化处理同一个 expert 的所有 token -++++++ - 跳过无 token 的专家 -++++++ - 保持结果完全一致 -++++++ """ -+++++ moe_output = ops.zeros_like(hidden_states) -+++++- current_token_offset = 0 -+++++- for i in range(self.num_experts): -+++++- expert_token_count = tokens_per_expert[i] - current_token_offset -+++++- if expert_token_count == 0: -+++++- continue -+++++- end_offset = current_token_offset + expert_token_count -+++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++++- expert_hidden_states = hidden_states[expert_original_token_indices] -+++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++- current_token_offset += expert_token_count -++++++ -++++++ flat_selected_experts = selected_experts.flatten() -++++++ flat_routing_weights = routing_weights.flatten() -++++++ -++++++ idxs = flat_selected_experts.argsort() -++++++ sorted_expert_indices = flat_selected_experts[idxs] -++++++ sorted_token_indices = idxs // self.top_k -++++++ -++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -++++++ -++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++++++ -++++++ for expert_id in active_experts.tolist(): -++++++ start = int(tokens_per_expert[:expert_id].sum().item()) -++++++ end = start + int(tokens_per_expert[expert_id].item()) -++++++ -++++++ token_idx = sorted_token_indices[start:end] -++++++ expert_tokens = hidden_states[token_idx] -++++++ -++++++ expert_out = self.experts[expert_id](expert_tokens) -++++++ -++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -++++++ -++++++ moe_output = mindspore.mint.scatter_add( -++++++ moe_output, -++++++ 0, -++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -++++++ scaled_out.to(hidden_states.dtype) -++++++ ) -++++++ -+++++ return moe_output -+++++ -++++++ -+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+++++ @no_grad() -+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++ -+++++ moe_output = None -+++++- if Long_Prompt: -+++++- # --- 精度优先模式 (ACCURACY MODE) --- -+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ # if Long_Prompt==0: -++++++ # # --- 精度优先模式 (ACCURACY MODE) --- -++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ # else: -++++++ # # --- 速度优先模式 (SPEED MODE) --- -++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++ # if sequence_length == 1: -++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ # else: -++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ -++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++ if sequence_length == 1: -++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++ else: -+++++- # --- 速度优先模式 (SPEED MODE) --- -+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++- if sequence_length == 1: -+++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++- else: -+++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++- -++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ -+++++ -+++++ # 3. 共享专家计算与合并 (所有模式通用) -+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++ -+++++ return final_hidden_states, router_logits -+++++ -++++++ -+++++ class Qwen2MoeDecoderLayer(nn.Module): -+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -+++++ super().__init__() -+++++ self.hidden_size = config.hidden_size -+++++ -+++++- # if Long_Prompt: -+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++- # else: -++++++ # if Long_Prompt == 2: -+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++ # else: -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ -+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++ -+++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++++ ) -+++++ -+++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -+++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++++ # attention_mask, -++++++ # sequence_length=sequence_length, -++++++ # target_length=target_length, -++++++ # dtype=dtype, -++++++ # min_dtype=min_dtype, -++++++ # cache_position=cache_position, -++++++ # batch_size=input_tensor.shape[0], -++++++ # ) -++++++ #@dwj -++++++ causal_mask = get_cached_causal_mask_with_cache_position( -+++++ attention_mask, -+++++ sequence_length=sequence_length, -+++++ target_length=target_length, -+++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+++++ """ -+++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -++++++ _causal_mask_cache.clear() -+++++ -+++++ input_ids = kwargs.get("input_ids") -+++++ if input_ids is None and args: -+++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ -+++++ if input_ids is not None: -+++++ prompt_length = input_ids.shape[1] -+++++- -+++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: -+++++- Long_Prompt = True -++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -++++++ Long_Prompt = 2 -++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -++++++ Long_Prompt = 0 -+++++ else: -+++++- Long_Prompt = False -++++++ Long_Prompt = 1 -++++++ -+++++ -+++++ return super().generate(*args, **kwargs) -+++++ -+++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++ dtype = self.lm_head.weight.dtype -+++++ min_dtype = float(ops.finfo(dtype).min) -+++++ -+++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -++++++ # attention_mask, -++++++ # sequence_length=sequence_length, -++++++ # target_length=past_key_values.get_max_length(), -++++++ # dtype=dtype, -++++++ # min_dtype=min_dtype, -++++++ # cache_position=cache_position, -++++++ # batch_size=batch_size, -++++++ # ) -++++++ -++++++ #@dwj -++++++ attention_mask = get_cached_causal_mask_with_cache_position( -+++++ attention_mask, -+++++ sequence_length=sequence_length, -+++++ target_length=past_key_values.get_max_length(), -+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++++deleted file mode 100644 -+++++index 6dfb5b93..00000000 -+++++--- a/patches/0001-20251104commit.patch -++++++++ /dev/null -+++++@@ -1,1272 +0,0 @@ -+++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++++-From: Pinoeer-kingxi <13022943007@163.com> -+++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++++-Subject: [PATCH] 20251104commit -+++++- -+++++---- -+++++- mindnlp/transformers/cache_utils.py | 28 +- -+++++- .../models/deepseek/modeling_deepseek.py | 149 ++- -+++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++++- 3 files changed, 976 insertions(+), 87 deletions(-) -+++++- -+++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++++-index cadd2e04..02f8d4be 100644 -+++++---- a/mindnlp/transformers/cache_utils.py -+++++-+++ b/mindnlp/transformers/cache_utils.py -+++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++++- # k_out[:, :, cache_position] = key_states -+++++- # v_out[:, :, cache_position] = value_states -+++++-- if ON_ORANGE_PI: -+++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++-- else: -+++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++-- -+++++-+ # if ON_ORANGE_PI: -+++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++-+ # else: -+++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 -+++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++++-+ if cache_position.ndim > 1: -+++++-+ cache_position = cache_position.flatten() -+++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) -+++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++++-+ cache_position = cache_position.int() -+++++-+ -+++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++++-+ k_out[:, :, cache_position] = key_states -+++++-+ v_out[:, :, cache_position] = value_states -+++++-+ -+++++- return k_out, v_out -+++++- -+++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++-index c695b944..d8303e45 100644 -+++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++++- # Copied from transformers.models.llama.modeling_llama.rotate_half -+++++- def rotate_half(x): -+++++- """Rotates half the hidden dims of the input.""" -+++++-- x1 = x[..., : x.shape[-1] // 2] -+++++-- x2 = x[..., x.shape[-1] // 2 :] -+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++-+ # x1 = x[..., : x.shape[-1] // 2] -+++++-+ # x2 = x[..., x.shape[-1] // 2 :] -+++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++- return ops.cat((-x2, x1), dim=-1) -+++++- -+++++- -+++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++++- if self.training: -+++++- raise NotImplementedError("Training is not supported yet.") -+++++- else: -+++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++-- if self.config.n_shared_experts is not None: -+++++-- y = y + self.shared_experts(identity) -+++++-- return y -+++++-+ # @lwx -+++++-+ if orig_shape[1] == 1: -+++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++++-+ y=y.view(*orig_shape) -+++++-+ if self.config.n_shared_experts is not None: -+++++-+ y = y + self.shared_experts(identity) -+++++-+ return y -+++++-+ else: -+++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++++-+ if self.config.n_shared_experts is not None: -+++++-+ y = y + self.shared_experts(identity) -+++++-+ return y -+++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++-+ # if self.config.n_shared_experts is not None: -+++++-+ # y = y + self.shared_experts(identity) -+++++-+ # return y -+++++-+ -+++++-+ @no_grad() -+++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++-+ -+++++-+ expert_cache = ops.zeros_like(x) -+++++-+ for i in range(self.num_experts_per_tok): -+++++-+ expert_id = flat_expert_indices[i].item() -+++++-+ weight = flat_expert_weights[i].item() -+++++-+ expert = self.experts[expert_id] -+++++-+ expert_out = expert(x) -+++++-+ expert_cache += expert_out * weight -+++++-+ return expert_cache -+++++- -+++++- @no_grad() -+++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++-- # expert_cache = torch.zeros_like(x) -+++++-- # idxs = flat_expert_indices.argsort() -+++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++-- # token_idxs = idxs // self.num_experts_per_tok -+++++-- # for i, end_idx in enumerate(tokens_per_expert): -+++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++-- # if start_idx == end_idx: -+++++-- # continue -+++++-- # expert = self.experts[i] -+++++-- # exp_token_idx = token_idxs[start_idx:end_idx] -+++++-- # expert_tokens = x[exp_token_idx] -+++++-- # expert_out = expert(expert_tokens) -+++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++-- # return expert_cache -+++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++- expert_cache = ops.zeros_like(x) -+++++- idxs = flat_expert_indices.argsort() -+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++- token_idxs = idxs // self.num_experts_per_tok -+++++-+ -+++++- for i, end_idx in enumerate(tokens_per_expert): -+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++- if start_idx == end_idx: -+++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++++- expert_out = expert(expert_tokens) -+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++-+ -+++++- return expert_cache -+++++-+ -+++++-+ # @no_grad() -+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++-+ # # expert_cache = torch.zeros_like(x) -+++++-+ # # idxs = flat_expert_indices.argsort() -+++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++-+ # # token_idxs = idxs // self.num_experts_per_tok -+++++-+ # # for i, end_idx in enumerate(tokens_per_expert): -+++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++-+ # # if start_idx == end_idx: -+++++-+ # # continue -+++++-+ # # expert = self.experts[i] -+++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++-+ # # expert_tokens = x[exp_token_idx] -+++++-+ # # expert_out = expert(expert_tokens) -+++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++-+ # # return expert_cache -+++++-+ # expert_cache = ops.zeros_like(x) -+++++-+ # idxs = flat_expert_indices.argsort() -+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++-+ # token_idxs = idxs // self.num_experts_per_tok -+++++-+ -+++++-+ # for i, end_idx in enumerate(tokens_per_expert): -+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++-+ # if start_idx == end_idx: -+++++-+ # continue -+++++-+ # expert = self.experts[i] -+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++-+ # expert_tokens = x[exp_token_idx] -+++++-+ # expert_out = expert(expert_tokens) -+++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++-+ -+++++-+ # return expert_cache -+++++-+ # @no_grad() -+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++-+ # expert_cache = ops.zeros_like(x) -+++++-+ -+++++-+ # # 排序保证顺序一致 -+++++-+ # idxs = flat_expert_indices.argsort() -+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++-+ # token_idxs = idxs // self.num_experts_per_tok -+++++-+ -+++++-+ # # 找出有 token 的专家 -+++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++-+ -+++++-+ # for i in active_experts.tolist(): -+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++-+ # end_idx = tokens_per_expert[i] -+++++-+ # if start_idx == end_idx: # 没有 token -+++++-+ # continue -+++++-+ -+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++-+ # expert_tokens = x[exp_token_idx] -+++++-+ # expert_out = self.experts[i](expert_tokens) -+++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++-+ -+++++-+ # expert_cache = mindspore.mint.scatter_add( -+++++-+ # expert_cache, -+++++-+ # 0, -+++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++-+ # expert_out -+++++-+ # ) -+++++-+ -+++++-+ # return expert_cache -+++++-+ -+++++-+ -+++++- -+++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++++- # """ -+++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++- -+++++- # Initialize weights and apply final processing -+++++- self.post_init() -+++++-+ self.warm_up = False -+++++-+ -+++++-+ def warmup_moe_model_deep(self): -+++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++-+ test_texts = [ -+++++-+ "warmup short", -+++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++++-+ ] -+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++-+ if tokenizer is None: -+++++-+ from mindnlp.transformers import AutoTokenizer -+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++-+ self._warmup_tokenizer = tokenizer -+++++-+ -+++++-+ for text in test_texts: -+++++-+ inputs = tokenizer(text, return_tensors="ms") -+++++-+ with mindspore._no_grad(): -+++++-+ _ = self(**inputs, use_cache=False) -+++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++++- -+++++- def get_input_embeddings(self): -+++++- return self.model.embed_tokens -+++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++- ```""" -+++++-+ if not self.warm_up: -+++++-+ self.warm_up = True -+++++-+ self.warmup_moe_model_deep() -+++++-+ -+++++- output_attentions = ( -+++++- output_attentions -+++++- if output_attentions is not None -+++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++-index 3cbf820e..d4c6b651 100644 -+++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++-@@ -18,7 +18,6 @@ -+++++- # See the License for the specific language governing permissions and -+++++- # limitations under the License. -+++++- """MindSpore Qwen2MoE model.""" -+++++-- -+++++- import math -+++++- from typing import List, Optional, Tuple, Union -+++++- -+++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++++- TokenClassifierOutput, -+++++- ) -+++++- from ...modeling_utils import PreTrainedModel -+++++-+from ...generation import GenerationMixin -+++++- from ....utils import logging -+++++- from .configuration_qwen2_moe import Qwen2MoeConfig -+++++- -+++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++++- self.variance_epsilon = eps -+++++- -+++++- def forward(self, hidden_states): -+++++-+ # @dwj -+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++-+ # @lwx -+++++-+ # if not self.training : -+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++- input_dtype = hidden_states.dtype -+++++- hidden_states = hidden_states.to(mindspore.float32) -+++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++++-@@ -234,6 +239,8 @@ def rotate_half(x): -+++++- """Rotates half the hidden dims of the input.""" -+++++- x1 = x[..., : x.shape[-1] // 2] -+++++- x2 = x[..., x.shape[-1] // 2 :] -+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++- return ops.cat((-x2, x1), dim=-1) -+++++- -+++++- -+++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++++- self.config = config -+++++- self.hidden_size = config.hidden_size -+++++- self.intermediate_size = intermediate_size -+++++-+ -+++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++++- self.act_fn = ACT2FN[config.hidden_act] -+++++- -+++++- def forward(self, x): -+++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++-- -+++++- -+++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++-+ # @lwx -+++++-+ # gate_up_output = self.gate_up_proj(x) -+++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++++-+ # return self.down_proj(swiglu_output) -+++++-+ -+++++-+ # def forward(self, x): -+++++-+ # gate_proj_out = self.gate_proj(x) -+++++-+ # up_proj_out = self.up_proj(x) -+++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++++-+ # return self.down_proj(swiglu_out) -+++++-+ -+++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++++- """ -+++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++++- use_cache: bool = False, -+++++- cache_position: Optional[mindspore.Tensor] = None, -+++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++-+ -+++++-+ -+++++-+ -+++++- bsz, q_len, _ = hidden_states.shape -+++++- -+++++- query_states = self.q_proj(hidden_states) -+++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++- "with a layer index." -+++++- ) -+++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++-+ if isinstance(past_key_value, StaticCache): -+++++-+ kv_seq_len = key_states.shape[-2] -+++++-+ else: -+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++- -+++++- if past_key_value is not None: -+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++-+ -+++++-+ if isinstance(past_key_value, StaticCache): -+++++-+ kv_seq_len = key_states.shape[-2] -+++++- -+++++- # repeat k/v heads if n_kv_heads < n_heads -+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++-- -+++++-+ -+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++- -+++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++++-- raise ValueError( -+++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++++-- f" {attn_weights.shape}" -+++++-- ) -+++++-- -+++++-- if attention_mask is not None: # no matter the length, we just slice it -+++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++++-+ if attention_mask is not None: -+++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++- attn_weights = attn_weights + causal_mask -+++++- -+++++- # upcast attention to fp32 -+++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++- -+++++- attn_output = self.o_proj(attn_output) -+++++-- -+++++-+ # @lwx -+++++-+ -+++++-+ # max_seq_len = self.max_position_embeddings # 2048 -+++++-+ -+++++-+ # if attention_mask is not None: -+++++-+ # # attention_mask: [B, 1, Sq, Sk] -+++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++-+ -+++++-+ # # pad 到 [max_seq_len, max_seq_len] -+++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++-+ # global_attention_mask = padded_mask -+++++-+ # else: -+++++-+ # global_attention_mask = None -+++++-+ -+++++-+ -+++++-+ # sparse_mode=3 -+++++-+ # attn_output = mindspore.ops.flash_attention_score( -+++++-+ # query=query_states, -+++++-+ # key=key_states, -+++++-+ # value=value_states, -+++++-+ # real_shift=None, -+++++-+ # padding_mask=None, -+++++-+ -+++++-+ # head_num=self.num_heads, -+++++-+ # attn_mask=global_attention_mask, -+++++-+ # keep_prob=1.0 - self.attention_dropout, -+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++-+ # input_layout="BNSD", -+++++-+ # pre_tokens=2147483647, -+++++-+ # next_tokens=2147483647, -+++++-+ # inner_precise=0, -+++++-+ # drop_mask=None, -+++++-+ # prefix=None, -+++++-+ # actual_seq_qlen=None, -+++++-+ # actual_seq_kvlen=None, -+++++-+ # sparse_mode=sparse_mode, -+++++-+ # ) -+++++- if not output_attentions: -+++++- attn_weights = None -+++++- -+++++- return attn_output, attn_weights, past_key_value -+++++- -+++++- -+++++-+class Qwen2MoeFlashAttention(nn.Module): -+++++-+ """ -+++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++-+ -+++++-+ 关键改动: -+++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++-+ 直接传入原始的 key 和 value 张量效率更高。 -+++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++-+ """ -+++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++-+ super().__init__() -+++++-+ self.config = config -+++++-+ self.layer_idx = layer_idx -+++++-+ self.hidden_size = config.hidden_size -+++++-+ self.num_heads = config.num_attention_heads -+++++-+ self.head_dim = self.hidden_size // self.num_heads -+++++-+ self.num_key_value_heads = config.num_key_value_heads -+++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++-+ self.max_position_embeddings = config.max_position_embeddings -+++++-+ self.rope_theta = config.rope_theta -+++++-+ self.attention_dropout = config.attention_dropout -+++++-+ -+++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++-+ raise ValueError( -+++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++-+ ) -+++++-+ -+++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++-+ -+++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++-+ self.head_dim, -+++++-+ max_position_embeddings=self.max_position_embeddings, -+++++-+ base=self.rope_theta, -+++++-+ ) -+++++-+ -+++++-+ def forward( -+++++-+ self, -+++++-+ hidden_states: mindspore.Tensor, -+++++-+ attention_mask: Optional[mindspore.Tensor] = None, -+++++-+ position_ids: Optional[mindspore.Tensor] = None, -+++++-+ past_key_value: Optional[Cache] = None, -+++++-+ output_attentions: bool = False, -+++++-+ use_cache: bool = False, -+++++-+ cache_position: Optional[mindspore.Tensor] = None, -+++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++-+ -+++++-+ bsz, q_len, _ = hidden_states.shape -+++++-+ -+++++-+ # 1. 线性投射 Q, K, V -+++++-+ query_states = self.q_proj(hidden_states) -+++++-+ key_states = self.k_proj(hidden_states) -+++++-+ value_states = self.v_proj(hidden_states) -+++++-+ -+++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++-+ # query: [B, S, H*D] -> [B, N1, S, D] -+++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ -+++++-+ # 3. RoPE 旋转位置编码 -+++++-+ kv_seq_len = key_states.shape[-2] -+++++-+ if past_key_value is not None: -+++++-+ if self.layer_idx is None: -+++++-+ raise ValueError( -+++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++-+ "with a layer index." -+++++-+ ) -+++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++-+ if cache_position.shape[0] == 1: -+++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++-+ kv_seq_len = past_seen_tokens + 1 -+++++-+ else: -+++++-+ # prefill 阶段:cache_position 是范围,使用其长度 -+++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++-+ else: -+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++-+ -+++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++-+ -+++++-+ # 4. KV 缓存更新 -+++++-+ if past_key_value is not None: -+++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++-+ key_states, value_states = past_key_value.update( -+++++-+ key_states, value_states, self.layer_idx, cache_kwargs -+++++-+ ) -+++++-+ -+++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++-+ if cache_position.shape[0] == 1: -+++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++-+ kv_seq_len = key_states.shape[-2] -+++++-+ -+++++-+ # 5. [重要] 准备 Attention Mask -+++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++-+ fa_attention_mask = None -+++++-+ if attention_mask is not None: -+++++-+ # 截取与当前key长度匹配的部分 -+++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++-+ fa_attention_mask = (mask_slice != 0) -+++++-+ -+++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++-+ input_dtype = query_states.dtype -+++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++-+ query_states = query_states.to(mindspore.float16) -+++++-+ key_states = key_states.to(mindspore.float16) -+++++-+ value_states = value_states.to(mindspore.float16) -+++++-+ -+++++-+ # 6. [核心] 调用 flash_attention_score 算子 -+++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++-+ attn_output = mindspore.ops.flash_attention_score( -+++++-+ query=query_states, -+++++-+ key=key_states, -+++++-+ value=value_states, -+++++-+ head_num=self.num_heads, # 传入Q的头数(N1) -+++++-+ attn_mask=fa_attention_mask, -+++++-+ keep_prob=1.0 - self.attention_dropout, -+++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++-+ input_layout="BNSD", -+++++-+ sparse_mode=0 # 使用 defaultMask 模式 -+++++-+ ) -+++++-+ -+++++-+ # 恢复原始数据类型 -+++++-+ attn_output = attn_output.to(input_dtype) -+++++-+ -+++++-+ # 7. 调整输出形状 -+++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++-+ attn_output = self.o_proj(attn_output) -+++++-+ -+++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 -+++++-+ attn_weights = None -+++++-+ if output_attentions: -+++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++-+ -+++++-+ return attn_output, attn_weights, past_key_value -+++++-+ -+++++-+ # def forward( -+++++-+ # self, -+++++-+ # hidden_states: mindspore.Tensor, -+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -+++++-+ # position_ids: Optional[mindspore.Tensor] = None, -+++++-+ # past_key_value: Optional[Cache] = None, -+++++-+ # output_attentions: bool = False, -+++++-+ # use_cache: bool = False, -+++++-+ # cache_position: Optional[mindspore.Tensor] = None, -+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++-+ -+++++-+ # bsz, q_len, _ = hidden_states.shape -+++++-+ -+++++-+ # # 1. 线性投射 Q, K, V -+++++-+ # query_states = self.q_proj(hidden_states) -+++++-+ # key_states = self.k_proj(hidden_states) -+++++-+ # value_states = self.v_proj(hidden_states) -+++++-+ -+++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ -+++++-+ # # 3. RoPE 旋转位置编码 -+++++-+ # kv_seq_len = key_states.shape[-2] -+++++-+ # if past_key_value is not None: -+++++-+ # if self.layer_idx is None: -+++++-+ # raise ValueError( -+++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++-+ # "with a layer index." -+++++-+ # ) -+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++-+ -+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++-+ -+++++-+ # # 4. KV 缓存更新 -+++++-+ # if past_key_value is not None: -+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++-+ # key_states, value_states = past_key_value.update( -+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -+++++-+ # ) -+++++-+ -+++++-+ # # 5. 准备 Attention Mask -+++++-+ # fa_attention_mask = None -+++++-+ # if attention_mask is not None: -+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++-+ # fa_attention_mask = (mask_slice != 0) -+++++-+ -+++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++-+ # input_dtype = query_states.dtype -+++++-+ -+++++-+ # # 6. [核心] 调用 flash_attention_score 算子 -+++++-+ # attn_output = mindspore.ops.flash_attention_score( -+++++-+ # query=query_states, -+++++-+ # key=key_states, -+++++-+ # value=value_states, -+++++-+ # head_num=self.num_heads, -+++++-+ # attn_mask=fa_attention_mask, -+++++-+ # keep_prob=1.0 - self.attention_dropout, -+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++-+ # input_layout="BNSD", -+++++-+ # sparse_mode=0, -+++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- -+++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++-+ # inner_precise=1 -+++++-+ # ) -+++++-+ -+++++-+ # # 恢复原始数据类型 -+++++-+ # attn_output = attn_output.to(input_dtype) -+++++-+ -+++++-+ # # 7. 调整输出形状 -+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++-+ # attn_output = self.o_proj(attn_output) -+++++-+ -+++++-+ # attn_weights = None -+++++-+ # if output_attentions: -+++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++-+ -+++++-+ # return attn_output, attn_weights, past_key_value -+++++-+ -+++++-+ # def forward( -+++++-+ # self, -+++++-+ # hidden_states: mindspore.Tensor, -+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -+++++-+ # position_ids: Optional[mindspore.Tensor] = None, -+++++-+ # past_key_value: Optional[Cache] = None, -+++++-+ # output_attentions: bool = False, -+++++-+ # use_cache: bool = False, -+++++-+ # cache_position: Optional[mindspore.Tensor] = None, -+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++-+ -+++++-+ # bsz, q_len, _ = hidden_states.shape -+++++-+ -+++++-+ # query_states = self.q_proj(hidden_states) -+++++-+ # key_states = self.k_proj(hidden_states) -+++++-+ # value_states = self.v_proj(hidden_states) -+++++-+ -+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-+ -+++++-+ # kv_seq_len = key_states.shape[-2] -+++++-+ # if past_key_value is not None: -+++++-+ # if self.layer_idx is None: -+++++-+ # raise ValueError("`layer_idx` must be specified for caching") -+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++-+ -+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++-+ -+++++-+ # if past_key_value is not None: -+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++-+ # key_states, value_states = past_key_value.update( -+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -+++++-+ # ) -+++++-+ -+++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++-+ -+++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++-+ # query_states = query_states / math.sqrt(self.head_dim) -+++++-+ # # <--- 修改结束 --- -+++++-+ -+++++-+ # fa_attention_mask = None -+++++-+ # if attention_mask is not None: -+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++-+ # fa_attention_mask = (mask_slice != 0) -+++++-+ -+++++-+ # input_dtype = query_states.dtype -+++++-+ -+++++-+ # attn_output = mindspore.ops.flash_attention_score( -+++++-+ # query=query_states, # 传入已经预先缩放过的 query -+++++-+ # key=key_states, -+++++-+ # value=value_states, -+++++-+ # head_num=self.num_heads, -+++++-+ # attn_mask=fa_attention_mask, -+++++-+ # keep_prob=1.0 - self.attention_dropout, -+++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++-+ # input_layout="BNSD", -+++++-+ # sparse_mode=0, -+++++-+ # inner_precise=1 # 仍然保持内部高精度计算 -+++++-+ # ) -+++++-+ -+++++-+ # attn_output = attn_output.to(input_dtype) -+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++-+ # attn_output = self.o_proj(attn_output) -+++++-+ -+++++-+ # attn_weights = None -+++++-+ # if output_attentions: -+++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++-+ -+++++-+ # return attn_output, attn_weights, past_key_value -+++++-+ -+++++- QWEN2MOE_ATTENTION_CLASSES = { -+++++- "eager": Qwen2MoeAttention, -+++++-+ "flash-attention": Qwen2MoeFlashAttention, -+++++- } -+++++- -+++++- -+++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++- -+++++-+ #@dwj -+++++-+ # 只遍历激活的专家,而非全部专家 -+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++-- hidden_states = hidden_states.view(-1, hidden_dim) -+++++-- # router_logits: (batch * sequence_length, n_experts) -+++++-- router_logits = self.gate(hidden_states) -+++++-- -+++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++-- if self.norm_topk_prob: -+++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++-- # we cast back to the input dtype -+++++-- routing_weights = routing_weights.to(hidden_states.dtype) -+++++-- -+++++-- final_hidden_states = ops.zeros( -+++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++++-- ) -+++++-- -+++++-- # One hot encode the selected experts to create an expert mask -+++++-- # this will be used to easily index which expert is going to be sollicitated -+++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++++-- -+++++-- # Loop over all available experts in the model and perform the computation on each expert -+++++-- for expert_idx in range(self.num_experts): -+++++-- expert_layer = self.experts[expert_idx] -+++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++++-- -+++++-- # Index the correct hidden states and compute the expert hidden state for -+++++-- # the current expert. We need to make sure to multiply the output hidden -+++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++++-- if 0 not in idx.shape: -+++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++++-- -+++++-- # However `index_add_` only support torch tensors for indexing so we'll use -+++++-- # the `top_x` tensor here. -+++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++++-- -+++++-- shared_expert_output = self.shared_expert(hidden_states) -+++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++++-- -+++++-- final_hidden_states = final_hidden_states + shared_expert_output -+++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++-+ num_tokens = hidden_states_reshaped.shape[0] -+++++-+ -+++++-+ router_logits = self.gate(hidden_states_reshaped) -+++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++-+ -+++++-+ if self.norm_topk_prob: -+++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++-+ routing_weights = routing_weights.to(hidden_states.dtype) -+++++-+ -+++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++-+ flat_selected_experts = selected_experts.flatten() -+++++-+ -+++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++-+ token_indices = broadcasted_token_indices.flatten() -+++++-+ -+++++-+ active_experts = ops.unique(flat_selected_experts) -+++++-+ -+++++-+ for expert_idx_tensor in active_experts: -+++++-+ expert_idx = expert_idx_tensor.item() -+++++-+ expert_layer = self.experts[expert_idx] -+++++-+ -+++++-+ mask = (flat_selected_experts == expert_idx_tensor) -+++++-+ selected_token_indices = token_indices[mask] -+++++-+ selected_routing_weights = routing_weights.flatten()[mask] -+++++-+ -+++++-+ current_states = hidden_states_reshaped[selected_token_indices] -+++++-+ -+++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++-+ -+++++-+ final_hidden_states = final_hidden_states.index_add( -+++++-+ dim=0, -+++++-+ index=selected_token_indices, -+++++-+ source=expert_output.to(hidden_states.dtype) -+++++-+ ) -+++++-+ -+++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++- -+++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++-- return final_hidden_states, router_logits -+++++-+ final_hidden_states = final_hidden_states + shared_expert_output -+++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++-+ -+++++-+ return final_hidden_states, router_logits -+++++- -+++++- -+++++- class Qwen2MoeDecoderLayer(nn.Module): -+++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++++- -+++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++- -+++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++-+ -+++++- if (layer_idx not in config.mlp_only_layers) and ( -+++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++++- ): -+++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++++- _skip_keys_device_placement = "past_key_values" -+++++- _supports_cache_class = True -+++++-+#lwx -+++++-+ # _supports_static_cache = True -+++++- -+++++- def _init_weights(self, module): -+++++- std = self.config.initializer_range -+++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++++- return causal_mask -+++++- -+++++- -+++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++- _tied_weights_keys = ["lm_head.weight"] -+++++- -+++++- def __init__(self, config): -+++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++- self.num_experts_per_tok = config.num_experts_per_tok -+++++- # Initialize weights and apply final processing -+++++- self.post_init() -+++++-+ # @lwx -+++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++++-+ # self.generation_config.cache_implementation = "static" -+++++-+ self._warmed_up = False -+++++-+ -+++++-+ def warmup_moe_model(self): -+++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++++-+ test_texts = [ -+++++-+ "warmup short", -+++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++++-+ ] -+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++-+ if tokenizer is None: -+++++-+ from mindnlp.transformers import AutoTokenizer -+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++-+ self._warmup_tokenizer = tokenizer -+++++-+ -+++++-+ for text in test_texts: -+++++-+ inputs = tokenizer(text, return_tensors="ms") -+++++-+ with mindspore._no_grad(): -+++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++++- -+++++- def get_input_embeddings(self): -+++++- return self.model.embed_tokens -+++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++- ```""" -+++++-+ if not self._warmed_up: -+++++-+ self._warmed_up = True -+++++-+ self.warmup_moe_model() -+++++- -+++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++++- output_router_logits = ( -+++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++- } -+++++- ) -+++++- return model_inputs -+++++-+# @lwx -+++++-+ # def _decode_one_tokens_logits( -+++++-+ # self, -+++++-+ # cur_token: mindspore.Tensor, -+++++-+ # input_pos: Optional[mindspore.Tensor], -+++++-+ # cache_position: mindspore.Tensor, -+++++-+ # past_key_values: StaticCache, -+++++-+ # ) -> mindspore.Tensor: -+++++-+ # """ -+++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++++-+ -+++++-+ # Args: -+++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++++-+ # input_pos: 输入位置信息,可选 -+++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) -+++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++++-+ -+++++-+ # Returns: -+++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++++-+ # """ -+++++-+ # # 调用JIT编译的版本 -+++++-+ # return self.get_decode_one_tokens_logits( -+++++-+ # cur_token=cur_token, -+++++-+ # input_pos=input_pos, -+++++-+ # cache_position=cache_position, -+++++-+ # past_key_values=past_key_values, -+++++-+ # ) -+++++-+ -+++++-+ # @mindspore.jit(jit_level='O1') -+++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++++-+ # """ -+++++-+ # JIT编译的函数,用于高效的单token解码 -+++++-+ # 使用JIT编译优化以支持静态shape和高效执行 -+++++-+ -+++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++++-+ # """ -+++++-+ # outputs = self.model.forward( -+++++-+ # input_ids=cur_token, -+++++-+ # position_ids=input_pos, -+++++-+ # cache_position=cache_position, -+++++-+ # past_key_values=past_key_values, -+++++-+ # use_cache=True, -+++++-+ # return_dict=False, -+++++-+ # ) -+++++-+ -+++++-+ # hidden_states = outputs[0] -+++++-+ # logits = self.lm_head.forward(hidden_states) -+++++-+ # logits = logits.float() -+++++-+ -+++++-+ # return logits[:, -1, :] -+++++-+ -+++++-+ # def _sample( -+++++-+ # self, -+++++-+ # input_ids: mindspore.Tensor, -+++++-+ # logits_processor, -+++++-+ # stopping_criteria, -+++++-+ # generation_config, -+++++-+ # synced_devices: bool, -+++++-+ # streamer=None, -+++++-+ # logits_warper=None, -+++++-+ # **model_kwargs, -+++++-+ # ): -+++++-+ # """ -+++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++++-+ # """ -+++++-+ # from ...generation.logits_process import LogitsProcessorList -+++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList -+++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++++-+ # from mindnlp.core import nn, ops, no_grad -+++++-+ # import numpy as np -+++++-+ -+++++-+ # # 检查是否使用 StaticCache -+++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++++-+ # # 否则,直接调用父类方法 -+++++-+ # past_key_values = model_kwargs.get("past_key_values") -+++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++++-+ -+++++-+ # if not isinstance(past_key_values, StaticCache): -+++++-+ # # 不使用 StaticCache,直接调用父类方法 -+++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++++-+ # return super()._sample( -+++++-+ # input_ids=input_ids, -+++++-+ # logits_processor=logits_processor, -+++++-+ # stopping_criteria=stopping_criteria, -+++++-+ # generation_config=generation_config, -+++++-+ # synced_devices=synced_devices, -+++++-+ # streamer=streamer, -+++++-+ # logits_warper=logits_warper, -+++++-+ # **model_kwargs, -+++++-+ # ) -+++++-+ -+++++-+ # # 使用 StaticCache,进入自定义循环 -+++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++++-+ # pad_token_id = generation_config._pad_token_tensor -+++++-+ # output_attentions = generation_config.output_attentions -+++++-+ # output_hidden_states = generation_config.output_hidden_states -+++++-+ # output_scores = generation_config.output_scores -+++++-+ # output_logits = generation_config.output_logits -+++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate -+++++-+ # max_length = generation_config.max_length -+++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++++-+ # do_sample = generation_config.do_sample -+++++-+ -+++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++++-+ # raise ValueError( -+++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++++-+ # f"{logits_warper})." -+++++-+ # ) -+++++-+ -+++++-+ # # init attention / hidden states / scores tuples -+++++-+ # scores = () if (return_dict_in_generate and output_scores) else None -+++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++++-+ -+++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++++-+ # encoder_hidden_states = ( -+++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++++-+ # ) -+++++-+ -+++++-+ # # keep track of which sequences are already finished -+++++-+ # batch_size, cur_len = input_ids.shape -+++++-+ # this_peer_finished = False -+++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++++-+ -+++++-+ # time_record = [] -+++++-+ # from ....utils.testing_utils import parse_flag_from_env -+++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++++-+ -+++++-+ # while self._has_unfinished_sequences( -+++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++++-+ # ): -+++++-+ # if _record_time: -+++++-+ # import time as time_module -+++++-+ # infer_start = time_module.time() -+++++-+ -+++++-+ # # prepare model inputs -+++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++++-+ -+++++-+ # # prepare variable output controls -+++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++++-+ -+++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++++-+ # cur_cache_position = model_inputs.get("cache_position") -+++++-+ # cur_past_key_values = model_inputs.get("past_key_values") -+++++-+ # cur_input_ids = model_inputs.get("input_ids") -+++++-+ -+++++-+ # if (isinstance(cur_past_key_values, StaticCache) and -+++++-+ # cur_cache_position is not None and -+++++-+ # len(cur_cache_position.shape) > 0 and -+++++-+ # cur_cache_position.shape[0] == 1 and -+++++-+ # cur_input_ids is not None and -+++++-+ # cur_input_ids.shape[1] == 1): -+++++-+ # # 使用 JIT 优化的单 token 解码 -+++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++++-+ # if not hasattr(self, '_jit_used'): -+++++-+ # self._jit_used = False -+++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++++-+ -+++++-+ # next_token_logits = self.get_decode_one_tokens_logits( -+++++-+ # cur_token=cur_input_ids, -+++++-+ # input_pos=model_inputs.get("position_ids"), -+++++-+ # cache_position=cur_cache_position, -+++++-+ # past_key_values=cur_past_key_values, -+++++-+ # ) -+++++-+ -+++++-+ # # 标记已使用JIT(用于后续判断) -+++++-+ # if not self._jit_used: -+++++-+ # self._jit_used = True -+++++-+ -+++++-+ # # 构造兼容的输出对象 -+++++-+ # class JitOptimizedOutput: -+++++-+ # def __init__(self, logits, config): -+++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++++-+ # self.config = config -+++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 -+++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++++-+ # self.attentions = None if not config.is_encoder_decoder else None -+++++-+ # self.cross_attentions = None -+++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None -+++++-+ -+++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++++-+ # else: -+++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++++-+ # outputs = self(**model_inputs, return_dict=True) -+++++-+ -+++++-+ # if synced_devices and this_peer_finished: -+++++-+ # continue -+++++-+ -+++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++++-+ # next_token_logits = outputs.logits[:, -1, :] -+++++-+ -+++++-+ # # pre-process distribution -+++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++++-+ # if do_sample: -+++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++++-+ -+++++-+ # # Store scores, attentions and hidden_states when required -+++++-+ # if return_dict_in_generate: -+++++-+ # if output_scores: -+++++-+ # scores += (next_token_scores,) -+++++-+ # if output_logits: -+++++-+ # raw_logits += (next_token_logits,) -+++++-+ # if output_attentions: -+++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) -+++++-+ # if self.config.is_encoder_decoder: -+++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++++-+ -+++++-+ # if output_hidden_states: -+++++-+ # hidden = ( -+++++-+ # outputs.decoder_hidden_states -+++++-+ # if self.config.is_encoder_decoder -+++++-+ # else outputs.hidden_states -+++++-+ # ) -+++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++++-+ -+++++-+ # # token selection -+++++-+ # if do_sample: -+++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++++-+ # else: -+++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++++-+ -+++++-+ # # finished sentences should have their next token be a padding token -+++++-+ # if has_eos_stopping_criteria: -+++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++++-+ -+++++-+ # # update generated ids, model inputs, and length for next step -+++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++++-+ # if streamer is not None: -+++++-+ # streamer.put(next_tokens) -+++++-+ -+++++-+ # model_kwargs = self._update_model_kwargs_for_generation( -+++++-+ # outputs, -+++++-+ # model_kwargs, -+++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, -+++++-+ # ) -+++++-+ -+++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++++-+ # cur_len += 1 -+++++-+ -+++++-+ # if _record_time: -+++++-+ # import time as time_module -+++++-+ # infer_stop = time_module.time() -+++++-+ # time_record.append(infer_stop - infer_start) -+++++-+ -+++++-+ # del outputs -+++++-+ -+++++-+ # average_infer_time = None -+++++-+ # if time_record: -+++++-+ # if len(time_record) > 1: -+++++-+ # time_record.pop(0) -+++++-+ # average_infer_time = sum(time_record) / len(time_record) -+++++-+ # print(f'average inference time is: {average_infer_time}') -+++++-+ # print(f'inference time record: {time_record}') -+++++-+ -+++++-+ # if streamer is not None: -+++++-+ # streamer.end() -+++++-+ -+++++-+ # # 简单判断:打印是否使用了JIT路径 -+++++-+ # if hasattr(self, '_jit_used') and self._jit_used: -+++++-+ # print("[JIT] ✓ JIT optimization was used during generation") -+++++-+ # else: -+++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++++-+ -+++++-+ # if return_dict_in_generate: -+++++-+ # if self.config.is_encoder_decoder: -+++++-+ # return GenerateEncoderDecoderOutput( -+++++-+ # sequences=input_ids, -+++++-+ # scores=scores, -+++++-+ # logits=raw_logits, -+++++-+ # encoder_attentions=encoder_attentions, -+++++-+ # encoder_hidden_states=encoder_hidden_states, -+++++-+ # decoder_attentions=decoder_attentions, -+++++-+ # cross_attentions=cross_attentions, -+++++-+ # decoder_hidden_states=decoder_hidden_states, -+++++-+ # past_key_values=model_kwargs.get("past_key_values"), -+++++-+ # average_infer_time=average_infer_time -+++++-+ # ) -+++++-+ # else: -+++++-+ # return GenerateDecoderOnlyOutput( -+++++-+ # sequences=input_ids, -+++++-+ # scores=scores, -+++++-+ # logits=raw_logits, -+++++-+ # attentions=decoder_attentions, -+++++-+ # hidden_states=decoder_hidden_states, -+++++-+ # past_key_values=model_kwargs.get("past_key_values"), -+++++-+ # average_infer_time=average_infer_time -+++++-+ # ) -+++++-+ # else: -+++++-+ # return input_ids -+++++-+ -+++++-+ # def _prepare_cache_for_generation( -+++++-+ # self, -+++++-+ # generation_config, -+++++-+ # model_kwargs, -+++++-+ # assistant_model, -+++++-+ # batch_size, -+++++-+ # max_cache_length, -+++++-+ # ): -+++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++++-+ # generation_config.cache_implementation = "static" -+++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++++-+ -+++++-+ # if generation_config.cache_implementation == "static": -+++++-+ # base_required_from_max_length = generation_config.max_length + 1 -+++++-+ # base_required = max(max_cache_length, base_required_from_max_length) -+++++-+ # min_cache_size = 50 -+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++++-+ # else: -+++++-+ # max_cache_length = max(base_required, min_cache_size) -+++++-+ -+++++-+ # original_max_cache_length = max_cache_length -+++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") -+++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++++-+ # print(f" - final max_cache_length: {max_cache_length}") -+++++-+ -+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++-+ # if max_cache_length > self.config.max_position_embeddings: -+++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++++-+ -+++++-+ # result = super()._prepare_cache_for_generation( -+++++-+ # generation_config=generation_config, -+++++-+ # model_kwargs=model_kwargs, -+++++-+ # assistant_model=assistant_model, -+++++-+ # batch_size=batch_size, -+++++-+ # max_cache_length=max_cache_length, -+++++-+ # ) -+++++-+ -+++++-+ # if generation_config.cache_implementation == "static": -+++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++++-+ # created_cache = model_kwargs.get(cache_name) -+++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++++-+ # if created_cache.max_cache_len < generation_config.max_length: -+++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++++-+ -+++++-+ # return result -+++++-+ -+++++-+ -+++++-+ -+++++- -+++++- -+++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++++--- -+++++-2.27.0 -+++++- -+++++-- -+++++2.27.0 -+++++ -++++-- -++++2.27.0 -++++ -+++-- -+++2.27.0 -+++ -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" deleted file mode 100644 index 5ba94286..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0009-20251109firstcommit.patch" +++ /dev/null @@ -1,9078 +0,0 @@ -From 4f88911daf60910b3b94b56b8a590650454a2dde Mon Sep 17 00:00:00 2001 -From: Pinoeer-kingxi <13022943007@163.com> -Date: Sun, 9 Nov 2025 02:09:15 +0800 -Subject: [PATCH 09/10] 20251109firstcommit - ---- - .../models/deepseek/modeling_deepseek.py | 103 +- - patches/0001-20251104commit.patch | 2 +- - patches/0002-20251106commit.patch | 2 +- - patches/0003-20261106secondcommit.patch | 2 +- - patches/0004-20251106change.patch | 2 +- - patches/0005-20251107001commit.patch | 2 +- - patches/0006-20251107002commit.patch | 2 +- - patches/0007-20251107003commit.patch | 2 +- - patches/0008-moe-change.patch | 8789 +++++++++++++++++ - 9 files changed, 8889 insertions(+), 17 deletions(-) - create mode 100644 patches/0008-moe-change.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 0af29305..8d004af1 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -415,7 +415,9 @@ class DeepseekMoE(nn.Module): - else: - # @lwx - if orig_shape[1] == 1: -- y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+ # lwx moe_infer_decode_fast -+ # y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+ y=self.moe_infer_decode_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) - y=y.view(*orig_shape) - if self.config.n_shared_experts is not None: - y = y + self.shared_experts(identity) -@@ -544,6 +546,7 @@ class DeepseekMoE(nn.Module): - #@dwj - @no_grad() - def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ - selected_experts = [self.experts[i] for i in flat_expert_indices] - expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) - final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -@@ -643,6 +646,43 @@ class DeepseekMoE(nn.Module): - # ) - - # return final_output -+ # def init_expert_cache(self): -+ # """ -+ # 在模型初始化时调用,缓存所有专家的权重到显存。 -+ # """ -+ # self.cache_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) -+ # self.cache_up_w = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) -+ # self.cache_down_w = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) -+ @no_grad() -+ def moe_infer_decode_fast(self, x, flat_expert_indices, flat_expert_weights): -+ top_k = flat_expert_indices.shape[0] -+ hidden_size = x.shape[-1] -+ -+ selected_gate_w = [] -+ selected_up_w = [] -+ selected_down_w = [] -+ -+ for eid in flat_expert_indices.tolist(): -+ if hasattr(self, "cache_gate_w") and eid < self.cache_gate_w.shape[0]: -+ selected_gate_w.append(self.cache_gate_w[eid]) -+ selected_up_w.append(self.cache_up_w[eid]) -+ selected_down_w.append(self.cache_down_w[eid]) -+ else: -+ selected_gate_w.append(self.experts[eid].gate_proj.weight) -+ selected_up_w.append(self.experts[eid].up_proj.weight) -+ selected_down_w.append(self.experts[eid].down_proj.weight) -+ -+ selected_gate_w = ops.stack(selected_gate_w, dim=0) -+ selected_up_w = ops.stack(selected_up_w, dim=0) -+ selected_down_w = ops.stack(selected_down_w, dim=0) -+ -+ x_expanded = x.expand((top_k, 1, hidden_size)) -+ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+ return weighted_sum - - # lwx prefill 20251108 - @no_grad() -@@ -711,7 +751,7 @@ class DeepseekMoE(nn.Module): - sorted_token_indices.view(-1, 1).tile((1, hidden_size)), - expert_outputs * sorted_weights - ) -- -+ return final_output - - # try: - # final_output = ops.moe_token_unpermute( -@@ -730,7 +770,7 @@ class DeepseekMoE(nn.Module): - # expert_outputs * sorted_weights - # ) - -- return final_output -+ # return final_output - - - # @no_grad() -@@ -1827,27 +1867,68 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - - # Initialize weights and apply final processing - self.post_init() -+ # lwx - self.warm_up = False -- -+ #初始 -+ -+ # def warmup_moe_model_deep(self): -+ # print("[Warmup] DeepSeek-MoE 模型预热开始...") -+ # test_texts = [ -+ # "warmup short", -+ # "This is a medium length warmup sentence for MoE experts. middle middle middle", -+ # "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+ # ] -+ # tokenizer = getattr(self, "_warmup_tokenizer", None) -+ # if tokenizer is None: -+ # from mindnlp.transformers import AutoTokenizer -+ # tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+ # self._warmup_tokenizer = tokenizer -+ -+ # for text in test_texts: -+ # inputs = tokenizer(text, return_tensors="ms") -+ # with mindspore._no_grad(): -+ # _ = self(**inputs, use_cache=False) -+ # print("[Warmup] DeepSeek-MoE 模型预热完成。") -+ - def warmup_moe_model_deep(self): - print("[Warmup] DeepSeek-MoE 模型预热开始...") -- test_texts = [ -- "warmup short", -- "This is a medium length warmup sentence for MoE experts. middle middle middle", -- "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+ -+ # 直接用 eval.py 默认的 prompts 内容 -+ warmup_prompts = [ -+ "Hello, how are you?", -+ "This American studied art at Yale and is the author of multiple popular mystery novels. First name is 'Hillary'. What's the last name?", -+ """Summarize the following text: US President Donald Trump has said he is 'not happy' with his Russian counterpart Vladimir Putin, following Moscow's largest aerial attack yet on Ukraine. -+ In a rare rebuke, Trump said: "What the hell happened to him? He's killing a lot of people." He later called Putin "absolutely crazy". -+ Ukrainian President Volodymyr Zelensky earlier said Washington's "silence" over recent Russian attacks was encouraging Putin, urging "strong pressure" - including tougher sanctions - on Moscow. -+ """ - ] -+ - tokenizer = getattr(self, "_warmup_tokenizer", None) - if tokenizer is None: - from mindnlp.transformers import AutoTokenizer - tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) - self._warmup_tokenizer = tokenizer - -- for text in test_texts: -+ # 跑一遍 warmup_prompts,触发路由逻辑 -+ for text in warmup_prompts: - inputs = tokenizer(text, return_tensors="ms") - with mindspore._no_grad(): - _ = self(**inputs, use_cache=False) -+ -+ # 这里可以加按需缓存逻辑,避免显存 OOM -+ from mindnlp.transformers.models.deepseek.modeling_deepseek import DeepseekMoE -+ for module in self.modules(): -+ if isinstance(module, DeepseekMoE): -+ active_ids = getattr(module, "_last_routed_expert_ids", None) -+ if active_ids is not None: -+ module.init_active_expert_cache(active_ids) - print("[Warmup] DeepSeek-MoE 模型预热完成。") - -+ def init_active_expert_cache(self, active_ids): -+ self.cache_gate_w = ops.stack([self.experts[i].gate_proj.weight for i in active_ids], dim=0) -+ self.cache_up_w = ops.stack([self.experts[i].up_proj.weight for i in active_ids], dim=0) -+ self.cache_down_w = ops.stack([self.experts[i].down_proj.weight for i in active_ids], dim=0) -+ - def get_input_embeddings(self): - return self.model.embed_tokens - -@@ -2208,7 +2289,9 @@ if __name__ == "__main__": - config.num_hidden_layers = 2 - config.n_routed_experts = 2 - model = DeepseekForCausalLM(config) -- -+ # for module in model.modules(): -+ # if isinstance(module, DeepseekMoE): -+ # module.init_expert_cache() - print('init model') - input_ids = mindspore.Tensor(np.random.randint(0, 10000, (1, 11)), mindspore.int32) - attention_mask = mindspore.Tensor(np.ones((1,11)), mindspore.int32) -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -index 513dd40b..8de61195 100644 ---- a/patches/0001-20251104commit.patch -+++ b/patches/0001-20251104commit.patch -@@ -1,7 +1,7 @@ - From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/7] 20251104commit -+Subject: [PATCH 1/8] 20251104commit - - --- - mindnlp/transformers/cache_utils.py | 28 +- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -index 41081b85..d7a129ea 100644 ---- a/patches/0002-20251106commit.patch -+++ b/patches/0002-20251106commit.patch -@@ -1,7 +1,7 @@ - From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/7] 20251106commit -+Subject: [PATCH 2/8] 20251106commit - - --- - .../models/deepseek/modeling_deepseek.py | 379 ++++- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -index c1392569..179a9bb5 100644 ---- a/patches/0003-20261106secondcommit.patch -+++ b/patches/0003-20261106secondcommit.patch -@@ -1,7 +1,7 @@ - From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/7] 20261106secondcommit -+Subject: [PATCH 3/8] 20261106secondcommit - - --- - .../models/deepseek/modeling_deepseek.py | 217 ++- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -index e548b1b2..bc5549ca 100644 ---- a/patches/0004-20251106change.patch -+++ b/patches/0004-20251106change.patch -@@ -1,7 +1,7 @@ - From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Thu, 6 Nov 2025 15:48:09 +0800 --Subject: [PATCH 4/7] 20251106change -+Subject: [PATCH 4/8] 20251106change - - --- - .../models/deepseek/modeling_deepseek.py | 189 +- -diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -index bf224d2a..7217a46b 100644 ---- a/patches/0005-20251107001commit.patch -+++ b/patches/0005-20251107001commit.patch -@@ -1,7 +1,7 @@ - From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 11:48:18 +0800 --Subject: [PATCH 5/7] 20251107001commit -+Subject: [PATCH 5/8] 20251107001commit - - --- - .../models/deepseek/modeling_deepseek.py | 91 +- -diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -index 1bd306b9..80906633 100644 ---- a/patches/0006-20251107002commit.patch -+++ b/patches/0006-20251107002commit.patch -@@ -1,7 +1,7 @@ - From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 12:06:32 +0800 --Subject: [PATCH 6/7] 20251107002commit -+Subject: [PATCH 6/8] 20251107002commit - - --- - .../models/deepseek/modeling_deepseek.py | 122 +- -diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch -index ce558554..8a2fc4fe 100644 ---- a/patches/0007-20251107003commit.patch -+++ b/patches/0007-20251107003commit.patch -@@ -1,7 +1,7 @@ - From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 - From: Pinoeer-kingxi <13022943007@163.com> - Date: Fri, 7 Nov 2025 12:12:51 +0800 --Subject: [PATCH 7/7] 20251107003commit -+Subject: [PATCH 7/8] 20251107003commit - - --- - .../models/deepseek/modeling_deepseek.py | 2 +- -diff --git a/patches/0008-moe-change.patch b/patches/0008-moe-change.patch -new file mode 100644 -index 00000000..349f1429 ---- /dev/null -+++ b/patches/0008-moe-change.patch -@@ -0,0 +1,8789 @@ -+From 45ba3bbc411b64cbffd547fa3d66bce9545639dd Mon Sep 17 00:00:00 2001 -+From: Pinoeer-kingxi <13022943007@163.com> -+Date: Sun, 9 Nov 2025 00:50:01 +0800 -+Subject: [PATCH 8/8] moe change -+ -+--- -+ .../models/deepseek/modeling_deepseek.py | 433 +- -+ .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- -+ patches/0001-20251104commit.patch | 2 +- -+ patches/0002-20251106commit.patch | 2 +- -+ patches/0003-20261106secondcommit.patch | 2 +- -+ patches/0004-20251106change.patch | 2 +- -+ patches/0005-20251107001commit.patch | 2 +- -+ patches/0006-20251107002commit.patch | 2 +- -+ patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ -+ 9 files changed, 8510 insertions(+), 55 deletions(-) -+ create mode 100644 patches/0007-20251107003commit.patch -+ -+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+index ff631974..0af29305 100644 -+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+@@ -19,8 +19,10 @@ -+ # limitations under the License. -+ """ MindNLP DeepSeek model.""" -+ import math -++import time -+ import warnings -+ from typing import List, Optional, Tuple, Union -++from mindspore import mint -+ import mindspore -+ from mindnlp.core import nn, ops, no_grad -+ from mindnlp.core.nn import functional as F -+@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) -+ -+ _CONFIG_FOR_DOC = "DeepseekConfig" -+ -++Long_Prompt = 1 -++LONG_PROMPT_LENGTH_THRESHOLD = 128 -++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -++ -+ _attn_mask_cache = {} -+ -+ def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -+@@ -380,6 +386,8 @@ class MoEGate(nn.Module): -+ return topk_idx, topk_weight, aux_loss -+ -+ -++bincount_op = mindspore.ops.Bincount() -++ -+ class DeepseekMoE(nn.Module): -+ """ -+ A mixed expert module containing shared experts. -+@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): -+ y = y + self.shared_experts(identity) -+ return y -+ else: -+- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++ if Long_Prompt == 0: -++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++ else: -++ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+ if self.config.n_shared_experts is not None: -+ y = y + self.shared_experts(identity) -+ return y -+@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): -+ # if self.config.n_shared_experts is not None: -+ # y = y + self.shared_experts(identity) -+ # return y -+- -++ -++ -++ -++ # lwx -++ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): -++ # """ -++ # 如果 expert_ids 为 None,走单专家逻辑; -++ # 如果有,多专家批量处理,保证和原逻辑一致。 -++ # """ -++ # if expert_ids is None: -++ # # 原单专家逻辑 -++ # if self.config.pretraining_tp > 1: -++ # slice = self.intermediate_size // self.config.pretraining_tp -++ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) -++ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) -++ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) -++ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) -++ # for i in range(self.config.pretraining_tp)], dim=-1) -++ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) -++ # for i in range(self.config.pretraining_tp)], dim=-1) -++ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) -++ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) -++ # for i in range(self.config.pretraining_tp)] -++ # down_proj = sum(down_proj) -++ # else: -++ # down_proj = self.down_proj( -++ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) -++ # ) -++ # return down_proj -++ -++ # # ====== 批量多专家路径 ====== -++ # hidden_size = x.shape[-1] -++ -++ # # 按 token expert_ids 选权重 -++ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] -++ # up_weights = self.up_proj.weight[expert_ids] -++ # down_weights = self.down_proj.weight[expert_ids] -++ -++ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 -++ # if self.config.pretraining_tp > 1: -++ # outputs = [] -++ # slice = self.intermediate_size // self.config.pretraining_tp -++ # for i in range(self.config.pretraining_tp): -++ # # 每个 slice 单独计算 -++ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) -++ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) -++ # act_out = self.act_fn(gate_proj_out) * up_proj_out -++ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) -++ # outputs.append(down_proj_out) -++ # return sum(outputs) -++ # else: -++ # gate_proj_out = F.linear(x, gate_weights) -++ # up_proj_out = F.linear(x, up_weights) -++ # act_out = self.act_fn(gate_proj_out) * up_proj_out -++ # return F.linear(act_out, down_weights) -++ # @no_grad() -++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ # num_tokens = x.shape[0] -++ # hidden_size = x.shape[-1] -++ -++ # idxs = flat_expert_indices.argsort() -++ # sorted_expert_indices = flat_expert_indices[idxs] -++ # sorted_token_indices = idxs // self.num_experts_per_tok -++ # sorted_indices = sorted_token_indices -++ -++ # permuted_tokens = x[sorted_token_indices] -++ # sorted_weights = flat_expert_weights[idxs] -++ -++ # # 一次调用多专家 forward -++ # expert_outputs = ops.zeros_like(permuted_tokens) -++ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) -++ -++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -++ # try: -++ # final_output = ops.moe_token_unpermute( -++ # expert_outputs, -++ # sorted_indices, -++ # probs=probs, -++ # padded_mode=False -++ # ) -++ # except Exception: -++ # final_output = ops.zeros_like(x) -++ # final_output = mindspore.mint.scatter_add( -++ # final_output, -++ # 0, -++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -++ # expert_outputs * sorted_weights -++ # ) -++ -++ # return final_output -++ -++ # def mlp_batch_forward(self, tokens, expert_ids): -++ # """ -++ # 使用批量专家 forward(保留精度) -++ # """ -++ # return self.experts[0].forward(tokens, expert_ids) -++ -+ # @no_grad() -+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+ -+@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): -+ # expert_cache += expert_out * weight -+ # return expert_cache -+ -++ #@dwj -+ @no_grad() -+- # dwj -+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+- # x 的 shape: (1, hidden_size) -+- # flat_expert_indices 的 shape: (num_experts_per_tok,) -+- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+- -+- # 1. 收集所有需要的专家层 -+- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+ selected_experts = [self.experts[i] for i in flat_expert_indices] -+- -+- # 2. 并行计算所有专家的输出 -+- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+- # ops.cat 会将它们堆叠成一个新的 Tensor -+- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+- -+- # 3. 使用矩阵乘法进行加权求和 -+- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+- # 最终结果 final_output 的 shape: (1, hidden_size) -+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+- -+ return final_output -+ -+ -+- # @no_grad() -+- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+- # expert_cache = ops.zeros_like(x) -+- # idxs = flat_expert_indices.argsort() -+- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+- # token_idxs = idxs // self.num_experts_per_tok -+- -+- # for i, end_idx in enumerate(tokens_per_expert): -+- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+- # if start_idx == end_idx: -+- # continue -+- # expert = self.experts[i] -+- # exp_token_idx = token_idxs[start_idx:end_idx] -+- # expert_tokens = x[exp_token_idx] -+- # expert_out = expert(expert_tokens) -+- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+- -+- # return expert_cache -+- -+ @no_grad() -+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+ """ -+@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): -+ ) -+ -+ return expert_cache -++ -++ -++ # @no_grad() -++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ # """ -++ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add -++ # """ -++ # num_tokens = x.shape[0] -++ # hidden_size = x.shape[-1] -++ -++ # # 生成排序后的 token 索引 -++ # idxs = flat_expert_indices.argsort() -++ # sorted_expert_indices = flat_expert_indices[idxs] -++ # sorted_token_indices = idxs // self.num_experts_per_tok -++ -++ # # 记录到 sorted_indices(moe_token_unpermute 用) -++ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] -++ -++ # # 收集专家输入 -++ # permuted_tokens = x[sorted_token_indices] -++ -++ # # 执行每个专家的 MLP(批量处理) -++ # expert_outputs = [] -++ # token_ptr = 0 -++ # tokens_per_expert = sorted_expert_indices.bincount() -++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): -++ # if count == 0: -++ # continue -++ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] -++ # out = self.experts[expert_id](cur_tokens) -++ # expert_outputs.append(out) -++ # token_ptr += count -++ -++ # # 拼接所有专家输出 -++ # permuted_outputs = ops.cat(expert_outputs, axis=0) -++ -++ # # 权重缩放(probs 形状为 [num_tokens, top_k]) -++ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) -++ -++ # # 直接调用硬件加速的 unpermute -++ # final_output = ops.moe_token_unpermute( -++ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] -++ # sorted_indices, # shape: [num_tokens * top_k] -++ # probs=probs, # 按概率加权 -++ # padded_mode=False -++ # ) -++ -++ # return final_output -++ -++ # lwx prefill 20251108 -++ @no_grad() -++ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): -++ """ -++ 高性能 + 数值一致的 MoE prefill 推理: -++ 1. 批量化处理所有专家计算,减少 Python 循环开销 -++ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 -++ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 -++ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch -++ -++ 参数: -++ x: [num_tokens, hidden_size], -++ MoE 输入的 token 表示 -++ flat_expert_indices: [num_tokens * top_k], -++ 每个 token 的路由专家 id -++ flat_expert_weights: [num_tokens * top_k, 1], -++ 路由专家权重 -++ """ -++ num_tokens = x.shape[0] -++ hidden_size = x.shape[-1] -++ -++ # 1) 排序专家分配(与原 scatter_add 一致的顺序) -++ idxs = flat_expert_indices.argsort() # 排序索引 -++ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] -++ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID -++ -++ # sorted_indices 必须与 permuted_tokens 顺序匹配 -++ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 -++ -++ # 2) 收集专家输入(按 idxs 排序) -++ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] -++ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 -++ -++ # 3) 计算每个专家的 token 数 -++ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) -++ -++ # 4) 批量专家计算(减少 Python 循环) -++ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) -++ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) -++ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) -++ -++ expert_outputs = ops.zeros_like(permuted_tokens) -++ ptr = 0 -++ for expert_id, count in enumerate(tokens_per_expert.tolist()): -++ if count == 0: -++ continue -++ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] -++ -++ # 与 DeepseekMLP forward 等价 -++ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) -++ up_proj_out = F.linear(tokens, up_weights[expert_id]) -++ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out -++ expert_out = F.linear(act_out, down_weights[expert_id]) -++ -++ expert_outputs[ptr:ptr+count] = expert_out -++ ptr += count -++ -++ # 5) Ascend 加速的 unpermute(已排序的权重) -++ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape -++ -++ final_output = ops.zeros_like(x) -++ final_output = mindspore.mint.scatter_add( -++ final_output, -++ 0, -++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -++ expert_outputs * sorted_weights -++ ) -++ -++ -++ # try: -++ # final_output = ops.moe_token_unpermute( -++ # expert_outputs, # [num_tokens*top_k, hidden_size] -++ # sorted_indices, # [num_tokens*top_k] 原 token id -++ # probs=probs, # 对应权重 -++ # padded_mode=False -++ # ) -++ # except Exception: -++ # # CPU/GPU fallback:用 scatter_add 保证完全一致 -++ # final_output = ops.zeros_like(x) -++ # final_output = mindspore.mint.scatter_add( -++ # final_output, -++ # 0, -++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -++ # expert_outputs * sorted_weights -++ # ) -++ -++ return final_output -++ -++ -++ # @no_grad() -++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ # num_tokens = x.shape[0] -++ # hidden_size = x.shape[-1] -++ -++ # idxs = flat_expert_indices.argsort() -++ # sorted_expert_indices = flat_expert_indices[idxs] -++ # sorted_token_indices = idxs // self.num_experts_per_tok -++ -++ # # sorted_indices = sorted_token_indices -++ # sorted_indices = sorted_token_indices.astype(mindspore.int32) -++ # permuted_tokens = x[sorted_token_indices] -++ # sorted_weights = flat_expert_weights[idxs] -++ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) -++ -++ # expert_outputs = ops.zeros_like(permuted_tokens) -++ # ptr = 0 -++ -++ # # 只按专家维度循环 -++ # for expert_id, count in enumerate(tokens_per_expert.tolist()): -++ # if count == 0: -++ # continue -++ # token_slice = slice(ptr, ptr + count) -++ # expert_tokens = permuted_tokens[token_slice] -++ -++ # # 保持原 forward(含 pretraining_tp、bias 等) -++ # expert_out = self.experts[expert_id](expert_tokens) -++ -++ # expert_outputs[token_slice] = expert_out -++ # ptr += count -++ -++ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -++ # try: -++ # final_output = mindspore.ops.moe_token_unpermute( -++ # expert_outputs, -++ # sorted_indices, -++ # probs=probs, -++ # padded_mode=False -++ # ) -++ # except Exception: -++ # final_output = ops.zeros_like(x) -++ # final_output = mindspore.mint.scatter_add( -++ # final_output, -++ # 0, -++ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -++ # expert_outputs * sorted_weights -++ # ) -++ -++ # return final_output -++ -++ -++ #lwx -++ # @no_grad() -++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++ # """ -++ # 并行化 MoE prefill: -++ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 -++ # - 保证结果与原版完全一致 -++ # """ -++ # # 输出缓存 -++ # expert_cache = ops.zeros_like(x) -++ -++ # # token 总数(批量*seq_len*num_experts_per_tok) -++ # num_tokens = flat_expert_indices.shape[0] -++ # hidden_dim = x.shape[-1] -++ -++ # # 原 token ID(idxs // num_experts_per_tok) -++ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) -++ -++ # # ====== Step 1: 组织输入 ====== -++ # # 按 experts 排序,保证 scatter_add 对应位置一致 -++ # sort_ids = flat_expert_indices.argsort() -++ # sorted_experts = flat_expert_indices[sort_ids] -++ # sorted_tokens = token_ids[sort_ids] -++ # sorted_weights = flat_expert_weights[sort_ids] -++ -++ # # 收集每个专家的输入 -++ # # build: expert_inputs[expert_id] = [tokens...] -++ # expert_inputs = [] -++ # expert_outs = [] -++ -++ # for eid in range(self.config.n_routed_experts): -++ # eid_mask = (sorted_experts == eid) -++ # if eid_mask.any(): -++ # tokens_for_eid = x[sorted_tokens[eid_mask]] -++ # expert_inputs.append(tokens_for_eid) -++ # else: -++ # expert_inputs.append(None) -++ -++ # # ====== Step 2: 并行计算所有专家输出 ====== -++ # # 存储所有专家结果到一个列表 -++ # for eid in range(self.config.n_routed_experts): -++ # if expert_inputs[eid] is not None: -++ # out = self.experts[eid](expert_inputs[eid]) -++ # expert_outs.append(out) -++ # else: -++ # expert_outs.append(None) -++ -++ # # ====== Step 3: scatter_add 回写结果 ====== -++ # # 遍历专家,将结果加回对应的 token -++ # pos = 0 -++ # for eid in range(self.config.n_routed_experts): -++ # if expert_outs[eid] is not None: -++ # size = expert_outs[eid].shape[0] -++ # tokens_idx = sorted_tokens[pos:pos+size] -++ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] -++ # pos += size -++ -++ # # scatter_add 到 expert_cache -++ # expert_cache = mindspore.mint.scatter_add( -++ # expert_cache, -++ # dim=0, -++ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), -++ # src=scaled_out -++ # ) -++ -++ # return expert_cache -++ -++ -++ -+ # 放置在 DeepseekMoE 类中 -+ # @no_grad() -+ # #lwx 20251107 -+@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): -+ self.hidden_size = config.hidden_size -+ -+ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+- # config=config, layer_idx=layer_idx -++ # config=config, layer_idx=layer_idx -+ # ) -+ -+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): -+ ) -+ else DeepseekMLP(config) -+ ) -++ -+ self.input_layernorm = DeepseekRMSNorm( -+ config.hidden_size, eps=config.rms_norm_eps -+ ) -+@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+ def get_decoder(self): -+ return self.model -+ -++ def generate(self, *args, **kwargs): -++ """ -++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++ """ -++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -++ -++ input_ids = kwargs.get("input_ids") -++ if input_ids is None and args: -++ input_ids = args[0] -++ -++ if input_ids is not None: -++ prompt_length = input_ids.shape[1] -++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -++ Long_Prompt = 2 -++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -++ Long_Prompt = 0 -++ else: -++ Long_Prompt = 1 -++ -++ -++ return super().generate(*args, **kwargs) -+ -+ def forward( -+ self, -+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+index 913a7609..6566958b 100644 -+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ -+ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -+ @no_grad() -+- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ original_dtype = hidden_states.dtype -+ batch_size, _ = hidden_states.shape -+ expert_outputs_list = [ -+@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+ return moe_output_fp32.squeeze(1).to(original_dtype) -+ -++ -+ # @no_grad() -+- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ # num_tokens, _ = hidden_states.shape -+ # flat_selected_experts = selected_experts.flatten() -+ # sorted_expert_indices = flat_selected_experts.argsort() -+@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ # current_token_offset += expert_token_count -+ # return moe_output -+ -++ # baseline -+ @no_grad() -+- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ """ -+ 优化版 MoE prefill (速度优先模式): -+ - 批量张量化处理同一个 expert 的所有 token -+@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ return moe_output -+ -+ -++ @no_grad() -++ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++ """ -++ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add -++ 逻辑: -++ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 -++ 2. 每个 expert 一次性处理其全部 token -++ 3. 最后一次 scatter_add 回到原 token 顺序 -++ """ -++ -++ num_tokens = hidden_states.shape[0] -++ hidden_size = hidden_states.shape[-1] -++ -++ # 展平为一维 -++ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] -++ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] -++ -++ # 按 expert 排序 -++ idxs = flat_selected_experts.argsort() -++ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 -++ sorted_token_indices = idxs // self.top_k # 对应原 token ID -++ -++ # 排好序的输入向量(连续内存) -++ permuted_tokens = hidden_states[sorted_token_indices] -++ -++ # 排好序的权重 -++ sorted_weights = flat_routing_weights[idxs] -++ -++ # 每个 expert 对应的 token 数量 -++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -++ -++ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) -++ expert_outputs = ops.zeros_like(permuted_tokens) -++ -++ ptr = 0 # 指向当前切片的起点 -++ for expert_id, count in enumerate(tokens_per_expert.tolist()): -++ if count == 0: -++ continue -++ -++ token_slice = slice(ptr, ptr + count) -++ expert_tokens = permuted_tokens[token_slice] # 连续切片 -++ -++ # 执行专家 MLP -++ expert_out = self.experts[expert_id](expert_tokens) -++ -++ expert_outputs[token_slice] = expert_out -++ ptr += count -++ -++ # 按权重缩放 -++ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) -++ -++ # 回写到原 token 顺序 (单次 scatter_add) -++ moe_output = mindspore.mint.scatter_add( -++ ops.zeros_like(hidden_states), -++ 0, -++ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -++ scaled_outputs -++ ) -++ -++ return moe_output -++ -++ -++ -+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++ -+ @no_grad() -+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+ moe_output = ops.zeros_like(hidden_states) -+@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+ # # --- 速度优先模式 (SPEED MODE) --- -+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ # if sequence_length == 1: -+- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ # else: -+- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ -+ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+ if sequence_length == 1: -+- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+ else: -+- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+- -++ # if Long_Prompt == 1: -++ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ # else: -++ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++ -+ -+ # 3. 共享专家计算与合并 (所有模式通用) -+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+index c9c8c5ee..513dd40b 100644 -+--- a/patches/0001-20251104commit.patch -++++ b/patches/0001-20251104commit.patch -+@@ -1,7 +1,7 @@ -+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+-Subject: [PATCH 1/6] 20251104commit -++Subject: [PATCH 1/7] 20251104commit -+ -+ --- -+ mindnlp/transformers/cache_utils.py | 28 +- -+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+index 625656eb..41081b85 100644 -+--- a/patches/0002-20251106commit.patch -++++ b/patches/0002-20251106commit.patch -+@@ -1,7 +1,7 @@ -+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+-Subject: [PATCH 2/6] 20251106commit -++Subject: [PATCH 2/7] 20251106commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+index dcb85080..c1392569 100644 -+--- a/patches/0003-20261106secondcommit.patch -++++ b/patches/0003-20261106secondcommit.patch -+@@ -1,7 +1,7 @@ -+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+-Subject: [PATCH 3/6] 20261106secondcommit -++Subject: [PATCH 3/7] 20261106secondcommit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 217 ++- -+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+index bbed13cc..e548b1b2 100644 -+--- a/patches/0004-20251106change.patch -++++ b/patches/0004-20251106change.patch -+@@ -1,7 +1,7 @@ -+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Thu, 6 Nov 2025 15:48:09 +0800 -+-Subject: [PATCH 4/6] 20251106change -++Subject: [PATCH 4/7] 20251106change -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 189 +- -+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -+index b2d1035c..bf224d2a 100644 -+--- a/patches/0005-20251107001commit.patch -++++ b/patches/0005-20251107001commit.patch -+@@ -1,7 +1,7 @@ -+ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Fri, 7 Nov 2025 11:48:18 +0800 -+-Subject: [PATCH 5/6] 20251107001commit -++Subject: [PATCH 5/7] 20251107001commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 91 +- -+diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -+index bffa134e..1bd306b9 100644 -+--- a/patches/0006-20251107002commit.patch -++++ b/patches/0006-20251107002commit.patch -+@@ -1,7 +1,7 @@ -+ From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 -+ From: Pinoeer-kingxi <13022943007@163.com> -+ Date: Fri, 7 Nov 2025 12:06:32 +0800 -+-Subject: [PATCH 6/6] 20251107002commit -++Subject: [PATCH 6/7] 20251107002commit -+ -+ --- -+ .../models/deepseek/modeling_deepseek.py | 122 +- -+diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch -+new file mode 100644 -+index 00000000..ce558554 -+--- /dev/null -++++ b/patches/0007-20251107003commit.patch -+@@ -0,0 +1,8034 @@ -++From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 -++From: Pinoeer-kingxi <13022943007@163.com> -++Date: Fri, 7 Nov 2025 12:12:51 +0800 -++Subject: [PATCH 7/7] 20251107003commit -++ -++--- -++ .../models/deepseek/modeling_deepseek.py | 2 +- -++ patches/0001-20251104commit.patch | 2 +- -++ patches/0002-20251106commit.patch | 2 +- -++ patches/0003-20261106secondcommit.patch | 2 +- -++ patches/0004-20251106change.patch | 2 +- -++ patches/0005-20251107001commit.patch | 2 +- -++ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ -++ 7 files changed, 7937 insertions(+), 6 deletions(-) -++ create mode 100644 patches/0006-20251107002commit.patch -++ -++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++index e7e1c053..ff631974 100644 -++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): -++ # return expert_cache -++ -++ @no_grad() -++- dwj -+++ # dwj -++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++ # x 的 shape: (1, hidden_size) -++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++index 2842180e..c9c8c5ee 100644 -++--- a/patches/0001-20251104commit.patch -+++++ b/patches/0001-20251104commit.patch -++@@ -1,7 +1,7 @@ -++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -++-Subject: [PATCH 1/5] 20251104commit -+++Subject: [PATCH 1/6] 20251104commit -++ -++ --- -++ mindnlp/transformers/cache_utils.py | 28 +- -++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++index c6cd8757..625656eb 100644 -++--- a/patches/0002-20251106commit.patch -+++++ b/patches/0002-20251106commit.patch -++@@ -1,7 +1,7 @@ -++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -++-Subject: [PATCH 2/5] 20251106commit -+++Subject: [PATCH 2/6] 20251106commit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++index 601960c9..dcb85080 100644 -++--- a/patches/0003-20261106secondcommit.patch -+++++ b/patches/0003-20261106secondcommit.patch -++@@ -1,7 +1,7 @@ -++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -++-Subject: [PATCH 3/5] 20261106secondcommit -+++Subject: [PATCH 3/6] 20261106secondcommit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -++index 8976f10b..bbed13cc 100644 -++--- a/patches/0004-20251106change.patch -+++++ b/patches/0004-20251106change.patch -++@@ -1,7 +1,7 @@ -++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Thu, 6 Nov 2025 15:48:09 +0800 -++-Subject: [PATCH 4/5] 20251106change -+++Subject: [PATCH 4/6] 20251106change -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 189 +- -++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -++index 8d9032be..b2d1035c 100644 -++--- a/patches/0005-20251107001commit.patch -+++++ b/patches/0005-20251107001commit.patch -++@@ -1,7 +1,7 @@ -++ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -++ From: Pinoeer-kingxi <13022943007@163.com> -++ Date: Fri, 7 Nov 2025 11:48:18 +0800 -++-Subject: [PATCH 5/5] 20251107001commit -+++Subject: [PATCH 5/6] 20251107001commit -++ -++ --- -++ .../models/deepseek/modeling_deepseek.py | 91 +- -++diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -++new file mode 100644 -++index 00000000..bffa134e -++--- /dev/null -+++++ b/patches/0006-20251107002commit.patch -++@@ -0,0 +1,7931 @@ -+++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 -+++From: Pinoeer-kingxi <13022943007@163.com> -+++Date: Fri, 7 Nov 2025 12:06:32 +0800 -+++Subject: [PATCH 6/6] 20251107002commit -+++ -+++--- -+++ .../models/deepseek/modeling_deepseek.py | 122 +- -+++ patches/0001-20251104commit.patch | 2 +- -+++ patches/0002-20251106commit.patch | 2 +- -+++ patches/0003-20261106secondcommit.patch | 2 +- -+++ patches/0004-20251106change.patch | 2 +- -+++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ -+++ 6 files changed, 7773 insertions(+), 64 deletions(-) -+++ create mode 100644 patches/0005-20251107001commit.patch -+++ -+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++index 8831e4b7..e7e1c053 100644 -+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): -+++ # expert_out = expert(x) -+++ # expert_cache += expert_out * weight -+++ # return expert_cache -+++- -+++- # @no_grad() -+++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++- # # x 的 shape: (1, hidden_size) -+++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++- -+++- # # 1. 收集所有需要的专家层 -+++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++- # selected_experts = [self.experts[i] for i in flat_expert_indices] -+++- -+++- # # 2. 并行计算所有专家的输出 -+++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++- # # ops.cat 会将它们堆叠成一个新的 Tensor -+++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++- -+++- # # 3. 使用矩阵乘法进行加权求和 -+++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++- # # 最终结果 final_output 的 shape: (1, hidden_size) -+++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++++ -++++ @no_grad() -++++ dwj -++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ # x 的 shape: (1, hidden_size) -++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++++ -++++ # 1. 收集所有需要的专家层 -++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++++ selected_experts = [self.experts[i] for i in flat_expert_indices] -++++ -++++ # 2. 并行计算所有专家的输出 -++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++++ # ops.cat 会将它们堆叠成一个新的 Tensor -++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++++ -++++ # 3. 使用矩阵乘法进行加权求和 -++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++ # 最终结果 final_output 的 shape: (1, hidden_size) -++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++ -+++- # return final_output -++++ return final_output -+++ -+++ -+++ # @no_grad() -+++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): -+++ -+++ return expert_cache -+++ # 放置在 DeepseekMoE 类中 -+++- @no_grad() -+++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++- """ -+++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+++- -+++- Args: -+++- x (Tensor): 输入张量, shape: (1, hidden_size) -+++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+++- """ -+++- top_k, _ = flat_expert_weights.shape -+++- hidden_size = x.shape[-1] -+++- -+++- # 1. 将所有专家的权重堆叠起来 -+++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -++++ # @no_grad() -++++ # #lwx 20251107 -++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++ # """ -++++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -++++ -++++ # Args: -++++ # x (Tensor): 输入张量, shape: (1, hidden_size) -++++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -++++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -++++ # """ -++++ # top_k, _ = flat_expert_weights.shape -++++ # hidden_size = x.shape[-1] -++++ -++++ # # 1. 将所有专家的权重堆叠起来 -++++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -++++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -++++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+++ -+++- # 2. "收集" 所需的专家权重 -+++- selected_gate_w = stacked_gate_w[flat_expert_indices] -+++- selected_up_w = stacked_up_w[flat_expert_indices] -+++- selected_down_w = stacked_down_w[flat_expert_indices] -++++ # # 2. "收集" 所需的专家权重 -++++ # selected_gate_w = stacked_gate_w[flat_expert_indices] -++++ # selected_up_w = stacked_up_w[flat_expert_indices] -++++ # selected_down_w = stacked_down_w[flat_expert_indices] -+++ -+++- # 3. 准备输入 -+++- x_expanded = x.expand((top_k, 1, hidden_size)) -++++ # # 3. 准备输入 -++++ # x_expanded = x.expand((top_k, 1, hidden_size)) -+++ -+++- # 4. 并行计算 gate_proj 和 up_proj -+++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -++++ # # 4. 并行计算 gate_proj 和 up_proj -++++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -++++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+++ -+++- # 5. 计算中间状态 -+++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out -++++ # # 5. 计算中间状态 -++++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+++ -+++- # 6. 并行计算 down_proj -+++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+++- # --- [FIX] --- -+++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+++- # --- [FIX END] --- -++++ # # 6. 并行计算 down_proj -++++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -++++ # # --- [FIX] --- -++++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -++++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -++++ # # --- [FIX END] --- -+++ -+++- # 7. 根据路由权重进行加权求和 -+++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -++++ # # 7. 根据路由权重进行加权求和 -++++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+++ -+++- return weighted_sum -++++ # return weighted_sum -+++ -+++ -+++ -+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++index 0a0ef2d7..2842180e 100644 -+++--- a/patches/0001-20251104commit.patch -++++++ b/patches/0001-20251104commit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++-Subject: [PATCH 1/4] 20251104commit -++++Subject: [PATCH 1/5] 20251104commit -+++ -+++ --- -+++ mindnlp/transformers/cache_utils.py | 28 +- -+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+++index 5185270c..c6cd8757 100644 -+++--- a/patches/0002-20251106commit.patch -++++++ b/patches/0002-20251106commit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -+++-Subject: [PATCH 2/4] 20251106commit -++++Subject: [PATCH 2/5] 20251106commit -+++ -+++ --- -+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+++index 3e05f821..601960c9 100644 -+++--- a/patches/0003-20261106secondcommit.patch -++++++ b/patches/0003-20261106secondcommit.patch -+++@@ -1,7 +1,7 @@ -+++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -+++-Subject: [PATCH 3/4] 20261106secondcommit -++++Subject: [PATCH 3/5] 20261106secondcommit -+++ -+++ --- -+++ .../models/deepseek/modeling_deepseek.py | 217 ++- -+++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -+++index 88a1aef4..8976f10b 100644 -+++--- a/patches/0004-20251106change.patch -++++++ b/patches/0004-20251106change.patch -+++@@ -1,7 +1,7 @@ -+++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+++ From: Pinoeer-kingxi <13022943007@163.com> -+++ Date: Thu, 6 Nov 2025 15:48:09 +0800 -+++-Subject: [PATCH 4/4] 20251106change -++++Subject: [PATCH 4/5] 20251106change -+++ -+++ --- -+++ .../models/deepseek/modeling_deepseek.py | 189 +- -+++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -+++new file mode 100644 -+++index 00000000..8d9032be -+++--- /dev/null -++++++ b/patches/0005-20251107001commit.patch -+++@@ -0,0 +1,7707 @@ -++++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -++++From: Pinoeer-kingxi <13022943007@163.com> -++++Date: Fri, 7 Nov 2025 11:48:18 +0800 -++++Subject: [PATCH 5/5] 20251107001commit -++++ -++++--- -++++ .../models/deepseek/modeling_deepseek.py | 91 +- -++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- -++++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- -++++ patches/0001-20251104commit.patch | 2 +- -++++ patches/0002-20251106commit.patch | 2 +- -++++ patches/0003-20261106secondcommit.patch | 2 +- -++++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ -++++ 7 files changed, 7577 insertions(+), 30 deletions(-) -++++ create mode 100644 patches/0004-20251106change.patch -++++ -++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++index 0546f318..8831e4b7 100644 -++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): -++++ # expert_cache += expert_out * weight -++++ # return expert_cache -++++ -++++- @no_grad() -++++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++- # x 的 shape: (1, hidden_size) -++++- # flat_expert_indices 的 shape: (num_experts_per_tok,) -++++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -++++- -++++- # 1. 收集所有需要的专家层 -++++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -++++- selected_experts = [self.experts[i] for i in flat_expert_indices] -++++- -++++- # 2. 并行计算所有专家的输出 -++++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -++++- # ops.cat 会将它们堆叠成一个新的 Tensor -++++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -++++- -++++- # 3. 使用矩阵乘法进行加权求和 -++++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -++++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -++++- # 最终结果 final_output 的 shape: (1, hidden_size) -++++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++++ # @no_grad() -+++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ # # x 的 shape: (1, hidden_size) -+++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++++ -+++++ # # 1. 收集所有需要的专家层 -+++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] -+++++ -+++++ # # 2. 并行计算所有专家的输出 -+++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++++ # # ops.cat 会将它们堆叠成一个新的 Tensor -+++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++++ -+++++ # # 3. 使用矩阵乘法进行加权求和 -+++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++ # # 最终结果 final_output 的 shape: (1, hidden_size) -+++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -++++ -++++- return final_output -+++++ # return final_output -++++ -++++ -++++ # @no_grad() -++++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): -++++ ) -++++ -++++ return expert_cache -+++++# 放置在 DeepseekMoE 类中 -+++++ @no_grad() -+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++ """ -+++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -+++++ -+++++ Args: -+++++ x (Tensor): 输入张量, shape: (1, hidden_size) -+++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -+++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -+++++ """ -+++++ top_k, _ = flat_expert_weights.shape -+++++ hidden_size = x.shape[-1] -+++++ -+++++ # 1. 将所有专家的权重堆叠起来 -+++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -+++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -+++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -+++++ -+++++ # 2. "收集" 所需的专家权重 -+++++ selected_gate_w = stacked_gate_w[flat_expert_indices] -+++++ selected_up_w = stacked_up_w[flat_expert_indices] -+++++ selected_down_w = stacked_down_w[flat_expert_indices] -+++++ -+++++ # 3. 准备输入 -+++++ x_expanded = x.expand((top_k, 1, hidden_size)) -+++++ -+++++ # 4. 并行计算 gate_proj 和 up_proj -+++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -+++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -+++++ -+++++ # 5. 计算中间状态 -+++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out -+++++ -+++++ # 6. 并行计算 down_proj -+++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -+++++ # --- [FIX] --- -+++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -+++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -+++++ # --- [FIX END] --- -+++++ -+++++ # 7. 根据路由权重进行加权求和 -+++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -+++++ -+++++ return weighted_sum -+++++ -+++++ -++++ -++++ # @no_grad() -++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++index ebd7782e..913a7609 100644 -++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): -++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++- x1 = x[..., : x.shape[-1] // 2] -++++- x2 = x[..., x.shape[-1] // 2 :] -+++++ # x1 = x[..., : x.shape[-1] // 2] -+++++ # x2 = x[..., x.shape[-1] // 2 :] -++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++++index d059dcbe..2b217b64 100644 -++++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -+++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py -++++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): -++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++ def rotate_half(x): -++++ """Rotates half the hidden dims of the input.""" -++++- x1 = x[..., : x.shape[-1] // 2] -++++- x2 = x[..., x.shape[-1] // 2 :] -+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++ # x1 = x[..., : x.shape[-1] // 2] -+++++ # x2 = x[..., x.shape[-1] // 2 :] -+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++ return ops.cat((-x2, x1), dim=-1) -++++ -++++ -++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++index 78f22642..0a0ef2d7 100644 -++++--- a/patches/0001-20251104commit.patch -+++++++ b/patches/0001-20251104commit.patch -++++@@ -1,7 +1,7 @@ -++++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++ From: Pinoeer-kingxi <13022943007@163.com> -++++ Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++-Subject: [PATCH 1/3] 20251104commit -+++++Subject: [PATCH 1/4] 20251104commit -++++ -++++ --- -++++ mindnlp/transformers/cache_utils.py | 28 +- -++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -++++index 22b65dd5..5185270c 100644 -++++--- a/patches/0002-20251106commit.patch -+++++++ b/patches/0002-20251106commit.patch -++++@@ -1,7 +1,7 @@ -++++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++++ From: Pinoeer-kingxi <13022943007@163.com> -++++ Date: Thu, 6 Nov 2025 09:20:38 +0800 -++++-Subject: [PATCH 2/3] 20251106commit -+++++Subject: [PATCH 2/4] 20251106commit -++++ -++++ --- -++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -++++index 966529e4..3e05f821 100644 -++++--- a/patches/0003-20261106secondcommit.patch -+++++++ b/patches/0003-20261106secondcommit.patch -++++@@ -1,7 +1,7 @@ -++++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++++ From: Pinoeer-kingxi <13022943007@163.com> -++++ Date: Thu, 6 Nov 2025 14:54:37 +0800 -++++-Subject: [PATCH 3/3] 20261106secondcommit -+++++Subject: [PATCH 3/4] 20261106secondcommit -++++ -++++ --- -++++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -++++new file mode 100644 -++++index 00000000..88a1aef4 -++++--- /dev/null -+++++++ b/patches/0004-20251106change.patch -++++@@ -0,0 +1,7498 @@ -+++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -+++++From: Pinoeer-kingxi <13022943007@163.com> -+++++Date: Thu, 6 Nov 2025 15:48:09 +0800 -+++++Subject: [PATCH 4/4] 20251106change -+++++ -+++++--- -+++++ .../models/deepseek/modeling_deepseek.py | 189 +- -+++++ patches/0001-20251104commit.patch | 1272 +++++++ -+++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -+++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -+++++ 4 files changed, 7244 insertions(+), 186 deletions(-) -+++++ create mode 100644 patches/0001-20251104commit.patch -+++++ create mode 100644 patches/0002-20251106commit.patch -+++++ create mode 100644 patches/0003-20261106secondcommit.patch -+++++ -+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++index 2f9192bf..0546f318 100644 -+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -+++++ -+++++ return attn_output, attn_weights, past_key_value -+++++ -+++++-# class DeepseekFlashAttention(nn.Module): -+++++-# """ -+++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+++++- -+++++-# This class is designed as a drop-in replacement for DeepseekAttention. -+++++-# """ -+++++- -+++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++++-# super().__init__() -+++++-# self.config = config -+++++-# self.layer_idx = layer_idx -+++++-# if layer_idx is None: -+++++-# logger.warning( -+++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++-# "when creating this class." -+++++-# ) -+++++- -+++++-# self.attention_dropout = config.attention_dropout -+++++-# self.hidden_size = config.hidden_size -+++++-# self.num_heads = config.num_attention_heads -+++++-# self.head_dim = self.hidden_size // self.num_heads -+++++-# self.num_key_value_heads = config.num_key_value_heads -+++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++-# self.max_position_embeddings = config.max_position_embeddings -+++++-# self.rope_theta = config.rope_theta -+++++-# self.is_causal = True -+++++- -+++++-# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++-# raise ValueError( -+++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++-# f" and `num_heads`: {self.num_heads})." -+++++-# ) -+++++- -+++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++++-# self._init_rope() -+++++- -+++++-# def _init_rope(self): -+++++-# if self.config.rope_scaling is None: -+++++-# self.rotary_emb = DeepseekRotaryEmbedding( -+++++-# self.head_dim, -+++++-# max_position_embeddings=self.max_position_embeddings, -+++++-# base=self.rope_theta, -+++++-# ) -+++++-# else: -+++++-# scaling_type = self.config.rope_scaling["type"] -+++++-# scaling_factor = self.config.rope_scaling["factor"] -+++++-# if scaling_type == "linear": -+++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++++-# self.head_dim, -+++++-# max_position_embeddings=self.max_position_embeddings, -+++++-# scaling_factor=scaling_factor, -+++++-# base=self.rope_theta, -+++++-# ) -+++++-# elif scaling_type == "dynamic": -+++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++++-# self.head_dim, -+++++-# max_position_embeddings=self.max_position_embeddings, -+++++-# scaling_factor=scaling_factor, -+++++-# base=self.rope_theta, -+++++-# ) -+++++-# else: -+++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++++- -+++++-# def forward( -+++++-# self, -+++++-# hidden_states: mindspore.Tensor, -+++++-# attention_mask: Optional[mindspore.Tensor] = None, -+++++-# position_ids: Optional[mindspore.Tensor] = None, -+++++-# past_key_value: Optional[Cache] = None, -+++++-# output_attentions: bool = False, -+++++-# use_cache: bool = False, -+++++-# **kwargs, -+++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++-# if "padding_mask" in kwargs: -+++++-# warnings.warn( -+++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++++-# ) -+++++- -+++++-# if output_attentions: -+++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+++++- -+++++-# bsz, q_len, _ = hidden_states.shape -+++++- -+++++-# if self.config.pretraining_tp > 1: -+++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++++- -+++++-# query_states = self.q_proj(hidden_states) -+++++-# key_states = self.k_proj(hidden_states) -+++++-# value_states = self.v_proj(hidden_states) -+++++- -+++++-# # Reshape for multi-head attention -+++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++- -+++++-# kv_seq_len = key_states.shape[-2] -+++++-# if past_key_value is not None: -+++++-# if self.layer_idx is None: -+++++-# raise ValueError( -+++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++-# "with a layer index." -+++++-# ) -+++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++- -+++++-# # Apply Rotary Positional Embedding -+++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++- -+++++-# if past_key_value is not None: -+++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++- -+++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++- -+++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++- -+++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++- -+++++-# # Convert attention_mask for flash_attention_score -+++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+++++-# if attention_mask is not None: -+++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++++-# raise ValueError( -+++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++++-# ) -+++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+++++-# else: -+++++-# attn_mask_for_fa = None -+++++- -+++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++++- -+++++-# # Call the fused flash_attention_score operator -+++++-# attn_output = mindspore.ops.flash_attention_score( -+++++-# query=query_states_for_fa, -+++++-# key=key_states_for_fa, -+++++-# value=value_states_for_fa, -+++++-# head_num=self.num_heads, # This is N1, the number of query heads -+++++-# input_layout='BSH', -+++++-# attn_mask=attn_mask_for_fa, -+++++-# keep_prob=keep_prob, -+++++-# scalar_value=1.0 / math.sqrt(self.head_dim), -+++++-# sparse_mode=0 # Default mask mode -+++++-# ) -+++++- -+++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+++++-# attn_output = self.o_proj(attn_output) -+++++- -+++++-# # Flash Attention does not return attention weights -+++++-# attn_weights = None -+++++- -+++++-# return attn_output, attn_weights, past_key_value -+++++ -+++++ class DeepseekFlashAttention(nn.Module): -+++++ """ -+++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -+++++ super().__init__() -+++++ self.hidden_size = config.hidden_size -+++++ -+++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -+++++- config=config, layer_idx=layer_idx -+++++- ) -++++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -++++++ # config=config, layer_idx=layer_idx -++++++ # ) -+++++ -+++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+++++ config=config, layer_idx=layer_idx -+++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -+++++ return outputs -+++++ -+++++ -+++++- -+++++ class DeepseekPreTrainedModel(PreTrainedModel): -+++++ config_class = DeepseekConfig -+++++ base_model_prefix = "model" -+++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++ # Initialize weights and apply final processing -+++++ self.post_init() -+++++ self.warm_up = False -+++++- #@dwj -+++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+++++- self.num_layers, -+++++- self.num_attention_heads, -+++++- self.head_dim, -+++++- batch_size=1, -+++++- max_length=self.max_length, -+++++- dtype=mindspore.float16 -+++++- ) -+++++- -+++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+++++- key_cache = [] -+++++- value_cache = [] -+++++- for _ in range(num_layers): -+++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++- key_cache.append(k) -+++++- value_cache.append(v) -+++++- return key_cache, value_cache -+++++- -+++++ -+++++ def warmup_moe_model_deep(self): -+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -+++++new file mode 100644 -+++++index 00000000..78f22642 -+++++--- /dev/null -++++++++ b/patches/0001-20251104commit.patch -+++++@@ -0,0 +1,1272 @@ -++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++++From: Pinoeer-kingxi <13022943007@163.com> -++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++++Subject: [PATCH 1/3] 20251104commit -++++++ -++++++--- -++++++ mindnlp/transformers/cache_utils.py | 28 +- -++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++++ 3 files changed, 976 insertions(+), 87 deletions(-) -++++++ -++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++++index cadd2e04..02f8d4be 100644 -++++++--- a/mindnlp/transformers/cache_utils.py -+++++++++ b/mindnlp/transformers/cache_utils.py -++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++++ # k_out[:, :, cache_position] = key_states -++++++ # v_out[:, :, cache_position] = value_states -++++++- if ON_ORANGE_PI: -++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++- else: -++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++- -+++++++ # if ON_ORANGE_PI: -+++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++++ # else: -+++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++++ # 确保 cache_position 是 1D tensor 并且类型正确 -+++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -+++++++ if cache_position.ndim > 1: -+++++++ cache_position = cache_position.flatten() -+++++++ # 确保类型是 int32 或 int64(MindSpore 要求) -+++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -+++++++ cache_position = cache_position.int() -+++++++ -+++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -+++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -+++++++ k_out[:, :, cache_position] = key_states -+++++++ v_out[:, :, cache_position] = value_states -+++++++ -++++++ return k_out, v_out -++++++ -++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++index c695b944..d8303e45 100644 -++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -++++++ def rotate_half(x): -++++++ """Rotates half the hidden dims of the input.""" -++++++- x1 = x[..., : x.shape[-1] // 2] -++++++- x2 = x[..., x.shape[-1] // 2 :] -+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++++ # x1 = x[..., : x.shape[-1] // 2] -+++++++ # x2 = x[..., x.shape[-1] // 2 :] -+++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++ return ops.cat((-x2, x1), dim=-1) -++++++ -++++++ -++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++++ if self.training: -++++++ raise NotImplementedError("Training is not supported yet.") -++++++ else: -++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++- if self.config.n_shared_experts is not None: -++++++- y = y + self.shared_experts(identity) -++++++- return y -+++++++ # @lwx -+++++++ if orig_shape[1] == 1: -+++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -+++++++ y=y.view(*orig_shape) -+++++++ if self.config.n_shared_experts is not None: -+++++++ y = y + self.shared_experts(identity) -+++++++ return y -+++++++ else: -+++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -+++++++ if self.config.n_shared_experts is not None: -+++++++ y = y + self.shared_experts(identity) -+++++++ return y -+++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++++ # if self.config.n_shared_experts is not None: -+++++++ # y = y + self.shared_experts(identity) -+++++++ # return y -+++++++ -+++++++ @no_grad() -+++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++++ -+++++++ expert_cache = ops.zeros_like(x) -+++++++ for i in range(self.num_experts_per_tok): -+++++++ expert_id = flat_expert_indices[i].item() -+++++++ weight = flat_expert_weights[i].item() -+++++++ expert = self.experts[expert_id] -+++++++ expert_out = expert(x) -+++++++ expert_cache += expert_out * weight -+++++++ return expert_cache -++++++ -++++++ @no_grad() -++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++- # expert_cache = torch.zeros_like(x) -++++++- # idxs = flat_expert_indices.argsort() -++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++- # token_idxs = idxs // self.num_experts_per_tok -++++++- # for i, end_idx in enumerate(tokens_per_expert): -++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++- # if start_idx == end_idx: -++++++- # continue -++++++- # expert = self.experts[i] -++++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++++- # expert_tokens = x[exp_token_idx] -++++++- # expert_out = expert(expert_tokens) -++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++- # return expert_cache -+++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++ expert_cache = ops.zeros_like(x) -++++++ idxs = flat_expert_indices.argsort() -++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++ token_idxs = idxs // self.num_experts_per_tok -+++++++ -++++++ for i, end_idx in enumerate(tokens_per_expert): -++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++ if start_idx == end_idx: -++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++++ expert_out = expert(expert_tokens) -++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++++ -++++++ return expert_cache -+++++++ -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # # expert_cache = torch.zeros_like(x) -+++++++ # # idxs = flat_expert_indices.argsort() -+++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++++ # # if start_idx == end_idx: -+++++++ # # continue -+++++++ # # expert = self.experts[i] -+++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # # expert_tokens = x[exp_token_idx] -+++++++ # # expert_out = expert(expert_tokens) -+++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++++ # # return expert_cache -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # if start_idx == end_idx: -+++++++ # continue -+++++++ # expert = self.experts[i] -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = expert(expert_tokens) -+++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++++ -+++++++ # return expert_cache -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ -+++++++ # # 排序保证顺序一致 -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # # 找出有 token 的专家 -+++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++++ -+++++++ # for i in active_experts.tolist(): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # end_idx = tokens_per_expert[i] -+++++++ # if start_idx == end_idx: # 没有 token -+++++++ # continue -+++++++ -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = self.experts[i](expert_tokens) -+++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++++ -+++++++ # expert_cache = mindspore.mint.scatter_add( -+++++++ # expert_cache, -+++++++ # 0, -+++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++++ # expert_out -+++++++ # ) -+++++++ -+++++++ # return expert_cache -+++++++ -+++++++ -++++++ -++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++++ # """ -++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++ -++++++ # Initialize weights and apply final processing -++++++ self.post_init() -+++++++ self.warm_up = False -+++++++ -+++++++ def warmup_moe_model_deep(self): -+++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -+++++++ test_texts = [ -+++++++ "warmup short", -+++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -+++++++ ] -+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++++ if tokenizer is None: -+++++++ from mindnlp.transformers import AutoTokenizer -+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++++ self._warmup_tokenizer = tokenizer -+++++++ -+++++++ for text in test_texts: -+++++++ inputs = tokenizer(text, return_tensors="ms") -+++++++ with mindspore._no_grad(): -+++++++ _ = self(**inputs, use_cache=False) -+++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++++ -++++++ def get_input_embeddings(self): -++++++ return self.model.embed_tokens -++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++ ```""" -+++++++ if not self.warm_up: -+++++++ self.warm_up = True -+++++++ self.warmup_moe_model_deep() -+++++++ -++++++ output_attentions = ( -++++++ output_attentions -++++++ if output_attentions is not None -++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++index 3cbf820e..d4c6b651 100644 -++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++@@ -18,7 +18,6 @@ -++++++ # See the License for the specific language governing permissions and -++++++ # limitations under the License. -++++++ """MindSpore Qwen2MoE model.""" -++++++- -++++++ import math -++++++ from typing import List, Optional, Tuple, Union -++++++ -++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++++ TokenClassifierOutput, -++++++ ) -++++++ from ...modeling_utils import PreTrainedModel -+++++++from ...generation import GenerationMixin -++++++ from ....utils import logging -++++++ from .configuration_qwen2_moe import Qwen2MoeConfig -++++++ -++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++++ self.variance_epsilon = eps -++++++ -++++++ def forward(self, hidden_states): -+++++++ # @dwj -+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++++ # @lwx -+++++++ # if not self.training : -+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++ input_dtype = hidden_states.dtype -++++++ hidden_states = hidden_states.to(mindspore.float32) -++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++++@@ -234,6 +239,8 @@ def rotate_half(x): -++++++ """Rotates half the hidden dims of the input.""" -++++++ x1 = x[..., : x.shape[-1] // 2] -++++++ x2 = x[..., x.shape[-1] // 2 :] -+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -+++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++ return ops.cat((-x2, x1), dim=-1) -++++++ -++++++ -++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++++ self.config = config -++++++ self.hidden_size = config.hidden_size -++++++ self.intermediate_size = intermediate_size -+++++++ -++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++++ self.act_fn = ACT2FN[config.hidden_act] -++++++ -++++++ def forward(self, x): -++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++- -++++++ -+++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++++ # @lwx -+++++++ # gate_up_output = self.gate_up_proj(x) -+++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -+++++++ # return self.down_proj(swiglu_output) -+++++++ -+++++++ # def forward(self, x): -+++++++ # gate_proj_out = self.gate_proj(x) -+++++++ # up_proj_out = self.up_proj(x) -+++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -+++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -+++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -+++++++ # return self.down_proj(swiglu_out) -+++++++ -++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++++ """ -++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++++ use_cache: bool = False, -++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ -+++++++ -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ query_states = self.q_proj(hidden_states) -++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++ "with a layer index." -++++++ ) -++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ if isinstance(past_key_value, StaticCache): -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ else: -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++ if isinstance(past_key_value, StaticCache): -+++++++ kv_seq_len = key_states.shape[-2] -++++++ -++++++ # repeat k/v heads if n_kv_heads < n_heads -++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++- -+++++++ -++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++ -++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++++- raise ValueError( -++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++++- f" {attn_weights.shape}" -++++++- ) -++++++- -++++++- if attention_mask is not None: # no matter the length, we just slice it -++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -+++++++ if attention_mask is not None: -+++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++ attn_weights = attn_weights + causal_mask -++++++ -++++++ # upcast attention to fp32 -++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++ -++++++ attn_output = self.o_proj(attn_output) -++++++- -+++++++ # @lwx -+++++++ -+++++++ # max_seq_len = self.max_position_embeddings # 2048 -+++++++ -+++++++ # if attention_mask is not None: -+++++++ # # attention_mask: [B, 1, Sq, Sk] -+++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++++ -+++++++ # # pad 到 [max_seq_len, max_seq_len] -+++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++++ # global_attention_mask = padded_mask -+++++++ # else: -+++++++ # global_attention_mask = None -+++++++ -+++++++ -+++++++ # sparse_mode=3 -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # real_shift=None, -+++++++ # padding_mask=None, -+++++++ -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=global_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ # input_layout="BNSD", -+++++++ # pre_tokens=2147483647, -+++++++ # next_tokens=2147483647, -+++++++ # inner_precise=0, -+++++++ # drop_mask=None, -+++++++ # prefix=None, -+++++++ # actual_seq_qlen=None, -+++++++ # actual_seq_kvlen=None, -+++++++ # sparse_mode=sparse_mode, -+++++++ # ) -++++++ if not output_attentions: -++++++ attn_weights = None -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ -+++++++class Qwen2MoeFlashAttention(nn.Module): -+++++++ """ -+++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++++ -+++++++ 关键改动: -+++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++++ 直接传入原始的 key 和 value 张量效率更高。 -+++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++++ """ -+++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++++ super().__init__() -+++++++ self.config = config -+++++++ self.layer_idx = layer_idx -+++++++ self.hidden_size = config.hidden_size -+++++++ self.num_heads = config.num_attention_heads -+++++++ self.head_dim = self.hidden_size // self.num_heads -+++++++ self.num_key_value_heads = config.num_key_value_heads -+++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++ self.max_position_embeddings = config.max_position_embeddings -+++++++ self.rope_theta = config.rope_theta -+++++++ self.attention_dropout = config.attention_dropout -+++++++ -+++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++ raise ValueError( -+++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++++ ) -+++++++ -+++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++++ -+++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++++ self.head_dim, -+++++++ max_position_embeddings=self.max_position_embeddings, -+++++++ base=self.rope_theta, -+++++++ ) -+++++++ -+++++++ def forward( -+++++++ self, -+++++++ hidden_states: mindspore.Tensor, -+++++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++++ position_ids: Optional[mindspore.Tensor] = None, -+++++++ past_key_value: Optional[Cache] = None, -+++++++ output_attentions: bool = False, -+++++++ use_cache: bool = False, -+++++++ cache_position: Optional[mindspore.Tensor] = None, -+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # 1. 线性投射 Q, K, V -+++++++ query_states = self.q_proj(hidden_states) -+++++++ key_states = self.k_proj(hidden_states) -+++++++ value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++ # query: [B, S, H*D] -> [B, N1, S, D] -+++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # 3. RoPE 旋转位置编码 -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ if past_key_value is not None: -+++++++ if self.layer_idx is None: -+++++++ raise ValueError( -+++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++ "with a layer index." -+++++++ ) -+++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++++ if cache_position.shape[0] == 1: -+++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++++ kv_seq_len = past_seen_tokens + 1 -+++++++ else: -+++++++ # prefill 阶段:cache_position 是范围,使用其长度 -+++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++++ else: -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # 4. KV 缓存更新 -+++++++ if past_key_value is not None: -+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ key_states, value_states = past_key_value.update( -+++++++ key_states, value_states, self.layer_idx, cache_kwargs -+++++++ ) -+++++++ -+++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++ if cache_position.shape[0] == 1: -+++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ -+++++++ # 5. [重要] 准备 Attention Mask -+++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++++ fa_attention_mask = None -+++++++ if attention_mask is not None: -+++++++ # 截取与当前key长度匹配的部分 -+++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++++ fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++++ input_dtype = query_states.dtype -+++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++++ query_states = query_states.to(mindspore.float16) -+++++++ key_states = key_states.to(mindspore.float16) -+++++++ value_states = value_states.to(mindspore.float16) -+++++++ -+++++++ # 6. [核心] 调用 flash_attention_score 算子 -+++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++++ attn_output = mindspore.ops.flash_attention_score( -+++++++ query=query_states, -+++++++ key=key_states, -+++++++ value=value_states, -+++++++ head_num=self.num_heads, # 传入Q的头数(N1) -+++++++ attn_mask=fa_attention_mask, -+++++++ keep_prob=1.0 - self.attention_dropout, -+++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ input_layout="BNSD", -+++++++ sparse_mode=0 # 使用 defaultMask 模式 -+++++++ ) -+++++++ -+++++++ # 恢复原始数据类型 -+++++++ attn_output = attn_output.to(input_dtype) -+++++++ -+++++++ # 7. 调整输出形状 -+++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # FlashAttention 算子不直接返回注意力权重矩阵 -+++++++ attn_weights = None -+++++++ if output_attentions: -+++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++ return attn_output, attn_weights, past_key_value -+++++++ -+++++++ # def forward( -+++++++ # self, -+++++++ # hidden_states: mindspore.Tensor, -+++++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++++ # past_key_value: Optional[Cache] = None, -+++++++ # output_attentions: bool = False, -+++++++ # use_cache: bool = False, -+++++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ # bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # # 1. 线性投射 Q, K, V -+++++++ # query_states = self.q_proj(hidden_states) -+++++++ # key_states = self.k_proj(hidden_states) -+++++++ # value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # # 3. RoPE 旋转位置编码 -+++++++ # kv_seq_len = key_states.shape[-2] -+++++++ # if past_key_value is not None: -+++++++ # if self.layer_idx is None: -+++++++ # raise ValueError( -+++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++ # "with a layer index." -+++++++ # ) -+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # # 4. KV 缓存更新 -+++++++ # if past_key_value is not None: -+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ # key_states, value_states = past_key_value.update( -+++++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++++ # ) -+++++++ -+++++++ # # 5. 准备 Attention Mask -+++++++ # fa_attention_mask = None -+++++++ # if attention_mask is not None: -+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++++ # input_dtype = query_states.dtype -+++++++ -+++++++ # # 6. [核心] 调用 flash_attention_score 算子 -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=fa_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ # input_layout="BNSD", -+++++++ # sparse_mode=0, -+++++++ # # <--- 修改点 2: 启用内部高精度计算 --- -+++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++++ # inner_precise=1 -+++++++ # ) -+++++++ -+++++++ # # 恢复原始数据类型 -+++++++ # attn_output = attn_output.to(input_dtype) -+++++++ -+++++++ # # 7. 调整输出形状 -+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ # attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # attn_weights = None -+++++++ # if output_attentions: -+++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++ # return attn_output, attn_weights, past_key_value -+++++++ -+++++++ # def forward( -+++++++ # self, -+++++++ # hidden_states: mindspore.Tensor, -+++++++ # attention_mask: Optional[mindspore.Tensor] = None, -+++++++ # position_ids: Optional[mindspore.Tensor] = None, -+++++++ # past_key_value: Optional[Cache] = None, -+++++++ # output_attentions: bool = False, -+++++++ # use_cache: bool = False, -+++++++ # cache_position: Optional[mindspore.Tensor] = None, -+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ # bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ # query_states = self.q_proj(hidden_states) -+++++++ # key_states = self.k_proj(hidden_states) -+++++++ # value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ # kv_seq_len = key_states.shape[-2] -+++++++ # if past_key_value is not None: -+++++++ # if self.layer_idx is None: -+++++++ # raise ValueError("`layer_idx` must be specified for caching") -+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ # if past_key_value is not None: -+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++ # key_states, value_states = past_key_value.update( -+++++++ # key_states, value_states, self.layer_idx, cache_kwargs -+++++++ # ) -+++++++ -+++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++ -+++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -+++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -+++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -+++++++ # query_states = query_states / math.sqrt(self.head_dim) -+++++++ # # <--- 修改结束 --- -+++++++ -+++++++ # fa_attention_mask = None -+++++++ # if attention_mask is not None: -+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ # fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ # input_dtype = query_states.dtype -+++++++ -+++++++ # attn_output = mindspore.ops.flash_attention_score( -+++++++ # query=query_states, # 传入已经预先缩放过的 query -+++++++ # key=key_states, -+++++++ # value=value_states, -+++++++ # head_num=self.num_heads, -+++++++ # attn_mask=fa_attention_mask, -+++++++ # keep_prob=1.0 - self.attention_dropout, -+++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -+++++++ # input_layout="BNSD", -+++++++ # sparse_mode=0, -+++++++ # inner_precise=1 # 仍然保持内部高精度计算 -+++++++ # ) -+++++++ -+++++++ # attn_output = attn_output.to(input_dtype) -+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ # attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # attn_weights = None -+++++++ # if output_attentions: -+++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++++ -+++++++ # return attn_output, attn_weights, past_key_value -+++++++ -++++++ QWEN2MOE_ATTENTION_CLASSES = { -++++++ "eager": Qwen2MoeAttention, -+++++++ "flash-attention": Qwen2MoeFlashAttention, -++++++ } -++++++ -++++++ -++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -+++++++ #@dwj -+++++++ # 只遍历激活的专家,而非全部专家 -++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- hidden_states = hidden_states.view(-1, hidden_dim) -++++++- # router_logits: (batch * sequence_length, n_experts) -++++++- router_logits = self.gate(hidden_states) -++++++- -++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- if self.norm_topk_prob: -++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- # we cast back to the input dtype -++++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++- final_hidden_states = ops.zeros( -++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++++- ) -++++++- -++++++- # One hot encode the selected experts to create an expert mask -++++++- # this will be used to easily index which expert is going to be sollicitated -++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++++- -++++++- # Loop over all available experts in the model and perform the computation on each expert -++++++- for expert_idx in range(self.num_experts): -++++++- expert_layer = self.experts[expert_idx] -++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++++- -++++++- # Index the correct hidden states and compute the expert hidden state for -++++++- # the current expert. We need to make sure to multiply the output hidden -++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++++- if 0 not in idx.shape: -++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++++- -++++++- # However `index_add_` only support torch tensors for indexing so we'll use -++++++- # the `top_x` tensor here. -++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++++- -++++++- shared_expert_output = self.shared_expert(hidden_states) -++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++++- -++++++- final_hidden_states = final_hidden_states + shared_expert_output -+++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++ num_tokens = hidden_states_reshaped.shape[0] -+++++++ -+++++++ router_logits = self.gate(hidden_states_reshaped) -+++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++ -+++++++ if self.norm_topk_prob: -+++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++++ flat_selected_experts = selected_experts.flatten() -+++++++ -+++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++++ token_indices = broadcasted_token_indices.flatten() -+++++++ -+++++++ active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++ for expert_idx_tensor in active_experts: -+++++++ expert_idx = expert_idx_tensor.item() -+++++++ expert_layer = self.experts[expert_idx] -+++++++ -+++++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++++ selected_token_indices = token_indices[mask] -+++++++ selected_routing_weights = routing_weights.flatten()[mask] -+++++++ -+++++++ current_states = hidden_states_reshaped[selected_token_indices] -+++++++ -+++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++ -+++++++ final_hidden_states = final_hidden_states.index_add( -+++++++ dim=0, -+++++++ index=selected_token_indices, -+++++++ source=expert_output.to(hidden_states.dtype) -+++++++ ) -+++++++ -+++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++++ -++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++- return final_hidden_states, router_logits -+++++++ final_hidden_states = final_hidden_states + shared_expert_output -+++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++ return final_hidden_states, router_logits -++++++ -++++++ -++++++ class Qwen2MoeDecoderLayer(nn.Module): -++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++++ -++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ -+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++++ -++++++ if (layer_idx not in config.mlp_only_layers) and ( -++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++++ ): -++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++++ _skip_keys_device_placement = "past_key_values" -++++++ _supports_cache_class = True -+++++++#lwx -+++++++ # _supports_static_cache = True -++++++ -++++++ def _init_weights(self, module): -++++++ std = self.config.initializer_range -++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++++ return causal_mask -++++++ -++++++ -++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ _tied_weights_keys = ["lm_head.weight"] -++++++ -++++++ def __init__(self, config): -++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ self.num_experts_per_tok = config.num_experts_per_tok -++++++ # Initialize weights and apply final processing -++++++ self.post_init() -+++++++ # @lwx -+++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -+++++++ # self.generation_config.cache_implementation = "static" -+++++++ self._warmed_up = False -+++++++ -+++++++ def warmup_moe_model(self): -+++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -+++++++ test_texts = [ -+++++++ "warmup short", -+++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -+++++++ ] -+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -+++++++ if tokenizer is None: -+++++++ from mindnlp.transformers import AutoTokenizer -+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -+++++++ self._warmup_tokenizer = tokenizer -+++++++ -+++++++ for text in test_texts: -+++++++ inputs = tokenizer(text, return_tensors="ms") -+++++++ with mindspore._no_grad(): -+++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -+++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++++ -++++++ def get_input_embeddings(self): -++++++ return self.model.embed_tokens -++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++ ```""" -+++++++ if not self._warmed_up: -+++++++ self._warmed_up = True -+++++++ self.warmup_moe_model() -++++++ -++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++++ output_router_logits = ( -++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++ } -++++++ ) -++++++ return model_inputs -+++++++# @lwx -+++++++ # def _decode_one_tokens_logits( -+++++++ # self, -+++++++ # cur_token: mindspore.Tensor, -+++++++ # input_pos: Optional[mindspore.Tensor], -+++++++ # cache_position: mindspore.Tensor, -+++++++ # past_key_values: StaticCache, -+++++++ # ) -> mindspore.Tensor: -+++++++ # """ -+++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -+++++++ -+++++++ # Args: -+++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -+++++++ # input_pos: 输入位置信息,可选 -+++++++ # cache_position: 当前token在cache中的位置,shape为(1,) -+++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -+++++++ -+++++++ # Returns: -+++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -+++++++ # """ -+++++++ # # 调用JIT编译的版本 -+++++++ # return self.get_decode_one_tokens_logits( -+++++++ # cur_token=cur_token, -+++++++ # input_pos=input_pos, -+++++++ # cache_position=cache_position, -+++++++ # past_key_values=past_key_values, -+++++++ # ) -+++++++ -+++++++ # @mindspore.jit(jit_level='O1') -+++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -+++++++ # """ -+++++++ # JIT编译的函数,用于高效的单token解码 -+++++++ # 使用JIT编译优化以支持静态shape和高效执行 -+++++++ -+++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -+++++++ # """ -+++++++ # outputs = self.model.forward( -+++++++ # input_ids=cur_token, -+++++++ # position_ids=input_pos, -+++++++ # cache_position=cache_position, -+++++++ # past_key_values=past_key_values, -+++++++ # use_cache=True, -+++++++ # return_dict=False, -+++++++ # ) -+++++++ -+++++++ # hidden_states = outputs[0] -+++++++ # logits = self.lm_head.forward(hidden_states) -+++++++ # logits = logits.float() -+++++++ -+++++++ # return logits[:, -1, :] -+++++++ -+++++++ # def _sample( -+++++++ # self, -+++++++ # input_ids: mindspore.Tensor, -+++++++ # logits_processor, -+++++++ # stopping_criteria, -+++++++ # generation_config, -+++++++ # synced_devices: bool, -+++++++ # streamer=None, -+++++++ # logits_warper=None, -+++++++ # **model_kwargs, -+++++++ # ): -+++++++ # """ -+++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -+++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -+++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -+++++++ # """ -+++++++ # from ...generation.logits_process import LogitsProcessorList -+++++++ # from ...generation.stopping_criteria import StoppingCriteriaList -+++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -+++++++ # from mindnlp.core import nn, ops, no_grad -+++++++ # import numpy as np -+++++++ -+++++++ # # 检查是否使用 StaticCache -+++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -+++++++ # # 否则,直接调用父类方法 -+++++++ # past_key_values = model_kwargs.get("past_key_values") -+++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -+++++++ -+++++++ # if not isinstance(past_key_values, StaticCache): -+++++++ # # 不使用 StaticCache,直接调用父类方法 -+++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -+++++++ # return super()._sample( -+++++++ # input_ids=input_ids, -+++++++ # logits_processor=logits_processor, -+++++++ # stopping_criteria=stopping_criteria, -+++++++ # generation_config=generation_config, -+++++++ # synced_devices=synced_devices, -+++++++ # streamer=streamer, -+++++++ # logits_warper=logits_warper, -+++++++ # **model_kwargs, -+++++++ # ) -+++++++ -+++++++ # # 使用 StaticCache,进入自定义循环 -+++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -+++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -+++++++ # pad_token_id = generation_config._pad_token_tensor -+++++++ # output_attentions = generation_config.output_attentions -+++++++ # output_hidden_states = generation_config.output_hidden_states -+++++++ # output_scores = generation_config.output_scores -+++++++ # output_logits = generation_config.output_logits -+++++++ # return_dict_in_generate = generation_config.return_dict_in_generate -+++++++ # max_length = generation_config.max_length -+++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -+++++++ # do_sample = generation_config.do_sample -+++++++ -+++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -+++++++ # raise ValueError( -+++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -+++++++ # f"{logits_warper})." -+++++++ # ) -+++++++ -+++++++ # # init attention / hidden states / scores tuples -+++++++ # scores = () if (return_dict_in_generate and output_scores) else None -+++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -+++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -+++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -+++++++ -+++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -+++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -+++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -+++++++ # encoder_hidden_states = ( -+++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -+++++++ # ) -+++++++ -+++++++ # # keep track of which sequences are already finished -+++++++ # batch_size, cur_len = input_ids.shape -+++++++ # this_peer_finished = False -+++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -+++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -+++++++ -+++++++ # time_record = [] -+++++++ # from ....utils.testing_utils import parse_flag_from_env -+++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -+++++++ -+++++++ # while self._has_unfinished_sequences( -+++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -+++++++ # ): -+++++++ # if _record_time: -+++++++ # import time as time_module -+++++++ # infer_start = time_module.time() -+++++++ -+++++++ # # prepare model inputs -+++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -+++++++ -+++++++ # # prepare variable output controls -+++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -+++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -+++++++ -+++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -+++++++ # cur_cache_position = model_inputs.get("cache_position") -+++++++ # cur_past_key_values = model_inputs.get("past_key_values") -+++++++ # cur_input_ids = model_inputs.get("input_ids") -+++++++ -+++++++ # if (isinstance(cur_past_key_values, StaticCache) and -+++++++ # cur_cache_position is not None and -+++++++ # len(cur_cache_position.shape) > 0 and -+++++++ # cur_cache_position.shape[0] == 1 and -+++++++ # cur_input_ids is not None and -+++++++ # cur_input_ids.shape[1] == 1): -+++++++ # # 使用 JIT 优化的单 token 解码 -+++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -+++++++ # if not hasattr(self, '_jit_used'): -+++++++ # self._jit_used = False -+++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -+++++++ -+++++++ # next_token_logits = self.get_decode_one_tokens_logits( -+++++++ # cur_token=cur_input_ids, -+++++++ # input_pos=model_inputs.get("position_ids"), -+++++++ # cache_position=cur_cache_position, -+++++++ # past_key_values=cur_past_key_values, -+++++++ # ) -+++++++ -+++++++ # # 标记已使用JIT(用于后续判断) -+++++++ # if not self._jit_used: -+++++++ # self._jit_used = True -+++++++ -+++++++ # # 构造兼容的输出对象 -+++++++ # class JitOptimizedOutput: -+++++++ # def __init__(self, logits, config): -+++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -+++++++ # self.config = config -+++++++ # # 对于 JIT 优化路径,这些属性通常不需要 -+++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -+++++++ # self.attentions = None if not config.is_encoder_decoder else None -+++++++ # self.cross_attentions = None -+++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -+++++++ # self.hidden_states = None if not config.is_encoder_decoder else None -+++++++ -+++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -+++++++ # else: -+++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -+++++++ # outputs = self(**model_inputs, return_dict=True) -+++++++ -+++++++ # if synced_devices and this_peer_finished: -+++++++ # continue -+++++++ -+++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -+++++++ # next_token_logits = outputs.logits[:, -1, :] -+++++++ -+++++++ # # pre-process distribution -+++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -+++++++ # if do_sample: -+++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -+++++++ -+++++++ # # Store scores, attentions and hidden_states when required -+++++++ # if return_dict_in_generate: -+++++++ # if output_scores: -+++++++ # scores += (next_token_scores,) -+++++++ # if output_logits: -+++++++ # raw_logits += (next_token_logits,) -+++++++ # if output_attentions: -+++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -+++++++ # decoder_attentions += (attn,) if attn is not None else (None,) -+++++++ # if self.config.is_encoder_decoder: -+++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -+++++++ -+++++++ # if output_hidden_states: -+++++++ # hidden = ( -+++++++ # outputs.decoder_hidden_states -+++++++ # if self.config.is_encoder_decoder -+++++++ # else outputs.hidden_states -+++++++ # ) -+++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -+++++++ -+++++++ # # token selection -+++++++ # if do_sample: -+++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -+++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -+++++++ # else: -+++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -+++++++ -+++++++ # # finished sentences should have their next token be a padding token -+++++++ # if has_eos_stopping_criteria: -+++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -+++++++ -+++++++ # # update generated ids, model inputs, and length for next step -+++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -+++++++ # if streamer is not None: -+++++++ # streamer.put(next_tokens) -+++++++ -+++++++ # model_kwargs = self._update_model_kwargs_for_generation( -+++++++ # outputs, -+++++++ # model_kwargs, -+++++++ # is_encoder_decoder=self.config.is_encoder_decoder, -+++++++ # ) -+++++++ -+++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -+++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -+++++++ # cur_len += 1 -+++++++ -+++++++ # if _record_time: -+++++++ # import time as time_module -+++++++ # infer_stop = time_module.time() -+++++++ # time_record.append(infer_stop - infer_start) -+++++++ -+++++++ # del outputs -+++++++ -+++++++ # average_infer_time = None -+++++++ # if time_record: -+++++++ # if len(time_record) > 1: -+++++++ # time_record.pop(0) -+++++++ # average_infer_time = sum(time_record) / len(time_record) -+++++++ # print(f'average inference time is: {average_infer_time}') -+++++++ # print(f'inference time record: {time_record}') -+++++++ -+++++++ # if streamer is not None: -+++++++ # streamer.end() -+++++++ -+++++++ # # 简单判断:打印是否使用了JIT路径 -+++++++ # if hasattr(self, '_jit_used') and self._jit_used: -+++++++ # print("[JIT] ✓ JIT optimization was used during generation") -+++++++ # else: -+++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -+++++++ -+++++++ # if return_dict_in_generate: -+++++++ # if self.config.is_encoder_decoder: -+++++++ # return GenerateEncoderDecoderOutput( -+++++++ # sequences=input_ids, -+++++++ # scores=scores, -+++++++ # logits=raw_logits, -+++++++ # encoder_attentions=encoder_attentions, -+++++++ # encoder_hidden_states=encoder_hidden_states, -+++++++ # decoder_attentions=decoder_attentions, -+++++++ # cross_attentions=cross_attentions, -+++++++ # decoder_hidden_states=decoder_hidden_states, -+++++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++++ # average_infer_time=average_infer_time -+++++++ # ) -+++++++ # else: -+++++++ # return GenerateDecoderOnlyOutput( -+++++++ # sequences=input_ids, -+++++++ # scores=scores, -+++++++ # logits=raw_logits, -+++++++ # attentions=decoder_attentions, -+++++++ # hidden_states=decoder_hidden_states, -+++++++ # past_key_values=model_kwargs.get("past_key_values"), -+++++++ # average_infer_time=average_infer_time -+++++++ # ) -+++++++ # else: -+++++++ # return input_ids -+++++++ -+++++++ # def _prepare_cache_for_generation( -+++++++ # self, -+++++++ # generation_config, -+++++++ # model_kwargs, -+++++++ # assistant_model, -+++++++ # batch_size, -+++++++ # max_cache_length, -+++++++ # ): -+++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -+++++++ # generation_config.cache_implementation = "static" -+++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -+++++++ -+++++++ # if generation_config.cache_implementation == "static": -+++++++ # base_required_from_max_length = generation_config.max_length + 1 -+++++++ # base_required = max(max_cache_length, base_required_from_max_length) -+++++++ # min_cache_size = 50 -+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -+++++++ # else: -+++++++ # max_cache_length = max(base_required, min_cache_size) -+++++++ -+++++++ # original_max_cache_length = max_cache_length -+++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -+++++++ # print(f" - input max_cache_length: {original_max_cache_length}") -+++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -+++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -+++++++ # print(f" - final max_cache_length: {max_cache_length}") -+++++++ -+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -+++++++ # if max_cache_length > self.config.max_position_embeddings: -+++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -+++++++ -+++++++ # result = super()._prepare_cache_for_generation( -+++++++ # generation_config=generation_config, -+++++++ # model_kwargs=model_kwargs, -+++++++ # assistant_model=assistant_model, -+++++++ # batch_size=batch_size, -+++++++ # max_cache_length=max_cache_length, -+++++++ # ) -+++++++ -+++++++ # if generation_config.cache_implementation == "static": -+++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -+++++++ # created_cache = model_kwargs.get(cache_name) -+++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -+++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -+++++++ # if created_cache.max_cache_len < generation_config.max_length: -+++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -+++++++ -+++++++ # return result -+++++++ -+++++++ -+++++++ -++++++ -++++++ -++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++++-- -++++++2.27.0 -++++++ -+++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -+++++new file mode 100644 -+++++index 00000000..22b65dd5 -+++++--- /dev/null -++++++++ b/patches/0002-20251106commit.patch -+++++@@ -0,0 +1,3200 @@ -++++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -++++++From: Pinoeer-kingxi <13022943007@163.com> -++++++Date: Thu, 6 Nov 2025 09:20:38 +0800 -++++++Subject: [PATCH 2/3] 20251106commit -++++++ -++++++--- -++++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- -++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -++++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -++++++ 3 files changed, 2689 insertions(+), 305 deletions(-) -++++++ create mode 100644 patches/0001-20251104commit.patch -++++++ -++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++index d8303e45..73773c22 100644 -++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -++++++ # y = y + self.shared_experts(identity) -++++++ # return y -++++++ -+++++++ # @no_grad() -+++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++++ -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ # for i in range(self.num_experts_per_tok): -+++++++ # expert_id = flat_expert_indices[i].item() -+++++++ # weight = flat_expert_weights[i].item() -+++++++ # expert = self.experts[expert_id] -+++++++ # expert_out = expert(x) -+++++++ # expert_cache += expert_out * weight -+++++++ # return expert_cache -+++++++ -++++++ @no_grad() -++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # x 的 shape: (1, hidden_size) -+++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) -+++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) -+++++++ -+++++++ # 1. 收集所有需要的专家层 -+++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -+++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] -+++++++ -+++++++ # 2. 并行计算所有专家的输出 -+++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors -+++++++ # ops.cat 会将它们堆叠成一个新的 Tensor -+++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) -+++++++ -+++++++ # 3. 使用矩阵乘法进行加权求和 -+++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) -+++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -+++++++ # 最终结果 final_output 的 shape: (1, hidden_size) -+++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -+++++++ -+++++++ return final_output -++++++ -++++++- expert_cache = ops.zeros_like(x) -++++++- for i in range(self.num_experts_per_tok): -++++++- expert_id = flat_expert_indices[i].item() -++++++- weight = flat_expert_weights[i].item() -++++++- expert = self.experts[expert_id] -++++++- expert_out = expert(x) -++++++- expert_cache += expert_out * weight -++++++- return expert_cache -++++++ -++++++ @no_grad() -++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++++ # @lwx -+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) -+++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) -+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -+++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -++++++ -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ -+++++++# class DeepseekFlashAttention(nn.Module): -+++++++# """ -+++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using -+++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. -+++++++ -+++++++# This class is designed as a drop-in replacement for DeepseekAttention. -+++++++# """ -+++++++ -+++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++++++# super().__init__() -+++++++# self.config = config -+++++++# self.layer_idx = layer_idx -+++++++# if layer_idx is None: -+++++++# logger.warning( -+++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++++# "when creating this class." -+++++++# ) -+++++++ -+++++++# self.attention_dropout = config.attention_dropout -+++++++# self.hidden_size = config.hidden_size -+++++++# self.num_heads = config.num_attention_heads -+++++++# self.head_dim = self.hidden_size // self.num_heads -+++++++# self.num_key_value_heads = config.num_key_value_heads -+++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++# self.max_position_embeddings = config.max_position_embeddings -+++++++# self.rope_theta = config.rope_theta -+++++++# self.is_causal = True -+++++++ -+++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++# raise ValueError( -+++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++++# f" and `num_heads`: {self.num_heads})." -+++++++# ) -+++++++ -+++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++++++# self._init_rope() -+++++++ -+++++++# def _init_rope(self): -+++++++# if self.config.rope_scaling is None: -+++++++# self.rotary_emb = DeepseekRotaryEmbedding( -+++++++# self.head_dim, -+++++++# max_position_embeddings=self.max_position_embeddings, -+++++++# base=self.rope_theta, -+++++++# ) -+++++++# else: -+++++++# scaling_type = self.config.rope_scaling["type"] -+++++++# scaling_factor = self.config.rope_scaling["factor"] -+++++++# if scaling_type == "linear": -+++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++++++# self.head_dim, -+++++++# max_position_embeddings=self.max_position_embeddings, -+++++++# scaling_factor=scaling_factor, -+++++++# base=self.rope_theta, -+++++++# ) -+++++++# elif scaling_type == "dynamic": -+++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++++++# self.head_dim, -+++++++# max_position_embeddings=self.max_position_embeddings, -+++++++# scaling_factor=scaling_factor, -+++++++# base=self.rope_theta, -+++++++# ) -+++++++# else: -+++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++++++ -+++++++# def forward( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++++# position_ids: Optional[mindspore.Tensor] = None, -+++++++# past_key_value: Optional[Cache] = None, -+++++++# output_attentions: bool = False, -+++++++# use_cache: bool = False, -+++++++# **kwargs, -+++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++# if "padding_mask" in kwargs: -+++++++# warnings.warn( -+++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++++++# ) -+++++++ -+++++++# if output_attentions: -+++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") -+++++++ -+++++++# bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++# if self.config.pretraining_tp > 1: -+++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++++++ -+++++++# query_states = self.q_proj(hidden_states) -+++++++# key_states = self.k_proj(hidden_states) -+++++++# value_states = self.v_proj(hidden_states) -+++++++ -+++++++# # Reshape for multi-head attention -+++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++# kv_seq_len = key_states.shape[-2] -+++++++# if past_key_value is not None: -+++++++# if self.layer_idx is None: -+++++++# raise ValueError( -+++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++# "with a layer index." -+++++++# ) -+++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++# # Apply Rotary Positional Embedding -+++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++# if past_key_value is not None: -+++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models -+++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout -+++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) -+++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ -+++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++++ -+++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) -+++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) -+++++++ -+++++++# # Convert attention_mask for flash_attention_score -+++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. -+++++++# if attention_mask is not None: -+++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) -+++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++++++# raise ValueError( -+++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++++++# ) -+++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True -+++++++# else: -+++++++# attn_mask_for_fa = None -+++++++ -+++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++++++ -+++++++# # Call the fused flash_attention_score operator -+++++++# attn_output = mindspore.ops.flash_attention_score( -+++++++# query=query_states_for_fa, -+++++++# key=key_states_for_fa, -+++++++# value=value_states_for_fa, -+++++++# head_num=self.num_heads, # This is N1, the number of query heads -+++++++# input_layout='BSH', -+++++++# attn_mask=attn_mask_for_fa, -+++++++# keep_prob=keep_prob, -+++++++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++# sparse_mode=0 # Default mask mode -+++++++# ) -+++++++ -+++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed -+++++++# attn_output = self.o_proj(attn_output) -+++++++ -+++++++# # Flash Attention does not return attention weights -+++++++# attn_weights = None -+++++++ -+++++++# return attn_output, attn_weights, past_key_value -+++++++ -+++++++class DeepseekFlashAttention(nn.Module): -+++++++ """ -+++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. -+++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, -+++++++ designed for high performance on supported hardware (Ascend). -+++++++ -+++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. -+++++++ """ -+++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): -+++++++ super().__init__() -+++++++ self.config = config -+++++++ self.layer_idx = layer_idx -+++++++ if layer_idx is None: -+++++++ logger.warning( -+++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++++ "when creating this class." -+++++++ ) -+++++++ -+++++++ # --- [FIX] Correctly initialize all required attributes --- -+++++++ self.attention_dropout = config.attention_dropout -+++++++ self.hidden_size = config.hidden_size -+++++++ self.num_heads = config.num_attention_heads -+++++++ self.head_dim = self.hidden_size // self.num_heads -+++++++ self.num_key_value_heads = config.num_key_value_heads -+++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++ self.max_position_embeddings = config.max_position_embeddings -+++++++ self.rope_theta = config.rope_theta -+++++++ self.is_causal = True -+++++++ -+++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++ raise ValueError( -+++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++++ f" and `num_heads`: {self.num_heads})." -+++++++ ) -+++++++ -+++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) -+++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) -+++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) -+++++++ -+++++++ # This call will now succeed as all attributes are initialized. -+++++++ self._init_rope() -+++++++ -+++++++ def _init_rope(self): -+++++++ if self.config.rope_scaling is None: -+++++++ self.rotary_emb = DeepseekRotaryEmbedding( -+++++++ self.head_dim, -+++++++ max_position_embeddings=self.max_position_embeddings, -+++++++ base=self.rope_theta, -+++++++ ) -+++++++ else: -+++++++ scaling_type = self.config.rope_scaling["type"] -+++++++ scaling_factor = self.config.rope_scaling["factor"] -+++++++ if scaling_type == "linear": -+++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( -+++++++ self.head_dim, -+++++++ max_position_embeddings=self.max_position_embeddings, -+++++++ scaling_factor=scaling_factor, -+++++++ base=self.rope_theta, -+++++++ ) -+++++++ elif scaling_type == "dynamic": -+++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( -+++++++ self.head_dim, -+++++++ max_position_embeddings=self.max_position_embeddings, -+++++++ scaling_factor=scaling_factor, -+++++++ base=self.rope_theta, -+++++++ ) -+++++++ else: -+++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") -+++++++ -+++++++ def forward( -+++++++ self, -+++++++ hidden_states: mindspore.Tensor, -+++++++ attention_mask: Optional[mindspore.Tensor] = None, -+++++++ position_ids: Optional[mindspore.Tensor] = None, -+++++++ past_key_value: Optional[Cache] = None, -+++++++ output_attentions: bool = False, -+++++++ use_cache: bool = False, -+++++++ **kwargs, -+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ if "padding_mask" in kwargs: -+++++++ warnings.warn( -+++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" -+++++++ ) -+++++++ if output_attentions: -+++++++ warnings.warn( -+++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." -+++++++ ) -+++++++ -+++++++ bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ if self.config.pretraining_tp > 1: -+++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") -+++++++ -+++++++ query_states = self.q_proj(hidden_states) -+++++++ key_states = self.k_proj(hidden_states) -+++++++ value_states = self.v_proj(hidden_states) -+++++++ -+++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) -+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++ kv_seq_len = key_states.shape[-2] -+++++++ if past_key_value is not None: -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++ # Apply Rotary Position Embedding -+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ if past_key_value is not None: -+++++++ cache_kwargs = {"sin": sin, "cos": cos} -+++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. -+++++++ # So we must explicitly repeat the KV heads. -+++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++ -+++++++ # Convert attention mask for flash_attention_score -+++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. -+++++++ if attention_mask is not None: -+++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): -+++++++ raise ValueError( -+++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" -+++++++ ) -+++++++ attn_mask_for_fa = attention_mask < 0 -+++++++ else: -+++++++ attn_mask_for_fa = None -+++++++ -+++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 -+++++++ -+++++++ # Call the fused operator using the efficient BNSD layout -+++++++ attn_output = mindspore.ops.flash_attention_score( -+++++++ query=query_states, -+++++++ key=key_states, -+++++++ value=value_states, -+++++++ head_num=self.num_heads, -+++++++ input_layout='BNSD', # Specify the correct layout -+++++++ attn_mask=attn_mask_for_fa, -+++++++ keep_prob=keep_prob, -+++++++ scalar_value=1.0 / math.sqrt(self.head_dim) -+++++++ ) -+++++++ -+++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. -+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ -+++++++ # Apply output projection -+++++++ attn_output = self.o_proj(attn_output) -+++++++ -+++++++ # Flash attention does not return attention weights, so we return None. -+++++++ attn_weights = None -+++++++ -+++++++ return attn_output, attn_weights, past_key_value -+++++++ -++++++ Deepseek_ATTENTION_CLASSES = { -++++++ "eager": DeepseekAttention, -+++++++ "flash-attention": DeepseekFlashAttention, -++++++ } -++++++ -++++++ -++++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -++++++ config=config, layer_idx=layer_idx -++++++ ) -++++++ -+++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -+++++++ config=config, layer_idx=layer_idx -+++++++ ) -+++++++ -++++++ self.mlp = ( -++++++ DeepseekMoE(config) -++++++ if ( -++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++index d4c6b651..bced285c 100644 -++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -++++++ -++++++ import mindspore -++++++ import mindnlp.core.nn.functional as F -++++++-from mindnlp.core import nn, ops -+++++++from mindnlp.core import nn, ops, no_grad -++++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -++++++ -++++++ from ....common.activations import ACT2FN -++++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++++++ -+++++++Long_Prompt = False -+++++++PROMPT_LENGTH_THRESHOLD = 128 -++++++ -++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++++++ def _prepare_4d_causal_attention_mask_with_cache_position( -++++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++ -+++++++# class Qwen2MoeFlashAttention(nn.Module): -+++++++# """ -+++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -+++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -+++++++ -+++++++# 关键改动: -+++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -+++++++# 直接传入原始的 key 和 value 张量效率更高。 -+++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -+++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++++# super().__init__() -+++++++# self.config = config -+++++++# self.layer_idx = layer_idx -+++++++# self.hidden_size = config.hidden_size -+++++++# self.num_heads = config.num_attention_heads -+++++++# self.head_dim = self.hidden_size // self.num_heads -+++++++# self.num_key_value_heads = config.num_key_value_heads -+++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++# self.max_position_embeddings = config.max_position_embeddings -+++++++# self.rope_theta = config.rope_theta -+++++++# self.attention_dropout = config.attention_dropout -+++++++ -+++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++# raise ValueError( -+++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -+++++++# ) -+++++++ -+++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++++ -+++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++++# self.head_dim, -+++++++# max_position_embeddings=self.max_position_embeddings, -+++++++# base=self.rope_theta, -+++++++# ) -+++++++ -+++++++# def forward( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++++# position_ids: Optional[mindspore.Tensor] = None, -+++++++# past_key_value: Optional[Cache] = None, -+++++++# output_attentions: bool = False, -+++++++# use_cache: bool = False, -+++++++# cache_position: Optional[mindspore.Tensor] = None, -+++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++# bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++# # 1. 线性投射 Q, K, V -+++++++# query_states = self.q_proj(hidden_states) -+++++++# key_states = self.k_proj(hidden_states) -+++++++# value_states = self.v_proj(hidden_states) -+++++++ -+++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++# # query: [B, S, H*D] -> [B, N1, S, D] -+++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++# # 3. RoPE 旋转位置编码 -+++++++# kv_seq_len = key_states.shape[-2] -+++++++# if past_key_value is not None: -+++++++# if self.layer_idx is None: -+++++++# raise ValueError( -+++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++# "with a layer index." -+++++++# ) -+++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len -+++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -+++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len -+++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -+++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -+++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -+++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -+++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) -+++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -+++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -+++++++# if cache_position.shape[0] == 1: -+++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -+++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -+++++++# kv_seq_len = past_seen_tokens + 1 -+++++++# else: -+++++++# # prefill 阶段:cache_position 是范围,使用其长度 -+++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens -+++++++# else: -+++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++# # 4. KV 缓存更新 -+++++++# if past_key_value is not None: -+++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++# key_states, value_states = past_key_value.update( -+++++++# key_states, value_states, self.layer_idx, cache_kwargs -+++++++# ) -+++++++ -+++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -+++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -+++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: -+++++++# if cache_position.shape[0] == 1: -+++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -+++++++# kv_seq_len = key_states.shape[-2] -+++++++ -+++++++# # 5. [重要] 准备 Attention Mask -+++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -+++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++++# fa_attention_mask = None -+++++++# if attention_mask is not None: -+++++++# # 截取与当前key长度匹配的部分 -+++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -+++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -+++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False -+++++++# fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -+++++++# input_dtype = query_states.dtype -+++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): -+++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -+++++++# query_states = query_states.to(mindspore.float16) -+++++++# key_states = key_states.to(mindspore.float16) -+++++++# value_states = value_states.to(mindspore.float16) -+++++++ -+++++++# # 6. [核心] 调用 flash_attention_score 算子 -+++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA -+++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++++# attn_output = mindspore.ops.flash_attention_score( -+++++++# query=query_states, -+++++++# key=key_states, -+++++++# value=value_states, -+++++++# head_num=self.num_heads, # 传入Q的头数(N1) -+++++++# attn_mask=fa_attention_mask, -+++++++# keep_prob=1.0 - self.attention_dropout, -+++++++# scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++# input_layout="BNSD", -+++++++# sparse_mode=0 # 使用 defaultMask 模式 -+++++++# ) -+++++++ -+++++++# # 恢复原始数据类型 -+++++++# attn_output = attn_output.to(input_dtype) -+++++++ -+++++++# # 7. 调整输出形状 -+++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++# attn_output = self.o_proj(attn_output) -+++++++ -+++++++# # FlashAttention 算子不直接返回注意力权重矩阵 -+++++++# attn_weights = None -+++++++# if output_attentions: -+++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++# return attn_output, attn_weights, past_key_value -+++++++ -+++++++# # def forward( -+++++++# # self, -+++++++# # hidden_states: mindspore.Tensor, -+++++++# # attention_mask: Optional[mindspore.Tensor] = None, -+++++++# # position_ids: Optional[mindspore.Tensor] = None, -+++++++# # past_key_value: Optional[Cache] = None, -+++++++# # output_attentions: bool = False, -+++++++# # use_cache: bool = False, -+++++++# # cache_position: Optional[mindspore.Tensor] = None, -+++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++# # bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++# # # 1. 线性投射 Q, K, V -+++++++# # query_states = self.q_proj(hidden_states) -+++++++# # key_states = self.k_proj(hidden_states) -+++++++# # value_states = self.v_proj(hidden_states) -+++++++ -+++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -+++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -+++++++# # # 3. RoPE 旋转位置编码 -+++++++# # kv_seq_len = key_states.shape[-2] -+++++++# # if past_key_value is not None: -+++++++# # if self.layer_idx is None: -+++++++# # raise ValueError( -+++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++# # "with a layer index." -+++++++# # ) -+++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -+++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++# # # 4. KV 缓存更新 -+++++++# # if past_key_value is not None: -+++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -+++++++# # key_states, value_states = past_key_value.update( -+++++++# # key_states, value_states, self.layer_idx, cache_kwargs -+++++++# # ) -+++++++ -+++++++# # # 5. 准备 Attention Mask -+++++++# # fa_attention_mask = None -+++++++# # if attention_mask is not None: -+++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++# # fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -+++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -+++++++# # input_dtype = query_states.dtype -+++++++ -+++++++# # # 6. [核心] 调用 flash_attention_score 算子 -+++++++# # attn_output = mindspore.ops.flash_attention_score( -+++++++# # query=query_states, -+++++++# # key=key_states, -+++++++# # value=value_states, -+++++++# # head_num=self.num_heads, -+++++++# # attn_mask=fa_attention_mask, -+++++++# # keep_prob=1.0 - self.attention_dropout, -+++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++# # input_layout="BNSD", -+++++++# # sparse_mode=0, -+++++++# # # <--- 修改点 2: 启用内部高精度计算 --- -+++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -+++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -+++++++# # inner_precise=1 -+++++++# # ) -+++++++ -+++++++# # # 恢复原始数据类型 -+++++++# # attn_output = attn_output.to(input_dtype) -+++++++ -+++++++# # # 7. 调整输出形状 -+++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++# # attn_output = self.o_proj(attn_output) -+++++++ -+++++++# # attn_weights = None -+++++++# # if output_attentions: -+++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ -+++++++# # return attn_output, attn_weights, past_key_value -+++++++ -+++++++ -++++++ class Qwen2MoeFlashAttention(nn.Module): -++++++ """ -++++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++- -++++++- 关键改动: -++++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++- 直接传入原始的 key 和 value 张量效率更高。 -++++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -+++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 -+++++++ -+++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` -+++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, -+++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, -+++++++ 以达到理论上的最高执行速度。 -++++++ """ -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++ super().__init__() -++++++ self.config = config -++++++ self.layer_idx = layer_idx -+++++++ if layer_idx is None: -+++++++ logger.warning_once( -+++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." -+++++++ ) -+++++++ -++++++ self.hidden_size = config.hidden_size -++++++ self.num_heads = config.num_attention_heads -++++++ self.head_dim = self.hidden_size // self.num_heads -++++++ self.num_key_value_heads = config.num_key_value_heads -++++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++ self.max_position_embeddings = config.max_position_embeddings -++++++ self.rope_theta = config.rope_theta -++++++ self.attention_dropout = config.attention_dropout -++++++ -++++++- if (self.head_dim * self.num_heads) != self.hidden_size: -++++++- raise ValueError( -++++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++- ) -++++++- -++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++- # query: [B, S, H*D] -> [B, N1, S, D] -++++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] -+++++++ # 2. 调整形状以匹配 BNSD 布局 -++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- -++++++- # 3. RoPE 旋转位置编码 -+++++++ -+++++++ # 3. RoPE 和 KV 缓存 -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++- if self.layer_idx is None: -++++++- raise ValueError( -++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++- "with a layer index." -++++++- ) -++++++- # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++- if cache_position.shape[0] == 1: -++++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++- kv_seq_len = past_seen_tokens + 1 -++++++- else: -++++++- # prefill 阶段:cache_position 是范围,使用其长度 -++++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++- else: -++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++- -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++- # 4. KV 缓存更新 -++++++ if past_key_value is not None: -++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++- key_states, value_states = past_key_value.update( -++++++- key_states, value_states, self.layer_idx, cache_kwargs -++++++- ) -++++++- -++++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++- if cache_position.shape[0] == 1: -++++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++- kv_seq_len = key_states.shape[-2] -++++++- -++++++- # 5. [重要] 准备 Attention Mask -++++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -+++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++ # 4. 准备 Attention Mask -++++++ fa_attention_mask = None -++++++ if attention_mask is not None: -++++++- # 截取与当前key长度匹配的部分 -++++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++- # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++ fa_attention_mask = (mask_slice != 0) -++++++ -++++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++- input_dtype = query_states.dtype -++++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++- query_states = query_states.to(mindspore.float16) -++++++- key_states = key_states.to(mindspore.float16) -++++++- value_states = value_states.to(mindspore.float16) -++++++- -++++++- # 6. [核心] 调用 flash_attention_score 算子 -++++++- # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -+++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -++++++ attn_output = mindspore.ops.flash_attention_score( -++++++ query=query_states, -++++++ key=key_states, -++++++ value=value_states, -++++++- head_num=self.num_heads, # 传入Q的头数(N1) -+++++++ head_num=self.num_heads, -++++++ attn_mask=fa_attention_mask, -++++++- keep_prob=1.0 - self.attention_dropout, -+++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++ input_layout="BNSD", -++++++- sparse_mode=0 # 使用 defaultMask 模式 -+++++++ sparse_mode=0, -+++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -++++++ ) -++++++ -++++++- # 恢复原始数据类型 -++++++- attn_output = attn_output.to(input_dtype) -++++++- -++++++- # 7. 调整输出形状 -++++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -+++++++ # 6. 调整输出形状 -++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++ attn_output = self.o_proj(attn_output) -++++++ -++++++- # FlashAttention 算子不直接返回注意力权重矩阵 -+++++++ # 7. 返回结果 -++++++ attn_weights = None -++++++ if output_attentions: -++++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++- # def forward( -++++++- # self, -++++++- # hidden_states: mindspore.Tensor, -++++++- # attention_mask: Optional[mindspore.Tensor] = None, -++++++- # position_ids: Optional[mindspore.Tensor] = None, -++++++- # past_key_value: Optional[Cache] = None, -++++++- # output_attentions: bool = False, -++++++- # use_cache: bool = False, -++++++- # cache_position: Optional[mindspore.Tensor] = None, -++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++- -++++++- # bsz, q_len, _ = hidden_states.shape -++++++- -++++++- # # 1. 线性投射 Q, K, V -++++++- # query_states = self.q_proj(hidden_states) -++++++- # key_states = self.k_proj(hidden_states) -++++++- # value_states = self.v_proj(hidden_states) -++++++- -++++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- -++++++- # # 3. RoPE 旋转位置编码 -++++++- # kv_seq_len = key_states.shape[-2] -++++++- # if past_key_value is not None: -++++++- # if self.layer_idx is None: -++++++- # raise ValueError( -++++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++- # "with a layer index." -++++++- # ) -++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++ -++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++- -++++++- # # 4. KV 缓存更新 -++++++- # if past_key_value is not None: -++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++- # key_states, value_states = past_key_value.update( -++++++- # key_states, value_states, self.layer_idx, cache_kwargs -++++++- # ) -++++++- -++++++- # # 5. 准备 Attention Mask -++++++- # fa_attention_mask = None -++++++- # if attention_mask is not None: -++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++- # fa_attention_mask = (mask_slice != 0) -++++++- -++++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++- # input_dtype = query_states.dtype -++++++- -++++++- # # 6. [核心] 调用 flash_attention_score 算子 -++++++- # attn_output = mindspore.ops.flash_attention_score( -++++++- # query=query_states, -++++++- # key=key_states, -++++++- # value=value_states, -++++++- # head_num=self.num_heads, -++++++- # attn_mask=fa_attention_mask, -++++++- # keep_prob=1.0 - self.attention_dropout, -++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++- # input_layout="BNSD", -++++++- # sparse_mode=0, -++++++- # # <--- 修改点 2: 启用内部高精度计算 --- -++++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++- # inner_precise=1 -++++++- # ) -++++++- -++++++- # # 恢复原始数据类型 -++++++- # attn_output = attn_output.to(input_dtype) -+++++++QWEN2MOE_ATTENTION_CLASSES = { -+++++++ "eager": Qwen2MoeAttention, -+++++++ "flash-attention": Qwen2MoeFlashAttention, -+++++++} -++++++ -++++++- # # 7. 调整输出形状 -++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++- # attn_output = self.o_proj(attn_output) -++++++ -++++++- # attn_weights = None -++++++- # if output_attentions: -++++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# def __init__(self, config): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# # gating -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++ -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# #@dwj -+++++++# # 只遍历激活的专家,而非全部专家 -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# num_tokens = hidden_states_reshaped.shape[0] -+++++++ -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++ -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++ -+++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -+++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -+++++++# token_indices = broadcasted_token_indices.flatten() -+++++++ -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++ -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# selected_token_indices = token_indices[mask] -+++++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++++ -+++++++# current_states = hidden_states_reshaped[selected_token_indices] -+++++++ -+++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++ -+++++++# final_hidden_states = final_hidden_states.index_add( -+++++++# dim=0, -+++++++# index=selected_token_indices, -+++++++# source=expert_output.to(hidden_states.dtype) -+++++++# ) -+++++++ -+++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++++ -++++++- # return attn_output, attn_weights, past_key_value -+++++++# final_hidden_states = final_hidden_states + shared_expert_output -+++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -+++++++ -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# """ -+++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -+++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -+++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# # 门控网络 -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# # 专家列表 -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++# # 共享专家 -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_decode( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# """ -+++++++# 【解码路径】针对 sequence_length=1 的极致优化。 -+++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -+++++++# """ -+++++++# batch_size, hidden_dim = hidden_states.shape -+++++++ -+++++++# expert_outputs_list = [ -+++++++# ops.cat([ -+++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++++# ], dim=0) -+++++++# for i in range(batch_size) -+++++++# ] -+++++++ -+++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- -+++++++# # shape: (batch_size, top_k, hidden_dim) -+++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++++ -+++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -+++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++++ -+++++++# return moe_output.squeeze(1) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_prefill( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# """ -+++++++# 【预填充路径】针对 sequence_length > 1 的优化。 -+++++++# 按专家对 Token 进行分组,并进行批处理。 -+++++++# """ -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens = hidden_states.shape[0] -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++ -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++ -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++ -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# selected_token_indices = token_indices[mask] -+++++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++++ -+++++++# current_states = hidden_states[selected_token_indices] -+++++++ -+++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++ -+++++++# moe_output = moe_output.index_add( -+++++++# dim=0, -+++++++# index=selected_token_indices, -+++++++# source=expert_output.to(hidden_states.dtype) -+++++++# ) -+++++++# return moe_output -+++++++ -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# """ -+++++++# 顶层 forward 方法,作为智能分发器。 -+++++++# """ -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++- # def forward( -++++++- # self, -++++++- # hidden_states: mindspore.Tensor, -++++++- # attention_mask: Optional[mindspore.Tensor] = None, -++++++- # position_ids: Optional[mindspore.Tensor] = None, -++++++- # past_key_value: Optional[Cache] = None, -++++++- # output_attentions: bool = False, -++++++- # use_cache: bool = False, -++++++- # cache_position: Optional[mindspore.Tensor] = None, -++++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++- -++++++- # bsz, q_len, _ = hidden_states.shape -++++++- -++++++- # query_states = self.q_proj(hidden_states) -++++++- # key_states = self.k_proj(hidden_states) -++++++- # value_states = self.v_proj(hidden_states) -++++++- -++++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++- -++++++- # kv_seq_len = key_states.shape[-2] -++++++- # if past_key_value is not None: -++++++- # if self.layer_idx is None: -++++++- # raise ValueError("`layer_idx` must be specified for caching") -++++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++- -++++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++- -++++++- # if past_key_value is not None: -++++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++- # key_states, value_states = past_key_value.update( -++++++- # key_states, value_states, self.layer_idx, cache_kwargs -++++++- # ) -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++# moe_output = None -+++++++# # 在推理时,根据序列长度选择最优路径 -+++++++# if not self.training: -+++++++# if sequence_length == 1: -+++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++++# else: -+++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++++# else: -+++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -+++++++# raise NotImplementedError("Training path is not implemented.") -+++++++ -+++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) -+++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -+++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -+++++++ -+++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -+++++++ -+++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -+++++++ -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# """ -+++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# # 门控网络 -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# # 专家列表 -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++# # 共享专家 -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_decode( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# batch_size, _ = hidden_states.shape -+++++++# expert_outputs_list = [ -+++++++# ops.cat([ -+++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++++# ], dim=0) -+++++++# for i in range(batch_size) -+++++++# ] -+++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++++# return moe_output.squeeze(1) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_prefill( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens = hidden_states.shape[0] -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# selected_token_indices = token_indices[mask] -+++++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++++# current_states = hidden_states[selected_token_indices] -+++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++# moe_output = moe_output.index_add( -+++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++++# ) -+++++++# return moe_output -+++++++ -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# """ -+++++++# 顶层 forward 方法,作为智能分发器。 -+++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -+++++++# """ -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ -+++++++# # 1. 门控计算 (通用逻辑) -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++ -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++# # 2. 智能分发到最优 MoE 路径 -+++++++# moe_output = None -+++++++# if not self.training: -+++++++# if sequence_length == 1: -+++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -+++++++# else: -+++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -+++++++# else: -+++++++# raise NotImplementedError("Training path is not implemented.") -+++++++ -+++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -+++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -+++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++++ -+++++++# # 4. 合并 MoE 输出和共享专家输出 -+++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -+++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++++ -+++++++# # 5. 恢复原始形状并返回 -+++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -+++++++# prefill fastest -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# """ -+++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -+++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# # 门控网络 -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# # 专家列表 -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++# # 共享专家 -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_dispatch( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# """ -+++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -+++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -+++++++# """ -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens, _ = hidden_states.shape -+++++++ -+++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++# flat_routing_weights = routing_weights.flatten() -++++++ -++++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++- -++++++- # # <--- 核心修改点: 手动进行高精度缩放 --- -++++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++++- # query_states = query_states / math.sqrt(self.head_dim) -++++++- # # <--- 修改结束 --- -++++++- -++++++- # fa_attention_mask = None -++++++- # if attention_mask is not None: -++++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++- # fa_attention_mask = (mask_slice != 0) -++++++- -++++++- # input_dtype = query_states.dtype -++++++- -++++++- # attn_output = mindspore.ops.flash_attention_score( -++++++- # query=query_states, # 传入已经预先缩放过的 query -++++++- # key=key_states, -++++++- # value=value_states, -++++++- # head_num=self.num_heads, -++++++- # attn_mask=fa_attention_mask, -++++++- # keep_prob=1.0 - self.attention_dropout, -++++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++++- # input_layout="BNSD", -++++++- # sparse_mode=0, -++++++- # inner_precise=1 # 仍然保持内部高精度计算 -++++++- # ) -+++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++ -++++++- # attn_output = attn_output.to(input_dtype) -++++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++- # attn_output = self.o_proj(attn_output) -+++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++ -+++++++# # 找到所有分配给该专家的 token -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++ -+++++++# # 使用 mask 选取对应的 token 和权重 -+++++++# current_token_indices = token_indices[mask] -+++++++# current_routing_weights = flat_routing_weights[mask] -+++++++# current_hidden_states = hidden_states[current_token_indices] -+++++++ -+++++++# # 对这些 token 进行批处理 -+++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++++ -+++++++# # 使用 index_add 将结果精确地加回到对应位置 -+++++++# moe_output = moe_output.index_add( -+++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -+++++++# ) -+++++++# return moe_output -+++++++ -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# """ -+++++++# 顶层 forward 方法,作为智能分发器。 -+++++++# """ -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ -+++++++# # 1. 门控计算 -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++ -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++# routing_weights = routing_weights.to(hidden_states.dtype) -+++++++ -+++++++# # 2. 调用统一的 MoE 计算内核 -+++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -+++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++++++ -++++++- # attn_weights = None -++++++- # if output_attentions: -++++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -+++++++# # 3. 统一处理共享专家 -+++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++++ -+++++++# # 4. 合并输出 -+++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++++ -+++++++# # 5. 恢复原始形状并返回 -+++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -+++++++ -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# """ -+++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -+++++++# 【最终高性能与高精度版】: -+++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -+++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -+++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -+++++++# 3. 这样实现了速度和准确性的两全其美。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_decode( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# """ -+++++++# 【解码路径】极致优化版:bmm + 高精度累加。 -+++++++# """ -+++++++# original_dtype = hidden_states.dtype -+++++++# batch_size, _ = hidden_states.shape -+++++++ -+++++++# expert_outputs_list = [ -+++++++# ops.cat([ -+++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++++# ], dim=0) -+++++++# for i in range(batch_size) -+++++++# ] -+++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++++ -+++++++# # 在 float32 下执行 bmm,得到高精度结果 -+++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -+++++++ -+++++++# # 将高精度结果转换回原始数据类型 -+++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -+++++++ -+++++++# return moe_output -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_prefill( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# selected_experts: mindspore.Tensor, -+++++++# routing_weights: mindspore.Tensor -+++++++# ) -> mindspore.Tensor: -+++++++# """ -+++++++# 【预填充路径】与原始实现一致,结果精确。 -+++++++# """ -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens, _ = hidden_states.shape -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++ -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# selected_token_indices = token_indices[mask] -+++++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++++# current_states = hidden_states[selected_token_indices] -+++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++# moe_output = moe_output.index_add( -+++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -+++++++# ) -+++++++# return moe_output -+++++++ -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++ -++++++- # return attn_output, attn_weights, past_key_value -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -+++++++# # 如果模型主体是 float16,后续再转换 -+++++++ -+++++++# moe_output = None -+++++++# if not self.training: -+++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -+++++++# # _moe_infer_decode 内部会处理好类型转换 -+++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) -+++++++# if sequence_length == 1: -+++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++++# else: -+++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -+++++++# else: -+++++++# raise NotImplementedError("Training path is not implemented.") -+++++++ -+++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++++ -+++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -++++++ -++++++-QWEN2MOE_ATTENTION_CLASSES = { -++++++- "eager": Qwen2MoeAttention, -++++++- "flash-attention": Qwen2MoeFlashAttention, -++++++-} -+++++++# class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++# """ -+++++++# 【融合版】一个混合专家模块,内置两种推理策略, -+++++++# 由外部全局变量 `Long_Prompt` 控制: -+++++++ -+++++++# - if Long_Prompt is True: 【精度优先模式】 -+++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -+++++++# 适用于处理长序列,避免误差累积。 -+++++++ -+++++++# - if Long_Prompt is False: 【速度优先模式】 -+++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -+++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 -+++++++# """ -+++++++# def __init__(self, config: Qwen2MoeConfig): -+++++++# super().__init__() -+++++++# self.num_experts = config.num_experts -+++++++# self.top_k = config.num_experts_per_tok -+++++++# self.norm_topk_prob = config.norm_topk_prob -+++++++ -+++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -+++++++# self.experts = nn.ModuleList( -+++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -+++++++# ) -+++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -+++++++# # --- 速度优先模式的辅助函数 --- -+++++++# @no_grad() -+++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++# original_dtype = hidden_states.dtype -+++++++# batch_size, _ = hidden_states.shape -+++++++# expert_outputs_list = [ -+++++++# ops.cat([ -+++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++++# ], dim=0) -+++++++# for i in range(batch_size) -+++++++# ] -+++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++++# weights_fp32 = routing_weights.to(mindspore.float32) -+++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++++# return moe_output_fp32.squeeze(1).to(original_dtype) -+++++++ -+++++++# @no_grad() -+++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens, _ = hidden_states.shape -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# selected_token_indices = token_indices[mask] -+++++++# selected_routing_weights = routing_weights.flatten()[mask] -+++++++# current_states = hidden_states[selected_token_indices] -+++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -+++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++++# return moe_output -+++++++ -+++++++# # --- 精度优先模式的辅助函数 --- -+++++++# @no_grad() -+++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++# moe_output = ops.zeros_like(hidden_states) -+++++++# num_tokens, _ = hidden_states.shape -+++++++# flat_selected_experts = selected_experts.flatten() -+++++++# flat_routing_weights = routing_weights.flatten() -+++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++# active_experts = ops.unique(flat_selected_experts) -+++++++# for expert_idx_tensor in active_experts: -+++++++# expert_idx = expert_idx_tensor.item() -+++++++# expert_layer = self.experts[expert_idx] -+++++++# mask = (flat_selected_experts == expert_idx_tensor) -+++++++# current_token_indices = token_indices[mask] -+++++++# current_routing_weights = flat_routing_weights[mask] -+++++++# current_hidden_states = hidden_states[current_token_indices] -+++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++++# return moe_output -+++++++ -+++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++# # 声明我们将要使用一个在模块外部定义的全局变量 -+++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -+++++++# global Long_Prompt -+++++++ -+++++++# # 1. 门控计算 (所有模式通用) -+++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++# router_logits = self.gate(hidden_states_reshaped) -+++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++++++# if self.norm_topk_prob: -+++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++# moe_output = None -+++++++# if not self.training: -+++++++# # 根据 Long_Prompt 标志选择模式 -+++++++# if Long_Prompt: -+++++++# # --- 精度优先模式 --- -+++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++# else: -+++++++# # --- 速度优先模式 --- -+++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++# if sequence_length == 1: -+++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++# else: -+++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++# else: -+++++++# raise NotImplementedError("Training path is not implemented.") -+++++++ -+++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++++ -+++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++# return final_hidden_states, router_logits -+++++++ -+++++++class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++ """ -+++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -+++++++ 控制的顶级推理策略: -++++++ -+++++++ - if Long_Prompt is True: 【精度优先模式】 -+++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -+++++++ 适用于需要严格可复现性的长序列任务。 -++++++ -++++++-class Qwen2MoeSparseMoeBlock(nn.Module): -++++++- def __init__(self, config): -+++++++ - if Long_Prompt is False: 【速度优先模式】 -+++++++ 采用业界最强的性能组合: -+++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -+++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -+++++++ """ -+++++++ def __init__(self, config: Qwen2MoeConfig): -++++++ super().__init__() -++++++ self.num_experts = config.num_experts -++++++ self.top_k = config.num_experts_per_tok -++++++ self.norm_topk_prob = config.norm_topk_prob -++++++ -++++++- # gating -++++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++ self.experts = nn.ModuleList( -++++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++ ) -++++++- -++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++ -++++++- #@dwj -++++++- # 只遍历激活的专家,而非全部专家 -++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++- num_tokens = hidden_states_reshaped.shape[0] -++++++- -++++++- router_logits = self.gate(hidden_states_reshaped) -++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- -++++++- if self.norm_topk_prob: -++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++- flat_selected_experts = selected_experts.flatten() -++++++- -++++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++- token_indices = broadcasted_token_indices.flatten() -++++++- -++++++- active_experts = ops.unique(flat_selected_experts) -++++++- -++++++- for expert_idx_tensor in active_experts: -++++++- expert_idx = expert_idx_tensor.item() -++++++- expert_layer = self.experts[expert_idx] -++++++- -++++++- mask = (flat_selected_experts == expert_idx_tensor) -++++++- selected_token_indices = token_indices[mask] -++++++- selected_routing_weights = routing_weights.flatten()[mask] -++++++- -++++++- current_states = hidden_states_reshaped[selected_token_indices] -++++++- -++++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++- -++++++- final_hidden_states = final_hidden_states.index_add( -++++++- dim=0, -++++++- index=selected_token_indices, -++++++- source=expert_output.to(hidden_states.dtype) -++++++- ) -++++++- -++++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -+++++++ @no_grad() -+++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++ original_dtype = hidden_states.dtype -+++++++ batch_size, _ = hidden_states.shape -+++++++ expert_outputs_list = [ -+++++++ ops.cat([ -+++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -+++++++ ], dim=0) -+++++++ for i in range(batch_size) -+++++++ ] -+++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -+++++++ weights_fp32 = routing_weights.to(mindspore.float32) -+++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -+++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -+++++++ return moe_output_fp32.squeeze(1).to(original_dtype) -+++++++ -+++++++ @no_grad() -+++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++ num_tokens, _ = hidden_states.shape -+++++++ flat_selected_experts = selected_experts.flatten() -+++++++ sorted_expert_indices = flat_selected_experts.argsort() -+++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++++++ original_token_indices = sorted_expert_indices // self.top_k -+++++++ moe_output = ops.zeros_like(hidden_states) -+++++++ current_token_offset = 0 -+++++++ for i in range(self.num_experts): -+++++++ expert_token_count = tokens_per_expert[i] - current_token_offset -+++++++ if expert_token_count == 0: -+++++++ continue -+++++++ end_offset = current_token_offset + expert_token_count -+++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++++++ expert_hidden_states = hidden_states[expert_original_token_indices] -+++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++++ current_token_offset += expert_token_count -+++++++ return moe_output -+++++++ -+++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -+++++++ @no_grad() -+++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++ moe_output = ops.zeros_like(hidden_states) -+++++++ num_tokens, _ = hidden_states.shape -+++++++ flat_selected_experts = selected_experts.flatten() -+++++++ flat_routing_weights = routing_weights.flatten() -+++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -+++++++ active_experts = ops.unique(flat_selected_experts) -+++++++ for expert_idx_tensor in active_experts: -+++++++ expert_idx = expert_idx_tensor.item() -+++++++ expert_layer = self.experts[expert_idx] -+++++++ mask = (flat_selected_experts == expert_idx_tensor) -+++++++ current_token_indices = token_indices[mask] -+++++++ current_routing_weights = flat_routing_weights[mask] -+++++++ current_hidden_states = hidden_states[current_token_indices] -+++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -+++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++++ return moe_output -++++++ -++++++- final_hidden_states = final_hidden_states + shared_expert_output -++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++- -++++++- return final_hidden_states, router_logits -+++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++ global Long_Prompt -+++++++ -+++++++ # 1. 门控计算 (所有模式通用) -+++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -+++++++ router_logits = self.gate(hidden_states_reshaped) -+++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -+++++++ if self.norm_topk_prob: -+++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++ -+++++++ moe_output = None -+++++++ if Long_Prompt: -+++++++ # --- 精度优先模式 (ACCURACY MODE) --- -+++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ else: -+++++++ # --- 速度优先模式 (SPEED MODE) --- -+++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++ if sequence_length == 1: -+++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ else: -+++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ -++++++ -+++++++ # 3. 共享专家计算与合并 (所有模式通用) -+++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -+++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -+++++++ -+++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output -+++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -+++++++ -+++++++ return final_hidden_states, router_logits -++++++ -++++++ class Qwen2MoeDecoderLayer(nn.Module): -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++++++ super().__init__() -++++++ self.hidden_size = config.hidden_size -+++++++ -+++++++ # if Long_Prompt: -+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++++ # else: -+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++ -++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ -++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++- -++++++ if (layer_idx not in config.mlp_only_layers) and ( -++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++++ ): -++++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ self._warmed_up = True -++++++ self.warmup_moe_model() -++++++ -+++++++ -+++++++ -++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++++ output_router_logits = ( -++++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits -++++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ router_logits=outputs.router_logits, -++++++ ) -++++++ -+++++++ def generate(self, *args, **kwargs): -+++++++ """ -+++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -+++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -+++++++ """ -+++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++++++ -+++++++ input_ids = kwargs.get("input_ids") -+++++++ if input_ids is None and args: -+++++++ input_ids = args[0] -+++++++ -+++++++ if input_ids is not None: -+++++++ prompt_length = input_ids.shape[1] -+++++++ -+++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: -+++++++ Long_Prompt = True -+++++++ else: -+++++++ Long_Prompt = False -+++++++ -+++++++ return super().generate(*args, **kwargs) -+++++++ -++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -++++++ def prepare_inputs_for_generation( -++++++ self, -++++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -++++++ # Exception 1: when passing input_embeds, input_ids may be missing entries -++++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here -+++++++ -++++++ if past_key_values is not None: -++++++ if inputs_embeds is not None: # Exception 1 -++++++ if 0 not in input_ids.shape: -++++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ } -++++++ ) -++++++ return model_inputs -+++++++ -++++++ # @lwx -++++++ # def _decode_one_tokens_logits( -++++++ # self, -++++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -++++++ attentions=outputs.attentions, -++++++ ) -++++++ -+++++++ -++++++ __all__ = [ -++++++ "Qwen2MoeForCausalLM", -++++++ "Qwen2MoeModel", -++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++++new file mode 100644 -++++++index 00000000..6dfb5b93 -++++++--- /dev/null -+++++++++ b/patches/0001-20251104commit.patch -++++++@@ -0,0 +1,1272 @@ -+++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -+++++++From: Pinoeer-kingxi <13022943007@163.com> -+++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 -+++++++Subject: [PATCH] 20251104commit -+++++++ -+++++++--- -+++++++ mindnlp/transformers/cache_utils.py | 28 +- -+++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- -+++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -+++++++ 3 files changed, 976 insertions(+), 87 deletions(-) -+++++++ -+++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -+++++++index cadd2e04..02f8d4be 100644 -+++++++--- a/mindnlp/transformers/cache_utils.py -++++++++++ b/mindnlp/transformers/cache_utils.py -+++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): -+++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -+++++++ # k_out[:, :, cache_position] = key_states -+++++++ # v_out[:, :, cache_position] = value_states -+++++++- if ON_ORANGE_PI: -+++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -+++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -+++++++- else: -+++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -+++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -+++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -+++++++- -++++++++ # if ON_ORANGE_PI: -++++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++++ # else: -++++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++++ # 确保 cache_position 是 1D tensor 并且类型正确 -++++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++++++ if cache_position.ndim > 1: -++++++++ cache_position = cache_position.flatten() -++++++++ # 确保类型是 int32 或 int64(MindSpore 要求) -++++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++++++ cache_position = cache_position.int() -++++++++ -++++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++++++ k_out[:, :, cache_position] = key_states -++++++++ v_out[:, :, cache_position] = value_states -++++++++ -+++++++ return k_out, v_out -+++++++ -+++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -+++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++index c695b944..d8303e45 100644 -+++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -+++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half -+++++++ def rotate_half(x): -+++++++ """Rotates half the hidden dims of the input.""" -+++++++- x1 = x[..., : x.shape[-1] // 2] -+++++++- x2 = x[..., x.shape[-1] // 2 :] -++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++++ # x1 = x[..., : x.shape[-1] // 2] -++++++++ # x2 = x[..., x.shape[-1] // 2 :] -++++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++++ return ops.cat((-x2, x1), dim=-1) -+++++++ -+++++++ -+++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -+++++++ if self.training: -+++++++ raise NotImplementedError("Training is not supported yet.") -+++++++ else: -+++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -+++++++- if self.config.n_shared_experts is not None: -+++++++- y = y + self.shared_experts(identity) -+++++++- return y -++++++++ # @lwx -++++++++ if orig_shape[1] == 1: -++++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++++++ y=y.view(*orig_shape) -++++++++ if self.config.n_shared_experts is not None: -++++++++ y = y + self.shared_experts(identity) -++++++++ return y -++++++++ else: -++++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++++++ if self.config.n_shared_experts is not None: -++++++++ y = y + self.shared_experts(identity) -++++++++ return y -++++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++++ # if self.config.n_shared_experts is not None: -++++++++ # y = y + self.shared_experts(identity) -++++++++ # return y -++++++++ -++++++++ @no_grad() -++++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++++ -++++++++ expert_cache = ops.zeros_like(x) -++++++++ for i in range(self.num_experts_per_tok): -++++++++ expert_id = flat_expert_indices[i].item() -++++++++ weight = flat_expert_weights[i].item() -++++++++ expert = self.experts[expert_id] -++++++++ expert_out = expert(x) -++++++++ expert_cache += expert_out * weight -++++++++ return expert_cache -+++++++ -+++++++ @no_grad() -+++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++- # expert_cache = torch.zeros_like(x) -+++++++- # idxs = flat_expert_indices.argsort() -+++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++++- # token_idxs = idxs // self.num_experts_per_tok -+++++++- # for i, end_idx in enumerate(tokens_per_expert): -+++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++++- # if start_idx == end_idx: -+++++++- # continue -+++++++- # expert = self.experts[i] -+++++++- # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++- # expert_tokens = x[exp_token_idx] -+++++++- # expert_out = expert(expert_tokens) -+++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++++- # return expert_cache -++++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++++ expert_cache = ops.zeros_like(x) -+++++++ idxs = flat_expert_indices.argsort() -+++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ token_idxs = idxs // self.num_experts_per_tok -++++++++ -+++++++ for i, end_idx in enumerate(tokens_per_expert): -+++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ if start_idx == end_idx: -+++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -+++++++ expert_out = expert(expert_tokens) -+++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++++ -+++++++ return expert_cache -++++++++ -++++++++ # @no_grad() -++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++++ # # expert_cache = torch.zeros_like(x) -++++++++ # # idxs = flat_expert_indices.argsort() -++++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++++ # # token_idxs = idxs // self.num_experts_per_tok -++++++++ # # for i, end_idx in enumerate(tokens_per_expert): -++++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++++ # # if start_idx == end_idx: -++++++++ # # continue -++++++++ # # expert = self.experts[i] -++++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++++ # # expert_tokens = x[exp_token_idx] -++++++++ # # expert_out = expert(expert_tokens) -++++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++++ # # return expert_cache -++++++++ # expert_cache = ops.zeros_like(x) -++++++++ # idxs = flat_expert_indices.argsort() -++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++++ -++++++++ # for i, end_idx in enumerate(tokens_per_expert): -++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++++ # if start_idx == end_idx: -++++++++ # continue -++++++++ # expert = self.experts[i] -++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++++ # expert_tokens = x[exp_token_idx] -++++++++ # expert_out = expert(expert_tokens) -++++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++++ -++++++++ # return expert_cache -++++++++ # @no_grad() -++++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++++ # expert_cache = ops.zeros_like(x) -++++++++ -++++++++ # # 排序保证顺序一致 -++++++++ # idxs = flat_expert_indices.argsort() -++++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++++ # token_idxs = idxs // self.num_experts_per_tok -++++++++ -++++++++ # # 找出有 token 的专家 -++++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++++ -++++++++ # for i in active_experts.tolist(): -++++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++++ # end_idx = tokens_per_expert[i] -++++++++ # if start_idx == end_idx: # 没有 token -++++++++ # continue -++++++++ -++++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++++ # expert_tokens = x[exp_token_idx] -++++++++ # expert_out = self.experts[i](expert_tokens) -++++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++++ -++++++++ # expert_cache = mindspore.mint.scatter_add( -++++++++ # expert_cache, -++++++++ # 0, -++++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++++ # expert_out -++++++++ # ) -++++++++ -++++++++ # return expert_cache -++++++++ -++++++++ -+++++++ -+++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -+++++++ # """ -+++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++++ -+++++++ # Initialize weights and apply final processing -+++++++ self.post_init() -++++++++ self.warm_up = False -++++++++ -++++++++ def warmup_moe_model_deep(self): -++++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++++++ test_texts = [ -++++++++ "warmup short", -++++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++++++ ] -++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++++ if tokenizer is None: -++++++++ from mindnlp.transformers import AutoTokenizer -++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++++ self._warmup_tokenizer = tokenizer -++++++++ -++++++++ for text in test_texts: -++++++++ inputs = tokenizer(text, return_tensors="ms") -++++++++ with mindspore._no_grad(): -++++++++ _ = self(**inputs, use_cache=False) -++++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") -+++++++ -+++++++ def get_input_embeddings(self): -+++++++ return self.model.embed_tokens -+++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -+++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++++ ```""" -++++++++ if not self.warm_up: -++++++++ self.warm_up = True -++++++++ self.warmup_moe_model_deep() -++++++++ -+++++++ output_attentions = ( -+++++++ output_attentions -+++++++ if output_attentions is not None -+++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++index 3cbf820e..d4c6b651 100644 -+++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++@@ -18,7 +18,6 @@ -+++++++ # See the License for the specific language governing permissions and -+++++++ # limitations under the License. -+++++++ """MindSpore Qwen2MoE model.""" -+++++++- -+++++++ import math -+++++++ from typing import List, Optional, Tuple, Union -+++++++ -+++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -+++++++ TokenClassifierOutput, -+++++++ ) -+++++++ from ...modeling_utils import PreTrainedModel -++++++++from ...generation import GenerationMixin -+++++++ from ....utils import logging -+++++++ from .configuration_qwen2_moe import Qwen2MoeConfig -+++++++ -+++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -+++++++ self.variance_epsilon = eps -+++++++ -+++++++ def forward(self, hidden_states): -++++++++ # @dwj -++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++++ # @lwx -++++++++ # if not self.training : -++++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -+++++++ input_dtype = hidden_states.dtype -+++++++ hidden_states = hidden_states.to(mindspore.float32) -+++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -+++++++@@ -234,6 +239,8 @@ def rotate_half(x): -+++++++ """Rotates half the hidden dims of the input.""" -+++++++ x1 = x[..., : x.shape[-1] // 2] -+++++++ x2 = x[..., x.shape[-1] // 2 :] -++++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -+++++++ return ops.cat((-x2, x1), dim=-1) -+++++++ -+++++++ -+++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -+++++++ self.config = config -+++++++ self.hidden_size = config.hidden_size -+++++++ self.intermediate_size = intermediate_size -++++++++ -+++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -+++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -+++++++ self.act_fn = ACT2FN[config.hidden_act] -+++++++ -+++++++ def forward(self, x): -+++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -+++++++- -+++++++ -++++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++++ # @lwx -++++++++ # gate_up_output = self.gate_up_proj(x) -++++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++++++ # return self.down_proj(swiglu_output) -++++++++ -++++++++ # def forward(self, x): -++++++++ # gate_proj_out = self.gate_proj(x) -++++++++ # up_proj_out = self.up_proj(x) -++++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++++++ # return self.down_proj(swiglu_out) -++++++++ -+++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv -+++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -+++++++ """ -+++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -+++++++ use_cache: bool = False, -+++++++ cache_position: Optional[mindspore.Tensor] = None, -+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++++ -++++++++ -++++++++ -+++++++ bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++ query_states = self.q_proj(hidden_states) -+++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -+++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++ "with a layer index." -+++++++ ) -+++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++++ if isinstance(past_key_value, StaticCache): -++++++++ kv_seq_len = key_states.shape[-2] -++++++++ else: -++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++ if past_key_value is not None: -+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++++ -++++++++ if isinstance(past_key_value, StaticCache): -++++++++ kv_seq_len = key_states.shape[-2] -+++++++ -+++++++ # repeat k/v heads if n_kv_heads < n_heads -+++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++- -++++++++ -+++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++++ -+++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -+++++++- raise ValueError( -+++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -+++++++- f" {attn_weights.shape}" -+++++++- ) -+++++++- -+++++++- if attention_mask is not None: # no matter the length, we just slice it -+++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++++++ if attention_mask is not None: -++++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++++ attn_weights = attn_weights + causal_mask -+++++++ -+++++++ # upcast attention to fp32 -+++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -+++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++++ -+++++++ attn_output = self.o_proj(attn_output) -+++++++- -++++++++ # @lwx -++++++++ -++++++++ # max_seq_len = self.max_position_embeddings # 2048 -++++++++ -++++++++ # if attention_mask is not None: -++++++++ # # attention_mask: [B, 1, Sq, Sk] -++++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++++ -++++++++ # # pad 到 [max_seq_len, max_seq_len] -++++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++++ # global_attention_mask = padded_mask -++++++++ # else: -++++++++ # global_attention_mask = None -++++++++ -++++++++ -++++++++ # sparse_mode=3 -++++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++++ # query=query_states, -++++++++ # key=key_states, -++++++++ # value=value_states, -++++++++ # real_shift=None, -++++++++ # padding_mask=None, -++++++++ -++++++++ # head_num=self.num_heads, -++++++++ # attn_mask=global_attention_mask, -++++++++ # keep_prob=1.0 - self.attention_dropout, -++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++++ # input_layout="BNSD", -++++++++ # pre_tokens=2147483647, -++++++++ # next_tokens=2147483647, -++++++++ # inner_precise=0, -++++++++ # drop_mask=None, -++++++++ # prefix=None, -++++++++ # actual_seq_qlen=None, -++++++++ # actual_seq_kvlen=None, -++++++++ # sparse_mode=sparse_mode, -++++++++ # ) -+++++++ if not output_attentions: -+++++++ attn_weights = None -+++++++ -+++++++ return attn_output, attn_weights, past_key_value -+++++++ -+++++++ -++++++++class Qwen2MoeFlashAttention(nn.Module): -++++++++ """ -++++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++++ -++++++++ 关键改动: -++++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++++ 直接传入原始的 key 和 value 张量效率更高。 -++++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++++ """ -++++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++++ super().__init__() -++++++++ self.config = config -++++++++ self.layer_idx = layer_idx -++++++++ self.hidden_size = config.hidden_size -++++++++ self.num_heads = config.num_attention_heads -++++++++ self.head_dim = self.hidden_size // self.num_heads -++++++++ self.num_key_value_heads = config.num_key_value_heads -++++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++++ self.max_position_embeddings = config.max_position_embeddings -++++++++ self.rope_theta = config.rope_theta -++++++++ self.attention_dropout = config.attention_dropout -++++++++ -++++++++ if (self.head_dim * self.num_heads) != self.hidden_size: -++++++++ raise ValueError( -++++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++++ ) -++++++++ -++++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++++ -++++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++++ self.head_dim, -++++++++ max_position_embeddings=self.max_position_embeddings, -++++++++ base=self.rope_theta, -++++++++ ) -++++++++ -++++++++ def forward( -++++++++ self, -++++++++ hidden_states: mindspore.Tensor, -++++++++ attention_mask: Optional[mindspore.Tensor] = None, -++++++++ position_ids: Optional[mindspore.Tensor] = None, -++++++++ past_key_value: Optional[Cache] = None, -++++++++ output_attentions: bool = False, -++++++++ use_cache: bool = False, -++++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++++ -++++++++ bsz, q_len, _ = hidden_states.shape -++++++++ -++++++++ # 1. 线性投射 Q, K, V -++++++++ query_states = self.q_proj(hidden_states) -++++++++ key_states = self.k_proj(hidden_states) -++++++++ value_states = self.v_proj(hidden_states) -++++++++ -++++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++++ # query: [B, S, H*D] -> [B, N1, S, D] -++++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ -++++++++ # 3. RoPE 旋转位置编码 -++++++++ kv_seq_len = key_states.shape[-2] -++++++++ if past_key_value is not None: -++++++++ if self.layer_idx is None: -++++++++ raise ValueError( -++++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++++ "with a layer index." -++++++++ ) -++++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++++ if cache_position.shape[0] == 1: -++++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++++ kv_seq_len = past_seen_tokens + 1 -++++++++ else: -++++++++ # prefill 阶段:cache_position 是范围,使用其长度 -++++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++++ else: -++++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++++ -++++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++++ -++++++++ # 4. KV 缓存更新 -++++++++ if past_key_value is not None: -++++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++++ key_states, value_states = past_key_value.update( -++++++++ key_states, value_states, self.layer_idx, cache_kwargs -++++++++ ) -++++++++ -++++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++++ if cache_position.shape[0] == 1: -++++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++++ kv_seq_len = key_states.shape[-2] -++++++++ -++++++++ # 5. [重要] 准备 Attention Mask -++++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++++ fa_attention_mask = None -++++++++ if attention_mask is not None: -++++++++ # 截取与当前key长度匹配的部分 -++++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++++ fa_attention_mask = (mask_slice != 0) -++++++++ -++++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++++ input_dtype = query_states.dtype -++++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++++ query_states = query_states.to(mindspore.float16) -++++++++ key_states = key_states.to(mindspore.float16) -++++++++ value_states = value_states.to(mindspore.float16) -++++++++ -++++++++ # 6. [核心] 调用 flash_attention_score 算子 -++++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++++ attn_output = mindspore.ops.flash_attention_score( -++++++++ query=query_states, -++++++++ key=key_states, -++++++++ value=value_states, -++++++++ head_num=self.num_heads, # 传入Q的头数(N1) -++++++++ attn_mask=fa_attention_mask, -++++++++ keep_prob=1.0 - self.attention_dropout, -++++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++++ input_layout="BNSD", -++++++++ sparse_mode=0 # 使用 defaultMask 模式 -++++++++ ) -++++++++ -++++++++ # 恢复原始数据类型 -++++++++ attn_output = attn_output.to(input_dtype) -++++++++ -++++++++ # 7. 调整输出形状 -++++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++++ attn_output = self.o_proj(attn_output) -++++++++ -++++++++ # FlashAttention 算子不直接返回注意力权重矩阵 -++++++++ attn_weights = None -++++++++ if output_attentions: -++++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++++ -++++++++ return attn_output, attn_weights, past_key_value -++++++++ -++++++++ # def forward( -++++++++ # self, -++++++++ # hidden_states: mindspore.Tensor, -++++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++++ # past_key_value: Optional[Cache] = None, -++++++++ # output_attentions: bool = False, -++++++++ # use_cache: bool = False, -++++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++++ -++++++++ # bsz, q_len, _ = hidden_states.shape -++++++++ -++++++++ # # 1. 线性投射 Q, K, V -++++++++ # query_states = self.q_proj(hidden_states) -++++++++ # key_states = self.k_proj(hidden_states) -++++++++ # value_states = self.v_proj(hidden_states) -++++++++ -++++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ -++++++++ # # 3. RoPE 旋转位置编码 -++++++++ # kv_seq_len = key_states.shape[-2] -++++++++ # if past_key_value is not None: -++++++++ # if self.layer_idx is None: -++++++++ # raise ValueError( -++++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++++ # "with a layer index." -++++++++ # ) -++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++++ -++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++++ -++++++++ # # 4. KV 缓存更新 -++++++++ # if past_key_value is not None: -++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++++ # key_states, value_states = past_key_value.update( -++++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++++ # ) -++++++++ -++++++++ # # 5. 准备 Attention Mask -++++++++ # fa_attention_mask = None -++++++++ # if attention_mask is not None: -++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++++ # fa_attention_mask = (mask_slice != 0) -++++++++ -++++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++++ # input_dtype = query_states.dtype -++++++++ -++++++++ # # 6. [核心] 调用 flash_attention_score 算子 -++++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++++ # query=query_states, -++++++++ # key=key_states, -++++++++ # value=value_states, -++++++++ # head_num=self.num_heads, -++++++++ # attn_mask=fa_attention_mask, -++++++++ # keep_prob=1.0 - self.attention_dropout, -++++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++++ # input_layout="BNSD", -++++++++ # sparse_mode=0, -++++++++ # # <--- 修改点 2: 启用内部高精度计算 --- -++++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++++ # inner_precise=1 -++++++++ # ) -++++++++ -++++++++ # # 恢复原始数据类型 -++++++++ # attn_output = attn_output.to(input_dtype) -++++++++ -++++++++ # # 7. 调整输出形状 -++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++++ # attn_output = self.o_proj(attn_output) -++++++++ -++++++++ # attn_weights = None -++++++++ # if output_attentions: -++++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++++ -++++++++ # return attn_output, attn_weights, past_key_value -++++++++ -++++++++ # def forward( -++++++++ # self, -++++++++ # hidden_states: mindspore.Tensor, -++++++++ # attention_mask: Optional[mindspore.Tensor] = None, -++++++++ # position_ids: Optional[mindspore.Tensor] = None, -++++++++ # past_key_value: Optional[Cache] = None, -++++++++ # output_attentions: bool = False, -++++++++ # use_cache: bool = False, -++++++++ # cache_position: Optional[mindspore.Tensor] = None, -++++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++++ -++++++++ # bsz, q_len, _ = hidden_states.shape -++++++++ -++++++++ # query_states = self.q_proj(hidden_states) -++++++++ # key_states = self.k_proj(hidden_states) -++++++++ # value_states = self.v_proj(hidden_states) -++++++++ -++++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++++ -++++++++ # kv_seq_len = key_states.shape[-2] -++++++++ # if past_key_value is not None: -++++++++ # if self.layer_idx is None: -++++++++ # raise ValueError("`layer_idx` must be specified for caching") -++++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++++ -++++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++++ -++++++++ # if past_key_value is not None: -++++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++++ # key_states, value_states = past_key_value.update( -++++++++ # key_states, value_states, self.layer_idx, cache_kwargs -++++++++ # ) -++++++++ -++++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++++ -++++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++++++ # query_states = query_states / math.sqrt(self.head_dim) -++++++++ # # <--- 修改结束 --- -++++++++ -++++++++ # fa_attention_mask = None -++++++++ # if attention_mask is not None: -++++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++++ # fa_attention_mask = (mask_slice != 0) -++++++++ -++++++++ # input_dtype = query_states.dtype -++++++++ -++++++++ # attn_output = mindspore.ops.flash_attention_score( -++++++++ # query=query_states, # 传入已经预先缩放过的 query -++++++++ # key=key_states, -++++++++ # value=value_states, -++++++++ # head_num=self.num_heads, -++++++++ # attn_mask=fa_attention_mask, -++++++++ # keep_prob=1.0 - self.attention_dropout, -++++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++++++ # input_layout="BNSD", -++++++++ # sparse_mode=0, -++++++++ # inner_precise=1 # 仍然保持内部高精度计算 -++++++++ # ) -++++++++ -++++++++ # attn_output = attn_output.to(input_dtype) -++++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++++ # attn_output = self.o_proj(attn_output) -++++++++ -++++++++ # attn_weights = None -++++++++ # if output_attentions: -++++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++++++ -++++++++ # return attn_output, attn_weights, past_key_value -++++++++ -+++++++ QWEN2MOE_ATTENTION_CLASSES = { -+++++++ "eager": Qwen2MoeAttention, -++++++++ "flash-attention": Qwen2MoeFlashAttention, -+++++++ } -+++++++ -+++++++ -+++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -+++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -+++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -+++++++ -++++++++ #@dwj -++++++++ # 只遍历激活的专家,而非全部专家 -+++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -+++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape -+++++++- hidden_states = hidden_states.view(-1, hidden_dim) -+++++++- # router_logits: (batch * sequence_length, n_experts) -+++++++- router_logits = self.gate(hidden_states) -+++++++- -+++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -+++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -+++++++- if self.norm_topk_prob: -+++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -+++++++- # we cast back to the input dtype -+++++++- routing_weights = routing_weights.to(hidden_states.dtype) -+++++++- -+++++++- final_hidden_states = ops.zeros( -+++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -+++++++- ) -+++++++- -+++++++- # One hot encode the selected experts to create an expert mask -+++++++- # this will be used to easily index which expert is going to be sollicitated -+++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -+++++++- -+++++++- # Loop over all available experts in the model and perform the computation on each expert -+++++++- for expert_idx in range(self.num_experts): -+++++++- expert_layer = self.experts[expert_idx] -+++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -+++++++- -+++++++- # Index the correct hidden states and compute the expert hidden state for -+++++++- # the current expert. We need to make sure to multiply the output hidden -+++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -+++++++- if 0 not in idx.shape: -+++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -+++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -+++++++- -+++++++- # However `index_add_` only support torch tensors for indexing so we'll use -+++++++- # the `top_x` tensor here. -+++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -+++++++- -+++++++- shared_expert_output = self.shared_expert(hidden_states) -+++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -+++++++- -+++++++- final_hidden_states = final_hidden_states + shared_expert_output -++++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++++ num_tokens = hidden_states_reshaped.shape[0] -++++++++ -++++++++ router_logits = self.gate(hidden_states_reshaped) -++++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++++ -++++++++ if self.norm_topk_prob: -++++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++++ routing_weights = routing_weights.to(hidden_states.dtype) -++++++++ -++++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++++ flat_selected_experts = selected_experts.flatten() -++++++++ -++++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++++ token_indices = broadcasted_token_indices.flatten() -++++++++ -++++++++ active_experts = ops.unique(flat_selected_experts) -++++++++ -++++++++ for expert_idx_tensor in active_experts: -++++++++ expert_idx = expert_idx_tensor.item() -++++++++ expert_layer = self.experts[expert_idx] -++++++++ -++++++++ mask = (flat_selected_experts == expert_idx_tensor) -++++++++ selected_token_indices = token_indices[mask] -++++++++ selected_routing_weights = routing_weights.flatten()[mask] -++++++++ -++++++++ current_states = hidden_states_reshaped[selected_token_indices] -++++++++ -++++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++++ -++++++++ final_hidden_states = final_hidden_states.index_add( -++++++++ dim=0, -++++++++ index=selected_token_indices, -++++++++ source=expert_output.to(hidden_states.dtype) -++++++++ ) -++++++++ -++++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -+++++++ -+++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -+++++++- return final_hidden_states, router_logits -++++++++ final_hidden_states = final_hidden_states + shared_expert_output -++++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++++ -++++++++ return final_hidden_states, router_logits -+++++++ -+++++++ -+++++++ class Qwen2MoeDecoderLayer(nn.Module): -+++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -+++++++ -+++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -+++++++ -++++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++++ -+++++++ if (layer_idx not in config.mlp_only_layers) and ( -+++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -+++++++ ): -+++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -+++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] -+++++++ _skip_keys_device_placement = "past_key_values" -+++++++ _supports_cache_class = True -++++++++#lwx -++++++++ # _supports_static_cache = True -+++++++ -+++++++ def _init_weights(self, module): -+++++++ std = self.config.initializer_range -+++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -+++++++ return causal_mask -+++++++ -+++++++ -+++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -+++++++ _tied_weights_keys = ["lm_head.weight"] -+++++++ -+++++++ def __init__(self, config): -+++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++++ self.num_experts_per_tok = config.num_experts_per_tok -+++++++ # Initialize weights and apply final processing -+++++++ self.post_init() -++++++++ # @lwx -++++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++++++ # self.generation_config.cache_implementation = "static" -++++++++ self._warmed_up = False -++++++++ -++++++++ def warmup_moe_model(self): -++++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++++++ test_texts = [ -++++++++ "warmup short", -++++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++++++ ] -++++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++++ if tokenizer is None: -++++++++ from mindnlp.transformers import AutoTokenizer -++++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++++ self._warmup_tokenizer = tokenizer -++++++++ -++++++++ for text in test_texts: -++++++++ inputs = tokenizer(text, return_tensors="ms") -++++++++ with mindspore._no_grad(): -++++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") -+++++++ -+++++++ def get_input_embeddings(self): -+++++++ return self.model.embed_tokens -+++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -+++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -+++++++ ```""" -++++++++ if not self._warmed_up: -++++++++ self._warmed_up = True -++++++++ self.warmup_moe_model() -+++++++ -+++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -+++++++ output_router_logits = ( -+++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -+++++++ } -+++++++ ) -+++++++ return model_inputs -++++++++# @lwx -++++++++ # def _decode_one_tokens_logits( -++++++++ # self, -++++++++ # cur_token: mindspore.Tensor, -++++++++ # input_pos: Optional[mindspore.Tensor], -++++++++ # cache_position: mindspore.Tensor, -++++++++ # past_key_values: StaticCache, -++++++++ # ) -> mindspore.Tensor: -++++++++ # """ -++++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++++++ -++++++++ # Args: -++++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++++++ # input_pos: 输入位置信息,可选 -++++++++ # cache_position: 当前token在cache中的位置,shape为(1,) -++++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++++++ -++++++++ # Returns: -++++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++++++ # """ -++++++++ # # 调用JIT编译的版本 -++++++++ # return self.get_decode_one_tokens_logits( -++++++++ # cur_token=cur_token, -++++++++ # input_pos=input_pos, -++++++++ # cache_position=cache_position, -++++++++ # past_key_values=past_key_values, -++++++++ # ) -++++++++ -++++++++ # @mindspore.jit(jit_level='O1') -++++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++++++ # """ -++++++++ # JIT编译的函数,用于高效的单token解码 -++++++++ # 使用JIT编译优化以支持静态shape和高效执行 -++++++++ -++++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++++++ # """ -++++++++ # outputs = self.model.forward( -++++++++ # input_ids=cur_token, -++++++++ # position_ids=input_pos, -++++++++ # cache_position=cache_position, -++++++++ # past_key_values=past_key_values, -++++++++ # use_cache=True, -++++++++ # return_dict=False, -++++++++ # ) -++++++++ -++++++++ # hidden_states = outputs[0] -++++++++ # logits = self.lm_head.forward(hidden_states) -++++++++ # logits = logits.float() -++++++++ -++++++++ # return logits[:, -1, :] -++++++++ -++++++++ # def _sample( -++++++++ # self, -++++++++ # input_ids: mindspore.Tensor, -++++++++ # logits_processor, -++++++++ # stopping_criteria, -++++++++ # generation_config, -++++++++ # synced_devices: bool, -++++++++ # streamer=None, -++++++++ # logits_warper=None, -++++++++ # **model_kwargs, -++++++++ # ): -++++++++ # """ -++++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++++++ # """ -++++++++ # from ...generation.logits_process import LogitsProcessorList -++++++++ # from ...generation.stopping_criteria import StoppingCriteriaList -++++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++++++ # from mindnlp.core import nn, ops, no_grad -++++++++ # import numpy as np -++++++++ -++++++++ # # 检查是否使用 StaticCache -++++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++++++ # # 否则,直接调用父类方法 -++++++++ # past_key_values = model_kwargs.get("past_key_values") -++++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++++++ -++++++++ # if not isinstance(past_key_values, StaticCache): -++++++++ # # 不使用 StaticCache,直接调用父类方法 -++++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++++++ # return super()._sample( -++++++++ # input_ids=input_ids, -++++++++ # logits_processor=logits_processor, -++++++++ # stopping_criteria=stopping_criteria, -++++++++ # generation_config=generation_config, -++++++++ # synced_devices=synced_devices, -++++++++ # streamer=streamer, -++++++++ # logits_warper=logits_warper, -++++++++ # **model_kwargs, -++++++++ # ) -++++++++ -++++++++ # # 使用 StaticCache,进入自定义循环 -++++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++++++ # pad_token_id = generation_config._pad_token_tensor -++++++++ # output_attentions = generation_config.output_attentions -++++++++ # output_hidden_states = generation_config.output_hidden_states -++++++++ # output_scores = generation_config.output_scores -++++++++ # output_logits = generation_config.output_logits -++++++++ # return_dict_in_generate = generation_config.return_dict_in_generate -++++++++ # max_length = generation_config.max_length -++++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++++++ # do_sample = generation_config.do_sample -++++++++ -++++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++++++ # raise ValueError( -++++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++++++ # f"{logits_warper})." -++++++++ # ) -++++++++ -++++++++ # # init attention / hidden states / scores tuples -++++++++ # scores = () if (return_dict_in_generate and output_scores) else None -++++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++++++ -++++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++++++ # encoder_hidden_states = ( -++++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++++++ # ) -++++++++ -++++++++ # # keep track of which sequences are already finished -++++++++ # batch_size, cur_len = input_ids.shape -++++++++ # this_peer_finished = False -++++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++++++ -++++++++ # time_record = [] -++++++++ # from ....utils.testing_utils import parse_flag_from_env -++++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++++++ -++++++++ # while self._has_unfinished_sequences( -++++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++++++ # ): -++++++++ # if _record_time: -++++++++ # import time as time_module -++++++++ # infer_start = time_module.time() -++++++++ -++++++++ # # prepare model inputs -++++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++++++ -++++++++ # # prepare variable output controls -++++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++++++ -++++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++++++ # cur_cache_position = model_inputs.get("cache_position") -++++++++ # cur_past_key_values = model_inputs.get("past_key_values") -++++++++ # cur_input_ids = model_inputs.get("input_ids") -++++++++ -++++++++ # if (isinstance(cur_past_key_values, StaticCache) and -++++++++ # cur_cache_position is not None and -++++++++ # len(cur_cache_position.shape) > 0 and -++++++++ # cur_cache_position.shape[0] == 1 and -++++++++ # cur_input_ids is not None and -++++++++ # cur_input_ids.shape[1] == 1): -++++++++ # # 使用 JIT 优化的单 token 解码 -++++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++++++ # if not hasattr(self, '_jit_used'): -++++++++ # self._jit_used = False -++++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++++++ -++++++++ # next_token_logits = self.get_decode_one_tokens_logits( -++++++++ # cur_token=cur_input_ids, -++++++++ # input_pos=model_inputs.get("position_ids"), -++++++++ # cache_position=cur_cache_position, -++++++++ # past_key_values=cur_past_key_values, -++++++++ # ) -++++++++ -++++++++ # # 标记已使用JIT(用于后续判断) -++++++++ # if not self._jit_used: -++++++++ # self._jit_used = True -++++++++ -++++++++ # # 构造兼容的输出对象 -++++++++ # class JitOptimizedOutput: -++++++++ # def __init__(self, logits, config): -++++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++++++ # self.config = config -++++++++ # # 对于 JIT 优化路径,这些属性通常不需要 -++++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++++++ # self.attentions = None if not config.is_encoder_decoder else None -++++++++ # self.cross_attentions = None -++++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++++++ # self.hidden_states = None if not config.is_encoder_decoder else None -++++++++ -++++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++++++ # else: -++++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++++++ # outputs = self(**model_inputs, return_dict=True) -++++++++ -++++++++ # if synced_devices and this_peer_finished: -++++++++ # continue -++++++++ -++++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++++++ # next_token_logits = outputs.logits[:, -1, :] -++++++++ -++++++++ # # pre-process distribution -++++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++++++ # if do_sample: -++++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++++++ -++++++++ # # Store scores, attentions and hidden_states when required -++++++++ # if return_dict_in_generate: -++++++++ # if output_scores: -++++++++ # scores += (next_token_scores,) -++++++++ # if output_logits: -++++++++ # raw_logits += (next_token_logits,) -++++++++ # if output_attentions: -++++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++++++ # decoder_attentions += (attn,) if attn is not None else (None,) -++++++++ # if self.config.is_encoder_decoder: -++++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++++++ -++++++++ # if output_hidden_states: -++++++++ # hidden = ( -++++++++ # outputs.decoder_hidden_states -++++++++ # if self.config.is_encoder_decoder -++++++++ # else outputs.hidden_states -++++++++ # ) -++++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++++++ -++++++++ # # token selection -++++++++ # if do_sample: -++++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++++++ # else: -++++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++++++ -++++++++ # # finished sentences should have their next token be a padding token -++++++++ # if has_eos_stopping_criteria: -++++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++++++ -++++++++ # # update generated ids, model inputs, and length for next step -++++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++++++ # if streamer is not None: -++++++++ # streamer.put(next_tokens) -++++++++ -++++++++ # model_kwargs = self._update_model_kwargs_for_generation( -++++++++ # outputs, -++++++++ # model_kwargs, -++++++++ # is_encoder_decoder=self.config.is_encoder_decoder, -++++++++ # ) -++++++++ -++++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++++++ # cur_len += 1 -++++++++ -++++++++ # if _record_time: -++++++++ # import time as time_module -++++++++ # infer_stop = time_module.time() -++++++++ # time_record.append(infer_stop - infer_start) -++++++++ -++++++++ # del outputs -++++++++ -++++++++ # average_infer_time = None -++++++++ # if time_record: -++++++++ # if len(time_record) > 1: -++++++++ # time_record.pop(0) -++++++++ # average_infer_time = sum(time_record) / len(time_record) -++++++++ # print(f'average inference time is: {average_infer_time}') -++++++++ # print(f'inference time record: {time_record}') -++++++++ -++++++++ # if streamer is not None: -++++++++ # streamer.end() -++++++++ -++++++++ # # 简单判断:打印是否使用了JIT路径 -++++++++ # if hasattr(self, '_jit_used') and self._jit_used: -++++++++ # print("[JIT] ✓ JIT optimization was used during generation") -++++++++ # else: -++++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++++++ -++++++++ # if return_dict_in_generate: -++++++++ # if self.config.is_encoder_decoder: -++++++++ # return GenerateEncoderDecoderOutput( -++++++++ # sequences=input_ids, -++++++++ # scores=scores, -++++++++ # logits=raw_logits, -++++++++ # encoder_attentions=encoder_attentions, -++++++++ # encoder_hidden_states=encoder_hidden_states, -++++++++ # decoder_attentions=decoder_attentions, -++++++++ # cross_attentions=cross_attentions, -++++++++ # decoder_hidden_states=decoder_hidden_states, -++++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++++ # average_infer_time=average_infer_time -++++++++ # ) -++++++++ # else: -++++++++ # return GenerateDecoderOnlyOutput( -++++++++ # sequences=input_ids, -++++++++ # scores=scores, -++++++++ # logits=raw_logits, -++++++++ # attentions=decoder_attentions, -++++++++ # hidden_states=decoder_hidden_states, -++++++++ # past_key_values=model_kwargs.get("past_key_values"), -++++++++ # average_infer_time=average_infer_time -++++++++ # ) -++++++++ # else: -++++++++ # return input_ids -++++++++ -++++++++ # def _prepare_cache_for_generation( -++++++++ # self, -++++++++ # generation_config, -++++++++ # model_kwargs, -++++++++ # assistant_model, -++++++++ # batch_size, -++++++++ # max_cache_length, -++++++++ # ): -++++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++++++ # generation_config.cache_implementation = "static" -++++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++++++ -++++++++ # if generation_config.cache_implementation == "static": -++++++++ # base_required_from_max_length = generation_config.max_length + 1 -++++++++ # base_required = max(max_cache_length, base_required_from_max_length) -++++++++ # min_cache_size = 50 -++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++++++ # else: -++++++++ # max_cache_length = max(base_required, min_cache_size) -++++++++ -++++++++ # original_max_cache_length = max_cache_length -++++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++++++ # print(f" - input max_cache_length: {original_max_cache_length}") -++++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++++++ # print(f" - final max_cache_length: {max_cache_length}") -++++++++ -++++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++++ # if max_cache_length > self.config.max_position_embeddings: -++++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++++++ -++++++++ # result = super()._prepare_cache_for_generation( -++++++++ # generation_config=generation_config, -++++++++ # model_kwargs=model_kwargs, -++++++++ # assistant_model=assistant_model, -++++++++ # batch_size=batch_size, -++++++++ # max_cache_length=max_cache_length, -++++++++ # ) -++++++++ -++++++++ # if generation_config.cache_implementation == "static": -++++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++++++ # created_cache = model_kwargs.get(cache_name) -++++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++++++ # if created_cache.max_cache_len < generation_config.max_length: -++++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++++++ -++++++++ # return result -++++++++ -++++++++ -++++++++ -+++++++ -+++++++ -+++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -+++++++-- -+++++++2.27.0 -+++++++ -++++++-- -++++++2.27.0 -++++++ -+++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -+++++new file mode 100644 -+++++index 00000000..966529e4 -+++++--- /dev/null -++++++++ b/patches/0003-20261106secondcommit.patch -+++++@@ -0,0 +1,2769 @@ -++++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -++++++From: Pinoeer-kingxi <13022943007@163.com> -++++++Date: Thu, 6 Nov 2025 14:54:37 +0800 -++++++Subject: [PATCH 3/3] 20261106secondcommit -++++++ -++++++--- -++++++ .../models/deepseek/modeling_deepseek.py | 217 ++- -++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -++++++ patches/0001-20251104commit.patch | 1272 ----------------- -++++++ 3 files changed, 528 insertions(+), 2032 deletions(-) -++++++ delete mode 100644 patches/0001-20251104commit.patch -++++++ -++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++index 73773c22..2f9192bf 100644 -++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -++++++ -++++++ _CONFIG_FOR_DOC = "DeepseekConfig" -++++++ -+++++++_attn_mask_cache = {} -+++++++ -+++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): -+++++++ q_len = batch_and_seq[1] -+++++++ kv_len = batch_and_seq[1] + past_key_values_length -+++++++ key = (batch_and_seq[0], q_len, kv_len) -+++++++ -+++++++ if key in _attn_mask_cache: -+++++++ return _attn_mask_cache[key] -+++++++ -+++++++ mask = _prepare_4d_causal_attention_mask( -+++++++ attention_mask, -+++++++ batch_and_seq, -+++++++ inputs_embeds, -+++++++ past_key_values_length, -+++++++ ) -+++++++ _attn_mask_cache[key] = mask -+++++++ return mask -++++++ -++++++ def _get_unpad_data(attention_mask): -++++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) -++++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -++++++ return final_output -++++++ -++++++ -++++++- @no_grad() -++++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++- expert_cache = ops.zeros_like(x) -++++++- idxs = flat_expert_indices.argsort() -++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++- token_idxs = idxs // self.num_experts_per_tok -++++++- -++++++- for i, end_idx in enumerate(tokens_per_expert): -++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++- if start_idx == end_idx: -++++++- continue -++++++- expert = self.experts[i] -++++++- exp_token_idx = token_idxs[start_idx:end_idx] -++++++- expert_tokens = x[exp_token_idx] -++++++- expert_out = expert(expert_tokens) -++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++- -++++++- return expert_cache -++++++- -++++++ # @no_grad() -++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++- # # expert_cache = torch.zeros_like(x) -++++++- # # idxs = flat_expert_indices.argsort() -++++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++- # # token_idxs = idxs // self.num_experts_per_tok -++++++- # # for i, end_idx in enumerate(tokens_per_expert): -++++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++- # # if start_idx == end_idx: -++++++- # # continue -++++++- # # expert = self.experts[i] -++++++- # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++- # # expert_tokens = x[exp_token_idx] -++++++- # # expert_out = expert(expert_tokens) -++++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++- # # return expert_cache -+++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++ # expert_cache = ops.zeros_like(x) -++++++ # idxs = flat_expert_indices.argsort() -++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++ -++++++ # return expert_cache -++++++- # @no_grad() -++++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++- # expert_cache = ops.zeros_like(x) -+++++++ -+++++++ @no_grad() -+++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -+++++++ """ -+++++++ 优化版 MoE prefill: -+++++++ - 批量张量化处理同一个 expert 的所有 token -+++++++ - 跳过无 token 的专家 -+++++++ - 保持结果完全一致 -+++++++ """ -+++++++ # 初始化输出缓存 -+++++++ expert_cache = ops.zeros_like(x) -++++++ -++++++- # # 排序保证顺序一致 -++++++- # idxs = flat_expert_indices.argsort() -++++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++- # token_idxs = idxs // self.num_experts_per_tok -+++++++ # 排序(确保 scatter_add 位置对应原逻辑) -+++++++ idxs = flat_expert_indices.argsort() -+++++++ sorted_expert_indices = flat_expert_indices[idxs] -+++++++ sorted_token_indices = idxs // self.num_experts_per_tok -++++++ -++++++- # # 找出有 token 的专家 -++++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++++ # 每个 expert 的 token 数 -+++++++ tokens_per_expert = sorted_expert_indices.bincount() -++++++ -++++++- # for i in active_experts.tolist(): -++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++- # end_idx = tokens_per_expert[i] -++++++- # if start_idx == end_idx: # 没有 token -++++++- # continue -+++++++ # 找出有 token 的专家 -+++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -++++++ -++++++- # exp_token_idx = token_idxs[start_idx:end_idx] -++++++- # expert_tokens = x[exp_token_idx] -++++++- # expert_out = self.experts[i](expert_tokens) -++++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++++ for expert_id in active_experts.tolist(): -+++++++ # 取该 expert 对应的排序后 token 区间 -+++++++ start = (tokens_per_expert[:expert_id]).sum().item() -+++++++ end = start + tokens_per_expert[expert_id].item() -++++++ -++++++- # expert_cache = mindspore.mint.scatter_add( -++++++- # expert_cache, -++++++- # 0, -++++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++- # expert_out -++++++- # ) -+++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 -+++++++ expert_tokens = x[token_idx] # 取输入向量 -++++++ -++++++- # return expert_cache -+++++++ # 执行专家 MLP -+++++++ expert_out = self.experts[expert_id](expert_tokens) -+++++++ -+++++++ # 按权重缩放 -+++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] -+++++++ -+++++++ # 回写到缓存(等价 scatter_add) -+++++++ expert_cache = mindspore.mint.scatter_add( -+++++++ expert_cache, -+++++++ 0, -+++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++++ scaled_out -+++++++ ) -+++++++ -+++++++ return expert_cache -+++++++ -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # # expert_cache = torch.zeros_like(x) -+++++++ # # idxs = flat_expert_indices.argsort() -+++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -+++++++ # # token_idxs = idxs // self.num_experts_per_tok -+++++++ # # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -+++++++ # # if start_idx == end_idx: -+++++++ # # continue -+++++++ # # expert = self.experts[i] -+++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # # expert_tokens = x[exp_token_idx] -+++++++ # # expert_out = expert(expert_tokens) -+++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -+++++++ # # return expert_cache -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # for i, end_idx in enumerate(tokens_per_expert): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # if start_idx == end_idx: -+++++++ # continue -+++++++ # expert = self.experts[i] -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = expert(expert_tokens) -+++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -+++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -+++++++ -+++++++ # return expert_cache -+++++++ # @no_grad() -+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -+++++++ # expert_cache = ops.zeros_like(x) -+++++++ -+++++++ # # 排序保证顺序一致 -+++++++ # idxs = flat_expert_indices.argsort() -+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -+++++++ # token_idxs = idxs // self.num_experts_per_tok -+++++++ -+++++++ # # 找出有 token 的专家 -+++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -+++++++ -+++++++ # for i in active_experts.tolist(): -+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -+++++++ # end_idx = tokens_per_expert[i] -+++++++ # if start_idx == end_idx: # 没有 token -+++++++ # continue -+++++++ -+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] -+++++++ # expert_tokens = x[exp_token_idx] -+++++++ # expert_out = self.experts[i](expert_tokens) -+++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -+++++++ -+++++++ # expert_cache = mindspore.mint.scatter_add( -+++++++ # expert_cache, -+++++++ # 0, -+++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -+++++++ # expert_out -+++++++ # ) -+++++++ -+++++++ # return expert_cache -++++++ -++++++ -++++++ -++++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++- -++++++ # class DeepseekFlashAttention(nn.Module): -++++++ # """ -++++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using -++++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -++++++ -++++++ return attn_output, attn_weights, past_key_value -++++++ -+++++++ -++++++ Deepseek_ATTENTION_CLASSES = { -++++++ "eager": DeepseekAttention, -++++++ "flash-attention": DeepseekFlashAttention, -++++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -++++++ ) -++++++ else: -++++++ # 4d mask is passed through the layers -++++++- attention_mask = _prepare_4d_causal_attention_mask( -+++++++ # attention_mask = _prepare_4d_causal_attention_mask( -+++++++ # attention_mask, -+++++++ # (batch_size, seq_length), -+++++++ # inputs_embeds, -+++++++ # past_key_values_length, -+++++++ # ) -+++++++ #@dwj -+++++++ attention_mask = get_cached_causal_mask( -++++++ attention_mask, -++++++ (batch_size, seq_length), -++++++ inputs_embeds, -++++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++ # Initialize weights and apply final processing -++++++ self.post_init() -++++++ self.warm_up = False -+++++++ #@dwj -+++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( -+++++++ self.num_layers, -+++++++ self.num_attention_heads, -+++++++ self.head_dim, -+++++++ batch_size=1, -+++++++ max_length=self.max_length, -+++++++ dtype=mindspore.float16 -+++++++ ) -+++++++ -+++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): -+++++++ key_cache = [] -+++++++ value_cache = [] -+++++++ for _ in range(num_layers): -+++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) -+++++++ key_cache.append(k) -+++++++ value_cache.append(v) -+++++++ return key_cache, value_cache -+++++++ -++++++ -++++++ def warmup_moe_model_deep(self): -++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++index bced285c..ebd7782e 100644 -++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -++++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -++++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" -++++++ -++++++-Long_Prompt = False -++++++-PROMPT_LENGTH_THRESHOLD = 128 -+++++++Long_Prompt = 1 -+++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 -+++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 -+++++++ -+++++++_causal_mask_cache = {} -+++++++ -+++++++def get_cached_causal_mask_with_cache_position( -+++++++ attention_mask: mindspore.Tensor, -+++++++ sequence_length: int, -+++++++ target_length: int, -+++++++ dtype: mindspore.dtype, -+++++++ min_dtype: float, -+++++++ cache_position: mindspore.Tensor, -+++++++ batch_size: int, -+++++++): -+++++++ """ -+++++++ 带缓存的 causal mask 构造函数 -+++++++ """ -+++++++ # q_len 是当前 query 长度 -+++++++ q_len = sequence_length -+++++++ # kv_len 是 target_length -+++++++ kv_len = target_length -+++++++ -+++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 -+++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) -+++++++ -+++++++ if key in _causal_mask_cache: -+++++++ return _causal_mask_cache[key] -+++++++ -+++++++ # 调用原来的 mask 构造逻辑 -+++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++++ attention_mask, -+++++++ sequence_length=sequence_length, -+++++++ target_length=target_length, -+++++++ dtype=dtype, -+++++++ min_dtype=min_dtype, -+++++++ cache_position=cache_position, -+++++++ batch_size=batch_size, -+++++++ ) -+++++++ # 缓存结果 -+++++++ _causal_mask_cache[key] = causal_mask -+++++++ return causal_mask -++++++ -++++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -++++++ def _prepare_4d_causal_attention_mask_with_cache_position( -++++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++++ -++++++ -++++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe -+++++++# class Qwen2MoeAttention(nn.Module): -+++++++# """ -+++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -+++++++# and "Generating Long Sequences with Sparse Transformers". -+++++++# """ -+++++++ -+++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -+++++++# super().__init__() -+++++++# self.config = config -+++++++# self.layer_idx = layer_idx -+++++++# if layer_idx is None: -+++++++# logger.warning_once( -+++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -+++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++++# "when creating this class." -+++++++# ) -+++++++ -+++++++# self.hidden_size = config.hidden_size -+++++++# self.num_heads = config.num_attention_heads -+++++++# self.head_dim = self.hidden_size // self.num_heads -+++++++# self.num_key_value_heads = config.num_key_value_heads -+++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads -+++++++# self.max_position_embeddings = config.max_position_embeddings -+++++++# self.rope_theta = config.rope_theta -+++++++# self.is_causal = True -+++++++# self.attention_dropout = config.attention_dropout -+++++++ -+++++++# if (self.head_dim * self.num_heads) != self.hidden_size: -+++++++# raise ValueError( -+++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" -+++++++# f" and `num_heads`: {self.num_heads})." -+++++++# ) -+++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -+++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -+++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -+++++++ -+++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( -+++++++# self.head_dim, -+++++++# max_position_embeddings=self.max_position_embeddings, -+++++++# base=self.rope_theta, -+++++++# ) -+++++++ -+++++++# def forward( -+++++++# self, -+++++++# hidden_states: mindspore.Tensor, -+++++++# attention_mask: Optional[mindspore.Tensor] = None, -+++++++# position_ids: Optional[mindspore.Tensor] = None, -+++++++# past_key_value: Optional[Cache] = None, -+++++++# output_attentions: bool = False, -+++++++# use_cache: bool = False, -+++++++# cache_position: Optional[mindspore.Tensor] = None, -+++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -+++++++ -+++++++ -+++++++ -+++++++# bsz, q_len, _ = hidden_states.shape -+++++++ -+++++++# query_states = self.q_proj(hidden_states) -+++++++# key_states = self.k_proj(hidden_states) -+++++++# value_states = self.v_proj(hidden_states) -+++++++ -+++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -+++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -+++++++ -+++++++# kv_seq_len = key_states.shape[-2] -+++++++# if past_key_value is not None: -+++++++# if self.layer_idx is None: -+++++++# raise ValueError( -+++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -+++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -+++++++# "with a layer index." -+++++++# ) -+++++++# if isinstance(past_key_value, StaticCache): -+++++++# kv_seq_len = key_states.shape[-2] -+++++++# else: -+++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -+++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -+++++++ -+++++++# if past_key_value is not None: -+++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++# if isinstance(past_key_value, StaticCache): -+++++++# kv_seq_len = key_states.shape[-2] -+++++++ -+++++++# # repeat k/v heads if n_kv_heads < n_heads -+++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++ -+++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++++ -+++++++# if attention_mask is not None: -+++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++++# attn_weights = attn_weights + causal_mask -+++++++ -+++++++# # upcast attention to fp32 -+++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++++++# attn_output = ops.matmul(attn_weights, value_states) -+++++++ -+++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++++++# raise ValueError( -+++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -+++++++# f" {attn_output.shape}" -+++++++# ) -+++++++ -+++++++# attn_output = ops.transpose(attn_output, 1, 2) -+++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++++ -+++++++# attn_output = self.o_proj(attn_output) -+++++++# # @lwx -+++++++ -+++++++# # max_seq_len = self.max_position_embeddings # 2048 -+++++++ -+++++++# # if attention_mask is not None: -+++++++# # # attention_mask: [B, 1, Sq, Sk] -+++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -+++++++ -+++++++# # # pad 到 [max_seq_len, max_seq_len] -+++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -+++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -+++++++# # global_attention_mask = padded_mask -+++++++# # else: -+++++++# # global_attention_mask = None -+++++++ -+++++++ -+++++++# # sparse_mode=3 -+++++++# # attn_output = mindspore.ops.flash_attention_score( -+++++++# # query=query_states, -+++++++# # key=key_states, -+++++++# # value=value_states, -+++++++# # real_shift=None, -+++++++# # padding_mask=None, -+++++++ -+++++++# # head_num=self.num_heads, -+++++++# # attn_mask=global_attention_mask, -+++++++# # keep_prob=1.0 - self.attention_dropout, -+++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++# # input_layout="BNSD", -+++++++# # pre_tokens=2147483647, -+++++++# # next_tokens=2147483647, -+++++++# # inner_precise=0, -+++++++# # drop_mask=None, -+++++++# # prefix=None, -+++++++# # actual_seq_qlen=None, -+++++++# # actual_seq_kvlen=None, -+++++++# # sparse_mode=sparse_mode, -+++++++# # ) -+++++++# if not output_attentions: -+++++++# attn_weights = None -+++++++ -+++++++# return attn_output, attn_weights, past_key_value -+++++++ -++++++ class Qwen2MoeAttention(nn.Module): -++++++ """ -++++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer -++++++- and "Generating Long Sequences with Sparse Transformers". -++++++- """ -+++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -++++++ -+++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: -+++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 -+++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 -+++++++ -+++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 -+++++++ """ -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++ super().__init__() -++++++ self.config = config -++++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -++++++ if layer_idx is None: -++++++ logger.warning_once( -++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " -++++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -+++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -++++++ "when creating this class." -++++++ ) -++++++ -++++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -++++++ use_cache: bool = False, -++++++ cache_position: Optional[mindspore.Tensor] = None, -++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++- -++++++ -++++++- -+++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -++++++ bsz, q_len, _ = hidden_states.shape -++++++ -++++++ query_states = self.q_proj(hidden_states) -++++++ key_states = self.k_proj(hidden_states) -++++++ value_states = self.v_proj(hidden_states) -++++++ -++++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -++++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -++++++- -+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -+++++++ -++++++ kv_seq_len = key_states.shape[-2] -++++++ if past_key_value is not None: -++++++- if self.layer_idx is None: -++++++- raise ValueError( -++++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++- "with a layer index." -++++++- ) -++++++- if isinstance(past_key_value, StaticCache): -++++++- kv_seq_len = key_states.shape[-2] -++++++- else: -++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -+++++++ -++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++ -++++++ if past_key_value is not None: -++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -+++++++ -+++++++ # --- 2. 动态调度核心注意力计算 --- -+++++++ global Long_Prompt -+++++++ if Long_Prompt >= 1: -+++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- -+++++++ fa_attention_mask = None -+++++++ if attention_mask is not None: -+++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -+++++++ fa_attention_mask = (mask_slice != 0) -+++++++ -+++++++ attn_output = mindspore.ops.flash_attention_score( -+++++++ query=query_states, -+++++++ key=key_states, -+++++++ value=value_states, -+++++++ head_num=self.num_heads, -+++++++ attn_mask=fa_attention_mask, -+++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, -+++++++ scalar_value=1.0 / math.sqrt(self.head_dim), -+++++++ input_layout="BNSD", -+++++++ sparse_mode=0, -+++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -+++++++ ) -++++++ -++++++- if isinstance(past_key_value, StaticCache): -++++++- kv_seq_len = key_states.shape[-2] -+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -+++++++ attn_output = self.o_proj(attn_output) -+++++++ attn_weights = None -+++++++ if output_attentions: -+++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -++++++ -++++++- # repeat k/v heads if n_kv_heads < n_heads -++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++- -++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -+++++++ else: -+++++++ # --- Eager Attention 路径 (用于短序列和解码) --- -+++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) -+++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) -+++++++ -+++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++ -++++++- if attention_mask is not None: -++++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++- attn_weights = attn_weights + causal_mask -+++++++ if attention_mask is not None: -+++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -+++++++ attn_weights = attn_weights + causal_mask -++++++ -++++++- # upcast attention to fp32 -++++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -++++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -++++++- attn_output = ops.matmul(attn_weights, value_states) -+++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) -+++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) -+++++++ attn_output = ops.matmul(attn_weights, value_states) -++++++ -++++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -++++++- raise ValueError( -++++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" -++++++- f" {attn_output.shape}" -++++++- ) -+++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): -+++++++ raise ValueError( -+++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" -+++++++ ) -++++++ -++++++- attn_output = ops.transpose(attn_output, 1, 2) -++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++++ attn_output = ops.transpose(attn_output, 1, 2) -+++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -+++++++ attn_output = self.o_proj(attn_output) -++++++ -++++++- attn_output = self.o_proj(attn_output) -++++++- # @lwx -+++++++ if not output_attentions: -+++++++ attn_weights = None -++++++ -++++++- # max_seq_len = self.max_position_embeddings # 2048 -++++++- -++++++- # if attention_mask is not None: -++++++- # # attention_mask: [B, 1, Sq, Sk] -++++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++- -++++++- # # pad 到 [max_seq_len, max_seq_len] -++++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++- # global_attention_mask = padded_mask -++++++- # else: -++++++- # global_attention_mask = None -++++++- -++++++- -++++++- # sparse_mode=3 -++++++- # attn_output = mindspore.ops.flash_attention_score( -++++++- # query=query_states, -++++++- # key=key_states, -++++++- # value=value_states, -++++++- # real_shift=None, -++++++- # padding_mask=None, -++++++- -++++++- # head_num=self.num_heads, -++++++- # attn_mask=global_attention_mask, -++++++- # keep_prob=1.0 - self.attention_dropout, -++++++- # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++- # input_layout="BNSD", -++++++- # pre_tokens=2147483647, -++++++- # next_tokens=2147483647, -++++++- # inner_precise=0, -++++++- # drop_mask=None, -++++++- # prefix=None, -++++++- # actual_seq_qlen=None, -++++++- # actual_seq_kvlen=None, -++++++- # sparse_mode=sparse_mode, -++++++- # ) -++++++- if not output_attentions: -++++++- attn_weights = None -++++++- -++++++ return attn_output, attn_weights, past_key_value -++++++ -++++++- -++++++ # class Qwen2MoeFlashAttention(nn.Module): -++++++ # """ -++++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -++++++ # return final_hidden_states, router_logits -++++++ -++++++ -++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++-# """ -++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 -++++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 -++++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 -++++++-# """ -++++++-# def __init__(self, config: Qwen2MoeConfig): -++++++-# super().__init__() -++++++-# self.num_experts = config.num_experts -++++++-# self.top_k = config.num_experts_per_tok -++++++-# self.norm_topk_prob = config.norm_topk_prob -++++++- -++++++-# # 门控网络 -++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++-# # 专家列表 -++++++-# self.experts = nn.ModuleList( -++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++-# ) -++++++-# # 共享专家 -++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_decode( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# """ -++++++-# 【解码路径】针对 sequence_length=1 的极致优化。 -++++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 -++++++-# """ -++++++-# batch_size, hidden_dim = hidden_states.shape -++++++- -++++++-# expert_outputs_list = [ -++++++-# ops.cat([ -++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++-# ], dim=0) -++++++-# for i in range(batch_size) -++++++-# ] -++++++- -++++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- -++++++-# # shape: (batch_size, top_k, hidden_dim) -++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++- -++++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 -++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++- -++++++-# return moe_output.squeeze(1) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_prefill( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# """ -++++++-# 【预填充路径】针对 sequence_length > 1 的优化。 -++++++-# 按专家对 Token 进行分组,并进行批处理。 -++++++-# """ -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens = hidden_states.shape[0] -++++++-# flat_selected_experts = selected_experts.flatten() -++++++- -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++- -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++- -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++- -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++-# selected_token_indices = token_indices[mask] -++++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++++- -++++++-# current_states = hidden_states[selected_token_indices] -++++++- -++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++- -++++++-# moe_output = moe_output.index_add( -++++++-# dim=0, -++++++-# index=selected_token_indices, -++++++-# source=expert_output.to(hidden_states.dtype) -++++++-# ) -++++++-# return moe_output -++++++- -++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-# """ -++++++-# 顶层 forward 方法,作为智能分发器。 -++++++-# """ -++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- -++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-# router_logits = self.gate(hidden_states_reshaped) -++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- -++++++-# if self.norm_topk_prob: -++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- -++++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++-# moe_output = None -++++++-# # 在推理时,根据序列长度选择最优路径 -++++++-# if not self.training: -++++++-# if sequence_length == 1: -++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++++-# else: -++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++++-# else: -++++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 -++++++-# raise NotImplementedError("Training path is not implemented.") -++++++- -++++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) -++++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) -++++++- -++++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights -++++++- -++++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) -++++++- -++++++-# return final_hidden_states, router_logits -++++++- -++++++- -++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++-# """ -++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 -++++++-# """ -++++++-# def __init__(self, config: Qwen2MoeConfig): -++++++-# super().__init__() -++++++-# self.num_experts = config.num_experts -++++++-# self.top_k = config.num_experts_per_tok -++++++-# self.norm_topk_prob = config.norm_topk_prob -++++++- -++++++-# # 门控网络 -++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++-# # 专家列表 -++++++-# self.experts = nn.ModuleList( -++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++-# ) -++++++-# # 共享专家 -++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_decode( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# batch_size, _ = hidden_states.shape -++++++-# expert_outputs_list = [ -++++++-# ops.cat([ -++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++-# ], dim=0) -++++++-# for i in range(batch_size) -++++++-# ] -++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++-# return moe_output.squeeze(1) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_prefill( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens = hidden_states.shape[0] -++++++-# flat_selected_experts = selected_experts.flatten() -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++- -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++-# selected_token_indices = token_indices[mask] -++++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++++-# current_states = hidden_states[selected_token_indices] -++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++-# moe_output = moe_output.index_add( -++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++++-# ) -++++++-# return moe_output -++++++- -++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-# """ -++++++-# 顶层 forward 方法,作为智能分发器。 -++++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 -++++++-# """ -++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- -++++++-# # 1. 门控计算 (通用逻辑) -++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-# router_logits = self.gate(hidden_states_reshaped) -++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- -++++++-# if self.norm_topk_prob: -++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- -++++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++-# # 2. 智能分发到最优 MoE 路径 -++++++-# moe_output = None -++++++-# if not self.training: -++++++-# if sequence_length == 1: -++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) -++++++-# else: -++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) -++++++-# else: -++++++-# raise NotImplementedError("Training path is not implemented.") -++++++- -++++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 -++++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 -++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++- -++++++-# # 4. 合并 MoE 输出和共享专家输出 -++++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 -++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++- -++++++-# # 5. 恢复原始形状并返回 -++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++- -++++++-# return final_hidden_states, router_logits -++++++- -++++++-# prefill fastest -++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++-# """ -++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), -++++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 -++++++-# """ -++++++-# def __init__(self, config: Qwen2MoeConfig): -++++++-# super().__init__() -++++++-# self.num_experts = config.num_experts -++++++-# self.top_k = config.num_experts_per_tok -++++++-# self.norm_topk_prob = config.norm_topk_prob -++++++- -++++++-# # 门控网络 -++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++-# # 专家列表 -++++++-# self.experts = nn.ModuleList( -++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++-# ) -++++++-# # 共享专家 -++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_dispatch( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# """ -++++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 -++++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 -++++++-# """ -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens, _ = hidden_states.shape -++++++- -++++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 -++++++-# flat_selected_experts = selected_experts.flatten() -++++++-# flat_routing_weights = routing_weights.flatten() -++++++- -++++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++- -++++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++- -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++- -++++++-# # 找到所有分配给该专家的 token -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++- -++++++-# # 使用 mask 选取对应的 token 和权重 -++++++-# current_token_indices = token_indices[mask] -++++++-# current_routing_weights = flat_routing_weights[mask] -++++++-# current_hidden_states = hidden_states[current_token_indices] -++++++- -++++++-# # 对这些 token 进行批处理 -++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++++- -++++++-# # 使用 index_add 将结果精确地加回到对应位置 -++++++-# moe_output = moe_output.index_add( -++++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) -++++++-# ) -++++++-# return moe_output -++++++- -++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-# """ -++++++-# 顶层 forward 方法,作为智能分发器。 -++++++-# """ -++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- -++++++-# # 1. 门控计算 -++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-# router_logits = self.gate(hidden_states_reshaped) -++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- -++++++-# if self.norm_topk_prob: -++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- -++++++-# routing_weights = routing_weights.to(hidden_states.dtype) -++++++- -++++++-# # 2. 调用统一的 MoE 计算内核 -++++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 -++++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -++++++- -++++++-# # 3. 统一处理共享专家 -++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++- -++++++-# # 4. 合并输出 -++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++- -++++++-# # 5. 恢复原始形状并返回 -++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++- -++++++-# return final_hidden_states, router_logits -++++++- -++++++- -++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++-# """ -++++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 -++++++-# 【最终高性能与高精度版】: -++++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 -++++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 -++++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 -++++++-# 3. 这样实现了速度和准确性的两全其美。 -++++++-# """ -++++++-# def __init__(self, config: Qwen2MoeConfig): -++++++-# super().__init__() -++++++-# self.num_experts = config.num_experts -++++++-# self.top_k = config.num_experts_per_tok -++++++-# self.norm_topk_prob = config.norm_topk_prob -++++++- -++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++-# self.experts = nn.ModuleList( -++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++-# ) -++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_decode( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# """ -++++++-# 【解码路径】极致优化版:bmm + 高精度累加。 -++++++-# """ -++++++-# original_dtype = hidden_states.dtype -++++++-# batch_size, _ = hidden_states.shape -++++++- -++++++-# expert_outputs_list = [ -++++++-# ops.cat([ -++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++-# ], dim=0) -++++++-# for i in range(batch_size) -++++++-# ] -++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++- -++++++-# # 在 float32 下执行 bmm,得到高精度结果 -++++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) -++++++- -++++++-# # 将高精度结果转换回原始数据类型 -++++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) -++++++- -++++++-# return moe_output -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_prefill( -++++++-# self, -++++++-# hidden_states: mindspore.Tensor, -++++++-# selected_experts: mindspore.Tensor, -++++++-# routing_weights: mindspore.Tensor -++++++-# ) -> mindspore.Tensor: -++++++-# """ -++++++-# 【预填充路径】与原始实现一致,结果精确。 -++++++-# """ -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens, _ = hidden_states.shape -++++++-# flat_selected_experts = selected_experts.flatten() -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++- -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++-# selected_token_indices = token_indices[mask] -++++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++++-# current_states = hidden_states[selected_token_indices] -++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++-# moe_output = moe_output.index_add( -++++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) -++++++-# ) -++++++-# return moe_output -++++++- -++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++- -++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-# router_logits = self.gate(hidden_states_reshaped) -++++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++- -++++++-# if self.norm_topk_prob: -++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- -++++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 -++++++-# # 如果模型主体是 float16,后续再转换 -++++++- -++++++-# moe_output = None -++++++-# if not self.training: -++++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 -++++++-# # _moe_infer_decode 内部会处理好类型转换 -++++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) -++++++-# if sequence_length == 1: -++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++++-# else: -++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) -++++++-# else: -++++++-# raise NotImplementedError("Training path is not implemented.") -++++++- -++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++- -++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++- -++++++-# return final_hidden_states, router_logits -++++++- -++++++- -++++++-# class Qwen2MoeSparseMoeBlock(nn.Module): -++++++-# """ -++++++-# 【融合版】一个混合专家模块,内置两种推理策略, -++++++-# 由外部全局变量 `Long_Prompt` 控制: -++++++- -++++++-# - if Long_Prompt is True: 【精度优先模式】 -++++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 -++++++-# 适用于处理长序列,避免误差累积。 -++++++- -++++++-# - if Long_Prompt is False: 【速度优先模式】 -++++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, -++++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 -++++++-# """ -++++++-# def __init__(self, config: Qwen2MoeConfig): -++++++-# super().__init__() -++++++-# self.num_experts = config.num_experts -++++++-# self.top_k = config.num_experts_per_tok -++++++-# self.norm_topk_prob = config.norm_topk_prob -++++++- -++++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -++++++-# self.experts = nn.ModuleList( -++++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -++++++-# ) -++++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-# # --- 速度优先模式的辅助函数 --- -++++++-# @no_grad() -++++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++-# original_dtype = hidden_states.dtype -++++++-# batch_size, _ = hidden_states.shape -++++++-# expert_outputs_list = [ -++++++-# ops.cat([ -++++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] -++++++-# ], dim=0) -++++++-# for i in range(batch_size) -++++++-# ] -++++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) -++++++-# weights_fp32 = routing_weights.to(mindspore.float32) -++++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) -++++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++++-# return moe_output_fp32.squeeze(1).to(original_dtype) -++++++- -++++++-# @no_grad() -++++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens, _ = hidden_states.shape -++++++-# flat_selected_experts = selected_experts.flatten() -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++-# selected_token_indices = token_indices[mask] -++++++-# selected_routing_weights = routing_weights.flatten()[mask] -++++++-# current_states = hidden_states[selected_token_indices] -++++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++-# return moe_output -++++++- -++++++-# # --- 精度优先模式的辅助函数 --- -++++++-# @no_grad() -++++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++-# moe_output = ops.zeros_like(hidden_states) -++++++-# num_tokens, _ = hidden_states.shape -++++++-# flat_selected_experts = selected_experts.flatten() -++++++-# flat_routing_weights = routing_weights.flatten() -++++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -++++++-# active_experts = ops.unique(flat_selected_experts) -++++++-# for expert_idx_tensor in active_experts: -++++++-# expert_idx = expert_idx_tensor.item() -++++++-# expert_layer = self.experts[expert_idx] -++++++-# mask = (flat_selected_experts == expert_idx_tensor) -++++++-# current_token_indices = token_indices[mask] -++++++-# current_routing_weights = flat_routing_weights[mask] -++++++-# current_hidden_states = hidden_states[current_token_indices] -++++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) -++++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++-# return moe_output -++++++- -++++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-# # 声明我们将要使用一个在模块外部定义的全局变量 -++++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 -++++++-# global Long_Prompt -++++++- -++++++-# # 1. 门控计算 (所有模式通用) -++++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-# router_logits = self.gate(hidden_states_reshaped) -++++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) -++++++-# if self.norm_topk_prob: -++++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++- -++++++-# moe_output = None -++++++-# if not self.training: -++++++-# # 根据 Long_Prompt 标志选择模式 -++++++-# if Long_Prompt: -++++++-# # --- 精度优先模式 --- -++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++-# else: -++++++-# # --- 速度优先模式 --- -++++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++-# if sequence_length == 1: -++++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++-# else: -++++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++-# else: -++++++-# raise NotImplementedError("Training path is not implemented.") -++++++- -++++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) -++++++- -++++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output -++++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) -++++++- -++++++-# return final_hidden_states, router_logits -++++++- -++++++ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ """ -++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -++++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -++++++ return moe_output_fp32.squeeze(1).to(original_dtype) -++++++ -+++++++ # @no_grad() -+++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -+++++++ # num_tokens, _ = hidden_states.shape -+++++++ # flat_selected_experts = selected_experts.flatten() -+++++++ # sorted_expert_indices = flat_selected_experts.argsort() -+++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -+++++++ # original_token_indices = sorted_expert_indices // self.top_k -+++++++ # moe_output = ops.zeros_like(hidden_states) -+++++++ # current_token_offset = 0 -+++++++ # for i in range(self.num_experts): -+++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset -+++++++ # if expert_token_count == 0: -+++++++ # continue -+++++++ # end_offset = current_token_offset + expert_token_count -+++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -+++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -+++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] -+++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -+++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -+++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -+++++++ # current_token_offset += expert_token_count -+++++++ # return moe_output -+++++++ -++++++ @no_grad() -++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++- num_tokens, _ = hidden_states.shape -++++++- flat_selected_experts = selected_experts.flatten() -++++++- sorted_expert_indices = flat_selected_experts.argsort() -++++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -++++++- original_token_indices = sorted_expert_indices // self.top_k -+++++++ """ -+++++++ 优化版 MoE prefill (速度优先模式): -+++++++ - 批量张量化处理同一个 expert 的所有 token -+++++++ - 跳过无 token 的专家 -+++++++ - 保持结果完全一致 -+++++++ """ -++++++ moe_output = ops.zeros_like(hidden_states) -++++++- current_token_offset = 0 -++++++- for i in range(self.num_experts): -++++++- expert_token_count = tokens_per_expert[i] - current_token_offset -++++++- if expert_token_count == 0: -++++++- continue -++++++- end_offset = current_token_offset + expert_token_count -++++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -++++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -++++++- expert_hidden_states = hidden_states[expert_original_token_indices] -++++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -++++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -++++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -++++++- current_token_offset += expert_token_count -+++++++ -+++++++ flat_selected_experts = selected_experts.flatten() -+++++++ flat_routing_weights = routing_weights.flatten() -+++++++ -+++++++ idxs = flat_selected_experts.argsort() -+++++++ sorted_expert_indices = flat_selected_experts[idxs] -+++++++ sorted_token_indices = idxs // self.top_k -+++++++ -+++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) -+++++++ -+++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -+++++++ -+++++++ for expert_id in active_experts.tolist(): -+++++++ start = int(tokens_per_expert[:expert_id].sum().item()) -+++++++ end = start + int(tokens_per_expert[expert_id].item()) -+++++++ -+++++++ token_idx = sorted_token_indices[start:end] -+++++++ expert_tokens = hidden_states[token_idx] -+++++++ -+++++++ expert_out = self.experts[expert_id](expert_tokens) -+++++++ -+++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) -+++++++ -+++++++ moe_output = mindspore.mint.scatter_add( -+++++++ moe_output, -+++++++ 0, -+++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), -+++++++ scaled_out.to(hidden_states.dtype) -+++++++ ) -+++++++ -++++++ return moe_output -++++++ -+++++++ -++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -++++++ @no_grad() -++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -++++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++ -++++++ moe_output = None -++++++- if Long_Prompt: -++++++- # --- 精度优先模式 (ACCURACY MODE) --- -++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ # if Long_Prompt==0: -+++++++ # # --- 精度优先模式 (ACCURACY MODE) --- -+++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ # else: -+++++++ # # --- 速度优先模式 (SPEED MODE) --- -+++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++ # if sequence_length == 1: -+++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ # else: -+++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ -+++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) -+++++++ if sequence_length == 1: -+++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++ else: -++++++- # --- 速度优先模式 (SPEED MODE) --- -++++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) -++++++- if sequence_length == 1: -++++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++- else: -++++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -++++++- -+++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) -+++++++ -++++++ -++++++ # 3. 共享专家计算与合并 (所有模式通用) -++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ -++++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++ -++++++ return final_hidden_states, router_logits -++++++ -+++++++ -++++++ class Qwen2MoeDecoderLayer(nn.Module): -++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -++++++ super().__init__() -++++++ self.hidden_size = config.hidden_size -++++++ -++++++- # if Long_Prompt: -++++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++- # else: -+++++++ # if Long_Prompt == 2: -++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -+++++++ # else: -+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ -++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++ -++++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++++ ) -++++++ -++++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -++++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++++ # attention_mask, -+++++++ # sequence_length=sequence_length, -+++++++ # target_length=target_length, -+++++++ # dtype=dtype, -+++++++ # min_dtype=min_dtype, -+++++++ # cache_position=cache_position, -+++++++ # batch_size=input_tensor.shape[0], -+++++++ # ) -+++++++ #@dwj -+++++++ causal_mask = get_cached_causal_mask_with_cache_position( -++++++ attention_mask, -++++++ sequence_length=sequence_length, -++++++ target_length=target_length, -++++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -++++++ """ -++++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD -+++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache -+++++++ _causal_mask_cache.clear() -++++++ -++++++ input_ids = kwargs.get("input_ids") -++++++ if input_ids is None and args: -++++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ -++++++ if input_ids is not None: -++++++ prompt_length = input_ids.shape[1] -++++++- -++++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: -++++++- Long_Prompt = True -+++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: -+++++++ Long_Prompt = 2 -+++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: -+++++++ Long_Prompt = 0 -++++++ else: -++++++- Long_Prompt = False -+++++++ Long_Prompt = 1 -+++++++ -++++++ -++++++ return super().generate(*args, **kwargs) -++++++ -++++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++ dtype = self.lm_head.weight.dtype -++++++ min_dtype = float(ops.finfo(dtype).min) -++++++ -++++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -+++++++ # attention_mask, -+++++++ # sequence_length=sequence_length, -+++++++ # target_length=past_key_values.get_max_length(), -+++++++ # dtype=dtype, -+++++++ # min_dtype=min_dtype, -+++++++ # cache_position=cache_position, -+++++++ # batch_size=batch_size, -+++++++ # ) -+++++++ -+++++++ #@dwj -+++++++ attention_mask = get_cached_causal_mask_with_cache_position( -++++++ attention_mask, -++++++ sequence_length=sequence_length, -++++++ target_length=past_key_values.get_max_length(), -++++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -++++++deleted file mode 100644 -++++++index 6dfb5b93..00000000 -++++++--- a/patches/0001-20251104commit.patch -+++++++++ /dev/null -++++++@@ -1,1272 +0,0 @@ -++++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -++++++-From: Pinoeer-kingxi <13022943007@163.com> -++++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 -++++++-Subject: [PATCH] 20251104commit -++++++- -++++++---- -++++++- mindnlp/transformers/cache_utils.py | 28 +- -++++++- .../models/deepseek/modeling_deepseek.py | 149 ++- -++++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -++++++- 3 files changed, 976 insertions(+), 87 deletions(-) -++++++- -++++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py -++++++-index cadd2e04..02f8d4be 100644 -++++++---- a/mindnlp/transformers/cache_utils.py -++++++-+++ b/mindnlp/transformers/cache_utils.py -++++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): -++++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -++++++- # k_out[:, :, cache_position] = key_states -++++++- # v_out[:, :, cache_position] = value_states -++++++-- if ON_ORANGE_PI: -++++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++-- else: -++++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++-- -++++++-+ # if ON_ORANGE_PI: -++++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) -++++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) -++++++-+ # else: -++++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy -++++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) -++++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) -++++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 -++++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] -++++++-+ if cache_position.ndim > 1: -++++++-+ cache_position = cache_position.flatten() -++++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) -++++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): -++++++-+ cache_position = cache_position.int() -++++++-+ -++++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) -++++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 -++++++-+ k_out[:, :, cache_position] = key_states -++++++-+ v_out[:, :, cache_position] = value_states -++++++-+ -++++++- return k_out, v_out -++++++- -++++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: -++++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++-index c695b944..d8303e45 100644 -++++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -++++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -++++++- # Copied from transformers.models.llama.modeling_llama.rotate_half -++++++- def rotate_half(x): -++++++- """Rotates half the hidden dims of the input.""" -++++++-- x1 = x[..., : x.shape[-1] // 2] -++++++-- x2 = x[..., x.shape[-1] // 2 :] -++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++-+ # x1 = x[..., : x.shape[-1] // 2] -++++++-+ # x2 = x[..., x.shape[-1] // 2 :] -++++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++- return ops.cat((-x2, x1), dim=-1) -++++++- -++++++- -++++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -++++++- if self.training: -++++++- raise NotImplementedError("Training is not supported yet.") -++++++- else: -++++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++-- if self.config.n_shared_experts is not None: -++++++-- y = y + self.shared_experts(identity) -++++++-- return y -++++++-+ # @lwx -++++++-+ if orig_shape[1] == 1: -++++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) -++++++-+ y=y.view(*orig_shape) -++++++-+ if self.config.n_shared_experts is not None: -++++++-+ y = y + self.shared_experts(identity) -++++++-+ return y -++++++-+ else: -++++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -++++++-+ if self.config.n_shared_experts is not None: -++++++-+ y = y + self.shared_experts(identity) -++++++-+ return y -++++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -++++++-+ # if self.config.n_shared_experts is not None: -++++++-+ # y = y + self.shared_experts(identity) -++++++-+ # return y -++++++-+ -++++++-+ @no_grad() -++++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -++++++-+ -++++++-+ expert_cache = ops.zeros_like(x) -++++++-+ for i in range(self.num_experts_per_tok): -++++++-+ expert_id = flat_expert_indices[i].item() -++++++-+ weight = flat_expert_weights[i].item() -++++++-+ expert = self.experts[expert_id] -++++++-+ expert_out = expert(x) -++++++-+ expert_cache += expert_out * weight -++++++-+ return expert_cache -++++++- -++++++- @no_grad() -++++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++-- # expert_cache = torch.zeros_like(x) -++++++-- # idxs = flat_expert_indices.argsort() -++++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++-- # token_idxs = idxs // self.num_experts_per_tok -++++++-- # for i, end_idx in enumerate(tokens_per_expert): -++++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++-- # if start_idx == end_idx: -++++++-- # continue -++++++-- # expert = self.experts[i] -++++++-- # exp_token_idx = token_idxs[start_idx:end_idx] -++++++-- # expert_tokens = x[exp_token_idx] -++++++-- # expert_out = expert(expert_tokens) -++++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++-- # return expert_cache -++++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -++++++- expert_cache = ops.zeros_like(x) -++++++- idxs = flat_expert_indices.argsort() -++++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++- token_idxs = idxs // self.num_experts_per_tok -++++++-+ -++++++- for i, end_idx in enumerate(tokens_per_expert): -++++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++- if start_idx == end_idx: -++++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -++++++- expert_out = expert(expert_tokens) -++++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++-+ -++++++- return expert_cache -++++++-+ -++++++-+ # @no_grad() -++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++-+ # # expert_cache = torch.zeros_like(x) -++++++-+ # # idxs = flat_expert_indices.argsort() -++++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -++++++-+ # # token_idxs = idxs // self.num_experts_per_tok -++++++-+ # # for i, end_idx in enumerate(tokens_per_expert): -++++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -++++++-+ # # if start_idx == end_idx: -++++++-+ # # continue -++++++-+ # # expert = self.experts[i] -++++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] -++++++-+ # # expert_tokens = x[exp_token_idx] -++++++-+ # # expert_out = expert(expert_tokens) -++++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -++++++-+ # # return expert_cache -++++++-+ # expert_cache = ops.zeros_like(x) -++++++-+ # idxs = flat_expert_indices.argsort() -++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++-+ # token_idxs = idxs // self.num_experts_per_tok -++++++-+ -++++++-+ # for i, end_idx in enumerate(tokens_per_expert): -++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++-+ # if start_idx == end_idx: -++++++-+ # continue -++++++-+ # expert = self.experts[i] -++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++-+ # expert_tokens = x[exp_token_idx] -++++++-+ # expert_out = expert(expert_tokens) -++++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -++++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -++++++-+ -++++++-+ # return expert_cache -++++++-+ # @no_grad() -++++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -++++++-+ # expert_cache = ops.zeros_like(x) -++++++-+ -++++++-+ # # 排序保证顺序一致 -++++++-+ # idxs = flat_expert_indices.argsort() -++++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -++++++-+ # token_idxs = idxs // self.num_experts_per_tok -++++++-+ -++++++-+ # # 找出有 token 的专家 -++++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -++++++-+ -++++++-+ # for i in active_experts.tolist(): -++++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -++++++-+ # end_idx = tokens_per_expert[i] -++++++-+ # if start_idx == end_idx: # 没有 token -++++++-+ # continue -++++++-+ -++++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] -++++++-+ # expert_tokens = x[exp_token_idx] -++++++-+ # expert_out = self.experts[i](expert_tokens) -++++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -++++++-+ -++++++-+ # expert_cache = mindspore.mint.scatter_add( -++++++-+ # expert_cache, -++++++-+ # 0, -++++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -++++++-+ # expert_out -++++++-+ # ) -++++++-+ -++++++-+ # return expert_cache -++++++-+ -++++++-+ -++++++- -++++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -++++++- # """ -++++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++- -++++++- # Initialize weights and apply final processing -++++++- self.post_init() -++++++-+ self.warm_up = False -++++++-+ -++++++-+ def warmup_moe_model_deep(self): -++++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") -++++++-+ test_texts = [ -++++++-+ "warmup short", -++++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", -++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -++++++-+ ] -++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++-+ if tokenizer is None: -++++++-+ from mindnlp.transformers import AutoTokenizer -++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++-+ self._warmup_tokenizer = tokenizer -++++++-+ -++++++-+ for text in test_texts: -++++++-+ inputs = tokenizer(text, return_tensors="ms") -++++++-+ with mindspore._no_grad(): -++++++-+ _ = self(**inputs, use_cache=False) -++++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -++++++- -++++++- def get_input_embeddings(self): -++++++- return self.model.embed_tokens -++++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++- ```""" -++++++-+ if not self.warm_up: -++++++-+ self.warm_up = True -++++++-+ self.warmup_moe_model_deep() -++++++-+ -++++++- output_attentions = ( -++++++- output_attentions -++++++- if output_attentions is not None -++++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++-index 3cbf820e..d4c6b651 100644 -++++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -++++++-@@ -18,7 +18,6 @@ -++++++- # See the License for the specific language governing permissions and -++++++- # limitations under the License. -++++++- """MindSpore Qwen2MoE model.""" -++++++-- -++++++- import math -++++++- from typing import List, Optional, Tuple, Union -++++++- -++++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -++++++- TokenClassifierOutput, -++++++- ) -++++++- from ...modeling_utils import PreTrainedModel -++++++-+from ...generation import GenerationMixin -++++++- from ....utils import logging -++++++- from .configuration_qwen2_moe import Qwen2MoeConfig -++++++- -++++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -++++++- self.variance_epsilon = eps -++++++- -++++++- def forward(self, hidden_states): -++++++-+ # @dwj -++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++-+ # @lwx -++++++-+ # if not self.training : -++++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -++++++- input_dtype = hidden_states.dtype -++++++- hidden_states = hidden_states.to(mindspore.float32) -++++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -++++++-@@ -234,6 +239,8 @@ def rotate_half(x): -++++++- """Rotates half the hidden dims of the input.""" -++++++- x1 = x[..., : x.shape[-1] // 2] -++++++- x2 = x[..., x.shape[-1] // 2 :] -++++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -++++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -++++++- return ops.cat((-x2, x1), dim=-1) -++++++- -++++++- -++++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -++++++- self.config = config -++++++- self.hidden_size = config.hidden_size -++++++- self.intermediate_size = intermediate_size -++++++-+ -++++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -++++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -++++++- self.act_fn = ACT2FN[config.hidden_act] -++++++- -++++++- def forward(self, x): -++++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++-- -++++++- -++++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -++++++-+ # @lwx -++++++-+ # gate_up_output = self.gate_up_proj(x) -++++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) -++++++-+ # return self.down_proj(swiglu_output) -++++++-+ -++++++-+ # def forward(self, x): -++++++-+ # gate_proj_out = self.gate_proj(x) -++++++-+ # up_proj_out = self.up_proj(x) -++++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -++++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -++++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -++++++-+ # return self.down_proj(swiglu_out) -++++++-+ -++++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv -++++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -++++++- """ -++++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -++++++- use_cache: bool = False, -++++++- cache_position: Optional[mindspore.Tensor] = None, -++++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++-+ -++++++-+ -++++++-+ -++++++- bsz, q_len, _ = hidden_states.shape -++++++- -++++++- query_states = self.q_proj(hidden_states) -++++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -++++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++- "with a layer index." -++++++- ) -++++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++-+ if isinstance(past_key_value, StaticCache): -++++++-+ kv_seq_len = key_states.shape[-2] -++++++-+ else: -++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++- -++++++- if past_key_value is not None: -++++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -++++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -++++++-+ -++++++-+ if isinstance(past_key_value, StaticCache): -++++++-+ kv_seq_len = key_states.shape[-2] -++++++- -++++++- # repeat k/v heads if n_kv_heads < n_heads -++++++- key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++- value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++-- -++++++-+ -++++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -++++++- -++++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): -++++++-- raise ValueError( -++++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" -++++++-- f" {attn_weights.shape}" -++++++-- ) -++++++-- -++++++-- if attention_mask is not None: # no matter the length, we just slice it -++++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] -++++++-+ if attention_mask is not None: -++++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -++++++- attn_weights = attn_weights + causal_mask -++++++- -++++++- # upcast attention to fp32 -++++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -++++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -++++++- -++++++- attn_output = self.o_proj(attn_output) -++++++-- -++++++-+ # @lwx -++++++-+ -++++++-+ # max_seq_len = self.max_position_embeddings # 2048 -++++++-+ -++++++-+ # if attention_mask is not None: -++++++-+ # # attention_mask: [B, 1, Sq, Sk] -++++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -++++++-+ -++++++-+ # # pad 到 [max_seq_len, max_seq_len] -++++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 -++++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) -++++++-+ # global_attention_mask = padded_mask -++++++-+ # else: -++++++-+ # global_attention_mask = None -++++++-+ -++++++-+ -++++++-+ # sparse_mode=3 -++++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++++-+ # query=query_states, -++++++-+ # key=key_states, -++++++-+ # value=value_states, -++++++-+ # real_shift=None, -++++++-+ # padding_mask=None, -++++++-+ -++++++-+ # head_num=self.num_heads, -++++++-+ # attn_mask=global_attention_mask, -++++++-+ # keep_prob=1.0 - self.attention_dropout, -++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++-+ # input_layout="BNSD", -++++++-+ # pre_tokens=2147483647, -++++++-+ # next_tokens=2147483647, -++++++-+ # inner_precise=0, -++++++-+ # drop_mask=None, -++++++-+ # prefix=None, -++++++-+ # actual_seq_qlen=None, -++++++-+ # actual_seq_kvlen=None, -++++++-+ # sparse_mode=sparse_mode, -++++++-+ # ) -++++++- if not output_attentions: -++++++- attn_weights = None -++++++- -++++++- return attn_output, attn_weights, past_key_value -++++++- -++++++- -++++++-+class Qwen2MoeFlashAttention(nn.Module): -++++++-+ """ -++++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 -++++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -++++++-+ -++++++-+ 关键改动: -++++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), -++++++-+ 直接传入原始的 key 和 value 张量效率更高。 -++++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 -++++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 -++++++-+ """ -++++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -++++++-+ super().__init__() -++++++-+ self.config = config -++++++-+ self.layer_idx = layer_idx -++++++-+ self.hidden_size = config.hidden_size -++++++-+ self.num_heads = config.num_attention_heads -++++++-+ self.head_dim = self.hidden_size // self.num_heads -++++++-+ self.num_key_value_heads = config.num_key_value_heads -++++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads -++++++-+ self.max_position_embeddings = config.max_position_embeddings -++++++-+ self.rope_theta = config.rope_theta -++++++-+ self.attention_dropout = config.attention_dropout -++++++-+ -++++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: -++++++-+ raise ValueError( -++++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" -++++++-+ ) -++++++-+ -++++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -++++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -++++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -++++++-+ -++++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( -++++++-+ self.head_dim, -++++++-+ max_position_embeddings=self.max_position_embeddings, -++++++-+ base=self.rope_theta, -++++++-+ ) -++++++-+ -++++++-+ def forward( -++++++-+ self, -++++++-+ hidden_states: mindspore.Tensor, -++++++-+ attention_mask: Optional[mindspore.Tensor] = None, -++++++-+ position_ids: Optional[mindspore.Tensor] = None, -++++++-+ past_key_value: Optional[Cache] = None, -++++++-+ output_attentions: bool = False, -++++++-+ use_cache: bool = False, -++++++-+ cache_position: Optional[mindspore.Tensor] = None, -++++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++-+ -++++++-+ bsz, q_len, _ = hidden_states.shape -++++++-+ -++++++-+ # 1. 线性投射 Q, K, V -++++++-+ query_states = self.q_proj(hidden_states) -++++++-+ key_states = self.k_proj(hidden_states) -++++++-+ value_states = self.v_proj(hidden_states) -++++++-+ -++++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++-+ # query: [B, S, H*D] -> [B, N1, S, D] -++++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] -++++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ -++++++-+ # 3. RoPE 旋转位置编码 -++++++-+ kv_seq_len = key_states.shape[-2] -++++++-+ if past_key_value is not None: -++++++-+ if self.layer_idx is None: -++++++-+ raise ValueError( -++++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++-+ "with a layer index." -++++++-+ ) -++++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len -++++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 -++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len -++++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n -++++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) -++++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 -++++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 -++++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) -++++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens -++++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 -++++++-+ if cache_position.shape[0] == 1: -++++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 -++++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) -++++++-+ kv_seq_len = past_seen_tokens + 1 -++++++-+ else: -++++++-+ # prefill 阶段:cache_position 是范围,使用其长度 -++++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens -++++++-+ else: -++++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++-+ -++++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++-+ -++++++-+ # 4. KV 缓存更新 -++++++-+ if past_key_value is not None: -++++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++-+ key_states, value_states = past_key_value.update( -++++++-+ key_states, value_states, self.layer_idx, cache_kwargs -++++++-+ ) -++++++-+ -++++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 -++++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) -++++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: -++++++-+ if cache_position.shape[0] == 1: -++++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) -++++++-+ kv_seq_len = key_states.shape[-2] -++++++-+ -++++++-+ # 5. [重要] 准备 Attention Mask -++++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) -++++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 -++++++-+ fa_attention_mask = None -++++++-+ if attention_mask is not None: -++++++-+ # 截取与当前key长度匹配的部分 -++++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) -++++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -++++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False -++++++-+ fa_attention_mask = (mask_slice != 0) -++++++-+ -++++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 -++++++-+ input_dtype = query_states.dtype -++++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): -++++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 -++++++-+ query_states = query_states.to(mindspore.float16) -++++++-+ key_states = key_states.to(mindspore.float16) -++++++-+ value_states = value_states.to(mindspore.float16) -++++++-+ -++++++-+ # 6. [核心] 调用 flash_attention_score 算子 -++++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA -++++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] -++++++-+ attn_output = mindspore.ops.flash_attention_score( -++++++-+ query=query_states, -++++++-+ key=key_states, -++++++-+ value=value_states, -++++++-+ head_num=self.num_heads, # 传入Q的头数(N1) -++++++-+ attn_mask=fa_attention_mask, -++++++-+ keep_prob=1.0 - self.attention_dropout, -++++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), -++++++-+ input_layout="BNSD", -++++++-+ sparse_mode=0 # 使用 defaultMask 模式 -++++++-+ ) -++++++-+ -++++++-+ # 恢复原始数据类型 -++++++-+ attn_output = attn_output.to(input_dtype) -++++++-+ -++++++-+ # 7. 调整输出形状 -++++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] -++++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++-+ attn_output = self.o_proj(attn_output) -++++++-+ -++++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 -++++++-+ attn_weights = None -++++++-+ if output_attentions: -++++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++-+ -++++++-+ return attn_output, attn_weights, past_key_value -++++++-+ -++++++-+ # def forward( -++++++-+ # self, -++++++-+ # hidden_states: mindspore.Tensor, -++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++++++-+ # position_ids: Optional[mindspore.Tensor] = None, -++++++-+ # past_key_value: Optional[Cache] = None, -++++++-+ # output_attentions: bool = False, -++++++-+ # use_cache: bool = False, -++++++-+ # cache_position: Optional[mindspore.Tensor] = None, -++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++-+ -++++++-+ # bsz, q_len, _ = hidden_states.shape -++++++-+ -++++++-+ # # 1. 线性投射 Q, K, V -++++++-+ # query_states = self.q_proj(hidden_states) -++++++-+ # key_states = self.k_proj(hidden_states) -++++++-+ # value_states = self.v_proj(hidden_states) -++++++-+ -++++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 -++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ -++++++-+ # # 3. RoPE 旋转位置编码 -++++++-+ # kv_seq_len = key_states.shape[-2] -++++++-+ # if past_key_value is not None: -++++++-+ # if self.layer_idx is None: -++++++-+ # raise ValueError( -++++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " -++++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -++++++-+ # "with a layer index." -++++++-+ # ) -++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++-+ -++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++-+ -++++++-+ # # 4. KV 缓存更新 -++++++-+ # if past_key_value is not None: -++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++-+ # key_states, value_states = past_key_value.update( -++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++++++-+ # ) -++++++-+ -++++++-+ # # 5. 准备 Attention Mask -++++++-+ # fa_attention_mask = None -++++++-+ # if attention_mask is not None: -++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++-+ # fa_attention_mask = (mask_slice != 0) -++++++-+ -++++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- -++++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 -++++++-+ # input_dtype = query_states.dtype -++++++-+ -++++++-+ # # 6. [核心] 调用 flash_attention_score 算子 -++++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++++-+ # query=query_states, -++++++-+ # key=key_states, -++++++-+ # value=value_states, -++++++-+ # head_num=self.num_heads, -++++++-+ # attn_mask=fa_attention_mask, -++++++-+ # keep_prob=1.0 - self.attention_dropout, -++++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), -++++++-+ # input_layout="BNSD", -++++++-+ # sparse_mode=0, -++++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- -++++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, -++++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 -++++++-+ # inner_precise=1 -++++++-+ # ) -++++++-+ -++++++-+ # # 恢复原始数据类型 -++++++-+ # attn_output = attn_output.to(input_dtype) -++++++-+ -++++++-+ # # 7. 调整输出形状 -++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++-+ # attn_output = self.o_proj(attn_output) -++++++-+ -++++++-+ # attn_weights = None -++++++-+ # if output_attentions: -++++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -++++++-+ -++++++-+ # return attn_output, attn_weights, past_key_value -++++++-+ -++++++-+ # def forward( -++++++-+ # self, -++++++-+ # hidden_states: mindspore.Tensor, -++++++-+ # attention_mask: Optional[mindspore.Tensor] = None, -++++++-+ # position_ids: Optional[mindspore.Tensor] = None, -++++++-+ # past_key_value: Optional[Cache] = None, -++++++-+ # output_attentions: bool = False, -++++++-+ # use_cache: bool = False, -++++++-+ # cache_position: Optional[mindspore.Tensor] = None, -++++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -++++++-+ -++++++-+ # bsz, q_len, _ = hidden_states.shape -++++++-+ -++++++-+ # query_states = self.q_proj(hidden_states) -++++++-+ # key_states = self.k_proj(hidden_states) -++++++-+ # value_states = self.v_proj(hidden_states) -++++++-+ -++++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -++++++-+ -++++++-+ # kv_seq_len = key_states.shape[-2] -++++++-+ # if past_key_value is not None: -++++++-+ # if self.layer_idx is None: -++++++-+ # raise ValueError("`layer_idx` must be specified for caching") -++++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -++++++-+ -++++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -++++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -++++++-+ -++++++-+ # if past_key_value is not None: -++++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -++++++-+ # key_states, value_states = past_key_value.update( -++++++-+ # key_states, value_states, self.layer_idx, cache_kwargs -++++++-+ # ) -++++++-+ -++++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) -++++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) -++++++-+ -++++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- -++++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 -++++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 -++++++-+ # query_states = query_states / math.sqrt(self.head_dim) -++++++-+ # # <--- 修改结束 --- -++++++-+ -++++++-+ # fa_attention_mask = None -++++++-+ # if attention_mask is not None: -++++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -++++++-+ # fa_attention_mask = (mask_slice != 0) -++++++-+ -++++++-+ # input_dtype = query_states.dtype -++++++-+ -++++++-+ # attn_output = mindspore.ops.flash_attention_score( -++++++-+ # query=query_states, # 传入已经预先缩放过的 query -++++++-+ # key=key_states, -++++++-+ # value=value_states, -++++++-+ # head_num=self.num_heads, -++++++-+ # attn_mask=fa_attention_mask, -++++++-+ # keep_prob=1.0 - self.attention_dropout, -++++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 -++++++-+ # input_layout="BNSD", -++++++-+ # sparse_mode=0, -++++++-+ # inner_precise=1 # 仍然保持内部高精度计算 -++++++-+ # ) -++++++-+ -++++++-+ # attn_output = attn_output.to(input_dtype) -++++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -++++++-+ # attn_output = self.o_proj(attn_output) -++++++-+ -++++++-+ # attn_weights = None -++++++-+ # if output_attentions: -++++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") -++++++-+ -++++++-+ # return attn_output, attn_weights, past_key_value -++++++-+ -++++++- QWEN2MOE_ATTENTION_CLASSES = { -++++++- "eager": Qwen2MoeAttention, -++++++-+ "flash-attention": Qwen2MoeFlashAttention, -++++++- } -++++++- -++++++- -++++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -++++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -++++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -++++++- -++++++-+ #@dwj -++++++-+ # 只遍历激活的专家,而非全部专家 -++++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: -++++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++-- hidden_states = hidden_states.view(-1, hidden_dim) -++++++-- # router_logits: (batch * sequence_length, n_experts) -++++++-- router_logits = self.gate(hidden_states) -++++++-- -++++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++-- if self.norm_topk_prob: -++++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++-- # we cast back to the input dtype -++++++-- routing_weights = routing_weights.to(hidden_states.dtype) -++++++-- -++++++-- final_hidden_states = ops.zeros( -++++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype -++++++-- ) -++++++-- -++++++-- # One hot encode the selected experts to create an expert mask -++++++-- # this will be used to easily index which expert is going to be sollicitated -++++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) -++++++-- -++++++-- # Loop over all available experts in the model and perform the computation on each expert -++++++-- for expert_idx in range(self.num_experts): -++++++-- expert_layer = self.experts[expert_idx] -++++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) -++++++-- -++++++-- # Index the correct hidden states and compute the expert hidden state for -++++++-- # the current expert. We need to make sure to multiply the output hidden -++++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) -++++++-- if 0 not in idx.shape: -++++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) -++++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] -++++++-- -++++++-- # However `index_add_` only support torch tensors for indexing so we'll use -++++++-- # the `top_x` tensor here. -++++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) -++++++-- -++++++-- shared_expert_output = self.shared_expert(hidden_states) -++++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output -++++++-- -++++++-- final_hidden_states = final_hidden_states + shared_expert_output -++++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape -++++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) -++++++-+ num_tokens = hidden_states_reshaped.shape[0] -++++++-+ -++++++-+ router_logits = self.gate(hidden_states_reshaped) -++++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) -++++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -++++++-+ -++++++-+ if self.norm_topk_prob: -++++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -++++++-+ routing_weights = routing_weights.to(hidden_states.dtype) -++++++-+ -++++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) -++++++-+ flat_selected_experts = selected_experts.flatten() -++++++-+ -++++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) -++++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) -++++++-+ token_indices = broadcasted_token_indices.flatten() -++++++-+ -++++++-+ active_experts = ops.unique(flat_selected_experts) -++++++-+ -++++++-+ for expert_idx_tensor in active_experts: -++++++-+ expert_idx = expert_idx_tensor.item() -++++++-+ expert_layer = self.experts[expert_idx] -++++++-+ -++++++-+ mask = (flat_selected_experts == expert_idx_tensor) -++++++-+ selected_token_indices = token_indices[mask] -++++++-+ selected_routing_weights = routing_weights.flatten()[mask] -++++++-+ -++++++-+ current_states = hidden_states_reshaped[selected_token_indices] -++++++-+ -++++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -++++++-+ -++++++-+ final_hidden_states = final_hidden_states.index_add( -++++++-+ dim=0, -++++++-+ index=selected_token_indices, -++++++-+ source=expert_output.to(hidden_states.dtype) -++++++-+ ) -++++++-+ -++++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) -++++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -++++++- -++++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++-- return final_hidden_states, router_logits -++++++-+ final_hidden_states = final_hidden_states + shared_expert_output -++++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -++++++-+ -++++++-+ return final_hidden_states, router_logits -++++++- -++++++- -++++++- class Qwen2MoeDecoderLayer(nn.Module): -++++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -++++++- -++++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -++++++- -++++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -++++++-+ -++++++- if (layer_idx not in config.mlp_only_layers) and ( -++++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -++++++- ): -++++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -++++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] -++++++- _skip_keys_device_placement = "past_key_values" -++++++- _supports_cache_class = True -++++++-+#lwx -++++++-+ # _supports_static_cache = True -++++++- -++++++- def _init_weights(self, module): -++++++- std = self.config.initializer_range -++++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -++++++- return causal_mask -++++++- -++++++- -++++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -++++++- _tied_weights_keys = ["lm_head.weight"] -++++++- -++++++- def __init__(self, config): -++++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++- self.num_experts_per_tok = config.num_experts_per_tok -++++++- # Initialize weights and apply final processing -++++++- self.post_init() -++++++-+ # @lwx -++++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: -++++++-+ # self.generation_config.cache_implementation = "static" -++++++-+ self._warmed_up = False -++++++-+ -++++++-+ def warmup_moe_model(self): -++++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") -++++++-+ test_texts = [ -++++++-+ "warmup short", -++++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -++++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" -++++++-+ ] -++++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) -++++++-+ if tokenizer is None: -++++++-+ from mindnlp.transformers import AutoTokenizer -++++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -++++++-+ self._warmup_tokenizer = tokenizer -++++++-+ -++++++-+ for text in test_texts: -++++++-+ inputs = tokenizer(text, return_tensors="ms") -++++++-+ with mindspore._no_grad(): -++++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) -++++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") -++++++- -++++++- def get_input_embeddings(self): -++++++- return self.model.embed_tokens -++++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -++++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -++++++- ```""" -++++++-+ if not self._warmed_up: -++++++-+ self._warmed_up = True -++++++-+ self.warmup_moe_model() -++++++- -++++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -++++++- output_router_logits = ( -++++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -++++++- } -++++++- ) -++++++- return model_inputs -++++++-+# @lwx -++++++-+ # def _decode_one_tokens_logits( -++++++-+ # self, -++++++-+ # cur_token: mindspore.Tensor, -++++++-+ # input_pos: Optional[mindspore.Tensor], -++++++-+ # cache_position: mindspore.Tensor, -++++++-+ # past_key_values: StaticCache, -++++++-+ # ) -> mindspore.Tensor: -++++++-+ # """ -++++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -++++++-+ -++++++-+ # Args: -++++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) -++++++-+ # input_pos: 输入位置信息,可选 -++++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) -++++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 -++++++-+ -++++++-+ # Returns: -++++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) -++++++-+ # """ -++++++-+ # # 调用JIT编译的版本 -++++++-+ # return self.get_decode_one_tokens_logits( -++++++-+ # cur_token=cur_token, -++++++-+ # input_pos=input_pos, -++++++-+ # cache_position=cache_position, -++++++-+ # past_key_values=past_key_values, -++++++-+ # ) -++++++-+ -++++++-+ # @mindspore.jit(jit_level='O1') -++++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -++++++-+ # """ -++++++-+ # JIT编译的函数,用于高效的单token解码 -++++++-+ # 使用JIT编译优化以支持静态shape和高效执行 -++++++-+ -++++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except -++++++-+ # """ -++++++-+ # outputs = self.model.forward( -++++++-+ # input_ids=cur_token, -++++++-+ # position_ids=input_pos, -++++++-+ # cache_position=cache_position, -++++++-+ # past_key_values=past_key_values, -++++++-+ # use_cache=True, -++++++-+ # return_dict=False, -++++++-+ # ) -++++++-+ -++++++-+ # hidden_states = outputs[0] -++++++-+ # logits = self.lm_head.forward(hidden_states) -++++++-+ # logits = logits.float() -++++++-+ -++++++-+ # return logits[:, -1, :] -++++++-+ -++++++-+ # def _sample( -++++++-+ # self, -++++++-+ # input_ids: mindspore.Tensor, -++++++-+ # logits_processor, -++++++-+ # stopping_criteria, -++++++-+ # generation_config, -++++++-+ # synced_devices: bool, -++++++-+ # streamer=None, -++++++-+ # logits_warper=None, -++++++-+ # **model_kwargs, -++++++-+ # ): -++++++-+ # """ -++++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -++++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -++++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -++++++-+ # """ -++++++-+ # from ...generation.logits_process import LogitsProcessorList -++++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList -++++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -++++++-+ # from mindnlp.core import nn, ops, no_grad -++++++-+ # import numpy as np -++++++-+ -++++++-+ # # 检查是否使用 StaticCache -++++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -++++++-+ # # 否则,直接调用父类方法 -++++++-+ # past_key_values = model_kwargs.get("past_key_values") -++++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -++++++-+ -++++++-+ # if not isinstance(past_key_values, StaticCache): -++++++-+ # # 不使用 StaticCache,直接调用父类方法 -++++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -++++++-+ # return super()._sample( -++++++-+ # input_ids=input_ids, -++++++-+ # logits_processor=logits_processor, -++++++-+ # stopping_criteria=stopping_criteria, -++++++-+ # generation_config=generation_config, -++++++-+ # synced_devices=synced_devices, -++++++-+ # streamer=streamer, -++++++-+ # logits_warper=logits_warper, -++++++-+ # **model_kwargs, -++++++-+ # ) -++++++-+ -++++++-+ # # 使用 StaticCache,进入自定义循环 -++++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -++++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -++++++-+ # pad_token_id = generation_config._pad_token_tensor -++++++-+ # output_attentions = generation_config.output_attentions -++++++-+ # output_hidden_states = generation_config.output_hidden_states -++++++-+ # output_scores = generation_config.output_scores -++++++-+ # output_logits = generation_config.output_logits -++++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate -++++++-+ # max_length = generation_config.max_length -++++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -++++++-+ # do_sample = generation_config.do_sample -++++++-+ -++++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -++++++-+ # raise ValueError( -++++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -++++++-+ # f"{logits_warper})." -++++++-+ # ) -++++++-+ -++++++-+ # # init attention / hidden states / scores tuples -++++++-+ # scores = () if (return_dict_in_generate and output_scores) else None -++++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None -++++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -++++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -++++++-+ -++++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -++++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: -++++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -++++++-+ # encoder_hidden_states = ( -++++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -++++++-+ # ) -++++++-+ -++++++-+ # # keep track of which sequences are already finished -++++++-+ # batch_size, cur_len = input_ids.shape -++++++-+ # this_peer_finished = False -++++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -++++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -++++++-+ -++++++-+ # time_record = [] -++++++-+ # from ....utils.testing_utils import parse_flag_from_env -++++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -++++++-+ -++++++-+ # while self._has_unfinished_sequences( -++++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -++++++-+ # ): -++++++-+ # if _record_time: -++++++-+ # import time as time_module -++++++-+ # infer_start = time_module.time() -++++++-+ -++++++-+ # # prepare model inputs -++++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -++++++-+ -++++++-+ # # prepare variable output controls -++++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -++++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -++++++-+ -++++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -++++++-+ # cur_cache_position = model_inputs.get("cache_position") -++++++-+ # cur_past_key_values = model_inputs.get("past_key_values") -++++++-+ # cur_input_ids = model_inputs.get("input_ids") -++++++-+ -++++++-+ # if (isinstance(cur_past_key_values, StaticCache) and -++++++-+ # cur_cache_position is not None and -++++++-+ # len(cur_cache_position.shape) > 0 and -++++++-+ # cur_cache_position.shape[0] == 1 and -++++++-+ # cur_input_ids is not None and -++++++-+ # cur_input_ids.shape[1] == 1): -++++++-+ # # 使用 JIT 优化的单 token 解码 -++++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) -++++++-+ # if not hasattr(self, '_jit_used'): -++++++-+ # self._jit_used = False -++++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -++++++-+ -++++++-+ # next_token_logits = self.get_decode_one_tokens_logits( -++++++-+ # cur_token=cur_input_ids, -++++++-+ # input_pos=model_inputs.get("position_ids"), -++++++-+ # cache_position=cur_cache_position, -++++++-+ # past_key_values=cur_past_key_values, -++++++-+ # ) -++++++-+ -++++++-+ # # 标记已使用JIT(用于后续判断) -++++++-+ # if not self._jit_used: -++++++-+ # self._jit_used = True -++++++-+ -++++++-+ # # 构造兼容的输出对象 -++++++-+ # class JitOptimizedOutput: -++++++-+ # def __init__(self, logits, config): -++++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -++++++-+ # self.config = config -++++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 -++++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None -++++++-+ # self.attentions = None if not config.is_encoder_decoder else None -++++++-+ # self.cross_attentions = None -++++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None -++++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None -++++++-+ -++++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) -++++++-+ # else: -++++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) -++++++-+ # outputs = self(**model_inputs, return_dict=True) -++++++-+ -++++++-+ # if synced_devices and this_peer_finished: -++++++-+ # continue -++++++-+ -++++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits -++++++-+ # next_token_logits = outputs.logits[:, -1, :] -++++++-+ -++++++-+ # # pre-process distribution -++++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) -++++++-+ # if do_sample: -++++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) -++++++-+ -++++++-+ # # Store scores, attentions and hidden_states when required -++++++-+ # if return_dict_in_generate: -++++++-+ # if output_scores: -++++++-+ # scores += (next_token_scores,) -++++++-+ # if output_logits: -++++++-+ # raw_logits += (next_token_logits,) -++++++-+ # if output_attentions: -++++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -++++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) -++++++-+ # if self.config.is_encoder_decoder: -++++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -++++++-+ -++++++-+ # if output_hidden_states: -++++++-+ # hidden = ( -++++++-+ # outputs.decoder_hidden_states -++++++-+ # if self.config.is_encoder_decoder -++++++-+ # else outputs.hidden_states -++++++-+ # ) -++++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -++++++-+ -++++++-+ # # token selection -++++++-+ # if do_sample: -++++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) -++++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -++++++-+ # else: -++++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) -++++++-+ -++++++-+ # # finished sentences should have their next token be a padding token -++++++-+ # if has_eos_stopping_criteria: -++++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -++++++-+ -++++++-+ # # update generated ids, model inputs, and length for next step -++++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -++++++-+ # if streamer is not None: -++++++-+ # streamer.put(next_tokens) -++++++-+ -++++++-+ # model_kwargs = self._update_model_kwargs_for_generation( -++++++-+ # outputs, -++++++-+ # model_kwargs, -++++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, -++++++-+ # ) -++++++-+ -++++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -++++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -++++++-+ # cur_len += 1 -++++++-+ -++++++-+ # if _record_time: -++++++-+ # import time as time_module -++++++-+ # infer_stop = time_module.time() -++++++-+ # time_record.append(infer_stop - infer_start) -++++++-+ -++++++-+ # del outputs -++++++-+ -++++++-+ # average_infer_time = None -++++++-+ # if time_record: -++++++-+ # if len(time_record) > 1: -++++++-+ # time_record.pop(0) -++++++-+ # average_infer_time = sum(time_record) / len(time_record) -++++++-+ # print(f'average inference time is: {average_infer_time}') -++++++-+ # print(f'inference time record: {time_record}') -++++++-+ -++++++-+ # if streamer is not None: -++++++-+ # streamer.end() -++++++-+ -++++++-+ # # 简单判断:打印是否使用了JIT路径 -++++++-+ # if hasattr(self, '_jit_used') and self._jit_used: -++++++-+ # print("[JIT] ✓ JIT optimization was used during generation") -++++++-+ # else: -++++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -++++++-+ -++++++-+ # if return_dict_in_generate: -++++++-+ # if self.config.is_encoder_decoder: -++++++-+ # return GenerateEncoderDecoderOutput( -++++++-+ # sequences=input_ids, -++++++-+ # scores=scores, -++++++-+ # logits=raw_logits, -++++++-+ # encoder_attentions=encoder_attentions, -++++++-+ # encoder_hidden_states=encoder_hidden_states, -++++++-+ # decoder_attentions=decoder_attentions, -++++++-+ # cross_attentions=cross_attentions, -++++++-+ # decoder_hidden_states=decoder_hidden_states, -++++++-+ # past_key_values=model_kwargs.get("past_key_values"), -++++++-+ # average_infer_time=average_infer_time -++++++-+ # ) -++++++-+ # else: -++++++-+ # return GenerateDecoderOnlyOutput( -++++++-+ # sequences=input_ids, -++++++-+ # scores=scores, -++++++-+ # logits=raw_logits, -++++++-+ # attentions=decoder_attentions, -++++++-+ # hidden_states=decoder_hidden_states, -++++++-+ # past_key_values=model_kwargs.get("past_key_values"), -++++++-+ # average_infer_time=average_infer_time -++++++-+ # ) -++++++-+ # else: -++++++-+ # return input_ids -++++++-+ -++++++-+ # def _prepare_cache_for_generation( -++++++-+ # self, -++++++-+ # generation_config, -++++++-+ # model_kwargs, -++++++-+ # assistant_model, -++++++-+ # batch_size, -++++++-+ # max_cache_length, -++++++-+ # ): -++++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: -++++++-+ # generation_config.cache_implementation = "static" -++++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -++++++-+ -++++++-+ # if generation_config.cache_implementation == "static": -++++++-+ # base_required_from_max_length = generation_config.max_length + 1 -++++++-+ # base_required = max(max_cache_length, base_required_from_max_length) -++++++-+ # min_cache_size = 50 -++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -++++++-+ # else: -++++++-+ # max_cache_length = max(base_required, min_cache_size) -++++++-+ -++++++-+ # original_max_cache_length = max_cache_length -++++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") -++++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") -++++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") -++++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") -++++++-+ # print(f" - final max_cache_length: {max_cache_length}") -++++++-+ -++++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -++++++-+ # if max_cache_length > self.config.max_position_embeddings: -++++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -++++++-+ -++++++-+ # result = super()._prepare_cache_for_generation( -++++++-+ # generation_config=generation_config, -++++++-+ # model_kwargs=model_kwargs, -++++++-+ # assistant_model=assistant_model, -++++++-+ # batch_size=batch_size, -++++++-+ # max_cache_length=max_cache_length, -++++++-+ # ) -++++++-+ -++++++-+ # if generation_config.cache_implementation == "static": -++++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -++++++-+ # created_cache = model_kwargs.get(cache_name) -++++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -++++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -++++++-+ # if created_cache.max_cache_len < generation_config.max_length: -++++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -++++++-+ -++++++-+ # return result -++++++-+ -++++++-+ -++++++-+ -++++++- -++++++- -++++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE -++++++--- -++++++-2.27.0 -++++++- -++++++-- -++++++2.27.0 -++++++ -+++++-- -+++++2.27.0 -+++++ -++++-- -++++2.27.0 -++++ -+++-- -+++2.27.0 -+++ -++-- -++2.27.0 -++ -+-- -+2.27.0 -+ --- -2.39.5 (Apple Git-154) - diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" deleted file mode 100644 index a1832dc4..00000000 --- "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches/0010-.patch" +++ /dev/null @@ -1,49453 +0,0 @@ -From 5d88d879c9a97cf89b7f7a00df9534ba2df9e955 Mon Sep 17 00:00:00 2001 -From: =?UTF-8?q?=E9=82=93=E4=BC=9F=E9=94=AE?= -Date: Wed, 3 Dec 2025 16:13:15 +0800 -Subject: [PATCH 10/10] =?UTF-8?q?=E6=9C=80=E5=90=8E=E6=95=B4=E7=90=86?= -MIME-Version: 1.0 -Content-Type: text/plain; charset=UTF-8 -Content-Transfer-Encoding: 8bit - ---- - .../models/deepseek/modeling_deepseek.py | 731 +- - .../models/qwen2_moe/modeling_qwen2_moe.py | 1005 +- - patches/0001-20251104commit.patch | 1272 --- - patches/0002-20251106commit.patch | 3200 ------ - patches/0003-20261106secondcommit.patch | 2769 ------ - patches/0004-20251106change.patch | 7498 -------------- - patches/0005-20251107001commit.patch | 7707 --------------- - patches/0006-20251107002commit.patch | 7931 --------------- - patches/0007-20251107003commit.patch | 8034 --------------- - patches/0008-moe-change.patch | 8789 ----------------- - 10 files changed, 29 insertions(+), 48907 deletions(-) - delete mode 100644 patches/0001-20251104commit.patch - delete mode 100644 patches/0002-20251106commit.patch - delete mode 100644 patches/0003-20261106secondcommit.patch - delete mode 100644 patches/0004-20251106change.patch - delete mode 100644 patches/0005-20251107001commit.patch - delete mode 100644 patches/0006-20251107002commit.patch - delete mode 100644 patches/0007-20251107003commit.patch - delete mode 100644 patches/0008-moe-change.patch - -diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -index 8d004af1..8178fb05 100644 ---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py -+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py -@@ -234,9 +234,6 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): - # Copied from transformers.models.llama.modeling_llama.rotate_half - def rotate_half(x): - """Rotates half the hidden dims of the input.""" -- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] -- # x1 = x[..., : x.shape[-1] // 2] -- # x2 = x[..., x.shape[-1] // 2 :] - x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - -@@ -413,10 +410,7 @@ class DeepseekMoE(nn.Module): - if self.training: - raise NotImplementedError("Training is not supported yet.") - else: -- # @lwx - if orig_shape[1] == 1: -- # lwx moe_infer_decode_fast -- # y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) - y=self.moe_infer_decode_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) - y=y.view(*orig_shape) - if self.config.n_shared_experts is not None: -@@ -430,120 +424,7 @@ class DeepseekMoE(nn.Module): - if self.config.n_shared_experts is not None: - y = y + self.shared_experts(identity) - return y -- # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) -- # if self.config.n_shared_experts is not None: -- # y = y + self.shared_experts(identity) -- # return y -- -- -- -- # lwx -- # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): -- # """ -- # 如果 expert_ids 为 None,走单专家逻辑; -- # 如果有,多专家批量处理,保证和原逻辑一致。 -- # """ -- # if expert_ids is None: -- # # 原单专家逻辑 -- # if self.config.pretraining_tp > 1: -- # slice = self.intermediate_size // self.config.pretraining_tp -- # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) -- # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) -- # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) -- # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) -- # for i in range(self.config.pretraining_tp)], dim=-1) -- # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) -- # for i in range(self.config.pretraining_tp)], dim=-1) -- # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) -- # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) -- # for i in range(self.config.pretraining_tp)] -- # down_proj = sum(down_proj) -- # else: -- # down_proj = self.down_proj( -- # self.act_fn(self.gate_proj(x)) * self.up_proj(x) -- # ) -- # return down_proj -- -- # # ====== 批量多专家路径 ====== -- # hidden_size = x.shape[-1] -- -- # # 按 token expert_ids 选权重 -- # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] -- # up_weights = self.up_proj.weight[expert_ids] -- # down_weights = self.down_proj.weight[expert_ids] -- -- # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 -- # if self.config.pretraining_tp > 1: -- # outputs = [] -- # slice = self.intermediate_size // self.config.pretraining_tp -- # for i in range(self.config.pretraining_tp): -- # # 每个 slice 单独计算 -- # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) -- # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) -- # act_out = self.act_fn(gate_proj_out) * up_proj_out -- # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) -- # outputs.append(down_proj_out) -- # return sum(outputs) -- # else: -- # gate_proj_out = F.linear(x, gate_weights) -- # up_proj_out = F.linear(x, up_weights) -- # act_out = self.act_fn(gate_proj_out) * up_proj_out -- # return F.linear(act_out, down_weights) -- # @no_grad() -- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # num_tokens = x.shape[0] -- # hidden_size = x.shape[-1] -- -- # idxs = flat_expert_indices.argsort() -- # sorted_expert_indices = flat_expert_indices[idxs] -- # sorted_token_indices = idxs // self.num_experts_per_tok -- # sorted_indices = sorted_token_indices -- -- # permuted_tokens = x[sorted_token_indices] -- # sorted_weights = flat_expert_weights[idxs] -- -- # # 一次调用多专家 forward -- # expert_outputs = ops.zeros_like(permuted_tokens) -- # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) -- -- # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -- # try: -- # final_output = ops.moe_token_unpermute( -- # expert_outputs, -- # sorted_indices, -- # probs=probs, -- # padded_mode=False -- # ) -- # except Exception: -- # final_output = ops.zeros_like(x) -- # final_output = mindspore.mint.scatter_add( -- # final_output, -- # 0, -- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -- # expert_outputs * sorted_weights -- # ) -- -- # return final_output -- -- # def mlp_batch_forward(self, tokens, expert_ids): -- # """ -- # 使用批量专家 forward(保留精度) -- # """ -- # return self.experts[0].forward(tokens, expert_ids) -- -- # @no_grad() -- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- -- # expert_cache = ops.zeros_like(x) -- # for i in range(self.num_experts_per_tok): -- # expert_id = flat_expert_indices[i].item() -- # weight = flat_expert_weights[i].item() -- # expert = self.experts[expert_id] -- # expert_out = expert(x) -- # expert_cache += expert_out * weight -- # return expert_cache -- -- #@dwj -+ - @no_grad() - def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): - -@@ -561,35 +442,27 @@ class DeepseekMoE(nn.Module): - - 跳过无 token 的专家 - - 保持结果完全一致 - """ -- # 初始化输出缓存 - expert_cache = ops.zeros_like(x) - -- # 排序(确保 scatter_add 位置对应原逻辑) - idxs = flat_expert_indices.argsort() - sorted_expert_indices = flat_expert_indices[idxs] - sorted_token_indices = idxs // self.num_experts_per_tok - -- # 每个 expert 的 token 数 - tokens_per_expert = sorted_expert_indices.bincount() - -- # 找出有 token 的专家 - active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() - - for expert_id in active_experts.tolist(): -- # 取该 expert 对应的排序后 token 区间 - start = (tokens_per_expert[:expert_id]).sum().item() - end = start + tokens_per_expert[expert_id].item() - -- token_idx = sorted_token_indices[start:end] # 原 token 位置 -- expert_tokens = x[token_idx] # 取输入向量 -+ token_idx = sorted_token_indices[start:end] -+ expert_tokens = x[token_idx] - -- # 执行专家 MLP - expert_out = self.experts[expert_id](expert_tokens) - -- # 按权重缩放 - scaled_out = expert_out * flat_expert_weights[idxs[start:end]] - -- # 回写到缓存(等价 scatter_add) - expert_cache = mindspore.mint.scatter_add( - expert_cache, - 0, -@@ -599,60 +472,6 @@ class DeepseekMoE(nn.Module): - - return expert_cache - -- -- # @no_grad() -- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # """ -- # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add -- # """ -- # num_tokens = x.shape[0] -- # hidden_size = x.shape[-1] -- -- # # 生成排序后的 token 索引 -- # idxs = flat_expert_indices.argsort() -- # sorted_expert_indices = flat_expert_indices[idxs] -- # sorted_token_indices = idxs // self.num_experts_per_tok -- -- # # 记录到 sorted_indices(moe_token_unpermute 用) -- # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] -- -- # # 收集专家输入 -- # permuted_tokens = x[sorted_token_indices] -- -- # # 执行每个专家的 MLP(批量处理) -- # expert_outputs = [] -- # token_ptr = 0 -- # tokens_per_expert = sorted_expert_indices.bincount() -- # for expert_id, count in enumerate(tokens_per_expert.tolist()): -- # if count == 0: -- # continue -- # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] -- # out = self.experts[expert_id](cur_tokens) -- # expert_outputs.append(out) -- # token_ptr += count -- -- # # 拼接所有专家输出 -- # permuted_outputs = ops.cat(expert_outputs, axis=0) -- -- # # 权重缩放(probs 形状为 [num_tokens, top_k]) -- # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) -- -- # # 直接调用硬件加速的 unpermute -- # final_output = ops.moe_token_unpermute( -- # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] -- # sorted_indices, # shape: [num_tokens * top_k] -- # probs=probs, # 按概率加权 -- # padded_mode=False -- # ) -- -- # return final_output -- # def init_expert_cache(self): -- # """ -- # 在模型初始化时调用,缓存所有专家的权重到显存。 -- # """ -- # self.cache_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) -- # self.cache_up_w = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) -- # self.cache_down_w = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) - @no_grad() - def moe_infer_decode_fast(self, x, flat_expert_indices, flat_expert_weights): - top_k = flat_expert_indices.shape[0] -@@ -684,43 +503,22 @@ class DeepseekMoE(nn.Module): - weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) - return weighted_sum - -- # lwx prefill 20251108 - @no_grad() - def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): -- """ -- 高性能 + 数值一致的 MoE prefill 推理: -- 1. 批量化处理所有专家计算,减少 Python 循环开销 -- 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 -- 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 -- 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch -- -- 参数: -- x: [num_tokens, hidden_size], -- MoE 输入的 token 表示 -- flat_expert_indices: [num_tokens * top_k], -- 每个 token 的路由专家 id -- flat_expert_weights: [num_tokens * top_k, 1], -- 路由专家权重 -- """ - num_tokens = x.shape[0] - hidden_size = x.shape[-1] - -- # 1) 排序专家分配(与原 scatter_add 一致的顺序) -- idxs = flat_expert_indices.argsort() # 排序索引 -- sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] -- sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID -+ idxs = flat_expert_indices.argsort() -+ sorted_expert_indices = flat_expert_indices[idxs] -+ sorted_token_indices = idxs // self.num_experts_per_tok - -- # sorted_indices 必须与 permuted_tokens 顺序匹配 -- sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 -+ sorted_indices = sorted_token_indices - -- # 2) 收集专家输入(按 idxs 排序) -- permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] -- sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 -+ permuted_tokens = x[sorted_token_indices] -+ sorted_weights = flat_expert_weights[idxs] - -- # 3) 计算每个专家的 token 数 - tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) - -- # 4) 批量专家计算(减少 Python 循环) - gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) - up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) - down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) -@@ -731,8 +529,7 @@ class DeepseekMoE(nn.Module): - if count == 0: - continue - tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] -- -- # 与 DeepseekMLP forward 等价 -+ - gate_proj_out = F.linear(tokens, gate_weights[expert_id]) - up_proj_out = F.linear(tokens, up_weights[expert_id]) - act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out -@@ -741,7 +538,6 @@ class DeepseekMoE(nn.Module): - expert_outputs[ptr:ptr+count] = expert_out - ptr += count - -- # 5) Ascend 加速的 unpermute(已排序的权重) - probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape - - final_output = ops.zeros_like(x) -@@ -753,444 +549,6 @@ class DeepseekMoE(nn.Module): - ) - return final_output - -- # try: -- # final_output = ops.moe_token_unpermute( -- # expert_outputs, # [num_tokens*top_k, hidden_size] -- # sorted_indices, # [num_tokens*top_k] 原 token id -- # probs=probs, # 对应权重 -- # padded_mode=False -- # ) -- # except Exception: -- # # CPU/GPU fallback:用 scatter_add 保证完全一致 -- # final_output = ops.zeros_like(x) -- # final_output = mindspore.mint.scatter_add( -- # final_output, -- # 0, -- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -- # expert_outputs * sorted_weights -- # ) -- -- # return final_output -- -- -- # @no_grad() -- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # num_tokens = x.shape[0] -- # hidden_size = x.shape[-1] -- -- # idxs = flat_expert_indices.argsort() -- # sorted_expert_indices = flat_expert_indices[idxs] -- # sorted_token_indices = idxs // self.num_experts_per_tok -- -- # # sorted_indices = sorted_token_indices -- # sorted_indices = sorted_token_indices.astype(mindspore.int32) -- # permuted_tokens = x[sorted_token_indices] -- # sorted_weights = flat_expert_weights[idxs] -- # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) -- -- # expert_outputs = ops.zeros_like(permuted_tokens) -- # ptr = 0 -- -- # # 只按专家维度循环 -- # for expert_id, count in enumerate(tokens_per_expert.tolist()): -- # if count == 0: -- # continue -- # token_slice = slice(ptr, ptr + count) -- # expert_tokens = permuted_tokens[token_slice] -- -- # # 保持原 forward(含 pretraining_tp、bias 等) -- # expert_out = self.experts[expert_id](expert_tokens) -- -- # expert_outputs[token_slice] = expert_out -- # ptr += count -- -- # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) -- # try: -- # final_output = mindspore.ops.moe_token_unpermute( -- # expert_outputs, -- # sorted_indices, -- # probs=probs, -- # padded_mode=False -- # ) -- # except Exception: -- # final_output = ops.zeros_like(x) -- # final_output = mindspore.mint.scatter_add( -- # final_output, -- # 0, -- # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), -- # expert_outputs * sorted_weights -- # ) -- -- # return final_output -- -- -- #lwx -- # @no_grad() -- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # """ -- # 并行化 MoE prefill: -- # - 一次性计算所有专家输出,牺牲显存峰值换取速度 -- # - 保证结果与原版完全一致 -- # """ -- # # 输出缓存 -- # expert_cache = ops.zeros_like(x) -- -- # # token 总数(批量*seq_len*num_experts_per_tok) -- # num_tokens = flat_expert_indices.shape[0] -- # hidden_dim = x.shape[-1] -- -- # # 原 token ID(idxs // num_experts_per_tok) -- # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) -- -- # # ====== Step 1: 组织输入 ====== -- # # 按 experts 排序,保证 scatter_add 对应位置一致 -- # sort_ids = flat_expert_indices.argsort() -- # sorted_experts = flat_expert_indices[sort_ids] -- # sorted_tokens = token_ids[sort_ids] -- # sorted_weights = flat_expert_weights[sort_ids] -- -- # # 收集每个专家的输入 -- # # build: expert_inputs[expert_id] = [tokens...] -- # expert_inputs = [] -- # expert_outs = [] -- -- # for eid in range(self.config.n_routed_experts): -- # eid_mask = (sorted_experts == eid) -- # if eid_mask.any(): -- # tokens_for_eid = x[sorted_tokens[eid_mask]] -- # expert_inputs.append(tokens_for_eid) -- # else: -- # expert_inputs.append(None) -- -- # # ====== Step 2: 并行计算所有专家输出 ====== -- # # 存储所有专家结果到一个列表 -- # for eid in range(self.config.n_routed_experts): -- # if expert_inputs[eid] is not None: -- # out = self.experts[eid](expert_inputs[eid]) -- # expert_outs.append(out) -- # else: -- # expert_outs.append(None) -- -- # # ====== Step 3: scatter_add 回写结果 ====== -- # # 遍历专家,将结果加回对应的 token -- # pos = 0 -- # for eid in range(self.config.n_routed_experts): -- # if expert_outs[eid] is not None: -- # size = expert_outs[eid].shape[0] -- # tokens_idx = sorted_tokens[pos:pos+size] -- # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] -- # pos += size -- -- # # scatter_add 到 expert_cache -- # expert_cache = mindspore.mint.scatter_add( -- # expert_cache, -- # dim=0, -- # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), -- # src=scaled_out -- # ) -- -- # return expert_cache -- -- -- --# 放置在 DeepseekMoE 类中 -- # @no_grad() -- # #lwx 20251107 -- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- # """ -- # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 -- -- # Args: -- # x (Tensor): 输入张量, shape: (1, hidden_size) -- # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) -- # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) -- # """ -- # top_k, _ = flat_expert_weights.shape -- # hidden_size = x.shape[-1] -- -- # # 1. 将所有专家的权重堆叠起来 -- # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) -- # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) -- # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -- -- # # 2. "收集" 所需的专家权重 -- # selected_gate_w = stacked_gate_w[flat_expert_indices] -- # selected_up_w = stacked_up_w[flat_expert_indices] -- # selected_down_w = stacked_down_w[flat_expert_indices] -- -- # # 3. 准备输入 -- # x_expanded = x.expand((top_k, 1, hidden_size)) -- -- # # 4. 并行计算 gate_proj 和 up_proj -- # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) -- # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -- -- # # 5. 计算中间状态 -- # intermediate_states = self.experts[0].act_fn(gate_out) * up_out -- -- # # 6. 并行计算 down_proj -- # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) -- # # --- [FIX] --- -- # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 -- # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) -- # # --- [FIX END] --- -- -- # # 7. 根据路由权重进行加权求和 -- # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -- -- # return weighted_sum -- -- -- -- # @no_grad() -- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -- # # expert_cache = torch.zeros_like(x) -- # # idxs = flat_expert_indices.argsort() -- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) -- # # token_idxs = idxs // self.num_experts_per_tok -- # # for i, end_idx in enumerate(tokens_per_expert): -- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] -- # # if start_idx == end_idx: -- # # continue -- # # expert = self.experts[i] -- # # exp_token_idx = token_idxs[start_idx:end_idx] -- # # expert_tokens = x[exp_token_idx] -- # # expert_out = expert(expert_tokens) -- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) -- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') -- # # return expert_cache -- # expert_cache = ops.zeros_like(x) -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- # token_idxs = idxs // self.num_experts_per_tok -- -- # for i, end_idx in enumerate(tokens_per_expert): -- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- # if start_idx == end_idx: -- # continue -- # expert = self.experts[i] -- # exp_token_idx = token_idxs[start_idx:end_idx] -- # expert_tokens = x[exp_token_idx] -- # expert_out = expert(expert_tokens) -- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -- -- # return expert_cache -- # @no_grad() -- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): -- # expert_cache = ops.zeros_like(x) -- -- # # 排序保证顺序一致 -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- # token_idxs = idxs // self.num_experts_per_tok -- -- # # 找出有 token 的专家 -- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) -- -- # for i in active_experts.tolist(): -- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- # end_idx = tokens_per_expert[i] -- # if start_idx == end_idx: # 没有 token -- # continue -- -- # exp_token_idx = token_idxs[start_idx:end_idx] -- # expert_tokens = x[exp_token_idx] -- # expert_out = self.experts[i](expert_tokens) -- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] -- -- # expert_cache = mindspore.mint.scatter_add( -- # expert_cache, -- # 0, -- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), -- # expert_out -- # ) -- -- # return expert_cache -- -- -- --# class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --# """ --# The trick function of adding auxiliary (aux) loss, --# which includes the gradient of the aux loss during backpropagation. --# """ --# @staticmethod --# def forward(ctx, x, loss): --# assert loss.numel() == 1 --# ctx.dtype = loss.dtype --# ctx.required_aux_loss = loss.requires_grad --# return x -- --# @staticmethod --# def backward(ctx, grad_output): --# grad_loss = None --# if ctx.required_aux_loss: --# grad_loss = ops.ones(1, dtype=ctx.dtype) --# return grad_output, grad_loss -- -- --# class DeepseekMoE(nn.Module): --# ''' --# A mixed expert module containing shared experts. --# ''' --# def __init__(self, config): --# super().__init__() --# self.config = config --# self.num_experts_per_tok = config.num_experts_per_tok --# if hasattr(config, "ep_size") and config.ep_size > 1: --# assert config.ep_size == mindspore.mint.distributed.get_world_size() --# self.ep_size = config.ep_size --# self.experts_per_rank = config.n_routed_experts // config.ep_size --# self.ep_rank = mindspore.mint.distributed.get_rank() --# self.experts = nn.ModuleList( --# [ --# ( --# DeepseekMLP( --# config, intermediate_size=config.moe_intermediate_size --# ) --# if i >= self.ep_rank * self.experts_per_rank --# and i < (self.ep_rank + 1) * self.experts_per_rank --# else None --# ) --# for i in range(config.n_routed_experts) --# ] --# ) -- --# else: --# self.ep_size = 1 --# self.experts_per_rank = config.n_routed_experts --# self.ep_rank = 0 --# self.experts = nn.ModuleList( --# [ --# DeepseekMLP( --# config, intermediate_size=config.moe_intermediate_size --# ) --# for i in range(config.n_routed_experts) --# ] --# ) --# self.gate = MoEGate(config) --# if config.n_shared_experts is not None: --# intermediate_size = config.moe_intermediate_size * config.n_shared_experts --# self.shared_experts = DeepseekMLP( --# config=config, intermediate_size=intermediate_size --# ) -- --# def forward(self, hidden_states): --# identity = hidden_states --# orig_shape = hidden_states.shape --# topk_idx, topk_weight, aux_loss = self.gate(hidden_states) --# hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) --# flat_topk_idx = topk_idx.view(-1) --# if self.training: --# hidden_states = hidden_states.repeat_interleave( --# self.num_experts_per_tok, dim=0 --# ) --# y = ops.empty(hidden_states.shape) --# for i, expert in enumerate(self.experts): --# y[flat_topk_idx == i] = expert(hidden_states[flat_topk_idx == i]) --# y = ops.sum(y.view(*topk_weight.shape, -1) * topk_weight.unsqueeze(-1), dim=1) --# y = y.to(hidden_states.dtype).view(*orig_shape) --# # y = AddAuxiliaryLoss.apply(y, aux_loss) --# else: --# # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --# y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape) --# if self.config.n_shared_experts is not None: --# y = y + self.shared_experts(identity) --# return y -- --# # # @mindnlp.core.no_grad() --# # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --# # expert_cache = ops.zeros_like(x) --# # idxs = flat_expert_indices.argsort() --# # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --# # token_idxs = idxs // self.num_experts_per_tok --# # for i, end_idx in enumerate(tokens_per_expert): --# # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --# # if start_idx == end_idx: --# # continue --# # expert = self.experts[i] --# # exp_token_idx = token_idxs[start_idx:end_idx] --# # expert_tokens = x[exp_token_idx] --# # expert_out = expert(expert_tokens) --# # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --# # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out, reduce='sum') --# # return expert_out # expert_cache --# def moe_infer(self, x, topk_ids, topk_weight): --# cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts))) --# cnts.scatter_(1, topk_ids, 1) --# tokens_per_expert = cnts.sum(dim=0) --# idxs = topk_ids.view(-1).argsort() --# sorted_tokens = x[idxs // topk_ids.shape[1]] --# sorted_tokens_shape = sorted_tokens.shape --# if self.ep_size > 1: --# tokens_per_ep_rank = tokens_per_expert.view(self.ep_size, -1).sum(dim=1) --# tokens_per_expert_group = tokens_per_expert.new_empty( --# tokens_per_expert.shape[0] --# ) --# mindspore.mint.distributed.all_to_all_single(tokens_per_expert_group, tokens_per_expert) --# output_splits = ( --# tokens_per_expert_group.view(self.ep_size, -1) --# .sum(1) --# .cpu() --# .numpy() --# .tolist() --# ) --# gathered_tokens = sorted_tokens.new_empty( --# tokens_per_expert_group.sum(dim=0).cpu().item(), sorted_tokens.shape[1] --# ) --# input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist() --# mindspore.mint.distributed.all_to_all( --# list(gathered_tokens.split(output_splits)), --# list(sorted_tokens.split(input_split_sizes)), --# ) --# tokens_per_expert_post_gather = tokens_per_expert_group.view( --# self.ep_size, self.experts_per_rank --# ).sum(dim=0) --# gatherd_idxs = np.zeros(shape=(gathered_tokens.shape[0],), dtype=np.int32) --# s = 0 --# for i, k in enumerate(tokens_per_expert_group.cpu().numpy()): --# gatherd_idxs[s : s + k] = i % self.experts_per_rank --# s += k --# gatherd_idxs = gatherd_idxs.argsort() --# sorted_tokens = gathered_tokens[gatherd_idxs] --# tokens_per_expert = tokens_per_expert_post_gather --# tokens_per_expert = tokens_per_expert.cpu().numpy() --# outputs = [] --# start_idx = 0 --# for i, num_tokens in enumerate(tokens_per_expert): --# end_idx = start_idx + num_tokens --# if num_tokens == 0: --# continue --# expert = self.experts[i + self.ep_rank * self.experts_per_rank] --# tokens_for_this_expert = sorted_tokens[start_idx:end_idx] --# expert_out = expert(tokens_for_this_expert) --# outputs.append(expert_out) --# start_idx = end_idx -- --# outs = ops.cat(outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0) --# if self.ep_size > 1: --# new_x = ops.empty_like(outs) --# new_x[gatherd_idxs] = outs --# gathered_tokens = new_x.new_empty(*sorted_tokens_shape) --# mindspore.mint.distributed.all_to_all( --# list(gathered_tokens.split(input_split_sizes)), --# list(new_x.split(output_splits)), --# ) --# outs = gathered_tokens -- --# new_x = ops.empty_like(outs) --# new_x[idxs] = outs --# final_out = ( --# new_x.view(*topk_ids.shape, -1) --# .type(topk_weight.dtype) --# .mul_(topk_weight.unsqueeze(dim=-1)) --# .sum(dim=1) --# .type(new_x.dtype) --# ) --# return final_out -- -- - # Copied from transformers.models.llama.modeling_llama.repeat_kv - def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: - """ -@@ -1313,10 +671,6 @@ class DeepseekAttention(nn.Module): - key_states = self.k_proj(hidden_states) - value_states = self.v_proj(hidden_states) - -- # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) -- # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- # @lwx - query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) - query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) - key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) -@@ -1555,10 +909,6 @@ class DeepseekDecoderLayer(nn.Module): - super().__init__() - self.hidden_size = config.hidden_size - -- # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( -- # config=config, layer_idx=layer_idx -- # ) -- - self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( - config=config, layer_idx=layer_idx - ) -@@ -1774,14 +1124,6 @@ class DeepseekModel(DeepseekPreTrainedModel): - else None - ) - else: -- # 4d mask is passed through the layers -- # attention_mask = _prepare_4d_causal_attention_mask( -- # attention_mask, -- # (batch_size, seq_length), -- # inputs_embeds, -- # past_key_values_length, -- # ) -- #@dwj - attention_mask = get_cached_causal_mask( - attention_mask, - (batch_size, seq_length), -@@ -1869,38 +1211,14 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - self.post_init() - # lwx - self.warm_up = False -- #初始 -- -- # def warmup_moe_model_deep(self): -- # print("[Warmup] DeepSeek-MoE 模型预热开始...") -- # test_texts = [ -- # "warmup short", -- # "This is a medium length warmup sentence for MoE experts. middle middle middle", -- # "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" -- # ] -- # tokenizer = getattr(self, "_warmup_tokenizer", None) -- # if tokenizer is None: -- # from mindnlp.transformers import AutoTokenizer -- # tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) -- # self._warmup_tokenizer = tokenizer -- -- # for text in test_texts: -- # inputs = tokenizer(text, return_tensors="ms") -- # with mindspore._no_grad(): -- # _ = self(**inputs, use_cache=False) -- # print("[Warmup] DeepSeek-MoE 模型预热完成。") -- -+ - def warmup_moe_model_deep(self): - print("[Warmup] DeepSeek-MoE 模型预热开始...") - -- # 直接用 eval.py 默认的 prompts 内容 - warmup_prompts = [ -- "Hello, how are you?", -- "This American studied art at Yale and is the author of multiple popular mystery novels. First name is 'Hillary'. What's the last name?", -- """Summarize the following text: US President Donald Trump has said he is 'not happy' with his Russian counterpart Vladimir Putin, following Moscow's largest aerial attack yet on Ukraine. -- In a rare rebuke, Trump said: "What the hell happened to him? He's killing a lot of people." He later called Putin "absolutely crazy". -- Ukrainian President Volodymyr Zelensky earlier said Washington's "silence" over recent Russian attacks was encouraging Putin, urging "strong pressure" - including tougher sanctions - on Moscow. -- """ -+ "warmup short", -+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", -+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" - ] - - tokenizer = getattr(self, "_warmup_tokenizer", None) -@@ -1909,13 +1227,11 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) - self._warmup_tokenizer = tokenizer - -- # 跑一遍 warmup_prompts,触发路由逻辑 - for text in warmup_prompts: - inputs = tokenizer(text, return_tensors="ms") - with mindspore._no_grad(): - _ = self(**inputs, use_cache=False) - -- # 这里可以加按需缓存逻辑,避免显存 OOM - from mindnlp.transformers.models.deepseek.modeling_deepseek import DeepseekMoE - for module in self.modules(): - if isinstance(module, DeepseekMoE): -@@ -2051,15 +1367,13 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - - loss = None - if labels is not None: -- # Shift so that tokens < n predict n - shift_logits = logits[..., :-1, :] - shift_labels = labels[..., 1:] -- # Flatten the tokens -+ - loss_fct = nn.CrossEntropyLoss() - shift_logits = shift_logits.view(-1, self.config.vocab_size) - shift_labels = shift_labels.view(-1) -- # Enable model parallelism -- # shift_labels = shift_labels.to(shift_logits) -+ - loss = loss_fct(shift_logits, shift_labels) - - if not return_dict: -@@ -2091,22 +1405,16 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - cache_length = past_length = past_key_values[0][0].shape[2] - max_cache_length = None - -- # Keep only the unprocessed tokens: -- # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where -- # some of the inputs are exclusivelly passed as part of the cache (e.g. when passing input_embeds as -- # input) -+ - if ( - attention_mask is not None - and attention_mask.shape[1] > input_ids.shape[1] - ): - input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :] -- # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard -- # input_ids based on the past_length. -+ - elif past_length < input_ids.shape[1]: - input_ids = input_ids[:, past_length:] -- # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens. - -- # If we are about to go beyond the maximum cache length, we need to crop the input attention mask. - if ( - max_cache_length is not None - and attention_mask is not None -@@ -2116,14 +1424,11 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): - - position_ids = kwargs.get("position_ids", None) - if attention_mask is not None and position_ids is None: -- # create position_ids on the fly for batch generation - position_ids = attention_mask.to(mindspore.int32).cumsum(-1) - 1 -- # position_ids.masked_fill_(attention_mask == 0, 1) - position_ids = ops.masked_fill(position_ids, attention_mask == 0, 1) - if past_key_values: - position_ids = position_ids[:, -input_ids.shape[1] :] - -- # if `inputs_embeds` are passed, we only want to use them in the 1st generation step - if inputs_embeds is not None and past_key_values is None: - model_inputs = {"inputs_embeds": inputs_embeds} - else: -diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -index 6566958b..d689e36d 100644 ---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py -@@ -63,18 +63,14 @@ def get_cached_causal_mask_with_cache_position( - """ - 带缓存的 causal mask 构造函数 - """ -- # q_len 是当前 query 长度 - q_len = sequence_length -- # kv_len 是 target_length - kv_len = target_length - -- # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 - key = (batch_size, q_len, kv_len, dtype, min_dtype) - - if key in _causal_mask_cache: - return _causal_mask_cache[key] - -- # 调用原来的 mask 构造逻辑 - causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( - attention_mask, - sequence_length=sequence_length, -@@ -84,7 +80,6 @@ def get_cached_causal_mask_with_cache_position( - cache_position=cache_position, - batch_size=batch_size, - ) -- # 缓存结果 - _causal_mask_cache[key] = causal_mask - return causal_mask - -@@ -224,11 +219,6 @@ class Qwen2MoeRMSNorm(nn.Module): - self.variance_epsilon = eps - - def forward(self, hidden_states): -- # @dwj -- # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -- # @lwx -- # if not self.training : -- # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) - input_dtype = hidden_states.dtype - hidden_states = hidden_states.to(mindspore.float32) - variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) -@@ -279,9 +269,6 @@ class Qwen2MoeRotaryEmbedding(nn.Module): - # Copied from transformers.models.llama.modeling_llama.rotate_half - def rotate_half(x): - """Rotates half the hidden dims of the input.""" -- # x1 = x[..., : x.shape[-1] // 2] -- # x2 = x[..., x.shape[-1] // 2 :] -- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] - x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) - return ops.cat((-x2, x1), dim=-1) - -@@ -329,21 +316,8 @@ class Qwen2MoeMLP(nn.Module): - self.act_fn = ACT2FN[config.hidden_act] - - def forward(self, x): -- - return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) -- # @lwx -- # gate_up_output = self.gate_up_proj(x) -- # swiglu_output = mindspore.ops.swiglu(gate_up_output) -- # return self.down_proj(swiglu_output) -- -- # def forward(self, x): -- # gate_proj_out = self.gate_proj(x) -- # up_proj_out = self.up_proj(x) -- # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) -- # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) -- # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out -- # return self.down_proj(swiglu_out) -- -+ - # Copied from transformers.models.llama.modeling_llama.repeat_kv - def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: - """ -@@ -356,164 +330,6 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: - hidden_states = hidden_states[:, :, None, :, :].broadcast_to((batch, num_key_value_heads, n_rep, slen, head_dim)) - return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) - -- --# Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --# class Qwen2MoeAttention(nn.Module): --# """ --# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --# and "Generating Long Sequences with Sparse Transformers". --# """ -- --# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --# super().__init__() --# self.config = config --# self.layer_idx = layer_idx --# if layer_idx is None: --# logger.warning_once( --# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --# "when creating this class." --# ) -- --# self.hidden_size = config.hidden_size --# self.num_heads = config.num_attention_heads --# self.head_dim = self.hidden_size // self.num_heads --# self.num_key_value_heads = config.num_key_value_heads --# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --# self.max_position_embeddings = config.max_position_embeddings --# self.rope_theta = config.rope_theta --# self.is_causal = True --# self.attention_dropout = config.attention_dropout -- --# if (self.head_dim * self.num_heads) != self.hidden_size: --# raise ValueError( --# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --# f" and `num_heads`: {self.num_heads})." --# ) --# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -- --# self.rotary_emb = Qwen2MoeRotaryEmbedding( --# self.head_dim, --# max_position_embeddings=self.max_position_embeddings, --# base=self.rope_theta, --# ) -- --# def forward( --# self, --# hidden_states: mindspore.Tensor, --# attention_mask: Optional[mindspore.Tensor] = None, --# position_ids: Optional[mindspore.Tensor] = None, --# past_key_value: Optional[Cache] = None, --# output_attentions: bool = False, --# use_cache: bool = False, --# cache_position: Optional[mindspore.Tensor] = None, --# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- -- -- --# bsz, q_len, _ = hidden_states.shape -- --# query_states = self.q_proj(hidden_states) --# key_states = self.k_proj(hidden_states) --# value_states = self.v_proj(hidden_states) -- --# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) -- --# kv_seq_len = key_states.shape[-2] --# if past_key_value is not None: --# if self.layer_idx is None: --# raise ValueError( --# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --# "with a layer index." --# ) --# if isinstance(past_key_value, StaticCache): --# kv_seq_len = key_states.shape[-2] --# else: --# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- --# if past_key_value is not None: --# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) -- --# if isinstance(past_key_value, StaticCache): --# kv_seq_len = key_states.shape[-2] -- --# # repeat k/v heads if n_kv_heads < n_heads --# key_states = repeat_kv(key_states, self.num_key_value_groups) --# value_states = repeat_kv(value_states, self.num_key_value_groups) -- --# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -- --# if attention_mask is not None: --# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --# attn_weights = attn_weights + causal_mask -- --# # upcast attention to fp32 --# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --# attn_output = ops.matmul(attn_weights, value_states) -- --# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --# raise ValueError( --# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --# f" {attn_output.shape}" --# ) -- --# attn_output = ops.transpose(attn_output, 1, 2) --# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -- --# attn_output = self.o_proj(attn_output) --# # @lwx -- --# # max_seq_len = self.max_position_embeddings # 2048 -- --# # if attention_mask is not None: --# # # attention_mask: [B, 1, Sq, Sk] --# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask -- --# # # pad 到 [max_seq_len, max_seq_len] --# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --# # global_attention_mask = padded_mask --# # else: --# # global_attention_mask = None -- -- --# # sparse_mode=3 --# # attn_output = mindspore.ops.flash_attention_score( --# # query=query_states, --# # key=key_states, --# # value=value_states, --# # real_shift=None, --# # padding_mask=None, -- --# # head_num=self.num_heads, --# # attn_mask=global_attention_mask, --# # keep_prob=1.0 - self.attention_dropout, --# # scalar_value=1.0 / math.sqrt(self.head_dim), --# # input_layout="BNSD", --# # pre_tokens=2147483647, --# # next_tokens=2147483647, --# # inner_precise=0, --# # drop_mask=None, --# # prefix=None, --# # actual_seq_qlen=None, --# # actual_seq_kvlen=None, --# # sparse_mode=sparse_mode, --# # ) --# if not output_attentions: --# attn_weights = None -- --# return attn_output, attn_weights, past_key_value -- - class Qwen2MoeAttention(nn.Module): - """ - 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -@@ -594,10 +410,8 @@ class Qwen2MoeAttention(nn.Module): - cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} - key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) - -- # --- 2. 动态调度核心注意力计算 --- - global Long_Prompt - if Long_Prompt >= 1: -- # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- - fa_attention_mask = None - if attention_mask is not None: - mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] -@@ -613,7 +427,7 @@ class Qwen2MoeAttention(nn.Module): - scalar_value=1.0 / math.sqrt(self.head_dim), - input_layout="BNSD", - sparse_mode=0, -- inner_precise=0 # 使用高精度模式以对齐 Eager 结果 -+ inner_precise=0 - ) - - attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -@@ -623,7 +437,6 @@ class Qwen2MoeAttention(nn.Module): - logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") - - else: -- # --- Eager Attention 路径 (用于短序列和解码) --- - key_states = repeat_kv(key_states, self.num_key_value_groups) - value_states = repeat_kv(value_states, self.num_key_value_groups) - -@@ -651,252 +464,6 @@ class Qwen2MoeAttention(nn.Module): - - return attn_output, attn_weights, past_key_value - --# class Qwen2MoeFlashAttention(nn.Module): --# """ --# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 -- --# 关键改动: --# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --# 直接传入原始的 key 和 value 张量效率更高。 --# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --# """ --# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --# super().__init__() --# self.config = config --# self.layer_idx = layer_idx --# self.hidden_size = config.hidden_size --# self.num_heads = config.num_attention_heads --# self.head_dim = self.hidden_size // self.num_heads --# self.num_key_value_heads = config.num_key_value_heads --# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --# self.max_position_embeddings = config.max_position_embeddings --# self.rope_theta = config.rope_theta --# self.attention_dropout = config.attention_dropout -- --# if (self.head_dim * self.num_heads) != self.hidden_size: --# raise ValueError( --# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --# ) -- --# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) -- --# self.rotary_emb = Qwen2MoeRotaryEmbedding( --# self.head_dim, --# max_position_embeddings=self.max_position_embeddings, --# base=self.rope_theta, --# ) -- --# def forward( --# self, --# hidden_states: mindspore.Tensor, --# attention_mask: Optional[mindspore.Tensor] = None, --# position_ids: Optional[mindspore.Tensor] = None, --# past_key_value: Optional[Cache] = None, --# output_attentions: bool = False, --# use_cache: bool = False, --# cache_position: Optional[mindspore.Tensor] = None, --# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- --# bsz, q_len, _ = hidden_states.shape -- --# # 1. 线性投射 Q, K, V --# query_states = self.q_proj(hidden_states) --# key_states = self.k_proj(hidden_states) --# value_states = self.v_proj(hidden_states) -- --# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --# # query: [B, S, H*D] -> [B, N1, S, D] --# # key/val: [B, S, H2*D] -> [B, N2, S, D] --# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- --# # 3. RoPE 旋转位置编码 --# kv_seq_len = key_states.shape[-2] --# if past_key_value is not None: --# if self.layer_idx is None: --# raise ValueError( --# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --# "with a layer index." --# ) --# # 对于 StaticCache,需要特殊处理 kv_seq_len --# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --# if isinstance(past_key_value, StaticCache) and cache_position is not None: --# # 使用 cache_position 的长度来确定实际的 kv_seq_len --# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --# # 临时解决方案:使用 cache_position 的最大值(如果可能) --# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --# if cache_position.shape[0] == 1: --# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --# kv_seq_len = past_seen_tokens + 1 --# else: --# # prefill 阶段:cache_position 是范围,使用其长度 --# kv_seq_len = cache_position.shape[0] + past_seen_tokens --# else: --# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- --# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- --# # 4. KV 缓存更新 --# if past_key_value is not None: --# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --# key_states, value_states = past_key_value.update( --# key_states, value_states, self.layer_idx, cache_kwargs --# ) -- --# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --# if isinstance(past_key_value, StaticCache) and cache_position is not None: --# if cache_position.shape[0] == 1: --# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --# kv_seq_len = key_states.shape[-2] -- --# # 5. [重要] 准备 Attention Mask --# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --# fa_attention_mask = None --# if attention_mask is not None: --# # 截取与当前key长度匹配的部分 --# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --# # 转换为布尔类型: 大负数 -> True, 0 -> False --# fa_attention_mask = (mask_slice != 0) -- --# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --# input_dtype = query_states.dtype --# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --# query_states = query_states.to(mindspore.float16) --# key_states = key_states.to(mindspore.float16) --# value_states = value_states.to(mindspore.float16) -- --# # 6. [核心] 调用 flash_attention_score 算子 --# # - 无需手动 repeat_kv, 算子原生支持 GQA --# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --# attn_output = mindspore.ops.flash_attention_score( --# query=query_states, --# key=key_states, --# value=value_states, --# head_num=self.num_heads, # 传入Q的头数(N1) --# attn_mask=fa_attention_mask, --# keep_prob=1.0 - self.attention_dropout, --# scalar_value=1.0 / math.sqrt(self.head_dim), --# input_layout="BNSD", --# sparse_mode=0 # 使用 defaultMask 模式 --# ) -- --# # 恢复原始数据类型 --# attn_output = attn_output.to(input_dtype) -- --# # 7. 调整输出形状 --# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --# attn_output = self.o_proj(attn_output) -- --# # FlashAttention 算子不直接返回注意力权重矩阵 --# attn_weights = None --# if output_attentions: --# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -- --# return attn_output, attn_weights, past_key_value -- --# # def forward( --# # self, --# # hidden_states: mindspore.Tensor, --# # attention_mask: Optional[mindspore.Tensor] = None, --# # position_ids: Optional[mindspore.Tensor] = None, --# # past_key_value: Optional[Cache] = None, --# # output_attentions: bool = False, --# # use_cache: bool = False, --# # cache_position: Optional[mindspore.Tensor] = None, --# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: -- --# # bsz, q_len, _ = hidden_states.shape -- --# # # 1. 线性投射 Q, K, V --# # query_states = self.q_proj(hidden_states) --# # key_states = self.k_proj(hidden_states) --# # value_states = self.v_proj(hidden_states) -- --# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- --# # # 3. RoPE 旋转位置编码 --# # kv_seq_len = key_states.shape[-2] --# # if past_key_value is not None: --# # if self.layer_idx is None: --# # raise ValueError( --# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --# # "with a layer index." --# # ) --# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- --# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- --# # # 4. KV 缓存更新 --# # if past_key_value is not None: --# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --# # key_states, value_states = past_key_value.update( --# # key_states, value_states, self.layer_idx, cache_kwargs --# # ) -- --# # # 5. 准备 Attention Mask --# # fa_attention_mask = None --# # if attention_mask is not None: --# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --# # fa_attention_mask = (mask_slice != 0) -- --# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --# # input_dtype = query_states.dtype -- --# # # 6. [核心] 调用 flash_attention_score 算子 --# # attn_output = mindspore.ops.flash_attention_score( --# # query=query_states, --# # key=key_states, --# # value=value_states, --# # head_num=self.num_heads, --# # attn_mask=fa_attention_mask, --# # keep_prob=1.0 - self.attention_dropout, --# # scalar_value=1.0 / math.sqrt(self.head_dim), --# # input_layout="BNSD", --# # sparse_mode=0, --# # # <--- 修改点 2: 启用内部高精度计算 --- --# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --# # inner_precise=1 --# # ) -- --# # # 恢复原始数据类型 --# # attn_output = attn_output.to(input_dtype) -- --# # # 7. 调整输出形状 --# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --# # attn_output = self.o_proj(attn_output) -- --# # attn_weights = None --# # if output_attentions: --# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") -- --# # return attn_output, attn_weights, past_key_value -- - - class Qwen2MoeFlashAttention(nn.Module): - """ -@@ -948,17 +515,14 @@ class Qwen2MoeFlashAttention(nn.Module): - - bsz, q_len, _ = hidden_states.shape - -- # 1. 线性投射 Q, K, V - query_states = self.q_proj(hidden_states) - key_states = self.k_proj(hidden_states) - value_states = self.v_proj(hidden_states) - -- # 2. 调整形状以匹配 BNSD 布局 - query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) - key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) - value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- -- # 3. RoPE 和 KV 缓存 -+ - kv_seq_len = key_states.shape[-2] - if past_key_value is not None: - kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -@@ -970,13 +534,11 @@ class Qwen2MoeFlashAttention(nn.Module): - cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} - key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) - -- # 4. 准备 Attention Mask - fa_attention_mask = None - if attention_mask is not None: - mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] - fa_attention_mask = (mask_slice != 0) - -- # 5. 【核心】调用 flash_attention_score,关闭高精度累加 - attn_output = mindspore.ops.flash_attention_score( - query=query_states, - key=key_states, -@@ -987,14 +549,12 @@ class Qwen2MoeFlashAttention(nn.Module): - scalar_value=1.0 / math.sqrt(self.head_dim), - input_layout="BNSD", - sparse_mode=0, -- inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -+ inner_precise=0 - ) - -- # 6. 调整输出形状 - attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) - attn_output = self.o_proj(attn_output) - -- # 7. 返回结果 - attn_weights = None - if output_attentions: - logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -@@ -1007,88 +567,7 @@ QWEN2MOE_ATTENTION_CLASSES = { - "flash-attention": Qwen2MoeFlashAttention, - } - -- --# class Qwen2MoeSparseMoeBlock(nn.Module): --# def __init__(self, config): --# super().__init__() --# self.num_experts = config.num_experts --# self.top_k = config.num_experts_per_tok --# self.norm_topk_prob = config.norm_topk_prob -- --# # gating --# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --# self.experts = nn.ModuleList( --# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --# ) -- --# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --# #@dwj --# # 只遍历激活的专家,而非全部专家 --# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --# batch_size, sequence_length, hidden_dim = hidden_states.shape --# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --# num_tokens = hidden_states_reshaped.shape[0] -- --# router_logits = self.gate(hidden_states_reshaped) --# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --# if self.norm_topk_prob: --# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --# routing_weights = routing_weights.to(hidden_states.dtype) -- --# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --# flat_selected_experts = selected_experts.flatten() -- --# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --# token_indices = broadcasted_token_indices.flatten() -- --# active_experts = ops.unique(flat_selected_experts) -- --# for expert_idx_tensor in active_experts: --# expert_idx = expert_idx_tensor.item() --# expert_layer = self.experts[expert_idx] -- --# mask = (flat_selected_experts == expert_idx_tensor) --# selected_token_indices = token_indices[mask] --# selected_routing_weights = routing_weights.flatten()[mask] -- --# current_states = hidden_states_reshaped[selected_token_indices] -- --# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) -- --# final_hidden_states = final_hidden_states.index_add( --# dim=0, --# index=selected_token_indices, --# source=expert_output.to(hidden_states.dtype) --# ) -- --# shared_expert_output = self.shared_expert(hidden_states_reshaped) --# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -- --# final_hidden_states = final_hidden_states + shared_expert_output --# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) -- --# return final_hidden_states, router_logits -- -- - class Qwen2MoeSparseMoeBlock(nn.Module): -- """ -- 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` -- 控制的顶级推理策略: -- -- - if Long_Prompt is True: 【精度优先模式】 -- 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 -- 适用于需要严格可复现性的长序列任务。 -- -- - if Long_Prompt is False: 【速度优先模式】 -- 采用业界最强的性能组合: -- - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 -- - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 -- """ - def __init__(self, config: Qwen2MoeConfig): - super().__init__() - self.num_experts = config.num_experts -@@ -1102,7 +581,6 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) - self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) - -- # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- - @no_grad() - def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - original_dtype = hidden_states.dtype -@@ -1119,39 +597,8 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) - return moe_output_fp32.squeeze(1).to(original_dtype) - -- -- # @no_grad() -- # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- # num_tokens, _ = hidden_states.shape -- # flat_selected_experts = selected_experts.flatten() -- # sorted_expert_indices = flat_selected_experts.argsort() -- # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) -- # original_token_indices = sorted_expert_indices // self.top_k -- # moe_output = ops.zeros_like(hidden_states) -- # current_token_offset = 0 -- # for i in range(self.num_experts): -- # expert_token_count = tokens_per_expert[i] - current_token_offset -- # if expert_token_count == 0: -- # continue -- # end_offset = current_token_offset + expert_token_count -- # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] -- # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] -- # expert_hidden_states = hidden_states[expert_original_token_indices] -- # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] -- # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) -- # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) -- # current_token_offset += expert_token_count -- # return moe_output -- -- # baseline - @no_grad() - def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- """ -- 优化版 MoE prefill (速度优先模式): -- - 批量张量化处理同一个 expert 的所有 token -- - 跳过无 token 的专家 -- - 保持结果完全一致 -- """ - moe_output = ops.zeros_like(hidden_states) - - flat_selected_experts = selected_experts.flatten() -@@ -1188,56 +635,39 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - - @no_grad() - def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- """ -- 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add -- 逻辑: -- 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 -- 2. 每个 expert 一次性处理其全部 token -- 3. 最后一次 scatter_add 回到原 token 顺序 -- """ -- - num_tokens = hidden_states.shape[0] - hidden_size = hidden_states.shape[-1] - -- # 展平为一维 -- flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] -- flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] -+ flat_selected_experts = selected_experts.flatten() -+ flat_routing_weights = routing_weights.flatten() - -- # 按 expert 排序 - idxs = flat_selected_experts.argsort() -- sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 -- sorted_token_indices = idxs // self.top_k # 对应原 token ID -+ sorted_expert_indices = flat_selected_experts[idxs] -+ sorted_token_indices = idxs // self.top_k - -- # 排好序的输入向量(连续内存) - permuted_tokens = hidden_states[sorted_token_indices] - -- # 排好序的权重 - sorted_weights = flat_routing_weights[idxs] - -- # 每个 expert 对应的 token 数量 - tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) - -- # 存放专家输出(与 permuted_tokens 对应顺序保持一致) - expert_outputs = ops.zeros_like(permuted_tokens) - -- ptr = 0 # 指向当前切片的起点 -+ ptr = 0 - for expert_id, count in enumerate(tokens_per_expert.tolist()): - if count == 0: - continue - - token_slice = slice(ptr, ptr + count) -- expert_tokens = permuted_tokens[token_slice] # 连续切片 -+ expert_tokens = permuted_tokens[token_slice] - -- # 执行专家 MLP - expert_out = self.experts[expert_id](expert_tokens) - - expert_outputs[token_slice] = expert_out - ptr += count - -- # 按权重缩放 - scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) - -- # 回写到原 token 顺序 (单次 scatter_add) - moe_output = mindspore.mint.scatter_add( - ops.zeros_like(hidden_states), - 0, -@@ -1247,10 +677,6 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - - return moe_output - -- -- -- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -- - @no_grad() - def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: - moe_output = ops.zeros_like(hidden_states) -@@ -1282,31 +708,12 @@ class Qwen2MoeSparseMoeBlock(nn.Module): - if self.norm_topk_prob: - routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) - -- moe_output = None -- # if Long_Prompt==0: -- # # --- 精度优先模式 (ACCURACY MODE) --- -- # routing_weights_casted = routing_weights.to(hidden_states.dtype) -- # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) -- # else: -- # # --- 速度优先模式 (SPEED MODE) --- -- # routing_weights_casted = routing_weights.to(hidden_states.dtype) -- # if sequence_length == 1: -- # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -- # else: -- # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -- - routing_weights_casted = routing_weights.to(hidden_states.dtype) - if sequence_length == 1: - moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) - else: -- # if Long_Prompt == 1: -- # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -- # else: -- # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) - moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) - -- -- # 3. 共享专家计算与合并 (所有模式通用) - gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ - F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) - -@@ -1320,11 +727,6 @@ class Qwen2MoeDecoderLayer(nn.Module): - def __init__(self, config: Qwen2MoeConfig, layer_idx: int): - super().__init__() - self.hidden_size = config.hidden_size -- -- # if Long_Prompt == 2: -- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -- # else: -- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - - self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) - -@@ -1421,8 +823,6 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): - _no_split_modules = ["Qwen2MoeDecoderLayer"] - _skip_keys_device_placement = "past_key_values" - _supports_cache_class = True --#lwx -- # _supports_static_cache = True - - def _init_weights(self, module): - std = self.config.initializer_range -@@ -1576,7 +976,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): - - hidden_states = self.norm(hidden_states) - -- # add hidden states from the last decoder layer - if output_hidden_states: - all_hidden_states += (hidden_states,) - -@@ -1598,7 +997,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): - router_logits=all_router_logits, - ) - -- # Copied from transformers.models.llama.modeling_llama.LlamaModel._update_causal_mask - def _update_causal_mask( - self, - attention_mask: mindspore.Tensor, -@@ -1626,17 +1024,6 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): - else past_seen_tokens + sequence_length + 1 - ) - -- # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). -- # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( -- # attention_mask, -- # sequence_length=sequence_length, -- # target_length=target_length, -- # dtype=dtype, -- # min_dtype=min_dtype, -- # cache_position=cache_position, -- # batch_size=input_tensor.shape[0], -- # ) -- #@dwj - causal_mask = get_cached_causal_mask_with_cache_position( - attention_mask, - sequence_length=sequence_length, -@@ -1664,9 +1051,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - self.num_experts_per_tok = config.num_experts_per_tok - # Initialize weights and apply final processing - self.post_init() -- # @lwx -- # if self.generation_config is not None and self.generation_config.cache_implementation is None: -- # self.generation_config.cache_implementation = "static" -+ - self._warmed_up = False - - def warmup_moe_model(self): -@@ -1890,17 +1275,6 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - dtype = self.lm_head.weight.dtype - min_dtype = float(ops.finfo(dtype).min) - -- # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( -- # attention_mask, -- # sequence_length=sequence_length, -- # target_length=past_key_values.get_max_length(), -- # dtype=dtype, -- # min_dtype=min_dtype, -- # cache_position=cache_position, -- # batch_size=batch_size, -- # ) -- -- #@dwj - attention_mask = get_cached_causal_mask_with_cache_position( - attention_mask, - sequence_length=sequence_length, -@@ -1922,363 +1296,6 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): - ) - return model_inputs - --# @lwx -- # def _decode_one_tokens_logits( -- # self, -- # cur_token: mindspore.Tensor, -- # input_pos: Optional[mindspore.Tensor], -- # cache_position: mindspore.Tensor, -- # past_key_values: StaticCache, -- # ) -> mindspore.Tensor: -- # """ -- # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) -- -- # Args: -- # cur_token: 当前要处理的token,shape为(batch_size, 1) -- # input_pos: 输入位置信息,可选 -- # cache_position: 当前token在cache中的位置,shape为(1,) -- # past_key_values: StaticCache对象,存储之前的key-value状态 -- -- # Returns: -- # logits: 当前token的logits,shape为(batch_size, vocab_size) -- # """ -- # # 调用JIT编译的版本 -- # return self.get_decode_one_tokens_logits( -- # cur_token=cur_token, -- # input_pos=input_pos, -- # cache_position=cache_position, -- # past_key_values=past_key_values, -- # ) -- -- # @mindspore.jit(jit_level='O1') -- # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): -- # """ -- # JIT编译的函数,用于高效的单token解码 -- # 使用JIT编译优化以支持静态shape和高效执行 -- -- # 注意:直接调用forward方法,避免经过_call_impl中的try-except -- # """ -- # outputs = self.model.forward( -- # input_ids=cur_token, -- # position_ids=input_pos, -- # cache_position=cache_position, -- # past_key_values=past_key_values, -- # use_cache=True, -- # return_dict=False, -- # ) -- -- # hidden_states = outputs[0] -- # logits = self.lm_head.forward(hidden_states) -- # logits = logits.float() -- -- # return logits[:, -1, :] -- -- # def _sample( -- # self, -- # input_ids: mindspore.Tensor, -- # logits_processor, -- # stopping_criteria, -- # generation_config, -- # synced_devices: bool, -- # streamer=None, -- # logits_warper=None, -- # **model_kwargs, -- # ): -- # """ -- # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 -- # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 -- # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 -- # """ -- # from ...generation.logits_process import LogitsProcessorList -- # from ...generation.stopping_criteria import StoppingCriteriaList -- # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput -- # from mindnlp.core import nn, ops, no_grad -- # import numpy as np -- -- # # 检查是否使用 StaticCache -- # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 -- # # 否则,直接调用父类方法 -- # past_key_values = model_kwargs.get("past_key_values") -- # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") -- -- # if not isinstance(past_key_values, StaticCache): -- # # 不使用 StaticCache,直接调用父类方法 -- # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") -- # return super()._sample( -- # input_ids=input_ids, -- # logits_processor=logits_processor, -- # stopping_criteria=stopping_criteria, -- # generation_config=generation_config, -- # synced_devices=synced_devices, -- # streamer=streamer, -- # logits_warper=logits_warper, -- # **model_kwargs, -- # ) -- -- # # 使用 StaticCache,进入自定义循环 -- # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) -- # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 -- # pad_token_id = generation_config._pad_token_tensor -- # output_attentions = generation_config.output_attentions -- # output_hidden_states = generation_config.output_hidden_states -- # output_scores = generation_config.output_scores -- # output_logits = generation_config.output_logits -- # return_dict_in_generate = generation_config.return_dict_in_generate -- # max_length = generation_config.max_length -- # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) -- # do_sample = generation_config.do_sample -- -- # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): -- # raise ValueError( -- # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " -- # f"{logits_warper})." -- # ) -- -- # # init attention / hidden states / scores tuples -- # scores = () if (return_dict_in_generate and output_scores) else None -- # raw_logits = () if (return_dict_in_generate and output_logits) else None -- # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None -- # cross_attentions = () if (return_dict_in_generate and output_attentions) else None -- # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None -- -- # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states -- # if return_dict_in_generate and self.config.is_encoder_decoder: -- # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None -- # encoder_hidden_states = ( -- # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None -- # ) -- -- # # keep track of which sequences are already finished -- # batch_size, cur_len = input_ids.shape -- # this_peer_finished = False -- # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) -- # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) -- -- # time_record = [] -- # from ....utils.testing_utils import parse_flag_from_env -- # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) -- -- # while self._has_unfinished_sequences( -- # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length -- # ): -- # if _record_time: -- # import time as time_module -- # infer_start = time_module.time() -- -- # # prepare model inputs -- # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) -- -- # # prepare variable output controls -- # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) -- # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) -- -- # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 -- # cur_cache_position = model_inputs.get("cache_position") -- # cur_past_key_values = model_inputs.get("past_key_values") -- # cur_input_ids = model_inputs.get("input_ids") -- -- # if (isinstance(cur_past_key_values, StaticCache) and -- # cur_cache_position is not None and -- # len(cur_cache_position.shape) > 0 and -- # cur_cache_position.shape[0] == 1 and -- # cur_input_ids is not None and -- # cur_input_ids.shape[1] == 1): -- # # 使用 JIT 优化的单 token 解码 -- # # 简单判断方法:首次调用时打印(JIT编译需要时间) -- # if not hasattr(self, '_jit_used'): -- # self._jit_used = False -- # print("[JIT] ✓ JIT optimized path activated (first call will compile)") -- -- # next_token_logits = self.get_decode_one_tokens_logits( -- # cur_token=cur_input_ids, -- # input_pos=model_inputs.get("position_ids"), -- # cache_position=cur_cache_position, -- # past_key_values=cur_past_key_values, -- # ) -- -- # # 标记已使用JIT(用于后续判断) -- # if not self._jit_used: -- # self._jit_used = True -- -- # # 构造兼容的输出对象 -- # class JitOptimizedOutput: -- # def __init__(self, logits, config): -- # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits -- # self.config = config -- # # 对于 JIT 优化路径,这些属性通常不需要 -- # self.decoder_attentions = None if config.is_encoder_decoder else None -- # self.attentions = None if not config.is_encoder_decoder else None -- # self.cross_attentions = None -- # self.decoder_hidden_states = None if config.is_encoder_decoder else None -- # self.hidden_states = None if not config.is_encoder_decoder else None -- -- # outputs = JitOptimizedOutput(next_token_logits, self.config) -- # else: -- # # 标准 forward 调用(首次prefill阶段或非StaticCache) -- # outputs = self(**model_inputs, return_dict=True) -- -- # if synced_devices and this_peer_finished: -- # continue -- -- # # Clone is needed to avoid keeping a hanging ref to outputs.logits -- # next_token_logits = outputs.logits[:, -1, :] -- -- # # pre-process distribution -- # next_token_scores = logits_processor(input_ids, next_token_logits) -- # if do_sample: -- # next_token_scores = logits_warper(input_ids, next_token_scores) -- -- # # Store scores, attentions and hidden_states when required -- # if return_dict_in_generate: -- # if output_scores: -- # scores += (next_token_scores,) -- # if output_logits: -- # raw_logits += (next_token_logits,) -- # if output_attentions: -- # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions -- # decoder_attentions += (attn,) if attn is not None else (None,) -- # if self.config.is_encoder_decoder: -- # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) -- -- # if output_hidden_states: -- # hidden = ( -- # outputs.decoder_hidden_states -- # if self.config.is_encoder_decoder -- # else outputs.hidden_states -- # ) -- # decoder_hidden_states += (hidden,) if hidden is not None else (None,) -- -- # # token selection -- # if do_sample: -- # probs = nn.functional.softmax(next_token_scores, dim=-1) -- # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) -- # else: -- # next_tokens = ops.argmax(next_token_scores, dim=-1) -- -- # # finished sentences should have their next token be a padding token -- # if has_eos_stopping_criteria: -- # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) -- -- # # update generated ids, model inputs, and length for next step -- # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) -- # if streamer is not None: -- # streamer.put(next_tokens) -- -- # model_kwargs = self._update_model_kwargs_for_generation( -- # outputs, -- # model_kwargs, -- # is_encoder_decoder=self.config.is_encoder_decoder, -- # ) -- -- # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) -- # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 -- # cur_len += 1 -- -- # if _record_time: -- # import time as time_module -- # infer_stop = time_module.time() -- # time_record.append(infer_stop - infer_start) -- -- # del outputs -- -- # average_infer_time = None -- # if time_record: -- # if len(time_record) > 1: -- # time_record.pop(0) -- # average_infer_time = sum(time_record) / len(time_record) -- # print(f'average inference time is: {average_infer_time}') -- # print(f'inference time record: {time_record}') -- -- # if streamer is not None: -- # streamer.end() -- -- # # 简单判断:打印是否使用了JIT路径 -- # if hasattr(self, '_jit_used') and self._jit_used: -- # print("[JIT] ✓ JIT optimization was used during generation") -- # else: -- # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") -- -- # if return_dict_in_generate: -- # if self.config.is_encoder_decoder: -- # return GenerateEncoderDecoderOutput( -- # sequences=input_ids, -- # scores=scores, -- # logits=raw_logits, -- # encoder_attentions=encoder_attentions, -- # encoder_hidden_states=encoder_hidden_states, -- # decoder_attentions=decoder_attentions, -- # cross_attentions=cross_attentions, -- # decoder_hidden_states=decoder_hidden_states, -- # past_key_values=model_kwargs.get("past_key_values"), -- # average_infer_time=average_infer_time -- # ) -- # else: -- # return GenerateDecoderOnlyOutput( -- # sequences=input_ids, -- # scores=scores, -- # logits=raw_logits, -- # attentions=decoder_attentions, -- # hidden_states=decoder_hidden_states, -- # past_key_values=model_kwargs.get("past_key_values"), -- # average_infer_time=average_infer_time -- # ) -- # else: -- # return input_ids -- -- # def _prepare_cache_for_generation( -- # self, -- # generation_config, -- # model_kwargs, -- # assistant_model, -- # batch_size, -- # max_cache_length, -- # ): -- # if generation_config.cache_implementation is None and self._supports_static_cache: -- # generation_config.cache_implementation = "static" -- # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") -- -- # if generation_config.cache_implementation == "static": -- # base_required_from_max_length = generation_config.max_length + 1 -- # base_required = max(max_cache_length, base_required_from_max_length) -- # min_cache_size = 50 -- # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -- # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) -- # else: -- # max_cache_length = max(base_required, min_cache_size) -- -- # original_max_cache_length = max_cache_length -- # print(f"[JIT] StaticCache max_cache_length calculation:") -- # print(f" - input max_cache_length: {original_max_cache_length}") -- # print(f" - generation_config.max_length: {generation_config.max_length}") -- # print(f" - base_required_from_max_length: {base_required_from_max_length}") -- # print(f" - final max_cache_length: {max_cache_length}") -- -- # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: -- # if max_cache_length > self.config.max_position_embeddings: -- # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") -- -- # result = super()._prepare_cache_for_generation( -- # generation_config=generation_config, -- # model_kwargs=model_kwargs, -- # assistant_model=assistant_model, -- # batch_size=batch_size, -- # max_cache_length=max_cache_length, -- # ) -- -- # if generation_config.cache_implementation == "static": -- # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" -- # created_cache = model_kwargs.get(cache_name) -- # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): -- # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") -- # if created_cache.max_cache_len < generation_config.max_length: -- # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") -- -- # return result -- -- -- -- -- - # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE - class Qwen2MoeForSequenceClassification(Qwen2MoePreTrainedModel): - def __init__(self, config): -diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch -deleted file mode 100644 -index 8de61195..00000000 ---- a/patches/0001-20251104commit.patch -+++ /dev/null -@@ -1,1272 +0,0 @@ --From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Tue, 4 Nov 2025 09:11:51 +0800 --Subject: [PATCH 1/8] 20251104commit -- ----- -- mindnlp/transformers/cache_utils.py | 28 +- -- .../models/deepseek/modeling_deepseek.py | 149 ++- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- -- 3 files changed, 976 insertions(+), 87 deletions(-) -- --diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --index cadd2e04..02f8d4be 100644 ----- a/mindnlp/transformers/cache_utils.py --+++ b/mindnlp/transformers/cache_utils.py --@@ -812,14 +812,26 @@ class StaticCache(Cache): -- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. -- # k_out[:, :, cache_position] = key_states -- # v_out[:, :, cache_position] = value_states --- if ON_ORANGE_PI: --- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --- else: --- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --- --+ # if ON_ORANGE_PI: --+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+ # else: --+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+ # 确保 cache_position 是 1D tensor 并且类型正确 --+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+ if cache_position.ndim > 1: --+ cache_position = cache_position.flatten() --+ # 确保类型是 int32 或 int64(MindSpore 要求) --+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+ cache_position = cache_position.int() --+ --+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+ k_out[:, :, cache_position] = key_states --+ v_out[:, :, cache_position] = value_states --+ -- return k_out, v_out -- -- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index c695b944..d8303e45 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): -- # Copied from transformers.models.llama.modeling_llama.rotate_half -- def rotate_half(x): -- """Rotates half the hidden dims of the input.""" --- x1 = x[..., : x.shape[-1] // 2] --- x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+ # x1 = x[..., : x.shape[-1] // 2] --+ # x2 = x[..., x.shape[-1] // 2 :] --+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): -- if self.training: -- raise NotImplementedError("Training is not supported yet.") -- else: --- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --- if self.config.n_shared_experts is not None: --- y = y + self.shared_experts(identity) --- return y --+ # @lwx --+ if orig_shape[1] == 1: --+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+ y=y.view(*orig_shape) --+ if self.config.n_shared_experts is not None: --+ y = y + self.shared_experts(identity) --+ return y --+ else: --+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+ if self.config.n_shared_experts is not None: --+ y = y + self.shared_experts(identity) --+ return y --+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+ # if self.config.n_shared_experts is not None: --+ # y = y + self.shared_experts(identity) --+ # return y --+ --+ @no_grad() --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ --+ expert_cache = ops.zeros_like(x) --+ for i in range(self.num_experts_per_tok): --+ expert_id = flat_expert_indices[i].item() --+ weight = flat_expert_weights[i].item() --+ expert = self.experts[expert_id] --+ expert_out = expert(x) --+ expert_cache += expert_out * weight --+ return expert_cache -- -- @no_grad() --- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --- # expert_cache = torch.zeros_like(x) --- # idxs = flat_expert_indices.argsort() --- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --- # token_idxs = idxs // self.num_experts_per_tok --- # for i, end_idx in enumerate(tokens_per_expert): --- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --- # if start_idx == end_idx: --- # continue --- # expert = self.experts[i] --- # exp_token_idx = token_idxs[start_idx:end_idx] --- # expert_tokens = x[exp_token_idx] --- # expert_out = expert(expert_tokens) --- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --- # return expert_cache --+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- expert_cache = ops.zeros_like(x) -- idxs = flat_expert_indices.argsort() -- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) -- token_idxs = idxs // self.num_experts_per_tok --+ -- for i, end_idx in enumerate(tokens_per_expert): -- start_idx = 0 if i == 0 else tokens_per_expert[i-1] -- if start_idx == end_idx: --@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): -- expert_out = expert(expert_tokens) -- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) -- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ -- return expert_cache --+ --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # # expert_cache = torch.zeros_like(x) --+ # # idxs = flat_expert_indices.argsort() --+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+ # # token_idxs = idxs // self.num_experts_per_tok --+ # # for i, end_idx in enumerate(tokens_per_expert): --+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+ # # if start_idx == end_idx: --+ # # continue --+ # # expert = self.experts[i] --+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+ # # expert_tokens = x[exp_token_idx] --+ # # expert_out = expert(expert_tokens) --+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+ # # return expert_cache --+ # expert_cache = ops.zeros_like(x) --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # for i, end_idx in enumerate(tokens_per_expert): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # if start_idx == end_idx: --+ # continue --+ # expert = self.experts[i] --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = expert(expert_tokens) --+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ --+ # return expert_cache --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # expert_cache = ops.zeros_like(x) --+ --+ # # 排序保证顺序一致 --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # # 找出有 token 的专家 --+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+ --+ # for i in active_experts.tolist(): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # end_idx = tokens_per_expert[i] --+ # if start_idx == end_idx: # 没有 token --+ # continue --+ --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = self.experts[i](expert_tokens) --+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+ --+ # expert_cache = mindspore.mint.scatter_add( --+ # expert_cache, --+ # 0, --+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+ # expert_out --+ # ) --+ --+ # return expert_cache --+ --+ -- -- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): -- # """ --@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- -- # Initialize weights and apply final processing -- self.post_init() --+ self.warm_up = False --+ --+ def warmup_moe_model_deep(self): --+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+ test_texts = [ --+ "warmup short", --+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+ ] --+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+ if tokenizer is None: --+ from mindnlp.transformers import AutoTokenizer --+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+ self._warmup_tokenizer = tokenizer --+ --+ for text in test_texts: --+ inputs = tokenizer(text, return_tensors="ms") --+ with mindspore._no_grad(): --+ _ = self(**inputs, use_cache=False) --+ print("[Warmup] DeepSeek-MoE 模型预热完成。") -- -- def get_input_embeddings(self): -- return self.model.embed_tokens --@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -- ```""" --+ if not self.warm_up: --+ self.warm_up = True --+ self.warmup_moe_model_deep() --+ -- output_attentions = ( -- output_attentions -- if output_attentions is not None --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index 3cbf820e..d4c6b651 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -18,7 +18,6 @@ -- # See the License for the specific language governing permissions and -- # limitations under the License. -- """MindSpore Qwen2MoE model.""" --- -- import math -- from typing import List, Optional, Tuple, Union -- --@@ -36,6 +35,7 @@ from ...modeling_outputs import ( -- TokenClassifierOutput, -- ) -- from ...modeling_utils import PreTrainedModel --+from ...generation import GenerationMixin -- from ....utils import logging -- from .configuration_qwen2_moe import Qwen2MoeConfig -- --@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): -- self.variance_epsilon = eps -- -- def forward(self, hidden_states): --+ # @dwj --+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+ # @lwx --+ # if not self.training : --+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) -- input_dtype = hidden_states.dtype -- hidden_states = hidden_states.to(mindspore.float32) -- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --@@ -234,6 +239,8 @@ def rotate_half(x): -- """Rotates half the hidden dims of the input.""" -- x1 = x[..., : x.shape[-1] // 2] -- x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): -- self.config = config -- self.hidden_size = config.hidden_size -- self.intermediate_size = intermediate_size --+ -- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) -- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) -- self.act_fn = ACT2FN[config.hidden_act] -- -- def forward(self, x): --- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --- -- --+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+ # @lwx --+ # gate_up_output = self.gate_up_proj(x) --+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+ # return self.down_proj(swiglu_output) --+ --+ # def forward(self, x): --+ # gate_proj_out = self.gate_proj(x) --+ # up_proj_out = self.up_proj(x) --+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+ # return self.down_proj(swiglu_out) --+ -- # Copied from transformers.models.llama.modeling_llama.repeat_kv -- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -- """ --@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): -- use_cache: bool = False, -- cache_position: Optional[mindspore.Tensor] = None, -- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ --+ -- bsz, q_len, _ = hidden_states.shape -- -- query_states = self.q_proj(hidden_states) --@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): -- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " -- "with a layer index." -- ) --- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ if isinstance(past_key_value, StaticCache): --+ kv_seq_len = key_states.shape[-2] --+ else: --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- -- if past_key_value is not None: -- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models -- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+ if isinstance(past_key_value, StaticCache): --+ kv_seq_len = key_states.shape[-2] -- -- # repeat k/v heads if n_kv_heads < n_heads -- key_states = repeat_kv(key_states, self.num_key_value_groups) -- value_states = repeat_kv(value_states, self.num_key_value_groups) --- --+ -- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -- --- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --- raise ValueError( --- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --- f" {attn_weights.shape}" --- ) --- --- if attention_mask is not None: # no matter the length, we just slice it --- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+ if attention_mask is not None: --+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] -- attn_weights = attn_weights + causal_mask -- -- # upcast attention to fp32 --@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): -- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) -- -- attn_output = self.o_proj(attn_output) --- --+ # @lwx --+ --+ # max_seq_len = self.max_position_embeddings # 2048 --+ --+ # if attention_mask is not None: --+ # # attention_mask: [B, 1, Sq, Sk] --+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+ --+ # # pad 到 [max_seq_len, max_seq_len] --+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+ # global_attention_mask = padded_mask --+ # else: --+ # global_attention_mask = None --+ --+ --+ # sparse_mode=3 --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, --+ # key=key_states, --+ # value=value_states, --+ # real_shift=None, --+ # padding_mask=None, --+ --+ # head_num=self.num_heads, --+ # attn_mask=global_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+ # input_layout="BNSD", --+ # pre_tokens=2147483647, --+ # next_tokens=2147483647, --+ # inner_precise=0, --+ # drop_mask=None, --+ # prefix=None, --+ # actual_seq_qlen=None, --+ # actual_seq_kvlen=None, --+ # sparse_mode=sparse_mode, --+ # ) -- if not output_attentions: -- attn_weights = None -- -- return attn_output, attn_weights, past_key_value -- -- --+class Qwen2MoeFlashAttention(nn.Module): --+ """ --+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+ --+ 关键改动: --+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+ 直接传入原始的 key 和 value 张量效率更高。 --+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+ """ --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+ super().__init__() --+ self.config = config --+ self.layer_idx = layer_idx --+ self.hidden_size = config.hidden_size --+ self.num_heads = config.num_attention_heads --+ self.head_dim = self.hidden_size // self.num_heads --+ self.num_key_value_heads = config.num_key_value_heads --+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+ self.max_position_embeddings = config.max_position_embeddings --+ self.rope_theta = config.rope_theta --+ self.attention_dropout = config.attention_dropout --+ --+ if (self.head_dim * self.num_heads) != self.hidden_size: --+ raise ValueError( --+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+ ) --+ --+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+ --+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+ self.head_dim, --+ max_position_embeddings=self.max_position_embeddings, --+ base=self.rope_theta, --+ ) --+ --+ def forward( --+ self, --+ hidden_states: mindspore.Tensor, --+ attention_mask: Optional[mindspore.Tensor] = None, --+ position_ids: Optional[mindspore.Tensor] = None, --+ past_key_value: Optional[Cache] = None, --+ output_attentions: bool = False, --+ use_cache: bool = False, --+ cache_position: Optional[mindspore.Tensor] = None, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ bsz, q_len, _ = hidden_states.shape --+ --+ # 1. 线性投射 Q, K, V --+ query_states = self.q_proj(hidden_states) --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+ # query: [B, S, H*D] -> [B, N1, S, D] --+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # 3. RoPE 旋转位置编码 --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+ if self.layer_idx is None: --+ raise ValueError( --+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ "with a layer index." --+ ) --+ # 对于 StaticCache,需要特殊处理 kv_seq_len --+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+ if cache_position.shape[0] == 1: --+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+ kv_seq_len = past_seen_tokens + 1 --+ else: --+ # prefill 阶段:cache_position 是范围,使用其长度 --+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+ else: --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # 4. KV 缓存更新 --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ key_states, value_states = past_key_value.update( --+ key_states, value_states, self.layer_idx, cache_kwargs --+ ) --+ --+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+ if cache_position.shape[0] == 1: --+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+ kv_seq_len = key_states.shape[-2] --+ --+ # 5. [重要] 准备 Attention Mask --+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+ fa_attention_mask = None --+ if attention_mask is not None: --+ # 截取与当前key长度匹配的部分 --+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # 转换为布尔类型: 大负数 -> True, 0 -> False --+ fa_attention_mask = (mask_slice != 0) --+ --+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+ input_dtype = query_states.dtype --+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+ query_states = query_states.to(mindspore.float16) --+ key_states = key_states.to(mindspore.float16) --+ value_states = value_states.to(mindspore.float16) --+ --+ # 6. [核心] 调用 flash_attention_score 算子 --+ # - 无需手动 repeat_kv, 算子原生支持 GQA --+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+ attn_output = mindspore.ops.flash_attention_score( --+ query=query_states, --+ key=key_states, --+ value=value_states, --+ head_num=self.num_heads, # 传入Q的头数(N1) --+ attn_mask=fa_attention_mask, --+ keep_prob=1.0 - self.attention_dropout, --+ scalar_value=1.0 / math.sqrt(self.head_dim), --+ input_layout="BNSD", --+ sparse_mode=0 # 使用 defaultMask 模式 --+ ) --+ --+ # 恢复原始数据类型 --+ attn_output = attn_output.to(input_dtype) --+ --+ # 7. 调整输出形状 --+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ attn_output = self.o_proj(attn_output) --+ --+ # FlashAttention 算子不直接返回注意力权重矩阵 --+ attn_weights = None --+ if output_attentions: --+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+ return attn_output, attn_weights, past_key_value --+ --+ # def forward( --+ # self, --+ # hidden_states: mindspore.Tensor, --+ # attention_mask: Optional[mindspore.Tensor] = None, --+ # position_ids: Optional[mindspore.Tensor] = None, --+ # past_key_value: Optional[Cache] = None, --+ # output_attentions: bool = False, --+ # use_cache: bool = False, --+ # cache_position: Optional[mindspore.Tensor] = None, --+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ # bsz, q_len, _ = hidden_states.shape --+ --+ # # 1. 线性投射 Q, K, V --+ # query_states = self.q_proj(hidden_states) --+ # key_states = self.k_proj(hidden_states) --+ # value_states = self.v_proj(hidden_states) --+ --+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # # 3. RoPE 旋转位置编码 --+ # kv_seq_len = key_states.shape[-2] --+ # if past_key_value is not None: --+ # if self.layer_idx is None: --+ # raise ValueError( --+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ # "with a layer index." --+ # ) --+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # # 4. KV 缓存更新 --+ # if past_key_value is not None: --+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ # key_states, value_states = past_key_value.update( --+ # key_states, value_states, self.layer_idx, cache_kwargs --+ # ) --+ --+ # # 5. 准备 Attention Mask --+ # fa_attention_mask = None --+ # if attention_mask is not None: --+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # fa_attention_mask = (mask_slice != 0) --+ --+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+ # input_dtype = query_states.dtype --+ --+ # # 6. [核心] 调用 flash_attention_score 算子 --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, --+ # key=key_states, --+ # value=value_states, --+ # head_num=self.num_heads, --+ # attn_mask=fa_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+ # input_layout="BNSD", --+ # sparse_mode=0, --+ # # <--- 修改点 2: 启用内部高精度计算 --- --+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+ # inner_precise=1 --+ # ) --+ --+ # # 恢复原始数据类型 --+ # attn_output = attn_output.to(input_dtype) --+ --+ # # 7. 调整输出形状 --+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ # attn_output = self.o_proj(attn_output) --+ --+ # attn_weights = None --+ # if output_attentions: --+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+ # return attn_output, attn_weights, past_key_value --+ --+ # def forward( --+ # self, --+ # hidden_states: mindspore.Tensor, --+ # attention_mask: Optional[mindspore.Tensor] = None, --+ # position_ids: Optional[mindspore.Tensor] = None, --+ # past_key_value: Optional[Cache] = None, --+ # output_attentions: bool = False, --+ # use_cache: bool = False, --+ # cache_position: Optional[mindspore.Tensor] = None, --+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ # bsz, q_len, _ = hidden_states.shape --+ --+ # query_states = self.q_proj(hidden_states) --+ # key_states = self.k_proj(hidden_states) --+ # value_states = self.v_proj(hidden_states) --+ --+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ # kv_seq_len = key_states.shape[-2] --+ # if past_key_value is not None: --+ # if self.layer_idx is None: --+ # raise ValueError("`layer_idx` must be specified for caching") --+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ # if past_key_value is not None: --+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ # key_states, value_states = past_key_value.update( --+ # key_states, value_states, self.layer_idx, cache_kwargs --+ # ) --+ --+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+ --+ # # <--- 核心修改点: 手动进行高精度缩放 --- --+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+ # query_states = query_states / math.sqrt(self.head_dim) --+ # # <--- 修改结束 --- --+ --+ # fa_attention_mask = None --+ # if attention_mask is not None: --+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ # fa_attention_mask = (mask_slice != 0) --+ --+ # input_dtype = query_states.dtype --+ --+ # attn_output = mindspore.ops.flash_attention_score( --+ # query=query_states, # 传入已经预先缩放过的 query --+ # key=key_states, --+ # value=value_states, --+ # head_num=self.num_heads, --+ # attn_mask=fa_attention_mask, --+ # keep_prob=1.0 - self.attention_dropout, --+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+ # input_layout="BNSD", --+ # sparse_mode=0, --+ # inner_precise=1 # 仍然保持内部高精度计算 --+ # ) --+ --+ # attn_output = attn_output.to(input_dtype) --+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ # attn_output = self.o_proj(attn_output) --+ --+ # attn_weights = None --+ # if output_attentions: --+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+ --+ # return attn_output, attn_weights, past_key_value --+ -- QWEN2MOE_ATTENTION_CLASSES = { -- "eager": Qwen2MoeAttention, --+ "flash-attention": Qwen2MoeFlashAttention, -- } -- -- --@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --+ #@dwj --+ # 只遍历激活的专家,而非全部专家 -- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --- batch_size, sequence_length, hidden_dim = hidden_states.shape --- hidden_states = hidden_states.view(-1, hidden_dim) --- # router_logits: (batch * sequence_length, n_experts) --- router_logits = self.gate(hidden_states) --- --- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- if self.norm_topk_prob: --- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- # we cast back to the input dtype --- routing_weights = routing_weights.to(hidden_states.dtype) --- --- final_hidden_states = ops.zeros( --- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --- ) --- --- # One hot encode the selected experts to create an expert mask --- # this will be used to easily index which expert is going to be sollicitated --- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --- --- # Loop over all available experts in the model and perform the computation on each expert --- for expert_idx in range(self.num_experts): --- expert_layer = self.experts[expert_idx] --- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --- --- # Index the correct hidden states and compute the expert hidden state for --- # the current expert. We need to make sure to multiply the output hidden --- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --- if 0 not in idx.shape: --- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --- --- # However `index_add_` only support torch tensors for indexing so we'll use --- # the `top_x` tensor here. --- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --- --- shared_expert_output = self.shared_expert(hidden_states) --- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --- --- final_hidden_states = final_hidden_states + shared_expert_output --+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+ num_tokens = hidden_states_reshaped.shape[0] --+ --+ router_logits = self.gate(hidden_states_reshaped) --+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+ if self.norm_topk_prob: --+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ routing_weights = routing_weights.to(hidden_states.dtype) --+ --+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+ flat_selected_experts = selected_experts.flatten() --+ --+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+ token_indices = broadcasted_token_indices.flatten() --+ --+ active_experts = ops.unique(flat_selected_experts) --+ --+ for expert_idx_tensor in active_experts: --+ expert_idx = expert_idx_tensor.item() --+ expert_layer = self.experts[expert_idx] --+ --+ mask = (flat_selected_experts == expert_idx_tensor) --+ selected_token_indices = token_indices[mask] --+ selected_routing_weights = routing_weights.flatten()[mask] --+ --+ current_states = hidden_states_reshaped[selected_token_indices] --+ --+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+ --+ final_hidden_states = final_hidden_states.index_add( --+ dim=0, --+ index=selected_token_indices, --+ source=expert_output.to(hidden_states.dtype) --+ ) --+ --+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -- --- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --- return final_hidden_states, router_logits --+ final_hidden_states = final_hidden_states + shared_expert_output --+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+ --+ return final_hidden_states, router_logits -- -- -- class Qwen2MoeDecoderLayer(nn.Module): --@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): -- -- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+ -- if (layer_idx not in config.mlp_only_layers) and ( -- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -- ): --@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): -- _no_split_modules = ["Qwen2MoeDecoderLayer"] -- _skip_keys_device_placement = "past_key_values" -- _supports_cache_class = True --+#lwx --+ # _supports_static_cache = True -- -- def _init_weights(self, module): -- std = self.config.initializer_range --@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -- return causal_mask -- -- ---class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- _tied_weights_keys = ["lm_head.weight"] -- -- def __init__(self, config): --@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- self.num_experts_per_tok = config.num_experts_per_tok -- # Initialize weights and apply final processing -- self.post_init() --+ # @lwx --+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+ # self.generation_config.cache_implementation = "static" --+ self._warmed_up = False --+ --+ def warmup_moe_model(self): --+ print("[Warmup] Qwen2-MoE 模型预热开始...") --+ test_texts = [ --+ "warmup short", --+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+ ] --+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+ if tokenizer is None: --+ from mindnlp.transformers import AutoTokenizer --+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+ self._warmup_tokenizer = tokenizer --+ --+ for text in test_texts: --+ inputs = tokenizer(text, return_tensors="ms") --+ with mindspore._no_grad(): --+ _ = self(**inputs, output_router_logits=True, use_cache=False) --+ print("[Warmup] Qwen2-MoE 模型预热完成。") -- -- def get_input_embeddings(self): -- return self.model.embed_tokens --@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] -- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." -- ```""" --+ if not self._warmed_up: --+ self._warmed_up = True --+ self.warmup_moe_model() -- -- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -- output_router_logits = ( --@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): -- } -- ) -- return model_inputs --+# @lwx --+ # def _decode_one_tokens_logits( --+ # self, --+ # cur_token: mindspore.Tensor, --+ # input_pos: Optional[mindspore.Tensor], --+ # cache_position: mindspore.Tensor, --+ # past_key_values: StaticCache, --+ # ) -> mindspore.Tensor: --+ # """ --+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+ --+ # Args: --+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+ # input_pos: 输入位置信息,可选 --+ # cache_position: 当前token在cache中的位置,shape为(1,) --+ # past_key_values: StaticCache对象,存储之前的key-value状态 --+ --+ # Returns: --+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+ # """ --+ # # 调用JIT编译的版本 --+ # return self.get_decode_one_tokens_logits( --+ # cur_token=cur_token, --+ # input_pos=input_pos, --+ # cache_position=cache_position, --+ # past_key_values=past_key_values, --+ # ) --+ --+ # @mindspore.jit(jit_level='O1') --+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+ # """ --+ # JIT编译的函数,用于高效的单token解码 --+ # 使用JIT编译优化以支持静态shape和高效执行 --+ --+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+ # """ --+ # outputs = self.model.forward( --+ # input_ids=cur_token, --+ # position_ids=input_pos, --+ # cache_position=cache_position, --+ # past_key_values=past_key_values, --+ # use_cache=True, --+ # return_dict=False, --+ # ) --+ --+ # hidden_states = outputs[0] --+ # logits = self.lm_head.forward(hidden_states) --+ # logits = logits.float() --+ --+ # return logits[:, -1, :] --+ --+ # def _sample( --+ # self, --+ # input_ids: mindspore.Tensor, --+ # logits_processor, --+ # stopping_criteria, --+ # generation_config, --+ # synced_devices: bool, --+ # streamer=None, --+ # logits_warper=None, --+ # **model_kwargs, --+ # ): --+ # """ --+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+ # """ --+ # from ...generation.logits_process import LogitsProcessorList --+ # from ...generation.stopping_criteria import StoppingCriteriaList --+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+ # from mindnlp.core import nn, ops, no_grad --+ # import numpy as np --+ --+ # # 检查是否使用 StaticCache --+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+ # # 否则,直接调用父类方法 --+ # past_key_values = model_kwargs.get("past_key_values") --+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+ --+ # if not isinstance(past_key_values, StaticCache): --+ # # 不使用 StaticCache,直接调用父类方法 --+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+ # return super()._sample( --+ # input_ids=input_ids, --+ # logits_processor=logits_processor, --+ # stopping_criteria=stopping_criteria, --+ # generation_config=generation_config, --+ # synced_devices=synced_devices, --+ # streamer=streamer, --+ # logits_warper=logits_warper, --+ # **model_kwargs, --+ # ) --+ --+ # # 使用 StaticCache,进入自定义循环 --+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+ # pad_token_id = generation_config._pad_token_tensor --+ # output_attentions = generation_config.output_attentions --+ # output_hidden_states = generation_config.output_hidden_states --+ # output_scores = generation_config.output_scores --+ # output_logits = generation_config.output_logits --+ # return_dict_in_generate = generation_config.return_dict_in_generate --+ # max_length = generation_config.max_length --+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+ # do_sample = generation_config.do_sample --+ --+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+ # raise ValueError( --+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+ # f"{logits_warper})." --+ # ) --+ --+ # # init attention / hidden states / scores tuples --+ # scores = () if (return_dict_in_generate and output_scores) else None --+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+ --+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+ # if return_dict_in_generate and self.config.is_encoder_decoder: --+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+ # encoder_hidden_states = ( --+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+ # ) --+ --+ # # keep track of which sequences are already finished --+ # batch_size, cur_len = input_ids.shape --+ # this_peer_finished = False --+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+ --+ # time_record = [] --+ # from ....utils.testing_utils import parse_flag_from_env --+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+ --+ # while self._has_unfinished_sequences( --+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+ # ): --+ # if _record_time: --+ # import time as time_module --+ # infer_start = time_module.time() --+ --+ # # prepare model inputs --+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+ --+ # # prepare variable output controls --+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+ --+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+ # cur_cache_position = model_inputs.get("cache_position") --+ # cur_past_key_values = model_inputs.get("past_key_values") --+ # cur_input_ids = model_inputs.get("input_ids") --+ --+ # if (isinstance(cur_past_key_values, StaticCache) and --+ # cur_cache_position is not None and --+ # len(cur_cache_position.shape) > 0 and --+ # cur_cache_position.shape[0] == 1 and --+ # cur_input_ids is not None and --+ # cur_input_ids.shape[1] == 1): --+ # # 使用 JIT 优化的单 token 解码 --+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+ # if not hasattr(self, '_jit_used'): --+ # self._jit_used = False --+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+ --+ # next_token_logits = self.get_decode_one_tokens_logits( --+ # cur_token=cur_input_ids, --+ # input_pos=model_inputs.get("position_ids"), --+ # cache_position=cur_cache_position, --+ # past_key_values=cur_past_key_values, --+ # ) --+ --+ # # 标记已使用JIT(用于后续判断) --+ # if not self._jit_used: --+ # self._jit_used = True --+ --+ # # 构造兼容的输出对象 --+ # class JitOptimizedOutput: --+ # def __init__(self, logits, config): --+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+ # self.config = config --+ # # 对于 JIT 优化路径,这些属性通常不需要 --+ # self.decoder_attentions = None if config.is_encoder_decoder else None --+ # self.attentions = None if not config.is_encoder_decoder else None --+ # self.cross_attentions = None --+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+ # self.hidden_states = None if not config.is_encoder_decoder else None --+ --+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+ # else: --+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+ # outputs = self(**model_inputs, return_dict=True) --+ --+ # if synced_devices and this_peer_finished: --+ # continue --+ --+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+ # next_token_logits = outputs.logits[:, -1, :] --+ --+ # # pre-process distribution --+ # next_token_scores = logits_processor(input_ids, next_token_logits) --+ # if do_sample: --+ # next_token_scores = logits_warper(input_ids, next_token_scores) --+ --+ # # Store scores, attentions and hidden_states when required --+ # if return_dict_in_generate: --+ # if output_scores: --+ # scores += (next_token_scores,) --+ # if output_logits: --+ # raw_logits += (next_token_logits,) --+ # if output_attentions: --+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+ # decoder_attentions += (attn,) if attn is not None else (None,) --+ # if self.config.is_encoder_decoder: --+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+ --+ # if output_hidden_states: --+ # hidden = ( --+ # outputs.decoder_hidden_states --+ # if self.config.is_encoder_decoder --+ # else outputs.hidden_states --+ # ) --+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+ --+ # # token selection --+ # if do_sample: --+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+ # else: --+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+ --+ # # finished sentences should have their next token be a padding token --+ # if has_eos_stopping_criteria: --+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+ --+ # # update generated ids, model inputs, and length for next step --+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+ # if streamer is not None: --+ # streamer.put(next_tokens) --+ --+ # model_kwargs = self._update_model_kwargs_for_generation( --+ # outputs, --+ # model_kwargs, --+ # is_encoder_decoder=self.config.is_encoder_decoder, --+ # ) --+ --+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+ # cur_len += 1 --+ --+ # if _record_time: --+ # import time as time_module --+ # infer_stop = time_module.time() --+ # time_record.append(infer_stop - infer_start) --+ --+ # del outputs --+ --+ # average_infer_time = None --+ # if time_record: --+ # if len(time_record) > 1: --+ # time_record.pop(0) --+ # average_infer_time = sum(time_record) / len(time_record) --+ # print(f'average inference time is: {average_infer_time}') --+ # print(f'inference time record: {time_record}') --+ --+ # if streamer is not None: --+ # streamer.end() --+ --+ # # 简单判断:打印是否使用了JIT路径 --+ # if hasattr(self, '_jit_used') and self._jit_used: --+ # print("[JIT] ✓ JIT optimization was used during generation") --+ # else: --+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+ --+ # if return_dict_in_generate: --+ # if self.config.is_encoder_decoder: --+ # return GenerateEncoderDecoderOutput( --+ # sequences=input_ids, --+ # scores=scores, --+ # logits=raw_logits, --+ # encoder_attentions=encoder_attentions, --+ # encoder_hidden_states=encoder_hidden_states, --+ # decoder_attentions=decoder_attentions, --+ # cross_attentions=cross_attentions, --+ # decoder_hidden_states=decoder_hidden_states, --+ # past_key_values=model_kwargs.get("past_key_values"), --+ # average_infer_time=average_infer_time --+ # ) --+ # else: --+ # return GenerateDecoderOnlyOutput( --+ # sequences=input_ids, --+ # scores=scores, --+ # logits=raw_logits, --+ # attentions=decoder_attentions, --+ # hidden_states=decoder_hidden_states, --+ # past_key_values=model_kwargs.get("past_key_values"), --+ # average_infer_time=average_infer_time --+ # ) --+ # else: --+ # return input_ids --+ --+ # def _prepare_cache_for_generation( --+ # self, --+ # generation_config, --+ # model_kwargs, --+ # assistant_model, --+ # batch_size, --+ # max_cache_length, --+ # ): --+ # if generation_config.cache_implementation is None and self._supports_static_cache: --+ # generation_config.cache_implementation = "static" --+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+ --+ # if generation_config.cache_implementation == "static": --+ # base_required_from_max_length = generation_config.max_length + 1 --+ # base_required = max(max_cache_length, base_required_from_max_length) --+ # min_cache_size = 50 --+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+ # else: --+ # max_cache_length = max(base_required, min_cache_size) --+ --+ # original_max_cache_length = max_cache_length --+ # print(f"[JIT] StaticCache max_cache_length calculation:") --+ # print(f" - input max_cache_length: {original_max_cache_length}") --+ # print(f" - generation_config.max_length: {generation_config.max_length}") --+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+ # print(f" - final max_cache_length: {max_cache_length}") --+ --+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+ # if max_cache_length > self.config.max_position_embeddings: --+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+ --+ # result = super()._prepare_cache_for_generation( --+ # generation_config=generation_config, --+ # model_kwargs=model_kwargs, --+ # assistant_model=assistant_model, --+ # batch_size=batch_size, --+ # max_cache_length=max_cache_length, --+ # ) --+ --+ # if generation_config.cache_implementation == "static": --+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+ # created_cache = model_kwargs.get(cache_name) --+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+ # if created_cache.max_cache_len < generation_config.max_length: --+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+ --+ # return result --+ --+ --+ -- -- -- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ---- --2.27.0 -- -diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch -deleted file mode 100644 -index d7a129ea..00000000 ---- a/patches/0002-20251106commit.patch -+++ /dev/null -@@ -1,3200 +0,0 @@ --From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Thu, 6 Nov 2025 09:20:38 +0800 --Subject: [PATCH 2/8] 20251106commit -- ----- -- .../models/deepseek/modeling_deepseek.py | 379 ++++- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- -- patches/0001-20251104commit.patch | 1272 ++++++++++++++++ -- 3 files changed, 2689 insertions(+), 305 deletions(-) -- create mode 100644 patches/0001-20251104commit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index d8303e45..73773c22 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): -- # y = y + self.shared_experts(identity) -- # return y -- --+ # @no_grad() --+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ --+ # expert_cache = ops.zeros_like(x) --+ # for i in range(self.num_experts_per_tok): --+ # expert_id = flat_expert_indices[i].item() --+ # weight = flat_expert_weights[i].item() --+ # expert = self.experts[expert_id] --+ # expert_out = expert(x) --+ # expert_cache += expert_out * weight --+ # return expert_cache --+ -- @no_grad() -- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ # x 的 shape: (1, hidden_size) --+ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+ --+ # 1. 收集所有需要的专家层 --+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+ selected_experts = [self.experts[i] for i in flat_expert_indices] --+ --+ # 2. 并行计算所有专家的输出 --+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+ # ops.cat 会将它们堆叠成一个新的 Tensor --+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+ --+ # 3. 使用矩阵乘法进行加权求和 --+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ # 最终结果 final_output 的 shape: (1, hidden_size) --+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+ --+ return final_output -- --- expert_cache = ops.zeros_like(x) --- for i in range(self.num_experts_per_tok): --- expert_id = flat_expert_indices[i].item() --- weight = flat_expert_weights[i].item() --- expert = self.experts[expert_id] --- expert_out = expert(x) --- expert_cache += expert_out * weight --- return expert_cache -- -- @no_grad() -- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): -- key_states = self.k_proj(hidden_states) -- value_states = self.v_proj(hidden_states) -- --- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+ # @lwx --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --+ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) -- -- kv_seq_len = key_states.shape[-2] -- if past_key_value is not None: --@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): -- return attn_output, attn_weights, past_key_value -- -- --+# class DeepseekFlashAttention(nn.Module): --+# """ --+# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --+ --+# This class is designed as a drop-in replacement for DeepseekAttention. --+# """ --+ --+# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+# super().__init__() --+# self.config = config --+# self.layer_idx = layer_idx --+# if layer_idx is None: --+# logger.warning( --+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+# "when creating this class." --+# ) --+ --+# self.attention_dropout = config.attention_dropout --+# self.hidden_size = config.hidden_size --+# self.num_heads = config.num_attention_heads --+# self.head_dim = self.hidden_size // self.num_heads --+# self.num_key_value_heads = config.num_key_value_heads --+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+# self.max_position_embeddings = config.max_position_embeddings --+# self.rope_theta = config.rope_theta --+# self.is_causal = True --+ --+# if (self.head_dim * self.num_heads) != self.hidden_size: --+# raise ValueError( --+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+# f" and `num_heads`: {self.num_heads})." --+# ) --+ --+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+# self._init_rope() --+ --+# def _init_rope(self): --+# if self.config.rope_scaling is None: --+# self.rotary_emb = DeepseekRotaryEmbedding( --+# self.head_dim, --+# max_position_embeddings=self.max_position_embeddings, --+# base=self.rope_theta, --+# ) --+# else: --+# scaling_type = self.config.rope_scaling["type"] --+# scaling_factor = self.config.rope_scaling["factor"] --+# if scaling_type == "linear": --+# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+# self.head_dim, --+# max_position_embeddings=self.max_position_embeddings, --+# scaling_factor=scaling_factor, --+# base=self.rope_theta, --+# ) --+# elif scaling_type == "dynamic": --+# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+# self.head_dim, --+# max_position_embeddings=self.max_position_embeddings, --+# scaling_factor=scaling_factor, --+# base=self.rope_theta, --+# ) --+# else: --+# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+ --+# def forward( --+# self, --+# hidden_states: mindspore.Tensor, --+# attention_mask: Optional[mindspore.Tensor] = None, --+# position_ids: Optional[mindspore.Tensor] = None, --+# past_key_value: Optional[Cache] = None, --+# output_attentions: bool = False, --+# use_cache: bool = False, --+# **kwargs, --+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+# if "padding_mask" in kwargs: --+# warnings.warn( --+# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+# ) --+ --+# if output_attentions: --+# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --+ --+# bsz, q_len, _ = hidden_states.shape --+ --+# if self.config.pretraining_tp > 1: --+# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+ --+# query_states = self.q_proj(hidden_states) --+# key_states = self.k_proj(hidden_states) --+# value_states = self.v_proj(hidden_states) --+ --+# # Reshape for multi-head attention --+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+# kv_seq_len = key_states.shape[-2] --+# if past_key_value is not None: --+# if self.layer_idx is None: --+# raise ValueError( --+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+# "with a layer index." --+# ) --+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+# # Apply Rotary Positional Embedding --+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+# if past_key_value is not None: --+# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --+# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --+# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ --+# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+ --+# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+ --+# # Convert attention_mask for flash_attention_score --+# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --+# if attention_mask is not None: --+# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --+# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+# raise ValueError( --+# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+# ) --+# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --+# else: --+# attn_mask_for_fa = None --+ --+# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+ --+# # Call the fused flash_attention_score operator --+# attn_output = mindspore.ops.flash_attention_score( --+# query=query_states_for_fa, --+# key=key_states_for_fa, --+# value=value_states_for_fa, --+# head_num=self.num_heads, # This is N1, the number of query heads --+# input_layout='BSH', --+# attn_mask=attn_mask_for_fa, --+# keep_prob=keep_prob, --+# scalar_value=1.0 / math.sqrt(self.head_dim), --+# sparse_mode=0 # Default mask mode --+# ) --+ --+# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --+# attn_output = self.o_proj(attn_output) --+ --+# # Flash Attention does not return attention weights --+# attn_weights = None --+ --+# return attn_output, attn_weights, past_key_value --+ --+class DeepseekFlashAttention(nn.Module): --+ """ --+ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --+ This implementation is a drop-in replacement for the original DeepseekAttention class, --+ designed for high performance on supported hardware (Ascend). --+ --+ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --+ """ --+ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+ super().__init__() --+ self.config = config --+ self.layer_idx = layer_idx --+ if layer_idx is None: --+ logger.warning( --+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+ "when creating this class." --+ ) --+ --+ # --- [FIX] Correctly initialize all required attributes --- --+ self.attention_dropout = config.attention_dropout --+ self.hidden_size = config.hidden_size --+ self.num_heads = config.num_attention_heads --+ self.head_dim = self.hidden_size // self.num_heads --+ self.num_key_value_heads = config.num_key_value_heads --+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+ self.max_position_embeddings = config.max_position_embeddings --+ self.rope_theta = config.rope_theta --+ self.is_causal = True --+ --+ if (self.head_dim * self.num_heads) != self.hidden_size: --+ raise ValueError( --+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+ f" and `num_heads`: {self.num_heads})." --+ ) --+ --+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+ --+ # This call will now succeed as all attributes are initialized. --+ self._init_rope() --+ --+ def _init_rope(self): --+ if self.config.rope_scaling is None: --+ self.rotary_emb = DeepseekRotaryEmbedding( --+ self.head_dim, --+ max_position_embeddings=self.max_position_embeddings, --+ base=self.rope_theta, --+ ) --+ else: --+ scaling_type = self.config.rope_scaling["type"] --+ scaling_factor = self.config.rope_scaling["factor"] --+ if scaling_type == "linear": --+ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+ self.head_dim, --+ max_position_embeddings=self.max_position_embeddings, --+ scaling_factor=scaling_factor, --+ base=self.rope_theta, --+ ) --+ elif scaling_type == "dynamic": --+ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+ self.head_dim, --+ max_position_embeddings=self.max_position_embeddings, --+ scaling_factor=scaling_factor, --+ base=self.rope_theta, --+ ) --+ else: --+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+ --+ def forward( --+ self, --+ hidden_states: mindspore.Tensor, --+ attention_mask: Optional[mindspore.Tensor] = None, --+ position_ids: Optional[mindspore.Tensor] = None, --+ past_key_value: Optional[Cache] = None, --+ output_attentions: bool = False, --+ use_cache: bool = False, --+ **kwargs, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ if "padding_mask" in kwargs: --+ warnings.warn( --+ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+ ) --+ if output_attentions: --+ warnings.warn( --+ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --+ ) --+ --+ bsz, q_len, _ = hidden_states.shape --+ --+ if self.config.pretraining_tp > 1: --+ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+ --+ query_states = self.q_proj(hidden_states) --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+ # Apply Rotary Position Embedding --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos} --+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --+ # So we must explicitly repeat the KV heads. --+ key_states = repeat_kv(key_states, self.num_key_value_groups) --+ value_states = repeat_kv(value_states, self.num_key_value_groups) --+ --+ # Convert attention mask for flash_attention_score --+ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --+ if attention_mask is not None: --+ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+ raise ValueError( --+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+ ) --+ attn_mask_for_fa = attention_mask < 0 --+ else: --+ attn_mask_for_fa = None --+ --+ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+ --+ # Call the fused operator using the efficient BNSD layout --+ attn_output = mindspore.ops.flash_attention_score( --+ query=query_states, --+ key=key_states, --+ value=value_states, --+ head_num=self.num_heads, --+ input_layout='BNSD', # Specify the correct layout --+ attn_mask=attn_mask_for_fa, --+ keep_prob=keep_prob, --+ scalar_value=1.0 / math.sqrt(self.head_dim) --+ ) --+ --+ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ --+ # Apply output projection --+ attn_output = self.o_proj(attn_output) --+ --+ # Flash attention does not return attention weights, so we return None. --+ attn_weights = None --+ --+ return attn_output, attn_weights, past_key_value --+ -- Deepseek_ATTENTION_CLASSES = { -- "eager": DeepseekAttention, --+ "flash-attention": DeepseekFlashAttention, -- } -- -- --@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): -- config=config, layer_idx=layer_idx -- ) -- --+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --+ config=config, layer_idx=layer_idx --+ ) --+ -- self.mlp = ( -- DeepseekMoE(config) -- if ( --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index d4c6b651..bced285c 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union -- -- import mindspore -- import mindnlp.core.nn.functional as F ---from mindnlp.core import nn, ops --+from mindnlp.core import nn, ops, no_grad -- from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss -- -- from ....common.activations import ACT2FN --@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) -- _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -- _CONFIG_FOR_DOC = "Qwen2MoeConfig" -- --+Long_Prompt = False --+PROMPT_LENGTH_THRESHOLD = 128 -- -- # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -- def _prepare_4d_causal_attention_mask_with_cache_position( --@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): -- return attn_output, attn_weights, past_key_value -- -- --+# class Qwen2MoeFlashAttention(nn.Module): --+# """ --+# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+ --+# 关键改动: --+# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+# 直接传入原始的 key 和 value 张量效率更高。 --+# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+# super().__init__() --+# self.config = config --+# self.layer_idx = layer_idx --+# self.hidden_size = config.hidden_size --+# self.num_heads = config.num_attention_heads --+# self.head_dim = self.hidden_size // self.num_heads --+# self.num_key_value_heads = config.num_key_value_heads --+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+# self.max_position_embeddings = config.max_position_embeddings --+# self.rope_theta = config.rope_theta --+# self.attention_dropout = config.attention_dropout --+ --+# if (self.head_dim * self.num_heads) != self.hidden_size: --+# raise ValueError( --+# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+# ) --+ --+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+ --+# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+# self.head_dim, --+# max_position_embeddings=self.max_position_embeddings, --+# base=self.rope_theta, --+# ) --+ --+# def forward( --+# self, --+# hidden_states: mindspore.Tensor, --+# attention_mask: Optional[mindspore.Tensor] = None, --+# position_ids: Optional[mindspore.Tensor] = None, --+# past_key_value: Optional[Cache] = None, --+# output_attentions: bool = False, --+# use_cache: bool = False, --+# cache_position: Optional[mindspore.Tensor] = None, --+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+# bsz, q_len, _ = hidden_states.shape --+ --+# # 1. 线性投射 Q, K, V --+# query_states = self.q_proj(hidden_states) --+# key_states = self.k_proj(hidden_states) --+# value_states = self.v_proj(hidden_states) --+ --+# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+# # query: [B, S, H*D] -> [B, N1, S, D] --+# # key/val: [B, S, H2*D] -> [B, N2, S, D] --+# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+# # 3. RoPE 旋转位置编码 --+# kv_seq_len = key_states.shape[-2] --+# if past_key_value is not None: --+# if self.layer_idx is None: --+# raise ValueError( --+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+# "with a layer index." --+# ) --+# # 对于 StaticCache,需要特殊处理 kv_seq_len --+# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+# # 使用 cache_position 的长度来确定实际的 kv_seq_len --+# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+# # 临时解决方案:使用 cache_position 的最大值(如果可能) --+# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+# if cache_position.shape[0] == 1: --+# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+# kv_seq_len = past_seen_tokens + 1 --+# else: --+# # prefill 阶段:cache_position 是范围,使用其长度 --+# kv_seq_len = cache_position.shape[0] + past_seen_tokens --+# else: --+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+# # 4. KV 缓存更新 --+# if past_key_value is not None: --+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+# key_states, value_states = past_key_value.update( --+# key_states, value_states, self.layer_idx, cache_kwargs --+# ) --+ --+# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+# if cache_position.shape[0] == 1: --+# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+# kv_seq_len = key_states.shape[-2] --+ --+# # 5. [重要] 准备 Attention Mask --+# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+# fa_attention_mask = None --+# if attention_mask is not None: --+# # 截取与当前key长度匹配的部分 --+# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+# # 转换为布尔类型: 大负数 -> True, 0 -> False --+# fa_attention_mask = (mask_slice != 0) --+ --+# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+# input_dtype = query_states.dtype --+# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+# query_states = query_states.to(mindspore.float16) --+# key_states = key_states.to(mindspore.float16) --+# value_states = value_states.to(mindspore.float16) --+ --+# # 6. [核心] 调用 flash_attention_score 算子 --+# # - 无需手动 repeat_kv, 算子原生支持 GQA --+# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+# attn_output = mindspore.ops.flash_attention_score( --+# query=query_states, --+# key=key_states, --+# value=value_states, --+# head_num=self.num_heads, # 传入Q的头数(N1) --+# attn_mask=fa_attention_mask, --+# keep_prob=1.0 - self.attention_dropout, --+# scalar_value=1.0 / math.sqrt(self.head_dim), --+# input_layout="BNSD", --+# sparse_mode=0 # 使用 defaultMask 模式 --+# ) --+ --+# # 恢复原始数据类型 --+# attn_output = attn_output.to(input_dtype) --+ --+# # 7. 调整输出形状 --+# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+# attn_output = self.o_proj(attn_output) --+ --+# # FlashAttention 算子不直接返回注意力权重矩阵 --+# attn_weights = None --+# if output_attentions: --+# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+# return attn_output, attn_weights, past_key_value --+ --+# # def forward( --+# # self, --+# # hidden_states: mindspore.Tensor, --+# # attention_mask: Optional[mindspore.Tensor] = None, --+# # position_ids: Optional[mindspore.Tensor] = None, --+# # past_key_value: Optional[Cache] = None, --+# # output_attentions: bool = False, --+# # use_cache: bool = False, --+# # cache_position: Optional[mindspore.Tensor] = None, --+# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+# # bsz, q_len, _ = hidden_states.shape --+ --+# # # 1. 线性投射 Q, K, V --+# # query_states = self.q_proj(hidden_states) --+# # key_states = self.k_proj(hidden_states) --+# # value_states = self.v_proj(hidden_states) --+ --+# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ --+# # # 3. RoPE 旋转位置编码 --+# # kv_seq_len = key_states.shape[-2] --+# # if past_key_value is not None: --+# # if self.layer_idx is None: --+# # raise ValueError( --+# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+# # "with a layer index." --+# # ) --+# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+# # # 4. KV 缓存更新 --+# # if past_key_value is not None: --+# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+# # key_states, value_states = past_key_value.update( --+# # key_states, value_states, self.layer_idx, cache_kwargs --+# # ) --+ --+# # # 5. 准备 Attention Mask --+# # fa_attention_mask = None --+# # if attention_mask is not None: --+# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+# # fa_attention_mask = (mask_slice != 0) --+ --+# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+# # input_dtype = query_states.dtype --+ --+# # # 6. [核心] 调用 flash_attention_score 算子 --+# # attn_output = mindspore.ops.flash_attention_score( --+# # query=query_states, --+# # key=key_states, --+# # value=value_states, --+# # head_num=self.num_heads, --+# # attn_mask=fa_attention_mask, --+# # keep_prob=1.0 - self.attention_dropout, --+# # scalar_value=1.0 / math.sqrt(self.head_dim), --+# # input_layout="BNSD", --+# # sparse_mode=0, --+# # # <--- 修改点 2: 启用内部高精度计算 --- --+# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+# # inner_precise=1 --+# # ) --+ --+# # # 恢复原始数据类型 --+# # attn_output = attn_output.to(input_dtype) --+ --+# # # 7. 调整输出形状 --+# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+# # attn_output = self.o_proj(attn_output) --+ --+# # attn_weights = None --+# # if output_attentions: --+# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ --+# # return attn_output, attn_weights, past_key_value --+ --+ -- class Qwen2MoeFlashAttention(nn.Module): -- """ --- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --- --- 关键改动: --- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --- 直接传入原始的 key 和 value 张量效率更高。 --- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --+ --+ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --+ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --+ 完全使用模型的低精度数据类型(如 float16)进行计算, --+ 以达到理论上的最高执行速度。 -- """ -- def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -- super().__init__() -- self.config = config -- self.layer_idx = layer_idx --+ if layer_idx is None: --+ logger.warning_once( --+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --+ ) --+ -- self.hidden_size = config.hidden_size -- self.num_heads = config.num_attention_heads -- self.head_dim = self.hidden_size // self.num_heads -- self.num_key_value_heads = config.num_key_value_heads --- self.num_key_value_groups = self.num_heads // self.num_key_value_heads -- self.max_position_embeddings = config.max_position_embeddings -- self.rope_theta = config.rope_theta -- self.attention_dropout = config.attention_dropout -- --- if (self.head_dim * self.num_heads) != self.hidden_size: --- raise ValueError( --- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --- ) --- -- self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) -- self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) -- self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): -- key_states = self.k_proj(hidden_states) -- value_states = self.v_proj(hidden_states) -- --- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --- # query: [B, S, H*D] -> [B, N1, S, D] --- # key/val: [B, S, H2*D] -> [B, N2, S, D] --+ # 2. 调整形状以匹配 BNSD 布局 -- query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) -- key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) -- value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- --- # 3. RoPE 旋转位置编码 --+ --+ # 3. RoPE 和 KV 缓存 -- kv_seq_len = key_states.shape[-2] -- if past_key_value is not None: --- if self.layer_idx is None: --- raise ValueError( --- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --- "with a layer index." --- ) --- # 对于 StaticCache,需要特殊处理 kv_seq_len --- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --- if isinstance(past_key_value, StaticCache) and cache_position is not None: --- # 使用 cache_position 的长度来确定实际的 kv_seq_len --- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --- # 临时解决方案:使用 cache_position 的最大值(如果可能) --- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --- if cache_position.shape[0] == 1: --- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --- kv_seq_len = past_seen_tokens + 1 --- else: --- # prefill 阶段:cache_position 是范围,使用其长度 --- kv_seq_len = cache_position.shape[0] + past_seen_tokens --- else: --- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --- --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ -- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- --- # 4. KV 缓存更新 -- if past_key_value is not None: -- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --- key_states, value_states = past_key_value.update( --- key_states, value_states, self.layer_idx, cache_kwargs --- ) --- --- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --- if isinstance(past_key_value, StaticCache) and cache_position is not None: --- if cache_position.shape[0] == 1: --- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --- kv_seq_len = key_states.shape[-2] --- --- # 5. [重要] 准备 Attention Mask --- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+ # 4. 准备 Attention Mask -- fa_attention_mask = None -- if attention_mask is not None: --- # 截取与当前key长度匹配的部分 --- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) -- mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --- # 转换为布尔类型: 大负数 -> True, 0 -> False -- fa_attention_mask = (mask_slice != 0) -- --- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --- input_dtype = query_states.dtype --- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --- query_states = query_states.to(mindspore.float16) --- key_states = key_states.to(mindspore.float16) --- value_states = value_states.to(mindspore.float16) --- --- # 6. [核心] 调用 flash_attention_score 算子 --- # - 无需手动 repeat_kv, 算子原生支持 GQA --- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 -- attn_output = mindspore.ops.flash_attention_score( -- query=query_states, -- key=key_states, -- value=value_states, --- head_num=self.num_heads, # 传入Q的头数(N1) --+ head_num=self.num_heads, -- attn_mask=fa_attention_mask, --- keep_prob=1.0 - self.attention_dropout, --+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout -- scalar_value=1.0 / math.sqrt(self.head_dim), -- input_layout="BNSD", --- sparse_mode=0 # 使用 defaultMask 模式 --+ sparse_mode=0, --+ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 -- ) -- --- # 恢复原始数据类型 --- attn_output = attn_output.to(input_dtype) --- --- # 7. 调整输出形状 --- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+ # 6. 调整输出形状 -- attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) -- attn_output = self.o_proj(attn_output) -- --- # FlashAttention 算子不直接返回注意力权重矩阵 --+ # 7. 返回结果 -- attn_weights = None -- if output_attentions: --- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -- -- return attn_output, attn_weights, past_key_value -- --- # def forward( --- # self, --- # hidden_states: mindspore.Tensor, --- # attention_mask: Optional[mindspore.Tensor] = None, --- # position_ids: Optional[mindspore.Tensor] = None, --- # past_key_value: Optional[Cache] = None, --- # output_attentions: bool = False, --- # use_cache: bool = False, --- # cache_position: Optional[mindspore.Tensor] = None, --- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --- --- # bsz, q_len, _ = hidden_states.shape --- --- # # 1. 线性投射 Q, K, V --- # query_states = self.q_proj(hidden_states) --- # key_states = self.k_proj(hidden_states) --- # value_states = self.v_proj(hidden_states) --- --- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- --- # # 3. RoPE 旋转位置编码 --- # kv_seq_len = key_states.shape[-2] --- # if past_key_value is not None: --- # if self.layer_idx is None: --- # raise ValueError( --- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --- # "with a layer index." --- # ) --- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) -- --- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --- --- # # 4. KV 缓存更新 --- # if past_key_value is not None: --- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --- # key_states, value_states = past_key_value.update( --- # key_states, value_states, self.layer_idx, cache_kwargs --- # ) --- --- # # 5. 准备 Attention Mask --- # fa_attention_mask = None --- # if attention_mask is not None: --- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --- # fa_attention_mask = (mask_slice != 0) --- --- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --- # input_dtype = query_states.dtype --- --- # # 6. [核心] 调用 flash_attention_score 算子 --- # attn_output = mindspore.ops.flash_attention_score( --- # query=query_states, --- # key=key_states, --- # value=value_states, --- # head_num=self.num_heads, --- # attn_mask=fa_attention_mask, --- # keep_prob=1.0 - self.attention_dropout, --- # scalar_value=1.0 / math.sqrt(self.head_dim), --- # input_layout="BNSD", --- # sparse_mode=0, --- # # <--- 修改点 2: 启用内部高精度计算 --- --- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --- # inner_precise=1 --- # ) --- --- # # 恢复原始数据类型 --- # attn_output = attn_output.to(input_dtype) --+QWEN2MOE_ATTENTION_CLASSES = { --+ "eager": Qwen2MoeAttention, --+ "flash-attention": Qwen2MoeFlashAttention, --+} -- --- # # 7. 调整输出形状 --- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --- # attn_output = self.o_proj(attn_output) -- --- # attn_weights = None --- # if output_attentions: --- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# def __init__(self, config): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# # gating --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+ --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# #@dwj --+# # 只遍历激活的专家,而非全部专家 --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# num_tokens = hidden_states_reshaped.shape[0] --+ --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+# routing_weights = routing_weights.to(hidden_states.dtype) --+ --+# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+# flat_selected_experts = selected_experts.flatten() --+ --+# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+# token_indices = broadcasted_token_indices.flatten() --+ --+# active_experts = ops.unique(flat_selected_experts) --+ --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+ --+# mask = (flat_selected_experts == expert_idx_tensor) --+# selected_token_indices = token_indices[mask] --+# selected_routing_weights = routing_weights.flatten()[mask] --+ --+# current_states = hidden_states_reshaped[selected_token_indices] --+ --+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+ --+# final_hidden_states = final_hidden_states.index_add( --+# dim=0, --+# index=selected_token_indices, --+# source=expert_output.to(hidden_states.dtype) --+# ) --+ --+# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output -- --- # return attn_output, attn_weights, past_key_value --+# final_hidden_states = final_hidden_states + shared_expert_output --+# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ --+ --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# """ --+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+# `_moe_infer_prefill` (用于长序列处理) 方法。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# # 门控网络 --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# # 专家列表 --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+# # 共享专家 --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# @no_grad() --+# def _moe_infer_decode( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# """ --+# 【解码路径】针对 sequence_length=1 的极致优化。 --+# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+# """ --+# batch_size, hidden_dim = hidden_states.shape --+ --+# expert_outputs_list = [ --+# ops.cat([ --+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+# ], dim=0) --+# for i in range(batch_size) --+# ] --+ --+# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+# # shape: (batch_size, top_k, hidden_dim) --+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+ --+# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+ --+# return moe_output.squeeze(1) --+ --+# @no_grad() --+# def _moe_infer_prefill( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# """ --+# 【预填充路径】针对 sequence_length > 1 的优化。 --+# 按专家对 Token 进行分组,并进行批处理。 --+# """ --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens = hidden_states.shape[0] --+# flat_selected_experts = selected_experts.flatten() --+ --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+ --+# active_experts = ops.unique(flat_selected_experts) --+ --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+ --+# mask = (flat_selected_experts == expert_idx_tensor) --+# selected_token_indices = token_indices[mask] --+# selected_routing_weights = routing_weights.flatten()[mask] --+ --+# current_states = hidden_states[selected_token_indices] --+ --+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+ --+# moe_output = moe_output.index_add( --+# dim=0, --+# index=selected_token_indices, --+# source=expert_output.to(hidden_states.dtype) --+# ) --+# return moe_output --+ --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# """ --+# 顶层 forward 方法,作为智能分发器。 --+# """ --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+ --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --- # def forward( --- # self, --- # hidden_states: mindspore.Tensor, --- # attention_mask: Optional[mindspore.Tensor] = None, --- # position_ids: Optional[mindspore.Tensor] = None, --- # past_key_value: Optional[Cache] = None, --- # output_attentions: bool = False, --- # use_cache: bool = False, --- # cache_position: Optional[mindspore.Tensor] = None, --- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --- --- # bsz, q_len, _ = hidden_states.shape --- --- # query_states = self.q_proj(hidden_states) --- # key_states = self.k_proj(hidden_states) --- # value_states = self.v_proj(hidden_states) --- --- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- --- # kv_seq_len = key_states.shape[-2] --- # if past_key_value is not None: --- # if self.layer_idx is None: --- # raise ValueError("`layer_idx` must be specified for caching") --- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --- --- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --- --- # if past_key_value is not None: --- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --- # key_states, value_states = past_key_value.update( --- # key_states, value_states, self.layer_idx, cache_kwargs --- # ) --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+# routing_weights = routing_weights.to(hidden_states.dtype) --+ --+# moe_output = None --+# # 在推理时,根据序列长度选择最优路径 --+# if not self.training: --+# if sequence_length == 1: --+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+# else: --+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+# else: --+# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+# raise NotImplementedError("Training path is not implemented.") --+ --+# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+ --+# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+ --+# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ --+ --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# """ --+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# # 门控网络 --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# # 专家列表 --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+# # 共享专家 --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# @no_grad() --+# def _moe_infer_decode( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# batch_size, _ = hidden_states.shape --+# expert_outputs_list = [ --+# ops.cat([ --+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+# ], dim=0) --+# for i in range(batch_size) --+# ] --+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+# return moe_output.squeeze(1) --+ --+# @no_grad() --+# def _moe_infer_prefill( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens = hidden_states.shape[0] --+# flat_selected_experts = selected_experts.flatten() --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+# active_experts = ops.unique(flat_selected_experts) --+ --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+# mask = (flat_selected_experts == expert_idx_tensor) --+# selected_token_indices = token_indices[mask] --+# selected_routing_weights = routing_weights.flatten()[mask] --+# current_states = hidden_states[selected_token_indices] --+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+# moe_output = moe_output.index_add( --+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+# ) --+# return moe_output --+ --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# """ --+# 顶层 forward 方法,作为智能分发器。 --+# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+# """ --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+ --+# # 1. 门控计算 (通用逻辑) --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+# routing_weights = routing_weights.to(hidden_states.dtype) --+ --+# # 2. 智能分发到最优 MoE 路径 --+# moe_output = None --+# if not self.training: --+# if sequence_length == 1: --+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+# else: --+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+# else: --+# raise NotImplementedError("Training path is not implemented.") --+ --+# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+ --+# # 4. 合并 MoE 输出和共享专家输出 --+# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+ --+# # 5. 恢复原始形状并返回 --+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ --+# prefill fastest --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# """ --+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# # 门控网络 --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# # 专家列表 --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+# # 共享专家 --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# @no_grad() --+# def _moe_infer_dispatch( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# """ --+# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+# """ --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens, _ = hidden_states.shape --+ --+# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+# flat_selected_experts = selected_experts.flatten() --+# flat_routing_weights = routing_weights.flatten() -- --- # key_states = repeat_kv(key_states, self.num_key_value_groups) --- # value_states = repeat_kv(value_states, self.num_key_value_groups) --- --- # # <--- 核心修改点: 手动进行高精度缩放 --- --- # # 在调用算子前,手动将 query_states 除以缩放因子。 --- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --- # query_states = query_states / math.sqrt(self.head_dim) --- # # <--- 修改结束 --- --- --- # fa_attention_mask = None --- # if attention_mask is not None: --- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --- # fa_attention_mask = (mask_slice != 0) --- --- # input_dtype = query_states.dtype --- --- # attn_output = mindspore.ops.flash_attention_score( --- # query=query_states, # 传入已经预先缩放过的 query --- # key=key_states, --- # value=value_states, --- # head_num=self.num_heads, --- # attn_mask=fa_attention_mask, --- # keep_prob=1.0 - self.attention_dropout, --- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --- # input_layout="BNSD", --- # sparse_mode=0, --- # inner_precise=1 # 仍然保持内部高精度计算 --- # ) --+# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() -- --- # attn_output = attn_output.to(input_dtype) --- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --- # attn_output = self.o_proj(attn_output) --+# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+# active_experts = ops.unique(flat_selected_experts) --+ --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+ --+# # 找到所有分配给该专家的 token --+# mask = (flat_selected_experts == expert_idx_tensor) --+ --+# # 使用 mask 选取对应的 token 和权重 --+# current_token_indices = token_indices[mask] --+# current_routing_weights = flat_routing_weights[mask] --+# current_hidden_states = hidden_states[current_token_indices] --+ --+# # 对这些 token 进行批处理 --+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+ --+# # 使用 index_add 将结果精确地加回到对应位置 --+# moe_output = moe_output.index_add( --+# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+# ) --+# return moe_output --+ --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# """ --+# 顶层 forward 方法,作为智能分发器。 --+# """ --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+ --+# # 1. 门控计算 --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+# routing_weights = routing_weights.to(hidden_states.dtype) --+ --+# # 2. 调用统一的 MoE 计算内核 --+# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) -- --- # attn_weights = None --- # if output_attentions: --- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+# # 3. 统一处理共享专家 --+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+ --+# # 4. 合并输出 --+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+ --+# # 5. 恢复原始形状并返回 --+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ --+ --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# """ --+# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+# 【最终高性能与高精度版】: --+# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+# 3. 这样实现了速度和准确性的两全其美。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# @no_grad() --+# def _moe_infer_decode( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# """ --+# 【解码路径】极致优化版:bmm + 高精度累加。 --+# """ --+# original_dtype = hidden_states.dtype --+# batch_size, _ = hidden_states.shape --+ --+# expert_outputs_list = [ --+# ops.cat([ --+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+# ], dim=0) --+# for i in range(batch_size) --+# ] --+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+ --+# # 在 float32 下执行 bmm,得到高精度结果 --+# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+ --+# # 将高精度结果转换回原始数据类型 --+# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+ --+# return moe_output --+ --+# @no_grad() --+# def _moe_infer_prefill( --+# self, --+# hidden_states: mindspore.Tensor, --+# selected_experts: mindspore.Tensor, --+# routing_weights: mindspore.Tensor --+# ) -> mindspore.Tensor: --+# """ --+# 【预填充路径】与原始实现一致,结果精确。 --+# """ --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens, _ = hidden_states.shape --+# flat_selected_experts = selected_experts.flatten() --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+# active_experts = ops.unique(flat_selected_experts) --+ --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+# mask = (flat_selected_experts == expert_idx_tensor) --+# selected_token_indices = token_indices[mask] --+# selected_routing_weights = routing_weights.flatten()[mask] --+# current_states = hidden_states[selected_token_indices] --+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+# moe_output = moe_output.index_add( --+# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+# ) --+# return moe_output --+ --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+ --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) -- --- # return attn_output, attn_weights, past_key_value --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+# # 如果模型主体是 float16,后续再转换 --+ --+# moe_output = None --+# if not self.training: --+# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+# # _moe_infer_decode 内部会处理好类型转换 --+# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+# if sequence_length == 1: --+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+# else: --+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+# else: --+# raise NotImplementedError("Training path is not implemented.") --+ --+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+ --+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ -- ---QWEN2MOE_ATTENTION_CLASSES = { --- "eager": Qwen2MoeAttention, --- "flash-attention": Qwen2MoeFlashAttention, ---} --+# class Qwen2MoeSparseMoeBlock(nn.Module): --+# """ --+# 【融合版】一个混合专家模块,内置两种推理策略, --+# 由外部全局变量 `Long_Prompt` 控制: --+ --+# - if Long_Prompt is True: 【精度优先模式】 --+# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+# 适用于处理长序列,避免误差累积。 --+ --+# - if Long_Prompt is False: 【速度优先模式】 --+# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+# 在解码阶段获得极致速度,同时保证结果高度准确。 --+# """ --+# def __init__(self, config: Qwen2MoeConfig): --+# super().__init__() --+# self.num_experts = config.num_experts --+# self.top_k = config.num_experts_per_tok --+# self.norm_topk_prob = config.norm_topk_prob --+ --+# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+# self.experts = nn.ModuleList( --+# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+# ) --+# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+# # --- 速度优先模式的辅助函数 --- --+# @no_grad() --+# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+# original_dtype = hidden_states.dtype --+# batch_size, _ = hidden_states.shape --+# expert_outputs_list = [ --+# ops.cat([ --+# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+# ], dim=0) --+# for i in range(batch_size) --+# ] --+# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+# weights_fp32 = routing_weights.to(mindspore.float32) --+# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+# return moe_output_fp32.squeeze(1).to(original_dtype) --+ --+# @no_grad() --+# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens, _ = hidden_states.shape --+# flat_selected_experts = selected_experts.flatten() --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+# active_experts = ops.unique(flat_selected_experts) --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+# mask = (flat_selected_experts == expert_idx_tensor) --+# selected_token_indices = token_indices[mask] --+# selected_routing_weights = routing_weights.flatten()[mask] --+# current_states = hidden_states[selected_token_indices] --+# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+# return moe_output --+ --+# # --- 精度优先模式的辅助函数 --- --+# @no_grad() --+# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+# moe_output = ops.zeros_like(hidden_states) --+# num_tokens, _ = hidden_states.shape --+# flat_selected_experts = selected_experts.flatten() --+# flat_routing_weights = routing_weights.flatten() --+# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+# active_experts = ops.unique(flat_selected_experts) --+# for expert_idx_tensor in active_experts: --+# expert_idx = expert_idx_tensor.item() --+# expert_layer = self.experts[expert_idx] --+# mask = (flat_selected_experts == expert_idx_tensor) --+# current_token_indices = token_indices[mask] --+# current_routing_weights = flat_routing_weights[mask] --+# current_hidden_states = hidden_states[current_token_indices] --+# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+# return moe_output --+ --+# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+# # 声明我们将要使用一个在模块外部定义的全局变量 --+# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+# global Long_Prompt --+ --+# # 1. 门控计算 (所有模式通用) --+# batch_size, sequence_length, hidden_dim = hidden_states.shape --+# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+# router_logits = self.gate(hidden_states_reshaped) --+# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+# if self.norm_topk_prob: --+# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+# moe_output = None --+# if not self.training: --+# # 根据 Long_Prompt 标志选择模式 --+# if Long_Prompt: --+# # --- 精度优先模式 --- --+# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+# else: --+# # --- 速度优先模式 --- --+# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+# if sequence_length == 1: --+# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+# else: --+# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+# else: --+# raise NotImplementedError("Training path is not implemented.") --+ --+# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+ --+# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+ --+# return final_hidden_states, router_logits --+ --+class Qwen2MoeSparseMoeBlock(nn.Module): --+ """ --+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+ 控制的顶级推理策略: -- --+ - if Long_Prompt is True: 【精度优先模式】 --+ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --+ 适用于需要严格可复现性的长序列任务。 -- ---class Qwen2MoeSparseMoeBlock(nn.Module): --- def __init__(self, config): --+ - if Long_Prompt is False: 【速度优先模式】 --+ 采用业界最强的性能组合: --+ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --+ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --+ """ --+ def __init__(self, config: Qwen2MoeConfig): -- super().__init__() -- self.num_experts = config.num_experts -- self.top_k = config.num_experts_per_tok -- self.norm_topk_prob = config.norm_topk_prob -- --- # gating -- self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) -- self.experts = nn.ModuleList( -- [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] -- ) --- -- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) -- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) -- --- #@dwj --- # 只遍历激活的专家,而非全部专家 --- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --- batch_size, sequence_length, hidden_dim = hidden_states.shape --- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --- num_tokens = hidden_states_reshaped.shape[0] --- --- router_logits = self.gate(hidden_states_reshaped) --- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- --- if self.norm_topk_prob: --- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- routing_weights = routing_weights.to(hidden_states.dtype) --- --- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --- flat_selected_experts = selected_experts.flatten() --- --- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --- token_indices = broadcasted_token_indices.flatten() --- --- active_experts = ops.unique(flat_selected_experts) --- --- for expert_idx_tensor in active_experts: --- expert_idx = expert_idx_tensor.item() --- expert_layer = self.experts[expert_idx] --- --- mask = (flat_selected_experts == expert_idx_tensor) --- selected_token_indices = token_indices[mask] --- selected_routing_weights = routing_weights.flatten()[mask] --- --- current_states = hidden_states_reshaped[selected_token_indices] --- --- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --- --- final_hidden_states = final_hidden_states.index_add( --- dim=0, --- index=selected_token_indices, --- source=expert_output.to(hidden_states.dtype) --- ) --- --- shared_expert_output = self.shared_expert(hidden_states_reshaped) --- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --+ @no_grad() --+ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ original_dtype = hidden_states.dtype --+ batch_size, _ = hidden_states.shape --+ expert_outputs_list = [ --+ ops.cat([ --+ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+ ], dim=0) --+ for i in range(batch_size) --+ ] --+ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+ weights_fp32 = routing_weights.to(mindspore.float32) --+ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+ return moe_output_fp32.squeeze(1).to(original_dtype) --+ --+ @no_grad() --+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ num_tokens, _ = hidden_states.shape --+ flat_selected_experts = selected_experts.flatten() --+ sorted_expert_indices = flat_selected_experts.argsort() --+ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+ original_token_indices = sorted_expert_indices // self.top_k --+ moe_output = ops.zeros_like(hidden_states) --+ current_token_offset = 0 --+ for i in range(self.num_experts): --+ expert_token_count = tokens_per_expert[i] - current_token_offset --+ if expert_token_count == 0: --+ continue --+ end_offset = current_token_offset + expert_token_count --+ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+ expert_hidden_states = hidden_states[expert_original_token_indices] --+ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+ current_token_offset += expert_token_count --+ return moe_output --+ --+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+ @no_grad() --+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ moe_output = ops.zeros_like(hidden_states) --+ num_tokens, _ = hidden_states.shape --+ flat_selected_experts = selected_experts.flatten() --+ flat_routing_weights = routing_weights.flatten() --+ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+ active_experts = ops.unique(flat_selected_experts) --+ for expert_idx_tensor in active_experts: --+ expert_idx = expert_idx_tensor.item() --+ expert_layer = self.experts[expert_idx] --+ mask = (flat_selected_experts == expert_idx_tensor) --+ current_token_indices = token_indices[mask] --+ current_routing_weights = flat_routing_weights[mask] --+ current_hidden_states = hidden_states[current_token_indices] --+ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+ return moe_output -- --- final_hidden_states = final_hidden_states + shared_expert_output --- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --- --- return final_hidden_states, router_logits --+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+ global Long_Prompt --+ --+ # 1. 门控计算 (所有模式通用) --+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+ router_logits = self.gate(hidden_states_reshaped) --+ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+ if self.norm_topk_prob: --+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+ moe_output = None --+ if Long_Prompt: --+ # --- 精度优先模式 (ACCURACY MODE) --- --+ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ else: --+ # --- 速度优先模式 (SPEED MODE) --- --+ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+ if sequence_length == 1: --+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ else: --+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ -- --+ # 3. 共享专家计算与合并 (所有模式通用) --+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+ --+ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+ --+ return final_hidden_states, router_logits -- -- class Qwen2MoeDecoderLayer(nn.Module): -- def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -- super().__init__() -- self.hidden_size = config.hidden_size --+ --+ # if Long_Prompt: --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ # else: --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) -- -- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- --- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --- -- if (layer_idx not in config.mlp_only_layers) and ( -- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 -- ): --@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- self._warmed_up = True -- self.warmup_moe_model() -- --+ --+ -- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions -- output_router_logits = ( -- output_router_logits if output_router_logits is not None else self.config.output_router_logits --@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- router_logits=outputs.router_logits, -- ) -- --+ def generate(self, *args, **kwargs): --+ """ --+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+ """ --+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+ --+ input_ids = kwargs.get("input_ids") --+ if input_ids is None and args: --+ input_ids = args[0] --+ --+ if input_ids is not None: --+ prompt_length = input_ids.shape[1] --+ --+ if prompt_length > PROMPT_LENGTH_THRESHOLD: --+ Long_Prompt = True --+ else: --+ Long_Prompt = False --+ --+ return super().generate(*args, **kwargs) --+ -- # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation -- def prepare_inputs_for_generation( -- self, --@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens -- # Exception 1: when passing input_embeds, input_ids may be missing entries -- # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --+ -- if past_key_values is not None: -- if inputs_embeds is not None: # Exception 1 -- if 0 not in input_ids.shape: --@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- } -- ) -- return model_inputs --+ -- # @lwx -- # def _decode_one_tokens_logits( -- # self, --@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): -- attentions=outputs.attentions, -- ) -- --+ -- __all__ = [ -- "Qwen2MoeForCausalLM", -- "Qwen2MoeModel", --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --new file mode 100644 --index 00000000..6dfb5b93 ----- /dev/null --+++ b/patches/0001-20251104commit.patch --@@ -0,0 +1,1272 @@ --+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Tue, 4 Nov 2025 09:11:51 +0800 --+Subject: [PATCH] 20251104commit --+ --+--- --+ mindnlp/transformers/cache_utils.py | 28 +- --+ .../models/deepseek/modeling_deepseek.py | 149 ++- --+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+ 3 files changed, 976 insertions(+), 87 deletions(-) --+ --+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+index cadd2e04..02f8d4be 100644 --+--- a/mindnlp/transformers/cache_utils.py --++++ b/mindnlp/transformers/cache_utils.py --+@@ -812,14 +812,26 @@ class StaticCache(Cache): --+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+ # k_out[:, :, cache_position] = key_states --+ # v_out[:, :, cache_position] = value_states --+- if ON_ORANGE_PI: --+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+- else: --+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+- --++ # if ON_ORANGE_PI: --++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++ # else: --++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++ # 确保 cache_position 是 1D tensor 并且类型正确 --++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++ if cache_position.ndim > 1: --++ cache_position = cache_position.flatten() --++ # 确保类型是 int32 或 int64(MindSpore 要求) --++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++ cache_position = cache_position.int() --++ --++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++ k_out[:, :, cache_position] = key_states --++ v_out[:, :, cache_position] = value_states --++ --+ return k_out, v_out --+ --+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index c695b944..d8303e45 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+ # Copied from transformers.models.llama.modeling_llama.rotate_half --+ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+- x1 = x[..., : x.shape[-1] // 2] --+- x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++ # x1 = x[..., : x.shape[-1] // 2] --++ # x2 = x[..., x.shape[-1] // 2 :] --++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+ if self.training: --+ raise NotImplementedError("Training is not supported yet.") --+ else: --+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+- if self.config.n_shared_experts is not None: --+- y = y + self.shared_experts(identity) --+- return y --++ # @lwx --++ if orig_shape[1] == 1: --++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++ y=y.view(*orig_shape) --++ if self.config.n_shared_experts is not None: --++ y = y + self.shared_experts(identity) --++ return y --++ else: --++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++ if self.config.n_shared_experts is not None: --++ y = y + self.shared_experts(identity) --++ return y --++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++ # if self.config.n_shared_experts is not None: --++ # y = y + self.shared_experts(identity) --++ # return y --++ --++ @no_grad() --++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ --++ expert_cache = ops.zeros_like(x) --++ for i in range(self.num_experts_per_tok): --++ expert_id = flat_expert_indices[i].item() --++ weight = flat_expert_weights[i].item() --++ expert = self.experts[expert_id] --++ expert_out = expert(x) --++ expert_cache += expert_out * weight --++ return expert_cache --+ --+ @no_grad() --+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+- # expert_cache = torch.zeros_like(x) --+- # idxs = flat_expert_indices.argsort() --+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+- # token_idxs = idxs // self.num_experts_per_tok --+- # for i, end_idx in enumerate(tokens_per_expert): --+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+- # if start_idx == end_idx: --+- # continue --+- # expert = self.experts[i] --+- # exp_token_idx = token_idxs[start_idx:end_idx] --+- # expert_tokens = x[exp_token_idx] --+- # expert_out = expert(expert_tokens) --+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+- # return expert_cache --++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ expert_cache = ops.zeros_like(x) --+ idxs = flat_expert_indices.argsort() --+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ token_idxs = idxs // self.num_experts_per_tok --++ --+ for i, end_idx in enumerate(tokens_per_expert): --+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ if start_idx == end_idx: --+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+ expert_out = expert(expert_tokens) --+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --+ return expert_cache --++ --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # # expert_cache = torch.zeros_like(x) --++ # # idxs = flat_expert_indices.argsort() --++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++ # # token_idxs = idxs // self.num_experts_per_tok --++ # # for i, end_idx in enumerate(tokens_per_expert): --++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++ # # if start_idx == end_idx: --++ # # continue --++ # # expert = self.experts[i] --++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++ # # expert_tokens = x[exp_token_idx] --++ # # expert_out = expert(expert_tokens) --++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++ # # return expert_cache --++ # expert_cache = ops.zeros_like(x) --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # for i, end_idx in enumerate(tokens_per_expert): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # if start_idx == end_idx: --++ # continue --++ # expert = self.experts[i] --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = expert(expert_tokens) --++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --++ # return expert_cache --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # expert_cache = ops.zeros_like(x) --++ --++ # # 排序保证顺序一致 --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # # 找出有 token 的专家 --++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++ --++ # for i in active_experts.tolist(): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # end_idx = tokens_per_expert[i] --++ # if start_idx == end_idx: # 没有 token --++ # continue --++ --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = self.experts[i](expert_tokens) --++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++ --++ # expert_cache = mindspore.mint.scatter_add( --++ # expert_cache, --++ # 0, --++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++ # expert_out --++ # ) --++ --++ # return expert_cache --++ --++ --+ --+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+ # """ --+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ --+ # Initialize weights and apply final processing --+ self.post_init() --++ self.warm_up = False --++ --++ def warmup_moe_model_deep(self): --++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++ test_texts = [ --++ "warmup short", --++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++ ] --++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++ if tokenizer is None: --++ from mindnlp.transformers import AutoTokenizer --++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++ self._warmup_tokenizer = tokenizer --++ --++ for text in test_texts: --++ inputs = tokenizer(text, return_tensors="ms") --++ with mindspore._no_grad(): --++ _ = self(**inputs, use_cache=False) --++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+ --+ def get_input_embeddings(self): --+ return self.model.embed_tokens --+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+ ```""" --++ if not self.warm_up: --++ self.warm_up = True --++ self.warmup_moe_model_deep() --++ --+ output_attentions = ( --+ output_attentions --+ if output_attentions is not None --+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+index 3cbf820e..d4c6b651 100644 --+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+@@ -18,7 +18,6 @@ --+ # See the License for the specific language governing permissions and --+ # limitations under the License. --+ """MindSpore Qwen2MoE model.""" --+- --+ import math --+ from typing import List, Optional, Tuple, Union --+ --+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+ TokenClassifierOutput, --+ ) --+ from ...modeling_utils import PreTrainedModel --++from ...generation import GenerationMixin --+ from ....utils import logging --+ from .configuration_qwen2_moe import Qwen2MoeConfig --+ --+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+ self.variance_epsilon = eps --+ --+ def forward(self, hidden_states): --++ # @dwj --++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++ # @lwx --++ # if not self.training : --++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+ input_dtype = hidden_states.dtype --+ hidden_states = hidden_states.to(mindspore.float32) --+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+@@ -234,6 +239,8 @@ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+ x1 = x[..., : x.shape[-1] // 2] --+ x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+ self.config = config --+ self.hidden_size = config.hidden_size --+ self.intermediate_size = intermediate_size --++ --+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+ self.act_fn = ACT2FN[config.hidden_act] --+ --+ def forward(self, x): --+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+- --+ --++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++ # @lwx --++ # gate_up_output = self.gate_up_proj(x) --++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++ # return self.down_proj(swiglu_output) --++ --++ # def forward(self, x): --++ # gate_proj_out = self.gate_proj(x) --++ # up_proj_out = self.up_proj(x) --++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++ # return self.down_proj(swiglu_out) --++ --+ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+ """ --+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+ use_cache: bool = False, --+ cache_position: Optional[mindspore.Tensor] = None, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ --++ --+ bsz, q_len, _ = hidden_states.shape --+ --+ query_states = self.q_proj(hidden_states) --+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ "with a layer index." --+ ) --+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ if isinstance(past_key_value, StaticCache): --++ kv_seq_len = key_states.shape[-2] --++ else: --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++ if isinstance(past_key_value, StaticCache): --++ kv_seq_len = key_states.shape[-2] --+ --+ # repeat k/v heads if n_kv_heads < n_heads --+ key_states = repeat_kv(key_states, self.num_key_value_groups) --+ value_states = repeat_kv(value_states, self.num_key_value_groups) --+- --++ --+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+ --+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+- raise ValueError( --+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+- f" {attn_weights.shape}" --+- ) --+- --+- if attention_mask is not None: # no matter the length, we just slice it --+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++ if attention_mask is not None: --++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+ attn_weights = attn_weights + causal_mask --+ --+ # upcast attention to fp32 --+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+ --+ attn_output = self.o_proj(attn_output) --+- --++ # @lwx --++ --++ # max_seq_len = self.max_position_embeddings # 2048 --++ --++ # if attention_mask is not None: --++ # # attention_mask: [B, 1, Sq, Sk] --++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++ --++ # # pad 到 [max_seq_len, max_seq_len] --++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++ # global_attention_mask = padded_mask --++ # else: --++ # global_attention_mask = None --++ --++ --++ # sparse_mode=3 --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, --++ # key=key_states, --++ # value=value_states, --++ # real_shift=None, --++ # padding_mask=None, --++ --++ # head_num=self.num_heads, --++ # attn_mask=global_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++ # input_layout="BNSD", --++ # pre_tokens=2147483647, --++ # next_tokens=2147483647, --++ # inner_precise=0, --++ # drop_mask=None, --++ # prefix=None, --++ # actual_seq_qlen=None, --++ # actual_seq_kvlen=None, --++ # sparse_mode=sparse_mode, --++ # ) --+ if not output_attentions: --+ attn_weights = None --+ --+ return attn_output, attn_weights, past_key_value --+ --+ --++class Qwen2MoeFlashAttention(nn.Module): --++ """ --++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++ --++ 关键改动: --++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++ 直接传入原始的 key 和 value 张量效率更高。 --++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++ """ --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++ super().__init__() --++ self.config = config --++ self.layer_idx = layer_idx --++ self.hidden_size = config.hidden_size --++ self.num_heads = config.num_attention_heads --++ self.head_dim = self.hidden_size // self.num_heads --++ self.num_key_value_heads = config.num_key_value_heads --++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++ self.max_position_embeddings = config.max_position_embeddings --++ self.rope_theta = config.rope_theta --++ self.attention_dropout = config.attention_dropout --++ --++ if (self.head_dim * self.num_heads) != self.hidden_size: --++ raise ValueError( --++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++ ) --++ --++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++ --++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++ self.head_dim, --++ max_position_embeddings=self.max_position_embeddings, --++ base=self.rope_theta, --++ ) --++ --++ def forward( --++ self, --++ hidden_states: mindspore.Tensor, --++ attention_mask: Optional[mindspore.Tensor] = None, --++ position_ids: Optional[mindspore.Tensor] = None, --++ past_key_value: Optional[Cache] = None, --++ output_attentions: bool = False, --++ use_cache: bool = False, --++ cache_position: Optional[mindspore.Tensor] = None, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ bsz, q_len, _ = hidden_states.shape --++ --++ # 1. 线性投射 Q, K, V --++ query_states = self.q_proj(hidden_states) --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++ # query: [B, S, H*D] -> [B, N1, S, D] --++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # 3. RoPE 旋转位置编码 --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++ if self.layer_idx is None: --++ raise ValueError( --++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ "with a layer index." --++ ) --++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++ if cache_position.shape[0] == 1: --++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++ kv_seq_len = past_seen_tokens + 1 --++ else: --++ # prefill 阶段:cache_position 是范围,使用其长度 --++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++ else: --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # 4. KV 缓存更新 --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ key_states, value_states = past_key_value.update( --++ key_states, value_states, self.layer_idx, cache_kwargs --++ ) --++ --++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++ if cache_position.shape[0] == 1: --++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++ kv_seq_len = key_states.shape[-2] --++ --++ # 5. [重要] 准备 Attention Mask --++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++ fa_attention_mask = None --++ if attention_mask is not None: --++ # 截取与当前key长度匹配的部分 --++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++ fa_attention_mask = (mask_slice != 0) --++ --++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++ input_dtype = query_states.dtype --++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++ query_states = query_states.to(mindspore.float16) --++ key_states = key_states.to(mindspore.float16) --++ value_states = value_states.to(mindspore.float16) --++ --++ # 6. [核心] 调用 flash_attention_score 算子 --++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++ attn_output = mindspore.ops.flash_attention_score( --++ query=query_states, --++ key=key_states, --++ value=value_states, --++ head_num=self.num_heads, # 传入Q的头数(N1) --++ attn_mask=fa_attention_mask, --++ keep_prob=1.0 - self.attention_dropout, --++ scalar_value=1.0 / math.sqrt(self.head_dim), --++ input_layout="BNSD", --++ sparse_mode=0 # 使用 defaultMask 模式 --++ ) --++ --++ # 恢复原始数据类型 --++ attn_output = attn_output.to(input_dtype) --++ --++ # 7. 调整输出形状 --++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ attn_output = self.o_proj(attn_output) --++ --++ # FlashAttention 算子不直接返回注意力权重矩阵 --++ attn_weights = None --++ if output_attentions: --++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++ return attn_output, attn_weights, past_key_value --++ --++ # def forward( --++ # self, --++ # hidden_states: mindspore.Tensor, --++ # attention_mask: Optional[mindspore.Tensor] = None, --++ # position_ids: Optional[mindspore.Tensor] = None, --++ # past_key_value: Optional[Cache] = None, --++ # output_attentions: bool = False, --++ # use_cache: bool = False, --++ # cache_position: Optional[mindspore.Tensor] = None, --++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ # bsz, q_len, _ = hidden_states.shape --++ --++ # # 1. 线性投射 Q, K, V --++ # query_states = self.q_proj(hidden_states) --++ # key_states = self.k_proj(hidden_states) --++ # value_states = self.v_proj(hidden_states) --++ --++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # # 3. RoPE 旋转位置编码 --++ # kv_seq_len = key_states.shape[-2] --++ # if past_key_value is not None: --++ # if self.layer_idx is None: --++ # raise ValueError( --++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ # "with a layer index." --++ # ) --++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # # 4. KV 缓存更新 --++ # if past_key_value is not None: --++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ # key_states, value_states = past_key_value.update( --++ # key_states, value_states, self.layer_idx, cache_kwargs --++ # ) --++ --++ # # 5. 准备 Attention Mask --++ # fa_attention_mask = None --++ # if attention_mask is not None: --++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # fa_attention_mask = (mask_slice != 0) --++ --++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++ # input_dtype = query_states.dtype --++ --++ # # 6. [核心] 调用 flash_attention_score 算子 --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, --++ # key=key_states, --++ # value=value_states, --++ # head_num=self.num_heads, --++ # attn_mask=fa_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++ # input_layout="BNSD", --++ # sparse_mode=0, --++ # # <--- 修改点 2: 启用内部高精度计算 --- --++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++ # inner_precise=1 --++ # ) --++ --++ # # 恢复原始数据类型 --++ # attn_output = attn_output.to(input_dtype) --++ --++ # # 7. 调整输出形状 --++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ # attn_output = self.o_proj(attn_output) --++ --++ # attn_weights = None --++ # if output_attentions: --++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++ # return attn_output, attn_weights, past_key_value --++ --++ # def forward( --++ # self, --++ # hidden_states: mindspore.Tensor, --++ # attention_mask: Optional[mindspore.Tensor] = None, --++ # position_ids: Optional[mindspore.Tensor] = None, --++ # past_key_value: Optional[Cache] = None, --++ # output_attentions: bool = False, --++ # use_cache: bool = False, --++ # cache_position: Optional[mindspore.Tensor] = None, --++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ # bsz, q_len, _ = hidden_states.shape --++ --++ # query_states = self.q_proj(hidden_states) --++ # key_states = self.k_proj(hidden_states) --++ # value_states = self.v_proj(hidden_states) --++ --++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # kv_seq_len = key_states.shape[-2] --++ # if past_key_value is not None: --++ # if self.layer_idx is None: --++ # raise ValueError("`layer_idx` must be specified for caching") --++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # if past_key_value is not None: --++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ # key_states, value_states = past_key_value.update( --++ # key_states, value_states, self.layer_idx, cache_kwargs --++ # ) --++ --++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++ --++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++ # query_states = query_states / math.sqrt(self.head_dim) --++ # # <--- 修改结束 --- --++ --++ # fa_attention_mask = None --++ # if attention_mask is not None: --++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # fa_attention_mask = (mask_slice != 0) --++ --++ # input_dtype = query_states.dtype --++ --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, # 传入已经预先缩放过的 query --++ # key=key_states, --++ # value=value_states, --++ # head_num=self.num_heads, --++ # attn_mask=fa_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++ # input_layout="BNSD", --++ # sparse_mode=0, --++ # inner_precise=1 # 仍然保持内部高精度计算 --++ # ) --++ --++ # attn_output = attn_output.to(input_dtype) --++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ # attn_output = self.o_proj(attn_output) --++ --++ # attn_weights = None --++ # if output_attentions: --++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++ --++ # return attn_output, attn_weights, past_key_value --++ --+ QWEN2MOE_ATTENTION_CLASSES = { --+ "eager": Qwen2MoeAttention, --++ "flash-attention": Qwen2MoeFlashAttention, --+ } --+ --+ --+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --++ #@dwj --++ # 只遍历激活的专家,而非全部专家 --+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+- batch_size, sequence_length, hidden_dim = hidden_states.shape --+- hidden_states = hidden_states.view(-1, hidden_dim) --+- # router_logits: (batch * sequence_length, n_experts) --+- router_logits = self.gate(hidden_states) --+- --+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- if self.norm_topk_prob: --+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- # we cast back to the input dtype --+- routing_weights = routing_weights.to(hidden_states.dtype) --+- --+- final_hidden_states = ops.zeros( --+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+- ) --+- --+- # One hot encode the selected experts to create an expert mask --+- # this will be used to easily index which expert is going to be sollicitated --+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+- --+- # Loop over all available experts in the model and perform the computation on each expert --+- for expert_idx in range(self.num_experts): --+- expert_layer = self.experts[expert_idx] --+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+- --+- # Index the correct hidden states and compute the expert hidden state for --+- # the current expert. We need to make sure to multiply the output hidden --+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+- if 0 not in idx.shape: --+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+- --+- # However `index_add_` only support torch tensors for indexing so we'll use --+- # the `top_x` tensor here. --+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+- --+- shared_expert_output = self.shared_expert(hidden_states) --+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+- --+- final_hidden_states = final_hidden_states + shared_expert_output --++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++ num_tokens = hidden_states_reshaped.shape[0] --++ --++ router_logits = self.gate(hidden_states_reshaped) --++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++ if self.norm_topk_prob: --++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ routing_weights = routing_weights.to(hidden_states.dtype) --++ --++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++ flat_selected_experts = selected_experts.flatten() --++ --++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++ token_indices = broadcasted_token_indices.flatten() --++ --++ active_experts = ops.unique(flat_selected_experts) --++ --++ for expert_idx_tensor in active_experts: --++ expert_idx = expert_idx_tensor.item() --++ expert_layer = self.experts[expert_idx] --++ --++ mask = (flat_selected_experts == expert_idx_tensor) --++ selected_token_indices = token_indices[mask] --++ selected_routing_weights = routing_weights.flatten()[mask] --++ --++ current_states = hidden_states_reshaped[selected_token_indices] --++ --++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++ --++ final_hidden_states = final_hidden_states.index_add( --++ dim=0, --++ index=selected_token_indices, --++ source=expert_output.to(hidden_states.dtype) --++ ) --++ --++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+ --+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+- return final_hidden_states, router_logits --++ final_hidden_states = final_hidden_states + shared_expert_output --++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++ --++ return final_hidden_states, router_logits --+ --+ --+ class Qwen2MoeDecoderLayer(nn.Module): --+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+ --+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++ --+ if (layer_idx not in config.mlp_only_layers) and ( --+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+ ): --+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+ _skip_keys_device_placement = "past_key_values" --+ _supports_cache_class = True --++#lwx --++ # _supports_static_cache = True --+ --+ def _init_weights(self, module): --+ std = self.config.initializer_range --+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+ return causal_mask --+ --+ --+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ _tied_weights_keys = ["lm_head.weight"] --+ --+ def __init__(self, config): --+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ self.num_experts_per_tok = config.num_experts_per_tok --+ # Initialize weights and apply final processing --+ self.post_init() --++ # @lwx --++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++ # self.generation_config.cache_implementation = "static" --++ self._warmed_up = False --++ --++ def warmup_moe_model(self): --++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++ test_texts = [ --++ "warmup short", --++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++ ] --++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++ if tokenizer is None: --++ from mindnlp.transformers import AutoTokenizer --++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++ self._warmup_tokenizer = tokenizer --++ --++ for text in test_texts: --++ inputs = tokenizer(text, return_tensors="ms") --++ with mindspore._no_grad(): --++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+ --+ def get_input_embeddings(self): --+ return self.model.embed_tokens --+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+ ```""" --++ if not self._warmed_up: --++ self._warmed_up = True --++ self.warmup_moe_model() --+ --+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+ output_router_logits = ( --+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ } --+ ) --+ return model_inputs --++# @lwx --++ # def _decode_one_tokens_logits( --++ # self, --++ # cur_token: mindspore.Tensor, --++ # input_pos: Optional[mindspore.Tensor], --++ # cache_position: mindspore.Tensor, --++ # past_key_values: StaticCache, --++ # ) -> mindspore.Tensor: --++ # """ --++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++ --++ # Args: --++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++ # input_pos: 输入位置信息,可选 --++ # cache_position: 当前token在cache中的位置,shape为(1,) --++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++ --++ # Returns: --++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++ # """ --++ # # 调用JIT编译的版本 --++ # return self.get_decode_one_tokens_logits( --++ # cur_token=cur_token, --++ # input_pos=input_pos, --++ # cache_position=cache_position, --++ # past_key_values=past_key_values, --++ # ) --++ --++ # @mindspore.jit(jit_level='O1') --++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++ # """ --++ # JIT编译的函数,用于高效的单token解码 --++ # 使用JIT编译优化以支持静态shape和高效执行 --++ --++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++ # """ --++ # outputs = self.model.forward( --++ # input_ids=cur_token, --++ # position_ids=input_pos, --++ # cache_position=cache_position, --++ # past_key_values=past_key_values, --++ # use_cache=True, --++ # return_dict=False, --++ # ) --++ --++ # hidden_states = outputs[0] --++ # logits = self.lm_head.forward(hidden_states) --++ # logits = logits.float() --++ --++ # return logits[:, -1, :] --++ --++ # def _sample( --++ # self, --++ # input_ids: mindspore.Tensor, --++ # logits_processor, --++ # stopping_criteria, --++ # generation_config, --++ # synced_devices: bool, --++ # streamer=None, --++ # logits_warper=None, --++ # **model_kwargs, --++ # ): --++ # """ --++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++ # """ --++ # from ...generation.logits_process import LogitsProcessorList --++ # from ...generation.stopping_criteria import StoppingCriteriaList --++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++ # from mindnlp.core import nn, ops, no_grad --++ # import numpy as np --++ --++ # # 检查是否使用 StaticCache --++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++ # # 否则,直接调用父类方法 --++ # past_key_values = model_kwargs.get("past_key_values") --++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++ --++ # if not isinstance(past_key_values, StaticCache): --++ # # 不使用 StaticCache,直接调用父类方法 --++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++ # return super()._sample( --++ # input_ids=input_ids, --++ # logits_processor=logits_processor, --++ # stopping_criteria=stopping_criteria, --++ # generation_config=generation_config, --++ # synced_devices=synced_devices, --++ # streamer=streamer, --++ # logits_warper=logits_warper, --++ # **model_kwargs, --++ # ) --++ --++ # # 使用 StaticCache,进入自定义循环 --++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++ # pad_token_id = generation_config._pad_token_tensor --++ # output_attentions = generation_config.output_attentions --++ # output_hidden_states = generation_config.output_hidden_states --++ # output_scores = generation_config.output_scores --++ # output_logits = generation_config.output_logits --++ # return_dict_in_generate = generation_config.return_dict_in_generate --++ # max_length = generation_config.max_length --++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++ # do_sample = generation_config.do_sample --++ --++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++ # raise ValueError( --++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++ # f"{logits_warper})." --++ # ) --++ --++ # # init attention / hidden states / scores tuples --++ # scores = () if (return_dict_in_generate and output_scores) else None --++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++ --++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++ # encoder_hidden_states = ( --++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++ # ) --++ --++ # # keep track of which sequences are already finished --++ # batch_size, cur_len = input_ids.shape --++ # this_peer_finished = False --++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++ --++ # time_record = [] --++ # from ....utils.testing_utils import parse_flag_from_env --++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++ --++ # while self._has_unfinished_sequences( --++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++ # ): --++ # if _record_time: --++ # import time as time_module --++ # infer_start = time_module.time() --++ --++ # # prepare model inputs --++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++ --++ # # prepare variable output controls --++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++ --++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++ # cur_cache_position = model_inputs.get("cache_position") --++ # cur_past_key_values = model_inputs.get("past_key_values") --++ # cur_input_ids = model_inputs.get("input_ids") --++ --++ # if (isinstance(cur_past_key_values, StaticCache) and --++ # cur_cache_position is not None and --++ # len(cur_cache_position.shape) > 0 and --++ # cur_cache_position.shape[0] == 1 and --++ # cur_input_ids is not None and --++ # cur_input_ids.shape[1] == 1): --++ # # 使用 JIT 优化的单 token 解码 --++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++ # if not hasattr(self, '_jit_used'): --++ # self._jit_used = False --++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++ --++ # next_token_logits = self.get_decode_one_tokens_logits( --++ # cur_token=cur_input_ids, --++ # input_pos=model_inputs.get("position_ids"), --++ # cache_position=cur_cache_position, --++ # past_key_values=cur_past_key_values, --++ # ) --++ --++ # # 标记已使用JIT(用于后续判断) --++ # if not self._jit_used: --++ # self._jit_used = True --++ --++ # # 构造兼容的输出对象 --++ # class JitOptimizedOutput: --++ # def __init__(self, logits, config): --++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++ # self.config = config --++ # # 对于 JIT 优化路径,这些属性通常不需要 --++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++ # self.attentions = None if not config.is_encoder_decoder else None --++ # self.cross_attentions = None --++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++ # self.hidden_states = None if not config.is_encoder_decoder else None --++ --++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++ # else: --++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++ # outputs = self(**model_inputs, return_dict=True) --++ --++ # if synced_devices and this_peer_finished: --++ # continue --++ --++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++ # next_token_logits = outputs.logits[:, -1, :] --++ --++ # # pre-process distribution --++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++ # if do_sample: --++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++ --++ # # Store scores, attentions and hidden_states when required --++ # if return_dict_in_generate: --++ # if output_scores: --++ # scores += (next_token_scores,) --++ # if output_logits: --++ # raw_logits += (next_token_logits,) --++ # if output_attentions: --++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++ # decoder_attentions += (attn,) if attn is not None else (None,) --++ # if self.config.is_encoder_decoder: --++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++ --++ # if output_hidden_states: --++ # hidden = ( --++ # outputs.decoder_hidden_states --++ # if self.config.is_encoder_decoder --++ # else outputs.hidden_states --++ # ) --++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++ --++ # # token selection --++ # if do_sample: --++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++ # else: --++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++ --++ # # finished sentences should have their next token be a padding token --++ # if has_eos_stopping_criteria: --++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++ --++ # # update generated ids, model inputs, and length for next step --++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++ # if streamer is not None: --++ # streamer.put(next_tokens) --++ --++ # model_kwargs = self._update_model_kwargs_for_generation( --++ # outputs, --++ # model_kwargs, --++ # is_encoder_decoder=self.config.is_encoder_decoder, --++ # ) --++ --++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++ # cur_len += 1 --++ --++ # if _record_time: --++ # import time as time_module --++ # infer_stop = time_module.time() --++ # time_record.append(infer_stop - infer_start) --++ --++ # del outputs --++ --++ # average_infer_time = None --++ # if time_record: --++ # if len(time_record) > 1: --++ # time_record.pop(0) --++ # average_infer_time = sum(time_record) / len(time_record) --++ # print(f'average inference time is: {average_infer_time}') --++ # print(f'inference time record: {time_record}') --++ --++ # if streamer is not None: --++ # streamer.end() --++ --++ # # 简单判断:打印是否使用了JIT路径 --++ # if hasattr(self, '_jit_used') and self._jit_used: --++ # print("[JIT] ✓ JIT optimization was used during generation") --++ # else: --++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++ --++ # if return_dict_in_generate: --++ # if self.config.is_encoder_decoder: --++ # return GenerateEncoderDecoderOutput( --++ # sequences=input_ids, --++ # scores=scores, --++ # logits=raw_logits, --++ # encoder_attentions=encoder_attentions, --++ # encoder_hidden_states=encoder_hidden_states, --++ # decoder_attentions=decoder_attentions, --++ # cross_attentions=cross_attentions, --++ # decoder_hidden_states=decoder_hidden_states, --++ # past_key_values=model_kwargs.get("past_key_values"), --++ # average_infer_time=average_infer_time --++ # ) --++ # else: --++ # return GenerateDecoderOnlyOutput( --++ # sequences=input_ids, --++ # scores=scores, --++ # logits=raw_logits, --++ # attentions=decoder_attentions, --++ # hidden_states=decoder_hidden_states, --++ # past_key_values=model_kwargs.get("past_key_values"), --++ # average_infer_time=average_infer_time --++ # ) --++ # else: --++ # return input_ids --++ --++ # def _prepare_cache_for_generation( --++ # self, --++ # generation_config, --++ # model_kwargs, --++ # assistant_model, --++ # batch_size, --++ # max_cache_length, --++ # ): --++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++ # generation_config.cache_implementation = "static" --++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++ --++ # if generation_config.cache_implementation == "static": --++ # base_required_from_max_length = generation_config.max_length + 1 --++ # base_required = max(max_cache_length, base_required_from_max_length) --++ # min_cache_size = 50 --++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++ # else: --++ # max_cache_length = max(base_required, min_cache_size) --++ --++ # original_max_cache_length = max_cache_length --++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++ # print(f" - input max_cache_length: {original_max_cache_length}") --++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++ # print(f" - final max_cache_length: {max_cache_length}") --++ --++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++ # if max_cache_length > self.config.max_position_embeddings: --++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++ --++ # result = super()._prepare_cache_for_generation( --++ # generation_config=generation_config, --++ # model_kwargs=model_kwargs, --++ # assistant_model=assistant_model, --++ # batch_size=batch_size, --++ # max_cache_length=max_cache_length, --++ # ) --++ --++ # if generation_config.cache_implementation == "static": --++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++ # created_cache = model_kwargs.get(cache_name) --++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++ # if created_cache.max_cache_len < generation_config.max_length: --++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++ --++ # return result --++ --++ --++ --+ --+ --+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+-- --+2.27.0 --+ ---- --2.27.0 -- -diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch -deleted file mode 100644 -index 179a9bb5..00000000 ---- a/patches/0003-20261106secondcommit.patch -+++ /dev/null -@@ -1,2769 +0,0 @@ --From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Thu, 6 Nov 2025 14:54:37 +0800 --Subject: [PATCH 3/8] 20261106secondcommit -- ----- -- .../models/deepseek/modeling_deepseek.py | 217 ++- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- -- patches/0001-20251104commit.patch | 1272 ----------------- -- 3 files changed, 528 insertions(+), 2032 deletions(-) -- delete mode 100644 patches/0001-20251104commit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index 73773c22..2f9192bf 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) -- -- _CONFIG_FOR_DOC = "DeepseekConfig" -- --+_attn_mask_cache = {} --+ --+def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --+ q_len = batch_and_seq[1] --+ kv_len = batch_and_seq[1] + past_key_values_length --+ key = (batch_and_seq[0], q_len, kv_len) --+ --+ if key in _attn_mask_cache: --+ return _attn_mask_cache[key] --+ --+ mask = _prepare_4d_causal_attention_mask( --+ attention_mask, --+ batch_and_seq, --+ inputs_embeds, --+ past_key_values_length, --+ ) --+ _attn_mask_cache[key] = mask --+ return mask -- -- def _get_unpad_data(attention_mask): -- seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): -- return final_output -- -- --- @no_grad() --- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --- expert_cache = ops.zeros_like(x) --- idxs = flat_expert_indices.argsort() --- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --- token_idxs = idxs // self.num_experts_per_tok --- --- for i, end_idx in enumerate(tokens_per_expert): --- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --- if start_idx == end_idx: --- continue --- expert = self.experts[i] --- exp_token_idx = token_idxs[start_idx:end_idx] --- expert_tokens = x[exp_token_idx] --- expert_out = expert(expert_tokens) --- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --- --- return expert_cache --- -- # @no_grad() --- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --- # # expert_cache = torch.zeros_like(x) --- # # idxs = flat_expert_indices.argsort() --- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --- # # token_idxs = idxs // self.num_experts_per_tok --- # # for i, end_idx in enumerate(tokens_per_expert): --- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --- # # if start_idx == end_idx: --- # # continue --- # # expert = self.experts[i] --- # # exp_token_idx = token_idxs[start_idx:end_idx] --- # # expert_tokens = x[exp_token_idx] --- # # expert_out = expert(expert_tokens) --- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --- # # return expert_cache --+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- # expert_cache = ops.zeros_like(x) -- # idxs = flat_expert_indices.argsort() -- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): -- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) -- -- # return expert_cache --- # @no_grad() --- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --- # expert_cache = ops.zeros_like(x) --+ --+ @no_grad() --+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ """ --+ 优化版 MoE prefill: --+ - 批量张量化处理同一个 expert 的所有 token --+ - 跳过无 token 的专家 --+ - 保持结果完全一致 --+ """ --+ # 初始化输出缓存 --+ expert_cache = ops.zeros_like(x) -- --- # # 排序保证顺序一致 --- # idxs = flat_expert_indices.argsort() --- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --- # token_idxs = idxs // self.num_experts_per_tok --+ # 排序(确保 scatter_add 位置对应原逻辑) --+ idxs = flat_expert_indices.argsort() --+ sorted_expert_indices = flat_expert_indices[idxs] --+ sorted_token_indices = idxs // self.num_experts_per_tok -- --- # # 找出有 token 的专家 --- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+ # 每个 expert 的 token 数 --+ tokens_per_expert = sorted_expert_indices.bincount() -- --- # for i in active_experts.tolist(): --- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --- # end_idx = tokens_per_expert[i] --- # if start_idx == end_idx: # 没有 token --- # continue --+ # 找出有 token 的专家 --+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() -- --- # exp_token_idx = token_idxs[start_idx:end_idx] --- # expert_tokens = x[exp_token_idx] --- # expert_out = self.experts[i](expert_tokens) --- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+ for expert_id in active_experts.tolist(): --+ # 取该 expert 对应的排序后 token 区间 --+ start = (tokens_per_expert[:expert_id]).sum().item() --+ end = start + tokens_per_expert[expert_id].item() -- --- # expert_cache = mindspore.mint.scatter_add( --- # expert_cache, --- # 0, --- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --- # expert_out --- # ) --+ token_idx = sorted_token_indices[start:end] # 原 token 位置 --+ expert_tokens = x[token_idx] # 取输入向量 -- --- # return expert_cache --+ # 执行专家 MLP --+ expert_out = self.experts[expert_id](expert_tokens) --+ --+ # 按权重缩放 --+ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --+ --+ # 回写到缓存(等价 scatter_add) --+ expert_cache = mindspore.mint.scatter_add( --+ expert_cache, --+ 0, --+ token_idx.view(-1, 1).tile((1, x.shape[-1])), --+ scaled_out --+ ) --+ --+ return expert_cache --+ --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # # expert_cache = torch.zeros_like(x) --+ # # idxs = flat_expert_indices.argsort() --+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+ # # token_idxs = idxs // self.num_experts_per_tok --+ # # for i, end_idx in enumerate(tokens_per_expert): --+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+ # # if start_idx == end_idx: --+ # # continue --+ # # expert = self.experts[i] --+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+ # # expert_tokens = x[exp_token_idx] --+ # # expert_out = expert(expert_tokens) --+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+ # # return expert_cache --+ # expert_cache = ops.zeros_like(x) --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # for i, end_idx in enumerate(tokens_per_expert): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # if start_idx == end_idx: --+ # continue --+ # expert = self.experts[i] --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = expert(expert_tokens) --+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ --+ # return expert_cache --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+ # expert_cache = ops.zeros_like(x) --+ --+ # # 排序保证顺序一致 --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ # token_idxs = idxs // self.num_experts_per_tok --+ --+ # # 找出有 token 的专家 --+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+ --+ # for i in active_experts.tolist(): --+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ # end_idx = tokens_per_expert[i] --+ # if start_idx == end_idx: # 没有 token --+ # continue --+ --+ # exp_token_idx = token_idxs[start_idx:end_idx] --+ # expert_tokens = x[exp_token_idx] --+ # expert_out = self.experts[i](expert_tokens) --+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+ --+ # expert_cache = mindspore.mint.scatter_add( --+ # expert_cache, --+ # 0, --+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+ # expert_out --+ # ) --+ --+ # return expert_cache -- -- -- --@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): -- -- return attn_output, attn_weights, past_key_value -- --- -- # class DeepseekFlashAttention(nn.Module): -- # """ -- # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): -- -- return attn_output, attn_weights, past_key_value -- --+ -- Deepseek_ATTENTION_CLASSES = { -- "eager": DeepseekAttention, -- "flash-attention": DeepseekFlashAttention, --@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): -- ) -- else: -- # 4d mask is passed through the layers --- attention_mask = _prepare_4d_causal_attention_mask( --+ # attention_mask = _prepare_4d_causal_attention_mask( --+ # attention_mask, --+ # (batch_size, seq_length), --+ # inputs_embeds, --+ # past_key_values_length, --+ # ) --+ #@dwj --+ attention_mask = get_cached_causal_mask( -- attention_mask, -- (batch_size, seq_length), -- inputs_embeds, --@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- # Initialize weights and apply final processing -- self.post_init() -- self.warm_up = False --+ #@dwj --+ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --+ self.num_layers, --+ self.num_attention_heads, --+ self.head_dim, --+ batch_size=1, --+ max_length=self.max_length, --+ dtype=mindspore.float16 --+ ) --+ --+ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --+ key_cache = [] --+ value_cache = [] --+ for _ in range(num_layers): --+ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+ key_cache.append(k) --+ value_cache.append(v) --+ return key_cache, value_cache --+ -- -- def warmup_moe_model_deep(self): -- print("[Warmup] DeepSeek-MoE 模型预热开始...") --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index bced285c..ebd7782e 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) -- _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" -- _CONFIG_FOR_DOC = "Qwen2MoeConfig" -- ---Long_Prompt = False ---PROMPT_LENGTH_THRESHOLD = 128 --+Long_Prompt = 1 --+LONG_PROMPT_LENGTH_THRESHOLD = 128 --+SHORT_PROMPT_LENGTH_THRESHOLD = 32 --+ --+_causal_mask_cache = {} --+ --+def get_cached_causal_mask_with_cache_position( --+ attention_mask: mindspore.Tensor, --+ sequence_length: int, --+ target_length: int, --+ dtype: mindspore.dtype, --+ min_dtype: float, --+ cache_position: mindspore.Tensor, --+ batch_size: int, --+): --+ """ --+ 带缓存的 causal mask 构造函数 --+ """ --+ # q_len 是当前 query 长度 --+ q_len = sequence_length --+ # kv_len 是 target_length --+ kv_len = target_length --+ --+ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --+ key = (batch_size, q_len, kv_len, dtype, min_dtype) --+ --+ if key in _causal_mask_cache: --+ return _causal_mask_cache[key] --+ --+ # 调用原来的 mask 构造逻辑 --+ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+ attention_mask, --+ sequence_length=sequence_length, --+ target_length=target_length, --+ dtype=dtype, --+ min_dtype=min_dtype, --+ cache_position=cache_position, --+ batch_size=batch_size, --+ ) --+ # 缓存结果 --+ _causal_mask_cache[key] = causal_mask --+ return causal_mask -- -- # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position -- def _prepare_4d_causal_attention_mask_with_cache_position( --@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: -- -- -- # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --+# class Qwen2MoeAttention(nn.Module): --+# """ --+# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+# and "Generating Long Sequences with Sparse Transformers". --+# """ --+ --+# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+# super().__init__() --+# self.config = config --+# self.layer_idx = layer_idx --+# if layer_idx is None: --+# logger.warning_once( --+# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+# "when creating this class." --+# ) --+ --+# self.hidden_size = config.hidden_size --+# self.num_heads = config.num_attention_heads --+# self.head_dim = self.hidden_size // self.num_heads --+# self.num_key_value_heads = config.num_key_value_heads --+# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+# self.max_position_embeddings = config.max_position_embeddings --+# self.rope_theta = config.rope_theta --+# self.is_causal = True --+# self.attention_dropout = config.attention_dropout --+ --+# if (self.head_dim * self.num_heads) != self.hidden_size: --+# raise ValueError( --+# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+# f" and `num_heads`: {self.num_heads})." --+# ) --+# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+ --+# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+# self.head_dim, --+# max_position_embeddings=self.max_position_embeddings, --+# base=self.rope_theta, --+# ) --+ --+# def forward( --+# self, --+# hidden_states: mindspore.Tensor, --+# attention_mask: Optional[mindspore.Tensor] = None, --+# position_ids: Optional[mindspore.Tensor] = None, --+# past_key_value: Optional[Cache] = None, --+# output_attentions: bool = False, --+# use_cache: bool = False, --+# cache_position: Optional[mindspore.Tensor] = None, --+# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+ --+ --+ --+# bsz, q_len, _ = hidden_states.shape --+ --+# query_states = self.q_proj(hidden_states) --+# key_states = self.k_proj(hidden_states) --+# value_states = self.v_proj(hidden_states) --+ --+# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+ --+# kv_seq_len = key_states.shape[-2] --+# if past_key_value is not None: --+# if self.layer_idx is None: --+# raise ValueError( --+# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+# "with a layer index." --+# ) --+# if isinstance(past_key_value, StaticCache): --+# kv_seq_len = key_states.shape[-2] --+# else: --+# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+# if past_key_value is not None: --+# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+# if isinstance(past_key_value, StaticCache): --+# kv_seq_len = key_states.shape[-2] --+ --+# # repeat k/v heads if n_kv_heads < n_heads --+# key_states = repeat_kv(key_states, self.num_key_value_groups) --+# value_states = repeat_kv(value_states, self.num_key_value_groups) --+ --+# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+ --+# if attention_mask is not None: --+# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+# attn_weights = attn_weights + causal_mask --+ --+# # upcast attention to fp32 --+# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+# attn_output = ops.matmul(attn_weights, value_states) --+ --+# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+# raise ValueError( --+# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+# f" {attn_output.shape}" --+# ) --+ --+# attn_output = ops.transpose(attn_output, 1, 2) --+# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+ --+# attn_output = self.o_proj(attn_output) --+# # @lwx --+ --+# # max_seq_len = self.max_position_embeddings # 2048 --+ --+# # if attention_mask is not None: --+# # # attention_mask: [B, 1, Sq, Sk] --+# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+ --+# # # pad 到 [max_seq_len, max_seq_len] --+# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+# # global_attention_mask = padded_mask --+# # else: --+# # global_attention_mask = None --+ --+ --+# # sparse_mode=3 --+# # attn_output = mindspore.ops.flash_attention_score( --+# # query=query_states, --+# # key=key_states, --+# # value=value_states, --+# # real_shift=None, --+# # padding_mask=None, --+ --+# # head_num=self.num_heads, --+# # attn_mask=global_attention_mask, --+# # keep_prob=1.0 - self.attention_dropout, --+# # scalar_value=1.0 / math.sqrt(self.head_dim), --+# # input_layout="BNSD", --+# # pre_tokens=2147483647, --+# # next_tokens=2147483647, --+# # inner_precise=0, --+# # drop_mask=None, --+# # prefix=None, --+# # actual_seq_qlen=None, --+# # actual_seq_kvlen=None, --+# # sparse_mode=sparse_mode, --+# # ) --+# if not output_attentions: --+# attn_weights = None --+ --+# return attn_output, attn_weights, past_key_value --+ -- class Qwen2MoeAttention(nn.Module): -- """ --- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --- and "Generating Long Sequences with Sparse Transformers". --- """ --+ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 -- --+ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --+ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --+ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --+ --+ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --+ """ -- def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): -- super().__init__() -- self.config = config --@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): -- if layer_idx is None: -- logger.warning_once( -- f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " -- "when creating this class." -- ) -- --@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): -- use_cache: bool = False, -- cache_position: Optional[mindspore.Tensor] = None, -- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --- -- --- --+ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- -- bsz, q_len, _ = hidden_states.shape -- -- query_states = self.q_proj(hidden_states) -- key_states = self.k_proj(hidden_states) -- value_states = self.v_proj(hidden_states) -- --- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --- --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ -- kv_seq_len = key_states.shape[-2] -- if past_key_value is not None: --- if self.layer_idx is None: --- raise ValueError( --- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --- "with a layer index." --- ) --- if isinstance(past_key_value, StaticCache): --- kv_seq_len = key_states.shape[-2] --- else: --- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ -- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) -- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) -- -- if past_key_value is not None: --- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} -- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+ --+ # --- 2. 动态调度核心注意力计算 --- --+ global Long_Prompt --+ if Long_Prompt >= 1: --+ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --+ fa_attention_mask = None --+ if attention_mask is not None: --+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+ fa_attention_mask = (mask_slice != 0) --+ --+ attn_output = mindspore.ops.flash_attention_score( --+ query=query_states, --+ key=key_states, --+ value=value_states, --+ head_num=self.num_heads, --+ attn_mask=fa_attention_mask, --+ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --+ scalar_value=1.0 / math.sqrt(self.head_dim), --+ input_layout="BNSD", --+ sparse_mode=0, --+ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --+ ) -- --- if isinstance(past_key_value, StaticCache): --- kv_seq_len = key_states.shape[-2] --+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ attn_output = self.o_proj(attn_output) --+ attn_weights = None --+ if output_attentions: --+ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") -- --- # repeat k/v heads if n_kv_heads < n_heads --- key_states = repeat_kv(key_states, self.num_key_value_groups) --- value_states = repeat_kv(value_states, self.num_key_value_groups) --- --- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+ else: --+ # --- Eager Attention 路径 (用于短序列和解码) --- --+ key_states = repeat_kv(key_states, self.num_key_value_groups) --+ value_states = repeat_kv(value_states, self.num_key_value_groups) --+ --+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) -- --- if attention_mask is not None: --- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --- attn_weights = attn_weights + causal_mask --+ if attention_mask is not None: --+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+ attn_weights = attn_weights + causal_mask -- --- # upcast attention to fp32 --- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --- attn_output = ops.matmul(attn_weights, value_states) --+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+ attn_output = ops.matmul(attn_weights, value_states) -- --- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --- raise ValueError( --- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --- f" {attn_output.shape}" --- ) --+ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+ raise ValueError( --+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --+ ) -- --- attn_output = ops.transpose(attn_output, 1, 2) --- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+ attn_output = ops.transpose(attn_output, 1, 2) --+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+ attn_output = self.o_proj(attn_output) -- --- attn_output = self.o_proj(attn_output) --- # @lwx --+ if not output_attentions: --+ attn_weights = None -- --- # max_seq_len = self.max_position_embeddings # 2048 --- --- # if attention_mask is not None: --- # # attention_mask: [B, 1, Sq, Sk] --- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --- --- # # pad 到 [max_seq_len, max_seq_len] --- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --- # global_attention_mask = padded_mask --- # else: --- # global_attention_mask = None --- --- --- # sparse_mode=3 --- # attn_output = mindspore.ops.flash_attention_score( --- # query=query_states, --- # key=key_states, --- # value=value_states, --- # real_shift=None, --- # padding_mask=None, --- --- # head_num=self.num_heads, --- # attn_mask=global_attention_mask, --- # keep_prob=1.0 - self.attention_dropout, --- # scalar_value=1.0 / math.sqrt(self.head_dim), --- # input_layout="BNSD", --- # pre_tokens=2147483647, --- # next_tokens=2147483647, --- # inner_precise=0, --- # drop_mask=None, --- # prefix=None, --- # actual_seq_qlen=None, --- # actual_seq_kvlen=None, --- # sparse_mode=sparse_mode, --- # ) --- if not output_attentions: --- attn_weights = None --- -- return attn_output, attn_weights, past_key_value -- --- -- # class Qwen2MoeFlashAttention(nn.Module): -- # """ -- # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { -- # return final_hidden_states, router_logits -- -- ---# class Qwen2MoeSparseMoeBlock(nn.Module): ---# """ ---# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ---# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 ---# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 ---# `_moe_infer_prefill` (用于长序列处理) 方法。 ---# """ ---# def __init__(self, config: Qwen2MoeConfig): ---# super().__init__() ---# self.num_experts = config.num_experts ---# self.top_k = config.num_experts_per_tok ---# self.norm_topk_prob = config.norm_topk_prob --- ---# # 门控网络 ---# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ---# # 专家列表 ---# self.experts = nn.ModuleList( ---# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ---# ) ---# # 共享专家 ---# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ---# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---# @no_grad() ---# def _moe_infer_decode( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# """ ---# 【解码路径】针对 sequence_length=1 的极致优化。 ---# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 ---# """ ---# batch_size, hidden_dim = hidden_states.shape --- ---# expert_outputs_list = [ ---# ops.cat([ ---# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ---# ], dim=0) ---# for i in range(batch_size) ---# ] --- ---# # --- 错误修复:将 axis=0 修改为 dim=0 --- ---# # shape: (batch_size, top_k, hidden_dim) ---# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --- ---# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 ---# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --- ---# return moe_output.squeeze(1) --- ---# @no_grad() ---# def _moe_infer_prefill( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# """ ---# 【预填充路径】针对 sequence_length > 1 的优化。 ---# 按专家对 Token 进行分组,并进行批处理。 ---# """ ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens = hidden_states.shape[0] ---# flat_selected_experts = selected_experts.flatten() --- ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --- ---# active_experts = ops.unique(flat_selected_experts) --- ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] --- ---# mask = (flat_selected_experts == expert_idx_tensor) ---# selected_token_indices = token_indices[mask] ---# selected_routing_weights = routing_weights.flatten()[mask] --- ---# current_states = hidden_states[selected_token_indices] --- ---# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --- ---# moe_output = moe_output.index_add( ---# dim=0, ---# index=selected_token_indices, ---# source=expert_output.to(hidden_states.dtype) ---# ) ---# return moe_output --- ---# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---# """ ---# 顶层 forward 方法,作为智能分发器。 ---# """ ---# batch_size, sequence_length, hidden_dim = hidden_states.shape --- ---# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---# router_logits = self.gate(hidden_states_reshaped) ---# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- ---# if self.norm_topk_prob: ---# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- ---# routing_weights = routing_weights.to(hidden_states.dtype) --- ---# moe_output = None ---# # 在推理时,根据序列长度选择最优路径 ---# if not self.training: ---# if sequence_length == 1: ---# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ---# else: ---# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ---# else: ---# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 ---# raise NotImplementedError("Training path is not implemented.") --- ---# shared_expert_output = self.shared_expert(hidden_states_reshaped) ---# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) ---# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --- ---# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --- ---# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --- ---# return final_hidden_states, router_logits --- --- ---# class Qwen2MoeSparseMoeBlock(nn.Module): ---# """ ---# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ---# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 ---# """ ---# def __init__(self, config: Qwen2MoeConfig): ---# super().__init__() ---# self.num_experts = config.num_experts ---# self.top_k = config.num_experts_per_tok ---# self.norm_topk_prob = config.norm_topk_prob --- ---# # 门控网络 ---# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ---# # 专家列表 ---# self.experts = nn.ModuleList( ---# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ---# ) ---# # 共享专家 ---# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ---# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---# @no_grad() ---# def _moe_infer_decode( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# batch_size, _ = hidden_states.shape ---# expert_outputs_list = [ ---# ops.cat([ ---# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ---# ], dim=0) ---# for i in range(batch_size) ---# ] ---# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ---# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) ---# return moe_output.squeeze(1) --- ---# @no_grad() ---# def _moe_infer_prefill( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens = hidden_states.shape[0] ---# flat_selected_experts = selected_experts.flatten() ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ---# active_experts = ops.unique(flat_selected_experts) --- ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] ---# mask = (flat_selected_experts == expert_idx_tensor) ---# selected_token_indices = token_indices[mask] ---# selected_routing_weights = routing_weights.flatten()[mask] ---# current_states = hidden_states[selected_token_indices] ---# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ---# moe_output = moe_output.index_add( ---# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ---# ) ---# return moe_output --- ---# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---# """ ---# 顶层 forward 方法,作为智能分发器。 ---# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 ---# """ ---# batch_size, sequence_length, hidden_dim = hidden_states.shape --- ---# # 1. 门控计算 (通用逻辑) ---# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---# router_logits = self.gate(hidden_states_reshaped) ---# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- ---# if self.norm_topk_prob: ---# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- ---# routing_weights = routing_weights.to(hidden_states.dtype) --- ---# # 2. 智能分发到最优 MoE 路径 ---# moe_output = None ---# if not self.training: ---# if sequence_length == 1: ---# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) ---# else: ---# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) ---# else: ---# raise NotImplementedError("Training path is not implemented.") --- ---# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 ---# # 共享专家和它的门控网络,都作用于 reshape 后的张量 ---# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ---# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --- ---# # 4. 合并 MoE 输出和共享专家输出 ---# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 ---# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --- ---# # 5. 恢复原始形状并返回 ---# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --- ---# return final_hidden_states, router_logits --- ---# prefill fastest ---# class Qwen2MoeSparseMoeBlock(nn.Module): ---# """ ---# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ---# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), ---# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 ---# """ ---# def __init__(self, config: Qwen2MoeConfig): ---# super().__init__() ---# self.num_experts = config.num_experts ---# self.top_k = config.num_experts_per_tok ---# self.norm_topk_prob = config.norm_topk_prob --- ---# # 门控网络 ---# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ---# # 专家列表 ---# self.experts = nn.ModuleList( ---# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ---# ) ---# # 共享专家 ---# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ---# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---# @no_grad() ---# def _moe_infer_dispatch( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# """ ---# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 ---# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 ---# """ ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens, _ = hidden_states.shape --- ---# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 ---# flat_selected_experts = selected_experts.flatten() ---# flat_routing_weights = routing_weights.flatten() --- ---# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --- ---# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) ---# active_experts = ops.unique(flat_selected_experts) --- ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] --- ---# # 找到所有分配给该专家的 token ---# mask = (flat_selected_experts == expert_idx_tensor) --- ---# # 使用 mask 选取对应的 token 和权重 ---# current_token_indices = token_indices[mask] ---# current_routing_weights = flat_routing_weights[mask] ---# current_hidden_states = hidden_states[current_token_indices] --- ---# # 对这些 token 进行批处理 ---# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --- ---# # 使用 index_add 将结果精确地加回到对应位置 ---# moe_output = moe_output.index_add( ---# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) ---# ) ---# return moe_output --- ---# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---# """ ---# 顶层 forward 方法,作为智能分发器。 ---# """ ---# batch_size, sequence_length, hidden_dim = hidden_states.shape --- ---# # 1. 门控计算 ---# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---# router_logits = self.gate(hidden_states_reshaped) ---# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- ---# if self.norm_topk_prob: ---# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- ---# routing_weights = routing_weights.to(hidden_states.dtype) --- ---# # 2. 调用统一的 MoE 计算内核 ---# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 ---# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --- ---# # 3. 统一处理共享专家 ---# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ---# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --- ---# # 4. 合并输出 ---# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --- ---# # 5. 恢复原始形状并返回 ---# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --- ---# return final_hidden_states, router_logits --- --- ---# class Qwen2MoeSparseMoeBlock(nn.Module): ---# """ ---# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 ---# 【最终高性能与高精度版】: ---# 1. 解码路径使用 bmm 算子以达到最大推理速度。 ---# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 ---# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 ---# 3. 这样实现了速度和准确性的两全其美。 ---# """ ---# def __init__(self, config: Qwen2MoeConfig): ---# super().__init__() ---# self.num_experts = config.num_experts ---# self.top_k = config.num_experts_per_tok ---# self.norm_topk_prob = config.norm_topk_prob --- ---# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ---# self.experts = nn.ModuleList( ---# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ---# ) ---# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ---# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---# @no_grad() ---# def _moe_infer_decode( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# """ ---# 【解码路径】极致优化版:bmm + 高精度累加。 ---# """ ---# original_dtype = hidden_states.dtype ---# batch_size, _ = hidden_states.shape --- ---# expert_outputs_list = [ ---# ops.cat([ ---# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ---# ], dim=0) ---# for i in range(batch_size) ---# ] ---# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --- ---# # 在 float32 下执行 bmm,得到高精度结果 ---# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --- ---# # 将高精度结果转换回原始数据类型 ---# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --- ---# return moe_output --- ---# @no_grad() ---# def _moe_infer_prefill( ---# self, ---# hidden_states: mindspore.Tensor, ---# selected_experts: mindspore.Tensor, ---# routing_weights: mindspore.Tensor ---# ) -> mindspore.Tensor: ---# """ ---# 【预填充路径】与原始实现一致,结果精确。 ---# """ ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens, _ = hidden_states.shape ---# flat_selected_experts = selected_experts.flatten() ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ---# active_experts = ops.unique(flat_selected_experts) --- ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] ---# mask = (flat_selected_experts == expert_idx_tensor) ---# selected_token_indices = token_indices[mask] ---# selected_routing_weights = routing_weights.flatten()[mask] ---# current_states = hidden_states[selected_token_indices] ---# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ---# moe_output = moe_output.index_add( ---# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) ---# ) ---# return moe_output --- ---# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---# batch_size, sequence_length, hidden_dim = hidden_states.shape --- ---# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---# router_logits = self.gate(hidden_states_reshaped) ---# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --- ---# if self.norm_topk_prob: ---# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- ---# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 ---# # 如果模型主体是 float16,后续再转换 --- ---# moe_output = None ---# if not self.training: ---# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 ---# # _moe_infer_decode 内部会处理好类型转换 ---# temp_routing_weights = routing_weights.to(hidden_states.dtype) ---# if sequence_length == 1: ---# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) ---# else: ---# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) ---# else: ---# raise NotImplementedError("Training path is not implemented.") --- ---# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ---# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --- ---# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ---# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --- ---# return final_hidden_states, router_logits --- --- ---# class Qwen2MoeSparseMoeBlock(nn.Module): ---# """ ---# 【融合版】一个混合专家模块,内置两种推理策略, ---# 由外部全局变量 `Long_Prompt` 控制: --- ---# - if Long_Prompt is True: 【精度优先模式】 ---# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 ---# 适用于处理长序列,避免误差累积。 --- ---# - if Long_Prompt is False: 【速度优先模式】 ---# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, ---# 在解码阶段获得极致速度,同时保证结果高度准确。 ---# """ ---# def __init__(self, config: Qwen2MoeConfig): ---# super().__init__() ---# self.num_experts = config.num_experts ---# self.top_k = config.num_experts_per_tok ---# self.norm_topk_prob = config.norm_topk_prob --- ---# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) ---# self.experts = nn.ModuleList( ---# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] ---# ) ---# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) ---# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---# # --- 速度优先模式的辅助函数 --- ---# @no_grad() ---# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ---# original_dtype = hidden_states.dtype ---# batch_size, _ = hidden_states.shape ---# expert_outputs_list = [ ---# ops.cat([ ---# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] ---# ], dim=0) ---# for i in range(batch_size) ---# ] ---# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) ---# weights_fp32 = routing_weights.to(mindspore.float32) ---# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) ---# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) ---# return moe_output_fp32.squeeze(1).to(original_dtype) --- ---# @no_grad() ---# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens, _ = hidden_states.shape ---# flat_selected_experts = selected_experts.flatten() ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ---# active_experts = ops.unique(flat_selected_experts) ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] ---# mask = (flat_selected_experts == expert_idx_tensor) ---# selected_token_indices = token_indices[mask] ---# selected_routing_weights = routing_weights.flatten()[mask] ---# current_states = hidden_states[selected_token_indices] ---# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ---# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) ---# return moe_output --- ---# # --- 精度优先模式的辅助函数 --- ---# @no_grad() ---# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: ---# moe_output = ops.zeros_like(hidden_states) ---# num_tokens, _ = hidden_states.shape ---# flat_selected_experts = selected_experts.flatten() ---# flat_routing_weights = routing_weights.flatten() ---# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() ---# active_experts = ops.unique(flat_selected_experts) ---# for expert_idx_tensor in active_experts: ---# expert_idx = expert_idx_tensor.item() ---# expert_layer = self.experts[expert_idx] ---# mask = (flat_selected_experts == expert_idx_tensor) ---# current_token_indices = token_indices[mask] ---# current_routing_weights = flat_routing_weights[mask] ---# current_hidden_states = hidden_states[current_token_indices] ---# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) ---# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) ---# return moe_output --- ---# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---# # 声明我们将要使用一个在模块外部定义的全局变量 ---# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 ---# global Long_Prompt --- ---# # 1. 门控计算 (所有模式通用) ---# batch_size, sequence_length, hidden_dim = hidden_states.shape ---# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---# router_logits = self.gate(hidden_states_reshaped) ---# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) ---# if self.norm_topk_prob: ---# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --- ---# moe_output = None ---# if not self.training: ---# # 根据 Long_Prompt 标志选择模式 ---# if Long_Prompt: ---# # --- 精度优先模式 --- ---# routing_weights_casted = routing_weights.to(hidden_states.dtype) ---# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) ---# else: ---# # --- 速度优先模式 --- ---# routing_weights_casted = routing_weights.to(hidden_states.dtype) ---# if sequence_length == 1: ---# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) ---# else: ---# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) ---# else: ---# raise NotImplementedError("Training path is not implemented.") --- ---# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ ---# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --- ---# final_hidden_states_reshaped = moe_output + gated_shared_expert_output ---# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --- ---# return final_hidden_states, router_logits --- -- class Qwen2MoeSparseMoeBlock(nn.Module): -- """ -- 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -- return moe_output_fp32.squeeze(1).to(original_dtype) -- --+ # @no_grad() --+ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ # num_tokens, _ = hidden_states.shape --+ # flat_selected_experts = selected_experts.flatten() --+ # sorted_expert_indices = flat_selected_experts.argsort() --+ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+ # original_token_indices = sorted_expert_indices // self.top_k --+ # moe_output = ops.zeros_like(hidden_states) --+ # current_token_offset = 0 --+ # for i in range(self.num_experts): --+ # expert_token_count = tokens_per_expert[i] - current_token_offset --+ # if expert_token_count == 0: --+ # continue --+ # end_offset = current_token_offset + expert_token_count --+ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+ # expert_hidden_states = hidden_states[expert_original_token_indices] --+ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+ # current_token_offset += expert_token_count --+ # return moe_output --+ -- @no_grad() -- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --- num_tokens, _ = hidden_states.shape --- flat_selected_experts = selected_experts.flatten() --- sorted_expert_indices = flat_selected_experts.argsort() --- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --- original_token_indices = sorted_expert_indices // self.top_k --+ """ --+ 优化版 MoE prefill (速度优先模式): --+ - 批量张量化处理同一个 expert 的所有 token --+ - 跳过无 token 的专家 --+ - 保持结果完全一致 --+ """ -- moe_output = ops.zeros_like(hidden_states) --- current_token_offset = 0 --- for i in range(self.num_experts): --- expert_token_count = tokens_per_expert[i] - current_token_offset --- if expert_token_count == 0: --- continue --- end_offset = current_token_offset + expert_token_count --- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --- expert_hidden_states = hidden_states[expert_original_token_indices] --- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --- current_token_offset += expert_token_count --+ --+ flat_selected_experts = selected_experts.flatten() --+ flat_routing_weights = routing_weights.flatten() --+ --+ idxs = flat_selected_experts.argsort() --+ sorted_expert_indices = flat_selected_experts[idxs] --+ sorted_token_indices = idxs // self.top_k --+ --+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --+ --+ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+ --+ for expert_id in active_experts.tolist(): --+ start = int(tokens_per_expert[:expert_id].sum().item()) --+ end = start + int(tokens_per_expert[expert_id].item()) --+ --+ token_idx = sorted_token_indices[start:end] --+ expert_tokens = hidden_states[token_idx] --+ --+ expert_out = self.experts[expert_id](expert_tokens) --+ --+ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --+ --+ moe_output = mindspore.mint.scatter_add( --+ moe_output, --+ 0, --+ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --+ scaled_out.to(hidden_states.dtype) --+ ) --+ -- return moe_output -- --+ -- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- -- @no_grad() -- def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) -- -- moe_output = None --- if Long_Prompt: --- # --- 精度优先模式 (ACCURACY MODE) --- --- routing_weights_casted = routing_weights.to(hidden_states.dtype) --- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # if Long_Prompt==0: --+ # # --- 精度优先模式 (ACCURACY MODE) --- --+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # else: --+ # # --- 速度优先模式 (SPEED MODE) --- --+ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+ # if sequence_length == 1: --+ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # else: --+ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ --+ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+ if sequence_length == 1: --+ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) -- else: --- # --- 速度优先模式 (SPEED MODE) --- --- routing_weights_casted = routing_weights.to(hidden_states.dtype) --- if sequence_length == 1: --- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --- else: --- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --- --+ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ -- -- # 3. 共享专家计算与合并 (所有模式通用) -- gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- -- return final_hidden_states, router_logits -- --+ -- class Qwen2MoeDecoderLayer(nn.Module): -- def __init__(self, config: Qwen2MoeConfig, layer_idx: int): -- super().__init__() -- self.hidden_size = config.hidden_size -- --- # if Long_Prompt: --- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --- # else: --+ # if Long_Prompt == 2: -- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+ # else: --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- -- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) -- --@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): -- ) -- -- # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+ # attention_mask, --+ # sequence_length=sequence_length, --+ # target_length=target_length, --+ # dtype=dtype, --+ # min_dtype=min_dtype, --+ # cache_position=cache_position, --+ # batch_size=input_tensor.shape[0], --+ # ) --+ #@dwj --+ causal_mask = get_cached_causal_mask_with_cache_position( -- attention_mask, -- sequence_length=sequence_length, -- target_length=target_length, --@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 -- 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 -- """ --- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --+ _causal_mask_cache.clear() -- -- input_ids = kwargs.get("input_ids") -- if input_ids is None and args: --@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- -- if input_ids is not None: -- prompt_length = input_ids.shape[1] --- --- if prompt_length > PROMPT_LENGTH_THRESHOLD: --- Long_Prompt = True --+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --+ Long_Prompt = 2 --+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --+ Long_Prompt = 0 -- else: --- Long_Prompt = False --+ Long_Prompt = 1 --+ -- -- return super().generate(*args, **kwargs) -- --@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): -- dtype = self.lm_head.weight.dtype -- min_dtype = float(ops.finfo(dtype).min) -- --- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+ # attention_mask, --+ # sequence_length=sequence_length, --+ # target_length=past_key_values.get_max_length(), --+ # dtype=dtype, --+ # min_dtype=min_dtype, --+ # cache_position=cache_position, --+ # batch_size=batch_size, --+ # ) --+ --+ #@dwj --+ attention_mask = get_cached_causal_mask_with_cache_position( -- attention_mask, -- sequence_length=sequence_length, -- target_length=past_key_values.get_max_length(), --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --deleted file mode 100644 --index 6dfb5b93..00000000 ----- a/patches/0001-20251104commit.patch --+++ /dev/null --@@ -1,1272 +0,0 @@ ---From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 ---From: Pinoeer-kingxi <13022943007@163.com> ---Date: Tue, 4 Nov 2025 09:11:51 +0800 ---Subject: [PATCH] 20251104commit --- ------ --- mindnlp/transformers/cache_utils.py | 28 +- --- .../models/deepseek/modeling_deepseek.py | 149 ++- --- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --- 3 files changed, 976 insertions(+), 87 deletions(-) --- ---diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py ---index cadd2e04..02f8d4be 100644 ------ a/mindnlp/transformers/cache_utils.py ---+++ b/mindnlp/transformers/cache_utils.py ---@@ -812,14 +812,26 @@ class StaticCache(Cache): --- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --- # k_out[:, :, cache_position] = key_states --- # v_out[:, :, cache_position] = value_states ---- if ON_ORANGE_PI: ---- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ---- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ---- else: ---- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ---- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ---- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ---- ---+ # if ON_ORANGE_PI: ---+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) ---+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) ---+ # else: ---+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy ---+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) ---+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) ---+ # 确保 cache_position 是 1D tensor 并且类型正确 ---+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] ---+ if cache_position.ndim > 1: ---+ cache_position = cache_position.flatten() ---+ # 确保类型是 int32 或 int64(MindSpore 要求) ---+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): ---+ cache_position = cache_position.int() ---+ ---+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) ---+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 ---+ k_out[:, :, cache_position] = key_states ---+ v_out[:, :, cache_position] = value_states ---+ --- return k_out, v_out --- --- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: ---diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ---index c695b944..d8303e45 100644 ------ a/mindnlp/transformers/models/deepseek/modeling_deepseek.py ---+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py ---@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --- # Copied from transformers.models.llama.modeling_llama.rotate_half --- def rotate_half(x): --- """Rotates half the hidden dims of the input.""" ---- x1 = x[..., : x.shape[-1] // 2] ---- x2 = x[..., x.shape[-1] // 2 :] ---+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ---+ # x1 = x[..., : x.shape[-1] // 2] ---+ # x2 = x[..., x.shape[-1] // 2 :] ---+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --- return ops.cat((-x2, x1), dim=-1) --- --- ---@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --- if self.training: --- raise NotImplementedError("Training is not supported yet.") --- else: ---- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ---- if self.config.n_shared_experts is not None: ---- y = y + self.shared_experts(identity) ---- return y ---+ # @lwx ---+ if orig_shape[1] == 1: ---+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) ---+ y=y.view(*orig_shape) ---+ if self.config.n_shared_experts is not None: ---+ y = y + self.shared_experts(identity) ---+ return y ---+ else: ---+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) ---+ if self.config.n_shared_experts is not None: ---+ y = y + self.shared_experts(identity) ---+ return y ---+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) ---+ # if self.config.n_shared_experts is not None: ---+ # y = y + self.shared_experts(identity) ---+ # return y ---+ ---+ @no_grad() ---+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): ---+ ---+ expert_cache = ops.zeros_like(x) ---+ for i in range(self.num_experts_per_tok): ---+ expert_id = flat_expert_indices[i].item() ---+ weight = flat_expert_weights[i].item() ---+ expert = self.experts[expert_id] ---+ expert_out = expert(x) ---+ expert_cache += expert_out * weight ---+ return expert_cache --- --- @no_grad() ---- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ---- # expert_cache = torch.zeros_like(x) ---- # idxs = flat_expert_indices.argsort() ---- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ---- # token_idxs = idxs // self.num_experts_per_tok ---- # for i, end_idx in enumerate(tokens_per_expert): ---- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ---- # if start_idx == end_idx: ---- # continue ---- # expert = self.experts[i] ---- # exp_token_idx = token_idxs[start_idx:end_idx] ---- # expert_tokens = x[exp_token_idx] ---- # expert_out = expert(expert_tokens) ---- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ---- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ---- # return expert_cache ---+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --- expert_cache = ops.zeros_like(x) --- idxs = flat_expert_indices.argsort() --- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --- token_idxs = idxs // self.num_experts_per_tok ---+ --- for i, end_idx in enumerate(tokens_per_expert): --- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --- if start_idx == end_idx: ---@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --- expert_out = expert(expert_tokens) --- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ---+ --- return expert_cache ---+ ---+ # @no_grad() ---+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ---+ # # expert_cache = torch.zeros_like(x) ---+ # # idxs = flat_expert_indices.argsort() ---+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) ---+ # # token_idxs = idxs // self.num_experts_per_tok ---+ # # for i, end_idx in enumerate(tokens_per_expert): ---+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] ---+ # # if start_idx == end_idx: ---+ # # continue ---+ # # expert = self.experts[i] ---+ # # exp_token_idx = token_idxs[start_idx:end_idx] ---+ # # expert_tokens = x[exp_token_idx] ---+ # # expert_out = expert(expert_tokens) ---+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) ---+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') ---+ # # return expert_cache ---+ # expert_cache = ops.zeros_like(x) ---+ # idxs = flat_expert_indices.argsort() ---+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ---+ # token_idxs = idxs // self.num_experts_per_tok ---+ ---+ # for i, end_idx in enumerate(tokens_per_expert): ---+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ---+ # if start_idx == end_idx: ---+ # continue ---+ # expert = self.experts[i] ---+ # exp_token_idx = token_idxs[start_idx:end_idx] ---+ # expert_tokens = x[exp_token_idx] ---+ # expert_out = expert(expert_tokens) ---+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) ---+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) ---+ ---+ # return expert_cache ---+ # @no_grad() ---+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): ---+ # expert_cache = ops.zeros_like(x) ---+ ---+ # # 排序保证顺序一致 ---+ # idxs = flat_expert_indices.argsort() ---+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) ---+ # token_idxs = idxs // self.num_experts_per_tok ---+ ---+ # # 找出有 token 的专家 ---+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) ---+ ---+ # for i in active_experts.tolist(): ---+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] ---+ # end_idx = tokens_per_expert[i] ---+ # if start_idx == end_idx: # 没有 token ---+ # continue ---+ ---+ # exp_token_idx = token_idxs[start_idx:end_idx] ---+ # expert_tokens = x[exp_token_idx] ---+ # expert_out = self.experts[i](expert_tokens) ---+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] ---+ ---+ # expert_cache = mindspore.mint.scatter_add( ---+ # expert_cache, ---+ # 0, ---+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), ---+ # expert_out ---+ # ) ---+ ---+ # return expert_cache ---+ ---+ --- --- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --- # """ ---@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --- --- # Initialize weights and apply final processing --- self.post_init() ---+ self.warm_up = False ---+ ---+ def warmup_moe_model_deep(self): ---+ print("[Warmup] DeepSeek-MoE 模型预热开始...") ---+ test_texts = [ ---+ "warmup short", ---+ "This is a medium length warmup sentence for MoE experts. middle middle middle", ---+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" ---+ ] ---+ tokenizer = getattr(self, "_warmup_tokenizer", None) ---+ if tokenizer is None: ---+ from mindnlp.transformers import AutoTokenizer ---+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ---+ self._warmup_tokenizer = tokenizer ---+ ---+ for text in test_texts: ---+ inputs = tokenizer(text, return_tensors="ms") ---+ with mindspore._no_grad(): ---+ _ = self(**inputs, use_cache=False) ---+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --- --- def get_input_embeddings(self): --- return self.model.embed_tokens ---@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --- ```""" ---+ if not self.warm_up: ---+ self.warm_up = True ---+ self.warmup_moe_model_deep() ---+ --- output_attentions = ( --- output_attentions --- if output_attentions is not None ---diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ---index 3cbf820e..d4c6b651 100644 ------ a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ---+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py ---@@ -18,7 +18,6 @@ --- # See the License for the specific language governing permissions and --- # limitations under the License. --- """MindSpore Qwen2MoE model.""" ---- --- import math --- from typing import List, Optional, Tuple, Union --- ---@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --- TokenClassifierOutput, --- ) --- from ...modeling_utils import PreTrainedModel ---+from ...generation import GenerationMixin --- from ....utils import logging --- from .configuration_qwen2_moe import Qwen2MoeConfig --- ---@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --- self.variance_epsilon = eps --- --- def forward(self, hidden_states): ---+ # @dwj ---+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) ---+ # @lwx ---+ # if not self.training : ---+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --- input_dtype = hidden_states.dtype --- hidden_states = hidden_states.to(mindspore.float32) --- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) ---@@ -234,6 +239,8 @@ def rotate_half(x): --- """Rotates half the hidden dims of the input.""" --- x1 = x[..., : x.shape[-1] // 2] --- x2 = x[..., x.shape[-1] // 2 :] ---+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] ---+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --- return ops.cat((-x2, x1), dim=-1) --- --- ---@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --- self.config = config --- self.hidden_size = config.hidden_size --- self.intermediate_size = intermediate_size ---+ --- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --- self.act_fn = ACT2FN[config.hidden_act] --- --- def forward(self, x): ---- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ---- --- ---+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) ---+ # @lwx ---+ # gate_up_output = self.gate_up_proj(x) ---+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) ---+ # return self.down_proj(swiglu_output) ---+ ---+ # def forward(self, x): ---+ # gate_proj_out = self.gate_proj(x) ---+ # up_proj_out = self.up_proj(x) ---+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) ---+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) ---+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out ---+ # return self.down_proj(swiglu_out) ---+ --- # Copied from transformers.models.llama.modeling_llama.repeat_kv --- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --- """ ---@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --- use_cache: bool = False, --- cache_position: Optional[mindspore.Tensor] = None, --- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ---+ ---+ ---+ --- bsz, q_len, _ = hidden_states.shape --- --- query_states = self.q_proj(hidden_states) ---@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --- "with a layer index." --- ) ---- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ---+ if isinstance(past_key_value, StaticCache): ---+ kv_seq_len = key_states.shape[-2] ---+ else: ---+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --- --- if past_key_value is not None: --- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) ---+ ---+ if isinstance(past_key_value, StaticCache): ---+ kv_seq_len = key_states.shape[-2] --- --- # repeat k/v heads if n_kv_heads < n_heads --- key_states = repeat_kv(key_states, self.num_key_value_groups) --- value_states = repeat_kv(value_states, self.num_key_value_groups) ---- ---+ --- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --- ---- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): ---- raise ValueError( ---- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" ---- f" {attn_weights.shape}" ---- ) ---- ---- if attention_mask is not None: # no matter the length, we just slice it ---- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] ---+ if attention_mask is not None: ---+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --- attn_weights = attn_weights + causal_mask --- --- # upcast attention to fp32 ---@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --- --- attn_output = self.o_proj(attn_output) ---- ---+ # @lwx ---+ ---+ # max_seq_len = self.max_position_embeddings # 2048 ---+ ---+ # if attention_mask is not None: ---+ # # attention_mask: [B, 1, Sq, Sk] ---+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask ---+ ---+ # # pad 到 [max_seq_len, max_seq_len] ---+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 ---+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) ---+ # global_attention_mask = padded_mask ---+ # else: ---+ # global_attention_mask = None ---+ ---+ ---+ # sparse_mode=3 ---+ # attn_output = mindspore.ops.flash_attention_score( ---+ # query=query_states, ---+ # key=key_states, ---+ # value=value_states, ---+ # real_shift=None, ---+ # padding_mask=None, ---+ ---+ # head_num=self.num_heads, ---+ # attn_mask=global_attention_mask, ---+ # keep_prob=1.0 - self.attention_dropout, ---+ # scalar_value=1.0 / math.sqrt(self.head_dim), ---+ # input_layout="BNSD", ---+ # pre_tokens=2147483647, ---+ # next_tokens=2147483647, ---+ # inner_precise=0, ---+ # drop_mask=None, ---+ # prefix=None, ---+ # actual_seq_qlen=None, ---+ # actual_seq_kvlen=None, ---+ # sparse_mode=sparse_mode, ---+ # ) --- if not output_attentions: --- attn_weights = None --- --- return attn_output, attn_weights, past_key_value --- --- ---+class Qwen2MoeFlashAttention(nn.Module): ---+ """ ---+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 ---+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 ---+ ---+ 关键改动: ---+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), ---+ 直接传入原始的 key 和 value 张量效率更高。 ---+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 ---+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 ---+ """ ---+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): ---+ super().__init__() ---+ self.config = config ---+ self.layer_idx = layer_idx ---+ self.hidden_size = config.hidden_size ---+ self.num_heads = config.num_attention_heads ---+ self.head_dim = self.hidden_size // self.num_heads ---+ self.num_key_value_heads = config.num_key_value_heads ---+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads ---+ self.max_position_embeddings = config.max_position_embeddings ---+ self.rope_theta = config.rope_theta ---+ self.attention_dropout = config.attention_dropout ---+ ---+ if (self.head_dim * self.num_heads) != self.hidden_size: ---+ raise ValueError( ---+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" ---+ ) ---+ ---+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) ---+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ---+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) ---+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) ---+ ---+ self.rotary_emb = Qwen2MoeRotaryEmbedding( ---+ self.head_dim, ---+ max_position_embeddings=self.max_position_embeddings, ---+ base=self.rope_theta, ---+ ) ---+ ---+ def forward( ---+ self, ---+ hidden_states: mindspore.Tensor, ---+ attention_mask: Optional[mindspore.Tensor] = None, ---+ position_ids: Optional[mindspore.Tensor] = None, ---+ past_key_value: Optional[Cache] = None, ---+ output_attentions: bool = False, ---+ use_cache: bool = False, ---+ cache_position: Optional[mindspore.Tensor] = None, ---+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ---+ ---+ bsz, q_len, _ = hidden_states.shape ---+ ---+ # 1. 线性投射 Q, K, V ---+ query_states = self.q_proj(hidden_states) ---+ key_states = self.k_proj(hidden_states) ---+ value_states = self.v_proj(hidden_states) ---+ ---+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ---+ # query: [B, S, H*D] -> [B, N1, S, D] ---+ # key/val: [B, S, H2*D] -> [B, N2, S, D] ---+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ ---+ # 3. RoPE 旋转位置编码 ---+ kv_seq_len = key_states.shape[-2] ---+ if past_key_value is not None: ---+ if self.layer_idx is None: ---+ raise ValueError( ---+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ---+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ---+ "with a layer index." ---+ ) ---+ # 对于 StaticCache,需要特殊处理 kv_seq_len ---+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 ---+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ---+ # 使用 cache_position 的长度来确定实际的 kv_seq_len ---+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n ---+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) ---+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 ---+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 ---+ # 临时解决方案:使用 cache_position 的最大值(如果可能) ---+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens ---+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 ---+ if cache_position.shape[0] == 1: ---+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 ---+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) ---+ kv_seq_len = past_seen_tokens + 1 ---+ else: ---+ # prefill 阶段:cache_position 是范围,使用其长度 ---+ kv_seq_len = cache_position.shape[0] + past_seen_tokens ---+ else: ---+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ---+ ---+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ---+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ---+ ---+ # 4. KV 缓存更新 ---+ if past_key_value is not None: ---+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ---+ key_states, value_states = past_key_value.update( ---+ key_states, value_states, self.layer_idx, cache_kwargs ---+ ) ---+ ---+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 ---+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) ---+ if isinstance(past_key_value, StaticCache) and cache_position is not None: ---+ if cache_position.shape[0] == 1: ---+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) ---+ kv_seq_len = key_states.shape[-2] ---+ ---+ # 5. [重要] 准备 Attention Mask ---+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) ---+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 ---+ fa_attention_mask = None ---+ if attention_mask is not None: ---+ # 截取与当前key长度匹配的部分 ---+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) ---+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) ---+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ---+ # 转换为布尔类型: 大负数 -> True, 0 -> False ---+ fa_attention_mask = (mask_slice != 0) ---+ ---+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 ---+ input_dtype = query_states.dtype ---+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): ---+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 ---+ query_states = query_states.to(mindspore.float16) ---+ key_states = key_states.to(mindspore.float16) ---+ value_states = value_states.to(mindspore.float16) ---+ ---+ # 6. [核心] 调用 flash_attention_score 算子 ---+ # - 无需手动 repeat_kv, 算子原生支持 GQA ---+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] ---+ attn_output = mindspore.ops.flash_attention_score( ---+ query=query_states, ---+ key=key_states, ---+ value=value_states, ---+ head_num=self.num_heads, # 传入Q的头数(N1) ---+ attn_mask=fa_attention_mask, ---+ keep_prob=1.0 - self.attention_dropout, ---+ scalar_value=1.0 / math.sqrt(self.head_dim), ---+ input_layout="BNSD", ---+ sparse_mode=0 # 使用 defaultMask 模式 ---+ ) ---+ ---+ # 恢复原始数据类型 ---+ attn_output = attn_output.to(input_dtype) ---+ ---+ # 7. 调整输出形状 ---+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] ---+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ---+ attn_output = self.o_proj(attn_output) ---+ ---+ # FlashAttention 算子不直接返回注意力权重矩阵 ---+ attn_weights = None ---+ if output_attentions: ---+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ---+ ---+ return attn_output, attn_weights, past_key_value ---+ ---+ # def forward( ---+ # self, ---+ # hidden_states: mindspore.Tensor, ---+ # attention_mask: Optional[mindspore.Tensor] = None, ---+ # position_ids: Optional[mindspore.Tensor] = None, ---+ # past_key_value: Optional[Cache] = None, ---+ # output_attentions: bool = False, ---+ # use_cache: bool = False, ---+ # cache_position: Optional[mindspore.Tensor] = None, ---+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ---+ ---+ # bsz, q_len, _ = hidden_states.shape ---+ ---+ # # 1. 线性投射 Q, K, V ---+ # query_states = self.q_proj(hidden_states) ---+ # key_states = self.k_proj(hidden_states) ---+ # value_states = self.v_proj(hidden_states) ---+ ---+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 ---+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ ---+ # # 3. RoPE 旋转位置编码 ---+ # kv_seq_len = key_states.shape[-2] ---+ # if past_key_value is not None: ---+ # if self.layer_idx is None: ---+ # raise ValueError( ---+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ---+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ---+ # "with a layer index." ---+ # ) ---+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ---+ ---+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ---+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ---+ ---+ # # 4. KV 缓存更新 ---+ # if past_key_value is not None: ---+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ---+ # key_states, value_states = past_key_value.update( ---+ # key_states, value_states, self.layer_idx, cache_kwargs ---+ # ) ---+ ---+ # # 5. 准备 Attention Mask ---+ # fa_attention_mask = None ---+ # if attention_mask is not None: ---+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ---+ # fa_attention_mask = (mask_slice != 0) ---+ ---+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- ---+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 ---+ # input_dtype = query_states.dtype ---+ ---+ # # 6. [核心] 调用 flash_attention_score 算子 ---+ # attn_output = mindspore.ops.flash_attention_score( ---+ # query=query_states, ---+ # key=key_states, ---+ # value=value_states, ---+ # head_num=self.num_heads, ---+ # attn_mask=fa_attention_mask, ---+ # keep_prob=1.0 - self.attention_dropout, ---+ # scalar_value=1.0 / math.sqrt(self.head_dim), ---+ # input_layout="BNSD", ---+ # sparse_mode=0, ---+ # # <--- 修改点 2: 启用内部高精度计算 --- ---+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, ---+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 ---+ # inner_precise=1 ---+ # ) ---+ ---+ # # 恢复原始数据类型 ---+ # attn_output = attn_output.to(input_dtype) ---+ ---+ # # 7. 调整输出形状 ---+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ---+ # attn_output = self.o_proj(attn_output) ---+ ---+ # attn_weights = None ---+ # if output_attentions: ---+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") ---+ ---+ # return attn_output, attn_weights, past_key_value ---+ ---+ # def forward( ---+ # self, ---+ # hidden_states: mindspore.Tensor, ---+ # attention_mask: Optional[mindspore.Tensor] = None, ---+ # position_ids: Optional[mindspore.Tensor] = None, ---+ # past_key_value: Optional[Cache] = None, ---+ # output_attentions: bool = False, ---+ # use_cache: bool = False, ---+ # cache_position: Optional[mindspore.Tensor] = None, ---+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ---+ ---+ # bsz, q_len, _ = hidden_states.shape ---+ ---+ # query_states = self.q_proj(hidden_states) ---+ # key_states = self.k_proj(hidden_states) ---+ # value_states = self.v_proj(hidden_states) ---+ ---+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---+ ---+ # kv_seq_len = key_states.shape[-2] ---+ # if past_key_value is not None: ---+ # if self.layer_idx is None: ---+ # raise ValueError("`layer_idx` must be specified for caching") ---+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) ---+ ---+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ---+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) ---+ ---+ # if past_key_value is not None: ---+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} ---+ # key_states, value_states = past_key_value.update( ---+ # key_states, value_states, self.layer_idx, cache_kwargs ---+ # ) ---+ ---+ # key_states = repeat_kv(key_states, self.num_key_value_groups) ---+ # value_states = repeat_kv(value_states, self.num_key_value_groups) ---+ ---+ # # <--- 核心修改点: 手动进行高精度缩放 --- ---+ # # 在调用算子前,手动将 query_states 除以缩放因子。 ---+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 ---+ # query_states = query_states / math.sqrt(self.head_dim) ---+ # # <--- 修改结束 --- ---+ ---+ # fa_attention_mask = None ---+ # if attention_mask is not None: ---+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] ---+ # fa_attention_mask = (mask_slice != 0) ---+ ---+ # input_dtype = query_states.dtype ---+ ---+ # attn_output = mindspore.ops.flash_attention_score( ---+ # query=query_states, # 传入已经预先缩放过的 query ---+ # key=key_states, ---+ # value=value_states, ---+ # head_num=self.num_heads, ---+ # attn_mask=fa_attention_mask, ---+ # keep_prob=1.0 - self.attention_dropout, ---+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 ---+ # input_layout="BNSD", ---+ # sparse_mode=0, ---+ # inner_precise=1 # 仍然保持内部高精度计算 ---+ # ) ---+ ---+ # attn_output = attn_output.to(input_dtype) ---+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) ---+ # attn_output = self.o_proj(attn_output) ---+ ---+ # attn_weights = None ---+ # if output_attentions: ---+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") ---+ ---+ # return attn_output, attn_weights, past_key_value ---+ --- QWEN2MOE_ATTENTION_CLASSES = { --- "eager": Qwen2MoeAttention, ---+ "flash-attention": Qwen2MoeFlashAttention, --- } --- --- ---@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --- ---+ #@dwj ---+ # 只遍历激活的专家,而非全部专家 --- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: ---- batch_size, sequence_length, hidden_dim = hidden_states.shape ---- hidden_states = hidden_states.view(-1, hidden_dim) ---- # router_logits: (batch * sequence_length, n_experts) ---- router_logits = self.gate(hidden_states) ---- ---- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ---- if self.norm_topk_prob: ---- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ---- # we cast back to the input dtype ---- routing_weights = routing_weights.to(hidden_states.dtype) ---- ---- final_hidden_states = ops.zeros( ---- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype ---- ) ---- ---- # One hot encode the selected experts to create an expert mask ---- # this will be used to easily index which expert is going to be sollicitated ---- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) ---- ---- # Loop over all available experts in the model and perform the computation on each expert ---- for expert_idx in range(self.num_experts): ---- expert_layer = self.experts[expert_idx] ---- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) ---- ---- # Index the correct hidden states and compute the expert hidden state for ---- # the current expert. We need to make sure to multiply the output hidden ---- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) ---- if 0 not in idx.shape: ---- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) ---- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] ---- ---- # However `index_add_` only support torch tensors for indexing so we'll use ---- # the `top_x` tensor here. ---- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) ---- ---- shared_expert_output = self.shared_expert(hidden_states) ---- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output ---- ---- final_hidden_states = final_hidden_states + shared_expert_output ---+ batch_size, sequence_length, hidden_dim = hidden_states.shape ---+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) ---+ num_tokens = hidden_states_reshaped.shape[0] ---+ ---+ router_logits = self.gate(hidden_states_reshaped) ---+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) ---+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) ---+ ---+ if self.norm_topk_prob: ---+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) ---+ routing_weights = routing_weights.to(hidden_states.dtype) ---+ ---+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) ---+ flat_selected_experts = selected_experts.flatten() ---+ ---+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) ---+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) ---+ token_indices = broadcasted_token_indices.flatten() ---+ ---+ active_experts = ops.unique(flat_selected_experts) ---+ ---+ for expert_idx_tensor in active_experts: ---+ expert_idx = expert_idx_tensor.item() ---+ expert_layer = self.experts[expert_idx] ---+ ---+ mask = (flat_selected_experts == expert_idx_tensor) ---+ selected_token_indices = token_indices[mask] ---+ selected_routing_weights = routing_weights.flatten()[mask] ---+ ---+ current_states = hidden_states_reshaped[selected_token_indices] ---+ ---+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) ---+ ---+ final_hidden_states = final_hidden_states.index_add( ---+ dim=0, ---+ index=selected_token_indices, ---+ source=expert_output.to(hidden_states.dtype) ---+ ) ---+ ---+ shared_expert_output = self.shared_expert(hidden_states_reshaped) ---+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --- ---- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ---- return final_hidden_states, router_logits ---+ final_hidden_states = final_hidden_states + shared_expert_output ---+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) ---+ ---+ return final_hidden_states, router_logits --- --- --- class Qwen2MoeDecoderLayer(nn.Module): ---@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --- --- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --- ---+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) ---+ --- if (layer_idx not in config.mlp_only_layers) and ( --- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --- ): ---@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --- _no_split_modules = ["Qwen2MoeDecoderLayer"] --- _skip_keys_device_placement = "past_key_values" --- _supports_cache_class = True ---+#lwx ---+ # _supports_static_cache = True --- --- def _init_weights(self, module): --- std = self.config.initializer_range ---@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --- return causal_mask --- --- ----class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): ---+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --- _tied_weights_keys = ["lm_head.weight"] --- --- def __init__(self, config): ---@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --- self.num_experts_per_tok = config.num_experts_per_tok --- # Initialize weights and apply final processing --- self.post_init() ---+ # @lwx ---+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: ---+ # self.generation_config.cache_implementation = "static" ---+ self._warmed_up = False ---+ ---+ def warmup_moe_model(self): ---+ print("[Warmup] Qwen2-MoE 模型预热开始...") ---+ test_texts = [ ---+ "warmup short", ---+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", ---+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" ---+ ] ---+ tokenizer = getattr(self, "_warmup_tokenizer", None) ---+ if tokenizer is None: ---+ from mindnlp.transformers import AutoTokenizer ---+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) ---+ self._warmup_tokenizer = tokenizer ---+ ---+ for text in test_texts: ---+ inputs = tokenizer(text, return_tensors="ms") ---+ with mindspore._no_grad(): ---+ _ = self(**inputs, output_router_logits=True, use_cache=False) ---+ print("[Warmup] Qwen2-MoE 模型预热完成。") --- --- def get_input_embeddings(self): --- return self.model.embed_tokens ---@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --- ```""" ---+ if not self._warmed_up: ---+ self._warmed_up = True ---+ self.warmup_moe_model() --- --- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --- output_router_logits = ( ---@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --- } --- ) --- return model_inputs ---+# @lwx ---+ # def _decode_one_tokens_logits( ---+ # self, ---+ # cur_token: mindspore.Tensor, ---+ # input_pos: Optional[mindspore.Tensor], ---+ # cache_position: mindspore.Tensor, ---+ # past_key_values: StaticCache, ---+ # ) -> mindspore.Tensor: ---+ # """ ---+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) ---+ ---+ # Args: ---+ # cur_token: 当前要处理的token,shape为(batch_size, 1) ---+ # input_pos: 输入位置信息,可选 ---+ # cache_position: 当前token在cache中的位置,shape为(1,) ---+ # past_key_values: StaticCache对象,存储之前的key-value状态 ---+ ---+ # Returns: ---+ # logits: 当前token的logits,shape为(batch_size, vocab_size) ---+ # """ ---+ # # 调用JIT编译的版本 ---+ # return self.get_decode_one_tokens_logits( ---+ # cur_token=cur_token, ---+ # input_pos=input_pos, ---+ # cache_position=cache_position, ---+ # past_key_values=past_key_values, ---+ # ) ---+ ---+ # @mindspore.jit(jit_level='O1') ---+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): ---+ # """ ---+ # JIT编译的函数,用于高效的单token解码 ---+ # 使用JIT编译优化以支持静态shape和高效执行 ---+ ---+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except ---+ # """ ---+ # outputs = self.model.forward( ---+ # input_ids=cur_token, ---+ # position_ids=input_pos, ---+ # cache_position=cache_position, ---+ # past_key_values=past_key_values, ---+ # use_cache=True, ---+ # return_dict=False, ---+ # ) ---+ ---+ # hidden_states = outputs[0] ---+ # logits = self.lm_head.forward(hidden_states) ---+ # logits = logits.float() ---+ ---+ # return logits[:, -1, :] ---+ ---+ # def _sample( ---+ # self, ---+ # input_ids: mindspore.Tensor, ---+ # logits_processor, ---+ # stopping_criteria, ---+ # generation_config, ---+ # synced_devices: bool, ---+ # streamer=None, ---+ # logits_warper=None, ---+ # **model_kwargs, ---+ # ): ---+ # """ ---+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 ---+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 ---+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 ---+ # """ ---+ # from ...generation.logits_process import LogitsProcessorList ---+ # from ...generation.stopping_criteria import StoppingCriteriaList ---+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput ---+ # from mindnlp.core import nn, ops, no_grad ---+ # import numpy as np ---+ ---+ # # 检查是否使用 StaticCache ---+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 ---+ # # 否则,直接调用父类方法 ---+ # past_key_values = model_kwargs.get("past_key_values") ---+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") ---+ ---+ # if not isinstance(past_key_values, StaticCache): ---+ # # 不使用 StaticCache,直接调用父类方法 ---+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") ---+ # return super()._sample( ---+ # input_ids=input_ids, ---+ # logits_processor=logits_processor, ---+ # stopping_criteria=stopping_criteria, ---+ # generation_config=generation_config, ---+ # synced_devices=synced_devices, ---+ # streamer=streamer, ---+ # logits_warper=logits_warper, ---+ # **model_kwargs, ---+ # ) ---+ ---+ # # 使用 StaticCache,进入自定义循环 ---+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) ---+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 ---+ # pad_token_id = generation_config._pad_token_tensor ---+ # output_attentions = generation_config.output_attentions ---+ # output_hidden_states = generation_config.output_hidden_states ---+ # output_scores = generation_config.output_scores ---+ # output_logits = generation_config.output_logits ---+ # return_dict_in_generate = generation_config.return_dict_in_generate ---+ # max_length = generation_config.max_length ---+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) ---+ # do_sample = generation_config.do_sample ---+ ---+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): ---+ # raise ValueError( ---+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " ---+ # f"{logits_warper})." ---+ # ) ---+ ---+ # # init attention / hidden states / scores tuples ---+ # scores = () if (return_dict_in_generate and output_scores) else None ---+ # raw_logits = () if (return_dict_in_generate and output_logits) else None ---+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None ---+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None ---+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None ---+ ---+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states ---+ # if return_dict_in_generate and self.config.is_encoder_decoder: ---+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None ---+ # encoder_hidden_states = ( ---+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None ---+ # ) ---+ ---+ # # keep track of which sequences are already finished ---+ # batch_size, cur_len = input_ids.shape ---+ # this_peer_finished = False ---+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) ---+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) ---+ ---+ # time_record = [] ---+ # from ....utils.testing_utils import parse_flag_from_env ---+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) ---+ ---+ # while self._has_unfinished_sequences( ---+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length ---+ # ): ---+ # if _record_time: ---+ # import time as time_module ---+ # infer_start = time_module.time() ---+ ---+ # # prepare model inputs ---+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) ---+ ---+ # # prepare variable output controls ---+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) ---+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) ---+ ---+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 ---+ # cur_cache_position = model_inputs.get("cache_position") ---+ # cur_past_key_values = model_inputs.get("past_key_values") ---+ # cur_input_ids = model_inputs.get("input_ids") ---+ ---+ # if (isinstance(cur_past_key_values, StaticCache) and ---+ # cur_cache_position is not None and ---+ # len(cur_cache_position.shape) > 0 and ---+ # cur_cache_position.shape[0] == 1 and ---+ # cur_input_ids is not None and ---+ # cur_input_ids.shape[1] == 1): ---+ # # 使用 JIT 优化的单 token 解码 ---+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) ---+ # if not hasattr(self, '_jit_used'): ---+ # self._jit_used = False ---+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") ---+ ---+ # next_token_logits = self.get_decode_one_tokens_logits( ---+ # cur_token=cur_input_ids, ---+ # input_pos=model_inputs.get("position_ids"), ---+ # cache_position=cur_cache_position, ---+ # past_key_values=cur_past_key_values, ---+ # ) ---+ ---+ # # 标记已使用JIT(用于后续判断) ---+ # if not self._jit_used: ---+ # self._jit_used = True ---+ ---+ # # 构造兼容的输出对象 ---+ # class JitOptimizedOutput: ---+ # def __init__(self, logits, config): ---+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits ---+ # self.config = config ---+ # # 对于 JIT 优化路径,这些属性通常不需要 ---+ # self.decoder_attentions = None if config.is_encoder_decoder else None ---+ # self.attentions = None if not config.is_encoder_decoder else None ---+ # self.cross_attentions = None ---+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None ---+ # self.hidden_states = None if not config.is_encoder_decoder else None ---+ ---+ # outputs = JitOptimizedOutput(next_token_logits, self.config) ---+ # else: ---+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) ---+ # outputs = self(**model_inputs, return_dict=True) ---+ ---+ # if synced_devices and this_peer_finished: ---+ # continue ---+ ---+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits ---+ # next_token_logits = outputs.logits[:, -1, :] ---+ ---+ # # pre-process distribution ---+ # next_token_scores = logits_processor(input_ids, next_token_logits) ---+ # if do_sample: ---+ # next_token_scores = logits_warper(input_ids, next_token_scores) ---+ ---+ # # Store scores, attentions and hidden_states when required ---+ # if return_dict_in_generate: ---+ # if output_scores: ---+ # scores += (next_token_scores,) ---+ # if output_logits: ---+ # raw_logits += (next_token_logits,) ---+ # if output_attentions: ---+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions ---+ # decoder_attentions += (attn,) if attn is not None else (None,) ---+ # if self.config.is_encoder_decoder: ---+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) ---+ ---+ # if output_hidden_states: ---+ # hidden = ( ---+ # outputs.decoder_hidden_states ---+ # if self.config.is_encoder_decoder ---+ # else outputs.hidden_states ---+ # ) ---+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) ---+ ---+ # # token selection ---+ # if do_sample: ---+ # probs = nn.functional.softmax(next_token_scores, dim=-1) ---+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) ---+ # else: ---+ # next_tokens = ops.argmax(next_token_scores, dim=-1) ---+ ---+ # # finished sentences should have their next token be a padding token ---+ # if has_eos_stopping_criteria: ---+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) ---+ ---+ # # update generated ids, model inputs, and length for next step ---+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) ---+ # if streamer is not None: ---+ # streamer.put(next_tokens) ---+ ---+ # model_kwargs = self._update_model_kwargs_for_generation( ---+ # outputs, ---+ # model_kwargs, ---+ # is_encoder_decoder=self.config.is_encoder_decoder, ---+ # ) ---+ ---+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) ---+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 ---+ # cur_len += 1 ---+ ---+ # if _record_time: ---+ # import time as time_module ---+ # infer_stop = time_module.time() ---+ # time_record.append(infer_stop - infer_start) ---+ ---+ # del outputs ---+ ---+ # average_infer_time = None ---+ # if time_record: ---+ # if len(time_record) > 1: ---+ # time_record.pop(0) ---+ # average_infer_time = sum(time_record) / len(time_record) ---+ # print(f'average inference time is: {average_infer_time}') ---+ # print(f'inference time record: {time_record}') ---+ ---+ # if streamer is not None: ---+ # streamer.end() ---+ ---+ # # 简单判断:打印是否使用了JIT路径 ---+ # if hasattr(self, '_jit_used') and self._jit_used: ---+ # print("[JIT] ✓ JIT optimization was used during generation") ---+ # else: ---+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") ---+ ---+ # if return_dict_in_generate: ---+ # if self.config.is_encoder_decoder: ---+ # return GenerateEncoderDecoderOutput( ---+ # sequences=input_ids, ---+ # scores=scores, ---+ # logits=raw_logits, ---+ # encoder_attentions=encoder_attentions, ---+ # encoder_hidden_states=encoder_hidden_states, ---+ # decoder_attentions=decoder_attentions, ---+ # cross_attentions=cross_attentions, ---+ # decoder_hidden_states=decoder_hidden_states, ---+ # past_key_values=model_kwargs.get("past_key_values"), ---+ # average_infer_time=average_infer_time ---+ # ) ---+ # else: ---+ # return GenerateDecoderOnlyOutput( ---+ # sequences=input_ids, ---+ # scores=scores, ---+ # logits=raw_logits, ---+ # attentions=decoder_attentions, ---+ # hidden_states=decoder_hidden_states, ---+ # past_key_values=model_kwargs.get("past_key_values"), ---+ # average_infer_time=average_infer_time ---+ # ) ---+ # else: ---+ # return input_ids ---+ ---+ # def _prepare_cache_for_generation( ---+ # self, ---+ # generation_config, ---+ # model_kwargs, ---+ # assistant_model, ---+ # batch_size, ---+ # max_cache_length, ---+ # ): ---+ # if generation_config.cache_implementation is None and self._supports_static_cache: ---+ # generation_config.cache_implementation = "static" ---+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") ---+ ---+ # if generation_config.cache_implementation == "static": ---+ # base_required_from_max_length = generation_config.max_length + 1 ---+ # base_required = max(max_cache_length, base_required_from_max_length) ---+ # min_cache_size = 50 ---+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ---+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) ---+ # else: ---+ # max_cache_length = max(base_required, min_cache_size) ---+ ---+ # original_max_cache_length = max_cache_length ---+ # print(f"[JIT] StaticCache max_cache_length calculation:") ---+ # print(f" - input max_cache_length: {original_max_cache_length}") ---+ # print(f" - generation_config.max_length: {generation_config.max_length}") ---+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") ---+ # print(f" - final max_cache_length: {max_cache_length}") ---+ ---+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: ---+ # if max_cache_length > self.config.max_position_embeddings: ---+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") ---+ ---+ # result = super()._prepare_cache_for_generation( ---+ # generation_config=generation_config, ---+ # model_kwargs=model_kwargs, ---+ # assistant_model=assistant_model, ---+ # batch_size=batch_size, ---+ # max_cache_length=max_cache_length, ---+ # ) ---+ ---+ # if generation_config.cache_implementation == "static": ---+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" ---+ # created_cache = model_kwargs.get(cache_name) ---+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): ---+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") ---+ # if created_cache.max_cache_len < generation_config.max_length: ---+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") ---+ ---+ # return result ---+ ---+ ---+ --- --- --- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE ----- ---2.27.0 --- ---- --2.27.0 -- -diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch -deleted file mode 100644 -index bc5549ca..00000000 ---- a/patches/0004-20251106change.patch -+++ /dev/null -@@ -1,7498 +0,0 @@ --From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Thu, 6 Nov 2025 15:48:09 +0800 --Subject: [PATCH 4/8] 20251106change -- ----- -- .../models/deepseek/modeling_deepseek.py | 189 +- -- patches/0001-20251104commit.patch | 1272 +++++++ -- patches/0002-20251106commit.patch | 3200 +++++++++++++++++ -- patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ -- 4 files changed, 7244 insertions(+), 186 deletions(-) -- create mode 100644 patches/0001-20251104commit.patch -- create mode 100644 patches/0002-20251106commit.patch -- create mode 100644 patches/0003-20261106secondcommit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index 2f9192bf..0546f318 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): -- -- return attn_output, attn_weights, past_key_value -- ---# class DeepseekFlashAttention(nn.Module): ---# """ ---# Multi-headed attention from 'Attention Is All You Need' paper, implemented using ---# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --- ---# This class is designed as a drop-in replacement for DeepseekAttention. ---# """ --- ---# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): ---# super().__init__() ---# self.config = config ---# self.layer_idx = layer_idx ---# if layer_idx is None: ---# logger.warning( ---# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " ---# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " ---# "when creating this class." ---# ) --- ---# self.attention_dropout = config.attention_dropout ---# self.hidden_size = config.hidden_size ---# self.num_heads = config.num_attention_heads ---# self.head_dim = self.hidden_size // self.num_heads ---# self.num_key_value_heads = config.num_key_value_heads ---# self.num_key_value_groups = self.num_heads // self.num_key_value_heads ---# self.max_position_embeddings = config.max_position_embeddings ---# self.rope_theta = config.rope_theta ---# self.is_causal = True --- ---# if (self.head_dim * self.num_heads) != self.hidden_size: ---# raise ValueError( ---# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" ---# f" and `num_heads`: {self.num_heads})." ---# ) --- ---# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) ---# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ---# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) ---# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) ---# self._init_rope() --- ---# def _init_rope(self): ---# if self.config.rope_scaling is None: ---# self.rotary_emb = DeepseekRotaryEmbedding( ---# self.head_dim, ---# max_position_embeddings=self.max_position_embeddings, ---# base=self.rope_theta, ---# ) ---# else: ---# scaling_type = self.config.rope_scaling["type"] ---# scaling_factor = self.config.rope_scaling["factor"] ---# if scaling_type == "linear": ---# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( ---# self.head_dim, ---# max_position_embeddings=self.max_position_embeddings, ---# scaling_factor=scaling_factor, ---# base=self.rope_theta, ---# ) ---# elif scaling_type == "dynamic": ---# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( ---# self.head_dim, ---# max_position_embeddings=self.max_position_embeddings, ---# scaling_factor=scaling_factor, ---# base=self.rope_theta, ---# ) ---# else: ---# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --- ---# def forward( ---# self, ---# hidden_states: mindspore.Tensor, ---# attention_mask: Optional[mindspore.Tensor] = None, ---# position_ids: Optional[mindspore.Tensor] = None, ---# past_key_value: Optional[Cache] = None, ---# output_attentions: bool = False, ---# use_cache: bool = False, ---# **kwargs, ---# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: ---# if "padding_mask" in kwargs: ---# warnings.warn( ---# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" ---# ) --- ---# if output_attentions: ---# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --- ---# bsz, q_len, _ = hidden_states.shape --- ---# if self.config.pretraining_tp > 1: ---# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --- ---# query_states = self.q_proj(hidden_states) ---# key_states = self.k_proj(hidden_states) ---# value_states = self.v_proj(hidden_states) --- ---# # Reshape for multi-head attention ---# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) ---# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) ---# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --- ---# kv_seq_len = key_states.shape[-2] ---# if past_key_value is not None: ---# if self.layer_idx is None: ---# raise ValueError( ---# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " ---# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " ---# "with a layer index." ---# ) ---# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --- ---# # Apply Rotary Positional Embedding ---# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) ---# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --- ---# if past_key_value is not None: ---# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models ---# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --- ---# # Reshape Q, K, V for flash_attention_score's 'BSH' layout ---# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) ---# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --- ---# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ---# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --- ---# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) ---# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --- ---# # Convert attention_mask for flash_attention_score ---# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. ---# if attention_mask is not None: ---# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) ---# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): ---# raise ValueError( ---# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" ---# ) ---# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True ---# else: ---# attn_mask_for_fa = None --- ---# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --- ---# # Call the fused flash_attention_score operator ---# attn_output = mindspore.ops.flash_attention_score( ---# query=query_states_for_fa, ---# key=key_states_for_fa, ---# value=value_states_for_fa, ---# head_num=self.num_heads, # This is N1, the number of query heads ---# input_layout='BSH', ---# attn_mask=attn_mask_for_fa, ---# keep_prob=keep_prob, ---# scalar_value=1.0 / math.sqrt(self.head_dim), ---# sparse_mode=0 # Default mask mode ---# ) --- ---# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed ---# attn_output = self.o_proj(attn_output) --- ---# # Flash Attention does not return attention weights ---# attn_weights = None --- ---# return attn_output, attn_weights, past_key_value -- -- class DeepseekFlashAttention(nn.Module): -- """ --@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): -- super().__init__() -- self.hidden_size = config.hidden_size -- --- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --- config=config, layer_idx=layer_idx --- ) --+ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --+ # config=config, layer_idx=layer_idx --+ # ) -- -- self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( -- config=config, layer_idx=layer_idx --@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): -- return outputs -- -- --- -- class DeepseekPreTrainedModel(PreTrainedModel): -- config_class = DeepseekConfig -- base_model_prefix = "model" --@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- # Initialize weights and apply final processing -- self.post_init() -- self.warm_up = False --- #@dwj --- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --- self.num_layers, --- self.num_attention_heads, --- self.head_dim, --- batch_size=1, --- max_length=self.max_length, --- dtype=mindspore.float16 --- ) --- --- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --- key_cache = [] --- value_cache = [] --- for _ in range(num_layers): --- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --- key_cache.append(k) --- value_cache.append(v) --- return key_cache, value_cache --- -- -- def warmup_moe_model_deep(self): -- print("[Warmup] DeepSeek-MoE 模型预热开始...") --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --new file mode 100644 --index 00000000..78f22642 ----- /dev/null --+++ b/patches/0001-20251104commit.patch --@@ -0,0 +1,1272 @@ --+From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Tue, 4 Nov 2025 09:11:51 +0800 --+Subject: [PATCH 1/3] 20251104commit --+ --+--- --+ mindnlp/transformers/cache_utils.py | 28 +- --+ .../models/deepseek/modeling_deepseek.py | 149 ++- --+ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+ 3 files changed, 976 insertions(+), 87 deletions(-) --+ --+diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+index cadd2e04..02f8d4be 100644 --+--- a/mindnlp/transformers/cache_utils.py --++++ b/mindnlp/transformers/cache_utils.py --+@@ -812,14 +812,26 @@ class StaticCache(Cache): --+ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+ # k_out[:, :, cache_position] = key_states --+ # v_out[:, :, cache_position] = value_states --+- if ON_ORANGE_PI: --+- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+- else: --+- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+- --++ # if ON_ORANGE_PI: --++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++ # else: --++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++ # 确保 cache_position 是 1D tensor 并且类型正确 --++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++ if cache_position.ndim > 1: --++ cache_position = cache_position.flatten() --++ # 确保类型是 int32 或 int64(MindSpore 要求) --++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++ cache_position = cache_position.int() --++ --++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++ k_out[:, :, cache_position] = key_states --++ v_out[:, :, cache_position] = value_states --++ --+ return k_out, v_out --+ --+ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index c695b944..d8303e45 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+ # Copied from transformers.models.llama.modeling_llama.rotate_half --+ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+- x1 = x[..., : x.shape[-1] // 2] --+- x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++ # x1 = x[..., : x.shape[-1] // 2] --++ # x2 = x[..., x.shape[-1] // 2 :] --++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+ if self.training: --+ raise NotImplementedError("Training is not supported yet.") --+ else: --+- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+- if self.config.n_shared_experts is not None: --+- y = y + self.shared_experts(identity) --+- return y --++ # @lwx --++ if orig_shape[1] == 1: --++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++ y=y.view(*orig_shape) --++ if self.config.n_shared_experts is not None: --++ y = y + self.shared_experts(identity) --++ return y --++ else: --++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++ if self.config.n_shared_experts is not None: --++ y = y + self.shared_experts(identity) --++ return y --++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++ # if self.config.n_shared_experts is not None: --++ # y = y + self.shared_experts(identity) --++ # return y --++ --++ @no_grad() --++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ --++ expert_cache = ops.zeros_like(x) --++ for i in range(self.num_experts_per_tok): --++ expert_id = flat_expert_indices[i].item() --++ weight = flat_expert_weights[i].item() --++ expert = self.experts[expert_id] --++ expert_out = expert(x) --++ expert_cache += expert_out * weight --++ return expert_cache --+ --+ @no_grad() --+- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+- # expert_cache = torch.zeros_like(x) --+- # idxs = flat_expert_indices.argsort() --+- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+- # token_idxs = idxs // self.num_experts_per_tok --+- # for i, end_idx in enumerate(tokens_per_expert): --+- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+- # if start_idx == end_idx: --+- # continue --+- # expert = self.experts[i] --+- # exp_token_idx = token_idxs[start_idx:end_idx] --+- # expert_tokens = x[exp_token_idx] --+- # expert_out = expert(expert_tokens) --+- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+- # return expert_cache --++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ expert_cache = ops.zeros_like(x) --+ idxs = flat_expert_indices.argsort() --+ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+ token_idxs = idxs // self.num_experts_per_tok --++ --+ for i, end_idx in enumerate(tokens_per_expert): --+ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+ if start_idx == end_idx: --+@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+ expert_out = expert(expert_tokens) --+ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --+ return expert_cache --++ --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # # expert_cache = torch.zeros_like(x) --++ # # idxs = flat_expert_indices.argsort() --++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++ # # token_idxs = idxs // self.num_experts_per_tok --++ # # for i, end_idx in enumerate(tokens_per_expert): --++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++ # # if start_idx == end_idx: --++ # # continue --++ # # expert = self.experts[i] --++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++ # # expert_tokens = x[exp_token_idx] --++ # # expert_out = expert(expert_tokens) --++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++ # # return expert_cache --++ # expert_cache = ops.zeros_like(x) --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # for i, end_idx in enumerate(tokens_per_expert): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # if start_idx == end_idx: --++ # continue --++ # expert = self.experts[i] --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = expert(expert_tokens) --++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --++ # return expert_cache --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # expert_cache = ops.zeros_like(x) --++ --++ # # 排序保证顺序一致 --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # # 找出有 token 的专家 --++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++ --++ # for i in active_experts.tolist(): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # end_idx = tokens_per_expert[i] --++ # if start_idx == end_idx: # 没有 token --++ # continue --++ --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = self.experts[i](expert_tokens) --++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++ --++ # expert_cache = mindspore.mint.scatter_add( --++ # expert_cache, --++ # 0, --++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++ # expert_out --++ # ) --++ --++ # return expert_cache --++ --++ --+ --+ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+ # """ --+@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ --+ # Initialize weights and apply final processing --+ self.post_init() --++ self.warm_up = False --++ --++ def warmup_moe_model_deep(self): --++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++ test_texts = [ --++ "warmup short", --++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++ ] --++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++ if tokenizer is None: --++ from mindnlp.transformers import AutoTokenizer --++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++ self._warmup_tokenizer = tokenizer --++ --++ for text in test_texts: --++ inputs = tokenizer(text, return_tensors="ms") --++ with mindspore._no_grad(): --++ _ = self(**inputs, use_cache=False) --++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+ --+ def get_input_embeddings(self): --+ return self.model.embed_tokens --+@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+ ```""" --++ if not self.warm_up: --++ self.warm_up = True --++ self.warmup_moe_model_deep() --++ --+ output_attentions = ( --+ output_attentions --+ if output_attentions is not None --+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+index 3cbf820e..d4c6b651 100644 --+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+@@ -18,7 +18,6 @@ --+ # See the License for the specific language governing permissions and --+ # limitations under the License. --+ """MindSpore Qwen2MoE model.""" --+- --+ import math --+ from typing import List, Optional, Tuple, Union --+ --+@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+ TokenClassifierOutput, --+ ) --+ from ...modeling_utils import PreTrainedModel --++from ...generation import GenerationMixin --+ from ....utils import logging --+ from .configuration_qwen2_moe import Qwen2MoeConfig --+ --+@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+ self.variance_epsilon = eps --+ --+ def forward(self, hidden_states): --++ # @dwj --++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++ # @lwx --++ # if not self.training : --++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+ input_dtype = hidden_states.dtype --+ hidden_states = hidden_states.to(mindspore.float32) --+ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+@@ -234,6 +239,8 @@ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+ x1 = x[..., : x.shape[-1] // 2] --+ x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+ self.config = config --+ self.hidden_size = config.hidden_size --+ self.intermediate_size = intermediate_size --++ --+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+ self.act_fn = ACT2FN[config.hidden_act] --+ --+ def forward(self, x): --+- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+- --+ --++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++ # @lwx --++ # gate_up_output = self.gate_up_proj(x) --++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++ # return self.down_proj(swiglu_output) --++ --++ # def forward(self, x): --++ # gate_proj_out = self.gate_proj(x) --++ # up_proj_out = self.up_proj(x) --++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++ # return self.down_proj(swiglu_out) --++ --+ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+ """ --+@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+ use_cache: bool = False, --+ cache_position: Optional[mindspore.Tensor] = None, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ --++ --+ bsz, q_len, _ = hidden_states.shape --+ --+ query_states = self.q_proj(hidden_states) --+@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+ "with a layer index." --+ ) --+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ if isinstance(past_key_value, StaticCache): --++ kv_seq_len = key_states.shape[-2] --++ else: --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++ if isinstance(past_key_value, StaticCache): --++ kv_seq_len = key_states.shape[-2] --+ --+ # repeat k/v heads if n_kv_heads < n_heads --+ key_states = repeat_kv(key_states, self.num_key_value_groups) --+ value_states = repeat_kv(value_states, self.num_key_value_groups) --+- --++ --+ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+ --+- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+- raise ValueError( --+- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+- f" {attn_weights.shape}" --+- ) --+- --+- if attention_mask is not None: # no matter the length, we just slice it --+- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++ if attention_mask is not None: --++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+ attn_weights = attn_weights + causal_mask --+ --+ # upcast attention to fp32 --+@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+ --+ attn_output = self.o_proj(attn_output) --+- --++ # @lwx --++ --++ # max_seq_len = self.max_position_embeddings # 2048 --++ --++ # if attention_mask is not None: --++ # # attention_mask: [B, 1, Sq, Sk] --++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++ --++ # # pad 到 [max_seq_len, max_seq_len] --++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++ # global_attention_mask = padded_mask --++ # else: --++ # global_attention_mask = None --++ --++ --++ # sparse_mode=3 --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, --++ # key=key_states, --++ # value=value_states, --++ # real_shift=None, --++ # padding_mask=None, --++ --++ # head_num=self.num_heads, --++ # attn_mask=global_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++ # input_layout="BNSD", --++ # pre_tokens=2147483647, --++ # next_tokens=2147483647, --++ # inner_precise=0, --++ # drop_mask=None, --++ # prefix=None, --++ # actual_seq_qlen=None, --++ # actual_seq_kvlen=None, --++ # sparse_mode=sparse_mode, --++ # ) --+ if not output_attentions: --+ attn_weights = None --+ --+ return attn_output, attn_weights, past_key_value --+ --+ --++class Qwen2MoeFlashAttention(nn.Module): --++ """ --++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++ --++ 关键改动: --++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++ 直接传入原始的 key 和 value 张量效率更高。 --++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++ """ --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++ super().__init__() --++ self.config = config --++ self.layer_idx = layer_idx --++ self.hidden_size = config.hidden_size --++ self.num_heads = config.num_attention_heads --++ self.head_dim = self.hidden_size // self.num_heads --++ self.num_key_value_heads = config.num_key_value_heads --++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++ self.max_position_embeddings = config.max_position_embeddings --++ self.rope_theta = config.rope_theta --++ self.attention_dropout = config.attention_dropout --++ --++ if (self.head_dim * self.num_heads) != self.hidden_size: --++ raise ValueError( --++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++ ) --++ --++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++ --++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++ self.head_dim, --++ max_position_embeddings=self.max_position_embeddings, --++ base=self.rope_theta, --++ ) --++ --++ def forward( --++ self, --++ hidden_states: mindspore.Tensor, --++ attention_mask: Optional[mindspore.Tensor] = None, --++ position_ids: Optional[mindspore.Tensor] = None, --++ past_key_value: Optional[Cache] = None, --++ output_attentions: bool = False, --++ use_cache: bool = False, --++ cache_position: Optional[mindspore.Tensor] = None, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ bsz, q_len, _ = hidden_states.shape --++ --++ # 1. 线性投射 Q, K, V --++ query_states = self.q_proj(hidden_states) --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++ # query: [B, S, H*D] -> [B, N1, S, D] --++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # 3. RoPE 旋转位置编码 --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++ if self.layer_idx is None: --++ raise ValueError( --++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ "with a layer index." --++ ) --++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++ if cache_position.shape[0] == 1: --++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++ kv_seq_len = past_seen_tokens + 1 --++ else: --++ # prefill 阶段:cache_position 是范围,使用其长度 --++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++ else: --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # 4. KV 缓存更新 --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ key_states, value_states = past_key_value.update( --++ key_states, value_states, self.layer_idx, cache_kwargs --++ ) --++ --++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++ if cache_position.shape[0] == 1: --++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++ kv_seq_len = key_states.shape[-2] --++ --++ # 5. [重要] 准备 Attention Mask --++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++ fa_attention_mask = None --++ if attention_mask is not None: --++ # 截取与当前key长度匹配的部分 --++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++ fa_attention_mask = (mask_slice != 0) --++ --++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++ input_dtype = query_states.dtype --++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++ query_states = query_states.to(mindspore.float16) --++ key_states = key_states.to(mindspore.float16) --++ value_states = value_states.to(mindspore.float16) --++ --++ # 6. [核心] 调用 flash_attention_score 算子 --++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++ attn_output = mindspore.ops.flash_attention_score( --++ query=query_states, --++ key=key_states, --++ value=value_states, --++ head_num=self.num_heads, # 传入Q的头数(N1) --++ attn_mask=fa_attention_mask, --++ keep_prob=1.0 - self.attention_dropout, --++ scalar_value=1.0 / math.sqrt(self.head_dim), --++ input_layout="BNSD", --++ sparse_mode=0 # 使用 defaultMask 模式 --++ ) --++ --++ # 恢复原始数据类型 --++ attn_output = attn_output.to(input_dtype) --++ --++ # 7. 调整输出形状 --++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ attn_output = self.o_proj(attn_output) --++ --++ # FlashAttention 算子不直接返回注意力权重矩阵 --++ attn_weights = None --++ if output_attentions: --++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++ return attn_output, attn_weights, past_key_value --++ --++ # def forward( --++ # self, --++ # hidden_states: mindspore.Tensor, --++ # attention_mask: Optional[mindspore.Tensor] = None, --++ # position_ids: Optional[mindspore.Tensor] = None, --++ # past_key_value: Optional[Cache] = None, --++ # output_attentions: bool = False, --++ # use_cache: bool = False, --++ # cache_position: Optional[mindspore.Tensor] = None, --++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ # bsz, q_len, _ = hidden_states.shape --++ --++ # # 1. 线性投射 Q, K, V --++ # query_states = self.q_proj(hidden_states) --++ # key_states = self.k_proj(hidden_states) --++ # value_states = self.v_proj(hidden_states) --++ --++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # # 3. RoPE 旋转位置编码 --++ # kv_seq_len = key_states.shape[-2] --++ # if past_key_value is not None: --++ # if self.layer_idx is None: --++ # raise ValueError( --++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ # "with a layer index." --++ # ) --++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # # 4. KV 缓存更新 --++ # if past_key_value is not None: --++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ # key_states, value_states = past_key_value.update( --++ # key_states, value_states, self.layer_idx, cache_kwargs --++ # ) --++ --++ # # 5. 准备 Attention Mask --++ # fa_attention_mask = None --++ # if attention_mask is not None: --++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # fa_attention_mask = (mask_slice != 0) --++ --++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++ # input_dtype = query_states.dtype --++ --++ # # 6. [核心] 调用 flash_attention_score 算子 --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, --++ # key=key_states, --++ # value=value_states, --++ # head_num=self.num_heads, --++ # attn_mask=fa_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++ # input_layout="BNSD", --++ # sparse_mode=0, --++ # # <--- 修改点 2: 启用内部高精度计算 --- --++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++ # inner_precise=1 --++ # ) --++ --++ # # 恢复原始数据类型 --++ # attn_output = attn_output.to(input_dtype) --++ --++ # # 7. 调整输出形状 --++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ # attn_output = self.o_proj(attn_output) --++ --++ # attn_weights = None --++ # if output_attentions: --++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++ # return attn_output, attn_weights, past_key_value --++ --++ # def forward( --++ # self, --++ # hidden_states: mindspore.Tensor, --++ # attention_mask: Optional[mindspore.Tensor] = None, --++ # position_ids: Optional[mindspore.Tensor] = None, --++ # past_key_value: Optional[Cache] = None, --++ # output_attentions: bool = False, --++ # use_cache: bool = False, --++ # cache_position: Optional[mindspore.Tensor] = None, --++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ # bsz, q_len, _ = hidden_states.shape --++ --++ # query_states = self.q_proj(hidden_states) --++ # key_states = self.k_proj(hidden_states) --++ # value_states = self.v_proj(hidden_states) --++ --++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ # kv_seq_len = key_states.shape[-2] --++ # if past_key_value is not None: --++ # if self.layer_idx is None: --++ # raise ValueError("`layer_idx` must be specified for caching") --++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ # if past_key_value is not None: --++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ # key_states, value_states = past_key_value.update( --++ # key_states, value_states, self.layer_idx, cache_kwargs --++ # ) --++ --++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++ --++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++ # query_states = query_states / math.sqrt(self.head_dim) --++ # # <--- 修改结束 --- --++ --++ # fa_attention_mask = None --++ # if attention_mask is not None: --++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ # fa_attention_mask = (mask_slice != 0) --++ --++ # input_dtype = query_states.dtype --++ --++ # attn_output = mindspore.ops.flash_attention_score( --++ # query=query_states, # 传入已经预先缩放过的 query --++ # key=key_states, --++ # value=value_states, --++ # head_num=self.num_heads, --++ # attn_mask=fa_attention_mask, --++ # keep_prob=1.0 - self.attention_dropout, --++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++ # input_layout="BNSD", --++ # sparse_mode=0, --++ # inner_precise=1 # 仍然保持内部高精度计算 --++ # ) --++ --++ # attn_output = attn_output.to(input_dtype) --++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ # attn_output = self.o_proj(attn_output) --++ --++ # attn_weights = None --++ # if output_attentions: --++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++ --++ # return attn_output, attn_weights, past_key_value --++ --+ QWEN2MOE_ATTENTION_CLASSES = { --+ "eager": Qwen2MoeAttention, --++ "flash-attention": Qwen2MoeFlashAttention, --+ } --+ --+ --+@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --++ #@dwj --++ # 只遍历激活的专家,而非全部专家 --+ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+- batch_size, sequence_length, hidden_dim = hidden_states.shape --+- hidden_states = hidden_states.view(-1, hidden_dim) --+- # router_logits: (batch * sequence_length, n_experts) --+- router_logits = self.gate(hidden_states) --+- --+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- if self.norm_topk_prob: --+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- # we cast back to the input dtype --+- routing_weights = routing_weights.to(hidden_states.dtype) --+- --+- final_hidden_states = ops.zeros( --+- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+- ) --+- --+- # One hot encode the selected experts to create an expert mask --+- # this will be used to easily index which expert is going to be sollicitated --+- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+- --+- # Loop over all available experts in the model and perform the computation on each expert --+- for expert_idx in range(self.num_experts): --+- expert_layer = self.experts[expert_idx] --+- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+- --+- # Index the correct hidden states and compute the expert hidden state for --+- # the current expert. We need to make sure to multiply the output hidden --+- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+- if 0 not in idx.shape: --+- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+- --+- # However `index_add_` only support torch tensors for indexing so we'll use --+- # the `top_x` tensor here. --+- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+- --+- shared_expert_output = self.shared_expert(hidden_states) --+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+- --+- final_hidden_states = final_hidden_states + shared_expert_output --++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++ num_tokens = hidden_states_reshaped.shape[0] --++ --++ router_logits = self.gate(hidden_states_reshaped) --++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++ if self.norm_topk_prob: --++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ routing_weights = routing_weights.to(hidden_states.dtype) --++ --++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++ flat_selected_experts = selected_experts.flatten() --++ --++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++ token_indices = broadcasted_token_indices.flatten() --++ --++ active_experts = ops.unique(flat_selected_experts) --++ --++ for expert_idx_tensor in active_experts: --++ expert_idx = expert_idx_tensor.item() --++ expert_layer = self.experts[expert_idx] --++ --++ mask = (flat_selected_experts == expert_idx_tensor) --++ selected_token_indices = token_indices[mask] --++ selected_routing_weights = routing_weights.flatten()[mask] --++ --++ current_states = hidden_states_reshaped[selected_token_indices] --++ --++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++ --++ final_hidden_states = final_hidden_states.index_add( --++ dim=0, --++ index=selected_token_indices, --++ source=expert_output.to(hidden_states.dtype) --++ ) --++ --++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+ --+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+- return final_hidden_states, router_logits --++ final_hidden_states = final_hidden_states + shared_expert_output --++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++ --++ return final_hidden_states, router_logits --+ --+ --+ class Qwen2MoeDecoderLayer(nn.Module): --+@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+ --+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++ --+ if (layer_idx not in config.mlp_only_layers) and ( --+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+ ): --+@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+ _skip_keys_device_placement = "past_key_values" --+ _supports_cache_class = True --++#lwx --++ # _supports_static_cache = True --+ --+ def _init_weights(self, module): --+ std = self.config.initializer_range --+@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+ return causal_mask --+ --+ --+-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ _tied_weights_keys = ["lm_head.weight"] --+ --+ def __init__(self, config): --+@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ self.num_experts_per_tok = config.num_experts_per_tok --+ # Initialize weights and apply final processing --+ self.post_init() --++ # @lwx --++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++ # self.generation_config.cache_implementation = "static" --++ self._warmed_up = False --++ --++ def warmup_moe_model(self): --++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++ test_texts = [ --++ "warmup short", --++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++ ] --++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++ if tokenizer is None: --++ from mindnlp.transformers import AutoTokenizer --++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++ self._warmup_tokenizer = tokenizer --++ --++ for text in test_texts: --++ inputs = tokenizer(text, return_tensors="ms") --++ with mindspore._no_grad(): --++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+ --+ def get_input_embeddings(self): --+ return self.model.embed_tokens --+@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+ ```""" --++ if not self._warmed_up: --++ self._warmed_up = True --++ self.warmup_moe_model() --+ --+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+ output_router_logits = ( --+@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+ } --+ ) --+ return model_inputs --++# @lwx --++ # def _decode_one_tokens_logits( --++ # self, --++ # cur_token: mindspore.Tensor, --++ # input_pos: Optional[mindspore.Tensor], --++ # cache_position: mindspore.Tensor, --++ # past_key_values: StaticCache, --++ # ) -> mindspore.Tensor: --++ # """ --++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++ --++ # Args: --++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++ # input_pos: 输入位置信息,可选 --++ # cache_position: 当前token在cache中的位置,shape为(1,) --++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++ --++ # Returns: --++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++ # """ --++ # # 调用JIT编译的版本 --++ # return self.get_decode_one_tokens_logits( --++ # cur_token=cur_token, --++ # input_pos=input_pos, --++ # cache_position=cache_position, --++ # past_key_values=past_key_values, --++ # ) --++ --++ # @mindspore.jit(jit_level='O1') --++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++ # """ --++ # JIT编译的函数,用于高效的单token解码 --++ # 使用JIT编译优化以支持静态shape和高效执行 --++ --++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++ # """ --++ # outputs = self.model.forward( --++ # input_ids=cur_token, --++ # position_ids=input_pos, --++ # cache_position=cache_position, --++ # past_key_values=past_key_values, --++ # use_cache=True, --++ # return_dict=False, --++ # ) --++ --++ # hidden_states = outputs[0] --++ # logits = self.lm_head.forward(hidden_states) --++ # logits = logits.float() --++ --++ # return logits[:, -1, :] --++ --++ # def _sample( --++ # self, --++ # input_ids: mindspore.Tensor, --++ # logits_processor, --++ # stopping_criteria, --++ # generation_config, --++ # synced_devices: bool, --++ # streamer=None, --++ # logits_warper=None, --++ # **model_kwargs, --++ # ): --++ # """ --++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++ # """ --++ # from ...generation.logits_process import LogitsProcessorList --++ # from ...generation.stopping_criteria import StoppingCriteriaList --++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++ # from mindnlp.core import nn, ops, no_grad --++ # import numpy as np --++ --++ # # 检查是否使用 StaticCache --++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++ # # 否则,直接调用父类方法 --++ # past_key_values = model_kwargs.get("past_key_values") --++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++ --++ # if not isinstance(past_key_values, StaticCache): --++ # # 不使用 StaticCache,直接调用父类方法 --++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++ # return super()._sample( --++ # input_ids=input_ids, --++ # logits_processor=logits_processor, --++ # stopping_criteria=stopping_criteria, --++ # generation_config=generation_config, --++ # synced_devices=synced_devices, --++ # streamer=streamer, --++ # logits_warper=logits_warper, --++ # **model_kwargs, --++ # ) --++ --++ # # 使用 StaticCache,进入自定义循环 --++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++ # pad_token_id = generation_config._pad_token_tensor --++ # output_attentions = generation_config.output_attentions --++ # output_hidden_states = generation_config.output_hidden_states --++ # output_scores = generation_config.output_scores --++ # output_logits = generation_config.output_logits --++ # return_dict_in_generate = generation_config.return_dict_in_generate --++ # max_length = generation_config.max_length --++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++ # do_sample = generation_config.do_sample --++ --++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++ # raise ValueError( --++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++ # f"{logits_warper})." --++ # ) --++ --++ # # init attention / hidden states / scores tuples --++ # scores = () if (return_dict_in_generate and output_scores) else None --++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++ --++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++ # encoder_hidden_states = ( --++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++ # ) --++ --++ # # keep track of which sequences are already finished --++ # batch_size, cur_len = input_ids.shape --++ # this_peer_finished = False --++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++ --++ # time_record = [] --++ # from ....utils.testing_utils import parse_flag_from_env --++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++ --++ # while self._has_unfinished_sequences( --++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++ # ): --++ # if _record_time: --++ # import time as time_module --++ # infer_start = time_module.time() --++ --++ # # prepare model inputs --++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++ --++ # # prepare variable output controls --++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++ --++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++ # cur_cache_position = model_inputs.get("cache_position") --++ # cur_past_key_values = model_inputs.get("past_key_values") --++ # cur_input_ids = model_inputs.get("input_ids") --++ --++ # if (isinstance(cur_past_key_values, StaticCache) and --++ # cur_cache_position is not None and --++ # len(cur_cache_position.shape) > 0 and --++ # cur_cache_position.shape[0] == 1 and --++ # cur_input_ids is not None and --++ # cur_input_ids.shape[1] == 1): --++ # # 使用 JIT 优化的单 token 解码 --++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++ # if not hasattr(self, '_jit_used'): --++ # self._jit_used = False --++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++ --++ # next_token_logits = self.get_decode_one_tokens_logits( --++ # cur_token=cur_input_ids, --++ # input_pos=model_inputs.get("position_ids"), --++ # cache_position=cur_cache_position, --++ # past_key_values=cur_past_key_values, --++ # ) --++ --++ # # 标记已使用JIT(用于后续判断) --++ # if not self._jit_used: --++ # self._jit_used = True --++ --++ # # 构造兼容的输出对象 --++ # class JitOptimizedOutput: --++ # def __init__(self, logits, config): --++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++ # self.config = config --++ # # 对于 JIT 优化路径,这些属性通常不需要 --++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++ # self.attentions = None if not config.is_encoder_decoder else None --++ # self.cross_attentions = None --++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++ # self.hidden_states = None if not config.is_encoder_decoder else None --++ --++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++ # else: --++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++ # outputs = self(**model_inputs, return_dict=True) --++ --++ # if synced_devices and this_peer_finished: --++ # continue --++ --++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++ # next_token_logits = outputs.logits[:, -1, :] --++ --++ # # pre-process distribution --++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++ # if do_sample: --++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++ --++ # # Store scores, attentions and hidden_states when required --++ # if return_dict_in_generate: --++ # if output_scores: --++ # scores += (next_token_scores,) --++ # if output_logits: --++ # raw_logits += (next_token_logits,) --++ # if output_attentions: --++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++ # decoder_attentions += (attn,) if attn is not None else (None,) --++ # if self.config.is_encoder_decoder: --++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++ --++ # if output_hidden_states: --++ # hidden = ( --++ # outputs.decoder_hidden_states --++ # if self.config.is_encoder_decoder --++ # else outputs.hidden_states --++ # ) --++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++ --++ # # token selection --++ # if do_sample: --++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++ # else: --++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++ --++ # # finished sentences should have their next token be a padding token --++ # if has_eos_stopping_criteria: --++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++ --++ # # update generated ids, model inputs, and length for next step --++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++ # if streamer is not None: --++ # streamer.put(next_tokens) --++ --++ # model_kwargs = self._update_model_kwargs_for_generation( --++ # outputs, --++ # model_kwargs, --++ # is_encoder_decoder=self.config.is_encoder_decoder, --++ # ) --++ --++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++ # cur_len += 1 --++ --++ # if _record_time: --++ # import time as time_module --++ # infer_stop = time_module.time() --++ # time_record.append(infer_stop - infer_start) --++ --++ # del outputs --++ --++ # average_infer_time = None --++ # if time_record: --++ # if len(time_record) > 1: --++ # time_record.pop(0) --++ # average_infer_time = sum(time_record) / len(time_record) --++ # print(f'average inference time is: {average_infer_time}') --++ # print(f'inference time record: {time_record}') --++ --++ # if streamer is not None: --++ # streamer.end() --++ --++ # # 简单判断:打印是否使用了JIT路径 --++ # if hasattr(self, '_jit_used') and self._jit_used: --++ # print("[JIT] ✓ JIT optimization was used during generation") --++ # else: --++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++ --++ # if return_dict_in_generate: --++ # if self.config.is_encoder_decoder: --++ # return GenerateEncoderDecoderOutput( --++ # sequences=input_ids, --++ # scores=scores, --++ # logits=raw_logits, --++ # encoder_attentions=encoder_attentions, --++ # encoder_hidden_states=encoder_hidden_states, --++ # decoder_attentions=decoder_attentions, --++ # cross_attentions=cross_attentions, --++ # decoder_hidden_states=decoder_hidden_states, --++ # past_key_values=model_kwargs.get("past_key_values"), --++ # average_infer_time=average_infer_time --++ # ) --++ # else: --++ # return GenerateDecoderOnlyOutput( --++ # sequences=input_ids, --++ # scores=scores, --++ # logits=raw_logits, --++ # attentions=decoder_attentions, --++ # hidden_states=decoder_hidden_states, --++ # past_key_values=model_kwargs.get("past_key_values"), --++ # average_infer_time=average_infer_time --++ # ) --++ # else: --++ # return input_ids --++ --++ # def _prepare_cache_for_generation( --++ # self, --++ # generation_config, --++ # model_kwargs, --++ # assistant_model, --++ # batch_size, --++ # max_cache_length, --++ # ): --++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++ # generation_config.cache_implementation = "static" --++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++ --++ # if generation_config.cache_implementation == "static": --++ # base_required_from_max_length = generation_config.max_length + 1 --++ # base_required = max(max_cache_length, base_required_from_max_length) --++ # min_cache_size = 50 --++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++ # else: --++ # max_cache_length = max(base_required, min_cache_size) --++ --++ # original_max_cache_length = max_cache_length --++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++ # print(f" - input max_cache_length: {original_max_cache_length}") --++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++ # print(f" - final max_cache_length: {max_cache_length}") --++ --++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++ # if max_cache_length > self.config.max_position_embeddings: --++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++ --++ # result = super()._prepare_cache_for_generation( --++ # generation_config=generation_config, --++ # model_kwargs=model_kwargs, --++ # assistant_model=assistant_model, --++ # batch_size=batch_size, --++ # max_cache_length=max_cache_length, --++ # ) --++ --++ # if generation_config.cache_implementation == "static": --++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++ # created_cache = model_kwargs.get(cache_name) --++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++ # if created_cache.max_cache_len < generation_config.max_length: --++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++ --++ # return result --++ --++ --++ --+ --+ --+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+-- --+2.27.0 --+ --diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --new file mode 100644 --index 00000000..22b65dd5 ----- /dev/null --+++ b/patches/0002-20251106commit.patch --@@ -0,0 +1,3200 @@ --+From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Thu, 6 Nov 2025 09:20:38 +0800 --+Subject: [PATCH 2/3] 20251106commit --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- --+ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ --+ 3 files changed, 2689 insertions(+), 305 deletions(-) --+ create mode 100644 patches/0001-20251104commit.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index d8303e45..73773c22 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): --+ # y = y + self.shared_experts(identity) --+ # return y --+ --++ # @no_grad() --++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ --++ # expert_cache = ops.zeros_like(x) --++ # for i in range(self.num_experts_per_tok): --++ # expert_id = flat_expert_indices[i].item() --++ # weight = flat_expert_weights[i].item() --++ # expert = self.experts[expert_id] --++ # expert_out = expert(x) --++ # expert_cache += expert_out * weight --++ # return expert_cache --++ --+ @no_grad() --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ # x 的 shape: (1, hidden_size) --++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++ --++ # 1. 收集所有需要的专家层 --++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++ selected_experts = [self.experts[i] for i in flat_expert_indices] --++ --++ # 2. 并行计算所有专家的输出 --++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++ # ops.cat 会将它们堆叠成一个新的 Tensor --++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++ --++ # 3. 使用矩阵乘法进行加权求和 --++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ # 最终结果 final_output 的 shape: (1, hidden_size) --++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++ --++ return final_output --+ --+- expert_cache = ops.zeros_like(x) --+- for i in range(self.num_experts_per_tok): --+- expert_id = flat_expert_indices[i].item() --+- weight = flat_expert_weights[i].item() --+- expert = self.experts[expert_id] --+- expert_out = expert(x) --+- expert_cache += expert_out * weight --+- return expert_cache --+ --+ @no_grad() --+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++ # @lwx --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+ --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): --+ return attn_output, attn_weights, past_key_value --+ --+ --++# class DeepseekFlashAttention(nn.Module): --++# """ --++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --++ --++# This class is designed as a drop-in replacement for DeepseekAttention. --++# """ --++ --++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++# super().__init__() --++# self.config = config --++# self.layer_idx = layer_idx --++# if layer_idx is None: --++# logger.warning( --++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++# "when creating this class." --++# ) --++ --++# self.attention_dropout = config.attention_dropout --++# self.hidden_size = config.hidden_size --++# self.num_heads = config.num_attention_heads --++# self.head_dim = self.hidden_size // self.num_heads --++# self.num_key_value_heads = config.num_key_value_heads --++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++# self.max_position_embeddings = config.max_position_embeddings --++# self.rope_theta = config.rope_theta --++# self.is_causal = True --++ --++# if (self.head_dim * self.num_heads) != self.hidden_size: --++# raise ValueError( --++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++# f" and `num_heads`: {self.num_heads})." --++# ) --++ --++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++# self._init_rope() --++ --++# def _init_rope(self): --++# if self.config.rope_scaling is None: --++# self.rotary_emb = DeepseekRotaryEmbedding( --++# self.head_dim, --++# max_position_embeddings=self.max_position_embeddings, --++# base=self.rope_theta, --++# ) --++# else: --++# scaling_type = self.config.rope_scaling["type"] --++# scaling_factor = self.config.rope_scaling["factor"] --++# if scaling_type == "linear": --++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++# self.head_dim, --++# max_position_embeddings=self.max_position_embeddings, --++# scaling_factor=scaling_factor, --++# base=self.rope_theta, --++# ) --++# elif scaling_type == "dynamic": --++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++# self.head_dim, --++# max_position_embeddings=self.max_position_embeddings, --++# scaling_factor=scaling_factor, --++# base=self.rope_theta, --++# ) --++# else: --++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++ --++# def forward( --++# self, --++# hidden_states: mindspore.Tensor, --++# attention_mask: Optional[mindspore.Tensor] = None, --++# position_ids: Optional[mindspore.Tensor] = None, --++# past_key_value: Optional[Cache] = None, --++# output_attentions: bool = False, --++# use_cache: bool = False, --++# **kwargs, --++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++# if "padding_mask" in kwargs: --++# warnings.warn( --++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++# ) --++ --++# if output_attentions: --++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --++ --++# bsz, q_len, _ = hidden_states.shape --++ --++# if self.config.pretraining_tp > 1: --++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++ --++# query_states = self.q_proj(hidden_states) --++# key_states = self.k_proj(hidden_states) --++# value_states = self.v_proj(hidden_states) --++ --++# # Reshape for multi-head attention --++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++# kv_seq_len = key_states.shape[-2] --++# if past_key_value is not None: --++# if self.layer_idx is None: --++# raise ValueError( --++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++# "with a layer index." --++# ) --++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++# # Apply Rotary Positional Embedding --++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++# if past_key_value is not None: --++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ --++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++ --++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++ --++# # Convert attention_mask for flash_attention_score --++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --++# if attention_mask is not None: --++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++# raise ValueError( --++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++# ) --++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --++# else: --++# attn_mask_for_fa = None --++ --++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++ --++# # Call the fused flash_attention_score operator --++# attn_output = mindspore.ops.flash_attention_score( --++# query=query_states_for_fa, --++# key=key_states_for_fa, --++# value=value_states_for_fa, --++# head_num=self.num_heads, # This is N1, the number of query heads --++# input_layout='BSH', --++# attn_mask=attn_mask_for_fa, --++# keep_prob=keep_prob, --++# scalar_value=1.0 / math.sqrt(self.head_dim), --++# sparse_mode=0 # Default mask mode --++# ) --++ --++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --++# attn_output = self.o_proj(attn_output) --++ --++# # Flash Attention does not return attention weights --++# attn_weights = None --++ --++# return attn_output, attn_weights, past_key_value --++ --++class DeepseekFlashAttention(nn.Module): --++ """ --++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --++ This implementation is a drop-in replacement for the original DeepseekAttention class, --++ designed for high performance on supported hardware (Ascend). --++ --++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --++ """ --++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++ super().__init__() --++ self.config = config --++ self.layer_idx = layer_idx --++ if layer_idx is None: --++ logger.warning( --++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++ "when creating this class." --++ ) --++ --++ # --- [FIX] Correctly initialize all required attributes --- --++ self.attention_dropout = config.attention_dropout --++ self.hidden_size = config.hidden_size --++ self.num_heads = config.num_attention_heads --++ self.head_dim = self.hidden_size // self.num_heads --++ self.num_key_value_heads = config.num_key_value_heads --++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++ self.max_position_embeddings = config.max_position_embeddings --++ self.rope_theta = config.rope_theta --++ self.is_causal = True --++ --++ if (self.head_dim * self.num_heads) != self.hidden_size: --++ raise ValueError( --++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++ f" and `num_heads`: {self.num_heads})." --++ ) --++ --++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++ --++ # This call will now succeed as all attributes are initialized. --++ self._init_rope() --++ --++ def _init_rope(self): --++ if self.config.rope_scaling is None: --++ self.rotary_emb = DeepseekRotaryEmbedding( --++ self.head_dim, --++ max_position_embeddings=self.max_position_embeddings, --++ base=self.rope_theta, --++ ) --++ else: --++ scaling_type = self.config.rope_scaling["type"] --++ scaling_factor = self.config.rope_scaling["factor"] --++ if scaling_type == "linear": --++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++ self.head_dim, --++ max_position_embeddings=self.max_position_embeddings, --++ scaling_factor=scaling_factor, --++ base=self.rope_theta, --++ ) --++ elif scaling_type == "dynamic": --++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++ self.head_dim, --++ max_position_embeddings=self.max_position_embeddings, --++ scaling_factor=scaling_factor, --++ base=self.rope_theta, --++ ) --++ else: --++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++ --++ def forward( --++ self, --++ hidden_states: mindspore.Tensor, --++ attention_mask: Optional[mindspore.Tensor] = None, --++ position_ids: Optional[mindspore.Tensor] = None, --++ past_key_value: Optional[Cache] = None, --++ output_attentions: bool = False, --++ use_cache: bool = False, --++ **kwargs, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ if "padding_mask" in kwargs: --++ warnings.warn( --++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++ ) --++ if output_attentions: --++ warnings.warn( --++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --++ ) --++ --++ bsz, q_len, _ = hidden_states.shape --++ --++ if self.config.pretraining_tp > 1: --++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++ --++ query_states = self.q_proj(hidden_states) --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++ # Apply Rotary Position Embedding --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos} --++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --++ # So we must explicitly repeat the KV heads. --++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++ --++ # Convert attention mask for flash_attention_score --++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --++ if attention_mask is not None: --++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++ raise ValueError( --++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++ ) --++ attn_mask_for_fa = attention_mask < 0 --++ else: --++ attn_mask_for_fa = None --++ --++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++ --++ # Call the fused operator using the efficient BNSD layout --++ attn_output = mindspore.ops.flash_attention_score( --++ query=query_states, --++ key=key_states, --++ value=value_states, --++ head_num=self.num_heads, --++ input_layout='BNSD', # Specify the correct layout --++ attn_mask=attn_mask_for_fa, --++ keep_prob=keep_prob, --++ scalar_value=1.0 / math.sqrt(self.head_dim) --++ ) --++ --++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ --++ # Apply output projection --++ attn_output = self.o_proj(attn_output) --++ --++ # Flash attention does not return attention weights, so we return None. --++ attn_weights = None --++ --++ return attn_output, attn_weights, past_key_value --++ --+ Deepseek_ATTENTION_CLASSES = { --+ "eager": DeepseekAttention, --++ "flash-attention": DeepseekFlashAttention, --+ } --+ --+ --+@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): --+ config=config, layer_idx=layer_idx --+ ) --+ --++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --++ config=config, layer_idx=layer_idx --++ ) --++ --+ self.mlp = ( --+ DeepseekMoE(config) --+ if ( --+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+index d4c6b651..bced285c 100644 --+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union --+ --+ import mindspore --+ import mindnlp.core.nn.functional as F --+-from mindnlp.core import nn, ops --++from mindnlp.core import nn, ops, no_grad --+ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss --+ --+ from ....common.activations import ACT2FN --+@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) --+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+ --++Long_Prompt = False --++PROMPT_LENGTH_THRESHOLD = 128 --+ --+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+ def _prepare_4d_causal_attention_mask_with_cache_position( --+@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): --+ return attn_output, attn_weights, past_key_value --+ --+ --++# class Qwen2MoeFlashAttention(nn.Module): --++# """ --++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++ --++# 关键改动: --++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++# 直接传入原始的 key 和 value 张量效率更高。 --++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++# super().__init__() --++# self.config = config --++# self.layer_idx = layer_idx --++# self.hidden_size = config.hidden_size --++# self.num_heads = config.num_attention_heads --++# self.head_dim = self.hidden_size // self.num_heads --++# self.num_key_value_heads = config.num_key_value_heads --++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++# self.max_position_embeddings = config.max_position_embeddings --++# self.rope_theta = config.rope_theta --++# self.attention_dropout = config.attention_dropout --++ --++# if (self.head_dim * self.num_heads) != self.hidden_size: --++# raise ValueError( --++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++# ) --++ --++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++ --++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++# self.head_dim, --++# max_position_embeddings=self.max_position_embeddings, --++# base=self.rope_theta, --++# ) --++ --++# def forward( --++# self, --++# hidden_states: mindspore.Tensor, --++# attention_mask: Optional[mindspore.Tensor] = None, --++# position_ids: Optional[mindspore.Tensor] = None, --++# past_key_value: Optional[Cache] = None, --++# output_attentions: bool = False, --++# use_cache: bool = False, --++# cache_position: Optional[mindspore.Tensor] = None, --++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++# bsz, q_len, _ = hidden_states.shape --++ --++# # 1. 线性投射 Q, K, V --++# query_states = self.q_proj(hidden_states) --++# key_states = self.k_proj(hidden_states) --++# value_states = self.v_proj(hidden_states) --++ --++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++# # query: [B, S, H*D] -> [B, N1, S, D] --++# # key/val: [B, S, H2*D] -> [B, N2, S, D] --++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++# # 3. RoPE 旋转位置编码 --++# kv_seq_len = key_states.shape[-2] --++# if past_key_value is not None: --++# if self.layer_idx is None: --++# raise ValueError( --++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++# "with a layer index." --++# ) --++# # 对于 StaticCache,需要特殊处理 kv_seq_len --++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++# # 使用 cache_position 的长度来确定实际的 kv_seq_len --++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++# # 临时解决方案:使用 cache_position 的最大值(如果可能) --++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++# if cache_position.shape[0] == 1: --++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++# kv_seq_len = past_seen_tokens + 1 --++# else: --++# # prefill 阶段:cache_position 是范围,使用其长度 --++# kv_seq_len = cache_position.shape[0] + past_seen_tokens --++# else: --++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++# # 4. KV 缓存更新 --++# if past_key_value is not None: --++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++# key_states, value_states = past_key_value.update( --++# key_states, value_states, self.layer_idx, cache_kwargs --++# ) --++ --++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++# if cache_position.shape[0] == 1: --++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++# kv_seq_len = key_states.shape[-2] --++ --++# # 5. [重要] 准备 Attention Mask --++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++# fa_attention_mask = None --++# if attention_mask is not None: --++# # 截取与当前key长度匹配的部分 --++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++# # 转换为布尔类型: 大负数 -> True, 0 -> False --++# fa_attention_mask = (mask_slice != 0) --++ --++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++# input_dtype = query_states.dtype --++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++# query_states = query_states.to(mindspore.float16) --++# key_states = key_states.to(mindspore.float16) --++# value_states = value_states.to(mindspore.float16) --++ --++# # 6. [核心] 调用 flash_attention_score 算子 --++# # - 无需手动 repeat_kv, 算子原生支持 GQA --++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++# attn_output = mindspore.ops.flash_attention_score( --++# query=query_states, --++# key=key_states, --++# value=value_states, --++# head_num=self.num_heads, # 传入Q的头数(N1) --++# attn_mask=fa_attention_mask, --++# keep_prob=1.0 - self.attention_dropout, --++# scalar_value=1.0 / math.sqrt(self.head_dim), --++# input_layout="BNSD", --++# sparse_mode=0 # 使用 defaultMask 模式 --++# ) --++ --++# # 恢复原始数据类型 --++# attn_output = attn_output.to(input_dtype) --++ --++# # 7. 调整输出形状 --++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++# attn_output = self.o_proj(attn_output) --++ --++# # FlashAttention 算子不直接返回注意力权重矩阵 --++# attn_weights = None --++# if output_attentions: --++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++# return attn_output, attn_weights, past_key_value --++ --++# # def forward( --++# # self, --++# # hidden_states: mindspore.Tensor, --++# # attention_mask: Optional[mindspore.Tensor] = None, --++# # position_ids: Optional[mindspore.Tensor] = None, --++# # past_key_value: Optional[Cache] = None, --++# # output_attentions: bool = False, --++# # use_cache: bool = False, --++# # cache_position: Optional[mindspore.Tensor] = None, --++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++# # bsz, q_len, _ = hidden_states.shape --++ --++# # # 1. 线性投射 Q, K, V --++# # query_states = self.q_proj(hidden_states) --++# # key_states = self.k_proj(hidden_states) --++# # value_states = self.v_proj(hidden_states) --++ --++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --++# # # 3. RoPE 旋转位置编码 --++# # kv_seq_len = key_states.shape[-2] --++# # if past_key_value is not None: --++# # if self.layer_idx is None: --++# # raise ValueError( --++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++# # "with a layer index." --++# # ) --++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++# # # 4. KV 缓存更新 --++# # if past_key_value is not None: --++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++# # key_states, value_states = past_key_value.update( --++# # key_states, value_states, self.layer_idx, cache_kwargs --++# # ) --++ --++# # # 5. 准备 Attention Mask --++# # fa_attention_mask = None --++# # if attention_mask is not None: --++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++# # fa_attention_mask = (mask_slice != 0) --++ --++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++# # input_dtype = query_states.dtype --++ --++# # # 6. [核心] 调用 flash_attention_score 算子 --++# # attn_output = mindspore.ops.flash_attention_score( --++# # query=query_states, --++# # key=key_states, --++# # value=value_states, --++# # head_num=self.num_heads, --++# # attn_mask=fa_attention_mask, --++# # keep_prob=1.0 - self.attention_dropout, --++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++# # input_layout="BNSD", --++# # sparse_mode=0, --++# # # <--- 修改点 2: 启用内部高精度计算 --- --++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++# # inner_precise=1 --++# # ) --++ --++# # # 恢复原始数据类型 --++# # attn_output = attn_output.to(input_dtype) --++ --++# # # 7. 调整输出形状 --++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++# # attn_output = self.o_proj(attn_output) --++ --++# # attn_weights = None --++# # if output_attentions: --++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ --++# # return attn_output, attn_weights, past_key_value --++ --++ --+ class Qwen2MoeFlashAttention(nn.Module): --+ """ --+- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+- --+- 关键改动: --+- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+- 直接传入原始的 key 和 value 张量效率更高。 --+- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --++ --++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --++ 完全使用模型的低精度数据类型(如 float16)进行计算, --++ 以达到理论上的最高执行速度。 --+ """ --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+ super().__init__() --+ self.config = config --+ self.layer_idx = layer_idx --++ if layer_idx is None: --++ logger.warning_once( --++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --++ ) --++ --+ self.hidden_size = config.hidden_size --+ self.num_heads = config.num_attention_heads --+ self.head_dim = self.hidden_size // self.num_heads --+ self.num_key_value_heads = config.num_key_value_heads --+- self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+ self.max_position_embeddings = config.max_position_embeddings --+ self.rope_theta = config.rope_theta --+ self.attention_dropout = config.attention_dropout --+ --+- if (self.head_dim * self.num_heads) != self.hidden_size: --+- raise ValueError( --+- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+- ) --+- --+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+- # query: [B, S, H*D] -> [B, N1, S, D] --+- # key/val: [B, S, H2*D] -> [B, N2, S, D] --++ # 2. 调整形状以匹配 BNSD 布局 --+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- --+- # 3. RoPE 旋转位置编码 --++ --++ # 3. RoPE 和 KV 缓存 --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+- if self.layer_idx is None: --+- raise ValueError( --+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+- "with a layer index." --+- ) --+- # 对于 StaticCache,需要特殊处理 kv_seq_len --+- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+- # 使用 cache_position 的长度来确定实际的 kv_seq_len --+- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+- # 临时解决方案:使用 cache_position 的最大值(如果可能) --+- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+- if cache_position.shape[0] == 1: --+- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+- kv_seq_len = past_seen_tokens + 1 --+- else: --+- # prefill 阶段:cache_position 是范围,使用其长度 --+- kv_seq_len = cache_position.shape[0] + past_seen_tokens --+- else: --+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+- --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+- # 4. KV 缓存更新 --+ if past_key_value is not None: --+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+- key_states, value_states = past_key_value.update( --+- key_states, value_states, self.layer_idx, cache_kwargs --+- ) --+- --+- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+- if cache_position.shape[0] == 1: --+- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+- kv_seq_len = key_states.shape[-2] --+- --+- # 5. [重要] 准备 Attention Mask --+- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++ # 4. 准备 Attention Mask --+ fa_attention_mask = None --+ if attention_mask is not None: --+- # 截取与当前key长度匹配的部分 --+- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+- # 转换为布尔类型: 大负数 -> True, 0 -> False --+ fa_attention_mask = (mask_slice != 0) --+ --+- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+- input_dtype = query_states.dtype --+- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+- query_states = query_states.to(mindspore.float16) --+- key_states = key_states.to(mindspore.float16) --+- value_states = value_states.to(mindspore.float16) --+- --+- # 6. [核心] 调用 flash_attention_score 算子 --+- # - 无需手动 repeat_kv, 算子原生支持 GQA --+- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 --+ attn_output = mindspore.ops.flash_attention_score( --+ query=query_states, --+ key=key_states, --+ value=value_states, --+- head_num=self.num_heads, # 传入Q的头数(N1) --++ head_num=self.num_heads, --+ attn_mask=fa_attention_mask, --+- keep_prob=1.0 - self.attention_dropout, --++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout --+ scalar_value=1.0 / math.sqrt(self.head_dim), --+ input_layout="BNSD", --+- sparse_mode=0 # 使用 defaultMask 模式 --++ sparse_mode=0, --++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 --+ ) --+ --+- # 恢复原始数据类型 --+- attn_output = attn_output.to(input_dtype) --+- --+- # 7. 调整输出形状 --+- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++ # 6. 调整输出形状 --+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+ attn_output = self.o_proj(attn_output) --+ --+- # FlashAttention 算子不直接返回注意力权重矩阵 --++ # 7. 返回结果 --+ attn_weights = None --+ if output_attentions: --+- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+ --+ return attn_output, attn_weights, past_key_value --+ --+- # def forward( --+- # self, --+- # hidden_states: mindspore.Tensor, --+- # attention_mask: Optional[mindspore.Tensor] = None, --+- # position_ids: Optional[mindspore.Tensor] = None, --+- # past_key_value: Optional[Cache] = None, --+- # output_attentions: bool = False, --+- # use_cache: bool = False, --+- # cache_position: Optional[mindspore.Tensor] = None, --+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+- --+- # bsz, q_len, _ = hidden_states.shape --+- --+- # # 1. 线性投射 Q, K, V --+- # query_states = self.q_proj(hidden_states) --+- # key_states = self.k_proj(hidden_states) --+- # value_states = self.v_proj(hidden_states) --+- --+- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- --+- # # 3. RoPE 旋转位置编码 --+- # kv_seq_len = key_states.shape[-2] --+- # if past_key_value is not None: --+- # if self.layer_idx is None: --+- # raise ValueError( --+- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+- # "with a layer index." --+- # ) --+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+ --+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+- --+- # # 4. KV 缓存更新 --+- # if past_key_value is not None: --+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+- # key_states, value_states = past_key_value.update( --+- # key_states, value_states, self.layer_idx, cache_kwargs --+- # ) --+- --+- # # 5. 准备 Attention Mask --+- # fa_attention_mask = None --+- # if attention_mask is not None: --+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+- # fa_attention_mask = (mask_slice != 0) --+- --+- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+- # input_dtype = query_states.dtype --+- --+- # # 6. [核心] 调用 flash_attention_score 算子 --+- # attn_output = mindspore.ops.flash_attention_score( --+- # query=query_states, --+- # key=key_states, --+- # value=value_states, --+- # head_num=self.num_heads, --+- # attn_mask=fa_attention_mask, --+- # keep_prob=1.0 - self.attention_dropout, --+- # scalar_value=1.0 / math.sqrt(self.head_dim), --+- # input_layout="BNSD", --+- # sparse_mode=0, --+- # # <--- 修改点 2: 启用内部高精度计算 --- --+- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+- # inner_precise=1 --+- # ) --+- --+- # # 恢复原始数据类型 --+- # attn_output = attn_output.to(input_dtype) --++QWEN2MOE_ATTENTION_CLASSES = { --++ "eager": Qwen2MoeAttention, --++ "flash-attention": Qwen2MoeFlashAttention, --++} --+ --+- # # 7. 调整输出形状 --+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+- # attn_output = self.o_proj(attn_output) --+ --+- # attn_weights = None --+- # if output_attentions: --+- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# def __init__(self, config): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# # gating --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++ --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# #@dwj --++# # 只遍历激活的专家,而非全部专家 --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# num_tokens = hidden_states_reshaped.shape[0] --++ --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++# routing_weights = routing_weights.to(hidden_states.dtype) --++ --++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++# flat_selected_experts = selected_experts.flatten() --++ --++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++# token_indices = broadcasted_token_indices.flatten() --++ --++# active_experts = ops.unique(flat_selected_experts) --++ --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++ --++# mask = (flat_selected_experts == expert_idx_tensor) --++# selected_token_indices = token_indices[mask] --++# selected_routing_weights = routing_weights.flatten()[mask] --++ --++# current_states = hidden_states_reshaped[selected_token_indices] --++ --++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++ --++# final_hidden_states = final_hidden_states.index_add( --++# dim=0, --++# index=selected_token_indices, --++# source=expert_output.to(hidden_states.dtype) --++# ) --++ --++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+ --+- # return attn_output, attn_weights, past_key_value --++# final_hidden_states = final_hidden_states + shared_expert_output --++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --++ --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# """ --++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --++# `_moe_infer_prefill` (用于长序列处理) 方法。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# # 门控网络 --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# # 专家列表 --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++# # 共享专家 --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# @no_grad() --++# def _moe_infer_decode( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# """ --++# 【解码路径】针对 sequence_length=1 的极致优化。 --++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --++# """ --++# batch_size, hidden_dim = hidden_states.shape --++ --++# expert_outputs_list = [ --++# ops.cat([ --++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++# ], dim=0) --++# for i in range(batch_size) --++# ] --++ --++# # --- 错误修复:将 axis=0 修改为 dim=0 --- --++# # shape: (batch_size, top_k, hidden_dim) --++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++ --++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++ --++# return moe_output.squeeze(1) --++ --++# @no_grad() --++# def _moe_infer_prefill( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# """ --++# 【预填充路径】针对 sequence_length > 1 的优化。 --++# 按专家对 Token 进行分组,并进行批处理。 --++# """ --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens = hidden_states.shape[0] --++# flat_selected_experts = selected_experts.flatten() --++ --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++ --++# active_experts = ops.unique(flat_selected_experts) --++ --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++ --++# mask = (flat_selected_experts == expert_idx_tensor) --++# selected_token_indices = token_indices[mask] --++# selected_routing_weights = routing_weights.flatten()[mask] --++ --++# current_states = hidden_states[selected_token_indices] --++ --++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++ --++# moe_output = moe_output.index_add( --++# dim=0, --++# index=selected_token_indices, --++# source=expert_output.to(hidden_states.dtype) --++# ) --++# return moe_output --++ --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# """ --++# 顶层 forward 方法,作为智能分发器。 --++# """ --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++ --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+- # def forward( --+- # self, --+- # hidden_states: mindspore.Tensor, --+- # attention_mask: Optional[mindspore.Tensor] = None, --+- # position_ids: Optional[mindspore.Tensor] = None, --+- # past_key_value: Optional[Cache] = None, --+- # output_attentions: bool = False, --+- # use_cache: bool = False, --+- # cache_position: Optional[mindspore.Tensor] = None, --+- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+- --+- # bsz, q_len, _ = hidden_states.shape --+- --+- # query_states = self.q_proj(hidden_states) --+- # key_states = self.k_proj(hidden_states) --+- # value_states = self.v_proj(hidden_states) --+- --+- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- --+- # kv_seq_len = key_states.shape[-2] --+- # if past_key_value is not None: --+- # if self.layer_idx is None: --+- # raise ValueError("`layer_idx` must be specified for caching") --+- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+- --+- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+- --+- # if past_key_value is not None: --+- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+- # key_states, value_states = past_key_value.update( --+- # key_states, value_states, self.layer_idx, cache_kwargs --+- # ) --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++# routing_weights = routing_weights.to(hidden_states.dtype) --++ --++# moe_output = None --++# # 在推理时,根据序列长度选择最优路径 --++# if not self.training: --++# if sequence_length == 1: --++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++# else: --++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++# else: --++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --++# raise NotImplementedError("Training path is not implemented.") --++ --++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --++ --++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --++ --++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --++ --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# """ --++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# # 门控网络 --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# # 专家列表 --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++# # 共享专家 --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# @no_grad() --++# def _moe_infer_decode( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# batch_size, _ = hidden_states.shape --++# expert_outputs_list = [ --++# ops.cat([ --++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++# ], dim=0) --++# for i in range(batch_size) --++# ] --++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++# return moe_output.squeeze(1) --++ --++# @no_grad() --++# def _moe_infer_prefill( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens = hidden_states.shape[0] --++# flat_selected_experts = selected_experts.flatten() --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++# active_experts = ops.unique(flat_selected_experts) --++ --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++# mask = (flat_selected_experts == expert_idx_tensor) --++# selected_token_indices = token_indices[mask] --++# selected_routing_weights = routing_weights.flatten()[mask] --++# current_states = hidden_states[selected_token_indices] --++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++# moe_output = moe_output.index_add( --++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++# ) --++# return moe_output --++ --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# """ --++# 顶层 forward 方法,作为智能分发器。 --++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --++# """ --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++ --++# # 1. 门控计算 (通用逻辑) --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++# routing_weights = routing_weights.to(hidden_states.dtype) --++ --++# # 2. 智能分发到最优 MoE 路径 --++# moe_output = None --++# if not self.training: --++# if sequence_length == 1: --++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++# else: --++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++# else: --++# raise NotImplementedError("Training path is not implemented.") --++ --++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++ --++# # 4. 合并 MoE 输出和共享专家输出 --++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++ --++# # 5. 恢复原始形状并返回 --++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --++# prefill fastest --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# """ --++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# # 门控网络 --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# # 专家列表 --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++# # 共享专家 --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# @no_grad() --++# def _moe_infer_dispatch( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# """ --++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --++# """ --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens, _ = hidden_states.shape --++ --++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --++# flat_selected_experts = selected_experts.flatten() --++# flat_routing_weights = routing_weights.flatten() --+ --+- # key_states = repeat_kv(key_states, self.num_key_value_groups) --+- # value_states = repeat_kv(value_states, self.num_key_value_groups) --+- --+- # # <--- 核心修改点: 手动进行高精度缩放 --- --+- # # 在调用算子前,手动将 query_states 除以缩放因子。 --+- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+- # query_states = query_states / math.sqrt(self.head_dim) --+- # # <--- 修改结束 --- --+- --+- # fa_attention_mask = None --+- # if attention_mask is not None: --+- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+- # fa_attention_mask = (mask_slice != 0) --+- --+- # input_dtype = query_states.dtype --+- --+- # attn_output = mindspore.ops.flash_attention_score( --+- # query=query_states, # 传入已经预先缩放过的 query --+- # key=key_states, --+- # value=value_states, --+- # head_num=self.num_heads, --+- # attn_mask=fa_attention_mask, --+- # keep_prob=1.0 - self.attention_dropout, --+- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+- # input_layout="BNSD", --+- # sparse_mode=0, --+- # inner_precise=1 # 仍然保持内部高精度计算 --+- # ) --++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+ --+- # attn_output = attn_output.to(input_dtype) --+- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+- # attn_output = self.o_proj(attn_output) --++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --++# active_experts = ops.unique(flat_selected_experts) --++ --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++ --++# # 找到所有分配给该专家的 token --++# mask = (flat_selected_experts == expert_idx_tensor) --++ --++# # 使用 mask 选取对应的 token 和权重 --++# current_token_indices = token_indices[mask] --++# current_routing_weights = flat_routing_weights[mask] --++# current_hidden_states = hidden_states[current_token_indices] --++ --++# # 对这些 token 进行批处理 --++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++ --++# # 使用 index_add 将结果精确地加回到对应位置 --++# moe_output = moe_output.index_add( --++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --++# ) --++# return moe_output --++ --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# """ --++# 顶层 forward 方法,作为智能分发器。 --++# """ --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++ --++# # 1. 门控计算 --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++# routing_weights = routing_weights.to(hidden_states.dtype) --++ --++# # 2. 调用统一的 MoE 计算内核 --++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+ --+- # attn_weights = None --+- # if output_attentions: --+- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++# # 3. 统一处理共享专家 --++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++ --++# # 4. 合并输出 --++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++ --++# # 5. 恢复原始形状并返回 --++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --++ --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# """ --++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++# 【最终高性能与高精度版】: --++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --++# 3. 这样实现了速度和准确性的两全其美。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# @no_grad() --++# def _moe_infer_decode( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# """ --++# 【解码路径】极致优化版:bmm + 高精度累加。 --++# """ --++# original_dtype = hidden_states.dtype --++# batch_size, _ = hidden_states.shape --++ --++# expert_outputs_list = [ --++# ops.cat([ --++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++# ], dim=0) --++# for i in range(batch_size) --++# ] --++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++ --++# # 在 float32 下执行 bmm,得到高精度结果 --++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++ --++# # 将高精度结果转换回原始数据类型 --++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --++ --++# return moe_output --++ --++# @no_grad() --++# def _moe_infer_prefill( --++# self, --++# hidden_states: mindspore.Tensor, --++# selected_experts: mindspore.Tensor, --++# routing_weights: mindspore.Tensor --++# ) -> mindspore.Tensor: --++# """ --++# 【预填充路径】与原始实现一致,结果精确。 --++# """ --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens, _ = hidden_states.shape --++# flat_selected_experts = selected_experts.flatten() --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++# active_experts = ops.unique(flat_selected_experts) --++ --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++# mask = (flat_selected_experts == expert_idx_tensor) --++# selected_token_indices = token_indices[mask] --++# selected_routing_weights = routing_weights.flatten()[mask] --++# current_states = hidden_states[selected_token_indices] --++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++# moe_output = moe_output.index_add( --++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++# ) --++# return moe_output --++ --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++ --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+ --+- # return attn_output, attn_weights, past_key_value --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --++# # 如果模型主体是 float16,后续再转换 --++ --++# moe_output = None --++# if not self.training: --++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --++# # _moe_infer_decode 内部会处理好类型转换 --++# temp_routing_weights = routing_weights.to(hidden_states.dtype) --++# if sequence_length == 1: --++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --++# else: --++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --++# else: --++# raise NotImplementedError("Training path is not implemented.") --++ --++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++ --++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --+ --+-QWEN2MOE_ATTENTION_CLASSES = { --+- "eager": Qwen2MoeAttention, --+- "flash-attention": Qwen2MoeFlashAttention, --+-} --++# class Qwen2MoeSparseMoeBlock(nn.Module): --++# """ --++# 【融合版】一个混合专家模块,内置两种推理策略, --++# 由外部全局变量 `Long_Prompt` 控制: --++ --++# - if Long_Prompt is True: 【精度优先模式】 --++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --++# 适用于处理长序列,避免误差累积。 --++ --++# - if Long_Prompt is False: 【速度优先模式】 --++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --++# 在解码阶段获得极致速度,同时保证结果高度准确。 --++# """ --++# def __init__(self, config: Qwen2MoeConfig): --++# super().__init__() --++# self.num_experts = config.num_experts --++# self.top_k = config.num_experts_per_tok --++# self.norm_topk_prob = config.norm_topk_prob --++ --++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++# self.experts = nn.ModuleList( --++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++# ) --++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++# # --- 速度优先模式的辅助函数 --- --++# @no_grad() --++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++# original_dtype = hidden_states.dtype --++# batch_size, _ = hidden_states.shape --++# expert_outputs_list = [ --++# ops.cat([ --++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++# ], dim=0) --++# for i in range(batch_size) --++# ] --++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++# weights_fp32 = routing_weights.to(mindspore.float32) --++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++# return moe_output_fp32.squeeze(1).to(original_dtype) --++ --++# @no_grad() --++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens, _ = hidden_states.shape --++# flat_selected_experts = selected_experts.flatten() --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++# active_experts = ops.unique(flat_selected_experts) --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++# mask = (flat_selected_experts == expert_idx_tensor) --++# selected_token_indices = token_indices[mask] --++# selected_routing_weights = routing_weights.flatten()[mask] --++# current_states = hidden_states[selected_token_indices] --++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --++# return moe_output --++ --++# # --- 精度优先模式的辅助函数 --- --++# @no_grad() --++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++# moe_output = ops.zeros_like(hidden_states) --++# num_tokens, _ = hidden_states.shape --++# flat_selected_experts = selected_experts.flatten() --++# flat_routing_weights = routing_weights.flatten() --++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++# active_experts = ops.unique(flat_selected_experts) --++# for expert_idx_tensor in active_experts: --++# expert_idx = expert_idx_tensor.item() --++# expert_layer = self.experts[expert_idx] --++# mask = (flat_selected_experts == expert_idx_tensor) --++# current_token_indices = token_indices[mask] --++# current_routing_weights = flat_routing_weights[mask] --++# current_hidden_states = hidden_states[current_token_indices] --++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++# return moe_output --++ --++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++# # 声明我们将要使用一个在模块外部定义的全局变量 --++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --++# global Long_Prompt --++ --++# # 1. 门控计算 (所有模式通用) --++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++# router_logits = self.gate(hidden_states_reshaped) --++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++# if self.norm_topk_prob: --++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++# moe_output = None --++# if not self.training: --++# # 根据 Long_Prompt 标志选择模式 --++# if Long_Prompt: --++# # --- 精度优先模式 --- --++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++# else: --++# # --- 速度优先模式 --- --++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++# if sequence_length == 1: --++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --++# else: --++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --++# else: --++# raise NotImplementedError("Training path is not implemented.") --++ --++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++ --++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++ --++# return final_hidden_states, router_logits --++ --++class Qwen2MoeSparseMoeBlock(nn.Module): --++ """ --++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --++ 控制的顶级推理策略: --+ --++ - if Long_Prompt is True: 【精度优先模式】 --++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --++ 适用于需要严格可复现性的长序列任务。 --+ --+-class Qwen2MoeSparseMoeBlock(nn.Module): --+- def __init__(self, config): --++ - if Long_Prompt is False: 【速度优先模式】 --++ 采用业界最强的性能组合: --++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --++ """ --++ def __init__(self, config: Qwen2MoeConfig): --+ super().__init__() --+ self.num_experts = config.num_experts --+ self.top_k = config.num_experts_per_tok --+ self.norm_topk_prob = config.norm_topk_prob --+ --+- # gating --+ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+ self.experts = nn.ModuleList( --+ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+ ) --+- --+ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+ --+- #@dwj --+- # 只遍历激活的专家,而非全部专家 --+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+- batch_size, sequence_length, hidden_dim = hidden_states.shape --+- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+- num_tokens = hidden_states_reshaped.shape[0] --+- --+- router_logits = self.gate(hidden_states_reshaped) --+- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- --+- if self.norm_topk_prob: --+- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- routing_weights = routing_weights.to(hidden_states.dtype) --+- --+- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+- flat_selected_experts = selected_experts.flatten() --+- --+- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+- token_indices = broadcasted_token_indices.flatten() --+- --+- active_experts = ops.unique(flat_selected_experts) --+- --+- for expert_idx_tensor in active_experts: --+- expert_idx = expert_idx_tensor.item() --+- expert_layer = self.experts[expert_idx] --+- --+- mask = (flat_selected_experts == expert_idx_tensor) --+- selected_token_indices = token_indices[mask] --+- selected_routing_weights = routing_weights.flatten()[mask] --+- --+- current_states = hidden_states_reshaped[selected_token_indices] --+- --+- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+- --+- final_hidden_states = final_hidden_states.index_add( --+- dim=0, --+- index=selected_token_indices, --+- source=expert_output.to(hidden_states.dtype) --+- ) --+- --+- shared_expert_output = self.shared_expert(hidden_states_reshaped) --+- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --++ @no_grad() --++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++ original_dtype = hidden_states.dtype --++ batch_size, _ = hidden_states.shape --++ expert_outputs_list = [ --++ ops.cat([ --++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++ ], dim=0) --++ for i in range(batch_size) --++ ] --++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++ weights_fp32 = routing_weights.to(mindspore.float32) --++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++ return moe_output_fp32.squeeze(1).to(original_dtype) --++ --++ @no_grad() --++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++ num_tokens, _ = hidden_states.shape --++ flat_selected_experts = selected_experts.flatten() --++ sorted_expert_indices = flat_selected_experts.argsort() --++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++ original_token_indices = sorted_expert_indices // self.top_k --++ moe_output = ops.zeros_like(hidden_states) --++ current_token_offset = 0 --++ for i in range(self.num_experts): --++ expert_token_count = tokens_per_expert[i] - current_token_offset --++ if expert_token_count == 0: --++ continue --++ end_offset = current_token_offset + expert_token_count --++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++ expert_hidden_states = hidden_states[expert_original_token_indices] --++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++ current_token_offset += expert_token_count --++ return moe_output --++ --++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --++ @no_grad() --++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++ moe_output = ops.zeros_like(hidden_states) --++ num_tokens, _ = hidden_states.shape --++ flat_selected_experts = selected_experts.flatten() --++ flat_routing_weights = routing_weights.flatten() --++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++ active_experts = ops.unique(flat_selected_experts) --++ for expert_idx_tensor in active_experts: --++ expert_idx = expert_idx_tensor.item() --++ expert_layer = self.experts[expert_idx] --++ mask = (flat_selected_experts == expert_idx_tensor) --++ current_token_indices = token_indices[mask] --++ current_routing_weights = flat_routing_weights[mask] --++ current_hidden_states = hidden_states[current_token_indices] --++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++ return moe_output --+ --+- final_hidden_states = final_hidden_states + shared_expert_output --+- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+- --+- return final_hidden_states, router_logits --++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++ global Long_Prompt --++ --++ # 1. 门控计算 (所有模式通用) --++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++ router_logits = self.gate(hidden_states_reshaped) --++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++ if self.norm_topk_prob: --++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++ moe_output = None --++ if Long_Prompt: --++ # --- 精度优先模式 (ACCURACY MODE) --- --++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ else: --++ # --- 速度优先模式 (SPEED MODE) --- --++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++ if sequence_length == 1: --++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ else: --++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ --+ --++ # 3. 共享专家计算与合并 (所有模式通用) --++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++ --++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++ --++ return final_hidden_states, router_logits --+ --+ class Qwen2MoeDecoderLayer(nn.Module): --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+ super().__init__() --+ self.hidden_size = config.hidden_size --++ --++ # if Long_Prompt: --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ # else: --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+ --+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ --+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+- --+ if (layer_idx not in config.mlp_only_layers) and ( --+ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+ ): --+@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ self._warmed_up = True --+ self.warmup_moe_model() --+ --++ --++ --+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+ output_router_logits = ( --+ output_router_logits if output_router_logits is not None else self.config.output_router_logits --+@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ router_logits=outputs.router_logits, --+ ) --+ --++ def generate(self, *args, **kwargs): --++ """ --++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --++ """ --++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++ --++ input_ids = kwargs.get("input_ids") --++ if input_ids is None and args: --++ input_ids = args[0] --++ --++ if input_ids is not None: --++ prompt_length = input_ids.shape[1] --++ --++ if prompt_length > PROMPT_LENGTH_THRESHOLD: --++ Long_Prompt = True --++ else: --++ Long_Prompt = False --++ --++ return super().generate(*args, **kwargs) --++ --+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation --+ def prepare_inputs_for_generation( --+ self, --+@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens --+ # Exception 1: when passing input_embeds, input_ids may be missing entries --+ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --++ --+ if past_key_values is not None: --+ if inputs_embeds is not None: # Exception 1 --+ if 0 not in input_ids.shape: --+@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ } --+ ) --+ return model_inputs --++ --+ # @lwx --+ # def _decode_one_tokens_logits( --+ # self, --+@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): --+ attentions=outputs.attentions, --+ ) --+ --++ --+ __all__ = [ --+ "Qwen2MoeForCausalLM", --+ "Qwen2MoeModel", --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+new file mode 100644 --+index 00000000..6dfb5b93 --+--- /dev/null --++++ b/patches/0001-20251104commit.patch --+@@ -0,0 +1,1272 @@ --++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Tue, 4 Nov 2025 09:11:51 +0800 --++Subject: [PATCH] 20251104commit --++ --++--- --++ mindnlp/transformers/cache_utils.py | 28 +- --++ .../models/deepseek/modeling_deepseek.py | 149 ++- --++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++ 3 files changed, 976 insertions(+), 87 deletions(-) --++ --++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++index cadd2e04..02f8d4be 100644 --++--- a/mindnlp/transformers/cache_utils.py --+++++ b/mindnlp/transformers/cache_utils.py --++@@ -812,14 +812,26 @@ class StaticCache(Cache): --++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++ # k_out[:, :, cache_position] = key_states --++ # v_out[:, :, cache_position] = value_states --++- if ON_ORANGE_PI: --++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++- else: --++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++- --+++ # if ON_ORANGE_PI: --+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++ # else: --+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++ # 确保 cache_position 是 1D tensor 并且类型正确 --+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++ if cache_position.ndim > 1: --+++ cache_position = cache_position.flatten() --+++ # 确保类型是 int32 或 int64(MindSpore 要求) --+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++ cache_position = cache_position.int() --+++ --+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++ k_out[:, :, cache_position] = key_states --+++ v_out[:, :, cache_position] = value_states --+++ --++ return k_out, v_out --++ --++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index c695b944..d8303e45 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++- x1 = x[..., : x.shape[-1] // 2] --++- x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++ # x1 = x[..., : x.shape[-1] // 2] --+++ # x2 = x[..., x.shape[-1] // 2 :] --+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++ if self.training: --++ raise NotImplementedError("Training is not supported yet.") --++ else: --++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++- if self.config.n_shared_experts is not None: --++- y = y + self.shared_experts(identity) --++- return y --+++ # @lwx --+++ if orig_shape[1] == 1: --+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++ y=y.view(*orig_shape) --+++ if self.config.n_shared_experts is not None: --+++ y = y + self.shared_experts(identity) --+++ return y --+++ else: --+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++ if self.config.n_shared_experts is not None: --+++ y = y + self.shared_experts(identity) --+++ return y --+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++ # if self.config.n_shared_experts is not None: --+++ # y = y + self.shared_experts(identity) --+++ # return y --+++ --+++ @no_grad() --+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ --+++ expert_cache = ops.zeros_like(x) --+++ for i in range(self.num_experts_per_tok): --+++ expert_id = flat_expert_indices[i].item() --+++ weight = flat_expert_weights[i].item() --+++ expert = self.experts[expert_id] --+++ expert_out = expert(x) --+++ expert_cache += expert_out * weight --+++ return expert_cache --++ --++ @no_grad() --++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++- # expert_cache = torch.zeros_like(x) --++- # idxs = flat_expert_indices.argsort() --++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++- # token_idxs = idxs // self.num_experts_per_tok --++- # for i, end_idx in enumerate(tokens_per_expert): --++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++- # if start_idx == end_idx: --++- # continue --++- # expert = self.experts[i] --++- # exp_token_idx = token_idxs[start_idx:end_idx] --++- # expert_tokens = x[exp_token_idx] --++- # expert_out = expert(expert_tokens) --++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++- # return expert_cache --+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++ expert_cache = ops.zeros_like(x) --++ idxs = flat_expert_indices.argsort() --++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ token_idxs = idxs // self.num_experts_per_tok --+++ --++ for i, end_idx in enumerate(tokens_per_expert): --++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ if start_idx == end_idx: --++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++ expert_out = expert(expert_tokens) --++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --++ return expert_cache --+++ --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # # expert_cache = torch.zeros_like(x) --+++ # # idxs = flat_expert_indices.argsort() --+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++ # # token_idxs = idxs // self.num_experts_per_tok --+++ # # for i, end_idx in enumerate(tokens_per_expert): --+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++ # # if start_idx == end_idx: --+++ # # continue --+++ # # expert = self.experts[i] --+++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # # expert_tokens = x[exp_token_idx] --+++ # # expert_out = expert(expert_tokens) --+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++ # # return expert_cache --+++ # expert_cache = ops.zeros_like(x) --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # for i, end_idx in enumerate(tokens_per_expert): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # if start_idx == end_idx: --+++ # continue --+++ # expert = self.experts[i] --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = expert(expert_tokens) --+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --+++ # return expert_cache --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # expert_cache = ops.zeros_like(x) --+++ --+++ # # 排序保证顺序一致 --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # # 找出有 token 的专家 --+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++ --+++ # for i in active_experts.tolist(): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # end_idx = tokens_per_expert[i] --+++ # if start_idx == end_idx: # 没有 token --+++ # continue --+++ --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = self.experts[i](expert_tokens) --+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++ --+++ # expert_cache = mindspore.mint.scatter_add( --+++ # expert_cache, --+++ # 0, --+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++ # expert_out --+++ # ) --+++ --+++ # return expert_cache --+++ --+++ --++ --++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++ # """ --++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ --++ # Initialize weights and apply final processing --++ self.post_init() --+++ self.warm_up = False --+++ --+++ def warmup_moe_model_deep(self): --+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++ test_texts = [ --+++ "warmup short", --+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++ ] --+++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++ if tokenizer is None: --+++ from mindnlp.transformers import AutoTokenizer --+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++ self._warmup_tokenizer = tokenizer --+++ --+++ for text in test_texts: --+++ inputs = tokenizer(text, return_tensors="ms") --+++ with mindspore._no_grad(): --+++ _ = self(**inputs, use_cache=False) --+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++ --++ def get_input_embeddings(self): --++ return self.model.embed_tokens --++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++ ```""" --+++ if not self.warm_up: --+++ self.warm_up = True --+++ self.warmup_moe_model_deep() --+++ --++ output_attentions = ( --++ output_attentions --++ if output_attentions is not None --++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++index 3cbf820e..d4c6b651 100644 --++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++@@ -18,7 +18,6 @@ --++ # See the License for the specific language governing permissions and --++ # limitations under the License. --++ """MindSpore Qwen2MoE model.""" --++- --++ import math --++ from typing import List, Optional, Tuple, Union --++ --++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++ TokenClassifierOutput, --++ ) --++ from ...modeling_utils import PreTrainedModel --+++from ...generation import GenerationMixin --++ from ....utils import logging --++ from .configuration_qwen2_moe import Qwen2MoeConfig --++ --++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++ self.variance_epsilon = eps --++ --++ def forward(self, hidden_states): --+++ # @dwj --+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++ # @lwx --+++ # if not self.training : --+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++ input_dtype = hidden_states.dtype --++ hidden_states = hidden_states.to(mindspore.float32) --++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++@@ -234,6 +239,8 @@ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++ x1 = x[..., : x.shape[-1] // 2] --++ x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++ self.config = config --++ self.hidden_size = config.hidden_size --++ self.intermediate_size = intermediate_size --+++ --++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++ self.act_fn = ACT2FN[config.hidden_act] --++ --++ def forward(self, x): --++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++- --++ --+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++ # @lwx --+++ # gate_up_output = self.gate_up_proj(x) --+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++ # return self.down_proj(swiglu_output) --+++ --+++ # def forward(self, x): --+++ # gate_proj_out = self.gate_proj(x) --+++ # up_proj_out = self.up_proj(x) --+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++ # return self.down_proj(swiglu_out) --+++ --++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++ """ --++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++ use_cache: bool = False, --++ cache_position: Optional[mindspore.Tensor] = None, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ --+++ --++ bsz, q_len, _ = hidden_states.shape --++ --++ query_states = self.q_proj(hidden_states) --++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ "with a layer index." --++ ) --++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ if isinstance(past_key_value, StaticCache): --+++ kv_seq_len = key_states.shape[-2] --+++ else: --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++ if isinstance(past_key_value, StaticCache): --+++ kv_seq_len = key_states.shape[-2] --++ --++ # repeat k/v heads if n_kv_heads < n_heads --++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++- --+++ --++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++ --++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++- raise ValueError( --++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++- f" {attn_weights.shape}" --++- ) --++- --++- if attention_mask is not None: # no matter the length, we just slice it --++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++ if attention_mask is not None: --+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++ attn_weights = attn_weights + causal_mask --++ --++ # upcast attention to fp32 --++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++ --++ attn_output = self.o_proj(attn_output) --++- --+++ # @lwx --+++ --+++ # max_seq_len = self.max_position_embeddings # 2048 --+++ --+++ # if attention_mask is not None: --+++ # # attention_mask: [B, 1, Sq, Sk] --+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++ --+++ # # pad 到 [max_seq_len, max_seq_len] --+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++ # global_attention_mask = padded_mask --+++ # else: --+++ # global_attention_mask = None --+++ --+++ --+++ # sparse_mode=3 --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, --+++ # key=key_states, --+++ # value=value_states, --+++ # real_shift=None, --+++ # padding_mask=None, --+++ --+++ # head_num=self.num_heads, --+++ # attn_mask=global_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++ # input_layout="BNSD", --+++ # pre_tokens=2147483647, --+++ # next_tokens=2147483647, --+++ # inner_precise=0, --+++ # drop_mask=None, --+++ # prefix=None, --+++ # actual_seq_qlen=None, --+++ # actual_seq_kvlen=None, --+++ # sparse_mode=sparse_mode, --+++ # ) --++ if not output_attentions: --++ attn_weights = None --++ --++ return attn_output, attn_weights, past_key_value --++ --++ --+++class Qwen2MoeFlashAttention(nn.Module): --+++ """ --+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++ --+++ 关键改动: --+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++ 直接传入原始的 key 和 value 张量效率更高。 --+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++ """ --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++ super().__init__() --+++ self.config = config --+++ self.layer_idx = layer_idx --+++ self.hidden_size = config.hidden_size --+++ self.num_heads = config.num_attention_heads --+++ self.head_dim = self.hidden_size // self.num_heads --+++ self.num_key_value_heads = config.num_key_value_heads --+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++ self.max_position_embeddings = config.max_position_embeddings --+++ self.rope_theta = config.rope_theta --+++ self.attention_dropout = config.attention_dropout --+++ --+++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++ raise ValueError( --+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++ ) --+++ --+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++ --+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++ self.head_dim, --+++ max_position_embeddings=self.max_position_embeddings, --+++ base=self.rope_theta, --+++ ) --+++ --+++ def forward( --+++ self, --+++ hidden_states: mindspore.Tensor, --+++ attention_mask: Optional[mindspore.Tensor] = None, --+++ position_ids: Optional[mindspore.Tensor] = None, --+++ past_key_value: Optional[Cache] = None, --+++ output_attentions: bool = False, --+++ use_cache: bool = False, --+++ cache_position: Optional[mindspore.Tensor] = None, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ # 1. 线性投射 Q, K, V --+++ query_states = self.q_proj(hidden_states) --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++ # query: [B, S, H*D] -> [B, N1, S, D] --+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # 3. RoPE 旋转位置编码 --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++ if self.layer_idx is None: --+++ raise ValueError( --+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ "with a layer index." --+++ ) --+++ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++ if cache_position.shape[0] == 1: --+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++ kv_seq_len = past_seen_tokens + 1 --+++ else: --+++ # prefill 阶段:cache_position 是范围,使用其长度 --+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++ else: --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # 4. KV 缓存更新 --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ key_states, value_states = past_key_value.update( --+++ key_states, value_states, self.layer_idx, cache_kwargs --+++ ) --+++ --+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++ if cache_position.shape[0] == 1: --+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++ kv_seq_len = key_states.shape[-2] --+++ --+++ # 5. [重要] 准备 Attention Mask --+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++ fa_attention_mask = None --+++ if attention_mask is not None: --+++ # 截取与当前key长度匹配的部分 --+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++ fa_attention_mask = (mask_slice != 0) --+++ --+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++ input_dtype = query_states.dtype --+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++ query_states = query_states.to(mindspore.float16) --+++ key_states = key_states.to(mindspore.float16) --+++ value_states = value_states.to(mindspore.float16) --+++ --+++ # 6. [核心] 调用 flash_attention_score 算子 --+++ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++ attn_output = mindspore.ops.flash_attention_score( --+++ query=query_states, --+++ key=key_states, --+++ value=value_states, --+++ head_num=self.num_heads, # 传入Q的头数(N1) --+++ attn_mask=fa_attention_mask, --+++ keep_prob=1.0 - self.attention_dropout, --+++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++ input_layout="BNSD", --+++ sparse_mode=0 # 使用 defaultMask 模式 --+++ ) --+++ --+++ # 恢复原始数据类型 --+++ attn_output = attn_output.to(input_dtype) --+++ --+++ # 7. 调整输出形状 --+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ attn_output = self.o_proj(attn_output) --+++ --+++ # FlashAttention 算子不直接返回注意力权重矩阵 --+++ attn_weights = None --+++ if output_attentions: --+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++ # def forward( --+++ # self, --+++ # hidden_states: mindspore.Tensor, --+++ # attention_mask: Optional[mindspore.Tensor] = None, --+++ # position_ids: Optional[mindspore.Tensor] = None, --+++ # past_key_value: Optional[Cache] = None, --+++ # output_attentions: bool = False, --+++ # use_cache: bool = False, --+++ # cache_position: Optional[mindspore.Tensor] = None, --+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ # bsz, q_len, _ = hidden_states.shape --+++ --+++ # # 1. 线性投射 Q, K, V --+++ # query_states = self.q_proj(hidden_states) --+++ # key_states = self.k_proj(hidden_states) --+++ # value_states = self.v_proj(hidden_states) --+++ --+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # # 3. RoPE 旋转位置编码 --+++ # kv_seq_len = key_states.shape[-2] --+++ # if past_key_value is not None: --+++ # if self.layer_idx is None: --+++ # raise ValueError( --+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ # "with a layer index." --+++ # ) --+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # # 4. KV 缓存更新 --+++ # if past_key_value is not None: --+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ # key_states, value_states = past_key_value.update( --+++ # key_states, value_states, self.layer_idx, cache_kwargs --+++ # ) --+++ --+++ # # 5. 准备 Attention Mask --+++ # fa_attention_mask = None --+++ # if attention_mask is not None: --+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # fa_attention_mask = (mask_slice != 0) --+++ --+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++ # input_dtype = query_states.dtype --+++ --+++ # # 6. [核心] 调用 flash_attention_score 算子 --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, --+++ # key=key_states, --+++ # value=value_states, --+++ # head_num=self.num_heads, --+++ # attn_mask=fa_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++ # input_layout="BNSD", --+++ # sparse_mode=0, --+++ # # <--- 修改点 2: 启用内部高精度计算 --- --+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++ # inner_precise=1 --+++ # ) --+++ --+++ # # 恢复原始数据类型 --+++ # attn_output = attn_output.to(input_dtype) --+++ --+++ # # 7. 调整输出形状 --+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ # attn_output = self.o_proj(attn_output) --+++ --+++ # attn_weights = None --+++ # if output_attentions: --+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++ # return attn_output, attn_weights, past_key_value --+++ --+++ # def forward( --+++ # self, --+++ # hidden_states: mindspore.Tensor, --+++ # attention_mask: Optional[mindspore.Tensor] = None, --+++ # position_ids: Optional[mindspore.Tensor] = None, --+++ # past_key_value: Optional[Cache] = None, --+++ # output_attentions: bool = False, --+++ # use_cache: bool = False, --+++ # cache_position: Optional[mindspore.Tensor] = None, --+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ # bsz, q_len, _ = hidden_states.shape --+++ --+++ # query_states = self.q_proj(hidden_states) --+++ # key_states = self.k_proj(hidden_states) --+++ # value_states = self.v_proj(hidden_states) --+++ --+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # kv_seq_len = key_states.shape[-2] --+++ # if past_key_value is not None: --+++ # if self.layer_idx is None: --+++ # raise ValueError("`layer_idx` must be specified for caching") --+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # if past_key_value is not None: --+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ # key_states, value_states = past_key_value.update( --+++ # key_states, value_states, self.layer_idx, cache_kwargs --+++ # ) --+++ --+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++ --+++ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++ # query_states = query_states / math.sqrt(self.head_dim) --+++ # # <--- 修改结束 --- --+++ --+++ # fa_attention_mask = None --+++ # if attention_mask is not None: --+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # fa_attention_mask = (mask_slice != 0) --+++ --+++ # input_dtype = query_states.dtype --+++ --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, # 传入已经预先缩放过的 query --+++ # key=key_states, --+++ # value=value_states, --+++ # head_num=self.num_heads, --+++ # attn_mask=fa_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++ # input_layout="BNSD", --+++ # sparse_mode=0, --+++ # inner_precise=1 # 仍然保持内部高精度计算 --+++ # ) --+++ --+++ # attn_output = attn_output.to(input_dtype) --+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ # attn_output = self.o_proj(attn_output) --+++ --+++ # attn_weights = None --+++ # if output_attentions: --+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++ --+++ # return attn_output, attn_weights, past_key_value --+++ --++ QWEN2MOE_ATTENTION_CLASSES = { --++ "eager": Qwen2MoeAttention, --+++ "flash-attention": Qwen2MoeFlashAttention, --++ } --++ --++ --++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --+++ #@dwj --+++ # 只遍历激活的专家,而非全部专家 --++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++- hidden_states = hidden_states.view(-1, hidden_dim) --++- # router_logits: (batch * sequence_length, n_experts) --++- router_logits = self.gate(hidden_states) --++- --++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- if self.norm_topk_prob: --++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- # we cast back to the input dtype --++- routing_weights = routing_weights.to(hidden_states.dtype) --++- --++- final_hidden_states = ops.zeros( --++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++- ) --++- --++- # One hot encode the selected experts to create an expert mask --++- # this will be used to easily index which expert is going to be sollicitated --++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++- --++- # Loop over all available experts in the model and perform the computation on each expert --++- for expert_idx in range(self.num_experts): --++- expert_layer = self.experts[expert_idx] --++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++- --++- # Index the correct hidden states and compute the expert hidden state for --++- # the current expert. We need to make sure to multiply the output hidden --++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++- if 0 not in idx.shape: --++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++- --++- # However `index_add_` only support torch tensors for indexing so we'll use --++- # the `top_x` tensor here. --++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++- --++- shared_expert_output = self.shared_expert(hidden_states) --++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++- --++- final_hidden_states = final_hidden_states + shared_expert_output --+++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++ num_tokens = hidden_states_reshaped.shape[0] --+++ --+++ router_logits = self.gate(hidden_states_reshaped) --+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++ if self.norm_topk_prob: --+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++ flat_selected_experts = selected_experts.flatten() --+++ --+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++ token_indices = broadcasted_token_indices.flatten() --+++ --+++ active_experts = ops.unique(flat_selected_experts) --+++ --+++ for expert_idx_tensor in active_experts: --+++ expert_idx = expert_idx_tensor.item() --+++ expert_layer = self.experts[expert_idx] --+++ --+++ mask = (flat_selected_experts == expert_idx_tensor) --+++ selected_token_indices = token_indices[mask] --+++ selected_routing_weights = routing_weights.flatten()[mask] --+++ --+++ current_states = hidden_states_reshaped[selected_token_indices] --+++ --+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++ --+++ final_hidden_states = final_hidden_states.index_add( --+++ dim=0, --+++ index=selected_token_indices, --+++ source=expert_output.to(hidden_states.dtype) --+++ ) --+++ --+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++ --++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++- return final_hidden_states, router_logits --+++ final_hidden_states = final_hidden_states + shared_expert_output --+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++ --+++ return final_hidden_states, router_logits --++ --++ --++ class Qwen2MoeDecoderLayer(nn.Module): --++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++ --++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++ --++ if (layer_idx not in config.mlp_only_layers) and ( --++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++ ): --++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --++ _skip_keys_device_placement = "past_key_values" --++ _supports_cache_class = True --+++#lwx --+++ # _supports_static_cache = True --++ --++ def _init_weights(self, module): --++ std = self.config.initializer_range --++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++ return causal_mask --++ --++ --++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ _tied_weights_keys = ["lm_head.weight"] --++ --++ def __init__(self, config): --++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ self.num_experts_per_tok = config.num_experts_per_tok --++ # Initialize weights and apply final processing --++ self.post_init() --+++ # @lwx --+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++ # self.generation_config.cache_implementation = "static" --+++ self._warmed_up = False --+++ --+++ def warmup_moe_model(self): --+++ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++ test_texts = [ --+++ "warmup short", --+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++ ] --+++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++ if tokenizer is None: --+++ from mindnlp.transformers import AutoTokenizer --+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++ self._warmup_tokenizer = tokenizer --+++ --+++ for text in test_texts: --+++ inputs = tokenizer(text, return_tensors="ms") --+++ with mindspore._no_grad(): --+++ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++ print("[Warmup] Qwen2-MoE 模型预热完成。") --++ --++ def get_input_embeddings(self): --++ return self.model.embed_tokens --++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++ ```""" --+++ if not self._warmed_up: --+++ self._warmed_up = True --+++ self.warmup_moe_model() --++ --++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++ output_router_logits = ( --++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ } --++ ) --++ return model_inputs --+++# @lwx --+++ # def _decode_one_tokens_logits( --+++ # self, --+++ # cur_token: mindspore.Tensor, --+++ # input_pos: Optional[mindspore.Tensor], --+++ # cache_position: mindspore.Tensor, --+++ # past_key_values: StaticCache, --+++ # ) -> mindspore.Tensor: --+++ # """ --+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++ --+++ # Args: --+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++ # input_pos: 输入位置信息,可选 --+++ # cache_position: 当前token在cache中的位置,shape为(1,) --+++ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++ --+++ # Returns: --+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++ # """ --+++ # # 调用JIT编译的版本 --+++ # return self.get_decode_one_tokens_logits( --+++ # cur_token=cur_token, --+++ # input_pos=input_pos, --+++ # cache_position=cache_position, --+++ # past_key_values=past_key_values, --+++ # ) --+++ --+++ # @mindspore.jit(jit_level='O1') --+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++ # """ --+++ # JIT编译的函数,用于高效的单token解码 --+++ # 使用JIT编译优化以支持静态shape和高效执行 --+++ --+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++ # """ --+++ # outputs = self.model.forward( --+++ # input_ids=cur_token, --+++ # position_ids=input_pos, --+++ # cache_position=cache_position, --+++ # past_key_values=past_key_values, --+++ # use_cache=True, --+++ # return_dict=False, --+++ # ) --+++ --+++ # hidden_states = outputs[0] --+++ # logits = self.lm_head.forward(hidden_states) --+++ # logits = logits.float() --+++ --+++ # return logits[:, -1, :] --+++ --+++ # def _sample( --+++ # self, --+++ # input_ids: mindspore.Tensor, --+++ # logits_processor, --+++ # stopping_criteria, --+++ # generation_config, --+++ # synced_devices: bool, --+++ # streamer=None, --+++ # logits_warper=None, --+++ # **model_kwargs, --+++ # ): --+++ # """ --+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++ # """ --+++ # from ...generation.logits_process import LogitsProcessorList --+++ # from ...generation.stopping_criteria import StoppingCriteriaList --+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++ # from mindnlp.core import nn, ops, no_grad --+++ # import numpy as np --+++ --+++ # # 检查是否使用 StaticCache --+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++ # # 否则,直接调用父类方法 --+++ # past_key_values = model_kwargs.get("past_key_values") --+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++ --+++ # if not isinstance(past_key_values, StaticCache): --+++ # # 不使用 StaticCache,直接调用父类方法 --+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++ # return super()._sample( --+++ # input_ids=input_ids, --+++ # logits_processor=logits_processor, --+++ # stopping_criteria=stopping_criteria, --+++ # generation_config=generation_config, --+++ # synced_devices=synced_devices, --+++ # streamer=streamer, --+++ # logits_warper=logits_warper, --+++ # **model_kwargs, --+++ # ) --+++ --+++ # # 使用 StaticCache,进入自定义循环 --+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++ # pad_token_id = generation_config._pad_token_tensor --+++ # output_attentions = generation_config.output_attentions --+++ # output_hidden_states = generation_config.output_hidden_states --+++ # output_scores = generation_config.output_scores --+++ # output_logits = generation_config.output_logits --+++ # return_dict_in_generate = generation_config.return_dict_in_generate --+++ # max_length = generation_config.max_length --+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++ # do_sample = generation_config.do_sample --+++ --+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++ # raise ValueError( --+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++ # f"{logits_warper})." --+++ # ) --+++ --+++ # # init attention / hidden states / scores tuples --+++ # scores = () if (return_dict_in_generate and output_scores) else None --+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++ --+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++ # encoder_hidden_states = ( --+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++ # ) --+++ --+++ # # keep track of which sequences are already finished --+++ # batch_size, cur_len = input_ids.shape --+++ # this_peer_finished = False --+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++ --+++ # time_record = [] --+++ # from ....utils.testing_utils import parse_flag_from_env --+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++ --+++ # while self._has_unfinished_sequences( --+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++ # ): --+++ # if _record_time: --+++ # import time as time_module --+++ # infer_start = time_module.time() --+++ --+++ # # prepare model inputs --+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++ --+++ # # prepare variable output controls --+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++ --+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++ # cur_cache_position = model_inputs.get("cache_position") --+++ # cur_past_key_values = model_inputs.get("past_key_values") --+++ # cur_input_ids = model_inputs.get("input_ids") --+++ --+++ # if (isinstance(cur_past_key_values, StaticCache) and --+++ # cur_cache_position is not None and --+++ # len(cur_cache_position.shape) > 0 and --+++ # cur_cache_position.shape[0] == 1 and --+++ # cur_input_ids is not None and --+++ # cur_input_ids.shape[1] == 1): --+++ # # 使用 JIT 优化的单 token 解码 --+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++ # if not hasattr(self, '_jit_used'): --+++ # self._jit_used = False --+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++ --+++ # next_token_logits = self.get_decode_one_tokens_logits( --+++ # cur_token=cur_input_ids, --+++ # input_pos=model_inputs.get("position_ids"), --+++ # cache_position=cur_cache_position, --+++ # past_key_values=cur_past_key_values, --+++ # ) --+++ --+++ # # 标记已使用JIT(用于后续判断) --+++ # if not self._jit_used: --+++ # self._jit_used = True --+++ --+++ # # 构造兼容的输出对象 --+++ # class JitOptimizedOutput: --+++ # def __init__(self, logits, config): --+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++ # self.config = config --+++ # # 对于 JIT 优化路径,这些属性通常不需要 --+++ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++ # self.attentions = None if not config.is_encoder_decoder else None --+++ # self.cross_attentions = None --+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++ # self.hidden_states = None if not config.is_encoder_decoder else None --+++ --+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++ # else: --+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++ # outputs = self(**model_inputs, return_dict=True) --+++ --+++ # if synced_devices and this_peer_finished: --+++ # continue --+++ --+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++ # next_token_logits = outputs.logits[:, -1, :] --+++ --+++ # # pre-process distribution --+++ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++ # if do_sample: --+++ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++ --+++ # # Store scores, attentions and hidden_states when required --+++ # if return_dict_in_generate: --+++ # if output_scores: --+++ # scores += (next_token_scores,) --+++ # if output_logits: --+++ # raw_logits += (next_token_logits,) --+++ # if output_attentions: --+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++ # decoder_attentions += (attn,) if attn is not None else (None,) --+++ # if self.config.is_encoder_decoder: --+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++ --+++ # if output_hidden_states: --+++ # hidden = ( --+++ # outputs.decoder_hidden_states --+++ # if self.config.is_encoder_decoder --+++ # else outputs.hidden_states --+++ # ) --+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++ --+++ # # token selection --+++ # if do_sample: --+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++ # else: --+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++ --+++ # # finished sentences should have their next token be a padding token --+++ # if has_eos_stopping_criteria: --+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++ --+++ # # update generated ids, model inputs, and length for next step --+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++ # if streamer is not None: --+++ # streamer.put(next_tokens) --+++ --+++ # model_kwargs = self._update_model_kwargs_for_generation( --+++ # outputs, --+++ # model_kwargs, --+++ # is_encoder_decoder=self.config.is_encoder_decoder, --+++ # ) --+++ --+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++ # cur_len += 1 --+++ --+++ # if _record_time: --+++ # import time as time_module --+++ # infer_stop = time_module.time() --+++ # time_record.append(infer_stop - infer_start) --+++ --+++ # del outputs --+++ --+++ # average_infer_time = None --+++ # if time_record: --+++ # if len(time_record) > 1: --+++ # time_record.pop(0) --+++ # average_infer_time = sum(time_record) / len(time_record) --+++ # print(f'average inference time is: {average_infer_time}') --+++ # print(f'inference time record: {time_record}') --+++ --+++ # if streamer is not None: --+++ # streamer.end() --+++ --+++ # # 简单判断:打印是否使用了JIT路径 --+++ # if hasattr(self, '_jit_used') and self._jit_used: --+++ # print("[JIT] ✓ JIT optimization was used during generation") --+++ # else: --+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++ --+++ # if return_dict_in_generate: --+++ # if self.config.is_encoder_decoder: --+++ # return GenerateEncoderDecoderOutput( --+++ # sequences=input_ids, --+++ # scores=scores, --+++ # logits=raw_logits, --+++ # encoder_attentions=encoder_attentions, --+++ # encoder_hidden_states=encoder_hidden_states, --+++ # decoder_attentions=decoder_attentions, --+++ # cross_attentions=cross_attentions, --+++ # decoder_hidden_states=decoder_hidden_states, --+++ # past_key_values=model_kwargs.get("past_key_values"), --+++ # average_infer_time=average_infer_time --+++ # ) --+++ # else: --+++ # return GenerateDecoderOnlyOutput( --+++ # sequences=input_ids, --+++ # scores=scores, --+++ # logits=raw_logits, --+++ # attentions=decoder_attentions, --+++ # hidden_states=decoder_hidden_states, --+++ # past_key_values=model_kwargs.get("past_key_values"), --+++ # average_infer_time=average_infer_time --+++ # ) --+++ # else: --+++ # return input_ids --+++ --+++ # def _prepare_cache_for_generation( --+++ # self, --+++ # generation_config, --+++ # model_kwargs, --+++ # assistant_model, --+++ # batch_size, --+++ # max_cache_length, --+++ # ): --+++ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++ # generation_config.cache_implementation = "static" --+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++ --+++ # if generation_config.cache_implementation == "static": --+++ # base_required_from_max_length = generation_config.max_length + 1 --+++ # base_required = max(max_cache_length, base_required_from_max_length) --+++ # min_cache_size = 50 --+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++ # else: --+++ # max_cache_length = max(base_required, min_cache_size) --+++ --+++ # original_max_cache_length = max_cache_length --+++ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++ # print(f" - input max_cache_length: {original_max_cache_length}") --+++ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++ # print(f" - final max_cache_length: {max_cache_length}") --+++ --+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++ # if max_cache_length > self.config.max_position_embeddings: --+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++ --+++ # result = super()._prepare_cache_for_generation( --+++ # generation_config=generation_config, --+++ # model_kwargs=model_kwargs, --+++ # assistant_model=assistant_model, --+++ # batch_size=batch_size, --+++ # max_cache_length=max_cache_length, --+++ # ) --+++ --+++ # if generation_config.cache_implementation == "static": --+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++ # created_cache = model_kwargs.get(cache_name) --+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++ # if created_cache.max_cache_len < generation_config.max_length: --+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++ --+++ # return result --+++ --+++ --+++ --++ --++ --++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++-- --++2.27.0 --++ --+-- --+2.27.0 --+ --diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --new file mode 100644 --index 00000000..966529e4 ----- /dev/null --+++ b/patches/0003-20261106secondcommit.patch --@@ -0,0 +1,2769 @@ --+From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Thu, 6 Nov 2025 14:54:37 +0800 --+Subject: [PATCH 3/3] 20261106secondcommit --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 217 ++- --+ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- --+ patches/0001-20251104commit.patch | 1272 ----------------- --+ 3 files changed, 528 insertions(+), 2032 deletions(-) --+ delete mode 100644 patches/0001-20251104commit.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index 73773c22..2f9192bf 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) --+ --+ _CONFIG_FOR_DOC = "DeepseekConfig" --+ --++_attn_mask_cache = {} --++ --++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --++ q_len = batch_and_seq[1] --++ kv_len = batch_and_seq[1] + past_key_values_length --++ key = (batch_and_seq[0], q_len, kv_len) --++ --++ if key in _attn_mask_cache: --++ return _attn_mask_cache[key] --++ --++ mask = _prepare_4d_causal_attention_mask( --++ attention_mask, --++ batch_and_seq, --++ inputs_embeds, --++ past_key_values_length, --++ ) --++ _attn_mask_cache[key] = mask --++ return mask --+ --+ def _get_unpad_data(attention_mask): --+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --+@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): --+ return final_output --+ --+ --+- @no_grad() --+- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+- expert_cache = ops.zeros_like(x) --+- idxs = flat_expert_indices.argsort() --+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+- token_idxs = idxs // self.num_experts_per_tok --+- --+- for i, end_idx in enumerate(tokens_per_expert): --+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+- if start_idx == end_idx: --+- continue --+- expert = self.experts[i] --+- exp_token_idx = token_idxs[start_idx:end_idx] --+- expert_tokens = x[exp_token_idx] --+- expert_out = expert(expert_tokens) --+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+- --+- return expert_cache --+- --+ # @no_grad() --+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+- # # expert_cache = torch.zeros_like(x) --+- # # idxs = flat_expert_indices.argsort() --+- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+- # # token_idxs = idxs // self.num_experts_per_tok --+- # # for i, end_idx in enumerate(tokens_per_expert): --+- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+- # # if start_idx == end_idx: --+- # # continue --+- # # expert = self.experts[i] --+- # # exp_token_idx = token_idxs[start_idx:end_idx] --+- # # expert_tokens = x[exp_token_idx] --+- # # expert_out = expert(expert_tokens) --+- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+- # # return expert_cache --++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ # expert_cache = ops.zeros_like(x) --+ # idxs = flat_expert_indices.argsort() --+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): --+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+ --+ # return expert_cache --+- # @no_grad() --+- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+- # expert_cache = ops.zeros_like(x) --++ --++ @no_grad() --++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++ """ --++ 优化版 MoE prefill: --++ - 批量张量化处理同一个 expert 的所有 token --++ - 跳过无 token 的专家 --++ - 保持结果完全一致 --++ """ --++ # 初始化输出缓存 --++ expert_cache = ops.zeros_like(x) --+ --+- # # 排序保证顺序一致 --+- # idxs = flat_expert_indices.argsort() --+- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+- # token_idxs = idxs // self.num_experts_per_tok --++ # 排序(确保 scatter_add 位置对应原逻辑) --++ idxs = flat_expert_indices.argsort() --++ sorted_expert_indices = flat_expert_indices[idxs] --++ sorted_token_indices = idxs // self.num_experts_per_tok --+ --+- # # 找出有 token 的专家 --+- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++ # 每个 expert 的 token 数 --++ tokens_per_expert = sorted_expert_indices.bincount() --+ --+- # for i in active_experts.tolist(): --+- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+- # end_idx = tokens_per_expert[i] --+- # if start_idx == end_idx: # 没有 token --+- # continue --++ # 找出有 token 的专家 --++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+ --+- # exp_token_idx = token_idxs[start_idx:end_idx] --+- # expert_tokens = x[exp_token_idx] --+- # expert_out = self.experts[i](expert_tokens) --+- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++ for expert_id in active_experts.tolist(): --++ # 取该 expert 对应的排序后 token 区间 --++ start = (tokens_per_expert[:expert_id]).sum().item() --++ end = start + tokens_per_expert[expert_id].item() --+ --+- # expert_cache = mindspore.mint.scatter_add( --+- # expert_cache, --+- # 0, --+- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+- # expert_out --+- # ) --++ token_idx = sorted_token_indices[start:end] # 原 token 位置 --++ expert_tokens = x[token_idx] # 取输入向量 --+ --+- # return expert_cache --++ # 执行专家 MLP --++ expert_out = self.experts[expert_id](expert_tokens) --++ --++ # 按权重缩放 --++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --++ --++ # 回写到缓存(等价 scatter_add) --++ expert_cache = mindspore.mint.scatter_add( --++ expert_cache, --++ 0, --++ token_idx.view(-1, 1).tile((1, x.shape[-1])), --++ scaled_out --++ ) --++ --++ return expert_cache --++ --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # # expert_cache = torch.zeros_like(x) --++ # # idxs = flat_expert_indices.argsort() --++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++ # # token_idxs = idxs // self.num_experts_per_tok --++ # # for i, end_idx in enumerate(tokens_per_expert): --++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++ # # if start_idx == end_idx: --++ # # continue --++ # # expert = self.experts[i] --++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++ # # expert_tokens = x[exp_token_idx] --++ # # expert_out = expert(expert_tokens) --++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++ # # return expert_cache --++ # expert_cache = ops.zeros_like(x) --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # for i, end_idx in enumerate(tokens_per_expert): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # if start_idx == end_idx: --++ # continue --++ # expert = self.experts[i] --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = expert(expert_tokens) --++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --++ # return expert_cache --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++ # expert_cache = ops.zeros_like(x) --++ --++ # # 排序保证顺序一致 --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ # token_idxs = idxs // self.num_experts_per_tok --++ --++ # # 找出有 token 的专家 --++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++ --++ # for i in active_experts.tolist(): --++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ # end_idx = tokens_per_expert[i] --++ # if start_idx == end_idx: # 没有 token --++ # continue --++ --++ # exp_token_idx = token_idxs[start_idx:end_idx] --++ # expert_tokens = x[exp_token_idx] --++ # expert_out = self.experts[i](expert_tokens) --++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++ --++ # expert_cache = mindspore.mint.scatter_add( --++ # expert_cache, --++ # 0, --++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++ # expert_out --++ # ) --++ --++ # return expert_cache --+ --+ --+ --+@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): --+ --+ return attn_output, attn_weights, past_key_value --+ --+- --+ # class DeepseekFlashAttention(nn.Module): --+ # """ --+ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): --+ --+ return attn_output, attn_weights, past_key_value --+ --++ --+ Deepseek_ATTENTION_CLASSES = { --+ "eager": DeepseekAttention, --+ "flash-attention": DeepseekFlashAttention, --+@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): --+ ) --+ else: --+ # 4d mask is passed through the layers --+- attention_mask = _prepare_4d_causal_attention_mask( --++ # attention_mask = _prepare_4d_causal_attention_mask( --++ # attention_mask, --++ # (batch_size, seq_length), --++ # inputs_embeds, --++ # past_key_values_length, --++ # ) --++ #@dwj --++ attention_mask = get_cached_causal_mask( --+ attention_mask, --+ (batch_size, seq_length), --+ inputs_embeds, --+@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ # Initialize weights and apply final processing --+ self.post_init() --+ self.warm_up = False --++ #@dwj --++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --++ self.num_layers, --++ self.num_attention_heads, --++ self.head_dim, --++ batch_size=1, --++ max_length=self.max_length, --++ dtype=mindspore.float16 --++ ) --++ --++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --++ key_cache = [] --++ value_cache = [] --++ for _ in range(num_layers): --++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++ key_cache.append(k) --++ value_cache.append(v) --++ return key_cache, value_cache --++ --+ --+ def warmup_moe_model_deep(self): --+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+index bced285c..ebd7782e 100644 --+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) --+ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+ --+-Long_Prompt = False --+-PROMPT_LENGTH_THRESHOLD = 128 --++Long_Prompt = 1 --++LONG_PROMPT_LENGTH_THRESHOLD = 128 --++SHORT_PROMPT_LENGTH_THRESHOLD = 32 --++ --++_causal_mask_cache = {} --++ --++def get_cached_causal_mask_with_cache_position( --++ attention_mask: mindspore.Tensor, --++ sequence_length: int, --++ target_length: int, --++ dtype: mindspore.dtype, --++ min_dtype: float, --++ cache_position: mindspore.Tensor, --++ batch_size: int, --++): --++ """ --++ 带缓存的 causal mask 构造函数 --++ """ --++ # q_len 是当前 query 长度 --++ q_len = sequence_length --++ # kv_len 是 target_length --++ kv_len = target_length --++ --++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --++ key = (batch_size, q_len, kv_len, dtype, min_dtype) --++ --++ if key in _causal_mask_cache: --++ return _causal_mask_cache[key] --++ --++ # 调用原来的 mask 构造逻辑 --++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++ attention_mask, --++ sequence_length=sequence_length, --++ target_length=target_length, --++ dtype=dtype, --++ min_dtype=min_dtype, --++ cache_position=cache_position, --++ batch_size=batch_size, --++ ) --++ # 缓存结果 --++ _causal_mask_cache[key] = causal_mask --++ return causal_mask --+ --+ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+ def _prepare_4d_causal_attention_mask_with_cache_position( --+@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+ --+ --+ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --++# class Qwen2MoeAttention(nn.Module): --++# """ --++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --++# and "Generating Long Sequences with Sparse Transformers". --++# """ --++ --++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++# super().__init__() --++# self.config = config --++# self.layer_idx = layer_idx --++# if layer_idx is None: --++# logger.warning_once( --++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++# "when creating this class." --++# ) --++ --++# self.hidden_size = config.hidden_size --++# self.num_heads = config.num_attention_heads --++# self.head_dim = self.hidden_size // self.num_heads --++# self.num_key_value_heads = config.num_key_value_heads --++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++# self.max_position_embeddings = config.max_position_embeddings --++# self.rope_theta = config.rope_theta --++# self.is_causal = True --++# self.attention_dropout = config.attention_dropout --++ --++# if (self.head_dim * self.num_heads) != self.hidden_size: --++# raise ValueError( --++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++# f" and `num_heads`: {self.num_heads})." --++# ) --++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++ --++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++# self.head_dim, --++# max_position_embeddings=self.max_position_embeddings, --++# base=self.rope_theta, --++# ) --++ --++# def forward( --++# self, --++# hidden_states: mindspore.Tensor, --++# attention_mask: Optional[mindspore.Tensor] = None, --++# position_ids: Optional[mindspore.Tensor] = None, --++# past_key_value: Optional[Cache] = None, --++# output_attentions: bool = False, --++# use_cache: bool = False, --++# cache_position: Optional[mindspore.Tensor] = None, --++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++ --++ --++ --++# bsz, q_len, _ = hidden_states.shape --++ --++# query_states = self.q_proj(hidden_states) --++# key_states = self.k_proj(hidden_states) --++# value_states = self.v_proj(hidden_states) --++ --++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++ --++# kv_seq_len = key_states.shape[-2] --++# if past_key_value is not None: --++# if self.layer_idx is None: --++# raise ValueError( --++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++# "with a layer index." --++# ) --++# if isinstance(past_key_value, StaticCache): --++# kv_seq_len = key_states.shape[-2] --++# else: --++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++# if past_key_value is not None: --++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++# if isinstance(past_key_value, StaticCache): --++# kv_seq_len = key_states.shape[-2] --++ --++# # repeat k/v heads if n_kv_heads < n_heads --++# key_states = repeat_kv(key_states, self.num_key_value_groups) --++# value_states = repeat_kv(value_states, self.num_key_value_groups) --++ --++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++ --++# if attention_mask is not None: --++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++# attn_weights = attn_weights + causal_mask --++ --++# # upcast attention to fp32 --++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++# attn_output = ops.matmul(attn_weights, value_states) --++ --++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++# raise ValueError( --++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --++# f" {attn_output.shape}" --++# ) --++ --++# attn_output = ops.transpose(attn_output, 1, 2) --++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++ --++# attn_output = self.o_proj(attn_output) --++# # @lwx --++ --++# # max_seq_len = self.max_position_embeddings # 2048 --++ --++# # if attention_mask is not None: --++# # # attention_mask: [B, 1, Sq, Sk] --++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++ --++# # # pad 到 [max_seq_len, max_seq_len] --++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++# # global_attention_mask = padded_mask --++# # else: --++# # global_attention_mask = None --++ --++ --++# # sparse_mode=3 --++# # attn_output = mindspore.ops.flash_attention_score( --++# # query=query_states, --++# # key=key_states, --++# # value=value_states, --++# # real_shift=None, --++# # padding_mask=None, --++ --++# # head_num=self.num_heads, --++# # attn_mask=global_attention_mask, --++# # keep_prob=1.0 - self.attention_dropout, --++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++# # input_layout="BNSD", --++# # pre_tokens=2147483647, --++# # next_tokens=2147483647, --++# # inner_precise=0, --++# # drop_mask=None, --++# # prefix=None, --++# # actual_seq_qlen=None, --++# # actual_seq_kvlen=None, --++# # sparse_mode=sparse_mode, --++# # ) --++# if not output_attentions: --++# attn_weights = None --++ --++# return attn_output, attn_weights, past_key_value --++ --+ class Qwen2MoeAttention(nn.Module): --+ """ --+- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+- and "Generating Long Sequences with Sparse Transformers". --+- """ --++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 --+ --++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --++ --++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --++ """ --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+ super().__init__() --+ self.config = config --+@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): --+ if layer_idx is None: --+ logger.warning_once( --+ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+ "when creating this class." --+ ) --+ --+@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): --+ use_cache: bool = False, --+ cache_position: Optional[mindspore.Tensor] = None, --+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+- --+ --+- --++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- --+ bsz, q_len, _ = hidden_states.shape --+ --+ query_states = self.q_proj(hidden_states) --+ key_states = self.k_proj(hidden_states) --+ value_states = self.v_proj(hidden_states) --+ --+- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+- --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ --+ kv_seq_len = key_states.shape[-2] --+ if past_key_value is not None: --+- if self.layer_idx is None: --+- raise ValueError( --+- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+- "with a layer index." --+- ) --+- if isinstance(past_key_value, StaticCache): --+- kv_seq_len = key_states.shape[-2] --+- else: --+- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+ --+ if past_key_value is not None: --+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++ --++ # --- 2. 动态调度核心注意力计算 --- --++ global Long_Prompt --++ if Long_Prompt >= 1: --++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --++ fa_attention_mask = None --++ if attention_mask is not None: --++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++ fa_attention_mask = (mask_slice != 0) --++ --++ attn_output = mindspore.ops.flash_attention_score( --++ query=query_states, --++ key=key_states, --++ value=value_states, --++ head_num=self.num_heads, --++ attn_mask=fa_attention_mask, --++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --++ scalar_value=1.0 / math.sqrt(self.head_dim), --++ input_layout="BNSD", --++ sparse_mode=0, --++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --++ ) --+ --+- if isinstance(past_key_value, StaticCache): --+- kv_seq_len = key_states.shape[-2] --++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ attn_output = self.o_proj(attn_output) --++ attn_weights = None --++ if output_attentions: --++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+ --+- # repeat k/v heads if n_kv_heads < n_heads --+- key_states = repeat_kv(key_states, self.num_key_value_groups) --+- value_states = repeat_kv(value_states, self.num_key_value_groups) --+- --+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++ else: --++ # --- Eager Attention 路径 (用于短序列和解码) --- --++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++ --++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+ --+- if attention_mask is not None: --+- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+- attn_weights = attn_weights + causal_mask --++ if attention_mask is not None: --++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++ attn_weights = attn_weights + causal_mask --+ --+- # upcast attention to fp32 --+- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+- attn_output = ops.matmul(attn_weights, value_states) --++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++ attn_output = ops.matmul(attn_weights, value_states) --+ --+- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+- raise ValueError( --+- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+- f" {attn_output.shape}" --+- ) --++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++ raise ValueError( --++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --++ ) --+ --+- attn_output = ops.transpose(attn_output, 1, 2) --+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++ attn_output = ops.transpose(attn_output, 1, 2) --++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++ attn_output = self.o_proj(attn_output) --+ --+- attn_output = self.o_proj(attn_output) --+- # @lwx --++ if not output_attentions: --++ attn_weights = None --+ --+- # max_seq_len = self.max_position_embeddings # 2048 --+- --+- # if attention_mask is not None: --+- # # attention_mask: [B, 1, Sq, Sk] --+- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+- --+- # # pad 到 [max_seq_len, max_seq_len] --+- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+- # global_attention_mask = padded_mask --+- # else: --+- # global_attention_mask = None --+- --+- --+- # sparse_mode=3 --+- # attn_output = mindspore.ops.flash_attention_score( --+- # query=query_states, --+- # key=key_states, --+- # value=value_states, --+- # real_shift=None, --+- # padding_mask=None, --+- --+- # head_num=self.num_heads, --+- # attn_mask=global_attention_mask, --+- # keep_prob=1.0 - self.attention_dropout, --+- # scalar_value=1.0 / math.sqrt(self.head_dim), --+- # input_layout="BNSD", --+- # pre_tokens=2147483647, --+- # next_tokens=2147483647, --+- # inner_precise=0, --+- # drop_mask=None, --+- # prefix=None, --+- # actual_seq_qlen=None, --+- # actual_seq_kvlen=None, --+- # sparse_mode=sparse_mode, --+- # ) --+- if not output_attentions: --+- attn_weights = None --+- --+ return attn_output, attn_weights, past_key_value --+ --+- --+ # class Qwen2MoeFlashAttention(nn.Module): --+ # """ --+ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { --+ # return final_hidden_states, router_logits --+ --+ --+-# class Qwen2MoeSparseMoeBlock(nn.Module): --+-# """ --+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+-# `_moe_infer_prefill` (用于长序列处理) 方法。 --+-# """ --+-# def __init__(self, config: Qwen2MoeConfig): --+-# super().__init__() --+-# self.num_experts = config.num_experts --+-# self.top_k = config.num_experts_per_tok --+-# self.norm_topk_prob = config.norm_topk_prob --+- --+-# # 门控网络 --+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+-# # 专家列表 --+-# self.experts = nn.ModuleList( --+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+-# ) --+-# # 共享专家 --+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-# @no_grad() --+-# def _moe_infer_decode( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# """ --+-# 【解码路径】针对 sequence_length=1 的极致优化。 --+-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+-# """ --+-# batch_size, hidden_dim = hidden_states.shape --+- --+-# expert_outputs_list = [ --+-# ops.cat([ --+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+-# ], dim=0) --+-# for i in range(batch_size) --+-# ] --+- --+-# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+-# # shape: (batch_size, top_k, hidden_dim) --+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+- --+-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+- --+-# return moe_output.squeeze(1) --+- --+-# @no_grad() --+-# def _moe_infer_prefill( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# """ --+-# 【预填充路径】针对 sequence_length > 1 的优化。 --+-# 按专家对 Token 进行分组,并进行批处理。 --+-# """ --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens = hidden_states.shape[0] --+-# flat_selected_experts = selected_experts.flatten() --+- --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+- --+-# active_experts = ops.unique(flat_selected_experts) --+- --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+- --+-# mask = (flat_selected_experts == expert_idx_tensor) --+-# selected_token_indices = token_indices[mask] --+-# selected_routing_weights = routing_weights.flatten()[mask] --+- --+-# current_states = hidden_states[selected_token_indices] --+- --+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+- --+-# moe_output = moe_output.index_add( --+-# dim=0, --+-# index=selected_token_indices, --+-# source=expert_output.to(hidden_states.dtype) --+-# ) --+-# return moe_output --+- --+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-# """ --+-# 顶层 forward 方法,作为智能分发器。 --+-# """ --+-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+- --+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-# router_logits = self.gate(hidden_states_reshaped) --+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- --+-# if self.norm_topk_prob: --+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- --+-# routing_weights = routing_weights.to(hidden_states.dtype) --+- --+-# moe_output = None --+-# # 在推理时,根据序列长度选择最优路径 --+-# if not self.training: --+-# if sequence_length == 1: --+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+-# else: --+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+-# else: --+-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+-# raise NotImplementedError("Training path is not implemented.") --+- --+-# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+- --+-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+- --+-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+- --+-# return final_hidden_states, router_logits --+- --+- --+-# class Qwen2MoeSparseMoeBlock(nn.Module): --+-# """ --+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+-# """ --+-# def __init__(self, config: Qwen2MoeConfig): --+-# super().__init__() --+-# self.num_experts = config.num_experts --+-# self.top_k = config.num_experts_per_tok --+-# self.norm_topk_prob = config.norm_topk_prob --+- --+-# # 门控网络 --+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+-# # 专家列表 --+-# self.experts = nn.ModuleList( --+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+-# ) --+-# # 共享专家 --+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-# @no_grad() --+-# def _moe_infer_decode( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# batch_size, _ = hidden_states.shape --+-# expert_outputs_list = [ --+-# ops.cat([ --+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+-# ], dim=0) --+-# for i in range(batch_size) --+-# ] --+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+-# return moe_output.squeeze(1) --+- --+-# @no_grad() --+-# def _moe_infer_prefill( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens = hidden_states.shape[0] --+-# flat_selected_experts = selected_experts.flatten() --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+-# active_experts = ops.unique(flat_selected_experts) --+- --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+-# mask = (flat_selected_experts == expert_idx_tensor) --+-# selected_token_indices = token_indices[mask] --+-# selected_routing_weights = routing_weights.flatten()[mask] --+-# current_states = hidden_states[selected_token_indices] --+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+-# moe_output = moe_output.index_add( --+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+-# ) --+-# return moe_output --+- --+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-# """ --+-# 顶层 forward 方法,作为智能分发器。 --+-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+-# """ --+-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+- --+-# # 1. 门控计算 (通用逻辑) --+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-# router_logits = self.gate(hidden_states_reshaped) --+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- --+-# if self.norm_topk_prob: --+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- --+-# routing_weights = routing_weights.to(hidden_states.dtype) --+- --+-# # 2. 智能分发到最优 MoE 路径 --+-# moe_output = None --+-# if not self.training: --+-# if sequence_length == 1: --+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+-# else: --+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+-# else: --+-# raise NotImplementedError("Training path is not implemented.") --+- --+-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+- --+-# # 4. 合并 MoE 输出和共享专家输出 --+-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+- --+-# # 5. 恢复原始形状并返回 --+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+- --+-# return final_hidden_states, router_logits --+- --+-# prefill fastest --+-# class Qwen2MoeSparseMoeBlock(nn.Module): --+-# """ --+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+-# """ --+-# def __init__(self, config: Qwen2MoeConfig): --+-# super().__init__() --+-# self.num_experts = config.num_experts --+-# self.top_k = config.num_experts_per_tok --+-# self.norm_topk_prob = config.norm_topk_prob --+- --+-# # 门控网络 --+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+-# # 专家列表 --+-# self.experts = nn.ModuleList( --+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+-# ) --+-# # 共享专家 --+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-# @no_grad() --+-# def _moe_infer_dispatch( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# """ --+-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+-# """ --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens, _ = hidden_states.shape --+- --+-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+-# flat_selected_experts = selected_experts.flatten() --+-# flat_routing_weights = routing_weights.flatten() --+- --+-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+- --+-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+-# active_experts = ops.unique(flat_selected_experts) --+- --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+- --+-# # 找到所有分配给该专家的 token --+-# mask = (flat_selected_experts == expert_idx_tensor) --+- --+-# # 使用 mask 选取对应的 token 和权重 --+-# current_token_indices = token_indices[mask] --+-# current_routing_weights = flat_routing_weights[mask] --+-# current_hidden_states = hidden_states[current_token_indices] --+- --+-# # 对这些 token 进行批处理 --+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+- --+-# # 使用 index_add 将结果精确地加回到对应位置 --+-# moe_output = moe_output.index_add( --+-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+-# ) --+-# return moe_output --+- --+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-# """ --+-# 顶层 forward 方法,作为智能分发器。 --+-# """ --+-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+- --+-# # 1. 门控计算 --+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-# router_logits = self.gate(hidden_states_reshaped) --+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- --+-# if self.norm_topk_prob: --+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- --+-# routing_weights = routing_weights.to(hidden_states.dtype) --+- --+-# # 2. 调用统一的 MoE 计算内核 --+-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+- --+-# # 3. 统一处理共享专家 --+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+- --+-# # 4. 合并输出 --+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+- --+-# # 5. 恢复原始形状并返回 --+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+- --+-# return final_hidden_states, router_logits --+- --+- --+-# class Qwen2MoeSparseMoeBlock(nn.Module): --+-# """ --+-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+-# 【最终高性能与高精度版】: --+-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+-# 3. 这样实现了速度和准确性的两全其美。 --+-# """ --+-# def __init__(self, config: Qwen2MoeConfig): --+-# super().__init__() --+-# self.num_experts = config.num_experts --+-# self.top_k = config.num_experts_per_tok --+-# self.norm_topk_prob = config.norm_topk_prob --+- --+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+-# self.experts = nn.ModuleList( --+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+-# ) --+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-# @no_grad() --+-# def _moe_infer_decode( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# """ --+-# 【解码路径】极致优化版:bmm + 高精度累加。 --+-# """ --+-# original_dtype = hidden_states.dtype --+-# batch_size, _ = hidden_states.shape --+- --+-# expert_outputs_list = [ --+-# ops.cat([ --+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+-# ], dim=0) --+-# for i in range(batch_size) --+-# ] --+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+- --+-# # 在 float32 下执行 bmm,得到高精度结果 --+-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+- --+-# # 将高精度结果转换回原始数据类型 --+-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+- --+-# return moe_output --+- --+-# @no_grad() --+-# def _moe_infer_prefill( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# selected_experts: mindspore.Tensor, --+-# routing_weights: mindspore.Tensor --+-# ) -> mindspore.Tensor: --+-# """ --+-# 【预填充路径】与原始实现一致,结果精确。 --+-# """ --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens, _ = hidden_states.shape --+-# flat_selected_experts = selected_experts.flatten() --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+-# active_experts = ops.unique(flat_selected_experts) --+- --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+-# mask = (flat_selected_experts == expert_idx_tensor) --+-# selected_token_indices = token_indices[mask] --+-# selected_routing_weights = routing_weights.flatten()[mask] --+-# current_states = hidden_states[selected_token_indices] --+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+-# moe_output = moe_output.index_add( --+-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+-# ) --+-# return moe_output --+- --+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+- --+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-# router_logits = self.gate(hidden_states_reshaped) --+-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+- --+-# if self.norm_topk_prob: --+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- --+-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+-# # 如果模型主体是 float16,后续再转换 --+- --+-# moe_output = None --+-# if not self.training: --+-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+-# # _moe_infer_decode 内部会处理好类型转换 --+-# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+-# if sequence_length == 1: --+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+-# else: --+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+-# else: --+-# raise NotImplementedError("Training path is not implemented.") --+- --+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+- --+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+- --+-# return final_hidden_states, router_logits --+- --+- --+-# class Qwen2MoeSparseMoeBlock(nn.Module): --+-# """ --+-# 【融合版】一个混合专家模块,内置两种推理策略, --+-# 由外部全局变量 `Long_Prompt` 控制: --+- --+-# - if Long_Prompt is True: 【精度优先模式】 --+-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+-# 适用于处理长序列,避免误差累积。 --+- --+-# - if Long_Prompt is False: 【速度优先模式】 --+-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+-# 在解码阶段获得极致速度,同时保证结果高度准确。 --+-# """ --+-# def __init__(self, config: Qwen2MoeConfig): --+-# super().__init__() --+-# self.num_experts = config.num_experts --+-# self.top_k = config.num_experts_per_tok --+-# self.norm_topk_prob = config.norm_topk_prob --+- --+-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+-# self.experts = nn.ModuleList( --+-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+-# ) --+-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-# # --- 速度优先模式的辅助函数 --- --+-# @no_grad() --+-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+-# original_dtype = hidden_states.dtype --+-# batch_size, _ = hidden_states.shape --+-# expert_outputs_list = [ --+-# ops.cat([ --+-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+-# ], dim=0) --+-# for i in range(batch_size) --+-# ] --+-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+-# weights_fp32 = routing_weights.to(mindspore.float32) --+-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+-# return moe_output_fp32.squeeze(1).to(original_dtype) --+- --+-# @no_grad() --+-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens, _ = hidden_states.shape --+-# flat_selected_experts = selected_experts.flatten() --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+-# active_experts = ops.unique(flat_selected_experts) --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+-# mask = (flat_selected_experts == expert_idx_tensor) --+-# selected_token_indices = token_indices[mask] --+-# selected_routing_weights = routing_weights.flatten()[mask] --+-# current_states = hidden_states[selected_token_indices] --+-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+-# return moe_output --+- --+-# # --- 精度优先模式的辅助函数 --- --+-# @no_grad() --+-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+-# moe_output = ops.zeros_like(hidden_states) --+-# num_tokens, _ = hidden_states.shape --+-# flat_selected_experts = selected_experts.flatten() --+-# flat_routing_weights = routing_weights.flatten() --+-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+-# active_experts = ops.unique(flat_selected_experts) --+-# for expert_idx_tensor in active_experts: --+-# expert_idx = expert_idx_tensor.item() --+-# expert_layer = self.experts[expert_idx] --+-# mask = (flat_selected_experts == expert_idx_tensor) --+-# current_token_indices = token_indices[mask] --+-# current_routing_weights = flat_routing_weights[mask] --+-# current_hidden_states = hidden_states[current_token_indices] --+-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+-# return moe_output --+- --+-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-# # 声明我们将要使用一个在模块外部定义的全局变量 --+-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+-# global Long_Prompt --+- --+-# # 1. 门控计算 (所有模式通用) --+-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-# router_logits = self.gate(hidden_states_reshaped) --+-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+-# if self.norm_topk_prob: --+-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+- --+-# moe_output = None --+-# if not self.training: --+-# # 根据 Long_Prompt 标志选择模式 --+-# if Long_Prompt: --+-# # --- 精度优先模式 --- --+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+-# else: --+-# # --- 速度优先模式 --- --+-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+-# if sequence_length == 1: --+-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+-# else: --+-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+-# else: --+-# raise NotImplementedError("Training path is not implemented.") --+- --+-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+- --+-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+- --+-# return final_hidden_states, router_logits --+- --+ class Qwen2MoeSparseMoeBlock(nn.Module): --+ """ --+ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+ return moe_output_fp32.squeeze(1).to(original_dtype) --+ --++ # @no_grad() --++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++ # num_tokens, _ = hidden_states.shape --++ # flat_selected_experts = selected_experts.flatten() --++ # sorted_expert_indices = flat_selected_experts.argsort() --++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++ # original_token_indices = sorted_expert_indices // self.top_k --++ # moe_output = ops.zeros_like(hidden_states) --++ # current_token_offset = 0 --++ # for i in range(self.num_experts): --++ # expert_token_count = tokens_per_expert[i] - current_token_offset --++ # if expert_token_count == 0: --++ # continue --++ # end_offset = current_token_offset + expert_token_count --++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++ # expert_hidden_states = hidden_states[expert_original_token_indices] --++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++ # current_token_offset += expert_token_count --++ # return moe_output --++ --+ @no_grad() --+ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+- num_tokens, _ = hidden_states.shape --+- flat_selected_experts = selected_experts.flatten() --+- sorted_expert_indices = flat_selected_experts.argsort() --+- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+- original_token_indices = sorted_expert_indices // self.top_k --++ """ --++ 优化版 MoE prefill (速度优先模式): --++ - 批量张量化处理同一个 expert 的所有 token --++ - 跳过无 token 的专家 --++ - 保持结果完全一致 --++ """ --+ moe_output = ops.zeros_like(hidden_states) --+- current_token_offset = 0 --+- for i in range(self.num_experts): --+- expert_token_count = tokens_per_expert[i] - current_token_offset --+- if expert_token_count == 0: --+- continue --+- end_offset = current_token_offset + expert_token_count --+- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+- expert_hidden_states = hidden_states[expert_original_token_indices] --+- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+- current_token_offset += expert_token_count --++ --++ flat_selected_experts = selected_experts.flatten() --++ flat_routing_weights = routing_weights.flatten() --++ --++ idxs = flat_selected_experts.argsort() --++ sorted_expert_indices = flat_selected_experts[idxs] --++ sorted_token_indices = idxs // self.top_k --++ --++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --++ --++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --++ --++ for expert_id in active_experts.tolist(): --++ start = int(tokens_per_expert[:expert_id].sum().item()) --++ end = start + int(tokens_per_expert[expert_id].item()) --++ --++ token_idx = sorted_token_indices[start:end] --++ expert_tokens = hidden_states[token_idx] --++ --++ expert_out = self.experts[expert_id](expert_tokens) --++ --++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --++ --++ moe_output = mindspore.mint.scatter_add( --++ moe_output, --++ 0, --++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --++ scaled_out.to(hidden_states.dtype) --++ ) --++ --+ return moe_output --+ --++ --+ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+ @no_grad() --+ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+ --+ moe_output = None --+- if Long_Prompt: --+- # --- 精度优先模式 (ACCURACY MODE) --- --+- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ # if Long_Prompt==0: --++ # # --- 精度优先模式 (ACCURACY MODE) --- --++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ # else: --++ # # --- 速度优先模式 (SPEED MODE) --- --++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++ # if sequence_length == 1: --++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ # else: --++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ --++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++ if sequence_length == 1: --++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ else: --+- # --- 速度优先模式 (SPEED MODE) --- --+- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+- if sequence_length == 1: --+- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+- else: --+- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+- --++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ --+ --+ # 3. 共享专家计算与合并 (所有模式通用) --+ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+ --+ return final_hidden_states, router_logits --+ --++ --+ class Qwen2MoeDecoderLayer(nn.Module): --+ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+ super().__init__() --+ self.hidden_size = config.hidden_size --+ --+- # if Long_Prompt: --+- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+- # else: --++ # if Long_Prompt == 2: --+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++ # else: --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ --+ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+ --+@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+ ) --+ --+ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --+- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++ # attention_mask, --++ # sequence_length=sequence_length, --++ # target_length=target_length, --++ # dtype=dtype, --++ # min_dtype=min_dtype, --++ # cache_position=cache_position, --++ # batch_size=input_tensor.shape[0], --++ # ) --++ #@dwj --++ causal_mask = get_cached_causal_mask_with_cache_position( --+ attention_mask, --+ sequence_length=sequence_length, --+ target_length=target_length, --+@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+ """ --+- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --++ _causal_mask_cache.clear() --+ --+ input_ids = kwargs.get("input_ids") --+ if input_ids is None and args: --+@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ --+ if input_ids is not None: --+ prompt_length = input_ids.shape[1] --+- --+- if prompt_length > PROMPT_LENGTH_THRESHOLD: --+- Long_Prompt = True --++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --++ Long_Prompt = 2 --++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --++ Long_Prompt = 0 --+ else: --+- Long_Prompt = False --++ Long_Prompt = 1 --++ --+ --+ return super().generate(*args, **kwargs) --+ --+@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+ dtype = self.lm_head.weight.dtype --+ min_dtype = float(ops.finfo(dtype).min) --+ --+- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++ # attention_mask, --++ # sequence_length=sequence_length, --++ # target_length=past_key_values.get_max_length(), --++ # dtype=dtype, --++ # min_dtype=min_dtype, --++ # cache_position=cache_position, --++ # batch_size=batch_size, --++ # ) --++ --++ #@dwj --++ attention_mask = get_cached_causal_mask_with_cache_position( --+ attention_mask, --+ sequence_length=sequence_length, --+ target_length=past_key_values.get_max_length(), --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+deleted file mode 100644 --+index 6dfb5b93..00000000 --+--- a/patches/0001-20251104commit.patch --++++ /dev/null --+@@ -1,1272 +0,0 @@ --+-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+-From: Pinoeer-kingxi <13022943007@163.com> --+-Date: Tue, 4 Nov 2025 09:11:51 +0800 --+-Subject: [PATCH] 20251104commit --+- --+---- --+- mindnlp/transformers/cache_utils.py | 28 +- --+- .../models/deepseek/modeling_deepseek.py | 149 ++- --+- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+- 3 files changed, 976 insertions(+), 87 deletions(-) --+- --+-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+-index cadd2e04..02f8d4be 100644 --+---- a/mindnlp/transformers/cache_utils.py --+-+++ b/mindnlp/transformers/cache_utils.py --+-@@ -812,14 +812,26 @@ class StaticCache(Cache): --+- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+- # k_out[:, :, cache_position] = key_states --+- # v_out[:, :, cache_position] = value_states --+-- if ON_ORANGE_PI: --+-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+-- else: --+-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+-- --+-+ # if ON_ORANGE_PI: --+-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+-+ # else: --+-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+-+ # 确保 cache_position 是 1D tensor 并且类型正确 --+-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+-+ if cache_position.ndim > 1: --+-+ cache_position = cache_position.flatten() --+-+ # 确保类型是 int32 或 int64(MindSpore 要求) --+-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+-+ cache_position = cache_position.int() --+-+ --+-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+-+ k_out[:, :, cache_position] = key_states --+-+ v_out[:, :, cache_position] = value_states --+-+ --+- return k_out, v_out --+- --+- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+-index c695b944..d8303e45 100644 --+---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+- # Copied from transformers.models.llama.modeling_llama.rotate_half --+- def rotate_half(x): --+- """Rotates half the hidden dims of the input.""" --+-- x1 = x[..., : x.shape[-1] // 2] --+-- x2 = x[..., x.shape[-1] // 2 :] --+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+-+ # x1 = x[..., : x.shape[-1] // 2] --+-+ # x2 = x[..., x.shape[-1] // 2 :] --+-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+- return ops.cat((-x2, x1), dim=-1) --+- --+- --+-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+- if self.training: --+- raise NotImplementedError("Training is not supported yet.") --+- else: --+-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+-- if self.config.n_shared_experts is not None: --+-- y = y + self.shared_experts(identity) --+-- return y --+-+ # @lwx --+-+ if orig_shape[1] == 1: --+-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+-+ y=y.view(*orig_shape) --+-+ if self.config.n_shared_experts is not None: --+-+ y = y + self.shared_experts(identity) --+-+ return y --+-+ else: --+-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+-+ if self.config.n_shared_experts is not None: --+-+ y = y + self.shared_experts(identity) --+-+ return y --+-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+-+ # if self.config.n_shared_experts is not None: --+-+ # y = y + self.shared_experts(identity) --+-+ # return y --+-+ --+-+ @no_grad() --+-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+-+ --+-+ expert_cache = ops.zeros_like(x) --+-+ for i in range(self.num_experts_per_tok): --+-+ expert_id = flat_expert_indices[i].item() --+-+ weight = flat_expert_weights[i].item() --+-+ expert = self.experts[expert_id] --+-+ expert_out = expert(x) --+-+ expert_cache += expert_out * weight --+-+ return expert_cache --+- --+- @no_grad() --+-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+-- # expert_cache = torch.zeros_like(x) --+-- # idxs = flat_expert_indices.argsort() --+-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+-- # token_idxs = idxs // self.num_experts_per_tok --+-- # for i, end_idx in enumerate(tokens_per_expert): --+-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+-- # if start_idx == end_idx: --+-- # continue --+-- # expert = self.experts[i] --+-- # exp_token_idx = token_idxs[start_idx:end_idx] --+-- # expert_tokens = x[exp_token_idx] --+-- # expert_out = expert(expert_tokens) --+-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+-- # return expert_cache --+-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+- expert_cache = ops.zeros_like(x) --+- idxs = flat_expert_indices.argsort() --+- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+- token_idxs = idxs // self.num_experts_per_tok --+-+ --+- for i, end_idx in enumerate(tokens_per_expert): --+- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+- if start_idx == end_idx: --+-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+- expert_out = expert(expert_tokens) --+- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+-+ --+- return expert_cache --+-+ --+-+ # @no_grad() --+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+-+ # # expert_cache = torch.zeros_like(x) --+-+ # # idxs = flat_expert_indices.argsort() --+-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+-+ # # token_idxs = idxs // self.num_experts_per_tok --+-+ # # for i, end_idx in enumerate(tokens_per_expert): --+-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+-+ # # if start_idx == end_idx: --+-+ # # continue --+-+ # # expert = self.experts[i] --+-+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+-+ # # expert_tokens = x[exp_token_idx] --+-+ # # expert_out = expert(expert_tokens) --+-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+-+ # # return expert_cache --+-+ # expert_cache = ops.zeros_like(x) --+-+ # idxs = flat_expert_indices.argsort() --+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+-+ # token_idxs = idxs // self.num_experts_per_tok --+-+ --+-+ # for i, end_idx in enumerate(tokens_per_expert): --+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+-+ # if start_idx == end_idx: --+-+ # continue --+-+ # expert = self.experts[i] --+-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+-+ # expert_tokens = x[exp_token_idx] --+-+ # expert_out = expert(expert_tokens) --+-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+-+ --+-+ # return expert_cache --+-+ # @no_grad() --+-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+-+ # expert_cache = ops.zeros_like(x) --+-+ --+-+ # # 排序保证顺序一致 --+-+ # idxs = flat_expert_indices.argsort() --+-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+-+ # token_idxs = idxs // self.num_experts_per_tok --+-+ --+-+ # # 找出有 token 的专家 --+-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+-+ --+-+ # for i in active_experts.tolist(): --+-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+-+ # end_idx = tokens_per_expert[i] --+-+ # if start_idx == end_idx: # 没有 token --+-+ # continue --+-+ --+-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+-+ # expert_tokens = x[exp_token_idx] --+-+ # expert_out = self.experts[i](expert_tokens) --+-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+-+ --+-+ # expert_cache = mindspore.mint.scatter_add( --+-+ # expert_cache, --+-+ # 0, --+-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+-+ # expert_out --+-+ # ) --+-+ --+-+ # return expert_cache --+-+ --+-+ --+- --+- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+- # """ --+-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+- --+- # Initialize weights and apply final processing --+- self.post_init() --+-+ self.warm_up = False --+-+ --+-+ def warmup_moe_model_deep(self): --+-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+-+ test_texts = [ --+-+ "warmup short", --+-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+-+ ] --+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+-+ if tokenizer is None: --+-+ from mindnlp.transformers import AutoTokenizer --+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+-+ self._warmup_tokenizer = tokenizer --+-+ --+-+ for text in test_texts: --+-+ inputs = tokenizer(text, return_tensors="ms") --+-+ with mindspore._no_grad(): --+-+ _ = self(**inputs, use_cache=False) --+-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+- --+- def get_input_embeddings(self): --+- return self.model.embed_tokens --+-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+- ```""" --+-+ if not self.warm_up: --+-+ self.warm_up = True --+-+ self.warmup_moe_model_deep() --+-+ --+- output_attentions = ( --+- output_attentions --+- if output_attentions is not None --+-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+-index 3cbf820e..d4c6b651 100644 --+---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+-@@ -18,7 +18,6 @@ --+- # See the License for the specific language governing permissions and --+- # limitations under the License. --+- """MindSpore Qwen2MoE model.""" --+-- --+- import math --+- from typing import List, Optional, Tuple, Union --+- --+-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+- TokenClassifierOutput, --+- ) --+- from ...modeling_utils import PreTrainedModel --+-+from ...generation import GenerationMixin --+- from ....utils import logging --+- from .configuration_qwen2_moe import Qwen2MoeConfig --+- --+-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+- self.variance_epsilon = eps --+- --+- def forward(self, hidden_states): --+-+ # @dwj --+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+-+ # @lwx --+-+ # if not self.training : --+-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+- input_dtype = hidden_states.dtype --+- hidden_states = hidden_states.to(mindspore.float32) --+- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+-@@ -234,6 +239,8 @@ def rotate_half(x): --+- """Rotates half the hidden dims of the input.""" --+- x1 = x[..., : x.shape[-1] // 2] --+- x2 = x[..., x.shape[-1] // 2 :] --+-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+- return ops.cat((-x2, x1), dim=-1) --+- --+- --+-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+- self.config = config --+- self.hidden_size = config.hidden_size --+- self.intermediate_size = intermediate_size --+-+ --+- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+- self.act_fn = ACT2FN[config.hidden_act] --+- --+- def forward(self, x): --+-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+-- --+- --+-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+-+ # @lwx --+-+ # gate_up_output = self.gate_up_proj(x) --+-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+-+ # return self.down_proj(swiglu_output) --+-+ --+-+ # def forward(self, x): --+-+ # gate_proj_out = self.gate_proj(x) --+-+ # up_proj_out = self.up_proj(x) --+-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+-+ # return self.down_proj(swiglu_out) --+-+ --+- # Copied from transformers.models.llama.modeling_llama.repeat_kv --+- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+- """ --+-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+- use_cache: bool = False, --+- cache_position: Optional[mindspore.Tensor] = None, --+- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+-+ --+-+ --+-+ --+- bsz, q_len, _ = hidden_states.shape --+- --+- query_states = self.q_proj(hidden_states) --+-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+- "with a layer index." --+- ) --+-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+-+ if isinstance(past_key_value, StaticCache): --+-+ kv_seq_len = key_states.shape[-2] --+-+ else: --+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+- --+- if past_key_value is not None: --+- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+-+ --+-+ if isinstance(past_key_value, StaticCache): --+-+ kv_seq_len = key_states.shape[-2] --+- --+- # repeat k/v heads if n_kv_heads < n_heads --+- key_states = repeat_kv(key_states, self.num_key_value_groups) --+- value_states = repeat_kv(value_states, self.num_key_value_groups) --+-- --+-+ --+- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+- --+-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+-- raise ValueError( --+-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+-- f" {attn_weights.shape}" --+-- ) --+-- --+-- if attention_mask is not None: # no matter the length, we just slice it --+-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+-+ if attention_mask is not None: --+-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+- attn_weights = attn_weights + causal_mask --+- --+- # upcast attention to fp32 --+-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+- --+- attn_output = self.o_proj(attn_output) --+-- --+-+ # @lwx --+-+ --+-+ # max_seq_len = self.max_position_embeddings # 2048 --+-+ --+-+ # if attention_mask is not None: --+-+ # # attention_mask: [B, 1, Sq, Sk] --+-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+-+ --+-+ # # pad 到 [max_seq_len, max_seq_len] --+-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+-+ # global_attention_mask = padded_mask --+-+ # else: --+-+ # global_attention_mask = None --+-+ --+-+ --+-+ # sparse_mode=3 --+-+ # attn_output = mindspore.ops.flash_attention_score( --+-+ # query=query_states, --+-+ # key=key_states, --+-+ # value=value_states, --+-+ # real_shift=None, --+-+ # padding_mask=None, --+-+ --+-+ # head_num=self.num_heads, --+-+ # attn_mask=global_attention_mask, --+-+ # keep_prob=1.0 - self.attention_dropout, --+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+-+ # input_layout="BNSD", --+-+ # pre_tokens=2147483647, --+-+ # next_tokens=2147483647, --+-+ # inner_precise=0, --+-+ # drop_mask=None, --+-+ # prefix=None, --+-+ # actual_seq_qlen=None, --+-+ # actual_seq_kvlen=None, --+-+ # sparse_mode=sparse_mode, --+-+ # ) --+- if not output_attentions: --+- attn_weights = None --+- --+- return attn_output, attn_weights, past_key_value --+- --+- --+-+class Qwen2MoeFlashAttention(nn.Module): --+-+ """ --+-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+-+ --+-+ 关键改动: --+-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+-+ 直接传入原始的 key 和 value 张量效率更高。 --+-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+-+ """ --+-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+-+ super().__init__() --+-+ self.config = config --+-+ self.layer_idx = layer_idx --+-+ self.hidden_size = config.hidden_size --+-+ self.num_heads = config.num_attention_heads --+-+ self.head_dim = self.hidden_size // self.num_heads --+-+ self.num_key_value_heads = config.num_key_value_heads --+-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+-+ self.max_position_embeddings = config.max_position_embeddings --+-+ self.rope_theta = config.rope_theta --+-+ self.attention_dropout = config.attention_dropout --+-+ --+-+ if (self.head_dim * self.num_heads) != self.hidden_size: --+-+ raise ValueError( --+-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+-+ ) --+-+ --+-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+-+ --+-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+-+ self.head_dim, --+-+ max_position_embeddings=self.max_position_embeddings, --+-+ base=self.rope_theta, --+-+ ) --+-+ --+-+ def forward( --+-+ self, --+-+ hidden_states: mindspore.Tensor, --+-+ attention_mask: Optional[mindspore.Tensor] = None, --+-+ position_ids: Optional[mindspore.Tensor] = None, --+-+ past_key_value: Optional[Cache] = None, --+-+ output_attentions: bool = False, --+-+ use_cache: bool = False, --+-+ cache_position: Optional[mindspore.Tensor] = None, --+-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+-+ --+-+ bsz, q_len, _ = hidden_states.shape --+-+ --+-+ # 1. 线性投射 Q, K, V --+-+ query_states = self.q_proj(hidden_states) --+-+ key_states = self.k_proj(hidden_states) --+-+ value_states = self.v_proj(hidden_states) --+-+ --+-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+-+ # query: [B, S, H*D] -> [B, N1, S, D] --+-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ --+-+ # 3. RoPE 旋转位置编码 --+-+ kv_seq_len = key_states.shape[-2] --+-+ if past_key_value is not None: --+-+ if self.layer_idx is None: --+-+ raise ValueError( --+-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+-+ "with a layer index." --+-+ ) --+-+ # 对于 StaticCache,需要特殊处理 kv_seq_len --+-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+-+ if cache_position.shape[0] == 1: --+-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+-+ kv_seq_len = past_seen_tokens + 1 --+-+ else: --+-+ # prefill 阶段:cache_position 是范围,使用其长度 --+-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+-+ else: --+-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+-+ --+-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+-+ --+-+ # 4. KV 缓存更新 --+-+ if past_key_value is not None: --+-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+-+ key_states, value_states = past_key_value.update( --+-+ key_states, value_states, self.layer_idx, cache_kwargs --+-+ ) --+-+ --+-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+-+ if cache_position.shape[0] == 1: --+-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+-+ kv_seq_len = key_states.shape[-2] --+-+ --+-+ # 5. [重要] 准备 Attention Mask --+-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+-+ fa_attention_mask = None --+-+ if attention_mask is not None: --+-+ # 截取与当前key长度匹配的部分 --+-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+-+ # 转换为布尔类型: 大负数 -> True, 0 -> False --+-+ fa_attention_mask = (mask_slice != 0) --+-+ --+-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+-+ input_dtype = query_states.dtype --+-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+-+ query_states = query_states.to(mindspore.float16) --+-+ key_states = key_states.to(mindspore.float16) --+-+ value_states = value_states.to(mindspore.float16) --+-+ --+-+ # 6. [核心] 调用 flash_attention_score 算子 --+-+ # - 无需手动 repeat_kv, 算子原生支持 GQA --+-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+-+ attn_output = mindspore.ops.flash_attention_score( --+-+ query=query_states, --+-+ key=key_states, --+-+ value=value_states, --+-+ head_num=self.num_heads, # 传入Q的头数(N1) --+-+ attn_mask=fa_attention_mask, --+-+ keep_prob=1.0 - self.attention_dropout, --+-+ scalar_value=1.0 / math.sqrt(self.head_dim), --+-+ input_layout="BNSD", --+-+ sparse_mode=0 # 使用 defaultMask 模式 --+-+ ) --+-+ --+-+ # 恢复原始数据类型 --+-+ attn_output = attn_output.to(input_dtype) --+-+ --+-+ # 7. 调整输出形状 --+-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+-+ attn_output = self.o_proj(attn_output) --+-+ --+-+ # FlashAttention 算子不直接返回注意力权重矩阵 --+-+ attn_weights = None --+-+ if output_attentions: --+-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+-+ --+-+ return attn_output, attn_weights, past_key_value --+-+ --+-+ # def forward( --+-+ # self, --+-+ # hidden_states: mindspore.Tensor, --+-+ # attention_mask: Optional[mindspore.Tensor] = None, --+-+ # position_ids: Optional[mindspore.Tensor] = None, --+-+ # past_key_value: Optional[Cache] = None, --+-+ # output_attentions: bool = False, --+-+ # use_cache: bool = False, --+-+ # cache_position: Optional[mindspore.Tensor] = None, --+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+-+ --+-+ # bsz, q_len, _ = hidden_states.shape --+-+ --+-+ # # 1. 线性投射 Q, K, V --+-+ # query_states = self.q_proj(hidden_states) --+-+ # key_states = self.k_proj(hidden_states) --+-+ # value_states = self.v_proj(hidden_states) --+-+ --+-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ --+-+ # # 3. RoPE 旋转位置编码 --+-+ # kv_seq_len = key_states.shape[-2] --+-+ # if past_key_value is not None: --+-+ # if self.layer_idx is None: --+-+ # raise ValueError( --+-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+-+ # "with a layer index." --+-+ # ) --+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+-+ --+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+-+ --+-+ # # 4. KV 缓存更新 --+-+ # if past_key_value is not None: --+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+-+ # key_states, value_states = past_key_value.update( --+-+ # key_states, value_states, self.layer_idx, cache_kwargs --+-+ # ) --+-+ --+-+ # # 5. 准备 Attention Mask --+-+ # fa_attention_mask = None --+-+ # if attention_mask is not None: --+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+-+ # fa_attention_mask = (mask_slice != 0) --+-+ --+-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+-+ # input_dtype = query_states.dtype --+-+ --+-+ # # 6. [核心] 调用 flash_attention_score 算子 --+-+ # attn_output = mindspore.ops.flash_attention_score( --+-+ # query=query_states, --+-+ # key=key_states, --+-+ # value=value_states, --+-+ # head_num=self.num_heads, --+-+ # attn_mask=fa_attention_mask, --+-+ # keep_prob=1.0 - self.attention_dropout, --+-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+-+ # input_layout="BNSD", --+-+ # sparse_mode=0, --+-+ # # <--- 修改点 2: 启用内部高精度计算 --- --+-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+-+ # inner_precise=1 --+-+ # ) --+-+ --+-+ # # 恢复原始数据类型 --+-+ # attn_output = attn_output.to(input_dtype) --+-+ --+-+ # # 7. 调整输出形状 --+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+-+ # attn_output = self.o_proj(attn_output) --+-+ --+-+ # attn_weights = None --+-+ # if output_attentions: --+-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+-+ --+-+ # return attn_output, attn_weights, past_key_value --+-+ --+-+ # def forward( --+-+ # self, --+-+ # hidden_states: mindspore.Tensor, --+-+ # attention_mask: Optional[mindspore.Tensor] = None, --+-+ # position_ids: Optional[mindspore.Tensor] = None, --+-+ # past_key_value: Optional[Cache] = None, --+-+ # output_attentions: bool = False, --+-+ # use_cache: bool = False, --+-+ # cache_position: Optional[mindspore.Tensor] = None, --+-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+-+ --+-+ # bsz, q_len, _ = hidden_states.shape --+-+ --+-+ # query_states = self.q_proj(hidden_states) --+-+ # key_states = self.k_proj(hidden_states) --+-+ # value_states = self.v_proj(hidden_states) --+-+ --+-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-+ --+-+ # kv_seq_len = key_states.shape[-2] --+-+ # if past_key_value is not None: --+-+ # if self.layer_idx is None: --+-+ # raise ValueError("`layer_idx` must be specified for caching") --+-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+-+ --+-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+-+ --+-+ # if past_key_value is not None: --+-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+-+ # key_states, value_states = past_key_value.update( --+-+ # key_states, value_states, self.layer_idx, cache_kwargs --+-+ # ) --+-+ --+-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+-+ --+-+ # # <--- 核心修改点: 手动进行高精度缩放 --- --+-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+-+ # query_states = query_states / math.sqrt(self.head_dim) --+-+ # # <--- 修改结束 --- --+-+ --+-+ # fa_attention_mask = None --+-+ # if attention_mask is not None: --+-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+-+ # fa_attention_mask = (mask_slice != 0) --+-+ --+-+ # input_dtype = query_states.dtype --+-+ --+-+ # attn_output = mindspore.ops.flash_attention_score( --+-+ # query=query_states, # 传入已经预先缩放过的 query --+-+ # key=key_states, --+-+ # value=value_states, --+-+ # head_num=self.num_heads, --+-+ # attn_mask=fa_attention_mask, --+-+ # keep_prob=1.0 - self.attention_dropout, --+-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+-+ # input_layout="BNSD", --+-+ # sparse_mode=0, --+-+ # inner_precise=1 # 仍然保持内部高精度计算 --+-+ # ) --+-+ --+-+ # attn_output = attn_output.to(input_dtype) --+-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+-+ # attn_output = self.o_proj(attn_output) --+-+ --+-+ # attn_weights = None --+-+ # if output_attentions: --+-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+-+ --+-+ # return attn_output, attn_weights, past_key_value --+-+ --+- QWEN2MOE_ATTENTION_CLASSES = { --+- "eager": Qwen2MoeAttention, --+-+ "flash-attention": Qwen2MoeFlashAttention, --+- } --+- --+- --+-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+- --+-+ #@dwj --+-+ # 只遍历激活的专家,而非全部专家 --+- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+-- batch_size, sequence_length, hidden_dim = hidden_states.shape --+-- hidden_states = hidden_states.view(-1, hidden_dim) --+-- # router_logits: (batch * sequence_length, n_experts) --+-- router_logits = self.gate(hidden_states) --+-- --+-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+-- if self.norm_topk_prob: --+-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+-- # we cast back to the input dtype --+-- routing_weights = routing_weights.to(hidden_states.dtype) --+-- --+-- final_hidden_states = ops.zeros( --+-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+-- ) --+-- --+-- # One hot encode the selected experts to create an expert mask --+-- # this will be used to easily index which expert is going to be sollicitated --+-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+-- --+-- # Loop over all available experts in the model and perform the computation on each expert --+-- for expert_idx in range(self.num_experts): --+-- expert_layer = self.experts[expert_idx] --+-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+-- --+-- # Index the correct hidden states and compute the expert hidden state for --+-- # the current expert. We need to make sure to multiply the output hidden --+-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+-- if 0 not in idx.shape: --+-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+-- --+-- # However `index_add_` only support torch tensors for indexing so we'll use --+-- # the `top_x` tensor here. --+-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+-- --+-- shared_expert_output = self.shared_expert(hidden_states) --+-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+-- --+-- final_hidden_states = final_hidden_states + shared_expert_output --+-+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+-+ num_tokens = hidden_states_reshaped.shape[0] --+-+ --+-+ router_logits = self.gate(hidden_states_reshaped) --+-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+-+ --+-+ if self.norm_topk_prob: --+-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+-+ routing_weights = routing_weights.to(hidden_states.dtype) --+-+ --+-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+-+ flat_selected_experts = selected_experts.flatten() --+-+ --+-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+-+ token_indices = broadcasted_token_indices.flatten() --+-+ --+-+ active_experts = ops.unique(flat_selected_experts) --+-+ --+-+ for expert_idx_tensor in active_experts: --+-+ expert_idx = expert_idx_tensor.item() --+-+ expert_layer = self.experts[expert_idx] --+-+ --+-+ mask = (flat_selected_experts == expert_idx_tensor) --+-+ selected_token_indices = token_indices[mask] --+-+ selected_routing_weights = routing_weights.flatten()[mask] --+-+ --+-+ current_states = hidden_states_reshaped[selected_token_indices] --+-+ --+-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+-+ --+-+ final_hidden_states = final_hidden_states.index_add( --+-+ dim=0, --+-+ index=selected_token_indices, --+-+ source=expert_output.to(hidden_states.dtype) --+-+ ) --+-+ --+-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+- --+-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+-- return final_hidden_states, router_logits --+-+ final_hidden_states = final_hidden_states + shared_expert_output --+-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+-+ --+-+ return final_hidden_states, router_logits --+- --+- --+- class Qwen2MoeDecoderLayer(nn.Module): --+-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+- --+- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+- --+-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+-+ --+- if (layer_idx not in config.mlp_only_layers) and ( --+- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+- ): --+-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+- _no_split_modules = ["Qwen2MoeDecoderLayer"] --+- _skip_keys_device_placement = "past_key_values" --+- _supports_cache_class = True --+-+#lwx --+-+ # _supports_static_cache = True --+- --+- def _init_weights(self, module): --+- std = self.config.initializer_range --+-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+- return causal_mask --+- --+- --+--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+- _tied_weights_keys = ["lm_head.weight"] --+- --+- def __init__(self, config): --+-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+- self.num_experts_per_tok = config.num_experts_per_tok --+- # Initialize weights and apply final processing --+- self.post_init() --+-+ # @lwx --+-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+-+ # self.generation_config.cache_implementation = "static" --+-+ self._warmed_up = False --+-+ --+-+ def warmup_moe_model(self): --+-+ print("[Warmup] Qwen2-MoE 模型预热开始...") --+-+ test_texts = [ --+-+ "warmup short", --+-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+-+ ] --+-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+-+ if tokenizer is None: --+-+ from mindnlp.transformers import AutoTokenizer --+-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+-+ self._warmup_tokenizer = tokenizer --+-+ --+-+ for text in test_texts: --+-+ inputs = tokenizer(text, return_tensors="ms") --+-+ with mindspore._no_grad(): --+-+ _ = self(**inputs, output_router_logits=True, use_cache=False) --+-+ print("[Warmup] Qwen2-MoE 模型预热完成。") --+- --+- def get_input_embeddings(self): --+- return self.model.embed_tokens --+-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+- ```""" --+-+ if not self._warmed_up: --+-+ self._warmed_up = True --+-+ self.warmup_moe_model() --+- --+- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+- output_router_logits = ( --+-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+- } --+- ) --+- return model_inputs --+-+# @lwx --+-+ # def _decode_one_tokens_logits( --+-+ # self, --+-+ # cur_token: mindspore.Tensor, --+-+ # input_pos: Optional[mindspore.Tensor], --+-+ # cache_position: mindspore.Tensor, --+-+ # past_key_values: StaticCache, --+-+ # ) -> mindspore.Tensor: --+-+ # """ --+-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+-+ --+-+ # Args: --+-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+-+ # input_pos: 输入位置信息,可选 --+-+ # cache_position: 当前token在cache中的位置,shape为(1,) --+-+ # past_key_values: StaticCache对象,存储之前的key-value状态 --+-+ --+-+ # Returns: --+-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+-+ # """ --+-+ # # 调用JIT编译的版本 --+-+ # return self.get_decode_one_tokens_logits( --+-+ # cur_token=cur_token, --+-+ # input_pos=input_pos, --+-+ # cache_position=cache_position, --+-+ # past_key_values=past_key_values, --+-+ # ) --+-+ --+-+ # @mindspore.jit(jit_level='O1') --+-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+-+ # """ --+-+ # JIT编译的函数,用于高效的单token解码 --+-+ # 使用JIT编译优化以支持静态shape和高效执行 --+-+ --+-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+-+ # """ --+-+ # outputs = self.model.forward( --+-+ # input_ids=cur_token, --+-+ # position_ids=input_pos, --+-+ # cache_position=cache_position, --+-+ # past_key_values=past_key_values, --+-+ # use_cache=True, --+-+ # return_dict=False, --+-+ # ) --+-+ --+-+ # hidden_states = outputs[0] --+-+ # logits = self.lm_head.forward(hidden_states) --+-+ # logits = logits.float() --+-+ --+-+ # return logits[:, -1, :] --+-+ --+-+ # def _sample( --+-+ # self, --+-+ # input_ids: mindspore.Tensor, --+-+ # logits_processor, --+-+ # stopping_criteria, --+-+ # generation_config, --+-+ # synced_devices: bool, --+-+ # streamer=None, --+-+ # logits_warper=None, --+-+ # **model_kwargs, --+-+ # ): --+-+ # """ --+-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+-+ # """ --+-+ # from ...generation.logits_process import LogitsProcessorList --+-+ # from ...generation.stopping_criteria import StoppingCriteriaList --+-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+-+ # from mindnlp.core import nn, ops, no_grad --+-+ # import numpy as np --+-+ --+-+ # # 检查是否使用 StaticCache --+-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+-+ # # 否则,直接调用父类方法 --+-+ # past_key_values = model_kwargs.get("past_key_values") --+-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+-+ --+-+ # if not isinstance(past_key_values, StaticCache): --+-+ # # 不使用 StaticCache,直接调用父类方法 --+-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+-+ # return super()._sample( --+-+ # input_ids=input_ids, --+-+ # logits_processor=logits_processor, --+-+ # stopping_criteria=stopping_criteria, --+-+ # generation_config=generation_config, --+-+ # synced_devices=synced_devices, --+-+ # streamer=streamer, --+-+ # logits_warper=logits_warper, --+-+ # **model_kwargs, --+-+ # ) --+-+ --+-+ # # 使用 StaticCache,进入自定义循环 --+-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+-+ # pad_token_id = generation_config._pad_token_tensor --+-+ # output_attentions = generation_config.output_attentions --+-+ # output_hidden_states = generation_config.output_hidden_states --+-+ # output_scores = generation_config.output_scores --+-+ # output_logits = generation_config.output_logits --+-+ # return_dict_in_generate = generation_config.return_dict_in_generate --+-+ # max_length = generation_config.max_length --+-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+-+ # do_sample = generation_config.do_sample --+-+ --+-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+-+ # raise ValueError( --+-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+-+ # f"{logits_warper})." --+-+ # ) --+-+ --+-+ # # init attention / hidden states / scores tuples --+-+ # scores = () if (return_dict_in_generate and output_scores) else None --+-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+-+ --+-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+-+ # if return_dict_in_generate and self.config.is_encoder_decoder: --+-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+-+ # encoder_hidden_states = ( --+-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+-+ # ) --+-+ --+-+ # # keep track of which sequences are already finished --+-+ # batch_size, cur_len = input_ids.shape --+-+ # this_peer_finished = False --+-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+-+ --+-+ # time_record = [] --+-+ # from ....utils.testing_utils import parse_flag_from_env --+-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+-+ --+-+ # while self._has_unfinished_sequences( --+-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+-+ # ): --+-+ # if _record_time: --+-+ # import time as time_module --+-+ # infer_start = time_module.time() --+-+ --+-+ # # prepare model inputs --+-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+-+ --+-+ # # prepare variable output controls --+-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+-+ --+-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+-+ # cur_cache_position = model_inputs.get("cache_position") --+-+ # cur_past_key_values = model_inputs.get("past_key_values") --+-+ # cur_input_ids = model_inputs.get("input_ids") --+-+ --+-+ # if (isinstance(cur_past_key_values, StaticCache) and --+-+ # cur_cache_position is not None and --+-+ # len(cur_cache_position.shape) > 0 and --+-+ # cur_cache_position.shape[0] == 1 and --+-+ # cur_input_ids is not None and --+-+ # cur_input_ids.shape[1] == 1): --+-+ # # 使用 JIT 优化的单 token 解码 --+-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+-+ # if not hasattr(self, '_jit_used'): --+-+ # self._jit_used = False --+-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+-+ --+-+ # next_token_logits = self.get_decode_one_tokens_logits( --+-+ # cur_token=cur_input_ids, --+-+ # input_pos=model_inputs.get("position_ids"), --+-+ # cache_position=cur_cache_position, --+-+ # past_key_values=cur_past_key_values, --+-+ # ) --+-+ --+-+ # # 标记已使用JIT(用于后续判断) --+-+ # if not self._jit_used: --+-+ # self._jit_used = True --+-+ --+-+ # # 构造兼容的输出对象 --+-+ # class JitOptimizedOutput: --+-+ # def __init__(self, logits, config): --+-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+-+ # self.config = config --+-+ # # 对于 JIT 优化路径,这些属性通常不需要 --+-+ # self.decoder_attentions = None if config.is_encoder_decoder else None --+-+ # self.attentions = None if not config.is_encoder_decoder else None --+-+ # self.cross_attentions = None --+-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+-+ # self.hidden_states = None if not config.is_encoder_decoder else None --+-+ --+-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+-+ # else: --+-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+-+ # outputs = self(**model_inputs, return_dict=True) --+-+ --+-+ # if synced_devices and this_peer_finished: --+-+ # continue --+-+ --+-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+-+ # next_token_logits = outputs.logits[:, -1, :] --+-+ --+-+ # # pre-process distribution --+-+ # next_token_scores = logits_processor(input_ids, next_token_logits) --+-+ # if do_sample: --+-+ # next_token_scores = logits_warper(input_ids, next_token_scores) --+-+ --+-+ # # Store scores, attentions and hidden_states when required --+-+ # if return_dict_in_generate: --+-+ # if output_scores: --+-+ # scores += (next_token_scores,) --+-+ # if output_logits: --+-+ # raw_logits += (next_token_logits,) --+-+ # if output_attentions: --+-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+-+ # decoder_attentions += (attn,) if attn is not None else (None,) --+-+ # if self.config.is_encoder_decoder: --+-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+-+ --+-+ # if output_hidden_states: --+-+ # hidden = ( --+-+ # outputs.decoder_hidden_states --+-+ # if self.config.is_encoder_decoder --+-+ # else outputs.hidden_states --+-+ # ) --+-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+-+ --+-+ # # token selection --+-+ # if do_sample: --+-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+-+ # else: --+-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+-+ --+-+ # # finished sentences should have their next token be a padding token --+-+ # if has_eos_stopping_criteria: --+-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+-+ --+-+ # # update generated ids, model inputs, and length for next step --+-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+-+ # if streamer is not None: --+-+ # streamer.put(next_tokens) --+-+ --+-+ # model_kwargs = self._update_model_kwargs_for_generation( --+-+ # outputs, --+-+ # model_kwargs, --+-+ # is_encoder_decoder=self.config.is_encoder_decoder, --+-+ # ) --+-+ --+-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+-+ # cur_len += 1 --+-+ --+-+ # if _record_time: --+-+ # import time as time_module --+-+ # infer_stop = time_module.time() --+-+ # time_record.append(infer_stop - infer_start) --+-+ --+-+ # del outputs --+-+ --+-+ # average_infer_time = None --+-+ # if time_record: --+-+ # if len(time_record) > 1: --+-+ # time_record.pop(0) --+-+ # average_infer_time = sum(time_record) / len(time_record) --+-+ # print(f'average inference time is: {average_infer_time}') --+-+ # print(f'inference time record: {time_record}') --+-+ --+-+ # if streamer is not None: --+-+ # streamer.end() --+-+ --+-+ # # 简单判断:打印是否使用了JIT路径 --+-+ # if hasattr(self, '_jit_used') and self._jit_used: --+-+ # print("[JIT] ✓ JIT optimization was used during generation") --+-+ # else: --+-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+-+ --+-+ # if return_dict_in_generate: --+-+ # if self.config.is_encoder_decoder: --+-+ # return GenerateEncoderDecoderOutput( --+-+ # sequences=input_ids, --+-+ # scores=scores, --+-+ # logits=raw_logits, --+-+ # encoder_attentions=encoder_attentions, --+-+ # encoder_hidden_states=encoder_hidden_states, --+-+ # decoder_attentions=decoder_attentions, --+-+ # cross_attentions=cross_attentions, --+-+ # decoder_hidden_states=decoder_hidden_states, --+-+ # past_key_values=model_kwargs.get("past_key_values"), --+-+ # average_infer_time=average_infer_time --+-+ # ) --+-+ # else: --+-+ # return GenerateDecoderOnlyOutput( --+-+ # sequences=input_ids, --+-+ # scores=scores, --+-+ # logits=raw_logits, --+-+ # attentions=decoder_attentions, --+-+ # hidden_states=decoder_hidden_states, --+-+ # past_key_values=model_kwargs.get("past_key_values"), --+-+ # average_infer_time=average_infer_time --+-+ # ) --+-+ # else: --+-+ # return input_ids --+-+ --+-+ # def _prepare_cache_for_generation( --+-+ # self, --+-+ # generation_config, --+-+ # model_kwargs, --+-+ # assistant_model, --+-+ # batch_size, --+-+ # max_cache_length, --+-+ # ): --+-+ # if generation_config.cache_implementation is None and self._supports_static_cache: --+-+ # generation_config.cache_implementation = "static" --+-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+-+ --+-+ # if generation_config.cache_implementation == "static": --+-+ # base_required_from_max_length = generation_config.max_length + 1 --+-+ # base_required = max(max_cache_length, base_required_from_max_length) --+-+ # min_cache_size = 50 --+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+-+ # else: --+-+ # max_cache_length = max(base_required, min_cache_size) --+-+ --+-+ # original_max_cache_length = max_cache_length --+-+ # print(f"[JIT] StaticCache max_cache_length calculation:") --+-+ # print(f" - input max_cache_length: {original_max_cache_length}") --+-+ # print(f" - generation_config.max_length: {generation_config.max_length}") --+-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+-+ # print(f" - final max_cache_length: {max_cache_length}") --+-+ --+-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+-+ # if max_cache_length > self.config.max_position_embeddings: --+-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+-+ --+-+ # result = super()._prepare_cache_for_generation( --+-+ # generation_config=generation_config, --+-+ # model_kwargs=model_kwargs, --+-+ # assistant_model=assistant_model, --+-+ # batch_size=batch_size, --+-+ # max_cache_length=max_cache_length, --+-+ # ) --+-+ --+-+ # if generation_config.cache_implementation == "static": --+-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+-+ # created_cache = model_kwargs.get(cache_name) --+-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+-+ # if created_cache.max_cache_len < generation_config.max_length: --+-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+-+ --+-+ # return result --+-+ --+-+ --+-+ --+- --+- --+- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+--- --+-2.27.0 --+- --+-- --+2.27.0 --+ ---- --2.27.0 -- -diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch -deleted file mode 100644 -index 7217a46b..00000000 ---- a/patches/0005-20251107001commit.patch -+++ /dev/null -@@ -1,7707 +0,0 @@ --From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Fri, 7 Nov 2025 11:48:18 +0800 --Subject: [PATCH 5/8] 20251107001commit -- ----- -- .../models/deepseek/modeling_deepseek.py | 91 +- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- -- .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- -- patches/0001-20251104commit.patch | 2 +- -- patches/0002-20251106commit.patch | 2 +- -- patches/0003-20261106secondcommit.patch | 2 +- -- patches/0004-20251106change.patch | 7498 +++++++++++++++++ -- 7 files changed, 7577 insertions(+), 30 deletions(-) -- create mode 100644 patches/0004-20251106change.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index 0546f318..8831e4b7 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): -- # expert_cache += expert_out * weight -- # return expert_cache -- --- @no_grad() --- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --- # x 的 shape: (1, hidden_size) --- # flat_expert_indices 的 shape: (num_experts_per_tok,) --- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --- --- # 1. 收集所有需要的专家层 --- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --- selected_experts = [self.experts[i] for i in flat_expert_indices] --- --- # 2. 并行计算所有专家的输出 --- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --- # ops.cat 会将它们堆叠成一个新的 Tensor --- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --- --- # 3. 使用矩阵乘法进行加权求和 --- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --- # 最终结果 final_output 的 shape: (1, hidden_size) --- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+ # @no_grad() --+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ # # x 的 shape: (1, hidden_size) --+ # # flat_expert_indices 的 shape: (num_experts_per_tok,) --+ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+ --+ # # 1. 收集所有需要的专家层 --+ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+ # selected_experts = [self.experts[i] for i in flat_expert_indices] --+ --+ # # 2. 并行计算所有专家的输出 --+ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+ # # ops.cat 会将它们堆叠成一个新的 Tensor --+ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+ --+ # # 3. 使用矩阵乘法进行加权求和 --+ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ # # 最终结果 final_output 的 shape: (1, hidden_size) --+ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -- --- return final_output --+ # return final_output -- -- -- # @no_grad() --@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): -- ) -- -- return expert_cache --+# 放置在 DeepseekMoE 类中 --+ @no_grad() --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ """ --+ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --+ --+ Args: --+ x (Tensor): 输入张量, shape: (1, hidden_size) --+ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --+ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --+ """ --+ top_k, _ = flat_expert_weights.shape --+ hidden_size = x.shape[-1] --+ --+ # 1. 将所有专家的权重堆叠起来 --+ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --+ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --+ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --+ --+ # 2. "收集" 所需的专家权重 --+ selected_gate_w = stacked_gate_w[flat_expert_indices] --+ selected_up_w = stacked_up_w[flat_expert_indices] --+ selected_down_w = stacked_down_w[flat_expert_indices] --+ --+ # 3. 准备输入 --+ x_expanded = x.expand((top_k, 1, hidden_size)) --+ --+ # 4. 并行计算 gate_proj 和 up_proj --+ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --+ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --+ --+ # 5. 计算中间状态 --+ intermediate_states = self.experts[0].act_fn(gate_out) * up_out --+ --+ # 6. 并行计算 down_proj --+ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --+ # --- [FIX] --- --+ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --+ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --+ # --- [FIX END] --- --+ --+ # 7. 根据路由权重进行加权求和 --+ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --+ --+ return weighted_sum --+ --+ -- -- # @no_grad() -- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index ebd7782e..913a7609 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): -- # Copied from transformers.models.llama.modeling_llama.rotate_half -- def rotate_half(x): -- """Rotates half the hidden dims of the input.""" --- x1 = x[..., : x.shape[-1] // 2] --- x2 = x[..., x.shape[-1] // 2 :] --+ # x1 = x[..., : x.shape[-1] // 2] --+ # x2 = x[..., x.shape[-1] // 2 :] -- # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --index d059dcbe..2b217b64 100644 ----- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): -- # Copied from transformers.models.llama.modeling_llama.rotate_half -- def rotate_half(x): -- """Rotates half the hidden dims of the input.""" --- x1 = x[..., : x.shape[-1] // 2] --- x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+ # x1 = x[..., : x.shape[-1] // 2] --+ # x2 = x[..., x.shape[-1] // 2 :] --+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) -- return ops.cat((-x2, x1), dim=-1) -- -- --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --index 78f22642..0a0ef2d7 100644 ----- a/patches/0001-20251104commit.patch --+++ b/patches/0001-20251104commit.patch --@@ -1,7 +1,7 @@ -- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Tue, 4 Nov 2025 09:11:51 +0800 ---Subject: [PATCH 1/3] 20251104commit --+Subject: [PATCH 1/4] 20251104commit -- -- --- -- mindnlp/transformers/cache_utils.py | 28 +- --diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --index 22b65dd5..5185270c 100644 ----- a/patches/0002-20251106commit.patch --+++ b/patches/0002-20251106commit.patch --@@ -1,7 +1,7 @@ -- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 09:20:38 +0800 ---Subject: [PATCH 2/3] 20251106commit --+Subject: [PATCH 2/4] 20251106commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 379 ++++- --diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --index 966529e4..3e05f821 100644 ----- a/patches/0003-20261106secondcommit.patch --+++ b/patches/0003-20261106secondcommit.patch --@@ -1,7 +1,7 @@ -- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 14:54:37 +0800 ---Subject: [PATCH 3/3] 20261106secondcommit --+Subject: [PATCH 3/4] 20261106secondcommit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 217 ++- --diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --new file mode 100644 --index 00000000..88a1aef4 ----- /dev/null --+++ b/patches/0004-20251106change.patch --@@ -0,0 +1,7498 @@ --+From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Thu, 6 Nov 2025 15:48:09 +0800 --+Subject: [PATCH 4/4] 20251106change --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 189 +- --+ patches/0001-20251104commit.patch | 1272 +++++++ --+ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ --+ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ --+ 4 files changed, 7244 insertions(+), 186 deletions(-) --+ create mode 100644 patches/0001-20251104commit.patch --+ create mode 100644 patches/0002-20251106commit.patch --+ create mode 100644 patches/0003-20261106secondcommit.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index 2f9192bf..0546f318 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): --+ --+ return attn_output, attn_weights, past_key_value --+ --+-# class DeepseekFlashAttention(nn.Module): --+-# """ --+-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --+- --+-# This class is designed as a drop-in replacement for DeepseekAttention. --+-# """ --+- --+-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+-# super().__init__() --+-# self.config = config --+-# self.layer_idx = layer_idx --+-# if layer_idx is None: --+-# logger.warning( --+-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+-# "when creating this class." --+-# ) --+- --+-# self.attention_dropout = config.attention_dropout --+-# self.hidden_size = config.hidden_size --+-# self.num_heads = config.num_attention_heads --+-# self.head_dim = self.hidden_size // self.num_heads --+-# self.num_key_value_heads = config.num_key_value_heads --+-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+-# self.max_position_embeddings = config.max_position_embeddings --+-# self.rope_theta = config.rope_theta --+-# self.is_causal = True --+- --+-# if (self.head_dim * self.num_heads) != self.hidden_size: --+-# raise ValueError( --+-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+-# f" and `num_heads`: {self.num_heads})." --+-# ) --+- --+-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+-# self._init_rope() --+- --+-# def _init_rope(self): --+-# if self.config.rope_scaling is None: --+-# self.rotary_emb = DeepseekRotaryEmbedding( --+-# self.head_dim, --+-# max_position_embeddings=self.max_position_embeddings, --+-# base=self.rope_theta, --+-# ) --+-# else: --+-# scaling_type = self.config.rope_scaling["type"] --+-# scaling_factor = self.config.rope_scaling["factor"] --+-# if scaling_type == "linear": --+-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+-# self.head_dim, --+-# max_position_embeddings=self.max_position_embeddings, --+-# scaling_factor=scaling_factor, --+-# base=self.rope_theta, --+-# ) --+-# elif scaling_type == "dynamic": --+-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+-# self.head_dim, --+-# max_position_embeddings=self.max_position_embeddings, --+-# scaling_factor=scaling_factor, --+-# base=self.rope_theta, --+-# ) --+-# else: --+-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+- --+-# def forward( --+-# self, --+-# hidden_states: mindspore.Tensor, --+-# attention_mask: Optional[mindspore.Tensor] = None, --+-# position_ids: Optional[mindspore.Tensor] = None, --+-# past_key_value: Optional[Cache] = None, --+-# output_attentions: bool = False, --+-# use_cache: bool = False, --+-# **kwargs, --+-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+-# if "padding_mask" in kwargs: --+-# warnings.warn( --+-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+-# ) --+- --+-# if output_attentions: --+-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --+- --+-# bsz, q_len, _ = hidden_states.shape --+- --+-# if self.config.pretraining_tp > 1: --+-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+- --+-# query_states = self.q_proj(hidden_states) --+-# key_states = self.k_proj(hidden_states) --+-# value_states = self.v_proj(hidden_states) --+- --+-# # Reshape for multi-head attention --+-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+- --+-# kv_seq_len = key_states.shape[-2] --+-# if past_key_value is not None: --+-# if self.layer_idx is None: --+-# raise ValueError( --+-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+-# "with a layer index." --+-# ) --+-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+- --+-# # Apply Rotary Positional Embedding --+-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+- --+-# if past_key_value is not None: --+-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --+-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+- --+-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --+-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --+-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+- --+-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+- --+-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+- --+-# # Convert attention_mask for flash_attention_score --+-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --+-# if attention_mask is not None: --+-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --+-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+-# raise ValueError( --+-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+-# ) --+-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --+-# else: --+-# attn_mask_for_fa = None --+- --+-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+- --+-# # Call the fused flash_attention_score operator --+-# attn_output = mindspore.ops.flash_attention_score( --+-# query=query_states_for_fa, --+-# key=key_states_for_fa, --+-# value=value_states_for_fa, --+-# head_num=self.num_heads, # This is N1, the number of query heads --+-# input_layout='BSH', --+-# attn_mask=attn_mask_for_fa, --+-# keep_prob=keep_prob, --+-# scalar_value=1.0 / math.sqrt(self.head_dim), --+-# sparse_mode=0 # Default mask mode --+-# ) --+- --+-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --+-# attn_output = self.o_proj(attn_output) --+- --+-# # Flash Attention does not return attention weights --+-# attn_weights = None --+- --+-# return attn_output, attn_weights, past_key_value --+ --+ class DeepseekFlashAttention(nn.Module): --+ """ --+@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): --+ super().__init__() --+ self.hidden_size = config.hidden_size --+ --+- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --+- config=config, layer_idx=layer_idx --+- ) --++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --++ # config=config, layer_idx=layer_idx --++ # ) --+ --+ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --+ config=config, layer_idx=layer_idx --+@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): --+ return outputs --+ --+ --+- --+ class DeepseekPreTrainedModel(PreTrainedModel): --+ config_class = DeepseekConfig --+ base_model_prefix = "model" --+@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+ # Initialize weights and apply final processing --+ self.post_init() --+ self.warm_up = False --+- #@dwj --+- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --+- self.num_layers, --+- self.num_attention_heads, --+- self.head_dim, --+- batch_size=1, --+- max_length=self.max_length, --+- dtype=mindspore.float16 --+- ) --+- --+- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --+- key_cache = [] --+- value_cache = [] --+- for _ in range(num_layers): --+- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+- key_cache.append(k) --+- value_cache.append(v) --+- return key_cache, value_cache --+- --+ --+ def warmup_moe_model_deep(self): --+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+new file mode 100644 --+index 00000000..78f22642 --+--- /dev/null --++++ b/patches/0001-20251104commit.patch --+@@ -0,0 +1,1272 @@ --++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Tue, 4 Nov 2025 09:11:51 +0800 --++Subject: [PATCH 1/3] 20251104commit --++ --++--- --++ mindnlp/transformers/cache_utils.py | 28 +- --++ .../models/deepseek/modeling_deepseek.py | 149 ++- --++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++ 3 files changed, 976 insertions(+), 87 deletions(-) --++ --++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++index cadd2e04..02f8d4be 100644 --++--- a/mindnlp/transformers/cache_utils.py --+++++ b/mindnlp/transformers/cache_utils.py --++@@ -812,14 +812,26 @@ class StaticCache(Cache): --++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++ # k_out[:, :, cache_position] = key_states --++ # v_out[:, :, cache_position] = value_states --++- if ON_ORANGE_PI: --++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++- else: --++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++- --+++ # if ON_ORANGE_PI: --+++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++ # else: --+++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++ # 确保 cache_position 是 1D tensor 并且类型正确 --+++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++ if cache_position.ndim > 1: --+++ cache_position = cache_position.flatten() --+++ # 确保类型是 int32 或 int64(MindSpore 要求) --+++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++ cache_position = cache_position.int() --+++ --+++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++ k_out[:, :, cache_position] = key_states --+++ v_out[:, :, cache_position] = value_states --+++ --++ return k_out, v_out --++ --++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index c695b944..d8303e45 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++- x1 = x[..., : x.shape[-1] // 2] --++- x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++ # x1 = x[..., : x.shape[-1] // 2] --+++ # x2 = x[..., x.shape[-1] // 2 :] --+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++ if self.training: --++ raise NotImplementedError("Training is not supported yet.") --++ else: --++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++- if self.config.n_shared_experts is not None: --++- y = y + self.shared_experts(identity) --++- return y --+++ # @lwx --+++ if orig_shape[1] == 1: --+++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++ y=y.view(*orig_shape) --+++ if self.config.n_shared_experts is not None: --+++ y = y + self.shared_experts(identity) --+++ return y --+++ else: --+++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++ if self.config.n_shared_experts is not None: --+++ y = y + self.shared_experts(identity) --+++ return y --+++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++ # if self.config.n_shared_experts is not None: --+++ # y = y + self.shared_experts(identity) --+++ # return y --+++ --+++ @no_grad() --+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ --+++ expert_cache = ops.zeros_like(x) --+++ for i in range(self.num_experts_per_tok): --+++ expert_id = flat_expert_indices[i].item() --+++ weight = flat_expert_weights[i].item() --+++ expert = self.experts[expert_id] --+++ expert_out = expert(x) --+++ expert_cache += expert_out * weight --+++ return expert_cache --++ --++ @no_grad() --++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++- # expert_cache = torch.zeros_like(x) --++- # idxs = flat_expert_indices.argsort() --++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++- # token_idxs = idxs // self.num_experts_per_tok --++- # for i, end_idx in enumerate(tokens_per_expert): --++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++- # if start_idx == end_idx: --++- # continue --++- # expert = self.experts[i] --++- # exp_token_idx = token_idxs[start_idx:end_idx] --++- # expert_tokens = x[exp_token_idx] --++- # expert_out = expert(expert_tokens) --++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++- # return expert_cache --+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++ expert_cache = ops.zeros_like(x) --++ idxs = flat_expert_indices.argsort() --++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++ token_idxs = idxs // self.num_experts_per_tok --+++ --++ for i, end_idx in enumerate(tokens_per_expert): --++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++ if start_idx == end_idx: --++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++ expert_out = expert(expert_tokens) --++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --++ return expert_cache --+++ --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # # expert_cache = torch.zeros_like(x) --+++ # # idxs = flat_expert_indices.argsort() --+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++ # # token_idxs = idxs // self.num_experts_per_tok --+++ # # for i, end_idx in enumerate(tokens_per_expert): --+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++ # # if start_idx == end_idx: --+++ # # continue --+++ # # expert = self.experts[i] --+++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # # expert_tokens = x[exp_token_idx] --+++ # # expert_out = expert(expert_tokens) --+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++ # # return expert_cache --+++ # expert_cache = ops.zeros_like(x) --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # for i, end_idx in enumerate(tokens_per_expert): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # if start_idx == end_idx: --+++ # continue --+++ # expert = self.experts[i] --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = expert(expert_tokens) --+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --+++ # return expert_cache --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # expert_cache = ops.zeros_like(x) --+++ --+++ # # 排序保证顺序一致 --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # # 找出有 token 的专家 --+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++ --+++ # for i in active_experts.tolist(): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # end_idx = tokens_per_expert[i] --+++ # if start_idx == end_idx: # 没有 token --+++ # continue --+++ --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = self.experts[i](expert_tokens) --+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++ --+++ # expert_cache = mindspore.mint.scatter_add( --+++ # expert_cache, --+++ # 0, --+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++ # expert_out --+++ # ) --+++ --+++ # return expert_cache --+++ --+++ --++ --++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++ # """ --++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ --++ # Initialize weights and apply final processing --++ self.post_init() --+++ self.warm_up = False --+++ --+++ def warmup_moe_model_deep(self): --+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++ test_texts = [ --+++ "warmup short", --+++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++ ] --+++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++ if tokenizer is None: --+++ from mindnlp.transformers import AutoTokenizer --+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++ self._warmup_tokenizer = tokenizer --+++ --+++ for text in test_texts: --+++ inputs = tokenizer(text, return_tensors="ms") --+++ with mindspore._no_grad(): --+++ _ = self(**inputs, use_cache=False) --+++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++ --++ def get_input_embeddings(self): --++ return self.model.embed_tokens --++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++ ```""" --+++ if not self.warm_up: --+++ self.warm_up = True --+++ self.warmup_moe_model_deep() --+++ --++ output_attentions = ( --++ output_attentions --++ if output_attentions is not None --++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++index 3cbf820e..d4c6b651 100644 --++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++@@ -18,7 +18,6 @@ --++ # See the License for the specific language governing permissions and --++ # limitations under the License. --++ """MindSpore Qwen2MoE model.""" --++- --++ import math --++ from typing import List, Optional, Tuple, Union --++ --++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++ TokenClassifierOutput, --++ ) --++ from ...modeling_utils import PreTrainedModel --+++from ...generation import GenerationMixin --++ from ....utils import logging --++ from .configuration_qwen2_moe import Qwen2MoeConfig --++ --++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++ self.variance_epsilon = eps --++ --++ def forward(self, hidden_states): --+++ # @dwj --+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++ # @lwx --+++ # if not self.training : --+++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++ input_dtype = hidden_states.dtype --++ hidden_states = hidden_states.to(mindspore.float32) --++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++@@ -234,6 +239,8 @@ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++ x1 = x[..., : x.shape[-1] // 2] --++ x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++ self.config = config --++ self.hidden_size = config.hidden_size --++ self.intermediate_size = intermediate_size --+++ --++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++ self.act_fn = ACT2FN[config.hidden_act] --++ --++ def forward(self, x): --++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++- --++ --+++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++ # @lwx --+++ # gate_up_output = self.gate_up_proj(x) --+++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++ # return self.down_proj(swiglu_output) --+++ --+++ # def forward(self, x): --+++ # gate_proj_out = self.gate_proj(x) --+++ # up_proj_out = self.up_proj(x) --+++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++ # return self.down_proj(swiglu_out) --+++ --++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++ """ --++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++ use_cache: bool = False, --++ cache_position: Optional[mindspore.Tensor] = None, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ --+++ --++ bsz, q_len, _ = hidden_states.shape --++ --++ query_states = self.q_proj(hidden_states) --++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++ "with a layer index." --++ ) --++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ if isinstance(past_key_value, StaticCache): --+++ kv_seq_len = key_states.shape[-2] --+++ else: --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++ if isinstance(past_key_value, StaticCache): --+++ kv_seq_len = key_states.shape[-2] --++ --++ # repeat k/v heads if n_kv_heads < n_heads --++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++- --+++ --++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++ --++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++- raise ValueError( --++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++- f" {attn_weights.shape}" --++- ) --++- --++- if attention_mask is not None: # no matter the length, we just slice it --++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++ if attention_mask is not None: --+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++ attn_weights = attn_weights + causal_mask --++ --++ # upcast attention to fp32 --++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++ --++ attn_output = self.o_proj(attn_output) --++- --+++ # @lwx --+++ --+++ # max_seq_len = self.max_position_embeddings # 2048 --+++ --+++ # if attention_mask is not None: --+++ # # attention_mask: [B, 1, Sq, Sk] --+++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++ --+++ # # pad 到 [max_seq_len, max_seq_len] --+++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++ # global_attention_mask = padded_mask --+++ # else: --+++ # global_attention_mask = None --+++ --+++ --+++ # sparse_mode=3 --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, --+++ # key=key_states, --+++ # value=value_states, --+++ # real_shift=None, --+++ # padding_mask=None, --+++ --+++ # head_num=self.num_heads, --+++ # attn_mask=global_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++ # input_layout="BNSD", --+++ # pre_tokens=2147483647, --+++ # next_tokens=2147483647, --+++ # inner_precise=0, --+++ # drop_mask=None, --+++ # prefix=None, --+++ # actual_seq_qlen=None, --+++ # actual_seq_kvlen=None, --+++ # sparse_mode=sparse_mode, --+++ # ) --++ if not output_attentions: --++ attn_weights = None --++ --++ return attn_output, attn_weights, past_key_value --++ --++ --+++class Qwen2MoeFlashAttention(nn.Module): --+++ """ --+++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++ --+++ 关键改动: --+++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++ 直接传入原始的 key 和 value 张量效率更高。 --+++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++ """ --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++ super().__init__() --+++ self.config = config --+++ self.layer_idx = layer_idx --+++ self.hidden_size = config.hidden_size --+++ self.num_heads = config.num_attention_heads --+++ self.head_dim = self.hidden_size // self.num_heads --+++ self.num_key_value_heads = config.num_key_value_heads --+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++ self.max_position_embeddings = config.max_position_embeddings --+++ self.rope_theta = config.rope_theta --+++ self.attention_dropout = config.attention_dropout --+++ --+++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++ raise ValueError( --+++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++ ) --+++ --+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++ --+++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++ self.head_dim, --+++ max_position_embeddings=self.max_position_embeddings, --+++ base=self.rope_theta, --+++ ) --+++ --+++ def forward( --+++ self, --+++ hidden_states: mindspore.Tensor, --+++ attention_mask: Optional[mindspore.Tensor] = None, --+++ position_ids: Optional[mindspore.Tensor] = None, --+++ past_key_value: Optional[Cache] = None, --+++ output_attentions: bool = False, --+++ use_cache: bool = False, --+++ cache_position: Optional[mindspore.Tensor] = None, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ # 1. 线性投射 Q, K, V --+++ query_states = self.q_proj(hidden_states) --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++ # query: [B, S, H*D] -> [B, N1, S, D] --+++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # 3. RoPE 旋转位置编码 --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++ if self.layer_idx is None: --+++ raise ValueError( --+++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ "with a layer index." --+++ ) --+++ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++ if cache_position.shape[0] == 1: --+++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++ kv_seq_len = past_seen_tokens + 1 --+++ else: --+++ # prefill 阶段:cache_position 是范围,使用其长度 --+++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++ else: --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # 4. KV 缓存更新 --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ key_states, value_states = past_key_value.update( --+++ key_states, value_states, self.layer_idx, cache_kwargs --+++ ) --+++ --+++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++ if cache_position.shape[0] == 1: --+++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++ kv_seq_len = key_states.shape[-2] --+++ --+++ # 5. [重要] 准备 Attention Mask --+++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++ fa_attention_mask = None --+++ if attention_mask is not None: --+++ # 截取与当前key长度匹配的部分 --+++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++ fa_attention_mask = (mask_slice != 0) --+++ --+++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++ input_dtype = query_states.dtype --+++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++ query_states = query_states.to(mindspore.float16) --+++ key_states = key_states.to(mindspore.float16) --+++ value_states = value_states.to(mindspore.float16) --+++ --+++ # 6. [核心] 调用 flash_attention_score 算子 --+++ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++ attn_output = mindspore.ops.flash_attention_score( --+++ query=query_states, --+++ key=key_states, --+++ value=value_states, --+++ head_num=self.num_heads, # 传入Q的头数(N1) --+++ attn_mask=fa_attention_mask, --+++ keep_prob=1.0 - self.attention_dropout, --+++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++ input_layout="BNSD", --+++ sparse_mode=0 # 使用 defaultMask 模式 --+++ ) --+++ --+++ # 恢复原始数据类型 --+++ attn_output = attn_output.to(input_dtype) --+++ --+++ # 7. 调整输出形状 --+++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ attn_output = self.o_proj(attn_output) --+++ --+++ # FlashAttention 算子不直接返回注意力权重矩阵 --+++ attn_weights = None --+++ if output_attentions: --+++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++ # def forward( --+++ # self, --+++ # hidden_states: mindspore.Tensor, --+++ # attention_mask: Optional[mindspore.Tensor] = None, --+++ # position_ids: Optional[mindspore.Tensor] = None, --+++ # past_key_value: Optional[Cache] = None, --+++ # output_attentions: bool = False, --+++ # use_cache: bool = False, --+++ # cache_position: Optional[mindspore.Tensor] = None, --+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ # bsz, q_len, _ = hidden_states.shape --+++ --+++ # # 1. 线性投射 Q, K, V --+++ # query_states = self.q_proj(hidden_states) --+++ # key_states = self.k_proj(hidden_states) --+++ # value_states = self.v_proj(hidden_states) --+++ --+++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # # 3. RoPE 旋转位置编码 --+++ # kv_seq_len = key_states.shape[-2] --+++ # if past_key_value is not None: --+++ # if self.layer_idx is None: --+++ # raise ValueError( --+++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ # "with a layer index." --+++ # ) --+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # # 4. KV 缓存更新 --+++ # if past_key_value is not None: --+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ # key_states, value_states = past_key_value.update( --+++ # key_states, value_states, self.layer_idx, cache_kwargs --+++ # ) --+++ --+++ # # 5. 准备 Attention Mask --+++ # fa_attention_mask = None --+++ # if attention_mask is not None: --+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # fa_attention_mask = (mask_slice != 0) --+++ --+++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++ # input_dtype = query_states.dtype --+++ --+++ # # 6. [核心] 调用 flash_attention_score 算子 --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, --+++ # key=key_states, --+++ # value=value_states, --+++ # head_num=self.num_heads, --+++ # attn_mask=fa_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++ # input_layout="BNSD", --+++ # sparse_mode=0, --+++ # # <--- 修改点 2: 启用内部高精度计算 --- --+++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++ # inner_precise=1 --+++ # ) --+++ --+++ # # 恢复原始数据类型 --+++ # attn_output = attn_output.to(input_dtype) --+++ --+++ # # 7. 调整输出形状 --+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ # attn_output = self.o_proj(attn_output) --+++ --+++ # attn_weights = None --+++ # if output_attentions: --+++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++ # return attn_output, attn_weights, past_key_value --+++ --+++ # def forward( --+++ # self, --+++ # hidden_states: mindspore.Tensor, --+++ # attention_mask: Optional[mindspore.Tensor] = None, --+++ # position_ids: Optional[mindspore.Tensor] = None, --+++ # past_key_value: Optional[Cache] = None, --+++ # output_attentions: bool = False, --+++ # use_cache: bool = False, --+++ # cache_position: Optional[mindspore.Tensor] = None, --+++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ # bsz, q_len, _ = hidden_states.shape --+++ --+++ # query_states = self.q_proj(hidden_states) --+++ # key_states = self.k_proj(hidden_states) --+++ # value_states = self.v_proj(hidden_states) --+++ --+++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ # kv_seq_len = key_states.shape[-2] --+++ # if past_key_value is not None: --+++ # if self.layer_idx is None: --+++ # raise ValueError("`layer_idx` must be specified for caching") --+++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ # if past_key_value is not None: --+++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ # key_states, value_states = past_key_value.update( --+++ # key_states, value_states, self.layer_idx, cache_kwargs --+++ # ) --+++ --+++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++ --+++ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++ # query_states = query_states / math.sqrt(self.head_dim) --+++ # # <--- 修改结束 --- --+++ --+++ # fa_attention_mask = None --+++ # if attention_mask is not None: --+++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ # fa_attention_mask = (mask_slice != 0) --+++ --+++ # input_dtype = query_states.dtype --+++ --+++ # attn_output = mindspore.ops.flash_attention_score( --+++ # query=query_states, # 传入已经预先缩放过的 query --+++ # key=key_states, --+++ # value=value_states, --+++ # head_num=self.num_heads, --+++ # attn_mask=fa_attention_mask, --+++ # keep_prob=1.0 - self.attention_dropout, --+++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++ # input_layout="BNSD", --+++ # sparse_mode=0, --+++ # inner_precise=1 # 仍然保持内部高精度计算 --+++ # ) --+++ --+++ # attn_output = attn_output.to(input_dtype) --+++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ # attn_output = self.o_proj(attn_output) --+++ --+++ # attn_weights = None --+++ # if output_attentions: --+++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++ --+++ # return attn_output, attn_weights, past_key_value --+++ --++ QWEN2MOE_ATTENTION_CLASSES = { --++ "eager": Qwen2MoeAttention, --+++ "flash-attention": Qwen2MoeFlashAttention, --++ } --++ --++ --++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --+++ #@dwj --+++ # 只遍历激活的专家,而非全部专家 --++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++- hidden_states = hidden_states.view(-1, hidden_dim) --++- # router_logits: (batch * sequence_length, n_experts) --++- router_logits = self.gate(hidden_states) --++- --++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- if self.norm_topk_prob: --++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- # we cast back to the input dtype --++- routing_weights = routing_weights.to(hidden_states.dtype) --++- --++- final_hidden_states = ops.zeros( --++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++- ) --++- --++- # One hot encode the selected experts to create an expert mask --++- # this will be used to easily index which expert is going to be sollicitated --++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++- --++- # Loop over all available experts in the model and perform the computation on each expert --++- for expert_idx in range(self.num_experts): --++- expert_layer = self.experts[expert_idx] --++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++- --++- # Index the correct hidden states and compute the expert hidden state for --++- # the current expert. We need to make sure to multiply the output hidden --++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++- if 0 not in idx.shape: --++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++- --++- # However `index_add_` only support torch tensors for indexing so we'll use --++- # the `top_x` tensor here. --++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++- --++- shared_expert_output = self.shared_expert(hidden_states) --++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++- --++- final_hidden_states = final_hidden_states + shared_expert_output --+++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++ num_tokens = hidden_states_reshaped.shape[0] --+++ --+++ router_logits = self.gate(hidden_states_reshaped) --+++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++ if self.norm_topk_prob: --+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++ flat_selected_experts = selected_experts.flatten() --+++ --+++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++ token_indices = broadcasted_token_indices.flatten() --+++ --+++ active_experts = ops.unique(flat_selected_experts) --+++ --+++ for expert_idx_tensor in active_experts: --+++ expert_idx = expert_idx_tensor.item() --+++ expert_layer = self.experts[expert_idx] --+++ --+++ mask = (flat_selected_experts == expert_idx_tensor) --+++ selected_token_indices = token_indices[mask] --+++ selected_routing_weights = routing_weights.flatten()[mask] --+++ --+++ current_states = hidden_states_reshaped[selected_token_indices] --+++ --+++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++ --+++ final_hidden_states = final_hidden_states.index_add( --+++ dim=0, --+++ index=selected_token_indices, --+++ source=expert_output.to(hidden_states.dtype) --+++ ) --+++ --+++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++ --++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++- return final_hidden_states, router_logits --+++ final_hidden_states = final_hidden_states + shared_expert_output --+++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++ --+++ return final_hidden_states, router_logits --++ --++ --++ class Qwen2MoeDecoderLayer(nn.Module): --++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++ --++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++ --++ if (layer_idx not in config.mlp_only_layers) and ( --++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++ ): --++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --++ _skip_keys_device_placement = "past_key_values" --++ _supports_cache_class = True --+++#lwx --+++ # _supports_static_cache = True --++ --++ def _init_weights(self, module): --++ std = self.config.initializer_range --++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++ return causal_mask --++ --++ --++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ _tied_weights_keys = ["lm_head.weight"] --++ --++ def __init__(self, config): --++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ self.num_experts_per_tok = config.num_experts_per_tok --++ # Initialize weights and apply final processing --++ self.post_init() --+++ # @lwx --+++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++ # self.generation_config.cache_implementation = "static" --+++ self._warmed_up = False --+++ --+++ def warmup_moe_model(self): --+++ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++ test_texts = [ --+++ "warmup short", --+++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++ ] --+++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++ if tokenizer is None: --+++ from mindnlp.transformers import AutoTokenizer --+++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++ self._warmup_tokenizer = tokenizer --+++ --+++ for text in test_texts: --+++ inputs = tokenizer(text, return_tensors="ms") --+++ with mindspore._no_grad(): --+++ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++ print("[Warmup] Qwen2-MoE 模型预热完成。") --++ --++ def get_input_embeddings(self): --++ return self.model.embed_tokens --++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++ ```""" --+++ if not self._warmed_up: --+++ self._warmed_up = True --+++ self.warmup_moe_model() --++ --++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++ output_router_logits = ( --++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++ } --++ ) --++ return model_inputs --+++# @lwx --+++ # def _decode_one_tokens_logits( --+++ # self, --+++ # cur_token: mindspore.Tensor, --+++ # input_pos: Optional[mindspore.Tensor], --+++ # cache_position: mindspore.Tensor, --+++ # past_key_values: StaticCache, --+++ # ) -> mindspore.Tensor: --+++ # """ --+++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++ --+++ # Args: --+++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++ # input_pos: 输入位置信息,可选 --+++ # cache_position: 当前token在cache中的位置,shape为(1,) --+++ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++ --+++ # Returns: --+++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++ # """ --+++ # # 调用JIT编译的版本 --+++ # return self.get_decode_one_tokens_logits( --+++ # cur_token=cur_token, --+++ # input_pos=input_pos, --+++ # cache_position=cache_position, --+++ # past_key_values=past_key_values, --+++ # ) --+++ --+++ # @mindspore.jit(jit_level='O1') --+++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++ # """ --+++ # JIT编译的函数,用于高效的单token解码 --+++ # 使用JIT编译优化以支持静态shape和高效执行 --+++ --+++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++ # """ --+++ # outputs = self.model.forward( --+++ # input_ids=cur_token, --+++ # position_ids=input_pos, --+++ # cache_position=cache_position, --+++ # past_key_values=past_key_values, --+++ # use_cache=True, --+++ # return_dict=False, --+++ # ) --+++ --+++ # hidden_states = outputs[0] --+++ # logits = self.lm_head.forward(hidden_states) --+++ # logits = logits.float() --+++ --+++ # return logits[:, -1, :] --+++ --+++ # def _sample( --+++ # self, --+++ # input_ids: mindspore.Tensor, --+++ # logits_processor, --+++ # stopping_criteria, --+++ # generation_config, --+++ # synced_devices: bool, --+++ # streamer=None, --+++ # logits_warper=None, --+++ # **model_kwargs, --+++ # ): --+++ # """ --+++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++ # """ --+++ # from ...generation.logits_process import LogitsProcessorList --+++ # from ...generation.stopping_criteria import StoppingCriteriaList --+++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++ # from mindnlp.core import nn, ops, no_grad --+++ # import numpy as np --+++ --+++ # # 检查是否使用 StaticCache --+++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++ # # 否则,直接调用父类方法 --+++ # past_key_values = model_kwargs.get("past_key_values") --+++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++ --+++ # if not isinstance(past_key_values, StaticCache): --+++ # # 不使用 StaticCache,直接调用父类方法 --+++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++ # return super()._sample( --+++ # input_ids=input_ids, --+++ # logits_processor=logits_processor, --+++ # stopping_criteria=stopping_criteria, --+++ # generation_config=generation_config, --+++ # synced_devices=synced_devices, --+++ # streamer=streamer, --+++ # logits_warper=logits_warper, --+++ # **model_kwargs, --+++ # ) --+++ --+++ # # 使用 StaticCache,进入自定义循环 --+++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++ # pad_token_id = generation_config._pad_token_tensor --+++ # output_attentions = generation_config.output_attentions --+++ # output_hidden_states = generation_config.output_hidden_states --+++ # output_scores = generation_config.output_scores --+++ # output_logits = generation_config.output_logits --+++ # return_dict_in_generate = generation_config.return_dict_in_generate --+++ # max_length = generation_config.max_length --+++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++ # do_sample = generation_config.do_sample --+++ --+++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++ # raise ValueError( --+++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++ # f"{logits_warper})." --+++ # ) --+++ --+++ # # init attention / hidden states / scores tuples --+++ # scores = () if (return_dict_in_generate and output_scores) else None --+++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++ --+++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++ # encoder_hidden_states = ( --+++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++ # ) --+++ --+++ # # keep track of which sequences are already finished --+++ # batch_size, cur_len = input_ids.shape --+++ # this_peer_finished = False --+++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++ --+++ # time_record = [] --+++ # from ....utils.testing_utils import parse_flag_from_env --+++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++ --+++ # while self._has_unfinished_sequences( --+++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++ # ): --+++ # if _record_time: --+++ # import time as time_module --+++ # infer_start = time_module.time() --+++ --+++ # # prepare model inputs --+++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++ --+++ # # prepare variable output controls --+++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++ --+++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++ # cur_cache_position = model_inputs.get("cache_position") --+++ # cur_past_key_values = model_inputs.get("past_key_values") --+++ # cur_input_ids = model_inputs.get("input_ids") --+++ --+++ # if (isinstance(cur_past_key_values, StaticCache) and --+++ # cur_cache_position is not None and --+++ # len(cur_cache_position.shape) > 0 and --+++ # cur_cache_position.shape[0] == 1 and --+++ # cur_input_ids is not None and --+++ # cur_input_ids.shape[1] == 1): --+++ # # 使用 JIT 优化的单 token 解码 --+++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++ # if not hasattr(self, '_jit_used'): --+++ # self._jit_used = False --+++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++ --+++ # next_token_logits = self.get_decode_one_tokens_logits( --+++ # cur_token=cur_input_ids, --+++ # input_pos=model_inputs.get("position_ids"), --+++ # cache_position=cur_cache_position, --+++ # past_key_values=cur_past_key_values, --+++ # ) --+++ --+++ # # 标记已使用JIT(用于后续判断) --+++ # if not self._jit_used: --+++ # self._jit_used = True --+++ --+++ # # 构造兼容的输出对象 --+++ # class JitOptimizedOutput: --+++ # def __init__(self, logits, config): --+++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++ # self.config = config --+++ # # 对于 JIT 优化路径,这些属性通常不需要 --+++ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++ # self.attentions = None if not config.is_encoder_decoder else None --+++ # self.cross_attentions = None --+++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++ # self.hidden_states = None if not config.is_encoder_decoder else None --+++ --+++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++ # else: --+++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++ # outputs = self(**model_inputs, return_dict=True) --+++ --+++ # if synced_devices and this_peer_finished: --+++ # continue --+++ --+++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++ # next_token_logits = outputs.logits[:, -1, :] --+++ --+++ # # pre-process distribution --+++ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++ # if do_sample: --+++ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++ --+++ # # Store scores, attentions and hidden_states when required --+++ # if return_dict_in_generate: --+++ # if output_scores: --+++ # scores += (next_token_scores,) --+++ # if output_logits: --+++ # raw_logits += (next_token_logits,) --+++ # if output_attentions: --+++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++ # decoder_attentions += (attn,) if attn is not None else (None,) --+++ # if self.config.is_encoder_decoder: --+++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++ --+++ # if output_hidden_states: --+++ # hidden = ( --+++ # outputs.decoder_hidden_states --+++ # if self.config.is_encoder_decoder --+++ # else outputs.hidden_states --+++ # ) --+++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++ --+++ # # token selection --+++ # if do_sample: --+++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++ # else: --+++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++ --+++ # # finished sentences should have their next token be a padding token --+++ # if has_eos_stopping_criteria: --+++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++ --+++ # # update generated ids, model inputs, and length for next step --+++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++ # if streamer is not None: --+++ # streamer.put(next_tokens) --+++ --+++ # model_kwargs = self._update_model_kwargs_for_generation( --+++ # outputs, --+++ # model_kwargs, --+++ # is_encoder_decoder=self.config.is_encoder_decoder, --+++ # ) --+++ --+++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++ # cur_len += 1 --+++ --+++ # if _record_time: --+++ # import time as time_module --+++ # infer_stop = time_module.time() --+++ # time_record.append(infer_stop - infer_start) --+++ --+++ # del outputs --+++ --+++ # average_infer_time = None --+++ # if time_record: --+++ # if len(time_record) > 1: --+++ # time_record.pop(0) --+++ # average_infer_time = sum(time_record) / len(time_record) --+++ # print(f'average inference time is: {average_infer_time}') --+++ # print(f'inference time record: {time_record}') --+++ --+++ # if streamer is not None: --+++ # streamer.end() --+++ --+++ # # 简单判断:打印是否使用了JIT路径 --+++ # if hasattr(self, '_jit_used') and self._jit_used: --+++ # print("[JIT] ✓ JIT optimization was used during generation") --+++ # else: --+++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++ --+++ # if return_dict_in_generate: --+++ # if self.config.is_encoder_decoder: --+++ # return GenerateEncoderDecoderOutput( --+++ # sequences=input_ids, --+++ # scores=scores, --+++ # logits=raw_logits, --+++ # encoder_attentions=encoder_attentions, --+++ # encoder_hidden_states=encoder_hidden_states, --+++ # decoder_attentions=decoder_attentions, --+++ # cross_attentions=cross_attentions, --+++ # decoder_hidden_states=decoder_hidden_states, --+++ # past_key_values=model_kwargs.get("past_key_values"), --+++ # average_infer_time=average_infer_time --+++ # ) --+++ # else: --+++ # return GenerateDecoderOnlyOutput( --+++ # sequences=input_ids, --+++ # scores=scores, --+++ # logits=raw_logits, --+++ # attentions=decoder_attentions, --+++ # hidden_states=decoder_hidden_states, --+++ # past_key_values=model_kwargs.get("past_key_values"), --+++ # average_infer_time=average_infer_time --+++ # ) --+++ # else: --+++ # return input_ids --+++ --+++ # def _prepare_cache_for_generation( --+++ # self, --+++ # generation_config, --+++ # model_kwargs, --+++ # assistant_model, --+++ # batch_size, --+++ # max_cache_length, --+++ # ): --+++ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++ # generation_config.cache_implementation = "static" --+++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++ --+++ # if generation_config.cache_implementation == "static": --+++ # base_required_from_max_length = generation_config.max_length + 1 --+++ # base_required = max(max_cache_length, base_required_from_max_length) --+++ # min_cache_size = 50 --+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++ # else: --+++ # max_cache_length = max(base_required, min_cache_size) --+++ --+++ # original_max_cache_length = max_cache_length --+++ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++ # print(f" - input max_cache_length: {original_max_cache_length}") --+++ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++ # print(f" - final max_cache_length: {max_cache_length}") --+++ --+++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++ # if max_cache_length > self.config.max_position_embeddings: --+++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++ --+++ # result = super()._prepare_cache_for_generation( --+++ # generation_config=generation_config, --+++ # model_kwargs=model_kwargs, --+++ # assistant_model=assistant_model, --+++ # batch_size=batch_size, --+++ # max_cache_length=max_cache_length, --+++ # ) --+++ --+++ # if generation_config.cache_implementation == "static": --+++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++ # created_cache = model_kwargs.get(cache_name) --+++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++ # if created_cache.max_cache_len < generation_config.max_length: --+++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++ --+++ # return result --+++ --+++ --+++ --++ --++ --++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++-- --++2.27.0 --++ --+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+new file mode 100644 --+index 00000000..22b65dd5 --+--- /dev/null --++++ b/patches/0002-20251106commit.patch --+@@ -0,0 +1,3200 @@ --++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Thu, 6 Nov 2025 09:20:38 +0800 --++Subject: [PATCH 2/3] 20251106commit --++ --++--- --++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- --++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ --++ 3 files changed, 2689 insertions(+), 305 deletions(-) --++ create mode 100644 patches/0001-20251104commit.patch --++ --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index d8303e45..73773c22 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): --++ # y = y + self.shared_experts(identity) --++ # return y --++ --+++ # @no_grad() --+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ --+++ # expert_cache = ops.zeros_like(x) --+++ # for i in range(self.num_experts_per_tok): --+++ # expert_id = flat_expert_indices[i].item() --+++ # weight = flat_expert_weights[i].item() --+++ # expert = self.experts[expert_id] --+++ # expert_out = expert(x) --+++ # expert_cache += expert_out * weight --+++ # return expert_cache --+++ --++ @no_grad() --++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ # x 的 shape: (1, hidden_size) --+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+++ --+++ # 1. 收集所有需要的专家层 --+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+++ selected_experts = [self.experts[i] for i in flat_expert_indices] --+++ --+++ # 2. 并行计算所有专家的输出 --+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+++ # ops.cat 会将它们堆叠成一个新的 Tensor --+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+++ --+++ # 3. 使用矩阵乘法进行加权求和 --+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ # 最终结果 final_output 的 shape: (1, hidden_size) --+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+++ --+++ return final_output --++ --++- expert_cache = ops.zeros_like(x) --++- for i in range(self.num_experts_per_tok): --++- expert_id = flat_expert_indices[i].item() --++- weight = flat_expert_weights[i].item() --++- expert = self.experts[expert_id] --++- expert_out = expert(x) --++- expert_cache += expert_out * weight --++- return expert_cache --++ --++ @no_grad() --++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++ # @lwx --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --+++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --++ --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): --++ return attn_output, attn_weights, past_key_value --++ --++ --+++# class DeepseekFlashAttention(nn.Module): --+++# """ --+++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --+++ --+++# This class is designed as a drop-in replacement for DeepseekAttention. --+++# """ --+++ --+++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+++# super().__init__() --+++# self.config = config --+++# self.layer_idx = layer_idx --+++# if layer_idx is None: --+++# logger.warning( --+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++# "when creating this class." --+++# ) --+++ --+++# self.attention_dropout = config.attention_dropout --+++# self.hidden_size = config.hidden_size --+++# self.num_heads = config.num_attention_heads --+++# self.head_dim = self.hidden_size // self.num_heads --+++# self.num_key_value_heads = config.num_key_value_heads --+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++# self.max_position_embeddings = config.max_position_embeddings --+++# self.rope_theta = config.rope_theta --+++# self.is_causal = True --+++ --+++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++# raise ValueError( --+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++# f" and `num_heads`: {self.num_heads})." --+++# ) --+++ --+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+++# self._init_rope() --+++ --+++# def _init_rope(self): --+++# if self.config.rope_scaling is None: --+++# self.rotary_emb = DeepseekRotaryEmbedding( --+++# self.head_dim, --+++# max_position_embeddings=self.max_position_embeddings, --+++# base=self.rope_theta, --+++# ) --+++# else: --+++# scaling_type = self.config.rope_scaling["type"] --+++# scaling_factor = self.config.rope_scaling["factor"] --+++# if scaling_type == "linear": --+++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+++# self.head_dim, --+++# max_position_embeddings=self.max_position_embeddings, --+++# scaling_factor=scaling_factor, --+++# base=self.rope_theta, --+++# ) --+++# elif scaling_type == "dynamic": --+++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+++# self.head_dim, --+++# max_position_embeddings=self.max_position_embeddings, --+++# scaling_factor=scaling_factor, --+++# base=self.rope_theta, --+++# ) --+++# else: --+++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+++ --+++# def forward( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# attention_mask: Optional[mindspore.Tensor] = None, --+++# position_ids: Optional[mindspore.Tensor] = None, --+++# past_key_value: Optional[Cache] = None, --+++# output_attentions: bool = False, --+++# use_cache: bool = False, --+++# **kwargs, --+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++# if "padding_mask" in kwargs: --+++# warnings.warn( --+++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+++# ) --+++ --+++# if output_attentions: --+++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --+++ --+++# bsz, q_len, _ = hidden_states.shape --+++ --+++# if self.config.pretraining_tp > 1: --+++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+++ --+++# query_states = self.q_proj(hidden_states) --+++# key_states = self.k_proj(hidden_states) --+++# value_states = self.v_proj(hidden_states) --+++ --+++# # Reshape for multi-head attention --+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++# kv_seq_len = key_states.shape[-2] --+++# if past_key_value is not None: --+++# if self.layer_idx is None: --+++# raise ValueError( --+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++# "with a layer index." --+++# ) --+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++# # Apply Rotary Positional Embedding --+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++# if past_key_value is not None: --+++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --+++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --+++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ --+++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++ --+++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++ --+++# # Convert attention_mask for flash_attention_score --+++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --+++# if attention_mask is not None: --+++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --+++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+++# raise ValueError( --+++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+++# ) --+++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --+++# else: --+++# attn_mask_for_fa = None --+++ --+++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+++ --+++# # Call the fused flash_attention_score operator --+++# attn_output = mindspore.ops.flash_attention_score( --+++# query=query_states_for_fa, --+++# key=key_states_for_fa, --+++# value=value_states_for_fa, --+++# head_num=self.num_heads, # This is N1, the number of query heads --+++# input_layout='BSH', --+++# attn_mask=attn_mask_for_fa, --+++# keep_prob=keep_prob, --+++# scalar_value=1.0 / math.sqrt(self.head_dim), --+++# sparse_mode=0 # Default mask mode --+++# ) --+++ --+++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --+++# attn_output = self.o_proj(attn_output) --+++ --+++# # Flash Attention does not return attention weights --+++# attn_weights = None --+++ --+++# return attn_output, attn_weights, past_key_value --+++ --+++class DeepseekFlashAttention(nn.Module): --+++ """ --+++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --+++ This implementation is a drop-in replacement for the original DeepseekAttention class, --+++ designed for high performance on supported hardware (Ascend). --+++ --+++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --+++ """ --+++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+++ super().__init__() --+++ self.config = config --+++ self.layer_idx = layer_idx --+++ if layer_idx is None: --+++ logger.warning( --+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++ "when creating this class." --+++ ) --+++ --+++ # --- [FIX] Correctly initialize all required attributes --- --+++ self.attention_dropout = config.attention_dropout --+++ self.hidden_size = config.hidden_size --+++ self.num_heads = config.num_attention_heads --+++ self.head_dim = self.hidden_size // self.num_heads --+++ self.num_key_value_heads = config.num_key_value_heads --+++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++ self.max_position_embeddings = config.max_position_embeddings --+++ self.rope_theta = config.rope_theta --+++ self.is_causal = True --+++ --+++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++ raise ValueError( --+++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++ f" and `num_heads`: {self.num_heads})." --+++ ) --+++ --+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+++ --+++ # This call will now succeed as all attributes are initialized. --+++ self._init_rope() --+++ --+++ def _init_rope(self): --+++ if self.config.rope_scaling is None: --+++ self.rotary_emb = DeepseekRotaryEmbedding( --+++ self.head_dim, --+++ max_position_embeddings=self.max_position_embeddings, --+++ base=self.rope_theta, --+++ ) --+++ else: --+++ scaling_type = self.config.rope_scaling["type"] --+++ scaling_factor = self.config.rope_scaling["factor"] --+++ if scaling_type == "linear": --+++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+++ self.head_dim, --+++ max_position_embeddings=self.max_position_embeddings, --+++ scaling_factor=scaling_factor, --+++ base=self.rope_theta, --+++ ) --+++ elif scaling_type == "dynamic": --+++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+++ self.head_dim, --+++ max_position_embeddings=self.max_position_embeddings, --+++ scaling_factor=scaling_factor, --+++ base=self.rope_theta, --+++ ) --+++ else: --+++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+++ --+++ def forward( --+++ self, --+++ hidden_states: mindspore.Tensor, --+++ attention_mask: Optional[mindspore.Tensor] = None, --+++ position_ids: Optional[mindspore.Tensor] = None, --+++ past_key_value: Optional[Cache] = None, --+++ output_attentions: bool = False, --+++ use_cache: bool = False, --+++ **kwargs, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ if "padding_mask" in kwargs: --+++ warnings.warn( --+++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+++ ) --+++ if output_attentions: --+++ warnings.warn( --+++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --+++ ) --+++ --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ if self.config.pretraining_tp > 1: --+++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+++ --+++ query_states = self.q_proj(hidden_states) --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++ # Apply Rotary Position Embedding --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos} --+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --+++ # So we must explicitly repeat the KV heads. --+++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++ --+++ # Convert attention mask for flash_attention_score --+++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --+++ if attention_mask is not None: --+++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+++ raise ValueError( --+++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+++ ) --+++ attn_mask_for_fa = attention_mask < 0 --+++ else: --+++ attn_mask_for_fa = None --+++ --+++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+++ --+++ # Call the fused operator using the efficient BNSD layout --+++ attn_output = mindspore.ops.flash_attention_score( --+++ query=query_states, --+++ key=key_states, --+++ value=value_states, --+++ head_num=self.num_heads, --+++ input_layout='BNSD', # Specify the correct layout --+++ attn_mask=attn_mask_for_fa, --+++ keep_prob=keep_prob, --+++ scalar_value=1.0 / math.sqrt(self.head_dim) --+++ ) --+++ --+++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ --+++ # Apply output projection --+++ attn_output = self.o_proj(attn_output) --+++ --+++ # Flash attention does not return attention weights, so we return None. --+++ attn_weights = None --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --++ Deepseek_ATTENTION_CLASSES = { --++ "eager": DeepseekAttention, --+++ "flash-attention": DeepseekFlashAttention, --++ } --++ --++ --++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): --++ config=config, layer_idx=layer_idx --++ ) --++ --+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --+++ config=config, layer_idx=layer_idx --+++ ) --+++ --++ self.mlp = ( --++ DeepseekMoE(config) --++ if ( --++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++index d4c6b651..bced285c 100644 --++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union --++ --++ import mindspore --++ import mindnlp.core.nn.functional as F --++-from mindnlp.core import nn, ops --+++from mindnlp.core import nn, ops, no_grad --++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss --++ --++ from ....common.activations import ACT2FN --++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) --++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --++ --+++Long_Prompt = False --+++PROMPT_LENGTH_THRESHOLD = 128 --++ --++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --++ def _prepare_4d_causal_attention_mask_with_cache_position( --++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): --++ return attn_output, attn_weights, past_key_value --++ --++ --+++# class Qwen2MoeFlashAttention(nn.Module): --+++# """ --+++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++ --+++# 关键改动: --+++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++# 直接传入原始的 key 和 value 张量效率更高。 --+++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++# super().__init__() --+++# self.config = config --+++# self.layer_idx = layer_idx --+++# self.hidden_size = config.hidden_size --+++# self.num_heads = config.num_attention_heads --+++# self.head_dim = self.hidden_size // self.num_heads --+++# self.num_key_value_heads = config.num_key_value_heads --+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++# self.max_position_embeddings = config.max_position_embeddings --+++# self.rope_theta = config.rope_theta --+++# self.attention_dropout = config.attention_dropout --+++ --+++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++# raise ValueError( --+++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++# ) --+++ --+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++ --+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++# self.head_dim, --+++# max_position_embeddings=self.max_position_embeddings, --+++# base=self.rope_theta, --+++# ) --+++ --+++# def forward( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# attention_mask: Optional[mindspore.Tensor] = None, --+++# position_ids: Optional[mindspore.Tensor] = None, --+++# past_key_value: Optional[Cache] = None, --+++# output_attentions: bool = False, --+++# use_cache: bool = False, --+++# cache_position: Optional[mindspore.Tensor] = None, --+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++# bsz, q_len, _ = hidden_states.shape --+++ --+++# # 1. 线性投射 Q, K, V --+++# query_states = self.q_proj(hidden_states) --+++# key_states = self.k_proj(hidden_states) --+++# value_states = self.v_proj(hidden_states) --+++ --+++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++# # query: [B, S, H*D] -> [B, N1, S, D] --+++# # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++# # 3. RoPE 旋转位置编码 --+++# kv_seq_len = key_states.shape[-2] --+++# if past_key_value is not None: --+++# if self.layer_idx is None: --+++# raise ValueError( --+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++# "with a layer index." --+++# ) --+++# # 对于 StaticCache,需要特殊处理 kv_seq_len --+++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++# # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++# # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++# if cache_position.shape[0] == 1: --+++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++# kv_seq_len = past_seen_tokens + 1 --+++# else: --+++# # prefill 阶段:cache_position 是范围,使用其长度 --+++# kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++# else: --+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++# # 4. KV 缓存更新 --+++# if past_key_value is not None: --+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++# key_states, value_states = past_key_value.update( --+++# key_states, value_states, self.layer_idx, cache_kwargs --+++# ) --+++ --+++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++# if cache_position.shape[0] == 1: --+++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++# kv_seq_len = key_states.shape[-2] --+++ --+++# # 5. [重要] 准备 Attention Mask --+++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++# fa_attention_mask = None --+++# if attention_mask is not None: --+++# # 截取与当前key长度匹配的部分 --+++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++# # 转换为布尔类型: 大负数 -> True, 0 -> False --+++# fa_attention_mask = (mask_slice != 0) --+++ --+++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++# input_dtype = query_states.dtype --+++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++# query_states = query_states.to(mindspore.float16) --+++# key_states = key_states.to(mindspore.float16) --+++# value_states = value_states.to(mindspore.float16) --+++ --+++# # 6. [核心] 调用 flash_attention_score 算子 --+++# # - 无需手动 repeat_kv, 算子原生支持 GQA --+++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++# attn_output = mindspore.ops.flash_attention_score( --+++# query=query_states, --+++# key=key_states, --+++# value=value_states, --+++# head_num=self.num_heads, # 传入Q的头数(N1) --+++# attn_mask=fa_attention_mask, --+++# keep_prob=1.0 - self.attention_dropout, --+++# scalar_value=1.0 / math.sqrt(self.head_dim), --+++# input_layout="BNSD", --+++# sparse_mode=0 # 使用 defaultMask 模式 --+++# ) --+++ --+++# # 恢复原始数据类型 --+++# attn_output = attn_output.to(input_dtype) --+++ --+++# # 7. 调整输出形状 --+++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++# attn_output = self.o_proj(attn_output) --+++ --+++# # FlashAttention 算子不直接返回注意力权重矩阵 --+++# attn_weights = None --+++# if output_attentions: --+++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++# return attn_output, attn_weights, past_key_value --+++ --+++# # def forward( --+++# # self, --+++# # hidden_states: mindspore.Tensor, --+++# # attention_mask: Optional[mindspore.Tensor] = None, --+++# # position_ids: Optional[mindspore.Tensor] = None, --+++# # past_key_value: Optional[Cache] = None, --+++# # output_attentions: bool = False, --+++# # use_cache: bool = False, --+++# # cache_position: Optional[mindspore.Tensor] = None, --+++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++# # bsz, q_len, _ = hidden_states.shape --+++ --+++# # # 1. 线性投射 Q, K, V --+++# # query_states = self.q_proj(hidden_states) --+++# # key_states = self.k_proj(hidden_states) --+++# # value_states = self.v_proj(hidden_states) --+++ --+++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --+++# # # 3. RoPE 旋转位置编码 --+++# # kv_seq_len = key_states.shape[-2] --+++# # if past_key_value is not None: --+++# # if self.layer_idx is None: --+++# # raise ValueError( --+++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++# # "with a layer index." --+++# # ) --+++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++# # # 4. KV 缓存更新 --+++# # if past_key_value is not None: --+++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++# # key_states, value_states = past_key_value.update( --+++# # key_states, value_states, self.layer_idx, cache_kwargs --+++# # ) --+++ --+++# # # 5. 准备 Attention Mask --+++# # fa_attention_mask = None --+++# # if attention_mask is not None: --+++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++# # fa_attention_mask = (mask_slice != 0) --+++ --+++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++# # input_dtype = query_states.dtype --+++ --+++# # # 6. [核心] 调用 flash_attention_score 算子 --+++# # attn_output = mindspore.ops.flash_attention_score( --+++# # query=query_states, --+++# # key=key_states, --+++# # value=value_states, --+++# # head_num=self.num_heads, --+++# # attn_mask=fa_attention_mask, --+++# # keep_prob=1.0 - self.attention_dropout, --+++# # scalar_value=1.0 / math.sqrt(self.head_dim), --+++# # input_layout="BNSD", --+++# # sparse_mode=0, --+++# # # <--- 修改点 2: 启用内部高精度计算 --- --+++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++# # inner_precise=1 --+++# # ) --+++ --+++# # # 恢复原始数据类型 --+++# # attn_output = attn_output.to(input_dtype) --+++ --+++# # # 7. 调整输出形状 --+++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++# # attn_output = self.o_proj(attn_output) --+++ --+++# # attn_weights = None --+++# # if output_attentions: --+++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ --+++# # return attn_output, attn_weights, past_key_value --+++ --+++ --++ class Qwen2MoeFlashAttention(nn.Module): --++ """ --++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++- --++- 关键改动: --++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++- 直接传入原始的 key 和 value 张量效率更高。 --++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --+++ --+++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --+++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --+++ 完全使用模型的低精度数据类型(如 float16)进行计算, --+++ 以达到理论上的最高执行速度。 --++ """ --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++ super().__init__() --++ self.config = config --++ self.layer_idx = layer_idx --+++ if layer_idx is None: --+++ logger.warning_once( --+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --+++ ) --+++ --++ self.hidden_size = config.hidden_size --++ self.num_heads = config.num_attention_heads --++ self.head_dim = self.hidden_size // self.num_heads --++ self.num_key_value_heads = config.num_key_value_heads --++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++ self.max_position_embeddings = config.max_position_embeddings --++ self.rope_theta = config.rope_theta --++ self.attention_dropout = config.attention_dropout --++ --++- if (self.head_dim * self.num_heads) != self.hidden_size: --++- raise ValueError( --++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++- ) --++- --++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++- # query: [B, S, H*D] -> [B, N1, S, D] --++- # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++ # 2. 调整形状以匹配 BNSD 布局 --++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- --++- # 3. RoPE 旋转位置编码 --+++ --+++ # 3. RoPE 和 KV 缓存 --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++- if self.layer_idx is None: --++- raise ValueError( --++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++- "with a layer index." --++- ) --++- # 对于 StaticCache,需要特殊处理 kv_seq_len --++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --++- # 使用 cache_position 的长度来确定实际的 kv_seq_len --++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++- # 临时解决方案:使用 cache_position 的最大值(如果可能) --++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++- if cache_position.shape[0] == 1: --++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++- kv_seq_len = past_seen_tokens + 1 --++- else: --++- # prefill 阶段:cache_position 是范围,使用其长度 --++- kv_seq_len = cache_position.shape[0] + past_seen_tokens --++- else: --++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++- --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++- # 4. KV 缓存更新 --++ if past_key_value is not None: --++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++- key_states, value_states = past_key_value.update( --++- key_states, value_states, self.layer_idx, cache_kwargs --++- ) --++- --++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --++- if cache_position.shape[0] == 1: --++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++- kv_seq_len = key_states.shape[-2] --++- --++- # 5. [重要] 准备 Attention Mask --++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++ # 4. 准备 Attention Mask --++ fa_attention_mask = None --++ if attention_mask is not None: --++- # 截取与当前key长度匹配的部分 --++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++- # 转换为布尔类型: 大负数 -> True, 0 -> False --++ fa_attention_mask = (mask_slice != 0) --++ --++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++- input_dtype = query_states.dtype --++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++- query_states = query_states.to(mindspore.float16) --++- key_states = key_states.to(mindspore.float16) --++- value_states = value_states.to(mindspore.float16) --++- --++- # 6. [核心] 调用 flash_attention_score 算子 --++- # - 无需手动 repeat_kv, 算子原生支持 GQA --++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 --++ attn_output = mindspore.ops.flash_attention_score( --++ query=query_states, --++ key=key_states, --++ value=value_states, --++- head_num=self.num_heads, # 传入Q的头数(N1) --+++ head_num=self.num_heads, --++ attn_mask=fa_attention_mask, --++- keep_prob=1.0 - self.attention_dropout, --+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout --++ scalar_value=1.0 / math.sqrt(self.head_dim), --++ input_layout="BNSD", --++- sparse_mode=0 # 使用 defaultMask 模式 --+++ sparse_mode=0, --+++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 --++ ) --++ --++- # 恢复原始数据类型 --++- attn_output = attn_output.to(input_dtype) --++- --++- # 7. 调整输出形状 --++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++ # 6. 调整输出形状 --++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++ attn_output = self.o_proj(attn_output) --++ --++- # FlashAttention 算子不直接返回注意力权重矩阵 --+++ # 7. 返回结果 --++ attn_weights = None --++ if output_attentions: --++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --++ --++ return attn_output, attn_weights, past_key_value --++ --++- # def forward( --++- # self, --++- # hidden_states: mindspore.Tensor, --++- # attention_mask: Optional[mindspore.Tensor] = None, --++- # position_ids: Optional[mindspore.Tensor] = None, --++- # past_key_value: Optional[Cache] = None, --++- # output_attentions: bool = False, --++- # use_cache: bool = False, --++- # cache_position: Optional[mindspore.Tensor] = None, --++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++- --++- # bsz, q_len, _ = hidden_states.shape --++- --++- # # 1. 线性投射 Q, K, V --++- # query_states = self.q_proj(hidden_states) --++- # key_states = self.k_proj(hidden_states) --++- # value_states = self.v_proj(hidden_states) --++- --++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- --++- # # 3. RoPE 旋转位置编码 --++- # kv_seq_len = key_states.shape[-2] --++- # if past_key_value is not None: --++- # if self.layer_idx is None: --++- # raise ValueError( --++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++- # "with a layer index." --++- # ) --++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++ --++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++- --++- # # 4. KV 缓存更新 --++- # if past_key_value is not None: --++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++- # key_states, value_states = past_key_value.update( --++- # key_states, value_states, self.layer_idx, cache_kwargs --++- # ) --++- --++- # # 5. 准备 Attention Mask --++- # fa_attention_mask = None --++- # if attention_mask is not None: --++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++- # fa_attention_mask = (mask_slice != 0) --++- --++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++- # input_dtype = query_states.dtype --++- --++- # # 6. [核心] 调用 flash_attention_score 算子 --++- # attn_output = mindspore.ops.flash_attention_score( --++- # query=query_states, --++- # key=key_states, --++- # value=value_states, --++- # head_num=self.num_heads, --++- # attn_mask=fa_attention_mask, --++- # keep_prob=1.0 - self.attention_dropout, --++- # scalar_value=1.0 / math.sqrt(self.head_dim), --++- # input_layout="BNSD", --++- # sparse_mode=0, --++- # # <--- 修改点 2: 启用内部高精度计算 --- --++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++- # inner_precise=1 --++- # ) --++- --++- # # 恢复原始数据类型 --++- # attn_output = attn_output.to(input_dtype) --+++QWEN2MOE_ATTENTION_CLASSES = { --+++ "eager": Qwen2MoeAttention, --+++ "flash-attention": Qwen2MoeFlashAttention, --+++} --++ --++- # # 7. 调整输出形状 --++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++- # attn_output = self.o_proj(attn_output) --++ --++- # attn_weights = None --++- # if output_attentions: --++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# def __init__(self, config): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# # gating --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++ --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# #@dwj --+++# # 只遍历激活的专家,而非全部专家 --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# num_tokens = hidden_states_reshaped.shape[0] --+++ --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++# routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++# flat_selected_experts = selected_experts.flatten() --+++ --+++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++# token_indices = broadcasted_token_indices.flatten() --+++ --+++# active_experts = ops.unique(flat_selected_experts) --+++ --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++ --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# selected_token_indices = token_indices[mask] --+++# selected_routing_weights = routing_weights.flatten()[mask] --+++ --+++# current_states = hidden_states_reshaped[selected_token_indices] --+++ --+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++ --+++# final_hidden_states = final_hidden_states.index_add( --+++# dim=0, --+++# index=selected_token_indices, --+++# source=expert_output.to(hidden_states.dtype) --+++# ) --+++ --+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++ --++- # return attn_output, attn_weights, past_key_value --+++# final_hidden_states = final_hidden_states + shared_expert_output --+++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --+++ --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# """ --+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+++# `_moe_infer_prefill` (用于长序列处理) 方法。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# # 门控网络 --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# # 专家列表 --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++# # 共享专家 --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# @no_grad() --+++# def _moe_infer_decode( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# """ --+++# 【解码路径】针对 sequence_length=1 的极致优化。 --+++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+++# """ --+++# batch_size, hidden_dim = hidden_states.shape --+++ --+++# expert_outputs_list = [ --+++# ops.cat([ --+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++# ], dim=0) --+++# for i in range(batch_size) --+++# ] --+++ --+++# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+++# # shape: (batch_size, top_k, hidden_dim) --+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++ --+++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++ --+++# return moe_output.squeeze(1) --+++ --+++# @no_grad() --+++# def _moe_infer_prefill( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# """ --+++# 【预填充路径】针对 sequence_length > 1 的优化。 --+++# 按专家对 Token 进行分组,并进行批处理。 --+++# """ --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens = hidden_states.shape[0] --+++# flat_selected_experts = selected_experts.flatten() --+++ --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++ --+++# active_experts = ops.unique(flat_selected_experts) --+++ --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++ --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# selected_token_indices = token_indices[mask] --+++# selected_routing_weights = routing_weights.flatten()[mask] --+++ --+++# current_states = hidden_states[selected_token_indices] --+++ --+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++ --+++# moe_output = moe_output.index_add( --+++# dim=0, --+++# index=selected_token_indices, --+++# source=expert_output.to(hidden_states.dtype) --+++# ) --+++# return moe_output --+++ --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# """ --+++# 顶层 forward 方法,作为智能分发器。 --+++# """ --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++- # def forward( --++- # self, --++- # hidden_states: mindspore.Tensor, --++- # attention_mask: Optional[mindspore.Tensor] = None, --++- # position_ids: Optional[mindspore.Tensor] = None, --++- # past_key_value: Optional[Cache] = None, --++- # output_attentions: bool = False, --++- # use_cache: bool = False, --++- # cache_position: Optional[mindspore.Tensor] = None, --++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++- --++- # bsz, q_len, _ = hidden_states.shape --++- --++- # query_states = self.q_proj(hidden_states) --++- # key_states = self.k_proj(hidden_states) --++- # value_states = self.v_proj(hidden_states) --++- --++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- --++- # kv_seq_len = key_states.shape[-2] --++- # if past_key_value is not None: --++- # if self.layer_idx is None: --++- # raise ValueError("`layer_idx` must be specified for caching") --++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++- --++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++- --++- # if past_key_value is not None: --++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++- # key_states, value_states = past_key_value.update( --++- # key_states, value_states, self.layer_idx, cache_kwargs --++- # ) --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++# routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++# moe_output = None --+++# # 在推理时,根据序列长度选择最优路径 --+++# if not self.training: --+++# if sequence_length == 1: --+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++# else: --+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++# else: --+++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+++# raise NotImplementedError("Training path is not implemented.") --+++ --+++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+++ --+++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+++ --+++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --+++ --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# """ --+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# # 门控网络 --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# # 专家列表 --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++# # 共享专家 --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# @no_grad() --+++# def _moe_infer_decode( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# batch_size, _ = hidden_states.shape --+++# expert_outputs_list = [ --+++# ops.cat([ --+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++# ], dim=0) --+++# for i in range(batch_size) --+++# ] --+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++# return moe_output.squeeze(1) --+++ --+++# @no_grad() --+++# def _moe_infer_prefill( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens = hidden_states.shape[0] --+++# flat_selected_experts = selected_experts.flatten() --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++# active_experts = ops.unique(flat_selected_experts) --+++ --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# selected_token_indices = token_indices[mask] --+++# selected_routing_weights = routing_weights.flatten()[mask] --+++# current_states = hidden_states[selected_token_indices] --+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++# moe_output = moe_output.index_add( --+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++# ) --+++# return moe_output --+++ --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# """ --+++# 顶层 forward 方法,作为智能分发器。 --+++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+++# """ --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ --+++# # 1. 门控计算 (通用逻辑) --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++# routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++# # 2. 智能分发到最优 MoE 路径 --+++# moe_output = None --+++# if not self.training: --+++# if sequence_length == 1: --+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++# else: --+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++# else: --+++# raise NotImplementedError("Training path is not implemented.") --+++ --+++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++ --+++# # 4. 合并 MoE 输出和共享专家输出 --+++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++ --+++# # 5. 恢复原始形状并返回 --+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --+++# prefill fastest --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# """ --+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# # 门控网络 --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# # 专家列表 --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++# # 共享专家 --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# @no_grad() --+++# def _moe_infer_dispatch( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# """ --+++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+++# """ --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens, _ = hidden_states.shape --+++ --+++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+++# flat_selected_experts = selected_experts.flatten() --+++# flat_routing_weights = routing_weights.flatten() --++ --++- # key_states = repeat_kv(key_states, self.num_key_value_groups) --++- # value_states = repeat_kv(value_states, self.num_key_value_groups) --++- --++- # # <--- 核心修改点: 手动进行高精度缩放 --- --++- # # 在调用算子前,手动将 query_states 除以缩放因子。 --++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++- # query_states = query_states / math.sqrt(self.head_dim) --++- # # <--- 修改结束 --- --++- --++- # fa_attention_mask = None --++- # if attention_mask is not None: --++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++- # fa_attention_mask = (mask_slice != 0) --++- --++- # input_dtype = query_states.dtype --++- --++- # attn_output = mindspore.ops.flash_attention_score( --++- # query=query_states, # 传入已经预先缩放过的 query --++- # key=key_states, --++- # value=value_states, --++- # head_num=self.num_heads, --++- # attn_mask=fa_attention_mask, --++- # keep_prob=1.0 - self.attention_dropout, --++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++- # input_layout="BNSD", --++- # sparse_mode=0, --++- # inner_precise=1 # 仍然保持内部高精度计算 --++- # ) --+++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++ --++- # attn_output = attn_output.to(input_dtype) --++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++- # attn_output = self.o_proj(attn_output) --+++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+++# active_experts = ops.unique(flat_selected_experts) --+++ --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++ --+++# # 找到所有分配给该专家的 token --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++ --+++# # 使用 mask 选取对应的 token 和权重 --+++# current_token_indices = token_indices[mask] --+++# current_routing_weights = flat_routing_weights[mask] --+++# current_hidden_states = hidden_states[current_token_indices] --+++ --+++# # 对这些 token 进行批处理 --+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++ --+++# # 使用 index_add 将结果精确地加回到对应位置 --+++# moe_output = moe_output.index_add( --+++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+++# ) --+++# return moe_output --+++ --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# """ --+++# 顶层 forward 方法,作为智能分发器。 --+++# """ --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ --+++# # 1. 门控计算 --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++# routing_weights = routing_weights.to(hidden_states.dtype) --+++ --+++# # 2. 调用统一的 MoE 计算内核 --+++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --++ --++- # attn_weights = None --++- # if output_attentions: --++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++# # 3. 统一处理共享专家 --+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++ --+++# # 4. 合并输出 --+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++ --+++# # 5. 恢复原始形状并返回 --+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --+++ --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# """ --+++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++# 【最终高性能与高精度版】: --+++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+++# 3. 这样实现了速度和准确性的两全其美。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# @no_grad() --+++# def _moe_infer_decode( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# """ --+++# 【解码路径】极致优化版:bmm + 高精度累加。 --+++# """ --+++# original_dtype = hidden_states.dtype --+++# batch_size, _ = hidden_states.shape --+++ --+++# expert_outputs_list = [ --+++# ops.cat([ --+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++# ], dim=0) --+++# for i in range(batch_size) --+++# ] --+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++ --+++# # 在 float32 下执行 bmm,得到高精度结果 --+++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++ --+++# # 将高精度结果转换回原始数据类型 --+++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+++ --+++# return moe_output --+++ --+++# @no_grad() --+++# def _moe_infer_prefill( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# selected_experts: mindspore.Tensor, --+++# routing_weights: mindspore.Tensor --+++# ) -> mindspore.Tensor: --+++# """ --+++# 【预填充路径】与原始实现一致,结果精确。 --+++# """ --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens, _ = hidden_states.shape --+++# flat_selected_experts = selected_experts.flatten() --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++# active_experts = ops.unique(flat_selected_experts) --+++ --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# selected_token_indices = token_indices[mask] --+++# selected_routing_weights = routing_weights.flatten()[mask] --+++# current_states = hidden_states[selected_token_indices] --+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++# moe_output = moe_output.index_add( --+++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++# ) --+++# return moe_output --+++ --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++ --++- # return attn_output, attn_weights, past_key_value --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+++# # 如果模型主体是 float16,后续再转换 --+++ --+++# moe_output = None --+++# if not self.training: --+++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+++# # _moe_infer_decode 内部会处理好类型转换 --+++# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+++# if sequence_length == 1: --+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++# else: --+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++# else: --+++# raise NotImplementedError("Training path is not implemented.") --+++ --+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++ --+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --++ --++-QWEN2MOE_ATTENTION_CLASSES = { --++- "eager": Qwen2MoeAttention, --++- "flash-attention": Qwen2MoeFlashAttention, --++-} --+++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++# """ --+++# 【融合版】一个混合专家模块,内置两种推理策略, --+++# 由外部全局变量 `Long_Prompt` 控制: --+++ --+++# - if Long_Prompt is True: 【精度优先模式】 --+++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+++# 适用于处理长序列,避免误差累积。 --+++ --+++# - if Long_Prompt is False: 【速度优先模式】 --+++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+++# 在解码阶段获得极致速度,同时保证结果高度准确。 --+++# """ --+++# def __init__(self, config: Qwen2MoeConfig): --+++# super().__init__() --+++# self.num_experts = config.num_experts --+++# self.top_k = config.num_experts_per_tok --+++# self.norm_topk_prob = config.norm_topk_prob --+++ --+++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++# self.experts = nn.ModuleList( --+++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++# ) --+++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++# # --- 速度优先模式的辅助函数 --- --+++# @no_grad() --+++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++# original_dtype = hidden_states.dtype --+++# batch_size, _ = hidden_states.shape --+++# expert_outputs_list = [ --+++# ops.cat([ --+++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++# ], dim=0) --+++# for i in range(batch_size) --+++# ] --+++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++# weights_fp32 = routing_weights.to(mindspore.float32) --+++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++# return moe_output_fp32.squeeze(1).to(original_dtype) --+++ --+++# @no_grad() --+++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens, _ = hidden_states.shape --+++# flat_selected_experts = selected_experts.flatten() --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++# active_experts = ops.unique(flat_selected_experts) --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# selected_token_indices = token_indices[mask] --+++# selected_routing_weights = routing_weights.flatten()[mask] --+++# current_states = hidden_states[selected_token_indices] --+++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+++# return moe_output --+++ --+++# # --- 精度优先模式的辅助函数 --- --+++# @no_grad() --+++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++# moe_output = ops.zeros_like(hidden_states) --+++# num_tokens, _ = hidden_states.shape --+++# flat_selected_experts = selected_experts.flatten() --+++# flat_routing_weights = routing_weights.flatten() --+++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++# active_experts = ops.unique(flat_selected_experts) --+++# for expert_idx_tensor in active_experts: --+++# expert_idx = expert_idx_tensor.item() --+++# expert_layer = self.experts[expert_idx] --+++# mask = (flat_selected_experts == expert_idx_tensor) --+++# current_token_indices = token_indices[mask] --+++# current_routing_weights = flat_routing_weights[mask] --+++# current_hidden_states = hidden_states[current_token_indices] --+++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++# return moe_output --+++ --+++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++# # 声明我们将要使用一个在模块外部定义的全局变量 --+++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+++# global Long_Prompt --+++ --+++# # 1. 门控计算 (所有模式通用) --+++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++# router_logits = self.gate(hidden_states_reshaped) --+++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++# if self.norm_topk_prob: --+++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++# moe_output = None --+++# if not self.training: --+++# # 根据 Long_Prompt 标志选择模式 --+++# if Long_Prompt: --+++# # --- 精度优先模式 --- --+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++# else: --+++# # --- 速度优先模式 --- --+++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++# if sequence_length == 1: --+++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++# else: --+++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++# else: --+++# raise NotImplementedError("Training path is not implemented.") --+++ --+++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++ --+++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++ --+++# return final_hidden_states, router_logits --+++ --+++class Qwen2MoeSparseMoeBlock(nn.Module): --+++ """ --+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+++ 控制的顶级推理策略: --++ --+++ - if Long_Prompt is True: 【精度优先模式】 --+++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --+++ 适用于需要严格可复现性的长序列任务。 --++ --++-class Qwen2MoeSparseMoeBlock(nn.Module): --++- def __init__(self, config): --+++ - if Long_Prompt is False: 【速度优先模式】 --+++ 采用业界最强的性能组合: --+++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --+++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --+++ """ --+++ def __init__(self, config: Qwen2MoeConfig): --++ super().__init__() --++ self.num_experts = config.num_experts --++ self.top_k = config.num_experts_per_tok --++ self.norm_topk_prob = config.norm_topk_prob --++ --++- # gating --++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++ self.experts = nn.ModuleList( --++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++ ) --++- --++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++ --++- #@dwj --++- # 只遍历激活的专家,而非全部专家 --++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++- num_tokens = hidden_states_reshaped.shape[0] --++- --++- router_logits = self.gate(hidden_states_reshaped) --++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- --++- if self.norm_topk_prob: --++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- routing_weights = routing_weights.to(hidden_states.dtype) --++- --++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++- flat_selected_experts = selected_experts.flatten() --++- --++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++- token_indices = broadcasted_token_indices.flatten() --++- --++- active_experts = ops.unique(flat_selected_experts) --++- --++- for expert_idx_tensor in active_experts: --++- expert_idx = expert_idx_tensor.item() --++- expert_layer = self.experts[expert_idx] --++- --++- mask = (flat_selected_experts == expert_idx_tensor) --++- selected_token_indices = token_indices[mask] --++- selected_routing_weights = routing_weights.flatten()[mask] --++- --++- current_states = hidden_states_reshaped[selected_token_indices] --++- --++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++- --++- final_hidden_states = final_hidden_states.index_add( --++- dim=0, --++- index=selected_token_indices, --++- source=expert_output.to(hidden_states.dtype) --++- ) --++- --++- shared_expert_output = self.shared_expert(hidden_states_reshaped) --++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --+++ @no_grad() --+++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++ original_dtype = hidden_states.dtype --+++ batch_size, _ = hidden_states.shape --+++ expert_outputs_list = [ --+++ ops.cat([ --+++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++ ], dim=0) --+++ for i in range(batch_size) --+++ ] --+++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++ weights_fp32 = routing_weights.to(mindspore.float32) --+++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++ return moe_output_fp32.squeeze(1).to(original_dtype) --+++ --+++ @no_grad() --+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++ num_tokens, _ = hidden_states.shape --+++ flat_selected_experts = selected_experts.flatten() --+++ sorted_expert_indices = flat_selected_experts.argsort() --+++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++ original_token_indices = sorted_expert_indices // self.top_k --+++ moe_output = ops.zeros_like(hidden_states) --+++ current_token_offset = 0 --+++ for i in range(self.num_experts): --+++ expert_token_count = tokens_per_expert[i] - current_token_offset --+++ if expert_token_count == 0: --+++ continue --+++ end_offset = current_token_offset + expert_token_count --+++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++ expert_hidden_states = hidden_states[expert_original_token_indices] --+++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++ current_token_offset += expert_token_count --+++ return moe_output --+++ --+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+++ @no_grad() --+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++ moe_output = ops.zeros_like(hidden_states) --+++ num_tokens, _ = hidden_states.shape --+++ flat_selected_experts = selected_experts.flatten() --+++ flat_routing_weights = routing_weights.flatten() --+++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++ active_experts = ops.unique(flat_selected_experts) --+++ for expert_idx_tensor in active_experts: --+++ expert_idx = expert_idx_tensor.item() --+++ expert_layer = self.experts[expert_idx] --+++ mask = (flat_selected_experts == expert_idx_tensor) --+++ current_token_indices = token_indices[mask] --+++ current_routing_weights = flat_routing_weights[mask] --+++ current_hidden_states = hidden_states[current_token_indices] --+++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++ return moe_output --++ --++- final_hidden_states = final_hidden_states + shared_expert_output --++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++- --++- return final_hidden_states, router_logits --+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++ global Long_Prompt --+++ --+++ # 1. 门控计算 (所有模式通用) --+++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++ router_logits = self.gate(hidden_states_reshaped) --+++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++ if self.norm_topk_prob: --+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++ moe_output = None --+++ if Long_Prompt: --+++ # --- 精度优先模式 (ACCURACY MODE) --- --+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ else: --+++ # --- 速度优先模式 (SPEED MODE) --- --+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++ if sequence_length == 1: --+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ else: --+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ --++ --+++ # 3. 共享专家计算与合并 (所有模式通用) --+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++ --+++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++ --+++ return final_hidden_states, router_logits --++ --++ class Qwen2MoeDecoderLayer(nn.Module): --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --++ super().__init__() --++ self.hidden_size = config.hidden_size --+++ --+++ # if Long_Prompt: --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ # else: --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++ --++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ --++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++- --++ if (layer_idx not in config.mlp_only_layers) and ( --++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++ ): --++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ self._warmed_up = True --++ self.warmup_moe_model() --++ --+++ --+++ --++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++ output_router_logits = ( --++ output_router_logits if output_router_logits is not None else self.config.output_router_logits --++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ router_logits=outputs.router_logits, --++ ) --++ --+++ def generate(self, *args, **kwargs): --+++ """ --+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+++ """ --+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+++ --+++ input_ids = kwargs.get("input_ids") --+++ if input_ids is None and args: --+++ input_ids = args[0] --+++ --+++ if input_ids is not None: --+++ prompt_length = input_ids.shape[1] --+++ --+++ if prompt_length > PROMPT_LENGTH_THRESHOLD: --+++ Long_Prompt = True --+++ else: --+++ Long_Prompt = False --+++ --+++ return super().generate(*args, **kwargs) --+++ --++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation --++ def prepare_inputs_for_generation( --++ self, --++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens --++ # Exception 1: when passing input_embeds, input_ids may be missing entries --++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --+++ --++ if past_key_values is not None: --++ if inputs_embeds is not None: # Exception 1 --++ if 0 not in input_ids.shape: --++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ } --++ ) --++ return model_inputs --+++ --++ # @lwx --++ # def _decode_one_tokens_logits( --++ # self, --++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): --++ attentions=outputs.attentions, --++ ) --++ --+++ --++ __all__ = [ --++ "Qwen2MoeForCausalLM", --++ "Qwen2MoeModel", --++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++new file mode 100644 --++index 00000000..6dfb5b93 --++--- /dev/null --+++++ b/patches/0001-20251104commit.patch --++@@ -0,0 +1,1272 @@ --+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++Subject: [PATCH] 20251104commit --+++ --+++--- --+++ mindnlp/transformers/cache_utils.py | 28 +- --+++ .../models/deepseek/modeling_deepseek.py | 149 ++- --+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++ 3 files changed, 976 insertions(+), 87 deletions(-) --+++ --+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++index cadd2e04..02f8d4be 100644 --+++--- a/mindnlp/transformers/cache_utils.py --++++++ b/mindnlp/transformers/cache_utils.py --+++@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++ # k_out[:, :, cache_position] = key_states --+++ # v_out[:, :, cache_position] = value_states --+++- if ON_ORANGE_PI: --+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++- else: --+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++- --++++ # if ON_ORANGE_PI: --++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++ # else: --++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++ # 确保 cache_position 是 1D tensor 并且类型正确 --++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++++ if cache_position.ndim > 1: --++++ cache_position = cache_position.flatten() --++++ # 确保类型是 int32 或 int64(MindSpore 要求) --++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++++ cache_position = cache_position.int() --++++ --++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++++ k_out[:, :, cache_position] = key_states --++++ v_out[:, :, cache_position] = value_states --++++ --+++ return k_out, v_out --+++ --+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index c695b944..d8303e45 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++- x1 = x[..., : x.shape[-1] // 2] --+++- x2 = x[..., x.shape[-1] // 2 :] --++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++ # x1 = x[..., : x.shape[-1] // 2] --++++ # x2 = x[..., x.shape[-1] // 2 :] --++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++ if self.training: --+++ raise NotImplementedError("Training is not supported yet.") --+++ else: --+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++- if self.config.n_shared_experts is not None: --+++- y = y + self.shared_experts(identity) --+++- return y --++++ # @lwx --++++ if orig_shape[1] == 1: --++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++++ y=y.view(*orig_shape) --++++ if self.config.n_shared_experts is not None: --++++ y = y + self.shared_experts(identity) --++++ return y --++++ else: --++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++++ if self.config.n_shared_experts is not None: --++++ y = y + self.shared_experts(identity) --++++ return y --++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++ # if self.config.n_shared_experts is not None: --++++ # y = y + self.shared_experts(identity) --++++ # return y --++++ --++++ @no_grad() --++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ --++++ expert_cache = ops.zeros_like(x) --++++ for i in range(self.num_experts_per_tok): --++++ expert_id = flat_expert_indices[i].item() --++++ weight = flat_expert_weights[i].item() --++++ expert = self.experts[expert_id] --++++ expert_out = expert(x) --++++ expert_cache += expert_out * weight --++++ return expert_cache --+++ --+++ @no_grad() --+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++- # expert_cache = torch.zeros_like(x) --+++- # idxs = flat_expert_indices.argsort() --+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++- # token_idxs = idxs // self.num_experts_per_tok --+++- # for i, end_idx in enumerate(tokens_per_expert): --+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++- # if start_idx == end_idx: --+++- # continue --+++- # expert = self.experts[i] --+++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++- # expert_tokens = x[exp_token_idx] --+++- # expert_out = expert(expert_tokens) --+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++- # return expert_cache --++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++ expert_cache = ops.zeros_like(x) --+++ idxs = flat_expert_indices.argsort() --+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ token_idxs = idxs // self.num_experts_per_tok --++++ --+++ for i, end_idx in enumerate(tokens_per_expert): --+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ if start_idx == end_idx: --+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++ expert_out = expert(expert_tokens) --+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --+++ return expert_cache --++++ --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # # expert_cache = torch.zeros_like(x) --++++ # # idxs = flat_expert_indices.argsort() --++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++ # # token_idxs = idxs // self.num_experts_per_tok --++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++ # # if start_idx == end_idx: --++++ # # continue --++++ # # expert = self.experts[i] --++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # # expert_tokens = x[exp_token_idx] --++++ # # expert_out = expert(expert_tokens) --++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++ # # return expert_cache --++++ # expert_cache = ops.zeros_like(x) --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # for i, end_idx in enumerate(tokens_per_expert): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # if start_idx == end_idx: --++++ # continue --++++ # expert = self.experts[i] --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = expert(expert_tokens) --++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --++++ # return expert_cache --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # expert_cache = ops.zeros_like(x) --++++ --++++ # # 排序保证顺序一致 --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # # 找出有 token 的专家 --++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++ --++++ # for i in active_experts.tolist(): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # end_idx = tokens_per_expert[i] --++++ # if start_idx == end_idx: # 没有 token --++++ # continue --++++ --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = self.experts[i](expert_tokens) --++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++ --++++ # expert_cache = mindspore.mint.scatter_add( --++++ # expert_cache, --++++ # 0, --++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++ # expert_out --++++ # ) --++++ --++++ # return expert_cache --++++ --++++ --+++ --+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++ # """ --+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ --+++ # Initialize weights and apply final processing --+++ self.post_init() --++++ self.warm_up = False --++++ --++++ def warmup_moe_model_deep(self): --++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++ test_texts = [ --++++ "warmup short", --++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++++ ] --++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++ if tokenizer is None: --++++ from mindnlp.transformers import AutoTokenizer --++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++ self._warmup_tokenizer = tokenizer --++++ --++++ for text in test_texts: --++++ inputs = tokenizer(text, return_tensors="ms") --++++ with mindspore._no_grad(): --++++ _ = self(**inputs, use_cache=False) --++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++ --+++ def get_input_embeddings(self): --+++ return self.model.embed_tokens --+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++ ```""" --++++ if not self.warm_up: --++++ self.warm_up = True --++++ self.warmup_moe_model_deep() --++++ --+++ output_attentions = ( --+++ output_attentions --+++ if output_attentions is not None --+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++index 3cbf820e..d4c6b651 100644 --+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++@@ -18,7 +18,6 @@ --+++ # See the License for the specific language governing permissions and --+++ # limitations under the License. --+++ """MindSpore Qwen2MoE model.""" --+++- --+++ import math --+++ from typing import List, Optional, Tuple, Union --+++ --+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++ TokenClassifierOutput, --+++ ) --+++ from ...modeling_utils import PreTrainedModel --++++from ...generation import GenerationMixin --+++ from ....utils import logging --+++ from .configuration_qwen2_moe import Qwen2MoeConfig --+++ --+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++ self.variance_epsilon = eps --+++ --+++ def forward(self, hidden_states): --++++ # @dwj --++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++ # @lwx --++++ # if not self.training : --++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++ input_dtype = hidden_states.dtype --+++ hidden_states = hidden_states.to(mindspore.float32) --+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++@@ -234,6 +239,8 @@ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++ x1 = x[..., : x.shape[-1] // 2] --+++ x2 = x[..., x.shape[-1] // 2 :] --++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++ self.config = config --+++ self.hidden_size = config.hidden_size --+++ self.intermediate_size = intermediate_size --++++ --+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++ self.act_fn = ACT2FN[config.hidden_act] --+++ --+++ def forward(self, x): --+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++- --+++ --++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++ # @lwx --++++ # gate_up_output = self.gate_up_proj(x) --++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++++ # return self.down_proj(swiglu_output) --++++ --++++ # def forward(self, x): --++++ # gate_proj_out = self.gate_proj(x) --++++ # up_proj_out = self.up_proj(x) --++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++++ # return self.down_proj(swiglu_out) --++++ --+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++ """ --+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++ use_cache: bool = False, --+++ cache_position: Optional[mindspore.Tensor] = None, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ --++++ --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ query_states = self.q_proj(hidden_states) --+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ "with a layer index." --+++ ) --+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ if isinstance(past_key_value, StaticCache): --++++ kv_seq_len = key_states.shape[-2] --++++ else: --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++ if isinstance(past_key_value, StaticCache): --++++ kv_seq_len = key_states.shape[-2] --+++ --+++ # repeat k/v heads if n_kv_heads < n_heads --+++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++- --++++ --+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++ --+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++- raise ValueError( --+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++- f" {attn_weights.shape}" --+++- ) --+++- --+++- if attention_mask is not None: # no matter the length, we just slice it --+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++++ if attention_mask is not None: --++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++ attn_weights = attn_weights + causal_mask --+++ --+++ # upcast attention to fp32 --+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++ --+++ attn_output = self.o_proj(attn_output) --+++- --++++ # @lwx --++++ --++++ # max_seq_len = self.max_position_embeddings # 2048 --++++ --++++ # if attention_mask is not None: --++++ # # attention_mask: [B, 1, Sq, Sk] --++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++ --++++ # # pad 到 [max_seq_len, max_seq_len] --++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++ # global_attention_mask = padded_mask --++++ # else: --++++ # global_attention_mask = None --++++ --++++ --++++ # sparse_mode=3 --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, --++++ # key=key_states, --++++ # value=value_states, --++++ # real_shift=None, --++++ # padding_mask=None, --++++ --++++ # head_num=self.num_heads, --++++ # attn_mask=global_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++ # input_layout="BNSD", --++++ # pre_tokens=2147483647, --++++ # next_tokens=2147483647, --++++ # inner_precise=0, --++++ # drop_mask=None, --++++ # prefix=None, --++++ # actual_seq_qlen=None, --++++ # actual_seq_kvlen=None, --++++ # sparse_mode=sparse_mode, --++++ # ) --+++ if not output_attentions: --+++ attn_weights = None --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++ --++++class Qwen2MoeFlashAttention(nn.Module): --++++ """ --++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++ --++++ 关键改动: --++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++ 直接传入原始的 key 和 value 张量效率更高。 --++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++ """ --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++ super().__init__() --++++ self.config = config --++++ self.layer_idx = layer_idx --++++ self.hidden_size = config.hidden_size --++++ self.num_heads = config.num_attention_heads --++++ self.head_dim = self.hidden_size // self.num_heads --++++ self.num_key_value_heads = config.num_key_value_heads --++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++ self.max_position_embeddings = config.max_position_embeddings --++++ self.rope_theta = config.rope_theta --++++ self.attention_dropout = config.attention_dropout --++++ --++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++ raise ValueError( --++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++ ) --++++ --++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++ --++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++ self.head_dim, --++++ max_position_embeddings=self.max_position_embeddings, --++++ base=self.rope_theta, --++++ ) --++++ --++++ def forward( --++++ self, --++++ hidden_states: mindspore.Tensor, --++++ attention_mask: Optional[mindspore.Tensor] = None, --++++ position_ids: Optional[mindspore.Tensor] = None, --++++ past_key_value: Optional[Cache] = None, --++++ output_attentions: bool = False, --++++ use_cache: bool = False, --++++ cache_position: Optional[mindspore.Tensor] = None, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ # 1. 线性投射 Q, K, V --++++ query_states = self.q_proj(hidden_states) --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++ # query: [B, S, H*D] -> [B, N1, S, D] --++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # 3. RoPE 旋转位置编码 --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++ if self.layer_idx is None: --++++ raise ValueError( --++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ "with a layer index." --++++ ) --++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++ if cache_position.shape[0] == 1: --++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++ kv_seq_len = past_seen_tokens + 1 --++++ else: --++++ # prefill 阶段:cache_position 是范围,使用其长度 --++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++ else: --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # 4. KV 缓存更新 --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ key_states, value_states = past_key_value.update( --++++ key_states, value_states, self.layer_idx, cache_kwargs --++++ ) --++++ --++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++ if cache_position.shape[0] == 1: --++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++ kv_seq_len = key_states.shape[-2] --++++ --++++ # 5. [重要] 准备 Attention Mask --++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++ fa_attention_mask = None --++++ if attention_mask is not None: --++++ # 截取与当前key长度匹配的部分 --++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++++ fa_attention_mask = (mask_slice != 0) --++++ --++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++ input_dtype = query_states.dtype --++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++ query_states = query_states.to(mindspore.float16) --++++ key_states = key_states.to(mindspore.float16) --++++ value_states = value_states.to(mindspore.float16) --++++ --++++ # 6. [核心] 调用 flash_attention_score 算子 --++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++ attn_output = mindspore.ops.flash_attention_score( --++++ query=query_states, --++++ key=key_states, --++++ value=value_states, --++++ head_num=self.num_heads, # 传入Q的头数(N1) --++++ attn_mask=fa_attention_mask, --++++ keep_prob=1.0 - self.attention_dropout, --++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++ input_layout="BNSD", --++++ sparse_mode=0 # 使用 defaultMask 模式 --++++ ) --++++ --++++ # 恢复原始数据类型 --++++ attn_output = attn_output.to(input_dtype) --++++ --++++ # 7. 调整输出形状 --++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ attn_output = self.o_proj(attn_output) --++++ --++++ # FlashAttention 算子不直接返回注意力权重矩阵 --++++ attn_weights = None --++++ if output_attentions: --++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++ # def forward( --++++ # self, --++++ # hidden_states: mindspore.Tensor, --++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++ # position_ids: Optional[mindspore.Tensor] = None, --++++ # past_key_value: Optional[Cache] = None, --++++ # output_attentions: bool = False, --++++ # use_cache: bool = False, --++++ # cache_position: Optional[mindspore.Tensor] = None, --++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ # bsz, q_len, _ = hidden_states.shape --++++ --++++ # # 1. 线性投射 Q, K, V --++++ # query_states = self.q_proj(hidden_states) --++++ # key_states = self.k_proj(hidden_states) --++++ # value_states = self.v_proj(hidden_states) --++++ --++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # # 3. RoPE 旋转位置编码 --++++ # kv_seq_len = key_states.shape[-2] --++++ # if past_key_value is not None: --++++ # if self.layer_idx is None: --++++ # raise ValueError( --++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ # "with a layer index." --++++ # ) --++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # # 4. KV 缓存更新 --++++ # if past_key_value is not None: --++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ # key_states, value_states = past_key_value.update( --++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++ # ) --++++ --++++ # # 5. 准备 Attention Mask --++++ # fa_attention_mask = None --++++ # if attention_mask is not None: --++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # fa_attention_mask = (mask_slice != 0) --++++ --++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++ # input_dtype = query_states.dtype --++++ --++++ # # 6. [核心] 调用 flash_attention_score 算子 --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, --++++ # key=key_states, --++++ # value=value_states, --++++ # head_num=self.num_heads, --++++ # attn_mask=fa_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++ # input_layout="BNSD", --++++ # sparse_mode=0, --++++ # # <--- 修改点 2: 启用内部高精度计算 --- --++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++ # inner_precise=1 --++++ # ) --++++ --++++ # # 恢复原始数据类型 --++++ # attn_output = attn_output.to(input_dtype) --++++ --++++ # # 7. 调整输出形状 --++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ # attn_output = self.o_proj(attn_output) --++++ --++++ # attn_weights = None --++++ # if output_attentions: --++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++ # return attn_output, attn_weights, past_key_value --++++ --++++ # def forward( --++++ # self, --++++ # hidden_states: mindspore.Tensor, --++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++ # position_ids: Optional[mindspore.Tensor] = None, --++++ # past_key_value: Optional[Cache] = None, --++++ # output_attentions: bool = False, --++++ # use_cache: bool = False, --++++ # cache_position: Optional[mindspore.Tensor] = None, --++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ # bsz, q_len, _ = hidden_states.shape --++++ --++++ # query_states = self.q_proj(hidden_states) --++++ # key_states = self.k_proj(hidden_states) --++++ # value_states = self.v_proj(hidden_states) --++++ --++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # kv_seq_len = key_states.shape[-2] --++++ # if past_key_value is not None: --++++ # if self.layer_idx is None: --++++ # raise ValueError("`layer_idx` must be specified for caching") --++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # if past_key_value is not None: --++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ # key_states, value_states = past_key_value.update( --++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++ # ) --++++ --++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++ --++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++ # query_states = query_states / math.sqrt(self.head_dim) --++++ # # <--- 修改结束 --- --++++ --++++ # fa_attention_mask = None --++++ # if attention_mask is not None: --++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # fa_attention_mask = (mask_slice != 0) --++++ --++++ # input_dtype = query_states.dtype --++++ --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, # 传入已经预先缩放过的 query --++++ # key=key_states, --++++ # value=value_states, --++++ # head_num=self.num_heads, --++++ # attn_mask=fa_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++ # input_layout="BNSD", --++++ # sparse_mode=0, --++++ # inner_precise=1 # 仍然保持内部高精度计算 --++++ # ) --++++ --++++ # attn_output = attn_output.to(input_dtype) --++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ # attn_output = self.o_proj(attn_output) --++++ --++++ # attn_weights = None --++++ # if output_attentions: --++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++ --++++ # return attn_output, attn_weights, past_key_value --++++ --+++ QWEN2MOE_ATTENTION_CLASSES = { --+++ "eager": Qwen2MoeAttention, --++++ "flash-attention": Qwen2MoeFlashAttention, --+++ } --+++ --+++ --+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --++++ #@dwj --++++ # 只遍历激活的专家,而非全部专家 --+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- hidden_states = hidden_states.view(-1, hidden_dim) --+++- # router_logits: (batch * sequence_length, n_experts) --+++- router_logits = self.gate(hidden_states) --+++- --+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- if self.norm_topk_prob: --+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- # we cast back to the input dtype --+++- routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++- final_hidden_states = ops.zeros( --+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++- ) --+++- --+++- # One hot encode the selected experts to create an expert mask --+++- # this will be used to easily index which expert is going to be sollicitated --+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++- --+++- # Loop over all available experts in the model and perform the computation on each expert --+++- for expert_idx in range(self.num_experts): --+++- expert_layer = self.experts[expert_idx] --+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++- --+++- # Index the correct hidden states and compute the expert hidden state for --+++- # the current expert. We need to make sure to multiply the output hidden --+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++- if 0 not in idx.shape: --+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++- --+++- # However `index_add_` only support torch tensors for indexing so we'll use --+++- # the `top_x` tensor here. --+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++- --+++- shared_expert_output = self.shared_expert(hidden_states) --+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++- --+++- final_hidden_states = final_hidden_states + shared_expert_output --++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++ num_tokens = hidden_states_reshaped.shape[0] --++++ --++++ router_logits = self.gate(hidden_states_reshaped) --++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++ if self.norm_topk_prob: --++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++ flat_selected_experts = selected_experts.flatten() --++++ --++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++ token_indices = broadcasted_token_indices.flatten() --++++ --++++ active_experts = ops.unique(flat_selected_experts) --++++ --++++ for expert_idx_tensor in active_experts: --++++ expert_idx = expert_idx_tensor.item() --++++ expert_layer = self.experts[expert_idx] --++++ --++++ mask = (flat_selected_experts == expert_idx_tensor) --++++ selected_token_indices = token_indices[mask] --++++ selected_routing_weights = routing_weights.flatten()[mask] --++++ --++++ current_states = hidden_states_reshaped[selected_token_indices] --++++ --++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++ --++++ final_hidden_states = final_hidden_states.index_add( --++++ dim=0, --++++ index=selected_token_indices, --++++ source=expert_output.to(hidden_states.dtype) --++++ ) --++++ --++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++ --+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++- return final_hidden_states, router_logits --++++ final_hidden_states = final_hidden_states + shared_expert_output --++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++ --++++ return final_hidden_states, router_logits --+++ --+++ --+++ class Qwen2MoeDecoderLayer(nn.Module): --+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++ --+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++ --+++ if (layer_idx not in config.mlp_only_layers) and ( --+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++ ): --+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++ _skip_keys_device_placement = "past_key_values" --+++ _supports_cache_class = True --++++#lwx --++++ # _supports_static_cache = True --+++ --+++ def _init_weights(self, module): --+++ std = self.config.initializer_range --+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++ return causal_mask --+++ --+++ --+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ _tied_weights_keys = ["lm_head.weight"] --+++ --+++ def __init__(self, config): --+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ self.num_experts_per_tok = config.num_experts_per_tok --+++ # Initialize weights and apply final processing --+++ self.post_init() --++++ # @lwx --++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++++ # self.generation_config.cache_implementation = "static" --++++ self._warmed_up = False --++++ --++++ def warmup_moe_model(self): --++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++++ test_texts = [ --++++ "warmup short", --++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++++ ] --++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++ if tokenizer is None: --++++ from mindnlp.transformers import AutoTokenizer --++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++ self._warmup_tokenizer = tokenizer --++++ --++++ for text in test_texts: --++++ inputs = tokenizer(text, return_tensors="ms") --++++ with mindspore._no_grad(): --++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++ --+++ def get_input_embeddings(self): --+++ return self.model.embed_tokens --+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++ ```""" --++++ if not self._warmed_up: --++++ self._warmed_up = True --++++ self.warmup_moe_model() --+++ --+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++ output_router_logits = ( --+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ } --+++ ) --+++ return model_inputs --++++# @lwx --++++ # def _decode_one_tokens_logits( --++++ # self, --++++ # cur_token: mindspore.Tensor, --++++ # input_pos: Optional[mindspore.Tensor], --++++ # cache_position: mindspore.Tensor, --++++ # past_key_values: StaticCache, --++++ # ) -> mindspore.Tensor: --++++ # """ --++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++++ --++++ # Args: --++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++++ # input_pos: 输入位置信息,可选 --++++ # cache_position: 当前token在cache中的位置,shape为(1,) --++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++++ --++++ # Returns: --++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++++ # """ --++++ # # 调用JIT编译的版本 --++++ # return self.get_decode_one_tokens_logits( --++++ # cur_token=cur_token, --++++ # input_pos=input_pos, --++++ # cache_position=cache_position, --++++ # past_key_values=past_key_values, --++++ # ) --++++ --++++ # @mindspore.jit(jit_level='O1') --++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++++ # """ --++++ # JIT编译的函数,用于高效的单token解码 --++++ # 使用JIT编译优化以支持静态shape和高效执行 --++++ --++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++++ # """ --++++ # outputs = self.model.forward( --++++ # input_ids=cur_token, --++++ # position_ids=input_pos, --++++ # cache_position=cache_position, --++++ # past_key_values=past_key_values, --++++ # use_cache=True, --++++ # return_dict=False, --++++ # ) --++++ --++++ # hidden_states = outputs[0] --++++ # logits = self.lm_head.forward(hidden_states) --++++ # logits = logits.float() --++++ --++++ # return logits[:, -1, :] --++++ --++++ # def _sample( --++++ # self, --++++ # input_ids: mindspore.Tensor, --++++ # logits_processor, --++++ # stopping_criteria, --++++ # generation_config, --++++ # synced_devices: bool, --++++ # streamer=None, --++++ # logits_warper=None, --++++ # **model_kwargs, --++++ # ): --++++ # """ --++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++++ # """ --++++ # from ...generation.logits_process import LogitsProcessorList --++++ # from ...generation.stopping_criteria import StoppingCriteriaList --++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++++ # from mindnlp.core import nn, ops, no_grad --++++ # import numpy as np --++++ --++++ # # 检查是否使用 StaticCache --++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++++ # # 否则,直接调用父类方法 --++++ # past_key_values = model_kwargs.get("past_key_values") --++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++++ --++++ # if not isinstance(past_key_values, StaticCache): --++++ # # 不使用 StaticCache,直接调用父类方法 --++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++++ # return super()._sample( --++++ # input_ids=input_ids, --++++ # logits_processor=logits_processor, --++++ # stopping_criteria=stopping_criteria, --++++ # generation_config=generation_config, --++++ # synced_devices=synced_devices, --++++ # streamer=streamer, --++++ # logits_warper=logits_warper, --++++ # **model_kwargs, --++++ # ) --++++ --++++ # # 使用 StaticCache,进入自定义循环 --++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++++ # pad_token_id = generation_config._pad_token_tensor --++++ # output_attentions = generation_config.output_attentions --++++ # output_hidden_states = generation_config.output_hidden_states --++++ # output_scores = generation_config.output_scores --++++ # output_logits = generation_config.output_logits --++++ # return_dict_in_generate = generation_config.return_dict_in_generate --++++ # max_length = generation_config.max_length --++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++++ # do_sample = generation_config.do_sample --++++ --++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++++ # raise ValueError( --++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++++ # f"{logits_warper})." --++++ # ) --++++ --++++ # # init attention / hidden states / scores tuples --++++ # scores = () if (return_dict_in_generate and output_scores) else None --++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++++ --++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++++ # encoder_hidden_states = ( --++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++++ # ) --++++ --++++ # # keep track of which sequences are already finished --++++ # batch_size, cur_len = input_ids.shape --++++ # this_peer_finished = False --++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++++ --++++ # time_record = [] --++++ # from ....utils.testing_utils import parse_flag_from_env --++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++++ --++++ # while self._has_unfinished_sequences( --++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++++ # ): --++++ # if _record_time: --++++ # import time as time_module --++++ # infer_start = time_module.time() --++++ --++++ # # prepare model inputs --++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++++ --++++ # # prepare variable output controls --++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++++ --++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++++ # cur_cache_position = model_inputs.get("cache_position") --++++ # cur_past_key_values = model_inputs.get("past_key_values") --++++ # cur_input_ids = model_inputs.get("input_ids") --++++ --++++ # if (isinstance(cur_past_key_values, StaticCache) and --++++ # cur_cache_position is not None and --++++ # len(cur_cache_position.shape) > 0 and --++++ # cur_cache_position.shape[0] == 1 and --++++ # cur_input_ids is not None and --++++ # cur_input_ids.shape[1] == 1): --++++ # # 使用 JIT 优化的单 token 解码 --++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++++ # if not hasattr(self, '_jit_used'): --++++ # self._jit_used = False --++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++++ --++++ # next_token_logits = self.get_decode_one_tokens_logits( --++++ # cur_token=cur_input_ids, --++++ # input_pos=model_inputs.get("position_ids"), --++++ # cache_position=cur_cache_position, --++++ # past_key_values=cur_past_key_values, --++++ # ) --++++ --++++ # # 标记已使用JIT(用于后续判断) --++++ # if not self._jit_used: --++++ # self._jit_used = True --++++ --++++ # # 构造兼容的输出对象 --++++ # class JitOptimizedOutput: --++++ # def __init__(self, logits, config): --++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++++ # self.config = config --++++ # # 对于 JIT 优化路径,这些属性通常不需要 --++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++++ # self.attentions = None if not config.is_encoder_decoder else None --++++ # self.cross_attentions = None --++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++++ # self.hidden_states = None if not config.is_encoder_decoder else None --++++ --++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++++ # else: --++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++++ # outputs = self(**model_inputs, return_dict=True) --++++ --++++ # if synced_devices and this_peer_finished: --++++ # continue --++++ --++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++++ # next_token_logits = outputs.logits[:, -1, :] --++++ --++++ # # pre-process distribution --++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++++ # if do_sample: --++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++++ --++++ # # Store scores, attentions and hidden_states when required --++++ # if return_dict_in_generate: --++++ # if output_scores: --++++ # scores += (next_token_scores,) --++++ # if output_logits: --++++ # raw_logits += (next_token_logits,) --++++ # if output_attentions: --++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++++ # decoder_attentions += (attn,) if attn is not None else (None,) --++++ # if self.config.is_encoder_decoder: --++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++++ --++++ # if output_hidden_states: --++++ # hidden = ( --++++ # outputs.decoder_hidden_states --++++ # if self.config.is_encoder_decoder --++++ # else outputs.hidden_states --++++ # ) --++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++++ --++++ # # token selection --++++ # if do_sample: --++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++++ # else: --++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++++ --++++ # # finished sentences should have their next token be a padding token --++++ # if has_eos_stopping_criteria: --++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++++ --++++ # # update generated ids, model inputs, and length for next step --++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++++ # if streamer is not None: --++++ # streamer.put(next_tokens) --++++ --++++ # model_kwargs = self._update_model_kwargs_for_generation( --++++ # outputs, --++++ # model_kwargs, --++++ # is_encoder_decoder=self.config.is_encoder_decoder, --++++ # ) --++++ --++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++++ # cur_len += 1 --++++ --++++ # if _record_time: --++++ # import time as time_module --++++ # infer_stop = time_module.time() --++++ # time_record.append(infer_stop - infer_start) --++++ --++++ # del outputs --++++ --++++ # average_infer_time = None --++++ # if time_record: --++++ # if len(time_record) > 1: --++++ # time_record.pop(0) --++++ # average_infer_time = sum(time_record) / len(time_record) --++++ # print(f'average inference time is: {average_infer_time}') --++++ # print(f'inference time record: {time_record}') --++++ --++++ # if streamer is not None: --++++ # streamer.end() --++++ --++++ # # 简单判断:打印是否使用了JIT路径 --++++ # if hasattr(self, '_jit_used') and self._jit_used: --++++ # print("[JIT] ✓ JIT optimization was used during generation") --++++ # else: --++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++++ --++++ # if return_dict_in_generate: --++++ # if self.config.is_encoder_decoder: --++++ # return GenerateEncoderDecoderOutput( --++++ # sequences=input_ids, --++++ # scores=scores, --++++ # logits=raw_logits, --++++ # encoder_attentions=encoder_attentions, --++++ # encoder_hidden_states=encoder_hidden_states, --++++ # decoder_attentions=decoder_attentions, --++++ # cross_attentions=cross_attentions, --++++ # decoder_hidden_states=decoder_hidden_states, --++++ # past_key_values=model_kwargs.get("past_key_values"), --++++ # average_infer_time=average_infer_time --++++ # ) --++++ # else: --++++ # return GenerateDecoderOnlyOutput( --++++ # sequences=input_ids, --++++ # scores=scores, --++++ # logits=raw_logits, --++++ # attentions=decoder_attentions, --++++ # hidden_states=decoder_hidden_states, --++++ # past_key_values=model_kwargs.get("past_key_values"), --++++ # average_infer_time=average_infer_time --++++ # ) --++++ # else: --++++ # return input_ids --++++ --++++ # def _prepare_cache_for_generation( --++++ # self, --++++ # generation_config, --++++ # model_kwargs, --++++ # assistant_model, --++++ # batch_size, --++++ # max_cache_length, --++++ # ): --++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++++ # generation_config.cache_implementation = "static" --++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++++ --++++ # if generation_config.cache_implementation == "static": --++++ # base_required_from_max_length = generation_config.max_length + 1 --++++ # base_required = max(max_cache_length, base_required_from_max_length) --++++ # min_cache_size = 50 --++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++++ # else: --++++ # max_cache_length = max(base_required, min_cache_size) --++++ --++++ # original_max_cache_length = max_cache_length --++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++++ # print(f" - input max_cache_length: {original_max_cache_length}") --++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++++ # print(f" - final max_cache_length: {max_cache_length}") --++++ --++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++ # if max_cache_length > self.config.max_position_embeddings: --++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++++ --++++ # result = super()._prepare_cache_for_generation( --++++ # generation_config=generation_config, --++++ # model_kwargs=model_kwargs, --++++ # assistant_model=assistant_model, --++++ # batch_size=batch_size, --++++ # max_cache_length=max_cache_length, --++++ # ) --++++ --++++ # if generation_config.cache_implementation == "static": --++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++++ # created_cache = model_kwargs.get(cache_name) --++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++++ # if created_cache.max_cache_len < generation_config.max_length: --++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++++ --++++ # return result --++++ --++++ --++++ --+++ --+++ --+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++-- --+++2.27.0 --+++ --++-- --++2.27.0 --++ --+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+new file mode 100644 --+index 00000000..966529e4 --+--- /dev/null --++++ b/patches/0003-20261106secondcommit.patch --+@@ -0,0 +1,2769 @@ --++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Thu, 6 Nov 2025 14:54:37 +0800 --++Subject: [PATCH 3/3] 20261106secondcommit --++ --++--- --++ .../models/deepseek/modeling_deepseek.py | 217 ++- --++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- --++ patches/0001-20251104commit.patch | 1272 ----------------- --++ 3 files changed, 528 insertions(+), 2032 deletions(-) --++ delete mode 100644 patches/0001-20251104commit.patch --++ --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index 73773c22..2f9192bf 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) --++ --++ _CONFIG_FOR_DOC = "DeepseekConfig" --++ --+++_attn_mask_cache = {} --+++ --+++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --+++ q_len = batch_and_seq[1] --+++ kv_len = batch_and_seq[1] + past_key_values_length --+++ key = (batch_and_seq[0], q_len, kv_len) --+++ --+++ if key in _attn_mask_cache: --+++ return _attn_mask_cache[key] --+++ --+++ mask = _prepare_4d_causal_attention_mask( --+++ attention_mask, --+++ batch_and_seq, --+++ inputs_embeds, --+++ past_key_values_length, --+++ ) --+++ _attn_mask_cache[key] = mask --+++ return mask --++ --++ def _get_unpad_data(attention_mask): --++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): --++ return final_output --++ --++ --++- @no_grad() --++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++- expert_cache = ops.zeros_like(x) --++- idxs = flat_expert_indices.argsort() --++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++- token_idxs = idxs // self.num_experts_per_tok --++- --++- for i, end_idx in enumerate(tokens_per_expert): --++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++- if start_idx == end_idx: --++- continue --++- expert = self.experts[i] --++- exp_token_idx = token_idxs[start_idx:end_idx] --++- expert_tokens = x[exp_token_idx] --++- expert_out = expert(expert_tokens) --++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++- --++- return expert_cache --++- --++ # @no_grad() --++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++- # # expert_cache = torch.zeros_like(x) --++- # # idxs = flat_expert_indices.argsort() --++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++- # # token_idxs = idxs // self.num_experts_per_tok --++- # # for i, end_idx in enumerate(tokens_per_expert): --++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++- # # if start_idx == end_idx: --++- # # continue --++- # # expert = self.experts[i] --++- # # exp_token_idx = token_idxs[start_idx:end_idx] --++- # # expert_tokens = x[exp_token_idx] --++- # # expert_out = expert(expert_tokens) --++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++- # # return expert_cache --+++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++ # expert_cache = ops.zeros_like(x) --++ # idxs = flat_expert_indices.argsort() --++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): --++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++ --++ # return expert_cache --++- # @no_grad() --++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++- # expert_cache = ops.zeros_like(x) --+++ --+++ @no_grad() --+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++ """ --+++ 优化版 MoE prefill: --+++ - 批量张量化处理同一个 expert 的所有 token --+++ - 跳过无 token 的专家 --+++ - 保持结果完全一致 --+++ """ --+++ # 初始化输出缓存 --+++ expert_cache = ops.zeros_like(x) --++ --++- # # 排序保证顺序一致 --++- # idxs = flat_expert_indices.argsort() --++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++- # token_idxs = idxs // self.num_experts_per_tok --+++ # 排序(确保 scatter_add 位置对应原逻辑) --+++ idxs = flat_expert_indices.argsort() --+++ sorted_expert_indices = flat_expert_indices[idxs] --+++ sorted_token_indices = idxs // self.num_experts_per_tok --++ --++- # # 找出有 token 的专家 --++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++ # 每个 expert 的 token 数 --+++ tokens_per_expert = sorted_expert_indices.bincount() --++ --++- # for i in active_experts.tolist(): --++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++- # end_idx = tokens_per_expert[i] --++- # if start_idx == end_idx: # 没有 token --++- # continue --+++ # 找出有 token 的专家 --+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --++ --++- # exp_token_idx = token_idxs[start_idx:end_idx] --++- # expert_tokens = x[exp_token_idx] --++- # expert_out = self.experts[i](expert_tokens) --++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++ for expert_id in active_experts.tolist(): --+++ # 取该 expert 对应的排序后 token 区间 --+++ start = (tokens_per_expert[:expert_id]).sum().item() --+++ end = start + tokens_per_expert[expert_id].item() --++ --++- # expert_cache = mindspore.mint.scatter_add( --++- # expert_cache, --++- # 0, --++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++- # expert_out --++- # ) --+++ token_idx = sorted_token_indices[start:end] # 原 token 位置 --+++ expert_tokens = x[token_idx] # 取输入向量 --++ --++- # return expert_cache --+++ # 执行专家 MLP --+++ expert_out = self.experts[expert_id](expert_tokens) --+++ --+++ # 按权重缩放 --+++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --+++ --+++ # 回写到缓存(等价 scatter_add) --+++ expert_cache = mindspore.mint.scatter_add( --+++ expert_cache, --+++ 0, --+++ token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++ scaled_out --+++ ) --+++ --+++ return expert_cache --+++ --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # # expert_cache = torch.zeros_like(x) --+++ # # idxs = flat_expert_indices.argsort() --+++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++ # # token_idxs = idxs // self.num_experts_per_tok --+++ # # for i, end_idx in enumerate(tokens_per_expert): --+++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++ # # if start_idx == end_idx: --+++ # # continue --+++ # # expert = self.experts[i] --+++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # # expert_tokens = x[exp_token_idx] --+++ # # expert_out = expert(expert_tokens) --+++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++ # # return expert_cache --+++ # expert_cache = ops.zeros_like(x) --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # for i, end_idx in enumerate(tokens_per_expert): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # if start_idx == end_idx: --+++ # continue --+++ # expert = self.experts[i] --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = expert(expert_tokens) --+++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --+++ # return expert_cache --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++ # expert_cache = ops.zeros_like(x) --+++ --+++ # # 排序保证顺序一致 --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ # token_idxs = idxs // self.num_experts_per_tok --+++ --+++ # # 找出有 token 的专家 --+++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++ --+++ # for i in active_experts.tolist(): --+++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ # end_idx = tokens_per_expert[i] --+++ # if start_idx == end_idx: # 没有 token --+++ # continue --+++ --+++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++ # expert_tokens = x[exp_token_idx] --+++ # expert_out = self.experts[i](expert_tokens) --+++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++ --+++ # expert_cache = mindspore.mint.scatter_add( --+++ # expert_cache, --+++ # 0, --+++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++ # expert_out --+++ # ) --+++ --+++ # return expert_cache --++ --++ --++ --++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): --++ --++ return attn_output, attn_weights, past_key_value --++ --++- --++ # class DeepseekFlashAttention(nn.Module): --++ # """ --++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): --++ --++ return attn_output, attn_weights, past_key_value --++ --+++ --++ Deepseek_ATTENTION_CLASSES = { --++ "eager": DeepseekAttention, --++ "flash-attention": DeepseekFlashAttention, --++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): --++ ) --++ else: --++ # 4d mask is passed through the layers --++- attention_mask = _prepare_4d_causal_attention_mask( --+++ # attention_mask = _prepare_4d_causal_attention_mask( --+++ # attention_mask, --+++ # (batch_size, seq_length), --+++ # inputs_embeds, --+++ # past_key_values_length, --+++ # ) --+++ #@dwj --+++ attention_mask = get_cached_causal_mask( --++ attention_mask, --++ (batch_size, seq_length), --++ inputs_embeds, --++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ # Initialize weights and apply final processing --++ self.post_init() --++ self.warm_up = False --+++ #@dwj --+++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --+++ self.num_layers, --+++ self.num_attention_heads, --+++ self.head_dim, --+++ batch_size=1, --+++ max_length=self.max_length, --+++ dtype=mindspore.float16 --+++ ) --+++ --+++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --+++ key_cache = [] --+++ value_cache = [] --+++ for _ in range(num_layers): --+++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++ key_cache.append(k) --+++ value_cache.append(v) --+++ return key_cache, value_cache --+++ --++ --++ def warmup_moe_model_deep(self): --++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++index bced285c..ebd7782e 100644 --++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) --++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --++ --++-Long_Prompt = False --++-PROMPT_LENGTH_THRESHOLD = 128 --+++Long_Prompt = 1 --+++LONG_PROMPT_LENGTH_THRESHOLD = 128 --+++SHORT_PROMPT_LENGTH_THRESHOLD = 32 --+++ --+++_causal_mask_cache = {} --+++ --+++def get_cached_causal_mask_with_cache_position( --+++ attention_mask: mindspore.Tensor, --+++ sequence_length: int, --+++ target_length: int, --+++ dtype: mindspore.dtype, --+++ min_dtype: float, --+++ cache_position: mindspore.Tensor, --+++ batch_size: int, --+++): --+++ """ --+++ 带缓存的 causal mask 构造函数 --+++ """ --+++ # q_len 是当前 query 长度 --+++ q_len = sequence_length --+++ # kv_len 是 target_length --+++ kv_len = target_length --+++ --+++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --+++ key = (batch_size, q_len, kv_len, dtype, min_dtype) --+++ --+++ if key in _causal_mask_cache: --+++ return _causal_mask_cache[key] --+++ --+++ # 调用原来的 mask 构造逻辑 --+++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++ attention_mask, --+++ sequence_length=sequence_length, --+++ target_length=target_length, --+++ dtype=dtype, --+++ min_dtype=min_dtype, --+++ cache_position=cache_position, --+++ batch_size=batch_size, --+++ ) --+++ # 缓存结果 --+++ _causal_mask_cache[key] = causal_mask --+++ return causal_mask --++ --++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --++ def _prepare_4d_causal_attention_mask_with_cache_position( --++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++ --++ --++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --+++# class Qwen2MoeAttention(nn.Module): --+++# """ --+++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+++# and "Generating Long Sequences with Sparse Transformers". --+++# """ --+++ --+++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++# super().__init__() --+++# self.config = config --+++# self.layer_idx = layer_idx --+++# if layer_idx is None: --+++# logger.warning_once( --+++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++# "when creating this class." --+++# ) --+++ --+++# self.hidden_size = config.hidden_size --+++# self.num_heads = config.num_attention_heads --+++# self.head_dim = self.hidden_size // self.num_heads --+++# self.num_key_value_heads = config.num_key_value_heads --+++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++# self.max_position_embeddings = config.max_position_embeddings --+++# self.rope_theta = config.rope_theta --+++# self.is_causal = True --+++# self.attention_dropout = config.attention_dropout --+++ --+++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++# raise ValueError( --+++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++# f" and `num_heads`: {self.num_heads})." --+++# ) --+++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++ --+++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++# self.head_dim, --+++# max_position_embeddings=self.max_position_embeddings, --+++# base=self.rope_theta, --+++# ) --+++ --+++# def forward( --+++# self, --+++# hidden_states: mindspore.Tensor, --+++# attention_mask: Optional[mindspore.Tensor] = None, --+++# position_ids: Optional[mindspore.Tensor] = None, --+++# past_key_value: Optional[Cache] = None, --+++# output_attentions: bool = False, --+++# use_cache: bool = False, --+++# cache_position: Optional[mindspore.Tensor] = None, --+++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++ --+++ --+++ --+++# bsz, q_len, _ = hidden_states.shape --+++ --+++# query_states = self.q_proj(hidden_states) --+++# key_states = self.k_proj(hidden_states) --+++# value_states = self.v_proj(hidden_states) --+++ --+++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++ --+++# kv_seq_len = key_states.shape[-2] --+++# if past_key_value is not None: --+++# if self.layer_idx is None: --+++# raise ValueError( --+++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++# "with a layer index." --+++# ) --+++# if isinstance(past_key_value, StaticCache): --+++# kv_seq_len = key_states.shape[-2] --+++# else: --+++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++# if past_key_value is not None: --+++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++# if isinstance(past_key_value, StaticCache): --+++# kv_seq_len = key_states.shape[-2] --+++ --+++# # repeat k/v heads if n_kv_heads < n_heads --+++# key_states = repeat_kv(key_states, self.num_key_value_groups) --+++# value_states = repeat_kv(value_states, self.num_key_value_groups) --+++ --+++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++ --+++# if attention_mask is not None: --+++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++# attn_weights = attn_weights + causal_mask --+++ --+++# # upcast attention to fp32 --+++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++# attn_output = ops.matmul(attn_weights, value_states) --+++ --+++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++# raise ValueError( --+++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+++# f" {attn_output.shape}" --+++# ) --+++ --+++# attn_output = ops.transpose(attn_output, 1, 2) --+++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++ --+++# attn_output = self.o_proj(attn_output) --+++# # @lwx --+++ --+++# # max_seq_len = self.max_position_embeddings # 2048 --+++ --+++# # if attention_mask is not None: --+++# # # attention_mask: [B, 1, Sq, Sk] --+++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++ --+++# # # pad 到 [max_seq_len, max_seq_len] --+++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++# # global_attention_mask = padded_mask --+++# # else: --+++# # global_attention_mask = None --+++ --+++ --+++# # sparse_mode=3 --+++# # attn_output = mindspore.ops.flash_attention_score( --+++# # query=query_states, --+++# # key=key_states, --+++# # value=value_states, --+++# # real_shift=None, --+++# # padding_mask=None, --+++ --+++# # head_num=self.num_heads, --+++# # attn_mask=global_attention_mask, --+++# # keep_prob=1.0 - self.attention_dropout, --+++# # scalar_value=1.0 / math.sqrt(self.head_dim), --+++# # input_layout="BNSD", --+++# # pre_tokens=2147483647, --+++# # next_tokens=2147483647, --+++# # inner_precise=0, --+++# # drop_mask=None, --+++# # prefix=None, --+++# # actual_seq_qlen=None, --+++# # actual_seq_kvlen=None, --+++# # sparse_mode=sparse_mode, --+++# # ) --+++# if not output_attentions: --+++# attn_weights = None --+++ --+++# return attn_output, attn_weights, past_key_value --+++ --++ class Qwen2MoeAttention(nn.Module): --++ """ --++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --++- and "Generating Long Sequences with Sparse Transformers". --++- """ --+++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 --++ --+++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --+++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --+++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --+++ --+++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --+++ """ --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++ super().__init__() --++ self.config = config --++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): --++ if layer_idx is None: --++ logger.warning_once( --++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++ "when creating this class." --++ ) --++ --++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): --++ use_cache: bool = False, --++ cache_position: Optional[mindspore.Tensor] = None, --++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++- --++ --++- --+++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- --++ bsz, q_len, _ = hidden_states.shape --++ --++ query_states = self.q_proj(hidden_states) --++ key_states = self.k_proj(hidden_states) --++ value_states = self.v_proj(hidden_states) --++ --++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++- --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ --++ kv_seq_len = key_states.shape[-2] --++ if past_key_value is not None: --++- if self.layer_idx is None: --++- raise ValueError( --++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++- "with a layer index." --++- ) --++- if isinstance(past_key_value, StaticCache): --++- kv_seq_len = key_states.shape[-2] --++- else: --++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++ --++ if past_key_value is not None: --++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++ --+++ # --- 2. 动态调度核心注意力计算 --- --+++ global Long_Prompt --+++ if Long_Prompt >= 1: --+++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --+++ fa_attention_mask = None --+++ if attention_mask is not None: --+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++ fa_attention_mask = (mask_slice != 0) --+++ --+++ attn_output = mindspore.ops.flash_attention_score( --+++ query=query_states, --+++ key=key_states, --+++ value=value_states, --+++ head_num=self.num_heads, --+++ attn_mask=fa_attention_mask, --+++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --+++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++ input_layout="BNSD", --+++ sparse_mode=0, --+++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --+++ ) --++ --++- if isinstance(past_key_value, StaticCache): --++- kv_seq_len = key_states.shape[-2] --+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ attn_output = self.o_proj(attn_output) --+++ attn_weights = None --+++ if output_attentions: --+++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --++ --++- # repeat k/v heads if n_kv_heads < n_heads --++- key_states = repeat_kv(key_states, self.num_key_value_groups) --++- value_states = repeat_kv(value_states, self.num_key_value_groups) --++- --++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++ else: --+++ # --- Eager Attention 路径 (用于短序列和解码) --- --+++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++ --+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++ --++- if attention_mask is not None: --++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++- attn_weights = attn_weights + causal_mask --+++ if attention_mask is not None: --+++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++ attn_weights = attn_weights + causal_mask --++ --++- # upcast attention to fp32 --++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++- attn_output = ops.matmul(attn_weights, value_states) --+++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++ attn_output = ops.matmul(attn_weights, value_states) --++ --++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++- raise ValueError( --++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --++- f" {attn_output.shape}" --++- ) --+++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++ raise ValueError( --+++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --+++ ) --++ --++- attn_output = ops.transpose(attn_output, 1, 2) --++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++ attn_output = ops.transpose(attn_output, 1, 2) --+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++ attn_output = self.o_proj(attn_output) --++ --++- attn_output = self.o_proj(attn_output) --++- # @lwx --+++ if not output_attentions: --+++ attn_weights = None --++ --++- # max_seq_len = self.max_position_embeddings # 2048 --++- --++- # if attention_mask is not None: --++- # # attention_mask: [B, 1, Sq, Sk] --++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++- --++- # # pad 到 [max_seq_len, max_seq_len] --++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++- # global_attention_mask = padded_mask --++- # else: --++- # global_attention_mask = None --++- --++- --++- # sparse_mode=3 --++- # attn_output = mindspore.ops.flash_attention_score( --++- # query=query_states, --++- # key=key_states, --++- # value=value_states, --++- # real_shift=None, --++- # padding_mask=None, --++- --++- # head_num=self.num_heads, --++- # attn_mask=global_attention_mask, --++- # keep_prob=1.0 - self.attention_dropout, --++- # scalar_value=1.0 / math.sqrt(self.head_dim), --++- # input_layout="BNSD", --++- # pre_tokens=2147483647, --++- # next_tokens=2147483647, --++- # inner_precise=0, --++- # drop_mask=None, --++- # prefix=None, --++- # actual_seq_qlen=None, --++- # actual_seq_kvlen=None, --++- # sparse_mode=sparse_mode, --++- # ) --++- if not output_attentions: --++- attn_weights = None --++- --++ return attn_output, attn_weights, past_key_value --++ --++- --++ # class Qwen2MoeFlashAttention(nn.Module): --++ # """ --++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { --++ # return final_hidden_states, router_logits --++ --++ --++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++-# """ --++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --++-# `_moe_infer_prefill` (用于长序列处理) 方法。 --++-# """ --++-# def __init__(self, config: Qwen2MoeConfig): --++-# super().__init__() --++-# self.num_experts = config.num_experts --++-# self.top_k = config.num_experts_per_tok --++-# self.norm_topk_prob = config.norm_topk_prob --++- --++-# # 门控网络 --++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++-# # 专家列表 --++-# self.experts = nn.ModuleList( --++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++-# ) --++-# # 共享专家 --++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-# @no_grad() --++-# def _moe_infer_decode( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# """ --++-# 【解码路径】针对 sequence_length=1 的极致优化。 --++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --++-# """ --++-# batch_size, hidden_dim = hidden_states.shape --++- --++-# expert_outputs_list = [ --++-# ops.cat([ --++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++-# ], dim=0) --++-# for i in range(batch_size) --++-# ] --++- --++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- --++-# # shape: (batch_size, top_k, hidden_dim) --++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++- --++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++- --++-# return moe_output.squeeze(1) --++- --++-# @no_grad() --++-# def _moe_infer_prefill( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# """ --++-# 【预填充路径】针对 sequence_length > 1 的优化。 --++-# 按专家对 Token 进行分组,并进行批处理。 --++-# """ --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens = hidden_states.shape[0] --++-# flat_selected_experts = selected_experts.flatten() --++- --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++- --++-# active_experts = ops.unique(flat_selected_experts) --++- --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++- --++-# mask = (flat_selected_experts == expert_idx_tensor) --++-# selected_token_indices = token_indices[mask] --++-# selected_routing_weights = routing_weights.flatten()[mask] --++- --++-# current_states = hidden_states[selected_token_indices] --++- --++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++- --++-# moe_output = moe_output.index_add( --++-# dim=0, --++-# index=selected_token_indices, --++-# source=expert_output.to(hidden_states.dtype) --++-# ) --++-# return moe_output --++- --++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-# """ --++-# 顶层 forward 方法,作为智能分发器。 --++-# """ --++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++- --++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-# router_logits = self.gate(hidden_states_reshaped) --++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- --++-# if self.norm_topk_prob: --++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- --++-# routing_weights = routing_weights.to(hidden_states.dtype) --++- --++-# moe_output = None --++-# # 在推理时,根据序列长度选择最优路径 --++-# if not self.training: --++-# if sequence_length == 1: --++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++-# else: --++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++-# else: --++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --++-# raise NotImplementedError("Training path is not implemented.") --++- --++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --++- --++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --++- --++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --++- --++-# return final_hidden_states, router_logits --++- --++- --++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++-# """ --++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --++-# """ --++-# def __init__(self, config: Qwen2MoeConfig): --++-# super().__init__() --++-# self.num_experts = config.num_experts --++-# self.top_k = config.num_experts_per_tok --++-# self.norm_topk_prob = config.norm_topk_prob --++- --++-# # 门控网络 --++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++-# # 专家列表 --++-# self.experts = nn.ModuleList( --++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++-# ) --++-# # 共享专家 --++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-# @no_grad() --++-# def _moe_infer_decode( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# batch_size, _ = hidden_states.shape --++-# expert_outputs_list = [ --++-# ops.cat([ --++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++-# ], dim=0) --++-# for i in range(batch_size) --++-# ] --++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++-# return moe_output.squeeze(1) --++- --++-# @no_grad() --++-# def _moe_infer_prefill( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens = hidden_states.shape[0] --++-# flat_selected_experts = selected_experts.flatten() --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++-# active_experts = ops.unique(flat_selected_experts) --++- --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++-# mask = (flat_selected_experts == expert_idx_tensor) --++-# selected_token_indices = token_indices[mask] --++-# selected_routing_weights = routing_weights.flatten()[mask] --++-# current_states = hidden_states[selected_token_indices] --++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++-# moe_output = moe_output.index_add( --++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++-# ) --++-# return moe_output --++- --++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-# """ --++-# 顶层 forward 方法,作为智能分发器。 --++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --++-# """ --++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++- --++-# # 1. 门控计算 (通用逻辑) --++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-# router_logits = self.gate(hidden_states_reshaped) --++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- --++-# if self.norm_topk_prob: --++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- --++-# routing_weights = routing_weights.to(hidden_states.dtype) --++- --++-# # 2. 智能分发到最优 MoE 路径 --++-# moe_output = None --++-# if not self.training: --++-# if sequence_length == 1: --++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++-# else: --++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++-# else: --++-# raise NotImplementedError("Training path is not implemented.") --++- --++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++- --++-# # 4. 合并 MoE 输出和共享专家输出 --++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++- --++-# # 5. 恢复原始形状并返回 --++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++- --++-# return final_hidden_states, router_logits --++- --++-# prefill fastest --++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++-# """ --++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --++-# """ --++-# def __init__(self, config: Qwen2MoeConfig): --++-# super().__init__() --++-# self.num_experts = config.num_experts --++-# self.top_k = config.num_experts_per_tok --++-# self.norm_topk_prob = config.norm_topk_prob --++- --++-# # 门控网络 --++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++-# # 专家列表 --++-# self.experts = nn.ModuleList( --++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++-# ) --++-# # 共享专家 --++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-# @no_grad() --++-# def _moe_infer_dispatch( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# """ --++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --++-# """ --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens, _ = hidden_states.shape --++- --++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --++-# flat_selected_experts = selected_experts.flatten() --++-# flat_routing_weights = routing_weights.flatten() --++- --++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++- --++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --++-# active_experts = ops.unique(flat_selected_experts) --++- --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++- --++-# # 找到所有分配给该专家的 token --++-# mask = (flat_selected_experts == expert_idx_tensor) --++- --++-# # 使用 mask 选取对应的 token 和权重 --++-# current_token_indices = token_indices[mask] --++-# current_routing_weights = flat_routing_weights[mask] --++-# current_hidden_states = hidden_states[current_token_indices] --++- --++-# # 对这些 token 进行批处理 --++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++- --++-# # 使用 index_add 将结果精确地加回到对应位置 --++-# moe_output = moe_output.index_add( --++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --++-# ) --++-# return moe_output --++- --++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-# """ --++-# 顶层 forward 方法,作为智能分发器。 --++-# """ --++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++- --++-# # 1. 门控计算 --++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-# router_logits = self.gate(hidden_states_reshaped) --++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- --++-# if self.norm_topk_prob: --++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- --++-# routing_weights = routing_weights.to(hidden_states.dtype) --++- --++-# # 2. 调用统一的 MoE 计算内核 --++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --++- --++-# # 3. 统一处理共享专家 --++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++- --++-# # 4. 合并输出 --++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++- --++-# # 5. 恢复原始形状并返回 --++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++- --++-# return final_hidden_states, router_logits --++- --++- --++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++-# """ --++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++-# 【最终高性能与高精度版】: --++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --++-# 3. 这样实现了速度和准确性的两全其美。 --++-# """ --++-# def __init__(self, config: Qwen2MoeConfig): --++-# super().__init__() --++-# self.num_experts = config.num_experts --++-# self.top_k = config.num_experts_per_tok --++-# self.norm_topk_prob = config.norm_topk_prob --++- --++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++-# self.experts = nn.ModuleList( --++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++-# ) --++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-# @no_grad() --++-# def _moe_infer_decode( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# """ --++-# 【解码路径】极致优化版:bmm + 高精度累加。 --++-# """ --++-# original_dtype = hidden_states.dtype --++-# batch_size, _ = hidden_states.shape --++- --++-# expert_outputs_list = [ --++-# ops.cat([ --++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++-# ], dim=0) --++-# for i in range(batch_size) --++-# ] --++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++- --++-# # 在 float32 下执行 bmm,得到高精度结果 --++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++- --++-# # 将高精度结果转换回原始数据类型 --++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --++- --++-# return moe_output --++- --++-# @no_grad() --++-# def _moe_infer_prefill( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# selected_experts: mindspore.Tensor, --++-# routing_weights: mindspore.Tensor --++-# ) -> mindspore.Tensor: --++-# """ --++-# 【预填充路径】与原始实现一致,结果精确。 --++-# """ --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens, _ = hidden_states.shape --++-# flat_selected_experts = selected_experts.flatten() --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++-# active_experts = ops.unique(flat_selected_experts) --++- --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++-# mask = (flat_selected_experts == expert_idx_tensor) --++-# selected_token_indices = token_indices[mask] --++-# selected_routing_weights = routing_weights.flatten()[mask] --++-# current_states = hidden_states[selected_token_indices] --++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++-# moe_output = moe_output.index_add( --++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++-# ) --++-# return moe_output --++- --++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++- --++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-# router_logits = self.gate(hidden_states_reshaped) --++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++- --++-# if self.norm_topk_prob: --++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- --++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --++-# # 如果模型主体是 float16,后续再转换 --++- --++-# moe_output = None --++-# if not self.training: --++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --++-# # _moe_infer_decode 内部会处理好类型转换 --++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) --++-# if sequence_length == 1: --++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --++-# else: --++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --++-# else: --++-# raise NotImplementedError("Training path is not implemented.") --++- --++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++- --++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++- --++-# return final_hidden_states, router_logits --++- --++- --++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++-# """ --++-# 【融合版】一个混合专家模块,内置两种推理策略, --++-# 由外部全局变量 `Long_Prompt` 控制: --++- --++-# - if Long_Prompt is True: 【精度优先模式】 --++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --++-# 适用于处理长序列,避免误差累积。 --++- --++-# - if Long_Prompt is False: 【速度优先模式】 --++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --++-# 在解码阶段获得极致速度,同时保证结果高度准确。 --++-# """ --++-# def __init__(self, config: Qwen2MoeConfig): --++-# super().__init__() --++-# self.num_experts = config.num_experts --++-# self.top_k = config.num_experts_per_tok --++-# self.norm_topk_prob = config.norm_topk_prob --++- --++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++-# self.experts = nn.ModuleList( --++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++-# ) --++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-# # --- 速度优先模式的辅助函数 --- --++-# @no_grad() --++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++-# original_dtype = hidden_states.dtype --++-# batch_size, _ = hidden_states.shape --++-# expert_outputs_list = [ --++-# ops.cat([ --++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++-# ], dim=0) --++-# for i in range(batch_size) --++-# ] --++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++-# weights_fp32 = routing_weights.to(mindspore.float32) --++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++-# return moe_output_fp32.squeeze(1).to(original_dtype) --++- --++-# @no_grad() --++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens, _ = hidden_states.shape --++-# flat_selected_experts = selected_experts.flatten() --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++-# active_experts = ops.unique(flat_selected_experts) --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++-# mask = (flat_selected_experts == expert_idx_tensor) --++-# selected_token_indices = token_indices[mask] --++-# selected_routing_weights = routing_weights.flatten()[mask] --++-# current_states = hidden_states[selected_token_indices] --++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --++-# return moe_output --++- --++-# # --- 精度优先模式的辅助函数 --- --++-# @no_grad() --++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++-# moe_output = ops.zeros_like(hidden_states) --++-# num_tokens, _ = hidden_states.shape --++-# flat_selected_experts = selected_experts.flatten() --++-# flat_routing_weights = routing_weights.flatten() --++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++-# active_experts = ops.unique(flat_selected_experts) --++-# for expert_idx_tensor in active_experts: --++-# expert_idx = expert_idx_tensor.item() --++-# expert_layer = self.experts[expert_idx] --++-# mask = (flat_selected_experts == expert_idx_tensor) --++-# current_token_indices = token_indices[mask] --++-# current_routing_weights = flat_routing_weights[mask] --++-# current_hidden_states = hidden_states[current_token_indices] --++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++-# return moe_output --++- --++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-# # 声明我们将要使用一个在模块外部定义的全局变量 --++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --++-# global Long_Prompt --++- --++-# # 1. 门控计算 (所有模式通用) --++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-# router_logits = self.gate(hidden_states_reshaped) --++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++-# if self.norm_topk_prob: --++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++- --++-# moe_output = None --++-# if not self.training: --++-# # 根据 Long_Prompt 标志选择模式 --++-# if Long_Prompt: --++-# # --- 精度优先模式 --- --++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++-# else: --++-# # --- 速度优先模式 --- --++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++-# if sequence_length == 1: --++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --++-# else: --++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --++-# else: --++-# raise NotImplementedError("Training path is not implemented.") --++- --++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++- --++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++- --++-# return final_hidden_states, router_logits --++- --++ class Qwen2MoeSparseMoeBlock(nn.Module): --++ """ --++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++ return moe_output_fp32.squeeze(1).to(original_dtype) --++ --+++ # @no_grad() --+++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++ # num_tokens, _ = hidden_states.shape --+++ # flat_selected_experts = selected_experts.flatten() --+++ # sorted_expert_indices = flat_selected_experts.argsort() --+++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++ # original_token_indices = sorted_expert_indices // self.top_k --+++ # moe_output = ops.zeros_like(hidden_states) --+++ # current_token_offset = 0 --+++ # for i in range(self.num_experts): --+++ # expert_token_count = tokens_per_expert[i] - current_token_offset --+++ # if expert_token_count == 0: --+++ # continue --+++ # end_offset = current_token_offset + expert_token_count --+++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++ # expert_hidden_states = hidden_states[expert_original_token_indices] --+++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++ # current_token_offset += expert_token_count --+++ # return moe_output --+++ --++ @no_grad() --++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++- num_tokens, _ = hidden_states.shape --++- flat_selected_experts = selected_experts.flatten() --++- sorted_expert_indices = flat_selected_experts.argsort() --++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++- original_token_indices = sorted_expert_indices // self.top_k --+++ """ --+++ 优化版 MoE prefill (速度优先模式): --+++ - 批量张量化处理同一个 expert 的所有 token --+++ - 跳过无 token 的专家 --+++ - 保持结果完全一致 --+++ """ --++ moe_output = ops.zeros_like(hidden_states) --++- current_token_offset = 0 --++- for i in range(self.num_experts): --++- expert_token_count = tokens_per_expert[i] - current_token_offset --++- if expert_token_count == 0: --++- continue --++- end_offset = current_token_offset + expert_token_count --++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++- expert_hidden_states = hidden_states[expert_original_token_indices] --++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++- current_token_offset += expert_token_count --+++ --+++ flat_selected_experts = selected_experts.flatten() --+++ flat_routing_weights = routing_weights.flatten() --+++ --+++ idxs = flat_selected_experts.argsort() --+++ sorted_expert_indices = flat_selected_experts[idxs] --+++ sorted_token_indices = idxs // self.top_k --+++ --+++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --+++ --+++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+++ --+++ for expert_id in active_experts.tolist(): --+++ start = int(tokens_per_expert[:expert_id].sum().item()) --+++ end = start + int(tokens_per_expert[expert_id].item()) --+++ --+++ token_idx = sorted_token_indices[start:end] --+++ expert_tokens = hidden_states[token_idx] --+++ --+++ expert_out = self.experts[expert_id](expert_tokens) --+++ --+++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --+++ --+++ moe_output = mindspore.mint.scatter_add( --+++ moe_output, --+++ 0, --+++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --+++ scaled_out.to(hidden_states.dtype) --+++ ) --+++ --++ return moe_output --++ --+++ --++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --++ @no_grad() --++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++ --++ moe_output = None --++- if Long_Prompt: --++- # --- 精度优先模式 (ACCURACY MODE) --- --++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ # if Long_Prompt==0: --+++ # # --- 精度优先模式 (ACCURACY MODE) --- --+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ # else: --+++ # # --- 速度优先模式 (SPEED MODE) --- --+++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++ # if sequence_length == 1: --+++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ # else: --+++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ --+++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++ if sequence_length == 1: --+++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++ else: --++- # --- 速度优先模式 (SPEED MODE) --- --++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --++- if sequence_length == 1: --++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++- else: --++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++- --+++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ --++ --++ # 3. 共享专家计算与合并 (所有模式通用) --++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++ --++ return final_hidden_states, router_logits --++ --+++ --++ class Qwen2MoeDecoderLayer(nn.Module): --++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --++ super().__init__() --++ self.hidden_size = config.hidden_size --++ --++- # if Long_Prompt: --++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++- # else: --+++ # if Long_Prompt == 2: --++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++ # else: --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ --++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++ --++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++ ) --++ --++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++ # attention_mask, --+++ # sequence_length=sequence_length, --+++ # target_length=target_length, --+++ # dtype=dtype, --+++ # min_dtype=min_dtype, --+++ # cache_position=cache_position, --+++ # batch_size=input_tensor.shape[0], --+++ # ) --+++ #@dwj --+++ causal_mask = get_cached_causal_mask_with_cache_position( --++ attention_mask, --++ sequence_length=sequence_length, --++ target_length=target_length, --++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --++ """ --++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --+++ _causal_mask_cache.clear() --++ --++ input_ids = kwargs.get("input_ids") --++ if input_ids is None and args: --++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ --++ if input_ids is not None: --++ prompt_length = input_ids.shape[1] --++- --++- if prompt_length > PROMPT_LENGTH_THRESHOLD: --++- Long_Prompt = True --+++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --+++ Long_Prompt = 2 --+++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --+++ Long_Prompt = 0 --++ else: --++- Long_Prompt = False --+++ Long_Prompt = 1 --+++ --++ --++ return super().generate(*args, **kwargs) --++ --++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++ dtype = self.lm_head.weight.dtype --++ min_dtype = float(ops.finfo(dtype).min) --++ --++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++ # attention_mask, --+++ # sequence_length=sequence_length, --+++ # target_length=past_key_values.get_max_length(), --+++ # dtype=dtype, --+++ # min_dtype=min_dtype, --+++ # cache_position=cache_position, --+++ # batch_size=batch_size, --+++ # ) --+++ --+++ #@dwj --+++ attention_mask = get_cached_causal_mask_with_cache_position( --++ attention_mask, --++ sequence_length=sequence_length, --++ target_length=past_key_values.get_max_length(), --++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++deleted file mode 100644 --++index 6dfb5b93..00000000 --++--- a/patches/0001-20251104commit.patch --+++++ /dev/null --++@@ -1,1272 +0,0 @@ --++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++-From: Pinoeer-kingxi <13022943007@163.com> --++-Date: Tue, 4 Nov 2025 09:11:51 +0800 --++-Subject: [PATCH] 20251104commit --++- --++---- --++- mindnlp/transformers/cache_utils.py | 28 +- --++- .../models/deepseek/modeling_deepseek.py | 149 ++- --++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++- 3 files changed, 976 insertions(+), 87 deletions(-) --++- --++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++-index cadd2e04..02f8d4be 100644 --++---- a/mindnlp/transformers/cache_utils.py --++-+++ b/mindnlp/transformers/cache_utils.py --++-@@ -812,14 +812,26 @@ class StaticCache(Cache): --++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++- # k_out[:, :, cache_position] = key_states --++- # v_out[:, :, cache_position] = value_states --++-- if ON_ORANGE_PI: --++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++-- else: --++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++-- --++-+ # if ON_ORANGE_PI: --++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++-+ # else: --++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++-+ # 确保 cache_position 是 1D tensor 并且类型正确 --++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++-+ if cache_position.ndim > 1: --++-+ cache_position = cache_position.flatten() --++-+ # 确保类型是 int32 或 int64(MindSpore 要求) --++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++-+ cache_position = cache_position.int() --++-+ --++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++-+ k_out[:, :, cache_position] = key_states --++-+ v_out[:, :, cache_position] = value_states --++-+ --++- return k_out, v_out --++- --++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++-index c695b944..d8303e45 100644 --++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++- # Copied from transformers.models.llama.modeling_llama.rotate_half --++- def rotate_half(x): --++- """Rotates half the hidden dims of the input.""" --++-- x1 = x[..., : x.shape[-1] // 2] --++-- x2 = x[..., x.shape[-1] // 2 :] --++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++-+ # x1 = x[..., : x.shape[-1] // 2] --++-+ # x2 = x[..., x.shape[-1] // 2 :] --++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++- return ops.cat((-x2, x1), dim=-1) --++- --++- --++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++- if self.training: --++- raise NotImplementedError("Training is not supported yet.") --++- else: --++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++-- if self.config.n_shared_experts is not None: --++-- y = y + self.shared_experts(identity) --++-- return y --++-+ # @lwx --++-+ if orig_shape[1] == 1: --++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++-+ y=y.view(*orig_shape) --++-+ if self.config.n_shared_experts is not None: --++-+ y = y + self.shared_experts(identity) --++-+ return y --++-+ else: --++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++-+ if self.config.n_shared_experts is not None: --++-+ y = y + self.shared_experts(identity) --++-+ return y --++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++-+ # if self.config.n_shared_experts is not None: --++-+ # y = y + self.shared_experts(identity) --++-+ # return y --++-+ --++-+ @no_grad() --++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++-+ --++-+ expert_cache = ops.zeros_like(x) --++-+ for i in range(self.num_experts_per_tok): --++-+ expert_id = flat_expert_indices[i].item() --++-+ weight = flat_expert_weights[i].item() --++-+ expert = self.experts[expert_id] --++-+ expert_out = expert(x) --++-+ expert_cache += expert_out * weight --++-+ return expert_cache --++- --++- @no_grad() --++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++-- # expert_cache = torch.zeros_like(x) --++-- # idxs = flat_expert_indices.argsort() --++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++-- # token_idxs = idxs // self.num_experts_per_tok --++-- # for i, end_idx in enumerate(tokens_per_expert): --++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++-- # if start_idx == end_idx: --++-- # continue --++-- # expert = self.experts[i] --++-- # exp_token_idx = token_idxs[start_idx:end_idx] --++-- # expert_tokens = x[exp_token_idx] --++-- # expert_out = expert(expert_tokens) --++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++-- # return expert_cache --++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++- expert_cache = ops.zeros_like(x) --++- idxs = flat_expert_indices.argsort() --++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++- token_idxs = idxs // self.num_experts_per_tok --++-+ --++- for i, end_idx in enumerate(tokens_per_expert): --++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++- if start_idx == end_idx: --++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++- expert_out = expert(expert_tokens) --++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++-+ --++- return expert_cache --++-+ --++-+ # @no_grad() --++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++-+ # # expert_cache = torch.zeros_like(x) --++-+ # # idxs = flat_expert_indices.argsort() --++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++-+ # # token_idxs = idxs // self.num_experts_per_tok --++-+ # # for i, end_idx in enumerate(tokens_per_expert): --++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++-+ # # if start_idx == end_idx: --++-+ # # continue --++-+ # # expert = self.experts[i] --++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] --++-+ # # expert_tokens = x[exp_token_idx] --++-+ # # expert_out = expert(expert_tokens) --++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++-+ # # return expert_cache --++-+ # expert_cache = ops.zeros_like(x) --++-+ # idxs = flat_expert_indices.argsort() --++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++-+ # token_idxs = idxs // self.num_experts_per_tok --++-+ --++-+ # for i, end_idx in enumerate(tokens_per_expert): --++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++-+ # if start_idx == end_idx: --++-+ # continue --++-+ # expert = self.experts[i] --++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --++-+ # expert_tokens = x[exp_token_idx] --++-+ # expert_out = expert(expert_tokens) --++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++-+ --++-+ # return expert_cache --++-+ # @no_grad() --++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++-+ # expert_cache = ops.zeros_like(x) --++-+ --++-+ # # 排序保证顺序一致 --++-+ # idxs = flat_expert_indices.argsort() --++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++-+ # token_idxs = idxs // self.num_experts_per_tok --++-+ --++-+ # # 找出有 token 的专家 --++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++-+ --++-+ # for i in active_experts.tolist(): --++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++-+ # end_idx = tokens_per_expert[i] --++-+ # if start_idx == end_idx: # 没有 token --++-+ # continue --++-+ --++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --++-+ # expert_tokens = x[exp_token_idx] --++-+ # expert_out = self.experts[i](expert_tokens) --++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++-+ --++-+ # expert_cache = mindspore.mint.scatter_add( --++-+ # expert_cache, --++-+ # 0, --++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++-+ # expert_out --++-+ # ) --++-+ --++-+ # return expert_cache --++-+ --++-+ --++- --++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++- # """ --++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++- --++- # Initialize weights and apply final processing --++- self.post_init() --++-+ self.warm_up = False --++-+ --++-+ def warmup_moe_model_deep(self): --++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++-+ test_texts = [ --++-+ "warmup short", --++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++-+ ] --++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --++-+ if tokenizer is None: --++-+ from mindnlp.transformers import AutoTokenizer --++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++-+ self._warmup_tokenizer = tokenizer --++-+ --++-+ for text in test_texts: --++-+ inputs = tokenizer(text, return_tensors="ms") --++-+ with mindspore._no_grad(): --++-+ _ = self(**inputs, use_cache=False) --++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++- --++- def get_input_embeddings(self): --++- return self.model.embed_tokens --++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++- ```""" --++-+ if not self.warm_up: --++-+ self.warm_up = True --++-+ self.warmup_moe_model_deep() --++-+ --++- output_attentions = ( --++- output_attentions --++- if output_attentions is not None --++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++-index 3cbf820e..d4c6b651 100644 --++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++-@@ -18,7 +18,6 @@ --++- # See the License for the specific language governing permissions and --++- # limitations under the License. --++- """MindSpore Qwen2MoE model.""" --++-- --++- import math --++- from typing import List, Optional, Tuple, Union --++- --++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++- TokenClassifierOutput, --++- ) --++- from ...modeling_utils import PreTrainedModel --++-+from ...generation import GenerationMixin --++- from ....utils import logging --++- from .configuration_qwen2_moe import Qwen2MoeConfig --++- --++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++- self.variance_epsilon = eps --++- --++- def forward(self, hidden_states): --++-+ # @dwj --++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++-+ # @lwx --++-+ # if not self.training : --++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++- input_dtype = hidden_states.dtype --++- hidden_states = hidden_states.to(mindspore.float32) --++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++-@@ -234,6 +239,8 @@ def rotate_half(x): --++- """Rotates half the hidden dims of the input.""" --++- x1 = x[..., : x.shape[-1] // 2] --++- x2 = x[..., x.shape[-1] // 2 :] --++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++- return ops.cat((-x2, x1), dim=-1) --++- --++- --++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++- self.config = config --++- self.hidden_size = config.hidden_size --++- self.intermediate_size = intermediate_size --++-+ --++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++- self.act_fn = ACT2FN[config.hidden_act] --++- --++- def forward(self, x): --++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++-- --++- --++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++-+ # @lwx --++-+ # gate_up_output = self.gate_up_proj(x) --++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++-+ # return self.down_proj(swiglu_output) --++-+ --++-+ # def forward(self, x): --++-+ # gate_proj_out = self.gate_proj(x) --++-+ # up_proj_out = self.up_proj(x) --++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++-+ # return self.down_proj(swiglu_out) --++-+ --++- # Copied from transformers.models.llama.modeling_llama.repeat_kv --++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++- """ --++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++- use_cache: bool = False, --++- cache_position: Optional[mindspore.Tensor] = None, --++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++-+ --++-+ --++-+ --++- bsz, q_len, _ = hidden_states.shape --++- --++- query_states = self.q_proj(hidden_states) --++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++- "with a layer index." --++- ) --++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++-+ if isinstance(past_key_value, StaticCache): --++-+ kv_seq_len = key_states.shape[-2] --++-+ else: --++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++- --++- if past_key_value is not None: --++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++-+ --++-+ if isinstance(past_key_value, StaticCache): --++-+ kv_seq_len = key_states.shape[-2] --++- --++- # repeat k/v heads if n_kv_heads < n_heads --++- key_states = repeat_kv(key_states, self.num_key_value_groups) --++- value_states = repeat_kv(value_states, self.num_key_value_groups) --++-- --++-+ --++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++- --++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++-- raise ValueError( --++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++-- f" {attn_weights.shape}" --++-- ) --++-- --++-- if attention_mask is not None: # no matter the length, we just slice it --++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++-+ if attention_mask is not None: --++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++- attn_weights = attn_weights + causal_mask --++- --++- # upcast attention to fp32 --++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++- --++- attn_output = self.o_proj(attn_output) --++-- --++-+ # @lwx --++-+ --++-+ # max_seq_len = self.max_position_embeddings # 2048 --++-+ --++-+ # if attention_mask is not None: --++-+ # # attention_mask: [B, 1, Sq, Sk] --++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++-+ --++-+ # # pad 到 [max_seq_len, max_seq_len] --++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++-+ # global_attention_mask = padded_mask --++-+ # else: --++-+ # global_attention_mask = None --++-+ --++-+ --++-+ # sparse_mode=3 --++-+ # attn_output = mindspore.ops.flash_attention_score( --++-+ # query=query_states, --++-+ # key=key_states, --++-+ # value=value_states, --++-+ # real_shift=None, --++-+ # padding_mask=None, --++-+ --++-+ # head_num=self.num_heads, --++-+ # attn_mask=global_attention_mask, --++-+ # keep_prob=1.0 - self.attention_dropout, --++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --++-+ # input_layout="BNSD", --++-+ # pre_tokens=2147483647, --++-+ # next_tokens=2147483647, --++-+ # inner_precise=0, --++-+ # drop_mask=None, --++-+ # prefix=None, --++-+ # actual_seq_qlen=None, --++-+ # actual_seq_kvlen=None, --++-+ # sparse_mode=sparse_mode, --++-+ # ) --++- if not output_attentions: --++- attn_weights = None --++- --++- return attn_output, attn_weights, past_key_value --++- --++- --++-+class Qwen2MoeFlashAttention(nn.Module): --++-+ """ --++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++-+ --++-+ 关键改动: --++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++-+ 直接传入原始的 key 和 value 张量效率更高。 --++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++-+ """ --++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++-+ super().__init__() --++-+ self.config = config --++-+ self.layer_idx = layer_idx --++-+ self.hidden_size = config.hidden_size --++-+ self.num_heads = config.num_attention_heads --++-+ self.head_dim = self.hidden_size // self.num_heads --++-+ self.num_key_value_heads = config.num_key_value_heads --++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++-+ self.max_position_embeddings = config.max_position_embeddings --++-+ self.rope_theta = config.rope_theta --++-+ self.attention_dropout = config.attention_dropout --++-+ --++-+ if (self.head_dim * self.num_heads) != self.hidden_size: --++-+ raise ValueError( --++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++-+ ) --++-+ --++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++-+ --++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++-+ self.head_dim, --++-+ max_position_embeddings=self.max_position_embeddings, --++-+ base=self.rope_theta, --++-+ ) --++-+ --++-+ def forward( --++-+ self, --++-+ hidden_states: mindspore.Tensor, --++-+ attention_mask: Optional[mindspore.Tensor] = None, --++-+ position_ids: Optional[mindspore.Tensor] = None, --++-+ past_key_value: Optional[Cache] = None, --++-+ output_attentions: bool = False, --++-+ use_cache: bool = False, --++-+ cache_position: Optional[mindspore.Tensor] = None, --++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++-+ --++-+ bsz, q_len, _ = hidden_states.shape --++-+ --++-+ # 1. 线性投射 Q, K, V --++-+ query_states = self.q_proj(hidden_states) --++-+ key_states = self.k_proj(hidden_states) --++-+ value_states = self.v_proj(hidden_states) --++-+ --++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++-+ # query: [B, S, H*D] -> [B, N1, S, D] --++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ --++-+ # 3. RoPE 旋转位置编码 --++-+ kv_seq_len = key_states.shape[-2] --++-+ if past_key_value is not None: --++-+ if self.layer_idx is None: --++-+ raise ValueError( --++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++-+ "with a layer index." --++-+ ) --++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len --++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++-+ if cache_position.shape[0] == 1: --++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++-+ kv_seq_len = past_seen_tokens + 1 --++-+ else: --++-+ # prefill 阶段:cache_position 是范围,使用其长度 --++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++-+ else: --++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++-+ --++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++-+ --++-+ # 4. KV 缓存更新 --++-+ if past_key_value is not None: --++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++-+ key_states, value_states = past_key_value.update( --++-+ key_states, value_states, self.layer_idx, cache_kwargs --++-+ ) --++-+ --++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++-+ if cache_position.shape[0] == 1: --++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++-+ kv_seq_len = key_states.shape[-2] --++-+ --++-+ # 5. [重要] 准备 Attention Mask --++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++-+ fa_attention_mask = None --++-+ if attention_mask is not None: --++-+ # 截取与当前key长度匹配的部分 --++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False --++-+ fa_attention_mask = (mask_slice != 0) --++-+ --++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++-+ input_dtype = query_states.dtype --++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++-+ query_states = query_states.to(mindspore.float16) --++-+ key_states = key_states.to(mindspore.float16) --++-+ value_states = value_states.to(mindspore.float16) --++-+ --++-+ # 6. [核心] 调用 flash_attention_score 算子 --++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA --++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++-+ attn_output = mindspore.ops.flash_attention_score( --++-+ query=query_states, --++-+ key=key_states, --++-+ value=value_states, --++-+ head_num=self.num_heads, # 传入Q的头数(N1) --++-+ attn_mask=fa_attention_mask, --++-+ keep_prob=1.0 - self.attention_dropout, --++-+ scalar_value=1.0 / math.sqrt(self.head_dim), --++-+ input_layout="BNSD", --++-+ sparse_mode=0 # 使用 defaultMask 模式 --++-+ ) --++-+ --++-+ # 恢复原始数据类型 --++-+ attn_output = attn_output.to(input_dtype) --++-+ --++-+ # 7. 调整输出形状 --++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++-+ attn_output = self.o_proj(attn_output) --++-+ --++-+ # FlashAttention 算子不直接返回注意力权重矩阵 --++-+ attn_weights = None --++-+ if output_attentions: --++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++-+ --++-+ return attn_output, attn_weights, past_key_value --++-+ --++-+ # def forward( --++-+ # self, --++-+ # hidden_states: mindspore.Tensor, --++-+ # attention_mask: Optional[mindspore.Tensor] = None, --++-+ # position_ids: Optional[mindspore.Tensor] = None, --++-+ # past_key_value: Optional[Cache] = None, --++-+ # output_attentions: bool = False, --++-+ # use_cache: bool = False, --++-+ # cache_position: Optional[mindspore.Tensor] = None, --++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++-+ --++-+ # bsz, q_len, _ = hidden_states.shape --++-+ --++-+ # # 1. 线性投射 Q, K, V --++-+ # query_states = self.q_proj(hidden_states) --++-+ # key_states = self.k_proj(hidden_states) --++-+ # value_states = self.v_proj(hidden_states) --++-+ --++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ --++-+ # # 3. RoPE 旋转位置编码 --++-+ # kv_seq_len = key_states.shape[-2] --++-+ # if past_key_value is not None: --++-+ # if self.layer_idx is None: --++-+ # raise ValueError( --++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++-+ # "with a layer index." --++-+ # ) --++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++-+ --++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++-+ --++-+ # # 4. KV 缓存更新 --++-+ # if past_key_value is not None: --++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++-+ # key_states, value_states = past_key_value.update( --++-+ # key_states, value_states, self.layer_idx, cache_kwargs --++-+ # ) --++-+ --++-+ # # 5. 准备 Attention Mask --++-+ # fa_attention_mask = None --++-+ # if attention_mask is not None: --++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++-+ # fa_attention_mask = (mask_slice != 0) --++-+ --++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++-+ # input_dtype = query_states.dtype --++-+ --++-+ # # 6. [核心] 调用 flash_attention_score 算子 --++-+ # attn_output = mindspore.ops.flash_attention_score( --++-+ # query=query_states, --++-+ # key=key_states, --++-+ # value=value_states, --++-+ # head_num=self.num_heads, --++-+ # attn_mask=fa_attention_mask, --++-+ # keep_prob=1.0 - self.attention_dropout, --++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --++-+ # input_layout="BNSD", --++-+ # sparse_mode=0, --++-+ # # <--- 修改点 2: 启用内部高精度计算 --- --++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++-+ # inner_precise=1 --++-+ # ) --++-+ --++-+ # # 恢复原始数据类型 --++-+ # attn_output = attn_output.to(input_dtype) --++-+ --++-+ # # 7. 调整输出形状 --++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++-+ # attn_output = self.o_proj(attn_output) --++-+ --++-+ # attn_weights = None --++-+ # if output_attentions: --++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++-+ --++-+ # return attn_output, attn_weights, past_key_value --++-+ --++-+ # def forward( --++-+ # self, --++-+ # hidden_states: mindspore.Tensor, --++-+ # attention_mask: Optional[mindspore.Tensor] = None, --++-+ # position_ids: Optional[mindspore.Tensor] = None, --++-+ # past_key_value: Optional[Cache] = None, --++-+ # output_attentions: bool = False, --++-+ # use_cache: bool = False, --++-+ # cache_position: Optional[mindspore.Tensor] = None, --++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++-+ --++-+ # bsz, q_len, _ = hidden_states.shape --++-+ --++-+ # query_states = self.q_proj(hidden_states) --++-+ # key_states = self.k_proj(hidden_states) --++-+ # value_states = self.v_proj(hidden_states) --++-+ --++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-+ --++-+ # kv_seq_len = key_states.shape[-2] --++-+ # if past_key_value is not None: --++-+ # if self.layer_idx is None: --++-+ # raise ValueError("`layer_idx` must be specified for caching") --++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++-+ --++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++-+ --++-+ # if past_key_value is not None: --++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++-+ # key_states, value_states = past_key_value.update( --++-+ # key_states, value_states, self.layer_idx, cache_kwargs --++-+ # ) --++-+ --++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++-+ --++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- --++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++-+ # query_states = query_states / math.sqrt(self.head_dim) --++-+ # # <--- 修改结束 --- --++-+ --++-+ # fa_attention_mask = None --++-+ # if attention_mask is not None: --++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++-+ # fa_attention_mask = (mask_slice != 0) --++-+ --++-+ # input_dtype = query_states.dtype --++-+ --++-+ # attn_output = mindspore.ops.flash_attention_score( --++-+ # query=query_states, # 传入已经预先缩放过的 query --++-+ # key=key_states, --++-+ # value=value_states, --++-+ # head_num=self.num_heads, --++-+ # attn_mask=fa_attention_mask, --++-+ # keep_prob=1.0 - self.attention_dropout, --++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++-+ # input_layout="BNSD", --++-+ # sparse_mode=0, --++-+ # inner_precise=1 # 仍然保持内部高精度计算 --++-+ # ) --++-+ --++-+ # attn_output = attn_output.to(input_dtype) --++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++-+ # attn_output = self.o_proj(attn_output) --++-+ --++-+ # attn_weights = None --++-+ # if output_attentions: --++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++-+ --++-+ # return attn_output, attn_weights, past_key_value --++-+ --++- QWEN2MOE_ATTENTION_CLASSES = { --++- "eager": Qwen2MoeAttention, --++-+ "flash-attention": Qwen2MoeFlashAttention, --++- } --++- --++- --++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++- --++-+ #@dwj --++-+ # 只遍历激活的专家,而非全部专家 --++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++-- batch_size, sequence_length, hidden_dim = hidden_states.shape --++-- hidden_states = hidden_states.view(-1, hidden_dim) --++-- # router_logits: (batch * sequence_length, n_experts) --++-- router_logits = self.gate(hidden_states) --++-- --++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++-- if self.norm_topk_prob: --++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++-- # we cast back to the input dtype --++-- routing_weights = routing_weights.to(hidden_states.dtype) --++-- --++-- final_hidden_states = ops.zeros( --++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++-- ) --++-- --++-- # One hot encode the selected experts to create an expert mask --++-- # this will be used to easily index which expert is going to be sollicitated --++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++-- --++-- # Loop over all available experts in the model and perform the computation on each expert --++-- for expert_idx in range(self.num_experts): --++-- expert_layer = self.experts[expert_idx] --++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++-- --++-- # Index the correct hidden states and compute the expert hidden state for --++-- # the current expert. We need to make sure to multiply the output hidden --++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++-- if 0 not in idx.shape: --++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++-- --++-- # However `index_add_` only support torch tensors for indexing so we'll use --++-- # the `top_x` tensor here. --++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++-- --++-- shared_expert_output = self.shared_expert(hidden_states) --++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++-- --++-- final_hidden_states = final_hidden_states + shared_expert_output --++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape --++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++-+ num_tokens = hidden_states_reshaped.shape[0] --++-+ --++-+ router_logits = self.gate(hidden_states_reshaped) --++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++-+ --++-+ if self.norm_topk_prob: --++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++-+ routing_weights = routing_weights.to(hidden_states.dtype) --++-+ --++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++-+ flat_selected_experts = selected_experts.flatten() --++-+ --++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++-+ token_indices = broadcasted_token_indices.flatten() --++-+ --++-+ active_experts = ops.unique(flat_selected_experts) --++-+ --++-+ for expert_idx_tensor in active_experts: --++-+ expert_idx = expert_idx_tensor.item() --++-+ expert_layer = self.experts[expert_idx] --++-+ --++-+ mask = (flat_selected_experts == expert_idx_tensor) --++-+ selected_token_indices = token_indices[mask] --++-+ selected_routing_weights = routing_weights.flatten()[mask] --++-+ --++-+ current_states = hidden_states_reshaped[selected_token_indices] --++-+ --++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++-+ --++-+ final_hidden_states = final_hidden_states.index_add( --++-+ dim=0, --++-+ index=selected_token_indices, --++-+ source=expert_output.to(hidden_states.dtype) --++-+ ) --++-+ --++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++- --++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++-- return final_hidden_states, router_logits --++-+ final_hidden_states = final_hidden_states + shared_expert_output --++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++-+ --++-+ return final_hidden_states, router_logits --++- --++- --++- class Qwen2MoeDecoderLayer(nn.Module): --++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++- --++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++- --++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++-+ --++- if (layer_idx not in config.mlp_only_layers) and ( --++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++- ): --++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++- _no_split_modules = ["Qwen2MoeDecoderLayer"] --++- _skip_keys_device_placement = "past_key_values" --++- _supports_cache_class = True --++-+#lwx --++-+ # _supports_static_cache = True --++- --++- def _init_weights(self, module): --++- std = self.config.initializer_range --++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++- return causal_mask --++- --++- --++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++- _tied_weights_keys = ["lm_head.weight"] --++- --++- def __init__(self, config): --++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++- self.num_experts_per_tok = config.num_experts_per_tok --++- # Initialize weights and apply final processing --++- self.post_init() --++-+ # @lwx --++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++-+ # self.generation_config.cache_implementation = "static" --++-+ self._warmed_up = False --++-+ --++-+ def warmup_moe_model(self): --++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") --++-+ test_texts = [ --++-+ "warmup short", --++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++-+ ] --++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --++-+ if tokenizer is None: --++-+ from mindnlp.transformers import AutoTokenizer --++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++-+ self._warmup_tokenizer = tokenizer --++-+ --++-+ for text in test_texts: --++-+ inputs = tokenizer(text, return_tensors="ms") --++-+ with mindspore._no_grad(): --++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) --++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") --++- --++- def get_input_embeddings(self): --++- return self.model.embed_tokens --++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++- ```""" --++-+ if not self._warmed_up: --++-+ self._warmed_up = True --++-+ self.warmup_moe_model() --++- --++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++- output_router_logits = ( --++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++- } --++- ) --++- return model_inputs --++-+# @lwx --++-+ # def _decode_one_tokens_logits( --++-+ # self, --++-+ # cur_token: mindspore.Tensor, --++-+ # input_pos: Optional[mindspore.Tensor], --++-+ # cache_position: mindspore.Tensor, --++-+ # past_key_values: StaticCache, --++-+ # ) -> mindspore.Tensor: --++-+ # """ --++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++-+ --++-+ # Args: --++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++-+ # input_pos: 输入位置信息,可选 --++-+ # cache_position: 当前token在cache中的位置,shape为(1,) --++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 --++-+ --++-+ # Returns: --++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++-+ # """ --++-+ # # 调用JIT编译的版本 --++-+ # return self.get_decode_one_tokens_logits( --++-+ # cur_token=cur_token, --++-+ # input_pos=input_pos, --++-+ # cache_position=cache_position, --++-+ # past_key_values=past_key_values, --++-+ # ) --++-+ --++-+ # @mindspore.jit(jit_level='O1') --++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++-+ # """ --++-+ # JIT编译的函数,用于高效的单token解码 --++-+ # 使用JIT编译优化以支持静态shape和高效执行 --++-+ --++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++-+ # """ --++-+ # outputs = self.model.forward( --++-+ # input_ids=cur_token, --++-+ # position_ids=input_pos, --++-+ # cache_position=cache_position, --++-+ # past_key_values=past_key_values, --++-+ # use_cache=True, --++-+ # return_dict=False, --++-+ # ) --++-+ --++-+ # hidden_states = outputs[0] --++-+ # logits = self.lm_head.forward(hidden_states) --++-+ # logits = logits.float() --++-+ --++-+ # return logits[:, -1, :] --++-+ --++-+ # def _sample( --++-+ # self, --++-+ # input_ids: mindspore.Tensor, --++-+ # logits_processor, --++-+ # stopping_criteria, --++-+ # generation_config, --++-+ # synced_devices: bool, --++-+ # streamer=None, --++-+ # logits_warper=None, --++-+ # **model_kwargs, --++-+ # ): --++-+ # """ --++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++-+ # """ --++-+ # from ...generation.logits_process import LogitsProcessorList --++-+ # from ...generation.stopping_criteria import StoppingCriteriaList --++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++-+ # from mindnlp.core import nn, ops, no_grad --++-+ # import numpy as np --++-+ --++-+ # # 检查是否使用 StaticCache --++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++-+ # # 否则,直接调用父类方法 --++-+ # past_key_values = model_kwargs.get("past_key_values") --++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++-+ --++-+ # if not isinstance(past_key_values, StaticCache): --++-+ # # 不使用 StaticCache,直接调用父类方法 --++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++-+ # return super()._sample( --++-+ # input_ids=input_ids, --++-+ # logits_processor=logits_processor, --++-+ # stopping_criteria=stopping_criteria, --++-+ # generation_config=generation_config, --++-+ # synced_devices=synced_devices, --++-+ # streamer=streamer, --++-+ # logits_warper=logits_warper, --++-+ # **model_kwargs, --++-+ # ) --++-+ --++-+ # # 使用 StaticCache,进入自定义循环 --++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++-+ # pad_token_id = generation_config._pad_token_tensor --++-+ # output_attentions = generation_config.output_attentions --++-+ # output_hidden_states = generation_config.output_hidden_states --++-+ # output_scores = generation_config.output_scores --++-+ # output_logits = generation_config.output_logits --++-+ # return_dict_in_generate = generation_config.return_dict_in_generate --++-+ # max_length = generation_config.max_length --++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++-+ # do_sample = generation_config.do_sample --++-+ --++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++-+ # raise ValueError( --++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++-+ # f"{logits_warper})." --++-+ # ) --++-+ --++-+ # # init attention / hidden states / scores tuples --++-+ # scores = () if (return_dict_in_generate and output_scores) else None --++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++-+ --++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: --++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++-+ # encoder_hidden_states = ( --++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++-+ # ) --++-+ --++-+ # # keep track of which sequences are already finished --++-+ # batch_size, cur_len = input_ids.shape --++-+ # this_peer_finished = False --++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++-+ --++-+ # time_record = [] --++-+ # from ....utils.testing_utils import parse_flag_from_env --++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++-+ --++-+ # while self._has_unfinished_sequences( --++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++-+ # ): --++-+ # if _record_time: --++-+ # import time as time_module --++-+ # infer_start = time_module.time() --++-+ --++-+ # # prepare model inputs --++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++-+ --++-+ # # prepare variable output controls --++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++-+ --++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++-+ # cur_cache_position = model_inputs.get("cache_position") --++-+ # cur_past_key_values = model_inputs.get("past_key_values") --++-+ # cur_input_ids = model_inputs.get("input_ids") --++-+ --++-+ # if (isinstance(cur_past_key_values, StaticCache) and --++-+ # cur_cache_position is not None and --++-+ # len(cur_cache_position.shape) > 0 and --++-+ # cur_cache_position.shape[0] == 1 and --++-+ # cur_input_ids is not None and --++-+ # cur_input_ids.shape[1] == 1): --++-+ # # 使用 JIT 优化的单 token 解码 --++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++-+ # if not hasattr(self, '_jit_used'): --++-+ # self._jit_used = False --++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++-+ --++-+ # next_token_logits = self.get_decode_one_tokens_logits( --++-+ # cur_token=cur_input_ids, --++-+ # input_pos=model_inputs.get("position_ids"), --++-+ # cache_position=cur_cache_position, --++-+ # past_key_values=cur_past_key_values, --++-+ # ) --++-+ --++-+ # # 标记已使用JIT(用于后续判断) --++-+ # if not self._jit_used: --++-+ # self._jit_used = True --++-+ --++-+ # # 构造兼容的输出对象 --++-+ # class JitOptimizedOutput: --++-+ # def __init__(self, logits, config): --++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++-+ # self.config = config --++-+ # # 对于 JIT 优化路径,这些属性通常不需要 --++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None --++-+ # self.attentions = None if not config.is_encoder_decoder else None --++-+ # self.cross_attentions = None --++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++-+ # self.hidden_states = None if not config.is_encoder_decoder else None --++-+ --++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++-+ # else: --++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++-+ # outputs = self(**model_inputs, return_dict=True) --++-+ --++-+ # if synced_devices and this_peer_finished: --++-+ # continue --++-+ --++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++-+ # next_token_logits = outputs.logits[:, -1, :] --++-+ --++-+ # # pre-process distribution --++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) --++-+ # if do_sample: --++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) --++-+ --++-+ # # Store scores, attentions and hidden_states when required --++-+ # if return_dict_in_generate: --++-+ # if output_scores: --++-+ # scores += (next_token_scores,) --++-+ # if output_logits: --++-+ # raw_logits += (next_token_logits,) --++-+ # if output_attentions: --++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++-+ # decoder_attentions += (attn,) if attn is not None else (None,) --++-+ # if self.config.is_encoder_decoder: --++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++-+ --++-+ # if output_hidden_states: --++-+ # hidden = ( --++-+ # outputs.decoder_hidden_states --++-+ # if self.config.is_encoder_decoder --++-+ # else outputs.hidden_states --++-+ # ) --++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++-+ --++-+ # # token selection --++-+ # if do_sample: --++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++-+ # else: --++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++-+ --++-+ # # finished sentences should have their next token be a padding token --++-+ # if has_eos_stopping_criteria: --++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++-+ --++-+ # # update generated ids, model inputs, and length for next step --++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++-+ # if streamer is not None: --++-+ # streamer.put(next_tokens) --++-+ --++-+ # model_kwargs = self._update_model_kwargs_for_generation( --++-+ # outputs, --++-+ # model_kwargs, --++-+ # is_encoder_decoder=self.config.is_encoder_decoder, --++-+ # ) --++-+ --++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++-+ # cur_len += 1 --++-+ --++-+ # if _record_time: --++-+ # import time as time_module --++-+ # infer_stop = time_module.time() --++-+ # time_record.append(infer_stop - infer_start) --++-+ --++-+ # del outputs --++-+ --++-+ # average_infer_time = None --++-+ # if time_record: --++-+ # if len(time_record) > 1: --++-+ # time_record.pop(0) --++-+ # average_infer_time = sum(time_record) / len(time_record) --++-+ # print(f'average inference time is: {average_infer_time}') --++-+ # print(f'inference time record: {time_record}') --++-+ --++-+ # if streamer is not None: --++-+ # streamer.end() --++-+ --++-+ # # 简单判断:打印是否使用了JIT路径 --++-+ # if hasattr(self, '_jit_used') and self._jit_used: --++-+ # print("[JIT] ✓ JIT optimization was used during generation") --++-+ # else: --++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++-+ --++-+ # if return_dict_in_generate: --++-+ # if self.config.is_encoder_decoder: --++-+ # return GenerateEncoderDecoderOutput( --++-+ # sequences=input_ids, --++-+ # scores=scores, --++-+ # logits=raw_logits, --++-+ # encoder_attentions=encoder_attentions, --++-+ # encoder_hidden_states=encoder_hidden_states, --++-+ # decoder_attentions=decoder_attentions, --++-+ # cross_attentions=cross_attentions, --++-+ # decoder_hidden_states=decoder_hidden_states, --++-+ # past_key_values=model_kwargs.get("past_key_values"), --++-+ # average_infer_time=average_infer_time --++-+ # ) --++-+ # else: --++-+ # return GenerateDecoderOnlyOutput( --++-+ # sequences=input_ids, --++-+ # scores=scores, --++-+ # logits=raw_logits, --++-+ # attentions=decoder_attentions, --++-+ # hidden_states=decoder_hidden_states, --++-+ # past_key_values=model_kwargs.get("past_key_values"), --++-+ # average_infer_time=average_infer_time --++-+ # ) --++-+ # else: --++-+ # return input_ids --++-+ --++-+ # def _prepare_cache_for_generation( --++-+ # self, --++-+ # generation_config, --++-+ # model_kwargs, --++-+ # assistant_model, --++-+ # batch_size, --++-+ # max_cache_length, --++-+ # ): --++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: --++-+ # generation_config.cache_implementation = "static" --++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++-+ --++-+ # if generation_config.cache_implementation == "static": --++-+ # base_required_from_max_length = generation_config.max_length + 1 --++-+ # base_required = max(max_cache_length, base_required_from_max_length) --++-+ # min_cache_size = 50 --++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++-+ # else: --++-+ # max_cache_length = max(base_required, min_cache_size) --++-+ --++-+ # original_max_cache_length = max_cache_length --++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") --++-+ # print(f" - input max_cache_length: {original_max_cache_length}") --++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") --++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++-+ # print(f" - final max_cache_length: {max_cache_length}") --++-+ --++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++-+ # if max_cache_length > self.config.max_position_embeddings: --++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++-+ --++-+ # result = super()._prepare_cache_for_generation( --++-+ # generation_config=generation_config, --++-+ # model_kwargs=model_kwargs, --++-+ # assistant_model=assistant_model, --++-+ # batch_size=batch_size, --++-+ # max_cache_length=max_cache_length, --++-+ # ) --++-+ --++-+ # if generation_config.cache_implementation == "static": --++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++-+ # created_cache = model_kwargs.get(cache_name) --++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++-+ # if created_cache.max_cache_len < generation_config.max_length: --++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++-+ --++-+ # return result --++-+ --++-+ --++-+ --++- --++- --++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++--- --++-2.27.0 --++- --++-- --++2.27.0 --++ --+-- --+2.27.0 --+ ---- --2.27.0 -- -diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch -deleted file mode 100644 -index 80906633..00000000 ---- a/patches/0006-20251107002commit.patch -+++ /dev/null -@@ -1,7931 +0,0 @@ --From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Fri, 7 Nov 2025 12:06:32 +0800 --Subject: [PATCH 6/8] 20251107002commit -- ----- -- .../models/deepseek/modeling_deepseek.py | 122 +- -- patches/0001-20251104commit.patch | 2 +- -- patches/0002-20251106commit.patch | 2 +- -- patches/0003-20261106secondcommit.patch | 2 +- -- patches/0004-20251106change.patch | 2 +- -- patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ -- 6 files changed, 7773 insertions(+), 64 deletions(-) -- create mode 100644 patches/0005-20251107001commit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index 8831e4b7..e7e1c053 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): -- # expert_out = expert(x) -- # expert_cache += expert_out * weight -- # return expert_cache --- --- # @no_grad() --- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --- # # x 的 shape: (1, hidden_size) --- # # flat_expert_indices 的 shape: (num_experts_per_tok,) --- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --- --- # # 1. 收集所有需要的专家层 --- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --- # selected_experts = [self.experts[i] for i in flat_expert_indices] --- --- # # 2. 并行计算所有专家的输出 --- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --- # # ops.cat 会将它们堆叠成一个新的 Tensor --- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --- --- # # 3. 使用矩阵乘法进行加权求和 --- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --- # # 最终结果 final_output 的 shape: (1, hidden_size) --- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+ --+ @no_grad() --+ dwj --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ # x 的 shape: (1, hidden_size) --+ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+ --+ # 1. 收集所有需要的专家层 --+ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+ selected_experts = [self.experts[i] for i in flat_expert_indices] --+ --+ # 2. 并行计算所有专家的输出 --+ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+ # ops.cat 会将它们堆叠成一个新的 Tensor --+ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+ --+ # 3. 使用矩阵乘法进行加权求和 --+ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+ # 最终结果 final_output 的 shape: (1, hidden_size) --+ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) -- --- # return final_output --+ return final_output -- -- -- # @no_grad() --@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): -- -- return expert_cache -- # 放置在 DeepseekMoE 类中 --- @no_grad() --- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --- """ --- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --- --- Args: --- x (Tensor): 输入张量, shape: (1, hidden_size) --- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --- """ --- top_k, _ = flat_expert_weights.shape --- hidden_size = x.shape[-1] --- --- # 1. 将所有专家的权重堆叠起来 --- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --+ # @no_grad() --+ # #lwx 20251107 --+ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ # """ --+ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --+ --+ # Args: --+ # x (Tensor): 输入张量, shape: (1, hidden_size) --+ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --+ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --+ # """ --+ # top_k, _ = flat_expert_weights.shape --+ # hidden_size = x.shape[-1] --+ --+ # # 1. 将所有专家的权重堆叠起来 --+ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --+ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --+ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) -- --- # 2. "收集" 所需的专家权重 --- selected_gate_w = stacked_gate_w[flat_expert_indices] --- selected_up_w = stacked_up_w[flat_expert_indices] --- selected_down_w = stacked_down_w[flat_expert_indices] --+ # # 2. "收集" 所需的专家权重 --+ # selected_gate_w = stacked_gate_w[flat_expert_indices] --+ # selected_up_w = stacked_up_w[flat_expert_indices] --+ # selected_down_w = stacked_down_w[flat_expert_indices] -- --- # 3. 准备输入 --- x_expanded = x.expand((top_k, 1, hidden_size)) --+ # # 3. 准备输入 --+ # x_expanded = x.expand((top_k, 1, hidden_size)) -- --- # 4. 并行计算 gate_proj 和 up_proj --- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --+ # # 4. 并行计算 gate_proj 和 up_proj --+ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --+ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) -- --- # 5. 计算中间状态 --- intermediate_states = self.experts[0].act_fn(gate_out) * up_out --+ # # 5. 计算中间状态 --+ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out -- --- # 6. 并行计算 down_proj --- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --- # --- [FIX] --- --- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --- # --- [FIX END] --- --+ # # 6. 并行计算 down_proj --+ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --+ # # --- [FIX] --- --+ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --+ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --+ # # --- [FIX END] --- -- --- # 7. 根据路由权重进行加权求和 --- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --+ # # 7. 根据路由权重进行加权求和 --+ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) -- --- return weighted_sum --+ # return weighted_sum -- -- -- --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --index 0a0ef2d7..2842180e 100644 ----- a/patches/0001-20251104commit.patch --+++ b/patches/0001-20251104commit.patch --@@ -1,7 +1,7 @@ -- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Tue, 4 Nov 2025 09:11:51 +0800 ---Subject: [PATCH 1/4] 20251104commit --+Subject: [PATCH 1/5] 20251104commit -- -- --- -- mindnlp/transformers/cache_utils.py | 28 +- --diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --index 5185270c..c6cd8757 100644 ----- a/patches/0002-20251106commit.patch --+++ b/patches/0002-20251106commit.patch --@@ -1,7 +1,7 @@ -- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 09:20:38 +0800 ---Subject: [PATCH 2/4] 20251106commit --+Subject: [PATCH 2/5] 20251106commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 379 ++++- --diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --index 3e05f821..601960c9 100644 ----- a/patches/0003-20261106secondcommit.patch --+++ b/patches/0003-20261106secondcommit.patch --@@ -1,7 +1,7 @@ -- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 14:54:37 +0800 ---Subject: [PATCH 3/4] 20261106secondcommit --+Subject: [PATCH 3/5] 20261106secondcommit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 217 ++- --diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --index 88a1aef4..8976f10b 100644 ----- a/patches/0004-20251106change.patch --+++ b/patches/0004-20251106change.patch --@@ -1,7 +1,7 @@ -- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 15:48:09 +0800 ---Subject: [PATCH 4/4] 20251106change --+Subject: [PATCH 4/5] 20251106change -- -- --- -- .../models/deepseek/modeling_deepseek.py | 189 +- --diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --new file mode 100644 --index 00000000..8d9032be ----- /dev/null --+++ b/patches/0005-20251107001commit.patch --@@ -0,0 +1,7707 @@ --+From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Fri, 7 Nov 2025 11:48:18 +0800 --+Subject: [PATCH 5/5] 20251107001commit --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 91 +- --+ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- --+ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- --+ patches/0001-20251104commit.patch | 2 +- --+ patches/0002-20251106commit.patch | 2 +- --+ patches/0003-20261106secondcommit.patch | 2 +- --+ patches/0004-20251106change.patch | 7498 +++++++++++++++++ --+ 7 files changed, 7577 insertions(+), 30 deletions(-) --+ create mode 100644 patches/0004-20251106change.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index 0546f318..8831e4b7 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): --+ # expert_cache += expert_out * weight --+ # return expert_cache --+ --+- @no_grad() --+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+- # x 的 shape: (1, hidden_size) --+- # flat_expert_indices 的 shape: (num_experts_per_tok,) --+- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+- --+- # 1. 收集所有需要的专家层 --+- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+- selected_experts = [self.experts[i] for i in flat_expert_indices] --+- --+- # 2. 并行计算所有专家的输出 --+- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+- # ops.cat 会将它们堆叠成一个新的 Tensor --+- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+- --+- # 3. 使用矩阵乘法进行加权求和 --+- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+- # 最终结果 final_output 的 shape: (1, hidden_size) --+- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++ # @no_grad() --++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ # # x 的 shape: (1, hidden_size) --++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) --++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++ --++ # # 1. 收集所有需要的专家层 --++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++ # selected_experts = [self.experts[i] for i in flat_expert_indices] --++ --++ # # 2. 并行计算所有专家的输出 --++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++ # # ops.cat 会将它们堆叠成一个新的 Tensor --++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++ --++ # # 3. 使用矩阵乘法进行加权求和 --++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ # # 最终结果 final_output 的 shape: (1, hidden_size) --++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+ --+- return final_output --++ # return final_output --+ --+ --+ # @no_grad() --+@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): --+ ) --+ --+ return expert_cache --++# 放置在 DeepseekMoE 类中 --++ @no_grad() --++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ """ --++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --++ --++ Args: --++ x (Tensor): 输入张量, shape: (1, hidden_size) --++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --++ """ --++ top_k, _ = flat_expert_weights.shape --++ hidden_size = x.shape[-1] --++ --++ # 1. 将所有专家的权重堆叠起来 --++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --++ --++ # 2. "收集" 所需的专家权重 --++ selected_gate_w = stacked_gate_w[flat_expert_indices] --++ selected_up_w = stacked_up_w[flat_expert_indices] --++ selected_down_w = stacked_down_w[flat_expert_indices] --++ --++ # 3. 准备输入 --++ x_expanded = x.expand((top_k, 1, hidden_size)) --++ --++ # 4. 并行计算 gate_proj 和 up_proj --++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --++ --++ # 5. 计算中间状态 --++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out --++ --++ # 6. 并行计算 down_proj --++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --++ # --- [FIX] --- --++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --++ # --- [FIX END] --- --++ --++ # 7. 根据路由权重进行加权求和 --++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --++ --++ return weighted_sum --++ --++ --+ --+ # @no_grad() --+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+index ebd7782e..913a7609 100644 --+--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): --+ # Copied from transformers.models.llama.modeling_llama.rotate_half --+ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+- x1 = x[..., : x.shape[-1] // 2] --+- x2 = x[..., x.shape[-1] // 2 :] --++ # x1 = x[..., : x.shape[-1] // 2] --++ # x2 = x[..., x.shape[-1] // 2 :] --+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+index d059dcbe..2b217b64 100644 --+--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): --+ # Copied from transformers.models.llama.modeling_llama.rotate_half --+ def rotate_half(x): --+ """Rotates half the hidden dims of the input.""" --+- x1 = x[..., : x.shape[-1] // 2] --+- x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++ # x1 = x[..., : x.shape[-1] // 2] --++ # x2 = x[..., x.shape[-1] // 2 :] --++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+ return ops.cat((-x2, x1), dim=-1) --+ --+ --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+index 78f22642..0a0ef2d7 100644 --+--- a/patches/0001-20251104commit.patch --++++ b/patches/0001-20251104commit.patch --+@@ -1,7 +1,7 @@ --+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Tue, 4 Nov 2025 09:11:51 +0800 --+-Subject: [PATCH 1/3] 20251104commit --++Subject: [PATCH 1/4] 20251104commit --+ --+ --- --+ mindnlp/transformers/cache_utils.py | 28 +- --+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+index 22b65dd5..5185270c 100644 --+--- a/patches/0002-20251106commit.patch --++++ b/patches/0002-20251106commit.patch --+@@ -1,7 +1,7 @@ --+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 09:20:38 +0800 --+-Subject: [PATCH 2/3] 20251106commit --++Subject: [PATCH 2/4] 20251106commit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+index 966529e4..3e05f821 100644 --+--- a/patches/0003-20261106secondcommit.patch --++++ b/patches/0003-20261106secondcommit.patch --+@@ -1,7 +1,7 @@ --+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 14:54:37 +0800 --+-Subject: [PATCH 3/3] 20261106secondcommit --++Subject: [PATCH 3/4] 20261106secondcommit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 217 ++- --+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --+new file mode 100644 --+index 00000000..88a1aef4 --+--- /dev/null --++++ b/patches/0004-20251106change.patch --+@@ -0,0 +1,7498 @@ --++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Thu, 6 Nov 2025 15:48:09 +0800 --++Subject: [PATCH 4/4] 20251106change --++ --++--- --++ .../models/deepseek/modeling_deepseek.py | 189 +- --++ patches/0001-20251104commit.patch | 1272 +++++++ --++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ --++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ --++ 4 files changed, 7244 insertions(+), 186 deletions(-) --++ create mode 100644 patches/0001-20251104commit.patch --++ create mode 100644 patches/0002-20251106commit.patch --++ create mode 100644 patches/0003-20261106secondcommit.patch --++ --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index 2f9192bf..0546f318 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): --++ --++ return attn_output, attn_weights, past_key_value --++ --++-# class DeepseekFlashAttention(nn.Module): --++-# """ --++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --++- --++-# This class is designed as a drop-in replacement for DeepseekAttention. --++-# """ --++- --++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++-# super().__init__() --++-# self.config = config --++-# self.layer_idx = layer_idx --++-# if layer_idx is None: --++-# logger.warning( --++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++-# "when creating this class." --++-# ) --++- --++-# self.attention_dropout = config.attention_dropout --++-# self.hidden_size = config.hidden_size --++-# self.num_heads = config.num_attention_heads --++-# self.head_dim = self.hidden_size // self.num_heads --++-# self.num_key_value_heads = config.num_key_value_heads --++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++-# self.max_position_embeddings = config.max_position_embeddings --++-# self.rope_theta = config.rope_theta --++-# self.is_causal = True --++- --++-# if (self.head_dim * self.num_heads) != self.hidden_size: --++-# raise ValueError( --++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++-# f" and `num_heads`: {self.num_heads})." --++-# ) --++- --++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++-# self._init_rope() --++- --++-# def _init_rope(self): --++-# if self.config.rope_scaling is None: --++-# self.rotary_emb = DeepseekRotaryEmbedding( --++-# self.head_dim, --++-# max_position_embeddings=self.max_position_embeddings, --++-# base=self.rope_theta, --++-# ) --++-# else: --++-# scaling_type = self.config.rope_scaling["type"] --++-# scaling_factor = self.config.rope_scaling["factor"] --++-# if scaling_type == "linear": --++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++-# self.head_dim, --++-# max_position_embeddings=self.max_position_embeddings, --++-# scaling_factor=scaling_factor, --++-# base=self.rope_theta, --++-# ) --++-# elif scaling_type == "dynamic": --++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++-# self.head_dim, --++-# max_position_embeddings=self.max_position_embeddings, --++-# scaling_factor=scaling_factor, --++-# base=self.rope_theta, --++-# ) --++-# else: --++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++- --++-# def forward( --++-# self, --++-# hidden_states: mindspore.Tensor, --++-# attention_mask: Optional[mindspore.Tensor] = None, --++-# position_ids: Optional[mindspore.Tensor] = None, --++-# past_key_value: Optional[Cache] = None, --++-# output_attentions: bool = False, --++-# use_cache: bool = False, --++-# **kwargs, --++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++-# if "padding_mask" in kwargs: --++-# warnings.warn( --++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++-# ) --++- --++-# if output_attentions: --++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --++- --++-# bsz, q_len, _ = hidden_states.shape --++- --++-# if self.config.pretraining_tp > 1: --++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++- --++-# query_states = self.q_proj(hidden_states) --++-# key_states = self.k_proj(hidden_states) --++-# value_states = self.v_proj(hidden_states) --++- --++-# # Reshape for multi-head attention --++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++- --++-# kv_seq_len = key_states.shape[-2] --++-# if past_key_value is not None: --++-# if self.layer_idx is None: --++-# raise ValueError( --++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++-# "with a layer index." --++-# ) --++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++- --++-# # Apply Rotary Positional Embedding --++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++- --++-# if past_key_value is not None: --++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++- --++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++- --++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++- --++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++- --++-# # Convert attention_mask for flash_attention_score --++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --++-# if attention_mask is not None: --++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++-# raise ValueError( --++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++-# ) --++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --++-# else: --++-# attn_mask_for_fa = None --++- --++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++- --++-# # Call the fused flash_attention_score operator --++-# attn_output = mindspore.ops.flash_attention_score( --++-# query=query_states_for_fa, --++-# key=key_states_for_fa, --++-# value=value_states_for_fa, --++-# head_num=self.num_heads, # This is N1, the number of query heads --++-# input_layout='BSH', --++-# attn_mask=attn_mask_for_fa, --++-# keep_prob=keep_prob, --++-# scalar_value=1.0 / math.sqrt(self.head_dim), --++-# sparse_mode=0 # Default mask mode --++-# ) --++- --++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --++-# attn_output = self.o_proj(attn_output) --++- --++-# # Flash Attention does not return attention weights --++-# attn_weights = None --++- --++-# return attn_output, attn_weights, past_key_value --++ --++ class DeepseekFlashAttention(nn.Module): --++ """ --++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): --++ super().__init__() --++ self.hidden_size = config.hidden_size --++ --++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --++- config=config, layer_idx=layer_idx --++- ) --+++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --+++ # config=config, layer_idx=layer_idx --+++ # ) --++ --++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --++ config=config, layer_idx=layer_idx --++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): --++ return outputs --++ --++ --++- --++ class DeepseekPreTrainedModel(PreTrainedModel): --++ config_class = DeepseekConfig --++ base_model_prefix = "model" --++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++ # Initialize weights and apply final processing --++ self.post_init() --++ self.warm_up = False --++- #@dwj --++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --++- self.num_layers, --++- self.num_attention_heads, --++- self.head_dim, --++- batch_size=1, --++- max_length=self.max_length, --++- dtype=mindspore.float16 --++- ) --++- --++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --++- key_cache = [] --++- value_cache = [] --++- for _ in range(num_layers): --++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++- key_cache.append(k) --++- value_cache.append(v) --++- return key_cache, value_cache --++- --++ --++ def warmup_moe_model_deep(self): --++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++new file mode 100644 --++index 00000000..78f22642 --++--- /dev/null --+++++ b/patches/0001-20251104commit.patch --++@@ -0,0 +1,1272 @@ --+++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++Subject: [PATCH 1/3] 20251104commit --+++ --+++--- --+++ mindnlp/transformers/cache_utils.py | 28 +- --+++ .../models/deepseek/modeling_deepseek.py | 149 ++- --+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++ 3 files changed, 976 insertions(+), 87 deletions(-) --+++ --+++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++index cadd2e04..02f8d4be 100644 --+++--- a/mindnlp/transformers/cache_utils.py --++++++ b/mindnlp/transformers/cache_utils.py --+++@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++ # k_out[:, :, cache_position] = key_states --+++ # v_out[:, :, cache_position] = value_states --+++- if ON_ORANGE_PI: --+++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++- else: --+++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++- --++++ # if ON_ORANGE_PI: --++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++ # else: --++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++ # 确保 cache_position 是 1D tensor 并且类型正确 --++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++++ if cache_position.ndim > 1: --++++ cache_position = cache_position.flatten() --++++ # 确保类型是 int32 或 int64(MindSpore 要求) --++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++++ cache_position = cache_position.int() --++++ --++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++++ k_out[:, :, cache_position] = key_states --++++ v_out[:, :, cache_position] = value_states --++++ --+++ return k_out, v_out --+++ --+++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index c695b944..d8303e45 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++- x1 = x[..., : x.shape[-1] // 2] --+++- x2 = x[..., x.shape[-1] // 2 :] --++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++ # x1 = x[..., : x.shape[-1] // 2] --++++ # x2 = x[..., x.shape[-1] // 2 :] --++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++ if self.training: --+++ raise NotImplementedError("Training is not supported yet.") --+++ else: --+++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++- if self.config.n_shared_experts is not None: --+++- y = y + self.shared_experts(identity) --+++- return y --++++ # @lwx --++++ if orig_shape[1] == 1: --++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++++ y=y.view(*orig_shape) --++++ if self.config.n_shared_experts is not None: --++++ y = y + self.shared_experts(identity) --++++ return y --++++ else: --++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++++ if self.config.n_shared_experts is not None: --++++ y = y + self.shared_experts(identity) --++++ return y --++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++ # if self.config.n_shared_experts is not None: --++++ # y = y + self.shared_experts(identity) --++++ # return y --++++ --++++ @no_grad() --++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ --++++ expert_cache = ops.zeros_like(x) --++++ for i in range(self.num_experts_per_tok): --++++ expert_id = flat_expert_indices[i].item() --++++ weight = flat_expert_weights[i].item() --++++ expert = self.experts[expert_id] --++++ expert_out = expert(x) --++++ expert_cache += expert_out * weight --++++ return expert_cache --+++ --+++ @no_grad() --+++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++- # expert_cache = torch.zeros_like(x) --+++- # idxs = flat_expert_indices.argsort() --+++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++- # token_idxs = idxs // self.num_experts_per_tok --+++- # for i, end_idx in enumerate(tokens_per_expert): --+++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++- # if start_idx == end_idx: --+++- # continue --+++- # expert = self.experts[i] --+++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++- # expert_tokens = x[exp_token_idx] --+++- # expert_out = expert(expert_tokens) --+++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++- # return expert_cache --++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++ expert_cache = ops.zeros_like(x) --+++ idxs = flat_expert_indices.argsort() --+++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++ token_idxs = idxs // self.num_experts_per_tok --++++ --+++ for i, end_idx in enumerate(tokens_per_expert): --+++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++ if start_idx == end_idx: --+++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++ expert_out = expert(expert_tokens) --+++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --+++ return expert_cache --++++ --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # # expert_cache = torch.zeros_like(x) --++++ # # idxs = flat_expert_indices.argsort() --++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++ # # token_idxs = idxs // self.num_experts_per_tok --++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++ # # if start_idx == end_idx: --++++ # # continue --++++ # # expert = self.experts[i] --++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # # expert_tokens = x[exp_token_idx] --++++ # # expert_out = expert(expert_tokens) --++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++ # # return expert_cache --++++ # expert_cache = ops.zeros_like(x) --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # for i, end_idx in enumerate(tokens_per_expert): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # if start_idx == end_idx: --++++ # continue --++++ # expert = self.experts[i] --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = expert(expert_tokens) --++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --++++ # return expert_cache --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # expert_cache = ops.zeros_like(x) --++++ --++++ # # 排序保证顺序一致 --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # # 找出有 token 的专家 --++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++ --++++ # for i in active_experts.tolist(): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # end_idx = tokens_per_expert[i] --++++ # if start_idx == end_idx: # 没有 token --++++ # continue --++++ --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = self.experts[i](expert_tokens) --++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++ --++++ # expert_cache = mindspore.mint.scatter_add( --++++ # expert_cache, --++++ # 0, --++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++ # expert_out --++++ # ) --++++ --++++ # return expert_cache --++++ --++++ --+++ --+++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++ # """ --+++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ --+++ # Initialize weights and apply final processing --+++ self.post_init() --++++ self.warm_up = False --++++ --++++ def warmup_moe_model_deep(self): --++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++ test_texts = [ --++++ "warmup short", --++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++++ ] --++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++ if tokenizer is None: --++++ from mindnlp.transformers import AutoTokenizer --++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++ self._warmup_tokenizer = tokenizer --++++ --++++ for text in test_texts: --++++ inputs = tokenizer(text, return_tensors="ms") --++++ with mindspore._no_grad(): --++++ _ = self(**inputs, use_cache=False) --++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++ --+++ def get_input_embeddings(self): --+++ return self.model.embed_tokens --+++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++ ```""" --++++ if not self.warm_up: --++++ self.warm_up = True --++++ self.warmup_moe_model_deep() --++++ --+++ output_attentions = ( --+++ output_attentions --+++ if output_attentions is not None --+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++index 3cbf820e..d4c6b651 100644 --+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++@@ -18,7 +18,6 @@ --+++ # See the License for the specific language governing permissions and --+++ # limitations under the License. --+++ """MindSpore Qwen2MoE model.""" --+++- --+++ import math --+++ from typing import List, Optional, Tuple, Union --+++ --+++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++ TokenClassifierOutput, --+++ ) --+++ from ...modeling_utils import PreTrainedModel --++++from ...generation import GenerationMixin --+++ from ....utils import logging --+++ from .configuration_qwen2_moe import Qwen2MoeConfig --+++ --+++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++ self.variance_epsilon = eps --+++ --+++ def forward(self, hidden_states): --++++ # @dwj --++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++ # @lwx --++++ # if not self.training : --++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++ input_dtype = hidden_states.dtype --+++ hidden_states = hidden_states.to(mindspore.float32) --+++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++@@ -234,6 +239,8 @@ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++ x1 = x[..., : x.shape[-1] // 2] --+++ x2 = x[..., x.shape[-1] // 2 :] --++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++ self.config = config --+++ self.hidden_size = config.hidden_size --+++ self.intermediate_size = intermediate_size --++++ --+++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++ self.act_fn = ACT2FN[config.hidden_act] --+++ --+++ def forward(self, x): --+++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++- --+++ --++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++ # @lwx --++++ # gate_up_output = self.gate_up_proj(x) --++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++++ # return self.down_proj(swiglu_output) --++++ --++++ # def forward(self, x): --++++ # gate_proj_out = self.gate_proj(x) --++++ # up_proj_out = self.up_proj(x) --++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++++ # return self.down_proj(swiglu_out) --++++ --+++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++ """ --+++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++ use_cache: bool = False, --+++ cache_position: Optional[mindspore.Tensor] = None, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ --++++ --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ query_states = self.q_proj(hidden_states) --+++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++ "with a layer index." --+++ ) --+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ if isinstance(past_key_value, StaticCache): --++++ kv_seq_len = key_states.shape[-2] --++++ else: --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++ if isinstance(past_key_value, StaticCache): --++++ kv_seq_len = key_states.shape[-2] --+++ --+++ # repeat k/v heads if n_kv_heads < n_heads --+++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++- --++++ --+++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++ --+++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++- raise ValueError( --+++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++- f" {attn_weights.shape}" --+++- ) --+++- --+++- if attention_mask is not None: # no matter the length, we just slice it --+++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++++ if attention_mask is not None: --++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++ attn_weights = attn_weights + causal_mask --+++ --+++ # upcast attention to fp32 --+++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++ --+++ attn_output = self.o_proj(attn_output) --+++- --++++ # @lwx --++++ --++++ # max_seq_len = self.max_position_embeddings # 2048 --++++ --++++ # if attention_mask is not None: --++++ # # attention_mask: [B, 1, Sq, Sk] --++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++ --++++ # # pad 到 [max_seq_len, max_seq_len] --++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++ # global_attention_mask = padded_mask --++++ # else: --++++ # global_attention_mask = None --++++ --++++ --++++ # sparse_mode=3 --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, --++++ # key=key_states, --++++ # value=value_states, --++++ # real_shift=None, --++++ # padding_mask=None, --++++ --++++ # head_num=self.num_heads, --++++ # attn_mask=global_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++ # input_layout="BNSD", --++++ # pre_tokens=2147483647, --++++ # next_tokens=2147483647, --++++ # inner_precise=0, --++++ # drop_mask=None, --++++ # prefix=None, --++++ # actual_seq_qlen=None, --++++ # actual_seq_kvlen=None, --++++ # sparse_mode=sparse_mode, --++++ # ) --+++ if not output_attentions: --+++ attn_weights = None --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++ --++++class Qwen2MoeFlashAttention(nn.Module): --++++ """ --++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++ --++++ 关键改动: --++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++ 直接传入原始的 key 和 value 张量效率更高。 --++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++ """ --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++ super().__init__() --++++ self.config = config --++++ self.layer_idx = layer_idx --++++ self.hidden_size = config.hidden_size --++++ self.num_heads = config.num_attention_heads --++++ self.head_dim = self.hidden_size // self.num_heads --++++ self.num_key_value_heads = config.num_key_value_heads --++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++ self.max_position_embeddings = config.max_position_embeddings --++++ self.rope_theta = config.rope_theta --++++ self.attention_dropout = config.attention_dropout --++++ --++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++ raise ValueError( --++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++ ) --++++ --++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++ --++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++ self.head_dim, --++++ max_position_embeddings=self.max_position_embeddings, --++++ base=self.rope_theta, --++++ ) --++++ --++++ def forward( --++++ self, --++++ hidden_states: mindspore.Tensor, --++++ attention_mask: Optional[mindspore.Tensor] = None, --++++ position_ids: Optional[mindspore.Tensor] = None, --++++ past_key_value: Optional[Cache] = None, --++++ output_attentions: bool = False, --++++ use_cache: bool = False, --++++ cache_position: Optional[mindspore.Tensor] = None, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ # 1. 线性投射 Q, K, V --++++ query_states = self.q_proj(hidden_states) --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++ # query: [B, S, H*D] -> [B, N1, S, D] --++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # 3. RoPE 旋转位置编码 --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++ if self.layer_idx is None: --++++ raise ValueError( --++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ "with a layer index." --++++ ) --++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++ if cache_position.shape[0] == 1: --++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++ kv_seq_len = past_seen_tokens + 1 --++++ else: --++++ # prefill 阶段:cache_position 是范围,使用其长度 --++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++ else: --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # 4. KV 缓存更新 --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ key_states, value_states = past_key_value.update( --++++ key_states, value_states, self.layer_idx, cache_kwargs --++++ ) --++++ --++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++ if cache_position.shape[0] == 1: --++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++ kv_seq_len = key_states.shape[-2] --++++ --++++ # 5. [重要] 准备 Attention Mask --++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++ fa_attention_mask = None --++++ if attention_mask is not None: --++++ # 截取与当前key长度匹配的部分 --++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++++ fa_attention_mask = (mask_slice != 0) --++++ --++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++ input_dtype = query_states.dtype --++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++ query_states = query_states.to(mindspore.float16) --++++ key_states = key_states.to(mindspore.float16) --++++ value_states = value_states.to(mindspore.float16) --++++ --++++ # 6. [核心] 调用 flash_attention_score 算子 --++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++ attn_output = mindspore.ops.flash_attention_score( --++++ query=query_states, --++++ key=key_states, --++++ value=value_states, --++++ head_num=self.num_heads, # 传入Q的头数(N1) --++++ attn_mask=fa_attention_mask, --++++ keep_prob=1.0 - self.attention_dropout, --++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++ input_layout="BNSD", --++++ sparse_mode=0 # 使用 defaultMask 模式 --++++ ) --++++ --++++ # 恢复原始数据类型 --++++ attn_output = attn_output.to(input_dtype) --++++ --++++ # 7. 调整输出形状 --++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ attn_output = self.o_proj(attn_output) --++++ --++++ # FlashAttention 算子不直接返回注意力权重矩阵 --++++ attn_weights = None --++++ if output_attentions: --++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++ # def forward( --++++ # self, --++++ # hidden_states: mindspore.Tensor, --++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++ # position_ids: Optional[mindspore.Tensor] = None, --++++ # past_key_value: Optional[Cache] = None, --++++ # output_attentions: bool = False, --++++ # use_cache: bool = False, --++++ # cache_position: Optional[mindspore.Tensor] = None, --++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ # bsz, q_len, _ = hidden_states.shape --++++ --++++ # # 1. 线性投射 Q, K, V --++++ # query_states = self.q_proj(hidden_states) --++++ # key_states = self.k_proj(hidden_states) --++++ # value_states = self.v_proj(hidden_states) --++++ --++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # # 3. RoPE 旋转位置编码 --++++ # kv_seq_len = key_states.shape[-2] --++++ # if past_key_value is not None: --++++ # if self.layer_idx is None: --++++ # raise ValueError( --++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ # "with a layer index." --++++ # ) --++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # # 4. KV 缓存更新 --++++ # if past_key_value is not None: --++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ # key_states, value_states = past_key_value.update( --++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++ # ) --++++ --++++ # # 5. 准备 Attention Mask --++++ # fa_attention_mask = None --++++ # if attention_mask is not None: --++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # fa_attention_mask = (mask_slice != 0) --++++ --++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++ # input_dtype = query_states.dtype --++++ --++++ # # 6. [核心] 调用 flash_attention_score 算子 --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, --++++ # key=key_states, --++++ # value=value_states, --++++ # head_num=self.num_heads, --++++ # attn_mask=fa_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++ # input_layout="BNSD", --++++ # sparse_mode=0, --++++ # # <--- 修改点 2: 启用内部高精度计算 --- --++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++ # inner_precise=1 --++++ # ) --++++ --++++ # # 恢复原始数据类型 --++++ # attn_output = attn_output.to(input_dtype) --++++ --++++ # # 7. 调整输出形状 --++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ # attn_output = self.o_proj(attn_output) --++++ --++++ # attn_weights = None --++++ # if output_attentions: --++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++ # return attn_output, attn_weights, past_key_value --++++ --++++ # def forward( --++++ # self, --++++ # hidden_states: mindspore.Tensor, --++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++ # position_ids: Optional[mindspore.Tensor] = None, --++++ # past_key_value: Optional[Cache] = None, --++++ # output_attentions: bool = False, --++++ # use_cache: bool = False, --++++ # cache_position: Optional[mindspore.Tensor] = None, --++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ # bsz, q_len, _ = hidden_states.shape --++++ --++++ # query_states = self.q_proj(hidden_states) --++++ # key_states = self.k_proj(hidden_states) --++++ # value_states = self.v_proj(hidden_states) --++++ --++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ # kv_seq_len = key_states.shape[-2] --++++ # if past_key_value is not None: --++++ # if self.layer_idx is None: --++++ # raise ValueError("`layer_idx` must be specified for caching") --++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ # if past_key_value is not None: --++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ # key_states, value_states = past_key_value.update( --++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++ # ) --++++ --++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++ --++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++ # query_states = query_states / math.sqrt(self.head_dim) --++++ # # <--- 修改结束 --- --++++ --++++ # fa_attention_mask = None --++++ # if attention_mask is not None: --++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ # fa_attention_mask = (mask_slice != 0) --++++ --++++ # input_dtype = query_states.dtype --++++ --++++ # attn_output = mindspore.ops.flash_attention_score( --++++ # query=query_states, # 传入已经预先缩放过的 query --++++ # key=key_states, --++++ # value=value_states, --++++ # head_num=self.num_heads, --++++ # attn_mask=fa_attention_mask, --++++ # keep_prob=1.0 - self.attention_dropout, --++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++ # input_layout="BNSD", --++++ # sparse_mode=0, --++++ # inner_precise=1 # 仍然保持内部高精度计算 --++++ # ) --++++ --++++ # attn_output = attn_output.to(input_dtype) --++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ # attn_output = self.o_proj(attn_output) --++++ --++++ # attn_weights = None --++++ # if output_attentions: --++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++ --++++ # return attn_output, attn_weights, past_key_value --++++ --+++ QWEN2MOE_ATTENTION_CLASSES = { --+++ "eager": Qwen2MoeAttention, --++++ "flash-attention": Qwen2MoeFlashAttention, --+++ } --+++ --+++ --+++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --++++ #@dwj --++++ # 只遍历激活的专家,而非全部专家 --+++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- hidden_states = hidden_states.view(-1, hidden_dim) --+++- # router_logits: (batch * sequence_length, n_experts) --+++- router_logits = self.gate(hidden_states) --+++- --+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- if self.norm_topk_prob: --+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- # we cast back to the input dtype --+++- routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++- final_hidden_states = ops.zeros( --+++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++- ) --+++- --+++- # One hot encode the selected experts to create an expert mask --+++- # this will be used to easily index which expert is going to be sollicitated --+++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++- --+++- # Loop over all available experts in the model and perform the computation on each expert --+++- for expert_idx in range(self.num_experts): --+++- expert_layer = self.experts[expert_idx] --+++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++- --+++- # Index the correct hidden states and compute the expert hidden state for --+++- # the current expert. We need to make sure to multiply the output hidden --+++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++- if 0 not in idx.shape: --+++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++- --+++- # However `index_add_` only support torch tensors for indexing so we'll use --+++- # the `top_x` tensor here. --+++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++- --+++- shared_expert_output = self.shared_expert(hidden_states) --+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++- --+++- final_hidden_states = final_hidden_states + shared_expert_output --++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++ num_tokens = hidden_states_reshaped.shape[0] --++++ --++++ router_logits = self.gate(hidden_states_reshaped) --++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++ if self.norm_topk_prob: --++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++ flat_selected_experts = selected_experts.flatten() --++++ --++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++ token_indices = broadcasted_token_indices.flatten() --++++ --++++ active_experts = ops.unique(flat_selected_experts) --++++ --++++ for expert_idx_tensor in active_experts: --++++ expert_idx = expert_idx_tensor.item() --++++ expert_layer = self.experts[expert_idx] --++++ --++++ mask = (flat_selected_experts == expert_idx_tensor) --++++ selected_token_indices = token_indices[mask] --++++ selected_routing_weights = routing_weights.flatten()[mask] --++++ --++++ current_states = hidden_states_reshaped[selected_token_indices] --++++ --++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++ --++++ final_hidden_states = final_hidden_states.index_add( --++++ dim=0, --++++ index=selected_token_indices, --++++ source=expert_output.to(hidden_states.dtype) --++++ ) --++++ --++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++ --+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++- return final_hidden_states, router_logits --++++ final_hidden_states = final_hidden_states + shared_expert_output --++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++ --++++ return final_hidden_states, router_logits --+++ --+++ --+++ class Qwen2MoeDecoderLayer(nn.Module): --+++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++ --+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++ --+++ if (layer_idx not in config.mlp_only_layers) and ( --+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++ ): --+++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++ _skip_keys_device_placement = "past_key_values" --+++ _supports_cache_class = True --++++#lwx --++++ # _supports_static_cache = True --+++ --+++ def _init_weights(self, module): --+++ std = self.config.initializer_range --+++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++ return causal_mask --+++ --+++ --+++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ _tied_weights_keys = ["lm_head.weight"] --+++ --+++ def __init__(self, config): --+++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ self.num_experts_per_tok = config.num_experts_per_tok --+++ # Initialize weights and apply final processing --+++ self.post_init() --++++ # @lwx --++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++++ # self.generation_config.cache_implementation = "static" --++++ self._warmed_up = False --++++ --++++ def warmup_moe_model(self): --++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++++ test_texts = [ --++++ "warmup short", --++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++++ ] --++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++ if tokenizer is None: --++++ from mindnlp.transformers import AutoTokenizer --++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++ self._warmup_tokenizer = tokenizer --++++ --++++ for text in test_texts: --++++ inputs = tokenizer(text, return_tensors="ms") --++++ with mindspore._no_grad(): --++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++ --+++ def get_input_embeddings(self): --+++ return self.model.embed_tokens --+++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++ ```""" --++++ if not self._warmed_up: --++++ self._warmed_up = True --++++ self.warmup_moe_model() --+++ --+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++ output_router_logits = ( --+++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++ } --+++ ) --+++ return model_inputs --++++# @lwx --++++ # def _decode_one_tokens_logits( --++++ # self, --++++ # cur_token: mindspore.Tensor, --++++ # input_pos: Optional[mindspore.Tensor], --++++ # cache_position: mindspore.Tensor, --++++ # past_key_values: StaticCache, --++++ # ) -> mindspore.Tensor: --++++ # """ --++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++++ --++++ # Args: --++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++++ # input_pos: 输入位置信息,可选 --++++ # cache_position: 当前token在cache中的位置,shape为(1,) --++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++++ --++++ # Returns: --++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++++ # """ --++++ # # 调用JIT编译的版本 --++++ # return self.get_decode_one_tokens_logits( --++++ # cur_token=cur_token, --++++ # input_pos=input_pos, --++++ # cache_position=cache_position, --++++ # past_key_values=past_key_values, --++++ # ) --++++ --++++ # @mindspore.jit(jit_level='O1') --++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++++ # """ --++++ # JIT编译的函数,用于高效的单token解码 --++++ # 使用JIT编译优化以支持静态shape和高效执行 --++++ --++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++++ # """ --++++ # outputs = self.model.forward( --++++ # input_ids=cur_token, --++++ # position_ids=input_pos, --++++ # cache_position=cache_position, --++++ # past_key_values=past_key_values, --++++ # use_cache=True, --++++ # return_dict=False, --++++ # ) --++++ --++++ # hidden_states = outputs[0] --++++ # logits = self.lm_head.forward(hidden_states) --++++ # logits = logits.float() --++++ --++++ # return logits[:, -1, :] --++++ --++++ # def _sample( --++++ # self, --++++ # input_ids: mindspore.Tensor, --++++ # logits_processor, --++++ # stopping_criteria, --++++ # generation_config, --++++ # synced_devices: bool, --++++ # streamer=None, --++++ # logits_warper=None, --++++ # **model_kwargs, --++++ # ): --++++ # """ --++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++++ # """ --++++ # from ...generation.logits_process import LogitsProcessorList --++++ # from ...generation.stopping_criteria import StoppingCriteriaList --++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++++ # from mindnlp.core import nn, ops, no_grad --++++ # import numpy as np --++++ --++++ # # 检查是否使用 StaticCache --++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++++ # # 否则,直接调用父类方法 --++++ # past_key_values = model_kwargs.get("past_key_values") --++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++++ --++++ # if not isinstance(past_key_values, StaticCache): --++++ # # 不使用 StaticCache,直接调用父类方法 --++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++++ # return super()._sample( --++++ # input_ids=input_ids, --++++ # logits_processor=logits_processor, --++++ # stopping_criteria=stopping_criteria, --++++ # generation_config=generation_config, --++++ # synced_devices=synced_devices, --++++ # streamer=streamer, --++++ # logits_warper=logits_warper, --++++ # **model_kwargs, --++++ # ) --++++ --++++ # # 使用 StaticCache,进入自定义循环 --++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++++ # pad_token_id = generation_config._pad_token_tensor --++++ # output_attentions = generation_config.output_attentions --++++ # output_hidden_states = generation_config.output_hidden_states --++++ # output_scores = generation_config.output_scores --++++ # output_logits = generation_config.output_logits --++++ # return_dict_in_generate = generation_config.return_dict_in_generate --++++ # max_length = generation_config.max_length --++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++++ # do_sample = generation_config.do_sample --++++ --++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++++ # raise ValueError( --++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++++ # f"{logits_warper})." --++++ # ) --++++ --++++ # # init attention / hidden states / scores tuples --++++ # scores = () if (return_dict_in_generate and output_scores) else None --++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++++ --++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++++ # encoder_hidden_states = ( --++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++++ # ) --++++ --++++ # # keep track of which sequences are already finished --++++ # batch_size, cur_len = input_ids.shape --++++ # this_peer_finished = False --++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++++ --++++ # time_record = [] --++++ # from ....utils.testing_utils import parse_flag_from_env --++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++++ --++++ # while self._has_unfinished_sequences( --++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++++ # ): --++++ # if _record_time: --++++ # import time as time_module --++++ # infer_start = time_module.time() --++++ --++++ # # prepare model inputs --++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++++ --++++ # # prepare variable output controls --++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++++ --++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++++ # cur_cache_position = model_inputs.get("cache_position") --++++ # cur_past_key_values = model_inputs.get("past_key_values") --++++ # cur_input_ids = model_inputs.get("input_ids") --++++ --++++ # if (isinstance(cur_past_key_values, StaticCache) and --++++ # cur_cache_position is not None and --++++ # len(cur_cache_position.shape) > 0 and --++++ # cur_cache_position.shape[0] == 1 and --++++ # cur_input_ids is not None and --++++ # cur_input_ids.shape[1] == 1): --++++ # # 使用 JIT 优化的单 token 解码 --++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++++ # if not hasattr(self, '_jit_used'): --++++ # self._jit_used = False --++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++++ --++++ # next_token_logits = self.get_decode_one_tokens_logits( --++++ # cur_token=cur_input_ids, --++++ # input_pos=model_inputs.get("position_ids"), --++++ # cache_position=cur_cache_position, --++++ # past_key_values=cur_past_key_values, --++++ # ) --++++ --++++ # # 标记已使用JIT(用于后续判断) --++++ # if not self._jit_used: --++++ # self._jit_used = True --++++ --++++ # # 构造兼容的输出对象 --++++ # class JitOptimizedOutput: --++++ # def __init__(self, logits, config): --++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++++ # self.config = config --++++ # # 对于 JIT 优化路径,这些属性通常不需要 --++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++++ # self.attentions = None if not config.is_encoder_decoder else None --++++ # self.cross_attentions = None --++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++++ # self.hidden_states = None if not config.is_encoder_decoder else None --++++ --++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++++ # else: --++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++++ # outputs = self(**model_inputs, return_dict=True) --++++ --++++ # if synced_devices and this_peer_finished: --++++ # continue --++++ --++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++++ # next_token_logits = outputs.logits[:, -1, :] --++++ --++++ # # pre-process distribution --++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++++ # if do_sample: --++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++++ --++++ # # Store scores, attentions and hidden_states when required --++++ # if return_dict_in_generate: --++++ # if output_scores: --++++ # scores += (next_token_scores,) --++++ # if output_logits: --++++ # raw_logits += (next_token_logits,) --++++ # if output_attentions: --++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++++ # decoder_attentions += (attn,) if attn is not None else (None,) --++++ # if self.config.is_encoder_decoder: --++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++++ --++++ # if output_hidden_states: --++++ # hidden = ( --++++ # outputs.decoder_hidden_states --++++ # if self.config.is_encoder_decoder --++++ # else outputs.hidden_states --++++ # ) --++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++++ --++++ # # token selection --++++ # if do_sample: --++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++++ # else: --++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++++ --++++ # # finished sentences should have their next token be a padding token --++++ # if has_eos_stopping_criteria: --++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++++ --++++ # # update generated ids, model inputs, and length for next step --++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++++ # if streamer is not None: --++++ # streamer.put(next_tokens) --++++ --++++ # model_kwargs = self._update_model_kwargs_for_generation( --++++ # outputs, --++++ # model_kwargs, --++++ # is_encoder_decoder=self.config.is_encoder_decoder, --++++ # ) --++++ --++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++++ # cur_len += 1 --++++ --++++ # if _record_time: --++++ # import time as time_module --++++ # infer_stop = time_module.time() --++++ # time_record.append(infer_stop - infer_start) --++++ --++++ # del outputs --++++ --++++ # average_infer_time = None --++++ # if time_record: --++++ # if len(time_record) > 1: --++++ # time_record.pop(0) --++++ # average_infer_time = sum(time_record) / len(time_record) --++++ # print(f'average inference time is: {average_infer_time}') --++++ # print(f'inference time record: {time_record}') --++++ --++++ # if streamer is not None: --++++ # streamer.end() --++++ --++++ # # 简单判断:打印是否使用了JIT路径 --++++ # if hasattr(self, '_jit_used') and self._jit_used: --++++ # print("[JIT] ✓ JIT optimization was used during generation") --++++ # else: --++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++++ --++++ # if return_dict_in_generate: --++++ # if self.config.is_encoder_decoder: --++++ # return GenerateEncoderDecoderOutput( --++++ # sequences=input_ids, --++++ # scores=scores, --++++ # logits=raw_logits, --++++ # encoder_attentions=encoder_attentions, --++++ # encoder_hidden_states=encoder_hidden_states, --++++ # decoder_attentions=decoder_attentions, --++++ # cross_attentions=cross_attentions, --++++ # decoder_hidden_states=decoder_hidden_states, --++++ # past_key_values=model_kwargs.get("past_key_values"), --++++ # average_infer_time=average_infer_time --++++ # ) --++++ # else: --++++ # return GenerateDecoderOnlyOutput( --++++ # sequences=input_ids, --++++ # scores=scores, --++++ # logits=raw_logits, --++++ # attentions=decoder_attentions, --++++ # hidden_states=decoder_hidden_states, --++++ # past_key_values=model_kwargs.get("past_key_values"), --++++ # average_infer_time=average_infer_time --++++ # ) --++++ # else: --++++ # return input_ids --++++ --++++ # def _prepare_cache_for_generation( --++++ # self, --++++ # generation_config, --++++ # model_kwargs, --++++ # assistant_model, --++++ # batch_size, --++++ # max_cache_length, --++++ # ): --++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++++ # generation_config.cache_implementation = "static" --++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++++ --++++ # if generation_config.cache_implementation == "static": --++++ # base_required_from_max_length = generation_config.max_length + 1 --++++ # base_required = max(max_cache_length, base_required_from_max_length) --++++ # min_cache_size = 50 --++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++++ # else: --++++ # max_cache_length = max(base_required, min_cache_size) --++++ --++++ # original_max_cache_length = max_cache_length --++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++++ # print(f" - input max_cache_length: {original_max_cache_length}") --++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++++ # print(f" - final max_cache_length: {max_cache_length}") --++++ --++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++ # if max_cache_length > self.config.max_position_embeddings: --++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++++ --++++ # result = super()._prepare_cache_for_generation( --++++ # generation_config=generation_config, --++++ # model_kwargs=model_kwargs, --++++ # assistant_model=assistant_model, --++++ # batch_size=batch_size, --++++ # max_cache_length=max_cache_length, --++++ # ) --++++ --++++ # if generation_config.cache_implementation == "static": --++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++++ # created_cache = model_kwargs.get(cache_name) --++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++++ # if created_cache.max_cache_len < generation_config.max_length: --++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++++ --++++ # return result --++++ --++++ --++++ --+++ --+++ --+++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++-- --+++2.27.0 --+++ --++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --++new file mode 100644 --++index 00000000..22b65dd5 --++--- /dev/null --+++++ b/patches/0002-20251106commit.patch --++@@ -0,0 +1,3200 @@ --+++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Thu, 6 Nov 2025 09:20:38 +0800 --+++Subject: [PATCH 2/3] 20251106commit --+++ --+++--- --+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- --+++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ --+++ 3 files changed, 2689 insertions(+), 305 deletions(-) --+++ create mode 100644 patches/0001-20251104commit.patch --+++ --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index d8303e45..73773c22 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): --+++ # y = y + self.shared_experts(identity) --+++ # return y --+++ --++++ # @no_grad() --++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ --++++ # expert_cache = ops.zeros_like(x) --++++ # for i in range(self.num_experts_per_tok): --++++ # expert_id = flat_expert_indices[i].item() --++++ # weight = flat_expert_weights[i].item() --++++ # expert = self.experts[expert_id] --++++ # expert_out = expert(x) --++++ # expert_cache += expert_out * weight --++++ # return expert_cache --++++ --+++ @no_grad() --+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ # x 的 shape: (1, hidden_size) --++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++++ --++++ # 1. 收集所有需要的专家层 --++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++++ selected_experts = [self.experts[i] for i in flat_expert_indices] --++++ --++++ # 2. 并行计算所有专家的输出 --++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++++ # ops.cat 会将它们堆叠成一个新的 Tensor --++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++++ --++++ # 3. 使用矩阵乘法进行加权求和 --++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++ # 最终结果 final_output 的 shape: (1, hidden_size) --++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++++ --++++ return final_output --+++ --+++- expert_cache = ops.zeros_like(x) --+++- for i in range(self.num_experts_per_tok): --+++- expert_id = flat_expert_indices[i].item() --+++- weight = flat_expert_weights[i].item() --+++- expert = self.experts[expert_id] --+++- expert_out = expert(x) --+++- expert_cache += expert_out * weight --+++- return expert_cache --+++ --+++ @no_grad() --+++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++ # @lwx --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+++ --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): --+++ return attn_output, attn_weights, past_key_value --+++ --+++ --++++# class DeepseekFlashAttention(nn.Module): --++++# """ --++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --++++ --++++# This class is designed as a drop-in replacement for DeepseekAttention. --++++# """ --++++ --++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++++# super().__init__() --++++# self.config = config --++++# self.layer_idx = layer_idx --++++# if layer_idx is None: --++++# logger.warning( --++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++# "when creating this class." --++++# ) --++++ --++++# self.attention_dropout = config.attention_dropout --++++# self.hidden_size = config.hidden_size --++++# self.num_heads = config.num_attention_heads --++++# self.head_dim = self.hidden_size // self.num_heads --++++# self.num_key_value_heads = config.num_key_value_heads --++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++# self.max_position_embeddings = config.max_position_embeddings --++++# self.rope_theta = config.rope_theta --++++# self.is_causal = True --++++ --++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++# raise ValueError( --++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++# f" and `num_heads`: {self.num_heads})." --++++# ) --++++ --++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++++# self._init_rope() --++++ --++++# def _init_rope(self): --++++# if self.config.rope_scaling is None: --++++# self.rotary_emb = DeepseekRotaryEmbedding( --++++# self.head_dim, --++++# max_position_embeddings=self.max_position_embeddings, --++++# base=self.rope_theta, --++++# ) --++++# else: --++++# scaling_type = self.config.rope_scaling["type"] --++++# scaling_factor = self.config.rope_scaling["factor"] --++++# if scaling_type == "linear": --++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++++# self.head_dim, --++++# max_position_embeddings=self.max_position_embeddings, --++++# scaling_factor=scaling_factor, --++++# base=self.rope_theta, --++++# ) --++++# elif scaling_type == "dynamic": --++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++++# self.head_dim, --++++# max_position_embeddings=self.max_position_embeddings, --++++# scaling_factor=scaling_factor, --++++# base=self.rope_theta, --++++# ) --++++# else: --++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++++ --++++# def forward( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# attention_mask: Optional[mindspore.Tensor] = None, --++++# position_ids: Optional[mindspore.Tensor] = None, --++++# past_key_value: Optional[Cache] = None, --++++# output_attentions: bool = False, --++++# use_cache: bool = False, --++++# **kwargs, --++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++# if "padding_mask" in kwargs: --++++# warnings.warn( --++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++++# ) --++++ --++++# if output_attentions: --++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --++++ --++++# bsz, q_len, _ = hidden_states.shape --++++ --++++# if self.config.pretraining_tp > 1: --++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++++ --++++# query_states = self.q_proj(hidden_states) --++++# key_states = self.k_proj(hidden_states) --++++# value_states = self.v_proj(hidden_states) --++++ --++++# # Reshape for multi-head attention --++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++# kv_seq_len = key_states.shape[-2] --++++# if past_key_value is not None: --++++# if self.layer_idx is None: --++++# raise ValueError( --++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++# "with a layer index." --++++# ) --++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++# # Apply Rotary Positional Embedding --++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++# if past_key_value is not None: --++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ --++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++ --++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++ --++++# # Convert attention_mask for flash_attention_score --++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --++++# if attention_mask is not None: --++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++++# raise ValueError( --++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++++# ) --++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --++++# else: --++++# attn_mask_for_fa = None --++++ --++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++++ --++++# # Call the fused flash_attention_score operator --++++# attn_output = mindspore.ops.flash_attention_score( --++++# query=query_states_for_fa, --++++# key=key_states_for_fa, --++++# value=value_states_for_fa, --++++# head_num=self.num_heads, # This is N1, the number of query heads --++++# input_layout='BSH', --++++# attn_mask=attn_mask_for_fa, --++++# keep_prob=keep_prob, --++++# scalar_value=1.0 / math.sqrt(self.head_dim), --++++# sparse_mode=0 # Default mask mode --++++# ) --++++ --++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --++++# attn_output = self.o_proj(attn_output) --++++ --++++# # Flash Attention does not return attention weights --++++# attn_weights = None --++++ --++++# return attn_output, attn_weights, past_key_value --++++ --++++class DeepseekFlashAttention(nn.Module): --++++ """ --++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --++++ This implementation is a drop-in replacement for the original DeepseekAttention class, --++++ designed for high performance on supported hardware (Ascend). --++++ --++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --++++ """ --++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++++ super().__init__() --++++ self.config = config --++++ self.layer_idx = layer_idx --++++ if layer_idx is None: --++++ logger.warning( --++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++ "when creating this class." --++++ ) --++++ --++++ # --- [FIX] Correctly initialize all required attributes --- --++++ self.attention_dropout = config.attention_dropout --++++ self.hidden_size = config.hidden_size --++++ self.num_heads = config.num_attention_heads --++++ self.head_dim = self.hidden_size // self.num_heads --++++ self.num_key_value_heads = config.num_key_value_heads --++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++ self.max_position_embeddings = config.max_position_embeddings --++++ self.rope_theta = config.rope_theta --++++ self.is_causal = True --++++ --++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++ raise ValueError( --++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++ f" and `num_heads`: {self.num_heads})." --++++ ) --++++ --++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++++ --++++ # This call will now succeed as all attributes are initialized. --++++ self._init_rope() --++++ --++++ def _init_rope(self): --++++ if self.config.rope_scaling is None: --++++ self.rotary_emb = DeepseekRotaryEmbedding( --++++ self.head_dim, --++++ max_position_embeddings=self.max_position_embeddings, --++++ base=self.rope_theta, --++++ ) --++++ else: --++++ scaling_type = self.config.rope_scaling["type"] --++++ scaling_factor = self.config.rope_scaling["factor"] --++++ if scaling_type == "linear": --++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++++ self.head_dim, --++++ max_position_embeddings=self.max_position_embeddings, --++++ scaling_factor=scaling_factor, --++++ base=self.rope_theta, --++++ ) --++++ elif scaling_type == "dynamic": --++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++++ self.head_dim, --++++ max_position_embeddings=self.max_position_embeddings, --++++ scaling_factor=scaling_factor, --++++ base=self.rope_theta, --++++ ) --++++ else: --++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++++ --++++ def forward( --++++ self, --++++ hidden_states: mindspore.Tensor, --++++ attention_mask: Optional[mindspore.Tensor] = None, --++++ position_ids: Optional[mindspore.Tensor] = None, --++++ past_key_value: Optional[Cache] = None, --++++ output_attentions: bool = False, --++++ use_cache: bool = False, --++++ **kwargs, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ if "padding_mask" in kwargs: --++++ warnings.warn( --++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++++ ) --++++ if output_attentions: --++++ warnings.warn( --++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --++++ ) --++++ --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ if self.config.pretraining_tp > 1: --++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++++ --++++ query_states = self.q_proj(hidden_states) --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++ # Apply Rotary Position Embedding --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos} --++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --++++ # So we must explicitly repeat the KV heads. --++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++ --++++ # Convert attention mask for flash_attention_score --++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --++++ if attention_mask is not None: --++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++++ raise ValueError( --++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++++ ) --++++ attn_mask_for_fa = attention_mask < 0 --++++ else: --++++ attn_mask_for_fa = None --++++ --++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++++ --++++ # Call the fused operator using the efficient BNSD layout --++++ attn_output = mindspore.ops.flash_attention_score( --++++ query=query_states, --++++ key=key_states, --++++ value=value_states, --++++ head_num=self.num_heads, --++++ input_layout='BNSD', # Specify the correct layout --++++ attn_mask=attn_mask_for_fa, --++++ keep_prob=keep_prob, --++++ scalar_value=1.0 / math.sqrt(self.head_dim) --++++ ) --++++ --++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ --++++ # Apply output projection --++++ attn_output = self.o_proj(attn_output) --++++ --++++ # Flash attention does not return attention weights, so we return None. --++++ attn_weights = None --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --+++ Deepseek_ATTENTION_CLASSES = { --+++ "eager": DeepseekAttention, --++++ "flash-attention": DeepseekFlashAttention, --+++ } --+++ --+++ --+++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): --+++ config=config, layer_idx=layer_idx --+++ ) --+++ --++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --++++ config=config, layer_idx=layer_idx --++++ ) --++++ --+++ self.mlp = ( --+++ DeepseekMoE(config) --+++ if ( --+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++index d4c6b651..bced285c 100644 --+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union --+++ --+++ import mindspore --+++ import mindnlp.core.nn.functional as F --+++-from mindnlp.core import nn, ops --++++from mindnlp.core import nn, ops, no_grad --+++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss --+++ --+++ from ....common.activations import ACT2FN --+++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) --+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+++ --++++Long_Prompt = False --++++PROMPT_LENGTH_THRESHOLD = 128 --+++ --+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+++ def _prepare_4d_causal_attention_mask_with_cache_position( --+++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): --+++ return attn_output, attn_weights, past_key_value --+++ --+++ --++++# class Qwen2MoeFlashAttention(nn.Module): --++++# """ --++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++ --++++# 关键改动: --++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++# 直接传入原始的 key 和 value 张量效率更高。 --++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++# super().__init__() --++++# self.config = config --++++# self.layer_idx = layer_idx --++++# self.hidden_size = config.hidden_size --++++# self.num_heads = config.num_attention_heads --++++# self.head_dim = self.hidden_size // self.num_heads --++++# self.num_key_value_heads = config.num_key_value_heads --++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++# self.max_position_embeddings = config.max_position_embeddings --++++# self.rope_theta = config.rope_theta --++++# self.attention_dropout = config.attention_dropout --++++ --++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++# raise ValueError( --++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++# ) --++++ --++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++ --++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++# self.head_dim, --++++# max_position_embeddings=self.max_position_embeddings, --++++# base=self.rope_theta, --++++# ) --++++ --++++# def forward( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# attention_mask: Optional[mindspore.Tensor] = None, --++++# position_ids: Optional[mindspore.Tensor] = None, --++++# past_key_value: Optional[Cache] = None, --++++# output_attentions: bool = False, --++++# use_cache: bool = False, --++++# cache_position: Optional[mindspore.Tensor] = None, --++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++# bsz, q_len, _ = hidden_states.shape --++++ --++++# # 1. 线性投射 Q, K, V --++++# query_states = self.q_proj(hidden_states) --++++# key_states = self.k_proj(hidden_states) --++++# value_states = self.v_proj(hidden_states) --++++ --++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++# # query: [B, S, H*D] -> [B, N1, S, D] --++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++# # 3. RoPE 旋转位置编码 --++++# kv_seq_len = key_states.shape[-2] --++++# if past_key_value is not None: --++++# if self.layer_idx is None: --++++# raise ValueError( --++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++# "with a layer index." --++++# ) --++++# # 对于 StaticCache,需要特殊处理 kv_seq_len --++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++# if cache_position.shape[0] == 1: --++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++# kv_seq_len = past_seen_tokens + 1 --++++# else: --++++# # prefill 阶段:cache_position 是范围,使用其长度 --++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++# else: --++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++# # 4. KV 缓存更新 --++++# if past_key_value is not None: --++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++# key_states, value_states = past_key_value.update( --++++# key_states, value_states, self.layer_idx, cache_kwargs --++++# ) --++++ --++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++# if cache_position.shape[0] == 1: --++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++# kv_seq_len = key_states.shape[-2] --++++ --++++# # 5. [重要] 准备 Attention Mask --++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++# fa_attention_mask = None --++++# if attention_mask is not None: --++++# # 截取与当前key长度匹配的部分 --++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++# # 转换为布尔类型: 大负数 -> True, 0 -> False --++++# fa_attention_mask = (mask_slice != 0) --++++ --++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++# input_dtype = query_states.dtype --++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++# query_states = query_states.to(mindspore.float16) --++++# key_states = key_states.to(mindspore.float16) --++++# value_states = value_states.to(mindspore.float16) --++++ --++++# # 6. [核心] 调用 flash_attention_score 算子 --++++# # - 无需手动 repeat_kv, 算子原生支持 GQA --++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++# attn_output = mindspore.ops.flash_attention_score( --++++# query=query_states, --++++# key=key_states, --++++# value=value_states, --++++# head_num=self.num_heads, # 传入Q的头数(N1) --++++# attn_mask=fa_attention_mask, --++++# keep_prob=1.0 - self.attention_dropout, --++++# scalar_value=1.0 / math.sqrt(self.head_dim), --++++# input_layout="BNSD", --++++# sparse_mode=0 # 使用 defaultMask 模式 --++++# ) --++++ --++++# # 恢复原始数据类型 --++++# attn_output = attn_output.to(input_dtype) --++++ --++++# # 7. 调整输出形状 --++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++# attn_output = self.o_proj(attn_output) --++++ --++++# # FlashAttention 算子不直接返回注意力权重矩阵 --++++# attn_weights = None --++++# if output_attentions: --++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++# return attn_output, attn_weights, past_key_value --++++ --++++# # def forward( --++++# # self, --++++# # hidden_states: mindspore.Tensor, --++++# # attention_mask: Optional[mindspore.Tensor] = None, --++++# # position_ids: Optional[mindspore.Tensor] = None, --++++# # past_key_value: Optional[Cache] = None, --++++# # output_attentions: bool = False, --++++# # use_cache: bool = False, --++++# # cache_position: Optional[mindspore.Tensor] = None, --++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++# # bsz, q_len, _ = hidden_states.shape --++++ --++++# # # 1. 线性投射 Q, K, V --++++# # query_states = self.q_proj(hidden_states) --++++# # key_states = self.k_proj(hidden_states) --++++# # value_states = self.v_proj(hidden_states) --++++ --++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --++++# # # 3. RoPE 旋转位置编码 --++++# # kv_seq_len = key_states.shape[-2] --++++# # if past_key_value is not None: --++++# # if self.layer_idx is None: --++++# # raise ValueError( --++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++# # "with a layer index." --++++# # ) --++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++# # # 4. KV 缓存更新 --++++# # if past_key_value is not None: --++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++# # key_states, value_states = past_key_value.update( --++++# # key_states, value_states, self.layer_idx, cache_kwargs --++++# # ) --++++ --++++# # # 5. 准备 Attention Mask --++++# # fa_attention_mask = None --++++# # if attention_mask is not None: --++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++# # fa_attention_mask = (mask_slice != 0) --++++ --++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++# # input_dtype = query_states.dtype --++++ --++++# # # 6. [核心] 调用 flash_attention_score 算子 --++++# # attn_output = mindspore.ops.flash_attention_score( --++++# # query=query_states, --++++# # key=key_states, --++++# # value=value_states, --++++# # head_num=self.num_heads, --++++# # attn_mask=fa_attention_mask, --++++# # keep_prob=1.0 - self.attention_dropout, --++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++++# # input_layout="BNSD", --++++# # sparse_mode=0, --++++# # # <--- 修改点 2: 启用内部高精度计算 --- --++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++# # inner_precise=1 --++++# # ) --++++ --++++# # # 恢复原始数据类型 --++++# # attn_output = attn_output.to(input_dtype) --++++ --++++# # # 7. 调整输出形状 --++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++# # attn_output = self.o_proj(attn_output) --++++ --++++# # attn_weights = None --++++# # if output_attentions: --++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ --++++# # return attn_output, attn_weights, past_key_value --++++ --++++ --+++ class Qwen2MoeFlashAttention(nn.Module): --+++ """ --+++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++- --+++- 关键改动: --+++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++- 直接传入原始的 key 和 value 张量效率更高。 --+++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --++++ --++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --++++ 完全使用模型的低精度数据类型(如 float16)进行计算, --++++ 以达到理论上的最高执行速度。 --+++ """ --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++ super().__init__() --+++ self.config = config --+++ self.layer_idx = layer_idx --++++ if layer_idx is None: --++++ logger.warning_once( --++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --++++ ) --++++ --+++ self.hidden_size = config.hidden_size --+++ self.num_heads = config.num_attention_heads --+++ self.head_dim = self.hidden_size // self.num_heads --+++ self.num_key_value_heads = config.num_key_value_heads --+++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++ self.max_position_embeddings = config.max_position_embeddings --+++ self.rope_theta = config.rope_theta --+++ self.attention_dropout = config.attention_dropout --+++ --+++- if (self.head_dim * self.num_heads) != self.hidden_size: --+++- raise ValueError( --+++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++- ) --+++- --+++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++- # query: [B, S, H*D] -> [B, N1, S, D] --+++- # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++ # 2. 调整形状以匹配 BNSD 布局 --+++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- --+++- # 3. RoPE 旋转位置编码 --++++ --++++ # 3. RoPE 和 KV 缓存 --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++- if self.layer_idx is None: --+++- raise ValueError( --+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++- "with a layer index." --+++- ) --+++- # 对于 StaticCache,需要特殊处理 kv_seq_len --+++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++- # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++- # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++- if cache_position.shape[0] == 1: --+++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++- kv_seq_len = past_seen_tokens + 1 --+++- else: --+++- # prefill 阶段:cache_position 是范围,使用其长度 --+++- kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++- else: --+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++- --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++- # 4. KV 缓存更新 --+++ if past_key_value is not None: --+++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++- key_states, value_states = past_key_value.update( --+++- key_states, value_states, self.layer_idx, cache_kwargs --+++- ) --+++- --+++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++- if cache_position.shape[0] == 1: --+++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++- kv_seq_len = key_states.shape[-2] --+++- --+++- # 5. [重要] 准备 Attention Mask --+++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++ # 4. 准备 Attention Mask --+++ fa_attention_mask = None --+++ if attention_mask is not None: --+++- # 截取与当前key长度匹配的部分 --+++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++- # 转换为布尔类型: 大负数 -> True, 0 -> False --+++ fa_attention_mask = (mask_slice != 0) --+++ --+++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++- input_dtype = query_states.dtype --+++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++- query_states = query_states.to(mindspore.float16) --+++- key_states = key_states.to(mindspore.float16) --+++- value_states = value_states.to(mindspore.float16) --+++- --+++- # 6. [核心] 调用 flash_attention_score 算子 --+++- # - 无需手动 repeat_kv, 算子原生支持 GQA --+++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 --+++ attn_output = mindspore.ops.flash_attention_score( --+++ query=query_states, --+++ key=key_states, --+++ value=value_states, --+++- head_num=self.num_heads, # 传入Q的头数(N1) --++++ head_num=self.num_heads, --+++ attn_mask=fa_attention_mask, --+++- keep_prob=1.0 - self.attention_dropout, --++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout --+++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++ input_layout="BNSD", --+++- sparse_mode=0 # 使用 defaultMask 模式 --++++ sparse_mode=0, --++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 --+++ ) --+++ --+++- # 恢复原始数据类型 --+++- attn_output = attn_output.to(input_dtype) --+++- --+++- # 7. 调整输出形状 --+++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++ # 6. 调整输出形状 --+++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++ attn_output = self.o_proj(attn_output) --+++ --+++- # FlashAttention 算子不直接返回注意力权重矩阵 --++++ # 7. 返回结果 --+++ attn_weights = None --+++ if output_attentions: --+++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++- # def forward( --+++- # self, --+++- # hidden_states: mindspore.Tensor, --+++- # attention_mask: Optional[mindspore.Tensor] = None, --+++- # position_ids: Optional[mindspore.Tensor] = None, --+++- # past_key_value: Optional[Cache] = None, --+++- # output_attentions: bool = False, --+++- # use_cache: bool = False, --+++- # cache_position: Optional[mindspore.Tensor] = None, --+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++- --+++- # bsz, q_len, _ = hidden_states.shape --+++- --+++- # # 1. 线性投射 Q, K, V --+++- # query_states = self.q_proj(hidden_states) --+++- # key_states = self.k_proj(hidden_states) --+++- # value_states = self.v_proj(hidden_states) --+++- --+++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- --+++- # # 3. RoPE 旋转位置编码 --+++- # kv_seq_len = key_states.shape[-2] --+++- # if past_key_value is not None: --+++- # if self.layer_idx is None: --+++- # raise ValueError( --+++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++- # "with a layer index." --+++- # ) --+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++ --+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++- --+++- # # 4. KV 缓存更新 --+++- # if past_key_value is not None: --+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++- # key_states, value_states = past_key_value.update( --+++- # key_states, value_states, self.layer_idx, cache_kwargs --+++- # ) --+++- --+++- # # 5. 准备 Attention Mask --+++- # fa_attention_mask = None --+++- # if attention_mask is not None: --+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++- # fa_attention_mask = (mask_slice != 0) --+++- --+++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++- # input_dtype = query_states.dtype --+++- --+++- # # 6. [核心] 调用 flash_attention_score 算子 --+++- # attn_output = mindspore.ops.flash_attention_score( --+++- # query=query_states, --+++- # key=key_states, --+++- # value=value_states, --+++- # head_num=self.num_heads, --+++- # attn_mask=fa_attention_mask, --+++- # keep_prob=1.0 - self.attention_dropout, --+++- # scalar_value=1.0 / math.sqrt(self.head_dim), --+++- # input_layout="BNSD", --+++- # sparse_mode=0, --+++- # # <--- 修改点 2: 启用内部高精度计算 --- --+++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++- # inner_precise=1 --+++- # ) --+++- --+++- # # 恢复原始数据类型 --+++- # attn_output = attn_output.to(input_dtype) --++++QWEN2MOE_ATTENTION_CLASSES = { --++++ "eager": Qwen2MoeAttention, --++++ "flash-attention": Qwen2MoeFlashAttention, --++++} --+++ --+++- # # 7. 调整输出形状 --+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++- # attn_output = self.o_proj(attn_output) --+++ --+++- # attn_weights = None --+++- # if output_attentions: --+++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# def __init__(self, config): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# # gating --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++ --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# #@dwj --++++# # 只遍历激活的专家,而非全部专家 --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# num_tokens = hidden_states_reshaped.shape[0] --++++ --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++# flat_selected_experts = selected_experts.flatten() --++++ --++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++# token_indices = broadcasted_token_indices.flatten() --++++ --++++# active_experts = ops.unique(flat_selected_experts) --++++ --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++ --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# selected_token_indices = token_indices[mask] --++++# selected_routing_weights = routing_weights.flatten()[mask] --++++ --++++# current_states = hidden_states_reshaped[selected_token_indices] --++++ --++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++ --++++# final_hidden_states = final_hidden_states.index_add( --++++# dim=0, --++++# index=selected_token_indices, --++++# source=expert_output.to(hidden_states.dtype) --++++# ) --++++ --++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++ --+++- # return attn_output, attn_weights, past_key_value --++++# final_hidden_states = final_hidden_states + shared_expert_output --++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --++++ --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# """ --++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --++++# `_moe_infer_prefill` (用于长序列处理) 方法。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# # 门控网络 --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# # 专家列表 --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++# # 共享专家 --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# @no_grad() --++++# def _moe_infer_decode( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# """ --++++# 【解码路径】针对 sequence_length=1 的极致优化。 --++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --++++# """ --++++# batch_size, hidden_dim = hidden_states.shape --++++ --++++# expert_outputs_list = [ --++++# ops.cat([ --++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++# ], dim=0) --++++# for i in range(batch_size) --++++# ] --++++ --++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- --++++# # shape: (batch_size, top_k, hidden_dim) --++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++ --++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++ --++++# return moe_output.squeeze(1) --++++ --++++# @no_grad() --++++# def _moe_infer_prefill( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# """ --++++# 【预填充路径】针对 sequence_length > 1 的优化。 --++++# 按专家对 Token 进行分组,并进行批处理。 --++++# """ --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens = hidden_states.shape[0] --++++# flat_selected_experts = selected_experts.flatten() --++++ --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++ --++++# active_experts = ops.unique(flat_selected_experts) --++++ --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++ --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# selected_token_indices = token_indices[mask] --++++# selected_routing_weights = routing_weights.flatten()[mask] --++++ --++++# current_states = hidden_states[selected_token_indices] --++++ --++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++ --++++# moe_output = moe_output.index_add( --++++# dim=0, --++++# index=selected_token_indices, --++++# source=expert_output.to(hidden_states.dtype) --++++# ) --++++# return moe_output --++++ --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# """ --++++# 顶层 forward 方法,作为智能分发器。 --++++# """ --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++- # def forward( --+++- # self, --+++- # hidden_states: mindspore.Tensor, --+++- # attention_mask: Optional[mindspore.Tensor] = None, --+++- # position_ids: Optional[mindspore.Tensor] = None, --+++- # past_key_value: Optional[Cache] = None, --+++- # output_attentions: bool = False, --+++- # use_cache: bool = False, --+++- # cache_position: Optional[mindspore.Tensor] = None, --+++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++- --+++- # bsz, q_len, _ = hidden_states.shape --+++- --+++- # query_states = self.q_proj(hidden_states) --+++- # key_states = self.k_proj(hidden_states) --+++- # value_states = self.v_proj(hidden_states) --+++- --+++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- --+++- # kv_seq_len = key_states.shape[-2] --+++- # if past_key_value is not None: --+++- # if self.layer_idx is None: --+++- # raise ValueError("`layer_idx` must be specified for caching") --+++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++- --+++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++- --+++- # if past_key_value is not None: --+++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++- # key_states, value_states = past_key_value.update( --+++- # key_states, value_states, self.layer_idx, cache_kwargs --+++- # ) --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++# moe_output = None --++++# # 在推理时,根据序列长度选择最优路径 --++++# if not self.training: --++++# if sequence_length == 1: --++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++# else: --++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++# else: --++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --++++# raise NotImplementedError("Training path is not implemented.") --++++ --++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --++++ --++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --++++ --++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --++++ --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# """ --++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# # 门控网络 --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# # 专家列表 --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++# # 共享专家 --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# @no_grad() --++++# def _moe_infer_decode( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# batch_size, _ = hidden_states.shape --++++# expert_outputs_list = [ --++++# ops.cat([ --++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++# ], dim=0) --++++# for i in range(batch_size) --++++# ] --++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++# return moe_output.squeeze(1) --++++ --++++# @no_grad() --++++# def _moe_infer_prefill( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens = hidden_states.shape[0] --++++# flat_selected_experts = selected_experts.flatten() --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++# active_experts = ops.unique(flat_selected_experts) --++++ --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# selected_token_indices = token_indices[mask] --++++# selected_routing_weights = routing_weights.flatten()[mask] --++++# current_states = hidden_states[selected_token_indices] --++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++# moe_output = moe_output.index_add( --++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++# ) --++++# return moe_output --++++ --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# """ --++++# 顶层 forward 方法,作为智能分发器。 --++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --++++# """ --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ --++++# # 1. 门控计算 (通用逻辑) --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++# # 2. 智能分发到最优 MoE 路径 --++++# moe_output = None --++++# if not self.training: --++++# if sequence_length == 1: --++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++# else: --++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++# else: --++++# raise NotImplementedError("Training path is not implemented.") --++++ --++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++ --++++# # 4. 合并 MoE 输出和共享专家输出 --++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++ --++++# # 5. 恢复原始形状并返回 --++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --++++# prefill fastest --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# """ --++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# # 门控网络 --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# # 专家列表 --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++# # 共享专家 --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# @no_grad() --++++# def _moe_infer_dispatch( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# """ --++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --++++# """ --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens, _ = hidden_states.shape --++++ --++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --++++# flat_selected_experts = selected_experts.flatten() --++++# flat_routing_weights = routing_weights.flatten() --+++ --+++- # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++- # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++- --+++- # # <--- 核心修改点: 手动进行高精度缩放 --- --+++- # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++- # query_states = query_states / math.sqrt(self.head_dim) --+++- # # <--- 修改结束 --- --+++- --+++- # fa_attention_mask = None --+++- # if attention_mask is not None: --+++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++- # fa_attention_mask = (mask_slice != 0) --+++- --+++- # input_dtype = query_states.dtype --+++- --+++- # attn_output = mindspore.ops.flash_attention_score( --+++- # query=query_states, # 传入已经预先缩放过的 query --+++- # key=key_states, --+++- # value=value_states, --+++- # head_num=self.num_heads, --+++- # attn_mask=fa_attention_mask, --+++- # keep_prob=1.0 - self.attention_dropout, --+++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++- # input_layout="BNSD", --+++- # sparse_mode=0, --+++- # inner_precise=1 # 仍然保持内部高精度计算 --+++- # ) --++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++ --+++- # attn_output = attn_output.to(input_dtype) --+++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++- # attn_output = self.o_proj(attn_output) --++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --++++# active_experts = ops.unique(flat_selected_experts) --++++ --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++ --++++# # 找到所有分配给该专家的 token --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++ --++++# # 使用 mask 选取对应的 token 和权重 --++++# current_token_indices = token_indices[mask] --++++# current_routing_weights = flat_routing_weights[mask] --++++# current_hidden_states = hidden_states[current_token_indices] --++++ --++++# # 对这些 token 进行批处理 --++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++ --++++# # 使用 index_add 将结果精确地加回到对应位置 --++++# moe_output = moe_output.index_add( --++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --++++# ) --++++# return moe_output --++++ --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# """ --++++# 顶层 forward 方法,作为智能分发器。 --++++# """ --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ --++++# # 1. 门控计算 --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++ --++++# # 2. 调用统一的 MoE 计算内核 --++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+++ --+++- # attn_weights = None --+++- # if output_attentions: --+++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++# # 3. 统一处理共享专家 --++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++ --++++# # 4. 合并输出 --++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++ --++++# # 5. 恢复原始形状并返回 --++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --++++ --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# """ --++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++# 【最终高性能与高精度版】: --++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --++++# 3. 这样实现了速度和准确性的两全其美。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# @no_grad() --++++# def _moe_infer_decode( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# """ --++++# 【解码路径】极致优化版:bmm + 高精度累加。 --++++# """ --++++# original_dtype = hidden_states.dtype --++++# batch_size, _ = hidden_states.shape --++++ --++++# expert_outputs_list = [ --++++# ops.cat([ --++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++# ], dim=0) --++++# for i in range(batch_size) --++++# ] --++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++ --++++# # 在 float32 下执行 bmm,得到高精度结果 --++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++ --++++# # 将高精度结果转换回原始数据类型 --++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --++++ --++++# return moe_output --++++ --++++# @no_grad() --++++# def _moe_infer_prefill( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# selected_experts: mindspore.Tensor, --++++# routing_weights: mindspore.Tensor --++++# ) -> mindspore.Tensor: --++++# """ --++++# 【预填充路径】与原始实现一致,结果精确。 --++++# """ --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens, _ = hidden_states.shape --++++# flat_selected_experts = selected_experts.flatten() --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++# active_experts = ops.unique(flat_selected_experts) --++++ --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# selected_token_indices = token_indices[mask] --++++# selected_routing_weights = routing_weights.flatten()[mask] --++++# current_states = hidden_states[selected_token_indices] --++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++# moe_output = moe_output.index_add( --++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++# ) --++++# return moe_output --++++ --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++ --+++- # return attn_output, attn_weights, past_key_value --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --++++# # 如果模型主体是 float16,后续再转换 --++++ --++++# moe_output = None --++++# if not self.training: --++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --++++# # _moe_infer_decode 内部会处理好类型转换 --++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) --++++# if sequence_length == 1: --++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++# else: --++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++# else: --++++# raise NotImplementedError("Training path is not implemented.") --++++ --++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++ --++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --+++ --+++-QWEN2MOE_ATTENTION_CLASSES = { --+++- "eager": Qwen2MoeAttention, --+++- "flash-attention": Qwen2MoeFlashAttention, --+++-} --++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++# """ --++++# 【融合版】一个混合专家模块,内置两种推理策略, --++++# 由外部全局变量 `Long_Prompt` 控制: --++++ --++++# - if Long_Prompt is True: 【精度优先模式】 --++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --++++# 适用于处理长序列,避免误差累积。 --++++ --++++# - if Long_Prompt is False: 【速度优先模式】 --++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --++++# 在解码阶段获得极致速度,同时保证结果高度准确。 --++++# """ --++++# def __init__(self, config: Qwen2MoeConfig): --++++# super().__init__() --++++# self.num_experts = config.num_experts --++++# self.top_k = config.num_experts_per_tok --++++# self.norm_topk_prob = config.norm_topk_prob --++++ --++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++# self.experts = nn.ModuleList( --++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++# ) --++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++# # --- 速度优先模式的辅助函数 --- --++++# @no_grad() --++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++# original_dtype = hidden_states.dtype --++++# batch_size, _ = hidden_states.shape --++++# expert_outputs_list = [ --++++# ops.cat([ --++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++# ], dim=0) --++++# for i in range(batch_size) --++++# ] --++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++# weights_fp32 = routing_weights.to(mindspore.float32) --++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++# return moe_output_fp32.squeeze(1).to(original_dtype) --++++ --++++# @no_grad() --++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens, _ = hidden_states.shape --++++# flat_selected_experts = selected_experts.flatten() --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++# active_experts = ops.unique(flat_selected_experts) --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# selected_token_indices = token_indices[mask] --++++# selected_routing_weights = routing_weights.flatten()[mask] --++++# current_states = hidden_states[selected_token_indices] --++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --++++# return moe_output --++++ --++++# # --- 精度优先模式的辅助函数 --- --++++# @no_grad() --++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++# moe_output = ops.zeros_like(hidden_states) --++++# num_tokens, _ = hidden_states.shape --++++# flat_selected_experts = selected_experts.flatten() --++++# flat_routing_weights = routing_weights.flatten() --++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++# active_experts = ops.unique(flat_selected_experts) --++++# for expert_idx_tensor in active_experts: --++++# expert_idx = expert_idx_tensor.item() --++++# expert_layer = self.experts[expert_idx] --++++# mask = (flat_selected_experts == expert_idx_tensor) --++++# current_token_indices = token_indices[mask] --++++# current_routing_weights = flat_routing_weights[mask] --++++# current_hidden_states = hidden_states[current_token_indices] --++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++++# return moe_output --++++ --++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++# # 声明我们将要使用一个在模块外部定义的全局变量 --++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --++++# global Long_Prompt --++++ --++++# # 1. 门控计算 (所有模式通用) --++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++# router_logits = self.gate(hidden_states_reshaped) --++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++++# if self.norm_topk_prob: --++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++# moe_output = None --++++# if not self.training: --++++# # 根据 Long_Prompt 标志选择模式 --++++# if Long_Prompt: --++++# # --- 精度优先模式 --- --++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++# else: --++++# # --- 速度优先模式 --- --++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++# if sequence_length == 1: --++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++# else: --++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++# else: --++++# raise NotImplementedError("Training path is not implemented.") --++++ --++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++ --++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++ --++++# return final_hidden_states, router_logits --++++ --++++class Qwen2MoeSparseMoeBlock(nn.Module): --++++ """ --++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --++++ 控制的顶级推理策略: --+++ --++++ - if Long_Prompt is True: 【精度优先模式】 --++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --++++ 适用于需要严格可复现性的长序列任务。 --+++ --+++-class Qwen2MoeSparseMoeBlock(nn.Module): --+++- def __init__(self, config): --++++ - if Long_Prompt is False: 【速度优先模式】 --++++ 采用业界最强的性能组合: --++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --++++ """ --++++ def __init__(self, config: Qwen2MoeConfig): --+++ super().__init__() --+++ self.num_experts = config.num_experts --+++ self.top_k = config.num_experts_per_tok --+++ self.norm_topk_prob = config.norm_topk_prob --+++ --+++- # gating --+++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++ self.experts = nn.ModuleList( --+++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++ ) --+++- --+++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++ --+++- #@dwj --+++- # 只遍历激活的专家,而非全部专家 --+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++- num_tokens = hidden_states_reshaped.shape[0] --+++- --+++- router_logits = self.gate(hidden_states_reshaped) --+++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- --+++- if self.norm_topk_prob: --+++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++- flat_selected_experts = selected_experts.flatten() --+++- --+++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++- token_indices = broadcasted_token_indices.flatten() --+++- --+++- active_experts = ops.unique(flat_selected_experts) --+++- --+++- for expert_idx_tensor in active_experts: --+++- expert_idx = expert_idx_tensor.item() --+++- expert_layer = self.experts[expert_idx] --+++- --+++- mask = (flat_selected_experts == expert_idx_tensor) --+++- selected_token_indices = token_indices[mask] --+++- selected_routing_weights = routing_weights.flatten()[mask] --+++- --+++- current_states = hidden_states_reshaped[selected_token_indices] --+++- --+++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++- --+++- final_hidden_states = final_hidden_states.index_add( --+++- dim=0, --+++- index=selected_token_indices, --+++- source=expert_output.to(hidden_states.dtype) --+++- ) --+++- --+++- shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --++++ @no_grad() --++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++ original_dtype = hidden_states.dtype --++++ batch_size, _ = hidden_states.shape --++++ expert_outputs_list = [ --++++ ops.cat([ --++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++ ], dim=0) --++++ for i in range(batch_size) --++++ ] --++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++ weights_fp32 = routing_weights.to(mindspore.float32) --++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++ return moe_output_fp32.squeeze(1).to(original_dtype) --++++ --++++ @no_grad() --++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++ num_tokens, _ = hidden_states.shape --++++ flat_selected_experts = selected_experts.flatten() --++++ sorted_expert_indices = flat_selected_experts.argsort() --++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++++ original_token_indices = sorted_expert_indices // self.top_k --++++ moe_output = ops.zeros_like(hidden_states) --++++ current_token_offset = 0 --++++ for i in range(self.num_experts): --++++ expert_token_count = tokens_per_expert[i] - current_token_offset --++++ if expert_token_count == 0: --++++ continue --++++ end_offset = current_token_offset + expert_token_count --++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++++ expert_hidden_states = hidden_states[expert_original_token_indices] --++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++++ current_token_offset += expert_token_count --++++ return moe_output --++++ --++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --++++ @no_grad() --++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++ moe_output = ops.zeros_like(hidden_states) --++++ num_tokens, _ = hidden_states.shape --++++ flat_selected_experts = selected_experts.flatten() --++++ flat_routing_weights = routing_weights.flatten() --++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++ active_experts = ops.unique(flat_selected_experts) --++++ for expert_idx_tensor in active_experts: --++++ expert_idx = expert_idx_tensor.item() --++++ expert_layer = self.experts[expert_idx] --++++ mask = (flat_selected_experts == expert_idx_tensor) --++++ current_token_indices = token_indices[mask] --++++ current_routing_weights = flat_routing_weights[mask] --++++ current_hidden_states = hidden_states[current_token_indices] --++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++++ return moe_output --+++ --+++- final_hidden_states = final_hidden_states + shared_expert_output --+++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++- --+++- return final_hidden_states, router_logits --++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++ global Long_Prompt --++++ --++++ # 1. 门控计算 (所有模式通用) --++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++ router_logits = self.gate(hidden_states_reshaped) --++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++++ if self.norm_topk_prob: --++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++ moe_output = None --++++ if Long_Prompt: --++++ # --- 精度优先模式 (ACCURACY MODE) --- --++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ else: --++++ # --- 速度优先模式 (SPEED MODE) --- --++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++ if sequence_length == 1: --++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ else: --++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ --+++ --++++ # 3. 共享专家计算与合并 (所有模式通用) --++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++ --++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++ --++++ return final_hidden_states, router_logits --+++ --+++ class Qwen2MoeDecoderLayer(nn.Module): --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+++ super().__init__() --+++ self.hidden_size = config.hidden_size --++++ --++++ # if Long_Prompt: --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ # else: --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++ --+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ --+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++- --+++ if (layer_idx not in config.mlp_only_layers) and ( --+++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++ ): --+++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ self._warmed_up = True --+++ self.warmup_moe_model() --+++ --++++ --++++ --+++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++ output_router_logits = ( --+++ output_router_logits if output_router_logits is not None else self.config.output_router_logits --+++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ router_logits=outputs.router_logits, --+++ ) --+++ --++++ def generate(self, *args, **kwargs): --++++ """ --++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --++++ """ --++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++++ --++++ input_ids = kwargs.get("input_ids") --++++ if input_ids is None and args: --++++ input_ids = args[0] --++++ --++++ if input_ids is not None: --++++ prompt_length = input_ids.shape[1] --++++ --++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: --++++ Long_Prompt = True --++++ else: --++++ Long_Prompt = False --++++ --++++ return super().generate(*args, **kwargs) --++++ --+++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation --+++ def prepare_inputs_for_generation( --+++ self, --+++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens --+++ # Exception 1: when passing input_embeds, input_ids may be missing entries --+++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --++++ --+++ if past_key_values is not None: --+++ if inputs_embeds is not None: # Exception 1 --+++ if 0 not in input_ids.shape: --+++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ } --+++ ) --+++ return model_inputs --++++ --+++ # @lwx --+++ # def _decode_one_tokens_logits( --+++ # self, --+++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): --+++ attentions=outputs.attentions, --+++ ) --+++ --++++ --+++ __all__ = [ --+++ "Qwen2MoeForCausalLM", --+++ "Qwen2MoeModel", --+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++new file mode 100644 --+++index 00000000..6dfb5b93 --+++--- /dev/null --++++++ b/patches/0001-20251104commit.patch --+++@@ -0,0 +1,1272 @@ --++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++++From: Pinoeer-kingxi <13022943007@163.com> --++++Date: Tue, 4 Nov 2025 09:11:51 +0800 --++++Subject: [PATCH] 20251104commit --++++ --++++--- --++++ mindnlp/transformers/cache_utils.py | 28 +- --++++ .../models/deepseek/modeling_deepseek.py | 149 ++- --++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++++ 3 files changed, 976 insertions(+), 87 deletions(-) --++++ --++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++++index cadd2e04..02f8d4be 100644 --++++--- a/mindnlp/transformers/cache_utils.py --+++++++ b/mindnlp/transformers/cache_utils.py --++++@@ -812,14 +812,26 @@ class StaticCache(Cache): --++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++++ # k_out[:, :, cache_position] = key_states --++++ # v_out[:, :, cache_position] = value_states --++++- if ON_ORANGE_PI: --++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++- else: --++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++- --+++++ # if ON_ORANGE_PI: --+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++ # else: --+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++ # 确保 cache_position 是 1D tensor 并且类型正确 --+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++++ if cache_position.ndim > 1: --+++++ cache_position = cache_position.flatten() --+++++ # 确保类型是 int32 或 int64(MindSpore 要求) --+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++++ cache_position = cache_position.int() --+++++ --+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++++ k_out[:, :, cache_position] = key_states --+++++ v_out[:, :, cache_position] = value_states --+++++ --++++ return k_out, v_out --++++ --++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++index c695b944..d8303e45 100644 --++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++++ def rotate_half(x): --++++ """Rotates half the hidden dims of the input.""" --++++- x1 = x[..., : x.shape[-1] // 2] --++++- x2 = x[..., x.shape[-1] // 2 :] --+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++ # x1 = x[..., : x.shape[-1] // 2] --+++++ # x2 = x[..., x.shape[-1] // 2 :] --+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++ return ops.cat((-x2, x1), dim=-1) --++++ --++++ --++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++++ if self.training: --++++ raise NotImplementedError("Training is not supported yet.") --++++ else: --++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++- if self.config.n_shared_experts is not None: --++++- y = y + self.shared_experts(identity) --++++- return y --+++++ # @lwx --+++++ if orig_shape[1] == 1: --+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++++ y=y.view(*orig_shape) --+++++ if self.config.n_shared_experts is not None: --+++++ y = y + self.shared_experts(identity) --+++++ return y --+++++ else: --+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++++ if self.config.n_shared_experts is not None: --+++++ y = y + self.shared_experts(identity) --+++++ return y --+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++ # if self.config.n_shared_experts is not None: --+++++ # y = y + self.shared_experts(identity) --+++++ # return y --+++++ --+++++ @no_grad() --+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++ --+++++ expert_cache = ops.zeros_like(x) --+++++ for i in range(self.num_experts_per_tok): --+++++ expert_id = flat_expert_indices[i].item() --+++++ weight = flat_expert_weights[i].item() --+++++ expert = self.experts[expert_id] --+++++ expert_out = expert(x) --+++++ expert_cache += expert_out * weight --+++++ return expert_cache --++++ --++++ @no_grad() --++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++- # expert_cache = torch.zeros_like(x) --++++- # idxs = flat_expert_indices.argsort() --++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++- # token_idxs = idxs // self.num_experts_per_tok --++++- # for i, end_idx in enumerate(tokens_per_expert): --++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++- # if start_idx == end_idx: --++++- # continue --++++- # expert = self.experts[i] --++++- # exp_token_idx = token_idxs[start_idx:end_idx] --++++- # expert_tokens = x[exp_token_idx] --++++- # expert_out = expert(expert_tokens) --++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++- # return expert_cache --+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++ expert_cache = ops.zeros_like(x) --++++ idxs = flat_expert_indices.argsort() --++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ token_idxs = idxs // self.num_experts_per_tok --+++++ --++++ for i, end_idx in enumerate(tokens_per_expert): --++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ if start_idx == end_idx: --++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++++ expert_out = expert(expert_tokens) --++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --++++ return expert_cache --+++++ --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # # expert_cache = torch.zeros_like(x) --+++++ # # idxs = flat_expert_indices.argsort() --+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++ # # token_idxs = idxs // self.num_experts_per_tok --+++++ # # for i, end_idx in enumerate(tokens_per_expert): --+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++ # # if start_idx == end_idx: --+++++ # # continue --+++++ # # expert = self.experts[i] --+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # # expert_tokens = x[exp_token_idx] --+++++ # # expert_out = expert(expert_tokens) --+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++ # # return expert_cache --+++++ # expert_cache = ops.zeros_like(x) --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # for i, end_idx in enumerate(tokens_per_expert): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # if start_idx == end_idx: --+++++ # continue --+++++ # expert = self.experts[i] --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = expert(expert_tokens) --+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --+++++ # return expert_cache --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # expert_cache = ops.zeros_like(x) --+++++ --+++++ # # 排序保证顺序一致 --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # # 找出有 token 的专家 --+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++ --+++++ # for i in active_experts.tolist(): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # end_idx = tokens_per_expert[i] --+++++ # if start_idx == end_idx: # 没有 token --+++++ # continue --+++++ --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = self.experts[i](expert_tokens) --+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++ --+++++ # expert_cache = mindspore.mint.scatter_add( --+++++ # expert_cache, --+++++ # 0, --+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++ # expert_out --+++++ # ) --+++++ --+++++ # return expert_cache --+++++ --+++++ --++++ --++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++++ # """ --++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ --++++ # Initialize weights and apply final processing --++++ self.post_init() --+++++ self.warm_up = False --+++++ --+++++ def warmup_moe_model_deep(self): --+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++++ test_texts = [ --+++++ "warmup short", --+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++++ ] --+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++ if tokenizer is None: --+++++ from mindnlp.transformers import AutoTokenizer --+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++ self._warmup_tokenizer = tokenizer --+++++ --+++++ for text in test_texts: --+++++ inputs = tokenizer(text, return_tensors="ms") --+++++ with mindspore._no_grad(): --+++++ _ = self(**inputs, use_cache=False) --+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++++ --++++ def get_input_embeddings(self): --++++ return self.model.embed_tokens --++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++ ```""" --+++++ if not self.warm_up: --+++++ self.warm_up = True --+++++ self.warmup_moe_model_deep() --+++++ --++++ output_attentions = ( --++++ output_attentions --++++ if output_attentions is not None --++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++index 3cbf820e..d4c6b651 100644 --++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++@@ -18,7 +18,6 @@ --++++ # See the License for the specific language governing permissions and --++++ # limitations under the License. --++++ """MindSpore Qwen2MoE model.""" --++++- --++++ import math --++++ from typing import List, Optional, Tuple, Union --++++ --++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++++ TokenClassifierOutput, --++++ ) --++++ from ...modeling_utils import PreTrainedModel --+++++from ...generation import GenerationMixin --++++ from ....utils import logging --++++ from .configuration_qwen2_moe import Qwen2MoeConfig --++++ --++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++++ self.variance_epsilon = eps --++++ --++++ def forward(self, hidden_states): --+++++ # @dwj --+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++ # @lwx --+++++ # if not self.training : --+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++ input_dtype = hidden_states.dtype --++++ hidden_states = hidden_states.to(mindspore.float32) --++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++++@@ -234,6 +239,8 @@ def rotate_half(x): --++++ """Rotates half the hidden dims of the input.""" --++++ x1 = x[..., : x.shape[-1] // 2] --++++ x2 = x[..., x.shape[-1] // 2 :] --+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++ return ops.cat((-x2, x1), dim=-1) --++++ --++++ --++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++++ self.config = config --++++ self.hidden_size = config.hidden_size --++++ self.intermediate_size = intermediate_size --+++++ --++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++++ self.act_fn = ACT2FN[config.hidden_act] --++++ --++++ def forward(self, x): --++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++- --++++ --+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++ # @lwx --+++++ # gate_up_output = self.gate_up_proj(x) --+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++++ # return self.down_proj(swiglu_output) --+++++ --+++++ # def forward(self, x): --+++++ # gate_proj_out = self.gate_proj(x) --+++++ # up_proj_out = self.up_proj(x) --+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++++ # return self.down_proj(swiglu_out) --+++++ --++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++++ """ --++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++++ use_cache: bool = False, --++++ cache_position: Optional[mindspore.Tensor] = None, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ --+++++ --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ query_states = self.q_proj(hidden_states) --++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ "with a layer index." --++++ ) --++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ if isinstance(past_key_value, StaticCache): --+++++ kv_seq_len = key_states.shape[-2] --+++++ else: --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++ if isinstance(past_key_value, StaticCache): --+++++ kv_seq_len = key_states.shape[-2] --++++ --++++ # repeat k/v heads if n_kv_heads < n_heads --++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++- --+++++ --++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++ --++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++++- raise ValueError( --++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++++- f" {attn_weights.shape}" --++++- ) --++++- --++++- if attention_mask is not None: # no matter the length, we just slice it --++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++++ if attention_mask is not None: --+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++ attn_weights = attn_weights + causal_mask --++++ --++++ # upcast attention to fp32 --++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++ --++++ attn_output = self.o_proj(attn_output) --++++- --+++++ # @lwx --+++++ --+++++ # max_seq_len = self.max_position_embeddings # 2048 --+++++ --+++++ # if attention_mask is not None: --+++++ # # attention_mask: [B, 1, Sq, Sk] --+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++ --+++++ # # pad 到 [max_seq_len, max_seq_len] --+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++ # global_attention_mask = padded_mask --+++++ # else: --+++++ # global_attention_mask = None --+++++ --+++++ --+++++ # sparse_mode=3 --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # real_shift=None, --+++++ # padding_mask=None, --+++++ --+++++ # head_num=self.num_heads, --+++++ # attn_mask=global_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ # input_layout="BNSD", --+++++ # pre_tokens=2147483647, --+++++ # next_tokens=2147483647, --+++++ # inner_precise=0, --+++++ # drop_mask=None, --+++++ # prefix=None, --+++++ # actual_seq_qlen=None, --+++++ # actual_seq_kvlen=None, --+++++ # sparse_mode=sparse_mode, --+++++ # ) --++++ if not output_attentions: --++++ attn_weights = None --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++ --+++++class Qwen2MoeFlashAttention(nn.Module): --+++++ """ --+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++ --+++++ 关键改动: --+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++ 直接传入原始的 key 和 value 张量效率更高。 --+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++ """ --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++ super().__init__() --+++++ self.config = config --+++++ self.layer_idx = layer_idx --+++++ self.hidden_size = config.hidden_size --+++++ self.num_heads = config.num_attention_heads --+++++ self.head_dim = self.hidden_size // self.num_heads --+++++ self.num_key_value_heads = config.num_key_value_heads --+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++ self.max_position_embeddings = config.max_position_embeddings --+++++ self.rope_theta = config.rope_theta --+++++ self.attention_dropout = config.attention_dropout --+++++ --+++++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++++ raise ValueError( --+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++ ) --+++++ --+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++ --+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++ self.head_dim, --+++++ max_position_embeddings=self.max_position_embeddings, --+++++ base=self.rope_theta, --+++++ ) --+++++ --+++++ def forward( --+++++ self, --+++++ hidden_states: mindspore.Tensor, --+++++ attention_mask: Optional[mindspore.Tensor] = None, --+++++ position_ids: Optional[mindspore.Tensor] = None, --+++++ past_key_value: Optional[Cache] = None, --+++++ output_attentions: bool = False, --+++++ use_cache: bool = False, --+++++ cache_position: Optional[mindspore.Tensor] = None, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # 1. 线性投射 Q, K, V --+++++ query_states = self.q_proj(hidden_states) --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++ # query: [B, S, H*D] -> [B, N1, S, D] --+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # 3. RoPE 旋转位置编码 --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++ if self.layer_idx is None: --+++++ raise ValueError( --+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ "with a layer index." --+++++ ) --+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++ if cache_position.shape[0] == 1: --+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++ kv_seq_len = past_seen_tokens + 1 --+++++ else: --+++++ # prefill 阶段:cache_position 是范围,使用其长度 --+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++ else: --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # 4. KV 缓存更新 --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ key_states, value_states = past_key_value.update( --+++++ key_states, value_states, self.layer_idx, cache_kwargs --+++++ ) --+++++ --+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++ if cache_position.shape[0] == 1: --+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++ kv_seq_len = key_states.shape[-2] --+++++ --+++++ # 5. [重要] 准备 Attention Mask --+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++ fa_attention_mask = None --+++++ if attention_mask is not None: --+++++ # 截取与当前key长度匹配的部分 --+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++ fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++ input_dtype = query_states.dtype --+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++ query_states = query_states.to(mindspore.float16) --+++++ key_states = key_states.to(mindspore.float16) --+++++ value_states = value_states.to(mindspore.float16) --+++++ --+++++ # 6. [核心] 调用 flash_attention_score 算子 --+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++ attn_output = mindspore.ops.flash_attention_score( --+++++ query=query_states, --+++++ key=key_states, --+++++ value=value_states, --+++++ head_num=self.num_heads, # 传入Q的头数(N1) --+++++ attn_mask=fa_attention_mask, --+++++ keep_prob=1.0 - self.attention_dropout, --+++++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ input_layout="BNSD", --+++++ sparse_mode=0 # 使用 defaultMask 模式 --+++++ ) --+++++ --+++++ # 恢复原始数据类型 --+++++ attn_output = attn_output.to(input_dtype) --+++++ --+++++ # 7. 调整输出形状 --+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = self.o_proj(attn_output) --+++++ --+++++ # FlashAttention 算子不直接返回注意力权重矩阵 --+++++ attn_weights = None --+++++ if output_attentions: --+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ # def forward( --+++++ # self, --+++++ # hidden_states: mindspore.Tensor, --+++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++ # past_key_value: Optional[Cache] = None, --+++++ # output_attentions: bool = False, --+++++ # use_cache: bool = False, --+++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ # bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # # 1. 线性投射 Q, K, V --+++++ # query_states = self.q_proj(hidden_states) --+++++ # key_states = self.k_proj(hidden_states) --+++++ # value_states = self.v_proj(hidden_states) --+++++ --+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # # 3. RoPE 旋转位置编码 --+++++ # kv_seq_len = key_states.shape[-2] --+++++ # if past_key_value is not None: --+++++ # if self.layer_idx is None: --+++++ # raise ValueError( --+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ # "with a layer index." --+++++ # ) --+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # # 4. KV 缓存更新 --+++++ # if past_key_value is not None: --+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ # key_states, value_states = past_key_value.update( --+++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++ # ) --+++++ --+++++ # # 5. 准备 Attention Mask --+++++ # fa_attention_mask = None --+++++ # if attention_mask is not None: --+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++ # input_dtype = query_states.dtype --+++++ --+++++ # # 6. [核心] 调用 flash_attention_score 算子 --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # head_num=self.num_heads, --+++++ # attn_mask=fa_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ # input_layout="BNSD", --+++++ # sparse_mode=0, --+++++ # # <--- 修改点 2: 启用内部高精度计算 --- --+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++ # inner_precise=1 --+++++ # ) --+++++ --+++++ # # 恢复原始数据类型 --+++++ # attn_output = attn_output.to(input_dtype) --+++++ --+++++ # # 7. 调整输出形状 --+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ # attn_output = self.o_proj(attn_output) --+++++ --+++++ # attn_weights = None --+++++ # if output_attentions: --+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++ # return attn_output, attn_weights, past_key_value --+++++ --+++++ # def forward( --+++++ # self, --+++++ # hidden_states: mindspore.Tensor, --+++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++ # past_key_value: Optional[Cache] = None, --+++++ # output_attentions: bool = False, --+++++ # use_cache: bool = False, --+++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ # bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # query_states = self.q_proj(hidden_states) --+++++ # key_states = self.k_proj(hidden_states) --+++++ # value_states = self.v_proj(hidden_states) --+++++ --+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # kv_seq_len = key_states.shape[-2] --+++++ # if past_key_value is not None: --+++++ # if self.layer_idx is None: --+++++ # raise ValueError("`layer_idx` must be specified for caching") --+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # if past_key_value is not None: --+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ # key_states, value_states = past_key_value.update( --+++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++ # ) --+++++ --+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++ --+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++++ # query_states = query_states / math.sqrt(self.head_dim) --+++++ # # <--- 修改结束 --- --+++++ --+++++ # fa_attention_mask = None --+++++ # if attention_mask is not None: --+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # input_dtype = query_states.dtype --+++++ --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, # 传入已经预先缩放过的 query --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # head_num=self.num_heads, --+++++ # attn_mask=fa_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++++ # input_layout="BNSD", --+++++ # sparse_mode=0, --+++++ # inner_precise=1 # 仍然保持内部高精度计算 --+++++ # ) --+++++ --+++++ # attn_output = attn_output.to(input_dtype) --+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ # attn_output = self.o_proj(attn_output) --+++++ --+++++ # attn_weights = None --+++++ # if output_attentions: --+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++++ --+++++ # return attn_output, attn_weights, past_key_value --+++++ --++++ QWEN2MOE_ATTENTION_CLASSES = { --++++ "eager": Qwen2MoeAttention, --+++++ "flash-attention": Qwen2MoeFlashAttention, --++++ } --++++ --++++ --++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --+++++ #@dwj --+++++ # 只遍历激活的专家,而非全部专家 --++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- hidden_states = hidden_states.view(-1, hidden_dim) --++++- # router_logits: (batch * sequence_length, n_experts) --++++- router_logits = self.gate(hidden_states) --++++- --++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- if self.norm_topk_prob: --++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- # we cast back to the input dtype --++++- routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++- final_hidden_states = ops.zeros( --++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++++- ) --++++- --++++- # One hot encode the selected experts to create an expert mask --++++- # this will be used to easily index which expert is going to be sollicitated --++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++++- --++++- # Loop over all available experts in the model and perform the computation on each expert --++++- for expert_idx in range(self.num_experts): --++++- expert_layer = self.experts[expert_idx] --++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++++- --++++- # Index the correct hidden states and compute the expert hidden state for --++++- # the current expert. We need to make sure to multiply the output hidden --++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++++- if 0 not in idx.shape: --++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++++- --++++- # However `index_add_` only support torch tensors for indexing so we'll use --++++- # the `top_x` tensor here. --++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++++- --++++- shared_expert_output = self.shared_expert(hidden_states) --++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++++- --++++- final_hidden_states = final_hidden_states + shared_expert_output --+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++ num_tokens = hidden_states_reshaped.shape[0] --+++++ --+++++ router_logits = self.gate(hidden_states_reshaped) --+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++ if self.norm_topk_prob: --+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++ flat_selected_experts = selected_experts.flatten() --+++++ --+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++ token_indices = broadcasted_token_indices.flatten() --+++++ --+++++ active_experts = ops.unique(flat_selected_experts) --+++++ --+++++ for expert_idx_tensor in active_experts: --+++++ expert_idx = expert_idx_tensor.item() --+++++ expert_layer = self.experts[expert_idx] --+++++ --+++++ mask = (flat_selected_experts == expert_idx_tensor) --+++++ selected_token_indices = token_indices[mask] --+++++ selected_routing_weights = routing_weights.flatten()[mask] --+++++ --+++++ current_states = hidden_states_reshaped[selected_token_indices] --+++++ --+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++ --+++++ final_hidden_states = final_hidden_states.index_add( --+++++ dim=0, --+++++ index=selected_token_indices, --+++++ source=expert_output.to(hidden_states.dtype) --+++++ ) --+++++ --+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++ --++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++- return final_hidden_states, router_logits --+++++ final_hidden_states = final_hidden_states + shared_expert_output --+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++ --+++++ return final_hidden_states, router_logits --++++ --++++ --++++ class Qwen2MoeDecoderLayer(nn.Module): --++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++++ --++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++ --++++ if (layer_idx not in config.mlp_only_layers) and ( --++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++++ ): --++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --++++ _skip_keys_device_placement = "past_key_values" --++++ _supports_cache_class = True --+++++#lwx --+++++ # _supports_static_cache = True --++++ --++++ def _init_weights(self, module): --++++ std = self.config.initializer_range --++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++++ return causal_mask --++++ --++++ --++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ _tied_weights_keys = ["lm_head.weight"] --++++ --++++ def __init__(self, config): --++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ self.num_experts_per_tok = config.num_experts_per_tok --++++ # Initialize weights and apply final processing --++++ self.post_init() --+++++ # @lwx --+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++++ # self.generation_config.cache_implementation = "static" --+++++ self._warmed_up = False --+++++ --+++++ def warmup_moe_model(self): --+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++++ test_texts = [ --+++++ "warmup short", --+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++++ ] --+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++ if tokenizer is None: --+++++ from mindnlp.transformers import AutoTokenizer --+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++ self._warmup_tokenizer = tokenizer --+++++ --+++++ for text in test_texts: --+++++ inputs = tokenizer(text, return_tensors="ms") --+++++ with mindspore._no_grad(): --+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --++++ --++++ def get_input_embeddings(self): --++++ return self.model.embed_tokens --++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++ ```""" --+++++ if not self._warmed_up: --+++++ self._warmed_up = True --+++++ self.warmup_moe_model() --++++ --++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++++ output_router_logits = ( --++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ } --++++ ) --++++ return model_inputs --+++++# @lwx --+++++ # def _decode_one_tokens_logits( --+++++ # self, --+++++ # cur_token: mindspore.Tensor, --+++++ # input_pos: Optional[mindspore.Tensor], --+++++ # cache_position: mindspore.Tensor, --+++++ # past_key_values: StaticCache, --+++++ # ) -> mindspore.Tensor: --+++++ # """ --+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++++ --+++++ # Args: --+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++++ # input_pos: 输入位置信息,可选 --+++++ # cache_position: 当前token在cache中的位置,shape为(1,) --+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++++ --+++++ # Returns: --+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++++ # """ --+++++ # # 调用JIT编译的版本 --+++++ # return self.get_decode_one_tokens_logits( --+++++ # cur_token=cur_token, --+++++ # input_pos=input_pos, --+++++ # cache_position=cache_position, --+++++ # past_key_values=past_key_values, --+++++ # ) --+++++ --+++++ # @mindspore.jit(jit_level='O1') --+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++++ # """ --+++++ # JIT编译的函数,用于高效的单token解码 --+++++ # 使用JIT编译优化以支持静态shape和高效执行 --+++++ --+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++++ # """ --+++++ # outputs = self.model.forward( --+++++ # input_ids=cur_token, --+++++ # position_ids=input_pos, --+++++ # cache_position=cache_position, --+++++ # past_key_values=past_key_values, --+++++ # use_cache=True, --+++++ # return_dict=False, --+++++ # ) --+++++ --+++++ # hidden_states = outputs[0] --+++++ # logits = self.lm_head.forward(hidden_states) --+++++ # logits = logits.float() --+++++ --+++++ # return logits[:, -1, :] --+++++ --+++++ # def _sample( --+++++ # self, --+++++ # input_ids: mindspore.Tensor, --+++++ # logits_processor, --+++++ # stopping_criteria, --+++++ # generation_config, --+++++ # synced_devices: bool, --+++++ # streamer=None, --+++++ # logits_warper=None, --+++++ # **model_kwargs, --+++++ # ): --+++++ # """ --+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++++ # """ --+++++ # from ...generation.logits_process import LogitsProcessorList --+++++ # from ...generation.stopping_criteria import StoppingCriteriaList --+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++++ # from mindnlp.core import nn, ops, no_grad --+++++ # import numpy as np --+++++ --+++++ # # 检查是否使用 StaticCache --+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++++ # # 否则,直接调用父类方法 --+++++ # past_key_values = model_kwargs.get("past_key_values") --+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++++ --+++++ # if not isinstance(past_key_values, StaticCache): --+++++ # # 不使用 StaticCache,直接调用父类方法 --+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++++ # return super()._sample( --+++++ # input_ids=input_ids, --+++++ # logits_processor=logits_processor, --+++++ # stopping_criteria=stopping_criteria, --+++++ # generation_config=generation_config, --+++++ # synced_devices=synced_devices, --+++++ # streamer=streamer, --+++++ # logits_warper=logits_warper, --+++++ # **model_kwargs, --+++++ # ) --+++++ --+++++ # # 使用 StaticCache,进入自定义循环 --+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++++ # pad_token_id = generation_config._pad_token_tensor --+++++ # output_attentions = generation_config.output_attentions --+++++ # output_hidden_states = generation_config.output_hidden_states --+++++ # output_scores = generation_config.output_scores --+++++ # output_logits = generation_config.output_logits --+++++ # return_dict_in_generate = generation_config.return_dict_in_generate --+++++ # max_length = generation_config.max_length --+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++++ # do_sample = generation_config.do_sample --+++++ --+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++++ # raise ValueError( --+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++++ # f"{logits_warper})." --+++++ # ) --+++++ --+++++ # # init attention / hidden states / scores tuples --+++++ # scores = () if (return_dict_in_generate and output_scores) else None --+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++++ --+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++++ # encoder_hidden_states = ( --+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++++ # ) --+++++ --+++++ # # keep track of which sequences are already finished --+++++ # batch_size, cur_len = input_ids.shape --+++++ # this_peer_finished = False --+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++++ --+++++ # time_record = [] --+++++ # from ....utils.testing_utils import parse_flag_from_env --+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++++ --+++++ # while self._has_unfinished_sequences( --+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++++ # ): --+++++ # if _record_time: --+++++ # import time as time_module --+++++ # infer_start = time_module.time() --+++++ --+++++ # # prepare model inputs --+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++++ --+++++ # # prepare variable output controls --+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++++ --+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++++ # cur_cache_position = model_inputs.get("cache_position") --+++++ # cur_past_key_values = model_inputs.get("past_key_values") --+++++ # cur_input_ids = model_inputs.get("input_ids") --+++++ --+++++ # if (isinstance(cur_past_key_values, StaticCache) and --+++++ # cur_cache_position is not None and --+++++ # len(cur_cache_position.shape) > 0 and --+++++ # cur_cache_position.shape[0] == 1 and --+++++ # cur_input_ids is not None and --+++++ # cur_input_ids.shape[1] == 1): --+++++ # # 使用 JIT 优化的单 token 解码 --+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++++ # if not hasattr(self, '_jit_used'): --+++++ # self._jit_used = False --+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++++ --+++++ # next_token_logits = self.get_decode_one_tokens_logits( --+++++ # cur_token=cur_input_ids, --+++++ # input_pos=model_inputs.get("position_ids"), --+++++ # cache_position=cur_cache_position, --+++++ # past_key_values=cur_past_key_values, --+++++ # ) --+++++ --+++++ # # 标记已使用JIT(用于后续判断) --+++++ # if not self._jit_used: --+++++ # self._jit_used = True --+++++ --+++++ # # 构造兼容的输出对象 --+++++ # class JitOptimizedOutput: --+++++ # def __init__(self, logits, config): --+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++++ # self.config = config --+++++ # # 对于 JIT 优化路径,这些属性通常不需要 --+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++++ # self.attentions = None if not config.is_encoder_decoder else None --+++++ # self.cross_attentions = None --+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++++ # self.hidden_states = None if not config.is_encoder_decoder else None --+++++ --+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++++ # else: --+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++++ # outputs = self(**model_inputs, return_dict=True) --+++++ --+++++ # if synced_devices and this_peer_finished: --+++++ # continue --+++++ --+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++++ # next_token_logits = outputs.logits[:, -1, :] --+++++ --+++++ # # pre-process distribution --+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++++ # if do_sample: --+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++++ --+++++ # # Store scores, attentions and hidden_states when required --+++++ # if return_dict_in_generate: --+++++ # if output_scores: --+++++ # scores += (next_token_scores,) --+++++ # if output_logits: --+++++ # raw_logits += (next_token_logits,) --+++++ # if output_attentions: --+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++++ # decoder_attentions += (attn,) if attn is not None else (None,) --+++++ # if self.config.is_encoder_decoder: --+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++++ --+++++ # if output_hidden_states: --+++++ # hidden = ( --+++++ # outputs.decoder_hidden_states --+++++ # if self.config.is_encoder_decoder --+++++ # else outputs.hidden_states --+++++ # ) --+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++++ --+++++ # # token selection --+++++ # if do_sample: --+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++++ # else: --+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++++ --+++++ # # finished sentences should have their next token be a padding token --+++++ # if has_eos_stopping_criteria: --+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++++ --+++++ # # update generated ids, model inputs, and length for next step --+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++++ # if streamer is not None: --+++++ # streamer.put(next_tokens) --+++++ --+++++ # model_kwargs = self._update_model_kwargs_for_generation( --+++++ # outputs, --+++++ # model_kwargs, --+++++ # is_encoder_decoder=self.config.is_encoder_decoder, --+++++ # ) --+++++ --+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++++ # cur_len += 1 --+++++ --+++++ # if _record_time: --+++++ # import time as time_module --+++++ # infer_stop = time_module.time() --+++++ # time_record.append(infer_stop - infer_start) --+++++ --+++++ # del outputs --+++++ --+++++ # average_infer_time = None --+++++ # if time_record: --+++++ # if len(time_record) > 1: --+++++ # time_record.pop(0) --+++++ # average_infer_time = sum(time_record) / len(time_record) --+++++ # print(f'average inference time is: {average_infer_time}') --+++++ # print(f'inference time record: {time_record}') --+++++ --+++++ # if streamer is not None: --+++++ # streamer.end() --+++++ --+++++ # # 简单判断:打印是否使用了JIT路径 --+++++ # if hasattr(self, '_jit_used') and self._jit_used: --+++++ # print("[JIT] ✓ JIT optimization was used during generation") --+++++ # else: --+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++++ --+++++ # if return_dict_in_generate: --+++++ # if self.config.is_encoder_decoder: --+++++ # return GenerateEncoderDecoderOutput( --+++++ # sequences=input_ids, --+++++ # scores=scores, --+++++ # logits=raw_logits, --+++++ # encoder_attentions=encoder_attentions, --+++++ # encoder_hidden_states=encoder_hidden_states, --+++++ # decoder_attentions=decoder_attentions, --+++++ # cross_attentions=cross_attentions, --+++++ # decoder_hidden_states=decoder_hidden_states, --+++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++ # average_infer_time=average_infer_time --+++++ # ) --+++++ # else: --+++++ # return GenerateDecoderOnlyOutput( --+++++ # sequences=input_ids, --+++++ # scores=scores, --+++++ # logits=raw_logits, --+++++ # attentions=decoder_attentions, --+++++ # hidden_states=decoder_hidden_states, --+++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++ # average_infer_time=average_infer_time --+++++ # ) --+++++ # else: --+++++ # return input_ids --+++++ --+++++ # def _prepare_cache_for_generation( --+++++ # self, --+++++ # generation_config, --+++++ # model_kwargs, --+++++ # assistant_model, --+++++ # batch_size, --+++++ # max_cache_length, --+++++ # ): --+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++++ # generation_config.cache_implementation = "static" --+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++++ --+++++ # if generation_config.cache_implementation == "static": --+++++ # base_required_from_max_length = generation_config.max_length + 1 --+++++ # base_required = max(max_cache_length, base_required_from_max_length) --+++++ # min_cache_size = 50 --+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++++ # else: --+++++ # max_cache_length = max(base_required, min_cache_size) --+++++ --+++++ # original_max_cache_length = max_cache_length --+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++++ # print(f" - input max_cache_length: {original_max_cache_length}") --+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++++ # print(f" - final max_cache_length: {max_cache_length}") --+++++ --+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++ # if max_cache_length > self.config.max_position_embeddings: --+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++++ --+++++ # result = super()._prepare_cache_for_generation( --+++++ # generation_config=generation_config, --+++++ # model_kwargs=model_kwargs, --+++++ # assistant_model=assistant_model, --+++++ # batch_size=batch_size, --+++++ # max_cache_length=max_cache_length, --+++++ # ) --+++++ --+++++ # if generation_config.cache_implementation == "static": --+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++++ # created_cache = model_kwargs.get(cache_name) --+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++++ # if created_cache.max_cache_len < generation_config.max_length: --+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++++ --+++++ # return result --+++++ --+++++ --+++++ --++++ --++++ --++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++++-- --++++2.27.0 --++++ --+++-- --+++2.27.0 --+++ --++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --++new file mode 100644 --++index 00000000..966529e4 --++--- /dev/null --+++++ b/patches/0003-20261106secondcommit.patch --++@@ -0,0 +1,2769 @@ --+++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Thu, 6 Nov 2025 14:54:37 +0800 --+++Subject: [PATCH 3/3] 20261106secondcommit --+++ --+++--- --+++ .../models/deepseek/modeling_deepseek.py | 217 ++- --+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- --+++ patches/0001-20251104commit.patch | 1272 ----------------- --+++ 3 files changed, 528 insertions(+), 2032 deletions(-) --+++ delete mode 100644 patches/0001-20251104commit.patch --+++ --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index 73773c22..2f9192bf 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) --+++ --+++ _CONFIG_FOR_DOC = "DeepseekConfig" --+++ --++++_attn_mask_cache = {} --++++ --++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --++++ q_len = batch_and_seq[1] --++++ kv_len = batch_and_seq[1] + past_key_values_length --++++ key = (batch_and_seq[0], q_len, kv_len) --++++ --++++ if key in _attn_mask_cache: --++++ return _attn_mask_cache[key] --++++ --++++ mask = _prepare_4d_causal_attention_mask( --++++ attention_mask, --++++ batch_and_seq, --++++ inputs_embeds, --++++ past_key_values_length, --++++ ) --++++ _attn_mask_cache[key] = mask --++++ return mask --+++ --+++ def _get_unpad_data(attention_mask): --+++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --+++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): --+++ return final_output --+++ --+++ --+++- @no_grad() --+++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++- expert_cache = ops.zeros_like(x) --+++- idxs = flat_expert_indices.argsort() --+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++- token_idxs = idxs // self.num_experts_per_tok --+++- --+++- for i, end_idx in enumerate(tokens_per_expert): --+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++- if start_idx == end_idx: --+++- continue --+++- expert = self.experts[i] --+++- exp_token_idx = token_idxs[start_idx:end_idx] --+++- expert_tokens = x[exp_token_idx] --+++- expert_out = expert(expert_tokens) --+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++- --+++- return expert_cache --+++- --+++ # @no_grad() --+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++- # # expert_cache = torch.zeros_like(x) --+++- # # idxs = flat_expert_indices.argsort() --+++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++- # # token_idxs = idxs // self.num_experts_per_tok --+++- # # for i, end_idx in enumerate(tokens_per_expert): --+++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++- # # if start_idx == end_idx: --+++- # # continue --+++- # # expert = self.experts[i] --+++- # # exp_token_idx = token_idxs[start_idx:end_idx] --+++- # # expert_tokens = x[exp_token_idx] --+++- # # expert_out = expert(expert_tokens) --+++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++- # # return expert_cache --++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++ # expert_cache = ops.zeros_like(x) --+++ # idxs = flat_expert_indices.argsort() --+++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): --+++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++ --+++ # return expert_cache --+++- # @no_grad() --+++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++- # expert_cache = ops.zeros_like(x) --++++ --++++ @no_grad() --++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++ """ --++++ 优化版 MoE prefill: --++++ - 批量张量化处理同一个 expert 的所有 token --++++ - 跳过无 token 的专家 --++++ - 保持结果完全一致 --++++ """ --++++ # 初始化输出缓存 --++++ expert_cache = ops.zeros_like(x) --+++ --+++- # # 排序保证顺序一致 --+++- # idxs = flat_expert_indices.argsort() --+++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++- # token_idxs = idxs // self.num_experts_per_tok --++++ # 排序(确保 scatter_add 位置对应原逻辑) --++++ idxs = flat_expert_indices.argsort() --++++ sorted_expert_indices = flat_expert_indices[idxs] --++++ sorted_token_indices = idxs // self.num_experts_per_tok --+++ --+++- # # 找出有 token 的专家 --+++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++ # 每个 expert 的 token 数 --++++ tokens_per_expert = sorted_expert_indices.bincount() --+++ --+++- # for i in active_experts.tolist(): --+++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++- # end_idx = tokens_per_expert[i] --+++- # if start_idx == end_idx: # 没有 token --+++- # continue --++++ # 找出有 token 的专家 --++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+++ --+++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++- # expert_tokens = x[exp_token_idx] --+++- # expert_out = self.experts[i](expert_tokens) --+++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++ for expert_id in active_experts.tolist(): --++++ # 取该 expert 对应的排序后 token 区间 --++++ start = (tokens_per_expert[:expert_id]).sum().item() --++++ end = start + tokens_per_expert[expert_id].item() --+++ --+++- # expert_cache = mindspore.mint.scatter_add( --+++- # expert_cache, --+++- # 0, --+++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++- # expert_out --+++- # ) --++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 --++++ expert_tokens = x[token_idx] # 取输入向量 --+++ --+++- # return expert_cache --++++ # 执行专家 MLP --++++ expert_out = self.experts[expert_id](expert_tokens) --++++ --++++ # 按权重缩放 --++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --++++ --++++ # 回写到缓存(等价 scatter_add) --++++ expert_cache = mindspore.mint.scatter_add( --++++ expert_cache, --++++ 0, --++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++ scaled_out --++++ ) --++++ --++++ return expert_cache --++++ --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # # expert_cache = torch.zeros_like(x) --++++ # # idxs = flat_expert_indices.argsort() --++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++ # # token_idxs = idxs // self.num_experts_per_tok --++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++ # # if start_idx == end_idx: --++++ # # continue --++++ # # expert = self.experts[i] --++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # # expert_tokens = x[exp_token_idx] --++++ # # expert_out = expert(expert_tokens) --++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++ # # return expert_cache --++++ # expert_cache = ops.zeros_like(x) --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # for i, end_idx in enumerate(tokens_per_expert): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # if start_idx == end_idx: --++++ # continue --++++ # expert = self.experts[i] --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = expert(expert_tokens) --++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --++++ # return expert_cache --++++ # @no_grad() --++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++ # expert_cache = ops.zeros_like(x) --++++ --++++ # # 排序保证顺序一致 --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ # token_idxs = idxs // self.num_experts_per_tok --++++ --++++ # # 找出有 token 的专家 --++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++ --++++ # for i in active_experts.tolist(): --++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ # end_idx = tokens_per_expert[i] --++++ # if start_idx == end_idx: # 没有 token --++++ # continue --++++ --++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++ # expert_tokens = x[exp_token_idx] --++++ # expert_out = self.experts[i](expert_tokens) --++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++ --++++ # expert_cache = mindspore.mint.scatter_add( --++++ # expert_cache, --++++ # 0, --++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++ # expert_out --++++ # ) --++++ --++++ # return expert_cache --+++ --+++ --+++ --+++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++- --+++ # class DeepseekFlashAttention(nn.Module): --+++ # """ --+++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --++++ --+++ Deepseek_ATTENTION_CLASSES = { --+++ "eager": DeepseekAttention, --+++ "flash-attention": DeepseekFlashAttention, --+++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): --+++ ) --+++ else: --+++ # 4d mask is passed through the layers --+++- attention_mask = _prepare_4d_causal_attention_mask( --++++ # attention_mask = _prepare_4d_causal_attention_mask( --++++ # attention_mask, --++++ # (batch_size, seq_length), --++++ # inputs_embeds, --++++ # past_key_values_length, --++++ # ) --++++ #@dwj --++++ attention_mask = get_cached_causal_mask( --+++ attention_mask, --+++ (batch_size, seq_length), --+++ inputs_embeds, --+++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ # Initialize weights and apply final processing --+++ self.post_init() --+++ self.warm_up = False --++++ #@dwj --++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --++++ self.num_layers, --++++ self.num_attention_heads, --++++ self.head_dim, --++++ batch_size=1, --++++ max_length=self.max_length, --++++ dtype=mindspore.float16 --++++ ) --++++ --++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --++++ key_cache = [] --++++ value_cache = [] --++++ for _ in range(num_layers): --++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++ key_cache.append(k) --++++ value_cache.append(v) --++++ return key_cache, value_cache --++++ --+++ --+++ def warmup_moe_model_deep(self): --+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++index bced285c..ebd7782e 100644 --+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) --+++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+++ --+++-Long_Prompt = False --+++-PROMPT_LENGTH_THRESHOLD = 128 --++++Long_Prompt = 1 --++++LONG_PROMPT_LENGTH_THRESHOLD = 128 --++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 --++++ --++++_causal_mask_cache = {} --++++ --++++def get_cached_causal_mask_with_cache_position( --++++ attention_mask: mindspore.Tensor, --++++ sequence_length: int, --++++ target_length: int, --++++ dtype: mindspore.dtype, --++++ min_dtype: float, --++++ cache_position: mindspore.Tensor, --++++ batch_size: int, --++++): --++++ """ --++++ 带缓存的 causal mask 构造函数 --++++ """ --++++ # q_len 是当前 query 长度 --++++ q_len = sequence_length --++++ # kv_len 是 target_length --++++ kv_len = target_length --++++ --++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) --++++ --++++ if key in _causal_mask_cache: --++++ return _causal_mask_cache[key] --++++ --++++ # 调用原来的 mask 构造逻辑 --++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++ attention_mask, --++++ sequence_length=sequence_length, --++++ target_length=target_length, --++++ dtype=dtype, --++++ min_dtype=min_dtype, --++++ cache_position=cache_position, --++++ batch_size=batch_size, --++++ ) --++++ # 缓存结果 --++++ _causal_mask_cache[key] = causal_mask --++++ return causal_mask --+++ --+++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+++ def _prepare_4d_causal_attention_mask_with_cache_position( --+++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++ --+++ --+++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --++++# class Qwen2MoeAttention(nn.Module): --++++# """ --++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --++++# and "Generating Long Sequences with Sparse Transformers". --++++# """ --++++ --++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++# super().__init__() --++++# self.config = config --++++# self.layer_idx = layer_idx --++++# if layer_idx is None: --++++# logger.warning_once( --++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++# "when creating this class." --++++# ) --++++ --++++# self.hidden_size = config.hidden_size --++++# self.num_heads = config.num_attention_heads --++++# self.head_dim = self.hidden_size // self.num_heads --++++# self.num_key_value_heads = config.num_key_value_heads --++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++# self.max_position_embeddings = config.max_position_embeddings --++++# self.rope_theta = config.rope_theta --++++# self.is_causal = True --++++# self.attention_dropout = config.attention_dropout --++++ --++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++# raise ValueError( --++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++# f" and `num_heads`: {self.num_heads})." --++++# ) --++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++ --++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++# self.head_dim, --++++# max_position_embeddings=self.max_position_embeddings, --++++# base=self.rope_theta, --++++# ) --++++ --++++# def forward( --++++# self, --++++# hidden_states: mindspore.Tensor, --++++# attention_mask: Optional[mindspore.Tensor] = None, --++++# position_ids: Optional[mindspore.Tensor] = None, --++++# past_key_value: Optional[Cache] = None, --++++# output_attentions: bool = False, --++++# use_cache: bool = False, --++++# cache_position: Optional[mindspore.Tensor] = None, --++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++ --++++ --++++ --++++# bsz, q_len, _ = hidden_states.shape --++++ --++++# query_states = self.q_proj(hidden_states) --++++# key_states = self.k_proj(hidden_states) --++++# value_states = self.v_proj(hidden_states) --++++ --++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++ --++++# kv_seq_len = key_states.shape[-2] --++++# if past_key_value is not None: --++++# if self.layer_idx is None: --++++# raise ValueError( --++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++# "with a layer index." --++++# ) --++++# if isinstance(past_key_value, StaticCache): --++++# kv_seq_len = key_states.shape[-2] --++++# else: --++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++# if past_key_value is not None: --++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++# if isinstance(past_key_value, StaticCache): --++++# kv_seq_len = key_states.shape[-2] --++++ --++++# # repeat k/v heads if n_kv_heads < n_heads --++++# key_states = repeat_kv(key_states, self.num_key_value_groups) --++++# value_states = repeat_kv(value_states, self.num_key_value_groups) --++++ --++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++ --++++# if attention_mask is not None: --++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++# attn_weights = attn_weights + causal_mask --++++ --++++# # upcast attention to fp32 --++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++++# attn_output = ops.matmul(attn_weights, value_states) --++++ --++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++++# raise ValueError( --++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --++++# f" {attn_output.shape}" --++++# ) --++++ --++++# attn_output = ops.transpose(attn_output, 1, 2) --++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++ --++++# attn_output = self.o_proj(attn_output) --++++# # @lwx --++++ --++++# # max_seq_len = self.max_position_embeddings # 2048 --++++ --++++# # if attention_mask is not None: --++++# # # attention_mask: [B, 1, Sq, Sk] --++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++ --++++# # # pad 到 [max_seq_len, max_seq_len] --++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++# # global_attention_mask = padded_mask --++++# # else: --++++# # global_attention_mask = None --++++ --++++ --++++# # sparse_mode=3 --++++# # attn_output = mindspore.ops.flash_attention_score( --++++# # query=query_states, --++++# # key=key_states, --++++# # value=value_states, --++++# # real_shift=None, --++++# # padding_mask=None, --++++ --++++# # head_num=self.num_heads, --++++# # attn_mask=global_attention_mask, --++++# # keep_prob=1.0 - self.attention_dropout, --++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++++# # input_layout="BNSD", --++++# # pre_tokens=2147483647, --++++# # next_tokens=2147483647, --++++# # inner_precise=0, --++++# # drop_mask=None, --++++# # prefix=None, --++++# # actual_seq_qlen=None, --++++# # actual_seq_kvlen=None, --++++# # sparse_mode=sparse_mode, --++++# # ) --++++# if not output_attentions: --++++# attn_weights = None --++++ --++++# return attn_output, attn_weights, past_key_value --++++ --+++ class Qwen2MoeAttention(nn.Module): --+++ """ --+++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+++- and "Generating Long Sequences with Sparse Transformers". --+++- """ --++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 --+++ --++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --++++ --++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --++++ """ --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++ super().__init__() --+++ self.config = config --+++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): --+++ if layer_idx is None: --+++ logger.warning_once( --+++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++ "when creating this class." --+++ ) --+++ --+++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): --+++ use_cache: bool = False, --+++ cache_position: Optional[mindspore.Tensor] = None, --+++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++- --+++ --+++- --++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- --+++ bsz, q_len, _ = hidden_states.shape --+++ --+++ query_states = self.q_proj(hidden_states) --+++ key_states = self.k_proj(hidden_states) --+++ value_states = self.v_proj(hidden_states) --+++ --+++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++- --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ --+++ kv_seq_len = key_states.shape[-2] --+++ if past_key_value is not None: --+++- if self.layer_idx is None: --+++- raise ValueError( --+++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++- "with a layer index." --+++- ) --+++- if isinstance(past_key_value, StaticCache): --+++- kv_seq_len = key_states.shape[-2] --+++- else: --+++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --+++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++ --+++ if past_key_value is not None: --+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++ --++++ # --- 2. 动态调度核心注意力计算 --- --++++ global Long_Prompt --++++ if Long_Prompt >= 1: --++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --++++ fa_attention_mask = None --++++ if attention_mask is not None: --++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++ fa_attention_mask = (mask_slice != 0) --++++ --++++ attn_output = mindspore.ops.flash_attention_score( --++++ query=query_states, --++++ key=key_states, --++++ value=value_states, --++++ head_num=self.num_heads, --++++ attn_mask=fa_attention_mask, --++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++ input_layout="BNSD", --++++ sparse_mode=0, --++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --++++ ) --+++ --+++- if isinstance(past_key_value, StaticCache): --+++- kv_seq_len = key_states.shape[-2] --++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ attn_output = self.o_proj(attn_output) --++++ attn_weights = None --++++ if output_attentions: --++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+++ --+++- # repeat k/v heads if n_kv_heads < n_heads --+++- key_states = repeat_kv(key_states, self.num_key_value_groups) --+++- value_states = repeat_kv(value_states, self.num_key_value_groups) --+++- --+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++ else: --++++ # --- Eager Attention 路径 (用于短序列和解码) --- --++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++ --++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++ --+++- if attention_mask is not None: --+++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++- attn_weights = attn_weights + causal_mask --++++ if attention_mask is not None: --++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++ attn_weights = attn_weights + causal_mask --+++ --+++- # upcast attention to fp32 --+++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++- attn_output = ops.matmul(attn_weights, value_states) --++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++++ attn_output = ops.matmul(attn_weights, value_states) --+++ --+++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++- raise ValueError( --+++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+++- f" {attn_output.shape}" --+++- ) --++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++++ raise ValueError( --++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --++++ ) --+++ --+++- attn_output = ops.transpose(attn_output, 1, 2) --+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++ attn_output = ops.transpose(attn_output, 1, 2) --++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++ attn_output = self.o_proj(attn_output) --+++ --+++- attn_output = self.o_proj(attn_output) --+++- # @lwx --++++ if not output_attentions: --++++ attn_weights = None --+++ --+++- # max_seq_len = self.max_position_embeddings # 2048 --+++- --+++- # if attention_mask is not None: --+++- # # attention_mask: [B, 1, Sq, Sk] --+++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++- --+++- # # pad 到 [max_seq_len, max_seq_len] --+++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++- # global_attention_mask = padded_mask --+++- # else: --+++- # global_attention_mask = None --+++- --+++- --+++- # sparse_mode=3 --+++- # attn_output = mindspore.ops.flash_attention_score( --+++- # query=query_states, --+++- # key=key_states, --+++- # value=value_states, --+++- # real_shift=None, --+++- # padding_mask=None, --+++- --+++- # head_num=self.num_heads, --+++- # attn_mask=global_attention_mask, --+++- # keep_prob=1.0 - self.attention_dropout, --+++- # scalar_value=1.0 / math.sqrt(self.head_dim), --+++- # input_layout="BNSD", --+++- # pre_tokens=2147483647, --+++- # next_tokens=2147483647, --+++- # inner_precise=0, --+++- # drop_mask=None, --+++- # prefix=None, --+++- # actual_seq_qlen=None, --+++- # actual_seq_kvlen=None, --+++- # sparse_mode=sparse_mode, --+++- # ) --+++- if not output_attentions: --+++- attn_weights = None --+++- --+++ return attn_output, attn_weights, past_key_value --+++ --+++- --+++ # class Qwen2MoeFlashAttention(nn.Module): --+++ # """ --+++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { --+++ # return final_hidden_states, router_logits --+++ --+++ --+++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++-# """ --+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+++-# `_moe_infer_prefill` (用于长序列处理) 方法。 --+++-# """ --+++-# def __init__(self, config: Qwen2MoeConfig): --+++-# super().__init__() --+++-# self.num_experts = config.num_experts --+++-# self.top_k = config.num_experts_per_tok --+++-# self.norm_topk_prob = config.norm_topk_prob --+++- --+++-# # 门控网络 --+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++-# # 专家列表 --+++-# self.experts = nn.ModuleList( --+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++-# ) --+++-# # 共享专家 --+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-# @no_grad() --+++-# def _moe_infer_decode( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# """ --+++-# 【解码路径】针对 sequence_length=1 的极致优化。 --+++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+++-# """ --+++-# batch_size, hidden_dim = hidden_states.shape --+++- --+++-# expert_outputs_list = [ --+++-# ops.cat([ --+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++-# ], dim=0) --+++-# for i in range(batch_size) --+++-# ] --+++- --+++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+++-# # shape: (batch_size, top_k, hidden_dim) --+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++- --+++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++- --+++-# return moe_output.squeeze(1) --+++- --+++-# @no_grad() --+++-# def _moe_infer_prefill( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# """ --+++-# 【预填充路径】针对 sequence_length > 1 的优化。 --+++-# 按专家对 Token 进行分组,并进行批处理。 --+++-# """ --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens = hidden_states.shape[0] --+++-# flat_selected_experts = selected_experts.flatten() --+++- --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++- --+++-# active_experts = ops.unique(flat_selected_experts) --+++- --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++- --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++-# selected_token_indices = token_indices[mask] --+++-# selected_routing_weights = routing_weights.flatten()[mask] --+++- --+++-# current_states = hidden_states[selected_token_indices] --+++- --+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++- --+++-# moe_output = moe_output.index_add( --+++-# dim=0, --+++-# index=selected_token_indices, --+++-# source=expert_output.to(hidden_states.dtype) --+++-# ) --+++-# return moe_output --+++- --+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-# """ --+++-# 顶层 forward 方法,作为智能分发器。 --+++-# """ --+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- --+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-# router_logits = self.gate(hidden_states_reshaped) --+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- --+++-# if self.norm_topk_prob: --+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- --+++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++-# moe_output = None --+++-# # 在推理时,根据序列长度选择最优路径 --+++-# if not self.training: --+++-# if sequence_length == 1: --+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++-# else: --+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++-# else: --+++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+++-# raise NotImplementedError("Training path is not implemented.") --+++- --+++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+++- --+++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+++- --+++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+++- --+++-# return final_hidden_states, router_logits --+++- --+++- --+++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++-# """ --+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+++-# """ --+++-# def __init__(self, config: Qwen2MoeConfig): --+++-# super().__init__() --+++-# self.num_experts = config.num_experts --+++-# self.top_k = config.num_experts_per_tok --+++-# self.norm_topk_prob = config.norm_topk_prob --+++- --+++-# # 门控网络 --+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++-# # 专家列表 --+++-# self.experts = nn.ModuleList( --+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++-# ) --+++-# # 共享专家 --+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-# @no_grad() --+++-# def _moe_infer_decode( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# batch_size, _ = hidden_states.shape --+++-# expert_outputs_list = [ --+++-# ops.cat([ --+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++-# ], dim=0) --+++-# for i in range(batch_size) --+++-# ] --+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++-# return moe_output.squeeze(1) --+++- --+++-# @no_grad() --+++-# def _moe_infer_prefill( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens = hidden_states.shape[0] --+++-# flat_selected_experts = selected_experts.flatten() --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++-# active_experts = ops.unique(flat_selected_experts) --+++- --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++-# selected_token_indices = token_indices[mask] --+++-# selected_routing_weights = routing_weights.flatten()[mask] --+++-# current_states = hidden_states[selected_token_indices] --+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++-# moe_output = moe_output.index_add( --+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++-# ) --+++-# return moe_output --+++- --+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-# """ --+++-# 顶层 forward 方法,作为智能分发器。 --+++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+++-# """ --+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- --+++-# # 1. 门控计算 (通用逻辑) --+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-# router_logits = self.gate(hidden_states_reshaped) --+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- --+++-# if self.norm_topk_prob: --+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- --+++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++-# # 2. 智能分发到最优 MoE 路径 --+++-# moe_output = None --+++-# if not self.training: --+++-# if sequence_length == 1: --+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++-# else: --+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++-# else: --+++-# raise NotImplementedError("Training path is not implemented.") --+++- --+++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++- --+++-# # 4. 合并 MoE 输出和共享专家输出 --+++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++- --+++-# # 5. 恢复原始形状并返回 --+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++- --+++-# return final_hidden_states, router_logits --+++- --+++-# prefill fastest --+++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++-# """ --+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+++-# """ --+++-# def __init__(self, config: Qwen2MoeConfig): --+++-# super().__init__() --+++-# self.num_experts = config.num_experts --+++-# self.top_k = config.num_experts_per_tok --+++-# self.norm_topk_prob = config.norm_topk_prob --+++- --+++-# # 门控网络 --+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++-# # 专家列表 --+++-# self.experts = nn.ModuleList( --+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++-# ) --+++-# # 共享专家 --+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-# @no_grad() --+++-# def _moe_infer_dispatch( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# """ --+++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+++-# """ --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens, _ = hidden_states.shape --+++- --+++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+++-# flat_selected_experts = selected_experts.flatten() --+++-# flat_routing_weights = routing_weights.flatten() --+++- --+++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++- --+++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+++-# active_experts = ops.unique(flat_selected_experts) --+++- --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++- --+++-# # 找到所有分配给该专家的 token --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++- --+++-# # 使用 mask 选取对应的 token 和权重 --+++-# current_token_indices = token_indices[mask] --+++-# current_routing_weights = flat_routing_weights[mask] --+++-# current_hidden_states = hidden_states[current_token_indices] --+++- --+++-# # 对这些 token 进行批处理 --+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++- --+++-# # 使用 index_add 将结果精确地加回到对应位置 --+++-# moe_output = moe_output.index_add( --+++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+++-# ) --+++-# return moe_output --+++- --+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-# """ --+++-# 顶层 forward 方法,作为智能分发器。 --+++-# """ --+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- --+++-# # 1. 门控计算 --+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-# router_logits = self.gate(hidden_states_reshaped) --+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- --+++-# if self.norm_topk_prob: --+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- --+++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++- --+++-# # 2. 调用统一的 MoE 计算内核 --+++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+++- --+++-# # 3. 统一处理共享专家 --+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++- --+++-# # 4. 合并输出 --+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++- --+++-# # 5. 恢复原始形状并返回 --+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++- --+++-# return final_hidden_states, router_logits --+++- --+++- --+++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++-# """ --+++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++-# 【最终高性能与高精度版】: --+++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+++-# 3. 这样实现了速度和准确性的两全其美。 --+++-# """ --+++-# def __init__(self, config: Qwen2MoeConfig): --+++-# super().__init__() --+++-# self.num_experts = config.num_experts --+++-# self.top_k = config.num_experts_per_tok --+++-# self.norm_topk_prob = config.norm_topk_prob --+++- --+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++-# self.experts = nn.ModuleList( --+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++-# ) --+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-# @no_grad() --+++-# def _moe_infer_decode( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# """ --+++-# 【解码路径】极致优化版:bmm + 高精度累加。 --+++-# """ --+++-# original_dtype = hidden_states.dtype --+++-# batch_size, _ = hidden_states.shape --+++- --+++-# expert_outputs_list = [ --+++-# ops.cat([ --+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++-# ], dim=0) --+++-# for i in range(batch_size) --+++-# ] --+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++- --+++-# # 在 float32 下执行 bmm,得到高精度结果 --+++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++- --+++-# # 将高精度结果转换回原始数据类型 --+++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+++- --+++-# return moe_output --+++- --+++-# @no_grad() --+++-# def _moe_infer_prefill( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# selected_experts: mindspore.Tensor, --+++-# routing_weights: mindspore.Tensor --+++-# ) -> mindspore.Tensor: --+++-# """ --+++-# 【预填充路径】与原始实现一致,结果精确。 --+++-# """ --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens, _ = hidden_states.shape --+++-# flat_selected_experts = selected_experts.flatten() --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++-# active_experts = ops.unique(flat_selected_experts) --+++- --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++-# selected_token_indices = token_indices[mask] --+++-# selected_routing_weights = routing_weights.flatten()[mask] --+++-# current_states = hidden_states[selected_token_indices] --+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++-# moe_output = moe_output.index_add( --+++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++-# ) --+++-# return moe_output --+++- --+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++- --+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-# router_logits = self.gate(hidden_states_reshaped) --+++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++- --+++-# if self.norm_topk_prob: --+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- --+++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+++-# # 如果模型主体是 float16,后续再转换 --+++- --+++-# moe_output = None --+++-# if not self.training: --+++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+++-# # _moe_infer_decode 内部会处理好类型转换 --+++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+++-# if sequence_length == 1: --+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++-# else: --+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++-# else: --+++-# raise NotImplementedError("Training path is not implemented.") --+++- --+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++- --+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++- --+++-# return final_hidden_states, router_logits --+++- --+++- --+++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++-# """ --+++-# 【融合版】一个混合专家模块,内置两种推理策略, --+++-# 由外部全局变量 `Long_Prompt` 控制: --+++- --+++-# - if Long_Prompt is True: 【精度优先模式】 --+++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+++-# 适用于处理长序列,避免误差累积。 --+++- --+++-# - if Long_Prompt is False: 【速度优先模式】 --+++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+++-# 在解码阶段获得极致速度,同时保证结果高度准确。 --+++-# """ --+++-# def __init__(self, config: Qwen2MoeConfig): --+++-# super().__init__() --+++-# self.num_experts = config.num_experts --+++-# self.top_k = config.num_experts_per_tok --+++-# self.norm_topk_prob = config.norm_topk_prob --+++- --+++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++-# self.experts = nn.ModuleList( --+++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++-# ) --+++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-# # --- 速度优先模式的辅助函数 --- --+++-# @no_grad() --+++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++-# original_dtype = hidden_states.dtype --+++-# batch_size, _ = hidden_states.shape --+++-# expert_outputs_list = [ --+++-# ops.cat([ --+++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++-# ], dim=0) --+++-# for i in range(batch_size) --+++-# ] --+++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++-# weights_fp32 = routing_weights.to(mindspore.float32) --+++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++-# return moe_output_fp32.squeeze(1).to(original_dtype) --+++- --+++-# @no_grad() --+++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens, _ = hidden_states.shape --+++-# flat_selected_experts = selected_experts.flatten() --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++-# active_experts = ops.unique(flat_selected_experts) --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++-# selected_token_indices = token_indices[mask] --+++-# selected_routing_weights = routing_weights.flatten()[mask] --+++-# current_states = hidden_states[selected_token_indices] --+++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+++-# return moe_output --+++- --+++-# # --- 精度优先模式的辅助函数 --- --+++-# @no_grad() --+++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++-# moe_output = ops.zeros_like(hidden_states) --+++-# num_tokens, _ = hidden_states.shape --+++-# flat_selected_experts = selected_experts.flatten() --+++-# flat_routing_weights = routing_weights.flatten() --+++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++-# active_experts = ops.unique(flat_selected_experts) --+++-# for expert_idx_tensor in active_experts: --+++-# expert_idx = expert_idx_tensor.item() --+++-# expert_layer = self.experts[expert_idx] --+++-# mask = (flat_selected_experts == expert_idx_tensor) --+++-# current_token_indices = token_indices[mask] --+++-# current_routing_weights = flat_routing_weights[mask] --+++-# current_hidden_states = hidden_states[current_token_indices] --+++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++-# return moe_output --+++- --+++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-# # 声明我们将要使用一个在模块外部定义的全局变量 --+++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+++-# global Long_Prompt --+++- --+++-# # 1. 门控计算 (所有模式通用) --+++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-# router_logits = self.gate(hidden_states_reshaped) --+++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++-# if self.norm_topk_prob: --+++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++- --+++-# moe_output = None --+++-# if not self.training: --+++-# # 根据 Long_Prompt 标志选择模式 --+++-# if Long_Prompt: --+++-# # --- 精度优先模式 --- --+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++-# else: --+++-# # --- 速度优先模式 --- --+++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++-# if sequence_length == 1: --+++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++-# else: --+++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++-# else: --+++-# raise NotImplementedError("Training path is not implemented.") --+++- --+++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++- --+++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++- --+++-# return final_hidden_states, router_logits --+++- --+++ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ """ --+++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++ return moe_output_fp32.squeeze(1).to(original_dtype) --+++ --++++ # @no_grad() --++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++ # num_tokens, _ = hidden_states.shape --++++ # flat_selected_experts = selected_experts.flatten() --++++ # sorted_expert_indices = flat_selected_experts.argsort() --++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++++ # original_token_indices = sorted_expert_indices // self.top_k --++++ # moe_output = ops.zeros_like(hidden_states) --++++ # current_token_offset = 0 --++++ # for i in range(self.num_experts): --++++ # expert_token_count = tokens_per_expert[i] - current_token_offset --++++ # if expert_token_count == 0: --++++ # continue --++++ # end_offset = current_token_offset + expert_token_count --++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++++ # expert_hidden_states = hidden_states[expert_original_token_indices] --++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++++ # current_token_offset += expert_token_count --++++ # return moe_output --++++ --+++ @no_grad() --+++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++- num_tokens, _ = hidden_states.shape --+++- flat_selected_experts = selected_experts.flatten() --+++- sorted_expert_indices = flat_selected_experts.argsort() --+++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++- original_token_indices = sorted_expert_indices // self.top_k --++++ """ --++++ 优化版 MoE prefill (速度优先模式): --++++ - 批量张量化处理同一个 expert 的所有 token --++++ - 跳过无 token 的专家 --++++ - 保持结果完全一致 --++++ """ --+++ moe_output = ops.zeros_like(hidden_states) --+++- current_token_offset = 0 --+++- for i in range(self.num_experts): --+++- expert_token_count = tokens_per_expert[i] - current_token_offset --+++- if expert_token_count == 0: --+++- continue --+++- end_offset = current_token_offset + expert_token_count --+++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++- expert_hidden_states = hidden_states[expert_original_token_indices] --+++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++- current_token_offset += expert_token_count --++++ --++++ flat_selected_experts = selected_experts.flatten() --++++ flat_routing_weights = routing_weights.flatten() --++++ --++++ idxs = flat_selected_experts.argsort() --++++ sorted_expert_indices = flat_selected_experts[idxs] --++++ sorted_token_indices = idxs // self.top_k --++++ --++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --++++ --++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --++++ --++++ for expert_id in active_experts.tolist(): --++++ start = int(tokens_per_expert[:expert_id].sum().item()) --++++ end = start + int(tokens_per_expert[expert_id].item()) --++++ --++++ token_idx = sorted_token_indices[start:end] --++++ expert_tokens = hidden_states[token_idx] --++++ --++++ expert_out = self.experts[expert_id](expert_tokens) --++++ --++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --++++ --++++ moe_output = mindspore.mint.scatter_add( --++++ moe_output, --++++ 0, --++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --++++ scaled_out.to(hidden_states.dtype) --++++ ) --++++ --+++ return moe_output --+++ --++++ --+++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+++ @no_grad() --+++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++ --+++ moe_output = None --+++- if Long_Prompt: --+++- # --- 精度优先模式 (ACCURACY MODE) --- --+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ # if Long_Prompt==0: --++++ # # --- 精度优先模式 (ACCURACY MODE) --- --++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ # else: --++++ # # --- 速度优先模式 (SPEED MODE) --- --++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++ # if sequence_length == 1: --++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ # else: --++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ --++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++ if sequence_length == 1: --++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++ else: --+++- # --- 速度优先模式 (SPEED MODE) --- --+++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++- if sequence_length == 1: --+++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++- else: --+++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++- --++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ --+++ --+++ # 3. 共享专家计算与合并 (所有模式通用) --+++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++ --+++ return final_hidden_states, router_logits --+++ --++++ --+++ class Qwen2MoeDecoderLayer(nn.Module): --+++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+++ super().__init__() --+++ self.hidden_size = config.hidden_size --+++ --+++- # if Long_Prompt: --+++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++- # else: --++++ # if Long_Prompt == 2: --+++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++ # else: --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ --+++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++ --+++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++ ) --+++ --+++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --+++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++ # attention_mask, --++++ # sequence_length=sequence_length, --++++ # target_length=target_length, --++++ # dtype=dtype, --++++ # min_dtype=min_dtype, --++++ # cache_position=cache_position, --++++ # batch_size=input_tensor.shape[0], --++++ # ) --++++ #@dwj --++++ causal_mask = get_cached_causal_mask_with_cache_position( --+++ attention_mask, --+++ sequence_length=sequence_length, --+++ target_length=target_length, --+++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+++ """ --+++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --++++ _causal_mask_cache.clear() --+++ --+++ input_ids = kwargs.get("input_ids") --+++ if input_ids is None and args: --+++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ --+++ if input_ids is not None: --+++ prompt_length = input_ids.shape[1] --+++- --+++- if prompt_length > PROMPT_LENGTH_THRESHOLD: --+++- Long_Prompt = True --++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --++++ Long_Prompt = 2 --++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --++++ Long_Prompt = 0 --+++ else: --+++- Long_Prompt = False --++++ Long_Prompt = 1 --++++ --+++ --+++ return super().generate(*args, **kwargs) --+++ --+++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++ dtype = self.lm_head.weight.dtype --+++ min_dtype = float(ops.finfo(dtype).min) --+++ --+++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++ # attention_mask, --++++ # sequence_length=sequence_length, --++++ # target_length=past_key_values.get_max_length(), --++++ # dtype=dtype, --++++ # min_dtype=min_dtype, --++++ # cache_position=cache_position, --++++ # batch_size=batch_size, --++++ # ) --++++ --++++ #@dwj --++++ attention_mask = get_cached_causal_mask_with_cache_position( --+++ attention_mask, --+++ sequence_length=sequence_length, --+++ target_length=past_key_values.get_max_length(), --+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++deleted file mode 100644 --+++index 6dfb5b93..00000000 --+++--- a/patches/0001-20251104commit.patch --++++++ /dev/null --+++@@ -1,1272 +0,0 @@ --+++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++-From: Pinoeer-kingxi <13022943007@163.com> --+++-Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++-Subject: [PATCH] 20251104commit --+++- --+++---- --+++- mindnlp/transformers/cache_utils.py | 28 +- --+++- .../models/deepseek/modeling_deepseek.py | 149 ++- --+++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++- 3 files changed, 976 insertions(+), 87 deletions(-) --+++- --+++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++-index cadd2e04..02f8d4be 100644 --+++---- a/mindnlp/transformers/cache_utils.py --+++-+++ b/mindnlp/transformers/cache_utils.py --+++-@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++- # k_out[:, :, cache_position] = key_states --+++- # v_out[:, :, cache_position] = value_states --+++-- if ON_ORANGE_PI: --+++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++-- else: --+++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++-- --+++-+ # if ON_ORANGE_PI: --+++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++-+ # else: --+++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++-+ # 确保 cache_position 是 1D tensor 并且类型正确 --+++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++-+ if cache_position.ndim > 1: --+++-+ cache_position = cache_position.flatten() --+++-+ # 确保类型是 int32 或 int64(MindSpore 要求) --+++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++-+ cache_position = cache_position.int() --+++-+ --+++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++-+ k_out[:, :, cache_position] = key_states --+++-+ v_out[:, :, cache_position] = value_states --+++-+ --+++- return k_out, v_out --+++- --+++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++-index c695b944..d8303e45 100644 --+++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++- # Copied from transformers.models.llama.modeling_llama.rotate_half --+++- def rotate_half(x): --+++- """Rotates half the hidden dims of the input.""" --+++-- x1 = x[..., : x.shape[-1] // 2] --+++-- x2 = x[..., x.shape[-1] // 2 :] --+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++-+ # x1 = x[..., : x.shape[-1] // 2] --+++-+ # x2 = x[..., x.shape[-1] // 2 :] --+++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++- return ops.cat((-x2, x1), dim=-1) --+++- --+++- --+++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++- if self.training: --+++- raise NotImplementedError("Training is not supported yet.") --+++- else: --+++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++-- if self.config.n_shared_experts is not None: --+++-- y = y + self.shared_experts(identity) --+++-- return y --+++-+ # @lwx --+++-+ if orig_shape[1] == 1: --+++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++-+ y=y.view(*orig_shape) --+++-+ if self.config.n_shared_experts is not None: --+++-+ y = y + self.shared_experts(identity) --+++-+ return y --+++-+ else: --+++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++-+ if self.config.n_shared_experts is not None: --+++-+ y = y + self.shared_experts(identity) --+++-+ return y --+++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++-+ # if self.config.n_shared_experts is not None: --+++-+ # y = y + self.shared_experts(identity) --+++-+ # return y --+++-+ --+++-+ @no_grad() --+++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++-+ --+++-+ expert_cache = ops.zeros_like(x) --+++-+ for i in range(self.num_experts_per_tok): --+++-+ expert_id = flat_expert_indices[i].item() --+++-+ weight = flat_expert_weights[i].item() --+++-+ expert = self.experts[expert_id] --+++-+ expert_out = expert(x) --+++-+ expert_cache += expert_out * weight --+++-+ return expert_cache --+++- --+++- @no_grad() --+++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++-- # expert_cache = torch.zeros_like(x) --+++-- # idxs = flat_expert_indices.argsort() --+++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++-- # token_idxs = idxs // self.num_experts_per_tok --+++-- # for i, end_idx in enumerate(tokens_per_expert): --+++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++-- # if start_idx == end_idx: --+++-- # continue --+++-- # expert = self.experts[i] --+++-- # exp_token_idx = token_idxs[start_idx:end_idx] --+++-- # expert_tokens = x[exp_token_idx] --+++-- # expert_out = expert(expert_tokens) --+++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++-- # return expert_cache --+++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++- expert_cache = ops.zeros_like(x) --+++- idxs = flat_expert_indices.argsort() --+++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++- token_idxs = idxs // self.num_experts_per_tok --+++-+ --+++- for i, end_idx in enumerate(tokens_per_expert): --+++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++- if start_idx == end_idx: --+++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++- expert_out = expert(expert_tokens) --+++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++-+ --+++- return expert_cache --+++-+ --+++-+ # @no_grad() --+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++-+ # # expert_cache = torch.zeros_like(x) --+++-+ # # idxs = flat_expert_indices.argsort() --+++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++-+ # # token_idxs = idxs // self.num_experts_per_tok --+++-+ # # for i, end_idx in enumerate(tokens_per_expert): --+++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++-+ # # if start_idx == end_idx: --+++-+ # # continue --+++-+ # # expert = self.experts[i] --+++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++-+ # # expert_tokens = x[exp_token_idx] --+++-+ # # expert_out = expert(expert_tokens) --+++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++-+ # # return expert_cache --+++-+ # expert_cache = ops.zeros_like(x) --+++-+ # idxs = flat_expert_indices.argsort() --+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++-+ # token_idxs = idxs // self.num_experts_per_tok --+++-+ --+++-+ # for i, end_idx in enumerate(tokens_per_expert): --+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++-+ # if start_idx == end_idx: --+++-+ # continue --+++-+ # expert = self.experts[i] --+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+++-+ # expert_tokens = x[exp_token_idx] --+++-+ # expert_out = expert(expert_tokens) --+++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++-+ --+++-+ # return expert_cache --+++-+ # @no_grad() --+++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++-+ # expert_cache = ops.zeros_like(x) --+++-+ --+++-+ # # 排序保证顺序一致 --+++-+ # idxs = flat_expert_indices.argsort() --+++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++-+ # token_idxs = idxs // self.num_experts_per_tok --+++-+ --+++-+ # # 找出有 token 的专家 --+++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++-+ --+++-+ # for i in active_experts.tolist(): --+++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++-+ # end_idx = tokens_per_expert[i] --+++-+ # if start_idx == end_idx: # 没有 token --+++-+ # continue --+++-+ --+++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+++-+ # expert_tokens = x[exp_token_idx] --+++-+ # expert_out = self.experts[i](expert_tokens) --+++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++-+ --+++-+ # expert_cache = mindspore.mint.scatter_add( --+++-+ # expert_cache, --+++-+ # 0, --+++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++-+ # expert_out --+++-+ # ) --+++-+ --+++-+ # return expert_cache --+++-+ --+++-+ --+++- --+++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++- # """ --+++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++- --+++- # Initialize weights and apply final processing --+++- self.post_init() --+++-+ self.warm_up = False --+++-+ --+++-+ def warmup_moe_model_deep(self): --+++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++-+ test_texts = [ --+++-+ "warmup short", --+++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++-+ ] --+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++-+ if tokenizer is None: --+++-+ from mindnlp.transformers import AutoTokenizer --+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++-+ self._warmup_tokenizer = tokenizer --+++-+ --+++-+ for text in test_texts: --+++-+ inputs = tokenizer(text, return_tensors="ms") --+++-+ with mindspore._no_grad(): --+++-+ _ = self(**inputs, use_cache=False) --+++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++- --+++- def get_input_embeddings(self): --+++- return self.model.embed_tokens --+++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++- ```""" --+++-+ if not self.warm_up: --+++-+ self.warm_up = True --+++-+ self.warmup_moe_model_deep() --+++-+ --+++- output_attentions = ( --+++- output_attentions --+++- if output_attentions is not None --+++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++-index 3cbf820e..d4c6b651 100644 --+++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++-@@ -18,7 +18,6 @@ --+++- # See the License for the specific language governing permissions and --+++- # limitations under the License. --+++- """MindSpore Qwen2MoE model.""" --+++-- --+++- import math --+++- from typing import List, Optional, Tuple, Union --+++- --+++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++- TokenClassifierOutput, --+++- ) --+++- from ...modeling_utils import PreTrainedModel --+++-+from ...generation import GenerationMixin --+++- from ....utils import logging --+++- from .configuration_qwen2_moe import Qwen2MoeConfig --+++- --+++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++- self.variance_epsilon = eps --+++- --+++- def forward(self, hidden_states): --+++-+ # @dwj --+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++-+ # @lwx --+++-+ # if not self.training : --+++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++- input_dtype = hidden_states.dtype --+++- hidden_states = hidden_states.to(mindspore.float32) --+++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++-@@ -234,6 +239,8 @@ def rotate_half(x): --+++- """Rotates half the hidden dims of the input.""" --+++- x1 = x[..., : x.shape[-1] // 2] --+++- x2 = x[..., x.shape[-1] // 2 :] --+++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++- return ops.cat((-x2, x1), dim=-1) --+++- --+++- --+++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++- self.config = config --+++- self.hidden_size = config.hidden_size --+++- self.intermediate_size = intermediate_size --+++-+ --+++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++- self.act_fn = ACT2FN[config.hidden_act] --+++- --+++- def forward(self, x): --+++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++-- --+++- --+++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++-+ # @lwx --+++-+ # gate_up_output = self.gate_up_proj(x) --+++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++-+ # return self.down_proj(swiglu_output) --+++-+ --+++-+ # def forward(self, x): --+++-+ # gate_proj_out = self.gate_proj(x) --+++-+ # up_proj_out = self.up_proj(x) --+++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++-+ # return self.down_proj(swiglu_out) --+++-+ --+++- # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++- """ --+++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++- use_cache: bool = False, --+++- cache_position: Optional[mindspore.Tensor] = None, --+++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++-+ --+++-+ --+++-+ --+++- bsz, q_len, _ = hidden_states.shape --+++- --+++- query_states = self.q_proj(hidden_states) --+++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++- "with a layer index." --+++- ) --+++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++-+ if isinstance(past_key_value, StaticCache): --+++-+ kv_seq_len = key_states.shape[-2] --+++-+ else: --+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++- --+++- if past_key_value is not None: --+++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++-+ --+++-+ if isinstance(past_key_value, StaticCache): --+++-+ kv_seq_len = key_states.shape[-2] --+++- --+++- # repeat k/v heads if n_kv_heads < n_heads --+++- key_states = repeat_kv(key_states, self.num_key_value_groups) --+++- value_states = repeat_kv(value_states, self.num_key_value_groups) --+++-- --+++-+ --+++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++- --+++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++-- raise ValueError( --+++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++-- f" {attn_weights.shape}" --+++-- ) --+++-- --+++-- if attention_mask is not None: # no matter the length, we just slice it --+++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++-+ if attention_mask is not None: --+++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++- attn_weights = attn_weights + causal_mask --+++- --+++- # upcast attention to fp32 --+++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++- --+++- attn_output = self.o_proj(attn_output) --+++-- --+++-+ # @lwx --+++-+ --+++-+ # max_seq_len = self.max_position_embeddings # 2048 --+++-+ --+++-+ # if attention_mask is not None: --+++-+ # # attention_mask: [B, 1, Sq, Sk] --+++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++-+ --+++-+ # # pad 到 [max_seq_len, max_seq_len] --+++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++-+ # global_attention_mask = padded_mask --+++-+ # else: --+++-+ # global_attention_mask = None --+++-+ --+++-+ --+++-+ # sparse_mode=3 --+++-+ # attn_output = mindspore.ops.flash_attention_score( --+++-+ # query=query_states, --+++-+ # key=key_states, --+++-+ # value=value_states, --+++-+ # real_shift=None, --+++-+ # padding_mask=None, --+++-+ --+++-+ # head_num=self.num_heads, --+++-+ # attn_mask=global_attention_mask, --+++-+ # keep_prob=1.0 - self.attention_dropout, --+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++-+ # input_layout="BNSD", --+++-+ # pre_tokens=2147483647, --+++-+ # next_tokens=2147483647, --+++-+ # inner_precise=0, --+++-+ # drop_mask=None, --+++-+ # prefix=None, --+++-+ # actual_seq_qlen=None, --+++-+ # actual_seq_kvlen=None, --+++-+ # sparse_mode=sparse_mode, --+++-+ # ) --+++- if not output_attentions: --+++- attn_weights = None --+++- --+++- return attn_output, attn_weights, past_key_value --+++- --+++- --+++-+class Qwen2MoeFlashAttention(nn.Module): --+++-+ """ --+++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++-+ --+++-+ 关键改动: --+++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++-+ 直接传入原始的 key 和 value 张量效率更高。 --+++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++-+ """ --+++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++-+ super().__init__() --+++-+ self.config = config --+++-+ self.layer_idx = layer_idx --+++-+ self.hidden_size = config.hidden_size --+++-+ self.num_heads = config.num_attention_heads --+++-+ self.head_dim = self.hidden_size // self.num_heads --+++-+ self.num_key_value_heads = config.num_key_value_heads --+++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++-+ self.max_position_embeddings = config.max_position_embeddings --+++-+ self.rope_theta = config.rope_theta --+++-+ self.attention_dropout = config.attention_dropout --+++-+ --+++-+ if (self.head_dim * self.num_heads) != self.hidden_size: --+++-+ raise ValueError( --+++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++-+ ) --+++-+ --+++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++-+ --+++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++-+ self.head_dim, --+++-+ max_position_embeddings=self.max_position_embeddings, --+++-+ base=self.rope_theta, --+++-+ ) --+++-+ --+++-+ def forward( --+++-+ self, --+++-+ hidden_states: mindspore.Tensor, --+++-+ attention_mask: Optional[mindspore.Tensor] = None, --+++-+ position_ids: Optional[mindspore.Tensor] = None, --+++-+ past_key_value: Optional[Cache] = None, --+++-+ output_attentions: bool = False, --+++-+ use_cache: bool = False, --+++-+ cache_position: Optional[mindspore.Tensor] = None, --+++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++-+ --+++-+ bsz, q_len, _ = hidden_states.shape --+++-+ --+++-+ # 1. 线性投射 Q, K, V --+++-+ query_states = self.q_proj(hidden_states) --+++-+ key_states = self.k_proj(hidden_states) --+++-+ value_states = self.v_proj(hidden_states) --+++-+ --+++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++-+ # query: [B, S, H*D] -> [B, N1, S, D] --+++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ --+++-+ # 3. RoPE 旋转位置编码 --+++-+ kv_seq_len = key_states.shape[-2] --+++-+ if past_key_value is not None: --+++-+ if self.layer_idx is None: --+++-+ raise ValueError( --+++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++-+ "with a layer index." --+++-+ ) --+++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++-+ if cache_position.shape[0] == 1: --+++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++-+ kv_seq_len = past_seen_tokens + 1 --+++-+ else: --+++-+ # prefill 阶段:cache_position 是范围,使用其长度 --+++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++-+ else: --+++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++-+ --+++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++-+ --+++-+ # 4. KV 缓存更新 --+++-+ if past_key_value is not None: --+++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++-+ key_states, value_states = past_key_value.update( --+++-+ key_states, value_states, self.layer_idx, cache_kwargs --+++-+ ) --+++-+ --+++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++-+ if cache_position.shape[0] == 1: --+++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++-+ kv_seq_len = key_states.shape[-2] --+++-+ --+++-+ # 5. [重要] 准备 Attention Mask --+++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++-+ fa_attention_mask = None --+++-+ if attention_mask is not None: --+++-+ # 截取与当前key长度匹配的部分 --+++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++-+ fa_attention_mask = (mask_slice != 0) --+++-+ --+++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++-+ input_dtype = query_states.dtype --+++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++-+ query_states = query_states.to(mindspore.float16) --+++-+ key_states = key_states.to(mindspore.float16) --+++-+ value_states = value_states.to(mindspore.float16) --+++-+ --+++-+ # 6. [核心] 调用 flash_attention_score 算子 --+++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++-+ attn_output = mindspore.ops.flash_attention_score( --+++-+ query=query_states, --+++-+ key=key_states, --+++-+ value=value_states, --+++-+ head_num=self.num_heads, # 传入Q的头数(N1) --+++-+ attn_mask=fa_attention_mask, --+++-+ keep_prob=1.0 - self.attention_dropout, --+++-+ scalar_value=1.0 / math.sqrt(self.head_dim), --+++-+ input_layout="BNSD", --+++-+ sparse_mode=0 # 使用 defaultMask 模式 --+++-+ ) --+++-+ --+++-+ # 恢复原始数据类型 --+++-+ attn_output = attn_output.to(input_dtype) --+++-+ --+++-+ # 7. 调整输出形状 --+++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++-+ attn_output = self.o_proj(attn_output) --+++-+ --+++-+ # FlashAttention 算子不直接返回注意力权重矩阵 --+++-+ attn_weights = None --+++-+ if output_attentions: --+++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++-+ --+++-+ return attn_output, attn_weights, past_key_value --+++-+ --+++-+ # def forward( --+++-+ # self, --+++-+ # hidden_states: mindspore.Tensor, --+++-+ # attention_mask: Optional[mindspore.Tensor] = None, --+++-+ # position_ids: Optional[mindspore.Tensor] = None, --+++-+ # past_key_value: Optional[Cache] = None, --+++-+ # output_attentions: bool = False, --+++-+ # use_cache: bool = False, --+++-+ # cache_position: Optional[mindspore.Tensor] = None, --+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++-+ --+++-+ # bsz, q_len, _ = hidden_states.shape --+++-+ --+++-+ # # 1. 线性投射 Q, K, V --+++-+ # query_states = self.q_proj(hidden_states) --+++-+ # key_states = self.k_proj(hidden_states) --+++-+ # value_states = self.v_proj(hidden_states) --+++-+ --+++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ --+++-+ # # 3. RoPE 旋转位置编码 --+++-+ # kv_seq_len = key_states.shape[-2] --+++-+ # if past_key_value is not None: --+++-+ # if self.layer_idx is None: --+++-+ # raise ValueError( --+++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++-+ # "with a layer index." --+++-+ # ) --+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++-+ --+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++-+ --+++-+ # # 4. KV 缓存更新 --+++-+ # if past_key_value is not None: --+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++-+ # key_states, value_states = past_key_value.update( --+++-+ # key_states, value_states, self.layer_idx, cache_kwargs --+++-+ # ) --+++-+ --+++-+ # # 5. 准备 Attention Mask --+++-+ # fa_attention_mask = None --+++-+ # if attention_mask is not None: --+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++-+ # fa_attention_mask = (mask_slice != 0) --+++-+ --+++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++-+ # input_dtype = query_states.dtype --+++-+ --+++-+ # # 6. [核心] 调用 flash_attention_score 算子 --+++-+ # attn_output = mindspore.ops.flash_attention_score( --+++-+ # query=query_states, --+++-+ # key=key_states, --+++-+ # value=value_states, --+++-+ # head_num=self.num_heads, --+++-+ # attn_mask=fa_attention_mask, --+++-+ # keep_prob=1.0 - self.attention_dropout, --+++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++-+ # input_layout="BNSD", --+++-+ # sparse_mode=0, --+++-+ # # <--- 修改点 2: 启用内部高精度计算 --- --+++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++-+ # inner_precise=1 --+++-+ # ) --+++-+ --+++-+ # # 恢复原始数据类型 --+++-+ # attn_output = attn_output.to(input_dtype) --+++-+ --+++-+ # # 7. 调整输出形状 --+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++-+ # attn_output = self.o_proj(attn_output) --+++-+ --+++-+ # attn_weights = None --+++-+ # if output_attentions: --+++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++-+ --+++-+ # return attn_output, attn_weights, past_key_value --+++-+ --+++-+ # def forward( --+++-+ # self, --+++-+ # hidden_states: mindspore.Tensor, --+++-+ # attention_mask: Optional[mindspore.Tensor] = None, --+++-+ # position_ids: Optional[mindspore.Tensor] = None, --+++-+ # past_key_value: Optional[Cache] = None, --+++-+ # output_attentions: bool = False, --+++-+ # use_cache: bool = False, --+++-+ # cache_position: Optional[mindspore.Tensor] = None, --+++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++-+ --+++-+ # bsz, q_len, _ = hidden_states.shape --+++-+ --+++-+ # query_states = self.q_proj(hidden_states) --+++-+ # key_states = self.k_proj(hidden_states) --+++-+ # value_states = self.v_proj(hidden_states) --+++-+ --+++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-+ --+++-+ # kv_seq_len = key_states.shape[-2] --+++-+ # if past_key_value is not None: --+++-+ # if self.layer_idx is None: --+++-+ # raise ValueError("`layer_idx` must be specified for caching") --+++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++-+ --+++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++-+ --+++-+ # if past_key_value is not None: --+++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++-+ # key_states, value_states = past_key_value.update( --+++-+ # key_states, value_states, self.layer_idx, cache_kwargs --+++-+ # ) --+++-+ --+++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++-+ --+++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++-+ # query_states = query_states / math.sqrt(self.head_dim) --+++-+ # # <--- 修改结束 --- --+++-+ --+++-+ # fa_attention_mask = None --+++-+ # if attention_mask is not None: --+++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++-+ # fa_attention_mask = (mask_slice != 0) --+++-+ --+++-+ # input_dtype = query_states.dtype --+++-+ --+++-+ # attn_output = mindspore.ops.flash_attention_score( --+++-+ # query=query_states, # 传入已经预先缩放过的 query --+++-+ # key=key_states, --+++-+ # value=value_states, --+++-+ # head_num=self.num_heads, --+++-+ # attn_mask=fa_attention_mask, --+++-+ # keep_prob=1.0 - self.attention_dropout, --+++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++-+ # input_layout="BNSD", --+++-+ # sparse_mode=0, --+++-+ # inner_precise=1 # 仍然保持内部高精度计算 --+++-+ # ) --+++-+ --+++-+ # attn_output = attn_output.to(input_dtype) --+++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++-+ # attn_output = self.o_proj(attn_output) --+++-+ --+++-+ # attn_weights = None --+++-+ # if output_attentions: --+++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++-+ --+++-+ # return attn_output, attn_weights, past_key_value --+++-+ --+++- QWEN2MOE_ATTENTION_CLASSES = { --+++- "eager": Qwen2MoeAttention, --+++-+ "flash-attention": Qwen2MoeFlashAttention, --+++- } --+++- --+++- --+++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++- --+++-+ #@dwj --+++-+ # 只遍历激活的专家,而非全部专家 --+++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++-- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++-- hidden_states = hidden_states.view(-1, hidden_dim) --+++-- # router_logits: (batch * sequence_length, n_experts) --+++-- router_logits = self.gate(hidden_states) --+++-- --+++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++-- if self.norm_topk_prob: --+++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++-- # we cast back to the input dtype --+++-- routing_weights = routing_weights.to(hidden_states.dtype) --+++-- --+++-- final_hidden_states = ops.zeros( --+++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++-- ) --+++-- --+++-- # One hot encode the selected experts to create an expert mask --+++-- # this will be used to easily index which expert is going to be sollicitated --+++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++-- --+++-- # Loop over all available experts in the model and perform the computation on each expert --+++-- for expert_idx in range(self.num_experts): --+++-- expert_layer = self.experts[expert_idx] --+++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++-- --+++-- # Index the correct hidden states and compute the expert hidden state for --+++-- # the current expert. We need to make sure to multiply the output hidden --+++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++-- if 0 not in idx.shape: --+++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++-- --+++-- # However `index_add_` only support torch tensors for indexing so we'll use --+++-- # the `top_x` tensor here. --+++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++-- --+++-- shared_expert_output = self.shared_expert(hidden_states) --+++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++-- --+++-- final_hidden_states = final_hidden_states + shared_expert_output --+++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++-+ num_tokens = hidden_states_reshaped.shape[0] --+++-+ --+++-+ router_logits = self.gate(hidden_states_reshaped) --+++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++-+ --+++-+ if self.norm_topk_prob: --+++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++-+ routing_weights = routing_weights.to(hidden_states.dtype) --+++-+ --+++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++-+ flat_selected_experts = selected_experts.flatten() --+++-+ --+++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++-+ token_indices = broadcasted_token_indices.flatten() --+++-+ --+++-+ active_experts = ops.unique(flat_selected_experts) --+++-+ --+++-+ for expert_idx_tensor in active_experts: --+++-+ expert_idx = expert_idx_tensor.item() --+++-+ expert_layer = self.experts[expert_idx] --+++-+ --+++-+ mask = (flat_selected_experts == expert_idx_tensor) --+++-+ selected_token_indices = token_indices[mask] --+++-+ selected_routing_weights = routing_weights.flatten()[mask] --+++-+ --+++-+ current_states = hidden_states_reshaped[selected_token_indices] --+++-+ --+++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++-+ --+++-+ final_hidden_states = final_hidden_states.index_add( --+++-+ dim=0, --+++-+ index=selected_token_indices, --+++-+ source=expert_output.to(hidden_states.dtype) --+++-+ ) --+++-+ --+++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++- --+++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++-- return final_hidden_states, router_logits --+++-+ final_hidden_states = final_hidden_states + shared_expert_output --+++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++-+ --+++-+ return final_hidden_states, router_logits --+++- --+++- --+++- class Qwen2MoeDecoderLayer(nn.Module): --+++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++- --+++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++- --+++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++-+ --+++- if (layer_idx not in config.mlp_only_layers) and ( --+++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++- ): --+++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++- _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++- _skip_keys_device_placement = "past_key_values" --+++- _supports_cache_class = True --+++-+#lwx --+++-+ # _supports_static_cache = True --+++- --+++- def _init_weights(self, module): --+++- std = self.config.initializer_range --+++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++- return causal_mask --+++- --+++- --+++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++- _tied_weights_keys = ["lm_head.weight"] --+++- --+++- def __init__(self, config): --+++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++- self.num_experts_per_tok = config.num_experts_per_tok --+++- # Initialize weights and apply final processing --+++- self.post_init() --+++-+ # @lwx --+++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++-+ # self.generation_config.cache_implementation = "static" --+++-+ self._warmed_up = False --+++-+ --+++-+ def warmup_moe_model(self): --+++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++-+ test_texts = [ --+++-+ "warmup short", --+++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++-+ ] --+++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++-+ if tokenizer is None: --+++-+ from mindnlp.transformers import AutoTokenizer --+++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++-+ self._warmup_tokenizer = tokenizer --+++-+ --+++-+ for text in test_texts: --+++-+ inputs = tokenizer(text, return_tensors="ms") --+++-+ with mindspore._no_grad(): --+++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++- --+++- def get_input_embeddings(self): --+++- return self.model.embed_tokens --+++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++- ```""" --+++-+ if not self._warmed_up: --+++-+ self._warmed_up = True --+++-+ self.warmup_moe_model() --+++- --+++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++- output_router_logits = ( --+++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++- } --+++- ) --+++- return model_inputs --+++-+# @lwx --+++-+ # def _decode_one_tokens_logits( --+++-+ # self, --+++-+ # cur_token: mindspore.Tensor, --+++-+ # input_pos: Optional[mindspore.Tensor], --+++-+ # cache_position: mindspore.Tensor, --+++-+ # past_key_values: StaticCache, --+++-+ # ) -> mindspore.Tensor: --+++-+ # """ --+++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++-+ --+++-+ # Args: --+++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++-+ # input_pos: 输入位置信息,可选 --+++-+ # cache_position: 当前token在cache中的位置,shape为(1,) --+++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++-+ --+++-+ # Returns: --+++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++-+ # """ --+++-+ # # 调用JIT编译的版本 --+++-+ # return self.get_decode_one_tokens_logits( --+++-+ # cur_token=cur_token, --+++-+ # input_pos=input_pos, --+++-+ # cache_position=cache_position, --+++-+ # past_key_values=past_key_values, --+++-+ # ) --+++-+ --+++-+ # @mindspore.jit(jit_level='O1') --+++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++-+ # """ --+++-+ # JIT编译的函数,用于高效的单token解码 --+++-+ # 使用JIT编译优化以支持静态shape和高效执行 --+++-+ --+++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++-+ # """ --+++-+ # outputs = self.model.forward( --+++-+ # input_ids=cur_token, --+++-+ # position_ids=input_pos, --+++-+ # cache_position=cache_position, --+++-+ # past_key_values=past_key_values, --+++-+ # use_cache=True, --+++-+ # return_dict=False, --+++-+ # ) --+++-+ --+++-+ # hidden_states = outputs[0] --+++-+ # logits = self.lm_head.forward(hidden_states) --+++-+ # logits = logits.float() --+++-+ --+++-+ # return logits[:, -1, :] --+++-+ --+++-+ # def _sample( --+++-+ # self, --+++-+ # input_ids: mindspore.Tensor, --+++-+ # logits_processor, --+++-+ # stopping_criteria, --+++-+ # generation_config, --+++-+ # synced_devices: bool, --+++-+ # streamer=None, --+++-+ # logits_warper=None, --+++-+ # **model_kwargs, --+++-+ # ): --+++-+ # """ --+++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++-+ # """ --+++-+ # from ...generation.logits_process import LogitsProcessorList --+++-+ # from ...generation.stopping_criteria import StoppingCriteriaList --+++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++-+ # from mindnlp.core import nn, ops, no_grad --+++-+ # import numpy as np --+++-+ --+++-+ # # 检查是否使用 StaticCache --+++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++-+ # # 否则,直接调用父类方法 --+++-+ # past_key_values = model_kwargs.get("past_key_values") --+++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++-+ --+++-+ # if not isinstance(past_key_values, StaticCache): --+++-+ # # 不使用 StaticCache,直接调用父类方法 --+++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++-+ # return super()._sample( --+++-+ # input_ids=input_ids, --+++-+ # logits_processor=logits_processor, --+++-+ # stopping_criteria=stopping_criteria, --+++-+ # generation_config=generation_config, --+++-+ # synced_devices=synced_devices, --+++-+ # streamer=streamer, --+++-+ # logits_warper=logits_warper, --+++-+ # **model_kwargs, --+++-+ # ) --+++-+ --+++-+ # # 使用 StaticCache,进入自定义循环 --+++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++-+ # pad_token_id = generation_config._pad_token_tensor --+++-+ # output_attentions = generation_config.output_attentions --+++-+ # output_hidden_states = generation_config.output_hidden_states --+++-+ # output_scores = generation_config.output_scores --+++-+ # output_logits = generation_config.output_logits --+++-+ # return_dict_in_generate = generation_config.return_dict_in_generate --+++-+ # max_length = generation_config.max_length --+++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++-+ # do_sample = generation_config.do_sample --+++-+ --+++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++-+ # raise ValueError( --+++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++-+ # f"{logits_warper})." --+++-+ # ) --+++-+ --+++-+ # # init attention / hidden states / scores tuples --+++-+ # scores = () if (return_dict_in_generate and output_scores) else None --+++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++-+ --+++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++-+ # encoder_hidden_states = ( --+++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++-+ # ) --+++-+ --+++-+ # # keep track of which sequences are already finished --+++-+ # batch_size, cur_len = input_ids.shape --+++-+ # this_peer_finished = False --+++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++-+ --+++-+ # time_record = [] --+++-+ # from ....utils.testing_utils import parse_flag_from_env --+++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++-+ --+++-+ # while self._has_unfinished_sequences( --+++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++-+ # ): --+++-+ # if _record_time: --+++-+ # import time as time_module --+++-+ # infer_start = time_module.time() --+++-+ --+++-+ # # prepare model inputs --+++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++-+ --+++-+ # # prepare variable output controls --+++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++-+ --+++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++-+ # cur_cache_position = model_inputs.get("cache_position") --+++-+ # cur_past_key_values = model_inputs.get("past_key_values") --+++-+ # cur_input_ids = model_inputs.get("input_ids") --+++-+ --+++-+ # if (isinstance(cur_past_key_values, StaticCache) and --+++-+ # cur_cache_position is not None and --+++-+ # len(cur_cache_position.shape) > 0 and --+++-+ # cur_cache_position.shape[0] == 1 and --+++-+ # cur_input_ids is not None and --+++-+ # cur_input_ids.shape[1] == 1): --+++-+ # # 使用 JIT 优化的单 token 解码 --+++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++-+ # if not hasattr(self, '_jit_used'): --+++-+ # self._jit_used = False --+++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++-+ --+++-+ # next_token_logits = self.get_decode_one_tokens_logits( --+++-+ # cur_token=cur_input_ids, --+++-+ # input_pos=model_inputs.get("position_ids"), --+++-+ # cache_position=cur_cache_position, --+++-+ # past_key_values=cur_past_key_values, --+++-+ # ) --+++-+ --+++-+ # # 标记已使用JIT(用于后续判断) --+++-+ # if not self._jit_used: --+++-+ # self._jit_used = True --+++-+ --+++-+ # # 构造兼容的输出对象 --+++-+ # class JitOptimizedOutput: --+++-+ # def __init__(self, logits, config): --+++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++-+ # self.config = config --+++-+ # # 对于 JIT 优化路径,这些属性通常不需要 --+++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++-+ # self.attentions = None if not config.is_encoder_decoder else None --+++-+ # self.cross_attentions = None --+++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++-+ # self.hidden_states = None if not config.is_encoder_decoder else None --+++-+ --+++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++-+ # else: --+++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++-+ # outputs = self(**model_inputs, return_dict=True) --+++-+ --+++-+ # if synced_devices and this_peer_finished: --+++-+ # continue --+++-+ --+++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++-+ # next_token_logits = outputs.logits[:, -1, :] --+++-+ --+++-+ # # pre-process distribution --+++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++-+ # if do_sample: --+++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++-+ --+++-+ # # Store scores, attentions and hidden_states when required --+++-+ # if return_dict_in_generate: --+++-+ # if output_scores: --+++-+ # scores += (next_token_scores,) --+++-+ # if output_logits: --+++-+ # raw_logits += (next_token_logits,) --+++-+ # if output_attentions: --+++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++-+ # decoder_attentions += (attn,) if attn is not None else (None,) --+++-+ # if self.config.is_encoder_decoder: --+++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++-+ --+++-+ # if output_hidden_states: --+++-+ # hidden = ( --+++-+ # outputs.decoder_hidden_states --+++-+ # if self.config.is_encoder_decoder --+++-+ # else outputs.hidden_states --+++-+ # ) --+++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++-+ --+++-+ # # token selection --+++-+ # if do_sample: --+++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++-+ # else: --+++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++-+ --+++-+ # # finished sentences should have their next token be a padding token --+++-+ # if has_eos_stopping_criteria: --+++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++-+ --+++-+ # # update generated ids, model inputs, and length for next step --+++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++-+ # if streamer is not None: --+++-+ # streamer.put(next_tokens) --+++-+ --+++-+ # model_kwargs = self._update_model_kwargs_for_generation( --+++-+ # outputs, --+++-+ # model_kwargs, --+++-+ # is_encoder_decoder=self.config.is_encoder_decoder, --+++-+ # ) --+++-+ --+++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++-+ # cur_len += 1 --+++-+ --+++-+ # if _record_time: --+++-+ # import time as time_module --+++-+ # infer_stop = time_module.time() --+++-+ # time_record.append(infer_stop - infer_start) --+++-+ --+++-+ # del outputs --+++-+ --+++-+ # average_infer_time = None --+++-+ # if time_record: --+++-+ # if len(time_record) > 1: --+++-+ # time_record.pop(0) --+++-+ # average_infer_time = sum(time_record) / len(time_record) --+++-+ # print(f'average inference time is: {average_infer_time}') --+++-+ # print(f'inference time record: {time_record}') --+++-+ --+++-+ # if streamer is not None: --+++-+ # streamer.end() --+++-+ --+++-+ # # 简单判断:打印是否使用了JIT路径 --+++-+ # if hasattr(self, '_jit_used') and self._jit_used: --+++-+ # print("[JIT] ✓ JIT optimization was used during generation") --+++-+ # else: --+++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++-+ --+++-+ # if return_dict_in_generate: --+++-+ # if self.config.is_encoder_decoder: --+++-+ # return GenerateEncoderDecoderOutput( --+++-+ # sequences=input_ids, --+++-+ # scores=scores, --+++-+ # logits=raw_logits, --+++-+ # encoder_attentions=encoder_attentions, --+++-+ # encoder_hidden_states=encoder_hidden_states, --+++-+ # decoder_attentions=decoder_attentions, --+++-+ # cross_attentions=cross_attentions, --+++-+ # decoder_hidden_states=decoder_hidden_states, --+++-+ # past_key_values=model_kwargs.get("past_key_values"), --+++-+ # average_infer_time=average_infer_time --+++-+ # ) --+++-+ # else: --+++-+ # return GenerateDecoderOnlyOutput( --+++-+ # sequences=input_ids, --+++-+ # scores=scores, --+++-+ # logits=raw_logits, --+++-+ # attentions=decoder_attentions, --+++-+ # hidden_states=decoder_hidden_states, --+++-+ # past_key_values=model_kwargs.get("past_key_values"), --+++-+ # average_infer_time=average_infer_time --+++-+ # ) --+++-+ # else: --+++-+ # return input_ids --+++-+ --+++-+ # def _prepare_cache_for_generation( --+++-+ # self, --+++-+ # generation_config, --+++-+ # model_kwargs, --+++-+ # assistant_model, --+++-+ # batch_size, --+++-+ # max_cache_length, --+++-+ # ): --+++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++-+ # generation_config.cache_implementation = "static" --+++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++-+ --+++-+ # if generation_config.cache_implementation == "static": --+++-+ # base_required_from_max_length = generation_config.max_length + 1 --+++-+ # base_required = max(max_cache_length, base_required_from_max_length) --+++-+ # min_cache_size = 50 --+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++-+ # else: --+++-+ # max_cache_length = max(base_required, min_cache_size) --+++-+ --+++-+ # original_max_cache_length = max_cache_length --+++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++-+ # print(f" - input max_cache_length: {original_max_cache_length}") --+++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++-+ # print(f" - final max_cache_length: {max_cache_length}") --+++-+ --+++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++-+ # if max_cache_length > self.config.max_position_embeddings: --+++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++-+ --+++-+ # result = super()._prepare_cache_for_generation( --+++-+ # generation_config=generation_config, --+++-+ # model_kwargs=model_kwargs, --+++-+ # assistant_model=assistant_model, --+++-+ # batch_size=batch_size, --+++-+ # max_cache_length=max_cache_length, --+++-+ # ) --+++-+ --+++-+ # if generation_config.cache_implementation == "static": --+++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++-+ # created_cache = model_kwargs.get(cache_name) --+++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++-+ # if created_cache.max_cache_len < generation_config.max_length: --+++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++-+ --+++-+ # return result --+++-+ --+++-+ --+++-+ --+++- --+++- --+++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++--- --+++-2.27.0 --+++- --+++-- --+++2.27.0 --+++ --++-- --++2.27.0 --++ --+-- --+2.27.0 --+ ---- --2.27.0 -- -diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch -deleted file mode 100644 -index 8a2fc4fe..00000000 ---- a/patches/0007-20251107003commit.patch -+++ /dev/null -@@ -1,8034 +0,0 @@ --From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Fri, 7 Nov 2025 12:12:51 +0800 --Subject: [PATCH 7/8] 20251107003commit -- ----- -- .../models/deepseek/modeling_deepseek.py | 2 +- -- patches/0001-20251104commit.patch | 2 +- -- patches/0002-20251106commit.patch | 2 +- -- patches/0003-20261106secondcommit.patch | 2 +- -- patches/0004-20251106change.patch | 2 +- -- patches/0005-20251107001commit.patch | 2 +- -- patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ -- 7 files changed, 7937 insertions(+), 6 deletions(-) -- create mode 100644 patches/0006-20251107002commit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index e7e1c053..ff631974 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): -- # return expert_cache -- -- @no_grad() --- dwj --+ # dwj -- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- # x 的 shape: (1, hidden_size) -- # flat_expert_indices 的 shape: (num_experts_per_tok,) --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --index 2842180e..c9c8c5ee 100644 ----- a/patches/0001-20251104commit.patch --+++ b/patches/0001-20251104commit.patch --@@ -1,7 +1,7 @@ -- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Tue, 4 Nov 2025 09:11:51 +0800 ---Subject: [PATCH 1/5] 20251104commit --+Subject: [PATCH 1/6] 20251104commit -- -- --- -- mindnlp/transformers/cache_utils.py | 28 +- --diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --index c6cd8757..625656eb 100644 ----- a/patches/0002-20251106commit.patch --+++ b/patches/0002-20251106commit.patch --@@ -1,7 +1,7 @@ -- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 09:20:38 +0800 ---Subject: [PATCH 2/5] 20251106commit --+Subject: [PATCH 2/6] 20251106commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 379 ++++- --diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --index 601960c9..dcb85080 100644 ----- a/patches/0003-20261106secondcommit.patch --+++ b/patches/0003-20261106secondcommit.patch --@@ -1,7 +1,7 @@ -- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 14:54:37 +0800 ---Subject: [PATCH 3/5] 20261106secondcommit --+Subject: [PATCH 3/6] 20261106secondcommit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 217 ++- --diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --index 8976f10b..bbed13cc 100644 ----- a/patches/0004-20251106change.patch --+++ b/patches/0004-20251106change.patch --@@ -1,7 +1,7 @@ -- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 15:48:09 +0800 ---Subject: [PATCH 4/5] 20251106change --+Subject: [PATCH 4/6] 20251106change -- -- --- -- .../models/deepseek/modeling_deepseek.py | 189 +- --diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --index 8d9032be..b2d1035c 100644 ----- a/patches/0005-20251107001commit.patch --+++ b/patches/0005-20251107001commit.patch --@@ -1,7 +1,7 @@ -- From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Fri, 7 Nov 2025 11:48:18 +0800 ---Subject: [PATCH 5/5] 20251107001commit --+Subject: [PATCH 5/6] 20251107001commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 91 +- --diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch --new file mode 100644 --index 00000000..bffa134e ----- /dev/null --+++ b/patches/0006-20251107002commit.patch --@@ -0,0 +1,7931 @@ --+From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Fri, 7 Nov 2025 12:06:32 +0800 --+Subject: [PATCH 6/6] 20251107002commit --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 122 +- --+ patches/0001-20251104commit.patch | 2 +- --+ patches/0002-20251106commit.patch | 2 +- --+ patches/0003-20261106secondcommit.patch | 2 +- --+ patches/0004-20251106change.patch | 2 +- --+ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ --+ 6 files changed, 7773 insertions(+), 64 deletions(-) --+ create mode 100644 patches/0005-20251107001commit.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index 8831e4b7..e7e1c053 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): --+ # expert_out = expert(x) --+ # expert_cache += expert_out * weight --+ # return expert_cache --+- --+- # @no_grad() --+- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+- # # x 的 shape: (1, hidden_size) --+- # # flat_expert_indices 的 shape: (num_experts_per_tok,) --+- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+- --+- # # 1. 收集所有需要的专家层 --+- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+- # selected_experts = [self.experts[i] for i in flat_expert_indices] --+- --+- # # 2. 并行计算所有专家的输出 --+- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+- # # ops.cat 会将它们堆叠成一个新的 Tensor --+- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+- --+- # # 3. 使用矩阵乘法进行加权求和 --+- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+- # # 最终结果 final_output 的 shape: (1, hidden_size) --+- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++ --++ @no_grad() --++ dwj --++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ # x 的 shape: (1, hidden_size) --++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++ --++ # 1. 收集所有需要的专家层 --++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++ selected_experts = [self.experts[i] for i in flat_expert_indices] --++ --++ # 2. 并行计算所有专家的输出 --++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++ # ops.cat 会将它们堆叠成一个新的 Tensor --++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++ --++ # 3. 使用矩阵乘法进行加权求和 --++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++ # 最终结果 final_output 的 shape: (1, hidden_size) --++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+ --+- # return final_output --++ return final_output --+ --+ --+ # @no_grad() --+@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): --+ --+ return expert_cache --+ # 放置在 DeepseekMoE 类中 --+- @no_grad() --+- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+- """ --+- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --+- --+- Args: --+- x (Tensor): 输入张量, shape: (1, hidden_size) --+- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --+- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --+- """ --+- top_k, _ = flat_expert_weights.shape --+- hidden_size = x.shape[-1] --+- --+- # 1. 将所有专家的权重堆叠起来 --+- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --+- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --+- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --++ # @no_grad() --++ # #lwx 20251107 --++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++ # """ --++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --++ --++ # Args: --++ # x (Tensor): 输入张量, shape: (1, hidden_size) --++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --++ # """ --++ # top_k, _ = flat_expert_weights.shape --++ # hidden_size = x.shape[-1] --++ --++ # # 1. 将所有专家的权重堆叠起来 --++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --+ --+- # 2. "收集" 所需的专家权重 --+- selected_gate_w = stacked_gate_w[flat_expert_indices] --+- selected_up_w = stacked_up_w[flat_expert_indices] --+- selected_down_w = stacked_down_w[flat_expert_indices] --++ # # 2. "收集" 所需的专家权重 --++ # selected_gate_w = stacked_gate_w[flat_expert_indices] --++ # selected_up_w = stacked_up_w[flat_expert_indices] --++ # selected_down_w = stacked_down_w[flat_expert_indices] --+ --+- # 3. 准备输入 --+- x_expanded = x.expand((top_k, 1, hidden_size)) --++ # # 3. 准备输入 --++ # x_expanded = x.expand((top_k, 1, hidden_size)) --+ --+- # 4. 并行计算 gate_proj 和 up_proj --+- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --+- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --++ # # 4. 并行计算 gate_proj 和 up_proj --++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --+ --+- # 5. 计算中间状态 --+- intermediate_states = self.experts[0].act_fn(gate_out) * up_out --++ # # 5. 计算中间状态 --++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out --+ --+- # 6. 并行计算 down_proj --+- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --+- # --- [FIX] --- --+- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --+- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --+- # --- [FIX END] --- --++ # # 6. 并行计算 down_proj --++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --++ # # --- [FIX] --- --++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --++ # # --- [FIX END] --- --+ --+- # 7. 根据路由权重进行加权求和 --+- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --++ # # 7. 根据路由权重进行加权求和 --++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --+ --+- return weighted_sum --++ # return weighted_sum --+ --+ --+ --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+index 0a0ef2d7..2842180e 100644 --+--- a/patches/0001-20251104commit.patch --++++ b/patches/0001-20251104commit.patch --+@@ -1,7 +1,7 @@ --+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Tue, 4 Nov 2025 09:11:51 +0800 --+-Subject: [PATCH 1/4] 20251104commit --++Subject: [PATCH 1/5] 20251104commit --+ --+ --- --+ mindnlp/transformers/cache_utils.py | 28 +- --+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+index 5185270c..c6cd8757 100644 --+--- a/patches/0002-20251106commit.patch --++++ b/patches/0002-20251106commit.patch --+@@ -1,7 +1,7 @@ --+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 09:20:38 +0800 --+-Subject: [PATCH 2/4] 20251106commit --++Subject: [PATCH 2/5] 20251106commit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+index 3e05f821..601960c9 100644 --+--- a/patches/0003-20261106secondcommit.patch --++++ b/patches/0003-20261106secondcommit.patch --+@@ -1,7 +1,7 @@ --+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 14:54:37 +0800 --+-Subject: [PATCH 3/4] 20261106secondcommit --++Subject: [PATCH 3/5] 20261106secondcommit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 217 ++- --+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --+index 88a1aef4..8976f10b 100644 --+--- a/patches/0004-20251106change.patch --++++ b/patches/0004-20251106change.patch --+@@ -1,7 +1,7 @@ --+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 15:48:09 +0800 --+-Subject: [PATCH 4/4] 20251106change --++Subject: [PATCH 4/5] 20251106change --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 189 +- --+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --+new file mode 100644 --+index 00000000..8d9032be --+--- /dev/null --++++ b/patches/0005-20251107001commit.patch --+@@ -0,0 +1,7707 @@ --++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Fri, 7 Nov 2025 11:48:18 +0800 --++Subject: [PATCH 5/5] 20251107001commit --++ --++--- --++ .../models/deepseek/modeling_deepseek.py | 91 +- --++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- --++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- --++ patches/0001-20251104commit.patch | 2 +- --++ patches/0002-20251106commit.patch | 2 +- --++ patches/0003-20261106secondcommit.patch | 2 +- --++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ --++ 7 files changed, 7577 insertions(+), 30 deletions(-) --++ create mode 100644 patches/0004-20251106change.patch --++ --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index 0546f318..8831e4b7 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): --++ # expert_cache += expert_out * weight --++ # return expert_cache --++ --++- @no_grad() --++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++- # x 的 shape: (1, hidden_size) --++- # flat_expert_indices 的 shape: (num_experts_per_tok,) --++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++- --++- # 1. 收集所有需要的专家层 --++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++- selected_experts = [self.experts[i] for i in flat_expert_indices] --++- --++- # 2. 并行计算所有专家的输出 --++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++- # ops.cat 会将它们堆叠成一个新的 Tensor --++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++- --++- # 3. 使用矩阵乘法进行加权求和 --++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++- # 最终结果 final_output 的 shape: (1, hidden_size) --++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+++ # @no_grad() --+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ # # x 的 shape: (1, hidden_size) --+++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) --+++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+++ --+++ # # 1. 收集所有需要的专家层 --+++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+++ # selected_experts = [self.experts[i] for i in flat_expert_indices] --+++ --+++ # # 2. 并行计算所有专家的输出 --+++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+++ # # ops.cat 会将它们堆叠成一个新的 Tensor --+++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+++ --+++ # # 3. 使用矩阵乘法进行加权求和 --+++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ # # 最终结果 final_output 的 shape: (1, hidden_size) --+++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++ --++- return final_output --+++ # return final_output --++ --++ --++ # @no_grad() --++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): --++ ) --++ --++ return expert_cache --+++# 放置在 DeepseekMoE 类中 --+++ @no_grad() --+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ """ --+++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --+++ --+++ Args: --+++ x (Tensor): 输入张量, shape: (1, hidden_size) --+++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --+++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --+++ """ --+++ top_k, _ = flat_expert_weights.shape --+++ hidden_size = x.shape[-1] --+++ --+++ # 1. 将所有专家的权重堆叠起来 --+++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --+++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --+++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --+++ --+++ # 2. "收集" 所需的专家权重 --+++ selected_gate_w = stacked_gate_w[flat_expert_indices] --+++ selected_up_w = stacked_up_w[flat_expert_indices] --+++ selected_down_w = stacked_down_w[flat_expert_indices] --+++ --+++ # 3. 准备输入 --+++ x_expanded = x.expand((top_k, 1, hidden_size)) --+++ --+++ # 4. 并行计算 gate_proj 和 up_proj --+++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --+++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --+++ --+++ # 5. 计算中间状态 --+++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out --+++ --+++ # 6. 并行计算 down_proj --+++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --+++ # --- [FIX] --- --+++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --+++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --+++ # --- [FIX END] --- --+++ --+++ # 7. 根据路由权重进行加权求和 --+++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --+++ --+++ return weighted_sum --+++ --+++ --++ --++ # @no_grad() --++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++index ebd7782e..913a7609 100644 --++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): --++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++- x1 = x[..., : x.shape[-1] // 2] --++- x2 = x[..., x.shape[-1] // 2 :] --+++ # x1 = x[..., : x.shape[-1] // 2] --+++ # x2 = x[..., x.shape[-1] // 2 :] --++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --++index d059dcbe..2b217b64 100644 --++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): --++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++ def rotate_half(x): --++ """Rotates half the hidden dims of the input.""" --++- x1 = x[..., : x.shape[-1] // 2] --++- x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++ # x1 = x[..., : x.shape[-1] // 2] --+++ # x2 = x[..., x.shape[-1] // 2 :] --+++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++ return ops.cat((-x2, x1), dim=-1) --++ --++ --++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++index 78f22642..0a0ef2d7 100644 --++--- a/patches/0001-20251104commit.patch --+++++ b/patches/0001-20251104commit.patch --++@@ -1,7 +1,7 @@ --++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Tue, 4 Nov 2025 09:11:51 +0800 --++-Subject: [PATCH 1/3] 20251104commit --+++Subject: [PATCH 1/4] 20251104commit --++ --++ --- --++ mindnlp/transformers/cache_utils.py | 28 +- --++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --++index 22b65dd5..5185270c 100644 --++--- a/patches/0002-20251106commit.patch --+++++ b/patches/0002-20251106commit.patch --++@@ -1,7 +1,7 @@ --++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Thu, 6 Nov 2025 09:20:38 +0800 --++-Subject: [PATCH 2/3] 20251106commit --+++Subject: [PATCH 2/4] 20251106commit --++ --++ --- --++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --++index 966529e4..3e05f821 100644 --++--- a/patches/0003-20261106secondcommit.patch --+++++ b/patches/0003-20261106secondcommit.patch --++@@ -1,7 +1,7 @@ --++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Thu, 6 Nov 2025 14:54:37 +0800 --++-Subject: [PATCH 3/3] 20261106secondcommit --+++Subject: [PATCH 3/4] 20261106secondcommit --++ --++ --- --++ .../models/deepseek/modeling_deepseek.py | 217 ++- --++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --++new file mode 100644 --++index 00000000..88a1aef4 --++--- /dev/null --+++++ b/patches/0004-20251106change.patch --++@@ -0,0 +1,7498 @@ --+++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Thu, 6 Nov 2025 15:48:09 +0800 --+++Subject: [PATCH 4/4] 20251106change --+++ --+++--- --+++ .../models/deepseek/modeling_deepseek.py | 189 +- --+++ patches/0001-20251104commit.patch | 1272 +++++++ --+++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ --+++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ --+++ 4 files changed, 7244 insertions(+), 186 deletions(-) --+++ create mode 100644 patches/0001-20251104commit.patch --+++ create mode 100644 patches/0002-20251106commit.patch --+++ create mode 100644 patches/0003-20261106secondcommit.patch --+++ --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index 2f9192bf..0546f318 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): --+++ --+++ return attn_output, attn_weights, past_key_value --+++ --+++-# class DeepseekFlashAttention(nn.Module): --+++-# """ --+++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --+++- --+++-# This class is designed as a drop-in replacement for DeepseekAttention. --+++-# """ --+++- --+++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+++-# super().__init__() --+++-# self.config = config --+++-# self.layer_idx = layer_idx --+++-# if layer_idx is None: --+++-# logger.warning( --+++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++-# "when creating this class." --+++-# ) --+++- --+++-# self.attention_dropout = config.attention_dropout --+++-# self.hidden_size = config.hidden_size --+++-# self.num_heads = config.num_attention_heads --+++-# self.head_dim = self.hidden_size // self.num_heads --+++-# self.num_key_value_heads = config.num_key_value_heads --+++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++-# self.max_position_embeddings = config.max_position_embeddings --+++-# self.rope_theta = config.rope_theta --+++-# self.is_causal = True --+++- --+++-# if (self.head_dim * self.num_heads) != self.hidden_size: --+++-# raise ValueError( --+++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++-# f" and `num_heads`: {self.num_heads})." --+++-# ) --+++- --+++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+++-# self._init_rope() --+++- --+++-# def _init_rope(self): --+++-# if self.config.rope_scaling is None: --+++-# self.rotary_emb = DeepseekRotaryEmbedding( --+++-# self.head_dim, --+++-# max_position_embeddings=self.max_position_embeddings, --+++-# base=self.rope_theta, --+++-# ) --+++-# else: --+++-# scaling_type = self.config.rope_scaling["type"] --+++-# scaling_factor = self.config.rope_scaling["factor"] --+++-# if scaling_type == "linear": --+++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+++-# self.head_dim, --+++-# max_position_embeddings=self.max_position_embeddings, --+++-# scaling_factor=scaling_factor, --+++-# base=self.rope_theta, --+++-# ) --+++-# elif scaling_type == "dynamic": --+++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+++-# self.head_dim, --+++-# max_position_embeddings=self.max_position_embeddings, --+++-# scaling_factor=scaling_factor, --+++-# base=self.rope_theta, --+++-# ) --+++-# else: --+++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+++- --+++-# def forward( --+++-# self, --+++-# hidden_states: mindspore.Tensor, --+++-# attention_mask: Optional[mindspore.Tensor] = None, --+++-# position_ids: Optional[mindspore.Tensor] = None, --+++-# past_key_value: Optional[Cache] = None, --+++-# output_attentions: bool = False, --+++-# use_cache: bool = False, --+++-# **kwargs, --+++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++-# if "padding_mask" in kwargs: --+++-# warnings.warn( --+++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+++-# ) --+++- --+++-# if output_attentions: --+++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --+++- --+++-# bsz, q_len, _ = hidden_states.shape --+++- --+++-# if self.config.pretraining_tp > 1: --+++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+++- --+++-# query_states = self.q_proj(hidden_states) --+++-# key_states = self.k_proj(hidden_states) --+++-# value_states = self.v_proj(hidden_states) --+++- --+++-# # Reshape for multi-head attention --+++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++- --+++-# kv_seq_len = key_states.shape[-2] --+++-# if past_key_value is not None: --+++-# if self.layer_idx is None: --+++-# raise ValueError( --+++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++-# "with a layer index." --+++-# ) --+++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++- --+++-# # Apply Rotary Positional Embedding --+++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++- --+++-# if past_key_value is not None: --+++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --+++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++- --+++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --+++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --+++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++- --+++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++- --+++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++- --+++-# # Convert attention_mask for flash_attention_score --+++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --+++-# if attention_mask is not None: --+++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --+++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+++-# raise ValueError( --+++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+++-# ) --+++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --+++-# else: --+++-# attn_mask_for_fa = None --+++- --+++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+++- --+++-# # Call the fused flash_attention_score operator --+++-# attn_output = mindspore.ops.flash_attention_score( --+++-# query=query_states_for_fa, --+++-# key=key_states_for_fa, --+++-# value=value_states_for_fa, --+++-# head_num=self.num_heads, # This is N1, the number of query heads --+++-# input_layout='BSH', --+++-# attn_mask=attn_mask_for_fa, --+++-# keep_prob=keep_prob, --+++-# scalar_value=1.0 / math.sqrt(self.head_dim), --+++-# sparse_mode=0 # Default mask mode --+++-# ) --+++- --+++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --+++-# attn_output = self.o_proj(attn_output) --+++- --+++-# # Flash Attention does not return attention weights --+++-# attn_weights = None --+++- --+++-# return attn_output, attn_weights, past_key_value --+++ --+++ class DeepseekFlashAttention(nn.Module): --+++ """ --+++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): --+++ super().__init__() --+++ self.hidden_size = config.hidden_size --+++ --+++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --+++- config=config, layer_idx=layer_idx --+++- ) --++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --++++ # config=config, layer_idx=layer_idx --++++ # ) --+++ --+++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --+++ config=config, layer_idx=layer_idx --+++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): --+++ return outputs --+++ --+++ --+++- --+++ class DeepseekPreTrainedModel(PreTrainedModel): --+++ config_class = DeepseekConfig --+++ base_model_prefix = "model" --+++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++ # Initialize weights and apply final processing --+++ self.post_init() --+++ self.warm_up = False --+++- #@dwj --+++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --+++- self.num_layers, --+++- self.num_attention_heads, --+++- self.head_dim, --+++- batch_size=1, --+++- max_length=self.max_length, --+++- dtype=mindspore.float16 --+++- ) --+++- --+++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --+++- key_cache = [] --+++- value_cache = [] --+++- for _ in range(num_layers): --+++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++- key_cache.append(k) --+++- value_cache.append(v) --+++- return key_cache, value_cache --+++- --+++ --+++ def warmup_moe_model_deep(self): --+++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++new file mode 100644 --+++index 00000000..78f22642 --+++--- /dev/null --++++++ b/patches/0001-20251104commit.patch --+++@@ -0,0 +1,1272 @@ --++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++++From: Pinoeer-kingxi <13022943007@163.com> --++++Date: Tue, 4 Nov 2025 09:11:51 +0800 --++++Subject: [PATCH 1/3] 20251104commit --++++ --++++--- --++++ mindnlp/transformers/cache_utils.py | 28 +- --++++ .../models/deepseek/modeling_deepseek.py | 149 ++- --++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++++ 3 files changed, 976 insertions(+), 87 deletions(-) --++++ --++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++++index cadd2e04..02f8d4be 100644 --++++--- a/mindnlp/transformers/cache_utils.py --+++++++ b/mindnlp/transformers/cache_utils.py --++++@@ -812,14 +812,26 @@ class StaticCache(Cache): --++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++++ # k_out[:, :, cache_position] = key_states --++++ # v_out[:, :, cache_position] = value_states --++++- if ON_ORANGE_PI: --++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++- else: --++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++- --+++++ # if ON_ORANGE_PI: --+++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++ # else: --+++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++ # 确保 cache_position 是 1D tensor 并且类型正确 --+++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++++ if cache_position.ndim > 1: --+++++ cache_position = cache_position.flatten() --+++++ # 确保类型是 int32 或 int64(MindSpore 要求) --+++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++++ cache_position = cache_position.int() --+++++ --+++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++++ k_out[:, :, cache_position] = key_states --+++++ v_out[:, :, cache_position] = value_states --+++++ --++++ return k_out, v_out --++++ --++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++index c695b944..d8303e45 100644 --++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++++ def rotate_half(x): --++++ """Rotates half the hidden dims of the input.""" --++++- x1 = x[..., : x.shape[-1] // 2] --++++- x2 = x[..., x.shape[-1] // 2 :] --+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++ # x1 = x[..., : x.shape[-1] // 2] --+++++ # x2 = x[..., x.shape[-1] // 2 :] --+++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++ return ops.cat((-x2, x1), dim=-1) --++++ --++++ --++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++++ if self.training: --++++ raise NotImplementedError("Training is not supported yet.") --++++ else: --++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++- if self.config.n_shared_experts is not None: --++++- y = y + self.shared_experts(identity) --++++- return y --+++++ # @lwx --+++++ if orig_shape[1] == 1: --+++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++++ y=y.view(*orig_shape) --+++++ if self.config.n_shared_experts is not None: --+++++ y = y + self.shared_experts(identity) --+++++ return y --+++++ else: --+++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++++ if self.config.n_shared_experts is not None: --+++++ y = y + self.shared_experts(identity) --+++++ return y --+++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++ # if self.config.n_shared_experts is not None: --+++++ # y = y + self.shared_experts(identity) --+++++ # return y --+++++ --+++++ @no_grad() --+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++ --+++++ expert_cache = ops.zeros_like(x) --+++++ for i in range(self.num_experts_per_tok): --+++++ expert_id = flat_expert_indices[i].item() --+++++ weight = flat_expert_weights[i].item() --+++++ expert = self.experts[expert_id] --+++++ expert_out = expert(x) --+++++ expert_cache += expert_out * weight --+++++ return expert_cache --++++ --++++ @no_grad() --++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++- # expert_cache = torch.zeros_like(x) --++++- # idxs = flat_expert_indices.argsort() --++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++- # token_idxs = idxs // self.num_experts_per_tok --++++- # for i, end_idx in enumerate(tokens_per_expert): --++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++- # if start_idx == end_idx: --++++- # continue --++++- # expert = self.experts[i] --++++- # exp_token_idx = token_idxs[start_idx:end_idx] --++++- # expert_tokens = x[exp_token_idx] --++++- # expert_out = expert(expert_tokens) --++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++- # return expert_cache --+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++ expert_cache = ops.zeros_like(x) --++++ idxs = flat_expert_indices.argsort() --++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++ token_idxs = idxs // self.num_experts_per_tok --+++++ --++++ for i, end_idx in enumerate(tokens_per_expert): --++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++ if start_idx == end_idx: --++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++++ expert_out = expert(expert_tokens) --++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --++++ return expert_cache --+++++ --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # # expert_cache = torch.zeros_like(x) --+++++ # # idxs = flat_expert_indices.argsort() --+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++ # # token_idxs = idxs // self.num_experts_per_tok --+++++ # # for i, end_idx in enumerate(tokens_per_expert): --+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++ # # if start_idx == end_idx: --+++++ # # continue --+++++ # # expert = self.experts[i] --+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # # expert_tokens = x[exp_token_idx] --+++++ # # expert_out = expert(expert_tokens) --+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++ # # return expert_cache --+++++ # expert_cache = ops.zeros_like(x) --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # for i, end_idx in enumerate(tokens_per_expert): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # if start_idx == end_idx: --+++++ # continue --+++++ # expert = self.experts[i] --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = expert(expert_tokens) --+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --+++++ # return expert_cache --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # expert_cache = ops.zeros_like(x) --+++++ --+++++ # # 排序保证顺序一致 --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # # 找出有 token 的专家 --+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++ --+++++ # for i in active_experts.tolist(): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # end_idx = tokens_per_expert[i] --+++++ # if start_idx == end_idx: # 没有 token --+++++ # continue --+++++ --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = self.experts[i](expert_tokens) --+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++ --+++++ # expert_cache = mindspore.mint.scatter_add( --+++++ # expert_cache, --+++++ # 0, --+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++ # expert_out --+++++ # ) --+++++ --+++++ # return expert_cache --+++++ --+++++ --++++ --++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++++ # """ --++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ --++++ # Initialize weights and apply final processing --++++ self.post_init() --+++++ self.warm_up = False --+++++ --+++++ def warmup_moe_model_deep(self): --+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++++ test_texts = [ --+++++ "warmup short", --+++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++++ ] --+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++ if tokenizer is None: --+++++ from mindnlp.transformers import AutoTokenizer --+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++ self._warmup_tokenizer = tokenizer --+++++ --+++++ for text in test_texts: --+++++ inputs = tokenizer(text, return_tensors="ms") --+++++ with mindspore._no_grad(): --+++++ _ = self(**inputs, use_cache=False) --+++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++++ --++++ def get_input_embeddings(self): --++++ return self.model.embed_tokens --++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++ ```""" --+++++ if not self.warm_up: --+++++ self.warm_up = True --+++++ self.warmup_moe_model_deep() --+++++ --++++ output_attentions = ( --++++ output_attentions --++++ if output_attentions is not None --++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++index 3cbf820e..d4c6b651 100644 --++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++@@ -18,7 +18,6 @@ --++++ # See the License for the specific language governing permissions and --++++ # limitations under the License. --++++ """MindSpore Qwen2MoE model.""" --++++- --++++ import math --++++ from typing import List, Optional, Tuple, Union --++++ --++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++++ TokenClassifierOutput, --++++ ) --++++ from ...modeling_utils import PreTrainedModel --+++++from ...generation import GenerationMixin --++++ from ....utils import logging --++++ from .configuration_qwen2_moe import Qwen2MoeConfig --++++ --++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++++ self.variance_epsilon = eps --++++ --++++ def forward(self, hidden_states): --+++++ # @dwj --+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++ # @lwx --+++++ # if not self.training : --+++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++ input_dtype = hidden_states.dtype --++++ hidden_states = hidden_states.to(mindspore.float32) --++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++++@@ -234,6 +239,8 @@ def rotate_half(x): --++++ """Rotates half the hidden dims of the input.""" --++++ x1 = x[..., : x.shape[-1] // 2] --++++ x2 = x[..., x.shape[-1] // 2 :] --+++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++ return ops.cat((-x2, x1), dim=-1) --++++ --++++ --++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++++ self.config = config --++++ self.hidden_size = config.hidden_size --++++ self.intermediate_size = intermediate_size --+++++ --++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++++ self.act_fn = ACT2FN[config.hidden_act] --++++ --++++ def forward(self, x): --++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++- --++++ --+++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++ # @lwx --+++++ # gate_up_output = self.gate_up_proj(x) --+++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++++ # return self.down_proj(swiglu_output) --+++++ --+++++ # def forward(self, x): --+++++ # gate_proj_out = self.gate_proj(x) --+++++ # up_proj_out = self.up_proj(x) --+++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++++ # return self.down_proj(swiglu_out) --+++++ --++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++++ """ --++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++++ use_cache: bool = False, --++++ cache_position: Optional[mindspore.Tensor] = None, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ --+++++ --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ query_states = self.q_proj(hidden_states) --++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++ "with a layer index." --++++ ) --++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ if isinstance(past_key_value, StaticCache): --+++++ kv_seq_len = key_states.shape[-2] --+++++ else: --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++ if isinstance(past_key_value, StaticCache): --+++++ kv_seq_len = key_states.shape[-2] --++++ --++++ # repeat k/v heads if n_kv_heads < n_heads --++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++- --+++++ --++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++ --++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++++- raise ValueError( --++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++++- f" {attn_weights.shape}" --++++- ) --++++- --++++- if attention_mask is not None: # no matter the length, we just slice it --++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++++ if attention_mask is not None: --+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++ attn_weights = attn_weights + causal_mask --++++ --++++ # upcast attention to fp32 --++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++ --++++ attn_output = self.o_proj(attn_output) --++++- --+++++ # @lwx --+++++ --+++++ # max_seq_len = self.max_position_embeddings # 2048 --+++++ --+++++ # if attention_mask is not None: --+++++ # # attention_mask: [B, 1, Sq, Sk] --+++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++ --+++++ # # pad 到 [max_seq_len, max_seq_len] --+++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++ # global_attention_mask = padded_mask --+++++ # else: --+++++ # global_attention_mask = None --+++++ --+++++ --+++++ # sparse_mode=3 --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # real_shift=None, --+++++ # padding_mask=None, --+++++ --+++++ # head_num=self.num_heads, --+++++ # attn_mask=global_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ # input_layout="BNSD", --+++++ # pre_tokens=2147483647, --+++++ # next_tokens=2147483647, --+++++ # inner_precise=0, --+++++ # drop_mask=None, --+++++ # prefix=None, --+++++ # actual_seq_qlen=None, --+++++ # actual_seq_kvlen=None, --+++++ # sparse_mode=sparse_mode, --+++++ # ) --++++ if not output_attentions: --++++ attn_weights = None --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++ --+++++class Qwen2MoeFlashAttention(nn.Module): --+++++ """ --+++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++ --+++++ 关键改动: --+++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++ 直接传入原始的 key 和 value 张量效率更高。 --+++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++ """ --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++ super().__init__() --+++++ self.config = config --+++++ self.layer_idx = layer_idx --+++++ self.hidden_size = config.hidden_size --+++++ self.num_heads = config.num_attention_heads --+++++ self.head_dim = self.hidden_size // self.num_heads --+++++ self.num_key_value_heads = config.num_key_value_heads --+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++ self.max_position_embeddings = config.max_position_embeddings --+++++ self.rope_theta = config.rope_theta --+++++ self.attention_dropout = config.attention_dropout --+++++ --+++++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++++ raise ValueError( --+++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++ ) --+++++ --+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++ --+++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++ self.head_dim, --+++++ max_position_embeddings=self.max_position_embeddings, --+++++ base=self.rope_theta, --+++++ ) --+++++ --+++++ def forward( --+++++ self, --+++++ hidden_states: mindspore.Tensor, --+++++ attention_mask: Optional[mindspore.Tensor] = None, --+++++ position_ids: Optional[mindspore.Tensor] = None, --+++++ past_key_value: Optional[Cache] = None, --+++++ output_attentions: bool = False, --+++++ use_cache: bool = False, --+++++ cache_position: Optional[mindspore.Tensor] = None, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # 1. 线性投射 Q, K, V --+++++ query_states = self.q_proj(hidden_states) --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++ # query: [B, S, H*D] -> [B, N1, S, D] --+++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # 3. RoPE 旋转位置编码 --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++ if self.layer_idx is None: --+++++ raise ValueError( --+++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ "with a layer index." --+++++ ) --+++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++ if cache_position.shape[0] == 1: --+++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++ kv_seq_len = past_seen_tokens + 1 --+++++ else: --+++++ # prefill 阶段:cache_position 是范围,使用其长度 --+++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++ else: --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # 4. KV 缓存更新 --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ key_states, value_states = past_key_value.update( --+++++ key_states, value_states, self.layer_idx, cache_kwargs --+++++ ) --+++++ --+++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++ if cache_position.shape[0] == 1: --+++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++ kv_seq_len = key_states.shape[-2] --+++++ --+++++ # 5. [重要] 准备 Attention Mask --+++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++ fa_attention_mask = None --+++++ if attention_mask is not None: --+++++ # 截取与当前key长度匹配的部分 --+++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++ fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++ input_dtype = query_states.dtype --+++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++ query_states = query_states.to(mindspore.float16) --+++++ key_states = key_states.to(mindspore.float16) --+++++ value_states = value_states.to(mindspore.float16) --+++++ --+++++ # 6. [核心] 调用 flash_attention_score 算子 --+++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++ attn_output = mindspore.ops.flash_attention_score( --+++++ query=query_states, --+++++ key=key_states, --+++++ value=value_states, --+++++ head_num=self.num_heads, # 传入Q的头数(N1) --+++++ attn_mask=fa_attention_mask, --+++++ keep_prob=1.0 - self.attention_dropout, --+++++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ input_layout="BNSD", --+++++ sparse_mode=0 # 使用 defaultMask 模式 --+++++ ) --+++++ --+++++ # 恢复原始数据类型 --+++++ attn_output = attn_output.to(input_dtype) --+++++ --+++++ # 7. 调整输出形状 --+++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = self.o_proj(attn_output) --+++++ --+++++ # FlashAttention 算子不直接返回注意力权重矩阵 --+++++ attn_weights = None --+++++ if output_attentions: --+++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ # def forward( --+++++ # self, --+++++ # hidden_states: mindspore.Tensor, --+++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++ # past_key_value: Optional[Cache] = None, --+++++ # output_attentions: bool = False, --+++++ # use_cache: bool = False, --+++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ # bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # # 1. 线性投射 Q, K, V --+++++ # query_states = self.q_proj(hidden_states) --+++++ # key_states = self.k_proj(hidden_states) --+++++ # value_states = self.v_proj(hidden_states) --+++++ --+++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # # 3. RoPE 旋转位置编码 --+++++ # kv_seq_len = key_states.shape[-2] --+++++ # if past_key_value is not None: --+++++ # if self.layer_idx is None: --+++++ # raise ValueError( --+++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ # "with a layer index." --+++++ # ) --+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # # 4. KV 缓存更新 --+++++ # if past_key_value is not None: --+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ # key_states, value_states = past_key_value.update( --+++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++ # ) --+++++ --+++++ # # 5. 准备 Attention Mask --+++++ # fa_attention_mask = None --+++++ # if attention_mask is not None: --+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++ # input_dtype = query_states.dtype --+++++ --+++++ # # 6. [核心] 调用 flash_attention_score 算子 --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # head_num=self.num_heads, --+++++ # attn_mask=fa_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ # input_layout="BNSD", --+++++ # sparse_mode=0, --+++++ # # <--- 修改点 2: 启用内部高精度计算 --- --+++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++ # inner_precise=1 --+++++ # ) --+++++ --+++++ # # 恢复原始数据类型 --+++++ # attn_output = attn_output.to(input_dtype) --+++++ --+++++ # # 7. 调整输出形状 --+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ # attn_output = self.o_proj(attn_output) --+++++ --+++++ # attn_weights = None --+++++ # if output_attentions: --+++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++ # return attn_output, attn_weights, past_key_value --+++++ --+++++ # def forward( --+++++ # self, --+++++ # hidden_states: mindspore.Tensor, --+++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++ # past_key_value: Optional[Cache] = None, --+++++ # output_attentions: bool = False, --+++++ # use_cache: bool = False, --+++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ # bsz, q_len, _ = hidden_states.shape --+++++ --+++++ # query_states = self.q_proj(hidden_states) --+++++ # key_states = self.k_proj(hidden_states) --+++++ # value_states = self.v_proj(hidden_states) --+++++ --+++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ # kv_seq_len = key_states.shape[-2] --+++++ # if past_key_value is not None: --+++++ # if self.layer_idx is None: --+++++ # raise ValueError("`layer_idx` must be specified for caching") --+++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ # if past_key_value is not None: --+++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ # key_states, value_states = past_key_value.update( --+++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++ # ) --+++++ --+++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++ --+++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++++ # query_states = query_states / math.sqrt(self.head_dim) --+++++ # # <--- 修改结束 --- --+++++ --+++++ # fa_attention_mask = None --+++++ # if attention_mask is not None: --+++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ # fa_attention_mask = (mask_slice != 0) --+++++ --+++++ # input_dtype = query_states.dtype --+++++ --+++++ # attn_output = mindspore.ops.flash_attention_score( --+++++ # query=query_states, # 传入已经预先缩放过的 query --+++++ # key=key_states, --+++++ # value=value_states, --+++++ # head_num=self.num_heads, --+++++ # attn_mask=fa_attention_mask, --+++++ # keep_prob=1.0 - self.attention_dropout, --+++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++++ # input_layout="BNSD", --+++++ # sparse_mode=0, --+++++ # inner_precise=1 # 仍然保持内部高精度计算 --+++++ # ) --+++++ --+++++ # attn_output = attn_output.to(input_dtype) --+++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ # attn_output = self.o_proj(attn_output) --+++++ --+++++ # attn_weights = None --+++++ # if output_attentions: --+++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++++ --+++++ # return attn_output, attn_weights, past_key_value --+++++ --++++ QWEN2MOE_ATTENTION_CLASSES = { --++++ "eager": Qwen2MoeAttention, --+++++ "flash-attention": Qwen2MoeFlashAttention, --++++ } --++++ --++++ --++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --+++++ #@dwj --+++++ # 只遍历激活的专家,而非全部专家 --++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- hidden_states = hidden_states.view(-1, hidden_dim) --++++- # router_logits: (batch * sequence_length, n_experts) --++++- router_logits = self.gate(hidden_states) --++++- --++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- if self.norm_topk_prob: --++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- # we cast back to the input dtype --++++- routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++- final_hidden_states = ops.zeros( --++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++++- ) --++++- --++++- # One hot encode the selected experts to create an expert mask --++++- # this will be used to easily index which expert is going to be sollicitated --++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++++- --++++- # Loop over all available experts in the model and perform the computation on each expert --++++- for expert_idx in range(self.num_experts): --++++- expert_layer = self.experts[expert_idx] --++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++++- --++++- # Index the correct hidden states and compute the expert hidden state for --++++- # the current expert. We need to make sure to multiply the output hidden --++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++++- if 0 not in idx.shape: --++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++++- --++++- # However `index_add_` only support torch tensors for indexing so we'll use --++++- # the `top_x` tensor here. --++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++++- --++++- shared_expert_output = self.shared_expert(hidden_states) --++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++++- --++++- final_hidden_states = final_hidden_states + shared_expert_output --+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++ num_tokens = hidden_states_reshaped.shape[0] --+++++ --+++++ router_logits = self.gate(hidden_states_reshaped) --+++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++ if self.norm_topk_prob: --+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++ flat_selected_experts = selected_experts.flatten() --+++++ --+++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++ token_indices = broadcasted_token_indices.flatten() --+++++ --+++++ active_experts = ops.unique(flat_selected_experts) --+++++ --+++++ for expert_idx_tensor in active_experts: --+++++ expert_idx = expert_idx_tensor.item() --+++++ expert_layer = self.experts[expert_idx] --+++++ --+++++ mask = (flat_selected_experts == expert_idx_tensor) --+++++ selected_token_indices = token_indices[mask] --+++++ selected_routing_weights = routing_weights.flatten()[mask] --+++++ --+++++ current_states = hidden_states_reshaped[selected_token_indices] --+++++ --+++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++ --+++++ final_hidden_states = final_hidden_states.index_add( --+++++ dim=0, --+++++ index=selected_token_indices, --+++++ source=expert_output.to(hidden_states.dtype) --+++++ ) --+++++ --+++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++ --++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++- return final_hidden_states, router_logits --+++++ final_hidden_states = final_hidden_states + shared_expert_output --+++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++ --+++++ return final_hidden_states, router_logits --++++ --++++ --++++ class Qwen2MoeDecoderLayer(nn.Module): --++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++++ --++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++ --++++ if (layer_idx not in config.mlp_only_layers) and ( --++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++++ ): --++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --++++ _skip_keys_device_placement = "past_key_values" --++++ _supports_cache_class = True --+++++#lwx --+++++ # _supports_static_cache = True --++++ --++++ def _init_weights(self, module): --++++ std = self.config.initializer_range --++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++++ return causal_mask --++++ --++++ --++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ _tied_weights_keys = ["lm_head.weight"] --++++ --++++ def __init__(self, config): --++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ self.num_experts_per_tok = config.num_experts_per_tok --++++ # Initialize weights and apply final processing --++++ self.post_init() --+++++ # @lwx --+++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++++ # self.generation_config.cache_implementation = "static" --+++++ self._warmed_up = False --+++++ --+++++ def warmup_moe_model(self): --+++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++++ test_texts = [ --+++++ "warmup short", --+++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++++ ] --+++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++ if tokenizer is None: --+++++ from mindnlp.transformers import AutoTokenizer --+++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++ self._warmup_tokenizer = tokenizer --+++++ --+++++ for text in test_texts: --+++++ inputs = tokenizer(text, return_tensors="ms") --+++++ with mindspore._no_grad(): --+++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --++++ --++++ def get_input_embeddings(self): --++++ return self.model.embed_tokens --++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++ ```""" --+++++ if not self._warmed_up: --+++++ self._warmed_up = True --+++++ self.warmup_moe_model() --++++ --++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++++ output_router_logits = ( --++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++ } --++++ ) --++++ return model_inputs --+++++# @lwx --+++++ # def _decode_one_tokens_logits( --+++++ # self, --+++++ # cur_token: mindspore.Tensor, --+++++ # input_pos: Optional[mindspore.Tensor], --+++++ # cache_position: mindspore.Tensor, --+++++ # past_key_values: StaticCache, --+++++ # ) -> mindspore.Tensor: --+++++ # """ --+++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++++ --+++++ # Args: --+++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++++ # input_pos: 输入位置信息,可选 --+++++ # cache_position: 当前token在cache中的位置,shape为(1,) --+++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++++ --+++++ # Returns: --+++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++++ # """ --+++++ # # 调用JIT编译的版本 --+++++ # return self.get_decode_one_tokens_logits( --+++++ # cur_token=cur_token, --+++++ # input_pos=input_pos, --+++++ # cache_position=cache_position, --+++++ # past_key_values=past_key_values, --+++++ # ) --+++++ --+++++ # @mindspore.jit(jit_level='O1') --+++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++++ # """ --+++++ # JIT编译的函数,用于高效的单token解码 --+++++ # 使用JIT编译优化以支持静态shape和高效执行 --+++++ --+++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++++ # """ --+++++ # outputs = self.model.forward( --+++++ # input_ids=cur_token, --+++++ # position_ids=input_pos, --+++++ # cache_position=cache_position, --+++++ # past_key_values=past_key_values, --+++++ # use_cache=True, --+++++ # return_dict=False, --+++++ # ) --+++++ --+++++ # hidden_states = outputs[0] --+++++ # logits = self.lm_head.forward(hidden_states) --+++++ # logits = logits.float() --+++++ --+++++ # return logits[:, -1, :] --+++++ --+++++ # def _sample( --+++++ # self, --+++++ # input_ids: mindspore.Tensor, --+++++ # logits_processor, --+++++ # stopping_criteria, --+++++ # generation_config, --+++++ # synced_devices: bool, --+++++ # streamer=None, --+++++ # logits_warper=None, --+++++ # **model_kwargs, --+++++ # ): --+++++ # """ --+++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++++ # """ --+++++ # from ...generation.logits_process import LogitsProcessorList --+++++ # from ...generation.stopping_criteria import StoppingCriteriaList --+++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++++ # from mindnlp.core import nn, ops, no_grad --+++++ # import numpy as np --+++++ --+++++ # # 检查是否使用 StaticCache --+++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++++ # # 否则,直接调用父类方法 --+++++ # past_key_values = model_kwargs.get("past_key_values") --+++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++++ --+++++ # if not isinstance(past_key_values, StaticCache): --+++++ # # 不使用 StaticCache,直接调用父类方法 --+++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++++ # return super()._sample( --+++++ # input_ids=input_ids, --+++++ # logits_processor=logits_processor, --+++++ # stopping_criteria=stopping_criteria, --+++++ # generation_config=generation_config, --+++++ # synced_devices=synced_devices, --+++++ # streamer=streamer, --+++++ # logits_warper=logits_warper, --+++++ # **model_kwargs, --+++++ # ) --+++++ --+++++ # # 使用 StaticCache,进入自定义循环 --+++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++++ # pad_token_id = generation_config._pad_token_tensor --+++++ # output_attentions = generation_config.output_attentions --+++++ # output_hidden_states = generation_config.output_hidden_states --+++++ # output_scores = generation_config.output_scores --+++++ # output_logits = generation_config.output_logits --+++++ # return_dict_in_generate = generation_config.return_dict_in_generate --+++++ # max_length = generation_config.max_length --+++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++++ # do_sample = generation_config.do_sample --+++++ --+++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++++ # raise ValueError( --+++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++++ # f"{logits_warper})." --+++++ # ) --+++++ --+++++ # # init attention / hidden states / scores tuples --+++++ # scores = () if (return_dict_in_generate and output_scores) else None --+++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++++ --+++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++++ # encoder_hidden_states = ( --+++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++++ # ) --+++++ --+++++ # # keep track of which sequences are already finished --+++++ # batch_size, cur_len = input_ids.shape --+++++ # this_peer_finished = False --+++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++++ --+++++ # time_record = [] --+++++ # from ....utils.testing_utils import parse_flag_from_env --+++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++++ --+++++ # while self._has_unfinished_sequences( --+++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++++ # ): --+++++ # if _record_time: --+++++ # import time as time_module --+++++ # infer_start = time_module.time() --+++++ --+++++ # # prepare model inputs --+++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++++ --+++++ # # prepare variable output controls --+++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++++ --+++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++++ # cur_cache_position = model_inputs.get("cache_position") --+++++ # cur_past_key_values = model_inputs.get("past_key_values") --+++++ # cur_input_ids = model_inputs.get("input_ids") --+++++ --+++++ # if (isinstance(cur_past_key_values, StaticCache) and --+++++ # cur_cache_position is not None and --+++++ # len(cur_cache_position.shape) > 0 and --+++++ # cur_cache_position.shape[0] == 1 and --+++++ # cur_input_ids is not None and --+++++ # cur_input_ids.shape[1] == 1): --+++++ # # 使用 JIT 优化的单 token 解码 --+++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++++ # if not hasattr(self, '_jit_used'): --+++++ # self._jit_used = False --+++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++++ --+++++ # next_token_logits = self.get_decode_one_tokens_logits( --+++++ # cur_token=cur_input_ids, --+++++ # input_pos=model_inputs.get("position_ids"), --+++++ # cache_position=cur_cache_position, --+++++ # past_key_values=cur_past_key_values, --+++++ # ) --+++++ --+++++ # # 标记已使用JIT(用于后续判断) --+++++ # if not self._jit_used: --+++++ # self._jit_used = True --+++++ --+++++ # # 构造兼容的输出对象 --+++++ # class JitOptimizedOutput: --+++++ # def __init__(self, logits, config): --+++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++++ # self.config = config --+++++ # # 对于 JIT 优化路径,这些属性通常不需要 --+++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++++ # self.attentions = None if not config.is_encoder_decoder else None --+++++ # self.cross_attentions = None --+++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++++ # self.hidden_states = None if not config.is_encoder_decoder else None --+++++ --+++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++++ # else: --+++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++++ # outputs = self(**model_inputs, return_dict=True) --+++++ --+++++ # if synced_devices and this_peer_finished: --+++++ # continue --+++++ --+++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++++ # next_token_logits = outputs.logits[:, -1, :] --+++++ --+++++ # # pre-process distribution --+++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++++ # if do_sample: --+++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++++ --+++++ # # Store scores, attentions and hidden_states when required --+++++ # if return_dict_in_generate: --+++++ # if output_scores: --+++++ # scores += (next_token_scores,) --+++++ # if output_logits: --+++++ # raw_logits += (next_token_logits,) --+++++ # if output_attentions: --+++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++++ # decoder_attentions += (attn,) if attn is not None else (None,) --+++++ # if self.config.is_encoder_decoder: --+++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++++ --+++++ # if output_hidden_states: --+++++ # hidden = ( --+++++ # outputs.decoder_hidden_states --+++++ # if self.config.is_encoder_decoder --+++++ # else outputs.hidden_states --+++++ # ) --+++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++++ --+++++ # # token selection --+++++ # if do_sample: --+++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++++ # else: --+++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++++ --+++++ # # finished sentences should have their next token be a padding token --+++++ # if has_eos_stopping_criteria: --+++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++++ --+++++ # # update generated ids, model inputs, and length for next step --+++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++++ # if streamer is not None: --+++++ # streamer.put(next_tokens) --+++++ --+++++ # model_kwargs = self._update_model_kwargs_for_generation( --+++++ # outputs, --+++++ # model_kwargs, --+++++ # is_encoder_decoder=self.config.is_encoder_decoder, --+++++ # ) --+++++ --+++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++++ # cur_len += 1 --+++++ --+++++ # if _record_time: --+++++ # import time as time_module --+++++ # infer_stop = time_module.time() --+++++ # time_record.append(infer_stop - infer_start) --+++++ --+++++ # del outputs --+++++ --+++++ # average_infer_time = None --+++++ # if time_record: --+++++ # if len(time_record) > 1: --+++++ # time_record.pop(0) --+++++ # average_infer_time = sum(time_record) / len(time_record) --+++++ # print(f'average inference time is: {average_infer_time}') --+++++ # print(f'inference time record: {time_record}') --+++++ --+++++ # if streamer is not None: --+++++ # streamer.end() --+++++ --+++++ # # 简单判断:打印是否使用了JIT路径 --+++++ # if hasattr(self, '_jit_used') and self._jit_used: --+++++ # print("[JIT] ✓ JIT optimization was used during generation") --+++++ # else: --+++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++++ --+++++ # if return_dict_in_generate: --+++++ # if self.config.is_encoder_decoder: --+++++ # return GenerateEncoderDecoderOutput( --+++++ # sequences=input_ids, --+++++ # scores=scores, --+++++ # logits=raw_logits, --+++++ # encoder_attentions=encoder_attentions, --+++++ # encoder_hidden_states=encoder_hidden_states, --+++++ # decoder_attentions=decoder_attentions, --+++++ # cross_attentions=cross_attentions, --+++++ # decoder_hidden_states=decoder_hidden_states, --+++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++ # average_infer_time=average_infer_time --+++++ # ) --+++++ # else: --+++++ # return GenerateDecoderOnlyOutput( --+++++ # sequences=input_ids, --+++++ # scores=scores, --+++++ # logits=raw_logits, --+++++ # attentions=decoder_attentions, --+++++ # hidden_states=decoder_hidden_states, --+++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++ # average_infer_time=average_infer_time --+++++ # ) --+++++ # else: --+++++ # return input_ids --+++++ --+++++ # def _prepare_cache_for_generation( --+++++ # self, --+++++ # generation_config, --+++++ # model_kwargs, --+++++ # assistant_model, --+++++ # batch_size, --+++++ # max_cache_length, --+++++ # ): --+++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++++ # generation_config.cache_implementation = "static" --+++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++++ --+++++ # if generation_config.cache_implementation == "static": --+++++ # base_required_from_max_length = generation_config.max_length + 1 --+++++ # base_required = max(max_cache_length, base_required_from_max_length) --+++++ # min_cache_size = 50 --+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++++ # else: --+++++ # max_cache_length = max(base_required, min_cache_size) --+++++ --+++++ # original_max_cache_length = max_cache_length --+++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++++ # print(f" - input max_cache_length: {original_max_cache_length}") --+++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++++ # print(f" - final max_cache_length: {max_cache_length}") --+++++ --+++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++ # if max_cache_length > self.config.max_position_embeddings: --+++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++++ --+++++ # result = super()._prepare_cache_for_generation( --+++++ # generation_config=generation_config, --+++++ # model_kwargs=model_kwargs, --+++++ # assistant_model=assistant_model, --+++++ # batch_size=batch_size, --+++++ # max_cache_length=max_cache_length, --+++++ # ) --+++++ --+++++ # if generation_config.cache_implementation == "static": --+++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++++ # created_cache = model_kwargs.get(cache_name) --+++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++++ # if created_cache.max_cache_len < generation_config.max_length: --+++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++++ --+++++ # return result --+++++ --+++++ --+++++ --++++ --++++ --++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++++-- --++++2.27.0 --++++ --+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+++new file mode 100644 --+++index 00000000..22b65dd5 --+++--- /dev/null --++++++ b/patches/0002-20251106commit.patch --+++@@ -0,0 +1,3200 @@ --++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --++++From: Pinoeer-kingxi <13022943007@163.com> --++++Date: Thu, 6 Nov 2025 09:20:38 +0800 --++++Subject: [PATCH 2/3] 20251106commit --++++ --++++--- --++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- --++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ --++++ 3 files changed, 2689 insertions(+), 305 deletions(-) --++++ create mode 100644 patches/0001-20251104commit.patch --++++ --++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++index d8303e45..73773c22 100644 --++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): --++++ # y = y + self.shared_experts(identity) --++++ # return y --++++ --+++++ # @no_grad() --+++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++ --+++++ # expert_cache = ops.zeros_like(x) --+++++ # for i in range(self.num_experts_per_tok): --+++++ # expert_id = flat_expert_indices[i].item() --+++++ # weight = flat_expert_weights[i].item() --+++++ # expert = self.experts[expert_id] --+++++ # expert_out = expert(x) --+++++ # expert_cache += expert_out * weight --+++++ # return expert_cache --+++++ --++++ @no_grad() --++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++ # x 的 shape: (1, hidden_size) --+++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+++++ --+++++ # 1. 收集所有需要的专家层 --+++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+++++ selected_experts = [self.experts[i] for i in flat_expert_indices] --+++++ --+++++ # 2. 并行计算所有专家的输出 --+++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+++++ # ops.cat 会将它们堆叠成一个新的 Tensor --+++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+++++ --+++++ # 3. 使用矩阵乘法进行加权求和 --+++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++++ # 最终结果 final_output 的 shape: (1, hidden_size) --+++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+++++ --+++++ return final_output --++++ --++++- expert_cache = ops.zeros_like(x) --++++- for i in range(self.num_experts_per_tok): --++++- expert_id = flat_expert_indices[i].item() --++++- weight = flat_expert_weights[i].item() --++++- expert = self.experts[expert_id] --++++- expert_out = expert(x) --++++- expert_cache += expert_out * weight --++++- return expert_cache --++++ --++++ @no_grad() --++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++ # @lwx --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --+++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --+++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --++++ --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): --++++ return attn_output, attn_weights, past_key_value --++++ --++++ --+++++# class DeepseekFlashAttention(nn.Module): --+++++# """ --+++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --+++++ --+++++# This class is designed as a drop-in replacement for DeepseekAttention. --+++++# """ --+++++ --+++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+++++# super().__init__() --+++++# self.config = config --+++++# self.layer_idx = layer_idx --+++++# if layer_idx is None: --+++++# logger.warning( --+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++++# "when creating this class." --+++++# ) --+++++ --+++++# self.attention_dropout = config.attention_dropout --+++++# self.hidden_size = config.hidden_size --+++++# self.num_heads = config.num_attention_heads --+++++# self.head_dim = self.hidden_size // self.num_heads --+++++# self.num_key_value_heads = config.num_key_value_heads --+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++# self.max_position_embeddings = config.max_position_embeddings --+++++# self.rope_theta = config.rope_theta --+++++# self.is_causal = True --+++++ --+++++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++++# raise ValueError( --+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++++# f" and `num_heads`: {self.num_heads})." --+++++# ) --+++++ --+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+++++# self._init_rope() --+++++ --+++++# def _init_rope(self): --+++++# if self.config.rope_scaling is None: --+++++# self.rotary_emb = DeepseekRotaryEmbedding( --+++++# self.head_dim, --+++++# max_position_embeddings=self.max_position_embeddings, --+++++# base=self.rope_theta, --+++++# ) --+++++# else: --+++++# scaling_type = self.config.rope_scaling["type"] --+++++# scaling_factor = self.config.rope_scaling["factor"] --+++++# if scaling_type == "linear": --+++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+++++# self.head_dim, --+++++# max_position_embeddings=self.max_position_embeddings, --+++++# scaling_factor=scaling_factor, --+++++# base=self.rope_theta, --+++++# ) --+++++# elif scaling_type == "dynamic": --+++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+++++# self.head_dim, --+++++# max_position_embeddings=self.max_position_embeddings, --+++++# scaling_factor=scaling_factor, --+++++# base=self.rope_theta, --+++++# ) --+++++# else: --+++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+++++ --+++++# def forward( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# attention_mask: Optional[mindspore.Tensor] = None, --+++++# position_ids: Optional[mindspore.Tensor] = None, --+++++# past_key_value: Optional[Cache] = None, --+++++# output_attentions: bool = False, --+++++# use_cache: bool = False, --+++++# **kwargs, --+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++# if "padding_mask" in kwargs: --+++++# warnings.warn( --+++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+++++# ) --+++++ --+++++# if output_attentions: --+++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --+++++ --+++++# bsz, q_len, _ = hidden_states.shape --+++++ --+++++# if self.config.pretraining_tp > 1: --+++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+++++ --+++++# query_states = self.q_proj(hidden_states) --+++++# key_states = self.k_proj(hidden_states) --+++++# value_states = self.v_proj(hidden_states) --+++++ --+++++# # Reshape for multi-head attention --+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++# kv_seq_len = key_states.shape[-2] --+++++# if past_key_value is not None: --+++++# if self.layer_idx is None: --+++++# raise ValueError( --+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++# "with a layer index." --+++++# ) --+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++# # Apply Rotary Positional Embedding --+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++# if past_key_value is not None: --+++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --+++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --+++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ --+++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++++ --+++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --+++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --+++++ --+++++# # Convert attention_mask for flash_attention_score --+++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --+++++# if attention_mask is not None: --+++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --+++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+++++# raise ValueError( --+++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+++++# ) --+++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --+++++# else: --+++++# attn_mask_for_fa = None --+++++ --+++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+++++ --+++++# # Call the fused flash_attention_score operator --+++++# attn_output = mindspore.ops.flash_attention_score( --+++++# query=query_states_for_fa, --+++++# key=key_states_for_fa, --+++++# value=value_states_for_fa, --+++++# head_num=self.num_heads, # This is N1, the number of query heads --+++++# input_layout='BSH', --+++++# attn_mask=attn_mask_for_fa, --+++++# keep_prob=keep_prob, --+++++# scalar_value=1.0 / math.sqrt(self.head_dim), --+++++# sparse_mode=0 # Default mask mode --+++++# ) --+++++ --+++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --+++++# attn_output = self.o_proj(attn_output) --+++++ --+++++# # Flash Attention does not return attention weights --+++++# attn_weights = None --+++++ --+++++# return attn_output, attn_weights, past_key_value --+++++ --+++++class DeepseekFlashAttention(nn.Module): --+++++ """ --+++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --+++++ This implementation is a drop-in replacement for the original DeepseekAttention class, --+++++ designed for high performance on supported hardware (Ascend). --+++++ --+++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --+++++ """ --+++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --+++++ super().__init__() --+++++ self.config = config --+++++ self.layer_idx = layer_idx --+++++ if layer_idx is None: --+++++ logger.warning( --+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++++ "when creating this class." --+++++ ) --+++++ --+++++ # --- [FIX] Correctly initialize all required attributes --- --+++++ self.attention_dropout = config.attention_dropout --+++++ self.hidden_size = config.hidden_size --+++++ self.num_heads = config.num_attention_heads --+++++ self.head_dim = self.hidden_size // self.num_heads --+++++ self.num_key_value_heads = config.num_key_value_heads --+++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++ self.max_position_embeddings = config.max_position_embeddings --+++++ self.rope_theta = config.rope_theta --+++++ self.is_causal = True --+++++ --+++++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++++ raise ValueError( --+++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++++ f" and `num_heads`: {self.num_heads})." --+++++ ) --+++++ --+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --+++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --+++++ --+++++ # This call will now succeed as all attributes are initialized. --+++++ self._init_rope() --+++++ --+++++ def _init_rope(self): --+++++ if self.config.rope_scaling is None: --+++++ self.rotary_emb = DeepseekRotaryEmbedding( --+++++ self.head_dim, --+++++ max_position_embeddings=self.max_position_embeddings, --+++++ base=self.rope_theta, --+++++ ) --+++++ else: --+++++ scaling_type = self.config.rope_scaling["type"] --+++++ scaling_factor = self.config.rope_scaling["factor"] --+++++ if scaling_type == "linear": --+++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --+++++ self.head_dim, --+++++ max_position_embeddings=self.max_position_embeddings, --+++++ scaling_factor=scaling_factor, --+++++ base=self.rope_theta, --+++++ ) --+++++ elif scaling_type == "dynamic": --+++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --+++++ self.head_dim, --+++++ max_position_embeddings=self.max_position_embeddings, --+++++ scaling_factor=scaling_factor, --+++++ base=self.rope_theta, --+++++ ) --+++++ else: --+++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --+++++ --+++++ def forward( --+++++ self, --+++++ hidden_states: mindspore.Tensor, --+++++ attention_mask: Optional[mindspore.Tensor] = None, --+++++ position_ids: Optional[mindspore.Tensor] = None, --+++++ past_key_value: Optional[Cache] = None, --+++++ output_attentions: bool = False, --+++++ use_cache: bool = False, --+++++ **kwargs, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ if "padding_mask" in kwargs: --+++++ warnings.warn( --+++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --+++++ ) --+++++ if output_attentions: --+++++ warnings.warn( --+++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --+++++ ) --+++++ --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ if self.config.pretraining_tp > 1: --+++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --+++++ --+++++ query_states = self.q_proj(hidden_states) --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++ # Apply Rotary Position Embedding --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos} --+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --+++++ # So we must explicitly repeat the KV heads. --+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++ --+++++ # Convert attention mask for flash_attention_score --+++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --+++++ if attention_mask is not None: --+++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --+++++ raise ValueError( --+++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --+++++ ) --+++++ attn_mask_for_fa = attention_mask < 0 --+++++ else: --+++++ attn_mask_for_fa = None --+++++ --+++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --+++++ --+++++ # Call the fused operator using the efficient BNSD layout --+++++ attn_output = mindspore.ops.flash_attention_score( --+++++ query=query_states, --+++++ key=key_states, --+++++ value=value_states, --+++++ head_num=self.num_heads, --+++++ input_layout='BNSD', # Specify the correct layout --+++++ attn_mask=attn_mask_for_fa, --+++++ keep_prob=keep_prob, --+++++ scalar_value=1.0 / math.sqrt(self.head_dim) --+++++ ) --+++++ --+++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ --+++++ # Apply output projection --+++++ attn_output = self.o_proj(attn_output) --+++++ --+++++ # Flash attention does not return attention weights, so we return None. --+++++ attn_weights = None --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --++++ Deepseek_ATTENTION_CLASSES = { --++++ "eager": DeepseekAttention, --+++++ "flash-attention": DeepseekFlashAttention, --++++ } --++++ --++++ --++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): --++++ config=config, layer_idx=layer_idx --++++ ) --++++ --+++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --+++++ config=config, layer_idx=layer_idx --+++++ ) --+++++ --++++ self.mlp = ( --++++ DeepseekMoE(config) --++++ if ( --++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++index d4c6b651..bced285c 100644 --++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union --++++ --++++ import mindspore --++++ import mindnlp.core.nn.functional as F --++++-from mindnlp.core import nn, ops --+++++from mindnlp.core import nn, ops, no_grad --++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss --++++ --++++ from ....common.activations import ACT2FN --++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) --++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --++++ --+++++Long_Prompt = False --+++++PROMPT_LENGTH_THRESHOLD = 128 --++++ --++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --++++ def _prepare_4d_causal_attention_mask_with_cache_position( --++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): --++++ return attn_output, attn_weights, past_key_value --++++ --++++ --+++++# class Qwen2MoeFlashAttention(nn.Module): --+++++# """ --+++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++ --+++++# 关键改动: --+++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++# 直接传入原始的 key 和 value 张量效率更高。 --+++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++# super().__init__() --+++++# self.config = config --+++++# self.layer_idx = layer_idx --+++++# self.hidden_size = config.hidden_size --+++++# self.num_heads = config.num_attention_heads --+++++# self.head_dim = self.hidden_size // self.num_heads --+++++# self.num_key_value_heads = config.num_key_value_heads --+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++# self.max_position_embeddings = config.max_position_embeddings --+++++# self.rope_theta = config.rope_theta --+++++# self.attention_dropout = config.attention_dropout --+++++ --+++++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++++# raise ValueError( --+++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++# ) --+++++ --+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++ --+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++# self.head_dim, --+++++# max_position_embeddings=self.max_position_embeddings, --+++++# base=self.rope_theta, --+++++# ) --+++++ --+++++# def forward( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# attention_mask: Optional[mindspore.Tensor] = None, --+++++# position_ids: Optional[mindspore.Tensor] = None, --+++++# past_key_value: Optional[Cache] = None, --+++++# output_attentions: bool = False, --+++++# use_cache: bool = False, --+++++# cache_position: Optional[mindspore.Tensor] = None, --+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++# bsz, q_len, _ = hidden_states.shape --+++++ --+++++# # 1. 线性投射 Q, K, V --+++++# query_states = self.q_proj(hidden_states) --+++++# key_states = self.k_proj(hidden_states) --+++++# value_states = self.v_proj(hidden_states) --+++++ --+++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++# # query: [B, S, H*D] -> [B, N1, S, D] --+++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++# # 3. RoPE 旋转位置编码 --+++++# kv_seq_len = key_states.shape[-2] --+++++# if past_key_value is not None: --+++++# if self.layer_idx is None: --+++++# raise ValueError( --+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++# "with a layer index." --+++++# ) --+++++# # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++# if cache_position.shape[0] == 1: --+++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++# kv_seq_len = past_seen_tokens + 1 --+++++# else: --+++++# # prefill 阶段:cache_position 是范围,使用其长度 --+++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++# else: --+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++# # 4. KV 缓存更新 --+++++# if past_key_value is not None: --+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++# key_states, value_states = past_key_value.update( --+++++# key_states, value_states, self.layer_idx, cache_kwargs --+++++# ) --+++++ --+++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++# if cache_position.shape[0] == 1: --+++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++# kv_seq_len = key_states.shape[-2] --+++++ --+++++# # 5. [重要] 准备 Attention Mask --+++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++# fa_attention_mask = None --+++++# if attention_mask is not None: --+++++# # 截取与当前key长度匹配的部分 --+++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++# # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++# fa_attention_mask = (mask_slice != 0) --+++++ --+++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++# input_dtype = query_states.dtype --+++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++# query_states = query_states.to(mindspore.float16) --+++++# key_states = key_states.to(mindspore.float16) --+++++# value_states = value_states.to(mindspore.float16) --+++++ --+++++# # 6. [核心] 调用 flash_attention_score 算子 --+++++# # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++# attn_output = mindspore.ops.flash_attention_score( --+++++# query=query_states, --+++++# key=key_states, --+++++# value=value_states, --+++++# head_num=self.num_heads, # 传入Q的头数(N1) --+++++# attn_mask=fa_attention_mask, --+++++# keep_prob=1.0 - self.attention_dropout, --+++++# scalar_value=1.0 / math.sqrt(self.head_dim), --+++++# input_layout="BNSD", --+++++# sparse_mode=0 # 使用 defaultMask 模式 --+++++# ) --+++++ --+++++# # 恢复原始数据类型 --+++++# attn_output = attn_output.to(input_dtype) --+++++ --+++++# # 7. 调整输出形状 --+++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++# attn_output = self.o_proj(attn_output) --+++++ --+++++# # FlashAttention 算子不直接返回注意力权重矩阵 --+++++# attn_weights = None --+++++# if output_attentions: --+++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++# return attn_output, attn_weights, past_key_value --+++++ --+++++# # def forward( --+++++# # self, --+++++# # hidden_states: mindspore.Tensor, --+++++# # attention_mask: Optional[mindspore.Tensor] = None, --+++++# # position_ids: Optional[mindspore.Tensor] = None, --+++++# # past_key_value: Optional[Cache] = None, --+++++# # output_attentions: bool = False, --+++++# # use_cache: bool = False, --+++++# # cache_position: Optional[mindspore.Tensor] = None, --+++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++# # bsz, q_len, _ = hidden_states.shape --+++++ --+++++# # # 1. 线性投射 Q, K, V --+++++# # query_states = self.q_proj(hidden_states) --+++++# # key_states = self.k_proj(hidden_states) --+++++# # value_states = self.v_proj(hidden_states) --+++++ --+++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --+++++# # # 3. RoPE 旋转位置编码 --+++++# # kv_seq_len = key_states.shape[-2] --+++++# # if past_key_value is not None: --+++++# # if self.layer_idx is None: --+++++# # raise ValueError( --+++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++# # "with a layer index." --+++++# # ) --+++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++# # # 4. KV 缓存更新 --+++++# # if past_key_value is not None: --+++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++# # key_states, value_states = past_key_value.update( --+++++# # key_states, value_states, self.layer_idx, cache_kwargs --+++++# # ) --+++++ --+++++# # # 5. 准备 Attention Mask --+++++# # fa_attention_mask = None --+++++# # if attention_mask is not None: --+++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++# # fa_attention_mask = (mask_slice != 0) --+++++ --+++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++# # input_dtype = query_states.dtype --+++++ --+++++# # # 6. [核心] 调用 flash_attention_score 算子 --+++++# # attn_output = mindspore.ops.flash_attention_score( --+++++# # query=query_states, --+++++# # key=key_states, --+++++# # value=value_states, --+++++# # head_num=self.num_heads, --+++++# # attn_mask=fa_attention_mask, --+++++# # keep_prob=1.0 - self.attention_dropout, --+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++# # input_layout="BNSD", --+++++# # sparse_mode=0, --+++++# # # <--- 修改点 2: 启用内部高精度计算 --- --+++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++# # inner_precise=1 --+++++# # ) --+++++ --+++++# # # 恢复原始数据类型 --+++++# # attn_output = attn_output.to(input_dtype) --+++++ --+++++# # # 7. 调整输出形状 --+++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++# # attn_output = self.o_proj(attn_output) --+++++ --+++++# # attn_weights = None --+++++# # if output_attentions: --+++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ --+++++# # return attn_output, attn_weights, past_key_value --+++++ --+++++ --++++ class Qwen2MoeFlashAttention(nn.Module): --++++ """ --++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++- --++++- 关键改动: --++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++- 直接传入原始的 key 和 value 张量效率更高。 --++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --+++++ --+++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --+++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --+++++ 完全使用模型的低精度数据类型(如 float16)进行计算, --+++++ 以达到理论上的最高执行速度。 --++++ """ --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++ super().__init__() --++++ self.config = config --++++ self.layer_idx = layer_idx --+++++ if layer_idx is None: --+++++ logger.warning_once( --+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --+++++ ) --+++++ --++++ self.hidden_size = config.hidden_size --++++ self.num_heads = config.num_attention_heads --++++ self.head_dim = self.hidden_size // self.num_heads --++++ self.num_key_value_heads = config.num_key_value_heads --++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++ self.max_position_embeddings = config.max_position_embeddings --++++ self.rope_theta = config.rope_theta --++++ self.attention_dropout = config.attention_dropout --++++ --++++- if (self.head_dim * self.num_heads) != self.hidden_size: --++++- raise ValueError( --++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++- ) --++++- --++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++- # query: [B, S, H*D] -> [B, N1, S, D] --++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++ # 2. 调整形状以匹配 BNSD 布局 --++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- --++++- # 3. RoPE 旋转位置编码 --+++++ --+++++ # 3. RoPE 和 KV 缓存 --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++- if self.layer_idx is None: --++++- raise ValueError( --++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++- "with a layer index." --++++- ) --++++- # 对于 StaticCache,需要特殊处理 kv_seq_len --++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++- if cache_position.shape[0] == 1: --++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++- kv_seq_len = past_seen_tokens + 1 --++++- else: --++++- # prefill 阶段:cache_position 是范围,使用其长度 --++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++- else: --++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++- --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++- # 4. KV 缓存更新 --++++ if past_key_value is not None: --++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++- key_states, value_states = past_key_value.update( --++++- key_states, value_states, self.layer_idx, cache_kwargs --++++- ) --++++- --++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++- if cache_position.shape[0] == 1: --++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++- kv_seq_len = key_states.shape[-2] --++++- --++++- # 5. [重要] 准备 Attention Mask --++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++ # 4. 准备 Attention Mask --++++ fa_attention_mask = None --++++ if attention_mask is not None: --++++- # 截取与当前key长度匹配的部分 --++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++- # 转换为布尔类型: 大负数 -> True, 0 -> False --++++ fa_attention_mask = (mask_slice != 0) --++++ --++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++- input_dtype = query_states.dtype --++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++- query_states = query_states.to(mindspore.float16) --++++- key_states = key_states.to(mindspore.float16) --++++- value_states = value_states.to(mindspore.float16) --++++- --++++- # 6. [核心] 调用 flash_attention_score 算子 --++++- # - 无需手动 repeat_kv, 算子原生支持 GQA --++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 --++++ attn_output = mindspore.ops.flash_attention_score( --++++ query=query_states, --++++ key=key_states, --++++ value=value_states, --++++- head_num=self.num_heads, # 传入Q的头数(N1) --+++++ head_num=self.num_heads, --++++ attn_mask=fa_attention_mask, --++++- keep_prob=1.0 - self.attention_dropout, --+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout --++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++ input_layout="BNSD", --++++- sparse_mode=0 # 使用 defaultMask 模式 --+++++ sparse_mode=0, --+++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 --++++ ) --++++ --++++- # 恢复原始数据类型 --++++- attn_output = attn_output.to(input_dtype) --++++- --++++- # 7. 调整输出形状 --++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++ # 6. 调整输出形状 --++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++ attn_output = self.o_proj(attn_output) --++++ --++++- # FlashAttention 算子不直接返回注意力权重矩阵 --+++++ # 7. 返回结果 --++++ attn_weights = None --++++ if output_attentions: --++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++- # def forward( --++++- # self, --++++- # hidden_states: mindspore.Tensor, --++++- # attention_mask: Optional[mindspore.Tensor] = None, --++++- # position_ids: Optional[mindspore.Tensor] = None, --++++- # past_key_value: Optional[Cache] = None, --++++- # output_attentions: bool = False, --++++- # use_cache: bool = False, --++++- # cache_position: Optional[mindspore.Tensor] = None, --++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++- --++++- # bsz, q_len, _ = hidden_states.shape --++++- --++++- # # 1. 线性投射 Q, K, V --++++- # query_states = self.q_proj(hidden_states) --++++- # key_states = self.k_proj(hidden_states) --++++- # value_states = self.v_proj(hidden_states) --++++- --++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- --++++- # # 3. RoPE 旋转位置编码 --++++- # kv_seq_len = key_states.shape[-2] --++++- # if past_key_value is not None: --++++- # if self.layer_idx is None: --++++- # raise ValueError( --++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++- # "with a layer index." --++++- # ) --++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++ --++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++- --++++- # # 4. KV 缓存更新 --++++- # if past_key_value is not None: --++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++- # key_states, value_states = past_key_value.update( --++++- # key_states, value_states, self.layer_idx, cache_kwargs --++++- # ) --++++- --++++- # # 5. 准备 Attention Mask --++++- # fa_attention_mask = None --++++- # if attention_mask is not None: --++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++- # fa_attention_mask = (mask_slice != 0) --++++- --++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++- # input_dtype = query_states.dtype --++++- --++++- # # 6. [核心] 调用 flash_attention_score 算子 --++++- # attn_output = mindspore.ops.flash_attention_score( --++++- # query=query_states, --++++- # key=key_states, --++++- # value=value_states, --++++- # head_num=self.num_heads, --++++- # attn_mask=fa_attention_mask, --++++- # keep_prob=1.0 - self.attention_dropout, --++++- # scalar_value=1.0 / math.sqrt(self.head_dim), --++++- # input_layout="BNSD", --++++- # sparse_mode=0, --++++- # # <--- 修改点 2: 启用内部高精度计算 --- --++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++- # inner_precise=1 --++++- # ) --++++- --++++- # # 恢复原始数据类型 --++++- # attn_output = attn_output.to(input_dtype) --+++++QWEN2MOE_ATTENTION_CLASSES = { --+++++ "eager": Qwen2MoeAttention, --+++++ "flash-attention": Qwen2MoeFlashAttention, --+++++} --++++ --++++- # # 7. 调整输出形状 --++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++- # attn_output = self.o_proj(attn_output) --++++ --++++- # attn_weights = None --++++- # if output_attentions: --++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# def __init__(self, config): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# # gating --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++ --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# #@dwj --+++++# # 只遍历激活的专家,而非全部专家 --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# num_tokens = hidden_states_reshaped.shape[0] --+++++ --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++# routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++# flat_selected_experts = selected_experts.flatten() --+++++ --+++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++# token_indices = broadcasted_token_indices.flatten() --+++++ --+++++# active_experts = ops.unique(flat_selected_experts) --+++++ --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++ --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# selected_token_indices = token_indices[mask] --+++++# selected_routing_weights = routing_weights.flatten()[mask] --+++++ --+++++# current_states = hidden_states_reshaped[selected_token_indices] --+++++ --+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++ --+++++# final_hidden_states = final_hidden_states.index_add( --+++++# dim=0, --+++++# index=selected_token_indices, --+++++# source=expert_output.to(hidden_states.dtype) --+++++# ) --+++++ --+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++ --++++- # return attn_output, attn_weights, past_key_value --+++++# final_hidden_states = final_hidden_states + shared_expert_output --+++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --+++++ --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# """ --+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+++++# `_moe_infer_prefill` (用于长序列处理) 方法。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# # 门控网络 --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# # 专家列表 --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++# # 共享专家 --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_decode( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# """ --+++++# 【解码路径】针对 sequence_length=1 的极致优化。 --+++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+++++# """ --+++++# batch_size, hidden_dim = hidden_states.shape --+++++ --+++++# expert_outputs_list = [ --+++++# ops.cat([ --+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++# ], dim=0) --+++++# for i in range(batch_size) --+++++# ] --+++++ --+++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+++++# # shape: (batch_size, top_k, hidden_dim) --+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++ --+++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++ --+++++# return moe_output.squeeze(1) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_prefill( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# """ --+++++# 【预填充路径】针对 sequence_length > 1 的优化。 --+++++# 按专家对 Token 进行分组,并进行批处理。 --+++++# """ --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens = hidden_states.shape[0] --+++++# flat_selected_experts = selected_experts.flatten() --+++++ --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++ --+++++# active_experts = ops.unique(flat_selected_experts) --+++++ --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++ --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# selected_token_indices = token_indices[mask] --+++++# selected_routing_weights = routing_weights.flatten()[mask] --+++++ --+++++# current_states = hidden_states[selected_token_indices] --+++++ --+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++ --+++++# moe_output = moe_output.index_add( --+++++# dim=0, --+++++# index=selected_token_indices, --+++++# source=expert_output.to(hidden_states.dtype) --+++++# ) --+++++# return moe_output --+++++ --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# """ --+++++# 顶层 forward 方法,作为智能分发器。 --+++++# """ --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++- # def forward( --++++- # self, --++++- # hidden_states: mindspore.Tensor, --++++- # attention_mask: Optional[mindspore.Tensor] = None, --++++- # position_ids: Optional[mindspore.Tensor] = None, --++++- # past_key_value: Optional[Cache] = None, --++++- # output_attentions: bool = False, --++++- # use_cache: bool = False, --++++- # cache_position: Optional[mindspore.Tensor] = None, --++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++- --++++- # bsz, q_len, _ = hidden_states.shape --++++- --++++- # query_states = self.q_proj(hidden_states) --++++- # key_states = self.k_proj(hidden_states) --++++- # value_states = self.v_proj(hidden_states) --++++- --++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- --++++- # kv_seq_len = key_states.shape[-2] --++++- # if past_key_value is not None: --++++- # if self.layer_idx is None: --++++- # raise ValueError("`layer_idx` must be specified for caching") --++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++- --++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++- --++++- # if past_key_value is not None: --++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++- # key_states, value_states = past_key_value.update( --++++- # key_states, value_states, self.layer_idx, cache_kwargs --++++- # ) --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++# routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++# moe_output = None --+++++# # 在推理时,根据序列长度选择最优路径 --+++++# if not self.training: --+++++# if sequence_length == 1: --+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++++# else: --+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++++# else: --+++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+++++# raise NotImplementedError("Training path is not implemented.") --+++++ --+++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+++++ --+++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+++++ --+++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --+++++ --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# """ --+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# # 门控网络 --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# # 专家列表 --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++# # 共享专家 --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_decode( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# batch_size, _ = hidden_states.shape --+++++# expert_outputs_list = [ --+++++# ops.cat([ --+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++# ], dim=0) --+++++# for i in range(batch_size) --+++++# ] --+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++# return moe_output.squeeze(1) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_prefill( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens = hidden_states.shape[0] --+++++# flat_selected_experts = selected_experts.flatten() --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++# active_experts = ops.unique(flat_selected_experts) --+++++ --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# selected_token_indices = token_indices[mask] --+++++# selected_routing_weights = routing_weights.flatten()[mask] --+++++# current_states = hidden_states[selected_token_indices] --+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++# moe_output = moe_output.index_add( --+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++++# ) --+++++# return moe_output --+++++ --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# """ --+++++# 顶层 forward 方法,作为智能分发器。 --+++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+++++# """ --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ --+++++# # 1. 门控计算 (通用逻辑) --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++# routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++# # 2. 智能分发到最优 MoE 路径 --+++++# moe_output = None --+++++# if not self.training: --+++++# if sequence_length == 1: --+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++++# else: --+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++++# else: --+++++# raise NotImplementedError("Training path is not implemented.") --+++++ --+++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++ --+++++# # 4. 合并 MoE 输出和共享专家输出 --+++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++ --+++++# # 5. 恢复原始形状并返回 --+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --+++++# prefill fastest --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# """ --+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# # 门控网络 --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# # 专家列表 --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++# # 共享专家 --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_dispatch( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# """ --+++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+++++# """ --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens, _ = hidden_states.shape --+++++ --+++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+++++# flat_selected_experts = selected_experts.flatten() --+++++# flat_routing_weights = routing_weights.flatten() --++++ --++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++- --++++- # # <--- 核心修改点: 手动进行高精度缩放 --- --++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++- # query_states = query_states / math.sqrt(self.head_dim) --++++- # # <--- 修改结束 --- --++++- --++++- # fa_attention_mask = None --++++- # if attention_mask is not None: --++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++- # fa_attention_mask = (mask_slice != 0) --++++- --++++- # input_dtype = query_states.dtype --++++- --++++- # attn_output = mindspore.ops.flash_attention_score( --++++- # query=query_states, # 传入已经预先缩放过的 query --++++- # key=key_states, --++++- # value=value_states, --++++- # head_num=self.num_heads, --++++- # attn_mask=fa_attention_mask, --++++- # keep_prob=1.0 - self.attention_dropout, --++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++- # input_layout="BNSD", --++++- # sparse_mode=0, --++++- # inner_precise=1 # 仍然保持内部高精度计算 --++++- # ) --+++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++ --++++- # attn_output = attn_output.to(input_dtype) --++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++- # attn_output = self.o_proj(attn_output) --+++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+++++# active_experts = ops.unique(flat_selected_experts) --+++++ --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++ --+++++# # 找到所有分配给该专家的 token --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++ --+++++# # 使用 mask 选取对应的 token 和权重 --+++++# current_token_indices = token_indices[mask] --+++++# current_routing_weights = flat_routing_weights[mask] --+++++# current_hidden_states = hidden_states[current_token_indices] --+++++ --+++++# # 对这些 token 进行批处理 --+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++++ --+++++# # 使用 index_add 将结果精确地加回到对应位置 --+++++# moe_output = moe_output.index_add( --+++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+++++# ) --+++++# return moe_output --+++++ --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# """ --+++++# 顶层 forward 方法,作为智能分发器。 --+++++# """ --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ --+++++# # 1. 门控计算 --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++# routing_weights = routing_weights.to(hidden_states.dtype) --+++++ --+++++# # 2. 调用统一的 MoE 计算内核 --+++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --++++ --++++- # attn_weights = None --++++- # if output_attentions: --++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++++# # 3. 统一处理共享专家 --+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++ --+++++# # 4. 合并输出 --+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++ --+++++# # 5. 恢复原始形状并返回 --+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --+++++ --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# """ --+++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++# 【最终高性能与高精度版】: --+++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+++++# 3. 这样实现了速度和准确性的两全其美。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_decode( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# """ --+++++# 【解码路径】极致优化版:bmm + 高精度累加。 --+++++# """ --+++++# original_dtype = hidden_states.dtype --+++++# batch_size, _ = hidden_states.shape --+++++ --+++++# expert_outputs_list = [ --+++++# ops.cat([ --+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++# ], dim=0) --+++++# for i in range(batch_size) --+++++# ] --+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++ --+++++# # 在 float32 下执行 bmm,得到高精度结果 --+++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++ --+++++# # 将高精度结果转换回原始数据类型 --+++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+++++ --+++++# return moe_output --+++++ --+++++# @no_grad() --+++++# def _moe_infer_prefill( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# selected_experts: mindspore.Tensor, --+++++# routing_weights: mindspore.Tensor --+++++# ) -> mindspore.Tensor: --+++++# """ --+++++# 【预填充路径】与原始实现一致,结果精确。 --+++++# """ --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens, _ = hidden_states.shape --+++++# flat_selected_experts = selected_experts.flatten() --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++# active_experts = ops.unique(flat_selected_experts) --+++++ --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# selected_token_indices = token_indices[mask] --+++++# selected_routing_weights = routing_weights.flatten()[mask] --+++++# current_states = hidden_states[selected_token_indices] --+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++# moe_output = moe_output.index_add( --+++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++++# ) --+++++# return moe_output --+++++ --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++ --++++- # return attn_output, attn_weights, past_key_value --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+++++# # 如果模型主体是 float16,后续再转换 --+++++ --+++++# moe_output = None --+++++# if not self.training: --+++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+++++# # _moe_infer_decode 内部会处理好类型转换 --+++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+++++# if sequence_length == 1: --+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++++# else: --+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++++# else: --+++++# raise NotImplementedError("Training path is not implemented.") --+++++ --+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++ --+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --++++ --++++-QWEN2MOE_ATTENTION_CLASSES = { --++++- "eager": Qwen2MoeAttention, --++++- "flash-attention": Qwen2MoeFlashAttention, --++++-} --+++++# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++# """ --+++++# 【融合版】一个混合专家模块,内置两种推理策略, --+++++# 由外部全局变量 `Long_Prompt` 控制: --+++++ --+++++# - if Long_Prompt is True: 【精度优先模式】 --+++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+++++# 适用于处理长序列,避免误差累积。 --+++++ --+++++# - if Long_Prompt is False: 【速度优先模式】 --+++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+++++# 在解码阶段获得极致速度,同时保证结果高度准确。 --+++++# """ --+++++# def __init__(self, config: Qwen2MoeConfig): --+++++# super().__init__() --+++++# self.num_experts = config.num_experts --+++++# self.top_k = config.num_experts_per_tok --+++++# self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++# self.experts = nn.ModuleList( --+++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++# ) --+++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++# # --- 速度优先模式的辅助函数 --- --+++++# @no_grad() --+++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++# original_dtype = hidden_states.dtype --+++++# batch_size, _ = hidden_states.shape --+++++# expert_outputs_list = [ --+++++# ops.cat([ --+++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++# ], dim=0) --+++++# for i in range(batch_size) --+++++# ] --+++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++# weights_fp32 = routing_weights.to(mindspore.float32) --+++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++++# return moe_output_fp32.squeeze(1).to(original_dtype) --+++++ --+++++# @no_grad() --+++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens, _ = hidden_states.shape --+++++# flat_selected_experts = selected_experts.flatten() --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++# active_experts = ops.unique(flat_selected_experts) --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# selected_token_indices = token_indices[mask] --+++++# selected_routing_weights = routing_weights.flatten()[mask] --+++++# current_states = hidden_states[selected_token_indices] --+++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++# return moe_output --+++++ --+++++# # --- 精度优先模式的辅助函数 --- --+++++# @no_grad() --+++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++# moe_output = ops.zeros_like(hidden_states) --+++++# num_tokens, _ = hidden_states.shape --+++++# flat_selected_experts = selected_experts.flatten() --+++++# flat_routing_weights = routing_weights.flatten() --+++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++# active_experts = ops.unique(flat_selected_experts) --+++++# for expert_idx_tensor in active_experts: --+++++# expert_idx = expert_idx_tensor.item() --+++++# expert_layer = self.experts[expert_idx] --+++++# mask = (flat_selected_experts == expert_idx_tensor) --+++++# current_token_indices = token_indices[mask] --+++++# current_routing_weights = flat_routing_weights[mask] --+++++# current_hidden_states = hidden_states[current_token_indices] --+++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++# return moe_output --+++++ --+++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++# # 声明我们将要使用一个在模块外部定义的全局变量 --+++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+++++# global Long_Prompt --+++++ --+++++# # 1. 门控计算 (所有模式通用) --+++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++# router_logits = self.gate(hidden_states_reshaped) --+++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++++# if self.norm_topk_prob: --+++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++# moe_output = None --+++++# if not self.training: --+++++# # 根据 Long_Prompt 标志选择模式 --+++++# if Long_Prompt: --+++++# # --- 精度优先模式 --- --+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++# else: --+++++# # --- 速度优先模式 --- --+++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++# if sequence_length == 1: --+++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++# else: --+++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++# else: --+++++# raise NotImplementedError("Training path is not implemented.") --+++++ --+++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++ --+++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++# return final_hidden_states, router_logits --+++++ --+++++class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ """ --+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+++++ 控制的顶级推理策略: --++++ --+++++ - if Long_Prompt is True: 【精度优先模式】 --+++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --+++++ 适用于需要严格可复现性的长序列任务。 --++++ --++++-class Qwen2MoeSparseMoeBlock(nn.Module): --++++- def __init__(self, config): --+++++ - if Long_Prompt is False: 【速度优先模式】 --+++++ 采用业界最强的性能组合: --+++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --+++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --+++++ """ --+++++ def __init__(self, config: Qwen2MoeConfig): --++++ super().__init__() --++++ self.num_experts = config.num_experts --++++ self.top_k = config.num_experts_per_tok --++++ self.norm_topk_prob = config.norm_topk_prob --++++ --++++- # gating --++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++ self.experts = nn.ModuleList( --++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++ ) --++++- --++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++ --++++- #@dwj --++++- # 只遍历激活的专家,而非全部专家 --++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++- num_tokens = hidden_states_reshaped.shape[0] --++++- --++++- router_logits = self.gate(hidden_states_reshaped) --++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- --++++- if self.norm_topk_prob: --++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++- flat_selected_experts = selected_experts.flatten() --++++- --++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++- token_indices = broadcasted_token_indices.flatten() --++++- --++++- active_experts = ops.unique(flat_selected_experts) --++++- --++++- for expert_idx_tensor in active_experts: --++++- expert_idx = expert_idx_tensor.item() --++++- expert_layer = self.experts[expert_idx] --++++- --++++- mask = (flat_selected_experts == expert_idx_tensor) --++++- selected_token_indices = token_indices[mask] --++++- selected_routing_weights = routing_weights.flatten()[mask] --++++- --++++- current_states = hidden_states_reshaped[selected_token_indices] --++++- --++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++- --++++- final_hidden_states = final_hidden_states.index_add( --++++- dim=0, --++++- index=selected_token_indices, --++++- source=expert_output.to(hidden_states.dtype) --++++- ) --++++- --++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --+++++ @no_grad() --+++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++ original_dtype = hidden_states.dtype --+++++ batch_size, _ = hidden_states.shape --+++++ expert_outputs_list = [ --+++++ ops.cat([ --+++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++ ], dim=0) --+++++ for i in range(batch_size) --+++++ ] --+++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++ weights_fp32 = routing_weights.to(mindspore.float32) --+++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++++ return moe_output_fp32.squeeze(1).to(original_dtype) --+++++ --+++++ @no_grad() --+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++ num_tokens, _ = hidden_states.shape --+++++ flat_selected_experts = selected_experts.flatten() --+++++ sorted_expert_indices = flat_selected_experts.argsort() --+++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++++ original_token_indices = sorted_expert_indices // self.top_k --+++++ moe_output = ops.zeros_like(hidden_states) --+++++ current_token_offset = 0 --+++++ for i in range(self.num_experts): --+++++ expert_token_count = tokens_per_expert[i] - current_token_offset --+++++ if expert_token_count == 0: --+++++ continue --+++++ end_offset = current_token_offset + expert_token_count --+++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++++ expert_hidden_states = hidden_states[expert_original_token_indices] --+++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++ current_token_offset += expert_token_count --+++++ return moe_output --+++++ --+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+++++ @no_grad() --+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++ moe_output = ops.zeros_like(hidden_states) --+++++ num_tokens, _ = hidden_states.shape --+++++ flat_selected_experts = selected_experts.flatten() --+++++ flat_routing_weights = routing_weights.flatten() --+++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++ active_experts = ops.unique(flat_selected_experts) --+++++ for expert_idx_tensor in active_experts: --+++++ expert_idx = expert_idx_tensor.item() --+++++ expert_layer = self.experts[expert_idx] --+++++ mask = (flat_selected_experts == expert_idx_tensor) --+++++ current_token_indices = token_indices[mask] --+++++ current_routing_weights = flat_routing_weights[mask] --+++++ current_hidden_states = hidden_states[current_token_indices] --+++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++ return moe_output --++++ --++++- final_hidden_states = final_hidden_states + shared_expert_output --++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++- --++++- return final_hidden_states, router_logits --+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++ global Long_Prompt --+++++ --+++++ # 1. 门控计算 (所有模式通用) --+++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++ router_logits = self.gate(hidden_states_reshaped) --+++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++++ if self.norm_topk_prob: --+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++ moe_output = None --+++++ if Long_Prompt: --+++++ # --- 精度优先模式 (ACCURACY MODE) --- --+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ else: --+++++ # --- 速度优先模式 (SPEED MODE) --- --+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++ if sequence_length == 1: --+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ else: --+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ --++++ --+++++ # 3. 共享专家计算与合并 (所有模式通用) --+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++ --+++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++ --+++++ return final_hidden_states, router_logits --++++ --++++ class Qwen2MoeDecoderLayer(nn.Module): --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --++++ super().__init__() --++++ self.hidden_size = config.hidden_size --+++++ --+++++ # if Long_Prompt: --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ # else: --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++ --++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ --++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++- --++++ if (layer_idx not in config.mlp_only_layers) and ( --++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++++ ): --++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ self._warmed_up = True --++++ self.warmup_moe_model() --++++ --+++++ --+++++ --++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++++ output_router_logits = ( --++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits --++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ router_logits=outputs.router_logits, --++++ ) --++++ --+++++ def generate(self, *args, **kwargs): --+++++ """ --+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+++++ """ --+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+++++ --+++++ input_ids = kwargs.get("input_ids") --+++++ if input_ids is None and args: --+++++ input_ids = args[0] --+++++ --+++++ if input_ids is not None: --+++++ prompt_length = input_ids.shape[1] --+++++ --+++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: --+++++ Long_Prompt = True --+++++ else: --+++++ Long_Prompt = False --+++++ --+++++ return super().generate(*args, **kwargs) --+++++ --++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation --++++ def prepare_inputs_for_generation( --++++ self, --++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens --++++ # Exception 1: when passing input_embeds, input_ids may be missing entries --++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --+++++ --++++ if past_key_values is not None: --++++ if inputs_embeds is not None: # Exception 1 --++++ if 0 not in input_ids.shape: --++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ } --++++ ) --++++ return model_inputs --+++++ --++++ # @lwx --++++ # def _decode_one_tokens_logits( --++++ # self, --++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): --++++ attentions=outputs.attentions, --++++ ) --++++ --+++++ --++++ __all__ = [ --++++ "Qwen2MoeForCausalLM", --++++ "Qwen2MoeModel", --++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++++new file mode 100644 --++++index 00000000..6dfb5b93 --++++--- /dev/null --+++++++ b/patches/0001-20251104commit.patch --++++@@ -0,0 +1,1272 @@ --+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++++From: Pinoeer-kingxi <13022943007@163.com> --+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++++Subject: [PATCH] 20251104commit --+++++ --+++++--- --+++++ mindnlp/transformers/cache_utils.py | 28 +- --+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- --+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++++ 3 files changed, 976 insertions(+), 87 deletions(-) --+++++ --+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++++index cadd2e04..02f8d4be 100644 --+++++--- a/mindnlp/transformers/cache_utils.py --++++++++ b/mindnlp/transformers/cache_utils.py --+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++++ # k_out[:, :, cache_position] = key_states --+++++ # v_out[:, :, cache_position] = value_states --+++++- if ON_ORANGE_PI: --+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++- else: --+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++- --++++++ # if ON_ORANGE_PI: --++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++++ # else: --++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++++ # 确保 cache_position 是 1D tensor 并且类型正确 --++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++++++ if cache_position.ndim > 1: --++++++ cache_position = cache_position.flatten() --++++++ # 确保类型是 int32 或 int64(MindSpore 要求) --++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++++++ cache_position = cache_position.int() --++++++ --++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++++++ k_out[:, :, cache_position] = key_states --++++++ v_out[:, :, cache_position] = value_states --++++++ --+++++ return k_out, v_out --+++++ --+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++index c695b944..d8303e45 100644 --+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++++ def rotate_half(x): --+++++ """Rotates half the hidden dims of the input.""" --+++++- x1 = x[..., : x.shape[-1] // 2] --+++++- x2 = x[..., x.shape[-1] // 2 :] --++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++++ # x1 = x[..., : x.shape[-1] // 2] --++++++ # x2 = x[..., x.shape[-1] // 2 :] --++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++ return ops.cat((-x2, x1), dim=-1) --+++++ --+++++ --+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++++ if self.training: --+++++ raise NotImplementedError("Training is not supported yet.") --+++++ else: --+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++- if self.config.n_shared_experts is not None: --+++++- y = y + self.shared_experts(identity) --+++++- return y --++++++ # @lwx --++++++ if orig_shape[1] == 1: --++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++++++ y=y.view(*orig_shape) --++++++ if self.config.n_shared_experts is not None: --++++++ y = y + self.shared_experts(identity) --++++++ return y --++++++ else: --++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++++++ if self.config.n_shared_experts is not None: --++++++ y = y + self.shared_experts(identity) --++++++ return y --++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++++ # if self.config.n_shared_experts is not None: --++++++ # y = y + self.shared_experts(identity) --++++++ # return y --++++++ --++++++ @no_grad() --++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++++ --++++++ expert_cache = ops.zeros_like(x) --++++++ for i in range(self.num_experts_per_tok): --++++++ expert_id = flat_expert_indices[i].item() --++++++ weight = flat_expert_weights[i].item() --++++++ expert = self.experts[expert_id] --++++++ expert_out = expert(x) --++++++ expert_cache += expert_out * weight --++++++ return expert_cache --+++++ --+++++ @no_grad() --+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++- # expert_cache = torch.zeros_like(x) --+++++- # idxs = flat_expert_indices.argsort() --+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++- # token_idxs = idxs // self.num_experts_per_tok --+++++- # for i, end_idx in enumerate(tokens_per_expert): --+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++- # if start_idx == end_idx: --+++++- # continue --+++++- # expert = self.experts[i] --+++++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++++- # expert_tokens = x[exp_token_idx] --+++++- # expert_out = expert(expert_tokens) --+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++- # return expert_cache --++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++ expert_cache = ops.zeros_like(x) --+++++ idxs = flat_expert_indices.argsort() --+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ token_idxs = idxs // self.num_experts_per_tok --++++++ --+++++ for i, end_idx in enumerate(tokens_per_expert): --+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ if start_idx == end_idx: --+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++++ expert_out = expert(expert_tokens) --+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++++ --+++++ return expert_cache --++++++ --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # # expert_cache = torch.zeros_like(x) --++++++ # # idxs = flat_expert_indices.argsort() --++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++++ # # token_idxs = idxs // self.num_experts_per_tok --++++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++++ # # if start_idx == end_idx: --++++++ # # continue --++++++ # # expert = self.experts[i] --++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # # expert_tokens = x[exp_token_idx] --++++++ # # expert_out = expert(expert_tokens) --++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++++ # # return expert_cache --++++++ # expert_cache = ops.zeros_like(x) --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # for i, end_idx in enumerate(tokens_per_expert): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # if start_idx == end_idx: --++++++ # continue --++++++ # expert = self.experts[i] --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = expert(expert_tokens) --++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++++ --++++++ # return expert_cache --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # expert_cache = ops.zeros_like(x) --++++++ --++++++ # # 排序保证顺序一致 --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # # 找出有 token 的专家 --++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++++ --++++++ # for i in active_experts.tolist(): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # end_idx = tokens_per_expert[i] --++++++ # if start_idx == end_idx: # 没有 token --++++++ # continue --++++++ --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = self.experts[i](expert_tokens) --++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++++ --++++++ # expert_cache = mindspore.mint.scatter_add( --++++++ # expert_cache, --++++++ # 0, --++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++++ # expert_out --++++++ # ) --++++++ --++++++ # return expert_cache --++++++ --++++++ --+++++ --+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++++ # """ --+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++ --+++++ # Initialize weights and apply final processing --+++++ self.post_init() --++++++ self.warm_up = False --++++++ --++++++ def warmup_moe_model_deep(self): --++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++++ test_texts = [ --++++++ "warmup short", --++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++++++ ] --++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++++ if tokenizer is None: --++++++ from mindnlp.transformers import AutoTokenizer --++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++++ self._warmup_tokenizer = tokenizer --++++++ --++++++ for text in test_texts: --++++++ inputs = tokenizer(text, return_tensors="ms") --++++++ with mindspore._no_grad(): --++++++ _ = self(**inputs, use_cache=False) --++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++++ --+++++ def get_input_embeddings(self): --+++++ return self.model.embed_tokens --+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++ ```""" --++++++ if not self.warm_up: --++++++ self.warm_up = True --++++++ self.warmup_moe_model_deep() --++++++ --+++++ output_attentions = ( --+++++ output_attentions --+++++ if output_attentions is not None --+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++index 3cbf820e..d4c6b651 100644 --+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++@@ -18,7 +18,6 @@ --+++++ # See the License for the specific language governing permissions and --+++++ # limitations under the License. --+++++ """MindSpore Qwen2MoE model.""" --+++++- --+++++ import math --+++++ from typing import List, Optional, Tuple, Union --+++++ --+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++++ TokenClassifierOutput, --+++++ ) --+++++ from ...modeling_utils import PreTrainedModel --++++++from ...generation import GenerationMixin --+++++ from ....utils import logging --+++++ from .configuration_qwen2_moe import Qwen2MoeConfig --+++++ --+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++++ self.variance_epsilon = eps --+++++ --+++++ def forward(self, hidden_states): --++++++ # @dwj --++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++++ # @lwx --++++++ # if not self.training : --++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++ input_dtype = hidden_states.dtype --+++++ hidden_states = hidden_states.to(mindspore.float32) --+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++++@@ -234,6 +239,8 @@ def rotate_half(x): --+++++ """Rotates half the hidden dims of the input.""" --+++++ x1 = x[..., : x.shape[-1] // 2] --+++++ x2 = x[..., x.shape[-1] // 2 :] --++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++ return ops.cat((-x2, x1), dim=-1) --+++++ --+++++ --+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++++ self.config = config --+++++ self.hidden_size = config.hidden_size --+++++ self.intermediate_size = intermediate_size --++++++ --+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++++ self.act_fn = ACT2FN[config.hidden_act] --+++++ --+++++ def forward(self, x): --+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++- --+++++ --++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++++ # @lwx --++++++ # gate_up_output = self.gate_up_proj(x) --++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++++++ # return self.down_proj(swiglu_output) --++++++ --++++++ # def forward(self, x): --++++++ # gate_proj_out = self.gate_proj(x) --++++++ # up_proj_out = self.up_proj(x) --++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++++++ # return self.down_proj(swiglu_out) --++++++ --+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++++ """ --+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++++ use_cache: bool = False, --+++++ cache_position: Optional[mindspore.Tensor] = None, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ --++++++ --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ query_states = self.q_proj(hidden_states) --+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ "with a layer index." --+++++ ) --+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ if isinstance(past_key_value, StaticCache): --++++++ kv_seq_len = key_states.shape[-2] --++++++ else: --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++ if isinstance(past_key_value, StaticCache): --++++++ kv_seq_len = key_states.shape[-2] --+++++ --+++++ # repeat k/v heads if n_kv_heads < n_heads --+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++- --++++++ --+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++ --+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++++- raise ValueError( --+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++++- f" {attn_weights.shape}" --+++++- ) --+++++- --+++++- if attention_mask is not None: # no matter the length, we just slice it --+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++++++ if attention_mask is not None: --++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++ attn_weights = attn_weights + causal_mask --+++++ --+++++ # upcast attention to fp32 --+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++ --+++++ attn_output = self.o_proj(attn_output) --+++++- --++++++ # @lwx --++++++ --++++++ # max_seq_len = self.max_position_embeddings # 2048 --++++++ --++++++ # if attention_mask is not None: --++++++ # # attention_mask: [B, 1, Sq, Sk] --++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++++ --++++++ # # pad 到 [max_seq_len, max_seq_len] --++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++++ # global_attention_mask = padded_mask --++++++ # else: --++++++ # global_attention_mask = None --++++++ --++++++ --++++++ # sparse_mode=3 --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # real_shift=None, --++++++ # padding_mask=None, --++++++ --++++++ # head_num=self.num_heads, --++++++ # attn_mask=global_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ # input_layout="BNSD", --++++++ # pre_tokens=2147483647, --++++++ # next_tokens=2147483647, --++++++ # inner_precise=0, --++++++ # drop_mask=None, --++++++ # prefix=None, --++++++ # actual_seq_qlen=None, --++++++ # actual_seq_kvlen=None, --++++++ # sparse_mode=sparse_mode, --++++++ # ) --+++++ if not output_attentions: --+++++ attn_weights = None --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ --++++++class Qwen2MoeFlashAttention(nn.Module): --++++++ """ --++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++++ --++++++ 关键改动: --++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++++ 直接传入原始的 key 和 value 张量效率更高。 --++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++++ """ --++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++++ super().__init__() --++++++ self.config = config --++++++ self.layer_idx = layer_idx --++++++ self.hidden_size = config.hidden_size --++++++ self.num_heads = config.num_attention_heads --++++++ self.head_dim = self.hidden_size // self.num_heads --++++++ self.num_key_value_heads = config.num_key_value_heads --++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++ self.max_position_embeddings = config.max_position_embeddings --++++++ self.rope_theta = config.rope_theta --++++++ self.attention_dropout = config.attention_dropout --++++++ --++++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++++ raise ValueError( --++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++++ ) --++++++ --++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++++ --++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++++ self.head_dim, --++++++ max_position_embeddings=self.max_position_embeddings, --++++++ base=self.rope_theta, --++++++ ) --++++++ --++++++ def forward( --++++++ self, --++++++ hidden_states: mindspore.Tensor, --++++++ attention_mask: Optional[mindspore.Tensor] = None, --++++++ position_ids: Optional[mindspore.Tensor] = None, --++++++ past_key_value: Optional[Cache] = None, --++++++ output_attentions: bool = False, --++++++ use_cache: bool = False, --++++++ cache_position: Optional[mindspore.Tensor] = None, --++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # 1. 线性投射 Q, K, V --++++++ query_states = self.q_proj(hidden_states) --++++++ key_states = self.k_proj(hidden_states) --++++++ value_states = self.v_proj(hidden_states) --++++++ --++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++ # query: [B, S, H*D] -> [B, N1, S, D] --++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # 3. RoPE 旋转位置编码 --++++++ kv_seq_len = key_states.shape[-2] --++++++ if past_key_value is not None: --++++++ if self.layer_idx is None: --++++++ raise ValueError( --++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++ "with a layer index." --++++++ ) --++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++++ if cache_position.shape[0] == 1: --++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++++ kv_seq_len = past_seen_tokens + 1 --++++++ else: --++++++ # prefill 阶段:cache_position 是范围,使用其长度 --++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++++ else: --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # 4. KV 缓存更新 --++++++ if past_key_value is not None: --++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ key_states, value_states = past_key_value.update( --++++++ key_states, value_states, self.layer_idx, cache_kwargs --++++++ ) --++++++ --++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++ if cache_position.shape[0] == 1: --++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++++ kv_seq_len = key_states.shape[-2] --++++++ --++++++ # 5. [重要] 准备 Attention Mask --++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++++ fa_attention_mask = None --++++++ if attention_mask is not None: --++++++ # 截取与当前key长度匹配的部分 --++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++++++ fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++++ input_dtype = query_states.dtype --++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++++ query_states = query_states.to(mindspore.float16) --++++++ key_states = key_states.to(mindspore.float16) --++++++ value_states = value_states.to(mindspore.float16) --++++++ --++++++ # 6. [核心] 调用 flash_attention_score 算子 --++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++++ attn_output = mindspore.ops.flash_attention_score( --++++++ query=query_states, --++++++ key=key_states, --++++++ value=value_states, --++++++ head_num=self.num_heads, # 传入Q的头数(N1) --++++++ attn_mask=fa_attention_mask, --++++++ keep_prob=1.0 - self.attention_dropout, --++++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ input_layout="BNSD", --++++++ sparse_mode=0 # 使用 defaultMask 模式 --++++++ ) --++++++ --++++++ # 恢复原始数据类型 --++++++ attn_output = attn_output.to(input_dtype) --++++++ --++++++ # 7. 调整输出形状 --++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ attn_output = self.o_proj(attn_output) --++++++ --++++++ # FlashAttention 算子不直接返回注意力权重矩阵 --++++++ attn_weights = None --++++++ if output_attentions: --++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++ return attn_output, attn_weights, past_key_value --++++++ --++++++ # def forward( --++++++ # self, --++++++ # hidden_states: mindspore.Tensor, --++++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++++ # position_ids: Optional[mindspore.Tensor] = None, --++++++ # past_key_value: Optional[Cache] = None, --++++++ # output_attentions: bool = False, --++++++ # use_cache: bool = False, --++++++ # cache_position: Optional[mindspore.Tensor] = None, --++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ # bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # # 1. 线性投射 Q, K, V --++++++ # query_states = self.q_proj(hidden_states) --++++++ # key_states = self.k_proj(hidden_states) --++++++ # value_states = self.v_proj(hidden_states) --++++++ --++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # # 3. RoPE 旋转位置编码 --++++++ # kv_seq_len = key_states.shape[-2] --++++++ # if past_key_value is not None: --++++++ # if self.layer_idx is None: --++++++ # raise ValueError( --++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++ # "with a layer index." --++++++ # ) --++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # # 4. KV 缓存更新 --++++++ # if past_key_value is not None: --++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ # key_states, value_states = past_key_value.update( --++++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++++ # ) --++++++ --++++++ # # 5. 准备 Attention Mask --++++++ # fa_attention_mask = None --++++++ # if attention_mask is not None: --++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++++ # input_dtype = query_states.dtype --++++++ --++++++ # # 6. [核心] 调用 flash_attention_score 算子 --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # head_num=self.num_heads, --++++++ # attn_mask=fa_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ # input_layout="BNSD", --++++++ # sparse_mode=0, --++++++ # # <--- 修改点 2: 启用内部高精度计算 --- --++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++++ # inner_precise=1 --++++++ # ) --++++++ --++++++ # # 恢复原始数据类型 --++++++ # attn_output = attn_output.to(input_dtype) --++++++ --++++++ # # 7. 调整输出形状 --++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ # attn_output = self.o_proj(attn_output) --++++++ --++++++ # attn_weights = None --++++++ # if output_attentions: --++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++ # return attn_output, attn_weights, past_key_value --++++++ --++++++ # def forward( --++++++ # self, --++++++ # hidden_states: mindspore.Tensor, --++++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++++ # position_ids: Optional[mindspore.Tensor] = None, --++++++ # past_key_value: Optional[Cache] = None, --++++++ # output_attentions: bool = False, --++++++ # use_cache: bool = False, --++++++ # cache_position: Optional[mindspore.Tensor] = None, --++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ # bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # query_states = self.q_proj(hidden_states) --++++++ # key_states = self.k_proj(hidden_states) --++++++ # value_states = self.v_proj(hidden_states) --++++++ --++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # kv_seq_len = key_states.shape[-2] --++++++ # if past_key_value is not None: --++++++ # if self.layer_idx is None: --++++++ # raise ValueError("`layer_idx` must be specified for caching") --++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # if past_key_value is not None: --++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ # key_states, value_states = past_key_value.update( --++++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++++ # ) --++++++ --++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++ --++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++++ # query_states = query_states / math.sqrt(self.head_dim) --++++++ # # <--- 修改结束 --- --++++++ --++++++ # fa_attention_mask = None --++++++ # if attention_mask is not None: --++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # input_dtype = query_states.dtype --++++++ --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, # 传入已经预先缩放过的 query --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # head_num=self.num_heads, --++++++ # attn_mask=fa_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++++ # input_layout="BNSD", --++++++ # sparse_mode=0, --++++++ # inner_precise=1 # 仍然保持内部高精度计算 --++++++ # ) --++++++ --++++++ # attn_output = attn_output.to(input_dtype) --++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ # attn_output = self.o_proj(attn_output) --++++++ --++++++ # attn_weights = None --++++++ # if output_attentions: --++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++++ --++++++ # return attn_output, attn_weights, past_key_value --++++++ --+++++ QWEN2MOE_ATTENTION_CLASSES = { --+++++ "eager": Qwen2MoeAttention, --++++++ "flash-attention": Qwen2MoeFlashAttention, --+++++ } --+++++ --+++++ --+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --++++++ #@dwj --++++++ # 只遍历激活的专家,而非全部专家 --+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- hidden_states = hidden_states.view(-1, hidden_dim) --+++++- # router_logits: (batch * sequence_length, n_experts) --+++++- router_logits = self.gate(hidden_states) --+++++- --+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- if self.norm_topk_prob: --+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- # we cast back to the input dtype --+++++- routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++- final_hidden_states = ops.zeros( --+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++++- ) --+++++- --+++++- # One hot encode the selected experts to create an expert mask --+++++- # this will be used to easily index which expert is going to be sollicitated --+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++++- --+++++- # Loop over all available experts in the model and perform the computation on each expert --+++++- for expert_idx in range(self.num_experts): --+++++- expert_layer = self.experts[expert_idx] --+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++++- --+++++- # Index the correct hidden states and compute the expert hidden state for --+++++- # the current expert. We need to make sure to multiply the output hidden --+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++++- if 0 not in idx.shape: --+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++++- --+++++- # However `index_add_` only support torch tensors for indexing so we'll use --+++++- # the `top_x` tensor here. --+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++++- --+++++- shared_expert_output = self.shared_expert(hidden_states) --+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++++- --+++++- final_hidden_states = final_hidden_states + shared_expert_output --++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++ num_tokens = hidden_states_reshaped.shape[0] --++++++ --++++++ router_logits = self.gate(hidden_states_reshaped) --++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++ --++++++ if self.norm_topk_prob: --++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++++ flat_selected_experts = selected_experts.flatten() --++++++ --++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++++ token_indices = broadcasted_token_indices.flatten() --++++++ --++++++ active_experts = ops.unique(flat_selected_experts) --++++++ --++++++ for expert_idx_tensor in active_experts: --++++++ expert_idx = expert_idx_tensor.item() --++++++ expert_layer = self.experts[expert_idx] --++++++ --++++++ mask = (flat_selected_experts == expert_idx_tensor) --++++++ selected_token_indices = token_indices[mask] --++++++ selected_routing_weights = routing_weights.flatten()[mask] --++++++ --++++++ current_states = hidden_states_reshaped[selected_token_indices] --++++++ --++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++ --++++++ final_hidden_states = final_hidden_states.index_add( --++++++ dim=0, --++++++ index=selected_token_indices, --++++++ source=expert_output.to(hidden_states.dtype) --++++++ ) --++++++ --++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++++ --+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++- return final_hidden_states, router_logits --++++++ final_hidden_states = final_hidden_states + shared_expert_output --++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++++ --++++++ return final_hidden_states, router_logits --+++++ --+++++ --+++++ class Qwen2MoeDecoderLayer(nn.Module): --+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++++ --+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ --++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++++ --+++++ if (layer_idx not in config.mlp_only_layers) and ( --+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++++ ): --+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++++ _skip_keys_device_placement = "past_key_values" --+++++ _supports_cache_class = True --++++++#lwx --++++++ # _supports_static_cache = True --+++++ --+++++ def _init_weights(self, module): --+++++ std = self.config.initializer_range --+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++++ return causal_mask --+++++ --+++++ --+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ _tied_weights_keys = ["lm_head.weight"] --+++++ --+++++ def __init__(self, config): --+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ self.num_experts_per_tok = config.num_experts_per_tok --+++++ # Initialize weights and apply final processing --+++++ self.post_init() --++++++ # @lwx --++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++++++ # self.generation_config.cache_implementation = "static" --++++++ self._warmed_up = False --++++++ --++++++ def warmup_moe_model(self): --++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++++++ test_texts = [ --++++++ "warmup short", --++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++++++ ] --++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++++ if tokenizer is None: --++++++ from mindnlp.transformers import AutoTokenizer --++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++++ self._warmup_tokenizer = tokenizer --++++++ --++++++ for text in test_texts: --++++++ inputs = tokenizer(text, return_tensors="ms") --++++++ with mindspore._no_grad(): --++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++++ --+++++ def get_input_embeddings(self): --+++++ return self.model.embed_tokens --+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++ ```""" --++++++ if not self._warmed_up: --++++++ self._warmed_up = True --++++++ self.warmup_moe_model() --+++++ --+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++++ output_router_logits = ( --+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ } --+++++ ) --+++++ return model_inputs --++++++# @lwx --++++++ # def _decode_one_tokens_logits( --++++++ # self, --++++++ # cur_token: mindspore.Tensor, --++++++ # input_pos: Optional[mindspore.Tensor], --++++++ # cache_position: mindspore.Tensor, --++++++ # past_key_values: StaticCache, --++++++ # ) -> mindspore.Tensor: --++++++ # """ --++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++++++ --++++++ # Args: --++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++++++ # input_pos: 输入位置信息,可选 --++++++ # cache_position: 当前token在cache中的位置,shape为(1,) --++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++++++ --++++++ # Returns: --++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++++++ # """ --++++++ # # 调用JIT编译的版本 --++++++ # return self.get_decode_one_tokens_logits( --++++++ # cur_token=cur_token, --++++++ # input_pos=input_pos, --++++++ # cache_position=cache_position, --++++++ # past_key_values=past_key_values, --++++++ # ) --++++++ --++++++ # @mindspore.jit(jit_level='O1') --++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++++++ # """ --++++++ # JIT编译的函数,用于高效的单token解码 --++++++ # 使用JIT编译优化以支持静态shape和高效执行 --++++++ --++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++++++ # """ --++++++ # outputs = self.model.forward( --++++++ # input_ids=cur_token, --++++++ # position_ids=input_pos, --++++++ # cache_position=cache_position, --++++++ # past_key_values=past_key_values, --++++++ # use_cache=True, --++++++ # return_dict=False, --++++++ # ) --++++++ --++++++ # hidden_states = outputs[0] --++++++ # logits = self.lm_head.forward(hidden_states) --++++++ # logits = logits.float() --++++++ --++++++ # return logits[:, -1, :] --++++++ --++++++ # def _sample( --++++++ # self, --++++++ # input_ids: mindspore.Tensor, --++++++ # logits_processor, --++++++ # stopping_criteria, --++++++ # generation_config, --++++++ # synced_devices: bool, --++++++ # streamer=None, --++++++ # logits_warper=None, --++++++ # **model_kwargs, --++++++ # ): --++++++ # """ --++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++++++ # """ --++++++ # from ...generation.logits_process import LogitsProcessorList --++++++ # from ...generation.stopping_criteria import StoppingCriteriaList --++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++++++ # from mindnlp.core import nn, ops, no_grad --++++++ # import numpy as np --++++++ --++++++ # # 检查是否使用 StaticCache --++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++++++ # # 否则,直接调用父类方法 --++++++ # past_key_values = model_kwargs.get("past_key_values") --++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++++++ --++++++ # if not isinstance(past_key_values, StaticCache): --++++++ # # 不使用 StaticCache,直接调用父类方法 --++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++++++ # return super()._sample( --++++++ # input_ids=input_ids, --++++++ # logits_processor=logits_processor, --++++++ # stopping_criteria=stopping_criteria, --++++++ # generation_config=generation_config, --++++++ # synced_devices=synced_devices, --++++++ # streamer=streamer, --++++++ # logits_warper=logits_warper, --++++++ # **model_kwargs, --++++++ # ) --++++++ --++++++ # # 使用 StaticCache,进入自定义循环 --++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++++++ # pad_token_id = generation_config._pad_token_tensor --++++++ # output_attentions = generation_config.output_attentions --++++++ # output_hidden_states = generation_config.output_hidden_states --++++++ # output_scores = generation_config.output_scores --++++++ # output_logits = generation_config.output_logits --++++++ # return_dict_in_generate = generation_config.return_dict_in_generate --++++++ # max_length = generation_config.max_length --++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++++++ # do_sample = generation_config.do_sample --++++++ --++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++++++ # raise ValueError( --++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++++++ # f"{logits_warper})." --++++++ # ) --++++++ --++++++ # # init attention / hidden states / scores tuples --++++++ # scores = () if (return_dict_in_generate and output_scores) else None --++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++++++ --++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++++++ # encoder_hidden_states = ( --++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++++++ # ) --++++++ --++++++ # # keep track of which sequences are already finished --++++++ # batch_size, cur_len = input_ids.shape --++++++ # this_peer_finished = False --++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++++++ --++++++ # time_record = [] --++++++ # from ....utils.testing_utils import parse_flag_from_env --++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++++++ --++++++ # while self._has_unfinished_sequences( --++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++++++ # ): --++++++ # if _record_time: --++++++ # import time as time_module --++++++ # infer_start = time_module.time() --++++++ --++++++ # # prepare model inputs --++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++++++ --++++++ # # prepare variable output controls --++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++++++ --++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++++++ # cur_cache_position = model_inputs.get("cache_position") --++++++ # cur_past_key_values = model_inputs.get("past_key_values") --++++++ # cur_input_ids = model_inputs.get("input_ids") --++++++ --++++++ # if (isinstance(cur_past_key_values, StaticCache) and --++++++ # cur_cache_position is not None and --++++++ # len(cur_cache_position.shape) > 0 and --++++++ # cur_cache_position.shape[0] == 1 and --++++++ # cur_input_ids is not None and --++++++ # cur_input_ids.shape[1] == 1): --++++++ # # 使用 JIT 优化的单 token 解码 --++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++++++ # if not hasattr(self, '_jit_used'): --++++++ # self._jit_used = False --++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++++++ --++++++ # next_token_logits = self.get_decode_one_tokens_logits( --++++++ # cur_token=cur_input_ids, --++++++ # input_pos=model_inputs.get("position_ids"), --++++++ # cache_position=cur_cache_position, --++++++ # past_key_values=cur_past_key_values, --++++++ # ) --++++++ --++++++ # # 标记已使用JIT(用于后续判断) --++++++ # if not self._jit_used: --++++++ # self._jit_used = True --++++++ --++++++ # # 构造兼容的输出对象 --++++++ # class JitOptimizedOutput: --++++++ # def __init__(self, logits, config): --++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++++++ # self.config = config --++++++ # # 对于 JIT 优化路径,这些属性通常不需要 --++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++++++ # self.attentions = None if not config.is_encoder_decoder else None --++++++ # self.cross_attentions = None --++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++++++ # self.hidden_states = None if not config.is_encoder_decoder else None --++++++ --++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++++++ # else: --++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++++++ # outputs = self(**model_inputs, return_dict=True) --++++++ --++++++ # if synced_devices and this_peer_finished: --++++++ # continue --++++++ --++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++++++ # next_token_logits = outputs.logits[:, -1, :] --++++++ --++++++ # # pre-process distribution --++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++++++ # if do_sample: --++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++++++ --++++++ # # Store scores, attentions and hidden_states when required --++++++ # if return_dict_in_generate: --++++++ # if output_scores: --++++++ # scores += (next_token_scores,) --++++++ # if output_logits: --++++++ # raw_logits += (next_token_logits,) --++++++ # if output_attentions: --++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++++++ # decoder_attentions += (attn,) if attn is not None else (None,) --++++++ # if self.config.is_encoder_decoder: --++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++++++ --++++++ # if output_hidden_states: --++++++ # hidden = ( --++++++ # outputs.decoder_hidden_states --++++++ # if self.config.is_encoder_decoder --++++++ # else outputs.hidden_states --++++++ # ) --++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++++++ --++++++ # # token selection --++++++ # if do_sample: --++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++++++ # else: --++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++++++ --++++++ # # finished sentences should have their next token be a padding token --++++++ # if has_eos_stopping_criteria: --++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++++++ --++++++ # # update generated ids, model inputs, and length for next step --++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++++++ # if streamer is not None: --++++++ # streamer.put(next_tokens) --++++++ --++++++ # model_kwargs = self._update_model_kwargs_for_generation( --++++++ # outputs, --++++++ # model_kwargs, --++++++ # is_encoder_decoder=self.config.is_encoder_decoder, --++++++ # ) --++++++ --++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++++++ # cur_len += 1 --++++++ --++++++ # if _record_time: --++++++ # import time as time_module --++++++ # infer_stop = time_module.time() --++++++ # time_record.append(infer_stop - infer_start) --++++++ --++++++ # del outputs --++++++ --++++++ # average_infer_time = None --++++++ # if time_record: --++++++ # if len(time_record) > 1: --++++++ # time_record.pop(0) --++++++ # average_infer_time = sum(time_record) / len(time_record) --++++++ # print(f'average inference time is: {average_infer_time}') --++++++ # print(f'inference time record: {time_record}') --++++++ --++++++ # if streamer is not None: --++++++ # streamer.end() --++++++ --++++++ # # 简单判断:打印是否使用了JIT路径 --++++++ # if hasattr(self, '_jit_used') and self._jit_used: --++++++ # print("[JIT] ✓ JIT optimization was used during generation") --++++++ # else: --++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++++++ --++++++ # if return_dict_in_generate: --++++++ # if self.config.is_encoder_decoder: --++++++ # return GenerateEncoderDecoderOutput( --++++++ # sequences=input_ids, --++++++ # scores=scores, --++++++ # logits=raw_logits, --++++++ # encoder_attentions=encoder_attentions, --++++++ # encoder_hidden_states=encoder_hidden_states, --++++++ # decoder_attentions=decoder_attentions, --++++++ # cross_attentions=cross_attentions, --++++++ # decoder_hidden_states=decoder_hidden_states, --++++++ # past_key_values=model_kwargs.get("past_key_values"), --++++++ # average_infer_time=average_infer_time --++++++ # ) --++++++ # else: --++++++ # return GenerateDecoderOnlyOutput( --++++++ # sequences=input_ids, --++++++ # scores=scores, --++++++ # logits=raw_logits, --++++++ # attentions=decoder_attentions, --++++++ # hidden_states=decoder_hidden_states, --++++++ # past_key_values=model_kwargs.get("past_key_values"), --++++++ # average_infer_time=average_infer_time --++++++ # ) --++++++ # else: --++++++ # return input_ids --++++++ --++++++ # def _prepare_cache_for_generation( --++++++ # self, --++++++ # generation_config, --++++++ # model_kwargs, --++++++ # assistant_model, --++++++ # batch_size, --++++++ # max_cache_length, --++++++ # ): --++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++++++ # generation_config.cache_implementation = "static" --++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++++++ --++++++ # if generation_config.cache_implementation == "static": --++++++ # base_required_from_max_length = generation_config.max_length + 1 --++++++ # base_required = max(max_cache_length, base_required_from_max_length) --++++++ # min_cache_size = 50 --++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++++++ # else: --++++++ # max_cache_length = max(base_required, min_cache_size) --++++++ --++++++ # original_max_cache_length = max_cache_length --++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++++++ # print(f" - input max_cache_length: {original_max_cache_length}") --++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++++++ # print(f" - final max_cache_length: {max_cache_length}") --++++++ --++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++++ # if max_cache_length > self.config.max_position_embeddings: --++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++++++ --++++++ # result = super()._prepare_cache_for_generation( --++++++ # generation_config=generation_config, --++++++ # model_kwargs=model_kwargs, --++++++ # assistant_model=assistant_model, --++++++ # batch_size=batch_size, --++++++ # max_cache_length=max_cache_length, --++++++ # ) --++++++ --++++++ # if generation_config.cache_implementation == "static": --++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++++++ # created_cache = model_kwargs.get(cache_name) --++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++++++ # if created_cache.max_cache_len < generation_config.max_length: --++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++++++ --++++++ # return result --++++++ --++++++ --++++++ --+++++ --+++++ --+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++++-- --+++++2.27.0 --+++++ --++++-- --++++2.27.0 --++++ --+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+++new file mode 100644 --+++index 00000000..966529e4 --+++--- /dev/null --++++++ b/patches/0003-20261106secondcommit.patch --+++@@ -0,0 +1,2769 @@ --++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --++++From: Pinoeer-kingxi <13022943007@163.com> --++++Date: Thu, 6 Nov 2025 14:54:37 +0800 --++++Subject: [PATCH 3/3] 20261106secondcommit --++++ --++++--- --++++ .../models/deepseek/modeling_deepseek.py | 217 ++- --++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- --++++ patches/0001-20251104commit.patch | 1272 ----------------- --++++ 3 files changed, 528 insertions(+), 2032 deletions(-) --++++ delete mode 100644 patches/0001-20251104commit.patch --++++ --++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++index 73773c22..2f9192bf 100644 --++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) --++++ --++++ _CONFIG_FOR_DOC = "DeepseekConfig" --++++ --+++++_attn_mask_cache = {} --+++++ --+++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --+++++ q_len = batch_and_seq[1] --+++++ kv_len = batch_and_seq[1] + past_key_values_length --+++++ key = (batch_and_seq[0], q_len, kv_len) --+++++ --+++++ if key in _attn_mask_cache: --+++++ return _attn_mask_cache[key] --+++++ --+++++ mask = _prepare_4d_causal_attention_mask( --+++++ attention_mask, --+++++ batch_and_seq, --+++++ inputs_embeds, --+++++ past_key_values_length, --+++++ ) --+++++ _attn_mask_cache[key] = mask --+++++ return mask --++++ --++++ def _get_unpad_data(attention_mask): --++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): --++++ return final_output --++++ --++++ --++++- @no_grad() --++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++- expert_cache = ops.zeros_like(x) --++++- idxs = flat_expert_indices.argsort() --++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++- token_idxs = idxs // self.num_experts_per_tok --++++- --++++- for i, end_idx in enumerate(tokens_per_expert): --++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++- if start_idx == end_idx: --++++- continue --++++- expert = self.experts[i] --++++- exp_token_idx = token_idxs[start_idx:end_idx] --++++- expert_tokens = x[exp_token_idx] --++++- expert_out = expert(expert_tokens) --++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++- --++++- return expert_cache --++++- --++++ # @no_grad() --++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++- # # expert_cache = torch.zeros_like(x) --++++- # # idxs = flat_expert_indices.argsort() --++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++- # # token_idxs = idxs // self.num_experts_per_tok --++++- # # for i, end_idx in enumerate(tokens_per_expert): --++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++- # # if start_idx == end_idx: --++++- # # continue --++++- # # expert = self.experts[i] --++++- # # exp_token_idx = token_idxs[start_idx:end_idx] --++++- # # expert_tokens = x[exp_token_idx] --++++- # # expert_out = expert(expert_tokens) --++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++- # # return expert_cache --+++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++ # expert_cache = ops.zeros_like(x) --++++ # idxs = flat_expert_indices.argsort() --++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): --++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++ --++++ # return expert_cache --++++- # @no_grad() --++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++- # expert_cache = ops.zeros_like(x) --+++++ --+++++ @no_grad() --+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++ """ --+++++ 优化版 MoE prefill: --+++++ - 批量张量化处理同一个 expert 的所有 token --+++++ - 跳过无 token 的专家 --+++++ - 保持结果完全一致 --+++++ """ --+++++ # 初始化输出缓存 --+++++ expert_cache = ops.zeros_like(x) --++++ --++++- # # 排序保证顺序一致 --++++- # idxs = flat_expert_indices.argsort() --++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++- # token_idxs = idxs // self.num_experts_per_tok --+++++ # 排序(确保 scatter_add 位置对应原逻辑) --+++++ idxs = flat_expert_indices.argsort() --+++++ sorted_expert_indices = flat_expert_indices[idxs] --+++++ sorted_token_indices = idxs // self.num_experts_per_tok --++++ --++++- # # 找出有 token 的专家 --++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++ # 每个 expert 的 token 数 --+++++ tokens_per_expert = sorted_expert_indices.bincount() --++++ --++++- # for i in active_experts.tolist(): --++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++- # end_idx = tokens_per_expert[i] --++++- # if start_idx == end_idx: # 没有 token --++++- # continue --+++++ # 找出有 token 的专家 --+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --++++ --++++- # exp_token_idx = token_idxs[start_idx:end_idx] --++++- # expert_tokens = x[exp_token_idx] --++++- # expert_out = self.experts[i](expert_tokens) --++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++ for expert_id in active_experts.tolist(): --+++++ # 取该 expert 对应的排序后 token 区间 --+++++ start = (tokens_per_expert[:expert_id]).sum().item() --+++++ end = start + tokens_per_expert[expert_id].item() --++++ --++++- # expert_cache = mindspore.mint.scatter_add( --++++- # expert_cache, --++++- # 0, --++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++- # expert_out --++++- # ) --+++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 --+++++ expert_tokens = x[token_idx] # 取输入向量 --++++ --++++- # return expert_cache --+++++ # 执行专家 MLP --+++++ expert_out = self.experts[expert_id](expert_tokens) --+++++ --+++++ # 按权重缩放 --+++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --+++++ --+++++ # 回写到缓存(等价 scatter_add) --+++++ expert_cache = mindspore.mint.scatter_add( --+++++ expert_cache, --+++++ 0, --+++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++ scaled_out --+++++ ) --+++++ --+++++ return expert_cache --+++++ --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # # expert_cache = torch.zeros_like(x) --+++++ # # idxs = flat_expert_indices.argsort() --+++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++ # # token_idxs = idxs // self.num_experts_per_tok --+++++ # # for i, end_idx in enumerate(tokens_per_expert): --+++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++ # # if start_idx == end_idx: --+++++ # # continue --+++++ # # expert = self.experts[i] --+++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # # expert_tokens = x[exp_token_idx] --+++++ # # expert_out = expert(expert_tokens) --+++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++ # # return expert_cache --+++++ # expert_cache = ops.zeros_like(x) --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # for i, end_idx in enumerate(tokens_per_expert): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # if start_idx == end_idx: --+++++ # continue --+++++ # expert = self.experts[i] --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = expert(expert_tokens) --+++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --+++++ # return expert_cache --+++++ # @no_grad() --+++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++ # expert_cache = ops.zeros_like(x) --+++++ --+++++ # # 排序保证顺序一致 --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ # token_idxs = idxs // self.num_experts_per_tok --+++++ --+++++ # # 找出有 token 的专家 --+++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++ --+++++ # for i in active_experts.tolist(): --+++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ # end_idx = tokens_per_expert[i] --+++++ # if start_idx == end_idx: # 没有 token --+++++ # continue --+++++ --+++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++ # expert_tokens = x[exp_token_idx] --+++++ # expert_out = self.experts[i](expert_tokens) --+++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++ --+++++ # expert_cache = mindspore.mint.scatter_add( --+++++ # expert_cache, --+++++ # 0, --+++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++ # expert_out --+++++ # ) --+++++ --+++++ # return expert_cache --++++ --++++ --++++ --++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++- --++++ # class DeepseekFlashAttention(nn.Module): --++++ # """ --++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --+++++ --++++ Deepseek_ATTENTION_CLASSES = { --++++ "eager": DeepseekAttention, --++++ "flash-attention": DeepseekFlashAttention, --++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): --++++ ) --++++ else: --++++ # 4d mask is passed through the layers --++++- attention_mask = _prepare_4d_causal_attention_mask( --+++++ # attention_mask = _prepare_4d_causal_attention_mask( --+++++ # attention_mask, --+++++ # (batch_size, seq_length), --+++++ # inputs_embeds, --+++++ # past_key_values_length, --+++++ # ) --+++++ #@dwj --+++++ attention_mask = get_cached_causal_mask( --++++ attention_mask, --++++ (batch_size, seq_length), --++++ inputs_embeds, --++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ # Initialize weights and apply final processing --++++ self.post_init() --++++ self.warm_up = False --+++++ #@dwj --+++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --+++++ self.num_layers, --+++++ self.num_attention_heads, --+++++ self.head_dim, --+++++ batch_size=1, --+++++ max_length=self.max_length, --+++++ dtype=mindspore.float16 --+++++ ) --+++++ --+++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --+++++ key_cache = [] --+++++ value_cache = [] --+++++ for _ in range(num_layers): --+++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --+++++ key_cache.append(k) --+++++ value_cache.append(v) --+++++ return key_cache, value_cache --+++++ --++++ --++++ def warmup_moe_model_deep(self): --++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++index bced285c..ebd7782e 100644 --++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) --++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --++++ --++++-Long_Prompt = False --++++-PROMPT_LENGTH_THRESHOLD = 128 --+++++Long_Prompt = 1 --+++++LONG_PROMPT_LENGTH_THRESHOLD = 128 --+++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 --+++++ --+++++_causal_mask_cache = {} --+++++ --+++++def get_cached_causal_mask_with_cache_position( --+++++ attention_mask: mindspore.Tensor, --+++++ sequence_length: int, --+++++ target_length: int, --+++++ dtype: mindspore.dtype, --+++++ min_dtype: float, --+++++ cache_position: mindspore.Tensor, --+++++ batch_size: int, --+++++): --+++++ """ --+++++ 带缓存的 causal mask 构造函数 --+++++ """ --+++++ # q_len 是当前 query 长度 --+++++ q_len = sequence_length --+++++ # kv_len 是 target_length --+++++ kv_len = target_length --+++++ --+++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --+++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) --+++++ --+++++ if key in _causal_mask_cache: --+++++ return _causal_mask_cache[key] --+++++ --+++++ # 调用原来的 mask 构造逻辑 --+++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++++ attention_mask, --+++++ sequence_length=sequence_length, --+++++ target_length=target_length, --+++++ dtype=dtype, --+++++ min_dtype=min_dtype, --+++++ cache_position=cache_position, --+++++ batch_size=batch_size, --+++++ ) --+++++ # 缓存结果 --+++++ _causal_mask_cache[key] = causal_mask --+++++ return causal_mask --++++ --++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --++++ def _prepare_4d_causal_attention_mask_with_cache_position( --++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++++ --++++ --++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --+++++# class Qwen2MoeAttention(nn.Module): --+++++# """ --+++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+++++# and "Generating Long Sequences with Sparse Transformers". --+++++# """ --+++++ --+++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++# super().__init__() --+++++# self.config = config --+++++# self.layer_idx = layer_idx --+++++# if layer_idx is None: --+++++# logger.warning_once( --+++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++++# "when creating this class." --+++++# ) --+++++ --+++++# self.hidden_size = config.hidden_size --+++++# self.num_heads = config.num_attention_heads --+++++# self.head_dim = self.hidden_size // self.num_heads --+++++# self.num_key_value_heads = config.num_key_value_heads --+++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++# self.max_position_embeddings = config.max_position_embeddings --+++++# self.rope_theta = config.rope_theta --+++++# self.is_causal = True --+++++# self.attention_dropout = config.attention_dropout --+++++ --+++++# if (self.head_dim * self.num_heads) != self.hidden_size: --+++++# raise ValueError( --+++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --+++++# f" and `num_heads`: {self.num_heads})." --+++++# ) --+++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++ --+++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++# self.head_dim, --+++++# max_position_embeddings=self.max_position_embeddings, --+++++# base=self.rope_theta, --+++++# ) --+++++ --+++++# def forward( --+++++# self, --+++++# hidden_states: mindspore.Tensor, --+++++# attention_mask: Optional[mindspore.Tensor] = None, --+++++# position_ids: Optional[mindspore.Tensor] = None, --+++++# past_key_value: Optional[Cache] = None, --+++++# output_attentions: bool = False, --+++++# use_cache: bool = False, --+++++# cache_position: Optional[mindspore.Tensor] = None, --+++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++ --+++++ --+++++ --+++++# bsz, q_len, _ = hidden_states.shape --+++++ --+++++# query_states = self.q_proj(hidden_states) --+++++# key_states = self.k_proj(hidden_states) --+++++# value_states = self.v_proj(hidden_states) --+++++ --+++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++ --+++++# kv_seq_len = key_states.shape[-2] --+++++# if past_key_value is not None: --+++++# if self.layer_idx is None: --+++++# raise ValueError( --+++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++# "with a layer index." --+++++# ) --+++++# if isinstance(past_key_value, StaticCache): --+++++# kv_seq_len = key_states.shape[-2] --+++++# else: --+++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++# if past_key_value is not None: --+++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++# if isinstance(past_key_value, StaticCache): --+++++# kv_seq_len = key_states.shape[-2] --+++++ --+++++# # repeat k/v heads if n_kv_heads < n_heads --+++++# key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++# value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++ --+++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++ --+++++# if attention_mask is not None: --+++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++# attn_weights = attn_weights + causal_mask --+++++ --+++++# # upcast attention to fp32 --+++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++++# attn_output = ops.matmul(attn_weights, value_states) --+++++ --+++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++++# raise ValueError( --+++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+++++# f" {attn_output.shape}" --+++++# ) --+++++ --+++++# attn_output = ops.transpose(attn_output, 1, 2) --+++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++ --+++++# attn_output = self.o_proj(attn_output) --+++++# # @lwx --+++++ --+++++# # max_seq_len = self.max_position_embeddings # 2048 --+++++ --+++++# # if attention_mask is not None: --+++++# # # attention_mask: [B, 1, Sq, Sk] --+++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++ --+++++# # # pad 到 [max_seq_len, max_seq_len] --+++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++# # global_attention_mask = padded_mask --+++++# # else: --+++++# # global_attention_mask = None --+++++ --+++++ --+++++# # sparse_mode=3 --+++++# # attn_output = mindspore.ops.flash_attention_score( --+++++# # query=query_states, --+++++# # key=key_states, --+++++# # value=value_states, --+++++# # real_shift=None, --+++++# # padding_mask=None, --+++++ --+++++# # head_num=self.num_heads, --+++++# # attn_mask=global_attention_mask, --+++++# # keep_prob=1.0 - self.attention_dropout, --+++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++# # input_layout="BNSD", --+++++# # pre_tokens=2147483647, --+++++# # next_tokens=2147483647, --+++++# # inner_precise=0, --+++++# # drop_mask=None, --+++++# # prefix=None, --+++++# # actual_seq_qlen=None, --+++++# # actual_seq_kvlen=None, --+++++# # sparse_mode=sparse_mode, --+++++# # ) --+++++# if not output_attentions: --+++++# attn_weights = None --+++++ --+++++# return attn_output, attn_weights, past_key_value --+++++ --++++ class Qwen2MoeAttention(nn.Module): --++++ """ --++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --++++- and "Generating Long Sequences with Sparse Transformers". --++++- """ --+++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 --++++ --+++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --+++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --+++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --+++++ --+++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --+++++ """ --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++ super().__init__() --++++ self.config = config --++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): --++++ if layer_idx is None: --++++ logger.warning_once( --++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++ "when creating this class." --++++ ) --++++ --++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): --++++ use_cache: bool = False, --++++ cache_position: Optional[mindspore.Tensor] = None, --++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++- --++++ --++++- --+++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- --++++ bsz, q_len, _ = hidden_states.shape --++++ --++++ query_states = self.q_proj(hidden_states) --++++ key_states = self.k_proj(hidden_states) --++++ value_states = self.v_proj(hidden_states) --++++ --++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++- --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ --++++ kv_seq_len = key_states.shape[-2] --++++ if past_key_value is not None: --++++- if self.layer_idx is None: --++++- raise ValueError( --++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++- "with a layer index." --++++- ) --++++- if isinstance(past_key_value, StaticCache): --++++- kv_seq_len = key_states.shape[-2] --++++- else: --++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++ --++++ if past_key_value is not None: --++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++ --+++++ # --- 2. 动态调度核心注意力计算 --- --+++++ global Long_Prompt --+++++ if Long_Prompt >= 1: --+++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --+++++ fa_attention_mask = None --+++++ if attention_mask is not None: --+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++ fa_attention_mask = (mask_slice != 0) --+++++ --+++++ attn_output = mindspore.ops.flash_attention_score( --+++++ query=query_states, --+++++ key=key_states, --+++++ value=value_states, --+++++ head_num=self.num_heads, --+++++ attn_mask=fa_attention_mask, --+++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --+++++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ input_layout="BNSD", --+++++ sparse_mode=0, --+++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --+++++ ) --++++ --++++- if isinstance(past_key_value, StaticCache): --++++- kv_seq_len = key_states.shape[-2] --+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = self.o_proj(attn_output) --+++++ attn_weights = None --+++++ if output_attentions: --+++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --++++ --++++- # repeat k/v heads if n_kv_heads < n_heads --++++- key_states = repeat_kv(key_states, self.num_key_value_groups) --++++- value_states = repeat_kv(value_states, self.num_key_value_groups) --++++- --++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++ else: --+++++ # --- Eager Attention 路径 (用于短序列和解码) --- --+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++ --+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++ --++++- if attention_mask is not None: --++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++- attn_weights = attn_weights + causal_mask --+++++ if attention_mask is not None: --+++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++ attn_weights = attn_weights + causal_mask --++++ --++++- # upcast attention to fp32 --++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++++- attn_output = ops.matmul(attn_weights, value_states) --+++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++++ attn_output = ops.matmul(attn_weights, value_states) --++++ --++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++++- raise ValueError( --++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --++++- f" {attn_output.shape}" --++++- ) --+++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++++ raise ValueError( --+++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --+++++ ) --++++ --++++- attn_output = ops.transpose(attn_output, 1, 2) --++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = ops.transpose(attn_output, 1, 2) --+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = self.o_proj(attn_output) --++++ --++++- attn_output = self.o_proj(attn_output) --++++- # @lwx --+++++ if not output_attentions: --+++++ attn_weights = None --++++ --++++- # max_seq_len = self.max_position_embeddings # 2048 --++++- --++++- # if attention_mask is not None: --++++- # # attention_mask: [B, 1, Sq, Sk] --++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++- --++++- # # pad 到 [max_seq_len, max_seq_len] --++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++- # global_attention_mask = padded_mask --++++- # else: --++++- # global_attention_mask = None --++++- --++++- --++++- # sparse_mode=3 --++++- # attn_output = mindspore.ops.flash_attention_score( --++++- # query=query_states, --++++- # key=key_states, --++++- # value=value_states, --++++- # real_shift=None, --++++- # padding_mask=None, --++++- --++++- # head_num=self.num_heads, --++++- # attn_mask=global_attention_mask, --++++- # keep_prob=1.0 - self.attention_dropout, --++++- # scalar_value=1.0 / math.sqrt(self.head_dim), --++++- # input_layout="BNSD", --++++- # pre_tokens=2147483647, --++++- # next_tokens=2147483647, --++++- # inner_precise=0, --++++- # drop_mask=None, --++++- # prefix=None, --++++- # actual_seq_qlen=None, --++++- # actual_seq_kvlen=None, --++++- # sparse_mode=sparse_mode, --++++- # ) --++++- if not output_attentions: --++++- attn_weights = None --++++- --++++ return attn_output, attn_weights, past_key_value --++++ --++++- --++++ # class Qwen2MoeFlashAttention(nn.Module): --++++ # """ --++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { --++++ # return final_hidden_states, router_logits --++++ --++++ --++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++++-# """ --++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 --++++-# """ --++++-# def __init__(self, config: Qwen2MoeConfig): --++++-# super().__init__() --++++-# self.num_experts = config.num_experts --++++-# self.top_k = config.num_experts_per_tok --++++-# self.norm_topk_prob = config.norm_topk_prob --++++- --++++-# # 门控网络 --++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++-# # 专家列表 --++++-# self.experts = nn.ModuleList( --++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++-# ) --++++-# # 共享专家 --++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-# @no_grad() --++++-# def _moe_infer_decode( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# """ --++++-# 【解码路径】针对 sequence_length=1 的极致优化。 --++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --++++-# """ --++++-# batch_size, hidden_dim = hidden_states.shape --++++- --++++-# expert_outputs_list = [ --++++-# ops.cat([ --++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++-# ], dim=0) --++++-# for i in range(batch_size) --++++-# ] --++++- --++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- --++++-# # shape: (batch_size, top_k, hidden_dim) --++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++- --++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++- --++++-# return moe_output.squeeze(1) --++++- --++++-# @no_grad() --++++-# def _moe_infer_prefill( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# """ --++++-# 【预填充路径】针对 sequence_length > 1 的优化。 --++++-# 按专家对 Token 进行分组,并进行批处理。 --++++-# """ --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens = hidden_states.shape[0] --++++-# flat_selected_experts = selected_experts.flatten() --++++- --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++- --++++-# active_experts = ops.unique(flat_selected_experts) --++++- --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++- --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++-# selected_token_indices = token_indices[mask] --++++-# selected_routing_weights = routing_weights.flatten()[mask] --++++- --++++-# current_states = hidden_states[selected_token_indices] --++++- --++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++- --++++-# moe_output = moe_output.index_add( --++++-# dim=0, --++++-# index=selected_token_indices, --++++-# source=expert_output.to(hidden_states.dtype) --++++-# ) --++++-# return moe_output --++++- --++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-# """ --++++-# 顶层 forward 方法,作为智能分发器。 --++++-# """ --++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- --++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-# router_logits = self.gate(hidden_states_reshaped) --++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- --++++-# if self.norm_topk_prob: --++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- --++++-# routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++-# moe_output = None --++++-# # 在推理时,根据序列长度选择最优路径 --++++-# if not self.training: --++++-# if sequence_length == 1: --++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++-# else: --++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++-# else: --++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --++++-# raise NotImplementedError("Training path is not implemented.") --++++- --++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --++++- --++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --++++- --++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --++++- --++++-# return final_hidden_states, router_logits --++++- --++++- --++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++++-# """ --++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --++++-# """ --++++-# def __init__(self, config: Qwen2MoeConfig): --++++-# super().__init__() --++++-# self.num_experts = config.num_experts --++++-# self.top_k = config.num_experts_per_tok --++++-# self.norm_topk_prob = config.norm_topk_prob --++++- --++++-# # 门控网络 --++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++-# # 专家列表 --++++-# self.experts = nn.ModuleList( --++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++-# ) --++++-# # 共享专家 --++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-# @no_grad() --++++-# def _moe_infer_decode( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# batch_size, _ = hidden_states.shape --++++-# expert_outputs_list = [ --++++-# ops.cat([ --++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++-# ], dim=0) --++++-# for i in range(batch_size) --++++-# ] --++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++-# return moe_output.squeeze(1) --++++- --++++-# @no_grad() --++++-# def _moe_infer_prefill( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens = hidden_states.shape[0] --++++-# flat_selected_experts = selected_experts.flatten() --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++-# active_experts = ops.unique(flat_selected_experts) --++++- --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++-# selected_token_indices = token_indices[mask] --++++-# selected_routing_weights = routing_weights.flatten()[mask] --++++-# current_states = hidden_states[selected_token_indices] --++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++-# moe_output = moe_output.index_add( --++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++-# ) --++++-# return moe_output --++++- --++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-# """ --++++-# 顶层 forward 方法,作为智能分发器。 --++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --++++-# """ --++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- --++++-# # 1. 门控计算 (通用逻辑) --++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-# router_logits = self.gate(hidden_states_reshaped) --++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- --++++-# if self.norm_topk_prob: --++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- --++++-# routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++-# # 2. 智能分发到最优 MoE 路径 --++++-# moe_output = None --++++-# if not self.training: --++++-# if sequence_length == 1: --++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++-# else: --++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++-# else: --++++-# raise NotImplementedError("Training path is not implemented.") --++++- --++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++- --++++-# # 4. 合并 MoE 输出和共享专家输出 --++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++- --++++-# # 5. 恢复原始形状并返回 --++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++- --++++-# return final_hidden_states, router_logits --++++- --++++-# prefill fastest --++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++++-# """ --++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --++++-# """ --++++-# def __init__(self, config: Qwen2MoeConfig): --++++-# super().__init__() --++++-# self.num_experts = config.num_experts --++++-# self.top_k = config.num_experts_per_tok --++++-# self.norm_topk_prob = config.norm_topk_prob --++++- --++++-# # 门控网络 --++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++-# # 专家列表 --++++-# self.experts = nn.ModuleList( --++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++-# ) --++++-# # 共享专家 --++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-# @no_grad() --++++-# def _moe_infer_dispatch( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# """ --++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --++++-# """ --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens, _ = hidden_states.shape --++++- --++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --++++-# flat_selected_experts = selected_experts.flatten() --++++-# flat_routing_weights = routing_weights.flatten() --++++- --++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++- --++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --++++-# active_experts = ops.unique(flat_selected_experts) --++++- --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++- --++++-# # 找到所有分配给该专家的 token --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++- --++++-# # 使用 mask 选取对应的 token 和权重 --++++-# current_token_indices = token_indices[mask] --++++-# current_routing_weights = flat_routing_weights[mask] --++++-# current_hidden_states = hidden_states[current_token_indices] --++++- --++++-# # 对这些 token 进行批处理 --++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++- --++++-# # 使用 index_add 将结果精确地加回到对应位置 --++++-# moe_output = moe_output.index_add( --++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --++++-# ) --++++-# return moe_output --++++- --++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-# """ --++++-# 顶层 forward 方法,作为智能分发器。 --++++-# """ --++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- --++++-# # 1. 门控计算 --++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-# router_logits = self.gate(hidden_states_reshaped) --++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- --++++-# if self.norm_topk_prob: --++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- --++++-# routing_weights = routing_weights.to(hidden_states.dtype) --++++- --++++-# # 2. 调用统一的 MoE 计算内核 --++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --++++- --++++-# # 3. 统一处理共享专家 --++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++- --++++-# # 4. 合并输出 --++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++- --++++-# # 5. 恢复原始形状并返回 --++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++- --++++-# return final_hidden_states, router_logits --++++- --++++- --++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++++-# """ --++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++-# 【最终高性能与高精度版】: --++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --++++-# 3. 这样实现了速度和准确性的两全其美。 --++++-# """ --++++-# def __init__(self, config: Qwen2MoeConfig): --++++-# super().__init__() --++++-# self.num_experts = config.num_experts --++++-# self.top_k = config.num_experts_per_tok --++++-# self.norm_topk_prob = config.norm_topk_prob --++++- --++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++-# self.experts = nn.ModuleList( --++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++-# ) --++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-# @no_grad() --++++-# def _moe_infer_decode( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# """ --++++-# 【解码路径】极致优化版:bmm + 高精度累加。 --++++-# """ --++++-# original_dtype = hidden_states.dtype --++++-# batch_size, _ = hidden_states.shape --++++- --++++-# expert_outputs_list = [ --++++-# ops.cat([ --++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++-# ], dim=0) --++++-# for i in range(batch_size) --++++-# ] --++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++- --++++-# # 在 float32 下执行 bmm,得到高精度结果 --++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++- --++++-# # 将高精度结果转换回原始数据类型 --++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --++++- --++++-# return moe_output --++++- --++++-# @no_grad() --++++-# def _moe_infer_prefill( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# selected_experts: mindspore.Tensor, --++++-# routing_weights: mindspore.Tensor --++++-# ) -> mindspore.Tensor: --++++-# """ --++++-# 【预填充路径】与原始实现一致,结果精确。 --++++-# """ --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens, _ = hidden_states.shape --++++-# flat_selected_experts = selected_experts.flatten() --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++-# active_experts = ops.unique(flat_selected_experts) --++++- --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++-# selected_token_indices = token_indices[mask] --++++-# selected_routing_weights = routing_weights.flatten()[mask] --++++-# current_states = hidden_states[selected_token_indices] --++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++-# moe_output = moe_output.index_add( --++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++-# ) --++++-# return moe_output --++++- --++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++- --++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-# router_logits = self.gate(hidden_states_reshaped) --++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++- --++++-# if self.norm_topk_prob: --++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- --++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --++++-# # 如果模型主体是 float16,后续再转换 --++++- --++++-# moe_output = None --++++-# if not self.training: --++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --++++-# # _moe_infer_decode 内部会处理好类型转换 --++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) --++++-# if sequence_length == 1: --++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++-# else: --++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++-# else: --++++-# raise NotImplementedError("Training path is not implemented.") --++++- --++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++- --++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++- --++++-# return final_hidden_states, router_logits --++++- --++++- --++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --++++-# """ --++++-# 【融合版】一个混合专家模块,内置两种推理策略, --++++-# 由外部全局变量 `Long_Prompt` 控制: --++++- --++++-# - if Long_Prompt is True: 【精度优先模式】 --++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --++++-# 适用于处理长序列,避免误差累积。 --++++- --++++-# - if Long_Prompt is False: 【速度优先模式】 --++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 --++++-# """ --++++-# def __init__(self, config: Qwen2MoeConfig): --++++-# super().__init__() --++++-# self.num_experts = config.num_experts --++++-# self.top_k = config.num_experts_per_tok --++++-# self.norm_topk_prob = config.norm_topk_prob --++++- --++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++-# self.experts = nn.ModuleList( --++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++-# ) --++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-# # --- 速度优先模式的辅助函数 --- --++++-# @no_grad() --++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++-# original_dtype = hidden_states.dtype --++++-# batch_size, _ = hidden_states.shape --++++-# expert_outputs_list = [ --++++-# ops.cat([ --++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++-# ], dim=0) --++++-# for i in range(batch_size) --++++-# ] --++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++-# weights_fp32 = routing_weights.to(mindspore.float32) --++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++-# return moe_output_fp32.squeeze(1).to(original_dtype) --++++- --++++-# @no_grad() --++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens, _ = hidden_states.shape --++++-# flat_selected_experts = selected_experts.flatten() --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++-# active_experts = ops.unique(flat_selected_experts) --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++-# selected_token_indices = token_indices[mask] --++++-# selected_routing_weights = routing_weights.flatten()[mask] --++++-# current_states = hidden_states[selected_token_indices] --++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --++++-# return moe_output --++++- --++++-# # --- 精度优先模式的辅助函数 --- --++++-# @no_grad() --++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++-# moe_output = ops.zeros_like(hidden_states) --++++-# num_tokens, _ = hidden_states.shape --++++-# flat_selected_experts = selected_experts.flatten() --++++-# flat_routing_weights = routing_weights.flatten() --++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++-# active_experts = ops.unique(flat_selected_experts) --++++-# for expert_idx_tensor in active_experts: --++++-# expert_idx = expert_idx_tensor.item() --++++-# expert_layer = self.experts[expert_idx] --++++-# mask = (flat_selected_experts == expert_idx_tensor) --++++-# current_token_indices = token_indices[mask] --++++-# current_routing_weights = flat_routing_weights[mask] --++++-# current_hidden_states = hidden_states[current_token_indices] --++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++++-# return moe_output --++++- --++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-# # 声明我们将要使用一个在模块外部定义的全局变量 --++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --++++-# global Long_Prompt --++++- --++++-# # 1. 门控计算 (所有模式通用) --++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-# router_logits = self.gate(hidden_states_reshaped) --++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++++-# if self.norm_topk_prob: --++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++- --++++-# moe_output = None --++++-# if not self.training: --++++-# # 根据 Long_Prompt 标志选择模式 --++++-# if Long_Prompt: --++++-# # --- 精度优先模式 --- --++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++-# else: --++++-# # --- 速度优先模式 --- --++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++-# if sequence_length == 1: --++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++-# else: --++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++-# else: --++++-# raise NotImplementedError("Training path is not implemented.") --++++- --++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++- --++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++- --++++-# return final_hidden_states, router_logits --++++- --++++ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ """ --++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++ return moe_output_fp32.squeeze(1).to(original_dtype) --++++ --+++++ # @no_grad() --+++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++ # num_tokens, _ = hidden_states.shape --+++++ # flat_selected_experts = selected_experts.flatten() --+++++ # sorted_expert_indices = flat_selected_experts.argsort() --+++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++++ # original_token_indices = sorted_expert_indices // self.top_k --+++++ # moe_output = ops.zeros_like(hidden_states) --+++++ # current_token_offset = 0 --+++++ # for i in range(self.num_experts): --+++++ # expert_token_count = tokens_per_expert[i] - current_token_offset --+++++ # if expert_token_count == 0: --+++++ # continue --+++++ # end_offset = current_token_offset + expert_token_count --+++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++++ # expert_hidden_states = hidden_states[expert_original_token_indices] --+++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++ # current_token_offset += expert_token_count --+++++ # return moe_output --+++++ --++++ @no_grad() --++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++- num_tokens, _ = hidden_states.shape --++++- flat_selected_experts = selected_experts.flatten() --++++- sorted_expert_indices = flat_selected_experts.argsort() --++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++++- original_token_indices = sorted_expert_indices // self.top_k --+++++ """ --+++++ 优化版 MoE prefill (速度优先模式): --+++++ - 批量张量化处理同一个 expert 的所有 token --+++++ - 跳过无 token 的专家 --+++++ - 保持结果完全一致 --+++++ """ --++++ moe_output = ops.zeros_like(hidden_states) --++++- current_token_offset = 0 --++++- for i in range(self.num_experts): --++++- expert_token_count = tokens_per_expert[i] - current_token_offset --++++- if expert_token_count == 0: --++++- continue --++++- end_offset = current_token_offset + expert_token_count --++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++++- expert_hidden_states = hidden_states[expert_original_token_indices] --++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++++- current_token_offset += expert_token_count --+++++ --+++++ flat_selected_experts = selected_experts.flatten() --+++++ flat_routing_weights = routing_weights.flatten() --+++++ --+++++ idxs = flat_selected_experts.argsort() --+++++ sorted_expert_indices = flat_selected_experts[idxs] --+++++ sorted_token_indices = idxs // self.top_k --+++++ --+++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --+++++ --+++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+++++ --+++++ for expert_id in active_experts.tolist(): --+++++ start = int(tokens_per_expert[:expert_id].sum().item()) --+++++ end = start + int(tokens_per_expert[expert_id].item()) --+++++ --+++++ token_idx = sorted_token_indices[start:end] --+++++ expert_tokens = hidden_states[token_idx] --+++++ --+++++ expert_out = self.experts[expert_id](expert_tokens) --+++++ --+++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --+++++ --+++++ moe_output = mindspore.mint.scatter_add( --+++++ moe_output, --+++++ 0, --+++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --+++++ scaled_out.to(hidden_states.dtype) --+++++ ) --+++++ --++++ return moe_output --++++ --+++++ --++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --++++ @no_grad() --++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++ --++++ moe_output = None --++++- if Long_Prompt: --++++- # --- 精度优先模式 (ACCURACY MODE) --- --++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ # if Long_Prompt==0: --+++++ # # --- 精度优先模式 (ACCURACY MODE) --- --+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ # else: --+++++ # # --- 速度优先模式 (SPEED MODE) --- --+++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++ # if sequence_length == 1: --+++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ # else: --+++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ --+++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++ if sequence_length == 1: --+++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++ else: --++++- # --- 速度优先模式 (SPEED MODE) --- --++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++- if sequence_length == 1: --++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++- else: --++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++- --+++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ --++++ --++++ # 3. 共享专家计算与合并 (所有模式通用) --++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++ --++++ return final_hidden_states, router_logits --++++ --+++++ --++++ class Qwen2MoeDecoderLayer(nn.Module): --++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --++++ super().__init__() --++++ self.hidden_size = config.hidden_size --++++ --++++- # if Long_Prompt: --++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++- # else: --+++++ # if Long_Prompt == 2: --++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++ # else: --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ --++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++ --++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++++ ) --++++ --++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++++ # attention_mask, --+++++ # sequence_length=sequence_length, --+++++ # target_length=target_length, --+++++ # dtype=dtype, --+++++ # min_dtype=min_dtype, --+++++ # cache_position=cache_position, --+++++ # batch_size=input_tensor.shape[0], --+++++ # ) --+++++ #@dwj --+++++ causal_mask = get_cached_causal_mask_with_cache_position( --++++ attention_mask, --++++ sequence_length=sequence_length, --++++ target_length=target_length, --++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --++++ """ --++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --+++++ _causal_mask_cache.clear() --++++ --++++ input_ids = kwargs.get("input_ids") --++++ if input_ids is None and args: --++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ --++++ if input_ids is not None: --++++ prompt_length = input_ids.shape[1] --++++- --++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: --++++- Long_Prompt = True --+++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --+++++ Long_Prompt = 2 --+++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --+++++ Long_Prompt = 0 --++++ else: --++++- Long_Prompt = False --+++++ Long_Prompt = 1 --+++++ --++++ --++++ return super().generate(*args, **kwargs) --++++ --++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++ dtype = self.lm_head.weight.dtype --++++ min_dtype = float(ops.finfo(dtype).min) --++++ --++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --+++++ # attention_mask, --+++++ # sequence_length=sequence_length, --+++++ # target_length=past_key_values.get_max_length(), --+++++ # dtype=dtype, --+++++ # min_dtype=min_dtype, --+++++ # cache_position=cache_position, --+++++ # batch_size=batch_size, --+++++ # ) --+++++ --+++++ #@dwj --+++++ attention_mask = get_cached_causal_mask_with_cache_position( --++++ attention_mask, --++++ sequence_length=sequence_length, --++++ target_length=past_key_values.get_max_length(), --++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++++deleted file mode 100644 --++++index 6dfb5b93..00000000 --++++--- a/patches/0001-20251104commit.patch --+++++++ /dev/null --++++@@ -1,1272 +0,0 @@ --++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++++-From: Pinoeer-kingxi <13022943007@163.com> --++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 --++++-Subject: [PATCH] 20251104commit --++++- --++++---- --++++- mindnlp/transformers/cache_utils.py | 28 +- --++++- .../models/deepseek/modeling_deepseek.py | 149 ++- --++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++++- 3 files changed, 976 insertions(+), 87 deletions(-) --++++- --++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++++-index cadd2e04..02f8d4be 100644 --++++---- a/mindnlp/transformers/cache_utils.py --++++-+++ b/mindnlp/transformers/cache_utils.py --++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): --++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++++- # k_out[:, :, cache_position] = key_states --++++- # v_out[:, :, cache_position] = value_states --++++-- if ON_ORANGE_PI: --++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++-- else: --++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++-- --++++-+ # if ON_ORANGE_PI: --++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++-+ # else: --++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 --++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++++-+ if cache_position.ndim > 1: --++++-+ cache_position = cache_position.flatten() --++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) --++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++++-+ cache_position = cache_position.int() --++++-+ --++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++++-+ k_out[:, :, cache_position] = key_states --++++-+ v_out[:, :, cache_position] = value_states --++++-+ --++++- return k_out, v_out --++++- --++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++-index c695b944..d8303e45 100644 --++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++++- # Copied from transformers.models.llama.modeling_llama.rotate_half --++++- def rotate_half(x): --++++- """Rotates half the hidden dims of the input.""" --++++-- x1 = x[..., : x.shape[-1] // 2] --++++-- x2 = x[..., x.shape[-1] // 2 :] --++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++-+ # x1 = x[..., : x.shape[-1] // 2] --++++-+ # x2 = x[..., x.shape[-1] // 2 :] --++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++- return ops.cat((-x2, x1), dim=-1) --++++- --++++- --++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++++- if self.training: --++++- raise NotImplementedError("Training is not supported yet.") --++++- else: --++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++-- if self.config.n_shared_experts is not None: --++++-- y = y + self.shared_experts(identity) --++++-- return y --++++-+ # @lwx --++++-+ if orig_shape[1] == 1: --++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++++-+ y=y.view(*orig_shape) --++++-+ if self.config.n_shared_experts is not None: --++++-+ y = y + self.shared_experts(identity) --++++-+ return y --++++-+ else: --++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++++-+ if self.config.n_shared_experts is not None: --++++-+ y = y + self.shared_experts(identity) --++++-+ return y --++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++-+ # if self.config.n_shared_experts is not None: --++++-+ # y = y + self.shared_experts(identity) --++++-+ # return y --++++-+ --++++-+ @no_grad() --++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++-+ --++++-+ expert_cache = ops.zeros_like(x) --++++-+ for i in range(self.num_experts_per_tok): --++++-+ expert_id = flat_expert_indices[i].item() --++++-+ weight = flat_expert_weights[i].item() --++++-+ expert = self.experts[expert_id] --++++-+ expert_out = expert(x) --++++-+ expert_cache += expert_out * weight --++++-+ return expert_cache --++++- --++++- @no_grad() --++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++-- # expert_cache = torch.zeros_like(x) --++++-- # idxs = flat_expert_indices.argsort() --++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++-- # token_idxs = idxs // self.num_experts_per_tok --++++-- # for i, end_idx in enumerate(tokens_per_expert): --++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++-- # if start_idx == end_idx: --++++-- # continue --++++-- # expert = self.experts[i] --++++-- # exp_token_idx = token_idxs[start_idx:end_idx] --++++-- # expert_tokens = x[exp_token_idx] --++++-- # expert_out = expert(expert_tokens) --++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++-- # return expert_cache --++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++- expert_cache = ops.zeros_like(x) --++++- idxs = flat_expert_indices.argsort() --++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++- token_idxs = idxs // self.num_experts_per_tok --++++-+ --++++- for i, end_idx in enumerate(tokens_per_expert): --++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++- if start_idx == end_idx: --++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++++- expert_out = expert(expert_tokens) --++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++-+ --++++- return expert_cache --++++-+ --++++-+ # @no_grad() --++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++-+ # # expert_cache = torch.zeros_like(x) --++++-+ # # idxs = flat_expert_indices.argsort() --++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++-+ # # token_idxs = idxs // self.num_experts_per_tok --++++-+ # # for i, end_idx in enumerate(tokens_per_expert): --++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++-+ # # if start_idx == end_idx: --++++-+ # # continue --++++-+ # # expert = self.experts[i] --++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++-+ # # expert_tokens = x[exp_token_idx] --++++-+ # # expert_out = expert(expert_tokens) --++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++-+ # # return expert_cache --++++-+ # expert_cache = ops.zeros_like(x) --++++-+ # idxs = flat_expert_indices.argsort() --++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++-+ # token_idxs = idxs // self.num_experts_per_tok --++++-+ --++++-+ # for i, end_idx in enumerate(tokens_per_expert): --++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++-+ # if start_idx == end_idx: --++++-+ # continue --++++-+ # expert = self.experts[i] --++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --++++-+ # expert_tokens = x[exp_token_idx] --++++-+ # expert_out = expert(expert_tokens) --++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++-+ --++++-+ # return expert_cache --++++-+ # @no_grad() --++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++-+ # expert_cache = ops.zeros_like(x) --++++-+ --++++-+ # # 排序保证顺序一致 --++++-+ # idxs = flat_expert_indices.argsort() --++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++-+ # token_idxs = idxs // self.num_experts_per_tok --++++-+ --++++-+ # # 找出有 token 的专家 --++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++-+ --++++-+ # for i in active_experts.tolist(): --++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++-+ # end_idx = tokens_per_expert[i] --++++-+ # if start_idx == end_idx: # 没有 token --++++-+ # continue --++++-+ --++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --++++-+ # expert_tokens = x[exp_token_idx] --++++-+ # expert_out = self.experts[i](expert_tokens) --++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++-+ --++++-+ # expert_cache = mindspore.mint.scatter_add( --++++-+ # expert_cache, --++++-+ # 0, --++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++-+ # expert_out --++++-+ # ) --++++-+ --++++-+ # return expert_cache --++++-+ --++++-+ --++++- --++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++++- # """ --++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++- --++++- # Initialize weights and apply final processing --++++- self.post_init() --++++-+ self.warm_up = False --++++-+ --++++-+ def warmup_moe_model_deep(self): --++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++-+ test_texts = [ --++++-+ "warmup short", --++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++++-+ ] --++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++-+ if tokenizer is None: --++++-+ from mindnlp.transformers import AutoTokenizer --++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++-+ self._warmup_tokenizer = tokenizer --++++-+ --++++-+ for text in test_texts: --++++-+ inputs = tokenizer(text, return_tensors="ms") --++++-+ with mindspore._no_grad(): --++++-+ _ = self(**inputs, use_cache=False) --++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++++- --++++- def get_input_embeddings(self): --++++- return self.model.embed_tokens --++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++- ```""" --++++-+ if not self.warm_up: --++++-+ self.warm_up = True --++++-+ self.warmup_moe_model_deep() --++++-+ --++++- output_attentions = ( --++++- output_attentions --++++- if output_attentions is not None --++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++-index 3cbf820e..d4c6b651 100644 --++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++-@@ -18,7 +18,6 @@ --++++- # See the License for the specific language governing permissions and --++++- # limitations under the License. --++++- """MindSpore Qwen2MoE model.""" --++++-- --++++- import math --++++- from typing import List, Optional, Tuple, Union --++++- --++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++++- TokenClassifierOutput, --++++- ) --++++- from ...modeling_utils import PreTrainedModel --++++-+from ...generation import GenerationMixin --++++- from ....utils import logging --++++- from .configuration_qwen2_moe import Qwen2MoeConfig --++++- --++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++++- self.variance_epsilon = eps --++++- --++++- def forward(self, hidden_states): --++++-+ # @dwj --++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++-+ # @lwx --++++-+ # if not self.training : --++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++- input_dtype = hidden_states.dtype --++++- hidden_states = hidden_states.to(mindspore.float32) --++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++++-@@ -234,6 +239,8 @@ def rotate_half(x): --++++- """Rotates half the hidden dims of the input.""" --++++- x1 = x[..., : x.shape[-1] // 2] --++++- x2 = x[..., x.shape[-1] // 2 :] --++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++- return ops.cat((-x2, x1), dim=-1) --++++- --++++- --++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++++- self.config = config --++++- self.hidden_size = config.hidden_size --++++- self.intermediate_size = intermediate_size --++++-+ --++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++++- self.act_fn = ACT2FN[config.hidden_act] --++++- --++++- def forward(self, x): --++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++-- --++++- --++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++-+ # @lwx --++++-+ # gate_up_output = self.gate_up_proj(x) --++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++++-+ # return self.down_proj(swiglu_output) --++++-+ --++++-+ # def forward(self, x): --++++-+ # gate_proj_out = self.gate_proj(x) --++++-+ # up_proj_out = self.up_proj(x) --++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++++-+ # return self.down_proj(swiglu_out) --++++-+ --++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv --++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++++- """ --++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++++- use_cache: bool = False, --++++- cache_position: Optional[mindspore.Tensor] = None, --++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++-+ --++++-+ --++++-+ --++++- bsz, q_len, _ = hidden_states.shape --++++- --++++- query_states = self.q_proj(hidden_states) --++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++- "with a layer index." --++++- ) --++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++-+ if isinstance(past_key_value, StaticCache): --++++-+ kv_seq_len = key_states.shape[-2] --++++-+ else: --++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++- --++++- if past_key_value is not None: --++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++-+ --++++-+ if isinstance(past_key_value, StaticCache): --++++-+ kv_seq_len = key_states.shape[-2] --++++- --++++- # repeat k/v heads if n_kv_heads < n_heads --++++- key_states = repeat_kv(key_states, self.num_key_value_groups) --++++- value_states = repeat_kv(value_states, self.num_key_value_groups) --++++-- --++++-+ --++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++- --++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++++-- raise ValueError( --++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++++-- f" {attn_weights.shape}" --++++-- ) --++++-- --++++-- if attention_mask is not None: # no matter the length, we just slice it --++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++++-+ if attention_mask is not None: --++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++- attn_weights = attn_weights + causal_mask --++++- --++++- # upcast attention to fp32 --++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++- --++++- attn_output = self.o_proj(attn_output) --++++-- --++++-+ # @lwx --++++-+ --++++-+ # max_seq_len = self.max_position_embeddings # 2048 --++++-+ --++++-+ # if attention_mask is not None: --++++-+ # # attention_mask: [B, 1, Sq, Sk] --++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++-+ --++++-+ # # pad 到 [max_seq_len, max_seq_len] --++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++-+ # global_attention_mask = padded_mask --++++-+ # else: --++++-+ # global_attention_mask = None --++++-+ --++++-+ --++++-+ # sparse_mode=3 --++++-+ # attn_output = mindspore.ops.flash_attention_score( --++++-+ # query=query_states, --++++-+ # key=key_states, --++++-+ # value=value_states, --++++-+ # real_shift=None, --++++-+ # padding_mask=None, --++++-+ --++++-+ # head_num=self.num_heads, --++++-+ # attn_mask=global_attention_mask, --++++-+ # keep_prob=1.0 - self.attention_dropout, --++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++-+ # input_layout="BNSD", --++++-+ # pre_tokens=2147483647, --++++-+ # next_tokens=2147483647, --++++-+ # inner_precise=0, --++++-+ # drop_mask=None, --++++-+ # prefix=None, --++++-+ # actual_seq_qlen=None, --++++-+ # actual_seq_kvlen=None, --++++-+ # sparse_mode=sparse_mode, --++++-+ # ) --++++- if not output_attentions: --++++- attn_weights = None --++++- --++++- return attn_output, attn_weights, past_key_value --++++- --++++- --++++-+class Qwen2MoeFlashAttention(nn.Module): --++++-+ """ --++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++-+ --++++-+ 关键改动: --++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++-+ 直接传入原始的 key 和 value 张量效率更高。 --++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++-+ """ --++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++-+ super().__init__() --++++-+ self.config = config --++++-+ self.layer_idx = layer_idx --++++-+ self.hidden_size = config.hidden_size --++++-+ self.num_heads = config.num_attention_heads --++++-+ self.head_dim = self.hidden_size // self.num_heads --++++-+ self.num_key_value_heads = config.num_key_value_heads --++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++-+ self.max_position_embeddings = config.max_position_embeddings --++++-+ self.rope_theta = config.rope_theta --++++-+ self.attention_dropout = config.attention_dropout --++++-+ --++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: --++++-+ raise ValueError( --++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++-+ ) --++++-+ --++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++-+ --++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++-+ self.head_dim, --++++-+ max_position_embeddings=self.max_position_embeddings, --++++-+ base=self.rope_theta, --++++-+ ) --++++-+ --++++-+ def forward( --++++-+ self, --++++-+ hidden_states: mindspore.Tensor, --++++-+ attention_mask: Optional[mindspore.Tensor] = None, --++++-+ position_ids: Optional[mindspore.Tensor] = None, --++++-+ past_key_value: Optional[Cache] = None, --++++-+ output_attentions: bool = False, --++++-+ use_cache: bool = False, --++++-+ cache_position: Optional[mindspore.Tensor] = None, --++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++-+ --++++-+ bsz, q_len, _ = hidden_states.shape --++++-+ --++++-+ # 1. 线性投射 Q, K, V --++++-+ query_states = self.q_proj(hidden_states) --++++-+ key_states = self.k_proj(hidden_states) --++++-+ value_states = self.v_proj(hidden_states) --++++-+ --++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++-+ # query: [B, S, H*D] -> [B, N1, S, D] --++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ --++++-+ # 3. RoPE 旋转位置编码 --++++-+ kv_seq_len = key_states.shape[-2] --++++-+ if past_key_value is not None: --++++-+ if self.layer_idx is None: --++++-+ raise ValueError( --++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++-+ "with a layer index." --++++-+ ) --++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len --++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++-+ if cache_position.shape[0] == 1: --++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++-+ kv_seq_len = past_seen_tokens + 1 --++++-+ else: --++++-+ # prefill 阶段:cache_position 是范围,使用其长度 --++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++-+ else: --++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++-+ --++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++-+ --++++-+ # 4. KV 缓存更新 --++++-+ if past_key_value is not None: --++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++-+ key_states, value_states = past_key_value.update( --++++-+ key_states, value_states, self.layer_idx, cache_kwargs --++++-+ ) --++++-+ --++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++-+ if cache_position.shape[0] == 1: --++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++-+ kv_seq_len = key_states.shape[-2] --++++-+ --++++-+ # 5. [重要] 准备 Attention Mask --++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++-+ fa_attention_mask = None --++++-+ if attention_mask is not None: --++++-+ # 截取与当前key长度匹配的部分 --++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False --++++-+ fa_attention_mask = (mask_slice != 0) --++++-+ --++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++-+ input_dtype = query_states.dtype --++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++-+ query_states = query_states.to(mindspore.float16) --++++-+ key_states = key_states.to(mindspore.float16) --++++-+ value_states = value_states.to(mindspore.float16) --++++-+ --++++-+ # 6. [核心] 调用 flash_attention_score 算子 --++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA --++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++-+ attn_output = mindspore.ops.flash_attention_score( --++++-+ query=query_states, --++++-+ key=key_states, --++++-+ value=value_states, --++++-+ head_num=self.num_heads, # 传入Q的头数(N1) --++++-+ attn_mask=fa_attention_mask, --++++-+ keep_prob=1.0 - self.attention_dropout, --++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), --++++-+ input_layout="BNSD", --++++-+ sparse_mode=0 # 使用 defaultMask 模式 --++++-+ ) --++++-+ --++++-+ # 恢复原始数据类型 --++++-+ attn_output = attn_output.to(input_dtype) --++++-+ --++++-+ # 7. 调整输出形状 --++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++-+ attn_output = self.o_proj(attn_output) --++++-+ --++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 --++++-+ attn_weights = None --++++-+ if output_attentions: --++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++-+ --++++-+ return attn_output, attn_weights, past_key_value --++++-+ --++++-+ # def forward( --++++-+ # self, --++++-+ # hidden_states: mindspore.Tensor, --++++-+ # attention_mask: Optional[mindspore.Tensor] = None, --++++-+ # position_ids: Optional[mindspore.Tensor] = None, --++++-+ # past_key_value: Optional[Cache] = None, --++++-+ # output_attentions: bool = False, --++++-+ # use_cache: bool = False, --++++-+ # cache_position: Optional[mindspore.Tensor] = None, --++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++-+ --++++-+ # bsz, q_len, _ = hidden_states.shape --++++-+ --++++-+ # # 1. 线性投射 Q, K, V --++++-+ # query_states = self.q_proj(hidden_states) --++++-+ # key_states = self.k_proj(hidden_states) --++++-+ # value_states = self.v_proj(hidden_states) --++++-+ --++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ --++++-+ # # 3. RoPE 旋转位置编码 --++++-+ # kv_seq_len = key_states.shape[-2] --++++-+ # if past_key_value is not None: --++++-+ # if self.layer_idx is None: --++++-+ # raise ValueError( --++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++-+ # "with a layer index." --++++-+ # ) --++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++-+ --++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++-+ --++++-+ # # 4. KV 缓存更新 --++++-+ # if past_key_value is not None: --++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++-+ # key_states, value_states = past_key_value.update( --++++-+ # key_states, value_states, self.layer_idx, cache_kwargs --++++-+ # ) --++++-+ --++++-+ # # 5. 准备 Attention Mask --++++-+ # fa_attention_mask = None --++++-+ # if attention_mask is not None: --++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++-+ # fa_attention_mask = (mask_slice != 0) --++++-+ --++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++-+ # input_dtype = query_states.dtype --++++-+ --++++-+ # # 6. [核心] 调用 flash_attention_score 算子 --++++-+ # attn_output = mindspore.ops.flash_attention_score( --++++-+ # query=query_states, --++++-+ # key=key_states, --++++-+ # value=value_states, --++++-+ # head_num=self.num_heads, --++++-+ # attn_mask=fa_attention_mask, --++++-+ # keep_prob=1.0 - self.attention_dropout, --++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++-+ # input_layout="BNSD", --++++-+ # sparse_mode=0, --++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- --++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++-+ # inner_precise=1 --++++-+ # ) --++++-+ --++++-+ # # 恢复原始数据类型 --++++-+ # attn_output = attn_output.to(input_dtype) --++++-+ --++++-+ # # 7. 调整输出形状 --++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++-+ # attn_output = self.o_proj(attn_output) --++++-+ --++++-+ # attn_weights = None --++++-+ # if output_attentions: --++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++-+ --++++-+ # return attn_output, attn_weights, past_key_value --++++-+ --++++-+ # def forward( --++++-+ # self, --++++-+ # hidden_states: mindspore.Tensor, --++++-+ # attention_mask: Optional[mindspore.Tensor] = None, --++++-+ # position_ids: Optional[mindspore.Tensor] = None, --++++-+ # past_key_value: Optional[Cache] = None, --++++-+ # output_attentions: bool = False, --++++-+ # use_cache: bool = False, --++++-+ # cache_position: Optional[mindspore.Tensor] = None, --++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++-+ --++++-+ # bsz, q_len, _ = hidden_states.shape --++++-+ --++++-+ # query_states = self.q_proj(hidden_states) --++++-+ # key_states = self.k_proj(hidden_states) --++++-+ # value_states = self.v_proj(hidden_states) --++++-+ --++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-+ --++++-+ # kv_seq_len = key_states.shape[-2] --++++-+ # if past_key_value is not None: --++++-+ # if self.layer_idx is None: --++++-+ # raise ValueError("`layer_idx` must be specified for caching") --++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++-+ --++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++-+ --++++-+ # if past_key_value is not None: --++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++-+ # key_states, value_states = past_key_value.update( --++++-+ # key_states, value_states, self.layer_idx, cache_kwargs --++++-+ # ) --++++-+ --++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++-+ --++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- --++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++-+ # query_states = query_states / math.sqrt(self.head_dim) --++++-+ # # <--- 修改结束 --- --++++-+ --++++-+ # fa_attention_mask = None --++++-+ # if attention_mask is not None: --++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++-+ # fa_attention_mask = (mask_slice != 0) --++++-+ --++++-+ # input_dtype = query_states.dtype --++++-+ --++++-+ # attn_output = mindspore.ops.flash_attention_score( --++++-+ # query=query_states, # 传入已经预先缩放过的 query --++++-+ # key=key_states, --++++-+ # value=value_states, --++++-+ # head_num=self.num_heads, --++++-+ # attn_mask=fa_attention_mask, --++++-+ # keep_prob=1.0 - self.attention_dropout, --++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++-+ # input_layout="BNSD", --++++-+ # sparse_mode=0, --++++-+ # inner_precise=1 # 仍然保持内部高精度计算 --++++-+ # ) --++++-+ --++++-+ # attn_output = attn_output.to(input_dtype) --++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++-+ # attn_output = self.o_proj(attn_output) --++++-+ --++++-+ # attn_weights = None --++++-+ # if output_attentions: --++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++-+ --++++-+ # return attn_output, attn_weights, past_key_value --++++-+ --++++- QWEN2MOE_ATTENTION_CLASSES = { --++++- "eager": Qwen2MoeAttention, --++++-+ "flash-attention": Qwen2MoeFlashAttention, --++++- } --++++- --++++- --++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++- --++++-+ #@dwj --++++-+ # 只遍历激活的专家,而非全部专家 --++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape --++++-- hidden_states = hidden_states.view(-1, hidden_dim) --++++-- # router_logits: (batch * sequence_length, n_experts) --++++-- router_logits = self.gate(hidden_states) --++++-- --++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++-- if self.norm_topk_prob: --++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++-- # we cast back to the input dtype --++++-- routing_weights = routing_weights.to(hidden_states.dtype) --++++-- --++++-- final_hidden_states = ops.zeros( --++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++++-- ) --++++-- --++++-- # One hot encode the selected experts to create an expert mask --++++-- # this will be used to easily index which expert is going to be sollicitated --++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++++-- --++++-- # Loop over all available experts in the model and perform the computation on each expert --++++-- for expert_idx in range(self.num_experts): --++++-- expert_layer = self.experts[expert_idx] --++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++++-- --++++-- # Index the correct hidden states and compute the expert hidden state for --++++-- # the current expert. We need to make sure to multiply the output hidden --++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++++-- if 0 not in idx.shape: --++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++++-- --++++-- # However `index_add_` only support torch tensors for indexing so we'll use --++++-- # the `top_x` tensor here. --++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++++-- --++++-- shared_expert_output = self.shared_expert(hidden_states) --++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++++-- --++++-- final_hidden_states = final_hidden_states + shared_expert_output --++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++-+ num_tokens = hidden_states_reshaped.shape[0] --++++-+ --++++-+ router_logits = self.gate(hidden_states_reshaped) --++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++-+ --++++-+ if self.norm_topk_prob: --++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++-+ routing_weights = routing_weights.to(hidden_states.dtype) --++++-+ --++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++-+ flat_selected_experts = selected_experts.flatten() --++++-+ --++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++-+ token_indices = broadcasted_token_indices.flatten() --++++-+ --++++-+ active_experts = ops.unique(flat_selected_experts) --++++-+ --++++-+ for expert_idx_tensor in active_experts: --++++-+ expert_idx = expert_idx_tensor.item() --++++-+ expert_layer = self.experts[expert_idx] --++++-+ --++++-+ mask = (flat_selected_experts == expert_idx_tensor) --++++-+ selected_token_indices = token_indices[mask] --++++-+ selected_routing_weights = routing_weights.flatten()[mask] --++++-+ --++++-+ current_states = hidden_states_reshaped[selected_token_indices] --++++-+ --++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++-+ --++++-+ final_hidden_states = final_hidden_states.index_add( --++++-+ dim=0, --++++-+ index=selected_token_indices, --++++-+ source=expert_output.to(hidden_states.dtype) --++++-+ ) --++++-+ --++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++- --++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++-- return final_hidden_states, router_logits --++++-+ final_hidden_states = final_hidden_states + shared_expert_output --++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++-+ --++++-+ return final_hidden_states, router_logits --++++- --++++- --++++- class Qwen2MoeDecoderLayer(nn.Module): --++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++++- --++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++- --++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++-+ --++++- if (layer_idx not in config.mlp_only_layers) and ( --++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++++- ): --++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] --++++- _skip_keys_device_placement = "past_key_values" --++++- _supports_cache_class = True --++++-+#lwx --++++-+ # _supports_static_cache = True --++++- --++++- def _init_weights(self, module): --++++- std = self.config.initializer_range --++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++++- return causal_mask --++++- --++++- --++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++- _tied_weights_keys = ["lm_head.weight"] --++++- --++++- def __init__(self, config): --++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++- self.num_experts_per_tok = config.num_experts_per_tok --++++- # Initialize weights and apply final processing --++++- self.post_init() --++++-+ # @lwx --++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++++-+ # self.generation_config.cache_implementation = "static" --++++-+ self._warmed_up = False --++++-+ --++++-+ def warmup_moe_model(self): --++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") --++++-+ test_texts = [ --++++-+ "warmup short", --++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++++-+ ] --++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++-+ if tokenizer is None: --++++-+ from mindnlp.transformers import AutoTokenizer --++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++-+ self._warmup_tokenizer = tokenizer --++++-+ --++++-+ for text in test_texts: --++++-+ inputs = tokenizer(text, return_tensors="ms") --++++-+ with mindspore._no_grad(): --++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) --++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") --++++- --++++- def get_input_embeddings(self): --++++- return self.model.embed_tokens --++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++- ```""" --++++-+ if not self._warmed_up: --++++-+ self._warmed_up = True --++++-+ self.warmup_moe_model() --++++- --++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++++- output_router_logits = ( --++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++- } --++++- ) --++++- return model_inputs --++++-+# @lwx --++++-+ # def _decode_one_tokens_logits( --++++-+ # self, --++++-+ # cur_token: mindspore.Tensor, --++++-+ # input_pos: Optional[mindspore.Tensor], --++++-+ # cache_position: mindspore.Tensor, --++++-+ # past_key_values: StaticCache, --++++-+ # ) -> mindspore.Tensor: --++++-+ # """ --++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++++-+ --++++-+ # Args: --++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++++-+ # input_pos: 输入位置信息,可选 --++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) --++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 --++++-+ --++++-+ # Returns: --++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++++-+ # """ --++++-+ # # 调用JIT编译的版本 --++++-+ # return self.get_decode_one_tokens_logits( --++++-+ # cur_token=cur_token, --++++-+ # input_pos=input_pos, --++++-+ # cache_position=cache_position, --++++-+ # past_key_values=past_key_values, --++++-+ # ) --++++-+ --++++-+ # @mindspore.jit(jit_level='O1') --++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++++-+ # """ --++++-+ # JIT编译的函数,用于高效的单token解码 --++++-+ # 使用JIT编译优化以支持静态shape和高效执行 --++++-+ --++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++++-+ # """ --++++-+ # outputs = self.model.forward( --++++-+ # input_ids=cur_token, --++++-+ # position_ids=input_pos, --++++-+ # cache_position=cache_position, --++++-+ # past_key_values=past_key_values, --++++-+ # use_cache=True, --++++-+ # return_dict=False, --++++-+ # ) --++++-+ --++++-+ # hidden_states = outputs[0] --++++-+ # logits = self.lm_head.forward(hidden_states) --++++-+ # logits = logits.float() --++++-+ --++++-+ # return logits[:, -1, :] --++++-+ --++++-+ # def _sample( --++++-+ # self, --++++-+ # input_ids: mindspore.Tensor, --++++-+ # logits_processor, --++++-+ # stopping_criteria, --++++-+ # generation_config, --++++-+ # synced_devices: bool, --++++-+ # streamer=None, --++++-+ # logits_warper=None, --++++-+ # **model_kwargs, --++++-+ # ): --++++-+ # """ --++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++++-+ # """ --++++-+ # from ...generation.logits_process import LogitsProcessorList --++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList --++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++++-+ # from mindnlp.core import nn, ops, no_grad --++++-+ # import numpy as np --++++-+ --++++-+ # # 检查是否使用 StaticCache --++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++++-+ # # 否则,直接调用父类方法 --++++-+ # past_key_values = model_kwargs.get("past_key_values") --++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++++-+ --++++-+ # if not isinstance(past_key_values, StaticCache): --++++-+ # # 不使用 StaticCache,直接调用父类方法 --++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++++-+ # return super()._sample( --++++-+ # input_ids=input_ids, --++++-+ # logits_processor=logits_processor, --++++-+ # stopping_criteria=stopping_criteria, --++++-+ # generation_config=generation_config, --++++-+ # synced_devices=synced_devices, --++++-+ # streamer=streamer, --++++-+ # logits_warper=logits_warper, --++++-+ # **model_kwargs, --++++-+ # ) --++++-+ --++++-+ # # 使用 StaticCache,进入自定义循环 --++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++++-+ # pad_token_id = generation_config._pad_token_tensor --++++-+ # output_attentions = generation_config.output_attentions --++++-+ # output_hidden_states = generation_config.output_hidden_states --++++-+ # output_scores = generation_config.output_scores --++++-+ # output_logits = generation_config.output_logits --++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate --++++-+ # max_length = generation_config.max_length --++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++++-+ # do_sample = generation_config.do_sample --++++-+ --++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++++-+ # raise ValueError( --++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++++-+ # f"{logits_warper})." --++++-+ # ) --++++-+ --++++-+ # # init attention / hidden states / scores tuples --++++-+ # scores = () if (return_dict_in_generate and output_scores) else None --++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++++-+ --++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: --++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++++-+ # encoder_hidden_states = ( --++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++++-+ # ) --++++-+ --++++-+ # # keep track of which sequences are already finished --++++-+ # batch_size, cur_len = input_ids.shape --++++-+ # this_peer_finished = False --++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++++-+ --++++-+ # time_record = [] --++++-+ # from ....utils.testing_utils import parse_flag_from_env --++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++++-+ --++++-+ # while self._has_unfinished_sequences( --++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++++-+ # ): --++++-+ # if _record_time: --++++-+ # import time as time_module --++++-+ # infer_start = time_module.time() --++++-+ --++++-+ # # prepare model inputs --++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++++-+ --++++-+ # # prepare variable output controls --++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++++-+ --++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++++-+ # cur_cache_position = model_inputs.get("cache_position") --++++-+ # cur_past_key_values = model_inputs.get("past_key_values") --++++-+ # cur_input_ids = model_inputs.get("input_ids") --++++-+ --++++-+ # if (isinstance(cur_past_key_values, StaticCache) and --++++-+ # cur_cache_position is not None and --++++-+ # len(cur_cache_position.shape) > 0 and --++++-+ # cur_cache_position.shape[0] == 1 and --++++-+ # cur_input_ids is not None and --++++-+ # cur_input_ids.shape[1] == 1): --++++-+ # # 使用 JIT 优化的单 token 解码 --++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++++-+ # if not hasattr(self, '_jit_used'): --++++-+ # self._jit_used = False --++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++++-+ --++++-+ # next_token_logits = self.get_decode_one_tokens_logits( --++++-+ # cur_token=cur_input_ids, --++++-+ # input_pos=model_inputs.get("position_ids"), --++++-+ # cache_position=cur_cache_position, --++++-+ # past_key_values=cur_past_key_values, --++++-+ # ) --++++-+ --++++-+ # # 标记已使用JIT(用于后续判断) --++++-+ # if not self._jit_used: --++++-+ # self._jit_used = True --++++-+ --++++-+ # # 构造兼容的输出对象 --++++-+ # class JitOptimizedOutput: --++++-+ # def __init__(self, logits, config): --++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++++-+ # self.config = config --++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 --++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None --++++-+ # self.attentions = None if not config.is_encoder_decoder else None --++++-+ # self.cross_attentions = None --++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None --++++-+ --++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++++-+ # else: --++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++++-+ # outputs = self(**model_inputs, return_dict=True) --++++-+ --++++-+ # if synced_devices and this_peer_finished: --++++-+ # continue --++++-+ --++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++++-+ # next_token_logits = outputs.logits[:, -1, :] --++++-+ --++++-+ # # pre-process distribution --++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) --++++-+ # if do_sample: --++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) --++++-+ --++++-+ # # Store scores, attentions and hidden_states when required --++++-+ # if return_dict_in_generate: --++++-+ # if output_scores: --++++-+ # scores += (next_token_scores,) --++++-+ # if output_logits: --++++-+ # raw_logits += (next_token_logits,) --++++-+ # if output_attentions: --++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) --++++-+ # if self.config.is_encoder_decoder: --++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++++-+ --++++-+ # if output_hidden_states: --++++-+ # hidden = ( --++++-+ # outputs.decoder_hidden_states --++++-+ # if self.config.is_encoder_decoder --++++-+ # else outputs.hidden_states --++++-+ # ) --++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++++-+ --++++-+ # # token selection --++++-+ # if do_sample: --++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++++-+ # else: --++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++++-+ --++++-+ # # finished sentences should have their next token be a padding token --++++-+ # if has_eos_stopping_criteria: --++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++++-+ --++++-+ # # update generated ids, model inputs, and length for next step --++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++++-+ # if streamer is not None: --++++-+ # streamer.put(next_tokens) --++++-+ --++++-+ # model_kwargs = self._update_model_kwargs_for_generation( --++++-+ # outputs, --++++-+ # model_kwargs, --++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, --++++-+ # ) --++++-+ --++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++++-+ # cur_len += 1 --++++-+ --++++-+ # if _record_time: --++++-+ # import time as time_module --++++-+ # infer_stop = time_module.time() --++++-+ # time_record.append(infer_stop - infer_start) --++++-+ --++++-+ # del outputs --++++-+ --++++-+ # average_infer_time = None --++++-+ # if time_record: --++++-+ # if len(time_record) > 1: --++++-+ # time_record.pop(0) --++++-+ # average_infer_time = sum(time_record) / len(time_record) --++++-+ # print(f'average inference time is: {average_infer_time}') --++++-+ # print(f'inference time record: {time_record}') --++++-+ --++++-+ # if streamer is not None: --++++-+ # streamer.end() --++++-+ --++++-+ # # 简单判断:打印是否使用了JIT路径 --++++-+ # if hasattr(self, '_jit_used') and self._jit_used: --++++-+ # print("[JIT] ✓ JIT optimization was used during generation") --++++-+ # else: --++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++++-+ --++++-+ # if return_dict_in_generate: --++++-+ # if self.config.is_encoder_decoder: --++++-+ # return GenerateEncoderDecoderOutput( --++++-+ # sequences=input_ids, --++++-+ # scores=scores, --++++-+ # logits=raw_logits, --++++-+ # encoder_attentions=encoder_attentions, --++++-+ # encoder_hidden_states=encoder_hidden_states, --++++-+ # decoder_attentions=decoder_attentions, --++++-+ # cross_attentions=cross_attentions, --++++-+ # decoder_hidden_states=decoder_hidden_states, --++++-+ # past_key_values=model_kwargs.get("past_key_values"), --++++-+ # average_infer_time=average_infer_time --++++-+ # ) --++++-+ # else: --++++-+ # return GenerateDecoderOnlyOutput( --++++-+ # sequences=input_ids, --++++-+ # scores=scores, --++++-+ # logits=raw_logits, --++++-+ # attentions=decoder_attentions, --++++-+ # hidden_states=decoder_hidden_states, --++++-+ # past_key_values=model_kwargs.get("past_key_values"), --++++-+ # average_infer_time=average_infer_time --++++-+ # ) --++++-+ # else: --++++-+ # return input_ids --++++-+ --++++-+ # def _prepare_cache_for_generation( --++++-+ # self, --++++-+ # generation_config, --++++-+ # model_kwargs, --++++-+ # assistant_model, --++++-+ # batch_size, --++++-+ # max_cache_length, --++++-+ # ): --++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: --++++-+ # generation_config.cache_implementation = "static" --++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++++-+ --++++-+ # if generation_config.cache_implementation == "static": --++++-+ # base_required_from_max_length = generation_config.max_length + 1 --++++-+ # base_required = max(max_cache_length, base_required_from_max_length) --++++-+ # min_cache_size = 50 --++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++++-+ # else: --++++-+ # max_cache_length = max(base_required, min_cache_size) --++++-+ --++++-+ # original_max_cache_length = max_cache_length --++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") --++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") --++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") --++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++++-+ # print(f" - final max_cache_length: {max_cache_length}") --++++-+ --++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++-+ # if max_cache_length > self.config.max_position_embeddings: --++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++++-+ --++++-+ # result = super()._prepare_cache_for_generation( --++++-+ # generation_config=generation_config, --++++-+ # model_kwargs=model_kwargs, --++++-+ # assistant_model=assistant_model, --++++-+ # batch_size=batch_size, --++++-+ # max_cache_length=max_cache_length, --++++-+ # ) --++++-+ --++++-+ # if generation_config.cache_implementation == "static": --++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++++-+ # created_cache = model_kwargs.get(cache_name) --++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++++-+ # if created_cache.max_cache_len < generation_config.max_length: --++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++++-+ --++++-+ # return result --++++-+ --++++-+ --++++-+ --++++- --++++- --++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++++--- --++++-2.27.0 --++++- --++++-- --++++2.27.0 --++++ --+++-- --+++2.27.0 --+++ --++-- --++2.27.0 --++ --+-- --+2.27.0 --+ ---- --2.27.0 -- -diff --git a/patches/0008-moe-change.patch b/patches/0008-moe-change.patch -deleted file mode 100644 -index 349f1429..00000000 ---- a/patches/0008-moe-change.patch -+++ /dev/null -@@ -1,8789 +0,0 @@ --From 45ba3bbc411b64cbffd547fa3d66bce9545639dd Mon Sep 17 00:00:00 2001 --From: Pinoeer-kingxi <13022943007@163.com> --Date: Sun, 9 Nov 2025 00:50:01 +0800 --Subject: [PATCH 8/8] moe change -- ----- -- .../models/deepseek/modeling_deepseek.py | 433 +- -- .../models/qwen2_moe/modeling_qwen2_moe.py | 86 +- -- patches/0001-20251104commit.patch | 2 +- -- patches/0002-20251106commit.patch | 2 +- -- patches/0003-20261106secondcommit.patch | 2 +- -- patches/0004-20251106change.patch | 2 +- -- patches/0005-20251107001commit.patch | 2 +- -- patches/0006-20251107002commit.patch | 2 +- -- patches/0007-20251107003commit.patch | 8034 +++++++++++++++++ -- 9 files changed, 8510 insertions(+), 55 deletions(-) -- create mode 100644 patches/0007-20251107003commit.patch -- --diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --index ff631974..0af29305 100644 ----- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --@@ -19,8 +19,10 @@ -- # limitations under the License. -- """ MindNLP DeepSeek model.""" -- import math --+import time -- import warnings -- from typing import List, Optional, Tuple, Union --+from mindspore import mint -- import mindspore -- from mindnlp.core import nn, ops, no_grad -- from mindnlp.core.nn import functional as F --@@ -54,6 +56,10 @@ logger = logging.get_logger(__name__) -- -- _CONFIG_FOR_DOC = "DeepseekConfig" -- --+Long_Prompt = 1 --+LONG_PROMPT_LENGTH_THRESHOLD = 128 --+SHORT_PROMPT_LENGTH_THRESHOLD = 32 --+ -- _attn_mask_cache = {} -- -- def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --@@ -380,6 +386,8 @@ class MoEGate(nn.Module): -- return topk_idx, topk_weight, aux_loss -- -- --+bincount_op = mindspore.ops.Bincount() --+ -- class DeepseekMoE(nn.Module): -- """ -- A mixed expert module containing shared experts. --@@ -413,7 +421,10 @@ class DeepseekMoE(nn.Module): -- y = y + self.shared_experts(identity) -- return y -- else: --- y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+ if Long_Prompt == 0: --+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+ else: --+ y= self.moe_infer_prefill_fast(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) -- if self.config.n_shared_experts is not None: -- y = y + self.shared_experts(identity) -- return y --@@ -421,7 +432,103 @@ class DeepseekMoE(nn.Module): -- # if self.config.n_shared_experts is not None: -- # y = y + self.shared_experts(identity) -- # return y --- --+ --+ --+ --+ # lwx --+ # def forward(self, x, expert_ids: Optional[mindspore.Tensor] = None): --+ # """ --+ # 如果 expert_ids 为 None,走单专家逻辑; --+ # 如果有,多专家批量处理,保证和原逻辑一致。 --+ # """ --+ # if expert_ids is None: --+ # # 原单专家逻辑 --+ # if self.config.pretraining_tp > 1: --+ # slice = self.intermediate_size // self.config.pretraining_tp --+ # gate_proj_slices = ops.split(self.gate_proj.weight, slice, dim=0) --+ # up_proj_slices = ops.split(self.up_proj.weight, slice, dim=0) --+ # down_proj_slices = ops.split(self.down_proj.weight, slice, dim=1) --+ # gate_proj = ops.cat([F.linear(x, gate_proj_slices[i]) --+ # for i in range(self.config.pretraining_tp)], dim=-1) --+ # up_proj = ops.cat([F.linear(x, up_proj_slices[i]) --+ # for i in range(self.config.pretraining_tp)], dim=-1) --+ # intermediate_states = ops.split((self.act_fn(gate_proj) * up_proj), slice, dim=2) --+ # down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) --+ # for i in range(self.config.pretraining_tp)] --+ # down_proj = sum(down_proj) --+ # else: --+ # down_proj = self.down_proj( --+ # self.act_fn(self.gate_proj(x)) * self.up_proj(x) --+ # ) --+ # return down_proj --+ --+ # # ====== 批量多专家路径 ====== --+ # hidden_size = x.shape[-1] --+ --+ # # 按 token expert_ids 选权重 --+ # gate_weights = self.gate_proj.weight[expert_ids] # shape: [tokens, inter_size] --+ # up_weights = self.up_proj.weight[expert_ids] --+ # down_weights = self.down_proj.weight[expert_ids] --+ --+ # # 注意:pretraining_tp > 1 的分 slice 逻辑仍然要保留 --+ # if self.config.pretraining_tp > 1: --+ # outputs = [] --+ # slice = self.intermediate_size // self.config.pretraining_tp --+ # for i in range(self.config.pretraining_tp): --+ # # 每个 slice 单独计算 --+ # gate_proj_out = F.linear(x, gate_weights[:, i*slice:(i+1)*slice]) --+ # up_proj_out = F.linear(x, up_weights[:, i*slice:(i+1)*slice]) --+ # act_out = self.act_fn(gate_proj_out) * up_proj_out --+ # down_proj_out = F.linear(act_out, down_weights[i*slice:(i+1)*slice, :]) --+ # outputs.append(down_proj_out) --+ # return sum(outputs) --+ # else: --+ # gate_proj_out = F.linear(x, gate_weights) --+ # up_proj_out = F.linear(x, up_weights) --+ # act_out = self.act_fn(gate_proj_out) * up_proj_out --+ # return F.linear(act_out, down_weights) --+ # @no_grad() --+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ # num_tokens = x.shape[0] --+ # hidden_size = x.shape[-1] --+ --+ # idxs = flat_expert_indices.argsort() --+ # sorted_expert_indices = flat_expert_indices[idxs] --+ # sorted_token_indices = idxs // self.num_experts_per_tok --+ # sorted_indices = sorted_token_indices --+ --+ # permuted_tokens = x[sorted_token_indices] --+ # sorted_weights = flat_expert_weights[idxs] --+ --+ # # 一次调用多专家 forward --+ # expert_outputs = ops.zeros_like(permuted_tokens) --+ # expert_outputs = self.mlp_batch_forward(permuted_tokens, sorted_expert_indices) --+ --+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) --+ # try: --+ # final_output = ops.moe_token_unpermute( --+ # expert_outputs, --+ # sorted_indices, --+ # probs=probs, --+ # padded_mode=False --+ # ) --+ # except Exception: --+ # final_output = ops.zeros_like(x) --+ # final_output = mindspore.mint.scatter_add( --+ # final_output, --+ # 0, --+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), --+ # expert_outputs * sorted_weights --+ # ) --+ --+ # return final_output --+ --+ # def mlp_batch_forward(self, tokens, expert_ids): --+ # """ --+ # 使用批量专家 forward(保留精度) --+ # """ --+ # return self.experts[0].forward(tokens, expert_ids) --+ -- # @no_grad() -- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): -- --@@ -434,52 +541,15 @@ class DeepseekMoE(nn.Module): -- # expert_cache += expert_out * weight -- # return expert_cache -- --+ #@dwj -- @no_grad() --- # dwj -- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --- # x 的 shape: (1, hidden_size) --- # flat_expert_indices 的 shape: (num_experts_per_tok,) --- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --- --- # 1. 收集所有需要的专家层 --- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 -- selected_experts = [self.experts[i] for i in flat_expert_indices] --- --- # 2. 并行计算所有专家的输出 --- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --- # ops.cat 会将它们堆叠成一个新的 Tensor --- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) -- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --- --- # 3. 使用矩阵乘法进行加权求和 --- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --- # 最终结果 final_output 的 shape: (1, hidden_size) -- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --- -- return final_output -- -- --- # @no_grad() --- # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --- # expert_cache = ops.zeros_like(x) --- # idxs = flat_expert_indices.argsort() --- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --- # token_idxs = idxs // self.num_experts_per_tok --- --- # for i, end_idx in enumerate(tokens_per_expert): --- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --- # if start_idx == end_idx: --- # continue --- # expert = self.experts[i] --- # exp_token_idx = token_idxs[start_idx:end_idx] --- # expert_tokens = x[exp_token_idx] --- # expert_out = expert(expert_tokens) --- # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --- # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --- --- # return expert_cache --- -- @no_grad() -- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): -- """ --@@ -525,6 +595,264 @@ class DeepseekMoE(nn.Module): -- ) -- -- return expert_cache --+ --+ --+ # @no_grad() --+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ # """ --+ # 优化版 MoE prefill:使用 mindspore.ops.moe_token_unpermute 替代手动 scatter_add --+ # """ --+ # num_tokens = x.shape[0] --+ # hidden_size = x.shape[-1] --+ --+ # # 生成排序后的 token 索引 --+ # idxs = flat_expert_indices.argsort() --+ # sorted_expert_indices = flat_expert_indices[idxs] --+ # sorted_token_indices = idxs // self.num_experts_per_tok --+ --+ # # 记录到 sorted_indices(moe_token_unpermute 用) --+ # sorted_indices = sorted_token_indices # shape: [num_tokens * top_k] --+ --+ # # 收集专家输入 --+ # permuted_tokens = x[sorted_token_indices] --+ --+ # # 执行每个专家的 MLP(批量处理) --+ # expert_outputs = [] --+ # token_ptr = 0 --+ # tokens_per_expert = sorted_expert_indices.bincount() --+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): --+ # if count == 0: --+ # continue --+ # cur_tokens = permuted_tokens[token_ptr:token_ptr+count] --+ # out = self.experts[expert_id](cur_tokens) --+ # expert_outputs.append(out) --+ # token_ptr += count --+ --+ # # 拼接所有专家输出 --+ # permuted_outputs = ops.cat(expert_outputs, axis=0) --+ --+ # # 权重缩放(probs 形状为 [num_tokens, top_k]) --+ # probs = flat_expert_weights.view(num_tokens, self.num_experts_per_tok) --+ --+ # # 直接调用硬件加速的 unpermute --+ # final_output = ops.moe_token_unpermute( --+ # permuted_outputs, # shape: [num_tokens * top_k, hidden_size] --+ # sorted_indices, # shape: [num_tokens * top_k] --+ # probs=probs, # 按概率加权 --+ # padded_mode=False --+ # ) --+ --+ # return final_output --+ --+ # lwx prefill 20251108 --+ @no_grad() --+ def moe_infer_prefill_fast(self, x, flat_expert_indices, flat_expert_weights): --+ """ --+ 高性能 + 数值一致的 MoE prefill 推理: --+ 1. 批量化处理所有专家计算,减少 Python 循环开销 --+ 2. Ascend A2 上使用 ops.moe_token_unpermute 加速 token 恢复 --+ 3. CPU/GPU 上自动 fallback 到 scatter_add 实现 --+ 4. 保证权重和 token 排列顺序与原版本完全一致,避免生成结果 mismatch --+ --+ 参数: --+ x: [num_tokens, hidden_size], --+ MoE 输入的 token 表示 --+ flat_expert_indices: [num_tokens * top_k], --+ 每个 token 的路由专家 id --+ flat_expert_weights: [num_tokens * top_k, 1], --+ 路由专家权重 --+ """ --+ num_tokens = x.shape[0] --+ hidden_size = x.shape[-1] --+ --+ # 1) 排序专家分配(与原 scatter_add 一致的顺序) --+ idxs = flat_expert_indices.argsort() # 排序索引 --+ sorted_expert_indices = flat_expert_indices[idxs] # [num_tokens*top_k] --+ sorted_token_indices = idxs // self.num_experts_per_tok # 原 token ID --+ --+ # sorted_indices 必须与 permuted_tokens 顺序匹配 --+ sorted_indices = sorted_token_indices # 用原 token 位置恢复顺序 --+ --+ # 2) 收集专家输入(按 idxs 排序) --+ permuted_tokens = x[sorted_token_indices] # [num_tokens*top_k, hidden_size] --+ sorted_weights = flat_expert_weights[idxs] # [num_tokens*top_k, 1],确保与 permuted_tokens 对齐 --+ --+ # 3) 计算每个专家的 token 数 --+ tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) --+ --+ # 4) 批量专家计算(减少 Python 循环) --+ gate_weights = ops.stack([expert.gate_proj.weight for expert in self.experts], dim=0) --+ up_weights = ops.stack([expert.up_proj.weight for expert in self.experts], dim=0) --+ down_weights = ops.stack([expert.down_proj.weight for expert in self.experts], dim=0) --+ --+ expert_outputs = ops.zeros_like(permuted_tokens) --+ ptr = 0 --+ for expert_id, count in enumerate(tokens_per_expert.tolist()): --+ if count == 0: --+ continue --+ tokens = permuted_tokens[ptr:ptr+count] # [count, hidden_size] --+ --+ # 与 DeepseekMLP forward 等价 --+ gate_proj_out = F.linear(tokens, gate_weights[expert_id]) --+ up_proj_out = F.linear(tokens, up_weights[expert_id]) --+ act_out = self.experts[expert_id].act_fn(gate_proj_out) * up_proj_out --+ expert_out = F.linear(act_out, down_weights[expert_id]) --+ --+ expert_outputs[ptr:ptr+count] = expert_out --+ ptr += count --+ --+ # 5) Ascend 加速的 unpermute(已排序的权重) --+ probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) # 按排序后的顺序 reshape --+ --+ final_output = ops.zeros_like(x) --+ final_output = mindspore.mint.scatter_add( --+ final_output, --+ 0, --+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), --+ expert_outputs * sorted_weights --+ ) --+ --+ --+ # try: --+ # final_output = ops.moe_token_unpermute( --+ # expert_outputs, # [num_tokens*top_k, hidden_size] --+ # sorted_indices, # [num_tokens*top_k] 原 token id --+ # probs=probs, # 对应权重 --+ # padded_mode=False --+ # ) --+ # except Exception: --+ # # CPU/GPU fallback:用 scatter_add 保证完全一致 --+ # final_output = ops.zeros_like(x) --+ # final_output = mindspore.mint.scatter_add( --+ # final_output, --+ # 0, --+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), --+ # expert_outputs * sorted_weights --+ # ) --+ --+ return final_output --+ --+ --+ # @no_grad() --+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ # num_tokens = x.shape[0] --+ # hidden_size = x.shape[-1] --+ --+ # idxs = flat_expert_indices.argsort() --+ # sorted_expert_indices = flat_expert_indices[idxs] --+ # sorted_token_indices = idxs // self.num_experts_per_tok --+ --+ # # sorted_indices = sorted_token_indices --+ # sorted_indices = sorted_token_indices.astype(mindspore.int32) --+ # permuted_tokens = x[sorted_token_indices] --+ # sorted_weights = flat_expert_weights[idxs] --+ # tokens_per_expert = sorted_expert_indices.bincount(minlength=len(self.experts)) --+ --+ # expert_outputs = ops.zeros_like(permuted_tokens) --+ # ptr = 0 --+ --+ # # 只按专家维度循环 --+ # for expert_id, count in enumerate(tokens_per_expert.tolist()): --+ # if count == 0: --+ # continue --+ # token_slice = slice(ptr, ptr + count) --+ # expert_tokens = permuted_tokens[token_slice] --+ --+ # # 保持原 forward(含 pretraining_tp、bias 等) --+ # expert_out = self.experts[expert_id](expert_tokens) --+ --+ # expert_outputs[token_slice] = expert_out --+ # ptr += count --+ --+ # probs = sorted_weights.view(num_tokens, self.num_experts_per_tok) --+ # try: --+ # final_output = mindspore.ops.moe_token_unpermute( --+ # expert_outputs, --+ # sorted_indices, --+ # probs=probs, --+ # padded_mode=False --+ # ) --+ # except Exception: --+ # final_output = ops.zeros_like(x) --+ # final_output = mindspore.mint.scatter_add( --+ # final_output, --+ # 0, --+ # sorted_token_indices.view(-1, 1).tile((1, hidden_size)), --+ # expert_outputs * sorted_weights --+ # ) --+ --+ # return final_output --+ --+ --+ #lwx --+ # @no_grad() --+ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+ # """ --+ # 并行化 MoE prefill: --+ # - 一次性计算所有专家输出,牺牲显存峰值换取速度 --+ # - 保证结果与原版完全一致 --+ # """ --+ # # 输出缓存 --+ # expert_cache = ops.zeros_like(x) --+ --+ # # token 总数(批量*seq_len*num_experts_per_tok) --+ # num_tokens = flat_expert_indices.shape[0] --+ # hidden_dim = x.shape[-1] --+ --+ # # 原 token ID(idxs // num_experts_per_tok) --+ # token_ids = ops.arange(num_tokens // self.num_experts_per_tok).repeat_interleave(self.num_experts_per_tok) --+ --+ # # ====== Step 1: 组织输入 ====== --+ # # 按 experts 排序,保证 scatter_add 对应位置一致 --+ # sort_ids = flat_expert_indices.argsort() --+ # sorted_experts = flat_expert_indices[sort_ids] --+ # sorted_tokens = token_ids[sort_ids] --+ # sorted_weights = flat_expert_weights[sort_ids] --+ --+ # # 收集每个专家的输入 --+ # # build: expert_inputs[expert_id] = [tokens...] --+ # expert_inputs = [] --+ # expert_outs = [] --+ --+ # for eid in range(self.config.n_routed_experts): --+ # eid_mask = (sorted_experts == eid) --+ # if eid_mask.any(): --+ # tokens_for_eid = x[sorted_tokens[eid_mask]] --+ # expert_inputs.append(tokens_for_eid) --+ # else: --+ # expert_inputs.append(None) --+ --+ # # ====== Step 2: 并行计算所有专家输出 ====== --+ # # 存储所有专家结果到一个列表 --+ # for eid in range(self.config.n_routed_experts): --+ # if expert_inputs[eid] is not None: --+ # out = self.experts[eid](expert_inputs[eid]) --+ # expert_outs.append(out) --+ # else: --+ # expert_outs.append(None) --+ --+ # # ====== Step 3: scatter_add 回写结果 ====== --+ # # 遍历专家,将结果加回对应的 token --+ # pos = 0 --+ # for eid in range(self.config.n_routed_experts): --+ # if expert_outs[eid] is not None: --+ # size = expert_outs[eid].shape[0] --+ # tokens_idx = sorted_tokens[pos:pos+size] --+ # scaled_out = expert_outs[eid] * sorted_weights[pos:pos+size] --+ # pos += size --+ --+ # # scatter_add 到 expert_cache --+ # expert_cache = mindspore.mint.scatter_add( --+ # expert_cache, --+ # dim=0, --+ # index=tokens_idx.view(-1, 1).tile((1, hidden_dim)), --+ # src=scaled_out --+ # ) --+ --+ # return expert_cache --+ --+ --+ -- # 放置在 DeepseekMoE 类中 -- # @no_grad() -- # #lwx 20251107 --@@ -1188,7 +1516,7 @@ class DeepseekDecoderLayer(nn.Module): -- self.hidden_size = config.hidden_size -- -- # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --- # config=config, layer_idx=layer_idx --+ # config=config, layer_idx=layer_idx -- # ) -- -- self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --@@ -1204,6 +1532,7 @@ class DeepseekDecoderLayer(nn.Module): -- ) -- else DeepseekMLP(config) -- ) --+ -- self.input_layernorm = DeepseekRMSNorm( -- config.hidden_size, eps=config.rms_norm_eps -- ) --@@ -1537,6 +1866,28 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): -- def get_decoder(self): -- return self.model -- --+ def generate(self, *args, **kwargs): --+ """ --+ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+ """ --+ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --+ --+ input_ids = kwargs.get("input_ids") --+ if input_ids is None and args: --+ input_ids = args[0] --+ --+ if input_ids is not None: --+ prompt_length = input_ids.shape[1] --+ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --+ Long_Prompt = 2 --+ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --+ Long_Prompt = 0 --+ else: --+ Long_Prompt = 1 --+ --+ --+ return super().generate(*args, **kwargs) -- -- def forward( -- self, --diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --index 913a7609..6566958b 100644 ----- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --@@ -1104,7 +1104,7 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- -- # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- -- @no_grad() --- def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- original_dtype = hidden_states.dtype -- batch_size, _ = hidden_states.shape -- expert_outputs_list = [ --@@ -1119,8 +1119,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) -- return moe_output_fp32.squeeze(1).to(original_dtype) -- --+ -- # @no_grad() --- # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ # def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- # num_tokens, _ = hidden_states.shape -- # flat_selected_experts = selected_experts.flatten() -- # sorted_expert_indices = flat_selected_experts.argsort() --@@ -1142,8 +1143,9 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- # current_token_offset += expert_token_count -- # return moe_output -- --+ # baseline -- @no_grad() --- def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- """ -- 优化版 MoE prefill (速度优先模式): -- - 批量张量化处理同一个 expert 的所有 token --@@ -1184,7 +1186,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- return moe_output -- -- --+ @no_grad() --+ def _moe_infer_prefill_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+ """ --+ 优化版 MoE prefill (速度优先模式) - 连续切片 & 单次 scatter_add --+ 逻辑: --+ 1. 按 expert 排序,将同一 expert 的 token 放在连续内存中 --+ 2. 每个 expert 一次性处理其全部 token --+ 3. 最后一次 scatter_add 回到原 token 顺序 --+ """ --+ --+ num_tokens = hidden_states.shape[0] --+ hidden_size = hidden_states.shape[-1] --+ --+ # 展平为一维 --+ flat_selected_experts = selected_experts.flatten() # [num_tokens * top_k] --+ flat_routing_weights = routing_weights.flatten() # [num_tokens * top_k] --+ --+ # 按 expert 排序 --+ idxs = flat_selected_experts.argsort() --+ sorted_expert_indices = flat_selected_experts[idxs] # expert ID 排序后 --+ sorted_token_indices = idxs // self.top_k # 对应原 token ID --+ --+ # 排好序的输入向量(连续内存) --+ permuted_tokens = hidden_states[sorted_token_indices] --+ --+ # 排好序的权重 --+ sorted_weights = flat_routing_weights[idxs] --+ --+ # 每个 expert 对应的 token 数量 --+ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --+ --+ # 存放专家输出(与 permuted_tokens 对应顺序保持一致) --+ expert_outputs = ops.zeros_like(permuted_tokens) --+ --+ ptr = 0 # 指向当前切片的起点 --+ for expert_id, count in enumerate(tokens_per_expert.tolist()): --+ if count == 0: --+ continue --+ --+ token_slice = slice(ptr, ptr + count) --+ expert_tokens = permuted_tokens[token_slice] # 连续切片 --+ --+ # 执行专家 MLP --+ expert_out = self.experts[expert_id](expert_tokens) --+ --+ expert_outputs[token_slice] = expert_out --+ ptr += count --+ --+ # 按权重缩放 --+ scaled_outputs = expert_outputs * sorted_weights.unsqueeze(1) --+ --+ # 回写到原 token 顺序 (单次 scatter_add) --+ moe_output = mindspore.mint.scatter_add( --+ ops.zeros_like(hidden_states), --+ 0, --+ sorted_token_indices.view(-1, 1).tile((1, hidden_size)), --+ scaled_outputs --+ ) --+ --+ return moe_output --+ --+ --+ -- # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+ -- @no_grad() -- def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: -- moe_output = ops.zeros_like(hidden_states) --@@ -1225,16 +1291,20 @@ class Qwen2MoeSparseMoeBlock(nn.Module): -- # # --- 速度优先模式 (SPEED MODE) --- -- # routing_weights_casted = routing_weights.to(hidden_states.dtype) -- # if sequence_length == 1: --- # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -- # else: --- # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) -- -- routing_weights_casted = routing_weights.to(hidden_states.dtype) -- if sequence_length == 1: --- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) -- else: --- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --- --+ # if Long_Prompt == 1: --+ # moe_output = self._moe_infer_prefill_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ # else: --+ # moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+ -- -- # 3. 共享专家计算与合并 (所有模式通用) -- gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --index c9c8c5ee..513dd40b 100644 ----- a/patches/0001-20251104commit.patch --+++ b/patches/0001-20251104commit.patch --@@ -1,7 +1,7 @@ -- From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Tue, 4 Nov 2025 09:11:51 +0800 ---Subject: [PATCH 1/6] 20251104commit --+Subject: [PATCH 1/7] 20251104commit -- -- --- -- mindnlp/transformers/cache_utils.py | 28 +- --diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --index 625656eb..41081b85 100644 ----- a/patches/0002-20251106commit.patch --+++ b/patches/0002-20251106commit.patch --@@ -1,7 +1,7 @@ -- From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 09:20:38 +0800 ---Subject: [PATCH 2/6] 20251106commit --+Subject: [PATCH 2/7] 20251106commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 379 ++++- --diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --index dcb85080..c1392569 100644 ----- a/patches/0003-20261106secondcommit.patch --+++ b/patches/0003-20261106secondcommit.patch --@@ -1,7 +1,7 @@ -- From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 14:54:37 +0800 ---Subject: [PATCH 3/6] 20261106secondcommit --+Subject: [PATCH 3/7] 20261106secondcommit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 217 ++- --diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --index bbed13cc..e548b1b2 100644 ----- a/patches/0004-20251106change.patch --+++ b/patches/0004-20251106change.patch --@@ -1,7 +1,7 @@ -- From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Thu, 6 Nov 2025 15:48:09 +0800 ---Subject: [PATCH 4/6] 20251106change --+Subject: [PATCH 4/7] 20251106change -- -- --- -- .../models/deepseek/modeling_deepseek.py | 189 +- --diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --index b2d1035c..bf224d2a 100644 ----- a/patches/0005-20251107001commit.patch --+++ b/patches/0005-20251107001commit.patch --@@ -1,7 +1,7 @@ -- From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Fri, 7 Nov 2025 11:48:18 +0800 ---Subject: [PATCH 5/6] 20251107001commit --+Subject: [PATCH 5/7] 20251107001commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 91 +- --diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch --index bffa134e..1bd306b9 100644 ----- a/patches/0006-20251107002commit.patch --+++ b/patches/0006-20251107002commit.patch --@@ -1,7 +1,7 @@ -- From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 -- From: Pinoeer-kingxi <13022943007@163.com> -- Date: Fri, 7 Nov 2025 12:06:32 +0800 ---Subject: [PATCH 6/6] 20251107002commit --+Subject: [PATCH 6/7] 20251107002commit -- -- --- -- .../models/deepseek/modeling_deepseek.py | 122 +- --diff --git a/patches/0007-20251107003commit.patch b/patches/0007-20251107003commit.patch --new file mode 100644 --index 00000000..ce558554 ----- /dev/null --+++ b/patches/0007-20251107003commit.patch --@@ -0,0 +1,8034 @@ --+From cee579410530fa9fad61cd1b8a2c5cb8eb2d71f7 Mon Sep 17 00:00:00 2001 --+From: Pinoeer-kingxi <13022943007@163.com> --+Date: Fri, 7 Nov 2025 12:12:51 +0800 --+Subject: [PATCH 7/7] 20251107003commit --+ --+--- --+ .../models/deepseek/modeling_deepseek.py | 2 +- --+ patches/0001-20251104commit.patch | 2 +- --+ patches/0002-20251106commit.patch | 2 +- --+ patches/0003-20261106secondcommit.patch | 2 +- --+ patches/0004-20251106change.patch | 2 +- --+ patches/0005-20251107001commit.patch | 2 +- --+ patches/0006-20251107002commit.patch | 7931 +++++++++++++++++ --+ 7 files changed, 7937 insertions(+), 6 deletions(-) --+ create mode 100644 patches/0006-20251107002commit.patch --+ --+diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+index e7e1c053..ff631974 100644 --+--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+@@ -435,7 +435,7 @@ class DeepseekMoE(nn.Module): --+ # return expert_cache --+ --+ @no_grad() --+- dwj --++ # dwj --+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+ # x 的 shape: (1, hidden_size) --+ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+index 2842180e..c9c8c5ee 100644 --+--- a/patches/0001-20251104commit.patch --++++ b/patches/0001-20251104commit.patch --+@@ -1,7 +1,7 @@ --+ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Tue, 4 Nov 2025 09:11:51 +0800 --+-Subject: [PATCH 1/5] 20251104commit --++Subject: [PATCH 1/6] 20251104commit --+ --+ --- --+ mindnlp/transformers/cache_utils.py | 28 +- --+diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+index c6cd8757..625656eb 100644 --+--- a/patches/0002-20251106commit.patch --++++ b/patches/0002-20251106commit.patch --+@@ -1,7 +1,7 @@ --+ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 09:20:38 +0800 --+-Subject: [PATCH 2/5] 20251106commit --++Subject: [PATCH 2/6] 20251106commit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+index 601960c9..dcb85080 100644 --+--- a/patches/0003-20261106secondcommit.patch --++++ b/patches/0003-20261106secondcommit.patch --+@@ -1,7 +1,7 @@ --+ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 14:54:37 +0800 --+-Subject: [PATCH 3/5] 20261106secondcommit --++Subject: [PATCH 3/6] 20261106secondcommit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 217 ++- --+diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --+index 8976f10b..bbed13cc 100644 --+--- a/patches/0004-20251106change.patch --++++ b/patches/0004-20251106change.patch --+@@ -1,7 +1,7 @@ --+ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Thu, 6 Nov 2025 15:48:09 +0800 --+-Subject: [PATCH 4/5] 20251106change --++Subject: [PATCH 4/6] 20251106change --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 189 +- --+diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --+index 8d9032be..b2d1035c 100644 --+--- a/patches/0005-20251107001commit.patch --++++ b/patches/0005-20251107001commit.patch --+@@ -1,7 +1,7 @@ --+ From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 --+ From: Pinoeer-kingxi <13022943007@163.com> --+ Date: Fri, 7 Nov 2025 11:48:18 +0800 --+-Subject: [PATCH 5/5] 20251107001commit --++Subject: [PATCH 5/6] 20251107001commit --+ --+ --- --+ .../models/deepseek/modeling_deepseek.py | 91 +- --+diff --git a/patches/0006-20251107002commit.patch b/patches/0006-20251107002commit.patch --+new file mode 100644 --+index 00000000..bffa134e --+--- /dev/null --++++ b/patches/0006-20251107002commit.patch --+@@ -0,0 +1,7931 @@ --++From 5914e3e59151bf5f44089d83c508b03132e7bb60 Mon Sep 17 00:00:00 2001 --++From: Pinoeer-kingxi <13022943007@163.com> --++Date: Fri, 7 Nov 2025 12:06:32 +0800 --++Subject: [PATCH 6/6] 20251107002commit --++ --++--- --++ .../models/deepseek/modeling_deepseek.py | 122 +- --++ patches/0001-20251104commit.patch | 2 +- --++ patches/0002-20251106commit.patch | 2 +- --++ patches/0003-20261106secondcommit.patch | 2 +- --++ patches/0004-20251106change.patch | 2 +- --++ patches/0005-20251107001commit.patch | 7707 +++++++++++++++++ --++ 6 files changed, 7773 insertions(+), 64 deletions(-) --++ create mode 100644 patches/0005-20251107001commit.patch --++ --++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++index 8831e4b7..e7e1c053 100644 --++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++@@ -433,30 +433,31 @@ class DeepseekMoE(nn.Module): --++ # expert_out = expert(x) --++ # expert_cache += expert_out * weight --++ # return expert_cache --++- --++- # @no_grad() --++- # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++- # # x 的 shape: (1, hidden_size) --++- # # flat_expert_indices 的 shape: (num_experts_per_tok,) --++- # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++- --++- # # 1. 收集所有需要的专家层 --++- # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++- # selected_experts = [self.experts[i] for i in flat_expert_indices] --++- --++- # # 2. 并行计算所有专家的输出 --++- # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++- # # ops.cat 会将它们堆叠成一个新的 Tensor --++- # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++- # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++- --++- # # 3. 使用矩阵乘法进行加权求和 --++- # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++- # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++- # # 最终结果 final_output 的 shape: (1, hidden_size) --++- # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+++ --+++ @no_grad() --+++ dwj --+++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ # x 的 shape: (1, hidden_size) --+++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --+++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+++ --+++ # 1. 收集所有需要的专家层 --+++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+++ selected_experts = [self.experts[i] for i in flat_expert_indices] --+++ --+++ # 2. 并行计算所有专家的输出 --+++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+++ # ops.cat 会将它们堆叠成一个新的 Tensor --+++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+++ --+++ # 3. 使用矩阵乘法进行加权求和 --+++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++ # 最终结果 final_output 的 shape: (1, hidden_size) --+++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++ --++- # return final_output --+++ return final_output --++ --++ --++ # @no_grad() --++@@ -525,50 +526,51 @@ class DeepseekMoE(nn.Module): --++ --++ return expert_cache --++ # 放置在 DeepseekMoE 类中 --++- @no_grad() --++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++- """ --++- 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --++- --++- Args: --++- x (Tensor): 输入张量, shape: (1, hidden_size) --++- flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --++- flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --++- """ --++- top_k, _ = flat_expert_weights.shape --++- hidden_size = x.shape[-1] --++- --++- # 1. 将所有专家的权重堆叠起来 --++- stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --++- stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --++- stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --+++ # @no_grad() --+++ # #lwx 20251107 --+++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++ # """ --+++ # 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --+++ --+++ # Args: --+++ # x (Tensor): 输入张量, shape: (1, hidden_size) --+++ # flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --+++ # flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --+++ # """ --+++ # top_k, _ = flat_expert_weights.shape --+++ # hidden_size = x.shape[-1] --+++ --+++ # # 1. 将所有专家的权重堆叠起来 --+++ # stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --+++ # stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --+++ # stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --++ --++- # 2. "收集" 所需的专家权重 --++- selected_gate_w = stacked_gate_w[flat_expert_indices] --++- selected_up_w = stacked_up_w[flat_expert_indices] --++- selected_down_w = stacked_down_w[flat_expert_indices] --+++ # # 2. "收集" 所需的专家权重 --+++ # selected_gate_w = stacked_gate_w[flat_expert_indices] --+++ # selected_up_w = stacked_up_w[flat_expert_indices] --+++ # selected_down_w = stacked_down_w[flat_expert_indices] --++ --++- # 3. 准备输入 --++- x_expanded = x.expand((top_k, 1, hidden_size)) --+++ # # 3. 准备输入 --+++ # x_expanded = x.expand((top_k, 1, hidden_size)) --++ --++- # 4. 并行计算 gate_proj 和 up_proj --++- gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --++- up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --+++ # # 4. 并行计算 gate_proj 和 up_proj --+++ # gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --+++ # up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --++ --++- # 5. 计算中间状态 --++- intermediate_states = self.experts[0].act_fn(gate_out) * up_out --+++ # # 5. 计算中间状态 --+++ # intermediate_states = self.experts[0].act_fn(gate_out) * up_out --++ --++- # 6. 并行计算 down_proj --++- # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --++- # --- [FIX] --- --++- # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --++- expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --++- # --- [FIX END] --- --+++ # # 6. 并行计算 down_proj --+++ # # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --+++ # # --- [FIX] --- --+++ # # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --+++ # expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --+++ # # --- [FIX END] --- --++ --++- # 7. 根据路由权重进行加权求和 --++- weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --+++ # # 7. 根据路由权重进行加权求和 --+++ # weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --++ --++- return weighted_sum --+++ # return weighted_sum --++ --++ --++ --++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++index 0a0ef2d7..2842180e 100644 --++--- a/patches/0001-20251104commit.patch --+++++ b/patches/0001-20251104commit.patch --++@@ -1,7 +1,7 @@ --++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Tue, 4 Nov 2025 09:11:51 +0800 --++-Subject: [PATCH 1/4] 20251104commit --+++Subject: [PATCH 1/5] 20251104commit --++ --++ --- --++ mindnlp/transformers/cache_utils.py | 28 +- --++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --++index 5185270c..c6cd8757 100644 --++--- a/patches/0002-20251106commit.patch --+++++ b/patches/0002-20251106commit.patch --++@@ -1,7 +1,7 @@ --++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Thu, 6 Nov 2025 09:20:38 +0800 --++-Subject: [PATCH 2/4] 20251106commit --+++Subject: [PATCH 2/5] 20251106commit --++ --++ --- --++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --++index 3e05f821..601960c9 100644 --++--- a/patches/0003-20261106secondcommit.patch --+++++ b/patches/0003-20261106secondcommit.patch --++@@ -1,7 +1,7 @@ --++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Thu, 6 Nov 2025 14:54:37 +0800 --++-Subject: [PATCH 3/4] 20261106secondcommit --+++Subject: [PATCH 3/5] 20261106secondcommit --++ --++ --- --++ .../models/deepseek/modeling_deepseek.py | 217 ++- --++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --++index 88a1aef4..8976f10b 100644 --++--- a/patches/0004-20251106change.patch --+++++ b/patches/0004-20251106change.patch --++@@ -1,7 +1,7 @@ --++ From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --++ From: Pinoeer-kingxi <13022943007@163.com> --++ Date: Thu, 6 Nov 2025 15:48:09 +0800 --++-Subject: [PATCH 4/4] 20251106change --+++Subject: [PATCH 4/5] 20251106change --++ --++ --- --++ .../models/deepseek/modeling_deepseek.py | 189 +- --++diff --git a/patches/0005-20251107001commit.patch b/patches/0005-20251107001commit.patch --++new file mode 100644 --++index 00000000..8d9032be --++--- /dev/null --+++++ b/patches/0005-20251107001commit.patch --++@@ -0,0 +1,7707 @@ --+++From 0aff56c2ef51374ba385ca0965ff57053447db54 Mon Sep 17 00:00:00 2001 --+++From: Pinoeer-kingxi <13022943007@163.com> --+++Date: Fri, 7 Nov 2025 11:48:18 +0800 --+++Subject: [PATCH 5/5] 20251107001commit --+++ --+++--- --+++ .../models/deepseek/modeling_deepseek.py | 91 +- --+++ .../models/qwen2_moe/modeling_qwen2_moe.py | 6 +- --+++ .../models/qwen2_vl/modeling_qwen2_vl.py | 6 +- --+++ patches/0001-20251104commit.patch | 2 +- --+++ patches/0002-20251106commit.patch | 2 +- --+++ patches/0003-20261106secondcommit.patch | 2 +- --+++ patches/0004-20251106change.patch | 7498 +++++++++++++++++ --+++ 7 files changed, 7577 insertions(+), 30 deletions(-) --+++ create mode 100644 patches/0004-20251106change.patch --+++ --+++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++index 0546f318..8831e4b7 100644 --+++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++@@ -434,29 +434,29 @@ class DeepseekMoE(nn.Module): --+++ # expert_cache += expert_out * weight --+++ # return expert_cache --+++ --+++- @no_grad() --+++- def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++- # x 的 shape: (1, hidden_size) --+++- # flat_expert_indices 的 shape: (num_experts_per_tok,) --+++- # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --+++- --+++- # 1. 收集所有需要的专家层 --+++- # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --+++- selected_experts = [self.experts[i] for i in flat_expert_indices] --+++- --+++- # 2. 并行计算所有专家的输出 --+++- # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --+++- # ops.cat 会将它们堆叠成一个新的 Tensor --+++- # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++- expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --+++- --+++- # 3. 使用矩阵乘法进行加权求和 --+++- # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --+++- # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --+++- # 最终结果 final_output 的 shape: (1, hidden_size) --+++- final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++++ # @no_grad() --++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ # # x 的 shape: (1, hidden_size) --++++ # # flat_expert_indices 的 shape: (num_experts_per_tok,) --++++ # # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++++ --++++ # # 1. 收集所有需要的专家层 --++++ # # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++++ # selected_experts = [self.experts[i] for i in flat_expert_indices] --++++ --++++ # # 2. 并行计算所有专家的输出 --++++ # # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++++ # # ops.cat 会将它们堆叠成一个新的 Tensor --++++ # # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++ # expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++++ --++++ # # 3. 使用矩阵乘法进行加权求和 --++++ # # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++++ # # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++ # # 最终结果 final_output 的 shape: (1, hidden_size) --++++ # final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --+++ --+++- return final_output --++++ # return final_output --+++ --+++ --+++ # @no_grad() --+++@@ -524,6 +524,53 @@ class DeepseekMoE(nn.Module): --+++ ) --+++ --+++ return expert_cache --++++# 放置在 DeepseekMoE 类中 --++++ @no_grad() --++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++ """ --++++ 优化版 MoE decode:使用批量矩阵乘法 (bmm) 并行处理所有专家。 --++++ --++++ Args: --++++ x (Tensor): 输入张量, shape: (1, hidden_size) --++++ flat_expert_indices (Tensor): 选中的专家索引, shape: (num_experts_per_tok,) --++++ flat_expert_weights (Tensor): 专家的权重, shape: (num_experts_per_tok, 1) --++++ """ --++++ top_k, _ = flat_expert_weights.shape --++++ hidden_size = x.shape[-1] --++++ --++++ # 1. 将所有专家的权重堆叠起来 --++++ stacked_gate_w = ops.stack([expert.gate_proj.weight for expert in self.experts]) --++++ stacked_up_w = ops.stack([expert.up_proj.weight for expert in self.experts]) --++++ stacked_down_w = ops.stack([expert.down_proj.weight for expert in self.experts]) --++++ --++++ # 2. "收集" 所需的专家权重 --++++ selected_gate_w = stacked_gate_w[flat_expert_indices] --++++ selected_up_w = stacked_up_w[flat_expert_indices] --++++ selected_down_w = stacked_down_w[flat_expert_indices] --++++ --++++ # 3. 准备输入 --++++ x_expanded = x.expand((top_k, 1, hidden_size)) --++++ --++++ # 4. 并行计算 gate_proj 和 up_proj --++++ gate_out = ops.bmm(x_expanded, selected_gate_w.transpose(0, 2, 1)) --++++ up_out = ops.bmm(x_expanded, selected_up_w.transpose(0, 2, 1)) --++++ --++++ # 5. 计算中间状态 --++++ intermediate_states = self.experts[0].act_fn(gate_out) * up_out --++++ --++++ # 6. 并行计算 down_proj --++++ # (top_k, 1, I) @ (top_k, I, H) -> (top_k, 1, H) --++++ # --- [FIX] --- --++++ # 对 down_proj 的权重进行转置以匹配矩阵乘法维度 --++++ expert_outputs = ops.bmm(intermediate_states, selected_down_w.transpose(0, 2, 1)) --++++ # --- [FIX END] --- --++++ --++++ # 7. 根据路由权重进行加权求和 --++++ weighted_sum = (expert_outputs * flat_expert_weights.unsqueeze(-1)).sum(axis=0) --++++ --++++ return weighted_sum --++++ --++++ --+++ --+++ # @no_grad() --+++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++index ebd7782e..913a7609 100644 --+++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++@@ -279,10 +279,10 @@ class Qwen2MoeRotaryEmbedding(nn.Module): --+++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++- x1 = x[..., : x.shape[-1] // 2] --+++- x2 = x[..., x.shape[-1] // 2 :] --++++ # x1 = x[..., : x.shape[-1] // 2] --++++ # x2 = x[..., x.shape[-1] // 2 :] --+++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++- # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++diff --git a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+++index d059dcbe..2b217b64 100644 --+++--- a/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --++++++ b/mindnlp/transformers/models/qwen2_vl/modeling_qwen2_vl.py --+++@@ -176,8 +176,10 @@ class Qwen2VLRotaryEmbedding(nn.Module): --+++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++ def rotate_half(x): --+++ """Rotates half the hidden dims of the input.""" --+++- x1 = x[..., : x.shape[-1] // 2] --+++- x2 = x[..., x.shape[-1] // 2 :] --++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++ # x1 = x[..., : x.shape[-1] // 2] --++++ # x2 = x[..., x.shape[-1] // 2 :] --++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++ return ops.cat((-x2, x1), dim=-1) --+++ --+++ --+++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++index 78f22642..0a0ef2d7 100644 --+++--- a/patches/0001-20251104commit.patch --++++++ b/patches/0001-20251104commit.patch --+++@@ -1,7 +1,7 @@ --+++ From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++ From: Pinoeer-kingxi <13022943007@163.com> --+++ Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++-Subject: [PATCH 1/3] 20251104commit --++++Subject: [PATCH 1/4] 20251104commit --+++ --+++ --- --+++ mindnlp/transformers/cache_utils.py | 28 +- --+++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --+++index 22b65dd5..5185270c 100644 --+++--- a/patches/0002-20251106commit.patch --++++++ b/patches/0002-20251106commit.patch --+++@@ -1,7 +1,7 @@ --+++ From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+++ From: Pinoeer-kingxi <13022943007@163.com> --+++ Date: Thu, 6 Nov 2025 09:20:38 +0800 --+++-Subject: [PATCH 2/3] 20251106commit --++++Subject: [PATCH 2/4] 20251106commit --+++ --+++ --- --+++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --+++index 966529e4..3e05f821 100644 --+++--- a/patches/0003-20261106secondcommit.patch --++++++ b/patches/0003-20261106secondcommit.patch --+++@@ -1,7 +1,7 @@ --+++ From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+++ From: Pinoeer-kingxi <13022943007@163.com> --+++ Date: Thu, 6 Nov 2025 14:54:37 +0800 --+++-Subject: [PATCH 3/3] 20261106secondcommit --++++Subject: [PATCH 3/4] 20261106secondcommit --+++ --+++ --- --+++ .../models/deepseek/modeling_deepseek.py | 217 ++- --+++diff --git a/patches/0004-20251106change.patch b/patches/0004-20251106change.patch --+++new file mode 100644 --+++index 00000000..88a1aef4 --+++--- /dev/null --++++++ b/patches/0004-20251106change.patch --+++@@ -0,0 +1,7498 @@ --++++From 04a0154934c483b9f42d997f28c0420c9b50ead6 Mon Sep 17 00:00:00 2001 --++++From: Pinoeer-kingxi <13022943007@163.com> --++++Date: Thu, 6 Nov 2025 15:48:09 +0800 --++++Subject: [PATCH 4/4] 20251106change --++++ --++++--- --++++ .../models/deepseek/modeling_deepseek.py | 189 +- --++++ patches/0001-20251104commit.patch | 1272 +++++++ --++++ patches/0002-20251106commit.patch | 3200 +++++++++++++++++ --++++ patches/0003-20261106secondcommit.patch | 2769 ++++++++++++++ --++++ 4 files changed, 7244 insertions(+), 186 deletions(-) --++++ create mode 100644 patches/0001-20251104commit.patch --++++ create mode 100644 patches/0002-20251106commit.patch --++++ create mode 100644 patches/0003-20261106secondcommit.patch --++++ --++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++index 2f9192bf..0546f318 100644 --++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++@@ -968,168 +968,6 @@ class DeepseekAttention(nn.Module): --++++ --++++ return attn_output, attn_weights, past_key_value --++++ --++++-# class DeepseekFlashAttention(nn.Module): --++++-# """ --++++-# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++++-# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --++++- --++++-# This class is designed as a drop-in replacement for DeepseekAttention. --++++-# """ --++++- --++++-# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++++-# super().__init__() --++++-# self.config = config --++++-# self.layer_idx = layer_idx --++++-# if layer_idx is None: --++++-# logger.warning( --++++-# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++-# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++-# "when creating this class." --++++-# ) --++++- --++++-# self.attention_dropout = config.attention_dropout --++++-# self.hidden_size = config.hidden_size --++++-# self.num_heads = config.num_attention_heads --++++-# self.head_dim = self.hidden_size // self.num_heads --++++-# self.num_key_value_heads = config.num_key_value_heads --++++-# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++-# self.max_position_embeddings = config.max_position_embeddings --++++-# self.rope_theta = config.rope_theta --++++-# self.is_causal = True --++++- --++++-# if (self.head_dim * self.num_heads) != self.hidden_size: --++++-# raise ValueError( --++++-# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++-# f" and `num_heads`: {self.num_heads})." --++++-# ) --++++- --++++-# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++++-# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++-# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++-# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++++-# self._init_rope() --++++- --++++-# def _init_rope(self): --++++-# if self.config.rope_scaling is None: --++++-# self.rotary_emb = DeepseekRotaryEmbedding( --++++-# self.head_dim, --++++-# max_position_embeddings=self.max_position_embeddings, --++++-# base=self.rope_theta, --++++-# ) --++++-# else: --++++-# scaling_type = self.config.rope_scaling["type"] --++++-# scaling_factor = self.config.rope_scaling["factor"] --++++-# if scaling_type == "linear": --++++-# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++++-# self.head_dim, --++++-# max_position_embeddings=self.max_position_embeddings, --++++-# scaling_factor=scaling_factor, --++++-# base=self.rope_theta, --++++-# ) --++++-# elif scaling_type == "dynamic": --++++-# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++++-# self.head_dim, --++++-# max_position_embeddings=self.max_position_embeddings, --++++-# scaling_factor=scaling_factor, --++++-# base=self.rope_theta, --++++-# ) --++++-# else: --++++-# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++++- --++++-# def forward( --++++-# self, --++++-# hidden_states: mindspore.Tensor, --++++-# attention_mask: Optional[mindspore.Tensor] = None, --++++-# position_ids: Optional[mindspore.Tensor] = None, --++++-# past_key_value: Optional[Cache] = None, --++++-# output_attentions: bool = False, --++++-# use_cache: bool = False, --++++-# **kwargs, --++++-# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++-# if "padding_mask" in kwargs: --++++-# warnings.warn( --++++-# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++++-# ) --++++- --++++-# if output_attentions: --++++-# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --++++- --++++-# bsz, q_len, _ = hidden_states.shape --++++- --++++-# if self.config.pretraining_tp > 1: --++++-# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++++- --++++-# query_states = self.q_proj(hidden_states) --++++-# key_states = self.k_proj(hidden_states) --++++-# value_states = self.v_proj(hidden_states) --++++- --++++-# # Reshape for multi-head attention --++++-# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++-# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++- --++++-# kv_seq_len = key_states.shape[-2] --++++-# if past_key_value is not None: --++++-# if self.layer_idx is None: --++++-# raise ValueError( --++++-# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++-# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++-# "with a layer index." --++++-# ) --++++-# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++- --++++-# # Apply Rotary Positional Embedding --++++-# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++-# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++- --++++-# if past_key_value is not None: --++++-# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --++++-# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++- --++++-# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --++++-# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --++++-# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++- --++++-# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++-# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++- --++++-# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++-# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++- --++++-# # Convert attention_mask for flash_attention_score --++++-# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --++++-# if attention_mask is not None: --++++-# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --++++-# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++++-# raise ValueError( --++++-# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++++-# ) --++++-# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --++++-# else: --++++-# attn_mask_for_fa = None --++++- --++++-# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++++- --++++-# # Call the fused flash_attention_score operator --++++-# attn_output = mindspore.ops.flash_attention_score( --++++-# query=query_states_for_fa, --++++-# key=key_states_for_fa, --++++-# value=value_states_for_fa, --++++-# head_num=self.num_heads, # This is N1, the number of query heads --++++-# input_layout='BSH', --++++-# attn_mask=attn_mask_for_fa, --++++-# keep_prob=keep_prob, --++++-# scalar_value=1.0 / math.sqrt(self.head_dim), --++++-# sparse_mode=0 # Default mask mode --++++-# ) --++++- --++++-# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --++++-# attn_output = self.o_proj(attn_output) --++++- --++++-# # Flash Attention does not return attention weights --++++-# attn_weights = None --++++- --++++-# return attn_output, attn_weights, past_key_value --++++ --++++ class DeepseekFlashAttention(nn.Module): --++++ """ --++++@@ -1300,9 +1138,9 @@ class DeepseekDecoderLayer(nn.Module): --++++ super().__init__() --++++ self.hidden_size = config.hidden_size --++++ --++++- self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --++++- config=config, layer_idx=layer_idx --++++- ) --+++++ # self.self_attn = Deepseek_ATTENTION_CLASSES[config._attn_implementation]( --+++++ # config=config, layer_idx=layer_idx --+++++ # ) --++++ --++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --++++ config=config, layer_idx=layer_idx --++++@@ -1387,7 +1225,6 @@ class DeepseekDecoderLayer(nn.Module): --++++ return outputs --++++ --++++ --++++- --++++ class DeepseekPreTrainedModel(PreTrainedModel): --++++ config_class = DeepseekConfig --++++ base_model_prefix = "model" --++++@@ -1613,26 +1450,6 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++ # Initialize weights and apply final processing --++++ self.post_init() --++++ self.warm_up = False --++++- #@dwj --++++- self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --++++- self.num_layers, --++++- self.num_attention_heads, --++++- self.head_dim, --++++- batch_size=1, --++++- max_length=self.max_length, --++++- dtype=mindspore.float16 --++++- ) --++++- --++++- def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --++++- key_cache = [] --++++- value_cache = [] --++++- for _ in range(num_layers): --++++- k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++- v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++- key_cache.append(k) --++++- value_cache.append(v) --++++- return key_cache, value_cache --++++- --++++ --++++ def warmup_moe_model_deep(self): --++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --++++new file mode 100644 --++++index 00000000..78f22642 --++++--- /dev/null --+++++++ b/patches/0001-20251104commit.patch --++++@@ -0,0 +1,1272 @@ --+++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++++From: Pinoeer-kingxi <13022943007@163.com> --+++++Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++++Subject: [PATCH 1/3] 20251104commit --+++++ --+++++--- --+++++ mindnlp/transformers/cache_utils.py | 28 +- --+++++ .../models/deepseek/modeling_deepseek.py | 149 ++- --+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++++ 3 files changed, 976 insertions(+), 87 deletions(-) --+++++ --+++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++++index cadd2e04..02f8d4be 100644 --+++++--- a/mindnlp/transformers/cache_utils.py --++++++++ b/mindnlp/transformers/cache_utils.py --+++++@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++++ # k_out[:, :, cache_position] = key_states --+++++ # v_out[:, :, cache_position] = value_states --+++++- if ON_ORANGE_PI: --+++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++- else: --+++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++- --++++++ # if ON_ORANGE_PI: --++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++++ # else: --++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++++ # 确保 cache_position 是 1D tensor 并且类型正确 --++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --++++++ if cache_position.ndim > 1: --++++++ cache_position = cache_position.flatten() --++++++ # 确保类型是 int32 或 int64(MindSpore 要求) --++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --++++++ cache_position = cache_position.int() --++++++ --++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --++++++ k_out[:, :, cache_position] = key_states --++++++ v_out[:, :, cache_position] = value_states --++++++ --+++++ return k_out, v_out --+++++ --+++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++index c695b944..d8303e45 100644 --+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++++ # Copied from transformers.models.llama.modeling_llama.rotate_half --+++++ def rotate_half(x): --+++++ """Rotates half the hidden dims of the input.""" --+++++- x1 = x[..., : x.shape[-1] // 2] --+++++- x2 = x[..., x.shape[-1] // 2 :] --++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++++ # x1 = x[..., : x.shape[-1] // 2] --++++++ # x2 = x[..., x.shape[-1] // 2 :] --++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++ return ops.cat((-x2, x1), dim=-1) --+++++ --+++++ --+++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++++ if self.training: --+++++ raise NotImplementedError("Training is not supported yet.") --+++++ else: --+++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++- if self.config.n_shared_experts is not None: --+++++- y = y + self.shared_experts(identity) --+++++- return y --++++++ # @lwx --++++++ if orig_shape[1] == 1: --++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --++++++ y=y.view(*orig_shape) --++++++ if self.config.n_shared_experts is not None: --++++++ y = y + self.shared_experts(identity) --++++++ return y --++++++ else: --++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --++++++ if self.config.n_shared_experts is not None: --++++++ y = y + self.shared_experts(identity) --++++++ return y --++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++++ # if self.config.n_shared_experts is not None: --++++++ # y = y + self.shared_experts(identity) --++++++ # return y --++++++ --++++++ @no_grad() --++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++++ --++++++ expert_cache = ops.zeros_like(x) --++++++ for i in range(self.num_experts_per_tok): --++++++ expert_id = flat_expert_indices[i].item() --++++++ weight = flat_expert_weights[i].item() --++++++ expert = self.experts[expert_id] --++++++ expert_out = expert(x) --++++++ expert_cache += expert_out * weight --++++++ return expert_cache --+++++ --+++++ @no_grad() --+++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++- # expert_cache = torch.zeros_like(x) --+++++- # idxs = flat_expert_indices.argsort() --+++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++- # token_idxs = idxs // self.num_experts_per_tok --+++++- # for i, end_idx in enumerate(tokens_per_expert): --+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++- # if start_idx == end_idx: --+++++- # continue --+++++- # expert = self.experts[i] --+++++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++++- # expert_tokens = x[exp_token_idx] --+++++- # expert_out = expert(expert_tokens) --+++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++- # return expert_cache --++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++ expert_cache = ops.zeros_like(x) --+++++ idxs = flat_expert_indices.argsort() --+++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++ token_idxs = idxs // self.num_experts_per_tok --++++++ --+++++ for i, end_idx in enumerate(tokens_per_expert): --+++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++ if start_idx == end_idx: --+++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++++ expert_out = expert(expert_tokens) --+++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++++ --+++++ return expert_cache --++++++ --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # # expert_cache = torch.zeros_like(x) --++++++ # # idxs = flat_expert_indices.argsort() --++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++++ # # token_idxs = idxs // self.num_experts_per_tok --++++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++++ # # if start_idx == end_idx: --++++++ # # continue --++++++ # # expert = self.experts[i] --++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # # expert_tokens = x[exp_token_idx] --++++++ # # expert_out = expert(expert_tokens) --++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++++ # # return expert_cache --++++++ # expert_cache = ops.zeros_like(x) --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # for i, end_idx in enumerate(tokens_per_expert): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # if start_idx == end_idx: --++++++ # continue --++++++ # expert = self.experts[i] --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = expert(expert_tokens) --++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++++ --++++++ # return expert_cache --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # expert_cache = ops.zeros_like(x) --++++++ --++++++ # # 排序保证顺序一致 --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # # 找出有 token 的专家 --++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++++ --++++++ # for i in active_experts.tolist(): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # end_idx = tokens_per_expert[i] --++++++ # if start_idx == end_idx: # 没有 token --++++++ # continue --++++++ --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = self.experts[i](expert_tokens) --++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++++ --++++++ # expert_cache = mindspore.mint.scatter_add( --++++++ # expert_cache, --++++++ # 0, --++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++++ # expert_out --++++++ # ) --++++++ --++++++ # return expert_cache --++++++ --++++++ --+++++ --+++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++++ # """ --+++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++ --+++++ # Initialize weights and apply final processing --+++++ self.post_init() --++++++ self.warm_up = False --++++++ --++++++ def warmup_moe_model_deep(self): --++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --++++++ test_texts = [ --++++++ "warmup short", --++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --++++++ ] --++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++++ if tokenizer is None: --++++++ from mindnlp.transformers import AutoTokenizer --++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++++ self._warmup_tokenizer = tokenizer --++++++ --++++++ for text in test_texts: --++++++ inputs = tokenizer(text, return_tensors="ms") --++++++ with mindspore._no_grad(): --++++++ _ = self(**inputs, use_cache=False) --++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++++ --+++++ def get_input_embeddings(self): --+++++ return self.model.embed_tokens --+++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++ ```""" --++++++ if not self.warm_up: --++++++ self.warm_up = True --++++++ self.warmup_moe_model_deep() --++++++ --+++++ output_attentions = ( --+++++ output_attentions --+++++ if output_attentions is not None --+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++index 3cbf820e..d4c6b651 100644 --+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++@@ -18,7 +18,6 @@ --+++++ # See the License for the specific language governing permissions and --+++++ # limitations under the License. --+++++ """MindSpore Qwen2MoE model.""" --+++++- --+++++ import math --+++++ from typing import List, Optional, Tuple, Union --+++++ --+++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++++ TokenClassifierOutput, --+++++ ) --+++++ from ...modeling_utils import PreTrainedModel --++++++from ...generation import GenerationMixin --+++++ from ....utils import logging --+++++ from .configuration_qwen2_moe import Qwen2MoeConfig --+++++ --+++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++++ self.variance_epsilon = eps --+++++ --+++++ def forward(self, hidden_states): --++++++ # @dwj --++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++++ # @lwx --++++++ # if not self.training : --++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++ input_dtype = hidden_states.dtype --+++++ hidden_states = hidden_states.to(mindspore.float32) --+++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++++@@ -234,6 +239,8 @@ def rotate_half(x): --+++++ """Rotates half the hidden dims of the input.""" --+++++ x1 = x[..., : x.shape[-1] // 2] --+++++ x2 = x[..., x.shape[-1] // 2 :] --++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++ return ops.cat((-x2, x1), dim=-1) --+++++ --+++++ --+++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++++ self.config = config --+++++ self.hidden_size = config.hidden_size --+++++ self.intermediate_size = intermediate_size --++++++ --+++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++++ self.act_fn = ACT2FN[config.hidden_act] --+++++ --+++++ def forward(self, x): --+++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++- --+++++ --++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++++ # @lwx --++++++ # gate_up_output = self.gate_up_proj(x) --++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --++++++ # return self.down_proj(swiglu_output) --++++++ --++++++ # def forward(self, x): --++++++ # gate_proj_out = self.gate_proj(x) --++++++ # up_proj_out = self.up_proj(x) --++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --++++++ # return self.down_proj(swiglu_out) --++++++ --+++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++++ """ --+++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++++ use_cache: bool = False, --+++++ cache_position: Optional[mindspore.Tensor] = None, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ --++++++ --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ query_states = self.q_proj(hidden_states) --+++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++ "with a layer index." --+++++ ) --+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ if isinstance(past_key_value, StaticCache): --++++++ kv_seq_len = key_states.shape[-2] --++++++ else: --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++ if isinstance(past_key_value, StaticCache): --++++++ kv_seq_len = key_states.shape[-2] --+++++ --+++++ # repeat k/v heads if n_kv_heads < n_heads --+++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++- --++++++ --+++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++ --+++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++++- raise ValueError( --+++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++++- f" {attn_weights.shape}" --+++++- ) --+++++- --+++++- if attention_mask is not None: # no matter the length, we just slice it --+++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --++++++ if attention_mask is not None: --++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++ attn_weights = attn_weights + causal_mask --+++++ --+++++ # upcast attention to fp32 --+++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++ --+++++ attn_output = self.o_proj(attn_output) --+++++- --++++++ # @lwx --++++++ --++++++ # max_seq_len = self.max_position_embeddings # 2048 --++++++ --++++++ # if attention_mask is not None: --++++++ # # attention_mask: [B, 1, Sq, Sk] --++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++++ --++++++ # # pad 到 [max_seq_len, max_seq_len] --++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++++ # global_attention_mask = padded_mask --++++++ # else: --++++++ # global_attention_mask = None --++++++ --++++++ --++++++ # sparse_mode=3 --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # real_shift=None, --++++++ # padding_mask=None, --++++++ --++++++ # head_num=self.num_heads, --++++++ # attn_mask=global_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ # input_layout="BNSD", --++++++ # pre_tokens=2147483647, --++++++ # next_tokens=2147483647, --++++++ # inner_precise=0, --++++++ # drop_mask=None, --++++++ # prefix=None, --++++++ # actual_seq_qlen=None, --++++++ # actual_seq_kvlen=None, --++++++ # sparse_mode=sparse_mode, --++++++ # ) --+++++ if not output_attentions: --+++++ attn_weights = None --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ --++++++class Qwen2MoeFlashAttention(nn.Module): --++++++ """ --++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++++ --++++++ 关键改动: --++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++++ 直接传入原始的 key 和 value 张量效率更高。 --++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++++ """ --++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++++ super().__init__() --++++++ self.config = config --++++++ self.layer_idx = layer_idx --++++++ self.hidden_size = config.hidden_size --++++++ self.num_heads = config.num_attention_heads --++++++ self.head_dim = self.hidden_size // self.num_heads --++++++ self.num_key_value_heads = config.num_key_value_heads --++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++ self.max_position_embeddings = config.max_position_embeddings --++++++ self.rope_theta = config.rope_theta --++++++ self.attention_dropout = config.attention_dropout --++++++ --++++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++++ raise ValueError( --++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++++ ) --++++++ --++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++++ --++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++++ self.head_dim, --++++++ max_position_embeddings=self.max_position_embeddings, --++++++ base=self.rope_theta, --++++++ ) --++++++ --++++++ def forward( --++++++ self, --++++++ hidden_states: mindspore.Tensor, --++++++ attention_mask: Optional[mindspore.Tensor] = None, --++++++ position_ids: Optional[mindspore.Tensor] = None, --++++++ past_key_value: Optional[Cache] = None, --++++++ output_attentions: bool = False, --++++++ use_cache: bool = False, --++++++ cache_position: Optional[mindspore.Tensor] = None, --++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # 1. 线性投射 Q, K, V --++++++ query_states = self.q_proj(hidden_states) --++++++ key_states = self.k_proj(hidden_states) --++++++ value_states = self.v_proj(hidden_states) --++++++ --++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++ # query: [B, S, H*D] -> [B, N1, S, D] --++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # 3. RoPE 旋转位置编码 --++++++ kv_seq_len = key_states.shape[-2] --++++++ if past_key_value is not None: --++++++ if self.layer_idx is None: --++++++ raise ValueError( --++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++ "with a layer index." --++++++ ) --++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++++ if cache_position.shape[0] == 1: --++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++++ kv_seq_len = past_seen_tokens + 1 --++++++ else: --++++++ # prefill 阶段:cache_position 是范围,使用其长度 --++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++++ else: --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # 4. KV 缓存更新 --++++++ if past_key_value is not None: --++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ key_states, value_states = past_key_value.update( --++++++ key_states, value_states, self.layer_idx, cache_kwargs --++++++ ) --++++++ --++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++ if cache_position.shape[0] == 1: --++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++++ kv_seq_len = key_states.shape[-2] --++++++ --++++++ # 5. [重要] 准备 Attention Mask --++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++++ fa_attention_mask = None --++++++ if attention_mask is not None: --++++++ # 截取与当前key长度匹配的部分 --++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --++++++ fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++++ input_dtype = query_states.dtype --++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++++ query_states = query_states.to(mindspore.float16) --++++++ key_states = key_states.to(mindspore.float16) --++++++ value_states = value_states.to(mindspore.float16) --++++++ --++++++ # 6. [核心] 调用 flash_attention_score 算子 --++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++++ attn_output = mindspore.ops.flash_attention_score( --++++++ query=query_states, --++++++ key=key_states, --++++++ value=value_states, --++++++ head_num=self.num_heads, # 传入Q的头数(N1) --++++++ attn_mask=fa_attention_mask, --++++++ keep_prob=1.0 - self.attention_dropout, --++++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ input_layout="BNSD", --++++++ sparse_mode=0 # 使用 defaultMask 模式 --++++++ ) --++++++ --++++++ # 恢复原始数据类型 --++++++ attn_output = attn_output.to(input_dtype) --++++++ --++++++ # 7. 调整输出形状 --++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ attn_output = self.o_proj(attn_output) --++++++ --++++++ # FlashAttention 算子不直接返回注意力权重矩阵 --++++++ attn_weights = None --++++++ if output_attentions: --++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++ return attn_output, attn_weights, past_key_value --++++++ --++++++ # def forward( --++++++ # self, --++++++ # hidden_states: mindspore.Tensor, --++++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++++ # position_ids: Optional[mindspore.Tensor] = None, --++++++ # past_key_value: Optional[Cache] = None, --++++++ # output_attentions: bool = False, --++++++ # use_cache: bool = False, --++++++ # cache_position: Optional[mindspore.Tensor] = None, --++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ # bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # # 1. 线性投射 Q, K, V --++++++ # query_states = self.q_proj(hidden_states) --++++++ # key_states = self.k_proj(hidden_states) --++++++ # value_states = self.v_proj(hidden_states) --++++++ --++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # # 3. RoPE 旋转位置编码 --++++++ # kv_seq_len = key_states.shape[-2] --++++++ # if past_key_value is not None: --++++++ # if self.layer_idx is None: --++++++ # raise ValueError( --++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++ # "with a layer index." --++++++ # ) --++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # # 4. KV 缓存更新 --++++++ # if past_key_value is not None: --++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ # key_states, value_states = past_key_value.update( --++++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++++ # ) --++++++ --++++++ # # 5. 准备 Attention Mask --++++++ # fa_attention_mask = None --++++++ # if attention_mask is not None: --++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++++ # input_dtype = query_states.dtype --++++++ --++++++ # # 6. [核心] 调用 flash_attention_score 算子 --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # head_num=self.num_heads, --++++++ # attn_mask=fa_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ # input_layout="BNSD", --++++++ # sparse_mode=0, --++++++ # # <--- 修改点 2: 启用内部高精度计算 --- --++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++++ # inner_precise=1 --++++++ # ) --++++++ --++++++ # # 恢复原始数据类型 --++++++ # attn_output = attn_output.to(input_dtype) --++++++ --++++++ # # 7. 调整输出形状 --++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ # attn_output = self.o_proj(attn_output) --++++++ --++++++ # attn_weights = None --++++++ # if output_attentions: --++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++ # return attn_output, attn_weights, past_key_value --++++++ --++++++ # def forward( --++++++ # self, --++++++ # hidden_states: mindspore.Tensor, --++++++ # attention_mask: Optional[mindspore.Tensor] = None, --++++++ # position_ids: Optional[mindspore.Tensor] = None, --++++++ # past_key_value: Optional[Cache] = None, --++++++ # output_attentions: bool = False, --++++++ # use_cache: bool = False, --++++++ # cache_position: Optional[mindspore.Tensor] = None, --++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ # bsz, q_len, _ = hidden_states.shape --++++++ --++++++ # query_states = self.q_proj(hidden_states) --++++++ # key_states = self.k_proj(hidden_states) --++++++ # value_states = self.v_proj(hidden_states) --++++++ --++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ # kv_seq_len = key_states.shape[-2] --++++++ # if past_key_value is not None: --++++++ # if self.layer_idx is None: --++++++ # raise ValueError("`layer_idx` must be specified for caching") --++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ # if past_key_value is not None: --++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++ # key_states, value_states = past_key_value.update( --++++++ # key_states, value_states, self.layer_idx, cache_kwargs --++++++ # ) --++++++ --++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++ --++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --++++++ # query_states = query_states / math.sqrt(self.head_dim) --++++++ # # <--- 修改结束 --- --++++++ --++++++ # fa_attention_mask = None --++++++ # if attention_mask is not None: --++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ # fa_attention_mask = (mask_slice != 0) --++++++ --++++++ # input_dtype = query_states.dtype --++++++ --++++++ # attn_output = mindspore.ops.flash_attention_score( --++++++ # query=query_states, # 传入已经预先缩放过的 query --++++++ # key=key_states, --++++++ # value=value_states, --++++++ # head_num=self.num_heads, --++++++ # attn_mask=fa_attention_mask, --++++++ # keep_prob=1.0 - self.attention_dropout, --++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --++++++ # input_layout="BNSD", --++++++ # sparse_mode=0, --++++++ # inner_precise=1 # 仍然保持内部高精度计算 --++++++ # ) --++++++ --++++++ # attn_output = attn_output.to(input_dtype) --++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ # attn_output = self.o_proj(attn_output) --++++++ --++++++ # attn_weights = None --++++++ # if output_attentions: --++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++++ --++++++ # return attn_output, attn_weights, past_key_value --++++++ --+++++ QWEN2MOE_ATTENTION_CLASSES = { --+++++ "eager": Qwen2MoeAttention, --++++++ "flash-attention": Qwen2MoeFlashAttention, --+++++ } --+++++ --+++++ --+++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --++++++ #@dwj --++++++ # 只遍历激活的专家,而非全部专家 --+++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- hidden_states = hidden_states.view(-1, hidden_dim) --+++++- # router_logits: (batch * sequence_length, n_experts) --+++++- router_logits = self.gate(hidden_states) --+++++- --+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- if self.norm_topk_prob: --+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- # we cast back to the input dtype --+++++- routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++- final_hidden_states = ops.zeros( --+++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++++- ) --+++++- --+++++- # One hot encode the selected experts to create an expert mask --+++++- # this will be used to easily index which expert is going to be sollicitated --+++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++++- --+++++- # Loop over all available experts in the model and perform the computation on each expert --+++++- for expert_idx in range(self.num_experts): --+++++- expert_layer = self.experts[expert_idx] --+++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++++- --+++++- # Index the correct hidden states and compute the expert hidden state for --+++++- # the current expert. We need to make sure to multiply the output hidden --+++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++++- if 0 not in idx.shape: --+++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++++- --+++++- # However `index_add_` only support torch tensors for indexing so we'll use --+++++- # the `top_x` tensor here. --+++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++++- --+++++- shared_expert_output = self.shared_expert(hidden_states) --+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++++- --+++++- final_hidden_states = final_hidden_states + shared_expert_output --++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++ num_tokens = hidden_states_reshaped.shape[0] --++++++ --++++++ router_logits = self.gate(hidden_states_reshaped) --++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++ --++++++ if self.norm_topk_prob: --++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++++ flat_selected_experts = selected_experts.flatten() --++++++ --++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++++ token_indices = broadcasted_token_indices.flatten() --++++++ --++++++ active_experts = ops.unique(flat_selected_experts) --++++++ --++++++ for expert_idx_tensor in active_experts: --++++++ expert_idx = expert_idx_tensor.item() --++++++ expert_layer = self.experts[expert_idx] --++++++ --++++++ mask = (flat_selected_experts == expert_idx_tensor) --++++++ selected_token_indices = token_indices[mask] --++++++ selected_routing_weights = routing_weights.flatten()[mask] --++++++ --++++++ current_states = hidden_states_reshaped[selected_token_indices] --++++++ --++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++ --++++++ final_hidden_states = final_hidden_states.index_add( --++++++ dim=0, --++++++ index=selected_token_indices, --++++++ source=expert_output.to(hidden_states.dtype) --++++++ ) --++++++ --++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++++ --+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++- return final_hidden_states, router_logits --++++++ final_hidden_states = final_hidden_states + shared_expert_output --++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++++ --++++++ return final_hidden_states, router_logits --+++++ --+++++ --+++++ class Qwen2MoeDecoderLayer(nn.Module): --+++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++++ --+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ --++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++++ --+++++ if (layer_idx not in config.mlp_only_layers) and ( --+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++++ ): --+++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++++ _skip_keys_device_placement = "past_key_values" --+++++ _supports_cache_class = True --++++++#lwx --++++++ # _supports_static_cache = True --+++++ --+++++ def _init_weights(self, module): --+++++ std = self.config.initializer_range --+++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++++ return causal_mask --+++++ --+++++ --+++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ _tied_weights_keys = ["lm_head.weight"] --+++++ --+++++ def __init__(self, config): --+++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ self.num_experts_per_tok = config.num_experts_per_tok --+++++ # Initialize weights and apply final processing --+++++ self.post_init() --++++++ # @lwx --++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --++++++ # self.generation_config.cache_implementation = "static" --++++++ self._warmed_up = False --++++++ --++++++ def warmup_moe_model(self): --++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --++++++ test_texts = [ --++++++ "warmup short", --++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --++++++ ] --++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --++++++ if tokenizer is None: --++++++ from mindnlp.transformers import AutoTokenizer --++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --++++++ self._warmup_tokenizer = tokenizer --++++++ --++++++ for text in test_texts: --++++++ inputs = tokenizer(text, return_tensors="ms") --++++++ with mindspore._no_grad(): --++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++++ --+++++ def get_input_embeddings(self): --+++++ return self.model.embed_tokens --+++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++ ```""" --++++++ if not self._warmed_up: --++++++ self._warmed_up = True --++++++ self.warmup_moe_model() --+++++ --+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++++ output_router_logits = ( --+++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++ } --+++++ ) --+++++ return model_inputs --++++++# @lwx --++++++ # def _decode_one_tokens_logits( --++++++ # self, --++++++ # cur_token: mindspore.Tensor, --++++++ # input_pos: Optional[mindspore.Tensor], --++++++ # cache_position: mindspore.Tensor, --++++++ # past_key_values: StaticCache, --++++++ # ) -> mindspore.Tensor: --++++++ # """ --++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --++++++ --++++++ # Args: --++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --++++++ # input_pos: 输入位置信息,可选 --++++++ # cache_position: 当前token在cache中的位置,shape为(1,) --++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --++++++ --++++++ # Returns: --++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --++++++ # """ --++++++ # # 调用JIT编译的版本 --++++++ # return self.get_decode_one_tokens_logits( --++++++ # cur_token=cur_token, --++++++ # input_pos=input_pos, --++++++ # cache_position=cache_position, --++++++ # past_key_values=past_key_values, --++++++ # ) --++++++ --++++++ # @mindspore.jit(jit_level='O1') --++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --++++++ # """ --++++++ # JIT编译的函数,用于高效的单token解码 --++++++ # 使用JIT编译优化以支持静态shape和高效执行 --++++++ --++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --++++++ # """ --++++++ # outputs = self.model.forward( --++++++ # input_ids=cur_token, --++++++ # position_ids=input_pos, --++++++ # cache_position=cache_position, --++++++ # past_key_values=past_key_values, --++++++ # use_cache=True, --++++++ # return_dict=False, --++++++ # ) --++++++ --++++++ # hidden_states = outputs[0] --++++++ # logits = self.lm_head.forward(hidden_states) --++++++ # logits = logits.float() --++++++ --++++++ # return logits[:, -1, :] --++++++ --++++++ # def _sample( --++++++ # self, --++++++ # input_ids: mindspore.Tensor, --++++++ # logits_processor, --++++++ # stopping_criteria, --++++++ # generation_config, --++++++ # synced_devices: bool, --++++++ # streamer=None, --++++++ # logits_warper=None, --++++++ # **model_kwargs, --++++++ # ): --++++++ # """ --++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --++++++ # """ --++++++ # from ...generation.logits_process import LogitsProcessorList --++++++ # from ...generation.stopping_criteria import StoppingCriteriaList --++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --++++++ # from mindnlp.core import nn, ops, no_grad --++++++ # import numpy as np --++++++ --++++++ # # 检查是否使用 StaticCache --++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --++++++ # # 否则,直接调用父类方法 --++++++ # past_key_values = model_kwargs.get("past_key_values") --++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --++++++ --++++++ # if not isinstance(past_key_values, StaticCache): --++++++ # # 不使用 StaticCache,直接调用父类方法 --++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --++++++ # return super()._sample( --++++++ # input_ids=input_ids, --++++++ # logits_processor=logits_processor, --++++++ # stopping_criteria=stopping_criteria, --++++++ # generation_config=generation_config, --++++++ # synced_devices=synced_devices, --++++++ # streamer=streamer, --++++++ # logits_warper=logits_warper, --++++++ # **model_kwargs, --++++++ # ) --++++++ --++++++ # # 使用 StaticCache,进入自定义循环 --++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --++++++ # pad_token_id = generation_config._pad_token_tensor --++++++ # output_attentions = generation_config.output_attentions --++++++ # output_hidden_states = generation_config.output_hidden_states --++++++ # output_scores = generation_config.output_scores --++++++ # output_logits = generation_config.output_logits --++++++ # return_dict_in_generate = generation_config.return_dict_in_generate --++++++ # max_length = generation_config.max_length --++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --++++++ # do_sample = generation_config.do_sample --++++++ --++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --++++++ # raise ValueError( --++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --++++++ # f"{logits_warper})." --++++++ # ) --++++++ --++++++ # # init attention / hidden states / scores tuples --++++++ # scores = () if (return_dict_in_generate and output_scores) else None --++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --++++++ --++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --++++++ # encoder_hidden_states = ( --++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --++++++ # ) --++++++ --++++++ # # keep track of which sequences are already finished --++++++ # batch_size, cur_len = input_ids.shape --++++++ # this_peer_finished = False --++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --++++++ --++++++ # time_record = [] --++++++ # from ....utils.testing_utils import parse_flag_from_env --++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --++++++ --++++++ # while self._has_unfinished_sequences( --++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --++++++ # ): --++++++ # if _record_time: --++++++ # import time as time_module --++++++ # infer_start = time_module.time() --++++++ --++++++ # # prepare model inputs --++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --++++++ --++++++ # # prepare variable output controls --++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --++++++ --++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --++++++ # cur_cache_position = model_inputs.get("cache_position") --++++++ # cur_past_key_values = model_inputs.get("past_key_values") --++++++ # cur_input_ids = model_inputs.get("input_ids") --++++++ --++++++ # if (isinstance(cur_past_key_values, StaticCache) and --++++++ # cur_cache_position is not None and --++++++ # len(cur_cache_position.shape) > 0 and --++++++ # cur_cache_position.shape[0] == 1 and --++++++ # cur_input_ids is not None and --++++++ # cur_input_ids.shape[1] == 1): --++++++ # # 使用 JIT 优化的单 token 解码 --++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --++++++ # if not hasattr(self, '_jit_used'): --++++++ # self._jit_used = False --++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --++++++ --++++++ # next_token_logits = self.get_decode_one_tokens_logits( --++++++ # cur_token=cur_input_ids, --++++++ # input_pos=model_inputs.get("position_ids"), --++++++ # cache_position=cur_cache_position, --++++++ # past_key_values=cur_past_key_values, --++++++ # ) --++++++ --++++++ # # 标记已使用JIT(用于后续判断) --++++++ # if not self._jit_used: --++++++ # self._jit_used = True --++++++ --++++++ # # 构造兼容的输出对象 --++++++ # class JitOptimizedOutput: --++++++ # def __init__(self, logits, config): --++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --++++++ # self.config = config --++++++ # # 对于 JIT 优化路径,这些属性通常不需要 --++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --++++++ # self.attentions = None if not config.is_encoder_decoder else None --++++++ # self.cross_attentions = None --++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --++++++ # self.hidden_states = None if not config.is_encoder_decoder else None --++++++ --++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --++++++ # else: --++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --++++++ # outputs = self(**model_inputs, return_dict=True) --++++++ --++++++ # if synced_devices and this_peer_finished: --++++++ # continue --++++++ --++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --++++++ # next_token_logits = outputs.logits[:, -1, :] --++++++ --++++++ # # pre-process distribution --++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --++++++ # if do_sample: --++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --++++++ --++++++ # # Store scores, attentions and hidden_states when required --++++++ # if return_dict_in_generate: --++++++ # if output_scores: --++++++ # scores += (next_token_scores,) --++++++ # if output_logits: --++++++ # raw_logits += (next_token_logits,) --++++++ # if output_attentions: --++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --++++++ # decoder_attentions += (attn,) if attn is not None else (None,) --++++++ # if self.config.is_encoder_decoder: --++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --++++++ --++++++ # if output_hidden_states: --++++++ # hidden = ( --++++++ # outputs.decoder_hidden_states --++++++ # if self.config.is_encoder_decoder --++++++ # else outputs.hidden_states --++++++ # ) --++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --++++++ --++++++ # # token selection --++++++ # if do_sample: --++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --++++++ # else: --++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --++++++ --++++++ # # finished sentences should have their next token be a padding token --++++++ # if has_eos_stopping_criteria: --++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --++++++ --++++++ # # update generated ids, model inputs, and length for next step --++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --++++++ # if streamer is not None: --++++++ # streamer.put(next_tokens) --++++++ --++++++ # model_kwargs = self._update_model_kwargs_for_generation( --++++++ # outputs, --++++++ # model_kwargs, --++++++ # is_encoder_decoder=self.config.is_encoder_decoder, --++++++ # ) --++++++ --++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --++++++ # cur_len += 1 --++++++ --++++++ # if _record_time: --++++++ # import time as time_module --++++++ # infer_stop = time_module.time() --++++++ # time_record.append(infer_stop - infer_start) --++++++ --++++++ # del outputs --++++++ --++++++ # average_infer_time = None --++++++ # if time_record: --++++++ # if len(time_record) > 1: --++++++ # time_record.pop(0) --++++++ # average_infer_time = sum(time_record) / len(time_record) --++++++ # print(f'average inference time is: {average_infer_time}') --++++++ # print(f'inference time record: {time_record}') --++++++ --++++++ # if streamer is not None: --++++++ # streamer.end() --++++++ --++++++ # # 简单判断:打印是否使用了JIT路径 --++++++ # if hasattr(self, '_jit_used') and self._jit_used: --++++++ # print("[JIT] ✓ JIT optimization was used during generation") --++++++ # else: --++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --++++++ --++++++ # if return_dict_in_generate: --++++++ # if self.config.is_encoder_decoder: --++++++ # return GenerateEncoderDecoderOutput( --++++++ # sequences=input_ids, --++++++ # scores=scores, --++++++ # logits=raw_logits, --++++++ # encoder_attentions=encoder_attentions, --++++++ # encoder_hidden_states=encoder_hidden_states, --++++++ # decoder_attentions=decoder_attentions, --++++++ # cross_attentions=cross_attentions, --++++++ # decoder_hidden_states=decoder_hidden_states, --++++++ # past_key_values=model_kwargs.get("past_key_values"), --++++++ # average_infer_time=average_infer_time --++++++ # ) --++++++ # else: --++++++ # return GenerateDecoderOnlyOutput( --++++++ # sequences=input_ids, --++++++ # scores=scores, --++++++ # logits=raw_logits, --++++++ # attentions=decoder_attentions, --++++++ # hidden_states=decoder_hidden_states, --++++++ # past_key_values=model_kwargs.get("past_key_values"), --++++++ # average_infer_time=average_infer_time --++++++ # ) --++++++ # else: --++++++ # return input_ids --++++++ --++++++ # def _prepare_cache_for_generation( --++++++ # self, --++++++ # generation_config, --++++++ # model_kwargs, --++++++ # assistant_model, --++++++ # batch_size, --++++++ # max_cache_length, --++++++ # ): --++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --++++++ # generation_config.cache_implementation = "static" --++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --++++++ --++++++ # if generation_config.cache_implementation == "static": --++++++ # base_required_from_max_length = generation_config.max_length + 1 --++++++ # base_required = max(max_cache_length, base_required_from_max_length) --++++++ # min_cache_size = 50 --++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --++++++ # else: --++++++ # max_cache_length = max(base_required, min_cache_size) --++++++ --++++++ # original_max_cache_length = max_cache_length --++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --++++++ # print(f" - input max_cache_length: {original_max_cache_length}") --++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --++++++ # print(f" - final max_cache_length: {max_cache_length}") --++++++ --++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --++++++ # if max_cache_length > self.config.max_position_embeddings: --++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --++++++ --++++++ # result = super()._prepare_cache_for_generation( --++++++ # generation_config=generation_config, --++++++ # model_kwargs=model_kwargs, --++++++ # assistant_model=assistant_model, --++++++ # batch_size=batch_size, --++++++ # max_cache_length=max_cache_length, --++++++ # ) --++++++ --++++++ # if generation_config.cache_implementation == "static": --++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --++++++ # created_cache = model_kwargs.get(cache_name) --++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --++++++ # if created_cache.max_cache_len < generation_config.max_length: --++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --++++++ --++++++ # return result --++++++ --++++++ --++++++ --+++++ --+++++ --+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++++-- --+++++2.27.0 --+++++ --++++diff --git a/patches/0002-20251106commit.patch b/patches/0002-20251106commit.patch --++++new file mode 100644 --++++index 00000000..22b65dd5 --++++--- /dev/null --+++++++ b/patches/0002-20251106commit.patch --++++@@ -0,0 +1,3200 @@ --+++++From 1b2b3a555b7f4c777a43b806c8c7a0f2049f8de1 Mon Sep 17 00:00:00 2001 --+++++From: Pinoeer-kingxi <13022943007@163.com> --+++++Date: Thu, 6 Nov 2025 09:20:38 +0800 --+++++Subject: [PATCH 2/3] 20251106commit --+++++ --+++++--- --+++++ .../models/deepseek/modeling_deepseek.py | 379 ++++- --+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1343 +++++++++++++---- --+++++ patches/0001-20251104commit.patch | 1272 ++++++++++++++++ --+++++ 3 files changed, 2689 insertions(+), 305 deletions(-) --+++++ create mode 100644 patches/0001-20251104commit.patch --+++++ --+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++index d8303e45..73773c22 100644 --+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++@@ -404,17 +404,42 @@ class DeepseekMoE(nn.Module): --+++++ # y = y + self.shared_experts(identity) --+++++ # return y --+++++ --++++++ # @no_grad() --++++++ # def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++++ --++++++ # expert_cache = ops.zeros_like(x) --++++++ # for i in range(self.num_experts_per_tok): --++++++ # expert_id = flat_expert_indices[i].item() --++++++ # weight = flat_expert_weights[i].item() --++++++ # expert = self.experts[expert_id] --++++++ # expert_out = expert(x) --++++++ # expert_cache += expert_out * weight --++++++ # return expert_cache --++++++ --+++++ @no_grad() --+++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --++++++ # x 的 shape: (1, hidden_size) --++++++ # flat_expert_indices 的 shape: (num_experts_per_tok,) --++++++ # flat_expert_weights 的 shape: (num_experts_per_tok, 1) --++++++ --++++++ # 1. 收集所有需要的专家层 --++++++ # 注意: flat_expert_indices 是一个 Tensor,可以直接用于索引 --++++++ selected_experts = [self.experts[i] for i in flat_expert_indices] --++++++ --++++++ # 2. 并行计算所有专家的输出 --++++++ # [expert(x) for expert in selected_experts] 会得到一个 list of Tensors --++++++ # ops.cat 会将它们堆叠成一个新的 Tensor --++++++ # 最终 expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++++ expert_outputs = ops.cat([expert(x) for expert in selected_experts], dim=0) --++++++ --++++++ # 3. 使用矩阵乘法进行加权求和 --++++++ # flat_expert_weights.T 的 shape: (1, num_experts_per_tok) --++++++ # expert_outputs 的 shape: (num_experts_per_tok, hidden_size) --++++++ # 最终结果 final_output 的 shape: (1, hidden_size) --++++++ final_output = ops.matmul(flat_expert_weights.T, expert_outputs) --++++++ --++++++ return final_output --+++++ --+++++- expert_cache = ops.zeros_like(x) --+++++- for i in range(self.num_experts_per_tok): --+++++- expert_id = flat_expert_indices[i].item() --+++++- weight = flat_expert_weights[i].item() --+++++- expert = self.experts[expert_id] --+++++- expert_out = expert(x) --+++++- expert_cache += expert_out * weight --+++++- return expert_cache --+++++ --+++++ @no_grad() --+++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++@@ -807,9 +832,16 @@ class DeepseekAttention(nn.Module): --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++++ # query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++++ # key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++++ # value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++++ # @lwx --++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim) --++++++ query_states = query_states.transpose(0, 2, 1, 3) # (bsz, num_heads, q_len, head_dim) --++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++++++ key_states = key_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim) --++++++ value_states = value_states.transpose(0, 2, 1, 3) # (bsz, num_key_value_heads, q_len, head_dim) --+++++ --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++@@ -873,8 +905,329 @@ class DeepseekAttention(nn.Module): --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ --++++++# class DeepseekFlashAttention(nn.Module): --++++++# """ --++++++# Multi-headed attention from 'Attention Is All You Need' paper, implemented using --++++++# mindspore.ops.flash_attention_score for acceleration on Ascend NPU. --++++++ --++++++# This class is designed as a drop-in replacement for DeepseekAttention. --++++++# """ --++++++ --++++++# def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++++++# super().__init__() --++++++# self.config = config --++++++# self.layer_idx = layer_idx --++++++# if layer_idx is None: --++++++# logger.warning( --++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++++# "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++++# "when creating this class." --++++++# ) --++++++ --++++++# self.attention_dropout = config.attention_dropout --++++++# self.hidden_size = config.hidden_size --++++++# self.num_heads = config.num_attention_heads --++++++# self.head_dim = self.hidden_size // self.num_heads --++++++# self.num_key_value_heads = config.num_key_value_heads --++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++# self.max_position_embeddings = config.max_position_embeddings --++++++# self.rope_theta = config.rope_theta --++++++# self.is_causal = True --++++++ --++++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++++# raise ValueError( --++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++++# f" and `num_heads`: {self.num_heads})." --++++++# ) --++++++ --++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++++++# self._init_rope() --++++++ --++++++# def _init_rope(self): --++++++# if self.config.rope_scaling is None: --++++++# self.rotary_emb = DeepseekRotaryEmbedding( --++++++# self.head_dim, --++++++# max_position_embeddings=self.max_position_embeddings, --++++++# base=self.rope_theta, --++++++# ) --++++++# else: --++++++# scaling_type = self.config.rope_scaling["type"] --++++++# scaling_factor = self.config.rope_scaling["factor"] --++++++# if scaling_type == "linear": --++++++# self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++++++# self.head_dim, --++++++# max_position_embeddings=self.max_position_embeddings, --++++++# scaling_factor=scaling_factor, --++++++# base=self.rope_theta, --++++++# ) --++++++# elif scaling_type == "dynamic": --++++++# self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++++++# self.head_dim, --++++++# max_position_embeddings=self.max_position_embeddings, --++++++# scaling_factor=scaling_factor, --++++++# base=self.rope_theta, --++++++# ) --++++++# else: --++++++# raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++++++ --++++++# def forward( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# attention_mask: Optional[mindspore.Tensor] = None, --++++++# position_ids: Optional[mindspore.Tensor] = None, --++++++# past_key_value: Optional[Cache] = None, --++++++# output_attentions: bool = False, --++++++# use_cache: bool = False, --++++++# **kwargs, --++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++# if "padding_mask" in kwargs: --++++++# warnings.warn( --++++++# "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++++++# ) --++++++ --++++++# if output_attentions: --++++++# warnings.warn("`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned.") --++++++ --++++++# bsz, q_len, _ = hidden_states.shape --++++++ --++++++# if self.config.pretraining_tp > 1: --++++++# raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++++++ --++++++# query_states = self.q_proj(hidden_states) --++++++# key_states = self.k_proj(hidden_states) --++++++# value_states = self.v_proj(hidden_states) --++++++ --++++++# # Reshape for multi-head attention --++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++# kv_seq_len = key_states.shape[-2] --++++++# if past_key_value is not None: --++++++# if self.layer_idx is None: --++++++# raise ValueError( --++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++# "with a layer index." --++++++# ) --++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++# # Apply Rotary Positional Embedding --++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++# if past_key_value is not None: --++++++# cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models --++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++# # Reshape Q, K, V for flash_attention_score's 'BSH' layout --++++++# # Q: (bsz, num_heads, q_len, head_dim) -> (bsz, q_len, num_heads, head_dim) -> (bsz, q_len, hidden_size) --++++++# query_states_for_fa = query_states.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ --++++++# # K: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++++# key_states_for_fa = key_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++++ --++++++# # V: (bsz, num_kv_heads, kv_seq_len, head_dim) -> (bsz, kv_seq_len, num_kv_heads, head_dim) -> (bsz, kv_seq_len, num_kv_heads * head_dim) --++++++# value_states_for_fa = value_states.transpose(0, 2, 1, 3).reshape(bsz, kv_seq_len, self.num_key_value_heads * self.head_dim) --++++++ --++++++# # Convert attention_mask for flash_attention_score --++++++# # The original mask is float with -inf for masked positions. FA needs a boolean mask where True means discard. --++++++# if attention_mask is not None: --++++++# # The mask should have been prepared as (bsz, 1, q_len, kv_seq_len) --++++++# if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++++++# raise ValueError( --++++++# f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++++++# ) --++++++# attn_mask_for_fa = attention_mask < 0 # Convert -inf to True --++++++# else: --++++++# attn_mask_for_fa = None --++++++ --++++++# keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++++++ --++++++# # Call the fused flash_attention_score operator --++++++# attn_output = mindspore.ops.flash_attention_score( --++++++# query=query_states_for_fa, --++++++# key=key_states_for_fa, --++++++# value=value_states_for_fa, --++++++# head_num=self.num_heads, # This is N1, the number of query heads --++++++# input_layout='BSH', --++++++# attn_mask=attn_mask_for_fa, --++++++# keep_prob=keep_prob, --++++++# scalar_value=1.0 / math.sqrt(self.head_dim), --++++++# sparse_mode=0 # Default mask mode --++++++# ) --++++++ --++++++# # Output shape is already (bsz, q_len, hidden_size), so no reshape is needed --++++++# attn_output = self.o_proj(attn_output) --++++++ --++++++# # Flash Attention does not return attention weights --++++++# attn_weights = None --++++++ --++++++# return attn_output, attn_weights, past_key_value --++++++ --++++++class DeepseekFlashAttention(nn.Module): --++++++ """ --++++++ DeepseekAttention implemented with MindSpore's flash_attention_score operator. --++++++ This implementation is a drop-in replacement for the original DeepseekAttention class, --++++++ designed for high performance on supported hardware (Ascend). --++++++ --++++++ It uses the 'BNSD' (Batch, Num_heads, Seq_len, Head_dim) memory layout for efficiency. --++++++ """ --++++++ def __init__(self, config: DeepseekConfig, layer_idx: Optional[int] = None): --++++++ super().__init__() --++++++ self.config = config --++++++ self.layer_idx = layer_idx --++++++ if layer_idx is None: --++++++ logger.warning( --++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++++ "when creating this class." --++++++ ) --++++++ --++++++ # --- [FIX] Correctly initialize all required attributes --- --++++++ self.attention_dropout = config.attention_dropout --++++++ self.hidden_size = config.hidden_size --++++++ self.num_heads = config.num_attention_heads --++++++ self.head_dim = self.hidden_size // self.num_heads --++++++ self.num_key_value_heads = config.num_key_value_heads --++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++ self.max_position_embeddings = config.max_position_embeddings --++++++ self.rope_theta = config.rope_theta --++++++ self.is_causal = True --++++++ --++++++ if (self.head_dim * self.num_heads) != self.hidden_size: --++++++ raise ValueError( --++++++ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++++ f" and `num_heads`: {self.num_heads})." --++++++ ) --++++++ --++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias) --++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias) --++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.attention_bias) --++++++ --++++++ # This call will now succeed as all attributes are initialized. --++++++ self._init_rope() --++++++ --++++++ def _init_rope(self): --++++++ if self.config.rope_scaling is None: --++++++ self.rotary_emb = DeepseekRotaryEmbedding( --++++++ self.head_dim, --++++++ max_position_embeddings=self.max_position_embeddings, --++++++ base=self.rope_theta, --++++++ ) --++++++ else: --++++++ scaling_type = self.config.rope_scaling["type"] --++++++ scaling_factor = self.config.rope_scaling["factor"] --++++++ if scaling_type == "linear": --++++++ self.rotary_emb = DeepseekLinearScalingRotaryEmbedding( --++++++ self.head_dim, --++++++ max_position_embeddings=self.max_position_embeddings, --++++++ scaling_factor=scaling_factor, --++++++ base=self.rope_theta, --++++++ ) --++++++ elif scaling_type == "dynamic": --++++++ self.rotary_emb = DeepseekDynamicNTKScalingRotaryEmbedding( --++++++ self.head_dim, --++++++ max_position_embeddings=self.max_position_embeddings, --++++++ scaling_factor=scaling_factor, --++++++ base=self.rope_theta, --++++++ ) --++++++ else: --++++++ raise ValueError(f"Unknown RoPE scaling type {scaling_type}") --++++++ --++++++ def forward( --++++++ self, --++++++ hidden_states: mindspore.Tensor, --++++++ attention_mask: Optional[mindspore.Tensor] = None, --++++++ position_ids: Optional[mindspore.Tensor] = None, --++++++ past_key_value: Optional[Cache] = None, --++++++ output_attentions: bool = False, --++++++ use_cache: bool = False, --++++++ **kwargs, --++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ if "padding_mask" in kwargs: --++++++ warnings.warn( --++++++ "Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`" --++++++ ) --++++++ if output_attentions: --++++++ warnings.warn( --++++++ "`DeepseekFlashAttention` does not support `output_attentions=True`, attention weights will not be returned." --++++++ ) --++++++ --++++++ bsz, q_len, _ = hidden_states.shape --++++++ --++++++ if self.config.pretraining_tp > 1: --++++++ raise NotImplementedError("DeepseekFlashAttention does not support `pretraining_tp > 1`.") --++++++ --++++++ query_states = self.q_proj(hidden_states) --++++++ key_states = self.k_proj(hidden_states) --++++++ value_states = self.v_proj(hidden_states) --++++++ --++++++ # Reshape to BNSD format (Batch, Num_heads, Seq_len, Head_dim) --++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++ kv_seq_len = key_states.shape[-2] --++++++ if past_key_value is not None: --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++ # Apply Rotary Position Embedding --++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ if past_key_value is not None: --++++++ cache_kwargs = {"sin": sin, "cos": cos} --++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++ # For GQA/MQA, flash_attention_score in BNSD layout requires Q and KV to have the same number of heads. --++++++ # So we must explicitly repeat the KV heads. --++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++ --++++++ # Convert attention mask for flash_attention_score --++++++ # The operator expects a boolean mask where True means to MASK OUT/DISCARD. --++++++ if attention_mask is not None: --++++++ if attention_mask.shape != (bsz, 1, q_len, kv_seq_len): --++++++ raise ValueError( --++++++ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.shape}" --++++++ ) --++++++ attn_mask_for_fa = attention_mask < 0 --++++++ else: --++++++ attn_mask_for_fa = None --++++++ --++++++ keep_prob = 1.0 - self.attention_dropout if self.training else 1.0 --++++++ --++++++ # Call the fused operator using the efficient BNSD layout --++++++ attn_output = mindspore.ops.flash_attention_score( --++++++ query=query_states, --++++++ key=key_states, --++++++ value=value_states, --++++++ head_num=self.num_heads, --++++++ input_layout='BNSD', # Specify the correct layout --++++++ attn_mask=attn_mask_for_fa, --++++++ keep_prob=keep_prob, --++++++ scalar_value=1.0 / math.sqrt(self.head_dim) --++++++ ) --++++++ --++++++ # The output of FA is in BNSD format. We need to reshape it back to the expected (B, S, H) format. --++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ --++++++ # Apply output projection --++++++ attn_output = self.o_proj(attn_output) --++++++ --++++++ # Flash attention does not return attention weights, so we return None. --++++++ attn_weights = None --++++++ --++++++ return attn_output, attn_weights, past_key_value --++++++ --+++++ Deepseek_ATTENTION_CLASSES = { --+++++ "eager": DeepseekAttention, --++++++ "flash-attention": DeepseekFlashAttention, --+++++ } --+++++ --+++++ --+++++@@ -887,6 +1240,10 @@ class DeepseekDecoderLayer(nn.Module): --+++++ config=config, layer_idx=layer_idx --+++++ ) --+++++ --++++++ self.self_attn = Deepseek_ATTENTION_CLASSES["flash-attention"]( --++++++ config=config, layer_idx=layer_idx --++++++ ) --++++++ --+++++ self.mlp = ( --+++++ DeepseekMoE(config) --+++++ if ( --+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++index d4c6b651..bced285c 100644 --+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++@@ -23,7 +23,7 @@ from typing import List, Optional, Tuple, Union --+++++ --+++++ import mindspore --+++++ import mindnlp.core.nn.functional as F --+++++-from mindnlp.core import nn, ops --++++++from mindnlp.core import nn, ops, no_grad --+++++ from mindnlp.core.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss --+++++ --+++++ from ....common.activations import ACT2FN --+++++@@ -45,6 +45,8 @@ logger = logging.get_logger(__name__) --+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+++++ --++++++Long_Prompt = False --++++++PROMPT_LENGTH_THRESHOLD = 128 --+++++ --+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+++++ def _prepare_4d_causal_attention_mask_with_cache_position( --+++++@@ -473,35 +475,279 @@ class Qwen2MoeAttention(nn.Module): --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++ --++++++# class Qwen2MoeFlashAttention(nn.Module): --++++++# """ --++++++# Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --++++++# 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --++++++ --++++++# 关键改动: --++++++# 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --++++++# 直接传入原始的 key 和 value 张量效率更高。 --++++++# 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --++++++# 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++++# super().__init__() --++++++# self.config = config --++++++# self.layer_idx = layer_idx --++++++# self.hidden_size = config.hidden_size --++++++# self.num_heads = config.num_attention_heads --++++++# self.head_dim = self.hidden_size // self.num_heads --++++++# self.num_key_value_heads = config.num_key_value_heads --++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++# self.max_position_embeddings = config.max_position_embeddings --++++++# self.rope_theta = config.rope_theta --++++++# self.attention_dropout = config.attention_dropout --++++++ --++++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++++# raise ValueError( --++++++# f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --++++++# ) --++++++ --++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++++ --++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++++# self.head_dim, --++++++# max_position_embeddings=self.max_position_embeddings, --++++++# base=self.rope_theta, --++++++# ) --++++++ --++++++# def forward( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# attention_mask: Optional[mindspore.Tensor] = None, --++++++# position_ids: Optional[mindspore.Tensor] = None, --++++++# past_key_value: Optional[Cache] = None, --++++++# output_attentions: bool = False, --++++++# use_cache: bool = False, --++++++# cache_position: Optional[mindspore.Tensor] = None, --++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++# bsz, q_len, _ = hidden_states.shape --++++++ --++++++# # 1. 线性投射 Q, K, V --++++++# query_states = self.q_proj(hidden_states) --++++++# key_states = self.k_proj(hidden_states) --++++++# value_states = self.v_proj(hidden_states) --++++++ --++++++# # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++# # query: [B, S, H*D] -> [B, N1, S, D] --++++++# # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++++# query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++# # 3. RoPE 旋转位置编码 --++++++# kv_seq_len = key_states.shape[-2] --++++++# if past_key_value is not None: --++++++# if self.layer_idx is None: --++++++# raise ValueError( --++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++# "with a layer index." --++++++# ) --++++++# # 对于 StaticCache,需要特殊处理 kv_seq_len --++++++# # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++# # 使用 cache_position 的长度来确定实际的 kv_seq_len --++++++# # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --++++++# # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --++++++# # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --++++++# # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --++++++# # 临时解决方案:使用 cache_position 的最大值(如果可能) --++++++# # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --++++++# past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --++++++# if cache_position.shape[0] == 1: --++++++# # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --++++++# # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --++++++# kv_seq_len = past_seen_tokens + 1 --++++++# else: --++++++# # prefill 阶段:cache_position 是范围,使用其长度 --++++++# kv_seq_len = cache_position.shape[0] + past_seen_tokens --++++++# else: --++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++# # 4. KV 缓存更新 --++++++# if past_key_value is not None: --++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++# key_states, value_states = past_key_value.update( --++++++# key_states, value_states, self.layer_idx, cache_kwargs --++++++# ) --++++++ --++++++# # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --++++++# # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --++++++# if isinstance(past_key_value, StaticCache) and cache_position is not None: --++++++# if cache_position.shape[0] == 1: --++++++# # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --++++++# kv_seq_len = key_states.shape[-2] --++++++ --++++++# # 5. [重要] 准备 Attention Mask --++++++# # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --++++++# # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++++# fa_attention_mask = None --++++++# if attention_mask is not None: --++++++# # 截取与当前key长度匹配的部分 --++++++# # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --++++++# # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --++++++# mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++# # 转换为布尔类型: 大负数 -> True, 0 -> False --++++++# fa_attention_mask = (mask_slice != 0) --++++++ --++++++# # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --++++++# input_dtype = query_states.dtype --++++++# if input_dtype not in (mindspore.float16, mindspore.bfloat16): --++++++# # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --++++++# query_states = query_states.to(mindspore.float16) --++++++# key_states = key_states.to(mindspore.float16) --++++++# value_states = value_states.to(mindspore.float16) --++++++ --++++++# # 6. [核心] 调用 flash_attention_score 算子 --++++++# # - 无需手动 repeat_kv, 算子原生支持 GQA --++++++# # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++++# attn_output = mindspore.ops.flash_attention_score( --++++++# query=query_states, --++++++# key=key_states, --++++++# value=value_states, --++++++# head_num=self.num_heads, # 传入Q的头数(N1) --++++++# attn_mask=fa_attention_mask, --++++++# keep_prob=1.0 - self.attention_dropout, --++++++# scalar_value=1.0 / math.sqrt(self.head_dim), --++++++# input_layout="BNSD", --++++++# sparse_mode=0 # 使用 defaultMask 模式 --++++++# ) --++++++ --++++++# # 恢复原始数据类型 --++++++# attn_output = attn_output.to(input_dtype) --++++++ --++++++# # 7. 调整输出形状 --++++++# # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++++# attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++# attn_output = self.o_proj(attn_output) --++++++ --++++++# # FlashAttention 算子不直接返回注意力权重矩阵 --++++++# attn_weights = None --++++++# if output_attentions: --++++++# logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++# return attn_output, attn_weights, past_key_value --++++++ --++++++# # def forward( --++++++# # self, --++++++# # hidden_states: mindspore.Tensor, --++++++# # attention_mask: Optional[mindspore.Tensor] = None, --++++++# # position_ids: Optional[mindspore.Tensor] = None, --++++++# # past_key_value: Optional[Cache] = None, --++++++# # output_attentions: bool = False, --++++++# # use_cache: bool = False, --++++++# # cache_position: Optional[mindspore.Tensor] = None, --++++++# # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++# # bsz, q_len, _ = hidden_states.shape --++++++ --++++++# # # 1. 线性投射 Q, K, V --++++++# # query_states = self.q_proj(hidden_states) --++++++# # key_states = self.k_proj(hidden_states) --++++++# # value_states = self.v_proj(hidden_states) --++++++ --++++++# # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --++++++# # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++# # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --++++++# # # 3. RoPE 旋转位置编码 --++++++# # kv_seq_len = key_states.shape[-2] --++++++# # if past_key_value is not None: --++++++# # if self.layer_idx is None: --++++++# # raise ValueError( --++++++# # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++# # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++# # "with a layer index." --++++++# # ) --++++++# # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --++++++# # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++# # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++# # # 4. KV 缓存更新 --++++++# # if past_key_value is not None: --++++++# # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --++++++# # key_states, value_states = past_key_value.update( --++++++# # key_states, value_states, self.layer_idx, cache_kwargs --++++++# # ) --++++++ --++++++# # # 5. 准备 Attention Mask --++++++# # fa_attention_mask = None --++++++# # if attention_mask is not None: --++++++# # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++# # fa_attention_mask = (mask_slice != 0) --++++++ --++++++# # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --++++++# # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --++++++# # input_dtype = query_states.dtype --++++++ --++++++# # # 6. [核心] 调用 flash_attention_score 算子 --++++++# # attn_output = mindspore.ops.flash_attention_score( --++++++# # query=query_states, --++++++# # key=key_states, --++++++# # value=value_states, --++++++# # head_num=self.num_heads, --++++++# # attn_mask=fa_attention_mask, --++++++# # keep_prob=1.0 - self.attention_dropout, --++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++# # input_layout="BNSD", --++++++# # sparse_mode=0, --++++++# # # <--- 修改点 2: 启用内部高精度计算 --- --++++++# # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --++++++# # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --++++++# # inner_precise=1 --++++++# # ) --++++++ --++++++# # # 恢复原始数据类型 --++++++# # attn_output = attn_output.to(input_dtype) --++++++ --++++++# # # 7. 调整输出形状 --++++++# # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++# # attn_output = self.o_proj(attn_output) --++++++ --++++++# # attn_weights = None --++++++# # if output_attentions: --++++++# # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ --++++++# # return attn_output, attn_weights, past_key_value --++++++ --++++++ --+++++ class Qwen2MoeFlashAttention(nn.Module): --+++++ """ --+++++- Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++- 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++- --+++++- 关键改动: --+++++- 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++- 直接传入原始的 key 和 value 张量效率更高。 --+++++- 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++- 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --++++++ Qwen2MoeAttention 的 Flash Attention **纯速度优化**版本。 --++++++ --++++++ 此版本将 `mindspore.ops.flash_attention_score` 的 `inner_precise` --++++++ 参数设置为 0,关闭内部高精度累加。这将在硬件允许的情况下, --++++++ 完全使用模型的低精度数据类型(如 float16)进行计算, --++++++ 以达到理论上的最高执行速度。 --+++++ """ --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++ super().__init__() --+++++ self.config = config --+++++ self.layer_idx = layer_idx --++++++ if layer_idx is None: --++++++ logger.warning_once( --++++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended." --++++++ ) --++++++ --+++++ self.hidden_size = config.hidden_size --+++++ self.num_heads = config.num_attention_heads --+++++ self.head_dim = self.hidden_size // self.num_heads --+++++ self.num_key_value_heads = config.num_key_value_heads --+++++- self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++ self.max_position_embeddings = config.max_position_embeddings --+++++ self.rope_theta = config.rope_theta --+++++ self.attention_dropout = config.attention_dropout --+++++ --+++++- if (self.head_dim * self.num_heads) != self.hidden_size: --+++++- raise ValueError( --+++++- f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++- ) --+++++- --+++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++@@ -531,351 +777,834 @@ class Qwen2MoeFlashAttention(nn.Module): --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++- # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++- # query: [B, S, H*D] -> [B, N1, S, D] --+++++- # key/val: [B, S, H2*D] -> [B, N2, S, D] --++++++ # 2. 调整形状以匹配 BNSD 布局 --+++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- --+++++- # 3. RoPE 旋转位置编码 --++++++ --++++++ # 3. RoPE 和 KV 缓存 --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++- if self.layer_idx is None: --+++++- raise ValueError( --+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++- "with a layer index." --+++++- ) --+++++- # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++- # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++- # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++- # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++- # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++- # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++- # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++- # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++- # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++- past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++- if cache_position.shape[0] == 1: --+++++- # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++- # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++- kv_seq_len = past_seen_tokens + 1 --+++++- else: --+++++- # prefill 阶段:cache_position 是范围,使用其长度 --+++++- kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++- else: --+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++- --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++- # 4. KV 缓存更新 --+++++ if past_key_value is not None: --+++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++- key_states, value_states = past_key_value.update( --+++++- key_states, value_states, self.layer_idx, cache_kwargs --+++++- ) --+++++- --+++++- # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++- # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++- if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++- if cache_position.shape[0] == 1: --+++++- # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++- kv_seq_len = key_states.shape[-2] --+++++- --+++++- # 5. [重要] 准备 Attention Mask --+++++- # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++- # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++ # 4. 准备 Attention Mask --+++++ fa_attention_mask = None --+++++ if attention_mask is not None: --+++++- # 截取与当前key长度匹配的部分 --+++++- # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++- # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++- # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++ fa_attention_mask = (mask_slice != 0) --+++++ --+++++- # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++- input_dtype = query_states.dtype --+++++- if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++- # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++- query_states = query_states.to(mindspore.float16) --+++++- key_states = key_states.to(mindspore.float16) --+++++- value_states = value_states.to(mindspore.float16) --+++++- --+++++- # 6. [核心] 调用 flash_attention_score 算子 --+++++- # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++- # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --++++++ # 5. 【核心】调用 flash_attention_score,关闭高精度累加 --+++++ attn_output = mindspore.ops.flash_attention_score( --+++++ query=query_states, --+++++ key=key_states, --+++++ value=value_states, --+++++- head_num=self.num_heads, # 传入Q的头数(N1) --++++++ head_num=self.num_heads, --+++++ attn_mask=fa_attention_mask, --+++++- keep_prob=1.0 - self.attention_dropout, --++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, # 推理时关闭dropout --+++++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++ input_layout="BNSD", --+++++- sparse_mode=0 # 使用 defaultMask 模式 --++++++ sparse_mode=0, --++++++ inner_precise=0 # 【关键改动】设置为0,关闭内部FP32计算,追求最快速度 --+++++ ) --+++++ --+++++- # 恢复原始数据类型 --+++++- attn_output = attn_output.to(input_dtype) --+++++- --+++++- # 7. 调整输出形状 --+++++- # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --++++++ # 6. 调整输出形状 --+++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++ attn_output = self.o_proj(attn_output) --+++++ --+++++- # FlashAttention 算子不直接返回注意力权重矩阵 --++++++ # 7. 返回结果 --+++++ attn_weights = None --+++++ if output_attentions: --+++++- logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++ logger.warning_once("`Qwen2MoeFlashAttention` is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++- # def forward( --+++++- # self, --+++++- # hidden_states: mindspore.Tensor, --+++++- # attention_mask: Optional[mindspore.Tensor] = None, --+++++- # position_ids: Optional[mindspore.Tensor] = None, --+++++- # past_key_value: Optional[Cache] = None, --+++++- # output_attentions: bool = False, --+++++- # use_cache: bool = False, --+++++- # cache_position: Optional[mindspore.Tensor] = None, --+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++- --+++++- # bsz, q_len, _ = hidden_states.shape --+++++- --+++++- # # 1. 线性投射 Q, K, V --+++++- # query_states = self.q_proj(hidden_states) --+++++- # key_states = self.k_proj(hidden_states) --+++++- # value_states = self.v_proj(hidden_states) --+++++- --+++++- # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- --+++++- # # 3. RoPE 旋转位置编码 --+++++- # kv_seq_len = key_states.shape[-2] --+++++- # if past_key_value is not None: --+++++- # if self.layer_idx is None: --+++++- # raise ValueError( --+++++- # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++- # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++- # "with a layer index." --+++++- # ) --+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++ --+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++- --+++++- # # 4. KV 缓存更新 --+++++- # if past_key_value is not None: --+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++- # key_states, value_states = past_key_value.update( --+++++- # key_states, value_states, self.layer_idx, cache_kwargs --+++++- # ) --+++++- --+++++- # # 5. 准备 Attention Mask --+++++- # fa_attention_mask = None --+++++- # if attention_mask is not None: --+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++- # fa_attention_mask = (mask_slice != 0) --+++++- --+++++- # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++- # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++- # input_dtype = query_states.dtype --+++++- --+++++- # # 6. [核心] 调用 flash_attention_score 算子 --+++++- # attn_output = mindspore.ops.flash_attention_score( --+++++- # query=query_states, --+++++- # key=key_states, --+++++- # value=value_states, --+++++- # head_num=self.num_heads, --+++++- # attn_mask=fa_attention_mask, --+++++- # keep_prob=1.0 - self.attention_dropout, --+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++- # input_layout="BNSD", --+++++- # sparse_mode=0, --+++++- # # <--- 修改点 2: 启用内部高精度计算 --- --+++++- # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++- # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++- # inner_precise=1 --+++++- # ) --+++++- --+++++- # # 恢复原始数据类型 --+++++- # attn_output = attn_output.to(input_dtype) --++++++QWEN2MOE_ATTENTION_CLASSES = { --++++++ "eager": Qwen2MoeAttention, --++++++ "flash-attention": Qwen2MoeFlashAttention, --++++++} --+++++ --+++++- # # 7. 调整输出形状 --+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++- # attn_output = self.o_proj(attn_output) --+++++ --+++++- # attn_weights = None --+++++- # if output_attentions: --+++++- # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# def __init__(self, config): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# # gating --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++ --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# #@dwj --++++++# # 只遍历激活的专家,而非全部专家 --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# num_tokens = hidden_states_reshaped.shape[0] --++++++ --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++ --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++# final_hidden_states = ops.zeros_like(hidden_states_reshaped) --++++++# flat_selected_experts = selected_experts.flatten() --++++++ --++++++# unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --++++++# broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --++++++# token_indices = broadcasted_token_indices.flatten() --++++++ --++++++# active_experts = ops.unique(flat_selected_experts) --++++++ --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++ --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# selected_token_indices = token_indices[mask] --++++++# selected_routing_weights = routing_weights.flatten()[mask] --++++++ --++++++# current_states = hidden_states_reshaped[selected_token_indices] --++++++ --++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++ --++++++# final_hidden_states = final_hidden_states.index_add( --++++++# dim=0, --++++++# index=selected_token_indices, --++++++# source=expert_output.to(hidden_states.dtype) --++++++# ) --++++++ --++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++++# shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++++ --+++++- # return attn_output, attn_weights, past_key_value --++++++# final_hidden_states = final_hidden_states + shared_expert_output --++++++# final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --++++++ --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# """ --++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++++# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --++++++# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --++++++# `_moe_infer_prefill` (用于长序列处理) 方法。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# # 门控网络 --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# # 专家列表 --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++# # 共享专家 --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_decode( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# """ --++++++# 【解码路径】针对 sequence_length=1 的极致优化。 --++++++# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --++++++# """ --++++++# batch_size, hidden_dim = hidden_states.shape --++++++ --++++++# expert_outputs_list = [ --++++++# ops.cat([ --++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++++# ], dim=0) --++++++# for i in range(batch_size) --++++++# ] --++++++ --++++++# # --- 错误修复:将 axis=0 修改为 dim=0 --- --++++++# # shape: (batch_size, top_k, hidden_dim) --++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++++ --++++++# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++++ --++++++# return moe_output.squeeze(1) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_prefill( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# """ --++++++# 【预填充路径】针对 sequence_length > 1 的优化。 --++++++# 按专家对 Token 进行分组,并进行批处理。 --++++++# """ --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens = hidden_states.shape[0] --++++++# flat_selected_experts = selected_experts.flatten() --++++++ --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++ --++++++# active_experts = ops.unique(flat_selected_experts) --++++++ --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++ --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# selected_token_indices = token_indices[mask] --++++++# selected_routing_weights = routing_weights.flatten()[mask] --++++++ --++++++# current_states = hidden_states[selected_token_indices] --++++++ --++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++ --++++++# moe_output = moe_output.index_add( --++++++# dim=0, --++++++# index=selected_token_indices, --++++++# source=expert_output.to(hidden_states.dtype) --++++++# ) --++++++# return moe_output --++++++ --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# """ --++++++# 顶层 forward 方法,作为智能分发器。 --++++++# """ --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++- # def forward( --+++++- # self, --+++++- # hidden_states: mindspore.Tensor, --+++++- # attention_mask: Optional[mindspore.Tensor] = None, --+++++- # position_ids: Optional[mindspore.Tensor] = None, --+++++- # past_key_value: Optional[Cache] = None, --+++++- # output_attentions: bool = False, --+++++- # use_cache: bool = False, --+++++- # cache_position: Optional[mindspore.Tensor] = None, --+++++- # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++- --+++++- # bsz, q_len, _ = hidden_states.shape --+++++- --+++++- # query_states = self.q_proj(hidden_states) --+++++- # key_states = self.k_proj(hidden_states) --+++++- # value_states = self.v_proj(hidden_states) --+++++- --+++++- # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++- --+++++- # kv_seq_len = key_states.shape[-2] --+++++- # if past_key_value is not None: --+++++- # if self.layer_idx is None: --+++++- # raise ValueError("`layer_idx` must be specified for caching") --+++++- # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++- --+++++- # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++- # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++- --+++++- # if past_key_value is not None: --+++++- # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++- # key_states, value_states = past_key_value.update( --+++++- # key_states, value_states, self.layer_idx, cache_kwargs --+++++- # ) --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++# moe_output = None --++++++# # 在推理时,根据序列长度选择最优路径 --++++++# if not self.training: --++++++# if sequence_length == 1: --++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++++# else: --++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++++# else: --++++++# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --++++++# raise NotImplementedError("Training path is not implemented.") --++++++ --++++++# shared_expert_output = self.shared_expert(hidden_states_reshaped) --++++++# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --++++++# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --++++++ --++++++# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --++++++ --++++++# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --++++++ --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# """ --++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++++# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# # 门控网络 --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# # 专家列表 --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++# # 共享专家 --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_decode( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# batch_size, _ = hidden_states.shape --++++++# expert_outputs_list = [ --++++++# ops.cat([ --++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++++# ], dim=0) --++++++# for i in range(batch_size) --++++++# ] --++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++++# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++++# return moe_output.squeeze(1) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_prefill( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens = hidden_states.shape[0] --++++++# flat_selected_experts = selected_experts.flatten() --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++# active_experts = ops.unique(flat_selected_experts) --++++++ --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# selected_token_indices = token_indices[mask] --++++++# selected_routing_weights = routing_weights.flatten()[mask] --++++++# current_states = hidden_states[selected_token_indices] --++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++# moe_output = moe_output.index_add( --++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++++# ) --++++++# return moe_output --++++++ --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# """ --++++++# 顶层 forward 方法,作为智能分发器。 --++++++# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --++++++# """ --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ --++++++# # 1. 门控计算 (通用逻辑) --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++ --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++# # 2. 智能分发到最优 MoE 路径 --++++++# moe_output = None --++++++# if not self.training: --++++++# if sequence_length == 1: --++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --++++++# else: --++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --++++++# else: --++++++# raise NotImplementedError("Training path is not implemented.") --++++++ --++++++# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --++++++# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++++ --++++++# # 4. 合并 MoE 输出和共享专家输出 --++++++# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++++ --++++++# # 5. 恢复原始形状并返回 --++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --++++++# prefill fastest --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# """ --++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++++# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --++++++# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# # 门控网络 --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# # 专家列表 --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++# # 共享专家 --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_dispatch( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# """ --++++++# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --++++++# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --++++++# """ --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens, _ = hidden_states.shape --++++++ --++++++# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --++++++# flat_selected_experts = selected_experts.flatten() --++++++# flat_routing_weights = routing_weights.flatten() --+++++ --+++++- # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++- # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++- --+++++- # # <--- 核心修改点: 手动进行高精度缩放 --- --+++++- # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++++- # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++++- # query_states = query_states / math.sqrt(self.head_dim) --+++++- # # <--- 修改结束 --- --+++++- --+++++- # fa_attention_mask = None --+++++- # if attention_mask is not None: --+++++- # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++- # fa_attention_mask = (mask_slice != 0) --+++++- --+++++- # input_dtype = query_states.dtype --+++++- --+++++- # attn_output = mindspore.ops.flash_attention_score( --+++++- # query=query_states, # 传入已经预先缩放过的 query --+++++- # key=key_states, --+++++- # value=value_states, --+++++- # head_num=self.num_heads, --+++++- # attn_mask=fa_attention_mask, --+++++- # keep_prob=1.0 - self.attention_dropout, --+++++- # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++++- # input_layout="BNSD", --+++++- # sparse_mode=0, --+++++- # inner_precise=1 # 仍然保持内部高精度计算 --+++++- # ) --++++++# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++ --+++++- # attn_output = attn_output.to(input_dtype) --+++++- # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++- # attn_output = self.o_proj(attn_output) --++++++# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --++++++# active_experts = ops.unique(flat_selected_experts) --++++++ --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++ --++++++# # 找到所有分配给该专家的 token --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++ --++++++# # 使用 mask 选取对应的 token 和权重 --++++++# current_token_indices = token_indices[mask] --++++++# current_routing_weights = flat_routing_weights[mask] --++++++# current_hidden_states = hidden_states[current_token_indices] --++++++ --++++++# # 对这些 token 进行批处理 --++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++++ --++++++# # 使用 index_add 将结果精确地加回到对应位置 --++++++# moe_output = moe_output.index_add( --++++++# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --++++++# ) --++++++# return moe_output --++++++ --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# """ --++++++# 顶层 forward 方法,作为智能分发器。 --++++++# """ --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ --++++++# # 1. 门控计算 --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++ --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++# routing_weights = routing_weights.to(hidden_states.dtype) --++++++ --++++++# # 2. 调用统一的 MoE 计算内核 --++++++# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --++++++# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+++++ --+++++- # attn_weights = None --+++++- # if output_attentions: --+++++- # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --++++++# # 3. 统一处理共享专家 --++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++++ --++++++# # 4. 合并输出 --++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++++ --++++++# # 5. 恢复原始形状并返回 --++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --++++++ --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# """ --++++++# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --++++++# 【最终高性能与高精度版】: --++++++# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --++++++# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --++++++# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --++++++# 3. 这样实现了速度和准确性的两全其美。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_decode( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# """ --++++++# 【解码路径】极致优化版:bmm + 高精度累加。 --++++++# """ --++++++# original_dtype = hidden_states.dtype --++++++# batch_size, _ = hidden_states.shape --++++++ --++++++# expert_outputs_list = [ --++++++# ops.cat([ --++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++++# ], dim=0) --++++++# for i in range(batch_size) --++++++# ] --++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++++ --++++++# # 在 float32 下执行 bmm,得到高精度结果 --++++++# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --++++++ --++++++# # 将高精度结果转换回原始数据类型 --++++++# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --++++++ --++++++# return moe_output --++++++ --++++++# @no_grad() --++++++# def _moe_infer_prefill( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# selected_experts: mindspore.Tensor, --++++++# routing_weights: mindspore.Tensor --++++++# ) -> mindspore.Tensor: --++++++# """ --++++++# 【预填充路径】与原始实现一致,结果精确。 --++++++# """ --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens, _ = hidden_states.shape --++++++# flat_selected_experts = selected_experts.flatten() --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++# active_experts = ops.unique(flat_selected_experts) --++++++ --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# selected_token_indices = token_indices[mask] --++++++# selected_routing_weights = routing_weights.flatten()[mask] --++++++# current_states = hidden_states[selected_token_indices] --++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++# moe_output = moe_output.index_add( --++++++# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --++++++# ) --++++++# return moe_output --++++++ --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++ --+++++- # return attn_output, attn_weights, past_key_value --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --++++++# # 如果模型主体是 float16,后续再转换 --++++++ --++++++# moe_output = None --++++++# if not self.training: --++++++# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --++++++# # _moe_infer_decode 内部会处理好类型转换 --++++++# temp_routing_weights = routing_weights.to(hidden_states.dtype) --++++++# if sequence_length == 1: --++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++++# else: --++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --++++++# else: --++++++# raise NotImplementedError("Training path is not implemented.") --++++++ --++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++++ --++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --+++++ --+++++-QWEN2MOE_ATTENTION_CLASSES = { --+++++- "eager": Qwen2MoeAttention, --+++++- "flash-attention": Qwen2MoeFlashAttention, --+++++-} --++++++# class Qwen2MoeSparseMoeBlock(nn.Module): --++++++# """ --++++++# 【融合版】一个混合专家模块,内置两种推理策略, --++++++# 由外部全局变量 `Long_Prompt` 控制: --++++++ --++++++# - if Long_Prompt is True: 【精度优先模式】 --++++++# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --++++++# 适用于处理长序列,避免误差累积。 --++++++ --++++++# - if Long_Prompt is False: 【速度优先模式】 --++++++# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --++++++# 在解码阶段获得极致速度,同时保证结果高度准确。 --++++++# """ --++++++# def __init__(self, config: Qwen2MoeConfig): --++++++# super().__init__() --++++++# self.num_experts = config.num_experts --++++++# self.top_k = config.num_experts_per_tok --++++++# self.norm_topk_prob = config.norm_topk_prob --++++++ --++++++# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --++++++# self.experts = nn.ModuleList( --++++++# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --++++++# ) --++++++# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --++++++# # --- 速度优先模式的辅助函数 --- --++++++# @no_grad() --++++++# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++# original_dtype = hidden_states.dtype --++++++# batch_size, _ = hidden_states.shape --++++++# expert_outputs_list = [ --++++++# ops.cat([ --++++++# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++++# ], dim=0) --++++++# for i in range(batch_size) --++++++# ] --++++++# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++++# weights_fp32 = routing_weights.to(mindspore.float32) --++++++# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++++++# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++++# return moe_output_fp32.squeeze(1).to(original_dtype) --++++++ --++++++# @no_grad() --++++++# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens, _ = hidden_states.shape --++++++# flat_selected_experts = selected_experts.flatten() --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++# active_experts = ops.unique(flat_selected_experts) --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# selected_token_indices = token_indices[mask] --++++++# selected_routing_weights = routing_weights.flatten()[mask] --++++++# current_states = hidden_states[selected_token_indices] --++++++# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --++++++# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --++++++# return moe_output --++++++ --++++++# # --- 精度优先模式的辅助函数 --- --++++++# @no_grad() --++++++# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++# moe_output = ops.zeros_like(hidden_states) --++++++# num_tokens, _ = hidden_states.shape --++++++# flat_selected_experts = selected_experts.flatten() --++++++# flat_routing_weights = routing_weights.flatten() --++++++# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++# active_experts = ops.unique(flat_selected_experts) --++++++# for expert_idx_tensor in active_experts: --++++++# expert_idx = expert_idx_tensor.item() --++++++# expert_layer = self.experts[expert_idx] --++++++# mask = (flat_selected_experts == expert_idx_tensor) --++++++# current_token_indices = token_indices[mask] --++++++# current_routing_weights = flat_routing_weights[mask] --++++++# current_hidden_states = hidden_states[current_token_indices] --++++++# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++++# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++++++# return moe_output --++++++ --++++++# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++# # 声明我们将要使用一个在模块外部定义的全局变量 --++++++# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --++++++# global Long_Prompt --++++++ --++++++# # 1. 门控计算 (所有模式通用) --++++++# batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++# router_logits = self.gate(hidden_states_reshaped) --++++++# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++++++# if self.norm_topk_prob: --++++++# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++# moe_output = None --++++++# if not self.training: --++++++# # 根据 Long_Prompt 标志选择模式 --++++++# if Long_Prompt: --++++++# # --- 精度优先模式 --- --++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++# else: --++++++# # --- 速度优先模式 --- --++++++# routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++# if sequence_length == 1: --++++++# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++# else: --++++++# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++# else: --++++++# raise NotImplementedError("Training path is not implemented.") --++++++ --++++++# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++++# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++++ --++++++# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++++# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++# return final_hidden_states, router_logits --++++++ --++++++class Qwen2MoeSparseMoeBlock(nn.Module): --++++++ """ --++++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --++++++ 控制的顶级推理策略: --+++++ --++++++ - if Long_Prompt is True: 【精度优先模式】 --++++++ 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配原始逻辑。 --++++++ 适用于需要严格可复现性的长序列任务。 --+++++ --+++++-class Qwen2MoeSparseMoeBlock(nn.Module): --+++++- def __init__(self, config): --++++++ - if Long_Prompt is False: 【速度优先模式】 --++++++ 采用业界最强的性能组合: --++++++ - Prefill 阶段: 使用 DeepSeek 的“全局-排序-切片”策略,速度最快。 --++++++ - Decode 阶段: 使用“bmm+高精度累加”策略,兼顾速度与准确性。 --++++++ """ --++++++ def __init__(self, config: Qwen2MoeConfig): --+++++ super().__init__() --+++++ self.num_experts = config.num_experts --+++++ self.top_k = config.num_experts_per_tok --+++++ self.norm_topk_prob = config.norm_topk_prob --+++++ --+++++- # gating --+++++ self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++ self.experts = nn.ModuleList( --+++++ [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++ ) --+++++- --+++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++ --+++++- #@dwj --+++++- # 只遍历激活的专家,而非全部专家 --+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++- num_tokens = hidden_states_reshaped.shape[0] --+++++- --+++++- router_logits = self.gate(hidden_states_reshaped) --+++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- --+++++- if self.norm_topk_prob: --+++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++- final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++- flat_selected_experts = selected_experts.flatten() --+++++- --+++++- unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++- broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++- token_indices = broadcasted_token_indices.flatten() --+++++- --+++++- active_experts = ops.unique(flat_selected_experts) --+++++- --+++++- for expert_idx_tensor in active_experts: --+++++- expert_idx = expert_idx_tensor.item() --+++++- expert_layer = self.experts[expert_idx] --+++++- --+++++- mask = (flat_selected_experts == expert_idx_tensor) --+++++- selected_token_indices = token_indices[mask] --+++++- selected_routing_weights = routing_weights.flatten()[mask] --+++++- --+++++- current_states = hidden_states_reshaped[selected_token_indices] --+++++- --+++++- expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++- --+++++- final_hidden_states = final_hidden_states.index_add( --+++++- dim=0, --+++++- index=selected_token_indices, --+++++- source=expert_output.to(hidden_states.dtype) --+++++- ) --+++++- --+++++- shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++++ # --- 速度优先模式 (SPEED MODE) 的辅助函数 --- --++++++ @no_grad() --++++++ def _moe_infer_decode_fast(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++ original_dtype = hidden_states.dtype --++++++ batch_size, _ = hidden_states.shape --++++++ expert_outputs_list = [ --++++++ ops.cat([ --++++++ self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --++++++ ], dim=0) --++++++ for i in range(batch_size) --++++++ ] --++++++ expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --++++++ weights_fp32 = routing_weights.to(mindspore.float32) --++++++ outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --++++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --++++++ return moe_output_fp32.squeeze(1).to(original_dtype) --++++++ --++++++ @no_grad() --++++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++ num_tokens, _ = hidden_states.shape --++++++ flat_selected_experts = selected_experts.flatten() --++++++ sorted_expert_indices = flat_selected_experts.argsort() --++++++ tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++++++ original_token_indices = sorted_expert_indices // self.top_k --++++++ moe_output = ops.zeros_like(hidden_states) --++++++ current_token_offset = 0 --++++++ for i in range(self.num_experts): --++++++ expert_token_count = tokens_per_expert[i] - current_token_offset --++++++ if expert_token_count == 0: --++++++ continue --++++++ end_offset = current_token_offset + expert_token_count --++++++ expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++++++ expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++++++ expert_hidden_states = hidden_states[expert_original_token_indices] --++++++ expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++++++ expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++++++ moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++++++ current_token_offset += expert_token_count --++++++ return moe_output --++++++ --++++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --++++++ @no_grad() --++++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++ moe_output = ops.zeros_like(hidden_states) --++++++ num_tokens, _ = hidden_states.shape --++++++ flat_selected_experts = selected_experts.flatten() --++++++ flat_routing_weights = routing_weights.flatten() --++++++ token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --++++++ active_experts = ops.unique(flat_selected_experts) --++++++ for expert_idx_tensor in active_experts: --++++++ expert_idx = expert_idx_tensor.item() --++++++ expert_layer = self.experts[expert_idx] --++++++ mask = (flat_selected_experts == expert_idx_tensor) --++++++ current_token_indices = token_indices[mask] --++++++ current_routing_weights = flat_routing_weights[mask] --++++++ current_hidden_states = hidden_states[current_token_indices] --++++++ expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --++++++ moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --++++++ return moe_output --+++++ --+++++- final_hidden_states = final_hidden_states + shared_expert_output --+++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++- --+++++- return final_hidden_states, router_logits --++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++ global Long_Prompt --++++++ --++++++ # 1. 门控计算 (所有模式通用) --++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --++++++ router_logits = self.gate(hidden_states_reshaped) --++++++ routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++ routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --++++++ if self.norm_topk_prob: --++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++ --++++++ moe_output = None --++++++ if Long_Prompt: --++++++ # --- 精度优先模式 (ACCURACY MODE) --- --++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++ moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ else: --++++++ # --- 速度优先模式 (SPEED MODE) --- --++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++ if sequence_length == 1: --++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ else: --++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ --+++++ --++++++ # 3. 共享专家计算与合并 (所有模式通用) --++++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --++++++ F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --++++++ --++++++ final_hidden_states_reshaped = moe_output + gated_shared_expert_output --++++++ final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --++++++ --++++++ return final_hidden_states, router_logits --+++++ --+++++ class Qwen2MoeDecoderLayer(nn.Module): --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+++++ super().__init__() --+++++ self.hidden_size = config.hidden_size --++++++ --++++++ # if Long_Prompt: --++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++++ # else: --++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++ --+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ --+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++- --+++++ if (layer_idx not in config.mlp_only_layers) and ( --+++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++++ ): --+++++@@ -1288,6 +2017,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ self._warmed_up = True --+++++ self.warmup_moe_model() --+++++ --++++++ --++++++ --+++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++++ output_router_logits = ( --+++++ output_router_logits if output_router_logits is not None else self.config.output_router_logits --+++++@@ -1355,6 +2086,27 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ router_logits=outputs.router_logits, --+++++ ) --+++++ --++++++ def generate(self, *args, **kwargs): --++++++ """ --++++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --++++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --++++++ """ --++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++++++ --++++++ input_ids = kwargs.get("input_ids") --++++++ if input_ids is None and args: --++++++ input_ids = args[0] --++++++ --++++++ if input_ids is not None: --++++++ prompt_length = input_ids.shape[1] --++++++ --++++++ if prompt_length > PROMPT_LENGTH_THRESHOLD: --++++++ Long_Prompt = True --++++++ else: --++++++ Long_Prompt = False --++++++ --++++++ return super().generate(*args, **kwargs) --++++++ --+++++ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.prepare_inputs_for_generation --+++++ def prepare_inputs_for_generation( --+++++ self, --+++++@@ -1370,6 +2122,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ # If we have cache: let's slice `input_ids` through `cache_position`, to keep only the unprocessed tokens --+++++ # Exception 1: when passing input_embeds, input_ids may be missing entries --+++++ # Exception 2: some generation methods do special slicing of input_ids, so we don't need to do it here --++++++ --+++++ if past_key_values is not None: --+++++ if inputs_embeds is not None: # Exception 1 --+++++ if 0 not in input_ids.shape: --+++++@@ -1421,6 +2174,7 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ } --+++++ ) --+++++ return model_inputs --++++++ --+++++ # @lwx --+++++ # def _decode_one_tokens_logits( --+++++ # self, --+++++@@ -1960,6 +2714,7 @@ class Qwen2MoeForTokenClassification(Qwen2MoePreTrainedModel): --+++++ attentions=outputs.attentions, --+++++ ) --+++++ --++++++ --+++++ __all__ = [ --+++++ "Qwen2MoeForCausalLM", --+++++ "Qwen2MoeModel", --+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++++new file mode 100644 --+++++index 00000000..6dfb5b93 --+++++--- /dev/null --++++++++ b/patches/0001-20251104commit.patch --+++++@@ -0,0 +1,1272 @@ --++++++From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --++++++From: Pinoeer-kingxi <13022943007@163.com> --++++++Date: Tue, 4 Nov 2025 09:11:51 +0800 --++++++Subject: [PATCH] 20251104commit --++++++ --++++++--- --++++++ mindnlp/transformers/cache_utils.py | 28 +- --++++++ .../models/deepseek/modeling_deepseek.py | 149 ++- --++++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --++++++ 3 files changed, 976 insertions(+), 87 deletions(-) --++++++ --++++++diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --++++++index cadd2e04..02f8d4be 100644 --++++++--- a/mindnlp/transformers/cache_utils.py --+++++++++ b/mindnlp/transformers/cache_utils.py --++++++@@ -812,14 +812,26 @@ class StaticCache(Cache): --++++++ # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --++++++ # k_out[:, :, cache_position] = key_states --++++++ # v_out[:, :, cache_position] = value_states --++++++- if ON_ORANGE_PI: --++++++- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --++++++- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --++++++- else: --++++++- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --++++++- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --++++++- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --++++++- --+++++++ # if ON_ORANGE_PI: --+++++++ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++++ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++++ # else: --+++++++ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++++ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++++ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++++ # 确保 cache_position 是 1D tensor 并且类型正确 --+++++++ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++++++ if cache_position.ndim > 1: --+++++++ cache_position = cache_position.flatten() --+++++++ # 确保类型是 int32 或 int64(MindSpore 要求) --+++++++ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++++++ cache_position = cache_position.int() --+++++++ --+++++++ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++++++ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++++++ k_out[:, :, cache_position] = key_states --+++++++ v_out[:, :, cache_position] = value_states --+++++++ --++++++ return k_out, v_out --++++++ --++++++ def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --++++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++index c695b944..d8303e45 100644 --++++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --++++++ # Copied from transformers.models.llama.modeling_llama.rotate_half --++++++ def rotate_half(x): --++++++ """Rotates half the hidden dims of the input.""" --++++++- x1 = x[..., : x.shape[-1] // 2] --++++++- x2 = x[..., x.shape[-1] // 2 :] --+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++++ # x1 = x[..., : x.shape[-1] // 2] --+++++++ # x2 = x[..., x.shape[-1] // 2 :] --+++++++ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++++ return ops.cat((-x2, x1), dim=-1) --++++++ --++++++ --++++++@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --++++++ if self.training: --++++++ raise NotImplementedError("Training is not supported yet.") --++++++ else: --++++++- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --++++++- if self.config.n_shared_experts is not None: --++++++- y = y + self.shared_experts(identity) --++++++- return y --+++++++ # @lwx --+++++++ if orig_shape[1] == 1: --+++++++ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++++++ y=y.view(*orig_shape) --+++++++ if self.config.n_shared_experts is not None: --+++++++ y = y + self.shared_experts(identity) --+++++++ return y --+++++++ else: --+++++++ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++++++ if self.config.n_shared_experts is not None: --+++++++ y = y + self.shared_experts(identity) --+++++++ return y --+++++++ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++++ # if self.config.n_shared_experts is not None: --+++++++ # y = y + self.shared_experts(identity) --+++++++ # return y --+++++++ --+++++++ @no_grad() --+++++++ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++++ --+++++++ expert_cache = ops.zeros_like(x) --+++++++ for i in range(self.num_experts_per_tok): --+++++++ expert_id = flat_expert_indices[i].item() --+++++++ weight = flat_expert_weights[i].item() --+++++++ expert = self.experts[expert_id] --+++++++ expert_out = expert(x) --+++++++ expert_cache += expert_out * weight --+++++++ return expert_cache --++++++ --++++++ @no_grad() --++++++- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++- # expert_cache = torch.zeros_like(x) --++++++- # idxs = flat_expert_indices.argsort() --++++++- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++++- # token_idxs = idxs // self.num_experts_per_tok --++++++- # for i, end_idx in enumerate(tokens_per_expert): --++++++- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++++- # if start_idx == end_idx: --++++++- # continue --++++++- # expert = self.experts[i] --++++++- # exp_token_idx = token_idxs[start_idx:end_idx] --++++++- # expert_tokens = x[exp_token_idx] --++++++- # expert_out = expert(expert_tokens) --++++++- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++++- # return expert_cache --+++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++++ expert_cache = ops.zeros_like(x) --++++++ idxs = flat_expert_indices.argsort() --++++++ tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ token_idxs = idxs // self.num_experts_per_tok --+++++++ --++++++ for i, end_idx in enumerate(tokens_per_expert): --++++++ start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ if start_idx == end_idx: --++++++@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --++++++ expert_out = expert(expert_tokens) --++++++ expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++++ --++++++ return expert_cache --+++++++ --+++++++ # @no_grad() --+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++++ # # expert_cache = torch.zeros_like(x) --+++++++ # # idxs = flat_expert_indices.argsort() --+++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++++ # # token_idxs = idxs // self.num_experts_per_tok --+++++++ # # for i, end_idx in enumerate(tokens_per_expert): --+++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++++ # # if start_idx == end_idx: --+++++++ # # continue --+++++++ # # expert = self.experts[i] --+++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++++ # # expert_tokens = x[exp_token_idx] --+++++++ # # expert_out = expert(expert_tokens) --+++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++++ # # return expert_cache --+++++++ # expert_cache = ops.zeros_like(x) --+++++++ # idxs = flat_expert_indices.argsort() --+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++++ # token_idxs = idxs // self.num_experts_per_tok --+++++++ --+++++++ # for i, end_idx in enumerate(tokens_per_expert): --+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++++ # if start_idx == end_idx: --+++++++ # continue --+++++++ # expert = self.experts[i] --+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++++ # expert_tokens = x[exp_token_idx] --+++++++ # expert_out = expert(expert_tokens) --+++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++++ --+++++++ # return expert_cache --+++++++ # @no_grad() --+++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++++ # expert_cache = ops.zeros_like(x) --+++++++ --+++++++ # # 排序保证顺序一致 --+++++++ # idxs = flat_expert_indices.argsort() --+++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++++ # token_idxs = idxs // self.num_experts_per_tok --+++++++ --+++++++ # # 找出有 token 的专家 --+++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++++ --+++++++ # for i in active_experts.tolist(): --+++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++++ # end_idx = tokens_per_expert[i] --+++++++ # if start_idx == end_idx: # 没有 token --+++++++ # continue --+++++++ --+++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++++ # expert_tokens = x[exp_token_idx] --+++++++ # expert_out = self.experts[i](expert_tokens) --+++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++++ --+++++++ # expert_cache = mindspore.mint.scatter_add( --+++++++ # expert_cache, --+++++++ # 0, --+++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++++ # expert_out --+++++++ # ) --+++++++ --+++++++ # return expert_cache --+++++++ --+++++++ --++++++ --++++++ # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --++++++ # """ --++++++@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++++ --++++++ # Initialize weights and apply final processing --++++++ self.post_init() --+++++++ self.warm_up = False --+++++++ --+++++++ def warmup_moe_model_deep(self): --+++++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++++++ test_texts = [ --+++++++ "warmup short", --+++++++ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++++++ ] --+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++++ if tokenizer is None: --+++++++ from mindnlp.transformers import AutoTokenizer --+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++++ self._warmup_tokenizer = tokenizer --+++++++ --+++++++ for text in test_texts: --+++++++ inputs = tokenizer(text, return_tensors="ms") --+++++++ with mindspore._no_grad(): --+++++++ _ = self(**inputs, use_cache=False) --+++++++ print("[Warmup] DeepSeek-MoE 模型预热完成。") --++++++ --++++++ def get_input_embeddings(self): --++++++ return self.model.embed_tokens --++++++@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++++ ```""" --+++++++ if not self.warm_up: --+++++++ self.warm_up = True --+++++++ self.warmup_moe_model_deep() --+++++++ --++++++ output_attentions = ( --++++++ output_attentions --++++++ if output_attentions is not None --++++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++index 3cbf820e..d4c6b651 100644 --++++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++@@ -18,7 +18,6 @@ --++++++ # See the License for the specific language governing permissions and --++++++ # limitations under the License. --++++++ """MindSpore Qwen2MoE model.""" --++++++- --++++++ import math --++++++ from typing import List, Optional, Tuple, Union --++++++ --++++++@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --++++++ TokenClassifierOutput, --++++++ ) --++++++ from ...modeling_utils import PreTrainedModel --+++++++from ...generation import GenerationMixin --++++++ from ....utils import logging --++++++ from .configuration_qwen2_moe import Qwen2MoeConfig --++++++ --++++++@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --++++++ self.variance_epsilon = eps --++++++ --++++++ def forward(self, hidden_states): --+++++++ # @dwj --+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++++ # @lwx --+++++++ # if not self.training : --+++++++ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --++++++ input_dtype = hidden_states.dtype --++++++ hidden_states = hidden_states.to(mindspore.float32) --++++++ variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --++++++@@ -234,6 +239,8 @@ def rotate_half(x): --++++++ """Rotates half the hidden dims of the input.""" --++++++ x1 = x[..., : x.shape[-1] // 2] --++++++ x2 = x[..., x.shape[-1] // 2 :] --+++++++ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++++ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --++++++ return ops.cat((-x2, x1), dim=-1) --++++++ --++++++ --++++++@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --++++++ self.config = config --++++++ self.hidden_size = config.hidden_size --++++++ self.intermediate_size = intermediate_size --+++++++ --++++++ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++++ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --++++++ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --++++++ self.act_fn = ACT2FN[config.hidden_act] --++++++ --++++++ def forward(self, x): --++++++- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --++++++- --++++++ --+++++++ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++++ # @lwx --+++++++ # gate_up_output = self.gate_up_proj(x) --+++++++ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++++++ # return self.down_proj(swiglu_output) --+++++++ --+++++++ # def forward(self, x): --+++++++ # gate_proj_out = self.gate_proj(x) --+++++++ # up_proj_out = self.up_proj(x) --+++++++ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++++++ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++++++ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++++++ # return self.down_proj(swiglu_out) --+++++++ --++++++ # Copied from transformers.models.llama.modeling_llama.repeat_kv --++++++ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --++++++ """ --++++++@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --++++++ use_cache: bool = False, --++++++ cache_position: Optional[mindspore.Tensor] = None, --++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++++ --+++++++ --+++++++ --++++++ bsz, q_len, _ = hidden_states.shape --++++++ --++++++ query_states = self.q_proj(hidden_states) --++++++@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++ "with a layer index." --++++++ ) --++++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++++ if isinstance(past_key_value, StaticCache): --+++++++ kv_seq_len = key_states.shape[-2] --+++++++ else: --+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++ if past_key_value is not None: --++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++++ --+++++++ if isinstance(past_key_value, StaticCache): --+++++++ kv_seq_len = key_states.shape[-2] --++++++ --++++++ # repeat k/v heads if n_kv_heads < n_heads --++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++- --+++++++ --++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++++ --++++++- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --++++++- raise ValueError( --++++++- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --++++++- f" {attn_weights.shape}" --++++++- ) --++++++- --++++++- if attention_mask is not None: # no matter the length, we just slice it --++++++- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++++++ if attention_mask is not None: --+++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++++ attn_weights = attn_weights + causal_mask --++++++ --++++++ # upcast attention to fp32 --++++++@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++++ --++++++ attn_output = self.o_proj(attn_output) --++++++- --+++++++ # @lwx --+++++++ --+++++++ # max_seq_len = self.max_position_embeddings # 2048 --+++++++ --+++++++ # if attention_mask is not None: --+++++++ # # attention_mask: [B, 1, Sq, Sk] --+++++++ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++++ --+++++++ # # pad 到 [max_seq_len, max_seq_len] --+++++++ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++++ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++++ # global_attention_mask = padded_mask --+++++++ # else: --+++++++ # global_attention_mask = None --+++++++ --+++++++ --+++++++ # sparse_mode=3 --+++++++ # attn_output = mindspore.ops.flash_attention_score( --+++++++ # query=query_states, --+++++++ # key=key_states, --+++++++ # value=value_states, --+++++++ # real_shift=None, --+++++++ # padding_mask=None, --+++++++ --+++++++ # head_num=self.num_heads, --+++++++ # attn_mask=global_attention_mask, --+++++++ # keep_prob=1.0 - self.attention_dropout, --+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++++ # input_layout="BNSD", --+++++++ # pre_tokens=2147483647, --+++++++ # next_tokens=2147483647, --+++++++ # inner_precise=0, --+++++++ # drop_mask=None, --+++++++ # prefix=None, --+++++++ # actual_seq_qlen=None, --+++++++ # actual_seq_kvlen=None, --+++++++ # sparse_mode=sparse_mode, --+++++++ # ) --++++++ if not output_attentions: --++++++ attn_weights = None --++++++ --++++++ return attn_output, attn_weights, past_key_value --++++++ --++++++ --+++++++class Qwen2MoeFlashAttention(nn.Module): --+++++++ """ --+++++++ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++++ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++++ --+++++++ 关键改动: --+++++++ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++++ 直接传入原始的 key 和 value 张量效率更高。 --+++++++ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++++ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++++ """ --+++++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++++ super().__init__() --+++++++ self.config = config --+++++++ self.layer_idx = layer_idx --+++++++ self.hidden_size = config.hidden_size --+++++++ self.num_heads = config.num_attention_heads --+++++++ self.head_dim = self.hidden_size // self.num_heads --+++++++ self.num_key_value_heads = config.num_key_value_heads --+++++++ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++++ self.max_position_embeddings = config.max_position_embeddings --+++++++ self.rope_theta = config.rope_theta --+++++++ self.attention_dropout = config.attention_dropout --+++++++ --+++++++ if (self.head_dim * self.num_heads) != self.hidden_size: --+++++++ raise ValueError( --+++++++ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++++ ) --+++++++ --+++++++ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++++ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++++ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++++ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++++ --+++++++ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++++ self.head_dim, --+++++++ max_position_embeddings=self.max_position_embeddings, --+++++++ base=self.rope_theta, --+++++++ ) --+++++++ --+++++++ def forward( --+++++++ self, --+++++++ hidden_states: mindspore.Tensor, --+++++++ attention_mask: Optional[mindspore.Tensor] = None, --+++++++ position_ids: Optional[mindspore.Tensor] = None, --+++++++ past_key_value: Optional[Cache] = None, --+++++++ output_attentions: bool = False, --+++++++ use_cache: bool = False, --+++++++ cache_position: Optional[mindspore.Tensor] = None, --+++++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++++ --+++++++ bsz, q_len, _ = hidden_states.shape --+++++++ --+++++++ # 1. 线性投射 Q, K, V --+++++++ query_states = self.q_proj(hidden_states) --+++++++ key_states = self.k_proj(hidden_states) --+++++++ value_states = self.v_proj(hidden_states) --+++++++ --+++++++ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++++ # query: [B, S, H*D] -> [B, N1, S, D] --+++++++ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ --+++++++ # 3. RoPE 旋转位置编码 --+++++++ kv_seq_len = key_states.shape[-2] --+++++++ if past_key_value is not None: --+++++++ if self.layer_idx is None: --+++++++ raise ValueError( --+++++++ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++++ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++++ "with a layer index." --+++++++ ) --+++++++ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++++ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++++ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++++ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++++ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++++ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++++ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++++ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++++ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++++ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++++ if cache_position.shape[0] == 1: --+++++++ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++++ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++++ kv_seq_len = past_seen_tokens + 1 --+++++++ else: --+++++++ # prefill 阶段:cache_position 是范围,使用其长度 --+++++++ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++++ else: --+++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++++ --+++++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++++ --+++++++ # 4. KV 缓存更新 --+++++++ if past_key_value is not None: --+++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++++ key_states, value_states = past_key_value.update( --+++++++ key_states, value_states, self.layer_idx, cache_kwargs --+++++++ ) --+++++++ --+++++++ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++++ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++++ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++++ if cache_position.shape[0] == 1: --+++++++ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++++ kv_seq_len = key_states.shape[-2] --+++++++ --+++++++ # 5. [重要] 准备 Attention Mask --+++++++ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++++ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++++ fa_attention_mask = None --+++++++ if attention_mask is not None: --+++++++ # 截取与当前key长度匹配的部分 --+++++++ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++++ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++++ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++++ fa_attention_mask = (mask_slice != 0) --+++++++ --+++++++ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++++ input_dtype = query_states.dtype --+++++++ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++++ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++++ query_states = query_states.to(mindspore.float16) --+++++++ key_states = key_states.to(mindspore.float16) --+++++++ value_states = value_states.to(mindspore.float16) --+++++++ --+++++++ # 6. [核心] 调用 flash_attention_score 算子 --+++++++ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++++ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++++ attn_output = mindspore.ops.flash_attention_score( --+++++++ query=query_states, --+++++++ key=key_states, --+++++++ value=value_states, --+++++++ head_num=self.num_heads, # 传入Q的头数(N1) --+++++++ attn_mask=fa_attention_mask, --+++++++ keep_prob=1.0 - self.attention_dropout, --+++++++ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++++ input_layout="BNSD", --+++++++ sparse_mode=0 # 使用 defaultMask 模式 --+++++++ ) --+++++++ --+++++++ # 恢复原始数据类型 --+++++++ attn_output = attn_output.to(input_dtype) --+++++++ --+++++++ # 7. 调整输出形状 --+++++++ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++++ attn_output = self.o_proj(attn_output) --+++++++ --+++++++ # FlashAttention 算子不直接返回注意力权重矩阵 --+++++++ attn_weights = None --+++++++ if output_attentions: --+++++++ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++++ --+++++++ return attn_output, attn_weights, past_key_value --+++++++ --+++++++ # def forward( --+++++++ # self, --+++++++ # hidden_states: mindspore.Tensor, --+++++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++++ # past_key_value: Optional[Cache] = None, --+++++++ # output_attentions: bool = False, --+++++++ # use_cache: bool = False, --+++++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++++ --+++++++ # bsz, q_len, _ = hidden_states.shape --+++++++ --+++++++ # # 1. 线性投射 Q, K, V --+++++++ # query_states = self.q_proj(hidden_states) --+++++++ # key_states = self.k_proj(hidden_states) --+++++++ # value_states = self.v_proj(hidden_states) --+++++++ --+++++++ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ --+++++++ # # 3. RoPE 旋转位置编码 --+++++++ # kv_seq_len = key_states.shape[-2] --+++++++ # if past_key_value is not None: --+++++++ # if self.layer_idx is None: --+++++++ # raise ValueError( --+++++++ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++++ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++++ # "with a layer index." --+++++++ # ) --+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++++ --+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++++ --+++++++ # # 4. KV 缓存更新 --+++++++ # if past_key_value is not None: --+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++++ # key_states, value_states = past_key_value.update( --+++++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++++ # ) --+++++++ --+++++++ # # 5. 准备 Attention Mask --+++++++ # fa_attention_mask = None --+++++++ # if attention_mask is not None: --+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++++ # fa_attention_mask = (mask_slice != 0) --+++++++ --+++++++ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++++ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++++ # input_dtype = query_states.dtype --+++++++ --+++++++ # # 6. [核心] 调用 flash_attention_score 算子 --+++++++ # attn_output = mindspore.ops.flash_attention_score( --+++++++ # query=query_states, --+++++++ # key=key_states, --+++++++ # value=value_states, --+++++++ # head_num=self.num_heads, --+++++++ # attn_mask=fa_attention_mask, --+++++++ # keep_prob=1.0 - self.attention_dropout, --+++++++ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++++ # input_layout="BNSD", --+++++++ # sparse_mode=0, --+++++++ # # <--- 修改点 2: 启用内部高精度计算 --- --+++++++ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++++ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++++ # inner_precise=1 --+++++++ # ) --+++++++ --+++++++ # # 恢复原始数据类型 --+++++++ # attn_output = attn_output.to(input_dtype) --+++++++ --+++++++ # # 7. 调整输出形状 --+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++++ # attn_output = self.o_proj(attn_output) --+++++++ --+++++++ # attn_weights = None --+++++++ # if output_attentions: --+++++++ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++++ --+++++++ # return attn_output, attn_weights, past_key_value --+++++++ --+++++++ # def forward( --+++++++ # self, --+++++++ # hidden_states: mindspore.Tensor, --+++++++ # attention_mask: Optional[mindspore.Tensor] = None, --+++++++ # position_ids: Optional[mindspore.Tensor] = None, --+++++++ # past_key_value: Optional[Cache] = None, --+++++++ # output_attentions: bool = False, --+++++++ # use_cache: bool = False, --+++++++ # cache_position: Optional[mindspore.Tensor] = None, --+++++++ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++++ --+++++++ # bsz, q_len, _ = hidden_states.shape --+++++++ --+++++++ # query_states = self.q_proj(hidden_states) --+++++++ # key_states = self.k_proj(hidden_states) --+++++++ # value_states = self.v_proj(hidden_states) --+++++++ --+++++++ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++++ --+++++++ # kv_seq_len = key_states.shape[-2] --+++++++ # if past_key_value is not None: --+++++++ # if self.layer_idx is None: --+++++++ # raise ValueError("`layer_idx` must be specified for caching") --+++++++ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++++ --+++++++ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++++ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++++ --+++++++ # if past_key_value is not None: --+++++++ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++++ # key_states, value_states = past_key_value.update( --+++++++ # key_states, value_states, self.layer_idx, cache_kwargs --+++++++ # ) --+++++++ --+++++++ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++++ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++++ --+++++++ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++++++ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++++++ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++++++ # query_states = query_states / math.sqrt(self.head_dim) --+++++++ # # <--- 修改结束 --- --+++++++ --+++++++ # fa_attention_mask = None --+++++++ # if attention_mask is not None: --+++++++ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++++ # fa_attention_mask = (mask_slice != 0) --+++++++ --+++++++ # input_dtype = query_states.dtype --+++++++ --+++++++ # attn_output = mindspore.ops.flash_attention_score( --+++++++ # query=query_states, # 传入已经预先缩放过的 query --+++++++ # key=key_states, --+++++++ # value=value_states, --+++++++ # head_num=self.num_heads, --+++++++ # attn_mask=fa_attention_mask, --+++++++ # keep_prob=1.0 - self.attention_dropout, --+++++++ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++++++ # input_layout="BNSD", --+++++++ # sparse_mode=0, --+++++++ # inner_precise=1 # 仍然保持内部高精度计算 --+++++++ # ) --+++++++ --+++++++ # attn_output = attn_output.to(input_dtype) --+++++++ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++++ # attn_output = self.o_proj(attn_output) --+++++++ --+++++++ # attn_weights = None --+++++++ # if output_attentions: --+++++++ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++++++ --+++++++ # return attn_output, attn_weights, past_key_value --+++++++ --++++++ QWEN2MOE_ATTENTION_CLASSES = { --++++++ "eager": Qwen2MoeAttention, --+++++++ "flash-attention": Qwen2MoeFlashAttention, --++++++ } --++++++ --++++++ --++++++@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --++++++ self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --++++++ self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --++++++ --+++++++ #@dwj --+++++++ # 只遍历激活的专家,而非全部专家 --++++++ def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --++++++- batch_size, sequence_length, hidden_dim = hidden_states.shape --++++++- hidden_states = hidden_states.view(-1, hidden_dim) --++++++- # router_logits: (batch * sequence_length, n_experts) --++++++- router_logits = self.gate(hidden_states) --++++++- --++++++- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --++++++- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --++++++- if self.norm_topk_prob: --++++++- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --++++++- # we cast back to the input dtype --++++++- routing_weights = routing_weights.to(hidden_states.dtype) --++++++- --++++++- final_hidden_states = ops.zeros( --++++++- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --++++++- ) --++++++- --++++++- # One hot encode the selected experts to create an expert mask --++++++- # this will be used to easily index which expert is going to be sollicitated --++++++- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --++++++- --++++++- # Loop over all available experts in the model and perform the computation on each expert --++++++- for expert_idx in range(self.num_experts): --++++++- expert_layer = self.experts[expert_idx] --++++++- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --++++++- --++++++- # Index the correct hidden states and compute the expert hidden state for --++++++- # the current expert. We need to make sure to multiply the output hidden --++++++- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --++++++- if 0 not in idx.shape: --++++++- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --++++++- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --++++++- --++++++- # However `index_add_` only support torch tensors for indexing so we'll use --++++++- # the `top_x` tensor here. --++++++- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --++++++- --++++++- shared_expert_output = self.shared_expert(hidden_states) --++++++- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --++++++- --++++++- final_hidden_states = final_hidden_states + shared_expert_output --+++++++ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++++ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++++ num_tokens = hidden_states_reshaped.shape[0] --+++++++ --+++++++ router_logits = self.gate(hidden_states_reshaped) --+++++++ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++++ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++++ --+++++++ if self.norm_topk_prob: --+++++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++++ routing_weights = routing_weights.to(hidden_states.dtype) --+++++++ --+++++++ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++++ flat_selected_experts = selected_experts.flatten() --+++++++ --+++++++ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++++ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++++ token_indices = broadcasted_token_indices.flatten() --+++++++ --+++++++ active_experts = ops.unique(flat_selected_experts) --+++++++ --+++++++ for expert_idx_tensor in active_experts: --+++++++ expert_idx = expert_idx_tensor.item() --+++++++ expert_layer = self.experts[expert_idx] --+++++++ --+++++++ mask = (flat_selected_experts == expert_idx_tensor) --+++++++ selected_token_indices = token_indices[mask] --+++++++ selected_routing_weights = routing_weights.flatten()[mask] --+++++++ --+++++++ current_states = hidden_states_reshaped[selected_token_indices] --+++++++ --+++++++ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++++ --+++++++ final_hidden_states = final_hidden_states.index_add( --+++++++ dim=0, --+++++++ index=selected_token_indices, --+++++++ source=expert_output.to(hidden_states.dtype) --+++++++ ) --+++++++ --+++++++ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++++ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --++++++ --++++++- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --++++++- return final_hidden_states, router_logits --+++++++ final_hidden_states = final_hidden_states + shared_expert_output --+++++++ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++++ --+++++++ return final_hidden_states, router_logits --++++++ --++++++ --++++++ class Qwen2MoeDecoderLayer(nn.Module): --++++++@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --++++++ --++++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --++++++ --+++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++++ --++++++ if (layer_idx not in config.mlp_only_layers) and ( --++++++ config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --++++++ ): --++++++@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --++++++ _no_split_modules = ["Qwen2MoeDecoderLayer"] --++++++ _skip_keys_device_placement = "past_key_values" --++++++ _supports_cache_class = True --+++++++#lwx --+++++++ # _supports_static_cache = True --++++++ --++++++ def _init_weights(self, module): --++++++ std = self.config.initializer_range --++++++@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --++++++ return causal_mask --++++++ --++++++ --++++++-class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++++class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --++++++ _tied_weights_keys = ["lm_head.weight"] --++++++ --++++++ def __init__(self, config): --++++++@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++++ self.num_experts_per_tok = config.num_experts_per_tok --++++++ # Initialize weights and apply final processing --++++++ self.post_init() --+++++++ # @lwx --+++++++ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++++++ # self.generation_config.cache_implementation = "static" --+++++++ self._warmed_up = False --+++++++ --+++++++ def warmup_moe_model(self): --+++++++ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++++++ test_texts = [ --+++++++ "warmup short", --+++++++ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++++++ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++++++ ] --+++++++ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++++ if tokenizer is None: --+++++++ from mindnlp.transformers import AutoTokenizer --+++++++ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++++ self._warmup_tokenizer = tokenizer --+++++++ --+++++++ for text in test_texts: --+++++++ inputs = tokenizer(text, return_tensors="ms") --+++++++ with mindspore._no_grad(): --+++++++ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++++++ print("[Warmup] Qwen2-MoE 模型预热完成。") --++++++ --++++++ def get_input_embeddings(self): --++++++ return self.model.embed_tokens --++++++@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++++ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --++++++ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --++++++ ```""" --+++++++ if not self._warmed_up: --+++++++ self._warmed_up = True --+++++++ self.warmup_moe_model() --++++++ --++++++ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --++++++ output_router_logits = ( --++++++@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --++++++ } --++++++ ) --++++++ return model_inputs --+++++++# @lwx --+++++++ # def _decode_one_tokens_logits( --+++++++ # self, --+++++++ # cur_token: mindspore.Tensor, --+++++++ # input_pos: Optional[mindspore.Tensor], --+++++++ # cache_position: mindspore.Tensor, --+++++++ # past_key_values: StaticCache, --+++++++ # ) -> mindspore.Tensor: --+++++++ # """ --+++++++ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++++++ --+++++++ # Args: --+++++++ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++++++ # input_pos: 输入位置信息,可选 --+++++++ # cache_position: 当前token在cache中的位置,shape为(1,) --+++++++ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++++++ --+++++++ # Returns: --+++++++ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++++++ # """ --+++++++ # # 调用JIT编译的版本 --+++++++ # return self.get_decode_one_tokens_logits( --+++++++ # cur_token=cur_token, --+++++++ # input_pos=input_pos, --+++++++ # cache_position=cache_position, --+++++++ # past_key_values=past_key_values, --+++++++ # ) --+++++++ --+++++++ # @mindspore.jit(jit_level='O1') --+++++++ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++++++ # """ --+++++++ # JIT编译的函数,用于高效的单token解码 --+++++++ # 使用JIT编译优化以支持静态shape和高效执行 --+++++++ --+++++++ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++++++ # """ --+++++++ # outputs = self.model.forward( --+++++++ # input_ids=cur_token, --+++++++ # position_ids=input_pos, --+++++++ # cache_position=cache_position, --+++++++ # past_key_values=past_key_values, --+++++++ # use_cache=True, --+++++++ # return_dict=False, --+++++++ # ) --+++++++ --+++++++ # hidden_states = outputs[0] --+++++++ # logits = self.lm_head.forward(hidden_states) --+++++++ # logits = logits.float() --+++++++ --+++++++ # return logits[:, -1, :] --+++++++ --+++++++ # def _sample( --+++++++ # self, --+++++++ # input_ids: mindspore.Tensor, --+++++++ # logits_processor, --+++++++ # stopping_criteria, --+++++++ # generation_config, --+++++++ # synced_devices: bool, --+++++++ # streamer=None, --+++++++ # logits_warper=None, --+++++++ # **model_kwargs, --+++++++ # ): --+++++++ # """ --+++++++ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++++++ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++++++ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++++++ # """ --+++++++ # from ...generation.logits_process import LogitsProcessorList --+++++++ # from ...generation.stopping_criteria import StoppingCriteriaList --+++++++ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++++++ # from mindnlp.core import nn, ops, no_grad --+++++++ # import numpy as np --+++++++ --+++++++ # # 检查是否使用 StaticCache --+++++++ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++++++ # # 否则,直接调用父类方法 --+++++++ # past_key_values = model_kwargs.get("past_key_values") --+++++++ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++++++ --+++++++ # if not isinstance(past_key_values, StaticCache): --+++++++ # # 不使用 StaticCache,直接调用父类方法 --+++++++ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++++++ # return super()._sample( --+++++++ # input_ids=input_ids, --+++++++ # logits_processor=logits_processor, --+++++++ # stopping_criteria=stopping_criteria, --+++++++ # generation_config=generation_config, --+++++++ # synced_devices=synced_devices, --+++++++ # streamer=streamer, --+++++++ # logits_warper=logits_warper, --+++++++ # **model_kwargs, --+++++++ # ) --+++++++ --+++++++ # # 使用 StaticCache,进入自定义循环 --+++++++ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++++++ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++++++ # pad_token_id = generation_config._pad_token_tensor --+++++++ # output_attentions = generation_config.output_attentions --+++++++ # output_hidden_states = generation_config.output_hidden_states --+++++++ # output_scores = generation_config.output_scores --+++++++ # output_logits = generation_config.output_logits --+++++++ # return_dict_in_generate = generation_config.return_dict_in_generate --+++++++ # max_length = generation_config.max_length --+++++++ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++++++ # do_sample = generation_config.do_sample --+++++++ --+++++++ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++++++ # raise ValueError( --+++++++ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++++++ # f"{logits_warper})." --+++++++ # ) --+++++++ --+++++++ # # init attention / hidden states / scores tuples --+++++++ # scores = () if (return_dict_in_generate and output_scores) else None --+++++++ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++++++ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++++ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++++ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++++++ --+++++++ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++++++ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++++++ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++++++ # encoder_hidden_states = ( --+++++++ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++++++ # ) --+++++++ --+++++++ # # keep track of which sequences are already finished --+++++++ # batch_size, cur_len = input_ids.shape --+++++++ # this_peer_finished = False --+++++++ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++++++ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++++++ --+++++++ # time_record = [] --+++++++ # from ....utils.testing_utils import parse_flag_from_env --+++++++ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++++++ --+++++++ # while self._has_unfinished_sequences( --+++++++ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++++++ # ): --+++++++ # if _record_time: --+++++++ # import time as time_module --+++++++ # infer_start = time_module.time() --+++++++ --+++++++ # # prepare model inputs --+++++++ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++++++ --+++++++ # # prepare variable output controls --+++++++ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++++++ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++++++ --+++++++ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++++++ # cur_cache_position = model_inputs.get("cache_position") --+++++++ # cur_past_key_values = model_inputs.get("past_key_values") --+++++++ # cur_input_ids = model_inputs.get("input_ids") --+++++++ --+++++++ # if (isinstance(cur_past_key_values, StaticCache) and --+++++++ # cur_cache_position is not None and --+++++++ # len(cur_cache_position.shape) > 0 and --+++++++ # cur_cache_position.shape[0] == 1 and --+++++++ # cur_input_ids is not None and --+++++++ # cur_input_ids.shape[1] == 1): --+++++++ # # 使用 JIT 优化的单 token 解码 --+++++++ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++++++ # if not hasattr(self, '_jit_used'): --+++++++ # self._jit_used = False --+++++++ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++++++ --+++++++ # next_token_logits = self.get_decode_one_tokens_logits( --+++++++ # cur_token=cur_input_ids, --+++++++ # input_pos=model_inputs.get("position_ids"), --+++++++ # cache_position=cur_cache_position, --+++++++ # past_key_values=cur_past_key_values, --+++++++ # ) --+++++++ --+++++++ # # 标记已使用JIT(用于后续判断) --+++++++ # if not self._jit_used: --+++++++ # self._jit_used = True --+++++++ --+++++++ # # 构造兼容的输出对象 --+++++++ # class JitOptimizedOutput: --+++++++ # def __init__(self, logits, config): --+++++++ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++++++ # self.config = config --+++++++ # # 对于 JIT 优化路径,这些属性通常不需要 --+++++++ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++++++ # self.attentions = None if not config.is_encoder_decoder else None --+++++++ # self.cross_attentions = None --+++++++ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++++++ # self.hidden_states = None if not config.is_encoder_decoder else None --+++++++ --+++++++ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++++++ # else: --+++++++ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++++++ # outputs = self(**model_inputs, return_dict=True) --+++++++ --+++++++ # if synced_devices and this_peer_finished: --+++++++ # continue --+++++++ --+++++++ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++++++ # next_token_logits = outputs.logits[:, -1, :] --+++++++ --+++++++ # # pre-process distribution --+++++++ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++++++ # if do_sample: --+++++++ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++++++ --+++++++ # # Store scores, attentions and hidden_states when required --+++++++ # if return_dict_in_generate: --+++++++ # if output_scores: --+++++++ # scores += (next_token_scores,) --+++++++ # if output_logits: --+++++++ # raw_logits += (next_token_logits,) --+++++++ # if output_attentions: --+++++++ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++++++ # decoder_attentions += (attn,) if attn is not None else (None,) --+++++++ # if self.config.is_encoder_decoder: --+++++++ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++++++ --+++++++ # if output_hidden_states: --+++++++ # hidden = ( --+++++++ # outputs.decoder_hidden_states --+++++++ # if self.config.is_encoder_decoder --+++++++ # else outputs.hidden_states --+++++++ # ) --+++++++ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++++++ --+++++++ # # token selection --+++++++ # if do_sample: --+++++++ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++++++ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++++++ # else: --+++++++ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++++++ --+++++++ # # finished sentences should have their next token be a padding token --+++++++ # if has_eos_stopping_criteria: --+++++++ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++++++ --+++++++ # # update generated ids, model inputs, and length for next step --+++++++ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++++++ # if streamer is not None: --+++++++ # streamer.put(next_tokens) --+++++++ --+++++++ # model_kwargs = self._update_model_kwargs_for_generation( --+++++++ # outputs, --+++++++ # model_kwargs, --+++++++ # is_encoder_decoder=self.config.is_encoder_decoder, --+++++++ # ) --+++++++ --+++++++ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++++++ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++++++ # cur_len += 1 --+++++++ --+++++++ # if _record_time: --+++++++ # import time as time_module --+++++++ # infer_stop = time_module.time() --+++++++ # time_record.append(infer_stop - infer_start) --+++++++ --+++++++ # del outputs --+++++++ --+++++++ # average_infer_time = None --+++++++ # if time_record: --+++++++ # if len(time_record) > 1: --+++++++ # time_record.pop(0) --+++++++ # average_infer_time = sum(time_record) / len(time_record) --+++++++ # print(f'average inference time is: {average_infer_time}') --+++++++ # print(f'inference time record: {time_record}') --+++++++ --+++++++ # if streamer is not None: --+++++++ # streamer.end() --+++++++ --+++++++ # # 简单判断:打印是否使用了JIT路径 --+++++++ # if hasattr(self, '_jit_used') and self._jit_used: --+++++++ # print("[JIT] ✓ JIT optimization was used during generation") --+++++++ # else: --+++++++ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++++++ --+++++++ # if return_dict_in_generate: --+++++++ # if self.config.is_encoder_decoder: --+++++++ # return GenerateEncoderDecoderOutput( --+++++++ # sequences=input_ids, --+++++++ # scores=scores, --+++++++ # logits=raw_logits, --+++++++ # encoder_attentions=encoder_attentions, --+++++++ # encoder_hidden_states=encoder_hidden_states, --+++++++ # decoder_attentions=decoder_attentions, --+++++++ # cross_attentions=cross_attentions, --+++++++ # decoder_hidden_states=decoder_hidden_states, --+++++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++++ # average_infer_time=average_infer_time --+++++++ # ) --+++++++ # else: --+++++++ # return GenerateDecoderOnlyOutput( --+++++++ # sequences=input_ids, --+++++++ # scores=scores, --+++++++ # logits=raw_logits, --+++++++ # attentions=decoder_attentions, --+++++++ # hidden_states=decoder_hidden_states, --+++++++ # past_key_values=model_kwargs.get("past_key_values"), --+++++++ # average_infer_time=average_infer_time --+++++++ # ) --+++++++ # else: --+++++++ # return input_ids --+++++++ --+++++++ # def _prepare_cache_for_generation( --+++++++ # self, --+++++++ # generation_config, --+++++++ # model_kwargs, --+++++++ # assistant_model, --+++++++ # batch_size, --+++++++ # max_cache_length, --+++++++ # ): --+++++++ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++++++ # generation_config.cache_implementation = "static" --+++++++ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++++++ --+++++++ # if generation_config.cache_implementation == "static": --+++++++ # base_required_from_max_length = generation_config.max_length + 1 --+++++++ # base_required = max(max_cache_length, base_required_from_max_length) --+++++++ # min_cache_size = 50 --+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++++ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++++++ # else: --+++++++ # max_cache_length = max(base_required, min_cache_size) --+++++++ --+++++++ # original_max_cache_length = max_cache_length --+++++++ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++++++ # print(f" - input max_cache_length: {original_max_cache_length}") --+++++++ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++++++ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++++++ # print(f" - final max_cache_length: {max_cache_length}") --+++++++ --+++++++ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++++ # if max_cache_length > self.config.max_position_embeddings: --+++++++ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++++++ --+++++++ # result = super()._prepare_cache_for_generation( --+++++++ # generation_config=generation_config, --+++++++ # model_kwargs=model_kwargs, --+++++++ # assistant_model=assistant_model, --+++++++ # batch_size=batch_size, --+++++++ # max_cache_length=max_cache_length, --+++++++ # ) --+++++++ --+++++++ # if generation_config.cache_implementation == "static": --+++++++ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++++++ # created_cache = model_kwargs.get(cache_name) --+++++++ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++++++ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++++++ # if created_cache.max_cache_len < generation_config.max_length: --+++++++ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++++++ --+++++++ # return result --+++++++ --+++++++ --+++++++ --++++++ --++++++ --++++++ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --++++++-- --++++++2.27.0 --++++++ --+++++-- --+++++2.27.0 --+++++ --++++diff --git a/patches/0003-20261106secondcommit.patch b/patches/0003-20261106secondcommit.patch --++++new file mode 100644 --++++index 00000000..966529e4 --++++--- /dev/null --+++++++ b/patches/0003-20261106secondcommit.patch --++++@@ -0,0 +1,2769 @@ --+++++From 1cf79d864cf51fd66bef8fea63047c5fde477f53 Mon Sep 17 00:00:00 2001 --+++++From: Pinoeer-kingxi <13022943007@163.com> --+++++Date: Thu, 6 Nov 2025 14:54:37 +0800 --+++++Subject: [PATCH 3/3] 20261106secondcommit --+++++ --+++++--- --+++++ .../models/deepseek/modeling_deepseek.py | 217 ++- --+++++ .../models/qwen2_moe/modeling_qwen2_moe.py | 1071 +++++--------- --+++++ patches/0001-20251104commit.patch | 1272 ----------------- --+++++ 3 files changed, 528 insertions(+), 2032 deletions(-) --+++++ delete mode 100644 patches/0001-20251104commit.patch --+++++ --+++++diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++index 73773c22..2f9192bf 100644 --+++++--- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --++++++++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++@@ -54,6 +54,24 @@ logger = logging.get_logger(__name__) --+++++ --+++++ _CONFIG_FOR_DOC = "DeepseekConfig" --+++++ --++++++_attn_mask_cache = {} --++++++ --++++++def get_cached_causal_mask(attention_mask, batch_and_seq, inputs_embeds, past_key_values_length): --++++++ q_len = batch_and_seq[1] --++++++ kv_len = batch_and_seq[1] + past_key_values_length --++++++ key = (batch_and_seq[0], q_len, kv_len) --++++++ --++++++ if key in _attn_mask_cache: --++++++ return _attn_mask_cache[key] --++++++ --++++++ mask = _prepare_4d_causal_attention_mask( --++++++ attention_mask, --++++++ batch_and_seq, --++++++ inputs_embeds, --++++++ past_key_values_length, --++++++ ) --++++++ _attn_mask_cache[key] = mask --++++++ return mask --+++++ --+++++ def _get_unpad_data(attention_mask): --+++++ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=mindspore.int32) --+++++@@ -441,43 +459,8 @@ class DeepseekMoE(nn.Module): --+++++ return final_output --+++++ --+++++ --+++++- @no_grad() --+++++- def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++- expert_cache = ops.zeros_like(x) --+++++- idxs = flat_expert_indices.argsort() --+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++- token_idxs = idxs // self.num_experts_per_tok --+++++- --+++++- for i, end_idx in enumerate(tokens_per_expert): --+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++- if start_idx == end_idx: --+++++- continue --+++++- expert = self.experts[i] --+++++- exp_token_idx = token_idxs[start_idx:end_idx] --+++++- expert_tokens = x[exp_token_idx] --+++++- expert_out = expert(expert_tokens) --+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++- --+++++- return expert_cache --+++++- --+++++ # @no_grad() --+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++- # # expert_cache = torch.zeros_like(x) --+++++- # # idxs = flat_expert_indices.argsort() --+++++- # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++- # # token_idxs = idxs // self.num_experts_per_tok --+++++- # # for i, end_idx in enumerate(tokens_per_expert): --+++++- # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++- # # if start_idx == end_idx: --+++++- # # continue --+++++- # # expert = self.experts[i] --+++++- # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++- # # expert_tokens = x[exp_token_idx] --+++++- # # expert_out = expert(expert_tokens) --+++++- # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++- # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++- # # return expert_cache --++++++ # def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++ # expert_cache = ops.zeros_like(x) --+++++ # idxs = flat_expert_indices.argsort() --+++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++@@ -495,37 +478,118 @@ class DeepseekMoE(nn.Module): --+++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++ --+++++ # return expert_cache --+++++- # @no_grad() --+++++- # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++- # expert_cache = ops.zeros_like(x) --++++++ --++++++ @no_grad() --++++++ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --++++++ """ --++++++ 优化版 MoE prefill: --++++++ - 批量张量化处理同一个 expert 的所有 token --++++++ - 跳过无 token 的专家 --++++++ - 保持结果完全一致 --++++++ """ --++++++ # 初始化输出缓存 --++++++ expert_cache = ops.zeros_like(x) --+++++ --+++++- # # 排序保证顺序一致 --+++++- # idxs = flat_expert_indices.argsort() --+++++- # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++- # token_idxs = idxs // self.num_experts_per_tok --++++++ # 排序(确保 scatter_add 位置对应原逻辑) --++++++ idxs = flat_expert_indices.argsort() --++++++ sorted_expert_indices = flat_expert_indices[idxs] --++++++ sorted_token_indices = idxs // self.num_experts_per_tok --+++++ --+++++- # # 找出有 token 的专家 --+++++- # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++++ # 每个 expert 的 token 数 --++++++ tokens_per_expert = sorted_expert_indices.bincount() --+++++ --+++++- # for i in active_experts.tolist(): --+++++- # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++- # end_idx = tokens_per_expert[i] --+++++- # if start_idx == end_idx: # 没有 token --+++++- # continue --++++++ # 找出有 token 的专家 --++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --+++++ --+++++- # exp_token_idx = token_idxs[start_idx:end_idx] --+++++- # expert_tokens = x[exp_token_idx] --+++++- # expert_out = self.experts[i](expert_tokens) --+++++- # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++++ for expert_id in active_experts.tolist(): --++++++ # 取该 expert 对应的排序后 token 区间 --++++++ start = (tokens_per_expert[:expert_id]).sum().item() --++++++ end = start + tokens_per_expert[expert_id].item() --+++++ --+++++- # expert_cache = mindspore.mint.scatter_add( --+++++- # expert_cache, --+++++- # 0, --+++++- # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++- # expert_out --+++++- # ) --++++++ token_idx = sorted_token_indices[start:end] # 原 token 位置 --++++++ expert_tokens = x[token_idx] # 取输入向量 --+++++ --+++++- # return expert_cache --++++++ # 执行专家 MLP --++++++ expert_out = self.experts[expert_id](expert_tokens) --++++++ --++++++ # 按权重缩放 --++++++ scaled_out = expert_out * flat_expert_weights[idxs[start:end]] --++++++ --++++++ # 回写到缓存(等价 scatter_add) --++++++ expert_cache = mindspore.mint.scatter_add( --++++++ expert_cache, --++++++ 0, --++++++ token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++++ scaled_out --++++++ ) --++++++ --++++++ return expert_cache --++++++ --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # # expert_cache = torch.zeros_like(x) --++++++ # # idxs = flat_expert_indices.argsort() --++++++ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --++++++ # # token_idxs = idxs // self.num_experts_per_tok --++++++ # # for i, end_idx in enumerate(tokens_per_expert): --++++++ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --++++++ # # if start_idx == end_idx: --++++++ # # continue --++++++ # # expert = self.experts[i] --++++++ # # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # # expert_tokens = x[exp_token_idx] --++++++ # # expert_out = expert(expert_tokens) --++++++ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --++++++ # # return expert_cache --++++++ # expert_cache = ops.zeros_like(x) --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # for i, end_idx in enumerate(tokens_per_expert): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # if start_idx == end_idx: --++++++ # continue --++++++ # expert = self.experts[i] --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = expert(expert_tokens) --++++++ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --++++++ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --++++++ --++++++ # return expert_cache --++++++ # @no_grad() --++++++ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --++++++ # expert_cache = ops.zeros_like(x) --++++++ --++++++ # # 排序保证顺序一致 --++++++ # idxs = flat_expert_indices.argsort() --++++++ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --++++++ # token_idxs = idxs // self.num_experts_per_tok --++++++ --++++++ # # 找出有 token 的专家 --++++++ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --++++++ --++++++ # for i in active_experts.tolist(): --++++++ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --++++++ # end_idx = tokens_per_expert[i] --++++++ # if start_idx == end_idx: # 没有 token --++++++ # continue --++++++ --++++++ # exp_token_idx = token_idxs[start_idx:end_idx] --++++++ # expert_tokens = x[exp_token_idx] --++++++ # expert_out = self.experts[i](expert_tokens) --++++++ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --++++++ --++++++ # expert_cache = mindspore.mint.scatter_add( --++++++ # expert_cache, --++++++ # 0, --++++++ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --++++++ # expert_out --++++++ # ) --++++++ --++++++ # return expert_cache --+++++ --+++++ --+++++ --+++++@@ -904,7 +968,6 @@ class DeepseekAttention(nn.Module): --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++- --+++++ # class DeepseekFlashAttention(nn.Module): --+++++ # """ --+++++ # Multi-headed attention from 'Attention Is All You Need' paper, implemented using --+++++@@ -1225,6 +1288,7 @@ class DeepseekFlashAttention(nn.Module): --+++++ --+++++ return attn_output, attn_weights, past_key_value --+++++ --++++++ --+++++ Deepseek_ATTENTION_CLASSES = { --+++++ "eager": DeepseekAttention, --+++++ "flash-attention": DeepseekFlashAttention, --+++++@@ -1456,7 +1520,14 @@ class DeepseekModel(DeepseekPreTrainedModel): --+++++ ) --+++++ else: --+++++ # 4d mask is passed through the layers --+++++- attention_mask = _prepare_4d_causal_attention_mask( --++++++ # attention_mask = _prepare_4d_causal_attention_mask( --++++++ # attention_mask, --++++++ # (batch_size, seq_length), --++++++ # inputs_embeds, --++++++ # past_key_values_length, --++++++ # ) --++++++ #@dwj --++++++ attention_mask = get_cached_causal_mask( --+++++ attention_mask, --+++++ (batch_size, seq_length), --+++++ inputs_embeds, --+++++@@ -1542,6 +1613,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++ # Initialize weights and apply final processing --+++++ self.post_init() --+++++ self.warm_up = False --++++++ #@dwj --++++++ self.kv_cache_keys, self.kv_cache_values = self.init_kv_cache( --++++++ self.num_layers, --++++++ self.num_attention_heads, --++++++ self.head_dim, --++++++ batch_size=1, --++++++ max_length=self.max_length, --++++++ dtype=mindspore.float16 --++++++ ) --++++++ --++++++ def init_kv_cache(self, num_layers, num_heads, head_dim, batch_size, max_length, dtype): --++++++ key_cache = [] --++++++ value_cache = [] --++++++ for _ in range(num_layers): --++++++ k = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++++ v = ops.zeros((batch_size, num_heads, max_length, head_dim), dtype=dtype) --++++++ key_cache.append(k) --++++++ value_cache.append(v) --++++++ return key_cache, value_cache --++++++ --+++++ --+++++ def warmup_moe_model_deep(self): --+++++ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++++diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++index bced285c..ebd7782e 100644 --+++++--- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --++++++++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++@@ -45,8 +45,48 @@ logger = logging.get_logger(__name__) --+++++ _CHECKPOINT_FOR_DOC = "Qwen/Qwen1.5-MoE-A2.7B" --+++++ _CONFIG_FOR_DOC = "Qwen2MoeConfig" --+++++ --+++++-Long_Prompt = False --+++++-PROMPT_LENGTH_THRESHOLD = 128 --++++++Long_Prompt = 1 --++++++LONG_PROMPT_LENGTH_THRESHOLD = 128 --++++++SHORT_PROMPT_LENGTH_THRESHOLD = 32 --++++++ --++++++_causal_mask_cache = {} --++++++ --++++++def get_cached_causal_mask_with_cache_position( --++++++ attention_mask: mindspore.Tensor, --++++++ sequence_length: int, --++++++ target_length: int, --++++++ dtype: mindspore.dtype, --++++++ min_dtype: float, --++++++ cache_position: mindspore.Tensor, --++++++ batch_size: int, --++++++): --++++++ """ --++++++ 带缓存的 causal mask 构造函数 --++++++ """ --++++++ # q_len 是当前 query 长度 --++++++ q_len = sequence_length --++++++ # kv_len 是 target_length --++++++ kv_len = target_length --++++++ --++++++ # 注意缓存 key 加上 q_len 和 kv_len,避免 prefill 与 decode 混淆 --++++++ key = (batch_size, q_len, kv_len, dtype, min_dtype) --++++++ --++++++ if key in _causal_mask_cache: --++++++ return _causal_mask_cache[key] --++++++ --++++++ # 调用原来的 mask 构造逻辑 --++++++ causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++++ attention_mask, --++++++ sequence_length=sequence_length, --++++++ target_length=target_length, --++++++ dtype=dtype, --++++++ min_dtype=min_dtype, --++++++ cache_position=cache_position, --++++++ batch_size=batch_size, --++++++ ) --++++++ # 缓存结果 --++++++ _causal_mask_cache[key] = causal_mask --++++++ return causal_mask --+++++ --+++++ # Copied from transformers.models.llama.modeling_llama._prepare_4d_causal_attention_mask_with_cache_position --+++++ def _prepare_4d_causal_attention_mask_with_cache_position( --+++++@@ -318,12 +358,172 @@ def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++++ --+++++ --+++++ # Copied from transformers.models.qwen2.modeling_qwen2.Qwen2Attention with Qwen2->Qwen2Moe --++++++# class Qwen2MoeAttention(nn.Module): --++++++# """ --++++++# Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --++++++# and "Generating Long Sequences with Sparse Transformers". --++++++# """ --++++++ --++++++# def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --++++++# super().__init__() --++++++# self.config = config --++++++# self.layer_idx = layer_idx --++++++# if layer_idx is None: --++++++# logger.warning_once( --++++++# f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --++++++# "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++++# "when creating this class." --++++++# ) --++++++ --++++++# self.hidden_size = config.hidden_size --++++++# self.num_heads = config.num_attention_heads --++++++# self.head_dim = self.hidden_size // self.num_heads --++++++# self.num_key_value_heads = config.num_key_value_heads --++++++# self.num_key_value_groups = self.num_heads // self.num_key_value_heads --++++++# self.max_position_embeddings = config.max_position_embeddings --++++++# self.rope_theta = config.rope_theta --++++++# self.is_causal = True --++++++# self.attention_dropout = config.attention_dropout --++++++ --++++++# if (self.head_dim * self.num_heads) != self.hidden_size: --++++++# raise ValueError( --++++++# f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" --++++++# f" and `num_heads`: {self.num_heads})." --++++++# ) --++++++# self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --++++++# self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++# self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --++++++# self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --++++++ --++++++# self.rotary_emb = Qwen2MoeRotaryEmbedding( --++++++# self.head_dim, --++++++# max_position_embeddings=self.max_position_embeddings, --++++++# base=self.rope_theta, --++++++# ) --++++++ --++++++# def forward( --++++++# self, --++++++# hidden_states: mindspore.Tensor, --++++++# attention_mask: Optional[mindspore.Tensor] = None, --++++++# position_ids: Optional[mindspore.Tensor] = None, --++++++# past_key_value: Optional[Cache] = None, --++++++# output_attentions: bool = False, --++++++# use_cache: bool = False, --++++++# cache_position: Optional[mindspore.Tensor] = None, --++++++# ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --++++++ --++++++ --++++++ --++++++# bsz, q_len, _ = hidden_states.shape --++++++ --++++++# query_states = self.q_proj(hidden_states) --++++++# key_states = self.k_proj(hidden_states) --++++++# value_states = self.v_proj(hidden_states) --++++++ --++++++# query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --++++++# key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++++# value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --++++++ --++++++# kv_seq_len = key_states.shape[-2] --++++++# if past_key_value is not None: --++++++# if self.layer_idx is None: --++++++# raise ValueError( --++++++# f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --++++++# "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --++++++# "with a layer index." --++++++# ) --++++++# if isinstance(past_key_value, StaticCache): --++++++# kv_seq_len = key_states.shape[-2] --++++++# else: --++++++# kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++# cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --++++++# query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --++++++ --++++++# if past_key_value is not None: --++++++# cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++++# key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++# if isinstance(past_key_value, StaticCache): --++++++# kv_seq_len = key_states.shape[-2] --++++++ --++++++# # repeat k/v heads if n_kv_heads < n_heads --++++++# key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++# value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++ --++++++# attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++++ --++++++# if attention_mask is not None: --++++++# causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++++# attn_weights = attn_weights + causal_mask --++++++ --++++++# # upcast attention to fp32 --++++++# attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++++++# attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++++++# attn_output = ops.matmul(attn_weights, value_states) --++++++ --++++++# if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++++++# raise ValueError( --++++++# f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --++++++# f" {attn_output.shape}" --++++++# ) --++++++ --++++++# attn_output = ops.transpose(attn_output, 1, 2) --++++++# attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++++ --++++++# attn_output = self.o_proj(attn_output) --++++++# # @lwx --++++++ --++++++# # max_seq_len = self.max_position_embeddings # 2048 --++++++ --++++++# # if attention_mask is not None: --++++++# # # attention_mask: [B, 1, Sq, Sk] --++++++# # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --++++++ --++++++# # # pad 到 [max_seq_len, max_seq_len] --++++++# # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --++++++# # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --++++++# # global_attention_mask = padded_mask --++++++# # else: --++++++# # global_attention_mask = None --++++++ --++++++ --++++++# # sparse_mode=3 --++++++# # attn_output = mindspore.ops.flash_attention_score( --++++++# # query=query_states, --++++++# # key=key_states, --++++++# # value=value_states, --++++++# # real_shift=None, --++++++# # padding_mask=None, --++++++ --++++++# # head_num=self.num_heads, --++++++# # attn_mask=global_attention_mask, --++++++# # keep_prob=1.0 - self.attention_dropout, --++++++# # scalar_value=1.0 / math.sqrt(self.head_dim), --++++++# # input_layout="BNSD", --++++++# # pre_tokens=2147483647, --++++++# # next_tokens=2147483647, --++++++# # inner_precise=0, --++++++# # drop_mask=None, --++++++# # prefix=None, --++++++# # actual_seq_qlen=None, --++++++# # actual_seq_kvlen=None, --++++++# # sparse_mode=sparse_mode, --++++++# # ) --++++++# if not output_attentions: --++++++# attn_weights = None --++++++ --++++++# return attn_output, attn_weights, past_key_value --++++++ --+++++ class Qwen2MoeAttention(nn.Module): --+++++ """ --+++++- Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer --+++++- and "Generating Long Sequences with Sparse Transformers". --+++++- """ --++++++ 一个融合了 Eager 和 Flash Attention 实现的统一注意力模块。 --+++++ --++++++ 本模块在 `forward` 方法内部根据全局变量 `Long_Prompt` 的值进行动态调度: --++++++ - if Long_Prompt == 2: 使用高精度 Flash Attention 路径,针对长序列进行优化。 --++++++ - else: 使用标准的 Eager Attention 路径,保证短序列和解码阶段的数值一致性。 --++++++ --++++++ 这避免了在外部(如 DecoderLayer)进行复杂的对象实例化切换。 --++++++ """ --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++ super().__init__() --+++++ self.config = config --+++++@@ -331,7 +531,7 @@ class Qwen2MoeAttention(nn.Module): --+++++ if layer_idx is None: --+++++ logger.warning_once( --+++++ f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will " --+++++- "to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --++++++ "lead to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` " --+++++ "when creating this class." --+++++ ) --+++++ --+++++@@ -371,110 +571,86 @@ class Qwen2MoeAttention(nn.Module): --+++++ use_cache: bool = False, --+++++ cache_position: Optional[mindspore.Tensor] = None, --+++++ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++- --+++++ --+++++- --++++++ # --- 1. 通用计算部分 (Projections, RoPE, KV Cache) --- --+++++ bsz, q_len, _ = hidden_states.shape --+++++ --+++++ query_states = self.q_proj(hidden_states) --+++++ key_states = self.k_proj(hidden_states) --+++++ value_states = self.v_proj(hidden_states) --+++++ --+++++- query_states = ops.transpose(query_states.view(bsz, q_len, self.num_heads, self.head_dim), 1, 2) --+++++- key_states = ops.transpose(key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++- value_states = ops.transpose(value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim), 1, 2) --+++++- --++++++ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --++++++ --+++++ kv_seq_len = key_states.shape[-2] --+++++ if past_key_value is not None: --+++++- if self.layer_idx is None: --+++++- raise ValueError( --+++++- f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++- "with a layer index." --+++++- ) --+++++- if isinstance(past_key_value, StaticCache): --+++++- kv_seq_len = key_states.shape[-2] --+++++- else: --+++++- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --++++++ --+++++ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++ --+++++ if past_key_value is not None: --+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --++++++ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --++++++ --++++++ # --- 2. 动态调度核心注意力计算 --- --++++++ global Long_Prompt --++++++ if Long_Prompt >= 1: --++++++ # --- Flash Attention 路径 (高精度,用于长序列 prefill) --- --++++++ fa_attention_mask = None --++++++ if attention_mask is not None: --++++++ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --++++++ fa_attention_mask = (mask_slice != 0) --++++++ --++++++ attn_output = mindspore.ops.flash_attention_score( --++++++ query=query_states, --++++++ key=key_states, --++++++ value=value_states, --++++++ head_num=self.num_heads, --++++++ attn_mask=fa_attention_mask, --++++++ keep_prob=1.0 - self.attention_dropout if self.training else 1.0, --++++++ scalar_value=1.0 / math.sqrt(self.head_dim), --++++++ input_layout="BNSD", --++++++ sparse_mode=0, --++++++ inner_precise=0 # 使用高精度模式以对齐 Eager 结果 --++++++ ) --+++++ --+++++- if isinstance(past_key_value, StaticCache): --+++++- kv_seq_len = key_states.shape[-2] --++++++ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --++++++ attn_output = self.o_proj(attn_output) --++++++ attn_weights = None --++++++ if output_attentions: --++++++ logger.warning_once("Flash Attention path is used, but `output_attentions=True`. Flash Attention does not return attention weights.") --+++++ --+++++- # repeat k/v heads if n_kv_heads < n_heads --+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++- --+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --++++++ else: --++++++ # --- Eager Attention 路径 (用于短序列和解码) --- --++++++ key_states = repeat_kv(key_states, self.num_key_value_groups) --++++++ value_states = repeat_kv(value_states, self.num_key_value_groups) --++++++ --++++++ attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++ --+++++- if attention_mask is not None: --+++++- causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++- attn_weights = attn_weights + causal_mask --++++++ if attention_mask is not None: --++++++ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --++++++ attn_weights = attn_weights + causal_mask --+++++ --+++++- # upcast attention to fp32 --+++++- attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --+++++- attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --+++++- attn_output = ops.matmul(attn_weights, value_states) --++++++ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=mindspore.float32).to(query_states.dtype) --++++++ attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training) --++++++ attn_output = ops.matmul(attn_weights, value_states) --+++++ --+++++- if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --+++++- raise ValueError( --+++++- f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" --+++++- f" {attn_output.shape}" --+++++- ) --++++++ if attn_output.shape != (bsz, self.num_heads, q_len, self.head_dim): --++++++ raise ValueError( --++++++ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is {attn_output.shape}" --++++++ ) --+++++ --+++++- attn_output = ops.transpose(attn_output, 1, 2) --+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++++ attn_output = ops.transpose(attn_output, 1, 2) --++++++ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --++++++ attn_output = self.o_proj(attn_output) --+++++ --+++++- attn_output = self.o_proj(attn_output) --+++++- # @lwx --++++++ if not output_attentions: --++++++ attn_weights = None --+++++ --+++++- # max_seq_len = self.max_position_embeddings # 2048 --+++++- --+++++- # if attention_mask is not None: --+++++- # # attention_mask: [B, 1, Sq, Sk] --+++++- # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++- --+++++- # # pad 到 [max_seq_len, max_seq_len] --+++++- # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++- # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++- # global_attention_mask = padded_mask --+++++- # else: --+++++- # global_attention_mask = None --+++++- --+++++- --+++++- # sparse_mode=3 --+++++- # attn_output = mindspore.ops.flash_attention_score( --+++++- # query=query_states, --+++++- # key=key_states, --+++++- # value=value_states, --+++++- # real_shift=None, --+++++- # padding_mask=None, --+++++- --+++++- # head_num=self.num_heads, --+++++- # attn_mask=global_attention_mask, --+++++- # keep_prob=1.0 - self.attention_dropout, --+++++- # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++- # input_layout="BNSD", --+++++- # pre_tokens=2147483647, --+++++- # next_tokens=2147483647, --+++++- # inner_precise=0, --+++++- # drop_mask=None, --+++++- # prefix=None, --+++++- # actual_seq_qlen=None, --+++++- # actual_seq_kvlen=None, --+++++- # sparse_mode=sparse_mode, --+++++- # ) --+++++- if not output_attentions: --+++++- attn_weights = None --+++++- --+++++ return attn_output, attn_weights, past_key_value --+++++ --+++++- --+++++ # class Qwen2MoeFlashAttention(nn.Module): --+++++ # """ --+++++ # Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++@@ -899,578 +1075,6 @@ QWEN2MOE_ATTENTION_CLASSES = { --+++++ # return final_hidden_states, router_logits --+++++ --+++++ --+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++-# """ --+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++-# 它包含一个顶层 `forward` 方法,根据输入序列的长度智能地分发到 --+++++-# 专门优化的 `_moe_infer_decode` (用于单 token 生成) 或 --+++++-# `_moe_infer_prefill` (用于长序列处理) 方法。 --+++++-# """ --+++++-# def __init__(self, config: Qwen2MoeConfig): --+++++-# super().__init__() --+++++-# self.num_experts = config.num_experts --+++++-# self.top_k = config.num_experts_per_tok --+++++-# self.norm_topk_prob = config.norm_topk_prob --+++++- --+++++-# # 门控网络 --+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++-# # 专家列表 --+++++-# self.experts = nn.ModuleList( --+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++-# ) --+++++-# # 共享专家 --+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_decode( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# """ --+++++-# 【解码路径】针对 sequence_length=1 的极致优化。 --+++++-# 使用向量化操作处理一个批次 (batch) 的单 token 输入。 --+++++-# """ --+++++-# batch_size, hidden_dim = hidden_states.shape --+++++- --+++++-# expert_outputs_list = [ --+++++-# ops.cat([ --+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++-# ], dim=0) --+++++-# for i in range(batch_size) --+++++-# ] --+++++- --+++++-# # --- 错误修复:将 axis=0 修改为 dim=0 --- --+++++-# # shape: (batch_size, top_k, hidden_dim) --+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++- --+++++-# # 使用批量矩阵乘法 (bmm) 高效完成加权求和 --+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++- --+++++-# return moe_output.squeeze(1) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_prefill( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# """ --+++++-# 【预填充路径】针对 sequence_length > 1 的优化。 --+++++-# 按专家对 Token 进行分组,并进行批处理。 --+++++-# """ --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens = hidden_states.shape[0] --+++++-# flat_selected_experts = selected_experts.flatten() --+++++- --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++- --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++- --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++- --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++-# selected_token_indices = token_indices[mask] --+++++-# selected_routing_weights = routing_weights.flatten()[mask] --+++++- --+++++-# current_states = hidden_states[selected_token_indices] --+++++- --+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++- --+++++-# moe_output = moe_output.index_add( --+++++-# dim=0, --+++++-# index=selected_token_indices, --+++++-# source=expert_output.to(hidden_states.dtype) --+++++-# ) --+++++-# return moe_output --+++++- --+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-# """ --+++++-# 顶层 forward 方法,作为智能分发器。 --+++++-# """ --+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- --+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-# router_logits = self.gate(hidden_states_reshaped) --+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- --+++++-# if self.norm_topk_prob: --+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- --+++++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++-# moe_output = None --+++++-# # 在推理时,根据序列长度选择最优路径 --+++++-# if not self.training: --+++++-# if sequence_length == 1: --+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++++-# else: --+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++++-# else: --+++++-# # 可以在此实现训练逻辑,如果暂时不需要,直接报错是安全的 --+++++-# raise NotImplementedError("Training path is not implemented.") --+++++- --+++++-# shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++-# shared_expert_gate_output = self.shared_expert_gate(hidden_states_reshaped) --+++++-# shared_expert_weights = F.sigmoid(shared_expert_gate_output) --+++++- --+++++-# final_hidden_states = moe_output + shared_expert_output * shared_expert_weights --+++++- --+++++-# final_hidden_states = final_hidden_states.view(batch_size, sequence_length, hidden_dim) --+++++- --+++++-# return final_hidden_states, router_logits --+++++- --+++++- --+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++-# """ --+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++-# 该版本修复了原始优化版本中因共享专家处理路径不统一而导致的计算结果不匹配问题。 --+++++-# """ --+++++-# def __init__(self, config: Qwen2MoeConfig): --+++++-# super().__init__() --+++++-# self.num_experts = config.num_experts --+++++-# self.top_k = config.num_experts_per_tok --+++++-# self.norm_topk_prob = config.norm_topk_prob --+++++- --+++++-# # 门控网络 --+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++-# # 专家列表 --+++++-# self.experts = nn.ModuleList( --+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++-# ) --+++++-# # 共享专家 --+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_decode( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# batch_size, _ = hidden_states.shape --+++++-# expert_outputs_list = [ --+++++-# ops.cat([ --+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++-# ], dim=0) --+++++-# for i in range(batch_size) --+++++-# ] --+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++-# moe_output = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++-# return moe_output.squeeze(1) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_prefill( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens = hidden_states.shape[0] --+++++-# flat_selected_experts = selected_experts.flatten() --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++- --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++-# selected_token_indices = token_indices[mask] --+++++-# selected_routing_weights = routing_weights.flatten()[mask] --+++++-# current_states = hidden_states[selected_token_indices] --+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++-# moe_output = moe_output.index_add( --+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++++-# ) --+++++-# return moe_output --+++++- --+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-# """ --+++++-# 顶层 forward 方法,作为智能分发器。 --+++++-# 【修正版】确保共享专家的计算逻辑在所有路径中保持一致。 --+++++-# """ --+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- --+++++-# # 1. 门控计算 (通用逻辑) --+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-# router_logits = self.gate(hidden_states_reshaped) --+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- --+++++-# if self.norm_topk_prob: --+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- --+++++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++-# # 2. 智能分发到最优 MoE 路径 --+++++-# moe_output = None --+++++-# if not self.training: --+++++-# if sequence_length == 1: --+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights) --+++++-# else: --+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights) --+++++-# else: --+++++-# raise NotImplementedError("Training path is not implemented.") --+++++- --+++++-# # 3. 【关键修正】统一在这里处理共享专家,确保逻辑一致 --+++++-# # 共享专家和它的门控网络,都作用于 reshape 后的张量 --+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++- --+++++-# # 4. 合并 MoE 输出和共享专家输出 --+++++-# # 两个张量的 shape 都是 [num_tokens, hidden_dim],直接相加 --+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++- --+++++-# # 5. 恢复原始形状并返回 --+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++- --+++++-# return final_hidden_states, router_logits --+++++- --+++++-# prefill fastest --+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++-# """ --+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++-# 【最终修正版】:统一了 decode 和 prefill 路径的核心计算内核 (index_add), --+++++-# 以确保在所有情况下计算结果 100% 匹配,同时保留了路径分发带来的性能优势。 --+++++-# """ --+++++-# def __init__(self, config: Qwen2MoeConfig): --+++++-# super().__init__() --+++++-# self.num_experts = config.num_experts --+++++-# self.top_k = config.num_experts_per_tok --+++++-# self.norm_topk_prob = config.norm_topk_prob --+++++- --+++++-# # 门控网络 --+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++-# # 专家列表 --+++++-# self.experts = nn.ModuleList( --+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++-# ) --+++++-# # 共享专家 --+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_dispatch( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# """ --+++++-# 【统一计算内核】:无论是 decode 还是 prefill,都使用与原始代码完全一致的 `index_add` 逻辑。 --+++++-# 这保证了浮点数运算的顺序和方式完全相同,从而确保结果的一致性。 --+++++-# """ --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens, _ = hidden_states.shape --+++++- --+++++-# # 将专家索引和权重展平,这对于 prefill 和 decode 都是通用的 --+++++-# flat_selected_experts = selected_experts.flatten() --+++++-# flat_routing_weights = routing_weights.flatten() --+++++- --+++++-# # 创建 token_idx 用于将计算结果映射回正确的 token 位置 --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++- --+++++-# # 找到所有被激活的专家(对于 decode 来说,这步开销极小) --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++- --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++- --+++++-# # 找到所有分配给该专家的 token --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++- --+++++-# # 使用 mask 选取对应的 token 和权重 --+++++-# current_token_indices = token_indices[mask] --+++++-# current_routing_weights = flat_routing_weights[mask] --+++++-# current_hidden_states = hidden_states[current_token_indices] --+++++- --+++++-# # 对这些 token 进行批处理 --+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++++- --+++++-# # 使用 index_add 将结果精确地加回到对应位置 --+++++-# moe_output = moe_output.index_add( --+++++-# dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype) --+++++-# ) --+++++-# return moe_output --+++++- --+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-# """ --+++++-# 顶层 forward 方法,作为智能分发器。 --+++++-# """ --+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- --+++++-# # 1. 门控计算 --+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-# router_logits = self.gate(hidden_states_reshaped) --+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- --+++++-# if self.norm_topk_prob: --+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- --+++++-# routing_weights = routing_weights.to(hidden_states.dtype) --+++++- --+++++-# # 2. 调用统一的 MoE 计算内核 --+++++-# # 我们不再需要区分 decode 和 prefill,因为这个函数对两者都高效且正确 --+++++-# moe_output = self._moe_infer_dispatch(hidden_states_reshaped, selected_experts, routing_weights) --+++++- --+++++-# # 3. 统一处理共享专家 --+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++- --+++++-# # 4. 合并输出 --+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++- --+++++-# # 5. 恢复原始形状并返回 --+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++- --+++++-# return final_hidden_states, router_logits --+++++- --+++++- --+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++-# """ --+++++-# 一个混合专家模块 (MoE block),其结构模仿了 DeepseekMoE 的高效推理封装。 --+++++-# 【最终高性能与高精度版】: --+++++-# 1. 解码路径使用 bmm 算子以达到最大推理速度。 --+++++-# 2. 在 bmm 计算前,强制将输入提升到 float32 进行高精度累加,以消除 --+++++-# 因并行计算顺序差异导致的浮点数误差,确保结果与串行逻辑一致。 --+++++-# 3. 这样实现了速度和准确性的两全其美。 --+++++-# """ --+++++-# def __init__(self, config: Qwen2MoeConfig): --+++++-# super().__init__() --+++++-# self.num_experts = config.num_experts --+++++-# self.top_k = config.num_experts_per_tok --+++++-# self.norm_topk_prob = config.norm_topk_prob --+++++- --+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++-# self.experts = nn.ModuleList( --+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++-# ) --+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_decode( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# """ --+++++-# 【解码路径】极致优化版:bmm + 高精度累加。 --+++++-# """ --+++++-# original_dtype = hidden_states.dtype --+++++-# batch_size, _ = hidden_states.shape --+++++- --+++++-# expert_outputs_list = [ --+++++-# ops.cat([ --+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++-# ], dim=0) --+++++-# for i in range(batch_size) --+++++-# ] --+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++- --+++++-# # 在 float32 下执行 bmm,得到高精度结果 --+++++-# moe_output_fp32 = ops.bmm(routing_weights.unsqueeze(1), expert_outputs_stacked) --+++++- --+++++-# # 将高精度结果转换回原始数据类型 --+++++-# moe_output = moe_output_fp32.squeeze(1).to(original_dtype) --+++++- --+++++-# return moe_output --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_prefill( --+++++-# self, --+++++-# hidden_states: mindspore.Tensor, --+++++-# selected_experts: mindspore.Tensor, --+++++-# routing_weights: mindspore.Tensor --+++++-# ) -> mindspore.Tensor: --+++++-# """ --+++++-# 【预填充路径】与原始实现一致,结果精确。 --+++++-# """ --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens, _ = hidden_states.shape --+++++-# flat_selected_experts = selected_experts.flatten() --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++- --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++-# selected_token_indices = token_indices[mask] --+++++-# selected_routing_weights = routing_weights.flatten()[mask] --+++++-# current_states = hidden_states[selected_token_indices] --+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++-# moe_output = moe_output.index_add( --+++++-# dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype) --+++++-# ) --+++++-# return moe_output --+++++- --+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++- --+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-# router_logits = self.gate(hidden_states_reshaped) --+++++-# routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-# routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++- --+++++-# if self.norm_topk_prob: --+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- --+++++-# # 注意:这里我们保留 routing_weights 为 float32,因为它在 decode 路径中需要高精度 --+++++-# # 如果模型主体是 float16,后续再转换 --+++++- --+++++-# moe_output = None --+++++-# if not self.training: --+++++-# # 传递给 decode 的 routing_weights 是 fp32,而 hidden_states 是原始类型 --+++++-# # _moe_infer_decode 内部会处理好类型转换 --+++++-# temp_routing_weights = routing_weights.to(hidden_states.dtype) --+++++-# if sequence_length == 1: --+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++++-# else: --+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, temp_routing_weights) --+++++-# else: --+++++-# raise NotImplementedError("Training path is not implemented.") --+++++- --+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++- --+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++- --+++++-# return final_hidden_states, router_logits --+++++- --+++++- --+++++-# class Qwen2MoeSparseMoeBlock(nn.Module): --+++++-# """ --+++++-# 【融合版】一个混合专家模块,内置两种推理策略, --+++++-# 由外部全局变量 `Long_Prompt` 控制: --+++++- --+++++-# - if Long_Prompt is True: 【精度优先模式】 --+++++-# 采用统一的 index_add 内核,保证在任何情况下结果都 100% 匹配。 --+++++-# 适用于处理长序列,避免误差累积。 --+++++- --+++++-# - if Long_Prompt is False: 【速度优先模式】 --+++++-# 智能分发到 prefill(index_add) 和 decode(bmm+fp32) 路径, --+++++-# 在解码阶段获得极致速度,同时保证结果高度准确。 --+++++-# """ --+++++-# def __init__(self, config: Qwen2MoeConfig): --+++++-# super().__init__() --+++++-# self.num_experts = config.num_experts --+++++-# self.top_k = config.num_experts_per_tok --+++++-# self.norm_topk_prob = config.norm_topk_prob --+++++- --+++++-# self.gate = nn.Linear(config.hidden_size, config.num_experts, bias=False) --+++++-# self.experts = nn.ModuleList( --+++++-# [Qwen2MoeMLP(config, intermediate_size=config.moe_intermediate_size) for _ in range(self.num_experts)] --+++++-# ) --+++++-# self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++-# self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-# # --- 速度优先模式的辅助函数 --- --+++++-# @no_grad() --+++++-# def _moe_infer_decode(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++-# original_dtype = hidden_states.dtype --+++++-# batch_size, _ = hidden_states.shape --+++++-# expert_outputs_list = [ --+++++-# ops.cat([ --+++++-# self.experts[expert_idx.item()](hidden_states[i:i+1]) for expert_idx in selected_experts[i] --+++++-# ], dim=0) --+++++-# for i in range(batch_size) --+++++-# ] --+++++-# expert_outputs_stacked = ops.stack(expert_outputs_list, dim=0) --+++++-# weights_fp32 = routing_weights.to(mindspore.float32) --+++++-# outputs_fp32 = expert_outputs_stacked.to(mindspore.float32) --+++++-# moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++++-# return moe_output_fp32.squeeze(1).to(original_dtype) --+++++- --+++++-# @no_grad() --+++++-# def _moe_infer_prefill(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens, _ = hidden_states.shape --+++++-# flat_selected_experts = selected_experts.flatten() --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++-# selected_token_indices = token_indices[mask] --+++++-# selected_routing_weights = routing_weights.flatten()[mask] --+++++-# current_states = hidden_states[selected_token_indices] --+++++-# expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++-# moe_output = moe_output.index_add(dim=0, index=selected_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++-# return moe_output --+++++- --+++++-# # --- 精度优先模式的辅助函数 --- --+++++-# @no_grad() --+++++-# def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++-# moe_output = ops.zeros_like(hidden_states) --+++++-# num_tokens, _ = hidden_states.shape --+++++-# flat_selected_experts = selected_experts.flatten() --+++++-# flat_routing_weights = routing_weights.flatten() --+++++-# token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1).broadcast_to((-1, self.top_k)).flatten() --+++++-# active_experts = ops.unique(flat_selected_experts) --+++++-# for expert_idx_tensor in active_experts: --+++++-# expert_idx = expert_idx_tensor.item() --+++++-# expert_layer = self.experts[expert_idx] --+++++-# mask = (flat_selected_experts == expert_idx_tensor) --+++++-# current_token_indices = token_indices[mask] --+++++-# current_routing_weights = flat_routing_weights[mask] --+++++-# current_hidden_states = hidden_states[current_token_indices] --+++++-# expert_output = expert_layer(current_hidden_states) * current_routing_weights.unsqueeze(1) --+++++-# moe_output = moe_output.index_add(dim=0, index=current_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++-# return moe_output --+++++- --+++++-# def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-# # 声明我们将要使用一个在模块外部定义的全局变量 --+++++-# # 这是一个简单的实现方式,更复杂的工程中可能会使用配置对象传递 --+++++-# global Long_Prompt --+++++- --+++++-# # 1. 门控计算 (所有模式通用) --+++++-# batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++-# hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-# router_logits = self.gate(hidden_states_reshaped) --+++++-# routing_weights_fp32 = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-# routing_weights, selected_experts = ops.topk(routing_weights_fp32, self.top_k, dim=-1) --+++++-# if self.norm_topk_prob: --+++++-# routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++- --+++++-# moe_output = None --+++++-# if not self.training: --+++++-# # 根据 Long_Prompt 标志选择模式 --+++++-# if Long_Prompt: --+++++-# # --- 精度优先模式 --- --+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++-# moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++-# else: --+++++-# # --- 速度优先模式 --- --+++++-# routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++-# if sequence_length == 1: --+++++-# moe_output = self._moe_infer_decode(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++-# else: --+++++-# moe_output = self._moe_infer_prefill(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++-# else: --+++++-# raise NotImplementedError("Training path is not implemented.") --+++++- --+++++-# gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++-# F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) --+++++- --+++++-# final_hidden_states_reshaped = moe_output + gated_shared_expert_output --+++++-# final_hidden_states = final_hidden_states_reshaped.view(batch_size, sequence_length, hidden_dim) --+++++- --+++++-# return final_hidden_states, router_logits --+++++- --+++++ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ """ --+++++ 【最终融合版】一个混合专家模块,内置两种由外部全局变量 `Long_Prompt` --+++++@@ -1515,29 +1119,71 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ moe_output_fp32 = ops.bmm(weights_fp32.unsqueeze(1), outputs_fp32) --+++++ return moe_output_fp32.squeeze(1).to(original_dtype) --+++++ --++++++ # @no_grad() --++++++ # def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --++++++ # num_tokens, _ = hidden_states.shape --++++++ # flat_selected_experts = selected_experts.flatten() --++++++ # sorted_expert_indices = flat_selected_experts.argsort() --++++++ # tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --++++++ # original_token_indices = sorted_expert_indices // self.top_k --++++++ # moe_output = ops.zeros_like(hidden_states) --++++++ # current_token_offset = 0 --++++++ # for i in range(self.num_experts): --++++++ # expert_token_count = tokens_per_expert[i] - current_token_offset --++++++ # if expert_token_count == 0: --++++++ # continue --++++++ # end_offset = current_token_offset + expert_token_count --++++++ # expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --++++++ # expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --++++++ # expert_hidden_states = hidden_states[expert_original_token_indices] --++++++ # expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --++++++ # expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --++++++ # moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --++++++ # current_token_offset += expert_token_count --++++++ # return moe_output --++++++ --+++++ @no_grad() --+++++ def _moe_infer_prefill_fast_deepspeed_style(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++- num_tokens, _ = hidden_states.shape --+++++- flat_selected_experts = selected_experts.flatten() --+++++- sorted_expert_indices = flat_selected_experts.argsort() --+++++- tokens_per_expert = flat_selected_experts.bincount(minlength=self.num_experts).cumsum(0) --+++++- original_token_indices = sorted_expert_indices // self.top_k --++++++ """ --++++++ 优化版 MoE prefill (速度优先模式): --++++++ - 批量张量化处理同一个 expert 的所有 token --++++++ - 跳过无 token 的专家 --++++++ - 保持结果完全一致 --++++++ """ --+++++ moe_output = ops.zeros_like(hidden_states) --+++++- current_token_offset = 0 --+++++- for i in range(self.num_experts): --+++++- expert_token_count = tokens_per_expert[i] - current_token_offset --+++++- if expert_token_count == 0: --+++++- continue --+++++- end_offset = current_token_offset + expert_token_count --+++++- expert_original_token_indices = original_token_indices[current_token_offset:end_offset] --+++++- expert_sorted_indices = sorted_expert_indices[current_token_offset:end_offset] --+++++- expert_hidden_states = hidden_states[expert_original_token_indices] --+++++- expert_routing_weights = routing_weights.flatten()[expert_sorted_indices] --+++++- expert_output = self.experts[i](expert_hidden_states) * expert_routing_weights.unsqueeze(1) --+++++- moe_output = moe_output.index_add(dim=0, index=expert_original_token_indices, source=expert_output.to(hidden_states.dtype)) --+++++- current_token_offset += expert_token_count --++++++ --++++++ flat_selected_experts = selected_experts.flatten() --++++++ flat_routing_weights = routing_weights.flatten() --++++++ --++++++ idxs = flat_selected_experts.argsort() --++++++ sorted_expert_indices = flat_selected_experts[idxs] --++++++ sorted_token_indices = idxs // self.top_k --++++++ --++++++ tokens_per_expert = sorted_expert_indices.bincount(minlength=self.num_experts) --++++++ --++++++ active_experts = (tokens_per_expert > 0).nonzero(as_tuple=False).flatten() --++++++ --++++++ for expert_id in active_experts.tolist(): --++++++ start = int(tokens_per_expert[:expert_id].sum().item()) --++++++ end = start + int(tokens_per_expert[expert_id].item()) --++++++ --++++++ token_idx = sorted_token_indices[start:end] --++++++ expert_tokens = hidden_states[token_idx] --++++++ --++++++ expert_out = self.experts[expert_id](expert_tokens) --++++++ --++++++ scaled_out = expert_out * flat_routing_weights[idxs[start:end]].unsqueeze(1) --++++++ --++++++ moe_output = mindspore.mint.scatter_add( --++++++ moe_output, --++++++ 0, --++++++ token_idx.view(-1, 1).tile((1, hidden_states.shape[-1])), --++++++ scaled_out.to(hidden_states.dtype) --++++++ ) --++++++ --+++++ return moe_output --+++++ --++++++ --+++++ # --- 精度优先模式 (ACCURACY MODE) 的辅助函数 --- --+++++ @no_grad() --+++++ def _moe_infer_dispatch_accurate(self, hidden_states, selected_experts, routing_weights) -> mindspore.Tensor: --+++++@@ -1571,18 +1217,24 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++ --+++++ moe_output = None --+++++- if Long_Prompt: --+++++- # --- 精度优先模式 (ACCURACY MODE) --- --+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++- moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ # if Long_Prompt==0: --++++++ # # --- 精度优先模式 (ACCURACY MODE) --- --++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++ # moe_output = self._moe_infer_dispatch_accurate(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ # else: --++++++ # # --- 速度优先模式 (SPEED MODE) --- --++++++ # routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++ # if sequence_length == 1: --++++++ # moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ # else: --++++++ # moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ --++++++ routing_weights_casted = routing_weights.to(hidden_states.dtype) --++++++ if sequence_length == 1: --++++++ moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++ else: --+++++- # --- 速度优先模式 (SPEED MODE) --- --+++++- routing_weights_casted = routing_weights.to(hidden_states.dtype) --+++++- if sequence_length == 1: --+++++- moe_output = self._moe_infer_decode_fast(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++- else: --+++++- moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --+++++- --++++++ moe_output = self._moe_infer_prefill_fast_deepspeed_style(hidden_states_reshaped, selected_experts, routing_weights_casted) --++++++ --+++++ --+++++ # 3. 共享专家计算与合并 (所有模式通用) --+++++ gated_shared_expert_output = self.shared_expert(hidden_states_reshaped) * \ --+++++@@ -1593,15 +1245,16 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++ --+++++ return final_hidden_states, router_logits --+++++ --++++++ --+++++ class Qwen2MoeDecoderLayer(nn.Module): --+++++ def __init__(self, config: Qwen2MoeConfig, layer_idx: int): --+++++ super().__init__() --+++++ self.hidden_size = config.hidden_size --+++++ --+++++- # if Long_Prompt: --+++++- # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++- # else: --++++++ # if Long_Prompt == 2: --+++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --++++++ # else: --++++++ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ --+++++ self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++ --+++++@@ -1904,7 +1557,17 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++++ ) --+++++ --+++++ # In case the provided `attention` mask is 2D, we generate a causal mask here (4D). --+++++- causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++++ # causal_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++++ # attention_mask, --++++++ # sequence_length=sequence_length, --++++++ # target_length=target_length, --++++++ # dtype=dtype, --++++++ # min_dtype=min_dtype, --++++++ # cache_position=cache_position, --++++++ # batch_size=input_tensor.shape[0], --++++++ # ) --++++++ #@dwj --++++++ causal_mask = get_cached_causal_mask_with_cache_position( --+++++ attention_mask, --+++++ sequence_length=sequence_length, --+++++ target_length=target_length, --+++++@@ -2091,7 +1754,8 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ 重写 generate 方法,将其作为设置 MoE 策略的唯一入口。 --+++++ 这个方法是所有生成任务的“前门”,保证逻辑一定会被执行。 --+++++ """ --+++++- global Long_Prompt, PROMPT_LENGTH_THRESHOLD --++++++ global Long_Prompt, PROMPT_LENGTH_THRESHOLD,_causal_mask_cache --++++++ _causal_mask_cache.clear() --+++++ --+++++ input_ids = kwargs.get("input_ids") --+++++ if input_ids is None and args: --+++++@@ -2099,11 +1763,13 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ --+++++ if input_ids is not None: --+++++ prompt_length = input_ids.shape[1] --+++++- --+++++- if prompt_length > PROMPT_LENGTH_THRESHOLD: --+++++- Long_Prompt = True --++++++ if prompt_length > LONG_PROMPT_LENGTH_THRESHOLD: --++++++ Long_Prompt = 2 --++++++ elif prompt_length < SHORT_PROMPT_LENGTH_THRESHOLD: --++++++ Long_Prompt = 0 --+++++ else: --+++++- Long_Prompt = False --++++++ Long_Prompt = 1 --++++++ --+++++ --+++++ return super().generate(*args, **kwargs) --+++++ --+++++@@ -2154,7 +1820,18 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++ dtype = self.lm_head.weight.dtype --+++++ min_dtype = float(ops.finfo(dtype).min) --+++++ --+++++- attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++++ # attention_mask = _prepare_4d_causal_attention_mask_with_cache_position( --++++++ # attention_mask, --++++++ # sequence_length=sequence_length, --++++++ # target_length=past_key_values.get_max_length(), --++++++ # dtype=dtype, --++++++ # min_dtype=min_dtype, --++++++ # cache_position=cache_position, --++++++ # batch_size=batch_size, --++++++ # ) --++++++ --++++++ #@dwj --++++++ attention_mask = get_cached_causal_mask_with_cache_position( --+++++ attention_mask, --+++++ sequence_length=sequence_length, --+++++ target_length=past_key_values.get_max_length(), --+++++diff --git a/patches/0001-20251104commit.patch b/patches/0001-20251104commit.patch --+++++deleted file mode 100644 --+++++index 6dfb5b93..00000000 --+++++--- a/patches/0001-20251104commit.patch --++++++++ /dev/null --+++++@@ -1,1272 +0,0 @@ --+++++-From 1c7cdda5edcc67eb81880aeaa24f98aa46012c01 Mon Sep 17 00:00:00 2001 --+++++-From: Pinoeer-kingxi <13022943007@163.com> --+++++-Date: Tue, 4 Nov 2025 09:11:51 +0800 --+++++-Subject: [PATCH] 20251104commit --+++++- --+++++---- --+++++- mindnlp/transformers/cache_utils.py | 28 +- --+++++- .../models/deepseek/modeling_deepseek.py | 149 ++- --+++++- .../models/qwen2_moe/modeling_qwen2_moe.py | 886 ++++++++++++++++-- --+++++- 3 files changed, 976 insertions(+), 87 deletions(-) --+++++- --+++++-diff --git a/mindnlp/transformers/cache_utils.py b/mindnlp/transformers/cache_utils.py --+++++-index cadd2e04..02f8d4be 100644 --+++++---- a/mindnlp/transformers/cache_utils.py --+++++-+++ b/mindnlp/transformers/cache_utils.py --+++++-@@ -812,14 +812,26 @@ class StaticCache(Cache): --+++++- # # The operator 'aten::index_copy.out' is not currently implemented for the MPS device. --+++++- # k_out[:, :, cache_position] = key_states --+++++- # v_out[:, :, cache_position] = value_states --+++++-- if ON_ORANGE_PI: --+++++-- k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++-- v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++-- else: --+++++-- # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++-- k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++-- v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++-- --+++++-+ # if ON_ORANGE_PI: --+++++-+ # k_out = ops.inplace_index_add(k_out, 2, cache_position.int(), key_states) --+++++-+ # v_out = ops.inplace_index_add(v_out, 2, cache_position.int(), value_states) --+++++-+ # else: --+++++-+ # # use index_add for mindspore since tensor slice is too slow and no implementation of index_copy --+++++-+ # k_out = ops.index_add(k_out, 2, cache_position.int(), key_states) --+++++-+ # v_out = ops.index_add(v_out, 2, cache_position.int(), value_states) --+++++-+ # 确保 cache_position 是 1D tensor 并且类型正确 --+++++-+ # 根据官方文档: indices 必须是 1D tensor,且 indices.shape[0] == y.shape[axis] --+++++-+ if cache_position.ndim > 1: --+++++-+ cache_position = cache_position.flatten() --+++++-+ # 确保类型是 int32 或 int64(MindSpore 要求) --+++++-+ if cache_position.dtype not in (mindspore.int32, mindspore.int64): --+++++-+ cache_position = cache_position.int() --+++++-+ --+++++-+ # JIT 编译不支持 try-except,直接使用切片赋值(更简单且兼容 JIT) --+++++-+ # 切片赋值对于 StaticCache 是安全的,因为 cache_position 是预分配的索引 --+++++-+ k_out[:, :, cache_position] = key_states --+++++-+ v_out[:, :, cache_position] = value_states --+++++-+ --+++++- return k_out, v_out --+++++- --+++++- def get_seq_length(self, layer_idx: Optional[int] = 0) -> int: --+++++-diff --git a/mindnlp/transformers/models/deepseek/modeling_deepseek.py b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++-index c695b944..d8303e45 100644 --+++++---- a/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++-+++ b/mindnlp/transformers/models/deepseek/modeling_deepseek.py --+++++-@@ -210,8 +210,10 @@ class DeepseekDynamicNTKScalingRotaryEmbedding(DeepseekRotaryEmbedding): --+++++- # Copied from transformers.models.llama.modeling_llama.rotate_half --+++++- def rotate_half(x): --+++++- """Rotates half the hidden dims of the input.""" --+++++-- x1 = x[..., : x.shape[-1] // 2] --+++++-- x2 = x[..., x.shape[-1] // 2 :] --+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++-+ # x1 = x[..., : x.shape[-1] // 2] --+++++-+ # x2 = x[..., x.shape[-1] // 2 :] --+++++-+ x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++- return ops.cat((-x2, x1), dim=-1) --+++++- --+++++- --+++++-@@ -385,32 +387,42 @@ class DeepseekMoE(nn.Module): --+++++- if self.training: --+++++- raise NotImplementedError("Training is not supported yet.") --+++++- else: --+++++-- y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++-- if self.config.n_shared_experts is not None: --+++++-- y = y + self.shared_experts(identity) --+++++-- return y --+++++-+ # @lwx --+++++-+ if orig_shape[1] == 1: --+++++-+ y=self.moe_infer_decode(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)) --+++++-+ y=y.view(*orig_shape) --+++++-+ if self.config.n_shared_experts is not None: --+++++-+ y = y + self.shared_experts(identity) --+++++-+ return y --+++++-+ else: --+++++-+ y= self.moe_infer_prefill(hidden_states,flat_topk_idx,topk_weight.view(-1, 1)).view(*orig_shape) --+++++-+ if self.config.n_shared_experts is not None: --+++++-+ y = y + self.shared_experts(identity) --+++++-+ return y --+++++-+ # y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) --+++++-+ # if self.config.n_shared_experts is not None: --+++++-+ # y = y + self.shared_experts(identity) --+++++-+ # return y --+++++-+ --+++++-+ @no_grad() --+++++-+ def moe_infer_decode(self, x, flat_expert_indices, flat_expert_weights): --+++++-+ --+++++-+ expert_cache = ops.zeros_like(x) --+++++-+ for i in range(self.num_experts_per_tok): --+++++-+ expert_id = flat_expert_indices[i].item() --+++++-+ weight = flat_expert_weights[i].item() --+++++-+ expert = self.experts[expert_id] --+++++-+ expert_out = expert(x) --+++++-+ expert_cache += expert_out * weight --+++++-+ return expert_cache --+++++- --+++++- @no_grad() --+++++-- def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++-- # expert_cache = torch.zeros_like(x) --+++++-- # idxs = flat_expert_indices.argsort() --+++++-- # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++-- # token_idxs = idxs // self.num_experts_per_tok --+++++-- # for i, end_idx in enumerate(tokens_per_expert): --+++++-- # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++-- # if start_idx == end_idx: --+++++-- # continue --+++++-- # expert = self.experts[i] --+++++-- # exp_token_idx = token_idxs[start_idx:end_idx] --+++++-- # expert_tokens = x[exp_token_idx] --+++++-- # expert_out = expert(expert_tokens) --+++++-- # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++-- # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++-- # return expert_cache --+++++-+ def moe_infer_prefill(self, x, flat_expert_indices, flat_expert_weights): --+++++- expert_cache = ops.zeros_like(x) --+++++- idxs = flat_expert_indices.argsort() --+++++- tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++- token_idxs = idxs // self.num_experts_per_tok --+++++-+ --+++++- for i, end_idx in enumerate(tokens_per_expert): --+++++- start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++- if start_idx == end_idx: --+++++-@@ -421,7 +433,76 @@ class DeepseekMoE(nn.Module): --+++++- expert_out = expert(expert_tokens) --+++++- expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++- expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++-+ --+++++- return expert_cache --+++++-+ --+++++-+ # @no_grad() --+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++-+ # # expert_cache = torch.zeros_like(x) --+++++-+ # # idxs = flat_expert_indices.argsort() --+++++-+ # # tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) --+++++-+ # # token_idxs = idxs // self.num_experts_per_tok --+++++-+ # # for i, end_idx in enumerate(tokens_per_expert): --+++++-+ # # start_idx = 0 if i == 0 else tokens_per_expert[i - 1] --+++++-+ # # if start_idx == end_idx: --+++++-+ # # continue --+++++-+ # # expert = self.experts[i] --+++++-+ # # exp_token_idx = token_idxs[start_idx:end_idx] --+++++-+ # # expert_tokens = x[exp_token_idx] --+++++-+ # # expert_out = expert(expert_tokens) --+++++-+ # # expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++-+ # # expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce='sum') --+++++-+ # # return expert_cache --+++++-+ # expert_cache = ops.zeros_like(x) --+++++-+ # idxs = flat_expert_indices.argsort() --+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++-+ # token_idxs = idxs // self.num_experts_per_tok --+++++-+ --+++++-+ # for i, end_idx in enumerate(tokens_per_expert): --+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++-+ # if start_idx == end_idx: --+++++-+ # continue --+++++-+ # expert = self.experts[i] --+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++-+ # expert_tokens = x[exp_token_idx] --+++++-+ # expert_out = expert(expert_tokens) --+++++-+ # expert_out = expert_out.mul(flat_expert_weights[idxs[start_idx:end_idx]]) --+++++-+ # expert_cache = mindspore.mint.scatter_add(expert_cache, 0, exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), expert_out) --+++++-+ --+++++-+ # return expert_cache --+++++-+ # @no_grad() --+++++-+ # def moe_infer(self, x, flat_expert_indices, flat_expert_weights): --+++++-+ # expert_cache = ops.zeros_like(x) --+++++-+ --+++++-+ # # 排序保证顺序一致 --+++++-+ # idxs = flat_expert_indices.argsort() --+++++-+ # tokens_per_expert = flat_expert_indices.bincount().cumsum(0) --+++++-+ # token_idxs = idxs // self.num_experts_per_tok --+++++-+ --+++++-+ # # 找出有 token 的专家 --+++++-+ # active_experts = (tokens_per_expert > ops.cat((ops.zeros(1, dtype=tokens_per_expert.dtype), tokens_per_expert[:-1]))).nonzero().squeeze(-1) --+++++-+ --+++++-+ # for i in active_experts.tolist(): --+++++-+ # start_idx = 0 if i == 0 else tokens_per_expert[i-1] --+++++-+ # end_idx = tokens_per_expert[i] --+++++-+ # if start_idx == end_idx: # 没有 token --+++++-+ # continue --+++++-+ --+++++-+ # exp_token_idx = token_idxs[start_idx:end_idx] --+++++-+ # expert_tokens = x[exp_token_idx] --+++++-+ # expert_out = self.experts[i](expert_tokens) --+++++-+ # expert_out = expert_out * flat_expert_weights[idxs[start_idx:end_idx]] --+++++-+ --+++++-+ # expert_cache = mindspore.mint.scatter_add( --+++++-+ # expert_cache, --+++++-+ # 0, --+++++-+ # exp_token_idx.view(-1, 1).tile((1, x.shape[-1])), --+++++-+ # expert_out --+++++-+ # ) --+++++-+ --+++++-+ # return expert_cache --+++++-+ --+++++-+ --+++++- --+++++- # class AddAuxiliaryLoss(mindnlp.core.autograd.Function): --+++++- # """ --+++++-@@ -1103,6 +1184,26 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++- --+++++- # Initialize weights and apply final processing --+++++- self.post_init() --+++++-+ self.warm_up = False --+++++-+ --+++++-+ def warmup_moe_model_deep(self): --+++++-+ print("[Warmup] DeepSeek-MoE 模型预热开始...") --+++++-+ test_texts = [ --+++++-+ "warmup short", --+++++-+ "This is a medium length warmup sentence for MoE experts. middle middle middle", --+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover different attention paths. very very long, very very long, very very long" --+++++-+ ] --+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++-+ if tokenizer is None: --+++++-+ from mindnlp.transformers import AutoTokenizer --+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++-+ self._warmup_tokenizer = tokenizer --+++++-+ --+++++-+ for text in test_texts: --+++++-+ inputs = tokenizer(text, return_tensors="ms") --+++++-+ with mindspore._no_grad(): --+++++-+ _ = self(**inputs, use_cache=False) --+++++-+ print("[Warmup] DeepSeek-MoE 模型预热完成。") --+++++- --+++++- def get_input_embeddings(self): --+++++- return self.model.embed_tokens --+++++-@@ -1161,6 +1262,10 @@ class DeepseekForCausalLM(DeepseekPreTrainedModel): --+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++- ```""" --+++++-+ if not self.warm_up: --+++++-+ self.warm_up = True --+++++-+ self.warmup_moe_model_deep() --+++++-+ --+++++- output_attentions = ( --+++++- output_attentions --+++++- if output_attentions is not None --+++++-diff --git a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++-index 3cbf820e..d4c6b651 100644 --+++++---- a/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++-+++ b/mindnlp/transformers/models/qwen2_moe/modeling_qwen2_moe.py --+++++-@@ -18,7 +18,6 @@ --+++++- # See the License for the specific language governing permissions and --+++++- # limitations under the License. --+++++- """MindSpore Qwen2MoE model.""" --+++++-- --+++++- import math --+++++- from typing import List, Optional, Tuple, Union --+++++- --+++++-@@ -36,6 +35,7 @@ from ...modeling_outputs import ( --+++++- TokenClassifierOutput, --+++++- ) --+++++- from ...modeling_utils import PreTrainedModel --+++++-+from ...generation import GenerationMixin --+++++- from ....utils import logging --+++++- from .configuration_qwen2_moe import Qwen2MoeConfig --+++++- --+++++-@@ -182,6 +182,11 @@ class Qwen2MoeRMSNorm(nn.Module): --+++++- self.variance_epsilon = eps --+++++- --+++++- def forward(self, hidden_states): --+++++-+ # @dwj --+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++-+ # @lwx --+++++-+ # if not self.training : --+++++-+ # return F.rms_norm(hidden_states, self.weight, self.variance_epsilon) --+++++- input_dtype = hidden_states.dtype --+++++- hidden_states = hidden_states.to(mindspore.float32) --+++++- variance = ops.mean(hidden_states.pow(2), -1, keepdim=True) --+++++-@@ -234,6 +239,8 @@ def rotate_half(x): --+++++- """Rotates half the hidden dims of the input.""" --+++++- x1 = x[..., : x.shape[-1] // 2] --+++++- x2 = x[..., x.shape[-1] // 2 :] --+++++-+ # @lwx_note: 这里使用 ops.split 代替 x[..., : x.shape[-1] // 2] 和 x[..., x.shape[-1] // 2 :] --+++++-+ # x1,x2 = ops.split( x, x.shape[-1] // 2, dim=-1 ) --+++++- return ops.cat((-x2, x1), dim=-1) --+++++- --+++++- --+++++-@@ -273,15 +280,28 @@ class Qwen2MoeMLP(nn.Module): --+++++- self.config = config --+++++- self.hidden_size = config.hidden_size --+++++- self.intermediate_size = intermediate_size --+++++-+ --+++++- self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++- self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False) --+++++- self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False) --+++++- self.act_fn = ACT2FN[config.hidden_act] --+++++- --+++++- def forward(self, x): --+++++-- return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++-- --+++++- --+++++-+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)) --+++++-+ # @lwx --+++++-+ # gate_up_output = self.gate_up_proj(x) --+++++-+ # swiglu_output = mindspore.ops.swiglu(gate_up_output) --+++++-+ # return self.down_proj(swiglu_output) --+++++-+ --+++++-+ # def forward(self, x): --+++++-+ # gate_proj_out = self.gate_proj(x) --+++++-+ # up_proj_out = self.up_proj(x) --+++++-+ # # 拼接,形状变 (batch, seq_len, intermediate_size * 2) --+++++-+ # # gate_up_out = mindspore.ops.cat([gate_proj_out.astype(x.dtype), up_proj_out.astype(x.dtype)],-1) --+++++-+ # swiglu_out = mindspore.ops.silu(gate_proj_out) * up_proj_out --+++++-+ # return self.down_proj(swiglu_out) --+++++-+ --+++++- # Copied from transformers.models.llama.modeling_llama.repeat_kv --+++++- def repeat_kv(hidden_states: mindspore.Tensor, n_rep: int) -> mindspore.Tensor: --+++++- """ --+++++-@@ -349,6 +369,9 @@ class Qwen2MoeAttention(nn.Module): --+++++- use_cache: bool = False, --+++++- cache_position: Optional[mindspore.Tensor] = None, --+++++- ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++-+ --+++++-+ --+++++-+ --+++++- bsz, q_len, _ = hidden_states.shape --+++++- --+++++- query_states = self.q_proj(hidden_states) --+++++-@@ -367,28 +390,28 @@ class Qwen2MoeAttention(nn.Module): --+++++- "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++- "with a layer index." --+++++- ) --+++++-- kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++-+ if isinstance(past_key_value, StaticCache): --+++++-+ kv_seq_len = key_states.shape[-2] --+++++-+ else: --+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++- cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++- query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++- --+++++- if past_key_value is not None: --+++++- cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models --+++++- key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs) --+++++-+ --+++++-+ if isinstance(past_key_value, StaticCache): --+++++-+ kv_seq_len = key_states.shape[-2] --+++++- --+++++- # repeat k/v heads if n_kv_heads < n_heads --+++++- key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++- value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++-- --+++++-+ --+++++- attn_weights = ops.matmul(query_states, ops.transpose(key_states, 2, 3)) / math.sqrt(self.head_dim) --+++++- --+++++-- if attn_weights.shape != (bsz, self.num_heads, q_len, kv_seq_len): --+++++-- raise ValueError( --+++++-- f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is" --+++++-- f" {attn_weights.shape}" --+++++-- ) --+++++-- --+++++-- if attention_mask is not None: # no matter the length, we just slice it --+++++-- causal_mask = attention_mask[:, :, :, : key_states.shape[-2]] --+++++-+ if attention_mask is not None: --+++++-+ causal_mask = attention_mask[:, :, :, :key_states.shape[-2]] --+++++- attn_weights = attn_weights + causal_mask --+++++- --+++++- # upcast attention to fp32 --+++++-@@ -406,15 +429,374 @@ class Qwen2MoeAttention(nn.Module): --+++++- attn_output = attn_output.reshape(bsz, q_len, self.hidden_size) --+++++- --+++++- attn_output = self.o_proj(attn_output) --+++++-- --+++++-+ # @lwx --+++++-+ --+++++-+ # max_seq_len = self.max_position_embeddings # 2048 --+++++-+ --+++++-+ # if attention_mask is not None: --+++++-+ # # attention_mask: [B, 1, Sq, Sk] --+++++-+ # mask_2d = attention_mask[0, 0] # -> [Sq, Sk] 单个样本的二维mask --+++++-+ --+++++-+ # # pad 到 [max_seq_len, max_seq_len] --+++++-+ # padded_mask = ops.ones((max_seq_len, max_seq_len), dtype=mask_2d.dtype) != 0 --+++++-+ # padded_mask[:mask_2d.shape[0], :mask_2d.shape[1]] = (mask_2d != 0) --+++++-+ # global_attention_mask = padded_mask --+++++-+ # else: --+++++-+ # global_attention_mask = None --+++++-+ --+++++-+ --+++++-+ # sparse_mode=3 --+++++-+ # attn_output = mindspore.ops.flash_attention_score( --+++++-+ # query=query_states, --+++++-+ # key=key_states, --+++++-+ # value=value_states, --+++++-+ # real_shift=None, --+++++-+ # padding_mask=None, --+++++-+ --+++++-+ # head_num=self.num_heads, --+++++-+ # attn_mask=global_attention_mask, --+++++-+ # keep_prob=1.0 - self.attention_dropout, --+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++-+ # input_layout="BNSD", --+++++-+ # pre_tokens=2147483647, --+++++-+ # next_tokens=2147483647, --+++++-+ # inner_precise=0, --+++++-+ # drop_mask=None, --+++++-+ # prefix=None, --+++++-+ # actual_seq_qlen=None, --+++++-+ # actual_seq_kvlen=None, --+++++-+ # sparse_mode=sparse_mode, --+++++-+ # ) --+++++- if not output_attentions: --+++++- attn_weights = None --+++++- --+++++- return attn_output, attn_weights, past_key_value --+++++- --+++++- --+++++-+class Qwen2MoeFlashAttention(nn.Module): --+++++-+ """ --+++++-+ Qwen2MoeAttention的优化版本,直接调用底层的 mindspore.ops.flash_attention_score 算子。 --+++++-+ 这个实现为昇腾硬件(如 Atlas A2)进行了深度优化。 --+++++-+ --+++++-+ 关键改动: --+++++-+ 1. 移除了手动的 `repeat_kv` 调用。`flash_attention_score` 内部原生支持GQA (Grouped-Query Attention), --+++++-+ 直接传入原始的 key 和 value 张量效率更高。 --+++++-+ 2. 增加了将标准浮点型 attention_mask 转换为 `flash_attention_score` 所需的布尔型掩码的逻辑。 --+++++-+ 3. 严格遵循 `flash_attention_score` 的参数要求,如 `input_layout="BNSD"`。 --+++++-+ """ --+++++-+ def __init__(self, config: Qwen2MoeConfig, layer_idx: Optional[int] = None): --+++++-+ super().__init__() --+++++-+ self.config = config --+++++-+ self.layer_idx = layer_idx --+++++-+ self.hidden_size = config.hidden_size --+++++-+ self.num_heads = config.num_attention_heads --+++++-+ self.head_dim = self.hidden_size // self.num_heads --+++++-+ self.num_key_value_heads = config.num_key_value_heads --+++++-+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads --+++++-+ self.max_position_embeddings = config.max_position_embeddings --+++++-+ self.rope_theta = config.rope_theta --+++++-+ self.attention_dropout = config.attention_dropout --+++++-+ --+++++-+ if (self.head_dim * self.num_heads) != self.hidden_size: --+++++-+ raise ValueError( --+++++-+ f"hidden_size ({self.hidden_size}) must be divisible by num_heads ({self.num_heads})" --+++++-+ ) --+++++-+ --+++++-+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True) --+++++-+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++-+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True) --+++++-+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False) --+++++-+ --+++++-+ self.rotary_emb = Qwen2MoeRotaryEmbedding( --+++++-+ self.head_dim, --+++++-+ max_position_embeddings=self.max_position_embeddings, --+++++-+ base=self.rope_theta, --+++++-+ ) --+++++-+ --+++++-+ def forward( --+++++-+ self, --+++++-+ hidden_states: mindspore.Tensor, --+++++-+ attention_mask: Optional[mindspore.Tensor] = None, --+++++-+ position_ids: Optional[mindspore.Tensor] = None, --+++++-+ past_key_value: Optional[Cache] = None, --+++++-+ output_attentions: bool = False, --+++++-+ use_cache: bool = False, --+++++-+ cache_position: Optional[mindspore.Tensor] = None, --+++++-+ ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++-+ --+++++-+ bsz, q_len, _ = hidden_states.shape --+++++-+ --+++++-+ # 1. 线性投射 Q, K, V --+++++-+ query_states = self.q_proj(hidden_states) --+++++-+ key_states = self.k_proj(hidden_states) --+++++-+ value_states = self.v_proj(hidden_states) --+++++-+ --+++++-+ # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++-+ # query: [B, S, H*D] -> [B, N1, S, D] --+++++-+ # key/val: [B, S, H2*D] -> [B, N2, S, D] --+++++-+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ --+++++-+ # 3. RoPE 旋转位置编码 --+++++-+ kv_seq_len = key_states.shape[-2] --+++++-+ if past_key_value is not None: --+++++-+ if self.layer_idx is None: --+++++-+ raise ValueError( --+++++-+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++-+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++-+ "with a layer index." --+++++-+ ) --+++++-+ # 对于 StaticCache,需要特殊处理 kv_seq_len --+++++-+ # 因为 StaticCache 的 key_states 形状是整个 cache 大小,但实际只使用 cache_position 指定的部分 --+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++-+ # 使用 cache_position 的长度来确定实际的 kv_seq_len --+++++-+ # 在 prefill 阶段:cache_position = [0, 1, 2, ..., n-1],kv_seq_len = n --+++++-+ # 在 decode 阶段:cache_position = [pos],kv_seq_len = pos + 1(但我们无法在 JIT 中获取 pos 值) --+++++-+ # 为了 JIT 兼容,我们使用 cache_position 的长度,但这只在 prefill 阶段正确 --+++++-+ # 对于 decode 阶段,我们需要在 Python 层预先计算并传递 --+++++-+ # 临时解决方案:使用 cache_position 的最大值(如果可能) --+++++-+ # 但由于 JIT 限制,我们使用一个近似值:cache_position.shape[0] + past_seen_tokens --+++++-+ past_seen_tokens = past_key_value.get_seq_length(self.layer_idx) if hasattr(past_key_value, 'get_seq_length') else 0 --+++++-+ if cache_position.shape[0] == 1: --+++++-+ # decode 阶段:cache_position 是单个值,我们需要该值 + 1 --+++++-+ # 但由于 JIT 限制,我们使用 past_seen_tokens + 1(近似) --+++++-+ kv_seq_len = past_seen_tokens + 1 --+++++-+ else: --+++++-+ # prefill 阶段:cache_position 是范围,使用其长度 --+++++-+ kv_seq_len = cache_position.shape[0] + past_seen_tokens --+++++-+ else: --+++++-+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++-+ --+++++-+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++-+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++-+ --+++++-+ # 4. KV 缓存更新 --+++++-+ if past_key_value is not None: --+++++-+ cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++-+ key_states, value_states = past_key_value.update( --+++++-+ key_states, value_states, self.layer_idx, cache_kwargs --+++++-+ ) --+++++-+ --+++++-+ # 对于 StaticCache 的 decode 阶段,update() 后 key_states.shape[-2] 就是实际长度 --+++++-+ # 我们需要更新 kv_seq_len(因为 key_states 形状是 max_cache_len,但实际只使用部分) --+++++-+ if isinstance(past_key_value, StaticCache) and cache_position is not None: --+++++-+ if cache_position.shape[0] == 1: --+++++-+ # decode 阶段:使用 key_states 的实际 shape(已包含之前的 cache + 当前 token) --+++++-+ kv_seq_len = key_states.shape[-2] --+++++-+ --+++++-+ # 5. [重要] 准备 Attention Mask --+++++-+ # flash_attention_score 需要一个布尔掩码,其中 True 表示需要被丢弃(mask掉) --+++++-+ # 而上游传入的 attention_mask 是浮点类型,0 表示保留,大负数表示丢弃 --+++++-+ fa_attention_mask = None --+++++-+ if attention_mask is not None: --+++++-+ # 截取与当前key长度匹配的部分 --+++++-+ # 原始 mask 形状: (B, 1, Sq, Sk_max), 我们需要 (B, N1, Sq, Sk_cur) --+++++-+ # FA算子会自动广播,所以我们只需要 (B, 1, Sq, Sk_cur) --+++++-+ mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++-+ # 转换为布尔类型: 大负数 -> True, 0 -> False --+++++-+ fa_attention_mask = (mask_slice != 0) --+++++-+ --+++++-+ # 确保输入数据类型为 float16 或 bfloat16,与算子要求一致 --+++++-+ input_dtype = query_states.dtype --+++++-+ if input_dtype not in (mindspore.float16, mindspore.bfloat16): --+++++-+ # 强制用 fp16, 减少 bf16 精度异常,并满足算子要求 --+++++-+ query_states = query_states.to(mindspore.float16) --+++++-+ key_states = key_states.to(mindspore.float16) --+++++-+ value_states = value_states.to(mindspore.float16) --+++++-+ --+++++-+ # 6. [核心] 调用 flash_attention_score 算子 --+++++-+ # - 无需手动 repeat_kv, 算子原生支持 GQA --+++++-+ # - input_layout='BNSD' 对应 [Batch, Num_heads, Seq_len, Head_dim] --+++++-+ attn_output = mindspore.ops.flash_attention_score( --+++++-+ query=query_states, --+++++-+ key=key_states, --+++++-+ value=value_states, --+++++-+ head_num=self.num_heads, # 传入Q的头数(N1) --+++++-+ attn_mask=fa_attention_mask, --+++++-+ keep_prob=1.0 - self.attention_dropout, --+++++-+ scalar_value=1.0 / math.sqrt(self.head_dim), --+++++-+ input_layout="BNSD", --+++++-+ sparse_mode=0 # 使用 defaultMask 模式 --+++++-+ ) --+++++-+ --+++++-+ # 恢复原始数据类型 --+++++-+ attn_output = attn_output.to(input_dtype) --+++++-+ --+++++-+ # 7. 调整输出形状 --+++++-+ # [B, N1, S, D] -> [B, S, N1, D] -> [B, S, H] --+++++-+ attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++-+ attn_output = self.o_proj(attn_output) --+++++-+ --+++++-+ # FlashAttention 算子不直接返回注意力权重矩阵 --+++++-+ attn_weights = None --+++++-+ if output_attentions: --+++++-+ logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++-+ --+++++-+ return attn_output, attn_weights, past_key_value --+++++-+ --+++++-+ # def forward( --+++++-+ # self, --+++++-+ # hidden_states: mindspore.Tensor, --+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, --+++++-+ # position_ids: Optional[mindspore.Tensor] = None, --+++++-+ # past_key_value: Optional[Cache] = None, --+++++-+ # output_attentions: bool = False, --+++++-+ # use_cache: bool = False, --+++++-+ # cache_position: Optional[mindspore.Tensor] = None, --+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++-+ --+++++-+ # bsz, q_len, _ = hidden_states.shape --+++++-+ --+++++-+ # # 1. 线性投射 Q, K, V --+++++-+ # query_states = self.q_proj(hidden_states) --+++++-+ # key_states = self.k_proj(hidden_states) --+++++-+ # value_states = self.v_proj(hidden_states) --+++++-+ --+++++-+ # # 2. 调整形状以匹配 Flash Attention 的 BNSD 布局 --+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ --+++++-+ # # 3. RoPE 旋转位置编码 --+++++-+ # kv_seq_len = key_states.shape[-2] --+++++-+ # if past_key_value is not None: --+++++-+ # if self.layer_idx is None: --+++++-+ # raise ValueError( --+++++-+ # f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} " --+++++-+ # "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class " --+++++-+ # "with a layer index." --+++++-+ # ) --+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++-+ --+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++-+ --+++++-+ # # 4. KV 缓存更新 --+++++-+ # if past_key_value is not None: --+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++-+ # key_states, value_states = past_key_value.update( --+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs --+++++-+ # ) --+++++-+ --+++++-+ # # 5. 准备 Attention Mask --+++++-+ # fa_attention_mask = None --+++++-+ # if attention_mask is not None: --+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++-+ # fa_attention_mask = (mask_slice != 0) --+++++-+ --+++++-+ # # <--- 修改点 1: 删除了不必要的强制类型转换 --- --+++++-+ # # 保留原始数据类型,例如 bfloat16,以避免精度损失。 --+++++-+ # input_dtype = query_states.dtype --+++++-+ --+++++-+ # # 6. [核心] 调用 flash_attention_score 算子 --+++++-+ # attn_output = mindspore.ops.flash_attention_score( --+++++-+ # query=query_states, --+++++-+ # key=key_states, --+++++-+ # value=value_states, --+++++-+ # head_num=self.num_heads, --+++++-+ # attn_mask=fa_attention_mask, --+++++-+ # keep_prob=1.0 - self.attention_dropout, --+++++-+ # scalar_value=1.0 / math.sqrt(self.head_dim), --+++++-+ # input_layout="BNSD", --+++++-+ # sparse_mode=0, --+++++-+ # # <--- 修改点 2: 启用内部高精度计算 --- --+++++-+ # # inner_precise=1 会让算子内部使用 float32 进行累加和 softmax 计算, --+++++-+ # # 这与 Eager 版本中的 .softmax(dtype=ms.float32) 行为对齐。 --+++++-+ # inner_precise=1 --+++++-+ # ) --+++++-+ --+++++-+ # # 恢复原始数据类型 --+++++-+ # attn_output = attn_output.to(input_dtype) --+++++-+ --+++++-+ # # 7. 调整输出形状 --+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++-+ # attn_output = self.o_proj(attn_output) --+++++-+ --+++++-+ # attn_weights = None --+++++-+ # if output_attentions: --+++++-+ # logger.warning_once("Qwen2MoeFlashAttention is used but `output_attentions=True`. FA does not return attentions.") --+++++-+ --+++++-+ # return attn_output, attn_weights, past_key_value --+++++-+ --+++++-+ # def forward( --+++++-+ # self, --+++++-+ # hidden_states: mindspore.Tensor, --+++++-+ # attention_mask: Optional[mindspore.Tensor] = None, --+++++-+ # position_ids: Optional[mindspore.Tensor] = None, --+++++-+ # past_key_value: Optional[Cache] = None, --+++++-+ # output_attentions: bool = False, --+++++-+ # use_cache: bool = False, --+++++-+ # cache_position: Optional[mindspore.Tensor] = None, --+++++-+ # ) -> Tuple[mindspore.Tensor, Optional[mindspore.Tensor], Optional[Tuple[mindspore.Tensor]]]: --+++++-+ --+++++-+ # bsz, q_len, _ = hidden_states.shape --+++++-+ --+++++-+ # query_states = self.q_proj(hidden_states) --+++++-+ # key_states = self.k_proj(hidden_states) --+++++-+ # value_states = self.v_proj(hidden_states) --+++++-+ --+++++-+ # query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ # key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ # value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(0, 2, 1, 3) --+++++-+ --+++++-+ # kv_seq_len = key_states.shape[-2] --+++++-+ # if past_key_value is not None: --+++++-+ # if self.layer_idx is None: --+++++-+ # raise ValueError("`layer_idx` must be specified for caching") --+++++-+ # kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx) --+++++-+ --+++++-+ # cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len) --+++++-+ # query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids) --+++++-+ --+++++-+ # if past_key_value is not None: --+++++-+ # cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} --+++++-+ # key_states, value_states = past_key_value.update( --+++++-+ # key_states, value_states, self.layer_idx, cache_kwargs --+++++-+ # ) --+++++-+ --+++++-+ # key_states = repeat_kv(key_states, self.num_key_value_groups) --+++++-+ # value_states = repeat_kv(value_states, self.num_key_value_groups) --+++++-+ --+++++-+ # # <--- 核心修改点: 手动进行高精度缩放 --- --+++++-+ # # 在调用算子前,手动将 query_states 除以缩放因子。 --+++++-+ # # 这样做可以确保缩放操作的精度与 Eager 版本的隐式高精度除法完全一致。 --+++++-+ # query_states = query_states / math.sqrt(self.head_dim) --+++++-+ # # <--- 修改结束 --- --+++++-+ --+++++-+ # fa_attention_mask = None --+++++-+ # if attention_mask is not None: --+++++-+ # mask_slice = attention_mask[:, :, :q_len, :key_states.shape[-2]] --+++++-+ # fa_attention_mask = (mask_slice != 0) --+++++-+ --+++++-+ # input_dtype = query_states.dtype --+++++-+ --+++++-+ # attn_output = mindspore.ops.flash_attention_score( --+++++-+ # query=query_states, # 传入已经预先缩放过的 query --+++++-+ # key=key_states, --+++++-+ # value=value_states, --+++++-+ # head_num=self.num_heads, --+++++-+ # attn_mask=fa_attention_mask, --+++++-+ # keep_prob=1.0 - self.attention_dropout, --+++++-+ # scalar_value=1.0, # 设置为 1.0,因为缩放已在外部完成 --+++++-+ # input_layout="BNSD", --+++++-+ # sparse_mode=0, --+++++-+ # inner_precise=1 # 仍然保持内部高精度计算 --+++++-+ # ) --+++++-+ --+++++-+ # attn_output = attn_output.to(input_dtype) --+++++-+ # attn_output = attn_output.transpose(0, 2, 1, 3).reshape(bsz, q_len, self.hidden_size) --+++++-+ # attn_output = self.o_proj(attn_output) --+++++-+ --+++++-+ # attn_weights = None --+++++-+ # if output_attentions: --+++++-+ # logger.warning_once("Qwen2MoeFlashAttention does not return attention weights.") --+++++-+ --+++++-+ # return attn_output, attn_weights, past_key_value --+++++-+ --+++++- QWEN2MOE_ATTENTION_CLASSES = { --+++++- "eager": Qwen2MoeAttention, --+++++-+ "flash-attention": Qwen2MoeFlashAttention, --+++++- } --+++++- --+++++- --+++++-@@ -434,50 +816,55 @@ class Qwen2MoeSparseMoeBlock(nn.Module): --+++++- self.shared_expert = Qwen2MoeMLP(config, intermediate_size=config.shared_expert_intermediate_size) --+++++- self.shared_expert_gate = nn.Linear(config.hidden_size, 1, bias=False) --+++++- --+++++-+ #@dwj --+++++-+ # 只遍历激活的专家,而非全部专家 --+++++- def forward(self, hidden_states: mindspore.Tensor) -> mindspore.Tensor: --+++++-- batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++-- hidden_states = hidden_states.view(-1, hidden_dim) --+++++-- # router_logits: (batch * sequence_length, n_experts) --+++++-- router_logits = self.gate(hidden_states) --+++++-- --+++++-- routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-- routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++-- if self.norm_topk_prob: --+++++-- routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++-- # we cast back to the input dtype --+++++-- routing_weights = routing_weights.to(hidden_states.dtype) --+++++-- --+++++-- final_hidden_states = ops.zeros( --+++++-- (batch_size * sequence_length, hidden_dim), dtype=hidden_states.dtype --+++++-- ) --+++++-- --+++++-- # One hot encode the selected experts to create an expert mask --+++++-- # this will be used to easily index which expert is going to be sollicitated --+++++-- expert_mask = nn.functional.one_hot(selected_experts, num_classes=self.num_experts).permute(2, 1, 0) --+++++-- --+++++-- # Loop over all available experts in the model and perform the computation on each expert --+++++-- for expert_idx in range(self.num_experts): --+++++-- expert_layer = self.experts[expert_idx] --+++++-- idx, top_x = ops.nonzero(expert_mask[expert_idx], as_tuple=True) --+++++-- --+++++-- # Index the correct hidden states and compute the expert hidden state for --+++++-- # the current expert. We need to make sure to multiply the output hidden --+++++-- # states by `routing_weights` on the corresponding tokens (top-1 and top-2) --+++++-- if 0 not in idx.shape: --+++++-- current_state = hidden_states[None, top_x].reshape(-1, hidden_dim) --+++++-- current_hidden_states = expert_layer(current_state) * routing_weights[top_x, idx, None] --+++++-- --+++++-- # However `index_add_` only support torch tensors for indexing so we'll use --+++++-- # the `top_x` tensor here. --+++++-- final_hidden_states = final_hidden_states.index_add(0, top_x.int(), current_hidden_states.to(hidden_states.dtype)) --+++++-- --+++++-- shared_expert_output = self.shared_expert(hidden_states) --+++++-- shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states)) * shared_expert_output --+++++-- --+++++-- final_hidden_states = final_hidden_states + shared_expert_output --+++++-+ batch_size, sequence_length, hidden_dim = hidden_states.shape --+++++-+ hidden_states_reshaped = hidden_states.view(-1, hidden_dim) --+++++-+ num_tokens = hidden_states_reshaped.shape[0] --+++++-+ --+++++-+ router_logits = self.gate(hidden_states_reshaped) --+++++-+ routing_weights = F.softmax(router_logits, dim=1, dtype=mindspore.float32) --+++++-+ routing_weights, selected_experts = ops.topk(routing_weights, self.top_k, dim=-1) --+++++-+ --+++++-+ if self.norm_topk_prob: --+++++-+ routing_weights /= ops.sum(routing_weights, dim=-1, keepdim=True) --+++++-+ routing_weights = routing_weights.to(hidden_states.dtype) --+++++-+ --+++++-+ final_hidden_states = ops.zeros_like(hidden_states_reshaped) --+++++-+ flat_selected_experts = selected_experts.flatten() --+++++-+ --+++++-+ unsqueezed_token_indices = ops.arange(num_tokens, dtype=mindspore.int32).unsqueeze(1) --+++++-+ broadcasted_token_indices = unsqueezed_token_indices.broadcast_to((-1, self.top_k)) --+++++-+ token_indices = broadcasted_token_indices.flatten() --+++++-+ --+++++-+ active_experts = ops.unique(flat_selected_experts) --+++++-+ --+++++-+ for expert_idx_tensor in active_experts: --+++++-+ expert_idx = expert_idx_tensor.item() --+++++-+ expert_layer = self.experts[expert_idx] --+++++-+ --+++++-+ mask = (flat_selected_experts == expert_idx_tensor) --+++++-+ selected_token_indices = token_indices[mask] --+++++-+ selected_routing_weights = routing_weights.flatten()[mask] --+++++-+ --+++++-+ current_states = hidden_states_reshaped[selected_token_indices] --+++++-+ --+++++-+ expert_output = expert_layer(current_states) * selected_routing_weights.unsqueeze(1) --+++++-+ --+++++-+ final_hidden_states = final_hidden_states.index_add( --+++++-+ dim=0, --+++++-+ index=selected_token_indices, --+++++-+ source=expert_output.to(hidden_states.dtype) --+++++-+ ) --+++++-+ --+++++-+ shared_expert_output = self.shared_expert(hidden_states_reshaped) --+++++-+ shared_expert_output = F.sigmoid(self.shared_expert_gate(hidden_states_reshaped)) * shared_expert_output --+++++- --+++++-- final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++-- return final_hidden_states, router_logits --+++++-+ final_hidden_states = final_hidden_states + shared_expert_output --+++++-+ final_hidden_states = final_hidden_states.reshape(batch_size, sequence_length, hidden_dim) --+++++-+ --+++++-+ return final_hidden_states, router_logits --+++++- --+++++- --+++++- class Qwen2MoeDecoderLayer(nn.Module): --+++++-@@ -487,6 +874,8 @@ class Qwen2MoeDecoderLayer(nn.Module): --+++++- --+++++- self.self_attn = QWEN2MOE_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx) --+++++- --+++++-+ # self.self_attn = QWEN2MOE_ATTENTION_CLASSES["flash-attention"](config, layer_idx) --+++++-+ --+++++- if (layer_idx not in config.mlp_only_layers) and ( --+++++- config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 --+++++- ): --+++++-@@ -580,6 +969,8 @@ class Qwen2MoePreTrainedModel(PreTrainedModel): --+++++- _no_split_modules = ["Qwen2MoeDecoderLayer"] --+++++- _skip_keys_device_placement = "past_key_values" --+++++- _supports_cache_class = True --+++++-+#lwx --+++++-+ # _supports_static_cache = True --+++++- --+++++- def _init_weights(self, module): --+++++- std = self.config.initializer_range --+++++-@@ -797,7 +1188,7 @@ class Qwen2MoeModel(Qwen2MoePreTrainedModel): --+++++- return causal_mask --+++++- --+++++- --+++++--class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++-+class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel, GenerationMixin): --+++++- _tied_weights_keys = ["lm_head.weight"] --+++++- --+++++- def __init__(self, config): --+++++-@@ -811,6 +1202,29 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++- self.num_experts_per_tok = config.num_experts_per_tok --+++++- # Initialize weights and apply final processing --+++++- self.post_init() --+++++-+ # @lwx --+++++-+ # if self.generation_config is not None and self.generation_config.cache_implementation is None: --+++++-+ # self.generation_config.cache_implementation = "static" --+++++-+ self._warmed_up = False --+++++-+ --+++++-+ def warmup_moe_model(self): --+++++-+ print("[Warmup] Qwen2-MoE 模型预热开始...") --+++++-+ test_texts = [ --+++++-+ "warmup short", --+++++-+ "This is a medium length warmup sentence for MoE experts.middle midlle midlle", --+++++-+ "This is a long warmup sentence designed to trigger as many experts as possible and include attention mask variations to cover FlashAttention or eager attention paths.very very long,very very long,very very long,very very long" --+++++-+ ] --+++++-+ tokenizer = getattr(self, "_warmup_tokenizer", None) --+++++-+ if tokenizer is None: --+++++-+ from mindnlp.transformers import AutoTokenizer --+++++-+ tokenizer = AutoTokenizer.from_pretrained(self.config.name_or_path) --+++++-+ self._warmup_tokenizer = tokenizer --+++++-+ --+++++-+ for text in test_texts: --+++++-+ inputs = tokenizer(text, return_tensors="ms") --+++++-+ with mindspore._no_grad(): --+++++-+ _ = self(**inputs, output_router_logits=True, use_cache=False) --+++++-+ print("[Warmup] Qwen2-MoE 模型预热完成。") --+++++- --+++++- def get_input_embeddings(self): --+++++- return self.model.embed_tokens --+++++-@@ -870,6 +1284,9 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++- >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] --+++++- "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you." --+++++- ```""" --+++++-+ if not self._warmed_up: --+++++-+ self._warmed_up = True --+++++-+ self.warmup_moe_model() --+++++- --+++++- output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions --+++++- output_router_logits = ( --+++++-@@ -1004,6 +1421,361 @@ class Qwen2MoeForCausalLM(Qwen2MoePreTrainedModel): --+++++- } --+++++- ) --+++++- return model_inputs --+++++-+# @lwx --+++++-+ # def _decode_one_tokens_logits( --+++++-+ # self, --+++++-+ # cur_token: mindspore.Tensor, --+++++-+ # input_pos: Optional[mindspore.Tensor], --+++++-+ # cache_position: mindspore.Tensor, --+++++-+ # past_key_values: StaticCache, --+++++-+ # ) -> mindspore.Tensor: --+++++-+ # """ --+++++-+ # 单个token的解码函数,返回Logits(内部实现,未被JIT编译) --+++++-+ --+++++-+ # Args: --+++++-+ # cur_token: 当前要处理的token,shape为(batch_size, 1) --+++++-+ # input_pos: 输入位置信息,可选 --+++++-+ # cache_position: 当前token在cache中的位置,shape为(1,) --+++++-+ # past_key_values: StaticCache对象,存储之前的key-value状态 --+++++-+ --+++++-+ # Returns: --+++++-+ # logits: 当前token的logits,shape为(batch_size, vocab_size) --+++++-+ # """ --+++++-+ # # 调用JIT编译的版本 --+++++-+ # return self.get_decode_one_tokens_logits( --+++++-+ # cur_token=cur_token, --+++++-+ # input_pos=input_pos, --+++++-+ # cache_position=cache_position, --+++++-+ # past_key_values=past_key_values, --+++++-+ # ) --+++++-+ --+++++-+ # @mindspore.jit(jit_level='O1') --+++++-+ # def get_decode_one_tokens_logits(self, cur_token, input_pos, cache_position, past_key_values): --+++++-+ # """ --+++++-+ # JIT编译的函数,用于高效的单token解码 --+++++-+ # 使用JIT编译优化以支持静态shape和高效执行 --+++++-+ --+++++-+ # 注意:直接调用forward方法,避免经过_call_impl中的try-except --+++++-+ # """ --+++++-+ # outputs = self.model.forward( --+++++-+ # input_ids=cur_token, --+++++-+ # position_ids=input_pos, --+++++-+ # cache_position=cache_position, --+++++-+ # past_key_values=past_key_values, --+++++-+ # use_cache=True, --+++++-+ # return_dict=False, --+++++-+ # ) --+++++-+ --+++++-+ # hidden_states = outputs[0] --+++++-+ # logits = self.lm_head.forward(hidden_states) --+++++-+ # logits = logits.float() --+++++-+ --+++++-+ # return logits[:, -1, :] --+++++-+ --+++++-+ # def _sample( --+++++-+ # self, --+++++-+ # input_ids: mindspore.Tensor, --+++++-+ # logits_processor, --+++++-+ # stopping_criteria, --+++++-+ # generation_config, --+++++-+ # synced_devices: bool, --+++++-+ # streamer=None, --+++++-+ # logits_warper=None, --+++++-+ # **model_kwargs, --+++++-+ # ): --+++++-+ # """ --+++++-+ # 重写 _sample 方法以在 StaticCache + 单 token 生成时使用 JIT 优化 --+++++-+ # 对于首次 prefill 阶段(cache_position 包含多个位置),使用标准路径 --+++++-+ # 对于自回归生成阶段(cache_position 长度为 1),使用 JIT 优化的路径 --+++++-+ # """ --+++++-+ # from ...generation.logits_process import LogitsProcessorList --+++++-+ # from ...generation.stopping_criteria import StoppingCriteriaList --+++++-+ # from ...generation.utils import GenerateDecoderOnlyOutput, GenerateEncoderDecoderOutput --+++++-+ # from mindnlp.core import nn, ops, no_grad --+++++-+ # import numpy as np --+++++-+ --+++++-+ # # 检查是否使用 StaticCache --+++++-+ # # 如果使用 StaticCache,我们进入自定义循环以在单 token 生成时使用 JIT 优化 --+++++-+ # # 否则,直接调用父类方法 --+++++-+ # past_key_values = model_kwargs.get("past_key_values") --+++++-+ # print(f"[DEBUG] _sample called, past_key_values type: {type(past_key_values).__name__}, is StaticCache: {isinstance(past_key_values, StaticCache)}") --+++++-+ --+++++-+ # if not isinstance(past_key_values, StaticCache): --+++++-+ # # 不使用 StaticCache,直接调用父类方法 --+++++-+ # print("[DEBUG] Using standard path (no StaticCache or not yet initialized)") --+++++-+ # return super()._sample( --+++++-+ # input_ids=input_ids, --+++++-+ # logits_processor=logits_processor, --+++++-+ # stopping_criteria=stopping_criteria, --+++++-+ # generation_config=generation_config, --+++++-+ # synced_devices=synced_devices, --+++++-+ # streamer=streamer, --+++++-+ # logits_warper=logits_warper, --+++++-+ # **model_kwargs, --+++++-+ # ) --+++++-+ --+++++-+ # # 使用 StaticCache,进入自定义循环 --+++++-+ # # 在循环内会根据 cache_position 的长度动态选择使用 JIT 优化(单 token)或标准路径(prefill) --+++++-+ # # 大部分逻辑与父类相同,但 forward 调用改为使用 JIT 优化方法 --+++++-+ # pad_token_id = generation_config._pad_token_tensor --+++++-+ # output_attentions = generation_config.output_attentions --+++++-+ # output_hidden_states = generation_config.output_hidden_states --+++++-+ # output_scores = generation_config.output_scores --+++++-+ # output_logits = generation_config.output_logits --+++++-+ # return_dict_in_generate = generation_config.return_dict_in_generate --+++++-+ # max_length = generation_config.max_length --+++++-+ # has_eos_stopping_criteria = any(hasattr(criteria, "eos_token_id") for criteria in stopping_criteria) --+++++-+ # do_sample = generation_config.do_sample --+++++-+ --+++++-+ # if do_sample is True and not isinstance(logits_warper, LogitsProcessorList): --+++++-+ # raise ValueError( --+++++-+ # "`do_sample` is set to `True`, `logits_warper` must be a `LogitsProcessorList` instance (it is " --+++++-+ # f"{logits_warper})." --+++++-+ # ) --+++++-+ --+++++-+ # # init attention / hidden states / scores tuples --+++++-+ # scores = () if (return_dict_in_generate and output_scores) else None --+++++-+ # raw_logits = () if (return_dict_in_generate and output_logits) else None --+++++-+ # decoder_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++-+ # cross_attentions = () if (return_dict_in_generate and output_attentions) else None --+++++-+ # decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None --+++++-+ --+++++-+ # # if model is an encoder-decoder, retrieve encoder attention weights and hidden states --+++++-+ # if return_dict_in_generate and self.config.is_encoder_decoder: --+++++-+ # encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None --+++++-+ # encoder_hidden_states = ( --+++++-+ # model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None --+++++-+ # ) --+++++-+ --+++++-+ # # keep track of which sequences are already finished --+++++-+ # batch_size, cur_len = input_ids.shape --+++++-+ # this_peer_finished = False --+++++-+ # unfinished_sequences = ops.ones(batch_size, dtype=mindspore.int64) --+++++-+ # model_kwargs = self._get_initial_cache_position(input_ids, model_kwargs) --+++++-+ --+++++-+ # time_record = [] --+++++-+ # from ....utils.testing_utils import parse_flag_from_env --+++++-+ # _record_time = parse_flag_from_env('INFERENCE_TIME_RECORD', False) --+++++-+ --+++++-+ # while self._has_unfinished_sequences( --+++++-+ # this_peer_finished, synced_devices, cur_len=cur_len, max_length=max_length --+++++-+ # ): --+++++-+ # if _record_time: --+++++-+ # import time as time_module --+++++-+ # infer_start = time_module.time() --+++++-+ --+++++-+ # # prepare model inputs --+++++-+ # model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs) --+++++-+ --+++++-+ # # prepare variable output controls --+++++-+ # model_inputs.update({"output_attentions": output_attentions} if output_attentions else {}) --+++++-+ # model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {}) --+++++-+ --+++++-+ # # 关键修改:检测到 StaticCache + 单 token 生成时,使用 JIT 优化的方法 --+++++-+ # cur_cache_position = model_inputs.get("cache_position") --+++++-+ # cur_past_key_values = model_inputs.get("past_key_values") --+++++-+ # cur_input_ids = model_inputs.get("input_ids") --+++++-+ --+++++-+ # if (isinstance(cur_past_key_values, StaticCache) and --+++++-+ # cur_cache_position is not None and --+++++-+ # len(cur_cache_position.shape) > 0 and --+++++-+ # cur_cache_position.shape[0] == 1 and --+++++-+ # cur_input_ids is not None and --+++++-+ # cur_input_ids.shape[1] == 1): --+++++-+ # # 使用 JIT 优化的单 token 解码 --+++++-+ # # 简单判断方法:首次调用时打印(JIT编译需要时间) --+++++-+ # if not hasattr(self, '_jit_used'): --+++++-+ # self._jit_used = False --+++++-+ # print("[JIT] ✓ JIT optimized path activated (first call will compile)") --+++++-+ --+++++-+ # next_token_logits = self.get_decode_one_tokens_logits( --+++++-+ # cur_token=cur_input_ids, --+++++-+ # input_pos=model_inputs.get("position_ids"), --+++++-+ # cache_position=cur_cache_position, --+++++-+ # past_key_values=cur_past_key_values, --+++++-+ # ) --+++++-+ --+++++-+ # # 标记已使用JIT(用于后续判断) --+++++-+ # if not self._jit_used: --+++++-+ # self._jit_used = True --+++++-+ --+++++-+ # # 构造兼容的输出对象 --+++++-+ # class JitOptimizedOutput: --+++++-+ # def __init__(self, logits, config): --+++++-+ # self.logits = logits.unsqueeze(1) if logits.ndim == 2 else logits --+++++-+ # self.config = config --+++++-+ # # 对于 JIT 优化路径,这些属性通常不需要 --+++++-+ # self.decoder_attentions = None if config.is_encoder_decoder else None --+++++-+ # self.attentions = None if not config.is_encoder_decoder else None --+++++-+ # self.cross_attentions = None --+++++-+ # self.decoder_hidden_states = None if config.is_encoder_decoder else None --+++++-+ # self.hidden_states = None if not config.is_encoder_decoder else None --+++++-+ --+++++-+ # outputs = JitOptimizedOutput(next_token_logits, self.config) --+++++-+ # else: --+++++-+ # # 标准 forward 调用(首次prefill阶段或非StaticCache) --+++++-+ # outputs = self(**model_inputs, return_dict=True) --+++++-+ --+++++-+ # if synced_devices and this_peer_finished: --+++++-+ # continue --+++++-+ --+++++-+ # # Clone is needed to avoid keeping a hanging ref to outputs.logits --+++++-+ # next_token_logits = outputs.logits[:, -1, :] --+++++-+ --+++++-+ # # pre-process distribution --+++++-+ # next_token_scores = logits_processor(input_ids, next_token_logits) --+++++-+ # if do_sample: --+++++-+ # next_token_scores = logits_warper(input_ids, next_token_scores) --+++++-+ --+++++-+ # # Store scores, attentions and hidden_states when required --+++++-+ # if return_dict_in_generate: --+++++-+ # if output_scores: --+++++-+ # scores += (next_token_scores,) --+++++-+ # if output_logits: --+++++-+ # raw_logits += (next_token_logits,) --+++++-+ # if output_attentions: --+++++-+ # attn = outputs.decoder_attentions if self.config.is_encoder_decoder else outputs.attentions --+++++-+ # decoder_attentions += (attn,) if attn is not None else (None,) --+++++-+ # if self.config.is_encoder_decoder: --+++++-+ # cross_attentions += (outputs.cross_attentions,) if outputs.cross_attentions is not None else (None,) --+++++-+ --+++++-+ # if output_hidden_states: --+++++-+ # hidden = ( --+++++-+ # outputs.decoder_hidden_states --+++++-+ # if self.config.is_encoder_decoder --+++++-+ # else outputs.hidden_states --+++++-+ # ) --+++++-+ # decoder_hidden_states += (hidden,) if hidden is not None else (None,) --+++++-+ --+++++-+ # # token selection --+++++-+ # if do_sample: --+++++-+ # probs = nn.functional.softmax(next_token_scores, dim=-1) --+++++-+ # next_tokens = ops.multinomial(probs, num_samples=1).squeeze(1) --+++++-+ # else: --+++++-+ # next_tokens = ops.argmax(next_token_scores, dim=-1) --+++++-+ --+++++-+ # # finished sentences should have their next token be a padding token --+++++-+ # if has_eos_stopping_criteria: --+++++-+ # next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences) --+++++-+ --+++++-+ # # update generated ids, model inputs, and length for next step --+++++-+ # input_ids = ops.cat([input_ids, next_tokens[:, None]], dim=-1) --+++++-+ # if streamer is not None: --+++++-+ # streamer.put(next_tokens) --+++++-+ --+++++-+ # model_kwargs = self._update_model_kwargs_for_generation( --+++++-+ # outputs, --+++++-+ # model_kwargs, --+++++-+ # is_encoder_decoder=self.config.is_encoder_decoder, --+++++-+ # ) --+++++-+ --+++++-+ # unfinished_sequences = unfinished_sequences & ~stopping_criteria(input_ids, scores) --+++++-+ # this_peer_finished = np.max(unfinished_sequences.asnumpy()).item() == 0 --+++++-+ # cur_len += 1 --+++++-+ --+++++-+ # if _record_time: --+++++-+ # import time as time_module --+++++-+ # infer_stop = time_module.time() --+++++-+ # time_record.append(infer_stop - infer_start) --+++++-+ --+++++-+ # del outputs --+++++-+ --+++++-+ # average_infer_time = None --+++++-+ # if time_record: --+++++-+ # if len(time_record) > 1: --+++++-+ # time_record.pop(0) --+++++-+ # average_infer_time = sum(time_record) / len(time_record) --+++++-+ # print(f'average inference time is: {average_infer_time}') --+++++-+ # print(f'inference time record: {time_record}') --+++++-+ --+++++-+ # if streamer is not None: --+++++-+ # streamer.end() --+++++-+ --+++++-+ # # 简单判断:打印是否使用了JIT路径 --+++++-+ # if hasattr(self, '_jit_used') and self._jit_used: --+++++-+ # print("[JIT] ✓ JIT optimization was used during generation") --+++++-+ # else: --+++++-+ # print("[JIT] ✗ JIT optimization was NOT used (using standard path)") --+++++-+ --+++++-+ # if return_dict_in_generate: --+++++-+ # if self.config.is_encoder_decoder: --+++++-+ # return GenerateEncoderDecoderOutput( --+++++-+ # sequences=input_ids, --+++++-+ # scores=scores, --+++++-+ # logits=raw_logits, --+++++-+ # encoder_attentions=encoder_attentions, --+++++-+ # encoder_hidden_states=encoder_hidden_states, --+++++-+ # decoder_attentions=decoder_attentions, --+++++-+ # cross_attentions=cross_attentions, --+++++-+ # decoder_hidden_states=decoder_hidden_states, --+++++-+ # past_key_values=model_kwargs.get("past_key_values"), --+++++-+ # average_infer_time=average_infer_time --+++++-+ # ) --+++++-+ # else: --+++++-+ # return GenerateDecoderOnlyOutput( --+++++-+ # sequences=input_ids, --+++++-+ # scores=scores, --+++++-+ # logits=raw_logits, --+++++-+ # attentions=decoder_attentions, --+++++-+ # hidden_states=decoder_hidden_states, --+++++-+ # past_key_values=model_kwargs.get("past_key_values"), --+++++-+ # average_infer_time=average_infer_time --+++++-+ # ) --+++++-+ # else: --+++++-+ # return input_ids --+++++-+ --+++++-+ # def _prepare_cache_for_generation( --+++++-+ # self, --+++++-+ # generation_config, --+++++-+ # model_kwargs, --+++++-+ # assistant_model, --+++++-+ # batch_size, --+++++-+ # max_cache_length, --+++++-+ # ): --+++++-+ # if generation_config.cache_implementation is None and self._supports_static_cache: --+++++-+ # generation_config.cache_implementation = "static" --+++++-+ # print("[JIT] ✓ StaticCache set as default in _prepare_cache_for_generation") --+++++-+ --+++++-+ # if generation_config.cache_implementation == "static": --+++++-+ # base_required_from_max_length = generation_config.max_length + 1 --+++++-+ # base_required = max(max_cache_length, base_required_from_max_length) --+++++-+ # min_cache_size = 50 --+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++-+ # max_cache_length = min(max(base_required, min_cache_size), self.config.max_position_embeddings) --+++++-+ # else: --+++++-+ # max_cache_length = max(base_required, min_cache_size) --+++++-+ --+++++-+ # original_max_cache_length = max_cache_length --+++++-+ # print(f"[JIT] StaticCache max_cache_length calculation:") --+++++-+ # print(f" - input max_cache_length: {original_max_cache_length}") --+++++-+ # print(f" - generation_config.max_length: {generation_config.max_length}") --+++++-+ # print(f" - base_required_from_max_length: {base_required_from_max_length}") --+++++-+ # print(f" - final max_cache_length: {max_cache_length}") --+++++-+ --+++++-+ # if hasattr(self.config, 'max_position_embeddings') and self.config.max_position_embeddings is not None: --+++++-+ # if max_cache_length > self.config.max_position_embeddings: --+++++-+ # print(f"[JIT] WARNING: Required cache length ({max_cache_length}) exceeds max_position_embeddings ({self.config.max_position_embeddings})") --+++++-+ --+++++-+ # result = super()._prepare_cache_for_generation( --+++++-+ # generation_config=generation_config, --+++++-+ # model_kwargs=model_kwargs, --+++++-+ # assistant_model=assistant_model, --+++++-+ # batch_size=batch_size, --+++++-+ # max_cache_length=max_cache_length, --+++++-+ # ) --+++++-+ --+++++-+ # if generation_config.cache_implementation == "static": --+++++-+ # cache_name = "past_key_values" if "mamba" not in self.__class__.__name__.lower() else "cache_params" --+++++-+ # created_cache = model_kwargs.get(cache_name) --+++++-+ # if created_cache is not None and hasattr(created_cache, 'max_cache_len'): --+++++-+ # print(f"[JIT] Created StaticCache with max_cache_len: {created_cache.max_cache_len}") --+++++-+ # if created_cache.max_cache_len < generation_config.max_length: --+++++-+ # print(f"[JIT] WARNING: Created cache max_cache_len ({created_cache.max_cache_len}) < max_length ({generation_config.max_length})") --+++++-+ --+++++-+ # return result --+++++-+ --+++++-+ --+++++-+ --+++++- --+++++- --+++++- # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Qwen2Moe, LLAMA->QWEN2MOE --+++++--- --+++++-2.27.0 --+++++- --+++++-- --+++++2.27.0 --+++++ --++++-- --++++2.27.0 --++++ --+++-- --+++2.27.0 --+++ --++-- --++2.27.0 --++ --+-- --+2.27.0 --+ ---- --2.27.0 -- --- -2.39.5 (Apple Git-154) - From 5eefcda2eec1f853bbb351f7587ac8c7b3e1d8b5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E9=82=93=E4=BC=9F=E9=94=AE?= Date: Wed, 10 Dec 2025 14:15:14 +0800 Subject: [PATCH 3/3] =?UTF-8?q?=E6=A0=B9=E6=8D=AEreview=E9=87=8D=E6=96=B0?= =?UTF-8?q?=E4=B8=8A=E4=BC=A0?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../\351\230\237\344\274\215emmm/patches.zip" | Bin 0 -> 1011964 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 "2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches.zip" diff --git "a/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches.zip" "b/2025-Ascend-Innovation-Contest/S1/MoE/\351\230\237\344\274\215emmm/patches.zip" new file mode 100644 index 0000000000000000000000000000000000000000..6a7b63a5d43a9a817f27b335f5305912221d8004 GIT binary patch literal 1011964 zcmZU(Q_L_tw5|Kvwr$(CZQHhO+cv(}wr$(Ct^Kc+larn7CT-_MGcIR4o(=_RU=S#P z{~6SdQQH4o{C^7s01kkIp^LGFsWZK*3M2q<5*(-H|4dg8XaGQvS3m#&5S0JUD*UJL zKRd+#DvZmvcGla&1E86x0wDii3TBpehBkEntM~tlCI7!-%RJiIOJZqvKh#z`@C}wN z^)^fB{>EToAuwG65(e)0USG&o)z@lTX)z@1ZghL%5)cRw8=0h$=_O%@9 zIrj4#QLXlVk_*_IIgKoJC}-52ViSy8MC6v)6ip=>{~GyARA{J>!vkefa~No{mCU_H z>a%o%zlA8#rAt4gIy7>43#L;}I(IhtAG%K$s5n;&kt!38Ua1f$Ziq)ds)|(JC{3V7 z>E&{+X?mot^ipYZI6N-r>m(I7s!`A}D~eF41u9HEkdc>`mXJ$KH9AdO-@M|?I$*af z6~}h8*Dj}2Zal&+CU5W}`Rx&uvWS%f$fq86(`uBLcB+*(-6(Y*>H@u4yO9a|1XCt? zGNlOeIHhZnHXAKtFn6F7_6VGvnFA>N5+0Z=G00pceKni1RMN-4uVENz(z)^swe9?> z2ogIqfm|y6jXN0_3$bItGV|=P@D^U z%ajBq#XlD@GRoPkBNXdC;m@mT-25O@`%HvX|{-8;*J!in7xNBT4Xna?o`P8F&+ZO z8<5Ebr>n(sCnpruR0x_BQX#${KJ2$@1Fsq+tV}{*RnRp`?U3wUNp`Eu+$|$!g&0$j-GM4Dv~{ zNQalkt|<%aMeGnXK-rLYn-etaLRlM0rASM`BF*L35}6?QXH_K-+))dUX3Xi}wxS*~ zBkn3AZjRn`UCI09_mc!;?;)IgC{X-_@73|Q_deVl{g$hP>HC9GNKyrB!-L=L-R$)+ zzMmiL7e@zg<*O?45lE*$=j*I@{%91uFKwz4U8La>`rh3Za+*R({W%tjNa`AB7U*k*YJC!6#8nr z?_TJ8xnB)_0kk>($@$+1I~Ftyl?Sv0?^8-KeV!`R9chr%8;jW!dHh=fvs99Iy}(>4 zQCrfU?OL{x7+&^TR)p=*CZ%t2WSG$jI2k0224(8ON3G@(17cHC;v^Bp0%w<5H9LKy zolz&yTsoT^VXC>tCf3A`{*6Gfqw$fsmXcBo;=%q?jB?a_^8tAEnOoEn`+2YQ=g?eB z=%;Qi=j=h(l2%zeV8RC?yEVq*kc8W?1zuvO(erpU$>-f*X-Oj!@BT=Si^VFYvobgVn>0cZ&6<(_p?T_rhP`F168IB_duFr>eK~+e)1yTVkXez;e zAGI`1yxx(PiUw4m1d|F%u-X?Pp8$NaQ3pi_QC&Puj&GyWdx6abMQ*bV+O6ULwQ+cw zh*VwwiDHuoNYdD%{t&zBKOUTzh}4zs+K((n&<%JeB)WkH2+>hhGo zH5*OFbbjf>t+#2pu<*(tGpdlvjLdIv3HJOtsu0Yh+kFn*H>$;oN8vHF<&hCda zV}Kb8he9RUwId-ukb#Dnt6wZ^O zD>UbRITk3!T zP1~|OiMFh=>WBUUPYN=m8X;z)O@U}}85VOb2?*r&XUR8GO+ohSmt@dAH=>GHvop}3 zVhYdBy9^R6i)_pw37RD=Q-;ozgbl-YAjmUxwRqHQt;MQr*pI}5jIObxXb+7Dx9S%_ zW1>ssr%&sIJPsRxCL^4ufEi2B##Np%*Rp17iXMt!6iwUVO8bg~+HF!Ux9X*cY~0W7 zxSnSV{ilfdI2|!j7gp%TzKT^j)OyPngfk^ae7>V$X_8Kq z+EC*}CAM9uiuBv0oE=h%3JKF9PJ>QGDw;*Aoa&^RmAfI>!cfL*1CF0H3P(#EZBxPW zu#4>TL?k>AD~hK?L@YbXER!X#6=T2z>K7W01?g85#F|(y!xt*Rj`|heAe0NhpN0sD zAWh7J{2c0GUXr}iwhxFPs`sNf%#cr!X&CJKzb)-8s))?W6~ z1Y=!p*5X`@71S`rktO^7N1Wu86I0)8jw9H{ZnQuSQ~T5UcOLvg-naLCT2r$XfBzNi zoQ;0>7mLj;bNv|q6CDqhuk$*Q5WCmY^t-?J>E3rX(A;VMZ_#3|Tx>Bf;5xqTAF*W- zUH#LL7St}kI*kM-U*Y637!u@`5EX;%n@srPngyPF-_k6UZ_0S#^+wr2P=8*iHx#ci zEwrDF`KZa?UK{a1yqldruO0P)y!%<-c;5BSKf+yV=U?N_I+_kBI$G$03{Us5IQ!vs zkM}GOuKx-l5c#j~#hZd!{A(B3rTLs)#VyyTfsYa$aYYNosE6Qi z7Y!sTsRo*v1WM7&YKR^)Lq(5uH#kdNb>?`y=s0|Njuc#A#UG43GNP4st}GI9njvPE zC68FJuOPZB;e}1xCULB=(#(>Z_5C%SA*;~{nYzu^XuCOtPr&u$rmQ6;Zd;A3A*>85 z3aS{$_ll*_SD{o~`$lJVH2inm_10ipu#QlFv-xP7Fmgk-C55-7f%LYoe7%j%r`gdh z{kOH*mp>2Dv)Y5Lh3?r;M`T$J;$+J%r%pg#egF$=azVV3k}mTGjkkTn9g zHH{!eheCq&3|!n^<9Ns`OAE_eD9sg|;1_s9e?*TmRf5p7UleZ$&i&R?THv?)_Orjx z7|AW2vqMWq@1^Ngfq0w*&txwKF^pGdtNL?92{|Kigzcojqgcp*2qrnRbwf`J=I=tz zR+#~#q+kP~av>rjy15(xGGf(j;wK;C3Ft9$t>~)O;-VBJ?rRtt*1Gw&FPI2(go=2K zFQXHg>G<67E}lr9Ajv>`#h*B`wlwRGj^oN;KrXBj&|addnGi zO;)qo(4F;zo$6rezt6t2*)ymKHdMAKsUa{3c9z$O;(m9qG2?w?6EGZ+%Kn!bis_eB zGpWoQ4CBt;`zHUtUybRfAPJw4Wk9YF5E@SOc!$qtOBmE{*#>3*Sz>kOW)8Pijs`*i%XC_sCc%JV z(KsCFDXx*7eiR8SW)|5H?pmBaTg$Gr*M(TYEd%fx>61?-<64!H#((OU%+kzXDG%D| zt=NWNr=>R(4AGfGH^%SSPaxAI{Wh9MbBYy8H0feBM=~AQV!0VvE7umXQdt;u8ha4K zIcb0kU~51{ut|)uX$PQ%rNE629z~BJWYmsZhg0LIoltxKG}{(L&TF;zRmp?UD)h2j zD&OJbgcu1#acz(|S1hv)9R4f@T-RlogLezM;ZGsbW5A995?Op?qkx19BBw9#h1vk0 zN@t~EB(nO|@=!omkMIx4aVCaUAn@LyVoCtdqZe|G#n$SLCL0A7nzLoo6h7jAJeD31 zk6K^~ff*NtVQ3HnnbeK&5Pua@)(OMv>u|U|ahl1#5&|+=rw|G56b7`h$gE#N?7QE< zPyx5rf*I(ZsZl~+1A#v`)S0<};ilxKLJdG{NV0H!`Ua8YD6_E(mjtdCGf{Z7*U?0k zU}5D$qt^u+7*5~TRz(coR~%uecw_**+1Apx7;I;#)7u?|(1Lv9&_pm3?Z><3PNfy1 zjX9fuYB9KH=3UXhdq*^*TODA)p-P@c8+HIMXt$&1V-Md z3;>732QDoA@aEttVKJeSP=ukI@a7dPfgyNYCW(54}z%YN@ zv0zU#=KOF_G}@|ruxLP7x-tr>Mp6=%J%}*>uXF1b0o`f4ZC%X9fvlvi6wL8GMkdwmr){s1AXU_6O=@c79J|BK zxl}AgGvk!On)|3CJPi}*=8x2KsOHVV#cxSu{T+%z|%Qn+%#;b9{X?>-O1knh4hbxguyW+JYAx7mEBKOV!(4@0L> zS08wDtw(HtQL8?b623wu#B34cs%tY@z7xsA;0;%u=jIKk!Q+Lshk+`E=`hkSroOWm zHogXMl)bxW47yaL*l^C+UHL^RM_a7gPEWy7FH?s1@CJ8@d;Zo`j?b+cRfXEz9Phb; zRN0to!;S}8GA$16mqy@6hM2{C34!jS7?2KkMC+~wjr+vcJE6&A1ZJ)|3vj8wfR7Jo z+~>w-AnStfqSx@*=z6bir8`2nI|YsU~*h~rp!aEgpIA#PbMHE`RiS8RbCjD|3U zWXM`^QB#ATnS_el8A?`ot}7By(efEQpsI-Ih*W4!CVbwQV$(Hxk3Ix-V-FVy5e83A z_azL)pG#+c>%u-_!yq8O$%HBP@5eAaWX1itE%G1m{G=x>*5$K1C4R-v7Hpe#8DE-2 z8lg@mm9LGN?#ud@FFQDE3MZ~DEqOBoMp7ptM9k$)&Z>D_iN!YAr2Ip_4*8gs3(8DT z=dMNXJ-dk%Uf$4ak*`a`An>q7`7^oXKFbkjCf{7l5H^dakYFrK8azSakF&_kb;G5} zk8a&>#kz5tuy?1Otq1B<%1WPXO2sXQh>5JY-mfJ{H+dR=f|P57Liz5Hhzi{FGaKSX z{KwiMIKppfsXOccVUFWK9Yyxg0U?wgmCDH{oQ>lhIg}&>QkxX2FDU{@?hW>XygT+4=C!r+r?@LToEb{v-W14nz(Z zi7p<6**{;2XGOuql2MdB2T!a3v85WM36cu$RecDt3B7{A_st2Q;Qf%+7=h`@mjZ`G zLqf?W>kawL)4;*%M}kM1C>&ch0LN1%QrT%c;5g z6c-tSIGN%wdQCy?mTncLL$WmRDj!nJ?S{RjkZfV6%gj0!e+c935hHu>xiyg>3wWS) zt3D>rmoFVXTzHUaELl0Nb~?$Nx`C1++O>;e5N7&eAG}ORv=&u)#vv*APN|0_#{QM5 z_QuuTZjffqK)wfOVd_eOQg^cL4p>WYyn7C+R-a#g>p#q}Y0AUvI|Ru|_#cJiXnPIQoBve@@z z`PZY}@AJ3Y6|9Z^g}#UDX%x3ax%2Ubh_bZx9b6QZ=L1c{eECy7&p!Q$N%o*U~PL)qyqy_QWV*AP1`{dDMhBM zT*d~A#px&dR7XzPt4TonZ?>R+IaYP(u3Y;v(E|`D=X^9fG>1bbFdw7sk8>7Koo5

gyQW6x`KWC+Hp5ryaQEc4o`9Rl+^RPx3gTyLY;#*2ytV@|C@7*Q> zb43i(FlaalEK_}OlOq9Wj~qMCIx_jNpp6?Uk}#R{0Amcn5tyLuszX?Y(DcyZFTTh9 z#_n<)O|SOV$eqYBDOIo@S_5#FpQ^HzxrYcL7)q2$jNix4Q zXkysX3srGZ`H4u1rO=bmRX_EVG@@0s;|HMp<=WKt+wSIB^z4BAdHyo=xRh?Ft33;h|l9{$B!txpWl6SqSzfZ5D0G@Y%OZHQdxBrRPtg;Jp;#IPxn;_n;jjDN$Lrx^ z`tX}{8bx{?qBgp!*mA)Y4r|P^jVfFoAPk64m=>Wf-K#;R>O|3}a74ZK#}W}BQ}mrD zgF1MEglir1<~*v%R(QA@oyK2$AR0(mpE1UCX``6^*Thj6o7J|i~7-unY=k{HwY`Mhk1h9i+zIRGL$6?lY?0WL~JYE@jc3QPP?*RJzc*4w8Z5Un{_y%9U3Uz}U zM7QG|X|Kmxe^;}cHJDpmTmbAgkTR_1>J0(0_{!}`r7Y& zQMFvvSV7}Rx!?oIOdrzlxBKDs?L7wI8BoaHE=}_j4RVY=%d}>IhH6g?2L-$};+zav zX=QQTYk0{sB0;Zt9zt#E0dQK9bd#=iDcrHau{M{PT=SjcBi7PK>P>2iBwG|Iju}mld!$B6=t85dtB` zKR&?CFGHMvze3B+bCeNkvmJeGF{Ja4$d?Q0F&3Kln2&o}7^6PZ$jcsN4l;8&F1GU$ z=&HC`PUwEuUl?6#FCwK^fKiFZs!+kyrbLl8-h=+?$FKp+)m?<7=ydZ z++I_d$QCyj-j zX}Mx!NqJM2m0rNx_yJY_aZjJoPUn>_g_Z~0B2mHI8w@@C{%6rHS7q9S>7gdXVB1fI z4k8*mBm#5oBBw1j+Kal|GO4D}PV4Aoea>=v&v<+_zpnXm=F3OXPj3}GJlrJ-eh83^ z0j+M1O!bE12;s4@Yy5={MAl$)NA*a^VkN2~$gN;{nZ&&@!)$?smDn>igS(q8m9eILA@b&sTv-wY_X9xK_K1nbf*Cuey15=Uf`_QljE?iJ5O8_Ezsd z%R6<~$`}WurmVD|f5Wq5FQclQc*kE?hXSv7x&l6bC2+~yI*pona}>V=mWu(b zsLqeM2c+jVW{Gu;^~zZ-r^C6IS{jl4OUxXEiDn-V4H_CnLu4irwg($?TOAOs9#t>Er_Bv3F)FdIGh0D$hX#|IvvksoCjj zC<#Hm9KbCdO*$rb^N(}svtgE0w&k)uCp&&5v(853fP4p$8v*4u&k53^$gieo-#P5& zM$l>Q=C92sv>UXCw|j@Fl|H7a9;^LD8qj>(_f4AIl00;C$X*{;wp}b}_CWO85AXtS zVzj5JJ^e!HXseu;5vjgPK72=z#@S4Nh7DN2q1!7O7{nqzfarL9j&l1TMBh>t6yADG#q8cPUuNnH#yvpZmLOw$(OY3XXBB3wcTmJHuh9*@rN@9tJ|iq90k zsk#}Bx!yw-K1fhyy!8EB$g*Zg$2;x+eT^~0F%j-|ei zrMj8iUv)oSf9D0g;^^19H=rFR0xCF&KNPflX7Q$prfjRJZ0_Q}+=rRm`q^9uQ$_ji zH%DYgc^h;#uPMZBd-@_OiOGDZU3bS{{1#~X>F6xyAG{tsMzyzP7R4)Ikh25H)aF4| zhg-PNW1TL_t~n=%jFFn1?+s3`70x$+NhDIv8pSUkuY(WQaa?(>66ZmK6^EB@W`8*s zOm^_oI%ehBNN*MOTGlh+w;!W+k6q6*zB=CzRM7?d-`2f#UoY9(;bs75so#(Akq+l& zh77v&(sX18W+GNU_4G4{LvPE|k;5m9clRbGefrh)`cf^FI!vk2rURj3v}L;sH)(of5PG-7W=bt zv*V)}R#%bkqHK07z}Pp0Q>gwQ#)T=I`^cmkk`uHY%|bQ@Fi*$qL2+c6+O0~QrYK9o zdsQQZEER_p)?6paqE#V_rpl*bU}IyC9d~K*a@^+ft9o&3cyp1*su?gevGTP3p1}E$ z1j$91H4^rhA%yJoH+Y8pqR+HhSApJ)#zHU)y`J-QVA)_`^|}P07=H+RL|X~a3bH%V zYqVuT9{I=B9xyC!G!O2pTq?($jU|^-rX-*(D|N{?R=Dd?x3hlo5@#qLEX^$vAZCy% zx1h-g#X4nFrrG6!IRweR<7yrddZlpf3(Y45*3S^sso}~B`8fh!7M!*+*cu*p3x-Xs zn?ZPVx>RiG~))C`pD2RR;%Xt73Z6e|7Wx*tBQWw_avcqL?EXTE)TUd#! z82N14E?$627n0PQs?&u(8NWsu2FbEYLiLd-n&AZD3Y_371T)S$h;H(dv2YWyaDmtp z260lrjx{=+N%S#_C*)ks{PF6X)p5{w)PYE7=|S3dl>wPt<1~p=oeG#ghST_E{K7Px zYMlTL14`|CguW}H5(wZz#?Gke$cJAPP@J^)LGTOCtYwDA-jyst5}hS1c#@w6T>1gl zP5mGw_X8M2iENQQJbo~Uu@g%&mKb#yS@1xA>C({$Qp)l44>?dvA-SgWf#ET6R@J#H zg_9<<4cX9j`ZXeSq4g7RO(!GC&yo?T)PJawAM|MSZ6VjN<8uMe+FmT}e*~c`#IruVI60g^hf$ zafN`k$#DJO5sC&->eGuY5cov<&= zMez}m7st=gyR%Qz^M;qr!g_dwrG^kEHL<<%%~*v~z@H0s@+>2J;HAQx>rlWZe?mdM zCjA*8;1c3s7yv&01K_cwR^_Q$ zT6tok{DF>S^@QL85KE%eyWJ8kjkl_55~k&;?xPRd;6QwVPhVluo@vh0CLJsRBwl@> zDnG#^ugx=X)v{vWlOCj?DM|z*Y2IuYMWoSBA@d6ki>Rk!;ePYox zceZKexTTPRnrDm88Sd$I9b62RgVw}U0HH4YfdfvIxu9$-i(%?dii(i@@QFPqwHfuR z0dUJr0|pd~NQl5hV$q)43kj$nzsPkCax}O_nX>DR|499)ErTLXt2+#R36xqoY0?!J zdvoO{}Jp z=7|_=F2PEAdUbLWr>QXJslLdV-Gi4b7j(R${XLfGm6%ZR$JF%>WyB?L`}84HPi9K# zur#wRClKF4v^ShZklA1&>EVJ6C=|fQ<81G3up$LCzf2< zHO#m`OA96teR#khk5X0I+!B<hO z3IqwmZ)+h@T4Q3Ywpb@tm8wka3cTE}zEF!kAD0_tReQ<|U^I>&CwkH;ypTxuZxoR% zHG$EYN!O9jDhZAuC5DjNs@FeM4hwCPP=2Gw>8CVb6Z|U8S~twRz#q_{UNai42we~l zuGPSB3b3g{CyMz=Q41WNiKe;oV17BNBArFsU8bRE zgBD3Zr;15AN22#a_v7^^nxL6%G!{FB3Fh&7K3TLhrvo=^yKA1M7%2uosR5}|q=%DeTVL%D$T_{DvI~gM$enQYC zc@Xx2YB=S_1L5s+9=6uEfQQ2wp%g6yL#+CJNhH(ZKQAAo#TtIY~^hdl%`Y%evn0Q31^J{i~%^c1Y&kjFr zj)Ws(9T{;CAcQ~_bBaoAA~Sq<|L!cV?fsp#_C!CWr{Vcji7X{yuj}lZV=+rolsZ-u z+>=A-y2XOcG>)-lUB6W_XzQz3L`P?xCPL)FJmlZ>IqTW^t$ z*}yds!H%}GRfqh#G4#wj-wkyUN57@z;qLnWL6ukv&X?CGiCx5{TPvH50s@;+04dy* z0H`{l@9U3F{zrP&Lvc6qy4L;2-aPUz^D*+*5bZPaRpkp4CnjnI_Uc|V)0?Zp|$ znI{ye6L9WVoi?`V9M0U4STz@MF+kiLn_}4DS>7QH`GmwbCF|zpK}!wCNB1ZGj{U=G zEZg)U!)oc39D4I)Sn*788iBV)TS{f$0TW2Pyag@2T&!O~#df@aDn@Ui$e;(_qG{iE zX_Qe_azJZw}v~+K3$WH*4Qrq z>)r`Y-(pg7C_3QhMT-^ze3n0TM#Gf>MaH0pRL2D zav8Q@NXNv}e+Vq2jtLT5uC;d%H-VvcYEQ4~=P3@HxgbgdTgi+D8^j@0o z#MkZVq={F*ay8JHh6>IK-n$``7aqX6Ig37`mVuHWeeaeEZPbmC+4k3px`CDE2`NcQ z*eYo6LQ()2tV@F=Gmbby*EF?eG%5xzJ|?iDgu`fWyQboG#^a`%8{yyEe_xwgB_ri& zrHpGcN<>L-i&NGa$-%0O&aMX+lxHLE(dPK`q&w$WqcypItC`^T@)524wz=H|5N?hn3eV<9kr1CrBWJLPS%nG>B?xs& znJ!*5>ZwNHrbA_OWaiSIa=cT4gZy;n1ihH$bUx3{Bp2TG_wjDA zHM5xR7F*a$AdeV(3h1YGt9E=f zvU46~wGMK7Lol?xh^w>TLj~9lGbhKO_zW!*(vJnBxi2b9Nq^IZnT5`#{J(#JiWaD*pm1y0h*s%j%bWDv^Z#zFI% zlJXQ%g3ttquV>JKY4s|oFy-c|8!Q5++>m=YljzDiofe>nifUlsj7OPJLtq2GHT4G< zQdH_duobiTB#NCJXonMAAcg|83OVxL(ixwX&}BsyO6Z{oX42p54KIq_E%u^5eC;m} z4Z=m0!B!&NtSU+dM`_bd#0oxL{{=>V0bB}_fQv0y1cdTL<^vY%ua@)0fYbO4ApQ+e zq>$4gt$H{L)8bPb#D~nZJLQ!?%aKT4H3>HD?FFnSk?R3W9-?>?iX^p$bfZxh=x^ZA z*hb$ht!bo>Mf2?q_iReSc&rAVhvW(l74tWBujlP!PM?M9ih!Z50% zDafE@!Yg#-x`v!*Onz(+;QW^>H&0qkWi7dp@6)3$L(I6I7t7G7vbmP{)f?#Z33_X} zCzA1)UzFc4XcLoOuhpoRi`cGUq-?FhtYC&BDdo4q&eMI|-ZqGitunngq=semr?%Ty zCA=(g0c?`jG)hmXyPl&&r_lU_dL+jp6ur-G)a0)Lpwq8R#K9v3KfC-M8zJeeR_3HLfpMl(h`NM@tj<;Hs7#omqgLfr(3m-*mcgX&V}dD_Pvm*< zrti5C6EuVl&vJSzHB%N2eeOkBAUKNxH${MPfyyEz6o^?YeA#mAZF?!M;cl~$z1e9s zw!SQ`(RTBFzWsz)M6F@c;aLJLgOVQjuodoT(3{zawutG_0F!*R7crtta(f-%5r240 z8?%KC9hw)v530L0!K&a*{b6DOVY6Ccc?eE}52V?F$@D^{s7&}?8U76aT(3@(a_gBCv<7rdom2g z=7dc>K;EnY1zu%X*FDA{+w!U+-9}7tliFvW%Hi+gaL=bx{sS$q&D)QjZpMAQIFFT2 z8?#?rt}EauJkW^j333%izRLZZR(F3el}v@n%WcBIb=H?N^to14jgeJnoQx9-pnh$a z&@oG<5R4a4hB5)2z4q-;Y}EHmT$(49)1l%3UGU1+b;jGNh~3l0>${wu{g2uzZz{Dx zuTHTNRwsbZU!k4(#xi2n1DS={QiIubbb?o%xl-!S1bPqHSzJGktrl5F8r()F2$~Q- zHZr#(Q?gfwmT(|*3Ciz@3ep+DlOCD-bd--OSG|I)l zW)w8+j=!l0rprK!m+UVkAw=5`3V$(TtBe%-i9phkG%C%R1AN}@1< znP)v#7G%ceqFYFARt1`W^}u=bdDveM`Ru>Wh&)HCm$dC_pg5V{lB)u) z)hqy&tzHC^465|XY?{84Rkb9#tj3<22zHl;G51#->)UULZrN60ebb^u6}wq4bPkmx zvDc!FSU8-lM>z94xG<8^Q|Li5_1mvuw3EmLbxf307xGZn=?cUcU`9&qBe9NT|eL5tpPBqp+LSR-4RwVtpOwT0@vnj#DM!r1!1^@CJ z?@Y=h4nlrFei<^x2fzx9ugbCL5{9`H{>N9v3;p)dC`q9?@$05s>;AL*EQiG)3e^&q zxfWg~B4j$)SEpqb=RDSv4N*aA z;&wJ&Oh`U-->sT#*E+dC|@=9gJcb}|o+LBo6=wTh! zC$E+Vn(hJGWb+zk9#dHIJtCHX8~C+|T4D74QK?%vUmdowRYGJh7b#8!OLyh{#0y%|6K`J=In_Rm2VhQ$MdY4wSphN*aJ+JKxZxh zX`1wuMM8pyjL(3>#JK=uK)zjHAo>URO<@PWck8muY*T2v0K23;0x~IN-*hNrfA{RutrJZI(`597GL3%=@5dA{NdW zknG59grr-!?nKSJI3`Lp^iV5Y8_Jj7rqZGW36)uDkYS>bF-R#Zp45bzZHN`B3HjfX z55U^F^2Nn?&hPhJt$i@uMh=E>$}I|XASU&sb`2sQ(^b&#_M9qv$|U6KA3iqN2u8+a|;`ZT0WrEbCR1LDjU z^AB?yMGlwE#%cWf%JonOFUuaXuk%J1p|iOAU4Xb8wEiFfOIVLE%JxET*p0(mM!Sj} zS-ru;{nn;az7h59V!zvY1dJkO8#ewavEaeG8Vm?~T5-K1&?ms?vM=D+PMA3nqw3fW zgawNImUJRme8Y>Nd<4 zWWX+--yFX7@Q;1o9wmOHlZ*Sx-D0QH#?JzxrR@$_z8ZEq);GADDcI1p{-g8CVl>xA zUi=^NS_dQ6&F%T^?KlaN4GKda^~Dxf%rffB9WKN(_EpUO%P!cT8`*Y&_FbbBx0LqH z#XN;5#*m6X3=s)-Los!s4_XMuN<+O)WMTs28zwo!UBtM@;|g9;x}qRZ^sAU?S2!_? zbHI41n6V2re5SLC#+2gB?jy6bw3dSg;S)(_l)E1RUgR=QhzNye1dy#;GyPtjNVhxn zqMUV$$0Wv_77>0fFJEus>Qbx;tx>x*F+0ObT8eeg2$vNn?y(k6N+V0#@!RFhzph5O z;3c*q&j5vSS^c6N5N+H2)T%VdI1iwGJdYOu0#;%P>L_u=o5AE0YjW%IQR_8%#V7g1 z*f~!Z89JFEk{7Kmy{<w3zIlDJ{u& z?&y7`D}JsFkrYmOe43rCqD?M~gbRrf!iF6_`5MkjVvwULR6m#E&`;{Q=;^r=;Wi1N z)2+eAxVSc=VUsMEC6hzRv(pd%rDC|)<+;x6R>|g;E?cDG4a;7o{O(|e*2d7!$X33= z0L@l(NQ2Xu zF_w90Q%al$W}&p@IsnK+m1y@l4J{|nQtDl0_=#;yedHl{6@#7l3JA`Co7C+t0|4;K z$yad)cpT-$iIDD?G;t6UH)hZK#U+>J>fHm#9#)KUDN~zEu1hury;;#EwN4$AoTN;M z)RHWdS#-tljMxuYl#-K^Wul$Xcui%jxRtCN?$ow%9tSp;<_lXX`gAOjf|0<@T&h5Ia|kz7_QKzo@~s~I_81i{Z+7sB2EnltrtJK(_l zRKbWui})N6WOg-5fFTW+L7RCALyu?SEhnv2;p8KH-Nj45TWq6jV_e2FDS?!}!jQ9r zzpHChtwmy=>zbFv=%qcKzXi-X;SP zXCj{yJfX5H3Sl`>noN}@x*7O5*%4s?*c9+o#pM0nPqQvk6WA;yKT9w+J&joLGhaC` zK;svN?bi=YYF7*AzwXK|B)e>-Ne8+XP#m3#DTsE4BAFxwcwCF}@Dl#8`h;n6(9*o&0_PtLh7L0jHXg^veGl0+QFSRk+dc)(|d0XcZEZ|S!DYJ zipCxqx;I2rI;u0#@K2U9DYiaLR?z6uZ(A)27YA{uEXT0eNgtPPz!kga{3LJf)C&UJ z+NuCG(61Dpr-=gsBjjc1K-D!_ZPr}zF-M&d+M^8r zqSH-n(!o@|`$=P)VEoWKL&d~eMY5S~RXx%?8W?SS5AeuBF>{Pb5-;DuQ4hqn3 zZ7nttul_JD3$fh=0FYHhk+Ym@{PaA!J6#pT(QG}9=ww|CmGYhc!yF?RvjBlTT}mx! z7}~G=KSZ4acV^MHZDZTEZ6_7mw(W{-+pbicd|}15ZQHi}axUI&ZT*G4+nQsJ-j|dH zBHKb<*O11-L+|pk%V?n$VNsTzO_7k-zuqS3(8l`(-~M_YPi>%NJB?nKTqXB&>>&<2 z_Sjd)j;dRt*VPYX&z4jN*YjK@2|LK8gUwO67V{$0g2+kwtSNi{Q&?@Pc(%CX|iA$o7ET3nEF z_@Yk?EgA~fohEx&m=JkzpF8VSH;<&* zsMIWLob8(sIqL8_7Y+j$nA^saDV{Pj<4Nwo$3$Ba>#O~RQvu4hCJ@*9)eK(SC}a-p|JwzYUAE z2_{towJ#N5gK0ygj3^*Sg=znJ2XwiL%{CDyO)$w(Wp`hUdMSh|z$TK$smoj*ZgjDj zuJUITB3J??tlZ+SI3`3VkXR0o@Hox0A-yKcl$I^bw%dUS$Q?RHtmtMo8^2)LMzRgs zN!fu&8Me>Dh=Q1zClWV%h-UvnMA?qwNdD+gK^1MsmPueB>O~(LWKCFsodzmJXUGPW zIZ3v8pdc+>Dy~G6i7xQV+!2>z8FqX@JJ#GG&lX1XO@+VIE^zUDTDHl<$2TAqNkOVsb5S%UuZ)9 z#0zUQrUnK==ie>I!H@rh*^NwT-~Kgst-)wAhE7SDb;1HD(4G#3Z12M70@-Y5S@8c! z{k|N+Ipg{!i$`VDUGnu*&iXg>rIOKLtnfn#H6qEYJV+Py0j(30-{7oll2#Oo+`;VjzIqMbq~p``Gyzj z@7RbQdDParY~OHTF08#wE1S%Xucbf2UmddLK_)-Rf)=wIaJKSU@v45mhQz02v}Di= zxT723L0zkeq1Xa;29Eq;Zo^BULjysaD^RmypNR4N@!;O&Xp|$Loq+F^(pyrt-z?$_ z!%fxep@%~~ky9Aft>^05k-zdXI{uBI^iaYtGAtw_Xy=_{6PWTO;{rDk7Kzakd;pY zU9(h{byoiCIBvO0Mmfk?2sctQXD0oW^bdt&=Tm8uu5sGea`1&B_OB`o}d z;l74K20M2Qoo^Zo2Os9xbfKAtY! zYZ*-hT@y>rSqNe~#3?ZBlH2ierwMN+XavJ5gO{oV`ygjE(0wBpc?|XKdW|YnN~3A_ z`P$A|DOG`|Uc%IJf?5gdw_xWd7A{mXJOUZns}f+On1-7xd~8+3Q3OtDKfy6&o!p@! zh_6z3P49d7kN1{dys$qQVh+^Wsr@Yf$H%uI6M28@F3MJx%eRuv3Gu|mvGd@BrfUDi znRY}3y;&_H@^$4B@t6+cI+FfSKa1$re3%~8oIuS0ji|%J;%W#M{kvAk2snF%$4%jx zx5N7PdOYGi3xPTMb}P9R2wobAIcQ6YaG73*X>X(4pSdAh>6G$Edg+usp-ZU+4xD-l zW*pmzw1HADl{7M#j-yLSG^WQ&$&@}0TokKj4l5y&8ku-o*RwQz*mLCi)PN?^-1al{ z)mW;@PU#Bd;$00Z)&jPUCk)aK?o54U9r;AqB2AtUD_00(5hQQ<56OZg$1@phm%=qD z7gRKlU84Y!%SNCh%DXY!ffvzjm^Lpx|Lr1UJ;?ohyyYj%Z<^3n_JM~DWrEl(dUn<-Zx zZ7YnF&mn{m97#crHPi2jtd@JvOf601`m`f5f(Shu6BOpDw-U0sf@igoaacdQ3CxX0 zQ+E9XIZTh)RBVP7jIFFpdbZ%>01DW(gAU;eF!!;o{)z8*Li??M`~8#1h`_A56vFr! zJE_n^G$$fX1)rykf&w1;h^m2(PSrnb|99EOPA-Z7Vi^{xyGK#N^nuWe`c&)z(;+Yv zR-V30rnWYl>rP$}kV0niv$-3v*1uz5X@banL-9wn346T)ee)o1lx-rFsv5>2kF8qI z>Ay48zpgiLuA2CDIQ$>%RkiZEvP?Xy*sBi2-Y5frUG|H_9{C(Q0b# zK|@p|F`YwDI4r7B4lxFqoY~Uq_|bC2wCR?D_~+^mXn@F&T@T2PW^4@3(uNeo=$}SZ z#1$Gxmtv{$755)*p!kuW1sQNF;a5Z^+eB`tr&)~{(Wf?j1--js3b3Ia@H=(sEX(^| z)r(pjRl$`Uq{cG}`UA1wFj2dzL^uHGs3u$S8zx`I`hKfZ$Ng%v$;QHZ{Atqr$#xU8 z^cOt(Op?s?c&d(<96GIzDZQykQ}nE)*rJZ4DL#*I5L5?fFRWJ^sa_dD%eCU)kua9sam$6t+EqVh?N?pY@p5r_tbTA8c!NQ z@%#iHCT=QXg{X2gAa?nik@!brmg6lTM7Y zVgE`<;3eGx(z)2R5rdCp2uFO zxPr1I5yS_$>na<{*wJE-NfHX>7ge2Q^%_yI+7YGm1SK@iQJL@G<4fH~74OZ-vUuw< z=(-ND2TH5jm3yo{F#@Z#7PLNr#K@XTAHb=e04%f2g?lo5GVoT<@vQ-GUMslC*nd{H z+XOvCi)W-stmsbTNflVWSbH$u9kwbu0xNBY3hnN8>#dxRTh;N(*H>#n=M926>+m;} z1e+^y%bxE?q7Coy^ooAVQ_oGEHR+)lEelWi;BI$MD7kx^EzhKCWGo}sUzW>po8Btx z%~wAD*N}itLZ1fLX~_1bZdhBn0t@h6Z!w?iDZ2T}5>*B|aC2zw+(%dIp*ZpKFsi0@ zUJ`s<=B{>5ag{B^@;Ad{}K%?LKqw}@L zvB2jF(H(S!5FxVyG|nAM46oVF|I>A&OpD)kp7Lw-*sWL#g=_>Az8pmfU-uu?P|CLixN9VrhOsAY$WRPdb{Zppr-SN{) zfR;k4y*tO9I?eU~n(9wxGjh8u&HO1<|G~9*SRa_X_g^aM9vTtVF9iI-G3XMAxbJmedbBNxFSdJbi9g$Jbsq*G{9S`W0G0_q$W0`OLSOg0ED+Te&z zpjWOnWl0I=WOS}u9w#-4;R4X%@)us#>C=V?zPqmvd-s`xuQC~ze*2@}?RN(bVpGAT zu+vjx)VVjVKdUaT#|94Pwbswkz1QS9<9|IZZm88BPFYlL6$j(l%GqnFkQsM^EFe5$ zQi{mw0&p~sZBjoz=(s8g-imN|Zxa6*j$Xp18nU6|WN_q!r9ek1eb7avn6Vpn4E zgSc5yR~*OgRA~vzYc!YRJsF?(8edO~+a!2Cqhs?3pKHR}} z5Xs6eUik;lUNYOaNmtDm+(1hfyAzyA)NDl{24@ZZbHD0=10DOJEYa$tz}KVr}M zKRT+`?bPaWiA=owSkiiV*eDRxOVC5L{r4vKv*GEBvx6=FuU3y;t%vv27cZ1GG^q&>ZC_l0$RklN-mEW*RyPv&AN!_gTF$WczwAi z5@6?+Slq|c+N*j8d_G_gAvw2Ca`RE9Fd`NT+PjV@iu$8NxPVORDqG}3KRQt}JM6qg z(&-p;>3&Efer&SGYDl!tk6>2UKJM?&zIoOufK>)<8WrUJnezu2G74%+!-_7*i$m)+5k7p{@SYO(ZnCKw0uL4_NYExgY>JBP7>q6h7hsr;(SIB)lguP)J z!`^UnH$E6`Aox=V(ewI1rxHJLo~zziP}EkTgqcSQ1z|_aF6MZ}KSIYepPn1-pBH*Y zK>i4 zE@$~|=iIJJUpGSeCzbu5kDf}ju}Xq8;&(*_ zPvLolyfq_=iX@qt*~%vK%b4mC4cfX(SW)#?4OtZSF`{|ZZBXEbGv(EeoAo?vgB_6$ z*@Hp{Y~cqAj_I!e#mU@asA3q-j<0kI0$Q7}vaPxa2?gZCgz&Yur0Dv}abk_4;k8&v z6wNJ6RuDve@%w3}We8OZOE7D0)Z?7+zk65vzv;eiK-Bj>m@V8eo`&Zs8(Vc`6#P;o z3Pm1%yeTa2R$dISeV2br^}BXCVjBNYSN9K+g)HQ;^p~hg!_zNrYQYzpZ0D|^w@mUxGW)j|M3gI?I#VCqi0trszZwe-c$-U4 zM~P4=ugUW@z0!nHte6Abop-t9I;5uY8Y4({kM_>_A2RQDK>w<>Sr3)!!@p>yj1|s) zw5{ZG@lM8A>+3sKZ{|^}{+VC4HXsoBJ;k|_t%wT<31kARdT zZo1Krb0_=CDUGxpjIi5~Z7Ea_6Uzok{8CemcyLf8H>Mr7OSTq$Yt|AYso2b*N&JE@ zmo4%?2cVvAAE8H8)kqdgesdJ59{8V^p`UJtp+`SB1ec4wHA5ege77q<7P$ViDSfRc zvbuEjS10Snxg`A7G58*4LRAHo{Dw%Dqm~fCr_b|}_BaEUqwb+y0LzN)me$KFcKv0S zN6a%%SQE;XhJ34rG?MjT6f0KuQPZs0)LPtq7Nlhbk&QkHk$eMzI;E}U@#Su^U@#*7 zy_75lJph>?-nywpFBIE7F(EP1_x%O@p^2d%6!@}NTS~PRoim!fz=Y?#Gc}Hy7dY%L zOw&n}jz@!1aQm!ksM=?tA1%mnzXzxJYkV%Nm3LgifQPaK@KDI{nd*(G-5kaE)=(I@uoD;mh%uBvgInbdA&2(T<+o+9g9-I}NUH zEz;GevxOVY6Nvfv13Q_K3Bdjbc9My+UY`t=?LLea%jysaj(<1B3rul4B9v_t1!7(W z)|$hI7WHMuApG@H2m|8`leQ}3$q|ULWK9AM*?_ry;)-qx`bAQ3MFv~{9LN}3_eNYd zmE507WkVIyb7=CEdB1adcYi+=mmi_}R*msbG0^J{Zv6;~Y<9=r*9^xszebnK$flsn zL%7^5oepa6nqDU%u6~(XFV82r|v>pKVb-5Wt{vFnj(x z!`cG2)YF|BWVew^CNXV!Ol_>KW5iCsd$;>>X|MjLd1?^pIfOWJW+Lwd1aSN2@C{20 zh5)G)m@_)b6*YsTqT`6k#C&0Z<%&W6a~?mQer*$u$WQ6W*wb;bD|P(F-=_NqRU+No zyw~yk1^W0v0k+;^WK_s-+EDRg0hss^UK0X#dYT17+NfnD3v&WP=Wb34?1l-l>0vLF z_0{qAW3)Ka;^)5!=)}4*xm?iG&E11%B6s6>tt*eS3gXk=nA>pYC}n%Rw1#{rd055B zakne*S`VrLkXD|{mtr8E$d)$FCeZ#_H57?+xfEE;{#~fjKHq@)2c{$r_K$W8F(l6c zNuJd+H!!e3iK(=woW`|w%M|rw$Srrdv03Bhk?M)fF5 zReL+{-2AK+&(N|-#}2CM90LtldYw##D>#Z9v^diL-JuJ?`DCq$k>2VL#HbeK6P;Y z$esMX$JH~tmDnl;tjFFZ!f(!e{U09dwZm#4+SE^;O#+*v^nVe|7!*&(!yDP0-J0Ou zpS`?mhSzCt2`q&+HgySzC)W(l51)wXi8#@^{un6<5mzeq_Dw?dw)=<6#5Ya(yxLH% zbKmf7-CR9gJkildFSSF)8Y8yp@@gdI=<_N&qqJMmH5cX!XkA;F&F0h<=&LHjhkopj zPS*1HRon50e;+gR7w=K$vb@yvg==EewCVf0#=e9e0lAM9g&T!u?+{XLF>WKCvXo&z zfz;3T{++ZH=+^4C=Jwu&=^N*J!GTR!2mqki{G1I@N~vqMg;0_P-o%pjv|6nN2Au&Ut9J5=3I?LbBEQM9qRrqBJQ)&3D8Uoyiy`ZuhyJ|#DkSlETqf(^w3$hM3C$5&bxd;U8GD#XLcS25{>~uyWcAxUy7EZJV-G)>6=51IVJrSynz}wdFUC`@-#ol&Idg2_?gTy9*}M zfU8Bxs0CFFHZoVrGJlohO?M>JhhZ2Zs11c}0=&=?ykCPjW?J6D{?~O<0AUfaB~#*VJRLl|tVE{{f&d!eca2=lNMu$laAn>AFYj`z$`BC3tjjMzEIQv@u)@#UOuKN~Aqh zFeJ?-qUwW!QGjkxx`2UH<1+ap> zA=ROq7e#Pv5Fln39`}KzPOT(sSnh8K;?~3s+Y-ci$|k(I5>S{E|bw> zpr+Gxq23~tP`5f`HngfUvG>uE|3GdNSkb|U$~@V<=~^tDU!&TITXLkgZaO)UcnhCDLfyuIJj>1GB z#EYU-Q&LppCIv9%yHAm97uF*a53N!n2vkmOmsqKhkmfRw z<5K)3Ka23?a#R18h1-{NmBVs%%OAr8xT7EHDSBN1BN4W|`t}#|nYq(lWMKNRtYc)H zRRL}@FoopF*5^6A9kN1?^=X1@?OjWfcYEg)Ma>E&XW}WYY?Y6+tqrx`UoDP5$PVp$ zpZ3ScsTv9!(BVNsnAo|=*WMLw)8}2$_z6JQyl4Us5)_b{e{vd$YIn84tP`ueE{G3i8PO%J(TMAY_Q2$Iz! zK9r(zbI>P9kB+eQ0yRv8Hx4IBF{OEKPgi^Q3Yf>$HLq&e9aV<4d#{>puGib=DD?4t z+_#pytIt0pPt4x-lhdbsZOWR4yv63D3`a;8?c@ajhPu+iG^pVW=2*R^W5_9E&~x|A z-OXyd|JY#vIQ`16-sApcZW=KM^Y4ev$Nb?m#UDt8M^IvHO8WrLI0iYsw7TB*fDDPH zSB0OrX}7s?x)d|lpmZTz=~6|(=DQ>G1b|8rhZxkMgi`Qgy}-tPA6pWAJ~W@$}8S4Xeq^E|FIoo(mv z!6yHOvG^+GryE^EP!}-0!He-}Z*XWj>(u3`!BLHHNWsId4>`7pH7pYcW!1RRg{5oI zu71%!MA?EJ;`mvl_{3yx!2upoRiJL+wAS7JXB-GvO_wuj{of054!fu5J|nxi@QC5a z{Cj^gseK%IN*$KE`1HM9CSGl&Awax!Z=z?gbyd=Js6KoLOy~&yn06$t%c^$KfKdou zxO`746FF*kJ$5?piQ9`!X+bO6ouToX{T}R0sYFV}I_kkV7RVQXc!1VtO<{Re z!8397h>S`ZyrTC=If%8tO@&G}<*K0ME+>5iEGMQy*`W=vQCC_?E6%4Rh(p%F^kt?( znXAiC74|=34uK{2dhB?^n=uX@jAUz(jC`SBIGjrNC#9xJs`Yh+M|mez2F6JgMqaQ? z;QX%X%UBJ5$#__jsp&A-fSsCQ&*!d_d!RXuS1X6ZPo#6po}<6QuG-tC zgck)bitJOa;XN%oF6|D$jU>)$tT&@bMw?}}&8X?=GKhP6~LC0r5+SX>5meD9lg z{^p34>}GTJ|9!<38!K;;y=<>rqsJd+VQ=J?H8g=Vdt+02O=s{DBznbtm=nNtKXgY9 z=4SDua>Keu>d(L&Gs7<49M|iYnDh8io{qE+#>50ylpnB%e)=K#y0&i_qWk@!jqAjD zO48DF{WmvPTj2!n%X%(D8oPqlnZjtPId0Qx`xVyE{u2!`skoDTgl&Ki0O`pS*D`r; zx4NmQyDNDTle_OI3p}G=UCzCv+#C8??J?UYXRWG~{&tin-pU+Lw@B;mx)VPilUl*8 zI{!J>%I7qwM$-#`ZVS1?Giumn@%Cr&(&P3n|HksVl2*{gjx>fLDYTXMt0}C` zctvbUOh9*ac>Zx#%UtphYeO7hO_2IB9A}2hbm=p6PrdG$QFU3ZAw~# zQG2y3-{Pv`Febdez-eGn3LxD>ImQukwUWZ1+>JgM1Y!jxz26RT18KIk9KgzdP9S*s z>ljqR=yt@&C>yDTGAG@KFzSn3*_* zEWFI*M8p=5lZvuTXF!BBW4ErCBify3asyg|+uosPo==AGwiiUy!*t2b<$UHZiTr*e zAHOJ3eBO4psTOK!3MoTGOI+0MMs~I5MCtp!H3Bb>Bt2h3|H}!35az^yv8lr7sd+*v z%L_)$A#A^_p3y$k0wnQJePMR7hp`Z!A~#TD!o^|@=6dA7r&L`+jB%PQrIl&Tad5L* ziCZA^$PftyBV_eT%-Iy6h0@;Gx?kV%YoVCtV0?yqQXAYDA{Ps6QuV+{J?|2=e}l2S zU&tsE1d+WA;vd9}!x!zpF7P^~Cs9`F$l1%TF>^K6i1`PwY(pxJZ{INWh9-qr1gz?< zCSS3hUqjsP5hxpo9OaDR=Rnx3g6{7Tqn(c}vxM7p)AZ$m?8tx2 zAWG1r5!d9owm?Y?K)+<~O#aUbB4kdmH_`M44Ev}V9N}(jWbbvGOsVQ&6{}Yk_N<~1 zF={il!OBoGO&~r|`nfChN}<^9H1GKW4R%Dy&16KZn68K;^6Eu4iY|t94~Yt>-ceIZdwmWOgCI{xA6DQ#W|;N)b#;aEk(EAF0o{mOV_j#! zQ9VHz#sIARXul-C^oKA8)NuN*-oE^CIisbDHg9UPxPvEgKClEkv+rIt@B~LY(#H)5 z3yXn26tjNKX3u!6e5DlK=$Lk^{7lp-Cr$5iL8EtJ$3;<`8CK2;LmgTF_152uB^7gjEs}G(AOn4eSfyfwLktZ>w zxaeXuf;_IA#~R^LO(#|xnw+tTL_guG+hxL3a<-LhrwMf`(hE_S7iaRHM-{e|L}Ct? zX5sygVCD_(@8>gG-}qCE$ynAK=={!LmqM|ej58R5Xb>&ME`AZg7hMt*z)Kx&OmN~U$SoA$u>2B;h5%<&4! z(aq=y(o%6QPNwiA!igK__P95r6cn8NWuHZgpmhUX@IFzS`(j}R3gRCH@kB(q!a0~+9T*4{t84T! zyx@7|0UNM8f$J$mY{^r*Ycmvzuv}Rkwu6ud;*J8D)SQh9jg0OKZfTrBMTFYU z%;D`c#&OrqCuD7{fc0E6GRr~nEww9akIa}(hAg37U&{z_TD{{JO|svXCj;fh0`X3; z4H-}|qU1mQzYdXhcr}x~MfMg;8{-GfMCY$0n_!D$dy-j>@=qGkr4-0OZ^PW2n_ovM)U=RD6`TSy5O_7u6 zoWcTx*rfbr&DMSlPIOzog4S=bk&-W7qk1VL|M-WCUeTTO3J(-XGGRpMo;kR}`z}&a zDZr4k)k!3E{(H|hxBB;|R1KhubrW`D=+!jldi{bKS&0nzjOEfRe-Y35x~7r0kiw6N zg^W98l;|QB=ues2nj(Z%d=|(?L=MYl+b7;|qs@rzh%kollzmC+5 z#Y%tG&&U+BVg)+Nec{j7hi$@a&-D4zYQH#KKF)wq+s@e;iGlq_*M-(E+ykI*_ANMdPNz3*5&_(0_75NsAC`HJmnQ^A+O)>Vpg z3Naf8*S6pvfaNo+!E9-i1-+}$ZoORYx!PSey}^JphR!t95hk^YqL- z#3HhpDVR*8(L|@zhfBsXbPUU{NMQi>d_ z>uXxgw3m$u;ajWOtU-$aWO2TP3%`jA|EDypn!cq3ZVc5I(gzs2A54+JZq3`3LdCO7^jyK)Jf41SneAwm-0j{J7xWcJn z>%ZlFP zJCh*@;{u1o=AZT14mZus-0gD?j(c>WwaZ9q82z{z!J%L)5X~^{iz%7>6B6eQ&+kCa z?=dISy)=NZ_YdfUHqdDEqZgIRBaZ&nD4IpR6xzXU{A*J&eQ#AkZAZ{t?M>-@;;=C@ z#r?c_rz9ICkHgHY=-mqKBBU+jkr5^;(!U3xF?3U-`liT|WR{G1J{a zHB`9bcC+IwYK$7rBVN&D*jn#DwXxEK+LGFSENMa-2{|CbU_GY)p}$8Z_MWsr-81^@ z2Mx#s=vwK&Ra{Ciz%(`NF8biV)~@}JhQ~@6?vnFXXmD<*4uK-$sa#!y%}$l=FB@q! zM*v$<$9?r?%Ns)v(5IT0F{Ibo69`)^#akE9C4@pJWveBY;!Z!xxX$)#Xgi`Kks=(>g z6_MXW6-MxbM(j4Q=!Wi0z*MUQ9u)`TG^E@KjcUU%6?U>c67!0J`{uGB1AC0A6?dBn zgx%^HN+XQ1v?D;4jH8PN8}gT_koSYKRt=h={Q~O9qc5>okgk-rC+Z%tQ<8*gS=FwyNtssc&35e=6NxgUY_+vB~3J`4Wfi zDr^9?4WVv~n?_JS)ZOT(FvW#*M1Q(^5p}YiA?m|qhsPDCie73XN$!%Fvz7;DQ2{9C>Phh zbkQP=uP4Z5RvDXMm~lYS#uv1wEt$eJ8HCegJ8Y2o1|GyTmD)YETRP8dXzmCtjtDHj zK%|aVmifXaPHc> z6895AZRvrrB206k`Bq9Pvyj8#&m$=m7-aZF0EZoxH0W)4tc#n%+C)n><7#CGQo9LN z{DK*sgP7Kr?8t{G5fgaq4^31BXkry|Lilh?V%AHtqO!=v&#Y64%d|PJEAqvBaWcwA z`f?B6;TUCc36pswF)LoTxo-(w-ij(H6u~x<8{b|@BKE~E$ldD$I>`C8r7)YWp>8eO z$GMF;v&Z-7Q)oqJ-dl~UWfeO~{kaTM+(R;k(9BcAndz|y@rf~_(^wQuT|)3s5+VVD zC~!g<4oa0pkN`-5Ne}aAbH-9)QkhJS(Hk=~&a8!2O}nJZ;tfB`X;2q2a&yyi9uE5= zmBo2yAqnm>8c)h#FARdrHV-3#Rk`T#kqX!r6X!4h1pk5`6`Y+`i zp1Et;EWr~h^Dp9$zxEAa`ZbV4@ivlwGNmI*6i4;IYL~5!MzNWzc(==yPR>T(^Qq@& zzl62-UJ0FNpMN&*?RTzm6#ZisM#M@FqR)rr`53nyVx2NAhqrUY;}CAY7UowNu$ej_ z#;Z9T6z%l|z{jAlBMmt=c_wt}><>_s-;lfmd#v9peg|Bq)$P2E!fLF z@fNkpkL+V2+`XmfcGR&4%oc1Ry(CF7a0V~o%&BoGd#e9SFw#+dw&H(D^D;+a>&b=$ zUKYO%Dswj~TnME<-_T*A@eJTb5?VxQtta6+bgae0q2F6f995&iYE*rp z;cR_$q~H{u3mNd9&HrpA0h6i>2#2tv=j6*6S0|ZcA&`&H@5RHJnm5kyOTXtv%!t%i zrG`IFh}L4uskmY^Vp>J%<4lkxxHiWk2%wQ_2M@6bh_MB9!N%L2dWxa8_{M&3naO)Cpt`%QuO5$m&At86!^^ zGY%7>GUHoB7yA*ah)O<-~7$``P$ZCLkH$1{^0Yde1{PN$q9Xw-C69gH|LO& zP9-IV&T^B5%Z)>;Y$N9574rf&Bd7Cf%C5W3Id$rq$#ZQNY`C!bh zCt}}gqGWxc%+I{tG4Yv_D+XC)nwK>oSk48ExV&8Zj%_A#O2oCu)~Gt5;23`la|RNnp_{Jta(n_cp}Kp+K}UR3Pr3gzhO;>a>l(Xyv%H^rZZ2_iIA7H7 z4|j=*>}GCdC}QnNab!pJl$on7zYOYPLyMFAgP>zb_z>#*55d+`m&auY6NkR3K*^`u zNpI6oJH!~sC=QsFMYdkL6^9S7h;jrDQ!ElP%8-^oBbR3jYyj2xwtW z4^}r!#S8&|D`$X)7tldVQ(yMeS`a%}BXV>_l5lphQj;1ve`&{KoSLVEt;e)Xu+H67 z1QqGWVeqpma689M$WZ=~>E8|9RdIupZS-3(-k|icLxX9UTxh&G5~$w1`KsWW zAgg9OprS`w{VR#11}vw>a8*hgfew9AV33q?ctUr%x_+_zY6W%oMbr}d?-HnZ(y!%H zT6SIYN^i;fUy($ILK+q=nZ$|I%V*ey@6&ePvwI)K#gKo0EE~%Vd}VU@gXvqfPk60v zg8C^o@jt6*#cP$b&bEhnYzrTP_;$BT=Gd8_(w__N^#mWd*#ET?^VUeswg9INTbYWJDKT{Y~tF5!-bjs`3yJ zto{~r$Q1O+E->vQ^o)8Z{r3=)LlLJyU$eL&3a|tIUjLMY&R}w@ZBV{u(B`oS4^cJS)pEy9~FNRdk8IauKB^M)R*?IVWV*kd;Ul+EYaq_TL3Vbg~ zy~&|fLZa-GiNP657gYBg;f|25IxaUBS@vU{=cnQR2in>7lFE)9RQd*0F#fF&>7J-p zdS$SXg@;8L&U;p~8Yfk3NZrjN#s(0IB2YxF82O1;tKJ&|CF=fyt5(@gA#m%M@#O&v z=d1z-U>P189&S`GN6<>=!Ntb|MP(LQ)`LE*C-MTsjmp(xvu4thXt5yZH+TEq`q>K` z;gy(3p-U{Ca3iCu=ttBA5!UBVQ!QiRR4su4&sr@!3Rc8smXOY=r}Eb+2&Q~yla#vJ zxu;vh&iwR-vvFK2>^d$gY#SL|0tjYj5$k6wzR=D@?28dP)3i+-pgGOgiV4%ab%t%i zhLM>)ER6sYauMv!@jl&5+3IJr1U9KHtF;>ByOXAz^-GML)u*xl7}O!{cFN@3vPB$f z^`+TYQpQqsA=Gl!km>vN=H%o`O)c{$%^OvxE|!l~XyQ1f2L-6>rZZG~=j~)G=yQhp zW+z6dzM_{$M>HjWqs{l0XgZAi+Q6U4au6<;S~3mc5@q(E!mAc;j2E|HH|>Q)MKqfH zd98rNcvxr>f{sDI41e$)o=T8SS5RsTzn=T0g+Xsk=6!B5dvaHUI*?_#h;YA3!KE<$ zu~%EiV^;o*iP0&Di~&|hlGGvY|#DZ25?hhB|#=kiT6#_W53!R;EA(Qhf5Xz)!AVz=gQ zRi`tnCk&|SZ{tibo02+x8YD4Wy`Pz{Rx#0 zHEfp#)MXv#L$?HWQY_nf)KfQRuR30+vh8>{@D@_}Y$VTmT>5gXMR7Rsc(EP23wZs! zF$FA;n#Dufxzm}YqHMM4J)O;e-)i!Yeg1?&clERc)cV9HnWn~cp1uBXtq~s6qy5#s zsYl1o47Y06ZfStxzHIJ7$tL{63qc3K=%?c_IlTh(u3d=6#zR{V(e((|H66z&$M;3A z2*lT>!k80j(Rf};MMPF$Mso=F*ELxnuUU_~wZ~f0Lm%fb(?DlHn96Gurw>j70Ht^{ z(~8K3+^Q3iaSB3Ml}3c6WEnexQJVRbK@$kT%M2v6hkjTt^y;SZIzmP(66y=vsv95U z8_9x10YTxQ^+UI|phvF%WuHOw#;)Mfw{{*-l)PmgfsnKX!f8>TErECp`|9Fp>kUwb zX703N1J#s@SC5K2^0ne1TVKLO7b1eP89! zg5Qd2Lcd@1X#=XZ-{{ZuppS_`8~PDC7^m$ZC9hUVrL9QxLE|nM{LCUqp}5bI z<2Xxneuofp{?I4uP2I>ab4~~j<^t8p5S|)6Yyqa_{pAts6BIvs;dPTFRwgYnnLZ*N z4UA?ry<$(DD|;r%*qJe^K{NtKG?z;@TMud?U0eea>5ss+Xq`PFUDr)Y!K8=6<)LE3 znA3NPlM)bYQVY!{8?jh)GQm`_^dVZFH0D+Nmh;^T*QR@_fH+ z)^m{nV6v$s|Bfd+vUVc+m1Mj{RVhQZBLe15l+IAH5i0mZ=;3M1GWJhOw@cUSM#!En zXzmaXhc*_(U1{_{rZDweP-jl;t}KfIFW{5+|GP`&vOHU%a0W`BI8u6xJ*9MMXo z0|SAbl18#+$vBM%v<~&&c(Z--!oOD33|oSxX*^y$f5K;< zEgUUr;Ai&c=Zex{#+vi8>ILiJ?!#*uEZP-zeX8jpnSwqBEbt(F4enqEX@PX4Mjp<& z`D{|FJM~HCRzBwYK%mbJ`WBx0ayKBxp=M=LYEX3PF_FHokvPJz06AVR^%1;C;N2CaezVjS@e2;GGOJSYyw+9j!dGEkk#K%s7JcZk?ff=i>*(>K;M zwI)qug7DQ0F{Iz0lo(>VN{{ELs?Ox;8LP!bk;)>T$)I`%z6bV8M3y9YpDqT*M?^G) zIQVv{iSWQmN?t*j%?GztVXjf#s4KJh5=4QQL{oqgH##t7(#PP2;?ymYSXS z3E7{|22`DtVpA~(M^ou$LS)>>;7}cU;OFH)A#OBNHH?>><-Ep?gfzrlcrxJ7rorge zVkAXZ^e`sJMDMWOY6QKb)XBanteF>E4namg&gi%I_8K&zs@;T!qE;lBH9B z__1mbxMUGGuz$&8h7-JJGJkG6`nR~*HWU;;InoDc?_z*CZ~S$(H^3e$gcOp`pbEMS zhS5Q=<&-=oL=0w*OX5EN*G& zkU0+u*$RQ4l3OOkL>eb&)0RTS@KkI~)XB}WbPi-vqh$+q*nTA*VLN6Lq27(Wq&81u zliE(s>*9n=V~n-olikp8@dU9f1GDw(8Pm|ndO;6Lr6%P`hepYD?GNX}_;rZ~0~{dO zkZNDZjfMp$?XdK$A{IX#yBDuqWf4`az|Ca+t$iAQc~R6=p-NPDypkfVqov;8il!VlY-N%5tKAQHUTE(&_rho)t~Ay zcs7XQRQ!hVA(_irn+zdMVVuWDt+R^bJMda6O1B*Y5O0aBrJ;$jemA-Oc&U!enTI#%$8xC5G-0bG(MV<=84T5iY!s=5fPINUn15f zPn*4FgIh<8YC(r}5r&F&xo{)72Y?7;Jb>=3PSV^8!g^9vBvB0~xOYNr@1a zHE~H#$WghumQDVEQR`_sTQhT#3nG(rr*v2%tJX)AI#jQWgUOT(jqOQ%Ag|2j7RQ!l zPF`3y>7uaESLQ;a6E@`CpZNMvWE;PhOm7ZlkxWT@=i8A5rW| z{9T*^R53GgEN@iZj~@liXLUsWF{_%^U|VUg5NE(I_g2!<6bN%dRL z;wiN=yHz^5dVL=$0MpF*gurXU`Q+0}1*Dy{@w{x|ttz)fgYjq=Q zQAZRCus`)p+beODqY5#+tVMoYFeZe~Gj2WKHnZgjWR@QXhnS4zL8?pea-!JJZ$MWS zr`>vZQneUD*}UY z+`Nx0dr9BhwyNY$*^&`l8B<-^6@PmyE0WfFS)5$0k^&P6`Z7b75G!#}1FeHy)2{(z zskA81dG$fl4_CGX%JUj%6tV1iu%wAPc~msX=VJZhKqjBR4eb7{?r%j8x2TUsV4qaM zU&Luvf&!%Hl*}4YFP^J&B~>W+Oc#`(ebw0J`2QG)ao7Jd5ZmlBh`=DAfxklnPrO-c zfrv~l0S2OR!~YqGAOAHF)x5w7_8_)S^p!}6G^8@sSXn0owCoA>G|8d_mUD_p!QzqE z9q@*F^-nWIeRr-bMcei;n$J=KYK5vMWDOy*D5d@hbw_;u7wYDuGc5%ASLg3btOi*6 z{)({U^9=?Z(_&i(5l$`5mn+jH?7urdq6eEwK!Mo@EHmDifAv>KK&UnEvc)y?7ufg5|wAak8 z45mz>dc6^vj_6IkM^bjj<$$YLoIOpUWIPI;j5Q#P&4W)YjS2Hd2nBqwkDmuO5WF97 z%)+0{0#W59?5_Z%Qots|Va2}-ijZjOV&L+#v{8WPgg&W=G`5PMCT-95FLuv$3O8#L zjeRsob%`BBTl-Yj) zL=(mYn~Vb5?avXeSCS#p9=0f^bFTP}CnwEy(xIA%tNM+&ilC?J>9zeR>vegr2cBDmPnCImrSUPWt$?V4(Wir*M+TiP5Lu1| z!>1EFy>%92aOV`Qi5X#o41l=OIx4a4+M$SIB}RuR~b|B$z#9c?rBvK$S5p| z|98@R`=6v&uxpt#vxhJW8?;&ez5(2rA~i!OaYdwXujnFT#+(7gMS@g+$c_WTz^e%a zJJOCoL|LMJJgi~Nk~sPw0P!d+rxE}lmIWyyXViFNO35$*0K_gF3o7w#h&I;q+T{zAN7t^y;7$V3^`x{!rAtcibr}l2&=%gdWHA1L6Ane@+guMxXu=i=*y~Zk5&epw2i3jI})k+MsfSa8atix0S z#l?(deM`~1NpV|frqZW1_tnKc4n~jlzkx3@p$hS=8{sC!J{=J<3aC2=rWI&mj;Hu6 zMrkN<#@?W-lPuchz$Z?m2L14?g>%A;9`eFpTg2!*LimkZoH&FImk0?r zC#u1PG(5^Vk?x%O;7Av#Nra;&aV~nmF7BSdeV0hBiL( zSv5U$n&e+^V2Pv+%mT>Ru)oe3WKbyU804Ja8>AaBFbzY3QQ`vy1X-)YG7KZSJuj)E zpNTE*$=UiwrkWq_!7NAb$P!?P#ln_tP*f`ermS+NQR(_}><| znB>f`03VfdHxqPIOxBnCHO`Bk3b~+z4!`E7;-qpSz#Cx9C(xONZ&O zc2Ve`tCMWn&_vqqV%l7{ybXp+F%Bo&UNRP5I?m`36h+ek9o)I1+)n+B*PyMV}ge zBo5d}Y>=S}2(XnfMpd`VnR>oDJGz1)x`KbU7(llfG_X`gd@MLn)keG@iZ#gkIi87s z;grVK>}F%0d@FU{4mcmU)cz^XpdBGC$R=Nb1RC)2R4#3Etjz5c5#n04r>-5wBQTG* z;RtY#*LR!Y3mgzlP^ensFVOe&$A1`alHo>#|BIE(UcWL_agNif0P-Vs-E%LRH1bhKIn4kP5`)y2I3A%DvRBKR=e6u;Q^|()=aBE*D zhu^%j{&~|Br>hRS{MNlK%*T+y3$<#!%`_hou-(B@Z?d|wblmOJdcDq7Ex*F@)zp4sbh;da^pseL)iDdN$OTvDKkp@IF|ym=-29wC=VN>x|}V`hjW*hY>p`rH;g2 zO3Z{_l0|(2IBoONXB9n&kqZ2k2y}t;vH*8naf+-lSxWOdfVDhbB=Y{?Ip6r!cH4tO z8RbOP#-0I_i5q%I++GcqoNW9O%iFkEPc>lf>`zd-WCOGE9g%VONN4x)&{*hijd+o)y1b4CKX)7i~; zOI=wF=^_fgSSXa#g|XOm$ZX0(%_|zJr6lSdtRGxY#dhULI1mDqK8RDBymCqtLFRE4 zkNWu3TNw^ATmp8Pm$W%VIu^GG9?MXx4q|jJu*C+_FnfrTaht~c>=!A$8BuJ@Uruaz zUlmCrp(~_lCl%W<1<81Lzm?s=5z*FCfB#41EZV#x|9%AT0t#-$yzH?_(vl{)y>&LQ z$0TzS*EB(ix9Qh1$&a6|l#rXofwmT7PnjT>d(aiJ@rfU-q#4}CTvL`Sij*)=HK~j> zdI=m$mSh%T2ri6*Z_Is!yReGb4KP{2>NTuO_pH?3FmN2e2LGcPSGh<-(!;8Jx=nxMw4n+c15sODtD5q#{$cJ+lC1 z)`TJbnKOX9ilEMETB8zcu!ovBjPmnwMI&YX9<&PwE`#4)p+| z5>YtBsDB;A+LeV!jradJh&9KxdB1mp5tbW=e_Ap*8-OzZ;~=gky1|^+0EG=2Yq3p( zt+cSwGnq@FinAI_!{ZhwoiCqE$7ZKX=*pkhf6(GLx=E=noyfdyL&mMYk^NfzQ)Zma z7oP^;+bvyhZoA;vQfJ3?T#+0b-0Z)cw0XfiCDgHSRO_NavBA+>k!-Y zLd}upr@A!gV0p5-T*Vu`Qyzl@MW+^WB}$U50>`1n`Pp@>rZc$LDjUj9{|}_-5czGL zxhd4%q~yvE#vVaLlXKxv;7?)aD$>fNx%mx6mGIn+9l;!&MDD|E?%y&rh^^XKQvGuA z3&Sw*0Y51o()Ne%Q%>!CKgq=O`s^qP#J_Ve%j)4LsLJ~K@dlvq449x<3u@x{m@;eR z^m1$)OU(@6y~upStJg?lNGM}&eUoMH=vufO_A1GhNpBv4wlvAortybNS=f{%EylL? zCg^^(6fO{3&&xQ34Hx~~M(0Y?**=7;QP$GjdWEYYhM61(wyH38KW8L*`7NfTIPDxg z!Fbi7l&TUMQ=)mbvn1C5JnG|ZS*A;OM=DeXnQT|t85fJ2nQ0X?Qu?^XANX{8_+%sY z*wk@dqVWKDG%rcYB5lu zw@hiZuKrP)JZqYBK|c7(xaKv45`m%^T4#rS+fRjqEi56OMvYxN(6IqXN~`MnT%(7_ z(yJjP2mcx?bzrORTf|78X8S{c2e%S$^EvDCpErWNPTYYIa!q@+L8?QUYb#jL-!H&2 zl9(aEHa~u*rCydVZXvW+^<%$}WF6HBXz}|0$@<}OogXD$Ovj9IXB5UD*FG4^biIhs zhOxnZh(oObZ*iO6O~^X>U5ZS3CLW=s{$t2yuYa&#`kG}9*MAdlM8b?H;#L6sF5$(6z(Anc7<;vE2NT@8&t29xVXq>m%F{1v^ORT*RE%)1Ekw21HnbYsmVx* zvIbB+x|F%VEm<^8>k$v@@2Utcf>k3-14nJrEm?*@*!N>Fq!8SorJ*Lo){OdU^Z%nH z-XJ^8xC~=B(%<}}B-&87pz%!oqa<#|I;wz$!OAZDrXQf(h|8S@2HzSRLOU`z7}Xv) z&jJe1SgIY23!tkAJFUqbOFeF(C>0~+a1L}GJq2D^w30#CIDNhQyXnxGf>b)iaa1E5 z>wDZ+g7UE6H;F`|>MPxdMvG>XB}}g;+ddhhtUHyt#}Y!694W5%b_e*eQ5a9OUkQz2 zK$_eDr{)84Bbh!~V%ar_QZB;f`_5q4?jm9wm}p`i&1#czR1uBLiH>aksIM>qQbhtzm>j?T&ih-wBmHQ*`+Ebp-Z(je2%Mf}x-J^DgFS>r=~L%?*P7$y?jeNl=-pp^3!UFif*+hUbC=MUF4 zM75CE|JO)lnq6F#pn1SXM}_}8GSstX{xagZZ{3GqUL8{Ma(sjwML|u~lNGkKF8fPh zdVj-~toe^Mcs%LdExt$2C%l2&?YBzUR=+M$yT697pC*w&&x;`@CzGp31+txB$CV-3 zWpCpVb}b)>+R(w;C+`+57nRFFNsy=xm!7p+Z~3GRS$h=j-+%(k5l!FkW-*6ge>KaU-p>Q)69Io)$waM7`@OAymj)D7(mCq!M4bmw)iC#Q>C|N zV-06VB1WXf-Z`ua&Vs4vxb*f?9u@rvtA-iD+c{_nO$QhL{&W5!4eigRYF)`*vfd3G zCX4P(Wa(9DCZTYEPG!2dxL{>6)d|x@ib7Q#JXfH)51fH^qpav?F1g#w5nc~TT1wd> z*#*|fedN=pj)M|L>&5c0D zf>nFYS%G4{2kVd#GBWosX2;6!zK_hfQ%(|&6H8T;chP0DL5I0in~mw!g}(qY3Nf*M zfU}zKav0+;FVydO>mEYxLD2-=;I1l7tj*!r#n-Buq0ojYnK5sSN_ct}!aZWFTjQj| z)p$a8DRqt?G%Vb_%n*QTns%ryzhav*Wno~_SYY+Gl!Vk*JENRUm@4H0_ON#-X-ubA4+41Q&ngw z(IDa4Z#&y>qXNBu7}%-sqS~wbp$Qe;{}v_e6(r7(we=?w17q^(@8@{&4@wA_C4Y?qbi^SdtrNlh9|wlgD_p-sVkcgz0kIYKp-{MDIgLy|+cfww)n*ZH_$d8iAU3_NbiLJ|Q zX8C5cgx<>I8t*Sxu)9z8t=Mb540WxWD?)9(SwPxgZKV=5CGRx4+o*Q&j}A{bk*B!K zSMR(jl6H!p2xY;tZvhrA$PGK}en?=5VA7%D$OQa4 z={MPF0U}!nsLXnS@291@}|-ap}MH>Wm7| z+tp0Wt8v*BJQaACfA7Ujml}x8F9&T#op9qtX zr^#ffevOlMc}X9&3DbMjs6L6BvL!h1n{}a0GUoED{T#lTer&-5*JH->n%K_T@P|iV zcvlwf&7&buHDuhSvWyJXWQvYhaoBv_SxIQQ-fCcL_SlTCsZ40GUw@kKydW8y1ENX1 z%f-wzrU!@Yf`2$ZXDeqas=qfDB3(4y8YWH0v7|u^YW!o7Lpz6 zrCS`{w_BS{K69XaM(O+9FqBc3nwdcmwB0XBybAMc*ffpEz^$6tCwl-+Y2wKC(5-w& z9Oo2;i51uc%^W<|%Dh>Z`o$wb;_S6;_@_HvH#s=(z6hoOzwwTfUyd@gZe3aj{vCyn zeGWgA^M9N~@mPS9*!({y5$=p={2>ve&zr;9h#HH~RS&k^lo)>3bKKKZ@QaxML5h}+ z2JUW!QMul3EOCfe$J=Tg)qFb~L-dT?^>2<}Ef({|!XWt+ygg`3g*o9v&8e~Ch1n-g zbsDGLu<+3-!zQsyMzNUBvASL^_9$82_-3=rb9?>;HF{hl?)wo%vq}NaB4R*R0oplx zy59AD%F$$kTYDhuA4X08_TucJCn`m;V8nOtg!#UReCn=}r=J-Dx>Mnus|jBxZzo{L zrO2`=Lhj;6wnIG7y1Po2lm;WMWQl6Z<=&2EBNkd~ngh4>?AlZKwG9`Q$17`z_FDPQ z8l$=r!V-!!kJIUTbXK4`wDUe<#o6q0ntMBHZqq;QXmn^G#T{vWZRcucd!2rKgeX4l zKS?YP0)9q_ad|#Vu4#N(3R~(6Zp7kyb^AL)F?qmY#cVXUG)LGmR5H*r8Z|#3a9q{- zLR?A>e}kZ1{i}Vk%0}->Ndrz%!*wEFBajIoCGy4sq{Lat7|n5*JZu)1Dw{tr#3W@h z4Q*{sF#B`@oyWW#O1Qn9y#C8sd3u?7Wnkt06xAHA!VW|cC98CECkUoZhGA3H59$cp z#6VE$^B2~=zf7J%cGhl>5}M^!#Rs;Ci9-I0A0AoUGdgz|z$Q77yCSodu{?g5I6I_o z``jCM(Xtc;J&Fu$5?3C{|59pC?W+TKVynPkq{Cbr*>yshb>kXXj-p zaQr&)0+Y{hB8#!80}z8m2$YX%_$?^$J#|pRVv)70k=1sm#OF}kNN1LD3`KwHgZ4N* zuSL8<+>2Rr$+RIV;2Vz;d7h>nCX^xr{-McG%#p=o%7k1a*Vfk{ry+JKc2&NxMv)$) z5~KsLlT2fdH@*Eb1l*1l<1|mWfXJxBCo-fa;jTpMIFe^N z<{NEao>dzlu(Lic=sJNL3ldvRv7gEchItM}*S4nwhXnFjd_Bi2CbCyXgVY)Y=aZBFoF(fE zTP6ydD^jqvbq>nL}KfPm5onsC^QylsY47N|AfF>qLXjF)_ zr=k90pfTmU-L7sJB8z~JfLLDx^oAt#0rija_oIkax-%7zcDTq~X+-wIfPb2UT8j+7;<4K}RB~!n?*%)m zQsdMOq>lxt+a8TpWH2mbx|ScQUx{LbDh&)V&cK?)shI&^y{0e_Ylfx`fe4S$eN{LY|Y@1>py^l=RuCY;$EK|R*WNq6Wm|| zhhh|p;*Qyb4nm7~t=GJPXqOu7A7QijlEclZw7v;`AtEKI@-bc~GA~53Cp(f#zQ-y0 z>nNQ6E!u%EQJuLV3A((TOOw#2<=8~L*Ju1`wh%UD8C2b(AHxS10bny=VoO}1GDDs( zFOs;CKGnIyFP=LHEdMOW8@>0`LFS5&H5d-y0_UTCaz7(!QeVLCUlIQ77Mnj`c)vJK zaJS!l^=EjCRH4ABj&DY1M#d1|RnDJOPh71K{sDQqi(GI4Yo5D{C)p8i+Mi(e!g|zC z_T!IK0C1XnK84?`$4i@q@4`hJh2ETF-s3UwZsW=*VGVG8VoP7-ob;rlwG<$m3lnFb#v6ZG<9|qt z^013>6wMtY-;Q_-7}zB^T*B4qJQF8+*U3fc98Fw7-g+XO%>UTOuNO5| zv5=r$`IPUf0!>-nYhCSzCB|>^__hRd+f(yL0+qBCy_?yI%j?G9uAo~%jj7pW;eKh= zE!&8Bb9dP3JOV|#SVKy95}B{{eN-^B)4I*|IVb8#=JamR+ zhuXTQm``HjPWg^g1VJpSk*j(25TqK~cdm+-sJ;#;_N?ZKx7kX5uoJbxz($wVPWw}P@|JEjR?Ode+e4U2tRcd5YnX@V&wWSpSWLo9e-`# zb%`=>X>6(Dt77aJ83%IgPUbhr(_km|=>xg{mF#Ho;!uQYoHzd%yu=C)=M~vb4Zs^t zAoG*mD4WgxH5m%L9abeOJv{7!P|x%rykQGj>Dp9vq)A87Ajd`}Zforhl)}bT2+fIW z?IdLi#}RCNwF3d;%$UCvXNJebHTPOsFoNPJ>eLM%hqZFMK_U6{QRT|u>0$LF3PP>| zwdqzE?j@A__S|TJSo!&XLTnYY`AP=uV~sD%Jnm(G;vvb%60X(B2XUcWDPLNW5K#pb z7A^bL?yvgUOF+@|1#f8aC|nz$W5{PnbcH}#;->*8`EH6S;#({(H?e@I5?j*D%LM?Bi9xVIn#3}rc9Z8zRsu6gf*v_{9_sP`O7uV`l6(M3f`SsEW zMPSDkOcs^8)R~Pm%^czs=~45rt4R%i3X`Q&N%9ZcKjnUd33n>sP3)u5D8ypz$nEMP z)z)yH`G}qWpmSbjICOI1bc>zW>#k15s-^jReC&Sj zRJG8bDVVKYTupRhzg?_qGyk z?Yx6klb--USKnLcZ@sF&SOnvqHM?DGA9!P9Ck09ZVOWO2N?APpjGsP0D#*%m zmfsq@1#H3lxOSm)r5JE~%m%(=+J)Q6}gqGhGQ5itSM{8r)$?g{sSqK+lEh>4~t$er)cKJ(#BN@|%arYJ!P9r_& zz$L+_AY)M}KLa?>{gppgdoB4y#rp)p09FrH>PC*vl8lnw1?sJX^$@;=x3m9dk!nG; zOpF2WPZP_2OU_*4grL3~Sd|gvs&JsT+GyDgR{5w@ zL{#h$hjcfSG*A|M<=2IZu&T$sKe4|S5?XCXqbPoy?5mD1pLOn{)zFe(oY2$pwGDEB zkSRE}AW0wR3QU&GXL+~yzSe7JNk}l2*VSn}mI}y{avLQ*sDuYXf@G37ZqQCHhYeJ1 zlJgRdn&|YyV4Bh(HS*hKYTR4l&USpM4<>iz+m+V$dE) zkz$U&5Z9sFdFb1)cE5ui6uaMGI4FIy+le*%z_znL4w{*^rK)@BwAPO~N}G&99Hc7c zL2(y~)lEu>U zf`BR^!mgBI49GM69p6XH$!Du+Sk&)G*s+D5&O!0K0w9`^FL(3*TsR0DMi=($~Kp%I#iO3|aNW_;; zqoI0$v}|1^WW&C3N*&1PgHn&K6da`C9B^_Rwlbif?qUyZfXm%lK|W&R_s_DpFxT+ATqYk}58vmLDz;s%fj+Fb^>_>~_-P34~@yjb)CcQwi9+LIR-2%QQfS z-GJcHS1m!VmqEvg;6q#45i~a@GwFtGklF)<%!=d7=O&O*|PBo(gm(|n(a>X*V%u3Sh4pNc)F6v zD*soSQs?o%_Bq(NYjt4QBe)*%U6WKxocg={<7%Vc$HDr2CGY0uld$ z!wGoGrp89)V(2bJkLQoVa%E_MZ_{eIq6ElTg6{~pJFR3;KvI5O$TjklVpEw`X;a1< z9rUwW7xWs9uMW|JQPm{*~6LjI%|WSAgZL};EktSaKb*YJ$pa(miCU)2mPGxlD|6C zl~bz9uViV-2#jZyg4W8uDLX5ZoYRwD?Uw_cj1mkw7enlE{12JPv1(`FbQ8yl{_(r! z8>v+$DlRSZaZw9vu7gSQ$Tv|#s1-FPBwwvOz@rd?B{0}i<&{*mbfMtrDs7BJ6h=#5 zElNK#CNR%Q!g9@(o*ZM6Lx#IYQ3Dk2r=Fg5S4YU8KB7LYcaC8YH`IkZk;Jf^{V5ef zv=h{wE0o7l46`E`A*3U81G}MS*toTKShpm!HStI)K*{C^kgF17T@c5sLjOv)Sw~S1 z?h4R4eN^7qGbk$@Lv9jMtD~Gd7c}3g~T++P&90|Kk!H3tM6zZJlNcb`SgX~ z-Z)c5&QMJJzGauD*cHjbXbVO+vP&ggj?sqz$f}!?Z{fuL%LSVpT zkfIb2ZIfI>leh#Uu}swnyauE-5-~OF7+oOFiSG3SCgVYrA&`L)xrE63Nb%1814=IAnsIQE%)w$RT9KaoD0Gwugz!!f$MG zyu-PCWDw(E_WAcQTDSP8-PioW3_OqA&9+0k*}IVM5q1728Sh_T#YL6R>mO0zp7iV;%S-~zGYSxOHjSzIE&qJtAYa2k;HU}38& zL!i7#pOyhM%bY#^V@l&bk=oXAs{5^Ij=u{XvIGXS+g3+rHeU02R1tQ)6%-`1f*LT# zQ!=7IfBWVEb^pyFx-aW6i|Jy8At8P)Bx<1OVC9U-nyH}S=?nq?6%@XE(OH<3j?mMP zE=@r9YB+WY9W=`Vq1}RLPBUs4vS~x0*}wp*VmW67J51TQUAm@dd4K$bu#ljuSKP9 z&{oa1D;G+ba+aH!!a#0?8p2YQut$&f+0X7|g=fJQsWm@dlJ>-pWNQASy(%yOf3oZd zHcUdf$4A{v_ME1`gA)WhR}_xcEcbdu-SjK=(J1JQx+ac>bs`18@Sc}aifiY6rg#4ujk>+}=OpcW7gj=kPN;nd&md1*v} zO?CP3m3YYAv_QYGb}tbUbFpi|y@b*|5SlfpQYgw9Tjgbjt3&x}Hi;QX0(@UGWNhYR zA*O>Fc2IjpUeEf~E52dp21> zD%&}A4FDmN$w(06)ugK)hrxb=woKS>Ygds-y?670pe*o?byxy_P1+`uE~{h2Y=n1_ z)-28j&8-L9@SEwl+RKvP8S}Bx0AnTM5bS0}L%39gW+8LB2-yEbl(-c`J|Hx>O9t_^ zja@*)vaDcAy0GUQE}7g#TF!Q!%tQ@pktCl|agR0^=%o`{ibbnirzHX(@ zdrC9kTZq!MNa*)=df{a?-r<8ItMvgc;X%NONF>HX!N~ zgc^inM-itnv>5g+e15ve;YH*mFF!B#6wuWX!IJ=DwjJ2gpoGVfCLBPFY*~_qU^qpL zINqIS(j|g{H3mz*R5Yl(SR<=YnnpJ!JBPEpE@*orBjcOR-xeSCJ_M2IZK~ZL5uUOC8t9jv zxAyy8kqcnO=g#LpKR3@r`OFkMoouBYMY&iCGW+5HMc`Wa`kx&&m+Gj@ocT2070Vzd z@XYq0*ZXW3RZ_MnOQoKQv6ij?jx#U+gzV>hI1sWifo`B9Pfw-Fx;|##r+|pn{NRW zLP9mpx1@$$T&)a7tbo?&%-0959#3F%PyP*H2!^L^E;chH0Ou3%>Up~Y=)?h7D9l4y z`K7X63==|skWkj@GJoAm<{0-aKMTOKBT+BQ z0(HU)_CtfdWZKTfO1fEzna%d=ef0(I`czovI~@4icappWgE621XJ^))Zl2T!P+#ifUfL`-U&*6AQPi@{Kbs@QaRUQ5~{yDcGTIM(R^j z;1|^t5LIH6%dz2E;9}^V!|RA;coT`ZR$~5+rhgLNA(rSap>bxtHNMhAaw_I5Z6Apq zndwvGE$RWs6g_Xg3YIK}mqM_(2qD)9{VW_;63k}_us6C!7?aH}us}gtHwg_GpqBG0 zm8h6aPE%7*Ogvt2XAqvI9}`8fX|S9l5mg|Id5;6`Gr zFltFXD1xV#JVKysdfv^Uo^}9Tf%;w<`kC98LZMY~7#7L=K65+95AR)8r2DepwiC2W5Rvq+ye4rvFj_$o{Rl46iy0RcnrkPV^R#?=s^4l*O@qTqFp2D%S%6#0UX^pZr7Qb6qhQO>9 z0;#Y%3q7HftP?&#nbEcwmxj%L!N-`%VIiiP$eNcn>v1Kmbp9tfCw7Lqe{*cx#&kRw z@L>8J@w9LwPjlw@A=R}k&AG->q7MTQi2VeoGB3+ zV=Nmjb4iwEX7#l4B~coURoO`_z?KGGjkZWKa!suoeNpj{eST@$J@X7R&XR0n({qPW z`a0$+t&BwLGUa4vqHS_%@zNxr3^j#?=u^p>zXwiB_s~U}*wCT(cQ~;z&IpL0f)_QT ztiI51A3z>=iI1Uqye`^$vJ_q(fS-{A6uTO{e} z=5&^c z8G_r=wcUZL*#&%(DzKI+hm#;H(M2aopGosNEDeo_=f9n$vxnzcgRaK+uB~AXZxGZz=ynD9w?=*Of{lI3nI(Q+U7B|Ds_vU&a-AS z!*ObVHVAr}wcvTZy;xnWcBVT(+~@CLv1W`P+Gd%5BF5?ppoTs%R)tC~l~QcdwN^y` zla1jv&18j~!b%n?U>Z;5lQT`8NU?cuO@U>EA0DIhcrq(1LY}fo4_yej?4Rn$tMe1t3>}^dna42UyeXb{gY9Ao4tei z?pHHoGJt$A)bvx(&dT3~nQR3-(Pmkm>OT*tP@|}%U=9EFV?yl>$4c(ZtXA|AxHp4$ z!PZIdED1umYqq}~8QjZlHv-h$c&l@IDYlhbOyA%CWMnd|qmAz7Q~}1DN?BKBLG|nD z7k0whUYy`M)8wBddU2qXXnVc(z6Y>ys5&$3t);%{z31HA;R~S;$wtt7rp-{YXePPy zE8*Na)`eo92mhJ)LkNp{%y%LT9B^EqV>Lw-Y4gx=IHc?VqdS_ z+M&-AZKtOXfR{SHnOsdCG>eUX#~Y3C>qb%h!D=uTST&uR#uAZ-lBdw`)wnsw-H2M# z%b+R%frcI?K2<2@?pJf!TYFr8W|avA&25Cm5l{hE$YdLoQrEPlCZJfD@wz+M!&Yr% zm&%?Zb(xSG-KC zF50jgtrIC0#l&jBmrYu7bE?r6H|D+dH474#`%h*D5B}yfFhcgvi9&rq_Zj0x=ank36w~5+kC{Ec%{{!Frc&3EJE+{qP zPH=AFiGrMsDS@OoM(_$l^=&~NKowQ8D$lZere5?Zm6dVHf_xD)_tQr+7~^C#*jz}* zbfuF7NLP`m_sBl9k&(#_J%ubut%KWE%TYw9FtrykI>;H769-tnFTLsS2o3W{9P%qY z`jt77GfP*m8K-q>hgH&>Jkd|~*>pcnybpKpLTBR!v`E0pQluGV1XlqH5L@2usv_5e zl7ZWeLabrYQ&x7~@Z=vW0VsZsh)tgXvlwK-X%Pj|b#a0dE!x&-!BnCd*xVKCWWcOf z#99Lz?uo+Gj4SU4JRNj4{*`Itje3C{i8Cwt%gM`wV1*IBD7mK+GaNL*gB5o!L2R{ouCHddLlSZxPbJ1! zU}9c^o$#yvW#916?1Jm%hnAjvTAx?7|HHn4iHCvSY#p^=t68nV9b{^y=j!(I_Bz`r z6hzhF@&@0r!1sEcqBt4inIglSOg;5>O{v(}OcGE6+-ykA0s1C>(=+ zxWv5Z&I5(jZkz$VEJxO+;%??^0C{wR`))Ssa=6ZpK?Vh{z0WjfHgL!4gps!odPcs9 z(?!d2z#V_ziyW|)qj0gL4}<;#EyTPF zab^AMq0gO&@FP0{LbKmka}6mGuDlPfA7vxy?R5fqCk$j&l(;S9zC&dR0hIF}t~+$G z3;%go7Vsidm@X@WNHIR+5%3)$L16iQ(ktIJla;slp@8UdY#ugsQ)D>1ziFHbv|KKvSpXiymQx|% ze5l6O?^~%tH$WFxdv>4?yl%?nCo`lP9?~iZe8e!z|=k@@`EA;b#VEPvO0rK7TsO?wUZ4TJ;K>Cy~RM`Kqw z(FoPZ#~+wbB7g}>`oqq!{gtUXdgL!^#{caf)Xb)bjgSr|w+!7(Pdt$%imIl@%MCHo zN43nj%z@Ou(_=?PbS;% zB_r1t=mW9Efy?yoa>PrWpCZgm16u0c_)!f2=Gyv1UkQH5+4@T-QhoVKc=5?sm5eXU zv*&=}-eN5)PveW(OuPL^V*d&!6-M4bvY>W{fcA}VYe}WR$5pU~-yLeuGr*@X;!+nY z;2(?enH}`89)>t?L;Ok~pGmL=TkNSFamuq&lzAiWH02T(A88A5Xoo0Kf+qx_cz6s! zY5pl6`#+p)o+XH4{dZqgIvwYuu41eq!LKOj1w3BNtvQmo3r649x z{J}+)CVF2-_m9Tmb~?8&frg6Mj7RGrXm@ijN$t7YgnMw(1JO)+TEj5X3GKAr3F0)@ zoW%~WI!X|)mne|D6~OSBi^k^W=!0gA&6TUE49%@i)WgL}l7LS(MAF3yDrEA&-GHK_ z5(RC#S4Q}OL^9Oe*>A5yr;AHQq%HE9jz*`8Xib||$4yvK^8IkLCcFDWnwMa~P-l9q zA$pZM3#of4Cj#CYz9=iL2&2G`PdkZ5R+_wnTq0P;NuE%jl@|l`nBKdW+)CiA+x+## zaSdLnou4Ka&#;6oh&#fsTCYmv-b@kWZvOy0(iJZ*!^e;iu~EmNm55P)mcg7CPK$*U zqf^pE8eY5>W+tdElW-FhRP;f?<@T?=D|EDqYsgnYRxgm|!xTSzdVv@%3F~}_a#ARJ z5M;kPBTChUWegAe7dIo3X3l)&WE!z0n977V9FQOjWNJKFG?312u+nRdFRAQqOdH3K zno)IreMye)HNHK?EQ24=aHMdRg-0K?{1E~0C+>oYv{|({=t6L;2~K;6w);Mg!2{FE zXhvWtS__LiU6NI&kEUS8~_PqHx7S9SDbCVsGt##(I{eoes9 zyQR`$6X8i_Xfu^*`4TSlQxDI?)*62qNKcz^%8;-ylPX$H4p6Iygi6r>uuumToY~cI zy1nTeAu?Hq?Rf{7zkV%`i;e7RPtcI$%s&K~vhWjJc)G1h>i(A>cFF*{QDUuSN!sx~ z36BPU`7vko`ziFf1`Yvo&kUe9R7&Q5Dp1rvT<$uW)r;_ODXVoO1ua>c1etH^2_>qw zvg~wpOzvXGF`LX(iXc@;LePYYOcC!i!0bj6ihmW|kM_l;w)aeGvskyza$;!>9*|IZ zB#)U3bE)8^Q$>j$1Pka=uUsuh1TOyS_${jWB&vB_P6Jg#L9{E;Aoczl388w%%GMxjm|`} zaF|j&LYvfj?=CFco1t z=&6juxBA*8^glFSjFm@>b$jrdHM#Nz2S@8+h_6fLnK2a0!%~mTW=>TYE@I|5Z&{_d za04H*Sl=UgH9fm+V&8Z&^ds|TD$GzjzM{8Xi*)d@5ZX0HPeIjqjI>_pWdh zjKMS%VJ4ux5SGd=`7a{8y-6>qP$-g z%=C{;0lfSS5&6EuvG(vMfNp}v-IUPB`~@0Z1!sijS_07iYos~31=whszQoHE2t#%F zWQhL}$T2lJwl_HxWX7araf=tCE-$RNo#GU9+ra@5%j~Q;$F-6?j6uP(gkNe_8K1jv`yzF|LV284zTAbGx&8y0I z26SyIp;DVH9>7pJ0+*7-9OVR|lFDi6Dznr$9E9Zt{|b&9Yz1GyL*@EEl--`ire<;>1q< zQ-1uVy@}OdWBwD}9S8Ala2And`&;qCrj8DD0c9u`_8{bNJ+T({FH}Oags!BwOYp@3 zv-BBqWMgwW?J336!+sHYMvN;|z1HYb8e1p_5`oX03?j>@Du-&1a``-+tv24z%ladV z7{$qzu_6UwgsvaU!+2g*>{S~iy6uCMPxF6TJQ@{!b#(5c!dTNWe_o5;E_a12&1qIg zG|f^CWq3$1F+6vy)>V2!_0;GLw`h>zED~epCNu`1U+_;sXKW5wG!s?SmZ+pMvSe}+ z-bMEI*7s0Q!*K0QKwts6yt^a)?#2JE-4l^`ZnRGuBs!(IbrKpMPRwQl!Ej zH@+}qz?b3Wtu(wsN_JrhggTh%)}#UZz!lqf&r5djLtV}$JCZNG+OSiGdyc;&Gqr*u zhCRE{E~m@XvlXBykUEp=ZpG>vKb^7l#oAuI zJ1FaDyc#DJv-%&_-ofg^C=Qs=8`BlWP?vu)OLnG^jC&zip`{tK%$}7i#j&k&tqJ)$ z`c^ms6cSxUP~3`kH9$BwQPEuxYExWAI{e}d|5d`i#=C{ZMZ|M_?Ov<2Ey&sH9lDAe zk@tMoFv_zX&zD6vkqL=No%J!Wg-#gJ{BT?JJ?)D&?q+^4xO;`3|B&Oer=RNvSm~Mx z=bCA^?WdHm{}|ZJa(qLr`QSTP%~o8TOuhLt`;t0Pt|`0+ z;<*1>8K>q4qNSLJ)N1@W$ow3g_djPwwnXOmi^Ce|$1?2p(ryaUf!rD>nImXdy- zz+wG2*&<5Wmk4uO#Y3rhBR22TEc}CjiX+uQetW!n{6Y>N)faHmFDWwotrFzQ))xWW zQ82%c5WwB<-s$|SB?^n+_mOTJ0!rV8PNSZFUNUgz5Sm=OI@Ik|r%O-ny;VQ_eK`#= zbG=2s*)AM}(3_dXtOY1o^1n(wI20d}`-qNQQCsr{U=rMLRG8#D4d|cAPGCTx0gBXtzHyNO;|cOPblbTpZsP|bp|WlqU0it& z;CTCEN-TqVHEcZW@aQ|YDkeot+!@uvLmbR1T@#$`ZVvbb=p!f&W@!JC{rc~0^a*Fd zjfnPIet&deCTGDs`g;}dyqw?4YIAdQ{=e82e;ri9J7HI#=3znxX_J-a-q3RNi2oJ1 z@*#n6H6D%cH5oSSG44z5uUTs6F@Nn}cB6my!9PdAe$u3tKNa+BWj@=y=HZnSfiO8z{sH2VsHbnp z*Z7G>B~*NGSEIGYpSobiqFW{`Hd6HZX&Z#dQwBpRS~w5mcP(fvlrrpD+A9z=kQ=uC z6}_LGDVM49aKj%XkZ_=a605fLCCbC2-B$AUZ+b;@t{78F8RI9=oq;D}PLadsVjk{-83HtKrqf1qf{4Vn&VFRn_{|)C^f8Ng~7f z1v6)!lg6}oA${p5P5cVmwu9?m-Qb4rdc28|nA6w*7^=FqAbVd*3d?X^WF?qNBSuQx zr<(C`nDWY|doEYfSN|`5dz2Aq87g z1@W8lhjt9w$ys9X8Wgh*$t-A*fBtsj%oGYdjaEk4)*Bee>eAA@jls>A0cO#O`Z;)?W(MxY!H@XQ3!2|L+RD z$co+h!j<$->YNOADFh&d6+-S(TWa!>TUYrW+ajL$>OGa!-9I0J=mk zD1H@Q#LF`^2xj#E0$A=~{~Lfs=)VP6{3iGf5T;TG`R>vtKktO4Z(>ju2VbXVx_#FZ zC4hrsC0~wNTb$kKQ>XWH33TKH1fhXbYtjDJ8UkHYaaHgFUrYg}rXy2s)`@Gi) zqSJR;XNd$3Rvhl5L%El-Fqxz9dPr0a+!Sq-?_`Q${U%XBptN8#_I2d~93{u#5FY5k!>_KElIEXa!ZfqgJO( z6D%R3g&x4C%;z3?%Olc@25Fl% z>QyAvmOYgm{;*!Q{CVmbZ}~XSgTZb#XZ>mtQh;GCb|6U=NH%TWN?SfIQDi-mH*M6Q zG<~{wpxD5Zz~#sYj65u+P`-6qdUS~WA;v^h_(G>cMr24Z%%TXMY0xdNA8`+#Ws9Xj#ds-Dj{Q}<$ee8=!WTCT|M8_!fgusOmWLiMeZs55O-IDh=WU0B)EcObz|F0 z#%v+q*RA&-2+RJo!c6obeIthIiy}#(7Kl9m_-{Au8r6-KYa=Kc72r?SHqYUXGTCsE zdA)ZPWLuNf4BDmW>jlx{>L@Cj@8!*Icy-*Z1`tlup=aTMcXs^wU+L^7Unh}X##|u$D&o{gAwJaeO43Zk=OaAjVzV|A3h z%>87GOJ>CB+-i@Llr)~VA53PwuF(!JE=^||9jDq}SD0RY1}=Pn zkptjvm7~oJF(EXFMNAT+QRiBbm@q*f8I)d;H%u@MuxLLodv z6V;;{MJ`HvmgV*-#Z2|VwQhz^{gASu)BF2RlYg_a>Kd{L+qn7Fe;!I<|OIG#h&Sq=x&lN6H8ea zDU~DYii-G2>flJGhAI{k4P@x#!N9_~ac~M^>xbVbp9YT$6oh$;P3R%dFFdO%bmc-j z!Qm=#u}{reHBIXJHn@HCF!5b)@*qVc{>~`VnHGqM+|(DueP=eeg1Q2(xEv)DkP`Oo zboI>IzHR}`R;<@4V)9T{Gn3>GnE>NOcS22oU!?1q}q-(pJe!02iZIvo+XUiKJ6Pd zPwR}SfyL+a?g=;A;#Kq7qY%lw%lY1nWR`sNNpVyFfJ$ty++i*lfl9`c^}QB?f71(33 zMU+JC(m7&=m2m(Z2q4!M!%3a+W1oW}i)m4eR!_r5dj^x>?e!3nsg7n<%#fQRB#Cs< zDxbjC5@_Rv79Dl~Pd9ege4mwV)_>?(%#V)<&ERv_H{`%lk4fyS%MRy5!N9=ABnWL8 z<2C&eEWM4*g0k8I-IM6c+ezmQ>+3=L38*W-4VHUB1!AM`-+@DJ-zyh+3TBc z6+}$zONn7e1@l&>C6;xThFyRcl^xbgt-Tdc0akgq=YgnaggFFhBz>CcYP8m)tJwx5 zWKoCX8h8_?as=9&Y#H0BLhrlm`Qd;LM|<47(R!k{=EA$k_k*vLT%Mu~f)k8Nb&aD? z3oIi>ur`_qTW3ppkZ1;nUg${LACVaj(b4e0SZ^j=5}e~#cZ6niAYGApd}w#JtH@E;-T?}8vO(3#G!L&b$oK| zejOOIpz;22QSw(OzuFLMr+>@Ikt%Ge;A-w_!%P#*Biv5J2SNHmY8{&WG8@gwcEfN# zr5p<(;|6V0)wjr?gEv(9kSV_94Nn}X7ze(?ZB%3GLZg9GRftO#ZJp0k3bm_5@QhV@ zfS6t=Th34gOAkS>TtrCqiU*!$ldUGdz#?6yS}ZS{_Us>`7<^%082F4zf(=G|sjwYi z5Sn7-;?$2O=46%MdHHW7mc=^u@EkQ_{X3`-?gieB!~{?TeoXlTn2tD>!0eiwN{;~6fJUrmz?|9TB_V!BHrows zciS_(&Oo;tK74{6YW2MF7)J9)aZA$T*??WP>iY`*TyDZ}8jCMxM|LcPQAuP)oU}C| zar?3u@WrA>az48HOQ`EfR$}+8MFGv26KBAQ{gjBfu|MJt`GCM1AbyHQDP%-T(ON<) zkNM9|tCU)XS6ok%-FFKk)FF1qD63HJF z#HM!E@Lj{pzt?T$KubOTIDEadv#T>BK4tCCI8h2#PFSp}=abhcfyF*7`TS`0(8*}W zB`QlB2sV$R>qUkX^jV#655LXR&Lqr~&geRm==vOH3JF-A14?BkLQ3oh6kSf^Ke;p)pdf$F8;F~kfS2Y z2;dKU`o9sDxS)R^ELm(@(tNt(Xbd$Bu^0+~Vn(VAl~oewDVK-C9A!cJS{M2YJsE>Y z{7fWz7Y`ym9TP*f5JgG^YxUA`yG2l}m-TYSCvwm!;wy^KwA)o6Evm0!r*a^7(J%|8 zbscp07BP%I(|ua~SR_cy*f7_#qcrFqQgqgZplqSNiK3cH9ze0l;9`CXH+zhLU0r;X zS0=5l+IB>6;-W;1$JS27P%W@JD39fl3ujPpig-+u|Ru$hTis$1aJiZj|E(IHQ7l2hdvkS=>Ah)7m4qeh~0!V^7 zf+Qq0*n{pYRD zin$5XNtOx=MPeI@ZQs>>M3D5^B};6%wTuaIJhskDCcjUrHmfZ0`&d9t9iV+(=tRI5}ws+^ilD zi;E34i@c#2%i9Yv(iQEHu~^gs5d}*n{Ycn0)BFV=Kyib=5-+t`)&t6iNETCB!|Xz)B-W^&S)d^*wtPK>|7m{rIa) z*r2UkzqsKKzwf*@Bq1@`0`Bm;e=eLoG~iGjU;>!wQb2M1fhu$H(tJ9tFS*j|LWyCk zU;!Iav0Q_Ydm*j~=vo(t~+)3sVJBc?wP1 z=%la}!AkHml?7vu>%@v}u?F?@i3SF|5CJmH&o;-|P8!syBu6cVj4)Rh*C!_@XP3<3 zi1xu{{Gt>&WAIwW4fbF_!$;~c z|FgaV4h6J=4ij)OwW-GaZm0jGel;<)5RMkv`q@tZe3<$wak61(=7<6;n2FRom8;?5 zZhf}0^t~CRITx?q3sNzaoEmI~GDtb`I?WoQ(r7 z{RmJ1-0ufgm5s|kGmaED`F5!apOY$V?*276P5#JBzuIo~9?IVo*QbhZIHrwG9WePJ z0|DsuM0IkFVTL=4HZM1~fiwN4_W#vg(cAy8?uwrN|K43u`}nuJa{BR4cO`Pf?#G=r zos#Z*K?)jKuqGyMJG-%~cr}V>I|mzCDoH|rHGNNwg$<~DEgB*AksDHHsQo|saWge$BiV; zsD9?-<)$sN)OavLs*R@lah7umlcIt*5~eWGp=O7q!|G@iWi{$@Mg+Qbi?pck3jqX? zZWe(8ai$D>h1X^C0H8CUB}-&_lUbdWk(-sDx_^a9c%$<8MCNph7-3FBg>giVDG~5) zFf*K#3y4f`i~-VE2CEFO_6;lMwbMEeq6sfh`y5jE~gJtPIyu5=8zQJ zMUVe8BJgn$>?1LBvK1p5gk>Y2$@_YO*;FwFViFggP~`FoZPJC7BTL7dtDe;x)NYPJ z1amriPT}M9Cv>FLr%Zayl+fHaObBW8*eqUh<0fKpY0Nyd$l~tXwl!oil@u z@IAj@CA|D~KKFwhuenUfOTlllGVDxh24LWJ49H|ztHp?7C@?QM^?NnqyQSvRBkAsF zdY+E5hc%hnJe{D8PIbl%AUb{E>USy4UMCXQr(}DlbXf-K1Bf|(w$RoCQn!r0Gt?wa zpTgvDLc9o!?!dUcU|0y<=^m5aB`uNiX{WuLeg~b_W6=PKgFLj;x_h0TOiwmR6q17M zmkXO2v!;0mSG^TNLtKVsxmF!lK7UDkda`M@*|@lA=ewoQ0>dz56$G>nK3s*XpF5Ix0CLoL{5pmVEe}5S%vxF)z}8Vvv(^K(}X8`M2Pa(KOHZ0%taVw z@*hZPtn=QJNE+(-g~J3!B_=f5@Fot`2K#DpavLp{(&oj@FY&)eN5UY)yC)}jS>;GG zuQ%U`2K%1^+%6+XE&9UVF}ZTFqkct82R{ogo1*2HTDcjbPoQLPha^s(f}Fay@+~v( zMr%Iv6@SheR`}*kfXiWIxdp5d?I(q$*lcNb{fWxv(Dm)5fBz1iVaH+>@isS>F4@N) zr;L1WqL5uRF)c3ur7tiS_46jxK+2E;Xhk=?;`i~7)a6IB+0pj+KsjYNbzG=gT98{; zRHmf~+znx@L+BaY#6PqoT?{b|xJIWGyEXXrjq*hhpCkNi6{Uk5M@5Ufro@(K#iXcP zq>0C%$$Utj2IZV!anc?uBb$@GkTJvV_myM>I+R6vGQ%Tf&3m`UuUg8{|L=DT9 z(*YCEfC~=vKGiFZS2-9q;|nuj#PZ*hl~z&RpCn;(Q4V(CZ-w2e;2s7$oh_7crcKZ!KhP50H+hC=zkjCcO*zexyn7 zux=8~Dk_kO37o%$`*xjEA#DUpXkHyxJ>0wTErNK$(?tMe-!~C?jSha3#3|YvQxH^i zmeKR?cs8sTR2~bz@yWE{O6wmkfzIEFoZ;8bCJvtLVy=NnD-&ABA};NQq&+nD>e}z? z`VkuU%>9A!=ms&`xZv0S5M;Tw;vMjyu4Am4ZZwxLu-325i(Z-SE*7D{c%A>6gFj8c zk(f%3HgXg?n-_Ej=P142ayXt{W1)1KU;0N_Q_N2~oO{*ZDmSb(NLU>D*OpchOpwh! z29IVEVuV(PRQB_n=|mu07h1^zL1rFH8AdU!2Yf28go!`BH}kI4?uSm{=4R(?uFFrVFa#vyHT#1m&dSjA+vHWK+i( zioY2a>yDGWH-YL9objrW!fO%N~xp5W6b zT=#pEWyG?U1{?!#5JbsyF0w4iZPYWHW^q!1?#0Bx$t~>A;0#Ii^XO3d`RhU|b?G*! zr-PHlAUn0?Ck<$>YH2OghEFgim(&3-Cs%V52Dv@WVOf+}tyY4?LJ&(z@xAvxg^abu zaBYjQaz;0)am9IsTD;OSnICa{@@x2*nw+w)dIG<@w|T-ws&Ybe8vqt^k{!qgV1YhO z7YNtou%WDc7aB(*TR9-BNlkVt89!GHox2i*z%{K>Mc{gHB2uVGu;Z~Y_z)G{j9+#= zwOx8=qt|1xqd~?5y~Uy~M=M@C>1ab`VC|B~g^(j6(DD1;(;?5|IR$ZSxs=Vn2yCXP_Veg19(-vS@-IGSKKRbe-Ks?~0!TnvFJR?IeN6Y)#pjM=Sh^2q`RH9E8*n& z?T=GsiQDPGunJKML#npp;n!YfTHe;pm(2^;Gm=P#@zPnu3t)gfUiNYYtu`gzKPdqY zwOdMBksRDp*laeR`?#IwIlXYf;kdaf=3Xu?duz^9WQ(+C+|0jpwYqm_a?!F!uKYP} zE-{{mv@2#-H=x+zfea5{bn@Xf^SG|IR^yeCG!=>2edOKyU!S}00`@QvMJ9(|caI(jH;|b8RkI0pi ze}gE{BnT=-5kZiqLxium;d~B%;aHwy$kA5T)=Ck`8HV`XAd@v6u=oi9c|>=cGQ}}8 zpn!+!$3DrZE(Bt>p+0;ReZBedSdG#Q`f$|lu!VcrQH2|G<{mV}O2vHK>%GMH>zcsl z^Zm%sbjy(pi>?F?yzi&5wobVG{L=U#j*qaxw76NKCKId@Dr893##UL zlGM`A~!rT}PCCiv0D$jRyc7xN;@ z1JaTo8D#7ATBQyFB3;dfW9B+80M6h)qSp_RZ3L<>MTw{=kWpR%TtqPWP(XmGD-EO+ z3ZfR;pK7v{Um&J6J-z}tOi__35(jXi(Bv4RAr^x=rs3IA zg9yZINV*@l>Rj9YoNJ-^h!(%hCBuTea~y{t6P7Xdr~dv7X3TSY_Hx#RD<|bn!Rx zk`5K2XtjJKNDl+11!P|!sgPceopK5XOUZ$pi#+Im7jDoEAaRw4co1R+45qeHIZwe{ zaJ7!0M#=yy{VeGokvvpP9Z_1%B_RsDgRCH4;sQ<5M!|YWMm!qQV{xl*Sg@hM#}^VN zokBZbrw9IL773^9r;HJEI3#xa>PK;%0${q%_x>d!LeHYAZAotPZT}G3y@N9m7s~AG8IfS zcXVV9(4dT;P)jH|Y066b1JT|(s+>D%GMwsm`iUbF{7VomHaZ>JHyxPBv52354TCDD zKZCuoNQ-Bteg7}yg%SGq0|5p#DB_&VUbH1R6?Vf4#YpY#k8y(^avm3z<5r-6Ixrzp zc%5`R#Yqj(4KUl4<5j``Kwe1yMqX^Rzmb>8q>EVo2@usmv&0&iB(;nq?pX^dN4dsN z}yQWi#{ss*je^z-*yWrF+|Ad261~Z)=9jCiM1N zeMVAA(9P8{=N}pRegpdTRT(BgmRK=TQR?eK2y{!?_J+&~-N#+&@{9LjlM>!@vkMHG zslp=O-AOw_b7|GO1s)I9l;5?`w4?Ze^cTe|f_HjB)`Hx+&o23B67Gf=4zu3ZrcKyX z13+NByj)2qtydw7(yPQn)l!w5vg52KMh*gV8n2AY9@bYPUBWKTsRZ!tTuZ?#JDD5+ zHf+P9F~zdFMIe@vd*b|<{-w8H50)Wofl37F=boEjDraf%yQ zrwYIdVmylAF6KvjDv`<1n}_7uRr9yt9*nvaI@EmO+kLCL3QNG7!3o)?z7GS;Ednfq z$n0rCWqo(aar_U<>&$U}%j=u1Yb_cYPs?Sx1EymUG7anj4G^_t zv!Pk#8l*SU1`X-u$c-EgE^Q9N%dAU;1*qMRD{9n`>)P>(?SoP?a|FqthqQ-gDDrAQ zVa0_|8Wlr-^pDa+1bE~qDzPr+V2Q8@wTW|9{>gloVBJ0J1DH;#B*`I8JYyPfS=tgE zfsvkS;y+OExCZF+aJxTlHt^8Uww=EKmEQeT6rW67sZ;cq@xoG2tltEdbv(u%XT&jv zRIWCWut{@>ugpkzl?D9?CB`DD*@EI*e#o?T-9B!|3+;kiKa>xGyl1Hi>t!b?8Zsx@ zu0l4btfuOTtL^a{g(+K~p@F0laG|bSEz1Th`DLvD(W4|+-+#AS+%*p_2P$l<3>knJ*uD-a1sD<52)u&AV9u_mlW~8gcszA;lG8KW+$UA)Ok-cXlBU(y4);Gh&9GsheDG9 zkL~efd}i4KGUsa+kiW$EY zc=YOr6OL0t-~!RBi8tx5!Jzz9Eo zV1EHG&u`lQ0$vgr8qowI54U~09uH_vC%UPw?>Y$|acH+?B82?<&I~bey!0T-p?=~Ywh1=7OI(il>9^q-<&@yvV0;jrUbN(a_Iyv?a#kH9zPZ$71LxWQO89h zuKPWlD>gV194UYI=`SKlo(fRJ8vfpTfQNF#6Zl2O<Lhl#|lD8S~)4YK2kKppK{LVGKh9Fen9bLnj{%dD`|l(|BvuJbKU zc?N4Xt@`LKwQAK3n^hh){O|twHh;ahIMME>XV6?yS((oD)yl2LyIdD^lc`dDa6@nb-naix+7Gf)Y zmVv4GR(5vLSJSCakY+`7sKRckJs{V3Ih~kRvh%m^!u_}JA|hq{?YmSKZ7BZKj@VAR zY5Gs!J@e3z}mfA}tJ|MXo{eBb`|U8HuA>mHo%8DqN|eGZQ22ugSYQsZr6WMFka0`cH_CZb82BsV<|=XoLZ4MNM!$guir%3OMAMvI3$VNWrqykL+FI+)cxw6 zo*}VNaKKy%Vk_PS(Y%m-8 zhwmZ+^$*{L`raSt*{qc0R2Sf%zRS|L??M>zH3TRqqEydRKpZH$1ZS_0l2kq5@Pdw; zx_qfG@ri{=UcC>Vl&M9oM8a)Xo1qKbZqcry$#vWH#VG={6XFrgno&{BVU?D!^eXQ9T=FGY z&hud@)ZhOPUGLx=Y1`;q$95*p#I|iG9ozQAHYT=hJDJ$FZQJ%lC(pCrz3c4r)mPPj zL09!%ecjh@t@YbDLM#utqiku1*&t~h*&gaWV zMq_h7v=O(-2x-O?f6y(qd$OnWam;yjcYh>{CtK+~t~oy3;KR(p6FS5#Jh#wEqO_l0 zU*zM0GJRgv3P$>NYSSpk_GM1EU2htD<7!e905yGAhWL@j-nzjt?X1Sqvv?UaB7_o%U^WT9^^jBM7}o-1+7}dR&PyikUzItA7!u zOh|%q_%PvN$^VHA5k@IB2_oFW7Ncr{$4W{y1JKjD_f@XXFe$V_XMLDnC+3@ezvFju z2w0D$Eyo%Gq4r=deXMhbUy_P%GR$*$NP!>aacVH1?kJC_pKmGTc5JtQ#anjNDayBZ zG5UH}*ncp)T7~SMnuc=58K--GEjXOUKXU-K=rs8A?jX(@cFvp|4u=2q$XDQx$gAQe zc`iur+mZ_JYK!?X{Wv`$l1QCO7H|DbsJDXGdU z&m4?gf2W1K*~Jyh=WecZ2g}pBJ3W>b`J4xVoJC76+`yTK@KQoU?6HeZuq=?NHS(B^ zOl_QhstXzqJV>BpuxIUDRf; z|E;@#Zms=WcbWdD?xGOzG-Uk1?2gH~SratUQu4p*E=PC&SKUS7|F`baCC~J4-9?im zLWB6-7tR*@PQWf=bN@ee zmz*O%u`-Iq5>c{#E53Jc=NXInJx0pd780EO^}8FviF=h>v<`Xa@4Abg=8>Z?p3-3z z@IQ4IHEhwS|EarR{CC|&1JwV&>Mjl>jhUnyAtlhY`$%37Qg_31P;f|XDCfq=Vu0_u zOX7d(F6IAs-6i3lx{K*|-9=_e{=4qN{ZHKm^}FuUHSxLDo=ndEr|CE0F7~6|KXsRi z@48E)0W*pNDa}EME=d`2N7M;hNi5QHp(z;lzP+~eZ^A)kMt}FgR=X6iN~)w`*j2+s z7E$4(S-i!$(Ww?>BLcVxeYV6ZJgCK8ztJBcR#$YYp!iK;L}Ul6k|^^P@}#u&7UUkC zQf$|v({+>GHSw$yBL{jo7eWtoD#YEuq6wwznk)l9N(vcg}?!c<4Xm?OZUlkxFkr+fB1j*UYSA=_1?se9M|YKdxt2E2Oxj}kn_#@I2h+{-anX^hQILt`a+UAuHB##*BQ>)A zm|hA!)oZ9v0M9W%ctB^Q7$=LhVw-AI=1!y*)P!t|F4PD&qed%9jFHlBGf@$nZw$~G zm7kEq34~T-^#oc#1xmKauri1(^3iZ+BPd83^HXQD$PsZ5OmZMiXCUueqlPLgYWOp2 zz5p};hquh^=kyd>LJzE23ZIWSF%(_>98xH#&-|M&wry91y_lON3grPBi9q0g)m;+4 z>n`5i|JGgjh;-#Z@d*N9QU0mB^lN_0bS_5@=CU1{3lK z4dq>6nLuk+oGrLC^Vv~N;5uo=FKHm8IUiYaOFQHh+^Lo3t1zEUWX)%(t7Bj{zf|%m zFac0q^ZG$RwOq|&u@A9;?VH`i2TyN{Xi$BLI6}Lab*h4 zRI*#esS1n{dca(656v93d6crTtbIG976^j!syr}_&{#?nKpqKdy7T{Aq?LVB@hZdUld!~GFTL9_Dpu2UCxdx0$8-hd2T_9*< zvyf_@JEPWBpV;T$z_cl}M`Yn(>kzb047eyn7!(ZgbAt5>*Uh7|ZcK9VZyS8}@e*;8p_YbpLk?DIn`4tA63lI}!XIPx)B_6Z!K7w3 zCR3acP1p~zRJ!K2$5#o5BjCVi2l;32f9o#TUIbwOPu&Hfp;4PqL$&u)JGxt}KxJWX z(9nek;OiHf{J8<_yG^O~T>G}JDlpNfPB=_L^GCB#`Wc6(P7&^MrPkya{oPO0NC&bE zYuUMEuk4fKsD>c*x=p#m%+U)q+@{QreW^Xj!_`5__PEXYTHyLG%4@Am!K?PyQFfRA zWW>pYR`A!EbTA2MX9-mL*7J``|HKUbb?3zga! zuQ`&^_i@)KzP00FYQ-nmpU`;+KsFF+l|VX2IQk$QgFCD@f z8I`ub0dWjj8{a~+jd@bxfuc|aA zxo167!!5YG0Z(YkQ1qa^1U0FB9dxvzE36eR`rtc;YL0QVzqD_Iq^Hn}y|aeOQc05_ z!Xo!aY9Vh`8^JcTTq2~gDB1BQt^L>Z^}YZHB?C%PB0`<~{qJ~FLK{kB;>+N7^=OQE zEFc^uJT3?VvwhTO(jFb1e%e>J?#$D3rye2mB$l; zcy95&TVS>s>M6+K66*(0=$eJWL31y)h`l{NIJ#+T zAMI*6T?VIx#be30;tdudNA`oTOF;dIgFi@vmP-AXN&I&4qRZvi<#`wz`C487juCE) zQQUS4Lef|Wz}(h-nr%w~ zXtX_=d`|TYC}WFL)vQRZh9d%z+3Pn(OpO;-X z8;p7nCH*4m`9Z6i)9rI6H2fzE4GLJ;&ss*~_X;-o41UQOC}Lul*FLDRl%F1t`W!}BT_&Kz4a!sA6f>*sccuC zYOP(H3Fh+UNVPHROjh-2$VL>`avZ51u+zd}!fZls9!dJ#QVJP$3;ju$YnW2}k=$7_x_AMrX70lBk9HeTM>grs-~-!@ejT_`vWM@6P8k zmI)MJ+8?G#wX(mR*wAle~MwDfST@rf;j>H_P)$FnP+m0v_caqaVUud2Bq z%VRe!=kg#TfMO}E5Ueu~hzc>w)kVa0BNd5!?@oRsas6$U=`kzAjXqjsu^qf&TtTr2 zYoA*KiGP7bWeqj#f6!N`ag)~K$4A-rZ;I{EOGFj%#013B67`~u4LTx0jC*#z*B1r@W9Z5j@fA@h%ViF-3(h!tgRe zU`_ISy?^_5yMV#{jevhjoe*+Futl&OltESD?2j021F4CE?>p5=1-25`IiScFL%2IC z?~7<*jmlV0PxyR$?Rm?iM_vYV)!&Cblma?ogsGoVu-flY@g!4usXLZL35{@^UIm&KMI@VB(Pc>*J3LArA|o_}C={XM zT~^-L-{A~GcUL#Z!pz=Z?0~ZiE83KsR+VKNlg|v>5jE&?>8bGn@OK!k4Ey}brE@{4Bqu({4nLIn zJB5s7F!>p%*79YagVBzN5o{w*$D-;UI|U&zx*)LEocyH_>S-=G*Q5MCxH8wivfm}7 zg#uHyz@`rj%@oER4jgNTl;Rtj39~>CWqu*Cu9(ZBO)ChOPN>Od-ji105kejT8+uwM zs>L{#p~UYS1UVd%Z(`6~U`@z8GCq(+e=csxmhoq3$;?I&GE6aeH|}5u!4TV=g7Hb+ zj_^;!leA1S$cEWM^`nNao#fd%QP~JvF|1fUlZwQPB4AXuy$=O`%$iQUs*1lAvh{RP z&CoRRT>c^NY|L%srEqeGFkNPFkEw!*5CT7R&@~oQ?0bH+M?#|;f^EW{?Zv3XQ;|@0 z-d>`%&J&LpZ7~nXgQJVhYsdDo=V|A{|BAk8!Yu^f^Cg~lCz_LaRdmIfLR#LS zoMK5E^ZC8(_aR@UmpBvAf7)}H2d{(n(n15IXi@bm?p)VQk;%~nfsu-}wk=gjnhfK8 zBz(xRO35456bp^=h*6eK90Yk?SPdc)yzkhqiuna5^AS=se@PNI=*SpJMcS0%H&}4O zQpFnB2=bAbv=+VlU&1q7V#O}02@aq=w&jzn@eT-iyf-V+F6T=<*py8>TzB-RP9lG2 zXIvbtd&0wZkTDQWFFlz1BBZ56F{6<54IRJ@

$=80QZ~88*Ane2{xK(%EOizr26F;&_}Etwq5w9A@-6 z9Qalg6UZtxLZ}FY^xrB#DmlB72&Ux^4$+o_2q6kGvMxXg))}NE8s{cRMR@wH+{t+Q zUDzRbW%iCPc>3g#%m|DZ7Eyx}lHpvY>MN^o5uCa+P!hU=ozQWe1N!N}aVP$_az7UO z+m{!>${0S@1BK%PBY=*%jH_%MVg_}r$Iv1)toxB7G^_yEYPh1eE|GvBqmTrbTQdnK z%_y#n@Qp?mwy;m%3Eg>T9CH=H8)uN(M({vD?W=3@f+Bf@VNVH~9w6#_wv-Z<4!4(v z6HXcN=g&oOXkFN@?8)!(zMkI!+R*mfU!yh1h{Qa(Y$AdgikUy3X-?x0Wj_b=g8FYK zMaw~miS=qw)^L?N{(SxLO;ZetW>*?NVc|w1N~=i z-e`MaxjFpQ2D8MaEEp8(aI1yb)4pEnegjpET1N|WJLfZD)2_B|Z*=vi)82wusY}B1 zH@lFU6E1Cf28HYP{;OMToBKsF`qsLQUSAFRowAKhw3fPh$TQtO)|wnjifQ@wKp?CY z$5fQ9m#R5?t7a)Md8_7@PNFR1+6WJdn>?)>G zT0SaDuZ%4ojbx&6Pa&#A?@4?npJSIeRvY|0wi|KyTFFbB;J!3$DJUw4Z|iVD8ClWU z&XUO4y63dQ5pJV)(3On1gVyw;hL#yL8>BS%xHTOO1wL8~YGhMP3-SNm|)? z%C482Kt^9rFnzRuezXG3$kIEgI3%5P7|1TSji4c3A1wJUAH(E>zOO(PYvv$Hck8iJ zIfm#*ql{l72|seypSR=e0#@5RPI5sjJf3^=)_Xm0Yte(D)_(0N@U1WWd=Wlb7HhtF zA=Wmz$@OsWtx5~`Twfd~8}j(BCCUBv^TaEm!Q>w}@VmL$>bViJUWwzsA@F2$+wB3I zOa(aH)L!+=lfS>+{(=L!Hbpy~W}$bJJ_b`U{Qe#M>w%WEnbHi~y4X46pbxOyS_-kq zucJ&lMcv16I;@u2B-z#+bc$})7pYzht9o9WNtd5lwEsLf32S28Ji9MKl}W$e_%|Id zdO1d9^fS%8ZWWQ?+vdD|V;jJmVVv|fSvWA{1T=HKbi2EGza37LZ2Z;C4tReUi21s5 zYHo=#DyaFHhbh0{_$zt_5NY?e75tm_93A8$y#8;fL8t3+^82-yY0?voe(RUP+djgg zpdUnEs68UD#)y$vMCcwoZkOOc@pujS&9<3<7T2??^^%B|rqNqF+_0Uns^ZkEPtt}? zSEiLXrz4JdK{Ti6{{Ux50O0HtI9I|nz&nj)YD~Yqg~hMH-OK#!)Ai~-88Cq@eHuJg zXRrfEj)p-R3G%3QvD+!+Reo`GxqfE)J;Zo#X!+m8uXofjYbUNmT9=in+sp3RuewlP zN~ANztiSa!ZUsYi(5gHDIEX~;2&{4!51Td&^|&v~zRzh?n*J`wZuhsj|ARaK_EEU=zgcG7sj*ylC|*vtyBb}!$GaKJ%p!>)x3h`Qm$0RfjwjsCxNznm9>76 zC1C|2S4UooG&Z_S8=9b@V`J2ExTJnDnF1AY#4en~3`$x26cTf)4txKB0geuhgmx$w z8Rv#DNBqzc3vXHz3w4;wfle8G2&L)$UCH5A_k3seF(+i$YC$>mOgH%4OOQ`g%c>Ol7|=0M-O8`1=4#a9yBSePD1a#z$~h zRXVZ-WklP7M7-K*2r8*Na8+ZeHv6|n@G#~{n*JmMy>A@0A2_Fjv5e9T-6=$@W+O^|jjjFgHJ}GO*7i(%}!wkYyaj=iYTRkvm_sz7&zcA#9Ay zqxGEQ^IQbmK0J(BcDeS8)*!o?(N}`ij*1eoGUwj_`VyMK03d+Af@yB~^RL1P%!CXp ziF<76oofF_ogR<0%V@q#y208^UyYxa$JsO8H5op%a3JC`*0KGj1YHD(cE@`&$H&18 z>`Dqok|mS)V5fu-7MAe8+*ek!k)KY!SKud3{|j+%x0exwf`Z*~^6*>0V(n(- zmm{{&_sKLU06hGC|CVR$629A6|2Q zwn5G37GAvbWcBP^J$z=pqzX^yUjj@H@RqF3_bbuQIoy1?0pOa}1>C?-6)y?7i?w`u zy)UtT=Mn2CGuCYuOxFAoDqYOr6Zi>Wg*EF_xX)n`6O{U|dzZr%B{K0wV1ccvj123n z3h2%BNo>6T?V{JGc9<>la)k0wb<}ERgp^|-n#hlDR%=VQ=>jve=#U#a7Gq)m^m1F$ zYgel6Of-Xy&GZ@AJdzEOqKGB#)?HlEy30UXJ7Ea9MRug1k35R-xDy4OD!}>&R?IO= zaiZqX%~(amcf-RQPOACslg#U{6tQ8^Pgp3duH+@iL59>T2$XCv8CqyGz590VXybAn zMLpQWnJtEQ78#niG~Z@^?OFq#)Af^h_0ckLR^MnYYJaW2yA6-KqWG?ZBTfX-1=Pvh zh|FoPL|7qWnz&DrM)>v;&%+Pm$JlG}1fWaA-9am9)}kcjg-z7eeq7vCWoG-;hrYey zrBV0t;`2?*mnyVBxr^#RP=wBe!le&*fC~NJ%lw{;{q8=7Ei5h`2Gh3bH9<-K&V2$9j#_AstU}#RsA9Q#G0?n{f5SBuo&a` z-$eS})BhyWnIMGh)c}cfT;jLZ|AT@_0(+xW>hwE=8R+r3^8oH!2g|A_ag8Qz!c`7k1!w4&cJ3I{sf=7?6eIf4H!pFGWDD1geFu zwTCX)GOSj3c#%Y}@%V0_rR>&FsN2UhkY}BRDtv+2(Bw&7ESH~HzHW=pEO-lt;bGU` zb!1W4Gjx-U2S30%XcV@GYnG;`<_UnTNR6uZd$7Lpijbc{BiA;)qO>d3hq{ct-hVbm zMMg1CbaRm_p?HX$2F%KsVz}oC_AA+_{0?+j-r^|J)4NT*RkZMq*dNE zlzz?k{YEdd!`qOnl0VxyPA#kUFFIFw=WGW_{jRAKDGAUh60_5hV5j-os3%GNTT>-n z67Wm0biNP1O66>a0=@F{=3K3E#Hg;MbRn?zY1R#AuE&2}{+lzhH`$1)eI)rCo<=4a zA+1we;VtHF&gWb?%5BN_e5$VQdSSM=?|FgfTE(C`)i4wJJ1sKJ?Y#kE!0dZPp?mYP zR1*}riK%D`HZw%3DkLhCpAq5HL7_P}FyXs~QPhQkBmr&eO7lWY<{PDtM}+wM#bYnM02GbEFDnm#drMxe}6-xmqFSSl8 zgRK8ec`g2z^1?1>C4)p_In%S!2HAI9l$*fW%)CqGki;FMb8w7)mB!mN_V@3#UNhVW zZ%BuS!P=~x`iT|Fx}`Bj)Jz(M9Yfwsp4Tt)4Pd;wP#peQVay*%(xb>uWcyE`!(?Cy zN0LtRksUxC&H%dSjm#iXf4If~IcuVGxu(a%g0DMzN1+Ndi7JybXffQRY>grJ$kA!9 zX3%amWH%UXK=?e{Aw?J}SuX*vN=!@T3To5+ICldvuco`*a#abfMEUO$3BhAw@8xi_ zABc>>>kiL-dZ%5pNQ|*HSq9OM@-uQ^r#M7G>_|thUVCb&uZ?Ovu)lrC65R6`Zt~}$ zbr;8E652MY(Yln7(y4k#g@&q7$7@#I67?B4K1BIHtPPxpM zD54VX!Gn_xu|zp$K*$fC6O0Dytnq0Kx!SpYy<9KMEfa%Z)j7QpHi~e{+0Y-kzIF{V zg`%>t2tTiNns^I8{pmq2RVr)Kf@ZXTti2%cQ^j)x%9Q95*Nc|S%ZFU2co)I*sJOoa zSx%=aU(eIvQJV?5ZBZTV!}BPSAfs0~7PuPrkCLe+!_21ka9k%Vi(Xx_r8zBnuSfxi zejfR8`kIoB!wrX`>D*e+;ws8Gp7-~6k}th2?}V`TOFZ9T8`Nx?SWA>*wQd^vU?B2x zois)Ab`|x&>2Z9Q+eU&)_=oPA3hhCg?G??EIckOcKAw>DWXgmU;On)-ZD~ok{5MCK z74%!_$0?12g2YWBS8RccqBACr9(ip9mkk0Ogc`@L_j;)v2K14O+e_sYl?%9J_U4-P zstKTm5&=De+m)4~iO^Q|=M$FLF12mlk-g^W^ojPd<@ID4n@9FHC#x?g!&ir5U;2l= z`@n4yLl52lQ&SGSny92zsW)*%;~-y&k*O5$F0H z4p>Xi@)76GuJdfI%6g17iRP+L?<6+ra&)Uh>rgsVI(IyJm7+SkD9 zc>>;>NhurJNH>fHT-t~`j2%5KPoGJ%FYC_4u0A@;Ts;pBmB^UYeLrGbZ9=?DDw<2& z1gu~V(>#l%x(dO?MWqRUJ&Gp-5(Sr%+>|<}Kaa<&FMdJ&yuVuVDyV?9dJ?ci88YXa z<(qyr{I*a&GVh*23*Kx$N|j~!b<3v$+&pkI3I0u= zuVjJtbb3f3#NA$54-uCW8W_G~&P$yQaFlHK8;D8&rf?0W+S}eGKMSwBsg8E<^ef$< zr=JWyl)bLEtkTaZBYdHMG=GwMDUvx;B)S{!Pbdj|=UoT755Hj%(3gH{P?iL!z=##h zyvET1C24w!Ue@1g_lNLh)~VMiJx79Yhx?q}C#;^On}PZc#KCUVMDHF&v;?>0irHML z;3SHoWsmtx6+DzR6%nco-ojg|K-NOVPHGLY0)N!FNaL%1`$&-zAo)lUD?&`M0#y}+ z$m(yLE<*Qle)wbGa@BhmezYj|eNQS?SbHiWRqG*DA>1kJSn%bUy;)c}o67VT@dt0K zW6%zqsCH-)N#|%tm3By17S(YS=c^Le06m0og4nvUeg1Zrf&$O3@Ai^)=;DNjr89Cq z%OPzmie^2NMY(F@w>DkD1|AEM!FSD^I1zWU#o4n1ajKLDWRLN3M50^U3eW-&=5jm* z(Lw7RNOHp(XM8jzTp^lr_adpiC7NuA)HS9F_eZ%xNskl|f%_bx>0s$UL?T({c;Kl~ z@KFkjtZNd1`v)ZD-44^jilB`Y*FoB}^v(#2NEMwBFyJvM`MGFi&}IGq8|dU~@|;J@ z;fpRiZazpKUhoZ|WcS2Su=4CxE^cuXf3ziuY#yCKRuTyAE@7_Qf>xSpXv?!T{5*Pj zGd`~89?dN~QB>*p69`a%?KrDBQ5hk^t5pswpVv1y!u$oVB2eKSzQ^)c{|6?o0hpF6B=} zy<8mk2)A(E$b7{JN(9)3tZ{*-vJUKnwXkG~R_!LNLcBe^z}l03ONA-N02c4-VyCm% zm^w^D%ArU5nr2zVyrsJN*1esfuDeSg%CP zd}VKd?bt;!Bwp>uHq6Bf?!i2vy@J~L!k{D}`Lk$?AE+xB2G{mB;TI14-y}2_cTcqIXwS|BbduZi)G$h_N&4$+?ii*D0fCAedplw}c94+{-yh zNQ@7=a)dB$gu8clBDvA*oftI|Zm}=fTZ(~Su33^ok*sCWn+?ZfI=4?49fX2uVZ7e% zsS5HYR-&9kMkH=#jAA(2>~C8d{#(1h$CFm0fs>Jjyvc{PlEe@4Lc87b@QE+6I?a=q z$?iNSlkS5o;1{K{g%#yDl;eUT-h+BGYb4; z_R7SDk*LKw-NxqQkuPCd|RE51Yom3_kJip zR+u^?iXF`PlUM!j>lHu<*^UA#Ix z?#EGmPKc~aK{kN_;&^FNB5u~s(P3q+@R5dVmA*B?(rY0(_428X(?Z9hq1{Vvz_^=G6r;l zzU{Bp%e^kw0PPKu0WE)G6`6^w7bpUFF{U$;1WcfzACMFrOmcwdcw$OLfB85b1qe0C zVxWQ1pskbYD||%QCeIgK-Twm5Krp{YYH7Ln>?%GuasS4~+7V?Q?Af#UIJQ5&Rr~Tp z?eTH@)3Y)dpa~9#3SGi=RXK}v$B1~0$fv7}3v-)_%2$?#NEhQC{!M5XxW?c9-adJ< z4*5cFAws}BNLTIbX?B9+ss|`RBC12o^)1bRM7>Pt6TF@%O{-6W zm-H|0H^xXdrB^dAftKmlR;d|VE3MdCXvE(5P`vS>Xq6Aeuf#dxjZ?!KJ2m{$T@^xN z5MdsTwNkP&xzQXxFMwHP{s5u1w#rv9_p7$be`9m+XLH}7FW|Z+CoM5I;lS$gC{381 z_*}X5{Fo%@pyPG!Xnv@Ag(`ejrcT$+{JnN#YH{Hz6)I?mgY_giwHwez@GFZTeDcFf z-KH)^nlReT_UQ~*M8%@%YRW>LHQE&}TxwNcJo*7vcn`r{5Kq>FJ2Y#=^6Z19`4cp2 z0nIs3yLqE_?Kqyt#kq4d>Wh=JqW1CKl}lfc=v1felaK|%u&@b9-czjZ9K$@x80N67 zEThc_c(8&}8qXV}AUU^~F`sFR73I3p!w;sfe%ueRbU)zgcdw@Xq$@@(wy>DY>TpeS z&-&A7Lnd`OZ)Z$s>d~ znEwj*cX{fp{cvG<`U!58I#|#U2H>_t{C-!i%&vTW^^M;8jo$i=-ujK+x*^?AG@)KyD_!Ut{bz%F;1{k=eC;a4e?PV1Z%j?tDz)JHr~zB4_Vd^9y;0^V z%DjI~T?A{S2jAfP$-kyPp}|i@=Fho-PyUPgq&+H zXBz#4rX41W3Rd$a^cp1)`Hpty*x3PwoFQ&$IEO+pOd+GWd4TkJ5 zB^8ehNiOR>3x4D6iAn)VX!s6dB?-Yxo0QqQZsV^@cQ^6 zXGHes=Y?yO2mm~mfdYcz@PX3v0j&E;h$fb2>eNa66NMfAm8H7o1De0fxGv6GIjN6D z(JzX}V*eXa=`W1$;#pAFkf2n!GqhoLWX+KjWw;Yl_wYFe;^&WFy%80^5f#4?6|a$~ zI9DkFdHoFv%1n*|OhqNoHcA<5S8grOK8JIbQO4z0SK${Uj-02VneEOq-Pk-hOtHM~ zgk56E4>X}y;fnH+Te)~BXB7Axl--hjl3lu&a z?ibIJ=zQmwrx>S3i7N?B1Fd6{xgO34zLIxdT-#@lPO&TbldtWC%P=|;`rLke*?#r| z32y%d%Sgq~BRHFQq9nos(Mz$W#9ytLlGJZNP_;pGCzni`oRk_;{vU%~8_j|4Z!G5m zGuZrif|EhApnLRGcS7$CAesV%db7A0H5UF}yY$i&LlUkIBxWOvvv>L&NxB0tz&lK` zWP~`Cm@^=-KxO(y@kWFNamjXc^a+J%L?uHa-hE$w#Ze^}{EdevZ9Mqr8b$A<1!&3?`=@^{&0nBNQt2{U zel-mgnbgq1{hP%3j3B<>c#;uFy8>J|$+^ci=}-T}``fabpaSTtS&eP!yj zec^MOFU;M24nL?@=c>7;6p_5L&fb~0zIsqp$(5{$GdQCkOJU*AL_u)HBvKkU9*7b}kk%NqCM?DM+&TLcv;VixKF9mOAmetp zUJ~2=JA3*??J|(4#ko7nQ<&v^LFMX;bC)Q4=p#DO6qj6%wkmbuFCn^kz>I0C%8@A; zHNB!-^f#&MGb!}mXx`}Ox9v9JDB7SL)mY7)cZF$O@8(v(XQLQri?LE8V`Dr~uW4>aoo*GT90>(lG~yXfwIt8N|v!V`)i z2r-?-L>6da2U18|{cBLMoI6MkCoax5aIM}Kr-oHwZyZvLcxb{QMH2wua1%Z)&Yh{9 zzK4$;I3eL&e0c?*g6`VjIhBpveQLCXL9AJrKcWm(x_0qI6ksmx0@t zys#UIJJ>KG`t3T~1PeCkm;^8$hHyE+2?w`odm#Y?W)8V! zA9wzu!eWOL)3`JmTIT+3%!Dn@%`d+A3=9I@Jh|Y{F#o_lb&fJ`-UDcjlMOL*S(*BD zY5og75)hE9<0pMAk-;xdos~BVyrg`JiM6DSXxB@`GO8|IW3t`5FT`}iIt+a5l>(d( zfyvwJ^e!@u<+ZvP1?bQ^`5FCoB>Alq)&J@Q?yNdFGwLPLf9PyR&rf92KuGJ@|a-@?FWVW$DsmW(;@ozWsR0zHoJA z`T`o>Rm8h)U%5C|st%*t`Fr-vQzqSat9Amx5(>h8f)j3l+MiQ{)`hF~$rlT!!` z+)2`RD#c%V@^IpvZtpWT#W$5}Wd-LV!RQ;ZGjza7*3M3BX@U^3 zMj4y_r@2`SxfhZQ z;pIvDQaNUF91{uWEA;6V#&E)cLWbtbm9E!~_m!TWD{(6^U-UfqU?pzQ+ z-V%MWlZrTBY`vafC7{&3DN*{`ZLBw9=Cu_wzY#6J5iPe^wA^wba$_Q6%0uW^msRr+ z%w{r`&R}H1<^2uSYsMY*4b2ob>#vJu{sz2Tnjo1?yQZv)aJIlvAni0}Vm| zuHM(5*mus=zB~@Ee!9DK{|v+fzwnSmX}Lyw0G?(qX2G55@olQDP-MpW;z&9#N|1Ev zgCQCoi!4LV92x{bnTJ)YAAj*GM}C8O{dz~z6HiZ-UzX~yRhl=N9z>FS)^1JPudc8x zA!KD5HCI9v4+W7IbEPNN^;#NmMw2Dlw2Do`m6vym%c(W(=EAz$m=?`TinZQIYD^bd zAFZp5cIb`EI{2*HvsDK_G7~=t|1w3kX7 zL3Njoxf^`ifCl2jLIc+#ZKe#=N0*7j&$?+cofdh>Cb_?Bt;zIhtw}wmYs>g{_UK=o zAIQG&-O4dczD{Y;W5DQb<;$CRbArtpcR+T?OlxlA&U%B-YVV#Bu+`l@B~ae!KHOtn z$otW$)^FzissU)3(X1cjmYB_I>yc}EGuUxEcXvM1ob!6dc6DnBGCWE7_>yV}k%}UP_@QX_aG?c*^L)`}ghGu^sK+yM0GH^3WLr09j*9=Fwzx zmV$`Rgnc{e%3X6nZ_mbMBJenCFgIj>Se4R;D z*$s35H*Xf0P4oH-V$#BE$y)H@=xD`)`Y~x5J`d;)|drOi83gY+RqoIt$d( z1O;At>XyS>w{82=`&+mDpH^^XYi`A7oUFhEZ(Wa|W~}#k-PdHh$D3bWP>n|H_6&Qy zzSW+gQlqi5TM*v5L3=$@)kdxLOfXsziF$1<^(4H%9y>kWyop2xl#H65c6i-_n}g>@ z-Vd%`_Z1{GtYmMYnIi9lx1k>|5!}p=*JxRckyV=bIe2cyz)!5)V)H%^&@D9X<1P4@ z_DKt|)rNgWdcw{6gwCl(cVedkvqS<5h027jiO^O6F5ZBg6 zc<6PH82IYu1nwp2B{P;d?T5x(v-Dx-%&`l zs}CukuypCG+RR02|I3VGPoKkgy8oH_(lJGL;{Jmp+QK!+_(lRVEAW*e==aA6+V}q6 z-TMxtcJ0{n?t%AG2i|*s$Nu;B?%J+iENM!-To#rXMOO5nRSk!n=OOhj(*jsd2wUzFsV1A3Xr$FkYwaiMSE(8x&&iz z<*LM9Q^f=0P59yOtY>v4^E!Mtw4rUqQVw%sqZm9Q$qGmDJrmRWC?Rf@Xg;4FOVb~~ z0Mnnl;IFWEqnORc#2lDzs&6fpQwN=0D{?r#1tlUywxUBv`eKBR#1o0gnvjt;v~v_4 zwa}sT5esF~nM2k9%3IZq6-0Bh4m%5G7*!8p(sJ++8YbmZrD7!qP=_O^S|oRB1m`9$ zW})!{UNaf1Qo$&uB4xSJ9TOR=MAH4l2hh<&RsjNpR=@*?wr7ovSlLQMSma~rNi<@i zu^cUiPk?gH@`RcgK$YT{ghybi8>?RNa0z$II!p15p zSY!ZDDOW{@tg^}^O4~$=ajNZ0X>IcjTwz0%3M&{=>$3VW&bTOh+TkF&6LGqD`nuO< z@xbRp?SN7CCfP!4-E&T%4ZWQ|I^k|catC0yXp8`+gWkgL7F8=X7-+nU7ZZHrMn1J*+k?JemCd7jOUj-t*^ zi~yA0Ok^}xWpgYRi+3hsiJo{o)}1MijpeFQ@(zvEFN1*=tfTlIm$%5IfRj#dcQ<}a zB;qK>KGA4zc66j?WJ?#Y&7GZSbJjYtxiFs3!}xHIgS7?o6N|)9M?4Zw^d<1*K=pTk z-o`V1nQS)QV`Vd$-ac!jKi=OTOIzu5qI+~pe>&aW8;d70u{dTN3TQv3EAc+uy8-ea zO5ip*qye-qS14Lmx$`jI#}hg9yLeYDk=WAR6^r$~9q;YJyxQ1XP7@;F}2;bWvUDJ-Y|MkvvbI%w5sG`g9dHJh_69n@y$xZzKn`u~mM-CIzHUiWXX za)2v$D?5)^?f*vo{dkioA2xz6vLGrba|m>R*$CRw*NZ7i1z%}kk%v3lBdEU*?@)Xy zkO!TPL6EdWgJ5DLNGX6!7i1=#%_gi^cQhJHjP_@{M}#Fy7G|JU>7wlj-Z(7W{&*r1 z??xT)Z=%;FEc@Xao7o0+!sM@RrGm*+8#;i`z+wp~XR(Yn;GN=5TvN=z6;m`Vn8~ zN9sTqN2kMg_))FZHZ+d!fUZL%Op>D}L6|SW3=`8yQ|<{*akYr~jePMaVfPC~B+El6 z>%rGW07R}00-Hx1*baCzy4FvDzN>u>Suo?F3RgfEeK3%*04M>Nn$%Ost9LGo|eBoR(pJ$rKzveV`^^Bt_@+ux+j992ff|x%?*q!0IjBBl>fQ&09so3 zeEH!gi*x5{mmb#69>Z7h$xe%0eIT>Z^(VFScd6$x^?hFc@r*sSfX83E{)EO5!joyA zT(D=K0XP#D+K^Zk^6)-CFP`HhU%~8Y`{ezlYacn;91IH+GBVLT_t!pJdh)fsa9L}s zk@Hxp*w|_8Ew5pBF1G0jRPx>_M5TUUnDjR$&tbAZLASD(+&k0xK|GVte2kc4;2#6Q zxV}JKtp__6q^`gr_HM!Vly1x+X8XHhT~>EbFb*+jD?WTXJQcFSMbtC6O6!i z#)rV*JTc^!OSt9KQfR;>##C^S!W0bO=4hf87C3hv?1VcfaH$D!_7H7noIkMay83$}_}J*^ z>hFtmC-il`uQ8NYCs4CvxPJe6yW9k$^vdjslu#!BkJP8K{JF zSr8LdC*|hYESgw2CK1z)aJT1!@gdhhMLU)!lQO=Hai1bG*1VAB(ZHk`nOez4k!rNj z2#j2;hc&dp2NN5|Ya>QfX2f2Gnqn9n|4=824;kP9k#RvKd6!I0 z!bf!UQ(l`FcFxo&ewlL6=Q?j-91c328DKr}sPA}WRS0t_LG-C2?IH<`I=ynG=-7BZ z6*lg4bFYVd;F0wntz%T~$4tZiK4ho$7#5kvtcH{#X0 zQElC0w;O1wvDi4|Q~Rbw&Dm8@uWR-#PWF@J-bC9x_4ZtICzi;cr!EOMwtpEEz$*o2#13TVZu? zNt>%(pLxF>1t0NVuJcW8tvO3OT-p-!G?l6Z-FH%@)}E+3=!SNqSi6bMDAjLfLptZG z2FY(*YK7q=o~>E|zth!JxUuuqqdET*)~iw-O5szj$y3$~;(DF4Qu}^%LGPXK&4#U` z#`oxIPe&D0!{?&_V0D|anJw9=ZO8_;W1Y>|z*b!CMr>jmHnj;oDm8y#n_<7cBU|W? z|K?nmb2`xO*Qg4Oa?C8cZ$F3cb5qOHPkvjp0*~|bOMCix?Z#;e3^|cym6BSl1f7g5No{$j(Nf{uk3o|gq0zkni;J)@uxnLxEdhms4wFn`#!fQz8IGr^| z|7n0OlG*s@ZD%Ti(8*sQ1@s7A(=aJSo6K@P4?^gJj)26_1K$#j^e7YadeJaF_3s4J zpeqF$BE0Usiu~be$hT&*TgNAI`CPg@xvN;IgqfZ>ngLv;$E!u4)6sXv3mMSIJKA3O zFdZwgfsV(zzy>-V@9&mu75Uv)wDacAOqS2G8!vbU6mr#EIuGuW+`FFmtdvUmN$L)Y zn3X?`hVIrS3$j$KREaaDA|Q6W$*9Ab0IC*5u(2ixi5=8r&LP?pR5^rCmQX~0 zf8bD#=qb_&VwI1NA?7E;>w#tzOum3#LZ3LD#2X`<1zT+KAO8mO#lk@iZq};g4i>1N zLbaSbc+e`NG&oEYCV5-%8&5|iHTomP6xN)^i~ zK$@=7a6K@^G4eyi>4sE~7zg783x}_}diCK_rwb8Q7)nMMrDEZSD#_4TC8YQ4Xbvx8 zmw2Zf#?h>dOtFI~ylE3{Hv(pmlne?PQyF|IerBE7rZpu`DF{?KR2_LrB6oJTifq`Q7E?CCkz_XpkjE z4})(GcNqxSgy&3v5FiO;ATY_qkO?6M41AY6-m?8yzQV7oU%Ov>YfCm{9_%xRSbKL@ zcUM`k2*GlCt|HytWFuW%)CeRjga|~ji)9?FN4Cn%HK)3Vlq2E+a|<&-ky|Vrkj9qCMDiNawMbT4yXU9C zk1KGC<^B7)yhlsWeXc>PdlSzqs8Ku=)Fg*egx%u*B$MX#z4vzQ-Hi#&*K$Lc1d&-F zqzUH6g(7iC#($YRML_VeF5R_+C4!M(u?^&{cy+3lftXafM-1gyL{5&`d!{?07#?4n ztjQvg&GBWUy5q4vvWbR_M~?I;G-~;lL9G$*3sVkF;SEj;*;FYXil(|^$QrMUIMw#{ zk*lA*!!$g`M*FXliXj|8X}Nz?|J9f0ZJHYFjV1f>gbXEODNDc-NB|YMkSQAxk$KDl z?Zoe)HH-{WBm#o=PFYdB@w;8?nOQK0kXLnR;J7EorR zfxxr5dS((>?bdDkQg7`ZRVM>~9c#sa4{IJjvVbnS9+eu4sSaHPE>AaE5mVi|wS0BLn6a@XJ#^1sK{Q+$iBtv`R4mEj@Vul6H&+ z_{|)gCP2uB$=|5EuNjRshDsS+mq~kU;Vu)L;=Jl)Bt&pG3daKANZ;(0$HvAi-oEC!b)xovVu;as zP}Fdx()$gkI^&66gl^uaL}SJ$rw5Qh*pa(?hwVFa>v@o$2%OMBJKa-t_E$mqI}JL> zTp4sPRB3c@%QTsl0w@77ezN#nHlT%-Ckn=$_5gSJlCOOI z7=m4aWg){L;{{Ei`<0K0F2xXE{GdteGFOZdn2NB#xWQ4X9!Hwd- zNz)OP^T2E6#S!4}d$}G1ll^<;cbT@qR6WT9lM#xHIc!w&;8)^TL7&E$GM*3@b6*NE z>$WDk5fRRDz{%!GNWK!@mooI%FYpiYmr_k39&am?F(WMVVhEZ2)pDa!)6b&{#ywaT z%=$v>F}S^ePeNgOvG~cEH(LkK6P3J4$ZXioF%79xjL%=aSSH;Xw*g)$Yb zq%W4mZc*lG->0&NCALWIQ23SDxjgd~lujl3hD-r+Q#-`K4@PSF5j6U1k^_71`}qG4 zj#;Weu}ms&Qcog=5@S#b3(cdfJ|ZC)zHZ+5v3dQk;K=m&%*vAooFw$_(|;@3JaXsn zqfrSbF&RO&Ug1i4C(r2`_#_^2EgJQM(IxU2(BZJN9I(GcF!<3CUWEq-!RtkRl^lb! zM3iF+VCn~QcxZspj5cFwxiPJlDb@u;;zp@%8@5@TVGuadz1laG(A5JhiKr~KRRf2V zsHInA?@B8>n57hEHEC%lvy>n^XWIZiTah6$x<1S!FDr6~1woSb2 zq_yqGQk0;IK(x++eaH?u?oXtX@dWB&Ix|r%UoBV4%$B3otmU%BY?aCuq_Y2pZ>vtV z_*nMC;M0*`@80`n#EPdXkFl6eCHn^ZhI$A325kjP&f$8vl7&)WUxMwb}oSMfBUBuv7y?-2H@I1cc zes>I)3BxeZ%EBk^w-;ClX3`vu@9ZvozH;vV^5Wy>mATau&sJ{!b@|D|pO)s`ua2Ru zbqoq^P5rcVYW4ZG)thIQAD?Xg_#Mj>a^^CJCP2C;zF)hz(7gD_{phYY#$+6=+NNfaogNor{Wzin!r;cOogYW;$rEeG@4Dx{fcVoSHqqO^*@~iXgShYHdwv5< z4+*?=VNA(H!y*?g-GDgP=e}_7oQ4kIIV83ldIoiuZmi9nZ(f>TIXBn5_F(OsD>AN9 zapZpax%<%#JOuat$>xnYckX2KhlQ16kKE5s>r%k#)4w*)ehG1l{KnHfHM@3w7UFL& z9(V6wguqhg{<3o8D1JkuXUmQE#?kWP?dFZ8wWB|{&%X8r3gFHkYhJp~gHWA;NgTIu zp>Y|!OoE-If6zk|5r;5!53RFb(!OoG|XWPm0o+Vt2E1Oio%MoHgK@526Q5BtEuUi*Hqc3 zsQ@IkG{stG$J55*riA;97DmjNYqeqHv(K_i2Q69?%0w2+gWBp;OKF3W&637Q)tar0 zR!U+0ARNw4B&y3g=JqQqBAis6C{pcpOaaq0f{DUGmNHSOGfNq%{&Fp4A?;%N!4AZ} z$IqLycbXqva_@hLeiK9g8AHEQO0{J@6SWjgv&61~>S#J;iSyWMX+y!{9~M!YmmW~d z`0|t6?wLnxC(a@w;d_E7h+qgLECh1hce5H8GTIURCun{4#?U*jzd1(KH~8o7Bzg6w zu5H0%-Gci*BI=~%N-0|<4XN@^mV-=J!$}HX^;9dVzY|33odm_V|71vtO#hZN7T!Ud z+GNT}BtoW`R5^3$^`PZw5}E@Sy!#Y|gv#MWjp-7cib#J=||;0Y(x zz%v6TsSEa zg$)Reb}PrjWhYfm?34jPI;Kkx@adh+lY8rqd;dHHQ+|39AL!RUJ>j1Jnz=ISp>xm7 zxeHg}oQcoV^CxX|71G{Ft}bcNTL#J7gzVjd4qbZgKK`ot+3l5^3%HAP9Ax1RwuxK*aAL@7zyI z^LT2S^M71^^4I1kH=5sH0+0}Ld3o{M)pI|(=P#3@B%HDtHJM><@$vHGlcW|6vWy#} zAY0&+62|KDPcc=PxblhRt&p`gc-AZ>Z-DsjO#J|E(f#fi#AQA)PqU0d#PFpXYqNjy z9ngd3Cl6Qe{MkMEJ;V#YIgjz`JC^44S)g+gaN7OqSo8Di?)iJG$Dalu?CHssi{C;2 z0cUHUUU26hYB*b7oW;v%_4yx{mzD@u?S!=>X1sAZy1Gx;Qi``0qz)mviJgfcOreFx-lR{2f|Z`q=&UO7q%-=H>hAk;OL5Mah8NwjYw64mxs9 z@=->~Z$imIM1;{d5Z2u;A(uXdPTab)`rr$&%+U9wVWq9fY=Ap~L>BEhAHnkY8cUXt z?43*_o)@~3sP@AD5P+94F_Y0a1a!%aps4Jtg4;jb6;l=V&U&sY8sX8DGpo-&#w3RuYo1-89ROxVuJ9!XNCKJ( z_JicnIY;qIt$$juCPF^Sr9wq5B<>W-v?iA~v(&Jba56A!+q~@e!1jRzU z8T6N%VIOxNUqqw7rWb&Adm6&MTt<9{Y0lK;vrbFNaQj|J@O12h1W!v3B+`Q9ca!imG7AXmi$bDVoj_a?23qEF|# z^1bw(^5UH!(p5TUQw@iAYxQX%$fH#sx9Us87WeYvS>_S+{Gxm9^XB(=n;)KcKf2cZ z>^PRIR=)Vl+La&7_Dvrsnd{pHwq^2^TzT6! zdh#Ky+w5r_^9!^`T5$CeERt4SzdV~HpREv9Nrs=$cF9eAQwXCCQv-aquuZDb9sRA7 zYB_jbrhU?E{iRwcEzJkoDAn?|P@^wOBg`26M(Bi1p79d2!h8({=!H963mdN+mS(eK z?XXHI>(mdo2*gVCe8K8H7Q(od4oplz7%iCBs}0i*c_2Gb>~LiK0;0ezKL;=Qm1B=k zas;{aH>h(w7U$jPC%|I>A3LeKgq)tJE(+m_1~jT-G!I;$5PEhR{J&)t(UaS2N1wYV z&Qg6#^XwPyt?y)4aQWIw3q6t1lTI}m(zTw>v~@hq<#Vx%BFv* z*O-zBRK^Q`VTvsRiRmg>STjVgsjNJ>=YDhp?snA5ll7_WVMN=JuZ76Xkee{)(Dp0@ z>!o*XmKVPTp*GF{MA_}dnh2nV;5y9WWB1;pwWk-%HAJzg^j<6B7k0mu2F12sLhDd? zE5TZnmQ8GcHpJFOIDLpz(?^pb_K5&phO`6z+}aE!@X#xL#w&fsKY%{tpGZ&fO1OvI)li#s5>t1Ua`Q%5S$Gk+}W`_^6F^Z?$P+wxwq|e@4Dk5`LF)YA|m;mMd_hU z$m%1Anv$s~EhP1orN1;UKC_-d-RpPhfsYC1r_K^z4glibKdJI(IOsmPO@X=BZfIB1 zIYcawH*fst9=+fFFKA}1b6F&GwzAI z)G5MjF<=X65#!1CX#eEn=4W3Bbp0D@Iy%uJyeKUIEx-Z*{P?$(C+De~Iwh&q=W}q@ zBQ^fb^#w7+SKz|4m8T2fZiL^0OAc|QWb5|Fzu_hC-ntClV^+9vS_v~HDnE=LuDQXfgNz%57S9-}5KW{*R>CG?yMRGMEv)!EIsr78+A# zo3tHEMhj`04uP#1&XU<2NxUR7Qy?M51Rz#>1G|bPnTKVLNJ77iPGnwEQDfXukF;$> zB+((CPM@qI0;;7~o6w$;gAQO8kAcHA1CkqvO|wXE2g&NH4lt1{&%cOxbUg`F9TMps zf?HJ+4ppk+icYpxD9%vJHac{mfR_Y=fD7+0gOMGVf`fgzh~-a6G34YGz>JA8Q57?> zNtUuPTZABm8QlA*PMI+hmW8lur{>9LeBH$%#SvZ?QM#WeteDwZE>{rA(U-+T%1&ns zMbH-rtUz`+j4cvym|P0MuOUo4d7CRwVUnUbfbjpEtQfaS73C@)AIofA1(%1Y(PML6 zuxiwCKuc2UfgCgHk>f9{%fpQRct%g5rBrh1FpAOFlO!X8zcjW}B%oI`ln$mp`w z_nlFqZKl&TCWOwQeMkqZa~zp$qDtG4MnOZvC}}uGX$HD7W}khE&Uea(99S;nWU)gA z+6>?kPe!d#A-SsRa12EOwY4(l%$xB-Yt$Sc_%I*iq(|cn20nsSY3?yI zY_XN!THeM*Z$ZE(;30!c6th(;4*giZWnrjNm@4@z-)uX43m|xhFYKQx7g#b(Q{dBJ zl6)%`P+@l}+uLqmh~4s3tXoD?pebcCY)Rxp3v3E3;IQIWw;iy!sHYxu_!ClS6X47~ zY^?AW1bSlxI1>i}lAKb`HrkG)xfb|qLq<~|882GMY?)%=VEB0^slP1~puYmW7#vNb zAv)M7u_W?&o|7BzQ#sMgEEOg^DmQ5db2>~paRDf5xfL@}Eobu}Gh3SR>p@&p!tbMG zA$s)GWwxQi^=-2q^klJ^IPGY}-Y6AtkE7P<2tyN!#Kv;`!ZxoqW~Y8h)JvOC;)S|1 zWf_-LGmNxn)V03knqMEXv~VI)XML&2IsUwU)IOLt z8q#-Ze`%}65};Yf9=slXTkk~Xu>yhzIIb3FT7C$Gv>jpowQA)?HRq%iG==8R@Pt^Q z9ZFpRtb1cbFt33iSxn6_5`IR7LOPtVrD=Mo!qx&erb;*=x5#w}i!XM>!IaY$8a!RG zrWSsa1Z=48Bx_z%3KO;R&>&=O8yf73S+a!%)wZP1gnvNg4}J*#%5$DTl9bF8rto2Y z3UeZ&o){DQXDv0Af}TxnHD~DF{Tfm8wL!#R&s~HapLvktVkHA-Nym;k~J2LWq6K(&PO9QI*53wV=^R<{Xu7@mdQKQnAv73#cYnuCaxY)?JBj1RF59P zYb-(v?X;8~ev_PFWs_7M7C<3Kt4S5qHF3(gmL|)~1 z9vmJd`OK3;Lo~OymhhIK*9Yxdt88IgGwKjVsD=42SGSR(J9h;)pr;3DT`#7m0+$J* zUcl^;A(6~6p@<^2E3<6V5liYic3>m;)7L6Jf0{EM)(++gBw{TOSa-lxmG8`)&yq1yx;U-E^wb=vFRoM z3(~y3Sez~zm2ukRb9xTCVZq>KOV^kho08sUJCp@eU#3yvzN_jH9A+PiG%Ao1o><+B zgbK8X9hEAmv?8OwCuPUz)c5QvZzq8OK0m$o<%cWBzjc>p-8-k_@pweVDc%6_G0r(m zgz`~i+l%=HRtOJ_S9L|Ttf61#_$<`0l;?t1Qg8T!$jJ5Mqr402vLL_hCo_KEdJ9QFg zkn({@hLMIS5}{gLb$0RM@L>4HoF%DnL2pnhR^oC>T!hIPbh}#{_3}Or$9}$Ny;MrZ z0Wi3O)G51(%6kL3z*(+l0D%^+x}nqq5(7#m)*C_AmT0_S5v<9uqEIr% zwi0$EuB9VWwTRg>a;ei(vcOKiznaZQeNsAR*p?W5{dL-Ij5~8EQfHcmClXHWa?E-T zD&#^ZPCkF1Jr?OQcN1`prJgcg+i1xO9xkvz9BA8?|4eZP^leRnHcQIeE(Y<$slSY@Q<9iYP}@EbhP(NJ*5$ zwY22APqz}C#GZ~(MPDMFO`h-4kR~%p5Qs6iuyJ(Czzt^f}VmkL4)4cVU(BwyEG*^Hi|Ge zcyf&@tf6$%IYc(2%oaz$;fjT)p$> zl^aLhxu?xb_hU5r%}#MUli-;E8(wv6`Hvm@R+cWW-ut5-x+#41i@eI*0EH6G*lnh3;dTmB%ls6i$Vfp#Z=FxlLZ+ZUS+UzM? zKeSn)ogp1?uiqtK>D@&NQ0`PRX6s4Nw7K^lt$ueCy7TQ7_t;$#9ugPtA;lq-=Il{} zZ)nZDN7jTDt=#=d8U}`6WK(HJ=yW-kov^x0npCri-9;AD51!-1Q(P-3Z%?qLgdiAjJcGvF`f2H^oSu)lE<#-M?=L|v z0DANO$@$gib66lN7RkYmuGM;VrpGy)b1Lm|*upb0Z}VFT=-BbqA#lN@SUPz!{S69fgc-ok0i8Ky`vzt`&JN(Bs0xoQE6T!pM&z|dhB z5N7ZZJI|Dn>^iR5O4Y$9xoX;WrI&RPV*POr5{Ny1EMnKX!5Wb>|FeEni+E>fw9&DG^?tW}saC@t*taLu)G#TKC%L?$bZf-ubmc;{lU<9x`}o3?OXTnlW)l z&iwJXq{xb^1jWz(R8o7FQ`yc!P3L zw)w?v2#$W?D>hvcX&S|8t_-uFNV%~3{2CTOfnXM{EtcR$e4_bzPbI^zkPd5EMVAZoxCAwATSw^ zpby|bQC$l0LV?BLbSj|?*YFMdUQ^?~AF)im19K?R8f_cfwr$(CZQHhO+s=-&W81d1 zW81hn=e$?uL!8Sg&A1#C#~rX`MTh@cMj9I`^cFgaqrAGK!5o zFHmGrSzX`dlKr7Q``6z|1%GSX+y8#_ojNci}Qv|Q5dn;N42lX88ME|z)z@u z3+U^4X7w;qe7p|2TOxWb?%qHV2j4%v=+Ec@mvhf)J@BFC{N30^cu;5Ko(lg_gxz{Q zSO3DY$TTENZf$zBQRKZ3BMZ_`kJqqh{BS@l(Npp=5S|VBlEyRNaG{nPqaO4pv<9i? zk2G_k2CFd{itfS_XbDQwAQ8JAaNr!$^M(-Kb4|@#%or0R;v;jYQrLw}jf2mG6$0T6 z_+N_k0FV49Vd0iE#|eY|70^VMi8Qk!9_wTSatfefPFj%;3kWE}?i884lD`W+nK#BQ zc~NzO)M)k`cqm=ji6R&VjSDS=sorZMEEi@dY)U3av=KEhPFJGj1`Sq4 zB(3U8@dbu*DW*e>-kNqdYJbB?Y~3yDo1*wNuGfkj@^1IKt-fUs{~T#uiL(7Fm>#-s zYbwIMq=ns(GELY!`)YNJR{#Cst>zwr)1c*8g!?hlizD+@fSA-77u;6QF<~e#x?2QqQc>&od-&(26(01X}nz3FbhnYZ3*y!t%lz1e!@c{4rDXwr~rJ;#WPk~z-=OcZ2`XOm(pXCuQ+2SFtH z_h*D;o%D*jT;s*E3&(`z6?EB%gEV7e<@8ZulxMeilcY^(0(_H( zR$bq_AS|B&ai{i@GB`_{ufosIdl>(ZiYxW@m< z`yw!Mn3xP5W{CrNoPmqmoLu&esHAZi9fgYrVU})L5fp79@7C$+6g0&94VZWG&$9V* zT+Qs-*RVXzbG}>A8PH<1yQA_Xd;QuaFkYPX4|RgZmDv)R9T@7qN(!QSoy*#hc~B`&}woLeVG#T z#!J+oiPYUs4A9VB%%hQ+)Q$p1t(>=yP;n!ka)$x4<{5rT1+?-+GyHD|uLKCP{gMy>tPl1Oc zSRvX1^W#FO`LhD8QotO2qP|_wZkiIM!w$j_>UfQQSAzAFlYUtOU{NJ(|{fxRw|JuFxwxp5juFuRr6zglHARbVAQf9Vc z0Q>A~jRtwUvs`_kW<-};+FN4G#Ie0k$V%B7h}h!P9gp)w+~&@_&daLKen?;C`_N%z ze|02ZW+j^d^}8mf?TlG723{Lxcv27>Y{M{T?`;?mEL3Q`<*7pHu8|;X6AN0N!-iNt zb0U;ifj20eBpP!B@i)ZjH)lI;qla#=ZL}hto-Z*azk7Ui2=?nys8WYUhSlKjkA-Hf zu&Ym=wGcC}sZCB2+l0RYgjE);m#>m}z=A{czA*S#u4<$d;4hgxntojftdUjCb;Ay& zDoi~za#+0iz*{N*XXBtxH>0S4JGN*`$gYRYo;8liqQZ==IE;K`zW68Ob~6d1#!EmM~K-F8!pSU69Dk9`0iZ0Q5z&L zvl^Rdd?U+aP_AtFeg#5txf#gPzr=8vmZBj5rhZS{)uO^hK~(tdsaVBB%pe4Gf&D#WG&crN1hRrz&Af|`*9W7iMwNJ`G zO-`qT#q1b*Tp9Q+&oK9xo$;6x@9+MFKh_FuG#O_L8VPqGV?97K*oxZ9cQ&zY>SCQ8 z8fzqojr~Pqrknyvh9e(Eu+&FV^}3=pr0a_Z>r%s%N5x15cOa6zU025o})61s_i8gsi~_Q*c462+n2+G_T}vf&nurmg>u zM#+DSNBOGUIktyJq50SlzJ6|q~ zVMVda%ByRi*%aqT)@%O}13I_E-mAkW)q_^qTm?%ui$Fo;gu=It9rzsdDjXneJS zhno#_UUfo;-88HlZ84NLu_jPcRHb-pkc zKr%!4G9DAm*9L~y0mks!ds*4D#NmME3z|bKM zy>1ns^Um>}?~AQ&u83@-0||u6@qvAdrFj>Opu@9tv=|t`xWu%)fimRY1XUK<3BG~if+$667elfNM6e-LdQ*pB*!rlketW*&q zjxb|7u(R#b-~&OHhQ->_W!9dSTeb-Tlch*OcNB%~U%5z+e^w7I_~7?SZ!J->DTFTh z;NGsRvcd)}V(`NHCmR;Fp0uE)^JZo4*3wY%G)EuA>G4q=7*A$yr{Pfp4qFKAE{2p> z)SoRb9VajYnZlT0f6lKIa3pJ?-7e(BH~0m}`A!g%$Z#MFC(FGQurIP_Smn=je+L66|L~rt zi#h0)10@A_{1F@bQJ_kcU=koc-4?36E$i;+~5r1nlZ5XG~DxmwJwUe2IO8E4m(105UVMH$H%utK|z4+uiQ4-{h`v zm%hPhguP>B^G|pJyDqXRBnfdmQn8A2c8D$@`XJV* z)8D_NLxmb&P8Oa#SQ#nVAT?&BA^tOVf2Li_@tAaoDOXFx%QX_K&B8pqS zGfljxG6~-PWZHVW^)~;zn$q^(ne1ve$G%*#Zv<;W+`%>^?PMm}u3yvR=NNcZV~>9k z?E%O}XKX0eGBP6D*xN?Aw4=FJkDc%lgImSoe~CJcy7_zI0F^Rd5>e&puu%@sV7h4# zjga~Z{hD5Zifl}TOb@y}96rUSSpmug4a+UXA<;s4W-8{m&4`I3olg|9#uq4w8Yvax zC4FH2Eh*akDLUNoIUGl#C=$}p1Q(X$Y)wSL9`7WnPdI@>hS-j`MiFQ&$frGPg(7sE z2B$Ud_>xLDeYBMUVRuXKZ)Vm zCQYY3hEIk`Hx<@bS*ljMQY_`^<_Q$^C)#G9Wdb==sdg(s#O+SU7-}w`Vc$gK;*ejv zd4>Gw=qErlah68?Yn@kDS}>^|sb{vc0{F;g8PWBRLPhRMB%_)|egq-xp{W0r? z=0TSn`DE8Z-w#tOhVwji!lve{(-9mxUmcu$XEgp3f&JTLfK~%QI}HyPc0zBf5vbPs z^LpUAhIN`jC2|n$nl&K%n9B$p4M3$+fOEi~Wc$M$85W0aW5Mz9m8VepI`6i5z<`Bp z#(g`stX5~h@q{+T&9+$wUpA*qg25W+$`LMhXCivom(e|pEp`qRX|h3SnIypD6XOSo zD_MYIf66J|As6Zmn%BAo+Iur%OGTE^BIo=3OS^|k&m;y>5b;i`^fSR}<@lB3nNDf; zyR|Ikzem@h-RTO;RN!2yIeLf4<>*SU)hk%?P?y8oq=qI5{6?ibfZ z9=7tgMo%35)oyTgJKr$Ak9iX(ntMs@L$+uIt&D-w7{tKA(|zpe_6UhA9f`1%SGr%u ze0}D5UqE5QM(I`^Aqc~1CkCRD(UQ>6h&&vh*f@Gcg*wz=!*IN0KjOFx_h zv12vM1&WxEbw$LOl@~@Xj*JS`;?HH`WmHeVnJn=Qg$y^;295Zf!;0RFU;gO?ckanibtTB- zmK{>lPQ0o*k+c$s3>(F_hCHOJ-)tX4hcYp*X&x^WBuBVw|cnAat166Jcjp5Fr0FrAH<%7ipPb)g2>z?0%7c@KBriqqU>Fe5vJ+!o9|BBWRs z=yLBqon@Fdcs3c*b=XM@o6E8CbN8^hyP2NRPe;$fSk0J5W5nCmNe3MAqcuT4Z0&aU*-2^Q2V z9hbGKSwrNfb4#ETm%JC#y%MMrdgu#4Cwt4!+(NQU>5E4F1z{vUFk{cp4#3SoNYpyB z=HuCRx7&U+Z#&bLyY+gz)0e)D@m`Hc&aGEDHYVdQ8nP%^VJ|NbzmlOV2_xTk&o^K} zw240zqV)4RO=SXY_D^GJs+vWcv|qn{J0&SwVGoZhDa{f9DMCVlUxAk0B|ktz*stB@ zpNh3k_%QzR-&$LGk&W9h07h?P3L^_bQcb~W&wTFaca)bG&&P?TTfavr?1>54-Use?~4!MZ}Qjw|eZ{q7q zGR)RTV4=1N4;01phXOI}o(mnPKl{s1#3RPlqt|`o-NIV#bf5Y8d0kDytH=3Y&`o!^ z%Z>hXbYiA{+>TXi)egE5ZL0Xf>g{DrbiS`F{MO;QmN^6m4_A`oeqq;Y1!8X4$wLaS z`d&k5D<6Fi+vpffTz{T<60*i%=X>R4 zeHsy;ongAZ=+yYGS|Ux`e7N@ZcI)n>|Lgp5-TKm;s}3a#&}2OO&6T$bkU1&jKWk0a${&9uk)nG!p{wxOgfA7JoX4N5&%u?9!>}H z*h;|!##V1}!@jhdDcJHTT}f_(BJ~Am+43&^UU&A7q1nQEx}}FOMS+}$!NTy%Tt`>xU;E$GqGAH~+F`CcICfdsg~Tg2f3t9F zHQw0{z>@@y;c4hN5bVls1!sjCO~P$YI0M^6I8OjEgqu8TezRxJLN1BuDg6Y`+IRO}osg?FVAX1Tf>pgUm;5KQwB?pp&K z9Iy1}ZF2EQHK8UbBZMDqT5v}JQA)H!sS6k$6^EdI#*h|eFqMr{O1Cm%ZV5VCuK>~r zUxD&EQN$o81-77|&?z|RxR@t~!~|3)ODJgTfkwB#89NC???FTvbhwg{nmdZrf{%F;5V9M}4RjY4$n^!7SH24B~R;D)*?Q|V7 zQdl_~6vWDtB?@G5x!&uo)-Q*ChjZqqmD)^`*lLe5Y2Nb8zYVw)ss(_=A~WyIVJb!I%-}*Q zMZ03T?CTGx8*)?UVSV7qc;p0rWxu9a=!BZl`Nnlrp$SZBkXU)kkuQ{mq;YKF#!^CY z3zh#?f?M;QoOuQ-R4Ywvg65gBahOX1ojUPl;l#1W?E^y*KM$cPmCYE-jV zb7EDnj4uK806rnjQ=A?_B_^eh0QO2gPU=!udc#G^E#JS3=8V=5gByf>;y|?Wv7B=H)8&$llk)b9}lgz0vs#9f5 zWiS(s4L`apwJJNKEY`+_$~~w|o_+>t|1TiT_z@^T&k0UU_8#%oz1%@yGr?MV4sJD5 zcia|AS<{QX`>LeYUA)NFY>e}(AQqaooj{MPFDqSMjg4`=hd~R`=SDo;8Ys<@wI7=Y zOGBlNgw*`)j%$tbP%+Z*O1Xmrk2@K)3d||=s^=~1KjkUNYq!4fsGk6vF6$mAh?)NPO6*jzC zk&cyjj(L!F1A?YBD}6m*g4kaiLGU#g=%HD?ZU?}|)oiL^pt59*-K3Qq`KzRnLuT($kxjkm&5eVh0?lweN=ed=M%B)vG&b>bAetQI$Wv0cF>vyH;gH==Q=+w zyf4=Ne>*RCKlnI*mkGnNRCY`xBQd061Sx$qa7+Obyq_2D4_2qK`Pu4zP`6g_^EEy1 z`{JkP{%>@Vnh{fU#ZE9!h^t|=^!qt_cfl1b!^!%EL@<>x=n z>uGfLx_ulgKK&Q^7nr_w|9kyLZS8zA-*#BQ2AAHkwO+b>vZ`jE=g8l&nZw)R_0#t8 zN=b1$aO#AFSMOE&~_21 zMr|BssGr7mzTT2_cr9#e3@CNK`;9521r4wLf7H%uEsqV|Cjg)v-S(((@#zYXnU4Gi>PJLWqn)D4h4(^{Omng%NmGxuuQ2_n_VK=i&_skGY zM>DgmvwoDZA=}i1)h5Q>Fhf*L?-(BXS9oB(K>levBs!y~qNEG%`g>Z`sHcOL3u*lhQY&>*=2~MxTgkqaMNZm|G~b zg{CP$W+xdpG_(CA$+*2Y`M|h~>k)&xc?ZJ)>0v@E2zN8jNgx)$GS2jPOiWkx@+lSi zJTb!rmfo^_HO2bS7s}{1yJa?uD5*;iB=`*2>}fh?8}!%6|V;2|Rj>uaD%O4(r7H2D3 zLtht5oeS~}so@AL))~xEkp{CbKCvO0Rxq1ouByUlJMq*GwHc?t_GI-KkpH70lQ&qrrfoAaG(F_g6HA-mAAxwaveT$9NeTIF8606 zhjDlgoY9%9rxsCcugl0*x>{?uQpdvB2NNN=nsn<4aCNzL)H=oj&=d5sz8d)k3^aE*;t5!J7L5j$4GtU5N`g5PIu%bs3AoXp_|0}C(i2_CNS z+Qg_nU+n2?`lj_p?rP)jHh)638?}HOe(fvF5CY$xFvyXSr+=ZI_jGj7VzF_g*ZMxc zpDxV_oR!9TV#AQ1f_vw2JbzFpZls3X_Mrkw(2FcU3SA84%aCwU%O`@ruHQWl&O^y( z-X3P_^K|NQ>DzI8XTrhV_xv({e!-gIRU7o^HZ%Da^zIzksvY0Z8XL}?gY*(~{E;FN z7Y2ZwQ5Y+~Zn3nL1#_320vwEW^$!>vrB@?9Q z1*Ousw?#o~L(Y@N9l3U{P=3`2?5PkHZO>VE{Pkfpd8vo3+|+V^9Z#oiAP@dBds38O zOhhuXvp;wNf|f~1j2`kkUDlQhgXtQ$+=R8=S}1vne8Qb>J(FB$!Fwult`pWa<9JHJOp*)c ziBWY8(#RfUoQJV%VNkk|k9I8SB$6;E$f47?S8VKIs6LT)B)rKU&K6Mb3?{YYjeI%^ z*`ipZ^XqoI+rPa9d0$-0ORH~xkSjhr6sE}$eAas2{l@5xH=xb36#x*b<0}xUAaBx7 z{!lwmG5m1?&*kBEPSRJL{4E6nferWklk77`t^a9yH32^vU%MBJ{9BzUza1e z)%c|tdwMpGy_fs=`z!_u%p<6%q~Eo8qbP=5il<*vX}pyi2I?Uz%6o6aOqb{7 zxflFK>*HSd<(U0C5}9@l-3u5I?0B#v%UrGj7e{?t922Qqg~8wH2y)BJ0iU@K+JzRahb?Qa4S2kwXcm6iow6>4q=#>)qP#ZD&%FEgxJkdP z0nq-?z$!sq$frna=D43D#9CBR3q>C5@*HXrlh_{du}qO_kF97$5WG*R-L6bdVoHXS z#fV8gQoURjK~GM{oAn&vES>GC-cXbU&@r7L-?HU#d;wcgKHq@VlD5fo-CM-gRyKHE z!#K=Hi&vw6aFX?ps?Xfh)ZBTk<&6hK8x;PtS=^MkifvQ$i2h$V#lka$WKw#s$Wml@ zjl{1t6w)z2w7>quu8VFWuPT+P*vmP|Y7PKl0;#zkY|5xX&l76}OS+P9i(>3qXxh@^ zTY0#Bho9-~5JXi{sOxp)iw-J5v?mEaM2;U<+RGYr}?I{`#UqHhzwbI#De@Eji^ zSF##^sxBUB4f)grlqIS9d~hJx%`O?Y(n8@+J90=K{Owv+KyZYvlG#85Lg_Dzw+2E+ z!UzjJ#hm7F2X}dtit{B_lBPkcNx3CL86Boeoz8pw(*HDO<@2COQT}-=?5Jp7SN=s9 zE%ow;m0(icVTJ;+R2kc~e@l$2q{J-7KAJ|A&*Q;F8|NqyflJvOE)=$tY^%1i& zpMA;ki)f{#ex=!_4)D)j)=d8IwC1LK{u6!Cz%FIv{>XxYU5Oa8+Q+BVeCaX9KffCw zY{C8~W|rJP|Lfnxbm;GJXnY`_BT&H?@Dx#QnyoJXfIQ9K@$ERXc#?!>WwvZ&OGM}d zp)h~@Gj2hEt7%miaC4cu^n|*!TtvK*LuqB|)AvuGNC8N_C8PGKq&_95epixn@l$-Z zf;|pLvG>78b2M}cKaev?2j;JvW@UE^^G|!CCn<$2Byg!t_`L`&iyz9@Z2rveE+g1U zQb|KFv42;d5v&|9HN+6M1%1*7ZFx4bvE_+M$B^q|wjHJ1W&M-xn^zNGa%v+VkJSe9 zao|z#a9tekGR^z3mg@~;kyJ0Hucgr0sabBI2Ll)R@^vs$@s)rc&XRk za_Z+&T$-k`!-UdZ{X+v1+lv1;r&zzE^*0n-`0+TI8jptr9v6MxJj=bb9>w=mEH1sZdZKZ`&;ozoyze47QKY3PK9bf_ct`vgbiQJlnCh^+XQBI{+5QW2h0b} zznH4ox<2~fslE)bTbxE29d@QM^RCVWBm{j2JBi)rvCpl zDp~2*b$q=5pGXhS+422|`|{?umTS6P^816we)}*vG5I;#xu!{2rnP&Pl_Rcj1O4yK z;elE1dRrRiFY=AVZ^882rP2%?vvGCPxvssPhkWmN@zDQGZyurpKs@XnovJY=I5Bz~gpoh+Sh zjxBF*iZ9=9`Dt=r-ha%GtjGvZhc%!Em=-3I<$uR_Xk0W>(AZ^Xw#)bv#}tphFQMzV zNsqYgQFLA})X7zS-Oa?E40Sg3V%x-)pdx%0mEnfw;=*HSqa95zosln@$bYwI=&>hKKHOg@AB8U=Mf%W!agg1djY-j^T98OYV2rKSId%}S}Jo{(U; z2w+}}#jnig&I9JsVZfWp9ja&08*jfrGpQdye!kqI?yk_Ke-CoW|M7XB_D9D(?)}qg z9RY#llmanj&X^3Uw-(E?vhwbuSc~?YJRJ$)7ZxJF^PVrUp$6IGG}U}i2t2AIj2`(S z+%ru}pJ$iX5!{NqMrf$-^9;EI%Db_bcwS@V3`PXq+a>`cLRsx<`gEqBZQcu_3zB$r zYq!DES%1Hc{PM5}IMBUfI&O;*$4X_`ii}lcEB07-;etbl7s3$BfEgyiNdPH%UY5Tt z(%_gDtPVGG=y|mFInn=?8qDme2g>D98VZ}_VG%`7Q=&{SQwB{M_lue$d_@;ZQwR}G zkfGhQ#RIt;mwTxYN@F^qb(8qNkf9Z)q36@^e_`lgLtGvL+>LUz-XChe9BNYey}=t8 zZ-ufK9{tywqUCu{i4S=i_jH%cr#80tyL(iWz~NVx7$U=25|02P&a>=IN~>qOJwGdJ zZDO}T0o&`OBH-|`1iy2BmDn1;Y37y+yPT4G%KTiM9rpPCwTh*>}N~O+I?dj`uI%+(|Yjr+nx?M4mjPlzq{Maw@dVgam zc(KxNSfh#$Te$Augo}yZs@{)7aT#TX>6nE(HX+-=_a{R<)1u+t)WnXqc5)4~l5zk; zeBfC22a#d!sYm1<=4~XO?q`{{SFg(Jp6c9PHPd*;adIJrQG4TibncJs-Ul?xp>;fM ztxhL+RoFaiMSwVj<ARvlXx z*sc>Cv~eW7BZIf$#+5qO6*{I>&@{uUh&IWMnz;0TT7hnKFa*B69NUw#lyvO6Q@w~v zt~-cBAEUK3>Qb8N)igh(1bjhA8BT$|V|N@~ zYu7>MX?}|NH?&UAX@IGlkF$|(j5F{x<0mjiH5=i}>`S)#jG{CMhm4QDf-q0 zSqEKNpH^GP=P62vV*-mNmxzOLms9+TzGjB&tX>e*5OIK>Wk|r}DqdUb3>$xNh|P6; ztNL|b*lE$pO(OTMx+VM`(b6Jlo9)B4tQF(jV+%=WVYMJrGkn?_*JU2py7OtOkS*Sd z?a)U!`_>)u)U16sG4CDg{+Ds)4co-KXmBib*6gp5>*bA{=9r^Urs_%J&w+8t9dhjy zjF4Xmpf?NQeB87SSnVgBzDJVQXe)z;`{rMhH_HY+wxo@};4NYoT*l;Y{~AI0w4ZIZ z3sWw2y-o3bOz~|N`PSl06Ice6W(J7q+FcB=VT^%%V#~5*pUFRrD-D-&Wyr#fgC)mU zH|^?{QQP4{OXDuxmQxuhfJ~CTLd5Db0G+V;ymSAabRawkfm zaDOlFZP0p@XE|n}(ehg#g+Xv3&lwt*A)MoI21N=Jso14lr)`>Hl7nujv#ezT5>gg5 z5tPfbDbI{H>qO&DR3$AgWJeu&8o0DE zg?QHFG7%oat$@S1%`EaGYIKH1wj60QsrNe?H58{UO-I}kyOvG<7|qQb2Tzt$t%PD| zw5KJ`ul-)3ee>_Yw&IdDxaIs7stUIEFvCWm85F*MSTrf4vZdS6b zB4aR-Kr1W>nI*!DGU>*q- z6`oNSMRF=dM({|*I8M@)*akq+EQ{|`_!OEXJMP&sVDV|k84 z0uIJJYO3EsD+?KD9b*@T*BrWrlzPgDT~A1L8`LUXN`aZ&iuq^z+=aRc_8QbCq{WQQ z4!cQE|Afo}xSdeD0Ck`Kw=E^>-LsXyl={lnk4gXZQ-RQ`)%;Xhi~bP4)e$JLyynUR zWiMXy&&6cIxv)GcQw|7hDREvT?1gzP zrGuBJB0{PhnfUw#uD*162f)V?!Cx=N@gW9P6753U<)(A9wfD`55yc>AQL1HxoUs@2 zYf#J>2@l_<4nvD=X}8_?A~D1wFsH#UAZ8KAyF&5jBO_GKjL*DCk)G@1mQ z3nh-W&Xh8(K9*xuY>y;WdriU@l13)1zZy+=5Dt;_0+Ca`q^C`P}ZC`LgL=?sm1Fx87mD`@H?(cDbIr-r{_*)4ll;*BB-_ z9^();wgP>$kkOVx7Pe8!X11Qzq@A0wTOy{SWZN2!=)+WndrAmyeX2H8u3r-xj~Mam1~spucC#FysZsR5eCxKiMZPj|IQB-R$K-pEH>1MjNt!Rvl zlxgYPjz``en%WQ1i}+JHwAQ3#%;Yk$Zu<937T{@BUWy}6#e-0Jzb^1U&6?ta*L?!LKP ze=CzgvsH@2{ZAFJ&AIdT;H=y$$<8p6TssE+4!w4ezXwjil@kC-I?pc%9ss= z12vWQ1Ys5D!NQCqO%@;l*0?;|mY{+>hw)$#3Wt%?BOF6f>Q#@X-7R5V5BqL^AK>0UShtK^Sk95kxo7g= z=;eZ#evt}glud+I?9_q>YD%7VK|ZPIl>!>esoxHzr~2JX=*`2Et%qaqcQx#~+m~>l zs`T0bmBSs+#f)ij-Dr?72bSG4!#6I=>c`-1fO=kP3j0Hh<+^(t7^sQI6KRH z2MmB~6kJ6X8MG+4tYBNye;N63e#9zHtTB1HM|qf+k$;fSeFK+gkc;3MSUiukp5|WC zm5MaK1=KW|zsUHt2u&a+a)3AVgFwm@?D74ZO>gq-5g1r*WZ~hQ0?~!Z-^!lx{X&}- zdxAunBP2;0$kaK#@}S#x@BU;hn9Fci`me_=3eZ zcH6C2xr`-o&#Z2q0oKobb}CeAElXAFO16zUxlVtYX;YT4o`f3E>ZT-8Zhs;9Foe*6 zgS%I>j4&fx>AWMOf?h$&q1}kzHk~Bj!HUi5559VrIXTbNVQRIH-ubyXfb$hRwjezr z1lAO806CGj-y?jSz9Nz7$^oeg(augI3kw9qb3%HrAy}wTnLz_1BAMSV9Z071=@VtS zB6wsID<_IwcEZG8=yNLJHh@i-9WI1BMP@k3w*vb+I|)VGdw(i#6U?>fP=9e8>TkWQ z1wNltI(2<%T9pNJS@*&igsa+kDmC++Q*lq`q}x%3#Xc^vPH8Hv~@klnDvd?}B@h`%@wiB7GPjZgCtn zL&HRHwiE%HTlNf*CNz*l4J|)RlLe$gdZXmn4=!ZRS+XPA#vUY&xSxOYu8 zY)eSG$lm@K+pw+jSelwuHJ!ei|5fS$vMV@$Gkvnoy(}(#((wdzhvcdCa&3HGV@UUG z6pcO#h+T`MDX2i7A_`AL2lmv?fQdo$Rajyk`CqPu|If_u@BU!@c-VWISzNK3Yw;h> z3}-4=32NJOf#)?FEZW}8Y)wv<_Sb$BcW21xZr5`_7+zoUS)PUyIvyYgHX4CvoqDS2 z-_&PEBrA`7ltY7__dcq zk&OH**yk@EpTd(723wE_*uy{tU0EVA(mwY2$3w_v?on*)XHpuKL$|Jdf7;p8K{0Mp z?=l_DK9)j~XaD6Z>M3~7n#}$u%%cR5i8!O?ME;NbUv=5RjgEK8)7kL`AqE|G>4m`ci071 z=&q>B0~oky zh3M}#MY?Rh^jo#FA;Ljmvx#>7o_{XAfC==!2MIdw!gYVsn&oak>KFZg*t@6T${+pR^RaDP9a}55ZQHidv2ELS$F^;EY@^eYz4w3S%$yq4 zoO3f(znj#$%tcnMTHk!0_iL~&>^=wH_Z6b&SIZ`Tv$hmI`?3?8DxR8o=VaJeW^WrtL{s5mk!<%atUZIu$jQl^raM1V^;jIFk-+pe27 z=y-JCv$mhlF_KY2{i6m9S6j;G#w%NO*X;e#-D#Eq8eEx;2*?e(ZUPJ1A}(s+GaI?ouT&+lsPtwl!iMZje!m6}#g<(LO4He^_YZgzP80s1*s! zU0cy??R=dt>f9n+s8@iOL6xuU(_cW|nO-$XeF*yWU+V6Sht4a0Si5w079mt#{ZRy| z?-5-ePrPKpFu|Q{7&fvo{!!~r>{bs4&|BKJ@EPQ;ZnYA?~wAl zgTEtgmId=ED{kMOK<7+9Tz$_m*X#ETFGgT_vWBwLxI=+du&96m7d44D?(sU^LV8F^ zG}8I~L9R>0THaK@DAFQV@T%^x{-^Vsxf{Z@3R#jKkGw>0*x^>bGKCM z7OEzeQRjdvn<703K79Ww#fd;(Jq_=DLw22jh^AG_ z8DgTpScnTyTS=+r%Hyq9ha6-t&rPfs(6&;jIkQE?{N<^m6AW*Lg)alGRahyXH_`}e z@&2ypX1d;F$Bt$ID*=@Q;+sdfWHc_z3!G!;a`QKUyP8Cde!uua!VJkJh{jhpFomqK zq3I#g7VX`Gf1>z+(5=FWIl?jy{dDLA{kxw4ZzdIhM!AU3CA)*&aN*X}IZF zr}C}GpqPa@D|!yZ0rAIdBKFiH5NozXv?hDx@;;;AcaRztzr&*tO~_?1oXXSbuIx=- z02cf<;;bEE$~n|~+i+%(H)e234UDSgP1{w0ipnYb3!A9vI=rKrZrDzM!q0%tnuMP= zfb;>{tk2R~R~GbxC6V4MGD+KxIup37Afg1O$4!(kx>q#EK^&meg-t@ze%${OZ=&J; zCEl30uR~ly9`;vUBO8Kxr)#GJM{XJI2e7t^P$aIdG1PkBFKmD-fcxnLYSZ5A&K~n> z{k(RD*V@kV&=Z8OEX=3iN+!9)_)BRy9(1=;+ug}j@TO%4l|;l0tJ6QY)viE>F#gKm zTm~gkvr^KTTE7{JQJ2tmuw0=9(Wr7pz?5u^%*ADPlW`qxFM^kFsidREVMJpC&H;eXfpw-Hl>1OTth)uF~a$G!m&ztP?-0Kt>m)qd;;P2&vzI z+5(ty;lXmJmsyhftlI20)4Z%-%DJ&$FEnT%d^8?j6Tn^OpURgsRC3_EdNff?14TWK zZTBad?TB1r0!gp%fdi|`3zlECAtb7l0V-?yHfuj&l0u|=#8mtEgC$vI`n5ifbG-iG^Ie8yrdT=(5g8Zgv(gh_ zA$fj+@nD}!2MXII$Q2_^(A~ggqC>oPB z@LHXZ;uo5*&s`8j&7=+!60uZ&kYkNNKanXF>lVuYGz15@pv3BkEc&Gc?#~nKXV7Pf zigNhq=7h-<)MOWykePqWR{7H7L-|if#8mL6o;fc#JBGca_B_2645#1IEBy~5n*e!r-eBmjt z0w!4Rq~ZdJSQ2t)2%;$^Y(_ov4Uyd5=lGznx$ZRPhiZI+QJ(;K%B}WM-i2GwmlH?k@|OD$_{U{_l}-`^0!^cPNecdy5=>KOb90iWRD2F=jZKSM3de}R zm4IPmE=C#WB<+ABW<4o-V=~z$#g6O3#wCU5Tl5 z{4W96f;8P44Xy63ZxWm&`#LUw&cPu!44uUVImrC!%Q3wF;WhDNn@NWPZcu;*Hm`kG zFQ2Kz(wWturStY}qT^#SDlSX?x0*bUcE{W5F(snER;GI|cWCPh6IC?|3~2)kjVcYz8MVse-Gc79{z9e&F_=BE+je%%9a2N zlr&`jtlMm8$T5VpK zT4rSUmPV)d{9YK5)u=d{WDU~x61DMIEWe+z~T7Z@BOQ2Sx0 z1}ZED`1u#CW_Ab5>*R@JH2fXn?^X_3-Of@Aokz8{!>+hUPO{Dq)2d=()h|i=JuE9u z$^EKaEAl7zpBCLo`LEA;aFPjUyNM$M01M?TnYR;IOZ6;N+IZAM#fPLtjU|0QFSp~f zfr$lE+TN;k%gNZa4cgMp5Y~*fmt1h3{a~E$$rTcPjvi*!auvLf*XZads3vl9?yyX* zTjkzERO!b-H>hK?aiT|KP?9&O9T740wh8l*ME3Yz@7QwG6tYPq*QclEijDW7--=C9 zGo3^|_5?;=?=3cB-msb+uNA$|(ayWW@qlCmm5qQ!vo;q2J)pwLs0o%i zcT3xS>*Itt{eCg&RiO9z4Ep7Bbt**>F~{3%Mzewn`W*m8#=>U(E!kKf zQD;^`$rZ{}O1qE`amtoJ^!jQKO5zvxOYy#}#FK|@`i-BS|IPX#33vL20aya6qpR!n z>b!$+`yowI5k+o2I&+G#?bYM$PEsifLUSZWF`mvr3flPTtt_2dM(1tlUTsVZm&ZCa zkwd`8z0?n=?ev@FI$+LAFFR8pHhNu%X<6rgLpT+DND8ZR|GLcmBsd7!%^vl5!k2z{ zr=8=1=DhNAt$urRefjKOK9mPVdVV~b7{1B>8_?7u&(QvW$d*mVFmqtq{Qem?DJ>1w z%wDG_ac?(~YmI<`-sLSNR^>&!XxxW#tLgL*03Mz}*Wp}N*E4OkJgxg!>Nc=}%(Byt zr@KvQ^J7>WRw)c=duT=ksdR4OynFMNT)_wE6=~6^FoB+gThWgNA@0{ZtwJ^1AAAps zG;K%B=)`4~(QRwQF(p#u1$mz0C5z%^{ZGU5{$B_+ zw7!>q-X)$_f)Dq9p2m!De;1kv*li+?%R%CEQNB?X9dQ&K5s?g>;zC{6VLCBnH%*L; zb^pa52M1CLn3dWHHrWwAQ6%wNqbwINMa>Tv3W-fiZW)tP5xJNAeM?eO41cs8ohO9_ z!~%>>Ng@C&&Hpvc>pakXJAGfYc>QCnyzgkjF1j}E4%f?^optkx_rjv8o_4mv{ z$KWEb`$9JxC4Px}#_{(3?^3u>ZS3h-Yn)P6C2}&rr+ejs*$maVJVE3V47o)O>gBqr-RL-kU zDhrVbWVxE&FO6lK+8e$TqZ_~mW=AmkP61^383z0k1Ub%VbxZGM;MOEZMLN_fOF{>7 z1v1u7o-ESv)|xZQgWkrDqQht{k64qQeNhxN3e!1&o1MR;1i%Cn6T7*vJFZTg(@q$+L4>P3Ks(sg$%ry=zLCFeVNX-!heYEwS#`C^@KGB-}z8ZL< zA>(;oNPO}KDMIiUxpg+t%kj3qPp@`uS_Bnnhq46yJFuS}U+zS|bwV#-i?%Z2tx3}% z_w?T~x_lRprYJ19d*egDXIA4Rr{n@KQF;9A#2v(lx4t;))pxuC)D2wR7m9&qF~>Dd zTM^oBLcrul`6OXs_<%N^nDVk!8d@EN7Hc^F4KMqTU|CaB(BYH4<*cYiSKK$GZ}7_%=sc(keIVf^IL zdUCAl5wUS(V8}Hk5S(TBwaRDj=im0W@A*<>{e25>++lSb)myFm)^D@>7?{}6vz;V` z{!Vlrpd1oTM2^A=9K094eE}_!EH*ukT6X6M%Gt0KJKCTSaBmtW#gdDNK{I45@&o^D zNO|(6r{pn{7HYV8TnB9sha)7*sO-1wZ=cnQz@!hoR`Y}679m`&^``|jGdV3A{LmR$ zhWRB)z-DVh(uZOjv%8esFI#M$kWD*n3iJ0rPmcvhz zO_KNPFor26uh6z>TSFWnj2nz>ede&6#W>sZyWAD2K_{jgl+!Evl_3iYKKU_9boQ`p z?}e-b(ew7fa%)G}41zRMs|o>n?Sn1#cfSlR*#RpERmqKVe;lF&szqiot9OivqqWca zhR}lNe|mOrFkZz&Hi8Z#h>tK3!ED_--zMsFJwuU;#{>L>xJQj0cYOUocpz>Yk}9Rg z&LYkW72;f`V}y&S#UA>=VG7L4`u}RHA@g>0yZz^!tc`WnZ-3`q&J8@RuBLeo6`S`= zC5c$~;!MVRZ(1sTM?jNJZ8}h$M8bV9#mduioCZfqk?5tKS2YlAz7y3nu-Or^0=8Jq zBqXpVK=Am7%rWM5pk}hpk;cbw7;?dLN`nHM_cX7V9}xnkUALH}Yb5dMwLe4HgdZ~} zCW5@L#X0uu#A|po^Ck$yn(^PX!nfc$pq&wHXNBtZyjIOT@3fJu4&elnxAm$cfUKZu z2Wk1-Zg{o4z4~>8(Xym&T2TB9U$iY1Y%a)~5klHfszoi=hafD3ZES-V!z9>zS+R}s zpyo^^0DH4F(+VJn-3#Vj3+A7orz~F)kKb=;SO&rFTlsvJ@aU*thX>c_usU%4>$m+m z_>-h{u_IRmG5R^*!@HMS072s4^BY={#97T`f{oy6zMthkG9nj*l=L;b|GueyHoR><}co|Br6 zo2BRPjx}}OpwqR3%uB7p+$wn}JZu+rXw&07%x#y>A(_GeZH>oZ%JqqRmglqcdkzs; z?JX!RLZussS^${?p#|!T^Vjxt%KcTLoT%l?2Wif$bhxXdk0k5^{_pz)plh*Ly7nU^ zaj8~2|4&@K`|lCv9&;G^@$B_fYmWBI^>z=Y2!aw=Za64F#AGPC z{y0a-~6NB z{G;Fequ>0a-~6NB{G;Fequ>0a-~6NB{G;Fequ>0a-~6NB{G;Fequ>0a-~6NB{G;Fe zqu>0a-~6NB{QpS55$^mi{DzcaKgZ(<$G6|V7ou0`ZfDH)1pq_z?lV>h(LXHg5>*HZPB5*&VC-BpOBtxjvLH>msJsD3V zVhHMuP>LKy62$C(+u$J~emDUCAz9CoC<*=WLXGIxH#A7MmhZmX$jZ+WvR8#=Vhv9l zV}*!67aw1eFfu6S2>FF795H0=jf?8<;Piyd`hLCQySXD65lA;ag!Lm+h5`aL0xk4= zaoGEeL)IUnji*g@-*$vwrsSL*qz(4V3$#wHs)b{E^ZKNr@-oUq_q2Z_d^m$6jlI~^4 zT6Ik{p-$){vfrkf?Kc&Ekj_n2RK!AxVR>Fp7izELGGG$hY72}q0;U}kEMv@X} z$E}9Qo-%Gdi?gm{M|zE?oi@n)u(A}Yuz@DRhXv}t(P;$%?*H&oVI1eQ((`F|yE1uL zvUzy_an4Mml=*8HUpZ=V&46?Tn6++8TQ6Hz?-OU0$%X55zG4-;jyQta9Ufs+%EiF>JE%D8mXY#&;U9zWNzchWG2f|@ zs_Q2Ct+x-Ga#331!9SARN0~5b8Sj!MAC%i6hm-PO=R2HdxxMq*y#m0j47*{PMayhD z`gyvZ%ynj}&`hI#DrNl%c*sI=xn5*sa3tq3O<9!xttQ9X+xva|f|CpSzt70YNQ1=Q zET+Yq25)>T_Kyxe!j&Jhbxo{Tmwq~9t~E=Bb^`0~Dn(c=v23MJ_>)mUAzw61boW`s zL=^ruXg4T{mNYZu=$Fy4#t;v@YsX>V{M=0Q?Z z_%u*wmY+`IK}7rfj!kJo4gH&gnH-8JL{(PSjRwr^O)N}^BVEz;cS7Zs zz|VHV4js3fo35V2#8+$2ZVd*VjEIGsHe0N`9A6NtL}LNtNK}!!0ggTtnzqsczlCr+ z<{VCud1pWky^I$HjhH(d@{V8}6y8 zgvaqOCtewUp<=5U8^v--p%NsuuMKK-%dGFYBXmSaMEz~-n$bOV{H!VQy%7b$lNa)0 zSz@VMWwNy#E3)tp1H*;V@W8MDuq2C+B7+}Wj3yj)PlzHvTW1QWSQZGyU64E;G(V1j zCku#S(4<@;@9)v_jStS-oDAfmhnqUQ%o@>-i-PeQn&YT9E^Ws{_g!1fN;-gT8^MfI zVE4rfkMMO*)el}`vAn{HBNAY`7E4qimcLH50bjjLuhJjCcf60-B8U9M-XScTS}8cAjezoS;awUIc%Yaj7%XLCzVbd>j@2i;?2OFI(}I9qs*IxhvP;S z9xo^mSs6D|_DvcTUbop5Ajy~@PyAYOUkZK0SN!Ht+BppAJ@K0Rd^>zXE zagxqs>}?Y4Ua`eE=YZL-D7qN3`dxgD#|lKsF`IPKi!22n__krc;ah$E{j1em-pC_7 zl|i6(c<^wTa%}GI@-uO+r{Ut^O+7!!!R=QHcl6sB zF@KlGbMYnsxI9;oI83bhGTzio3xFezs~?sdke{WzrmQ>34(P=r+MGAphL&5sZdCi4nuurs9-ZBVFOEkIYutdY(kvV$?x_Jdetx zaT5Ni>?8c|cghF3-Mu{%{~K8H2$Nk2|6Pfjd_u`b=0wI&GZZlfQn)6IlcNNwpU#o| zj<2sHGR^d)TfB|q)hz3naftAU2r?9?s6Al&prb_Kc}{L55|5^+<6k}dJ3TtzQShN# z6Mbra9v;pNoC)%GrF1^ZcT_oc{#&DneE(v53o@olaWG#PeNF-8lB|&T7@-__A14b= z3;tVo+e|3c%a%mva8AzgS}7eN14N35PlrMdwtXb6KK#5LB`dV3-_ai&ph>ymwwaA(`D$5#0{%K`UK6e{99*BH4zeqFaM z*Zx4_J$9|JA0H4D-bun9_$+yRJu{sgeJS?7JwDq@8FdV)I1uK1_q(ZTt9CaQQn`$N zUQYN~6|K14D>Yfhog6C!n{+uT+MJyd_3sI2Gp3SJ}lbBm?=#^>L^VdQnK=Xw(<8kPdZ$iEek{$Q6A2QO)O{RH+ zfqi?6)_h^hp0d|mR&S%`IQNOhd;P#2ZOMd~6CJq2hmZ8| ztC&Nrn%AwIXIkvb+*TDjTiCHaOg8+W;wX>I&ju!hk)s5UqUDI$rFd&Ozej%x<>DS6 z-v(fZ8_g(RE>xm4&lhV8?H?agNSERBvv-Tlo3F!dU;r`HE-?i;V2i+ zT`5JlhKYg(hlfv+83buhna^PR2;>Rb?A?IYj zC?|1ClbF40pb&E;;S11{YS0$?y?QZPI-C|V#t;z@O0qqCIM=iA@g(5d3bJk1#jKx9 zq>+fEN+H6G$HC6c{-QUMA!^?0b>!B}*qkJ2uy!;K&Ot9T)5c8&+Bl+s21caq>t!?&zk(n)dd55ZWLEUQ!RsZoi#d|iA{Bf%puGOE;30@BK-PQ~&~D9N!eU6d zmpU`GaD$afqbs;`s@ZeAXCUCt{*ltJ(e(bg*us@N8#@M;LQ=YYOzCf@=S?t2vF`aZ z^L$@7E>)6% zcZ7YUBeo`^zhGR&DxU#8&Y2%zJ00v)2YaHwTH~$aVxJtQ>ki;U-`@AR{6#0{)?J#E zGdwwS5DB?B)@2QFX?`dPboJO^!jUMUfOvm+`|`*JY`-lA^w#NengjH4G}=0K%~_Ey z9n2R5qzVIp^>OOok0n{_5$XW?+-pQ_05Q}-P|!aWhGh%{Cj=f}tn(9hwfM6De}V?R z-A?!d#OaTj;y67qLW~zc)Zd)CsvR$jkj}j~r3r4I+g~+y8e9H@4}(T>UdbA0;?)j4 z?%R!o7@QM8@MeqtJ^)sD$H!@LIeHfzAxo^ z(d=1ON*~us75#yRCO^+ubaIt^eSb|VJ6Q-kcv>Al9?{e{m!EWwx0J8`L_YZYhm=$M zPS@XwlPfzjeJGLep%Z1&Q+TUp<;-4`v+uS~ftD-4-T2kGuQ|r{*zsw3P^qn2>rjdP z=*)Pc0qak{i!2kJ6JOPClce<}-RDJseiZLQx9yiPrU}~xS3m|bUuy&!&?8WrCXf)o z%EG!K{L8%~&Z4gTBKlIkqw+5EA+JK||01>>+AC=>(V2Bf%Y9rB0#)CJyDu}7=+5sm2GtR#Ip zfdF}Lglq_&=i3+JO0`6rLk2d-g{{|_n&DiQA@=tG6QtJ~W8=`GS{~KVBF5gn4JYdV z4>IK5_%CE=UFdDsifc@sOsZk;L(6>cN9xOsd*z+*x)pTnof+`=2mv5PrsV)EFK5aB z-bbl&jf~=jRtAki96WdSn_TfoB8OUB?mwBt}5eumsQ%DrsWp>??K zyyq~hd@p48f(6YMB~O0aAi#1pG~e0o1b@&+;KI1WlwipDebD(UKJ3WNY4%f#a$Er# zv!uqUG3vjpp@||FB3Ua%Lt@hy`PB1DrX{Fl8VCnm!!|^#H&uK% zfx_fbV~kcSm1#1^ilxf%*TsfI=iFaL%Av;Os&x>GmtA1`)^-1mhsFsBcHD7S$n2At(XUzJChf&~%Z`ia0cQP0lysO5z>E>SZ7MHguy zXkjH;=#->6D#S0}X~o5TK>K`#F_}2yXveTI+pPvg7x}Cgj4Gu1BE;a%EfST-jgpZ= zz1Wgd1x$d!=a=`N$oN7Ju>5N^`^a|$TVPi;A%%)i^;=S4ho9;GBAq39Osg(@Yk)n& z-@T(}aIsf`0OdOa_!Fq9!0{q~$DT9&qpHVKg(Qypa50^bR;y*9ekSSs^^=89_xn@)cwB{iFa1}DYX+Yi(-Lkl!i~6qa;DC%WDySc zaUMgVq3zIO{7$yXzM|OFC%lsJ2(v;)cS7J)GNEctdT;X$T}JAt$-ihP1Wwu;M#vVY4?Dq-Q_C_8RSVWT9M z63RO{E2XH32BtVImm&!#kWpU4#v3{iHgH);M}(PzDmx-K%y%T(B0|q9G)EHx10g6U zl)0R^DVSg`fsf=@u2qwHrbA2UIU#(eSQcIEP#(SaO!H;wBm|lmik($I-z^m8f!r^G zpzO$)g|Z^|)b4g)cS!?Ay3r{i$BP0C!n-7G2~ssvxN9BJY$I_~c)pOHs`6)@WX|k} z9aU_LPG(oEZ=i!7_mw9n&M!099wrf#c&8k-t#NN@gle!D@Chystp1n?V^kjw||EXECCT95fOMSy0@-d#+ z7M57S@HPtRyiS1U&ov>k{C(sdW69+<1ByC#Y-@`2q64`^%EcedLOsG(rQ9p@gv#_F zS&Yeb?(Hcrm=a0q#(jGa>tnbE3~_r~QS!ub7HK&%xmFQiJnp?cUxXo7))4&zQfc74 z*s8boen5*HCJc{uY(cf@s%^9H$4sc$_z)C4q$VM{4PfpWzi)ioZY`A&@z# z36}hGT3B!#qbV&{pvA4A0A%dYoJbJ@)JNZnFaL5cIPg#;JR7gpirFzH6)`FfF;o|& zW2dCFZ)pj5y5Zjte0h$GjP?I0h=}>0OmVc_7 zBXF(-xkrm3pauvj>s=zw;$1K`X+3U{_^;u%XBSPJ+gNER_-m2^UAJrw77rMtf6O(T zRwXNz2J*G;p{wvC)phIFRIzu`YuipyS|N8S*7&-u|yhJ1_*2FW?1~&9a58{3!@=BWGQ#_M4m<@LW)?9+kC5@^SX>bXsQtvvh z#5kX$EX9w*T0nRLeM5hBd8R;V_*WwJ8DW^VA!&zfwYLf&7|#PMEqPsj5kh#5LpMWtWL~v&NxUr zw&<9Bg8)Qgcf%48fq}mN0DE8C#Io(so#9Dsxjg0ayqyq$#}xboD&L)g(h8>#WXs)2 zE~X=`A}M-!m|N2_)ipFO33N_J8MEf*5i5xwEBPmC9zmfZjRVRihf_5Ce_{BH}^t#@?iR`2vu4TV_)z9`S0gj*nyEH+DxavScdYDNJz(|;GPSE+@Mc2_Sp78mA zts17y5XQ|GrdtSeb>Lrs9@Le7wx1@ujNSMP+kOhG1qHx|I6%09b=Up;Je>J}TSWNt zuMaTj8s7spnb+5psuCb(@(CUGJbA8&9uJ;`OuUH~SqT%|qcgh!SO+T(SkOBJa^(gS z!t>C-rNwl<3}fb8M&@ye{^MTi;6<|E4UpKD?|NPDcfAf}*pa|IRf_j1 zk7pQj$MYGt9b&^auBZ*aquk0vV_~7DzEml%VanBplp&OL7}Dxey!qiL#p@%^HgCs?)1(703j?)Aw=#bwFkvbHibUHffKR;P& zn85}_#yPm$w!{(BFd1rnUce?Lr@u9ZDh;6dJ;3!q%I{U0&~l0>)yaEBcJX|CpOt zI9~Sl{}wbhyFBVVe^XVen|oysa+N8KqKX#fEoh)ze{M}wbSx9fbv8+Rd`?>KU0{Gv?V3K0jI2@R2zs6F5DZvK4D?urAQ?tm@ZLUmWK^@9pjulsggeZISPv=yBX zl*Ek7%ZQ@1YArQx&&DN2-klAB8h1f?CZ`A4X8V1=x`iCPfXhJ57J=eGy=36qyHbcs z0sk)Ay`b0$IYB!bwEG*iBT9$A73_%AEv^#E3BmbeI(4TJTe_5)fDDZG`5)af6>Nw4 zZrx^Fb|3?r{qhx>^4Qp851MibgJT(-Y>8gER}!8MMr23j+ma~*Hvc6Ymc4(GouqrC z;16eyyPfZUB^?5#K*NNxJz-AZ46;S{;<#ua=vX&I#(lU{5jhxSirHq~E;k<~TE+=z z+;e_%IwpF`B=%R-{o#6eA(`0j_c(fCU^|OMR&`(h+d}Inv+9rhRx^AK09Vkj3^Uyd zEV?7XFwH#0EaN5%)ct^ZTkZDMhj}t&WuK#%jOOoY|Es!aRwNsL@f$0nv3RLsqi!({ z(1uX0;wI*}TR~2&Y!Q3qB?AUPXqTTWlfC5B%2nmI?ptUH+)L=@oLYVpLe3<>-Bm-ic$|!-VE=t()I(yTd3-`2{6B6I(lM|bJ zU$$ulh@Jx5hsbb0IhuX3VmC~#n@>AA8j*s16~7|Y@dP|)-mr5D#24nGyeWhqTPabY z%>WGNpqI!O=`P1Am3)=-DO+f^Qm$9wp=V!;0Oh$z3<9o25F`t(5QBlUEKn6y-&a~b zfjC6?Jjl7=I1?rG;fqz5BIC8pGnyU<@7%FCZMil0oRbK(@aMNTR%X2DNG zWqdOD_!P+W=;6Ss$*RQx^JVHaXn9s>EUkF3MSTAb$W7a};8Af^V*k4#2U?pkL3K1u zqUTDDfMJ^y$+JaX<{nNAL)DF~d|}q@nzwAXzwhIGQ?tOK+Drz6(mfldCcrH~> zEv`>>UD_WC0~6%NR1S{Iu%eWeb#$hVt{E(Z&4qgjv&W@d&BXmf;I!}d?b|99C~n(7 zQrO+Z*#z5KaE>?dX~jzXC}EW@t00Q=O)b$v#dox0Kmjs^p#f0oD2?7}+HG zT49~lghOd9v(#6Rci>I#ft0vn;#YO^PuH>B<*hH`(0%V1rj*i+qZ;wHw4)eN4KXF{ zJ0WZR%$F9(?<(g$gRO&KZ>=^J{QXf6C6|i@YQ6ocnI~UIe0;s1pH0i&;@@OgXfLL@ z`U%0O^&Je)jPmSO$LRD$85JX0qp&6U_kTPLHxirxnO$Jr`^^O$45IrC&UE1L%~$8A2ZO4B15eeMb4 zL9KXx8LL^y^Bzq?fzBZ(V!!Wc?6P;tO!HgA?3rFGyUOt8qqNfNN&DXW2~WTY62^C_vRBf`M+%^J(V zUN;4S@@qVnuMQoe&z%@j5*>lE_xppgzg7< z<%}ZS5)|_NnD$P*QFMh4#g`_ubI(K?JC3LEamJZEN|Rym(n3fzLu|*<%!uq9L1ec- zSYr}>B4&eOzSSn8tI|9Q(lADztM{{RA$D5%`nrDG z*paCO9v>Noy;4*wa6RCP&vc5nuds~uAT_Yw=2SICiJh(qT8ddvIK)5sGIR^FmV<+d zD_M&3&ZCEcgOO=2&t?*mUM6(#+VzCfWYRi(3$z3qIGr^pjpq)3GXl||B$Je$Z^?r! z*Qm9*H(~oG|4%(WavMoknJl^$8#BOz!#F-Nq(Ug?N)J9dvl1?CC%5nDoNK0WE0T63 z&Ow_LeBBL5eQVtQT#G}QSWfnFQ_g4XZUAG%Vr~%Y-8Mo@HEF9JBB12)IjNZw4Fy98 ztk7da9%6-5yRp*-^f~BmPvxxJorHj~jS&>BV|W1A(k4QNtgfFE(dM}Nar0ohc3@vf zGLTSxPbk@k{$?JvT?#0~ zBGAu5`}zT5vxcY7UI`$$AkDJK9p+aHVohw5vOMlo?9q$0dNK1>wzBSGkl_MDwq63> zTO2H~sHB15Dv=0>xpfN8DU4c_t87EGpj-yv1sPrk?ed3(i&?90*cqo@6psD^Q_q+OL z?m4LsX2CGs3cLj}L8;SC=#c4y7c02=T(YylTPX+e#k~>!F}zR5)vFQQ;2sKzj7Hs9MsJ6TTFhfsVzDm`r3w!1w1Tdw+bQb&Y?BLJ2g{XaAK zh6042!racNQ&W|6A>ajU+Js(czss253mJMmCg>x63oCGRml?o#q(}VJqcY}j-uw<5 zt5#cKJLPQMzP6TW)S9!lcnQP|1#oh|srNOv<3+rGNHn^8XT3&^T&1g=*ld# zMM1X$PsF4>5mB9EdTT6*yQY=VuxLsBs~Q@EUkcPY6W;H3@2dzL)PlaqY)~ILtcYH8G zRxN>vYMi8HTzlocf_A+sOM5V$c?#kN$CA~BU3hyHa^8vrkD6_^_|(7-nI+XdY#3~? zCN=2Dg-q61kTSm093hc5UevAv$3|hsHI)Q!J5A>a{NJWp#FWjp=(Vca|C6d2*A$?5 z8sE8ZT*^K&irhB>xRCKw8h+FZ@`F1;dMQ0;p>A$}pp))`RR&4H_CFyrDBIr+qR2io zsnN)3a^&ydmXwn+V5kpK%&WF=UO(w}#Cv!2ufM~TQb=%=?6K^*^kt>qC!-)8TcME{KRn{n1FjCT7x#44I9t z7?;AonNos;EH|xMvCTE>YJ`9-yv+lrDyiN+7m;R zBJ`>!455j0>d^xYFO-cRt-opO7iV^Ldm#8d^e;X;NTb=GA}4hHAz0{Y9;afu6aLu@ z+vRE~GcGbnf=RdSAip)jxCJOH0QEqKxC+e`-2OJEoKLdSeyh(+v(@o_*DNBVtG=(j z^!={CX*0zBZvRG42G10H>ES6h8G(`d;o-@k%n`c9rT8d#IbTBI&UPTeK!4~g2Dwk&j> zM8PwSiPl7`3CTJ^-;b_3)Ox0aA2Z1sVqK2U#SD2>alO#lG(DGOcz2eWgE|r%d5oHu zOkQ@t6!}^8Av+Y<&_yo(!PJr=icPRWS>#IDO_MI=;WV%=mnffGp1i?U#|R$KA2u zL}Y~(zZvruM;>ig#lrzvEcyE}-ifH2)9lGauk#qMZyr!3Ur%I$lVv{M-JI`?Hm0+D zr|F3{7`5H3@ag$`3L9s%xh;umdW#t?@X__4i_HLd9nUMiuZq{z%j8U$)<|buJr242b6%w(I&Vu5>0` z%T1NQ8#o2)Er+G5ku@F8o+56f0PmPY9beCA)Y0P>H(QM7w={24*-3i- zHW|S^QA$A@;G|Yx`Rm{OziFClUOO}`x=`78q~x4!#pBW>3;ani`cQ64G1!H{jWQR@ z7Br?+L`iru98Dirhtm%1jV2pl8h`ansGmtE_?ZTL5_&jKDRe5H=b~4I$Hh99%PYWy za|bg4Mh_=xS*uJq<>0QP2hq@!7Uqm-kAZ(@R0*)UU3GeW$GU^MQ_}qP0#|Nxp3u<_ zI=nkExF4Nd4Ba08#Uub;-ZsG<+o}fKQ90Jww+OE&^eQ|s%ZEFpzoh-U)l>=YEAZ<{bu_5ew&({D>Q#T07V`i|++piKS#G|J(UGHz|Ci1&m7k z9=>Nn!iZ-D6R3^$!=}Ar+{|obir~vWAA6p28;f)ckWf2KlVQc(0xXsW6UM6gY*hS4 z(%Kl0G)otsx&MzXHgkL1#@Bu8^F;H$VPCyk^L6jHm6|WZI=fGvakbrZkzz1k7}t(6 z5wS!#s0?dy1n4F%7t`(Z^G%wBKGB?$1 zb8+Ci0GP;5NVL@gm)mV#PZ?EmJha(984Yc;Z76k__&tMf(Yq*JE1~J|d1syA-7KU| zhUF`?PR4#KLlnjK2kvIIkF8wH?V&}s;&8_!{YkRnG1W4^k%hp#P$eDopqM#JqzvkA2eezduk;V2Sq&u3B5Y20<`{?b*ET&4%wrDhnMt z#P`*Xx0ah~@^vJS5$goDI&TZT9xA-P@KGtY5E~Wm_Z!CLYK&I1 zT%}XjdrJ?`ehjdkCmgm&8vu-{R*EadQpkouKe{5(D2naE&&1fwTpyyDN4v7q6OqW9 zf<~Z5AQ+&V{cY8ptWw*vq$1Va>0XW4g%|c^i~&SFf0+M-jd8+ zshwWuOzyGq&V;|^lk+Np5U+Rd5*ug;_y)Z5J&wv$DiM!5H@6tMl)L_JH`HCnr-o*) zpCtZ%o4jthmc3kS;B9c({&(5&e7VxV-C%RP)aBK34Hm!8x&$4;Q7bGWHUUY4KPU?M z_Y}z|MXIvD=+-h^y&y#U`usBTkwX4i-jQ%=+YVfQU0MR~`}W+B(~W`SjjJT3BJp{S zWXxdQi54gnDS~q{fQ&MkS&UH{$e{^bL|XM~B$J2*Qsge7gNu zmMwRhjerLU3!PC)SiC8A6yx}G`h&BLtE{dP02bshRR8P)RJ&G zr0CtWjVl=#(YU{1U|<2fCHPRw{7<^goGUFBU6BPBXtODYw5b#6ZNkVO{~%NpKlMwv z3+fbO{1i@Ns81%Ycn?X@Xa{lLxn`tzkx*|gpZW;MvJ_n#AulO-EhTNN1&|WlNo;yU1V4g`p#}x zF;{J9_@_hS6n>(?xI{G|X7V8sm5Bjzjj~Yf$B8oWnC@D71m?GWEDJK8aiE(e zo$&=hQ7Sr$U(`hnL0Bb5HHN3vQ%XGyh7`-DkgAl)jMQ;3z+AL(P!nRwq>{!SuCv#8 z%8r!5RMnVD(S}_>P%Qb#iB0Ax=si_tD9q;CTFd(k7KXHG;wg7YilfE-lwx{v2~m(= zb1US6ccfDG7+xcZQq4F{@Vf>ih2yH*Zeroh>9Wgdh+KhBM0izG4#xH2)HcG{U~q7S zMZ1{ASSMp?K2KC!+f)RO`fX8(gYa19%zolLSWRNV2Ne~mmF{@8onb8fjHqA>OJixw zc0(`#1t;&Fe($J6{Agy(7f99Ow&97|K~-NX4;+jxFyLIjP*{3}2~_~cgVAeBaK4m5 z#X*gC!{rT~-wxit+P@Xr`lAdo{&InmcK~&P;LUPtyAB^B*KwsfTPzv>lBCgEvUmi# zTB=&V{s^s2*PgwC9M73uIf;B3Kj!z^x;`z$lW3%!Ke(V0j#8ZK=G+NKnmaXcKerlp zniDmcL<9%!9;X+5HUIJZf_^ZxXNe9u(JvfG7&-Qkwe6HU6#3ntoOV;TQG_>f%5XRU zmREMDIw8`OH5oWzZ81~>;h+H39h(q*GOLuj8UnxEGU+;#IB2-nScv3RsNBOFqghG< zD?^ElGky!R-no{R{YIwn+eRoH<7_I)L|bu^VJ=5u`7Oc9*JosIj@N5gM8yX@dAo2X ztIG{U;5?YNgbyg|v6%Sy<^E0KU4PA7V!QaW7=SJ9XNu=_k@J6^+!}2_9g~$T}?Xf`tZsYTuVqa+(jV z1+$Sxy&JSayi7#w#T=6#2pk@er-l4=$!n$>tpE?W9|Q=rhZh` z1D+=D=XJl#pXB$7gOAxfnVW*ab(Z^&Pl?)0Vo9ALWtS72iy(UCgmAAoW;BlH2_idP zll=ORk!fZ*Y6B-zW~--2Q{@7&BCjWCO&ugv03EUtaBEmErL>#eDJy9^}E z#J*7`Mw8Y+77f9#9`h_>P3ip(fb^S70UP0`MikJaq?d6BM^L4@buXaB+%+3_t zU5%hQ)Xav}LP`6)^PLZD>+uO0EJ^+NEph*3=+JJiJY*3&ead+f0v4D`r(pXNh3S0*6kZs_tGnRE+XnHCkd}mX}h7aGaD4-<>TT1 zv=Z$jp_RJ-8W`p{WmevT~y!+@D{TYy^*zr)dUSh7XU+79EUN6ZsO^ z7T7rw98Kel=bUA6FxechA3jg&MdKSXJ*+97W9|;eWI4p1bQG%OQ53RcKcUyrBj&Wc z%xU>tI!T9juW1~+&RrR)Ea-3LfJmm0bH8~SJ{Z3uk&WQnoaqEw1=ynq+2*9%u#h`* z+OOTs5Tax%Z(jWS7-kfb)aczgM)GORYI)lQ-|%J==>rvBEz+X{s^ipem9;vL2y2=G zT$#GoU($qR3dqZI3Fk4fvEyx8D|?^IbQojqaahB_k5B9cKDMb`4CxO=Qo>;5?6cfHWkXz%pA zj`d992c!cQKUL*8pw?N^@J0B@iMkx@)ZGV|Mfz(yU(^+tFjvo-u<{kQ%sWF4kJ&TD zul>6ZU%jjvO07RlGCS2p&rp(Dyi`s^fo2a{l zc4sJRsC-6)INZN{)_Z&n6xBw@X9^2UOy78!V!ZtMLv7c-GbRhgNpe>VjX(VEF(`5l zfV@uL-V}6K6Y9oz^uMp$)q_f@=zmuu%9*oN|E^J+Gp8;2o~!T_NfP)T(v9%%2qS-e z|Cvp9jX0h3O33VBX+{epa4ka(y9vRY4=H(fS!&mG0`|Z^JqxXJ$GE2F=WBVIM3dMK z+0N|}$td9w`X%ri7KVW(9#15j6zOh>?;UlXIiKrU&Wv*4sGR#}C=7ZFi|uw^T3qSq z$#>Ka=`h*d4cZ2DGWkw%td=`%;9sRq3c=m3dcK&Be{4ThzMw_ycjeLN`iG1v*DDu) z)&PnkjqMOdfd98g-~MQo2uMSczN5>(xXMyMDoaNI`9$rkVv%D*mzq55g3}_jxZ6&) zm_qWG;N;b8k1Zp<2u)8gTy0>Cc{S)C$1)Q^Iz%y9MAR8im{K~kG~&Sz-O5xNGzOl- z3NTtnw0ciq;ZOsCv|jNG9wo8*^Hx(D6^Pt}3XAn*-6hee?+Cf@Wxdiqj4nX21U}$( zd~HMH`JfcHE38@<(%a>;8hO4Q=K#sGRsd1Y=}YK`luwHZEO{>Q5<%Y=x3XfTr(^r| zSc0VJs|$lHL&DV2Qf*isdVN?b;_i{*@Ax?Eb=80X00@|GTIF`geT#S4 zK~$=Dqs{?@rXtcI}oFu(poF_Xy zQ%<@G9JKbr=>52mEmtkoBp7hoa{>_H?eC>EdkCXN;iQA7Y|4p^n3&#o(THAcNn zwV4hT-_D-{G}C@zUXW}$GVjOMl^1K-Iu)pxbVyh7e|C&L@qdZK7}s{mzaFX${&hN# z%L^h7_*EHXc1TFB*YgqN-3#&XGv}JbiOn#>@J>Yk5h@H8aPWb$%Y6^-aN3U@S3zBC zO}>ubg@kq2Z>sQJMA~k&3-@o${~1SQ&#p>Nwu-z8W@bSG?RFoKjh&2)C45cTf+4Ua z-_W`f8LrlHVbX=FKyQ|Msi2<>GF*&Gmv@$2#)wm#fsfp1g##6mUa^AlSZQfKVkRCR z5Sf`qmBOlSXzIJAsuT?BB$y~L{tIh-ly{y0-<%0t>T1AouPeps zakIIaN;6?I(?}CWw%=eyc*2IR0)IdVr$P$S_?1apm)0bHLgQwB?;Q1)ew$HbpGRNd zbHyureC3ZX5n~mIAaLrrucZM0j1F#9(l}xB0;Iz+Tqm~jsrCwyTfb@N5BEo;2WQ5E zAQ7z-8H9_TLqcy>M05XxbBHe%Aj1?Z+V@rZ(YfoV7QFE2?kn`k4cw1?nY5yUZ!Z6awFPuc<)W(`D!_CrL$GF)Y4CTn(Bz!Ha#aE6GJm+&Ez8OP{E|BxkI{9hK3iPDtiY#Icl_$19(W)p4s!^8m9PhOYOoszt zGlHYZ=vV75P5S0D(pdgS!GWE)c%(!te5jpb0^GXnhF>^J8hc^M1o%*mNVT;J@8Xa|`0fCzWk}P5K90DmMa8zv!yc;5>jv<(^3cEsMSO zYC-EX^22J-Aqd|CqNVoQifdAM0{J2r0+=@AXXW~Lk_t?1+7( zkF0<2mJ}13jSs^fQllSuh?_JKyT)i>8d|Hk|5kduYS|ulNG-nqvAY@VKPr#iY}EX-Fm=dm?=evL?Krpfd?0}z_AZ212bgbv zm-)#dsni${An*tlArf$`5N&})DPUB*`F?Fu!|ecJJl@i;m=M{xdt=jchMr%zSLuY= z%zx1*`k95=l}nSFqqBL8x2IYcAX#-<&(@lvuu}4u)!Py@(YfTF+2nhnSZ<|h>9k&d zy|5dA4m(RmA72GK(a8NPF4{<-RJpT^L4*?qD`~;1u)mFA+yi-CxP))$$rYk*ngAkx(|8~Ka7m7| zl#y25d7p;Rip&U)UiYo8+;Z8?uB#?>T`(|GF{TxlP)kTub%1wg;yRYDQL)|MnAU3h zOnnn@fVT^xAhj0dCTn{!Z+>mlBOeTx>P*)RXmcn8Dh(SuHjWBgDmucDI37EbaUY2~ zUmLf3S=QT5>FK^-I1O&>3}-7Zr%__Ow8b`FFd4?e>0|US2xGzR8)j{O3_(K#3yt>O zS1CR;6Q}QBL#Ybe5gM0{2lFWK`UTTR5)LDM1UrD{uBPq(qaA1-tw?4OPfW=kn4j;5 zdwUfE=+gaR(fjrXdaSY|Wr-XC zP|l$?j#QY#gc}q0QjwV#71VYfpt$8>+;OdAW>ulJd!x|5J)lv>pGIHDv74o6|E`7ST53x2CXzHR?5%Q z*h!{rHE3~sZ^0^rV_=1_{oKsc(& z?VI0g`3-F`{<3>oPod>ifJ9~<0nxa)L1bYhEKDfDD>V{K#%TsWKXV3djH6nVg&12; z(wmGcgoAPPz~9{{3L)z6#HEv;+!u^eN5vH9BUh{~60@7*5^UXc`I<_T)pE6tO7|L$ z_6Srl3k{f+1Y?>7FNx%NZ>7a_D)3~%1muH^ybkCD`K?Y zy{)eo&Xn_rEE@uMHBa`p)3YBso-74pEWiq*W~U?kq=ibxhX{AP-rl-!UWuW&ymXh zs3Y>K8u1gkGAt?TjT#oXw`$6K`z>Ap!yD@{`P$#Fv2~2ns?Ks59P4u3atckz@9cA= z+>);FA2d%Xq*<+WeUmhOa3UXK@`8!9&pGcF@vhC=BTJ zJJ!OgA{D||_a8WqS7wO%gmBZ?25e?MrwzA5YbyQ{|Mxdev^`U000lkXzST4P0ki16 zJwngRUK@&=CVI^!@Xf`#n}37(3UmkS5~~-f7Dz*c?Vuv4y1594djJeE)DA)?w8GnZ zw?17fL?(0+X!u#2AfO#A(%53qX4>OXV1u*I)u-X+2xD15W{lz7xGg;_=rUkQJoss- zXo+Q@Zy|5ZLs%`dCnHD$0jU~r<5Cw;u}Tz;JA9b@xT$^4No^WQW8Ak z2;XDQ1lQ?$Q*}Kmkl}k5OZDypPkzM>&fv0r-Fr|@zybFIo!Obcv~#uo$pfRuw)7)D zuiIjX-hNI6c5QZDJLd&9KNZB_XjmvDJHSEWqG1-29W2(Lgd?i(VAY!U4Pj1TVLd|| zu+H=Gux_pGQ+)rKS;{bmGX|;p<*p4wv~NQj7a@@~{i1<*V?FgCdEfJ?p8ZRKl7cBd z8*^05I<#tmIv-yoG$gB|>GpPy@kHG-&!C^RL1{HSVQMB1I9Lp?yCA~gd}-Y1S*?h0 zpI~!bOPiW)2Cu{`@M=I|Q8XMfBInJ0zucgSGzlH!-&t~YHJu^fLrOPzjV|j7V6+?s z9sk@{#=IwaH>$Mk`v9p7X@tw48vIV?qn}f~T3BaL68RqM3gKo8qP#mv(c7DcJHD7- z|33Ze!oRyCQt8tgY!!?+lVJRKg%9z8P;&mE`I3e<#M6`O*<}x7qJuGMdqJe)hZwfD zH(=QRO)wc7onKPYZy#Uffeijn!BUnKxx7S~c<~3x)qE6zg7WH^`gW#K$Srdj8}GN9hekbzNE3sTI$=v4KG>Tg?R%d$sUj@iz;Z!!NZg^zPGb*cNM?5si6g>1+)*mESWEfAZfp<)x!d( zD9V+Vr8`PiTg^uwT?f~e9X0@kLc}DfuVV-{*8!%o&e!yixxIm!$M8E*4QK7WniL=6 z8zC&DTUdb|H?>{9W*&|BU`aw?A~wVafjta-4y?ICc$A~QnoqNv)h;`;!8W&0-jZSg z@Vb+fl*1K$KEu}E?$LecUVZoCCGzR45E_?k_FNW<0lli{gPaQKs5xexSU7+7`{%BT})7_hDoJd@t4M@V8kxE81mpa_;1tYpaf{0*MAl`Ax z{RlNJRIO4;JD*g91T!yjFahb1e~vPhq_}%%2I7IQjU;ntdG6mL@V9Av!EIbp-y-xS zN~9zIE9!wYdhl;W!)SHNZwKLs!-XYNh7=mx`=3Pp+fZ8NfKGhtC-n!M%(9Ioh+4cd z%O+4$VXy;Ot>dF8ByS6n4_bt0*ci>=3!894aP;Zt44JjE)4&L*xET=Gm^Q$Eq7Y*s zQ_*lE+M0p$)c9FK!v1`#TUFPJC`zhyu{fIqe&RdOfi*SUC zy|XdzIzlG4_DJ*VM^^9ZH2WLXT+X`4nm!b;OGRg; zJsLfe7);1ls5CfAUfT@I-i#He=1Ed*0RBA2%%U19Tp=AJH9V*^f-x2d`$%1jhkuMX zPQFb@anrs*L*`&KXR;wB0R=2{PKpBZ5T^CDFA8Y`y2}gQ`dyHLdJ`qu9*z*f1j zD@MUHA-M9-arbDf$eI^XdO!|&>*`5Ps|kBk->ow=AML$HQWc~h6`s;t5(mL zgSfQ$j{_PX((ph4R4eL+QSLU+VIh<`b*FK!%)c!b41T_MXyK1<5=ZiW#gPV2TZ8x$ zzeQa$=#UF1wy}j{`$V!gqOv7NLqEb{B0YZ#siTwQZx@d&59z4dqZDyIOKGEw1vbIg zxy($mGhI<>OU7 z_)v)mMDvhm|2+7OvK1(;A~yECZ^d^}_U1=hESJK}YyI`}J#$}jIit+ziFP)`2>c;o z_q1__jot;|=c}S+ckIYGBL0gRCr55O-0LWV;Ux%Te-XRl#{(F#lwFqAw~>;9Nu|tT zMwW!>6sEDTpcYL!pT-KN!~C0y;_1%Wxk&utcY8H}PO`}y6^fXeG7>0z3gzpRDANPz zLZX3Y9|}Y!eW$57+tj#5r+UvAP|GSC2z)|nhc(zPYoiFCJL%mFCl-J7p(sqP5|$+I z>D-vME3gYP1Z84Kj~d~aJJYQ^DDn(PAwKF*eVe1hSSq+oK`{s(r7~Hd%r9bZB%P*D z?pP@s{$h-J3gPD?vFEa}{-~Gv-$v>Ve&Xm*?okeS`{HJGZN~zE6}um@84Ygw*902~ z{XU<0LE`Bf!^GF7FM)IkWyi#6J2x0>DJ0$A{b z$0u%Kp=TIAF?89+Ae{)+n)NW1<}pCPCLV{1I*M@|tjz?hoNXw85&qQ_I-a;VgC|1t zEuhwD)eO{on8jn?c{6-m+=YoiB-h6*XE9J&O7m#vfT5INvUqC zWak=552INzJ4g!-MFg`wcc-7J=H4mWq!e7kkMKA^7|&Puq}agj*X&nu^Lqbam|h@B z|65nG0gEQfiD8OPqd-6OE9|{x^_x4nwckV#S}jBnM&(&vQ0TT*U|3Dx$i6n%dnUJQ zk#+$v`DSHf`=O9s-NvQux%?vXY=dAwSW}TfY>FgKeHK|wWZB_S<(_%)xXNO0g!Uqr z^9*4XIkyzbR8IPO)0d4`vW)d-NEv|~o8d91a(RtO;Yz7-+pNKp^B^NqZDl1R=int< zZgK#(U%>ME&tnK4g9$66MJ*?DJeKzuV~ppH-$%`ACctrx49hx`sI_0_6$J`Eg$#Gv zjdI)s8cHGLs#`ifSVFabljOhn93^O%1tJsRz`Mp{WLGvgQ)Egq3HL4Ry06VD?q=>2 zNkL|O3QtD7!gHBHfS;{1;NoK>cBZn&?%WRQn}JoraYi01$$Q3awvazHI(n-iBBQ)8 z92IxL;W(i}B@W-GwP#z$8f66n6KXCGqnI6~#oU(OITaTEiwC9&+pzIgg6N@ZZa7P7 z5R_SpyQl)681?Y-g@Y5R_fUk(+yY6P;fwvt3)y`YL9l7~juR2jFi%mWsM@V}7QRbJ*B&j_yLL{AbJ;?<3*hS+xxw*9?-kFJwq=9(h~i zRyeA6l&gT5l|$OyEuoiLI^k zE0VJ}wWF{C0TJ6Gi2CE!b%M&ribFb@H&2Jch2qMB3he|ClcY9;JA!0z4^X(exSb$z z0tO8Wa$vHoqs(g`ChOS^yS>9Ci!oJcG1R zd`;$l1$}Y_UY@jtjil6;&q|oT8K>)Ibe4-EJw>*{8E>e0rQjtAhtLgCBO`iDhy0FK zFb1!vwTAzueP1awg^{^&Mvx0vzLnoGRA3C$fjr(_yCU7=q!wi_3~oFOj{;{XOD|4w z2Gav@p!=fnpFV6*wvmKeP|G_mnnpSWFALF{akA4c;>lXz)Qf|lB=^eM+%61?-I;8E zqw@h%4#;uD&=)Qw!h0DpW{gv|?pn(cN-53MxRcl@G09V8l*s^AIWb+`MFmXYv^(`H zKk1i!NH*a*`2=i|&0Qj%1x~Q+OJ?J=`cU_5s$iZ>$i z1MLWP9(|8C7c|}*21-fq$nu=^qW&yisPt=M8!@wsy<6qGL$yg&=oj$C3TWmK5qn0o zH1_Dgfx<_PKJF7wa*Dqe|DFuv)jLa2%8GUxu$uT9s?dn8o#@KA=5fV6LLMRXvHE2y zkb3IX{k-U~7>;G4@dFQSl0f{iIZzSf&glD>^uv|&e8ug)r#w+NfuAQ0%a-vDKAVB- zhZ&>Dqr!>tH=JJ-dA=}pV-BK)Y}u+0A47!wvv$iMKsa-|^aX@K>=5I!CvZRp&cX43 zASI_wd;cCdvZH+zqT&O6?#^5Ow(SY-6Yl8v_Cm}=Q)*p2yw&iX2PL-{)`8XLhCnz| zuU{RMt81woLV#)$&JPX3RnZR}NT?8lry5j^#<|Ovr?HeEe3eQg@>hf$h>P38`8^PQ zyEZG-3F;MUkA~K3B?Mg#7Ca0E20Jj>4+%wP%OoU0>6LgZD72ax0ADYi#6k}S$qif% zJ!qtUf83)02Mo2ZmX!lW5(r~Jpq5zxf!AwhH-f~@&<`BrehNUgB}Kr76A+SBKD1c>u#srxK-l!dXLHGx@vyea!? z>H~`CyXV5qI^oaUZYP461_uuFO9CO-YolFaLH!w%Xp`8FOzEOnkBn-AnU74NKr`DD_xP~UbsMUL3Z7FCwx7* z<96&Y`&ahlJG}D{PxGb2Hd1h3bM5Mu0Tdo#;?dT#!2SwSsXUaoQ#*y0anV3A(|e?Q zNqz^%(Cz50H$8$@lG0F-|05b=FAFPHFnwuxXv`Gk#LY&;J>EAx9UOC4xgvE1o>pdhX?sL$ztjOW&L;6Y6k98V z92w5~R2Y?I+WeSvVQbnKyisZ7E(eI)EW(K4n;3b89dhqlxxBHDo)k5KCfwnBZFz+K z#il}c@ETvW*sVntU5cv;`^e)H`w$E#pn!%TMov%tg8i_WI()MW!+9Rrl~)m!tQr`P zU#=vV=J5=ku5NujwJsd2(EBGofmuO;AP(@ub#1wmc)@BSVqwq$dBWVagJCv<;M@(d zdG_6G=-0>1F>Cl$A`H>RJqeb@-qZJor*T<(X8pN)C}F*fUg;u3Z6SiX$8W9$8FG)= zc;)$1{{A{$zc$~!LfbvL-rwCG1HtYnI-$7vp94JOwQ6+#O zaSyq`)Z=LlFiCKs6Iw>Ai?{B ztt;!=isC9_@s?jS-L)izlh03ZKDWgsJ`+s_zo%DiYdsRe=5Pv3&F5VQZv%RA!@`Nf zvFz(yZ$4;_v9^&J+9%;OVp-hEs2uCC33K#AsnOOzoKfal3c()L`$*(E$swzo)Ce2M z@h-ox)fhYpQG@N$(x(m?S9xi-Cwuso|&sph)~z7>%LXuox@J7FVcO0jJr0AOUa%MH!x#C=Rnm^E21z`@g{V=SMi$$BUnz82#y)o}9prx95xNRj3m6icJ&Y=&OzF zFeJLd3tAtEMzT<#`o4|D^>KN{JVeuqj17fSlC`-yXPT>s<|XxtXeT6-d-Hj1kyWsz z9b&0+8utdBHVfU>f1b9z?VtPO$uJ89muq*Yih3OP$CJVCGnL93qQdj17RHmE77#2C zPdgvaf8C$5HyX0>#*}#+9Y0n|FD+6WyseS`5ujK4B=B z>>~|rlED-^`bdAA-%d~cr``2d?DF`}WmmW}Mts&yKiD#g4UaHQG1Ln#O;7vvC4Zx` ziZ;XS%7FT)g+Lh1K9=}?9%kWA>}YK7ZFPL9DR@!Z44{Ru{I`|r4z}}ttIH@yZ(x!k z^%yYNaJ1@io?5)9BsalLFkV^Obj+`bac#IoT*Ibue6(aYz3Z74E0;O0#t3&l$FGLb zn|S-S(a^NbhZG5c(G+{F5lM4S@MZY*)J35|=>D4IV&id{i2#5o?a61nnldmDLdZel#Zomcs)`ccMm*KPx z0JO^7&mU;4ocrl`pf@{k&=^K|JSPBk);?N6w$Nb*1*hp3ZZrnn*58E=C$b;iim=2~ zT2EMsb4w%xX1BqiK7JZ~!K}{!m#>Vrk9Z#+b*X2&!v_$YkEd@6m&0*BG57N8_dvs+PprmaCjA1D#h#7fr zn%}5}`}!fyyPj2oC$~TKdA~(JK6_Q_(d}B!H6xg84ho#J6KF_xcWS>1 zQ4&pXhP{%*>V6ZOUr=---*ls8yx@V(`?l5lb&P1f6y05jpAQv0 z-m?)tlz&bRwCrV^|U=s z*QCM3l1`~e{3&k2$W>NVg>)$Yo6aQ%TOd~PPa2$!er&LCW$BJmO}50)T0b`<99p1= zR2`KdSpz@%(QwHVU6*ZF0nODzemU!p*fsU!p|0PKQMR#eC-{4M51SzM($#kV?`-`$ zU9g}N{N!|J`C0h;D=+tf+p_%q_~mI}v%zrLzrzYJ5NS?)$Lr#}JY!2r`f9g}UoF z%+oA#ciF=zOQ@3?j?W~qvH=dx=ARd$8Gq1Th zV6%aQ9POO2 zw4|>QYBVSraq!F=TnzKq4LrK_w0MEo8j8-04C$TYK|tiYDWD+0LWKST6wyR@8JG+= zdL=%_?@C{NS{=5fw#O$DDWk=+Qc=#bi`DDZIvr1FMS9W96q5=Gb~Fv8Dd>FGw6yYv z7!&a&&Kj2Y4Ufz#E~$yb2~- z2_W2E$g7@%zR#Xc%!>eeI$d4PTfqA+-t+AJ5QUq+69;F1M@OKf&PA)9s6pv(1BV?1j2iBv3cU9|{^IxrVvXid(pAM{A`5K?*nH%0Is zUn!R9H6xcG{Jc?%ePHzbeIts%g+y)b0HSWt^bF}c@PW-PY_?v-=m3uK;lMAv>@ZE{ z;SEhR1ym1n!(8oOH}DORNOF#&nz2%IW?LjyJ@ioqk+F;|kd2lVuxOhoW}EK>`wlz6 zbl;a)JEwc^SA8JXh2jV}a<)}GAA*nCqF5F!$(85Ce8vwFIBG&JiQV4bKtD4Ka*3 z9a-#`;T_qw+IH!4Y)Xy`%$GRslD+O(ch*n{p;zH}1bn%{$|?N7Jalv_XSoXF@u z*JH3#zif2krN5g2V~2ma^Cp^Nd~{f=QaHVcxKz)xx+UmnzHy|c$4tw?WN1EdG{s7@ zxF?$cXGo|;cC5wtzg71wo4*5Gs#D-uX4}_5!+&N3| zp;e){TAHh@67~I~1H$9=zpijUuJxUYjY0gEKln?|MOAEgd&($7cF6?MFsXAE;R(spa96_E6uKBvJgLcMn|CgQIW44q89 zOt3ZDc@T}d97R1sR~D7TFg3QBLNKR021iHyIv8ZHD z;R-d9EwYN+hP{wx3u^UxU?_B#76*6qP!D5w*J&E)FU3w<~L z*7E14dxc1NN~8Jd*#33+=U>zo;{pBz6&IpM=<(Lz0rtdrpqVH~2z=+{0h%zhJWDVW z;VNRD55uJE3*Nxlc*WsUg6t#pl=7&bt zTcNYX*c_O!2)DT}W2wxmHgV>?F83qL(baQBpptfQf_EBpSCK(T_$!Ry_mwM_Ag>iG znieXA%U7){uZnzt5!al_Rk)kb)fCYbox~NJyen;sB+_G>IO3rRZ|cS_l8{lSmq-1T zEy~4lSI(3Hf%j!=J?-r=6*7Rz{U6p+;lJK5k;cCa{+^+sp@k%$zf3735VHupHD~m~ zxadylm7{bsw}e1Y*1%*c<^z6dhJ9K3pfubIJy1P8YU$}jY`@jC;1|nj4Uu!9>dI@0 z>%zR)5Rt2vGDY=?CfU+Usg~*177P*&?n1J8>&68Pu02I6RV~H(`MHE zY};6hiV1Qsubu0iYO8wnyH4uQK=Y<_V=XMPf4n=r!`4~Mcy@+UKi++G$h z=?MgHTbGo5sh7Fnp5AUUykt9Ds(UBc(JgVT5M%G6gspX|+I9^0D@ubF9E8?a1R}7? z$9Utv0I)z$zindaW9;G3o_IQ~>7XB;2D&p|c*+*(gguki%XU5RlzxRaiP8_zPI}*Y zqVN#bOS0U2jh=TpMs>jNP664L*PRaw5Yp#Psoj>xold9Qunbd)Xgl6^I)sk++IcYX z+4GOtb97YynEg0tSV9Y|>3@}-)SmU5vJrhO-;^>It2gDnkHwpU6Ycoyg{g0JrkSOj zhUej}z_?3NJtTv~lgK**nul&%-}I`#kkF@iW=x zHSfcOy%Q6>7tvKM%7@hYhjp;j3KLrHtNH)goGX1= zf1%p{791-r<#hzrmc9ep5Tq0OHR?fpXvVr}Ld-p6*M+QQFLYiXvPMTr3#P15C!#q~ znzI_@LMast5~-?75JsrlS?5U&WN0~|IT>qqoK&E7%xzMh;4orGqbKbO#@6C6sjZ5S z(8%01y4Xl75hOSY=P0Qa4JG_(%SlqJ*{W1VKe&VBGC}TS)9NU4Mmd1xvgYR4h+kAK>_Gr42eR{y-NG-oT%>uUHW1?F|NivwsCFR!aI%3u-I zIy+N{6zGVlo>Kv0o}b?dQ%|e7hwN(n<@beB_JB`XMPDzeHj~f+*@WcKm73-< z!yek|)Jo;_0c+_r{tG5Z)|N_@Y5cByfagf()s^R@rdGWh?IS)=HmLZGqqsbnZc0&0 zN6n;F8eDHq#5zDWSzD09F}{F1w(pMcl*0L9&8bW~xjgP4=|zfr{gj-v@secYMrlwO zR^Sd5(buL&k7DPaX$x<{ij!j%k(X>MRPU6^mVP#EFdb?^5`js>x?j&7{ImQUfo}T< zCd$Vf&u1GCo`9?3!l&*$SOl)jo?89lntKckWavX}w{RIF-7HPSh(y|@KeWZfn*=SN z@TtKp4<;Jy)+QRE4sCvs!&G=msWc;{IioreWb`HECTwvm0ds_DsU5Vi;;|tTd@7i_ z;oV#yd4@T0hSiPO0owWjFo2|bboxzkqAT#D2#+L%$HmThRTYen`{uyOVMCJKO9Tzr zSD-BsTVY1YXxn`W>z0BK7pJI{$~QHTHs+1A7uEW-*$LWklrWs|-Ik}%K9AZH;5t-H zk+=^Tk_GrwzL+VbmD+6P{@bbG>wO`App`Yp{0Z2i8cU%+f@=kwiIGs9tQV_zlR5v5 z>l4Jrv?+%#dc3C@N;W;bH!%qlGmJ+u0VraP06SSJWpY`3dABs=*N2#>gr7&q_Vlo; z$JUAt*|*i!CW$2Vb~N;_7xTE+5$m#q!w-Bk@dF`uC|x5o5soUgIlukjRmB}L;KcGZ zXWC#*w7OK&_wrGj}+Z*SW-^O$`=Ej7g^PjK||0!Gw+^iY>m>-f!G8nS&mEwen05!r1zlQR-` z${(P~O8g=HL!hMW6BkskTB=vFPD+`sKpJj>3AJ}s-b}q{Deqb!@IEl{$)4XgUCL8V z%fc1P;1QQ#L`?;W#T;krv3Y;nnRsh zIX8a%(LFaunQyQN`r;yr>?mzTlK`iO+C399jSukKr60F(o>+V3tk zF3eHWC-?r5m9LMfv98?tqs%Itw77cXq5J*&XoB(f69_!Rez*V@vDY6pPR~<2Q1|g! zckZI+HTe1Cm7`DHtM}cxoTVxQJHiFxQZ``RT{YPp-Zo zK>#=(0#4)ATDfuO*DEEEN8}#+e(CXd#zk3-=+`ejc)eIk@2g~T3bK?{f!~-SRnFNdZE}r{%^~70Rpz-PHrKeXZIb-L+$w!U9-azz_>mfcqf@D(4 zYyvgAEFY1f1S-6u21hx(;Wk?ZydJY`QMfH-Gqs5AoxoS~`oXR+<0^kMmLJXcCdPZT z>l18V9Yx0Sw)??26%WLF{O)9Vj}UlsrSGxvpsDEw#SdCN|M|+jdrQyf-J7R>TD%P5 z%QA=Z)l{76kHCe3^hArHNgKakTOM`|c;FG&n#j1ZxYu!j*OT zjdDwg`V1{t`phF(L2y7|$y#;KF+yvKF- zQffuinV4T(e)0izv!PEgc2R;ChL9Pkno(PEt*USB%0q;6cs3j zYDWaMM)8yW%s9sMvul0ya3-l`ask$qCl$a9Dunu_0swmZ|NKs?OlW&*J|BGOoW5_Z zdb)Mf(XESq?xhFGOAnIu@gVsXd4{}n=y(AR9lruEjV?LVGmny6Ej!u#R30BDz~VAT zh_3Z_M$qT)*Y1q)($k=+r$MiY%*}0NI%6Nhp*3Y@+A)*~y1-lVsFK)^!fjtDp{SmR z>Yi3-Pc}aKOXKS7(&8m5rZDFZa7zgbV4FQM&#G8ML}9X7JIGz94p)X0+|DMN>{?Wx zOY0q%2bycOGhB4lvEKOg1gr_~fh#3mv%5EG9*~v!yUS0G(i9Xle?;Tj)yC!b@p3LL zoTg!QL>iLDhqqSGe@^0oopg^uQV$!J6I#8qRNFRaAymYZFL8%gW>2~I7FSL@ zz|B(c5*i5te7J~{^y$i9@Oz9?yG=jEIvkeUbxO9>}X zO(ONdq;ci>=c^}9QJIiTL_t1FC@UpJOCp@N0vjRartW@n%AG$7k}F9H0!g`!KE=eH zx@;geZIB*<1O|eP$r@mwM7i{uMYQ;2|1Y2Awtm|8xeDHq5P|>Pocrh@-GVem2%aiw z=IBG){4OorgwzLQn!uHYP9C07NOUnKbJ-rb9~aW~5wQPK1dn+z zro?`sAyJ6MCxn@8S3d0d+y9(`VlO$-6gU}Ixf9#3M{*4D(XY-fKbey`sh<`vQ?dX6 zS+Ws`I>ZM{&kk^CAaSu+!m;DW@J|#GEH6)+&WH5m7UR%38zvHH%W`Ivh{yj0QBgCD z@Z2d-7LjORxHq(AcAhPegmt(Vv$ydf2eR%DpTCq#zm!YAluN%jxpcl-2GaYRmBCp( zBbcU4B72l*HZERYnSTt&F(aBQ&o99*MmIVCLvtUVVJf;O;6CQgV_JTP@>ax6R#4Fi zMENfBdJ56qo`11&?+B)gG&vC@A9{WcNGT@#Hy(qb#3!flt=e5YOB(;463W0t}JFvvr%Dr1c8$+?&Qny_VX45XQILZY6>g-AP>}NEe zoVObV|7fYt6@=|+M0stK9kg+YO~L3=F;yjM?P|z%!&3s6iN{xXFbUNe+aGO$)vD?> z0KMurAuT(6ie(kIB)Al?d(7XSVPXr>h(5fIX)ne{y5Lr#OHi*Ps$1*yztr6R{0^ap zW;IAC@luugQk81OA(}A?ZmLA|1E&Yk;7B4Mo63e}5C+rf>|~HQ#!#VhDc#!CmcNit@nO^I>AFv*=`<9a zTG`*K%F(J=e5n9jn*xwQ_!a8=pdi(oEu8UWbzJ=?)xgK69O8^|UX#<)Jjv*jTPx3= zQR5Q#=55+EO&Y5cXPRXE>^b-J2@FoleWK-MrdfX)59d}t zzo{-G2nR88-8jWezo;Di$9YKV%^h1jarwP__?0km;I1xbmKOd7ZAumo>R9??T+~SX z-0(us5rZ=S#BmG^&n>xP-gJ-MfmM5U=A~vbSTng^%Ew=&e(|44h4|ZcZ3z=IQjQCW zNie3rKqPySM)Shlqe_+hK5~9>`MHJH4Y+!>p^$s&i4!mz?Rest;^wPfBCw@}j~XX$ zlk>EXlJ$F&@6u-i?JsE^1H-ny_S51;0mYW zd27KS+o4jwbzh^95g>)x0^@rKYXt0*BzyZsSPw`yikGfU+JyYFwG5q645(0Ex;DLZ zZCXzO`O9#IzcA@^hx42KLWJAhsPV&{#s_Ed1%Y~NF!?gEjy9SG8w@t*Lbwe1J}@`Q z%@esF9F8D`U=p|qV4~{2y-Q%YzuNO3iPPrHntDM!@ zPnMs2&W8h1d3_Q{Kg(nWR%TDB+Xh}zZpO-u(pHR{Dq;zG7cVnW@~x+GHfswId@7bh z93A1Q-p}e@WgE@U>|+(COE1pR>Nll8=%9k&*C^!Y0O)zrJl!-oa+;)@re#`6H{JJM zrxXn!e7bhpJvT>1{eF4}py|ht5ktP4%ePNcq5ksS&z3LTQoVYX&wtPC`HtOjzn^u_ zURphI7ENp~;hng4 za%swDm{*<-uo=wwGo8ueDFr#eOG4-YXwg;oB^_0_?yruH{x&>%+90FtJL9a0n0CM~ z<`<&l>qhR^_ZyeaP@N0dtA2BY$8rE68ichWvNP`AdU@&5Nr-DD3C0YNU_^T99tE&G zKkJ^p05ynh1&!MP4)IqWxHnHXzIY#A{djBn&PNc+{p>vw^X4+`0=U{eod-Xz?{84O zh%S^alqNF;Sr}y#U|7NsvQRzbFrrZt^z~S~^7B`q@8ox$Ux3n+bj9B<=NG57?3~7r z=6_Kns*UR>+~*fr8W@74c72&t$4@Ec$AVeXGy5S4oZ$pr+t=+GGv?>a@Vos)8h8u^uizzOCvriC)qx+pboUPGhHalGH#&jRD45;@U}Sy|c7% zt8sl1^M#mM1on|QAPJQvZp!`e1_MB~*{!S`(OLVd_@vcsSLG>B-qOP5lm;+N(6(y+88S%s*y8GqXKcI+XM|Ij3|`&|Eviq& zuC*&qz1Lw?r)rIGic_tX&}vi9I1fJEKpyd7VnXbcCRK$Rq(?>a=f!DJMU!4UEVa+; zu0{=Lu3aUnWP- z^YvAMw#Eyr0?R**i+B9X`CR%82yI=04G5hx`war!8N*+54gzk$N;%tT+qwrgnZFip zJPIV^ft^_l61oShb9>VC;0_8&qN%lNUs!JLUoJeiG~?HUCK7j}4IQp;o2^Vwd*b^h zIMISc-@yLRGJU_AVWj#-UF*X(h5ZJcR@J8d0NPTVh8t|KK2>bmAjN99hSp2p8m{&5 zxnEV+A_EY7sFK0=b)Y8Kw^pmpNCuD)JG3L|hQmN?U+iHd_SK@i_{#qxdvy52F$WL<(5H!6tSx}pCTpH3UbY5T0XD+mT9Iw!-t9+1coVueV%w>7Sq zqRDTXn8XPHK@-r9IUEPowlSQTo$c-=7#fG@`PfdgZivpFaYA4n++!Jso6LgSE-0GwCT!n zfpgT3jf;ZJRy0jD!_xrc7CbXdH50*rdiusdmt$eRcEuq2LRdoj|KVqFZ)aTCn>7)%H|FI6S(_g&(AOF!3GeGe|wI8E)QkCt5In=DjNDHAYoT7+u zn80X{f>lc!7Eh9Cif4!;KvshynB(}yg&AXQlags+BAbE8F5|*< zM~t^eYt5t(t@pr)G~?GUGUsvq2C3VCt;>j1ODepu5%HdyEEhR$Oc%=OQn7HDUT*L` zCR5BI(|`0kBEt@y%?0Lm&pkP>Gldz<~T14f~Zo`6Mg{qXgsrJ29c9g64a;7vw1X% z+gjC`Y;y^%1-;&A&qH*`ppIk}`v(W zL0ZkTU#m8^G_&!-w-i+4(ml#iET8|fF?WvIF*CE?lc({e@PB8&5H{5!VFP&IEnbFf ze-nZV~y63H3+uty6r!*^KJ_q$pXtx-@K@T@G0df7f)6}9)%%tz>3w25D@;*fNtJ++9qQ4T z;W(i&iDWXmJ|w0aZJR;|9kf3)rqil>=MueDaU9?CjplF}(LiVp5~iV&W)3LZza zL1%i>$yK8g&QE6!qe%x%=V>u~L{##QKiK3bs+Oi5#ny*QB=!;x_Y6|uB;i*8GjYJ6KR5!-s@kC!To*Yah;se>z^mM)!Bk#~;^D+Rn z=p4is!-7MW2z--<1_t0ck;S7p`@~{Hxv9y)$xZ!)_3rCK8*|Rg#$vrtAmhZ{4%r?o zRy-Ovzf1JTlgUj3{qgwl>xrR$ z%;!zNMvA|IkB(8aw~lYu19`m zV~IZz0R9Ih1~#EyOXYU}k$}DTYKup#`v0Jj5xjM@4xh{fW`sgc~kq>RG^ zQA4%Mkhr&n_5qPc63J*{0QJJZ$sv#G?1B4nb_-OBkiQm-8cfBy(O!IDmdZe1OBJ*M zV`6c1l&ZzDrSf68g>2v_15=CPQ&w`uCHhx5b$9IEgO^|ipW888O9wz8H8vVW_&@as z9|y)(WwD~RFFh0oaZHkcOD4*4zpjGLwROvq2H;s!4^S3{TkO>8SK3_xRhR9 zwdJu+5F0!m0I?N9jk(X@+t9#Ii*q~h81@jfTKVQkUFIrywY~|j^OADtP=kGO69l^Ky!+j39JVENJD-RlHZc)E( z>KeZC_mAA!MLY+MD-UQ;CcLKZu|;?O5rDJKL^qP_Lf+=bXXG=UB zj>CeXkaKJ9IK*?h{K4{ruiV88W@9ZK)k@7459$Ch5WBT88bILE0$wpHTm)mPcbK4v zi5MmQ&MtC+%@oG)l0p~a#AF2j7!AoBhGKwyShg^AWkzyn6TaRIU`8@G(jV`41_nbi zlA+rPU@t?%0H!mUh(|}@QXh{d;vUb*a|OR~xR{yFXLs)X_dVGR0C~Gq%Tx|;1u;Vo z|BUd0+H=mLwkad4lmN5R`!j_p3I`BGdKft*^T1tQU2m7D3KU)v7M7~;!1ALC2Nygd z0)rSA=n|gfPy!>;p)q{J!FY}iF}mBA7zb1Oj>9W97<4)l9%5&NsSw7|aa~yM}5cp9A4j?XcD$TTd&u$E9t?Iq!9UEHN(G(->#> z;gmXS)}UHZb?*G8&aZ{TK67dmZFz33<|N_g*r@J5*FKCs9iWDJ_TXP3C-FBhSKaHr z8&w04Hs_(c-5Kci8N2sO^SKbpTc30ihuD0mzC!9l?qk6Qp|;G*5FXKVtT?3gzlpOL z4;o}+_k~qGH+0wBoATqbp}$g<4nV8N?c)RL3C}*eZ-C`;=2>hBGk9YTj9KfIeN54- zth;I@w(N2}or~61Uc|!jHbT2gE}7qG%Uvr~vimIyF4e%BtZH8}&9zdhCH;{q6YPx=FG1!h?ciEeVQh5SoIChnA|yOj z;zV;tY<#HEh{XpI^`XSL4G<6?4_1nI%GxM=6rV@!y&cc<)^_pRw3tNCV+GITnCQx>@&;g?n=EU5xAsE)F&6gb8@tF>a&^+Gyg-Tn5Cj|TxH$RDd_L2ObX=$nIHaE6B`T!y>iJY3fyY7e8Gw6J5AYYJvIyo++grgF8LNFtGh#oZLddfLLawE zozNRmsny{l*M!~L_LQr)vpE$2?QBr*;59%6Zd0Q)e3Zl2C=q!4+Df;00Q)r9D?K3xME8DZ# z=4@qau6bj&vn|`(ls>iAaHiXV|D30~)T1W5^#z^e;Ru3BYiJ~DX0v|dF??T`T{-dK zH%LJ6U{5}CPdsj1JxReK5${(oYQ&i^DGYkeQ$vW^jtEC*qdapFjaul;C_y#M=#+{8 z1y2KyPrx-Ac4BBou87r25Vb6_uH=tXV0%cVRtP3plfT^%3lffi{xV6UPc)#`X(W2p zE=BaAkb!d&l1%!|yFxzw3dMcjv`lQ9^n)w*wGyq;W%+VO|MG11n{&C%^+WkWK2tfo zy;QA6nEpGKg=x#wYbBuMu{Z0*EZ7QoN@)0S#V;}NPQ?4cz&nu`8BlC3`9)ebX6MhW zX5E_8Gx$Li^R;}Y01l-*HUV+jDVGa}sn;xGJ`F6Cy;qswXt`9a5rAT2!Xs285+ z&mv7NoLjmR9?|=gGfbDNmn&c%(>3-dQe#{sL+AtlEAY#{_R;EBA1oie>n_f^w?B%- zxJFb!s5uZI$T>ua@R%v)?xLMR)&2Mc>WW$l?A@OyijoY1Q1kU^#2kfqMbOrQu^LcR z=oY7>c)#RwU}-M@)1-w$skqMooO7!AeMRb=QLE(l?Q<$918yS4!@M#0jhCgGpDa-S zD|{dp>N&@=VA;|P1_gF04%pgCg&`RZ9sz!RnEru|MVtKYQirDL z9>_k%I0u1@_Bl1&7H3V>2bc@uMMj(P zHr&ernf2w4Dn(-e8Dy6=y2&DnYArMEq)U}F%$!A=;`yk`udZ6AsZs?lec$teD2cNQJcLqQ%JU5OEnw zB!`k#^P(2WEdRXr+H2l2#ArAJrqJZRd(8RjK)#%2{#mSJB=AcV_rb{&(;lXfEel@P zm?*RP<*t(+6?w}!97UOmgASMK!0=SF`BJ_5uV_oANFLQPg#)nbr=5TOui~~1)6~qK zmy4oF%sFgBSzMZz#I<6ob19mbnBd2hye~!4J4XF2c%vN-!c)>-$UGjbB2k-zcEpt8 zTK%mJ!FHe%H>nZYgs+(!O0Z#EliH*!)Qqn`J2^Fyj61PdZXi1}IW(AP&5SQhOE6Pw z?=RVHBcK{cKuiGU2Zgy2Oa_SY+je}DD^?xh;z2gXR6dIenc}{BW}k!h!Er@%wK)~y z9SLiMPdg5(ke|-i*zs1!^HY%lGe;4Oy`DKRb35Bg9rirSJ6e<-7t@%c(+AW<@Nk)! zW6_t}Nr-1|`58rf>*WI8!2eNXp0TnK*$_~g{z2-mM_^%oMa&p;v*UmzELJ@EoXRdz zkn$1Sw9SA0RN2S z3}bqqqCQO;Nyjwkm?K&@*tycdq7QTnEVI>M=$ULSJq2WV^On8IH+PPyOM>5yHz&@A zbXz5v!57_+NbN<`HNktH<$)1X36JFSUR~f$HlUGrq(3A;a3wW%9Dm9{>6NMn^ZN>Q zDb%bWtYkNnw-FVxpQ)xkNExz4IOe}RlJrbWj4a-iXG^cZMk*J#4PJ=iKU!>@A?Ycf zezpAJL-*_@L}SO1uyNp|W62|k7bt0gCC!pCktCOiGMHGV3g=$rkdVkrUDscZM}-L9 zYtfrvbKUZM)i^}JDyabx}fw7ri*8GnRA(C>;#|B)zj&9v>gK8Z_H(!jskijv|KkfkEKaYVXvY%3+UNBUtnfMIlSXd3uc* zhNHW;f0O+P53wThrwi_TAQ-VS|<%;s23k~9@Zru{9|zxKQP1n0l`384TuAAv(oiy zX0pJeS4OnUjD9d+{j2onM^jP$-rsb1C%RkextfLSKkA6%d1BoIxIn5AB>1#*4P)WBj! zMRpPZ9nO4=PU>RgJ6u0zV>c3^SFHL@rN-_Ucxt4myDBNfGby)BR|5T?6z0d^P!b~| z^D7Q1X(L-)0jWpif5P~wBcxdbK2u%dxDmR{>i$x_kb?+n_~uTSzIVfR{iY5$g$8P-r(`PnPCW#8rXGKBt?H~>rt{#PB0= zx_P=ZD4mQCjM%u&?v_x4U#-*%qiF2+B*FTgckurQ#%*<=TsoOEktrTU@o^}Hx$Q9q z0Fj6m-!`uQ*tq&vu*Q0PYWc|nfgr{Z=-+ZChupdQXiS=!s60*9j^SE4C&w`zh$tR* zH4^ay(j#(K&=Ij4AyCG66Zm7Jybce$g137Y+PLs$YMp;=XztCeg8__Nrh<6R~xToNy`SoU;|fWGE#ac(x+s z&`l|)vR%@SXQH}4KUGV4*FnpQ!tEKRNiJn8OO9B4I;FH@%2t+483Aft5ucGeM2Ik% zO2p!*kBR0)y@{DHMTRUu%m5jT+P`u)y5Z|G()Y`Q8u z24*Um7#JQH=^q*xww5e9hiV~<=8Hw@ahAm#MJjGB1A{jf$>`fu{*eFsOtw}Bguq+> z9k{22DR^Keu%x_er0+&qU139&MTR!pavjl5TQ8H0T9k>Hq>UO?sz$YIgbagrR*Oyw zyW0PtB@HW92FBqGlz#D&d+M|0lXIBxi_k|a_m4v8qsJHA@88E|!cY^mJb%%>dya?I zChgIn-R}Igrc|=6$rehGC)2$)6Tat~|f8a^uv} zuJ4aP6p@&_9z0tL3bCkBi`OB7 z`I#@>+aEy>@M;o861~>Ci`Q3AoNZi~TRwB5apl44cbAk2OU97<)iw9S>v$0E{o{@6 zC)^Xq8-Jf)e*cks?IWFQSb6%_#_6vhW|ZHE8YgF0ug*ef^@XGE{qqpn>&#!4uOGor zXk2uq_5K)ITDaA?zPNhi@9wj2eW3!pv+p-9+~;AyPQg-6Sjf|a0$;(!_Huvdv7v~0 zn|8+9vw=wO{q$dI2`_m=%IkrpTqr%6T1A{nUIlpMO|*A~*4kp{17|baJ9}pyp1LYEyGZ>c42sU05xhaYFNDU=80;GkXxc-ft56IwnUPw)%5XTh7q&1jSCN`_I~NfE%(%;)nlg- zxf^)QD)4LwSTY2h;P!w}}+0dK_p9~G67 za;>DbmNvKYPnNe%m-$JOX7!FN$=>5c2w{H#!35oo4_X|0|cL`E{K6oJ?g5p zN^wbjp$xo5v+wwZe5<5MTx`O9!s#h3EA z<5v6%sRJdxTpIZ1u?xGAy<^a&3(wuh-!#6swR~e9can~TOhCY_v6;Ji2eH6SX!X*= z#{A!ZTD)vTw+1PFg6wl*e_HHIjN?gDHf&%E)Key?0sP$niJw8L!k-rB@bWa~K3;nA z*T%)`jUO%mNC=y~v~YLj%#ZHb&qz@c65A5_noZuqmS13Y1-rH3vvw(#1IXQnYx{AN?)UFQaP4DrG-WNse_y=5I{S(LkRCKHK3uu| zwR`*ri2i7_vv@nLJpXuUago5**4B$K-%H7L)xlhZ|_Hx#!MUS!m?m|DGgz01m-#V|62$9?|ubb^x3ol;5)PS41pB$4tsvSS!jw z`Q__aS0<{HB%M)NX%avcL57M5RG6qA|K^@L=AQd*>Cs2-Nl2N&IL}@@W1r$D)TuZ; zG4{U~>3ec8hQ?M;oW?7NSN+6s_vQ)d0=~myAb>x1T^hv$deT$X4b|I#s^!>2%!CoY zU$k=L&hpL2>=66vyQPJ%+{L4K4nU68ICIh%BWZDE_SDkChmC~=s;Pnr()3EevlE>Y zO$LJ-jkBg8LCl5Ep%*uAuRQn?R6z7OX<2P-DidVjAfZP~W>TJVPtbKE=p~+iRSjr+onOMqC*aLnQ0aB>O?~C!L{_ zOBIfKU@U@1ihHFZR`R?nWvr7cn7N5q+j4dy-XEJmEk>=y$Cw`k>NHh`dhUs{?){JP zAn<6G|Mm>p)dFl zCX0>h3-0ry<4gzdL-puV11v-yp(DzeHR{o#+BtsoTj^ahZHZh%AH zeS98`{hm5J?es*AdpI-s78Af}n`y1qR$_&NY^~TD`deCwRSL4Rf{&k76x?gbG4Mw? zNpC*D8#1ypVZjB@RLa=;)=sx!Bn6-LO{Cz{+(3%7A=v}PJt0rdnab1)wH@RHymP}{ zJljMiMRBWf3O>eODw7Do8k`D*S1FQ!6Tmlj*=@3bnW;lcb1;XWFrX7f@k9*s z4k%Z0DBt55iLpW=6qnkSwv5Sq-B0D`WRIvyRc2ZlTCFO~E8AAAGHb!MRjb@^v_<7A ztrlLr%3N-nDp+Z`wWwITU#OCm_O}f+D>n^qs%VwWjayT-3IpWUl&z9cay#l)KCZPZ zT&)9NPd%t*!d`?X)H>lWQWwgXEQ~gk%~5!LXg5DHLP^=w2&~O?qH4^|0IjH65I!$d zFKRabVl|_d_Jeh!Y6WX7*!2@Tvm>A*f@hP4yb#ed-%&xr=XN*7ItZYpRc&7ct>VmD zh0sld^z0+BWtU}7Pj0OqdF~!NO{F-E z(_gwbzgNuKW#^n`0w}XHZA(Wa7p6Al5Z>wpN+OhPWS2#FMFOP}Ua{t0T3*?3q6fK` zEeao`!0c^_AEeVgPyivdBa0xcB@CobVx&KJ5A?IdOJ()DOB5lg<$?{-={MDHD}g{h zJ^vRbHzS~!{DXy&L(~sv`N2K+!|ULkv5nEIfAaJz9Nw&a$WIlq&t#i}>eLXeyVnZJ-t;GrXZDvYDo> zbV9^pYb30YMXKv39T5jfkbFcsD1Uwli4xQ4rHtgIjO3RmBl(qyK3)neUXZ}z*CB-H znh=#IgufY!H4X9bN)>*~?|OO9ZttG`@($hR9eVB2ru^waoAIXnF17ULC%F7kL{;3B z2Nyj4^5S0_=bzaxv+mV9^xDW+`je-L$q7Jl?;lsmHe8vX+@jFjE7x^*eGU?{@{Q|1 zx<@X%XYWCIVz5DLH9kGP^zS@PxjvuUVE}wH3&x)?&Cc%7s{*-&{4z=5`G#PNV zG>f6-D<^pJY2%A8y(#;rm7cUGmsmHk0xiITB>ecd7T5 zoH}9^Q-r*|w|FNHRq>zvn(cdg=JEd4;}VVj_@}KqlRI{8O>f@2ck9l*+ji|tZ`r#87l_ z&6{moOaI~Z#MwUvWK5e#2Te77r zkyh&Gs}Qcx;j#uc(mIXM+Fqiu}I#k6gTCHc%0 zAWmc&_dlXzX_N%FA#~iSdK?|!hcU==g*Qorp8Ij1+1Xwyl@Uqhn8BmU%w+Nf5J8BU zf#iXB#z=H%;tPWxL$rVLG+Ua+*hTX{;s2bB7`lcYr3xP+OP*cx!m`_}W z$kb4DW7r@Uf+?|qCP3>Il2MaoZbpA{P|=~Nrr3;3*9)~gq+Nwdj5VWe*dVjMO&&%Q z+D4y%wG*Skl(U$lI7_@RBX|n?5@by9O(&*grl(MxyAwdJU2+szmJ&W}aC95%XVMrE zZPP^?7kp~q8KuM3X^@OQ5v6fRjiJrN7->05Q3v`m?uP_)PTwjWbYSaDki8GdjMFfg zcuA`DGRat0D^>8GU>P)N-fSoe^II)pg1rH6x_Z^|YXDYdf^=$v0l{~;3QI?4Mwqrz zVawB)=r3Fh2z= zETF<4Vz#Hf{xEyx1z9^Ur+`xmrP=n#g=g#(U_f?-=er%Ua9Y&wNc=JKpl!tMv3*S@u@zoU0cAdE!rP?_`2#JiH%TsUSqJ?Q2;AUmtH-gl z_px5A;@#!^H&4=wFP6D{R&+E&*$ofnO)SFX593fw0E*cBfhH@ZOb%FVOFMpDh>1%0 zd4%js53}?XU+~cZ`!?Fz6lE|!Ipb)UUoYlyrz6%y35OgCXU9@S10SF^=d%?go-k#C ziREj~v~6%w)lh7qQQ7{Ki1 zjM>N2Qd9al?$4H0St2~^_?zrc|MuHqjqH%%1Eg3F1}=XHjFf%a0;*L@^-9)BDRUK` za>FOoo^6vs3!>_s;nuts0@*e#gG~6P6@u?@Y?&@Lq9S1n;Hb(7g*`&o!db5KpqpS& zYf~W$qcyqn8zk06Wji4Qy40Bnu1AI;f#1mRK-7{hF0i^S%PRZ<4Fgz6_>1Xzl2KAJ zou4ijoM}v-h@4|wh`9BvWa>0^x745^<@aqy%iITnm~6dMkfdwZt=(nY=(26wwr$(C z(Pi7{vTfT&7rSip&$ZtFdpBZ7+(&tsk&(}H&pEDf@dl&2LHxZff;nQ*5u;40UQi%Q z1$|Z`vGH5$4x@i4Zi48lcbu z3w-Xb0;z1e4&&ZnR#?$sIXhstSRi@OY%*?=>wBzkpLXe<2P>5VMO|%w~sDA*;M^5jS3P`9D0{=hHPMmR81>cjM(Srbda&ueCmCNPUXtjoU7g@ zdmc~s-)K>Z#xCH*1jy6CZjBf_l9Z}9P{~J|UxNevSs8ZRv|JRfm)z+pqD4pyA-?wfcqhMe@L|h12@*x^Ct`(phCU5CATCpK(F#?eVzp$6}fK@Yd$}T z*K1TsXKDGU?7!zbN7OwSA46@5as$1dP_b(Bgx7TENA-Q1BA&>fFVgJc?4#Yr^dZUH zSjAK&uiKN0qp8)y5A;y3k zL4Zu4H-mZ1zSxnt(qYb!%B{jv9YVAXDIE{DCSz>Bt4w8X{#;&zEhYyr>S^<8epZ-; z+vbpq@GfZdb9`lD`#eemb17z+L{Tn^fEMu)k0K9&kKi9kG!q076zCI-oLNi*<#0@g z(~nw~^bm){^aejwqegx(X;dvyfJKc#K-}8UsxB}@vxVu|V0dPKK%cvcR716aOUO2f z0T0>=CJIt(ru2*ETO{YOTzKiqy2`3Q=78Yu-<*xx1(2?(j~^2GS@JJlEXDl0+?=SE zMYUrU6&8Rh@+IB-5sv0`fD&EoKh?8qT6_V54ARoGR6II~MUaFw@60L4PAJBxB>0O~ z@>l5|2XCerxJ_iuCP#*z_a&>nqtU~iKt>N_(Ga}G?S3aHgW3={b|W~=Z}gl|n|~La zUkdxNoMJwx^%X}Arg3LoN4UR`(`+23N7}TekEI{adJAbYh&)v$Q6ucUH(po2sq8ix zWaV%bf0h2(3gb*Eq=>XWwag$VBTH`Lv1tXSxXi92NcK$h=*m}|;gGJv!i8HTN1Vl) zwP915$dC@2n#kIsu(MT5+ZPAbBHh0Q3Q`9d2Xwno%p*+&&3bR}_D4OK44kkTx?ANC z7Su?=(l6#yltGS-AkTPGYS+$L5#pwuGnJ_nE9srtxdEe?Hzl68}qiJr}XG%(*cIBWficLlB=%^^LuqwNmT*KHG0T(b4<0 zI9(ePZxrs#9Z&Fe66Ud7vXl~nR*(s_`h&dO)l5098U0&+Pfa?)2@~ zyGTIXj<RX=~(ZI5v{P{j#WQnHy!UrODGW8?pSfkW14aEns8; zo;-}%%URX>Fkz}+P(W&WT=7 zWyFg*#G{C!l9H3}=VlWcH+qfSGp!aUxS{g`F?W>T`lp4Ln1YU-*;N1`o4VaXO%j5Z9cLfoLc}K1-fiFt$XXS&b@9 z5G}R(kg+p$7LybfDiZ2|p{MIhxWV+^l*|%f@OMn@b?(1#t`vr(MGa~Bi1ncuDd6U22^qj`b z(5z?}eJpS~i#2!ZD=~1WhD>7mu*$`L5ZJ`QcXn(Wv$wDGOM}5^P#UyZ{rdA5vaq3Deq)QbJ(8sb>9~ z%PT`fZPby${(LO9{w!Nh7g+z*tnXl|tdLS8&vc|wDyk-{TCRTo7PZ05$5Dwm`L_8k zU0%EReD6=6QQF-1k(_rOY3D8(Bk=j|<-<3i9*2c6a56e03mI4~nIjTqU1NKj&rkmk zLmeA)j-;II{tSXStGGhEsK1q&Ah^PFBTh=Ju%DN|92G%Dj~@ZGGGVB2-#w!gC*_j%OoW(`O9 z@x+OHG}}R5PX4M_o(7KagY!zV@}Iqoa@;9RBM8pdIANA9Wa*YI0q=RV779{G7dFyFnXZNHq*gZ~7c z2O?L00hDq0lHsSl&~kgDK?R+RPH(al0LiYw72J%kX!bM!@Iui>Ux!s{HzeC^f6(^U z2fG(V4#Sir;yYj6*5oHC1A)k6V_rVx48evx;z z)oC3QzRWG8g;%3Y>B#B;pNTU9!ciDy>-Bt(+Sqfk(o8q0{J>Su6x0b+1aXgjIza;k z4Aj3A^UK1W7D|-CeaBKC5si#)4RzQuX)w1~7i89P}W2(F8akRAn%M>TC~U>YnflrS_%~9wFC|=x&Dl zHR3`{-mP#m*-#rSmY@D$upo+KL?ob`kASZyriqf_jd{K=nqF68o!xMy7h8kZsWk4M za2OUFdbhbyz*Qok{8rL}vesNG0`x(lP!0T5UqX5Odpwr#x5#BWr9lgw_=VZs^5-u{ z!DP~7g1<#bzd`-0IUQqd#se!Rf>5TTM>QQPQUs}0)@lxwQ(F!bx#Zuz+yDn?}#KeT?GpyYDtf1Ue zA#9|VrNPMFkAV+kO$a-dfFvstGZ}2jvA6T}KA)l0Ui5CgT+8(HK2O!P@2w|{vYGJt z$u|-KtXxnU7+yykjy5}u_R9(bHjs|;$zH|CUg@hU<`Z{xQ#co97tzedYTU=P(j}wZ zw&1(O9{PgI8Vh+-uo5x&y1$3)&KI-s*>p`Irb$Xmr@6T5=;&qqtf9 z`nN?OJv-xmUN6g!7bj1ow*+s+Phdt~1sYQK;{qxADKpRPHdvE9iE4-OXOdb)#0MkyM78R^k`z@K=iuR{1n?y07rGUH;++ z-bIIDk+(4S3Dou!h^Z37`@DXPNPBm$U8IY>NwMMQwiZTiHjcE<_}GL zz-*CTsDfj_@xH>GtNJx|bb-W6=gAAEc#f~{*xS33-rddRy{@aFp~WG4s#;#70R|u9 zBD`bVt8H_$JZ&)qosH$~Rtyv^LF}T-Zzx(u-jConc~9W~q4h+gH^qy}2vVh`p=zwv}42_?)I#wW92?gW5gQYB#)yDBl>aE}i4T zMHp|r`HaG-NBhi+L0sxQrvhu9Q9*qQchfL2m-zgOU=W~97mVM8?%B%-TDAGZ=$~oC z9kQ&|@hW}qx9@s(Ua=Vgg$U-8*JiQ;-fN)=bbYjC<*tV;M(lJi>a(;0bSMn2+@_}q) zaGwH0miQBrG5hW(Z&n~%@xa|E$De$2ACI@mauO%j-7w6+56hlPtNf|_1crLA0G|y1 zhh~JOlop=t=^qqhC>^H)sCCX*bx{vcy3A~Xd+K9y9Vfc!BULvAy9E@qB7BwDg0P+v zQZ#z|N|0)%p9;D@m5Z_P?Gy|NsO$iQ!9{z=OpA2q9u=pJL)8&G`-S9+8Aql$;wph z3daZq;F!(s(5+o!{-F}(|1gUB+F=E=8tRAl*KGsW9V z4+~`H+~nwct{o|HBDc!LH-{s3$Xl8h%{RfJH^y)G87xGS_B?{x^>9V)ux@(W3llKF zL1qy=)GT3v1=(V06?g|71Ld+T_f8hN%#k8hwA*lZKfJv3+HP92yOxpX#HDXB7eS&c zfsglEf<^{orDoHs*Y=H{D{3io9!^SGOzSkUH#)ECZKxF6WxioExlQ^sISJZ)3-wQh z1;gHeWH<{-2(KX}-G#W6VP>JJB)(pHKTq{7l#a_x_@F`u4J4FR$%E{ruZ|?YQfNRF zY|kl;gw`BoiB7yVk?M*;@&^idkhM!^#0gP|3o*T88{x>2_d-|M;fMQe{=V+KvsoUl z1Y>3t5gdY3D$lHi`nG)pv$;w`9Q$Qnxy+Ld&+3K3up{ke2);E%xMGifDNko$)q1Cl zetwi6FS{^q9qVJJ8_X)=49ziV8jZ@JF(OW**CW#{K(qi_V-zvSW|P+~ryb6hQ2|UY z2G+Y|N86OcEwX0~jRLE3BRit;{3rF&Od`D^pFyJV7b1@4OI) zTpXNK57}GBItkxi^cLUI%{Kp#NP2!H(bj|ASgz_T9>Xcg9=TyR##@c>^o<%5C;@eW zo<_I)s1N`L;iJE>C`sv4T|-|&!#>NLk&!p+@F>=qXqn5eIc;)a{)Hw5VHjFL6uOxO z$UU7_rXUrx7da6v{O*VJ{&|nW@|(crTy|DpmTLH^JN@m4wxDaAZGi&noxJeb_#cJ> zrOaqbd*72hTNW$Phfafr9q@`j2jiI#SUc~6!6*IQ32nb6ZW54#0L2Jkq^J)ix9X8$ z+p#Ul6C|!2FXChU_?CSc|3gQ{yDa8y?pxk83z_h-r4Jca82SfB4h!vxdICJq4<7k1)kUmUS@V#XUch z9*h@=5b>!DV~iQmX=montQr-AX2NPCvn#m3?-Mr)@PaE4;Q^9feQ}mr&Hu1}Sz%$| zNm;laV3Bd`$JrUN!(Ih_D}^cO`u;y3hl?DSvLeXB!wKw&et4p!5+&qd?fcHr^k1tQg4)-v6Fjz|-_I6v(@`;`72LkGDasB0m3b&WTAKXs@p=e~_ z;$YC^kNG$vzHdVNv_$@WVPaimVe@?4t-T}kiSs_1V4GbVu}l*AtMAi{c$kXEQm*}s z1A+~DXkXswXb`pa8=`;&4|Ot;MBNWU0a^&#t9h7M?bH9}4USxaewguz(4lx)vhNIb zUk`zTgV>)T`_KXT@f07pGA~~m&9om8n&Yhh5RxO43ytV4;=$;P59?Y!=&QdV?rjXb z6aiJldWW7bYl6>!P>gMYn|%vczC|CyOVEAT;rqtHpDiP1I8iwZIV&+qd8dc4(-@Ao z_`Slt`e!IFn?#Vre(yL?1c4;;ixLP8g8?zyp9jN`ygrwlvAl;=UGCkjr|^fY&K1(G@F4^xpSe0_$#`CTvd#@}LNFym=-aW`HP?!gBkmt=!Bz5B%PapbVXODSD<9|r_;L$k9pDR zU?TRvl^Ciy;`pYsh((yn5gxKD(FTwSlS@q46LZ7lKL>K?128~({w#*HQ`#<(tYgex zn$C(0KhN|D5OK}Lc|m}u3_4AD@;yrkRu6h!;9HGbwTCP;|%H$>r

z)QiCMT{4UK1AnrtDOSZ{v%yo(n$L~|(x=fAw}8rvUj|Ax%x%mDj=#>Zp6H=ETb;t5 zHCP*BI!8!9iI&^JI3KIH0%2{dfo~Y&Vj&awSG$1ZC?SeMz-vGy5Z^!&y`JDIU4D1e zmoPdyZ*BNRn^tXlprs91b>(?CgPL+=eL88vlB7wBK zbv6a8hRgqc`0+>>Ni-IpsRY)}-~&5BfMFr^vuVIq0kW|jQDs1h61JpH9RxVRE=fp( zKnV(z92$*W{U7{_fU>VDC8^@HNq`Itz$gYN;V{74Gi*K=z$Cgg8jE+&Mts71W3dZg zc~?wmA@~gF*kZ>FEPEMvkYq|U_324Q<>%f05YX}u`W1j7@;F%4NyQgTZ?!sB^oyv} zB0*gGs5gWCWt=mSNL-}2vkLW0I5sH`km+&7s2$u@&YNd=ylPhlWTd2FcAc~!3*h8t z*lKc&oeYZqWDczqOu@?XLCn?!fNP>kJN^S^sHUPGii+7 z%)nR_vSs*Hr2a*gVR(Z^)t%T1V2MN9RC0`K=i2t+{)`4LnogQI;c^D;~1 z&?quUMSJz5D8XVqcy8jdeP~f~W%3z%4|{q&iUKSQKvn=Yi4mtj~CA zO0Lau6lT`ICgRvrfrrT5W!qoAk&6V|$e<9EB>8O@fwaEeshHEZkh zrSTCAaX{c!Pwzwe;DZ^ZyDm04%b#3hmFq;DLvK@Q{ulnwGINa3-9YTo4r-k|%4$ zM(G+gQ(TXl5HEHh9@|DDCzvA?Aw@*$Q0TImIk#+kvDNHxhPhvKpZCKmN0gP)r|zO>|R` zSMh%Di=o(ik=#n(0nql!J_KAbUhHtOBH=Qh+h%rdrWUt)2^)vgNauCP((K}Xb*#&9 zUp#~muPfEfZ)9K1yS0?BMSK#jr-#KK$U`(f!9PMAktfK&H%#3Sd*sGISuHHjOd)F;&YO4v{72gESKBSt=3bLe z)80=u+aPM0g(Y<4jk6Ulv6gD8MOmj0>$AQUs4r4_WUB=LZMTJWlCrROkaw&VA`#9 z#j@_;QS121bHdV;6_k}T+*@QTwLSrYW$NV_mSyT2$#HJIx!f=22NjI8d{2ENMsHIg ziewIWF$>2uh&DfQB@%VHlTYHvXk9L+l}uqyFP?r9I=0~0qMQPyG_htlwL7H0QZ-Oy z$M{8bu#gK^NI8dzcul22FsA@=y8E!6Kx3b)fhwE~gC7nzcXtL2`x%gH@_0!nQy)2z zB|(qH@8d)8ihbJr`AvSLd%9d{d2IU9a!ye9)yXr#r7`He(+wT-^y4bom`W3tWyD)r zOBwm>7?y{J+&{6AixiYrBzy~f=U$Z7tg~UTlsTyy$o%L7-(^9bwSn|wB38+xuTBGe z_@gM^MbCP-l=)KeW~LWwaN^BZ;2P2~@Q}{~N?h%OJSwo7QMm=d>dJLSxhaRgf=I)w zRq2E6;b4Fv;L%M^n}5y}5q?CfV3Ic(A(X=S8f+}IIS+RK-!P@oQVwmCurzaTo`HM< zS#tp(y=lsHewaucysj~oF$ISZN0O78ttHe1(%5o5gZy6Ke*&3)EDj;2c87xsR`S%GWzIws0^HB~H|0UGAXY{+po6-hL(J?=vrX!+q#cH6JXg}Gl1Ko)#0_k>q~ z0J|R88NgsbEsn=ME(aZID-C_7E?=$u_8=}6KhB$Waz^QaMq#TMb1%+o^7xv=)S!CX zb%MgoG55?2=z5!3pRK12jGCMli8x8)}jCs zMMzHZ!?D;3a)%gPuUwIkmyRU!+*)o;BD&-~1m95H?kJwpR*<`N%z>MGWzWo(0zFrS z=6G_&Y=nJI%AXHEei3mvoYs}B?CGrx1GW{c6d_L$bFC&*-V%4hP3oDgHR+~23&hej zIM{~L0TD@n3tPli;V`^O_&a@-$cMY<7D_C{RMMDNXbRi z$voEJAse}PV_*IIes(33IIE?}PwHZd;z|tb907?7vJ5pA3LIBcM3x7-({@cQ{wZc` z>ilC;iFmpHL#);d%mvKBKX3f61iD#;dJk<9Dk=#NxGnl1*+J~O)x*geNoouVI5uHS z=yW`wSU=)CU476=u(f5x&PIoWpq1{tzz3*Lnoz01RmgiCcjO;j41EFG-onXU=*J_B zJ_*-fqTdDT(G7^=)sn-wCx1!XT1RolQ(TcZUoNl?aA?vFdCtfyLoxL_L+$ znh0jv1cc3Z;hA4_a7aE^Go)%Rw z)-O?jbIWOSBW#Xax=UoAxZ_d(&PJFht>w~lVWB@rpGm2VwyCp1p=VB+S1RQaS4V=- zvKH89qT^(FCd8)rV8f(SH8K}NQQFq|78%w&GhB#mVe*Uak|DI-{j?gCjV@NC{nNOv zELc1M4pF#P!jhSv#vBHb4a^<%FylQh+igCrAza<&Cy9X@rkBu&!}Wtx-I#kBtvU!a zAY|a~x9Ksl(7$}6#>4(gvYGo(pHYWIen~$fpErG*Eaq?#P<*lb)5O8SLD88j8ea=9 zgXX8bW1idr9#uL4sqktg5@T@Mid>=l8LRN_W}SsqH3*@VB3I!`)@?P$mMS8BQYW+? z6FE_D@r0A7kV+Ja>a&kLDrq4F7KczX-@ z{L};ir=1hh89ZCCdL>D`cn^ZBWq`cKF)#qf^JeUoGfiLCD#@(|V(#H5-d7^@ksaf* z0N0=UGr_@(c#a=cV5@w>(8DT(?u7lm`#rKZYTs7|gG!&_RlTI{_@pY}&VHaEh}mto zAH5$K%lz?#C~7)cO?XWSzIV2_?RE;rRp+@N6s9+dn6Rsjh5K?MF{&me6;xeQnfbk8 zU!f(6_u|p-4YK!G%ECi@=QBlQ+>-#+PNa8R9%lUJ&`7$5@=8Q5fOn98!BD^~=b7)S zz_c7$i5hI-ya>y-(*xS0|5|>cqQa?E-Pv@cJKEIAlQZ9`+m=(KxS@q%hv`o~eSW$n z-pqW<8^3x<$*l?L0QvIfX8Ej!ZKT}$77vyjRyNOS%W7#=5}^;~D_ArUA@jI98?mEx zWJlIatexl78PAN4I{L7B9se0rou5elnq)rq{I+>0bDOVuyoR4UdpKbxe$kf+ASIVH zhh2L?Q1L|-mv%uiyd1he>(97TGQF4|H$gH#B^G%Jq_37c;zMFS&9yaf_7=jbKQ359 zmcZ4PCN*P%kON1vEGwrToO+V1$<*es>B=sG zA}Jt~?+KoYz+ULBlad?NZzXKSM%I5!{mDO}3Ze+Ulj0(#z{dpMTOm72kI&5e;|WIo zQ)QqFA^YOGWiILYW*K8vya$`$)MFL1MwQMehsJrKo3~N}W5YwWxnT=t>PvuM=1HYv zDf}dkgg@T9=(z}|TaUwoB0|D7iMva@2PD*|&4r1N!DzRQc0R}MNxj!J?lV{B?{{g|~MPrXc$f zFxw+!xiB|~1o)RVUuvu`>z@;e;(4D^kF1PHM<{uVZKZnhZo(nUSX-C6OgI$fLMq&a z^pvL@t!>8TJ6Bo|w4oIqvz0PLtvaHD&y79%`cR=YO;nErXdVW=P703 zvZSR6iY%=<-)HwJgKaWpN{JX&%!||6*2CzkU`CrDO*s47ivIaS)-!}tYfGf^{F(c? zYyODY94-7+q&P7$a4#_7t02(OTaj|samz(sdx;=GF6J%LTkxS*b~-fT>9=nq>>o`XCtgfuCaK6=|y zq~q&kCn{a;e9ERiy*__~*QufL*EOQ<@qjN6$Jy>$diOLA3T+L~cQ&@$o@u}H2_U{L zzYeHpx?;a3PR+6B-WmmTGmV>_9G{M0Tn}b%dpF&E!gc9M&k{gqa=RBgzEOKJZ0&}{ zr^Dk2Md(_x2jA$uV+wY`Ke2y8IVfO}e0P0$NpB|4V|Mx|7tr~UAxV695d=dLq#<0Ikjwx_QpcI!Qhi)h<&%@U?CRWOy zmU|F+q6W=-5QV9k(;G9jF!>vMei4yTUu`*}_CKH`TE?qnh}2<PT18DgFi%O)Fna%!HxrycVTsf2%UqR!39Mn_a4E43Zw^x3ybk%2Fyj5DCfn zL0P(x$<&^cK7(j$)wr4Q3OirZVLTu)A$oee0ZUxi(+fV&F;r#5U4#k{VLi$nQ&eSq?poqc1qB!NysPe|sqo^D_k#cQOC z0CRYu$87X4DlFl<$0kSLeqTQoop{Q$VNMFg6zGY=-@os)I+uG=g`l)glq`1qT1Al@ z--lNfr!HxE`G{-Uw=~N~PtbiznSh*B+YKNMc^3FY(jX~g99ktN1Otqw%zw?y#cAF3}! z2JOVnvmZW3En?D%Aq!>9nXu?jNo4H6NGtpQfKUQr4-dE~&lWvOAB)C_rFnnb;QQ<~ z#Vt&!wC8Z&T9M1ySSmF^yEc+vA)_D1>!DmaT=q3a)<#OuR9d+W1ISyJP#A;jj-}u+ z_8lUb-Mb5o4eSojI$F(wjC!$Kj05)F53FZDL?m-b{JJO(P>NNgr(on*&8P%gS;9S}4qjp)$VYy|(wl!X`+4R38)0^mlC#Tes zzo(Nn?itaQ(sa({HUlklAGZT9OVb+7xMO!;lP1|DrwOS@<8F&e4|5m1Yn!lH~OCB`z3@o+B8S8G~rSXj%Ie=nX-zK6_LtN%wZUy^`80IUwt{|gGF=hz6g1$%B#yhW1^G7QX4hJa^c&s_E>S8+c3C$#QXyuEHbil z0^Z)ce8*$-Xr;%OAb=N#L`v1UAu#nXeTVxj;{kP@64rVXB^L)zkQd7Bo*;JF>;H(W zRst;gGO0tdr}gU1BMPB&MPedD7kYfEsmP*d()l!(WDxydeG8I#^X1j@m^3dinlWrX zUWnIV5)mJlCJ;mkn)XThPWw@g@w_EMit+!nO3UX4n-(OCqr<+d^>!|`F|2YpLo9^v z@HgB&GO|fl5xResd!MCyUvX~$DvrZFR^Wt%wosdiOgvj7LLl;!%57t3@oRB?&-A*l zXMVcR`o$pVnv&$z=yeUEW7i&wY)sZm4VK9|tRwLqdK!G)>$4rtJm!yVnM(wbu9{|| zDQceRSX3_ItvI_josopMj4Pu@U0D2{&(R53^rrU@K)nj(ZCb7{+;{EM(i(rQZdIw2 zIdC^Ms%NHDRnPS~me!5$XOUu(~?%xSV(u=e*4BL#<>aZSaP{1zXfD6lK zzgok7ES0@Ty^LQJpoYm~>`?=_w#QOdEFIWUB9a~x%fc#vMX)hh-9|E4i|@XG4oL#2J_Ba?&1MxD+0lZ(jh~FIPHjM0 zW#UbySASJyd*r9{O4pSu#gMTMfnn$6>UnvkBr?0oWr4OHiWO)!X!#kHbO03 zG~lKs(gMTT9U1GCSCa5>0ID&N6>yz?84wt!(oQrMukAqjurg8O8BubKs%j6^e^~Ue zGwu|X5|MHiyP)n#S}miGR@eYOL4f2VDuX5k-O)Mqw&Ag`Dlav8{P;`#d_~Kf=Wd_- zWqI4+{mn6fJE5-Z7?Q_2!>^^~A>%#3N&*ea^j&0ysp!Nt_Pg|1N*jg8cpXIUm#*pf zf3-pDhe<<+llot#j|;nLc)A1PG*|l%E(nF$Vch=Ed0SWEt2p|&wm#oI@8gtk@N>|j zxFOWyO|jU&F+AV0LJ~72@l*`#vCfZK(eHn>2(Nmuz4%iNw7e}tT*Y2*8INYpu{~fD zEltPOu*&6_{N=6bOjDvhlVWuDd4=+DY>UHNAjj;a=!X#=gH=gf(9bvqKnxt~qH_2! z%Q(^JF2N?-^B7T5YsunP$d>#UOUxC*yUco+-W#mS_kpv%bKYZF#BBel!SbsInu$p4 ze~pSgJ%pV9N;rdlQwFdR)JW%jAO#gBfDr17nOVgGBO)I%t9%ASam>_{*HYagz2Y^Y z@~|b)aCN2+`ql!icDRVBZ^^RD_+0vB2=a2XFMOqKtWp$MY5*&iQHx&1pUK8DI`uep z?_FU*HOS`H5EES@t43ZH>|ED$=&*1tmJCD*mb|qWFUdC9kzibJ#k{m)w+X(Vey)`z zUZm&6$&o)qOQcLlnroni{LOHU-_}XRr;EZBlNQUS7X!w@+Vm6Sp$)*%oEc?)2-tt; z?OLWlEg$&SMpCN+YGe~B%=?fmM?O3$MTZ8*p|^^Np4e<2m#5!L6*_8ECw}-If#NjmD%QlVy(^(w#M3~JNj2WrF;=}2i=im7t<1~rGNk} z+6W4vdXSzKPFcNP;S*EX82He87%B9M8LMX(<0In!mO44ZG6<$|Q?DSe)kft1YL2M5 zQ})Hi(1D9wa=Z*MTkW~MGA)po|1Nm7Kv(B(aQE)PnT9S`tCyqudK`hs!OqteT5A=bVDwcCVD*koY&GOy zf!$KAHoJ!0mzeUc>08H-#Nz61l>B(iYRaz}R0WD=itgx0x*5Uv1O*mp6_}a}g=X*G zOQk)1o&eILDXPpw*P|ce-A|HU%RB!_k9bV4jQ4((F}Tk;oCWh?3H$M$1*dqomOq6_ zarT9OO{D=AGA6?_3j?~(6Rw(kQnen*Hv5+n#6PR1a-FVeYqC88vPybzm%eZ?9OQYm zc-_1n_dm|m)AXPp^#A8@6F(=O`ULoYN>{LlFL-+1cSMHiO9gLkqxJgqpUu)sf0Co+ zeLWSI{6zWHL!fCPE}GXbG?TV3)5&p<-}qM`*!$Vqr~h^IVsw)II4#$o@c*qp5>J&- zX~92o=R9A zMJ1KF>Hl7ceKHbcT-D>N&CV(hK&d22_%c|AK*s!^0;FUO?7CZVBwFA9r9ZkQkNZb| zlo>_DW;@2~4sKJ9UR#^X@f!$SDp|T#QzTmc>HAL8?p!>vyoIdZwt*PJ$fM5rswmtA zhVOle_ZG8MDwFeVj;&W3hCL0vr(_Y7tjp!@CPwJ)h%saGY3#EtzJ^|3(}BNhM+cWW zk1Psu&7U%8lqaHME}?cStE@s|!S`+XE*v4UWZd(4qS@_eurQ1NcYd38aY63DGg_I} z`?BOOM!~-Dfx9HmjHEZqTP9MNAMvECVFc|ybq^Omgr$(*!<}fZZjE2aqaaf;xF`?+ z^%)}tj%-zq|D(hm|G&*h|51bg|8gWBVH68?Gy4N!c#~~>K7YzIzyBOZr&FJ1*y3k) z>flB#6DC=}7fvy6qtn&H^o{w;Sd2wIS3gKU#?I*r;=&y;Gi3mXwHdVC|1#3{3VTxT z=MtBnfEN0sdGZd!;=pyro*Ero-vboORyu9HcOSU%N-{6@JmA8S@^X9H&7B^pY(HfB z+Lav!{hZ89Q*T|9uY&f<(Q~*Ru(=4bVN)OP?~Z3W{j+ti@qLcy^a2R|99vm|k0a1PLjEVCozkGPgx3NlvNnvgXdR;qefj zh}Ysc+uW`N?J@fu?SOPt#dfxC4X-cXH^y3vZ@rrXZdtFhG@z;HngSKfv7Tg8inRQ% zK1VyWgg=WU<^?027U>kv#P@1va$RQq!qhi?eRhQP>hnN`1}-JbJ#C%6H{5KM8Qy3` z(G;~iv7NRFN`T0&;%Xhrh?XgObs1ZwQg3tC%!RbrKF}tqN< z(^U+8iecg>c^zeV!d#HbV?Yhz)T_70KvG>d)!zG*#dZDJ;{($v>&EHP#fd2;q4&w} z&_;1^32eQr)rv7TNjFTMHH0Idn}WHS9Tu0=wD< z#07}LaUOCtwO(+VC{JCN>J6NL(`wH7m8|``5M74y8#o+sK?a?r>Fn4EtFiQ&&CT^D z^n;IT&q<9wRizZaGeApvZ9#SzkT?&^aneDJ-(W-$_L$n4e$uj6z&-n13W?nY#aZ0grd*l zZ&^c}!fts(Sb?7;6D>Z8Tu6SZ=yb9CLC;GA-=c?W)5msz@V4xRMZZ-#KA**#w{b`x zEE6k{xsZx!izy}iz|LYJbPjbWPyczShPtJb7i9x>VCrPb($klNlYR<##cFtXP_FAC z=tB`6O23%8OX|G3T};-5!4@u~Dd?PE*nj-5J7~b_e{%0)<6gW_8sAvW>ztiNei_t_S`X2|vrx||NL2Lrlye*8?n`Nl4J02U}C zc_irmcTOX2dgK@6onJ8;N6Vk1uhrJ>^UjhIMZf!L^EyDg{Nn!l@q9HsOYrfCE>kXbCee8;n4Wi%lE_b040%vE6Gw zq>wVX3N@k>vG`&r9zKs#TK=ZqQBZ?y5oh;ozN|sB!S7K!$2V`?97&2=$RN0TH{tJ9 z{i4cO&&QvgyG!pbr=(jX$peb^T;v4rxfd34{F?q`ntDKi49f6iUzh)cfUMOVD!B zUw|d3huXhgzMUkyoebeJf^4T<$~Iy zC+SZILat(<-!A%*`qzAW5hEX*Q7-=~2NCyu0_6Fhu~U52?C0&uT}~Xq{`tE15bGB} z6XX#9&;&gf{-X&Bp+hB;q&GE8mJKkBR2$-X8#ffmp?(BNf}U>F^BMk?1g*ro4C}kT zL@#2y$sLFd4pAd712X4V_t7+3Jc3G{JyMkiWvF+?{WR-EAMv0w4nZ}S=PNN-_xFsq zFYPTkvKs!{%ek+SLwX+~&E*j`e67~7l4uMC(`3z-yDPW_6J^fyXi0!$(!Hr9(^8y3 za(PoVe1mkwV|zj}Xn|Kyu&OWa8`(c(z|s!&NQN9KJHH27Ol{YGa0ag%ji{1$pDLPf zVgM@BN(N$pa7i@B0@^qrGDE3832y*!zZBf>|5yQg3F%svX59e%JbpOPcNr~!%|c`9 zuRRNbfk0of%p*f@r~Di+=}D(uj%G#7Z&WD`tE5S-mIt9nTg^g}{;e;5S+v%sxc|nY zP%Wuf9CZd){jdB*L`NS}mDngoQQzyO?n-&gB=tpBJ9nf0gX65xZ4eka?}e?rt+xd} z{x&5F)Bc>RFyRZ$c_*Oye$Lu*+CRT%(()hwi__BiQk%atEPp1!6gD;_x-94~_8(T% z3+Cq%AO^p)L(1Nw*Tg_&Lb~O{K7_5SMPJ4L$TSl|%zGJ&Z>Ts|HA5?U4eu4>Q60|D zdKN?GmMmi`TTfAUgbj-HY0MUhuq-45gYw02qx6QV<3kPh4cduLP3j{wUL&)tOU4jT(csX^dM1As@dE&FR2BnW|Y#>x-8hd816$$ zUIJ%oCOd9LQon3b(em7!4=1Ik$%Bc?#Rx79QL$TexNsxT=SqRv9sQ-rMI}}cw-4RiTSqjRq|+NStZerCN`UW2%_mq7)6Syy9(0 z82r5|!lfGtMQt)Z;rP};hKiRe;1>#`7lLx+6Q%=0Mc^yZoR@FDWSe>P!E%CV1nlCS~-#i&&kwp5AXZb?`XFRFJo(jyLx3pflPg}m0ht=DWGta z`60+NSrU}N`9pzCB>^Vy9EA)hi!YlBh4kB2Hq+Ft`Z?*5W>%mgD|vZhkPl83&F&qQ z@}v~0Vew|*%c*p=ZWQOzz>f@b+47Jo6jQ|y(Wi@@j>lLGUTMJ%{%VVRNzzPyqUnN^qY59IP0Mur0F! z8y`#dzUIFCa_TODhwi#u44&;}n43Zjkr5}<)*HpvXj2d1e$&0m(3t3-@t2>1mxW9? zzig%<#>!q4<%T?#o?7&#@_lh7XZ&YpODX&HcBkv#9otT~(pH<@F7w$HUgIiTB_>)K z7N%yO3JyA+n;{SHrWKpA)Z>MyaRnz8;cYeU_0U?kYm& znb?HbR!Tm^pCj*f4h|fdm~78KN8HK~_j=vbag7QO4#%7)FNc{@?V7417z~{hYZwqcK)Llr$Jq-{@`Il^MIm0~llvdfmSqCb*REA?QL#LV$V;WB0mLTr z0r*6eWoQ!KSjjrhJR)ThUwZB3Xl=uLj!V1B^k8&E%g*@^WIn6va|MvjfhCqpZJD3eQlFN;|6n1P+=H4# zh5yp@EUZC<&3F)`QTWq9foZLYdgPl5;wbK5^;p}}5cy-|S0FoCu=~ti>~&55 zP+bBbK3KA?h5y89;H~~Y_Usd71xc7HF`<8LL2gd9MMsT$q~mPyHckkiIQ3`QnDsz#57_>N~=@c#E2O|ekqlJBSX6Nvov$()QO05F|BvQ z1X`qICzwhfN5^4?MLb2ieVOS6=p+vJVouCR$I0ASz>gSO4Y^qC?xx=)kPIB@VK|<& zq+WsswSx>NiyshQz>XPa7$QR09wb2`tshwZL(6n>P`{d|XJSvnz=$->o+f3lYd$YV z#zc=IOVV%)Vw@9h`jfAJ0YZ(5`N!6A5%T?^pzep__m$q@4kF&+4ri~CK5q@}=lOlO zc|Y9sOZkGi-moaYz{na4Ol>y9@&P>`{t{q7{fDi?0F8jJ7qF%H6N*l#ipsCs%y_nT zv%o1cATwIIpywbusR<9d8`dCItdvQjnwKeRkIVn30j*^4W9(>r{A274>gK9ZHzOt& z%5A0^q_LdX#PZ&f1NDCA4)&5QSd@LhA$Y|GxcxhNKQlUtaJInw=ezGHahM@9gRlGUb|H!II`EwD1zITzp%0CSermT2ZTJdC(v;xiklefDs)f zg7_s{=Kw8wuho)N5%) zm;hQv)51(-Q$SJH0tfe!$}Q1#T^Eo83BorI(#g#)35*wE+PEHdrYs6&1a3j9v{srv zatxxM{k90C&7pRxQ5NWv2p7*KUKglzi$o<&2+z|mb;K#XheZ}+&Ek|@X{x`A!x=%5 z48|EnrGboBueRn*crxdxux!Q1#Tv#e6t9vIcr4x7M?U&vxtr~BM-~5Y%ku>CbcR3B z&Zz3Yp`B95(!R|9f_AF@1?`Z)f*gme)=cJcH*74>4Xx=$?1+A2Ex%)O=l=`ZVX<%d zH?#w(zr@t8F6B%p>C&bXpORbsO&PdGB2&v-;Mo*lEeH!!31sp(#IW}_%Qk#;eCzmG6 z9exdYBTIC3o5}sXg&h)+cIxSTyydfjRSg{R&l7 zsVmC6JN}}=!}NSpZr1CBI5q!pg&)D8rfvBLKKH;Gxus>(4U z53yH>ZJp501$79K!2jOg!4jxE{PA~)gn#^&e;^-q z;6RX7j=O|qcy4z11TbB7+`0T&H`raUPX2h-0Wr}K3R+D9CB!2UdW z{{tZ_ez|1yTWrQ`gFq=w17n~l^Cv01*{99k_b71f`ij>yKDp8PotpQbXu49RBH)=b zNFBs)_w4QXO{bxjg!-It2Jqp^M{`=5Mw9;u?{rVAy*8)(Y5r75=};**eAYz!)BGo) zbBTF`ZjnXf$KAobJmy$Jo=84_p&Kr`+YP;fuM(1J;H9*-^mYR%90&0z7`#maMRl>I zDc5FcxBQFi8LEhMKNOA}#beh+D!g^H6!9jfzX1jj8Uy}oh|9cJ$@X&ByjPu|&73J_CCbXBEzJ=Ldb#jz_kTFs@WYkyw=pbQhz1!_R4 zBim^%&Q&Q3d#I#hFA;YVksO6rV@w3=H;SA&u*e3i$i4Ubw&3>S=uy?RvmaxgSO*2J zdrkz_<%PXPW&o?RQw>yZR#w!3Y{h%~w7x!aXUT3kb1X@)3GnCV>~0>KjReg_fw9v| zvblliD!&5=^i6xBq=@{E3i4G#2ug~0+4=ty$PShXuN{7lk;=nD#0tg4!BkF$l_HB(9_cm0b>tIi54AB(Z<-8wa>9B zo$gBU7cve_N;%7(P$hKX*m^+PxMRC^@}}@XS&##C_`<}XW>)E9*j7%aUb|A)AHe(6 ze`ct#_6;9AZp7Yu3nGh^b^hB76~=NX6a))+Ju*iK1R@^O(1uncQ)xSX4!@SXq+i$j zy|8?|KiSN)fxR#V^$0mls!%mq)!f za85L5MLA z32S2-)?zQ=5pIr5Q;SwIvPGo8GGdN&8&FZynpgG;=mpOdCb`Q-!zqceXP%0#7_F$8 zOD@oz6c;bdjBbwd;`S+LaJj;7xOSKTi}|uerapr_u@%ZLmtq%&Z;QZGZqUaJNbZ zZ4qVG6Zf!&D*dgHQw#CPSp+j~SR8~PStBhJG8}`_Zy!y_OGM)B$k8TX4MWI6z>urb zZiCFqTFMA@FenULqU4UJPJR~r`!^uSfCpc!xVpzsE23?039t>Qa()L%+3D~!Obfj4 zwC0~`W{x}JK(mSB^t(>O;GojB`g~mhZxrs*9@9SeJ zI==a)L*UUOJ5aybbBX&;((o0S(gj{e=6_S{Z6g5^ix>*=o4tV-kidZX1-+s&7*MZ* zfm7=-aQHR@sQ2IRDmlB!tzAm-1sib2`!l_CSr&qnMZz$tJ!CDb{R~|M7n)JkqGia!~n~LrD_~Q4<2Co zG-Qzhwmm&0)L`4uNktS*WYmp2vcRs6I||`ROhY8#Ha-8X&1R@tYOhKgj$PC(G2pp_ zAm*^nAk{}70xpmQKR8DNaCJ$AF~`aY)F`d9Kh-5!MgIL{Xgz4$>H}~?vDy;`)^pNn zM_YzD+?a&+afo%?ttPZ~Kp!=bgcBy3fcE`~yY9lax|!G+x}%hgq^GZDrcCb;E{o#X z0Su#u8f9|k+`#YGYZO~{Gu_IHH|Q*7%k&UU%!dDHhQy@%Dy4jst^9us@D%N#{qF#t zN!kNIP8lUrB_Prh)B!YqdQ^w>4P68R1i+ZN1VG@Mq7A6Y3gv0nPEQ} zq=TWj>31(#KO4MNxHThuGCae0rj#w|n+t#FJf$3% z5Zf=2Q&Rwf)2yz$mzUPX+*N1f?8+|}6oy84Df&VckCH1TIFF5=Pnin^lseMbjKw59 z&SuwV#)YVBC~6#^mhCohpR3N$sw!aGJ&YK418MA~T;IY^9^BCFj=+BK9whIpmX5B%-uaXjpx(z0%`9v=+qPs_uxVaCK${QERAz3)je;k>^Q$-xO~m=5AB zlNQ?Od8iW23YCI`Tl#T5TAaoD%VaAXRvLC?0$X78K)vaUBmoB9>3a*fT2XYJaQrrf6uvKy=VIY)p8JWA-@xBV zSC#yzP77oA#8ME0^mFf2MegcsN{eJ$z{Y91KDJ)T3H;0gM2+rl9;1&Oiog+=sH@`s z&g}OLoMjPF*S4RjQJZ3c+W+Lru{sQP%Hoi2%L@^v3pW_GARl}bQt<DB}g zHpCrdgDuy3jEsU=P*l91#;sOe+`3>8$4TJ;5=2I&s9RI~UJwxZVMy41zlYGIDGAon z6)Ntnk$#5P?d&6_Vb0r3EI9~+ikBx+pp$jgDF9^j%FzMdE^?=fJL$i~v2Su~VI|58 zgNs=HF>#SB?W<8S50YnodZ&=k75s+#`tEJ5f`imdnM@l+iOGLqPwd}#!T|Oa6OORK zUiPv4^Z857fQPr~;>?2 zxA>2x@P!COBEuHQ*H7M@7zl&-QeGd&nh23s6JVirYam3Vd@-LpK0&9kSheBCudg=H zzjB%PTp39#$Mjy|^Gy=EiYesFJ|vItkPx!!ObLhd~LC31sMFmOzX2u;*W zikbZEkgCW&tRktgJhu6lV%`Qs+2u68+PNYs&=Hm$a9Q>C^~Qa$HP}FL#hxvXDvw*TcW^maS44=KlYVTh{UKHu7KvZ)Ei}dmeqw_3sZC zA^#f5rrCZupGyegqxsvm-5JBWOi$g&@jAaZmUOJDVx6hs?%8>}|4hC6w`>}O9lq4# zB4qLYKI%SZ1O8eJSbA{g?vlJBXBz~9oUflQZy$Njo8IpvmwpiP9DwZSd%tov%k|Ty zCRE^rgnkU+vlfZt>+R|rd;CNU{n^D{4lt1vAVgy^zvUb>P!^`{iQ}n7S#<5!O@Kgu8 zq{ly1-kaDwGgGUJ6S{8?J-YpQTF9`5$w$508-y<}ZYyc&-N4jrS7rZg?<}sTW4ik> z2%enGjdm!F){HhQsKE6?C)ZZ%!C&T((~;j1 z)vqYEzWw9~ydO_oKW(jb592ajoK0<>{iLY#-jP6eXNSY2RG9vmpjP_O zsberl1WlXes7%3R=MqzZ7EKvtGYfNXi=2hy<4og8h3i%eDr&rCym4D^N{nd=5i&A~ za0?q@3RdPqzz-u7Z_gBOvQ26h z)V1+sf0M=Ho6>plCh7H?o9%W17q{qm+77eb8{IFMiClnV)Iqah0VRWdkw3guLWF|3 z!+ZSyn8~7aJ)U9j3-bp0TuH@>+FGz@+7nn5kTk z#$WKr2wy?Qwu4uOk#IfQNEF;$GM87CeK>ewEun)k-3iAnw!R>C6I1i>bbUB_yWKKP zzF0KOr&}`xCd(|#n`Jikv_N_aGVG%;;TDhXw*u-#tbu0^S=l z|7TBYFc81;{d^}Wh=M#(CLzL!GpYv>Jn_L}v>)kq8hXBIUGsUd345r@E_;f9@C7Ku zChUc=#q&VafD3I#MFjdY`ge9!~e8_Ef!~Du7FPwlUvK8K658dyM2tqGxx+ z^q6_tS-cK&UyjrNx)CduX{>I@EFk1<##SMQ)Ft94sgJ(p;Q>;c-|ZLRpM z=RUu@AgaAAqm7q?Uao(bEfc<5{JDVPp^#rKJ+icTMWa2un`lq_rgoB`(e!&VR;tDc&x9(FuAqiQm)aplV!A! zBSUo0g-3oLNeP7h#2Rme_&*3<2$P31#oNj%+w=)9}2F@<38DVH~eUm|Q4+1ahK=kPm-76M4zP4bQBch0VKy-j?UEUu|&{ zYW2FP!GPS1{8UX``DNY9+wk~&;brU6Im zuuNpI3`l6IC$lv)UTW)V`B{|d1=6)4U7hnC^2c;-3+$FqNJ86gI3D(nE|b7@{Gmd5 z$!RL0q>cPxa7Z3Z$E2W~$6Hdr|7frar6AM6Zz?O@Alhp6dZe2knnHlLdx^*kvljZH zg!9N7F~tLMbs?>~54l&lJ1{Bx3N9TWZ@1%b+2Qlpc-o`02PQN~%38~(x{6BZW(bgq zP%dapO_N08x&b=tY48Qj#qObM?a7m%C5qjJK6~PHOZ15vvNsP!iESoLnmVEsx&^Cb zdxgirw=A>s$i4#{M71)K1&b}^i9ZD#$Z`sir%>TM_Nsxzpgc?bt~_v49ake)rFTHC zYd*XX?&M<)(1_HCh+*J^STHMxdE7uY*?h;}7ucDl{=UTAIo^A}>Ib^W5{1W>wXWp; z5Wq`-WD-*_e)HM$2F>cE!i84n5sO9z5Y)LLltRN^M&CFV9Z7@R3OSp7t&OeZ4YbK~ z-InX>Ho3<(QZ`KDc^klI(-fdMJB1~H-UJBge1GA_dgpO+T{{iKdbTybqapL1XPwum zsv3j>U)88%d1z@wsz5O`0mbL5plDr=HQ0%Gy}uE=#O?tje+;5QE~q z%n<7`tE1&ir97;T(&(NNF5GZ3IF#@2o>vE2kQoLaIH3`^ksjm*0sl z&SNx5aDG65@lHDr$mea!HoC+ruyp>3!9^K62VGe)jt}2NehNlkgCccjpaO>#N<)yZ zY6ZWm1!<};re`2QJEDrVhCT}}xH0)(?+*8aMvo)c{a4HVKIFJm+eMnbP%FQxN#bHIXTX2XdAq4uKIhcPU+ycyo;_Vu79J_M4Q_WPoApsn+@v&G*+w70eb& zrc}%%n-fXr%yB_T+5bo|smJ?E8sZ7y%_GxytRb^IQ)6mYO8!RTD|Yg{N19I$h;enb z9w3J#f31Hrib2$^;yD-T2!~sXnFfHt|DY0cr)b_G0wy|Os$}O^XXG{^N9uzmRbvrr zc18FzL554!H^h=9#iGM5g?fFm^ToYI2m?afSpy;h(261kB>_6wDws@3ULt7R7{$<* z+ALj1l9L8B7$&v|Q?UO|59U^VLl!pUz<0&OwJur2NK}Tyk&}QZv^-Owv@2vdHvdaE1(AW% zYL&DSkHvZhY15s4WGeiC5f=LqLFjk(j#|1UCJ^sV#VHnq;NhlG|BqZ$JLd>Ql}cE7 zo5%-qA2xxIh81oNgMA=KPs%z2V%jChXkI@#bs`H##fh?rk`Z_6o|hcbSRofyXWR&j zZMf7jHef2aeu7-sRD zZEf%=g+-ii)o!})FAK|E;MmFu8$U~)&kR?_(3plnOe0Ep#;-H{8SDG^RI?}m&N2(K z{+Te%4vP$#PP|E*40FP~2x<$NI)H2{1NOI`cLk|bE!&Q?Y7M71Y;U<#tc)~0!3&(K z-s#CC#JmcXr5hIVXHW#7AAFlTQRCc3T6vml#z>m`Cn$3#$)#O89DfLpphvQ?Lz?qkJ|I8r zKURe9A^Bory-xL#v-hBRId!*1G&AyHj>t`LVDHa|xUU=s+bMQqg?2C0%1w8G7mhre z{RfeYrReKCr_!2XWy-O_X_%?jbLBCan&AEj<)_=;T$ORSNS?kZ=Y740@P2XivYPx&=4IMr zoRmh@{fgEvfoly`jHUugT}&J4%Yp>!#YVA^J&l4lxY0E8Ln2BfmVh|Qn0d9LCmQS% zPI!*UFy>ZNhI)-;F5Ury?WLCFhWsg<$bFBa^dvXfn0TQYb3>piEk^1=e7Sl6(2zb1 zJhF@u62JO}1PBc#GyPXP+}@6PnZXkj zMfJ*9rL@NekasbG|4g=t$bOyoxFg!@#EW`y9Pd$q9{ z!K|xj3X+U?v)&Wtp^kNlWUIjf>LIGwH{O*K!H8N4;kk~F{M|5T>E@irle(OlwCIfL z^_8vODdm%8&SR*j(%s|!Q#0NrVvP{(1e+32S=MDg!ii_HN@9iSn7P@YvvWVg@BT)u zPlX9P!KtObd+hQ;$b?^Hp04w@ju=0&WMX1wv(9Adw8@6)kMqm=&#MH1uy}DANujd( zpv0~20&!w@_mCsh0{3_~P#$wPDV|lscu?{e`P_2rscJ9U7dQ7{W%xai9Kb&? zu8qsW3N^mN=&kux?}4rzFNkG|N?^3l!3YV*(NfVYEn(Q8z=!%E=*IuX1VJ1Uv;fLSs*cs!4DRZgjK|k__{-oT~#E3Wjs4P6LCotlc1y5^>f<^ zxC~Ghwj`$8N+ZT-N?O6#3V{+;pAf87Q&CoZTHfujBZQ;PvkF@a@=lfZXOQmFBt0uV zPF1IfMzpOdtiP(*6oNvpv4#X^r&iF~Q$Ow@X7#z_hs5I7?_Ekse{kghHe@7rrzJlj z8>A4S8m%K8n`Oo+5uG3PrI~fYCZD&c={lm3iPXlUm`B%3Tmk&0?U@2F9KOPDu;gpB zn17#2!=CnY&czhNmbDh3K0fp3#hKJ^Ua4;}FIpk6WpePF&Es#+bDF&T+9MdF> zC7?j>4P=jFKW$1xg2O>PR^2V(=Y(hKOyMJxze1Rsm;c6`AT9I>o|MyH{l-nUgIp~x zTMA>7>_VoBiDW@JY}W`2}+$s5%ZFhkxI0X znaAe>>^BlS=<+Es7HTSt$Lk&V&-sXUxA!2=ZKK_;cj7SGnv5}mpyb?#E=d+9hPuwD z^I^sHm(}{#;n1|om6xTk<^uNGpH&mKgM{L&Vdg-BMG^?ED})FYxA0KfWHh{AbI|H{ zC7SN>MR|HprazFuF&mcXmEYybsz*fwc@@uWa)@(?V_%nhNN15u7E1m`mGHt^O75e> zk9Ya{1?G5iMf&{7MWjnkyCoVfd?VfRg^nPF;Kx0PH8V01P_}=(8u|7?_ zWt=V4p=Q%x1~{OrvBWX9;UfqquQf z1lR2DK()rB<6-)xIOtmzJrlDecCGeTU7jepncw|jb%{Q2XP;=(?Ws!nVm5uFfVP4J zGQsb?S1!qz=}K=G$hnu@;}W{qVamf9Lb+pezz}1WK-*jrwcRuH90L$?_7iXWaDU6p z&My0XC3?!d-M`)%ozx{3hEFL7DPJMbvGabxG_Ll}hp&*6L#R*({$kHIRramZ9Encj zO$ZJA@S=H(#gi#Hs+A%BspRJV<@Lx5)6rEow=Y`#Evw=5&D<%|OA5~LOW8Y3Oba89 z&OugwstPP*JNDSTw|!gL`Lv&c(!h>o!SG)B%Dw8KN{t1nl$nny(0rMx) zX*=Tm41Aid!Ni(KdubcJDUt_BFt!?D^byWQh!96=NDbHeQ_JtxxruYwd|Uyh*VxO5 z5ErmMplzu)E!1DJdEE*^apC%_Gq9K-A$^w-6-g|(k8Y2t zP6=#Xor;mZnfjMX4lO>|t#TcX6nDG){`u-~N(;ftXPeDRy||cpatVk0WIR#D@c8|& zR1eU}w4Tm~lfw6l^+Tqg^XQg<>((KM?N)~z3$y4H@kxUAmh?Y-z34J)GrfbqVWPqW zBFZui%PlUVs(QX-_#z`NVNB{1nAnj)G%n>IXg`o=+tBEcG*M_f=PGxxy8S*X%t^kV zKRmrwrue9Feht8?+skO8vs@H5e>x)f=Hsyy$;AohZClp z*S+UBsAL{PJ#kFFjO?2Zvi))a!2kS^VBwj08ObFtI}*kH#t_vmeU|YN*wGX^%$Ad5D6fBwz_4IG3P>d5Tfv|a!#nCZ8(&e zHp5@J57FlTT=ftL6KM5+Y5KsoF~s|LsCUF>%c3;WkC+p`ZD#MT-4E#6y?Ix z@1^5F1SQ{;(WaVipS7n&Oo}FO33BR4IZ)Q2#>%SWk{r0|nbTmSTsPTmjw;Q|Kr-KO z*2D{@?d?LyZjeI2ocZtJkhZ&zDS#7=?crpUwSa{~k1e{KgSC(%oc@F7IiQu{uOf3NEzO;R8F9TAW2IH!DpGu@H2o{9+zVPp zzCH>3bA(xOmAk)@1KX+A2ZiWR-s(?IGY}{Vcv9`ALQ){6P!GEU6BHIi4)6wAoe;>) zj!}wBn-#N*2C7uAwDB9Yb#mUCoJu7qrV>31&OjYYi4?y0pykTFe9g>5o-Az)hDoKk zOrVoXhoPHN-R;{9w4wEbI0v50`bj%6PRqC&kQ*ZNM&L2&0m_%Mo$-%Eba78_zpWl` zQ!B!IxAyw!nRRg_?=Bq5G}6y<#+(SN>|GFJ7FEkx)7CWbbf&2=w0_YyG6TWPZ3P-2Bb941;1p=0VSP}NZxDi*`6{hPuET6&aWadu zfUcF+Z;SP}pQdW&qxC|ox4zL=_xt{KpakAN#o}Sd@n34bHp$s`4*&AhB`k%t%l&l6 zk2Ua$hUHCzh^r4`IqzBe{Xys96n2D(MxTjgXFJE@@x0!@8Zh5&J#BjcJBnJ7s;WE_ z(c$zV7j|*2?5GVMnLPO;q%|c3=AZk&uAY~%8d_9M=sKXG zf3SASE~rJ+ieA51>rsyjmk8MIcFZU>KPH3eXl7kRv%un~&L+Y^RLZMkPT?%@Wn?7^h_k*(LOAc{N$0jf;rSWpGaX z=!VN1aC))5agKAE9NUHEMVVg}U=>PIJc}5RV0Rk))kvkukXuQ1O{xMJD=mIp-nW&H z1gZ!){=6feyCQ`?%!>6=IcuM#R6u$x$(9CFWN@Hqu{yWhlOAs~P8N+Q%Cn}0AR7mW zPgd4PT_Z4Y(9a#V8qQ(9Ss^}Quwr+C3DZollG+yrIL<>PysQsql5q9WIs$Pl-;}B{ zY$7Hz5}68oEZu!f_mf}yTyjYVCIG0LARBHdkTWf?N*r&V#V{Gg2h^%KdKdJ3Dk|>- z7fMmksHie=C?Wi*p#m05b3Ej9MIY!c$(2?ebJjEpQ6UQ!WZY9ly2QhvC4->g9jUQH z+IV*@4}7*B;rlUfqO$FECHl=+Y#L!4&^_uLuO@W942_AsHZyO|@7qP^5S*fqbL9@G zd{ve-SJdA~rPCVB4~}wWWU5K~X19)Tp$-pbP zi;x*Y{qIrW`S>#+#)tFF0KA^AyQ$;!Az4Tj#le3SG`veD1~P2>yZZx+i$x6KU<6a4 ziWBrAlRu1mMc<9QKE89mr+Zz}!kE}w2#jDY{TWvhge|Rx7?t%Auuf0)7|J@;(9y&h zB(LKk*p42T4-&_vQ8>(5#FNk`5+`IB(BlXm$Ds0KC!n~}2@PuBU=#WzjPXyrav_Tk z?jh0XoaSYP!0N5aFhU!UyYGL&cl@Ce-CKgJe?ky^ro6OdPH*nhN2|xikm9TxY{~AO zQWc%;F!s2%f&eK7)I$#3O5!ZwucZ5>IggtiJmDc`f~_EdHIPJJQ)t=>vCrH9PtiP!)QdyLUe`etwp zzD?zSj)6*Ds(6p$T6k)#;zh&UXiPNF-6c zdXbzJcflUY_prry<$^=KC?|f*AGNqxufK-KKd_b< zGkcO*(CW)*!~3Vg$yA4b45PWR3FB{DxNrAcVrP<2PW#l8{xk?L-+qmE3Y3>^$QYn0 zOr?UqxWWii`KAzf&7u*Yj_URv2jNyVBDF%T`?m`UY0amdj!B`%BucKLVJ)y^p;-^t z37!&E(kmC*ddy3aGHEq25;40d$q2i%VCdBtc>5cp23mKWzE+4At#qR+;_Ln@y{e-@ z2^ah_6XU_eAyblYU^uLT|7?g7sS5Q*ms(mUK=tHSzb6>*;4?+wX=OulL51X15R`+k zZ^#Yw?`Hm@nyu}aO;$)NgaiME;CLa1Wmk+sg-Sbx`2>u&eskdZ@M=(69jMOL1n{|6 ze5FRF>C=d3^s2edwUlr4dOyO=xCKTFN_WhSaU#K#gd3r-5Y0&#*_u~ z_Aw;jP0QLyoPf0%m(Is^1@Xvalz=I4^-TN-iNAIA_r?6L65q}#d!lHM13asG{Q!rpLxO`!IlyA0F{;u0N*N-Hg$1y&x)inyN zpQo_vyYYRDEifs_MkQiq_^ex5SdjD4pT7aXhS<5#cmwP3mj3j}D$K$S#5j1!S$4Y^Wm7A|Evk~3@is(I zNCxqnv3N`W^3nqF^a>J=B02qIX7wU5?z;ijZ>=zf2(2aY2o{ph3>zmVD2ePq0iR|< zevJm|4WW-ukiG@?yAww90f-#+r!`itk?$_*p;T!u!$OvkqNf&*gDFNPvL{2GyqSnQ z2FDp4PCB(KXKAVpr@|w!_csC62NZEFXB5uFp=~lZVZt~hIBP6!a8!(Y-;i<4uS}Xa zZL(L6J8p6q`KuAXiT?w!=*xo;?9x~Cb3a1_88YVW`SgIV&Q1=&hqYAIeM(SBgfKGk zz?Xk%N-z!f>}eDpTm~e8m>>8vc{AilmX&TGnTIQIdU>~RSIv@wgPocEKI8$(k9E|%2PaadSsPDNrx8RJquUiwVj%yACjfdR#U&+~Ak zFu&P8Klw?#mt0{!AxHyLGW`dpLHda17(8+ABZ7mQ-N}F6YY2vVGgnh8#GJ?>GV-iH z`Ft6VzajA1!Khvm8;Auw36MwF0;E5ISAXVv$D|d3$2N+k2NcKREIkRr7@X|suMF*l z&Lcnq3V?|VZl%4&=yPX+KABs>)lEk!7QXt)MOZczq5{l0en#-zSSby<akb#>X5KZtS z@e~!)gui`-Zh%}ojScFwDN-e`w_&KgxUtA3;h$MM-?|JR!P!W`GQGjSnNm|Tu`;nR zLSrm})B$+2$o4x+1k>9>_EO#3zUDCwRAOyKr0gYcAurC2d|mXTRA0+~4+u9X)?&pXS5k z(=FzS!C?7Wx;*JCjjm_JS9Fa7F;F^LGJ^Du^J)mk6W)m8B4gV6>J@^SuRei*f6=@8 zmK)QHDFpGt$`J))9sfMVaBQ)2thc(yyF&=!1qR7(zJ^>qUt1iS`98^?6Z|yiX=o;C zX*?rL0W8zw^nT{Vu<(=G3+eb?#>^)F#?Z2OxBjq2P}-2s1Lsk6(X8D*+n)zY4wliEi!O;q ziEq^>D>%36I^K6?KMA&EkR(t9sE*kriOC8|#f)lnqNQDtrmX-hsNPy1Lj@5$1UHJ$ zC5CviL~UuvbftP}`y=2yQvIb0Ur2%R!Z}=GLm*@D(NxY{TVbj48F3^sKdUU?vHK0k zJJKDeMAxzg+;m4!18cQ@Qi0Q%jYjE4kiez{YB)&BLb)+gnmMie{H2!AFHybGLHO3$ zz5`uH1c!ovvAh#q3Tvgav?P^g zdAKIixZHADldl~tAd49imf{UAt|m~=q7@kghKdN?b~l5)*zK|qoDgr-RglYo2NYA+ zd7W7l*<+>ncE+<_m56 zqMd+LD;k@j8EQK?56AR-6VG3)XIanXi3;{+(tRs|)uj|nkLSshA31BK2ygjQ1zF>1pp(ELf~xMLubjXu?4 z8}5A^7{=qO&YnTYV0c0zM+d<_MO##e%6)Ko$3ODqW8e@C8kDBu!MkEBv}m(nW%6Nq@az z65c&0l5#q#sV^;BzIfFrT|D|9T(&T|NchaY29-}Iow&@Z|MJS!*0)--l+Cf#hNW8vZa zfjy@pji6P8@+B79RR!Kf>unQ}-y$}rTh^<|<8C$cE3&KY@*9`@P4cqH<2r7c7Kt3M z3$}*)^B2jcm;D6Kqv_*{S=xZQ_g$~oM-<4EyAHTit(3MBIluH%I3rA>(=oOs#W*9_ zWU)eZ)a$~z1ygyC_Ic*Gbxl@|a!;Swy}9V}E=E3Op^A5Bi|3bzn;cpgohQXLK%eT_ zJi@&s_Ed!m+24q!wUBr009%|ip;xM;{ZTLYP5d&I5f2hs#00icD~c%1ZQ`$O&Xowp zECX(2m*V)gUA4%JXmidHIkbXZDG?7R;J7Ur%&4bkeWE$+zm|>5`fQxKYf&07qBKiI zwUL(Zjb>bD5x^Goct-5>sTi12a3c+M7%jw)BY-~lv$jtK-R!l-&~s0=1S`P{gMNj9 zz(U>Zx-2Y7wz{yvh25?*gpvxu$e%Kvw&o;hp+l|mWqdNs4Ww2>sS7zGm5Z5;#4mZ6 ziRmx1(>DY9^5zVbk4?N<(ABmT3OZ8i0qSM7N@FC9e;4?vhiRlf>jt)KnKO;b@I@*& zn;V%h@-$M?F&J{?V2riR>*$&=@~T0hQ^U&WkSbS)8<)*BRw~xf?1`1aF~jZjOb6z) zS341?M*yJRxeT?|*5m#1a)R>AbkEpNITmVTfvJM3*+#q3luML z(y!|j-QrQU=bhdeB_Dkd8*Ae_3TGct7}RCcvQa5i@cc1p-Ey%nt7;tU35dEfRH9g0 zMT-ryraDh#8{>LnRI6p<5pZD(5iT#PTzX?Jh1z<++_cvhNeWo~T{-QwhWsv6E!J@t=zMFZ{t6Oee?L^HMBL9P} ze~PZGjn+Wj*tTukNyWD9if!ArZQH8Ywr!i0WT*bM)@f(Awr^&exAS6-@r}{@`@CIA z{Z|R}dIZ&ob-K997)T|`r#Og>x# zVfS*eFgoW!!zRycjoMACv99Y@+K=&?$6>h_2KYT z#;P`RY~!YfB?_@_YYp{ra>xeO2ExfPY5fppmMs@^Y`MA45flwBMi{x_Fid4}5&3MF zl#AyboWbudQ_CoQbflInC3UgFEedz(Ter*51&SwAlqlMbne>kh->6_ihB;F?)Lo*P z(xPUJX3U&!bz$`q)`U>bWR3#^#1iCa8%d`#Fz`d{%|uYv+men3W{w!WL1dxqBFOfO z6JMy>6!2JYF6AlZINRe!hvzuvnRC?Cq^tC2pXvQ5{Qgg=czas$`mT%v_b%b{UW>eXq`L<*To1K4bY$Ltko&QKSie`aPVBAEjR zpqYY)nfZ3N@*V0x#dG5iGe0eJhLq=!zzTHwJh~qD&IjWUeyU@8D66XYH{MB)QoS*~ zU-Ol>(q`hbWo+kT%Kkep|LlZ(p>RCT6|6sa&29FdP4Wh8fpR+a=8kuLGVL^jLlUxN z$ODN&z}2>z-)#Lh8om(1d%Qd-xB}g`vGsc2jvo_O??CKoClUiQjb~iO?4xXRx1-8T zoGfusLv4`xJ>mTta3L(XEY&$0-2xse};T!4f)NOnMtI^Wg<433ANFQ z1w#DVzM?LGSef*=@(Va&x&V-YM{Nvx0wGggt-qaZ(Mhl9{81v97e0}w13 zDYRW91w#n_ryMrRUp)#N9gIL+DYHK%3JSPvBx4GSJCLTsAb+Z%ZfB)o)==}*Hl&Sh z5OF#9!ch0rNZCX)nGskxgsVbBL%dUPsN&}^3!(eyGTv4v(%GEDlHBK9vMVkW{YrD~GRgPUWIRZ)`7LW)#S@w{BY!4(4=OD?FZX%d0Z| zQ|($igG%@udq;myoETAiTFU18w*9KD@QA5_$LEYH^Yl7q)#H>G`Xe*V_BXi$^e>!kJhM+;kc8X_*~6UuBd=d zr44BGs-hcdmd2X~#7H9ccRe40bakm<0)6UH5n&G$_d8V)kaQyrY1j+5313lu;>VoH zt9{bN5Q0EbkU!}|YwCn71@|u5A@5V2TxE%ycq`_-KNLc6SlSMPkA?v!i~(H>3vT`p z(|g5yX(Me*M+**WfW+h#$#F{MNmMF1A($lKZtn1c9*0&X@?i-)Laz?M^bah*9bvA9 zlYbT)PyGT-@riLllxmVp?$cyVl4{=AeC_@y6+I@iO)xT;RL&I|C8{&QKL9F}2ZR=* zFzK$HSFWq7DtVYy0(0rFuB+67@!YQy(^psTp`U5{WL)?kF0r-n5N-u6Zuh!!k)9X06>Cz2Ui;QF=-Dt6p%hQ# z6@^okR0U#l#(YXLnA}J?n&prFDeH@^{hC_2ZyH+kZ%ud%ohCP$R*87&b)7_DL~+C$ zp72M}jcQV^axLbl_&Aqsx7@gEeBc6H2*5lh(;%nw&iGzxCw--H@KX|?3N*je<4RbH zYSTGnA^B_vniQgLOyX%jh?K|wo)feEC4H?U5|f%fHzRLScDCf5`(Sn|i! zPAp&T4i&0*hW10ZC*-R&E>K%c8%Ta)Dl^h>^@@o>zV2?s{@!;CW$YxMw`pN|u$|4| ze);;CSc7N&c$R_0Kd{R^e~;bwH$2lv?ABTDPIkqL7x7fn70jtpBF>M>URLX!(7WM1 zNXjyM5iPF7W0hWv^nCQ@QsO1pN_mXw)>Q`g`-< zxE!QvBNMZW&coj8Kp*oOII{b_BC8^mkDN@_^Wv4x8io>!LGyo;(MLF!RHhK)i9@GA zjY8|h*ah{L2FaW7pU);MHa;tfgmSao9+*4N;#{is3jyAo7 zN|9yC!k7W)Td}sk>mhx?;=h?jtZPgc09|EPe^VnR8X z-~ZiF*QmU49PUhEe%2rt_#169h0IIzS$3|)VO`Z^*O-bC?O1a*43STN$Nf(TGpSDG zrlktD()$@_a}E4Q>5g-R)APJ^YhlF`qPrYfL8yEoz}gdiaXao z0}je+mdPG1V9K_>q7*m?WaS#*Qyj@uR;+o6LX1t;P`GqU3L)~N`Vv_?s5pE&!uOiB z$?sItJrRLpFHu6>bO2;y7~c8^Fz96>3Ts5-f!Jzzj;P;P21_xKqkvIS5MI?cwBPtFqP9P*%TYr!xrr~9Xy5+8}?7?hQ=sCuM$;@B1-p3 z@*^jD+e^GuBy5dgR16|tGqL=AM(!&-Hck2&Dl_$ks+wWMs_W2JzZawnQ8dXY*$F z0p87%B^rO9TNY2RhPpjXLaNz@J-@iAwaD1nw_JDBzy=_9JQ0rXh(Swd*pzS1d0s>I zr{d8!M7>ksjaZ%#1mGMd^gsvYP+?C>4$@Jzm%R&nHu$B$tU)(W!gGZUKo2(18Yzqq z_uNChx_atHUR#tS0R1%&NiyR&Kv4?2ZeGA?7obO_4;KW)I9wg)>Fo@PlYG2ZwFp^$ z$aKT__A|juTmaT$hpJ2&teD%`OYJwS$c!yl5YkQv&|#N^#xzmpEp%c?l0cC)DCDQO zMp#lcLq1K~ebM8E3Og$?fGXYr?j({AXZV_YR8rB%XbDdce7&;V!UCU2jE~j;5WFIn z=qmV(JOOFXT7>^Aol-su3usvbqHX#5m-$Z?wxPZx4 z;tLHTt0M7=c}9du)qa%0%*o(|Wpwu&8>~X%UuF&v@J{*+HrS_`#p7s(srBAsgQdCa zWu0?ua)0~ELBra{jl#X2N&jXBP8%3XUtzY&&&VD4JV=fTT?oPeZH=O3| zP)>4}c4xSU$XYj1inut>1f7|k^bHxDG!VVCf__c@Mv{BVa*NBul|+4i$x#U^gC0;3 zT-05+2=4a{noQts)$s#Wh{#+FrZd4@HxMd%XR@;9To#87YaXg6E~+O+$&X$R(`~Xi zU#bJC{CfK%)FUG=-Tq_}*zF%w}H zy_&p`7`10kv*DG0A=D?GRKY*WPfO+$RAwipQDHe zzV@U=@cf>6t0f7{6ibpYHwbVZ?3;)LT?gLQNy7*d7WXF!ri?~~9i6nEn|a-@zMg|i z5-eppWjxd>K61G~cCS4EvSdbhF`;(s!>~C-u6+xsr5(EWJ(5ut;^Orb0)h)8(oXm9g0gqz3uZ+hB2I zXi1xbl|X_{#yR$V8P#?g<;3vnlA%B9xCg%F%G=h|;k$;f8-#FE&<@KBA57_kWC$o{ z3*jw!6!d7Bm|t>3xLG(8^RI&5t(c`BI=GoxZ~W%zRSS|Qa49Jj(%~k^!k|i+3@Fvf z?PQ;{8WiZ?V!Nn1?^g$EJ#_x8UAkC;QEg(98X$M4$}u-4I^<&M2lR-1e-nDtVo29| zjCI}ABSqwL%aF+uLi@An%l!-b*ChaME@clPYAHF2Cc7W!TaGM zJ+x_nQ^xJqAwz#^Sez##5o}QWV?k$SbzgnFbDkP&^y4oVTwy6_wG@2s4i0vu?#D7N zZy*&(Bi!Y?CY7q)w*p2>b1OJc3x~7G!+W`t%RA7h!1L%*sm6on$=^Js&w~NHA2_|3 zjB^o#9nXa)k@VA%raUx!p3hOdclu}VP?w+3KEI}ZhU*Xt-YBp6h zg_;!_Cj!+*D{sR${7#j4M~G|d!G?BMfQ1kx8i0`y9!pR5`jJR7L|^mhah%%lFY?pk zF1A}tP0i5E3F+I@W~EEbqy)qop@F9ryu#+f)#@lfkG;PP{c?JTXYr=NvP7&UA@FAi z_@YDa%;B*{L(!{nVNAUP-xd+4OJn@_U_&3viOXUe@3L7}*E&g~W|Cauo?M0k2hq^V zA}icjHlw|&MByErf2%ei4MDq1RxsFyLr(_s;vX1Fu~X$4*N^oBs6q`9FAj%#6%c7y zgsdVG_o1Dow)Wx>NW@jw%|uLD;Xx$_VVqUG>)sq=SeVXxm77px0y3KamNt)@DvjR= zbhhC+*}8rif%tVWlo~~9Ty@tr#dL4H>7GAb8O#WuWg&7gTn(1bhhw?aZGY8wx&TN; z9czK%&>j-^O7B;>%(@dDP*G*i)5I3xuNqt5xpTOCdL;uNI~`<7Kru`iD zvn~0!Gq%RR1Qt^u3cGEq&fXP@N&3Zp;o$2+kWj5oOy-gs+`u*r`%*cwuS;DYk{yFV zXDaAJnqws~;2**|(*q3SDS75?K^k_)4!UHN^PJ#sy^WS7 zDZV8cL@d*^UDeo_C6$GA`Sj)V)eoP);*Pphz)Dflu!1+EB|+bY3Raj$nuxU0%pit? z>~7B5P%)OFzz-?&wtuUL^``k7_A$E=Ao|+HbPR;(O7^VxW3zm2lrHtT8K%5OXNwNl z%-EFpuH03xkp+bbVTTu1t+<8*@_S`z4fC2&J^Xb$>3mhJ`TUl4sW*nn{}_5Q`E5@Q z)ilsQPmKmsv`4EtVM>G0xl}@{da4P?iyeb!o^+?zn3bHV3pnZv)@@VL8OT<+Ts1LP z8`s5qL#b+e^f(ECr9ogTP!0H~Q>JY!vnif-J>PcLqs6p(N@HeO`)Sz9 zSc|QNlGmboUz#1*H?g#1wh*h+Ia~H#A4H&(b?G5@jh>FC=gZ-3ETiAFNm`-0h^%^$%8zx^0m***LwkqRXU; zzM@N-ApC(9{ZjQn0z?2vW4pLB=b}KLRDd;ZTHX(K=))?(*p8`Z>S5N#j@L{SFRj+$S>`EK z#mC&kAI`X;mW!2m)uOvkDaFidHw~T$X!y4Ud*|@}_k{f&#e=$7yYknp*?C+rT%Wy& zRsL-{djf6h7|z~8OyKx$u?AF^u4u!*Ny5F83Lv$*D>!MtkBse;v?sVVlW@@yNc?*8 z$O4Q+3}ZxmH>2sLY+%M1qWLjJea9O8TnZ31*yS^(X-OE7f`4Q~a3h{T8ep0#Lk9qB z@P?o}gJiQqfN?Rq1vXf$yRE%ClmHRaWHkd8NLg>yzq}_?IhS`9eATa5r*CNg3))s>eC3_&vSr1@5Ucg3FhL0dJpt_D8tL0*~Fx4 z?#LI+`wN>D*@R;Z%F+kyQVs5Kxqoj*Cy9=~&JOA*!MIcK5=*-4~oZW#YF4WjLDYR9v>$gLWNMts~>}TuL{d{N*#93RwMmb7M5tLLn?bfbvxO&5yAMj@Ht)lTTchA{% zsqO4}y4=*(cfywIBq0g(+fb++ZXPrhJj-s|fKk{ad;my*vfR>S{psbMm=%k(1mBRk zsu5uCy%!hWh0HGL4&7LM4!m=Hd7e-T@Wm~Y$%?IPkt7Pdar#Blgr0B$a`VQ9#~7WD zq#J(H&C@q!#ZP!6?k(ZBE9tO`_m$$WNSz$}8~#YNdtZ7^Pl(BwUP;C|j>tIBZqm~O zxHdp3pmxp`Y*_ zjd>Wz`7%8$SpwBSJ3803_+fBO4Simaq`(Q?p3;DEm+mSw#vPF0Fb*}6ffo@3(>08|c1}ObiLCg0fF{CT%hVAK zVi&psNl*J=Vt2X0r~J7nm@bHt*5ZmNI7|Gn9f-W{=u!N}Yh- z-+n>LJ~q7OnC)WOVfaFBx*4W&ohy=IuKt(^m#{Shc1RpD6b9&HPn;@%k(loF1MoHo zhV~5h5F(q$$e2p}HK&i{EmdG6DuJOM!sE#jm5kywcqKwnu@m6em_)5EVp(qOw7OIA z{2k-+zu1z{enhh*H4Y>7;x{1uo_>;49->u7=ucQAs0I!U)NalU#Q1*YgL@`)_t%*= zy`;cAm&xDma$r{LoxdRm#*tEbS|Y8}UEjK2?G9d^$_O%AG zd39jx=^cVp_OxyRZY^d4!L3y&{Ur3uQv6OG=%$>205cri_UG@+)M~>otg0=|Z!42C zS}VR|NmQTw;qSi1z@(kUn|9np^`z$*BH;KbC#_$Y0m6ivUTZu)PR^%%{aBYH&x+ql zr-9}|_`lJVFpcGSML&T|`iU~4a04f$RYPi>2nyU*kK{+8!;pJpqrH*urgE(3M%3QibPvmPDf?lGS>>=w}n=}03A488QMt^*CbR%-tieGGK zv|3jgcy!!WRAk`eY-n>r#@La`G*RLAdzW^kd%EDT%;S#k;lpu=%qXAExO|ym);d;Z&H3;^VSKm1{$1FX=uaa#X3JZJ-N8JK4C~5S05z%i8Q|P zJaI@F#SkH~rug?+CKf$eb9iokyeM}tK|urAe!}nNAQkP#7xs`2=@5G@k5@opSI{t( zT0HPtXIopv#44#}*+@L@(B;;D-F27`BqaO0Z~zzUwVBB2X>c(IL;gX2H+#FI{6Y8j z8$J(7=#Yal@`~3gpM@AW?KiZ1K0*1EFW0oBepssM?lRp>?HsperxX0hhZ!G@w1z*t z;VmjZ>K^UaM1T6x3VX_pG<^?;&gWbB=f2iDJ$M%MV$Q4m(Y)2uyyM3uut8ougl5OD zKlejAes;$=Ux-|MrRaqYTx!)FN^J*+Ki=MvQA%Zu3-F@Ru zf768zUF}vv>Pcz;se5}Zy!P3I#dfrsc+n558jxQCh=&G`RjsLt)BUANu+BJB*F~S40 z!76c@5y%B1d70BDNRRRSD4RY?!J@eZ8_&_OB9j|Q%*!}Ry02SO-=wK_O4X!sID$HK z=;Cr@YW9svQsJk`7|SIsZg?)dXC1~NH!3?xU!vy$gx1~k5FVw+TlOQQE+uTz^mOZI z8DfCI6zuKQS{vS`WqGC?_UGom(%ay?q(@EJ)pR*W39>4+Q zt~X{O{^R!&c{7Szs)USe(yV?uNgQ?x*o)69aiUo1N4hPo30mvWFt57TH6#B%=~}bW zYjvC9SZsm{_E`Qjkj8Cm<~X)nf4+nyYE0M!#yW9rN+7uJKi~qSsa)yi}UfZ^ZuEOGiC2 ze4>2FvT!yAcA;KMhEd0dH^E@k((3-3WO0Lja%Y^!~o)77G>&JfZ^}?CFZ&=+{zuZwo{G689JSBR`S!$Ub zrLKpX7Ec~gb-v{I8dLY`U~+bJIp2>u`AWT7F4v05FI9S8h9bMZu>q3f2ZO+Kk<_rd z`Py-|K2_~`zTUh)%rWCZVCdyP&?@YeuEq7x#pSvD(}+Ajw9voLAei45F=RJ<*)oCt zgWoM##xgi-W_`rSq#Zfr|*A^`#9Q#VTMd;_TOl`%V7Z#=)0 z`$MC>QlJ%>#3%a9S^sOwbr6hUz6kbW1L!H~&Jw_PnSbB440q_~Kia>bl>P2x|AoI) z*;@Ha7c35e&C~vR@BS_K3R+EE0SRdX;VpyE_IhYPa-j$-Iw@MKkijx9xhCdlMQ^t_ z-)MrGp|hzYd9Tssj0!yNV}vTyKJhs*5kcG2(8B#0ZnQS%mzH*0H`7=8OI(g>EuU7; zceMArSlE}3DOBOIBg<0eLEWarrrpOz;*u8o6?ANhT=(O^+QfoXYECZ%fHk$Ac2j8% zrI}V1_Q;0$+f!*IG&D#HQ>#tTn~P-41llnXNGNT!Mst$Qj6TE~ZCXLE=wMN58&O54 zU81(L61KBLY=&n6cr5zBk!Nsfaam>kKH)0YoIPTe1hnA~@b?!9tq1?_&x>B81RF9g zjhqFz*e+OD{Ye#X>1eZBO9iZ`l%q4H=jc-7g?PUg0yUY?vh3m9sqJ#%sKxjlI*O&52g^?+Y zf%BwqpkpKSJoF%nhP4)(8~0DYbF;2eC7CDY%I*Ij{f_Ma^gG~x`dzITuI0EeYvMos zjsQ>@(YTBSn1SH@1po33^5X}so0(ZNCHM_?z_YV4nj?W)-3cf*JQu{%;FRe0f-^Qi1B^u(5Dk?t0f&XDow{Wcv~+)2Mm`26$| z>8Q5>UCIE*AEUzSD~e5QZtw2pVSjDqY!hQq0U!|Z|7YJR{Ab_&1%2GV{`}eb_lFAW zA*Y>)hAsUIgA;V5keZHinRLVO-Ok~fBTJrx*HKxvJ8;tB_rfwaK8FD3^zoP)-^oG-_01Tr7h_fC6=0)8mg5g~DZ`VsZzS!P!ciYL zg{!O?xe;$-$MVqVtAo+g@IlU5w1YgJ4?#cfO5G6LSfqY+1w6Ie&F5qWc0c^paQ>RU zZ^R#BJY)DI`Q7nNex15AM!it)xAb|a&PT8ac{6NW@8LP^<89%0w#!?FHx%k`>`dSV zJ@gTB{-l$+!%#_7aXI&Mk23)wEH7pEr*T5Yxqpt7D*KLI+%vDGw&3YaAA3m$kX@6m zY;JKdS)Eyqn96T0PG*+!aZ$Z>(-7(F4#5HT4M}{wR_Y=Y@-zx*-p!CS*Vn2}!5w?lInJQ~A6TcuItRUc+9Zmoq@KI;;FstNk%H3mO=o;zr?y;W;Kv1Y%hy ziJ(g=2P6xyvuW5e24rH=Wc53Y^KTwlHbn$DahfZCdVt6q-imk>s)L~Yn2z~q+T_Bg zvX|U*Se5ecl4Sp-iWv9b0}(1Fw5xR%$=ET=<=Vwd0hMa$6V@`I)U>TwsFoKUR2t^K z*z%1I3%3~&s(qVnd@7Yje%iSWDh!e~!dv4b%^ty(_Pdtl;H(5zye#>mLA9lHNpczp z))WH@cHU;xznFj|)(#=6nw<(9CS~xe%d`=4oR=ccr)q0W7H50@*eKWNM}c;_7@E3( zu!!7NEyOt^@;=}mk$c%V8CCfMXr?Us;S&Ju>{~Lt>v6r|486zkI%j6QA9(tGdH`9W-+m^K>yJJ~MzkZ; zlm!+lN`H=nC6uV0^rq+}@lNe%ao(-xq~$ASolmMgZ;>Zxbe?7@NdXkQmKdCc?e;I` z;q+(Y>PZAOMNc+Bw1(BP^wX+tNKMdUOFt4kH^}Y=!rOf3kKElph6{=P$@L?5M){dwD$HNZW|=GroaC6Ni@JPtPxiJ+E!- zwkguVsEI0$AXN_VAs|O``;P%y$1p$;9uH}_5G@nhy{3B_{PO6z zzb`P5Q%I8hREu;wQ7WTMWwg%YB4O^z$7)E=sAdrwG_)WoixVlkAjP$ZL;Cd=W+&g{ z4jGQJU@j!uKuG-dG$!t2R1gd6Kg0z|IWJZ_o~ZJE0PJRCO8J_E zVdMm)cs^KQS_`3m_{$45)nl*aymCUoNUhiX+33CB;k}2LL=J}Cr26!bK9tJNs3yl< z;_z4B+wrrY+ph17%LZw3#I*D(9DLDcV76#xO_4ezr8G)||Aubtv!OHvLiQK~9k z-5*U~0Lkae!2An{wYPM_^6iOa6GI{z77(NcSJX>xx<}K46Piu&l@<6F1=SHrrozYrH)ZH(YJG?etFo_2s^z*@HA`ID|FdA0b%#fvgCF zN(au`<+br>ped!>F<6~^)J)afq)?h)hC2qdT2B2h;Np&AfBj`t6UQ${6v3~}Nl3O& zw1~KSS)lOz5KwPSj*nf^chx3TjOWfIRIJ8x>sXABR?=DLv@D91Olum7QEY zkyF!)Rm+{`4(DVX)4VW zZaIBk&tCMSxl)dF#we=)2f7E z=S3ry`78+_7>CL%!IUb)t?9+{7`Ss<#kA>)DG-g$FiE?1lrqHc;z0Ycz?P)T@E zGVRF!(RVZ&{2Gcs^qoThHdGx|oYfshTMDd(azLb|X1N0!yXX&nx4K93AARRl2Jk=h z9Tew(^c^*^(ErkRu#F_So}d*$cq8Uz4o3AQo)JOA~1I+ zbTH8meW&E6TwYji1t*LEXF}jkF#Nyto#hXGw_O5F?B+c8SMK!}^%0DY23{}{K9N+8 zE>LGw+_L!qoxxL?ePF`raxiC}8cC}8YT3w|;vr-VsBt91EufyPDnA0%H61S9A%Ads z$&Z^9QrEBaRhKNM9Rgu->R?nUKm$U-=ugHkCPh+6HnAD?Ws%<{lA2-6%*2#YX-s%U zrEFM*X*~dv{$yBu&E;Rk)S~Xm=S? zFiw+nTircnQ=KXDAzCjmr8zl3CUO*=DBpytJ6nbsj{UD#oy3Rx=Qq+JMr~`|X z0%0I}29njexOn@U;-Rp&MYF#8hx~d&wRs7pEu-BYxuBvvg-uotsRJJ~<3Y;+K^+0s zgouM_CUPqpL!jgnR7L=(UI|S+Hm+eCb+Y1?9vqN3%H)~~djs(&$Wd9M+ZbDF!H_QYj!5L+GqJjKm{NHaN4ydb z??c?>_56EbYg#ODq#W_Z9)5RqR(L6YeYb-v}A&bx=#m+=0eItv1fR%O~~!Jyej=cq@FPxpo@mpzK2n9bBtd zp6dl4opQGV^(U7;)hAkuU^T?bl4nKGDpO(mb3|kkOKNeNFSe7_ajq=6^O4UR3&(|E zabsnQ_}a((CwZ0d#wc%xC#NGramL=ZrD_ss4|7D%$k(HF2g1@g1Oz_LkQJlt6#)a1 zdEKib8M?5zDLKv#HOpZD*=&YwH1AE}>DbQF2D-;8GeLmtb&)IJM&Bx(imiai0C=}< z5(1Q0P;Ee?<}S~&D(_vM5ULQh{Q(Z?qM#`=4p8M>8f2xLH?Fs~wUpXQ9Q{OVE9p1t zmtP1V4N2F~BhMHGW%^_cisLdwiJb5><8D~nQk5}NhUSHZeB!+)TcPoFApc~I0CZ>I z!|PnTArf6-(s+tF1+q|pCm|BsugmRWbPvV#;H?-#>r>x{eim^F%CSe2G)~HXNX+DX zXyo={b|~cbX3t7K z47s`Y(ufQOt9;6hIr2>&cHd!>6N1K^5w;tf$u|Sne{=UsREjzqS z6bek`#^w}BEs67;FN(Xi9P+(oB2C-iKusnWXf#jAXEEV8`ew zcD6L4KR3L>JG;#|S(ewd-vuOAT_j)7mq3yjE~MWIU_biZh-*}1c%5r_GFtH*gr3;K zZdwZaQ5PNBI4YA<^2+lhx!ld$Gzp1@E?}Q{&||MQjUao1v?^~vQ3ue{5d;}>%TJz&~G74T^WS)_n0Xbx$ z-NokkN-j;dhZ%j;#>S?VLO7%{~Nub3% zlF4_ZRKJcX;BQ^K+RPYy)DhOaxY#og|M7PhW)(@Kjbf;b!r1v{_j?%<(lwhEL&`^X zoDL}SiE+PXC+v9jl2HpbO(7QGoKWRv2`dHnc@+C+6O5?`Ddz=?hsJRZqW6LB;DDqX zC;|zHa;ZfJt;Qkft>Xx7v8z$w9Qg6>ftD4@%|HkOha`g)^Wxm>5N4JIDIwKa7$2uy z5J$#p>ukghGwPQ06_Bgr8~bGh5(KC<(E5V&%+j)1Ezj&Yc;3%3>XfmrDrjJa14Vm z7Pa8WD8NoAil`_MMJo|gl0(6p;n67_UW3h%zU}j2?dm}Ec+a2SgHSk#rhd2I3BsT{ z2!_=NMpKD4WW;4!d@tepDZd}hDe8s7xahFfAmzN>I@~`4Rzh1!W`c)P~zo+|w>xP?{0PDn7qCXy+5 z(fM@$_@}9xls{izKRYejnr#h#PnBrb3)W{_<_Q}KbNyVf77zlI6A3QcglII6SP*q+ zrn}>c&($FJoZ5sK3sh>L!W2l-`gcwGar@w1ulO)nKIX=EIPblO`Vtp3R3gla5 zeF#%3nYLr#u7bV|toDIfL5TbYgU2PT#8k zw;lx23&e|qTg~>nr>o0wi!{u=082{6DZt~P5*06Xhtw;x9KvXVdCiH`;91}LA(vam zTe|*Y6|If*w-FFs>t5K)%d^w@Jp3XHf3k(!EV(9VmI}YfM(G+4gKl9Kkh>7WbiW27 zV2szgR|4eExnn$8upXkK&WSQ|t=USFI$uKE3B&01IL81<{zoR}+;q90Z z5ncx=pZCn`ziLg)XiWD{i2Bh*0i`qQae=?W_z95^NlXyko$dR8j_#fxNL#tJmLUoO zQ(6!>+kh`auP(j$-3pVfrq7TPFC93)dc16`@E!l`{<P}HW+@B0tT&y_bR)4vKP{=`4UXm!0-^F9S99J_@N!W4K`4nF~V?FGTYN^6TR zq?t9PjZ1`Ry}MriJmIyOB5vF$4S>=>g*78NaYX@m{Xy0>bSTha2zbQ*33x4O*}N!3;oE|`k=P}o>&i4snPVgENnB(QYrq4=Dqgz| zS~%foprJRa59Y)=R7n$->p=Lyl2C!{#1mlBwC+_&2FKfAB*8g zi6m@QGD#~baEp|X8H@DoxH~mVBv89z{#2y5v9vJEdNMXJ8e|qD%5aU&n1<|$wD>Je z&;=BX;GFo;>PNRD6&G;TAU1_#1;JudedEvMHe+)<^zlv~{}Ff-3BB99{j*!exKc#V z?aH(gPcw%daY^l+aTDiKDwcUI2}3K5NV+wo7@B{`mJVKgJT_we>;Z5}=C^V)kl;=f zLlXua`o{59eh9pTe*|995`^`7^UGE?{^e$;x9RDE=Yv`ds^Mhbw<&HlcJp&QzH4wVL96 zH!vJg@M@?oRL-c@@Uf0^1Udg7ftLjGAAxtsOXGGbF$C?&h`K)@64jOqeCliCBO9kk zVYm+GAVg(E(&LFx1QZ<7Na|G&oK6et31|R&i@0W@d3@`}n%x?sw)&)bhgP#f#IFri z^PArqkW`?>h*X~dBtWxe2G*#HUgAFj?|W4B`R+(M zS*RzQ730g%*KrK6x3Gg*SLKO@iwX#C?SjYtMJ|tnH)JY-#RzzEIENc-vEty+!6h6o zwM$lL@g1H0lDxIz;6-h?cEiyz(vdky7xX+rVV7T={gC&$W9VIPt4TDfH@)6e35y|& z1!(07NtFc~Qup>^IPlf2`BJm%)RUqJ@wTpwCR#f&;bfaX<5EH;BDNfvS0mlU&tkGt zNlmq#O8%ov%yVR=~jQ`@B#-mtZ?g{X9QX;(rQ~6d`kP}87KXm z3G!2@n?_L#dJP8Az)zDQn4lHat3NFucdeIrn`OueVxn33A$9$8@L{oU3W{mXp*RT_ zEBH#e&|9phq5^VgBYRn|aj%qs=|%XD+=BA-c z#Srjs_=ozVi8VW6`#;TCPKiRd%WL7W`M4OAk7S#}`xGb6#GeR^S@$@+1N^{>26uCC z9H*H0Iqyd52<_*8z%cw@7jslNN)7oLu zMBYQDF@@xBF_{wSJkU!Rsd-La2_&K8W5+cSgz)AiN?9DNNO5zt0-^gQHaooxNHWKR z#VuKoh_2K)5iKLr41+)pr5=wJi-iO{uVW3wo@cx%CU&eBT8GicPWN!ZPQf{t{TP$wJ1=8aEA;0=LCu~v{ho}_Rnz;PL_99jllr& z;>bsT1Uyk;2q3PsR;LQ2bM~C`pee$=PcX04{{%cGW&l2(E<0h9w7`PFvhsX&$%sGY z_qSMCb)^jXCFO{H{-O}bUrtP6a|!5RJWYDTE8p<%-%U?+O!inO$E8*NGV&mv3327s z;uX>`fj3kBPnu5J)46G@N=SO(5j8Qs&M_apctb zERIW?liS|l=1TbrTZ?Lm_1L!xGobZ%(4frN&&<6qzo&5pv3g_DlE{m=`&McGBuv`w zbeq1XEX$Z6y;t%hki_s9cwIn?{b3_}RQuwBI0{LbrE_xLp+OpMbv|PPdWC%LUl`L7 z^hM?8M2cHeLu)y9y3bT!imuJ~y;^(8a%a1ku6ko;6c!kSRV>S>1^c#s1g*MC-JOUj z9Y3PUf}P2!Kd9QFSKSRT62u~Y4!tq!_1JZu0F9TnEwIwdQAGZYKwol z-SX1I_Xzmk3cO9jS)oy#k-*Sn>kPCiIbZr3nqVGRM_VMME zZOZAPBD5!fcxizWueV6W^phYYKA12bf3GY`Qu_&nY`65DL4ermhzHhj^;0VF< zj}1K#N8@;A!GdK6F>bVN*l6L}#n6OAJ2HNEUg3%ffYly%$lNHl6C_`QWn+5<^v3d_ znZlbW!Ncb_;Y!wYyek3DJ%1AI>f-Fjl2)W)1HZAueUl$+!SQFqcz@P|<?vx9-lKYO*##qjqRd+zoxUzFV~&VVCBF6WimCf;E0 z8M96+hS6%H^~P8mH#9hZWXjlumXBQs*Jm41kYDklum&qa%RJ~_w-LfO_6S!Omtxne z3x_Wv8O11hxwIC`L9pS7_W=`)d7n$7sIr)4y*tn-=gxK)DvneQ5~5?s1|$A1<~EQO zSXA$s0U)TR9KT`D%5{2_h9!Ev0a4ZnFBZ(EaLg*{cK#;u(kA9rb7!jv5pNDJZ_b&A z&YGY}R&`;Vjd4p|XI=eqXQ8fPxAw;VZf!gE@9ad~RaU>mJwQ`ZWc&()8>|DN^*6%l z7d2T>Gp&}b8IQP0|I{N{xZ!X#QU!z`J}o99Lpw~S!iWvG>?I=Crk&X%>@lIs|GZo& zR%k{!g-LFf9BUCyE{=GZjUP7-$cpnK7lb_un&;lh!<~0JZ%;WBNY3P8%(a&#N3Q3D zj$`cT%Gw>A05k3n3mbS2(RiT!YZ>RPQB9fKe@kH<*lm3+<)D7z2$2zJp(cXjgxrLA z{Z&(HR(u{dl8^P)oY|?048e5DZY}yd-|Km`Ivu(^O?y0@U|LaJA4#<4Wvfun3;6nu z7O2$mYGq^VXMeyg2w1Q%ogRGhVv&b;lF{&fZRoYr1hXE&T%`TgcI)5f2}_NUywp0G%)xR}Il25ObdTy9NK znVPf-SMc&$&E;ule};N}H+mhqxLgy0%Jn>9m^~FDr3M-Noqa6*q-I-D3{HAhi-08S zt0AL-t&EVLG<0$hPjXrwS9aBFznWh^wVuYNrt0kXFdOD}YW+Mb-;avF)?ETX@i@!v zn31?$V5E$B>e95v4RPQOI*{}>?;>Wa*W zUiuqJR`>?zsh3Xo0?JcYSmBzd6ii@|r&J`BI!~(9o=}#QCN0iV5VCk&G5_(DWbu&z z?E8y-1$pQ#U~kTYbMbebl^i^neF+SHToZ>KHApuRU%BQ4b-ZYt{rtlTl^BLrkC^cyu%eGx(;k?0I;hl=0Gp8*R56X$HH0& zy4o;~t@zA(Gz1hFKmS9E`x{tMO~m=Hz%5xO8v*Iqa6*y6{N%w31VpjiBd#MUqPUv| z>5q+ec(X8>(WPPMnYaP(>DXrlxHBvk4G3v86z(c&6S<+WE;MfWId$Y9`6=?tZE+Nq{)$qWUp>K^)Jc zBtCKi&u!2M%EVw8SdJgeeGKSm2eC)#`1??k+yQN~2kUm?5H;!+A1%b8`pIZz%24W| zBD5m*pa^zwhV(0=G^{qU#qaP3Ci^SQoCDPJ!GJqyk2PZ^3Cf#izj==QAwsNQbbB+3 z#;9sOgPzLZawqR^EXih5?zP~@1cX?nY^%Io)R^F0JJRvCi#if__Ku!)a#OYOe=q~> zHa~=(=P-Z$Zx{|V@VrFCz4KJD9|A@`GiBJiG12Lw>n2*YG2RlH z2}&}M7>1jZWNt)zvVU{_XR&OoeVy0HD6_STV~OX;l+2q&T6qgLrsEO`3U|=kOQD=$ zZR}t0?X~gT7weW#M1MZ;e8qLQqp&g9t~GvQ7OLyqX3mT@rI{qrqogSN-(hlFlS<%iMA48CvE3_4Hdi(vguj~8|cIF(IHydh`1rnw{}V_!*GAHaI6 zsZVKr((uEj33gxNpgW8V-fcf;-YuYKM^X~ZoG+R2P*pNGuGFg{qr8E^MT&w%YD0aP z-2jrp0Pu(^QXxxW2szz{c0RzA#08)cN6IfW*DGLw$c&LLS-}iB%@_Y`ZMy9Ep;=`0 z2>^oODj6tiIa978W+L=|s=@Ry>@T8)^Qg{-{K;UP#_DvFd6Mae({jBT;J1|#x|O_L zE4j6R>35jYzZbs`HTHZIH2jgVUZwWU>2*AetR4gVmN)eDHCroxR~I;@uK-&&a4vac*je+G0a3QJRbXw zK;-4`DC+an&cAQ7S{?=l11@Uq^fJDTB((X6c|J^a+%ApO3D_m)zShgrHc`Z9f}U&i zJKqG29!~cWLWdU|N*qx2cOI?{)U!)QohJslh5YUy3lBGmCf0vr3iowk1KuN6ci65| z`I>sn5lo;_dXX9YkXXxmOw*`o3sK5OqM?DQ9!tS^!^_5fL$L!G6Ed|M@N8ohYWQ?x|8A>2 z28ue!nWz%S3e4slV*%^L3~$kRJc-IM7jF$@>LO<}>o0Az153so!zxMp+3YEqO6Km; z4sFW;(p}%ewBhjBFe2|em9S*i0u@9TxdIDWCx>Ii*{vR zEkzZM1E*e-7`9yI;l?rJxNasH3XV(2E7hK}6h^feU*El}JYPow9Qy)m*q!$YKVv!n~LRvWv+=4Amo}% zX|6cYLsi~{m&&E{0P^aL@kJT^os=RdqGkvbMX?}P zQ0@NfcpB^Wi4dQ6po$-5RVoBtrR->Hi03=uk2D^Q%7zu)iV8D~;KKwKzdDtuHTX}EIX7QTX7ZosmPF>cmunWr5x@Me?(PW zn;)Sv`WSO`S`w}=&c-jZ&i9E8LS#6pykAm$^hwnhqi8;W;atRIOB7KkkJ3zX61NVMuK@DJ z-*#!6GIp>Q3dTUeKnLaoM+#$}te6>AAu2??J(?I&N&*uyrI8trEfb4$9#fry#R{++ zhfbIe^}`YJrrIqeS(cz^4+$T0BIp?iVh9s0%Pmb*KnDj8Q$G^#z=t^%D8ryySj-Ao zWyZW?oHCSfA71jOy3B@owJak#fL2{XGI(n|3YU7cp2VPZzP7 z?D+8Sv@HoUq0{x%0QZ|f!hO>>3gN+ENtZ~}=z;BvGNQbGkGe^E&}OT#42UbdF%97s zxC0n5B@VPFT}so9&{R4FA$d~Q;`*)b`} zVoWh?(qf(qvw~u_?%roft&5iGq~SrlT*^J=^`Gn7A9ny|LQday& zKQ`IHTK!Pbr1o52r_ae^!#7#&P!J*d5fKYi10n{!!Ftrj z=+akSS=hZn*<#B7q<&ziL63Uvy=#0S#6r(}HA!iqXpNgV+9! z!`ixCe5@vv1P4NG$P8NorVjwZfZJ643gc54{q<|6sQea4L$QvB7=rSdR#wp-2`6jP zU-WTv(r@&smBLT->a~>5M`t32;yqokY|v%5$>GuYeOe;o@F63Ln}=CKRK6g32~R>2 z$R_FU#AXbP*6X9xAGBEe+D$d6+ki@Bp)HV?&6yr5H)(whfo>*9$iXTO$iPNGj?ueK z?6^G~PIrR4p08x>c6+Y+19QLwwBOjvGSH}9$EjK#&(3`e z6I$b8clV=JDCl-aEpSEF@fh1Dxe<;^+Q_$JHjM9TEjZ9tg1hUxCA$I(?&6R}gLcYP zQGJBeLb;DeB-k2J@V`kk7tQUHpR6G;7!E9jVQvtB6Zs@C5J1r-IKZvyP!>V+__D{HZ$rB za!sOp*PRW>eGtN>48(4M$?!sbNJ7V)k~Pp$IRDvh!jJB5Ul?>)LnDBGwxQR{)F}3s zgZlTEldWpLGY!YctR>3Z`TL(f^Iz?@S>FsCb$#$jT(7m^u3865JZ0_6(=?U<-z3G_ z-bu{iYkTDQdhc#O<_BG4}HTwU#3Bc+A2;mtI0pba})m} zwX?Jya$c<=hrB{8bg`ALF{2~cwiH-te~ly9yB%h7dmCNMD!m@?`)FjA0tTkyxl>uAg1|{Wyy|Jm+j+3I!H;rkrOvZnz; zyPc>1c0eb<=?wP+vyJaU*B%@J2P=%r0%we{UPi7a$6G(=-{NvsbzUOAa)x~K<%tc| z1CFjwy>w@8#CTy)4|^zc3E)Qog^q5wiTuKV@$t?14>(ub z;@9YPcsBX6iZUz+a$IM>Zf<`VxE3d`ZZE%)R93}5Lzm-Yx64(wGn%UZUBvakD#mb+ zrTR4eIyvLGh3I{q{Pl_yrpWvAQnHEQ=w<5T_qI|d^?@Mhajp0CJUAJdcd5SXIS3ko zF87;((ffWDd76-f;E5r3Rp?Z02JqfyDsRrOmc_>w&LH~&BPv`m)Gbxnh{$`oi2gMv zmlQyqs3T}o7jLGvtgt99$xQATxX{GE*XT3Q-2DAJXgA%}a)S8>F?Ytkl7tvh&Mg>Z zkWn*uoHP^dl}uqRGX{XEF9?8%NnH0P;O`EN@Rj|LqqGEDT*?2Y(_;Tx4eaXtLzoUT z3hVHCtN6OVC~2BQ`+rm#i#)M`VlFDR7YWlN-)BSKgv{;+BLFju{s2AfFOfFV zMQ}I8eyI4!xpH9C(=Of<13`9yG7wUXMd)U?QgwPuYjSryy*RaA5=#AQRj_YCyfJaZ zyCF&4h|jC<3`fD+gr zM@2m^GlP0Ykp%v|zmP{qM}u%Njm7khgUff5n*F8FxT!*k)3G0{G-%7TfddJ$YtR|a zm0&uCvvQdk4=L9p7*Mae=tI`tr0g*d{62X-TUE_>s_xZfv;1@b;+H@gls5+z!WuAzFrr0^RHvOd`<3VWkT z^20rWg7usaYuY2T7EL}-GPy}VOCd!5#s9~mnHgMb+;c~_&D~*(NT~9XY<`rM+^oHBxMyg~mP+y0o6 z?*lqU^_AaH%fgnUJU5yUkt{TF9UH2Y^9}c08|ip(ak^7k=n)I0=~ajCUX8R%!1}W& zyCM#=D3E?n`27QMJ#yDt-z+K_*kc#@XfbhE_FvA9M*6GtshASSamB9Y~aSj?)LV9~#R?XbZl0V=xA>O9Jjt5nJ9^nHB z#QvH`lu0%Bpr?`NO+{N>Bqh_(uQ!n*s+*-+xB;qKU|KLlhmpdXpGKnDGjCdP;Z{l8 zBd|j%UvEgX+)`ez9^d~@c+gn$;}Eq*12p+n2j`mdT%PLb4GsT?B2ETmrTu$%(2l6tV3P8o$c~VK<+j=uQza&B8^F1Ze z7)BQYl6btX0Ix1K4|4E0y%Z&h1zJnRA+Re+zf;$+m}OOewIYFVVeRPh`)Rh*Otj7V7TB|kzj91^lHxSjzabT`-mL#~Z5T@lSi^<#92P>Ke1vfcB?_D=fsm z-VNPv_H@5A!%93@`Y&vqXL4L=z~=6I$m6Qhl@f}J>a!K@`khWaVR$P`?-(Mfr?M85 zgr=cFNg=14-s+_E?4Kv}WcFU&zakoR)3;W`#{(oc1E43ptEk`Z(#5R~N#C$UAI@ZC z`Hyo~M$Dmx-}jRMLNwnm@O$p9uUOgCsd9Yc4SXi@+of=*uc?N8xd76zKK7+ZgiZ@Z z2kh)(RWR{V+D{jR<9&KL!1jQOL~Q={$HUg%MqyGZmCX9Qs1@50(&Ay^iMGhTtfySlQLz^9D(mfto`yYRtf8WyDQ_?Mo;^?@7TXc zFigFaGtN+NkdwqX`Av5Gc~e{YY>nI; zVBle8>6NDye1I^Dk1eRt&Vvq*%nCi&gg~~hrASR4ee9Sj+z$N9N$yi<%SYwwCgP- z;VWKV-u<0fI$i^D1Tw^2FBUSnlJ{oP$#cDuQjq}Ut!=q-mAEF|0J^GeXNP7!DVztM z+6JI8YuxIN`%>YYX{|rldP+IJJ#ic^;&?R#Eb+Nxlz0Lbv+*<8Q#?L~G})6te| zc{%g2!VFiDk?S=I+VMH^xNG0?#ay#@JgWMvBTtH}w9<4f$}gL> z3x#t{2WQ%=5ZlVJ0%}8eBMs8YL-h;Wa=IF`S1c(8><5;|J%pflOsN&@2d3q5cj8oQ zCQIA#AqCI{oF5@{3ci$YPQL`W%e=$9_n2u#K58p-PkmO4eDW)4`a_GSB?7>{RQp>c$glq?}1-k!b;WLzYu9I!Mlo0BZbN_o|6bg~4+qXAo0*$W2BZ zg;DjuoP43+ z?_8mjiQr}b5X^q6alY{Ro=LELsA5%~U!!UY$s+s_)J$y9;)M>6)##RqJ`{UD-&1Y% z%NT!~w#XHSu{LSpMxP5<3Dnkolq)Q%+fd7k5)9ZT;fkZC4;;sXC0FlQssOF>q$wiK z1RiC2nPZ`K)=X-uWZ=?UCilj$cqS_is}rHlOR7*}%akvt2bU|g*crxob$t{HIDf>=fB;--fO-qmnYN55=lSzSk+^4r+2^1gcT>8dR6;*(m zOXRaBI^a$|^=P_)+awe=hxLe98K5iOkqqP>8k<=9`JrN~$=g5TG;8*RTFr8J(!>#7 zk{@ZSqr!l+1z!@#JsD)%6tdQk(Dlzovqqqw=kjeYjV7iVdPl`FkgGza%|ELvw4}gN zQZ-^~oeS;^WB(*Q0)@`JewGiHrzel6Gae7CSzo)kAl}3K*T>1UwoAzRqXPPst0B=0 zmzU*uRV2;K_eBhE2MCtf*h2)f1O`IGHvTfy^PyABL|HHjRB0xntc7#reFjmr3UC$rk<5f@dK}xfoA{(*U~gXtE5px|+*4_^C&R6nhpX@< zxD}U7hRTQS#FT|t*%SCPvfD(2PZY!yh(CL#%t^9>Z)n$FvqXq9V+yar`kIj(Sgd6u zy1Z^r2`Jo*!hrujIN@pa&mh7$)ffnM@J&>A6Qz{gwrqYKrOf{ApFbP{LMIyC!X-jf zNpdN_*mE}*q9fJ*3t^-7w(DOr8!^zUb(@|DqK;i!Q~KzWruIX^!y^z~ir2jZbI}uz zMj;AP922MD+^nIRsE=a6KPJ?FBW4FEQYtv;330c-=&AvIN%V>W!=>?gQmYHna=)Cd zs+OpHZ^F@H8jeXZTHL#vNJJt8!*lRxQy7}g?c5%&ui4`&W=Aoga-sq_5(qoUZI`e9 zMgR}h!fpUF2QqiZATT;Ir7S$O{((oZ=KPcNj=%xQ{O&`Y`rpLiiTt)J2cUb?^w`7; z%UPS3MIX3=&#O6n3`%1%DRX_io^3p;VSr^XD;wG&`M9OUFVnf*Bxcx&eq~T?)?Z=f z!N`J%6;)@GROSa$6^t$3jy{k z*!l9ngz@<=dO#^k?PCPg`@Hgk`E4ZXSegDREAFad={Z#Xs$%S3gHONXJpj}$$bn7L zYX*B%qL}SHm;yx3fqr(AIgRp3+~ADEwIOD0{7?VzJ}Th;}c5&*o6#-7Jc>J87UEi}%s;ySWCn!Zn+ep`;>mox;? z&IEd5xKu@S_wz)ll6wbdQZHKvp$stO&81a7G6VpRlc&;&P1C(s=K7(LzYkJ;IC z1!Evp6q&DZWXu+k(f@DeU3)14;6SKasZ?JY*b#Yj9x!!W!@Z3Z?_~R zE;sEgPu=$+Tp_t3wxgZXI3?*OSb3OU6uBh>CTv}WP-3iSJ6f7y&P!G+txo%5h1B?K zQB%Ov1#}lM?D(a^?`~of58FbS&Y-?%Wo0psRgVs=yqR$0#ItUy^KoD#z4Zp{lW`fU zlf2pKEb3T14o!bUumrASR}ckzTYzBQ4H2lN-#d-pD+^ia4{?2J-w+D z!@;6>1Vv1Q{+{+YgFg;!OV~)zXpU#BI(18^u8$uq{%Cs|F|ln#BkA)c3*XUIQ(KjO zZY3Y5r9WE@5D?bE?eILKj&=&P^3j2cb*2GQXxgbCo_wFV# zGAjhW7dGPJspMnjk~xNZ05yJbadf!`V_65w59AYbq5Z8^(J})@%92`Y+CRBXjvXOm zQMF4o&6)uD&{ZBwJy=SedAJ(2fKoym)S#xYP%~KQX_@{^5(@2_`05NGCS)J~0SyPw z^f{mE2ApXQPa;PIxX_J7I)6!ZVWrv?srKfy4a2fL2kT;FmU%-m2ZNZ71F-2WAs90% z&3vO!dc5VieztPgM=c!LW&T<%>vlJ+p_TALA2u8n zt&xBoUQfGD^WyqG4WDHIqZu9n?Ztj@f#IEgKe+KwA5pr$XjgANFvb(hM1FX1)yxx% z$;Kfy9K-P@DIrHrJ9sQ#G*xyvmW}DB%+uCx75msPXQMa}4j5BETig#NMuOxsRfxpC zlQKage97m^zchSf)~aNFJxC=slV(ajt>3e~AJlO2G}c6KfcgfpC3CmT+bzx9MKi~z zlgTgo5IL_y@GCf~T9rT0)XmN+aE!{HW?E@_QKRz}B&yG8ASV!ya-2XW>TOe4XASbq zMx1Wb#1xx>zrV#CLCF;sGjFtGoU@sbWoMgD;rCPUwT1Z?h zx=?b4#$!L|@8i+`P7`rxspNSFZWY$Bu;qYow!=j}$LE)mtE+4*KiHW+urohlC;l<6 z1_y$BS$clL%(d2|Syl0R+%tD!wH_-UN9+YoU#JwC{&Q9<{*t6t4=SY>B?Pv$T+8}5 z9R+)jst&(WR3eQ)-cPXI_;1c{N?012#`Ia-h)EWKzm8_zzarf}X5QiBswD(?9R2j2 zcCZu?_TO?bG+)AICO_Q5>7*TN=Z*qL+d}`7zHW@P;q;{E`P3mKq_j`((*lEsSXSpPviXi_tR2pC&DA zlzS?j?8ZF)BJRXdoq>P91&jJv3Edp-=#5$kI)TiHBLrg9#Y|`DZ%!`+7!Jd4((ET5 z0#5!NbsqB)i)J@(<*e7Vh?LcEu*|~M>VyOiLyZ>f3IkbaQeBZ1VFRRHmmJSurNzq# zSKlYEi=WBkE?|eaaA$kcx zb`gU56bU_AbwCFIWuyMh=(a|946d294~dU!s7MTfi4geTE1*riQz%5`^)P*v|Pmg6EaVp6% z1gq{7Zb>zi%9cUT8mJ!wD}YV`DJ#(pK+3O&TTJ4gj~$uKX})?tEmAw9sI}L}dz8_j z9cN%B20`ZsyyTlGLpA6zmI>wuI(-#TYLP?=YW|nig^M-Aa4KiHu(#CX+H!1JA*DXB zR)6?o^glWmBp7JymHFbh{VMr?3h>EHCI{PZ$Le&G%Y%Jc^?fm*%eqY-?actP=_Ws) zLyMoFnS~j2b;qko0%MY}$o;#zM_UXB-ihSsLBrDhA4xp9Hq<|#z@#!X=I^C;y9*pXb+Gn!A+)bX`Sxt&;^=Qxiu$Z)$}|>0WW+dnZ-w#Zf6uM>+s?}ZE+bh~?9~!A z+tg@)SwNqL>^%uf0L3-Ty>htGbja37|AnikLkKUIHMKFyj!GxVb!T8C zsjlJhHaeoQEsyt+@oUOXT~HsbDf}k%U9D-zGr3^sxC12Z^mm}ElF}%5SY@iU-1gOF_xAyxVu)WskXAg~U3XJM9cf84K=B!l z{Bw#{#KywhhqOAqq3!b$g*CYBda*oYd8=RKAgg_r)k;BBA(AhqEZBpSTUyg^x0sVL z@{nr9Cf#P2{ZR2``xOA{n(_(0N1$)ySen&|m0a`B2`zeD&mKaZ;`0vjbDp4Y4V-=+ zY{Ac$8^jbumog8UW(@Db&w=5x{I+uDj?OdqO|U-I^4)Q^c2HR3j^!%VY+6wGxM7fOSQnO05eYMH~Wo1BYL_ds<7-w(kcuv(ZU_ z?LW)13vL~3b)08>yaEI-q0O$TtKUVQ*J82Lci)*M;LsTnSYtsYl}Dw(Eri4q95I%K zJamFm;-ce}?BP++e|QnQut^O1;n%7O4moj^qDHF|eWKjz1cltF8}AWpNwH;j`9)F1 zlAV#5FksK>{ytY1L@B!w)NfTN;q1H>Gj-zGJAW$q`O5vA0QKq1{h^S}kq_-_9P5z{ z%iT@Z<99%t^0r13yFio_+lP)ZJp{1O&IWL&4`^p+=zBkLqM5g+clMf?e&+djg*AnL z3|Zy|{$L>1MEke0{D9&kmjkKD#P^~>Vx^9i9m|67E9H(&_b-|?OVb;@<)YUzPdxRc z88x~fqxe{$0cTUWA`G!#=$m29VY@RyFITuev3#g6X|pKP-eIfKa1g9%A`>KVa>iMa zGL?m%OPlQ6IU=Lt5Uh2w8tZ2lVu8{7f+Kt`ArGV6T`zuNR!V7jd6=Hv^Ny5R{@$3( z5;^Yr{g~|Z+gq7D0b#G{_O?6vR4^#jz!aZ8eZ9{!ie?b1+yNqyp@w>cog3rckrW;d zL#>~OQCw{Hfv%f)TBwtf5)(D*wF#w+=H(Fh0eqiE`jU_qXK zA>5VX&JMM`d{ifcLDY64+S2KGuIBn!`i~=0UA@95$xY0KYbf2=Fo_!gG$69alI ziypyF^jVr6;iqZ^MLt-qkyODVMiYU;W>ExFpvQg!yHRO-v`b|Bm3C)j70_He_+_CA z(h6xO&_{G^MBUiusng?@Yvs}Z?|BoM=Mxm-m_Q66VZ0^5GK+{>-}bx-p7_|Q`Veo6rK6Dm#;H1E0re#`2^2@Pi> zf2{x;P4+a48m7dlSZkUZh;hK*xF&v-Txdwv>}eTv{={BjbKF*}A3U1!I5Gx(wpz|$ zTUwex_L@M_va*@V(jt+ftYQyuwHsZG79mq&(j z&dEjfdcdL2nhyc)7k9_0dR_axNUY2x~1&$(2ZIa%W4qj$+n%Xi^O^G|G83jZQuo zdwUR?yz^x7lib6UL(4|17x>B?aJihRC2mX}z`q|j)fHqq<=gy|OqT}tw`svJ*=0>R zN!pz3$NW=TyhM&0jJEe%PtX-SO!Tl(lGY$U|4j8$UL^f_cyKaD-DceR6jz@EvbUN3 zd}WIZdtKQ6;u}xHnpz#Qyh(KfvbF(L26EpLbTSei8+g9d($TcOmrI9)@1lIbbps_h zX5vfF8@E>88`j!EbZWt!Y3!TKNA&qW(lLWXcdaOE5VBBe{hdK=j7vb>Ro8F|R%x!i zYGDsGf8f;qnrAo~zz4ax!$wG@+*~+ZG2`DnoCq>(+LrN-toq~yeQCw`Yl6o>LT>|^1Ib9)p^WjP;mXmJ}B5J*P(NjUW zAgf@_h+^Eiz$Vl{3b>vsa0PtxC2qC_>?6^7w~7 zwtk)*lg28gP7uLtLZrq>2!tyJcY*NmDj%}KmZtE7iA^YvUUgAwf%985w;D0j2CS_o z+@dqq4F6-6)}`}@N&ZYVVJnSQb<*hm2lmKo>|bmPUYQoh?45j5yGh%tSKC)UkCpxn;q`PQq=pZkH39PbOxW@MD za5dx9gzjhTrl=!a4cq+GcN-$uWP%9M;sUKN$I~n%vR*rF3FYH=T0&WcOyn8gni_%< zOb=zY_-)*YR?~ff!Vb*COUQ;kuVKi7Ye5MuD_n~yAc$#Dl-Sdv)fVWuop&N3|4$SkN?9f{kdH4=8y z5;9>grnA^ir@?9|aYbsnV-CZS{ZK?QdB779stp@v65m?&I<0=BLed9ib^=i7b+m-0 z0%Z=^6n5r!C)q}V`v zsDnKi4l>c%{LSSP7?q-LK8plT9&Ir&m)d;+R_>M-%oL-uP;-u>SxS)s4)$?yyUV z3H3Z3gOAln%uHo=hapo8>Gn6p4X}$OdA11`cbiSY()uQD>`V2wkJJo#x)L>!K4G$G zS?;Gm1N=g2`@b97`N2K^9WEyrzKVB{i32n|iD{pAXz)K-=Tm+@8@V20$bJjc`6S>- zthb=2J<@w;;FN0gK1CR?OCW2`yNVjW25O3%hW~`XmPM`NW?+)$ruVduf1>@*Fv7Vr z*!qth_PI>FNC2l>(3%7Pv9B|1{0#K+e!!Et-)w#C47?>J}@8kOS7S_AO06@7!})o5WyH(ye4-KvADA&qrE)KMV2#X+s7b7x!YddE_96MBT^ z>x=q0d=<+pzwj+js*qe0(SY$7S!QLX@84>!(ck$J#1{ zb|!~Vk7eXi|Kw@#;Oh|IEJeG*78s#%<#1aRxKj9Ro(i&6*Hy$=Rv+@4lED;@8OS8V z;fe&6=rn1Pn*iJyqNzZgPQ@LN4vo;TW)4%5TPIDUt*9u}v$j2XY*8gQYBlh=+}0+A z9k!S}G#1cm5(uf4;h?n*NLa(dXjl?UMs6{cCbjq+4t>jYFdE{cY*~y{;m{e%WdQfJ zHMycHKwyyjQA4)Df@B1wO5=A`1U;i*fT&EV|JAE9F%~b9B^G{SJ=Zp$zp>Fb5lWAs z%AB=ehESMCCFjuJlIh-I=*sZwTDPR{y{&mDPI?OF^?NU$Ll z@)H(rnL?=dyuj$SJfUGS+)G^9UJ>;p&;md)l_y!u!NlA6?Kewmyurn_a31CA6J8#~GnoG1>?Ao}NNPwzS5M**kpV(?t}qvKoxAdh}4c zi>qiM*WGA4CtV!JGb?pBa4wC{#Q|k^v+H)0=TH6%MP~~=+QsJY<#}O;g87<*f$Nc- zTK7!&>!S^K>-!rZaf~w8ymlBCWbjSLvB?9^wp7HeIdLRh$x%kbH@wLE*NgsIPV5G~j>3?iB=}E2gXp_Eu@Sk^NpWB5Bxl9S+Rnnu&8)4Lf!3azE1B!v6>|z8_%j0K z(7hs(G%4Vuk(HDq^~6FwJN(Cog4TM505IX8bE!Vq6*MJ_`h0qqD8N{=bbZy z$(&NeAszVB>5R5~$BVZi>QL!rzYKaplz4A|NhF-9{VNq@$b5^?krl%Db51n%gOegu z%|H-i^ZX-eQ8OV8mhi#!ehgOj-)5cGXahje$yVJ<|_Z#N;WE6>!7;POWY(+*Gh z%&yzA`NoO>roK>39R?m|!Z1R>m;gqOa3^S}1ei zyKD}~S_*0Lnq27r?Ic{-PdGLxx5l~hzvPNYMXROZ|8?)%m3o~iy1s){CXaH_IDG|e znf%)0=XQ(qZkC<>-pZVxUx7x2pGK9jj-ljh{wo3r?qFuqOw_=1^ycTN$MYkZ8^6x{ zJRDqdk;n0-0wayuSSgzm2BEpiuj%#P(&Ml`ujO0TuF|r@yd~c`eC-=ef-qhkq7XmH zFaU5VOcrRAC?h!59Jdvs+~LR5HXlR5Xgi?)9&{fc=wS-qN++QkxqG?u$6i!7eRulN zO^myymR9g>W{UZTP#$fvhN& zq|C+tVe1~-YhBwkOviRIW81cE+n%v8W81cE+qRt<+qTnLtLp8l>aG96`1EY!y3b?3 zfsH3sGR)%$#gO^Q&z8cmR8L@BM-{~gOB9-{=^}o2-L>dN{^)R)SYo)5L8RZGeI>P7rSY!~tb2B9FHo#Aq%uQjijDLdD>ihIhty*rn5zsglK!1c#}E>WFJS<_LVq&? z&VP;h(;Gu#yn#WK9P}H;onW>hjDc>^;`g~&-P~OfoU-Mkm;9A8M{a?#Y92?sCcpAm z1b(ISrAvub$vio&Cf2ZE#mY-iO}R^0f!xz5h(o4<$Wu(bo`WqCFH(`SYPeleJN%fv zGYSj$&reLc&(n8{y;J-VuBG;e-@!EYY+ixd&Hx9*-)~tx`Ju?O9h!quk>Ozi`^4x?!LzgpiBg$mhg5bcMm`5C}G1C;VCVctK3rHA+X3;U-GtIq)x3Plij#Yz9gWd-#IL7Q9iPqrB0hp z^t|s`w2yUPu7Y-*_v3+aN&0W9@lRKFc=j6ZP&ov2xGFLi3|9l6eJ8X zcrsdJ^32rxV%%2$@`$3Gq_^AF26-_C!hj^KyDXg&w%ZAW=w>~cQN@0n*OgiN%G(sU zp4(h1n13Hl$nHe{v>BGnud0geDP4%aIv6=gH$q zuXGmmP^plR^Q2_;l=bM-?n%%*^W(dm^-$mE9a>f;;&Oc~ser1?yH^3IUz1rMO}?bJ zT5WH(oSM>HQPuerwKmIOZI$3DuR2;~6Y^;Q@0l&pTzA==e)DDS=l3&fv@=QaIR^F2 z#?s67e5;XExe1-=tS8e`w$081h(iw2psE-;nLUMbUtjWKSBK}=+R=17;=*D@$lW7C~!rP;JFa{;ShW6@>obCGFdbBm6t>$#wI`PM$(dV6Cc%nOgSKG z7JlwIEfb6)@Y71aR9}!qRU7_0%+p7xiIH|ae)c2!Nn9rkuIOtnUSg2j2L$5#8;8` zBt@XUe`NhN(Q!ruuZi(&j+Z&?Jxmt+gO4lqGf9o)L?b%t>$^v2Z5pem`6<-yEmlDG z;?xM8rIWwhhMENf%R0*t|A9QHT@Zd9LPEc9{7KF#W7yqkI{A`ew$^WRw%k3ENtNKP zLGM3(I>v<--o71bjGz#`2AH^Jmdt1tyJ?ZHfpz z(YuxzIZV}Fv?jJY9eP<(M=Fv)NBt7@6QcaTvDs(<9Q3-c&+j7_Ys@@akWOuMSOZ!k zG|Bz>PU7lc3fi`FJ3G0K;bB-y65gY5U<*)+aBwR=i`tTW9NdL9q&`ov0wz{Zax2EC z-MZI}lKNDGn!IdZ!CovD3*GWkgVPVQcDCFWB7|wxPS5hs$?93jp?+&XRxEiF#`ll@ zqJ&_#D6Qb=CR+67_{$z0w@LJR&Df@>^>V=t=7~~=@KMEZ8Yc+Yrp!h;DulcR^hz(x zPWp%SM1P^UzC{`0B?*nMkOR;fEDt8Rum321i}bn1{xTUEgCKpmfMs#Ura}^8d^@*6 zipp583!dwnGla7}>p5sUt~CZ)fdwgbgm%^=o~5_k&yD85ahTo9Pv*v|n&(xFXTXkj z%La>erM9<&qA*gLJcqDps_fjuDQ`rK(?5m^fosbop%I#yQh@5$LG293b`up+sbrH4 zW%(jQUC83$TXoc@(+%f!rxsL)kOMJrqi2~p3RU3vGUm1S6udDfGp2Gt;t8YstMFh` z$gC(UNX@$R6{^V9pBFO4vnb;Ylk8G84AMAiWwzKU+h&}AJ?ZBNjL*^>uT8|y?l$N# zD6B*DGO8??_zQYO@V1%$E>vAiI-jc}6xjY)6?6JG4Rwsf62E9i<3Y{eJJ(O^pW)ZP ztfI&dtGI#S0FKnRw0f6#!5&v4EpiqiFLG}7dLXV1cbHZGEp3NSjsar&frZ30RkjJ` z!CX$3qtt6eYjNpo3N>)0h zF^wQ7m_XoWLXV@5;e+kjPnGslX=^XDCLh$ZjyOnY+!@9fyWO|Nps!MWRezGVQ4WQ^ zK+vfl8ml3|Xe;xx9_kz( zI3(26S^pIIMMuh|F&FjAVQin!JMJH75m*RL?@N|<;xTl@AW?3wpPAgC>SQE-qp87? zYPYSUCAj76c#PHi{GLMvJ5CG_lM=;28w^u5e2mxG-!fxO<;F3fqX@z`GE5t%anArx zyx;a{92giFmv#N9<5B*#r$6)(=sWdGatL0mYCd&hCywQjKu5k98|Z;?5!lqz<5=}x z1JBrIumsd520!7NTq8sh|BM&vZpjJISWNlF`KM06_#?X58G07oWH}N$gXpn~N#%kG zX{bfRb!Jg97wISs5h>mm2BAFK6xuAOvj_2!uDlV6wUrR3`b`Lg;2UZ zsnoDol%0L#!z{abt1VvBPO|2knFmd)`A3>lu1K-Mji&o8LnaJ(>jkK72t3q#n$sXL zH!?VRKC1aWhvuX&Y>XPHv*f^hC{Vj+%{tOblkEdK(c)ok(s=AnlAC8a$!QDU4&6b5 zqJpOUl>)ipSe>(7a6Gp%Wh5#Kur}y7Ei-Vo*G?G$N202@od#P9SX)>LrT!@qofpzH zE;?{${ek(&bFb{=s|~4Qllbd^Vvvp;z4xI07?j^2#IN;YCI2rqpnDrTLyiGLvp=%* z2BsOF#AO{PK0v6YDW!QZ+ygYz3I}Ft6CuzAZ7?9UR4_#Sgx)BPRCr3e9?ctje{tW$ z@h(S-_wE`gPV#kRkq&b)gJ|`9F-qwvDBv}{CIx-4Xw+P}Zdbxh~i!R9-$B4P1q6o_(pW5n3t5N$Cx>>q{0-P10)an```@* z7zaSGtz!GF-SrAjPvODv^CEdshxU#BWZRH3;OB{csa~&~x^QxvZL8++_)QZ$3TjLj zHk6xyx#L~JtU#Ry3Br_F71-OmD+%r_*fr&=iWR^W13@%{Z#gvdPTsCZ!uSWt;8CMH z>F^w0dPSw>b3bpyJM4vYzS6tO7Tn!Zx-ci&8g;;{+~_Kw0~~4L!mK)6LTR8KE#^e* zJqH2K4la;)WZb`OCBX(2Yy)(PeQ%2Lq~% zerbRNKvsB~VxGDkv$o`>180&{68X1axkm%C34pSX->b}qaQDlGQ)fofYl-3B9z=X-Y+gW(ZFt#|M1uY2qY>CwQw%t>!oZmmRWd-m;ApCAU za_K=_!I#KT7P03tw52;ydDFb|Rk2i*13UC-U}C>?B#G5dtujaDhNv-l<%}e`#_A0( z8I!p-Yvk;-BRe8$_1^`xwqyq_pOliin-_P7FFhM;(2()E>6Dkso;hg!H_u2vsD6ZfNlz`$uA=&GgH)Km+fA!j={O1AfZ%V} z!)D2QSHcEtExLtlOE_8*nr!uo2?-tHY+kcFZ!(TpzRoqAiTbnI(_*iW7^gJv-_bH|D*73czENeYUn6 zlg%`Pnxg+X#*wbgmVB>~7U>6T6Xdi!)P3gomC6JL7&)aK?7k|&qT&g~(9(pv{Lez0 zKe2*MQO6Jw6hPScLPg5dScX!tq_u~1j$*l#G!8)BN5)V;E6;lgr1);={OslP?C-?W zzT9liz=@o}Lylw{r*niXk3=^u0{gdSII&Ky%8~n8E`3Cp#_ri+$zs30{Z}^6r{`}& zusE%ti5;MX5Z@4?UM^mY6@WPV=iM0mHckZ5>z*bfUx?hBl`e)mix66>{(`N@k=f(L z@Q9_oBMYeTp&IIyJW72O;f=c}ErbuTuueWOLZ6;8#KER5YYK3Fp(;uxDsP>FpkoN7R;5A9=k3~iQ5#j70itU8(MC+|GNHhk`*H&Vbs6HMe{u17>i^n5E zejm(vYyhEWyS9eRO=1f4>Nb)d$VVjg3 z+G(#5lgVz6ZEsIN!Hgbm>zOUhSHu498yqH96<{?vxwVFs-R20}C#909bLw2(WcC({ z8{oi1E0LpWfs+3-ahBmk^#|e`WH|U|;#^DpKNDw5f^h*R1fuoq)x{$&Ltu2<`J+c$ ze@^JvCSN{F`dhA1o7ahcll_PS!F()$+ z;uKkl`I|xbUhiL7b0+tH%bJHircg#Jj@J(D$Nvu%Pk=PLaK___La;Q1i8i==HxnTjTdWoQxJ`gv+KrZwwG!iveTDw0nyx zH10Q*!ii4ZJy5(4E7gCBfBA=*IthVYUYD!xx~DsrVK1IlbwqMsV`{pba_g&se4g^n z&G#Qa4OlK={mi#QWUD?9<^9q`lDMqlv8IJ@y)G{uit%i}eqz zNsZgf<9gNb18b6V*7X1JbquOEB7KU0a5-s}&;ULb!JwW=q_&(eVo923TBF(lT*NVM z4D|)o{`?5w;W5zr@V;G1A3rd>G7HBJZ%>EX8kX8P%29x+DKM$^FT?&s9DVjxiE3!YLZ95B%|>-Rr{s8p zwv)bPGgmdTX5O*1AtXz9f$7wmcMoWv?QOVmR=Fmhz!fke_e6kh?Fd7P7wFAaErtl@ zC&6D0{4-x>_|JUV6WqfEYpGqZ{%q{}q5=NmxTYe92Lx_!^5v&D358-S2WTTxJE$G3 zx#IS>rfy+&CRFCMte~jyJYw1j&*3OZ97LkjkK7Z5s5R7A+7TOxlREvyEFz9HMbrC{ zLpQi}>}-bksi{TZW`AamSRhny)T6J1)G5 z_NZ`=hcO*=pQZT*=st?SCOATVF1u5o?+Yac7~bqOM6Fg4O?J)Eu{JK7uVQ*z!~3a# zof=r4&miLpI>O_~gL0yf#>|UPAr%kCdar|_ta2;6K`#eW9~A&MFqUvgc!(3ByJPRC zE(xW*en?+IqzRMz{M9)HBQzgn3)FT-x&v^5mLmU+R~E3Mt7hWn>zH-T#F5?Jq-qU+ zG}&0gf}SCTr$ukt>`BwNbIZ8q-+{n^_XTr&v038k6Z|wRUD4lkFdypIwjZ@>*Vh0A zP3sD@Y8$5t!c|mCMK-QJP&TLi8=eCXZ=(*4c5&HzE=ZN0XB`zl(ND z6Ugt*swAWW!8cd7JJ&QrRdOEK4)PoDP1gOk`+F~$(+tUE#mMAhF1W|5E*kJJPz0j? zP#cDc(BoUY*S_lG2>6m5?W(D*au0Vm+t1i@bcn6CLHxRLs(z`vnT4`a2dSXD=(T#KEG89dl*ZM9Fq z>#b3oX_+8lt7n}j1+mml8SjUOliD}Vnsl3JtJFdaQ`!htR4?cKD-^Gqs~g-?{RNL}ao@iZ`oMI8n*jy!A6%U^STz2CayZf&{G=8(px@{2lTy%RJRwZZ9OqxgwFB=TR`8wU`hQ!G z>aZTr_PDoA*vhBZ698$5A&%OLmd`PsEaEzOUZ}g;K?tam+vf>P3>SH^q&!8vzm#9? z%Y+^t$`D{i9UZ!H6|aaLd(VcHo@Fs!PhaYr?_MuS)o|MkJxzf>+>Ox!o0^3FhViY# z%SkTd(C$9IRv&}&sLYYfYu_n zd(Y{%?l|GLKpx#REIbVYTWA_uEF904VH8X%ozVX*Jtt~F8b7B=7^5UPmt|!IplkV& z_`YkIef-){XN#IY+>x%b3WtM*2ja{E5x0vPYJR?Q{0-2&p-tolE0Zp-a(sFN`?@_q zE1$%}b6C*+loCP>zPGXSH9U+lNeh%OkVQJg?9}_Z)4ljv*leucZ=u=an%P%f?VtQ- z2RVJv902Ze#nWoJy>;-kxekZD>IR^m=Hs{~NQ0WmMC0Wih-c5!j1a!K`=Cco2cIV9 zWg;Du4U|{WK-EbSzUE=Hxl5z<;OhrIn%s-`l}c3@vp62?tEMzI-dd(UBn)Ma`X&PH zlC2NOQ8+G9mC6v>qYb_nC#8V)P)lQcc%BA0c&h49rGK>F??Odpa&cc$Q8Q8@cbw%} zor*~lBUj_;295WIpMetJ8IQXxexEO;_F)*sWx!y*Oov+wl!Fz^5frj$3_B~%n9b%2 zp6pYMLgriytpZE38BPxlLBVs&I;naSYv$yEv@Lk5nSC6sVjzwXix9N!nxqyGBWS() zLM!6761OWEYtD*5=2)!;hfdcUNUwPs-mvB>z9yOHN^DHY4Z}wgL#`qI)9HUep-mKx z2MEN@8Xm11M__B9rYInC?`WhklEpNbkM{e6YtvH3EHQj!oGbtoa}Z#72%9Mt+{lf~ zB#IPBx>R&b%EqdXe?aTLFpsoMmZLjEEdZw-v-RJ;W(bH45VnMPFg@#|Y`nC%n?}9w zMxkrA+QD!#imwVj!7;;44OV?=E4o^f;-QsIE7oJ8%U# zm`hre@KECq>0Q5~=;C^gSPS-5zm0fwOX5T}d%*r00&`%u4BE(@AW=u3-S5goCU@s7 z6iLrS;q(MC5_()n^H0`G4AnSHee$t*om}AH^aBexu4i_wByPQCUwcFl6X5YNs8Fq< zDHC6qFh_%73RKC|kjfhrIV&s#7Ycv8_N&!$1ZY6b$Xdw7IV6dmaRN)7R4=p=M9F)RjulP)Ywp@8veyg&-4;eb@x?+JNf8t&_U#kf| zO^y&PeEYBkpJv4gCKRvEqzi$P4rCd9%a1*ox~!l5)@Ws=kX?Bpvy~~wnk9?jz~ueG z8dt;M^uaFK?ao%HzCetDabkjd6p6lt! zrIPL~q1XciMP|$Yvxx=9Zp$O6;T6u5eDMmkQgVqhZ`<5aAn<6`q)S&!z2&x`Sl4IH z8Ty37C?P$ejAQ3dw#P;XZ4;>RBA}0f#E~V|suPVQLOBE5$7@={t!blPH)_M=JYC`` zspl-2fA%(;t%!jOUDMjsfa?a}(UsCpc!r#_#Z}CTDkJM+`&Ko9M2uoDZPN2{pM?NF z1cbuPozoorb4wtPWODGjM1>|QWllvvPzjd~C)ca9flqfVb(1l$+9vbj=xX7eyW@BP zu^PZ&mktvSl?59+z}SQG5~}kHt9^ODx+(rlIE4CA6=a=DJT-RSMi7I!jukQL`H7!) zWGlP6mOwAp$IFOO_XQXVws!MI72-y?vCx>2OL04n=LP92I zXEQc^0CCT#gOu%pa7M`EI9#H2O7TsrcvoK#3cxjx{qTd?6HCP09?g|2yeaQ{3VMN^ zu83G^&49KTLd;AVkZ$@Ud5e7VWcjM97(<0|IVABqewZkYj#whuLLPO`EyQbf+fWmW z#*nhq_*J|af-~QN(FkXVm{Vi2Xx`YH!538{K-nhFLeY;uZRj#jM=n+z%gRvmkXWKh z5NDP6FLM>8ieduR!p++Sim_w_B7Be>$v_!<2I9)^@k>hm9u_Qx#mA*OOp#j+cdB6*b3A|d|8{n&Q))Ft_hOzgrtTDeFA4a7uLK{^NW6e*GKq+v6R!7Wd!TFllBY~YCO+aNJbChR3+ z8$|5%0U-eya^8+##*$g@IY5Rd7N16^p79lMOP&q$T?BIp3xiR16R@>;6x{6||D?qi zu{qdueLk&U5sRE02Hb@ai00GKOAWgq?hJ$)upQy6W3q%#IZaC*CR{1zE8{Swy<{3f z{25%O#c|)pghn25w$mShZpoZR6`lBwZZ5eyP$5Yo3V%X8-~P7sUMWU&K<79#3t!er zEg0h>v8SQQd;Yn8X4;2Y>j4yF6BHK}2UpD|iJIAyS}Y9OOLogKS)>J1(Ph<568JuM;@3J{X+2`qMS|yPR zILf6EVJXP>R6Ur5slx0vh&(g!K7L>)Sa7i{*@(BzK7{oNz~ai`7Rv$gyS*;BFxxR< ze9S+}C13kAIK2n*ywWbZ^!%UX0bme&7Ly}dZ-~csyC&Qs3AkX{U}L@RD`!}-*UBm;|%+q zGpQj~rER#-y;{r~weh+aq2q+p*h&sT6`*WwGhy9F9?ag6sL0@#`_b23)~O~+uxQ%*@7?y) z?gWfDL0vyQIw2c4%zJ~c{=0LD$+ts?gM5i+ry}NM0~|T6AxV7YS*69f7A5bg1=CfX z*wl#fCI=C*TD^h2ppnBQ$QD!Tfz~%^Rvih~*5`8WW>x+LD<$R*Fa#TgzdL@=K`$ta zHv!I+3-F7c0tv~kbF`X}*v|YO8>aNr`Wq4)p#G7$ouLha{x-&m{m)bJ$RoSEZORCSlX;#0!7Lh2p12j7mjG(U7~e=G63T{;ikgfbDHm8QaCt=d&`s4gkWg zGIo?LNo7y7M0z@OGLfl=Eud$4kA+YI+B`EGXHl_X>TdEE4%hcf z$}?Ez6zoV#(i}SmNcc1F1L6m=VjDIaf-d^iEYO}E?93v}Zw|JBk*#i7=|TjQ_%xl6 zOmU{+GT2BrG+Vf}FF_z+!5YRFSK>PrlyNlTdUf)+s~C-61mSwcQv<(q?r*sV8do$% z@KV0*dDC7YTo)H;6y96+V?_0l0O16`%)ku^dl63?sd18i3|hUAzkkR07KNd?#ENXU zl0M*6zZMTXGix{Q;+f|$kRN-ULOx7#vemU)b-O-0IhoGWYA1Q$0rY@xSi0P=t}(5F z+(5_-UIr^AJ&} ztt@)GD)Nk^k_$OOmj_dwCg%d##dHKwV=V5vj=nmvQ_5XJN>ZMHM!nd*`{$8#o^fTi zn0ieyiXC)~Jrbr(k+o^BL5vktf+63%(r>0iM!q+AD@Y_Ir~*=hDE}e#Q+%yMnAtpv z#a;v&dL1`2FAz|`#wD*=pvk&0f%qW(D7RyylAX+2)M^I7l+4S3i;GgY{wt^xm2)z-aY%iOuYk;z!dW`@5*$pr;sx;N zNC?RpYRgw^swU2n93$7WTdSaGs&e5j zo1*Twbu%}cYp#vNPGR;K?5hhcKo~#^qB4cwx}8C4TeU;SG^2uVu|w+X1j2i^)oX2$ z9<|6Vw>W~iGJQzOvS+dx%}s#}Dh;<2oH|3ieMIz;TU_Bf%UOc2m~Qd|Fqli!k6zr1 zMCK=T7OzYsdxa*E1`N~223%$#1EN;{qN>&})dDhbz7iCqZNpy#idd`VTc8?`BkwV~ zt9>RPzpA4CT7ZeYERKXpOiGEseX;WNLaBkhT8oRcoY#j!^~E zhe9;|Y~bJY!)~In z#5}ao!Ue=xkca>oLa-3tv?Yf`xn>PN`w?Eyi#mu@)oUv&bKNhEQ#Q0nOg56}OiO9k z+|g9{T_Fll&=(u3zNWIONd{?OR^N?U;5C^kHx_uG1?WK zd?;PYLe-I~;*MR3YDpuib)TF4kYO&+F0~>Qai*%5jx$npS=O;c1>cr^ t7e>${e zy2O;T&J^RLk?OEmP5pkkWJ;x$lYT(l@9x`oKS8}jZ!77aj%ny%b%%k${eZGx#Df8U zyK5-5u>|Cc7zCtPPUc)3boh<0Zf!Z3?V_UC!M(_1(^;*-MfaWN$5V9kl ztp|Vf#ZR&3`WQN!^6QDINZl9feI+QIeTRKG2LScr$CVK1;;fZW>YIa z4g|ZIko2TX=^a{%1Hw)7QLdZ4;Lih9=Z~-P7iSL}8+)5zq&od921yzSnUKuCx-C1H z&%9N?h@4F7)-q9H{Nr|#x`c6MMgr7;9}5IMx!3<#+do{Q%%8ZcYiFBmLAmd_m?#AW zn4Xv=hE_4jX`Srsl5onsM}Df%&|z#K zZA9B!BXCB%O~zjA=`W+63iHzKHU18G#=P#aX6kL2)_@5dB6vOo?iraufI)9Uw$zEQxNQOw2wSyxBKF4q zKr$Mhjr;I#!(ydh=%@#6Z#4qL_`HBes?myPTG;r?t_ zJ*bnF5R;@{LdVetWzlJO_@)nxG67z$h)?<`=z9`LO}%m4s>!mRiK!}_{4G#u%?bDc zul?ezNzideW-k(J4Gx6Ms0_pEkmKDau*H%gKrY>2`8XuNE+jMoIp=0@c$j4c#H>cm z!tTArfhP_TWsz8)vz9-d*((7tN^g&e$_ulUDz+`bJoGbq)GA;SW$p3DM|75IcVPQz zI9U~w5C9WE5CMeLzDO8>Gw_8JSrx(0VaFUhA>!OU-)}(>%D%`Li(^{fkd)|A>OXd! z=qy4gDiXJBDto#SE>EbEi>&^Hsk0Eq{W;!X4JCJj0U`5KFK^!h-#N{1BEfRUcOQvD zwIt#Cdq|8ds1iwGkd~!YQJ^14{(Z5hWr%45+Fc8>{?esc&D)5onv_EKtTqOUc^pfUeBP_}-%I@1>OVueyrOhE>&(h+s zk{5Nh!4>6I90fmJafxF=N1Zi~xASiYCbrw8RkTLaFPQSh0Qq!p5i|%`DB#l}IpU?4ux7J>72>{jb zUjWXGsk3^Z^P?bLs|WkVMziu_m}zzewjKC02JzC%640#aoU=LfBgqzYM{!)io$TiMQ{P7C!x_rM{)YNOahGK@M`5$}}!f1SD zmc8Q#-?Yd12j5gy0)=Bs{TJV4DiB-<{GGYlfw_P2G^tcfGazp*r-R0mhe8BA4f2x4 zKF(#s1xkbi#q6I40iQry))o3tfhNHmB~~5wwIg?qMS$ZeZ2svx(AaB>Zr#cS^#1FP z&kRTb#sMuaFyPbTO$<_JE4&#koEVtr#V53m&3rK1fjC7fs_+C2p<_^8;Fj18G-R1eUEXtCG$xq8;w@TER@JYH~9-$+C^$MBlzD!gaVCGrV_r;bH|0 z;L3PwIa6rV6^r&DGe@`7Z$gz!m@&4OXOW^FTVc*FdjpmS){36s5MXnC0|f?)QRe*$ zFy!5!L+x+PPLIw{xld-ciPPhLb9p@)r;>PQ+dY>%PtOX+N7I1Dm=`^ej#7?Vej{@x z?X&bx{|#|2WBvs7Z+-6x(%Iuvt>_YqZ>{{^Pt4i%Q2*u z(6pZG%tImJ!H{nui{YFJ892ue{bO-%0jYXQb)tRP{7{@Ffor!*__&vm0Y?M*HJm3; z#kPgcY8|GuibYv{f(=%vEl%{(@GQD1P60Lv+0)tC(@xyDVS^zp!_C7X!9iKV%@dSw zqpa%hyRY2h>|U2SqI0{eV=n9Kuea#3Kyd_cCPGoCqk0tPQ${x37C!X)ag2D z4(Xc9CyYH&&KE1PQ7E`vVJXL0j4AD ziy(i){i#UwDY@oXu%s86_c=$h7TS-#4tYZZ11#3d(sfeP>}q) zF~P;I_I?vk^Tt3;=n=}w1|!;&;!Fkn6(v=Xps)e!uUQK`FhO@+$eao zUwLtYIw}q6#x?cG(GkgSL0fcQ>WFvxt&nu+P)l`;%%xL?Hfsu-?A7vNxHqv>n_NXu z684B07!p1c`?&Z_=&xrxJ=7^Z(diswYHTT_V*I1;M}u2~=^Ch}3K+klUvON|3d=;a zt%g8J?cl=`$flf`j z8I7eG;`fm{D~=0twgJo&7tg8DCcRonm_ub*^s^K|mnBM|ry&8Pq*gBH4XE^$y)r+u3R7`M+x~N+gQ1zzpPZm4-iarSRb2i)~{(M%HmqCP9J(A8Ci%mm>+A%c!XB zTPB(TkszV4W1!mMa~8zFRg47cIDk8>kWj(mU^;Y+z2}Q!P_9*?qXPCQ-@i#ENcZsj zs$=*=m#|3Tu>rUr;~^IPgb4}!e^Lg6^EW4hg47Bb`s77e1eopt0uUj2s`z^b)!@uT zf63?QeG3iYWXU8|O_;4ogt`|2nT%A+^W(weIDO^d0VJ*ZC)^V9fOQiE<=a?eaBk2W z$%OD}7jm8WOB*bTeeDdjS^{_5x0V*LqKD3Qhd_cBGq^Eh*NCj^XOFrVO*=jeR zkzGfIx8s^fkh6p-{`FT$Z|!-;4~zPiOjq+|!`oGc7R4&#%(%$?i|MD`xzTVaMy7hz z={sYS6Br+beLc1N2&(Rka|HBtg5MYQF-?t2iNwm7kC^uJ&*CLo%Ei6joPKPyC20tv z>x(R`>oF;a^8z#$f2^Zfuey$mEVkJ}u{rNO6dkrHB$cHyc{xe`vK!7r^S%{=!?MO` z8>Qu;2GmBa-}Wg{A%df05H~+$8yrovk3W4x4DDz}d{mM9Yn}FD<5lDWK_xfS9t3=8 z$wag=RGU>B(tH?6TKkeVbWlSmLpG!mju9*!Tpa#&2p1!u(pf)I#}icOvkrGP48`nx zSbIYjC7A@ia59SISX#9Vn83KjsytXOEva&;5;}|B1dFReRr}2$&BEN#Si*Q~@?HCi zc5XZg2(K)yw#HBC(3V*8Z6}gQdA{2SrafN& z7c6^&&leSI-@^%$bX2q0#;2b&ypAIANW9mr(xg<{@%S^;et*8XE_slE$Dixxi(CVt z_KM*~hNa1g#1k@3k1qGqB!jLJC636I_Gh}jcG6rlyFdHMUT@GIXAFnO zzC3QOCiY*ZaXfSAhz2hLD;>&eRd}puVeXwZK1r7N+D>OZSS4eQe@l-OPlpA3%%_XQ z0mQ$-5);fH&yGn~#C1)Ax5+bPzF7&`7JOra=4&q|A*R*5O)3Fn(Buxj9+L{(mP!-C zZp~Z4sP@$wlBKIeH$@LEy!BUPnv0$Dk9HSBnG2^o8We4po=uTE-&k89rIA_LxubLw zU~>)j1_B|+rBeY871tlEs@m9p*M@Pv#VkJh<XXy>UO-A#7V<=NNmq)}BHGcux3=IAU7l25)rRhKR7ZWBb6s*uS7q%zl(4qZLy4A`0G}sw= z9VsT*NQYF$X?Q&7rR9_4T%RIoN(o*ngl*hW<#)}_!TZvYMB>DSL&whHn|@ER8Ay9T&^M8 zgSF;p$)XAqTgg?ZkA_oi$)MB6DI%28Ot#4r&8t)8Gj>;u%I3u#Q2e^2X_0=bCEAOkBEBRrfHAavbzOVu^8KR+?&WmJ&o#PxI~JX~ z=2ABpfuj;$x3CO|hDsF{Qr>Z{|70)_h1z!s{gJeYj<0+GP>hbF zpQhwxH`4w8JyAm31HxP!DoqfLZ@UzWTF?JFJ7r32X1v{^ITato9DjIt7_NHb_VRB8 zN!9Z53FFvEwxeq$3G8q`e|VDkUwoOs+zR*y7TZ9Mx-qo0?ntw1;c*O2XY>));3t9e zc5d`0=*IbH>NT}0ar|c8cYv426Q1Eq%K0BdNq^hY^rgv7k2}9DvY%9MymBp2)zNK2W-Pd? z5G3Dz20@zCehfxB$a;{hE(9pPCm5BmSd!o#Ruedcy1*^xER{1>$5pFos zcuV~QMgkp425y-R)%}q_f*T!Q%giKxWY#=-{;U6{!^W{xUp7Q2hnImkBhn^gv{ve_ z3!>tGy-J*AKrEqszqtH);aZ9OWV!$|PY1m~MW(xs%(Ov@NHd*fnl~)}6TGhA%R$9N z+Z;dnXJAxEM0hcsmlKiJoXnFs+2|n3)}Wev*30{?@;Ko^AJzY%fW>J~z=0pI0cCN4 z?e$FV8O68{dDx+-nf^EML56|?dvu&`8iY7gP7;y@Jci;Mj_w?d5y;yXxy0|1$WZC5 z`#Du4(rXvIg+0ULk!`RD9JXzmjd`X~gu+|>a{?r~XIkBs9aZcyOY5eG$DSPskWvNJ zu$zTA#B3!DYn;s9KyZ3Zeq>z}P~s0Fio!u*^eDJXAa-K)N6;V&{`E#GkW+apN@&j` zYFXn1Q0@~cgfAgPODxXxJP2;VZ|EdP4W!C<^kM1UROnjxJ`5Q#T2}byE>u(@gfjsO zr`|Ag>0vM5gE?Vv(2XYHNFgLWOb3FwRU|L*Bf)-UFu}w5SUfPp=I4Ei;3~jfe&Y{B zi+mH*kYuhzphgGrdRP0pDif!ArZQHhOI~Cik*tVVYs^@w4?%m%v zy2qHm*RQqKob$e}^E@0OY>xQy`gHM$VEZ~^xSy77p;x{F;CM^g?+*7^wz`V9-vvZ7 z?^y-y&CjO}t571wN$*K3X}eoc+%U1mipTWjsf(hBoSO13bKNyK+pDWkxzz7^lQ*&F zf>ZNW4c7FO-C$VAtfTSk@ncgsUxJxwd4*Cf;^34e_Dy7_G2w3fTOyWBYL> z+z`$izGU?1Ak((vd=tkCb!KhySW+s!5j>gpAi;#5y`mb&n9!hAmZ2L5Z)@tDb&SU;_7P4<`eqrkb<_G@bY)Z@xIMyDq z#$!UdXJOPIYQr1ev}jM4d2!6nny4Pu^IHNG}L`izjk5>q5(KFEi<93{b<~$ z60IshyzMz>9mpTFxG~NT?iUBIBXU%46`M9`B0~iiK+h$TGmv4 zz<>d`JwWn8rR&4t;gF{%XlL+#?p=W=U+@b}0n<>1TlfPcxbH_h7sq!s+!r;bc|t)u&D4^!aqf`=`{CGH|mM90-A~kfShIZ17S2HkcBGcHO?4nFq06$j#}p9 zeYmjq5`K1u#jM4*F+n+@$Q-1s*XdhtTQN?83=9-#t9ULIEVuj|h!+{rlEyZIUkJ#3 zepa-CBX<0|6cgYQXhY&Jw$82b^kaVOYTp)-38bh^Xr)AnNpm2BqSGZfcXW}5*l!2? zmB8v5{|Q_63&dZ^VdBgykc_Fsy>6~Ye+Nq^fWGT%jT0Qlf5Q7SyBpLVKiDaVbfc;Q zAUaK4RANNb2+OQ5`c^!%g_p$l_UI5F$iYKriYI4=ch2p8BT;ykgarAIf(0(Xs@{*x zVi^r~k$AEsE{%ZEU!`WB*^dEr=Cz+}^ue|f(|3&KQhLlGEL$Vry|WfolSS1?t8Ed& z1;bdy2`XxI|E<4O_q4;h09%IE#}w)$kLASkcwCx$8GgplO#gyf&M#JaT*20*dOcY5 ziOwlJw&&XuNcc3~^7Q?dw(X6eR0FUYTvvVp3Yt#&9v<&bCn~xBXICy%aA3eTn`sOk z_3!RFPm|SOZ@C=a`Y$&WZ-XuJ6RmKn+VLNHAcJoRQuFP%A5BI0+B=l1zNV+j^dGO5 zW8U|f80A{ECDn6FrllOesF&c_>oK-ZFT^}@Z@_iD#o9FtF?RS zAFMWn+2?g7hL>{w}u{gti!sz0}xAk z>W!>2Z#Sgt$K#jV1Wwl_G#$QB4iEEsj=Lr%ykYpT*lgo9bGD9!e);~7jdAHc+*PhXv#-VJEDE`!H{$SMjsH9MOKnC$Etn?V3Jxt^oS#htJ4P&Xq#-V(%zzPuaFR>7~ z4c0`L(g?Bhw5TvSyXKu%w^&ImD&2Nr#z1YyC7GMUfRH>mBnD#zE#kvcqF=|!{AQ-Q z!sRz`Cw&7HjzSaWA493YxjWIvSB1fKRy4%K(rqm+>2Q`!DOuG@ILY57jtgsC%S=?_ zY%9p^J(Hf_stNyRZ1lG}cp0df*Fu4*lZX3V0-E zW9BXLRyaw3=vUk=LYgCcij+q7s%UfEhhjXXqC$=Uzk)xQ=J%OWQ?-l05>quAQxuwt zATFVCb3UJ+wd$*=pcZU*=T-=awf#SBVacKr0xO;cX{uz{D*3O`;4G{lZ)Ive$?diO z=!D}y8R|o$1E^TPiIW2%Qf_(f!RPtn`3hF&!Z&IcIW03o{xdm!ch zPU1PrV9DD&v?O!1o3Xuk_;MS(g>7WUo8C}iJ4Wz`h`?6a49@IXIomG}qtCO(#Gg7; z*!2lF?KCW=Bt#`k_T5z(Q7|^LvJ&CYzS#-Ja&%C$4RR^>?l{p>^zul(%qxj>W{bHvmfG8%iVSQja!ESeBIG0XUnE9VNe@MO;$i_y%2mpU(I3;Vt~vQE3oTqK3O zWUE`k&K?*Pl|1&e24dCD-IcE8MrHr$7nc0ib;jBy(EZ*_r|o3{^*z$sbY!U}r!ACY zT6>~ff!C^uZ+;LCRhoN34A(e9Jj)o1##|uR2Q;0U@*&;2_88WO`7j)6A$KU8pE=;2 zBR5Y!1oNytE&tjeAfC9$Lff9WC{--H5alMGkYX@$C@dJ0Ih7PkrKm z2f;VSGF(JIR%EVgBJOH>ndC=QV4TCKwsb1Av&*x6fXdA9b5uW#{rAF~45zc}OUTY6 zw&qDyF1h;uEW;4mX)Hj+AD9u{5&_BV9xDBIt@AuB^17?zc3SPerudbAQ0Lig{{Wm} z$5j;|#u`QktIawhq>xIr-G)KyrIgkFC%9ZeB^E-mazc?)+Dha^fbpieuA2%}nI88~ zzHByK{p=)ZfIg%*U~7jc5QF2qs7@YSp_yoC*M7MyHVvc{7J#nTt?`-Xl98mtHBaEmhD>qyzB8(rka6r z0zNsju$~_$kry|G5<6J7W6&^nG!pYrIsgaKrM^Xx92yap9SdrT zTGzuOQQ_Io=+b`_Ta6Fn!MFt)cLWn&(n;u6j!5>Vb`zk z1ECrVuQl@Z?(MJS{Wnx3_`pn+(yi)!Z+i1k>E&{m5_I;sH&5=wr6Dbd=&jOq?`d}l zei2nr62}X$iV_UE;_n19d8$mt_H_q}VpK7m1xUB}gJ1Sc%b2MTy=)KJwW}lmb&&_F zG7-}?JN+0>_J)P9a?JzTSwkTJbk!Mvt}c?1T}&4%@$w1~_aay0%Cp!QGEFBC4Hi;s z@BTMhgl-rh7-L8FTcZ4LSJq?J)kRnsRvZ z8Gp#Moj{phIWof-BlAKMU8ao4mc@?wzQQa+#SMtleytQ6D?OdZCgZ}oGo?@OI`a?p zF8h+rUl;~a10G4znsIgzkG=T6|2PC?RVn;dp~(|t}FPXw`IbI1joSwbSJJ|23NL`v&*gOl!~-c zf58)^1FMpwqrdIqA7ROLj+P5#T$RfS{z^XN?qDTE;;zK>6iQe8H(oV_^M zmX?&UH@=l^yT1-_mae1V&WnQKbKqg!(jV2`HfOCIO4Dv8*%oHhu0I44|PUr zN22JPaLJ&34GqPAgBS6&3y#8@*W=xiE+>X+{>=_Wz?0StS^P|CSJdG+l%_yLf^e zK}^1mjXvHSA1k}x-n)wsXzEM8>S0C8I?#%(zrL-%t_Nk>GqOIR9W2KAOj|*VPzeCpstx+YQD~|AHGmI)h2i%-|`E(g? zJVWHKy&Cp2vzn@a!Ml!me@0%8sN}3A&;e1Yj#8%I_@tcOaz>$lQb=F)=N*Ct6UmoUGOa_P#|Hi2S z9#ct)jVAfV={^Vk5G!tl=8Keg+0U?^??e9w$5h`cAZqXr$D~Id)+mUH{dN!b8O{%v z0>hp7joBy0&LPvNB0IfS=Bts-MH6uIibTrqo{c=IhJNcy zjr77Uj{i5yw3fo~e`lE>B@@t(rWmDmxKxkV8d9ftaw9S9@uw{98J^3Fw(}x8`cIW z8C1R*%Sas@dew9aL30iDJ-tn6|88ic)6C-J~%qLv3# zc6Muc5EKDB7z3Ol0oYPjPA6~IAV{?5%j~YG6Ljr#$KA=mh|0c0h|8#kTEdnuW;IR= z{Rdd~*I*p4Bzs4!eLse4+g1d(tez2UaL2_m8T6ws;9%F1w4YVnA0YPL$NIqf>A%(o+a9qJ+&cXXcy{dDST+~?KKuZ0c6KN|w zs1dtjcr|2q0?+MAtj{8JREh++ekC!aG_*>c3aYBJ4q$(00|a}xt58B3d5$z(aC%R#zJKhA2X#7 z^V3+n$SIo!$%f@Hl}-GQS2TG=yWifg@O1hfoXMJ+=`u;&_FGBq;z|nZ&4XSH4rNC$ zQs2Lfk#v%mdo7M1CSbz~s_2~1egN@NzC$Ip#rr8QSTF5XEyCfb z9KcwE!boed8R%#Yl9hLuuTLNg8*@Y6bT6|ALm;|Q;i`MT#GFnn6R;9;5&g=lU~G_( zIlRy~@}xr(9|ku7G;NKJcW=&+z{b~Zg|xZOxROM>hu(+>=73~NAf$hoCo)YgWLRR9 z65XuO%WnO1)%2@a5p53z2evf_wxP}r_&vMxS{bGVreNfnMG3rf26KpXLfcvVd>gso zYSjQqg5%Uo_X{b=!^B(frC!}#f8QOS?&S^Sjd4PjXuXrTgTI((Ol-Qt7ru+^)R^4` zG&|i)Cm;UaHUK$k4Tqdt{=wG|puRdGw?)PpQZZ#$Bc08J9#6uO6yVYR+C_I=DUIEs zK$uX!?!$qlgE&H0GMemIqmH7m+rOVKw5xY~Rh;0amnlC-QGU}zkWz&6l|5dH^s);V zKk9M>=e3lPq3+7m%IxTH^a&2%Nx{=rYxk6?Gt+Lr9{`+hAXCIj@L;ykX=|n3x3%UF2LKZg8FK$6OS#f*j=o2B`?6NPj?jQWHKK+b=9^K$##9okAjh z8-PQ|5skQemTWu!qdEV!J64vG-`nJDr-{ikIxz+T=?sz<(7PM9{5SxU^g_q&)#&&Y zr_iy45hmCWvu+cXx#bcW?t`pS0u_Ad|AGf*XBFa58kN;K7D$M2A=|q>)?~7{l9LRm z+WFlsZWPeSm_N%tgwsF&w9Jdomxzd69s;}Af2H}*lP_SAfu8B0BlXAROS%esp>*cK zlp3a`hd>#|uR6$rvTG}Lxu&UeGib;-Yy1zar}qUFtxZTTy$d}CxlP&cKXO9*bj>;% zj!3GMTeYcPG_0G!BW_)SQ`r26u!MJ5&}kX8lGcK@lI(~d z1*0FQpOW!11IRnk@!}vrmd)^c9xe{3pB1><1e);E2hs|~P!H6NF+&MIkt4i#r72F0 z%O@p}7m|NN3X%LK`M_#SDx~{GIj&PsU!SJ_3u{{^enHtrX0@iKp6$!9%iVM&P&`|l zoKAQ$k$u9Vtp`&k`H~XF9c!~w(@e9vMlT0zY3G&MS^Kt$)vwZ>ITvt9Fea+wty&ef zlHR4^v`trYX=N(N%9JlP8@f~iNg~A_Rl^LWA0_q72+Ct!)8Apnb8?Akd@x3LCw->fAaicX4a)Nq zLURQ~J@!IP&Y1v8rT);)a+CTxgtC(3fR$hB#asmgk5&VSs9Y+h(s_2Mj0V{C4+6QR zKc%-E`psGnE-&s-bgHg?at|1q%gP^ch9Q1Dx*a*I(Ky;e#>>*E5mVO2BSzI}Gv==u z8}o+gi$z}<1G*}`O9qeF*SkBuGxekJ&V)mKTWHd4HfJLI0P2%eQZ@sN1lEm z4z^R!!*Ow7<=cJ}{I6McQ5o6nN}LE%K2m#Q`s5SckGo6=9Wm{pj;t=n|Kx&^a z!vLaiv7$w-laP!hMLky7uq0SfpN;blVlso%mw&(o{dR9bE65K{fx=%-AzB_%zbZhH zQgVnCufYBYi$_o@Kr?}8VL_G=UKaUX2|tH4SBirR;M-OdD43s1;b+R5oN}`mwb!Lf zL@4~$7Al?#u{?CuYQb?mXd?cTFF)`@ z7VVDxK87crl#n#5Kc%l6@`;hcRMS3$3XgL>#J`(0(yKut6r3?=FjwZ11M1{EarQ%Gcrvh>SO3`*>!ZHyd(q;V~iD_YfAN{f8xL));&4_f9OAmg*AWl|0@? z#A&<4`7)eK4?e{_G|_nZ0%!E(3i?K1(CL~M!UVbyp1~YS#C30QLz8IyLPdpYTo&|D z_fD7VWAkxd#l*pPCvr3vyY3I%I=^NXcw5?HPorCTZbXtPiMtEnG9hjy`aG+(mal^V ztF>qO3`~J%bBOTmGd6xx68JH|n$Kt9Qb$Nc9T)moMuOFsF8`atRYAAgY@*GaoTwg# zV|d2zgVW_w)kpG$5-+K~>8C_uv>?lzXJ+4C*mAnoUD6s;Tg`3D;govqi#sUeK_gYE z?%?G@%k6S%{cFWq+-)Tz=>c>FX2$Q}2ih~ufSRs{)o>6W(#pQ6&_t0ytVq62{B+*2 zdHV_AS<5OOqxxp&U;_=`(bH9=XeZVwLrk|N-&LEil%$L;K4mvE*19<+7Q8ItwOiECn69&+;xLOfDosKr2v%NSI z*5L5g+n%}qkv1WwHm47XRMHHc7>LasFwB=CMY_%=?W`subeV_sNtM)*|*1sD& z5vmAkn75mbXFuUeQ7^`bBpay}*)8+*%K+o<-jdZ^M(%)xcmPgxgRAE_!5>oD3I0wg zzT*x+grd>dER(2FPFH8q;?L1bNaRV~s5RIleZ=vb;|KZK zp?Lxq!gQchuSXX94j)Ky|3EI1jwqJIQ{vWBb_U;1s%EeD#Qz@Jx?yHN`7=t|H9krt zJ)#rEtZ7jq$O7NYFd^V~^kbQ_4ErH#qVD%-&3>DIT!J^7!T}H890-e4j3U9zz9_S~ zG_2*ItD{Av{26usqdFy-sV0q(axRxV9H(4?W9lG^fRJh^K9lR1^e zvWSSgyT=tyjGnK^t;XqzN=>h8o;@87KH^M>Dc$Jbwz%=Pu|28k(9)t{pZu&T@C6_E z=W7jLy7o^~U01lCJZ{$?@XmOyK3kW#xBO)V`l{+&h8rWb_gh3KIIca|W&~QND=?x8 zF*u-@-e7bAvJrRPLhSE1b;nD+iHL<<8C5%~Flf@O#+4iy9ro`XU`quv zH2v(d#3N4DIM&?FlLpylrUR_4e3&z9+6@PGcCi@HNYFw@+ptmRn=MaWg*FoD*GkqD z)(BxojlrM3^aJp4)#XK{(H=PwiJB?o!v0K1$SX!fTa#*P(i;3Wy$rAgfXHx!mg@TD zT$EMbx`+)Il>IVup4;_xElgnDSQIh5Mww)wc5&b~j6CYm3=vi%e>oD!4o$Zf3_AaP zIqWL1U5P|%vP>N54g7UuHHm50(KjgpBIi@3pDM5{PA&xFAT%QE-0t>rHE6}UX4h?49zmgdwOmM72jdgK( z`hy2)4Rql@q7NfdSbd1~R&+E0cb-B}Tw?`4y1Hf)lf(x*f?FPo<+V0uRk?7Z^s=d9;8cYIZS-eN@n7Hv?PN4n zy-a;rbZRT*hvO_iXa4`I`NFm)Tg3-1x(w?faM9{|3qdg_1rS34`5JHfb=@?ndZCZE zdH5K(uNg_z_j#WmYjA+YETB=Mk|4I^)&-<{Q-JE@P?R{95&~U>SfRKnnX+q2l0hXB zZqSGIViUOA6)M2@SND;WsqSR&zb%}(A~@W&!GZXS($~_E>nM8+v8DDMnEY?w$GN~6 zpc~JMp-EsC!3XRplSVc_IvO|^StjYWBnn(u=`E}UjEw!D;~>VzG$Z`pJ1QG1kHVr= zd!k5++>)>5+B+2D!KfuLmw%ZH%}Ot++Tr8~jOJ`ZjTh~d=lsk{3vmwTn@QM(YaTSB zlMshZ>RokA(EW%WZYzl{Zi=j(^3wgw94nC)W%x(X>=@~k48va5dg^hC-~=PDbM3#p zPlFkizU3*z2-(|PAH)|~!XWT=$4{kv20nv>g6TbrtMUfe!^sdxx)#ay=s*pRbdl=t)&oJK2FG=XgAh7aA+%wE z)q9ccjdg0ma0w@%N9F|;(_t3D>dj)P>*N?IAlF%h%p4NN7x&7SCy0*L;o^uq-kAVi zjZHNTeqQB(gu9!pH7xU5IvCZvA7RVN$|mr{Eh7Shk8mF$w z8hPa{7Dk|)dV zMc8RVf;7hw)<^L&-|o*05kitcNenT%_KsC7Ts$?n;P#=}&*AC-Q?s41+XBEVzHxa9 zo>GZSuDDrcKSy?RII8HL-JaRQJ2zHvUv|1C2V>j&JiCV{@Lrz#9tEtqU&+!x@lH@S zMiv6@R*f{HF|;WkqF`pJB-D>$Qr6z>t&RULVNmG8_Nf_wm&~qC&Q!kLM+5 zK;q|<{nm&2T_@=oHKZI2d~vJ$ha*AK*{}ojMkta1(|ZveIXq^l%A@X?|3cF9f z_d?Liq@;~W<2>L-owx)nK&0 zY@Hqy>`c%Vn}hb}I1LRAK1!(s@VmIHR{0+IS`Pz`*>$+698ok|<^nffKh>!#PdD<8 zP)u}(bG~jPE7$4HhT9xiPaGj|m(OhPB#6r^RyL396b1fef43`&;@+#3ASr_PRg7Cx;^R7vmn{|PmZwam z!4vKTspD0;bo!RNerC@rQdY_DK%eAW`i{=0dbWBcy~0u=r`gC78Khiu1Ff^@o3T%l zS%>HMJBG1xWJ>f_;*z9CB)@S=9I;ydiGaI0IzL!|96rIOldLsc6jGt(tpQY?OQehz zH%NZ=CiRL);`Q)2bjEhM`uS#NilXTc5(3rfL`ypqN2Air^L>P|#qER& zRMEZO{Cb45i~skWU7!-zSm-3KtdTCw_0?u$XfP5K!(OcQ}=clS)<{tjH^-TYCozu z%s(F$9;+VZH*KBBL>)&}KLyia6nge;p)RrPXSJ zQ+!fjMuiDTuDpwC12$Yu_5Lw_#ruCX=?Tdu?qVH$Pder@$!x-3c4mtUd4 zcVA^uMxDDmxP}TlyLEu%=T$`ylx#+A>eYjn*NyHJFQSrtbI{7l+R46jr&E78FIP`` z{$^q5lcsyubhsBvnG8d;6?N2_fR~~v8r0pArsDtQ9Uzj7k;+!TWSlj0K?@7zjVF=h zezehZRJk={oJ7D>uNRb{7I5&}s%iInq@+Q4;LDQ*wdN`C(X@UE^FpbL zl>4Ua&o%o6xWk9EdD-{RMu986s5cZ1ji~eO#hhx&+TbY_;1P3HDv z$o>Xp!?<~42e#&Ra*DJx{L2x!D-hWw9h%AeP@*NEn|(r0koYOAP)Jc;?i)He3DNR8&NF_q`VY7rOi4vqSAL*wHlj54LBvnGq>>JLNr7eP2@4DRnneS%)o3fFQ&~LNom!h*+1u)`hmX(6;ln2 z=8%7cX@RhxGs$@uK@D-?7Wfb-4UG_=4})F;A8}lFeC_rx)@LYuB@T-htw;1U=9BODWX#_#r**%K>rAQ~KX>9KVoGI;n z0s0rDMZPL6?gGJ4mii8XwvJ@2Y?nHl>r7d`C9IT@1mj;rD5XR>m0X^c z)zcoWC*#;^ih|M0u9Z>lL?Z*GtG5O9TF9=#!3k3bL4^bF5To$JUGoRMYu zRDh*Tq_@&(oS(*?rL;)!PLN6=ku*RM@&^b)A${ww|EC~S_KzTB`yW9F@~m?BO6LE^>AcJ|jHD}wjp+ORR&C022m z7IVSk$kV8h%PfDwT1_X9RJb?0VjTA*-r0Pqt+ttkwx@Vp$0_OzR%0DyR>!Ly{ z1aWYF72gmm35~3ieIshH*`L5MaUl?A_T1xBPQ@itNZoE=U0ARy=7$+QuIUh+v$P(@ z68TP0YmW2L9r#-!F)FnO76|zCA|}E22N;&IRcD!jn`b0H-O7 zuv0}EP_;9~HzM&1h!8y1Ct22)pQ@WqSa*|z4S8~J90v(f_Vf;PwgRopMV`8**SV2b zcwn1Eq^Ie9BE?9DRr0^EBwPmLNE+~ehFxBOW{N*q7ug9e_1~r{Mj19rNPMQ##ISA( zvvG?GN@sb;bS-H~c!zGEfuX~4=Z`i194$+!9jA?-VEzK8x$vYTtf zvh37%+e2|}_k~BtKKlJlq_RYZ#^-vb%NBf=I)J#BfQ@qS!7ECI(h7uBGMOcZ_gT2; zhPa-Y4clOtE6I1p?P0jCwFdM$otX`(G^f5xlWNZ*5(%Qk_- z*V7%@8Q&T;Y0Ez2h!)4xOdSe`kw_vQ)$(|2M#}ckweYO)&bHiZ>Go9^M0SHI2(xb+ zG0L?oNy6JVHOm@_P`yuA9_iB$W4$#JCr%S;`PCC&)N&W!YZ>8*AHeH^%40i;Zx(P zFokCW`zjvKKo4<$%vKz#;K1;-Y9q36u5HN)x+!UO=H}AVVueyvaaUM+c-#cPVIdc# z3ToNo>|@ihw%*2Bh+jQHMH~GOKu2S?MHGo_Xv$%Kc_9M;FC=m);yFV}j#viZg<#T6 zl^I}IHt)f@t321w6mjf8faUoDjuU8-x$wW4AeqrpoaF5c*lul}^w2oL%s+D2&ObAr z@P|=?pRSh;`X641 zT8qa4xh>4(48RK&{Celb0F6|;t>5#|r91R&K`2wte{^n|swQ*w9*~>@=t9Vp%~B~~ zA2o;*9lc=(uo03CF8+|5Yx_iE8$w^8#(wT$SYtha(+*srZ09EDQZB8D0?6+JK!HM@ zWacWHMSXylnEq8fcM<%H4uCFHPcct2B!)}@k!x87LnUh>FP|;e z!k$;{f51WzmjGBu*pe&P$}At%F}d40B5H-@(-9nG3fdB+I4yQ-)x`ca9#k-QHf0m* z3YUp^Dy;7yHrj2OWkEUG2XrJv9O%3oZcDsfB2IvO zsuaEv0sWMKabS|Y=!$-VK3SJBvXPfwMT?kt)n$bXy!3nI|&dCya8)0@} zK$J30kXsV4_A;R|o>X75BY(=z(PEi&OIO&Ql|>Jl`wBf_vr8(>;oV{KRgr!Jm(f(- zpY(8|xTU)#d&Rqz+h#h-t{kJ@EK4b;(us4(D;$5wPY?cTZ}yQvjRQ>{JGoc_%tq#<@pieh3wweMTBMAF`vm0)n+Yh|EH zFf6_;LjvJgFyg8k92v+oP82)@-R{Cz*Av0+!ru7h2Xd<p-S3;&50YcCWI1{9YS zP9LYOM7p}$WsUobB(R4Ju}#~@0Xw`44jqAP>y8vbYJXA(J^LqS9zYfv#740GdQZ7* z!RJT6Zbzy@>!O&KT#Fia7(bF>t>#w!_zdT=I03W&{o41j{S7bJ<4SRF$n5F9Y$)b# zoD8o`$FtNPlS<}mmF)?2;H*BX%+A>P;gy;gMAdb=%LS6ZBI|4XHfrw~7Iv48bt{8c z0$PXFnz|bAYu4*A3zZtE(>CAk1rL6}84b;GowsG^;G1EgB@KQIR$>IkS=n$Mi9n1# ztGVLPs8DCH9mQ2cnJ!aRmgX>L+dJ0vSdPt(c6H;BG{RI{cvXZngfh@A&Dsp)!LTjH zeeo3MKD^YqG*~VR-uhW)kuN0Lul#FZw~;cFkJC?11=x$wCn|HEGXD#h&Lq^4ZsFgqVN#R%*1H;#)Sey-su&Fe#pi|2sPA&F0UT%%oGfk=s z!Il{-yVg7Ne2p}s!K}=YGsCD4*^@RZk{MxS6}Pd5s`w!k{)L4UFVbw>s7Dqx%4q9C zL4bZRG%wC!swITaP~laI-G?`oixC5{XeVH(nUv}<`;~dVcm8~{+6x%J?0TBbu)FVI zdxbC3OlTLX$tKo8htRa85sIgsp>*K3l1&{PsI=X?1vS^Jz2CQZIxkpEI6`PsdWNB; zcY|lFag7n&rEXTIzN#se zrhh~AusjDA=^v`mZ3pfDMI853!n=>($mHL;w3%!V)VNSy-?+Q%WUi)x>j_LC-ZWH# zzdJrSH}5bmbCiP41E7VRg3}jTgypcxUm1eF;I7;J$S51YrVYOnv*b#LG`Tx*sC);1 z%d1S0^dF?-*;VX~$HqO8xQ8R2nG)OW0J4Ui-!2O#Urrp2J#!-s;6Y@~oah)C_;q8l zM-4nNxkdxrczY4~lR>#p?Ym!HSk$3a9181c`5yt?8+(}nY)307bd zx!8=$+km2=t3TmwE!|F6TD~)fGBRyd_fW;~7iUuDjjD>)q5_pyFj|Pv^@1X}i z4sg@ePFzuE4ZaV=?;UTGVMFN48_+H)fOa)DD;XIYopyKw-!zSn^L)-H*|fHzAhpga zd20P2#tURfHxwOv&n@6%I`Ls` zsht8M^G%2QOX0H9lNzqtmhoaQL~hYDJw;g!7e1>Yp;_lDB@S|ztT&AI-IizDC60jK zkd9rx{F2dmoUZ_@3LZ1h`<;pHjdeMxw>Dp~s8MeZSGE@%^S%I;6XiqwkUh!t3|(Q`#fVb2-qh}XNX@d* z@%r6(oz(G+m;F69n0YE^c33Q3GmX2W@tV&-&yR9YHyH&Mt@cr{L+I{vTREYUW(3^v zh4FZ(%0>mx6Y{@&p)^>gHiVf%_5j*HOj~-iPG&n?TFRZ`jv;bsVl#svHn#a-g|l?W z%2>ibXUD$JSvU_CMTtS3-$?8iR~&Nwu~;izLcO)g`JDRLZhkv<+mmte4PayJ0Q|B1 z>kE;q>eC8TomoCwoHM~N=qtFNno&@A&nDpsZ?+?a5kNA>2FB=vp(DncMH%pi2qELc zVw%~PC7IwGFbzntOij*P1N9QoW|aZQo16Q45xO-@&0`jGNYg-I@{y0uDk4chdKdrr z3}e2JRIGIRNrrkAH%pDN_h;8&f)%7Iy^>ex=^p%PQ^Xa=m3Q37<6COGYoC^?Z`A#< z-rneBQV36IKLg;{q4}h{w$;S1q+W70e+eEb2)uU&%^N6TY$yu+>BYlf0;Go@;-oOO zR29C@MWr*MRL#$(1INTkFar4gHDhgpo_OP#G`7MT1g^)+6Ef;EnZxzs#whuXefi~E z!d{t%#pI*zrucXDfVw7_TJSq%su?60yo_12^`McLYtqby7QF2b%(Vmty&#*>B1BM! zVU4;Se-l_V&1=Ehtshko&f;P^(uppl4fBJq*7m||B5Cnx z?fgG@p{VMahY5a-e|Vv#e|RA^057y0_?H(7sc<&yWu|c9WL52w=va^UPtoWIsc;4F zKOs%(%%*TkFxT$wY-bjt!`H-C>>8S_srX!^x4_Q#W1e~)0`NlZLUVofZsz>(YoqXq zrgNO^g4U&fc_FOAe|Vv~zq}CaUtS0yKOxB?HQ=GT#4-aH?jp|mjL9fqf6dbeT)b6d zljMx>08QBi*4`S^V~P4C8et}NguGucfm+wK^?XVr9`MP(eXJ5qUU)T71aHD#Ur_&IM}6 z(Sy`wy0y7K*SX!I$AYnrx*cscQl##M&v}SlYk%EKoE*tRHzADmHTxvOEcmjnHfgw_IIpNmE;l_gW_^0dUKH^__ zuuVeN2IeH~`@*GGNv*DAQn?dCY%)Hu^5_(rOVDlOfU8FP@)n+4%8d|Yg%@#zCsMJ% zV7-F-^5sw6^)`Zkt>yuLKs0*FW0U5p&ie4OBfgCQH63S=DRX+tgR^8!%;twKhTI~j z(Pr0L@9nM*R!X*-_2A{h+P}ci6yi7r(B38|iGo9`C(lc|zeCN&3_qrA>NCrH~k zd%4FBts(&nNu&DSo`Zx;&E6Z(MY_8R5NBQm`oeGn&_+|uUTccyhtu_$yTSebl8 z5zmw<9u02}_+YJPnpAGY2Ib}spI2;Bn9C}e9`l&OXEhhS5XyYs&KwDuDVpD32nw-7 zD|^CiB93aj4e0Np%sT6TBpxH!;2T@DpX}}r@tx1bt~NV}8ZLVLa6RD8*E`4>yq23k z-dxVVD+ivtts(+&x?A zrdV^QfMhNUsnYL8T=KZy^jlrT!Y`j#>d8L_(;}gg<6^@IhRVIi%3wqUAq}?74bshi=tUz=3?YrflPogg*HykU?`o~)Y0n@S2a&S$>>N$6gd z@dhJ{P8B-72A#5D7LN50M8>0yAy_mOmoHAr3^sg8X)*V(?p_Mk;QUG=VPSwARyfH# zPk6$HGteDNqR~WC!A;%mtaWvcTA^@^Y^u$ak#*o-I@CJJWMSqAiE5z^WD`bFiFx}7 zfthp$F4~!J??SHye`{u0alnJ&KYdYq!R0b);#(aa?Gex5lg*Ni^+F112b!4kw=!

7he7^Hn8?M=9dDVK!z&UZRO zZ_`smLI3bVyF1G5eg79PWO4Mv3thtbtIUEVr5oLE~uFAl* zk#Xi5d&!|opjfb#b@QtEgyG12SfN$msUqMrm#KC}nL44pAN4*vJ(NRW%Rjmh5oeZg z{*^SQ82Li<2%kkW%|E(OKuw73TF0@9>ay}1FBI(B+lfEK_8A^02vAd3N*n1rB!dN> zV&XrzP_}Pneb7&D2E=+TAdgvaV1r)(xLFfgy^a?|65{;<=wfAZX(h*R;fy*`Gfr+q zSNq5=ZTx;)c%x%B#Hro#SO0?AdNTdkkv?-qa!Myw{+P)R#RDQ28LJbIST2~vj4U{f zwQ7hKJrV=KicScVnV~gdW?p89cAQYK_q`A=|Nh75akpJ~a^^ml!5V~}a5ITXdw^A|?Rxjo(+o1`Jro+Txj%P zTu6!XxA?!fkaMUv{hGl*G#mq}a*0cuJRq(DO`>azK2ml$N+U-@4+ei|l|Y5FI7>Ny z-pJn5Oik=RxR3$o4=$Ab4=xmp^Z(*P6p;g^ImgOb)nV6+D$@vmR&rnas{?NZ7P2K_ zVRwC-%S(}MG$5)?M8T^53l~bN{K=E(bPi%E&N~JzM-?t-Cn%A4I=Swu=Mxi2b^C== zWhytkfoMf2zvtxfIjPUdU?Wq8iSr~$0xFiyjXQB$i*v3I^_e)3T#UecL05rdF<7a5 z6#@YlW#>TS9@7|zAx|<&4k7j|9rf|m8c${EQY@btzhGtBNJB22Zf3%JRK^={5P(Rb z%|SZWgKTOu&NhowJWiJt0u5+oT?w=BQaw6yXMhI(HtiU=jBo}7)CWPytkEA3qCt%S zVfwL!k}1RXIYqzuC1VwwPF8ic1Onu}ZgYZZZ zq5I2r>96uW{WgLYI~edqM*-!1cD@@RP}=8}sPBL6vD^*;2=QOJ&oKE&&I-Kb z7Cg(TLk}~yxlsDu!ySQ3DFf)%qV*{Pm{`8xsb<_Jp}ug9Fl7h>(ih54pqAbjjFPpt z;Et<88>5!rg_`l9+yk{$BT+H*T99b*DNEt_D>EIyKQL}hsi7(@yd&&B`|s_kaGoqk zWXVWUa5BOubP?hzQ&!h7Z!z;G`ISu(1G0&;N)Ag3N)JWTXZ&LeIdyya2{nxWQr9bP z_&FFAWpS2FGTbW{)b^QNgpvNo7P|ajTSy%IMW#swuL0XgEzq=yR4__lm7ObWMWvf{ zjzML<1(ACQa|Im|@Y=2}V|uiz+<@sqKiYardOgO`5hHvOH{YOzyA*^&3j`c39s%LL zy!UU;$q8ewkbal6b?k>1JksH<2OX&g83?6-g*YPbct1;85%{uvE`^2%)$!6KUGD{* zVt_6$c`k+#@fu*Gu@^cZZxn`$5nMpALSbV%73Zqrh!~0FB12;MUkcX#9E|l+p?^+B z_uAm#E->sZ+XdY_o@~uLTZDBN8$}Atavi=d`J}kmWa06^!AT|nB<+)6$(|K5gXyU%jucgs+=JKTlza!ptscdZ2RIrD< zK%__#y+H=w=E~i!`73zX!Txqma4oWiTLUuMIxXK|cZ{^j)6s+!W6o?SIT;g0oV6xr z*33mHc(t7Hj+E-g87a0YDY*$v1e>bEz**e5HKg%E0+>9MUYqThGQJ#EUe|iPA0J|m zZqCbLr(O@6AC%}AlM+mk$v}!FtoVPvJYK8zynSuE;;HL?7$07Vc`Qg(ms>9eqHzfu ze3_d`gptV(9PRVg@8tgump1JK;cS>$n->Rio4nrqk6U#Ouf>!m)m#}EZf`TDPjl7&Q@WP z#W;SZX|SbM`Y&22!peR+(X%LPQda$k7HV4<_uAko0{8`BcwdW(8q2|sQ+TZpsVCYc zCy=w|#vs-ucNB0}5&JJKq=Ku^bDlgTk>ukwyxrkDw9NA1TXKS9A3>D)RKn_Pen4y2 zvm&ybiRKHnH-WE`xvr_k96?Y+TZ|G%cRzPoJbJ9w+Y5AxbNlX?Dsi8bV}@Tw)Bzq= z(iX&CO>A>;_OojhMTT+ubxIw)%6@-2G2LkJuPx+$Tk$`(5beLV(C&C}@^LA{xf*ps zB3cZA`;d$5i0*eIJ4ML*b&Sq zu5uHBIK*W1TY6%sNdEUYg8+P`aLsP*az6)JH5z@@+PK9?vpaO)q;y5N#%|?{37l>j zYZY9t?JA4Yu;M>^$>r*kR$J1myl2a$b{w-O^Jk}ySLJ_Snut2C2{zUr*A=~9ZFam* zg24r**4iVZ<$g6FU=q_n=Sy-93KK5Y`rJ$G*RnM~z&lE|x%_nKbvvWpt7qx4?-vK; zVOp#Cbe!{gZqLC5HTFCOzdy(|g|@hu;n?gQ6S3k)0jF^}3YBmW&qW9eB21GQEAJ~p zPvBndzqD4$&r_5FriCQqZ23G5x;*hx_MLjVTO!-AsDMJ!bW|!R_ua2bZ3Ec%mZr2?m z&n(LOAN9;edW0*f=I9Mp(dEs2EEbFX-uS&uliMrhDkxPJo7wde&Gnp~@XIqj$h6>{ z`T_}f%YOh2e2cEc+{Q^#W|%$^x3Ko=qZ*F7s6&bF-XzA*y%R_H9(jhytM!w0{c zeVY(8a;T-c64S!L2F&%EcXpaM0VOe0CmcuoAIwzNyQ%m4BboJU34Q((6}(@u!aWg& z#SOw}lPYtcn5DGBx09|K+D{z6{xV6g|Ka?4`G|cUpAzdU%&{Tb9mVx!24X_m#XtD5 zn_`673AV242E-!1lZ_ zUOZeGjU9`1d{hlNxdCspZwb&-!p8d%DVoWT*Dry=dyD0m8x~6Om7G1*pgwRuR&D!# z*+O2{{oQUUB~H74eqYG0w>>~8)|P+PDf0YHBbWQv7TQn!v4ti#1yf+o?5G`habqTw zv;-tTt!KjEUGXtA2UNHByQ&{qCUN4QNtYv~^OJTR%!!DlG+;>&?hQIvkTJf0e-c?C}*Qi(G9%$_3ACERw3dL!M8Stj`O?bI;Ll>SoR zBeavy39%R&|0E=G^keuo#aDsN-thq3KC8fRuB7i$&`}>d1XI?G1Ly>0w;)*sY!1z$ zHib70(+O7MEa*3gHqH}3+LNqAT5odvYss~>R{#4wX39Ui)7ALr+WT6o)+U0eFxh_z zXe63Y>BQ7;oW42&^Zc5}JKSUBaG; z)+)3PF__}Mvit>n_Wlb8uENwW_~E3PvI%&S{bkoG35F8Q;86D zR*VdY>TS|dLiaO&Ax7U9&9qI{XBhq)7IKEhCjEhhnx#0i6s2qB;M-|!4z`3R9Y_~u zCcdU9d16!}n^Ov^HOk07O$Eh&In!t=_rBeIP}5%i*|hL-2tDVhRYV0%)b)C+vp+I= z-ssLnKjvJ0Pw_c3;SD2XcVD3Cf~f+ifDDCmj5eq0cB8$tk0Ixv_Y0m9421ycbJ&vKvSSUp?k zFRI#Ei8YFF(v3ALp*YWYlBWwa*aW^=2p2WtS@#I)bX^NkKxL(-Au>fBkeSCFm%{$Y zY4WS*m7MQA3TmD|v(%zbfWwhOm63fu8ztubZ#)Q;4U-Y_>1R8Y>n~ zPTKRzDJVnP`?H`oo!E~n1aB+Csx>pAXiZW8go$J;IvmbR5b_qrqWiI4T4FXTne>n7{x3$C^wh0 zD5|IfX%v-Qb`iOy-{SGJw=G1vro?EcHiz}nE{lGfAI~Fb$b?o=1a2h}M%$V&{XPqC zZE2*^CWc%7V%?~*I!mKvYb9Eba>jsWZ75mRmUMo8yYO-bJH)}ou9A)mlm;3H8PErF zG>%Z@Q|Hxg#lvdB{S3(_3`7|NN<=6RY3hZN(+Pm5%Sq6NlkAK~Ey=cj{vG(08}dmwSnB95FhQ__gi3E8o{lADY&jIf zva~o1OZKE+%)Xv zs;3_-TIP~eneK091g!wN6FAaSKwEe!`YB2O= zYY))}wOM}gsgJN*YkK_8JC1^0CHKLx1at*N#Gwmqwe3gXvIiipPB}9`cb$j2vY3%(eX!9ll3X;8i1iaX3W4#%P+(u z(COK<2>^Wl=a{F`F>BVn+ef={#+H3u3JgrmEs<8;@Q4V5|Ah*j|JrC8)`x2Rg#t0& z%^2`Dtwt7{D0OoZpP2_Jw|OlFhqTINkQS0Kdv$xJk0`MV?ud{mimN8NfEqg&s`?Kq zB;yct`C@^Ib*?vmcKV}dKiyj&>c8C=9rA+>^WO#Ys ziI@j=2?fJ-a!kGh>%}t(%&bXhunI&@6d&nh&J(WcTgBL&1L{RTIUp*-cjaj* zC)-o^m#D4%nP^~I!;|3~hzkLP@9u%+7;DE07b54YO;9D4>CKK8=54lErV;jz`KpqV zF|O!9Z~zn?1JZ<6-~K&1kw6vLX-oqgQU3=xH@5r5oaPRTHB=KcjkB~#DQN>PcM!v& zrCHXd^vHdiEXt&qL3ue)DE5?TDU`_7a_UbYL+3YV^bhZ9 z?ajYeb?boX(22n~iK_h-35+~+_b0CQO_s|oo9j2wXf#<*d7-W(+2Zh|81Q301-%e_ zT?f5S+dV_u?zOrFbT7}TJ6fExoh-AlhU=6&^RFKHTZw1a=N_lU%laGL$m8SHLU1LQ zT8XRuP7SrKzVyA$_?gp(Ek}-lKmQ=q7O2hReV4Vzu4=zm7bwdHMHTia+h-uyGas>buy#s3V zfu#y=j^`WS6@Shn`#L#18ZdX*uKQow<{iwWC_BJh+k+p%#Y3#n2^iYktgH&6TS#r$ zQ?iV5bPa1!eUU1p=d0guLp(k{71*@zt=9=6#LU*7b{neLi!d|uQ>yxDb2ZaQa9`k^ zcu*!^3<76TE+8MUneX$9B&kxBxDWf2QauR{1D2^f-ms2iS7cH>z0xF{Ms67L)v||; zT3%JR@A=6yY)TIc#@E*H!e z@y?<3r$IWHUsoC9a4Scm6WAeW!ftKVw3J;KlLkcH9ulX%SH8z7rCB`y=|!C!;eo;#1uC@XSc>;q}SYP<8eIlK0^?)tmx z(PHcC^_Yu8%O0dxxu+Ut?E>#cF%c)>^0Mw*OQOuBR^Fjp$Jy zw(aH0_sEt`tzh`WjMV1Kq!u_3)MLr5dmy83h-WO9=dsQSCcJ|97SQh<60ES-(~Dip zEc>xESm0zPXchIsWe6G)x9CXCknyg^IV+TIAz(w+h|NRX+CZ(0G{vDp@+fb7%w4yf zxk2hiejJlh1}epxMSTc5XE_wf_B-aI>+rV~S9?&U$nA!0{^4gN%e22dSAJXU{ibd$ z2I8Y#!fn)SAPkk$Xp+L4B8opnJ{#lW?~2`VpMV25ia6mj1#eS30#p1ISTyx=G^%5R z84JHV9xALxOb_(RPDCetIQB&Z4g}3ifoVc_BNIP#wg@3CcYaGtJ3G(PWeQxy=YQ$a zIakscnyL%lASU4~NqsVU8{Jixa3-b?VQkmnnNL$z{81Dg0oJfGID5{h57TE7HR?Os zVjel#BAsb(90qi1nR}G&oyW_WA71$03xqRNoR9Tb@{s=BHRR+^P?(-AwtOQskQ@qf zeBV14XA8s+!tkAk^)mY57Ax%dwUX^C4vhWTg7A}T@C2M1`*|5prU2xTy2V|>P``es5yZ8}<4S0v+&fezAVK>0 z#U(5Z(5{*Qaq2*Vjr+vH(d+SQLV=wo(*w$2`g;H)!$Rvx>F4|%S2e=OTx`driE<`D zCf*lHpr<+=i=Lb#(WKHOn_rS}ugN?$?Bpaj%1^7cl`z$*`^eH4OEUSK1+v(&gPm#!@n1oy9XqT?|SH z?SISbYy0@XQBRFooExzqUj6}7)yz;P<~UNQkQ0=u4}Y56l;c*zu0JF&i6R9;y+nLp z>?BDd?Vw}w!d=nogTdd#?|Ow&_a zRQ5r}bc4Vx;YwFs^CjjHGPi#6*O#-&*U{ZaY5(X?CiY8tj7y8A4u{T)ls?ob_(~c5 zlN-&g3sAfW;4!|GBCgJ#g8hIfTaqz9h@W-W^2VS%DxC!>^~PyoM(l=+`KPY4rr>Bm zQHqris3>kRF#;o8f8lWsE3$=Dv)N^u3)}KW*|WNLuh<2t-TZUn=yvhIW(X2oYK^M< zfJPkoj0^|8%*LS#bRx>um^ip{3-+d@pWB; z%ScxMHb`$3LnpncLV~ErGAnL$>G@6Y9QY=bj3_`3-DdYB;dW*-s0P~kYz3zeF zgdu+iY*XL|uqh+r2vt?%pAL${Pfu2HSL9|K2!cv#$F?o^`WAbq@jUayry%lti3Mm8 zGBl}Sk+r!+lc3_fV^h#%m_~}8ggNee;{_yv#hWZ7<(M>HGBolCeF_+eS}NQ5pQSPR zdGkl>Pc@gVcoi@e1_L6(I%FD#Cnh&cHc&oQH60FyI7Y6ke%*m{bryOUROLV82<8i$ z+o=(yZ$wa)LC~Dl5ui%_o1YdUVHseG+*Sz;uPeW_>TTVBQ?) zW}jfxt$yU>?bWHz0O(d&;JeyW0r+?Y0s7(S^AN@<1F_KZXm`6**qgT4f5Jo~07G^{ zq;x#Xg$17;8m_E7j{J_MF7Mp+g%=csa+_ zbzK5ue9`qL>k9j}X9Uf^+OwSX{#=8lG4BcH22A>j|2b*I-kyPs$nVH{Kyk97} zUg2RI*UMK5=__)sWI!a3tBWUlmTtU0r{sJrcH2*M+Z8gn#;e+TIl!w%sHT{{yj|QS zawJy-*Hi=$>~E{_T42hxMAb z+2Gs#I<cmb$QO ztbU%4;~lG>9Ea3}Q_Q^p9^`6=yH|-k-$C{lCSP@DRMH|Qk9bTBN;p~+_Mz#m5VQ@L z0zWLW2;3ydRm2hsk1inYE|bj@s+c+Hp8Gz3@@pvC!yCgCmtG}A1kr{OiCxRxJGqL&xQRKPkXup}#YK-B|6C4Jf7@L8|J=Y7U#Q>!e zY!x9JM?DHQJD+y@C(ibn?xuBj7mX1)ts51=ElZ1@0v#%_QnlIDhTnL<GJI*XS;vb$i%;~4>hq8`74R1r#W06>80jy52*z#uF@Hx zyJ=|;^eqWHoq>#W>jjkg4b7cPQflOoZd(BL9R4<}(vi*~IZy zq8=*T9@0|n6VuP(X15oJ*4+4&?kiH{%-y=~k?_W(x!*4*JmnPFEVO%$|IjKbVG81W zGGVj)*(M#jtz! z2gzaO)eEGceCKn``vpp-D)fYPq}5~##uVs?SEu_GqY+37`grqRZ?0Oh1TW{;Id4*OnuE(}`l?2&$E2+U2#qF)nH!e4&OoxH*1%F^JzNBn z^E7_ztA(?1NBjp8DXRRiR)Ui~&jqX;a_wLFv_&-WmO-WFcr&?mvi25ikq@!}8`Qqy zfP*im3D;gKIn2SEj=8zo-LD{j(?*hK(;fcu4kT7QQbuVJdp%qCqB9@UZyur4ZB#m? z`u_&0xw^pQG~|4p>ku&|ihsf|-E#7P1db^XiduqM<@lLmX9#DM>HN>U-`&N73jwLK zin^ZY5-s#lEocd<-fmheS*buJCV1JxN_6Yy}=(RUUAFf6=1~kM$VE$ZS4w7 z@Y4aMS|y_f3PSD(nK(FIRs_K^j+=Hx^0fehXb{v9@pyMSek$@HvB?rVw&fX>>uoF0Q+*{Vx0g#s z*lNo}{Nd%$KS$A7=N|8?c~4v?nOAF*d|0}<>=za|aGVmE85Dt_i5=Pn*d!9}2fe){ zOmu~8<9zaIZ-a*mhUhj=HyvI!!L~_v=7{TS77zv73WU{>ppUR7#Rl{mHKaC_@y)$D zU7}hD6ajRp)I8m&(cfWH>c6ShzWRwK8wu9{EiS>;#EmWE;F(DefiIIk(J~MOEUVw6 z^!-@!^(?b7?wYlJc~lj4kzWO0-=vRUVaVaN#rEcZeAFs5Oz#akzPrA#g@mG(azh4N zAd7{H50$gDadSK%la56ybX9P*hxdj)bvIvW@QYt|K?Lj+A~bYGZltwaGy4f zv8d)payq)bOYdc=;dZ>%w)X|t_1@o5`$;t1I724tg5nu~mW!_?>YZoh*NgU6kVc&=C(Oz zF+pCVvdP@e$02}|=Qjev$f^~bDQ8i`0A5cNB=QVsmll{krnPQ;aSm}}iB8J>B<&qC zB>7?gcq~B2_CE|2OzCKu1JBR!yh#TWlVr5 z6l!UxR%8-56@;G(GVop4GZLSL+5rR>16QpSz z%t1R+|D9r|g46^e6jOfA0Mu(i2Kh47C>d^LPHOt}GlGW=!WFbt4#SXx4rVlUqC+(e zp>{h-U--U?)g+2;c zJ2BAt%+b&g6_iu~z{Cj_#g1_dfXGT>Vwzg~x=uQ_#(OvCEs8=}MylT9Bg-?#+>ig% z>l#HX0^WvSNAM4Kb?8t_ie<&6?D|p}j(hNmj864$?msVsE)Xn)$t-?n zYm+9C3!;NF6(hF6gMk`D%RnzWy_Pe!a}Iwtm8emx)sg$P7nHSv^^KW!37}BVYl+QP z!}#X?VQ9cic7Km2sW}pTHDyYsuc~hr+>t>z^LCmXAm0q7Ht=mHy?FFeEGW_*|lu=@Y>|r)Ky}OWt9~EmRj|e#dTR z1HyQs*v?53byU6i{=CD+&<^Vzz7LW58#%p9c99W)EO0w~I)UN&87_eV9u{V3Ug#}P zk4mjI(a*BeW%Q-fTjdx4LOkok2BK#4x8JC_(JM7XF>8>NQGSKpaB3PxylT-a|H`Kg z6J&N2t8GcX`z5JT>EGSPvC3$yTyBdGyL3FxtMO1Nx!cdP+X?OqIfp^5p8N4Tle$@Z zI)5s(BjQbH-zP!Deqf3I{#S?{=9-OS7|><{O^>^w&mGd555C6bO{Qa2zV1)H##~SC zPV|kj?B0$hH|L)jm-6&V5S_D+enWRcEfk@cpRl`R4*4q@(_=UVU)M8kKnVaheb=2~ z2}D_GML1~R4HL)tMP8Cn5+5of5+6{HS}1NFg$Pqr|AdOJsDJ<{?3pD8xq&9_4oWka zPaqHxQbAWBbRuek6VteQ(<(4BBSqNZaxd^uh_iSBSb3Y=NkVRu!ftFJD zYQO)%lCWW(ufD$J_l)pET7dygkQgd#Aitc3^wDlo@wW#`?5OsZ-ngnI+}F*zA&09v z>BGlGNWFYNoSRBkmgUAjDGL7j&v%y#H`L3m_p=yU!@}o_qW6BijeiGu2#`E5@D}Bw zjoTD-tVGroIs9^u{4m7&KnK;^hoB$3q89Az26?L&u`F%5@H{*?NbmAS`!o41@CCNs zii6{x$O}gr|u{A{Vxj`ai_<`*fC zWPcDCZVpI^B(GTCw+u?3pbV^%?gI)7g5OXWC?bRc*irJ>A+k4a^^Us?Z$tTMvTL3E zWIHP7yCYALY{H42&wCn2uYXX6^x;Ha#7NKdQ@LF}V0NW+D_HJ~HO*QUNdJOP&eaSm z-eF$Z_=Mb6)o_0szK-L%UOn8U&!2qrFVv6d*`C#e+}yYoMZgDJp8BtN0yp62(i6#W!-EzV5B=s z4~%C|GySxre@^XI1LZlMsKK-V$?zwRmK;hJnI>kJT#(^&12tvzm?MKBT~%J2kiMe% z{%u8mQt9~Hl#vqkg;LIfNcYnmhw=5S*ZTcwySLf;{TsIJ({)`)GfeGHuv&0l{osbC z4?Vmk6EOvi(at-~iF@;xoPQa6Hw&s9nU;|@SFi>Qa-JIya&H0SkTjfQJ@IhG1kVn3 zv%tzs@c_}Sr#UOFoIVGTf~V(!!CJ^nzW8JNW@o6$W;U=`n8oBP(;2bZkNz zk)bV+Z7en)$H)+lWhU+};Sy^QjM%Uq6r?ZU1Cw#kD~Upklkg$sgE2$7L9?_HuI+P% za|Sc=?0Zu_`N9aR$s=n|L?_ub9nY8UgA-`xWTfL&VJNxiA=AT)MmA_EE+8y?f^w3g z3%Uu!oNdZh;QF@jgl5K#mml)3N!kcWD+^oFxoehi0pDWg-)9okXeoA_3N=VNu@wNq zPC}2wRD^Nh^ad7M27^TVrAf-R0??EkVOt1k*Dr@D7!^6tqh)5TaS? z_K4qounC?Lh%1_Dxk#~&hTkw+t%A>e0)GPk5hk2Xm4quM4SB1cwDB*b=`7ANStXl8 z?q2=+`U>wK@7dVd(Qm1BHH1*Zoq#n>O#G=7SQ^jonm;n(W5yK&o@*#g9PAcqF_D9w zcK5JmhrfPDvqbhh+x0d>q7VAi8ArS)hilgh!{US`5dbatqsIEkX16N0oh<1~C=9GZ!2iNQ1HKEgk!JJ8OCpc_OmG#nT&^>Az#9yqIA5Ki+wtS0#f5zXWk#52j)63@q=Z1e;D{q5MBPuL=SY`%d@^XEXNkX- zfs4(7SUI*lIRNN_C#5}1*+XgLb9+==t&CKA;@H}Nb#z#%EG9wuId;YqmY_X?cf z=3Cg7y9ILn7)zORsxPDK?#5JqGvaR}Y`|049HEoqOY{;EB zWT?%MIeuf<-_v(sNz1@#s`YV2uH3fP@&3qlWtmZRme*a>?RlefMozbee$Ol+ySCvt z_jH{!-|p1<^S-s|u=jkG!qn@^mi^{!XK6bT*WZm@+nO+dn~I~X)CNC>xzQ=4*ajXq zoZ8_FDzVXdD(eE4tqgZvCme9>c+Az^32VNwoHE+|j+*dLlxuo&$)>SbsnFb+{!>0s zxD5Il{7Hn>DKAt~_;W&Af-v@5vC5i5+RI;69SOPK*iu8fV_Cp(NS>5uAypQZgu%8i zSPW`HH~~`x8OymcO2D;qNOW_$4s3H4L~IjmT%nhigE1@y=8mLK(IcdZlBZ#>4{Jya z2HXL-{bNS|Tc5C>M8XB94mz$tBCd+P&JL+W~a{&6+CKy6cBs~U$S9O zAk$a^hU_L~ozt)OBpd{tYSo@lI%6S6N1*Rp-?a{i=rWKJOaTIjiosVfnb0Xm*5RF z2%%U8%sx?rx*4rWABl`gvr6M&!8(cAsj>i!HnBjRa2$(pGQ9)T9Jn10h}Nl9Vr5}7 zsjHvX4q;g44Qc&A=0#J&1dav*H;ug96IL;lB@A~7-&H+uC*OU1a7W)^hM4bUWJm`U z9|rEp0w{=ABmjX_5Td~QrB^Qy!C&bie=BL8&A%2{J7?!|uO$p|iCj!SAZW|Y7Ks@0 zWm_3UA%I-UPmJ-=gcWLEwgOfkwvfzGMLK3YorTiR3^wfE&R(vt2WRo&@PV_=_M47c-aM@ zh4#iCq<-ehAkDIqUg)jON{nz&Pe!CId^`x(u_Hqf zJj{+61_EO%bxyQRo808wLxHQ`l%9U`q+Lyo3$ePw`Y4%c8E1R`?#;?B#2E>`9fOV0 zqKs!L8=x`y7@ONVgVYSz&P%bOxUqPem?zcn4$f%L$1!5f6w%RQ`x!DcH^{@b{X{tn z$n+SjMkO>O9{3Af_N8<(|Ki!g78XdfXPHH|M+@J6R#blzrVl@QarT3HJm*9t(ljOPkKdvxcD5L7y~ zp-WpJ<0~$N9PQC9PtpD^BFNr=2CpseAtYWiBnC5p)x;2Hlb*ViY6q79REnJPi3Bpw&+fr3}>@XERO??ls9bCR=& z>aD+iM=MlU+*Nu$$8R&={So%))OCMT6Cb`gu*^_aI|wl=AVJJh009s;NE>LAOdV*6 zVMb~)`djJ6JmU9Mc0tf2x}c3<(;bCiw>P&2x*HbV5L2GkWCxy*iFVSj^rI*$f! zir6qD)B$tHm>bqU$7mvq6+4}muS^R5$`ME{+9Vd$LRy^oiO(Z8$l~%kky%bUfS`r2 z)t1&T=zDpPF$J5-IPs4$CtsSl9g&6#Jk~TTYk_^0&C(q2{mK!HRsq&n2DR}!3|a#i zvPL2SjfP@RrDk)24my)Y2*PK1{?~&(W2(4a$*?GWyO8s_SQX6csA5C?>uNwKnH1nU)M& z+L4c@_eYt5-ddqitZAflf4&5!BOBSuu?#{q4Av+pKE5mI2M)M;dTsR^*jR#Gxf@3x z%mqEC=$9=*$6*-W_fD(z!z8|<^9F6E2N*Yo65`wf9Xc=9tZ#FES-24=S+CxW0ToKj z7vg~`6Ymq{dtq0{Izri)h6q9mG(V4?(F+L>VSH)(hHiti@xoe;kR>Y^O5zLH8>H3#gb$OxtJJa9J z`L#|CNC9uiGReTVF;>P)=QyQCn$qMrK}=(CpEhSU-!;H)>!%%ju62zkRws{*aO-A2 zt=zfoEZJlhU5QrON2`^;Efwze+O(@I!tSsJjuhjml`rAcXhW(X0B>%P{+sw0Fv9HM;X@S>J z%uMx3vp{`H)&yCX>}1&jlU4pmzJ*<)4*(0|`sQ)scQS+Ty=poV?F;N8JV(^gAZ#TV zo0OXfV;~s@ld$@?5d}MgC~~+k4s4_sEb)j`?Cas24@WZ!))RcqA94iual14xOO#l= z$)t(AE=~8Z)w9#aB%_iI(c&CygG;DV#+wiIC4G)ps-ZQ z{DI};GOqg&k75bnyTh|Xo$di=-~d{Wrm>5$+t}vLFF%{5sajq5yOlfoiCxwQ`AT)T zd_>{jP^}$-wr593sZ`KqX@1E26tNTCK1+Q0;SHTIV&e4SO-O~Qgw)GVt#OvpL?O?+ zPp;dzkqY|^a3~ZDtKA2|^#T;Q)XXE+M1fbc_9;cuO_gfID$XgoIlop)kiF36DH}^P^Vdgo!v@cOSV^Rp`pu2xhd4E(V&?vXx{J0!eB+*guZKv)TGnUQr75W z+NI>19anP_{tLNvA)+b+Bg|TB?Wy;w6l&?W-u@zi+9uh4CF)XV=sNF*M9j0Ry#*8mY6kLbKjMoK)!EXdJMf${G2mJfGd?WcpF9EU3 zP~?e1526}fXJXJ8e>tsWTu~TG3Oq157A@K%mVG;*>;$O&(Xm{)V;xov8Q_dUxR5@= zurq2nUP0DEsg~*cr|?{Eu=EcG?I;)T;(G|FwCo?`JFB$MvuTsRm>qf|PahVD}S zY(x6BC~pg^X?~AG@T-D1Rt!iPk;p3#+SI

?Cg!zRqsoeUT*6N!vFkt$d}>K)G4& z<54g<3%xP`#RV?jIStPzy%r*Kp@5z#KdF}})}W5zuxmq+IsH}agMgvSNbG0V`1|(w z={iCQI%dMHGl^y&4Xh%1!ZOsaj+?k_*vNQ8`_TJv$YAqi(c;H8=-bzI;Wv#^w>g}g zlTouc`)@~Du%-c!;|Vg;Snl(RKv(H#Fx38F6feZ}R`XTuIGiBJ2#=qx^YwJ6&h&M2 zYx#gZ$gYsRTK<7?B}22EJ!&B9r1~#04;Oj2+|kyPuy;(6mZw20EM*X4<^>{b7#mLgubvI&^MIT9mjpwy3~RXBL9cIdy0`gd;>h+I&IswZQHhO+qP}nwryLd zZQI>v`#;FeOeUFR?>4(hT~zA6D!+Q)=lOo>uN2*xLG>FCUrUBc9u^QitGeyLZrKT= zUSr#nZ(Y^PsSXEPO7PLhpGnG|gb&%Nw^1uPN>q3F07@HMh6YWfY9{P-F+uFkA>TEn zo=dvg0SvgxjjC79Ms`jc9JY{-8SOpmOzwd!j_++kGLgcPx4Pfyb8Y^7ho`ItSa{Q{ z{W3}Djdxic59{mc=hau`#uK>sR*Qh6$<1=2)kbJjm>mY1uU2EczFNozaZ8uJe|FKD zT<&YbJr*jNRCoh-_gCYdNIbz{)Ofhd%(X}J9pJB57-Ynw!5EUistoQv21DaIjkc-l zFvQ*)h3bY|&LMN%r$w#q#P%%danV3nwna0f|0t6#0Sc0MAF|jexs(1UxN2t>DI1ZA zC&i^e&&~RmZ@3AaP83He0kHrI<%v=9&o=W*@+Fs#1wsis&PfNobjZ$fyNkP>9M{`Z zybbv|3_fX$GEqi-%durxk#doDllFahPoZWHQqOTb2U%k!g&h8Ao#QybNL>WcpHp*b zZT*bP*x#AxG+PJ^v7P~>Lq7;}=?=N()oR9ugSBT1O{~`h=^28%V+SU#*yd5U0~kU) zMD1yl9=1Cg6dA!&yE1SzH+NDlSZqi5=l0r_s-+H2`5_?O+j zcz!Oj1!)gV6NqEz)O(HFhdXEZ_Z#P)?f)cWairJc*q2X1`~%(LadsSoqb?504dE9C zVru*|;{G{?XU~=SC_`dz^I8x}9p{CYV@xjsyGK(e`vd(Ynmizd<$RuYrr}a>&hH=i zC`i}iuC2BjcUX*Q{ZDmVxI?o!u7Prc@PS>sMY0xpT!-8t^RQ`#!J&#@Ogv8XYt}*t z#X?+r=T_rPEfjFhZ-)It2m2e010otS zNrt8QyzO&$9R%G59@?#tEewGq3TVRY?JR-caGY*YwxGHv{v^{xAP@S6yLYD?EGglN zw9e!e~@{Hk9y2WT<{P?>JHhSv2c=@kMDV4{;wpf}Q}AB?7+^DyQb( zo6Br!d(_u^;_TOR4uel=Ynqf;zld}%+d5krEFkQzV~aifmkL5f^!b)?Lw3#5P+ggP zX0bKi4-DRDw(%12EjpLrJK4N42)`xTzr$V1 z;edTBMGPn1e&ueY`3Sc|bHh$pq}Nyra95E~2f}buJj>6wn(BEwc|~FFK-{;d2NppD z;`&fk{gI%FK?84uASx*FZN&)Uiv|h`j=o7dqhdVykVK!E@RjByTTL-wNRMP_LlQRYP(Y69{k{Tj^ z7OgwnDm=;Mo!V6s9+R{V-coWFGGOZsaYVVHs##-HJ->6&kT%Pb`6e4W5fn+iyI|8O zmF*x5E}|g02u9M#a2VZ+MHqaX$ad54i;J;6`0BkXbfvpbjo4_djA%QNzZJ^2^4suaS1O2(`JGVe#!SqCSf)f6|6a#Q;l*{4 z6%l>_&dDJlL#JDT!CbF6>YZ+Lr(bloUT$|A{E`KTNtkkLkQkIPqbBtC3q~>q73rZd z6{MYvol&#Jx3<_Q>IKY}cOd?rBPJB4ykFuoca9BRz^*bi7%6;;C|H@tuRD}2eUo-| znU{~`SeIwju!|Z{QN+KhJukBFw&9U(pAFE18O&%N$NL8AcO#1Z_6Opj(y z`a{2efxjD4!yJDRQwDbD|M`(r3fQMwYfENI^n2$^t)n8g8{zr^gd!`!pDBCu&fE-{ zE^?fmX*N$r$7`y1X{?6%O5A%UYf?(3ZAD<@eM&_wP~M2gFg0>BF$>)%5qfa^lj=Uh(+l8^mm#P&6*)YL)0Mp@XiM`Xcph556)ARS#lf> zYs>8%BAu%+#pqcg7%ilPdmM6IC)9-mNEQrOAyfDghsG5R()VHbL2 zEbcH(0Eee@+uU5eg<9NAd2T2vnqX5ELCekSHc#i;p2&J&lX#^a)hq0UEpPS1{Jj6k z#ePhY+VjFXCwqRUIW)`#E~d5M;5SY(M}s@mcnP_6WOIi#Z=P$SPlPZd4@dHCRy;h{ zc+>z$)7DMBJ|*ncsz@jrL4yn+n7_?|!jGhtR&H$I>DlGqeS*})AJL-tGUG}Ch^(N`kVut3}5^$gu49d zcb*;5eaUf%wg|f!$5LS8RNmAP%mtS$304{Wt+z)Vj73q3<~Z=9^MYFnq838@m_nCW zL97KxUA{=0y%E~I9DPq;P9>Zm86p*G@7Nu2xU@9dn!W!(oOENj{6lN)n1OBcE5w) zigLn-OuCDs=zf>jbe~EB>Vdc?!~ELH#T_^&`UR%g=Pegp{_%*M>Wy!6vwf|@x$WVV zgL2*M6Z@E>Y)Ncdj3{PlgY0(4(v+EO-Pcro-sKk)tx zvm&I3l)qhZsknJ)bAJ7TZa>jp4=)cZuLRTgEm*&)XPELxJK&pb3h+_aXCH!xhHXu! zvUcT`4W@1SksUQ(uR^TXa5UsPEo$&r68cU351jh-{Iz2MvR+~Z>jSzW1&?u<`;6nS zlm6A8h#H^t-D$XIQm>repYktLCfz10x2zI{X+;U}0FZy-1o4DfzoVIy+_J4a0*O}! ziL3DB6X6?qyadYGuIgS3&&yL6RE#icENGmg zW>yBEA9NAL;G6n;S_Xp0jr+Qd2jnqEvrK+V1jg2~rAbzvA&hX6Qq2lny5rWdf2`tw zVC5<~s#fNtUn0b%_P@UrXP2w=XgM@NVpgLO4Zq~ zhGQ#=?iQWoSwhMP6_=RO?ZaOg>x3Q6{^HU^lR&zdO}fkJ-;c)->;Y3Xo*heuo7Yj5 zPUXbgMxE{~k_l#UdF9d;q|%@T4trOr=bK$x`n*3M@p~U%9$VhTTU>`wx*x7gN2LKY z>3EZmJxQYD+__}whqj@Uf+K|})^fwV4hz>Z#{4tvuAJ^~JEsWW$!_+at=CQP8+HeU zl(lSupdCw4npE-w>ZWbS9Cyk|U$qzl84QX*8X%gNBnzlV-b0pAw4#FY zm)JBz#P=CEwv4DaHylltFNVt+zJ(Z>fE1vZdEKy)}}tSL07hEZ&mr5 zn^zLvn7bA{IuS4=??}V4RP{22^{+0kW8g5MH3I20XRa#iCB1`M8vCT{Icnb#MZPL- zY}nvQqeu?P?GIOe_Y6vJxGWA`G8hQl`v!=lprx@jvIf7r=FfJ^PaBjeK4z0ETRx>% z1K2wBMm}h4R3QYZuvFPISkwo$RH?oH*)rIHNHnrk%4c_7NyA2@7+ zIvmJamV@*+3ts7uct`B3QgFIomQ?<%F1}L&Xk3=sm`uH(vsx={u{bxewy3Q2Drjw% z(%33WwX|S=nx6Gt4bVMNSby5CYv3h_qLa%Xe&ubCp{3^jXRqk(VrJ~l!@%|UDacQ? zGEmI>B*W{&;qU_k65hk(sJU@NyT^VQ!Ex%s8La2(lGmGP4Vg`QWm4(dU2U7+I}~Ea zAhcW?h}@kr*jXt+7}30Fd+m4ho5RHDQW}-R2A4s_GBzi9=oa2HW@rBWT=gc=~~Ac zR3p!Kkaj*&{8iBTk728&66|#b@J~W#nPQ7rW9(aCn+?t#Fawkf`D;&%m8w7C5HPyCN*jyN zb}+q%#|41;1}gUx8ric^3mV}_rwP_A{L^3k(C|nzMCi!tbL{#o)X25QmtlACvHnd| z@4z@wVcxg$qoo(z<1eFUUJH8@~(>%NbNzYrq;{3m6K48R#QoZs=0QDP{Z;azeSMmld2VHoJO!q{XSVF8t- zj$wby0(z`7UYQIq87nOdf+6(eD2B_FeAMu!_?WWH7R_BeIrtA?{JpwLRwL^WY|Hw| zO!IX+9-UnF$3h9W60z4 z&Ax)`4MY!>IaD}cob>-**it|P(v1z(T#s;UXQqp=YH*kFW%}>#kf9(EHZ&=;A}lOj zZQy@%hct_M^_O56jcT$rQfq^dlwX@C>&Vh5+k~%XWTatBl4J=JZl_?Re@|00n}wrO zhVM{*k{85(Rh){ zAX_jsD7-N}>m&cOE}G-F?^H zxS;2@FO`Nj&m=n(FuHIn#hsdO48#TiQr?zVJX$zUZ*?exBPTk|A0;j`TYk-7d)Ljb zg6S2&fSv1+3mWfE?O+c{VI(zuNpJZF8i?v&=d0;|0(r&F3Gwq!_OO9CN_%-2uaOdBdY?DwVaY4$R z>(ug=h$B)*QmMq^-F@&w@gkPIi2%!EdsAThU{v3InQUHbY4YCNwBtn89cXkLU&t6u=Zk;BGUD6W#tTmy<#r@q7fCzp0_Xk_%gZUEu( zTLnK5Q?>++=XIgE8727ZTf!pZ~99Rfe??;w*3I+mFo+1b;h{`ml zJ-p&ia*&pO$A5)}s?hPG^2lD~wRmkHr4jbfl6)=*%eeCaR(X>ANnIItYl=jx)`aJ0 zMgq*z3Ix~>yS&;m=RU}-htf2a3L6kJ9NDbH~fV~TPEn1kYOe?aI`PSl@5k+)QXFsl@FZ4hEjBCTL;tHHPt4ly4G zw1lw2+(x3Dwhe6Jg!xrPKWi)6yR9dkHwG1eCsQZufm}#Qjt$mhInUzaof1coDzG$* zGngpkS&A8>XaW9VyvBKo3WsOX+%AU4fHl*|GzZQuuGuy)Y>npC9?|C@tb<51kx}(n zUmmqyVcjR(2B)|QF&Vt7Wb`wk_KK`8V`^sFHiFO^X27x3Sh`o=sBD_ zH`k$*Xsp3q}`(?D?d{i7aC)6Dr{{BUXq@&50@0DpJ>->F=YG+oHVrxm zQ8W%Re3it{oTvvC4f7zuod><8d{?oAzF}bCH3CpUL+y8}SI_O=64&TMXyB;Hjc|Yt zJ)yenv}vLf?Y~*^wPE<^*yQ8+>aau^G^CaYn42+qia$MZQ~{e4zm_*>R*km$B&uv>GI5ON&i;wDjTl|tadsJbQ=rA0EHqfZG-@&ge$5?*WN;tp z^^$bVqmoyU9gg*J71OGCqL ztIg5l_^|OX(?`qW@vC2Cf6hQO*Te=C*qs0VeX8c}Yju3QT0B&W$-C+LT~EPz+=tNr zo5WYGn`x)*dE{%ag>6F_klN?d__09);CLeRMdj?f-DMt4#7PhByv-i&aFom0@jDy* z`js}7)?56)z&8?od;cSRb2CxLfc$%yPypqw6QLTlll=D=!6>6+rk$h7k#Je`nxS67L#fst-^K+8)>|$pjmJKw4dm zi?XV=GNaO~AS|zqx+Ef2dYwp?ew{yTPR(FJHvdZWp+<5pH)+52%@hZ1oK^@t#bLW% z_ek&|uYc-84Ae@NINQxyWU|dR-T%?N@sO1LuiT9;6GiE_=8Q=?D>P(NdOC`&wl6;o zbZV~mb*@jd=fm%D7-YNJ-ahf4i$cZa&Cr|tXs*TVtX z+i3o*eA34%xulSaUg_N~ATQgUuu1CPM!E9($tv3{V*l@E4kAC59qdX?Si&e9zxUj5 z+rTkX9Offld)tm)ue&5aA`?2Mm1eZ)jP0~Hl04!{yCYgs)&Dm=lxaUc;Kb7-NR_otd{fA_DQEOT%B&2JO+VP~n|!pqb0 zcdHT|czXN74DD}6*yJ<~&Gdmy%fnYu$%qJ$wuT;`0Y4kbE(;hFdf{p2YOSVpTPrbO3P=cR>Erg~oPN8jfDMip4b$Rb?0qAqvK)!9CCJum*&e2IN~M?#$hMK5kL z>g@@pI+S27u9vh#b6v5kI-GFpY5aYkaLwIs4PK{`hnqp8WUs4}Q!O(9(*Ru-RCHLT z&Bk9S_{2!NB1eVbpGCOr)mu8_J?=Z>7DTT*K8Yj1?O>O@m%@ZGg_o4pVC-t-T&9M7 z;r3y%S;;K+y2%5LrN$b1xK%aDxeK}mxcq+wM5US}hP1z**j}$BUJqWS;&`3s_+^nt zvs@1)KGFLMbk}6!&-}TPTZKN6%$VJoMd>0tTF7OMJZWrX?8rOr(0^fbTr<_w7p@cfW-zek7iE}iKJ?#bF?(x^XlsP zk__o2B0+zuc@}IJy7}5yh^?R;0d;<$I~^{!Qo!^AYa@EeWT8Yr70IydxVWs?J1JVAA)gepikKMeS=ucD-}l2X)<(IsTnX5Q-8 z*%WE6{nj?A%KDu6+OK&%cixt8_V8z?X?fl=fkw8J{IMWY)2HMwIe!&9E|{R`_2gVn zP02_-=aR65wpw9AaX}d3{UY=M5fEd|!}NKBg<&xv(M)a2A@7r^jY2{yTL? zo&VpY`TtFt|CgII!Bm)g3IY|r=fDW7O>hIWddQYM@1D&bZg(f&kVR3@y#}VCyF!zz9Y= z)hu+v_&{l;A!EM*>?%3B%op?=H46~UjC^5?{YBiRSWC&Bp)JUyw_w&~)Q*^nl(R&v zf>xD-gB<1nMznqIZx(T#VtWm`Bii3rpJ&GuUq1732)Hq@2>w5M#h6%!x|jA-qp`st zA@D)R6et)@UG2>^M+-j(JiR5KZkUGdE9AlzbKTV~h8|we*B}F7mD^;GS6i^DC_f8r zPty*Rt~I-*DY7o;Yz{GMBY5>mv zPoB4oiJPP0>507sZnnD60lW&X)pbUrbI%#G3FP?tFChX;O+!njy?KZj1ydAG?)Ucl z_q4(0?xm`MadmArc6;I-U+20VH%5b22Us@iHwyz%{X#q^LIILsb1rxv- z-8-TcZwDi~4Y~vF`2)|tV+&b+>kVW*UbnITn;Td>^^COcR{1dCQ|YmK+koKqIAQLAR>0xsFpg zmfSg3T<-ga4nxakMlaXrN4mQLxJC(r3954g@k}(GPDEs@ZKx=TCf|F1l(%jFX$tOQ zQY(Ma#D<+c9`%6?#qQ+{US*bKXR<+T{NktKt$O(}p~8i032&$pE0SC;)er&!enDj4 zLls>ZUdR@M!%yU67P(L=fzl2B?&*$JYytH0qK!-b_fJv@NT6?5B^>{ zr1&f7+jR6}xWrYAO%+--_=&fS5#TA@3DijZl{-@~N2vo1qJr)SD0p8O39RugZ4XEy z%$e_ye@=}%XDZwjYVYI`%w@rq`rL?1A2ypz)r2hl@`)s${A1886D!!RHw>XE+j9Ap`7Gcsvcq2jC2C>dm9Sq(vEJlPn6MpJV5V5QDtz?PO2a}DVnq+<3h zt|Sr^2FPND6VG#nC2lwY-!mr|Of(hT*Wb=sRp+V{2*t{z*iIT+2Ts!<)k`D^F-MA5 z3I0YfViXmhvyb4PO{e3aoelRY@Lu$@Vv-dDJQ$kki{1||mr@bi>GWuid%$h<+gED;-m|)pDh5b8l40<(IQ3@ixPtLtqo$R>aOZ)@ zQKlGZk?ev1lM4WAUWtIw2fbjkKVPEw5jF0<~uvW4<)rO`yl7ovxFEt{!p z+Nm}IYC>c-e;+BUtSEi(LO^YNp7?|9T;Onk05x@`w3EJr(OKXsCPtT&c4qrl)CGkP zv4gMI{BfHH1vK~vfSEO;)@gY`Cc`}*0xwo3msYZw2xZifm~nE$yV^x`Y2)|W!Wtg4 zz)$X$zxWr{){*MPjP{r@kWo0X@W)JiDIO5INLihG#&AL|>JsS*R&;_H z%?)h`G4U{iwPS^Xe14YRi-x5`QYvw2m;1+-r%rH*(MQZG|IXfO=t1WTt>Uk67H2Bs%Nsd(nXie7 z7sIsD=Qt)7%ND>|Ckn=@rchQ)#8|ldxI!|Pj3kTfE6q7k&Z-W(WmKAin_bO)?X3>H z;a|v>gofVrZ7wfCv{47IG7IKwl%1$ZLK6 zz7o0pqMkGSFS&(Jvg!~+3~eryeh;w6AQDP|IyESL3janHZ#b$M_esbvoI?zmLV$Dy z^5e)QkA=e|?X6hjsy~g9%I`wWc#$4}+o};M>3S`RHF%XIvHX=74`7}cb|+Pl6c;|> zcc1(Y_mnu!7sRuqCCE4!q2#&=u#_pPYM6GIc#{0eriuPo#8@SUBm|@fqUbXMsy#NH zy1e}a8%J=}b&4Ax{-8yfoFtMA_sRsceaDudrHk$;cQ3K*s=g?worTMPih;jKH!0yZ zVj8Lhm^6_HM)R$+afGibb+OLVD=oIbaSx!cqJsV1*wtrFj8>QHFHz)|Ah;qS}8cW2Jd8FK{mx}~jRzP(@(5AHo^NIghEDfleK;CV-T zn9_?uR^@Za)IBLqm%nTGFK843ba=^f(F};!{u+$EQ2=Mg8`CH`))a?D zh$R>46T_#-S^9F(*GmO#*%|B(+F)R=(d{kT1>8EHY|T7cgmo7igbK}a9RFVON^-JF z#|@5BiYPSQ3}Z`=QifZagLi)K%N$2ZT73Z19fQ;5rT9!hZWO4>ahOu4b2AH2$Nx3E zkw(#;%aa!Pjri1|u(efH!Wi-fmLyK_1{(OBD|fr$ui#<_o$4ItSY{2k0%WjtTD`&O z7-^TMp$;uVo8M4yHX?{TZ%xXmS&WeLYB}Q_D$$KIP;6IFbQ2s4GEswqvAA<jAUl5-npA{7F)2FpxJh*9D|qATVp)d(?8 z;)kKsJbfGNzGzqablW;PJVrNBjk3)`#Yn&Q;*r|LwWL5Kj?~CZ4}_-p>rGM0CtHD( zTWvd;trvl^!#%DdcHr!6B?eja<0t9{TPno~sp+7-J(Wmn`^f~4;;ac7wP`h}_JuL; z4X&a;96-bST4dxHHg@cSTYWGc(M}n@oK1H+(GHoDfX9m1QQZmUWMwS*?z7|pu_Ryb zf!$xegGS*qshC6`qv91eK`W3*M$A&-;p+|F;8}Y}(C!^fa5ko}sea7hp z;wyw}bZeFQInt_7>8sSnFGQH%qW~wRDZ(~%DP2uscS>6;VS9g*70$zoX7`iI)h55+ zJZrq?OQrT~^Cyenlb_dRGw)4=zpwB&*B>?%z20nhyibEc_@>v}Bfmts=0_L@i8fbE)Xh$3qzCmZUG}}AzYvTt%+;U11DY4R?vf?V3BLW58BpptKQ)N1{N z>5xQEGSibXaw?|$qN&XCghr6jXC{2CHcD z=06vUMNBtMx2bcxC0qq0%VRRSU!ysn(i3q#(t=D2KB%q{fVcbxLBO|YOU!JXC8dVx z6R`_wZeK62b#We9Sh03wbq1ArV6?)fE_)j<%t0*S)gR z#qcPJ89HIv;-}G4SRN6sQOA_$Kha8Pg>T1Q)U==3 zai$nv^4T0B@8jS>4j&{_>PxdCgtX93V# z!ovFzDU!jD+ard~bC2ne8y-UbHz{kXLA~#CwA#k3@@M7~b^~TyRZP{17BG*hhnXL1V%23L^lssVju`PiVs0%wPM;@%$aYapjaUiSta2OXn zG>t*kt;5dhN2W>axL4ApNU8j!eMd7wVo5b9(!)Fb-%N<;e@(s$6RV7}7}OxjuBIR+ zQcMiFT{A=+<-n0n(`RRe0t(!?<;6=hIaeoBUr$WWtDh{^bcaThVZn0(iJPFx?vwi@ zOK3<#W*mcsVm;UBubzysIu#syodKf~@b&DNX`OpvsS||`LU!~Zt1ie?@OkuJSL+k8 zZp7=it{OlyqWRJG>X4mNG?7k=sKYPN4z7y`TU!bYiO+PVLeA<07Euk|exh=BmM`V_ zTXE3ost`caGhxIR$UT64n({;)mK~~EyQFXI$`|N8I2yzH){~+1Nz-F+%y%Rw^cvR& zI)on!mIE9IcY4lYUW;%LGtyn$=J1Jth@~dYB&0gb=J0#_A;lpG8Bq(cy1@{(oTKD< zw`;#P+hy|=6-jG#O{z_x`5y!|O*Jh(lUz%yk#Od46nC1u-XHNqeB*6@rz z?uJ{6Wmt}n5k=&=Adm#nKacJfGiDL1Kzy6hBLhOG{wPEkeP_-QXcKQbMZ6I1M=TP2 z`u1z+XG?I@4hZeVv_dQgN4^OO9R2BjO!1YWvi3dxb}lN>ohs-%<+RjCj=+>O;{Jbu zuv-$Z{5OSUQJcV+hHD2Ya2EFJMH%PuAsk9nB5XE0PFrw(YuQacVkiBxf4LZc-}k=O zsIiSCEKKrS0UnAbP&hO78>g#|L_dEhj-^mVgs~G{?Y8#=#(>y!r8?o-p*aX%0UoYH zCuvNro;fN!j{9F85TjSDHWA~8$`b+R4RKo!Uo!W?DeH$ZBvM(X92XXDW*AXbP4Ae} zFQwjd&>`p@YpFv0B??v4mmaBpV6drGn;N;ReHgdAbv9RSF;XvUz6G7%b=zUteFn}- zia4uJ2<*j@);x9#^z{OXu^SHX>Q_Ynu04o|U;ftE)dTPtU+cET< ztyU2kC_%^Tv(En5;B~Vz7v+?@nPSrS{j`5k1}E)^2#BYD_r)m1GM&YkUmBW@odqNc z3db|zZb5ZOz^iQ?(Zw11bNVr6Y&eM`{V0Gcahy6Fg&^paziwO*X%~zSRXQMu4-Uf1 z*A-0;7uRy1=u|yhdm2gQvcw8WDEZb3iBOztJjv7b2iOFjc?c&J{8{%f(o}s5VPHjt zrXd1Z9gvylJ?D4#{Au!w=e4ZQ0~82dufih0>;CH@i3@pz2)`=YS%$EDq@J$&QLnp? z(`$&Q<7}1>$Oa1rR&Lts>M0O?`P-|YH;sr*MdXj$`rbxA&Z7w&n3Rv*d?k>!^LB8C zah((~=ODP`lG7$(CR1xaHhBA~kU0q*Qo9^vhxjY?5stg}OwT_NqnE9fjjEf$!Noxi z$>BnmQSugiW*kS;E!zQtSKb;FB4^Wv&0JLIz=X3D8^XBOxuBYWlQKc?YMaki_7B%c z3uhkM!m^CEz+8}=4dwm(atQjK!c*EMKKT$Iqno6axxV;hQddY)ER>PAl)$aVC{Whc zHzd|p*46%VgYc^2`S^>ymRae#<{V(Lp+ zj<*YywAT$Nf^uv@|6F-* zXqW;00n)bq|_F8twenKU1ub>e11XH>lFx1ip8gi5~AKOVI|R>=(Ea=|SrMaiJB zp%FJhAnRWW3i#?$TTjJdg7Nj_o`(bN`lcfKEazCAuGXoU>J(5MMQIk8S3{!fMGE@i zWGYKQVh`#Xoc9Wbs^0Bc45O0DFCupIT0H0- zj|5296d3JP7BFAhWl?YQX<8E}-)CX1Ee(|0MX}0XtsB)>XQ?%8 ztVQaO&*)LC3?$0hlh4oYmfp{x2iZ8;mD8|*(?H`O0{Wnj#^8#4>b=^nxL7Q?pTSs! zfGA^u2ngjMOuUeDIskBVIq@1-XOD)<&}8Gf6*&K^U`m!7IzmLod;94t_!T zv|OC8 z-Tjprm`5hJPMzlxbTft@JvNUd^?ZVlCk^EBCvOGAK~0#C0^O5}ll%Uh1xD@i@x;~? ze&}@(3x{VGUzRW3@6E+9~E~xIaZgduJ#wwW6lJ;wDLwg44IZilkm^$e~xx06}xWT zwR^lHYh=;iDNoPP+!ATs2?Gz`=N||M-3@JihO^x?qz~DEgA6{>#Sr*DsY)80AaQ>h zmzno3yLlx7i@3_6mlm2Ze|2}M4==t8;s}>0ilrjDh!i^)qAH865G&}3@Wi50g94L) zS9YRWtjebX-QqjgS-?MO9NBhG&_Ajl35^mGI&y>%^`k`gwrY5J{|hc3z1AD}My;6D2Y7Go@(s~qqgZ?-`d7$$f7-srd4qL~JmyXLD3ieG(ZUr-<<4ISc` zX7BzzDgj?L$XQGyEJ6PVI5&pd)q>_Oll9MLNGfM(qY}~vEbd_X0}J!4E$N|$Hd*9x zQT?)Vzz~dS)5?|d%yjuYvw;WQOYJE8Cr{I4vk7|Tb;WHVS+b#Vzm`8fl9Q?@sbBK^@qd~Mf{gP)|H9C{l0>g6NFqnBlv8~H8#uo^p?-Q*Yj4cl*024iLnZ`c zC93pQBrZfxHEM4`%h%`XjgCdw9rAw`1``#pwCNGvZo!E1xHDf^4F zx6n)&GyYAIE&WugIO4D6W>G088@Dqj?1l!^a*RC?Y7+0kk*CbSKD#91n=kf|cp*kS ze;VfZ6}RZfN>S9qh`&6XowsHR(huNx#GCsR!3 z+4TMWjTDsR(=jkGV6AsM3?Isvj5SS4wlvXOlO*Ip7UFuC@%L6)-V*-RG5sryzBw7r z3jLZYc`{b%nj?DcLLKm%bR|o=tnKBQG%2XA7}WEAf%%z{<{#zw_k4iRk^{4U+ZO^z zkJv=b(D{-5E&PQnEXjuDRbK8}QBoCL(SpwwthFC-N!wE_fk*XVql*{B&_Iqk)CeKB zi)_gWdsv%$`0y?X?;zxMrC1c6Dh0Ameg3x%z>Zo5QQ18`%FGeOB&)hSgx zzttH3IL1t}$?X~!uEN&jyCrMWQ0c8XRCMsiW5cxij^PylMIs(JM~aAPZ8yLLRrjffQne2VQ6l4U-5aV;Bt$D=?>ukr{HuwI~J^X=qm9x~BC2A#u zE$^BY{(YXY$)@WCos#3V^>)ZH5j2c#U4RHwj={Jzxl~#jNq3lcFaZHpcXHR2jRD|=!-d~>-{CM$l_*{2!yGD=WQo`ltMDSuuVu`5 zpl9m2SIGN8W|5p1YfQeAt-Ju+GJJ8^~P2Vr?S62{Cqz;p@9{nKA zCy`uOmdY7L9Gec}Y%JCB9kG~t1cR_R1_Yb>_K>k2$S9h6Q$1{f8u|8!Z^+MW?a-d$ zS;?3OG+S8+YYezlX$Zf-f#)$HF(L!et_OliH_dnBm#qoH;{0uBp+1PD31OJ$lT?|i zkV#RNkb-O@vB`sa?g^HAST^u(QJV5uoC@oa5|4DKo6Nh8qSAP@ZXzX=xBgZzQ=f(} zYTcHqN%I9WeD`8ylX>bh=5)ieMveQiy;GFAuFkg_B7E;YFmE2Rzst0T5s&7S;SRKv zeLq=3x8vP{lfzDYpJ^c~ltB#Im^g7v+bDUk4gRyKI?h(az212J4F3J3o)`lS$EV=r zbZ1h~v-Up)5Wt<8dnQ~#q|Fav>LtRSR?<WU}S;&$P^d~X?zYjh&sz@by37o{#6xc#H z>W^x-2vmUeQOVS!eW62LxzTrBV68)nZetL%-%tiT3eIXPS0uX40McqeeguoHix%IJ zd+jLvIiakwKssCq9@MfRTfocyB+L9Fw(FgwzVNU;y_RcnJlQ^C$m}0y|2IcO2wI!; zLzCaO_*Agb4&nzFaC~~>iS&EXw{a-G_Q+S~pp+BM?NL$_?T1Yw9t{m6<+O%Y1DIAT z`Q7ylm9Xwj_=6ELIat;4dE}VgCopmsGDNjZ%dN}2Vm~xYT~5-#*>9)=;LGiV^UwG& zWJS@DIaX;6PEwk26XVB!5*;dxx4v5Zic3FeY1v~~Gib5M=A}as7zeQav0$K|uuUi{Ok|N*8E=UrxDQlfz-->GTJU!lB2WExw zRncY~C_9?uZ+oSh$}HpSYrOZ6Ejq1DxOAf)KbbbkIIVRz1|r3`_Xr6Lipz=>oIvYC zOkUe{0wX>R43cuQ`xOu*Xl~D0$!-zdn?2#--t!}l2Gpkc(ggfT(=gCh3nEAirnJ>l z9p>L5uOL|s&;z4Fgv;buSAM|Uu5}^bvgX4TBr+J4eOn z9XSF~-7V79wr{KIxs&iOmUc^L3ycS*^cdvm+DwfwHH*JIV7{-0G?_Y3 zTdg5tMu?*L~rS+smf9i2S~akB0Q42%RJdVcg}P{qZ>V9YJSqOSW>%Yd2do7YW~kJpy6IEuC0?#KFk^->7pl_ds2 z_BtRY@!Ym@0RGe|9TUO3Z`XSC<87wSZk<=njKz zDqu{J_eKGl)d_3RXbY&Q`UwD!A!i0`xpXNAvFh8V^-lX;c%tAq70tCY#kwmx)ySh0 zlBEJdPQ6T8GSBgctRA=sghH5yAoOEF8w)PIs7N>UHum)U<5#8(MSQTQOl8~?ty^K+ zZy;D0lJ!{P^Xm^B*0JXBqbZ=>-Bq(tA#TH87X&}B)*3jT%xujanu0`2U?VCh2w*$% zpOZ*JK!3Bw!+wIW4W6}&m2+*J@}q)kGsFmbh}r*I&kzOa=aA?$+`SlQ6e+kFaw79_eTQsC{h?2ZsoI1gV@!bu!V97}^z zM`Rdvkuc1EY}y8*e4eMUnjxZ+ODD+vbQ+TT1TBuL3d?iRxzHv>(g9RK^#7vjor5Ed zx_<3wV%xTDCllMo#J0_eZQHhO+qNc7I?l=SKJWQXeO>$SuCA`T_PzI7>-t@o|BPT| zTf%rtG%6E=n%dmsInnMlZ6pIAeTcHL@7_xh)QhMbJl@Un&9K1nZip8GtG%}_)pY($ zR&7>?Op8=Mc8~GgIPa}+hgxLfwRwDCefwaTy7&!`2dwktp%?~~^Hq2>hX_$ROK5tc z3+()aWqh_R9YR73W;*;FGT#s!1u4r z#td|g#Yhgo*wUkmo@7!#>!y0o*4Zh8hJlhwlRf?Y zUBpiAM(W1Ghx()Ek(|Ih*d^5AEzEBQ^R&SU7lq+gZR6oyNB(917#B1vod9A3-AjsO z;Pf(*!#J;1e3{>r@|>20Hn=Ct`66RXA1gX5eadtEZ#%DXWDRXXKR}oiNaeka$!<}lmewhKobWs#I`lcz)o>`ZzpQAh?ZuP63=g8LJk>^B&xMqn(f&?oBe zCM)yNoNH-PJ-J`DoP-0?WC>-XC3wd=biI9w>f@*YY!+K832(v2K~l$rb>e&-U}dmb z+ID69IJ0T&{km=iS$ie*TpmW4;8W|M|%Lz}k-xJ^+Jln#a-nQ5)t(;gva{NsH?i5m?nGVdznvyUjchD63*Nx-h@$`sHVoW>S7|w;IV(zzG5G!(fgS%;@j^ z3-grE%)sMQyKXSyLcD1p;eyp9h^%!b@K?%;Gt9F(Fr!v$PlA8V{IgUb+W5Ux1(LNJ z*<3)ghHm&8UJ{rVIr^okLovs4RJUqybw_-Qno=f?tp!}d4gdKJg$!r|j=fMk;RBuFrWo4}a(#!zBAtr}RB}B*+r2k#@Uv1?doU54O$jf;U{PsX!UD z!U9mZ+yz+Z3_A@-=n!)SP$}x& zRG_qaKT+Fg4ETS%ha<~N)Q#^}cW)y&2>P$2L|F>wlOrN~^Tt)@(4nQNSGF)Y7S5=- zgSMN{QW&}Bt?P#5tdoAB+tq5m)9BT5UDD`P{@$d~+g7hzorEN{3X?d$m_M@WC(-A~ zq&#jyKVFWiIn*?Q`1?NcI5hL;yJAL< zlm;4!XVvu5e~F?(133qd*;yfzZ9AD5i%S&M6AXvhd~HYX;TQ04lfz#K>t*M(AAM~0 zE)EZuZnY~>m_u0evY0XejdjO<6=4ny#^i=|s%ETY7uLTciND~U;) zW|=q$i)LxWNH)XdyrmdI{o_|t!_@8;ICjL+@1$dyJOR!jPawz`GX;MJto1EJ_uuw^ z<7g!)p0%f9|Hw5(s>6#C?rXo>(0wu~*@iYl)3&9gsoiLWkp=r!A;15%c7BKWQ{a%K zzXU1ED5nSNaXMN@zsf&F_jf5)QKH!ic10zis z#|n?Wu!2GW^;mp>F&fCncwN^1^El2j`JX9KxWZ?d%>}8y<$?{6nw&6%bi!Cz1 zr-CGs`rZZsaTPyFu5CE)l53lXbYHJe=w+)I)o&9V~ z&*(vfn-KkYSvsxc05mIFph^R`{w%TzLCa$cRPVTT)9oW)L7*wJD0)-1W`AOnNMtWY z3Gac4<#GD@lV>l8*8qeqfPhh)Div9iH;-h9uwo>@Zqt|Wa|7S6akm=%;&{IHMAPVdjN_%FE77=R?s+^;*A*5!F=H{n9#WFJy=1mhl44zz`An;0 z{kfIx_;zsy{pjoR5MQEZ*Fquq58+xdUt}&hM1%FZ4st$?U#D~q-}HYXQ>00_Yqz}Y zUKeZtcP4-@C`cb1vaGi44)Rf4;M&&#CZ2* z%1})o|9<)$w!nET1P8l@kcT_n9=P-ShJA-4`B5Z*rSWBeZRcsPPS2ys;llRF`r*%> z0XMAo_q{3c#v^eJ)T?nVL0UVCYG{m1qy8|Bquzd& zK=%XTBgbApT2i|$0bt;PyM-8O^+fo+8J>}4aurfI#?&OLWdVtgcGIiFBrS%5q|c1P zq-Fz_a^N>rQO3yg6$jmXG1m;)Ee4!GSg ziTseL6dK3)=MdxwbcnOxygB-+y!h#GcAg421_A{x9&mU15)moKqq8`TyX^dJS36$0 zokdn;-$bBw1|w)J>(P@4Tc!GRez*4LP(-#d~?~`C84T7y-w(@=>>gl27 z<7@qJ_Vu_!@urwRJ^#IyhGA&LxzpxGK*z_=K&~|GCYN$@F;zJ70dZXoNM*=DjvE0# z+kk6hip0`fm(8|=lN!KQzy9M0UY>Wh(caqQa--W9P0z=7JmXjYwk`=6aAUy&CGfK_5O5v@zJ7*oER4aT$t^mfI3Xnt%a#(c2OBADF6zY0 zV_NgrZ_VJSegAeSUi>`eT`JM-u_-wp&3EU~BM7O$N4+<^{we6!RaxKT-M3Iw7c=xf zUq^0^j)Bj>$N$^1g92mwRlH3In%&r!s-cSa8`da--|RmzMvVllps2^mk_NKMOAux6 zraykK%AZPGrGEEKPF5k*!j?EfD!N_O%omlt0Rna|zs~FP>~%T8f4odvS%Xg2c*=)w zv{%4sba97l{~xs$H~zoW8pPD@|59robn(^}G_wL)S~{eA$u6c~0$iaTR35clX=LT8 z2u8ls8x2|tt=b@xJR;bPBFZweLs7RE*IT6O+e6426xG>=F~ucB(NSx{->*)X4RDOb zj_3ujCK2UW7+Ik;r-A=_0|AY<`z5eZi z7HUe3bkjan)!u4(-qj#^F0LfG@>+~fF)OH0^h8eiAeBh+WoVps#(CX!#T$uVcfSkP zK=vZleRIc444+ZMXoJZSPK>0lv2)RM#~T$!rAM3E0NJgapadwI8?p-uiHyh$*7rsI z|LL`?peG|Ngu5l5s9i$gt_B*A#UvYd@o+o|JwLuoO;<*?Q? zbEK;-Ej1t|i>_oKu&<=O!#j2Sh$PR6577$WV*4h=nh^_iydQ{4gCaC5iNxo2G%a7&B z(BPwRfA)&SWdx~<%6r@NQG#>D{%pyaHd&day1;=FAiKrjw}IY)I!(+Ph-j;vZ< zh1Xl4*9{3THgDB&FCXZZF7G})Z6CPTW8K}+$;2SRP(M~bxT!4dC0|Q$ZF4bQ6f6;= zvH=h(C^oEt<3?Nk-j9MhN|RGVf-D=CCHqq7#Wu8v0$$BO#|mrr&lc=2@ww9C!NOUab?4z7vq4-z^L{TQ)d z_P{vS6@S5Swo#OHPH66??9B2e$M!JHe3Kh0FwSW25A|3mo-BbNVlTyfNk0lEj6SzQ zsO~(M^zbOdnv~dtrUO%-+moo*g%tYsubIEK8F@NR2RHxVX#)N}#A&MF>Xi(zfe}q$ z%dykD=WAbm_v)4QP^fu0&BuftCru$cG=wC@+aQCL_5$BXh8O$r+uPBae~MJDYFVKN zJx9Mw^sKhT*N!jf_L?f~w<4ZTOA|t?DNK^Kky||{tkA$k~e3z}F ziibnTLiW3*tmVY7kh11{cSU9O(T--q>dj5|!s@G)4%+&VwyIHVozV-YNSS@+OUx?y zbvMk@gk4+KBISIS+&E&P9PM`}ErCJ*n~5%U<2<%$Hv~?N5xX;y%A_^AZ$)o)Ep84T z$-ji5%SY2eX% zpS~@s6X>+?(WJPCetw|*uF|vp>c+1_BuUs}*IPfadx5nSKY?%tEP)JrDpR7i5h@K7 zV3Wy2N&z*#{b2nYKVaH}5Nh-P1Jf%17ff>rG+4x!-7&pd)G89dZ|y~gitPPk$c%0= zt65CzTh7Q8sn0xQrALRaSw{Ta=20J^-)Yhx(I*@nt`@u^x*t&;*CMds?x)iQL+x+? z*8KK~7SXm8Gt@wFRw`T`e2y*3Gl)AW0Udzp_r)M%FOkRe3crrmZ6S)+^S^+auj2m& z)LK+psq3i@)Lrra0j=7KcCY@tW`J0735kxxJJpGXwd$YbNEPQpl1)|L* zhNvpSOdd{-N8QxTdLI>ldkHlv+QC7mMzXaZ)p-rgc|m#-zC`SPH3-lH8H;54wR||^ zlXFPQK>KQZKE^l-FF3)vLCf&oWINXGmOx(T@b)nG0+A=)WIM}$wl&@6`!&RapD^*d zvIe#s!t^B-m!4t`QZl6(SGrG;ZBku0KnXH!w%=c+^SIf)9+W4)j!MH2C}hfRZH}bd zIgoMZ@1RbI26|idQq+jY+xWKm0#{F(0{r7leW!>W?SzSTi&yIR$Ggo*!C@APIe}|v1N|MwUIj6*# zRj-=-iG{kPR+WPU$a0`rp=>$H$S_#Xito|XGKf|DHr#$QmG$4q0$}q_JXAVa>o%_fi?5Yku6fos_6~Avh-I}=QIZWCf9-;m zYlhRLG8eK8(1>`kmLC)Og5diWQZ$^Xv)IxFG`)99Ozk=nQ&+Fmz(5{S)8?a)EN=(C z4P{iu>db~LEw2KOv0lEumvsr6-fY2b^a6D2dXexz9{MzuGp$Y#Rfu8~VNc{(ee-&| zqJ^D^E)GXZuGM9E1HN4J{?W3p{$g!B5*~I4wt3JN)-n0&cDiE5v_7YkjDB87((1Iq zB_&d!2wh>*=XUUDu~g*7gD49*!RKaL2OU31R7I7ihb_F49|ZrmHWSHmg1?7I1Y_jgjsD;x2qfT&xQYDA(kVITMbhI2)8Q*Xa3qAb@E zw&j_5WuCBh6Q6_t?n>2Z{aljE(qr0ls_bfGi*5t{TMiq~9Cn@AG&jMt#|rCce@&zC zyuo~37wLTP;6K+4ReyG(Q&w3S99&Z}cN^zaGzy-Ky`{O_m0zA#PD!e2K~Z6Qp-6;^@5#2L`l=%fJ&)OyGp6Zq*UX_&)Ys^3D-PJq?V~3#8ag} z?jI+LXx@AR`F=v+dL?=nj!I7VSmV0^N&A5cv!y{334_%C#7Et_NcBB}v#8=iH;EzLm2k?NsgAp1@^I zo7?Z#-!{3R)y^H_R~@(c5ey{gIeTGA$UQG~uxT0^?dL?wfwuibYkO8lwVVaAo=ae7 zLU`E3m32y~zx1PlA9iQH7cRe6Db4Z*2E;!a2fy1VXGc97VT|1=fG?G)wv7)jC0{I% zFoTyNi;zdq=s&C7+~)gn{JXs>;<|6H&54PBG8rrIj-l zdNe)dB)N3<5B?TlR)vvREspYm`6q4TDRl#F!!L)a{bD^cXpy6M3}?^Gd@X?+4LP~i zXPlozXG#ibjs87Zo??~$XW!n5)87NX4t@n)%;%RnHm9mld?)u0en@xrfe|XCfUjK7 z+v9rwWlEL)mj0mDpVZgo!43#0qu;!SE-JYUiY9tcW;6Y4fj1O&_Dto_^) z&8hY4n!^sD_)wQ&8zWof=7)&EN)9NF?7+u4?Qe=C0rbl}!P=pS5@(N}z*%?=LzJ~4 z!$|u;79l;5g0EI3msWC_iDtARcGQAKboH9#$oHH`_OiLqYUKU-%KHxin%9*Xu-n8Y ztS|_)=z3|*#IKYzxbWD*ljl?o@?>{8zqT+&GjlRBXRKoiYvOBR12AA9hkci`S@?0E zAL(TkTJvk@`g~DE{G-1SL;h_(@<~i1T8O!c0jXF78mbYu1Q%3}+6k!NYbf+%|9pH=z0PG$c3mhglsU-}gBc9D zUeZW)3<?r2z+&aK|+?tJv#>2|q3 z9j-+5X7H*~z>#F8!)_ao!?!>(rh;uCACx#qPeY`XXep^e>DUBrYXk#(P0$p(V}D zy3MaI^ESgbJSraPmX592Gs1Cmwyc0r#2?8Ph9F-B- zwRch7A>B?oNXZX*PG<`D4TAdjwhZ#ycx$ScSu!{cjic91J_j~G;><+!tRS11eJ01S z3#`V+e}Svud6@U&J`qq}h~^W-6q_(nMyepfn(3~vEA5KaOr7z3S*00KCS;vE#wfag z(JgJ2GJn3Dazv;wWbSFJZV^6Mj_lt4S(l0^uuNa@=^rV zzoox5x_{ea-X>iZbbD^`J@@&V?si1J{rQC$a6Q=P;!g=iI7&reGiM5=R;lG`7&ZGw z5$hJGo{B|s%}t8_!J0nHpL*U8R2D(iCfMSQ7ithSPlALGM)u)1n?|A1>M)*|kjE=@ z_aWu{Tm^c{#B5!jWex{>S!;f>a`f^GP#3L1W+d~erko^CbkTbu5$Q+DdDpO3jG86 z+Nm6u8~U0oUw@nu-`}kCQ0@RE!27cNsd{h6{Q&EJh{XWRV(=Wfej0gIBEB3i7053F zhNfpN2RvVG*5k{vIp6$#+ufO0aYc}C?53XYd&?a^wQO4)w6>qY7@I~*c8y=}!pqpd zm)`*Ekp}&)r?W2+Pk{AUb4>GinkRZi6)BK~M3%C&+ILUGKxi4hFp!-w4V&8nyIP#{AT*AYV} zSJ#)B;@(VvL9e(7RImY67*RYDAtQ%tuSJ-+uO2nHRPLab@`PzcS1*r_zluViWhS=K zEgB%#QR=Is^`elJSE$1L+KYr%Fi6Re^XoTOxZmG*kxf&IeGp00@OP$nI(~eii0}Rf z3}nE6*Bm*A#(`d$s=x8Kv;;g|4}**yC8D&HB~I>+M!FyDUmlX?A6~DIL(GOt|Mx>_ zjv4eM#h%vL%puDj`|(EDVY9f`DAMtnjJipfnE!3jfEq2Ma#|pOZar-1$5_x|EX5*87H0QV<`l~Y;W!y#M%=n0kS8JJ_}TPi z8mY-&7R=xHL;q9;{wevmgG<*DhG83^Qztp1==IGRiYoIdQqT*SEga}$r{-u591ZCH z6)iNG7n;o?NnZnkbyt=Otw)T;a3Tf6HUGH59CS-4%ZtVlXdkUrn#2$ND8hak9F`F5 zECELd`RDvZyz^V{)FCr2S!=DegdOzC4OF;eqa<0Da;G}ik4h#}?L(hN)TuH*c9p@N ziB9mrq+Q_C=5_p6#(w7i1jGm)NWzf}`As8^It;l{$;dGYOZ1OS^zV!QRYVwkBjOtbfJUhj17k(K!5`=VK4_m4kZ&twNy)ywm1x4$=+ z`W39%Rc&|ha|n-WTsba`TwGxR1}InwihH&BzLp>ZXr#55ksjW-J1q``Nr04!JsJ|$ z*0#x4kmTWY4Dj_2`(=$Wz6p(?)z~L75SrH6ndrD6y(X`==k|~XY!&>xLc>`^SKs!) zOO`REsX#QsScS&eshXoB?}SS~FP#ZiSZDf$6EUbyewL}q)YXnmu!{XN;&w7&6*Enx zxvvwWr@%0;=2m0uT@ODOVxs0uf{8JwdDn zHu+*tL<9C}^ra)boGNGQWqg%pHb;>J53thO$}}GI0*Q2i;3k{Y`OiZGo+{~q!Aij1?DF5DZ*MK8EC{ z%8ofOeG=l8=`s5ua;Cmf3?IMWWpC%Tztb{b;g12#?>Br8a2YdxtVG^(D;Pm&#ktY% zOGUYeA1pc@YSFdAx7J*Pb}w)mI|+JKwWXZsCMiGKK7^8`C4u91c_%-COhwP3<@>~x zk@-xz+be!Jit)O@JtiF#TaW`V|9NQQelIcaG&-=J6MfGQeVaD_>Q6t9XoSqxgU+bS?Ezl20)orr1X{y{7$nPks zo{00)F$qNo&K0ryG91l$yzb^<-h{_pt3U0V>GopYbhFv)^&9o&jdKegtQZ8Y)F(Rb z&iI#;Lj0W1mjf?P$x!*}UhKA6sR5k7{`0PhbW*7|Tx{&Ku)iQ>D(Pze_{UYzr_u8; zet%r68e4NL4MhyfU_LqSDB` za>JlmG|Ru-c>2YGkuRMG4cCX=!Cdwc>ys#um;T#MtodmL`aDyi6)aqb)wsF8tr`3H z{d(jQO@9z}`R^hTHq-2C+J>E%&B=Bac!6p5m~U`nE52vi3N6HJ{h5`iZQM8v$q}LJ zxn??9dvlr&-p@eZ@0r#9BRHiS1vyzmhtye7CN%^v8b|qi)uc*@t-c25F2vneWcY#M z=V*xb<5m zN1HBgvzb$;I{d7q|0fE&Wcf<;I`@XHCad&~v<~;{gfkWsM3^74P1M+H3*zlROgmz- zEaE1Xe<0IMawqE0e@!M!L@gAo+>l`EZO?t(E;VH4Ae7 zjmL`~w_BZli5H&E^`7;-@pKWy%$bz+d~4rS%e*Ab z0=;W|i#s-+EjcZ~DI`I*+8j21?v&C;q6(`eIH;EE$#(3nGilU2p`3l7h^%^%?8=#Z znUyKCI4xI=!=76aEeqsT?D@AmrHen$aDQ_XWS9M#Xxr@Wev*mOw>NCMDwo4hnF*F*UB(v zyy*QMLGS4;v@EXv*@mMVxty!OnmPFhCcEK*H$Nx2CuC>iv3djh}nYKt;Qc6x=lhSg{CD`7x)Ibv8>ZW z{dnWkT4tvT`kQ#$w2w^Jul>NL6=*8odDMvF_*IM#nuK0^h#lpcp$7qMmwMN1K@Vbg zq$IgmrrnkJcgk3qgAOWXKRKNUPKgk##;ejZpw8A4SwWX$wPS6!skfcDZ!WGHfoKC% zgofJKs&ZORx$+`Rf!0DwJpW%Lc@t_&$NVFzmd)g43K}MM(TVDaDM#sgKm1#+qujfZ4|@-CvGTiC;~ z;RGnGsDqJNXl|s8zFQE6(wJE~%MGOz9fe}a=qePLw1*`l@8BzgTX003-moiM4JrDu zHVFV>o%PpN#OgUfD|PW>u&jLCuveKpV_NIa3sNAXl{w*>28DeWutKKcKr{)QKV75F zS3d?zG$bdHQ|4xd_d*4Wb8%d z!AyrrOsC)c(_*ej8Tb(r#o+N!@S4r9Xc#yNmsyM$DP6>iWKCCI)wI2c#af(?2ZKt} z+*!H+MU8m96P$#QnqbEWSs=MUGI#f^tx?fv?!957Ki*p-{T!e_(vsk3laRmHMy!3H z$kTX1Kj3KT>RyyR_H5`N%BgN}^tb=H^tTsNfAA>BKC&M8td!Z<>-;$fm>;2=1k_Y& zmBU{#O$3SYD6^cbkTM~oobp5tN0)i-=%k`oJ@fdeD;tu~zcU&{Xb| zY2Gq)E+JqX%A!O>P0xf+$5!B?#^W)3{NkHh7;;j%@w5RG*QloXr|s zUHDG_+-h_LC>8nV9v`TV-Ae{eT*{sq&DRb(G~HBpcLMOnt~TmVpFr)5;k)iuVm|Gz zwx23qb7Y(f(eM4koxxEcvOTK?y><=TB43jkuaUYy13{BzbxPO3lJ2IcntLeYl5yve z=b*8wM#JCH(#K7y3B`+++9iK2O}4x?GA$K+_DIwaI~_3t{kgHqAsAtGkQ>^Byg)~+ zfYPo&v0+e$PBY3Hk|j#maz)lJy}L^w9BNFVL~m5qeEZer_U3}|fi@-xvWs-v56l5x z0l}co0>KB8SEmF8DApT*dZCbMcq4o!cEUI?*v*s$C7J@6bVo0R;;rkOT+AY0epnM0KWo`PNxEy=&z< z#)zv_SgEl%bbek$4oVnhwXZOW{ro?MaxI-TB;|E1*8RM45V}x=bjyAykSu($=KG|e z4=QnGxlaYG!HHC)o{eMZG}k^sU_H$>A3~ZCi}6(H%!{(4I77i=VLUv?3R8Bq4P~V? zlSSK<7NFf%Hg@asE@th}3s}(vCq-x`M6C24=Lq6ugWV1yG~OF6coq|e#OaZizf;xD z1?myvv6AnpI;Lq?K=QDQ&vlC|j1oMFfC=S_d}=jETfK%1zLtf}S-A zZh`^&A-FeJ+T@9J=qy;fTM!7b26xxP>wiQk@P|b=(xo1mYdJwjzQ&{0($u0VCDt#NWGB513r#KzlT=z zTz{RN3u)ia#M2d@|A~*rK_GPOtcx0VM|LSaD1S7-t|bD!$Buhcv1Lr1HG9wYrsrTh zlY#w7$4k(p$dm=h+Tu-^zJn)}JJY8R$jCzd8$F_C3-jWZ`4X|%Y5=nSe<-iTU_Hp2 zYz0TUoZY`mS#k({Nw~*9Uyc_hYWu3*kZ^^di*mS6`*dDl{g4Wdm-f_bTR(m6@qAJa z{&fMqJ8=qi0;y)&l?$$Q>&h8z0eP-$=Vts?(Vw8Nr z9o-w@qJLoc4qJ{iXL{*&m_~Qq3BPcP-&zW6jhg5aQZt2nL{0&2j=F8cCOy=@fds&(P_cUisYumQdHns#P;-= zv!&U_`o!@v^69P5Lr_jOtkFpi=$t3_U0_Tm;_v-MVOvlc`rr-d`E(bG zF^C~Fw7q16To?O$2>DIX2w~KxaTpXQ)jHK23${WoUZ$h})Un=4(Fmxv@zUFkUvG6< zQvt_c5HE@Dr;uaH1RaT@Q5dgvNK*3xY5}dUX=4nf*9H{_!Qn(+@aL2AY#b_0R+YvfzTFg$e2DIbtr@Hbz*tj%cghFrq+364=5 zfc(eLkfhdX`8R0i?GUPREoXur?4wIFuW%q*$;1iL_fI;6xyM~%a_0}vm}RBYvB~L$ zyIje_2jAW&bl+};fPLCq!USw?5RgnD;B1C*r^8yqmuEP53*GrY^Ois}%p}xvoo2bjQuS}4BdW}DQU~bz}xR~n#N*K?K_hS zgvR36{2k-k#Z`iII-7!OTb{-ZFjND>#d2eB$Y8i}cj|MeI^LU?N|xd)6{6N3qxbV^ z1JFseBcFEmIxuk@?-%yruVeuK;AVkLtrocTB<=({1sACx#=gr{+LWrBvvZefiPWE$ zbjD$oYA7{UHqNU(G$dnEZzi^V%a~$DOFW$6R?05WIVQ1UFD0YBuQ>d~h?36&OCeTT zBo}IeaRZRBn;cd?^hguoDYWpd^-F^CSH*;soRMDj%3`{KTIz z`9_{(wl%?|9=s&rTX1wMPOuof;J#83M9KFI24*)3ez>_}+ub5TcRfupIh~U{wDA!y z@y1am6xsBO#OC!l8a5qDL^Z(XBjpsV%|Y3px%*;xIe5}C^^}CJ4MWzjiF}Bw2l7ko z!>zs2E-v#7nf*-yBINo;Gp}jw8>;M0jMQ{ZH8cL}rKSW*k!bgXl$EpQ(G6@Zw{bm5 z3Wt~<0Z}C*q*i3nFlkuvW^O1yg*>PGT zXmAbuy|`u`ZK@A$gSwAXJAny3#p};TLv?EzNV{qxQY{ftv~*i33-XvMnxSg4VeuuRD$90vFCr}j^w6A$S2pazk*0N zi4`XU%u6%+QT=}8j(^76c*PG4lghZ!ARcz9!tHJ%V6vk~pv{d@#7<>H;nysJ_jTeD zp__%_cu6DN3Eht~^FZ~o)yn*b2YZTjYqb`d+5rENJ3e?tkxLqO;?r9iG%|*DynA{b z27$`%E@o^P88*MTT3NKF`#Uu~94EghZ*O`N)}c#}DDJUWVMNHuY=M<< zYNf}4ZmIN~lWZ-RpAp$sV#xggv25{;+w`~HQ>1qx3eGt5f$fZ7qP4ORY2ANR&2M)9 zf;L#`3fEpA;46XK+IoiD?V7gPX1Cs2X|buyTv4)<%?iN!V3@{%?1tW+#TkPWCBq@_ z5GCng=?%(^-G3OknmC8}UMd1dc%NbcJRXibv7nOP|1r7@X?`K+W)o(3hsLmbHp-do zQp@2&8-+u)NFyXN&6*(d4=fED?Z0*5U_U4ym62SUMD${4xjsrzwsE?xJ&~Z*%-NNC9QsTi3_Xg)?nUiEg5$wRzLX*BDbXb4|8t#%#opI5J>vTzq2A0uNs;k>;waM^t^jeW7;zlaVX+ z9BASDKSIZjdzzlJl)ha}*n76Wp0^HEXcr;YNLdLUIJ5lZ(q*oYx#oBdM)9D3p+!}b ze#e~|dO|a)GwGozVf<5g?$INDAK0FTjo8;RomBzZrfih5A+Ag zp_%lPm9Y+H2Hm))TZV)aKaCEGTRU*yyy#)EW-R~cBIAm$t1vZcyupG%cnBb_azh5< z)}Uj7zj;Nz1xF8vc6pZiEjrK}E}6)M32TDn@LEUL3u0hJ5NajFv&v4IOoqk75klJu zv7h9JXhUyg=6S)R*zN)gZQKj-_wr;AhpweUP=Iib>RdXYj1mf%c#e%HmSLyz75^T+ zb44B=mnN@aHM28Gj$*=<_Qvjv%qzDNBbmcOk78{b0Z&?z;3>2DT=qFN`PXj=5HsV) zTB+ijt<7U>cBQxALkwz3X%(xGS6@B6Hb%GqhvV7b3=M0F!0Cdzb~2zjPcqxmc}CR8!G{?^_3Z0*LkmL;V80#hRkp{C-mM7eY_Qiqz6D~y=22?ziPTn?w6l8o!=hC^h> z3P@d+)5DJ1M-$P$p}3`kfV6hn`bKF6m(S>{B%AB!viY$Gw@>;yvnLbO{ZT$4nH&9rZaze%6Ll6 zzsDjQ4^5$SAwJO9*0ahHQ8?co=#e?IwU0;U1J0$8>>hsrwOPw`e>xj5)^-dFtZ#zMf{UZ~UlCp$k@FM@WQU0kMqD45;Dj zykEvV$sh%U7Zes0_b6WBcIzHbS_RyylY+Rs%hnMK!UH>_)7`9VROwX^JkjM5@xy`@ z(VzF>)La9FX6}NPm^AGZ-?KZJuB{HZFoV0FG;c(+50I=vkT|O-FIw>?+5pSQV$z$3 z`+GqV14<$gU`!aYk=2`g44amgp#-|k20AW1RGyNfwfJU*6~FdW(1hF%jGY=iQ%^it zK?T?A<*XrM(~io{%Kv;SDhrN41iO@F3fW@;YGkZB)t!=Z2&3Z&32{XyKR}UL;4AndwKLdAiB$HC#;suTF)BxC&|kpKJv+1actT8l z;nx5$*?$*iZ!%L#A^ft#N0#Kzz6=k52MNSv%-P6i5l-M}TN7hd+>512*QHW%@i2GW0rS{U*s9;1KHT- zLw8VB81S_C1juMWEtO;1!Snae@QmKxHLF52;!^**@QMQmMWg$dS>hSLK_ScexI)n# zS959kVgxH@Zym7ve*noqHoqYSXLmWU6U2(kw;IlDna6;Q1;yEZtVI`l8^u(F)%5^7PlQ@DK8rT2mseV5<~yG}`Svu~e!- z04r6Pi6+y?@|fetUG#@mqsim4G);{bhDLN*RT7Rn~ah=L@Wo{VS4vpot!q;vkUPxuw2su4sy z^#pI-gJiIzT9wLn!WS#AFe^UwQuEFUf!k7BA%5lWl7W>GN~e?A37dR7*d^KUS!&2V^v;ZSu%5>+ZRemh7% zxs3i1okXn{+HHW>Qq0^*vRme)J=0tev#*@6eVPlAtvn~~^)|Z##!TaAX|9_IkbpMMlobyyQFzSeQi&uQVTwk-X0cwaRRd$2 zQn^uZ%1)h1Jfyx|`mjKt#uF^5;zn`g&vf}Cz{x+1> z%~#ZXruhXJk4DN~F6NX{i!)W3u(0D&^|*4VQQ`2eD{+5;7ldu)qyvgChb98R&araU zJNNF|*)w<}@dJIW-+luEeciv}efv5t6NL|m@!Eykt9S3W zuby5%`QzG+zpXyF^V7;1?~B*bjwXhNcBFq=IlKPo!}S~IR_~u`|L`qKA9e0s0iX!j zJNezl<>mI}d)`|&r6H#hXzj}f8}EMxaL&GsKf?eH`zMyehz<_JJ!~AvVdR}Ux$(wL z@4_dL2Y>`ad*L@bQ0xUb$+*Rl-vCoZJ`RU4))ew$(b86~L)L){pLk!s4Sm2fOiXO_ z@a?T!-#C4-edWyBh12a1zu)-!s)D+70(qZ)?7ejz55&8Ds(t;mcluQOAIod6-}63x zTa`-IAO5X<{!@sRW*b%e?9#@yB}j;H_YLp%Wr%He;je4gU&C)`E(4|gOaiUm{k(mB zW#hGfct3t+g$ocbzTUoan}@n8X9pu)OK!L9w4EF-7+Yb z2@=0&xC(= zHFf$GCDp#b`8rfnl=uP@vsBbahJ{irKDV!YPrV>kAAIheySH)jJR;|jKqv;O5P{H# zK-&CvNryxRKZXAU-Ppl6dhzM!W{8Fh|J;`%ubwlgqVUkS<8e&+)zV_Uw56VcY4In^ zkEv=i1>L%Qeiy6`Nn&hBfxRJgh6D|c-jux_s%vb`_SHYzXxX;ldX`!fumCutyem;-?BA1c!uOg$U#h=IMr(kYB z#7{x1V2QS_2LZ=NNPJSo$H}#`_u9*Ed7r(xcIlKrEEYI4RJa&SoSnM)f}lJLvix27 ze)aBGY}&lfzVvQigb4KyPkGB9Z@hofyZ9CJo>k+HpF8a>e*g#OH*R`oPFd(JqOF+R z_EQ>&66tdgnL7vFy7I`o|3&+w&)04&Be)aA*>lc3TF1|~OlF;KSe6inDw|l3bg}rDV?l8dLI*sjxl@6 z1c}xa6tI>l%>&Q_i_N3BQSaN=A>*v@(;8OG>AoO_u=#>Z4=hZ~mdV9cbyaSYt8N>j zp%pYUm9WNb0}2h-OcB_D{viPZvF#DeNu;rU>vQ}PmO*$65vEATCtX~0(n1IT9(iFJ z0NE;sx^L0Gcj7=8J?rP*@IL&WEJTuy;^ZB=G$VI;bJX~Feb6-dst>+sMzMG+ebEd; zVm5*P?RRKx4 zguF`^i~>3GZhuQsw18w`Uv`5?GOCtF%6tP<$ZQqIPC8=x=`T9Y$>EvL8Sz0jG~q(Fddz|g9SNqa~!STxViS( zeYW*|`t|DFPra2lFfKr+*1mAoFjms$`qH`8yKlAc-lg_8K$T)EgA)L;qtO&kxY;CY z4AMSbc^~@m*_Z3ze*#7=`ku6{HaF*lJAnw~(xvMete>y2EW3L50-atwS#-5gbxY_G z1g~~tbEJVo=(3tZG0_DH_u|A*Tw0{0z`meFp;fmW*q+}(O}&uy{Pb7eTQ|LXkJ^9! z1|*ELOREn)r%>@)6##dJOB{KqH=xbgj%}2o9(PJ#eX}T~Gdkf@$e&)KStM~8lJKV# z21kTsn(CO)Wy$?WCrg@9Z2iY~FhSy0+vk^QPk;rItF5Uql3#3={UG_$E>JX6v#K8O zjZl_!b);!}N9fyEiV#U`+=kfz4Dc>-AN>HBo=(~C(q25vk!NZ&~>UhF{a zQYTGX@~+&aEsb<=lQy71b`E3gVfF5L=G^t@viIS~?eA{3-@NF(^eqaflfd=A29o~fq>L%xJ7b>;jjjun8^<(uq1+t zRhPs~S$XiW8?rZ~=`;yrbRk|?Cn^C6AbhLSO zdZeJ!L<8DU@yX)6N6-uUQxj1$7<96qin0J|&k8P%o?k)a$<;@2C1dUNdnh%9yffG7 zH4J?6@E)B6e-3=2rrIKM0Hn$}ge#f=QeCWmYy{odk8i^r15sP{;PZ{w9(gCvQ;k#m z{3qUL-zwJ)#EVv)Rc59xZ9mNl-A#J-ncFRP~71 z+pZNPLYjmgRV0F(&_PKev?Z!ZEM>GMk;Y5n(k-x9llw&*UtgufO1(SG%YgnR-^@!P z(7Z4Im1+G5Fs7$rVGfqr zQI`NJ_50_FwUxiNFaKyeUVGPW(jzQ8Mw~rQ9C!eZcl(sg4*%9OJ&$m?l}CH@o8Rx+m)^U7SANH#L%a4J z+P!~Ye&?PY2M_K#2x8AuL33!pg{^2{N{^5vk>>zmRU`f=gbMP&{lMH@Mlz+c$Y!$f z(Igt4NR7ouNA*CB2gx$QfBd>!Egb8Z)dEvj-7N+Qe8r3y1r|3)3s9e9Axxyod99c^ z4TD?>0?13@DfEChWGNVuJ4jQnCep#8%EhlXUOVqyIMsf%)c*c~p9UB+s`aIF8y|g) zB?Qocin)PFFFU01(N#8~!QuVbq%`jvkMDuE7RZXj1DmoIXc!*5I=}}Oo&?9wu^NEY z!hH`SEH*;^SG+PyO*|@^mw?0rQ94)QIfpeM5(p%uP%x_nv^cO(AJ{>_X`c5r#u99r zb>Q{g%Ils~3smk~qQ(#^VBpU(pnvcH;x`6{E_ z&Jy6X=Jl3oYv!aw!rUDPw*fdNI|Y)K&(9o$M3E{4&&nIvxD2_MtP@8P$>vom3=(9U z&=3n&fQiZkTIJJh&j#fKw)Suk?XS4#DBeYID{ym$B#C20;O9|=tf+Mtc#y+$10x}{ zrvR-9Nn4M@CBs>l>ZWmB*J+f>i_~oa9Y0#a>xTisg^yIhi2|1bT}HKxMP)^3D3r?vFiV`0~ka`qjj>KIjH=pop$h|?{7OL}@=4jqG{6E+65ug!Fway1hbVL~zSfI4HUN)Gz4lus}M<>}W2w2LsXu+YdiL8&XRJriumwhS{* zQBlWYXOpM&Vrzn26(dNZ1jwjEECi@A11Dfxb&{)>WtT_4cTvTq#3y;-%(u$T5~O&B zN=)#ht(al6{>?6;!wM@n3^VDE2y@v=k|up!cAuI^>iL}^baaK@L0JM3*LWR*{eT%nR9NX|V;rAI1BBTY#zZJt=*@#cu)?r;avH&O_QlxJdA)*5uEJ2ix z%vVb+OS3L+Dqu;5ECZc6sIfDB7aGH$_f2p#;v-y$l!}^0-m4(l0iE5+h7vH-yAHBqzFa!i)%@e* zG=y8wW<`AkDnd9~>rq>6J#AGQcO$qSSQf&BGKbM-()*>bCb>X>K81LqDxO^lPh=EB&5^7Oma% zVRT;mK4FAIw7r*vEBod#Z5~zE&8f10UC?W=#F$1rbX>n-edtBdvcZA}n71BQUwjCh zoEe3o6&ux7z2N2~taZcf;fcJidldV^_k#Y&fI%A_gh4I)Q}oI975ULbJzb$i)!GKo zaapX2z6EbiX5I3kcciFpQKG8T70vl`B*IF44`CL%-kd1kC&nSI=frq6Zb%r}0f0GY zGyDhaC*W0LH$CuV#iV4uG+!&b^O%^F%@e|w2Tnmu z8^EKkmuq=A4pYwDpdoc)Dg=0hDDr%}dG!&|o2uXzz0*N(rB!K_v~=pim2z zO{qXD%hj}G*fbXc+bNlWWZq_CY8E+*XLA$W7OUrpI4w-U_~dwe91W*Z6BGDfZ4M~r zEiJX5i9|o+6%8iLIBr=~!rGU&+>0~;dHl!x*X7n;66P9Qm{};p%lryYf{!s-2 z%2fX*As0wtofB_3%H6m_~tZlgZ>fxh(9fPmd`_Qw;}y7 z7fa9uC$nU0xZuDuJSrr-@af{zBIMFbpT`}rx#n9n?j>!~Hvnxw!B!2kBMjz(e_CGX3(6N$uthG@L~;uETSf*dEN zP5br$?F!_ntT)XOMFtLmxds2vL10mC&9fAch*!n<0r7#H7NK+e%s-FCfU=A5KV|=) zj!l%SaNAXZTy(JvS)r=-X1xUN3&?5Uqg-XtH;&-nnB-{k3m}nFrBH4a#Y?c{DgZCWQ+32`k~pW7 zbqwSgz4%22sPNYydWeIv>N8k(rNRk- z4SqjO%BN!DmP77BSL=Der}+@!FC=li{75Z!My(Ch2c%A{WIo-SLyCiSwjlzILE$*t zr`WN-ksFwA=xYida#1?aG=Y$Dr0x`B_VmGdw)e)Kewwx(=l(j9beyB1%=}9iN~#YI zEh1wezFz;e^!k7*gAyRXh`<}0I!Oh&I%H<#kAUsrG|LAttpuqe$5|!Y+IW&~YZKXc zzjn1}o_QvitOPy0P|-5Rc()6F*7z&~VjJ;R$ASjSn@4e4D7#K2A5ax#_gF5)n;1iL zpfL|P(~It6oLujRYyOyu8--G})%X{*)2WbGO{Yw({`2m?{Cj2hwt2Eo`Q>N|-`!a9 zsrV&vEj<SlUYJ**>ozN8B6tv+TeLreW`g+u?0cUM;DQQ-q!^jdO?lRY#v+* zC~>lJMZdCEXw_-eNY~Zq+Q$20*mbN3cex@or5uT-#5k>{n_f02EdivmBIrvRTbrO@ z?SnVhzx>PE_1C=958GF6$7x)wJp`i=a+PKQB&iF*HU8RH>z{u8ue%Set-QN_>raO0 zqk3aI>PKvDl)^;N2%ix8og_NI-D0s^&k<$I>fQU2zD3q!s{XfRvS^@58hd5+(T(-@()O!dimz+{R2@Km)MlTX3&oms+kVKbWe2R%AQqYd%nWD*9q|vO_ zYG5ub)Js@QEIEcUns&zoXR%Vif-}|ivRm=S=ISm!!q)W+)>4o0G}Y{do_?CjjP$aj zDjP-RUmK^-d#B##Q-CN_fJl4SZc14KhCyPYhbIKlh0k1Szjud48zE_T2-7miStQ}W z#ux9l|8_%5>TyPixlcped!K!PMKdm8ob?lY(9Fl{Kiu+ueACtt1S0?NWAEXgX&-G3 z(HQFFo|D*uFcJ_7EiD=GNK_$-grEaV$aKe5B2=zCz=?5BsbOHP251_Ia)1Zfhn?(w zWQeV1sqA-)>9DyiV}GT*xSw6M;h%O9uj@K5p_vhdhPY0M1VLggRPZiat%092RdX1Y zH7nSfuPuTiu2SpFYz9&LFH7xDK8Iu!m%d=rC=jlx2F-D#m-TbwtUvk?%gsPb%O9-X zJL~=U)!KzybW!xS3UHg3zVObxuOwDqBN4FZn(KHJg8P!ff(MZ?rdf~-e4J`|<_4yp z+x6?;zA)o2b=c|Mq9*wu@C}$kufRVsRVoVN5XRwHF5nK=uqNAR$k{et4ZG{bbQMuy zS%iJafOXzD!O_tl;Lbnf+IE?XvVr<-OB-oU{u$b5wzKj-I^%}z zveK`RTA|x6eu$$WSmZA|Jg|Cs$WKXkWkAKEG^DSLeNT6CWb*xxfAP zUp0OON)FD}@OIcf^RBqn;;(GqR&*VHPd@vSCL!5ada(ZEpCDlcSy*cy-t{h>qj`Jy z^*eSvZeP9!nfKI=*g2u^163q;4b{$;4@yw2yX{~iD&bC3cwfFtpIdqB2iu^CzE`slINOmBPdH*9Vp~(Vg^1FoZz$Jh~qT!ZWTNjREZ~t zoXTR1r|cOV~>VOX_yzR@QdJ77)4TWgv2YeZ@!$$e4}jVWOl3$C5yr@5|Sn zPm z>m#gF#8z7`xeG4;qK76RJ&~x*xaPygA-PLo=0=-hic6Ep0v=vUnjKO0m>%Fom=b7U zh8_xn8UsUSlUm4hiA`%$W?$}+IwrxB`X;y3p?i@S2HcWC%{ttsuFkQWyxc{#K}@ku z8HnL&Xl|D9VX%(LxDG>{Hs~O^-hpZx z65vt+8?mH&piZ50W#>qqczw8)1sxD;CJ#-L5Mo-L*tXsKp5OJ-u6;Xq`G{Bv z3D3t6O>pNl=(JG}&#;5$9C+L|9TG)EsFHxcJKrl3PviIiQG+GNSg~Z~mdlEEa8A3d zHhY`L(p^4~&_wE$&t^UKkT`aM^B22$fVp!9q;_@pH|NJhF_K?`MoA;_uL@LmA=O{xne_ci8%LI%XPT z1wv-v&2`93{_xnFuec|g?0B!-Gw~KA>X>~6y532s5N5*&)jbN)PzC3(nW2X;-vqmu(i+wHpIu? zdf+9budJBH?&BhvS|l*&Oz4(QO&W?F0&_6BD>DlR^%rQLd3WQZk0rZI2v3teo?`qA z!u|6_zaMMF4y*VlA0cUwChepdGHy7&lqzO^X_M-AmdW%tNLVU*TgF+amWo8B0tQeA zK`uM^3u6sFGZZAkxQ=NPm+R@SLQ#5)XKsYMSQJZ*X1z4qg2mRc7rsVVHHnHRO0Cy6 zr&~ekA2J;w%~LJ8rER8)?glRho8WsvE5?N+DuOzbFZkm}u~1ugUv8D^CM_5npGCnL zrgfN;Rq35JdWwcNf%`hL|@MZ*^z0wrpHj$d@93PiG>`pc(b;IVy?<;oT zv>Za1ZBl#;Ex~x9Kk*?Ji@c&=u80wLvk)9GJoKc~79VYMgx>K%svqNmvRMBbX%ylU zWrpPtvEk!k=7CTRIe%12=wKi28ugt)NaWN-^Bsu7EU&<<(l>Mt8LGOOF&m4cr*Rr3 z#mey`a%PCq)zE@<@*re7KT>0X=@5fDT%k`QI##OlC8r!C?a`Q@v~P`EYA9ijQ4?9p z(VKMYN22$?6TW?G2ES3;fD6B^GTu*|1riOjR4094GeZo?HK!Ovo3u`m)#e|0wBoO) z&1RKh*^%UF6icBIW6`eALiMoRX#k@qu%sX^1;VDw+#$d@fglN@SXdAus1jio3r;gO zZIld%&_qI5&U7G2;2X>*Dni-`czE%IER=VH9HS+Q;wvwHQ+4_srR@;qMg;dPf)9;x z=EC`ER97!MHY~P2-`DRP?bw9$Qtl*3ncYR}>7G38z+bE5(=qYeY1HburPf@)@CaHacR~S$B?rbqkte z=1psQ2BE`tt03s=g@T}-PlaGyL_)1vizO`xGxiihRZ4Ab*P1s&V}8lo7*vXC5~qPc zE;XjmD|(Bss<*&Ya0(cPD)frbAq9qcTDs$xKCc7{isYXlnh3ylMyvs%7yIh{Q~$dA zknIuz-7_uf|G?>_lae(|t52hq9Cuja&SI+$`*^^mO~$Yg0Yz!*J#$n0_8+3n#ag}_21SHW1EU0g9y<$2#4N_$d8om&qvUV70~+NvNr$6=(`b}ne`(UQh@phw zeP@8~oM2;M5ijK;CJSF_&fr@f7H&+~$&HcyEHksINMs?>f#}x*uH<;b##8FF$|PP= z=*&jkx})|y=(+_3vrYr#J*w2EK7x_&BtZ1Z2vEV%2cM`kItXvu=Guh?6$*u<)5)Px z%lS)@{1QBV=Qe1h0(yP$p30q6zI?_NEoN}R6CZsuKb2#JR5ILiQqoN81A9<|3el-f% zumysFhkcF&KXj4c2136_YJmorG$5t}KEi3?XJ3DCr9q{cXQIIGV8oC~^!Gbn+P8b( z3sdMN&h2O@S=L}oN8%8INFi>qfo$W8y%M39uSPoMVNnovom_Bpwo9=Bp5E`MijqYx zg2qu1HFBLr4|oFcBNr-(kRB=1k?SCy!1I{5WLga_keh-ZLsJ9u&it%1z++2O!H0%E z&gcEh=ZSLlILtXU9C43dr|!%*EP@Qp2v_u@A!~wm&^5dITn3_g&IY#_rQOjuhtK; zL-0-9PQb&WA6_9f<<*_T^JeRok=Is;|VS-J|%w?zmfkI zOLKE*WaJ3krrcfwrs6iXCzHv{NIIDwOQn)y4SbNT6bse)`BF1MN}$=OB}o$%_c-2q z%k;JfN+pwH*(~`$lPV_JlSoXCjg6)!UHTD{u5T~83)?HLa@n7)?pl8KSu~Q2C(&>! zo*o~AsG$_l^Ds%N!rb^|abhf6m>W&a701SA-MNW5*BQ$sv*U%)xuTmLADmSJyAlxN@bGi^kg=ZOpZUB8q44$hZIP~r|3D~>2k2)*6RW=rqz6GggZlm} z2H-#NypE?3?*e@oBw@x*|5=Hi#0Nu4kEhXy`Y>P_LP`x(0GBh6;YZUGa1Gh5 z6BSKt7@sQA$xIq-E-rmF!WKV$C6Qt0zKOV%)(Q)0vt3;4(eORndtKTOx^BiZDf7PZISh}iC*#*H@x}o!YUjk0afvT(^-h7N^^x%XE$&hB%b^9WsL9Y zeAd`M`JMasJ-_>f{PX)?%0IV%CrAv>v6glchCd*cAEtRUAULjHE1ZV1o*&TDtKDKiG>AU|(jJMI}JfDKRqeu?0&R;;3M1 zL5!-4KZ3uJTsJx6rl%Y!RvgKLwU2F9{31MEp4DIY|RCT^iy1 zNX_*t0Lt+WnHq#gb}=_3c~-OTl}wG|kG^Nti6HkKM&z{LnJLy2OMRFz-yQ74v;;Av zN>#`!7n8uKb8M*uKft#yW+3Wm%LrydVEczuf4$?snu$gMI7FRfmlI3IQ8F&0FGvEL z8cH;=_!Nubzgg1eX*}seLvf++LuPVhd_inVDIqYDK{IO@l$paOnF4p6Q1C<&FI!my zuhb=b;MICz5w#ZnFLS6eG7JfLmrO&f-#f?P5a8*SYoHM%@eL7H1l}PQhsVG_#3B)R zh*G>KKB6}niI*6vMdK&tioizlJ@XL*OsY%atfTH0TLm{C3uPEsd~=}{qVSb2wF2kxWDU>UbvXb>qjGG7oXZC-F7V%y|s98>_=@rigU)s^5T7{-3q zO#^9t3g0#!<316M5>y2t{Hb=4yzDc{`t1hMg!vm5ap3G3k_O9-rG2R z(ObC=|2_9EO&j4|I=6avY4z?`oO420jk8PbYiA`F`~b@O58tgnI^BNnI*-xu!JDgh z-|?33XeC!4ebhep+S-G6+81cWWO&fw-M2NE3xeie8b z)o5WH9eeBTwQ`vV>qk%5~eM5Y%i+a%B*@lUs#K z)s{iV?j*iy^@Sf#n|42|7^`3$h?G-yu`{tsUK@C`x%lq-tJ#Dq<8X_z(?N>xBnGd>+$_=HV0-5X7Fq%E0XxggU+sf*;*@-F=e z3tqNK3qJw{;Akoz%6HJ-JqPTaG%V48gp5m7%iv90Ja@MJ(Hk45&#$fgwSDT5 zdLVlZMu1gA0n3cpsYu(+jHed*sZWP10V4fxR7IRtu?Hb^s}yqP26~pz>3@4t%xO>v z(e$pIGr|j3&>5@1W0j$T|As5-Y*o4yc|w}*B>03??WnZtP`PKtT^QWQr@Yg8(Rugk zNy*afDO{1L{TP65cKS1gfiEy4xC%=^rq0 zPA0SQaWp(RHW44w?VCHe%bUT<`IOW?Z<=`NRas(6@`KH|>Xa#~W!SGlrTypeucKj| zpu_eJTI}~;h~qT^$#mVKkS~T@1=!|;L%SPjN4bptZ?%Q?xo&YA?qLm5&iMCcaBULg zB0yIvogM|p*;IOB0^?0XSjP_c6KjkH{KoS;4jtOH@6hi3`|>;Y>^OLE*Fo^H6o1}TPUTqyY3K7px*;#b!4MF|KAK9O?7tdJQ ze81r6eKIMkXgMkEOv8PddpO%`gu3De#X39X2IWm!;K65$$Nyk>uOZ3e=WUo^D_*ox z9k8O6H{pa(^Jq3rR{2;e6HgoX;`7z|PSWW;dn42IAlmJx8ZaH z>d}Mb{@*vf9I#(Ag@WhbpFjm z#lG-MSDlG&{35}e0;jf@g|GQyY2H*oxPX~5eHRk=&~*3yZ&PX2jhkdtPmL)h4LdNL zsbt*HbCrMz@;4nQ@<${kuoC)4d!ot5ur9geGp$N;sI8wdDO*Sc^ai zunb!@xQoR?D8dCCLJ%5H;PE?tPpZri{&EoWk`Zt$YTvx!ef;*ur*E#k@r}2#U38u{5DaC+O#k9`qON|Ev6?%3yFj~TO1#sNV{ELe~Q!=V6P8dCfz+5zkI^dj%@L^Tk~||Lws3-TMxS_n+WO=yv!|DlrOEII<(182|NvUrD<##qW@s zE=-N=sp765z%pyCDYNvE12660d*D!h&#rwh9C|T-=*5?I9ei>Bp675WESEAo$+da3 zls)_Ry^xPm9lv?$P)Ny4TIPb{Cf>vSt~}hVRny)HD^lM7r-XCgA(Bp5V7J;?67K@7u(?|5g=BaEl|BBBYXg@XKQoPtOR6uTRTVafOu3i;RP zOa&xh?eA{3-@HhNM()0OZ(U!#`xfhn_vdpA)=w+vHeP$=ojlK-F45}U3oHN}YX9&< z`-f9v`{D&Wy1z(Q@7QJ%4<L-6La{gax5j8cFF)b4>92%hm3JggT z!j>2DpcOt?rWDVE@IffgC_($kMTH)t6ZACctVt=dTMtA7J7y*|5l^MjaAp+$JD!GJ z6TniNCYF6;a!7ui`=Lr67bfg(2#q|W6;QM!k+DTK_=LRQC&=a+XnRQP2J$Y5-N2@d ztDX4k5CzV#qH!ATgT^U=a2ut$5>4gOYPhITE|Hk*$1#4Y#{+;<2)1=Pc&U9jM-H%v zk5Cbcp@V*~YBKhNBqb|4B;ao#VHuV5+K?-YhV+<1vH%mO$`-07hpa&mqkkGtHmno~ z#G+u4a2XyRh&M2#7qRESQVRLagBZvE)`~&^X1^edXL@Xse*)VZS?vdDt2VggifwPbNT5cydXOz5#371DJ4J zNHyQ!VSqvP#d9e^mcKZOCxcfud@RwB!3lJr>^cpI^l}UnU8@eZ70^K}l!_$BywIb;C(Lz*Of=?CA!i{p#vThDBfK=%@(+$d9fOL&(jq%uawUh@1JRxr z0s=n!N;6YOl#fxM9I01ZHDB`*3{dP+B(Sz%Uw#KI1eCAThS2}g=h?ja>rM%;jo$%&cfr~UTlyl= z%(og%$gNQpg-o0vHZ*v2LBOLZ2`t|)cuMTxs zdkX4bhU8j*KqR}$o>IkiSm->VLvfW3pH{SUXtv}uqzr0GBgeXGL|sl@H@EP3o2y!y z3qhFP9ny%D4BfT}v3$JvR6kN1`kBM+}UC1Uf zV+pi-4lUx-tAoqZ{2%sHj$ML*X8;URPODiRsk=w&VDnsXsVa{ieDQ(x*!Bf}Vu_=g z;iOUu461P43C`X0INzZIqz#821rf1fScE2cbV6jage5$3>cb~Pd5+^^WKhGwmp|Q; z6Fg$L?NFG*IfeF!T>}f_ST1GixmZv_QdROR2({R3M5-tIxN;=K|n5Fp)3C zWsX9u;&}!gFK*I;t$f_?R$SVWt!uc470GPmJXP8&1Gu$;Dca@$ZXW+lIx!$`pzvCH z@ha#U57yiQgcSuUeyMt37Y%r;AGye63d2;mpE9#Faa+t*L7m*VG$xaw^k*k}4ZtAN zHWKDQsQEj*q6vF&{>MuvL3@b?n41Q1IEK3T2U>!fw7Hl-f4nsySxf) zTQAW~{m(SjW(KDWU-~(0P`eu_ zrBy3n;uN+8yx8VyzU9)2YIBNF*)7i&`HxrxGwGp3vnoqnl7T5=(@fc+G+oP?)g^II zjh8@CNel2_mPY9bLZ&RaA(4_h&6SGwrUtN0j>@ltApYG7W`25r&USP`dBLa>S@e^9d=ELRtMgM}&Rk1qNOs)mTY z5;PlPpY+}gqG!v;Pjq5wgG9FwF#b^WBrcs!W+&{+yGt%L$o{LI`lo+QP<-%Z{QqM* zFQR`D<5;?=v$kX$C1;=%KJ`vB7|1(+W%ce??dw0Zul)@!a@;?+_Tc+)UPb>~bBf42 za~n+yc#4Ziv-J;>__{^Teu0=U4jQqT4UX?g#RkUM6D-_h=O(77cs0tzA*!wNDm4S@ zff&adz%@|~JV(maSLrt>)91c-P+G(GO1~HQn*R5Q}@LgUZ0p|y#6VGX( zU6K+_XI%6zvOVOa&D>HF@0vMbW4Dwd>^dijAC{7J7h|V!v^3YunYBy7jDy(=uqPF? zlqW0{v)7!$XD!A3fWA57eA$uV)_^0;=2D3y8nM_8Y_;Kzzf-4@7y)3Hj%*-kIyEds z<7RRLzutH7xdAH%>n^)DlS`+v(NIp<&Jnc$>^6ZN&f``UD#K zIafLnTErDBjz}4DJyc z*`0v2<(u8f^@N|zm4LLqNk@VaX*^_jt0dUn`p4(JOJ`Q^pF+Ff4hnJVBDYCUFeGpJ z;;i{##w6gx_!}S~Iyti()mtI@H{f2k{ z3pt|S2+*43RGCBRDMV8NZ+v}q?e9P0I@*|MdSGx|`|}y^);;2DchNiZ9&L_B67)yL z8X{X8gWSG;+B|6Ixw=^gea1!&8KD>3#MtM%_;7v~reV>a+9hk6s7d-MY8_ z?F~G4t3N{gK<~`y_W4icAqDrd2utd_JBhz)ZO$Y?0jn#?v*6#9Nx>ziQS#5T&vv-o5cZg~jQILp+KatIT>Rrq(jedT;u1tP@tu76INLizg$J{q!9OD)n z`$jA%+HzB$N|g$ZH-lB7<4K~JO@VG!X-`6OKhPSfRRzoa!E27-TcenN`o@Q0EIqE?zu+IU`9-lIL+9X`b3(PmV$n$E z2BbVh|J^8;3a-^E$`*F0X;cMk>UK=>KwSgbj8Xz95y0Y*zwhbfdlefy0I!`lFW3UddrWL$cg@rM*R*z; zM{1w-=j6cYXpc5>Sup|ORLoM1v=C#9$3^y^&C6w&crrxo4xehDN^o^;btYUgTtPTO zBumGiS}N9kRdqg9!>ydbTt4@Za)J(=plRFwu6}DP*9o8g0@h9SIEJ>d?aw};64Q&V zQzG_lJ#nt;wHyUAQ@AhtqQA@DD!9wbTyvNi&<^>3r84Wb#Y=^ojM!R)n>Nw*X(rng zt^z~SJRRs7Rj!ws6fRBVyJKAezHXFSo$+F~Zrl5(jqDSQ2XW2^3gp_J+;>xFHLsVh zlw>(^rp}ZoD*vfeg|sSBi3;OVHa*MQCP@(%A{TG-4}~aYV|M@-cNHD;mGAQ}dMHs;IR?XJ z%B|KbYVG>=bq@oabTjb|1FUG<=l=_`ODdN%HgfQYgR-B+USq z5yW!c?%@;xdLnvi_}oFdvxRi6?JGB5-z`7J)1N~!eO%b_Jl1SIPiNiJeM8!vj+<|`gR<`DuFDPi^)ic9IlnPnKqWl5=a%`kqdhW<&M)|VuD_q%pqjY= zVzHvlJ6EeTMaLh;d%IOG-JmKvw6i7*olzP_SF3E)Oa( zOn_Ac8&HzK0GtsFc@_ss$8AZIdOnfM_uV=Ardp&w(r}y&hEf)mRDXQeKcg|=c zNUn*R!$nO>>Wv(>Tt6ct0fQw2BH(orE|c+=JBjq$EMMrjYjV3c2CoUv5e%kZ=QRA- z{6i+ayBJ^*ucmks0p<$xJ4FR3{?jaAx*)kRm&w>ce1C+nCL-` zpdA)k;3g?u2JU02D%Dz?h6spDI3CTrLy}Z>(!XGVjAtM|dZtih{dk@mO= zN5JG9u>}5C>X?#@omYfd6f2Bc(eF?a1v9kH3W48-P(4Smj4;!D^F3W)G+oiCcB!oS z9F(0r9|We0NjNe3fC=HNx|uZNS&*@WWNVS@(jaqHHjON3NA2MWLxW-15lQ)vg3r(g zyBDJGlKE$5N@y{ahp8I1W!)!Am;qB}QN_MB#Izfu6SXPCJ(J(I8)1B&vLO9nNFMTO zYx7?RWRoUW8PwguuMjVK3=_J4T*or!2+SkuBOTdeEwr+sFj-^*4)A`^iWb7b!?E7X z-k_R*(wm~Qu+r;cR_x_ybzj@Kdg9-uNQI*$?)Nl3tlsuDe0iPR(bFzM5r|u-3fo5xh08pp)x;+S`vI=GQ5+ z$I$Qhdv+`M?L0^+fyA%v`l*Qs%i%M4?1@| zZ54xWQsT#zq@7l4nm07D=V+jk7y<*P#=JJys&-SD+yAIT=h&Ld$Pdbn{~{*aey>fE zU$pR*F4I&b2du)P^WAlOYq2>0LY%|D0A$%u@rdlW>0My)9NTj0v`35DZzL3lUGm~l z=EhIA`(N7-pTk}g*txnshLi69bxv0gk&TAYEUpK$iCEePka>GlXLiiC-q9+0b5A=z zS{%RsEDra8@ZSBn58e8y)7s;!cE`!~Y?ztYw$49CII=}LKL7q51x$Tp_UsZb`_X<` zTn*o!7siQ{FY=2h(n6ZcPt!HkW*>#s*xA25x2>p2%5>`mzq`c;?+}@xqD>Qu1v$(i zo;JLT=zWUlZANvwBLqfBSY1xqRJ@Ajz0{feaa*P5o`(6secdjf+JGyP0m>}@*NTeC zxMjc$k}Rre2}1SbBWJ}3(i)w4HY9zCOw>PhVCPQ>sCxd~hm;+Fn9RJ&fV65mo`1feH8#tXj z{_cb9Zg>i0hF>KWW0&Vk!%~NyVsHR3n<()xElFH18tI#URNCDY_ZqMlX0QiRviL*F-#BDJp zKi|2s#IfmM+cL{|S%^msxoUiQRlovDzOq0H4}uH4;5a&x+SF7(?!ShrT#94px}|1+~UNspO`WvTrrO`9E@>1pGMaspH2LAwhhP|-?{Ho zr_+)7RCg~B+HbA69P`cG|2o`kZ)vR8#bU4;&p!}T`9oSMYp|fEl7DPIx-KpzUJRVf zs$9S3R9}Z$X;UkEFFc>FK~3wt0a6h=O*+9hins1fA2%%g!zChy0gaK6MN-6bn`cUg2^+nx>n&eoXmbLfO${*3 zf>c>G{2aYsvj+4$3wG!#-v=>HM0xk=LhVLe%K8pJfw5#9huNyizq0|Pf%od0Gm@5` zw@s}->#d!O->hlP_`UdidIvF)ltH#18U>{}6rh z&yupSfR z6M;1nww6}r>18bICy^N zBlLdm#4W`he~a#HJ@t=}U}bFZ9xyd??a@xzq+DyLlL^c=CaQAQD&Rdb(&1^o8 z3zXrXL9D%>2&$bQPPyedqTz5w1BQcPZS2)^2BlWx#lMP8x?Do8fpf)fG%y)F#d_A4dSHL$1Edz0+Qahmht{hoom3bcu3|$W`Ka$xi_OdbJP9_oJNg)%&HO$SPk@14IwOCLPFQ_I z@AI4#?q%qTA(vvF$dsb>95X+W7q>03|fy`*vyT(F}S{Lx;X5@jgos zzkoJh3tcW8KJI_m8(kllQtoD~q7eKaSMUJaOX6m{Ql}NX{HJ&TwO<#2$X^d=PQb9O3cBuAuci&8|U)!6AFY zGb%7@0ayD=%BV@!LV5aXZ$@6qt`l6vAInEh zjYi+!snP_)VGkpQ4*u&12TuceT2*4!(Ehm3dtKBnJ_$h|ZY38F+J{H)8zEk)v<)o> zk{<&$y{3T289dCOzK0$N1{*QiMF29oF%r>qBnVzs7#bjOBM2fQZ-LZ7#(byH4U%hK zjZ`f877UBYFRsRk$85RTxWq+TP*t6ZSMfnoPg1kopfGp}Qr`yaoD;P7bUX#`V5L0q zQ}%q_I0ux0yLO~|@n;&g`?PvL{PrH;h;HaxHq_}AxZyZ9eebY(pkSoA94@GAkTld$6g5>Y#2~V)r$99RC_uM;dOHj2ddwXZ z7@tw?fuN>5lSwOoM7Jcqsx?U{F^y;Gy}Bsv62Ga3fem)HVdKnC**Y~rK8gNwU>yw1 z1VDH&rGdzLxn>Ik`p;lSJxn9cs~%}R$%P-jzn!hib@-kPxGbcgnc0*3*>2%(u?pAQ zFw(~`k`=zvP1!mXO=Dg&o*NhMCsGZbZI!}RZGEv}rXXR_+>#|HM`WN}#gd6HC$Q*@ z*I7@uFH?EBf_XNmA_X?yA8WA@;j-zEYG&B2RT^P?lh_ofrVU+6ix6QS7ofaxb)Z72 zdUYTG&g`Pp-f96Z9yciV};hO(}|cu*bur2q5unKQ!LbQ!ed>`M+PY`opS@ByQRN;VrOyb z!A^?is|C}TXwA$n&_wip+pwdl-tBRI&1c?&nd_e36j~j3uTzf}8v!G!_n++FTe3*w zL1Cd$XW~gD zSlkxUc^ySupWg#3g^U zMBpy7%vP`I$%^tH0g0yBED%g*Gg4?U3OF~X4br@gv42&#~%QnuG?3* z7OL>fZM8w<)HJtRm>Ukm2g*0?7lA12gqP6O|rE|K*X*0Fo<94DMB!9P2IdS!I5H4tla~GX3z5U_)j~6YHZq z08|m(ut_$qX2nVg@C><|N`r$6RWt5BhHiYVH3fRfgIsUj-7Zy8t-)(al33?^g@A=| z&U2?SNasn3%nUv{OIVzojfd6sueI;$QBfR@{#kdt74WpE93pQvI+@<-ZZtai5%qfE zZ?S^-!|Mr=Mlq5$vNt8nntD*~ZV3ciI$wV0&eZ`qMcQZL7ah#k$pEOqfOvmnx{$rv zH>!-WS|l{QQpqX>^7epn9SR&gDlu`Cc+Q7#-DNtG1(e6%s_g%Lxr!{KCl-u3nOz?Q z?t0~UebTGeI;BjxiP<%{Ey+Tw<*(s>IeN9)rogMv<5l^2`LHRP)s$=0{TV6>UALM~ zhu){gya6ZXSHu&CoqeCa5;(61az>4|b1L$>x?D^$1o@aIDIuv}CXO8}UXb2*S!2T& z-R*_1AZX2mEFTj{kb)aiI?P5Go_@4ZM&>Sho&0f&lmGUPx36Y-EqEwh<{;9kyr;qd zdLn3p(@b412-w?XZ^Y1wPu0mnG(xVY`l*P%sXvSGNgf9;iP}p4_M!7J@AYyse#7Bx zzS1q+@p3b9v)OR4+ouCw`X6Od=gss!i}ccXv|r0Sep`E%HBg9SIwo^cV8Md$(JrRw&OHh%R#VP6s~5) zlEXxEw@k}Tk>+DlL8u8lUdZEE4W0+{0gXME_IM4whf;~muXB+nk{>m;{Y|q;F%Z5Y zQQzLmzn^$TSrsVju}Fuv^qmHinl<|!v1P1MqGuAd?1%2Hwhe&K;>3r1$6G?sWAhgz zZj;(gOr}=tjSJgq+fBcC(p+EnX?FjFALsU11^_Ig|1NS#ql(ChK_*QOXOmS*p8g9` z$()@fld#0emk}d>8ujpI=|he=o72@yiX~1zYZ44z*Wz5DGPg`)T)_3@TetoxgSCC7 zygkOvk~{obNLqYV*7nEFYfCP%b&-lO|S6=;7F5zP~?d$SOoDc=Jpd;_&=`I zphTV(m}(ag2?;N;(CvQk)vfP?fm+KM2JQNFOT2Wi11=S6tX=kn) z6E*!!B(-T+oet5Tb~cm;-Wr};kXOD(AJZW?UZg-(Bb%30xUq%h4-N~fCMeMG;PYE_ ziG=i|LJm@waQW;?(73OVV+NBl^cG0396p`HwMJqBkFTS%WBE#C+r`{eTJ_wK1&hrg z$;d0{G+5rv@|o0`q~|zFk!~fmeFgczVb;{(g5nm;hX9gSXr<04FC&BC3)E__cMrdt zAo)CfsJFlzl~$1~2)DQ(fn$P^|feoczYx)LG3{;%3U%J%boVYv#ER@wASY6j#z z?363rz;3$9c zp=|!%8!ODMmnC@r`iBx=yzGjT)8i_9LXezXDtSk?o`Au82~s1-9GwI&Z)o zr7|YFoWohP0{6W@E^Czjp&1`;3IEM!)Q7{Mj5^F0=5=QkzLEyKX)NV4miGuV- zlyql8h#kZ-Z`aSrGUqq1h=mw2GW%^rlp8y`B~Eyp`DcevH@pU>tP z2HIk875zD_i|u>WS+FjzcAo=!eXM-BsO`eP*z9-`_JYXHFHH1%RzQ6`b)&wV*F04E#HlvFWw*b z)xURf3Ju*H`#BV?O=LxSZ?_-MAu|?}aKB-4Q$y*>FM8Y380awfed~n<2r^UW<~XF) z4Zcf*5#W6zPVf6 zu48vY#`DpWo)YW;0hEc37!|096f5=`JrpOl$U>P$@Ursz*6V>K{~xcB_sK219BsDf zl!RfE=wp2s`sMKCM*91TGXpqG`h+*IM5Bj`k(4#0fi9YywxR;E6W0H z*W>Kq9r^bLT~Nyje3x-29HjWDTP|0jlnQV#Fx`rIE#%vDh@r)O7`YhO)nC=o+yWWU z5_h>bQHbQlAPb-99;&|_$aKwaOioThWJ%C}JV&IAPlAuuZ?6$Nrv9&&lFTx7mj#1g z|I~(AI^K8buYSFNjk5iJ#BJ!;Io$V7i8&J@b9Y;|C3}ocgLNak;)uq4r_%uAbg{8H z1E6#Kq3<05aEd;m7t_-*Vh*;c3zjJ|jiO*6D|?9;L4eccrX+OrGEnN${xX_J zUm~U6nF|+o#fVaeHudxBYpUb8M*ND7FpjIip}15+h^7+EiHIJ15SPL52P~6)!aava z^l#EI8utZ0t3f$Q?F@*U2aWjq?1}XT67}w-5x(sg@#-8!;FfL{b7qI<_d;ppP4)To zLEP#bOKRj(O%Gqwl}xx$ic9?PDXL8E?WVjXDT^xieMo@WSYY6&Q&Y`<7A>Q31~bUg zbqd1~V~E%Up$bN_Wk9AZyX+SwzzSCK>@yYHAj)(b(JlYntZ!T+!I47!2P7$}UpJKL z(W&j9Zt#^{!M<8!3qBA2cW~Iw0zGxJ@40wb<+2eh>OPvx;z~|0>5VhmWb= zdx>=LvuDJc8Ye!pBc#)dS-+rCDUL5Mg43s3-_%H=1E%?5V zK$}fkRtqMV{t05JkKi=?EHwLV1jT;nZt6`S{Hu%S9#Bdd7H{>t0jh+Z6@G+#3$@?i zY2%{B(tsdsCRwcSnhY5q-|`b@btUU^Ihvj=Z#5gTy__qdo6}ID14=v0*F_U~5p|Q$ zNtiXtoWWaPVHtk({)YZe^TnP&y-U4;bW^)Vg;H0y-8$E!EqWb9Uds@5z^MsflvWE^ z)d16fghtlCLL)fr)^hI^t=8hX*v!^yT{ARfrG}T7O|VBs+ao1(E_>vzxa?TN=y6ZY zS)WxORde3!9-uT?`hDG$+i(j&w-#@1`t9o!TYNNIyNTu%QLFM6%CTOU?~@eRU|kpO znK(tP@NqU7?{c5LneDHrOS++$kJ}c6@T(rm%Dw@vhuF|hilKn3;b9)`5!Q7k*Ri%{&TAbvnv}_z9XkpGu)?}c zJtpR+V5^Ie%KLZ;L=Q{xOA+JhXOH-_|0s+OwN|X#jqHHfe-%b{+B&?y3L}qcT zv216ZJ-KdtxATWLHQ8<$l0qC;A&TXSgjI+^(tualcpb$S{KfNKPnQ1-X_ zKx{Gx>x1I=blsh0TG@FYRMU9qb!SU}uo>ZtErfU*T|@SdmM+yA6Pc$f98he}f$a|y zg~g_XmyZg@Ken*Otu9gPqE@Z%5~U_lyc;r5!g~ z3~5OMgzDI{tmlSn!RVzIc)@U3?L&%l-wN>#3Z@f;ZzV&K1F)SMWI7`blL!|w-yByy z@$Al;3l}g7E#BR4NnJ*3X0{d?MTTzi6WAS`MzG5XgJ1>qI}E?9+MJ8o>UZbUyvFIVm6TbIMIn_!u4a*(@T3usv+$&`w_4i`r$0T8)6D`cBxj{{Cs!wKc>8@; zI4p6fm+oUSv?%U4Dq)e>>0)=6?Md-x;r}eQmIYyQ4m>$4EpWnjp3)5`D7(!WHh~hU z7AjOUWBmlWa-rT=oF&Bpnx(n@9PLndv`j5rAa-iwOc_QPvs3NPcaqd3E6{5d4urvj zXJ)lJetD}|&oIm<{woK0v85e)lpU%#HK7#r785u2{u`=EZLYX;0RY+p^y@YvHYtS` z+l!6BwQQJHFdLUKaJ41Mc1DgcY-V8NX3IEVPG{1z8o=G6EbJq+atPGDi-`ku+o|&o zS_5_9-wc{bNe;cx+aR1Utx*v$DRnG5!ugH)Rfm$74aRn#HPj6y) z|IjOJm6Ons?)Yr4i~Hoq4tY7uC%M9D;FNqW)q7RO(57!Duz|M9QH2iLh3j~%fTZw2 zZcrA=rVhGPA?wvE@#2cZ_D55W6WqPc{t4)vO_IQl{bRLVDzas9GuC`ey;A#nXyq>* zv3P8I6k*C?8;n}*I1v!%6Ie^Y51!H>(`;lbMna3Y`wo`R6yU6?c$U+N7n49}hj+~S z1{CfcSPNQf5|-CCB;QXlt#4I##QJwgH!za8f+j?-7N<^g3p@Mh%H8J%HT8*8cZ16f zU!1$cPAd=p&4nCOo;&+BFbEvVADuR9DU*XgiA^9-5dPoGh_FfW3HD|Q!~72IFgdz3 z58Tfr+DsycS{1br1>+EIVOkX;bYCjS8m77{d&h}y>RD3schb?N$ccd zrA7`$W_vpJ9JmE$>OFv$AwM zf)TyD3e+PEHGS>)h>PT1x@2!U!4oP^LoxkqM0;E5DO>)(Zz2knrQ!Af!)XAV_pkQe zyivt*NC9jYFHA|EN{>cf;KhQ>Pr{$OwQ2RLh1nejD>45I-0Jelem3Ag*}B{GxUag( z3g=i2r;A0e3eQj-rN~VVr!+`EZ`!0v9e9#Z9mrMhU?GukGr})s$({yyMkoFXJ6L5E z@5J~2%|vwW5Z@8?f0>8?`4IkxA|ff!n*S9Mkw^pGDN4A&o7CcTrbj!-SDzz^RIpC>9I-GeXET zqbu}FQ4jZ)BmaXC?25hHgu?8Wc$va!F~fpBcu7reoqH@~Og$s%)iqV)&$z&$nPGK2 z2wY)MvS$q+{&ymM;%_zfjVq_g2J`PuH%VR$v&m49Gzwlrn+j2cSX#kc?Vl=jtEJx6 z5|vOEzyGO*=-M$Z(3y%mCwny7WQ5j)`Z~h&u)kO;WLuF&KY}#1F$qWngG}z?K>JKY z1G0418n9g7^o}VjtpJs`c{abQwLQ28w%FwXl{MGG7jU>R;a_SL<@e&gj4Bz;WvdEr z4UVM>N9Tom0g>!NYfEj`pAP%>uf3PwAxKq)R5^m1g}APknu7rM7Wv=Vs%N=hIg*jb@4Ei> zh+U=9*!w_F(~KWf{k(OdWv-gWohRyR!j72T)H)w|Uz?9@zNW3FW0Qb%q} zZLM|ZGyKg+ZIwOX{?N+rBIkpG`-#3LJT}IzAeBj~k z#nCv(Y4&!3Nr!o$?zD@r#i6Q_O-1OYxhZ61U2|z(6Qa2QV|nT?Dn+Y9PIp?lWEik{X{p+}j(^ca*l+cu!<>0h%9P zVQygC7Xm+Mbs*__J01uQKH;4tDn9wj^tKg!&ymmb@p`c`P`$?PtiqllgM;h&F?8TP z&8W8f(Z_FLtlr{*?_<8h2a#xzpXZ3yuk-F(@@8Ls(n~Fd+xNP$-<{SMgj!&Tm5I~?z+I=3F^S4+>^TlbcvGbeLHhdrHbt>-& znucPVF$7iTKh%^6BEU$abFx(Pt>qY)k2} z1jVPx=J|c#YW~;4hULe=&Bo34c4yTJyB{m^ZB6!k70LYH2FY`7*_}2$+G4n^dPwWl z7DBRgIlS}a$jShh%6sHyqrrW6#n7R=GHO9lswz+hU=@~;Ldl4e-{=d*ZsLevu-n%7 zxaQjU?U=@m2$9F*lF3nknkZsw< zu!0)p)wjcZy`PWXUvm8wqq@m$DyOD6_v8qG+MLfk!7fgyX(np6y97$n<j zLHa7QS8*(M5!$Ot6l9zQ0~afW9V`+Ofj$C}Le#b#^}T$-S)YW+UKaE`?6w}Sc`bk5 zu;Uq~>8kl&vxJ7MQ~@?I=^PI8xBc17xSJJ2JvS+Ro~8}HOcMr8B>Hoe zFmMnPEh>zJQb}KkvuP~E>Ywm;`amMIY;yHVcmdHQYMM)iC4B4uJO>?0cL>bMOjUUH zvJ3s0Pu7yGCf#mDrBtl?r3#F$qS8m*b0?>#(pJ^3joy$rR<&U*Jxl35TYiGW0!JF3 zU+@D=)pcL4w?kB2Zsb1tVxG8nB8Zsi`m99ySETJ75QaZ@x`f zzN|d+ZS?X!c}9m8(9^jTw9GSKG#<=&HuYkrEYeys_^tN3EpJ!7$9F$P*kX5j=|0z9 z+}nnqi=!VAIWjCR`{v(rv?q7hqJuDb6gMX%xC4Jwm=su1w>V2xZLdi?PNuN8$A=_B z*`!>qR`{OyxaD6+65c|P6HlF@(KCCSg)*>Cjv@7%N^cgr;y4mcy>%wtEA&*2(?0 zTHoP~#G`F(_frgA(*K4iK)~)V6W+>)fN%kxdG}tz!RiDm6sKn(w`mkr5Xn+uAXmrlOO9^(H5)IY+)rzl zt}xt=4R|>teRAGLWLQRx0>)FF%!Yd}`~@y**Z5k6Xs^fL3FQVWXuj+_!LH@Km&+-| zD*dGcfAx2&HQ3?k)m)X6GCJ$QXyUCI%MCRoY1MG+NEZ2PkEDY6a9Dry^pm{Y1<%UP z5$juYs3n3~AbBe!KsfV*5=DNwwZ9tU=F&tlse{E@`t@WlQEc7aQW43XZfU82U((&_ zsX}=kdf(Iw?}+*`@&o-OM4o>wuRgT4wO01V6$`mmvfShd){`6!B&{PhSK9quIT0$d zbfL7mK43@JAi;=-ZJ*da)1o2hY__^?egVfjg{~h`B6X}FT?!sxDG=O5iwKu;(fuix zMPe80O8#*IW#+B2px}2`+Sne{kPC-ko0ru5u>_(7-nk>5xueg9Gska9R2KOps<=Pp z_cKLgm>iet{Fgh_#)w;l=SEG_E(pN#EC87+06d%BrD{cQBU=Uh9VYqpBSHC&@PymB~h0dRh6#pz=MjNg4HC^;{CF1(Y# z;E{3gzkslNtue2PNh&S5JyMC|S$Ee$8H-RIBlE;r6jpP;Mz;`t@ zd>lQ`$GDSf>nFO*R7&mhG|ttu;VMsKd%3vZ)n=Y}dydp`&!_)4E%q5J9Oin(-RSkx z{W;Cc4Trbsf-s1u3zY+i-dqqs$t#p7k^>y98uMC+KsEXcoH#jQjwOu~WsVJt|220f zVnxHpT>@ssBbfmCxw~*|vad2ygO<^8X~7`5MP)Wb(b&YH74V4-VMw4lX)-)!fpH{- zAfOK^odw>GFV>g1CXnsQkN}UOLho63zv^_`Q~8)v`uG6%#nPWF`^6hxzU(WbAsgGj4WL-8G~uJsw$t}cmxPPDml|?nZ~bZ0OhQ7 z?b4e~<+w;Dw;b%T{f+)96Ii@`P6|q(lndQfbq2&t#Cwf#Iglh+>lXJD2sT5)*6V|& zmC59(@=ZNkEAZ)C!Gq6zM@Gf&>wvTXv71RGa)^QW;;D{CYPto0bMmk=APIRv&1S^^ z-c4&WWx15wVy(Dji>g zO&ZUmxMKt<7iT<{*mQAxfb0p~#6Gmi=Zdq4J@y&+al*w1#OhPI@XiwvR}pl%wU4mS zVXq-j`^yzU0oejv##~Zymhb7>uBt8Norpf|UAajk7-?2E2*SS} ziOF`cb4kQiRTntG@b^2MA+;KX68EoHJ7e^C0tzPjXEC#G-~(sU;fe^s3Hw@qG%_?i zHj&Qpxh|V*(z7#=nYTsjz;=m{+FO!@9AmkMghAo1QAQ(fX)s7vIwUb<)zHIaoOO7E#54J33*JG<}A%l+eslGU|}{3X|U8*3G&u zkp#Win*Usg_SE}tjEtAS7E8#Vg2X{>Y?{4PI5mWH?2}pNDw6D1jD{g2<>(TNMv+#6 zNC??2bVve|(iZ?prDWZAZM|lakJ08?9~ahzdz!i0=5H*_=hkKM#ZH=T zki|GeHoPQT|JmU!94oqPWl=$|f_8yPM_UXEk33;7hv|aQV$-<$dt4Yi4TbfNr7L1+ z)+^d9LEFmpYqD?R<&0jmi`Ub8)1St{U^LDNVl4jS7tOWSOTd?1TtL-SskQ;YP5sS4 zv~G`$muc}m#UszrNGrH>=Cb(3iK-S~5KQwgrXucr3~}Z?<)Y0FLW2`<0n(h<2K;(` zfpke-7l7r}WTcLzADaGg$9Z7RbxyuUdh(*@M>hA>U@g(06Uthw@eo>j?4k@d0`i=D zwxxs}FdR+7VkwcA2>FYL6V0?tm?es3)JQ!+!J+|0JoAYcB;7({sm$xcLN)2HB|$a* z)0UMEm31Pl%TC0UIJO}xF->T2XJhZPsrSqYU+71PmX947`5jATWeoUPI`%i*W+L|7 zh~^G9mN=kM8%t2Cz9DgWzPwsZ%g`-$TH^5Rt8R7qXaABSmqZ11#v+~e!gPqFM9`>Q z0tX2BZqS=iwwVSYO!N?;0yWB$ELU=1A#AyCD@M=EWDs$D=O+jw$T{$Sgs)1FC-Y>7 zV0+DNE=JP=VhN1)FNBQ|THWg`u zYSBR*6-A$jN3F25TbADK_Rj+28&^g1Ie_?%k&Gql0YIWug*sAG>;#dP^M#h{5;6UY zl@;D3dgtLUOz&U~U9p^XJSS!dI1iTaBS3$!^|?qP#x0*v2y|=6qbj~h7U3lIXG9=p zE=teX-?FFaZ5WyY02z%e^UxMtRkB!HI-GM8k<4Q@;>Mh{GK_Ug1xvxq~(44Y^c10}Y>CldDq zyDi`?w7X=N5WN^!78wFl{9KOFR31Oe%**cbtkmMd;kuSsa)c9We6q(mzvz8;mWT;# z|KjO4K7}u1q5uSnhUly>|Em&TDQqhC4Y7(R*WeAXAQeyLn5Fu!I)?nLnJppa#Y)}l zKiTP+)a(E>Dy>npTgR?h%nTDjGoqafjd@+C2TEzkzYCRnKvhN8xFe;`2u2ogtOtMQ zuHWEv?8&$FIIg@^kwR4~#!rwt8_HwJd3^Ak|K@w5N8Qyfll&t|FSg`%D~2>TQV(fm zUbzpS&s#&XWN4Y$U)dF%IN^*tp2E(Ov}s8apkWlMaX%$g zm0MJXph<-(owPAUwsb$eUM05-Hb;FaH3J+~3jxH(8S(AJN2fO*eGy~oxG&isi5y$~ zk=iOmEn6+dStq8(rjtHYXC%@-m0C!7Ji%Nldcm^&%*1IPNq~wvu)CNEQHZ(xOc~;V3p&ZFkY1X&LjDocl?WMS*1$)vK%FZ<+VFLVJ7CZ?9`` zS)Ii&88k{4VQ)8O9#VIK$rwcHNDt9|L1rr6EO$NG*$I?(x1D$SF2GwxMa9MMBS5-K zPM{kO-%A^lt|3fi?*|j^g^pXNOi}2Tp=JHa6s}3A0{g=QkjuAQ0Nl2kmx|gVv?|!4 zv4nJV_}L?L({S)->do(u)c8a0Uq8)riib>PUu;Xs(IkF;?5Y#~U1t1mzwS%A)}zhl zuW}nILuRgSv5Csm1yh@IRXAufTo zXt*HV^%uf9?@|PT)Zp9A*HfQP5C|VnKR)q*<{95MU*Rb>OAREJ)*X3j{O~jFbaI-* zkk@OdLydeuW;|`}-}ml`s3RuMx_OuFD*D(~awywsOhkO=jwfEgvj2mwbBeB{4YX}+ z+ji3F*y`A32OZnCZQD*dww;c4Y}>YzyU+RWeY|fqM!nR_9=qyWYt0#M)~iGqPo3Jy zgVx}Vdxb4LaRcGVk$^fGvQm(*Nyp~Y(m9d^%i~E=dM?DVa-~87+o4X;}?*G$7m^ zJUo`wP+*hL$=LbWFxk``+U&&(*i9qrUd$2@Cz1AJwBxQX;=4pjytn8>`$-J$UczT z5bvRI8J>gwCZH&0xU6_-{hAN4*s+d8)c~dpZLC6l=CAr|15Sp{98!X@Gi;!MvI@l8 zVG~ZsjSB71#gSz!jn1P6mNT#+c`-ovipq3{wI5ahKbA|L{93&~=vRun<7frLKle9* zJlFzInkiUg{c!QL+u`EKsWAjVo8}>YQAO|9@)SJ&wLU0AmsBEwxW2UN6MM#P`hoEJ z)smR(UcjL5A0!boJVqVWUtv;h&0IVTE@UtZ`}u>RU-~dr_cZEc*n7fzZ&Qd}WT1X7 zqgaq!XpO<$s&&P-8nz}j64syzUV2bu&&wLo#s1N&;Y3u@zwH@di(h-ZH@O+pF~~$p z)r3i%;;1A3h}n^rW35RS=-IHfv>^(hr5K3YNiM0TuQ$w15_F!x*n(oZIyDIl@Q7JA zKPGnW-2L=42b)8WSc|q2uqH_C#4xpOqAhJ>%V5~qEh%0_HHkX~)eEK#fHFNPASq{N z0xAbXWYbofI82@ni7JfH_$uriWe7X4BMka};4V}R+MJycbdF8vs^Sk}oF~_ZYf|`1 zmi$v%wkN@~0nN`UPFVJC*Y2$n``bUf=RG;Rf1IzTkehG)cVOZmXlu3;+x-J^jKH8R zlO>hi+GJ11{_;=m+R0b|#ZQ@t`W+HdwHW^+Q&?iC^YrL*rXhz| z;e^%CUv}UsQW+U(_aq_dR3i;g58QF|VK{lgju+c)Bov1O5M)^d{%LXHDe*Vn2(u7w z@G6Wx@~)vIrMSUgk@u^2gBwe zIILW;-mCSJ3UY)IHkV(07IeB5=lP;z@YWrZYb}ELhpWEyaer;7{gH-VvpKx*dQ$s& zazb+t-jq{HC@<;ce`EnXx=U(rsLu)IcVgfnyRF+2fwpRmK?jZ|`|ntHSW@FSr9jNz zD6oCjZ|tbQAwKb-kLL?eB9=v4%%=bRh;kPy+;o09%OLVI` zkrM+Xp#AOb^zm?g3)$~^6O-vuSpoWbE~FWHThTF2xQ*r0MW%`)ZYUP~34^k2xT!RpGmqOx-U~lYMUQrK#e9b#D_sY<60;(~I}5Y)@Gw8JX$ zi!Ay)ox`C$_lgdqq}BaMye&$se!+Z6rtycSHKEt8Rj^u8xqIP)r;SqwX7U&5Jn9_Z zVxsn%Y2}ISjbj~b4|bEGj$|QOIqPenpf3h|v&-|LdIgXrS%0gAk!|ohj9V*Z%c7_Z z2Y0F~BKz+-be%OP<>NL>8J%t#aUSeqbDDxhSliG+$TKx9e; zvlxFbQ4);wv|FRJjT8@NsJ;2dTCT}ab&x5uegwym^=q+}~RBDF8W8DfD zSffV+bM~qV^G}Rop|V+XE|AGcfmF3*Vac`HgO%+ltt;kNNja4R9?XHV5nOj+>r164 zyewpyzd^ZhEexTuetKC6ukY!-bbh|%7i?h!h_%@e9*Jjt?UvvpP+hvpNIJtmU-*Tb zEzd)%HQ4mV_Iu1BjPi=1<*Y6aT4&7|0UlK`xpa$_v)S&ObWNwKgjXBC3R)cC?54U2 zFbHk83)4+Fv8TAnu&3~|RSv5AB0xVFm}qcE>Cb2t@z^}2DeH}d5#fP8h_>Sunv1LB z{#2uuiE>m2SV&9|$+L6suQS?kpxQb)`bV z!mu+*-yGLB`ilY#H*&)8_;{tiQhD>0Y?GY1;J9MMVV}~d_@i*7VYb|$PNrZvps;S0 z84Vk5QjQRM4h?qy1?OzwtWLAAW<`o(kr9Ub^6yB;a$571$p4kyqYRm{9;)WGoDQnd zR%AM76WS`0wmBomd6fqf&5H)XB=EV3kS+?kX|bu%)l@7{l;p`Fz-!L##Tfh9zwc%n zMCDB#OoEa`g9Jy;o*Z3{tvC`TNhTpAB8%s)gFCOtCr$0So-FHgwvu-O-C^1796VIa z-b&sL$Ddo*Oi@FQXxvYW{o6!1Q-vdp6$vSUm5z(9QvW7lWbIs|$RAOg1jl{lSn@^$ z8`hTO!K)a{++PaZkY|TH8_W0U5n2OJHPeF%?gL32Aw$%!cfZ7^^SPTTL~m~+HZ;)$ z@t-pV5@|Tb9v1#6N6!j9s^_a^u?i6uq`~qSB&S&?2K~n)&ZG00j3i9vett? zHHx5PK_bj)0t;a~(jF{doQF*5;tp6l?F%y-JJU5_ng^V(g$>i&{AAyItJgc1Mk8)8 zTV)XS`NidI%6pJ(bSBzfOc1Yu${bp^@$a+=N#(J%J-rgma9f}tVnJ@k-P^j??{%3N+ z&2*|?GuM5PxmmONfC%Zc_Rvfj+kq0f5+mknM-Eor zj2uG3tQ>oB>LjD0u?zDhsr^BgwWu)Pnd*WCE@`T=(Qieb3|Ukrl@p-~WtYvjQRe6o zbjpXns#F>B3Z|@M2QQ5k^$1(?M|mD++%wk&LDkX(a(iO!{-DJYN-X>-+1b)Dg}Vo$ zEUydwOBci&!cK5<#48Tw$5Axg?ZDflVk#8}6t`OcW{zYg%W?y=uk9Tycyp#N#g_k* z2>D6B{*OdxZpDuzS=tmO7h*=;aL&eBVb36%=|u~U=U0!^#G&AhphdG`rUt=zgCP`M z?62X_C~S#Y#Z4V%{3aM8J=+5;36Eh1K0PNppf4#n^3;|t79NvzQ#BhWqqc*5>2@{} zeIGr^T+=6apqO1F@Z`IRM?gtdA1zFK);7mH+HmkCK4=U_C0yoDGy?zdeFE67KA2Rj zN#ib|=UIiJg)LG{G(iCG^tu*P&Lvq|gm_A0_p(owkoh(B6Z4H^fKY_)<|7+iDfeMO zdmR_vc`j=!a|UbM&^95}okysrAlZh;|`he$m|^~yI~1gxqnua zNH9RWk%I5W+*MZA5LMcg&5Qf>UQ4iw^AnLr(Z;hJ@me&ybl`rzc0O1EDE^@*gc#G; z3El9RS1xUwLzt9=Q*JJZz%_6P*9xS7#z>tw9g2^iiIF;0-Upyl(gN`0PWGvt$fgwK zoqE(_aYIwNFf=@Wl^HN(1z6w=&c{!R=0M;5*+9Q7OEw8F6DiCya2%~hTmmHvcq;&a zqfkKU+{|~nCWXpAPk95a1suT*9@j{2%z~{gcd^fti%+X1PtA>N-26H&A18?6-9RVj z`&rz}x!;d&BCxzfo`g`+WwkFZ5i&ii5F>9>$u&89qc89Yy29xs1(be*3~h@=d;HNA zxUYpgj>yG*%n8ZSvfSUO>g#nj%}TC|^NITqa>{JNjtt#3@f;~>N7)Hmt@vmpn8CV)VftHbwwNmDBj zL#wXfoDB|n6Rmm`$l*A5x}8{!pB zj)ODievwVXP9d$2-wY$TGWLt%2-IX%%U(FCO~@8Up=2kSNgy+lq35+&v-&*41@e`B zc4(wuQ` z6u)R?i~`p6Uy`nYL>)kkP?sbsLq$I_5$A^;E?*glNv;a|>8(57+@xG|%*Y7U!)&+( z-wQ8)Ojz*N6-zo~aM2+nypS}soM233`gPRgH%ST=w-Y`a)hAh)VNh@-)pNit(a=7k z!UyJpMw0JXQVr#YvcPlZ1w71>>l9;to=AOELM9_Yc%@sm5p@ z)`MW|_7Mx4MW*{FyU9dhGiv)`vDgW`B}=f@ZP={kt6SP)YOHXsWq$CzBN?;U2GHmP z?S9o<>?Da*#3Bjdy|&HlZCMs6^2feMOV~ry!X5N2X0s@Q4B-X0ZKuQedPUduKoE{z z!BNfd=yzXDwZyhNEb*SKV&H*L#5)iUtB7_Svt;uB&iuxHD2&qP4OJ4q-b%YgVVbhC z#Wdn$_7~i*+3{`kziJvW@5`_O#?9k{hSCLB9}W)wk0DP8|M2rK+!h_1+qil)CBA>A zzSGATxw~*ZI>Lfej=p!#b4B9YsnfSngCjEB?+vs2kK&bgLW1-sNY2iX^nC2xZC3J9 zD1!_d{gwAcWXp8XtaXnwXZ!7fU=>`=>#0`fazJB|r8i`~161T<+*{-1lib4^i!X#c zH95UCPrjE*#%x|?DsTE;WT5rI@RpL;gQAf?qY$;WRNqL1k~deIXobTOV~SCx8zU3<_(3=pqA18m zWoZT%ZxV46T{}?nOLOf(RO^`h-~QEUF_^zSOKS0k1fsE8V^512pkc zF_!WskWTecxmU{Or6h?3v*i_Jl|_DTP?=2W4_!0l<&EgD<`L`XhY%dfGOM^^1&ACg z{$?aqX;t*!+MdF7f#yR)6)dBlNJbLXE&FO`SffkvM*I{bo2A!AcjmVyN1cxnJNr#X+<}3l|ycY z>rbLPhJ_8RH*#_$#$qW|PK8-a+`NQ!U-(JSUYubZ(aj)7%Eo|@ zCQG(qf;@56>9U0#EUwqc8Cm7| zD=t)s&uTdEn<>bHgy~o_hzPS}DU|n+Gu=fg>xnqtraF$!4J{~AKtXBTuF ztgV-FveKP$>SbAY^>A|NJ63OR+h+l(iK2A8?3p>^(d&|AgObAdnesTe;y9kOTVv!g zBLw0ESZPD6QzkgS+KN^qw7R&^BUT4a%v~GwkKwVn@Mty1Ga^VbT!7j&TPp2Qd1F)R zX;R48>TK~j#z;#ljCffCy2_XoKB_Tgj}TOef}i>xe3RledL9_E z%zM{5zqA=Srct9HZ<@SWlm&GH&Pa0T^On~6#riJ`2>qXy1)yfOw$4@YuFS-v(Pm~j zfFbjKDf!r#z32V0Sy>KVy==nCrxd8%+O$1$o%{fd8KLnG&S*F`lxP|1Af*6=V8?D1 zuuRzyrNB)wcMfG{aoCDgLn>~LD9+G`17k;PhNODsK*D6;$6%qUG)|7J$!2uaH?G{| zRLqn~;}~q0m@!<@Sj4I^Rf<&Seq*E2SanJ8gQ}x7T~dqEK)Nif1wLi;SaSRp8)fm_ zDU)>FyZ5v>-9Ir1H)m%sJIHjj><1lE-s9i2{YcV95MD~Glo&d0&faJ;)kPTx*{s}Q zt^TS7;gp)xx;Q@o@@jf#RDGSFDEj$vt%(GcW68h5wz zD%(aP?t4x`r*k%d@>1iv_}uYQYFNsh8b<5drg0OW2E-eK)TCPq082Z+KecDNUJ&l^ zhp%f(yj+QCJq;Hxrv|}f>@+tn3Jml46!df6=0 zt4EM^;~p||H`i_k!ObVclq_Qm&vpO-zh)r1H1SL=eOBp}4qzYF$wd0`RI&S~uO!`I zoayN3A4|oULs!tvJ9Bmc!t?HLz|qw(LeixvOY9-v#~(h*zVN>D^5BFqgq3PIQB0_) z{aoe#VHGIx=70duCTz}5yJ^rrB~z*y3-~#!v`2(!`$91D*xtNX3-tNN!*(abqhe@G z*<}Jh*)ts=cDMRP@g>4YG>iy7TZDsJ-#9P|sUNcF)iQFYyTTT5{tSx{nncM$0Ow_-s*VFtIZDH(ngt$%F1>B0;)rW zw4ZDr;4XprhWNvU3)!AUOesA>5)6qI-N>OI+s@8q89u&6`hgC?S_D9}vyCr2FkHap z_?wz(4j+*Rfp2P6IYjgFp5n$R-w4XsXy&A zX|{moG$|8dJM%&3-x*Rf6??o9A@69l3h%oBsr|!lCx$^n(q4py$w%RJeYwfbj@#c?LKFX+k48SrjuV2U(>b@Q{F_q0I1*t*%ZO?J=8e;bT= zt?OBd^H#n4FYl?&8`$Zwo+jTnFUJvvVk`IDU|UG}qDX#66oWQt&Sv59Y^M>dc=b$n zqB!*$FomLMNHpL6Up*NhQ-9W5!-Uwu9uRLmz%u`Mn=g!0dP2&wE{<4GgC(_q(y?}F z8;pm((xCXTc4C%VhNtD10u(>%Bc8eP6!Z&bUCx+gSLr-UWmrWVfLJ^xeX-UH_2C+`q}3t49J z^*I4HX2?^u-5zG2QOh^-tw7eW3;Mb1KJ3+y-1)=SG9m5T0BbF7bcHsy#JZJz`nt$u{e5Cvopyx&2wL{y<^rjiQhr z9V_M)$6X#Fhl-$p{pZFXvb>bUEsRroOHO6S9 zPjJ*?4YmLgFsl5qDLkAa^^@y6;I`gKsplwhqlF6Rmkqk2OZG*fxl-RoQ$=ol=a+3F zt6ZF*tO<1=HKw3cL+|t*dPbMCEOzGi$4W%Cu2eZ`aY+7K=wzH$`J4?c$6<|iX|k82ZwtpSuz|aTowR+wiLkvT^sr;05=e|0f84u;c#7sG-#=rT zuZ~`xEsuc*-X*Fk>7Oy!)j?#z$Cg719*kv+LjLFZyZNUw<>d^u+d!DHVuTG)ZsYmf zn{T0Z=`VV0Pr3qnom*g}Ixj<7o}k}UZNgfRhB6;%(r`P=791))tWu=ZY`C`RQJGIX z1$!TToX$bLXn=$IxBi;If#M3ohmuS`n6&SEMNsucy5*NMSg?!lo}!dCorSqjrPWS^ z1206+qI01Cj=3ONfE1RJywkng=LoHO+xM9RfK&Zr|8;F{ zEr^wF{hX_trM=>B)lbq%3~MvA=*?=MyXBa_-lu5*_(-b3%8ruJ#gq~ zT!J_P4x1hG!6_hj=UEB%tMt=Wwq5|oYVZOEnjAK_8H|61wTB!OH&PkWmx95{YHjWr zm)=;AphK4iOXgpF6@Aep!wej?S=$@tgSR9lZxLTn1GrrL`98}tv#_zk-b(@zWiD}2 z(d~m~i#EQ%J?L(ngZE(J!RHDFNoI17<09|#)!Hg6#Tk^N)=d5RBa#jI;fqI5^Q#(HL+MEUXqvFDbJqmr0t+4W9&{1P+fUHu>EgVfSQD-w&#H?sOxo( z^kOa>mC4&drwVuADVTJyo^q#f=yU)h)m&)B`PK5agCwA9&ENxeKJe#_<=0a=k?JkZ zb8m1x^r^jmUBOs~Xh0U=QjTa(z)tue;CXfQ9GH9DUdsS?6LpZjr)V+*+#b z#tbKF@Jiaxz)j5DnA%#PeoDu0pA=90lD@UO1oP}kB;BlTqzLG83HIVbht` zk>7b&Fi+Qjz#@8J{~6&*OMsh>pv|xgKZh99kRD{4?PfzUSptHo-!FQUXM`A!*o#~H z8AfI0-*)8iro(70WrpKJabSOQ360OB!f4-_%~Wveumn>F>cC8s2t<+4p|MZ`#a6&= zYtU5&piz|R3$7p;<;KaqP%=9BKK>s?0(x%E;RuLb5LskiYFhzFyY_0LFcikj*!E|3 zB)6FI0})WncMDP^mvlta(dG2aWcnEElE@Zu;LCb4dNz0uWLBQ8mBlZ{mV_bPQH2K` zPt6(3bH^Y5Y=DVu8eszc$1jiLjNxV0bdcee^%g(BaoKfsvVgA zyRHehu7~!(CWg^6E6X}JK(@12w#AQRF@z?lCV#15Mk~(JSnSZKQl!)UOR&;7pb1h}#}2sq^W?D4@X zjO6Hc34Ju;K2$8{xn1cF=l{gSGeCOMrP7Ma`!@8rc z_l$5%XXu2;6qFC!YNa^C(~k6z$=2v%nrzqXg+FQ*CW9+FQi)L1_!=<EcE?mB*`>Q>-TvcUZ1P{@~+wUOyVUkRnl<{HOKTu>!n!)_|I)vu6 z=n+HF7O1aU*rhRfe+MLV?+PnK9_gbjZ=8e5L(cDTc9~sF9dFdOok-w7|EAc}MKMpP zNmAYJA{`Y3RT|5o>7bFY9`ORu4BO%Y5t_2r0pj)yGQumK(h6Fe&9p{rwM_be+U?}* zPFV0KpT=(1>+0>c)#Y}K0TCuJh`FR7HykFTL=H%96@QZ+d)j#Oa(bze_6k={)(qMV z4WE~KU=!06oMUS2pVe_UJU|GtCbQ(;J z<*>DX4alsTxW$X%4SbW5E-1!>QS|Q56zIMqU2gUM>~)%s4T5$f%r>1duWO$3uD+P+ z#KbeTqbh~i#51-zJ`r>fbxiDIIQr;iqclpU6NZoccHk7j?VFR?KM5xWwFQ9^IwH-X z=)X}e%5aXm?=zQrldFM(ape~LJ6@-0qDhD-XDKZHr()?igt@y3DE|1MDs3uKa3|P> zLnw)O^Cn03zkA8qV8eYkH_eXrF|=mLz2Yum2$|Twnq7Bwp|$wi))8pU#Wt?1*ecwt z`*8BY2GjSFR~APY);qvAD4#dtmG+1DKbIujW>Hs?n>z7c(E8`2iDvu3P21CCrmW9{^20y$+Sv^up!BNiqrhTL^FnmS`Ci(3S*`-g7D2P-9sw?j zz=CCHqMmwQ8n(5H{gxgB&#l1;U#c3>sg9qY?ns4-UZyC7Gk-;K;1AlFAV0a0VKhz*gZD@%&QzmdeGh@q8}m6d~&A#3aa6W*3t?j~yq6z4zx-a#)W z9{%J}enClJRd+yJ$#?JW3Hs{RV8zn%2KMd_x|a*U+7@Q>fk&*XGZwD&gLk^}hdo{S z*)F5p%75GzKQI01o<|0PJjn49@rW+(hyjjz-7^etE90bi&G@+v5NPZO=~nAAqpZL6 z{?Z2ETU9@{ljbRt3R5WYWA*O@YlJKa2X0EAyFQh=Wd)(wi}(w zHVO^gTpOpcP7({DL+0iEHJOw3W9JnlHi99kMX#3LKTJLUnvE$5-QycadSxrVj)wz? zDzeX3{WA_$QrC}#C)vQC%;Imdus+O;qOP0OVR-gR3>)OB!-Oa~X@@nkwm=%@iIo6K_P9IcZJt z)b)%tbse=W@4^tAclbR>T~0i&(i2*%Z45?cJQb%DUWi>f6w<6OcYSy*w$PnEE?OOsLZz^4`CP zcJFc$oyxlX8FwVoD5m}bR57!q2#>iV?0oE+5sDf+S<1c2=YL3Elu`Emr|AggF_(FJLKJr+l z2pme%{f1!$llrnAdU1*`pq-zYmi8ofzQP0*ofb}3E-CN7dfR%=z-L>ND11JoQ}yOet}%wa6 zg$%5;@$x2Qrc3hF>=!pWC85VA5z3U#g&5OO$XRg0T7-F&-lgJ8Xpt(N9;nh`8W$5T zDbJO*#JyISHu4cyr%tm`Qq(-kfSNFC4Vy!dUom*dW*mm*%_j53m-z-BOgGK?Q?8#_ zAG2gb{hKQ?Dl)pVCQFc`q*iW3x0D})iD$#B!)%)Ev-afJkB*I!<-%1B+h;GOU^+n9PM77HU5 z1kG&JH!8_EX3}6W;p!7mpoOTKqM?r#=THsoS|p?DZH$v2{iBXTH(DLoFtyUS_{!42J#AY$4WH%{`s_)@2j5LeGYeg8Vzpz})G6%+Pj9%A`TFMs^a5b*z3h{*xRcT^x(lnZrMF(HzlUB@9Vsp-=2%SE0+>Pg>VkdQD3htNbdkhk}LVe^Q= zD03Jn!YsKE!X(@W0a&51KmKn<)c*f6qOkwTi1dIN(J)$ERQoD0BH|0i3hgWiF`)mk z^9@OjXcIYdNIUh?lC$E?Adpz zvOJCyUc4QXjf6R!1CRd)d&r4$>Bx*EHlkE9*9ZNr+&mYKh#VjrKs{a3D-n)XtmxKn z@>`QB;Z1amqQT9A>oJ(uCUj{yCx2p{&`kxh>i{jHy}e%6(vk1H_8X?BBxmj9;`bWt zKqE!Y6pb7Uzh#G6zL*~8i#aai-3y%1m{r)D=`S9FZRofU63X{90trLQ} zdQ25X$(q%*X}^oKVoRzRt2Z`aN!anMLIFE1?>Uq9!yMq}sO~>Qt}o|dNJ6t(s2pN3 zufM>nokWeT9bH3{!y-?K7`v@+2ln_9cBHFOlwoI0#8geS#)^>7sss|k3TId>n#2+D zegsKR6u+zyNZD5}VO2JB>0(ePWjruo@wA1~2Qv;}YR|H3yPmCd+5Y7`J1O2|sc03S zJo^#kJ5dO;P7q^{xr}wz#;}GUm($;@A8l&vBknwzT0TosCM%3P_^E8?QZZjg&rSiE zCgorx`kL zCN(Q1gF248``rt}4|)8Go{^lDGkgqJ(*Tp|>c@terKZNDD@K(o71#jwofTPFt!P&y zl{*ChJsgVf)^|_cfmuV5t_^+27(9Z=%u@CaxA@{Y7dV*ft`xjb>=;L3Admdw6wcKz z5#M%%l(B!0|GC1}Wv&&JBK-*dAQW{I=PKXT}0 z$cyFE$g3YQ%a@7YMPErPSFxlQK>=g~z7ySHi`W#~m$$}Ez09x&UL0F~es(n*q|8op z4I&Dj5$Y+R-$7H-Aor+lc+H0Hkr(bu#9)l!UMrQ{17o76=(PUDTy$0AtU^rS5=SsL zW}2k)N!KwXYzrCLccS&eg4}o97_U)?!U7{JCy#h9Afve|l5H=A1g5VsC?qq{PAOZ6 zx#K3RRMS@o51(|)f%p-2vM&}q9bcwRVf#fVr(wrk)MI@CV%s^8%6m%MPAL@p*s#rp zOw%{5i~EQXS)WW5X2P`q0J3_=gssj2kt&qPbol7^F#{AeR$z_Yg}f5WA%}2eby$zW z5IBnC{-W2K=zx-S;>Z2vmQ=^EV?(#B|1*Uq?(4lLO4qQ?;dU~$YR8}i)ejN=+EZ^B zF1lK4gf!;Cj517KdSQ`46;^lp%-gxSFvL^4z7%oiy}1%G|GmA&=S;DUGPMZ$1l9n= z0AU|IP{Gfa4K85c{@1KTW%QxN-znjsz8k8-`lvMlOJf}^QppDl3&5WD1CJ$owCz8e z%I(Wh8+2rid-=RO0p827D&(GA5Fz;zN!z7d$r<8$90mKzA@G?SE07=F*Fk%{1%}-x{`I)9y zYrI};c3?Nfe&{CUE-e#(A+osfy<}4bVo_3OPRInBEzp7Z;Gp1Ytg4NqP#{8Gu03Z9 zl2{muFw>`8z_!vsAa+^?Ozvy=w^8D>AW99?$Hb4Im-_%^m>(>Y5I>Qo%I8DF#>wsN zIw5zvjmFD!Ip89G$>yMLTtn300Z*dvbigTN3!(&IA6xP@SQ^shMXe=H14|$hzZ)q9 z`Zg*Cq&l2{g#xnv%>edwYtf%O<+goLX4wKfU~Z?e=&P732k_bKuVDDu0WkuD&Yqvc zOO1jzOI#4FXE<_l(WN6vA)|E=`!}Jc{?a)CqWwawF|RjKUWTLve~(z@cDvb zkcQw~tjD?C&%1+gn~gUPPBuh&QY7sqc9|?sfsNywAbXrYt6fp{c{CDEXdE7}W|w*c zP@KCcd3%$}N;|Y&>@@>De(_0;8|~I){oSF1CsY6}uFBTB9QC)|pBq1aEAfZQF%D`a zP1N<9t(^Ss(t5ku3~imWTRFn0^aycWHk4Q6Q)@S%QaE=rJ7~8TI9 z&nOVnZQj+>Qq$SkCv27n^Y}a!p^-cvNKzX1DCDZ{N>9tRDO!# zN;SU$9w=7$>-jy9-iBUyw8%O7-w;@55%14MA9U*kVO0f{T5z#w{7e2ux6eL?;K9YuSL#K%HTa)+B54^lnp_jlBUrk)M_tvOtVO-5E(M6XZCUmcMpWCY&*wi+YRCm` zXke=ncSNq4$7;U~C)IqmIHTyc+ng?TK4~=W^&RLO`d#%qOspw7+fno?=Vd2&yZhYK z&ZU-Za*qC7Kz5^`|2?Ry85~$njf{osf7M6kcOCRC{n8BVT-BzJkbA*sIc&z%hX1jl zC(y*8YvPVVfd5t#B`J~-cqfx$;g4eKP1ac(^aTvA?!CN-uE|lj8(v;5bSeGDyDC44 z0fU6tet0qFOA#Hs`18YVUS_dW<;5&!k2h|i1sK$lCQn}Pq8qs zsDT4h&eBA7`!Dj>+}DkgRAl8Rh{-wnEl z^#KHU24a`03($IbrfVcd`hld**~+9hHi$7I@dzRb0(Z2 z8-4F+irRnQ=2mZRUbzewK2n0E!>#NH2gkapKKD~FbZsi#i&mn(L=Sr0nL6?s~vKcvpvcLFe`iQYBuV&+)7~O!I6ti-3aAf8> zH%36X!6H)Tz~tuZb_9kxW(#lyJw-)3yD26Kno|uCFYKV;9SL-JB%*l>w~o>-oZXap zeK3zGwafyDov`oW<=YRX6~&=aLi0v?V?&z+$Rr$g zdY@ayC+9|$0LwQ9i@S@b^)6Y+NSC(xgoi9|y^C0D#~s5qZC329L=q)H%uCKli2bEH zxRD$f(Sbx*(k#IP({*ct9mWl|A-W?RWLMuAk_FtiokY=&dZhVwz%~^QAM8azQ3(`2 z|5woohqMIU%sZ5$V!d8g~@vcFje{8)eJ?hb}i=_}yy8UnQ4OEtBCcpogshXOpI#uVU?^e~iSzW99?S3ELM;I!u7}KAo)>U0gHvAZ6 zNEXl{b&I53zRWxyf^SvFqb;8=RhJ)<4u%6?Zc&Kpq<%Yja5hz@C6W@JD^0dq1A!u7 zpIpuvbonpjc!rHqQ`Dwcx7J9?Y$d3{tmh8(h%RGckP5?y;i|)8P<&g7+8{l}EU->{ zAa{lHUhb!IQ$@2*TAi5Z9bz_1h%c%=BlHkNIe2UTJWp|15>>RQ`@2Y$Lhkv`wPvem=Aw7eNrcr6fieVr95 z_IMsM*6Z42#;f035eXn0ufUW{~Dsp^tt+xzdD+x1#m3WAMzeRtTU+O^hS z8A%ipW&0^YoME}ycl`_5%4FDw_-D6DVZ5!0O$h`F!UV_X#UN9z27b8!Qs7vD23# zvo&Uj9Gq|%(0;$WAWi*76pH0(*O{^XHPa9`mGi;cW#xhrC`pOxakDnT1V1EBCy@rj zKQ-vweAOfR{ba(>kqf)g7%GX1as=V0&8=2JP*;#%3F&q_P0A8k7oqZY7La<&(yyD3 z>hw1M?=Ck$L%1vil+u-Ci583((6zutk+ZZ|`LR=Hu z|JNY(Sl&gyq61Y?0gNTa*beq82Fr16{T>8;%_eheZuIAXAU;xse&XEGvlDMVmv^6P z0hqTxK4CefNh3RT{kx!xMCOo%KC2n6W;hqu0Pu zd=99OmDAyel{=AtlxX@#!xkt@;OesgGeFn2QwZMrFCY}k$CWP7ZRXFA6)bm1QJ6L2 zr3|nKLj!xHzc%3V0#C7#zl;v#26#(wkj2^=*XR6;)uZUn(HW1hFoQnj(ZUsldU+@Hfy0El39|*!nbWKo%rrq!p%iL zTn`BiR1v6Lf#dinW>gDLD2<|5#X!3vk<|#m71WnFcO3k4A)NLurLF4&|3ouSc%&+# z*?^Ytxf4Adz;f<%Zg{ImiBA)`^VQ7~ppBbRXp69Obi4i1TL+?sxV$N1S^FEviG+ly zg+9|<8!KcWJf*^ZDe6UZQcXDbvu$x73U|vGLes`$=$M2WL7QOz0i z@YvJbSP7lXu*Asx!zf)e2?fX!(WeC<+JS#4z?3-7kp)080rlITBFGfAKR1Jt`09tG zSc$C7TN;7X4-cHfd^2%``f7D*(&PGK^~9`<-8j{2dE`5&<5Fmxe=`BLZy2ULEl30N zwgYKuwZXD!a`1J_M1b)G&-Ol1e!*K1^a9GtQ3TdkqM=diR6G79uFMl8H^vq%skKL# zLUkq2ba*HL3ccWEVF1&zx4CyKcB^{E7ci-HeMBfAl|MS40~qp17^0dgU1lKQRz`|b zFSck+4K?98$`XnA35AspdwKXU162 zg#!2BKU1|CLx`jWtd4f7T~=;TR#r?(KB)Wh9Ipu6Vg?KcsXjF5mjOca27~Ca)Cb&5 z`O=$&4G2#f>Vt??1G~+#xEvWu4mPI7(O3UW*r39>3YENUP>}UFOx=5n!ds_qYHg`- zIv}QO%Qnx1Ehp)i=;KXp$o%Zn`OsFa8YtdvKN*gcO9KGUAA2;xpy3xb@?@L<~m5cV$@JDB^A;ZEhsgrSml6!K*et`X9>TNKcbya z`0L&N`hLfG_>hQKJX})-NlBjD8Z1p!kqYMf&b+ldDzl zWriq^UVW;DFUiJ9uzb?Yb5zx+)0S2fH*^n19KZ2*^?I(-ZlJ%Wn%O>UG8B{T6PPiS z_?oLtRnJQ_3^06xp>fAc}8Ab~4A+QD1!jI{*%+B9U3 zv<9@mL;_4#EBo2Khb{V0weJRRd6lcuX{J{D9tlnB^g4H<69YhomsdD&EE=B6@;olS-zJT|>~hIEL4{}(}uyiX*R2EHeQO!|E& zCyKET(e4$K`!vTPSCLvl&C=)dM@w7Iz|$IcMH5}?k>{UHLk_W?X8=EN+=w?K9sspb zk){pWS$kwBft4R!QbmVmhH-*e` zq-Dksoq;lsHi5*l)xIWtQvJvyl`CNoteXUj6zoR!dZhI}k#GorAx<_Do08W?Fu;7! zK9+{RJ!hagS$;1!2*;E!p6PnZ3jMpBfKr4)AZkJqQ!xw-fC68k9X2UXoX9AGt&h>D z^pkb%Vr88Oo>Ij|0E{iZnG+Qx5A2AQ-yXJI zl6Uakb~Cxy25XUl1X_o95(gn=ELM>-Bxf2RsE$xx3|MhAQ4j(M2}g?9b@P9ry~B-y zH92e1l1>RoYj2o!Exo>7i-`&!={ z3bC@Uo42#AcZjT^W60}CYN+7n{%oT@tM+yVPLB$^m5om5nJ;_4tHKEn;s{wTI`q(B z46alrT2zmeSyJ~=@nN!trai(OEkA0^$aW1L4sIUa%poI7N0Q{pwzauYBTIGl$ajbI z42d{da^g|64^oKCsEancb}kgCE?*c?bd(4y)<5N8MUX{%;zc1&UYo!b9Wc^Ms*08J zAth%tOsRc1U}uwZF~hfZY;ZgmHauEl_JN2MN1FT8bb#<@B+m0RvV2{+$!NO;^?k6N z>b-9g^&W^s7V>RP44fv-8vLR!l>^2~gHpQ!Qfid6lrIk)JeoVO*DQxx4;J+T4c$?b z0$!mLU0t!kNh{)|*wmm{`FBz!PAYJ7HNfAWIemf}!KVV21*9Y*GDnqzrtms0wzd}v7N%e+{SaE{PCndO zOWnF!g}!IgOUssGC+7mhm;9^hluTq;VU4PA@Ge+VNf2wO z)$h>erho2R;$MALu0^DjW+1AA-475uF_xo8o6-z z&EGxe^qcb*Bl4SSb*#}p_wn8(>_FeJTAeCj@!KN*jr@{imN4a~KM}a-p=l;~)(4&} z#dKtp^#C*nbZYnHV%odQwj#o|wwi977q%>Jq?i~$d^=A-h}7}a9*t=;o6l;r3bv1;j-ZlRBw`WZWmez?1};!*!#qq1JDE=Qs8ck>WO26Kod-I+ z=9Nq2(d75{&$wR)V&?6=CT5j4pHmL1&PF`6|CQlIR4)l2qAavv%7*!zXUV%?$nQY> zNjZ(Jsu~gdVnm$iF4EYX<58kX&w1O+vnoP1{sMt+!^{d$B(KLM%d=O7*)b;0y4Z!5 zJ2>Exr^yW59IsduQ@eA$+~t|7KVq6+=*}wWcr}t&?DK1qknm)i`3ADPuKfm%DUI&j za#5HiyQeVy0=h)eS*zuQ{E<698moheR!pr4l>Bi3RY9t_cwm9FqQ_jE!&OxpA=w~% zg7_O$>cqvdAnS-h|Edeky1^Pcck@2FR5TB!`tAAT+2|PM82=ef)D&W!hHX0NI-HU3 z{edEg$tYtmP~17*5BhFWY&&9$-(#lhb2Yg8|B9VRBli&-fJ?^)3!<6PK}Hs&vW6v9 ztP*bE7`!tn30lYKU0M~FAFxs$3h6H>Ty1XS)mZ#KW^%L}mea!RNXA>)6rouD#VN~F!`CsL;E<41`WI^;K z*As1O`;K|Gz71vhr8OhLM4{XRIL%>&b@zwPm#pbu)~AnsoMH#6o4c`kiiji*;}=8M z)cc>4XtAD3LB9=on*li^$|2lkBl8lRwY3?n{&_!Qr-2sALlh?K6M-Y`0Tx_f++Mk3 z_3sHzqMHO2vs_Z<&%HLxT9q-#hyB?;Fm^~*r?B(W;!qW`Vf zi5IKdfQ#;6sfV(%*IysxhCnEQXXCUtPTnIhkY`v-ZmI%JqM{LXnjNJ&R~0G+hP_?< zXBkFo@tH@czb0Vm6})8Ua5^P(;NmBrIi)}uvN5!Avte7RVKoJ?bq#(>0VbCBz}ju) zS)0I6(9JLbpALuu+G;qp{v_E8ipc_R}y;iwO%L$LbL1YA!Bx{))~^c z_&xKl(~Z;71a@$HrmlQMZzdXVLL@2Hwo8KB5K5YB&=$n`$Ty9UQ8R75KsXIe8p6lv zjscW8gIBr@94ZhDBpD8ZA_U83r#HuO;;HN-wAD@3)m^t!yU`{I`&lMtv2tCtrOIlh z*7s5YMJw)GnvAU?$Ng$6bfq+(CpPQ^i~#1|)Kc#99^8l{UNG|)IpqyhthkB52D7{w*LTE#XlcODQ!Ev4mm7ms zFNW<$hCmM?kxA8}CTXvBr%dI@{ly%K1L#4K47wHyCmoBUDkdYFv+DcV$16p!^Cj=} zHuxDhC5hiRX);xlJ9Z}%E$8%GH6OE?{eL~VN}^`p=sb=TjrpZwTx>Q_CLma-a7vpF zgoY>57^Y$6E4=)LrVUqB{WpfI3~MQvf{+a%W*Jnnjl|`S{N>@v{^wXGJR0H!$0F>2 zSm1_rP0I^y^f|5OJ?Z)q2dxb9n+GkfoKWd<@C7!EHwJ3ON-T#^4wonABoMCRM&^j@ zK-%7xb$+g(c{vV>NJj}y>afdnXVGeD@bxp)QN@M*%~~wPmZK^aa$=Ap;{BsECK;;3 zVsEI8?`o08X3+b;ZnbeBD}FmnOBf)snZJG>lQ>5Z?TsSF!Cm>Hy2 z5E(p>VDuBF&_SO7!4GSvp6pCIIu?fRbYgJ34ytU+rmw@LofackbDNC?Z`4O0PcCA};1}!9u{7bWQ^f2sc zDsW;6X59d+20@!)j7n0eCySDrqDf6Ggy8VgbBbjof%s`Cs^^NAp0E+t^H6E|@!T6d zxVj3LRxZI6xop>{deA)3q2GE}+yTw4VXFmxfPv#HOubVAUP^&m#NMLXs_obtxY<8E zmDbE3*R`qy z*uRjTaV9^6O@+%*qjv~pQY=>5BzM(4oc;k*N_Kp&anU7q9?OT6B0tHUj$y@Qw^9rflb4-&$E9 zVQw>c-!>kw4A1PKL`ouVuUgBvfn7Y?G0p+jHA zVF=6suDnkwR*<$LpdTk-DS}{UO`j@uCzo8_ILcn3ru>lYbJsCSbPXvIv%%y;T!G@fkE^4x#o@UHJ zJzfOy)YuM6Z`ki+V1O03Q0Im^oa&4DGIMQueVNb?>9Kp=;cJM zlYv*G#5XE4=8>c*Tp3^zHq^2QM>c;zofw5SZg@#h*6C|yib}KUBM^!-qY65w;eL zzB?p0IO+P6wR7m;xfxVTPehB$X7}1EYz-P*yqgywdW1zr#p?B3zHUrVGA=SIFy;wr z_TIX@hFhbYB{7@$J5LDs;vxe4drPP z)HQ|~i$nw@5xYhhkw3fKS6UQ~pKlnO#|%E)UEVWbnL%|dYKU+a6N&=;dG;@p{de+$ z5E^K;x3hJH4Pz1LIo8_!_K(*VoqZuuTjg~OFttM}aR7mrniq{RKsb?H)}5Pm026g$ z*&M46nHX<+lJ8Q$+!h(HL6t6;QkFJKimEs3RTlTYmy9qZf<#b*Ff+6*`3HSkZ%|o{ zqZI3?h7-Vnz_BPHgwWQvj8J1~O)hFJunJS8v7(ShxibvkVWy^>Bf5oyI)h6<%T1j2T`h1f4$fk9OOJ#~V?^dF`V8Etah5xiWOZ^g0!B)~4ey zYuN?`%3h=iTQ9hxy$J;IZO_&B&hPWV^!>K%7lV4cK~d$eRwGf*iEbuxIM$L3 z-&%b?nHnODqpLOZd`5*_mH8ClL0CCL5WD-aVwgxcKOldwwbg%AZ#y;a;SnLQ{-Gak zFH4;`N@xUc5`1yMty)QS>1>PIMF*#0*jksHr^DEKpNpp*>_LnOV*Tp9UJ%;{DX#7)D_6&!U*y= zeORPFUx(1%N6UBVDo>%;nUB?Y?jIM@-bXw-8S#SkWeQl!+ETjAw@=A9P^hO(_@}&( zTOA+kez&@xRo>C=W)7047EmZYjw3hj_rONLfYBTAhM z892j5Xk6|B_%p`C6|k>rprLC;N|_H9!Pv42fzO|_#@lD=_ACBfa74sDp92?zCHI0G zn6(N2(UBbg(UBL?p_r6X;y6tNt&)Erh|+75|C?KV&KS`g1?@bS(75J>aKQ^NNsCYSd^B^ipDNJ8{z=BHj{+O z;8PR?tZ;)g12ygEw72&Qxee$yhKO8ZltM@QN%1QkkOv^P0@$ST!n#NNoj@#*$t!J| zpw*s_9QVhL%))oT@SsWNB`Zb#fGW#*mtx0Kpy9K&>;WWQ&?r|! ziE-ui($aytdXX0%Lk@=C_OPmt43ql$Jg?opZ-Jj5uAleeeTF$%IvJgkfA+e^ff2%F z;XKRP3&vLC@Kbs)+?DlpTc~u+T?QZ%c)P_KrS@0=yG2M4;o-&EXHYAq{U1`gX>3KS z^Hq#DL4ca&nW@=4AZU}d3~9dxLKsVJKDs$!lNHK z@;-AkN8mIz`_?GiKk!wyz@W&q+%mt*K=2C=5j@`V_PWksYMQXg+}u!|-(M}iv+3bN z($9pa-0zNOi4p60>KlMjp=$x;H_xU|bOv%VbkOAIz3k`f6R}2!WkEU=lUDCa?6W`Q z9i_$Y!Lh2O7Rjk0bp7oZ4IELBu?@ts4&O(x%E<7ie_JK|uWsIrbKPK`UmqZFTtd5j zD2sRh>&eRI*Jt(8rsjRCYVe4O3dA#|lfwt{uiZ|CTesK?8H<&0bBodhNYUeL7Up1c zatO1V5N1ZwONS>FM}@_dCJEO4(Dh$VUIJ~#Gx|^^7DZLHeh;|sSKM9{w}2Uc;#-ap zk0wWxiJv(`k{`JE=>XR2sbUBWKe&i*RdySiQ2106!)<%&Uo1F_Pm!t|6O_mxh50WzZgsYNQ?qaBv=C- z>O+Mr@K*q@{lZ2Qa4?%*ixVK!P$N)b@nDd7GO|EttgtJ^+-wBKG4Sx~F##E~TIhDnE3gOT1QgZcDH9LLhA(l(m zn zmyCm|q;K19$8FHm=xD}RkmmtIFU>t#1`Skv*42<|y!Fdj2J(nOF8xO6qHKpt=mky2 zMdBnbsl(NoNLy>RUq%l{t&^1z^X-~?kxaTelC+p*;V?Y%WCeji3K+-yAv&|USqVwP zb;&hoDV~bi7!KSgulQnlIv(&0*wG_-MF612GJ6$HIG_7^eT6`N$Q~sXOXPMS>K8t5>w~rwD_LCWz$NLDxaGxPg|5poxs-?=yWU#u9o1+c+>1P zmYR{A+Q9QDhCde=$o2jJMMkPK#SrGH^Hh6;T-y$UpfJdbB$J@TDL^45IoRGBp@?Fe z%97otwY@;GEdzQjetxoC8kElwJ~;~V#b&y|K*ruD^)naafsCDSFes3f6e@?x;x86j z`wz>taB5;r<9^7aW}Z3^PKIe0*ajlF6#pvtbjloo4p;0lx>I z*OH6xAyRd$V69O>P#GxfB)e#KDsiVQ*c4Lmze_lV3ALFL0AU^9CbSX$sE*sG_Sh!nF@Z21uxR{f?^{KalfI3g<)q4uAgrGe{v<*h`$^t{FQ9=Lm;(HcXzd*$Wk)ssE{4D-iRp+@_BkMUUV?tR) z^-=(?Q6HQC-NWS`$Rj^%+WPQDwy7r_>xvvnMiNP4zCGQg<--lS{KG|JSs|b6{N45c z;w?pg+M+dZluF{R34IX5@FXg$f^|N|5cCZjlli<*sv#;N9upay4^mWJVf?CK7aKDzZ-)bRwT`!R+;I^=%@=7t5N?Eo`A~p-JX`M*ce~mA5Jw?l#HZLF{ErX@mrA(%M;4 zZE(^9(H9_{EK|oyO*0;x2rVQ1x7{?vDjXi?5qJy z5+I}l1KzD66!{+*iRY1W#IJ5d*Y^WPZZ=O0qX|-shFEB=Wp&=p%AUlqlJmKsyYNET&mnzZHvaZsJ2U0k=_@H;mMn)_sCki%KvYbxTM5M?QqiyK z%rSOaAXa#ox679#B|hTaz*4f}w*(op4MaS>GgUHUKmlL~shD6WgVb>)@)*?*zMXS) z!^RS>jyvla{@dJC2aF1~^tU5~jER`|Z(!K;h(B#caJ8Oqe0X7P29Y_T-$a z+=mM(R;n&E3&Tq>%8wbTgyz2apBc$L(ELAUt+L_11B?wTEV z@BonpwhfMU2`q77JR}Rwpdf<3=aZG4ncNCeeskcJmvR%C5`ri6#&|6+H|8Xvg&m2U z6H*WbP0arkfl9qG`)EsHUxVSN1bhB7BTdx~)QLz!s#zY$04t4MieXR42jF=H3Hhf6HXzG2LXP$>m10r{26nt+o8KQb~U6Ck^J{ z*x;&U$q?l?k@hy~${5j&x1gl@a6;x1PirIKf^LxBr`Tni z{Ij6R5GH?Dc^Pv9f8EvKS1k}9TF04HO00qL2dj%Q*M0m0Sz9ilL=wF~KK9y4HXvgP zNsLjsnFF=`{!Ub5?ys-#JDG_EC_uoR^#k}5S|c?>74cW2kupmzAuVP#v)n|)f$1pv zVpThoX>$wbVtHZvjit=k5wzh1e-RM{yhC9v=D1}Wg37o~39%~)3z$b6)!nq?D@WAA z9l^t|loO%v4Qe^j!{76A#u00|EIAF1*gZ@w*>C$lgE!}Sshh&ky875bw3;0y)Bwr6 zyHa}+L)tu9CpQ)CElhwf^J=L!xvj`}oy(&^4_s7*6^}Xoz#^|MgGr-FA#(_alC7#O zYL#-kNNT81VYfme75Ow+i>zh+db?J=?VYBUW_2mqU}-zjs`MBdnK1>r3jJ}y4;z`+ zszENlYH0`a#_=rvu4qm&w;lhj`X3v~L~6AW!}DSy*faO|&%pj98>DGj; zLRx-v^h8`_*mjQX3)Tio#`3A|z6cI1q@tGc&tSrU0Ohhn$~dVH|8!8ckz?;)#6#s4 z3B>}Zmp3+t%PVxBUS4|3w3%SmR^rcD%;-6I$|MuxHiH=!K030l+tZ`pBnQf6GlST( z?WH3VajDI}{sCEo!o;OBOBZsWfj3Gl-tZm5K)tU+e9J{4&V|mkG&RSD+_brxy=E(5 z#tSwdQ(nHC8#QQT2U+-h?BQ0rhUid2xE{OVUcwwyR4+ati?2Tgc7T7U3l}I65E@ex zC90tBL#{$PX@^X6yMHlHLZH!wd{ei5gK;QP1-4d!ahn?wG)jqIw#X1nVk(wK)j178 zt8)kS?vqO<>-V9IVK#-{;$4Hu+WCdEr1E9TS2~caV~iaEGjs#$r>eB37ZH9NhZET% zy@GbfaNMbCGLQR4>7WN->{qPYFG=O7$ndPs+jO^yua`e)XWDlK@-g`tHcDT)A*G_h z9%AJW;_s3pvfR?|x6BJbrYPn0chI860o&BSS19qI@5bZp!Rl5P1LwcVhIA&r-14Yf z^{_hGYWUVKMy5^bKEEaG739|336UytjZag%UDlPZ4_BV$1%-3TcqdU9{Mc_S?GfC* z4q3U~iLL(%Yb5c#@dJ5(%_*AIE`+9>4F+N)F(GN!`O6VlCwnS_fI%T^7^{#{GpdJ6 z#cU$iv*wt+0u`c@ssgyG4m8fA7mRCqDc37iE-zcRpD1W07#UMqVwl!#19RDXCV)01 zs+X9H`tOdEa?D`WN;96+(8-@O7>B~KM4Kp?pQ#ogm&H=ro`bUm^@f&gdf4aIBvr*5&NcxB~czwbOZgHId6 zs2)*TOgjfeSf+Y>YRWRdCnUrBf6I{=b^aYL-y1_4l9$Ph=jnu!mO9n=U`0qysb;`r z6|ntmUq{cf+aEWC8_)F9`kG!3nG|p3>#IK~Uotj1Co!Z=kO%5#KXBxouB;gR z=WVvq``mpArF(%2V%^#ZL7T9x<1K?S9tp4l&8x_R##h1x8EPSZ5tp4*+aHi@Q~oH# z)W%HYWZ_yj&6Xw)psP0^Nyxp`g0#bU5iVie)un-P>Gph|&$Zj?(oKc1q`i+aKwTKm z{6}f=b(R(g%TWII?bE&fO)lbhK7&ShwTRW=6(9^D742;=RyR}k!XqL=hBTEvq6**e z6iRR|#ZTyNQ5F$HFGv4t7pM3Ua6O|-fnV@jB2+l{Nmq((n;LAEoR}}Y_YK7!xKZG8 zW=6T-QRO-PYIIG4T{{Tk`Q7x3Cx1Fsb3jBdf{;-(4jqTd_-bJOJUK(34sA3v_3B>B zv~Yqasb#z+;iX`t#u{2kldnm#rIn9)8@(;T)4WZ8Oe<_;zX>|NoDskP-tO}^jpbE< z7Oz7~OsPY~$5{1H#!(@3NNt(nx4!eSnyVp-;pr5v^?mCp?I9$Ow;yrU?XRcX>_ICI=9Q@!hNdsJv$ zJvw`E%d9bT`t^YO9gseaMkIqqH2-f3QY0fwky0us6*Oku4qTe zo=V~x9E))jSD-6XK^xO+QBl0QLs+}k47iOs(+dU+3O81D+~YD^z*)y$isE8~j%b`* z1emZ9J=!g`HcfCYE4uT~(P&yO0ku3JVVwp%Ih%E*&yBISm1iRCp`q(lHYZqh_245e z<%y1V$27g|J|EWVlFW118HKgF9(}&=2)n2wH(b<8n7A44h2mvdNlqivLf{L{BdLE z{Man!^z32*@rE2^`~BSft8$hW$%)GZo-q9Jl=ed@+TZl}FWI?+x6`wLyhtM-ck9C$ ztCWT?$MRmR=KG#zR_lmkukp(MTgmYc8+jFxiPFKL$X&RS@MvRM2bS8A%|KBIqzrC? zLRlGLBC!c2OJxBjnX~1)JBuO##L#I6R)B+g`@z$mS$sd#ObWdc#FC==9~)V^v(SI* zx2)sHvec{!0t=D3Fa1Z(kz}v<+Bp)*DtK=i#PYmg&Qyn^qFk8;o_RVoCs=7k|9NF6 zHdcWN5kr7X8i7=iT;@1S%E86w+w5ryMM{YJXsiQE#1Bs!biK#llZ{VX)2q5=fG z^;!NOFoZ_p+D1V$xYEFS zvIW180$7ipDRZo~>cNpci{uiWF+H?Je*sS7kpVR^uweyx_ zZTs%9QCqVy|Ni%SGwsV^V|k0Pr;|A9;Rw+R*CQ1z!;2eUC&JDN&RZz=zT73!FtR!i`lFy*)g6&_+{yfGC zf(U7NxMH<_jaaMfGI1Wj#0N2nLFhO?Zlv5DJAoF|YZmCVZn&1vRh^d#(>_GJ11E%h z8M~*=6t7mW@EXaoq?QpnDGiVpm>UV*7}25 zgLM4)l73DmVCeF+^N|)2pa0yo63h@Jp=#!=tfzj4|I5^GUG<}pZtN#f(f6%PZE-jU zLE6y?smSvd#c4i78Zimp-r4y>nWM5fnR63Q`r}68i+nrZ!v^IjW9NetZ5KBx?#)00 zD=C_=>G!KkConG{(IT;c64v6s9U{iJzZSX{57#~?<$La{p?=KYR8tN_{J4?2vM}%! zJ1GmcrHem_TauI38b0u&sd!5qTLv*3GvsBt-}}M*r1AT4Z|4z=$EC6?H@>Td`n+&>b-`0O3aPu4QXlxh$$K;Fq-y zXATrZ5^Okd0I!oK{0wg+h``!82{c8bh_qKiwu$dLMTUFG{yMVG#1!c@vXrEn4+e{5 zSDT?=yJ~({FQ4aDUaoEiza+Kx=#_;yVaTdzkLv zVa9#Y1+nO|2xmygi+j&bLSp~|={uz!v1J0qVd*KWU0csVw3nCV-dQWg=!3en2F)3L zGWYNC;|M8mv;u6f_KO|VVVsXiMWp*8o(@r+>ar}D&{O)nc7_oENqXHf&1<6Ls!d?G z1NlIsOiHg3O*aSjmXnBo{^^Znp3E4$Evu3XsmGc&=mxA7^EkQEE+BI{Owb32wFUjW z=+;M^65M#76^iS#7@??;ivOgb4C_?xPXvogvSY(4T+{8NeOU6EtUR+7YGuXFOWTE^ zz}mF(3Z7kDuMtp*E78j-#jE7P|FDsl1`VUpmCoWJ8QHHp0lR;wGdaGg!bu;dv5P;9 zQ&2Igdc$zQze6$&nZ>HVJupS)@`S^w z#!hBL0_HZjVv~%DUx=w$#ZD_}`gwAROlbI%i~}o>x_RP+^2J=}(uVIcMM9b-js50( zR8=D5G6W>#Ef^$iSya{td%lHV+*+{XByG|0{6=73_{Q)^=9uo$|8pb75_P5zx-621 zsaeiGMB~%GpVIVd=BaU#x7)yACn;#e)}cLe_Lvi??ZLRUc zF%H!Oxv17%)ZnbWRd=0<>8EuyRQ^(*S`cr1KOBluH1`qT?_KQCS6m;VL62lG9dMQ< zD}w2CobG4~8HZ(eoRdsa8#ubX5)HAk-HCWBn~BeXSVzisj>ddLGc~lx$#CACORv5Z zJi=8E8Plst$?6JIrJBPn#EF%pg+DSL~kZg1BLs5Ye2{n4FA??1gXiRo<~O$x^_~^fzv(Z&#Q~FMFl)CUnpvO}!K{#YB<{bN zey8;B(8f2c0dH@Mwg`dBSs|eYAbz*p3_laLv5i_ET{7K( zreN)%eBP%ay4!9)egVDB{&8bVp|fTxBxwK@E=HnN7cadP7eJVsP!QR=W1@vB`!#-_ zW1CPDQ07Ufn%5*1n!qJfEkTp<7%*ZisN9qD14pvqR{%YM<*^GSS+YA2t$+)L{jB?0 zj7c*&DF3pV{s%|8th)-tqsxRFPFL)WHoQ#zW-(R;B^6~IINY=tg(-~s z%2Ov%k5U46$MKekBBwPed`7(|p_+ zZ0@%w!fzAA^T)Uq^b=7=a`OHY0(f1!sY5aE!MCMPz4xF3^oAn(%7_xE8i$IQ(TVm3 zz~%dG0sC*$sy;Q!-l%D;z0UchLH8N>NoUr!<}w8!E!Fv6PJJbI-L@yiq zRAQ8qLdpR4yZQMaFteDLr}H%EGVxBgQYLKQIG5KLnwib<*ER4M(eO6i!t9@O-4)gO za3pghE!Vcan_31R#O3sP${*rn=bc(t2Wo~XR;Kx6AC!bsn zm#wUBR1b6#{T6#ih>=ucECmZkiUAtpL?qh>-TEwN{YU72WP1HFs$Xqm3j}|))@XkY zN!8xq4HXHBVOGB|!HsyOf_qvyz|}miI>FO6r>ilYt>GC|`|vGM-f7QD!o-Fo5abGu z%*&5-MGLXtt5n!ohUd|7SRSWrm?%VFAAOzHj3 z8C=w$N-UBgMcKpYGQt_CJXC1ku}PFgw?Ns76U<9lsJTi_XEP6`^ zjZ%U=wn?|jb{%(<#0}|v+%DlHA&!(;S(nn(Oh}h=qfMYUZJ0Ub+#zz}qIf7eXlfcw|CxT-bsu|XB~`)ABZXb$ z-(Rm#587uaFf(DcaGEHOF`O3VNH<*-5TGd(CHRVU*WCY$t$T_NCET_}9ox2T+csuw zd&ahH+qP{xnXzr#x>*!^0ZTozM{2aXTbQXys@ik$Tw&pEG zgTn^RPL6(3GQ4R?xE!!Um8@^?BAw4D2%EOg2VKKG!t;RE`+afWoQsyDSNY@Se)a^* z!~J}=@tY4=$;D~pSKBYJO!fJj+Xd;IqzbwuRm90$&+)q=iwpUKF73+W>PoE8(;UNrnBPP$G3w1kvkSp8#XHjawc3ZXJtl~R= zCKbYQawy`Ep(F<}h<}`9w@s?khSc&QcZF89}&R zbI=WvD@n)WRkAM~!}Kzs{<{In^;&lzwSaA=%9s3``ctn$>)2!)%`ny6R>E<-?wzQ< zd^DE8(2Jza4z!N#i1U*C*wzXBF0pa2t}lzNg-F(bs)XL+k07j|356Ba#HHDYhh=VDtLw9G=FoQL3%q2vNOcfWWM^gqj^Hg#i>sjhx$x|Z&U z78Iz{pQMpFy2%OazAX|(5S?t88r~VwC8K}ctBL}F5_Af)+untF$^c4!IdlmXQO)!W0BdTHt^g6ec)K4TVMKUVWq3G8@M*4j} zXvfN|#iU8?MEd2BwD4vg+}k!N(b8ZzSxo#^_wuCFd6uImF0s@eV@mA!Q5v^MIL}qjhQ%^DD%~;$-2g`J|_>DSzL`& zH`=%|JnQ3;4JbF*K*u^zLaNG`H@hppqj!B_r#Ol^VS^M@pD87!1B~@RYBfP&r+)(v z{^+D3f$pusZ#dzy<4-fS;6t^^VPO`NGf2^xRFv%U7F6BYK++la_cCuqpT+?FItN`7 zo6^FpFXku@wjR|9__EeaSXXuZUifS_Nm#dMeokOvI`KX*pr*dXEV?f3j>5$RR`@cQyFOR50)iPy^i`LuNnGEr>#lmjO0qMDY-GWGt9sHLo$+GWuUq2PGw#5j8FaG zGnU}kwrO~y43NcOqz5OBWu?N~?V`8wjJ4G6B!_13a~)ITLo~s5%o!m2DeT&!;m0Sf z;WX~n{E1Abj$j_Cj>NCL`uu5iq=>9T_A?+6{?e0XDNGd;GEqulGIFnifo{^QfDsg? z|E*TO0~L-;1L)s`!j&>$3#yu96aM)+Nv^r+MlpVsv@tGFGiXprDjBK;bjDVvxiv?F z;ZYT2$8Lw5a=1q`jFuVgY`U-U_EpbKI4&7#i``Si2%l@=M}GG`EnzsL67sMD=siE8+YgMrxO|L4;s&+^Gl{q@w$tAi|CnBCLT3b zQCT$X2b^+(p(8t-@VZE?r4bi(jbR}cwQTFgTqs&PD3KnvqiFc{CODikX$L?z0VjZH zb=-ESwVMjb)pxX&82%q3I-6|$^tKRmw4>PXR&3kKU?kGM--q$WZ#b&GvU$R-1#Iy_ zV1)HpB#p(4)@iIl*+37d`zkvBs*vQ!J48X#x4kIV2{4C^s1mO>xr1u?J0e&>=F$TTP{%q7!TNJgf@I8iDijo z?9L6)^zeSpvY4hwX2kL6A89Nv`eN!o z-1*w1718N9cPwPoazU{76w}?i93t7Y3Gpo-QPAgUvj^73#4_{%?4i>GwoUyF0P7am z7XaT8Uw**FYg#Q0hLg}>Q8!adIjb~h8v-PggcN|tN81X5+JLfuM+zsy_=TtNijTR6 zoBt|;*_7u(c5c4(fvcRV-8yl<>3vnHge>o8O=oO|bWSpo6!|4{k z`;4e|#Q*r?bLzhT75L;^f{FRjs9F^;cuc8KU(RmAGYn{2Bhp0CL>8*TfHM1evl_cQuqL+ZtrQx*|Yhh8l_@=)u{UHPOkPjwK zSbiC<@`?gw2oGUUn#v$4N zQg4IdV~GhY6t)B`aJ$^aqSY&G??ad{95BnNyMiiZT4ga#pNTC$UB#pwCP0=JHmsN` zNqceIGm{<56+xy=&iMrvsEsPZE4wd$7ELT=QWtUiw_;;hv7N1r5QY{8Iw(tvays2K z?!_P@;m=rCK21I^C9FSl5_u3La!r!gja4ZazFlG8BywDcFHS!WgKlTh#aLc-aOg;6 zzL!V@18T_eR&YW>Y7`>+U;m>2ETjm!r<5re<@z<(!(0un1ghQ@i^mIizXJJfSP_s`C*=Hn{SP{^5s2RjbTvV_x*r?T4F!JhGnpBwsz)M&cY0BZjZ^e8 zn9v%&6qy45s5E#uvN0(fPSqM1-Ik^}u(~=*dTf5$w}|SWhnAehT6Z6pt!++Do%gAW zM%UXMZH3^m0^iyA{(&<9D-DQOpq9C9Ak>H3Y?DVg1)P`Jh23ozjfBp})M5k_P?u)R z|6}B0Xy`eu9!?p}ZbW*!m+`j7WP9C=05Q7vn3K z;Vd+4He^RCHgl3=Wvg|ODZpLIPb*RoLvE2Ezh=9EPQSg{oU=4E>q=V=Q-Kf7jOcb^ zcq^6>>wJU|y3c0~Z(HuZ7SU~#-q2c| z-x%`bNVe}L5-Wx9Z4eUHP!ga_m!_yM$;mzbH*#1~q6UHr;orJa6B19U_QBS|Zc?uh zMZy&4BJDA=2ZFI`brWbMPQSqZ^9v|xe^m*59uk5PihLL%Ogs))8*ji0Vr3XX1rA{Z zcpye#g9wV>+h{?zT=Y5CJ5Fx%=WY3}x87}CdA8cT-Rgh_@6qtQI(G;$!;tXj2j|>Y zCBuen);#B+PJ9zJL6G*124b8Oh52t&GgBW?6BQPxv$Hm4Y=+CL*v2D#BicWrg z2zAZ>GkOXA&R$c`M%}nm(F9v2hA=>6LkvKB0DT919@=v`!%2)ZR zBAsh$9o6+&m3yowPT&D2nXx9g_m(GIiUj#^J2$P#~Xc(^A z5@4^%MbDvUCr8)IUm&7@pJ(9h@?v-l*gbfRx5EKhJu#dITvy?uR-ZcS3;gHH?H((= z_wFtHlAWz?)Rn)1QHa=k@In!ZP}fbA##nzr1cGu4d#=r2$ati^GYiJ~=ti;v+t`teW-PkSFdU_B z2u)u;SwAk5hm{tdGJSuO2{f?n-H)%Y`^DhTk(1AfMhLMSGOumJ>6Ooekw0lXWG(HF z?~}o6H&^Lektg`{Q$1yR-0dt+mUlBdXvaBUO#Ifqt}xc+25#ok_o;76EJY6kk;qTE zxvTU&h294O5qVeT_^I^Mi}bv$^89fjIbr$3AC8lH+o0;DGrK?xQ3fwud{L(ejF$#i zb|LZJhL&%yz#pG0k*Pg3dLGLBo{wMrHiGdldl@!I9p4gF**Y>V`kUc8eej~k1E1EO z=XhYoJN&@ieGk!>v$OM&APpzB^toLMy*%A+-^O1f3rF}hloUY{hdY}M@;f7e_4v7E zJA4+oJc>USrgy83AIq3;Ep3nYwoi^b`pU!$W9Ox0skU&g9BeBuxTGH z2F0)@&F|_y5sw$_C|9+=ul4eE51KYTUz#@4Dt;A(-UJSN#HF5E;(sy=JHpl*Pw&X(YP5UF0al*@$(U(}RWQn)S{aSyIXWm^N4>pE<~;o9$M7 zy}oL?E*^u~P0eqIiJn{DX&2ILhcch%B3?(_nzLWoxj-Iu0ndCqM9*L)I&_46)G5A@ zC@l+8Ys;nS4W4h6s<%PkQ|aq$dp84fO@oUlC;|Y6iDt_uK!c8`+4Ee*_pm_!uu`o0 z$DBcKAxA{;6-#uuW(YwnudW|hE8qR=6RSTLweA)7TqIoYxb)6xsgz$U?L+J9{OUb6 z74Xpa?@s8iHlWM?S?jS2MrkHL`DLp|kz;GkR> zTP+^I9*#U}Jt1JL%xeIEPO}0R0Cp>W&X&*V(&t5%=@2u1-xda6PxpJrpM2hGJ?|E$ zAFYp}*eB}Q3)d3HjOfCOovfg000`iE{r@A~0&oA9bVCx@`(M%x3SGjot9a#C&CMde z&TBU!6YvFbQU3O?rfHQY!x_fRY&L2CX|4lO=$!(2%YS73RHIno^-=w ze+g+{I|oaa!W5?QFo#)%O*3CRp$jBHza^(hC>DRYm0m>=v<^z6LalUaoEv}8fw~RT zQURXO<|PoL!xcm@88Fx6^4#pXOZqE^5y$%!WwQdt)ib|@$&j?X;b~c8_CsTTu&yW% zBLj|p(p!hGTe;(=*?Zr{>Sf45EdNSF1MWuK^(q42V<4=M3%ZaLZtk}l>x-GgNlcUM zWn2&FMDVfiPHPlujp|C@{8;5rCu!nEql-g2g>Q>|8Y}JrCof<>c1OX9sc`ThK8+_w?B!5Sm5DyY^yd z+2GqosCntcO+@evqt=0D86ja;0oU|^XP)Ww~&;Q;jj^|qT<*}0p+*Zb`7 z<@pT!`C;X079YsW!P3Ft8ZS8Vg#nxk#@NtYf%dR5#c-aR13~l_pIa5&$!aa`ykJY| zcHUXM)EA&VQI;eVH4nr7l_JeHtj-A;f3M@K_m9N%#1y@<*Kl7Xtb@TXrlwu_1##N| z;Us&Y$lY$5HF=27pke!$?h!T>L?hhhXNOADQp8jTI1eD`A{K|mF4p(S`YMLs#QTuu z8u$fShg)Zx^ zIbUP3^6)F{8m5w0jg($>lo}>(2t8+nUwx-3!t`YIPaBQT7YD*qJp%%~lp!!hwa|Om ztlL(LHiM4&VGWjDl3;Eb82#o`75>F_dfGyPgqzze*mh5bU%^oK=v*Co2L@lKj1S@* zoi}{LTAU5kh$!?-#c{EHr(QmEV3lZf@c%E_Z4TJl8<`0)VmY5E#4?Pqn zdHJpx@O#FjclB)JA~lrXF^a9=~Ta)U+VV8xwjB@6F**@Ids9A z@ulYNs8_|ZXvjU-s=+DuW4UF-IdQgQSK;vP zT0C+8%$`SA>YJ&W#_cOyK6SQ3mVEqZt7Pf;?iw87cw%d%e4?pb)k3BfIM6(AVpXbt z)x0*dQ}_Niq<(~D{n3`VL26F{x_Aj~wiTeeh3ZJdak^zaxOv6e>K&nKi!Ds$2wzQW zbrvoyCHzC>y)gvH_rx}C%$F3~B6;Y0GpbhR7GEy%U)U-CZNlN_778?RTCwloIL%f# z6m9b(*M{a@c5f4F(ZA`p5AIv?U(!}X|9?rF8M;?({V!>Ig6bJpT0?EO{F6_J->eQI zX)AJL88APCuC0+3NM!%0BL#=8lB72v_N zcHW}cYLIDrv>J_q!vQL61UW$}JblTY=T=bONac1{S;zr1P%x{uXII;Zy3`t-WS0&d zAJ7)h-L2+tR8(hVv0=DL^t9BgJ`55sD1ID#^Z5_+t7BLkdHJpdMpTuJdp0IYO#Rxjb9rvGmooTb>RjByVIykEZH5YIr6 ziS&B+yICUnbSHTtd9^p1t3RQIw{(E+wW9>Cu4q`bqu1rm&v~ZyFqHl5eSH{hR|;1T zGK>B6S!!KRr&R6AfusRbL-%1>E0P|MR^I2T2I4cIN;TSuMWC9$OWJo-LU@o+l@1fg zDUBOB#69e<8*}aN(EuRiU}$1*^r#ZPm^dlNu}x1u<^472*TCtVKQGP|r)x0W^YEGE zzFHVO!?@nZqnR=YxVQvF9HG&q{hlo_11TAbj91#GC^lFo+@Ta0b2}c-lX<_~?)Jlr z)6tsH8so! zg0oTU5$2mUs}|#oniqAEZI&z#WI0q^-vBJOi#0FTp>Qii_47B&hpA*al!!5Tshx^l z_WFzxOre5LW7K)sE%O=8n4K){TDj;J4Y4b)1ZZ1Y)~^?Cx&k}coXHEeGLgy}YisG( zceV<_d?80Y1{Ts($Eugh#3VGHsF%U)O8-E{E8(r+H7zt)eWHECtvH=$Z6S$f=n{;yMP zYLmsxF_KrB*IR(Ykww?H&Y|{?dwugrsq*i)rg*u3FT{;J6@dA2rn2y5X5%OFxLsmS z5mTJ%TfB$HMD9tzy5p5FO*_ zWZ3jNw@wjrfE(oPMM#%OqfnyE{I1qg?z44&nr@$PNm7VL-5-Hzo<>gAxy8nyP7Qdks28fvx-78WFZhVt|@)Q|g!F!Cexlb*7jj`FMWWrhW< zDn7CF{jHN}Ok*%tBlVqFgHqzTyc`LrqrM^|-5Zqtc_`)*r@8bc4uVEK2*mx(*!C5| z?AUfJ*-ku1h72d3V?vfgmnjvPGpDH~2s_SWDYh%SDg7${PW#mLi`$uzj0wS-5$!oK z-01d-Xv`!#RBTqtxtKasF;c4w{l7w66&ruk6nmFdvGa8z-L->J=iOBXunkytm1&a6 z#e;GQM>3r#*qrpMqZzRr$(EJps`M=+o5VVUSVl)Co}|FA8Z$CTuWXplQ&ihbR7HE6 z98L_Txis5jB%N`V3%L5M)2Oite9%hCiU2x};0BQvf2(peADad=)B#qd)C};B2uT|| z=&A+O+l0ssELx`SnVkzQNX-A_17K>JpXir6sS~O<#`OO=A`?+!%0(c^g0h25nP9m= zr~=Srb=5B}M&kme9$RV$at!t=m&TCPNR;dI{d{QA8PD!)eKQ3vwdl$IyBi^_qrhg@ zeK}Z7Y6grRf=AJ=a__(wQ? zE+EhM91@P5GI?iN$-+8K0FpVp^fihgoP#jTlAxJ~>-8$~Z{+zTlC_e_^yt1VwTPZ_ zRHfRuI2oT?KqhQHZ?3w}gn>w2J(TR)L;+5u%ad9)Mo>T(Ye-=Wh02Nxr)1ixj8ZCS z>|*s>?)p}*NDX2NCB`tqx8U2O;oAxB)5)y2&a5xR>G$w@uhx1=Sr=VEzXW>>VS4rr z71x2pnfE!5AngdkQeOj&VcO5oZ_ujPihEvuhAg50LWv^7JdmYCfIujx7GOJ!z>S`g zVpU!hdYxgPo{GtXDnrVMNfB4|1lJ!L9(&r4#Ra18$K@IH!<(9mkP$}2Q-Oj6lLP1 zX$7mSR*B8frqru<(4~p(l`{rbf1eb9)#@J5$IXPuwX>fdx8OTKL zgc`&OsL;ICve}k9%Yz)sJj1sP){=wJqH|?}bR@VRLHLY@ytKmC_=;&6{2z6$&eWLUArqllZeBIp?y+kikJ$`npm~Q)k%KcdB}pni+wTZ zZ^Cye6ranh;RXPa8eyN`^?!1h+_8=q7ojwWvC@vUtQzZ!2eOJgTC;t=tk5c4kKjiK z*ck3$_%p-`iJ&PTRH(b9Jq4Xsp)D=-e%{PJ_F1~I{jyip)D(@)EU>Wv&2blBb5;+$ zB7AkyJYlcXZVY`YEj7UNankI6?5cg-Gc-rmVq@vNxr>L|8_i;v9D?6&m{Ya3{I$86`Klw&kKRAX6K2FbziNWan zr#0W#r{Cd2#5X9?XJ5%Oa@&~38{>-t%-&Ao6*qx0b1W*xy6oDt;jl{7;VJ`CMkrh$ zv;1WAbXz|zch7Uhx%g|fd|#u7W|O({8?iZ);5t--S^sQtb-3P?;rzz@jdXj^`L+1m z0aMHWTLU14zb5H;?*3)#iomPH9mbt09YWc^GO+ZV&7V3bm5|l(`yT{nzeo?%=l(n& z&%tK;%yLl}{!VG_ZoYc0)wr?0u*cx{+2Q$cXGjp9db+}a?t{&yVG~08xO^{@UE;Dl zmln^Z4rl-3Q59Bb_1+!M<_d#Xn4h;Cd#Ch5S3}Z?-z=$SP~3MhL_CVYZi@xRE@24# z#|;}u_1d#pbTFdV0!1JsKtyk83(^(AMGZnm2k>${Vpf=%RdwBbVDWF;YK@F=!_s$9% zu4RZMrk|N2hfpQk)GUq)q)<&S4if<9hs3y=9T@Y#+^La%ss$cff$DSd^e2bBsnMUB zRi{dA2lcvE!jxxSeN=Ta%+h3X!t6$+Fw%A6x@sgfwkD38GFB$~%lT>5WYsE)cV$Ed zupU83(#k{1j+D`aKneBu%wz0rG|8=W*=Uz$TLcO_GS(ND=8I}RR8dqkzb=-aeZw{?SIaG%Fr_WAfT8ds|>V6`v`i zdr>r&r=-hp0u0_N%4W!ehEYAfbxI0ovfqbPk`_^MRN!dm@Tl-Vth`T;2ftYLCBNT<2J;V2+kW=`1{DL))*Jr{d)6QhyvS;Me`} z2Du)0Bf{SNf0YUsdbZ7gh;)>wjuqfB>S~ z^u^yn-Ba46bE(YT((Rs07^2eXWovvN zuXkJx{qG9|_~~K}^jITmz^2Yr4^cV$pb&RR;f+luEM-a>TpHSUI zdo%CQ(S;cy9Dh&4m*wLGNeX`VX&nYZ{L%PD+=G9V5OeQE#;MBP*%vyJOyX(8Ku)HQ z4CoXD2qhV%BZT{N2f_O^WPzGApUePfM}9peEEtfCv3}YFUH7-c+o1*(;lUXrL>pv$ zNzz~c*?0oK(O3ejkIyPzN1P(}tXFU*COqG8dFb=1V%4HMxUBN zSGqzSigqux4sdzBf@j>u1#O(|9PHDL4!^IG(jgH2hQEpkbyJueys>>sp+Q?l?G*Ad zd$kn%_xrfIBFtv`jxQLgr*#2O2TIT3xoD=dxBDJRVWU6n1xTJ4C&P_-CB**V_Mzqn zmuXp~<+AqTWH$$m$jV@)dbN=+E5jcb`(HbLZv1+*z??tmNKzSD-f8a%g@s8_E4ACY z8Sa2Q0EGk6wYB~1Vqv&=QKqZBvyUZ?yXF?NyAOMSCU;?>9nvYGh$cTx=r&<2Z|{}G z;^K_?#gyj+pj#J_0-1O24kw@(171EYK1c1mzWyzMQ=`L;B9EzwZ0 zO(VqB2&l*yun@qQAhH;+{mQeliKK^+i>pr8qqNDwtIHk$tSweO1S|c@pX}9cx4IFbKPq#gV)FFvousLx5I&d*Fh`-%kk(UQ_E403H2zz#N6n&F-s-92 zqY6KuxW-mp1xH7J`Tq2xmS!%D(zwvd2s;X$LI;9;M`qQig5`3RRYkq(d!@S+@ysm% z(O7e>tLDmHdTqL5OY{~{7)~+60huPKq?-Hv_ct!9Kj$0pcC~7zx9eRrr_`3kGoE|j z?R{_Z#T3`~`}9gDraAjaJ`qDVrc!uf6GZ$$*^K#g7(PGL(-MmTYiOIWv?vKub(*!j z^Yu`vlicg9NoIX{(^~+YP#F-?s2I=Uy@6!H&8||1%w6!35gh@GiHhb7uRpn_?e-9b z4ZrMsu>dz`aw-N+!`WcBFcAv~=SeCF22KYEPgn1?o8xmu5>8^$bmt{4uQuOreQ!KF zqQ9YHhPhYj0=UUCN%2=MynTq`^__Cwx<`SJ#oNo(ZbThTzxW)u+{xFNi4et9be)PH zLHd#Bp?5mfPqM!U2?@QEshM=WDhFo>ubrvAh|XNVt}W1gZLNO!Qd_$Cu57mshH;+^ zZ60R9MRv+-&6Xg_-Kz~Kmh$1}q`*xM3c-Hq6=16-(sAWFn5fNINLRJYwe3^r4=}PO&f)5x9W! z`MqrAe(LZnp48}X^^zY|dfAc%xhsBN^S;kOy_)m+BxKa(WnPlu{JLQ?A=LiBk7qIj zUZ{6+#S|$2yK#uqgi8#0^sobGj)QJ=gr@HW5Y_lS{-7s3J>+S{KN+=CprK z{RFO;Ie)(smP*qex$Ts*(uj`pT+wfG3oR3eL)Fmeca%QY(CB#$Lh_a1FUb?1gROqc zS@ZUANxLFnl%6G7}MS z6k`6T^h|boe?dI*1>gQDa8&AW8dyYE;E@^7DB0@}SC;?8X^hiQ0IXP!LzWwZJYR-@ z$Ok|~rqk<6!~3}OKNXm@HGd>&uMlqh!tBQwo(Ea~ks^xDcdVQ1SU1(M-78|(qu5M5 zhukvMmRv6Q7Kdk9G*?}yZYvA$w;w)y9rGRyd>s$JCgm#D zn$jMarRmsnB@Rn~3664sE)AM*etBnu{VdGw`ZC!zlS$l0Xiuq%glX*_K2a-|wh(`8 zj6PYNls`jF<@R!)gJ=Ne6N=OlH3*GTf4&bqC?Pz?Bm$l11qj|Zi=&L}PZfDzGbt0b zFtFr71?9EfE%tcUd@nJ+o}G9`!p{g?M?%L+I8|OWq3>G1JAi4Uk6JNZNYwPWg0O?yFju9n3gR=iPs%E;tS9K8`R!}0)W%v1jVmEJ_rCKPhN$`U@hCe&0W8+-2kZ8HlFsL)Avu;)^ z0Muz#r)(~8C z|5~$Y!JFLmm3{!)Snt*6je;zZJSj{|zpnikvv$q(FDJlz>?YJLOHyw>LCRBK?*a8C z?+k(39Xn~_Z4e$UozM4MJj>MrSd`zj;XkIN$6DR(`PQOGoQrK=PV)g|M{mu8=y$k! zpe@VG?(zS~QsV=8o=d5zIO{)jfL{5yF2eq&C<@}A;%P>_Aa}8(eX* zw15DLw2))k7nei}amA)8G!Qs6Y=&58ixxx>qJA1A{IwB^MGa1BJ0uCVHWF-nhrTc^ zAx`}uc}0@a65#7nmjpUfwVP1$fJCS=A*TD>DBC zq19i$Vr<$h7GKLXOo5$^*4q=0(qyh!s6SAamYDM@V&PV60ua~8#7Yn**;7gT6~dyh z2__F7-MtQjKs#G-7X*gWXK~fFO|H|$F{VVgLB>)`$2ireJ&09{ij|h|=JsQis{9P* z)l37{s%dJ@WEet|>K|1Vf%*~??I3>mTL)RJ#4N$LjitJW8j7%!?WR0tLq!E|<)tUk zH)OD=A)&RqT$KpXYzrunQy4e#hW%aZc#_ArS$p@@9A~Wwe=2t&!Y0bGJxsG&K3^>V z0kf9NrmNz4l`niQ!Qw)kSyxu7Psf3ym1_MLlwDrq!K_-s#Zf&GR`-rpGN9eX`SGU| znk-?%?@{Bo;_T*l2S8ipKvVM)NHaQtIX9Zd25O3J|^dab|CL$YjhSMB`U#N5MmS{!EjALlwH zqBmrr_!pHr@MOhuZ%td+b1%K@M+Dmmq{X*wipjkitU2wt^?nr>gllD1G!ofhHoZuq zfg_*;et=^M*~~_JQ_&o6%oipI&MQH2>CP3jPk$$RQHDyU5eqEiYy#4S8hd9j!{0C$ znfmo@@|o3&QtHS@zKGAg1+GPWX9e)`>PPkRDT*{F&k15H%5PUSSEN~$f?R4uFr)|k zpjwR#V`$T{Z-qOd2llXqJG=+ElvCc%6-tS?66rKkykUh7*&l7p-{ysFcW`IYgs-pv z!~>(q-wVDVdXZX>J&Pxnr`%YTR_mGfyEc^S!dqEjBsFN_D2W87QI^#>4_> z_U5*nLQw4hT_Js+R0oFcC05|ZS7q<{FJ6=PI1##!zh91$UzVG(nBY$%$=oL?0$bVr zRyclV#d*99v_`D439;N^1grKwhut&(Cx|AmSM+>0#qqhw=~ercJ6@~p=*`{=%aJ;x z7Yb8L@~^i9cG_-{+ZpBR&F*D}_hX*B<;uf%2mH(%->c2~2E5a?%c|q=!eiST4vlog z*s4z@nuX&|>)Ew~E^QaWI1vaJNON-()eZ#8{U!g>w?yC3=N`}|6K3P&fG2Buy2SuH zd)dE1Mab0+G7&6J`)z!|JmbA|cSMkSHMYI*I}Nyl{cBivwcDrH?B$r2lnM7Ki>tpeg>Z4&ZNxN5YU=j8{8ca0bc^ zUjW!dhck!r4Dhi*MP+KlbdZ2XT=d3&ZiuzDRVvknNlJIU>$fe$8kI8hH>6V~ew5JrIv9g6ouu&6YR2^1*xgu$yK!rxF^(hyEGJ9~q_b`b)pusak4 z2hwWz0uf!)=)(CRwB_ZQ1ku&>X3uHq-1y@4s9a&y9HRXjx``awu1>QD_K5dG$X*D^ zqWd5WHOXA2#D*Hyzna@quWNgPAVEI>4prSU@u$tsT3Gdk`Q2hw9NGbIR2Zmtz2r-o zgN3-J*Mj5ktuxjvBQ2%l_&R!z?h6acDxry?OxTb()Bi{OBL8pkJO2L@zXpE)5x?kk z>c&j{fVHiqpJ3#=D}plYF^6+ENLzQt4vzA!JlijF1+CZ*B}TT~ ztlPunO8~!vHc>_R?Wtez3-llGiv;|;>i!S-jW&9YiTxa&SO)vIn?w82;=>R$ggALZ zOfL$u=UT6@J(g>OP5Jwa<)b@9wVd6QV_)K=OYv;oCv}>Z_?%#DN1CMru@?@f?KXyQ zY*G)lLOJiRSpB8`>0#jA9z%q0?5CH5X>#68^Nw+$ih@hPf5ayR=8k&wCFUw#s6o@n zXP03G-+VHlw-%zhPLyKS15EIN+3^g;IHyj6X3Xu{^ARsYhIfbNzBOhdn$N}tp_W@( zkJ-`AAZR|mM87^WPwtg{?Eh|Skjn#J#_Z!4voks}>GM z9@Ny|avs2yD%hFuA(XV&x)Yqua4=lBpKQp*q0T_bDz`RX@ zoNrCY2if3x0|!v5lsJmS0-B^oQz0UkCxi|rkq+)ZL+KW862qz{fEMfO!K5V3L^MRQ z@XMB%phRx0h#OKt^Ig4200|_dcQt{Ozl=9{O`0V?{I_=#NNf;fjd?`?&+*9kYo?Xp zsnnF``J()Y+Yv}VZg@)9L8=X|f`B4he}_r@K1VT{hCNP92ZT7VmD8lafizw$J%Nyd zOh?n0QDiqBp^{NKw+ATK!_y}{cRL)sF}13>_$FpK7w0#JtUR`@v;ya%6Csgep-9qo zdUa;*6K8C1ZiOxO^;cJnfq@NvoF1FKIZlOx$p#jhg>b>;8xY=sM>hThtirwe_V{%}1D-jU}Xp5E@r36WXq3I)Rd@4;lWuf_s z%=JOW0AZHRiCj)5_!ixD3&gP~pTiE=6QiVNsQcMw#I+7@s{8o8^I%JLw&Zrxin!W2 z!+l+>51=$tW~lIspe;-s**)QJ;Y{;jzx!GI2K-20q+iq65AF72v)Glt|Ai<>FzM#qgik2uQBl{BQUo|uy)Mb^XQD8XES6`g{_jOCO zBW5>2ypK@2<>L=u-=C}N9h6~l6*E^~y(aSVnx*Vw>k>Qp8TaB?6bc+(1EGX^D=p_P zZ8AA{^MQ(CWtGrbXM=%yZu@C#qJ(zf`WMm`n>3672@Kx?GM1b@{er?kA9v_1w0UDB zW!-Je)&&I%VQFoUk2BWXs0CEwYznN-QU<7G ze~9p!>z@mV*ks2ZEsO`E%!{3vAy(FrS%cg_18zXl=n0*==Fb;L!C}IeeVG=UYlp8< zt9>#?B|o6*=4?g+Z-t&P9Gb5~KDos)LPYpE?urix52Gc|K&d4;)Rve}!?+3n@FmHG z()LXEJ^VPXF%q!#d?XxX5iJ+)Bvf$4;eIM=QY*LfMSa^JVZO9|VR5BUcwFw`)=p`c zZHw`=u)I`wD{E`u*yod5y;bXII0|Vjco(ITF9E+IcB%GYHJ9Ye&i$uUn7Ih08MM(4 zW7J!CVt)3YRCB`2H)Bk8EtZ+@V-K`16WyaL<(J))+GZLsK+tL=oIj;To6Ca0k(pPZ zspS^ShrS~-U@WqN6w#$%F!ccuANq>3YSJ@8#9g3LB2L*ww-*7m39bs39LEOQaGki> z`aF(+2cH(JkV-WCuoPrh!+|KakPxUxMn<5|!crVg#Tifl5-;P`S)m6Cq_A?q4>YK8 z!^K5;?m|_n43lZ6TF60sdWk2Vm?9|h7Qwn2*iSung;`n5Lej7+D9?9mO6tslwek@= zc2zAysL}@DS;Rq6v~o?u0rzvoBn4Bq-KgSV`gr00Qobvg$-BBX)wbSQ=^AmadIpqxmH#ov@v9YL##Y$pzsTr$)~YN+jdg1QL$5T#b(8}ZJoK+-p_e)T01{rejRO&arfRY zc>80dpB?=*8_{ta2ORd^sJJw>v0k%(0`4zBm+N-md&)k9Ev1V})!EMY?|-%5gYeJD zGvxZd%_s!itSEFhDgd%Pd}2W)oW>nx%_z9yrAjAu3KRR&!h>FG@_azMZ}BYyvR{{1Mc%7ZXTYSWsOB-=;?{>is^iz%_-ArA=MOuG z|1xhHPXmm<4$0f@XzHeBlpZ8EKf82{GdNu7KMT4v@Exn>oqfE>AQT%sO@OF;%n0Gy zeah1(YXesKpnK3FT_k}yK5>b^;%k;H^d))~K_8`oj;ewAb`am3nia-9aty)r^^ZZmTL#zi{Q;S7^gB7`ZI04i#mI6ucTjsb=t?*r#iSSFv9nIW^-t zw}Ca(dG5{N*BgEabo5X?Bf_UxK#*1rarWbp1=o|YTBpk7@R5tvb$kkr{*%)|6s{7L z3b7Zrdl2INI;Y-fiWW2k_F&Qy*kYzeyH%B~4OQlZ{VrSym|Y%`C3;6Izb56IKw0ko z>NI~+WkRYn(o&RKnQzbJ)F>Eyj2tjedx{kp#^(|o3qu|0sYUIzce~-qR664=lw(z< zVB}z6Z(flzxI~#cDBF1d#oBbwYuol;|CxssCo`Zr6Q4orBEgr*Se`XPa!vpZgg&%c z&ob29mO9E1K?RjL$QtXyFiLRw+Ls+!Bs(OGTI#gZU$yFM)%A95X;UVtzr+$2B1k*@ zE8x3#=n{9h;yglyAzGa2xe?#Y$*P>J7F}hdb?)Yd$phj}XH_nIJRbG4Dj5N9YgyAk z{)dCF>3F9lG;(QbYl)P}4z5OjCE6MW+vv~n?Ru7i)&*Q^#I;;}>`R5=8o)|z$|XKv zMaOc5fX&n=uqE9!$9BqpwwH`3KH~T>0-IP?7;uh5@Rg!hr(YI|x+t3S#Xx5n`rOa@ z#8^WdA)eZ@Qi}{=p2Xpo>Uzh#T< zu+(8<_}}J>x=r(sDJ^%(_s-%dInVjG{kakfOEj`jI5LD(*^Z>8hcKMDesLS>RdT6j ztkc{^k;i%>EW%L6PuM+Hh4I`N9Gt)8#i=-b8*A3VGieBz@bH#>mGF=piDQ4~iv`!( zY&SD}9hM8Yq+XT_>%qPwha2O13))4Z9N{GfwFVx7&2F8UVe3>|MY6sy+Q{(ntbNEJ zYrvl|m>Wy)(}5Qks|lvB8Dm|L+w_d(B7tAoNkE>NG14I?K{?0K{U{Ssesr$viH-2V zu)D3Syy8H(wy!0!-ap=xvEEHMU?>PQG3tu24U$LMLnGK^4>o&VC3`#KouOSMgQ*Tw z{3B2{00c_qP~5Z7GO5<@8IzK<~;9+aJ(N9uybqzmlM?(S~CFd{~}Jrw~!%v zFB0ww!9`5Unh3eU*k=s6{$r9A^1{NW56n(H?xPL5L--?z=`x4rFEWqd+Nh#~=2>SQ zG8lIPzwe)Zk$4aK%!io%CL(DA@2+fYYAz#hQhg_P>uMtdF{2ZzagBR1KEh0kgvg!Z zu0Kr)wkhW%*HsrcIZ`1eA$8}<=Mk5|)9E|HX^cFyO~StVLNqwLdZn8pB!|EnLANrw zh=o8x2trM%=CjO3MD+j6E!c}CiHHggSRtJ>R-)dbV9_KYQ{$i`J>LL!7xV$U3oY;- zKmin+mZ+83T^RblAe7_!68fQSV-cl4WV4Ggzi&YglVVdl@EMQ$^v6|&^0q3N${>QF zf1M62pOT&uOH$1O)25Ib>Bqdt$}Twc{xKM(EUl^^?J7ghd|xktBk{F16NkM zdb49ye+8=q4_DkQHo)%F;TY*lqyjoY4WB zV6rsnRJfy`dm@5x18Q;|4U@?6Xe!<0{)c%bdE=0wE&CA-m=fnQHDVHu4Sn_14Xz24r?};eF7{Ebb!-)d30@Yc?5WM>rRavQ#;zBMV@}SyWc2*ub5^-!Pa$~2(p@S^@CEZe01}ATUN_Y03 z)TT745+p#AlhM#%OB;p+n4Brr|B|29aHUuXwDPgrsw{I0%M06^>kAe%y3O&60~ynp zF^gWTG2A(-8>(=#g+GN$^NU|ed&F2)*SLP+%8Ky^+h~k7LL!2rA9>(CYiZ~fixqqG z4S5O$^e4yr6E|ck2Sx3-!xN$diI)COf*HdPLMy=@4ZU7f;358*Po)c*!*suV(WHT) zAh!=X5?9yy@WM5y;p4RG?+lj#EmAVOH>vi%E5=X?Y?XW@TW?`cpBK=@^j_qDc`(0N zJ;WQJlj-v`@zMOb$=qBT-d{*RW^klH-`&&TdVKMp5^+79u|?et68P-XV!}j*Ibg7C zy__&|Uw&|Q@3RAtBmk*UvADTriZpsYV~srlVm>jtH2mL6#EIKMgM~id2|9`ge(fk) z9kl`=b7Bkxyi7+2L&GFXQ`_ObD9636*ZD#6I~^2n{NDL(|KIlT*o(|?;^!3o`-2@H z+WT$P+hxkr7@*;7j*;-A%^^3bBeOIkpgR8@Vs9~PSWSSv!`B7HS#Od6&|(eV2|oVr z^n9oC7im%*#XpHJ3|sYiT)&@?2xv6$a{z-QV#KR9@V@{f<%3P({Z;7t#~JmG#g5@H z1ur}vOSvkdiy0bCV^Bhl8skp+uB5B>Bqn^xGP-30b+#I`j_(2%1*}rc9ftdZWgB!0 zcnC6-3USs|^o?2>1=T68g4cX1;Q}O?lkD8 zLBgK7MXqLwe`wP0S4wes^} zBfZE8W_>jo2ecG&=;WdI23qoJTCU^Jk78Cp<#&;&auOh^_+O|)(4|u~QX>*6m(r+H z0eUU1?DwIJc%rMC03_7F;(sBb+e)gk?7uP)%9Ya|BT?$E&jY{}JSpJv<~SsUg!MH3 zGk8=64as7Uf|V|_lc)y#?(14X`F9gH1k8M=1|*(tpNtXfaCqK|ghsObmU1sS?F;O5 zmE~AVh8D&T^5Yj(Do$D8t2LB|FXWJio-d&u-(wfQ%UaL->E00PMiDb`QPS3{=FP>? zl`VqwAQAY%VL70|+SJkXZy7sK6qHe6g|#icH>Z)&cV;r^u>RIUVU2ZpciQhG=!f4x z^Of*fUDSK=qHqP!uS?18A15n^%TF=aX9k-(U92I$rAI_&cM|x+2Ywhy>hvaq*$jTp9QTtES>9M-) zTtFWpF)`TC=xKNUHUAP;LsBVOUoc-R(8%d&2=L4(p^)PX#4Rxdy^+D3m_KK{l~!^( zO8FMnyhB0JTyQ-FUHsPrA`KNIGLok_nWgEYGyVJ_Rv)T2-4CqcjPqXU*Y#8`O=j*QM$EWPT_P_<%mbpwF zpeCMAxl`)(H1FlAe5c>S1bC}up`o7uSCc`P+XJ1xfa0K{m>;3|#eJ;SlC$8H^n>{! z;#)N4-e3i>ZpneK#=vU>M!(+d$W8kMetkA;Sz2Y-4S0a}%g&M@wqE`Yc&Kwb=pyF0 z#h15{P=NT=Bk=rYHS997+{`tvBM#&k$?V8=5*L|m?lz?nrzft_fWCmr5)H(vA$+E(oS3?J>d-jr?f;T)RXxf^rwl)qKx@*3#z zNk&D!SX_7Ak*Lcoi3zzwQau+_(=RB>_72zjv8SDoe#NKR^&XrXz5P+o^d9#k0B;|Z z%;oXE+;yKkjsYOLW&Y61&n>Ou@5?Aj3yA@lTbZ-fZ{OsyfoJj6`$FMye*T(Ys1x!L z-%c4$dlF{_HmBZbv|5xzXq|t~+z0$xp&WY#d(%r*c|Sm_UnjqU%W4w6a|l8)v@K!a zlWQIsor8|Leu6d$zb@0>;dus!}khl_6QUK2tD$On@?onbO zR8&&yZp%hEQ;UjTuYyG#4Mdq`+}vm2gK3r|#bhE3K%aUTWl!kToNvHQn=NjhPBB~D z{&We`(`qjq04`Ws1X7`3Zi_NX(Qpp*B=UJ3Q`c5fXrSsQ&@uPFY9hrz_l$gGADmaL z9%~}Le7MNh!BMPvE=~3J`BKd`&#|bsDCpM{282t5M6|p-Yt(xmC#*I#)eq$fRL^W` zc|VZ~aM5cDB;}ux#`$16X2*ar@qq~h4}E)9#(h+7y+e(#a^CxufCzW?@r%KNLjQG6^gS zIJ`&B!4C4*qUX{$@^P$DDR5CKVZ2c|_@Ok$VJ9E2pn7D&cDZYXR<)3Tf!!%(V+R^K z2W!8bXw+r)ZmnqC6r7|+(Wv`H9Il4hLN+{gQ`T>9EH@7NPT?^3cU(oQg?P@IMR#A6 zfr=Jxlc#X>^;$2VKUH?q2i;zWt*n#j6>Tk^nr41Y5CJwNPJj0TF5RZ5;ldYrSbf<` zJnTIXitFdanMYlrHk+{6R}9|i(!(rj9%rl(|LTI)oC+5pX#X;S@#tPt_I595mX@!Z z&tXqp_`RPPP`Y~s1|wr|xSYrxb{g*AO51`R=dwXWAp3p++6b#X1H`n5QBZY4wR zNU(wqHH}C1iBROv&;@~*0#MPFjDc1c+A39`dm2~j%KI9@45lifqBn>tHxPHDwIfwG zs`dGVPM*%-USw{J+dF(FBn;M7jw|r0{jY4^uTWRrf;}iMV(_X{=nn)|}8o%TsE8P>&`gfZi@piq80*4zeTst}t+kliA5 zyg;cQ4|-Yu5#hgoSGZk2X`v^3 z4>8Er%iJNeUau3d@>psr^g=vqUbz>^w)S@jx(0>v*95dS1*N4F9uF(J6+cSuj9;$4 z=_cb8*O^T9jxioRa2<&1$NWS|bXM_5+V z(;}#KL9xTqaV8A1$qW@j)_c*{(lW@~Uvh~pLv^A!EjcL|`aY24kB{e}4%kKfP=a& zW@}tUQrW2Lmblq^1`j%2c&8Wr3)hU&`Es(;)2k@^=;G-wQ3uxr3UV|Z*8DZHiC_e3YH($`lNd$r8Z=h2c_bU5!2sg;2ppQo3wS+$1Sf1N ztz;$>m)-x<7`>Rw4F9h&YECKmm!>3@;s?CsLdkg0NAc@#U}LKz0ZKd=em|o9Bvk!$ zb|Dk2*HH{z%dgkC4bv_HU3M}b-hnO(GT7mt>;`{8i?PWiKhP?D#3(gdV8jB0+CMNM zkV9~iF^PbZZm4E`YJIf>yNCxq54SWuP}w* z+lO9Y?N-H>0mX8L!rh~l0o@#R#*Is;v|D(B(<*JJ=7W2JbKIUbl?oQ;WJN>aVj?t^ zao90kAX~Wuq<9>;=ouv$$2u`EksWwXF!0*!b5Ebf-VTMIPR)4N^qOaZeJ6Lq`_v*2 zpiL+G-xE*sdG6mxs7ZiL_~TC?nVjvmmA8 z2rXFyYYjGF;!Sa`s9KQ;69<=Z@h%YykA|5)iNgeF+MA0dD4cwa2@drv9(rNRIKI|$ zuN|-d%xQ!i;%+!)UrwO+re0~@jxpaQL{vnu+-cOTAY}RncC&29 ziy1odT12xbiZq1OzEWpi0k6mxZAj5e$h;A3>oOBD+Nmr<2E@jmR8@S26vBjdW$-in z#I`I43^mM2Z93D^b|B}Gr&?*?9u77|H?Fzk4SVz6A+wsuh0SUEk=^7ayRM-mYN8c= zuv801c^T=D1!`1Ik>_t@e^w`K;6SE`q-_E9@utBo8WWm+a<5so0-PV{ZVtSD#2$Al zizo!jfmEq7P4uZYk%eeGz)SAMvL`{mV;>GbN~zMAto9zeHwAXE}kfb3PM zs&V}m@GOAjaSw>WE!Q%D`OG>K1y)vg`j(xbWC~^DKVVD~9NCw_CfMEGyEKP%=H3`7 ztND5mkMFeFMzF^hJ$Xg+OftRq&@o(q#o)FaNYj)lAU}G_s8tw^?a4p*g*h79uGm)b zc9V>7*X~hwj*ICYjW0 z-8JJ?D;+-ilcsUg8_KGW@Rg;S35It7EHf_{sF2;u}Fc{1I1uv8FgqQZ}F z1ZR={Y6Pp))&;w^*w$zJlYkx652~t{@*WrfRgy}O^u_*PX|%%T1TU~WD0ho20?Sdh zIb81n2YL^8X+wY4Vp9z()C7ioo|%Sgxi3C?LjzRe0S0?xiO}x`Fmyy;)uoRz)hiRc zpz4nFR){=_dg~uRXzF=+eI8}a5)L86RPq!qP*4l(gV~h)N6J!8y6oQMFql1@8V`9F zuA>u34o4c??-HK9`b6>rN9Dd#n9Xa(HYy4&C-5cA8Ru#t)m+0$vnUb-gn(hV*Cnqf zZm&5QvImu%XZG72X2DV(sDCL4(1Rl0O8YI3h7ur2XTSI?lS8e6d1gK^JhF`LX(s#Kfc ze#oCnKBm%;LM6z_kLj86o8t#e89qGk2c(S^s2FDA$eqVpzL*cXK-Z`4l>w~Ls>>DZ zje;Bk%69wzTBB3{TBBaXOLv)UtYj%}Lh8yvYgGSOqx{rYEzq)qVcYqVeFF+)`(u{M zDSe&9Xpv;`>upvan3@==^S0K<65ueS)_M`r5iwz#3Un_g%z-!Tys%r@L!WQ@sWMAA zLLo=ahLWGlS6fHVzwcAbJ2bg|63wIgHDGOqlJs+o*gnN>{s|s92)xo@7iWG9{E-9P z;^5tNvEq9^gjp|LaN>!LN^8BTL{u$?dRd_htYo~tlP~p3+$q}Wp5B@SMAbG;7jS>N zcN_8aQR4S4ONU{javO(e?Yq6&@URBLM!Kzrz7T?g*4FcuY6xfl;a0t3DPY2R^QPYgf;; zOJj25(e-9IkD^KYYTU|{xBnSDc6h5;mQ2|sou(`Ln-tr)Z!>FU&r5A0Yc0t70;wdB zB%mN;PQ0gzR-!v0Ne4pFiqGJ^1}BPD#!5;jn;y4K5;v`}x+$JB1#*Pb{?)RSP5j|; zNH}Y4FYpS+1saMmAhFv4aFTxFuAjU^{%ca`3Ktk+r5u-B3l%_sOnD2qmrAnh3T)^N zC9HO2!)_{_{b31JQ)qU%!eA2hhJ<%Yi@*(zj^D7XLf%E;v2KHyCWR4ZmN{5bj zV_?6q`h&n!yvm!>QRCfED1MeBNIR)aG#Zl)Fm{UaZHcJsDX<`4Udc6EZ}#6u-Kn?S z8?T2LCS6f0co!IjRyYvo^jp>2709H;)gn7|UwDS8BOF&sJoEB%f9A{%wAY7O$#$y3 z-w?aB5ZH^mupm@)NuB86eB}@s-@cX0JooB#`7QFlHu}FhrHbRP3o*g&I#}&VB-f+- z`f~8^eV9W0Zt&aQ`$#L5$UiOWv(8>uUUg^&IW!N2&<>;eBx47^!iKH!NtblOFgJh- z#*@;e`H7OERh-0c^^qO6>~SVR802>sCNYsxF8i5e9`7FQI@t3X_#ypSOs&}C5>-(t z;jdyN;NA0ZfZ%$*%l^Is$~nozuq75P;&~*+z8(X-OqkqlCfa5)JOU5fre?TMBk@qE z%)x5RM*p|2-W`LtUyo*IN~nvIE$nq4C4N=hy|X@C`HC{2E`YC_`Z=Ytq8&wfPeX;= zs!V0xLysM~QaQ+@OYRHP*J<#(9QH49b>dn*Z-Y=CLWCUiUV>ha9+}rkr1YP(%ff2> z-AY)2%^kEMt!9d<0Q`EHF3M5s76Ad7!oLce z2A#zRxl5PGI4ZXpujjtr2IkYSUO)W1;5wWwo0B?VW$F}+DVh|CFbz7NZPKC`&L#iS zEDMu{)vxlMi z07ZLlr4G{4;ly+8(OGDN_?2&-J%F-<&PjSqvL&d4LDJ85AYx7GVL_@+1)h{5`;X`e z?P?;e{bWNmJ)KasDRst1%UqQvXv%zztxY81ofpogGG{)Q1D~V|4eS|;4s#enmz#Gf zT=$$F=P%vzj$z4J1XbOw7+7cnO!8fz8t^OxQ4IwWMQ{>AbrA3b%@koWgEy`D;i$J`g|##=Jts4>1xhtKq9p5C_MdD}UZ_Uh=DZJR6{8nU>yJ!{RC0l2DZWc-u z<7jrJ!4aoQxWDVFfko>lzYbnBy+BI*hMciT7URzVL||_Zu7Rijxc}ql-m8Rb09yyKV=-oLQ#mm^Sb!;GvY?6UH@HDXUq~t` zP@k1Jnp}GPxC-&S<@(O=b^q`3=ZK9sMA@J;6)tkHBUsQAlNU-77bShHQL2px?uT8R z;Rk9p=74d^uO53@C=`tB(<-3X>1nI-jv(%vo8uw?uFXgBKo^0iMe;*D$ykagw-g3@ zAXIIiMC*trHi$R}H#B8x%U78vVMM1RU$zUH?`|v;#zmDUK{!~s5JBl!31LCCVt>g_ zkIP(TPvhIZ$z}N!2}I}u4x{H)?Oq@?ml93l5MmG$vpOB;)C#Ul;XvbkaB<+{@$K^X z(S>!`Gz>6Z_yW_&=T+r@R0jhOIu<2zcB7@0-!nis_@d!nH35RwM(c6C@WAq)t=14c zGShtet;`n2ywb5OJV;UZ1;C^%LKNm<-ixBdvMa5GD@|i7<7HYwk^g>;c&bZPVyv;Tw{l+7-Ct&jp zPBv95m`{!8O9DAvk=XBFmK;23HEQGhGn-olZqIF*3Dz1@WK@W&zfuf`^Y;Nt#Bft0 ziSHKTA*O?G?WqCiGX=pU@+k2EHC|i3?zfV&#Cqh4JwQ+B1i!v&g&KpzVYUOxWHfqj zh%E@O07;uPl21X#VXN6gTw|U`q0)Sz=(7HrGNmoGc5A_P8YWqoy`hf+K@>4LI%FO1 z`q8gmII3$&tM^Ms)IoPlCxY%6lRXmjG^YOK0jpqeT4k__ zBE(J}nL^PTw}=y}PFyEKzxbHb!gqbV>(%y4Q4bXerW^Gq%$cF&9k9Gmg*q_h{XCV6 zt@|&#YEkMOYC~}riHp*^CD8OMk_lP@!857i^$xzyVVStUpbF=%s{@jbfwV{B`n$J9 zDpZ$liEo*PbgJN&5ySB1Z4w@%w<>#*XbX$hfg6VNB$U{Y=YVXJ6Iv z3=5V!XiwhTid)$5(Hl9Y%*y%GV)3dTjYTCU!df#)fb=5|#~*Q9w#Rt?GOQH-V^p%a z!VMGlKH3Yw^PK_Y(WW4)gp~c|M$clK#*{%M_}HaP5Ad?%f8^1!Kk}1(0C`k;gQGzC zSi#rwX(T`M0KNx4+SEf(#zo+ka@ctyDDuDZ=>8l)9|7T|0t^^p)*(rTBT4p?Q_vCk#LcJr+p98kh8c+E0ws>K$CP%> zp-7k@f<*jyA7PQ2nAKDf)#rM;M_$_>e~p7%ElZjoe%fbw!>(0GJps{8%kJGhJ#4X>nXu&!pG!FHP9AKLSfSGR_YlKN`$T>= z_Ivo=%r4AOU;pU?(4+c8_yqr7Cl%DvXiozx1ewVxrzc9~+V(SbkTd zZCZdYk0Rjwo#`p)Jm4@8A7Vg~_d|l}l}@PFX1%ZwKgYjoDid|!GO&bD!(AE)NK2!Z z6rGZsxKep_!Jx+}hNH_Y$$V{b#G^Ff5(*om5>%Q3DjwLhf&@$~M% z&c}Ky&kNyAG&?c~OX^h#-MHt4@O91JVpoM-M{$I`hg!Y@a#dmos2{(_7Os!?2}B)0 zk4AO>LyykmhDEG#WS&Rt|3)d&=3TgTHeEm#&JWulWVByVuwSCQW>nB`v#t<^x>Az$7Ap};oGa>;tsP%uJp`R^ir1y4`SNMNaeG+!BsSGZlHMCbgU^`_=b{bXYjt}Hgg$_) zZJ~TMoY>R`!9mn)ohlRgM@kAG#qk9Q0B8M`*nf>yuNoe^lMPI^60}o|ct|Lx!en}) z*c>7j!&nthifjyh903s?eI$DHQH+;6D`1gxm!Kq%QuBplkN1IUoj3H-1uSAr0~LVW zbw3JABFVaW0~Q)c{Ht1=2dSCoK^A_J!B6BkdAGVSi#7R&cW(Kv*`$P!W;A@mOqFz? zDrM(S>7rCD8yd;FSm86)gn^!$XnDrY-QXj)hAA-P{HQk z)~F*5?J#*W`{2HdrU*$d#jh}@;6I6qh8QfI^<5S2WwECa1`;y=!AFA@sMnhR2OlM} zt_WOi(CwqTOXV2d#oCoMSI0Ia5kdkd7Lb+PJxF27`VT&e59J^h9F71^FJC}5sY)w5 zjO3Yf$D9=B04Jh7g(;>Dz(@b$fZZSa3nw#$n}E^62+6%Jtly|x@^dj)V71hJ6Fz)X z^M+WExos5Z-@&%ZV8aEGbRGfXY|N(Xp2UoFbr8L*vVf1dp9P8`N166ZIqD(jEeZT6t{}QnilAg4 ziZlB~x-=4d6oXbiu*AY)e>rIInF38HmT<8&qg_y@f&-^DAZJIcQ}U?@99a0E5&mGl zWu7O)4Kn?JpI6R^A8>}TV(hOb6||6GL+qmd0-S;YuMqij#lZx>Ak;!EU{R?CA61BE zu^1y*>EdkcyFf;|CH}NXa|!5+TVGW5H0b?)o`WFWuWD(RS1XiI2zlN_*9*&p?#!Q& zn6(=qpWC&amqK@DE5SO(RLmTjzOkPcLRYP>0EnV=NV_me^F8n&rTU>NXVrp-@vghF z)Xxj%SEAIH>*9~1#CWQ~f>oxhr9Ch;J}?17Nq>9y5FUT(e9&r}e)HJ;anP%0-<7w- za*wt=z_TH(a)a;T$ow!6;T{LmBl|}mMJT}Oo;-9q;-f8iFvDcnj|IIwp8B}6-985e-^7LnHr$m_)|p^uTf?4wt_I&xeki{tAX80 z&)#z?2rD4DGy-cHhk#YHkLMYw>(<|{93FBlGwdIL0#m!K-#`25kFXqg?cY!T@ztn$wr|{h3&}h+_tBr3KR?T zkWWd|MEV3-Tqf9PBl@1Y0aq5*v^GqxNR~}OzsYfv;>7Xfx_{o702)=9zE$c;%+^= ze$!$rJW-(BZqf9>VY;l&uW0sQkdn_zt|T|d6}e#%&VZjy1f$$~Iw&B1h8-=&Hc?`% zZ>;~O?pQVMb>Gc%nAD2kz!-rAcX-3G{rq1JX=5E`a4U=Ab$={*W3k^)m(T7Wt8;Qu zbGi=Kk?d;%)@I(n_kZ}U$`x~0m)AMCs&KAFm7D#?A>IC`mpHr+jj#f=mcp1$S3U2B zH1m+e_oi3Gn?G!k|-*1#aYi}KutkzrrU2e*zC)pg>{3vNSKv$T7V zM2Btu5eBw9h!W)bmB40@;TCFN(sR|*b7jkd9YZd`Ny$MeAs)6V&anvdc52^0jq!lh zdPsrt7vo#*Nl;cN+b}QtBxsJ*NxFcMbo%vLJ`W%Qo?q!*+XV=GZe1ZRTwiU$Yz+PW z)lhSwYnfv0I-g2~B)Dk?&Hp7@l$9%QE3OO@PsRx;I+#Y2T&FZ#Ys3KL+Da#Gl6Kc7 zo%dEnN{Uo;>p!?*4lWr2O{{<4FA^?17%#jWj&y|e4fCXAeE6c6+KO+s!)waMBo1sH z0w+0;gQA>?U*X&#q#Ga!nO!RAg8fQ>xj>SfM#nghH!2XLLh?;Bo%m5rkR{AaEZXQ^ zFRHJIj-tY)l-j*KN-A0$*-nsv*-u81QE679c2tWwFUD5^(tdmpW0k%2gPb_jG{iMo z;0zVjdA+`z68Z8}Jom@XP&|%ljy0{dGAV_FWJ}_fI|U5YMKVCaplF{`%RI0mzyeDlN^Gsp3X7pm4FhJ9-EYSzf5 z2rlWtret)7O$2?CJ(=`zBFdYZ?!ItkRJ9Rl=G{c!wi}kR(R-@tL5Yh4N~XJyGtQ1W z?HqIWqN&qWT%%M9SdhCLId>dxkt2+HH9$*oeJWW3&T){flZ zCi?3PUrQ}?ynCfWgD-fl@2q`1CLlm6$7@aa*l2=pOVm!U;G`gUX07loJFr;P;%qX$ zvrLi10y|er%j6(0D)j18AgdLLbRoWDRM8S*Xx?MNsQ3*y<{c!+A6aH^1cNEtbI6B0 z+eX@1v6?q=V`83f^Y%oTmn8j~)P;_8d}OvszMJOGyNlMTrd*HN#JGn}9~oI8r%{nW z-Acj(pqjwQPJ(?4RK!Ahl4Rt3^-Kq$jED*GB{yaRiRhKSE+t}iYSx1XJIE_1oL8g$}+Dy@=VCgDB3^ca@I&dT!u0*v9zc6hHv-RuM5(@+y^Qwku@ z#tViXoSc*WYnzYU8`sk?-T%w!^l|KMX#NHD>11yM^HGQM z^54t%#^3p|a^mo%_>0rJi+78@r0kb6#_0effi4P6*x$KsrWO8QF7HF&8zX>qxQy$r zz!~4`_Qt6On3_%l$5|XxI4;}ye&K>(URKZ1$n}TFVb(UnA8iJ@jWYpqA%)KI-3Y0G z=Jem7HvLW`#{LcnDDII|Qw_Ml9HJj&5D%yUXXyHhS120lOd6V#YKTQxZtC0adNL$e zZ!ztdo~3*;UHD6mN}!>k#_|oSU{uI6g;Dq7yGIXm;79FU2i`=c-mn_AJ zcbS7)(B@F9(_mh$>5E7kO=&dTxeupVhd0VxTj1Pe`KW{rwy#IQ6sq78F6gG^MEip8 z4h#y@RW@{hv#hg2fz8HtjO*>J)#n_Xp@P+iFIbS^8#{Gt=G((r5f41kfi^PQTp>%6 z?$d4&VE<|zcOb!P)HC%<%!L8^yI0(}aVXMUl+bd3k1`eu_W6DdzG~bX8)6@yhEal(6J%rtmRsl`BbE#p&#yH`1g!O(-PD z+ti)=;k8VfdLmcRT;J?JUA;Z6DP>lLwyuOaBN*6{Wu|&wu#e^^n=l>2?uh1HMzb_DLJX(< zDrHY|IiX4(a%NscImVIyQ5FHb&!0k)k9Z{O2_!V~u=~NdUbZGgG$I0LH}cq&j1g|=Bf7lm&Q%2$=XMDS*i(|ioWKrlreNCMWG z!%z4`uDasCHQn0h(&vSmJ>)%_|gz*PGLNUsWE$;o*5Z-9mn!zAzRn; zbD@?{puU9B2e_g z%ID>x>$HWcK95J9Tdfm!GKJX5hiZ3C>7CZ_wq9)hwqwjhy02-~%KB6O8(N6bN$M0uD|2EweN&n?MM33u@CdO`)_1foWzr?NQ z{)qJ_|9YR1-|bjOR`{+_kDHEcyzbZD0{AdM{*Sx-`9+&jS%PS&#UC%9sLc*y z_1N(JYqj|}CrXu$cQ8Yp` zI}`adlB1A@bln@9a;Y4TcH}(!?-j{w4KA&MAZx(ws^1lRWiMKESr=LpD+pf`4r|ij zKLH^S?SP%U3?cM=HpdMehh{4xT)Gt}rrchWuL3d#2oZM7`+b9iY#oc1WNE;F$$2H3 zr9x$&QF3r@Y>3t);LI0y zRERufMCVWdcxv6S`gI<}|1~yF!zNLWJVPM0?uZ!rX*86QM=0sj^ADm6dN?H3?ar3p zUemYuI#e8%yHF?424}u2*p-+S1b;A?ZGaHj%yhi(+<5&-xNd&YU9zuCGI&vvZy8P5 zdWtspiX!dZmUUORh|Pxn*lZr+^x{{ddWD8K(C)8PCKY~Q9Zb$syg9<;jP;yLTg_H) z>RQcVK|*E!#UUV=J+Cq{mLh6vEG8z33&#aGiAem*ZiGGcC7j0kfkWGaPK2MuLH)dze%rL@@QPx?mv3<+Owso9J8``Z6weM20u z??0?|d5KqjA4Vh1xBnSXE=&hI8W-K$E44qYDEAbapM}a)YuM{|M5k6msV2Cq##ymT zyh(|{$Wx&m_?CpH^91}`rlXmUkEfCrRbsJhHul+4;IEN7;Bx(q(m0s@N~rU<)%+r2 zUiLi7r!OvDtCB2YktWjzxvhr-uU=9;HwE3MscoE)|W<6_=O7Dl8`X^__evh|oU@_lXz^GJSMZcd$eg*{ZIQ zd*TN6?T~WL85QAQH4iWMJc_^Qe8PBc1{Jx~R_G|3Fh@JggcGI#$aS8o+Wy&v7Bcld zcVZE?UVJh2n0mB9sgQc0ag?e`Ndrhze%V1IKDlg0G=y&N=;-mb`ShEUU#8Z+lnq-W zcJ2MGz-zX+SHNGgZ*vrRbvN3cx6q?Z&*?REqM>2npGHjr<`G=!R2r6>w9Qn4dq&keBzdp{Ad#UKWR@4<@rreif;cTmSc zD}=fjC0U2PTsYjJ;?1lL=Navx7dm1S+NjR10#hB^w}Lbui>2kdbl)A&$EK+!)ps9> zjon5I#{M75?t#0~w$T=JY}-b~wrx8V8x`BO&5CV1E4E#+Z9CQZ-glqwz59$aM)&#; z>sj}D?m6f6wJ_Bla3b{s3E+Bm@ZQ13qBp&lGX5LO{XQm@PptxZ|INF(ASIuQh*U|= zm6FG(IdL>IV!S%uhNV4X(Uq$>txHtsD+dz#2i=Z0OS7~c{vHbI&pk|+odO4-`BiWS zc+*)Ef@A^T4fItQscFWQ2t&%CtMCx%^d*q5-F=9+duYI<@y0>r*L-$dibK|*0RM>! z<-q!KV18wQHOBcpUq3JAgOq79SGm~9_l*Jy;AbX6bDar7~fj6$B4b&m~am2&<~Z&FGvo;TO1KgLsLSKm10C?{dHIv+1bUL z4H!m5n!fv;i-lkWMkq41C?Z5=;hb%ty=&lLGhf1%qDmns^rJRP8cV%Q?*j*oJ#SNz zDL+obceN^hY6xzyZ1n++Qc~>ctdn`=gci*y9K1+En*+*!@N28!ff`VYk9tV&sLG1b zU<@li3i>vsGLKeutwN^Y<|)xyRtQPbQuI(oAY_eq5@)Eawcqzes6iO9B}xzOl}+ST zGpdK#LubGoZbE^vaF_-QC0!eS;c_JZ+w5$~K-Q{Hr&OZOI5|B(y}r6UWl24n4+*2n zVZgAQY^UXyS8*`r5{xvXX7jjkuzb9*$2CwWi9=6He^9lVgqW9XA~#iWfjjs?RB)g_d4=op_~1Vvr+(I^K^=9aQMEotkCz+`+M8)59k{7?-l-chWvf3$Y*!$ZVc8Uo#h1&;#k$o(f9N*rkKb4 zaslxyNg|DauvOoia_HwpUVRhzf=hBk-9y)bKoSJw{l>qgGx{eP}d+|BS@ zvvyQtr!Yr`g1;RYzXu+eU()I)TijS%JQuLW2q6`;uwT+>BDQoOM^4M6g7_}L{{Qwv(2q*`m6O1}5J!3@y1kIZ{bA7grxpfEM@df+eg7iaK9v1l-3?Lfy};OSVxS>h(~Sg%Q#u9xAN zURvgK#hz_?dlNSekapLDHfygLiGbq-HZo`z0-mcDv}rRVmu-b;-OW^QQI!-6{z;tt zCMlJ&SGl1UE0L6^LIpLmM&;T(RhzG^u?>h*)+o=DE1qu?mI=UhMUSA;;hak*9K3sGPF=sBihcS}@ zSH35pC~S1CGQ@kC*~)xX{8?je+}?#R&OQdZ!3RJ7NbJy2!&~&wQ7NcQhyI9^IxPwU zy6NyPNAPD~ofq6CcjQoUmd<CaGT1Swcthn!|2hWEolq^BZR85Y!$|F^ebBdHDA)(|6qE(MCIfVw>3F zauz$b8j5BZE7sxoLQnprSh03B8^#h`$QQf~HT9AXxQLPyEDZ~4%FZ`yd`N9|c#{t= z+Bw;n$AWn1YpcCS?lMn%od|PQtgB4iSS3Sk!3eW7*|bcySZ4)#V6J=`f0PG9fWril zLd`);nnd&>Icjq7nh_vv`WYjbLT@1_Sgh;`s@NRY@LI|%#Xp>TR8VnR%8UFJnyX_O zq6zj}&bL)zYS}HSg*&s;cxJ(w)oq9fY8sIepOs%;P|RW9^7Dfaw{v!O_WuxJEd}Z_1w?wW$7%7PY;G+9eCaMH( zjKQ!aZ0QG3PhmVUQXVF1W*s8*)&F6NXvO%4+&I2FkD~PZKNduiN}@~f_h1ebSPcF* zn}ikhw8%nY`-S$mgYg+BTBZHt+yK4fW1#u%@bv8!GG6)pcCOg2ly|be z3wk(?IBgDpkgE&<17mHDr{`}ELkQoyeSxC(-Yz8@j^~|L44D#LL67>_ErIv5%h`7q z$PEU}D+d#t=22;6p;1qRdN>tJjB>nX)kKGlN^3M~!#6XgMh$HJLP%Q^rKEO%SjrmT zudls{8H{6F9Gum-=WwJJ@tA(}C1+jCDshWUdY?L0z;UN z8>)YFl5-R{F{!N~Q2yhcmCIR}D+kw_^zXI)mW5|skZ>z{<7ccY!bnbeRE`d-0;dGJ zNxo23`-YmljWrz&(Ep{Hl#4;sxIc{VSgbs2!3^bq5^A@(JDl&ze}&Bi6h_Kqc4GeT zHj`%j2Ev~FPNJdoO|yNo&oSs_!&rX8-DFEer21&kQSmx4HUW_YIs{wU_LjUS-N4NE zOWt`XLPN{)Fwup@z7IhF3)3R;(J2Ks+#%cGp&KSHe!1yuy zl$M0Lpj)%|=|r(a9ERS^AGeuLX1`5eZxGYigUWKw{_F4Egm1V#uY1o~9R!J)2h?cM z#1Z1wsp2&MU#m$ycJp|wHQY=bA1wr%;Qj~KBVO&{zrDxOkmHpb;J9~g_T_*d1I9Di zLsdaOZ?2z}`;U|(-{N{jp8pY4KdX)aiOt;~_c^JmR3t~xyF%DGn_(?axnbTq8j`f= zcdgPi%(R}@MZkR!9~-(L^W1tkk{qFN0-XZzXH2nI&-c0XgE*{L&x?cyto%tplziN10u8owl6@bm;oL5Z;Ghc-Rb| zP?v=jdHu3bWu-5==(y#{%v_3$d)ml}nN$1jJlr^^$%#S8o^0POj}{VbvS5EIqz%Tt;J5%%-wc&r3r+QGasQ3cCE7WB6Q>s!>Za^X}5@aLZypSW0njP)W6lC}ZDfwPN zV%yPwv&)L^abC59piHV74@Zs(mk}LIKu$4DJ35A7*v|_Wvo1UzqFLAm%IfngAcb)q zKwm{7`>;SD?QQwZu(=`is$9*EY!D;4s&bv_8Wxh|j;iWy z|0XdXvy#$53%fEg%895kCne#2kiTGZb)lIH*AlDtGnTz4koe8k8oA($nOj z7^LA5^rZOQog(J+!^ZViU@II9bg~xp%})*twePdS!`Jov)3(>@@r9eY?Z0S0%u15k z#&;=x`&OoIK2a0lrm`DK2@Les>#)O%M{AUAI@16tnbA#`PSGRyjXyQvby78E4AdJ* z#eh@8)puKOnxs=AUi;)Y#G!^dK)5Ea326>*6U?mm;Xo^`Sin+&UCxl7QUUr?>}4Jk z+4KXyJ$D&a^Orax5k5J}0a~8UaaK!ZO^|zf*?-pnexVv|bd&r|b{j{BQxEgfV+>n; zta@Yf2U_pnhljD!Hq)~?(cjqi!0#9LhviNa8s@1z?JYIZS{lVwmCDO%RB(&660DrYcT|A3Vml3-G3hOhubDwW#u{^?h-tQWfCCHAF6M|G7rWI(q-Lwaaus zO^abSxmIcymiXTVs{VuXf9l|>uGT#tX15P4MkdZ<00(PuX4lr99vnljPSjYKn7#EB zwQ3K64~XLH^yr-+GY8vD^F!Z^9{C99-XiDhU|2T@`598i4RWQyA6X{3>HYdjgNH-; zar3=LS4{MChT!KTr5C6Ibdx!|Lytd@~HfTi@Eu6$cmF1KGD?teLmwq`UY1gHOo1Se9g>z4oyPJR^kfA3qQndt!` z_LkkupIhI=ARC22f(G4%tlrDoaYOoIKEd%55-4KCaaOgPcwi|#DR}J;SCPt@_(R)# zmS%jnuFum58M*;bP1w1VZhHR2lgH-g8aeCtm=?Ib-a5Rni9zl6GD*<`MZtJg$xGgl zLN69E2keSLd#ZC+D!$%@-O$0m#W zA{uN|E%Mk1EJMoDdo2X9P9&Z$PjR0%U?s3JcmmTY_xDjh>UI7ln#_*<71~%pDr|FQon;Z8)l!}-9FL+DrxjYV7yiOvGOm9(q56IPHsJO2 zad1~W_Euf^v=7@enX8SN#bMS(S3u}wm)`m_`~6!aT#=a|zcqXlm-=C2>T#Fs zvkRjK`xGU!h2tjW-^1B$`3e9P+??x(A%IBqZbt$CSVQ1o&5gH%$i-l=<+ODz5Ez3| z1t&=uaq|In%w*+rD_!S93D{~xJ-@DHv;-Bn`L>yg2mY~{B1pe&CTZwzo9Ud)_tNz= zvfn^gq^^co6CEBHo7A_)XXt7s zYu_n&1X=~(7-DLC{{yuDxJ)zh$Ya;M6e=@2IMTwLiwrtVGrxmpy)PaNc}rIoO5+%} z#YK^y&t^)}db%nn8PgN?!QGE(8)frQVsgzdUX5!`+bCb$AE5poR7~L+-`&GD`h}Ts z(Pc7|$VTpLxxbsA`2`+_AmAqpQz{s^W6c~~d0;je?MZpN`0CnjW^Z_+AI|sKwBPqs z=)CGr{o*wNBj6uLC^=2f1MR*PMR|qutEE=Edmj^$O4mGO7ukB-9?ry=tHJGdI@Fjt z(WRjoK5laQpr6bs&+Ukd6XtB+?ecX&&z;`#++!KDsD-i&@9%k=vkn*3KU(9&zc=O_Ghzta>0Y!$PN57)Hly4h#2+2=1m+7Tp#`rSn6WjLR|Pm!5fN0yU_WT03K%owYV*v*{> z4hI^WJ%OnvMqxEm!2gLoG8H`DrTe8YK{)SEgm`7}WM}AVyL*UfU?#Fd=@|jP_d3)b z_7vfiYy(FiYb|%bKG*+oc{n!n?KwM(L*DseQx?sb2F4eUHSRt66H+}J8#!8vjEmu{ zQkJB~323JGOc$L!>nTH#P>XWD^wNfLO5u=r+pP%*ag^wXBngkkb^|Mf_{rnTSfl^w zv3!I?EIFh`LNfg4olXf3RFXK@=zT=_P#Ozt<3H9?WaxicOSH3?41D$1dVL3QMOO+G zm#{0T3VMPAI$vjosseu7IljZEJfcXtj{VkMlsmz=d8ye4;lRY}i>y<%QC{?Qkof2=7{k#i_@^+AZY-Two0Cf2mi}Wbk$OxOyL=lvGgU6ovl|KiC@I zo)!+h$F;rn8Iz-|0Xq&*;KcaXOY7Oh7GyY~?#NOSgE>-W8{l6&s8b*X07b&Er~R`c z)a3kS{=TF}y+(%PvhT@->=@GY%@J7Gq8rw%$^V3l1SA;-WVqmXo0Eq? zYzHwK{f2mer%@)Tx*FEO%&zU;@zzzG&T$cX)rKDbNsnNscO&>HZVT*9Ii zZGuE!rJ~3eO@X((YV3nsOtea(43?D)>c5O7 zbTa>DKS1lt4xDHu;3j;quua|zcsiEByw~_@48LM*ZggCV+DvcFEOpSrD7oGp=Rl{h zW+ci=uqHDu zacc14lm#L)73l`bM5`h0PD~pUd~Iho}QyLgH`Hl3vC`}}RAU9Z8hXyh62{6G0YBNZbR)^=)h7{fe4;h4f&dpae@`OP?}jR?UR8izfN7ul zL+pRp97cNwIhm{EiD&aqO?KM@cX@Q}fd;ag8utv^m{@f>f0Py3lrll(;Yux(o4S0( z7ck4%bpMeD17E0bIHxcyG6M$RV8cO#V^mW{tgCK>Z6e;eDgA!9R!pX}L~dIMw^Zvt zaLy8Nt-c5#roI@pE>(Rwv?qGVZSl(=7kSWnOekaVjmwH8b7R~|t@c9Hv9IdRD!uv-)y zR`a@6Vz_tsO5bCPJ4(s!#SRfgOv`6YhD=UDc=KEI&X?g}Gl%}SrP^@M`4D$}(=%#i zG|O?Bk>9zWw)98mK%(JsV}Aa`@UxfmzM4Tri7@26ux6mmKt&bt`1hdG=|E@nXqb}G z)asvqA9ho#|3Y4Fd|tI6n6R^u*b8q4EDsKbkBh#e%!vw|$&#V^%$dwcn-qW*)6L6w z0Qt0=>2aK*FOdfBqa1Hku=A_Js!UipU9x-`g6e^ zr;Ot@S}hYw=9>ORtTs7<>VaG;Rk^W7M3$11t0Sb}8;JKq;1x~>qL|C4U znYC?I61}k8rCQ8HrQvxVOXu}$yUsJpEhJr{OPVE1r%#R09$WNnAi<&lzU;= z!B-+V04~cl2fxP06Q19GraD+$T6iHF8ky&>0mG)y6pOs#@neT=LgSXWe!fym18~K{ z=$f{vwR;f@;c7VarqC<`K-GtY6LQ>mj3Vy>lBHRmfr$R3A8DTE=wyEGwTO)uAs$;p z-YBess(0vA9PL7it<3_Y6?c(os0+Z&(P1R~q+818S1L@Y->WlKBwozQglyu*^hKb( zwG=ur#w~ji_fEcBlp^>uwzMe>ficMUB>DYmFYzUr9ly87@V}xyl(PkTK~Lj&e1wuH zXzf0ep~Zyd(-kQPo%=%NhB+ulLoe4CQ7<)~&%f;^b>=dB9=T5T$gR?Km^ef8FG)a| zt+6D7>z<6=d;6-we`!gOB^-AfZ-e}(R%DzOt?ie*W$3H6F68Hz{%AtwgM5G|SY4v;gX zxY<2h>Tg0jQ2e1XNnFN9R8u@PCkle0!eF-9@v}QKkQ*YkbC&CpN||ER)s9C`P8qsK4Uv%Z@FPO>0zF~}uh|kq^ z&26^y-VDx0J_QIcE2_d~^YA)71N=OTKkhJS4}N%_({&IB!}l^=4^oV;KTz%yiRqDR zL&u>m2Qk`()IWSL21m`E|3=ynDLWv@7cJBrUnC)l9(K!~Qx1*<5kw_XfdKxyr?tqu z1(x%)1YYx`SE*qc3w&o4S!#uSbqQG=@sKZTxIvPP#^DqEi8oj28t6ejqr@y7~x5PAcF++=M#w5q4Sd*no<(}pDL|kk^P0i;k&iF4fX7|btCW1K1 zo89z{WXVCVhT{5qZ#r>b5z;sgr)b-&852*>{Rk8GoZbyxX-VwvXQ}fHB_#A>n$>U!<}>^E;SE9I@uEiQ1h02UE0T0Zdf@+3Pbz z^ucjU@Y+X{n+&wnChnr=Y9{F+ok@TPf00>{tHZ zl%gH*-*}fZkI|A7fGGM4h@Jj5+cMI+n|IVPivmcO{vIiLq;^ab($#832H~%>7&xZY zUlQ*Knq{h?g*~c5xHp+}2ylubRo_SvGEK}1J0~wKOER_WjjzSv4)_5c&-2C5z2lGE zW;9E1>kA@DeMraxsK1mb3&cC$uYU>~H z_y*4eg1EQXz_u%XEKp~INshEfTuJ!~G`OqxAr)6VtUg2pK!*oo5kOcDfDedM)3%e? zSBr}BgGN2krp&Voopk+c%i6wM^4T4II~y|E%`Nr5Vcz9a)8^L5|Tg;aUEH~)r> zEvoF8)=R;nJ@41QrJ_;if25+k$cG#~@dyDIWU-<=vH!1BQPsD9 zrW4e}^)FTgEdMXzN&o3QoRdBBQnu|bJy5(8(NcPJnQ;s>9h~Ais1f$7@gJ@TT36uz z=8FFGd~-$Pgqd1j|2tM>op1>-0KpVzgmP|;R>Vp!v@h;YV>y-wT>Yz8*te9~Q6DLGX*=1TTKZ)8F`~ z!L(z(yEW>@F3y8XgT`!}k-o0-)~p=plExoT72T0%uw;KtDna(tSzUcOfZ_rdunqkz z`?zfoqZML|q=zhJAB}pSv?hV#9)piP_)Oo0KZ`tLggJSUBx1B$q@hk90U>I%9lJ$^ z$Nyi55j|3>UTS)5>VHR!P}F(<2Vx}82{iu*#a*!M7_FaBHfz4v7lDm;${i>}5HJg@ zd_h#??O(jM*+Uf>S22C{1+Y+1)~)1`S$DpDdA*k=oR#48>=#8aTxmITZiVEd)bcFvf@pES&i6SZQ?rys6H|J5o zZmfDjfvYQsixsU`MS?eiiDNT#vljb0g)1t1NJRCGI7>Pf3j7k9x*-Yqo)Rt!CAHB) zuxze4#bmg;G)IuV8o}z1J#;7lssSK)_Ex%5g>aaXO#APc;tBwxQ=`enih%NxrdPZZ z8?hgjt&Y|MzOa*9&Q@+Lz&xv<~Q^kUvdX>9B1-hzd^~HS}OO0=5XD7<) zBhsnhcC_Ek813>zzC3;8t|`*vRVpq~>RYHbtKoPsiqi4_ZWje2jrJ~7nFm&uDg>g{ z@M|3_?MY2HP-?c3sQJ?1uN0sV>?MqRpzII61adYfHHsXw#C0edjRL`jr$f>oI!`UK z@;1bEsfzYc-IXBhfLLzL&5PXTt=9uD(-2|>mL4Vu0b2>O8B$vLP)(s58}ElWi^_04 zkYd63OWB0s5lVRzM4MMTlM$fblDV_1S2cvJJ_Yw)!)8d5n%Lx z@`@ZZj{bp)oHt|Qpy-Enu6ZVDNG&+}Nhg>ey7auxr|;$C)Y%*hV3%-KJB5dqbI+(> zLfu#q^T(!Wdo9dFV9h&0=q!HHpzkEixD_kTLf2N#zF^x~u)id8sKE)jcvsZy6mbaY zPBiMBh#>hWd11Rp8J}0Te{M!DXhEzzyiNjZzn0jV$+%f^I0vnB4Vu~7)3h3qa^!@f*HcD_@V2)JLA>a9}h zw+Vr!r2BiH+UMu!(8ck&HsU)~lQ3?$rdRGV zK-`yj9CE{1blP>&^N_)N6JQoDtpIKM6N!J5f_*ue z|HG9AClf)S4%P(W<>C~JwS$Bz+XVShq=Z6AKke(r6! zCSd0;v~7_h^_7Ofag~*?K#q&-nPPR~cF%T2o-BHdX(Xz=*A2dc2AhC#o_yR3Eq(Mc z?aXIzM?B%U>fK-)N;Cu07+V}TtD03iWfL~6YU!L)t|FzBdi7GxKF@(F4 zdYK!%?VMrZ+WfDSHnFt=QNA*HdVU*T&dDW}+7FUto@w>NYWLn{GzFfuK({M?$W#ol z?SEIdnp(vojs;OW3KuEGpWGcl3wNrW`C!5~75|;x z5=YKx_4sFY>w?t1cQb-Vbe`Vnm%3T$DloRkIo+Bs+GG&bYRX*>NBy<=^l~Y~m+>8z zc^8Qyz}+oJEoG8UJj^Krlo>{TERCUc zoA1>=rQ~u=ivm^zUQM5$W0l$^_DB{(>5tccTibjFuv|Xv$HnHLj`Ugz zY3VPM2b+x|8Mtu>2#%%&eQ7$mCjK}S=&Q{K3~-U+j+Yxp}?;Ph5o8#nI&N* zBWhrKFJEzLO1pm>h(@9y9`tpD1QcOZrk>}Cy3SefN5N$5ecev_!d`~$$5X5}P#b%A zXZBa^E+hx3kH#HZUx49{oK)P@@hMPtE%F69ADMK+E3Nfs*>|A-abOLNgMG$lU;1F?w|D8X@5<6fIWbo*&E+82OUV4!CoiIm%$$>Z zcHk{WjEWSTS+SUIfq;}G-xp>6n9SqShf0M2f_R`el=owYNjYjxdo85XSz5;6 zk9t+i>T)6&GOKG*%F>P;;b&eIEH(4mf-Tlg3M5H}8Z@0p2|A8{coM)_i6K&O079?C z^7&kZ<*);%n5Y{uyCRw}cAE6~v1B|2qv`iaSWRxvz1lLab|rH*>Jw(!stA@AauZ)` zHpAQ5rfv;!5MMG+BZX3O*)iW0GB8v~>5`3Jw>M#8kprn_Mvj;`CU5HywZSv$>ty_j zFz|X*V&E%%MmXLQ{PaxolcAVviN#sQBb57$ioxe7cmrD-w!Y)r*A7%bzV*GJsZV#cZ;Vb4ATSKT+e=VSUAoe9^X0(mh&3PhbaXkI8Cy)lkr+hhACar7f(wjmfDQsoVWeq= z{jhj!n>4a5&|RE}k{;|H+nJ&6@0l)^ECfYOR~=llgjxryrNdsPx2bkNWYFM;J`s8% z$9Nt)v1tQADG(&FAy#_X{c?y>i6oQpO};GX+U-uR4lTZz7S3nIZI&_gdAPg;cN@`4`#usH4}%C*DMOx$o{=%rK-0oN25Q5dFySE!$%SrcZV>; z-yl~L)m7Or;6wo*3lFG~sU~cJ<^CB6s|Gl!nuNLa057))c1v+Zj7Wym0Uu^?;SUCZ zkY|;#guq9Im=7%XUc6LC;}QiU%Z3P`&*DaA(8eSIu1uDkv-fBVBW&G$T7$5tv}pr< zVL27glAPLUzS%?NRC?mRBk7Xkr@b_s$LoHHBm5C(QZ#w#&IS3TTmp0VARfZRh9qks zhY64;ltL5|4j_^>NRfoQ$w)8R_6_TaSCCdSOz@DPiep;w9K=`l^ebl7TgH-=Y258w zFxbQ_z`+u?j2bNc`n&42@>CLQqV&SkQxNAqE!5vo_sbZ!r%)Cu{@YKWXJond@c9^R zXoABr8A~wf<;l9oLh0R*0tp5pjwcR~3JicIhJgD>ldNSEHqzAlx9WUiL#w~jN_PyI2Wbru=o&MbBCvjN)cfpfz()+1ctWLXl_mfoqqFoR-$fqYLn)i3B{Vw^`9S~DSKQ5dkw5*=ODtaKhPPg zH4TmO9>{|Ozl-u16&i}^W^yw46T~m*c#6w!*{xsQ%A(I-= zy9564>yfJ9RsAtMb3vP)yplkMAEXI`00D{zk-uvVu-^D+cld;KzVh-k`&uQ(Av@N1Cuv($tQe#%#PsxE_m!p&u_ zEX+6akFZr|=(PH+f6hD~Tj50&FnK_;X^}2_9@^$3OQa@}#WWDy(i6JG8TMh3b%u*7 zIuS&3ATZkrJ7TAi_&Nb-q>8HY$+8x#c$mo2kAV#5il8PZpAwB`#p-A^vd&y}+y#N7 z3R&?2pumt6BBV)2P-c48q=@c%>u2?-LbMUXC1hK~-=|kwSAd$FtF63#(_=+G&r_rJ zNqV~@q5d&vFmw^;KP9s#$C9^-77024{#a*FodZx_S25bozk-eZcacfHt>~?qbLsdq z_TR!nT~oEb)Tf2&&Kaj=A0;}F)a5S|;XCm>a8?O+~3GoUXHtm%^^R{k8%j?K0e~jO?aHNLd*PJ<38Km9XcFtBfQRP zxUZZeC$~LXUWRPIT<%!CM}*)!Byx%SA)xZYs)Dddu5OK^2sd43d_tsXEbuy7x7KC% z>~<4IshWDiWYgT;Wr2>IjrJ4H%&~2yJvc zi2{*x`o{1Z>^q3Np?}M!r6$NAib&v7P08^yN~9fqR}xB8VLhfqxUSv2=E^YSnp0ar z-&}4|;(Ww{@|s88h=~~{_~IGzg8daAfm35G2nz)bwZZ48{QMo$NCMy!|8^rlpw)>4 z$sw?>`N;DfgWhbfo@0>&=`X#Y(Z2RV6h*@Kcbx2oMlQeMt<7`ZsOaBxYlA#X)xt5q8chxE7&bMM;GaL>3}PQspAhnL$gxgix&&W|67X zus=;zhuK@HZ)!16XvX(dGVyt9VDcz+I7@urF}~AF{2-S4GD-fmdUv*@S_Ok5^GjW! zkx+7juB^n=ooG04bhGUt+{0>;?|qo>-sJX=maqadfGEa@Oo%EW&-=}O9wgEzkI+}_ za{Y?T!|QWds6seMz^(YSvYK5EGujU{x_G;Cg-YWBuA+=s=PPORibb_FQCWWpLXsj5 z!Rl19u%6VrfUMZr-uipjaL~0mBK5CMa|_))7$kz#uV3E2II%2FGc(fyxjDJraOP;J zDs=0`g&C?YJOvczVOu@@lpRQ+Gm>yB@pYAz&7Lm2`VuR_K3ojYa5Logz2yad{9LoY z&(5S0E=eC+VDwC|M{r|;rHof;)zD2EZ*aR4Q}|>rj56HyiZMgFK@IiL#AEXOU(KN0 zID`7NFsBReDRWs%6xv$_a(e7_NHU28Y3DofU~_UehV~3(;qJBdJXtyJ@pE~epYWgi zJY-pKKH7MEjBlfiuU@a)cS&A&ZF;XERNW=kB;V3jsnRQk7b%@)d@a0v8*lf?!6K-jj?Av;vfo$v z{4y|#Pr)y=z&J6?bEf(D8)XvzMD2;bqLxGzLMAfI)j zKKjg7yx;sdqUK~DQXVrdiTVECC*_7BTqy0|kHETpoe}&+zW=q634#Y1Tkhb4vH~PG zyyNTD1Xsxf!E zFsA-F6VxceGi}a7i^0OPLKB5h(pI8PFhFYIz3c5vr&L`x%|MAt_B_4mPf#ctfqK?% z-57JPD#l`sWJ2leNdKqZ4tY%|g`8+Q1#0oHg!%~BLJ;pobgKX~O`eUpXqO&r4iGfD z?AOumr*L|yOx=S�VC&z>2ouNC-5vZvPMyvJ3E_Z6Pgt3{(%Xs-jgqaPRMSKTx* z)^Rv#gjK<`ZAMov>L+yXv)~KkiP`6wKduj26HFPYxleJ?u6+aBFsHQ+UHACiCGW5E z7LglZR!}n1>OE31V6H@D(;K1fRObeV6(+O~*eP;C$kerMplw2)!KX{ff90))#%3t? zhS%r*y~^P_(tGi}8nb*RGh^Gkv%$@~VsRh1x@;Ys7MU?iUFu{KgTPS84&ee<=5*NT z_7PS_zG!-P?ltC+W!^Jj25a;Qe&3aw?{Rwpw zkl8-mp44(XFOEquGdY!QH95(sLbxZ>IKC?spnajq&S23R)EbHnp!W&DZ}&0Zozt-3 zGrqRm7xyA zLv|udk{UJ@ z6sUw=eA0_)lqLQ;fA=at+m4TiwvdDU`g1VLv?9Mqh?e_CgDo{4Zk&DJp?smjIypp8 z3#UB6p^9Y`?G%068Bva!WAbmF$K5oO4vS0;c^d&a)8Sox0fx#I!2~PS z3RC0EEy3vPv1l!bPE){{S#BU+B(IGLTxAF4bSENzz+7+L(cdqSa<`{49+n`Pi?RLT z#-Ffpe#wyL;J77sEZ34E*O?bb)S%G_#InSg2t&e9e7(xJ9OfHogz%2nVL+~9v&P3R zmtfOlb^YloA=ffF2hApa(09vTSwVR}^nlM5w>uF`A)imhm~S9Dds`<~3a&65FV=M` zE$BPZZ_4k@01*=Sx$BC4+;&fy?hCu92X~?7^c{s=VO|fH5_O*Yo%62Jx1T$3p~$TM z#U8!8jkJF`yh%KI=2FReY7q2)s@aP>?iatwfB&p^A5Q1QSuGfo{o%Z8_IjsXuQSQ{ z6I@}^W2xzj!UAhkt&z?j51s2i{v9Ba- z$(r$^rsXMuo_-*BO4t4;laZHg+XR!b6Xg6MRzpaS!+FUs^+s}I_MRGoq3o$h`h~pi zCk;l+##n~cGP)}=Q6_7d=?gmC7QW}FRJrxUiTC-Kb)$3|qpg-b!<5S@jE`=Ts7Ca^ zwOdEDW{d?(8T*ywo%h=H*kViU6+$JXe{z<}`gWJin`!xL*S>x|?WFw7qtmsp3Q|7@ z3+dFeq2t?DRF_t;yUW%nE6231ji~X(J~wxHc%1*mAY)vsya;o0K?(s;`nGgrrACk= z5IH@URZyla(NjuO5pdr}FXBKE|G*Rd#`TP1mp)18BRu~(_`73_KBYHd$j1YYPiDtr zT2=7@Z0y0eNVoHhnuyt@*5gLBcs**tU0+yzfbZSxc2ws1k5Rrl1?hDaapP{82p z;wXo`}iG@FbKsu+vuPba$b649usW)-=tpTzJ<)xVdW zWl@5Wrs7*Q*yiA+I`0LdTqU9Jp0a#TI6h~4+I&y+HIMJJGLJMA7e1m+b&W*yK8dgC z$r1bG0@!KP^5?@pOXM}FT>9-i1JM^yt0eI{4CoskB%O(mU0mZIr^!Bp+o|jJ1Kmh6 zazB6W??v@Gb$ET!^FZB$j*c}h&OiaTBgR&C5u$^9Kb(LztvTq38NpOJ-C1qFE(qGs za(d#_%FGce|GZpWY}}vQLl|x@%ut?qxjY8)cx~YVHASwVm&~+{Tror9dsPyzs|I6v z1)Rm3=duAlq=mNaGD-ayCuWDMN+?3mcYvjQzurF@?=QHaNc%$4C~x3otv^bf^5bfp zq)_S;OZ6uOa>X&Z-J@zu7st}?zwIQ1i-B$*>~3<)wHl40*g3Tyq7SZDtE;bZgrK#S z*<427^|*>s3*ASg@+&(}Fs(SGG6JN0F^!)xf?MzRExNyXwpUX(yUb@=biG=78fiLB zM~*f-TQ9|t$7n(oCi^z2m+i#UmCM=0Jv2KxFv>%@y#-u5D3w=L5}n;#C^=l?TL8`q zj{D*)f=9L;&O2!w(kUNbN2^l)YZdZ+3h%U~>|^(NBI42kaENaOs?#__ASjk)&5V0J zC?f-K=%Op3bOK@05aOw1NWZ52THOz-ysq5$4^8&xSQxxmZ|XwESfcc|fE++vs->kZ z*A^%&GiCV{-C6w*!D;Vvp~65q5zTl}c1=A{X+xF5ZTo^Sa<)fhhR+AsL>g+ziL8wp z3^c2TxZ$Erlk;v>nyymBYOGKrL4qV-89-Fw%p~=K{bl=)1?@cFEWx=aHun!N5TY(gY%R9t%!*d${qC!ZdqU09!{ezV|_BsQ4vm+2+Sqfqm*(7qZKG^+!RlI z;T&{f_$5$b6$f>AdAR*7{vtyvr`Hn@(AlSGI@Z8{_oaXU5WRMVC z{VEE_(pYuuu!Z60$=OA_6(^}Xi{4x+RqJRae8J6V=6B)}KT2Fxsf~k=5U2ZY93dTt zP|oytX${S^dkNqZ6&SqxtY!0sw~uHF1$}9c2|tbxc=6JcRmvmlm8VoPROjF+Z6CG( zIqqslqXpzP&wU8m=gQ5%HJ2jU>Jur6~;DH+WE$jwnadpanzC( zE(!fH3`m}Gu||DZ z5|96AXcs1qY73F7h>4rSOMR3^UxuIAmc@jji#(}O^FOZ*3gOaY;mVUH6=`x4!9&0k zsq`gKVHpIZ#QoeV>C)Q=4rLF>Sl;WcsOE%UQx?WZpr)!Q#XbBPpF9VKB#AZDU6?|m zaA5;ef}@iOIR#R&q&xT2LWky<#VbZbB0`On;;01(hM|K_hT%O}(@TctvA;H|{ZypP z2@foWT;9q}!K5GPt5mRktwU}0ysV8bYniLyAhDLbt zCTgvo=xnW24;NA~&#G$`&E)AUaRuK^xiMxm(5O`}2>#D!$-gL9%~Ba$o{Qd1y~$qE zmQMHe!N102uqg7LwxLn!rRQ{JwM=2|wN0(V6VDB(QD?1uBb^)@>Gk zqR2urmPIW-8cs_c6QY@~1>cPjh0RqSDJl@tU@3r@0tWv8}G*}J?0jdAidZ+ol z-uz^FE*8lrh<3$ZEX5X~%CLiDHv7)*Q9%JaVC&V4QOs_@9?YTHH#ig|aGD79Fk9gS5{l^7fNGc7;JNThvdWg88Ea^v*k*B{S+bkCigbnl!_rBX2y z->^}`QP(*{4p9^?;Bpu{~(uf&X#@P21z6JQpYtbpcH%Y|t|&_D^UK zMnPmKIRySGdH*zTqg25Qz#8a+19E4XDfR2sBKlO2Tn$cxH}vcT}6m&knu7D^*~tGPlvB*k==Z_n}d>sk9r z;ZUSV)fpBp&bw5i=V=k(A~FifjdiWX#K zDOyL1c1t(C-VV0u^>rs#Y|ne`wb%SfOR>WblSbp5sXOR64rd{h@~8_w$Rj*aBExLH zL&ASB|B4-VPE{*((XPQZb{Up`jr?Pa0p~2vFO2?Br(dPKtzT6g4)dkW9da z6C7$VjdwP0Ib~a?VX|fJU=$B(SvQQtWui-WbaYdy>+Vb^yL!?qMR)idHr|?kUVM2_ zEV4D_YY!}f5WAblDL9M14UAOz=vuk0mmjaP%F%A>vGoo+XV@jQC|B5`G^QP@`sI+h z)*Eht?qm{BeUY%YEc`h|#QfEd7w-IZ{^k+)#Pi0b2T2xnYzM_GhQDQrA8Eo$^!mU4 z-NM)3|J(LG^Rt&1?tfy70&8h*b9HZ^7pOENEYK=O>8GhFpl3X^?is4ZnVWg4C~`D) zsqRWlXN^VrWbtd}UfyaPxsU$I=kG60o($@bHgl{~v?K2IyYwr&AI*Tuol7T!dK5Ob z?)@hV4{u?&?p<+@-u1!^L-BT69K$nDn%Q0gaQ!}ATY=~nZhX?Ppsf<0Zabn6RPwpu zK-XeBg{|eW6FgGE+Z^=vhNZB)UE(&|%1T#gDC7MWe(N8mH{<;lR8*yejQVF#C2giN z+TXTnDpIuqultMqFK}>#|7T0i0jHE%_fC49J{>|#3B(^ie0xap(UH`KtNTOO64AMf zJm-w7U?$y*cjum-Tm1eCrgw$+XD`t%Fy3PU!U7z4pjh%-jBHG!sV;u`1*{25t?s91 z7}UneI}5jrytB47!eT1^GJ91{@5|kJF|WsumoQfji#GD?{KCrMSuWP47;7)!8r14aB0|3FdjMB3{r!n|ptNy9<|qYB^NB zv?A|#6ckMaVZF(6MV7rxtzH=$Ljz{MS_BcZn6ni!bVpW56XiUJ-b}&FLkQSgtU7Qi zu3D0wE8T`U8w)A5emxU7*$q%bqKeMn7EhdYkAESigD{SPK@P9qRdPMFp`67IVg%zr zPhDty{+I`Dp=p&UPqT?x0f3~+%i(NOlgS|Sat&U_EnJRgiLbMOWVRSWH0=D z-+l3MU`rT6{@M-q`KPRxK`pWP_w=5Z+=Mh16efe3vg4BKSyCxU*_cv^Q0QBjMEb6v z%pJv=jrUr>8T2Y3?&cr1@b}R{j@OGNuVY+sF03iCf%UKJ%w%cj(l z&^gf%Q@n{XY{!)`^xkF)6x%}QCC5j{CQ#c~9*e?+){@3wCmUaVgULlMe9Na(B4KM4 zmdna4?4>_iczF$kZX~Mdt8-6Ix-Y()KX;$4k0mY$X!OFj?i8kmkhko48fc8Izvc%j zyg@240F*T2h9%3!<(U#`BsQ>h+Z%s;YtUQxC=@saLsBB>zR*uT#DC(t6y=A3OyWsj zLLbx!OrKIy^QRmRzw5@8AW@+O9ThO&*;(r}w=f4j{yg;kk%q^_o)ut|rP^Z>FIN=S zS*g+?OgMEZl4T(g7#KpFAVfC%;GI%DzHH_F79^YCqwkEFC|jj?3>K7$w26wx*i18# zwpXupWt+{Ye};FP@4Moc2ngbPu<}<}weU?@Jdq?SFCe~3tX6Jh=P|&a=vVjnvANl+ zjhjyzXQu;`6}qSH!VzR~@?_)m-wYmxniIk5V6SXUU6wau_{w*6U4i0t<@OzxUS@If z*}{uYFb4@;XY+|hI0Do~q>ZtctZC77x|@KmGf zIZa4`T1yj>F1-f=Qwj#dqfr4u0kIOn5Ih+_5Q4@3NtW^Hp(a}_XNB(F3M-(j1uEw0 z;#Ni*Apf)NaFnZMoeFwb=t5D3>>~`OshorOB$$q^WIrDhINOkAWLsD-ok(Ltbx&usbyeujP;-8rUvLhLmpj zVUKcT$kHQ#z`IZ4MS2@BEavh^>_Y|m;Y47F=7{_gw_3@ryFt5-wt?`~M6bBn2;N(5 zaheY_9<%^(zLe$1GlB+06{zM8N|ubspjnl0(85Ve;b765@T-M4ldo24wbgD!j78_6 z(OXg#9s!L7lg|jh?TNk$WqyUW+RnR0?Z(xjbHEW_%peZ5FACAwIY8J@6z4ABcC0O~ z6Sdk%@Dat7DUwB#ftV=56-u!|c7pJ0#1eLkUf7(;Z5#6zviKu)O#!I%PHn1L2P3i% zyc9(0RlHo-DeZC!gom z01^an!`;X)JYE)svVI5I9TUJpO|gDhuHp1_+cr6QN`yu+%e)5Y2_~ybl5zjsp;V28 z?0ON-kX0b*Dh_kNrrH#G3#P3|p|=Na(~9nbd8Uq(a(lDXC&eirut4GsWwB8jqRprj zU$=eNo7>*swrk6_?4IpAw`Jeow&k7ox2{W)0KW`ND?l$LZtTvYm|~lwpcP?Geo{O3 zc*K<9|0X3lDibO*1Va(vfv3`>~9}eN!A07_UH-t8*vfX|w&pjN@jP;hABF;Vb7LV>1IJ^FJ zvhmYtcj`fNg>--(H&mm8Ye+j6XH=Zf{YMKA8H7xP7No0_!9~&433whsS|NE~102_a z7kpegVp#fZ$#WdCz%6LhoqEGGE>1aaLSP$YLR?R1sdY7*a)jzWmbHFIt;btTvlat6 zE&IzxranOjT9SPL^V5@1-krMLxO`98d9N}v3)i81{>R3d^?-H>e+hmZGKu*99Fg?1 z6#brWhlO`Y_VHL?P6)39ntj|@Da`=rwsIS*2m!IAsaQ)IH%}}~KX8BkyXQOsbj%zt z+zaRCpWS1mABl)nQf!%U16?Vb58}qhi<39qW3%q`6SOC?L=4nPDvYa8_r-rJ*546? z&&+JiI&Qg}B0@drcT@r(c*ED70pyFlh*kc&Ok;0kNGIFr1C@u?Hu%*Kad#y`wYA*F zE3r+Xgg%40Kb~H5AKqw8-dUWy>dwr}&75WIE^3FY`V+bfz=#7+;Z0ARMmtvc+;@1T z+OQ~TTB#nSepv-<1U0b~gvd`w`@a8@xya(iE|8STHAqgw{)F%N%xIz+I)JCKJ)7H5 zn7cw_>hj{3Hx$!P7?)GsvEt&;BJLc>?nuEgPJ+cm_3+8UT^VQ8u(9KTr&tc)p@FA> zEhS0_H5%lL-3ZDZs1yrSHG@{y9Kjrh_#5C3KXW{!LfMdQlUFNR9GD_?7^bjYIC>O{ zwR*KUJdSIxc}F}|vu=o$&!R@h16_tCt3uecgmi!@;R9OdTJF}6v%8KyAgpRqN{AzA zNd*fyxDO=Ys`J5kv1-+-@i7MIO+z&oBdVspsU*PA(v~4{WBh^s(P~57N4ZHul2Jfw zjI*FGq4MswA@EexS4+C8(0xL91no6qAV*lKbIX#CM) zmG**95Vc~DeI5G#M7kRTC_#fFa@vhTmTgLMjIj| zM%W*0#Zpd1Ej-HFXLjYG8XR$ zLY%d?7$Q_@ZG+K)0C3!;dL0MLa89FHP|(F%Kl#vX^00B^O^2t!;rJpSN?p={1WZqh zywu~PNKO&`C1i61+U%(1L3YKTdwTrew(kkNQNcFW08@kS3)57I8)iUYFnOxs(8KP0G&@arEIOJ` zI;}#th76bdniS@-GyIctvkInHRK|Q26^L?#$XMSt17#kBGv=$N#vSdznK90%OF0u| ztZ!Sj6Ggb7(hQ`@HZ$g{$hN2G2NaF|jq6bVa>TzZBKz+{8)0<8X~LV(DB={g$O&&F zW_4|TYzVsl8tI};vx7vgRx9ELRA+}7TM@}&(AGfi5FZJTIjWVh+8B!?7FYSOU~bZm zZ;hlFG)c>g^A;jT$gW4|7Z3Y7Om1$xM8i*o&T-@&K5FOv-Up#!I9Edvk*Uy`XJh1> z78v<63RrYTf>W7658;gmI-!9`!@$rCNg1*+=n$t#iYXC6bsaY1xkmVLtz_|08my&c zR6T<}WJ1$*pu3*63HmNSn8f->s8YpwQ?q`REv9Qh+un5%B%3B4PRICnlfPK zbEW)viPC6)pm&^rN9a)(ohx-Xa9#@R_u)uz!`2Ew5BnUgfzb5=ErfoL)&z^`X~9i1 zlr$lUUw25_xl#2saK)se!u=jOsN!S65eElJCv%caSH>aY#=N?rwkF;o6=%R`5MeP zZDUgFlHY~z3QJ=xVpMr^Lj+zCHk1T(ND|=U;hR!qjzZ5njIns%gMYCk1z1bARK|)h zBzk{Gjy%g^M%7ZKQf@TI{y>8P`*R2We6v#B%}>!=kj#rC#XQ%k;Y`yq?XQV=j!Cj( z$L5`z;RTPrFZ$5#cc(Me*_-mDO0R(Uu6$mjew4qhk-tbE*7&dew`;^Z@6{UiLj1i3 z|LRP2ZAxt<@y%m+5%{fQy*<6LJF#ZXn~1iJg}%N*U+<>;rrf69{7B!X;ogznTt`P? zWYfm3?%`Y~{NildxRLAx-0XJ7NV=DFboA5zNM}b!nu6&kncx3$&ztRizyIL(nQfag zeVv(2U72m&nK!m%Hob{&Hg#t-#eNsmSBe(eQnKF)HRYn zqYT+awmNwL-MNvZd-~H|{ppRQt)s7_W6kdI;lDWfdOsO>ck`YtZ8S+jHd&TZ{~0y?2#3)6t8EfqB6*4y{c zcqACRcf(jIS1kWOB)<>uaMd%k3BMdAHmBN7b(cU}{iJWWSYHG9ZeN2}$u=+w+)!|w zv6|!T-{29Uc#|5NV2k5F20}#-$kyaeTy~?R==zs}n zN7SX$9o>M~(PBMCzbPC7Dt7jE653a(W~Zpx6H>Da<{WG4mAkNV4_2-@z!VA=5dT3s zdwVw7DtG%Wqtl&ZZMnC5Qy;G^9tFro0i+j@U>^Yh>Fwy%fT&PFem~4u#U!4wKy;h!PH^w{MdU1>vMSKrY8T7OXxjtj;^FVt3z=zIwX5n^RY_T6W56Rx6C8n$@*aRI^%vB-E^8Ak{Rh7)U71 zDw?#hW=)|uG6JNq7ef|6YHcw0Cx40P!~br^k=yyp@;QIswvXF~4iiA9m++SpWS>Jy)IBCeBQ8 zQBZpi0_IXBl4=FD+}VA((#RUB)RkVu4ZY{(=TR&Mn`)nouuWwUdG8Gv$$~cfm4+%RSGwu zc>5tJawwhf-GK-tf$nry5*TD#ce+E)RwJf%XJwm`70P3`!t6=Y5#YAJMrOu59Llar zeY-fMIRzFMCANpx5XTp;cLrLSaO#+L!vn%Jaas~-8tGeU3;i{##l2bOTA)^s)rG1A zL$Ncv1;1%F0>?mG;fV|nx~i5}1AHcF-b9)(6%#enUszlUKeKH*kyxXH4}wm6vtmt5 zuoCNi^jL~%2JCWH$*HC(h$zW)btKcBNHd+?Ez!)9S~P0Yp}~5X!1QlM(!5%{SU_`I zeZn$Sz9=Sinwsz}nU=;a)r&r2sggQY^Cw*m;#Xv~B7iM( z3xPp;nLk4#*>eY^M8ANAkHmEGVbLEYfCK}AVl})__R|9zTK+S4`Z0@K_RH)wvv%V; zMh~SSOR;3*>6I0Yg&S*-%7GRuw9`eVFNlo77mhsj}EX6NenO0djUY?896T)v$>BB_nL(XYaotK6SE_?570UUs+59lb zxqLl4QjU8##cZU!?+InHI>RXdspFjg^Zn}C+4+wjFMN9hwzBz4UufG@ zi<^soQ+2?g?v>nhqAxP*3IQbDKR-9~9q$wDvGb>YSeU*wKmB?5_EzD^lQs6vXA3Vs z6xUOj{qbU3I>CM)4Bg%kB+-ES#M2T;T)+$*@rh1nnOH(%oA_t=W~q&tH_xv^g=VtrXt*Si@26wf?%7p-z^j{evC9 zaqi32^eiD52hV`6yc8`c2*tZ$-o>UvUeui>aAr8c(A?K%H}HBG$n^ENg4!BAoohwy z00t2Dprx{s?;T_FVIKsmu_aSALHLzzFx4Rfl+kgo48_=ht(L8;-gc3(O^O(y%Io2< z;l%^;c2IGmQr3x_s9Z=^|L#;PwQQ-l--&A-PlVQGN7m9=YBd=Kub}eflki0tp5qXCLD$0hqCu1jy1Ue>BZD46bKf(tcs# z7sBHtlmcynoNbjGWbTbHXhRs}_XYKetkXeEHmDXWCwQO0yw?ds*!77L>aag7Yfr&c zTSl(6o_EwT`Eo5qv-W!7;9u6XG4-P~+4S!fn`)-M-`JB*c5NhW-QAtZ&Q^4>?PTHS z9~WMpXncNC1b4jpFFNLbn;_x=<2=|xZJhl|#0E0Ub! zm2O?i(W|yB3S1asM1oE1e;6v)R{3MIOOP?5v<%mO%#+Dxim zDHUt=c)~FFxaTh~-2X-(hD|ue8l|-1UO2}G;+}cB`1vDM=P$zGngbZ<_jyJlg{K;E zuUUlv2O~r=h0|uDU*bGob+!K@vcRDLxpjcT?uT{;y#b&5ijRrWxC)TBC<(S+6NAL| zFbvY1rtXDLVB1@xcpSITLA7F7M|3P8-C?3dzT@uP@vequnN0_<_PElZfoQ64VydP2 z*}pX|zR*Y^pDV%8}Q!rDcndJrso41 zBI6m;DA!GzpFzH(`@yD7$sRP#^>*4!`o?zP1F~&uB)ZVDLZ@6^T4NzwPfig`zwYI~ zxylXqJ^}-d6QptNCED~)o_0^&m8T}8&6PW7uN4841*`J8`uwA>-Pue3dZ%)Aco;bJ z`6Y}Pt)9PNHP|MD#^F0@=|}j{Ue@+Xld~SJu-W~l-u2?+#cRj85}UHUC29sVu#|zZ zU77*M*#u4ogH?@@$_R}^yQ+|7ZI0Ee1fyx_A6nh%HG$saofwxCO5m>R00!w@TXRqJ zwqR4rr;#WT0tWrCs&c*9gW5MQEH92-L;ip_`hM@XHrkFM$A|9>ej`Dmy{7B5X*m6< zXv?tC8YFOfwsG#8#>q(#pM3f{@zhFRs)`3KeVQ7Bpcp&YwF0Xn`p->V|2TqUb`{XWVpY;9+^6 zrkifK1IG2cjk~wp8>hYFS>yA^A~Nb3Cc?;cjq)`mL*VU=D=*+{xFZEK{Lp(6OtCL) z#`Ro&f1EE>!=xuMDetlkC&NLT1A){ES^)^LwxU}1meB^&_FLP~?SPS5q41;V1DF%o zl9U4Cp#&TtPIwXx3XDB{-O27Q(zdaqE7{rEf&zgxiZPsYj@2 z;7hEeaqcb^4X*lC(?G zY_8?;sJgk6%*~vV+*)m77~|VQy>R4f_x4#IR#%E_dH0VS-g@_s2njP_g5W%oE0u;Z zMH3bEBqfYHePjOI1HX26iZD+s&OJUoBbosbcV`rJv%#=+byR zr)oJ1jo5b!qXn+f@$@VUKi{9f_>dogi-q83IMoGgbZN6C8M5qMA?VdB7<9Jr3Iv~Q z6fIIZ!F6-2P0TrZ>?|tK>C~!vI{Lh3 z$9HvY7h^~auraJAHeh8Coe9fP5D#l-PBL@|?XpCWkxz%jcq}Ak6H2AQL@_i+YlgA} zrX#n_l6(Kj;`0kWGNYe?b^J%NjVrUzl^|>{(@)MTqum}Ymidiy3?eAYT?tL`-JKw^ z)lY1&Q??COQQT?HX)7ewdf+t&wWjhcX$`#Q#kNACEJ}{OUFl>W2#%ebTD}CJsS^w$ z_mm0B+dwq}DiD0t0bL{X*P!)>kFL+fb>sDY$Gcwo2&UB));xtKY@SU*^ibT?rJ?n>7uOKjR_&z zTC@$gdVl^q7Y@XURav{s_G%F)dr5=KwkEuF#;oPWL0tlFs)X?i zDK}^WTzve}YrUhAI3U511W#Uy{L5~p$hPz5Q)2Pc2rS>!Q%J>u#ZyQv&|!VMJWnAN zRwPd$Wm4o<0iHsy*iiM~(H#>XRVf_NHPdK6yh18h!~Dr{C4X|4dX=OFJC4!1FAjw` zc>1+gS}e6WX)Vbi)H55Ylj5HL4z?^VX+L}9KK+(Swf~IfA)bRmTuH!x<1nlw6A2j9 zAe&ATvl4UASCPYhV@^%D!T7F!?uGA(dS&4slShV&xf)7N$$5Q zE$PODKqJ+wcH?bof&a~+_cw=L%Qh|DJ^pn(OWY@q7jB(#&s^3#OXM=`+^-*1{7i>o zn0`_3a`Nf?$-grznfoK!02*h%cF$ji(+aR&4IUK+nP%HlgPXFUwa06T;!w zDjcqQnS4f-=-hi@ea-m*BbTkWa0QM3DY8$fg~^nLXd=@vml;`q3Y`;@N%LL_?=wHu z5p`vjIvXP3BvMsp%*o*pU{#i!+<_1$=zt59yXzPfte?z3`*{A@38rtf+@E9@BZYUR zP(tz>?5cHz9%q)7%R$`Mx|)#f9=)r0FEDrYsRyuVfP&nq&lhgpUBMZY!bWh4Ls~PV zcc?hAiHeetD1oxx)|6;6%!{enlvG-okgk4}A5ouIZkM%9#p02(?m2F6LEH6%<9sSk z-Nd@Wb(?2?lf~9bO?QA>u{o+lo*{lx*XvogGx6{hV){?l24GPA@c%ZeOA>9*?vu=W ztIJ!>;UN~`Sb*Qnvj+Q1na-&PmOCs~s)^#UY^kk7u!@=ZP_shdR~kw!<6}BhpaoC? z@+?JpHJ|a(ZYyOMV^ut{|KUFM0`H()zbk8R$^64-b2ImX#ak49h4V_UUFKcR6)?Z= zq0ZE-3D1|B34!xp{cZ8ePjgSMG=989;=`k(2_YyO9xBu!Oa{NXiow1$j2V)g3~fQkD}D=_KU_F$qQ+Fw z@Xz(**#Or}W=?2bf4CN1iGbDG{r7GjchBEk`03}ymx|~_ZGf-{R>)f~VS`P5^p@wL zXN9{$etD8!T}5YSio}=%8Y6(z#p{zEftjI8Sjt%$lPYQlECf7g1;EiIV=sDT4<{;; zLI9ZK?gBD8aooLq!j!ZRA&gwPK(mn?O0hrUap8WTX%m7P=uRo}ofNYP9mP0R$W>^8 zil=S?eiZk6lGJgZqG2Xs6LjJoNsNe?PAK#P4Mc#2c(+4R2eUE79UFiFEq?xJ{?y~f zr}4UepU1z}DKaOClug_qZ`YtMhcBkQwom}ii@t1yLhQw=yyj|G8Ao#LC# zxiS*6ZP!+5G$w9)Q>1b8N#pD^bEuPt#Jsv2503(=%9U2y4b~=nD6X*Wz7~xY)CCOo zR*@%`=kY4AqsfBxN;SVP+)}x=1>=@n%Vu%0f$*=sr1E3q@RR4LP36Z&5vd(Pu(P~h z(d}NMI=v{L#CxGs<_|ZS*szEXy;;c53>|n>DN@S`ODADoGA)vJk~VDVoXd>J&@!T9 z0S)7$nqu@4zJ*I>RKJ-NFP8OMxG92aT!c>kZ0ZYPNA}Mt;Nf@BaIw0qDmJS4l zBCz)_ihIFQmh4g%Y|s*Iae?a9m$^XkX?1zFn~ReoGUCz{Ptd~jC$agmC?$%C zydhTly(^`JpyISo9^w0Ol7h zxWx!qG)jfa4FmDfk}B84w?&m}7P7k! z8WR$QsW$0{a!B?|MCHFiiObw?Hx~+<#}5@t#awk_N2OMa^UNA4%oCl;jn^y4p;K>; zm-FZyf-6Z<^08m{0AlOaVtzjv;l(f&MUE!LAUOe@#NpQjDM2GiBB~zTSIqB&g_|#p z7n~YRUxDovopN2Ivw&K(5-E&V5g_`T!r^-?x0j~tN?AJeI?IPQ>eTmD3gV?4>pfpT zL^lkkPZzzK3P{D3UfDa8{suYz0#YBSaa&DRhm3?}C{o9gJ{*+- zgB-v)v~}y&iC;Iv;yvUPgitt2OHyvgGi$K?0K&&0WdT7GW@qV5@+_Avm}yPLo0sHL z6jcT%S>#BD*Ehb~F3O3*_RU$D8ZW0I&v+Vh#;BGt#xk68IQ%7#g{)2+r(~^U#^8Vj z*deRMVKggZLdNXDO0`s=h1F4eBlAFMul z7?0p~pq!u=#%I9x5+7@6@km!{Ld)SJrr3x{oe{YesWhy841;QnT7IaY^IAs5=6d^V zHR(E1Vv+og#1&X=cujT*0IA%ibE!zf(xNy98Ov?e2lbZ;pZkhV8B0rB2I*x|eiJ(H zSUnn!N~SAIqOWQmHI9t720$0=GjE|%g!6rUON^AGMs3OrChtwP6)&(j0-CYTwAPbM z(^=!Gm4hd(=}bwBz)X>&qzF>?O_7lYI;oVB%lJux^iskU?3cA>9!e5PqVi@|&0M5R ztMbKY)Dy@peW3!m3`)s|L=RW&R{=#QCoNZ)hUV~&nVy+e@~|4@sbfj#lE7zzOLWfY zSbZXH8NX3OXc;}zvae)RZhII*rOgd!=*y6z@9?j-Zz~$uf7FeLoL_Q^DmsQFX{Y;< z{6R59`RR2`=o6`WMMDe|c5NB2)aop~mVUlMAGFd$e8MMUrJ{AaS>;4s=BUG)vbNF$ za%(BTEc4Jj1(B@ zT2tPd(wboMVOj}!wB8BFv<%Qu$-w9??xDootPK#r+AHH@R^T|VOysdiDQ0=!54wpw z6{3;Il}doOEdEmiQLtoTjjYEOYlzVf)2L->W;$w$i8KxdKLE8h2o&7`pQzu7^$Yh! zv3?Wwn5LKniGH6`l_^)Hy%wjC58Yo;+;4`(E>GGBdm!C>zUkcsHVqV$HI3XDfoncG zK{$v+Jl5e-cpAod+Cqg=sA(afx`qK-*%+P$K!k-jR;ksqtb4ZQD3hhFdn&0$(kdka zcqcxrBLyMql*hQ4l1_TYM5EyM38CZLj+4WgG z#j#l@SafF1vc8%P+B7TPps6Jn&S6_BV@0PxMyi!jLOs)A6$4Ym$QT7!HmOo6H=1LA zV2}~^C)aQ8Kd^?L*}OkheN#WtQ?CO9h6CqVKPtAVKfhjp2Mlk;#n-~ql%fV?hpvGm5B*q7YezBG$=)6%2d}XRuVshBp*tl^cT^pM^tWo*4 z(%X{N9k~goYF{1}R|)?cZBV(+#}&ZRODzO1Rr11*Y_47}OUAD3<~@71?b@^bon6^2 zJ2vm$y>0gZheb3B5?SD@$?>r)Ds8EVkS#!8EXh5_s<7k7V*`Jt zCZZZll~c|;^pJ=h5?*K^o)iitNBt)heyLL4Yk(~{wc_3~ zuDJ?2TkhTKR7tJ|3$i@n1=WGSVR~xCVQ7c?<}klXpOD!4J&>ywDVMAvuzckJfGQM6 zMnF_9*QLT^AU;F?;qwIhfgMXW`CGaAZN?J>QmwKzK;7GzdOnurB;XZ^B?*tx;n#;l zl@9_CWsPS+MNnbolSNSi8w5-&**spa>=9t}+Hx0_f+>WX6@l`cf*O@0H|k{J*oycv zuG!Oq#`uJyf^|Z!gXDuWimC;lV9humB7%(5){u zXFWrxH9Sa`3$O9@>shNwGG22S3*)ITm$pi$&orhk{A2Q{=%9!;-IOPIU0V`IP~;7k z-U)-(I>6{tV-sZY*_DOq+Ze6|M0Ge9)`)xjm^=MMN&&#DM8naqUIa-8 z7W&wDDOV+<6SX?7y)rCgIBAhLi`814AW357b#E8puvMK{mm+`Om#eR<@vWK{F}oFu z?H(T;1qK1q8WtU?z)%jNb1jlrKl$Ts0y{*F+TF?43Y-HAWKVT`bPPRKNiA0_fbdFN zMteRSb;l;wF?z%CzCRA!35H9J(N1*?hx?~e4#d%7mApG%FM|B)x45%X%U2FUpJ4=h zVJVOtU@KQb4uwfsCY(9}F7?O#bV;VzK(?2GfmRXgRcCm7KkSRFci6Lj5iPu*7>{XKcxf!6i|%Bbf$@H6U#KIT*aYhOR)rNj_I9K{eWA63Y$ODy8wd zQ<@<8YVO}BV!SV`XExhk@s4B?GFyQ>q`Ax=;H1W6Ap`1o*G9Nxq4)y{9f97-J(x zo6xpQZHH{-Pbq25pd4d;Nt-8?0`X**?m{WBvp1PeFIx)Kd$ZJiQUMrYmsHsH!q1<~ z%}g#HIjb?_UuMrN-1*i${|Sp^z_Mej$(Y6Fs#er9HMF$7Dpf-^SE4^rwpMlAFTYqk zamGD=AA?Msy4g5$62r96$e1D%I8P?U0&|jg-q|S)KS($ddb$FieqU85-U#tYx~-w2 zmEXn?(Pr46HDlE1-BT-;Yjt2aPMnoW%21&Kl5JFJRLaH>`1=Oyu-It;m>Vl*aybg9WQ8od7NlLoSV{^HLxw+-LWy5wy0#5z*F&I2^jf?nPEBz z^)Fu|G0-uHfpnOGx^vJ3Hkvz>)c`Wu*D?DC=pgL+Aif|)jhBHX<{hTrq10t8hcrkD z+u7|r4qDvRQ*|8}TiPkqw1UhG#9XoVIT+@JO38S87pSaWN4ex6+HDL9sDMz*TjC5F zD?%%!@mdkYUl;*bOyMSv!X+=w95CZfYHtcbmuWFSo6OctyZH>%M zEM-CNP|JHxC*(CBLB1!7Qrkv3ULAA<5Wf%fCrNuesG)Y5z7oXbGoAi4%ajNstBv9+ zts*-t`^~N66YBS6U!{b21yP&6Ty-gP zp~7VF#gZGYP;`>L6*BBhRLTX4B3SapQ4qNqy>pc%)+swoD9*zk;IAJ|=1oN*bAzk2 zNa#v=%1qPg9;V*z?(ANUfGVj4DzzeQkWDP!l-*myOM&s2{K=D(S6d}T`8?QRF%Yo{|91|*#b?G8M;Z&;7)3?d#>M4+PM@LV0_Zo~vx2*oHX=`g+W*?E;^mHZD zeWVTlp#~AuHQEg_Sv<&bRwqa=?t2!{U)-5*+2IS-g{&^q9yD%zyf}H&J@MR2^e6>I zrFI(9xN_fp{+WC7EVC7p#Y-7O5oWLzQI9>^I8F5{{aDk)KN7ib{|%mf6eIM zMD50pcN-s{XZ=P?GNzcDn>ocB#gLQ?f_Iek(m7+YKo&ojG=BcM@$+%spo3R*6k`;( zMKeI$b25oX&n)*a;oKV55Y~p}rl6i;U0NsH@5C%K0JR7Gp|RBN?zyiTU()o((p<=@ z00~SKi)08T;1g)3(`GND{mra&echOKu&;v|c9%qsY)JBljQXe4Cx4PM(tNS+o*JLi zG>u`-J_TSN4f_6vHS%03@7YGrw)eKqwBl;;9xl(dlZ36ESDI~^w;)_sEWKXQmW*r_p?`7bU{c4POhw7nH;MejCUr`=M7i_ z2yp0!wx>GIHI$vbo06M=EPCu@Q2{nKu?;sr)L5(Ju~g5q3id0zFIO7jzGwc6_@RXF zKlXlwVFW)gvjH6!dWhDPlW0m}?}yU#q7Bpb$d{e)+D)ft5kQOOV4{3>x_E0)Wd*s25ii`pKx_PaVaB|Bm5cY&O%xy2ZMP4&Stx@ zkx;N+7~ialzN z3YQr?-xW)&0ziJJ21mk#fh;7#0c8Lpsp26o+MLv7{o`P=T@7XLBhj3fqL`JStilqU zW>X}`4sZLX_o(gK2q*R)II?$jNpj1IHy64q;R pLFN{e!9DY9&|5E7z1U5Ay8rR?O;(nMJMWJQ%cV&DfQ@*q8!pUIa1>ieb@e)V4etD zWv6zJX70x=#Yx+V&IIk)QAbkKupX}S8{J+bn?d^j-<5H)%W(?p;9xuER8s^X7e|Ur zv4kh*TB%5l$_HUO(e*(9=qEcW<-KgQqG>!I$KK*cYp8=I$Zjqm*Vs6BgH#S+_ei*k zso~z_JHOVjUi}_Uw&)I*{p1~*@Gw^zD6%x&R2j6=alpLxs9nc%HC674_+@;JlInv~ zs-P_!ed=hI&I%gOY*!~nV%w>0tzKkk{g4?}vy?uvS@?|_VY7!RhLNoaP56-4i6NoL zsKQZibkw2g4{*K?;u$bzYc>WH;8d%Xsu!h?`Hrw$^5`v*q``II;6SzRHKdn9io9EL zay3+2?g#1_ucA2%y>1Q^(H^gK$=)p0Oz&rzEwk=*

6$rm||4%B?Qnp+SM6D1O0d-VtP+rbTj)J1tY^u z!dRG67%)Y^Y$lgWIf>qQJdqk7$n}nS7z#5Vu1$iZU7=eBL>)+`V#!|A1^-R;d9-66 z+%~g2ph}edYgm3@I@E#o<72H<2AWu^pv@TDii3kx^^+}?r-9$!%qclj{ou1ua;hXc zMK~MZd}|+`^htcW#?5vPf$)Z<&sLAw5nwDuo@le?WMEA>e>BXqcB0^0^4_<%?|FS^ z`mJ4q`geFpznBCbQ}se73ycg`e37ptr?1*F=yIc-F{$63HsF(iz$cpkmxpG6=RBp+ zW(TU{8=L3-2dg9W69_XInE7F9U&;yLwXKyf_fRMuCY*Y)gj8__Wh6L4G@rx`nY^|q z4K|cIKNhsmN~||xSir9@YG@5`G8GeiwaD*-7j3=C4_Gl2*{?{ugp}Q zSPAHNS8bR@(73)#H8rNaojR-`y|Qd^Vx1T^Xxw#T&w?6rpTJ*zy+1F_y@`9jk08?O zH%A)ZAEg<^n{=1jWHZJ;s+r5gkn*Chx3jHLk(Jjgnj`$rUHj3>(kH8TKU`io)3|WA zary|pKTmf%LoV`hXbE$Lh>OW4oGfQ~vjVt$Q$Q!(h?(rpe{vm*~%s>Z{ zYqIR@(pgPn-p`+OkKbOo{JuyAV8KVo*F%$%Uw(h({#WkOMZKvO4qc^QOGmD+V1?b7 z*e)k9;WIBn#c^Qd^9~ccFrl2J*Vs$0if6YVRpi*`~~?QUWYWAIucS0Scgm^eQ^#ajcO@U0w#M+u(A6{0AL~t-sP<-QrzI)} zKOT&jN(!Y96a9tVR;u-KnV2l2X{Q!f#ME~7d73NIg{?qT)9w;GKw@Hp$rFeN@n2N` z@7X3LJ3Hx*m%!vWO(s`iaQPUy3QbjV8-i}5i_>7{X6WOq(R>aBI<;w~F}8LVzKlw( zjMLfc^l)-SSbP{K?(~p4Q`XQ#F?GKDrppe0+B%Cy!4_xJYWNU#R*lMjvukC*>98~m zvxiO!IdT8OGt{xcyG=D&Xlwi(?Tx&{I%fX2VxKS6l0Jd$r`Q~*J|yZ(ZjVp8F1N_O z6aLLK!Jkt4+{9&whh;IDZ()tijifbqm;8ul9IjNP-PUSx%WB_R!lTP>O>SI{Ji8%b zK&_43n5iz=ix9oUIx1IT%PZHC`C4U#Mcfe&$=h36$uvY;*jlNQJ!mYgR32}hs(pDB z2Rf;SuMeG8U3k!S#N|!kD+KgX>Ur zsL@@0#^Aw3MR+hIh^~s*o=^h^i?Er9pTk?>6nL9h38x7V)>V zm^jam0ngsB=*b}KhP}G*MXC}uQ32>*L)kP6l+K_!@rimN9kuRidzVN2!4VXXSF<3J zsH8FFt!JHdG!awhL_Q4S6{qZA9yO`(wMzct7@~lNHe(dstTk@m+ZH;lCQfK|Hwy#_ zL8L|5ko#DaZ)X=gw(3iS_rjbXN(z zSDMw}+ot7RV*|`l$&;JmziMI=fq+yl8AE~Fsg^G|QGGWl+o@Au@qjG^k93tbSfP$} zIV!t6Iui1*RBD9WW=frPJupqkt!zHIa_gH#nK|7OSn9^%O-?(Cz+)sqwy5;08Gf`-4pNj1>Ro^W2wZdACZ8s(Y8U(7;1 z1_?r8#-vmPD0ukXZ=A~cZN$xt9}usVAl_4S!^jz@ruINJ;mIFahaawsMQBbAe{F=$ zC(cT%6c4>*m&EzdJKvcIN%HLTtPnzbk%Z6tmPuw{52&hI`yQUM@+;9Mq8X(Z%I&$_ z_WD%5kk3@6cbBTwDAVW0voJxKdaVRhIR09_m<63TSF6H9)tSUhIGN}HGvQ=%pjWZ= zQ z?zl@c?yXbtIMw%w`ZWhatTd;0Wyosvg8AMkHo@-hVJp)C3s>!LI8Qh`EXI zYM_M#!zrMK&>>C*@eau4z*<@UPayn4sd&Hun{%r914U}GU#sK~9B?Wq1Fi$bY2Fn4 z#>-I6j}@rD4L-gK^_=5be$&V*lbK3BLtV}=uxx1(g91AW2kbhf!jM1%_aA?rrvE_4 zVoiQW+SnUxr@6^DmUg$4>~}3Aq+A zrNvkZC#r|C#z?9Afea`L1!IgJvA{Uhp~yrvV)W{89%HUYn$v>+vnv16?1+iJ^b&0~ z1_p$b)Op#S+t{b`ryFzU|2cCs(z(fE6ePLJfRRbnBMG8s+&PM>5HanDlj3Ramcs~U zA8{E;rutG=vzO2WlRq!N{IWOgaq39Uly=-di#gmJ%9qp3Cx>;41k#A%9ypm|+QSL5 zA;Bva6IeE<)D_O75N|lsF_ft|Xu4DfcBGokm+IC3Ksz!;@~W089D*G^;rz#c7k6!* zpyt}VTnvrj6Mh%U;?le%t`(OZ1kvc|C_QN8T_=)UF*ada`5V1F3`)kLP-`ePexFldTw#HE*T4 z30ZR(WIqjnusjLA*~BLX^A31>6Vt2R_})~kI>H@*p#FG1iwc?IfqLeEgATxPL-R8^ z72@#*>w-@!3aF5u$k*86RL8SYk--v|0*s%Yi79h9+DToqxXl+7W8{7U(`5KbP12^z z#59P0+D(Gda_hnv+Fvgh@W%X8k$5(GhD?1xJ9@TKk1PTQa~5KLm@5?rD)3bC)N(3& zNijI}sY#Uqhxl`FC>rsBwc`~ew?9@`b7qsF#}%lB(t!hvb)yC7`BkTty_M$W)mRhK zlq3gIgr4I6B$K9}+S_mL+k^4SPh~<#0pUR);0C6)`2sQ5#eeCiKtSZHPR&e%CI+!s zV7wNuOjOel7fClQp)`rGx-omNG@B0HTB^gE0uWgYNY<$@9&sZ_W4K7xNDqyo7H<|8 z8S#!Vsm%mFx3thf<?&s{<^LKX>*1Q2clL@ADZ06y&d=q^;k7yd&W>PjSJIIUx9Hx zJ!xT2?>54}_HB!8GfDn-el_wV{NkQJh{j5#0vF-M*BZFsfApi9igYbiJAxEAY5` zQLi{V_ms$Qs{Csl-{$j0JmZ)*jjE7ivM-%orX_Vu_JZR~SIRUa_A}>S*C7h1*%I6e z@S_roUlKnITjJGcV?pHo0dK5!^tAmWIQ{N;; zyqKV%O^kNX5JtF5??!L~kwMx_1h%Yx*28SX1%PlZMcVgH9$#qQ4dK!IP%HeusaMzvNm9 z%6LbWJSRfoExMmRP$|{RRqYcSr7=col_Z?={5s4Y_ZR6Tc zjjMkL8=yz0R~GLxBhzDDZM*To^irDsa zEbJBrG%gVceRz=9;2|vV=1^ZHN8mOP<+B7JJ5880Vr5!esv)^8wUWW51;b&Ape}Qk zS$taHAclI>PbyZccTN(8S8AsUmK{;EL*!H%QqCr`5GE%>(h+4AAjjj7a(>!m&v@k3 zgZc5=5Wjp`Ru1l`FezRcvm7}J@o6%o9V=!@GDQTOb>^*Ob;!khY$zE|pl&A2618%b zQn^GGC|bp8HdDw{D5pT``k&yI+EkE^IXDb_DDwL~`(BOMCaA(=_zk6!z5TrdJ$=3X zmU2aBsurqXzF4IGLs@*{4<#%mV8Ggf7=0SgPx*b#WNUR;cD#|_f!jgzVuvOJ%E`My z`dcfhswk*x#L&SSR{-riLLEv7NKYql@nMM{$`jGy$#5f9l>jPs8<+)@UeTcm9i&v-g%4 z9yKn_t{#7~a{ceiiw}NYnsdK7iniA&*N)wG z&wdGUib;T+*Z;U3MPCOl-N|+T5r`TRbJmGxNkS7AJ!k0}L_HhZpdac<@8Y~#xPweK!T)0B!M_p2}553b<>xc5#puFbl$CmR2lUpe~F{o<5< zs;@r&d*jSk5Ie_jG>wxpYgcC=Nb|xm_ud7FYjpN+E7y+TH#Ekw)OJrCEic?`Tw7W@ z@(=gPx4tj|+PR~Ri}!d4q|>l`qZVQ_DnXYpue}TadZZj;LZzIBwwwT(v7V{rQNMeE z-G{ukjZz_WL23nYns~+Gl{dBgVyd;nXgg;F<#uILYnrUfX``82ky{Gc7cq5WK z(hQ;>Co8(;Q@{X(&kuZP9dv3LudesbC~h3JUM=I3&K*D^0d+NA(x!m$jQRo@w=CQJ z*ARP*N8~k*zN*j4%(tF7u^eYV+9 z&?Ys2jAbx)r|m|ikTM)u1E>#CsntwPB$w5%t)Y}NqP8sQZMm+(l1a6ROw(G$WEKsr zlPDXYnG$8%(oB)s&r>rMP7|fw$AFJ}^t3T^tMS1__ul*Hk1_Oz82T@{P%G*xQArW^ zN>nMJiYC~V2#lqYItt9pVRo}|@jlf$FE8G7Pd{8ceg+Xf-D4twqe8$iA>h)!pHZQZ zfezw7K_;^=hTeGT)e$0H!9VvT$*Wg2aR?sf7ToVaQ6(kUNm=S>vnl^%`J{Bom?U>q z@1c?eIza@_NsvAJu7V_0^lwO`(XBVBt)l#MBbrs012fh{Vbl2Z6z15MAD>xyJioH^ z31-8US!qRQ*HjOs>tZzS*3ZKLo=}VmJmV3FtWABPsioeCJj zqRHNPPhT9;R)9e_*m~DVN(joaq|@Xwn@;1uh|@v(J!G&h{CQ6LVtd;@`eLK7CBm#3 z3>9CW@C^-LrB+TpY|MY)-gtlI{0RveEHG%eR5_q3yJd0&r}PFg?p(Z&Z{2K(+#9#t zd*>hu@8c8rBEI(7arfM}%<@l1# zJ|RU(C|)DtE}OQ6N6U{+kV-T@FK&xMY%-w~Csv<+h7rP?DSP;51)sH{vsNiy1HiYY zYX@6Snr*d3WxCHet&PGkE)~KK*ccX^BAA*3^q( z-9^c9)xrFLaL+Ls)ytSVc}uup$%Fzn9wx8ExlG@6o7JkDJE^;I!qyKnM711@+a3;+ zPFfQjhrrH6*F$DYL4(!1H}OkAb${p~mc*Pxx_Rg#`7Aa(>ZY>*u~qani=q8(Mp5u| zte!sRUb#;;7zwF;`~lr^p}N^Vwmlzv9)-_&Mw=chE4ooOJsN}78%KZmFSN4sv3uuI zUtwbsq%M+e-pVUKnB+cNgC3XtR%VHh34r053)>uM)*&!4q2y~w@yJqZ^N zC%t{^bRe0|(Djq>9L@vEG1)jIA_}2nC1owl5#@pGk^VQ34O%}U%_yfdd=TN2K_UVZ zrs$_1+|$S1^WQB$JmsE*fcuPV?80;PF@8f`LcsU3q$1Q$+{Wa)4^$_yf#95wA_Odj0mwjYsS_`s%yog|FPDV|eT! zJ8GOgsf~@aw>oordEtY`!UEOlz_e&O7~ryreuk!vz)dDtLy$`1;%Cr@8@E>Pe+dc` z`ku5aH#MH|SKdqL(1J-1?1^tN&j`7<$@1X^p}U8QAbeSSymN`(iaOKNZ8C_WvN;HD zeX292Y&ue4k5?jFuUIZWuWhGlLCEMi`>p%IZTI2R#z%MHgg-g6ym*sB#QRPH8Vr*d zSWs&!Kf@(%8NGwFjeW4O@bTR9Mq?Q08>zH$g05(+s4Za zXPDv5(+lpEFB(7IZoGfa{oqRD^JADtSnwjb8*P|D`8XJQZwdCx z3Pxt?7gEJ2%>CL_GBR7XHr0&We6b}(Bdrux)yP~an49nVeSNPF6%%8?r# zH&HrDrmd~09fh%LD~dn^sHRwCvCkEY8up(WcZva&>9zj_<8T zz%5_dB({6LS0Lb~?-gq1!{e0z)lyF`(WYUNE-rhlnkJxK&o-1)yT$rAy@C~9LT-iliJV0{GY|rnI-)Eu+yNdn&gw^**Ire`^FrV(vhc{g`*7{?d1KGe ziB$To7O+QWZzAikZI6)B4&FqdyrU`nXBS-9ng}Dhkm}i`4&oT_mpe#D+b=75P@?0! zkUhMRJ^T`~hkp^F!V6)*1`7jz?N|KBsHi$B{E%1_T!<8gZ`$l09eHt^4S$<`aaV2O zuKKJId;Ir$TQPg=S4eu_5*+s+qAFX;5et5MW$AB?3r}oUNB8P&dR1fW^~p2DNCIHE z_fDvU7_O&_Hz_Rl$~D#Pnghgab>rGk?vcyxxw}xBm>SSJjgQYPKfa1NY}zTE`vyyq z%jeytbE4b0K`>9AKkXjBO)UKvQ7Ha4|spOH$DLi zERrN*qi+8Cl^yZ!Eiax~dH(?>0gcl~tU`dMH{_OXzM-o9TV7~=@2Ncc*?L}A(I5Z3 zb5H8cy*ty}_wV1iXaBCfd(u00Z{N3X=RRN~Hu3PP4v_!Ipcx@Vpjd>Ek#5mjsjxag z5)2ji3^`PHZ%=P*YXWr*B>Q4px0->x_7SwkfBe2s${q@g+KsWJ;^e$UcXG7vEQ=vw z1ojhM-qVaU5aeq;y z%EQECb>{Ti=U?E{7G!KZe;(;olQaURc)>TYu82cN^Sy8%c}&hBNB1E-lO^ePMCJm-*^+=c)!x9NVafAgEDlN3 zmew)LO9-hCIqG4v4Ra*ipMWVno4F@_UUt8tcqFH?XBd1>7DsE&;>5>SLtWvA+KTYR6zA-fIU3Q>AaCoHqF zyHqM8l6)6g%;^b|%G=cGm=7GWgb24Jk3Nn-` zd~hu7a0RRhqDFU(oxo~QrM#A4(t|jnS0fiz*oIRK>39lJX^Q3g_)xBy%Y|SnOuz(Wy+ZQdv8<@*&kia&6tNTwhlzTj zmWNb`P>C^5v;`9`>(|&c8r8P(D6Ezk38sd{9K}!KV;99!*PSHefxlD&N#=JPCAir+ zxL;y?|YD9350cVX4Ri`g9x_U`2g9V@7)2K7$%t*) z=}f*+I_$s}86~?L(uSsC3h^>j>t&Lqs8*_A+-KSDXf9eP3bR@*VG_IUrYWv)~PUioG@z?;Cp zCwTtAL@CcQ8=7pK29jjruz(789NFG>`N8Xw7vWiXF9n!VB+a%#E-WLafC6GFJeKWn zg>#yGTj5VgUQ7TpINnrFV!{2P+1@D( zj6XpiwvQxqGx{0q&qP&O;w$U81NWa_+wCwjHaPGAVO2vI%MYnPWS=VEO4U-ml68iZ z8462<;c2=b+hnEu>GS&VRUZ5R220685q=?spg0^0r3z=LIMo6+rm`;Kj==W$zYQI3 zAjY&Z1zeP?No`*xQ3oo!$p%+twnXSV&=0A)2KswrmSj-@mCd|}@DIq~LEG;y(&otl zNy&760$;-?FwG%Sg%KefR`V0d)6&sWV}?ZBrwJuX7lfAej4{~Fmvp^TcV^K7Z5!K8 zDzHY&EwlfCynx80Z9+I(0)V6C|(Mju^bkS=w>5xo?xNm@h_ zE2DeYeTJp;RNgOtX)fN!K;40o&}98Q;RcP08q4H9`c*Gw}sIN!{HTxH5#iaV8-uoy0o9;6I-AyUgVeX<5aoGf@gzOr(v3D!K&Q zo?3~=F*WSDkk19#)m7rWgR3L}coLpDrqNR-9G%cjJJ*wn%ooX#uiN+Dnu*V-DHNoI zw=BT+ZoCl%mr{|8UIr1D3B^+zGz|6uSCU++@ko^zT}De8C>*a++OYI8b@Y3%AwHbe zCoUirI8ac*6V~Yn{)wakzZebU-?;&XZc?;){@QK`rV-{oO1BG+g;|?9DgW>IYtU&wLVBm&+s~~+t zgBef;MPx&;9UJCXONzC(NaL%{P^yfqWb8E~{odcfinDZd)cpDh@m3t0jlOJS^=2`c_7n93z{-t)%ZKAW+tdaiDhB|-ZiQo zL5#W-IY+bCej+8Z3nZX8aZvo^>SyJ?EoR^y)rzyQ1ESu>%y`=9Js z@Kn^Lv2zqrl&I*`)3&GZ@hjp8tC#NoWy4DWP(5IhZS^lseq9h*nurQlcD; z*op}pD}D=qz@z~7ryLs)t1>V?Zk~A@=s4wR(X_85FN?ydvfV50BvaBE%QLC?g8)9D z;GW5mneh2dxb+izZmYDr!5XJf75@s_)KP;7=B{PjlnKGfB*s1uaN$U|?GdhKbu%jR zK`xLGvi~Lb!I#Ziqz4Xhxa!kM5jpcB20~`U9Db#VqZ-W0Caq(C&m)G_#qJy}EvD)tO&hZ7% zd!yxV0!*jp6AG7{ulaR9zqnnG{*XPN*dfOvVP2Du5Lxd}Bj@o~#C6NL2iggDFeR~q z!FBhm|K(3h0oa;Pn_F5IN<-e)q1zycFAYa((@_xcIckS(hQWx#MfLf9ir+@VNA=MH z7Txfa1GFb?&e=;pm#rfw?)3ZiE_?cR-{HRkI(KEc!-{5nQgV3I=FvDAE~qKZ`+qsg z{z|il*1EK9-8md7dB;U5(rfSeIJsE&$RM=C^ba^9w#rlnZQ*wdMpv}FVurGp9Bi!h z`@$AC2)4MAD5vWGA+oAh6y8xGae&T_0aMQ0?BlSv`I6?ds7;(Y4$^wbq|rr!j*zikAJu+u*`@mB~j5K zb$V$;F$vV?c=>mYJ^eBaN?%dj0$KBIiM*aga>9F(F4o!&Z8;hW&O;5~4J zk?Wiuy|<^JM?wmNU#Npcx5T6sck3JltY7|HQ2hMmBba6Xe7AmOFue~d9HkH=am~=S zzb)uJPrSE^6AGTEF>HXmh3@nOCaO{RS&?uG(O-T( z9K4m<+7gP@Tbc}hKk1)_$JduaT|y)sY;h~S>mN-_=1o-}7qYT51yX#>2Z{&n$bF+m-D1b}rTA{napveR_2M1d9%?tpgy85#5(zxix) zv!3O2_Wd1yu`Im&S?7G@2Pd+%_|!ZBit}` zsnOMPoGmEh0%_lX@4927=MBB!@us*)m068Ch|JYlDKG6oQ5Ci^nO7sqfF34qWDYu! zt!BUl6r!U!k-~=x$!OBRE-$w=S_cx{w}t8`9LHGE3$+B_QdNV5>LSTznS}iFTh{!R zWxGrj45@L!7NJzhjlWCf3-&C8LOYg3IxS5?1+!6RrAicMbs_WDTzjMHeRELut@32N z2@<;_39ORju~9WP+V8QYkRM-7Oj4SQP&yG^0uJ$Dl^%Q<{#!BMj_F7oB2M?&kEB_y z#Hx^M;=4y1FR`r#)VE?4Gma&>yUUxxYOKSW5Jwkj2eJ2FL{z&MUZo9&2*c z2s%%;Z&CWFp41%yK27~LY_e7qjT;reQ)8wr%odXErrCp8>Jjh60(#e`GXNub36Z~-bKr#=kct;EwyF)msXAW`rN!i zV!aVJrllsjj-m40tA81pbetnoJ+Im(o%z~4nI^_(u3L%2nAgJF%c#bvt!^kel${&9 z({*acgJ5mXiImpG!Djf2!4#o1>S>EgzDf}5g_TuIcoplgB*){@%kFrC=lNnaYrF4! zChxS(<7)N&sXE7GTve{R{`L$rGn)|}H_#54w%zu(g{;XJxHVkE_bBP8>39NkL*5Y9 z=OOkD^!+e-%Cu6MbVitV%d1}h=zw>9IuZV}@-64b<#hnzjwI8)UijspB;c=d)uAaK z`IwM;N{^h$AF4|yL1v69kIZy$F*uWc&u}en6Kn12R2Vt$aS`I4UrZ;4rhiN0uE7sdfs}E_yI!5dq>M^Yj3Be*RrmW&OS;WP&)?#(7=TZYF;Yqyh;pz}Jo=$dEN=|nF~ zaP6lOpP781JUdt(Hrh~iSp;8sI8r*djHIr~Befe7e?MXP)$5gMwXoymrg3E9sN``a zzMsOQJL{eWv$F>HUVw~CA?UKW`s`+z<~8MCyr)YiEv7r?4MeNo4MU{+Fy$08l=D``MH`1PqoetYQBuRW$M=kZrjZM^sr`Sjl2mCx{`ACeG#eM3zjY5 z8D*(jg&bL49CL)HCBZg%n&HyIo8<90hs7aCrVz~m%vWErjJ^8X7pQ+9_G|l9oE#^K zfRbw*VI`AH_7G`w-X^C9NQQ}+7IG{z%XLx550NLv0;%}M^pNm(2DYML+~5Lw8GqW^ zeyk#f26+t7R!bZoi74pA+4i+no%HI-1Sd(Mx|jZk1STm*D}TFnpI58aWb1apZze{k z<~rAOfj4W&6L_8cz1=gIs!&^Y!-30g;2Io1k(Ed}nN{?LBvNlw05Vw3A}82ks3o%+ zN&7FvM3Z?WF7-I*8Jbz}aZ+y02G*omwBk}a9D?|#O77H)*c>JtVJ8%Gm8{7iM?ZvY z6(Xu*eHS&yqiRgXxLMg`O6yk`tDG5rCF*n*E(Yx^H*DX>?jmuDZ|@2E-`C1A2&7H{I==%$Y=|U!iuCnOAHqeGLig z5(K7xpWj8_7Q4;UwpO_6iOir#J#gtLN;)#PvT0Z};ygTBrUdw2{lpe#QN+nJW1S7& z@yc9^GezpjsoJ-)9T#XWO3?CcHN4Obk@YIrgReBjj8fI62!Y~Yf-GtoY`!gvTwSeD zrUS}ENZ;gs<|_D>9jfu#N4BHa}G***Vyv+;PJM*@80~kaJz;HrI9t zK!6~_h2ZTpZsH}5az5TqlxEm^BE>%1m|USY;nJzbk*m8B)b9?PzL~=8dcTfnxBgyC|R8d6rm-Vs=+2L|!48a^IffsJJYV~MZSqJCi<_>H~rfyVkv;nNyI z^In(wtfsmp+6Pw=H4Zw7w-NP%h=JzGbwu}SsA)o!WjZ1TLr(uR;<_%4ryN`We(G7Z z^_nCB;vp5%?NZ=s_TbI~7zOp_7+1iwPG#CNpmudiKP^+#TC`S3EGjj1wDVcdS}PX^ zV{}gqLPidyDf2kklgt}-62s7(US>Klj4>k-#QIfmy%8Z^j=RxpT=+vz%jdt%#!a%Q zZzNSYrR^$(H{?ZF8_AId>Opo0W8Dz!SIn71uz_HLN`bJ1b?7|c<6NTe+wF3In-wv9 z21HHWrE&5Arjb6i{Nkc0>;r#GCJjy2HMvcUxc{yX7@o^4__Ov6|EK?f^mid%@7-jE zo`sLdHYP;7ZCPS1Ce%9odaMU0rRbHB;VK}c|Dw;qVaEo78S0d2w{TX@2$Gl|0{^9l z$`vPdni(6|Ww&p;@5CSE-Ug3B9|4p{@TInRx1g=B$6_Ua1#)g9S9>losHUNU%PJ45 z`#Mh-{WXr}%{~&p7t`~8fEFVcK_nkj4|a^3_wx1kMEScS>S>1iCL%b&_l+lQeA?mZ(UR?3Om?$cYV7tFG)NaZ{0CH0s z=p|*ERHJ@SiZ>qgUaSvmlk3~1~L}(m3bwfOUA*7l$J(9>Ic}$~TqfI1D3h>AzeOrViKYI* z6}R1uB=4(sBK~vBo-ElR^FgUxY~9WmE&fh3H9+K-C)eLz{G^8J`}-C0h;=r^d!uQ$k0Lu8+@3Eneb8Vdv$KU4t3(ui2pAM8{oL#+Jc8e6AhSS2fh%*H8 zGPOG1#WF_l*B7~dep&mQMyRqUp`M=h4{xVOSdpWl(QTae9lcU15LLZbo+E%2-Apf~ znNizb*8^ieju@VNrch-6IS}X#C9SH2A?q4Awv~+onq=u8a<(Ww2iC{Fgp60e^S=| z4M*4S?pa1Zh_EgleEcs!r3Z;C;b^%uT*y#Xj5hjoic)~d$n55Op6BVx%WmcTbmgfx z&)@ADfiuj&?5_vXAP&)C=RC1S4JX@HpEo6e z4-$S0x~FGC(*g19Y^y5N_?YwahpximG(h4fYZXx74IKZLEYc9Q)OZ33hg4#(WK7m_ zVXmE5jU_)vmQt~*f;}X=HwlwJ(SmwKSeOTsbCVXk@1RP4>bHfc5vc-5(@|qu$b%t8 z>y!qyEa6RIHUYRzsbeh8Q5nynL1nXBzye6LRNh1QHl_p|ubm*DKUX`0hr- zrmg0@)MK+c(`lyt)gsi^1NZ+t#D^ z$7X2ZOkQG3^1=;UKIM}hGfZOR{_dFkU7u$CZkr>=y7QY(EqejThivujwDqYM9KWaQ z{!-t``LQM?9>U&v`f1mvuXpND4nZCs$kLK!ba}@6e3_5o^m6$$@(H{8Gcmn^aj`d@ zns-A;PP5QX$HdX>@{%BuETt%mQ0nzI2HB#r~+Wv1|`wvNJKUc&Cx^Bk`U1JXW7(am0 z`%^FIPcdazn|x3K{;i{~*4t{2ixf4oakbS!q0Ewvn@@uhEB$uLtGfbxVc#>NrY^ zUrw?-^ynV^8g=H#=PDDknt*NwRDwD({=}GKwqd}-4JlbsBFVqn@^LW*(2em|<0LL{ zI~SiCcTatVD#a=t&pE>M*$SFZ&k6Uvmq!2jY3eoGkR*k*XwAZDy%$# z@b2cl#hPUESf0MNE;blGoGbzGqS=5RVr+$j-fJRl zGI{}e?}Rq7QNR{+0Q#aYw5aa7gp zX3SwczqGeQOx-1HFNJPyQ42a;o+*#PHV5)jreXvpHiIT1B}3rDx~0W>oTu;jjC|IJ zCvhHovKIKQs}0A%4i_2QUEjR%=n+v^-yE=r5|t{&#O+8zh1_@ToN6C-(!+!yz=FXp zp~GwHKf*+o_ii-3kgg2^h>{F$)iomm$wROr3$24dO32YSP;kHpeEEY(N+1U)B&LD` z)$KF10Tx|G>wzux2Jk?$okstDf+R*ksNU6gV;&{m)!4>4jf1oGQe+3vePRIVT3f{e z>&r9ukD_l7VIfq$9J`HU_tA(#c#NujhLX2SrDVsI2ZIvQF?lqRZP%3oICj*wW}RTEO|4a>tPP%VnyPVB={RDmbte-xI`(b=Pjl;{?6!7@E2pT{z?VO& zakr}oGdBgrNGXN+U`|iyZdaA1Pf_5PLS_6Dr@TGOl^piqI_!q0ZkIgj1}f;Psx{42 zA#dRENM>TcJpXyk?jtHZpzV66&t>*Y&*!<+<6^tlVdsyZ;qjPuWFqcyl8{TDarZE% zYqQRDjH)`GaBN(chYi85kh2Xnvdq&uYg*z!=67T1yP9K*%G(TXd+Ie_1&74t<(*n4 z^i5tJUQOy5B!Y1BFV%5bp0q0Swc_=n7VAoh96dxVa|2f)Xu6Wn`a14wM;d7#tDlpE zI?4K=LJjo{T=(w(wEJAgC3z|Y!Gw*+yhUULv2qVj2?s7rhQE~EP+Js}=Lg`52jpYHuc zy~(lK<)ATiAb9jPIZ;KJ@JOg=L@f26;yW-oIml(XY=rP|4%v-f+Z_l%z^3zA7@U%w zG+ULT02(>Kd5|^nfxy6QmFa=&MWIba$NHNhrv60p%<#=Z#V`Ey$=^NJHPBw{^xpYC zQrS|sZNTL|rkCyHkQY6+ziWxME?1O=C{d3on6kd9W<+~_pbVsodjw^ZxCNWBz@|)91LCwR_iKfG9tWa5%G1!{=;=BH<-c931aqI(jqAp@JIj zg6^b7F+M>|m7;PoP8)B!4a`_FDfbyFp^tgy?x|6D^HVej7dZH}z!2*Uw*h33Pwbl# z=r7z+^ZRh$@aOLO-5_L*(R0ZHT==*Neey2`hFr4&Z^ixc0vv(7UAxJ6x>S1_db(K2 zvF?!uo(`-`OBWX=7^`@CJX)YupQowOV-6#~(y$_O84xnCQpRp)q|qEr1{auhWdy6f z!{W&*c&feww`@4t{6KJ%a7(J8Wb3Q#T$)*T1Ak5o3}-#?iUeq+|4_)@WF?v4YWjA z#rkSfh69m^7N(6)UPtz%+yKSwCf=2TM#0Ap+r>_}Acju5`AnX^?&42E-X>9j-;+!9 z6r50+i&Mhy1Onlbmo=pSbpytLJ7I;|(E4j&d{?k@o*K~ssu(qYABcHxSn@~HK=RKd zKU_Grqzy*C3`YMuf{}QMhVVVS>)MoG!Z}Ua@6IYCG@C*Q#Igj78QW%)c86*|PLsHI z%PN)xgI&svR!!a zn-%;NGgZuWx;zlNQkj~^Dwee+R%6Y_B7If^`&5EBDPDb0ClMDy^zaR?{#Hr?UXs3y zeqCmr*RA;5>~~3ri`}WO-G;h2M1BoNM!~ekeWHwm?RuAX$q8^bhXeqoB*eBr=6d_5 zJna^0ZK4DuyMkuBKDYO1d#>ymPtx)hQ(;^bgkKrd^$#c$bSg_*5$_T&Y5B-gq$1)h4Zm# zRvjOP7>pb(vX!m$(z7L%nUr&9jR}Qqv~t;Iw4EyKQKoJe>ft9#tyn$f^;f&DX8aG} zj5IN!Cq9c0EG|tNoIX<#cE2fYYa9gTRO5%05_;MGG(>VhiwEyd>RqErh>eA-suIEL z8ms!&*i>I=H<-+hLF<=BceZ*~mXvCiF?(F1e>4E*D?x5Qp zaVox(sA^855CY&axgmOtDl&&F1Cyxa-0_36mMZlz%2}h9uu})x##ZPAA|O&H8ae)(ZW@X=uXQPLexT~S!aTTzLBXa1 z1Q8Ju2ddgea)^rfrxTmz=sdH6@(ew}3JZY1q65-Bi^=758}DU?3C(w$2-&`nu{*39 zE!doB1%d6L2(ne@E9**Ch8s1?Z|h{i)I+Mkb7v3MhJ;;uDe9Q=;xJ#I-|wl&v$PG` zvC`X^66R*Zqa&0qxLz0B8_8tG+o^nxDP#&ThD`BKij(tF z9B+V%R(L!27wz^j?A^%!EFC%-lwhR2JVg_GG^g>~rxw?{%2 z?#)d!3-Q~7@jnQn%5@mK?8*gpwqYmjlDdsBHQ|%M26Hx;)UV%nO){vb)>w)R7EHxb zr8o}hx6E_gukz)5X6e@22bg(LmOkFfdAHh9KOx0IBGaY|kUBOU{Kgm{W5V#65TrPa z-9B3Q#*ass<%+oR)zc72TsS>8By7biaf=PpJqE>UKdZrt#RtmS9k!!Flq1buRsXgy z{qBgm98df^HG`sU>holg zefmF@O#Y_jA)9G@#JH5yL;qzBU1VAXrMOr7-fG=no=xALdP2SfEQgiDC}otnQXWW+ zDVab3H*yY{OAUlkcy@f+VAtiFLw z9OU9sL6piGzQozYK0t#o(HP`pk?ClF%X zrV|^KTQ)@sGuUTgg$$*&plp9wYP%;QVyt%#o4Ng5nWjy(fL{AbvagIe$T&CzFB_iV)XwRM zgb$jr6mG|Q8wwzz9ueMxSyZIPU+b#`2T!O#)$Sp& zZZ6yun(tlrT#|2dll@h(dis==46k-Ro>?|?+B99f_zw*EiA-plq6Hv(<}GWofYG9* zSKwv4Ue(Q!J(VD~X721qvfNtGJ54x^p_C!jJ?Ao?rlsT0pUv#E;Lew&q>ebgm25o_ zhylX1c(bEAj0|?X)?)GtL6bkFKg{RLSXm7*jYI_Oqtl+&Lmi zzECA3Jtkj!I zMYJlTK1T9U8Q58_r-EXIwQOvAh>=;&DbeK;8(t^4`bF&2MrgfK<2lJ4r&dy#--d&A zjqCzbjvf@G7~nD$!yyE5|H6w}LHWh}ak?lubI)%M%BZ;gAtR$AIfzlR4UuGIi^jNPUO!wWrxsMb~Jr;<4bI(^6X2mlNKeTZC z&6t~1kExTExy&qbqd^jGf-#c&+YEDV9=L3GDuZ>;Qak^KQ}7Dvu<;-ZiamKFTU9b; zI)s(ycHUH-l`Cbio>XS(*D$Hv%p6WC_HLh6VreET!ZKxjYHm)R{t9c-96Td89V-;h z%J?H}B6pml-WZ{Kn|{emw;J(;` z37l^>lnBouTDz*Lk4sDjyh>`Uw%_zq!i?XgaoyoUvFSpv7ovnSGG-VUrC;(3+Cwa(GmiPJ{jc{D{X#OD89NysO4rEB-$!n=0r> zAaM&a|3K_!QuEU?3^qZlL}U6)T>U;f_7l0(RroHV)(&Nqe4}3ap98U&M`{SC7 zl&O9D1wu7mRwyB~OOtZ^U1fvce~Xg^! z@5yHXBDFfo2IRnM6L!>>NPgwwasEG*bX8V?G%OV6v) z^Pk8aQ^m{v^{`p>>uo)M?(1#N`yZh9c^f)4Yf>Wu1*-2y($L<06ZsTq4W9aD`Wtd- zHzKrgie?eJ8N@TbF(MbY3pd$7X^Qw(>V^QOX@}g(Q=|7|V;vAcE6^R9AhOvJ6(ukF z3heR4%5SoT_?wMiVEN^%;(g8v9OlAL%r&@LyagVs4$})JrZibp4FCT+a-IWQ#PdAM z_IEhFwBOJCp5HvKZujLV(Y&#dam>o);}$5^vSMwn%2^|_S7!4xwZN+VpZ|frL6{(s z>CB8L1<@)i<7|bhRY6$PjSm|Y=?+1sK_uE@zcnhJrrb%(7h8|<@o6wVaP5cnl;trW zzQ344oc6{(-8M*5rGV zx=@IhOq$nJ7)eX&vZbP^&;VtMN>9lm@ZS?Fe~is(;u^Y=oPh?L8zrk*K6Ppoj(Rl?>y;3VfrKl_}ifrVWlvFekQnlcU~l8;3W}~DJ~`O z)_?5RHbKQU8<dqHYeHaJbs-ztjKnsH72_I>>h@+rjAmgr=m?zM`H*RyG> z=d1a%3%^aumi=;!!S+C(w}0!mEH(vy>ipbJ!xZ2D@7QpkO{{FLi@1lyzVV(kJ~Nkzt@kP7Ff@=9MZ@+iUyQrBb@Ea^Aa@fW3^QsX@L&>7c_7_wZO30Lw)V5g|+n~eKoO}p9$n!`85`j>q5X^m~$~P zc6C9LNo+NGBIBZ-foV_6;44D`87p|2`p+x$05HT1d69I5Oyr~A(YX8L8iQsE33R*h_TT^7P3oO)x6c;t1<{yZd%vL2J_+(h z+cZ5+alsKIo?aImOZ0fVxQ9F$q=g1pm7<_<8G&bMw_%e*?&sfrcX|Ztm0SlL((8o= z`T5)0J;lC%iaq~J@RaoA-t0mQdfy*|$K|udrqmE&XzI-Vzy?H?dc)q4rYv{|Az{2ggsb2tzTQqt)l`vbwoBxe2$=+hbCeI&4Lzw2|`>3~n zD!czhEW|STHrsHx$8|2y*YTst(HkGl%ek7LUJuI}#gzDIweZ*He!V8=OtZl@THy~c~~r5)YBJ6>&l z8ljOse-}o()Na8w$RnZQrJPD=w39?x4{qgiYLYd)Ae^xv(_yDXo1q;eBE1cMPpIJW91yVfHG__D4#>ZYVGdhsV~LX>F37MP>K5@P_z9D= z>6rA2_!9+IY@vN0+jp|U8Y<+Iq=n(T_Z;-G&5^hUP>=yl%jv=xP$LzpcK>Riny5uX zih21}P0RI1J!0i#>tB$5b{HB3yhUdu&!cqf7wgj*Secu&!@9lF=S(mz$>U$?jE|l` z85DcX8=M~7d@Qg^L0EYfyqPO~?;cD#B3Hv4zYvd>`MZc1a55u}Njv?sEA3Bf-^#r_ zPAeE;AL(~IuJ;s#D+l1kX6YF|fas*T5xj6SW&6{P*A*5)a1{jbXd-Vd6buzT#A3Yc6!$34DRv;qfw?s}*r z^=%ILf0i87o3!o#0wkcFU@)_)WdxLDWkq%^>c`D!U^ zf!cjYVp3F7lW#rTsaAxQ;r?#kjGUq8HB%jfL|tgNq!KqcK7vfFuaK<4jtOp^`_4xFb z**G~q&vHa`+1>YNE;il^P}1y}*I8L_7;B8G_Wsn);hLlkhuT6j@)%f;J#a2}K23Y~ z(6HYEsDX}&k-CH&I7@=feeKV0^0nv7Hy)2;xTSTQOO<}pYywuSg5zO*ZSv+y_B zbxMpn_IKCojCDxq41?(OIi>NFV1wJ?pb;$W)`9l8jv>&69xU*ffg+uUQi}3a90xLG zC@Mu85!rTq+=Fm(r&01E=H9~3a6>8h-BsEQK0s-PG)*(k!VEP{OrJNgXBC=r-NzMW zGc>yo?)R}$lJI?o0Bx5z=KcwtA0_!aB;Jir&e*ciMTA0CEh@k^l9JJMFN#SS>kG9^?I_7*vK7>&Q>;M}{L={v*RV zoA*D2Vf~K`r^WI|hEvsSkc4mN$Ew6?+Cp4ptDECi;6F*Mf$G|P>RgKB4v6|8UQLs3 zMW8rCptW18otVuhW@!fY2iDyc414Uvst+hUy?ap?leK@p#`2pa66{#PGU#bU#EJ|2`Dj!%>H+Oi^+gtG0q&Mg-}+e>Az=YH50c!3m)wc6A>e`>X> z;y0dQq*mx?G=0tZfzVm8n!7}O$XR)x-b`MNTU8mB1>t5;AaN#3JzM#KHx-2KL}>y3 zOw9>~Qy5VqDnKp7mx)!C&w%+C&m~|WXg4NRzF;PHx5|G;TV;w9n>Ss*d5eZ#kr(YQ zgMpC>R@gJ!%gy&qbpgbqe>+W{YBKM`t?>LZ0Azi@Z}bJ9L-n--LUl4E{+EolmcjUT zA&3|-VDU?s<}{7E=aha#)xLq{I5F$uptONK64{$VHq~MsBD=UPv;lwLj3pL|%ss7< z^;{cD#QDj`V+^hmkt_^soP6`C1PnLELIUfZB0?;U@!}yop)|<+p#vE9gl9!?#AePl zDH>E!9y^icn56f)jgh7^?>?+1)0uQhPo1MgBP61e2y4C`_Rkj9M$1<`j~wuuC8)%P z&{bG)xxCZ(N#Et|$d3^>l@B)W8EalE%ZK^Bo>&xsF+KOF$R<38y~>{|XBPgj*FOrJ z+t$`>NLhe>*&gS@?|J7sFHdYIvdCMVV)Fe<&2M(?m2xe-lnJibOx}~T=Pt;B2`C6G z2as4`^(SPuJmyJw7oqxBW?!Z@x|1$a0S4J!1C*LtGL(v0^l8z9h$UE-Tx<2|Ij6B^ z9L&)Hg_Ur_Lts87745E(R8?lA`R z@bQ>#4&$AkL#5D#nk3$cq@^5IsewGkt*4iK2A5@p1ZdrW{UzJ7KpvgsM{lxZr$1G; zU~s9U%*8BAIId|&Q07SAYBBriHiTlu{wC6FE!NX_Y(_p* ze%W~mky(FEXgUMAk*Q34X;!t#Md4)CS!fVKYx++Us}S5elzB^BDV~rOd&dSgG?xq{ z4)6sW%4cvupOM1DlR>1-*3PE>%-hG(pf53hFjUesDaMOIVy=8_j{j&B9ACVg5BwZi zl_#I{M-Ch2Q6S^Li?Uy;7UG)l`A~g!#M_`5yX8N3Mv8Y}C|*gzM)o9uF$;FYG#`7( z=XuXI$?h=3|4BY#;0@->;1h2rgP!%?lNR)~mIEUaw_HZ)(n_l6`@L>9`C@w>Mb-wg zztF{&eqobx%k()d<+`pm;pm&O5z)f4<&q3Rv0ZJH_j2yWVJX z*k^kL&a+&0zB=qdY-raZ<``o$XKbHpI>O1!GZsN<9m ztX^Am(Wc=X@lE`xt^o=mwT=;kTn`2#j{{riYsdd7eY-mmftMz2E0c`3)YrnnZfIYz zBAte33e>gpu%XWMz`X_0zTj|FekW#DAhaBdTHwVaCW`NIo#&V!P-l5?p|7Y7C86$%HhlIc9jyfz-8iolRdK2KD5-7#6dH72%SrIuY{-ac(h@p@|F% zpI(tiCw#twbhP>eLD`~w(2HBf+LeO^u6or}8)}O|bw^nNv4mj*6_I`6WK1gW66rMP zN2WneLGV3yto}6&lRlu+KMxzSCVY@f%%o97KB8aH6nE0in*_JN-AW`3!!4{ind;xc z^=FZA2F*~Zr7|0FA{24Y&cbGrgs?3ua{~h1hy+`+N;hn{1#&h#NWcu{PbT2H&iF7P zUE8Z-oBVhYueO`ID#La6y&Eu(Yns6~Iw8pGct@u^YIak5$vPdfT1io=wC@ymwGfrx z31aXAfW)3G?PBJyeOi?aO~%EU%P>SdC}>z@!dRjENDI6XLzh>W5k-cWL2+^DBfD{6 z4Q1FM-penET?D%S!Hs#@^ZPCW(yk|sldwL!5GQ)o6@Ppfw1dVOv`4zA;iO-T zh#3_J1Lga-8>8YWMG7&>_p~(fxo2brn)*{0Spq(nMu8DG#@4TlR;lGc?cV$KhY|C8ROzQ`;`wH9 zdt~_)P#;8E+4wl^^mpukgv_gjWOf!V~^2$k3N^p!dF6G*&PML%bsQVczCv zu%nTwVVj@~_b2xIl7pBqwrP!v-y48QMIj!zk8r^mhhiZBb_`aMf#o6N%hc*Io%(zf z<7>@-ZB+h8kHI3>r&#*dV>bUcLicyW%;Mrdh44)A&eb@QA}mpE(jkBEC*Ti-F?}lj#Y5%+T<+r>B$K-{)x8 z_a@2R>-4>52gJkG^-qIwYql|Z2lY`S{`fI#ozXv75+~FF3rr-A3O$_Gy`-RivMj(f zh*Ll_X)Fa(2ud{^3fHu1IC;6kOI*eGp)V?`u&z~l)|N#mpYj}1mmotZ3c`$~Own9{ zW;|g(7~HI70aj#A-c)A#?1z6)>8W=9U@|vCu-U7vc_AFnIWn-Cq5LtwHPb<_qJb7l z_e_$Fdb{qs?FR?2D;4_9@5t3CxO}3BQSqc%YkdKP{w}Ibc)5FCyykR!!A zI@s=QT@i*hzD~srKZ%;lOp$CBu~O`TsR_wb(+IfN6P4ucq@U|oKc#N__hW`DYvr0mD`X4`5~LPNzD+u=0TZXIj&M#B3AE)yu?CxCk~=j{+iPZc z&+1gpg*^x=&mBST_lI!MZlCK4P-Sj$isCL~GbTl>VOS_>($FsCPkOBIEZCv)*IEtz zQ};^UZ5GdmO=_QS|k@0e()d%Cs-i#?Nq>oAaxSl_OYQK#)m74N3tjm zJU{x?VN1IOd+Cv90FZ|ibm+T~0BAxjpj`@VTuH-&Fxz8h2R6y-k*NhquvnNWm_0UH zZh%=5XjMRQ)nBh8@PP+!3hazM+_J(Nc_w3(9SldYW1dx!zw!tLn~JToXl&K){P;+3 zym}fly~|iU*zN8RuZCUZs7qT^7mEV7<0Apc)%Sq?TFvupc^V6%d)VkvNFjKjPkX|C7;fM?7WnTA*Yolco#+0$NctmiKl$p6_hjvXc zOXO3AsPX_)M|`)687crnOViy6paklYhCe5}2cflz^fDVK_)f)@@`whfo-fiVIG>G9 z7*YPz)5Y>VN$SiLZ&TGtz$F9obd$JA_y>a@urqQa5LYZqA_Y}9E{&|W2xqA;+Y)kI zw-5;2IjL=mRtQQ@a1X?+gCyE1nnv5uUnZ~p(&Sz?dz-@nA{mn{M;Y`+XQ#it2P1ll z|B+>&c|m$KvbV!G^ZH6W>c9M1>kFSXchekfhYaFN4*j zRIZ1NDoUyg}N#T(NFljY`3VF|NZMy)?J2gy{>r)h3Zf$$3;nai^r@L(LSuMT0%+$FnM zr)af3ctsqPENQP>V+Dd;sI^w!7l-02*~-dR_S6)0NE~;~S3+#FP+tLl=tWBae>&Qa zVZm(3^mE%t*o7r?ZfD+{;9p2G{WA&Yx2q{=?U(ks&#=gX{m&li}TFa5S%rW>wRwjItaS+ZHp zvWA?;x~E*p?x?MpIRoTBO)Adr3;w-zVbbIEf&4xN^7StAUxW-HGLCq(FA*jjriLKY z(a^bwaIRSamqTKtdg%xrKN+ozXh|rG_~ARt9>&iUL&1N@m{f`XkTH_@<>_$)x`J;jqM=W{X_-ZBT>9#3VA?L^{>q&OwOe>MhKhz)b?MUNiGF-RmkbDjZ zF-zSq9}LPHPaVRf{uT=Uk{XYRy}2APJcHQAPoH(YsCZJ1-n2_G+0ZQ6vn#(seXvA0 zV6ZEO0>)Ve??qmNSKl6*4p!uY&lAjq*~5vQ z@N24n&*6IxQn>L2W;s&2gS@`gANRLNdiMjB0XHr;BT&rzE8SVZ0@!IXL5mwJ+qx$~gVKUGbjpaK%7okH=}=L;_jQEhi$ zne2d*#4%)6n0#*?1XhMJcOw4#NKOoSbb?|WJB7?FPk+5fV(31phoVXo+B|3yA06)( zW81a;Mc_d4i&k6I=aM102f6Od>f|h|?2(a*cr_t8^!M%KJDZSN91X471 zlLL5Dj$j6KNa>W*-e{$U7lp3ZUuL4kVmeC z>RRb3gH3z&1B3t^oFd4m&NHD;IFaum#at-YL);y?0gAhz>u7%_tbow7w)VU^T79Ue zG+%tW%l}73My{n(?bkzCUnl6+*5JJbYd0%;Mf7h-yi(rmnS6m_*W7IOom|l08?%du zGR6ckvXp9=o*MFeLGMD{!w4lD*-7DW$*pL5@ApztY3tiELae+EFa8FR8?t)oaJ*&W zYmRBD5OcB};_(VL+E{{b_-OiYA7!CFw()RjufERrR#9=WpkN4psRrL`Yd?OL_A1&t zo&zxij9e)_-m*DuE*6ga3cju2J#ZpR=Tl`&+fx43Y_!;54X>O+`pGE zjBiQY-CC?5noGokk_*V*_{1mwgtqNzvqc=He<4q=qCz5_R?j`*6F_4Ad| z$VyLxcqni9FH0siOYHw`$!J6U50=dDaYK1Fy5TC@YAyLJmq2lAlIymjAT5d%1X0B- zdLzkIp-LGKcVJxw?sc}Jz041?~oD| zF(!do%uSb@(XT51xV=oaC(EmYDI-4LTIpEZb6T-!kLo9qzYCe!Fxo*;>`CzcnUpv* zuf5HD4C`?IS!2ud`+PFrqHvD8^-OI$-1C_jU0jyJg^3Y_rXwds6PofL;49;H&3|UI47EPWFyM`=`13owdxGZo8P+Q z;~^=C3k4Fgi)tn)dThUpzY?7$$SM^2@y|5|aSTF419mQ2q-iSx2PoZvQ*^nC+)mOx zaGG({hdFOy&vlX=#zo2q4)@`zAb8R(!Gw(#*hq_|Qe+}Cp?^CvsQ-3k@I2cy1iI5{ z_pnDg&6R?FCU~8kGf)yB=;iII z^6U1dGGrEa7ta6(Fry=Kn?q&!vWY6&`wT(KrhmTTeDZ(BTauIQ<0#)|LIt%Pd%yA( zDbcDnc@#1%;gU@xz5<_Z9XR~Jn;Kgx&#s8x-s zPwG!$cGP%di5TaB=#ja<(F0B z7Ng5;cWeVZ$s-mL$5{Bpn;%_aJobH8($*0V_nP+9OvY?trh`qQujr=u(FF|ah1*h5 z4Se9D;VEQ2L|*idP2^^ZJp!5JG5>2t=9i$jRPlD-7cY~+G9GY;!v7&ghU)IWF)~Pf zN^y@A>>w?EWI4n@tLFWMbxcjqP6wG7)@4a>G}KrKieOG@4O+x=ZDH9KrJ^5&22Jiq zL~$8K&gFWm$jqEVGDY}@Rt&KE_6$Tg@+QFcTnOYxCDMVTnv?fP_<8Q!DP5P?c$(qInJ z#I(OqHxx2p=IJ3r{1An`drY@0kk`SKe^z%%6MMXQK)S!Wzfm8CPb*O?^?m2NKSJuC z2Y*SZN%Z9vw4`E#C1aFL34?Bef@pBR!U7BU zchr#vHJBv#-$_UDXZ*lg9sjJH`m^|F&c9;r?UC zL_GdRzG;!C2&56K!jnO*Ve7mtPS%xba>lYgip<@pwhFhdsnTJjp0fOv0hIfsw1sXf zWG|Y|f4^@e+~Ns?<$L|&hgcOLL%qNBu!TJoRj^!P>^a{CDs{S)kX&Ith=!CcuZy>z zY~1k$Fo?Mc?HrzSaVouX%7<>{-<$YN!<$yJ9Ba5Rs)?W?+UZ)un9sUchA1EZ5G)Vn z9@SO2N3OVKchL&-B`109G-zE^Gx%4({(|<7O$ELNK8v&;ve2Q+iq0LH%nJbE5q)<6$CE7xf=yHHI>lICG7P z)XDrjClq5)CdGUy&2(~wH#dvoe)IgsH>_*Bm~0jk8)lx?d8M54aTKxu`pJB`S`%?B6U4)joIFC(ts=-MJ?xkkm}jnfa^4+ImT9eWIG zm-#YZe5&;S+JG{cg>T++x^CuAG!!oxGiJ#e_CYQQxc3;<(|*3oydQdO_PQSu(U%NV zXT+lD=GpxL=$5F;^jt~(>g~Gu9DhMYVNlbX{at3yr|dj&03~w*Z-mtD`fIy=6{U^U zrtN8(A8l)d^jg^SqGA8C%{J{Y2iN1_zLL`in=tXg|AdPMU^eFoB5z&$M+~8{8jB zpEgJf$)JZUg*J>Any<}X=AaeAhL47<G#4cXokG%I6AS+tBon_ylkc(?(cEulX#J@v}cx+)x z=iUyDH$%!#8=FgpF(0O9H(^Pz+k#j6{mVR@SJT|HBvx+v5Vko>CEtIP8hLar-JQO& zgN(hJBlQ>3CA&z0JGLUJEi@)onq2Sjz(EpQ%1MV!gNBm>sS~Tr%By~GX>x$w`hfnt zxTXI-(n{K~`=}MUS*`mMi8plVB_j7=e5!*OE21`0&QP%ErXvPkb9lE2ePhoQ8BBKc z(!$6F<SZc$7`#{XUAbk$wmUw?;N`*ZP#j^VcSwkGc}Mi7a#ej&1E)#P6l^?Dtyh|IrP(T zr6albf}GF(NVRo7rL3%F`EjlWf}HeEA75;;y{0ggns2B3CKN|h*Q=%TI;q1EE(XU) zG^&BLsV?qO#Z-ot_8Ui0%jLjH#gK*s)nb{OnQcUF9Rst3QV`dAhkO0KzkK>Z%dao6 zo#1>}Jk0nx?g5RXL3iTEeM9{h8()}Bou?De@ktIjkk&HBg=S7u-qgKa8bV%)1=3H+ zEQ{WFAr8c&6J^zX$o|F6no}88Ab5V&b`#O212Kn%ryZ2^eJt(!P(dTuazUBZOC9X> z*HC=Hrl}j4_~R~qRv=d<&y^p+5ju#1GKPI)C)1lla1a1D)SM|_R?+`cNSHO<%jjWj z1Z(cMWEA6FqZ6L?pcD0~sC`^y-7lh?adnO&CG-*qBm%X9)_HORoJ(<1!&I$}N!9+f zB*x_H?o^RXthi)c{ZDwZ8CPtJwuYk(k6(M2`*xMz*|pzxC#4;=K%58P!p`2q5+v_< z!5lusSg8~*?*{1?s#1|$9tEU$tS4|O>_*6#_Wqjx+x-gEg5cXMBL^%5{i(>aN$3;A zl9dNbF}(`EiblaV$+ii?CqHikk+yHbR=z2UIeUFoCOC;JTiX)T;B8A6mr}2vJ*6O4 zS=y6@q{+xjiC;(Y$s%z5y!z1V6@c`>+i&i=&$DGe$1QHS$!++{P*am*y*>?*>jLo7 zyD&Kz(wlOwiIVL~d=zPV8nl*ifY#x)sm*az(V2&}$ab6Lu@A@n%LwE!2CY7dc#G(} z9>w?SXwM$cuOa4mwE#ojXiF=%E2djfrxV$?OrpdJWn=blE)P6ds=tMx3cT|Qi;fQ4 z0Gw^M&hHw}x#t__{SSvrNL4`2uYt+=@f07;j(fxcyMaM5k~0-X;-oOpC$|z1eyZ@n zn_p~ep_|oESTXKPW5^4q@}Mm8NzubM<}vofA-81NVW^aY{+MH%W&?2JBYm?TlC`=j zbPS3;u@oZumw$@m<5ugN%1jRaMo^Oj%!_KvMgB^x3ayGsHr3diyV}5rt}S3FM{gXr zhRw?}SX&EUkVR^1%5zCT)K-<_zL;ezznI}%4xfNiFErky^Pe&C(r)7Emi}GiXJOTU z2*Ala4~f+%yCDp43bgpiuS=b$0hfJLO3L)ZCp%q0@{GS-psAHO{%{z|5}REY(ymIJ zkVQ*x!X>I_w3+)}*QKtiG5i9?M)}>9WzJeJR+b3B_th^JUDg|jI zV5J@?Sa#s57vJ`UJTFF3 zs*}QJb9aEYA~g~JotU*0&2Ar=ksA}+pG>)ox+1&&rd*s18KK1ft3mrzT@=xU2Gbr@(d9@2sdZVza&Xpl*x|QK9=UvU z<3d&`XxfH`t(^VdN6x6G3v>WH~w7a%{H=Lb^ zT6V8Ztrs)BCA6%+j*DM}(;3F+CaU;6L(^Kzj4a}aztkX!Gf2Nf>k=E?tgCXHD>UMU z@8OE?@l0%l_(dQPgxT*4j{)(%R`VPuFYl!Ej9*9AibM{#-MGrlal6Oz&!6lE0o;lYE@a@-md!jq6G%9jD!6y@> z%zWESLM?Uydf8k)Yuw%$8b7h%Au`eW=tDhi_alC>@_fb1F7+fSj{FP zMkGEcvibf%$+@?wsl(N9`w1!q0w{@53j&h19i`s%6jXj|>70GtQgpj-c<8Lo=lyg$ z%3jRya4G(u6ruErSncg9|WDCj%+5@AtXW80TrZe?$+EQ6f_ z#2$y5ZJ0sbE(pz{zMj4E61mZ^ntm9H4vo2mT^^Wit78cEL!EODYJDY*n-`>Fduvny z^|_$`v1LR8<2R(vZBmn9FXiRAo$=~$K8W;q6#6HsoG+FX5;k%=dD1k~(_2D4y5joD zHD5tu4kmTuZ>y(GuaJHf^fqc}k0Af*2E+-G6e^S!s@lLn~(U9;v#7XjY3 zcx^NO=UU9H(_~M!#;JVN;g%dI@l*_k)LO@(sWNFiB)_?T|C-!OEZUnkgs>3Opo}370_uQ@5tPn?aJ}FE_7CYKhB>84 zKd8AELVyHjwQm}gcncu|Cf;;1*ksuw1OmKdHl(4QSblS3w7QOfnAfO0Vup1p;V!PO zh(|Lt@ZzyNzNNaaitY)V?YvoQbfQTo0N3GcrlzlGgTw10n9TE zL}*lUzgyJFDl*QOAdrGnIc9zDjS6Aa`dTG)-ho3uEorkz0Xy!%5}PY~(uhRYccLp5 zS?nvwG<>j@cnkTNc@UG-wxI|*Qj`f1?lUI=IY1>gPiZ&6&^&}vPa282-SQkvC4%Cx z2`mA35V;G`R@={#X&^SD7%HJi^=Bl+o>?1oH?PIVF*P;~NhKMf!(Y7#APe1pozWT{ z3!OJQq@qg$rl=1>f_H`<{}gR3EbQYfH6TBO`dhtcxOlF# zIZ9N&X;6%gnd&c@H_ERLvh4OtB5cI>A`~%(o_(^uscHNXVs?edp@0zh2ph!@D{K-F zy>>Z5{{T2POUF_Uuedqlawd5ErzUnRl8a%odOleZs|gv?+o*Yc$mUV9)y-HICV0ir zlspaE1S;Vs1@L{I31?A?fnQ;&D-b4cPW_WvjYL`SEs_s zOPnJKBnyp4l@ZwE6t1Dx25=-6Sh{;4a*Oxy*cFKk?0WvxRnJa%G!(l}^7VIIRHsbo zvwg2D*xnYAj%RfE4+gj{tQOj#e@k>8i9lzSxyJC6O|c2kxZ9b zc#MG8^OTyMWHhPf7kB?2ztT6GT4R>)dFe)ny+oXPl+o9FhkmW~BFf8{o6++jDIZ7C zRUg4={d!-s2}!H#^Miz{0Q9SQi+$j~DZQwaC!Z!&G8RtCsZqemE_|$zP#((W{0XVK z-GQW2lfVeSh9w6Z+EpV{W)mDL!>FZCa)NdNYo7VCKK>0#Uz|j~ldmOh@kSBGA-5{c z(n`@lEU@k8Ew;owyZxhqC<3O`>aTF14<7{Lnzv6Ji!6eeMDZQ!Gl0j>93hS(5^A?L zJx44%IXW4@NVLZzO{U(C*H~#9*=N(ug<_MJ8C-!^l&TC6=odqy7QQ48g>#H7!?djDv9MtV}501C`;*w8Q5|VW;lC z6B^gM03g%KJ>p{!k7q7HDN`Gd2cbI`QwAZlE@Aq_jCNu&TD5#HfCB<^@7D}vw5|OW z_JCAW3#I#2zZ0&+bT^qSxdgzjAAfi~FfC>^r$02u*w$jO#TtoBD1h<}vt+s3-RE!x$HjcO%ZUOvMl^A3a0Jlp#jY4wg}Xe{4`q#eE2C^t%<5&t z8+vl{#Q@b9^Z=c_1$|@Tim-!4(n4^I)YBoeW7xEV_nzbbEQ|Ln0L`_?RLTx{rzK3$ zU!{DTYaxA85<2;Bp zTA?w$h@i5c5DVe=4%!vjbgS40b;!fNVwxN0s*b?MjB*rh^Cp(Q8%TzdH?M50eG`2B zJqcwaev#Tu*QH9_x7Sh3#`gB;QdyhnQ@h%)jF9MK>jl@002EQnr@f={4k-}BJD4*3 zF;E=c-&Ymcf^Ngm!^owb%ZJip8R+(Bsipu87wE&{q+DTc=0Hm;6^YbA9ubxWfIvR` zNswSRm7Ga7+|j(wOExZHwIs>4j^G10M6tXi^=H@%V}E3lz{xKoi15(|;YgnQiFkKWAZ`6)lm0YO_WSb(Ds~Di)1ofCE`Rdw4eG$k^MdN zd%#a_dZsgFcRAiLhLtH!eWVMDH<+B*1r3rWP3m_?JN%wVifowmWnmZ^F=x;&Rebq5 z8-*VzQ!C;DHlUAqKRa42@4(nOh1k;tY%%{*h36Ah7Zr=);NT>oc-D47=`n@d;>EZT z{N|Y=GtPt&5n&nNf(+Cg=FGr}u{VhCO}l+$4%Y-M8E?o33ycZnr|ihtQedB*NLGVl z^xb9mS7{<;NFi@R9c+~|AT62)?dkv)d(p$VC>h;_O9j% zp)V-t$VRpnxcRMyX;H|aBTireGl_6chqrIvUpjQ!oCmYKHd7n(Lv+X%hDR##SX1okEP{fd${eXaRca1&4D7VIy_ehU-_G)i;^V(*7%tl_FwV(-S}Mb&4quL}w~ znl1PjEh>Ui$xpB7wpSnE+qTzV3*e^ngX$YLnA)+g+uXWvqmsys(7-7I+gSIXp9_nmnjxD zrD33Wv2vL~${7HYhUkY!Y&P1+n*r~he;E&F>S|e96y%}=Q{6@`9`j&Do+hw;} zm_dLwYCLy&K0_a;YLEen$-rObc%bGvyoR2D+ zieup%_j@1f5II-gC{(?qoT`jLm{lh%2JBi(I)K)aKBC$ zC{shgxYBw$->(bb4%U+zT*&iVqE}nH?zUSUa?Hg=RmBp~+e1Lqy{{eOS*U827L&#Gd{ZHjVKORV`b1dRBO;@L5o85)}{$!`^8sI5l zJvD9EWqG))2m2nJ<{%Spx1fyvDa`X=x90pk@$xVqlhtN!=a@;D25gli!|-t2-{X^M z3b}XFW+kb{ZWrHZdyZ4TM5|whNc3Q%orm}$DHlcKM?y7Bn-m^PToSa9sfO1s8F*E0 zD4{tXoJPNfdTn!ek}@bTp5jn|A0k!!FJL9% zHT41%*dJHMoOS}k!I6>SJdpyVCO3X#|I-K?zmtr(HE5#h~;c*5cxW#oB@f`{Z-5G0R{UW-&1 zLWt`e%t{->X_ah<^JmkpXS*!Asf{R$ESx}h?~sp$Xgmo|tv9Sj!hh^fmO(4E3jak{ zIWft?Z*P&yK?_ziE=g7DgD00lJ807;Z0DP|6DzYAYl2x=zUAtcHK_S zVZg6+qpJ*f-4)*fPrC6+#!dmezqcfh9HoDl?AK z{I%3c5P)9BB8=#-7GPc)!EFJ4q9t+Sd0;xZYL$L&Q}3>6@tppTc{L^ z%9FN`AgmXs7tpj8v&LwM3M%Pe5Seqbtcm)pUQZ23WqeLobTIG+M6^4w0&+HX-J2(r zo~LO(a(K)#-&eRDu8xY>^R5lmp?JM&A5XjloAIhR;Zk+s%6J?c08IHiPd%%)eAlz6 zA(%~!ptl3UF0CPi$3E5GQX~aR!;OFgJ0)k)_eQ+mK;h9xl+Wql403>QAg!4j{y*mV zj9iZ<5z|Uh+P2I&Gs}&exAPtvS0)C$41EHiHmiT*1sDuqJE+o^(BdJQAs3gD5}K>5 zYc~72Tylq-R+nBpE@gS4niz+L%d4Dcatb922;_Ph;PFFj+g$FO5Qt93l3v>5DMqb-0R>j-)Py8WpDJ zP&?2&Y|`3~mwfSUE3KPo5cOpW=o*2lJH~_}&*)G%KJ{e&*fWUF{sz#T)!+NxXamln z){OS% z=p1TmAljWY@JYafQiNE?g#!i} zv6aeq(xM7@UBz`Onfu5D7fAf~f7OS)NZBVg{bbtkDT78oKzesnzcn-mABOS@bE?Ox z6<%23l#ufE9G)89s@GEXjueD(AA&6Qi4^yjyt7=`nviWsQW^k%jqwXB@G5F}pHt=J zvt7AGgH%lJ1DwI$L=-YJl{h1Iv zT4y0;rsGD#M>5wJUKHk8j6`fb_x!t2b}7%ubR%*ODLjELV_XR#@XTR06GoiGadXUCAjR!$&69609p1ellsty-B?F$PX^_GlnIe%_1-h#dSQXM?6W zO8EDA%v8yY+Rcdx~36sT0W>ozqUF?@OC{4ios4Pi56$YvmCYLkT~vVfzT9R zxEO9aZd{TYA@RH2_)t7lI52A*X<8RKNL+*10f>nJD7~z`a<`&rdbLRy4>lRb@0s<4 zG2@nRqR5^}KMl!p1}`yfei)-b9TlZvX=4&^-kj)j7R;iXBXo;d)5F#QhzuSq1LMAW z5xK0J{bKhdC0E-Q-tW~|{Q#~Voj^O8+BxM{*7J<^n>EDJz2_}5sYfs{zd$#hIQ_#Z zZ>BzUtNjB^W)g1GViGY=NcZw!$s@_*`KD|}*s`>sqBM&2q`u?e@fYT6Z>EdgmzcXV$=;2KW+JxkG~IC zB5>g~@>=(g&03qdm zX^K*N%2?Ru=RoF$--vo)j7t`{W*$+uyKAar2I2v!X$D@#anqeM?IxtWHgz%ZFdl?i ziOZvM$*VS;1Twv2lQp>v-%Cc<+&?D+KvgTpZm^nLbX=$faVM1gMDzRQEhN#` zoB{c9%zjY8=+6-Pl@7lakgX&jcNeCJg=Z~p~ z1ve}o=n3L$p@VUiA$}|j&ttoV~@20+;58x{?++$!QJuL$rHNucQHH|&L~QFO8iV{$Ls2#eAHHJ z&PDM4x~n$>(WGw25$m?|Sc7eO*L=oBnrAa4U96{Z-<$0DIb_5TIM}ld`1|c{JOo4y z%r$!eS-iqmv6NbkmOmjCJDGFmHM4tzXzuoP*y7~iY*#E4@tTlQ2&FX08mchS43(Tj+tngRQ!X# zSJ%(a=-{`~Vc4Kaq2Ny-H@c6uAD>=TN~r ztS$SPR2DQg+0F#$vJWGE$m>G7*V`HPFE@6Jn_fVj70rqI1Ds3LVM6&WoJnHV(j=%} z>U8BthLh+f@2Smy#I3|Ff(Jz9(z|3PD9@in#=mzIK> z4>HH7hurgOPtc(r(9fcHF&lcYCm`p`(-8N-Nd3QAblIcFK0tvgwh)|;z* zxR_(5?~74ZtqnC%Wk~*nYz^ME&X5-v@k$UujD5*b#EB`e1sqp-)*Y}VT?)D?7vCPc zjAaW^G*z{b#vS_*ul@-@j0sC&b?Sl1r)m7s;|_kk?#>dLbxGXdSmC5gT?u|;0TSF$ z{q+`+Pz9gA+tPfseFJg2$w247C~reh5%2{}8cB`Ddi^`ge#s8O8u_?ydvIa_jr1#A z&K`&rjJZf$>Q?-G=p%FV&kD-;jIKPxQjUQ3y?jMUC-Zg9RzOLF>Ju)y93s#uD4#Oq zRZS-NhOT^hkL?eJ-23ldGy)FZRMbn^he!WEd5ZY+A!e8+;|ZW%p*ug_7NS}xXWohN zb??H=HKus*r7OVUf2nA#`cdHxVWdd}4g}Ucp4prhb22h_Q@2qwvqZc6UEb=x$vyzu zZrgMW;Sq$bb;>D=s9kUQuJVuf0!8}afwz1V^x%<%j>fOFLL2xxX z{#NHKVLB6ow7+4&VbSafQ+x0V$S0f4stpK#sUJHZO4ztj$=FaJQ=*x4ANsdU@EB-} z4laTNZKE+!e)9cr`1Pll@BJR0nw!k`e0IM@5*UsildiwfYMyC;aK#p(^LtRx^05Tt znztU(`8UZRw-Pm|P_BhyBEL943b4KVucGcxQsz)_EiT`!2YsP!!V)eKMLvoAWNdV9 ztiw!#<0+DBd!&v22nj7Hf@Nny@JN560sPib)iw#}_++~dg=>LOWF3S%$!mJqC0V$o)~WCf4Mlm8!D; z>gissUJ^Ud(L5+9B2L!M(&QPXwq4EdYi2z=>}ql3ds~6)_me2o!7#!4PgS=|6b8;6 zxt>7(*?uAY_!5JT@l}>O5289ao=U~hYp$&%a~Zd^HDOdE>TTy4rT*pj=O*y=t)!F$ zNU_mF(mST-rxm8#Vm5am+gbUt{$srL^ZAaa^EYEIf#$nXUy{>sR;p-$)UG>(v5}~Q zjWKs#0}lVr!hA-_3afKASIfNC5Dg1?(PvXjk6TQuj)Mi~-p+I{%Q9j2^FG{giioSj z8N6*FJ{w&puGL^=PcXWETi)FjnrdT;h1$6rv6q_<7PqzpTkkgrg%KO-JweG}f3i3n z6e#l$rh3C^L1W#qZEOwJ|vt{+K-XeUG?_LN#&?>)m_mivH}Ec_9Z4wwV_)-l#Yv1 zxE??g9t0PkK9Iu3u*9gy2)<1L9XBJRLAD-($ z(=j>G6YJJUTMeNDDjRh}y#Y4)bTG~rKm_QX!jV{&M`{mW!?T=Bxe^*P z9^?X7V0A=X9k8oQ#x_4K4L6<=EjqaPp@}h&-*%`CS=X_fB9F4#1IjwCDU;YN86z82 zVVz4gj^q|N5`QUMO1~~DP&ntdnCA-+WThd+liDH&h!A}hx|JUiE4{3wx^p?rPx7SZ zvd0#4p$m>-okFKVUy1Tb0&KTLGS;NNLw>p1POS-gAu|Fp;sZr-Oznz<9DggowEuKv zOy*Lw-X5-WNzb1~tYhzIDPMQ9cu0CQh*16=N275(Y>7ds6|#g|o^6=X1fh;n zJ@zb{UrwF^PrB){eNXVnfnH(JC18FzZg8hm6~7aCfkLEfkKP*HtIP*-O=f;o{x5RZ z5LSp3FtmsQc}~zGoGU)t!?oH%X(yx@iXzJE_?cn}ZCGfTW-8H&B;%+nE7Gy|p<%ux zeb5DcvvkA{KMUuehPe*p4S{b|6SYrzuji)cst@im;9lnYxta^D|Euh>PfNON0q`-K zoD3dxTajC1BXwd8bn|Hl_kkb$8G&S1q2=R%ZvFYu&B_AWe0n>La6iH?bdoMqh20Jf zfRIWTcrUE~3?iZr;!Q_dNV$ph&Q%L5PEk`W8PHJV>DWWOm7_;!Kpy3sWB|N0Xw(Ju z9~0L!&afX%1qyu$P6qJf2|$&Dkeu{*2PF_)q1vV=-noI2N@V2X6cO;uSgbis`@ESi zJtZ8UEk?$fxDOIV7{TU+sJL{q$zU_Q!wc}yzp6$R{Ao@A!pJoY;7m@?FC#$w<#_ET zk`E@klS1S#gt4u&&r!A5{!@(7TvnWYmD^>q?+AlYhwzn$qeZ;#*T*!3B_^2v<;^t|vaZ{xp=#mkVgRPt4zyfrN+ZsyiMAH-~E>7q|y5xaH-27czW8tUfT$ zcRFPe#<~J7Dzx_5B*5TL_hpxkpd4U~(<>fa0zT8NIP)p-}M@)Ad-7UA0FH>zNe5Lz_?{ z3WB0_DGUtx|^k8$@1|R7e~Y%)&(8 z!z&F|YFaWi9$)05i;owWjD%b-e?G<62DuNYUJQau5Lx+Nzj@l2z`&G3c#VP0yl*3r zNq~x42iT)ETZo_;zCPm#ZuCSkh)hG`&BjkqWy3&v*9)#hJVh}up5j_IJ*Z^YOWbz5 z$LA>fqvP>X{A#7kYr8w^E!d3%f`PyM;!4ahAw4ERUCPucAr61lT~I(uzkjtZTU!C{ zj#I!%;=4V`B6Crqhak|`8TLJr&-P$C@Dz3^&F@oAO!SJ4btyDz+l~49jp3k;fp#$K|47XQ|8DOc zC)1o}tf8rlH_`b3b8}ugIM1K@SM5js70KjCtz{wOR88t$R%o+i6{#NLo$V_I$_Stj zilDc}dznh7^3aGn>0SPoRP1!S5NlmP`H&rPl=$KLB(O>5v5d1b;{!}NhQr3fnj~t` z`A-yi-Cof*e3KN|%U9jj_f?zuns$r6EZh58!Y?leOs=tb78Pglj1=*7>ysr>22kD` zmHbHdIL_L!DmP1EZcX^V*t@4L!M27=&|%xQZQHhO+qP}nHZyFS8AgU38Ae9Oc~6~B z)zvk+d(`NQt{<@P*IsMRXJUh>PlLtxIgVTW%tq5QQ=%J!hyXxIwpNg8nwVm*D=#HW zD;n$-WT`q~4d?BJzj0)4eO-+4zEI(jBl+WpcE@+TY&P~8PohwkU8uk#q8 z`a*?;dB7CAYkBAWA^6_*hi^O2w?LoSJba(M9chhhU~FBBHWW~qWZKNBGbQ;nb_}`l zE>PztBfL}>ux%Ht1%TPUn1xIS42;J?sNH*gr zDmeH=z~WCOPb?%&+vEELn;xZsa16e3Q)jD!2?t;A-?hFGU^L%L<2XR0xIi|r=bdrl zD4luVyM7I2KH#CAX%`3YoMul?y*<8j7u-wsS6It^$W7U0Yz+d7cQZ)SO-n8h9<0T` zs1vg+?+yYf+$Muw<*;uj1oo0;1Ez$T``k_?6|os9>jZ9%$-#X!5gh{nishV7kQ8xtQrgMxc=L=X5~PE)i%-hfob8(KFtnA;QDx$O zE;*8rcU@Vln3%dookbi{p~r$ddOw&6HsKTBP4htXne|av3joZC3{!=)QyB;pP3#g9YNuo*%rv+LGv4#87l=@NZrQ| zE?n=7b$>hjwjSF4Vpn^A2@ong&X|}xBuiV+{8CTR+)5>KFh-LD&-|=ZRze9f|D{xdPuqHssL6FqXNla)211F2adG zR-`+kC`-byEepH50ybvOIxrKSKy=s!JS0@3IYM26P84|5oi_eV9o(;EAya$NYkPZW zZq>}gf{H7!>I`sn^t`(mO)UqsCHh`tD%`Nzao1zexhYR#R8o3ijVhal;(4kizd1^h zMB`=O;0=YACFL%q=ycI$mbt)O#~LCRMjsdPRwk?J(cGw<6yVJ&tX_`6mJY~0@T&`( zqIQeKpe)bDLX`F>sJZ;jg8mYy5E8jrYdnCLJ7y2O@nz0N6w}?f;&3zW1 zE|8g923h&5jOAjQ2qo~AeHNuRbmhT<@h{5(J=ux+vFH^C0>gSq%GlB!}Y zYi3DBQ~qC)gT|$#?cnn3eKsBUhvfwg7N~ZPY_T78vz5wIwk|cWTzILWMwThArS-v| zQ1c4oVE6120xS0q+^g8`Auh8&&(tGIZvv#rztI}z9Dx9K6e zYEz4j9qB`(9YYL%w9{}I?7(THe@K(?w;*OWtFo4;dqldZ&$WG)EGPJsD&vzVgn*3m2PO`-DF#_b(DGzg?1iRdQ6=SGU%S-P~R zTB!0+-Az(;JKtTRS4-D5w`AVx0EqQQ2hB5C(gBC!_Hy!7`$l_+j_Qyd>KUtB+ok)u zMnE*ehmb9pFID7h2#}J$Wr$XvfynbN$)9q$Z&BbjgdQc;Edx-lHB*X|BWck8-^&ne z#5Ehw2hJR!LpH9tek`+ozrIhK9}mC(S>!}F`EG~ioZudX>xZbkDk-f`P@$M zD?lZayu~G|!xAJ}=cn=++~`8xdc>IwcZ_4R`Vm>5Me2P5bESj-_*;WM8D6Re7II>#WH={;&`zc z8PLVAaQI+NgGjAb-7h?sF^bEVc^K};;^&;NUVm$~h}$itIM;5Qc&^G4d)<+pvrD~^ zHFTynfmic38`t~H3hvTy{&(QP!M|pii1CowU1h2Ol03a${~Or(rsCOc@?1JmL~M1#x62lbqj(+T*F zE&IQi4nRs!QgK}*#<Hq2*WLO=1z3YC4z-C{kb?N`h31YPwP|DaK;-6iq;Z zSjs{FnXCGcZYC!Cj!*GC*FjjYinO_yd%wHO&MCuSQaNW2(}q( zWltn_!lTL9=y%AP9B=N??0xjDjZB-7z^$ESO`2{6ZzFDwFa86mZCt&k4WD*Tp4#xN)$|52a3VZ zi>!>s5ImZH$hNltgZO!`0QNr+4tGbx(JAQPYKzgSyeQJf6tl@W@Ww(jYCPr_i35W< z;cML2s>sLy;|ND%uH-p}5rHlmx+nykU ze)8xtn?Awuj~Dt+ls!i>%Itzs>!oRy3h(j}pu<)V)1oMNjxEBOmXJk~!Sm6}o<$oj z7wby6P_8Hl1T^3z*Yjxct?`8nV@8)1i-2FXIvTGwDD|mDS=<)#QkKzwu?_@bQ5)XV z@Az?LT*e8aVIoQZ(E=PGT$sHjUVrNjM!*4d7uE-o=dA3>B_n>hWO;LAB+M(j$kKKr z7YqpzrC`&(@5k5eK+iRaZ-VbTJUM}CfX$IFF#YY6{UB;H9_039z=MDksB9vgfLWp7 z=r}b{U3xnd4r$d=1~XXo^SvBKcZYIX#edWbu7kuX};s*GNTv zS44G&(_7gdoJwFMmU=Ojz+9{~FvNM4gPy3lNcR?%n{Q)@7{~(NSDwmNb;^15Na5fw z#DVRTcRuMg%H<+gZ`=s2R&Vpa$gZ#Y^()4zNDbJi${YLBSQ}tP`>(C^e|VrDRgdawY-fFwX2O#KhrR&N}8yMZ3(RyySN^P1UFZj zXZ-fqa(!Pt`4Dw|H`>5mSfR1yx-T$iRg?W=4>x`CO8h6e6j7aIo=TgiT7*JS;4EkR z>V#1g@@da&aLw5%PHTQhW`X9ap3v(|Yrbd`{^cSd;-sqC{(c25RZToX?>CSg5thdi zBDdg?@96Ua6!Q-kp`xo#3u4+)B%O;S)uEV4E!`$gt$Y&DEL(=xcqYr=3_ufz(@|k& zZP-&l#NzW9P&g7SAt13b7m(p+%WeAh4XW^Jl;K@1Vz(-hq^0NOm`Q?g>XYHu*-gxU z%SF88B6=H>)ReQ0KgTrr*tSgEL6<}Q8(%Oc$@R?_2v9Mg{lyoEFlQ)6c!ek&*T8%= zUxSC1utz^vUAHI9ytP(MSB6zh&e3&}I)ANBY08k6IAbIKxO0+^`cH@rz1nyW(Ji>2MvJN) zJz2e`1_UYozd#zMUP3lVkDw-tfJJJw*W?0&E}h*Xd^1 z>v()?sM;{}nBvs!Xls03fUfd+gZTwR#X;EkqAm;=s}-$>#pAq=K3D6u&zI@c=)EQX z#T`Iu*Zz<0!1Y}=Mu_(XN=v+YT!r*$=z{K#@0ISl|F`sj1Nl~{@pTV0!rJNOu38=L zfc>f&ntHwuMe8=Q_&55Fy_U8b_c7#qUrR3&;C5|sc)*JXbjCunTM@b*BQ;Mt4J~n3 zYh@~2`jR$%N5fdjYRs!VRqXdXaZ{TLRS2y_9WsMh+rFd~Wg%T$YZ?IuGOre70c-UB zTF5FevnXsmqCEv!=LCLb9;arn+4f#f_@wY#OnioBK36tb68mR9v6pb}8v?1OJw~WO z^VXaWcirG_gv{Ji8XHl4b5H5x)WIbtmX?F=BpUU!Ld>rQv}6{_jkdSe8I=YqgnX4c zJwj6sD>GHs2&vj*Pt~5Ls}j{z8BXolDykLOY?YPjMV|61bv6NJ(#e+6OrqBOaxX)ZOXFj;!{F%ao@dWI=1v9bb1FqjmpIvaDQHTzM@0iOxaREG=;z8+PtUuu zzqeV$xDXAH0u6=wOMUr^g$RLhhy81ci5kK6@*u zNKUvarE$#>4wTPP*_imr@2YOCKjd-Y8H{F8*kI-74dT4tA2L(NpQxwt4cK%rQP-Yc zW}O$h=*En0zjy)z{?GXS4+ zVnh(UFI#}&8wGT!Kiw?h>7|U%`}YUC*!gVB>zPq#F6>a@4a=#FWrL~jWlDD`LM^`X}91$wFKF*Do zgDUUq>+{C3Kh2LX!;_&mh0hFz*7>>aCwLafZQTsP?Q`c#@agGk5GJ;=Js7LHDLCr5 z)9?__IK_%RF_p>fA<$_u6b6Te_fk1i54!D|wLdzOe_)f_d@=~DObv8g?K7N}unssv zT*|ocnzW&?X!Lj^_kpN9hvOMZPIH!Ihtnm^84VTNaX}+!i|RLHZ0wq$fXa`ZpBZ1L zzMBLmCTxu1&(oq;WnUR@{%$-`e919yIcsbx!ZuwhX0+&Wd8Hhh(n#;4HO8+7kn}?gTOgE)@>6K2b0d2G(4=09-tyi}cDQp`+-#Qv zPdbCq<%gWx=PAHRjuj>v~eGSdR7{ICRn zeRg>LdXt8q=Oi!G?>oxsw%_@N(N{=UXb4%e`zgBZu6=0XVfzH5$J#yHOumKPyjc1o zEBqOY-_;gb$HQA&Srlx=DZuLZ0cciW5%g>2qFvpN$7pOpeV3Nr=ZFq>$Qb;m;^Frd zj7>?;_u$18G(HO&T7?{Vnds60K)UP$Gm9;V9n%0-kO~qye=y%xN2}MDv_i)GbBS(e z{I;+tLT?T4PN#^yXztk=8>WcjPaKeTz7LnlXOeMoEdS&hMWBP>TF5iPHfSM~y)L(__***xbP--xdGvBvu$s>u0K$IdaSdO71JUms7tdhG zo|M?fKWRdTquw?l8(DuhBOz5E*b{@u3wYT{Eq_6j@5jD-Oo={|#JWDm_{P}Up4m*9 z@!hsq{{oh0M#cxypPHR{3B-|I{3PQXvD0@?pM9mWrvPJ(^Euk#5S|0i6nG^LwK0xl zj}+0P*?aHoz>(OyB=&l$5JkykN4K~szDYs12QjExq|63kZ{e(v2L>!F*ho+f#%+sD zN1tUF$}azF#finQd)wnH=Y1e`^kj^ox#j+7ca`pQ$YP6uF=^t*ij1V`LBH=#r2s-S zk>dw=9Gg))HPC3Wo~zSvi`R01tV(IIqtC{BxnzCQo~RT121D?@Bg4ZE(q%eo%=9vS zNt9y7duyeMNt=yr6=vtFk#v-p7G+IUVy6+J`s`)Tojv0R-Zlng)p37;kGR_;`eD7Tt)-ycN*OA1YW3 zr8XdwUc8E!290PY1$3_J2^1bmIvLQwNvfc&HBL)MP{Maw9x}r|mgpEuU&yl+0$}x8 z4tzqxbwN;efWfS}mNdd>>mYK+G$E?y5TIw`NyjtV-zJe@l}K{Kz=;~gp*k6U%#mT$ z&nT-hc6l+ty8#3PK<85cB4wikeZEx!Y)(wjADx(Rz-%>wAum%qqyf|~13fGC`lTRCid;foe_!|p~cddQK}Ff(~B ziF60?gS{J>*tD?(Iz*fS4*ohZ87@FIp-_F<)7aRD(m)o8y~g4WTq%h{)RA2i@HoA~$0ai%3rQ^eY3GWoJb4fpRcbN-!mJHGbNN1YuhcxX$T?7F=PvLGW|}&uh&G=Di};- zNrHKsoa~MXVQ@_nC=h@kLBKF!U@(XP4EARVv)#sjQs2XRt@dKIHjO4*CQL7~v8&om z^{Q&4)mlF?UKgaej$CjQxUUi!DWf?yOuG4-rOwumw#jrBKP!FuCWYeV5 z3bm@e*14wb1)UZJlm;zYL!@pv`?`rLV{mw8MyVgnlLC9#DxoAwaV@Bo(Lg$BV=@{2 z{J~@?&9E)0Wwd~oLr;281m1RGJs7jzcz%5(4Ns#@U(^fufT~YP4DHCGX{z1e$Y;qc zo< zk4?10MPSmLKD-vIL0<5nq{zXLp0zrbX{Ymu43^FmFnY~mHiVO`!U&3FuG-jf46K<` zYI&I1pjK4Q=Q#RD2)OG-MgE@&0ipj%2uQ(I?M> zexZW*0|+iPDTLrp6jH#63snkbR>`pf#8%ZzcMvzY_}Hz2V|N%~2?nB@Tq$Y0NvZ%& zwKWN@X3{2lz{Lqr(G3RnRg`Y32$(d}6L|q-@|b^mk8bgF5>R5w&F^2kg_LoQP%#vN z2w(XSVbbEtwwg|ecxNdzISZO^%3}oUdcBwE2zA1$3raCkvxI;oR%}6;Nyk^UKa^&n zyb!z=i==z7?`D97SCnMFLNp_tbm$#1T{qfbgBdm4V0*G}kO5y%eIkkXQBJPx$6YIc(jK_zg} z%5|q8@j-Op>Ge9cR^&BkY5}$Kl=;JWnrKa6Q7DkoNyBd$lH9g{L7`7MGb9pBWF_C3 zuvkh0Vz`x11tzvBV||4fvejuY&$TPc}AzcDnZ0pg8u|r5@(GnK+TCFiG+=G>3PeH;=FCgmPOUmk5CfojQJm${v`xFdiHZ~|B$_rPA5w#V|Oq7=`NgM7fT!%IzTm5ChgE}QNp zKeG@-xhh;#Gu68Zc@X>vgOHApi9vZivPAsN1LWY*2m7{xG|T4G+4p_6v12k$ zb}PHl44zg2vlV5egRh2btR4*I9BmbLOrhn-FkaQ7NJTNyhr^?U$x($qQcQx+s-$!* z1aZ^qC9c@?vS@y~w{R+x)LUf&E$qp0eEso{(q5BIXckT>rhI5Knk}M`I?%F;T3%RC zq#OX>j0O-Zei=Prln0qD$1IIe9vB}t%Qy*enDQ`hK2(&GLE=!}=@oO7F8&e2J(#9I zgm*)1{hcw0czPn=Xo%W-)!1L9h0ZRBumf?jSH*-lZy8+u%Ls5uZ2ZISW;`LMW|103 zN8>_&@S9@Xwm7g)JeaJdx<3KOTmBiO5wlOCphQ(hf8L|aq#|0KU-QTN)xpfrbtL+u zG<-1&-xOOABSfJ%Gk7A3xIvxw=XU$M`Me0SL!CJSChWHaYy&v*3;Ka1-!# zgnx_xwqQ3EJc_@K0OLv*rY|kV69X05n2Hsr`vu@~)p6a5nuwH_nKgTqISr`~Gs@Cg zYpKs~BOtpy<>}i9Aei$0jB9`X6LQ1;^5rtXxrd^3;YPOVT3JGgKz~U^!LSF*Q57rI z9#Z4Rx^Z*=Yrz*Ta-MES+r!Q7HJm!34Z4rtJCa4FGEgg@YXFLZ#T^rboy72GTHhNI zy?)c|*4(T@z+2?{Y)ZnV=a)Inw7=mDu=)>Ap0X zPzq7fWr(j+l~eb0-Ubr?HUjj&jR2(oU<4%mn-MTt-kWh3_J3&vp!^pjpoE66s>U7T z$kaQqeW&OH*;OXVGeYda_;pzG`6V5vG{R0HqSxg!4+*vtn*-VlQvjjW@mT2d9{E^M ze((?SVBsARNyYuWyZ-W**AxUFU-1Zf*}l*AraI&Mp!{(%5d!B7O~=C$@beLQn;3z> zMJx7o^qj@+2Ua1YX=Hvdj4l=+gKU}_On0KSCk@#|y3sk-!J6}={nY`qwRW$THN=F5 z9nQMllOeM=KGcQ@tv&FtB5vJBYDauOxGDNy$f&mVUSEEGJ z4{^jUKkLmz)T|r|`M~*djVE%j+=@R}E}s|N2r}=3?lxH^p#N?JELK}*gon5?dAshP zNj@JvN1x!$hwgk#j$Y}$%;vr-Npt;>D5{y~MH#;j&QESJ46Kd9zygZRQ<5C=7`yG# z_Iu91iI1R#E-z1J<^TtITPJ)#{6JplgrY}NMX5_d>UDtKCD>PGm3kRt(}#D9hl3s( zd=MgwZC;Ip3Ach*2U7i#Ja82BcY0YSA`(QQOhW-uuS$rM=VYbyQ>&Xg#+=49%IjP5 z!nqqegG)9MTCi>6SfiWmM2|z*8N{I-+wG`Y-v{ca z`-9v%WqJeh1R`g9oxG$sX+_BDXi2pY9dd-Mi7C)RrkXy7lpr<5fdm#ra9V>JT1~CJ z=?Z}Gu{~sa-Yoi(PLLJop|Tn%L?=N8(-_#$drjk?OsfsbKrqb{CM62xoT%Fr-XMl|dq%(|2C6ScsGJHegNJxANsK&7ELE&GAUTu8@*+G;pevT>{V?Y!`Rswm zm^jp4e3uO(CNLyUj(L)pP0w*-BVOZ-fDeC*DD{(MNQraQ3$0NGjOk%Rh+LVz6{)iV z5_fo*RP~F{ks1+HE+qVR_3@@qn^KjUIwBu_pxm!P_M)B98Mhj1;~Y-bK&bZd?m;Tv z8fKAawFz!;oIfKjucW*RYAKEG!EbdLxdwwG*@lBGx?8aD^o7$QK7Zid|3U&57=czyc_!rdF&oflfyq zMdXV2{+L^!vuV=7G%;gYjNda>6^GLT%>i061&>b+PN^m9c1jlg!+Z4@vEg>;sD_%@ zHnQB)C(mtU;^SCNKjF zUgY##24-^@I=#cP(91mnOoaf(2P>P1z$W%O{-c(AXY+nP4L7^J%H+e5-{@*5}^Ax5-|IJh6MP_B`8uEYr)=gP<0XeX(4a`h>re7 z^_c;}q<|F#s5^NnxL`{E8wn`H>e{j5%b&6AI;wnb*@+$Y1NjFDXc|Iu(`2cpB#B+Y zQ&3lz+h}X+aYOdF7J1AjPcy^+!T%{+k$b0AvSIZ+*g10Ye0Fc!g0RB6GJVjjX$|W3 zGD<$R#BTXtm4NXzw|i1)R0_b{mW){E8-od-t@0*}(mK0T+y#KYm4I@EB)lRv&Qw&M z=H+9vqXAYCKGpaHN6ur+h>Xw6toKtf-?!`g2UT}FJHu<5QX#@0hCRo~LNKr5pLU6Y z=oH0})D~t&>tMHNj6&mCen3G_aJ%qc$ruUv-qm@JVV&AUV3Ki@Q}cso6qqHnCS$yX zQ~iubh2_M_HO-fI%=$F`uEGnTFY`nzZzO}S?soDtn`Rok+qIXeUads~;xX76a~y$J zbLg^fW`|Mi=;JMFk{+m5x%r$5lP3}504b)JAf0x5>@m16(_z@Zi!s`SRcec~Tv*Nt z^HTBZq~?yQTc!MLkl17lri8RAt7m^@r!6RF-xZQLJPEoBG^;3GERi9{hYbu|W1X&A zhE|}U=pl99#dNM~0RSKq(Abh)IIuT!rib4E9x-nwZk0xt;1HlM* zkI9bhz2v0&$_Ph!ii)4{hmi`BZl)gAo1Q<`ObHhKlpl;N?kRQlUOfK_pvQ2D0k{s_D)qza=rf($R7$CAJtv;a* z^WTwxrT>Nm)VE(AyxFaA?9rk~7T{SgC|LZB1f1TM{El!|b+~0rWv3{FRO%<*3IdS4 z1Z7U0Stc4<<3jviY%ZG$1A!HS8(Zuy!`Ck^*Y$(Qa1 zcG_&b19mVCs?0lf+WxZmX#reQ|BL_hi+QzW+Qt$`ErAIb%?p!;l&CGuD3gj&Eyl&Q zah#9mK0tJK5=oQ~5t3nf6X;2O#2`s+ z(obkCM1UC;-PPL_rIX7Iq6ARsP=Sx^+krBkd8bO8NNoh8E6flkl+l2MiFUjs1G{i)eC{hDWr(z==l&ZpM0I&JKY#wU0i zGA4^GEGr}nU}%H-Q(C6ynVnV%p6&qIG!cfv@2W~!2|FmpK_63$UsgAE=nAVWFA|yc zLd)tNhbey>0clM7ppqx8-inhQVyqL}nNOLfY(qwIADFO3RQqJ8DxIj~pxQozSehN+4a1pMtne*8Vy_NY)h3N4 zg3~zv10w*ZY?7}#f0LT!OPcXwVXNPX~1dJH~1xOMQK`5KXH5S$h%Aw=tDxgIn%~Me1_+<(U zm-Ga}vzBj61c^;hp?J!GSZkw;=#Wc6XYjykdf2ChPb8)4y|<)}Ah*v^Dt*`NQ&=}W zjKfbELT*ihDZBCi&IoWMP7sX71BQFsLU7v~9P0<}V-JxFPSQ+33;_Om^{ah{oUVU# zy>t7Q5#S2>$8jl~yY1S;AHsLM5)~G694k6Q;zH;jBOo#U`{n)czZn4%o_a2_I2X4q z$6aLk*Og$4^j977p!i?k@{qt}IyOPiyh0w70pgx!FmckiGNhp0AZ1ZzOvKEXcDCop zPq`#Gq*}H#SjeJAfgAE#5(R2^l~HLzR)r=OWR^J*Ns$IteC|3$m!t}KiWZVbo~#me zWfN$pf5GaCZ(GKIk|Y-#H_^|4sDu?wsb8Bi!-mv~^=8Y^XqSa@+h{Tq40&@^0O+#c zE77e;@{(a800^-r4+-d#^aq{Mn|t9{DU;*^^xcc|0~hvLF%}gnGZN+#_}9S0obB!J z9wk%uE&#hQ)hzru^UQg$#oCrYTr|di( zKOdaOhf7fZsz=lH136b;9z6LclEVquNRe)S?W~OPHKV_X0G(rYUeOZ*z6Ge|4jj^h zyFb0XyASgqQjSla>dU|>w%yvD#BKL;)a$oTSNo^5gPr#T^8a1ZU; zmBk_fx0(s>@dpc#VCE4vQt8uh$`_mQsTaq(k-d)?*ZJ|`wt!{%@6WWCw^1ClPh#S^$V zn5ef|=BPUKYs#yE4#z0fam~NH8W~=)H9Vv0M!&==+zp5aUoKXyzc5!pMC^W7MXT5R z@h=v!*Qc+f&BI%Tw`4-{VR}mmCFI9Yaw4b{Ixp)gNT-RRnhc_aV`!fEGZ4G?z04f} zK)4wDdcN}lhS1IZAPWy!H`?SbNnPbwPC=E5IxECelm4vB0TbV4o^bF&@iR9 z#l@EoSOR_~xI#Qq!=oLSp&$2(v&5jp3PVs%O4=YRLV7vcGf>}{;j^9p1z$KZqoqj za|ej7#C2K)m9D9tAO#=a)hv#l_qEdFxy9?>%;y{kRN;SzU}hJOgdb%1{`PS28l)S!NDq1587-n=y4o%{F0#e?kccvp@rnLUfl>{ zvb@SDW9S!l*|s3HK-i$NnClL=EszaXJotIT4^Wm&BNY+CasIKDtTOm$t}Z020QreLdB*6vjYuikzRE1u2<~eAF`sdETPc7f8XDKG_ng#rLO?HFr&Q${ zkZtk~Nej@y7|93>uN%F$GcgP#XT}SNi4w@aR|DJ7tcdByJ|YweT#xko^g{5D zX*XWQQU8HTO49GvilAp9$fhlUKjB1qEfJakd3SNOBr!s;N*gH8Me#lll=#SotPZwE zFcbYDVxstBuW5k4=OB6LO>*~NN%L-TpTfAIJa~hI^MMOztF~B>lI)$oByP*%k#a%I zfcK5_vJ^+>>2x~JD>~O1^IZt?i5+K4@?KlzK9(aqlG-NIb*GgIrFDgjs=8C*q?7Ll zWL^z?npD#3SoK8DN=u_S(w4im^sIjs*V2knb81tz_xSaIZIv6w!d2_e^Ra<6wku`P zK+~;Zr*CDBGJK*86@}a zGL?%krPd|}-e7Cyspq-Xr}HO}Pu7Z+^^Adyx*XwlQ>kL$8g}b@xHiuJWZn^Ed#L~* zr;29h5iscJc`$dHE1cEXycZ=Osv#)pJ*~kXBO9|ri9KRoNofS$uZ=~!v9FT!eXMV| z1FmnrOOihXtYGC{(@B0%2!TaiNDD|OO!9ejT7A{ZNw(D-DF^08_L=2xtF^|Qemty= z51*2MfuES7&B1xOBtr#$V&Z!sMHR{Q5eTGJk|GvOE5dRL)y-mai^^pYsS&2szl3!c z08HXt1n>7C`<^fc5}YSuBVe$X5?6XzC%>p2T0Gg~EWU7I2Sv978j>^0CSMfMm-=Kb z&>I&8kWYWzpj>D8PUXKMKD6LE`+VFN!lA|g{(x`2(3jV=+5n*CTHqK7c0VBmRdd;Wh%gLew-I|(R zuab|!{;I+xOqF0l##5KBre)FqN%Od>odk>5>=C?W?oEK#$e3lJpvQKT-E%)a9^pCa z%Ajw>1dUt3Sn%0cL+e&qowS)d$9K6tVsFK?rRdd2ci4s1*n{-k!su|EO@Rv#enP}G zkA8vMpwq%Ywovgy&jP|yVvPBG?Dmd7hH#VE7$mmUc$p3yQ?xTlxV~^27%&!VUYSO$ z4U;FM0DuaaC7=Opm-&$vBAx~&_#xLS$P+P=Eg^9KOm0Gt9@<_Jd# zz>ljb2x9gSN%Pgfhbh%{Z2|aqV~Yz8$8F(N&6z*n4WQzggxFBC`}2&h?C$YIHZ{*XR-D@8j+m zR=#05GMzJy4IL^xfF=CY`sdvUuTNeQdrVgVkrJvh1Clv1-f6G<5y#1MlyshaRnM5Ftj=6yooQZnoK>eikXicBTSj-mX3m1l~5?@l#qj@!BIKJ78 zL-YkUB@puA)dybsn%sg9`0*#;y_VqcI}d+Pi5Dp4!5riUFnmy#J>#q*JmO1Y?dMOC zFEkJYS+?be)h0ITK2b-{!bH%(Ig&bdONzOc7K14YVr8Oom~^NFb5bM6<-_u(Lmsj| za>ejQ5XTlqcBtWJGAbz~^O3q+x>Ug>*(+)86+{T_5rin`so`X2Qd3T6nkFh4a3wI# z@1c|yB2z+%w5E^<1ES=`8>w6sNBAvHJ*xnM4@Vg1bPwF5Yocf{cswlYz`PSmrw;8b z)a3ld^g!1wn*-$oc4^9l7E`Jeg)#%*nlbK*HhQ&B=hCwC%O5tamSQLdWlD}bFYcN# zfKAK$wOXXNK1Bn<6`jB)Bb#+rg=m*X5FutXMg_Q(_H~AIMHn3kzpc1;a=WHd4qcxRGZ?AnlERC|4f6arA2Z$~AJMG8=H} zy0RHwZPxl3QP0Q5HtDVpjjcDi)+<&9h<1Th!fS1!UCiE|2xVYfE( zGj?%rLu;2pw!Zc%cJV<=W9Jr3bwi5QTGX$|=!~H3Ml+&bsK)w^II9c^K1CqPmT97|$CUqb%YB5-m32X@JiRM9k;dm%^Y_Bp$ncw{i7O$ypps zQR03a>R4^&)oH$VJus@Vx6*AsLA&l!EQ$*XJhPN#vEZ_iV zVVq1(seECKll!LH%oF>q+Kl6fu0z>}w^W8PjvhY=lyUUL7>5a)e@vQjbnlCaaooQ# z$~rutKK+1k@Oc(Cpvoc}ZA~NNfc?2f2IEA+6T&c-sIBX3jI*PpT*lF@gdE1vEw%Lh zZBcncZ7M=U0s`W6!wGbm{SdI4^2{=Id?%t(_L^?LIRr{4NBBsHFB=@tXoQkGg9{J_ zX0fJRhLtvF$6r7&q`rm5qRYF0#^V9BEN)z3x_C6ovz|?HidhsivLdmTb+y?z$9}iv z{n_Hg|623wdUZRS-S@>#M!OA`<@)|jJJ1T}OrByNo5#8keQlHF8n@8p09F(b(97Nx zwq1IYww70QjTQ~5)jI@U2UsP!z0obwQt*T}es|deyPZ#U!PbZEaZ?PDzqPF~L)Dbk zVycu zVMc%=1Pbf-W#*-yhv+IOA!|bTv3myS@U%X9{be6VN#BMg;|XkR6G{!Qf%D36pjmv7 z%JGJxILcAxZIn~dApf{Mr-^uWMPv1>XCfNviI7RkPK)_m8m@QS^;Y6ex9!sPwzeCv z`W-e}Q_;}Gr7IP#aG8-7%^+M;p!e{4$>6yxgcX5-jE6w9J3+zdo0?s$VHJ>!_ukB?zi%u>6ZOS5^F8VX=Vm7|T>D_RGE$?(vgL zPzmg`Mer(czs^YW)uO(C+Sz4A=`662y+2y6q5?xqQq9EwED~iO>jPv)G^x!#rPiZFla+d~A2w zk~l52E2FIK!ekV*qgbb@Wfs^s-R%SQUv43~G5q>b3RPlEZz13iw9~1;yDWd_{qp9< zkb7tACmZ7j2P|JU0!qAGPFTp+K|7m*U=JmT=IVJ3rp@;6vp!kH^t|+YAk4x^I6!m!RU^4Yw4<@)7#*W0RI~g_eVs{kxqDWC$FKQPoUl+1a{pXGsUbTCK3@Irv&8pH-VH{GSj;L-#4JpdCdq`t9YNoYNr(^bb!l?Gp*K9xoFII&2Gus2u9I3M) zuV|+E(d@3SfK#uZmHF*f_Z-8X>a=K?ZRN;IXj>i#irz`zxj+8!+}bAOgx32Wb8 zxl#drb~BBGF@hD`hD{9ja5Or}>SVFc6{Lwo*C@KUB=8i&a3@7}b;_KE4MB|18n6-jV6soW@ z7@dH(T3)Y?pxcD+3$#a+TJ+~qh)q8oiMi=TIzbm>uNL2<>HaJJ^YqfOi3R*FilIPc z3bJS4uO%KlmObK+3#MeM-bVn`G{pxr!LOHbUYzOBdhXVF4s<=WKdRWDR#<5W_K zgd_0b-b%k$wq8`aZmzBuV|S-7*y(>Z?Y~iV2GG*RE0>cfkNb!HB9c|a0h;xS+nEH7C zxpn@(r}@A6XPOIg_{#lGLU_H=YfE2H1YEW{dznOKKKvWy#q_e2I&9ct%OC_iuNIXy zcp>yUA0~aD^ggS2qub1#D9bC#Lg9)g^wWt4YDh2F)C_d2yxT;asMhP1bhh+CMrG z!0X=G*|Oi{entbC&{b;v?wAZHBAwdxIUvX`$aLE(Q@}WJU{LTs5iA~ioKi>w_+1cu zH++pOz13jHXzwm8*C`A+Ly`qrR{0%vdVdWpE?^3Cc{};s9*|-I=i9wb!2+gH;q73_ zW@Z6H#u^;2&^X;W;s}N3ztV=#Kk{daz(pl&=gkq-Ep|S9+BqW^m^?WgF zRjO)D)6k}STTf(ebSWn~v< z#3-jcHuja;3czD}tc7&NHZCMmAd1Y+rj?#IOd8J&kpK;puDD^Bm(uD6LeHBvpk;ba zUzsM@ZiT&A8CvwS{(zl6<-Vu(uaTkDTdRGz-9nNP#tXu3b$uV(Ao~rjSqtW7QsGWQc0 ztHE-Nhz*kex=6MIuR?m{b@jr)|RW4VKx>G z<*Y`~6|7U0ms7$Iv?wvCN&d$C)+c0giAZGba!O@gp^bbURe~Ddq9&#q!+kPR5s@@5 zV>0uzB-8ypg3{L1?iaZf^L0FZ8k)Z5&$1g#s1&PixW_Hj)9vuQ1L3^|y00tp>e*LP zz!u)+pE8B>IR|?FS0#dEXjY~I^@+flov&qc}#h<6)^?|-IQSaNk|EUHW+EY7)vm*NU5YXdFgtG z-s8(B|1Pq&`fcA~Yt!N^lO(w@f28p5P8tacb|VLKf9M-q2P1#egGiKVF6_3)lE-8; zBFLH(?o7~UmZj+_ZM^jTtx5RBmX>+g!+D|i6@=X8Wp1bX_c@h$w`G*R45aM4{xK62 zDz+Z1ND2dR-Ih^X+X1m(ghOzi4n^eHKbv0NmQe%;_7Dxqp$a`KX@6^Ll{ePzgXUzD zF0&Qt&-Ko`lobbcUupj^jr_&DShezpQ|XlQ9Gh~962WZs;=4R}wTcG|*h$s5CM!3Z z&-Fr))NkDm=@p*Km1S4$`4)^$dD_l3`OOLcumukmxX?Av#XKW3tI2^y#fyquW?(M` z73qf3EQO}mmK9&^4a8MH?P1@y@D&s~*1=tjg+?jCRiXB2e~8F*oo%!#BI7jkW;bFN zE-N=p=BDeB!0#<28#1C|LENeNVX=^;!r%Bdd6v|g2Wvu9!~S0wAITW{fv;Z&iDgvG zAzPW*%a(-qAAl+o!_)wW3e{PWS}oOt5mM#Kz4AYc-C2QbXUA6c(&@)CbJui)e#Zp7 zGUV)!O$J?Ugwzn{a1+EKvgR7vX#r4ion~6WoUDP15N4q3$ME`zLXp85bmMaxj6cQ- zqUS~_;9!jwcI05!Qb-u=Vs}&K;Fgevqh!ws15Y3?jRU*Jk0JCTT0kk|W5gACi40Am(}#y#)k2(KkN0l&T6PG|(G|5*m>0<{o^c8cDCf6(B9 zU9F_-s7^65Xu6|`S1kJXL?HH9(s?)7 z<;^c*el-Ptlo;V&Ar<|_fu@v9w0@Hk8oos)15HF2TcZar5Y$Nx-uupWN+GpVIxQx6 zahv&bBx!TyoOrPh(8>n%Jr_$2{A)74b(noxl5W^`HrI zYRMp4Ym9N+It7mmSo`zK?u~#HIyOu*v3PS{DpDZc+0vG)iSZU5w;r1^>x?j+ad$0_ z(Q9_EHvoHg=n(Me?F!vBhVT=c4s_aDFag~5F;3^0Qz8ZmL7fzWv2^o1`stiP_uuV#KiO6%_$c8hR;%4)gU!n% zs)&je;)Qo>k6)^NbrJ!-(kqWFH`VGXzTfk~=a!=Fs+H&>ysFk$Y-jRT92>=~W(q(7 ztGd)H3^o0U%HYfMvC8y2#i-;-N(~^G(!ihu*{X8XLK;u!?FuyTZVgMCBTDajlxiOp1+$G5L%YgvjeF5*+i9WMN*3Dw^7YdC@J;! zTFR+6@>;|^RqnETX7sAKPrr54TEcd_>l&qte&zHVxI*!C8iDSw8n=GQEip<*)Fungx~V;bv-;T3k;cp+R;9p zHwxKSDWoHQv^@T6EU-)c?65twPv=Z`=Jx^@G$GuiS1f{v%bIb)JeUNp6qB9O~Mt&Wh_p8&!98HRmxY{1=wn4B6EON>_6nXu9l@X&Pt)qyE-wQl(ro3rI&T5B z(^$-zP)K&Sm)0Z1sWYIZ#Ml*up0nk(!Xw8hpG0mcgfGj>u#&!q<;94sna)1 zf*8!w5LEopH*Y9)M{+DU=aYD~UIK=rVf8!OigX3Yeyc4dp6KPqbZwWIVHlb=chbAl) zt#sl2Z`ukW@{*nyZ5m(63w;nNR|^{N@XY~S^^Eb35{WldC4e<&8g$R@KUvvsrRDaHEj0< zy6|ROA;@IEM~0}kM0v2S&7eD;R66?@AVu>=*OYGwb~@&jyQh4%)vv%7$6qHOPz2bF zTi8%XYa7lf)aySsYx6WSSqsd3t7U%`@Sx!QgL5QJnFNpERa*{6@@!oRR1C&akF&@( zDwz$2sdw){m@J1UMqln)D7>)-mk-{s{cFa1La|WHs~rgRDo` zO=J}%pF6VTxB)$RVAa=`jKddQ#?mcBy?D+;`o@7Qv4MINQBENIwV1I62q{1PyD7)P z^mXhJs_fp!KbBi}swmSj5x^ocxGOE^LS@8AuV!81?A6!ei1Q6rMWDl5G70vjx-)mL zd%rRHzKgYA!Ly)$fN2UGDaaLN5^&ay+Tyx6Zd)AqSWi5U*da=P`xN%tY`5D=JJ>XO z63t%5vVU~f?oo_L4M0$FUL5_Eq>Wu7ETa(!t0v(^ zLZid8Uf$zcULgr~lydGv>2ag5me|t z7KUg@^9a?A<855-rvctF^w5}avvAx+B_`jmqSJN1>_Qax0f-9 z;xLAXYqj#^QR7Raoz++!EVB}HUwp#nznTgL8B!j@U5RXAq>0r_OIp9ij4{19O-i!q z#{O``vbPS=Gn@iHN34q2S6kpZ3P3C{MOqyoCb~R?_e-uTVJOyUXH}c#qy$wE zN!y7)H>jxj$@EZD?+&Us5hlF+oxRSXI;I9SGkW;YCc1YBwA$Iwj%-0W;cM!-db&`l zLh8A&<4R*1K4S6F#EGt~#y-{2?%}1;YN+6L{4C4cf@MvKhj`DbV<;MbTEt+3Q0y!Z zWNHX}0rS`M{DRqP-WlJ8&8fs$*O5wH|8oI$xp zy>YuY__YyVywp7$#8H;`0RRWl3Yd8Y2FVw^FWUnwqb|*^7dwKdEBXYn-j@r5O@t4$ zwOBKPuc_ZT#?(D+NfQ%Q*|yle&l}FnHe$_**oZ7bR3Ao%I5@l}>Nip0iBwdsAi4gs z2AcoHm}CU`9S!{M;?ES)1+$$9`Bo6qRD-x;cDTA-4;t_iWI#r`9~pc)eGd4(^uIqY z2cs`i%$cM7@2o!S+K~ABIwPv0iK0_paF}AhCM^uG&LHc`M3mx8+e6fi@DZ~&k!yL@ zW$BGy$?AJc+ zuf^XMWHOD?*j)rm4ser1`d&@zpifyj4G=7mtI`@PxRfw2bf zbMw!tqQ_sqeki1J!vG2>1oY03i*sj;<2d9;Y$YKLY|ImObd@?fxkQFh{*UN+2FKXG z$6)VEP0+j09f^>D*M!=)%;Zb3F|}vKATibOj0+u~FX%^>0*4!~Lbe*_8kW_#ym6UL zlrLj7@vL(-%}h<%&=mn}P0%y7jM-}0hhpLivqY8XgHjA}Ig0$&hzxZKed|2Si4c4` zWf)qiyOHlo=>)oDatE)-V$LK~@QDcR1jHIyj^Z8`1gb|>ZS?JQVY&6k9QGr<9p+?c zWV18=9R0<eIssd#1L($vO@IpNSv~RLmWQRXE>gyHz z>43_4Vo2nZ1J$rt0=YC5qEx!T?ZEmf zp-6}XyBF=wjZLA7u59xr0>s_uHfl@hmbIGn_6JMcA4kdv4MU%Sw!m*qxBKLpB(jb0jcl68h!8V_+)ww)(s^i zrW)b6S>CI8=D3X&Q`GSbLA6_6ub-*sFR-+Z8 zC`uC$4-eeeIr*oT&b1?+q|5V-)+j^5;tT?9rA6#s-yN?XJJfbB==A;X;VgMiobVK4 zIpxIBXG4hy$5^haDuMMa`nGjO{1{w`tkVm9sVn37NUeb0EWyylWjEwq@f*PD<(%Y` z%Y2SMgEfqG{nqgj{2?~W*j_58KO%9VQZWPY_K#dY-=zFZI^w6+=7e;Oee{0Yb&8-| zN3wiQU-LKe`Jx|aC+mBF<7qgsMc2G+d`gGrk}ak+T9^|k15cM(B8n5^2&Ufq>8PhG{3%Gn%~TJXPeD?YcEj+*vjk@onLOV3yo}-X zMk|&{8V6})!S@ZM=5o^!f-^}F3IMAP>~5pcCF1FLhypKrS^g4y`rRr}^NtN}CmgZj zXEf^j76RbLSpq4N8pQFn622PTvWtjUU;>LHLvz5&On@-bJjvuQ6d(!Nj$S*lWI8gW zuJAFXa@P9Rg2K5D%RtEw3A-jzPeST$7SF6X<@o9hY10qNa&pYBS^cUg?m$xdZ+I?~ z|Lyk|4LJxqCc^(qI2cDWb&0g$Q^M>N)3qNd|5DPp~N8 z-b2k&Mb<6B9GKE_s57Z;@NX!z2uRdJ{Vf-y!(j~U7WK+Ia{u&yN+C#$n<^x&wiLq^kcF0m%5!G~uzz`}JT_jHei6jtPq@m-UyXHYpl zSb!q}-6uo$#^=Z(WXh=A2wKh~Z!r zinx*ReUof~w6BlLj3FRVtKR#ED)FDQVB+y&$l7`UYfLcoJ%wZUtSCH*EKL zD0G(_fS|N6d3amlH2q;L*Y!05keDHYicy9gBuy=-7c+d2i(w*WGRtXjvw#;BXsAYi z(Wv=ygzn(Z;QIP)+oOssKpdBkS_A%2${u=g)`1A%9&_o$q~c^}>_w8~?P__wCgIq( zw_PuWSv4w@7|GJyhUTc69wm++TJm7E-!G!gpE9;vC>UC5Q^3&&GRVCt`?EBC9NU>a z3Au-3F4wS5M@31O`e%g}Pi>71p&AN$1_(m4f&d6 zdt~aU01NBUfPz~-ct#~rTGB@ zhg^hT&n(A&5IjsW(C82n10)x*6wT9y+Wn&4UaZNSV0eBX{o-%*aon+7&N%+7bhW<7 z!slpYj5hX_5k}$Vt;lo1ohpB4f~J>ck3eXj$k+M>roWs_rm$9wWC8ut#BG;{vj(b~ zbX?5AGDQVufbBuqzk^%%p_qBW!ujcxm><}=yB_g*CDYm8KpTgqqSLx*Wr9D24LKn+ z)cE6VvOwCz-p@C$92Z0hg4`He?X-?%9gu_R+-ZagA@DRwJR3fLbmiuS zxjuTQtrsKWY8m?xDwDa5td zhGkVf0Z@KJHu`>GF$Z_`cs|pi*D>yV8{aPCkJOp{%#Y0BJT7zJEjRf6&0dFpvvt;l zcx=~7TNUc!ia)w|Y-du+c3!Jp0SA$zhkSxjZXvY@mO<3qYiyaY&X>I2x3}GaG30R- zPzjOK!WII7YwG7tQ#WQtriA;;j6EitdO6^vl z`QP3Kk8~kwZtpc9uOB0IV4`;T_WxzuRjo};{S!@AFzjfK}HTJd_K_+b1+0Dd5n&9I|`he7QD4qL!HT#_7tcClR@4JTh zn1G>rh)`e?-JBd|t6&urME1y~l1fCuK&NRDEkA!C%jZk-av@u6OC%7-$KI!(Z9}bJ zE66VoOF1LnWyAz8uaIQYQ)(NaH*DU_umIV3dnLP23RZ@cQ5JBmAX8~Vx`i7}Q*7sx zc_N?p0zrdav#LquBjNk-aSE&RK*V!Xuju+6VLe2g5bkNHw4vzY>P5Av>2&psT79 zU09Gxy0sj?_@7OieZoc%m2jn|sIH8poas+reA!{vL?4kgpdSS#9It?9n7*UTe&J0l zacejP*|(vtbtZcW13jK)1LLHYL7xZIUhRHP%x7TaG;dF>-?hEj6>(z(L{Vwv0O{ES za;|$QxM-PMl(IwA_?V+8O5faJdCw#v z|91HU(=-Cm@OrGnxNdkl3w=!m-P!z%0=Z$|Q}#VS{7nHX-X{NA;ZH65_3y=>yu8gqSw@Fn#TxOKnFNH}_Q z(T>ay5wnxl*BQu3`M25$(p1=d!vz=f0zNl&hacULw#%0$|6|G5bF4Gvx#(Bj?&RkR z&^$-cI?=ctb2j_vwZFew)VjpRBz!CI`QWbhen=E|QZme%vcT(x11dd#NvU?~5W1xsp)Z>{=DOfykghPyc$< zQPK;;v8)$Sxm*C6F)ue)m)gayyt;wrM^DEEMc`Atynbd^j6K(=kix_Vl^FW`8`s@n z1k}n+-=9;+O)oS}2({A?_kj8LS#0mjaIhrwDG;=}+kNmn>OMRjVYvVpT z#d!utVCYOSmt*W`PB*ghBXrNkNsR42LL5jlFZAl4*QsYYOwev&M*XXo8Hx?LMDWgn zU`M`cR^X0^dX{t*+?J$Ln}(l&xmpTy#o4`AWX)phX)}syY?4C5{u(zK??kIQEe{BV zp@G`aAW1`dHuZjN@L|@;qFXggoWIbYY)=_$Oh$EtRZOqT5X;%FfA;_Mid=85>tmvz z?@p@(XN-;4z%M7%C>7q`9&HbEb?)z!H9~221W1jNQ!tf}Z2;pF^!ls|9oXZ@B{Qc+ zd1}e?NVG+9JA|wTdgyH2V*j~rUKfPh+`zr_GM(36h0_6pR~N6E502#n`E?H2h+VqI zBO%yNgLs6mjJ&BCz}w(K#nfw+OhFb1Wj&B0rZv3!m}y-v#@8(WGk4q3(`5sKaSh zXHm6L-I5D0!Aqi4S6h-B^t@(LHkRF%IG&KCw7F^Dr{idI(>?yazj6^66jGXcX~(8= z|)4%E;uLt!; z=^)2Uc2|cEWJ9DpJv~1uFXEDJLm3g3RSuE?yVk0u1#{mG3$-zS2Y*XIOKbUIAE(Q5 z@=AV|xe{;jB*)-wMVJ)2yhW$YQfpqAA3|`kN%wT4xCwus!V2N5Ur|Uz|M1#L=VuUm;3|dciC_ z&6;(UEP3V<1zEFm|01nR=t&4y4@SE{*rgp1JMgd5nor1mPBjolE#E;dth|;}PH)NlFqSS*|2lQmMtVVZ$H%x`!G4L@e2BV)|UWd?? zf2PvqMAM7>v*3~ew2CS6Ga=r>Nu9_jCrXA^a2@myMMjQX4l(DQt9y2sbQp&>BtjLi)6UxZo#fuH;$)LA~<+6h&9f14jdkg;${bc zD8N}j0bD&ZMy9?^pWZwzz5WD2#f!BT`1dBFlsbc2YZ=BeIbLv;I{8x;4*{0LAfkUm z?qllo#m)A7BuDz>@9D@`lbZ-KTD-AmS=nookajhbP!V$y(QhD#jIR!OKWT5yuytG6 z>A5ec4!RVUYZ;dd5Dyp$Q!XKvB* zz2#LkB8jNYG%B!Jbvugo1>B*9*5^I9|I0>a1F~Bm^teoz&aI$!}d82$wl#;CY zq;-)Ff>{oW)N?YH2;?)jI24jLI3nIgE-UlNQKgbqqJnAZd`2+ET{D&dFwO5VOlxN- zH5pWdJI5G`KO4O$5Gl08KA4jyx3l5IAoWSj*+O7YP!p+97uMzf5`cwDeBiHKXn2y# zZBfZbD;gM6TV^;NQ{mO9tBoY4-2oX2818TQC)B09%^O}0#8j20~Ver)uI>!bd9y!s1g&g@>5N7njQ!ZkidwTD=2T`}R2FN2^!oR=39h@v@v;sV#o}Cl;Sfey6r}$n;VKX_wiE?%8-qL?<}!Ld`kVL1 z)(#5(dZXFm9+FfNA79YF-4!OJctqrPk55o_*^yIY+Sr)nZ)t7$?^s1=B)6r?22ERj zew)=(O`h_{gZ6y~vCnUb$lloU&FXgxP450w@6-`m_4mGoJ)#}5jUkD784NLWv*$nr zCQNZ$$Tw#3iUV4Bx;se^4K#R!J^UyI@VmlaSYXsLY;Y8#gDD{6uEOT59lDE=&m+dS zM=yr84zk*frbc1bsSF!{o7ka2d|9%hNP%VeZ;E&R!A9KLhV>amN)f-TpD~84TOt1o z99?Rla~h2$zP|TJ0dZx-WZI;PqnqX!wSbD)ez1x&7kQREm*h|4HP*==h;W2iM)X}+ zcDS*zD*O(Lsg&fpz2rQwztUO2ImqtnascxthtPiG_ABk7wnkI}*i&*X0|f~6TZ963 z8c|~;v5t(nd4uC$CaIoM#f6^wKS+=QLQ5b$f*fdrn<9y-97~j4VE%B)46*_W;S9zB zIMgL66#%q?Q%%TgtjHOW(Q`C;@uVo##lV_4lvphSh|Vv(bv>Fm+bL1n#V2s%z}*w_ z!WJTo$dcn0O$vDAPVER=Oc1Ltff^QQx;)QxlQG{n)1QYtJlOTI^a{{IJvQJHJptFp zsnG92Cjmhv8Txsp;aF!5J)ej*qib9S>XZ#>SgxUuCH!tVLAVdnhLk>tCCS#J>hr6(ts8%PV) z-i4CC7k1ZHHu}dwn5?xWUI}%c^G9A8(+GSznLJwkZiPRmjeOAa&Yi|rAX#wrymbXV z!o3(&c27(C^4!Z#Z(lCf2$rx!JXy9)?g5u7Ko&zHlu1=01UnX|M=pPkXfycF2)f!l zA5jx$1{#@&Nl|B9{2!A?#(n~C=`pQY-0`VVGdZJQcR22+#@V?-@P|#e%HA z5t2WpI#Y9wb~35q4l_ee4u{b|znRhEE?`Gi-s(+^&ZY5;za@MW&W7-e?dv&rE*|Wl z#)!l0$O9bUU_sg^Oi2jvL!v$TvLd{v$>u=b0jkWbiF1Iz_wU>5aQ!bo8;n&;TLTP< z!oOo~f;PNIy<4fKDpL&kOP#H&FMxUpN{3fVuK-xYf7e&}pB=L&ibxHt8%o8_3aHqZjQ%5i~A`>ub{ z%XqXHP)%&vj+c!qEw{qAa}PTGj24j=ROg0s6aRu66@rsUe&DjJ+{9F(=;=MgxAkZ6 z$R$+Y4f*`h%`3Csms{^#n4o4_VYP**O!WD&m9|i~aC1a+V8lrc$B=GaygC{Wvnn#j z9ap48G1_dEl}E7)7QeDFnf)qNYSb#$wqp|^T`RweHsxKCA1kfl@V2c?`1elvOXbx1 zT41HIkR8vpmFrFZy?unRlJcja{dPpgHOy4-HdD^naSD9Zw@5=P%q6h2xtFkU5AmhU zHyImhgNbpv1*hwBKM^lUaU(6a;ZcLbN$^}p(Y)EyEc_(-S>n}2K^?ia)VO(ks_t?g$K?0F_hOx zgH9)RSn&3ZT0xu*Qv}~J>oMYE2p`|jl~i4lz(opm+SDA(T9Uu2dAD~i5`jETd&VlYkepQ)GZ+-ib6m*? z)NV-#F&d&&+?i~$6<>_VU)|7@geu%TTWXhdp4YE1aB5HbEWJFUQ&GbFq6dwhTgvAr zYg;=&IQyc}aii5aSd#M0+nnO9Qc&7M_iI5r0l>@KYn!JzGcFJew|Q?%vJhmsXXk z&|TLl(pUUSa`SRm5*_Za*-90=lRq)FN?>YB9h}hp#5wgMq4_gw+O4hE^CjII=nIVs z;1a0v)dT7aNT-@vt&&foPQxdL8}p&npC92`QdWded3LV^Qr{@IKACpO%wdYU*)U*k zYfM|~O4?cvud@*%kg6?Me^qa=n@Imi>FH8C!HsYI{38jL;5NtDZp$6Gc3G04B>3*M z1SdG1NC!k(JYG;zGm()@S(C?VLQ~KSJ9y&&2OdYei$1v%ZK_s|O`$C#sRT{F&V-`eqZxoUr}4O3R31a=pWp1ojnKW0Qkf&~fL^ zLnMY1`Xy=nFW4be;w>fe-PbZJ7s4aRYQ6?75awfcDtgD5^7<#Qa#HsKDE8B8V=%lKmj20y-j;6Wfgo)eoP5 zXP;|vdnQ6Z2Tv|5pi37xI5}=#FbRf~B*+iZ6F)HdHg@SJPTuJG(90`$wZo4e%?w@) zr07Bv6zdX^@YhK6WJ}mPsPg@0ViB3W?|MC*(klusNb@)uHR5&R#w#BiJee^+eV$UD z@^*CmxF193i?TgXq@sN-Als$#wMYECOtI?d8Q?}k1&V+=LcN3W5nEQ2MM$gIN9r?+ zPT>+~u_r&YiZ`x$hkYIy9%v>5$wXEE?AR}&MgwQ1d;@fIO8o}PgJE~$5zfM*1}Vfr zr*Da5X;G&#MDN5=Ogf_;M74-aeoo|u!W{Z+$b?9%f-FKu_QH`&Pu{7LUVtwlU`7~)#XvA{c+46q zk;i1IhDwQHkscXDR8zr6T|sEFfV}ra_8#V?$NJU9!ZL$U_#dsSdyyf8)I@#?TX(20 zuXm|{%Rb@KFP=_Sni0E9vUFmm=n837k%u_mD5VmLQc7lJ0s245SoBwv46Ko+{)^U? zsJUr(r&n#ePxFLDy)Uw%<2@@Gn+sn)+%tU1nr)H+NmPKKWa`qMV_|nHPh^wDDW)l1 zF#fFDKvs0|%&6c}>E!-C)uz`))5qO~6WPS>N;8;uqj-h?wYn;*xau=RjWTi8qyKk( zVH7nW+0xZDB}ep*18rg!E0R+q+8xL-Pskhl9I#6Ijk+}I& zhjcE+aHtWYrGRo7qL!1FLm(rDw?VU(aZ_-$k9Z*kUN^$s&{};k`%|XYBs(X@i2*07Z5d_u2$RMzl zX{0P!9dU4`9*AIRKYhv*;P``pUU;rUN_#uoQ(rjtpdg=uIfLv7<2zY01K>?) zzuN=K$YwADPmB^`5tu-cfpZEwOoaVyo{I)NFxKh1gpvj&;WvVi@NAqKX^ms7e1^fC zVvH=nZGD|-A7QNN1LLd=w)Nr?5<+Y+8M7%eg&zemMA9U3%1&@GTjTQ;O0nSo3^XvKE(Z-u)a{GoHxy*koZs2k#$gJ0(J;%lVJCdh6a zH(%!K!eAKtsU9J;OB)jZaDabp@PdG7D3PY!w!tN+mw7SuwJ}kPGolYC+v?u27lJ%7 zTQ5UWfpVCt_a>7}G=oW)Diy_0)bQaJ>29#t5rj+HnNr=CjmAR(ajH1{SyM=r9^eh{mBD^WJ7*CM8d5?w z&=w8vTUI+ordZSIgEZpc^$iRXf|v0T>-CnkpjlL&F&b*M*RnTWon+hC=0Ljd`_3uB znMFBB;Wu{2j~N1F1QnYq4bDIxLldg%Pd(u)bxrcZGC)6DB`+a@eIbZFF#urOQZN&DF8IT> zA0EDVi67m<0X(D+QV1fGKJYRWCibaLP{^DdQA8*v53YAj&h+U}X4@2enUCsPDlPE( z4lKIa1pg}*^#%L9MnccnV^ZpBT>jGXje8rF+sfU_hYbFNW({eaM<4;bUwWNUB zJV)H!iVyN3{(Untx~r-?$3tR{lwR^S7^BMVQbuzV~hVm{;)x0b*&^76G3 z{d@6Ii~_ytt3%lY6Fdf;(Wgq}feGf={R}MsJ9Ur`;}5Sf6IjCmUeZ&|*{wv**B7jW zyzgfW%1w+%1gW?sLe;x;awuNN&0xCS;HWQJD=wjmlWXtB{u0jHINs@)xsgK>8`?r` zX(Wtn{E!+Q@TiS!*7(o~f$Mx8MHM?iUx9gZD!FzLn+!xj>%d3%zTm}|Vui0GW-LBd zqtUKA`m*(u`fHPp{bXwi2p$)p#Ar^nY{D0Z4?&+X+ zn1~R9e8W{Ydqkdn<#i~ej_DDOvcR&RA_$1^Oj)bubmW|+D=h1shLS(|dAlKwXjVA5 zUzHl~r3|i*ATlI0GInYMMC8V2=-ge(lhEeWp;B_vJQH-}<5F45aoSbmxF-3XjDq?y zkvf0OF!noP|0cp{Nhg`0=J8Npy)>Yamy;mXG#L^$iq^iHREW{Vd1Xw?s!pj%AEPWC zTe4UESm#smMta-`H~nkKVbHYvcxT`guR*97mv(?7su^gYFO#{nnzb zvrsCFn9)W~#6@@o#8Hq-h)};Jaak%=W2`^=AOjziVLhRB)}FV>Vkc?54~xNJ1pG(% z??(HOc%ieeJSXL4p4TY3q@ui@9-j*K>LrJOTTITTO4Ns4`>v z3_EWZgZGp_fMNSCzHqiR7TQM2$M3JBAK$Y~+brGxaJ0x%XwE0F2yy|g@t^HJPCqaB z3h(n=&QUQq`W@fipK*P;-yPc*L=FME?I%?s=-{42O%9ai78x3)$6LKq*kj`1x?$hCGb;@J)iNQG>fO+T`tup3r0v; zRIS>~ozIHSUOHpW(r7rQLmihx;N7-Mb-t9nAjXVIu>4BCC53}Zi{2_mwPsz?sswG7 zad7`Ztgw^L#J;z2a8PO`l4wPL%#=xvJcA0am>M-ydnqOH+@^{ewgPjB(Y^nP|Np^$ z-~MZWk)d8md6HdLCYwR&fT=8+NFkpzsVad7Z}QI&uQBkFWj}fyXEJRTebGpGyNoMD-0|`ViEy~d<3%}I>FR6-k9Xq-yeE_=r_hqy zrLq-I`Zwa4c4En+yi~SnOMT!|Dqn*^^?p){qL{w56(`11qs$PDbQHuYVPXd^7I(WR}zgNxL*W5|DBZQUp-sP-q;u;(SZ!GAdvTihS*^j-+?Ar7P%;EiFv9slG9PWvoMW4!)tu~KfWS2@lNKc8hDwz{D! z?@H>Xv;Bh)K5`@RZp#-<*YB9?t@3xXxLu_Kkl;gN_X%&F5t6(KSEe4X4qSoDH=>Gpk?o&nkbw}iy{ zFF&NQ#%{xZNMjyK;Ii%OxMc@jk2yh2o;MO8BgQyJeIKw-$UpS{zXx)D7w+2{+2c4A zm1yKKOwXatZ*+SRPjpfLuI0&4yZK0oQGXOE9SI=cYcf3n#zjKaIdCbv?sc3*!rpNlj|HmPr!66+f)a z&{DuP|6W*%0>LNEHyjQn*E|0|LdGb})QsKrxP@nq?;)>~A>pLer_Um2V0__sE-yd$C>u6ezLGnyFmdfOuVaoVoxcg`^>@9^QTQ2D~eTHQ!Y<<+y z-iEVzBSsKW%3j+NU!TPSB^OLF4rP{(4$EGl>%Q4 zEp3l?HZzlB&-vQ0{?1WUc12pm3zgBD{uB;2%#l+w}cRc~c4o@eQuclA_pg%aeuHO;B z!@uW)H37>Ryz*D|y#M;O(spQTQT(8Ae0}Uu;K&&?D1`$3%^oUtphjmM4m-+Pg+#%W zTM^xQXD*3Fl7V+b65+cFAP&F@G#J%-P4+4dlEL%{IacTxvzu)3)u<b*bu1`1NZsBH2TTr^eo8?O26CvssTL0XVT1$Nt+v@08 ztDVDB&fC*;zc1>}2Qf;>lzMU(=810CULi!IbwNm!#0zF!@Wuo%N3$Gt@$mFD>C zeFjdIw?UYh zy}fE+N|UH>p&lNE3!W}c3^rjydOdM=KkH`FD8D&7%l^rx?d~Y>UdV}K3DI|NlSdK} zG2I6I7t*Q>nk{blv6?2H{L7#mxc#4YDAuS2K& z$@yDuNt|=7u|9E4{=0&!gh;zI6<6@8jUu(|SW=67F$KYV#8S>OQ%wQSd_X1JteA#< zm&;Fa9!;Xg1UTOOhlK2RIVci%MNlvHS?P*7ypN_t3l6ts1zz*!?h+gc$vgj_5>1_S zi^mq?LD>1Wp;9YF*p`Iohn|#t^y|AiW<6GF8B;061E&(|`ChI3get^Guozc$NW;-a zdIm84Vtt)DP0~O2{jKx2 zICvJ+QwGuC%$9}Cb=vv!9Ro7SUn#;u|D8!bp!=Ii9xK_dUy}gkh@VSTA;V74l(kd< zwN(k&hZwJ}lR)T1;jag+Q4%aAt1+YG_g^146arTex>5-OpYOvXr3V~pfGr3UegpOCiektI^EJ z2gCOunQ@&$(p(?kX#oj!LIPuI#%@1T@2zlh9=Uyn>Nb+kr0IWAJB zTLPsgmS8X6-RJDriPJP^a_nO_3Pq4jB7sH40T3$@Bha%5(Y5Pcm63w>0ueh7H zmlq7Z9#KW0>ca8?^dNklp|CSE^gg#>3&F4H11jYLX>Tr$zmMe}KAo?IyH$1YU8*;} z>)wa*+N{Q}cRNaYEaY%&x}2}JJBoVTHA{E6+OERWLQ;XqhJsQEjl^N=4{6%t1$Ezt z@kkag>Cd}01<}b1RzE#F3V$Dzf6^=9ZPyRP=Cl5;e)^ocg)XuB(Kq+|#-jWO2LpG( zw`T4M5Id~=G(}pq00j1`0|~p#2+~*5m7$Gqrt8_ z7+*43^qV|V4tfeqm-7|BfD3&}cw#{F)&})*tY)aGaxI!vDQyzeq2q_jo_t8CfO4>t za(|ah4rlcvgv=o$FJW(epxx@<4_YO+9!Z!J$S4WSvkd7lB9}yA2%kg~vtt4dTiWa` zHxNJ6PbW#3WvF6cW+*RXx(sV+w2l(F+hqB3i%y00XIy# zLCh9%?35Y-DIRX7C(O->8(Ik-5Zw^))3(k z@+Ky-N6tu9+Nuh5GV<75YMfdzii&A}KESOgiapGGRxc zov*AYtd@3a33MKwl+|_+azy09k;87irLrBrl%hpY4ueIK-?}TQp)0R@*ydzh6F}E< z+S0%Igs7}3K&guwg0M>rYxFLwhZTEkhh)}Gpj3{j4J-NR!H#SB$QdbUNX3aK8X`ZB z^9lr2(+nb*nAFP-i2TWw3_3Ba`$5tktgK+7E#5rP$fcve7e)>CI>3YmzNjLZ z##}RiYh8c(+#C9WMaGR|%y^CdOTaNUT|ctwe)*Q2+yfSl2N)4si#Y9fwFoN?hDZfu7a%*Mh4qiATcAqOB*r<;Y3I%~e#Nn}9PW?^L#_#*;d^9uJ4+ zLylCPR4Yyz1D>%mAFhG?n^TQGt}3N(fXA!0&78`k2wWWmjcHz?h;(1y3v>S2{|h?U z(2&DtCruT%p~feCNitpTUxX4nRfz?z#dNW$dp|)p*Vz0dr^~RQf+ui%YhX63%N2xg zFPgT52Po^Ixc}&R;Z^@lFT+fXxA>#2idxdo1kdx@=ia&tJ!b?43O}z^Bg_1S2uZ;u z=;e1s3OS8BMR6V#t9J;+AcA%L84TgZg??WrM1UI;+lbgRHaf_O-DC>+Nj``s?0VVw zP7oKdYA z-zdR3(tGd`mp>_0`n2r!#5-7tkPsIW@&1g41vNN#wltgR=;jBG{9P0Opze%Tkan(n z+^zrzysNl`WF;$HXnK%VMAPIvx?rk(IKd^i!C-Z^l1>_F(L6 zKmSoEKmSoEr*cy;f9~dX^Jvo+i!ZCmrR;EoIPi1RuO@8i!{)0A|RoIMM4sZwsPtHO4MB3kT!Po>Tj zpR-;I#GPFLWv#|sQuLpIEl3&Jk0*+sWKu^4uTV8vE#ags*R2BH>;mUZ6N_Wr9&=;P zXe5n*78=ZM@Ba3CE-il7)wFDf ztL%tD9RhmkGtl7gA4f>~{e zGuNx~WZ%iT{A0HR0o^ZdoFr{}vc;M5t-t9NYf%K8V(C{;JBz0Xa7*#@vNeWYZ(E z8TVlL8_e(Yf}QXn`HpW5Ie{>3bK&tbyd|+tPU)r4X~PPV*bJAQLW)8)2ZQ55*Mnm2 zR24)VUJ$$Y*VyL`WK=ev=k9M8EnVW_@`t$&cE}=jc#IxHuE5(#VQq{&cj^gb5q(yD zV`vMT1^KKv4ZhEf`eMc)y0ODv+fu^e4NO1q_72pi4kZ(pd_^-^gl{WSsF0j*gYK@r zN?V-)_ZBKb92o}oG>3c<1*Fx*WWVw7@RA){YNw$68V7o{L=Me~esC71+3Tq>KVT0F z5puHt&f{W|NRoqCl3g3+*Ox2km13`llEOybMlDVy0VOOS;V!=@+;~*fz1Ns2(hA-J zHUrp10$J!m5%IlsGHgcZ_VH{M<_YddGkS~~-NJIHqx%7US z{6t}N*I)R~x=)LJy}R^J=~{=|U9IP?@W=cBWwcL2v>rk~_PmiiX&t&6vF?}8NhZOk z;aSe#xvC1$mxYW~rB%zOpu}3~z!Uy`LzH>kr;uP#E}gc>6fit8o-H9{%?+Kw{?xZnEZUOC`Bc*? zdKrEl*Xo8yg^Ptc~g24Z<;l`MAEOk}qCw(4SD-@Lf#5jX$}fJKDnC(1K(t>|0Lo0>0I3gjqIsILz+LTRe&BwP3*)p!e#7-scHsPyz3KtH4Hr z4cb8+gfa69T1Z)tiZk%3S(Z3prp&2N<}zu!#qRPW8ek-p+M#iYfhI}ktX`EE>iga% zF@88d(H?fA`$%=OteNyuu&DuaG?&&*oBIP$)R$0*Iky4pUsJuMXrSQ=$}7Nt0kvy6 ztyQM$(;t8w*N6~QeM}1%9^Li6f%(w0$t*BFW5!=#E@blbs66>_-SA!n#URTfNB?o( z>-`emgbr2Zc@grq%kqb_N5zawZ}*5OFm)JcGNlR|)pdWGAW}TwS6X}C^G2?NMaMJs zT1w5S(+tSCn&-Dae@#k+^?yuCJOcq0df}~qP0ARke@x12fJv!jowNv(>K-RY0q2EG z;9ShCmOdZZ)y=P7J{Eq9CYBO$KuA0nehTp9db&aT<`}$2N+dNhMC>zY252c2=xG5s z2DY3UV}mMYPL7kVPmkLru=X7tt;70u@YQ-cLz7igBxxG_hHM`u%`UP-3M}=|4yE>14OAmpD|YuPk=A+Pt4t5kjVZEB?0W}9o)yRiec}s zmCLh!`eXh46*ZI5Oo!#4*+e%NJCb>4yy9Z4$MOgSRze`*#6t-Jw-V^{4^*Y(!i*En z>e=NS-=`Y7|H#sOx^YE3Pq`cp6z0m6i-=T25JXMKWSY-{|2zWm)J1*PZHBF>0pTFsEmQ zkk)U)Ht^3da|Le6x-fYbrP5mV!=ki*pB8nYDA8ISTq~<*Ko3!4v*aIVRWlG4=iy=8 zw8Dc4NUm7I`mD4xo3Iey4F=809z~kdH#GI%P*w5=brMJw8iDS5FEhaZt5Sw&nhO0@ zDSt@^Wc{O3{uU6$gy$z}O)U^lbW+HmxpACaR=i;|8`%XN2@H^BNq_=XQu)JFq=1MU zX~G@VFLDV`DZ8+OZ2&6e6+orbQvj%xQ=U7V|EQFx0F@FKpi;7oGC&;uqf+k2cKq65 z#+~K*aY_2I?lGlo%x%CK*xGdrt9G6NHFkl1qdCh+G0twV+C{tBkvz#S{s#|-wrUp; z#cQTaqTq6?$?!#8l_=NHo`QJBa%#uyMt@MSH6QI38csLGJWh_BDhbvKP#3L_d*+09!CLVVhJ@2!FWatMY#Un|=eX?+JT5bC2 zb4Su*IT=4gj<7@qo;FX@l%5yEMhLpC=4XL!2SyjoawKHKuSPcBEd|VC#bFSeP;=>c zADKft2x0y$x`fnCXDTpw8JJS_Uz z0Q&-PEm87Er~ji=d75tb=2BvbuW0%9IY+rqvJqqeXL-o>eN)EvjA^!az`lHaqa1zH zji}+fv=wO9E-*ZdB_VutJQBZ1@c$o6sqS=$HUXaiqC!BMGoFdngmMdF96u3CqUJL6Gr*85<7uuYJOgrd$V!5cJjI< z+rff}3oGqy6w{pmy8}&o$d0u^I;%_iiN1TqaUafgnlCPJ?!{WB_EeRf`s3BK^%ls;i^^Z#_jaWf%Lygz2%oeF1Sb@*p)E!bjpdm4v@(V5g(^%7E zleozOscSs4VWIUGr(?(GkazV_hs4mEl#Shs_{aXy2D4kg3ihEqh%;vXHIU|if@g6hQl4YOmK zk6lHm7q!C^81R-qx6xKh-FQ&xQ)k8$_oery+e|E)+VE~{R-Jpq%Op6}T40--F#t`pm74}5|h=KlHr zf+@+Q^?w>9qa(qa-B6>B5#S!GXaCe|M z?g_z3;3cOPMiVF`CTK#%dkSFI8)^v6$dGrGQlGs}VggQxz=5;^82%^=T#njJGkIehE0 z#S*D9rB)BOFzyVJ)x(krr*j2b#dAVHp%3IULOzCOC$FC^?Q(z7t}rg?C?D-HpTYydL58tYi) z0&PXkKX4CZSuop2x}x z^Qh@Lj|Jt~*%#!8?=Va%HcK`xtU^w2JwxD)$l$UPts>cBa(p#C2D^BWRV|c*IzpUK z+D&z{L9r6Ta%EW)ska8LB0cAjmJrAyNu{O06#2PM^Lv1kUfPN>vnM|^=yrO}6uyhr zX?|=DjL|e;S<6E4K1&_jqD3vb%RZ>)2}!h{q_T_)ugS;6vfvCyIz(}UL!o!KM06j& zx;}Cipl8mAWiu5!^q1Gq(GQV3@WNSWh@{V&FuPaco0LK zXNAK`zFHVJIkI+H61s5s$Asiw4Fgv~)_=B-1JOSRw)v+TaDA;WL{aLy$fQ=m6B@*+B4wAkt=d@P}AvFg7# zPCD{6$73}oy#oVa#3?G+P*UM;$duGc=4=*#x62TCNr>!2LY zZ3f(q=PsGP32l-+_%dU)@uCW0Dx0&^Y6MiQ~eK|5)*!XR3kDi>`VZutanT~ zeruGNAozH91fE_x4Lgv92q2PCp*K;XO_%f>v;gTPx-^d=4!sjH&Nm$5> z5)vfv%%5+$jR?;pe?YHsMuDOzPSJI5(0G8XwJ zCLGp$IhjFpHZ08=U7mSe^Z=W$V-8#K;uR50v8V6timpp0FhWH4tizdb@QGx_O=-7F zGUrx?@@M&55Q^7_;-Cc5%UreBI*2#|n3qw5T*3i}u_0*w>!_6agjW>|c+^Lg824IsKdu$%o^%t0}HA}C3 z-@&A9XKLJ95NwZ+B7-ytG936?aNS+9^>BE(v0q$hz~q`(PW4CmH@er^LV+#&NP%+m zN!|U9?zlQXDAbKXLN=?mlF)NP^%L^OUM-LPff70P>FMF4B=m48qJaWqZDp&6sBNxK zm(cm*KTKs_nQcLn>n1<FgVJ`geM$<=GtXMEOO)PXvY|Z38BI4Pn2Be?Z8)eAfGBqW8ZNd}XfQ zS5Wov@sldwrT1aGDbG$&`db~wl}A>4bbb+klB?tt&^-4bOYH84)3qDu8{d%C*Q|ftP3uHtkXE6ICV=*O_ zFp#`!k)eWpm43haSFu?5k9A?E(KCUwB4uU^Nwb=baS}3g^Ex*YW#fG3rn?Ut=Z=iZ zSwKc^@(Z3vn*VBe*GHLZ#&qAV>ejO+toqsX5TTS0TPCbE;|=i{jj-Jc*9y4+#TWPu zrOCic$Y{4Xuj4sxl?b+p&m>=86dH02fvjtiv6_=O(C|bN40Vlyv7A^wT+UYfeUi7l zETyMREwyJQo-`tiF=e?g&eKL2?BGzI{)9#-y#ELBrGKI@Bi*R61-5l?TpPJ|UpBp; zIB$2@LSv~%vqQp8IE7O^1(AVfQ4Is~#hLJiH*V7Nt5>-#gZ11zG1PonkuA~*8PZ!W zX@{Gr(C&e)UJ?>?as+RBF}0#SUc~$_Y2q9DDVu^$qv`G<3XgG#?w7_W`OxU0 zgq?mXs3?t!VhOr}B-3eBBqePl)MjP;3CxCKaWNyz2^krn*FEwqR?F=72A?ilJ=Zcy zRr2z;g3-RSsOnKW)~Nz4x*c{G4<6QM(wrxv~QBZG^m zhT&X8V(AIfN;RMYFPDGF9a=;%qT3yLF2i6!FI?}a=peHQMcXzo*n>7EZSr}($ zkloFlq#a(rhuJq_@78J?ik_()SU5ivTzQaxsUyY13)kROXl_6BqnPo zV2KXs_P_66f66-ccl;bXd3{LC0dIgbI4LDa)yXc??H=2Bukop8d}`x-6QXOrKCJ}Y z^hVC`)@-$2PXBYwJYY?m?e>i^NM?W9>i{r{DHAma>16-u`1Fb1CYE+lni&4tHnVPe zWbd1T|4pa+0nX=$JARzam$W7DDus~E9!L_aFc6etOS8c(Ozclq*l?0F0he>Gw$qHa zbl?ut(N6@T{aRsnx4q)?y97@Spg2dPK?QYLI2?n)!=fCoaLh2Zf^q637#wI_=^U^w zy2%+518(8L;zi1t1bGp9Q)v-wW$A2l{V*kb^Ln-2VII1MuX zjpI^I$oNxI;<%N`FfSH#1*qK2%2m3fMUFvC!Bm>?lJ2~p(S+x=Ek*%FOF^P8gGoqu zwQ^h>Qf0;{PbTb&(us0*v>Z_hfmrHDb3X$|E57&PORz~KVj!@`NwQK)`gR&~CV909 zWA{kJ>_foH&#ml{?7U-2A_5}F0^!(MvO$-Cqm?>y=NT@|?sv7C=|YV31r9FKSC)BU zBYACwW@Bs!e5CC-zSCw}Bg7cU-fl2~TcyXPDm#rwV*A`D={qBTZc1dLjK_`MjLtvS zM!D5kYIGz9B{SYrb zMz7d9Z`<`dMSA?K8ie`ij!J`qVhvssdYI-Gdq)U75I)F5&2Z2&E!=|@6NmR}H(n|d z^tTUeiE*PvieZ~mh&yaXER4fL(J43YiupIOY zg@K7{bt&AT-LJlcVZyz_j8`o0sN+UL9J0v8Duh!|0|q}6piKpDB-~*B1cVtElP(f` zb+@oaF7^=!EhL0)Z<@J&mj*#j@gOJ>v|(xMxr(<$&yH( zsC_QfSZ1>h7Q?lHMQZpSLKuRfkbtg~&Fg3=pS%0X9;=fMu^g+TFQ~TqB6GQ&d!Guy z<7H*^K-E~-ONkN$R9MU8RsZw*J=xghx$M12B1St|DGy?k2Kz z!=VG;R}e*~y~~UQ34JANktz6^K`P~-xS$`Ag`nqzNd?V(*VA0+ZC2@hBiwZ!UA#L_ zwPiszwiWYE^&A_S8*;b%V~uz1zXU=n*I&qZ@qO*lZU7cf4yH~~-Y;TY`T1tV6=jme zs$h)yIj)w{a9fV=P;+B+-TW)I=p8B_s^jMJ3V1EH2(- z;Vr&G-fW|cv$Y)T1lchhhaEsc1I0_RzoLh$SRc2|-~zFQ?l zNh{Rh4G@_HHwujY+BDu(DXz+(uTLvQo|3)WfQ(_#6pEua0x5!yr|bj%I88LOGlW}$YE`1maZBDB zFD+W`T`ngce4h74pq@DX)R`Ia8`Sl~uH)0bdJvO*x&d~9JJn@z;m#On|GE<+8oG98 zklh|rw+G|j?z^0rfYzPXQXBfoY=h{I+-!Y0pB&x>ujgdX^Iu{mU5u_V!LkAmxjq$N zPFk!q9YFFhH9x)>9q=Da#%J(G)pAur2@8{UP!`Afb2^T?t9_jz$t5$$ETRwSP=IOS zj__JhCpYr0vHouInWwgYA|MM95n}D6iJdyfR=f-e<-tzoi}ZN1{Hq=Z37zD9-gK+ zJ)M_7S=i?&1^v(n*b$*MI52F|H*`Gg1AP)g>Z3t0lU4npZERBUxSs&zc-wi0&tjcA zRsp&IC+8K!fl^pp46;)cpiGN2M&)aRQz!1z!ngSidCQT2p?aF6J3x8@ z6h6bJIg~eMW|#t_eTssjIAjh4&o>mnE(YkjADkMi)NA8i`(xlr<71z?mg3Ci^RSZ{ zcfHb{=WPXLU|YGdCETG#jpU}7TRD17<=alg`CZ1xBHf%b;H z5yR%MLWSTtqUM`oSAw_m+#7`Y7E$mY^6H&CCq;Cg;sdw2L_1@-Na49Xbh=K=v;|3v zv1Q$poAX8zX0*BG9_5ye7?0nV0=i{0EYEsznw4+OX@F!+bj4tG=vB{_`{B;fta#Va zc*%K58CtIC%+(W07G*y!8(I|kllnj)b_e^~__RC#34tRWVP zf-!$8-s|WtEcUqLO#V==*Ru+g6wAh7t^2oYN7sf@r|}hlzJRIr{F}*Wbp9s2gG;dQR}#PVaRtv~1;&M!cRnh$ zEb(t%^4<{g8+B>HcVVxwwVqF)9#A3q)&935?gLyc!p=zQt`~?V8Jy8sSajlfN8B|K z*;HqlBCf$xrwXbtZ}RdnHqh9x8yu)A&f+S1V#$$p`a{t4Z`8Z!u@;h&xE9S|9fZ33 z`@?&XN!(Apax-VtNib@>JbZhmi-or|zT>Mj_c&+g< zM1PbZZP34D{wz0`A)1G7ju0xYX}i=Tks)H7gcfUSnpq83tX6QbMu(Z-jWVadDbJQN zSHO+VHg;9BRjFJUC9u^mQ0>8_bMZ4te!bGw>qIAm??WFMtv~B%2g)KKrh&J0Z=b6r zPdTsTv&NVEkJYqyeoTvbtqaN9VixK5Jv92HV^Fzh`>iBN)sv?#bg`Iu9mt&AJE46t|1`rH92ojuE>F?s>WTJK4k1*MV$VCUS^E#1|A`Ur z!`jXD6%_t|#TZpWDJ>E>5=n>tjWJ#u+ou8Z0jxOyp!VDLQ>o`ul~L2z#pLG2ke6A| zH-&{`f0|2Z07a#MIpWtw09UX~jH09(ncM7=*dSsUW4QjFeiHbjZYMBc+EgbLMJsHo ztV<%Qm+fc2Jvd*qF!0Q#fcd0cATokrV~J{mU1ndcn+XfgH1=~Fc=#t*nbXM;JcI9 z|Lc8*#r_&P(%sWo93A0yhNng7X@8GBA?eXK`2fNou(|K@{J_UC4mg)r`r0-B7?J!N zJ6!tre%v{B!;wI`!yser z-xy-wcIfUnNXqLjosV~Jz|e};Ux#5rHD}TE3jtV$o_kQR(8lB8q~jOe=0BKexi(r3 zQWWpgBPN#ie?>28Q~$8Hasoie1pTAO)mK$fAYA}m$=x1oYOzBG_Q%0gW^u-cBs%CN z8A)MYhX%9`wu&8MQ?(0_3VuJ=b<^BLli9)mos-&WaAXtcWaf_=E>^td z@6qL6GYnOVAhU-xc1Ap_gFLi@I@eJHY%w9!@EoH z_-(1@wqxHB&P!358Uc0l3zfyInvo5uOyKsWF@;biiM|El`#0`TyWNzpAHGD?w1TvC zK)(Ga>1e98CXG@B3T}9bW%E%X&|{j`@b{A3%l1_zF6ASTbd{Bw3K-cTzmxS8P93sF z(NPICl5iD~))17~Z4?HqJndFn)r~WQkTfFu;gw-T+?;iXvhAf-!7N0G+?GYmO|3nm zm0TX1z0q7m^hv-x9V6gDb<)raLcB~15(&ku?pmbGpxKtwQ!7LcQxjit)IRJzc4a|WZNT&K=9f7nD2 z?>Jp>5A`|Mi@gc53tQ?49S%)(WTs@%7*_fDVULp?|yRxY={yGEl-)i%W zhq98z;cBYyY)Qq`O&6JqYjB#+9klQm*rBCDWk4s~g6*L}mkMKWX2bh#%qH`4+Y$zb z=(WC0X2%{wm3?ny-jZH*S*dZQQ*(LV_@hyO7Pqh$w)K0oBJeu}n*D9ig5oqiwJ@?r zeR_hz*<7=MA`Z5iX0$ohZlKE)HIZ)mRfuW(Fkkg(mI$H(>Smfq^$Hhmy#fs(o(_xl z?=lfT^tdg2Ep6O9yQ=0H5Haf-@V8vQa5ZDg@>#l;Ih}E{2KOlS$G3=!;-u_v60Hpp z%?c^a@3h591rPyi#!@9uBNj3xm>R;QFMyW23*3of^A4um3_A7bF>cFEzo{OVOit8Z zRLE`Syc8#`V-Z`c~^K#sTap@npC*Tgt#TeAJ2dwygVdGU(`Kq7E++8Y2nQMi7_PD3wtIQqy#f z0io&oQxGFAjWMJ{`(d>KoDp;M%%nZDChwvjKLg$;cN-l(MeIcyLd3IF`dPzR-bgbC zPEfZE;f{2r*n$`0&uyV7l9kx)kUkLG^~E(6(9A`XG;Db1oRTyKu0f7jWHrOa*uv^+ zxT|wJ2KvdlJN9{g3iOk+mB5+VKh}GGfv_2Y+{N}q;}+sVK*+li^VIr_JbVlgw496C zY_$#X#{D`(CoDx&wSp*s(!C}MaL8}`u3PyjB0*hk7_PCpqwAweY5ZiVb{bzd_?dU; z0@x#gIC7%YWo3-zLxEl4f>LT@DFTdCA*&bYe@JC?tV8h^?mlLlmr$AqNfrP+MmV%L zFd(^wU$~-w(b^!o!7{QA%R38^YRlcX!VY(1C%Fg{dmvBDj0GfP5VF+c)$DX1?D>(; z>A24QMa3g>VDHuK^@5+N%N1T|&Hzphzs)ut%=tWdDF5f*GS1mxz{nCIVv<%e6}ymj z{%u|xZHHPldjs1GXX$|sp*s~bw5NHw!oaK&gYjf=uf6Vs!CVTSOdf!k;!hM_}6NEHzv@a#0Le< zkTPH;tYW~``x%l_8kn9xIG}xA6SJZC{Y)3HiudQPkBbO6C@Vkm{B}FCwc;>&Bv+rfz*q}OEM;D06zp9_g!^^| zxi;6+V-LiI#^)jG(=y^^ogw@vrUx!E$f02Ssi{mo>NnL5L1+|b>NJk-=Wm_Z0M+;y zw=;)Q+ES`As)@vynHkQh*uzdK(uoeEo&DSIGSIQIf*HhQSE`yM(r%ted0B1vk7>7^ zDf1pFV^E$^{w2bzpp}0D5B?5Nza!!rDe=FrETNaMh-?xcPZ=@Au7#;V;sHq(x1Zps z8<>pe6D77vH2yRWxOUk6Qz;KB*3(!X3P)D}9hM39BGphVo3a_^@d{QcZkbOvu}WfL z>4}pxjKhSsaM}lhuj5j$(1)ol3tyF!@fZ{8S2?S1dpPPPv?$VH_Re;Ol+1=_P`@wD z5btCFOjA#g`3sC=rr^=G$YwkSDwld`-iYk12<@96u^bsi&a_f+-d|WKVW4p?u%vHp z9;M0jx9epT3H?>t*q{`2N9ls(YwA;mA?X9q9L+Xua=&>=SuQe;(U=`q{PMY2GE1>TkjDhPA_1({eT6n>-5yb4bpS9OBo`s!Q$}m?-_8&d zrr--6s6C;GveG3+Bl3I2x_m_Xp$1eqKpG0rTTi^8nb*RV$BS%MOtdu4IEOj%TwB7- z_2CiL1gu7SgGaKIEbNamXftkpL3s{%&m23mN`X;k!xgCYvUZf%A zjdyGd?Wg$Q6X3c)t_nI?#E;;)t!AU6)Hry&e;q)3NvrFKGngiE?QV#E-2dPBkSv4+ zQZ($}j)V>`cbAt;8e5^$|9+&yi;b8suLwUyo+pE5E|T5>3+%^ev*t1GYuHq$ZkLn8 zo*bb&Rq{To)IkSp(xi(dF$h?Ktx;??jy>KI2ET&)&z9$Zu`o>ADf-S=36sx3_@HrE zz8E`)cApQ@p1SpsgELbEhPYa_8tJH2ae*VGl3aHmQ8PlkKWEi`d#BN}chR$o8lq!Z z3XhYMg4Df2gkZJeOt_=sTTmRTx}&Q$FNPGS(ZI%k>qAohV-OJzzH-`!NZ1%@Kno}e z@KG)0v&ut869|9atzx@AQi`RL#)AuSaJ_;k+F8hlV-3f7?tD`tn%dIhvDXQP6{1Mh zsoO@&{qr@#GqWs;;>1oN3cnuU&BQf}{d9ubex&1bnd1gwmQX9At0C6`Ku+{8JtK%} z?Tg(8MuYZS2bgEhkspwW&UHH3k8;{qlifvwh-83p(Jx82Mk zMcFXm5#gsl(HO?f1+g{KZfjJ`V6l6ep^8->isFtpk0U{VcP(lX|AWgn83ovH!~tJ$ zMVUv=gU4@&C|pEP9r@{*>7611>J$By0{-{~Z`j#>xx(i@&GY>Vz={~pF_3C}eHSY3 zVeE!A<`3bdjI$7lqlHsYJ%H;P^F}7 zSY>2JPZnCEH2$UA-NyO`H?1y$4cO%P_h8xYbHQV^0S-(J+v|?>CN8E8JiKG;B?p;= zg503+vPmHVLi1d4=e+Z$W{{l8B&WSDmeOGx+$Xs<;1ku5;nEPs{>Q~Da=SH3QsSrIiOTBd^j!}#-rlh>V-66A4#V%STfN|x zYu23zd;(U9et@0e~e8>X0D57f;yJ*~}TZA>**mm+KA)?mUtW@Q0}mZE}Fn1h`o zXz#0RmokNaDq1K#RKgj$Cte8)#ZlFnvUk-$G$v(Y0J}#csCPQqjX@=727#Y*><&W9 znw&vvDxE#?)?WreGqy!jfG59rYOyJYdC86oDLr*iOEXC8AzcBhJGsw(0ZJVRLzD9qok#3M{xl? zEVNx-;^=CZ7aff|2PZX5ti}3JMc&yeMoveh@r=`k3AbR?x`Rt?KA!=ek6=N@2BkoI ztDX2fK7Q_m4qe@f|Dk(&&(RsmG)>dk(>G^*Gn|iNp%hiGh`+XhKS>v<)}J5N``5Sf z#Kk)b{vYb@@jJJ#UlV+6+qP}n&W>%{wr$(Vj&1MQHg;?~mEUuobGoYfjM3HouJ1Ab zgmsTK*PPe)`b2{y!L>1OeMGFcA8_Umg8uD~IKV&~p>B8E3WfRQKA@ZHMU37&A*v?% z^g2Vn^P!et@TEDWBP6%|1t3kpW@6PB4408TF)bYOPgM>WZpO|<-c~}OEQ?WD2T0qcneH|Mhs3HEVKB$xY#c`{<*#Zoa|mU8MDUt&3ZOqOU5Fy5PY&?<%~mz7pxG& zfEQ}d*#{}PU%K+U&^+&GHzt7DV;h5`m(kwWXjdkeJpgW7!XStPFLNK7`E(H~X-p`B z@F!;~@@;s66sQp15V6K>ThuqTT7jc_qXd>Gnsa!)HDJB?Z@k9>I3VYQ{7L;7E$?4` z&qdmg-(y6#8G&|WMyQ?iV7J7paMwLYFa%HrM%Dg z`S!fcw6qHA7*`QCt&tbQ)z>{JiIWKTuNa1ZmKuGunS)hK4Pr1H1J&3bEt3^X#!&TARDNt=(I9)9b_5P=pi#}(}(0Bm!4R*yV$tX?jAr;-o{ z1dcs^ouJJrtz>SUuo7C*WC&@N7kxBr$c#jW-bFY5zg>^cGZ@B4Zc$cKoRdxVua2yYJ_%lOA&Q4eme5ZbW2 zuLi#f?o5$J)$B--esnNvD-Xvm!V!#Uz8> zQl;VM1e8(Iwi1=8@GWbKGx$MR(v%5VNtqMQD$r}@U)+}fAslR^R7^q``E17b40Ouu zAp>7jHAc?teL)`8xebJdz&l_muBuBsm|Bz&>6ax(=i3dZv8zx)w^CcY!B4qkp5O;t z(89X-VqVyWQbmJY=D9xf+l6+`KLgl^Nm^nT@SDad+$-C{ZFu^KS38R*=$b|{lXXA} zD?BhmmoeWhWm&a~O2h-=kx&_K?{}0Vrj)M+nQLldsUZ*QFXy38D{RNqNGCdGgolc>;X)40%k$H9 zb1C+ECz+)lBgQWpvgER`!X*L+WgIlvZG~0__wkCsi`q4Za3H0eFxqe+pt|>u7APf0 zk8#5+**1Hktb?^effv+dJmChholx_D*brsc$5w) z-^wnOx8Tj=UL1l2$t{g7GVtd&toUswJP&^$6HmQ9&0wV`lw-i8KnUm*EYaSXZrdOY z9yA7}9|>^&1W4AOn12%A#Slh$&H|St<{uk_krUq%AlVMGi>TGO9s00T4OJkKyV)mJYX_>@ z!;=JI^pJ4}TT;Y^=hR-_u()ahYI0D{pGK=?2N04mz^(O;SpEU&gDK$L|kyb@tzM38-Irm7+~w+cFv0=WOn(7qk%*JPe@ zC>N4TRmug4p&%`VjJa(!OX=d}u7H>#H5|5hO{*_mwFTyFfuKh}(&jl5NfPx$+u*g5 zN_+jAHWWz3UUO{Zh2)H#$j_XeZ11+;bkG)bTy%Y1{`J{)S|y!R^L`Dn28y(JGsSfI zrE-3f0F`+@mvuPLQ?+UD=CRgV<7b$zq-HAAh$p3K-_uBn4cPJu1X07B3k$!K95BCI z({>!k<|2lS4Zvx!l`!+fE`_t(~c0Sjtu{mp|u0NjOkFmW!FG*catyPaI2+BXZaLX_S^bal_44(r&%cY$BFd z%k4_SZo9?e`HHqPOk=RvSQA6o*z#4$L4>0uqQJExMkAJVstS1_>oC4UCF{}vyeC8T z`7u6<)wyc0OfM!QZfO{rsnlIHizv(5sEeM<|9LeO7I<#RzkT%P&J_yrYvp&Ou(f_R zj)&d0P_%(_VU1{TL|l;NJC4flWVl187ZEJfBt|Klw0qTLZ}vzyF7sMtXa77AOl`Y{ zOv+iH_w2|ZBA;v5VH80&$^_G+837(fS^IgPpFKO#I-$ut_j(F7vDUXQ`AaV}y;C_sb><&pcb`&Fjlo;n%{BM*`IGTl(%A;=&&#FRpVm zcQ*Z2x{)4TZxZPfJ%$NjWC6W@ex28Kl=qv~MG&3AV$u3v-VX0K`oE6&>TW*Tz0S~n zfAKG@Y$chc(eJ$f7-u330Ek6slikO)LhJBIA__q>*_*F^ zely^wy!hU2CePs9t=-i0QDl{(DJ4R`B4EZouqM#@Cf4#ToEeXu2m&_!5Q;Y#c}^4H zZhM3|t8W92minCGj43FTCke|r4{Ls4I1wN;Nq$o=&y9AeKrUdk=vl&`#571I(3R^Y z!wq%=Sy#gEm@K7$u1j|2Uq?CyKoW~kIdLs4bvDPb&kKMx7(mj^DaC{=bQm!oXa?r^ zR~mxU4!A2l=orm-iij*=dAre7A@_f+Ny|_YTJ;gd=6>3|LukUgf;!R6GR87|atU!?JsX`^2Uje(Qe?P|t;r-bjTW zGk0+z)e>faWtr4)p$DxoF!nIz4)O(wC761|Et}G(;p2<*^1ft7F`Y-itW^At8$E#Z zh9LdFxi$c2C^!hLF~`twnf~?`5$;{RweWsKnof&(*!UX^E%oeZQH|hirt;{bMGLpTN_r`iAxlxeiPj z#yc+~{W7Js{J9dQ2auDHsB^Yu(Kp@Xc=$x$1teF%VJyL}f1^l8D;_Rp(4X|osP{;_SyN`%d4ijhTqUO0ID^-(B#94Hbn*8SlhUkGrouEy&5 zIoH+^u(q2mW{%*+;)ty)_}Ve>lJkMs*EF-Zk0?1DpXb$O4u|e&@1cjlADme zunP_e3K0}&(6rze6v}_k)N8{$*dajOvKM~&6^ZPKE!r%9GRHf|%X$2#a-+|Gwv5LX zsM|pBgE7EIhdU}ZIppIvBB_=EaTj2nUPTu61B(2D_n0H*p-48t0z)=3kODOBAZX0c zraK*azX_Cyg+F>5DJIfzk|H(RDBn$Eq@x{#&7L(a;im?>d=Z%s~2HfnWsG7H7B9jD7GW zz;`}G1A7S$xqYL6viGe5#152AU&^H7YFgn*P)9{5r(}|2v)NAEeLpD6y}gsT-R$y* z0$|3s)IUra->Ec22JNtHQ#;H%rm%RMtu56Ei^8RQ0_szM-~kvmf1-$a4_bG|yct>c z7cS)pmQ3O-PLU-ZK1g_bz8bXqxPGKn3DQ@QbNnWGBxx~}pMnLB$2k!Qks5eF4RP!R z9jbWCUxPjH^$r1=n~g4!CbS`<`lc?liDE(_HRyy(Ap+UlB2_X*1LSMRi#GW@nn|5v zM(u7GeJp}AoAYU1K*2xg_7c^T0M{2fKWp5?NMF9@y-D#UXhU7rSi`h*fHThZ7poYx?WxW83Jw-+cRE|0$HVaSx@_yzgXin5d6eY=AXnyjwTt^5 zrEjecV9cw#k|wLhIW|dQu#ub){C-JZgurxSAlMm##p`@Y-aOT)sK#{KsKn5ztUf^f z795$VPF6lY@wyO!qIY*ZfS&GSTu@a%vC_&66KfhJ^j1db-ocP4Iis>Q7=G9$v< zbu@S9dfF6VB#hMCEPI{l2mNw6f)BVX61Iu0=V*g_Eqd5f>i~Bejf%+kr5XdQ8`^_!V*hsf!3{932rOIrTpq zIPqpV5M+I0h_%}D)MaTx!g!2VBG)Kzcv41Cp>|FzBV5w7;xeb>pj}kxxqe!`PgMD; zNNj8>MuZ(R(xEtg1tJd!ZcC(FG5Nx#u(kl0SC~GfMT+k6^vH~z3ktPT+%taF2 zVIxaOnKt~(z(FVYw}HdMR%1zE0I@qICs$w_^(excI1@lq!B)aHYC1PE&_B?EyQe!7 zhD`jE#CM>PFH!N<7!(yzt$rn(Dhr_GW{OFiB3eq$U=7kPYN1cOl;OtBYFWP^q!lW> z&cl(-sp@`L(MnvlpIzHO;l4aT^D7U47^CLE^bx|uH%wQ+RbEWnoX&pIvq(uHz{xws zb(gsG0CwZYTi+OH_R^)K)N%{ECsyj>E+M<@{3Yuy_wwuh!}UTJ$Co$RhwWhx`{iu8 zi`%?4najHJuO$_C^|Wgo<3^2KqqRWEv_p-`*cG#xM$2a+Pcusx`S7t?<^K88`7y&1 z>O?B}t$dg7xH96@oEmtxDX)@tt9RhI^uOqvYdN&~_5PG{rG=Ba05NaSG- z2Rc&c&7;;xPX)svAn@>LaH`=>V=E-QHF?!{YrRzut)dqrP1oW(1x^|+<{{60Wxd;C zE|zKmnDW94qCvg!UL|Y8hWJ=@w~p==V!)enAF$`J0cQ`U^J3!L@<{aIqKf@w+{tw^ z6kqi*9!yab1xsArn)8QcAJ@*1&?sq1~)5>IGs z%rW`KjkeJyZOF|5`;}8^B;emY+uJ7seoksRH{4~4s67#xJHl*#yj4NlFvdxXDD0ge z-TH+iqw*K;|9?Q;l=xmTKd*V1{y<@vAJak9pOOD%-4G$9AX!P#VbZBY%tAoWh>IB8 z74h5?#b`>5BWUR8ZYI$oz9&(lSBumNW+yEX{$tw&sj>dsw!uNJCd~N&RT9YiBQ2XG zghxb5^8ZZRT$KF3plz&?Z@10)MUe3>l&w);x71t#cCW2%19>Yc0bf9mtX^dRyVN`; zk@iuP>3(YP!ss}3^*c1C(4VRPQ@6pcf(#!U4ayi7@ob=Zm613JK7fFdNsZWGRn$iA z=Hkp`0d#8z0wc#A^dmtP`%~w3K=|bU6O4-J{+J?<62I=eU%1*@sK-zifC69zPh92* zgNVRoGoHq6A}{ZN&!fRrynoO2@hB|&Lm;ut!5NT_4OI)CsmSmOV3$~53EORHy3@Hg$<=@bavM6%KK8hXm*T{n6R2*u#5i;K~R1)VXvqu?S zHaJrX4k2*ix&q@QOBlqi7^a7 zkV?H6yQ;{Xg$Y@<^G{(^aRwDEmW}{EF~5iid$Q*c2L`=5hj;gq693m*am}xWkF;xw*WV}s=_{rd(o+wqI zw3-yg+QG5^+qv;l`?qtGo0EV?dnu2gnRy|vDeE$@p!C)C`Lu!eX0&)8Qi0GOh5m9O zs_5d^ll?n2$2wP>raG}sUJ*ntcDSjP=MUVlD0BJ)(l`PP= z(*N2QbrHtPxA*b>O=+vcE+{+a(N2LixY_qBdAP(wQk%N;C8%(2PSF)^Y(wEt!hmF3 zn|2+r4+q?5Jl)J+RBnuD85&45-Q>r+Np6tLkX1;8!lDXp@v}P2K(tn0&S!fFUhjJ} z9UTErav(>WcrubkwCx$-L`9BnN5!n zWc-okp#)O=>wI2TI=!`E*z&Mj-J~!xlt#5f6sANySUz^$s}#Es15Gby8Ath#>C61@ zEBE3!oI=KP z>JNanfF^8M5IjjWC8Gh0w(DKwuiLQ(E=)Ho31T2G@us)L%qNBCh!yqWT-fe@jU?kh zQ8(k<11T0Z#LhAP#Fw~$0afJ%OD|er5>?8)={s8KO`xb`*lFG|<^Dd=SdW0?{;(M- z*ToV{v;l}3ixfqY)H5R#X)Q2X5(DxxD>GYwWm)JuG`XJpxxInr+YE_JFgPlON4~@N zi7bMWCyWdbB-ZMfk~GR8%(RXY5JM%g=lJ4<+CW#VCkD&ewLE#Wa0sc1AUkwy4>=ul zNdnDhcY-Lrr=Q>u2#)%9>}RJ|N0~8&hFIvO4mW`dr4kKe+dVPC&zCzGTjseCSueN~ z&)Nt;!GtWos8HLQ7%EYeK1)cRNxP$VCqP;`YPm2W4K`y2<^VY6*}eWIBp^mL*Lv9D zR(3C()!5jR*xAJ9I%)9#cTwMH%`2dYF1W+MX2bu@JH=lG}>)) zK}=;Q1z0pK1slw%8VNjaL!jDjB?O z!tHhZY7~QEj;0vXxT^rtx%W)DUb;;U-({DL zE7V*Xw@vYb+oD6bgy%-nve+*eRQt$%I8~S``Xo=^xD+Nm>Kf5EdXxif#wUWp&^!`2@<~j8wN{Uqd z)~=|J0vd@-6IUhl#0YWhbOxIJmNLMDp2lUw2vm1~lYC!(a4(Yc`vxU0>+uzXbRXd! zN+4nhTk%t0{<^bnvxCdm@t*PQVeBDut@BCY8 zt>xJOLs(NXh^s8{hjL;qvB#Ys5#xcBUCaxdrh*mgp0rqaf1TtL)#$d8nCZVKaBvO1 zJ&Y613T~(!96ipcDRQEpVKZTj&6uMC^r9Ev6qEa$KW1LOKZ#SRNK7@E%zo#( zqYR7ZqnKufY50f;F?j3szAAsd>Gj8-u4N1#$Gs-bcFO->*>Sl-ra>9jL$`B{nm(5N?vs5PJ4z0N1JHW0=uH+^Cd2m-5w|ArdeG85{mkLgeksR3v*e9s_ z(+9XEQ1n=VR(s65UIMqOAzfHv*hW-;^=%R36Ke+l?lXIAnaluo7x&Ao%`ljv^;t{;Isc`ri;eBQWStP5E z*!~_QL-D|0^&wxm2Pg29*sXy!lG0_Nah&^*h1m4DvU4s@wG6$ z-HiA-5}4rE{{JN3D0u(x@(rB!|G9i)jq_jf%?C~uc>O#brnCYCIV6b1q)@>s zlS#?PhSh}d(eP~;Znej(_H4oJyo|0Lo?p-}yl_W+&X2Mrb3QjhGqPBNNKMd%C8I&a z!@9YtK>rE8vHVdwuENxN?-loXa^bI$8RFgz?+CIw0|*WpP|wv|6If`pE5tXHQr?#0 zL%-H_EW$19my>v4m#11<%WnPeQ6GFHmqf>s;+n)zl~?(*0Hg3dl7L#zs+6G}2ckzg z|J@BgCKAC?KWQvTy8>Aoroa4ZB|0{848n5B;*S$$j8(li`xpCBYd`g&4Bx*EODTB` zjhfScqtCk=>hVLn4=}R+&}9N`&S!MLYE;Q?FLCwopG`y*Jkgk6{gnZ;^qlkQmpAZ* zwMHXIxiQ-11Hq9BH)XkGe-E9qe8sfs7W0ddWew&v&zo*wKQEYKm)HmfPARm!e1%-Q zztFX~-&y!PC40Rm4MUHXh#CM36>W{*+ONr`(7Q_c6Z9^YjJ`ZgAB}bUtoXR&1yc5M z#L_=Z=;1ld_|3>*KF+qA$+b1A?d3*XFEEeVoKqF|B(%|4%wmI$or^r#3Wm{jJLdVQ zeVGX!FBM>%u|CRK4y5^tJV+A4(8sZUTW2;}Hc_I;G7)h4J zS_E)Gh764JV<*^mE5|1F&CTmqR?d5M-h0(MPqSP0>jEJlJiHs&{48+L>v5Ya<#RS>}cnHO}Yuhou}o|Y<({Cei08r_p%1vo^hwXJzA(Alk7IVZza6N zoL{|yON;A)5L^u34oAi1fBQH7rKfLWmS@SPos>PWX?i$Y2zm7Y(#%q$C-Gn>olymE zgv!lJoU7FR(}Yn6;+;ClqYrFoDD8dqu+=J8)A2jX({i*1m+>HGVs6Q2AXTZ0Rs9U1 zFzX}(Vj#J>?qL_s7m}I}z&qT6aZZ9AF^8fTGfaHT2aRGm^9zipX(FZ^T5N1Sqv)!6 z)VL}E;`HZAXbSyz8MV5Lm5Fz;qaby&K!amdm1#h+lf4h)!~3^T)O}v(bM4|!&fV|U zp0rDSLnayKkL(8F+KQm`h$Pzd%h=5S)Z>M%w z8Aewj_-MJguJS@+pd)IoM$SI5I|_Mgv&JjWzl$yhU>RAuy!UBxxp}t8V>?^rCDcwY zOZo8lUH_u#06fr?VWC#a0iyn?Cr=+HvaK;cUG5&^zkLw|QssA3 z!pFC;w|M($;i!1q#8Jfuv}Lh@XE+c+7GVzFWjm-Q{e$IrDt$6^*(n{@3pNGgBaiiK z9cu<89a&rCA8mVDE$zteU$PFn)_eBrrhKbeYzHGPf(b;Nf;WHN zH8vO?XZBX4EK3&*7M({-Um_qeOOn_R5%R$GdO9#0Gqh;LNaQ@qs5XU&D^JsOf8fw_ zZmNG?&|Cr_Fz*kfosdJCkF4U8#7%QXs#Sh21hOXlE#g3>G9=tnrl@6Y@5(3%ZKgS{ z*z2xOFWg6E{f4>s5QrL?P?+L%*!Fp$!~?_|-`~+4sLbZypPuS>^>{1(`dFS%!v711 zNf0!oT{=U2rP_B|iBx0X92BhZh0qjnHA+(F>oo1-;v8(z?<0XD8X}qJiO?gzpCg2N zo?By`ycF+pbn?k>-5{vIGITEB+YUGF@Vq9<^1*!u&@`BYaCOA94=&?aYMY0;&J>XW zTW4(ObZj+BVoKf@35mzwNZeM0VEGNsL2buB!p+b?Z15Xk25oHhqy>)ME(An=f=%Lv zrAdH@R>GCtL0%}yX@of`WZl(qWUKSVsFU`@e$Ttb--^&8T7Q>J0rzEe|A+u6UAHmDg zu6M$O(JV5sH){gy6=H4=D{_^8cQvFxi`U*VEh%WHz|Sa_?phRzsz?z{gL`n943bQ% z160zl3jo^fqr3_}8w@Piv{K=!s7*rd_P_$vBeLn$yGDgD*b@UHC?SYJ?>(2;PUFEF zK>OjuN9+ipHV3Hd67_iQP)I}Lxt)@*MvYvzeOzE@z;2q8BBlRaMvNCK#2p@u6euDV zdF%lN%uX!mf73>V=k93meT?(ErjBc!nf1Eb@h8gmo3y5YIT96-*GMsuwB{}sgKbxX zjKL=B7^UF2lLi`t>sExyN$buj1Ol~(O+`Qzhi}(FKo}o7(}|gS#~E zZbRe*G3RRBiqMX^59tJFGbT`{duG+h;D(>Rjsc8dupOm}C;%J4J-NyJ)6D%B9et8l zq`u-omwFG~wKoA|xJ6HBLcyK7Unx0Mwn-4n(26*lN2K!fXQb>{#Ggi(3sA3>MHrIj zK`i(7XFm>RP5=sQ zhK;Je!Juy)yHPOw1l*@cbA*RrJU)-)ulA~bUScX>9|1s4P>*2Icn_~x_m0E?q z`FdY?&?d^zn#+5D*C`%DG>HZr8xO{q>s|Zw(^%y9*etNpn_panN;zOD31|+82C!S! zOUuVEKec$8sOjkfbmdt*(%aTW5cvX^?bH5Zh&CB>qs1^r7@BOo^Yy^q^>=GS&;Joa z61*X@5JU&W-yI%3H&5r|5V92d=dP|sDTLwq^62+m!P)2ec8pJL7te)ur(Mqpyx(R6 zal6lP>VH?uq7J77k_!he7amKyOUUbzj z;^#aF#_Ydt+pu^`-%Fj`@E%lDd$6SB;Hc{RM&o(&_O0sl?ckVbkf@aU2VhI>?Ffi7jsF|b z6jf{nJ1>&X1pO`X$FPm%03M#wpOVj5f7bv6(;aJO@Hm%-0Q6WJA>xWB-4bF`F8Gy6 zqAQGDy(bi0nAj%i1z58+k|(VQO0XiXSbaB@+9^&EnXOTra2wfC4wPz&p5RVxy;0&N zfD~kUYRPP<>rXx5X}dM;m7Rrhs*(s3*6?FEt1MaAWr`(V?Pq$$ZOriE9FP z)fi5d@k`lZ>h(&nqUjscnL(BLx64ZMdVE%DemK|V3QWo5p3WPbNlMJ|CpiXYAxj><` zuKR4lbwX;GTA+@E0%4#`0h}6yl1Li4IuGLDsJI;iD9X_hV?wm>kSqmA2hHZlXnoaY zM#|HizkiGsSNyTD2!R;9k7j69F zcMFYZ%8*+L*XIhFCd93lb2v0XR$LEdfx^NHl^j&;FwLhXep)L#`N(&k4Y*{IV&gW> zmj%YVHC7>?AW>7TQR#p8|0`OI8CyV+-=aB??18&&o}@Yo&Nn$qaXLp46i};=n&~wO zqh(#vn%c@Zr=@=b>HcRIso!`hQkXx{ZViJ9z0~n*oP7gw2LOU^@BW`Dn^c$u;}@26 zlATCUmDQy2rZ3UnbEyoL4W#y2m@FV@it3xQ7GShk&oj|5e!>hk!08KN zuZ&iL*-L|685|IMM=8z?bFKX&tle5jJlO!_F7v*|66n{@z2l7HuNE2qFQ`L9d)AoX zlb>>J(uA4Y?9G?#@km`$3g*2UN2g1DsW3G;8GuOjF2fP35*avmQ)y;j-tY*i7oT8j z5j$R&44V^spABm;Ud%}{Jo~0$K8?u(Jn`xUL{nJ`9PEi(=+UnAbe~7*YTfZ*q=}Vm z$ceV%OoKMwqTi3CX&-Mi($dctxX7xHm?~d*p+^_p$N*S&#*B}EyVdlle4}~G@EhO@ zBEoN!moql18bB=Pt3l62cIP!WJii2P0EU6+ znn!7?0i4>xLm#ms3`JFNiVe(^BS?f9wXjISqpoW1imAepA0oB`nL4A zB4;2a+K)ga^jlU^raU>duH+Pcx>^a(tc4}|@;oBCc$*ADK_9nhlUCByEHU7iBso;7 zHt|jEj;Y-*t^@DDCheH*y&Yf~5F?#ZQA{$$viWN2nnnD6 zcswf9os*n^nPq2IOU^wszS;bVGdeD?^u;Y|lQ4HBT ze#=wm;D5+04+A&9xve>%6{w5fWL47;V>4AQ3{ik~)T^>46KT7Fx9OjYm+#PBo(uJ3 zuQl_K#lb`v$Bso(D0DMTux%Mgep zBsW+Kd?^M1+zIWm40H0WmJIDjBnuZitZHf&Jra{ zzor*|bUAG3yigvzfS0$xJPQjAW~bFLy11IWn02WSrs8ut#?-2#TD!~eu3+2%L%A2R z+%51c0Ij}A8!+M5=ncN)8Y0&t?}oIN1c5(xSz7O_@$HF^o9}wJ7Z+NTOF73dI6=<{ zXBdPspEXVvZWDNXP_NHBDiI^*h%Z z^eKsq05C@ctWvf_cWVMT9MUN)1hWIFQMg2L33|MuKew56L;9r1moe(gspehzh}GkD zL<&sfv5DV#-DWn9`x1BWACZv`LxcsR&gV4|Dd_YgRyY_q>D`?Wwn9zCXsI;`P4h4? z{f>-Jo(8L+S&fXV{@w1M+O<;$A3#A6s=$G+xIcrAvC&#+9kdnDx6F!eM)(-zTJGaJ?9)y*IYu@8%Ti#Tl%8C1@ z&$*-cew%s8G^k${AZhwAE1+d+-e>@l5pB>1ym_P=g&5YZI%R6l5utbHG+%pLp@b=v z-adHuOMl!iYre(c6Z=~y&4}Ypi%cLjAqORz_i(1`AiPnB~IkF7z>(T@~ zbS1W>CTc(bNgy`7LTljwjBmg^F)qGF%QY@TU0{O|F6EobfM_yX=T2HlXevpxFWrw^ zj?B_jO*w;bmh`(XbsBUq$z6(QNTF5i?I_$Er2>?0-*6vkUC~d(PhaYbNupU=r#Q=Cto+z3h;> z`^VXd%xPFC%?*~_csbgd+PD^W>v8w`7%$Uxm(b=BjufVtT1sQ>DPVDmuZASnY=2K; zfPv|qD4ma#r1cW$4TX_>=yjjtIdwMUf?wcDhS-Gchv8E3io6WU zQ41@7a*2wLzeUhkKk?{v33wzd-bVEg)g(ay8{MT7$Qd9s0pk=FzlC6Ng2 z&bQncc#$iydmUnhva>&p??0Ukc!EZ@-fq888F}%Ybb{NC^m;(s0Zyjgy=v|FID$YM zY}0TqPBe9}a{shZ%evK}L;!Rgd$br)NrG;^KSaPubL=%y>H@rtCX(~?OiGb32t$&o zwQ^0BWZov4GFcC>VAG0Hv2VPCnh5i%)z~|SZCgB4&g+%0@glC%js%&5w$%Z;&Uy@| z_)EeOgD(O(9D>9F6}H;1JLXU-JzN)+eCY#(mL1gviT;i)LYsBQzzRlw3EwgtmvF^Z zvokr_Pui?A?f4h9!w*4G??jH;b7Dg)c^7^}uVpqJC*M0b7j<$L(xFEtXcmY=_zow%<3{rH%qM3m!xJF<~J32*5yU1tzZB;bh(vRAhCvjKar-JZbYW zN6O`Sv`UW|K*=0l*b~otT~>D-xUEOU!XwvEa-T$#ZsnSbsl)g&K8r-`%wOhNnzW8t z_nIj!otYnI_G)Wiqka(rU>ndSy@8b=cCjCjDGFV?eN20poZ-{x4E%(8H-J8$Unygc zok{oXKuKsZ-!GJC!?%R)cV)enJm7M#*{pj$Rt>I_n+UgREuE2JKDF5RzCwfB3o;W3 z>^apbF;#9O=+ko|2p{{1+`Mvh+aZV2mh?bv8OPi6L|m>*mCDc33e9zj18FW6Tqs4R zGGT>M9RdEufu`z*BQ_KWngiM!NnyZN)O^H+GBO~ao@HvsTw280{tZwff8P6Pq{s-q z%pPZVqt9&dX*Mefl!WE3MFP!i?p7f)L-w=LOp{nZLdd%xoO6pC=6~f~8J9#) z*|ztlaAzluVCXY0u@+`}W{5B!nWC^goL(c=rH>V07o%%?E+YQ%cCc_0GM@+CI8MR~ z$T5J<*`%EROy%ceq+hNjAss498nsPAJZod5#`I7`5;I{<3k$eqUV^{pB>pU_~4xBi;Jf# zk_o5C!@)BV#Q-{X0hBWGiPb3Cs7h|`JLKBT38g+%-$4SH^-|<4M&EWOD6e7|#1<>2d}plt7Yw=vxRvA_7#%(xI7OTL zBXELOH~-2l@~!S*E|vtgqrawCMS&e3@lqT3SJ|>I(dUZE813G~mC}+zF`df|r&YM5 zGsGZQb>+@Wdrt0hqkcfxr4m=CGJTip3a2J{51{RJI#39!LfD8zI6;k&;pq0K{3U0b z+4mLiqp(-rmNI|a4{07GeCPc^i!|f>WLHCAjbs)euscYi*y&2Ff zwq}lliWX6o{lo3l7$H_&NlOvQj+5t6a*N@3^sR+?I2rMEvRt?oag80w=VUK&9A0!y zv&x6u%qNw}9AC_i7^iaB9fp++V3w>}QWTE(y*Qrl!(zRkh$}`N=8agj7N(Q} za*Ya1fmVakKK~8|Z@{N87?$}1EFsuIj;NA+6)|`hJ_RxSZfOs|M86tLkc<6h`G*;! zP6C!f85c8%1;&=XV5JmWHL+7_)K+!c5JPckv~+%j!$n=1&1UQm;;A0l{C&itx!?n} z-KP4m$*2{+9Gy@9rMFtSXWl6+ogXvj`_aos4gPhl2bp+k&D|ZlR$A};Bl5twtOJ+fA}Rvh54aN~S5Qz!Ujs{QeMH`ldjmTj-liI~2(u9#Kyb4_ zB@;i5?*p2R~!PRhUX)3>lAvhS_O3c~K+a>pj*EM-R<4az6OM@T z1PI$rV&cWDT`DYkQhE4#*n_lUrFEu%a*zIaJJa$1csuGW;BpgjaAvtN0?JNZh>?Dm zEZAU54S54{$#se&>MUeb<#!2M6^v5e!T|e77#C5KcW|Zlff#Lkc0z&q(m{|`T+$JP zBfMFE=t)*Lg)rs-!S%0L6crcu_DmWZUa;Y@#U>y{6?8sIEvp9qt`m?7Bn!i1c8}ax z?tL(XSW}x`d zb`ZFy-Tf>FrTh?V)e@{ApPH(fG1|I+2>~P{PO;XTBLAW%?q$HnFm)9^rob=&;Aw2k z7>tGHg{e~xnAV^GwKV01sM&W1{}dcxA{<%eh9D$wW-4Q~1{mIyabt)_$va%;0NlWw zDoXNnDsv6)SJG$L%ufaQ7G`L1v+R$Gh%(R}nbpPyJ`1v|8et;{> zQujYwe^!G~B?_f1v7iY)fRi3w*2Eh~J-b($l0YWGr-;#JV{iEGBV1Q7)Ob!-UE1cN zD&^SVeMs^ux1%tQLfhi)8%4j)ai7QWZUGNFo^6lX6g_ZV zy{@-dg*5{)L{%MDg0EY}+e3h)gCgPZ+f@!{+u4J~&he@%7Z8lEhGSJ>!Bt0*q@Ex? zz699Kh1`gnB8t=-0DvVu_=%4j5GA%ii{-_SWc?CLk{LbiZ+Z`{U{BzUaDqF7=N6Yt z7k7f5uhLh{t3h^%F~7-_Yi-{oePFkH_{ZAeoDN{lgaXdP1$y_QLNdkpH2PzZR}>KT z8xV%~!HdNm3w+u?j)5gu=AGZwCP2Pj3NLOH5Ud49f3517#=1KCr?^9hFK+)XHbxy;YqmhSzIz`GHWX^M|Q!DH8DQ<8` zDG1DM#W{3({||}zY_Je5{qJ(J)wfTm*{^W~m2dIyyHtfpT4(svemS?h%5~WQiRj!B zcjN+t7Lw-Fl%C9t919vp-iIiAI6hvh-6D%KY&6|-17V5%X@~)3BJca~Lx{s{0Ziam zvTu2ax(|%?E&TZG8GAp0=zpx8=!UJ|jVH7G0?>eiiA1 z&=zgbC#P{X>D1pI2=$GqMaT@5CPzGZuTcMi33e<~pv^I3O3rqJ9Hu83OQZOdhv$&o z8P7Z9>gYU$XJOdk)H=@jij&Z!l%d}wt8AL2+yKTjyu8S7NzE6hmHK3{ldeJdLvmh+ z8xLWU+}i?PRs9wCy4C?RHuMvDeN(&+=8tp5-_hnuD3e!Xaap@qTJQ z7I4P+Jf)v)gB5IS?R=K9OH#4O`uS5Zl4PZU)V z&cr;$rg2u7MlaFIXt5x$Ny$6y?*?x&&m$CTT&;o0Q7kQ+w#X0M7V}}s|C)Y=nrT&^ zXTTUUX{J4a5C2LZ^&UpI2M?DLGBBqeURk^J-h^Es%ZJokYHY`F#%=WHr4IHjWsXd{ zSfjxdxNcS&*Ck1Ay23;;>m6dXW)b`bS3iiEAJC!Yw5*I!JUgVv54)C7y;}Le=k(%@ zf22$X_qz}*NsHwKJvZi{Ea0+kLM)pA8X_}sbq7{mftjWe^pYx_H{1~{*)SxNQMv}K zC|is|ACtB9eA_@r2B{RF1rJ;>sIV_L@7imm0GgIt z){u-U|H8nqe7*s^-sf}lDRgYFLCZQvhxVm!K%Q$m4lKEaF_x=IHpcGF=Y8{V@3d&B zN-w)yC+D6b#kS`so*@yD2#=oDMspK6MPxWr2k?HOAJRIFAwuShhC251e%%@0#9;Pk z>;1#hsoPeT*Yo0^z%tCwJ$~Kn_Se^ubG1RcjL6*eGa{D*%t4B|PxhUVJ&(Q3^n}OC znsNNIQN8v7S`~w6naSB*M-BsAbIoH;@$PyO;uz%mt+l=aQWiU~?D$bW(jw>}CbTL^ zDnCPenv%E1Ynqkrt72R!JXM}7GL;qCN;2TAEV~iRE%-^Y)!4=aw-P|iL`QBhC;eyq zwlz*Pep=`^M*Pf32cIxgkXcdp=ZrVis4b-TB9olQPp8FTL_H9MA~kkAmVeczh4QnMv-5ZKl3n8&K>j?e-j2snvWOE{96hDCbJ;7hqf1UAl_e012xYu z`w4#Mx5D7*j-QTJ*_1t5F1@P`0ugga?uk_3;Oo-U`0;h9+0hAVX9W;c(Gv0CaT4u2 zNt}LSIqjN2KDu!Wn0tNjRb?2aAQap1V`r1UK78U8cf9JgY(hWm3kk-J<)Qw!2;QD; zBR`M~NTB!-#c^UTsBx!8df$T|HHTze9b_1aM$S1_o99vzm38Q~vQ8zN7i3T0iDbTZ zpAaFWpdolQQzUo9Qqwvl>c?I^?XY=eW+|Y+vt&yf1oxbtN;*C}{lZQZKfZUmFWF(Z z(}s+4rf>PeOF&(Cv+}vQogy3#&>8P-s_k9!q32+kFr{LF-po=Lh>bV0!q%I8CY;FU z*P3|26V{M&bak{iTsM9v4BUo5FrnKf?1XjiM;TRj#pZ1DP2mpolya2krpb&WmFn#S z`&I*|DUi%zY#|Du`-Xr(!Pc)TElQQ&ckhGL(Be32Q$KnjIXPik=DJChQhtIuIx@_j zyO0n^D3CawRA#XzMD^g5l1BAM-**HnYq<|@IYRq67QM)TAX)4kyl8DuKz}!w{iiW&lh=VibW~sd}Hwz>x(=BAfV3it-ZJ zOOLd>QP503w1UO$rBFn(A9HFbSU~*gEja;8DTb~S**+xt@90X6#OX~F+ z^+-A)RlghyGRKmjTdWryX`3>wO4yq)V~Y_6xnEH-SD(<_pdoV6xQ zOU~G-T?-63EBE%H&n*WWpvlH2+32|Y>!S;6Do8A~`N1r%p*tNmN+;JIyk`Ckx=gg% zP95-`dibY*WpqmdzFtOL+IC!AsJt5rHeq~v@5n88kRzI*K76h}5U3kTD{PU!cBKM! zqvn8hqq$l{OYH(|p+BKGHXIllY_7j(Wt>~YfMMP`J?q#(ts7BKb`mM}u*U7PBq7{B7&ayNtblBef4o0L;X~hWN?E@eVGz8~1{$`5lw(?P zvgCmF0-bT4`x12{4JUUhdGC%zB9bmotXTcnf@*=Xd1It&n+i4Cx@*BXS3$c9B}e3W zvKD2^+Q6HfG=Zx@r+VsC3g|J()R8bww!h@fPf1u4f?(C^qd*}lvToi2Bpp-oIj5LY zTlIC{K;G@Ho#p5?)Basgwc~!odH7i1C%v!GNgk}cPz0AD8txc9dg-%Jo%{$#>A6s) zN=5RQBQZz4UvAoFdRFb~Ye6$Tjr;BVqyROv#oE4~GY57>pBbq*j-kL1fVAuT2|KAu)p*)e=E}BYZZs0}tm8h$*?@PU8uNv#fWXVssm!{5 zyBxfdl4y!@#~ME9QHWZPi8LdtR{`Ex_3q0CZkp;G6y^bgOH=9%T;66Iw~ zCL(MhO#JzdRXPTBw)Xa-`^a3A+H|rP{`=PM_tk^%D=y94q@KZy7DEl;eT|SSFdp!3 zp$bQ-hiMXd-H1uTTnRlmA=oq|Z(GTdgH*;dmAt+Y4hg!@vBn8SeG@9}`_~r34B2!T zQmMU0Jjpl5u*XQiw$ei5-GPt2$Z7G|atKi0wSMVI@_F}D6*;c-DQ0bH|K{q?Wg0#T z5qna#Kf+se%Z^RZ{5C@krm^=|2TFM$B^gDOTE+D3!24?^vvCG}H`^vH(3JU_D>7P3 znv^gr;4UsEv}dT_5T_o3ta@*dEc6FKNDC-XzZz!Wtk&mA71P>?St%(r`Kn&NN&wn+(h~Nw zxfM*8T-^7PuU%|*^KNf!^q`SN@N**_#MiNb>-cF<_q}*-C4$22rNpB&Ibaw{9vNBa zs!q$*vh$8Q?toU&>N1t9Vr*IniU!OB#9EHPS}eccjB{axdpRLueHa|nc=~J{Dc{40 zhgIDjCt&i3YeT9hIK8_%VR<*4btXmm6Sf$`5vhC5pmT5u17@&Jj1guCPaa3`nY(_s z(%Tmh_@G~}^+?Dw*u%T}P}JKabS)3;Ex%qD<(6tZJpftE>N0MkQjp+i>5xS=le3`d7yOApm{DZXj1124FVk8FEec%7*k2p-*B*Z=DxI6kvb>T`>uy| z2x|{l>1nWTsk-G-PnDv$aRa)#*jRTz=w;D!4NKGr%K^c|0g5!*U zwI)FJAi9m7;Hzw6`0;gqH965u*uk<{bL%!xz@%Qd0Te>dp9^RXe;|=(VnND$KY%`? z%Lz?4G=HV!QpBY=rm{Gb^mYF=HG!4IPK<(Ah2Azg`B58Kau+204i7~7Ey~`$4-CWe zhr$T)#ZB!X%CBEyGsSnT#V)e~X`_fpNWE*eCs*pJcow7F39HWW+Ddz9!{yq}mMA0G2WS~`^}xN8^CaCsEnW!njt~^?vV$76RKA^3n{Qk`c6Pz5 zRuU=Ni233QBrZ#cWK{47E{C-=Nh1~1egw0M;btEQ^-@L^-Ukj!N*bAPnI~|a+B~1f z@(LtaXsPcqZg+(x=~N6JLz><@+B*Z2y|xjdJgeUBNr zlwb|g?2TkAutpW1l~7J3^=Z-B(V_J(OmDU;V@?h@1j7|_u>A#atR$r~i~gY+d@L|a zIF#%dq)JK$HlmpPHRmyof6*z)q`vUy9+HAX4C0U0zd)`Mp4Bm)+3pbrMVq_jk$`uT zsx!Pvcww%NPM7ggd^4p2k8gDTg*-}H9s>t}R6*#cCP*$ToviFrO*iF*G^NHG3iJ&H zKOmFB+bmT9nWDfBOy`;g0dZgS)Pn$ar^l<2om4g zvD{yEhA3ZyKax&23d7jJj{sWLLrZzGx}R^niB zX?^U=1Y-zrmOS>Gre(r(WZszz8Gv60&R!#MrxvXpSqK8RtEd-ONpvHg&0`Q+2C?7z zuk~V$41?s}c-`{_iNQ0&q%+nrzP_c=nt6c4fY%B%yWkpE0`Ka&kuG19eG}rh+AjcTe!W zSrqKY>0OasKrRUpK$dhAQOmCG(W45l1s^{wASsu|aKLIOji%w@NAxK94L>0;!cmVX zM+u7~0eFi32@t6a^S;~%hGfWbZRGbl7X@x-EI(&#PyHxVVX6hbO?`KnwqEOlV_^M_ zYB>MnuQtQ)aXZ}K{$R1s0KV}6uq=RJ)TfYz>&H=0-+m<$5ThC}Y`*Itaum3JHhSKU z-*9Bp;`?q-%C$aD_8Tg%RAcKJWx7K8$!>u9y(aZk!(sTvUKj?^GE)E<_R3}eacLc} zEUu*D2$+4Zs*#NpPMxrUE${h7BVabiXOJ1yhD7SYp>&DZU|XD1`Tm{`%K^k5fWHyo z>$F9_;n&qA*R(|;UTlX_WzaiRsZ9qSdz?F-U$vRu>+3T zkREJs{Xo&07p^=1?bB6CtQA#>TgWH=y^>k<8#-!KJgB;eHvZ+>(+-;w-ZjXdAVKs}NJ>`)Xt2{tsGtBh4}W-? zyKsRc6qmnwET`~@6-(}!KasA*LD4Kw(w{11f1r!whm;_cwCT{V2?m6NeEXW@={CI` zQ2%XXP5j%&`i*As3{ul=ca1!wi5S^Gd!tF4&q|r|Uv0iFPQc+d{g1b}K8R?~7>Cul zD9bs$Zo703-(T6A;gHoSE{)j}3bPw({983)$ zQJFDHqNRh~zbbV3by9iN+U}}dA1F(cW1IyeQWVkTP@{Y#jU<}qlp>1?lYO}MWGDBo zmn!*xe-pYkZl(CM@yiToG9=iQRLzBePYd9qI}!!V`NK@Mh2nxi_Z1AJO1GXUA}Vtb!f=4#ja-=H5DW%#@fa;&4F(gBF%v=^ciPI* z?^ZRV_~MEHbf>wK?W=kYceii?M~wZ^5T0pm3+2+mgCpdN$$Ay4oX3i{PuK|m9u(v6 zyMM*>dm_Yp{0i%GB?5>A^Cw%sbWtn>2c~gI4eotjgL4Qh>s(9nZ&Sb*O+~=9E^U?o zY_^OV1jX6+$f@Vb`;L?Mw0s2iY-Q&b2GoE|G}GYqhmE3p-H|{I?hop%WV=YrcNOa0 zEB%#~a}w>sqk&wdx!Bolrv<}f!iF@WE(HxLw!Kry*k+%*nEMLRz;@xu>jaQ4-uQ_iQK5*z+fBh-i|B;a;Ws(K092A(x+X2->6uCT?Dc;SF^G`3#Fc4_ydWN=fQqFNmF>v#GwmWkpdseKN;r9Jj~skDy}w78j5Gi z+X7pDt5=@<)M$1^2VcRq=E}j2Of3hmK-*wsP*W7b%hC}e==5jk;^9%m^Hg>T$5S<1 z8y}oIV@)3se8N?WJQxRdN-lr~2&VKjUKV+wS?AN44`V)xNpnQBB-)S*%%E=(X_2R^i)*n&+GEEMM` zm2}Dwi!8J>OOnEH=#TQVMBKlfs^+hq9>g!G%JDxzRkwW|BO7$8l9(@a(4iYkxk4;$ znwvw^Iiju(MT#BaQ9t*KW^APAP>LA?8X4DJJlZ7AS~jz?%(Dl;NRCQTd=BT!dN~yK zJcl7PNYl|2Ip4HYtf>%t|1qmnoZKmLFX>F2TvAyG>hreaE)U4J=HvBuLD1i^E?8Ac zgErhG3l_7A$&W%@^R(N-CyM9oOobzy7W`1_>7shChRwJuW+S`+@HT*-+Ft?s*>^Xbvoa1 zeX=vP(>!jbe$ZTfm+NpY%389`$nB0DhA&8xHY;uUCN_F*K8t*pT!eHrPJ^;;JWDCZ z){UnCBY{O&2E)QQ03y&s8H+smgv^`cV##l`Ky2*tvz++Xo-?~e*({U}Y2EeXnjv-X z^oSl2_`JFET>f;H$8ZWFL=Fil(_K%K5Q+C_|2rS6v|J!}59^Vpe+X2STNN=s0I*88 za+ZAE*WyL5!&jT9ULum3#R)NK%@pPe0sKA>4n0;6e-$L&1ht?<3IEU? zP_0pT43u3+%yFOqtEwD;>3Kcb3nBc?5)|G5St)^Ig&?rd^IpysYVo2i{AXR%6Qncu zz*1}eZIbtS7HT#+Du&2X%^$!!{WD+$V_ND6r6>VlWL&2oL-&cYF3<_JL(kewd}P-U zR(gQjUkx6!UgNw2^*1vm7^vBcyX!>HpOuuSkEqrj?#w9$JK0SLm(cMI7W#}Aw^}|4 zI%ReN6aa*>T`cY+FqG@^VXGuscmgyWG5`Yya(M1m+DtYXQeUtviTGnRmo=jrgph-} zCpI9fxR?|j*(-e{8Cl;B64+gAz!5%X z&sY(7XQG)|zL*pT85G_5=DZEt!Z$go=i5asJNKg)g)WRjq4i7DVKk)P3=xF`{mR5r zfwws4ESE@;t%zOv@5e%;TKEN6MG9?2WmTh$&jS8lVg9$QVBcN zP6M3qD-S0WJ~^x;KzpWF$rO!7+*jb`@(w|J!ogy@BSkQ;Lvycq^#su*hN?kR^)=cB zLZ*<}rm%{99LZ(1b5^){g|vMIn<#)5kB~arCgR9zlJP}xL04SANbcwKv_(8Bq&)H$ z8_3RmfB{S%>yZKKdx=0^sf4L|x?ldCHK~OICFyt&l*n~`Hf(v5^~iB8+AzXkpi-gF zWe~6ae+8v_C=qZE%I`dPCnnc-v<3YMl(vNuy(TYbknb{ZiWWM;577YdQafs=zsYB| zWQOTJDB`M(JJws=Fm%m3({Qk%5di4pkOhovlLz5~%pVgxQ~*l(QPtwOTjx3CKTtGX zhLFQ;H8<+u1i{T}Y5$FvI){Eo1!lJnVW>EX7JC^cN;A2RSs;wsuTa!{woR-;CrR6~ zmTOwYzJx)67wpG&0eX6{r<~nD3eHd@6#-Z0?3I}5S(0^=V<^wO$eK_XW}bI^RZ(OL z0;Z&Uv}(}P;kKH)=*%)VB*}{X0Mxf_t&elNoqgLJHZ4V{5C}s=(u8BfVy!T1S8`4H z)+fb^PyEmhoDuB(_@sySeN3wpJEwzn4ZbD2j zQ}utNrQEL6bzuK(OA-CcmJ&p%k%}|^2bbbZT+1eSgmD$=5>@Wz6ODQSGU*N(A9YhC z{gvDuMr)Zklb>W}lqjVYLPxq|_Q&e*GGJ~Z(hgmY--SW@2P>-|6b}~LBTZI{ElOe# zK4_poFGNyY|18?~IpWY@0qv9K4+qLovTfZ!eA9{N(d(66AOk9it8+u)t&&s5Z_0J_ z1R~ezT;B!#VnGwS6y)@zfSi@ZcY3!q)yOUcJb(|X(aP;EO?zK$S82uIcPRq&fQVxj z!Gee$p-Vm#iTX~ zXpHJ>OfH}hW#X_H447YCdI;^Ry7S?bRFu$T+!9ApYl7She$CL&*WB!ky=PPf8ZGHhGbybot`=zx!*xVZO0GQ?K! z#M(lctsx=0D(I^`*d(ki1eTvLQJo{HsbE{eG+IKW#O{^f2)>4BR&2o`xiw4_u_4E( z;2aK_8ra3NqCN-r1qW+Tf(({Prws=x;W z$Kn-^U8nRy$HjX}bz zHnV*x0odyx*!@I+*C>s(ui4OWy6^vL*##?LoZ}fjAwdq_!qTOclvr_uay~k|nY@jR(bTysJ%xxiqI8%zZRs zbOrO>qm9U>;wD-pwSi3XoUu3+Q-L|Ot-1i=sRh~FzQ7FfxY#f&EU(EwgI6P9o* zv*IFV5;vPA_`m%9GcEFJM>oG)_y}j%OUgdT!rpU!O4PQK(~{ zcgN8lftPK<{v_*qhfhJf{rR)|ECy3IWu;b>EZeT#{@xVaQq(mr*S3WxtW}DMx;pzl z_Eu7$wv=LeYiDxe0HnX!c7A@&U$iqo|QhIWqG{ww)+{5soBoa^mc$(>~jM9 z0g;K1ZvD=f3);IKQjtj@dNG1BHgwr`*rnTk#}*}Xw@J{HGwj)Y!M#$+A5VA=Vuy#R zbdzfk7x~BT>+j4E^P+=g8ki?k#7+1z=i;s6paiC2Y%cP!$Z{#0_S0TvP{j?lWozA8 zZmU<=o*=V)Bh9f$cF`wIgXdwq!(H?cBw6>Z!u0|tOlzMbYqezL04JdtucMyNB@Ak}&Ue+ew=l@xTrdgWt$z#k zF$3JcLF}ifp$+ZajCtaWVLNHRoaH$x!aR|sEJXy*NSCk)QIIcbSu!%T6s$nZ#X}02 z)6=v-ID$)zu!*}k#akWVNQ+lFAk&DeRpO&AYBdn4DI_F`jnD7lxv-?-U=6MPOy>HR zRy6AIJQ$3wXQR(n8~MuGfZfvXoo1=?$9^Pew(+XKxxd$IaGbAvvSGFQx@ASXj~ca0 zxks1EMqYDM*CpR_0#U=WqgRIP{JxYUu1>8Km&>F|jY4Elx5_@YP#V=zR!Ow9s-loK zUst+o8F=mSO~vL$_x#S}afX6NFiWwN==@hP8lOX%T2Wkk#h^UnYpBNj>H&o^ODCh- z67Q?^b3Z4^2<%l-w*k@iDZZSg{;)aZRTbg=3eJzqDqpj(FuSYlSa+x{53#=>BHM0n z0z{YA1Idj>Nvk-HMKfkqHk^iX@>etI8-{9sens$yu-qgIy>sz9WU3N2z31PM;T;yVcu_2 zDAr-NSd>J0oiA3X-2Py%vs`<7yyV;H1kZCIFOk7{Y!JGi4{y>PO&_RT++JGhZRoV~ z2DXV035Hz8Fo!haJR-pm2PIeWRZP6yjlHpvFfxmeqZk{ZXiW&I8A2IJujC8D$#cXJ zK0UdC68nU5DeR^^Tkx*;oMaz1~!b0`{Us$lwriTC&V8{1Zsd! z$&Hw+Fc@pXs)O)Jnc`OiGu4psxxHLy6{+eh3Gz1VEk>ta4;~zy@EJ%^ZJyA&K+VY9 zNrM@TYN~qxFuH=!2!Qzd>r>71)5ByFw|@5JUGTx6TFLl=;;$Zf%!B6{AAfzSFDSJO z;z4&Pa>svQmgdmPTN{}96kBqVGzm3Y7F6m4W;vTTF<<(8a|fQMHXzlkma^wN0rkkb zmV;U78<7BO8Cq@r;o6=sc?eMR$Fu<49DQ~_RTyej~k%1NVe6jFu&|hR*4{n zpv^Io9>^nEDGFMp$PaqIUuoQMMtNPT>5^g7eNwj6&PP4{*3mjVAya|~h!}3~-c=a}hcJXu`a;;e#+338EdcAm`0Uu!b$nmc z!sNs>JP?yn34;Vk8qx#gNS;&HZj$9^y}pd`Go@^Bgu3_>>Zp|ry`5=^0o|CpCBTlh zuo&=Tf`7;he>!&tS=p(GzBz_(fTUYk^nAeeq6lM1$IJJDJdjle|Lj1Znz%y_JX(fcUTu2gKeQI`}| zi(Na3+wnL=#}1{5eXHZSr2%=^{Q8EB{l@D8w0a-`_2`Iy?M=?9)HDTc^Q(UxLY=v$ z<+i~Hxei5t08CCm2q1a>31xPWNdbTqbtM}|+g>5Y!r55d8LWuiwpfMumeqQr| z&5@CHsnZfpiyAaauC*ok1qtr8>e_@gg(W37Nzqm~DxTmKTVZOsk;Ep40t$lJh=r_0 zrm9@N*?=;(X>m=54p%M-J}sj9cv$}5_wna%@=zpjN}yhgGcu(!c<&7fmYje^MShFM z&O&TSsat<8$%b~i`6Ek-AgteZ!P3k5SXP8+e>^D#=-2mk%{#2s(m(!k}N)AP(Le>l;G%iubhZHb`^r`&$cD_r+avhk-O*L6`4i#PAA-1Mkaj2#FK$h%2i{VHzd zjxkZPU$H5O&KN$DsEdaksVL$s@#mlsbObe7QYVAd4JTajS1YZYi&tY!%n`USFvXo&LXm*jGoQ&btVGYik5V!=St@LY0h1%6}Aq09mE;aSU{xCaNnSj>J21q z^Ak9X@p2#nDkr2H&-9_?^=joNGV%|8?y|B+Rniqq+n4=exW@3?gt9$JpGtCrT$2#4 zrU6M7pLpRe&N$48SRk1w)2~V^?Kc>{2gwwGWaPG@tSmHS$HnM0o19>?Qm$Ylpjbm)H9YkFkZz7s}43VTl(0%56SVR`-b zJjUUtX4~DY2c(PLdJzWSU}`p(m^CL**y~ZgWcE%jw_6H`Q`Oe2_AIB!c{KsXy%h4Sm$dT%67Do4@A?6YgNMT_f9W)~q>v=*4t>94wr@Yi|DUTdaps1lUD zkw;7ZNB8rxD)811a5B6JaU@kKXaSgNARzuvZ-)yIHl-pd%i{&q^?UIsUk7(V)N8!& zw~p5Xu&FpZP5SY|uv9CZ@6VTju{O6(pi`!Jcw@95-&&!7uJOOhRJGprsS05l>csxB zf(HMq6;$?R1;zMoHkNrFUfBFEqfUlu+H<- zT@h~zdope z4C5Z;#8IleR5HJ2P0(C28onD-D7f0n`_%o-dTJ~j#z6a>*Mq#n^;-(7Dsn5iKO!MN zq}7tir|5lMPzWXn$51RVB&JUC!x>G`KjLD4FbbqyhlZkdqY^MPUNHv1@M z1~}1&iMZiQvw_)i_j+NGZua5OstpAGj%$>71Fp~l&l*z*&Y2feq_rPQY8fm3mcEjQ zcnQr}1d{Xz1M!*WT!Zwj8yV!#xYSVRT0Z@2Y@y$;3P9y6fTz}g+h2~oTc=GPQUnUX z1th1Yu5Pned%^4(nWop{Nr}Q(q`~`FVqL@^!%DQ1GAj}G41f`;dV&^3ko^iZi53^? zO6(kN&(9rfCfS^Aq$9S#yE(oap(f7J^jL^?#lUq2N&vv(uEUDk60%Paci@qh(Qpfh z%<_B^@6Ozl%BzUyyFs;=isDi$51?f_lcn$rePKb<{s|6oBX|Hgtu zzpx-@{4XrX{y$+sX#WEXiZ-kX`oe-(@fAIYGsgnk1@ysd`wXn({QCN-z`8;hulH>V zzlWrtP5=jeI?WocpOjK2KR>CPo-(KMi2TG)e+^K{3jx)PS47y4Rfy5{HPFL1J&iHT zPr9N7?G|$*QvpXRsLzBgj-5m{O^xpO+0sTW2i%9w?T}8S{=4jF1xO2GP9MmxZd`^0 z(s0@-`j@&uXm}li2&%zlKxbMYSektJwPyV{82!b%I1+R9wWYp)VL`S3#)8!T0}G=3 zU$7vEzp2Afd3a3bn`bB6sB89^*^v6(a`~C^#6ecrT&8jf$>_5 zlgoZpixV*H+sBYB9z)fUqee#^JOvx++KhL;kf}0Q2!$V%PNte@D^4)3XDKYb#7p^j zCl?fWK17F?zQ9uZA_zG>Y<>f}=crA60kU6AfzIBOJN?=Yv&bQOPkcCRq^1Qn;V0{nI@9SRRZ1;kg8{^*q{;xd?JJ*qTGaO1J^aTlECY7!u{?^K(jL3aTD(jxq+`eWc$sC zMDzG4xY+p7xmir6r7mwH^&^m(n_qV7Mipb{abAOC>`^{}sTncse4LE`wY zfs=8P)Th#4SP=EW9M)f0Q2G}ZR4?%t7S#JMENJZu3p)P`3u=M=uUHV%7Z#Lu@DCP5 zRRr~g1+6!=#C%~vcaL-bI~JtADH(CGjP*AbA$g{kH4^> z%V;;THo9$mfdNOpmL@*MssBu3r)t>l@ zX616_k!rS(6^2WWj#QZSB6Vwzk&ggYI|lT;bWyU5ho_7ze_M4%z}hTt;b39@rgi9S z&WTi1O7nTC$5)GHDDfVd_&2%w3-?;+`Y^l#=CJxzh)?NJ_?eQ+RgHB~5a4y5y1%+l zh6tK(6Ta?qYShbR2B`ReuGuCzrBiPN_m-{mc>QpV4@${!1=slGVYfeLq2`nKjA*AY zcJCRfx2MV7XpT_2zQyd2e7_`olhb`ef#uf&E8>N-UhMJG^oNctJHS2aQfLjZ69^^| zN{#s(q=|rV33Z93F}~T;SV@V|P8BzF$IGyP(+Wk%OG>mx#U( zdB?muoP9DXe@PN_3vm*DXxgPj<^Vy3XD(G!Kon80Kvl`r25oBzQ0_}H zagx#Zj$s+MX$_VQc;iOD`%HP1B>n`}XM`+KbVP}pg4pfWpe+LOM{m?VLv;W7c0%l%rnx6%2L*n)Bq!jCqJ5AYg zX(+j$i7C0*@qI-MvIyB9x?1tOgHu^en;v87iI^rIRFD`3U`xPxYF-#Xvf-&Pdn|t= zKI8kD&O`@v*phinuc2}{A<~pp?L0KUv>n=SF3qi41$0&B+;e5| zoj%^Xpx2h7;*5SI=24+H8NufhO73cEI&SRj7;w75tYrX+bHk3C7-yje)UiZhL57ei zWgAF?&9dxePMnByDoT7;au>rHu&Fo^x`)z}4hOWSloxJZh}4@Y<)WQ_%ZgD^-D({F zR(RBz)1)6Ylk>r4O;#wtAd`ZN^8*thYaH-Pwe+$>_>qFICLlxQq@`uvN~BF1cGv(B z4gs_5S2UQk?(!WR9H$gSJK#K_z;la!lj}sbqgclWwP42174iQnK2rNQTMXW}eZJOP zP2WEUX8YvsDSb6rv_@YmlI@WjEf{&jyrhM`!dKN-VP@e$1_lo;#Kw%Z3sx*bdF?X* zIG8HirD#4d&oLN;6biYOm){eAAwGVddwkE5-WXH@ElT0V6%>MVfB3u_@1EG54|`M| zm%Wkjkuffb2Z$2?>?;J?nooEMJOrUttiMTMLxg0TDOrdei7+ML*Q80Ol@(?_6Gt9H zjyV`oRiY)vP<)^5y^uDsAsw*$*lYlt#P4!wF}Z{!oMbmbufmcyblp?q&*mf8dE(qK zJjDln${l2WqsZ27;;yL_CeiYLGDwm1hq7B%urY7dF)PU9LyJ9(yYLx~a6A6!y}cdd z*ABk3d3tXd%eRmuVFj8uQ1|;Cenxmg8Xnv9wvYngO;9=XaZMH~&undSeziuN$Z=@$ z`PD^UbZArk;cH-z zyW!6ll}r5|T)IKxzhTH4rV11aHysFXUI_NnsW@Md7BNH)ktXK@NmfgiNigEMLy<07 zG6yi4=#+XeT5t~%0^8?j52jaBCyhC>?l7S`3g0=Bl)#>%yZGbhXS`ha{VCP$kB$b^ z$vi%9jV!Lithe|)jF}A>|0>&cHA77|5Pl&Bbf{e`S-o-{_tAd&|L{oO;N zz?4Ct@#G3D6u13Jf=G#g^3>Lxr;Th!OV20jwYah)=P8g;bx+(?-3J32PXhRL09y(M zo}rK`{f{l3VS<$jT%@NWoBr*FajD`}$`UmCTf*P6m`@;lClO@T%mo;p-hs7>dqxE$ zgc{U|0bwHCrQX~cuP^jqVqH&o>GT#B*aH@ouyr}&E!B4?T&4mBJ>%M10M}9P=l9*` zQ0EDbnqh}Pj4r35uKC&p`gnbIXMS)!-0oYc)$Z=7AFwjPvNk1zM)fvVnKoUYbDqM6 zpnZa;#&EKr{-XdOZ?I1hXZn5^aP;7{k|1X7K8`cbA5pI#H5*I8I@5hZH8tb090wUL zT*A9+eJKJ<+2Z1?M+)c!)*>L`#KE~jHxVYY_aw3^5&FsJ)m%zWwrOVG4cV%1*8tS( z^z-phG5+82_tCT;nENCekBoaU4#i|&URAEjL97y4@#aZSGsOz(3t|*$L!c zwhM(_u?Qk>Ntu;5CVI|$lf2juS*DsH&Bw-7w7S_Y3|R%?gw1W^ zzZmUCkg?h-7bM)tiq)4#cgkv6Q6f~BO@(GSb@T)z1-WT=9dMw6lB<`{->U6wCQL>C zgn!G+9z|2sHnj@gQ&J6ta_3JK9f9b7E!V@IO-fV(gKL@$BUNa%C+)3c|K=F5n)6)` z#s=>T@7dOcCYPC&uxK2fV$zZuSjxeRM8HLW=9>^FhLzW_7>bqTwWdwGXN1OSsk2eP zGsJ9FaOiF)fJ^X`P+%=*y$rq-&ohvcV+32YK^;MwBJ9FU&9fn(isiqk4jC&KzTIw4 zxO2i`RZD^kJBQtU$Ys#4u_7jIQtP)9?8~5B3p07F1zaZ2g_2shnOE0G@T=P{=gMbY zZFg9H$QX|fNSd(WY#U7<>prQ@OLY_wy&_|oWa;Fn(K9|aJ7S!|&wwU7O$POKrB2S| zBaHSkyE)1~qLE-pmzZ-)%bE=@?d<$WJdRxn>|xgp`X;^_LdkKlV18O{2AD>K(bNdK`S0fB= z_B*zW#8rD5_nEjIL`>?v9Ef0Z@dY0!OzZRY2bJ-3Mw7 zoINxOHtM1PG&TuQz)$Enmo>MJw(~%nwnkg2^R-SVt&>0-EjQn*$yUnRqj9z)0N(zu zS!QtB02_R3nzfAOVpko+NYasyD9LuI*%Y~&tT_Ppu=y2C>43c?V^Zbx@$V4#PPHZl z#^CZoiBymhFIXRho7Px{`cy53o(}`k`MI=GaFYURy|{8}&-7ke)zQ7geYTQKFYGf+ zrhQ-|0@I18W9tU(@ugGJ82)>ObU~su2_ouQr}MO!vR zg^O^46o!fEWwEa9wvDEQhV04axtJ3e8E@n0&Ulz*7|{zIaD;uH>E4KySjgBnV>#wx zk6H?NDOz?K)awO9AmgoSq;TG*yM=hOX|n`P`9d7#7YWP^q5g6igi?50YP?nzwn&4_ zQhfG?&XAHmP08uxa@4pF6D^NT;s#5^j?qY#gywQ?@1lnR(8|L$(ScWKTl)ix^ZCPz z7O}6XB}@{@n5z51@l6s%p_fpaFD7f3vSP(aD=MWak!mm37E_O~9Et8?s6i|NJ$!#A zuF$9Fo>M@q;kP;a*lPuM!fW-wucZm*G@89vtYK9PAV~Z9p(6cI6N|pRIr|m2|ehYMhxQ5TGY)^=>oFjbbiZP~3iWA_{MFT1DpFS`e`@JRv0 zaQB{|<;1qrucVc{I1jxIa3M;!HuO4=F1(P`S!;x^AJajP_<2uyQdm^3a%B=cH9p<4 z;gv~Ej8i4H+3a^m+t1_cv*GPk$L?e5=h{pC}TvNv2(~iTRRfQ%kGy6r^H~w`ZLnO?G}ed8`TjU zW}?ni z{3(Uh-Rl&e%dcgm_I3wQX)PYSoa$xl*~xLQprVNjfL20vCRgakEL$K~w((mrhhQcA zmfs+X*;2eEp7}lT%%C&h3s_5*=%P&}l*bXO^M&QPh846gXi|h*J#jO@Vi=!jQNl;q ztj|u{nudQ63F3X%XpUc$tfct4!RPa!=)$YC1{*Y_M`ZJ3SMTE}0YAUS8gs_wgvN^8 zx=)bB&Wuq1;RuII4VL4#*`Q?MkxFuiZgs*jBPyEVo$^mV51yM?3DrbQ0rKxvAQ*}X z^47ushpu;ut}N=`HIs^Mqhi~(?WB^5ZQHhO+qP|672EdN-}gV=XLOJ5G3Ld(+82AR zjX8hs^St{SblI_SBW4PG8RiOU96^A)NANbG|FN#eIBs^$dr4ts^#|eIHyVwO*@TN7 zy@>0!-2N-uv9&2mAay(=r)*ladQei3s_9R=K`oDv(p+#`f4%u>OVr*n47&YQ}(QPHp{K zpgO>%3uAh;&}{?dLM0lE8xMj{$bSq!_$r+_c!#`1R&%v%C}a%@aitoatk&-U zh1&SEtVt|BpZ+UjXr9EKE;og6q9d}>f3z>H3kv)@wHMWYnK||RZ2LnWB@{K;WGEe0UhbOC7o9HhX%fKNM^YKIY9k4z6T&@IpBvP^CD20Q5bcFcZ|iYPnY^?y;D+OlMvuJn}2M ze#;ACgF=T~5F=(>uwJgWS2v^1`ry^Ru$-rk-v9hPfrJr0b$;_igP8V`YYAiO||uJ%W6qj_3VeZWX!aVw=Aa-Zc<@tgys8Gr-@2eVizW3(8~Z=(WK0 z@3$ok0rXWqIS-wDmW*=wqZXdP{-03?7ZGuTNai)#aa3M|NlFmFLN1-DN@BpOMR)B8Z>ByXq zI>wopYgH$`7zf*@FB>~oL`TPSAn|$#vXxh910jB_tE=;9Cp#JPooxF4nnn#a@$Q_& z7ZavgKW9*E7f*Ahc(|LFvMs%jY@+G#I!dIoYpm5q+O*A3E`i`xNELH^K10qd%93-4 z)x}k-qB;+@cy_p%yem)4jt@>nFAStylvC_u5xdP6p!snl^X%e6`sF$5(Gc=0y~oL` ziF>Ehy@~&VWKJ$>xLXd2^xf_Te9RYeP|8jM#7w?M!70mORgi}|)bZ3cJjc!T<#;E@ zt?TA$s081uEUFuQzkDG69F_!D#%Rb<(`+(Tjo@p7F zrOkA*r~0Pw8je?30*^2=PcEebV@sB8PBj4OD2Xb@(JoMf;#0QDDTr?2N3g8amidVSQ%!cdw-=n20k z#Jaxi%y;%clwxHv{!GTOTL?R`wu6`sh*dcmFOy<46I zFYQyXZMgOC9JkoTPDb>L3&49c?&RwE5quee_850vIFsyx>Y`!W#U9}*PX4*Mc{NP! z6^gd~XX`kVBPn5=_|=Z**Ze~?E=aNj9Sm~Wc|3eIK5?~p(kk+sw-zYLvJoR9N&<9% zhJ-BD1wU@_nWYH0QpT~M`*;`rdU(ZTgc+94nkA(Z`~^864s@Ig7b~9BlCwi(3snTi zz|EV6v+z*m=6r_P6v2a`sURA^x}1|XoD#Yz9Y}B=lorNdx0;?rU1Q4Il3V2IQQUY_Jv4~;13dreO0i#jz(jw3)gqR(k%EQ@7tsG@chN7W8p%>#$cu7@Xfbv_upJ|4V&gA~0e3L-=|wOx#)Q#Ss+pJ==F zHVLgF-BTuL9DWqiHEm$Nme5><=R6t*z@Qec8d<8@@bK}H>qb{AU?ae&NCga1fr|r$ z&Gf%BKJFv9zrx@+V(AtB=}1H9%mXniBPw0YOS&Gv4s7jiX_0Tkv|21;PzG&a#Oi^+ zl;^Wz&e+P(&B#`oy9&v)-4HDSNtRIMBMDfm?XQq7qZocm0t$Sb-_GS>W@}b_y1D@S z?;Mzy$LbCBQOnra_T231-V(!)M{i@|SnRA6{qO_8Sn^?G8iz&~+OY_>b30jTN{>bx z4`)`qYB1WO`=$Ml#FD&?z!u44QG-83LP9a?qrH*z0`@h7l0^uA1PvVcZ;=*!%@{(h zdXLb!97PHKfG&bX=U9ScmbptT&?$A_F8YW+Z{IoBW(JAlp5bXaf~ijbnxk#9;IJ?W z2t?5Xn-Hp&7%PO|*WR|+3E3($5z;?39}}EkMY$5s&OE>=T{ z`bJ#0TDB?sH8bN*@BZi; zd0>`$V6?jD6aT6wWbS7f`;a6J+k~MGUIsOOb^NDSLD+MExR@Dh3NBT2<%2dkN_nh6 z+w&Xsp-mq;q3AN-a7MKYme*WjKM~-a(C)M5)Uja(vYir}3V=OJwza4N2DYfoNtT-^ zjiXixOmE<7qd$#^{41;1hEfBE(>zOd4hUdckaO>CSNK4NcGnQ1~ zEVt~LM#<$R`YD>eN5gDfKAlrfOk=V=8WbvyVpjIS%(Z$drjA!v_Asv`o}pzO%vi;L zGqa9XRGJBxsJ30p6MV7MBwn;2Z;E}=zof?)l@1_(4k2?T2UXyAZa&SD2gae{(UOCv z9|9;QuDctT#{|hlC0K|@1sK?EM6!dPHq!p1m^c}3X}!|NvhLK%(ZhFU1Slrvw@iq| zdGH}cpATqxXjC!r^$rC*Ug~#!u&s%;d^Gw{nul?avKcRT-sAqpm8oyU{Oy-y=7uZh z=c*jDpRmUjC2n^jHNTW_vn*bKz_C!?g{HRl-fwQvc{Jw*>NL5*wo1fLuj?cl5xR8< zPqm?sX z;?r#V#Hm5L$zmAUGiLN&GHD_zv za<%8mi*?V?U6^`n9g7(gp1A|*#vX%4-pAeLN9JM3>z$H`Ek=&5h5--v3Rj*Q@`WYx zoyuvGi%2X8VuRB%sTQKxaymCjwbEX<;+sMrP(W0g6TdGLWk3s60SCxj}GH z+3--L(BFRvKbs^?Au#S(+^mh^@*l56Jv=F`wF`?zon8yeB{P#O0k!>oOO&QgH@FJ5 zQgsL_wTJrVJOsQTYdl3XRC+plbHAZhIN~N9BHVx^HOXwgrZ$?Oy};yq-r$(iGOwa| zg{3~Hs=Zd9Z2zXf)6M2>RLd3jKL*PUsb-#>@?Xo`ap=eHkcmrCAZK0Fb<2KoHWB$( zg(UWN846>1l3BZGxByLr`~R(p7=H;8=9L2`tGDNE(+_}IDkw8TDsn&b4P%)hc2p6< zS>rJ$lqF3HTRZ-9^Le}WN@RWx&_q-vwB;Y5g)L55MG))Um9Z82n~HI7C{Z275=;~n z@J2yNp2X{^H)CT-*oKz|D`UojsU?}s!7OAKtulM3F-?b6;PAk8OEIpwl_I|S92FKQ zD+Wh5*p!E~LN?!{Gs^K<3|=+C&8=+M4Zga5J{a~L=}1)YW=!@i5W-S5hc!*tEV)OI zm?=*iXn&~AnQ~V=w!)YH7@~d=Mtmp6c~w3eXoFRMeC@of_EWJHw_KgZf;>Hi_PtUX zDN$YMj@8B1)8p*pGiSgTUqvOK)yZaDSw4;jJhThCO|*5A<>dd1!>O&}@PWHK4#zhN~H_c4J8AIhl^s zU!5}q1sg+>_E7mMp_Nf%U~yf$G4!7JQAA|Dzyv*HRw$wF4d}EEkyxoxOZIx zNxvRInh4(&knoDJ9TD=daDee%B_5u(veK*wD(NCU_cxNpq4usv_`lL=m!PshNgIRg zC@q83PlO&-RM1}5sgWJ6_Cp9F zE%FV`K+(6@|JzSYl0`a`&YnR+0&5tf;IYI}#WlPDNnMig)SzC8x`Gd$kDG^QAwAJx zoBD%_gX>UoHRS1H8_rbLufHp}`a?Ob3P1d|`~1&F6m z;(ZIzZC40&_j29kvDBORC$>@(WWjnNl{Af0Uu z)-Ue4D)}vT=Q;!X2Qzw=@hg5(6OX+w-|lw%>gR`WN-gQGtn!VxVGb1D9ElEmYlbj@ zOJ{xRPC5H3DlFON$3vtQEI0O)lc;YIvp;!6{VsPX>~?S!rVTZHYllOb&XHL+*7tH&@B>sxEOqZ zh)^}(4O00V7PNv;Y~VFn#MFO0PHAm)Dt|3H=3is)czhG%AQtX;WviegvG<||EXgC$ zbf(y8+^03p)YllGvBV|ZhF#ylsB*W=KOVNQ#yuWwRTdlLtWIb~+wx>~C~LWHSdDgT zY0z+F71~e%6@PVaD*U_o&U(|~wrR`{QUL2%m?>pBCxi)aKRA;GcYisP2?seDhXH%v zStuR$ZonU0xGT>>H{-NPq5{ zCfG#y+;nhYh#3DVJPR-El$l}^-b5}pDcq%GEc1yr%Iy`aTV)z(rBqs_FOb{|YZj*^ zwj76$@2K=d+S1*8*-wUK`)(mWA#EPC6N3ha*QH<91$;}uPxF70E8ynXL#q<7m6om( zag&xbx3`ALV^a;=!>8W2%fO2y8J8aFd4LHp$ViJCPne%A^+SPS^%S)Aeckjo5G?-A zV51r|0}5uqzO6t>1TG;U%$HzVerE&1eBf@q`~undD)q+MS?_VZNI}>MKcjBTVfE`m zRxfB4e@BsPf|$)HyL=Lx=9Z>#_J-^iij&gOny*$Kl7>-K9gGweg=60XSrt${imu-E z%TOGh)L=-yJKk1Y?$6wnUGDrdB7a~*XQLVm&(&4NfXE|+n3xZZf+Lm`{xn}-Df}>B zHyMmQS4-`DAwlfkBC*DMW8)v)Z_Z&1ZFo8hKOY1Fj5J|CbzRB~v4{4DI0#g5A%E8} z2I(OJugjrNDE`}`x;vMhYr?Y*-Jf*@Svn&ddgX4we3paPBynpPDYy^ zgnog-BP-=_y!!Ljp3IqIY}PVE2ERW#_^0s;B`IaPo_W8~I zu8B!nIi=&Vr?SxWwJrx^H2Pc;p50DKW(6vgbYDT+F@l^1-HN z{8HoE+FObdns>UyQZii#$sOUwc-bFFsoGwZ*tPr6f1U05T@fm3ZQq)Mmcz($~^Z+kEEHKrDwMmUyka@49xC!w(e71friSN zU#XqEHMT@Tk?_#LIj!&p7#!qtH28dWd4Hbvi)$@Y@mFT>hhN-K-4Sfc}2-*Tk8kch0>T}SD(5Vwt^%FN8|0q zsks%?!eJZ#4{!|~NasJ`n%P-rVe{xHK^|kyjQ)|(9QddGX5EHg2MGx5gUa~>b?iLa z4x4o416e~N)Z>$wi^UqBJjne$Vc*8x}{7E$0^D+UsdvlELeA+45^8aWom94ea|1BFDL!$8qeqT`*6pQNVwZdkqW4- z{R2r5M64&EZ|AshhSgqX&E3+Kt@j;rMzksXb|F&q+d zuzI9QZO%Y?6d3v$+McvOSwFva5uHA}pcd|Gv5N}v+7WVTdHCtc1%}x9QjkxU8aJlQ zUuMW7G32Ephv1LjY1=9%+)>xz(rtmI@+{PU8yJ9{8JQe`eK6WQ(Hty*4*xqRljrfFbgItp0i1vT?!TkIL1s)4$uNQ?Fxo#5FW zpA+8S+E$RTjOK@&($-_NX?C# z1v{DkGDwbql3J(xVE3)v*=wXgQ#VJu+ZkgNFarPu6?J_t4`~NWP&PkX2E5h-Xd7E^ zRMu#y0y*g}8U!=O^1k4d@iK2s=i8p(Va%GL-nehRiw*E|D z%N|MMvw}JP^Er)|OyZVfs$dS3nX(!ns@3^gEJ1d&DiC(Y7%$0=md)_f4rri{Md#pi z4ir*eh*>S}>aenc>Tf0gNH zU;mdV!^Z&-Wgvt^UV%68mf}p#1&vhJpd5-6#3qJaOHp%bIgt%&E*4GaTdLCLJcS*| z_yYBUV$y*IKs7$!ou1D@o_C?arlwoUCwe;?_$*OPvv)WmkjfxK>o3nnmow3fENwi_ z@UuiY@8yZt*_3AuJ1y{6@T!*-H-AbL2sUHIZYXE_;ysqbFqvJH9Cj3W!AMR5 z)uy<8CY#{1Fq33o0AjrV6+RaF`V-iO5XxOJ(Lw^|;!etRVEnOtu}!Pe+QHy!ZGP?v zpoU-8D>qCL{2V!0+Zg~@$8e@{fEEVLyqo%cdz4|C5j?yI(&wmLKk%kS^RTn+OH;g) z_k23OfPqdpXdSE*0ic^c1cZuU6IZ+l{wJ3t;v>r>t$TD5${{VMEfS|HJ2O8U$Ur122rYvWY4Dw1e?>hXD^Wnaf2jnGiy--eaK zgva!kG3+{++Pb%sG%Zj~R?UEj>m{(4`BFDrH-SiBU#JVG?=)_1n9bAm#f`12;!^&J z!3H3>sX{dtPfNJnv(yP1qp!YC{T_xW^WC6vqpOvwkMmc=#rqFL!3bm=unVEro(NY& zR=iu-wZf_xD>R_vFBeSoKn;$M?UU3wJ;+)@Z>?%2$lCY%&H=D13kFRLUYu`8@=E|( z<_bW|ehGJx!U-Ybx>Aqrc%li}{{j49iCh}ch^K&GZ%EFJe6()A!8|?wbxL23@j2Mx z6i2{K9{{Vt9S&>5j1Nncr}qA1I5VU)JcF|lLV~8Jp?9}qqZ%|Q{g_b|CZaJT=B&t`tdAXv; zx0O)vr00-s-??$kRzZYxj zs)H8&>lvpS)f~FgwA4RXQnM3vOMWHE->|fW$}`#T8){FG&}*6soAL_W6AwdNg2a`*8H{)|dg)`s+v9^8i137=S2E z@3((#r;3~KwYiEInNVQGZ5hQ<*Oq_UIRF%^O9fN^KcHCd|A1nuhMpnk3=7oKwi&4v zkaS({XTj{k{r51&;%^cRT}=H)7x*|GPV zGqELYv%7KOmq*&$nsQX}O2f0ku$H2Z<1sj-+@c3F9F7~J^Oq)u4?F`oYD*^4j(S05I(`<}Cu4(jdeaF9-<_9>E>EV3ZUA3`S+btA_|>BlNmwuFIouKMxb!HA*Z zLNzjrPUhsp-4I~xtR$G}?~L&ju*oOxf2Y>+%+0WS!0%oAo$VDSUPpg8F?N#2&`nPT zaPHorT#x6Xo6O!OhWq9A&#m=?bNOY~v7zf^nZjld042y0rt!CsFAc&A7?jwZ9Z9>s z&+iW>N;A&_m|SHop{>P@a?dtsQDWgN8St^2>gQ2ZmX+)1GwY-$nV3pl?724f097?6z zdSr4XU{L(ZDLG2P)i)~6I4WA_Fc2-#Wh#!?3b?9@i?WTS3a*%*FShfIwZyPzE1A5R zo}L~vU-??B2qMrE2ziiq7{yY;?ck0A^^(@Oz91LD#B-K^e&_|gn!BJbD5hgo0i8(V%$^!eGc)Tvnx@DQGz8g?(k)XazZ_;7A@GpI& zZjwy$32RySm_Z2+tBL3>VTSLNDgB= zur3ug3$M+F08CGQ(z+zXpJK=514Brg`ZB)N*CnU_^itx?kKW5t8rFL#_k3t{QCb#Y z<#sbXMb9iXCM~YGyHU4~T)dI{`RlLF?CN&8HMks6W}E(zC+CmGKA=qgg1ti}VL9eK z)mlvH5M@&bnR9!ZWMip=Xxj_4dAYsL$+6M*1mt|d$eais-Jd|3=@ddNcqs#{c0}q! zF1MYKk1Q|9lFgTV=PTRC5xAh|-;)2#zcnEm36>iA8t+6?#Yh4?4YQW)wrSnW){k9J zC#JcjJ7hYE);eE&u$$bBQ=UlGt8lVsf7of^FEXf{%(==1FtP==u@bw<9s@fKq;3W{ z45WrL9S4(uSPp_H)e&w2D3_rR{3y{79D0)?SWmphfvQ?Aj8j+E<6=@hngv>=%l`_? zr7CrU%b*u&!eP;-32WA`4`6d?z3*8#PlJ2{SYhyAds{TdpFs)o?nlBFUxQp47Qs-o zFUz03Y$sB=E2}CWyMBhn#a%rD&E>ZKY@laBKEWP{zccDBG7udR)}0Bd*I}wxL#yWE z7aEk*%24LBC}gX^+ZCk10!1p$QG<{wT2|bZK(I9B%!mw!V-lg8QrcxFAmNE!SUeW0 z^sGqX0cY^9kkM1xg^|%s$btcdb}^9lpGC0QiR8PQV4kAd70bSQ`TqEe(@W5NL5kr_VT_w%7edz;(U_F?uy%CY6Xt0u%&Nl!<;lfzH# z?!JSn5Efmrx`DaG7H1Ds(my8coD+@U*!@c`%|mmz+erOD*C&_W1pd89_39i%*punr)*`ndu2>vgcDZi^}ACaB;{Usn)xQ(JmAgpWA89a%3W>QWpMzTQn zRZL5^18MdaBoAle&M>dn$~F(_Z1xa|TQZ<5Rg6hFjlCRBP)JS4t{-U#MJ=9|vLOgV zjlU@rP#RT^vor0UV48+vnzkK=mskB82t=YnymS|hyaON|y_HCC`+J$CuG;_FR*R5@ z%SJNZEi&il4Yq+?k7kk-)v}#jXG7`M#vxj)(o^o=#Aut>T^1Fao}C9L4mAkXyLtb( zepl8FTqCSaiTpw++yL}@8g1b%NhwqRx;LAtXXnHb0w>*ZD%X#;f)VYVx=pPw9TRGK z{7wkWmg%I|-n*fW(RyM(692|LBSF@7JChm;g}E!n;FrP>Hn2<4Ul~*8*LxfDv04O1 zMX27`xT~}38h9;gE~8yF9ARY1^DDDS->$g9?ih_y(yjMCaTI>UhEQcyA(Xu`rOby- z1Rd2aLJ&C`b>|V-6_HBmVr$7GBm74JIIO~3EGVA88vhz)Xtf_NvdXGvOYUC?Sj-~- z)cBS4_74li0fpc?m~{Oe=_+k)y~=c-$~0{#z7OkX08^h0g;Bmy0xcde<_Ib+RB#OyK|XF;wA<+#v7J(6qY>YxPS`f zd~8>?L47VMvm>1vwJsyV5CXb8+T4ujw_St__bY7lY>`vLMy5ABTxIQuTh!X z*Y<9>G}4iMMFJ6MCbX{P&Wd)s=Vvn+d5qXWo=3;+JJ4Pco%sDq8exYHRT+SYRW+=Q z<1e^2u^P9^>0hklogF!+Fo~c=zcE@$90@~FsJsKu@{`fiZGX7jt;`qa{_ddXe;Yk4 zoy?W(M94i0m1h#{GPA|k&hRJz$*ncDN$shhjf;(V~F`W;eaCkNY#KSk-@Ox7@YtLMAgRpkNTJZx)r#=_1| z4RQtDBbfT)9_J_p`3Xx!&lc94BW#j-~JrEK5| zm!K9N)StHNDus^X!V33twMOJ>lj1+<3f1uj)d~U2LTAZf9(7ApO?4B#_-g7B#+fYX zh!QNlH560|u**@97Qy@EU)lCiXQr#?r+YF=lZU^mDCMdAuv8$Qq``n5pMQ?N#M0-E zm%R8c@}1v%lF!f7WW`&l-rL#nNo;u})J6jj*$=VqNAeBfaJhUybg-=*%1rS|Ta_G` z+UIfkWc-GWV-T$?Z=&)JZd$LSZbugLYDxu5+tV&K$C^g?LWCLdY z-W0`mmy87o4-s!gz1GcCyDD9vz;H|dAcWMe(YA#8YOnO-qgAKrs;N2$b{OdOcrCgp zGsSWph^EUPu2N5ZEm*$+XGv4tfo3=am9X|raf@FxQ+#%{y9rN8Doax1WI+*WAK0#L z=J!<}c0T6kP&O$ZsdEyD{{}^~ic)K*XnY-?q7A+h%6?xp! z_WT22p32x2YR4GEXdzMr0!;Q64K4?$z4ttQN0z39fF6sRF12P-s76s+SQ;k*078Hr zL{84TYI+-;(#Wb4_A|AhihL}yF1#39mWS%E(oaQOO&;$}EUQgzZgFvxMG}_*2`LRl zt{_YF)L&$OwwyYtY+5cOK~_MJOPZ9NPymKBqvanR=b-G$&wVk23o)Kd0_I=zcT`tz zUrw71Q5dHvl)&Ggz_0-vlR}ZL*#L=_Z~bS|w4+-7fls}k8=2@XSSIG9|TC_Cz?NbCQE)CL|nmqah26bpOs*SY-*8 zxXTSaNt6T|^BT@NJ=?UAL?7>Z_>k9Y%)jd6v69{4?0;5sqVx$BDM9y76B9?JDd=L7 zq){yL^hUB4U_-daVItyqwygRZIWIf&Gq)9bk{)3LzcsprB{O-Wyc5wj4o$@sMX_tF zBTv||`ClFJuv0u4dBCA?E!{85;h>*qM~0-=OPU_l&Ib%LkUr4AJkmsDIl37}UU|>>FL%Z->#-!B{i!sKPDsjlxs9eNB~ z)C=MslsV270@E!aJN{}j_a~;c%ku{*OQlr~Natx`3%KEe$F<9LSF=RPUIr&?c& zRS$i~_PvAqCF?@8AruKl@9wn7E3sU=`-txy^u8vJWVTMJtdR2)1@GA-Y zOIPZQVztpvi81=-G;ZDgda^jwHJB}=#zA`9gf-2%b7-Fe<#jT#bPGmz&dXuP2KBSB zm`@#%trY2~S=ME4gmeY0P>K?;{UeSd~0@tuip*_jQ*kV#wvh20a{K;vXe9;LTTCVz%zj9)} zLc^;N=x_S3IKfxA^N&1&<;>}v7bEAtU*yYuqoug}QT95-hK4-jr(4^_sr2Rc;hrb> z)){2cs+l7!sJq{ivSRGvU9n`K0iDdUOg?)Bo;J{diw?PF?&;<$xf_dTNAx#ki;x3? zU2QhG#wEHc@ZS%xBu~dNv^i~`d%~ZRX&iRysShxXzM*WSkr2)_yo_`fBtAB7M5keb z>HN;fAcJkB>QmeR5fFnV{cP{N47eqky7b#?<`VrkbvfcsySCQAP*l=^kO6Yk89m~- zT+w_N!%LX~XKQYj?q469J!g;_9a)OQ9k+c@y+SN`T;(mVL21wy2G0_gb9}zcf4ISm z4@YQTY9276MprF6a7u-wu!d+L^vG9LD>xZP2u6%6hq^wgnANKZwTOTBmYjDq7{*M> zmiRks6~`ds2CV$=c!OAfJxeBZe^FII6yXfxtd{I1rR1|l0HaL+uGwp3D=7(lQD$z{ z$+B(ieG+xU?K<(6eY$i%4uAC!DA>Z0>5{F?i!JH}_S?|~O~&1TI3{lK#DDZFt~dSL zlqrb;zi#)%d9Z?MbcCVjgJ=|^2WED&o8q;T&nfKEnZwjPG7zb90E~UKvOW?JAd2p} zkR|w~K|uf-x@BvS-<0fzM`GsGBeH1qpxqCvX%uhQFr1mqV32%h(u6TFw<6_H$W_3hfoG%uBLZ8oK#D?Ubn;hN1YYwvj8{YRL;$i_sd<8=id4K5*SNYVVLWF zj)Q%b3Z7W=!^_rL>EB8M)$3nUWN5*>;GzG3nj1=8Iv&)wQ5NSa-(cDXC#>k?Whtrt zjN=BC83{@X&>Y^Sw=#srR!LUe2VL~lg}tV z_3tN!M!w+RPdJZ=gBF7vy2!GC3mze4zOVMy`5M|H{?rgRVu{hy^P8{F^khUtlBAl=2Ym-9{C zyfo+V%98h5wS9$5s%GF?ahATe6|>rT~*jIPFP+5TRx$tmT+ zs8j#CrieWP5@Lb06gD#5>V4W6XjDdAl|qifBIpl$JQs%6vo~4bb;Ytk+=Rw%1;*X4 z<+s1ds`Pe-<@@q-z5B+3Jo4Ei40i|o83g`&r8DLCd-Y&bg}wl$_+dESNe5kQ9TYQ# zA&5o=fiuc%3DlcUt>*I+?v(p2`$RqOkwp9MW%6y=oK~y!uJBD4HQeHACF$cFS{g(c zyaAWc<#Ea<5_HC{;cw>Z6&B`$8JzY@s_UoFW-5alJuH<8a&bmo+0%~;iilh<<8Y?)aWw<4I`P0n6<@QFk`xJ(O3DTPQz8{H7 zN*m?NqcymnMI%l+XF=n$>YZfcX4QnU*v`OZ^YEl;f40?kJLJuOwiP-Ef5T-WUGsTq z`Y=?@;cwf3{-L(6+QSD(&i%;k3qF;vHM<;@zi?Kat5hS&*;h0p zYg(UUo%AolA0quGBO6jPS5hq#^iUz;w=5M!ll8c(TjHzjz-G;BH%Tr_fI8k>7xumT z7P=0;NsToaa9MyH24VA<)ZHHXK%=<~a&5`uqwQFZY=)SPsI7%=LetSh)A%<3+jXEj zbFyG8V0MmdC194lZi{%jg87}n0oevdePmT=eUASZ!WVAK4OLq=t8Qzn_ph6c`A98P z=W6;))Mp$!Y}L$n5Hl%zY@53Nca?hcK-9`TY)x8k%eIxl6CYLWNLYbe1=7Hq69L&8 z+iJ1wcvu2L7c=PvxKRJKuC+ml%SuuAyW}6y=j-*BE4(Jh7w%22xZGoG0y7UwrhK-8 zHq;_jv+kY@L%Jq!+}gRaO0PKUES}&BsQq8%qgB(Zt>*#&8s%0K~Z{PF#rP zCdMp*duZ7`jx2Pn0ed@Y=0KR%%(TliEAp=9ET!T@KamCxl3wUz0xzRzF%R2?0%5B} zgLbN2$2Ub=u*mk=F9pFs!A|KQ-2DdYTK)uTqlklGsw#93Eg9_84-FG9*lYL=DfGm0 zX3}RQt2F|~Qa)3#4ifw7-v_LMBwJz@Fd8N(+#xiE*s%ZUUGhT1vu@^DOB2Lo)B}z$8lBV0E~*!S z4lrLYD-U9<);pf$e_O4;`)Un#SNBZkJI@R(_ObG1W;C!KChMja6WaO)?}=kKFg00y zA+m*}Xu^rJ*os2li&eu|JL9y_HUXPUR^6($P^-lKWK==Bf2cKJsU2E)Zk~At+ra0g zGUNlR`qq&Y*a!Pk!%@4_2f(7Zr-M%+yC<)R8vQAkED&&21`l_*Qp+TM@3SdpB(fzM zLsk5fSj{fdlz^!pLg75efRx0Io?)MWMcO-VIq@zF{;CNJ>VA;ZpBA3yA=J;>FPEoE zu#kiJ?56^~u1WGv++b7sE10wTs+a?eRA~!E*+Ew+0G5o7rQTk9V#|VB0MB)$G?~rs zGgdV$yXX1Y_cJym4Y-3&<%7|JTy;kJ%Gf4&tWt~OD1q-=^+n@_i(~>OMtP4V{2&>m6 zK74c;r^p7Rg^E~-w)}QTY{=E3;qO07LmFKvG?+KyM~A>0#xo)GbTa*tdMpo^VRL;1 zCvqvLJl})hGb|;!dO&UBSV|bIGO*n0>!xjCjfktSFCd0@8AJYF@CDhc^s?SLe82MO zMw-1_&pfbuMaexxvN^u5eR5cM%3Jy`3F_PR82$u$10H!Xb0F&zlT6~EO?+r*SqD}s zI&};R^z8Xw>6n8aCZJ4d3B?EK#jLB*7(6~vC81b8)6|V6VpDTD-O3TV8mxQA>_*sW zEHB0g-IL9xsBx{Aa$_4nTa~fo7)h0+d0;4DWOx*TDDKE>?J;#cY>{^5R0t}91}!s+|4`NlT-AntE?0)JFYOP!!igVi!Lbod1cs3!hS>KBGGxm?A9R33k-@S~33 zgX&qU90A-_KF)(Ax@Zf%o%_L*?C#{NLEZtB1a>uQ6i;Q8-%;6nT>%>V@B8Sxrj%R8AKxXBhdWIb= zde>$EEWCyish&}&2TKXPoS95PCwB)INeQ0yZ_!X(Cj#ERJQ*&>J}d$E(vF${Gx%No zdXEsrCm=HJ4}?5@B|23qH3uLUj))3T2XS$`zI_9DZ2EskiS|@zQ6^N~E1KIb90y4f zQl&#cTuyzI0&XkNZn*Q{+4V@|TH#WbJ~ZB-(Hv$p;*opA1zC=t{VoW3Qub4V_yzN6 z%Df6gaSftD^eXANB@he}ericB7!p{(r0WLy+GuFFOrT{+l?5Pj>qfN6+J&I$tgaOe zb8k5{b$e9BtTb53SbiKQj!o9r^*G(zP#t0K9_r>u{SWm=hzrkD>k3vVZhu9&J-B^O z>zcU7Sp&8w8Hv)K1GAlx%JbTF*=!3wis|>*m%M?jsJo`4kC~i6-lH~{Bw|^S$m?Gw zUNFgeimd<(>XFCc7hJ zZ$+aX9dBz^X$QX-slP+04B8`^>EXG@8$Vpg;b~AWyFMQuO`Pcod9l45%x07w z4i|jql@o>H%5CJy@aN79*O{y!{JO`6M;WxQQy>l{y+?*EbLZ5EIbGI!M#8tCmp{j4 z;p}>6e){Xo>88Z3fEz_#&!o`r*`OftbdVR9;3wZ^s{)AC@A(i~bd=H+Z^-l04*rX7 z9_8kGXKNEB_N6;}Z7J}rtG-p@&h>Vl(eF9>VAGj?#|H({BW6Dn0y*428fppaCRX_% z{Qrf?A`m)Jenn9!ldJ2AY5r=N&BL@>xrX7ok1#Lpr|{%BiuVM_T8ZE`-ob z{%R4#mHvzd`VS=AiRG+GYDKw;9^T(B{94~Z za)(O{#;(wA3FF2w`;G44Tfm)hDw=(3`$86bei{5>Z$hmS5toY|i7hj{G0*3vhb}Fk z8)@7GUpq9=2c#Nl98paCzu3CR?nu;iZNsr`+qP}nw%I|)wr$%+$4)xv*mlQeM{o6< zYwf*1ykq1y)JWBHpV#38t39R&QHC*8UUL{D`%v$6cD31zsW<>8&~Dn6h7crV<3tp7 z<9~)M02WjG)6}t=7ge4!h$S^!-%eo-8{piZ| zOGjY}`AeQnP0BRK8EPeA6lU+pbfj%hq8r}9PZ`2#W?k41g0Qnbt*;foX!w^*5~ov2>AQVMU%u;~ zt;p*SrT=68nc^)1%s=7Go1Fiee^HU{WGkWFzh-78E&0aTnO!FMo&9guDM|3d%|k_| zYfqDrW=_{BqyPE!EIXFnRgIP%m#+eJSm~=!U;L1Qot;Tgu*1{fwyZM zoFkWTSQSP;0yP5 z3aj7MopOK9@5ka8;BI}){}+F&}-mKh9DwDt3JI;v?sFoxi3iCUGS9G zTp>U4b3$B8yAt&DohlR*R?G{Vo9@2f-rk-Ntr7-J00kqh>Zbp$7z7ftn*<~n3-bge z@6ZW^v1BOtB`kP_n`i380e}z4opyzn1Z^1z_h8{9M6jiblrU)zJqu-dg^OnqEmN2B z8RWyd>@3&y!C0japCPX3WFX%Z8IL(}A@fh=s!PR_o-}BOW-k#*Tl{L{`K}V>ia$Dn z5Ud$(az^?Q!oIrD#~d}2GM)F?YvtJ90refDUHf3@CU^t|QI4B*aG28MxoEVffbX;uYB;j);ztB+DRY)f~l?LT*B6D$>7r87+@ptEf2mvyotGSQTsy=WS|i zA~m`yGeA7Y(e)lG0~f~~{%@l`Sbm0FE6kCv;Sevo5#zFb{%K?))l|+reMKsbZo6Ed zqm2IDuj1ZzF9J~^wg~=U*+v0{U`)czU+<0Sk88mCL^I?UOysaXh+b>`9|tvfp{!w1 z?`ABkXY8)$^kWy|ly-qrG6Ytiie8a3*(=ilh{Cq12_%9!*BY*x$G*5`EJPeL)8|li zeoohw&FpWYO@)0bsR%;VW0wb@XV5bOU5)`q-;TBIpjkaYAt9D|%n4On^4v6Qhs*Tq z%3bExr=qpEwk53GC&=${?K`VSz8wdc50>J1B&%PjmUU>zxcT(ZvI7maH<6}M#~$Q+ z(5J?Ux%$iCe~WsId>`te_(=McN*zCoMPU>^roPYnXlkM)I>p zA55Yl3O=LSkO?%MYSy9U;K)-Huj$&^zAYpQFuNV2!A8nXQO&T4x6Dx^BT#oHC8TI< zQvJpxnWOY#MegE-rYtn>=CN1Dn)ST#h4H7ZjM>_!J(?UfswH~9@VTR6Zi=D(k2Ggv zwK{D=nL9^PzQnUz`iKvOhvJYrwCjy;MRR*O<$8PNK(-3#TZI9BZ)w)5GRLe>peKvq zUv!WO>ER!Au=-zgkj=c>ZB;{P7a~})<4`xF<_QD8EtjAN+IN%2O~g^GR3kcG57pgt zL$z*u>R1hkty#Dycm)XkhkWe(4OMu(GRe(LYpTNy(35q;H1dUJHNh=Z#mWWq1?0D1 z2qvYVfhmI*hBu`VwG|X?7g36e_!_9c%fU=Svs*Ih*UcSfCE6@|ey#mn*1H|Gv(V*e zr}nj}-IZ$x_0KKuTrB44a*I{Ca{?PEVP`{OZ`SVnA8k-kO6bPuf3(5fgox*AO!j@b z3^a(Iqe^11HqyblHL4jsv0@A!R-rT;3)d*{fS|el15~Hjuw1OHL|T(25;6Md8k*Mw zv{eF?O5`Bvo z^^4?qpgv zJxq)xgyE(Sr0*izGOh2`?DhrpyWf1zcM6OH9A)6bGob*Y-%P}V!DSg;vF;G+2)n2i z^fuHD*FterGV@R(zvAPGkEw=fgW51+mI8`I+PTO_-pI4TWnEtLdFT(LLq8yPYgETC z8I^v=*l@S|_o+%)JE2nJ?Zpo4c<9h{B=SA>4MLju@tu3Pzq3)yigo75szQpx(W&P2 z7xXiX0PE^jGLa1pkoM=t&l1iyTIKd_EEZaKLN>+UW(5mLyyen3L^sphQ@0U`CALv1 zOIxc(A_5~qxo|XF8AlL9?tE_&hRlH7mzL1I;n*Al&U`$IqY*a+!Zu4+?;IzIvZ3cc zE&(`(M$<*|w?1k|&UY@O=Mg%|8N3Iru)kZ{6X;FL>L^%#K%vF5;EV?GsVCW+qWcE- zyHJI%G)r!)1|2*o$Ca%*qyL2r@xV|loNPd4>T~)JXS>bJppNE^Nuy@DK`Di1N5u1pU>2s zv`Jpb+KWTQpC1JFo)*M(>Q@|SnnvUP`0teO$w4XcyJhweZ2Mh9#@1lb(=_!Y9kEUO zS+FJRB65K_j8{918DcsidEp0W0qqDL@azInVMSv8oQ0z=`C@|)Ht$BixAi@aUTG7c zUWxb+z+wj$dC~y{)eGl_H44&~yx{9Pvf?{wpK1BjU&W9d{I=g!lt5+%DjC1^dAcEw zxpsNI)>=Gq$*A0%{|<_j+hIr|Rnn`UuvO$E;xOWS_;|tSjdVrm1PwZ(cqebsPXeac zRk-H|lJUu+jr=dmu*znIEuTXySLxW3!NTX(&A6-CB+So}GdbC4Cok^;f#?xOZY zPE4R9$AQTZtHz2PmSTdw%Jkxa#~)4_XS~uO=>M;$;ghz>UefGJRqz5y4z-_VmVE&j% zqymTDnHUti*;DnLaEn`w15@UbJEOq9tjf&w`GtQG^;JHmm>=T1vG|2R2VzOs_;+v_ z%)_g*4}161K|MjPm~%smp^{ltrrQSUJM!YLx3A%-_MfbC= zQ^esGPS)U2#r~#z9a;HvKUm}abc-9cfidlmuPJxwjv*sAbx2-N4q0|eV^Y?212N|uN?WBif6{X#(!aB{hgcX@hlALUD&i?YIS%E;IlIvjJ_ zZhp>P6hx*Tl>(dGt3@3kZ&ZuX_ z6x>G%z5MJK@g>M*RtvO0Wr9b$cRny;-&hkMJ`&-RSc2mtcOiqA$E*Yja`m6zG|+m$R@S44I-y27*#RpZ zcp)OjLT1lXFi^LbX*S;k4m5-W)hxuhVKCynB%8&5D(kB@=~R>u+_n0cY_@bvifYF_3*z{JxR-}gwmT5npLAE3=*Ky627_Bg$I`Nvb{^rr?v-h_SR^-0X1`dW z!d|6MYKGO)tmCS(y7UQ2Z>`nH!h~R1@BnxxZ8oIK%Nh+a3=z=ra>WxN5D1$L zDWtG+K8||ZB|)r~UoRH!+RDt1{!K_fB~Y)QtHX2q?e=L0<2VoS9sknrJDh!`iI$b@ zGvj@Ent{Gyr?30o>-hRiy!K#5vmuldfztDE{A9X?O?qBO`L!&l{P664PY2*tA1FlC z@<#IHrQru594QYKobxJAfbn4-U!%`^ulJ{I1g*lL$X}O>>W*L&H^**;Fp!EQunQrE z-WCW0wKFD+VLfp)c4=)JJ2nZu?5Iq(%W@l^ z?1<*S3x1^Q49adNJa`#VW6CI&X&{#;X#yK~Mm*Kc>gZnh>|T^_w%LVUYZ~?->2?+*RTkO!T zCk}S1`s1qY5b)I1N&ExbKG~B0E)n7N0D%pnUC+mP=9uE!Vw&3_)E!0m5;CK}l@)6? z%Y!#o7Xag4J=>)N!V~Uww1jx%pTo$9j0H^m%PzFitm=uK`!IN z3VCaaTXTb>@6B$=Y?h)e09)D=L1m*>huPGsscsRkL8B`VMxMUbuvWCHM$A{cb4$!V z^eFiZ^|V%zJej)n<6%4d%RAJn9>a{T^E}@+pM5-%@1XJ8 z;Cb-CvS9(JK75r!#G(ZKeL;HGj%Yo7pSi>&goFN9G6|a3Sc7+l@99Pap+RT1ZMGc`n|ZGQ-1&D)r?3f2U4smh^AuFT#}1gOlJshM%)OS46%yZ3o- zr+e`a_UUHq2rR?h8+v5JV+=+9>R&nT*FH@iwt6c}JENFn z!=}RF5Ia~CtkKEym#Iw;i=^-CznC~%4x3pot;Q3&;#T1g$aWNy|((>DR(6Q*t_~ZX2a?D1CxLO-yb=o8Iux| zkv|L>Xg#g3`$R!-X||1tH?nM zZZ}>(*eD$`Fo%e-e1vdlz~^XQnD}ZzX996n_K-|_S=DJ?XNn_sRVt@j_uew>g;?E}MRXJ2_L#WYH z=9y5$ynGauZaiaN^I3+kathK-`5jCXj&T8TYh1olo5qxkl#VyUHkaErH>;_FL(M

Z^wpjwo61GA2wBDRPGc1Iup8f@5*!g zFsm;PP6)gL-d!H45f|oEuwe-#ZYGMe8T=Q#Wc@FA`Ka><055?;@rG1)y2`J+W(<}D zyn*lrE5}1Sy3i=L6H!*Zlp>X3S~c!jp-m+X0`xPM6${x)&Z87=flyFo$UKL=1rYjV z1|-e6+Qwt0wo<389>}Hc;pltb@fZa*4K6p~E?z2M}3kQ(z!QBpR$Iy##-_7=Z_H>5oN`CQ0)^QuEW0p^6quz&;oO)WJ z+{l&2)H+_eqDrXTusM=|EBb;`&xe^IqCJQ3XJMfu>I9+>zMWvkAdu9ae{m_6#j^`-Z!8gvu>NK2hwGnp&Vb#;MAr{+FwUy(H3~G5dg}7TL5G)6%x8s*XGWgUKR|y-W%Zo86k1`tK7}Iq?lX&|e2kI`2Rn!@e{Kbi3#$-q6-J1q{9}>Dy=^siYYSlw zt~#xj=|ElHY9>A~Ju+W5o0+`WjE^KZbbj_0Ry9#TnnkN8a!aR$22DsEE?v45@;MZu z_Tg#g9Xm66A(@@QiVmgwoe>8)(>~QJzSyp=2e`yn^jKcuFwnPhM`cxiA4Dere7OPc z<6p{x_3fj`gdv1BYqtVzp}Q%th%nnk#!h6;8e_ zduM2UIHvsZ4lJg=nF5K)_qqRWXq#@Z&e#knlpofAYW-3D9Wf3qgrLqvr7OnQu}qiHh-&lWfK;$kdJw_p!MHw2h`=U#gu0ZahY?4q4OS7Y{fa*NI<}4 zncq2o-CipTj!XA!WcDrFdkW zScjP6>CfS^lhIg<-Oe7f*Z*{n%aKL#nu-AVQkwnf-S_oK39R@PC^)w|xATS|Jf_^Zhi`k7e6I!2aI}mz~f~c9;Lx7EO)CRJ} zF_A3kwo3;M0b*>T!alndKJ5)&0S**6P!dSTM%|9Sh5qV3!I6iilsftkHv14Mzq4er zVMGREKZP=yG%Q4a9r`o`3A<3qJ=L>TSyQ__6c8vzs!L)2=*5|H*f)RFEfrcRaV3G);YkK&a^czIlj!oNd-2Etx51|b76fF}i+5U|!Im`{vXC1Bf|$}yDL z7t@Vf@*p~ssQ&mu_Dz+LcES}I%Z(bFyY)wohrXm4VJ((nH|Q2=nJ#@`iH*M#QgR34}E#Eeg~j0k+km~SXRx&Via`Up{pfhaa+XN z38sjfNjl_)B>1P&NEDsuh;_pP5MN}fV1kr9zh^oTL()vKi3?$~hN$P=h1dG*vxq5) zrsL|ues&<3PHepOVTZKVEJQK=WrfS}v&NBBa<-vu zF5LKP*ox_h&g?Xc+#mxvX0Q$n8}+>mRW;~T8SVEKSb}Xjo_JbiNH~0)gXxDH4q1^s zkF{u!B0OKQ>kl6Iq;y3wYdjS|X^2HeFL^p2*bDbc4-&QT9376g$}?nkEy{9$Mue&; zZQd=$k_I?G=%&9Jtzx z=+sl&0-;4YA2IFkBQz(OLbMHnwVRk+5V)UpH$Ls67z)vU@FN9@J9vR-=VLKgJ%t-| zhflO`kcRK5!R<*M;x4O$hb26?!Bbw#5FFDdej{)&?T~eJn-e@}5;+A|*Ak4)Z;t^= z*K3B~eV$1MK8;hyEJelE6Am7-+j`;|?|bZMr~xD#{soEOm~+7~Hs<3DF-?k4{0NEP z=5@8bq8L3~`+H+K!iw0ADJk*nkK!@E{a8RgFA34bgRvY_5eFqK;P~B)PL{`H9rO=@ zY44=6q-0m9y!M$Eo#C5SQ5r7bi|Y1RsrE?_^5nxqMYOoHW`Ybh(A=lKO8jMNG%Ju7 zmNh@)?B7;1h_a_WW@zdEr+o*`)+`(hF~Ni%>&H71mNf4=NR@ij!BI@IniG27w;@!U z8IJJ+aOKBkCtCy9K)dWTVT$)fn1SzHZE~2n%83o&PHN^Mg9L@Ndvuzfvu+wnrWFaH zb6K0O_$cQ^-N6aK`e$-rD<8fI{q=qIoHo*!m|9zD=IDL_VZd}`?;{U<<5bj{6S1jx z(^zGSV$;lIVw+wXEa>u=@a$GnufExHEYHG37BdB*TLYf&CqC5R>B=(Zv}`GnL>GJ4j1u5Vh-CaYaEF>Nw*PC z?KNRi=uy;3_E;%y9&N^3h;3^cV%aGqRO;Lrw1J&vNfSJ%83^SvudZQov~@dWG%hsu z-^9I=>l=(%TXv*T=z|obx{^f7^{aq&Nzx0^k7#t7bZ{DTbDiW8nJhH>)S7dYnmupc z!zQQ)4`D){?XRMr3a7=2`tm7aSCYt4`QCO79n&wr=flaQ$k}M)L><+ZJjCU?;mj@Y zF4ki$B~Q^#7H17DCKD&iacL8JR-IXlPj1vm2iFF^mXdGgOC^BaZ{1#5wUg&sy)ajc zNw0fQzoV{#r?gzid@6$j?@R6Rj;@KVb`8r!&s&Am7h^lg(XM_oNRLXT!S0;4eLGGr zfA(R9rcTooVnHG1Z@`S6i!Dh~&nK$P`{k^~<-xjOUoH+=TIUO(Fzw|46ef9{Uvan_ zQ0S;^Mz@l80G~FZ7f^}{HkifVcg@i76Oxm^6y_k+|A)d9r5B*N^_}IG)A}QPA7Wpe z8{p%5R7IaB3ZO8>|5BI_jbFttoW`6~h!n{Lrt!Juw`2(^@1Ig&D1>ne7k5hV6% zIyJwTx2Xt_zKHtCqFJCwU|v<^rpsb(<{_}}(r#^>2_X+j8IADOqeg*Pv})9-!J~dc zLIO(WG}QocTv5ihXg&qM&_|x4<&aj0=q#()VVHJsX^2`gaR|)7B4zjahI6;S7UtY{ z1isxYcFqKUzu4Gh*F9T?pMduU>W&dcHIaob>N@a|L zRE?k--7gHlG3bqfqcy;jRpPf9U)yo0Ok~OiO-*EPQaRXbr0+`(H^~ei^qA>Cx>FYM zXQR+-Yv1_PY@yaf=Mlv?pt_eYW22l^4KJI+pdGmaH#LU17)->ZTel*`$-2($v(o<* zFR%NqxMk6D{xbkCasJ`G5f@E*H>Fj+)=g=q9rmY-YcYN4bQX) z1`LaQ+Id**yK-Hxq{>3$C#$LGP|bE*Ny4DZlTg4?Pk9ahTv@n#xa&ZVbG}chLM~d- zX06!bdV0aD3t&UY%LkX|VYkZrq8Dm4GWvVsKGKfsU4^vbLznS#PR-F`heDOV zj;)nT=He&Enbgl@!Jf;@`M%w3j8DPT?-V&c-?pxXI}vhA!1p8d6~P6%2P9i>Pc^Di z{8SvhryYlb3ve+U^x6(DiijIIXftEs0(j_wbGq@EpH zeM%;|bDU+CkhBWIi)U?Ak0)nnmvw_OD65*)%qc5YdUf`0U-Spt*P@u(__8O}lHC!P z<90h(7st@L2HW~+D{)YrI^ZRJ8rsV%Wc>@vfIHQO!Ug7Nc56Ur)`5qP18;Fs*?!^< zI%AlKhj86l-&0dmG@@0gMVF#f46+>IS*(1sE)F3kW;Vtf;C!Y_4q(|amlZXKG(?#6 za?mc#_h6Cn6JM?`&%t#wR((xnda(37%kjRGjk6JJ=NQK8ayE>V+lJPI>nS0M2;uQ< z1T4`+p9{^qBQ=>i36KrH-hDS*{B+-tLSw2G|5GO1y2ZP~)I6;CrAlHA^h>St;a>Fl zi1NvTi0bY!`zm_I>gzqLurVhxKN`*u2b{@z*`4l%6eLtjE-7PVacKk`Hf0PzL~h3E zjS$6PXpM>-?mjgk>aM*d+>BOtbf2IkUwq^Kyb}5R>2T~fAZSnO-feCmqM9K6%PA8E znrvtn^CnY^F7T2L9UDtG4IP(|79aRly%0@~Z(aFvV|;z(JMrqr$n|QAbo3&fqDxT* zxwuAbspbmvGYqZe@!!blYVsO$MzCbEFY_@@xBDfR_oFoJ*duWmuE@K3`VrJ;uLK@m zPH%WN1FI!tLRw-2VrT8?O|XZPtc^8iTw!@v4#}z&a#r$2QS7%)x0Idoj~f}SA$A|o zSItGKPr|p?KMmRbT3i}oxJ`#S%?dVJCEi@9RBxz}GLXI#`_1uCe99$CY7s__aydIJLC*#3NR67LpA2)lDL^=4b3v4Rl%(#wKpFQ#~&(ZcOk6OmkQ5y<2YZ$YS{_||NCvXoh{~GH?i4nxpC^uldmrNnH{Z48k`(-54F5^1 z%x#<~CoHMWDfqa!8Mx72gwHQ#gziZxgv=}Yh29|p$6U2h&5?uf_qRU@WM+u)=^D@h zllIPNI%ovFn)C&+avfTgH8!w5Xgt(#J25o44XuaGDiWFAxKv%RQ;H3nBRpcRNk=)X zdpC?_xZJ0qO}%t)JXT;)(ufrLEyNTcY~X)xZdfLAer^sK;W7?}(A$fqxINpnxUHtj z^d^&W+R3Htq#-O(gJ;b%w&wQNEs?|QN(NS$-^A0$u^7rbz-bfPjJXmtyJ?$nCRgJ* z#lRNPF+*|_#%htjm~Od+Yk;)Rsf`a59~-H^q1GRI1g1-QeiXI+@t8svH*$0~UR|`s zlVy48P@@-roH^`FQ0eH37(1g?KQCy@8d_;)F{m@gS}!18I(YW;+K3Ny2Er?w*vijB zi923+Qyu;`KrMG-q=iat6>}Vk5R}zw>z2h6wl>pp(AU@on0^60yyG>tlfTyy$1y z)XLrgdZH!GBH9g7+i63ci%VGXI=CoZI@ud1O(UnYzs9Iv+`%RB!60V{wkZ98W~WGb z;vtp>c?nm@=wM^1&BU%3T=KQ zS7Z|mxl3oY0$Z5Q;DeqmA$R2P7mbnETN2xR+@D_^WVeocPbRNyt^vkST~Z7}&!W@; zQ{HOVP=d51lX_p)*2OXg{uXNkYZX?Tfhw3U< zxN(riemuU$Vb)=E1jQn(cU;wavMiya{0aBcE}@L-phGqe&|!)tYDa$$V+PM(>o@&m z8*zuKZgss+|MM|?zc}&>rs+Q}=A~M>FR|g$Qpnxq^hy2JjVem!TajRaHI3zybvm@$ zkcp`yx*fb#X+=tnFbdM$FP{@23;xOJ@Pm#fYIBnVz@E>aohM0kW7ivI`Pw1p|oVhGvoa+Wb0nHcNieqdAh>+0Q+9e2@>9Uc76apITri^H*bgcvk0)Jr4h z6xgyvzYr{i*SI|60_e(ycPvP}hZsb~FRWCty5kij(uDj!%~cQ+a5&JY)sR2aUyEma z3+h^oT~|O0=<;UeAsK*q(}A0%c4oOvp-RH%x2iYt_5kC;bt%>iHN1#l8it=Qp zNW|8f4c^;033EkyYf{A>&A&of$4ZEld{mL4;xp^UBqaDmuIY7l3);;DGt-%UZH6}i zex^L|^kFHXlq%`h%WRoB4ysUaZCxn`js<;L8 za&t4toN9~^jZ$3}WrMf6okgFu6kD1uMTu!_kA;a;rzYU;lTAHHzT z^)2?{rs<`$L9+|Mp2(1ObsJqnu|gzxc(mkb7?(rERi@^<$OdCQAKAG_RPe>)bmUcz zM&?VCRDVj4N$EEQQ607n##ux&Ykh=`8*%p#J_yW$souSYM5lM2mJ@@nwW+}kuR$BW zPYf}j16J$Ng_j9LP6SGG;UyM0w5CMd@q?w2Wqr<}H}%Yce9hXPZ##vIaFD`?9ugO^ z!Gr8^a0_&I^kx35~ipb3efwUtz&naS>Sh5zVH- ziQ$bU z95JNNH^N-8iQmCM_A(C{tb1V!@u0^qY+*N?cwXge&W=}R1)ne8runknm~98-;FOV^ zgVCyvtB3q`eSy#4$ibQp{>qr;qN&?pG1p1@8Ni&r8IRfGZ^^J3>$RH|BVOE8X34I1 zI;8p9zm34>azLiPt&xDkr2!VE${G3dD3Pr}ESZOmaoc7L%4p*Wr8m~e-6W7?SW73ie zueSb3Axm#VKt)iA#@QU4MueS;9E?TP3c9_aT+yi0{eA*`sy%U}XacW*wOJm1m;gY(JkCmx2a%?k@T1W>A?p?CECU_$RluD&>6voOd-`_e%NC2fU!s^8>VXSpIN~T ztHr33?*nE%&?*CC$;^l>-Os@gbAjHZ&YFq4q~u_sM4YfPR40-LS_1h5vI~Mj^YUMtz=AvUY0pixCmLFP9dmkNk!eN@(wTMb9!hOg>y4(9 z)ATRk(IWCZvrFQ<^Z4gi(aOsRehn+zy8r8AYHo$0BtZS+V?O^A{*eQGOmv?>Iy2Uh z#zsAM37LY#26AUm@ZU%REcc71K!Eun`JSE^cKK=yOGUai&xNwst!(@ku*i7M|Hgm6 z#7XJ@ZoEcG9#pU*UtVp1Iimvq>|#QXoU+<**xlvr z;;6{C{p@a{iYSmax*ars9k7dV$;(6_o_#Z&-*42(StNGdKe`>5C-ArbOYT_Z6}ZEs zRK*tx_HMh__#<$(Hp?1fcVLFPGCnUH{Pqi?vx;dxM+HEoI3U4B_eG}0&9;63D!>Eb zMmp(lM~KGK!cwCDsr~Z$dclIQ1CLKdccq$>1J>3eytk@ruwMt9I^Q~#(?5}v_{}=h zlpx>=1B$m-w-DpX6fz?ONOCj!uH&9vgFgof>pOo!lrg4+YV;Z6Wlsng5le9V=k9Lc9P_VlqzLBVrM;xOXpAL`Ap2a~f<-2? zhA^}V0Ee&}hRErSxk1gCMtx0O@AO{j+SvIYOJi{S6T#M$mUzy-$j(y6tQN!aR+135 zX?a|UIB*NnunSO}*HsjKvbBYY3N(-4NQ$DntEYMQNzt-!Vz$djbk%fmeA9Qp?_bJd-4tW6Mo@|J znhWbmg~^IMA-wsJj1WEp<lM;XjCmWAd67}qx!c=fE$xtQPY!A*lsEVd%|GN7 zeTXBW{g9b!~b!c48(|*YPcIT2&^@s$UW~+?7OY#(7xH{oh zc7g8AJPE-ixEOx-Y6i8i?up%cg*Cluk6n6a+mO{V!1ZH#!un2uEd;vT6T zNT%^=To9MzW%QJiJ@t%i#1xI&1HM_U9HCBf5f-^P9w+SQI&crqWCVJ+d^Twt!~=y1 z5Z08Na#qprgOq+UPs^#%-9-3g`bf>kO&#uL?ssV|2>hoeJ>y0ACFr&1&_c8)8zCYUc@-#FF`B?#vBC?pF;)c{6e$=c+oO_d9{L1n z;2$kO1&2J{VJn%JZ?oXiU29Fn+m^$A;XU0PqF3IP6EcWCgL?McSfSN!j)wkhiB_vZ!kM+^>2N-QuepaV3$jy<&r%ilWZ10wrNk`!!CthW#RKPAN z@ayJv?oRIn^l4IL?Go|@`0w(0tE+^;-$oJMM zVVm>2SQT5p!PfUylc%(_Ol8%fEd#Sr%1=@i&lqYZO@wHhgeFqvvw5O1#QJQD|tb+t+L?$`}- z%g0BXNJUH44HZE(%3yF>4CN7YtI_^8UO6o_N@P4ago=yqd9u)KBS>D-l4D45N)_4| zy^kZ4sr6#or-!5UodVmx}sooaRol-5#e3UEie)-s>`D+k$B)dLq~ji575su%xxzdVD# zNnUvteHc_sH9$!x$o*>dV1;|~=^i7>9M^q)oyWWCY3UYl0~Y=7XdJ_}98di1+KmDE zD#4vCx(aa4`+TmSSO@DaH$je0#+&vrx9gP%j>pfVTD7HW$y)OLrxkcc8_=L|$%AHd zhxg@3cMxp{*auNr238clKdd=elNHNMx<7dK>3Cl#Jk9?!LpP=Y2h+t_i0f)t3GdU} zf_wOQ;9XlLWK7nJnA%ATH4jWlD~<4w)bprkqj`f1>(kDq&xLzgDx#W1k2-Ygw5D!* z+3(`4|H2@m-I={%^PFrHvN58)r!*q6q$om(3)i60NDV&C$hr)|p!nt9Q3{ga{6Xl&F>CG(r~uZnMMViM?CJX! zrDJG=n%BZX+M&(t44(n{<((&=u!mUUY%Jjv{;-kHun48NnfM1(%_IAu2LMfVc&hrl zCRg@t4F0NvqbGPBAY8_4M4)YQbvZ=jDDND|r#vL;YqhQIjsD`<-C@HgQQd3ka~F{- zW{`3_H7k8b8K(2e1qyA`oK5o~_}JO!T0i1ibNgXPE~QNTPUd8fA`?(dIk(l5XdZSw z=D~ubR$g5@fRbrEXJ#|m>$~B8xzX#bX>%S&@BQqs0jrT!RLViwvqRkupW%Y_I1wY# z-UXIdGJh6O1cKOu%Q?U!ZFdnTiQ2;<&uC`(+h1EXgVwf7)RS}*IF;qHf>XAZ%muw4 zUd*e4fC_W$nJ~agkt$zcKZRCAYm{(Fqrn+gcHPysAmf7>oWCU{DXVDT=cm2N#{?vn zt5@h*rn^_E{HxX5-{m3ZZ5*rHMo12d1xQqhOd&q3QhBM zK!>0ALREMxd1qHAR%6e&6i_c$rr9qD+Z{vFSMFE`tLe?KLgY9Y@oECm* zIfp36n$)?F^4RpB9Fu1fePpW6Cdn!ay!d;|MJ7SpdL@a;JEyY=SfNQx5@)EF!K3Me z1_ydS3nr8kxo_W~HXUJp?6AC;h;<8SyDCB+y(x>WF{*qW<=r*9IyvOIZw2r*+9ZG* zcu3U*r!Dh?-KwD4P`d}g7%6we0fOuga8(3d+itL4b~-3LF0F*h6?d-qup3JiVDG*Y-|;^JF#V zqB#~r-n%mMOY2|!cU=!3?pEWQ;*L%R|1X|hfbTW%YW{g~H>Y_O=|U3tHhAHQu?Vx| z_X${}maaEn(~immkIF_PeOGo|f5$J6QMdh`%MzuXQ@*a<@BRF~(prBtI3=wr0spJS zoOCyI?;!YlM2D=}m00#+$y2+IR_3pgyL25T9=#<*kaOwVY#kxRHWTTV!DxtKrhpmz z|Hs!ocIOr?VS|os+je$r+qRP(+qP}n_Kt1awrzLLIdAvq{?cRASpVReYd*EA?hEOD zErd`tWxsi5&>%|+U)Y>=W1W1E?n&m&nT@5nNPtvQWdIs@x*OO56ZPdsRVqhlv*la!IE+UZX4JUlx3*w;2Z4{W@Z@0I2AVo9OQv*9v z5_~BSVVzoVCxP*XbPRru7t+?6al)-gEk&D+d9)8eQ6roJRYqQb!5GLNV>{5z(14y_ z9tzYk$l^AR0>>*L)9@`75`jSnRx_CuFvJXd($Sg|;ux|k7A z8lG^zn7TT)rU7*Rh=%V15`{-$B`8^^*u+2v~3gDVcQ#hqI~&v zZVtPaSzBw8L|XvnPUXN5p|Y|g4AGU(fpldta_^$7A{M!BN7|fsRgPd;3_VG;?=n0_ zWNpm?j<#WCme-h}k^qpPr0CHy;Ui*EqCIH$7NC!mGItPzp)H62!NP{zu6^`c_fz1( z`jL1ys^MW_VaeB4-mX6PiXZh1d2roTaUi}CtE5aoC{W5OZ$02DB!v}R_aatAyXOLB%%A>k6=eY_ zpBl&4TBzea3`bc=-1}E7@P|6^nlLd^-n9<1Q<2o4``)`u3Ygm<9okKFE5jbYVy^;DNHK(mWA*j57J8@korK_w#rlq<~Kk|?6w-QA#P#y=bH_V&j&b%Rlc)m zq;jaHp|J<1y#q27s{hm(Sq}x?CT+JRl!8jb29`sP5#m?G_$Ng_>pzu_>)jSW%DR9?+QG&4f;=NLOZj(Q(8*5Y^3)20N% zlSpkBWYh>J375=iSOUHYPO^kkA=wbY+9slJcq6syUTR$Xb|0gQ+(pg2GCljdpN_E$d|a8T(=f4>5p@+rE;iz9up z&=D6L{%(}1incQ!Q2oAS4q6;?1ay{#AlgJjJ;bzKOk0vjC1lbhFPIPuP2%iZf_Ao zy%5B%tx!$z^k2h24IRsoOnpcAia!K*uzmDtDr|WOg#$+{cX0@P8FkHzgHOGVd3oMo zvk`JC&Hh@nUHjdKgxLBZ{Qk+4)j72lmRQVYkx=1|*cXzA+F=971Uby<=w$ftzFMAX z#`Xn1_&PB~KgoGj!hdf}nkI66N5o39Lrz9Cxxp;zb85(n>rdz-t`#Qokq_d5$3`M% zuOmHp!otAWO{QQB8O>O)97NC_A;!@ISklJ7y_`W8gW#>>i`@ZoL04nC83`WNBD}4dnt1G)STFBq@%cZZNro!gr&)8U zZ>0=(C|W6zIXIlDA^!tVQ)IZ@5Cmd)pSE~f)LaKameDjFxpSM1G@BA{OoS$;!|UMe zsdHN*n@X7~T=C&Z?*hPlgm`Rbk39Bo=VtvwHWPk&9PS}p+=blZ!{ft~_33#nQqo=a zB4r8{iaAXQ4Bv9BwF@DSKbJ=8N7OMC7^>)w1!ozo1PmdPMSCm|xliQCo8-#|ZP^^q zZYNFi!&o#0qx&=bHt(Ee9{%<&C^Zy9QBHN>5oGe)J_nY5S3U+=qUL6sGW5Bh)|x(U zG3)Q!cV+eRqvzDktfZ?j9Z(osf4`|+FG|!Y7rZv~7RyFO>ha~-F^Z;Mdv~~9oc>Rl zBxs*J5_7X#?5SsUk7Xm%)b@BHWx4EM2(Ulhegi7JOnozd=`ad2IoSYwxzjtAV}#k_ zQ2RK?*uC6qEFNjO^P=qxqAh=JKc#+3{-Vjo6-c63$jXRtb&g9`&=nW%M;|Ich#rJvXy$Z=jIGX3UJa#F zr&YBskSl;J$cdK-D4HPk8KQjc@gXLU5$2c$@DV2KRIF$VRnRF^0*a)TEyQI&Q+C+W zCTZ&|*H>3WR8D}iR95XOuH=SnBRbHDihjd;7u`*oSEhxrkD_EsJ}aS$kGHg+32|F^7tpGoAVbTA%_OmK7I-yqyrG z6R20uqSpYLJ!$LZt11F$B4?9mj^XJtc^_6VTip6vR-803C~q;6=*d~O{M}L;83Ijr z(8vlzQYp`q?rx{bZPvQdl1QoxrVBp4a@9gXY|-<=9opRPe|Xr>n8*(1jI!xt%7BZj z90wJGIEa5n;IJpnonY_QUbbB{s&?a_JE}%+iu`tN@+sWy z(Z&y1t-G2yi#2+GVUBTDr4Qb!8V`2Sr}9~-&q~s=1o%2`9i2h~SuAJDj7f7zE@J~i zRP_c8?dU2X`nXSo{9@Lv_<6otRkPa%OkoNsyjHK^s}4EQL#bR(dxzYB>h3`+(vgWtlBluPRQU;Lo z_67bYR!`fX4SXIldEgX2hbgkvWK=EMGXU)KE?Mz_s+7;?L#hp$V9S5OXi^q|Q>)aV z(jykwZoxL~$z-~sg)#(F|Gw00_KPns>=Fu=m~R@Gk1aQglUVLH{}8ixNx3kwcP&yG zj_+*B-1wR7G22`pD>ZW%v9-C%sBgci**d!hu%8k{fh{FYBMxZC(yjqaz<4`&tGnLm zSk`kq^w7ad0$Yi>E$~%KLc{(%UN7YCO@237RI|Q6fcF3;^1+&k2lADtP-?-MUapu7 z{4+Qw_qLQhFMJvoJ1$Ia5cg_A;=RYKPMYeS`4d^Ds>&r0}B{$;YhnTCZytUy! zB0MW1vsP6gAGr_fh#gC|4~uob2^C&q#V+ zovaRnz%G1xH`0n=OdE2Gvl_$+LGz>gH&(pa5E_Cjqfk6))!gnF!fFFY;p^}b=o#NQ#dHheei|JO{=~%!S~!I@!%iW51a6{IC{hr- znnxa1lpw4MN{#NvqxpeQ=-R)VTqnP7@-Z)G_#yqlJC3@LV+R`!@dCHRJ z7|<%D=FaN&M={9yEmLlDgRhlbb}`@jlvcPb6JMt?n`da+Q`;54SamVk@cnQf$IZTV zCTE__%~>G;JtPiONqtPO?4+OAQ`YO}d_f^aQJ^>Ew>8ImbV zDh=C{G%D{Ntfaqf#ec?=Q0@N!<)oPfBJu_)+5&$#O`XClqPIKtO@ET)hNNk96Vd=r zYoooGQU=#qyYoOg_ux;LMQ|zj8^tTwWCuU^um1*U7!Uo%*8{)2(`xWa*19EG?O%x# z5td^d%Sj@Z3KNC}ykKnJ+Pfl4*9PN%TsYzyTesY+0Swsd-ILmpp zf18IP&EjVXUj+XRyy<0kP8km}CW5-x5Y(WgEEt^ki^dmedR)_RJ!TiLA)$mA%;o+D zLUy&FiojTp^St9;m7m(t%xTX91d~jXsa6+`lzaZUPq(=smQcn{TwPv^?_%U$HGf%C zY6sQxxt8StSC&XCq_=+6N$YB4_}~#iPP8(sKN3ivcpm+Vi>sm6&2|1gd+1}5 z<&Z@{F8NUVjJCtdXXk8x&$}}5dGEWa{j<`+`NIUyufL7=26Iy@R1`Mo=4E1c-b3Ze zzK#h-(g?Ttb?j`kj*2~#dWFV-01b!h2qdZSDLJ_M;u3~G@-ag^?m$1`L;**^aO3~C zhqM%vYbW>J#Na8K!c-{C{cRf-kzo*oJ`>Pzx}+AF+B4XdPuuul^~{yQ-@)n*kMBTm z!TWU{t^NMr^0+#jZUrrD->>YIa;2dbYzdgoYVGjM*TK)N_^nc z@A)O5a>`GSd#$#z`i9H0o)g6> zO269jfeVGtOz&$obNTwS0!a4`HQEj2aP=^cy>d#r_Sq>%PUD7!ph1uantXM3wmslw z8?Stj@3cOo8ieWL>g8~r@OEGFejvib5f6L6!mvyG==bdnl?2^OJvvxtU{tBYNdhnv z05&d|x&9Xmg5Lhcf*vM4oD_MfEY)n=VtMP zSf@P$Bss)MHGR_hSt6h#+-CLMR~5H80L?%{SH@45xev3_|YwY`h@!XZGj zkrkY>%-mS`XL%7YXG>WTw!Xd;_W(u-%V6*8lg))%F8VeV(WcT6bu-8dVZM5+CSA$i z>?hvncyvS0Xbn`mxX{8--hUA<+nOvIU%Y~((2n|LOLA7b!>}6X)PrdKTWrMZC3lb% z)NrNX@nsS;Z8AZsncq$Z*S7lOE(sdB?WVaxoDfQ*fDNG+Ew{&jlH4C%Tu(rXtNPQ% z&(rgVkdq>cW8?joq~qS8^&=tTnL8P(6r%qm+|s#M5Dva0AyljP7uXD^lfyZDI}~Lo z#ESzPPejs;2oIAic7-Hd2uwdGM^`Z-%^Mh8U4Omh3PH!4%|aP^IiKjhav33~N9-oD zLebJN97nu`^in&2Vz~WgX!I>dB?R=yY{aOMhxgY*!oO*wGG8`VXOH)(u8Za%zo`NL z@q(aiJT`vZhqL*}CS7^*dj~>a6Xg%>0=NH<7t|HQi6$+(JMVm9YB3c{9uR|}jHX1+ z$PlIFVBZ03K6P8w#r-4tjYZ?DLyLG}s~GWKOG5E=!nzBK94{2p?k~acdv7G+iWhz8 z-;|9w+VpYoLDZdV#ffVHM9&}_jIq6p2Ckcn>Z+%F7%U~@gF*j0vnabYV2YeoXy?)0 z889Phn;#Yhq8h^mDo~)tR0SF@VA-YNao4imX3O6%@XF_PzC1le!<*x1n*-U7W!9BK zQ?O@PRP!(_Xw58C{Jln7AvC;#FpfF=g0Kv+N-k0GR5=^pByWE=?5D+T3)8xTVb{aB z3->(lflRX>J-L$0Q?Rzn5GoffnZBBgVUH;z^vV*#Oz0SFSC)o*sEV?wof}~VdSK#U z!ra~awUKtFnU<~PcE_Qvhp-ziGnjVypP}8CLyqw((CB=Xl-VmYgYV`0L(@n{z56}C zI!!U4Gt+Q={a-Y%{d$ksM6BwP#CoXJf+)k;$KV?)5p>5}pi;_6&41UZ8YSR4lw7NJ z32KuyW5o&zuSCky6#iST7|<*~=%9v>g0w>OZX+b1Lc=>`cn~LF{$W*)z*KWVf3-arZSn@4swbW9&JOj#;6NZZHo)cNVBSZIc zhtwQAlv;cCAh0q3$quRyybj zV~+zZ-fuxxspVRqiNujKZ?ya-A`sTq4_xn~9Copzn@Bl4x^2%<>J?1;Ogb&e|86d(>5RIOV7A0p^7#zo}c`0)RTAPTTB)!k^g z|CF8#c&{KVwkN8#yk3r-+)pwVf*D~5_AHKijlIDoNdM-AszS!MsNZy6ldj)2ppoO$ z&^B5~*Y#6UOF#c=2Rc+j4~Fh_30YujW-aw41T}@N;&WF7_Ej{u8|-19t#I#eRgwVN zv{khj^wF6xO;XE2ufqIIOp`Qakle%j4ngn$zUsW{K1JKlZl%<_7+wZ zme$&>xuaG{XeNHy5NobHDW3+0;qBoX*Wnci1KYk4b{Rf2U|K=T$qu+hPtczPfK*NU zwG{v&nG&ztCgcM#$mL&TA!5A0ejPCFD{ArOlkfi;L1C{RdxZAUJ)#`|c>45zd%I5W ztSeGygZoL&ifN)6AQ@=pBGSb8IHJlNi+S;>SMa{UP(nVZAE;-s5^7#6;e4Gn2bRq@ z&C=XN3N(LXzPj7{3x*O!ELPMcoIjHUeQ451J(G5b{F$<7R*s!5!G=G*q`7@+%uSJ( zcSM^_eTeFt#gqdhM&Da?48$t{f|i9Wwz{j?1(W4YwkgPVD9FaV*oxDv#EB)=_cS8+ z5U;sqPDuvL@bdFpc0_f;qr;=;uBXbF;>E3Eo=FLG><4hwTu&u~@;Ill(fkwq5gD#B zVEnq8BKegNCZUiyN|h-&>WC`E?yJ$k>*#us?^WrS%&t}G0eib{BP7tdsr9`KcHmKA zD4G_8r6`=#Tp)V#@yD%A>-<=nEBVe&fYUF^I(hfx` z%C=`C{zJd;4?cw}un-MnJe7xrI_CSo25iXisg@BUq-dW4r`?98ja( zw&Sj#W1Z;{UcjzKgvv@;egi*c%YK9kXe9#cmWF(48D11l2WSWY_Z5Wq-huI33%^_8 zngduRC*CUiB5s8Fh1b1|#rc|5vrzm%3@Rf#AXKwGi)2>ySX2up$=b7*heEkOamJ$b zH8NuvSp>U_sM)1KU{ty%g^6l9#PtreGTpV@o)a5jxUE))xxZcLKkF3yMK36iGS^mu z2}-Zj<(9(a{wFH2ZbM{rZ-^%G{~A3)V-yQYoFAX_+gUW0h-v$ao#joxvZ4xN!|7t^ zK}|pDUOu>EK4e}FQ01sfWir-Ul&E;6v2apk?N@0pxxQ|`hiMMnGmktE*2r>Fh7pt#xJu(j`wHPRHW!4nqQqo zMl@q!fU*414}ep6@Dx{?blsr0uL$f`B+vPsKhbYYA4XkQd#uP=5L|+Ze=B6tM;v?L zH@x&duFJ$W_1(f5wE8y|7bZVJ;>5lsvVp8o+(A)OrK*m%SOGTuvSThlvKXWz+_!Sr z?m{Kb$-albxImM`;tQ{(2krDhS&d+G%s`hTA#8nhwXbGSUMv(hBeamfy-jgyAlS{M z3y7dOt5%kz!Q^6@&%4g#;zxwbwWqNf_-(Zuf#8WAxze^!u-MN#+^hiKnXDbzd+vI> zq}UI{vsykK((K!+Tt4ltmenL>vwl`-xS0j8&_Y&@7}yK9qDmGXm8v0O?A1>-4>QVq zNz0!J_Rw-TLmpBC(L~|%fAmsz^RQyvvmCW1e!1|2SP>?z7&*5OMfFBKKZlg1XT|5- z{2LUDk|m#5frcD7m1{}<@2}mMS$;0M$<#W&Y@4+@>=Tv7LF59{XRLbe8aXSq>Dj|j zk_yTq78KCREQgS9fCeysXx(L;-}*Y7j1>rYse>_Nb&(m^E)p=N?B zY_70+YZfj-L70ZzUkl&wBW>BWqod#jlY#mz)%`ngUJ4k~(EXfd!tYsy8IVu$wLuY} zw*CJn?8G&$Sw^xXC|tBXgBmR)2s{VFHgq}TYH?m$lM(YAZ#KoIFxzTCtz+Du&OstH zr}@p979_S|d{p4rxYohQ_`o{mMfBsgd!x0UKI4?>1y}b+8`QZ6`B!;q+M#6c^(|noVbhg3G57x^ zRRrR1%eSo}yb)pJ|C9^0odbSSI`&E=Metn(gl*+`~klz2s!8_87ou2Z>%2#;w4 zUMi&V@XgRwugtj|{b#mEVY`e8M2O5A0U3pzc08h4I^v#ZGKm89Tn6H@wOI$^{l@Q<*@hHRf6vEhk>@^T^g|Ex zgTvhrj6&ZW0gOJB59HyMcq)~Sg?vl;LX*Xy;^^?K_S&yQ`nE6<-m zYtNmXt^-^XZ)SP=iYm+oZ4~+A6Lm6p-W@beu@x6OXou;*Q2so;fkyoQSU`(HxVi4S zsL}ue6oK=>GyxUB^A>A3G9>zKR{dUza#AZ&`o6v2Nbkl-???g_txRr9t0}Ut#&q#| zyUWT9wr^AM*-vV#ph&v|`u(&rm#bYDUCQ!YA?nK3r5i~KC((a>LK!;YpR0?qTm2n> zq~bIa&Bz6t;%?n^u@5GHi6=p#3j-C=x%Ve0b5AJ^n#DoUaoI=L5dngBEGJkzJQu$@ ziMRU2-$2~xIj%T#FaB~uS*C(Es**}Xgo^#jJ~Kwa+F^py7ni00gQqYvSTO*(4=O>{ z59^4}%w3L7i*rV`0MAQ-6LAZoYqgwD}1C)hKGNk-c#@gy#QlyaP$450mp0FcWJ%kaq z@r#d=@e>>-4EkdWMltSLy0co{+ElJ2$vhH6}Pm!ju+IhZqZAwD1~(cbVw_c{o@TzNbus=F00-^ZXKO z4u(r~`ZL|h(b1?(kG3Te7cUM+CZV?(7C2adp$Cd6fz;Ng&6~Sra_rdVFW_iM!N`m< zld)#?8^Fw+G*tq=A8t&S1MpBT43~il1YPFl74$8ZFbjje1i>Lsku1}=&Zk2m!vb|E z7p}~mjw4&Tf7jPT0;b4lb~@4Ys*BJ>d?>%gERLmoL(^9i_j)T%#Ml5W?e`kDa2EI! z{xYR)9TbJT>TNc2aUgunM0+4VvL-h8P^nno(5O0yjh6gCpkoyF^GbW|w3C|*2~_mhp%ssMLz%Wo48Bx1e? zi<~jemgf+GW_AKqx%-G}4KPa`3)6s?wauXzlX}IHt5xN+wGSoJ&J*9KaOBrT#Q3;WJ?uKUN<)`y-aspD=m zqA%p9G?y%p8KIws*5t9>N{T1eTnm^)MY9nAZ3K6mjLTAVY++KK^#sOz%RkUGSx+$DJK^Vb*0j+wae__i^&- zNWPY7Z()n2zddWC>B(5S;Py{GmS<=aRr>UFR$02MpHEdhEnv)MI$bmO%)M#rrOTH` zp<%L8keyd+fzH>wXinTRKb>A!wm0ARMr=a=LWJ({pk2mpQt=xr>%Wqniz`S+=}?<< z+pp)Y z5@}(o%*5ef1_a4rLyUbWQ03CkbO>R`W}liE>4gsTJ$dbcmmy!}h_2VNKEJ!VrW9RD z;x>Fwrvlq(h}>IXpr?F)rIo7vD3~C=j;-9FVtTqRN2D;`;_?SoCr!82a>&$O4sLDV zJ~G!xys^mPsWnFru9d6Hl~1Eem)C`OSI*sQ-;BcT`|!s~J2DIZEH$~a!lb(h6hT+y zWAZEM&DEG+>_l8d4D;UVYl36ER|P_yoIUslk?i*XRD0xnSS5Mu{`^F4%Ibw2J!k7cT`yjM5S zo}{PnZ-Aq)v#Ani$U<)^g4cnxD#9q;FZ7SK@Rk&ZwoHdPgz11N)tnn2efbde*$|?8 z_Zb_|p;mJgV+N-4sDV_ll-y0l!J_R25fuX$iFq1!Nap+m)vaJIwBT-v;5cZKfHHz^ z2j>SU*|ePPRU?v7bGf9SKFsWyjN!1q#tjn^nc{LCQOGC-rEZOsKK!)I($Ja78 zc;v)4Ts<7hm^oAT1?7YxEi$RL`D|{<$RT=hB`OKlWka4*>f%@YTUb=X;!C)HN;a$X zaW3vq?gd13k?#|b&khp9ZoR75BS@l`MwEgzqo(}4vh1O7V7I_ldMLAA$aPr}A+{R{x5GL^!=X+qnu>VLM69T9Tr{qI-(;gxOQ%IxjJNNp=;qf{o6? z%&$5=ia&E^s=r5u?P9o0Vmsyt=cDs~(&%qTOmmX{w%3vDU)J#Zc8nPI15w+W5YKTS zD+;obN{lo76Ly{O3PXhg@3qiecQyOFQCORFs!A6b$@HA6XV<}%yQ5=^5)o|G{HD6r zP%!=GN)u{4v21CQbXc!;QH=8r|(L)H6Ycm$i= zw_)Y@a>LY!(TC{Vuq@q;D}F6c&~3733fCZ-a2auyA5lnYcQzh~hhhw=bFh(DC<_4x zF08cqv?>G%HFJoRax&Emk7sJgmvd2`Cbh&X?4d*@EkFP?{r9!eznG*!su4M9EG46h z`5j_V4}~m(wgU=PuRfERqs9kqpFO1ty{HD~ea25U4Mg=Q#W=U3N)h#Tmk)m1KK89Y z5Czvm?68v52+kSne^NoC3V6Q%SBm6-AxMkFaF&?TghBw)O*~7Vb&3+k`Ed6^*6Ebr zRRF`d0lQ?6FlH2iF&OK7syRl_Keh6=aZ#B@@+Ekj`3Tvkcl2dp_)d&Uhmp0K=w z>6Y%?JT{)iyCjGBHySBI1Xe7BW3F!5%y?Ll9Kwe|To~a7b7dcf9cPODoa%jEs_HLr zHb;6GN^RlRFt;x6ie~3+k0W`jX&||8D>Pap2y+{8EGZOROEC9f6mDZcRJa0wSSFOA z3xU9pl>Xqj{Xw63LVUE-W6k8pTRAUn0*MlDF~%TB-f=Oq%a^ z?L0Qu-?U5~deb<_I-G3+4vHx$AZ2N|RMJp*{9_Brtki2B_L#?&+voxuH)a!l!XA%V zn{dul_|||(&Gl~`Z9T@T;eYt)cHFXrf*rTcP(tQ67%PAotif$}#SX}cLSwC%kw*({ z{^9;Q;n$n5{fmoh@$@igoEstPw92W(EI)dw;R=`+RjPEwVjj-z!|*D27Zb``alCkamK^u6-iwEF~fq1fWAac zHJmoV#;NZeQU<&ds0VpusI>Le!1-vi%!J2|-r8$L@gEXR4wP=2tLtn<`t3v_Br>2+ zC@Iy%k47+ftD8b_^X?)zjAI20hv$k6JN3}ja=Q@w{DbbX{+53>CHU26qxcOQ&x;t86PPa(+xbOcP!)`C}+M`_`@=p zpvV2&YK6iaZcax1JaL}(u8(ON6G#cA=0IUf$awwow#7YV=ty_7PgF9O3>q^i_Vcrc z!U9!w9VLXzyGiHXVpeCc$J|za<>eP5-v;u}G_k~~_aM%bWax#`kZQ^?xztDGVcUGW zuU7m73@L3Zs&R!TvACt0_zIBT{%Zw=er5zwa6}dNR=Ppr!(G5 zc)2uIeI?p8TX;74NHZX~MCD6DZT_P(1V(9+$1owiy4R0mPtKf-SK{)}TgP~*hSBk6 zNULt$%7sj9T%b*NA1xt7M=BNv*{5y6^!rZ?JqU>V3aBiQBbn1g`xQ(=oZ4Rl>Qa`O2esbw2w#4iZQp2u@5!s?$YqP@QV{4;^6p{5s zS?bXS3SKnss%y8~STcDpkCYMMg21M%wIDLqa;>#+qL~<^rwA@iWeL6VkaCnNLC7-r ze|E~l)?ssL0-jES2ZQXAHtVF$HG6#qI5Vx4blcjPDAsb|YcJ_WB&fQ<)rb)({uf$X&fBsMo%xeh@@P_fT z#3QO|fSd>4G(CsW#N5_D?y^h%1bT&nl=OBS0Z-F~{I!+3%o11G#~$+_jooGt065pj zQVbOoVeBF3e#Z4+b8uImy{F%37%pM-itwKIO;!J4Q5uaX>a)G*} zphDOgDlN3D@AQQwv#Tn=)a3E08fbI*cD=NZDmnVcoih5F=G6U!H)((Q4fR<8F5X! z{mh?g;v)}6IWlSOHl@Owz&cotX_r1|RgyB!d;gxw5tibiIYkK4k}ITQPzUq!sH9pB zuGCMWLnAWx9-AdMyr_&GP7a-N?QmqG*dOzg$O4FUws}5sAyr)MMqz{C;No0wL!EaG z-CF$+hdv;HxiB{5j%6uMUVb)$`7r}xma;Yp(x5j%>?694AU#I%H#cSG;l-r})!0!A zO7V0JgMRF4kXdVt3Q~ri405KmbPR5X=1kX=kmAI<$GGY9%d-fU*Lg`W3_>Gc7I7eu zFNIfM4{dOvg=NwQpT7~Yi@B)GO}=Egc6>7Rm}roJ=UkC={>4Kpr=XG9i!)!_~7MxwIVVw7WHHlHTeQfFf30Oe!a_@@M;= z{n2u83H_EJP@_b@+kbX-mtcnp)s_xVcE^N079 zlh9px_-i6Tb7&MU*BHn?7u){c8S6nAG9C%SX)V`GMysVFv8AA^D_>fCT8R*8Fo~a| zYpCt_c_nC&}8x~Z8T_o1Uix5$wWvu{BP4}k9;^qev)sT1R~ zdt9pCavFjULQfyYA|tQsBHXvU*vX;WGdS!tjA<;=Ap=d3Cjg2Ck+g%@nDUg@Wr_um zd;YxZRqLb*UIwY*`Y=G9im|Au5-QaQv{<|3jm?&bXunv9UzQg8l7Ojw)ZfpemqyE= zw;WD(2R7nwkB@T`{aSgSt=5LZh4l`wRC!OD?wegGJF~%GW6)lSr5y6xwOm72SL4X5 zVRbt960gE0vi(4(a@_Jsin94;OucySflmu5ydtXyn-WH&5jXiqHRPp(JR|vqtD47! zn?F{(RO~!TOg%%Xwz{XycIjf1?Ij$a5Fa4&Nb=SS0rO25Yr5*_^%N(z6ez50~m}rt+`t2 zW$oHuLvS}QIN6;)wiLSWG5%LZh@vYKDg_Jeak^^x5hJ*5%F0zuu_ra z(}hZ9So!nKi%Jz74949fZt4A#rV4Qx-?!n;&E4qg55_l?b34}tjJ;%Ojsj(hlU);} zk02KRYkHJEoEMsc=~Bg$HtbAYtnn5!c-G`f<6qRR%rPs@6s)j#HnMVPK983bMyf~# zDT>5==nHAE?m2*@N&5oE@Yj>z?B)wgcy7xV)RklRXL6ztP@F0x1I8BncLU3}DDQ`Z zXy^wZN3;4m;~_&%Bh%&y%v=eQPVdY4ucu_LY5g@e+YQ$lBiYN5hv)khExoLUnH_sC zrlrY|6ST`Cdi!#y$onC8THj!9;Np61sQ2kMXwR^2gViD_xyz*+%^ye0gZDqo2~R;t zd9X`3p@qezU{BC%Y zZlfOK2k|W<8Mv#5p`({w zs!yZDx;Sj-LO^{GGexckVEI~(nv|xqZ&Ums+Z8o|c;x^03PRa@hs%GgS&p`0!`{h>$y`Q;*#f4z6RsXIf+rPO`k>PEJy`+J=C=izv4Vn#lA)7VVtG)Z5s6DmOg-P0t(f<0Mz6b=Me_l%!#h>3W z8A)3K)KId*2wlu53faM2#{fx{&b1OUIo<1g+pCKJ@xsF*}-nlSlvfIG{M=ybHV~{!vo4mfJOYt z?fEq|C#{gD^pvGE)yV*;_S%CsEQJ}xA}&Q$0w6q**&3t>RLG|dzDvi_OK83-K5on0 zVQMCviAH#eXEwL~fQSL;gI(u%PIA|w(beUCa)tngbLfZXDNqBzZn*;j0Q@fvtOMYZ zv$Z)`u9m*S6eajxh(h}wzika;R$SoV{ygmt ziT&vg;FmK`$)v1V4BS?#_0h4c6QF4ekIFH{cKQpZLBb+pSe=6PD{*ARv~@ew@s(=L z6BTN%jllAqEbhPX}lol61Q{W!@h!}QYT zrJiP^uMx0!5N~|X+g%;y7n&$PJ-LQDo|}h+!a~!f=HSK@Ma$!10zBy#n$ri zq$&xpyc}%!sYnCHWA%C(kZTltPWM(5DFyOYkfvQN8b?7O6s}K5YwE-!OC`+Qv^5pcSny7E8ki zIcHW0_P8#GzV71N6WE=r`n;mtJ(;(6wMTrO9sr*jX;H=ZqK}aX#U!Cz_~Ntg36RxI zj}fWDBN>4L$ftWnAd8&JjJ$4i9ZrYa7JNGO!Vp)*?{AUsv@O%!X?croq$(K4>$)F? zBZiRR=>j#**V>J9#n%xu>M4Me`|wgY@R^qG9tW`-|EpQsoZQkc@K4nW_9{hnn+kMI zqtB=@BES=t18mzb@$uQalhEix5ls75rnsrv_Rujjfzm>*$n#9qjk9%caIM1Ml)opDu-1_Zk$Z=BnY>o6~J^J{?HRDFt zZl`1U^ALA3pxWy_wVB$N&8bhnh!hVB&4@l-B1j)r7Ph}KNun2WH?w1sRik%uaxoi- zF*J8J4_V|+80Z16Ox~Vi=KwprUQm~Uu8rCnGf`tpZs^50c2rX3uO_BO^?AD5ziM@p z!W##G|KnKfl z)joi4qOe!s7otZ{!|HX|Y^+0G3COv{@^ZOjaQ5c2YMv3G?$5%Rw!$=GsI+i)=6X6?a{Sk8$^aMe=k4*I+ zdmKIqC8LR69%U@fV{ZVj83ktn6i@ry?3h+sFDld1p2_hzF z(xo9}`y_n*Q~c$DpnVPg-#n1&^^WsS2V@NO)3~w`rz^@_2uUe=J?{R6Bi_}xfV#=K zY*Xhetz91^7~yay`-ScxCkPy0Wb9@K{7Q~C^Ccx`vEt7X(X$s+zQg}T)jcqWwnbZj zj_u^cwr$(CZQHhO+qP{dC$>*)vvc2jUDaK+f5Wb|)*N$u7=OVm1sI30pb}<+TNfb_ zqbd^01yKuXWjGIT{^No6^FPl>h!v&F{xK&0etCJ(M|skZciQ?55q7y)eeka%r9DwB4JxPq~p;QltbmYu08#2o}c|pU~-VZ16*9}$I zAzkvh=+^U^Gw`!Cd>yeJo*N7)hu{}kRcPBK(7ONQ1NGYf`aoeDHt7HPK!b+n8pj}U zeCEFQ6?`T?t&2&|M@U%z=L6w=x8A=3vGkXnw!B}WY@tiU`~!l}=8ApI)pqZ^HwLsW zzSW{9Bz(@druw;MlLp<|#0xyelK5s!^_k1cmX6uooO12HjE5!dqKT9 zN}&AC(rc<|GsC^HBcfe6jrm;wPTTp|sJ)B@26w&Us{d&5S9s4^78qtnZ z{Md;jc}H_Im2Q(PBkx`ws68W4xw(c0rNmR3G#Qj2+&iFx+foA|PY1*x3at!|Lr7Bp zmj_z48k_y)fmriq*lzfqk(>+@n>hVkuadA{vB5J6@@kSEX^COOX(1pXbgp7DPP`8g z(QgX?)>+)?#lZ5wgc@`@dE^Jqn7gCHdF4r@9oPn^v5hT;y=vF`4+|I;CAz^%j7)NY zl>rF=`33Ra{yufd)j?}*m0{C)DXKuy%3$r2@U;*_HVlvq&rPs#qa5zbVWph!r&rvf zd?xcyn^hqIyDT2IC}zz9!8lT84kl9`ZjLHiOiN$XET&3RNChu*0!iApC!gFCgQlf< zfV@e|^cFO2wtyhEA5$s}V=u7q^$1prlm6R2_iR_nz=B<<~`NmXc4PRe(&vDlOc zID`CqgWy#9ApTMT#rcYZ(NxA6j!=xoqMPVRQ5KshQui5D0@>r#*M8C?i|mlLLZ|IY zFC3h`@W}E1pzIRSd0vuI*VmOQlo2J4(}#*CXHFtX5Cq(rfG2?O{(n4BFJnwY)QR8V z4+n?cDU_$OrFS^BhbQbMqoHOQY*+Ad(I~8Bp%&gryKaJZsX-C0N%=T3Hgpn=U_ZDw z&nAFCd3%?0dpz(T)l6A_Sf*guUmnPS*8zi2NoXO)AV?jebQmNPAg3ZBI+NOo>Afdt z5>A9=!qnKabctMWL#2FJooVCO&A!+?!p@XueFv$|VZ)T>m;#M#8ir;HEvu&!83oBx zHk35=gd+JRbdEMi#1afB_6>rHHr08OM#IgsXgAu&%D{3JD790-B2JN$FvFJ6C#9nk zQV&sF^63s)iSwP_Esw*2wafx9eZ)<(Lnel$6!R()qfKHYVQ(N)1(Hr?LJR5po}ftA zv~EkYD!5v~^i@d3%Sh1?yu+#*oSi*`Er|ZGcFVYbg+u`cz;}2WrmtZBTX01|e+i$i zolpZiV1E!>bYqZwoXMm{_ep6nXODd)hb57 zX+#DH+<4*9n?>Ly8+HE610i;B!RHrX;h5eB>br$6Fii^~Op*Ozj%nv;&C(V6wTBkp-M)F40Ui9I3j+Ar#k z{pEoSuqEC74N3j-Cp1aa08lsQV{4!x>wbA4qL0NfYL*c+Zc2Ttq8b?mxQC>s*uXcY zW5R3a>2~skc%g$!^-A+S;G+|2ErC5qCeC!>hJEQQFf!$sVN}bM8M}_~^}xbVT+R2B zWtB9=cV4h74bD>mh~0HuB#;57zLYhB4+eTuTsbM3ihu1nG*wN)?xS~MmvUYoSM{Q% z;qdbUhnJLn_rPryLOU0in zrpePj2;u8&<_2!|xO|<;GlM3{itC)PCWhknbi#utVcXEJf^w1{^0lHU zCK)?0Ok^tT1WAm}1_a<@iS_ZDVXaxHiV}>tO1)8*;np3ABC0 ztRoC_RZ*(QCoC<7U0nwmLfGU+K?fN<>A?7u`uT z7&FYKKC7-8zicP+g(r#=O9~V>B_wawu@=KKcL=%-wPhIT3dD-SE6B2>8}~r?B$bzE z64Tr;lEAQlAE3#(6D1Qs9?pz8b6Ja#V|Yz7$RV-lqJDt7{24=aCj62G{Mi>Da1N~r zFLew{L@byGR~NY1OH*Vx>Z(#agJ5)yTWa2p&UkjUf~lvd@X;gUanqwnPDK+ZDoMmE z?VPax!0k*sZ&EA>R9NjMV4Gt#F{#)_RPCGs!spLR|4Bl#gqJ6L%?IPJSmD=SVR`ma zbk8Of*xi|=MH1Z~{jkdG>AIh7yI$=6SpJ9*E|kj9^j1Fj$ahMve>8fB+!Wygd^@#- zRpJh-1ml#9}a|2s6Lx!3RUPu)ZsL9Vky$-sf;Mpj2KCZ zSDP_>G+C!P!veRA+!UW1{28D-s`XZ>AMTh(Ak8(e(#;Q^0~YZh4k@ITY7nkg93>f* z!ZM0A0&WNoB;80JgkPv#Njhvj2~os49`82IObO18AJy&eqD+|yFs|RQn2$nEii;Va|Naz_^lm5Xe#-dZd77wb)mB1n5~f^wq+3#;`b2!P#gq=qdFKRLzVbR=i4bM zxdcW6ZtYzv9-G6!PsFJEXBcQJ7>!gMh_020Ei)wH$s7Zdj-=D(z}Ry;2>pFPe0m(f z=))iyfm44l;s&W-9S+NE3ahS4!x{IMm_%jf^I0&A>v78ZwoMvr5 zGuo~>hepNp^uy2PWO84bK!v#d-2A&7ud&l(l9|n}%aIxA#CD=QsVku^B{Rj%z(7!P zY|!Ams{-T-%5FSEvk#93`5W7VxAO$Vc;zGGx&<$S$QS7)TctL-!0&5e^Z4)>QLZskaQw8q@)Y zh-(e+QqYdmV9%Fe)*@{&(UI`v&VO<1`QiX%`^ zERch-T*HMbxg>>A$0K5ikAlgc6{)$#7Utoq40UHV*}`_~(v7)El*#|7I2 zI`(zD1bvESvQP?_s+6{uQuFT}Zg{o6ZqS^MSASk!ap?5PXm`XSg}nd18FBI`V0k88 zB0U^A2GJOhw?`mpDLpjLs@``?1F!8QTvFss(vO!2oUIYS(|CWq_3{LPP#v97tj2 zKOE=@dmO9K0TowN_&*#-gS4ChrJRj&nX-KFzJ<_qHIkzyn3Q0_V0u5xD?oxjm9;5dY&W{&|Kxzt(*3qu#_vl{bn7pD` zVHq^{FCicVZ_E3tPwI(-_*L7fJ#-8cb!f-@YP2)GVmkD0^IQ?Rb};D*$j5_C+oeN5 z5)E#rUZty*2V7GlVVY^CB4?!4{_%UO)Pf<$*)Dy0_r9H>AzXeRq*7dd2pk6ZwtC;YHBEq3If|*uz@k z|8!zc{McmG!FRohg%}yP6G+W+92Sb^IQz|ZQ+&|;rvp*${f`b*t(_3AUn0uM_m4Rk z-7y5W>4(E?#G_V!%3u^Pujq~!qXXD|>46H_GyhJkh%Do_D3)&{Aq|obq{p>nh@Qv~ z_LPPT=Pj%r!+@;haYplha3Bb_LlM{1hIw&im|~#6==cBOKy*4bJKk&DZTd1A$GvXg z|KUK}-amAhI6w=4WNNSSR=MBu+_ZOfmzOET5W0M}ej>=2^TFE)PLjQm$VtJN57o(pAu;JSNC zjAW|05%&Yyxc+=v@>zw#s<6@fTlDNk*%D?Tt@2x<*(J4`)xE2zIYK-*-gEI$WceHl zvSevl>Fn&c>_}|27OTOFF%akCYkwx-O#Fh119QQ1e>@nHFj5$w0R;6ig=+Dw!sir1z`lN6|~9bVge_LK+-!&8}eFEiXqKRC$l+dI)X>Jby2hM8K3)M(eynVcW_jvcG1g$SAWgOf zQIs!P-~QS_y?R~E{J%C(_)T!$9NW)#uh+Y7JM{|Tc=LZYkU2+INd>-H2-5frdjP;B z=@^nh$bU8v{6(~Wx1P+2WGXB+@~elC8>|Ppjn@jjT_sJaisXK>k>{@s^z>^3onB3P z&<>ryJ5@TfDgN3(k*y}BNsn2O&_3QQ7xZ5nRL~FY^i^37JjsU{G}yJA}zSaVJo5&Uatk>p-{3YS8hk(j9Zy6L4#TP#f;-1XZ9OdbX;Ru z*Pb-*P;Hk9`GrAu8S%CGB}g}$kmwPB1ZXx*;GC~nYRXMobP&|FR2DK@dux27ns#C` z*?uFLTTXtjS!Ic1?2;~T*CP*9aeXD`II;PRr7fY!rXTI`vAgbWHs7UQh7&sO5R-J6 zrrB$=>Jh*V(jP7jcbH;kmDo_1)dRPAA6Zgx0IBCY?kdYor{xe%s#Q|n}&+V`HQg)4lWC%$e)96!q&pDJz+4jy;dC29$shA8X^i?AMX zHf?KrRaqTrULNMjAn#D7VWYL)%9yLM4*s%LhYNY3UO)q_z&k|s;={Ids?f73B z2>(v?KN?7o$>5HEeaPFIgA*&3{JITQkc|VRYgX19=Bi)3fb9+!Rji{L>ufB1KiTEc z8f+sTDk{#kUtsR_b_wq=a9rhwWe{SR%vqe1SeN72zpRr ztIN@(Go5Ng+=>~A@Jfv*>0Bb+Fj)9V>hZ*}B}l;wmwPDpGVeovWuV8YG*~Wli#h`3 zS8sn_*FtSw8jq%zCK zH-sT{%2}Fatu#ZTHpOob>Mur+^ge;uHtQ&?ejK>=9TJ!U3y!e|&iD_PAV&XxWgvt9 zlYw@BWuOaECX6o?VhAL$T!vp6s7Mcs>HR+$$Wds@9IZ3e4i)2YGu0Ub@jn^p?1;9g zFESrHV*HXSyoMv^Q%T?DgZ;6t=Y1u`tm?W(lO9VBpCO=5mr6FHIvse3HIl5Z(Gk-jW;Ic!QojbRX0iVO zBd_7z{jsSeF`>*Jr6{{~(ewIEv%_gp2dBNE@~za>3gKpl5K6Cnx_*A39pgqPc$+n= z(4aJEhyap{@qGT>N|C)OByN{IwEqjrg>N}lf)4ViIXTAJPE9OI!jTO2tzwGi9ezZog{_e1AA1>nFVQVJ%zp3MFB_wuafvO+m8z%Z>wT7b*@_As; zPe#da-?6L@*BdVrl%9zS4A{Jh%&4=Z)gb)RAfRcBjI z#?=v`W5w1GFqauq9=1c_3P%vtbV|kc9*f%4tXeD##7wTrJrU1kbb~%4R)HoM!IquJ zujSz19x*Q^Jbfzin}!^eNa1e_aUMkM-{h+k}d zK63<~b~YflKhV62gz^xMP5Hrr-{oH^g$8a#IVqG9IB?&HhPPsv_NBSxDUN?R-a(1C zA1`oUKO9S1^K=B^>-eW~zW;il#9rcag-`ovJv}|< zS_2u|s<{*?zITMd-BIv%H*y6##@rFQ-gjs|bVGwMwsNrSUvS@M^eh}K>@4*0*z3CJ zXLInRVx&o572nNF+#DRq!d`cmH*S^WEYXP*3io)DWJwfIW+*+n)N0seD{km@#|wn- zCY5{{vUYf0r}@r@WFSVo5?pUlvXxUOr<1Idz2vHbjTwSme5yyovCb-FB@C#e+nyl`dl(-7`Z}-=qpTMde?+ z5NzR_ftiEKrLW3x;#L&AKd^p}nHIWE<;$60@=g$zD>&U0(4p_CnZUkT`qan5c24sA z8Oa?(KQW2ye7p5N&PKIGplh`k z=Lb>5XtsXcd=Z`#eS+dsY3c(#DLRg{fW&yBzNTPwaS||r4_Q$QOciWIs+*Iaq9vS{ zmk3o~K+iUKH1~yQhbvy7i4+jjk|svf>{QFq*fGCl4!W5-nV!O)6W(eD!9KohoeK&O zxBljEd?*ip!yB|Jl~?A&l{XQ3xN$*(KK_Od;4|QVpzv3KSn7J#yIvVTOnX1PV_42# z9^bq4an&Z5-5rU>1exo0bak`$b-%I{-gTM-OdOBt&D+y11^OO;n_VxOKvohoEvF}E zm9xt>0RX*Nn7eD^sd5Hvg~6-Z<&>-spY5Hda}T+fOs0pW3I>vn-bij^iA_nVZI|J7 zl-)G!r{2l{dK;b9UqI=bstH(bbDhz4H+{a=UC$(4hh)wCMWYBl3476hf~zmbxR2J$ z)AR7wlm$aEXsCfG2SB>yC6ohT0g%9)uizW_3{~a*n%YR|qvPvd{QSQEy-~Z9Y`xgB z@_CfKZ9BZonGH2?IPh;Pgep{a6&g7WoB15yfzM#AYRsaZ+@QM@Nu{JCs@5{_C`Dp~ zVX1k-Ycu&9{Dc=rcjvFSPn5+bFvp3(lb8!muuf!v3Q7%Es{tVgBtw>PiH>upzrU1+|%;tcr*gsF@n;Y zA9sySMeQ~&n133Fbx632TT9+xJ#_OpWfl_^z<#znr-UqWKiFR_fg-wmHDVLtWEu4_^%n$@0KQzZjM|jZoi3LNf(g)qC$4`BXS~j7~6w zRFn%&TbNNO=zM*^&&c&6oG)Q@d>Tj(;J@iTLu{rWk&t;2`_aDwe+7_OEk@+K!Nk0v z^Uo>M_}{@_0BDtCQ)mGs)=C<1dM`u~*9 z;8gOjh4Wy}3zNB8^fL-?RtSdp_fXFHL3gP%o1DG+vH;vn$FcxiZ{b)Mx!dO63bA4t-iIY@}I%M zi?`Dm->igi`U$_HEdYD#IbfJhA0GRR9bOhGnV76h=|q+qRlqdQa<@0UMdTlF%mkEh z`r|-cIG_VQO6ae3V}2BZA&t+8;LTp+*GX>xE0# zPw#G*^cQ@NWQJ+B+Lf{*6Lc~iciVygV^-r@LDq*<#@*BAP*Kn$DQm3px2OLVXvSn+O#}A{8nJ_ZtsV@$CJ#ZS*^qoyY@2vg=p#0 zEp~vy;1RSIy3~H>HvH3+0io_ZnY3@xKFg3H#i?{QXiu1ciy>YnniyN?6b%q@j!J}L ztO``lCUQ@CSb&HXpzO=8M}7pBBgo6gd?_7UEGnAFL6#;3FKvT2aSRrkSmDf@0acI4 zUxN;bZ%JYgz(@o4stPo5RaCKbsvoHZX>JNsVFR3-(GSpuc)BbFdtRcCvHr|(u3w=8 z8*@8x1u_1x3sE8#9=jb1jC$ceam$De4yHsDbhdBIn=t-MG!P914Y^|13oP3AtImf4 zC!I}vZ|&{v9o;VR`?|Zcv(58y1X9Hvg*i=BI6LZpEQt@CvBP0!Midc1f-g=id?Ql9 z6@y}@pieu`oL21$xa>PrOf`PZbz#oB&zLT}d!4^Yrw!zDlh}};FFpI8D3=}6(!Pqd1 z!o)}l2ZAB zP_2q*S+Be3F-(cr--)j|x0a~k&`Q*Ns%*8{THEY(F|zwOiRq!)WP^jZ#$(W+cs?jm zn?V5&3sYel`7SmrUBfagGbc%T?aK?wFr?eu3IY~V`l>+e?#LJ^Z2R!yGAM7Cl^W5t ztYk-!vjbNWxwmU~Y+vnrN+e4}EL5K;SGWe9!jZAs+B9f!sIy0gXIe*$Y_)1K0$j*k zkj;Z3k6e$H?{;$K>MZud3FVs1C(OKtVBxgvBI>spau>_KidDY>i2~ka-Bmc@?gW&h z#}sSbL+p394BTkjzvFlAzn(9)zUkK0LD~#Kq^q#+I+}tiW&v8M3XBvSEQZPNM?45BYoT~tluJj_1X3WO+A)O|0Is<1fSW{7dDHSV@U zFB!B=&_xbw`)eYRsG0i-7NJ9*&4STvK%D_-(*IecYBR?&!RWAEI{aE|Y=|$3C~u|L zO^Hn(be~_bSuL{W&2ouUGyWQ-PmO?KEJuheV7;U`IP33 zGR)XMU4hmunaxrXZ#QMtJ2m-{f{7g9&gw~Xk73P-9ygq)=WwqNXDDtZgK{B-9vULD zAWK-wJ)eVxCuwUVgo3pr(MHFF15TfRynS}3jPP9V!$q4k((cx%-A+ljw z6RiRIAn_ts%LF2?{~o0Brn+C#nXd2FD0qbQ4Gh2L=H%-|Ti$1cs1yGMNsetHc72&a z!3Wg|6T*uW5|jisTMHiB)u^w~l%N};X;4Il_U{4<5Z+A5aX2z*)c7r$-!Bhal(_4N z4Dc&SqQ|D<`0RAOQ0ITI1VEeThdpsgz8>{P^?gf}-1c1e=E_)4!j=7YT@AfO?eh!D zap;)2dU2Xt?W{D)$us&D7d9J$j`m&zd=YFmo{Tw|x!9QIK7Thh!#ho`5W)w&+{t+Z zJ@@f!`o3-F99Vv&`L%DRXc!#mSJig5|e`7pTFBGU1*rRq=ap zm~by`Cg6(M3g7hW27K#x_4&-{@tHF)5=o9pN3J#yG!YGXM1x@n7SARM1_TZ06ZW2i z0U;3eMbQ^x=VK8r+K8=@2?VVnU4q4-AsCFsAev?hf)Oql$eRg8%BZ0*0pwqwbA|Y3s^&FF#dqa$GK$vaRt**9umaSc*kf=r-p_t4RK`}ejI3G7PLA-+U5~*UWeo@XnH{)RJ!I-aTodp0cds4A57H8UptlO=x zXU9iAAJP{aqGtsE_@Sm6WlN-Q z!x>P9&;{yhLVJ>I9Rvt=U6_ki1b^F^fL%Fd529|BR4PRg(fQKX$|zn!2EKANv z^wStay`PZH=5u_9KNF?R9l{C+4HACt#(M5hpw0KL`enU_pwXuUa@&`yn;Fpxo(Fi| z>0`X>h;b;Ft5o`kK)`9y9N!l!fpoWGLJk>2)WJYbSrl(5z%R+7qIM@|Ia#h7?w^te zI?u+3J5;ONMmH5}E;bJxjZDP~r*OX%KSt$pjt-eiuR!I5|u})}bMDUosX?WR5}$ z+r!a4B@m9*&`M3|kVI$4`>j_jQ?_u#ejlW)4az;miXLt<-+7?4|7DzAda_eiLYs%R2GIuf4b0Ksjsu#T{XE5qx979v#p$ zDLtns{Va&sjsRFf9oo6bF7by%)ukbeLTpHQsQS2(W{G^V=4DfwOR)zXkF^ycFfqz09-`8{8&J>C;}qx?EGgvts7AxI&e>0&J>wqZ4JbNQCUFCM^WwX#CM9a}*d~`uHApIv#W&UnE*yAF!m>nBIy%zHVyiYHy@S zy(@1psy#75t?4gS<($E9iK;RKC1-z8Aqe;Dvr+o%AhzjCH#I@c^ruwwNiT4p2k(c1 z>25d0{sP=fMeS}Xd>WP5Vm!fhFWcr86l@$h6m32V7nzP1S7zrRG+4`+`PQD?BHGMvtEtpI{L2SUdDpV zt~Aw29|bmcu069io`gq`o?>+eo46CWhue>&1UC21Ca8*wX$zP9g??-HN<)7mNR5Xx z0M4|4Z+?b@a|Did9~~|(FgW*WA)BmCfc{Xz^-MQrN1F=*=K;d`5z>aMw13fNfR#&^ z0uA%MHdfk6x|3e~{1NjDhT&LZR@?CCRjA!ZBFJrYN5o9mO@;(5fA{E=SqR8!_Xd(ZhTFq)%QKED!V#8rv z7A|DInwb79fyK>wWns{AjvXF0N%8pC?mhip<=gWdJmiL z8RnQf2xyR)={8@jR5hdu%*m0=-&HoRZ31=-5dxDsyk5I^>x?*YAzien<^;EbE!}H| zbM692FA-QZv}xMGPZL`SCV;us9+_VCN*5-6=*=ib()=Hl<)|ze5x%tR{x5877!6Wt zH%meF57MBC*?GqVl3Y64L)2KxA6Rat&V^XZ7UnSWs7PHiY4O@%Zda9xggO9Tw5jaD zS899dEu4Ai#hmC}CYN-=neZtrZt$K-#x?gcgxA)4hdL#Ph!{rOiqtj;7P5j*;OliQ zo%Wnsg$+$KTXerszunM|Ej^d}cp+H%F#2-kRYOgkf(NJ=g;dW{&3?XN>px4&u~J{K z1W$Iu3w8&mZox-bEofTj=2LageVw57>S-IUMQoyoCH_D*X5yy`T!x#pi4q0XJ@MxW z#sXN03{I3~xcs6(S>+1`7tC5#)16wrL~!UvtXN^~A23EpeyLQg^5`~CQd)DOe!Rhd z2?wpskB9>0X^bmTfYi9tlJB|pA zUWJ0pNu}I|D=2AhD>nX;Hz@_NkhrVhKHF1@H5nx7l;xbFuuI#RLoi|^_XD8}2H?-x zxBrX^WT*5#tI(f?dJSy@>b!NumYHSNU)*7Y?XNVTmjPNR6Vkz}N@;l`k&J{8y9L5i z&P^oMmbWwqR^<$b!yav;Y!CAHr%n+s304~J7qc9yXy=>(hCVX?vcchEke4FMhc4&Bc26?^q#za_(2nVr*j&*T6}N>k?L6HcNa5(|sg;jK9h%;NxDk1j*5(TXZS_ z37I>tT@k94*Th8eieIk35w^x5Ku%CvBTInr24bQl{Yp~oBrF^N!w9uIFN!*r{@zc* zGKi}h*<#cBH*o>ZB2uHG8{(wmYr@!@>o+F(E_M(au0n@6_O^k$=N?lJ!HRw*|Cv{` zfrW4Jo633x1}UTt|7_xd!(MB->AzlmHu-HXZ1DT3{W5~VrKlh#z>AZR-#4h(dZ=8N z5pDHJh*z^uzBC^r-u=7|*Fom|dw3|og}f>vZl4!6q|>ZsSVs69Vfc5PzQIkl3|t1) zI-x$ID^s3K(a(LSRKdSj*9oql;RNM*5R|^u8!i7YqWnZCJq;I67r7#oGqpx0BCeB! zk_dg|IpOY?4pZ`ZTv^yHKR8%YX+W=b#TBtDhS&2b^_Q~fc8u%C0lX{?|F8`0Kju&H zbj;%%&bFu|G;F6#*v22}F8E!!vfs32tx)`Y*M~^voWiI)u`<4$??95U8WttcF*8x-{zkQWjJwRbmB?e!YOBka;z9O;KOYP_ zr$GQLGhA(3ST}V(&)1V&yPZ-G*J?LpNYZl+1lP{2Y2Y*e0=%{o7A|Dr6>!dk+rSO0 z<+gD>xlUwOD-VV2g#9>^%CZ*HuXh`ozladhKPh_VS`j>vDk_Iz}_PMDUw zEf+v>gQ$|kb9&DVcwjEn9}V2xygzB{^vV_43%HJi66t6QbU*KP8ZO8Zb6?P@TcD z=b*MAttchB2gpaf-u2z?7Qauy=>?X)9~`xYfx<#K9nfl-c)ZrW z?v)-F4+_J;*Bx`R)!Uxb0$&bi)L@fR^hy3l?C5l}4R#2_=Gu0V9SIOYG(oVuY#9pT zzeo8!MoSk6#bB;K(CPTB?%V*}4PO>Z++PIVn3%5)XOFtvpP>3$4^u)_(DhUWQYGYyE`^HrIm?-$$P^OK8VRc=4c-R`v>XSNTQ3`^xR zPaI+ovMm9tQKQu*4IEp4%T#BxHNQSSu0N4-gHBpLep4pDZQkxO+sIPagF*mi*%7wH zD=s&@+P==Z+=qZ~TOmx>5nNPiyCLo!tG2Xsyo~zfCs*5l$q0zomfgM(IvtH@aF!Kk zP~k0oTzl+3;e^}HspH01s8E8e;L%0sXQ0%D=c{b}fwdCNk?ztK!l(C5+ru119nC-X zj-COh#Zg7ylfGvA^@T+nLBSVXyuug5H7kfgP6Pag0eBK(3TS4*vPG`MA@X#!5b*Tg zw=B`~cFF_1-3M;<*|V^w%R=mVX+Rz=xQgZM9`kv3Mqljf{fy~j=aO~fpWup%hEY+= zqbl8*w_<>*Md9sTOcWZCMj*KUF{if=gPR)6{cB%ajDsTAux_OHkCA+$Z@DALD;H1A zBzP_nuNQeYp-Klfo-n4mY%*lh3Ke7_5bdn&UG3rs^UPq@tHhu}<(J+_SHd*DusC`K zn2Y(>z5CUGvTtVXJ)td{%Yxgck<0CIZLMTxWiZS5+dh~2Pc>+M1KVo(%E`vy@eJV4 zbP;B#VnvO_Ct7;dHSBIoU}g1}$EmFwSSO9qOW(x6f)^JqZVu$(QhbxP%$~$yQ^|n= znXB?yUs_W@B`mi{3}*MPKU;SgD9Jb5#l5-%w}LhMUsBh+kVe)~Soq4Z16`_y(4JUy zS~Edkl3Z(vEMS|JYLARy$yEP0LbRId?%bS;E0Wcuq1$6yWy?3AP*60>JsDLtIt`#` z?K=mozvucW{cXm9usfLP0Me*aZ<8=vcSvvNFyB!Q6`bI^rlkaM_KPqaQ}4pxh+hwX zn3buJFJp+xwDU)w-FfV3;@deHYNp<|27)3G25{to$N~aU_QRzWy(nNu)e)Zxmf9f^ z5b*F+L)|9V$hp(SSt3m=PYS+mM;{6sbUv+CvV;Qxy#>d*zoQ#x*8uvF78{UKm4Psn z#Wqxtux)aVH6zY46jzefjiwjx?(qGU57p*^){0X{l)XrAj0d)qF~i-{i7jQ92e2I^ z4~o2oi*X%<0Zm%^6VE-e)!ZUER9E7!$<1)Hv^XUo4xOrb>Khge)!8UM2xH z*4M?oHMdu(VMAbu?d674EbwIpe_$V9#=>A#2?N+lP@fglae7NO))YwRu?M^}@V=LB zS$Ns>4WC?D%RusTV9)WoG0XdHDpNwmIK5kt0>NsY-9eVDUeH?u zwwi;~&-w97=0DK9DGbqhcK<5YU;VYci+Vp_t(iu$9Xi%tQBL4z?l4I8@2ia4(C!48uB`Cg)?C z{#Fb62bKwpv!U?A1b|KQ$S_mJqcFD?9K|W^MOVm5fh1K=?Am%jBgOcCb4wHn_HC9W zoDeIkLSe9zAx;$A?2z1BP$f5iFETd3o`I`fQjutuOdT9Ye;*qJyP-gFnZ}|@jcD80 z%9ZHGF?_iJb$=ouPy>;B92{pj;#z}xlQ45@-$`xJC{Fl}n`TPJz=*e|(sXG(QjWse zxa+*8$k?KN&$0GFJ6IO2#eE=bvf#fEFk^46hdL#bny-v(=`=I#scTs|kHCVqR~f+x zxOQe(2w(9b)In@$(yIbTLTw);quitk0#^Ul-U(`wtKd6_1FbzeV8l&FePfiTzuMJ=1td`9G@~ z*I~9(@ix1EuX%OZH@SW8@E*oGOoHV6+96S02%d}JB7=u62#U6)XCf`61C{m5D)>#q z6oaQr!xlv!3Q@;LH#h0miEIanBh519Jhn_7#x)Ow5v@o5Ee<)3~`9=W)Iro zWoxsa7)Zo)MyfcZlh;3EN&cK!uU5@!XM?Z=?unvPUA!5Zlz!uxo7GC|vN)Cu)nAn5 z{JhTHU(@_un$ArtN&s>#;-;OEI?!Y;j^CXe9tPnn9rlr+RsuP<>&H5_-%|jj4tYEr zAV*pT8U-w?k#cA;6dh=Vla`N5!sN>;PTz{Ii+`iN1KIdLhpitvwPvSbw53<)r)FSsjkwLp~ zX%u~JKFpUKIWr8xv14-gwItL(ArEX=Wdk?W;LyM(*}acNH5mg1fy}2XkJ!^rNjF-+ zd)b>x#1mqaC6XUqu$!XJ&oLKf1xYUTizgN>tf%igm_eVzp8|kZmxjyF;ZMEwW@o|V z@V`ggYL*ES^>$;Y4$)swIaWcJ2~^7a<2T~f)aKr-o~k7-9#8NPFh;yc_{g1I0N~BN z$KpBpEu%B@Ld zB6Nfvn#lpHwOTVG&crD|oMq8UQU&UP6Ffz!&PtzjKEo>qmpEsTx%=?EON$O>oU;#bi3otdy zKKs$#J^30BfLrp#E>5Vm0+xQ3grz9Gbuo}n?{I5|rC0|3WlSl^-9$9;?lHV~=;hjj z$Jh-^v5TCfFnC+Oa=c^ALgXr^dmBdnllTQO5z2f|gWWH{E45%wnGLQrZ$l@@*55=z zSU*Je@6M2o<-dT(sgKbkl7YV1Fywzll{JaudtTddkXIu|ECXJ>VPeuoN|1HK54yU0 zvaI>>ZpFRC{PrXq*6@B){1qwZVv*odOs(unkE`i0>!y_zaEig!2(jyFV)3~i$V8EB zBkPJ(^$C$9GC?oGcQ9em&qK_#*DA)AV2h~?WD zEa)S&ab|ng6yyl$oOQk&xUVDjzEHm)S5|Ar)nbG|s*y13#VKl}){F|Zt~`4XmYtt( zsY3)~RlX&8i(vt;6gZ-wgQg6E*hMQqkBHXIvcOxBdp*p%(5oZx&{xXkcwwQSC~wjd zdBzLD#><}AIG?*IQ5m;`iNHA2N(N{|5X@H74cI%qKH9S4ZXu`$d9(hAtS@%bgg5GL z?UC=SBTPmL+d=)g5J{w`y%{!Mb6CLoOIj3#vY}Orx;MJ8U*KuKigmRfJ%NRKsjEH{q?_rV4bEQiq(+xjOKj3Qb6Pna-B?#FiUOU$TranDB4#)Q@5I!fJUW>`SghZi6S~H5@;h7G{z)OqQ&LV8EMwO#rvcfa|2O zlZ2Sn+D1K7A2_(Us@p*GK$}-_!XSdk4j_$LR{3QS=Y=5-hX^>#@9EX;=blgVED~E! z=^cd&pnK)@O}HRhPx)XRiB0rEbK-+4Y+z2*ck%INi0Z z_22sYGM`S*Q$v`r?U52pXXVExshWCuZ9JkaPS@e zEwP_ef8f~SFmh-C%C&FyIMm9e_jw9vv_!3Bs!Vo)1N7aX>UAC}BkB@9mhd+slB--~ z;8SqfP!oWSu_I0i>!XG&;IhNs^Y)dA4|Ks{smC4N!iuDgt^1l~jK5x2hf(VyGsVW> zpb~5<*DVgMnz~UQJ1$f|o_h+@`)be1Ike!jWnp39BOl>mVX&o=vkIa@MIw%vp4cZ0 zqlpq&lKuKDBMP6a%Uu`0J?M5)0RwwAe`4(9!W3_Bt@I@9S0Hkl>dX6ssUjxJ)`8?S zZLu>B4jNHPG?Bhu#!ap4I&Ic#jf`_~r}@uPs#{I@v^X5GkiF?2=gzzS5j@TnOkDpP zu}`B|1gWC&JoJVpBlq^GnBg4`3Gc8oJo@bVrQ)WjOH|~zwcDOnjz-|UwK%W5-EgkV z^wiA6-badkhQG(g^ae6@<f4S3H;Ldd6IN_S6>-niBCG_+ZxtdJArA`;0s|@P- zz#`<)ugl<%Z5eKem+i6RDW&VOgsC3-N*tflZCBkRER~$kCbw(mP)uWNg@?({^(VJY z*nfApFWaP{m(Jhx)NC<(Y?co(syW{h7F~RP$>=?1+gci2&;MS|R#Zv>Zu;+zk36_x zR@K$UA%We}V|S%m#hFMSN^g|kWl*sS!fm)1SWy^lm8R7krQP=aWqeD~?UiUu5O4&x zYthALOIPokmLwz1Q8QM`Sl|7w&3WfBm>rd!#Bb4ae?n_70tok#yF1_7Ox~W!>kGW8%^v{S}S2u4^2(%xSs80 zm6Q6Vh~d0pZGyv&=Sgpvv#O&%DI3c^wp?jVFvIEL>-5F2GE-2r@M>kaq~wjd?Olnp z-d`?Eg3wI`hn{n?MRc#UFMGH2ey`;SE%b7prZ`;N%+a25u z`C65KjI&j%f~e@ay;X_qCjEv+Rv4JOj9tXK^$ z(ruiW3~;F#h~-e!r((88{skV9pQvQhRML2bV)_`nXe?NZ&d!%{QUxV6dt~$?#{v4A zCMcq1VDGSKceFWIRBKb)L_So46Ah!YdrA$av?VdaXF%Vywd`oOxhnF*5u;#Ls==5} z+H~ODQHogTv_+IH-0!bg)opI*w!P5tOJ9uCcLxm7U5Fh=a-~vo+G4t`yzhs`t3Wp;IFXnD0UL%*uHSJ7f!Okz zsNvJ0eSYlE%F(v43Lh!!XTrYB+vxv9lgq5v`q9eErR9R%n!=KJhqx=8j}$Gqx}}%= zH2$U?r($eadwO692(6y)_ggAXe10ECJUl*jHF$Kz>K+ZBmXA}Sd`^6=^$!%Ykp`tp zCGQ7DK`^gobMGaR3-y5Ml_CvzY|YFcxIMD1n`j}sE94SYbQ{0V^Sywhq72IipJw-h#Bq3YPbe+`nfQr2On0inG(scN z<5+6c;{t2<&Q=u2OA_(ZrIiejx#0vtMVP+&`QEP?9?yXucNs?^Hyx$~#$X5B9(R#m zzB}?I`=^7e-t>wuhk8I!Rau-Nst7gjWeL|;O8uaoHGTX5d?wmhL%{CyN=vXy8@}a7 zgTRx4bA>k*E!UNs-h*&;Gu`SspOf0|;4^@ub?GNFW`-oR)zdlJI-WIct>S#E9mU3LE(@p`R%JF)c>VrTC8e1(QWz7Yqs!caCy6s%gZBGbPdg|3BGdOxFtrPP@t z#S_`~N0Pw1%a-dGBYX{o7-?)}%nVCIbfM>Yhdrm(=GsE}otkWv{Ken>} zV=Mdrx2?=h1Q#dC;9Qk8mW9U)-)c={ZQ+1^Ihw~&F0()}^SDiJYhy>kkCBdhb!6Ed0 zuX3~JqWd2*j_P?**|5_!MLw*AEWH8O;K~X2dQw2`_+pls+nwgFKQfG9q?`S6SEw@t zjyD?qe~bw{ZDy-VPLf666(W~!D1t{!#juP1M!{?-L>Z9QMaV>GO2i65G=kb0PJ>+H z{)fc<0&fda;)SWQGsYyp7at$y7+>1yL0hju0tnglMqB^Qqw^LFBuqLi&SUJ74$X?Vu{E?!Je{bR~;QOWT{rNx+wm0srlIN<^uIiA#?{=Wbo@+B zZwFk*r$$4n5%^^`Rr+=*^v>!7vOZ7Gy-JULxTXyT7EHV5z@gdtaY#I$+0R2IpYd<& zBJ!&-Qda3oe&|{PNWM?&qdQOwU%5HUr!~q}#(1n5P=r<&oI9SjTc@KjpjGj`7Cm9% zOMx|&?>*a8sMZ!fkO{W6k(-_Pl6hzw_+U+sozn^S9 zCYrW0+}k^%TKRL>Jv&YGT`x`AYq;QWm+SxZUMxTAxn3?;dwf>B_2kSjUE6Wkh1*4+ zApQUWxghp0oIGh!(7`7Mp%iOHVsZN9`VjD@3U?8g<-P9a;Ik-7de?+2nbqNkiiKQ}aGAKs4bwGu%r3FQv4TwPyUg;l)mZB*+=)D{a zs#%H7a-7NeO;Xx!`&^P950aQVe_yVXv)yvQF$wW$lAmddVwf zyvqAtVsWPu2gn8Ft4*yBmm(FnV(Xj0)wAsLs9Npb&0?I9=mgByG0gy$ z`^Ewg5XN?#y>FGQ1kqkAMW^tTm4l=dN8QHfXv7Du?jr4<9Ae{z-`$q{wnXeEm0h8{ z$MKRI{)YN-p51Ab$(RCyaiGfROD5e}8c>BJct; zc-~eJoaL?hdtf3E=IBoWj>jKxaK)i-%KI+)8byfpHd)Q*A!%tXj4eeJag@7?iNT@X z!RZs&?FFDR0QDB}E6P*q52MygbAVzp6<9}(4K-SalfF$R5y%>$z3>$qm}i2t5ISv^ zec<5kgg}lu2C+#*W`9UbTwazbmxmTQOztThnK%w1g%Wh8`5pSX4vRaP?L!CX3N`xN zv+AY|uZleI?fqnCvN?hBQ2pT*PUh|cbHQY+S_szeKUX*aBb}>)chss8r(391h+$Gb zgp3XvLnGV;?#{OMB~aef=F}SXgQAup&ke@x2a|gnV905Oi7U%H9j+Ci0Z}pllJT8h z6c?3AZo~ZA7BCLUPd8#_;!(0dA-tkeHmJn39so&uvUY%zp}_V6TAs#+EzUdw8dBd6 z!W>XqO(8P$m!o1NW#$Qr-^>DVxQZ_rtxdeG?VGc%%$6_|Zj?c{-aJqOo}oakmJASn z43mfwvVbur8y26i59gdsvByT5?c+9xfWQI;647cB~Y0h!2^ zcrX!CNZoM>-_E?^6d@L;$f7w zjCwHRf{OGIHd#5O4tUIn1BLMO=kVR{JClq%l3P+20;L!$Hv~ZaLSXK?Y8tpiMK}8D z%mk68N;KXmK**{~@C96Z;@+8x2ZzM@vNgk zdC(DaOx(Iv&>10;{$_w9<>qHh;F~q7il+vMur?c64hmF-?1_oZJM@-p@d1A{rHq^Nk&adSF(Nu7xM z;8p0kr0dH?jhLCl)fc6ruXiYRoqYlp&j|voi4`@UK_iZLGZQrTsP}=Za_z>qHJ5+N@ zl1xtNpA@WxqPHcpsS3X05S!tAX6g zfeKhxzGdV7P&z^p`DX7{HBe_69!)zFM%-x?sWxtP6|EjAmE*;Z!SF{iUDG}jvmRy= z4G`^E8_FQr7J1-iVMaOX@V|*hO)LgnTzjcR`s>FCa1W6(|D6o{Vk zz`4qn1Pm8gA3T)YvT?b#w#Mj?1$GBtO*7ltZYP@m%(uVJzl4eAilk|IsqVaHIV9EI z8NULr^0RzBotlCvvxQawf&;O-~99kRi1wtgxh8iig8b=baDe^0|wm+ z1M{dS7zU~5MT&;Tu@0j4f$P8l$bX_Oi3qshQ@wP+98-&0raAhwb1ViEkA!*av0fr^LG7;A^10~0y8 zEYx4Qa4)uOufE1{4IQYU?*L5R)uUAnU7PndSu4ZbXisX!Iaww{YDv!{%;hfdqTB-r zLA5_Xjwtk*#J`zWditSWE5Z{6oph2h@_Q>F5V;INf!l^4y)c^ zPy4Sfm_5?tJ%3s^M!_JQ+TDH!2!qNX7*+!qO$A!ki09Z?QUmX|{C*UNs3%HY*-^c5 z&UxqgFcS#rXnz)219V8RAjYCyy zo%-K!{D)bpBxyK=PAu_~{V$=iXgxG;Tk01O;`s!~f{nnfvcJKB?Np4Nlm7XcLu(orfAyRE`6G`tJ`T26vp9)cMP)x-q`!jfALBUbVxz6o|&-T4ZBW3{T)-5u{s zC%=WowvucX`kjb{RY8wY7(p|Mbo|IdW?4xcMMl3N9wO{M z`((_@E`{xqa0zpBXzNR6K-m-yt15HTFr|LmEL?kC@a5l{wA~m0VTd6=zO|6Jg9vb> zzqd)MB{}edx)m6=@`ZOO-au!V{ zuPAD|E61a;%U!ykPlfQkwleW9Nj7aG2eF1qNty!{ig!AhzRkTg*;eyo@qK=88qc#A z;ZT8`YhCr_slyWDP(Z4*K97xH^g+V4`|wH$u+DfkxnEc@Q0DDTduz2*6Lhw#RMjigL2DH zL7ox}skwQ3q+VH-5Qme|!{0=QFZQ>Uv>X*C=s^n6y|Pszy?yNyU^ z8DL8;wz@-;NPJ&2t`fjz#q&$}*X?pUf7pNp;%c_+az4|ZXO>~2j-)@egkSq9d?lEh z4O97uFCsQxwkKHj-F92u=H7a=MmxpHfFq6$F{`+*%m00JLwq5=u!@P zROqISkTw{MU>ELg}2sNH&E^wYYGtB5_uZBbFdcn_$e^PiDhDw`%PX!$Hi}$bXg7mAx2>#XRL?FJuPRG1{YFrQ4V{%sXr_gF-ZGz&vNfol4(>-}$#vKvY zgqj%%6_}!*Im{d6&U;EqHLESJRkmhraizmfM*yaqXN67o4|)`w->Gi`wP)!cIUfqV zlvh=kXGtWG8vK@?d`MWcejA7&x#sEdCXttY_4KQN4xPFPvxj-1(XJww0nt6~icFxm zn~L_LN?rf){HYaSb)%91M(AB91}YD$AQIF@0OIN~KY24qDd(85W);;bEn(jDO~m>k zhh(X+C|?@mzp=3gL1j(Szwz9r3(@LqXR2I3RktLV@uU+2Zu=B5J$ZDL6Nn%c zrdbD@A@5%2V%Ii=7~@ZAO@wUJu*{X}lx@lIPc*JW35;T3QdA&)C#>yBG*EE6b!l^> zJgXFIY|WrOQK%>*wfjKwjH*s2IX}_2qB0)^`{@U| z;jsLS<^v?qtwX~Aoc(yYJ0ly(h+Y_$hawpCrB>LDx4y8nc}$Apo#O_Bn}X`C&&0Fh zb0xmwHy|%HC=>wrpcYI0e5{Pa-z*#qM>!R^jwMs8s1Z8S48CFuv=KzYI7Z*h2B~8B zRJ8w@k=kKTso-*3z^OFz-APR+x`kEl&|6f*jA*5RJcFSJ=K)@+A(`e#E!8XkNLNFv zr*%ZURdy@?F|xoG@#Lw{cdF`}9cC%R>v>t&@CJ$m-epL-{$5P9D^L7QhbKeJx~YH; zq>lCM{aO3cJW8ye|G^$^184-km0HQlco}l1$ng5hoWr3>RP{vyi=pC zkqcPtJ@~JG<|M<%XNSj&UJK0<(P;e$zm~Z%b822Wwm}f`&?J)|z&OD$tXA+y4JS3k zS(IVBhRl(40xTNxvoo(Fm^-nV$1;^|32lL@$Znjmi<4*Ay%pE>(dCFcMc?_0<8M25 z8ImQn$OhAb#QRiW5N~(7bE@}MQjnY0Dvwvk+g{7>K+Pm7gw)f$0`Mtj{MM7}UuEXH zl6!q+F+Pvl$7#2~FySsBOs1}Qnaw{bw8sGiU4sZOwcFW*aP($I)cpaG$kv>LQ=e)d zezGFPp+?-pVC7+P&ldu5&~O=4%U1(%CQYnopnj}f;`))mmBspXI*7`uax)pt9hIIT^&8Qtj1m>}>o&g6 zDWx$(oT7Hmm;JY-ksZZGI5Bz51q~s|1}`n~5!=oVR-Xl)xxt(~sX^A+OSj8HfbXThB#&6-xXOEQ*& z5Sf@BE`@*r;=~Tx8MXNf33>rL_uQc|RmMB}B0EBiQNAN0jVul_lGz@><_>aYik{M0 zr(@%dz{-Z(++TaYwIhLDOReP0cR5P}v|&wgf5z6L+?OO_!gh>0fUhm>3IedqH7 z6VM=oE(L1Y>0TEo^PPm6atpybmy)Lyt5?d<$*KkXS(YSb&2?kJt8X>Mnx_0 zWdaC@Hxph5h=%MxCo#-J>2$^h2lb&aS1n(Z>KM45(!Z0-r*czS^%~SGHbEK~VJ!q> zU8$!Su#oc7YF%?L^?93dvS7tfUNoWeGI0F0P04wKUGxfPGhE>#3bdA^9t{WW#yQfC)j)A-KU!OtZ%v#%KdknV+T zI1xZ6wZJNgyo6RFW#k`GnG)#S(Tf?WX&$+dih@SPoM}S|;m-<{usE9I;bv<3K=(+l zwcD7qCzA{c|6oDJJy&E)JdsE<3KHFux;wV5_m%R%XY0#7&v;Q>>~3=?^^*(PBnv~) zEQ^0w(S~w1rBlDoz0yP&EV+o`6unwksDsW?XGmL#)S;X+qeXG!DwmJ_J#y9X%mUuu z#KOs#1OPy}H*z3Y_O&Mh0mQA;hV@ zWVEn2r7<4bQFe2MAh$v#3rxp~E_E6S($@%6HOmuo z4455wz9XW*c^V(OsMVOI2APVmMSMPN@d690f|Vqtgx zo~WV+iJlom5Nb?I>hU`@qZknwK8`sO7gJra+Bs);mWR}=N?E~|wTY|Arlgw~#5i-^3lmTHfL z@CNd`b#h3V(-o1A7B+fD9bC?o{idvK_riH!(eb(%Z}|J7LYo>!2t{NQE4buI_}Mat z&ag>O7>hfzbRoiyi}gsB`$e}JWNzmurrx98$XDlEy+tL9S(y?b$O=hb!+4)&54(&w zMY&6ZPOI3fhlyY3`sTn)iWEm?hf0jotnhL9Z=L;dObeHtk?OU`#WL<{lMqUmLYhIo zj}6m$D?pPqgHXTNzn>tClgVt>^+JKY0VGDd6|~nQ+?j7NYP1^CkR>tF=|)8?T>PFj z)i;&pu$3u0+clA{(YrGB9eF*)Otnn1$~9Jai=~CI3#yf@mWTKn9afK5b$liff+#wD zDxaM}*d|DizXYfc9o{l#2M!#o5tZRIH;0%XeiIJEbN2>ucVX5em=ZEGEO+TA>;)5J zyh80>5arYl$QzT?r{`eGtK+4MDSZ2IF*10@Xlm$j>_Pxxp^tJg&f2Q3h3N6_YW4%o zWMh|P?Dac=whr{R@uG&}9xP41Evui#s6Q3l)f0>exmPN-n*ur((uiSqt+60U1xQt2 zNnX{g)RL2pC+%WC&Vh16FreL}Av?o9Uzr1lS{jvdbDKqVLS`i@5@HJX?=4Z6MP%&` zV@9qzSe}OM`?rN4W%Er}8GJ2j;>)TmlnBueOJOzy95Pi|Ti|0w(07_Bv8r$n4B3Ts zEEI2cr+4gL7d}fU-X^wRM>r@xB_ah#o4UN9fY0K$te+?$hSKemLw;)?-FYTPd=|;7VnO^4&~4+P{NNZOxcE zVfbme#BS5n^63C6eoywWFufk>?6hOq!bDlWwgDOxusya>>&~j(NgL{!L`f$!kAmix z{fJU}Ou77R3nou5&)gxGFh2o@(;&=fPjPnCE&Ncy}eETpF@aX#(4Z`(vtZx zzbhqTQ1YPX{kP2d$@_<83_RSz} zuGB$z#D0Y_XNP7#5a%+{BxhKcIcc8(3AT~>bu0gHR^O;3Z7=Y}?isq8&X3lz`?^f9 zhp@N}Wt*KJHzMn?upM9O}e zUn0XvnM+tAoZTZ>7!P>UbI}9Rf&e{m`>t-kHCbt-MjAYS@i6QT$kIK{cS_wga%}7C z6HSK|W5CLcc{Kyk(#ANeIdi~3<_;uStPk|Qx*>Jf@K^yN&Cg6l807)E0rUE6tJtLQ z9Lz{Q%ClW&3kuYK@@=YJfyysE1+P+@YmJ+Er_(zaE8K%4{x>93fohi9t#BZpu6j*J zKTS{X9c4zqf|cg@c8AU)3+F7O*7lRu;iL{`If}_jwx?;oxx*ct93^qtuJC``&Dz$_ z5Y9%qTY#IUQ$uP5zo}iaV&B=LlT)hO<@g-rb_@gP_98G^@fP}4{*F%C1Sxm?o7Z+V zeY?}$${^o@X3Hp{S45bCzv+|2lh2Q@P3z(9xqofUw5IGI-?&roBT-ux&IYzHNN6Nb z_3?Kz{h!%ly56ouZ(m1CrYpDMtv6f0StQwqj)Uu*8BzY*K|tq>w9{g!Qb(mBQ=k8_ zn>pYdt6{nQNHwLlY5up}Obkc$-L@+SLRrB|ur;EzEi^!mhO)`_@nJl1jcwnY^sn_^ zB@+v#gJK$|IcJIVk)RyqpR7a|QOLZ$ff{BSAK>lAG zdg1g7Ua_F3!u1qTj{3ZPr_7Sp$UMeT^Jfl_#=VgOZDW~q%x3wkq3Jqm>00q*)$BIe zboK1H99xgO4X*L<3iHSu;75VXzD|7i^9%ovnWZ3frM?vQ{Jpb5U%p?=tXEamt{(-) z941caw(tm8;bAH}H#ArkjZPGU-(pxdeHF<*g)HSUZhLSj4&K1m9pYjK4+q8#x}!mp zoqKIS?BRMvty;}tL<*?-btsw0)s+fXk|Ve|?aRDWJ~~H=_ZGAcs24+!tU&0kFi3;u zr?Bw@Wy?dUG|B&_v5noAlv5q z-1WT?ILf1xOTveY>sohRVmK}^Tr60ncm-TMdAG1B0K8sL63I-|-Do0;{giv9tb_`+ z#u$Dy%+9=@RDzEu7dO23Vn$k7JtK+9r=UN46QY4+<=^GtAWEjbCST6aOFg=+I4p(9 zjOzDm(O`qzo%4PCdlYa=8^L>KO%xBVQ=3k#zB(JY5(C*?*USD7s|KZyuk)N_TQH^S ztQu)p+3v%MAW}CAyMh&tBU_AC*g6#18pecfZ-nTNl}zr6{Sl{`BpjD3EW^D%d+I?t zEt)t8U!LvrbJ!0NV(q*O2NPN3Uw5a$`{d~z$X^S1f3om?LEcq5QwR_0Pag&Y~E&*ryf@^hbAIh56}Ox zWT?kOcqyt&@iurCADGprR+u*8LfK9|e%MJxm46s)P0f5aPN~vNflJqJcd1|(fm2yE zr2^u9RyGs{rtZw2)AKi51w=bmjF7fcjQteZYkLr2lWG^Zsf{&m3>p+N+qF$VZ0 zC2}V8T>|$F(izSKsK)KzdP6as$BXJNN7GmF#$mo$>P?M?0>%X6+_IA{e4-ySSe|Y$ zHq`^r&y|0`eICCN3G0GP&;^MZ0u;A3w5@}vY1t~!81K6`Bg0-QMpC?eKpq<{&)xGbx9or%)%6AT+ITeyqE$NPtlR4t5q zq?HMdTW?XR3o%s)jYhcwXN9~Zfn~kZSs^*H|$F4iZ%d;ed`bWg`c5Jk7W!EHVt9X zL>7P{{jtF4q*alcE$twiyibQzE@gj|ocxrS4FuyG*q^<~YaLVOH)g^m(CBB=Szrwk zLpc(N*E1&%ugerlCC)ZVn++=8HMMFtzpP+!`Isk?v~Qnu)K7)#DucYZMQN+C17;vo ze4uX5pF%8eDxR2Y--z0F8qGvjUuqb?OLtcQS{MTs0CQOFA*vNV< zK5yt25I7bCt5r@j2HkG_x2j%auP?No;NWwrR-Ml!Ltb66-46WBFz_1y$jRC8w z4-`Q$C;K0Xg-y4iDE>YoOHxFkz0Bfiud}2L8`}KHC${}0#@L$tV3!$2@@s=uT~NPS zMY;T+tn}~1z%#oe$hlGp5C!d1tFP-54$XbK`x9y{iOo991yc*tB5vqlrTebFZVqP% zaPexvbEt4}W9c1nZyop7C)5e0#H)cG0Rjh6rVGf&dpINz)W>u?XeS3_sNGRQ@SF$| zp=B0tiT#-4yCN?A+$zbD0W<0=dP=Hbvka%(y^+lC(BdRn=kbkNz?1Mg;zS=H84-X& z**m`PDnT5KZnD1&C!o?pnt)lZT+w8v;^);Z(n7V=Op(X}k00_IBeupt-eK`X`-qRd zB`GzC;RJqv8Vx4EN$06bpj6RfTZ@n+7EP5D9lVovhD#gt!rK(mSP_Ni$e}clej8n2 z@@4+tIhxNd6NYz}g2Wgo7;%1?LYM=X#)-xH7IAapEp;S$l$gMKOvz-2U>jMSRqJHe9Kt%+4Un@IVMa%%`^#v3%5uS8UoEgoqwr=vaMNsVgGQFg-xmH zP*eX#g~Fi7!b5S1yid>cq*0W$823_UzY#gWnIJel9SPKH4KKma-$SuAX;+iI5Ci7Z zwGkW1jt-V5tx1>(9j>p2xZHe_?i>102=@kxI)o#J4O%|TLrZG5shOnvtT&j+{M$or zX$UvL?ZAjBaj1>yq8xU$Dp8|*uh>PSlH5t8NCBEV0GUQO&jx;%+{Vv58p;Q%>-x|f zug}HJN#}S49`c_nY1P~05|4CuOzAx-={d@i6QpF!o6e^A77^NJ$?Y2zDbg}w-z+40 zbiE>yd68P65uNR?pD07J!r+i?im5Op4ED*c zP_Erfc)uIQbVqbGw>8Zc`}lIP2D|w3MuQ!amZf&*gk`&nW*I<&^=7Mhz3Q222_V#= za*1*TqXm-7cV{)0uveVe1649uY~!iImPLPBrPJw6VJ@<=v8iT*c&r^OY_hVusXD!V znR2Jpbq){A(eWhWSLsoqL41RYNL|Q9bauoXadf2MxmD?cTz+5u#9BGxbdj*s&2ojr z_8@5!Y>PB!46D*yAQ(EBj|88;{`qcRwj&7RAi2iO6o}%jh>7;o8-M&1{Y10};phu_ zy-Yf=mAXR30b&Tf)kzvkPL=THPS4&NGhPU|ahV-~^xESR*taBo3NYF4=C9J(?%uC3 z+1#2m8Ljt|@-bQO=KD-!Iui{Rg_k`BWYS#^_P}I_OwuOPlrKNdBGZ6eq{Y+Al80N; zj-?URkI<$v0@cHeHc6U$IErODo6ul0T$_?7(_M?2Bhy`tb@R4L2@t#qi9{`hk^Mm7 z+zTSJA`_bCie;0>wrPaOZ71g@GXtZLEE)-I;-miL-Puo@@fW1e-0oknWit|<24_q+ zc7P#HnZQK6Agc|(O1(36zjodaM_YUBtRl8hK;0rCK@ET$OS7ws*~Th8vGo8W4KhdN zae_!ZHCY<%kSvP6sNQl2iwI56S(K&H`y(_SE8|M{0YHISlvC33&}Q>+Q_F<@oko{N z@YD;&gas&xmS#5NSAfC#d>_p8qcjnRN%17!wx3B4uTYX_V(1q=h~Q~@AFn8Em)ZZbyYGebJawm-m0%jRBM#5fVr zFwA(@#mkX-qffRFc%oN@L-n|K%C9ZZ*#^uAQgR&^BAdOz%`X7&-!HqmlOasugQ)unf_BYRP`yT0LcNSf#90|p@_vrh z7tTCW{$hf9oBUTq&yrhh4yt8|wIFS1hVQ*YyDg2lLKH0WfblJH#PA_Q=5Wy`*oqlW zK_d8gqwq3tu*FDWW1ULkF)+uP5xh7>q^VI)FX*>_w(#*qyAJ8UC50+x{)YJnlg(`^ zRkC^Jsbz*Q?w__8%lfRs=?!%;D|uosAAi z4k1Q>h;!wlj!ZpZ^{lsDybKNJ@sQZAW=rC(Y(b{&xg=qylBX5eK40&E;t%E_r9yuC zhKUtrj_%b>B2DCTf_EAI7e?Btj)wt~@B0(oOQU3o!eG&df2*8uLF0vaAW8;v)WD^? z)%3tt)6?v+<99hx)0O6R)|Az%gDzaY1LmU`*2mi^_wG`NJkWE#@2s=&a(}J5c#p7_ zz5d`7mS`XcS}=U!D!c@SUOkVma0TlFL**j3eB&rdY1Ogd-u5efb{_zG~92?jqn(ciroAV)Jczo-5|VEdaat8MP1Y|gdM zkN)wykn5ynln*o+DlVAP3B{e>DOXY#hhA3V7q438sb5 zMzp42y;~*s{|iamoYSCvl|$zQX~;{}x6$do;eE3@>H?pL>9L#P)#9?3eP97XOA%M_ z>-n_N2@wwF4d;tyOgsDK(DZ}$s9%frRO!v{=)^zotzA(HbebB~=&j3Mxc$e-L>LIm zL_@PQ9|~UXXpP>++LFkjsipAuD}q?XW(u-CJH)uoM6b6(CSs-|B$)1!;z;>8A8Olf zvpqhxk|M}Kjp};aT#&0f4C-ZECXP4!6f2LDI#S&Q;tATVuCHn~J<#nkCgfz>q1l#s zdG~Z)(~rj$rl}XXJ;t?Lq0kTiUALCVY23mxPYt5e{F2H>Q4v`MlmSU#{k3V)i<=y$ z%SWcy1@sv?o^=kEfd3MsgQW#`H|yRyyus_Z$<40D1CIN5e6-rj8{-5FCDKg@;RGky zsa}eP?-nJnISOIZZ>UQlKh=T5cLWrl@l#|Ej4iUi7w!*wSozx5KEIiKKcB9Fe*u7v zr3b!NbY2pXt+yillxYy@z5RkCF>(o!m%;dsTZ9&|ykKd%p@bzN;O!eyl@ChllmE#{AmdcN*7`#gFr#6rK zkK+*dj277g$mZx$H!r+c{9-5&+g}I&Z(Lt>#|??wQBJm6>m5!MFrnr7PUPhg*!VtK z0;p=q)JH={gJ?GCkI5qsN3=U7M-5l%jhQl>q0QuSt%ht17?1$_06N3DB23$0=Jt8@ zCXwn^|2-nPwU>T%VlkjGQ*x}+H%Ra^pPSbz279QiCY}X) zsn0ir)|MFoAei;@U#}WSs~AJDe1~_a!)74w?tU)6UN+CDuYc7nKT8;mGoXxYR^J((7_i>%(SnERbhMV02e~T?F z{+F6$!+T-=<@|pTv=t=7-v36>P`@f&z!_d$-}O6S8YGWkWE_5n?HOA$}b ztGK9J@au?g8f=r%`5aTW`WcFwtr9(RUdiGHLY!^N=hS0kx9eCn{ePO+}*LMJ&oE+LS98#DeuF^6me?AL!s*g-K~9qO6{zjLt05pf{!b zlMlMSpI=&1((Sm}vd8DO{C?BirJ0_j=lk{-9OC*5fCHCOx^YotustU&IU;%h@oflx zSPE$y<@eg*^+Z0z+`d~ds;NUHQ3?20ykD%(Ng&ur<2-ft=qE`);({Zj1_+~$pyeuz z-Y#Rx5xNnd|65$Vc@aS;9Da`E0|>9Q~}M+yjQ3W zN{`@}z@Qr=u?ZiYWW7`FxYP_|DH%_{8mV+e8P%e_gs%1FJ21q5r18F`#J5grWNS!d zRS6!S1%F4w>$Z!X33;*i(G9h$=lHg)8N@{a5QHqM0eCLO2O?qmcuM4Djq6}?xK|MU ztP>+l1HZpTt23wc$lo>@-!w_M0OVG)7JxDJ5y)_tw>@x}h+W(~!>S;^Py`|djgd~H z0&wm$?+z@pr{(5g3Tq>U$sojbtuih+cGN@w>=GO4a#GJ8k&*)IKoW|DbzR57VY}^! zPCGPLv4DMr`#H_y1BWL{xF0=IH2#khDOoIyrO<#DrT}v20<#=FnK?>pxgMVvd6(-S zkOeM_T5C&^@TVqr&Y_(I-1&%;)V<{fsp}?emcN}J+fB#hH+nA7WPofg14hoYxTJFq z2o(Aa^{Bg}9h1!#+ytn1f9p7|;$I6@7Vg{su>QFB&pS6@HdBV5>B{_-{%OvqxjugA zfpcd3>c9(P447>nPyMzQF_-=kRlr{8xlqFSXIp>%drQh~@<&u2f1$^kGWKkTReALB zhBI;W)0TX3bQp3~ZK_&-pH-~IGV^t6>H5BB?yr=uGR|^?Kg2wD@h8vF+FPL0^R=L6AUQ_Z2$f5>ZbYO{iJO2TQ@HH0k5t){Bk*g?ty&QtC9WNF(LYow!w) zt9F)HM{f$I!bfECU8(y%^KN6NqIcxZlOnKJm--Kofa6dJ22nJjSCSU?ORF z=4j(@e-Y)bC$Bz#Hkp(7mhlg=_95|Nvy??AFol{5F-{X7xH;_acXxw*+ep_WF;C#i ze(Fn+aS9{3a-Jrvp8+rL5Q<;n%VIXd7ixTJ6GWomOCu&R_afdRh&mM`At+Zix>!yU z`?qq;_hbjqL$F%LJ_0m&+|qJp>smi63_?Si6;chA=+z%G65Mo)vh$&@lOXK#^R+Iv zLj_L2Q(_m>t98FE@$!xvzI+{KSOFcO=lePh`Et1P`%iq4r33N-3@{|er@*rT2&duW zn8Dx=(mX{Us4Qm@+Dv{e-cu}Ft*3x!u?0m&*@)`89&VGYh$#WbtmCJ`Z zlp4uW;F-6y7kTn&;m98+Wu@8!<$-=wytPtZj;QU;xF02MTN?kXVORcl-D0gs#kSQr zWW5^z;i`RMd`lyRxeA8LpSnsb&sa=@L~gu&(~O7WHkH(05i$G;`$&-b&k}Lw*?pvS z;78?V|1jswxV3fV#bpG2es|xua;FbNkb6i-$N}*euzPPbg)|X&m6>#>WRJQN$P%ZigHTA3NYMOf*3V2 zmc+~fY79YRvYCFUw;1spA|(fRM$rl*YqvQkuH%J@SF}o}n9o-8NjB6`$=!z1T}49h z|3%k3Mpv@MkG8RGtJAS<+w9o3ZFFoW9ox2T+qT(p^7c9B{qMW?j`7A=pX<}E+N)Om z=9~xXk1oAtGe8Xk1YRi^Y#VwrEFQSmRc1I|C*Ds{!smr0*~oir+w$^Of5{)_7d%nM z6bB^Gxli-V0(Ka`d47DDZ+mE}?o>_9Up#pp|8@f)7xXD~{*FJwap#du!*p-(dg-mfr+(=hY-Ogd&;Sg=!G zYmy@%z}-E-l#{GtXSkBXn(C1+^=BLDkhgUx@Odu`|B!r$g#*(d@$i9b6hfKfm;{mB zn49ZDR9#B`kY5rrY_2{@xqtljkK*U%BsbG;&}XdsR&mg;_>0x1+Thz z9$b?L*1#(-3>%8HG^0?yot>H|*1b`uohWIgMG!(2HvNM6CU95w82cLOtpA<%tvox* zbJURUn)0mr2f}l{yAk_D^0!QghZ67~#q3~w8%`=Q;M`~O#SHnrHw>Cr%^Gw0Lr`J6 z?4Ab4J7a(uz`5^dR_;rqH6z7RvnH$*>RkRpVN*HdcZ|6rWbd6W61=Crxcvp}(#!tQ z%KJk0=fHUbdD=(7?dcl(Vy_@U^${90%f!QYlW9VUB3%b(wnC}Nh-}Y!96DbFL!e8> zn??o{%fmv7L6&dUy-MOdbu7R6cB}1aUSHCzxOfG2o;TuanGDnbH>1Jz5ZH|vPbFbQ zJ}XNf^35NTUJdc-P`>trj2~29@}m(aAyl1Fhrv`yb_?LD{dYnK0m5P8{e|&4xV0Dpph8{S4JlxC++A1!Xq< zY^zdze0zT^N|5G;R5A<94e5J?R2G6pr;DU4gif!HA^Ee6%8Fj6SOv|8$`%y_2UNkq zx$=Aj*O6;=yCQ#wSZ5v&)fUjSr@Gig!O!+ou7of8GhZzAl_C^|jKMn-qDw(U*c=sw zk+xpd9+ZMGx8uz-{Je0(N$z!-Hc`numT;8qVzz5%Ji6?fmHl3eM^SGCHY=AH&e0tp=hGg7gKs~ zfI-UthRu;zp3d@|SQW$nQLZvZ*laT2=o#FUWrJi+MT%0xidI7OcV4|rvCI?^H`44* zuQrvbMYT-!fUx2ahL3VFb$g+l%HP&G?!^lm_#g*wEM3@bGy<2)YE--!=&#wZ7Sbra zbrcQfaH``?_KFeDj&t3nz2$;jUYM7}Tm0U%EIN6`rvyrC0nI6$l|6i9ne<}85DeUS zLCCYo6ku>^RK5tKEJ(>jSZzRSQcN>NMdDZMH72b17g3i$t}}m##ow{xhS%fIk=?Sk zmw*nSzfpjV>Iz@mUN(?4!JE<+pLUX@n|xSh!sz^B>hNk%Vy?p@UMNFkFf??>r$?0n zU?(ZThMlWIGZAJjSRo($!-ujC-Uv=nfMkYm8E%7_gQ)#k^QVnRmb1n)!Ph>#up;}6~Q|rqqa|Xl_RTBFzf5q6{w<) zJ>28EX_S}_BSynR;Sm%syZdIMd*;o86hHBd9fOvq_Ev^`=7D}jz?*5F=qHVZcv@WLsA=i^I(x@zGzSFb`=EC22&BT03J|#Yl`9NPmv%0X zw|Gqnlp~|-C?DxS_JxKIvmX^&v<~pFTx}K!(x5ZUd~YU)%IqgwLGUO`Tvr*r#ZCV~Wt%o<6%6k9(ED z=N>uYs_IfG5vVLQ!wFS7!R@YQA~q+$YFj#uy_|d6fCAFAe@ELwmrwSpY13=u7e9aA zAfglIwgJXX;aHOv;^s|E0kkV|n!ql~^QQ^JZP#n9hi!mnV|y%JYYG3o4;jC2t^k|Ya%QafuXbOGO!LDUmh#wgHz=cA+WdbPNn(hzfsbg#GG9-X`7=Zik z+N-wPA7*Ink;IR(Uu<9J*xy_>aF=?5iS~yae|puXP>;6Cvn97rOr>wJ-s7556san- z%~H6a{gSL9`4vWXqi_nt^PSey`X3@y^<}@}^LEXD{N4EXWC3tGat@pu?CA1zvoy&(qqtvN3O$kh_mNY5xe}M&Kgj0rCa7B$xFd8dzl?``lb>K-RNS)?kAF+-<-3}9fw0*0FB5ZE zX_+rK_fCRU3J2Cd!HjWH5;6B7Vnm6ihJsg)Kuq*SMUtaH?kPZ}x`Io$uO~QGsr46i zEh2D#25%5$rm}`a5O0Spdo~|{0zv%rh+tg)!`|Z{EftxuWat3Un$ARa=xkG39MC6J2CJB9m z%tIIbGM$eMNo^KD$rpq1V;q&FyJmisH`*i9VzPx3Tx<&Q_7-~yPghvXEW(%3s`)Sm z7x#2Lhp04w4Wci%=pBmPu$Gm843cRb80yr7jdIqSNIhrWV$#@I;5E6K27xB^&JR>N zXsJVwVFe@q)IU>juunwh46xm=^13Q6dZIv&-pe*d#}^zqZ&k-&;m{i?B7SP!yPCG0YfU)rkOx84tPgpS^r z73#mZD^$OKCYNR@r`x6kHa48k>UW(uft%$9J`g@&_Q70k%-q4B5kC0k2mBVrSm*#% zOotfdnx15@$=P!&wsUwr)(C@)p_k!rgGd%(FIa)c@F8wy^Fz;{Ox(7zdeuHy7yO>| z^I)V4XC%KWTX-%Nzz9g|r%8rS{^Y~lu&rKo{c%`Env{jvovNflR3efcY;k}60v)oE_nS1J>52QMZec>Y+;2!x(Ko**b=u($NF`x5Z|FJ@d&tiy`Ao4dUcw1F#J41>AX!U z2m=(4d}Vz9W)}4T?1QZ02ESop!`;RW6|*_TmF$7hZV8gCrHGfQk6EBt6srBKA}}Nk z71->)b_qJ9WKFT!6EzoC&}w+E(s0UA*k+d^-FMt|Mr4Ho4Gao!GT<1z;i2#?)0NoL z-Q9fe%lJ(!_x%3U_FRE4Zq>TCHUi0}9e&-mB#&U4N0>(FR?Nr&0DWJPoD3uH zB=WQ&k?MkM=N#S{YrTf{7*OqQe<3jeqljywo=$@=sT%E(Yy%3oC%?9R1b$BrzqC`a zptiWbWxT&2uZ*C@kYFGlZcE;gEp|FCbUo9NZVIxgW1q&Y`l1 zmcJy5WF-T%=f{*1YGu`X4R(#Si6uv!S1+dRp3i(HY#BSgdvsGZIvIA`9LL)oy~@#F zT2_2R5oG%J!3VJXhrm7`cjsI5Jgi7V?#f5a=w*9dhRC&&uvQOdo3MoQ-a%j2k1w*Q zK-I<8%B&L=7ncR$6>y;iQk%wGm30mAD~L0&+QI}n2ZEf&Jx@SzFw3{&&ybBvY?L+} z{hkvSTXnlkeRP6;UJ>C8B0iosWW{4xtY`gy!WsMer%&UfD{i%;Kb-Cpml`_*gknxC zw*U*tIlpXRw_o~D(!LexDV|7;={(Y0$g&oef6SYEvmy4Dk5S@|_bbEL=w+Svsn7#k zV*P&qYRO}AnJ=$B=x9hf=@Sm^&U{>(4ph73?ePTlncwkRR26zcr8=WnmG* zBjQJAWK&Gx~nS5B?uYw zx40%Qmr55uL%}6*X0|-M_dd2dUdU%Y0=Zu@^>jaQS*X;FC<@Nx?KQ~mo(C!fA>qX! zO5_a9-F?ScpPB(uA)2j)hZ&c`Ip;#$V(?G1C8w_r3#M=UO%D?Ad$b@I!}BOfW`!Bu zLZkt6%({Ol20RCCeY@SbAUZpnGVq)ZBAY1K1i@CHEqSDt&KKW@NwpWCpaqQS)e^!z z)TI#|e)`Tk)A+1%Ln2%pn8eb^WQ4ikIHE(wlHi7|z@!-17$tkeq`B@Ts4Sa=T0ca! zic$kk-UXOpYGhv+cRIoT=LSZ5B-@`eSsh>4s(ASGlB4?B+-|&X*XLVkWgic%3H;Iy z?jwooAKrs!Tcxj`S;rliw@-m@B@7O1j9<$*kle^@PuVoM^Bwws6F1ZyQw6ACo?Cf9 z5hVslw%UGcbRG?jOvv*A9mw!Eu)kFDd@b{O#MFoM4V-2Mp)-;H2ka7GRLS@uwhlb%}sX&%X#t7S!ztA(q znUTLX#wb(%b8hjsrnu#_P;);=rEWJs-%K`G;u!{fO>8K5kk9n6N= z&Z=3VT38wev&{Y;=l$$mFb!LJ_aB7*ZLrhF$tmRnUHW=AkN!Mpak@GWzG>!R0r;;= zNE!^>f>A^Xat(X|-c@#ZW#n6-^Wsx`obLELjXJj{)$vh1MQGL>eu#DJ7!xqiUuSk5 z?xbQ@=lFd+uIARYZeT>Jw5I4NPTn5T%SiP#rR_({kpX@D9%omCIo(g-`bG^K5=P63 zY+J44F{}OZe*Q|hSwg7zN_G`>>KtMgGqu33CQu@TMm4iU4p`sE=cp;8k{g{!nFnXk z&s{wVPAIGa%G8>e1F~!pYZ8SOS0R=}*%x?hnkT!$MitXfU?Cl1FRWQD1Nv#)Snx?Y zPQXzE^_1piqOu;bixNOs~cvYg1D<%`I2!5Vd@ zP98jUL`v>*xF%3l?Tq{-RqJA3W`$MXOu8~Qamy~PgjJEfuD$T5W`)|9V%I0n8EVtY z@Q7Gjz8+DbJwSPQgHR;(9M^YVF0XsoP`G5HYtCu~gGp9r1kM_9BzHZ-mjlVBI9a@+uk;cdcHGf$>(mJ49I-G8Q{j(kGKe+EO$i z48y)h4GH6wBSO+=my1BlSNvd1sC#~X;;~d0(gW*C@QeSpAZLwc>5C;TvjKc&=X(!3 zdL@3boSTPEu8d&7Eije3w-LtVIDkH{7A`BD`ZMJ=cJj9FYG{aWt7B!JSo4XNxKxh8 z<{h|#ea3A5OW2I9(d`1^!{TDwsp`_HsxtiDQiFP9BZCa>+=GkiWuH^8DGxR}|MpQU zzg!;0;j(YMpvG8~YAF<^Xk$5K{_L`mE=hBWf_Pb7S({o4KJq`~Z9fmGDNVo$RBNty(k6y!nnt*6%a^Qfj)o0Lda)x4Z@ zbd|molsvnFocAi0zpG3nbrAgG9?9LB2z>QQF_yK=_4DwBJ?OfXRuHvjOt(76J>(v> zBfWQ*D_mk{qNer_21@PsLx}BueZ)sB7SzR}`QeG_1AcEu6RKz#eB5uF${l%K_CQxDR zyHTwv2)bB5zl@N_+hj8c3|ZmSrGr7<+2BoD)DMb z)3swhS7f~*YfK3r-1yxQjTXBRhkS%eE9%P6XQeuIc6&v4eJQB&%?T*aa%OBiqw=Lb z4%Y~@@!0!sKJNV;4lN`K6w5xUniAvQf`W4a{LA3*ut^Z-T?xYmFQUjUOQT84h(VKG z)5+A6r_EoETse(P2ltDnmB`kPw73jw`p%_H%dT_($p%KXb*LZceC;Vg4=q-v`!*R; zRm9uLoZR}sk?Bi83WDH5UNRzN+tFKa`t|D5YNq4x0=S-QljcH!BEF_Lws=%ZUXkVX zbjOU=8)OU(Y?)D}P=gm^7$C`IW$mCy3F8T)?|Qg9rE*N_E0o$yf|);1&CpRv7R@e$ z5s>xXWdt0|;HMJnRL8M2%EAzSgBQibxrn$4Y2Ib?0HcO)ci2A5?h+95Ko z1b{25d7F(+rgysAj849<&{A8VF%pK0tO96k-}08SO-O4#6W2uV5h~dgq^;QEA*7Io zNahu)o?0L9p;340X$#97FH+*jD6+zjyXStz&BJ`Cu*G$PnkwOFq*!@wSoQd|K+erL z%|)yUhEWv29|Xl@1m63C;VRuLRi!E3bKytnHQ1s`*)4A{=b}}(ZFfp0L7 z9EIz5Nr3;s&mQw%gS+~bD^fb}i8d_}AoNULNK=|57jy(K>xVB_%Wd0GDwlhARyjoS z-HhCK^%;6L~sqlYu zZy}~>rW{cWKDJ1a7BD%7+|zf5y(yi=$G;$*%aSV(+#v_+jp5!LN!gg$7H3mmt>GTr z=+n5 z6Kv1?A$<3EJ6u(==p91YgmRBAB*^u22pQoM*E}fdrzHp*^u6Z!0_Cmr1f4KM%M+LO zdN@LW#=n&6byMiI4=np4${dt{C$ZRznfXi?P(oO!-gVn%$RdNPPwy*hz*@6e)HyO< zB>xWx=4EY~=6#IUt@<_dL7wf6jgVR`R4k6tt8FES8|IrBU9hs_ZF6%0S?}d~ z<#p6su?AWL&_aAy)LeOtIS&2g&_dKI*!ezftnVkFo_@@~;=r0fnVx z0}^d@$zMkVugrv04^;nI@Q7vEL%{4Oqu~TD^_igr)TRH24OWk4y#tnol1htKjr?zY zv4;U3k{K;erP|$A&CyMkZDWHHt>LsC4116S(*mV8=rFl>Akv6uzRI}l$tk9O&)Ev^ zt=j4p^yJ}zFtECBxnK2SJ!F>;-GgF)DEht_TeW~+NMi6;;r!VPM0}2IqZbUm!))GW zWzc=w2eBDia^`RfSAJVIIgd()K(ufgDLzjynM9j)lM+41W;lnX%4im0pJY@dr+IM{ zBjwHFg?o8fuC|Tq%3P=NPm2zvg40pr61YBxi(LazE$3qLEFoSx>wvROybR}A=Xi_( zGl{BGuwL2Nd{2{g!7u9qixP*@KtXgWsGae9QJJ6mX+YH^muyJJ{e@^$N3`%NuA@Y` zN_7DUoUEwAR5m1(<2431QtC4;9$D}jRMnG7x_<;E9GfN9UP0`b;^Y*KaR^o^i~~<> z4g`xXb?UoTV~{xGY#$BJMim`868=DSoE<zm@apw<6IfXo1bm4!fs%~n zNf=eqGVn4tx8>KKyZIXb7h~JmdZV}zoGd+@SjvtrUlh9FzD5z28emv*KkzC<%zlb? z|C(oD=19uQlrYKJrtPV3`7S}n&X6Qi^e~2IsxWTOUo25(usNZ4PbhWZ>Z4w8`efL) z&(YL9;QmcWo3B^=mU#B~H{i?I@nvXHf&Un}L@)P9um5C%f{P*O<_gr^Xqx1dSIU-9 zwnz+8xs;tV!FU+|-Hhye_Yk+h{PGDO(zTl_=FlPHB%R&?G{C-1Ws59Ze1jrB%L45# zPbCXA?z#tADmGVFmor|%y1;Cu8AST4d5XLH1sAd3uVAs)bF zV~D>HgSG#fO+%SvAMo`#uEq!^1N2?o_Nxth=|?H*utsT2fU@0#rElyqre5)m%;Vc6co* zKoHM}lz;CQ6edl8CJrqnAF3tf>)H}NKGbEmGWZj~eyj<+G!WkoF=l8Yz35jR*|G%P zlyz^=(4?2J2&KLEMPV--E7QT#yB7p{WGEd$wYeuzOQf7C{0%XZEQH=F4cDFltM?(F zNygSm62kRo2*KcrVUofHna9%H!E1!hT8;iQJ1t8(bycQ9euk?=f5+)a$fvH`DUwJV zmY#~Pz@i=5VGnMMN0@E7{f0-nxA-#X<7#-dYy?d+%8o&Qhm!3@0)=t1+Oj!E!Val; zb29Eek;~@Sr<`M<25RDs-PCR9Zk$QV2J8+uSPu=5FG1ymKmhs|U6_bQ{C(9+^s%fx z++PRt=@OsCav+!ED9Y9)`x2gvEVB#Ow@_SI=Q!f;|G$yK!#D7@vFa_m?C(I&F6Eim zrJkOfY5F}6bKMU40`0vb7kQdCSVZ7V54QGrb?kd@elGg#Z&LZue(dMLtmKLt&sP<= zQW%XTxtw4q4K+ZI;2)fLZC2B&q*bM9iJ3>Deel{RiWF(IGFTyQoS_fsGFAp;h%`Mg z=8V-5bvk+Yc3p@)VXzTUrivaH7TZszwbn^$N9o~g#NUa4VCd1nI+Op)4H~5QI(145 zZl5<70A!aT_w(Qf5s}2OvSbeH5q6PY2K7mCzDlVpAmd5&Pf>(;hMQY{C2I3)V!fuK8 z#$uos99e!do}XfMqZ$;7i(_+MOGgJK3V3SMr2myJKbkF-=q!`?5ov+nEjVv1MZ*Hg zSWuS2P)x`F)#-7iAuvH&BQ$tS0w0(f%@b8n_M&*5B+P{`EdIo`2?iC8;@JG$ruM7y zhJwxwZPU$og7l0~VDu}#kAu(N^IbgT0Mr8i2Y^aX54yZn9znB_e$XS85#3MvvxMdK0!E8Yi~xVKDB_5bMbqXM5a*JELi_M7%pKI=H!eZ_ldXBmGp z`n~IMvIPBux(}|B^okieu6x(9Y4A~o5|>#C%xPx?lsF+PuJl9@NqNFBgD=sozRp-X zoA9nyaxt?!2>UE)(C>NhH{UQac+y&+gL~w5cQOf#M7T(=m{73w`zQ&fWHW&<$f4G`4s861T6`RNLp!HWo~mQO&Bb zMdZNf8+;`;4D2_BDV|6&+>Znyns)%^L>F6)SJrEM10gmR9U->M zDb43^?3P*^jZ?CgWzE?}J8+*Ryf44`e%tKGa$o$IhXMu9pX*Oo{i+hb^>FZgTKx1= zEe{m;J}&b*SA2NGhC%XBOlzvwj6%q|&c*cJ3T3}woWa$mp7Az2fY@i2xe3a<@c-jb9w_CB;F9g2UXl;+ zGu3fHhOCKotW%s!?mbEocO@c_29T;TT$n$O`}!V`UYelSe^Ea{IGitm?PjkKyU3z_ zxDQ%{fXcbZlDa}4)-H&=F5zH3I7feGt6)FhYd!{$NhcgoS|KrDhfc%fw=nUg$LPm-<5GsOUH-hl7Sjh$trt$5HB=;xNin>Y}oY>YGBc zV0sQ)&~A(U23ehp{0?SSEKfqTjDwZya<7%)Rl;f7?+cQHKkDTy<{8lnx^ee79~wh> z5*{EovnbwuxrSlcowu(3HL2Jm3rC}u4rnj>4g*&hY8~G6z%Fy-zO&EE=kBv0^hZzAYD@?pMR4bhJw}*kbsB97mhuvN%qCXrDBG zq7~ajwse`iny$<7K82dothNaRN1mcL#NjSiwp7Y z(Y`r~6LyCpLF(~OhpOOx&qy*dvQA~0%~IWaVONGU1_@3M?{7sX%Fn}E2|d}bOU9td z)ytgfv3bz0ZHwg5UttD(?0q3V#7!1@r9Tb*d>t)Gvz$@F`vrW>9nBe*@_!l@HtRQY z&6?D|iV6IP4|Ipw1chs$O=Cg2=;D2mQm^LDW`aL)93kSoH2 za}?Dny>UEi|BWza`7ygESdtiDqXxyVu{)Y(yLFU)=j-!~!ZaKh@fS02ek;50`r`#B?4@)w74 z3=3g~M;=syCxwIKcT(NYV;8U%&0o5A2&(H661CFFL6APv2`PCe$8g|#1>iL#;@NY> zRICYPgE-%bQfJfIcGp`yhy{lJ1d&J>{iuTK1Ts^RKMhr4_|xSx{gJ%aq1U22MdbIT zUD3KSOJ!FYkRoN9`12QI>`xt>5M2@pgN}4s7=dY-aB*C!mHF04A)O94fh3}K?TQ%I1UKo z(k$ezer?W|W_!y4$F`M5gX)&zQEK(#`kzgO_|F{Su2B-rak~;o^mN+E=wf|bxTE#^ zs5I20&_it_844vbdYCZa>tp+Mz6Pr?Ipxy0h;Jx_^{iuzk)QQO3&!k^>@ewTR=vgV z_svHkVXppYH;zoyGh#BY050cJKyFlo!K_sX%$)v1s8~oPm+>noT8+jo6GR^K_Dku3 zXUypBup#-GpcS*42xl~Q@D^13r*MR1r;-IKgF$`cD=DQ_;T@@?3kR-|IB&B5tjE*H zU26^K?VraJ4ZR)`JzfM9BpLHQGPEYTqV^YpimmJeM8#|~#2F@r!`q{q!v^hGcRcYU z8akOEln_z=l*8&z&<*Gr-bUVT9e#P&TP?lbvi)!#u{q-+rG2K^RwbFgWvJR4i`9WS zxSO=1K*lnM=eg*ykOC;a|H>L#R;~ZsEGIsFNBn4p4K#YAM*rqGSUmQIh}~Tzq!dbb zgqoe0Wyrki`rMc@^GRw#&Y^4N4V3Br$5;?h>VumB*(h$@D&Ogb3_6$Ix*z^?=MW-GOMKKmEV7cb0i_#jd8$hiZJa4#%gmW zbZ=eV8}7LhW72}n_JM}VMAtp1Rbcf8j4E>JEVOYHMN)5Mg&lO>#bOzSbPzb(K%jLs zhD$V4eBXb53Gq>){|vs56>>=7(So$2SPNqB2%CkUg5T6Z z9_$=-od)q1f8`hq%>HOk$Po9FbfY&> z+bQpP)oq*DAa)hS{+Bu&68_mk&o`2()~9gdX4Q%27SL3)*7TZJ+7|L z%n>3%SR|J*ZkynSkia~4f*Ajy+NPhKKPbGnrR>>ug7OFG&t$43161F-IPd}mC2Hm@ zZYm^|@|=O1PSjboOTjZ|NBzOca@Up+iVsJ!6)aJso`?c8KIpHoZ~n@Bk}5FphO@vJ ztuYR%7-7oE;#fwgFfgdsc8`b9e!q2(|p9Buc~7h0GELn4gG+MNj|a3 zu3fc@#&He$d&sC}#E^}GSJxMXi^8=AEI%F2O5Zw3Z6Da?wmLztXn2nR^?(^d!NmZW zhsE3Omqt(2YUZz*K@JGvhh&$o<~1{io`9^eN2{Lk4Z&{BM2QojKQ0X4;n?ru(|u9S zI5-mqD9OmvqDI(`d}f89n}$VM3XhM2}3Olf3_ zV|rPuV{Ab1NHWqU3jsce5mJBMA%Hii${mNU$d)&U`p7cLk9g_oG8h|ZNl)Jwl|BCK zZK?Z{-eQ>BdfAdk%z`Wn2a5pfj1UKl4HKYd!&RKXm@qkyNFBkGAhVzYQ_iJNUG6E{ zpYsQ4^EQI01ziZ{?>8Y=8XIWn6CKJT^NAO(ld5H2r7uheqSMx#u^YU{S_rAF@bmRTfK~t<9XdLs+fnMGN z>b9s6$XtLyF6ywL4AJvpcoP85JmLTCen(2OLTz^`P^Y#FvXAz*4U`|=`FhdEO!fmU za<$1B)o>x_d-3z!=M+R&Eq^JW-Nfzec)bjEYNAAKI>%DCqv!7wy2tW7^r5fofX@vX zL5SCl@xw9Y^Wxa?4(6{IerLPR8b|C3`LA`JXY9d?`sgwbQ{SsM9+4@%f5}~;iYx%R zJ7x3&AMH-hzLO&`5vj*wM^}aa-tqouMXe6xzUTgM&r1+?MMHBM9AGOu3O_2brn`^H zN`c9^tQ>13O@)=`uZm5MdAHzV{2`YHBet+U=F*usQ3G7_@rrSMBxQ98wwGD4+I||W z4^FgG51*yr3l}PAcit)dH9L4M0%Up6*EOvHlcvA-1BNgzG*fK4=DLS@X*R~aC;x9^ zx^Tb;<2{Ovs%bDo+j)8Vbs`3|KN)t$IZ7)3-xQelD?sS-`B)MXJ)%L>vYeYU^n=yV zr4!Tcx%8LIEf$N}_TVRDz2SH6|Iu2}FIn=q; z0oUuj>1L)`bs^Xab7j|NXP&=_mT4ec6ZEt(!b9fxwdw>m1O=5n{J}5cV9Lj&{c{uE z^1qXyMB_^vWps66AV4Hc%}evx8i^6J)C851j}c83lB7UvdWnF~P^EgT0cGBgm7L$I z4_y=2U2b!@4v&90-m;XcMYrD{3n;08u|a_GVS2b;hIYAqVri)o=jlA@pqj-jVNiGnkF>*lXJdafIdI(Xb$#WgMwg8iHwYIpoPklG*nrEt1C}Hr zNYXO#oQm0FN`1WjHQwb0EcAz;^Nem&{rx4_YtPEmx*VdK?_UPGMaR2=-)BC`4Ed>k zuuV3Akhsv8oyiTJ%jGrw{dD4(T(5Fdvv;e|h*%kk#liZ%dPk5(9YeF@KDX`mnJ~55 zl6YK@IgVr_dv!78UFpDmyKoA36U+qz#uC_XS%2RxZtMP^(q{PoUD|wn{ohNQwa3M$ zN7i5el{U}k>`z)bVOWe!DI@+VZ61X3=`B;dqa;wH_Tm(eA~5LXSmE{CwC~`DLY)Cv z=Z_JjDL=qH|B(X<;nMRu7qnEv=O*uqSodL2c8RwRe5a zEe}$11?Xu=@AP5mA20KDpJw~!PHx`s^KAdB0Qkc?8%6x7FD}U8PKypZuJs{=AGH_O zRU^-qF+y@L0^8uMC1k?FLyOySTSiq|9%o;G%DW|2*PHjieghXbU)y^7yl_icCW*4)zqlHb?)*Hap7#*k(-&8!Ak5N8(&bD+rxhi|%3V^ZR{| z_Dc8UQ-lJh#N0^m?OhRQi2}XZDwYVt{A9%Ifw}C~Mn=g1AK&l~nJI5bFAvD2cA@(7 z{{v|z`VVOa8QM_x$Aw8rn^3sm`tymy4IdP>nw0ISDHWmTTqGXfQY}IvAp}plSAf+g z3TCW%kTz>DHz+PFmZ5F=bOwn9?~wQp#*8a)8zyaL8CIB)VXP;yUHRY~QWm2OD$K^CkE7W%)elX2<{=aKNzPtf#)&(5tLLW}|ok8QveLZw|{JENSv^od15A zkSL|Jp)1h3o!eN;&=Xi zWt#ZRyzovQyIZVkvh`xHD*~VAJ@5h+eM7E>|UQEpZ7M;^_Fw** zjH0yx+rYeFCeqSp6siRX)7KPuI}UX_=03jU9=2ewy;d@EI)mDplI6eaU5{>xb<-+G z2bl*O`pqGIwT~pT#;a0QL+m(qzzEE-8^GmWN^f5rusSu9hys6oK zjmEkcI~VhMv3IanKZ35YMR2922aBnldYa~4RXC@67`;pB*gmb7Yo;z+3D!+)#to=; zH14)1b!`2q+%D3wmF}CU+8R*p4BT^$l+v9^Y8Pd#(bwI|&QC?G1})b;IAvG)woGf1 zo$JLh^Ko*i_1+?}Fj(Vh$_)k3Z&EC0S$HJuDwx#bLh$p$<3IP-w0$bKzsRhN;qjs7 z#>QDE`#jl|nx_y#rzKikGU_GoZMu6ves0>_Bt*wK!d7%3aJEG8UZgmSLz&AMCmG~R zVleP%yPnWW0_ccsoqR{MIGrpK#xkZb2}kwFZ}>3j%FIhYonS;Jh);fpG#Z| z)JFoJ_fxUdfG+D zzX@+@2Ty9p$K?^-ckq1D2}ztRTENGG1Zz84=Q?V3EIuqE0x=XdD;>+Jr>}+HOws47 zkGIbA9m3#!y>ggqfrp0O$m84T7NQukYM1iyRtJ71&1aFvS=rIAOWht}`kYH9`(wPi zC}4)`u0$r*(23^*dJ2x`by?aZ;Lvb-;Hf|Sts~X#nAbw!hRQ+}tpA}t-H5Z*Wp3cL zsO_F}I_8huhGPvHB?3`la6=5h<5)! zBifaJB_eX*FU3i9mj{WO-|MeLzcy$>CzC)T$v=LF|RuSk<1 zo)PLX-2xEEud}ta|Onfj}9>DcA4q*W&E(_ z{qSW(pF>?N_8-3(fant^3nQs3i6}hZc{-n*tA3!XE|c*T30c{D0%0t-N6e}L3OEOQ zg1G5}nNHd-ntseH&nxUeJ_aH#BHs=ff*?SNww2%4qR^p!*sP<1g9bs#zs0S&x4cy> zg~#0~p`iC-(uQW5g4j2iZaVU>111Ze17M5R2gDY*$>)1%Ww&S_NxYTD#K;B3^UAGC zf2Ki@9G!7~j8F-04%Zh>%3NH{H%D<=1dDQmtb4R>9p95;|4G$=faX!^MWeRc)n+O@ zs;k73M9NTAoRPP+P76qjeI#2%xlN?6VU>=9T$%5%ebihN480`|_yZ)4l0Y0q+m%X+ zs-(M&N5syVeS-d4_Y+(nUSiY(?BjwApNei+HA}loJ)^)C_;fUnWSapFg3h}ZL=2`C zEsTKb=j5GGK4VF)zbR>K-ALM@Y+Yk*8%pr`p?rl3;Kzw@VJ>;X4!X*z|2o zxZ)R)MOGefvJ`rJR1POk?WT3LI!IXvJ9nHG$_yZBBCG9xYbCKRuzgIufHh}+P5+UT z_#jrZFIhA!E?LNFq5^slpSEnbkZpio1{{(=vj)j!TPj(7Tzf0rDkj-;o#Nct5+_20Mn8kd`KRqsRl5w^PX;!GFh7m*!5-!kX zMp||OvUoL%KT+tYg?gxPFx_Cw?jb_FSgrAF|Kv6a27U*!0BNKfrUBue%)5%HH%Csc zjJd&|u`|aafj~Gx4FxSElSsE!Fvr8FByf)bl_x(;JvOrgS=O}D;P1YYx+|5@Y*78i zz!HGQnKh*6^2csG=`GW%c|nbbe-y~@oVPG#fO0Rv5?as(yAM&4TqxZ*L=-ly|Cucx zXyT6E4{9nsG*DH^-HjlO1#H6!C~l4gRn%edEzVB?XjwZpKylO8%$u4PP~6mWN+Q0y z(;Fd|L1kq7r?^=_zW8r(Q{2{kbUF0D;-<~{L0ZZNJOO%r)6%aJ%24bX4e>Py<|C48 zx5-u(#X6CFAdO)MeY8Uen|C370%opEk-9w@TycuU=wW`#Cfb@dnvKBf5V`gC6V+cU zDsOxc@au1<0Z=>V1YBSsjh!j2Wbe?7rUXin(PiXaS$-7_!C}LkP#ZPCyyl@n_5Oj7 z=8YJ2I^Ix8Q6Be!=S$SgO`VDHWR>6<2=aqEn`HEu61Us~tFN+PPVSbzxu;f_(5nTE z_8D*yFj??%2TVO^uVFfvxb1TX2m^%}CBiAL)PYy25@~Vk1VY#>b!^Ddv2en4Bk_9y zt;Blz_C7{Tx-XzG@U>gD)L`x;no3Msnbj$o7BWp?y~dk~`Bu2+qP}nwrzFJxjk-ojrE~^KvI>wYp(fR(W5B_ z5iytfqNzIB5kpe7dzp2Z3nxWBF&uRi-i;ElH@r`Z)qN`WR-;nAw_ ziz>~aW~xi&H!La8$}Xb;pT@I@J<*ntqChOFZ{n^$!GtyTp%!b(#4pk;`bjHHE|e>f zu(UD2*38K87dNIhlY3%<$_%ZAztSo;QSU{5rQvP!VE;Dtl`W1!e5+=-|J_Rtn$<8d z>P=^eK$IsL6Dev=P+DzZt0Z9I7ImFscORnQWT_W@L5)G3V60~jPi?IiC#jIFEfQ+F zfqWNCA=zqV$ny_fBr5xZ?@XB`HL?&~*gylLMXK6wGr8Z>tC1V2UvZCzf{uUnsCCM2 z3(1+TmMl*gecQwe|LtaMc>HKlw2eQovg%|gQEPLv?K!I%MLhvYpjIC%lNv#CdK@2C zO`}j{D}sZ9u)Xg@R(_{mmU9q?P0rQ8q->*F2@L^2l)Ea0f{bfXCP$c>HU7U^bsbMFPWw9S-KxlC>-4uH4(C z2&iR8&#P7pbSqK?kqktxI`UF`pIK7byTgz@M-1hQG33zAtcbwPGuC9ys!A2FTsV%G z|0OpGz1u8NIrv$=q;(+2%>Z3yGKzO<+?iBESzP);(0>CwG*sa5u2?nw?PpC%4u6Kj?*jB6LS2PB_X3Od zN;4wYAjT>hB;bwR@3r(`Wto1%rCyZvmc&N1qa%9Q#MCUkX11b#CXGLEgcwBV+v4vc z)a5C4hJaqDxLMS@8$9gZwF?co3mGU0pQR8Y=WrKOW zldTiXL;e=|+*3XME5aX&F4{ld2B9hRc-cF5X~~t`(Xo%&xf3Ye-}fqH2g%A(Ws*TZ zd6BNSrBPY{(|L;gT+$L&Uy8ift|6Ame>{lzUj72^uHO4;ATeF0FmCQlUY{?B z>*#$mK-0F>_@oG>3Q)4Px!|574{mKwRN(8LlyphP%hLTVXex9ZyqVD^ol7QdD8^E@L7DM)d`%^lK# zkvkr7nRt|~CULlDte11VcB?PUBT8lsQ7JYhueB%Ddo9+;?m6Vol{6`yY$o76n2S2? z2)4FdxD>wLclCbELc{x~HkHt^^1Am57{xYH2U4EFAcraSJRgNj^=!}2^Jc2_H!ZZD zUSCzDj_iN-W6+_1Oqq4O4`{!fbtM18&wOgax=!F-z+GR@vu_Q{iCOifhts>BgiSk+ z6{A9eCOd_pkm%9&nmDoTNrkDsz$rvL;E= zv4)`p`FnxZAY1EzL@6t47A|EimY%!PkCBIqO1T!IxF5m`xxs;HVTmI)E~!Fas0fRN zH^}jl_I?a8PesX;m7$xITV57v_q<Ej6p6I+o-hZJdk5i0Oz@l>xCNVUsE^>~{M_-Ry+ODxF3=!*Z{CR#(WgOx z6ZkX&Ga&DUK58UKk4Jv8lnz{89^wg<1Z5S>vw0Nd1|;$3WyYW-Yl?^yWiiOZOSz2N zUXLOFmIGw4;jzxps39k}-E8%(0C-B<{zu?s_2cqSp!eMWufSRSD{wMS7(mt#{hoi8 zQJQVLjN0)1oBzK8=X?-nP93N9?;c33tN3|9;)%;JA83xI6tntQ;DqX>ae-E6EFGvD zeRE=`l(hyWL22YC;@R}mGlSIhig{>WUZHjr+Ro6T`;WjmQ4|$RZOoDyI}>uYET#9E z^AsWRAAwVkAh(_gBav!m$n}~v<}jC*e43%4A_PqM&Na7Qrp2jVv-EahW>NEIrM4?F zh5`?X2VC3&OKy)MBt=SB3MS_WA_DiR*|Ps)48*DE+V=u19iP8<-%Q}#A5VicgcPQ$ z4?%53zEaGy|FX`oq2;k{Qd7v0tC!l(vOpT6_axgK2P0T-3@|bVzKuClz*1 zKd6vq`2HJ%yU((X$lr?hH?BYeLC=O4U#j>B_G!T%b!eI7?q!&zgqfsNyV)#$?*Ozo zG!Y}3p>`((@}^UiBF|3E+>+t4`I@SPmAWR)1z5}?4P@pbQ(NBdv@ky6E~`H#iyclXXeTCbkW!<`eIkDWNs@G!;?Th=LUK#Re*GsjO@kNeZs2ZbnTl?abxtEai&p zEx@NYnzk7??|xHbLQacf{5x=@RSq2l$jAR(cmhT3KM{3H{)h7Y__#jv*dx@5aAM{Q z2l^12c%qpC;}A`pbvqUZQDaR@g`n0V9nNz*H4QxYvS6;Rw?0j^1EhakjV)Ph4}wi7 zNl#cO!I9x8UMUIDf@gEE%`@gewlFjHF-gJ|tr*gfoLi<|Oz>tRD2!=Gr773-GST?16+fKmUtWCFxt}tk0bDHG5ujA&0q?@$-uBdj1ZU@OtHR_ z^ANivVJ?H8!FmyOLaPnYga~Uj6kkCV(;$#ULNL!>eQm%qhQk4Cn(n#*O+bz#5P&=H z6&=&RPrzlLn|j_^-Gwg{F@yF4q*mNa^pu)s{SMsgfJn(F68c{2YlYkpAqT>eN|XW? z8wPm`9J2PUAfK-`x%o&OJ_ui5{&6_KzHc(T*J7T{>1vg_sa65iL4v8UX>HQ z?STOKiUOmR${goY%T(fZc62I!HF5~^3{Vp(f8sZB;{7PNv95-4EiX#xy2Eg*nS#LM%g3;zy#39AD=$JjS-{_k6QP?3g?VhKWvFhXoM9g5H&$p8fmr-;X|oVUq^aH_4JGx{ z0m5e{|D$l$>KeT|(b9==fN17Jk`#6tQ1OfkL`=0|%lDTyXo zq#lb^lbInRxBZX8S;`|%g5u_-pTe{Vv+$`8r2eaLDkC!wFh@4i{|M+-wR6PgK_Bs1 zW!K9#vcLaF;Y=qRUzLPNBX4Syw%mk>#CWmM=R1Q*fB*rptK$jZp2za^kfdh@g~}j# zOrFKT2(;(W$PdHi1}O~FgU1iQ;tvrfe#{4ay)Q`U+4kx+XpxN(j1b#yxrRXW2Mb#_ z-;5&SNG`9VR}27NFBt$UO3b>D!+amzob6tLn;kY_YOkuNon@1jkZhTB))GxUbBq<` zr`NrSTO)6tEd(OC!Wd2gMKxxC``oawv)?KvE9eOa6EHYt8gA+HeVi~A`R&R1y`yl$ zxKdet?;_P-FM=gWdIc~p&KW|XIcf=afeW%*$y?}l@FUl{@BgsXjeNv4ZD zB50?up%ZkD(x2>DgJ<-g5T<9!saT1y^JG3%7NEmVtv}2CWjJrtJB7;v`Fm6{8PN^F zK?jj!%nS=T9vmvkR8)HN=rE^zpICUp?gWlaB#3)ucilz)CKUYJxM^{FO=$Z|Z>uyZ zRdq$Ec_RoiXqT%e6ksd3{tnJ&&5+uE2@WdYP-8Tk`*vPv=KzVz?EuZZTlVoa%1`J) z2D2c?xX!JELrqBWsRw5;UwSYe>b@`kK094zcrI^do4-9w*|g7(1hTXl{z!Qm-uE2S z>cSlQrbWT4NQlm1QMb20i_oU!KbHdqx}#C)<;@ASn5*rl5B`M>>bX1JdkgFnGY=5; z&nz`*cML+vKgWH>@2Mx8nobB1uTmja{^5mld$Yx3hpOOIT4u`(L2+3}DyJ70WREUW z0B21E&4WsSY$w`Ic>o4ClMZ9Yt&yb}FuR$XMR=J2nf8y8YnT+dEnv)yu93eQ>^zP! z4rAXZI9G<-<-Eo&i#1duG>xJYmeuk@t4t!Ls9sqaPzcuKu|?Zi zdzyTnc|VlSrFNA4vxiBN`2@YvhT@K}Ecwu=ZwmlFY4m;chIpS?2vC$BdH+D6qy-(p zmjzK+@!crP99@997iVEW40#QnT`DRFYyr57s9ujjAGkDeIR7NLXh-G z4I~ZnUy`#8gZIZ_HFe3V|t_Ak?}V^Xz%7k z^|5Z{fpDZg_K-@aAd^0jICC$#cd~Ql{(xj+p4BXtL^vLnhr=z!4}f$p83YJ2kPpaD zoD_^S5S{*3H~TL6O7;!fR#o@;`eHemnlp{#G6~(|=KB2l{P$IUUVfO|z3u9G+cd~W zx7|sqOERPjK7_ySV@LUWawVHO>>K|@KF7AX!?wv8+I}Et|Cj2cc=%`E$a28U=w|tA zV_Tpr!$UOAymlM=-$UWP-GLbs33DWOiI#JwU%We>Cw z7z4@I`8JixPtTne9H|@|3FosMUq2U#UiQ2F$_FtN5*`dWN2(C_xuE_N(tcS!wX3Zk z`F0MrzekHqP^lu})~i`;kS*#PRE!#_1rmx{_m-(8TH86$Vi^??XzyueFqK*Zfdq>*??;9__ z3+D&&e}`07xH{DDq4AVikn*l2U=_p>CaOj8#mOw~jk}A>N859y>lD`7LOP%uD|B%6t-+ z4xdP=jF-I@OVRq9!zpmTavJ4FCEuh}CY*?QIS)h3b?cCsQ66&=u{Qswa3IpSgj8Q( zpH$3z8TfwYcxt&3y0IL_w-6|b5PX4UBKl%CeXt*KuKhQkjexZJqwdhiPJP047nuu= z1=}yy9b%8o;!y!WHq>Eg28TQLu^wBkD&3gO))?C3mMty%8EA?gK%^rZtj;#Mq@T}a zTTVoT#ODzeA1g~u$bSZ<4E_y}(i-X)DN4PC*A(y^ov=MW+_Pk`L`@$6HAYFA9GUN` zB$Ac0$ux&jp)XI4)YsGS*+Jx5-_f2gu+kyLJgVlq8mrZNf}SP9I!nRi-^2iLosg>+ zYBbwa2dGoRq0(#F)DJhV_)6v|&d2}``OgtXpmLkhV;5l8JXtswmzWU}QOQt*AURmn z6xASX7)Fp{3u<8&3ub^Frb&PS-jRD3<4h$j7?4=m(T+#n-a%2F3PwRBAc{2B+2@s_$@}$G-PBcF%_&Z?49Jzo_0I$46X8^Ciz-{Q?Uoj2vcS z#L$QTWN8B{b|t@iT&usqPYxva=zNiU0b}pp?0waEnsN=S{`NMKgjRB%Vtb2^ ztm|3Zcw5H#BSzEuxIgipRxlyUTcz7`Je^}NyE$}Knd!c&N1rvC^)cV|fT$Hf@)DMU z#_R0UN{9y4BIVqy!21&RamaPOHr`cG!Isl|7E+YW)c4e23aMvs7(PeSLhR(X8Q}| zx7;)<1&stt@@w9!tu*X9a)CfD8^c}*Y;nPO#G*)`TO;ChGG1y+BRZ&GXAop2u`IeS zSXzL1^y5Q1&5y!)FRtbpDiod=S2yHOxYPc(m1(&-5=0zHkb*<8WdeU}Jt{*YpqT^& z%6Vcj%)A4zx>5!0=UuqDWy5$OV|7vIH7umkePHj#*xQWC_0}r2 z<1#Jay#mD_n@m!wwbGBovqAoz#uGVWfI?ou^RY(vDanM3LOP}aTjj}OUd_1-b3TZ? z#!5~mf9S;=49i}_gs{&XKX+!$DAJ%*-(ajzMt&Bovqj8BYvmeLkbeVCmKtQ3HZoOa zg6UkGR8PyE-mHC=Huf34jrx8A3+D{K+t{?Cshr%PD_N&jd13kl2vA92sfQsJ8W zUUU45UB-Ike{(mGTO3Llw#vnP?n784xHcuI!3o|1r0dT6$t`40Z?Ys7YV=N@RyN=jhC#DSu zDHdW}ldT^l3kjz0W)@;Ox8A4~F4?qX>vI12zUp|nS#99#vOZb%`e?oZCR6HZf+KV6 zEpvjoV8TR!lIFYd;PQkVI)%e@^jru~nd43VU~01Z36)CFX~STJMuEKkwxaZ3%)(!zfXBYI69D>%!B##Vv|z zfsEWH3Lp$aO+boM%5rJDsA&?2Z7J6+&)X^2-OP=@@}zNjrF*x8 z5Tvr<1Iyd~?j}~B0uI|Yo9}(uCRc8bgS**8NeliH_g1fJkUdt#C}3r`w2CZUs#hJ3&Vj&LvVj@{SUC$_z1w(wc^J*I`^Af`cTh zg=h{7X^7x4F|sf;1KRlU_w?)u_=+H8gpAY3VrK-2jEu{()o3hVR-{pXR z2IW8&5o!TEZ6*SxTnZ07)19e;VWM4}SE}Kouc^$}we-uoXoap%t>-7Wi~VxfgFH2= zp$mZE&b_%45#{T(RKR*6)noCD>+%>$$a_OwDpNKVHnzmysjWxI1tdM2t1pskyRSKd z)?|b$EQ|YQS@#Xijg~L{n{1qpmk1ZHUGD>v*SOH9fx{;&>0vORoxTe*mw;C$Q*RY;BL15*M4DujB8Q`HvlU(~s_!=Sxoj*SnsuVd>gU~eNdh0BeM4bxVxYv zp2K87;p!l9(R6?#jx8m0xdT3tLr2qI1o8wh8I&Nv2-29Q@hcJ>f z)!!o(XNYrf%WTLxra%eqv%SuO!9)gQ&DsZ6_0RVEbuwVUx${<&wBDxmjS43QxwhBr!8 zpiN9TDP(a@&aLr~3ac6RP5`5>j4xF5nIOIrK&L>inQMtpvKHSz;Tst*lP zx~LXb|^uWf2mnmQZ=$$25bU1I@Cyd|tA6qogv)OM4hUjF?Z4N1|r^T7~mwrpPm z|9iPw;Vc=2Z_8Q^U8Jt{nBASnWkWGHqTTZ`h@VtnC7iQ23uTK$161d-Q z^p$8c5WINrwxbbg8Cz5WqsAIx;Pu8rJaqXY$xwg4QKnBYh>iUn)h}5hws=MfE;}q5 zhd2R`#~*pI$24r}o+k}-Ryg1tLL%BN3*$gv5#Q$ekeEjb0A-JZGm_n;a=r*Tnaw#W!BE*%&jph26Ep+B|r!U}1b z<;(aD0<^qYHOB-+m#>jE?tF2;t{gFtD8s=RmI&Kjw%WSLI7VMFZU9~9+Zit%Kr89~ z4=x1u^STIQWpzxtPK?yF8M7Is5}q~W?~KgZGfG2+Y7$r>=@x?8{8P=WU3pbW)K#E* zk~gsCf|GLA>i^V~vi*j_G7d5>IS-Hib?Ze5SzpPru9y59zkbW#sj{2bxTF0^JK$;(|d zw-L_izEU88%8=tJG_t@ykg0P1DaJgacNjAF7y{b@{bJ=f-(7Pg+!wVtMVljJDkzuoxom8q?RJ4av>aAwKl9R^bZrC3<&YN2J+(Z;M9&jVN z!J^rl-N#Ei%%cbrEx=>2k$$^7|N z{km`&b&yI49ap6acdf_G!@6SSVFxY;23WVFZkMf_vSL&OMzc-3wf>FH7w4C@K*w|J zxq}-nxlpmZnOW*0gzrcmfD)`j9gfEiM1Ps4>k~042Gf46Rap)=zo$}~GH^q6;?Y%y zF|z&9u(z6qqWGv=6PQF4)UvIoQlT0HN-{fyTw`32Bd1AGG{F)5)41$5OEwSmN#~gH z$c&GixEU?h?MK}2H@61T1c6ef6ehnZ!L_bgF>EAt+KWD;z;T_6gc_@N<5LsX-5CQv zi%#}t5xacIZ9Qm05&>%b_x?+ms!bY`M6<1V6`7iZU)M=rWfng0JU%xPfwR!L2)Rpz z-IF-g#Tm)VcV+{aCA8IvahuIZb7HElK;~JIyDH{0Tkqda$Lo(E=0aW%BCs-JNkgsc zpCQq_KRor(^a!Ke9_HqI9q&;&7c1Xjs_5C7>fZ14p%imDM^bQYjxTeezImmgw2~+I z6Xgj9T&Snu@I9DetG%w%Vpd9#@O_j~Y*zY4sVn)BnEUyZHAn?^l`(difNH>Hy^gtZ zm@6=TLMC_X3sjg#YkIEU)xsMvN@027?7T(pPc!%Kx9Xhu8SHd{=F+K8o5TA46 z|Cu}TcM=o_!ft4V)*d7hE2cuLU`HmM_k9Y{7%u|9HNX^;@ujOxYo*tUP4Xy0;jyGV z_xBY?IvNf6D}psMWlIe1myYC*wI~HgoTWh}Rfu_YG#dKxAsgwdqqD`IzkXeN)wrz6 zsoJm7#3?b(v8}Q#a?Y}uQBV;tS!=QDQqPZSsCRx9?I$bvLRhcEcx8^_K~l!`nY!cu znIy$8v>DA0DaJ}fSxLU_$ZH%X<#QeR)LKytj)h$kzCp7&lK^_ifU&G6gSEe4wV(NE zntUp}Yr?qj80?G9qlpW(pjw$}M>|f=&J^0%I|sdsBy1s%<;8XRe!p+~aK+N{en+Vm z)~?uuuko)rO-cD-@QBQx0FPpmexK_4hZe;!^o3sT^W6Ge8%{IXnU9ID$l=A~A@S)Wn6Cz3%yli^oTJ>gK-T9j7Uc5Kf z=jvqZ6+HJmv7h;I#V~l!eI3a3O&3vuk!+Jy<>Ll(`|sv%JRrk$IXPP}^smdc(%0q6 zylbb2Sm0CR#;;scgenXB(=jD`z;cUrF5R1y<$XV6S;Nv|g>h0xu0WD)__WS<-XPC| zr4e}s`>;BTISc^Nu^zh2;>mW88V40IR|@e4Lz*8ZgpG{wZETiH0tLmK>{g_~oHVCX z(~?&ydR=iV(;5PqeIJIdk@W@S*pl>?G4YVXP-A~0lMiuR5vw0$WJp3eN&Y^NgP{CH zKz9f6o7&qA5z$xPjK93NYhO^pZ5Tp5h4}yqDqD-_!1o#bedlh;JHw%inAgBPa{?? zE#?soECzvXh09f5s$Yk!uu!KlQ=+K|sw_8d(yPT(qq(Z^W64zaz!Cwy6o=XD7e650 zqXd+f%us}@K)4*&ucG1qF~5k(KM8s&I?dxFYj*WMJF}kU;O^*c6g`(__;xz`JXhVA z2OiaCR>cgOHHl1pHt~l|)!hh}WZ`zQ7r)nEY{;U8ZHQw@?7iig1sbQ{KQn@wx4LILQB&fKF zH7&He7h-5Md=!WW!JgXWJY=s^MSNsySmDf}>=uH=h=eJ=db7@OykU zyO7Im!wp*MN$Us9;lN^ifwe`HK3E|?cL7;c(6rM&m)fNx%B0(ADxEhjn9E}UePPTS z%a^VE3lZl0Y-**VlpFX<1+@h8ei0iT&TUe8VH4gzg32N;v0~*;T$CyvQTV6sEH=?V zpE~N(rMhPGLOQL2^+9cbh`KY`t+=yv%lqqwKW4gNTwOAQQ1^7`8f80waMa zf17O$Rz3@<0y)1c_uRiUpL-Ngp1gwkV|R=Aw6lcTutx2JP}Gq(F_}=*Qii(s#w}cV z@|)k>=Slx-(;D^9VuGDLZHqYl4{5 z?naq)yk!H1Une{zh)=q2Ca_39W8V zGdaH)<*`o3&2*O2j89~XnHzOZwNCeh3AXHo+WK$9OTbE${nDT0-xB%Y2;HRndf8VZ6yCi#NQ*~RZ|oD;uTx3~Kg#5aHrpn@*~ce4EAtT8ur z=pr?V-K}|-+sbNX47csmWuPajg}$rEYvt8?TZ%Isl>TbwcwxfC!S}+dE0+oa0qK!p z3MP`@c;s5ODY^o8sxxcSnU$? zet*7#Y8$JzQ&X8OR`4>DuYJ9@`?j33Z%9Dk$UlLua5Vya{~Q2?J*wMJ!)GAnT1C)L zUK8G8nN5w@F_1AN@2}JGvXFYw7Jx|1Q?pvo-HE+-i}bk z_7NG14OTvuM-3Q^angWcqy{%JN>)U?_#CT%!SE1lX7tHW^R@dyj^P+wK+CYdF}07? z1@(f5V&3dKS?E{sOk43*hLlaNYq3Y9qGgOs+eNTYxO6jcNANtujm~VCkpvbs>RqpI z<6Rz-bjMqoratPQ{arpt#?glrW$gVo=V*9zG+cRIcbCja5RAwJ4wb6)KA{gKPf}^v zk*${mWFs=8hc%)4-|QNgun}84>|f}L7x$_@+Fkbx^9NL}Ov7E7nU?}e)b=McyVBcV zw<2E?2T$N5o3hiTzP=%1;bdBz`PAF1N^xYO5@PC|gZ3<^^a(GY#|Av3sA;$fGT)*O zT}w5*cDCL(+3a(^cX_Ia`d%uO@;}^r{HXikHaOU5-piU*eZ+{hMMmB=RSa% zT6x=YVY8^{0AMX+Kv6 z=(b&FhM>!6`3+SCUSpB)CWR(uOV-^CRBV5l1qU_bauHZ3x>$a613G2S4ISEZ)Q->0VYR~7WV!b>K>QYtJbau7R z=`jPJIh1 z85MxB(ZuecBmQ;~3J8mKjAzo7#8j>CUL%SaHd4)ls6a{W3OehD(v8Sc)k!alB8CCY ze|fZ<+bGxUSo5B9M934oNpA*6+twDs`RmA`ymr@AFVmv=e!u^FfWMr>%Hv7GhbLz3 zKZzR(S59Wz-_TTNx`%EQwWybNaMJE(yLGznU!EVmE!EU?GL!x9R5ln$Z$kL@BX=sb zU@cxtsx%SZw~zt;MDcgcm_oS=)yaK?I+){_;5M;#arQhE3)AKkmn} ztWq|x&PDFKgYzM`ae2QiBrl}nGz7GVMT~N5q&BKjf5?fV&oBph9CmNwswFbh&9{SG z6y!ft2pEsAOp`BV@YDvag4aQgyMWEEhEI!bD2%lwTJBThdYvCXcpH^ic>K;AnKrF^ zKvxlafHQ3|cylOkpNcPw$b2cj-0$z{OhCn@2)YzHujmfJ z43mtl{uX9&53gsqZ9x9I@4bi=iQE4fU1iAZnq)tkpNNW=$?u{`Ju*9?aoLL$+sa8^ zKkbaWg(Ni9UAkd~wn~udF-ho?)S_Do*VB%N)R~vDD%uT34Uki&0BcT0+1NO<3F5TbV-v-U!w-f}z?s92{j-$u05kOQgxbtgO zKXx11nc7|Knmyi&T|VC_Lko1uQwGAw^cZ_;{r&eV_X{(s*IUnzHi>`N%iC-=smA}9 zLsvAEY(3=cxw+(M{U!}HclpmxP@VTW+q-q7Ct4Y>pF)YK2BNZHxqV{mfF8ccRNs~; z-=6s=D4o;v4;q?4DQH@mJQgmjEqc4wsvllXQuX_`w%~2`zJtUQWIEIA5h<-=_P5!g z>7Jf@NNi1|tG*Rtey3Z6`!DQ0AE@i9R%o{j!eJCt`}^jIr0fq^v@2N}QvpoBV{z_CDY zI=tkNXV;W3=_wa^njd`ts%o~$@&_-Ru_1mdnvI^@wd(S z{+&GezWl;Sum*JYrmfesZS~%UUlS=fUXhG}Ia1f*z5}%F`gW0<_eEg_OJ=6SCVkUK zMf>W9+4}MaB;`;JQD%^wG?mkE8D_Hqi(-;~bFKInpvBk3Mx<+U1a-r5!nF`@dYg}O z_u4FlZCCDWxuh*Jep*>T7>6uJ`p1l3b4oNU0Fr zpp0nw&7U(E<{KAm?&tY&!e%7FpHkq?>2MY^S}UFzcZVfm9Sva}$@0U1@9uoxM;T%1 z?D-ZjLtb5AoU`50_LmLr;y&DNEg_XSFU8HlFrtG^T0M{>%p{>@Q?k9F;0Qt74SZeu zvoyWm13d1jQLg+$=6J|%WEvY?4V|A8j!JSK82tWe4gK0kX}S8w&yk%FyF^LqlR)Iw zBAktc|MS2k1+_;n{bXat%nK2a@lWSjlHetjG<}2a%E%UTw(iy(w^9k~;>MTm)b*@P zLqBhselPf($N>O(P<4(o-K6|g=adS-0MWIqA!q8lIo+)$)&9`?eHvUDfbJM-W%=@o;t2fye)6`{6|l#E;GlZ})to!L3#HX_vIg z+YHk*o>Y0!GR+x3_*Itr8E~!Wlts~6XI-!PvxHvhUUA;GPec42Nw5Bs29s*T{HOr% zn?&0BYnb9?$Qbn`XSQNUfK;g-GtbIDox(9SN=q+>tgr(tAUl|CLm5-|-11EZ_QLJ~ zl{3TpYK2dw0d;m5LdUSM2EUSZ&}v}2isLq=%7cz3|97nGLRn2ANu=GvYWD7EX`A$* zVGVDIX}X><&&40f-V4TPN;!Xv{KMc#@eiCIyi!yz)BjXZo@c05%=aPkWtAiOb+J?^ zFCdV>H1L#Y(Qxu*H>LIeJw;Z-+l8z^K`3u5MM1#djeb1}OvpoIaXa*5ip?sAbhRLi~!$2NQ^q0$XIgAMvV52`}^D@sn0XFZ9O zv>dKC)N+9+L%Y%5f*1fE3n)lhZV3LI&|CDrpmbyv5m|~FDYbW>7OR7ASdG#EDSKWl z+Nmz}LDX7Eq$RJ*w@;&F=~I7j@wj%UV$Qc1J&!pzv#AlGati|69%0pFaecTB3@}9) zc4rZtAquh(#<9c+CwyKj;cp_JQB%I8e7j-smJ-B1i={#Pm$u}Z%+Yc!c891e;nX0Uc|KMy!ZYNwksyJLdz$R0L-rKd~Lk z{ncVgzg$MypT1HW%`%X>IT1<}QbXhx`8LNY+x%kw>IqZ}i_$EtQpoT07Fpz3R_kn$ z!J3t23YKR}+e`)1^etB7?z&{b2-O~S!Q%N4&ucNo5^UG>^Nb(7Xobw6jSLgM<4-6=!2`T4c z2%{Cj)BQDK0A3J6%*()tB1Q`!7en|Vo^?=b#Nz>E)FZ%S7%ipFSa{fJ|o#%XV#o8*BdQfI}n!HiBcZXL}?n`O4v`; zSndJW3K>i`7UNzzlUeZtloV_mGjN`)mAID#;NGWNAx$p<OLK2#iu6?QEi`2g$Cw zP6#za-`{;^26DO)sjpcAfj>QiJsiyaGa;UicWkw!Wde%<>?f&K;T*70ffKXnV~-@+!P5W*QdSz7~t)H zgLBLu`%OaQh43eq;CCgL{AAj+$`7{$`7)pGy@8_Ds4NQ^jrt`@IrxfM8pRI1Wp|kK zy2JO2;K`rN-XkObhp&Hbj=X!IhTqsuCbn(cn3xmW$;1v!MJ zId!T|Rlk7z(cRz9-fOMTe!~$(dQ)<|L`K{f>*951rqT=X^H>cI3@=8jnv_mS^#Q_{ zm^azfw}_1U{`9hK=s2ilRz4*^g?N$$b56DzwD=o3=sGmWx3@fz*YBSBDl6g;^^VCu z`NU_^omRFtuf_;QF&4mKq{8LAFB#|=X#73C0Fn{%k0Z!JLg__n?5joh)l zf!e%rw{>j?j<164lhs$I*D6sMcJS{|kd_Z%HM^pMK6lqItezpx+qjrJ>g zC4Ym52Ifg$2Z=jhu6&qs*O1A+P&X&AmJGmZigb%$?()1X%`0gSd-Vu?uabr%F~F7z z*{xH&Jj|m#qKp0EE0cem_=B<*cHko3CNKLtfygN5vCdN9QEz`gGg5CEV=Di#u>+HP z3d(O5N!1S7X!YsushVI3&CgCuiec0F?COvMiS#>~n_{b^0zGrBSve0`IJEGUA(Y;H zZVa^jt`zGFZ zn8DK7%Bf2E99(Kn42Y_4+?eCmGsE1&PY5Xz^Q253Fn`F=r{iHd4F#i!h}=ZPkCMKT zH^G@Odx74FGULeWr%X_N#o7;A&J7j#H%JsbIiRFFO$>Ye9 zu{k-59EQ}BhPxlKGt(OaP#k|ON$SXm#otuLi^54}8LJi&Vrez!mex81kjrcm18zy!Es=79CAa&TGoPvBshWrFKjd#AQS7DrWShhkZX0e4c~f4mV5_jzB;YVk|3sl2WGVce#!~_WS(a4Gng& zLu}}GpHw-=7uGJHFeN|vc~rY!;4vvn#`1!d%x{%l0>Do{3RuM`7#&t^bwBOF?|4`6bMGjigAxYFztC7w3xOwSXefBxdVab z;lsQDVFV4vOhbeQW<-cg`-JQFIG&%? zwh|f!=?5s-3eP;jAdZ~N)wF&CL#gt9)vPmLTT9DX%I|VX9LbbtgmXsKOzPy&GmbQS zOICgoXb}s-3pL z>Ao(;wu|ow;9Ep_euL@{^Td+!9<_zOj-{-uR}w9ow~fJ9l$hBTrbLXl<)wJ$h*c`G ziQbnebdCC_3gx-*kmuz2*p7SZ4R6Zd*hED^!MkjX@!_v>6{3>>*XGfCma9zXD+-;> zF|<{VvgpyM)JwnnNi-u~ZM6U9l=kqw zZPrY7gK(a_0hi3>x5l~)Kobhmt(=1Z>l3*&u6;AN-~BE~Pacog<8+4gkTXz%jKWAp zpdRKjzD~<~OJ`Q-n|gkCSafsL;9C1f;w)sT&|r~(2JFOiBNiFjRG-Wu6U%05rF@XQ z80I7-mL%nk9~&yxz5P1fKBZXRB-4o$Kk%Ko2S$1n5B~y)Zm~^`Mo%dXURsJ?Iml z2c0AQw5XfG|5?5mlMSX~?UvotyWWzic`C$sHSMvy%{ypbp7^cfaYGe+arToMNaHN; zr^)CeCc3rqc;k-YPs`FOuiT~v87=LCBulH6$H{4*jYz#?<@|$o0}n4re1mPl2gmFd z`9JmH5dSJBF*QCyRRBNewbkP^6hl`wG@CpL9^?CPy_c(cv}^5M_;Y(2fQ}(}-I=7p|6aM#99zmhG70m9X977VT8S0J4>z+bt45k%k%Q zK!9k}}){^91&k3?X-K^-Ff1OHp#Z0M~IwzGaK-G^A z*ZKAW;(%Zs{EtH_Ot9W?zklN5Phrb?=Ahne9s`LUexeifC?0kmP6)w}CIj2G$%D+Tq$sP$NP^D!dgdCPir3R`9 zZqoFHAUJroTc=t6Sluv>9rQ%&z85Q^1}fe7LU0|a+(BNy8wLehCcgWD%7L4dnKLVA z>{Rw{xw@CV+<Gne{3tL4QHhtM=_4ioKpigTNd7A*)4(iQ6Y z4(Q)hM+mM^?%A9-*|9A!X)LJ0^4gJ!F9XeHl~}SY7OfPayB&O6 zbw<-xqIyeUMsKdEI+~h4q6gfjryRKx*^ymxiC2qruRoMf-=I_{zh~jGy&HC+lCUjz z(Ku}Tf$j8+`W4KOm0=-FWdG3fTF8#V>^vKoD1YxGSigwKJI(dv(xKy7;f+4kAnTpT zgQ^-rpW~m2VT%YiWo$Q9#}6$J+m33^NTnQ8!DUrW1A#70n@lb$`|@it3ENktQ|1Ko z)SrEc{!(7AwTRON7(#jGxYAfW3h4g~p%jO7%*e<{xAv44g-&e=Ya>R*m^ypcA2M`> z_nG~9UM6m&yg}jvPX2$?Y0MT2&$hwl_$pCWVQXhH)AfhDApSx^6o$6Wo`#fIE{3(* zdd~WuIRY~U6ow^DZ8)#QK_DXNTS%Q#*m&1Az6Q16^$@$A&8P&>;i|*!6T*w5qWs}7 zsk?rq6Tpyoz84YX$M)3mZXJNBmYt1hf7KEBLo4gA^ghY5keHWDOqq$Uc$Rj~c09$Z z9UjEGK5L%T0l4hW+^Q%RlU+&i?n_ds`c?ufw2R=LGR6C0;!mwz>j7{ZAWQ{Yp9C;8 zqUjp*z1*k+#ho0nJaUhDcKsii400yXNGkQij#G0$aX8Yr8GU2RXm{L@9pil+9S&aV z79`}o^a%*K*JRR~L6_;uXaVx*9bvdgk|L^xlu{{dpA-foW_N$8yxN(lAeL*PjEI<# zwdV%bdG!e84xTygqEKTz71xM(VqJI^wlnXrdcTuNuTb zxh`?zX*EcdUpd6)`RMfed0>2*!32th)Wovoy+brf9CBd@$`x|IWR+~J4iXsh$`hoS>W>_bZ;g7iW}R_|g;{8)424-IP)p6W zKR*5>Vt(pLI!5;rSQ(S|0&Wj0zQq*ZD^&736U_h4^dgsO*C=~Iz3%8=UkK$J5sHh~ zusZ`J*~j5{Ff*eVKGpPM&8PcOO=(z8$T^sxbhANK%)AHd8c?)Z2xc-yD2gcJUo-6T zWW#jzU~nkz$-R9K1^COt9CJX*E-j|3O{H^q^2Vw~(u+rV4}^L{Z@~R9C|H88{T$0c zOc@iN9*zmuUcM`@vZ?YBv&n0YOs_@Wr!^MI?L?nge&kwsE2FIU9H=vqs!XV`vFL6C zn-P24H_esRiY-`izgTvK9083_BV)O$bW~Xaj!ZUC8PXgq7%iNBBewL-UhcU2){}A0 z?uTc<(CmK_ncV6d zIsR4#Hz!21VDdZ8K-|&UHawYEj#3)66+#Qlrrhp;2S)h>NUtfqy=qszQ4(D8bvj!@y6Hi8c zrHCi)zU??02=+eduP=m(x2d+T@uM74S~(CG+eHvwP;}s_&-uWs%ya4oPsK&<)BQ}B zEj7J-`fqwBxwH`>U*mAH6hG{_xZ8-;V7&X}Qp?C;)KV-4>umdv5)V;3 z4)}6b<2~S?(jjUi8n5+Q))P(F#bZczx=5&uQvvUbT`_YQ1cWe=UHeJqS2;xXa+p-X zP7^#$&_09eq%R?q&80_xF`U1;(03Jm)Gqbq--=YFiGIt@{33;qT<>FRz+V0M335Yk zK@)kFW{v@Uh-MCv2CTPll;5{Aw{_mXh!W8>^hHB(QzlLtzD4IfQzrBozC!u?O8UlJ zopN|r>Am5!B=B+7zu)J|)4wtvPdShWueZLLi}uK?h$gk#wrKttwO@rq2p`ss6tz}p zYmvalLF4%WG8e&Nsc)R}<5pOn*s#8N zS8df=&yHnDS(X_VfL$ z09{DCP=($NoxYI1z;Rrq8DMZ2&lKBDF-~TnJp?g9dap94x=S8M#OT z9!bLot}|E`7q;zRRw(gpm}qxCUi#ACs6_@-%|Gdgp!rD{INDny!?Bw$sA-Xrv4QPneD?$p!uVm6rQz;S_jw1JCoME>Fo-N*x%_>=e?KUaccZg{BmFv4Y=VhRT>(r?_QYl zhQDdEHY&@`(&=P}4Fy7vYJXx7Hg9zF(dn{B>&_K=KREO-0(A47KZbTsXR_CiwlsP{ zdhUi#lP~WCthv7V}WNZY<#};pIN`zWN>a*JLw3zQHz4OJ5GPDP*Fq3b+IljS} zHO8O0@+9kl>b^Xq{{e_X){r}K08z*lx)(-}-voy{*V7u#KWpAC_&qmt0~&aq+}|U! z9*wcLPYZ365RtnvLT|&ztStBEjknlEyoPXZdPv8pBTrf|GVy*{y(zc6CM!^ysd=7A z#^;C!_4!l#G-*cqFkZsA{84H7>D2S!G(Tz^|RC~H`?O`ty20>fGL#s*AxQRhz|vrLe_sxA6rr9jxlX{95i&M zT1xHi`xtjjTE5PUe^D`7;$j%R9&jQ=Ke%8)XHUj*VB$Ueq3+*@mU?KN4`XNfqQHS) z>zRbrx=X(Rrcn9~W>BZ5RITQ$wyaR?Geu{0Ar&V0w`R{>bhIz~yeoO%F#&T+uwMXE z$OP29RMo_CP++OlVi9mWYJ}YM(_4qE%l|~NV^PwKp2>gf_88Lu40qvm1PqxEOJW2 zotm?F$7v0U_a;akp!l@Tpzv*VYp3v4+r%x?LL`Zl_kGR?mG>VnCRjQGh*(}Vo;wAC zq^69liw&5OX_+Z6#5uGu4s$y0Ee+Rr=}b?Hcu${hP(MG<3jm+q2f%&)%T-E!DR#^2 z1eMDtEa$~;{KvW!OJ-ohc-;6;n0F*1GM?C0tLai{hINXq#qN$yPe**0_^7}fLMVGU zx1~#BbXwTTf7OqZhJ5fa_M%=RJD9CP|d|>4P4H?o0^A^W-T<}$( z`UFumUC*c!k>O;vu`bbl7!dXTgk|8v&D2$)^`f_JOi@VzCW&}95!SCkobvi?Ft}5L zkK6o2o-IJcEI|E}s^sinW2y{}@f&HMO zHf)YjRgPaX08*$^QvH9V(6W`_Us7m!32K$JOuf|wryJ}Q(VB4J{&=JyJWMXd>vAV{ zDIYkh&?ZJ1?ZTC$%3rbG;jqQyfk?xf)TcL8!bvFT;wr7gj%1@v1={jrQ%5}21+TN+ z39E)7=<}MQq2syqtv~T~A^2hHvRXCCDjhTh*m+J@mwn89^o2%Hio6S!N(}i%V#7|o zxjn{XCqr>o{PF{mG#pYAaq-t;s3^YJk*YdEO|^pk1gSUj0RkQel~qhfiLZ(5Xk8bx znidz|UMC2C&$^I!IFx$N!jzB8RE8acXZQUx$rJ0os6m|thxlsoGkoW-4X|(oP7w>5 z27w@PKm}>=E`;K1Ug~L)YG=7?e81M0p#CH(4m1(8Br*g$q^H1f7a4tSWYE09;qLOb zlET{O?e(oNuzF)~4-+v5zC4?!@~|tuL)BoBB!WIUcga(Tq@;J~YcK5lhwpT$n z0`7F;csx|eLxb22qJ`=upQfgnJ_6RJ4P`u=A%a1hc2$$m)?gPUN0UX?)>ra}^jO~y zgOdql_87y73O+yN_yF{TtvzqHj}+YbjjT@j+a1<KCYhfZnn=vT z3bSI$QgmeNEZW3@gLAN?oHPYs z=FAToUG{;{77Rz9%G*})zAEA-0h`Cv$#%nrd!V<_j91WEe-NswBf?ggfK)v- zav2ax?I}+miIu{$8HR|#YjwT-nRB5`OHTt)u)bu){Cnb4RiWFrg|WH zDs>|wy|yq?|H}{$vylNysg7nfbU%H2 z(E4(NG;Gyq%UWyiq0Z_-5?sgnT4!zO54#MT*^#qWB@Ty*%=@r`jxD_sZzt*fB&7+;Qj;L@7whqr|6#6%AsWlMkni<=zrX9!U>rG`{LVSlr zB~~QaMnQr&7B}3KQRF<1o~k2XX$j23!ziOtwuJ8c3o-114PKUw5-`Ot*Wummfe8rL zF(`V%K?nj$4TCebN+94LBr))~c*`gk5WRBx+&__@55m7!7B~O+w_Q-!mixL=y{?65 zjNx?yqa-q!lq*nar5$jlS!v1st^u?7Q}S#qZ{vC#?2O~xfvBqRB~XN#Wgv^a{14LZ zS#!}elBlnKV$gJ4$vh**%%UJlwIA$LkAh6qc7?Z3U>wCmA>%CHhiq2~z?_21{-RMH zM^=|9?(2Iba-T7O+2MQY$FZX+)=VprHH{)%jq@y&unt)_3TW$HVjezA5@5S@i9eBMfnwfQw{L2?GIYbYi4Rh(RC!5 zZ?e?x$Rt-dB0ewf2Coc9qdzO$JOI2ns$E_v=8HO1{H$|e_Or#yXP6Eb@b4~{Yp=JF ze%Gk%eD{w^4t{LJUDeAPUUI^>6`^I|2{va-OMY;YiH_cU(Z^Jr)(AyYRc_vo zhR=f^AgiCjR>qnOP8Hc%ztQ#mNWDMbZs+dqErH1z0qacL>r$2gAKeu@l5)4)>GS2b z-p%g3&$jm>w573b;p1+L$;i&u3qsd}ba1MThDYTa`0j_cySwv?zeI}v{1h(RDPP|i zNE+XEaM{c4fNlf9fLqSs19xj z#qNj-Jw~J~V=)ZN|9SiuA$}{`FCoM=VM@IzI6}GRI4f{if}?INKKDgFm%4w+>eA$? zZ;ttiD=g)Dy2I)G$K-*e)dj)u099!9uPU_JW3kA9UjuT56T=l4JZ}s?Ey=uy+5t(! zsT`KfJr`US6n81A4ib=%<%)G^r^@7I9$d^3MCoRXUjWQ-Yb#IPV1SS z-?4LIM>fKW6Ozoelmt2ik-{iWp>D<$cT@}L`_ba1G_nzYs}miHRb{7Xp$@5S@@-=1QKX@wP3bsjDh3*ansm6S zyRi4=IFK4_(tTdIr?VlvvcIU144Ym`t~c=UbuX{(KU63OOb&nwmEeB|r~QixQ55@{ z(ng04ryR)EG}NSm1f0t3A&aVnMaBV8p49D7Xj$}A~4YuX#^*O76@Ou3U%*|j3dPdQ}O z9yijwhqVS%b+J?5TvF*rzbI_WxO}dPb&?2bw=QeoJz^+Y2$eZDh34 zv??`l8hcb(%glsJg;3s<8%$+Lu)vxTHPG{=FjC!P4u1IpJSE2pY4I6afD5xmUedF` z?=o@uk(fgDn~_sFOeP1>y=d-xz3&Yv`Sa3Mh2o6Z9w^2t*Xcus&}|e78G4z%o|DGu z%#)P7)}Brp9+I1i$0+aJ>46%J&v5yVPoF_)r3Lr@mn->IOuSuBge!-7)V?>*~Bo%4mhW(s$06uH=_bQ zBRCj-Ftz`gLL4}oLjY6A@1H3|oC1tvZ?Q&9J1&-VwqGA~KS@i%3gk}~@a{LNlvTLC z5GV;wJ(_6@v@r2$>b`~t)Tot_#6IGQ!dt46Mwh6K3X%f`?^=t2*8{iO-cw2dFol%h z{+U8H1^`p29ZVRN9>heXw7_DXZY&s{aV24bqL;bDk`S~A_C;iJf`|W~Da1YhBWKh< z|36cx)bJa_|4bq3$%A@bJr)(hn-jfaiV0LEwn1J&3168zL%r)z z1{+`s!BhNa3em3#{~uH6NafW747=fB26V=?)MFnBYWh|)F72FWAjC~lmc8im-ub0+ z422lme%ibf;y&ja^cV;X{`QRKfb$KJf|BVz_(FM75w!*d8Bqmv7J~eU>Uth6#>n*= z$zM}Q`mZT;?RZYL+z)M+h7d*#Fop8#I?&w1b>cDGHur2qrDXF&7M*FrE|#QHV#eh9 zuvi<}6(bShh3SWLd%H^huPFo!0e`(wgY%y$l$lM7ks&qF7}94vKlj%ZLUa1XzEEH) zU&B#wXczDd_{|N^Gjcb*8R&p5SD)~uW&y}Yf&k$Y$#HGL>xtQGu0w7{qUap^-1M>j zktUmra8?ha>PN@^(G$+HDu5P@wZU97$eRCs=r!cF+4x2hL)rph3Pl9EZW`c4IHj29 zivUcaD*UnJUU3!KqYR6QytxBma_K{WDHKB|xqwhgt_r2{&lExfm_p=5$^cV{YTi#Q zZff@bnnKP(rVS|#n4Xqge+E|~o_`ckTM%Qtr4Xz$W8#T!gbqw2ef|*P6KgRcZb-ty zsWF`}lu?nYQ!KM4`R06SIUkdjm2W>u={`rr{jE;i4v4vLBUM zLN8r($wz>NVz6&RS$M9v&)VC8!{nV)Ho5#!K?CJWoU^Qg#e#oYsz8_kFomdo2;wa< z-I^)6C&Yg>~5NhVLB!Q|LhE zLZ6Kt9$*R~9-B0z3aEZpoDvb;gp3L|JMmQ}v1g|$K9*z=z6oyU>yy<{y^t@Oud+$J`onMefI0{d2JN^9Y5Hg@%g8b zsV72v{OTveJeR&4!-62Rop@_)fBQCj_=nWE!sa^L&ezlOPtw8T9sjXVFbSVHdeFqUf( zWc9^|CE%|D7n!5^(yIwl$9nb~R)dolWv2ci^7^5!9nlKvdd$cdc#;jlL)&h8ZpE+U zwPb?V;y>U8a;<*Db`}nn00Lkh-=nNW4X%Eg6jL`CrtoMv;qB=#92+X_Ny|p!C?T-+ z8`>%KM^c2J#RDmj`@BX$*pU0{eeU`Lj!kH{I8U?O)w-PozLRiVkfsDMX0wvvN*O|a z?k#m`y}Z7Ct=Iav-P9*~5S%G8G!`|fA*jD2fxpj9$05mQ1WXQl8TN`yA!Uxt_MuAA z5mRBR7OOH#u>K*6#HqXAHT^i|yY}bvvHkVf(8@5zJA)7_2cVSWs_&a3-GmT;ePC-~sL4Re$@|%(aZ$U#%aitF&n=#L;+TT-32O0e+w%KkAcKPFbI( z^tnhADw)9Fb;z*My}gJ1DWGFllqqaEmo-pCAGra(8;mb>H1(`hwpyZ3I%$q9-Saw4*{kYFe zOgF=}|BDI%&dUbwGbcjPZeiYCp&cBjQxN5+syjFX#kz(4a9K~%s#HzgGe|UwwP6Ec z+QQu*ovEueuHfFlLJLd86KHh7-sq4Rc)KcO3iyn$$QO=oS2V{QI)5bAHiD#?r01=; z_+E#PAs}i~4HW0}@~Kxde+0HMn6IMXT;(Nh=OL&^2gCotlHo2j6^unzL%*RUjR+BZ zi!lu&R1USQRH65`r`H-aQe?=U2{(Vm08L0WL=W#&xfrAWs|s;GcWBJ~Q-xTRo00&T z1nS2b53jHU)F^? zszR?9W`JJ*{H(wP)y}Ggw$sxhhWgO%ZJ)(|s?hBn-_r}^KUHYq?6O|o9P#JHH0Sz$ z)<0DU>8~neC)x0=HYF!5e6(Pw5nY9Qef-4MtgFmX4wNyHnybU>yr2D%pSGt|vupcT zSZRDKnSp#QV|}m=_fB>b>eD_9L+#G}9It}A(-TH^>jQUgr$&MIbzAZ#FEU~!GRLrhy|xY$(tWt<{9x;~zKJRRq? z)@!_YxMbsm^j!(WuX+WLRmW~Xcf@w!=m-&(YKlxt##*s9o40-$QU}(=Ozd*+i*j3K ztWKvq-5lm#M3XWS-ctuROXTT^#TrT(h8Mz|d`2jv<~<*@(f)XB#hqf2S$pLEe7=o- zO4o?;mDt!&ifrcjG>0SqW=OIh-9%I*Y#pmRKC6}*_XT&#xW@Cs06)|BsCdEvP=z8B z1>ngD7-*Yi;>7ulcJWiLV*aW^(&2jZw!Br@na$+1jshLw(ZJ6sx?jiD*tQ^EU}grc z!{t+6H-6!-Pf27AB^GJlzwfWLST~! z_2ik${8NSehyJGuk;-2NtQZZ;IdxFabloD{5bP{Gu8PSiSxEb|9p0P!f`%4~+v4P`G=ta&k)jsri_SqP1u`-qs`XOB#gYLB@u%#gRxI zOdCh25k?z_Ubd#Ze7(BPePKsSXR-|;b|v?3oOwkYq3a7%0uulMlM8DLeV(#qkSTzUm2(2Tp;a0O-X z$lBw3$~xfjHsC=Vc{H%Co4%H?hTF|RT9pMnIiE48ffIhLH|2I=bJ)0xlGxa!L{$yc zv9dhQ-L)T?7wyeI<*&ACtvp^cFV-GA?p}*1nccC>>GW4A8U1ar1K{Q=Qc88RPfHv~ zZ=fg!KOjkPFnMH;3~Xv(a0|Vs&aTru#iq5)D0d|fep~p>tXi*#XUi&z%P+K#EvsylfrPP?BY*n1 zQ@T+)ai|MFV$>HHe&pZuDG;!Jgh;Mm=ge`uyo*EOe`2wDvS@ykravkoN%cLcLghY? zQa-(-!(|B7z`3Xs&wE7DBpLh^1N-r#a)4Y-AAFkimPheps5J5A?Q+WJ2^Wm9Q)vn0 zsju~*SODMv2vso8uu0{h4|j=Yc7OJsp2I&4)ei zqkUU9=%0k}X4320(J#G%*JSez$KU{tKmQdgDj}Imx=~D5#$^P14g%c$GF=mw483j^ zs9op*af`s)ddlCTo84K@{TA3>Q`2CV#9V8a$z(|fB_Euz(xhzX@0h(9hRl7Z{!<sVj6G3e?MB}!$4w<9y7A{0hJ>CRa%8oH*e&b0+H`ylUph7}8Vm5@52r1YM@Y+> z>M5_L*Mnc9=7vZ5b;^nX1XL{y?Z^5?;Q|JzXZ!wy@N@;47#BuC{gPeg`;q8WURr)K z?ulSZc^p#AbA_H!CBi-EsX{SxaNhmTzo)G4Yk|m(RtpX+)0E{xs&=8zcK*P(dN!(! zM$7k}TF!&Z+m~Zg3_9j6nuM>F<-|A3iwN$iME{-$OoXo0_}uV#z++MeBk5D15r~K|svpZWgp0M&pJE^p4-}6_0R;u9CH&dV z|LHT$!7uz^q%wVUs@ESOn*+`*J%I9%yEDTG>OE zKq!l-pzH#^@*)%+CZmYNn%(eC{UQfKrz0G~RTk#!@&eVCBW~56baB(P0n>>el8DO* zNLn^r@pt(ItA|6Dc8EMmH`PBThN5g%Jodh!N#)=h&5$byFBGopMl0^6b08(3>;EwE zhT29KL;~}}nixT9@MiQ{k)dNYTz?8>6$PP=fJDX z=4VYS?MWC9*V{vZ3o*%&p`f-se_?hOb3A_*>6^o#1(Drn{bPl=({@Ag1;Fy4cMu87 zE_#E+Snu;dpJ(IazgmsD2W_)$g3%K@t~O=>E_H!ru{{TZnR?mKx(YbEc{U)ieiDVX5*Vb`;6P1J|**P>uFg?FwTTfE6;hpd$YtD>V0) z6)L7E?1q69Mmbde#|rhEefL@DInTwtQ(7&pd2y5)dv4--lL{$E>;?M8o-(T!ST_>D zM@gwtrBnJ+OmxE*GuO?Jw`}1%+0Ee#`}IF^DTg*%(Q6JIk)qP2*l7RXIs=?US#zh;0oEN^zHt0 zg$fk9UychT9hd;F5Gw0->R(qV&-I@xw9fUPD+C0IY%#BBBC0dP=w)uQ#Pm>NVB429 z^gZz%`2V;6hps)T4<);E7$AujFTa{yN;`>!h`05+ea3UGzI5J36u zO!S}qw)$BG^YO52P$D7(iJu#na9zL32W~AfuY1YCQ(3<{9x^uo4`Z%~&7Ji~2qu#Y zF()u*A}F7o{jw4Ewh<2^?icTnd~EbFQ8eG>&_N*E&p?cWK7A2S?~a^P~u-#$Z56R9*-sgp2U`!JePTUeD zy2JA_0rlF|FufY{od!t04w_MQ2S7YY@G;aInZF0|an&dMtN5zp$Cr5SZTA;la(0R! zA0Om4%I+a^4bH!IA>t$H-sq<&F+XPXlc{uk{EKFUr}$oO*ayx~DCnI4FozP57$tHv z;{C%1j%*J{f|Q)`8+pdxcHwNjo!2u5uWkd^k&G?9+&7r7N9+K+2!uh8Hw717Y>Dd5 zM_WGIo{Sy5QXe~MJ>tzRoa_QAkyvmAO+ zTa{Zy))1~i*58-;q7Il}OW8eMIKLGy)3Lw8<169wV>`8;Y&~G%ZSb+hRt0b7?_=(F z?QYCj3ORTU|HuBs{m1^if58jHK^=oZm&n+uJW4rI@h7E8={n}Dq;)Dd0$Hlzb=Wut zy(Xo3d8cn59t7Cd>k*~~wDu>ZfeKF+R32_8=>p>H{=o1k+||UZdY(f)WRN@JwaP?c zd6FFb_i%)Dlp|YwDONYuNqwhK3nV%478`ZyEa_tS-=a*1w5s21`JBjj{N;$lf2h;4 zGcir4FEcL>|8PgtnXxhlUVKSwd&VAZT1PavJ`t$CF)ygDH*EjDb|8ZyxyGp;6h#|Z z-2aFP*ecak+40GL5iv(Mg7JHo3-fb;6>d|)!wHX+j%1L?V~qw(ivfJxSlv3ScHJY& zXA-1akeLaWzJ8;(8-U@$LZmxAe+v4UtO4zgiE54 zmVlBGF^0z55C8g+eWaZjcI=vSn7znxJM0kzV@DDH5C7U_!8)HDpB}8{IbZlVanPpU z?AVy{QBS6?Z*Rkp6t8vu{{E?5n>R$-Zj?MJq4K1AURvZq4svT2+uLZ?5XA7Ld;Ogv zYS@x6tc?1Feg%S{pb*&RnlZ%#2qK-^I4d&B)n`>cM+>q&mD&eF6={j(YZLqVD3 zzjrJWkhRyxNXQ|Y_k$o_>rg*JL)UwHT3DD^cZ?z2_q_a9K0`C+I5;_Hw%Y&P#~$Nc zUf~Sg21N3^V|khHu_#C}zz+)gl3QX)TV`zZ@Of7XSSLPn;F&sGs0aac?_v*btZPiTnUj+WK1VfPs4bg6Y92M z#{`!+*ip8%N}|~VObB%yg6^fjo%5O{giH-&c55}^M}UvGTS3ek%9U)C5im&hHpF;TT zRA|DG058}R45&VW3_tQ$@H8aYszBQ5BvC?*3t2V4BC|4b$CZGe`H_&)J^?&!S(VdU~EvwgjCT9r2OF z9;lsCW_F&Qg!qE68?uWI;w7J7+clRqN~SP^7W|8=+*N=n58wUZn{g9n}e*AYQ0VpJ&rkoGm$6FmKKNk=z9Pv;qIPUxH_WDHoV z#g0f8r&Ra;JNYFAVKq|5Pi)fvCpxYj-;FU1OB0_Bo(@aoa!-PWW2$0eOFOGu22yhP zU=UB?EW%ZVdLP^@iMup7-`wH`_UXmM<&qxbN-24Z~SAGp|{2r!0MXA zU_0>tj7~lfFjRK5gdO$&f38gi5MC^M|MJ5UFRP4wBnsQ}emNXBx>CpSndY2Ha>@z5qxCcYmA$g4XwW#n6v1X z3YiC=5#cY)9G$>i{r0lsW^iXV@KtO_+65(S?NcYP8XdU&xi8otkby+=I8AE-g3rLO z=gp2dsstu!{?a>F7)bD3<3!c4!A!}^$`txU0UP1$9%bBKDEnUb3@%;h@OpDvdPfYL zKk~e9jIl?~nwl{p`^n@Xv%ayj^Kw#P9Vnp?P<2J0_5F`dhhKUa`zk(P9rLvo!T#yF zMUrgPEoYxm{!W3ZGOVY>jn;Jvh3-gs@gInO)4m212+KkT>|{T%@*=Wcqbc_T@P=(_ z33c9%tbvj+x4KjV`tll&RiiPdAKtI6p9eo{l^T;q@%!K|f!5LaC)8-&Mq3RdJgI+t z_c5^O`-OjUp_;ElzN~BlYy56NXXjYhF9vITTME6r0rAl8WxkHG1weJ0rUe96)xiCB3t8{%%P0*hr4kKA1?lMmD z0(E!A&cyNYqfP3Y9z+=X|7?KyJBiDwpsuf@&{I~niNesrQ!Jb?8nbkqMi%WVtvw5C8SVA!q zwJ?7@6Mi0RA>t^21`6x<>n|No!KTHvATfWxT6r}ue1+{VQ?5m;EC*8VEZHC5QwY0uN4kafYNZ3r&Vn^o}^u02`3|JF9NNZ(|pM)DW+>}sl3JkfA zhCBPcQPtMu)H21}+(9`4xQc_X-vUQi;|mO}>=Ru+6qhPaswr3PGOhxxLuc0zL;MWS z`#RL;!HIN;T>}b$v`>ONqqHIqYkD7V7Dt5ZG;t0PhX!TTCx>iMs}e!9a!6}J7VYpd z#qxjPy8Q~zGbTbmxM~0fou9R8rfgXQx~elBS$jf^Jm;!Fk)kI?PrR<|b6>tP?_le{ z{V=OCZa|07oEala9OXdquETsTf+I)vYgD~655%y^7!|%PSQ;yk+b7+OU*46l zx@*zFk+3(@q!z8*%4puWWM8Y|l6Cae5+koaF?PdWx1DoE*4~4ZacY@}Z|>l{mdJQ7 z+?TA7YhFqL==2wj!~nXCkJ_*MarYjK$8M>Av6@?)~?7a54`RVYeY> zU+En3#u2!`RX(YcNDcJqo4;aG4|fpRE|+;|a5~hH(d`&)TejM1v+r@Z@Ssezo_Oke z&F3FW{{W+nzlu}rh+PwJ?bo6|6Fci0;Kl_}{d*KZ25Q+4AldKau?@^`&#d0G39U305h%niG3dEuRB>L_-08{gEmY{|BmOoy+1=qzS(|vFA zocjDWmKoXEvof?$p=H^B|Bzt6noy7eYT*6RSt-*vGBv@4WRTU_3GL6Pdk4#Mba14H zqz-{W59ijw)5+5N&uJlPVEt7c?&z*iug!eIz z$mRI!#`ZjX5uMT8<$(*ZR8!peu>?xd;1A5^biqXHw$ zogpDI8g5L6L&vRYxeq~vj3;tMwB=bCU@jeVb{R;L-bj@gq}tA=*aKn3jI;QJRx|Bp zYj00+HfX4Y+?Cen{{kpRtsz{usPupX1|3{E+y$5u#k8;WWPgnRJhCd4n@}-txJ#SK z0Bjg5EsH>5DE_k=2=x76gIr(Sjv~j^yWLAfurRy=C;fSFN%#y@_Kcxw#1|hhQ9{B_ zeMx}1hs9~oBl8Uuss%IH3l2;5=+076n2v$_8; zWVmU^6sC#YA;oq=k2R>ae8sp%Ha_P?CL^2JYAyIKO>P5nR!J# zl1a@;NGa=R;7vw~&uOocuROoSHZo?~b=WT4E}Nc76y2RjvjMJV*m#Ap7BHas1MnP9 z>tP4U+gf#DMw>p@9dE5B1cGSALCNPFhXF1xlyK*eg)N2ruR^NL1?Lt6t-yzN;j7o; z!V~a&;Mh6nl%%aCtR-T8M)Qg8gHSL%GBNg6PeH_)h7-N=ER24x?k^?+Eo8PwZIVky z+AWoAhj%{3s?52gyFIAuTuB}x_#a33pzo)f%^mB3#o)n!7oJAV-y=)D>r z6f5V2kI4`)*A+yyV89-Fq?F8C4i4G&Y?HGjQQh0h>}?W2G=t#WssU1H0tasfY@vPFvJ%I<)_@>09ue6VdJ4QWDzl8SbF%*bAo3sI+x zjTmaxLaLR*@Yx{JUYN3Nis{ov;OX&irgq;qf+gad^foCdMw9AAQsgvt=O4*H#!~-e zXyyPe8zpGk=7&{ocA@7}VQI`gu(8)X-E3<@)=B7x=nO3_$P#40_ z5#-w@hPX+`L!pX-WBwxO^xSFhCgXxc!Mq3ZSn6WGc4h5Z-YxmpL}MvXpUUP@{f+N(X;vT9an&+Au=#2od% zJoAAn&9j?!3lFDM8mq6`=MDZxvfPzO>m|Pr;@Jyw;uuiA)-u-}1*b&`gD(l@z(~y? z{*F3y8idXm=%1FvmDs-z zk~au%G!tkRKYzmq&2X6%lC#__R~0jw^kJqyH5&vREB5890!v0 zmhjiZ(BU__Y~fRk7roIz z^N18_+{5I2B_{5pSd?9qB_KCm!)cN%qUOicRIc#LC6 zB`^~??yj`NyKy0xOKoKWO0jd7%jHptDAE%q?^VjJd%Q8n({zM7dKE`vk;Mmm3*Q#T z1uo0)M+VusRB1t@-1Q2BY(yLwWQy0EkiH9)4Yx&HBUA4w%<<&k9wgPW$&CATm{omp zeRKLWR1}u2u}p$s%4DZNMaX=&=>STegQf3?d{@F0W@mnYYW}X9HJ5LgCu4RL_A!xY zKgRf}N)jZ8#U-@fcAnr)zux z6!$(MZ~^e$nibL}R=sctraaUT{grGNxfUgg!O0LGe+fcdq;u2eQO6Odu__N5vXA!8 z2;R2JG|Ni%f0uHk*2r-`OhW;YYOEpHWFMEeQ&8b3JJ}mf80{VO{LVkQcgJ7)1<@jx zJ}Uo7PGkz+@C1Zc07)7r;3}E6xe8nY$SyY>hJi=rkpFq_i2Rf6Zd04(lwWn^XaKHI zhW^haU$WSb2TH9fffm@|Br@~l#qh6)Hq|WBd_mF}K0Dm%S41Kw}^LCfgV8w--KQd(EZsJ<2N$B2v70S@a0U zL_ZK}b%FG&GG{2ih&9@xMyw>RZ7WD8uBdNVK7V{xuQ|h-vS@}+&vliIz#RCHJ-LS` zrb07Ct(RCV9t=s$2hla^_&SV*S_92t;&9f;2#&K6Q~j2DFueB3p@-E|@;!vZWa{Xe z{9uzgCg4S*dR4^WAwqAx+v<%1i7g%xa;{YZQbkAD`tz(MoYn#D^(Eoy<8df8rp%^j zpG6xfaR%;bs7%;zs^v$S_Ioc6Qy%`X$(J8rJUw1Xq?eViL+~BQU^Gix!t^krrs9%? zzf8A>c488R-n|ioxQ?_(6i9p_xOR3`!ps7M#g0kxazoZ_Hk$;7vK?9CkC;fgYfp*Y;NmcGoz<@-%26Ni~q`=N8 z2j1UihU>E(WgrSrf68l?+Zl)bgdY;^CqDFdrQab!6#}s`2Q%eKrTyIx!N^hXH%yBZ zS%8-jpDdpJU-QAxc1x0Q&&|9^$`rnRkW;8stq{yEg5cN4q?4#f8zPXAgW4-& zU`qM=`w@_O*iW&ioNsehSK^J?O`P;t2voCVN%35HrOd^1A1@hU`fA@l1UBqU9 ze~6XUN4G{_ZlcvWso;8(ukTSSbC3huaB-e2%u?;5Cp(vC3+M2>s*6{}t5|3!zT|SJ zWmEmF2$sIhs;L9b8}}F_!r&(7X8wTv*G2p>wor-e(%XZDp4ZZWe784@WCS&IF{Y|q zhZC_g(w7COO;_%}!MY4qW<1LciCk zGZtDW7Wy-4-q!x?2T0xkRUs;#9WDKc9_Meav2T?KG%5adJ{-0~+XOWa+rj*-CtIW( zEiy<4{ZsJ0fWd_I{XYR;@{p7`rAblvwDsq@!AKeEAxyoyUeO^*BC9(!+(|x~)-t9| z46{p69)RdIdJV#nWVd?ce+<(iCaJbD{xK#mB2=l*bFt@V!B0X)o$o8KgX3cRl>+^M z4Di@yX5Mv}w{^GcibRPFyTT)+$B0Vr5oZN;ng7T2{R)fCd#Ql?Qbu?IP3T@l_7+QaSOROQ;-8tb=wjX5mbIXerc5cZC%yuD!? zEkR0tE#;gkbDQ>j`16qA?ctz&wVgK6Wv&)1S?bmk3{3I9z@~7xG0R+N``%6=0n;VO z8iN!M8qp#70en5f@>sI!r9cdu!8#2*LSkQ0!w=yIJ0X+NN=3=7QnpKrIsvyY1GM{U zAGOuwik}^R=CvG-mB9r{o*2JJfqiW0SpAKitXwoazTlrcB+NO`sm{yqUz^%~l2!Gi zsAg|>7_G_&*}D+JBl1}e6a}yhfo2~;zDlCTJH9jS*6ZY}6i)o|&0(sfTl%ajW zCh9tp-ACGM5_OK#e$Z#@!*=)(So)PMkb0w87W7SlOEyo$$2<`VKdbtpj0~N35j@&N z5Ld;GrDp1=09^&rhpB1KJw{;zDPm;qZ8Ta;Xt;uq_fPonZtArNLyu~SaFj;!{}_%A zuYF21?n}Ne;A)Hk`Tj+luhfq-cq{z>QI1m#`?m;N804yUe966E^={z8N2I9@LSCr9 z8GbNFL`KWSS@aGz;;0tL7FeDcykHpO!WjCZy)839XHH(Nardq&&IFa)t_9S@M8w#g>h`ZS!|eZIT7vuf2N}xPBe?OQC92U5`bKe z=Ia@;s|(UWp3mdn-AAS86LPlZ|8DT~rq#`b4~}1T{p25}!{7Pk^SFK2c^zC|BM}HD zZQUI`p5pp0`n-Y-2PeTwwhemBp$1VT(oW6|jw&F4JKCTFVFq_Jt*eB6?9jO8HMm#B=dwA#_)<39a&6&X=T*BY zyY+9==vVqI`MKkfJq+W5rssuVRdOREV&x4C4{@XaD@o$I{4f5)oJQ|2S@a1gdh90V zFTPr$NZsn7u%~uIu1%=3X16O^TqNwPOZWC5&1P-Lgqtcw|Lm%Y$Fr40klP=Bj%qEi zxDBIxH&~5TP8W95<+k|u9JPUC?}&{M*0ymfLDWg)+B|9|iqYB~>R0$#IS+`hjEvC3 zaVyx5m>BUvJj^U_u2>;s&KZ$u7_&{@dmm8Y&$kk8p2bFhh&P!DiR-Vz1s_F<&|#kG z;b|ctc2^2PDi$5*NBP>T*r|1Fy4SZ^0ib7!MQ!>&dSa-g5cKaVV%?Mw4c|&ceX29R za02Q+*h-QrW~ey6J;$cipXrFrcrlaV*}FM`h^14dsUVV)gi$tMvh)0|5OuWZrnMnj z$)&y^7vb2Bjv-YFMZTY61_PYk{6>o-D9H6mfY3z6PC=Hv_6{|k&z7f(rpDtPSBfBO z%XkMGA}y^m3Evc*Q?{*02y*BDI*Le`?;f57f#81tyMe0A73B(~-MyL`dSt7a%U;n$ zUscDn?DAxhqy-fAdrNm<}|S3Z?v?kP(su&s;gU|Tc)bO#L+*YCi&t^)urmp5viP2ByT1_DzB)NtY z?~QQB!~Yf)5dz#pw+FyaD8;m@DtDW%tEp-MPinxUniw|JQq3c^`Wc$la2QrfQqKQi znu<);_lcU6wI-VFLcT6+KZJ6yX?Nf@E`;wOK_JR=-Uq!Mea?nbsGno1x4nM#&`eR_ z_WoJvXCFHmt1hnrF%n{xqJ1NqH31?v_*tjIah!+l+2tb|dz+6?@ZLj1J-5>*Y_hl# z=nt8b)3S;zS^F>V7+b$E1J!YQFO2M@(HeF~2v&#pZY$>wJtT;>VI&+-!t?mX)EZwe zNVm$l#i&=LKh}pm$-Z1KL{Xro@-h#+b7?(C-hUj#D~BmgAtgT?mJ8fQetVcU#fz|X z$(xQ`EKmv95;CMmCBaTI^|-z%csII-sC&)MC`ZnlB8+c~MUgyE19j9(eSliei(Iuhae!&_;+-3$;+8qcNy3LJiC==V#u^@n*j31KFy|-*{xixhm-qdLi?JR6we+xS7BM3+o zdXl&x=_TVPnBpzVuZ0-2c0FR{=;zW0NZ_(UnGHO5f#)MRF;X0zFRV}L=g`NSOc)++ zy&xvpPE<%OseRBNCy~+mj87&5T-HmwoM;d-DVpgDP>dJ8q&#A*qD6ZXM*;Z-9`_`o zl2OMt#nv=RT{GKU;M3WOY5;1^q5fuT2kHJ;*z98g0e@R4^#jsmruI z4cn7?Jtz*>d>!%IjX;pY-jTq)#@%|jRL_DS%F@ZdyJoN6! z2`&H77`oDmq+c7=wg10EpE=`GxOYeJ2CI;x_||gBE4PWX@Pt$aL~iIrbJh%c8I^Ak zn+eph3e94B$d}&2d;IT`R-*-yE}`CuIeD1>8$YB)uld>ewYHwz7#a+g)HV-_?h_Fl zAKn{MsZE-9g^zqviCly*Y!@`EOlH2AceW2W%T|5;cc^%j9oDsW`!jRd3?FjzUh}{8 z{}M^e4%@pcb7>(gpwjXZKBQxu+TD}*@c|L9yRu%y)hoHSO#oD}WGbVm2+}6-9H&6m zW`q7P*Wrn;mdTT-+i<+!=NqtFfiK>KNG z*%y}_ilu*#ovviP%pO5JM9tgh{s$e#l2Www!U2Jt|06G}qhO&~cPl*@3&S#xFCD{c zf?Gu!j#SF4!$|GjYdf(38jB`{mo+f4j&R2vt`ZQ?KiYZNzWmS1Pi8YjwjMRzlL9ot zZn2k8K<3c6p=HacwN}vPQ!yryOk#g0ieoz7T&=g@BpUkRMuvodra?|pFa+eiM)X3k$+yeU^89WpvwimgHIe z!KqzXEcHsfnkCqde@AlDg|d*^CEqDRRf$!hD%cu6lP#gU$EKFUIk6j_rN=xASmW<1+&_&fSv#)!DY}2S@{)NAez`=eJH!*`kzRK zMxx(<>jO|xTyD~=X?cNZJT96)$JkwBM^OTMVnPfy|FF#s5fZUr#hZcO`_tjI z#QeW|hhctwONW>e|4-cEnvmCA4H8!460izxDdc)-Fia|ipxnvcALX(L@deA;+lZrR ziW_TSeA?xit&m9dnQq;$n|RMJ=zL(R&e`s49n| zQ0Pr1N(X3=6M@7=tf*AA_uB8R9eD_sVPXNhase5#SyZ-0!oH45sO%^k7ldIn+qh7T{p`;0&qeAUVeBNbLR zNe@=w$Cr9}&cnL-4I~FQme-gsGDt#xD-FlvZ9E@}-~(jgARqxryoH9?soY=ZhNEb$ z1!q|)=H4XA0pP6WY^}q)5s9*zgs_9N*=(iO%{+j&@+A3H&$Rw5GjwF3G;0F$2aZIG zrVn;86_6Baz~!39&b8bD6kYaPSXzbVW!Q0j)z2K+1x29%!W7H9AqAx zS;XHlG;9syTb?=OCH9}>mr3zV=1o{a&nJM2?L9_2_p?T{T=<(4ulhy}h48a!n;pqm z%?L)Us8ALB*T*aD63n={xHTT_p})LJ6*gdG0cVLn3O7mbt*n4(eeq3PsamV;<)OswM{+Y_g={7@}ifn!93V;HEs zn#S)$Ade*`C7#KbvW~MVTf!bT+2o{S%&ZylX~OMNxg&GtW>&u#NQ7xCq|U1mf`4Ol#H zliR1qbLS%c9)3V`Xm9wZ3aF1nMx-OKtR#qFQI@xpN{4? z+1;OC(8l#ml)3PeDBHmUvZV>3#etC#)DkZ`^p$JTrAy-p6ZH0fpD!#3Qh0Z1grB@N~*a$0O48D^F_ zLRsHKUbLro9dqRAsRQ`&qmuhY6<4p}SeNMwH~Q^s#jf*RM)hf)!7iNj+KZNQk{v^I z^*(AL+kJT@8J5w9rP8{yXzNO23mPC5!pW6k_XOl}&nwB|kF&eYi76cD#%x-h5FX*BP2I2Wb64pavWw)-%Hw8jIR{ z{%Mt(E}4?5uvEg+3bP>3l3wZX+K-$4CI}6%yH={adc4d&jncc>xrHJ?Z3n+p6&gQG zF5opiSkia25PNZt-`>Kf-fgc)fXs}EZKW9{HYGdhg(Q!z(&LEGP*D_oy&Bcla@G1g zBW!+eM!xlWQIV5lKjS-+DYMkOWG35ouh(mVk#l(>n|?-9r~}ZpRoV@Oq+AQ1)CrW; z@z~>gZT>p?ca(A8n&CtP3MLq4G}_^UQKzqA{y1Ln_953pdzY#`UIkwVq(D%4Gb!jk>YUFFo?pz0 zNkOdaqV#4jT5k0Zha_cx2C@#0DQ3Y_i%9D4_Ww%bb9GlLJdqNqA@*Vy55hA%|Mw)% zwgoXlAoLdAmi(HL3-9^yvt&4Yen}1-=%L>8VSY5tel>71jjMBMqJ6|#NKbLrM zecmwoSpeL;e^!m*hY{ImNc>q@>cs^5Talw@e}YGus->AaG+}u(G$Iul3EIlkXuPS5`eNGz1kY4_x}RHfn?<6VmiyuM={OubUS=B_cJQPIYbrZ-VVFaU#!gJ zywdd+2jGwC+3zFi#2LJIkWp=lx6r8sXm+@!C>-njxu(+rzl_Y||A3;U_c{NvFY@^b z%s0GSB_FLH1DX15HKwFQGif@0LCYtEUkgbphkni6gsYb$h@k#FKl#WHdo{clH6BW;#HMW zf1P*Am=Q3)=>je(1Qmol4iYP(zr@KPF->uNYR^R?|VAWw~=q;{C=Pg9xgvCu+XUVa=^C2 zwSzh#9xH1fE33Achl9lriw%m3Z~4r;QJwCj*n-HG1n@l(NZKL-#hfvaxac}wEQW<+ zMXK4p5o3t#8vNXH9l)@$q${Ll#lvg(BzDW35eAkAive*+0SR>Jl9rTv*=6FshH$*Z zXjVFt5<{NGPB6nP%`c!6(Y-ZfQWx&4ykb4O1I~3y8A*0DW^5em@P9&*6ag{TGURpjrl@2vih|3&EGm z2sI>U>KpHrg36dvM!#(@&kC#%W1JmOk0lub;8r=h{P>*(7~KxR|naq55^#myCzqs;J0<-F#|%UD;t_Q(?)tsaWJrC6VT+gwsDxbYKU^D(D2 zktu-Q+{%Bjjwrtv7M0{qw~Lb#E`hWMdSbg1bvx0dHx(gd#xG4bqiAhF-!U#Q<9KT^ ziPbeZCcH{@wVi5oYrfrDzWUss2Pz)vV~s3qUMqU4bzjF8VG=RYTFq8J=8W@@4YfJY z!c>$<`nUuZuNXX|{8yP(oE-<}e5FC;E*Fg$X$*6<$MTR9u7`%9Uy){#rW=p1O?#rL zy$L477*Ci|W^cfCW@sCo6Y2iiS#On2bm?ZWUd(&QV7jca9u61lXxhsRZ%xCoUC-0C z81Ei!Uq#Dw1{^pgis{ZIw@WbB?CNf%=cOQ2!xifuoUSZ=S@y1nsOv|o@2KO@vvt0ZJ1^V zdp2%x$*mE5vgzshV_yN(#yrg1g4c8cv$cw|S*_ZIL0QPS#>L7d&>1@GuBI~*?AN(_ zd~FuZUInz2MH~h{;Pdh6y8O6dCr0oJ1{oYrY-b~3UV!$xIe>LhUkIepxDP+XSj{2T zSb7U06hCJns47qJ{q?#_7d<>YA5USP2Q|uZ%5y%)yB-Bx<}hjqSQo9a-WJ-2MO&+v z%1sAyiuAWa8mnp6;d-QNeFo?;_BGS>I0~7BFD)9*VF|l^tf1ipK_OKwAZf!T;r=4o zz>(Zyx7s7X4jn>I|2$h!xGU;G6K^f-97b9~fsWt2_AoX5_8)ShqOr!$$A|{tZ~xrs z1GhMq7lp4IA4BHTj;VPJhK$-2|26jn|E0fxqa+2x@3;Efh=Ayl&1wk-DQs7Ap&#XW z`(VioL)t28Uj^G=m5f3`%*ZCzLX@@Wde4S+{)NT5l-2`BiF%KfUC5?lcxc1omor_z z&zsh4yZBzc?zzs-&*$DF#aGB?TncRzyjH0z4t46k0XQshhjG z-bCK#k-L}7$t~O9UH#uM)dF`7xsivL(=3D#B-Spay)ABB8v3s$pWC)0k9&&Lq-ale7VP`|h6oUH=A){|>Nuze+=iY0#Dj)zNZ zK7`roGBdCnl(+qW4!7+K9$S>ny_1<%@+-v^WTgz?4A}up6V;-D1-T17AJW`^zO7=# zDl#K)XO!l^=;He)J@ov%ay97kidyat(9>Ck!&1>9C39g!-I6Cc^!ynRBXl3=o7rJn z)LL8o-?+@+&&y`JO@9E$j( z-vu)vG_`B)IXvEHHeAVF|E*a;J zSSm2v`b&O6Z+d}{$RTRII^jIaBimcXwieSA89xBx6aqETd zv=cZ+#ZydV6(Al{m%G8M+?A%(J(tEW$ChkIF@Qo;^w>8^t=S`-JAESQ4xVo!n!&B; zk0T%;gzmYer~$_hUa!6Jj=2jd8Av8ruqMzw*4Bc@#L@4^%(9*;V8csjR;9v^NQIW` zEqNd$g$c)GO}^?c8aKxFi>q3|^76OU0fQjroXQlE&XVWoS^)?Us*7B+%;n|;A#Tb1 zGyEb+NQ5ynJm@9J3Og$~dTp6_W~lFsq@a5(S%cuRER~2ku62xM zC_j9y2aAd%sKu0*SrOQxqGbay7SDuXmaUWTV=|5|aWkCo(@r6J$#WVh6`}z(XY41F zFQFp&q?!PW`MS)_7q*afv`x#TTD*&^#Gqo7h+#7Xo$_Gk3mi^L<_A|gjfS{iFpN>8 zZv$ZWnO{o8DfHQ4N0Odq3pn^V|GHyMv>9h9xT(FJGpV{#BNmC3EwNohu?c*pN2?uA z7Ga7IuM%!Y0mC9LJfkQgIGrrbLZ>j?p~7|1$BIT$1#+)j@KfMnTq!2%HPAjqWL0gh@IL};ZXt_i@(% z1osj*g^Cu<%8oLSe#k7lL%L2t%;wj>N_;x^<-{3hK`hQCI)eKMcK;zm%Ctfz!4)MG zE3{DHv-KJ)G39_x+?4ynrqUfEurgo2aKIXXHlE4r`I8~U3~Ro~OR<+ksC2Goe(%3# z-IlSW%y@WD$@>qM_-17@$PF%{-UEq7vF8m_7fHjDI0_3l^9^=NzIX@GDCG<{Zra|I z##L3ZMHa)N^OWG4U`eChx9%VvECv-JOZ+m%8WEI_!)dI75pFaN&B=rOlF2uzk<*8y9^-Eeb>F(;pTZF@4t6(Z=3O#Tf1=g&;#f*`%DL)m+G(Ff{;bsCA?!t6sLU|Jh zPxBQel6mlqR9Q|5#S(c4bO}DG3d!Wf!(owh9GdOOKdg)`C-+l2heT z?5$+}1p4to#R;S1@i8@Vv+wQ}e{)FVdHPSC zdZq9X!?yBm-#n)fJScI^qv2YSDSpS0F}1%SCJaA}pLFrAP;H)YQBOPNxX z7Xir~pMx&D2t+n|aIG*ipLhy#hh6+IA!3LBS3whuHdP^TCf6xoDKr!W4h+WW-T;2&y#g!F24Of!Xr^4$L-j#S$Eq zSSA6kl!DXfeg475po8A$_zfdm?UV0_9P~;%#QZyfgYKQ%k>T4=ff4c8ia~Pq4sgR~ z!>o&Dl7Q{;?#J_rnzuyMf>4B?86sw+OD`=}j;H`Uam~e+V~&Y|Qj9 zxqwpBRX@D=Gyd)HT9UKMXCn-*tmmAgT)T23C8FtBS#8BRXJI4z1mW1TX{zOZ38a0T zabvfp!TS6xpr~GXHFzDAdY)^1&VN!wZ#E|CBlmmFC)tyPhW{K6hQKTAMD(-ALedmgsIQ!*I`(^xBb3E< zbUrfUvEca!3Haz*T3a~ornnd37UA0adCu*g->60{T&)vX+|0-gclvRv-R)7@9|q8# zS9R!vsOYwZBZOzyG*}OUkb17;oQ(t-T^UhA0<|b>$5@G7+moiEv^Ldc8W+c|T0KR8 zU4g=8e&!^b$NGO^djd#3B;0$zwibsI)9!qjF_O%9Nxv0#H6=i{6ndrB5JREHw z->KuKk$U(&!&&5B!@&b|MV^7?P*V^~k_V%;KkxoZ-v$wF$xX|rJ}$Vb+Gbk}8ki@w z>7n-^F1Rw0D~?j%9FUGq2^#V7@b=v8qoRL8gE`I?TDAW_BC)inRHt<3Ya3;qrd`aJ zC?tSf5PdM0n()9O^R)> z!)?~2v7%jGV)XfEda&}Q5gIg4icQoVr#6-q`@8q1qEdfooY7~b*EEll)k;_YTt8AO z%5r=)dYo5!-u;bn5n164^6YR?i=f|ut4H9yAqdmpcFps7MS_VJoUcc$E`*ULQT#k_ z$yxe#QIh&*C<(GEahTiE($mL^oPa6mQbuGc_%}ZDym^AJzV2_otX9)8!vdRx;n3cie^(kZJ(HpAT zTGsP`aMc^OAt{}SJpNEtdRrlfU>Myd3*>PXdB3K-3oss$ODccPjYahc@?PlycsVXP zzAau?csfwD@qD&-{Vi9UCz3vwCdldG)eT~yn60=JH1=_yMCnIjCyUB(%yW!)Z%;el zw^Iss4g*jB4p}6?{r9{vyCA+rlqp9Lf-r)a8~r%62h!8?s7_4Xc6?UC%TW7;m?AIY zP|hkHEQO7>M>0a2`4f8mW-ESgb2KS9C>CKY&I@l z4kq}w?Myj;;W@iXv2I_z6fmYLhegNnuZN^U*NJ-2N@a5*llE8kA3is6S7ViETmeA1pDD>R6`nYVT*G_)k(D8xq8hLyc^jLO6 zJs@QU@8Q<)J*Ty%?bleF0hW+;|JZKjl~I{*yN=olsLtOlfj#$QrOyZJ@fn$|zYUGV zdC2?=O}=oS7U4A1lBd=ZwVw5U#D6RJ;I(ih^xH`QYHzvaJxNS0>(m((jPfM%vhp_CbZvuT4j zA&4uVsU-FX11@5JOh7ct?#!B9ZNDR@tYv#e%nHj+iI1ccA>pTE*ssJ!pVpHK$~S?I z&UbDzlDd+}cq+N%Ux8yD1qvLQ&sYZIW1a?!&HMl^tgrJYo~I4PG!~b8?lv3r{gX~A zcMUNUiA}1{rA?zPvk|L~RhW+ZEa+cVDdgB2MjaB@WL)E*`ZoW8%#Gx6ZSDpFT&gP6 zugR@Yk}K41;2uqR;x)-KO`RRG7Y^ks%x+wbA${v{F$R>&7<~2J$jg4?qEP4X)j!nk zX6t);4=~S#_{eE#s_wG{BoHK%6XsIV9p<$Jy#p{3D8$eh#JHW1N}Enmikv$&b8{xm z7Hejb)*2bK7rzjeXrVF}ncMPqr-Sv8>}^1osO)W8S^BE09ebv!82wD4)n6u%zO%g^ za|`fkL?3y~5iOJ%eDI} zEr@OrgTzFl>=tTH#Q9+gMPq(!pN2%cWe%VCGgUGSJBn7+Yb&#G|FtwpUfWFV~!-x6~@Zdl)j; zOq7V9$wGK|FZRVC2TSZnlznH7*a_7}Knx@VkZ>%VyJVHrsvpl%)?)RUeFRb$-_~J! zYoaF#?L?Y9j_MO*#RG>n?XqS-v*}*;13il1TcK55{@YT@iJ60i>y_$f1Ac4PJ&a|( z{(9s@?0E^l^UU_QiNiE=NBw0>3jWXKx%qq|LE>NNZLA0$QkPX zXaWnP*CLNNC922Q9sJ*0e-V2hBz0gmVNL)!Ujh<5pmpOSD0kw|tMSJUJYcl$7m5;n;CrhP+-w0*kuY&6j}abT>hz>sLgP zaS4MHoN}5=kE3X7VL(5Qg$E>c$?S8`9FwoGM!9be%iniByq~7eFEOtA`{(;L#s~YI zhfA9XcyJv}H)Q+ZroA<(#ZD)Uns{k3QHl4cHu|`gl7eg$%-QR6z+Ck<_AX8eg(-? zCL#o|G{(8;HpXJX6TjxivuhSU*nlyesUVAoZ9@HNbh z{7Q|oNCZ?^$Z`_4N03*BaX-f&^2?ILqBUT)e{Iq%%S7{g^8@Gz!e$W{3*Ee0=X2cAed9(3&#w83&q&tm^9zp$FSrb z_bY>+3JO%qB16nW7!nyR*zW&g>K%h4Ys0ne*tR*bJ+W=uwrv{|+qP|UVkaHjnmGCL z?7hFLx2u13b@i`aYt?;S=W(1o=@>tSz0>5r!Y6Tt6_FW@^(9w4C87_6J}N*tgKB!~oN#z-nW_crio%F;3(;kZMw@Ixc~ z#*vBWCrf`&JlF!iy-t4z;5q9X#dOqy9NG|N!!B$gRV>z^*D#thlN1UAD6TSPbAH>X}@ zi{Nr9#Ho-)aT$Rju1}vr>c~qsU)4%G8)^T?tDWIk!+MMLPQksux$943)p?n z3M!ydz)GF-7IHU18a=a!qVW1aiXac+4=8Vi#KoAOi$>U$mY4fFiTlLp^X);rlLv=fqCDwkpX zJC1SnBp(-iz^P}W!d2IEE(0o=J6#3aV;Mh=+e4@lt20;3F^Q6e1#6TlH85N zGGO*^z#|)qXM}Ck`;>yVG3nm64$^&oOu23;Z`um2UUg(Cc_b;thU zc8o<82~eJy;k6n14-gI&5gQnU3eN{``9QeUJYonFO^60H(UU0jd0do4GBM)uI4(OE zV1E5f3K_Yd!zd#>;c(*NS|3q-3yLRFtSFv>>?%6+SdgMSwp0|>1MQ7Nqk0c1fu!P0 zvrv^!3Zc!vzcEi_#u9p|5~y%kKN21-EPUtyE$2^-=3~|H@AVT&Ay^GO4F8c6?xjB^ za1YWCaS-k*G7HUQ;p1V&-R)-W^O(^q+?fMsIJpIL z9(gU4=p!_#ie&SYc(SzM-v!rErrI~7uzIll$BLi>N4+7XLF^P{);KjTNtr^uwIK&^E;P5z+Jw9_l8tz|w8!4tha{23QC=!Z~3S#02}Y zeHMk#--v8VnI;AjT>D2fa99rKT2(8I%b1-K=Bt(YUx(z`CsV(#VY}fXIs1|1nQl<0 zt16YAs1khAN4Z^Zbz0N-4HSkh5L*Kr{lLOC1Z$NxZOj`s)GEF_-;GW!k(Gn$A2d#*bOWpA44(b&GPyA zc@T{1yFM^5Aftctf)vJ_j6Xe0k$BfjhbqiT5$I+(eeFC+$rRzuDf1%S(jiS$# z12f+icSz@fup>L}d+F%+zKQjag~`pz=O#dCVy34^f}o1gc3W@IpD7hk_Wt6*vsij1SY9Px zWXLcis5An(slgY?IoU52QO(;q5JzrK#%6H(RW9^GmPi-l*FFva;tG6t@7Z1fo~D<0 z;0KPo>~LK(y#`q7%>E>N_`}`G#e(15QzGtp>@%0|XTzGo&8~qxdj{GnnkI+!DVe_I zw)G8t^+2zOK3ZL_-Afz8nfR4Y4mv<}gCHRyop}Shd-<<`87v*5#D93o-)i^fQF34O(WvzAUR#ogLNrIUzZQwE){gdmWTG=h9e1esqnPcnAS(n_MzJ|A|5 zu3~YiDNG@OZbp|vYFOcOzi7FN;-bv_VpUF=!-Jut^?f&8?T5I)V@D|SI|3{WUM>u} zM~Gz>5i6zT0!rt@PSInI!hb>LKln@$9sF6J4tWIo>Dq(fm??#MK#%j?Qn7ukZ5jwmB%nE-1dvR4J6g(^k&bMA%Z7 z|3VIb@8hHWb!^i0EACSVESlI)s)e!WeK~+0?t9hj#|$tb*(#JE0Axv>4JXOT}J z=tz_CG`hCkd{p<<>2*|AGNyjcuczhdh&~(rz8CwJHgdnuL&N5ZPc&TytYYv=2W5 zS@eR$w3HR|^EveFR0T58$p9!d-i;f=0P{rh8>aD^N(H&%2c>Hn=sETH@X1-!9INCz z42XS)iYv*!c&4y>_nW@rBYmvO?mXSF=TV!-u6t@90fR1`Vk@g$2h+E*fO2kSRfZTk z+Qdv0;pS3({~k5Djfq(Q;DKJrM|ha}9(YS0Ne(K~3|L0sbXKQ}={OJ^a}YEx~} z1-w?{6`?^yCZL)E2w}z3Lf+!@@>#7w%wIo6ls#(aJd6wCa;+~Q14GX8@0Tr3-9 z6XBoYNQ8oGy1Ornx^(cG4E)fmgkLvCk)oSq{vPX1N44oHE!#O#Lr!s&Jx_LC!;ivP zi+%0`N09^qrAerRX2_p6YHlJoRpfu9fX zfyanD2z}*Ayexwxv@>ilq&HP|VsHJmt2E2fg}a`=n8>9!(P$+{MaA>#-a6Wt6MlT# zR4Lv9Zq6Su;d3GE{7kS~#j7jB9zz8EwpL$K#H*73IXn0!CRzIrupc}?b-j?^M;uBL z@_Lw$+(l_%z$x8Akpi7bnmBOwPuDJ&e?a1#ekqT32_C9*I|@!noR-~y-`N4Ox9 zl@ig0503FX&BYeL`nDguZKM6Fr|cK@u}p`hFkr04(VFbILLh#>{;tU_=piMfu28uyD=bggu^n zeS2hGkXjj!1s~77?L*p&XR3SLC;r3$l(#Y%xUi2eNFJrpMLGRAdy6c1@+wN3(P8cB zvOg7+Ybv8nuCC<2C$G|IuON3dZuA#x7OpT>bcZF-{QQiR#3sG0dJqIt9ntMMY?2yo zs$)=>m|rY{uptO|G06bL3~kQjM*1$ya@8ZW-Bjqq>z8x=9CRi1OwN}0H-)f8LGV$H ztixj>R5ViJT=_t=x;BCTi8!QE1Sp#$z!79BmU4bNl%X`;(zywFa*_+56(b5xS|eX= zzws{CWG4I})ZyhxZcj}>;*}X?h3z1s{{$_nvOL^!)pU7(6YBjHZE`l$ZoRb-8Wb5L zj1G_bk=8xRS>Ct-JQEGU>=y@m?r90zE-#q1B25>Uw{4xiWTs9Zqf`3a+&bhz6s0 zz3oh=h>MHIzFR`V+6a`Eg3>ZJtDB*ih%Zue7HzCLjOqCS&7G;0D?oB$fOvt6U8FIXTnk~4F1U z?}%K}Hy!_L%pm(QW~51kXgaJ2F3Od_z#_k#3&35a6G=|7gcLzoBVI`XdvHZycvh;N zSNw*gMlgF>MCi$aQ;7V&E0|f-fyNRw4Zfe1-xFnx;1e#+Jcht0vNb7pLr2z^h$a9b zXlf=)5c+Xt7^A4VSHj%=&A{{+W*|^CYul4sjBHJv3dI%yH*8LWX3YS(4>78d!&ung zm)zGG$7~@3C3y^Wp+0(+C#)S;F1){=6_{Os{uY z9dOZK;|jOP&h7B<;{NgA+kf8Mm;~PDPDDWhqU^8oU;`Vh@RGp%MC;dG6U6ndSTRP- z9@bWLim3QV(I|*z%bHbbv3T>`(*j+-S?ql+VvoqtS{TGg#iUYD-IFlTH7+A|h_WO? zp1jPVRv^6k>K(w6WVq&!6drq)6Dy{Z5%woC?ZqvF2#xsE!aN zkTB7ue4k=S_=p3%LMh(L=x*L^(E{@U$8We6Ed!fycV z>FB#6E1&2>Q;#{0NI=hU4BWmI(37u}`tg}NB|QpJ0ZMlX^l=$urLHjdKNo$H2d7Dg zL11(}5)5%(_6y21t5@JSy%d>SZR1fg@JPf6-*gcVE2G)cL zYl*le6TysZ1jBA7>+MUdCSiUKxri@bu0^+}Kb`P3_WU58F2H)~0_2iKjt=)7oxT!uw z7~10!!msQ^v!@1XD}tDural`j-MIlgzmcQ8@@_H36mIhguZ@jlw($g7FJ#afiV0wH z^>8x<#`AfWrRPLy-t{6Nr~(>WI+Q)ZFaN42{UK7P)?*z1BXvk-#z9@ogL@)U(%L;h zucGw;k6%8#uu{e0P~&nKi~oH?iJ?jD@zF9EQ~ozlb_Z4qj2MncAGp->hPj^So9@L> z@=4gsP|`K?ejriXdH__~x)c1n3J9ylTVGOO_0*ee0OrV(Y#Hib7s^Y&R(Gw)If4`j z9ZR(9L;FIO)#wiOu+p}KK2608{Bv`n){X-|O~n)ZedE5Z)mBEQwKvtt&&}PVZCvQ| z*M4KHJ~R*ZFnXTriBT!=I(~b;M8bP&z0DApeWc2E^F?ZBe$3xrRs}8Xf6D8Fx&>q| zQod_OvmYSl61!5Ng&zKDkt8G4YKd*NU6z*!!SZHZwN-HkyfbB+aQEgk6*MHiRCBJ_#ms zGb-Y#_wb&1)4%hY@zlT5qI=W3JMRtbQ4`A76c&+Ihb_~GC3x+8HNj?Ty(ZAah_TP+ zPvy8w0B8(WT3*VL)TTfEySs~=J>4sMeZIZ7y}f4oa}M$HubBs(_|r=!>7}au`bGI2 zf{&x;_20@&7XX0Ml1_drgOWP5^dR%DDQcf&`;6!hk~_>YT|g~T^+e}mf?ff6BiCFP z;!ag*?-Z8|`)Wh5RQ(@Jo(o~XsA8-d31;#7SYQcFrAB51im)fllFtT}%g!~k4if6r zdJBxC4r_))GGO1>GTMvkf#GJm{S4WB693N8*K54fT65D~Ys8R@87h%u{opEaj;cln zJqy9fRX&sXSE?WzPl%c~2p^l(%D(PBNT9IsAxPY2$rXdo0aiy==i_a?ij4vSnY0r& z_zaK6n?7CY6altWiJc!D^==9PX)qh(6_>#RCaGyf!fY&_X;)<)#3NCqAEZ$YT?ba; z2K7lj%K=;UQ@36p@cX-r#2oH(ofrVx@4yod;xj&Og7^cxaEF)*ay9H2d!k7f>UKL3 z0Ym$@?5r*jlBSqxR1pPPM_NABvfYNvwW;oZCqUnP3+S*SJOtzHL5rDydy3o!S>K1> zK3bmDyg%FT{47Rtx6f>Uy!@MOzJwC&{}l?$9S10%VXg4OixgfF#pI9S)+^}o%{E9y zNJ=_IWUdnF{GBs!0dl2C^u4nnsSTeplx%bh0$U=2z@5@cP~8Sqii!nP5s5=2%0k_e z&Nzi_{^2MjHd-2nANnVwy&x4{A=C(@`I{61EW$9-QoA(3+2Bucov?VA9hwaH?=`33 zjb;W~E*E__gP+j_a&)YDj*e3~|E7t{se&!A*v6bG(T<7zPu?df??oO5&A(KpT7-4x z{I5^Tan*LAiY2C~KdskvWF55Hz^22~YI_^gnuD=Ytxf%(o`G9oE+UMh<>>UXefaFy zKE(#gpL#M)!sh#S$nMFTHrsFKUN;8{N!+$3m64P-!La_T5u8+Gv(qfy_tqhX`vB-A zEWdvI5|tR!+knzYVU-lnolaZ-``B&1&iA)K&D}iV0K0Q;P`TePm@&>ye5DMeTclxi z0gtV1t@iIj0A6ml8~MJ5lbpv3@7BF2`~+JAmJf;HH}Bidv-Qjv_^;lfs%G3?M3f~T8|f-|Wn3|;&> zFJqTX6%#3#c?@(uCx7+z`P}_2_ar|jD^gHr)9k-&&(4bUF;2GZ;|pro;1FzOf21p! zVhDO(PK{p=pg+3cW@|nX=z8z{vE}H8r|9@?#GA_2%KO{obCl|NY2r=(;U~!6waTv~@BV+#-6(ToY}Nlkcai@K-3^uJ{Xlo{5pca+ z?=DBLV^)wv{SW^ya|Z&E@ITC*{_8)B&jBVVq_{FS8Lmv102csLDSH8mWk6T)bj6qC z{w&)8VcC)wcFAQMUP(20o+SL7JxP$J$aCC<)u~C@>+okM@wmW1dm!={NTxR8SX;pH z#>tsjNZ0ObBA(W74g_?q@zV|xOiO*r0td$cz0reL*^z@kBe$iHEx=!-T4}3D-=W18 zntu0%eBNbWdt}dFkRMW=YWI1}K(H7(4GFq^k8{5Np69Q*(24{n-1uU$FU-PCdM^*^f=vu;e)YPhu0Pa+i@*(!urF&i7s5 zeGz;FO95Wk0Nm*XDmQH;rsk$}A|2dV|CZvNo+C(2$>n}s!+rhv&^}-C{~&4NNyJpz z*p+#lM}Pf+%`bocA{|NWokK?K9mr{}B+ScVBY=64MJZVLiOjg=E#A|$63pu)`wM-5 zUN2{Sm(-Z>98wHl*r2sL@{~@nS{+a2?Dr2XiYY}(#fnmP#$1?B6kBK{Z-iznueB75 zQ@{xb+UW1IR=DF{ibd(8{kvCDIqTc~GlqJa;`kXQfh7Et`tWOMy%-*C&0fFGN+`O! z-KC8z3To*5FQRsqtKo7&cmP4Q2Juod{6CZ$71!T$exKk5!VqAodl(dr#pFleb3n$1 z;MUJJaqWDSu$|)AB8c_*?&ku>t!l~xL%1~jyf6BxB5kMeulE1f?&pkfu*OyT)}pBa z8bCL9uw19_JiDXK`Be>kpPQr8`vNoy{G$E&5ybs0F|dO&k=lB{e^4?a3Zj=`K6-ml zRG!uNTp28ysC9;vx>`+F1@78MHrLEj2+YAc@+312Rag7>KUj@6&~&b9sIB+0^tmVX zc|P=J{=7*H8TODI zvuHSO>m4yb7ul24;PZ!Ci=NlSNQJr-P>ErQv~^PRDjF68XThD_09vn};RZRF17YtM z5FV8oY5*Gb`|9#O4)NZ5FBv23D4QMpXov~Na?RW4h+wQM#rm;*OxHWT8(Ww>Eot%; zr6bAZxY*~Qpn-NVFAIlpy)b1FQdugj2WqvITf)jHQVO1fATPzo-gPc6Q|w_?JEEgF(&0HITaLbYrZuN-)nB#GkUu0qNEt}m|E`2JZQe+ z-B5X%jHUS$`g+5C9!Rfr_q*?K2mUx0(UK!6n9G0CCS}yE%G|9emgf*%UwXMSsN*N9FO4kTmqxh%wYVL{(OI;05p-O%H_CNgk54V}~5ld^!D z^K5uH!#ppc79c*nr#|l~1Yw$a7cqC{FHg({StpVEvKAl-=R}a#9-CE@ms6$-93}y> zLk4jh;uH1*wDp5v{#BGw_NLQgrzs;i2NRoa<8-^zR0{o*MjnNB0Yqm6OuB!eqKQ}- zl$!lKiQkGo4MJcTMyyW@(HRi$Up)uv}kI>xZF|Df+0RHbq@$+mZMVvn> zm?a2)I#Xl8+Xh=Tg>`)E}( zm)uJld|`F!DJn1~vyB3r+yoKZ|4?m6?gNixZFYgDgChB`xL&nq>CP@3p15u&+K6`> zv6ph`XT$DuE9s&|YmR~huNP=?e|Mje>96rnb(~l~w&)w{06k~)xrpXHMPM5IZ!O*z zwA+;;ero}xzuK;qL<~%TZQ-4SA!Rw`pGAZ@C$t7?kls(Z&qlOkCHi^*C#`COoQct- zDt0}Mu|9jW50u`H^F3ZYUkBT7kSE)Sbt`>5ab*0NY=AA)BG8WJ#Sj_1f?-nWG=Yd4 zM(>5Y-};AgaaRZ|k*LRzRgsxrOR-Cj=E`R0a(u|vOe%|c*37Cb zNOV|LXS1c#%a7)i!EB!$Vo9ziVYoGCvlufgALdv?Y@h3yXPI9%BObE;wG%!w{;eYc zw!?#5+R5;1n?yg2obR}{s>l4o8!6bWnT@oId9tTyCAu^*u9|@UpsT4)HM5^RvKJ!z z)~L;?kT6Y?=4*wYwmvV^x7W#0PpTMnXYnjBMqF;0u`bz=n`M%ns)Lh z`44krc$J~ijbc&emcG zH?$Ov%>nNm0dVym2BqgW4FLGcP^4`XHy>=e&iy-ywXe2~Kh{KIiK;T{C`COrZ%aGp zbCMMNBS@jbjSb6S#lKy&&nTpz0qIfd4Y3D(?JFH90p9lO^}fkdz%IKT?I+~)3TYZ% z{El^v*6sn|Kl=&wLT~Xpx|sR_>yXFiImmxCH{bs0Yx)Cm%*4~m7SwXUr7*eB9L38Z_%h$Nn@%-A#>FV>@JDmJTDhWlbm?giq{d~mNi6mJPE|~r`u7A}R zdEscX$*a>B@PtvTWAcfTHfKAi0GInTsfR*xUBUC^fEj(nHKRkLre*!5)BtdKP>=Ek zF=+NF(70aPs70NsAJidFwPy8C!J0_vvwdgYUpWG4 z*c#^2CrriFKA9S->k0FE^G2Nqe1V-4;Nis}JpDT}logstx@9#jy0c=l%u}TDt!c#u zRzOvU4b_s>rG=Zft_de7WgmAk$(?;_#aTZ!{G=1(-AHC{#<(losJ_o6TJ(;xN3YqN zmezY^Ja>!~W@o0ctmdYT4mTf~mK^(ai zf3CU{EG+YVSCM1kL%L(im(go)HMx3^qa$MfBhKM)1iN$zmxhZj=EF$JJ+G>_M+QM# z1h3KG8pe*b_MT{CwkT&DvUI6`!C1}b2$c=8wJTvQL+y7ZplD|?a8O^_o-U8Zrkrea zR#lIcjE-I|`{^yl{~=~zg<+e6&^1ra(s`bWL1Fhlnr8A1Bv8`h2<|BoiLyJzE~&X8 zx6`TYF0Xl}L*KpXw56iU{pS%##cTUnTC&tet~2p?)z;HS>pS90na7|w+joqcN*&vB z?kqz2UewBZECT@f}BLc;%T z@}ya@GfinZ7pA0BuKLbbnLOpDOsRi^mvULyxSD1}g44CSbS$q9H0y@Xs+15^Gx@wH<$7rRRP@Pow@$9y)Wsn)Vf9vD(pt?Hk#uNVeUU$4t4GLj4+lE2BkUc@NUhwB|bp6H(l*=Ens(sf1J>Y0YSk_PC z&CSS{(dDyaTK3J8SQH~^|Ff^0S&~i~E0WJ~vlo8_d#Li{V!s|q@4z+gCY%?;SB&X` zS621RQBD!GmYSFsHdpBjnc_pTJJ26M2ZVJ6(5P5Ez108}woJ8=1enzVV4enbyV0q^ z^t;euTD{Iv12w*B+IY3ng!S8(tJvG^Xp(L28*3Drs5lS`^pfK6N(Wj7QjCQe);H=~ z0e?~4s_m=0MAj}^NOFG~qtS$Y6KZ!Cs6|Bn(l4^ut zY#aU&`8sPf<%|s~ugotPYw5FV&u*wqc&&M__yn#9@LIRNTBMoF{zmBKqCtaKgP&4H zk>lLDmTrQXBY(^$A8y_otOl!#T{RUo30Ka>`2mWB>}@ekQ#2bZJ04w2XgZUb2P$jp zF-=kiiBnYR^x{;{+3SRrL<0=ylx|xzN+hPzjM#o(xRR+^m-}KGv329r4#Aj$U`axS zs##+i?tDG`+n&(%ZiPJZE7i>Lyf(io)ePR6H8TaT`*q@%I=Us%(R!WcSW^?C?$6d9 zL385~7*(+|NT$Q)dCu4Pb;r=hklYwvJ)vCFwzHC;hJr=^E2nNL*;@Xxq(iMnMUId; zRsJAvcgxhWMkQldf9jBxpB^Xy(A0?&A?FgXWy>-&x`TI?8RhpxYjs*rwVVyIJ{Hfx zlHj<3_t!Z@{=7EFo0JFJvsc5JI!TT`{$0W=ao@dNN>1y24cv?8C#Y!!F6V^oTH5&- z5i_89f2&mUn^O({HS%8QNwL!A)HZ^Av4EUnp7Bcm#JXe1|IZH5EyV9A&9{lS==Hau zG6zN1J!N--ZYFiJ*=j$}ZC+q{@yC+Z-1gCA`Z5R@#li|@{Y+z?xZtP7F8okY>S>)d zss-yXJ?-YP&Lm=~qlK8yV^$!My@ezsT&6L5aIgDppP7E%9ZFos0p)=$q`Fs-3E96H zT+{Bua7$ab0U$DKC)f5KNy_CyGo%U@NFk=XKb}d$ z(FcnQUg!&_;P5IxlUj^~T50TevyHU#rD6d3hEFn7>unb&Vi|VPByNI<$;O{wILN6D zfssB;$`aB@SQy_^70H!nLSG8btO2G3w+JibV!q(wIBTk;1zlXK)%#?pADCf+5(Lf- zczka6UuKk9PwF9Zy6IPj8&RGCB&p@&2PsPAX)`L;VWl=b5uPcx?@*fEd?T(In~H|o zo3;d+KPof+D@s`7%Jebcc&su?o{?SKSbNeA(leaBy!x35*$wq5eQbB>V6>^#Y{}sK zqNUQepVN0+{7b*ersHnJ?%p{Z>?;nJ)lH$gZjSzcK;l9+r~t@9I=II$aUdM;HtxQ&!r0gr|lq#~9LC4&g9#?6~h)p?Z9*2V>3toG+s@@6aaJyhIrbr|^c}9WZ_E z@SqYQ2$iXrk;9CYWo4qP63RZ#tuMa9=@Z?Q&png;+saa_L15HT@JHZ;QtqT3gy-T4 z6fr!g%w|Q?$cuy_yY%Abb->>im?i$2a8zmG+22F41rz18Sd;e)8-EA%WU&5mJW1Al ze>8a;#znpEx;`D5qOGfL#Ek;0VLmBQ{G}({t&^75?=Kl%Z9r3X%@$Tb*vjqeQ|eoNRYY zeZ|ht=2heQAM179{LOKh_5aC|sEa+Eoga14KBE5|xEF=@!A8k?0vN+c9NupxIG-=c zI+=)Wm$SR(M7Mm&ER1a3o~EkgIze@dF#WnTZkCy;a>1SS4{+o8XipNE4>D0XlWt^ng-sxCNn=+bAPDKc`JFeWJ#H2Qh$ zonOtXu{>RmIF-9Nm5C%8XRgIK#LE=aeSs&0O_=|SRGt1W(eh#Z?_gQ}c*{Hf=NO(s z^juR%sLT1J0KGsVVEnAb;M47O&O@*C{dl?1WStD%*mL-J@U~+|(y8nAM_ZSms)T17 zY-P#OaR|QR-<~R)Mxp@F$p!V?XM`iwPzGIAh3K?CZ`4XW|0NO|wX{6x>zw0q#$q;%7=Y-V!nR-kqS4>GI-dggZhK3oN-`hx zpwQa7s=KTI_fl0KV8dP@r3<1B6+AWplVnC!1_3`?N!A>EnZhUb;woGWv}YO;aXJh^ zj1@iNet<${aJ|;`bOE>DY%fw)){U5CydKn~MQp#>vG8##y!-6B4mW3$=-1{k#CZ#= zs(K?oD@JRBqidd(k$A+Fu^6kv_)U^z_LZVZGF;vMn&W+M0;Mj(Ks<6#S;ucu}sp+^latq$L7x^sW*{oY{M- zyHOt}C%!4>;#_laq_sMMJcD~m#2f=-;wX-#}`URC}sO*8q#dBz8Iw;*!=a|He=@b4UcpyD7WoMr= zM&=SSclWWoy`zuMHMQHjI$eUu!wWohaq{%- zb$l`9uP668aj+((F_t{O|GkySOHOyOTuVxH7?=7FZa*hq*vlAdTEIkm$mKS?g7P3p zl0};*Hsl8B(-L*Ky>An{fT|#^2{1$;tu1hdy=$&%A*$13f(0Oh-yb9_+Kh`moLr9su9>1nTGH8(Mg=uXs;*A2BZg7O0qQElV7heB-+I}Us!KC*h|vN+J|rDi)j9Hy@mf!%e3~}UY;<2H>a02 zip(5rtBfYyt!81x;Izxtdf7$UtP*ePxSQaGj0cmc7LAaDFEp1>BErDuZFQFe2&-LC zErim2jz>@%n3pOf6PS<{Cdw&HW3zRBp!pFCt{S&7^j=?elD&^sUo>W~9z0orT9 zM59eUsy*Wyc?slbi&KLANZ5ib?N#r0WCA~E7zci;^? zU+DvajYH>^c1$aDgY|EeOyQ|*OfwF&bTZSHSb4i@I-anZoQ91SbM;-^gI1l|!dK;4 zmf#~49Xc1xTLOo0Dci&SmN9Qc66*PkxJjg=+jVp6=Qv|KDb!>^kVbRNK;Q$qgJWX{ z*4JBFzN4-wObb*Z6&)UFlIUFhEMsQUqm^G?=q^AHyC+_KcuYmQL z{~&W~FbKi7Q@iagAeuP3y?K4V^;Xg`dNQ{Cj@dojtbMHBP>P%#^2W)iFW=5htF zo&6H2*wN^6^|lL%Ygjufj(zH2xekrDNb@B)jAmn#ib!xOifg3RNpvaO@0 z0PjcA3C>_0Th7lVj@wM_5mkP0ChK5I)$trr77~Y>O(>T)xXTzfq4=EYeUC7`v-qIB zvTNsQVZW+$Yra}rnd<4&{!@Mg^YVqVBscDFN%bTDK*}e7;Y?Ud?>h+z@edLmzqk)QDEjBQc%K+V zSYcxF$J3H~;}yH%ac@dJqwwP&DvZEj5FEr!MG6$I91G9Ad9Q&v&OF0zs&2C#?@g84 zcgS&KD~_;V&nO%r)K0kH_gIeiLf7;4v<<(5!AUNt{pEJjdV}@**C+S$L0~tXjT%0J z$0rQc&9cpNGuj>E)t!xpkmvg47){Tj-VhAs3?KiAUUpd+d{z#)Td*=kB+F6Fvi)C0 z(9!elcy)#yHG7vDy|-LX)!gm_zn`zRH~PJSaJ5&(s9*M4c~&a%N8p_FIi3){si8p& zAl~v7{H_*VlM+9Tg+5p9@fqReMD*3}4d|+wet0`uOkCG;s3$hFW{ghQMDW5A`;_l9gj&(k zeYnaU%NI^_M;F)|f$7*L!>(0^L%S3gnQLmlq{b}5?*6g({Y$@mHafMZiYh67j}>pE zgKXo@BWZ>}+b!O%aLUrPmIBWqIInVq410b zp#8DZT1gY{7CH#N zejA1_pD7w%uN~)@hSy{{(4Edtur152tAmm(za?>YM`IVSo8&b}D14)GJiB-d+6~ zZ0VqE_AIAU&3Igx=^_82iB?x<$P2v)o{I0;Ev=HiF~qr>Ic*j~+!zB{`nf?v#j^mb z6F+nRMRq88%j>vEw)4U{;UA7Tb3Bo+%jNv!ch2)v^W9*xxL#fl13S;v(d&Cnj`h0hb_Z9m6VM}*#f|N0v1P%22QS2pIpR{B8Ih8f#flak6ZhZ> ztk|C!2W#`iFu}H~|HO)%eP|}|j-C9ob=b=f3F;&eT$2?~7QUMHF!uG!Py9O6el_0M z%tHzm1)Lz^BF{HYoy~&*=J}0RZ<(DgSUl;rc@K$hVDpwqE67Z~>o6MC>8peQG#MlQ z04K(+Qx6i@KJB3e8Z(f>siN3&i7t0Sm~nkcQaZ@g!xTdz7!E4%0Z^$+RGpOzq0%Y-`Y8ek%icEH$ve|V$BxO*IBQr;+%GPtc5m-2Q zse57#p-{BQaY+zTJE_MM-$pHNW9@i7tLPt*mzd>KWjX{5ft3I>C6dJ~Liy?kI19YY zuFuxhWVwJ%?8_ue448(&j%%tq}y;4rfx3S}E7j%k6Qq?!6{ zXE^pf$MGswLNnsEO@(ZQeI$@OyY} zOeYCjXSzcmj%y7p*A^z*-(3+FLPu7*Joe6i z2B5H}TEp{$AVS?TDGj=f`>+Q^3X*B$)6NC>AFH`4sc=y`&ZvHuSv+3J)>7}Li#OP2 z5q)S?uoT9l)KR|EBvVjWBLqlPqLQ+J*~XkN!FC4-k!$QXW<_XCXDh@v;6fz4Oj`s+ zy<49HVe5BAfSIOCYtW6Qm=T$<6nIo_Z1=V4OZ41@`9u$~QA8o>Se;DCS9 z8b_GRRrs3mw#&3?+ikN1SuFK<&$@1+k{y~oZkX^kFsz$T87^(x1`dB-5qLY3%ybHb zS`2O8hnm+rQ3WJ1j>+rQQnktJRpiY3v0vYx5%-CP%QyJFZ}Za4ZC+eX;6l4SB~!XO zbnrLM4ULZ<@58DpCf0WMngVJ`}#^#FaCS961juMU0Ut2)1u9Abp$k8-X5InI|U6YWM9fyl=L|zhNh>JLZ3;Kci_-j34qUw0~VAAiIGOBS2 z|BILFBYuH&3Qz9$oxpLohAuF&$}J(LOz4Lp{kvNlymJQtgBqThIx0OF^~fn|xF zS64S>4_vHB@`!f(!&3NIree(@KKBaH3YbU%IfR8Zj-ty( z<6FMx%7}4!Q!_pOlf179W-djqhLUPD=lO8Y=nMl#ViIlMdo;@z8OM_PAE_YLIvwIA z$;;CEtj7#WeNM+(zQ=fcsb8~O=DBHIX(Zs+n)Ldiw0XbEy8$757@xoK#NPp1UEVu^ z(?3&xJ?>kt#grMfNov->;x-^6)|G61XKF&!zE*Uz&pZ_=5$?%W&9jIBVM^p zBG(9ce$0tn#gk}Y^vWy~e#+a)s!xcap9s+CNi3 z5q)*o{4y5%i4ko`OJqbUsHrLx8;Z4P^~1j-vxI(iT>avcJGdefOgE=fq%|vRe1w&_ zdAy*1Q8xgE_7UsHh0Mk($M4pe!Mn!_An^iERmCRMoF%cxkwF0@OO+jDjbw7o<0hS^ z2P2rrb5W+v;Jb+_%vba&2(wbW2*%eU6uucw*CODHK}vSWl<3d^?FKulFreat1kvGQ zI?uU4qpJR`SZ3_3kkhkvKau%ck+5V_f$VZwuN*Wt%+{_bmgQ1Pp6x?NUma0#Nt@{~ zrxJ)UlqJ)$3kEC~pReUPDabcDpRCNUGg^Od8WQi;0dkDbs1Trzp+QqAld_VGX13m(SwhyY?iUNMO0ZQ8iDcArp^BKPQYw-(dvH(?ldhQMf)-V%P5%T!Tr$qSD-_r#js2>%HL zv;}Mrn-q$tTZ}=A1$h(s7R6&bOQ4`s7`1Ibi6mrP-yriAum zc9j~$Jslzcp)_t;05*d56-s5tsrDSfykVw*ew*q|rqjcHy~*U@%c+50EMHE-ak zH=z-9VA_eJKD=a)A{YTm4v(Z#BmF7Vog7MX=ac=@6Mu4Y^$|3-Z~K8Aua6^nu=2hq z==MMjT`E?#iQCpT%7>@x1-y4vW(7L@Z-mlAsN1WZNF=t=J#Jf`<^0>`ZFBsM2jag` zs&5!|8>_zsWCjNgzrOd10r+n;G=!I#^7upPC8SeBxx-+#&BxL3-~i@WHGFu3WfJS| zild=Hyx8$RL|*hLK4KD8M?uU)lu`^hCE#2(pHDlp;S7a>OT(smfT^@JGD1y; zxpHL|&PFfrorakV@%}Ab;wH9g*w1(G+m9#YD3%Bk=C+Q2l5K1xj_`l7crXrvHQorR zcFSD*Xkcx!h068STQ__6WcI$fea~wXVEcsaH6f+(!>r7BA8C7*c|61r`l;<1SnQ&B@278#liK{X6xL zai5)BzqusA1+B^AC0@e2p9u=gS@tbEQ}8@i)U&=ag#{9?dX`t_?<@w!sB>M@Gt$lsG@%_6$EuV?jb$b7S9e84jt5+OJF9!O$ zS{@;x3R7kCi2wbK188;m_7*B&&^-=;AEXkdZ0 zzklG)En{3Xu05b(!|)8dCzsuYM*z-(kxnGmhP?2P&x^7F$?>vq#yxpw_3C?GLL39l zLf*XjQxJ3P>U*mXzI2x_o0@BKgO}sN7ZHn6RVg^Fy~R{pOkpb6 zx%QGHb+$N$CmFhuBrZ7k3#XQFQ8SWQZ4v#V)Ro23fnj`@>ce7beyBIu>-6p4AJ5|W-Iq)C*b6EiMX1SiN&h7*iNFINU z@q)^GU*K&~r&lclwPg-xi<2A@kc;p-HX}>b9UUESmZ^~#ekMXOHEV(wO$|0XctsQ$ z313Z3q?(x&CaswksL1sc70Ssun~e5>u!8kql;l`0Pp3W3gB! z#NVe<#5B9I^i|{9vw*fq4veVnMH>e%9DO7+srU@Pd_?*kLo<3M?yiI+eLQ0Z{2E%U zX9gFXfqP1dBbD+OsIzydKaO{!?%tuncwbs4&)wynmPkc>OESi?lXjOV5wzF|tTtV# zkgEY`)~P2nvx8mypXE*+o*zq0@hBjEvczo+vprDP;!o5!#rMBUb#>7%FM?}-h9F-P z&iR5N?h|qr`taUzz)hfHw zw2Y6=)-lm(#ew);yGP@hxGhOsW*BbdJ7Mxmes9Jbj@koOj>w_Q4bl-;s-a6M{L*Sc-OB0QLBc{h@h z#EwaPsi-54;btuAL8);wC}37Tv{BwMOT+LeH(`|$=2cLzw2ptzcnJ0HCSk0OdA%B8 zU9yAl?kLuX_$vVot|cuz36f#v<$8&!>0%~k~oZ2*bW0ClzXi8+b$%aYOLRH^B$I zI<0U4%MJUMR=w&;)8g7k;?wNP?=iY0g{^uxsDaPIAnNYx?Trr(G<{BMYK?`};>)e2 z)n)?D$O3SdzMiC*)4<1GRU=-gv+qPYI`+TX1 z4O^&wSCjpp*OpLMF6cWUI|1i3s~H`HDDpjSSt^YvVCi1LN;(y`KGsUOQWHi zxUF6|iwUUN`bKU8`W&}%sS6?8MDJtRQH)VpOAERUyRQnZ#r;DdFq~i+_-**f<#_yY`w4+-$ds|=q2Ca;$c7LU76R)vrENs14kMmM2H2a zatWYdbnrMBzU{`wJoT6w1hi50EQksq{eF=*HC)bLP)WFTLbXOM(W~V8&j4uajNGJ5s1**j+u=ITENSj(g!` zOBtVj$fAZ@{g{z6T>DpTeNOYbGaEB+8XNyOGGfX-E z_FttpUYMfSBY3$un!pmt8z_fM^OCq$LcN%b4jw${Z*t)zoB=UW^fr)|s zRI|>M(OLtHwWHT$Ee?SpBn7c3yt{8$zQOE*xF7DqXUkH}@viNVWiwgGp<=dlXgYhy zL5EPD`tvRlIGuat#?gW4N)a#gKa{+i)v^QM0Pv;WetMgez{66IheLJ+ zlLMD%T)OG$RQHmiTr%-p3^kbKo)(9wFgZToypnM3m{m|N&Z?~>B&WBM3SmKsVEUXK5yQu-a(H+S#fgK0C!oJLpF^$?3-dmN>aFz10Sh_2YY#{@Q$9HcZxntcsOvB`Cj$1XU z8_gY=7X4+rRsTHh&*{P5c&Z=I)lf2?HpDc7WPo9OfU+Ym(wDJJlOe%qB`vKW32~yl zQx~KW5(I>UXqkUPV#YzWXK_ojh9ravd@OlwBJ_OuSjht408MW<41P9O&rAaQ-oE2N z`qe#S^7P)oRBIhlZnf%QMc5W~U(l2Z8bJ?Dq8eAkonD2F zTC8A3glCNSV7YOgB)om{ z_RrXiXOj@Rul_h57mN^on?Abrc`Wc*EOgeI&ySnHH(RNzbMt=)Yh=Kai(qZ|&| zqWMdC%hzL6$5APR3)2vhq4BC^EpKn%Fp<{>hU3HfcC(!yAA>d>PnR-+Cd%a^SGUI1 za`G)h;i8|~UAXlJ$5aUd85lWxXyxMppoHJdmB+`&t#0YI)A~g1|HTno6d@@Z)T(by zJJnfVCL|R079}rZRtS2e8U$gWyLZ?u6hxXO@#%oGD3}9!s?MP*XxNW}HaAxW-4s=@ z9ocr2OjrTbsTglre1e_I9>K@g=_;7zP{CK9r~21}%GF(Lqj z*#pZ?h5^THfHYt$ANP6{N1SDV?$M}aWpgIuBrI4T<3}_+9q!)S9k{WM5!z%2ZXJIn zedv%G7rd5UyaT(&ey-KQOXJP*KBjdsYUjj3iugs^tWH<*;CW=LpiXRD8dBu@@z0Uk z4Q4>P%MpHa*vaNeM9~u7^fL6@Z}AuMNv$c7mcLbsxbmDbyy0XHRm;;A-LY7i1|OUO z-9Z4gs3vMdBnq&fD}@~Dvvu%Lk{1F=P5{yaCNh}-@L94K3(MPx+AdLht6FC#8;m3= z#D-|IHHzgt*v`DM5o~+go3PZ4mq+Pc+W!Px&?~g2Lhwv>c-xd`@2wp! zPZ#qLKd?NBh(P&v)Xuy^b^e7~hrL1cwm#r@jOEq&Q;)PPulho+)>lIjo5rp>k_^d(w4q=vt z!E^0Ni7*i$#E!!<@+%P0+Dzt-!^Xi*tq^z@QZtVc6I52Ks<(_(0?o0 zJaXsnqcH(XaS>&f?%{fQC(l;{uwRUTS}bM*)8*x&q2pqwQ{bn`&=$r!`p@zTk^<08Wpxrb0M;ok4k&J&8~nkQ#)LktdEKZRKrZe*&Yxl zmot|qTQ}bAMwM;PT$-SefHuyQ6%P(kcub6@5=qp<6pdcZe6?ID`^Gk}WdGH`q7SJJ2_1DO+-8>X8c!HFqa(IZDg%8IK?_-rVxo3E((O!Esc z9*vZ}T+At@7H6t5VPVIm>T%^zqr%}`SK|HxF9_PoNe2{P4ow7ron!fmd-mhiGxK;O z@dACV-9G_=z8+t8zke5(iNc4_>cS`P-HU$sd(t8eukS8=zIy)t%Hre3mD6h{pRV5g z>&nu@pO)v{uir)6>lhl^p8je1%-XYSYd6oXJU-R<@q3m&>g>k?KoPKe@`v?H3yn*U z-1qNDLrx{o>NiX4pWXsEXFk9`!vGF?CzgYV4i3XTXdK8v%jRh+;2XBKHwQ9CN_Hbc9(CgpT5wz zJimJWbmQ8C_3y4Is7oi1`{n2E`#11F-20~*H%_~!Pc{C&u=?&J_wx@_sbuZRUmNGX zgji{|Q8muYtzVymgb0f#-20axw%z%^tll_|-_TqJO8dPDw6b`+abtP?_}|^9-&)}U z#0&2>F5l;&@Xo@t9yBntgGvhub+(rEM~#L?To1LK+m;7NC}Fn@%4PfnEu(%vj3p!5 z5amkV?&ep4S3cUs!@klU+he%UE4D9Q!LUP+kM4$UC_hF3LeKxaOaAD@4n=`#liW9BHZzmC&1Q?dWrDt!!y609#QE>*mSw72LQ%&&^7Sb7-Ogxd)~amSs8`w8)Ks zCbC#cRaUcB&^S!ds%#p*1|`hAsFqic|B(%O9t2UN-%@=^q!b_!J+-x(F8mrTmR|V~ z%cfVZWy>ak@N-x;qgyZyXI79wA3tl%ebadVvU~qM^!qsak2v~IwP3TRPOqY*+7~!q zNHs-?FEBAnMSWygD8=G)n>l9eatR-Y`aE`N-V z8XA_GwE;)6VXC$}MQRVac?3^n!3v(q4#-wP?`78N)#+RvtpA6zUZM~dh5o2h1*dg% zv@g*+kbv}6aA^bXb4-)8P*lRoWXK^ulfj?JWu(kI$S7*@=Q;W*nA;EVQ_w0{qOI#e z!0{0hpH%U2a`nuk#=`sVt@l4$LR+xbvqh^cK-pOm6!r4Md6b*@?`Z zgKk}Z=05(q@!9Rwn+v$3bTkABfw*J%yM70;0A*PWHYeb500pcfI z>Pd}b)JY3Dqz!ssDyRwkSp$mSK|<7@mgn&_HRk`kvh>%+CpQ{DTn3O3zJ6u#?%Meu z-3uR+q9ioEF#=%-(mVj& zH(Ni98+E^b7sA}1oTtgOAveSFjrF+?gNF5>@yWxrZ@zL*{Q#L6Zq8!_TSwHmJ_p24 zqWHUCzuWlyx_jZ?+KDH@n0s<+_0sn+3P9farx)G%hYIpm7U%HtT6^~AmE~muU72G) z|3f!r=QRh*9?}&)tzosC?h9fFn=iQZz{13AnOs~|SLF+G)fc*GXa&toC9HAVfC9rc zQv`<4KO{gPwmpJ5i8R*k-Nr9r83eZwVTyEo(#1t5Er0;vkr$=`kgal{`xfnc7ze`W zSvz~ez4m}CM3Ro;?@~eT3i8Yw67}?x6FQzFzwiXl(uTIXs(~22P)H zZ=Dtn`@4-U6!L>Bx?-PK3)D4`f=-(2 zLBhS5>55B>loZ(Gm&i?5EeEz&w^LIuWIaFqt^58R_tCS)hj&53I5W4hbelrOYgGUo z4wpFcP;Wq+vmM(g13m7PzItX+N@sM!N02wYM6*cZG9=-n6b46xWSZ)j&}GT}NGD60 zQEct$N0=aStBrFDv?ss<$<@}>0Ld>l!M>1uY3C`LsaaJI_(mv8x^gS2wZt2@A|;Ad z^gLv&eK8Xm?~~cN857&`V=P|+2b-)wP51N#_x_*pQ1F;m|MnDUcKKa*aS>M4qlb;9 zn`;k$5JzT$8lP(`vQmu?&C{n?057++a6!}xu1)tuQPR^ zw$L{2vF6Qqu)i9c(;#m(Z`Oi+)!?!Dl)?FlY#hIbUHH}myj){@Qbzjl9nCcMVCj5I zE@tp-#K8=nDDGvX9l?Jp>8qOZ&SZAFSl>-{&pS8W!Mouhk_$ozr{ zN%T3!i%no#0Zk2Z^8~Qk(~s%irx%lK4BU9IpT3h`ydFaBQYTGX@}}ISEscb@NgL20 zJBKm$u(EiLId?s~77nz#qZ5EVzI-YhL!8!kaZP3y$Kgnwa?r+hJrjqF{6Fz6R7iAIq;1_0QY( z%O@|I^_StQ8T)T1-))_#2)ePSAHW?0QCqfjd;Rz`_vAUMacZ3V!oBsqa@|0@Xcew9Gj(YL zS{&hOTFbf;f3`xjCDKB(=}Y`VAsQ3EV3d1!e(9*{5wW*TD@cSi2`#Ef1UVr@Ng}i* zs!1$mv?Y~ev)tIr4VS|7yiPuegqiP)37j!h~A`H zeQ?ix{{~z+s+A|}Q`s3rn^G<;i7k=)TIbQuEQBhjm%>&S?}D72U@&6rR&ULV$%x?6 z*5YIL-lO#=>isVfxzZT)LctkOBIuAzORPyP5^pY4j}+Qgdz3+@W|~n43H9w-9B(TM z*5e4PoL{MeW=pB^oRuoSCN)XNL9hBj?_ME`=nhdI%T=y} z=-AQ#Y+M7db+&KA*}ff-((F8=Ug5dMAq!L|p=?)bE`uR+srF$-^;z={AMuUb~@(dEh|t@u6|!NB8(u_rg7>LEK(wy~ao9 zR-RnPCkfg)^^^e~O;<0v%NM-v<3_=~f8nfq@(%TWF*X{o8->S!vuZGwK5BgSg^0Q> zEf3m=?qm}cAG81qw(;ZNR+lc&OB0mr)}Eb)(<`a5r6N&OByq&o?ZVU5Ckx<>hTnpt z7V*wy3;xHyVP4_h`WQS8NiwJBp*>Mm9*B2!W$E1Ndk-;NX`DT7RHQVeS!h~Y5UJ|c z;h!D;t!H{3;c_#N_UQM2+_@*cd+*N7_5%lY?m6(r-aVNeySDG&zjHr`J-ie=9S&?o z9V2>#B#Asb2&?MxMj=#?2kr&t<}#8gl|^4~U%Wqwx`$E&@&0~2P~(2GOz?-_70bCJ zp;;|3byc0bpTJklh>>G)bF={Uqb!7pR5_y+GpAvYD?tExDLjQ9@P;e}Lvjac>eWO# zSX8<2?fUU^?)g)VXLF4QOI{ja%&6Ap&aQv(2cmSY z#B&a7KqL@INTFa>3utj*p+2w!pVPeRX^bVLahBVVTp$asgBPmIbD$p9?vDwivWB!4zqZo4}?mD^o zgkMAM4e~Zup29Ro^Szqjif5?A1V7q}88++R#4I|fu!4gylir9h zmo4Qf9g};yF^tNd6d4zM(itzo`6Np6Ya)>In`~jmiG_y^kuGy}FB>CDdpd#Ro-`b? zx2cd-J|&V8buA=wM- zm;Q{RO-jr)wI!lZoRX>DoaC2A?9UUeZcEZo83V8{Q_oKEpwT zv;g_HVwfu%@oC#StSd(r0Odo9^qWP92!Vqoh{B<%a)D)O*2PT)EXj~%0M*>qXL{Qm z1h-q9oOZ~D5=B0W-Y|krgw=!jY!tA0MY7+=4bM>MKwY!qHlf+G^|Rbg72d zp7Z~BYH)m{%@=ZBce7Mv@fU+mS(s5`x=MI}V&*$w6V-Ay4+6ZQC0i#FUM2iKMpms) zZF*8acvR5VW}8P<7E7o{9c4{SmkPM+G2;Y9Cl^UlWQprRl8~FUE)qe?9Mu6O3Uz17 zJWQc-L_t%lZhOy_<`&X5^wZiyueLh5()TfF(b`Q9M)=zI2qPGx?Y$&i**A|d^Qf9` zPL&1hf?ka!#x&v~bo~bPp%p>P1`8fw-g;Pl@gZksnRe(-m4&t!)4um&K~+TkytY)=eLJC`C1k5>=h9 zXwF|D5mu_Z2(!@j=0y2EGze)uhX(uNhJ=wJ0L(d?;UBP{fLDp#^uUu9laiUjRHf)l zVO~ZwFXNt~Ue86SqOG&3Mm5VtKvP;aPY7G?I|VUq0FSy}tYqLgOgVFnhSZ6v5a1D_ z$n)*;C6uJ)edygN)rI~yFHA$_R5Q$Ch{c@>?6FA|zBIM{LrIw0;elb()b6V~2dddZ z$;s~qMOaMvE*n=erE-R3HwR-SrT|zMV;v?c>5x)>WokzX6);TIGI{4H=H8h~F`FY3 zjUjZ%W~*9Db#{xQ^?WEay2a(HqC{=c1mU-nX;Lf z>UqxM+1&WI#i|)1P76~oI6N31MBS;>&=CHs%>l){Wk&S28FvWCkj7c8k|U*D-9hT? z+8x@MnqDQmY8<`hl$W;Z%zZ(=^eojdDD_%`8))78`2MRu>@UkGE26G3-&$3qe8+9pDs=< zLN2}Zc`OSy*UWT3YZC3?y-oY|tiK^P`)dILL zpll5v5en+#_0+~%}a3aN8k(rNRk-4Sqj$%BN!DmP77Bm#Z1Tr}+@! zEhKTg{75Z!My(Ch2c%A{WFFm{LyCiSwk86NLE$*tr`WN-Hrg>&)7KO{Llgo>X4a{Hv+bY(<~prv=XF>9A}kmYlBI;tqt|X+qJ8`^wLZIWF_d~g^HFj#=9Nx zv&Lr`5Zj2iIuw@5xaN#TI)-v z;=Kc@Hc=Zq@2D>|4=T1G2>R$E^3MCZfI}~+F`CVQD*+`=Hm>Mb)^gKTS~b#jHM+L( zz8G{JE5co_2u&$Rq7gAptLdhf%}GlDsjLY4lE&61C|JGv-r6@`t=>59o_^A}d_PX( zV(lUrg^;T>3m{2d2(Izhzg_$CyMKS&M9#i$dC6h%1P14vaE6;8=j^6{H-V68E=gwGqqf8j>66uC} z{SNt+UIV5u#jd5|mi`3IsC)0x+V?l1Q+Kbp@80o(+T!9pq&P$Zp40M$n6dmOSxGiT zbEi0ANVuUQn`JYKkCt=U37bQ%k*qpdJOGbA;GzS$*r1mcw?f!Kn^@Hn<&1it0`Ifq zSsnF0SxULSBd;E<=tL8Zcw& z!rHUbSY9j^$$^!w^Lll*$C=4Fm3l;c(Cyaq1fU|G{YJG=Y*C1mk3+%JTd;qm2c zA|sj$u|q+=kgJc{?mlZ0x?V&xx>I}$UK)N`*(rl-ocLsf-4OO_(+||eLf67r`5LVT zmz2P;J!14CVe(8Z3mTH>(u+qikwgmGkqlEb8H?2F8F~ONF z<*?vPb-nCXys^2egO9LPJ%hE>V?0e2d!ZL!q%tGD?5N5{QTf;U>2vO>Px%xe$`l~d z-t{|DmVj=MnCRgNL3H8yi;YhnvS=eD?G9mD204o)99aMQefplQAqYf%?Q{3ZhqRBjhG-0Ra?eR@K^O@Lg_f3#cqFQj zL_*L3CS7SmyKTgKi}adt1e zYQulpNxZJByo6>(6dK|>Arb_MwNS#lY`Fq{&Q#4|Sk|oI^i*XQ6mg|WIJ4=4RH?|1Ll}o+xqv%d!(YDLcC>yBPwzQFs%3lNf%yw4(AsjbsmzBN+)C$~o`4d4(@?xsX^l4;0 zMuviU5&6+Qb#i6-YU9SE#<>M+x;pp$JNOWR&;5-L{-W_KP;zj#hPT7U{Kw)}i+^SN zwxa9sdUERH0O`WI%QU!4aX`}_`=3Jm=6nYk|c)_m=cwJO=dVIxzM(|H4&4zpS;Af_SW>qs zx3ab&wt&bzF9VT#>nlc5K*nrD4-+LVIFN3=veL~YyXXjR0VQ^GwA#ZuMQpXzf^*d2zv!U}NKYhc zGp_ldaY*h`n7PrWnBvl8vVez|l4eJgJ;pkC5vBy{7^jBf4Oq`t{5g>)|x!+={ds9uHJ)a6Nblb5@w)`%%KTf`z_p7gH}Cx{Ib&}UMA!Bj^4e%Z)V`t{T8@qRA-rTuk@0+i@5JzVB2fYnUB}Hal z&VUqRnjFg-(C0j^9AZ2Y8fAauf{-*CBoNu;0K9xetOSJTV~8fW%hu?$Q4i0cgXSc7 z+}5)siil7p0iVO~6^W;De1NFHl4Gn`vU1C1MLRgBT~?dD&0y&+<4b5F^~z_nmU>7W zyTJL2-8{ftyIw69BlMPsv3t8i8Y?}uxSEh^U_T@0)mEl0eB5VFT8!3r&i1YkD(ng}g| zevO0k=+Fqr`_=3UW?kSh;lyC*v>}hJeMK8EsCvW(XiJ<@Ob0->z%K5=-%Q(buH5k2 zK7N%>N36!$OVRdhJe7Vk(WXzuHao6ge|2FF7kB4xH9o$}?W|X+5rio~K6%(U^O1Y* z{!h#Ep5d2eN&e*F`X>*Jx)y;h00I0UOzqq+WWZVuaZFwa8XP=yhvUWV!io5BRVeQ5 zIQpOK9}(n~D|jye{U+6glKY$x5Cz313sroS1RXOCu>v78@aEcOCVzPB&6k{+Iy>Gg z_spC3nh;6D_nwHNN+pQ*JShyULNHEu*is94%`3)fZFPjyR}IxB)?kOw!W?v>#*Nc! z3-{e0|K_8q%M5 zAO)}x;s^Neh5BAj>eTQR(rLrh4IaTjxx!%q;Eh~0!;2SL0iW!3i3F1vMQOT`+jM=# zCfMx=ezhaKmX}A~XOLW1d=@fCJig|B|9NBXoAtS??&9Lg;yKFMB6i29`XJ06@~zuW z(P4*1tr;VD=62hrCJgG8CTiOhC+$att%Vk_AwKrjeJ>$>WyLgh9~a5gB7s3?0=IN( z(opOWn1j(>nOQifzd&RD~xEVobuQbKDO{Av6$?%OM2WCdJ3l5{wu66YpZN$V>X= ziWqS>3%~)xLrXes@zFL%=p7%V`eR&B7VE!88in{onPE9ZZ18xPc_2_j&L5Q$LhQp$ zqdpvjL{3dKA3_vnc?D*bzM*r-P}R(g*;pJsjngP8R*oN$GeeZFnii~+2O-n>ks1q3 zhv?Vg3VjmMu~eQaWQ#G<9*y})`&JuGbtTL(Y9dQHdXw4eq3Hb&!?$P6;5TY(aN&2l zi1(B1Q4$TaP$hj}GeZo?m2BRRHffzAtIa?1XvJ?Yn$0TRvLnfhD3(G!#-dH3h3a9s z(*Q<~Z%IL13WQCUxkG?+0znc)v9KUSP$j}F=CbwJm{BqyLK6vLIUYiiz&DsrRD`q> z@bKaZStxG?IYvtq#aBlBP1Wg#O4|X-jR@{p1RomX91Z5HQC+?4*s$38jIaL>`v0`G zAPR>Di7?0*1q-ACUNz(dF}jLD*UgeLrJ|4h`Piz;l~7ZaTv0G! zCbE@^Q_9B#SUsWvs6uZG z9a3PJr=>Z5>G4XSph*4+qKN=(XT%yHdaUxis(l(XWf zcC>8aR5NOP+oa7V!flQah&p>5)n7$414TY$C?v0482VMw`$j11UxOk7sDV*}H;>^0 z5;2Q$cOGi+43+#XcR-`uCJ8wTWNWno>@RhC7BQ3%yzdOqvomZAEaIhH#AM+sOd5R4 zgTjpoJGn8kpJiq?6p1WAIuQMOz?B?t*mz2_(?t?5DR5>ZZrxFP?swgSf{APmB%w>zY=^`P`5u3vq2ZzWR6&w{+F!;!0?e#lRzy>W43_R%PNbmy}32q?p_ed?! z0Fws9gy18X7Jl~i2Ui+Ynt3S-{179COrn3>{^p)H_PjQN-sIelhLU9s#&jfhA&3;> zY;Z_YL;R>0Hx9aT}X7>%HDR78!@aM1%EU;G#iltf66 zl<61^5l`TG&Ra6A1{cUp!H=P-j;ZX_M7D#+mZpLa4Sk%+c$d!;#qu$jb80x^9>3XY zcB*C(WN1dXycZ2w6SV!V+12MV5Y=(8a!J#OV9?z-T{6MNt!4*$MDu$q_;g#Dw!PU>mxtVq>4%Q zBoe~|1O4e?hyDmj*SF=JquWZ;#iBP`&9(gU%cv(APonNrJUuu7Q9~)9S7DM;xyixd z{LnyOZn8f$nI9OKa3+T)o$NqwvTrcgKbd#>1_vkmd(mz@6Z@SCN)6(^j*$OQI+;xQ zutw0nLaFRH)t)1GZf6SUx2fJ_Iz8Oin@kSAoEqrGTin!3wEQcecti&dPsh;!<`zdW zsB}L{^^NrRjr0zp?&MIC&CmYni9b2H`Uo1^xBbA5*HQ1b-f?6au`r$8UTS){WVMab z9!8^6sX5M(Z65jYH;KwDT@U|_P#O==%Z4Uae+x(tf_nP(y;lstf8%)_Od&>z>|v0E z8B4*8_2d5cN%SN>7*cvLje69F0qZ5CR6{v%IRhEKKRpE3kew<~(Zss(sUn^1O@qzF zp|5(_;-{}9G7RsVh+Ao`u#h&}#I+s`-xICZrTw7mX0Uf~us4@ZClcw&;nZ+?Vp3dN zWc_TQmwd%_Z@jy(3j0YwRs3(d52C5kT;bH&4IBrF=RSQIfSdquk76c62mL3r5%LfcSz;CX&wy-&h72;Z@(ik#cm!jko*q&jF(3s<|+Ni*Eo~B z@V*X4QWiL{Muk)@-FS^3?8OMMZ?Vgw5}>h^7#aA;QA-)(s9wG$W2ed%p{GVGOQ2e&$eyWEE=q?0)nKB zjUxxsI!mbR;JU=f64QFBVT(`|_$jbFc}ZC41>(m+&q*4Hu|f~$M{2HL0njM#kf}j< zWEXQol4mvRUcuBT{?YTyn(=erVMLC3of%<0vDAkd^W4E^#w3UVRmwtMxtIh-onuQS z_yWFtF#}PLSw=7(0NXpHdg~p3swZkW;1E@kT}~_+N6EO1K0gU;sw+{);!`Y!KeMFG zV|dcLy5d6LyUgUs_=4D!QbJ%Poo3e1DKm!+G6n8Bq2P%mUbeCZUa3j;z^k>wA}Z7P zU*=F{WEc|gCYgpLIK2Ugx8 z*n!(9GFZlK7aBy0jm#H>N*fm(h}bsV9|sjc-{4R@m1;_G6AWWN>!yLUJ_TG-3XRMI1OYPtss{u@|o@#d=WV%%k718fWGj*Uv~U_&$`iAAeYTcDnJ&4IZQ8>U%4TAGr$;wUR5(K5Lvk zzPj{LO}7RXYMy2z&&Ad>w0zhBlqqVrBuYFYhX8x^B=m8 zFW~O2-8;U1^D%rDIwlltNeF1uDM0|%+?Fq(kq#8@-QQfm!$rdKMl@iyd+(9^_>z16 z3mS~mzq!hRL`Q*F1z!63WBvTp1{h?YQqv!?`f7UXXrnjI_nD`dh&+;3NmeMUB{kRO zBJ`4X$0Fv@T+BOlYvoz?DdtwN>pv_ZxW6JNFZ!j~6K zyPs8zRWJ@j$|*Y^POOsG2HtEge7tt=HeXnjBf>67n5TR3JezL!?Bn%M)XmEycH=aU zjd;Bp?;&#Xn=g%Qk7jM<5>@P31%R z4%)qIpS_cYCF+xqaj9whfP2m!6syKrUOv_-%Sk6YLvI0|N4{ zeeRz6)SbUiB|wm5?w(m$`caf*beTm!LgWP?RaOD0wgUbnZ9lX?_8JTytA+xW8M9N7 zwwoDGE%c{09j*k3^xvq8IIUs}Lg;2GtH5KG zq5OY?6?L{M&5Aq$O@|3SK~+PQb|IBpR@?=_eSXS2try|DS4&EkW>4Wt#f_iDO#qu* z>qbEBSn%+)e9kh!YHF$TCvc(rGahl7vkDtr@ECE%3Cuj zM_KyMUDk-mXDr+-SU>pB&z+*HI)15IGdt65LOZZIY&MEYs|oIGR%=D$*1rlK97V?z zof)^LXKDjO4xE>uv!+?yVh7ZU$|bm4`UzB5y)?HqAk!Z(aSkW@;)AGrcwi_#pxZaM zbC)-Rm6QG94{Mru>6KYxO7exxxayQCt7X`?PNn@<@wdZ4ouI?^3|j2_wB2Kqq;113 z=wg>`+je!??6S=++qPX@w#_cvwr$(yWE_q(FS+x1r6OQ&AS{Ve(OFoZy2_}W7xr0B|MCBwf(V7E=cd?R5W+$W0xsY z6dJ;~#Tpds(vqca#Jar*EIWE=Ohb?|0(6TMJwH^#TIyKig+3PjDv|{dA9NjhrDrc1@MXIT~_FCGZs$s{DHP&gVRxwAg z!G%FVAHL)&+26%PeaWfxgk~lZjU?i$y!1SmJc;$sQ_`J+x(RNl9~>xXo_!2D&g5;k z5igF=nQOY}K9dJ9O2r_3;1h|bu0;L_GOpL&97(F1Es|o73x0}owV?fJQbTaOYEY?! zIi~{g9K=-qNi!U4C^CE@&ZA^GOe*OlEWf|8jBjB__EcI*Q0_fV8fNj3rtO%_b)fbU zJtu}22IVlFAb+fgb09T^65-Qch5d!#D|c%xu=v@;;t8~GNay~IwnU5wLfd}|y)kG= zgt`j?PXrlFar1)bOR6#i<+~c|E{VX2&UUro#Q&}TIsU9Fx8a-u5X$%@NRdDGj1Ma0 z3|t2TY*B(2ioNrebI2WJ`FIy{q)9WwKX!KH*UazUlnR@GJgrZdZ?Dc{5dg6>w{7`) zwOnRy-R;(W{-;ZCqiNvVFBEO4%^%-|$e)bCyzBGfz1z-iNewVcAMjv;7y1tIbU(3%|9^ff`4pjXK+8N@~tBtPHmT8oW zw;3D=V2KwpP>Kb%J5tqa=`5CDSj3vJxNaE{l#B-NhyvH=_9~1%DXEGacT- zFVabuuZ&%{2t#%gVmDZh5r>qV(PzOHgam!qWCaVsA~vJ@YmL32gFDV0i$RcUhY(z` zuXS>TNnE!#o^2~>yQaDJeB6?I-^^bx-u|7h+c-t8qRlo1-jUxQCcxp(MqbBY*qQwl zZSxprPiW!K>rJ_3g>Kdvo`=??KX9%i#khK9z|PqW`H4oJ&;}t@cqka{Fa)Wk@K&x! zM*RCtJn7oP-aJ{n$ksx^07cFB?#OWAy%1eNi3Mw(`L@p=i7Qtig_X$49SK2m7h*hd z&mGYRag0T3#^)ekC~_7`Gu>&a5OuW*Jl`L$o`Vflu1LYvmj8=8HADxcM=^su+1toL z!YAuJPw@u^`9(F%V@DyHd;@9xUs3j#ppI>dk)tz9Pl<~kcT=&OJ;hNKSuf{6;Ihs6 z_GdPMLsb5DIKiVJGKb7Mj`CINgiNck1|NyBuR!96R_9JPitlzY7y3v9dBmu~@16RK z#ux9BQmRe@J$mAO%+#liL~7`8r%VCHz~r(?bjmXD8i9mj-QsnfqP@hs%sf1Z zwz<*_hSUBMq&w|Zx5jO3ZrHH}N8s}B%83D@pH1bU{k=Oz}A z4u>h_|81LePOv%#y_noV~AZy2;MMV@9Rl!9U@CP!B4F??Q^*> zW3^Qk%AXqOTyiJ!+C}{sXqUL@l*X2dKiG=mNo3ga=}U?;xZk`oXtdH{1X8!~=E4$T zH=7{;dhx@&q!m3LNi0GcRl!r zA%%W68}yZ^3aU`MAxCN^MKW+fGL=|a0+D(~0tJ92)-TW3XXb@Fw2q*OdVqIq*DF~{ z2&x3n9^{5;`vTD~o1yIi*Zx=KQruW*7q6)*F?{X&UiJfPLoJ-IRAx&xiBjytDFY-n zF_x2y!LKaw94}ls_71=GXx1;?K<5WI_fM+ba3{mP*ab;Jk(h~+J?7YxhyrEYB4IIV zj^l6Cnbbb}0`>|*DI4{VzdADKp#13uO`;oJjLJuc@%YOY`eh=z$HfA5%;yI`8Hn+f zcf#qbdstTOfpQQ_i)-R7$*bV=zU+aX$6JfPhLgomha^b#a+LPj5A(au z?|Vr=NZ@FApo$EH7$BwZ%tFOaA}8OBlp*B69{!Apj2DE1(EDPQlf4PsZ2~oLiKT83 zzra?<9Cn~mnjZYvvkyLdFgPBo)rBUTD)dqi(e%##^j?6d21R*Q)IOpdpw2d!x!ohD zTNxYfv>7a-TWt%`Qjebyn^G$3-1OHv$IP>K**Us3oePI|2Mu1Hi*yy3Zt31TAgrX6 zzs)%G<~Hxq`Wz7@SHdjr`%FwL3ST#Df<8MKDq>25E#l{%wFKfyYl>%tpjH#cw_@y= zx9t{f>hBWK?V3^?NPfh`Z$o&!fyG1!FcWP&`7y-aR%V5#y?&^ozwU91jxOG*6L$@> z%$-@TYbWIU0wbz4{xC-hqd&z3qW8Rc-cvmNez;k<6RE1d;V9rdJsd zQHBKR(Cd3|dCFFLFg*i;Ghl5Et%B4Fe&Ob?QP(K$=i-OicK~anEcQ5-s%Ov8#*4Jv zVzVx=!u0ea1{Y>MrIjMFVA*eCU$TAB*C6KHgU*E?H_XsWh8MHYzlB>N6wH~}P=&ir zDi|0wClX(HC?oUJl2yPD$zIwS)bUv?(Z905n=tfbE6*rq*UC@milgyaQeA;R;IU1m z;9<#vbIuKzQOAq$qObc`v2WZ?tCo?90j(5~;6SF!!b#7zq|L@`U#@;jg`x$jgl=YF z$&|ePcn|S_c>7@;91J^!Bn9YZVjt0lHjmC0R$=A}=LiV%1G<_1qW^R=<>8J95V3w* ztdfZLLr0#=yjW3JJ0iw?G7T>`<|!CUW#PRHYPF~K@(Pi(qr^4w)N{9H-_*Av$1QQ3 zNX(E#hr-$hgSBhDe`I>v`yAdjq)FD{pLjR+lNd3I@9eAgl7)y6Y1zbZ>Or?_IZ;jzfnvXk4g@6K^# z(AMDh2F63pBths?k#@Clzg%quEG}<8U$Y*?)NJza@_lK>oGgPjkpEPP$$x;%FY{aO zZm)Z5E@_EjCr`)6#9ijgbiL*!?sCD{c$G7AWc`!N!MudHx`ui1&w}fz(hiQ;t%={S zn#pP(`xK#}RGs&uibb5^(}lXyVTVker$6{HGtrBEXOhBO@%Ppog=EQ zxV|abpVLf6{ap8r&e!1bA1JWjAIg8f2yzuDxl3n=bWbWXl;dHi0Z(LklrEtB4UYwz#h6p2 zFv0D;-_MwguZExV9JBh7ubz_Uc=O8yv~j?2R^sSKA&vujV*jzB=LT=mDq3V?CcS;``HdG-bqX-gGumK&pG+#mBFPzm;!>hX&C*_nUG;MNkFS_!M+_$@34RATsWU@4drZdGy^ZrWj-ms zXkcB9l9ksAguzwT0CjHd#HGw_;>fK$ZcRKDkO{L$uO3#0b}S@)sHy`qsM~aKZU!;) zMGr20+`9r-O>@EX14B}5!k>@!EKw^KLKmv)P<4}jF|mI(lRkLwVVjaZMXIQYOZ%)p z_gl=lS(&L5 zqBPavL`+i1DqE{cdKs(wY!PbAkzyH9E>x&gkMC>ed5{13LfPp8QZfY+PD8yf{;NHNstwn;INprzbE)QHfZMiSOX@gwqv8e zw?D)>#-nhEb#iLx^5bJy_~GFuyeYT@D!5_@gm)CWWQhp$LGR&kkl zibowD7LsdP9-g6oH41&?l7j)wdAcv7e0JP=u}%5bza30hNH0&);r2-jMKtC^L`2}U zS%=jUv;8W=+iC?PM7!)TR=;u#u9G^HOV0a!uQ_T;9d6Joa>v7oGjNf~3LW2Jx;&i2#79BmzGo?Sg4f z{Bwa=m-)@vbJ*ANO$#?e?~OS4rto2=8=yG&^Y~5)dJxu_#_8 zip0|6zIk&Ld;@XJ#XO1ClS#HRl4?4v*QilWE3jejaas}mlgw5vT@~l6L)6=uO!ww{ zAj%J*4KP&vbD1_UdZ*IXQwubLJMU*u_kQ+RLT0D%#rFiMBPw!^O;Hk2>`)p7--G`M zPl){1L~7ZBr9>HK8*eQBj22?iovKwlI+H&OkfnNNXjn)`Iq5-Xw=J~t%?VJAq~>Jk zj1ks*Z2DNSKAO@?h5-Upazg4@z!?o+=X}IXh1mXfEi0|cC}X{*Zs1`NC+we9QWZ2) zVY{x#G%Y6KxRlq5Y#4vAENBm4dcY{p{+jHNOyc1fl8PV7Gsab?eq!s$St@Ip@C2?N z-ZeYv%tz>wzM2;=^r9Ub%)G4EjqP~krz0sw<69fW1~Yyd^=)?&2;yj?Q^ zZ!f?D$J6*4F1sD(x#t>VdS_GXA>?bVZ*#E-g;fwrqVC_WyvP1oyPJa=`ZQn=L+kym zr9{`Z!xx7Dm*{w+4-0VpLd%wMn$x0lLAAVnlVGd(9T@|CT3W5dslZ$s*(HndW znjEWO&Nlw_PjwwMNQQaD@u&*flo}_i3jL1G_>xBpXSo(CsR594M*L%b>pv=#+jupm z*!oI-c2q&SoQHDTYh#)mphaK2L|dNZzXEsMZP)!O!0J>*yX9Q7iTK$4&?31?t6R&u zHng^(@w{~hxDaYF@W=`Ec-{xOS`|O1n6C1ztNl7mqB9ywW-=O2MFQwhOhyT;u6wdb zn41RD)3%HE%^J;J;nX*DS+4J@Q#GR2AJQ`@-{UkP-x`fQ*2=GZ9G^BBNv*w7>vV@! zsD~H)NAv-c@BfWM)of|2yFaB-CV7NXmXiTrj;6H^J+gevIAf&P0`9AX4DUXP7rL`Z zKhS?x)Y`!_VMj&A2;?l%dEX{?wV-+1Q-Gnw|L=6Db=idsAoBC<4};G+kl&3xrs2iV zEJ|Uj9&faZM&VW%4O2lFO2wk96~J6Rq`@nprklYix%MbdY>4%b11KG2^2fRZh+!L% z$GyJRuS0LAFxX~08m|)}i&N@CFAV1gI0S;?8=2@_eIA$6JA%{gvG{Y~Zx1BT4@LrM zeu8gvQKtZHh9+`bF~}d3U6f>qjx?44h0NVI67%^2;-Ap&v#p2%W29$X;YyR7)XT~{ zQw`Twk>@$CN@bf|m$vse(d}mIKb36f&gSJ+y&#Z|gkedcgqNs;g|n@9y$ms27PyYI zO8UvjyH&YyHL7!hLTZw%mOE8LtV=aHiO%!(Q>-FhhxaP zRU2G)nI`f$t=9pHo}>lF4WUK2h}}VDe7_R{De2^w_-|U+ImDOL4EBHbWC{S+dY04l zAfQ89LzBZhW_P<9USqB)3^U{cr7dpgvLDjN%SWjb60K}4j~4NF!f~_cix#Q8fy<0l zQ|~2!`c-er)!lE*3sZ_Exg5uT0V!Yvl!B4YXUIbJuEXA^;o?v!!-ra1hiS7w@)SQa zAyd=i%c~DI*Hb1n%p+68_Ey*L#Fj($e<`IQ3jxv45XjTdn9bU0zF zOm}E;!*YhVyY`{_)}cT$q2z}!mg(X@G6cz(OIN(P2^%e*>pIz5-K@(mA1$8WeRAWb zO{i-Q*U=<1w=suIv{hB~|2*rON@fd;=S<+-gJKkPCD+w+xc7BmFqQlOSEzDJTM-D# zj?LxH$B&QTR+`I5KxM}y8kaT~HYoRyP?9&vH!s-T`s8}`W;)&s^$uQQ&4k5d6xO!s zoxBJuTAHcD(JtFxgOAmU6x+CjUXB?=Uy8fEw9fU+3g`d8%X~=49(FoS~QV4g#v2 z5z6yp*<5F+bqX=-g84`SVHZ&{0Ev}e+i5y0WW>(O*MPZjdm7@omQw5iq8SI_c$vSm zNS3xm+bF-?J-HQFo^Bu(=IX9JJEO6tvc(nQ`v9w^W&iOtX|P+cB7jaB55a`n?xOvc zq%+OG1RyF+u$0Id*kKMIoVdXt6YjlwdiY4^Y-IAe-=hw78%E!B=Z`(>4UckZ4q!4- z!>uQ2+*VgZ z=QqoI{c-J3OVKa;@qp5C{YFY^jL9_iiUqmZwF;~j7NT=Hzn0L=KIILyhDmI?r_*6? zWAOM+2H08(h$WNS2IqI-%p9%AeU)ddx^MO+(|Z}Y)#Cw0OQ;D${1K3zoANHI5x!>8 zwCd3Fy?TLm$;>cf*Mxi1jC+&QY~f8~Hk!=WuHx-Nk^htYb~T{wgZVeJefE5MC{Pfr ze>OQ52&|EPf@l8CIfgRQWw;Z0c&u1siz5n*6HTNvjT|2c0XhgDo504)R1N*6vs_j| zfc}ca+G+PzxN5R(<+OK8+XEE=X6e3)Q{0y-Bt4)4K1KuLQy35}YFu<_Hmvubb*ibc zJC)+&uFoKlxLU%%UM)kiKA&0G$oDkf7a5W_@RnQg^QHTz!*7ZPmxjgU+w(?INP#uW zSRkEwz-wyc{Q|x|iyROJpR>c`Xy4f0zeG~!_kR&dI*n=!-q(PG5J!6XK;Qp?BvBsV z^ZmDuGz0-95Sw0(Sp&S%Css;Bx+*Gw=HifDEH${PdeiRn)?mzw_Q4bOh*}9X5RG)L zMkDRk-+JMmFh76v7EX~*)9Q1fo#UC}-jB3(5nKQfJH9`?k#Jm^W$@VBxoCU!x+pv> z$y!2k;P)#ErgvYt+QojodPkgrlRzkH(`KcxZDp%c78k@CNrk6c&wWMx<^7*2pOPH; zO_Ix)6Q$VM;kd_5G{!(goRQ=~qj5l^`xY^;-fwHoBE)iBPOk0xJGf2P%)cW8Pyv9C zSMZ|E&!Xm7)vW!E?pi$IK<;4Tdt)*6Xquo?|(sD=g;xZ)2Pk$OVz z2WR%c;Y?&>-k4LN>M0!^-L$YWEgrq;V(lCmVNP9nB3{W3MKeLenlNeb5TL3bsdGk{ z-re3fXbU&~rZ#5#okow?*ki-taBV95s~IlE7Lal1kv;Hn?zTVVY25egzg!-ERneEf z6C?%CUpHE7`?t~~(1IptZ3+aG_}f2Z8h0@N zdHZyhg+R!B7wkzo)2Ffq)pGxA3``&A?VGTs?&QoaSxHkqVr79UEJBA)7GS0K&{xIB ze_5OIB?Vjfq5T+H`{&OTSele?9{b#rCc3%B4@uCC7W*~tC#dlm0f}Xk1Qd(aXkB*u zsZ_QpF-}5IMCPq9G==``%b_?1oF*^lu*ViOnmiee=X=6R^Lb0oWDZOLI>g4^6S&Lm zko__Z?v~#Xo6Ljhp^I%)O_DIQ=9AM|@UgC<72r5lsN6KxXaBGu<@}2TatH-i%G7MQ z1iu959?;tuYIdTjEEloN1Xad=NC-GvXhJ@%RhCZ*ys%9mW=U1;0aY!TMrP#2>1X*- z)Ts)V&X=nS0{Or1VW7q&ip_*uJt8VI_T=eMEXhw(n}#qcP#%uW7jcz-X*d{7BBlaz zT>i$%3pjVhn!I2c;&lM-M$)1ueGvpo<&ed|-J$LXD6WuCvd&r@r$QF`rq)3I1WfNE zlY!X1k}<|3N{0z#kU0>F{;IJR#W?d2x%kLD>qUY=+THg=EXoN-6BfJH*<+GFDh(U& zww(=s0lgp<5qV7>r5E~74--V6wV0j_t~H^P2Kuc`2Ge}004&q86sLBty7N{@duBPTWC8wmr6#=xT{b>%X!}`!I+&=lB5!D zB|R%Qy5TNri~Vh;z5tPvs`iR9|LE%~^3uySjp&i>25%LMOVe$YqAk|W5^67F;ZR%z zf%y=nMx5Y}UyjGbZs}`#L%HGhEMU%!f3{NDv5MNvU0GeI=dR3bgzHa-9#r*&Myo~2 zv_WDB#+R5MPo-uJV&V-_Q((K=@-PqxK2{Cq1jmH+c%_H5dqC$~-@+hVGBH8gQi0X8 z#Oi{ibBFV9+URb;?3f8bdH5a9O6~s)ODOd~sr*62jQEMQk_zFuvDMX`UA1Ht%P?yK z@j?Nl-k^At+0g3bu|<322JV#zaf?jzXR zY~T9MUUk-7TnMSg;)9TnF&(+P-OEoN>QiT93vHG;Sukc_t@|DQ*{S0d6GG691ENiZ ze7!cA&symnQ94;K1Q}W`XOjYZy~n(2^9~x3z`>I#e;25IPRDbE_Iuiq1%9hHl4JD7 zLNZ}=9eTQ6Z@o{-0oCedmn*X4^a$@vFwa`Uzag60m}At{ z5h4mxyOKwa(QCr8j#$92n0*{A`#NDMc*+p)fEH=%MC@gAuJ~sU%;Q2vLQ=m}94AOT zKdtwo+8RHq$&)}{(5i(a`wT*`vW7n!d&Q5?B-yqgeFv>o?W&(9=%}JSOslLFBAy{{ z26Qa?6rMmD99B8>OTJYPa^kyy$H`Yt#R$Rv}%SZS|?gzmiDLcKp|2k7k z`@hKa5E$_ilCflSOE>9KrVMqqCvA-^9vY)iCXP>}nir`!v28EHd$m;JT@j6ep6G>k z2F8Dw(Q~*q%Tu{ic6i)&|OHV98FfCC|;}xl(woXpOl3!=_2; zPk_+w*vqnf-q_Ii?(nseljwR^2@E2k*C7d$Mjf6VjY5_b#xA3nbR!wCm?jHfGIEA{ zaUF?p8hO9y;0%E^Q@z85Y~2MD^-{tzEcK$YR2GsBj;hJ!9&6yXxJe(8;@A%f#OpL`6cH19b4dSd$rklEnu=d#an~t z#_PEA$l-eB>wR6Vh8kg>Z=#(~EF=KU*eO7)7UTwmz9m7`9S4zh8E>S|E}`!c=InOO ztxN=bo#{h1!7OMUR9D0kFh+t54mYesMVJRn8~)go79!*U-c>WhGS99+B=Iyw<0IPi z01%Oa2h>_*uok47-0ShUNs)qrqM5?JLrh#QH}5scTj*p4X_}MR&37U-mAXYSk1`dx3dM z1F+n2ao^Ir4YOSX>Ff*IAzVGk@de40)sa~f=Ov;|I8L?!|8Nv~vb@1C1FGQ8Eer?Q z!^+Ite_I28dK6Q_=m??y>L%3>=8NH2RYbh za%9P;TC~l1lp+?ZF-G#H{|@`xQAU=`1)48pMNu%t8rkU$!lD;B1h#tdtzREFwgNh5 zhLmiqi4$C_R&BNpZj>39I1lXLwr5edW%nP=A^Rjb{C9(W;DHqgwXo;u^bDI-r~VH@S2I8ti@>vM4kT0)h-V z+?r8kaIJ}F+(WoKkT1IXR^Wp}>JhHM#^|wGe;;TwTW{NQ5a@#g8yj0-b8H`>C?!r7 z?hrn3T!Swr1OlSgPsy>o*2m)SRW&Czwn)&_NF@HDVw}jT0wN1i5h|!G6lcoh@w{vu z0*al710}$1RdPolJ2;W|PtG_6M|_i~#*=%@vw6utPYiq)42)j~UNk(=D@edUipo0a zF@Q*-sm)8NI064vR1}YNHk(MJW`c5f`Uazo{F`2UP~(;l{gwncdf>|Y{Wb`f?f_S% zre}={DTPPv{2vi!CQG;yQ`eZb;kO0M-xl^bu^L^mRk&? zDNGzst$+@|780KPY#*tyPgL~;>cwSLLK~X-ZMkEn4=OHHVl z(l_BQ3skDw(++EBPhcU+6U%u0=z!%*ML=R&@Q*FTteh>v_g6&tteD7ASmGd3ekdi_ z3U*GS9cp++@Xj2~NLS>g ztUs#-h-9P9{P6Z_v%jZdjFvAEeO?`X%#cRbo&lrN-6P|XAxwoZS#7@f)q7v65_>up zCyf}rOSX+&KAJs-D6Rr@CJIiH4+yt2O6cP*q<&-a!99F7@O0zE8uJs?HP*y(bonk% zzvIn0d!&2>BsN<_(o>y5;q|DKU%s;0gC7 z_Oq4N&%Hh>wymDK1c0vn4`Aq0gnKLr1hU)?2(>*uS4Rdla<$B=@0{XO^MAt*omN4; zq(KfY?7_-Lv#nh0pv}&cdo6O8@g(#Wvay87DY${>Vr{RYb(b+AjOzago6ctK2OFoC+F+m&{Z&sRt;bf)3uBfMR3EdZ+hBjwH!IGZ5HWqDbzQj4 z#9E-Xw^blfm&0PxQ?NYbVPe;d)k4Ss(IYsfmpiqU_rySAjbX$z#4RO7uMgvTB*|Sk zl@<8$1^wHhXv^XsqLJkwM?MpL9xaz}rZl#P>ZyF2!0}&nsw06vs`x6{#*S8qh5M*$ zC1Ae)}K(1>8A`JYv`h?7}LL(oD(kBw0gc?LDd&}d^q4>;&%tfWZ4Gf_`~H&oquNpGM= ztd=(PZHQelG4!zhNPPi&p5oLwn{6#zDoe>NTP3+C6PwP&K$Ldh z#^T!?M21dv^%l!G)xTsV9oiEcIb$euXE7LmcI%(`Wr<@xf&b(epxNjl z$q&N`Q0s!?+gUUZhEdPAXlOWVS0Svb@+Re*tp30+m{(jW4-M4ROXqxhe0&;$fJo;n zVQU=fv$#E6nzLX5J~qa%<52t87Mf|-hxYlKUK9ix-L+rPNUIUh&1Ws$SiCr7_yD}+Me6_2QnFyz)%xhk&rD%3W zWVT=`>h>Y0X zdc?5HoW%deg)T(vlt>oC0ntnHJ6L-BCnMcD&?=txOgs*|LL{LepUeV!2C2Wc?SGl3 zZ(77MhUkATvk!M##ti;>eO;F!o6tv1ewQ{WPPVhIvQxoFrXrSu8T$2a9P) z*+J|j34KtE)fO%xlBiqooufYE7qBQ=R$_-YL7!oX)O-Iou+Rv{(fd9_t9Ue+N%H_O~Z<)e2)xD!}*h zAn~J8OYjGCd!~nhzfJQP4@!HCdiba3d~=O%Y+@rt@vy}3 zb>m|=nI_HWBbt#3VB2@h8u*c#F}s*WpE2#022f&}_JckA0+|I7+iB2bfvjfwmrjYJ zBtwQS5)-J@Sp0+w;KGL>6^6%MY5uZUSnvIbp~80Y&FP3PpxEMMLo$Pf9>&@>YA&-Z zz*A!LKJ5X;@C%|N-XCpI-M0_1_Wld_%Um_w`!~5x^=Fwg%(>UDmC@H8AXfJ1Ei9rt zh_`j3qY%n1y=eTj_n4^1URZ245s6#~y`D#Q#1lt6@sl@>z0IC&F!S|(PMYC${!ea7 zN9x~{ZO@?hYJ25>WFZSQ3bh@3RZCJEZQS0fUB#gig1S=sSz40=*kvP0nd6Py{|~s3 zRrLSBh3+)|{s$M@+iq_}{TCNXDFVC2#EZX`!M3Q%74@zGMcsf_&l-tIG{A@oz(WzX zsGpKI9oyO%IFRQ$B}IPi+UOY&@GaLc9W$$P4sBBs3lv`3L+sqcA-K8jGI;$_53}df z0+T~0%{JWLB%C~HOcWZiI+u5PK?H#pa2Zpq?@+D;-l5g!wYxb7p-TG;?aV4Lb11bu zBDeFLKefqw+1e$`jUqovgPKrn4FC z&)2;TSjMs>w4o#NcJ^qF??Uk{llyW_I#h%>C0z+F#NSKRy#*(WBF^Q>{ujaJ>!^pp zA|0_Ap5`}7%3(ydz1)MW7D{fm zDnb`3W5b50)tq%j?xgtJo!$a^r&|Wte||e|luK@!U5zxHFsxU-9ozfLBFvpQp5__z zSchT!^<3_f=Mz|s#|WBKC)aFXFG5BSzx@i5%MkckS^gxe6DKBt*oNSc`2{81)4v+L z>YSY2G9WuZH)&{J6IV6*ryZ6+S~mldSC3D%u7^WlejP3_XL6Axp{J|)nlsMFc{-Jc z_htFk?3cILGXe?%<`$!&Pz95_52LLgP&l#TZW6U*X~Nc)Nn(|tWmQ5@KoH;@?JK_f z_s)YwqfxN%pt%yQOWoCI4HH(=W|b;lCFFYk6-+v_AjwIdr1E}5%X~$xLL}Nm*i94? z0WBai6hdSE?=5Mb=Pq>qD23#fWJ@L5@-kO;T?10O9}8Sn56W;M5m_DE3cnFcwv5n8 z*W13EHlxss_1nqP@!uZP_^Zf8aq0@`$cpKFk=wT*C$zcIwE6oN>xlq$N|E|pEa)8gb; zv=Td_$_h^6Ef9nJueJ>8Wm6~ijzj`xv@%>BW!y=Hi~C`9 zxUXBY7gi5D!;9TmX*Fe>nR+!V0a^4B7Gm4K>Ddm-vBV^lanVz?KV67`$Gji#PflHb zd1bnh%gww)V{5p9UhFcHQv#`5B4hDX3ZEy|3T+`gT3R^5i~sAVVtX&FgzL{fTS60^ z=*;`u;*`8?pUCbwISZNahL0-6fu!TaT{sHIQqAn@lbFFV)|VaF6L^qPQ8on}Vujex z4V?q|oT|XkKOKlW5X8Lt#0gM9Ib@!HYdfM#FOeGox2`UAy%K*E{VS|c_EoI1rr_T6 z3e=|pFDGztdYhU%p`&$aMZs=n(cDsVH)Y#BqmF&JA7f9LRtOk{oZZ)-xd%ZJBCuUDi!x zl3FTSjt5m3&55Iy195p>hX!R!dgK!F=%S);G<9d}3Ol5t`>800{{GO1 zD~E71!!;vAr;aDofogik552sE}EufS(Z=^79P=#(j{bg+Ri?4e^}ej0+qAY}kDc;5~m}#I5^*342~XuPa=zagK7nW^aQk ziQ>bzGcKXt8P3in+X|2e-|Mw(djWa(u>AUE^>8CI-@-vZU6akf+47|S*-is}HV1X! z^V<6U1;>N<{8jd!JbZLFLuJ?d$ma*B3blsk-SSW?1CeUFwQ@9zK{Lj_07J_jp*;GB;w2q5QVv!Slb|K<98+N|0FblcY+J^t)Y*-_6}1pxg%uavg&7K@(CY(s)-kzZcuKQ|+f$%QnX%)LG zRI}3M2Hp=Z$k&F`miK`P49WX_Pybqg!i@@EE)($FW=f?G7V^y45svS2rqSK?J*6&+ z0{)S{CdY36!Io?oRh=6NrReo`0_zW{E!e_JaaZpx>hVa#a~wm<(lWMe44cugLvb9E_(Z+U4f zc?qNzZLroRn^vzk8s}l7GLbWs96QcapV~m9xIi#+1JaB*vPk(inZ-G z`T*Fs7Y)FY*U)>)=KFQt_VBFlz^rdMcm|uZ2O8O8SI?_ZIhB-`f4Bzq8A-no6?zU% z3V}RBSW?mYf0AWH4*0jhhH=nHklv7mlh?;i62>=lG;V--f?OPOw5hyr~)jcvxJ*OL1^`bP*7!g}H8T-qPxH zlOjZi!NsQ5h9#ep-SiTa&jwF}Vc=KS_(+z?@I1$P{sIq^|2{};zj=Ktq1c4kX1>;7 zyLwrjzv@%rWXx&VpUSD5(Gt;OYP~FBMI@UurC@f2Z*zW4dw)3}GHq6~rQeB}n?)oA z+l$$~e#098Nz8ksLBN6W{RV8MqMPDZB2o2$hvnFnQLx1*P|zL3v7dOC<&xi+!cqE! zwkeze`Xf9t(Xae9w4aT)qtQbMDZr#bHF_Ey?Ji&9&way%lu$CZQ3BKQEiZ0>hK*wU zQ04HYLMhQmkyG8%h0-rluAvJm6jDt!`5w(;Rs@0P$t8q+vCZr4V8&;PWAiK9k;z&w zBk!yJTmxP+uRFWTjG7H4egTt^jC;)9B_y_LKyx{P^6P5(nw_uvO1GuiWlxgvC_7WL z^MN-CB(C}ERMhn=FQ6l7@MlNIQue3&Z}&6-wWF2k?-`j@_sQP6P^VxyOLe-E_W;bk zBN%$#T1F-sWnR;^;=t=38m? zAH8%Usl<3WPWH2sFoZyE2f^k*#ud0HU83*!L;j2&OHjHCIJ8 zAJw3f-zD#gqBcg_mDmhZUv7`{yzNjR6u@617iO%3C^qb)*ntglDg41c8*e6;y%_on`PY)`7LEz4`V@%%+UZWbfsgle z$-e;^+4a!!bb}=ZOy{M9{g?*n>LG|wViEQ z=^jK0@29qa##KvfHXC0D4DwJ~E8v#}m0Gs~WXG;$AU@OsbS?98MGWi2zT{z%B~p>A z0EI80esrZM4 z&tIsg-~(xHx!GTn{TPpYBZ% z5$1_*Ag(nG0UkR7Jt3p+0*tk%m$B(jZ65EeGSJ50Ci_@6fQaz9xRUw9y~4d7{i5PW z>|_b-l$`PY3-3i=GycvY_Id*ITx>})>C@DFk`a<!)L#mqzB1ag%B=V=3bge+vq0KDfGAR5OvweO&^L3^g z8|?r-`XTFK?-9`=h0A>I562QN(TT=BXRkHq7?um9^!rXtpU|3;RfL(HO>Blz&)b^p1~O*f8Q2Wu z&gna4u#+uD%w(o+tlUA0nQ7Y@nw<}w*lCva+AOrr;NK8qrh}k8nh#B9D7!U*nhq%T zDu|))h@jclFRTUEZyJ8F75=+z`h1=XHT^K^Gwc4`Z*Itn)D+_kaP5Focb34l z15Sz_dCj;QHvu zAWFp&xc!6mh6l*uUp6Ff&j!e?s%VM81*&p{dGBTA;Sw&5o{WE=u6cd{?l2;@8;e&@ z|4XfnRXSe(BE(-NU4ie4R68c8mrG$S97C!tY5z)nZa^)hNkGq((ceLbTF86`DPAAE zl@?ctkL1M$V^IkTs?(q={hBM@GL47|C9v*ZRAMB2hnGGtdLS zOZH`^3l(_kzd|p)BO1ua4-b(Lc_mvt4f4KxVLO~usO8njo0=ooN_94nw253>=ZbST zBUI(+L1}CDKs&hw4MsF-_p;KH5eG?YN5}8{_89XkUiw9jrHtpHLCpms3xt_y8_8NF zdO8(&l`OQc6u~mHEX8n1mfNizduk~(;)!j#`(b3`P$F%l)0<1D-dcaXL#0qzw55~I zYE_?E%Otqqy_ev<-k~%_+#YZ>!#?rtIj<5r(8-1|mA8sO2%?~kY#-$ta8;>xT zxin%3u8;*v{qA)dy5G`Ai^^Mfo4tPFc!nE{htr&g1HcZyu1pg$`0>}@Qz_5;Dh%&5 z5O{Pvd@L4!=SY1A`lSHO(Yx>o3qM!e(+M}9#NupO8O(!4xA&Y*|F$b0`&U^O+I&61 zz5Fq{ZZTkof(JZXHK_4kakqg^blKLzV?o@NEUV%(4@9#*7VodE+a*9RtI(wF?S>^+ zUow^!C8C%(f~emPEw7$W6Tpt;2OXz1!kte0ivJVl8S4Gl_?ClYA}25WSwz*7V0R_8 z)VcbGcf_XD@wr1WaWTcL(9%}0q)I#7-|e+M@4_==YKCXCc<;yYEbsBUmz#MD9%sGt zYyjK)#l-F3W}pAh>=(iFt zoG`&U5WwToTNJMW>?RbL3EYxOb>s^J#gn1-zvFYAXY#ChGpu+to|XZO#ah|i-QOE$ z8bScL6&ctWk4z-7@pVm*FmN}qd8fWdcf7Z&K$?yo2Fj~Iu*1gc;2r}($@?ChYQ7Cx z9XE0zqhf!4qSVa_mxx=B9putcg?rbCcNX9Wb5gHOS&f6J{z@kH{w3 z^9T>>0g;m+S|>|3b6_WcZO;wJSG1hx47;KZ)DF&JB$37<`zMN_H5IDh9tvq^U}8cR z@_<^nNo{z+(qYJOE3!vja&2l9vr6rtF8>?=>+!)r(D_ur$f0zgTrf|F&{7H4a+1{z zndSa~SWiQeYZ4vUAQqLH-C`I~J&j2gkPf}&X3?|WyJiK%xO=>fRNy4S3@p|;bp}Zo zt<&IXKU=gZfQUwh6H4$#F^V77C1=aOvL5de*gKz4$~d#!V1Lg|ctOlVl1-v4Hp`lr3a#gsTw4MXOZlWEY zR#s7IT-Q9K>|M_yP#=Xzq5x#NM1TggnjU~$zo42}nQEFU=I5$7Rya)VaVRT`egZ1- z2$PbzEw0TlZ2p#=E((UC=73@3F~uJ|XJ=r~B8dxw-GUqsSR2E^Gv zfo7JY;jxZz!KD7u#8}i)j?8f_IZWuC4W)UaOvE`_a!ME)8x+sD=$3jZ92WHKM?KMRo2Lz$pM#C zhihVROWAZuTDyGIhRxY_U_11e8An6{I>Dbhtg{I*zpZw-%gF>#fv-fOUo#{EfkPvR z0w;?KT_ep*cUxs32}To4VaC?CtbIq&p?h?~GW^#YloDmS-Za5ZOd+)&An1*VGTqQS zRa2IX=}$Bnji+F2zDdUdCTO3%Z!sd?fIx}L8r!2P$OMu*GZBWwu+)v9$*4?&V`=q$ z=ZG~#YtdZP2ZTz9Q~Oj=+ms`4e9bHZH#DLwM%OE-i}Y&v>2^ib+~l9Rf~a z)(6a{NNfH;Fo3Pq6-HLNhgGAlPlXwrCA4rrPTBxq~WAb z8DRo852#&+cZ zUbf1CwDJ~zN=PLx9tNB2RbQ2JK^ARF&ZnJ1$Y{zF(HD5-S{L5QlB8MR2Gd~ zAzBT#P@GyC^_Pt~ zj>5FInTUP1V|W0JWzHxSr_$sKZpm!jPGM~2h3ZiC*n+j(L}gNF_kU7)lqJga2D?)U z(h`C5a*13ZBDxboOa$n4BsnTe3nEWtmZj3BXCtMosL zpg@x*hes(4UT*2`NeFoP{=daI?`rbvt|{kVd?E=_BpmF5$HyqSig_3g z`IyXy8~7E}ab?Zw45BC(TtFx2Pr9gqVqTsRB>`f6)E)34Fek0h*=wdhR5YL{_yND@ zjl2c&DATr^D!~~pjc{5;Rj(dxzz5Le>aAiO%U?|KTXz{JmN(D@iNX|WxwBE}hT~&* z7H!&ujS?_&BU?ouy#0A;oUPBQRGTjcT1S*cutfiY^gFXYE0p5g@;{1z9*lWa!g9GI zoTW9z#KY&K)Qf_xdz(H-pvS?GP|31R?ZH(gDm5e{IM)!#Jm(|r@@rGz92C`j!LI65 zIRNYXFUsXURAhFX$8;{8S%5xS@QAxYNaIuS%SPtQn*zQ9mm^ZA3%#5r92(}&ietWA#2uQlL1Bg(o{$g}I&-a@=hL~o^8)aWVS>HDf*%D#s|^qI0ANA}bJNYNG#X31)nuvNn zN*hb<3WYInS!u!qzp%B>g=&{O!3w!`@FDJ9>7l>~I;bF?j_9VR9-8AZ2&;GlrUOag z(Bvq}&y<#3w^Wr_SH0L8`zBg&t&wQ^Bx*jzi3AJrs3ptJYa^!xBw;eDfZo!^HYjMN z`li+6rq<@yr!ag9q~K})wQ z93@P$&q=EvfjMQLlxiZ(;vR(@7J^OHq{&wNYQdSF+rW8@HBps33VeGGOvjZF4eO=p zNF9RP5~)h7I9u^TdCSM0x1xBHTw22OQ+%&3jce-nOJ57qLSj0}*MDAn$iK~m|H-H1 z63;z6hxS&cce$uxK(PP!=bz!YxQGsB<)xs`->6CPk;t-7;G}ZBuIBsK&v$oqf?9Gv z|7)fTfDKt6tsQJ0~+oxl()te8-Mg9)Rj_!(Qd% zIP#=UK?se0_)EwF6HkAZjH##-fg9x*y7Rr_>bjY%@UJLKG52Wue(6DaYA#d9Sfsqz zI;>$m+n0bYjSFo|YW;$hgCr!(7O@yT64~5xQvqQlU9SfUvSdjC?gWEqa-&48UY*K- zPDV9aLx#p5tVxSaT;!Y$B{=9=m9;6mQS9t$u2%@{6=n@--mxW>Ms%+`Q7Y!eskkZ| zlvMxPe7>{1_Y`@+A{juUE8Tu#Q^<3$$Vj|A1QGPS{`J>m!e}<3#cPBz(}ZS;p}?nl ze!I`7vb1Gmy~s{^;jQlWI-`XcX$xeN0 z@|TTxg&qOCZ#RRL<$M(kck8nG@wa)rqZbQ&tz6`!c1qN z*=IMKtu^exw6r?9?$2XB z9=BE=Ue1dwY*F{w=>4u{{i{gS^hpsL&amGFN#i%Q)qJd_J z&t#CxEa%|z_|fs8@B)sMjrg8A&P}0sR-{o$!#L@b3S+QZ3GImTpxZ_P{)sT~kfhx% z^!>V+^fE!@I8c1;Ta3si{B40_&N9h1fp6Q56WbK%Wh;$+QSr&zFlS?afq{Ql3+G^f6CX!l0Xseg)La#^K>D9s7&H7!3 z;{~G0t}?gLSx6qdO5QrF3b*zcy0gb6^K|*&V=BtduXFjZWjt%Wm%3!cys)v|j8w>W) z#Rm^r2v-3Ic|sf=mhKpGvJ@9NGsBTjrhQD=w^x4{y-iQ>Tf;4xq!UIUrQ*h#NtkgG zv~9OGGGin+oR=>rL~bRggCZy{$-(O3dBHLyzk)m(F3TghutBbPBTy#-I>MTJ(~&LY zEBEhTKaiL~`mZ#I)HU0Q+8WykI%4~Q38AyS&xi!nWLoTzL2)t2Sdw7o<)nJUI7pI^ z$wZ43f$7B~=~eX+?b9l-_PFXq%QksypI8yo>@mzlHnfwLdPk-yI&L3L!H8YFnzu%I zL`|EPLmc-z8ZRs0QKX3NC<`7d{Fp8@6Wcb5l1`Qk+QWm=(oJN;xF#6gP^uu#(Tn2N z+QsG9G6=Xt%2q3z>DwuOnE@hynZ2zf@j}E1(?J@1soEi{n`grA*%@sWn_<)}N<(xS ziLX@Xw$6wnA*u~XPLw? z=*VK8GuxOR>Ns5Khi#ru1d75p5UR!ooO%J`gH>n5R5Oe)2E!%&#jR)~1f)Z4 zK$PwRzg*_joW`3(sE03iQ2`&xnh;1P3CU0+e;&g2*4ueIo#o|rGlT1C^fo`8$>o0T zdz*xIG;#Bv-x^R+GC4hgksvIaXsiMBojsZ)93w7Nd}gPEkSrDiN|c2gm=Yh667b*) zvk2<8iIu~Cx`pAgkP8~O<&`dqfKWxc<^0r3^uQVikAR}#9E&2QZr4$333lEzZQa;{ zNxNq13n5dg_hGZH3?FHP&goyD0iJICyW+$sWbcX5-OkP8x5ZfWukF@<(eA^0ZTG*igb*{=31?GB8Z)12SK*Kg;^_3)ubpS6H}SPtmXt^{5Z5= za8THtLvNcdCjwqQ^l&$>_^0z5QBbZk=mlurUP+gta}E+vQNgB3VN_glo^c6*8i5Je zzE0QdK74%FXCkw*`KZ@vuE(<;GLgmSCX%8`CCFngd1!zN;j@vDy zF|3~%g#?Hr?h+Z@+LCT)(IeWN4!nEgA9SlHD?=kJG;Irdp4bK}Bvknnt$JIz_Mk<6 z11_MfaxEw6Eg94u7`^AU!IvPl7-&fq5LYt&+i^JK!nVsER^5uM76Tg}G*I(|RYN7U z(s0{H6ylv?iZa-pQym)iVIQC1Y($r5Ddm_g5i2SlK05F9TpPK|u$kCBb(n(|oL_c; zk`D~+phqG99UWS9PldbUT?C8Q0M{JVO$txl! zv&co2q8wT;!LPmazed~gX`VdYrSyM_eduAh|^>LNHB^GJfu#Oaqs@+xK1<|lEy+UIl8M$IBE zcT!J;5#_Qm%_fRg1zs9Atij`J8p=qiz`9E&HwXtg0*N3BG_(+iD$%qNopyXICMhrx zOCmzr5g`erPw<-$f^kGwW?k%L77L~x8^g88Fy+Mm^kCEJ?nXi;V4;%W?+9U4ky zs`vH$_}VHRNLJMeR8U|uBJE0!QXiRs6=JNxtPBf{w-|-t*tiT=wdU}FE>Kyil@LeMZzt*oPTym}C*sN;NsEQkR&u^PZ4^pD9S<)7~{z%5dc zE_1ErpS~r@j3icxYWd<+oT;Xf+_wv=e@A@jY?#A9ASk7k^3$?c8{PqI=-I2ewXmze z#AaFCRd&Y=!`tn_brK;skUV5K6ZzT81U3FCkWPc9nmiCc)7VOS?p#>E8le!vTwt+S zdtrhL@%3K?QAVjKCv}{a_!OEPYw=jS?4}LT_9;YQMf3Q7{6x$O$HHMk_nV=G`l!A3 zAx+<`6hr}77^EH#7<2@P%^2=yg|!YSaVYO6BH=!jOX>vjDu@ z!(#(PdXu6uXnjK^gY7&N+ZheX*u)>2W@f^Pe}GDW7wW-vCho#w%UQT}N$!$>%-IfC zaa+&))@Ug*{;Ue@7E9S)5@Wo{0g2?t0HEUd-bP9l2k*Dp*6C^}6e>t?w=-yG}TG%QxJyEvsFTlH; zSREilN?O~Ad!c&p7+Oo}h-3ZHY0Jo_Lt4aXqiE)nFksi z2=1QtOfMRMXEgdl@;-iLf9+{JXH9%A zJ9}k8p}A@xBW{EqBtaGz7gCERU9!7k9mtAXCZiOZoW)?E#y^z2vxOJ1I>q`f$$ZP(k9VtXPhT3BZ@i?8b zwg(3#zDZ>56gW}XL3gas!jJJX`YUD;G-5&s#@3#Z`)2CH;Z63al zE3gqdArwbBVBja2nV0U7G03smYW;D2^$Hq&IomaVi6y z|6s>s09~1(Cu{*sDbi$MU*G;hOAPOiFmh$zivtm&3B1Mh1U3KQ@UZ{Cm3U~c7bXyY zS2$Q`_@83p!W8{a*_moaPv3W>&CQ5N9ervR#`%=2jNIwR2N-#C?^|GCWDfK8-WvZr zcg|e#jz`!tH);x&$iy+k8+>W<1WG7#&@-x(>3HQc&fo*DvOvG=6WP+mLEo;4)9uN! z0R8TBnclnX9Cbm2$Va)fBTG!@xFQVE4m`jI;{eL_76Od@&*Mmbj*c%#?A4Z^!T4Vb zi>-N%v6$d<2*!7B2_>%%t=3p29K5RkqaRxMpMD7RS3jh0Pk*Uq)d$)FrpWVEwM(sr zz9u3J+)@_stOD=CpDOiyu<+|?vKbl$w<>B;W1g{wW6^F#w}KOx$qHWc+f~?8YM!%K zCYghV^qHep@C7~CJ8Q}CyPMDO@j z7TJw<-FepZ<0rO?>pdEjO*ofBMKdU+2M~~k4G;m?B5NA62<2MzUfX+73}@uvoXf(( zKO2T6=E}gokbQ6pUlhg|1JeSiZz!Nc5(lNA?7Z%wn2LDtQ=*kYxuPlgbb;vUC%`+Y z6aGf@r1yz%ad9o!?cC<-e9uu~9QeT`Kr=xiB}QS(l{KBfP^T-t?P)Z;qG3Aj^E7ogjysNy*5QEhoip!IP>hi zQjF{JM9yt7FXV@UPWD<1h>0H~&lZ|9`ia3naWC(1-SRc@Pw;*kzxUL>~GC7(C^v zd_QnEh4I#tdcZyZOt~XIq%0j<=j0+Zd_s=-pzSw_!7H=Uwjiqu#44aXFMd+=3IiMY zrR<#K)(^G{QEtVKJ7mm@AYtG0^oVP?MRQEb86(!{R4*9rJIei9)$dvOP7sEB{GBm# zwEV{Xh));yZkKs|s2tc~bPGb@=GQ~DPpVUXz(U!N>mq-6jzKwd1T~$+olxK5A4+5A}!lm?wj=>Wb3-ky?$?M+O;xMAVWO|RN2FkjUGe@ioY<(9u zRc*#gV`Si?#KxB&*bwdNP%qa9a;s!}238X2sIG!&20?Alv*tdQBo;NmokABVAI}iG|8Jf=4<@le**E<7s z2cAe>q$H=e(ev^cb&ImK2~1}^3t$q|Pf#!DN!0_&)JB0J-uDN&dx0W=pblrmYSFe^ zKm}NULYI6UUU%g%fHS;!l1;T4=`tB961TQtDJ9o&6LiGO7g7u|&)&!g4=2W@1{r#g z$jd31RqSjF3lA$GXjr*6tYMVeqH`LCJchV~d0?0C#dcC>Wb#5)sxa$cW6c`y?&N z`}t?SYvUw#Re%B{K=GR51VtKpJ&l=kY56PKzKxj7QBUnQHMnfjb`k4KVc^sayPa+iAD504_Sj5UG9daciOAeJ&^T&w`;AT z+11ebA^-eL(@xY3zid_qruuJ*0J)H;UCO)8&+hhj;0=gCtK`LI217(;4+!(WRm`AB zLBgQ)h1rx%*K1NcM4F7l^oD2^oyB{totCq;ttKZm10etvF zn*qU%x9xvS>B%!m0{@ot$)k~nQ_sq1j9Fcf+6#+O>xA52-%ooq8fqDDzG{dwq5stl zgjRf|QHYM^9^2IQ;s5VYy>P0RmtTuNM)eOk2sE)_^fXEu&oYI%O{z=u0$(nYFbls+r znkbIXmN!Bg%$Q0rj9aV@I|b70DY_p4H%u7^NRqBx0;)8am%9i5OfG`a&TtMMDOb6b zzbep>TXrhAt1)xnpQ%8q&K}Qz8Od*v6Zw)a(xc#N)S#kk%B`bgW(d{MSAm#{P`xNk zdEK6xr+d(eeWjc`X}N@UAjE=8lMRlTynp|)^vsj8srL82umr=Le^*m4bp5& zHyE2|thy*g7^ww?9DZy=OUrC5%>`ww)v(v^;g9Y5_)xl2@mHzMaNsHms0`}!R>?5u z8E@gNgd2{~i2a&HeEARiVx1ZhkQWAUH~vaEwA|#c2s1&=19G_3rjn#Eu#zsYe9RW% zZ7mrL(1hh0aiAor!qMXJlXdz3Bw+IbLJ0wJ2z4J{-Aa~Ta+r*6zfdN;-5vps@^^tO zD0|ZjHd#nPV$E8<*G1hvnhNTRD0~%M^}Lh5GQXm-6*c zWNK-D7rR_1x*f|Y{o?y?@^Lr@*8uTJl;;oeR(OlCqZqEOr{JC*z*!k=;#q7Xt*)`vWzV;6H>Y zac(RKR1F0|4DvDG)2-w@&SpgHyrAHdC4dTHt@wUy=ZRFvXu;)WDc}$dAUHbFIs;Lb zyehZC>loVxWUAtdNG=SZMmxoLDOdrGK%6R+41Z!Z;HN2WVsX1@!lv+%u>u%{1ZSv-t3s>jX~H*=I|@|h&&s2@aN&3U#o%5e+#%B z=jgLjy!W$%LXedzcQl`y4i*QyXS&rfEwnV%hZ>$h`Yo1gYra=Go&x_G<|PO+2h_Hz z!Y+>Q^5*={`T76^h3>*efdrX3|E^WED2_o0D(35iRjl17;^3FbED06C$+nYs7VUID9}rHJ*?Hz<<&;-ESazp! zy*iRU>8Hr9b9@c4(E#3#icg~vf>F~UHARzMnkM8M>w@X>NA^Jyh@fL`fmQ5^?k9!= z0#DM!cv4{8@wq0axeSBvxRvNGV;^`-+m9MNB%cOaPsYAG`b%|*UnIaD-wrbI@kfGV z`b9A&X49QeZ_=gYQZ83u>e)tUY^$aJEOtDmn2-XK=6xJbt-Rm|#gAoFbQV_&l?v3o zFiZByQvRs$k(;6R{I?+S>e9&f0$KxYe=!BJQRJmCYP|hH8o?_O4jc?CyGjbBf-~lv zG7I@GM#4W5xB2?Mv+_`WF(_S+QxR>WW~|1;c2t^e$~PtaPuy0T-R^wZvF1q^7w?yPVtUG`)Nl&M5C6FCKg24)N4Oy$r$%aawK0S3aex{Uve&go4pI z{LESG*?cjSES0_irk}0S&pF z3~tt6!forXK7a#)dO|feGp5m^krO<_|3&qk2X2SXsn6}9SG^oOaNdO#S!3gmdJ1cp zW&D_Yof^|{*3jg+&-n+67XH?Hx}eQ>lHDl|KetarrVYf8D}`*Mx=P%kVLQ=xNjo=0 zONN-4i5_4ba}8dlEHlWu(%*}w|CbY{28BH~(i8T=o(`HmCP@nOcv^gsQPlne&N3n~ zKLn^0)+Vx^#2+lzJD5bsgLo|D!F8tzqL`RKd3P#bo6iH@WzIB*H0VJcq7*bYcMwU$ zh%_sVl`JCXk(Gw;+v6HV;iq^Q26Sv@Dh-9HAVB^TtFAIlnenj*0Oh5bCk;Iiccdwvr|7f5)+ARBBSg$Rl_ zx{Fzd*OaVeLuWW=I^67fw|pNLMHTMSjj;4ZZBM`Mm?qM;5!@UC6L!BsAFTZvsjH>0 zr1jB}X83(RGT3eL?^7u8(;gP*Xqs5Dq(h~}p<)cbb>n4`yCNh^@(t9Wlz2MCMo36Q#Y?k(H zgZ50*L&)9j4(M*qY`$F2|9T@HHMmcuBAjzQxKQ8$zQl}mo;TOC$LrKSmRf$NTYVdU zAZecCrPTYt*av19MhjHpSh5P|h2;Y5ZgKmsoQSrr?l%Ox*`@=~XWjBlNM+Y)S_DM>GDdbRhO4UJxdb<-cg$FRJv^I zKZuc5{;(}Te^X#PMK6}f+?@WnEi{vPgoLKRr!9?CQDs#eHP+O#b||?NMP3Lo@*hqJ zkU&K@mWIWsSD6FpQ}=XaO$nX(xi<<#KwpXHf@Q8lFQ@sFM!e_<u`#w(Pgo%W0MXrH z0z+lY(miIgG(Nnjg3D4viaKPls;LWRfU0^#jE;XZNAj~@8uKVnj5XLWF$(&!vR@YO zX-tzrVdH8^W^|5~mU}R}*Y6?Xnk`(uNK;3<2iKVe!pz`Wf@vajFV7>4;7C1{gWc|> ztIF%7VVbDBtyXI3IF&~TFMEE8jcB;My`@aQsjbX50+sMFa@Zu>2 z+qpEiig}9M5xwA=IWBz{Q(_dWAZRJR$7f(lfu-qNhk5ltI^Zw>GQ(YxK!&BCUdps{ zz{R8&-JMtk8m2~jKDVIWCW;F;1pb^$pIoLel(D;>188^;^XL&JPwbC(JZqr3Fhk;&K@`Sm&Ie_@mTyiW!@>j7uiW< z9|!NAHUR2$96a?KzaB44orG;^hjQ4d;_J}*&vV7w8s3kT?R_ADSn8Jw5}~gWwY0Qf z8NlJ@53~CYue04(GWQSjOQt)(Es#gN0W%a zXKYC_I|iu&o{*rTaV(gJC+o?Ur6J18S+BHEG2EEDVoKJ-I5zU}QE%AEsiER0FKHoN7(RRy zVlK)Vl0A_HO@@py!NzT2%cp^+;mW-mgMr5mM2z$N3=G@+EDItA8}n>K!c$>E1s_bi z@CN#2{=8{_h@)=mO{!VJ{atUN9tQkOh@kodISbjgdX%X}Mi^pZvsx;e7-9nyDNKP?thUn+kX1PdJRw8H7rnw|D z2pw{wCNgu`ZF%^ngGb}?VdEa72%|1_d!o}QdcAml=!S1$9>jycxrb3)gHzkvPa-FB z^~wfi;;r1U+e*e2q9rDI&V1J+SokRt`>_(QifcQ(6^w4{IX->euD`SSdc4k9U-%J3 zYp;K0hf(D`JW#Zrnj*>%UQjMw3t0gG;Fm*BEn(%2nY|ZVx;*R77~z9ZlOINg(E@15vcs?` z$m^H%9dZ8s2ej+k%cOJf6`tg$0I==6dje{ZFhFX1mJYy#QbNlSg1jJ*n?eG)8&wOd z`h=cKPZ^lUG3q}Lo6CgYUTBGcL{A5^K7#7x1*2Q?75r4K2Oa1?F&k_0Qx8Nt;X4Wm z6%_x5YId7W?JU|9DVgiRjEx;iQ7g~u4}<{maup1Puf?lamAaJz6vHzkY|8W6x8nxB zOwhX{mSFm)Ns0PE&B+`$E7*}szm`1mcF6HN&1V(kpdNl+@YVNxN$3Xhr)KotS}|`( zK~)od(j$2F*L5H9Znt*+{hs-{AJa3fr^P0Dy9)56j(WD==lf7}+wf@V8Rf#}EhvW$ zM7XXjT;yhC!)>8P?|ig!sq|GWCrWkmy(4}s`nNOpjOmONpB8S7sHIGTyd;%q79pIP ztqs28ar`}rqhpPYZ@Q!LNTNqh0QUZMbs%n79`l~l}~eeFM{U9ql+7rdW*Br9`ctHwNh`rPIH z`tM)4@2D6%GWBxY@p(lUs{o3+7vFJo+>j{m5hP!!OktJ62*1oQ0k=y&=|@#i@`{~^ zdSKW;j>DXVJN?EbN#2!v-+c6ySU*GpL}IxSp^yjzE&fE@ff+}aP|BWwIluX|?M<9H z0SC{HC|FrgWV$wOxa=L@DzI&_C&-jKPpAlFDJ{$7DTyGb;$^rUR>(`JaPcHb#XfpR zO0~+AR@wp|N#J4P_2^kC(1WIonxGaV#*_h%rD)c8V#-eZ7iK8&ErSiK(G)8el77I# zkZ_HEj_*K*6xpZZM8V2fa+(qYAXyBkEr<;(Q$p_5{Pp#P9GvXiIXF0Iqkl7kP{)&k zJxxZwG|_i3QP{KOhQ+~yD-MhVTarxNE3Aek0nNf-P#V}G01OO^Id$Z4AB$;6A zf{?-OMr0)^hx~>^1{@PCLX>`qI|jmJQyyvHfn^GU-hLl3F?jGfFQ9MN#Lo)(7M(j$ zC`ryepZyPe`bqvYw{LK8;(J%NEM8?#!hcWVrXZEi3)9!)h*?3Bz8Fb(BX(X23!XPY ziKa7%d}zwyG9$y`zH(QGF45urBaH0!uvIx6Hc1SB3)kEH z@8uDb9}|6jCx{!8Ue2CgSrTvMUQ2TaBO`*&4UhphYrr`Td!^xGO|BgsZ^C~SGZe?@=3BINbwf^F%A(SiU3pe!IE z$&n-hNRlv~MPc^5(5&M-cz9#V)?RkbG@Sh56=CX>aU^^8?2~ci_~5n;11l$n0QAnx zyij{8+O7Y$q;U43pl6gfuhzhiqOKz}HU0wSHCpqoHeAA?akbU20;>{>q6$pC^$(q* zR_l#Nq3V^6Vg(n+b~cev*CRlQpwJtuNOw-W@PeQU#menqzXIKg(T*;G^8SSL`3qhV z$b>4cLna!UObMSNQ{+EgEE(QGx$Hn!NAmM~H?Uk4IQmWzBPXWDgK2N6~tDRoh*Wd9X4K7cjY1xcM)0D zc0tt@dj(ZY7ge%YG(zpRnUCghQ^ zR5e;8k%&zx@`x4z1YoxyhU#{bp!JCBNhy>oB|S!=R_KfsJhLj<#Dlp+q8<~7_QRfaKC_DB`#@${icc_wlHEHP}5Vz*y@fh*{qqn zR9(T=f*7+BpD-6#wReKVf^(Q4O?bd>2hO1Cnr_qGyjf7ORaF^uMeTdoAt#|e*-=Hc39G306M1qofhT3sdkq}Y_OBgm zAxhPS-(_x0u~G4HOjvfpb-Gla!f~aPF@sD+7W|2iA=Ru}Xw@4)+F{>JMwAWgN+{LnCDoBHfPRY=Cjcatsgw;t@*TY@htfukqy zL#VDyJ%IMr)^j5so*BCLw;XGmxlkWu8M+Lz;$}}lIBbU(!8CIG85>|>;)aEw9$3O@ zpQp6_#J~Ab(x#d8%{Y@j8vkwO<3+rPr}&oUX&o=kzdG9ZfNiv1B~kM{`gQ#1_L<6K z-zkwTWq=Y%7O#&itDwVA*O;o7Al8Ej0_0uMET{Z{hqp^XY0VFw;viALLeAK+9M4NO1{lK4v2W89Rs+o?+aU8=Gq}G(i3Y&52#S~;m zW|%p*rBXMjN2Q3@`(lgUJz{Wlm@TV0xFbcpi={?n)z%V)ijmzM2aIJoCk2<2QcwRZmo(f(C)13yQo!Zb}A6qZmYo9 zBlUm|N{lvG$)U#B1aRaE$0Nm~ZJ-$tdbrt&rLgcxVh4L~tRe656vbW(6r`^VF5Zqk zNFCO+-$`=^m=umMvMg1Di;6{S1;9h~8Y1bs}8=!R^IaAqXek4d-9Uy#^3Ai-Zz>{FtA>JJFIAf3?SE@^GbQ{7&n+oRP~;ti9h9P%ej*Y}xiTxfSGd_C-jvoF|f zoS{~FggUJm%dtjjTOi#HCm2zDJ}5$6+Gor=G}jce?JF)>ZJfZ`I$@P_tK1_jckngp zaMDzRRmVc?YJFNNR(Z8j>bKuh+4r-$2=|s@Im_8@a=F&oc}$)eSapN1@R#cbpUGI! zml*Ju0K}xf-%L9)oY>~~*cmpmv4(5{JhP#4ie%-=u%DSs?aF-b9vM0^llm29u4udB z#cuISwF4%=_WX!@DnudN^O2z1ulY>@7i)l3pi>nJbNN)dPK<@+R5tgCx)mW^ zH~)!~3fN5~N{#qr^Hc_<=3zUOF5d3NFsm{IvLDEj0$AZ!u^q~v09rXn7)`MgllM(iqqY}q-$E8i*e2EOX2*?&s?5ci~K8#k}jwi|iTOm5i7n#uw(O6FOu zv2B?!oZ-Vy+*#?t8r-#-lwmomJ>5%OhcRb`_1f}TVfkb0D)>}T{r?el56pqB(VBqc zj&0kvZQHhObZpzUZQHhOv*S$8y)!lcVOM>7;ayKz^hsN>9=FAYoz$QCq|FrFZVO)) zDp=_mm7Q4{c3Tash@0pm70X1Xgg><~-zpmqO&?$$!5jNA#v)dAP?7TR%}4!&z;ih2 zN4uMQy@=k|=7Tp>C--Bxf~wkVhg1hDo6vtygCT8Z2)74%BfyQF$IW`h6Qm~zr|3vh zUD?zgIg#G_`}3O;#qWkTrprAFo0f1Y&Z2=pL3W?^?75SLx36kbTF<8)d9<>(Iav0O z9?`>SLEQ>)jwJz|ZY1is!*wL)YKS9-uasH$T>}4Brx=Ksz?wxSBE!lD-XZ2Lf-qo3 z3fZMlhy~&)eANxUfUs9C@1;T3ilcsjq-Kwu8<2u(O{f#bhZQ0yKn7ezo;&30Z_t&Y z3K0$5tYAQp(qRHp>X8o{F{LYFM4A6q^*Ve$DH9vY%;m@oKDE?7%WW^zc_t0R!A&cJ zqwu-=7WEKM=v6$;h&+)0isJf~H7h6{;QkWEK2Pl#GR1jhrX<5ER8!k#s?bg-)-R!; zy7I2`Wz7m`mput{vamDEtxgQUT9REfzy=6xfr0%If3W-%(^~v#x3ZGBlzQi&>6tdb z=eupN&TJ*qG7S_;Jb&WXn$$e*xytu_!-(NhkS(HtkHT|+F;PT1n;`NHw%>`?f zYxE+|wA}URa+#f#ar`>k_g_Q;{(AG~E2=Bd&GRopJp060|6mGJ-q`M-W(oEP%{Bu_n(SyA)<7@0yTk z8)xYVZUbu%&e9$n#u0CuT>^qPKl2FL(&crL=R57NOiZ<6QqR~|e*6&5yBdAkxpkJJ z$s)J+rpz{<)K{)ik6elLRWbpl&AVmKJ5>^{q#Z32mM{TImrk^F7@@`^wd-lGkDgKR zV$u6z!N;dUPRCMQigq67LIn+7oUhGNf9!d$;+OmU{(!naKO`JGvbSXCVF)i)IoTm| z40QE)w%MwVwU6ER&v&;3kOH|g;AeBPtzFGVjEit3y;4m|`kafaUX_v_MK9G$bESI^B1yP zntL%s_}YD7-Z*4`L$hC{lx2y5R7x&j3WnO>SY(M}foRLc!;uwDoEwZ<6)i~_?1j&i zXH6Z1s>&dd+5c%-b(E!ueYyG&3g}Pl!)4bltDirF^C(qiR}R=&l}A0UJeHKp4~o~K zFm)=jDuS^2i-_7JUMunNDxI%fX2vMp9kIInliTffZF?|=8uBB_^Wv#{q>h?kQ+JPf z90Z?WAEE`s^vF9oDwR^qY92=-k^sZY(KX56cVdVS(F#Bi$FHBz-xmZQ2=ke0w$<=O z;k#e0qq42DdXqw(X*|-6NYq^6s&ZZNyC|s_>h$Mb+i!F1ANfS-Xtg6{D$hdbOH!V{ za}S?d8SLK2(T$-%4wyS-A$o9pOWG!XrX~n8T$2cN=-)Xs$IQ{=ty#(Wt-60id$67* zuFyH&_kDns*j!n(X#9QoG{Cd=lQDwD86)H}LZBB4&p zfq9Glk(j8En9g2|;5rn@mF^3Eh8n_=$ss79MK9$nT7&0$rJ?dl^`22Id@ z0w?~Z*ls+LjOX6Gkb@zWsGnh4nAPg*=o^t#YV;&Bcnx>bd%iOqS!PTyMfVIB(EnYI+&%GVNXWrK=kD3I6UpW>Cv)yEOI z)nmG1FZ1&Ka4~bZsDe|?Sgrx@cFr%4@F!o)NwP;(UQ)A__;m!`qs4|OJftvvu-^49 zsL$PeIkE@64rl$cuj^VrU-|QuA4oMkK8)F>>RRRvY4(BA+fkc}9EZ6rMvTAP{+y|zh98cXYG4s)5-$*kp( zOR4;RnL@QJ&DHyBD--P95*Inmfqg192vt&14|2AXq#>tUV*PCS*x_ZUvvp7RY6&VLThu;D#p%dU z9mQDg4AOXIe~`lY#7V+RzK*2VJhaM8b){6tAE5sY-Nd_$ad|esi9oM6m)DjI#AmBA zN|xUFW63*s8qPeS@)cU=R)kx$P9 z5mtHKsM>A2gI4+Gir|$vt_EVMM|CB&)2{CBxFciDNun&F0TrZ?JhMr<>nysWRdsV0 zB$o`+d;*`}6SA?IGmFZN(tLGBxE4I0CLn}XD=w+q9QF(qTrT`s)hTXS z&W|moM70pW=4|9RSLRUZ(i#FDiqA=}#hw9!qWE}ZRYQjW;2xy;fYS`~tWc{so|i@` zJL+$$)D%bqF{5TS{bfcF{A+Au@^J>uyVxTZf(qWH;HCuu-8I9IJ(@JH$hH1Vppw2C zp+2_(mf?E{hqA_S2=%UJymUF^QhxjN2|&Y` ztV~#u7*Ac(5SwZ-zw$|EF>99gorur6YiHg3HM<-GcfEf&dc}fl^2}F)E!r09d@H zI}eCfFy;Wg11)7(z3a*Kw3ro19k`v`WI0DZj(lrBW=hLUZ#b$gvM)HgG(ja9oU` z5r7LXDxHi{zC89`fN{!xG0f7xkVG5HO~6ce!X&W!omgA;{6VQch+p*N{?D7>C{p?ndmviyCSstoTzo+efqymP9x2K>I-d?Wr1=vd1y<2W zoJq?GOy3WJzzi^xljt0(mxh37A>2iPdhTL{(&~pNlb^88RZ8yQ?E;aHkrRS!=Gbm( zVv4$CrOovx3D5s7pn(V&bx2#-BBIWTcAZ!JF-n%3P*D)tFa_`%mmZb`yiJKR((W>Y zDZTGbx7Bg4T!M6XxXzypNx+F(bMA@qVz4!Jy+cqOT0*Qx3Yn9Yj;*%tYdAu-LbyqV z*C8H2EQ(M0y`WPNCccX#_Lum|wfHO|D))lfhY+R%mG{FxcROGMDZFNEuy!zvx-EOzUW~+}L zU!a6!iFP8jp)3o0{NC>nlc|`(tRpFjCbS52hGUJj-Ahn8Y-JATh;xawCQfv2O)$(o zv~xx)Yv(>2YMZTN#?vdz$J6YTG`m@U%;lsFO+n1qA#VUIvG0~f>to-(siAX^5Hr39 z>U!>8ejF;a#3pMRgW``Ti6GWv+KBof2R*wL!S35vGsis2^;XoMoo2Uf%7 z1}C+lIW17AYV%OC0*5ka@SE^4zmRG`q?#y|duoK?V86Fb%Js_k>hrOD)NM9%%slFf`81!>j9%3rrQp1LW@#N?VpPaoVm z5CTJz)VVisBLOmtn}zhNhAJEMT6f_vYD|TwUPDfXvu!o+9mzqcnubIA>~&YQT|Hd*)z;iXR~S;mQD}_%@Y&w|`sB9xC_fx5z=3sVx_8OXNnWw(0-S%RhCb*KB)`h*G@h)`k zKdCRtZECE0c!_5_2cBZht9WRgt!|A6TL&=3Qm&Hg3ym5t66e?RZAI+n5HB}z@J9Ag z@KOwW&Io1;>(-?xw=u&9*@NBdHv|sFA8*#?`7N2-OijlI_5VmY3&U-fV1Ha4sYiJC zsF{Y62nD}6g@ud{HF56IJFg32+>24)-^T!y(hV)RL*A-kVx}mWG-(e9Q6Dk7CV{gc z?0I&Y-SBea>a|BA&Nn$hQnS8$$S<6R9%IUE_~WK8tYY6*PHsP7zu(^54T93Qc<}7D zzxrR6c5r+;7=xai2vYB8aJR@5k+Ixch>p|<-;llFfq}M0s7yAbVWQqx=~3qZ_@r%+ zvWO*ZI+NF|+o~1vO9xBouZ@${OY zvddUqKzVI$zFq!WqT!D7HjJh8@Tqz^ey+YP?sv$tdpkPsQ%MbVCd88#h`ZOfcUJEs zB&g(9R^i?I-mJx6o~hZ9QSioW4utOPWYQM4i-8@#(72|V8VzylEnJy~VC{5gh}BY2i#qgY zYWO*fxm%7*qF9@0m`htWJcn=2CCn}JOV99kB^Fm#d_Sh4FB3;NqMB+q9@q9WG@(uO zw*<&{b-`|Cst%TCOPCC(YvrtA{w1CIga&Bo?V)%0-;siZ7#}$fB@hB-mz)ve32|l2 znOhlbhw2pDesA#5UktYa^}#)kK>dnNO@!sNB)N9r`9Ff{Ryu*(6BICN0kJD14D*&l zfIm92I&(dc<51f9`PftI3W05&AZhiNkrRgkGK5^SZF@S5~r=tT3#|(Chp3&(9rSs^EmTM15&`do8y!YW+HzK$<@Y?#rA73 zCX+<(^}0qR5AS~av4*yzGIdH-dPjJ?-+@j1%NNEbV(}_tim4FXd~LYmk=0=fHaMvJ z{lbw|6!M9q{yRr|dC^K*ETEFUur7H+ez?f+RB)OL=V|!{?(`a!Cy+7sQQCsNP>vxh z_qP;i`NlHrW7lq8%cC8}V1P=JVlr!bfMvck28R%F%uIAEk2ewr`wgep;mnl=#CqS_GIv!l`_CrL0;T)8@^$`!18t((QK%8M>9rK|8mxQm4=2W7p8$wu}k!1BS zGy_lR{+JGDw#CzIc$OsU&brfi}=Tw@G z*7$|ck}>mJd4eO&N;B=P?UEmq)pk*r<%@V5z!?sesR59qU}-ni0q#LEC|+1qX#^z? z#wrq}LM_Q$nlFg!jsJc-jAm!u5&vtImL}=wMXachlQt_V_g{mM(42n6hu3kcZ=1Dq zNj&e{aW{omOdl!h@Ro7@b{TNAn>KBY!5u$#I)8YbF)T(jcmZO#gNha?Lu7|Byp2U= z)Hmv{yxPB4J{rm@ISsh~MvpDaN*>cGYkSjqw8P!Xtv0Sn+y&Ub=k z8R?|T1eF8)&h2A;5COL!@^=l)Q;RIqoDsOPy-UQCKpV+E+2hI~#Y{mfLBp$mo2SM* z4{>jcDW2po%m+t}N-K#mkk>yo^8aE|YfGxZ9Wp`(cGG|O!-)#>xF`^1;sTFkz~fs}?Ua4s$a8{w zy9Ez0_+>jDL~5*u<(UVSi4HqL+W+QG=Gny`+_M)WZlkm_(~1sJey9GeUCo7LE9@w^ z<{*$_7~4_`krR>)CPf9dIPT)m5Vv9C?c6mA9$AKBdPmi*=9da|;syzEV8!`&AKoY0 zHWvqcWHDYX1l-k*&^u*;9ccba#B{cD7}sVM=F*XdMUY(ipa)$X0%1<4C$iNTYC1#f z9y@m%k@eLXRj9jZm1Y3=#*y)g#MS zg~S~CN69}^<+uC>Q@rOAMbXJ9G}xW(K!rN#YQ_p@vy)V|n#O31x$|O5d|={LpX#o) zI7N*s7Ca8hz_%5Cn-Uyx+;Od(<*xJQyx(X3?iUQ(=O9)!bC=GMP^moTy}!5soH84I zRB`2pJ)B1Qn5Xj(ud9V;(Iy2VZ5a6Q8ZZ0cs?DDtWroVKfkT&&)c)x-`}N^OL@jz@ ztxDmkGyrQvfq8Ee+PT~cNg_sc-+*>(I6rYU0}yS6onxd05M%JQeuCLEcre|aGz3!m zp!AqODJ+X=SC&L{)Gt*Zh1Y)fY652*vYdhI zVQcQK^{UI0PZhU(FA&{LkukHNDX#9Hf)5>wt0d=TzpB4v=e!#Gy)BW~!s6wi92#6&^gXLQ zjD*nFJ~^|Bas&Kbt=8*XZ(p}w_Ft{Z-(L7D%o1-zkABNQxJ{Ml;8TBN?nD^QSUgyb zk&Hmj(1W_4y4IJGHtmo~@4RIe?>JPMAj#a4Oy(tU!nd5%4c z!W=YXjSiY3rKaxE$7V? zYx+j!eK!%V8?v9ME)QOrizbL-m2P{PzX)6Ac*|M%)<~rSM-P0hqxB>$P^NK^+Z?HF zYrc!!rd^!}BRY3gcbpLAh<;4TkzmMZBqDsKu#>w=?F5Zss8@A%*|YdW&>5$OYogQJ zqsB1S(NRADqJidsNjt*njS7y|g{li8W`MS-vx$FYcxdO|Y_;U~ltlCtGEx!Z)q}?W z0VggFda;b0lH#WP(6$gXSi-@{^3U1;liMEC`2$e7AY%+hqF-{nm{$iBio}7a4^H>a z?yr0{JW$s(Sx;4v<&I2}oer!hP3R$uXbCDrJpp+$CTB>j*foA1dIDz~ldbm!s#74b zb2IZ^;l{(mr-V3DFjDBV7Wsg*a=P$JxR2YKpl@(f5jCxp$}0JEon-#Re4U0q*sEM; zZqgx<^p=yuZiw|xtqb7p!UAef%VTxAMZY;kU^(2Vl)fRS_Ql1Kd;ZOPTm-0^&AxMvLd|Ed|`{NpR~@1zIF%G&bHNooz|r`>&&z4 zjM?fBZ8%}?Ez;)?fr-U}Ch>}^nW~QEJ$VEiKgHDUKc@dthw@!NQa7f119i3{og$s& z+J_422kuSZQ6kKs_$ zaD&}y%xcx^>cmwA@x6oySc1Cq_16c|%CV1sGuM|LLd@apjDWOe(8K7m?A zv^*E=6b$sNFr^N@ZbCzH4*YJ6Bz-uL0 zc|ldd%S7V!#537mZ5Q#WY$oj^@-|PKb09GPZ1IO!3pN_S47U zX1(@N>E++({0e1ANQ>)lUFdoK99_(HimIDc@%1qD_nzk(<4@cl>)cO^w*~3z1*eZG zV1u!6bj>cl*3V@fKc{!PAMjhNp+y=5xDWKFC73 zc@b%njm%!`Y8~n2PzrwY=5rU-4sXydh3q6n*N3)w7exrF@tA7!&>UDXYSFR;V;2_(R#{UnrRzH}6yUD8`OuR*N4pFMK^vm!be{T% z)M}cr;{kBJu{chF`Z&*a*Kc;_;r(R|}&TtfsM0A z$RIC+-M!T-1V~d~_ra#N=bx;W$h7m(it=AZ2d39Bgmyr&;?siQyU~Xu3%gR%_5eap zOxyicfg>f1_jy)kAhvq~C=(|m4$E+E<{Qvzw~IEqm=+4ufID*+ciL}Wh-&*6%#?-( zhGBZUZh{tvvKm$($3D;&)|QfL6qxEaeX1x!9X^d5KPtj^ZZ%&1#`6>_ z^+pN&E{JJUcbJls8A$H4$kpcXk7lOS>(N|}cJ1SU=L*j42$y4;e-+E=E!u31Q%Qy| z?XGVW%>n9Zoo2aSJ>HI*5=vmubo4&pZPR2(l*Ej$yq^fhMjr)dgRMJQ_=vQhC=Z+a zYPS}`3hrlE_w4Ummf7CjvO?V~k~y|%`O7oewpshja@jRnM6U<@$llGBX7|;``^QfQ zlj*~6RIe5J`sKPf_T)GvRh@}#u6(mZ?t)CJcU2k@Ehjyh*>Kpv#ZWr8eQ&T8dTt4)+Y;&bjEgjCM1u!()AjlL=UQwug;?JcBd6j z&ky{WPHvs4HKmu+LhsdIV|Kz}bXI1AX?1_V`@o;aWuk6J@FF6-HFt?WO+2?+e3#=_ zI9h#)ytUHX)#_pjPeJOf>Ss&0})@MWEVD*iHTc){u7VYF)x6`xpWzhrxV*e7&6puFhXSl~c z7?~vhycyvP1JX>7x{F-I{CPrx*UO9{Py1Yy}$w(o#U z;q&@7cGY&fiH=*7^L_!(iB672*Au7^;E2Z8Wj@D~l)%ow709l(cr}-9Y8L5i0o&^h zZ-auPRz2+>y!LBS$gyJuO&F;)uh{}H7E+?gK^x`m1dQP_j16;BKtajDOPd9M&wd64 z5bz2gF+Y9>PXLF31vO7csNMkUogKjta`|gfW~6M!ERdhrRvIRM$wbeIR@1EjGI*ic z`4+^-h*ATB1ZPCz?AZ$R_HlCcNOQA*(rYE@T7kv-!E%$xB7OKa!_z2V;qV*Xb?EAE z-)W_WTbHADmD!60=dL4qAxf%fjtTThG?6&qJu41;hh%L&?X4L%yu{YoL{2AW1_ami4k8N1~=i!JsN2UvmRBqjk ztbOvEwi*KT?^A|g2PJtbf3kuy5@LiV3XH@&*T+f6())86LO|6m{K zYS-^3(_h%9INS-U(af^3QTYd~EWz2t9vI(~IpY9=4a+QAV54+WPk`@cN6+&ehP?UA zYoTRp%YI`QJV=YyB_pT*`wj510jlxz{OMr4pffr`%exley%i~&2nJiQRRu5Nj#Ru2 zO>CoJfRqH()r*9&&}1<9`m0)bxs%@4$Kv77@5}M=cDaED=Re5KQ%~;}D>&?x42%t( zivA#P#h+M@!$s~xb5qpQVX4qDa~AZEz|AwXM``8*9KvJ~?s)oI`9;o7L3r>DvpU3pE zEZ+0z%zML3U;j6$JpDnEG7CA~Iy~Q)|6rfjShYIS)7Z_%#=6QJzRKxs4bK-OK1Lj6 zK9B4Rw5X30U&F{HT4n$~iwrt=7eMQgqefS+mA^iQe(QMX=0w9$DRnOR-iO0p5t=OS ztZNFY;cV*sc-~FR0K)Q9b^sXHVES-ZXsPl)olE&Ptz@5Dr*)-xj1GAX^QyM>Co^rm z-I$q5d{!)OE1x#YdkpJI{a@kO*m(UE-v20eYT~ycuHWyVr;#7@&%A|5P4B9P`k-^@ zPfUt%X8uy%gl9CY*n2YTZHvgx{xs;b>q3T$tf|7|YQv)Zm@^4M?sb$y=C ze6_LngwX<+3&zR%=z##u~ec&Ah+$;qp!MbHxmE=zZ zY}EUX^G@O5JskI0+Kj1jePa-P#&SAjw(0Mt2mksYUUA$-XGlhlfyP0e!q=1mi<{J6 zH2*$=wKD;+fOIsOovq$i%|4D9l3A9EyK8Q_;x}Mjg(`Jp(yPTN;tuSqO(7;EJJz7$ zQbEzC390VEK*XfF{q%p`bgO&I8Dck-Y=5^ifxdXR^Z$#6qeK~#_n9O`772qPquSs-N0U-tJ>YzZ2lSID^J@8qN!znFGhaC;#@US*Z;=jy?(Ymp0qR_)O; zmiK+~Wjnj`#ahm0yVv*Q=8Nm)O4eqx-DS7u`or(qL9&0*uwP+mp;S~HF~tmu;MY7Q zZ-YGO+wR+fjV+QgncrT<*+v$-?hc=xm&siEHGRfd`|)zB6@Zq=G{BMa9VIDBmM{rlT%Ma;LENV~vzG;t`S5Y`i# zc?RxN2LNbkzKGatqZNNtA^vSg&~Jjj>D|vvsoa;D>jELd314XUk8@N3An@bU1_$U; z=MDJ>zw^U#OFs5cF`()%{)Mpwy$USA^&+Gyg zdmD3mYSe$IPsx$W8Ohj6KEc5?z&7RKN(IyOiM2efz7d?(d`>_x?12(UD^Ly-u>`gj zh%K(lScGPaA6RkrR$my@^|(N#;KIt_o-w#-JSCOHL#5o@z$iSywX5C;?@@dBv0YS93;MNa8VPn zfG8;5@}Sf`Z}edtHiWS^ncr~F(7<4bcx$_N$}>_vPj(+*Di$pJrtuhK_CczWusN>c zn)vG(nT^=FV89OaD-A@%He6&V&6;9h5F>xYGjbk^oSaFMa08lZq)C^Wx~gsim$*Wo zb8*=C2Ur(|7!_nY651&K`3634pk2HPn?*5ZtBzDXi=B~e?mnTQrr)O)f^>58D>Dwx zNH!6p-vNop>vW$mNMAp;@J*jM!)OC@n^KhPv=wdXypXuWHWwuArj0u2e zRk3Wye&;)&cM;{@H}X1fLdS-jP|#f^BuZq4wj`gtg09uA8%srh@5b_?r&nSt zm`~(KsV(H9P#xPY5w1ak^=N@g#}HS6$p~02wurG))0F{-9aXG^`J`(c#YL6MZ9Q+n zr(vYj7UhiKwl^M*Aa_6VMU2>V(EGi1b&5(zR0A-h+y(vIf4J6a4n2ThELjd5=M0Pl6cSRjEcLH}} z5N*zUxq>aqvK%8Upcy)q0+?872hmDgmF%!fTqjRxI?A`TT3%8Y+XIrfK?CZ-o^+IJ ziXexP$8BLJB)fq0^`2VKR3#$g9;j4)XAS;_6$ulXMrod^HeEdRH@XUT4ZR2s3Mn>jMtdafVe2q1B{1?3@c1=5NA(D9#N@E z!MUbo1ZGom%;7BMh&{dTdnbNfA1qU0n@oj2xzHA{;mM9g$djAm+1Z^ySc!|3AZpPC za8&^+q*!ihH{vz;G})H0i@G7&4aj2~<4S>k=wJE6X=U1(;6myr-QghAkHThMOdI7A zGTO6eB7!3}^T^*&*5HzsFSU<_&>%tOHDvjNaRa<$EA~kfMBY3zewl{*el2jthQ*Td zd5*8|%=XncD4%{@G!!YbflR-!&m3ChJPcWau~D1;EF(8Hao7DX2}O9MSUNjsG1p!q zDixF5B)bD(1q>OJLN8IkitN!Kx^%}~%n1IW`Tipq0XG*sUtigl9dy^`;`+8Ziz>(6 zbV_#hVP;gX9vNWn{ds^@;S!`x_U>}(3iTiCL#_6VAfova*zpVdh~_Ea;uuii$19;6 z{#pCus1-EB;r5L%1$IOe^fTW11Hu|l7tmS>!eM2A+rTe!z3d1(g(lw-Q0o}%pvDo< zdq^UdsMm)$j>7mmRJ#R{vi)mx?ie#gs=|(76KVVikg99AcA`R4T=|Epo#YDjNhpm zmMR{#O96?~YB%ri>Wvu{_Z&EKOoLN|&Ii?aN_<2|=^jDMLQWAxS`_fHGpnI2UY4Pm z-Tw~OMs>yZViK8?(l4C}ir?+?f{CGrRxWdqMzxG_lK=n(K~~n5#3v?~2@r|gK(3%m zHF*8Zhs9ACW+*vUJP)4KFRx=*FO|JMu3NjuBjzWeOlXLrT($JgqHh`dS`e&Urp)WQ zJdjJ%yOL>(x>Q8Qc}}G62}b6wB&K*Kp7L36Np8OOPGZ#@ND`@HRKJ03q|nAHEji_y zzKK?RKh+FILl7Kg+mFp5cX84lDcVRJXjo7__xAFVq$97*0LZdwH$+Nu__=$_layug zjYialtDr%mCb+V{((NsOVIOV^`3Nk`lxJiF+j5eFd)tPU4lbnmn(1)TYF77{2|5%g z$`%6f_cKz*+Yib>A)KYP3>kk4K%FdW?R}k<--1_@C8-m0d-GNM$!QdFe-fpD;#_iR zbkt(m!!tdqi9K#1=KsPz!I{jvj2K6j zsVY>FfA(tN?49vrU@SGh$%BxZVsvP#vWh;&S95V@Ef>Cw$th-h6+bP@h^Fln-)qWA z@RdH_aP0VgPUWi$n%8TaE?5-xR3`d28l`+e$1+*=oq{LpsOP>B? zeLU;#4n7}ZSN~&u_k_Mh3P;xv`e#CJou=xJ2!_FIG4Vj zR55oo6exrcIXIFMjR~z)%kaq~L8(9*6UcfzmUWv;?!tA9c@= zJQ1|(7V!|hxFUZPH8@{=IBN2_pfZDmCxL{wxIFWNt1IDMknxKFp4xFRkqsBOdgS?3F#-#-uhMrwYSt z63z+muO6?Z{$D;pjH;7ItCn=g^%>&}^24E5T4nU(qN|~cj;kcOrEm>_7^!doemyGW zYZ6UL-%NT02lHkI3N{W!bRd3>JR~V=b=k|d@O7qFT-3qAKWohP;L1t?C+u>Q?Jahx zT08w7jFp9{=it-QYc)RV9A9BJ5u;9kR*!oxuL~-Q`gg)`Kaqdhdn(|cB<(P8)JDSd z_!ejE$q&%2Pra>Q+k8Isd$um@=hamMZ}@dEXF|CHdn`L3^ ztTFFzo(viy>gr&ozpRgp6&XACJI_{o*#}biH2BxV0a8+!lR}X^GPp^-G#I_DN;`YR zI;&C=@G0Wn5eXY^@XVou)D0Q%C}6Dj3yBbR$V{PpmT{WVf=ALJBgP4U%O)!9!f%qH zb@s~Gg%{*7B)egNEmyN=wB|8rFJUX~;^2KL0hdOgzbWj~JeRYb-nZNJ!^6gNB&}wK z_Ypvk@UFYp^~Dad8;BrOXywj9J+or~@0ay)wUugsZ6$NNT_h%+L~@d<SCYQ={VvoBr=8CXFKB3OJd>6Gme;QWT(q0b#Z_F;jw4O`Idao`zgV1BLv^))0~x2m97+8ih

?QnFeTJIO3<`o^ zMiclizK$o)2eTvzn2j}u(He~IiRUWsXxx{*wZ@YolP4%!_oIYK)Aq3tNl@NfqApF- zI5-0|`29U%Qn=Jh!)jWx8b_#4(O8%P&%^6Bn!I-GumjqWNo|`bNZXdmsrMWtAsPoC zaldF}5lBHSG4fJ@YDJ|ob&F4!bi5ZQlG0nf%3>hyf_lPfK%ME3`vK@MlxYIlTpZFo z9ym!WPzoYsP8lN8NLx2icNUfb=9d?_jl$nj5AyTy37)Q++jgb3a)yFwl7OVO0snZ-baM*@LrXb8-#qf?}tv0MT`VdMOq6M31I3Puf zdh)VAyLr;Gv6W(KyO? zfdcDQgzJQp>wMq1omF2yRb1pgh}6 z0J`BAX4KGcXMZ<`G(%x)D*HN$@T>a3=KXj*N(E#YLga$t?-n)4AN-H%W6Z8Lq9zgd ze^np68k_fHxY+i$0GFbPntuto9$%_Rp9^c1xWw4Fe^T+53CeeORGwuCyi2C20jzLq=HWl+QY zf!m(fABJo`j=Fdrv6sApK9|#DBVDI>xmU(6rOCxIiWUaK>vcSO-ph5xtxEwWL zX1N-Gc-Jq1AV}kdv!C$U#^0M{=x@PCn~B;|GR1a?;$o^^v5jNg9R~k8R(62A5@lB~6Z3$S=szi&V!~R|$!B&nd_L0%&>adb3P+ZPIsI|C zdi3bTGa7;=`XJLKu!ti)E;(niZRJw^Y7(cC2alAE0Zr#$S+lFYMK{ByRo~)Z0Wy|rqqu~6? zELVIK(vnWzf!lRe8E_3+jeeX^W`7`UHi*X?mcG&OJZ@f7l^9Tej{%bPS25H;Aqm%c zPARI1OhFiI%NIzoo#UZs<;Hb8Dd%4-dp>HKG)i;+zOU$RFA=LffN4IIq|d5 z0%8Q}E4-y%{RniqoltOU+S|VGZZ1wb(>G}l!SfVa@ta`+B63+zC_2D*V5$zCaK)Ef znrDxSzzx@f1mrCAQq+eY+<;?w-n*}0?vaYV@(CteD5S#EbV~T(%qB}zfAB?^q*lUoMTv5lYIIh_Lp{eM{>U_o|;U)G0IC;G{X19;#+)(03+ z3h@lAF&r}96$rP~UD|l!QXobZN#s8Hz&Fw4-G8ek)}AQH1GPU?kUjC+nWd^c6$1-Q zC->#~C-9rkgzj!CKZNK>Nwgn~!hcytcvX{Hhd{J7zJ(Q4=2svuh<9l$k`BK!(!1PU-!5!2&2krxS%162XKwbJiL@S@&v6NI;|-X zx9kKez@2;7&mJ43SdScz1z&Pa5mbS^u~Eic8#;28z(!w&yi6<$?4MQnlrOi88att@ z&&Un|Erjx(bQuRC#7FZyFO{k~010W5+gAzB?j5UKRjZqz#t;h&QZcHe8} zj7V(!oOrQlP7=CE?O^8G1bx~jR_roCj|V1Vc~Q|*zU}q0ctTiMQQs`;l{YNyqqdx^ zZJPGhk{c=9m)u7(kbl)K0VDXH7#06WH+dl7?zZFiP%*Ed=!1e7Vh<;7xeLaxSX(FY ztP)ou(m_R}3ukw(ltQV{d9Nc3pE~V>${;pH~FS)A9Hx^~T z5pzw_{*)n2fY6d*G%J*n8<@j45Y8OkF2S#oze91n|4p&j4M7HC=3N$;$0sP-U!)5; z@{HNL-5}bgU9XMdhIm!m8#bjoLQtpkv#%ZKH#ZZQC|G z>Daby+qPX%cWk?Y$@4z*t@&onTKo51e`?=Vdtc{qoQeD3cnVhD>%*nMNeUc%7WeMhHh3{4$cu)*@znhp}T2P;MS+ZTsMWO}quDsIC6OLbr+f2Vu8$KzF$no4y9^6QJE@_^xz4)zxutuVM8NXv3(C&p5836>;3V5fIkCy z(0zC1g79uW1XSVJ`->r2KwoYQ;2eCve#-=Ro9`fk3kgS%FfR1jet15|ZWBEe#1U9a zicK^^lF_1WerviK8Zk_94Gxx&d-!qBQp2=FxZM(Sk8N{oA^2l}<#*XXS^o?x8FM5K zR73F-0uvq%#iinkPo*I1scx!o6J)D?gLg2#u&}SNCKBTOYU6Gwq1hJ`BPvs_z4xLw z;D>jS$B>L2ap=q@tiZ;`$wXm1mk?9OwyU!-zwLk)Z_R$}M(t3JrfpyPYm-59)n|tu z>!>;(n7}VAJN*aJWUx+xwgN9f6XRpx3&b)!ncHNt9EzwE!5`de=;Pl@Q~1{{T4GTA zCL9I%>|AI-3io}DrA|ZgkQm7)gYSt7?p-H(zW~Y4^DJp&Nq4nh%~4#zp92KAdt7SSWfc2|Sv~ZkI-??qm>fRF<>9epU^zI< z?PJ_9J6kZ5)wm(|pfO(F%VHB_M<0E3k7&O+)CQd8D<>n2DS8rbu3{sUyzU+f>_$3K zfOkW5E8(&OurIT1<(!_LN_%)9JQ{QNWaPT~wzKW@n6-1+P|Pz%@B694Pzw=d%5l_m%jY+UHP{?4HA975=4jNn@#>V=i<5Ws6$ z@HrBf6^zAVX9&cI5Vp33LK$?-rk0+mh?*j#?PQ?z_10Dw1b$2nE6Sia;I^lLt`{bm z7SnI(ALN2*hG54E_7)Gy%%HrHkqze^kO%)L47YBISCmV!wT*7a%*`Qk#JkhmyOT91 zXONK$&kEwhwN2j#%^NV%mHwg!R5UTG`u4`1ESpX86%?Mf+XO6bEiCB`h65b2f5vEo(|tXy!MGhJlern3V( zUbP9?3yFwxBqfr3g*=(ki&8F>YE-km5fP3m@*=C4ZXh2WpmMeJb-Vh8b^rxsWQ zsf{uMgzk1%hSO8>;ZhD0S;aQmVd$)yC1SlU-(mCB4&kTp+&i<7~8iX z+YC6Q{Ho{fDw$8uhm>{7gIL2b$@PlbeYblB%)!mot@w9#KfFQa5&)+!0c%v&3bC68 z)J{q*GD58eJAZh8RRvj7E@o$FW&<1D&nabmM)1FF2*{!W=GN)8vC&6l04K>3mY1Cr zZrVvMvhA>g>0=R#+<5WsTr`C=G5!ZHD%INa<|TGK&(+h;N}*F%3bAY#XHuRS8md44 zkX$BNgwiqaUHUPeMturjAQv}RaO7;&Geo&0BZ+LFk)*~50x8?f4&e|O&%6Z21qVYu~!FoMUw2)?)Uu2lV;@iE!~re zWqlsh99iwo^949lSh=;(7D(HseKlw?xHLP&Fi*F%mvSEg{E{?R562Uk`%rb)zyWwJ z_|GX1?NtbGQk(k#LpeMh6v{>%6|C0Z$YP4xu(;99yqQgeXI8}#!OC~N__*L~CYU@C zV&n*;^9vQllM@Is<}AE5*gU4*orOK+elor)%~5kc8f!E|vGR$*P2+n6Nlf;uZVhaZ zUXDIh%kAlKvP;44-Xe#_(o0xf{gTDt^VVh^nx^8@-2*a=twlktNB1XR=#mTeQ91tuw0Oy+ymm4t((7!~JsB058nQ|WUi46m{$*HAtkY$!}H8ZDo9ghouS zvcV`b)~I2fF_;##ah)5?rbd*Z>OE7_-cfOtC~3cM4_<;jg^4|Z8(`aN z4E_^aAzIlv2>MVVlINaGO&B;tUm{H@!G7{&HoxI-$SV+Z99u2ZjVYxXhv~BatpW)h z;8_nsPQe)SZTFsc_Nyo1l}bJ-V4i{%Ix4Slg1;O|q01-XHU0P6AB8$J(7mb2C}XVR z+;e{pdTT>LY7A&>5;qC`H94b_`OBfKKe4o%Uqz5c2uIj)jk}KZc~Nmj0C(g)OTX$s zQY9-|g3p$DH!KFP1k#t6I0w{NqnJqSH;*D?Y^v_n79QDdg`=Y!B$PcC0tEcu8+2`& zZsjT~q@d?nMH;9KZxYFL^g{8jHL|_wZi<3@gbyaSM)-q%Yy@>ll2(9x{!z%Mo% z{8|Ur^E1PZ0dI7K!^M>1#^8T75l2LoRa#zS>;>Q^4xQjX#h`rWtKI8A%}Klr#oDfqVc3deFqeqCWAtY>=-d?O^Mzb=H1VNX#TfE$3qDw@CEYt} z;&p20dhIvPw=}(8f>oiF20he`9Zq&M!um-(i@701aA7U0`BqnoI&6(E3Pqgc7M7vn z1s#~fjokDt7Yt(w54b*4MS66tkj~Myt%6Dj(BsZXWF+F#xbzMiT85uMWXg}AmN|=as~AkaR!OZBV@k~o$NcP4X!c&f2@UT8kf9}^L|OpM?s z!*b)PkU%MWSY+Q<2nIZAdd1z4UWOY$Fn+Ltv#&3Se z$<4@I!Zpvq3m4BQrsyaA$f6yuBBpiaS@ybBLjPci{FI$VsL{3*gRt}T*bT!oEU_8c ztT`#_HT=m~p97Y1IGG7|8lCqr74hPSP$p)TYhuRhnQ{ii(K(g)l=MIFJZ%EK1*)=f ze_Avc6ND`2R2pk}EP98-ESg7IDp|@wHyBfAf4sIuviVzhX~Lk?<3hGO zEuE_xanwICFPRsDOqoDMC~+-|FqNQVklBpj)&Y4r4`t;Sd{1A>A8<%N#x3p|*WX~@ zf}_D>>Io;VAcI8gZXDn3I4i5V_2jij`8cxVJX;Fn-9`vV#>_bVHm}U(PW4B%7=DBXC(=KqSYp4`v;5E!WQOI z-j+R#qKu$w{`BNrso5*n#oIy-#SDcIovkT8l4TnQGRsNRn(ip3&gT|~_Tx;V4gP`A zQ0{$hQ4U6GSNeMhBQ+$1XoQGer?=>4I=?z7wm+gMNGhEUD|UY+Dm^JP8V=VgoOneg zCLOO|I+ct~WyD}N+*R+*to~Z8i{aK}M+waij>GIZ!7VZFLyri%L3^WKgL$S~ERtM) z$|bEm-nfi|L+aa$&Vuvv;^h{s4XQ>-J_qZcKg@~@sgnq9LX?Ff1#A|4j+3|&DU2c< zmu1pn1MzBN7BiM_4*{4p$=-Xy15Vth;;r!B2_$^fL++r22@3+8VNyDuQ$=Pu~<-((il z%2y|S&L7ji@t$G81cWnTx>%zvAt%C2`bEaYpOhcIp3-zG@0oF}Kuo+760dZgaFiq2cQJ*f1Bo zwzNT@Wg5`QFqbK%n5$H|aE%!jcw*;~;e?C)q@TN`i&X|!*hhJ=mF1%pVq>E|urVFI z+?%H~e}}Ex?U&ur_eR6&SdNU=F*WQhr2%}LcPom%bre}nm&-yGf+n#N%3=qzUgERc z%Tor!3DwBuX=5Z%R#E89%)@~b+P(M0Q=0yxB5nrVcq4aAGF&gYVP7$Y2RVZueSuMY zBV5&-rrB+;NCz)a5@Tue_cJ$^Z>g6smVYHiV=2ovCYDei#7z(5A%Hf)AFbo*u}be| z-7Pjszs`k~CDAzMURz|ZW^C;SHDN)NB#|3=$*s;QNMADE{(j5|Z_IGEH9}T64+wuq zKufr=ZCE&RJ+8bb7W4ZG8^Kbc-6O&nyrqFm6N|@zQOsyN1Nz{6$gW_O!=Hm1(HfXU z>%armZ|P)${X%WTtf((QUxveY>v-1Q^KHv{a9peX*5krzxd=`WRq{M>DZJs{ z3hY~W_Ef+-8!`=56LF>Rx}n09)Kju^n8QzCNsY=ut%A^c-tW?2bKWad zC=dYXyX_$Ww%UEX2tb7VqvwBd>l9^!Cl@LR$-SvgK?vQdeagj{{3LCLrsy9P)dSA7 zR{veCvefv;$ZrjV6XM40r&T~?$7+MTr1R`9U&R1y)?2C;FcbomEu%7IshhJz*rP#e ztd}`x38=~^6}OW?B?qUoVkn0bGUX^=oZYZx$5+H)oOAvR*Wd5#!e&kK5KxEViUl-; z>wzPat}TP^DlBG*-9%QCiB1xGY;~pkjYQjIscZzWIPIzrDKFzun)zsU^|w|KkV#AUb4=$VwsdaS4dhJW$_rtvN>4&T9@~td+E0?e_USe-ik(m$hDQbvo=m z`SnBal>D`7ScIx}mdUgJK|;TWDN4+{<>Z1fj8~un$0M79e{xI!&A+^fwjB*X%h|rQ zB&~%cG{#wQ2b)4i{})@5Gd^AlC%xg>Uqd3PjrxdnLp7p2o3*fBM{FuA!yxu7zkTl9 zR&l|G^r}V1D}Do7WaR^>_s=IbyB3qYLX7mhB7)(|=RAjx+S$(8;BWR)A;rX-)ZCce zk<0Kd@(8m*s9%iMAF{r>h!lH2mpZ~~1H&_1`# z&)H{>?@pMVAL2j0E+M*AWQdq?2Z@;PCPK1$qq&dxu0o~kfR4acp{wmCljqN6ppT$@ zPn!i^?FJYQ8RuZ0xUCA*jwFE&pNYVAt7ps8g@3>plrzFOiwHxRw&`SJQ+V2pK7Xt7 z3JMXQ(B+5cTx}0fg|F@U)9@>-;)BS)+iy3_$GsQo!o;F3Iu|m@tR2U%vuWC!l)`zK zhS|bcS~Q!keW^6ws-6%8#Xt7@fTovqL;q+S>Hg%%a)dwJlYs3azmG7MuHucv(sf+7 zaa>EEn_Id44y?OBwj94K8Kc5v5~w%OZOy|E^BguB9AlW(Z6VTsKOS~!nE9sm7^XiB z=eD%@zdVheeqI6&7AG{KJU)AwpD(`2A0Tkw(gtHlOouNZM@UYLm#Ldu4R{#wvF8Ke zLM(;`2)-`em1F=)j-N817U$7enmq&TU#q%6&L6aNK|)zh^v9r32uMQ($_ZxwZ)DFp zjc{d_0Qk44hWZXvcd5p9OmY~oF=C20nzIh^u4u|Qz1ym=GZgrpq$`nL>;qguo ze)^__Cg578RHBZ*TbptjzdYqcNW@9M{tNC!|G(hgot@L^;jYcnuCXfF(ztaQRjPyh z-`e#is*ABS=4ilcAHz;4^woby*&5y;w`~^JEB13tM8FOI$6yrJ{%*%^_t)FS=dtgL z1~Fm}{(<1Zk5Y(qV5&$7f?3O(rhnwaxd-}X9V47j1;CrWeb=A9U7}h@O|dV5+X-c* zw~k?6R*~h*XWwZ@IJG0yX!{+YgD@q~_mpW-DcRniFO3%MfKN)msF}1gDM7z9{8yF) zoUSH}>n$df3)SoS4XPD0l+n6Q16?xgU|Y-gwv*GT&;6+J;z^#N?hq&&HBXJ1^cRQTzS(dikp-%v9W_@mpA6F52n66Q*X9bc3%(^8?J86 z*sfMwADJXh{eR^+3awF0HS`4=r^hWBY){*Bm`u5Qm~kkJF`JI3M@vtQo4rkP)N)mN zK7UT}$lnXmEO&bxgn9#Y^gO-(LyEqyNCO9v*f&#LCkp@_(mqGLx=RZD^=0xnYeeSL z5e9~%k>OKZlOA!URB4&e`n#^pZ-}C6jD4*Jz6QXzoxL2KEiGQspl@|JEB%gdpe7?} zv+t#r-O;hZah*&Jxl%xOsX#=~4M!v_K?G>Duu(8ntLXjPD@TwJo#8Ftax7}!tir$3 zN;$vpFZ(lKWwRxZ5q$h>tAU7NT2PnwE6VR5ERD6o^cQac04t%1=^IxiRl7{Of0}mL zA`QU{H10G`)ei_si;8Rk3|-&+c-afA^@TyGw(A53x!Acq6WOZ%IC!~h%iEwTv*hFl zzBYxtZcd>YydYg#U7j_S0CzXa{44+CPGSpLu*FxqJo;e^A&eakVbf+0z#?SG@x=Ml zmsa+@dw(Z@z^3(eS}F)6DYgqM?=$p#`WpG?_NN_X=78LSF3V5&jSO!6YVm}2aXl(M zOsovS2*2By1Rdg;&_v{{OybksDZv}A#M2zTZ|~rmo>Av2H(kUwUJx_sx9=C9%I$>@ z((TWymr5nv9%3?5`O$)6T&Y^UI3qzd=Rwm?X0L`o?eZ9i-QoH}BpIg2>VMO2JVwG# z6Be#rs|&X&wX=Z*?7KTEE&@P}2)wn(P7_1kcJ8l|j~ngF(Ir62*vvn)Asnwk3nQL> zn|~up0xz8iB7T^%Vw8oT#zLQFo=4I3s&@$^kn^DzQO7OO0*ypMps*62D1Ddfk)D$h z7FnB2j;u7CN=||9Kd`-*DNt;>fht_R`htn0H&|{xtwOE z&-R=5YCmt>qxh1J;q`CcP2^AZisa6RyLEDRp8Q8e0cOB6s=?_oOKT+PtVpjyWwK9>a^+wH&V$d#K|@8K{ESZ<7ib=Rq`|E1b2u{CoElS+FI44d;aXv`x6!Ni@O_N z8srivCc@r!UXPZxy?m=#xpeT&l#&=1=+U|1xhFA+8*&|7G5eu@xwTG1Cq6^rIaQX4*OkB_qK8 zc-=t=u(?4%$b}dEU65duv#0-;cFP$4-?W>U z#Y5}Opa?qWW*GVjgW8YYzMl@q$A1*adL=2->{091ET0T+XZ#u#6{Bk4cg+vEKdO}C zmiiuP#&vRVv=Sn-KlL&avj;52<>bFs%c#o&hAJjccT;Gmban$fvp?-LjjrAiN?6MR z@KiIG`VcB+PIobAXLI^ylyYp88UUHL2(W9v(rR*oTGq_@0*20-a7Mq!yWbx$P8%Y| zkgTr5zYF#Ns7$t);n=jsTGel>ITfpS-y~K~3DC7O65=z=q`cBNox_EF&*>8({YtSd zRzn54u=TPI1aLEm3KV|z%Ou`=sL)3nrhXz&HMvsJHh=dI*t(2(d@ioWi*g-b6ZPjg ze$#Bd|A%JF(Yxsx{ifNfHE!f`Cu_&ml3)>D%STA1XxY^3%846%+%UVjJ#lundJawM zE|Zp2!PK)pM|*+dPA7)rNHpV19yUKwDV{^?Ni$-&Tz)I#8KGq!a_Tn-PsC`jQ|AWi z3U;T5Ub}$@;nsQY7vjdwL8GraUf%BEn$Vbax$a#$-=drxMTD(==R{9fOH&z28AOE+ zVIgF1h@7wj+siBHo3Zyo-MS#JS2QO1-SjxB8OHI9#F*bRI*23 zTEmaGyqXF4w^7{Yb<+Kyw|t^SBH{^>@hgTkUF@K@j?E3$<8qn$RQ+x}+2Qu*G2rq= z{+VibGHAA`PQBuZ)?%#dH;JoshMu9xJ0ZW5|GGlKBv;u(yxNUajri27(2e-esY9c; z+H|442&}U?1Ci0TEd|;@A!#1J|MK@ts;wii1+87aP3iXvwy!JxSaI`SVU#4zN;u?j z=xKS`TeutMx{B;)8n<5ImH=M{40SztM6k(ZzQrLp@bNi0`Nka( zaOVFGy+#dT;b}HF7}+K{W1MrrRHw@rd%VpQ1oXQ-t*wqm1}|`|Eh2$Be11f0g_@5y z)C=%;ZWvqfdZKo&qEEll_2NfZt1<#ZbBQ~hZ>rTXGaZdAe$xMOt-+*QIn-omSqs!@ z0eFJTuMh|d<9bwJcD0IzS)*uDdssL80-s%$l9+?ltI*@VNiWV2CFvwnpNrA9oZf&Z z(0;Fpg=<+KwdVWYT{W%12fpk4SUJ52%eB%U*4L+}5l`-JRRact4Y@bVWf;y?D^)1U z>K~CUBKXDLThtxak334DC+z@fECZX8+Qqr^%Bo7)b7O>MRf4l&$g1+}#m5)=M8O2zJ#yT&0L1Hh-fk@B-KbSXzdxYl%wTZ)-xZN*8bN9tKq3 zP!}5{WLUD!0<0_xB6iXMn1gTnv?*R5KgD)6t~_wIzG-!dT@m(eXfbJlF4~&$=Cu=-S7j!Q+XgJ09U52(qlok4zSdI`#b@i9}I0g^n!br2E_@~ z&vltitn_uD~wd9{HIm_lXCgBi3bgFu|1ZSiJ2HB{SG4vZa?r3ms`;G-g?2f zIra7R<(ccHaH@Xaa53s1t-63>bHclNKlwb6eTg>Hp3oRv-eZg9aSWS&wY945jWPsu zV|qHyOm=ouL*|N%bPV8KWH{9!)^Tt3c4x(9#VYfXPPL#!#@_bv42~(Jhl%2DoJbkj zLVm6k>|uYgiT*YEz#=SLsq1{k3O8|+0W`|qR$Rv#Nlsh`E}W zb3H&g%F*X(?3G{dXi5i9M6zgo!KZPTMp+Efd5SoMy6k;WuKmD1q@@Y6F01kF(I%m2>SU}D)@&E0GO}u&zGHJMvLd%s zE&##NwmdehaMdKyEQ%daK7DK>^){Vil*KlUXh7|26%EAOyjd2PjPs^)ldgeE1H}`GwP<`cdg5awOBVK z^A^{{4+adN&MfOU@COAjq73vSLy34^tE&HQE(9Nrn*$;IvV=k8u1wNXU9z&n6oF|D zqwt6%hUy>*weDl<>2kk{;m$a738_>gY-r@!l37eE-mg?|R-TNrBc>FmnLkPW$BLam zRV|11%t{SK`kncpn<6G=LNG43hfU!iKwL7XT}G!6J#e%2qw;yFTcrD84m-v?BBbUc zpya~~<=@$40AMqe>HPll`g*SOBwZg<$h3lR1Z{r$2^-&u%9H=Th%Dm*&E8N8hiyK< zJYdwO)JAYtaf&9c_=5&rmSr$ojTo6kUNXpG1ep&jHPxn~I_xUbDI*Ps-%OT*1!p{| z`Vny;EF$h?0GIcNVIZ&1pD(|(Jmk!9QvOPELb82UaXCSAPSF7>s*IG1P8$sEAB1=( z%HkU7D53yg$w7dlt26hzNKdqHP^opSy{&ziQd0}%w6yjRyypILF z@sq`BYHzb@>5!eGZ|6yI4W~53Gi5|q7)8bRtiYKiXQr4=O-`rOet7HGi{GCMGl$C- z9Nfg;czE_dj80`axXh)yOq=6}w&N_RDsxMMkpzzaOb|+|!Kj6ZDcUYuYEb6_$QwH6?>Ux+taD@JU*ip$0 zegqe5>yT;`1ljwVmW6d?qgmzcP4~V}4%n5hyD$@@^K{QpBDo@^gplOd%9Ndw9%9}L z2#%JzkGm*8r^p?Ie))^@D(c3nb^-*@W=?Bgg$pOap+3gx9!QVrXXZY&hH5cI#Od~c zZuKtypA7pKV`Hhkd=m9N(`8oNWrfO9$lPVc*qvvuZtq(l1XA#gG>!KN{wQ_H4n{CFn0*}6 z?B-J1X~yq?!0k##Tb53qD>Sd9U+$*nSBeoGHxfbT$*yewZ>n1N&(FQ7 zIjm!QTwE;z7O0oK!NoC`j_;#hUnuq;q8z`wiNPo$H;n)f*Duk%1P|De<=@V{bl32W zm!{VF8G@Z;E6?BMW|_3Cb$GR@!=V)^!?Z>vEU*MY$9ZU2X;%Slp1}1*rnn1@yzif( z`$MME3(2{|V8#q$`IEL-206hLaAN1b>1o#^b7@pee_30Y%CPiziXRek*vstL>QGFS7?RH;Xc&f=KDyl@$G@;tdH4KR!&fBz@z5tH9WFZ6M3($#;%i`lh{>hlg^_18CC_3A8}nw(m;nQV9jL zJq1j5b3WZZd51Y=i6+z%ol3{S$A}l{I0BgWJ$NUIzti+Q4HCW6VtqusfbUi&r0Fw5fu#Qy~ zgiI0;>~(goQsy2$ zX#x%9N)v{x2`+JsgStwmBU}Dnn}_?c?6C6mR||S}gp_7*Zr~5L737xeNzroRgDR*P zmRW>z)0wH#ink3)OH~b(tWNFBWw+gm@5Ob=_sHiKuI*v0Z9-MLo1~)*N3^vQ@uWwi zm9gcBn5WeLhb=D;6QQ$Yfs@O6tFLfurjtpor&L2Fegyuu|l{0z@f9 zFn|k2>Re*SvE`2{gk(zaH*IHQ@ulANou~?kfwgFs3VEdZ@ywi`{MP{e^oC{2HI(t4 zx!k6(&~d)0hISu2rNO4N*M)@f{4$Yj*~UE2_YGXVGOk|$N8w!JWJREo8yJXuu zI{EUa%9o*uIZje4%GQ|X!UED_+m`=g7HaQx#}(-h0mhlY8e0oS5C5`}9$?_^(F!$| z8+==_4CY-z_#Wq5wi%%#zGrRNe7ly$bTh-fq3--p!bk-1=ic0lEnZ78d}Jw#)m7!JOnxcT9+_7rt;xG>*>W`WaeC^V>Vz#m zfA8Rdx2u01FR@Q^5p!1a_PGge4o0e1V0Y_NwGsLlc{>4Q1MCsC)E6BUW2*hp!tut5tYPsMb#R5d*~H#0 z2Seo9MR>Sbi-yGg;*g;g7mMe0sZ?`HFgOZvBK>mV>?NjfzOg?*`Tc)rZDJi?WS|kx z><2Fbr}^P&6m{20sANb4h$m z{m4gAE-iF(m#=Wm2!?`;xnv9ab4+r~h)Jl5*rmsz9Q(Ue>#ssQC(>O=Zu;T>6$s|iMl5M}eMI?P@Nx(n>-=6W5}fqp zn4g{ar)%V**BAbr{2#eYG3#+_L#q-In`xe6m&7t_@!mG@k&FLq8TKpmUo#$(a!({m zsTF7@cQFQ#9;s3rdxmr+2LDS*XDC;DV4@95H6i(Jjp< zrFxh9rvE!m@yeja5FcB8vL!M^rieJi`w@D2Y5C|Qxh0(1W6X#h=24yd8^(Beh2Y^P zQHo==O0<9xJxnhJlYsXxoHW3Y&(r(xT}cF>#^h-qP9JyQji8Zl;ilaTO<6$%NfB6asxcy- zNEB7lskwC2yN_V`p^=Bhl}*fI%>HB7?Xd0{(}Dhk0nHiLHQT*xvuAhUFhsBK^A;A( zF5KAth_m#c=fGl!A?c2(9jDAtZBCLK1kho-N_)W{3fgFOW&8Z>YIJ$~;!Td+z>ytg z1cwTg&2rLO5cnb;-!U&$dntNnK4yU+rRVU;Yrv&S38SU8q4wHy*_#Ffs|qzynz!9eo5-) zaO?|-{(JI}jnb-?SRRt(ky2BKBWC0E(`-#e{j5x${)u$WUF z;0O7xs;Id3TVG%s^^vaIr_pqM*76qG10Qls1qXu{8-P>b3-zBQyxDUNjoZ_U8Y*Y~ zw{2_XAqsf&m7x*Tkj`a>Omoa(<R_2c}03JAg#3# zu=^9x$BNy7bx-9S!S?byWp9Kff)CF~LnBg;>2Fe%Rg{|Fo)(P& zimgc$e9^kVO2P^Ysm)b=qDN-V%2!W86>h(TVbKc)C zz8YL;XjBxZx`%rR4|AJh18;Fp62nQY^z6QeJK7p}ujMhCuv}(7VxH+5RF`Z4MWH;_ zXRU7!a+!E;*l5}D8W(D7E!F4hJF3D$t*H2V;YXop5{`eHNAgeja4^%g#D+Z7mFI2* zpvj6W#e5n`)yB{4r`DTNP0CjbZvU;W%cpQnk z97G{`5vM90{6waufs({B-aoBXA;Ku<*yDn2*MG*ib+9m0NZ&n|#;f;z@?fvfMLVCz z?ILp{Ui^y;-pTB0uZ`1RlZ1-4rFc7e^5*DMOk3(5&{%7!1>JVQeq`LmRC`S1_L{>q z*uqj9i~ZCnZveYw`svW**0(bo6jBjJ~q<{Iue z#{J~^5pTe@EwsY=r_aCSw~A%Ti`+lqqH_^_tetNk4!&|;037uBIl-I;U**KT;9^Ay z5Fwn*n)8k_pJM!~z=QX@91YvQO~-n3k`)kEbIWyg?rmL4I+rQsVj2wUur(E_fJ#XP zhIFgJCD;c{B6aO~xn=#RKycT!@PJiSm&;b#un9C^A11&i4^tz*3-i)-cY^S<>IQ5B z%U@G4bmD-9Vh*|QxvqUGHY0ZBH)nj$EukHJ?dmd8QnZV`Ugd}X?*TxQ-ewC9O@~EQnG8Zs38P=cM zxu~W_u-%I~ zOw})RMFi^V8D?BHtuVs&KGx0cV^HnO`I(x+UOrc%K`J5*-92>88%4Jm0~mvpPJ1V5iPhXABY!2Qk>=Yfd~h_I>*80#8u*EqUq zg%$=J7*y0_x&*HqF0?3$A?Czfl`(n+{q_o1#6cEbd<@bnQE&~39y2Jl)jaCr?^aYk zDXe?CwjftwjycH1-|O&jG=>p z81BK9A<|zpw!)c4kAUe97ELCS>Cv0^DA^<|)LPA1)Q?)Ne9hO{G|io-q^35Nu8YEG z^3Yig?1qnf4Er0N8v$BO9>E#!%ytT}7h+)ADg4U7OpMm7N)Mu{A!kzl{H)dMnm~@? znqP*2#j!&o)oHT>E{*B5edL)t$JiMARpyL$h9Bec9O1U`x}EijWT}+F(II`R)b&(^ z%KvHNTv4?p4--pTWMDV6sI1xvPpLqZZ(=2_?hUD6#&VeLT|NzpI9Mg|TmAN%b-i*y zE=|#@)u0j>&T9PO*1I<4++HQYj#cdLc$o8tPb~&QTBTrS?ORaL(f{)Nn@4%*$TnLm z!NOp)2VPX(W`1GutOX)~cv}W^x-!}+^G%CNY9iT2HzyQZ2D}y~mCx7gP5W`VY0;Ynb6@%wM?KJ)DU&0-zv8Z>Mmi624MmS^zDpY41?&M;J9Lkjb#Xiy(SZe2o*m*c zYkg@LTB-=ADy+i;>Ha0!Yget*7xYons`*;xciIsIgeeb@%O5sV-hPQYVJE4vTZu4N zfaVB!$Lk|Yi#hAr`uPxY6~M?aTvQ{ro9~bLH}K}4l8UdGmT5Oa4g|dLd8(J;rkC(Y z#qL6nDR2`qs}@3nW_i96Do+da1LO)g9QBY`GcteYU@Fp>e9O}9$i&g&lq_3p;;4n1 zJWsQ}Ni=Jsx#%GUg?*pP|9S1MdAMu_x}3hdjo+Ab!%Oz4NG|NeJzoI2gj!sH zS4d9*>42|)W1G)@?H0#ArBj@8aamJ?jm6>71of2(n9qJuSB%FigK!D-qWNNqI#cC$ zF~THspDLGeKXhk_82cF#Si0!DHIgziD1BXmVFmD01AFX5ixbd}KTv-B!pzOU5<&Jc zL(9d^J7i|O+KKmoFo@f=`Y*bgeZ*{!m|!|K^kN@7myb>ls=UE`2SkHnN_P>6*h%8Y8X>dqiu%#oEpNt1*fTlbAIaT&g= z;~Qu(qMd0!HSl))zvV_!Z$08fyx%3q!iFcXp7RTwH3=kiIuXuzrsM*#q4eyDMI$#f zsq-)FJNl;MWJblZmdf=F0Q9B9=^ zwFQwXTVwwa2Fl@%)hh=hA@1S)!B9rkc;?coIm2OsQkg;|u(hS_jQpsoGM`6GEPlzu zv`M|86kU^i+q&^=IXZ8ufS2(0f(gHHX)@-Z^wbD~CX|M1_W-*M*=G%f!9PJCAkKK* zGvC0k>n3h}0a5oDZO2yNGi^GCV{e)AcgJ(e-o>EQE9Q)Ci+Br#6w&r>W5++E<%QoU zA&Yns+0&J_(PR4&v9q4XrQ)PJIsZT6cL)UQoA_nSZW}iZ0^d2hvVTHb@Xh}Z@jJ8e zuRu;rr!(6|ToMY9-ZILGDRJ**@$QNKZ!?r9MsmMg07cIA$#oCLt-l+IT`1vn!nEmB zPMtGpW6K2LkVj}*F-zXVA(o1l?e>THA6QGGQnLuctxdr@qO)PU`}zL^ z{HELEPZS7|ckbon=yZ1N-R63rW_2+;@ZPnm>#Q!1S%o*AGU^|Ll6I2&e??rlPr?k8oTmd3jGaiLFQZA=?WK_?F zUYZP^!+0Xgn|0Z|x80!G?b6P*DL~HP(^xOoa>elW^u`MY*T=PhBj;M={?g{sBsu-` zyPUWQ;q2$QL5~^+WkQXmkYHOBMgGYMD+|_b! zT-(*Bp^x=-te_oy@rZC*Q z)T15rd)uNn9qx#S)m#TI<-A$}$+U?PWc-GRqCK+0cV_>!)c2OT0}LZlDp^_yb5e_q z1f|OxS`UYGn}B9-y4knbxT-Ob-L85NHBlP{Gocj9rX?B-r8_g+j#AWW#{dRIIx*36 zZXo#^;~l)9$eb6S-e&|YG77rLupy4?_)mW$fz(F@PTa@a!CejyDD$|nITH_sE|eN_ z@^^dVDH+dPxkM_SR1srf!oX%Wt5QdzRFUi?GI|mtLnAinja2vwcBP!&Z-uO`fSf!) z2vk#CbxX;0+)6I~m)oc7_>RgNqPt$CRH~H{dH4F(_Ua4q#NPZWXY`wp$Ax6ny+$hG zcMy|Luj`@Q{R+D>b&QYR8RE}H(6=XlcDD%SfAMt>+?95Xnzm!xwrwXB+qRvGom6Za z728I|wq3EERBZRE=XrPc-u-=}f5cj2%z4l2JWwq+LW*-|)==-T2%7=<1w0PExyLb2 z?;~!1R-1LTD|39G<}6ySXac z+O0-|GTpDSD@IrocLi}KD3_LK#dK-YzS&Du3RV**E&1hNVH;BmFTQ;1v)NqC==}y8 zuBejJ%>y0p8?);XY$JfA(0$@K!bo3ncee^@8MQ&3Gq^Hk z6S@iYRqSp_#7h2gmywwm5vynp-G+XWMXxeER|zmd(nrphj7|zi{dFt}Qd>{@@w^q% z0?N}L?hfD^@OXJ32OF~Wwc5zeP$T%!e-$yQ4YsIfTt|cZ5r;|@^XYfNI&Ux8bv{~U zAJ9T?8jj|36`x~GCWzjL@0*IX69`ExYR6ikqZEHd^KkjPC&ZGr9w0l6CSG(WBi5;Q zcyWr|C@pO*t)pICsxF_^?76A5QXY5COTeJT<}eYDtu$A&Xwx8Tt@!36VvAQ$od=abTMJE#A`!E)Pr10!_ns6}>(O9Fd2(0_Fa=EK^J4te#x;}i({R_Eh$Nm=6eWd#^<2FrEYN@41kx6ThsJbLe8WxePv^@ z6K_&(M-H~Rh&U7O1nF}JnM_0C(YJCv+XoF5J@3If(_S`yid6p=vtQ&2zGBI~FXWN$ z1=ptSF7la83MW!PnDsn=gLFGwbRU>PmQ{I%?IMb=c}sM=MFa(!^lb0 zRsz+0>QPh?OB&{lNx-(-Vq74nwf+8TS1i1!u_b#)~#ag3WG( zVP3l0QUqlm}1cwbw!KPckoc?de?nX2)Zm|Q5Ee191z zIRsptf4JvPqWrR~+yE&rxQy6+CW*xv5JSAv+!?p2(Z*v3{<%rlZNRRVwK~)SogjP} z#UD}NWIO0^bQ99T7gbRi?t4-!D1WI($*M$PBPH&TnV`Aol*qwClAPxOn!k#q`8p~{ z&o;)P7q%Fhq7$h8Dy|Fvbv{l8tH>n$5|XQfhRRA~f3+f_o&znrFuLq6Bu*hnWVJ%! zAkd9MS5cmbJs-Lx#7++6FEswLu8vCDwZ^tj8+{*MhO!L6Y;xV3=VTh@z9|1$Woa{# z2<=!#JmIRO>gZNobStU9hec=p>W<2mDl^#9O?i7@(iWI2 z{ft*WqR_qAh1m{$Y<0`>)fa+3myMt70t@hDY^zs5$kuG<$lkUj6fXjQ^ z(E2Y_wczK<6u@TotRgnuJvI$^uu=iTy#<=7#{fBQkcSiUN7J2l*wi^Ysn-FR6j zRv~7-KovryJoU3DUK-Ty*H46N`CxzOmT3mRx7EFfu*dgcpRez9!_@V9L~5`FVOkPU zc7=9wM@|V(G}(5%qgZ&YWeV@uo|T z1$KykR6i|=hI*a84^N>bXdL6$5J0AIiXDRHS{5m*Fz-VWP)|pZx0qk0`w=16K zlB^_(fl-tA9lPA}5t%Q{*`v=pOmf^Tc53_&8ZQwkpx?P37wE}E-eIhqA^!L=*|xSg z=T*5L)~YUFn2LEXvQ$t)}~_PrB1l% zmeaEByPlbY6(a*E#>J-5I!o{YRLb-bl5+xRAoQWly7`dgmgG@}Fe<3@LDpzzhEalx zmsy;+0+&_c{UN2DKANApj!n0OmOAZ%`f3X=B7(HyRbS7w{ju<*iw^_Tm|`O7-m3vD ztxYN@%aPP}>y{1<@w}if)qf~N%tj)7)ubR1u579pOLI5{Sj_aAg2R=ib>@qhuVX5A zRAFpkaSr4T?KQC#*H7V_Auq<^;Gf)vt2OD?|DBx`*m=*!UjYBkn0IVo_ZIp}^yebT z0ZdAQnV^XulX!2)n*r}~Ajun|LY=Xb6)+K^OAs|rfbhcDyi3(6K8KiWp4krb4xLHc zXn2JrnA^wszRxT`GrGoSqEiBxiFmDz1T#+$f;^Xd9 zt$ouT-=F#_4LtT__(^C7;k48o!P9~go^IrnFovg$f4!H8$!Ab~OQ8JQ%!rFs3Y_jn z-q~OJU?%%VDaLW?l(nS5kTfqduy41=4$#cNyz)IBL9t<{*uE-CA-R=O+Q|i35y4lI zWLZrg;ni9TE~C0ZH36}B@NwX*@-y4$VT1|Lr>Q{O4-h45`3?0!a;JTRMv;pTP0yml zK&;?~X3XN`13RRpCshmG;%5KvOBBuG6azKpZd0Uf2X(ci3XqAUp%JdaP|QR`RXhhf9Q=Hx$#WZKd9@~in?nn=-z{b`@*H$N#4r^Vpm z&l8oT;SKF<+50S_!eB!o*mf2}L2)Jju_rJ7w>>EZuqW%lzQTtaVtVpgMIs#FB?h(n z?*q+loS0#2Rey>6{G@0h!^gAsCWEX7f5KpHD85StUR?awo^(QP(KD6{2Yz8E0eND^ zNQImPYHB#*Fz zMzF~kZ1T9|dD!7vq+TJ1E(=x~nn&j{A(hZAh62ceBLP&k;5RrAbv4r#(?8`PspxoX z;2TGmxthh0kWwVZ-qbrgain6$>LJl!!xM7yPsPkcY{CI%zy%^q<>r zp==i9i0fo~mtFllfU$a=Rjf~UC+kou&O31*YvUi+gCLY_y@@0mJsCX|D@S?=82UFL zv18(C-~ENC_Q@N$LI6G9ny>n4x=0h}{hdt9K6iGgWadtY(VV}Ry~*8n&KYs{D9TTJ z#~d~zJhs6#4=(V7IUe64lIX2KxKjjoWOVXS&?D+5ZTM+7;hy*hHUWK52GSv4HSl#@ zj$pc*Jo-5lK9OxvIVYX#<}y@ZmIPtT%5QP{0~`c}9`R`#0p ziiQ#vPqb5tF7OQG(JCp)Gu74gAJX-M_R$U+UOtP`zWTmp*YcMEHzA8ihx~E$tQ0K@ zfm$N;nCqJPOX7rk|H|Iy;^{C&f1B};qFpv9m(|4x4`{fQ z(QvdFm~iiBKzgCJIPTY4Ag}+oB)g{YML2y}#@POdhM5G+UE5nnN5-V1lOex%K~FML zGdl?xt-3T8wfIs_n&|RaLlB$xYu8svFDMikmsx2Gsq|$K0aTUyb;4`RTd8syI+iP{ zL*^sxxKJSxuy7es6)~RTckE$p_q( zgnV(XrZjv_HMUQasy|xw=lwD?w%>}dAAYrJ_A`f?Z3lJ%zI6SdoG~w53HW>Iroq|R z-nQC>Ji;pQ_9J+c%!Kaq+BD~}1cA6ryitG&7gI`zEG^%}BL2247{g|^PJtf(X3VW@ zQP5>`!rRdy*&g5sgR$c*9N{lDL}CWoX7B?lT&T5znln+LKf$=7Ww3T5X;H;f_;lLC zz4j-yU9~$hH>PO4JmHxiXC%rq-3!>4pQK(z6Aqf3tNBt2W?aEDP;x(JkS_#i!ZrK1 zMY}KMju&^POK(Rz$4ly-WS4<_KbcdyhEjq;Ujz*$0>=`9y+|$E^FN4rQ&_Mml7nES zQ&YU6VZ)cLg4X+ea`#W%OIbv+9#;{yD4-5ukzDDZtvo^yd~959(iu}^+j%7=IqVN% zqqNLwZ`As-BKy~%jmdxJ-x=(1ZsrwOkQ{6^dRJ(SPpG+7lKRK2sUoCw5_!$wi$}|I z(N;hTc|c9doR~OuARfLI%iL4?>j(?^D;$aJ=FnU%|*pV{3&-X=Mfbw&FPUqX`G z+e~Y^Dqij zqlg>z&rhAGV?xHs?lQehU7i=~o9S)Lqnr<(Z)Sdec(&EX4wg2@wRfdy`rDfA&UU*) z(4A_5V2 z+bd5q!j#v#s6K>UGwXLPfVY;53<~r)xmBAU?eF{S*nZi}rheWN0?xRIL%u&M+`Q*= z_|)D!&2OHa-+pWTjMDOYd0M+~K`NbQt64jZILt(iKR131{&+?K!{qyF8CoB?#?2@e zz7*7eqX+1fn>hvokBuK^kfeq{p1U3ww$ml&7v@25eR0DpiL${`&YHheihRM3VFkDxJR02ezp8h4I z^KnLbisgX|)b{>A(u4!pZ@%mx;lb}`uZhDD*Mij9FMzMNPS2_6f;7}##wK3`kEk5` zw-OyhG;KT)9^CZ`n}ic=;ajd@Np8#GzASNZ0y;k8LmJxZ@z|RPa3r>(>v=At3T-Qz zN%fgI^a{l@?g$_UULPVE5zRhHkBZ|*? znh_>NOxbKB^e6aNxY0U*{Olr_SH!t%p1)L=!GV~lY;V?7BnGGEO@V0;B<_@-1MfRW z!2u|{C6@C z24rBrS5sloNW(ltKd{sFZmQSo>L-YBD0sikY@*t&1*!JyXU@I)`?d$3m&ojrNRnuCgs+Ib;Q6E9Mf-()Zp9HtJvG|v8J_)^2nFA(^!TG{ zSz!Usc@3uNrVaYs(a=>~&x;4jRn7Pd2G8GK4f;@Xpo1)0!Ik3`mzSiQ`7+7_t@`_# zGQmO*QLq}v4Rhm$+fDQ`P2zjuI@Nj;%#shl3pX@r#6-c%(LP0W8G_#_n?ZC{GklSy11%o#_oUllSe? z=j{eulI{97ncvywr84O;gfN}E4v*t=qln)b|8j$pU`3@=_MD;UGo$sJG<`5k?^@*j zW>Bo2uWiqB_v1s?5HD+Mr3)Vz{O5@YRp$#(eV&8m%Ua{w>jZy*cZM??EO^QJnsG7H_6|n67>!aSlgurp1eXTovPuF zR4*&BQ=PS^J=N%sU35*|;XZC=EFC{(-gcYUVoIB7<1+gIIm%9X*e*LK#)2uB@J9q_?S+Sm^ffPKx z%IX6xN{^G2_4C%wmU04^SlLH)z}C1d=AuBL!@e(JK9@y4HzOg(78eWgOW`*y7g(+a zdz_2*Jq%8l-kb}6w&a)>d-%9xj?oxH##A0RrQ}8HWb79L2)m3E84$)r6ss}SJOd)n z4p=A8ke9uh{xh^z{Xau%LZY94ht{}^RxVb18~uNNuKl4?djRextH#qs3m=X(zw5sk z*0PD;JqK*86aL*;ud%-7u+=niW+5=i?54_fi96M$`1aym_5bnYnev!EnJ!Gk`I>;) zdKDd#GO3xR{~bos>4SDTlJKjn&PXa?Q$J$Ta@D27+mCjMk#W+%AmhDH;I7yAv;A>M zWJge5fez5iG`tx;BnktPm5nZYC>Udj&(8lm3Q)G!31j`%z_t)L|zru3c!o zJwyPTe;5*_q49fZF9%%nV_z{@tM0BQN@&q<>4W%O xL_0brDj(T$fT^aVY2ET3> zBmC&#r#=%1i~ER|LSM372Fv7eKLVvYTiJMBTWD*DlJ$7PXA4oOVmTUW#vj0)39eFV0cZA$8x6>?f9e ze?N)YhmGWNnO6!VScQbssACsUp-aYN8J_6e|90z@j#00obDYLf#Uavu?ZYzHRQzSB zR5Q^CY8vwV>?0k9(~NU}x0LEjk}0EP#EQD%mAq1eeJQ=TGU@L^9{GJ_tJPPHqgpGC zC9|zEOvz%QfL7JAr5iYfW@1-b8s7?KBa6f2$I8Jx_mld101GW^CrIMcd zD}e+(N)Pu2qy{eAT6Gfay^U*}v~G_^3^G4tw;Uyq{S=tQJ`ZvcZUV0%5m?=4f@Mah z-3C228kgPl2%>t}WgJEiW^`pYNy;r^Ou~S(EG`B5FOL`oJ!^a874-Y0lgM;IqNCgw zTvQqqOWDVPer1v~n*Z#upH=>AhrOP-S^r!TF0tm>#1AiYSib#tZD2TQ0w)>l$*HkH zln%8q|6eO?-3<>o*uvRl7-?HnahBqBXV;06ER3LHm4GRBss19HQUPpK$UX^{gB&&B z2I5PX&uYoAU?ThT+kDrT*(+Uvd*0NSswk^1|MVkw3U@1PmvhAS?qoWsZ(wV5srU)v z6YX62T6@8++(HWnU~&;679}Jx5(Rf{wF^;I-jt^zi>d+om>albd+*!UY5Le#r;k&R z_au`HM5Uz>*~mZ*gt;hI-=M&bE|vTrW7ac}un<&#-25M2v;t`YPrLpf;Ufgfp}y`IwnygB%M`ciVyEC?OZ z7~_{_R)^rG&KIKYEF^FLA7Ru=?yoRvTB6S__E8X~=(>Ce)rr?fMc^PDRr4zu=>Pft6gcVt zBdByJaG|1DA%=;(<~N4$0MACf1N-2vcm0B1s&sB*QAZ_jJ9X@uH)>{KCPz6i%Tq#i z(6`P^Msoyk3(CIA>hES&W~xCpGvh#ELu zUbqL%@=T@rqo+a$Q{&{;j}?~m6&7=*+g1yi;dL^&S^(I?q`ZHXnk2QH7fyVMR0`;e z#GO0n;QXYYJQv!JSHWRY%1%0)z+tA_1p3z5_dDFSF%SM$Ysp9TU{5(ojA%}qnn*}- zR_XX>YUL*w`5J5VNFY$h7e+V|2zD|?K_HS{rMxG-mnNX>qf19;w@g;9pMt1?)@Bu# zI4;7R8d1d}$a=d*&(yAxV>p8&s3jI&^?ZpSOb19DgBqz2?Fok*IQUBIB^THY(vB(~ zSbwm0y0fR_!D8&psmL4*1x8c1ddEtn>JC}`$aWp1_2cyuUFjJ}w%lgvx$HK17A}$= zCPMx!OsF@N8s-N*r}x1JR3eR`j;01>T4SOdr_vM2T2bb*?U7bG?Nr3CJTZA_BWS52 z#nS4WN;7$hp^@zNST&iZn);H9Q~_SM2N=-_@!a zIN5~dO(rSe_}pJh90KE+Ngca$DB=9M`&HEx1?qZ;HZiRkaZv6*ZWxNHWCfT-UGj!y&uqIIy=~t`i|T znbk!8*qpW=*-l=t>l#X;##_+`N;PAYmXHovphjdBcyu89vN~b|`!j_l@G>b+)(>vc zn9%f*d(N`uo$;f)+OPW%x7?{Lq7f+jQzgsP(4E-vyh} z>D9e{vgNkjgKqOdsKg}!*{M)f;`$8WSpdnO9}t6EuB8F0klQsFbsU9#P+5uO107QOq|e zzVsK}(`Ce}00=X?zDZp!PF3z1T z-!B+Toxy24V8LvVPdkc6q|hT1*`=NTVE8|45^C|45^g zTo7|4XX7;V%6c?8Wp$RyEfCc$v>J#U$_yfvGU!VOr)-YkymCKT%m>RB z0n+H}jQN;zGC&$doSqT;AJVAlEH#sCOo^*F+Yi>gHZV=V-yNMj>5|(ejGdx<9I7s7 zq(B{v`p-W{dTpd_ft~Rc+_D+iiDT_VwA2c=XvX(p7`bRqYhV<6{a12ihP#wJ``L5y zavX6m9F?S_zW0s-|BY?M}Yd5|G;_JU?7}$E77P| zfWc=>1Q0(MRC;HA+nH8~JT%_@9w@1CKVvOZfgF z>ok3onJAJuEJ@)oOuNz~+%<9S2&2FlrvpZx`}8#H(!D^oAa_pI zQ-_CI2`5RsFmAI8X!L@k#znDc7pZt&{El8H4{j2fZbw}okLz1SLnnt}PKC(gi}Xx* zBksgIx8cwNM-rL^qLHhzThf@yTOftUK;~vQ&?E6v(8v7R4}ZN;AtWL!5dHZS*%~hG z7sMudJ)+u*Q$A)PtoqOW9vEJZ=y=jedI=T2t6n_iKKcloA?@T)2uR37KCACRpn|76 zbBvr1E+|~8#br)hFQy>s(K~{H<(&&!6VM|(&`DD&kj1U#W?IZByO??;1XmQ5H65(T z7i(FTzuSY3%Wl=1?~K=j3==M?6}<9{g3Ig)bo!|4Z1ZGNVrmHNyCyAvsv{hiOFZ!k zaDQaY_P5rBSjo1lKHLyHHxt;2JF_5^bxNJ+;CyBg8Q;8G$UODvboxx@{Lt!n=aMdr zy&y{uxo%;*A(hgER{!Gk(|mm#0VMYP96!hB3uj!F4B4k_=?R&%1DsfeL92z6eG@Z4 zo}xp3^Gg(Q!8S623B{5zq{u~#Rn1A}viVN+pYt}G#Ptfi^cNmRER=jovP$rZ@fdIa z5cDnnoI|79ZWCQxB<(401_TPdoFLmDt}}ez18~i;Ft3P)i23eImoEF{dX~>CIsP5p~Cw-wC;NAQ;addQHw~ku;y?&x4iVP87$*Q%1 z-=&zI&vo}WOBsE%$osUfKk9Dcn|{!J&GW~-**v(kH=CfPKPV=eoG)xbj1=3_VQo*p zcVLzJTa+`J^~L>1F|YKN)+RwdGrC`P&vfaN*->Zw@_y+=JZ+db?$3_8doL8$pD%oE z&+}ad^l|Q*5-jTFI5^bzx^A{;R2nDy_LubPljY#GHe1BkEsagPHy$qiS~$}6ykZ<_ zA!w@!81o*)L0a}ShUUaz z_w=dsKWG_j5_4cThn>}LsULppB<~gxQG6vFKhkebwfm zS!r;^>VdYCsXW;rnoWU!+`ch%RZBULSj(K&*j&Oqm*g8vF_UN+`cijhnU6DKVwCQC zjvQr6aI=~v18s9%=BMT91N@}RKtV=Z_w{?z$646WUEIm1rGvgUt1p0`p&h>aLT*?<%^E^@FVSf3M>XG#)h zC4H<>s*QW@`(2#jdunuMzi}l$15OGsXxJ$iHUA$L3u{(2CJ21*Z6|N2&~ecG6MG`5 zl5v04QAvw>+0ei0sATFYhlwFz@VIN*$eEiwy?>*ANr`F?68D3JVgN5?x-dS^v`l2t zGkJgw`^>{t2QzjR@XP2~|FhvL5G;tq8x&&iyY!A|b__kF)CPPrG*nS6?u8R{iu$r7 z;Ob=b>2>jVw{{=Rtd;L`1l%1+r;t^X{gEj=An3Fg&+|k5cS+4PB=cuV_E`ha(8V}_ zR_L0%9o(3&oFRsUUbTqgG{1;CwsfKx0Z!O#vPsJk*%ND{6j+>N-yBEDj;y+n@gy%R z+xDnLKFp#aD_m3#A#=VGHxw3PCHu9r}m+s$S#riffhjRSNLaL58K z3M>^*LNOnyVj$T0`;j!Cl*lAHfj^8H5Lpa|2J()Rkgt){GoHxr` zU!I2$N=CSY7F=QGz`}^YI$}Pu_DeXq|CG<1tZ`7EMuXe(l%b(>54b3Ytv!39(57?I zlK;9%K7}qC46@bO?f{*Do-0IIAF@0SdRSO&*4J{wfy4bAf!kLaFTyxd8$KRaR{DU9 zbgN6G3NpD(hOmSC;P1k zgF{ZdmMs1Maz|NK%q+G0&dFzzH>nL(T%1kUkk1Yzif!UL(i8t|Q;kTF`*>22=K8mw z#^LlBhQn=x1)aiV3s@JOd5hsg-0fOt1ncjf4emH&+0%To#M$GKuLMPlMrpg(Gu5WM ztFu&2Z+IMcT5%IcWTjZbA7k9}$@3 zN2E8ipsVMd{L);Nt+0SyLW#~uUOxZNt0_k4z)12R;{KD#;~Ehbw805N3A?)3+d@Ka za-a)>{>VRW#~ai>Z%zgKc$CfdfXASFfO!#oBRn4Q@(&upK5A^kLfsr*Z!k>cp;|x_ zeRWrB86 z8R%*SR@1B&mh4d+#dSVui;8dYSD)@TyzLaECGDTFX{4F4NUwMYpZ{NZbWCpr!RnCm z0IiplW2+z3K1-Tc`+3r>Rm;{gnC(H8Gr~X}7`KE1M+oZcx#?<^u{@KmAPYpvw=LtD z?Ii9iLLr4Bu>f>|&C8q>ABJ6{OioYh^EEZ^^`_9NfLndAt%*wjvXeWJdpP2!92#RT z+eV6$!~-s`u7?viC_MWX$gjgO;j(xlgGEv*e)UPS2up~j8Y?hvxLwp?{P)Rgq&sqv zb`kxmGG;@VB_qBN{MdX64vq3uXOY6Xr$FAl$P6Pdqb3;Ng}nt}nqbbR z>7tDXVhaQG?}C;~RB1?INr^mUJJ&*!z-iSA;?8E(sxdKhM4S&Px^Pj6uY&Ua1;zXk zB;(I9k8{Q%lj7WY#PZZc?8B0Zt;52D)aHnHYVH`_M+eS7*g+-hK>IWZ$G>Hi7>>^7 zTjEAx7%1XPkPM$3#rNNO1Gy`%QBl5*B8WX@9V9a6MBl@chiEHi-$uykLdh&6Ep z=gXD2doGiTOL#79V>t!SBq|tUC~nkuQMi-Eo9r}X z%lwz?b$h99lQ~9rv36z6)v*mpgpk0AzsXAO9wagS{OQ6zUweoTWiJ*OiU3A0pGP*S zN-H~zuy7VIJXR~rFx=dd? zb3H{VZ%nZad1ZWzLobg>=5DxKIK#h{fpEU|FNZw?xKCVvx(C0s1cNN1F(K>umA*V$6*Bnk95PV5}8 zuw7W6aaSn_V2@&iJB0kTM+1}@`vLZ7o|8no z14YW46{&jxz#i@LDnLS@5h7A+_zAE_fAfd*4Bb-ekY?lDpM7&G?|3A zkv;qfY;&aS=145CN>s-ds|E2^oyo4#oW>_24t$`8u6?iN z_-=6HEgkSv>X^0AvqAhV`nMwjpJ+|Hp)U16-xIk>Ph4o=d5Z>jMr5kj&kP#j9m;w>`niLj|%lrTZk4a0q{{m-@o{1k^;JV?3mMEd^8;k zKjWS8e7IB{%emHGAkhms^83>{x3N<(R)PcNz;>SMM$9@D(JoN_4Yy}_dM7mfaInEt zunH$dJ;Q6zSNSZKm7{oy5~8Rq?R)5aDDNP%1M-&$HJcdm6TY!%N6Pr?o{`yjI%+-b9PS$Ok|kv6 z?;!C=l%%^YkZ~a7dkGgyV&$+TG#lD4GTTB!{NWMjoY#gtQNHfBj?7PCXY1ir)F|84 zPEVC%jHr*b0Pp|pkZ#>Ct-k&(;CisduA$CN=S(3rr382pn$sm{cypF0gi%!(Q|+G|bTCiy~gy6TTq24(~%FXcDev(xTB< zZa6~C+$Z3@iWJapt+o|+&=OpEQr5S@5(bmnH^N*67@}SPa2BylW_;{{ULMYTQTOeV z%kCP2MS`msUe@OSxl0sTY_7KfIt`_tq|1x*0qJRoap5-pq5ZE!S|j<7MCv{ckVqrK zPR)Av&;39qkP9ljfHpMX`oDI#FQte|QSTiE9i7ws+t!02=jf{#KKt@`ym^$u-nnMm zNBb-C;8w(nA`K~G-CBZPBZ%rgbC3k}p{yF;*@~dT09^2aHziPl z?A8e^dg%Q|Zcp&~k3d?ld3>EZm^z(rWYm%X-yG=Mu2a_q%^zyp77^S$FKKAVPo{O(RU^TOF|J9GKu{v+PYZ)T+Y>G3=0B z*5#cPKQO#1-VwRZej1^RLSLkgGA~GcYLHlThR<2S_l3a(l+a$TnW+`Y%6^(aHFbYd zG+^>`A~=nXC}5qJ`{GyQT=DlLiM4~XfT64{~F0W2BU_##xXOwYehcIzP1#>|mEEymG?lwa_h0BLeur zWqx8oqQsM`Q{Jx?_nyqSwk}Ff!%YcE^XScUK*yO%)JnlKlp0Q2tq=9Irl85S)$T5L#IFnwa}x-3JM_Cq3EsXnP`AzS1Zi!&2HGIHLsjHCw;EfEO5JR{9Hwr*Z)&|<~V-gp-| zWXZ)ZhLYO`%vd{d$fJ01e+MvN9^Y-By)R+pzKjg`T9|n2ncjj#6p5+EP9xT)77AJd zAk9zHasi#P-!)rq!dbds-;Ew+ypMiryX-zJH#?7D{8Lx9n)&GRJ^D)u(bm5VA;k>)#t6-I^~0S11aUq6n)e;j#_q}WEQS^oYA3{s~w z=`~71zyLwt_}M2SzEL8Vf?O6S3}DL)q5D)WXq@1Oy|Q@(ZRpYi|oiQbU;xj!Fgs zGssq@8OAqB1gVL|JT0IsY3a}!U4lK!lW3$8Xs5fgn3pIft2WwQvu>JaCRDREFa%m4 zl0jF?%VlQOGFh>oTgtM+=Y}uS?QJ9KuyTC&sOHCo%hOZa2aYA_?Y9vh^xw@@S08clkMdPM46rGaLtv!590L zqJ;q!UafJ2TQ|gcewX|NUF*l88Rj(eoGyRbWXOaPn}S8Ci?z1p(!vb6^J>bEBQ@5( z1@W1`r6OZbCojiO`(~NiCSH=HX%oKXu9-19%M=tm4Ev1+1s_0mzJ!mIi62n`P+se2 z<;Qsp6%l*JCtge?rjmv^j==S^G9{!E-vhH_e)3SH3n#^Xm6X#> zkhNsw{x<6rzB=ht`O`tX+(wq4&|y3I7XF3Mg$~D)IA`VS@9NmT9QpIJ_jA4~z@vw{ z(T0>n#_zzR<|PfCBQIMtu6sW37kTyyR3Lm^INtTJBxSCGkAPD3AvlRV60WYlr3AR` z)bZRGb!@?AYRaa6hgRUCaLpG|jT}j+gQbj%n;l5wj;Mt_<8W?Z>y7{?w)b>_gLnM@ zPn0YGnt9tgDn-CBMdx*KH5O(l)QAqB;jO?6LoLHhfPYhmbGqas{>~cHnB+||L$rr##}F6h`r09IcjTJwu~q<5U%0r9XHYIK7p$M(fXn)*|{*8m`L6HvGPEjp9J! z*m1S-PIJHx{p0GrL~n`VNtUy~4Y|#%a}`|rQU%Lb-prYF4JE6?T3E+_VnZ~~bGv!S z%==;Y2eY2s_tYrsneVLy5t0SVYzU+WtoV#hBjg9n`>ztbg|{npI`}!|2D{*6MKA9N zbUrYeuPh$)Vx7~c520btI!2-^HXEAH`n>zr&X+VUSUfu>mCzevNvLgC@!g07-C2Pq z(n4;L9Ylpc*k8(y+^MRdSF#Gpua!)Y8TyDLs^C*JxdN0vh0oxS#v2h;ej?OGRq(f( zZ2#Y1IVT@D;4cs{4#bYH4A2!bT|-}k;mxYPTU?dM(wW1Xd3$;BR(tvM_cudeeHX8d zs0Y;h7+VWPJG{>yVP*as;^wSWWdg|OQ*$0%$*RB)it&?q&@J5?-cUL%1X8ZU|R_p#5}oR1vq*uJ93y@cCoI;-ge znX!K9itTYeP~y>p3C4?s1t^m$zF13f=kB6?A*x+fuwh*&W;W#JAJofJiaNr<8gHjU z7mctk!GmJ60?yt>k8tPF08S0!V)?~+;-3s2ex-5~lXp(Yp$I?`EB|nkJ^{I^kW?V) zlS;tnRkC*r+Q3M0wRh1~mB!J-)Y0-neS7C@uD$95b7qyHCmb3&0zq&#f!?q>vs3Y) zwxTC1zLNGd8cyoxET(h6rtN84NObu|ZFUf?_xfE=-s;PrO0hsL3+LB)co)-Gl#%aE zw8OV+^p$o+KUeN(2v9+`CXg(SkaX`DI$|!<4wSArQA+EpwcxLd)8HVC9pxq3qg+yB z7>E?U0$~_Tii0|M*UoA*#RKMDCGqFL^ihhz(SSc=sglufx6@GxyRzA&$4*;s5m9Sq z$Qs_LqPon%I%Dw1mdd$vks&lMc+g#bSSIBjE1)Z3{-}BTn7lww3m6!=sMf;VO*1jV z^20rGrcYvT+<3nRqd`&qW)GAL6I5U0N8d!tRl}q&i2oO1_Y@vm+ogdzHdk!hwr$(i zif!A;itS{@wr$(C?Va!cx_kG&`e4^J&+DXW&Z>8e=ebjm4~$QS=i#ux1$RbTh!=(- z=MEZXiYfLJiDW1bp8!_)kLxv24K3SAG;&2`j(754h_%`6p33=rg80+#Y!uK(dbDX%C&p={<#8--m-mrkHqGp>Utke!}xIya*skUlUt7V2w z;afV+{T8qpA%{E;eCGT}Ett(QGPm85RKuT|BIx)M9nZZ7q!?DGg&xI!iFsv_O+blX z;chU2$&TVQEiKF(np5U<3I-D}?)7X*39-DgUMMe&H%MJGc2yON>wETJ=6LJ%+t#7T zs-l{U&FjJbC+1e=HT+-dIK#>YEhZFtPf{caRd9+1%ca2076#=Zb7`OUu~Iro2y7gq#}*uPfiyv9$J|B0`?L?24a48w5KUfgsFU!_w~TSK3-6 zj>EoEJ2tLxzI=qf#t=1h3(FooJ_kd8^d1ji z4lD;vK0H^jXehSIDG%m-kt$*m{O}Y-*BQzh!{M4mPHAz)&LOu7jYe{PCtCz0z=I!M zcq;M>ktQO%GeE4KcvaiLC2j-bc2F_%j2v%C#ns&<8*B-MTM*k>uk=#lr;20~>S%`n zZ`{}yvD!^p(_K3cZC12XBM*oC7$y; z3QElH9GcCMFJntDvf8Z@nwH*V-nHM*SAbvAZ?lBi)i;`M_Q=Mm|A5?teSHC6EiJBG z(OMO)Rq+&uA(V*17Csi)-C<<`+=B`R@|u^Iwg#W&_?l_NiSp26Q9?t`^f0;GUw%=9 zzy^8lKnl2KL*-sKkcWT^_*-a&82cR@SX>}tb1L+wXw5;V+C#%y$j*%dlI+^o`W1H) zt(3Y|pUn}*X2_=HH*W~FoW?W8-={|#{13#)5PIv*cix-n7&Im~l83S|9j_r$*_4Z6 z_TF8Zvtu*Jaq*?3?VRg|i}@qz;bWC?)=aJ83oaC8DILNBU)jKrdNKD}`hYOGeUd07Rq%Hz|?ap()IU?&FG1xeu z_*zVrjJ3$@;o&`$CGJ{V_0BBzutYw-;u>T_zZEi$WiAjp4YM;qZqvwwYR+=K_KO8j z=upKgxs?P|Um&GiMY%q|UA}mPt5LmENx{EuleBC-LwFK6^=DED5tW93!83#xsf7a~FbQLC0P0@?1swBtu@{yK zLZTfsm6luXq<$LQt?GT677G4x8n~feEuzA|!noE4I6y+Oud+_ykrSFbA#?B|4rba` zM#ig@hykQWAv)luu%j%_H-R=P`%k_T(^z%uB5(|LhCG>h5uX?}al24BROWCqZbIU6 z<8^0%60jCilGM zmFi4BnhOr4$fiOzpJ;{Ql})ra;^2=krDXBOnZmiHOPo>`Gk`Sc0|KlPLgQsdRZrgz z+%U6thqd=F2j)dmPO~D{RTd-ffP!r+X=$q3mBA5r%;o{GpLTOJg)4kb!sKuY6CxDS z#_X} zscfHbtd0f#!u`*_(DY6p6~vXfoXk>jcarGfYbB!MqE{rED;n0NPRB1(s}$EB+;S>F z8=Z_1zUvO9>n2m>F>*}Qlf>9d=!BfcdM}0i-s-%~$T?;(fv~l;hH8DDl`~Ot+F>&y z=)Yu$GQ@js61(u78Ccp%fRaMZ+WnM5Yu}4%=q@8X$y<)R-8)F}VR?t)=c%Z)lN%bD ztr2AjJkC%e^f)2f(!?dqKV{Cdc;8W__(=BQ;hAI|R|@u8m1_4VsugnAC#vOY=V1rd z?V_B+aF^_ATKg*o{PSn$SV^FiyY3{Zhi$lIj5{D<{)b?o!fe2eyXi^vo7fv6R+_7d z1Sp6$!SZKBBq|Zr4r&~@G6Awh_}o?{D~qh%A(lGc@4S*aMFFfylAlmCpf)Qi#Km^I z!b&~}8HHmySSL`=H4crn&{mayhQ{#Ew(qUV=yj7^oI6f5&+{KYw9H^`(a!%9GlF~$ zkW1r?F>tg}@+a^;O_mZ@yK7D)H-LDC`JCCFR2b;&+;3+-V*IJ`a;-j%$zU#^2VFNHE6Si%m zH7KF5Bt*_DWay*z8;BGfy?h!Vs(J%GV*!SW^(QmiKO>(DK*c(btLu%`WJklc93b<% zUf`W22T~JwynCjSZS_gcqM6|W!(2q^_csU>f+l{`n2T@v(?`Wu(4OW2+q}BsRmQBg zB5g}V*nMO8pLcz2RoGQwOiQ-kh@;O93`8N0jIh>00V(X{8`+9r5I9 z3RI!+*7$1%*`QhO@XHpwdiq6a@kmN==I!N%KNfu<-d3B}19P zRc_>C8A_E~0Ws1*dc_1uiKeGvlqt_`LH;@kSOHNc&0+cd_5L%-cPMek@WqU#BExqW ze#7>0AW1bqc=G$M-P^LMORHgZb_2&VJEWkM3QWMxhy677)*gs50!z;7__{XEm5z2@ zE+#R~wRZvHUnTni=Dvry%ym5Zq&Qo5Av^5(cmC`x-`n`pW>)*t?`QZbTK6}3;e?w> zie6EYh%xkfx7(vgaEHZ+AOW;QMzFFXimw}+u2hb~Uj5K0fv14<%q5>!P-$6-B}7ZY z2tovA7l_?#`X@NhSG zHeE_Juf7$4+=Kp@l#nbi8ysf<%Y-XCh8nH+k6E)4VyUDS2N6>}^+(IbTkTThKCHR+ zi>hDM4tFhGDVs+SRr)C13*-3rZNh-vZ1z9C8D(*JXah>27ptV_;2p$QtFUxa2s=cd z=d07fQRqcsk6=vC+g`X$@?-zBao`mP6g~ zqA+no6AkB>Z^DMcidICX{2V-%?jK~qgkoulw3ddvOO;mhVojz5Jozw-Q`Q9!LsbT9 zV$y-UPhiyx3BXOlrLrh#h_a2IjIACF*Ly!59;sZ$p9J@v`I4EwMJ1q)#D8G*FVJ*( z!(pdvFX8yS_kVPzP|#6pqAL(ujA9TyXti}$-Wl2qY)p#j-{pVJgSpLh+i((g0UeYL zEYkKg%{UosB`A)W%b~G(8^7cORp?UEh{p&!6^{$i0cxB!&=u)Y>1s44yzQ zQA+B6`NgfCA{CG(GKJ+R$-o!=M{BC9yWo}U5@BXK%O#(+gOf|XBEc4^*q8h#kpW=w$jTf<>9dLwU-42Wfnsy-hD%~NE{Z0mNje)io=S- z(+RU-nwqm>Jo48;Y&G7P=Ircf`PoN*%EG%$Bm)*F^}s^RKNX_2I+kUc0#@6W3&(dx zP3LT62j7x>x;J}()eZd<6H*d8*0x{AWC^?9+SX&*LUF+S(7;!fCrI3%!5yk>L(vfc zji%j1NPyZ97n&VSbXkO#=dFyQoC*o}h}>On_@C4Vl&Ch9pklo{wyl;HPnMzsymJqjY0 zZGk;kqY&zoAkM#Z6hZQdmyZZ6wD8ng>8mRcQJGkrT%)9YLuPFPkS;$WO#+32FP-3Q zhbo-xr^RUKOCh*k4?}bQd`P%B_Arkce?Xe$ggS!;1WJ(CV=lvxhHTXgh^%JmAH%n< z3ZO}uxYO}t$UXW{?p_^+eA^)a!HxIp_9HkTw=tZzR4$Od7)rCO0zR5#xKVCA-!JxJ zR4~%$i)4eW+(A3s5u>dU^nHck`-doQep`T?*O6YI3R8Ru*p)=n8KBu+lDPoo{UcyX zo1YEaBPiH1;g~U~6f)W6sjo^GM<7T2KWg)YD8h#U7~z0LAm4ty6;%bLQ9AP$Q&#V17YY6QY(G<#&t4_BT|{-3{ctb0ln`e1boUl^*MQ#j3uj*7>-OM#BCdMa! z#@x0Txj>iukKdhPOjx=UlK1?YrTjq6RxM*>;A<&+gUC}?!w+o*CK|-%os*snHPBbF z4(Vacb>ZpTntLvv1!J*In>T4uqWd7>YIqZ!&HJWZWcx|sD^y}8J@Icu{--G94yrkL zHS{M&+0l`vt4Z-X$dx)o^`mN-@ulmfdqrp!(ifS!)z4%{l(;>^JIBQK#NY zZJgeKBl+J3RnX{it?gy|yt0c`_lL{v1e2k$(`c8y6)2-ib9Xm}foC~V3>57CdXh?+ z2j2%=(RFInPN3=k2B`|2GJ~L9!DOe2=r>3d2FNpvbN&mY62dF`|L3SS&#Tese1rcL zsV<&>jz9&(NU!$mk(9yAbgLtumj~a^_dd7%-T2ff06D(deo3bP_Dy?BC@A;?wOqBO zo9+(C4N%Z0RbAc7CK`r=3j;HTW;)H(d5Zm_v%>V^<;JhUy;pOTa3~}y^_S$`l!GK%Lr>!gNG03fSNf>c{ugO7J@%7pVFE6X;9Q~Z$kg@XxX(y1 zJr~vB`h==lR0esxjBQ*-YAkUy4xbyJMb1!QjZCdo_jvd-<$SrzTzI`lkUZiFpRE)tCmn~iHW#~Ac4Q2fY#dlw)FY<0^Ggkc;e+^RI5_b|(>onkoYIWrT`| z5Wf(tycAIO^stKfc4qXT$bY*^;6&p}Qp@VSuRx8j6-2o3ZCdTIY+3Oy>-km^=?JVb zjv`I3X=O48qG(VkXeM;?rS$=xt4=^NskX;RZoq2Ukpigdu6>0R8YFf)Xi-?7U$^H$ z)=_|jG$M3h8>BoH^SY`+CR zr*EGB8gjV$VgCsVSFvF_sQk9BFhIO)o8?mTS&Z6htTXhIxz z6K`VFQ$$<<`A0ckX9&bYkqu4fho|+MR@^PZ7IMy0<7wI;nd4JsJzxrE*?VRgg$Wo7B*ESj$LZvE*A#bqqQ{O53BKbNg+YB%nS ziH=8?!>Kz}{;BxwV2~y~Q!5L@hY}|MB0dJDP9BSJz|cjR5!HT6ntWRG__xKY5GKOG zkJI!;^RLr1@f+gDX)+9S`!A=dF8Uv*$-;+9Gqt;u`*v_|Lv8AW|BG%v=%G~@p^n*J zb_4f@=+H-6n$}vpRh;O3Ve}9Q{s%W>DtqfpK@JFmLWJX`mm-i&XoJ|!c#@lmK4sf2 zM5r^e?NQp*QyQMr1|7TB{QP;7HYgBb_2%NsNF`-pgSjvO_n&aZ?cqe<#pdAqyz*?# z%gxx{p=Si>+g=gb(R)8Zg~0OS@r{`|oy!~@Ixa3{PGwoP!is2183OkQ zS-SK;Qbj!ad-*@eQc3R*vSj`bvb0ntv|U4Z*!vH%)cd*yJ+%ya_=7C{sZPr6INiG$ z0^R-}WJ$y7KgiN<-#^Hbd5?kh53*GL53*!d_7Aeeo%tVR$>;$~ng4%~r6CBIkRN2J zMI>+rVNV7@g8=DkQd`f>2eWU%PLUr+nSQtPDXdRDGp(kCPKd;iM^Go1K%ghOFYnr5<=zMkFpfI_oFPabNwhwv8_MK5~&yw67NC{f#QK;^JCP7Y;ocyVb5O> z-M@`N_F{DN0MxU?LEvfW47-VMM1y+Mhr38EG&d`_gw>5obD7TS4Q`!j`U(HpruyFD z1N*`{lq}n$si}eW#fZAC@D5xhhER=D&F2=cajRU0xY^<-H|r`zHXe^zp_IHISvO?N zsTle|#erc-kzp)&`N2_=DSdE>#hapiRO)Z;KsBw>S%)+VT0eFizb;!ZOXS8aJL7%W zR-T|wCR@`P<{@N#5A(D!hEwYD#&HB-p=|;aszI;EE%qwW6w{lW9x)w+V2qtwr01jQ z!K2ackulijP%u;`Vj&A`=`;jxz#Oa6{V7!D?Dbscir`t!bihC@JZm{eK-~kw8xXu_ z9JL2X?P5kz~P2ioHIP#bV%csOEG_!Dt{_|GZQiS&aoSGyV^5(&9=g3BD}$vDe> zn?#A9fYFF6Y>*N*NMr+DoAA7RCe+cFy$PbT;xSG=o&<4v z#2{C?eRVwoK*N2Pbp#urti)hnDV^f7ytj1PYR}SgKTVeAh3p2~XZ}P#X+&2wS2Yzi zHI;d)Z9=bW^n}glfZVY)>8TNE|GMNHM#f+=g(c=TyWe~O5@M)W4t^DWM+$r3zL|Lz9RtAY(wCdvktm=`RwnKe&RU z)eXnsQBTs*;ApFJI(X$Ze4l*m?ez|tR3jD|hjXl^(yW(N*YGPYd0Y5fB*n-RvmKde zgj@DEv--Q1Tn!O32=gN$AzDY3p!0ZZN+^A=S8UI zW95>plMBt`xVykLm%HtAe+s^NRD*od4oCoenLsel_3M6%D+jcU23^-95qNOP_uJ}k z`Gxj(q3`KtC_L`NR$H*b$K{1s=eRB3*Jp|^*Lx22E5x`h-0rV+yq+Wt;??qeI9mUZ zZ*e#?`kC;=;n*y7N*taRJM&XJi_@FYz{t>8JRXE>P9Z&>HSq1Z>T+w8jDC+TG4rwV_HCp802RwkZ&HEGb^bh}?%j0w+3}0lM-x83L&d5HcD$nS}S#9%H zmuyjtE6}+hR~h@0oSY{!2D!)HbTm~E@){HYTd^zel;Y2s8$mgb~Q(Fj)qXG>J<%3n?WC$0K&)V~5}W73>D zBgdV~1A)LOlvd?RRO)XmzGJ}8yIO9PN*`4(afek zyk}?(-y&&uPOXXzB^8pb+Nn*dk!gku)|$m>lhu>RGE92nW_5>7U{O z^#^=^1}}GOC-ae9Hft-64vKa3u2)W8rD?M!rw$xXA0D971Zjh@I;TU~{WM;vH7&Kk z#*A*egLBQud6+UrCu9vwh;?8_YF2c;S1FrKXYayai#|iu;>1C_)%hEH2%(Q79xaV* z%Pp2RSxPkO)30tnc6|K#TemEneRmfK6duABzZW+X{2Ngm$@O;w>#y)Mo)ymtydJFYJAA?kw83X$u(VPM zxC;eR7F{1gK%IxTOPvyrw|zp62;7O>@x=ie%!8Ka`LVfppOmX8#AzX>kDR#sHO5vkZ+)myv1n`UhA zZ@a0uPnAu9d_*`7EAmt(2FUjZh&SUPV=BeMG@Cc_hh&``{s_+OK4hpxLTN$~=I{QNLr#h%Nbzb57&$bRWur>xcjU+>)b@GYEk5?8DY<4j0;SeB+Z8o(%$ zha{-u64=R!3lc1JKj3;%9r$i_Il;J^#!obg`FeGJPjXxZyZ}~~WGMWS(-9k$pCbf@ zgok>)o#m1s4gIA#f9}g$7v$2k?fLN7&DZhl^8L2_SzcYBTF+(SJf>Ozx_sp2*6e%f zn7&i!db&4t7#Z8uImgu=eLDMnLo1UK0N&>6)9(82$zM`1hQ8Ok+qC+=?AGaDSk>*_ z>iRXF1;1rH`h}8+R`rp`Htu_KG_@(`xz6_I5tiy5bZ3{OJ7ML*-^}h79jte13?g{f z4!KF$&GE>#>;#d*71^T$5SF2Q+^r#x!(#6$)VIc+Nu4OV$lq$ApS_0QvXMT})yTb~ zQ#GR1X=+Sva9KCBj7G2o2Mo5+z6iF%*58-$lz;ZIT8C$*GJ~bPqvj!Funy8qmeVa~ zgGXJfkrL&D=@boy=lEfdp<&x%C3m{7E8g~Z1S_jbWjYJfpr)iRD^99w5o~?ikbG~& zZZ;Vg!6R3^c&3xNK12uE8WYbqn5{J006sTi@(!}u5!p-@Otw*$Nyl~HIG{67td3p% zbXAy1y}+uS1Tz)mJHvzEqjX77E8$k-Tswcqxk>oBc|nBPbw7B{Bi{!_zDt5Cw@WZX z`(6reK2uiI?Cm^YTrrt~Sjb(LUc6iGwLLhWnVmyjpp*-~Ge5a2LPA%M@+3+L4E0}9 zhd~RXoTTG3hT~jHJU+VW1xRm@StK&Yux4w=^=L5rWig%d!6I0LmWTD=D%*3DE`iZF zH7NT(WUC4EOw0(cM;JjIuJ`tDM(_hSe@5^}+B&vv`C?m7S_-(uKo-#F=LOvG$&yK81vIVZ4qUmR z!U4_qg!qPtP7)WOs)=W{56SmqHVOa<`oEHp(z7HGyje>uzv?=Fjj2)c&wmB+Yk(*d znH@4YUakfKOH^o>aAjt9w4he0c{yD?6~{*t)b z%+@#H_jsXU(Uy>Wi3{VVqh*lEq!X9tipTkz>dqX_NQJ50GXUh|=rix6(ocp>gh!dL zi`-NN*uozU-vA@tF@ec?-be`)ieRy-fm#CxfTWtwn5+n&`nR#Xs4&5?)U$>Wfb0?H z&WN~{+dQRGkLR zsZ>A@JQw!k^)FmD@J_DNHSfp&B5Zcpp`FUGEvUTSjGHoWA{sY?n@yxUQ%HU&bP#}V z%`~DuB1pNjp*}!NxgCjA2YxXg3&i}UtLC-4#O&zFDeFSWNF;An`CjvVxa!^yj@)R2 z-+ncA{qFeyfBw9S?TmcZnB%7)!`I`~{Y&OD)-OshYEu$7<{5t;5+E~~fPrxkjcLYF z@H4WZQHqWyJ3Fma$jK>3f`J?L93}spuC&t~R_}PZ&K$R_+lwBs`9xM?NVA6dHE7(1 zxKD7iALejcnj&<6(+j)T!6XCFG?tCG%33zNwTk)pDKg|s{Jz=!ND&^ZZEIH+Y;NWj@y*2r`0Vq zHARXmgbV@;4ZGURMS=<+4G)=+peCqREcH7Q80WMi*o;h-Gzi*2o8>q(m!tkZ39c4a zpzV;S*AGm(Hc3!J{1;y^HAUyM^IumHhim6<3k>B#mBpQ1HZ1X{`RN+hZ;!pU+}c&0 znneP!>Zoe#C{z%9c?1Vqrt_pzNjqNQ8ds>IpkLysZVryYyUB~!l{Muz7~ zAEt609WD~?ZB{4G0y@yYw4N17L!RQAOjmJ8C&4d#dxGZ%CucH1+KC1WFp z>XlYY^!wSG=iVaz!^$0VB9hVW7@l6PE_UN@M}=>2qdM`PUe+Cd0Av~el6r7(0qm9GdQifmk@OJ@Y3 zD~rZBsmiSX*802NcDqa8O_8~6lFQ?%pN{}+h5xE}le`tS{P%mBtxiJzG(Up84jrgc zc!h)Z!T|MF!Ha_?t1TIvmJV>54776=>xsV2s>|9{^^8xwEN^@Repx;a6*&KUW!_hA>)n|qLq?NgnPiu?I*qhR z$P;+9#`XYz>`ZFd$Q<=tsaZjf$ftn_L_NYDpF6Ux;(65|(4P@X&HC|D%cbS@qqg8B z0qi-crV7)^p*dtxSSckqt;MQU8o7WbhSqfMEBrY~xyK%YP3St_Fml^n9d-;qvKUKs z?HaR%$A)$MDbI&bg$_vO)Yw}8WPFTT|F5DLU#X6`7Cclxox@_7go6F^ zM*8NF#cLGOPHN0j!^L*{^b6hbYF1Wp_;=l4L%hntD!a?Wfc_ExDzcPU2`w^{h%N+_ z)x3Jtt`5}%A-t*wX{9_1?MM~`+D=!L{SGhA;KoKF7F!oiNU>oB$yzbi-1{I@P16(e ztXd|MvkWMdlfxh|75#YS9A?ea>!;Kz8d~z8K%cGJ9^*$$Uw!m;ZcP<&YeeE6xq%O$ zJEi+rACbmoT!?Rq6%-^~YZpU8GSU-`A*JkhS*SeVadz!;xwQ(RcOT5idSMuMp9cGe z0dDJ36*`Ao=4HG>Bpc2>v2u@__b~zUfqn^rTGRV7Z|F`39`C06dHWXEHIWAyMQaQ) z&=QRM=YH+@-&y;@#8T)B2aOYzF(-i`Qmza6pmYA0!-(FT0FWdw<&@$1C$V`ksQH+3 zA=|q}xJqsub`jr(8@fF|`LzwhurU1RoQU+^3MWWE|1%Qp)FUMvFCNX2;>_`B1f6br zqlu9@^R4Yr(J^Av!ZeZc7^O#=_7G+HsP=xCXF zDk{Y6=eN)AzwfqScS|F=R!77;^Va37unXdh+p_bc#a&EMHWS4HR;eSCrH!F(`-8r} zYU+a%!$yl6w?O*8P*-Bc;a0jz5p5l>Q875MqMB2E;{G(4+U!l*wL=aEjljvm^83QT z^flPmX^5Evl=Tq(a8`OU#|H=-f&Z z7RpV^9AHrBGQu0PFr&X2=dH?dyfVZuI1mdU9Ycm5##&)to><>F+h3*hqd*E{N>T2y zHMB`f8M5tpcBydDqJDA$kK$1dhUjIq2|1iCozW}c6R2*g;p0HM2q?hQZ1$8ppXPAq z%bwvn^oQ9tO&v0b#edvVItHW}++W{L@$-Ds`#A|lIF|$jok-tX8mnXe7Wof#gGS7} z7@APLRFI@X7?I6fh0Ir}&in-hgUa{m`qj))JzD^_E+f3~<0-sXG4>I-XTKQ2qnf~U3X*|gK}b<0R;RHy#Ys2Co%0Q{y{PNCMw&-kl7%7BHS(arxf z!;I!5d$TA~8|ngF6}W>8H(|1j$-xIr4O=UY7S>CWlf~dLA7NFycPd@z|9gGpQ_|g) zo7ml&{x@`3oIIf+(1|=toF`H*Kl6}^2*=1NdX1rnO*kHd1Sa;6dGwo8Z-Hvis9%=Q zH;}wTj zA1F_nR;xpeRoE;%9Mp!Q8d>9obLMmPW{fkE z^A_wIt$GlcRldv_%3_2N4`UHy~}(C4e? z?e3S8k65neE{~U;_MB$A$1gXy$!2;^=A2LW_q&1|UN5Kfv;KFqH{Z{p_V~*)Pi|2_ z=j)Ge-Y%bZ-_Q3gExm}(slu5lGF9E z?Uy>&OQe>DfP*})hdFLZ+yI(*l{7ENR9ZtP?BPGqa;$Vpn@=(=#QSSxope?FL^tY* z_q=U6*Vfx!Ubk6a(NbK?v`>5;?=REWp{FS_d0QoT&s%Nb?Q`vpdA|3C*L%TV&CRY; z-rMgllIoG}5B+MN&3wV!&TT%AbUEH1vl=Zu`&(F@T+0NlJB{C4t*@Lug4J6PZ1^dH z@Ymj3-L0Q5kG7tCUlNh%@?DNqzWXsbQ8$z@R}`QnTqsq9+%jVA@l=^1G3<(Cg1Vk! zY9sS%j?xsUhKu_S7wv848%Q#U5fVKBV7Q5*=i^?`KM)uMay}3LY~0K^ zyLzcgudVWoK$k@A%-MtQ62A-}%g>QkP3~rmrl}vkTX9IdatMI3N71rZ4ZyEOuy3dM z*^ttuT?XZA=f)+d(Wb`a9Wvp&B>J!AAqCi{r=Jtb6)jfh2Hnei{Q++Dr=3Ro)pbPP z^m$>?q4`?)zQDHUFvkt9YSa$t0ZG=-zr@_9p;b2LtE$K5eC;7E@JaDwac3o3$0^HNHx@qf#Q`4*$U#5tNl z77!pP%fo>(#GLOM2WxoyjRAmVHPi(ZsC~J4shs@$7q+^O^s>6up(j zHY)}87VYmtm4y9vkR$+A0);Jw4T8qWx6_gVr#Q<~F=Uw3?z>O>xZw7(bHBBCjhEw# z6`v)YD;|g$&MzgLo8#lC>K`GmkLnA0zMk(7cg;~FC~f|7EEvMgnOoMJv%V zzQ5J8>7~zcrynhVr7;&nj5}@+3rUAWAIFamD2Pws9~Vggi3WrIq36O=_etSv(B9T& z)>eYYR0sQfMD6&CD7vk@z5Jio-}2I;0)Qd7cVS3L3rDNkOC$fwAyUncx4yNd2_z0X zc_X9aISPl&G(NKY*WztEGV^sA2peY@&iMZP9g+|BmsJ1c?EG}i12W(6!Nl=J@vD299GtJO({ zVt1+!<)oa7B}D0(xNS_yln;V-$xF`n61tz=~Um(wU`iu=>* zc*Wyb@eI5P}u3Tb$Z)UhkKr zg1zaeFG0K)3}l4`vA5q_32-G(;o?q7EP+~YErT=(d>U(U)pV?nDb7mZR8gp7a~EvP zH;_B4O%x(F*6>0rCzO>RsS!licZ=htrQOrkM!Xs1h^LF!Q=c3?T+jAL%5gw#D8t93 zbWDN?E8FMk!~!xXUdAzb)-a>-wM=Xn$(@d9(V?5bqLN7B3%?y#0H|atbLTXcURI~W z;udd|WS-`5(BO}=#8bh#M{j_^PYG?QRydM)Vsy3TdO%VU5=5#fofx60ZOm@ZO6e!2 zn};brwB{nlbm8-hjgUN@%bU36&W~0iUFTp$NW1XVvs3)9*D9Rj--kxzY)8$}*I?Y(4}(&ui{#|!bd%_HuMvA?sil=_h}SkN@5Ne0`QL2 z1_{fzVN641CF>M0gG5*y9A)Q_6-r=UMak)#`BBVKbp@+1mu=8K1s{!a1-6JvRgt#f zk}#a!)fS>AUg1VEQ!FXePuJpxy&k(&!_ck;ofT&za3H&_*#Z9rBbx$y<|8Q}=uiKA z<&eK#816kFrmh`^=aRFX;Q0{2&P1({IV{9i0=UM5ZGRat`oXB2&mu3 z>F8-Dnaj8Nc|5*LmzUDEliZX+oALG*#7tH$=Fnz#g4lApNz>Ze9p!V4z%aUs1eRij_jHbT)cumt zMd!iCZ9mN5`8X;_C9Hhw6_&4DAYjO!JAKl5$^XZat&_1clcAKEzw#l_ov?=euq{+k z?EriPD*X2S4;;;c)r^Z8*AtSCb_8Tb>yC@j0Nkb>n$gftx-pW$*wJ_YN@i8Pp2~=| zy()VsGg<`AV&>|VE486^ADY$jo(s}qVl9Blb2;R4rnA#(zG>&K%Wj_)?-0H9@eDn;em46Jv`l`>*<%%W28DCov0qW==6Qc-?{}iu@ zzTICv8P)Q-n-2@8m|f(eMNUy39Yt7e5EsMg$K;ifba--MOi1z@Cpi6{*23JbF@R>l_%fqtOho(^|fx#)eV&%MAtCP0-*&CZ@1Pb z4s;ME?qS4}H*IV}1n6;El38DgTb|v%(i{e525Ok0WXcK`SS^^9+E|twZ6caeE3zbi z+k`Vqx4E>hywwZLzThmAw$j%L z7%=DXA+8h}2tE)E8qckgME88FfmF*Z1oZ664mT&- zzR1{J*qNQ+Dy3#{su!7fIqoh=nF2MQxA%}Qy8lmIcM(-LxU~TocZcHc?%Lw+?p7QQ zaBy#NhoZ$fP~6?!-MzTG6^d)`Y408UlRsJO%U&zVJIf?H*^iMho?;d%yr_QY84`AJ zuzB;Wn9Nq0Wu`o)`E&`Iw@=s2;!mt+V?b0hj4ad<-Sp5F)PW>qdnavLL zm1|wT{V5kj%?=EeTMZ6tO{i<4l1S0L->*%Y$11d_6h{{_as?|yh@({{Nd0g=3E`C9 zGRSFnw1OK6J$g+=HH^(Jg@oj;L`EO?`#&u{&&Z6f$RRX784Jk2Fcn2CiaB|%By z-b$4jsz(uE0Y)0hce`yB~7ZSG&C zO*uCast6N1hbii4sgY@;8^P_@{%=^^e)1Jxc(Tj&Ln(WmLSFIUJ_Hjc;*mpqkKmta z&-K#}if#SU2iiD9QKp?q%Jq?`jED0+KGg2KR@2t;!uMnJEwDhv+C$vSHhz$ihX`}% z3ws=^4eaW5A!?LaXsQ*G&}J}2dI%JO1~qR`WaMy6pd83zrNOS$^3~J>u#7E&c=f?Q<6$4Cs5H}@=^(7kjgwm2O}gI41H-ssbV6;iMWxY9%Zy|}XJ2~K z46{{MF{#Rtw)NweJ;~M^)(>(iL7MmLR)szgos1ygFN_uf9P|ZOmmm4PbPXu;M}`;#WcsycS$gfHU1&cyd&_>2T~fzAh=F7( z;$1vL4L?Np7`OT|!#Pfsz4gBvZz>l(3>%@yB5Ta-a~GMy6n{o^Q%UzMTw z@}^dOQe*PE7nQF3*0?)cSIf;Jk_H&pbGfioXbY{o_Cip(bjm~ivm)RNCITiyMj>uV zths_3GKsnyG%UIBSIW{RFC@Mu(PJKgD24(_X&N&M(KyJDp=ye}qV-neXkjuU|THT)1ocA_o@=G?bMKk=5{PP;6KrRaRTC)K^vkM`9Fh)eG$e zv8{eynIK~K;p|0F*El#yEot4Q6{dye`+_~64cfT&2HjcMfnyPJC@!oU-&x4nLt^NP zLm9DZQG(cNyBJ(JvBVpFTq&X1=kZ}4#~ah?9SDic07W~!zOqJs9w0-C6tzOs?-S|g zNoq@K2udtz%Lm!cKCCb0K>!ri$>1W$#fz$vFYOaZ-s${O3X~G^7=1LNjbEOSq+5 z=yGemBGB8lyfuwiw_Cc~s*WkYDJB z-U^pz6*(MBC(%&RepZzH_&S3!fVMKgq?1^=KVd^kg=`i2F1rc7)j}-WFD^Cwvud{IvaKV z0LqvvEM-jV; z2Ia2hS(c%F89e)+EjgW#P&D!ZSsbW#(jD=T!|Lk`Z+8u=MU$hbL-TvoKPLr#dGkyc zmwUTf`WNET6=;XnScO^0<0U1Q9p}dnlX0f}Q4jqOX0~JpwR@Z(0hTCs}8;>;< zh$=Vj{Z9@ zLnMpNdIYmG-CmZc+s;aUvVF{vpE!SvPuYs3p6hPy3WO7F^{{=<%&(WZgwx3Ht<-q> zDxANYX}_>va&}Q1o$C!>OK2fO_`kDM$!J+1 z^-CAR$}Z}>*wRQzAKLsvrx*!aOWf%WBOtAC4YFJ@{#MfOob>7+mmT28{oLfZ(SfM6pp7* z6>6LcRp`MuAz>a+!gRouKU(G;(nOdvvnnXnGr5{k*iV2$E=btuGMLg=S4>^L2l>9a z+lvg7L{XQvMeOYMDbQRCvJe}}ON)`rzkA;9ePe7=XYa}E=_KxK%@btec*lti0^X+O z!Q$&>FGWhV7iMmrLayH~RgAE#WIx=mIm&!b8SY1y&(|AT5ON}>v!;muU4kr}@K;zG zI9FYCwJVJ1aUvtnV_w<8uY*1KI#9m(fVOXg0fP`GD%&QYS=yZ0?YY#>^IUeIU5eZn z@$lkb4Lm2wHPF>X(-3r|#!_ctma{$Jl9vb$4!#BhOM%cm%R~8eeD0o!)L>w#PC$&v9s3%sc%Fbgsx|nBPmFEUJgQxlbI^;lN4RnM~WF#`;YrYFx71mbv)Ik6FX+;2-Ix}ZzsxYs6uG*-L zl3w1qW93GN9WaE4*fn@{go9b18aK7^>~a6RvprBhb@bqh?{1Gh$Q;r=I1tBeO5^qR zxs&T*GR}X!#ncfC{PAFxz;<4M#h8cNVBT8>fpqo{~n$UN6wMT2E(Vjo*B0b3(= z$SCrp!n8o;+h;B=-+b3c(nq=Xd*^ zzpmF+_;L(9yAg!J4NL;s!508{ma|}))fbNKXXv)3Z^wrYU;Aog4YYVSd?bHWqPMnx zs~Yq{Iy0@s7FQ2kBfh;4E2?QR>f>FTluxxO!S}1%dZwEds&sv}(hD zsVb+X;?dszoG0~8w!ba=&ZgDbY23Kn-Z(?_1C6x^|9Tj0iz1QIZQrav=0$?ENV2Q9dEvj4D2zUO4)6;n5?0Wnto(VobMo{PzpKlOC`XZyyPCw;TsD~Zy?9>1gm20;3K9$+Mf5sh`civ?PFh_* zanxTSF|PBNAvZ+1S4p_0`=gx1knJz+P%l{(p={f7d`_e{ z5gF*S5P~`xIHXd#6zjDdxqAWx_H`m6^qj%I(gxU93S7m&>{RQW5Jv!e#X>HO_dIgb zg$~~Z>~!Tj-#;ROyr4A`O6G2wpRo6Oui2j5_(wR%M|Kw1)T_cR$*p(9CIb0q%#S{% zlGgRGa93g-06t!SJB&&D^KN!j&6lLfVqQCdn3hG};+-z*W=JbQPz3BnS#gcKgq@Bf zGzh}@ty2X%E!=OMq2KrTaGRaF9Kw!^L93*6;w*Q^{M3qiS@E)mgSR}78 z{Do&5$L9c}xt46H{^3Nd$spNwnWKpRXep_(@iegVJP7j-0NS1Q!5eY0j-g?|U77o@ zt|Z;G=L6l*tQMGOa9m;WrN9}Y!M9PGkal6ut&8O1>l+C*p1^}prg9)yS7J|1*5_CB z=jQVbMc5G4#7?y){ehCbl`iCp{&N@#4+e0%{?LnG5Qfowu19Et^7HfWKqQBFBoqkh z$qKX4IcIN8N>po)SJVDa{@#w(EmvanS7T8H`vEU~`bXri-(wWi)bK}$LAS8C34e7Z zB?132iWeikA{aTauA~XpmG)Kvmu6^`uc5~;$f#gliRrsiifvq$eI;w{RGUh;Q;!{D z0`s`4rl|3u@?>P1MO@aLA&_#wGIXcKK8s9~@Fu zzrysrw{<=Z6!}@yg6?7}bU5CNV-p+M6UC@`Y@rCN`93)htK^g~+txy;7-_VmQIle@ z(36V2G>3pR2FV0Agk}OmoL~z=BeY(2FNcMKGF+7F$I)BUv~1PHveFKv;k^XIrk_u{l0+;on&J4*La!Zw#7c?OmZ^RcEZjzj znrwL1eefZg!FBcKcU2AYE1Aa66QKDMQmI{Y)`i&~dmq-XjorChMD^&@XbBLWsQ^Is ze3cuIvxMQ1{BX758$?6JV^d)UKbh4)jh50CS(4G2PSL$--*qhgsJko3?W<7N*FTe% z{xxg`sB)!MQp?ofCvt|{Nc8f-v3l4aY%9%l$x0RzeBD8g=L`IQtD>TP%#p ztne~hn}Od6LpK;PsmB*Sm(8v3ouJj_?LanU<8QDg&GIxa)R3+>u<{57!JrvB^F<*w z+jXcioRV*HQ0UsoFQHHJfV7z0GC}+_hWWbU)sQ|x45Nt_ydOPfwKOQ&wJ5A*f5bPS z_pj$_09`s^ZJB;=UqG%k6ts&@FYJ`Hw@dn}cdQ!!m6Z?)3n{>|(#mFx!RE$QZy3uN zUs99fdPnYUZ`hrI$fJbg6vrc2R-(97w;Z~}ieP9H;H*S&|Lt09t|F@lHsF=x^P{xW zN9pO}52*8oa8VL=GR*g#p*(CIw&2^kwd?o!9agMr|WkEI! z9%|93R@&QLtYd^(Cx3ao3#rmit!F4MAvG0P`qcn-m0;~>M*!EEv5Hhdy!we$5cSyk z1z*`c>Rqo{n{SMzxsc~`4SkjcyoK?*|C}x8+W1(lG_VPQx^G3Rt1*wE<9pC&ZGgX6 z0%u9hx^MM|;n6SsIy$5RFv+TnX-GCtUZ)ptaK!H5R;Wa+qn@>V>Ob_2@UQGdsZ8bM zv=8-gW0-`TcQ|j9J?ef`y}55r^C9;@^oR5#27kT#PF#9hkMox67#`+{aDV-cgHCce zFwNIck3kM%x5x}M@_7XnM4=sxhfh2!V%IR|Ae7;@wximtTcCZTZ z=dpacV1%8_kxVuGa8Xd?u!+imB+7!^*pODq%PtvHWhLlpb|PiLd2$1X^pwI+W)Omc z*l(RcwBsMDWXF9kJ*yY{sZ-6n1Q=}ew(Y5-u6bt4Ya9r4p}7T&yj}0hU(Gj89alch zj`!~GgWo{cwKtH=ovIwOib~rHhSFv8CwnwprT@B8Z_HJjbJN%;@biXTX6LEJ2#JU* zd>iqXGqTXL#2bm+w@sYS5xFNJi=pL^fZqYMSaT5}wK2I3u}=BBD{;Opw}8%LlMlji z{fEF<5J^E${SbloEKAT;tSq$s^{<1%KU%A@_9bR4*zcsA+3M z+h-3^f$_ZJ>X+CY3vV6p|LDZUbmZ;Rp*5 z;2jXQV}w-q28&8k|3oD<8VBy29WJ}OgeZf_IXI3P#``Kg=Gy~Q4!=2M3ETXt?)GR3 z>O3Lt%&X`@rIx4ls>I5n4TZfp?XRoIykZMc8&2bD|3IY;w7EFl*Za23fO^GyfcM5I zF31l!%c>TPWPI5ws)|DZBxdF4%bw`q?2!thL=yK)i>ExMGvy3l*Y*h4_QAiQb?98tP;t?qim-sxU)e=^B2!D(;v&bR=TPJf5*G* zv7oWRh3||D(7Q394X_It{ho`DgAypbO~shRG^x)|fuH^0e>i$A;;`<}p%zF~-n)SD z)JM(e41|OEc3n4Z2{0T7k~O_Vgxke9@#@hoEz|g#8#Fl#!_@;94GgXSa!TVjmwB<` z6R;Z|M^q==n03c)!Y2(#8-!7czZL#+N|JO(1xRH&nrgnuxDjq4^tM~Hf#3Q; z&U7Bz%#26#p`gg9)#i=WNOeP1IjGXQ$-HOA#tViwNE+!sW9}te#Mf+5BkzA^!eUtQ zl|ryHq(OPO0++xL?VtPR!6)x_Wzn2+@QBST6+LhZef{66BDUE}?0Pn)rTffchsJDx z-;zyZ$~#!nGHdqis2DP60nW~w%w{3SJ}`dQ7N6<9pEZ?pzb3iIYW%tad?Nd+DH%Cn zOG;|y=3ASeX5wNzaiQuuNmCYchLWDg5M|9o$OfC(gcan~&$&fJ%SCyUiZ$X9A{tTU zwP@!((xYf=a8_nh2wQ{~-M}bX6xJS@<)4lho(zgVfHfuXIQPl)noAvzd5qrYlmeam zd{hIS-3)aTwARI3kL0;y{d`a2@S~Z0R(_qKo%ox?+O2T)-p0xgP#gEIZEzK5p$#T) zLPps5^BTdK)JfFw16%$qeL^w(CmmTufzlrCu zh!M(mlQUxhAuGWQZQk?;kO>GeHFzKL?fN&ONUoT7J*w!vaOO+6%oAlfLM@|~Jo;zM^sB8Dd)%|t?83O}u3z`8lwvN9Io6KFGKu`bVhbsx91+6M5e@~` zJ#AxJsvi%R@fT3noQ?^!H|O z!C_E6V7_hgYy=tC_s2u7$m!JY*WZgzs%0q1S6hWYu76y~T=PAr{-Jz@@LxVVW_cpp z3avKoXTAPa-P?Td#J4q8?N)b2=!E&$DPP`Q<07?XSN%q&ssIT^3XKHucVlvnGWb{g zZ$W^-|6hdxaUu4%bTV}_x3P3()6#^6fKEgc0JwW0LqNjZLqb5n;DFEnH|=kQ@Gosd zsj02T5gh{8LJI=>e`(-rAqoP;|4*6~z`@j>^

vn4@LohCJF$C?b;ly@ku2JRCcSAwO+$eg8UD2{)~kR zXFDM1KOyqA;erTQr7{VOj@RmB*a3<8K#{qn3{UuX0!Vx>EX5(I!cae7*3!Gf*7Y$G zqw99a1Ljrr-W*O?*NTSM4_LYjYE9lzEIYYsTG1cr1mE0v zqqlV6m9zD*yz~k*jC3*$F^;+*iOf`ZQHhO+cr-A zf1fkYUVHp)0gCe-tH`yvq%67v1` z#BC+`B*FVA<{`tTBu&GPJf(}}`>&i5EoK+f*ZnU^;2{1kTLRnD*!cAbfZ{fH1+#-} z-+1&i%3;f0rVgk1Ve()y6S%ES9jB+fecM~#ej4@*Pl%B6PaFg&H*WDE8V8On*Ra+& zJFw^Ikc$ee4L0y>!w~GG{y9{}8^-=piY~u1FtSk_Y?$;qc@LwABnmJB%Nibdh4d>I z5kuGjI2Z@)=KX=sJ@Lqxg-|)Pf(HZj8f2emm8HmH`0*Dz6oC)YQU@SxErxO6{;-0c zfHCq;e9}Z(di}pcv=yXHuME+!7LyP+wcbmA;no7S&C3~`lKYm&!uuC49VSM9&Z!&8 z@H6$bjbW1m@sS@lNhFWCBRj834?ms&+QX?NRsSqEhAz+J*y56d6OQ(sd=AX0$M(QYu0Rb+&PVLRazcKJV z<7e**jfiDVk7}mKE;aqpK`qAZwHmSK#J7Cw7g!TwIAwRH=>|NDM_!(A${EOf!GnxU zON&_kVjO;<6L75xWhz*u0VD*;u+ScnP!2|G?=vtZQ1m7xp))G4rJ9ybG=_Z5322hg>#@X3+d;<(<|H9tnGD;}w=j09vS9R|c*rG$rS5h`#P|iB#%1s91!bx$ejc5KX zp%!NCD&M|=I=^3)vqpFCaRxk}p3`%V|AL_YWZ}Gl7+kwT-1)YMA$`Jesym@cp(!F3 z=Y*(5#hr(6b=u47{6P-2_m8F{&-wJ8ii%B-vU%zl5uO}rhM95@|EByMGphwaouOv< z?|A4S#JK)~qF2EtfIdcRspAyPMSKXy_Rf&LRCfOrU1{$}43^5Ee?_Yt49|y8h3INs znh2-PfQulhm}A%?3XlHKXHg{iSuj^?AztvCxPZ+9GxMjgHMnj&q7Ef3Nbi31FHWAi zZ(V;xoosqg8A3$bR`>|!8;sA=45}X#lJSH>f(Y>Yn3?AUjX099pSrybUa3%I<^MRb zkY>KLXg`xQrK<5AR3#1#t(paMGjegf81#Q zc3YA@mY9!-NqecGtcuf!cK>4ErFu^wbEUZZ(-tbsn5(dp^?bZBL1MQRgk!o1tLP{j z9dG@>M8UZ}6P`EFoYOYXel*PkbjAIBL0vM)&9JGsd=vJToc(QhN5NWdyRjf`cnc`z zvXWs!+Ec1rIi;rfSfnDTLFK~+U|mTeJuhqy9mEDf612*8TW@68>v|vLMi+w&Pb}%^ zVm;?P9bB|O*c9*8(smBFW1IK85NIwI+wLB}Uxi1Dd;Vaq_dY(rpD)Hw6CZ=^j>~!Pw#vTLR0n zP|n_Bu2TFogQ~7)IDso^>91l7)GAnNcnodFYxPxa5afmP?v(DaEq-l9$6+ZKs@%A$ zYYB@GIlz!}wY3(dsG-;cTG7IlRdo9*^pv?^h|O$|!|34`B7VK<26OVUtW578uvcRE ziaxx%Y#j#z{6d7+NpM{dn$6A-dwpR`So&^XucXUdM`l?ZRT&yw4@RH!kXF-ZpVUl; zZXOsVkeHF9DP0p;x{i(}yK^|J;mB-)zB<#+ydE%)=auMijN+f5b^$!7B_r$O}o$*H{xwOIFR|s+U0LJl=>GW9p04g7v=rE&!JRe+s@|Hf$hFm7nsZI zSxtQWA!J`-aU=bcwDitX{C5ccmY5KbBnKye?yI8J!>Q+ zyE3nj@Ev2f4=wLIuKP#O&jYtZvh(+1*)8lc=N{vI_XIy{CJ;t(IkPM>G6rUTXx?bphkqL77G}xc^_BD%PBSz;@Q%G&a z7{g-5-4ELkaz1sSr zSC~OT)L5vopG^T}3P8%|4Rso(Gx_NHc{757A@D~lwY0AF;=?<`2>i^sceO;ZR`I-o zDR{r>`t*RGV-50;W}LooWM}M|=AWia4hm*|c0BX)!SeUyfB9L1X8-tYt=zBjcEL*0 z9~0x}Y>avnyApiAk>F+?$?}>@WjomVaQp1G2*aUcF5B!O$@&Ytlq$rq@3TcxKl|PH zV5Q5(db~%y?{GJ;KYwuGwqta29A$k+ZF4C<|C-}){NbO&LIgS9Wbky`6yYQ-t3eR+ zmDmJ5ifg!7(m;ZNTUg=ojPi2=5C<@ZS0CGTIsyi%qkqKf&mX}eAp5sjnDo4pl*CK%>2Gc><;C=T5i)elj>D%T81voag_zNM{r?%U zPUNOOsXSP|O{v9ZH9ZnnwdyW!&<%rxYyuSd|ALZrfdo=gEH>V$bGY<2BC;BD?^Uq0 ziF;hg&)#SB?ru(g6rJ|rqLi?0uRB+pr=z`tVu60Qg_n=pzRd2nVIM0k@3|IM@W1f5 z;`?tT$Q*f)iDHpB%bWMl7`*QI5tYZo$Mn1dTf}|n-7PyZJWIcZqy^~tQqwaFV;cWF zUaT?%0MHh2W56%|WZ%q=rzD{RgdT*thFQCXv|M|1%v0pN7AqCYtM*ePBNmKbZX`9E z{EV2A3PgG5(q=eB`EsLRQGY}ZnvHJ=9imc5p=btDGKAGV{diP{)t8d+DUbmb#+l@V z$7U=sU;xek%f&KWba`AbMnj<#rb`w#sqTT5;|hMUZt5+^{TGULwMzIoQjz0`Npz-_ z8M!K;8+D~ioR5-B+ZqcZciHm7xKvpunqb#v|21LaB z78C~cQ8JuT#_I|LqNB3X1T5V;cD!e{7{<{RUA?^{4cbWN}?=vty$AxZvsOTrb#2*{-dUO(5V+ zCizhYYKI9qX5R~pc9%9^*JWSX9v+_^nRVOi33^B5RTns16mYOUt@ZCFHYQ4kCgAEf z2^Cm0VI`Iq5SYJjQa$!rSXO&+(mY-dKMLt<5Sy=QxZhyli7$ApRc)wdLii8)X;`u<&SD7tKt)8k}s^=Ez_9BkCvVhXD{A<*gpo6N}k?9*7{tfD{8$vgv%(Z}n5 zK{A!4I>O~n>_9wp+wH{)GD}2I%y(<%)39WB_(khw1uTTB zn3;kbH&Bi@A+!0P^>HUm`!jidLgzbgu6HfB7fXHn1c$RHzl4};uzpwM$o6^^k7W`m zBg|ui1IZOPL#pB3+hQLy9RcYK+37LlIfE1ns9{XFUwTlGxPh-SNN>5@)`R;RgldOq z3^}!kx69JU`LV0*#A|m^**5=r-a3+?Gp?nX-d#+e)^kziH58bt+R6ZHZ`(sT$O$ur z8=Kkg_qur^R)3OPXrBs1BP}=v2NnKPs2l9j{XEk|^AXI2e+A%mhH@ zcPRp!V2koa66>}%PuFD%n8@^!hbFt$&}++bQBF*N;I5sTmqpW>C9KO1gO2@-un^Xwwrdw=CVFpG5+bVJ zOr!kIk5Wc_>8U8I(}g&w%UwkAMDMF{^70>61}a8;nrt+6*|cfstJ9>I7=zP?X&99O z+C1ag;Ns=B@AX8li)!Bk(??HyQK=>9{7=kjHWI7YX==cW+C(!5W){FYxui2zAIfVW zw3Q+1pzw>SU+%06wHg1aeFc{e5X&p#(D(up>m-Rk(NrfSm|K(|d12Op6 zi~16+E=-xV6GC2~)+oBUs0&M=ms5i%)CZ=wcMk(vlTQeFeIB62pHW=S$Q&l|l{hKbz~@lb)@0lZsF=iXoOSH~UWeS-ESlM0+r z!JJE@ZxAfYyy*(M7)fOC3s%{@8I0HK7_=tJ9(#l;L-rC%ZLr?_7l`V!78Q14v0%o}UW^A(b4Ez@YwiN|nlKglUA;df zK^v3@ml@|lq_#y}Ye~2|_jh5{nhPZKmri$@u7T|#$a`Io7q|wCfG!q82H=a)%@2C; zJLQ8!imHeXgpbbKdpbTYg+HIf*FsrX<;@l)+9^*?*id5FtCPiYZ|F*T6mUFu>U!Gz zoW8w**xp#Zy`meALZmx6XrAiL`0a@e){bC{oRaZ+)UKVG^{e&f&TW5J{^88?^>gn) z_2hFNeJ*ZWB$CX%0$II>raFNI9f(&BSQ4N>BF+?N|!3n7r4)tanr%=MX`%=MkC-NYP9&_LM64&3r>9UU2$6TpI zb*S#WQl$O{Ws_%;jcQT;g+Vl^QNOh6PN!M)s_+yIPA}zMv~|Et$5cjxvshbu!S@aG zCZ;oAtsG|x7}VjcguDCR+6Zwhbi5ah+|pBOMz}&9U{IQO_NPi#NRSb2y-K|grkWF6 zN2kpkt5i|KSn|0~b6ix=hjS0FK`3toge>X4+xaOSRl<@FL= z4v8Fm9AO)B?xtH-;4h2TLPR5xai zXw=Gpms^3_RAQZb7}dpDH`k&BttQn~G#3Z3Qo8DdvD!+eHmiqg_;zF_1s|S!KvzA5T+oWna`yz-&g8tEPx$>{4k01-( zRM01TVXMIWF_gW-e_`c3JL063hD$2i2}zd|BX?{flT)i&xxZXhF|f=-c1bDlHTVq0 z6!i)$H8;34D4FbsiK4)Ji|eiZ%xw~4 z$>SQp`L&3+S$p*zZ_7s$Zcs7{a|@_oW(nJ9GunQ&BwJ!Ab1ah|XG_;&X55n(e=)v7 zw;#;DyqZRgiG!=48>1*g4M)p+^2obL2W>LlhwS6+qp8Wl9uMWbWO5N3sd&$V?sMcYhwDdod z2w&j@JmhwwU)h0_!>NlBVRM7&soY}$DtF@yn6=|PKHHsM&R-h=LkD>F;~4#C0$f+6 zn%p)TGcPBlc?8454|KWv>FKSUKBsVH<|f&Dv>QJ@9RW|a&l95*iU=|Fy3CexmBbUQ z&p{>O{-txq=Sl8UcKq^cMI-x^7tE)%OR>yEHur6Iyl$BJpMK7P1VLnkHQ*?~tvElV z@6WQgby~1&@G*Y6cE_VXqT^A9$KK0Fr8#B*#(=>IvorhVGd_^a?5%!!32xaty2iUKx-9YJG#~ijCvlmwU{zg2-9FDU5&_WN(F_JuH$ftIGn_&4Z zXBcvm$~Oc;gBE05Hu}}>jswiQ^$&!AK8A5sef3D>Dl|P#L@-YC^m9C3b_cFE?*Rl1 zJ!iDjE;#cqN1hp2T6}!uIXM<+95Btt$NrCV7&&cQCp$3TKOVG0F~i{)2>o@%0#)QP z*r?;i;tT_G+!q%=i_FGJmi_QZ%KjA7lvNc;QQSb1(;x`B8mm?LH7x?i;X`jb!Ph>F zKz!h%Z;~~H*gVsxGYm>!p+Z2%l}QPfAQSr&#rBt48HJF=N92GP2OBuex4#9)g?1|+ zj&6OfnwZ?U2X+SOc|Rtd10|gK&CzH#$NiMD>I9Iu@DHsTg|r(u z@KWn|M13pOpC&KREB|Lpa|=^G8*5$o^X6OjGk3X;dfgEZ)%(}qGn*qd$n@e*0GH06 zLnm57u=dbeQ3VEq+Gx5*H@49NBt-HAoa5^Gcjb(b(6W<^Hh=?pPn-f=^xsvsUz9Jm)F9Wh zQ4d6GV%*92ez^abCFjMuy_JwedHB>u~p zg*o2Om0|Wz3X>F?oafCJmou%P)7p@Mm9~ZkMj=`RO$-!yMBwPB(>%FsT7^2sXu>g# z+bu)~ieG||%cY|nVIq5#|2&nv|2-ZD2;`p-o~r0u*-~EUIKu=MT*~%d_9xCvO#{t+ zD{{nIv0x<4vL&o4k5tJxDBIP(uG*6gGCO#purJVQxb%%C59#8JC$WpMgY9Wn2bU;jM&sJo!zZ`r-{W~tD+H!p8cwY@(ACt=t)D zcCA%WTQ%UTS-)GVnO76ZtUK|%RI+C&&iV;e5sXEXuLpMh?!?uwXxkn46q_wFMvOS$ z4y$P=xiU=~tCmh_?giAssQ-zu4%sk;$VYEhVEXyGiX}cZ21+tHq)|@m0-I2{QL2a` zjmpq2fxnx4H8;t}IxIzNa2Y`E=BVx+D1!M)ON2wvTlxf(;Mkgdq zpoPpzJPxI2~khRDzEXSCye162X%PN7hsNIqyI3}mrM(hrK-~wjPmB5aA1mU zAGa1*=a#{3ofYP~=f?WF)QJ_dq0JTk6cQwL#RFrlFvXwp^WwQWzi52(SEI)iCKd7t z`;UgU@E<}}z`oULM(r?tH6N97*9*CkI3Av;&Lme|`YfY*>T>N3bBLYr1DnT2dYlEs zoJy9FBa`5gYvAPD;~z#W_KSGKLZwxl7f_Mz@>^_;YcR^*TMxY#HR=kPRKQ{b;R&iG zBZ00mENFvV`hUU$r3pBbwS0D(`aV}Z*$s%LH0_sA0A`)>M71yHY_o}Id#726%sJxb zRKG2XK6g%!6YWcYSTQTRsU$^z$k0l&aQyNiZ3KO_(t??VN9qA11rjn+Mh5~atnYGe zh7XpcR_Db=8VMv`-7=+OB+@8SFoB9*wxZMD4i8HrQqW`uDB&mv;0<D;I$sl|dbJ zVr=8Z_X_(=cAQ++h3T~y=C9z3^;_(t6 z>_V^UQMh*OQJ&VIOrcW3Yx4)XDE93$+?^A%cJS1W%TU!3QL*;BTFSXq?2}U&n?BMS zMr%pLPm{v7AZPR8shpq~5?jW*0Pl}J(y=K3#g7f zgJ^NX9J+^{8tuM~oSRmAVSwAVX7k$apeDm=_6QnTFe+`EqRO@K!WG=o_ij{a>B1Zp zy`HLxr@0hQ$ZYW^NXHWgNUPYJn#eo!@Spjo+6l!nPfc|vUFjfK_Ji?h?d!rD zxVNyTWzS_KZ&M)`L`vozEH0v*cN?p{|yY=^X|h3>q<6F zq8y74Qc&(F_p>uqK}54YVfa+cMA6r@(a z$;D3A{z+LuO6DhhUg|Sr;LII!#wj^*J)5IahtWR3Tinc4ML*52lZ|oBNaE*N;)9RR z!O?h8M3#kmo0+1`bC*+9bg^)O`1W`iY%i#Eoae9jppfPo;^(Rsv1|8L{v1~>+9-2H zvgnd?Et?%LiET1utV=3w$Qn{!UFQe9k^{&7y*>b;e8r*Pb;l}r3r$nXpn9Zj|D##C zWa?+cvnG?B3iFn&XexykyIs@+ybxu!oU~r`JM|eoA!zTn-#VV1M5@zu!7}C4O0Mh9 z7^UsR7h(LG+5?A1T6`6o_8m)YBkcS%eYK}gp}jfGc6O&Qjxn`X)#Mexl_3Qd!vrz6 z6%oDr10*zlF7Evl3nB_D;&ME5Q2pEZi3+hx<}`HMW7x=%Pp9vkjK*fa{vMWIwkHWJ z!~k`9#l2jWd@WPR&9nla$zZ!rvNXExD}kV6LsVi=kT!7Rpry1MW^i|31gThsXP;Tq zc2$s3cRKWD4`TG}USF&ANc06^P*%0mjJ`n1Z$|Igu^Rt9NINYW5)gC`tAYxOk&9RF zGvi`k=mDS&iSu}gx|lSJr;TS#JRHWpH2g2!{3za~QNgunAoFUGLA}t1LU@bXau*#E zyzHx*xly7xdBboenhjKI*ff2<%={-9JnQcHvF5Scb+5wX1J;kTF9{36r`=ms9AhcC zz0)qK^F`V4Oz_7m_~^r&3>8oABEp-=Fbv<;5eYZ@F=sKJyLdw=iB1LAEbQ7n<(9r=EM<_LTJKKnlzY4!Tn%*=e*`$IOvc%|eDrT@x z=^&q+w;Qmk3=Ca;AoA-bl%M&B1LGY;9;N)#o9+331m~od>0CECTOh>5&jXyH&#~VtzEUg<63}>m>OqU=p%eDisDFCql^wJ*gMCa+=Cc)0sgy! z?Kx=6Nt4N zsBBL4ZXhSSN8^8Epa4A?ljjBPUuL_)D;%Hf0k%7gsx=rZdf1A|ht{xNcRVv(m+oFI zULl9yhU{nz@wlYlua?YLLCiGJMNr=Zh^mp{vNK4!*Zz^jZc$_4Wm~hXS?b+jih1dI zb*Q>iFntupz`5aZesT%^==>~Yh7@2beF}>x9H8Z4{LtRY;x<$0rDDiFui zXqCS{ro%9^{zEro1Dsw7g&Yq5wT$pyGD0Bt?-{Ha(!8b7D547JQ;=D=9bW~=ax*>s?lT-^|&gK zs*-+o?mEYlRuf%r{)0CsLe>h9LANb>SoPjcDXZlea=z`|5cBIewATcsw0n^6=$Xxw0d+*ICkUQtMN@(9G$BKi*}XQY;79Az=$jn$ ze`Ms!P?2bwh2}&tF?nKxb&y#~Rx^@NoJNa%cO-=Dn`M|Fc@7@L(OQ(CY_KKL`UqhR zC?0&1=5^r+eD`u)K`tUJAWe*Z)J$F3wYxfE#frd1=gcTKK1Jp$Ecn>!az=(3I7 zna?z6zp@?|ljr7eIknrrX4Noj8;4nXZ+v~&?;n_6twP#`Jx6HMG~sw^e0;<@7bPEI zK~f=@O$0a7g|qQT>~&4?qI#}O{f1d%vwuaRVmw(Ve4ReemM2jf@_o+B3O^k%%TRi_ zC%=s~KttduQ6zzsr_8R7i0XAtoEb$Y(2O@ejtLda6H&JrjH8xqw4H@lEs`&AU(y~Y zN|Lg#6()q*L^YLj(v#k#G1Y8qYC~ddPhzpjQcLmP`TB5~Zh&DKzNozla(qCbk2B}! zEnPs#97*AF8;II^akalI@~~RU^zNfy{^Lw(9CXzc(vt)NRgE$YdNqR7(mpVlzBh7%Ru z{5QFQAi$)Wy}~WqPV8R4onKY=7-I+w-H|zk;Lmmq*-{)WDQ6I&I|l;i-7+w@9KEWkH@p;6@_m}L}@34CcUVpt!0U6oIpiY6oR3Q9_vq!tMI?1!lu)bCh(hRI}D zqcQH?h9{4f*Q-f_!@Tv(2w~cGSDMor{qv|7UoZ1JKJSEfCEd>Ng!Xoki77O0|E2RR z?V9GR*L*t?T?mhh_e7GqlRCa21Y=Dolc`=oe3Wk^&rHcTBpSejEHuYQM*5ij+9*DK zNbdR6@NNR!d2F!(v{~f(@Cx6M+ZeG_GX|qCY{6vbe#zD5Ga5AsEhgH4#h?z)1z88X zgubhl0_< z*;ERnop@@XWBl~uKIqbBLKyPj!+-!kqae#v4+FE;ZE3hLe@f*UD>l#aqU>#`AG6tg zi$ZHHW3g7Kq}-ZuzhFFNKibH@ zF`GExC#A_tG(V1(UNJs65QPG|RT}aVuEemMHWur_Ozwu-u@v+-t{O4SR&l@XUNeP| z{!WNxdsjH~zbrp5PmSJpxvNXNPJX^DK)6^3LZv7=)_9ZhuQ9^`c2l62#dQ|w*6>QO z)bIa(ihI9o*Pl~t+O1Tf<=WqhuJlONQ)Ci1VR=&a-R5BTl{Yp$ql&Q+E7t2Fz8tw; zw68Dpr3Z$k*~LV3%OiQt^v%ruSzY4c9PtrQ)p&xJ25jS!3M-iVMG|rpxOx43vS4V; z+o+8uRHZGc6)c9{*nS1OagU```#&zNnUfge>S)B(FfTR7I&C9)ri1FIUQifz!N?bg z{qkldV$In8rlbUz18#LKRVtasOy|>)+8OvGA^IqNQvrfT^7}S&E!BV=wY2iZu#wbr zDHxS5ER2Rf6&t^|_zvURyw!_23~qbiU;%`eR!jW+5IUOCelu3t37tSQugFb|DM`2) zwzH2c0B1qAhLX<*E(d(=r>*h?c(97$0a4WbQWpDvN%C0;G8@<5 zW*ZQ9hau@>L-3mJsQ9gpEcuIkK-G9v$D)%dVjt7#$UuHg0*7Qw0RaiR>d7L8fWi^m z0^vNhLLbN>L(b%Gr}voj8q=6T;d;%$eTSVxWV$;^xYEKL@DJ8VIQ(Co*=8WG>$Yf* ztPUCMTAb@YIk0vE5B0};s9O@vO>@YU@;T+;^AAdy&>Y)^)CxqDtXdCR zsTcfcYL7I2w({ng8hFQ-X*gwQtjco?Oqy6yO3pdY0Rjdei+aw{ys&GN;F zm=SYI60a7!>P_+`i6hs2>UQCh+7ZOIHY_#H{hZmpS%2fCxicO^D(t9Y>w{$C$Lqdh z4+jYU>n}&^YQ~`d$6rUx7 zDQG0Y)dCv;04P}J;^Jg4`oVx-$4<;@v+}k_?CzN?bBrO$9l<(>kmys4nu^;!;HB9_ z-(X~9IXle_sTO;)SuusV1_t&D9Z)<`&s0yjOrQ9;pIuc;Xa&v2Dz$ zH7K#WGmoB?poY@7z+Y1W^Zu|>B@57e$)0#-6~~NE9PT+(+m5wwfw|4}zv(*npd{oG z4@atxI=5}#0&j2bIYn6gB&Jrz!1+E#lTAzL1VeX`B z=9S}ddpq>Khy>Q;#qhe2SE}eygJm4x>~haK%k@W;;ner@B1C=iBda;IKb?66 zo;I_mrRpp^*x;buLL*V2Ru~8CtemJj+$mV1MtjM3h|y&uO>0oX`prwqBC=yaXj2Lm zz*?|3L;_ja1A`I>{uUM&2$78vRJgcE9$plxg$Gszw|8VKJ>B`CfT%Om;*ed?4Aapf zi$6W7mErR;(P`gie_bs%2-I&t*fOtrLp5-S5~=Ak82U`JoA!RUvR7#e^S5K3*-<`0 zt|~z4rrRMTEM-mE9Mag>DPjRO9ZNAW?gjJNx1mh|*^FA@#W#1`ceF}oE~QQCM0)6_ z2Ei4o>nQK+=l-=<(!NBScU#KZn zN8Vq*^8c~km(=UJaMt7sXC>%6MUYYqL!W5R{PuQ>D z!!D8^HtQZT$4?squ|oq4CWY|2s>)bte4InEc$z#nE(dwf)}+K}WsUxVoKc0Y<6Y}2 zxY1njrJhOb7#hSf5ox(C=VSGc*@+`+6qa=(i+pMREYRz3< z3|l@hV}wGKx^&!6#DC@GwuZL~yxV)DcGP`hYoO5|`NQUHhUHMz?)}#q0_VlruP-YF<{MFt7J!{Z?Twz zXwmJ}w_QKK!rrr+CILkGdn0K$)TIVgQ1U7_C~OZ{hZ;ip@}1xSY$wne3zVYOP?oYf z)1t7@Iv@Fi3hm9s#iQo>f!b>Axsv6zp>dzYIrBPro-w?*-<&w5qnAT5C` zet;6A#yuV|gI#x%`!&`;byO5Qs9VsrHW6k5gh}yZU*WUl$jNFT%83Gotq=&lu?wDN zW!gmQ|GXS*#&QT3tLNDEoQv{Z2Q)UtiN4MFg3#Ca*NHYL&GahYv`MkqX!kySAwUQk zqBF(nV7Z>VZjBCc#ZmDZnPiu78X`cYn~U&DlN5gKqN#x;s6zr;WvtiR;%Gb6tVSd? zI`Oqxjt6vR&CZvoyAqVos}<3t9+cO}l&#%TkzdW7Vw;K^Jvx5FqRQf==+-1RIJr2H zdU$C3u@ctCo1_X2nmgmH?R|`R>s22;j1xz0^F14kLs?#quJc7clY*?q^8A>urrWI0 z{+C%Zd5kstBv>fepj@b=KKeNWw$Y~%f8NxwlCUPJZs)1WvI9oXl~2L@HD%_Bu!UnS zqPdC(ng+8@5KtC)=Nls-xkenY)0w8NV9w(@na%_I$Ivd#+zE}&)@2^Jv7uW$F}eOE zsqSE{5gfyPv!}USe547h*rpr#2+ggi{GLKT{Gv&}#b!_&=CaPF&>{mQr z%(5p>Ylt!l9Qytu#dJl+zk(q72SscUzkCHz3VS|IVC7cL6wEui#^_S_LZu4eHI)!+ z=&aOQ<wKYt?TTtZ1WE1e-V)t&F5nvD;hw^L`-bd-e@+aw6pp?n@p;3 zmRW0ojnhc#vN}FHiN1%84QOkse75HS8H_Xy_%B`=gTaLpG}n63>Z}zWXnS7wDNR~ zu;Es%yoC0P7t;+dgm@=5m`Yq3h2r~Hw9iAQKTqSV?~^PpuG6hQnG*8%X{vbpLZ;_a zzsK+)S*)!h9)SZ=mCkDmzMQ7Jh!*FLM@keaqN2mf(els{g9Y92oQkGG|B~}GtKJ3bqWieI=M_TVT@AZR$aMd^PQnO z@3T%UW;lRI78k zK$V0F^E+PWK#}lmHTSR<`f0j7M_nL>#XkMvREeMf<6|uVdDuaxhvz6(d3-?BlCe|l zA}V0mwLS~el)1_HbHtmdEHYb9Zh%L(uPuXjKBevEr+lgmlTyyNluQ9=fL zY>sVi&XW&CtCn24)xcD?crl2CN)>h5vbysqpKGw{+m1Okw5(^pMgLRN@HQj!F4!8e}24l6AilTdm*hrn@_U>S8xB5UqdkV0Ol<#x>h zk+mAW4(;UZ$79|A)sBRC_;)uUEPs%5+hb@PQ&Y}A%=?Kyri+=ZJEuN!lXxFNQ7*Lp zeipYgWL&G2%nV32))ff>O?;YwUi?XKD}&_KFcsWuzluLat)tSCI0Eep^&bg$L+DEp zZULcG8+~ZFFwq?J)|q5z)d+tAH!60x-zif)o5d@3IJZ+PnC9fUTG_?scmcyx=B4ER z2URkAV_r+kU4zH~H9E?cI2PGQyiEG&()M<%2S$v%Z>2w-E0v-iw#!b@AhMH5(6$l+ z{atUJweDnCejE1;%_F*PFuUU7EVh;A4928bfIS{nCD|p9!X`0`Yuu`J-ghj}ek~D@Wp1^Ray5jcE>0(=K||40P~|Lp7S`eM6P?TB$8d`D{pyM`GR|Tot>z|e< zp+w|b@7~s2i*B!}Kj6FQ9c?T!E7ef6;{&yPp@n??}>c>|ILE1{KbIbH@YCh!-iDGNj?Q}EA6m3df#yMYL!@uKg?^m}{rDOvf6D4-BwVu5DLC6HpO+`5cH&UeLpW9M(+-81?xm1jxGD_4an zLH(bq?I5pR}(gj?j9!hgi(4{JNG#x3F#j!#7jfnF+|()@?5NA zAp1}g&A&@mtWc$c`D43QsSyznEWGp1QteuvuEqN(|5!r(AzIB;+6}NT=G=u9NnqpE z3(S>B5j?pb{H(9EjQ1k24t9L~w@V52*l?+AYKyj*3yF(vjA$%YX0_VA3zQi^L^TFP z(26}|zJdr@d`p$aSfI6_M0S^GTcKH-Q$+cI+wL_`MPFoP{77B;ju?s~bz)Nn0iIjx zF=Ux?E~RCE7k-a7*?w72o;1Ucl7P7J7t-b=t-xk9xg!v4(VSx$qNSnw;D$Tr6(;O) zur8;Hs#7&kqpZcVcEYe1i@=J=SMG2;7bwYMsUDY(Mg@ke#B=UkE^eeX3i+(f7g~PU z_MhAVVr(5Wec62oqc8rssr5k4pO;LwGE!!{Vq)V?|A6phJ+COclQ)eS*H}XkE8Aq|eNDfdowMi)L&D z{B+cX6Hj}y1{Ohi=`kPpFfR!#9jIR8I;s1_5h0EL9w4jRy$nNOndx!botu@C$PL6>wIXk@U~-%(Ei|yF z$KsLe?XF*P$-avTwHDMO0_oD56~DP$c=!X9wuV( zxt<_C-|uSz7An15E7RVB=qQp?%mD?$K7^aKKpltgD*OB6p+|xlucOn{SJUW~hGz9Q zrX=cL6Q-PmyT2)k2QLs!TI^u_O**BFb$rp!oQH6Q4tY^3Kpxh4RDJATOw% z3M3rwg9t$}>iceJmcfCZS~@1grYoGC2&!_Rns}Pe*J>J`Cd!kH04Q~pkX54p5!ci` ziCfR8joatb1G+~k+N0zl)I>$sJR+_E)8iEBc_Pqo#8sEnQ47o;6Go}D+;T*89Yms( zZJoVzv9nFE`Gg|yLB@kk)-eb?f{3Pbr3iCXq<7=!KP7B1jT&Z;7XMA(zVlXMcaWwh zrqbaYFaTx#>#VhWUP=LHR*pqs`QPv@Yq&VHxZpV2xUsV|86uZ!Q9q5ZkaBVF$@Je_ zxt?LIF5W0+`140l)Qj|_H6Oy36sgy_Ym44Ms$Sw|IElwYo;8T!dG!=FM{_^YpQyg{ z9=iK68*L0eU7xRZBtkDc8IA{`gH}?^^8qS@9z7Mbr0@xE=K(J_jBy@~2NU0Cs|O+v zl&fJ1kK+-S%iGXLE9C!w%2R@E8Ti-g#_RaS=nwddZj+hn;ofn3FZ6ucT%ibu4mqr# z$xg>>FT06GELlAAk!c#PyD=0-F58cBaS=l9cEc3_mjs+W1wFZgy8@MB6^3(W1i;;p5N{|W9Ue~t0c%j^Xi9e4kw>t z{|PZ>bIrCqnf;`Ad4o9>2R`3|J>m8#bYq*8mSRJcl+qToOZQJOM?R0G0wr$(CZR4xwecp4<9pnD3 z`dxeMT5HWYd%=e+FP5dVa)Gr<=3_es-TbV;&HIZd87fM$JBM-tpWAgyS|3GduP;%( zEj$nR6S4^#RXL8)=Q1|RE+VWeE3%+yBdzfJ&+AbNPeS=|TCrt-AV%|ai$P#c?Dai- z;ViP1re9S)UED2Y zp~!mPS<$4xqhNCB-G$zp2Cd%jro@Jwg-lSOBW;f>%r!S@svy-Cj45}Tmm9!%x1|nKV_8Grh0zQm|Dx{Z zl9DG`k9Lm`>b*D}A3De!DwVnlh#vKjxCxHF0bY?53oYZ_i7J0HzSaSXC|eVBeNvHXdS^n>Sx(!eoTzDCB=4x-U@+erC3 zO%FYZuZn}F&{=2>&wT3O* zduE?)$lD6u`fX1&Y_>>uput_se^T-a+L4UdtMOl)R*Q>flOZ2#yf(ek03CBE4Us(< zwCD*vj4B8O{mKWFUW2X?Xa}jN{}+K4i?#p7eMEHI5)A{*qko2sx>>pdlE7eJYV?Lx z^hsSFV@foy=?_{-{A6j9>OK+CB;%xGQlms>wNWY2#O^kw0|XzU`}%CaK{!hlucfwk zjk1R2V1_wM3$4f4im8{x6f=5Z^oI+WL~t14TX0W%Dgj`Tt!k0ff`B&{y4RAfwLGRt zbdSnrhipN-ZAKI;1E$DKs!@nfcA(zhQ=OGjZ@7_iFK%&BqFYQ%f=aS?eB9k`7S;N* zE8D|yjrZ^EN-JQy68Z+aRf8Rkq+!vM`b)zCTgh1ec0ePAK4k}Gcypb+VEUl zbXICQ=VD5#K`{D2s41r4ZK|e>IvuciZ9sFtD>$8(pH7RH+3u{G`LO=AMZKaFajSq* z>o15|ToZdjPsq8QjR{UR8wM2XK?Yu-m>&4>M69@#Nc$$1Hwo>hmL9OkzXf$O=6*Wf ze?MFod8RvBQ#DSGOsnZKUZ!}Rmc8!kZclkPdgWxNc;AD80}IJ|VZkLfF;ZLO1PN=!uLk^hw8O{MEfL!~D@aEW(hH&4#X@$Az zy)`AC0fo#a7{=t`mf!lsyK-s9@5KtgmOQV&^q9v8U%46K&y%tQ`T0J_SLloU814!hDRR2Ug?HK5eg zOhJm;*)YR(ZF-&X(+(1)XpS&FkF^Zj1n5H_U(%_eObu4A<~OrhtIzm&_*ElzB>puU zAK!IW+tSM&ag6U#_uKkjV)jFl69QhU-MXi*axUz(AGb_DSV@&nKGnatEj8Ttg>iX>A}Rwc&WF%+|)23(|ND5}is7Gr=sCl7HP& zS1Kf$+WqBHNu)|DRDpvgkZ>|Fpkl&PX4OtU$n9ul8@(QfIo0&^rKOSt1Mv=-tvS-o z!+!$s5hR;(8>Q(Jz}SIS&xec+QT>F`l!Pa%K^9Kp?JlDdJ%(eP_cZ}$i9d^Hr`1(k zc-~EKN2H!tG>ZDo{{!dFxNs&_a|K%JH1N^L2ovahtB?KN$PU_JZmlDuDR%t!xM|&> zpFfZc*JJ1-)V|W79=4|pwm*k-uxN+l3zMQ9330fu7aYzM7zWU}NHMyQ>K|V)#LT9k z?B>azMY@g>D9RPG-;w3P6gX9&I|kWM?FS6sCtZNow=KjH716W&R^h<(6+Srgwu}FY!%39#ze{UWd{F#K3?eR&+ZR|k!dZ2OU5VZ`A`}I_j(^FHt-SCb~=W%qx`QL+!Kdvs^s^@3$+U{ku z9ilf)jyyX%W8er`L}hn?-&V~c>AUyL-|X@P%^w51xXi3oef2iJUa#+`r&Q)g#&8@ZqbSUYMOJ=4ZYtoF(UC)! zcYS*`akI|o{%UjGIp3Us$c1ZU-RRRa@Q0uh;~4E)r?;fZVy;2KWWQtR@m1B;!4N?A zoqQ!>yI~HIuqGpc;q8{MJbH}~{!~E{noRd)7<0md5TGCUck(9uLh`UeVFF6GB57E? zYAn)}IRm&*lNGBjg^%)z+j{S~TZ_kWiPxp$*0JvPBsKXSmSdprAa#)5t_LEljXn!1 z@~K23#Z3tX5w&;`n`Hcnkj@LVdPw|==nod;R34SOTs~gL&B>*%CKbM+&?E+2AS@0{ zl58z}zrKRc>8E|Alu4|XOjY#)nc^?Qee<=d#Xj(}>KAriog9!@5t`?^k=4B!NZ`VV zvC(B-7-T}Q^TgUu2d!PCkAzKn5s75to!ZPQMchdupf(xI*{=ozo1NJ8VX?_4b!<@{ zPYXl%yhBtu+g2W#x;>F5a_uQ7!Qv-~XD}%kARAS4kg5t+4K}X zD|8TqqNVi4gUmYNC*S~xnv+1;djA{BM`R^~w9>MG3PSR`JS22IX%OXMFm=dVy2+Y; z5%xZu8=iRIGQx_EUpN@;751$?u2yAZ!v+{7`AT#ZIYhLjCIZGcnI0RcN@ z8@lFKf8LxG$PE^#(tPvl7zHB2I5@~RC#T&26P$9Hp)n7+H5%NbLw+^a<-O#@P9-Vi z12DX0;|UUq+9si_1<`SSpiV27b#tRN?xOpa|15!F9A}0C+aA@3j#Dg=e>_VkpP)mp zdDwKDHq@+sPa)!PIZGnuw1LFeMND1)iG%|V1}^<#c*+7iwSz?{tG@4HCsr1L@LCye zpT*q}uxGWlzOqDh!XU8Cr3Dw(hm1iGY9n!haEqalw@{nSNqsN^K8jY-+ZUsNQ=N4f z1g`^wEB~zD*~Txj_?9`QV6$aJ27z=LCaT^j*tGk}m>1{6oPcq|P&%`T#}tu^Ua*Cs zVRLV+G^xxR98(%A3UracC8{2)rvL;J(@R(^J1VD3K-*XD&I^9!kcsSvS8<~6=yb;m1P zP2d7n@w461sgiz$2_uek^uJJRP{l{sJJ(yBYqo+aSi|nA`H)0t5E8cDBw@4+eAR?@ z=7#YoTqp~z^R@ji7Qw7yO{l=5p3G1%!wYkx*s57 zQIYW2cdS@OT{RfU- z0)DAsYn?l)v-`piUf;;ow?h@XF6mAScra79LUFQQZI2$lh|tIrZtEIr<$dVr3L|(e zJ78LR5y@KLUdY#CWL!0G^UDZa2|*{z#9Pa)QHqhjBUZN-b#F-&_~1;t0GpiWO|K$~ zM%cmuwS7LcXc%7*B{!{ne3>+|MYvX-cypQ&Tz&$2@H9(hnxYeRCzsx)pjv&NVj6ncAkp zB3RzWPCqtQ0;dfZr=1X0B-=)oq``Qc{cu+FF9EqD>cMa{?pZ7Qk3tcz9vhNOf^Q{M z(^|MP4ci`IXam^hYIx{|l*&jiTn?>tceYhisqro>1j&~OjamzIF&6Y*&pGK^#`TQh zUuA!3Cf_@@Q%g*KOabesuhuHk$sNN}4J%s-(;h7vhHPO)i}bdgA1dpflbfkiCyMOR z9ciFd*JR+t1W_He!>jp6Yr=v*`tOwB%zv?lStd!w!j^ZYyIZjB{L#Z7PbeqNDj0zk zsH1gij&%S{Nkuo;mYl`~QAd#5>QGgbD?gWW?}UPlB95FQ$$UhvPr37~Vx)!3XWeNa z80un~7TlUmy=WkD3Fi<`j1;FP|H1E2PKG&Nhm4=-Nw)rJ#UupZpD4!bfB{NsZ#}+` zn^6}__@_d#e14ZIubelUj{D6$W$F%NHEtQmW<{4~5Itb*!&(8_Yk9gvK%cU1k$Cvf z$R&wEj}sAS(t@s!_eO|g-ks|=J$)ynX+@Z2?n+O##i$_L~B3{%rwnnh};M z_u33V#OOdv^Pxm&9z`jDkriG3vegu32nF-8;RKnOzTyt2<(+d zz(oHbkb&4ElPj`*NY#yI{BJKcvXQ2oo2UWO3nB=S?Qbwjy1S{%mY6;DSs0=;nb-w=hkl0h6e%8UCR5Z$93&zjBDFFJxNb5^S_@2MvH=7IrxA#JMPxC2J(CeK-&pzWd@TD(=|Ll-Iq)iTZkwW^@D?) zZZq;eM^-@qx5v^t3IMeTOOpIE*q;t#Z(Q#Wt7t9P&P0^dx{arXhM%L6wgjEH)}W{7 zO{9=EPT#JVjmsYr;*7iLqpA)7Z||Q{Fby0N%(Fn4b{4tbzM3R@&Q`}Ia)coVL52X{ z0OGhZQLnGtCg#|JpTBPNC=d;3^(rEkz2%w?#vW97)$&B=6^@_3i#xnFDK-f@9BD+> zoMSM?bR|j(N!oC*uIml$ zBkwd>!_sZesE#>^GyaG{PA+6!T}hL-yU&dm<9fo6gEUKwp|VHZ-4>KWL0lX`6gSam z>R@WzX0f-A$KqGl!AxE`)Hp%5%dCJ0--ILV!kPPfcH`X0v?CcfP{snil8k?u_A;Fk z99Mh)Gs^%>99zh<*y(u1>!66ZDT)}Zv89NG<2Q5U&L6MQyS~1*o32x{J3H*ZQCQ1T z=LU|>bi?VB0=3ZjV2v8)?E7aB@PMvhd8?-rx zj--|$1c`awoP!<9pn!SJULV7ij`=88-@>Mad1=?^FA(>9=j4<62_8ju^KCuKxs$+j zz9msDMR7rnivIz047Z7IsfZnO#jP$CSZ(b#v=TP7q4x6{aQxf+@`)zB40b)On>ShK zRIP0Glbe@-Mh9(ZCcn1lA4IMsQ8DQsM6TN5ZmO@6Sy0|xG^?21~dKBu*^?`f&$D^_^Q|qZIymP8wcS6 z{+P6af$J(?q*{;3^IoSjsbAQ+o_8aC*Hnwfihdjo>1|oFlZh=@G|{sWDo>8F*qiME zNb?e!qXE4oEJ`nJVOI^8gII<6s8DK)vF9%3@%`;l8qCnep2UJ6gP$gg;~ei>C?~dt zHI&e|YFlORq$xL|2h7#K7O0qxh0kZA+H1b%CV{y`2t0DA{pIoyU?hlOSX9z9*F^uZ z)aiwGfLJ@7^&q5LAK?J#%ev@!T3dkjd+PoR$I+C)p(iK2y^i-3e&yd7vE}VYWMwE( zN?gV(j*4dFq%J&EAnZR&Ad;OBfS-PF$ZiMUFUR+aeXKSGsh7*SU(LXikUe1m8%k3Cm z?t<)IAH^4j38#ZF^ zSw0}O*bVx)w$5VFDaOPcgqf=yrHf!N{=!*rJ`Le-V?Cf#+WwX8jRAwg;(i|ud6Cmj zM*h$jm*9Dp81cvQBP_JCls@f0(t-5eh=l1y)naV!(>O;J?W>C>N09L;=*_8N2VaRR ziHaH+M+?eSDQj3qk<|&?QWjSSWMfiQjSU+v+4pV_p|06~`wWR~n=tLJN4dft#aPr4 z{vBRPVUlU0sunRBA1m>JaEiM&*h`j2z4?@W9oHo^4?vO2A=wS0P#GN`!+tOWgZ>m;$WzKpque?wh5?LFb5 z3#A{j4TetqVV%%Rcc9NO*xes2MDdoj()RQ}ReRu5Z%sJ1SIB8I-HeNUlwvR`6`4|P zlp});ryC8-9nbdDPw;0GfABV@fKu&KU5-}-J*V1&t*ej}xw5X!Kzoy>PVlXR;h8|_ zxujr2{4!JVonEYy1*dIjgNh|=z5AtoV0Ehw{@1{P@x2zT{pNZU*92guyQq~%;*P6K z-dfIag4PN_0DdCDV#AkVSqsj)gGiS*Syjyw!H*k)&6z9~e)Q8c;;CF;8(5i0B($1Ao<;euM zb|4M&pwxa1K%`DwEQgX_IK#bHj8tMV8|jq9w&nXOx)-n^Zu;I>?(fcB#Tk%oYEuefS> zPQ7g!@J%!HrR5nlwfH>$DaZaBVVK&Qj2q?C6T_zwBIQz!=;=Z@PB>z5c%LY_poQ zS?_h+bn(94)N%i{*e}s%~AAuzp((WFmiz&Awr zWiKRTlG)Z=i7Mc{*|E8Q*LvEH8BFl17XK%`Y<(zWcaO$7fU3=wZC64_D<=eHaRXih zDR3eFS>z8qw%iyaMK*^h8xti4fm*eX6vH!D*LTx}B}>%=j+q>2CPh!{)>kn;<4>^} zH053AGUiZEw{DHxx65W2ZV6F06UGP8sLNK}t0hsdN~0b1X+q4DjYGS{+eo$T5XZo`$~QTd&%~E&w2Gu%O**7}wwaJ&`GeY+)wx<%Mrx%$f!w)lrCbkz zg^H)G(hR$nRM+Te3|37F0HV>0@LK>vP38xXPsaNikKK)rdXEU>bi`~)aazQT!BfhL zdUihDjE>Pb=={Q9@YzB-v?o^C%&4zQbssTAN@fc%mzg~kTDs1+twms`PT;d(+1EWZ z!-#IWdF&y~xzQ)_aLrMeOm5~M)#o!Xx}wtACeTj%m=Vn%4{7f=qDFU9dj~2mrlz(> zfbP>Iw*hail6N>KNr0>$p&3CwIJNOrX0zuBI=DbXn4s7@8mgk>%vQdJAGdhgpBD}Ra~_H?ckW{aT%J84oS$9Ibtr+C zTTL?7iU{H~#=#Czb^{D~3a9w-akjG`YZcdb^TzOtp-z6FJ7s$W zho!C$7Vu45jS(9&YcN5-tXN#LF4y^{&^`K(f{jS_q}5jtxW((jZ`g{~GlUH{;?t1K zr3V}sF@sT+2EmSg{v6Z6swva{7l1?kp)%yIJYIl8NjxNgiqvaO0pQC< zFiL2*yl@Jb6O85`?_L(G!!-BZ1&)~Ovb_7qf zdz*2KM5#-Iu0T%rTSH5<=$^B`<~0a!5I{_qOYH7|Al_FZ-{H>YcTZo<`esE zC0|P;P^zeu;s7k+0lSI3*A%SSiskefPL9m;0^2QMrx0*w4RUI_uPswAY(g_}ma`N~ zBUQSwlA=U6K+14gN11=Rv0HXPf^G~4GNdwjDRK|yf4D@Zg<)K(r{{E;@Y}+{o9=x7~*A2Ce(7*-YNX+MQektzWHfe>0*C<54e}T(lH1-gF`pEypK)#z)h>LcAu8u0Qlzyr3FH3KzCe*GIR2 zY!IblQs`Jnp(VsFyD0I^DyUiS1v}iCh?r`~y+>`k7P&0jR{P!3---ppBds&q5l7^K zY4oeY-uI*dXs(;Fdv{V3M?qi949L9W1$NaO4~0d7f;2C+jd7KesFXysQUw=!Lup> zpI*W8y!oXcN<0uRXGTIY1PIwvUk`!Z4@2}6i2dR>^Gxmne2s87JLnXVu?7QDgm2H^ zFVSgzKrz3GG#+fQvq?91;+n6eTl^*0rt4J01!Bl)PE2-7AEj+d-REa5I7}y{ZcU)-HBhq zHxC{ri}7WTPWU|En$gN(^d<SM>;|gn%lzKMS6KwE}=&w2) zdbUod;jsska8OH*>dz=MXc7;9L+qe(sk6pztaqz(mrYN7cwjI7aSe&qSLA)S;`l_0 zl$R7r7H*hFefm*pJ{`kDk*&1P17Uwndde?`odOnjT7QVw~}C)Fd~1{|~%K zC%7V0?fI?LvkMtN&&DEo98AK?OOIR4*Pgjp+9Z;9WjZSBi7zqo?7wA+5i$4eD*y$HXf=0398*w`2g_j1|Bjuj%PVH803`Jsxc!;&#cXoERURCY#`7S13 zPdY{XasFe6bxO@)QHN5~w2n!9syjD3EB8$dL^rwf5bqLx7`Csx{@jm)Llw7O*v`ep z%?>~p{ycPLU7h;V8ayz+kA39{uD=|HZ~Jda^7gj=FW|K-ix zd6mU+!`AZR?BO3f=|cB*caQeF%uGYt2Ie$9(H|`sNsA)d{F}I&5;C-U+RdtO|mYx@Eq zm}p;eNu=y|O@0^)?k6UHcx&~EA5V3M+h|BtJBV)%o73g&NUIU;Rhnc!GU4hAIM&w# z1cj{R0ebO|+T~<*c;IUHcXVq?94VN#pFL-~-oRcj5M66bj*X8U)lInaRZT-bPH3l@ zy=wjIFO=2@J2K_e6{!K_)Ah;lZetTfLR(U+EQBzk3st%xfjd`vBS|JvWO($?YFxXF z3~O!vQEkW0gZQidd`Sy*`SR_x&!NQk-^}s9nYWS9PeX%x`afn_>L7mKXZbuda(hVl zTa2PH=O8LN%;>=;nL<7SI~9e#AKU_l7BmL5*yz}i$Agv17&qebk^y}|4uw_$#6-5>|Z^I{hs!=x5IOp-nN$3(SO8F#@~s!j0MHD z=h4f~@L?+?+Ye>v1U@M8i^ZWLLdbMe{ey8ZpSbs|{NGLbQI@vFhwI-%T7VPT+VfQb zAnpY~Uy~HAc0+X<8dY^)lC7kE%Dt!yqz28ERCwq5e#ow!>{N59S zL9VBJz&JngVrbpFC@F0{YPInXx)PDoZf{{ZKwO1Bt-A|K(QfKxOg@eh4NnSA(c%!X!Q->^<9wMWls8_Yrp2wfBO&C6!N?Cg5Bh=9hYZSiMHQ@R#D3qz5LQD98+zqT zA5cg`7Sg6DyEk&1+OVsFAQx~hFaoK`xS|=R89SgM+J%spS^ybyf2bSdg;L(pDOYh? zwpk`o32U+N^Ho)C|0e&Juk*fNl|3pyr+d4 z{p(H(5t80o5BY_pQ*c2g6wMt50TAn36+`BW@aFjGf-72;5z%gr<@^+Y{)He!szGNI zI4T2S?+1qlL96>z^+h-H_)kV`tfHbDyW;b4rm8%zPPRF)SNtRC6zXO4oT(<~!&B7T zW0*luk)&C3*S=)cfBIZA|Ms~q0DZ2fM<(R_akr6xqs7Oq8xALhr76gVmcA>(ONyjS zoM+~kx7=_?kZ6!<2_|g-#4r@r-;5sB-d%|9#w=9IWO0KooF;03Sk8k&Sqsshn9>8$ zG`qK44EFhqQm!o#Alg|W;$Wo)4t?s?@NjTgo~kjr38ZK7h2i+lBPC=%#8vannBUXF zJE^BMm=*|~vG4-o(VYzLWHtI-eQX*Wfv#_pM*Xgo)v6COz{E*}sZwit22zSlM(BF} zZa>ay25<$!F|!#jYrw_|4H`g-0|vjrusfJ|POA~pQX$calrZw;AYub4gzjT&R1?D` zN1y1uvi-l1W1Hiu&KMmi=g3;F%NCm7!;pt)44-a?9vZSLaB9w>42)Svr z*=~%JrsDgk-cc5>P>*Q!8P&Ipq!<=u>r87rb0?wutPzakoyz;(w*_qBej7fttJ3$5 z%)7B2qNpF-b>!}f+Z>@YjaTTcQY_}$LEN^c(1`DP2 zeH7I72C*O@aHr0UqY$@>t9B_T!mS?k15c5x$$t|dpqZyc3+Ngo&5mR*hS8#uoud(F z*<{}{&Q6$@?Dgs+BB81fioebC6UH4l?d6xYN;sMz7806E6?y`BSt)b!y1rU0KF`3a zUVLFo=^@aBA1N89t9sP(^no$G1n0VOewP=Hq^DaP2sYI2Vvdg{V)_QW}7uL=_|X1!^4Vi#GCEk`1*o-8Ur1n2x85@kuWQL zY3^rNgd2+)%^X||rp#Zcgj>Kn;z(8?=5S)LTr}tUKA?_;!H~8FWs!i`p4iG1d}MN# z*Q$Eias2qRSKUUIeUqOlsdG@7wl{AYBVw6G^i^o6&vPO#K@W(3^aL`yaY8u?gkApU z9t0nJzzkG@*t?`OGZr6>C(wE=^PhJ%>_=0c{OL}c5zjV&7}1naIqa}30^c--g008D z^jd0>uVhJOT%%7h)nrP5lX4z-e`~CIT4`>mSu!#PH#DeQzblW06kaYKnL-Bm+XPETNMdH%t33G4iTZcUv-b2mGBtfpb^0;G@g3 zgrKOQ-p(>jj!?)6j^Fwc_Sa}&^_@{+O;|SWo7nBnb~<8vb~~H-B&&z?8=FGPuIniJ zqM5TWH0n(hsnO3+I7nf#ocys&H|>40G+Rvdtu?Orm+&-W#{|!43KS!5BD&{A=PR^!U!5LResA+mj^9YF%NAn5GBXw&NSgen_f82?Yd*T%BuD`jl+)rrW z$xrPA2(|Tbk%sxO=sCJ@QAxn48iZ7v5k~#{2-L^SVfWcB zEI*#8mFC&mF=l!YU)_IevFUppA8q;&Tz@!B2Uo7U*sXqiEhR_-><>1nXGyj`WN#=U zyus7d41DJ@XJ#Sd!N+7F&-(jyLXFf$ib*3ftP~yt{TmRs0X01fha%-!!kX+fDAkroqU|bkzm?N-Mn;(L{r-ZRz*H_7Iti$1)oK@jOVuhf z|8gO}>)b7K{mqQrvC}VYEfvepy-*eo+spjmzmqwBJ*<|1P9fF%Kw{@B4LPSRqzq32 zby}Tx&v`#QNCLiv6;>hX$WhNtsA}Pao!@k#{YQ(1H5MN=1hvf>u3)8NKCdUvdOk~* zyxna&%clLw+Q&@Ic|2;e-NkM_G7bcx8S%HHN2X=PcEcA;UifU#+7+B6Re6?C4hScW z5MxJ_0l}MlJZ}N*=IUmwv`dNsf$Ll(E(LGk0LY#~`h8m~8}X`KXCF;nP?_uyzD5=| zGX^E-kN~8UEove8m}5NGP=Yq;L=4(81Vu>rLme13=}g&zknxnMpHx4n9$V?9$nnM0 zGwidqbg^Q;Tk*_lGvl_w#>jmkqYF_3fOmwt~BB{1nV7U1vJd#zaMd5n>BaUgHsJM^xZ*x#LC)OL$=v%Z={}i5sOlH)SpjY+MNb5;xPYS39Uuz z$_GLDqJuT4pkwj7dD?v5UVXjFx`?uA-&(tOcAj3pwp;&v;^uD36JJ*h2h_NggQ=(@1 zBL+A<26=zGeH|$8zaex_WEsq|XsG^vJ|zW!Bd|&sr0P>2564%XKfY6V&RCmo?!sGB zuQyJy?~m~&CIM0f&RYA6!H-J(qb%Mp6X&tdCyG^VNO|Mm$Mp5&8p^}1#lzVA{C~2& z$J+*T+<-QX_~~>rY_&c2vf9q08UD9dNS)0Cw_P(XytQf@N^SwEFT7NRfb{%QI~aY{ zcHEWrzsW*-GP2v|n~7u8we5WGzj#Z?zQwF@`B?p{8y7OqE9^l1oxicTf$GIJ?X)z0mpq5d*}Z2qV`&5B*=oC%V%t7M!11s~)a(Bx3yb4}j`>-pyx}lN z1@|p7GPFDRpgZl&_#E-I!7w6!`=odOjaB|&(8W&x zY2!6NQVIdRi+9e|6&19m4;7x$&=r<+VI>Aum#hczCqrWXNBj_D*xAp5#TQzvC=$C* z%UOHP2XG~zkFr?Dy&^UFywcyXy9f_w{hK*+XJ{hL(B_t`8Qvginw<3bR2-|vq7I0N zIQ16du>B|kpB+qu3~9qPFSJDi)PE^+!zyUu3=?hcCE1e^w|#@-vwhy`7~P5;&f?y8 zj2sq~C?DyO%z?o85GApb@mi<5v-lrfuU6TwQpv-Ir?8T#|5t5*k(E_!c=3Epp9sev3D`T~ce zKyELz=|p?)EH2x8YDBry$>a_cHA*JR=Lk}hg`AK$8tnW|t(9CU`px2Pw*PWjv5^7r z{1_%UHve#04%z?5Wod$9GR^e%`H?FaB$NC@X6Z|76d1tOWH1=q05(TN9sOWU-4n9C z3oJ!^0GW^Q{(X8RK0Tqhv;sf-Z@b-%>8scAh7xYhcL2Ld0QgO0dbnpSIYv*m&CBI5 zLexRaVj9qN0Iq8K22}R9^nmvTiOq~F;j-ArN3aRIxD~b`Y1AXK{J*>k>wH&hl z2hrN`{r@0Z*vZ)}wlcuFpu=-5e1HE$VF>9c?+-aWe|!c-XN3@4yxzWsBh-F})(_It zwJ1fiW=zgbQ2+S1dzjPcb~w>yglHN|7n}xaDAZYpe?VeY?oK$557yX`Au!`5_Iow9 zKL4!TE>qJ9R}N^^;1a5vz!Y1O%;*R0;eI7_Dd@Aqw62g0A(3=$P|*8!k5pD2CzNNP z1>@`pI#PN`4(70!ju$$`Qd+Ozt9o*}!%ue@Gl6a*rqFuA%9+ebaciR|DK?g22oTM>BdkQRW`^ z9Yq4Huq(37L_N}+XLd|K=x6)WTJ z$dH+;2A`&2)g)(WATSHmj6>ZFSZHDP_f;0>k?xE4)u_eufkp;>7B2M6BW4F-!s##5 z7T{tW8*tnI))0^M>;A1F!q0I3Zv51AbG;y=Vf(a3FiS!)WfN2y2J{d$8TmVcv|Ntu z_z$@kmba%TJuj!c!Scl-mG(Czy#72i2aPiSW-i8JW}c7^Oqs3Vq6kz9pZ@4KyNaBj zF@?0V0Elbule>CkH@fO?PN*AtYMvVm3M*qXOotva)licbG)Zjlh(5PEr!fTTm`1T7gETPrWV*WyqUi75_v&KW z>J&9GV%(=+&d*8Q9(*fA1EHl32UgEd8cjAMJk)`FsL6|x+J+qHFP%cSQ;Lmsyb{#; z=F>r@p6njQ9j@FF)@oDnY$ZYV$Lems)-Kt)8~{bcEx!L05o-?GuS%^V>G?q{!c4ST zwb(NZ(QNSvN1D*7$+ZM~*-f9#k!#IMrhR63e_=r~%#`Otu1#OSbGW=1;cnJ6P0WG5qJllnXN@VY2sDSf!&iHPZ<|5|PC$EPE0*?ZWJw zOy;RxB_Hr5Nqx~vt95^o(?GeoLxcVq+6qo(inofY*e$dFW~>o;-65c}SaPrE9A&yB z=7a*>#Gj@aq-NpAqUBS_CN6cKt17NUS=_CMn@MkWO%2!-y$+V0dksz_PF^D~_l{Zl z6sBs|eX*zEappmA2)z9hM@%cZS(z?!r?UgeGjIFgt4x8Q&?fy^)`7(D6XN0wps>2% z|Bu3&#Qo(0ps*gmefvq3{>~^N&$HbMgi^5-{bE5-EqQQrz~XH+VV$YBLyzwG6P;Nx;zm(UvLgL7O*<_ErU&N{ei zGh}BYz&QsVOdTaWf$S(Bv!>K-5;7Ql;{=b>HNS?Zcdxb}Y`WxiBeLq!jIxxW7A?{v z$(~tDQGo?0b$13x%%32>L8H+9$U_{72-$xC*?6@FZzmLqYe2Q@w$Q4SyL!vVwlMyu9+-xEQ`e z{Q=Y2!^3f@-Zlw5$K$SJ;0>rU#mD^}{msJuJMEGN^@XKpZynik$c}(@f|q|7oK8d_ z|KJ+hEaUC45(e0Rr0Hfai2G6htM4}Gv>5R8Oo_&RsB^f3SJofNKe_wK^=V^bcj%@) z?@T`ZZQdDM?5+FSa}nWWC32%td2#_6z!fVZ&lbh9%ve0sY7awP}*G( z9%vboP4l4 zt%R*1EEz)2C$qusLXy0ltw=)4&d#o`x_liLe@0ofpS;d?yL{GPSFdQ}@Z^D%0I8Zv zf|W>5Ic!F$Pz_I1*DtY`Pp{m<04nWzv5~HUTIr{y2I62sDhjJS`cpy~(_30PK}X%> zLKO`84`yDB3Ral+!Q;dfoC?}*tlt3usJ*n8B7J6y_OdP>eow9CtYCgNjPnkQLhFi^ z{6|k>#{nwPSE8;%-LAD0J;$nOr_geGMSw=_xQyzZ=3%*vXnOb5AXz^4B#r|Dz%L24iX&;=Hp9Zj_(xRIg`>oA6@F0zk6pK*`@QPYqw-=6YtA7bE=2Fu3xJ* zuh5G0p{`xpsMXj-WzJu1;100+76}b0mPl+dtJiC&xj7i!h`BX)ljS_!OqxbR9G^n^ zU#w<9PrswxAHwW0C8=a|dsQeJTU9!t3iYql!b-MrPkQ&9XjlFp0F*#$zk5Zd(ooM8 zW1U94p{4VZ6}z9M0bYsQS;|6KzGsfs!=>HKQu-9pwMX3wSMFApuUu~Gha>Bumvtjc zr=Mlt$5H?;;Wk#lCLZZ7mQr^KH?dUC+p=eq${1U}g{9YDfjd|}MA%3EF?oTF=O2@w zgbm2Ei)t28X|*iHZ%bKfR-Y}UgyFBHe6{*&DMebp$rYux){SrvFO8O?dUv&6D>&f< z4!3dc?)-)0jn5ymS@rE%^3$3NnNrp6KBkmu}Z8xfx=}^5u|fr?$@pT zZ@$~r>Dk;DUOorrg-Vu}-!r?q5$|4FoWb5&UZEB|wN$1P;H4#%XzezeU$4gwnMCOF za(ur6;0ODCskmOwO*a**wCMb^&l)$b>%G#w%6tm5dz7i<|EqYDHSfWYy$eI456Nbo z(2RjiQ*h9Z$;yo-g+*^?Gp1h3<1Z;kU{~^_H_b2PNrjhk9Z#y>`ei)1a@4DMQk~`( z@ub4R*YKo5OSy!{tFP`AymqCVkuKn=RYI=cNu^r2d?(e4Y@V0uv|ymu2eM2P{VTh4 z7lfhJG>8Xd_|=%iqg=O>@e)DfyCEOPAf_|a)7=2SGr^u(Do~|c}KorXDOX9 z-mavzfT&D+S$2W|p0G-m0)3s>Yz0fF_?5O7EbBs@g({Y1Gf*zm>39sjN+%TyRz<2P zZ?247$xCz^z=(P_9RbU{K&QZ3q3d%pQkyI7v_33N!L;;?bJ{FeRWRn7P?a#+_h2b0il(6`%QO-JxGIqmb3F3A}%R^y7C(a6eNkTY+bkhD8|?>ftZ-Q0&%+fxt9ltPu!ZI-nUITk z>h2P#H2JkJ;pqw2cGI;Azr0bZt_9wvSwhMxWPX^$wOZG%Tkw}=je2D)yWd(i z3qR3O#?nfqItpLL_KRG_qO$T{52Cb|?4@^QOnC6#QQYUvD=kS&2lezf8dM(-9@a5a zq#62cKup%o9q-0PV$fo_?o>yeLJ|5#dyo$8ziBY)Ym z^}zC~It5;Z_z8O=^-HO2>0y)RvXLes9hh=x?dyeuf03@BqNVl-u+`&@my?Z0&+xL^ z%xCU2dZjH)o>~0zhIEw`tQ zWrL4UQK;M}APtE&)(x!v+W4s1JZmyZ1ws7X%;ROpp7sVM1eGp4NkN8V03IopbET|O zlVj`OohlrjOU3=s=NM!O^n5gsufIak9(S~O)9#IzYp{%*e}}r1e1YQ>;AkcKm}6|v zr*4yKfJFh#q&z@zOSH@3Y9&|5!{OM{Oi&L}UL|-Rrwh=hq=eaq4mYsP=*XdIcpZ(% z$IC_NZ`?W+&8Y#l7s1*Khq86b=Wx4Go3c~AUlF=uKuHzr&S>1ki&T^8sxr#HI^Y8u zTw`94wSp5z@c?qPbk5!_uc|(G%HIRnoEc+T0nmK;dXw&fi!{{yc(n?Wp}%tNOLf3L zl*r9+gn2TyJYjLQ4+(bF&^m5?6-H{?&Y>)CQ&h{D%@qp4X`hh8N-HvtaCl~jQR*Fn6vdC^h4bW+~z|>D&ONbW| zWy7+)Z42noQUUrW(`aH|(^=)Xur!ra1uhXZWhv)AInOckn{E93vwLBRiNxrw^5y3w zzO%B83?qyL^Q5H?1|P%Mxo0o&veH(NPHK*`AIqWJ=J!_`m!_CQj{D%q!e5WcF@*kq z_TIg_i6dJS|DR7$S6=sr9LqwIjcvTfIbjG%I71*4m^14xA1_&!+Mq#}6g`Z6Gr7w^ zz$QGCKtg~dkU(IP!H|R@1{?S;XWX*=SH8lpcD=e@-D(M&WHR)cAy#+QuGg+zyY|za zKgO(!NqcM8AGqHiMWc-49!20M^y5XaxxDtUab}*{9J-IrxpSX-o@1XqT0QpIy?W1` zJK>)D(4D&{jpF0APj9o3FAydpO)pF1c)qAH^eBpgXV9aw?wN~}SR=4en>*i0Oc(_u zxh8WL5miYypi9)ruaouqnNvvr=C0d#EgENUtvvp4W%25BxO9Wd8DK76nbqsJ|Ml+U z@<`m{-!DJ=rsAkzbQ{yB4g_14!b*N&|Ibk}|I{y%3vT>EIwz57rrD= zO-ytm$`82nr|}uL#tr<|$rJc_U;oZMe!K~VLN6miqwLhFarz`XlX1=cl;#jM9p>Fu7T>2%7W56q zAWF8Hks||H{KQcz{`<>0W$~t0{+T3J3Z(TDPc@M&HSfwZ$%WMEX;&W9o-C+6IgkNn zJ>);nB>P!6xsL%kSl$y-%;Puj8!6r~7fIgW62)#J#0e1Kv@X1PCc%yR#yEQ9^JXLT zO*+A3w`|socgMzzA_Oj_*J*l}{_J)u%wRi;EbFSTmp<#+bWt0mhuQ!g)C(_y7hVMI zdlCE#_y@c&?B77c{$I7denbvW%cFi)%T6{wp2tT5u${~-KC-dq>FS*Q>zb#(us-*( zKJPM-wwVRTdRT98SWQ@u){Qv)FUdyULlW&yI86(63)N~+Ez8=>$;PR_HLlL!i#-(q z=raa5q_||8^F`)W6&Ls?@m9J_ooWo3u$_%FS*n;mf0i314Yt*4N4WTBoqA&L1z4}$ z1(!IyN_TG190RNKcUBgU(Lnb!PJQExtBuP?@d7R{oS{KUM1q6H2RGL)d`4m=opg^w z+5i)m3FX~Ws_mMQqOwAPGG7frc`))mpVOY3h=0lj+nCNuLVa(Bo~K@9KZ zG$G3icib!2WDOh@Oj0rkuK9~!;_j}_oObUntU$5KI>eQA?bG?SuP(h% z3%^hczfcRmPz&26buyLq&%ygbi@J_l%nedHzR)y=!}wpb^6r<`*!^3n>3&Okxph*+ zJr~W}I;z@&)oL$PX#Dx-%z3crx$`x?Ui@?BLmEy~=H$c$g5K`o)_-!QarGp>aNNP= z8>b<*Exv91A__VZc!=`t%xOEiHgg#QB9k6>`kzNHw0SLhl>PQ(ktEoYvPb6bQNbdD zJG(`?7%aVfA$#0N+2i`k1xXMo5wwz$L7;&|f`SR%>eJ8GW=~W3icAtgl1b}^oestQMKL!#YNxlGSd5%5CWSW}pBQ{wOVx9z2f$XIUSfDSt^fQYy@rnH3 z)&;gc?fcgF?>UI-eSXe;_<(LQ8iN5(ku+oUr8%q13pXH$I++}Bqo7lU=MjSB56c{| zN4iIZ?0MzN0tuxXEWDJ@1w~luzX1MY-5OV-XwXm-#Ig`VVzwto^}L^d34yAYq+}c% zL94ZY83`-IV7@xDvN$L6JU=g8rduAsv9tsbqlFKWo&jL$PND^|^g<_&wo=e?;s8&xuU%h(-^2U?IB#?OI>3JZEn0#M< z1d9HTPUBmwyL66%heP}PX%Y$U?CK2Ty(k4GkyD^?OcyuYHa|!X&P(gN>fQ-d^Z3do|m zojgObtpEt{hEj|GAqpZs@d+r<&fYOXuaRhOjcY%-M=n$A35LX^3iXS_3rxGD2YrKW zarr#n4WiSyJ;>lF?LED{02{@Ckb40UAMtMc{7W`X|0W?XTMa3>F+m||0h%+z{pp95 z#d9>vB}JXpr?WtINsWKQ@R2dZcNb4I0a-VIl_r_g*pB?^2fUfx8=pYdCgSV2A(18Z zn2?C~ZF%v`%KHzfnA93%QV+tYo*;8q+{H&+Pmr5JEJVP)6Y@=>ap{_S?|sO|K+>rK zPEape8W{VRS0Gt}7=}#H5vta64#}vjP~+qi5E@dc_*dTHQGJhu2Aw|e(JJ^CAW zZn{e|YZqo3UmPKclTZG;soLaP1{ig2@l6=AuPuJIdiSQ#O;BuJC!JlB&ajxS1oqm@ zN%!0*G#`|=+XR2-t9WZi|Htc^?3#^>YXU@>pj>&iks{Yx941Px=CT zJ{rUqD#I5l!(f#myBiXL3ebO|$B+aors9I78(%o%o?sU7?&(Ju?=2!{!N8Ib;NQBl zj~f?(6fG~@Se?NPkw|7S4LoF0Swr)g%-{dto`3JX#KyJ((JeA^3XgIo*eBC6H`n{f#h`|je<41 z#gr=f1LWl4(q#+(>3i*IQ3>_JFXNg0GIWvbRWGs6^1`Xc$y@kDfwK=zyeFUH^UXsC zmhslgpQp#m7{Uexl_R=GtsgWW7f3cN34Q#o5L$mFj}f>yshAu-#9eE&=C|l=5-P5p zP`5LFgfJ|?)JXEGUsUfcaq=(RTG&MVT4n9)AtF~{yl`uI;nuQVV&fOV+kOSI+jY!( z@QVw6J5b}t+l}|nVJ4kg@-qo6v9C3n1=|5O%R>AL838b_$Bp8+#}k#_3f@X*yJ+x} z+t)EywY;#n{P<%qrSsC6f_4Am9rwf;%8mK;em&ewh~3EA%ttGWpYcI}tXdyB>0gelY}o?Y+`LYyzoxaa4n@Y_!g z50v@DabiSvW98NvD(GIh^U2D^o3dxe%7yQlCE4-Y?)Nk9xl3!a=g{cx65e5NRZ0`( z+9+zAz2(k5pi*{Gu^Z@;R1EeG9AAUfGMid{E?shu&q05d&f%^~YqmB!tEd|#;{~E_ z1d%zhs>bKl@*+NTE((VedR1II_E`2dQSXdvGe?lbyx#!XRR?OQdycmmW0c|FL@Z zC-=#xRA5KDA-mY%sbUlJ-6fFk6XE^~QT_{2KKjjx@>xa?V<&=V|C8hH2Y;jX1xhB5 z(Bgl~?JtZGLG$b228qaD$i>%DF7AP1>&hkC7S_GVU6!}N0NA@00*=3QUGAbLT9+Gd zA_WO218+mLyah4^>F#J?|OzAt3H>m>7Cckylu(p<_Qm{*WzWDuCCVLFq=(+D^2 ze{&Moy36GY`uHzMAOF=oJ=!3T?K)~a6Omj8&xv1H8w-R&juz_E zuFn6nDgb}wT0edt`NdU#q$7U58^0p`-TG+2X!Zz2LfN=B>puOIC6OTLXwxA{HSEe7 zKeow8iRy<$3)%q&F#7efO{*xzcdExTP#na$kwDd>>xLa9#AFn-Cq0++L=q(l?iK_PoxgR&U zN9}W*LtxtSKQmV;E_64}euO8W+nt+TIXU}3GoK4hF6|OD_w+k?R=Y7E8gxkkk0FBA zQM2!t0lLSR)~-KcBV9Nv+=M*f9=5lnNEG|grbhHWhEpP{<-@2DwH87u5V-*vJRLyj z@L{0CS|5#~3e#Voh~(#aX%j_*G^&u=*9}%C1~k^5Dp42(zLGoq7hxdip8Iy~2qsb| zjDRO#MYi_&7kDRv6&Ule74&3XZqdrBKI6qvf%QyRM+K69S_tpB_VPn?Yqi_BdK&fD z%j7f&@Ol`X>SOo4d8&KaGRijYo=s*g^vy$oqy4lqazTP>-*s(A1|8HjA(1onIqi#> z_5H~O&Xz{}+HWGI)>_b^`nDK$TG|nJ2Vy{I`L}`NAtm5`^_t7TjhdEMZ3)H=D6Odu z{j0a3JPb3)V2z^KlK!g2FpX@CzA;S8;j=W#t|$f=c&H%W=VPD_)2B{rjw=R;CcAGP z5`cz**S=7}NVus)*zlF3Mfbh6bLXq*&AqSg>_m2LE!0l4#s7=!?F|E&JxW%EY zi`fcudWe22rwdumTn>g?hwHAZAEjtCV6-t$we81#Jv=wL^ z@|WWZ3Iy6wB1jNdTzl=Yd@)<97i)kU^u9J^T&CmM`ULRv3eJkf(-{eS@3(A`m`5wt zg|Hjpl~lcRX?(ouK=p*Y0P1Eo#x$sHe%4(Ik<19^iR5X-!~}I)o7ceM`;!eK?xWr? z8BeK*ak1{`eJbX>!CA-LWogAeFyG~}e;BqkH|TXpr>U$Jw)Rbsp-tey#uS0$^nut4 z6Kfrz4_uYK@eZP>Z6v^X*!vR{RI^1(jlkJy-C{#QUMpe_1!JaZM0pSb(O+Wx951$v|$g&tNt?9aJ=Hyem`h;zK5wr|Rm|00%eymbpX3@5;}z3uY3$hHU03*jDGr5z*(XjQSj zenx-12Y!Zg{CjA#IbLDCqGmjs$()$rHVo$Ey6bB$=L32DjOBO({-$yY_N}j>oRxSm zGdZbani-qiq0r2Z$K3cWRvVUDGPqyBh>bVRO`Ob$X(n*n@L374TtnIu!e))i_ttaj z{HCvIhXO??>)Auku!k;dPh81!M@cLgZO~+-O=#`hQ_r;%RTN8S-Y(>pxi z@DIdWFM@fAZ`_Vy)+#BP7G|0mh`uo*Oi{!rJC(Ndo!@%vMkEk>o5(O}*`UFxwo4l@ z64??6FJn}Eqh_*2js+8ia=KJ3Ow+3dzExz3Ii#C1nFdnmWt5;rJ-}Tk*@^x)FHBE@ zN}8pz!sN0{0@Orq256%VjO*^+bCgqB9zuBWu>Pbt_%7 zfxcL(e}fiHX?Tb!Vd5%xu3Erx0tTe@0ENz_UN3U())+!f@JBZ20frDt2luMUO)jR+ljiFkJ+<_4shq#!7>P3twS*KdX z(@W*NN}D^UvQC+#f=Ldd!v~!rgs7~7M+$AunHY0&)tH3U6Pal==Ael@ErtnvCGYqF zO%0-IX~IF?^r8u;cCduuSgFd5*;NkE8?eI+q_GN8R0c*eJ7wh)|sB zmP8_v>`o<8Ta(E|Z?-fsk*~$cH#F9~%%<+5a~NN{3JzHY@K)*T?cLN(g2N7zJ6zHAUQCQ$T*oys&{mF`1$uTJIB?~*-}G`M<9|_7^8x8$`ozZQsA+jS-{{jGik@W(4BOVxhc6J{q=A+E=gT z@%d1mmT>2<2&D#47pWPK$G6ZkWJ`|a1==Ex5&lM0|0_!NZbMy0)pr0@fSLB{idU@i zU(vt--VWMByWc}lpo+2wLFJZ

-^p8FyI;*)2{;>`SjD(RCo-XzXBm3{=%W!&pqR9c6mDb z=GbiU&eUkS?)$ShXME4ajr=Q=u#$+m1OhT@#G4JAnrN~Kq%WmD)XGxfrF<# z6a=RHjj56^M0IX#rMMhObCM~hMT_kKFHR zB8}lOWL{TeszW5SybyJc(|jVtl0#9Z2*AzojPVeU7uEgYsgtD~Sn3L06nj|76?^N< zrpqqZn|0@#)@i)!kd4cM<))=s6i8NH-{15h+)qy3K?-+3pt`1)O|Dl~opHxJh$j=$ zlQHVQz6NCwgQgWhmf3uGs_wmEj`_zKC)HBbfe5e&P5PAEm^*{y04NI|lS9S>1EG0n z?TD~nuK%bSj)z1bt9q!21Hu26=YN|T2!o0i8^t4vQCrziTewi0-F~_34yJhcv=1e{ zQM|I{U_9Rz1hqy)@be97%j|Elw}b-+7mqrc)d&>6jhA3IcVPu`5-B)uVvr}>@~Q|3 z{S9u7=m~%2%kB&NGb5d&y@+WM{des>7tYaRKuix}&yt)7zUkxMIMIOXI^{#Gw*TBSI!~qZG2yPT#yO_OdQm zL+FOwbf87K#jty;ybgbSWZduk0tKf5nrYB@7+{(d%(~Fx*(>6fi8$jDjMl^p!*Fe0 z&D=$@lijs_vSca z4P#f7%focwoQ#p=8y!M5U1DvO+axAO2;98t>JvU!fZF!pWR8d%3)5JVbHTtk?W1?r zm33fY4PESDoEgLF6=^j`0v>AkNZAKfztsys&XWPl=k#(nM%>aRb7#1<68>w-Wz2`{ z3aU3sO{dIbjcie3kLbnorRU65Y^Lea68+_HfP zU_6A2`<4c<9hFkks;%{SP%i2NsMPK-mEBc>`1*xUo08|}iB2mt=+zt{A0HFL_oEc+ z>0!Wvht-BWl&gA}{CcG#!kQNG2HD^5z&0ZGTF1aX$3c{a9Z~m@{%{xoXBc!6evN$g zVEa(ALdy=tDvc@On2wBaGs^yU5{sI$OaRA6VNB;-CfC2mkeINtd^7~Yq{_8P#AVBk zeHjFiPj?-*Lf|6`iTa$2ZUg^zHb{w7^9v=X$!*7;3=y0MG_uhoK_&YdOZXpa39b^u z#7_B9#f}l6HeTs_`i6MtDTpf3LV^TZLd-9PWhMY8@E3hC)yAl0g$JZf7WU=13>AI~ z!Vm)`>c-qUbgnfylWUnX^2e^Pb%M~RD;X-VY;>UwuCF$3@u~C%ZWrk&E+#iq#`9O% z%-O8u!+?Fk!!$>2DWg;pdU$#mexUCIe~pW+Y|d7KY*SQSq2g#Gze`Ir-88qYun}}5 z0*2&wAj&bqWtF%2#~{v>)z*BcdxqOT&NgI*Ei(fZmH*iN<8j#dACH5P;J}Mv<0s=)%Qxg1c^vC~r9Jaqb4j@;UaAumD zLDbxvP6D)9wu*;IR&6d#+RaV7LblC4ljt55RaOO!9g}M-#(XMkyq1@-dXomfAi5iD zI$8ORgfD(YDY8s?pYd^a-|USSy$ydqJ^im8=K&>npVaa|JZA<-#AWS23FW1T6Zmr3c{p3mR#FazJjC3~4=*A$pauzg2`B6l0YIc@Ceg(I zXdXbBfV=cagn3J?+7J-zs_1C(4r9~Bo85C;bYAn#M5KY@+ zL8z9rzhQ^O#_NmAD6xSO zs`q=gDw#Tw-15u6$rNy)z^lar_U5|r8uO*K*5#9#&T{XtFzLoUwma8BQjHcPArl!H z8Ji%oWjw$mNVLRd`t4hIL4Er#-vVI%DohoUX6qlWI*|H-j z;#ce#L&b??1ZjbxX;8aNBqIzjvM3Bo);gjW79zXwpj)A=EM8R75m0(TaBk1Y*lb3! zao--p0)_xe?a(+!JNlvSQp#oJAk38+cJ7;^KG-Srg1k+`lV{RUdQxfszkU)AR((c; z{Lnup=t3`2i?V>k3dIDh&W@3zC{ZO(jE`-DO>H1+jNm)d~Gn; ze_`EVCQ6JMSZRNRF;agPE4twg#NxeKCNeA@_;b)|R$^_1sSW>nS?TkhwPK-IB;2r^ z*dku!5vpf7lQ++SQqadXtdte!xf4a`;ECuj56&W%E2^_R^XPQy#9i#b^;C2KbYMtl zf++AVR2xlw*s%oF{I%l-VfX!$4f=O$bb`s#16W*%m8`Jn6S<0D zdCP`zaT^t2(>Ko_8(IW0LEGjyoSl1FRsu4k%kJE{JZ7Yei0bME`+L!>G4fv*gD=on zV3e^nkf}i-Ct~^U?9xi)9j{lrQOE}HShW1Wla9*KY6B!5USvex|#RECi&l)}*IKFW@UB@l$ z^mC=1ivC{UIf)tT->ln3XUdYY?L(g;7HOO;qcR__wvQbRVCrH;aCC6hiq3m~^Dtr& zB!+%J*^u#=Z<}%!!<3vp{h`6Tyn%PVKvVEG5DKy^PFiLmsv~-mzYa7Yx~1A1yScT& zn%_n|HdC86edAoSnmU{Ygu$N8H?Y!_YN)zO>I~-QrmI`RZMVUajLS`hmHQdhRivsc zE%4x4b0j=1;${Q2<&PA(i0EYW6peM6qqHbzwb|JNcH4#Ic53ek80yN_4gj!L<%Xl< zVp=NoLNO>pXy*D7EzTbPRK5Xj<)k(7zbUp9NT!2HFGwbX3%9-!IiCnLj^-d$o3jd* z^AqFNO_~srk*Hw$Vd-;=dHj9^E?&SvVTvCqJSoSoF9h(NAwO6l6R4D%L-?}%d_8Q5 z@%q!1ho9jpIN%?tzD_ht*ocg`7%@jWTRcdQOP;G&2FI*zrpS(0cl;;d6Use1wcAVQ zSCDCOKF}jW!oQw@y>98|3=vorSG5r_?=@vIC1JTkvy_e4E)8-jPHDtKVyyZ4S}D2k zv?sE^BS9VPjhsvK+v{B9c|&N4K0$Dz8zvXMM}ja(v{q2PiLOWsqr^5uBgwS9*mPM^ z4+y1JI^r+j$0;bh*@V2=t8m@lssIfQ;I7JfYPG%fUSN6~h^fGB+n3x@VuAz>Zmu+?N~v)sDe&TC-h@ISct z_kCP(t;>uA%mHW1uk!%6=s&zMQh$+e ze{iFWX-ue*XA>eas)MDxW?Y5YGJ-u%kg?|mJMh8yGFZ2Mo=_OECK@X4!J~CFX;o>A5v&<(!nOHbO(p6}Fs#d6rH8`B=Zjhv zlv*`Kqsj4)5KFe?YBdmlE|oj6Gs)6C8s}(>OjKpxn-X5ipNYn07GtGKDZL6zg|?=O z-MyovJCLigWdEi7BloH_@N|ovReNhhmvnv!|B5qOUYj{HSwK5-Bl7@^@&s{)WVg>7 z_8bNotLinSlqm%n$TglO#X;mBbSV#$Et{vm&QY(4?Eyx|Y8N|F6c=MH=sU4x>UBLt zKmOtQx&Pt$4=sKR{fFn5(`6~XX5s5P;Cl%zsfZtI$i+u-?kg64une3IE`6mYMBnm7n7=vXR4_5CKJm!)$13D>>6(_cE@A|>w#lZh-cyZ0%P9#4)4+NW$0vf zN48qHi)qfpuqwTu{6&nCd#tSrk;6O!<)bH7Dd;>*DTr%iw_YB>HkrIBH06--dh30r z_CnR<`DkH$tMz;<7=$G!C1DVpam#QJk#!M1kdPS#?2ma%Ti#fwRYiNgMZfLv1loGL zpy=hMbaU)xe`RBho%V81XF8bG{0rNg$u|85wkI>_=8F9Xwg=GP(01J(Hb*p@+CQIQ zj?8?JMaZl?wugeN3687bv3{@O&fistaj7NZ6O7MMaC#J79WbYp>aad~l_32IPF~u9 z5&4C6C|=Ls;dJANCUP*cKO1jfCx@m}fc~ZJ+nb%PFrME{{zKc>ab|e_u6Zj&fWHjS z$Op=pAQKHK!qC&#W;hi;Vf_7Kd10{9dWBn%hKt- zBmYIH|0eFiU610}v|t_Q-PUP3U+(+fd4MacN-!Q;BI38x0}#Ehk_;QlFc=%RlJ)E3 z-9fB4q~*gflDgKHDpZzT6?0Yti*QVpjW}KEStp3HYRN|6cm2Y;f~1pv0|TDB2N8jX zz5PSf5gb8Ii}PTO^~_#MJtsl>6{n|H0$rN0(0Y|>osNW`*M#e;B;bxvV@||i=ZBkW zFHBNLMlvVHDvc|>j<7XGOW)OOaX+5AXZGIBt~#lsej(zd;5Uo|s?w38Aev18SnK7@ zCOk&Se@D87?U(}LXA!}cCS?zFIm`}Cy&sroL0rO_)M|?IDlM14+6QRXe zz)xxBGUjx^8-~9dL0`e?yn%IEyv?mc$qsTB^cR00N}kUvR*ha$qCF-=lC32B<{R-xZe$UFz^_I_-*L{ z5|#gj-GluHyLXP1`Uks*iJuku0hs3RKMBgP$jmB|0nIZEJC)E3S9l6ISmY2lOE?=$ z%w+@9_yNfzO4I*7Y~1nxK==F8@2C?gz*$%o0;JCi?$5nB{{`K9AYgL%Ze(Qy$lSgt z?4MQ@iSv(tL-!E>f$q2ef$j~?Ai{j#&!}P7LnO8Kk22r(`VESqp9poGK%BUiht!1-D)rL(*H(82<=CD^B*zPzT37Gx?P5VhGW#pbSC{nJ>Oqx_nT~fL#m&3(3<_}3V(Z^oAZ>#D>MiHz_&BFCa*Sk zw4CU+&%V6c>=)c$4KHKs`Gj)j4%U3uc^%sRKfJvx+8$kE=snk;66E$ngg>Yx^<0<) zF_^hw)ea=lG4@}^i>%lx;%6e~Y+^1kT(_~gZG}fhWRAN(y!))Gc~$V4;eW}k4$}f@ zDLl_|5u*7_STm>}io-TmJ)FBqTJo<=Cjao=5Moz3D|1pmj+W+8lb}DK^3;c__wT_&o+2OkLt8;^TexPWO8gmbY z?zuAQSIZCKmdA(|r;WdQgGJ~E<1MdiJ^Md~_F(4U|c$U_E0o;PD`e}@QE@u=cY;|VsPO*abI1iC3?3a{58sV90^ z&`fNCCNoc!i>Ebf$^@ne?MN{`4|IOHy{>|L$INgldsKHcgWn<~ORcyFQ?!DG;FF}8 z!*~rcL1@8F7QrdfRs;&jyX#^3JDGM$QTn6w9X4j}e1D2Ypg*BWfNE;4ez`e+g>=P< zi-}Ak?%A7ny12^oWGfrz8%6CjlEo7xa5zi+xk3|le`=Wil`JM z)}m?O@4f3`W8`SrcwIfVKg18F9?YACJp%jL;b$sS(rHuW_y7c3mu;x32XXOCqjA`M z?PGMDr?*4;L}KJBSopZxZ!Wlw5-d`j@USoFt99*;rz2$dpL#M~pCUc?=v0o+FTru4 z{23j*=x5~ke|f#4Wfc3OLJL)}KQy))&4WzXzoq1$=VX_VP5*FC*3JC=hE{fK>xL6U z*wt>DL(i81PK^2fV?s3a6mLDQ`#q4o-t&z~oiT0!_?Lsia?MKUv-R%j>R=4s?&cPU z-Ct`X06F~ zm-U+;>rNc^&)my6LFPX2;WDT^^v(PnMGa?jB#Zk}+`ZQ3680ua|2Bwt?^>CR2|vX* zD42bETeR_3VZqt#B(U4~eR{C4saggovD6h7M`JZAvq(d+8^cDP0S6VH^hrAiuwRSj zBVMnCE!Qtd{9hk;we?r62~imHQ6X}oQ_RwY$;IFOJNIuSmf+1WY(^p`Jur=(6xhdJ zQxu9N{W5^t+ei~WZo~bA<$JZL@@Gp7Jagydl8815vQ}vlGYT0?iK4?9#+e5W`!VGu z4Q9s}ZhU^>IP%Sjq*`^f84wU>aiqEL+8?&MkxSHlQ_lu?V!W;fh7dtV81`nX zXc7(NxJ%@HW{_efPY|=;q~Lx{R3yM^V&dfVX*q0br6E0RPH{I;^f|1s{(fbT#h{mT zoF@ciGnh1(c?p#SR+6HIR1(Z8&VDS$?z$`@`w^0UXR$!%3@>w$-v-Oj3eHu{JPuLk z`jeYp5HSx`d<;H$2vSBm@Z}>1WXFJ5?SWB%K9+`|6)_pWh$c(332JwhspVMVC#Vt0 z4+z38*~$H^m#%POL)6F*1&>d%NUT)_IyI!x@}$9lQCU-1(U!1?Rlj#xw`2)* z+(4{8SV$ptMiXm(?a@tGW&xMlNj(60E(I9^SkJFKR%Hb4k7ZF_a*6${l70)c-N+Nt z;J_T0s}G6xItbNt;oa^S_q~@52uRLM5a|9P2tdh<$qZX0zMvO`Y9h_ZbL|kc3XuJY zcBc$OQzR0c=fg)5P$U4-M%nRwz7G;!;u?HG<0I6d0hY4$W=*Ri=&9(nJ2iSfY)@2d zQfTV+WKC`>V~J@!*BFKZsLwpsu(m+3TFO5c~S!T9z z>JMP-;u?aI1)5#z%oI zN)27k7zkflL>1(x9mB^=*BlEV)bRyjiUy0Qw%~asu>lxDWqC+~V;K_|}SEthXWZZ=H9;%Pz00&syibBoAK9 zNPQ?GwU>oMG}4fp1osY6gY^ijauv?hPTU6p%kY;8GQ}y`kcKM$z^~oj{$8mo<{2EH zb=+ZblUc!+octGZD34mWD|PIdj)2z^6tv2v3hcw2$N%~_2e}u`Gm2~90p}HY?5n09;Ceo;QYRk%wIy&2?m8Lv9QZ&6W zpXVsjA?WwU=E*`!Di_}j{H)QQ>}kNFsat`(TX|85aTY8WX6iA@Bwd5^?-BEwMH$Vb z6ATk-#pH%tS*%Zvj$9`4hzWJfCU5d{eEq)Wj?q-AHCZu0-~8j>gtcD`1%r`VY@idU z(15|20SeO^7e+a@T@oYCwa5$%M_;BBg7Kv(4?C;H8Amy3EhiNrVRHa=OiuLN$Q;2fE*E5bptn4LK|AWS9$w8D{Ys~E%Or|rPxJm0J ziw_m!{d=3uj7E%*ze5oo_ve4#ERjgf+aL3P+FHE*wRaE*bzl))bD3aF6p_>hNIG#c zG3}Z7>ubp!*guc7MKh5tnrWvfj1YWi9?|yifE&vVF=v=WIuOJoMcwJLhG$rx5gk*A zW*QR-GBuKWDvhek={X#>faVl1^;_pNrQw9X+RZCtf}XYZ{-FicR0~(FJf8I8bU;N?2lqZZiJKiDp54~4|xEnNUAh3X02 zrBxiCaU`)h92;S^dy&AQF=Zl7d*sgLEr6)T_WTIavY$u zGcePcH`D-p+7!?@wUb3#gSj~y`*yas8AN{keP94NO~1LXPm!p1+NNGCGb#ZRnpY8o z<3)kBdL5@{XhNBA;O77LR{nzv;P;0&#Jzb6glmHo)c$=36* zhpCqwBx;qIp&hI%s|j!_HpySa4}??c17 z#xl#GKRC09sL(0dp2J)d;|fwfQ;J8 zEqpO8@t}oGxiB?rQQtKcr?IB*2Sfh5uGy(w=BG%iMytP6604$z;{a6k^rNfcC@nOa zQ|9cZ0-=joWxjZx^c*w$F#Pc&7(Nj^pN|8D9Z4ELSuV?2&ZoBod6pN7(64%1W$(lV zI7R@=4XhT(jE;rRcVMGSQr%4mOS$;B$iB|!mBW5bb0otev!=NwHtbTTH`+bndnPZf zkm`J-J)nETHOJT30<_Odw|$rIE%9lh1|B;lapiTqr|1*(-iR%K8mpCact3uHygDDg zTTrRm0&%$LXOLzm76E8KN2T}0pz0G3+@COyJPNLUkr?gY9qr1QYm<@luNBqFxn?u;`+QYaR zd1`5+^Nc}}KTky%%M3VBCybuXySEn4ug?+3S4tN)SXuUt7+sELT{8~ibn!R%PTOFN z=R1qU*77?s`N77zMXAa^7|^jiJlD0leO=Y96gg0`I(nfAm8DuJjcoavl(|;=R zp@on6y?ZIZtz$!djR~MM`&8BJ%DSA*Oq>r%JQI`-O15Jj2$m9-hg@VqFCv5+&_TV$5uX zm`0yGxwN#5W~G+w=PJ`%){rvs9iloKS~?jy(SMR~6@%g(st9znWZl^Lm24LF3$ z1tPXHPOUi++Kf5w65E*@Kd`Ay5sgEA)+uGE7%*?(SVV*>F$W9Io3B00*0-$Nn{37} zw!0j)Os|`&+Ml~R`GVo#|A1iCxW`c*G1>L3n{Ey0I${}NQ_~|6O@2m6YKyOh^#94M z$J{*ZT?a+1)gY-4xc)NcER{^1X0?QayE@a@;)n zRlspB)rMbKKroA7#ni{$%63LI%#a$IgA7qjz+lc#cr-aEYPlf@!m(Y71A(rn%@QC*J&9T^K_NT$U# zfnID5`jk!mW}Imc;yEpVhUJEcL1B__?Ib`ND;Wl~`XvUN2n$f|txk(bcpLK+3A<^iUX=={)F4!qf zW;X0}+xs5X70N7AHWxYkmYJUuyl*?puUL6DMyrQ>z76lM!a8&WRqSxddRL`n_TCW6 zaXbB8TV^abc_lFP7Q82ih#2*|SMUX2t)EzFp$_2l#<5unoH)0#AQ)~xeo`ynT0!T!RYoOe5<9!_E^c6J&C!knu1-@kBu+*aV7%*1)BtNep`;!)=F0OB z*qB#(D$8(dO?8b9C(u^{Q5hx4ruZ%0?<7AuF~3SZRyyr$Zcll`>ttf(L<`Z8rVJl` zEUe<-)5>iZjs^-U2?v~@B*47kk&B6cX_Q_cNl2++fBtMxS8*&+n{Hbxsm9E$K1uvp zqd&B=k8;Z(TkwOInAVV#+R1ccqN*B{Iy5pBZ**^Oh#ATDb(Hma7$@qKdwgDPv$Zv- zwLWo&&SoVbWAoigh9Ad*i61_4u&3S7qPYVoam+G(1rJD1FPmdN(*L!W#>8eDk z7K!D|R%)aV86xrjVe1{3BMrN@-PoDfb|$uM+sVY{#I|kQwr$%Jn-kmldYyJ~uWPM!9B1i!Ga}Y=5?#TFQ%H`l%++**=d6zI*Ifv4S*tSsav#aWn!T+j zHICq)j}{y2L=V9NBIYR1B=4xcAq*)9+Q#4Q+L6YLR-}Yb%{sGkDX{Ji%aGlw=z=pS zFCi!3x{dK8-cUi8aVnLRBPg;Rp%{gk)b;m(m!1}8wjUnyQWYZAxfIQ7kFnFDVGqT4 z;N>Fj93@)uTho|v+H`1d?GG*+Ego`kOftaF{#-R5b79tnifLLA0 zFAZ#eU?FwGY?=Y{ah9+cWNw`IytO`UG_MJQ1Iz%vTg5YaqXg@DuZ^nfyV7y&3Z%bY zkbNS*bU)=43MTm8kRxx zxo{cMh5V+shAnpo*%XI@c(^4EJ%pyiC=+RH)9v6WasJkqBl{&rx4q?aty4|iDea2oU2Mve&t z+sA$qXZW)G>MtiP=2I%|5$GoOUd;W6BRJ|0q5hfv8@ly@ zAzp2);mF&|xA6CDpJBScQhN#E5lEg3_!M$MOUV8jHsfcci%Y{U^OXNKF@(t)Si09gK{iHiZXqnck3G)t@I%=$lSBpS~y$VwVKM#;TCYwZBxP~SA z${7gbCM7JyN%G4?%Gf%k8WEIZ5e-jpqUI!YHtc9@zqI}X4~V!K{M!dI0oZ222miG{ z^N(L~EKt^BbC62xUFRcCp(Tm|e83=Jr;a9!45u7nL+WfUzuX+|_*<`A{}3(~j*1Wc zwb%s>`YpJxeV*&XF{wm%L{ae@%Hu&nS?iCyL$&|xCH%#DcP<-pHp3;0>RufmaK4?2 zO~cXD67v}LdhROPVNLoa#ZMUttBhNgp5e6FI#=p@ddqd?qG|EzN@_UU^|vL0F!yn( z%vcbSqxEERn@vBhgINF;zjt&=G}Zj+pKeTL!v~YJ9cqxp8pSH}{%tVdEQ|WOs;&3F z-r$TTeK5*}=ndYgndebVD7NwtX=w;?HM?YL)eI7M#br{soSN_&3q$7_^q4TbhHS^$ zEoCresDO={Cdvg|9Un3E)0W*-y3d~qhsuxaC*S1_h{R+I(k>d%`^*bUX%+T-8i zjjZ}^Wo-#sWWg;=Az>Q*K>g;xj^8715bRcrBK@OW-~#)@rSs(X8F}_X_V}#Uyg=xd zn)GkMQ}~j?jY-x%2+NRR5*7HEt}0j-GB}cEG&1ki3O?HM8BYak@I^YbxP<-}7jwzt zJ~Zqb2zR;+c}ESyn%bWB7y$(Cq!NwU+3;N`vqzNm zjOPNlP0ivL50KZ&8)gw}AIbtcCX!Zol-a0wBND!D!W}9Ln;|(#7v;h5oq>iJ)$Yki zz?!8?mnjnr%cF$ukPHi^e66|~XStSMYybVGN0E{>b!7MxGpMrbj2es((KZWA3LI=A zZgF+n-HO_j@wnFR?RyC zSkMjsmK18zNvc1&leeqpO%YmM8dEW5X9$S%3%T68*W5?vd@jBlFkzaQ)Q4$7;A)Xc zXFh@AY7zN(+$|TI0R(mCRew7=I_)u(ZdH77@74d_1=lk!ZFvK2MzE>rGjlgbhl>y| zXG;~aktK%r=(u}zDa0nanm4(pJCmI-@YO{ufdJxn!h5#2oms-=0=)NcRVvS{+H zVqe|a$-TV9!s}~>bog8w(x95}bhjqFCY_h#F?!E}k$O7Ng-`fK3`r{b4Kbf1RDQv+ zd?oDm6^lA&m%tmYVhQm463>5|Hq=5AKpxmt{q?{xRxtnW563(o+;Bh^v{N*^*Qa9q z!|Vq@>Y;fi-^sDS|xS*hzToerX*&1A`N~q^WE37%qPYCGd9h7B##$E$T{0#~n(~RpteJ z^!q7&{zOy25PZQ8fMHDnokZ7{R`v$X?AM$rPk8rfg1sN*4VNVZc0EnBg}NP9Vw0p? z>XK#%h2d*72gfB`ZiQM<_$)lJ*c(9EaefNBO_jD*n-GOBor9Q*ENT^A6 zMtYLBD_{?)?}h!kly#3^-!=tzO=*24eJd2Pg&{CgM$F#`)6R9#|lovtU>6EpXFr6ujBTn{(LO? z{4qT;1+LX={|a37$A@zaIP>K1Vx+v;k_awQ$|~d*3qt~<=xM%kWkiraAFus!VZ+O^ zUP|qU&E9P?U|+|4#^L!OGn$^?`onGW^FfArD*AfWH75GcPX#p#+ZP?-ksF*_rdTy<;*^@P(t`Uols!kB_k2dj+M-! zh>5P5=vq~PS!ffvMbJiSiTo(Fa;{7yY{f^actm%lNJ_S+z+(B1)x5DQ!5toYIinxn zLo73)uTjkgp7L)qHZ>}UOIQp}G#H$q_V=>pZQPBz8Knd2gK(n0-57|DM4}j;XQ#4* z%F95SDNNoO%^2|94hoLIuS}FoLtclcq&f;9?rsVEVy4Ij37@cw1WSxIjk-!h z{Q({p^kVoLNJ`Lw-5nigi6iqj8$bi$Sg9J%?;%Zll5}UjnzJ+dvsJ1;ehA;GC^#c^ zlmxb%?{3+l80Up#bkYzkFSgE&pDdE746Uuf^l+1Pu$*MD+SWGa1A1}aexlNR)NaEPm zZx;)n#1eRFqkj^lnJ=o4x6NT%QQ5=@-Mk@=BEPjL=9NXxt1;{$j%~u8);8rj=~j#; zK6K45DyiCJ!1o|oQ@bO>?Xf7-6yE;4Y%Qs)WlUOHIttrCP0`l+(YkFGJ|UkEl0M^^~B`3(d-dyB6!BMxzogfn!*Fnb{>gRaJ*p4TCth<<+}-u zP|JOoANhF%1&tUWS>~Rvy+(nc2roDx(S!{`iQh>rW|ky(6%(7**vi6av|kx^n0HAj zl53#0>T+EDkIP!=rJpyg=v zuKMet7Txx2Z_$FsIijU}M&6($3ahS(Y1MAgqx0$e>Cx+by47Z~f0QF>l(?53k181< zb#D7E$EK^1bb^ZyznBcgV4l~IvKU)YvTR*`d3ik}Lx^b}c?pXN9pWg>FcqV@1Gm2H zabm(owL&f`=)!p_4v6#sJeIi_9bHp40LJFeZh9@FC*zoNGXcmJHfT6F+1}mu8Z`m} z0?vtg5GQ2fS*#ye?$d|}g$r3_d}q4%tmxB?X9}Ei&~^y*ONr=I>JY6Js@LksWa93i zwoqdCZH~_*FSI-{w&Hvc1X&0wEr|<6jPI(r75ph8%hXudFH^>VJCjgd!gLEhY(~FQ z%^?$JDq1W)zz@fCKR=f5*2`BkS7lKiyN1XX_@P1bv67Oh>P^BKi|LgJYgYF7To z`d|pzIG_S&fSm3~_{$ca4P~yohhv>jEWyf*h8r%=oSCbK8Wg_|I(`v=0aWv=0(gM~-nNKj^_q@$s!HC=>F$7vh^% zOXc*Ck(65*#*uRj;%;pT=RPM!MN$o6HjJz@(b++okzW0aUrwVHK4!+IX&t@TM3JIZuY$T<%t7;#V@m43q4UxMEWhu-_L%8tp0 z=Q2rbg4&bCl6BRJuTI~WrK&sf4eE+J>=_;Si8Kk)@%R92ZvoR}?1md^a`IGSi)+mDHVJ~h^_lUDrxie>++Col(gTtRg+Qog&^68W% zZ+IA6AiyHYxrC63hH6ac)CIa_WLUVDW?(CO5hpAi+s9wyTe4_N%%SbxngrdZ4TI>4 zXop5=q)x$0C@enr7X?+1`hFW-1hn_li!&~UDaJmarCYMR%*_Em49O)PY*?#QxoPQn zZp5uQln%2uB0>!}s7fI$`k~_7hLjUTt!~$@s7`B^7Bfv}0DCEM`Hz0GIc8Io!OJIO z15+Tvdkfl#N^;?*5D%xe_Z@ac{lqd{+K!yT_A^WXA*bqwK1D}@BaAfeYeZ8NB{2P~ zLr&~cqEzIZiBh%`&<@SAe4}|a6?>wCg+Twk}1M&Muc-T;UV9_tGeUi*%Py-P_vFqQLhVsWKW{y3&NXK)N@tPw8A!c0$G!3GJNR;1 zyI;R@yY&lMwlALw#M!TsVaRFZMLO$aso7@!O0*nj;9T@4!TPbNO|fVsM&UU6o$Y?VWT8f-EROYd@xp zi>fh3B*aD*Qe=!Dq75iNA3-iF-kmvousywdZ?7FTN#@s5rVXu|wV5O0iC7rJ>t}4V zAsD%e6sVgsDotimhgk;IeoiqBipRGbvfXa&=pvk3uEY#1f|2$W9uIdM2?Zu856Cp^aZnT40VZ$VGXv4j+Uz% zukVL`364t+#GaX8w}LSj>G5tF0sfb+OCXJ!r@i7 zxI04PpB15rk1w$fvKP8ljg2L0^n8t-Pi*vtWkfW(MRYeB6}*SnpAoxK|k_9xdZLtY~5Zp~2b`i8pX?v?pF z@g&!+-{22HW_g0FVIrp%PGh!V)U;RB2a5p{Vp#y`9FaQEt?qI;M;baX6x@7S2@NA{ zCMQd|OhBywwZ5xo9F6iRq%>A^{)d=s@{X=a!J8=zC^9Cox?RoWNMpC-WY12)R_{*l z?d9cri|yLGi_Y+}esxQf^X1>%_%CP@{J>dh1-`$s=dPg&5hvuSuD(YN0nL?#N-1-) zs+6#Y`KEz40ap6wcO|OVMAg(ONAa1KCZ9bemWhw8CCHled!^hf@#@P^f!kE%aLXlQ z#)oBZeOcfp(ADdP&H`$`>UE01W$RU#1zbiQyZdY0%wP$5N(Yl{A`AIBL@Hsy1Svfj z53;B60>r^Epd|g5NbL-AemFX2cQw!pN%X4C9#7$59HDAh^)*n1?@e8SnHMh`MV%Hj z?m5qki29`IDUEHjk`I?;AA_$i$ik+}nNEIvZn#|T^mwY9{*9r1f3n$t%mRUJMYL~# z%(66|Tl#V+ij0e1r-dL(R=s9y1L2wwX6%ACAcS}`@-CxJKbIX!fM$jRlWwaAZ433& z5TG%1lv7_VA^R++lp)bB+c$nmG;nABq%ax7Jgt+99(WL0vsD@`i>riMt1nmwI{F#=>!9lDJOHWoB1n2WT)b;S)@AzAp8Z=)lf%X12`p@BwbQegWk< z8vN(PzPIw&G;R^bFs5hIE~u8y@|!4y0|l6G-r2G23l9E zIB@ih;9G0A!|}=oC7IiQ0X_e8_?a`Dol7c4!urE$a^+@hDRA$ycVRLvrg;5w@2>3$9Gi3fRXF1)Su2*ZhYxOp(o$ou7 z7ru7aE#B9c+qZ0WBIW*Q!7=DaC!Y5*-;Wx*9^S>auT{O1**WtJmcsV;!-V$( zdnjd0a*dgPTv#`(Z!RBQhPKD;FyKCZbrC)6h=6{a!%~x_#dYg-yP}7Y!}sRwGxB-G zv92_lHAIY^7DJ2gN6sO+B8?2o|p2pX0mG^mhf4CMKxD8 zQxDxv=y3VJWEc(m4?(t*9-GZ9J-#P)z(l=A0C@BDAX*wBrH6av<(30Xe^{EMpl%Oo zSAm3t$qfq_RHjcQ;tm0Ol(OnqoZ5ji3vT$1hspeJttN0~psoc8PQ^v4vc_`zcmsro zyTL8LxpOkn)>{ip+9)5d3@u)2f&q@TRWU}yUDQSk3AjOI0q<@uB8H67IyXG!{oOA$G#OLyX)mfGOF+%WK^G2FdkNMe0)-47(4Gz*Gt+6~AM)NB84p z9-tuEVvbFf&tlD@@=GLD;@>^=$y(`)tN*cID=gF3x_4=yA^ZY;#riV8F{p*dvk={O z5?CN7Rg=`9>VbEr4^M&@NBKO&R(|_Rk=Cr|=Hha@cCr6qGF4l7f5$~S&iVrD1cMW2#lDw9X~u8Ft{iio^ubcIIa&mzBsb^m zV7FsWkTnWV$HN_7dp)U_^?TM=9);0-6xK3}^Jqpl*psY>e>SUW56x6Q{6rc+-(q4V zdWxYiS>bC!v(sC8}UrOs_ zT#VxrXmguiSoLyng>+ImRzFNp&X0uC_ai{}9Kile@zO_H%imjeN2@Ex-EQ&J2 z=#q9K2F16$(J5Clz64**Q!^DYc*vy9_?&8aNwwj{c5~cB*!9|SYJmQJy!TXu7@ub! z<+AO{L7qnk4ln%Qm{wH~0<0M?ItrGS<_-Xo-Yo-eH-Tm5{?TOOVv5E)-YaY`5ad1| z`(tJ8zra?+KVXZH@$ln$O&Cw2gv$wy%qbKlph1akMRIz?@+gDS(YTP=YKowmc-b^b zB?DLiD$agk4vazi{{w5CHP-3Ua3kl`%-*1ggvpr){B@k1ciAoBz}Dh(Mz6)yy-=pu z+9)l8eVDPnj8nt+y`ZA$_;`RT%|NMO16Cak7f)+0aB-7vy`tOp8IUWh_0>pwU(fzV zjKwFFi*_u1oR7ErUtEh7X)3B@8JIlF+zZ{;>wNnKtTkPqmbUA~;dTLIE_}l^1>Xl8 z8}luTrK*WFWqBzr=QctpK?~MvG3^!KF$vc3E<9-dhu zDqeu?mhbBu@i6i_>gPr*>Yt+h}SLefo_&VI2u~Nm*JF9P6*Gs*nkI7dBCZOtg_+XY1^Ok#LD7@JoJLRZ1ape|_Ke zNq}ocMRJa$7M7^T4B5f^Df3akr=WSKM@o|qEf~(|1j$0z#-Dyp*A&h7|G-*;iPs$j zdH^nk=(xmuM77a*kWwxGi73F@s@{RgMBj_VC6XLJGUU(4`zuibhhGmwFRL`o`4fn= zbKQ7T1>KpRs=MTOh?<_ujijx*zvOpI^?>oRwu<3GZ~k)HgPpP{`DZ7Hs_Ce-vgtxE zZ3Xk|jwy`nTdQ+2`R7K3QsS45boKamjYho%-YvEzE-mvTvQx}Tx{s!o_!oQ7cKo!Q z`83HVj?!^)7VdOY>^aeXtuIS4481a_Ma!3xb6mcwz}}S)j}j| z>ibhXMGl0n9EJ0@nd8?9`4+k;pG-y9`Gh>X#OOJ!;hilAYm5ocgf zBfjx`gVD5`FwQO7rN(yumd`3?N*-CO-Q3`x5Zki$$eav_@@b;3M%=XTOaQhM4KN0s z@Tr%A>_Qg21Flu^W3SX@Nn#MVGO*eQsx{RH0X}Q{febBCJthRK4GcP}j5$ib3F%B> zKc?}u2O}y_@!h)1`l&{l%rNaQ!IMTUL4gz&WcED}qz;ChD2@nsElG54-SBeL^L{IP zGhdcGdVa7*64pY}={h!z=$qnbj?-pFe9qX&;EI5`#sP{*W}b^*JFF?44V9&JoYVG9 zK+oT)e?`cZDRte9thO+xB4xNiOLRo~VOLjFWC2=9cw5(uBz8=+d>^kCwn!wPOO)+V zt7a(p-9pi*vAP^EifIwr_Sb2!HT8&A3|pH_uih3y?jDt7RyRIZgI%})%`F@Wv35i} zily5W-%yQ(qyYGrPMyKve&YM&qF(sX2m%y342GcJa+_-=-!Mv$n9wfK8S{ezaX$$h zCeFl#qZe&RxC|oe&Fp0E&Y~QJN6c=FpX6p>A^_b&wckP+O))*`Te=uTxf<`Ko$?lv zM+fmjA2b|AmKd=EVx!^V9>t^!79=^PPh$@Gyi)`RVw`Mw_-N<8*S{SYFJ7YnBq}N6 zuy0O|oN>2-Wb-in+7_eOQCp>!SCGw=&v{-rk1y@f{p`C)a3D_7RMkine4W^7dxv<3XQ;Xtawz0`w~4^x_slCRaliXlWgE%g2)2n5WF#)1Bb zcle0a9$#hsj8wdui?i~IqbxvAxK*$@pVbN53oz~ME~miyyOt#K>+|!o-aki|*-Jnf zAD%DY{ z^GJ=XSZ3`={rIVvBd9xdwKi=jIl*q_!GbY3{y_eC*{Du{7gr~-S;?`tHO}ORM|TF* zrR}9huULHx_(yX26nI){9-VqEBAaUasVuXPtY){WN<4Mie2B;9P!*f%!N{`K7hZt2 zmgp#Db#>}7`s1BjxoY7{ z(y#kg-S~QzIfvmtr9`|e5QzE2 z0jYBSF(=oHzCWHrakbE^=7*b(qL)S#=rb2_pn((L6f$g~a0@?n-+uo3};h)!YZ; z$%7sHlFk|2`;x(*TQ@~QmmGWb=T?6MR;!(ulR$H{$of9zsFnlnvsI{@Jaj7-0|Imj z#$Z$*=%>$S>_S$Qh_3hnG%gwb$d+7ADk3sLtrO)hR}OLRYRddEzPBs&L*tXhC+Q{k zP?x&Kq$`%3nvxcvauH8a4-kTkIoZ1*~YJA9v18(#ttIwC_2a18kkfT6U;T&ZQ~6U3)n|jz8WM)?jt0#9M_>_Hg1c&k&|D4GZ(8k#jK9$oc`qd4dSa z((*WdsAO5G&`U;aJ zr0Ek7A&~zA%r5N*F3bk^pl-%Rm(0d`f5H zNQ*xLhGN;$CApqBBNtnY3<&Fd4X?&)`dXr_x@fy&^<*sGmRY94(Fhr3tvzW0<8 z)2TsgIBS}SLDs(vTIr_Ac}vdB*}cj7Vys^LAyKjq9x1lGP|0jl94?(5wn6zHl7oNE zCHNGncE7}`rRQ;RZf*x8J-;MMO-Ca;tP&ClDQj`EO>x3m;?fwc=VMPlrI3PIrP}j>D+<^` z$yfB7m!!1__wN^fxg1!;*sXeCJC+0N)Id%6uN@<(W2c&y2?DGte-_v&4l#wUE<&d| z(wbVm{>B43Oe_N&S4dz!Li~ClnUnvCswT$HM#ybnC1_)&*t++1 zF~gTAe6fDL@1!||4H_Stn_R%mp|^zN4Jg>fdN|DTd_pE$fwEpQrE@|?5d)U28Q&BY zQC)yPMRvaXYuv4C4|C?*{+CTRa-^L*a`vn1<>5`b%_(paOP@|R9U$5Ge!V)llh{Pz zr{j4Q?97UY|6tx(+@VQk2E>(i1 zJz9&CoBhkpB^#MsY*C)y_Tv$~KwO#P_64yhc0nDlXj}7D0hXv@cNrCPLJ-z<6npnF zuKsa1CjpRTbd2L4P5wShhd4+D$)tiw<3tw9-CPN4Es2K}0z;n4Cqus*F8BVcY}dxd z&S$Bc`GOq_FYiuZ5~Ek6!`F9jxXSbWqV`^$2uS5gB0oirVH!GZH4&du5zNAyX*Ez{ zk3B7KKz^mL8Yri56A%eVdYU|a7qUh6? z%^NtY@qh2XM1g=m_2{$Y^S4#dAt>5HG3M^QJS8%!=nC5(bUu*NX zQOL36E1bfMm?vd(vCd?fnlaV1=PuwDlni!l>@UWaC4zegFoq>mn;wn{p!3@&MSOob z$Z2++CF{ixN%=Q%Ys1 zGsN$8*OR`DX_vTzNTqZ^A(Hb0Ds|qQjsRmnM>#l+pUrHZV?P5++-lHd-p0dWjzDZn z(k(sv^%mJ#%@BA=*-%y4qxDvmd{(d`fzS-WRR&P_)^LEMD6MpSnzhRLT7TOdee0SOSvN(;tu*NFH()j3 z)nebj=_SQ+N04md-eaL%qE)FBY^Ir9EXB$FD7N>;*1IT&63AujBN`^#$f9Set!V4x z4{u#CIae_n3%az<8dn;Acx~LIw5Y3A;~p61uUta$r0?^%=x%*^ z5XL^u87a5hY};=0z@^_E*wn;M5+UU<_xiZJ8>W}2Dq6gPmcY^9XUgyDoOrxENL$4m zwN6YlBX#3{uCFz|jbL<8you06R;j7MURU7&^M_mFgX~j%@v7N)vRczaG*m^F<}g-g z!;!LI$y~{(KP&k8Jl8?I?qczosw_ai^_`Obiqg()TtXP~?2r)`-0peRoY?>o14I53 z3vBhHRB!K!PBz#6dX_%hBMhgQ4`S&VpT{|j`tRLIrSyKH zv)y(#kNbe*V96efiJY7c;prv5?dN~?(e7bKd3Ad1Aq+S6H%6!x=i>W>lkCjoHs+7c zDiIOWAKDhbc|5A8t>z0a^yJYc19*RW8Xp6r(Ga&aY8OTJFHEjBj5a6n*3{_4fyE;W zQUksjwgf3<+8n^IX85{JN^nMo3b$8M-Au+tx@B049+@FQAo! zkZZssEBlO;Ndv_w4^l4hqy$bgm+ECYyf5QL{H^yFM4SKeedXnvGwj^kV8r_)%$5l| z5LiB*#pofQvl7;^l_-oW9rt%(o^r?uQotd?IBuPIM`-Yc`|Ua|bTJN1SJEuFsK$9O z(3jB`s(OL>oC3;F1EP>&3x)V&SFT>Lz_PRw6GTj2^BaG35Sk%_Uaqbe7s)*jpk-RD zjo(+_1cq1`jF6B6tbZ<70ABwfL?OSMOG%YBFUL9i3TQ23Q4Ft<#W$&?j1(9X-}WM* zjSZaAWdR36herp!3vHN`F5pT^1-~V4Gp$=6ruzqzv2U zoQ05rh>hz3+OE=#^1DDLhjwSfXfMN*geS`8+YOE~KOa$bf;9GR27v1~ccSoP2!CF; zFTsadKK@7KOK8U-D6Ege)Mrs8k(g0Id70=xIxW7ST?nSUMn5w9>8R~&8 z%$y49d~WYZE$v(SVBb+XOlbF8QbhoL`SL+C!3JEwk3Go{3`!C4He6o7-{mO>$s=0C z(Q5N!nAas)OP+z0T(wLOS66?XjQ14BIip3gB+cQeLC=k`zG#^M3bTsI*j1QV0r*Y+0W(2=SVgDK>A=x!D{+c= zoA(~FR~)D>`=## z2mh!Xo*lH-`*HVq159`(2HE*arO1v*5Hc5TybG@ z^58VwQ4sIeYG6@*6lUT~nSzm$r}WU$b%WO)W99s{eN~G?y6(w$Ht+FUSE|vMH#PJ) z{^ro!H7(6h3Wl^6)&jL^jke?;4nTpBN~irIzhfmIDOI-WgGK!`IVak%+C9Vv>zKykjgr87bkLAQOk6_nMdXXE9R$8@2bj6bmt`CY3}vkp(ck!Gki4hiVq zW0Nw*lSlEadbirCB@%+a-4oVP6`;q&$nl0Kanj{duDkN}nB{d(aWL}f)`Mq9F=TJx z`8?ZSq<1Od3+rqv{a3DnIi*GU5ghrC~=k!g6*6#HtKM7)a<+P& zPlnwV40AlF@-zs6>HR#s5lENMArv!i#kkHY4!9cjOvRx0ruKB9(fx?mOvrBg`^9mY z)7#*6&V#+CIKpi$-^`<19-9X^7++11P?=z$b^6mu4(ng{-Lgv1HkKcqe4srK!9AFL z@my6jnNF6<{X7>Nck?g95j2*os?xv#ieUW)$Q@oCvTpUeT#mZkL z4u!+?Nu1hHN$YvJ`P$EZZ?fEN3*)el+cXb<*!0!OI3ZK_10cCpjZyd`a$vJx zCls5p-$IwLNiz8P!`~?zdYB{u$E*FU$nm`?k+lUno1nuh$Lb*{8v zxV~o{o1(H!bnM$AFvxNc+aczj(`*Y?x7QirzIu~hfKrz{?gZfT=tb>Gf(*5U)W(Qu z8kcd!+k9Py)PPILtcYolf_cn-?}QsAK^JJRv5%Zo7OX{fu0v8NHypA-4eUd4Krp zy72SU+DPi?-ZK3T-S#I~fK+a;{Cb3rCH-H!2Lj5A*>+UF??slDh z7;y9KS;XW?mVIyDzwum&=~hGNQ9}q(L~v=&%MZj>X$RQ}>XpLb>R{;LSb+glAkwe&{QGbGzk{vJnZVQ_W6UBLQHpW#akvPsML302Fhn!hlxWsU$F zC$j7rxEYJHS@^^`L&RY7ntS#a6{7>^1kkmfyPV0=;_*KBE$Kzg&{-!lzZyokt8y_C zU^z_y!50RrE;QYt2X&I4qTlhD5#w@|YlC+0^hAS0OnG97VvZM`C zA#?y`dT|2E9}M;XVdcycK_%nsO`O0_FP%w=U8ESi)KgQ>#F$tAq!Kh0^`-&l;)`;;kB-uyo@p%j;)Eq_cztuLEhHwoxAy>0-cTv&3Xs2yMr zDrHq6{C#dii7*Jqq)|xOQq*h7C_`dZXa~3G@In*aKpg=L!KZ?Feb1y2dZt{;I*JA= zlDO3h11&JvXQfxBPr9i>Z2dAZU-159z464h-u`}RlRZ{b9sCwilR}d_okR`Vzp1V* zK50f%#EL<`ozdCYOP1PY-hh|t?_*^-OxDUmlP*6#3ku&RV)8B6YTGpo+-w1eNN^Sn z(Z-#qNVBomzr?W#UZa&+kZOV5=8`K`vlqxv&!wvA2nQulv|=pVXi!Gf&ey4)e3teR z(e@u9^5v|_6Uf6O3jnhpZeGQwO>#I2SHk-pxV^eCV#1Y^i3#bW+pk&g#zkUq{W%hi zd~ZJ7f#^cgl~CbI7aP5dTb2`tA+R+yk_OQ{J=-Bolz_Fz6VDT!mX>F0b&t*A<=Y@Q zt>sDX6-MT~{EbB@Xm&{HHH4@MY_HC6eDRTCIN+;#MQl&h)1q7$5x9d$_hv zQ>wLndtE12z8f8Et*S5gt13}bI=Y7vOW-{@bRU;B25(o?QK<^Rr8y>#Z7m_q$`=h$ zkCQ{?Y^OnH)EY(K*>S>D!yLOxTI?p0dXXpDw>yGyLs0bBb6j3^j5k0}f`21aFZ@UV zZL0XA%Uj654az^r`Nh|?hFe>wrO2#i%0xll1|P`Tes!KP`6{5xYs9(j*IS+y;9Y4& zX-0-WB_ns9W;#OdO3BiAz{u_sxg1iQ0(XUmS$|v79-hd_I1OfN#YYXCVkPk&RHs>Z z*S(y@)q7^Oo-cqjVfi%2e33u7j-r0IsfY|KhQoqofuhyfkz)B4rO97uQaaLG<8gSB zYE?MW+wwaO^hcSR{L)NmM_`?O#59UO-ntk?2owcW)D7Vut0wbCl5`!nW0nQs_c3k| zbB|C++%JKyN945Bmyt_M!|H?tEQ6Bt;8UEz0aVeb24xHBo!9DCTM>}@)^NDQ6jtmk z%g!bUl6YGG8z!+~ZcB3QeGLZPD&vXbebjEtU2V~pB=L@)i57&x+XB#9OQ4&<3E8K< z_X7{4Wu_Wb)G~C`B6O5EG<&Sx|L8awc(|cA1$C!eEUMwDi2`_UOn z-3!ZrBPlIH3Z?}@Gl&v> zs>Ynf=GHEXO;r@wKr=ILrjrtc6h$OuvY96Lrl{;2OPYG#?N!5FIp(v%PGW_Q#?u~b zHU(B}a~RT(wetl(;Y-tz$!i==D!_3m^aZpK6i#){n}J!NUUK0>e+;sVvRtjn4nQ`T z=6nb(J>QMBA;@)5;Llq#u^`}%j|jrE^pBnqbBGAOb=~ibq35z-oNYd%DgboK9&hxz z9~F#_{_E3Bi3^Knx5*d({%g^@h`o4TjHmU2tMo{r>R@y)d)*%+^Lag8uQwa{8;v$< zT%UVW*S@#6ZQi#xx({p~*r@!dj0FOuAE_8oT8qd+lNRC`;fHX19F2_FOx}&;goNzNEkzBl%VKgF zu|HTa#wrUhxsWJf`S8rG+LiinmoXutB^){!y;z5ec;v#=X?1B#M;jj?f7d;n{CJ_Z znddl8WXy-CNk8YDT%H21=Ol9>tJ6{8<}=Dtd!T@-mMBrGZze>;Gk?eAy2uLzjOUIT z+5=|@skj~=Sw$ReW1w)c1Rn#_yl^?f#3b5figl=g=Hf9cQpfQy`|ZObIn4`#kjPG! zEZscHv(zmdH{!FDiOTg80#9!QOTmb!tajstG95?6v%P4^4IbTGV4}Iqy96Q)K=f!i zd<2vO%hd2)P*1v~S1;5junEdZ8! zdVGgf%t#sQj7U-PJm#RY-j?&$J_wt`G%q>YMiR=uP<0V>UhxP^$8~(^a%XKF z?%H+qdYQT6a4{u}v#r^dxm(-`GcPZtNtN40kePwWnh+9WjQOwD2wxG3e@DX25w0s! zCgYW^dqZ6>|Nmq?f2_RVwB2l^?IIL>qP&!>N474I2H+PC9rqhM{8mPGCWAoE``BS; zd=sZoClC?#Y`NHZ+h$OL6q`jOX9lr1Dc4w|B!I{Hd6`*;ar3H>p78tR|B!VL%$2ok zxTs^>wr#Ux+qP|WY}>Xvb~?7*vD2~bGrw=GwX4qFRr4Q=8ko=f-1nsbRSKbqS&1D# z%DKR5YLa&ZzD^Z~{3WXTjVk?QFRmxP{0yuXkoj~LB~f#iKi3Dejj$f>-Wy&Z1b>7O zG(k@&lKuY9G+ePK16APCT+JE=@pdad^pMAQV*bPqBOaC+GS5`XWEd6DVJndDjPm=| zlZ2fk;Tr#uDTT$&Dy8irhyD^CSE0-f~o+-R^Jek5sBx8_l1FJWGL3 z2PU<;z*(BViXO2xq=SQ1ACqY{$^v!_W~sSVK6jfGovX-r#JF$)i>%?4ekl3BoSJe) zh`NnVo0d)7v9M$kDc5Vbu{Bj@2;uK*a;aX@x_z$G^sWJoDr*82z2S(bn8|%6Rr*tZ z0}}V581K>!AJlXp^z55iDIU8PA$(boUQAJU@8;lca?07(FN9EEM*ezz_n$$1f(^(* znE%XJPj~*|X?Jcm#g8@_4qdKKL#n1Lj13)!G)?IwM^Sev6%1cG6Aw<`j8Ywog>h3I zjiU{U$4z(H1NLCalWD^iSotzCj!|FX8n2IT^pMsDM*!UL7;sp$u@sf|;sZXgo;hy5F{zHgKzWUnPZOTu%Kt1JPl|Z zF=s#(nF#LLo@o^qQ`fJvjJpB!8c0Onr>9?idl_BX9_)XPWSYEu@hx8_M8lHxbUf8N zUFn=<%jEi`e?3L=^2d*Ew?8F>b@x}F{JW$G$~6h+O4LCyIm3{ZwYq**) z(i8~8H^i5pjrw&t9^>EO?LxNo-kjQ5i|gSVjQVP9d+AIHx}IROsKJ%pJpwS8~;k318a}tgYgQZ;9T;FlQ(Cm@#l|L z6Jz9VjA845mz#N%gQpQ;pm#>0PTZ#K1qE18S3Wz!%e&zQM!R(A%Y9nGm2COQp zMqyzB`pp;HpGXSiP*nJ2b*A5EV8hNo6HPCmJnV`5Vv9)4!R@~SLv+*aOs@t|E(L;M z@qr1U#<4;_5i?d;>#4&O!hYoierY;Ph_>V&*`-zBeRwh%(@>`f$Ub1Y>r~3A(hQkm z%W*%pla#eK&8p}oGT4F_Wx1!X0;8bf%H|32Mf56t_F818TwIh0as-S!g~|RJ*u7(R zM$f7`T zSu>g{@99%V)*8V>!Yf=01yT|!6$ zLPodY0)`kysBlQAh{{f3Dgw9&h2%jL|S*<(PV1W86W55W2tG(lX zY0Hz(W0ybTK38|fm*$Pyj%hGJjb~nc{tzraXZAGHme=>(aayW^_-!@@La={&d{s#q zdtOdWMP{5)1rG1;pKJ}V_4c|mwkjMI*e6)Sdox14;ZQ;bVx&626mM=icDPdSHzKmJ z1&Q<-MY_vRvJzS4654W2o%*?o&@;Jjx>tgL|LU7Q~n=XPk=Ko7Tw}U=cAI(0;$(}i9@1mcbpv3Aqq#*$ary#pfmk?F>LzLoV z=H1)=^+HQvT44RZC`XW8Q1_OFV4=fQgy0KBa$xrH6J~>38b1_>K8!nD`+%}NtV2P` zts4#MFc9z@tEgaqftFHlqkW7_D$>SaRQAFP^hBY0jW&IhC8|+0E&oXC%rP?TxFdrA zcSl25b^E7p0pd#VC%@RY(5SK8Pl>}o>PF8-R%Ul$q6|=Z{IYDfA45|VfrH(YSvU7uu4{to=xv})!&6dOe5fz8BH=|v{| z5W_2>gicb1@1)E(xaEx1e?iJXI<9H}_>(IR?=aa*e#9E5~JZp@=84i;Q zcRy~FOA^Rt+zu(o9iR z>_d(nW4-3|K~A2-9T3TewNp)Z)&4MoPjo1VzU;(^8O`FR1;aIGulyZ{oA7ujJ_?}> zO-V5O{P#`#or7G5naE}vhHhNiNBU_I2nvw!&X)5vX1We&<7EE59(XNRnnk(^40RTd z?p%>UkDoC{;5Y}&WpDYDqFkD9+feVuN(>ks_8^ppH;3Y;M%Ao-s1)d++`@PQFRoQ7 zozDIqd>E^PaFh-Er*d+0B@8osQJs5&Ve@7uQ`nP&YLutRSC1$kF26{bKmk5~c7+5+ zU@$iU5}&f8MSM1aT#Hy2v$OW*n}OwtA+Z4GzR_}l@^lJRRF z@Caak;Nc^yAoX|Z-=k$s)P)o>Nac{ceg7n>MFqR-8NV(j{ntuaC~6`;D6q}GrB+3< zznw4HKViT`d(M`o*es2aUZHt$|3&<;Bce;P(-gBrHMcT;Yh=lV+tV)_$pJ^uL6B81 zVaB&77BlOT*_4EfD@AigN5^KK!pzDCH8x$f*YaCh(I2082NlSu7rXKO!`N4&)-7kqR zy7|;%ac9m#Bosgt;rb*&W`VZggL-czFT+yOKaw+nBRrFinC@$+LzHO;(LeL|ER&h2 zT&ML);dhgbq#neFI~U+WDP5YILg={i(zKwd{Qd!koJpf>4M1ZDuEI0U1xX|L1^wMB|#l5W8xJ;$8#f6 zjacAFlY3s{``N3WPe?ssa;n0$K`?m{nC(;wOA_zsw|3b9v&waJXD8Tg7#umCh;q(5 z3`|vOppimwg;S#RyxM@i+@UKx5h% zfv+?KD7?``e5-?Fs#SK##Je*J3glwk=a9tPWe7zbK@)}7Wx?7adXwErX%YiRCxVNG zhM%3RK*b+_AMwl57)pj63Xi))#w=%79QOYhtPAQ1U^e5D^OG|Gfv}+=@LoXTahe@lNhnX+eyK>bMID+IG=$@kXG3eqUJMC{svHF z{mHO3CFZg%J{^~82!(LkUdc#AF#}x>@1r4v3n?=8u+6o`dCbNrD#?DpFBQG^L-o_{ z;_uWEUsODLV?0ywVI@*KmBEJSx0X-Ln9R>}#W-qgyglI_wo5Fs*p|&e^YG(56s%k&o$LTZQJ&Hhp!%AInQF$T| zW?o{qMF~50pUXH#z2AH`47nsNhFT=vjzas>S1CTtt#iO>xRkqPU&&8n8|vhQv)|nL zd_;?URiA?iBUQ#1v4l@mi#&vm!FjbcfBvjhibgCdY?I8|_q#s$Mug91FHogO<(wx+ z7Kt>X3ia9{R>x1w0REqISP$&|uX7j%rq_D)f1E?xp~K1sSLi?Eyk2p`LNK?6HuDvP z5%S0zd2yjz0|6uCAcctVcQ||@;M;g-#fMCr5JkReK&Iu;$>KFrmNKbn8_KlUy4p;5 zWl(1<7Ma84;0PpJ#XZUm+P__rsJfMT+3PPNvhYdDm9@FEr+#GQl$4TA&?rCi_wb13w>devKfuiN|Hqew#fB5*Ti>bCX}N*fL@vo z(JE5!(f*ZE2_jalIcXNXMstKFsrQY#shkb7zzi84SWfD2x&G`v#38IkFvyrI{q*#V z@bK{Xq%)D4L;9!)4qJ-yNI0>FDwtcJ5&&!iYy4nU<2ZC}$DxrQU^a^T>er6X#^`Rz zQjpV)TXL-}r6wkGYZFMl5V$B@ubo^N8qYWpvn*Vc@$Vzd+ zA9s{O@$W0`Rpw-Lxdl@pM!#VifiG0QVVg}<=WggO z++)kdKsd)3h3I%|$4+?OBT&ZUZmSoxC~PRlZC@*CE<|iLvln_kVb{N5ov^Z;geLmkV&k>+ ztZ(?-{HIlkCWJwNdKA zsIx?t@WX+rSeq{lR=~|`AYyH>JpcLfV;=T!Z1GC9S@%K*s3cjzuMr447^v9_#ssIy z#w~yNhEYvX{eu%oo~7bkOFa987nU4GS&Rzy!xOZ-wBTiA#5i~HhcKQ;V(G?hCK%zm z26&o0mLvQ>an(17Eno%`o9fzGzE;W<6%pT>J;-v{Jcv_$ph)R{qvT|hU%XPJdDS-WEx>LK3Ujs=}Es?T|+QX(8 z@U_WQxCNOgX6p;GiR+9itmyKnrCXx`@jiD&%zov!x8%?L5+@$a4;xr-j2()!XEaKH zZ8m9D>VhbW)l{gM(j*?3sVO7(CZ|*4Wgb92b%p=l^L0o|^u;wk5-bh|crYYD=G{p< z^alh_L~oF+xdGhof0Df`qVwf5kVig_X$3|43<_SvFCBo4%H~j%Sd0 zs9hH(f$<1~%}OQkzyAs%Q+sfbES|6BtgJ2iaG}pqO9y3L5j9$T0T65rjC{OZFShPZ zUo7T>sajTI7JpO`OV5qju2yCTfdmS%&;jLFM2#Gep;gP(s4ty*p}uG&j@;3$yr-)D zNEmWYBR+5Y39TS-o&765K_wvtcrp|s3{9DEU*b6kEE0+Roj0>G3z!=lPxF|#q4db+ z(}oobZQEEtIFEbg#7p8K9G>-@8#(aJiKs>6< z<;pN_Qtq(#e7N7P&|MMXgJXAJ`!;ZQ{lsht;IMq1C?2`k`I*~mJ+uUV>Xvk2YWzp@ zOdjFjRGGsZ$oGsC8H~W669|wZ7QOf^5<@uFwqI0JU3G{Y2tC`SQg3@s^iS7)eY# z^6)WzbgJ+fc3*${T3pJz(@yCh2DR8`epB)mR{~ffMamJSaqs2bZg<+^_xC>?wD@Xz zyg1w+CM8D-xzqQ@`DBU>aSKo}nD$;_THU^Crq-=r?kXAEKyvJ;*E&mM- zBPnhXdPk5mA1BxobV>?D!xOqKY6Yy)yk!l~eQ^|1dGe&q_@3zdNOE97SNiQEZ#frs zt_igM#`7Tvoz*Ioi%&(@qs=CW6+VwG-uc4R&iS7D?NRD@QG{Nz=u*7Q0L+-AFK zqW9Gr<5vSh_9~1e9?nGlCfwh-9=Qp7TqDcqO!y%=Fgz&bu)_n$3$wa@lJSM#2RO@e zAVfy|IMI~Sf^`w~BJwHuW;YF1kC$YfpoS+DeKAdl=nJ8KJEDA^f_@%AgpC(;QV{9$ z8)0|f>v#h^2pI~EplfzM#kM@M_bn}7m$bUAJ#!UuT^;h17>0S_A348PTjX4iPOaq7 z@l_{*t0Gb3nLx$5Ip!B?yQ16j>km%O2{HcmxE~7+nLw^79c)_J*_Qk+J!J(%-Z=mM zt&oQ-6I<*DN|$?JWwQgbXX(cYQbkGc2@u%qX?nP!Rm_+>F460#Y^dqI)Bh2UA?t0Z z;2ZBHKK5HK4UrLQi(eINZ*u)0SLQ)|DKpPLSr{#KTG;Vig2!Z{j1=76iyYu>OdcKQ z(yP)#Ve>2aT>~l_{62~Rz1CcN3JQRSO4^U1cX<#TxMVC%MH?EyL2`G`S8YZsKQtQl zC<#}9fvEX|gqu`g2Kt5OhrpPk?Id<;C(V?ws%!C@L|UGI7&3g@5*M$G4&;KE?w=$- zGR$$Rcj3FroMIJeH79Ez;vn`Y+FX*Ay*yWzu<_w(U-B9~ZWHM1Xse+}WUDXezdHeU zF=_>AK4;zUyR+@@o_NuM{E)gif>$%s{Rmzk6}Jy73&r_I^Cw8brmAK6gjdPm&{fJ6 zimAb5l0866;8%l(znolV{BHM{7De9YSb1O}n9syt%p;y5mTi4F%9z!K)wPQ~{M&{> zgp?M3L1rF!b5(6oV$RR$ED@WhQ?71|YV|&6jLEDq(Kex>(4IL^4y{bhN=w#6^~NIk zeq4C(_6SG3UGK7We22&u zd9-uhaccMHG+zI04KESq3a1O3=O-<1fibWNLTXc7nQkevJ7eIT+2JE05Cs797_Tjw z=;adl^e^+M4^&2z+bHEIk{$fUhGYm+1+T@vYP8{LbVbM7?d3H8+r9YqerrVHcq_gR z@aRHfm5Da^G<%)ci5Y=MCC#kSnNCrqK-fIar2%W8LLT+swIp@#A~EEnAmV4rP>;u| zkF0Vfi&OAPVdPDbxRuS(B*y5Lyb{MXR}?@9+g1tsYz#`wd<#`&hH>fTNR63FtCeFX zI5?9{wG;9CBLo}WQ!4iP)4U{y6W$df*1$+mk z;E7&M$yv+CbvQo{L(eT9Y>=edCQKrtV-aV0>H~`(`Tuo;KIE7c1kSe7O1gj8I5J_} zM75|iEpd9HeSV5hdTqX#L#CUV@0u@L5_Wd8I*+@q74M@rSR-ro>CqtCExSgy{yTV( z3g=Rr?(=71SRV#_Yh+oUCmEFCSsC~4arwdrr12E}!#xtP&zwnQ z{?V5ky8p3D$MQe!(PVZmQuXZ@M!kfE1Ic;4U3|g`A5S+5B&weDaGEAYeS&Z3f-EFeaZrL}7~q&Bp?ldJ zT@n)dJLZEs<_0@zI_=fJ*o@=9^&M!<<3C}fu62SA2t-xQWVns{9T;c6OPy{zF3(N? z!cLua%x?j|G%P6~xaWSURHVVjhtoE+G@r||Gxup=zlo`L^(y?V&fT22Xw@H*iIh*ICuvv-$DIN?=Ddq@Ava|8BI14BT87@ED zk*1k{%k__CxNHV;vHw0QA1*(xWabnrF2yY0O%u2N*wDW=iMVcCJC2?;?{|h21P8@5 z4jkh+Wfa`|yxi`A!_Xv?8q(hSc%3j{VdUEh8 zeEeO0cVbcPr6<<=KG>>PncLN71~>xlwfpmR+2vu@dbQfyv-n@W#A#%e?>J99`3wHZH zNc=M)m7(j_1DT@1JC<(jX_js!ExXP473$oohFtEbfl|&SVJPkNUAi8<>+SfZZ{ia8 zXqc>syc2`wvdsGsX^4&Ge=#IPhP&0I(}n}mfF?IIE|PIS8N*3SAdNeC@_?ZZ=~zLT ztVuVxa=hCW%nBmVvz!k2)TZ zV6Q0vwpUW~;I1OunIbpkuVYi2(0w#D{Nh1KSWiSIp!Sh|#T=#1Eq4;Fd- zu*nK^6E5Z!924l5MzM&5^nSnivsEloU@<5)cWZ*S;b$$p#*N)tq|#&80Tx=`i6a=2 z@js28d~4&jg;41}%%lllu(N29Ml#)`HzpB`@_BzJ4zS`a39vr$lhK3K6AE>Yf=#T# zs}1Fl*gI`5z|^B@j#|Ky35JY*<8;&yaPSQz#FZnZmTCTqM?BIlLl_Wo_&#AGW+kqn z!*@q-vUbZu3EAUx!Y4-kkr-Xu17IHqz+{C0?Bh0X8an#F>?28<@!o&fM|2WZ<0(|T z{*cMHf7wSnK9Fj1@t{g0YK7b6msPRbB>?;AxtlFbH2x6>U>|wI@zM_^nK2LrY&Bi8 z&nza9kbxFMaWL>q`?QKhmQSN&7?et_pgQWN`;n|T#S_83S!{LOd6KVF5qayy#K*3} znhUWressgNM8b?Kt1@aZUp3i!uwl}pg=*ENGDGXfbN&&XZU_!f#UkFqHX^!#uaZWQ z%WQD0OnE-Iv|hN9noO=h@N6f}3TV z1CeC9UmpPDfwu{sPW``sdUa>{iWIIZ$ZL)@i?NyUDWByV$qC=;3%5Wbc$z)i)C$`iNq- z)r82@Vk~66FT@sAY2&=f8Aawa!?tmBB){;L)bLs&G-9`NZRlzsFB{U^yF{dW=ZRAx zTp157$?MvaWRpfx%_&h)+v7J>Q6Kpz+iYj}5tFRL>ho;9-#MR?T^Gj#fwO|NWVEHt zmn~Xr$M@;6)dQBSoc2poZiM?DZ|G&OoBzLXg z(HQ1jh%(8>vs5Bg>CB{nvD`~&0ux(RAihIQ@Js1^VtZ471B4VjVV(ZMKan;N6ro}4 zof{zNCQ6&@eS^0%ap;F*!eVYZ4uE}HHU>g)MyR<=*La5 zry^$|aogp|0q55-FB+)jKjLMr;YQ1-5NfqA+F(d((GZaZAds&AAdqPIWw6`u;|u+% zpZ_lcNtGNK&+hvUKpns**^%RB`oELmN(^4^&WSCNycVW zfyl=dPh9+RQSgnR*sTK5i!nCe5?h5ANn(tB1#Lu62p z;Z=)*L{BXQE;p3pIQw_Wu)Po}l)?oyGrile8=;p_F7j8|y(|M7^?hf3l|1a9lcK-7 z^Mf0_C0N9ggN~g4&hYt3VjIHBhyN4rG{^mBCD2Z#}*CP*i^VJIsHDVC{r8`E*?b4kePh%k{Q&f)e z;`S}5KP^$#j4$dwmK~13rEgy}j;T3m6s9KK6{|ib|x=lw@AUv4S%-=JA1C12}ruMRjk5K0>Z6!XC`oK@A(~ta-qZ z5!Bi+%CzQtdQ0;k1+ugf7IrK8g%8Gg!*c(6z6AP`hKXu)dy!w*c7zdy3Of1}6lXRC z&h7S0LdK&9FqWL+f2?sj6^Yxs3i=J~ru zEL{b6O_|@2l{}duQ`{-ov6D=Q?FuG()FTWE1W<5D)d<9&*xri7Hii|;UVpQ%ez59S zuf*r?@AcMPin(q)_bcJ=*15;H&RLPf-ngFh2;C=?HvRM0K`Ehwu*H?0Q~VBkKa|t0 zy)s@AhdklpY4G}(ej}(Y>O^d039;d>SZmNzWe(kOV=;$_2&fuQhJUT53sV(^_>wjE z>q+1u%^`y-6A69K>s#zTx=ge~n(|p^*>Zl`-Fjzf%Z@fYv)1+j=kMRo6286t-xg6b zB`UIYJGr_$l}TlNUPc{^LOAbIk!M{_8+CFjoskM1KMtNMb;O{XY)m zy*j{wOqlo6`kw>2ZF=!Qg^Hs+fw9Jq%g;9WAx*?GF~eAgM7Vcwn;VkO0?RbQWiYn9 z$RRi+|E53Gc&ip+YB@@A}l5{`x2my zVShcX;McG(7jL?583xT0^QXnTLm8M|;g@!wtqO7VZ_W}p?@wx9%o)}WqI)LxtO}l& zbu~F1UsbK^n-q-;a1Pso4^%hVB(Dhb2NS@g=J^oQR~mV%0Lkz4O^+PM8PgHtfhB;* z`d}(_yj@;9+v<#;Li4A}#fY4Z>< z->>fDZDWN3R%|#oE%Fz7UfF+`OrQutVt2BE7-rGcV%>}XI=XEqGl;6@8mc=^*;yY% zwvi}QcYqle;zY0|^l3=tM+CPk(x-~LV)CL)H@ap(FL!56_WiQPZGfx2Ly0<*unw6d zlVn1G_^nE!wPg~a*V}SA1|t`vQ6_YvQuTrwjIi`#-g&#+h9vtw;9;9d7P2I>hzDcN zjU4ZPI?O&L0+KiNPw;5n@9T5F;~i~YXVi9cJM?-pld-v^IGSt%W=kZHPVj|W3KwIw zs-lp1?u0s(Ek-PCd;4kSdIujv2SZ+g(!mh{>Y*X{1oVZn$^~7Yt_n++y!3T|-8VSo z&l1(1gk2B81pz)rSm@E_Y?b{N3&s4k1xa?Mytzv4$7@=kF%0 zg=HMjmHXf?-`MDQ;`g|1Xx<|fcM>m>^iX2x%CWJwJ|f0hR7OOBA&%FSP5!N za4|e7P@_4Nm5HF1$J!bk!?6$gy%p9_KbwSfxl?mF`zpBUa6cdI)hAqWnAI~XTp&L|eN{be=frMe4qYFKx81#i4VKWT>E4ILb`X z)~rwSh`2f|2eF>saF$Ino(2Xfjcmi8bs4({L8ID*K`**Gar$~fn~@*h`%?`>GZl)P zX0WX4$4Zv#)$TX3IX8{mt+v@!AW+w*2vpXBxzaiw1efwxY4jg7@Qrlewy){MR;)IP zO+_6==1LU0&&@N~HtJ%VnzJp$@0zGdAZUZyB|2M zpe42rFa565;59yc{ku>Fw_J;2E>P@jrWB8@Q3WAu@jHM~RxKt?4QnBRg7V{qnNKdxR($w z<7BD7R650zKWf)M7z|Ae`88)z;(aPT@p|6h3CB2HD|Dbh+*%L|ub09PsC;s?Q2wENtND;2K50`@u*kq+hB|dSy6#F@{dC5l}FCpG;r!T$1SZrs^7ob%5Bgy z{w{yCr+6YorDNMovtytpBW%nyCD zk84y7M46L{UgLU3NShk$;uVL1Skm5KEW z#4{R;*i=psxc>}c2hls4mB|01)_Vo_uYDvA88=07Vd$a~i=@t2oR4&Rpvi=Yf&_(@ z#oB=d1Ear}$UOMV9Le$<*~F`dy#n2axhDTs?M140OHsc>3~t6OSKuXMitL;DK~x*Y zc;kwsC%ScRel8v6ov10$L?d00F6RyWEJC;WVC=gR+y-&gijo{p)+17#G^a-CF>H%> z3lk48E=9c=aqW_t#hu*bN#*p5A_~{LtcOsOs`B|F1!{a)X0V^zTECj0V=^iEVz_TLE3S=i&km^G)POc z6|6zlol+b~TcLTfvsMgvc6e3sG4vr%9==n)eO+`sU)dvgX-gLBU1a-A zAgK#*+EhS^kjM&ZCKCnO!YY5)f* z4kpB^md@_mvc%oh3T4`_OpNqR;b*QwU|FggNkNBde2E>xqO!hf$z}tnE6F;?7UhW@ zBFqw^UxJv2G`ErX0~xR&RBt6rD2bnVmc2-j_Y-;@d=;s}AWdwhK!i=*30Jh(66m`E7p1F z#W~evQQRh^FYOfumDwq^;HfUqRa0TO?De{&g^-6bY}7-8xK|AykFJ2mvVMtmU$mm0 z0o?Lh1<-;Fjh5zDDoLKYt>}dj_A0R}0Xx6-pcugraBm>CWZ%+h2{~1W!VG(OKlss5 zf6GlfOr0-wjdhFEHJE7jI}5gms*r3|#UphhOxs&%8}mJaJ`@Xp2RZk|b2Qk}mNc9M z!3^*qQ^k1h_0O|%2Qr2QIH)dSU6r*-MYPPjE+e>rDeG)yfXy3T6o+I|ho2pp5D|16 zC`L_Ix4pMo;Jny$KnWsYBr56EEH|to#8<#bu+(>PLmFoKFV|(wnLvez6VQOHn#Z00 z%yE@-#c#9{2*H{)V9JV`mzT_INr$!nIWrL@11#0^SHW?V#g?*Sl!4ab!?pIXT}s@` z$aZLR3GKtI@6pEluDP{x9{L#t0JeWRB1we1gtKoEAVbIa;o!PlkupA?FE8iKiGmXd zKoPvR(AY0++A9#>;&=BOhdV zuNM191%3H2h%P7L6Z9=r@(>J>ckfZ+Bmo}~KkWKx3Nyz%-F?|`_nZW&7ljAa%vlBs z+4SiM!yZeb!0j(HYRyM&FvNq*GsmN;(jbb@+$I^4)PPSgLh>k+RiiF#nYVgYY|aGM zrH!b_av|`O1C2;Z*f2D7XR+lC^Lz*h z(yx3J5Krn zY~}3SZMOtpSCj3`P0cJSjB(j;htp|JB@dD`A-!?Us$N8r(b|7lNWM(m^PL>Rlk@3V zJL$Ni_c!s}hCpV@jX%jJm$P;_lf1t_KTWc|=j8dS|KmdT09;5#nnX!uu3$@rLRI%1 z>u2I_aS50c&hb1^nwE|>%ES|Y*0n#u@kzIwD{jBxgBY8?cSYeI_w@K8g-bXJO|N&$ zw)^vbE~R2g?t1-S0X%}UwN=&_+hQFF^%@^NOfGZVl9!fLj!c~M9h{7NB@j;G4W^z~Qp_g7#@jT%)k$xhBzDK_y{=##B0@u=77R9_KW70#C2<@?}V zhE=+^)R(hH0}E3f%E}Okxm!e0hxkrk>-71M{?NxR6|t-y9~8P5LbJ}0TAcN(DRz3$tc0R#gP|(>L_2mRm(54x1nJwY>orV5 zAzA#cr{%&izY@j<(S1b33;3!{#UsifNi0q1W`}$T*{B7@67Xvf2fg!yu=~V`SeWCo z5hBK98L{V_K6CKd;W+U0G~4fwSf2g`ShuQD%%{pda?ftjbE2`@<`EN~d&{UpXp;Az zaL9t~ktkH>_iNtjB2b#z3(%WuB&%%)u&i?Wm6S;oR(_rsgg#%`ov7YE$qm~COUqqS z(1SBh-WFcjIBW(;0_4906?g*1{K-E1vq@@beSijop|Vwx#_6~-#ZM(;D9(_fRjjgL zkBIM$Lgi1eprg5t{Ot`b;%94)bGi{YV4rf>mUterQGzLmp7b{5aX=5v?a+_b#7;=|`FaZQA67 zUJU>*LM-p@QSz&pgLf64&7=gDx{o`7?;mBgXdx3>Vm(272)ltb%5)x@vy_%>>S_k^Z5Bp80-*uaRzH5!d3ouqk1J4n<+>C*q@TXW zws!5T^kV6w|8vW9$3{$N``uhc?zy*A+6Bk34opXIab>%$ulU)$XRBKNC&{sCQD8P+ z6aULh-L>nKM4&lSL`R4)HPUVF5TCePzs+cW-H?(4$ud+3)F z{%)+})ZoMQpE{G3retxkX2*A>*486IMEo7Z|h@GC74v@grlvM2;N zUv-D+{XJqUToRn-rQhGHe$Mm!&*<uzz}}0)DXF1D@t_^}#R$A{mc(b!qo=4Xvo}my2JgL%b&;agEh(;t(~)yGY7V|=49@K8VhiKO~Zazd19JH8=-9P4w9rY-W{o4($M zYugdXjOvaQ3eRWn$&)agJt*%_5ZxFWzCjf*z+QP9iu#>TgqtSTywu#tt)M`^AEdoj z7yO+h+F#H=Zz3;li>+6o1BS!KIQO}tZ4G(QT{zwm@^qehYdr z0Omtta|0?xv;Z-n$ zAByki9jwP+;G>)X-gXb~y;DXOkAgek1w#?yI%TCgyryo2Rf5)tMsYufzs>3EG^@tGqg(&l(oX6w1A*+)NY|0{92U~f`cl|>{CLS4I zp{*W9_P4E@6=&5&K!Mwir8JF=mLb`B-IFsny*Aaq1Yc5ahQ5eXG|=jGUXCzCVwO zy7#Eha&yVvT&gMhUCtNi&Mp6w|4I)0$2x46r>;*BMN9Z8pjshY^XU+nk8vaf%-y-P zDd>*^TgqCa|97h%V9v6gy4Gnj+2gnI6qrkBL;T||TLNO*E$SANLWlkwMN(wRp8vivEh%eL~iN5M8a-gjzKj{`Cv%KBo5~SZPaaa$mn`uY4V)b(!}paBs07c(C7%orm@R z7A;%P#zfvU;4Wj#@VFO?_Kn3hzO_Sh*q_NI>aC_aGsaE6iWS-2ux25Z$g_D&2YHO z7z#SNcBMm$A>5+oL5TRXtqok-oF2@X>1(9xPtDCj{^+6X=%3}T_OT*!Mo~ltYT2kj zL!PVa$f!6fcMY*1Iqt!OMdtmBtTN&4hmOXJ@Ty=ADOOSk4$J$KMxTRuz)nrjZ@;DU z*g_RU1N@Tp+Y$n>p6l zOKeTuRR&b_n^Z%in!bQMy2eI>4u@Al$y=kw>*qcGZTJ!NSOArQn@{KeIlmwNy!mPc zu$yi@@p58*bWyY2??F9*SSn|C(Zu}C@wGNoVIc4jzyJIMr7j282raW?^H8^h3DTsp zi-He`bg%I!K=%lpT6u1@`@GEC+VHqc|3vSLwI=O}*Hh z`&}H<)d6|$nA$0!-1U61_1k6kYbwsJYkTWk-~8Lhr|r_m;W@6&33{r0uS3x$u>31- zt806jtvbyAa*I&0ic*X^*_Wtw*Ci1ck(K`lZiD*_+j>@37Mkr*5A|x=)8NMAg+w%M z)Hk`@N|3+c5ZE7f=B5UWc;usFUFBQAYe<5P=A^(v{cy%lLe0n!a8eZN>`Rx$Bc)=$ zbY|E?jwBd$*d^ilu}q!Tl#afCMY(+MfpX3HOh(k zSHK*;D?iVONff~)aTT!shX=b9jMMwE@Nz`vpyF;@=&IvnWhBb;mp*Q5e7H|EJ3CrQ z(#7@Eu)hwFB>-#&VysALV_7$nDgF?G8Ki~$|A@NB;7Gf+4b(9wwr$(CZEIrNnb@|S zbexH8+qP|EcAod!yLQ#8`)B{^x_k9n=XD-Noe@vWW%+bw+8=ZnYU*^=pNUaYVY#20 zllIy^Nd_EKC)2R+UY-LJ@=S9!u!u6e&$H&9x1Z7snp(|;M6Z zKFrKhSLM6ZGWk=;G&iw``87Yggr_4if10A1a zqmu?kldutY4DTl&;D%-T#n3+!;t2C)64hO*0~?8B887=a|AboPm2;J=)5f>fhTZWd zBX|QrBH@O(^cLE-o``B5(w2jX&GI8jCPljvNU?-I%RHb(Sd$AJM)KJ#k`vjc4n%6K0C; zFG_0l9xFUH_SZp7nZlUc&UUt#*P=1LI>Lrj8#GMYN^l38j*ac@K9l-`6^&8qw}vDr zabKw|i;=zy%rlN%cwUhA*nq7MLT0*!d`8jPqA`>YgW6qYIM5foHRVHD>j1*EGBhAj6Piv_MjnUYZL_L zOyd$ALg;n-%Mj5XH5R-#;s3y7go+9gybcNLe{^=0Sep!HQ)2H1B`|G`-e*Om>m`5R z&-Fu&lrC_sw8Q4XnwE^?QyzZ}~O7!n&Jv#8zGLJ<1cFB^Fu zEzXd)5dncjJW34i8abe9vqEy*s?nkhp09y!76+s!hbHC5d%Li?fv`2e+jvR(TAiLDLXauwx0z}$i3?6-y#siBq2*;uphRIaeiUN}vj0~x9#62i96)zRUy!*^ zCixjng#8&!ln3L8=t&BFIvfgG2e_dg>VHtn!wXjf|GuWXSrZR>Uy2q#GO|RK(Gg1J zUH_R)3}-UqZCAcXA^fXcA0*W?I$X+2d8h$4MDg{6W7@GuZE-L*bEtLKPEOO8=5)vA(> z2oCXMra78iXg;*^^KxFa9UDSc)d_QsR5^xYLNml7$ER!|mtKaA?mlW-w2_YwFzwM* z<-{qxhxk{^?axYPb@3H(0A{!mETL!-_$z}<5UqPxAe$8lv)lVFFr}?um3~_QmI9?634TODgJNL(B4a=ir;BQUpxJy+wmc7_^Btb?>UJj zN`YYg4EjrF>^Onv0SHd&0V#=MWSK(p=exrt1vc1wP%T#E#~7>zjq{6ysBR6pZ#N9W z6HQu`s6FqQlIz#X4rLxi0Kf#q2Of2zC=@-C5x$VAI(3W}Lc|yvFqu(<DDoXIqHBe2~=2uW4;B;0;?cjXlrO7|4T2tiSs^m zZ+csk_XUVRG)keTzp1<=`s0p`as1={;U@=FmU;<g zIgB_$xM8>Bx697?!BAcOvwzBkwfJ4}Vi_E+sLD?c8R7P}OL+_2=9T1SPO+SPhfE#F~Y%Z{BP4s*LvB7`*RK-Hwg zxw)R|n|A!dlhUDe_5OPBd6nk^cBU^KRB_@}Y>~2t0WWgg$SrlB!Mit(0r@rzc9|s0 zIoA3Vo7%Jsj2>wZ7y|<%S**m|_#&HZdaAiS!<2sCV1F)GNPjy1cMAFGa2X!d9`PxH zzqo2-hy|hNH+1nn(pMI?;w`5uNTqu=HXNmUd|mCbe$#>_?gBQo8#kZg-%=|)#}X<| zJvj6b_>u5B2Kama0y=0wwV=rsI$BjTM4YP*+Kc#4MVd-;uC75;(^Uz>0z7e0xeN_x zp6gr+mX1DwTt=H^L#`}MCB&hfWK+HnY#rm_?T+f;9;ErxdOaH?ZSGj*d=Ho5vd~9p z=2p=~v8PTFIhvm>-jgRbPI5T$U8H6Qn|TlSo~|O5z33Z$88I4f#EQgTiBqi)jnu<< z)IbSb)=etqO|T+kj-pJke`(DIZyaaeaHCRk2SvZ&Tuo8LdJ!#zuN> z1p`_!pwg%SkG@^4f}M<5ZG$~Za!I*8T$RoUJ)xq+=P=VsNuvf(FtRy7$Sik4c(Trd z-3+F27dgKY(5O;i_y-FTcrfc6xO-&K+2+UHL(aOQ)Fc+J?L6$dbE7*2_rbp`f(16zDkxQRsalJpQ*3q$2Q1*$)?TQsd&^h^mU8UT^u9s0%de8_ z9YTjp*5)3j7UKh#lGFg>W-)OsD|6Z3T2>isgK=rVsq3A8C9&Ch@5@)C-%?g9L$7)} zX6|FoeyPro%kpX|XqK+oIU{3Y5)pO`T!m>b%cPP3@j`@50^@A8U>N$>X^ z1J5(5+s)Pbzb)YDh?4MI3l*m5{6f(W1cKex zpfm0F;hbt}*XHkSnz|Lz`*JdBMr%FJkbaZpbAuEzMR9A`4 z<>9Qq!U@lmwtzIKfoKlM@v97gM``QiE_LT8cuKnZ&7pfmy@2xk<@Ab)4XBa7R~9K2 zf2+f;qf-E6eqf$lKVr;TnE3Mhuu}MeTI;+YQou0?`G59K<<|jwRMLukl|SCR#91Yb z@>Lb4Q5;cD=s%{+#t0Bkj8GiQoM7_MY!J-e{N^W*Z_uxp2JEIyGZfmCt$3S0_H>+4 z*TJy$BU3-pov3z93u%eAE+T5rs3wxDlq8xzcD5J-_uE+a!VpXFRl4uSZ+inn8V%1m z&H@a&4FoyDXA!z&SR$`lSMv=+Z9f$ssx*X46s`F2K?KO-^wWwRJHHHZ&ER#k+_*@4 z`a`AmC2#Rz8g|GnJ<_`0KgO2`F!J<WC5_~_rKMu#Z zQDM45pkkO+%GTEwE@g+=5EL^-rY_9o>866zdcH2Hd;>KQ$#kAZ6Ph)Hi&-n6D;43U z^(j!o{y4!6n_>(8H7j2bE>V~#VT!2myaKcZ7R*?3T(Vsy)9;K>ut>fPnGxVoVJv9P z4()0iB|i0=ov&|PxQodMgZg4I96G=n9R%HgSwkIsG3Ols;m( zV(FFkz3)ddnUAg1YI&_looP%RR|PR3^>i?FAX$hcPrloY%-)i}lG<{}6DzGAD4Rh{ zkWxjk8txY8;EL2OSt%PQxzk3jxvw#Lj?bpI;5uFzYtM4J8m0y|u>~b@KX(0RQ`(c8ZmzDCa zK8V^?`|XyvXt*TESWK^i<8vi$tsndz7+(I$k{u@dZ_ZGH^I^R(H{BbUs+u71izv4Q z(!JdM;{GzInuZS7CR5vn>TjWnu^4S+uSx?q+`iX|OVn&o-|! zu1&_54n^%L*Vwfz%O&!>N(8c%je?^%2!QUSf`r292{{G=kL5kzA?PBCk@osW7Q9r! zi7i6<2tF3g`4#NWYAq{ffoyWVAc}i(bj|cy&SVNRmw6@MFif^kpM$xKdHKITl z>JNCc%M2x{F!Jsh!Sj?lG^5uyn$CQd9O^U+n0o~o zO9>nHtok2EA&3^{)=)z}qvzAn{1JnLfknrqYUW2Od2Y4b)_??>WUzqWlepK%|JDY~ znQe_S2=aSY;%}^~?{qs!Lc`?GrJ7CcTwi!dN)rEx+M0d_abT`IkktGr=aUR&EBs&C zS6Z@3H!ASqt)`rh4wy>Ul3mEgrIDfUL#Q>+2bc4A=1miEsRnvr|A|9aC;y20fy{vca-eNCRRpfJ0^pm zw%n38hE#%WS5i3_=tfBpoidi>T%}bSm9i_?yn~FUJOih+@&+Bgq4Ft7XHPxfhm2=a zJi>E5A@mdM4lKi5LP~D(g_&As3FNil4lc(!S>rprMNlL-tR(J*(`+6=r?s`)nY8r% z0E)el7DnKHnmGBoKlkxs%OH*oNPG6hQ@Op%68N9uKtJn}0jN}eFPRVGU7=VRB&#Z_ z;Aay=JN*xI5+Kj?A^vgc3`JZEi@5vS{|kK^S5W1oyP?cLkQ#PUY;sF0mpj6fm%2eS zyF=8I)V@>=<@?@yq9VAaT{c-tkge~;UKi66k*gaC{`+?|9iA2;YLRkN+k?Z>Egs|Y(hN~I)Sm{_x1fmnyRnqPMbuQUdB5NC`wjH`UHmspSY1c%GwS{DZawq zy=bzfP551zf{`L-+2Ohi zWFp#Lo0sQzOy^&f+I6Rb!&=$<*Be6p&nv939?ovih?_z)4D>kmD8(?6R6pT?MnS&-@*zjpMcK)gzxzVec9w#r6cvtB-RUXq2Jd zp|1*JtXp*;BqCMyeB2Snzw(?9>|%&uhJiBEtIf;#1J93REc<8OvsyzP*feBMGDrlo zSaDH=CAMk*Vj=oog@vcB!C|G){r!kv~1lD<<$Thq#> z$18F4mC5^i&*#`1t?>#yGcYh}m^{z6FdF}|7-NC)XG#N~dMGMCV$(&)k$A=&4?cA? z;Ij|)AOSok4!3*g-`FZUy|CCvMI`M!s3o3}EkQ$gU7gN`F84QcO1w|XCq2&kPGr(< zB9Dt@((o6OkNG`!tWR#%SQG%g>wxkSuMum^*aToJcyUVZa0~hA*+%2mN|0;y(lWB@ z^n4QvZ6*q99)1q=6Q;YKOU<@%bfZwI~{!wDKpBcbd8cuvOkE!`&&S_c%w`$dMle`9{0 zA>-W^9HJfik`4+;4_|7CDJWcTIDl6i0O$LZ)Q}dUsx3tz&)!7R^N4?$l>^nQrJ+>& z9xtoDoquBmJs*E%aVze@ZP)}f0qL3lFBftfs2+6Md-!W+o zv_KrkF|&tE3{UNa|CSb3ZeLkjuS3e?|ZSNcLrwOr<1 z({BlYCeSkQl1@*qXvTgd?WO}yQ^eV;Xm!(SlR=+}*2htEGl~WpZLb_!6vVnPe`{3A ze1y=Iz$ z|20Lj+NrfWODY&6$rEgK?+f7M)~*K^>qKUIU&B=TpG(y%sQVvF?fNM20!XpqoVUk(t?nCH;HsX? za{wu+IWFY8l{_c1h&qnrSP5O^=Glw$)_i-#xvVbT&@TCBGJz^>L_=AtF7!M|pZvgi zp|ROQKmx@p>hAbwA=0lQR;4QAs|kW-$-i_KzCxasw}tDUEebS&B5~z4l$Y+5w~dD_ z^wU$41KXJq5l}8;`lm|-yt1yh9~SMnAAVE-bn?UqRx!_sCd(?_rkEkUR%6HOL&qcH5b8dL#WN6mXChQkRu{VLXo+wHa+dD}bfSO4<2yjyN8B!0?KKnn3? zdX`f$R123e;s!r{&U27%qDb4blFN~Ns`dXI_TC$#XS|2^EG+RDhk+;{Et$MVmtet# zLfL(#@ZixSo05_x)dQco9;7Q%*Wzoy^sYqGl~OYikLH8G-t4rPj66RL#%V6f$X(ztJH}q-u zCcfJy$0Eov_*H0U=2N`6?e^Dgq9a@?cU#-=qzQ^x-M3c^q1RXPyml;Q$`{%LwKGBY zAM;)TtB3g;z-Bn$aT*xjU-@-j5oo8f=W;b^waZf#S5&ol07o4bP5&y3Q?%gw65 zs5TjyO6;0YdnOHfD6RmVdGeLA&U;%CC%bWfB4&K*@1iZ(y(qOKMT)jxB6-5c@=Hh5 z%cx|MIjd|o;Wp)T_1BUF+X{SqKXK)}?z~&C=6j=WYlAgnadP!k-W5BPE9$RREy46& z&2q0%;S#4iN&2XnEF*LHP{5phO(60H$!U4W&y*JIt4Q+j(=g2$FP+0EwGQQGR~H}k zc&gMNR2CeF_wbkQl=JD*Rh%*`WpY_Mc%29Ud#yB)Dc{4?{_v`)TLZ002rjAenk7;wV_@7>A>tY)A^ss;X1|)R7#)Fcy~joAwq^>VSi{%ZuAZ<4kwiHAlAF(4>pmNl*-SGq)b@}JNu^Yq0cF_|! z(Gi!`!t%(sDde~wO%hQ+?a9?06I(2hx7)lin?*5hiy(Y{i~bS~uP#%#-LpFg4jGFp zf!8{-N#5pceP~tP=JF3OElsZp6@rS$xSuml*XI**lXEW*mdj{u*Z$R0RRPtg;0BEO7)oaqm#-)iW@Zw)gqmB$a zEm3-giJzzSF;;hiJLS*mZVQ_rPSD^-J-NBtpZ@eJEN0>1onfubqs0ubLf-UGPZQ$x zssvJ)IWKenZDY)xX^@d1%{`ULkJTg%Jwph6f+u&P%srbhZ8*cH$r^in)*|`3`~KEk zmo7uuD$;phIJ^G2AWVoDVp35N{HR9mUrEul3HS^{ZUy}{3$&g_gGyJq>B*_V$(bP1 z6XiaH96`6GSFgFVakpJVdA+OCdg~@j3pxA8ktVSH4#^X&59p_$&VX~xtxzn1DyCadzurGQzRHcsnk{C9ifaAlhhnRX^B z`M(|{=|>V}09WD9r)#p0_hC!2~)fc?RDgSVu zUS0~}k3>3l#8L&p&EWq~$+HJGKRy``=NqROZE5u5T_G**gpGO;A#i7i{Ar{PA~YhE zr23Uk-z&?VC8t?$*9nAD^tE=4?*m|0c%_OuZq01jRh@mrzPSfsbz*h=_F%WI0Hn++Py>7rtCXPXbubz{Hv za2>nYodbIA{fhg0+?I2U$vH#8TrhFWR)2H@O1hjbXn~d7MYv}}t~c%#L_is`m6H{Y z*L=N^ePh;R2szNd?~)TnTXoDxIxgMbl`QTA=D1m#cAsFV=#XBKl30s1D6(i&5Ai9X zm3-71+p?~}^b<2RWG~tNsGl*w{%T$kdpBxe_L`eEu6T0jV=E#;vy2S>?9WJ}|Pe;SdPY-U?8NFBf`rhMRcygK~#)7N6W zd-7+?{v6#Y zpr1@H(eqX1@P$?!jmcXRo1|38YkQslvzw7+$jR0JzKiB-3K-Gv_zw13IM8ZTpkHm6 z0m=pjjgMNY9=lJ1F=vCe=JCd6KGXjgjd-!vj6+n*D?RFbMv~L4WMAqnpjom$9aVEY z_c7=2Z=Qgb^_YqZn1`-`fBo*Og;#wdEl+<=&7X7eLsTGS#UptZe;pAdbfZVCJnypt z2VDI1>(aShD#B^-Tl=ojOk3JQ34elJ4Mh;aLwKglD&zrPO z=7RdK=1GBuzeK&RbF`V*437W4n_*-q=Pi>p!&z6SJqo}ImPuS4qP#L@%~OagDU$zp zI^%?-u5_-*EdZlvCpe$qSA%-bm!37pn3xg|s8$JqIq?#;4EkAho zHv@C91)(!@F4%e%GWBl+QJH2xT`9USDEyqQi2X^Nd@*BvzdJK_Tb3C*$+1NVuYOe2 z!_Hz&&%I?*Fb&+i`d9ZuV4s%&Qk<+qoa+1vW9BT411$I9DgT=Iu@d}14od4R7-vnp z5E0^sKviSScP#?GfW zwC9CEp%m6fTYx-mCUmcolZL`=&f`dt#9SJ!gh0&P-8pL( zDquNSs9q;zbKYMKQCFe%++ULbTsEMrp3X%rA~^UJ$+Xz~qt`UgSy+AAPQ`RiasKEv zFp(0uy#*&oEbf7Of^IYM1WqD0^RD$`&$+ku?F^g2iznRMwm*TiX`ha6EGZQYNM{kd zPc{JYRkQNM9m#FQ(R+0|9yV_~#XAEf$qM6A0Vnq?e}vlVV`P(Pw|eHDylrS!w)Py1 zRPL%(eKuFIT(t!snFj?`EPo+OrQ*Os-qM6ZIbSAK%t4RGR5JO%U&u-rj-hvwHku5Q zWLUK^m>5J1adh~#e%EqprHh(KrdPxk)Z-u&l6UiceksYOF(wq(!q<3si0}wlsxcl@ z3q<}vlG6L-&P38}ruMj6X1vq_3Rv%}^6hc*WBGE6zNg1B<-ViK1 zN}r6m{%!sTjBs9lE#PC7YG;z3-6NNy)0$^+JmyA+)DL^BDJN4Xd?z+!LfSnLQ`ptr zRqWjhr@azvo94;t)^_2xh6(ssW^5!+o)$}}dt=NoE-PMlKRC9flI6M~WS2y@giXD? znoPnhOhy6(KUFy%pm`c|qpusL!a1t0aVBx~S@!$@ zJL&iF?EE?n^6yjpe@;!>^P@8bcOoXR6fGugV<<_sdmWA&h?w{DARokNpwod% zBhMT;AbUr2^Yg%|3YzO+aS>OAit~-+;dW+2IK0B^@HS2SlvfypP5RD5GVW*Fj)DR# zm5VilOvH-ub0TSKFe~anGNNr`I9xe)Tdpb zw94ao&)bPZBKB|{NkDR5v3}WZaKM={nFc4NI$33{YL#llOc`w**-dQegbCm-@j`I7 z5K$h>gHNCMdq9*;_Hp@eFu`o9ggM<%L zJTO}@yd!nEyC6I4^7y_$_wB<_zYv_wa>XJ1{@Fk>xY|Tp5M(v8w8pGvJL6bZ;Eo42 zbz1`O(`ny`h99qW+w<cw7&~ zRAj%wXdR z<4q#b;2kxx9BrbYQ4L7aa!(FcLtgb6t-c+`O8G0e&44Nz7UuQ5aIHCuW}XF}`3)5N z!GJ11nkTu@^8vySKwkTim={3d*D>DhGkK{-$88rtu^!hQ=rQro|xvO%YndZbcRE%ed(poo!-f#0gVJseP>vR#5a?Ow#Bhd}>nHIM?dKSt4c zm5PJdQah<(b-J*R8VVzG44|Hzt<*WS%L7)U&Saed7@2h;)yvhuh6zMyE=BJk5U zHhG#3mZ^yeBLT4F!4H)G%$ogBTny4N?S%3n0 zwcC(vkk=9xE#r1X>IgN}sC;qe|A{r>Z5oksVODfYzbEN~e?gGaokz)Rj?SIt-LFzz z;v6iqXw<0WxCcww4&Szw1nW|x#f73=A(=|AiB`#Xv>SDiwl~i*f>yKLkv^SgNzEJg z$fC?VF^$PS&V`Fcd~1>#-Ij%t9{K&})lgW(J$L(=5|jj{?q_qp(98@@6$=H_$;I1W z)=Ex#*F2N&&t+%A=!M2`rLYBN)8W&3^s)=iZ#)L&OslFC3Cg$F@x*(W&*byMy?j{T7ATICImqprV}t-JYQRstywL7*sHGWY&SyCYqD<`_HR!;$-;oY7#0{V~^pF zN8}tW$MDJQ2bgsm-txFPk!pyg?=xY9<`46QQ9mMH6QUaCB0`>kc>+&aV4EJ?`<4!e zR%8eSnizD&ScOo>;J~M`kbLd;&m=A4!8$FjKlki#G^B!KLb4?N|yI;0~q|HsuV?vdcMwt)H*5>$HC%tBEYI~ZZRPH}jK z($HO=ZwtS!Y||ysagXhc*NW>0@3fsDD2WMGYcTVSAeD6;ZPG^&MNNk+a6Qgq^l@>=#b7b$5VPQ#jx>SGwzaCpd zcMRG>|2(7sb*)NBd`+tZIYAi02cq03i)2con(yrKfW(Z&g*aid+tFLD9*Zp$C?2gbPvVL1PJD@O zev*qCtl+9Ls^$b9?KqhgYBq{Amo`nifXq$q)#q4BQ;>z^o$=7(5)Qv8i)07h24Y}t zw(kf&UsrePeWG|$4BN70Jr%j#h)JJtS%zuY=?m>pdyiBmfi&2{J=(gyBnMI-Ns3$E z{Dgc@d6mEC?~uk1wFe;#+L(c8hvZLj6>3Th8ut@f1wq@HDS`6<^cZZ86oX`%O5!Gm z%M?#-dr-lmW9y?WLPtvu04}WC;b}CL?q-uhXQ9KLRf%3Sd~S)4$b<=S8Ex^cf^HXa=^c82+nhv9|nWQ|SN!Sdy zp2l}u09cmvtcX!c?Khpa@mAkxL<+cLh_w(Io0NULuUgXMtX&JoVIjtIT(*ApCESj3 zgN;0+$;BKwI-tC{cxM|f_%^1;Km4NzO@Dpt zP!Gd}VnnHdtm_rQWHmB^(tO|N!H7a!?!$0b|^UK5 z3_%}%vh$xP%~0I3TT~?tqq&~J5a2eVqKlaA!5600v)xTFq$7v#Wbx&=SMC#X>YT@4K8ny@|^bV5W8u}%+SEq zVl22duVC`C!+62tG@~Pa2ghI}REVJt=d6TXQ!$Ac`Fr67nNkLR>+rDuVSaGQNLS?f zfoceH0t^{v7&hYK4w!{3K(sTvj0IXH`wL2R{A{w+&;k(Ysvf1q{z$D(WRMXY4iA6yr3wX6#`{|>;WaaK{KGXtOW|!t zZ;8&@v;c^b2>kRqQ);kTf8h!3X5i`cNzdP#oIlTZlRtw1U5xX|g+K8~nUgZ>mAon+ ze>FiB;>gQd`*nhye$YCCoQj{zvoeJxzid73@5ydYalWNAgU=$jYwPfw1GdcdGr&(2 z3DL#tI{51pd613!-%E+k$Sn#si?&4vO(gos^|BpZ74FBJ-&+oaY=GOI$JIRvKFArZ z>`vXE)nBl)At@`m%n)qK@Jt1X@**l;5JQ>@EBB)}InNGACmrh40rZkJyV>M+jvj}F zKSWK+{}DB-H5QS!HFO3na8oo47yn1pxQiM~oC`hfo95VgqRs;CJ%AEuhgDgvS6}ci zL|RK!>&SSQI6>|EJVwikYLcRVS$|ORyXzTS~KtS=E;PrcNi9Zi@pZ zc==SbJ=ecFn0PLVnEquU8~Fr!Gq*y+$ifR6rA}qd~+JA@f8{<G;eRYpg z;i{kAW1Z*Q>~xPy?9)CW3L+5_U?kjI!k3#^5dQ~$x0WzzjX>zecAa0enCA5)m(#OR z4=HTd$Zq%x-=TtZkQ-VA4?JsM;oPcFY8b4o(gy*Jbe4(OBR1%A%^4jGQ0OiK6JysY zWrjYKXFXTNi!jf)c;Nxiba}7V=f07QDRyQ9$T~mWEy);C*S)CRJ>;u9? zeTxMrWNLd3s|H9NxpT!a_<}FV={=%dzecM36E2k$5$P6-9AsDbgb!(tBg&Pk z8maqRh3Yeg6R((G6#JrI;G&p;z2&NaVHpO8c=EhY@$LzFXZBXj>g=&${c_|vCoYvf z7a2VW7y>Sh8bZ~qMYg+_s_V}$;k3M69gZyGk7>S-Iq>lQkA2kc@LC_o`R-f1hEElf z+T5IG^V&W#B!duhd8}m>!7co|MWuk1sNv<%$(+DzT|YB{f6dQFTO}I}twAoRrnkAT z#+)6F1phHLN9POcmvKLF(*@)6z(;O*H|Bb4T9@H4#`fZvGtj(u$=5$4J8S{3Sm%Ow zc^JQM8yA1`Yi0kFxEilsM@R!E4ZhIJ_OA2!rugVTDgI$Vg9Mh4a`4Gjou27!;a|_qTQ^Y&wor!cEKCX4^z`$odQtk z3VkScJ371XvcFUj5DY%sV8I0Bk^r{S1Ht8q(Kh~DCzv1LGhh zY3%EkUbn|MEyY|dZgKuT55q6i{xxh1A10NcZXP3IXkK_hw3?XO_me1mC8Bk;KLsuo z9U6F1mxP(sU)lKj5=Hn(sh!7Vda;qyZ z_u?wykK&&B!>#2nBUy@bh3o}E<^4iNN)IT^`?k#Evo!ehyIaHhU#cG&ok4 z{)bpU%i`_%tAuA}&GGT+CCb@g1D4`;Rme@1$6p%@$I{;O{3q2i($vOPu>jmi$&I=D zH^MKpG^ST>SW_fkdjB3{HjL*bHO~he2IW1wk%pZu^I3fxcA@|{34ax3$0!#`IXLNT z07cjx0Zqu`@3Y4Vau=FwY1xJ;P>7f_qGZT2egZolRow_(0u}=fy}*r4iIvc(BKKF$ z&yUf^TIG5f!fvntUlXT)+J8n3&^f~Y88r<588x2T3{<`@s#Y#|b5eMM)2PT}Sa*)G zeHs*pK~O`iIYtGQv0UMun-t?RoG)G^2iqs3fzt%5kpV^AvH|@AvJb`5g7CR&XYaBM_6=1bl&fQU#BRWzPzy| z{@uY%tqz;HSwVRBmzK8Q&-V)igLCDGU#6<<{~&~YnLtkuUG8SAU} zIby>I+U3y^k@c}Jr2>U$J$e`ZxG^>K)Fhzp``roRFlBr@+@g&REX+y!mO=&Xc7A zaaM5`i+~AK6wOBe_Q7uq|Kups_ogako}aRs2(DN7+YSAFx}J}qCu!+?pI%Zb1*~GM z$;FAaI5MfmWN}+(bj+ZPl?Y@1_0EM(x{39AL~B&hGF9U*8=|P4IiUW`8tODs#_eZv z)(g{mB&^1ePv-OD{@N-teTsq+|9y_3ICPr+!3HojZMKoCz-oIl7Ks$srYf&UXj5KR zVxL!@s>In?mfMr`Ue6y-z`}DXNg*&s?G=h#Ao1>BY)U+kc)x%;;tb5y7H-4zSyoag=jhJO6c)Mg zYZq>7B}zP6hqA`vvj5YnUWQ~%Uw70otYf;dd2ipgs-8ad8@%QIT{Gu&nPZe9Vek!R zSCe*=+ldTb2}wxr^%F!TKT|S z{2g?FX^^g8c`uY}yS_dyX3D>_6|(q^PF_&+qScgkn^|OR2#u9oT<6IjRM`^*9a;57Ou2>e=Jx64}w==?ohbm+=_|7XkA6Mvb}+Lt&*1zSZcc> zQD$KOG>mRKf9y4+U|=-yi6Buu>Cqd;PU}l78?EC5m*90m8+c0Sy;frAre{JCwczbs#84cpzC_VsYSZSO8XI^Qu3X_q}0?h8;Mwaz#a&*$@e|BiKjk3)ei(BRBP@F_Opb=!syJ z&l|erN?(d<;T`W^9NkSz;N$J@CBd{~MkQn0&gfKq`62q4<=3{JT}h-pyC%joz3IRs z1YZ~khGCF|UyFI)5ie~%Cr+%-UMO8J*HXXu?G46qL3v-UrGIU;xcvRb{|Ja#?;ont z68aVe^$bQbj|hRnK6Cb@MN5U03M>H0BMv>kzCMqA99EL_>*icsDuTS+wV#-KDm4-v#Kk#QasL4^+?FM{$Bwk_TCDJZ~ul^RTiC{D&}oZ$ke#FnO@5m3Z1nn1Q`J%TbLF1*b0Kb%{2I z<~G|K1*n1tWl;KJZCjy*z(zX_rTb8dfIOWIsV6CLKHaPz7J$^Km!8v65Ue(6(cvRx z!Nd(#EgLew_h7}w6NHc4Yr+u4H_0LPmq**M!cGbbl(tD}FQ5`YJvgDMcG`o>nM-xz zQ#t`3*hC;Iq)^&hVBPasB#EN5lq{zlQ3R`o;kjxwSsp=I$0-(ZZv}&@UMS;kYS+&b z%MCD_)OS)w7_R@}`B$mg0P##X{*O{4R4J0!Tm(>RjFbUNjd=-e$0r#;sVNF`{Et#I zK=L1@2J_)xrAA#n_2zu*2n6@)tNV&gJgMdIXZ3PV!;?7smVDNM>Q2#%LVsBI$dWy8Eja?oi{sV1yTWu zajpd{kIrgcf;3yT&7omBy*XJr4o0K14+>!ylZ38$b~I$~1X$v4H?)!V4emefWRj5_ z9=C==g@=z=z2Q^jbnNiNFX(u1iLRv14MK{ca!U%#Z+ds- z;MykXm$p*4nILEfmgUiDQmKVj4bf`?Q?8Ou3nI$>Rwd!mw-VhQqqM2meAB~BXR}P8 z0d`8Ji54<(A=bs!p=HWtb^Yb2r}1l~(bVP0etN~aTktGS~#(uCp&aBgxmGU?&-UwUFth1N-e)(L;wR5s9u|@>kQJe-H zCxF>X@CWr|l*7n+o8qQ_} zTwaWms*^c=Y~l5f244*8i|Nn?C>N2@AdL%5_TkXWaQY=oR>{|q*ZKaol@2}WGRNfm zM>(QbT)CpS4etH&iLb2ctK&72!Z=d-ITr=xCj^1G>1&t3p@<+QQ|f5^ ziX#weiI6}8$31AsEtn9V+8T!?G|I1D&FwzpG7uy#BNGZq*bCOkI=Fy+j{p}|r>N2Q{ ziTSOzlDOim*q4_jv}u{eAAxF!Hn>S6xGdE`6)FeAu+;ls{#E-&7aCFJ&{-jXlCRK-A#cg8KG&-9V*T2y zEHV((Bc2`cZ1;dK*{W#rAc^d`x=78p3 z3>YQ2ViYl>^YOLp)=E_FL5uL@;u6IOI+9EI=CUzq?JZ^Q=AqnSAcy4RtMblU9xN}8 z;SoQugB1z}`r->kodMO1 zSkZr(n%eW-#{$rQn3@s*Q!_TE`43ZLh}X|7XDdEW?CXXCfwah+};9M&?^a zEB^`mpaTwIYEqp5OwAcgZ!a1J{yxulOd2vg0>V-rt;zl2&jp8=ZC9yl>%vhA7#NZo zjmIVX2GUU>TfnB>vmAvV#1d(o@YJ+8A#X7<{u(vd2LdOe>?mK2+~NAP6|rPJ(>^k| z+*vJC1MqJxk{+>ww17(cG3I?qdHPzidFURG9zDTtgXkBBRq+Y~wBLa>;wg)%^-|vv zc}5M{AlqltElt$JcXza;H^G{)>KJqYr^aEmk>DA<*#R9{2@6JI{pUb!LyCz-HOiu= zI3A|1B^2Io$f+}%_3T>4k=d@>Qu*c;sJ4{&_AxT}oI@-hSi?E)Vt?GM^%=kb&b?fh zpQ%4F87IHJVM-U|;89a9TsB-T8O`nREa9+(Q6etw+g2rTBUCZ1ZP#-uCz{H1mzo#D z%y|U_S9y90^g@)*`p$6H@mguy^j^I)-cbXFd?Y<>2ogq=GG(w6?TFu+MP%DzAP1vv zT!umeukI6aGnFA;a>Y5e@rOZ&+MG*4b;b7gy(11-nM<^6p_;sMC5 zC?JV~*Uf_kMX=d2%aUdl;i>V`j1zx*SueB3!vqC66fTwRUM&aFqRMFA=~N{?j0+0u za$e*(=ciHa`pDl81;aJC7#zaLJ5YaiX;_h#EkkO0(8>eONQ}SgZblQa8JB2WbTv=) zhEEdmwj}WV;&|k#Gy;H`pL5TljaYnmuHxOofrw5 zz!XCuGgN^jD`Wziq(QCMVb@Vz-`}|$NWja1zVB-UimrmKiS=9lM6#R^OUwbpzO%Hy z#Wp%d+zlNv7~L#?nhCHKiGwMLMLaW(wZYxW)n-GOL6Q6Z{^YWyC}Z+OzAo|17Uy+R z+p;*|z^Dr?4(olNQyVZQn~Jj&H7C$%iQ97vL_RpkDdrOUmCaMS2-%VFGJ(KEoQ#VK zE!yAzU;{Bz&i7-wSW+%Dbn}O$@psE?c91eVb7AUKgd}3NIq`I*I#ZoU+8;ww|`YUF2Yk3ZMSjv))|s0^t9p4lfjR>V;cH~zBfX})7L%v1Vxy3L)M+F=g&O4p;Ma4_D0<5 zHJ4ZlRnl&PqQ>z+Pf>Q3ma)f7=$v#4 zXwoxGS&s#Cy*c7uahX~A5Apdz9N@X;xq^}FzoeQ`A8H&&$p4mVMmAH*rh{^WzbJ!6 z_r;}@j#BId>%Iyy@-}j1$S80S9H~~O4Yj&HA0wq2V zR>7mmEB&1igH6mn(OK_)B6uq1N{$GQ(sD~^e^o!6s+-wYM(IEn}403hYAjK2{wam7Mqd< zt5R2oyZjwBVhsFi;y@)T989@gDH%$qL5@ejQZ3~5=Z0&NIg@#~$G5_v!>}O3+P*g{ zfT~fR%QwjZP&I5;3Ds@@s>aoZ`~u%ZA!{)pHQCSeIVs2fGfEIZ)lh!jm+yOg-o+uV z?GLA=tYs$XScm~FA>4!Pl>laDE4KVqIayC~+Qxp~gse-ivs%0lZZW<6A=yd9-fSyo z0+?}ppQ$-{gFV}4BRO7dTs-m#u)?hq$C{l2=K6fHPGI(JxUNT4x}FdlZp#Yk)M>_) ziR6y9YI%u{vP#f35z=WPIy^{8<1&ziEH%0I#eot^!wJin5cG!Cm-Vf=DW;%s34P?- zo-`U#4$zgz(UL#0iT(r{OyZ%!y|fIzm(fyJLE)IBOi2|w8?sERJaY|$hU9bUgB7^q za1R*D!|;Ir2sdjTN2FB|%xT-rLd&CJfb%v;_=L)*`7w)=p_%T^xD7m+=uKjwX|)Z{l^`(`>T3X1y+oBwI?GJpfFUVLDCI9`!zpg^NN#VR3ysq- z6FD|&Hp~Hf_oakU)1){^mZP$6jXY!`g&Rihy!iEF>Z~YmotsHjYXp_14JF}J&i>Sv zeGY@=$f%?-^!5!(l?1aJ`HiEt%~%JwXu1MgwH^EzUH#f5hq$xNahAvEjFPm0}YPwY1{E_VpHJQ7-)CuR%Dbp-9X;l4<=8PJwHa)`+pyuvyC)v)z^u zexiZ2PonnSS?bLwq^y@JebAV(L$b%ALsJ)wazn{OI{Fj}Z=B9ie2LDwSO?4av_&yt zSX>G|w*|UAj9dyGpE|;i=1lvk7|hp?|u;@rrmT?OvS?D3ZpkWkFEIN?3{Svi#u63^)bgwkSbyyw&=0r zLRhb8Y~mv87Dq$4Zr2{>V;{WFHd_fi^-eFb%OAY0S3Z6pqfm=#ODr~gA0DL?a=?%V znLtBYt0%05On+w=r5gQ6L5ve0jYQ;5nDo9$AovFPm`zadN@$(6%GI*gz-`c~A*(D8 z92n8!2Houbg5F6)X566IU`o*7)0gNdEF0dlk6SgCQBQN1=pabxgxuqo7W@ybMty3; zJLg}mCYX%}3uoT)FiZgLAqWw^&1A3DWRlZ>%!}}SNwh#&kZgm}PqTwfcIf^y`o>~U zZpD|+c#f}mHZ#A2CRdH2m#1rB1%lk<&QXCa?@zEPb(V8GR6ag|*P3*T2HjlG`tF-u-&sGlivS-`@InZuMBI}ET%rJA3n#UXcl4vH-1HH{=Y>P@${Cb;lc?SI*0i2ME*J*vIa&QLI0n$8oo^0 z!~dz(boFNM8{4)T$g2Xhn!7g}xh;Byo~ox>%4bq2SGM1wzZj^gT_Oo6%_Ui(`OSSVVLK$ygR$*O>z;y})Ov|T z(gPMYd!8Q%Tez753zG?ppaWR$-v)L~0(EBqKZMkg@$6>0*f1;Jt^|1J-QX zl^G)qdBl!*8l@$B2o8Rp{i%Gksr^BJA~9+>^xAqUMXR=w;l8x!X~iOd1~yA@`fQmh zW5vH|a`IYKcZq?MoiQoz(1anNv8a3ZQ@VlWzM1315PBj{`P zQvE#mAQX`iCnaNpD4fZPh06BF0k2vrU~lEAvwM*~6YpRhOn;{Ma-Ol5E_qde$RhCV zH~+V+#xh~!2N#{7DFodkil$7kJiYi5x=}Tjt(|#-kNY3spzIX#$nZ77i_LTHNOFDl zFdbJ zz9g6EfHhpZ;mk{9>D5dl#tQ1#97eW>b9*2ySoa-v=2a8KtTmy0>8(>M%#SO5k9P!Y z8b#qQ{-f1U0JIuifL7Dx#+%pJIUxDN4=ulU7?(0e(I@{@qow$JT#=`4J1TLLy;AsE zz|wouKU&Zqyc3Zv$D8~~(nS>_Al=sV%_((3KkZ1Wgh6SnK|c-8_@C5r72 z;S}v&o%#XF>fDenMA__GRY#$%s$@isT^I~mnBhmT3)(2><6d=k-VWs;pKASe-2GxU ziBJ}UuCs710m?dy|I%tyQbzyLYMyf%c*cz9cRP33;XKH6f&WXZG5ANTQHT7GR>Lv( zeUm5Y{-pWZx;-TSo1@ zy(Xu*F3?jSg&K=-p;b*yNiANG8!uXIVZB2bKcxIH!uMTofT3O{y7*LR{aZ6k(W%$^ z;4HVA^y^3LNAce~M8Y(1DNjexVoaFHBmp`Q@74YAVzqwiWwdIPOz4=BF^NTDf|n!~ zd^R0Znj@}ciHDIQraaY#{u=zHqesn=eyMpn)}NG-C8On?J1kKqra+b81s4-!$ws_e z!fjI=9w_qo;%y zw~gh_w;9*Xs!vViP5>m!#ujBePaxOsd*IS+j zTV7PFP22>yB;tp6nuN9}LpW1voHqX3CBNtEGY&kQRvTr|r+B(IgxUR2=;asRzU)5# z(mN~@YkcAoVGZVYu;)lt)2NE|97jstTlJ87REVoLfPt-5+QjZNJ2&1rFAa!I@LC$j zM?$w41!`RWsyD(^y$p#&_f}&xfnS+H@7lo_*d1kQU0>?~pUZpS@GtXVgKUVfRGYqd zetLWg&@WJz z{i^@D9XGH#A@$(GIiGFRkbAwmQ0wl;wz;>5-qtDkzD{K@Qd??=%cP*O(%1I0b{tS0 z82I+df}Go$odnE<;txVdGi;meFOnJD$U>^YEKncXP^*|wXjV7cU)8bk^&u^sZ)LK6WZ(}M--mtmvB>aD}*D-@f=HDao*;anU zQAd^1PG+;D{&&``=F2{VLZS^2R{FRXySj7uSZZ>?uzvLKRD&~4Qunr=A8deo8@Jmv zctc4~)6DOoT`^c?KRaJ{7qG!I6+7cJ`)k1o=13Oqs4Wst7E>%Jofyd>nan55CX&iy z!t?~4q}r`kES;KEm&ipe3L}&Q}kA zx){Led3HP+VGO=N6D=$rzKwK1Hm}jp+tQPVA_6Yj{g@<2msrWVfWZGg46 zoWUL>iti^hG`LFeTRa*hcmm!Sx%A&|4HLq}$>j5yWC>`9UT9t{anRGx$A zWbn)cLaJyShR>AZFl^OJq0U#UUNnHNGhqN#FuW8{8KBzVkF2E`2e0M^8-Ed&wEM#AqaNdmx@q=Mn;f00`yx zCY&Umgd*Aku)#s2;}LWm&MfR77&TZHgg^P(vF5qM`6Ho)XL9W9T2K$nB8xnmlC0;R z=*!%)+H}E!gTa?Jl6Jg(ZW8leWMVx^zk6IuSjJ5t_BP)BA`EaI<$T>{Y*9H{Kp*`~ z)&B$y?3vqZnFB=FJrwweQ;+l3=lWa%urp}Vkw7;|&L^Mqj4k0T#&WMFxrP&c(xWJW zfztdu+nvuExUh17gAhxI4D9$2xD$saxj3+;+b&kOBD`c!u1YB#(F2Hs5f9;qF@Cp< z3rGPl4b=mPOe0W`D3}ZB(60~&7b{kW0`|nKiwi}~h0|1bjm|pGR`T}tdOg@D@X#kB zAkp;GEtH5cCCHDLU+jCMA0t+sd>ug-YNprq(qB>?P+i0>R5^q@LS{Pd{x?L?SHpB8 zAL25BjgZ5*yN+`)j^{VwFUD_=wY~>+ov@juP5HiF?aJPF zhhkssR`K_DNz{nlYcVz)*|F*cF?D5!l`zpZkuOJ+iKOUUvYZnLvE|h}Iem&O`4`O% zxze^doU^!>m$5~4o}5#Hoy9x_x0lg_2sygznpO$;(;8$o9YTK_!=2g#o>o&L0>aY> z))AMo%8ppOYsbe`8zo$xN0A*ZVm=qKwwUPTz)YTbB43>;If1`1N`-{WZF)=B1$+b* z`l0e(B+QFjNi)x4Mf`W)xnlZshul{L_RW?rR42?=q!9_V&Weh2QhOF|rFC_TJ3H)) zaaGUe?~4r!V%1(#7JjQ%cg=)2art_do2U%|wZ3(hicm|t7wlKihMQh1mUio<1yy#i z3%y>7J$3JXWERq`aIKHe&t$EdI|~+R=oi>mio^VA0gPU&Ab}|-n zUoBeO+=g+k2wN6=bQ`XF<>o9_&#y0Vug2{ZTN~(C5cJ|mCO`CZWb{b3U*cS34Tqun)=t=pRX>+cg zazA1rj(dlUuf6)cq4hUUx2;h(>8~C|+0@t@wg*rAhE%X`_CU~jiwunm-3u89=$-$Z z2gpTsGIhd12;jqKVUfRkwh_V0w3$&RpI)n91s{C%G!`gH;>!{1XJ zD>%e7$XgzWy<4EaqmM{Xp6+2*A@6nK)iNk(=0Pn%SijQt{2;y4^Bx>b)6)e%DppCo z=EM)Ik76B1$_bsQB6);N0;$-rkh+@XC?V$dXxi2?$L)LJ+H(`tj8hmF3}yUji{#zxz{I`@!SUFY?eL&eW9ipf=tFD3+V8mVeeUgB6mVvXRaHvuK`nO*RhVk;w7e znIec2s>N0OwXY^6T5Y_L|Kl62WVHYdZz5b?GP9~3x_&91lvkS}mB-28No%rqwP>m$;Z^2E9^rdQVS?)f-RL8=+g*_g5@PIhuWD zuz}`hmnz)$bB>6}qVi`lO6uieOp}cuz`mg&Sp(reaerM?&sRPm+x`y~&!@m`Uv`&H{7#6!A zv%1zNieoxIzRz^&*O!ONbiE>P2q9F>DVeoLvypHIo?Nx;<^lw}2ih=u<+Tw6HVgQ0 z-c|PP3*4LicNwp@26$y1Lk|v>(^<|NsgOm9hg#)Cj(jH@cw(T_!S%Jnc;ZTE!$LkH zwPN)3{+t^{e~!}1YN!?PINY5B}!(@7` zN05oZx_m=;_e652W!AEUeNodu5TuIVv_?|K{E{?(IRn2XRA>XF%bJ_A$#i__q@fqv zOPs(Pt7Q4)aw8n}RI0C?!O4>~gQtTyqjl}f*O_twtG4vcDOa21rd8C*t`>u^Nf(5* z8qqi9j#cD1tsf&CkGXbg$CLypqg1rC@bmCrT9$!F)4i z-jEU^aIe>H7r3zuD|r(q`4VmT4$-BYM-O0wcMgzEG59O|a6uHmq{T8Mo8y=f&iEX$ zq$|8jp{h7EelHK1$`2h7_&-ZaCfeq0@=J?}?G9R=Dn|;Bb_EE%&=5Lwq9v>o{Muq` zM=swFLUnx&Lb4@h6WW5_r?_+bd=c&c{hgQ+U;tAu2lXN4g7|I&lcRnb#?nH^pREDz z;!66%&L{lkzvdbu!!eFuG3d>AFCTkbP*rLvZmk7^2t+b#U{*7!S)xPX;LTE>!Hv5i zCVx@mOhp2JN0+XEj!3Z;nev$AAkfNtTt*6pi;h^Q#|Uj3hAxEH9Yw;UNn~v=5}ag5 zK?i(8_6JE2fk0Jy$3LFv6D1D&9283P1jfhRZm4z+0O9f+A3VXHz!V52Ow?h) z_V$XYZ2oc9$XHX%PkuBk^hJsx(58t+TX<8{9>e>waACvb8gS1^oE{JXip@+2)IB~u(p-;59QKho9<7lV2&qdK9y%CRQB1>zbNRdKPvVy`Htz#qT<*DDn zbW;*AwUX9ArY!Tg`64rCCy)?WZI+cHiFmX|LMYykDU@bOKgL99N-1CA~3w@8-pVnVyZw zOfDL|*+t#wotmES4-ISQ-6;+RI!=Ub4=--A5Vp&c7Qw}h?>MkT zTt|!&=*vk7ZqDPd3SaB7mb?53+53rEZR<(?W&S4_yLwd|l^RqWlZ4b#XF&sv3YPyU`^b7tS+gzbiB!OmfIS-*USeK`_NnSZK*YqD-%`J2xRIYV2(dgyQP071s4 z?7(Bz)!%OgNPozYLxSM6GH>Dk2o*CbF~pQ65M$bgPT2Eid>ri}iO}P#zs^m!G`K;V z+6p%vsNFG)BakVw5#`lg0NHyYf}>b7z}QY9zb0*S@BplLHOReW_LN^axKeQ#p}9kI zIQC`AX9(4#Poxs_$fd(lXu zMpxWy`^7cXjzW>qWzB?(lAO7r2DQe#j_`LUd;=>b2{>CgP$jH7FC>PAPDS=enq3AP z?pOVF9_I6cLIqxhW_?L%oc zO~J1d*aI&Ng3J%Nt6MG$L(;fOQVqd|fzgnym1G8=;p7->((X{&3t2#bQVU~!f|jTd zpT9AGc!ytpe7VQDiEzbjer-L^u+r-LH9H>!^e~z3@uJ3{MEu3!l61kE+2>XDa=+$p zDG+51M?6LoA17dqpbXtHK-pU$nJbSCjXvN7d6RI#`pku!RmOi6G{V|jsJ>8;j0mlm=a|3 zsyGiL>&Vr62)C2JLx6F|xH$OWHhX^V?Q@yy;{EaAico$7y@5XH{yU32-L&lV^v))9 zLYss=Njnxq*#;B*;fmt7q;~*e1*i^QCPc0@&m+fHc;o^`W4=R4K+mP|HZ+6uM|!#a$cWkWeRm3 z$cy$Y28YXb%_Cv10Rp%_2(#*6 zo03Qa5AQVqF(LFR{LR+u7%f<`^BEcOoSO50)}KyZkUR0?<~3FE6_DG5{eR?Hq6!r$ z)};%?$}$(y4hcelG&ot`JdXYB+ok6x)$U}zNV~d;bosf%AE4RSZ|hg}_(5u8L_!es zo8mSOYn!Y-pe8JU$CO4J=^IRx8C>YhzPKqSM92_Ne8-*ye_$}*x!RZrZ5NPPf`kuU zT+rb{3;lQpR$gy4qphjxqYP9GMTn1Zr97A9Mz;v9{IO(9C*T?h*^a|Nxj}v)&TXj{ zp*aX{YLZ9U(LWdCPjRACOgvXPvS?@(Ceo`+Ou-&>5uJqO_&7$PWL5^{NDiUVFiR(t z(1_&oQ(XYC(i6j%o1|ZXJ!vE|s8*1PY3N|RHLXvCFKgO58{pa(nD!nv`7w$Xhet~) zH=2goIekkRqqz*j5iVcjORBj-<7K!u?!RP6O&dMA^YB)R^g=tTw1soSAuPOn`MQ;f z>yhO!{n>@(+fg(EoE{M`3m-i(avD3$lQM_jI*9rsFX!}HwfdThez|{d^Zhsf6iq|u z&wolv3py(}OBr?;LTgwB>s=U8EmdNIE<6RgvB7XOd^wRUj_@i39vZcG#}(;ilrRvM&;|Qx0%g!`2e>>nOqViO`XpCTa zGbs~WgQ;dS=RBVgkMBT{g$=m5i^EVmt}WV1Gh z@0H*UC%hd?;W>M<^miK8BB)46O0A1Q=exmME=ehEew~YrjXqGTw0{Z5F(7&b6?N9i zK!~9np_O(YV$*_~MK2-^4-sj#%3vHTAEM}QF+jY>5y4P(ATut+$Q&N`{K{YYT(`C5 zzN^qTo*d9}OGjnaIw=?B}zuZJX0c5Z`aUqpO=B z5abgyr%QgTSD_JPmlf(316`jax0H~zZ>MIC8sM+`Z+@vFb4uILg;RQ~LN~`Dj~PL6#xH=`K743UL9X|n}xK|TeHnB&5uul zl43}oRdtf2Bj(IgV4v5nwcl-#k9P7Bt>IhgXqALpHuk5ZnodW91p-IZALlv{-u5q@ z=!sU3ice+iHv1E;vQ%6l0s1FydPvZYH%TcE?|Ts{M2VT5+ZCjZ>hA40O6|3+WCj9@BO~RP^ zH=865wUF=c5;(3NRjxHFY@>S1h%bvt;h}Pis|X682c;4gSNG|z9=$EpJri3z#ck9A zuCAM2i$9RxA3F%rrBGl`t;-+$;Rm*T3ph>IES|lI=VFq^P;{lx(?x2!%2yk?vTr3x zHj?e9Mv0t%Z8rMaE(aQfUmz|@NGG_}CC?=CAHV@C--zVg07j6gp?6iL!2)lmCfFLC z3?(;35j&(xRq1OuEVN2MnR+MPlKYD;fJ^+o@En!K!&l~&}t$*T;Jc1-R-4>f(z%KTr!BxGKg zfLLp5G2;7W(ygL!g;VzZ0t^AzS6#Km3UromZx^A!OgB-`QDf_@(pcPh7gGs|4^(Da zoEcYhPf1U&gBYxMa4EI5hD{R}8^I@3MoAp*Tt_N*FVv+f-Pc{!a?My?p^&kNi85Ho zu_Z`)XQWtMqk^Eh>IR(>k^lQh5}hbPl;F^>-h3?5m*4sDf)&!6CK2fn3xM>}(Qr&M zW`V;q$*CeZ%H|Zi$vODu!is2Op%Pbmll|?G6FW=(EDyQl7zYun`=@$_=dYm(MfplL zu3^BRvdZCz)A)7$r?A;`)wf-NyODfuE|E~A|bAdjCoWhpxu$44IT!q@cY z`DD9!M2~2I|2=UR(XankCJBB3f)DE0`bYZa@%^w4rk{!;FoAhuBeY{T2cVVI=@zND zzHzJ2%%?s8iFX}uq$OgBV(3D8tVhj(U)iFXC5E2^g#-bbi19Q+Zu`e#x)H0>kxl4c zm5#1wHBuwWa0I`Zmbhhvcd(cPNmTrr|M=_QaFP_E7zwq%I1wJml?|YsBsKUTww4NI zl4jWuL0O@(CzpuvVVdb)8z*g6SVxO;Gw?7^7)AE}cDD=v~4m6ERt-U|I^(*EGWgWX?ejKXhYtkdVoXd9?I$(IIGk@^kb?OG{Wr} zZgR7*bS$HrhfTx#4Z)|MWy-N9+@@DFYX`}dM}Gc zPbHQV)8}cDqZN`e>_{x*_LB`e71GsR{`rupTa@2)Pf>yCtUjt=pWbwG_*5 zMcvlvD{yfVzAqHLy4?L#TJ#hw<^=jhgR}qiCO#>0s)n2%`=uPk!3YXd#>tfWMGAU_ zfeKWn$KS>FwjG`d_|LdFyDm5^f`{-BEGJeaS2$R(*sqs}v$1*VPWyo7E~c=(5QlDZZ}Z^k*M# z-N^&0B3y0E1-SGs=^ue_;iUWe)+j$L!2v$p4DbB?UnL+rw*S2n5Lkz1_5UaV zK~ZXA`I>IVH zutOx^9nthQA}<SDM(3@-aj=TphL*r&HF0GTV#W|T^*jk77o`QSLcuS6KS$-e*u>Uk0&A6 zA6thO*65-l+Nc1nN6**A=k0l8kiRw0Pu6S1-@*TRf>51dW|e^-;5FUAF%pQ& zz8q`xwu8DT<_*;44-Fx33Phh8FjX5GkA)@pod21O_;`M+OJ(p82mRCiF;+wM$zXq` z-ue(nX4@9jhkfi5M>tlCPi&Qw|m&C_B9o9uFx6`%eKk1;xNk12Kav^p9^yAI}e z8}xOCKETQO`>|MK{5mnY{CPj67~I-b2>1!uUabyJHV|H=?RgA>SzyMlF$f+I$IlKv z!sCX|)C1{idI(VWn%3ClqwQTQu#6OA>}Iy`-FwA-?#fgEbrN-m%;POs@y3*dbFgh{ zhFa(Y$&-hx6LI7bY5}V<@;CqJZpi-4;x_MnZ1G0LwByI#5SGo!Q*9n#TbCuLHT8}6 z1J_C;jB4C7hzjiD$oF{D0sC9a#FL|~8N(xIhQ5U%s>#~ILi0kTRdWnZTo!?Lu3@|K zy>&vho&_mRwbqu{lFQLd%_ICr&AW%%PUB3mnu|4Ou9{joE?Y&Vc8RZoT3dCT*(CK+ zvRUM^p5lxulYFuxA*jIhTh-=r&*oU#l_cREpSe+i#Il+~Nk2Eo4h?+qx~gPA2KTh> zDuZBw*2&0Q*Yk!sE#;i$nV7#cSSU1b+OxY%s0aj)e_x=SnIBkBo8<>mLhbLocnt`_ z&DwC$TPDs!Uqmi76r-Hrf>|E2*T?jbmg$=_?kMZ5T@_u(|0hA6pJ{0M$E9o1`Zl&( zqWab79AfD!K?MgFsJ#XIRm+QK<}Dj>SGJ*eMOQ;Yi9QOtwIPbc>CG~fg~u3PxbvU2$!3>IZYY$@V8vIL+xXq#Xs1- zC&~#ib?&c;{%bZGd4YRNco*C+r#Z~LJdzX(Z@q1d5KYJeXL-85cMwJ>e3j$dxkXR) z&{?gL3ml;9tZITm_|3RDQK`Ko)~Tj8VUfqnvxEP?i}U|R7bnqd6AbO^QSswa`s3gw zYHoJDjQF>}Ar3E7t-z3CxvYf}$c2kv!i%mV8hou>-cVEMiN|}gkGWPm+Rf_n-mCE{ z$3jL~8M~UZ5Mp&YL%#a@sQXdi`5^eh~CFWQ@s zqj`U6#LI{qrnz=5w_F9nT%=B~a;U+imNDM6)0%Qt{v8TXD!R;|->a0Hq>H7b;Lq*k zEQh8;M<+Qos`Tyc_P;#dc|Q(6o|`mU6bbvG{XDH4z8ONoUvhw!2`QNC^H){%LS-*+ z@0<00yq}gYpYjwY@%U#`lO?hq9O4DlDU3rRM_Td>iD=+EA$fExT#`;P3J$@-x2f`S z%4$O{j<8@S>X4CMLQa%9<`yPb9{d3kFpI~{)VTTvw*hRSi-!N%=>+P=s@@NKwdo(a`?QBcNuT4X#LQZmiJ8*D-m)^S+w}| zX`k1vsFWcBUU^9=d=EyXXwOLD)}rn7T!ktshj=ACM}@&=v>b6egLDI5x#9Xtj#Add; zIyn31T`O{6oT8_#%<^BTvi%%N_oJEnc1wfZ8v*T&jT{Mql;tfo zNivX3bax1O3Hng*yU&@I$mFm{+k<8Wd9wXBhN*=6$+2#@ehPs0y0WU>b`p&pCoG>W`?E6E+kmVGa^7=JgVS?>0pNM2)Q=nH)^4<;_f0- z$V*p-aBK=`@M8~_c6sGJL%~mX0Q>?xiCGmH zD$I}>o^$A@krfpcOUo?<4Ju4du|{27BYM&kdZd+eHsdMqR^JuRXxm>p-}IOpFioq~ zyXyLcRj6C<5N~8jTn4~+(P26g@<;gsfyLujidCUp3`61IOnC5pA(E!ib^X=8XRoYi zghJ1g^w#om9nK$1QnL#O8ziZ=2@{CuSj1VL`XJ&*0q<_mha8MEfm6*?e=_^JoG(-T z?(tG%7N`BpcD(2^6rvIEzxVN4aOI@&!8FoKuV1fiUpMbd0X7guAMkDHWTaLz~8i(<@8Uy?~BfgqZmY_-qg+ikZ8GoS1#Q# zt;^*Cp}xM{$vpfCWb6!%VGgUr7*|4xcopvu^dVZprHwD6K(rpS%SGHO@+mf5gl`1} zEQ7CvaY6tTk7M>UIB8S2KlC87OULrE>HlhbH2lMEMo~=KFC8|csA(n9Eu&RJ$O_V7 zKx118^ZOhevy>t@k|EUFyduN}&T|4kwCL(dnxOcU%fS>8@Tn7hzNJN`=|+aAZeZF? z_a8=t=T(cq7bL>~$0P~c`^C{E5wX2#9N1|xN&gD}b; zSU#TSJcQ8K(|r4~UXor~RpfUd&{N6${Qk&oOa{yZSlyY5z^yF2b)lvCeWhKxPXQaZ zcDi3seyp;842V+6$--OatXrI57;N_Aaa^U0`-PN@wM0J;fR}Bw3o|L?^4kD@2j~%+ zA^hZF{$h#cw%xtWSR42cD?K@or&@ka_lc*wYX@?-{e7G%d4pZd_{C6kq_N;AL(2Yb zSNDQE^s@19d&jUNBriA!wn51B*JD}0{+Hd(8y;P~WKv{H=UbJA`%qUKSAhLlQCSAq z;xqv(4Dk7O0&eFUMCG%CSHV+v+5M?SO?Pdf;J4Xm?$C7)ud)sz-fF{1|8$R!@`k(J zR$suVzac5dEQR72$UPy>ENLj044v4RWZ8mFK|P-k3LN??9J8v2hlWQnWjN3KK~G{X z0e`}Ze`EDs`p!vo=3uW<*WEdljuz)!4%QX#ps}Q7kYDoTlR;(neAL-q*%1tquZHv2 zGN^EZSv6R*d#N(qd3^jQ>`o0HuSK4qnh(3056eyQ_~4KUz{ED5&Ut4A%fgW>82?x) zj{~gNy^F26@>5jep~sQXOLm%AS`g82G=Cuh?YOE2bWD(`FS&0LoH|UyPBddUSl8N< z^Qh93j7^{Qu1teRZI}Cv3`p6HEHrt!t|P~qf3=l?+;cRMPnyY!$eR+pfO#qqQ&nWB zT6RfGmap2RMUw}Q1DzmDqhiqb(T5q2m>((nr~yrg$Rd*G*`I+;tBoV>-c=SN<4eXbpuhn?_-Gmg&%ImLh`8TBd?_LJq0|f%~%7d|@1UnIKQx zr4<^oe7T6Dxy`UZG$Yt#%+O5%Ww8E_n{&D_MPnoQBgvG@88ODTT%c+gnU%!v0?VO7 zXL;QjX$|K$sY)Jk3B;pY0z)9jy^tS@#s|udh>XX~xX^EA)zwhxO0*-6#@6iPD7PgK zNW_d`9|Yh9>#MY3}WC%V0(|UF-6OE#=Dk zQcPzec*<3Yu(Y;=(5)&1f3reVbGb-M&@Vr6B%k%_bEiSeyA(R z!APC$tC!JXK~yAQ?_>h*Mlqe!T-b=d$XZ2nq0=5q~g%*?so^nf3#J2x5A+S zGv$;MM}$0b+o+b)_#vmuXjwojR5BxeDY5fny#r}dW4!tiUI+& z+;BAU%qjx28^H-heZohoznFx&6~0jzn;9AnNoCYz4Q|Cr9@T;N6Ey+TCD01rl3L{$ zvV%M(lf&NuLXO59$BI=}8-Ee}q=!uC{D!9+OU{qL(O=5g9Q1fef~YsR%B8}P2c{#WJ;+>G$tOgtTulJATOh*UHQVSH&1G!nqDN^Xo7;gw4RweYb z*KscUov)K6ch<=mXawKD`f=izVZ@3&wxaPSV>^YW_) zSYa=`nM||35|Wk4e%wbBwo=emRJ(zxEUHrHO(#eVwTL!G6`?p+&2_nig+1l-yQ6xv zL7#E1htAh;x?pSnp=TgZE_Nze`!fV~Z9=gm#y;_fez4<0SXmx zj5;H=xCE?mj3vOCwDmJRSOD6$BI_2IL$@U8Q zlB|xP3j6W;o(`;09Gq*>3Mg>=NieHjAr)lpk>aINXm~viwgR=<(4~{+jFiQREg;sj zSX>@fo>+>@Ie`Gr5EG0!K1eR~FCbEcZpYpvi!Yv;XJSugUdd@^4g=FLZ%8<@t z#Kku}%13Gpas`@XY)0jaeqQs0$1in<-3of05E|URk$9Fh*qcfhEI>d>DEEyc{E z`}$GK@(j3r;xdxr;Ng`DYK$F@HWeP=bX+BDu876Tp`nRs*PWK_8B4?jZB~}JJ}HuW zgUu5q2+mGOk_wif6K_#MeZaAyC`<@fN$+pY*%3-NuZ=ZAB!!yeKjxmtyA_>gN(eUeSW${k2%iA!JoSI$8$t1lCZ0M{Z z11TLTlZyI!WPx;4#vq1(@oTFTtO-7k!J+3VjTc*Kyj#VcZt}DeoTD%!9dbEBb9sL- z`)ISEed4gjmH4O@O*)F1F#-`ST%IP(je6W?*)+6$E|`~IKQUFdmP`K)v6)_vD8BNH zZ*Dj4eVajCUDDAtOWrzzT{s* z9WwnS2}Dt6_7ghDOe$c|`>=f5=9FoRTU%hXPAB|f>6@kpW`ZOVZ;n7h9oMh>GU72| z5b!sT7m@DY84^_T&g{*x<$K{|>dZh>oMHHZvm~Ioc3L`u9W4?l&1@frKOl=@`{QD> zL?&#@{rgoLEhdP+#qS0NQ!5t2^{JADyF{|u95yK(DxwL(DBY&lnS}!AJ(g=6L zr}^5g3HxrqNQZ`brJN$yx!hz+5S>YHlJN~is5mK5!UKa&*5I;~yoxiV;xpjjTT13f zK5cTiA&aby#pu&f;|aKGK(0lsevsuj zhEGm3di!9kx^5p_(xnDdaK@f0P^&@1ju}FB!^!A2#?uXUk7SlJzs|H^>fPVB@2{G+ zhk_si-@W#-GU8&jzJzrz2^mD;yw3Rq0*v04*p6%Ny_|`QJ{q-iFPkY?)GM?tE+45} z*qT%5S$sG;C*U4vd!vu@kOdfaq&%s6fJ|bVI%L_c?nP~1a>?b8rJeq;?b~x} zK~I+u;{EZ@^^uo*536h*6wotv4aa~i16Sk#WG=l&AXVobed+?vd0F*eBuisUGx9}! z^eQl9$Q2(<$GImtj^@XGNaJlw?aM2 zD>q3R6J0DKq+K|q)LVAM`&rVQ#EG z2+abLh}{eL6gP(~mpx&hhx+D8GX_^gX$5BbBFbM-v5uKw&?CR|F=5E6e83tm+8Xh- z3ho-_$)NmT$Q_Jl!>s`ti{4Jy-_4Bv6Qxp`3D#CwqQZ(V4YDRLR`LilCR#@=x_B(eU zf<3&4wpv{LHPs{BM6;s^`G~uD_2X*o4do<`sc5%hHl7gBbz@2cf^2Wz^**kOSP-R{ zELPd9@}^9xUG2__Pn0O}uOdO%26JNT24wvL^&V>S|~6#o#sSB8;6w`$8vvuVARlvOTE~mx)_y)%> z2$q{t^wkm&AY>r$dZp(T4D;z&h~|B0tbT3Y4Jx%+JXdlgNtE#a}=ej zJvqr-=I-p4buPW>_^`AY{N4j{xfp{d>AjtVG<$C1lP?3L+5F@;cAb?4o7RXn9uS6@{ zRfbp^`MgDgml(u!OKlkyT0b`)!Fj#;FdYYo=dQJqx*48a<5#?R*lv7$KgS{$SCyKr zdOknM%4L8c^s@nnw^Ywqh#7t7lq8z|6$cw3I~xwom^SME`~l|~>Sa7e#wD$B+N@Z` zQ3tz9rHQD%Fm#|#MG$tg{SCPvi^jS`zRD1z%A+makzGD`Xp^vHs-TkLBHe+P&<=CN zD=XSV63%SMH%OPm%wCfg{l>$8NSr7DW#!7mmcKB9=tp%7Nr+?rr`PUJibKETv%q6n zgh-LUWaGbqj;!;e5`=dRF~u*5Iea`1LSGJuLiD>mBS)vm8fw zZr|*Bd^Gs*ieW9NWmg%NgkNUb4%g#sXE&RjGEFM&<@_#X&u^f$7G2_DH<$APM^;yl z7SA0LfN4=K7{Ro#+wp}|S_QVXV5S=;Lx1Mt_sLl5*~s9Xzv6Ykzaw(Fy2~@Mkpy*J zBYe0*zfVRiz_5o=N|%V@iyC?N%fN#`6+PKBMpoz&g60*$cD>^eJKg=b2S+z;e`l39 zUexSAtYx(7WP58|=qX%EAYEGa!*A#+N`}iCrRSKVhZVQ3`<+*gOD!yOtk?t1*p<86 zczZo3ek};%x3KC*tLoSp@b$6%4P@ByhE>v6iKP=rZb@^9aPV(taf>c3gIGd>_miy_ z;A`b?NPVBtbz?MWR1Ml+kP2DjdO^UL9Ob9sEmB>oDI~?&jpb)Rint_`o)ujrD`N?^ zNL$~m-s}rcR#+l8)s6GQATby0vlr~u`qkOS;x?80ODCIGqCzpy9;@yx6iM`3sZOVO z2VB}_04ggJs}-nQI0mV{G(9lCW$-0-21l?1IkcbMl1NU^~SXn4np*RRG%MJ7)tmx)obo!GqL?+@Qwt7WoR|m|5QN z!v#F2H?w`;&Woi69YsG(`=l%$%|(lZ1;|VbN4OUEHM{||Zq2bsA8pVza@@@{InJAZ zsA>L>+S!_JiS~4C%wYJP-CCl0WhLufY01-~sV@~&HvjC^0$u8&ck`d2B4Nk+KA287 zZDJCzFibIf-y#D3Q0yU=-m#PMO0zofWz^IxzsxYkv8CIruIKThFs#Tx?L ztSZ24mn{aSx`8Kg{6tP73&>AnG9*!Ye$GRIh+E?YI*YJnMqzBmLXOQ@MPgNnM;xb} zfS97-S1aO8lsp*CW5gm#e5gu|Y!rFNgbQVdZmR;wB+a}p$8*W?jul<-?6;AJjFGbV z1S;ze7OP)Ic~s~!XVthhlU(q*-rGV&r~!5?(HFd?2Z3@_coI1yg+a4IW}&*mF>`y; zC&(w2ySpYd4vHeE#eov-z`(()u;w($wc(Jllq^QFlBZ3?c_k1ZIcqhG(8KVA`fN0; zO~Ob8X3#X}yIbZhA{Pwm|+%)f{amZ5a!^RAaz`H#RZ2Ug$-KEeP4aO&1`5GVD}#nyC%D|QB`!LD$AJJQYX@+k=L^xTPoj;Fn=ufRtU|zzEg0gJSdwi)DNbRL3 z_Ox`mcI7xqyr_ZIepJU#ejmhH^MQlTf(nwK%k@Z~+RJn*Gm~mxlU;tRrV&`nEe7Nz z&}VWi6k}a@4|VQ0?gD0RA@RlER6>3~IpuNk-i7AstfjsZ^IY{iFv(kNR6{oX=3kqE zu>*i$T?s9D-lWgt&&%OJn7797m+&zC=WN>6mhLyQt>U%JSuZCOIg5l<$v2CGY1(`{ z3_}z&`C&Z>)42314g*^?s0nMk(dp9ABF~_gH89R5{c5b;or@qoH2dl~2c)cG;LQ5n z7m%HN<@M*%Kqigv3L?vBVsG{g2`+g}+~P*_d)&Hgu*%10en6uHW%;x}56g?Pt}IWr>1AaLeBm}<{Pa%Qs4{`E1-Nc=Z&o_< z`aiYxFde{yu1G)A(svD7d;0*LISPnfTaZQNC+RGak$Kan*4rDga+5VI)vt9HhfrI&szz6_l8-W!@-{CKIdV z)@6qAXBB+VLu|?v8%hVxdyZg( zeFoZXgXv=k^o?B~aPvVgRJYRV-;8omV5k)ZXy9&?@0+1ZJJF3O@}z3a%ZuS#54KGS z@~oM%(fLQnv5>PFz1x=@V=uujAo~!tEZixxj9G}|*pxHh=!p8Ld_4d)wH-xRU*6IF zHZ6bpUmH$q0qOrJTJo%2pj+G3*6^Z-QF4B27E%(#KLJ$iCLPu(-J+yV6Sk*Rni6+E z)5=DUsQz1~&xcMizC?|sKq2knU$dPFM;K%24LaQE}g zr$!FXZa5r_-vDVJ{Qr{q7cc)enIBcOh2;TA(?`AR<^Chsfk3wi{J#V{G!)LK9(T2x zBn>59v=|pYyxSXL3Na0Bj(b-`yCRSXu@Z`m+cgqMk66R4aAd}ydSB(F%Y>l5H-v<5 zRjCyS`M>>i5N&-B$vHgWiMQkl`nteRzpW_9;ZinGCkjFMxk{{m64o+O?+hNv|(F_lPP~*50EV8Z>v> z0Ux{t=y9AVPxoh@J%&50uX8P%>#FMNXp<$mxL*;$Z#V0w6xhZd_kqU3oeBRS|rOBz4};0>xu`PPXgti5csyw2h$JIebKI*uT& zP*vSQ5-#P253x{lR;Nbkd0n7+(c@Ez8L1zIqB6G+I#OjU1$Z80#3>%ftOR+hu0*_N zbVLK)+k3w^6Ow!eYoxpC77xzqe9yVHc&0jd;HZfZ3LbVJn~}-~%LU4MK)#Ir?e82umz#oCqZ4l;l;i#T37@ zKP%4zX2|8l8ngM$mCmBq$4t#tB^B19|J%D`E>tR7MCuH;bc7Q!9fiiS%cQPiz$*?!e^#W8LWhRrZAnzrt0Kd0=(^P6c|n zY`I6{3cixqol_xA$S@*y`;i$jJr8)v-SK{~-&EB?$ncMZ*5dc~tNPUhYRl;8mBP%j z-pCy2xi|~p7Llz&8e&GEJ_o3GsM#_O4Ewj}my9Sh{;S@x5j*R;lP6N-M&|2+EMF7; zn#QHxGoM`;NpWDE)go1c^PR`adf>N^-C9tok^2bQt7*b26#`nDd2IG5zbKJbRFhjH8Ac|^&|)>-D7HqRZ;N`4muy?M+oicx2- zQYdd3xJtz4B3z?jY9&mIoiZZ$Y-yK+O4B2oc~fwHqSpNs*G?vICX2n5NKTu!lp5+5 zNN+{^4mY$BMw|6U7}s*nu-NlCB0lhpEE&btngA=f@nb%)JshZw%Q}-oRDxIBH)ndIRRF zrA2IyfNW;(nw)l7onQ0D_bxc>)I8eQi0hOt`A8$-A8!-ibjS2U$Of!;*QKqJf}r7f{jiR8 zASMMzZ@UxR!cWpY2O=inBkn!eVlmc}FHfSH5tnZ;!(iO&h> zb{PNZSY^(0B}*rFnhwqNYLWk3jPy)u9Z%O=R4mZb7nxS=pv6z4I10(U7xpnLXVA6T zj2aLXML*F@da&-ONlY(o#i}{9Em?g&o`P|V4@c*r^APsZ!W{bT?9_ee*m2!HG`$#c zR)Xc;d92hA1f!6gI#5-YF3-hX2LrCjzIlhJ`)b zoH^`u6!f>TgOodK(b{z0%vSN0XuDxhBy113^&`xJ(EBoP4ry(75P_?#@gWov3a&N% zTl|`P5!*(i*03ZKJ#-c`_32>sa5eV7ItU^@_PPB*c=n3gZ#548LpKinN6Ssf7Vi@r znL|L&(d$2aJ|0$%MQ$)NFD zWG4N(BO$$HU**Y=RP_MzxOhxaRk+q9y)-7rm=YZMLUA6|CxnQ z(jEc>JC-|_L?Qez?FobPd(LC+lz&li{7u1hO4jO0;`T^>lqc{6!T|Exd=X#M4SldC zH96p9^GjT@H%EQbWf#$5_B!7EfFLy99q?^`t8(d55zyzqM1Q#7A|w;Ny#k?^%12t? z%u6v>#K;<^ZjU3JA6JtGhh>WP68|nS7+`jPQ96$ol zuc%7c$Qc!YyE@=xN4{q2(M-S9i(S>{erZN_JjSIZ0FF2T;hD$3I%m~wW}=v@@7A}3 zvXUHSy&S&1BV13IN@V^V!NK8}5Dr&rwVbn|dJqyC#Car(POO1Mq@W5M2ZJZ7{{15B z20I{u7B2iQ&;7d__2+VW(B7H+AALsd?Lvy&0ELFmh15Pp926Fl{y?Cf0;7X`G^vXI zL|kRsj-}uW?ojDiOYRi=)`A!MU?Y+jY?mSJYu|SS8+K>2ru=zG{F(#^NGn&l;i=CoPIpQP)E`N;B{4~wy3^?K0*+j7g3tSIb8@0xINAa zsRASQusB+@ADPA&!fRE*JQK(jUA=5y@yL%6Q17e-SK?{|BQWc)A73|ip{Z3rRX1h{ zALgad|z+rk(5rO?HQGISa!&@s(5y^7;%AT@sle) z4q`3ez)5W}JU}GM=qgM}7RdzWIUNI3W^1>+P`Q&(#^h1kYtkZu!TyN$=IWh7zlLo+dG7PQNx{Z|!!}q#i^;oaOI+&UgX!f*KE{}w{7=NO zXTRO%;0PzL04K>;7?q)H)OrZQSJz$9tU?0POHh@U9@n;EVO;1%uWS~ z;6XEFwVV~SGi=R<<5Z-|BvoMPFbNjJMa+w5C5;Ds6n&IR5e-~U&2yYELoXCG(kK?g z)wc|3LJM+NG8>1;T&+>SXeZ%6DNm&59gne0)iMys*qv>uBrJc-3njprK_mYbqo~+U zT`%7!Xm{ye1QLEdf;win<00Gn0Rd0oYgrBMolE&=&)#x5^Ru`UF6O`-HP5vdTRNzi zUavHi1^U8-bz68zbZ&fSTusYj#lBLGwGfCkHD>R(g9S6hm~up!Gy2ZMPYA?<<65#e z4O*NjSW3jy&G+^U`5Evs1D>a=QilZzi>5U1@sJmE}= z4@e_$16Luxkfb)%=^TH?gP&+|`=DKy8CjbyTx+fZ<>#m+?5as#;0E+r{C$q4`t-T& zjLPS~cHf@qh@x0uAC-M5qVplQ)19L9{!RvVKQhu5@k$%$=xic#m7)z%4%S$Tgl